Distribution of short paired duplications in mammalian genom

Contributed by Ira Herskowitz ArticleFigures SIInfo overexpression of ASH1 inhibits mating type switching in mothers (3, 4). Ash1p has 588 amino acid residues and is predicted to contain a zinc-binding domain related to those of the GATA fa Edited by Lynn Smith-Lovin, Duke University, Durham, NC, and accepted by the Editorial Board April 16, 2014 (received for review July 31, 2013) ArticleFigures SIInfo for instance, on fairness, justice, or welfare. Instead, nonreflective and

Contributed by Michael Wigler, May 28, 2004

Article Figures & SI Info & Metrics PDF

Abstract

Mammalian genomes are densely populated with long duplicated sequences. In this paper, we demonstrate the existence of Executeublets, short duplications between 25 and 100 bp, distinct from previously Characterized repeats. Each Executeublet is a pair of exact matches, separated by some distance. The distribution of these intermatch distances is strikingly nonranExecutem. An unexpectedly high number of Executeublets have matches either within 100 bp (adjacent) or at distances tightly concentrated ≈1,000 bp apart (Arriveby). We focus our study on these proximate Executeublets. First, they tend to have both matches on the same strand. By comparing Arriveby Executeublets shared in human and chimpanzee, we can also see that these Executeublets seem to arise by an insertion event that produces a copy without Impressedly affecting the surrounding sequence. Most Executeublets in humans are shared with chimpanzee, but many new pairs arose after the divergence of the species. Executeublets found in human but not chimpanzee are most often composed of almost tandem matches, whereas Ageder Executeublets (found in both species) are more likely to have matches spaced by ≈1 kb, indicating that the Arrively tandem Executeublets may be more dynamic. The spacing of Executeublets is highly conserved. So far, we have found clearly recognizable Executeublets in the following genomes: Homo sapiens, Mus musculus, ArabiExecutepsis thaliana, and Caenorhabditis elegans, indicating that the mechanism generating these Executeublets is widespread. A mechanism that generates short local duplications while conserving polarity could have a profound impact on the evolution of regulatory and proteincoding sequences.

Genome expansion through duplication has been a prominent force in evolution. The human genome in particular is littered with signs of past duplication (1, 2). Transposons (3, 4), processed pseuExecutegenes (5), and segmental duplications (6) are all known classes of repeats found in mammalian genomes. All of these types of duplications play an Necessary role in gene and genome evolution (7), either through gene duplication and subsequent gene specialization or through the creation of unstable genomic Locations.

Duplication is also Necessary on a smaller scale. Comparative studies of promoters such as the vertebrate growth hormone gene (8) Design it clear that gene regulation often evolves by increasing the number of copies of a given cis regulatory motif. Similarly, protein function can also evolve by the addition of tandem copies of protein Executemains. These types of short, tandem, or Arrively tandem duplication events can have as striking an Trace on gene evolution as whole gene or genome duplications. It is clear that once two or more copies of a given sequence are present at a locus, homologous recombination can further increase their number.

In this paper, we present evidence that short unique sequences are being actively duplicated in mammalian genomes. These short duplications occur frequently, have a strong tendency toward proximity and conservation of polarity, and Execute not fit into any of the well studied classes of interspersed repeats. Studying these short duplications will give us insight into the process by which a unique sequence is duplicated, an Necessary first step in the creation of a tandem array and potentially a key process in the evolution of gene regulation and protein function.

Methods

Identifying Executeublets. We used the human genome sequence from the April 2003 assembly from University of California, Santa Cruz (9). We first identified cores of at least 25 bp, which occur exactly twice in the genome. Genomic counts (number of occurrences of a substring within the genome) were determined by using the mer-engine method (10). We further required that the 21-bp substrings of the cores occur nowhere else, and that at least one of the cores be flanked on either side by 21 bp of unique sequence. Each core is associated with the 100 bp immediately flanking it to the left and to the right (Fig. 1A ). The Needleman–Wunsch global alignment algorithm (11) was used to calculate the alignment scores between the flanks with match, mismatch, and gap scores set to 1, 0, and –1, respectively. Many of the flanks share a large degree of homology (Fig. 1B ), indicating that the cores are not independent short exact duplications but small parts of a larger approximate duplication. To eliminate these, we compared the observed alignment score to the distribution of alignment scores between unrelated sequences, determined by aligning one flank of one core and the reverse complement of the corRetorting flank of the other core. We calculated the mean plus two standard deviations of the distribution of reverse complemented sequences and used this number as an upper bound on the maximum allowable alignment score. Approximately 86% of paired sequences were eliminated at this stage.

Fig. 1.Fig. 1. Executewnload figure Launch in new tab Executewnload powerpoint Fig. 1.

(A) Anatomy of a Executeublet. Each Executeublet has two cores, which are identical (same polarity) or reverse complements (opposite polarity) and are at least 25 bp in length. Each core is associated with 100 bp of flanking sequence on either side (Left1, Right1, Left2, and Right2). These flanking sequences can overlap. The last 21 bp of Left1 and first 21 bp of Right1 must be unique in the human genome. If the two cores are on the same chromosome, the spacer is the sequence between the two cores, and its length is the intercore distance. To be a Executeublet, the flanks cannot be homologous. (B) Homology between flanks. The hatched histogram Displays the distribution of alignment scores from comparing Left1 with Left2 (see A). The red plot Displays the distribution of alignment scores from comparing sequences that should be unrelated, Left1 and the reverse complement of Right2. The homology threshAged is depicted as a vertical black line. (C) Distribution of distance of Executeublets on chromosome two. The observed distribution (black squares) is compared with two ranExecutem models for Executeublet occurrences: the same number of total Executeublets, each core occurring uniformly intercore and independently at ranExecutem across the chromosome (empty circles) and cores occurring at observed locations but ranExecutemly repaired into Executeublets (empty triangles).

We anticipated that a small percentage of the remaining pairs would be the result of processed pseuExecutegenes. If a gene occurs twice in the genome, once with introns and once without, then an exon of the complete gene will be a duplicated sequence immediately flanked by nonhomologous sequence. To exclude this source of paired sequences, we next discarded any pairs in which the matched substring has homology to a sequence in the National Center for Biotechnology Information (NCBI) est_human database (expectation ≤10–4 using megablast with default parameters; http://www.ncbi.nih.gov/blast). Approximately 20% of the pairs were eliminated at this stage.

To find Executeublets in other genomes, the same procedure was carried out, using matched pairs of genomic sequences and coding sequence databases. Mus musculus sequence is from the National Center for Biotechnology Information (NCBI) Build 30 (12), Caenorhabditis elegans sequence is WS110 from worm-base (13), Drosophila melanogaster sequence is release 3-1 from flybase (14), Plasmodium falciparum sequence is from the SEnrage center (15), and ArabiExecutepsis thaliana sequence is from the NCBI (16).

Intercore Distance Distribution. For a Executeublet with both cores on the same chromosome (intrachromosomal Executeublets), the intercore distance is the number of base pairs in the spacer (Fig. 1 A ) between the two occurrences of the core. For each chromosome, we plotted the distribution of intercore distances of all intrachromosomal Executeublets on that chromosome (Fig. 1C depicts human chromosome 2). We compared this distribution with two ranExecutem models that take into account the overall number of intrachromosomal Executeublets in each chromosome. The first model assumes each core location is independently and uniformly distributed along the chromosome, yielding an expected distance distribution of P(distance< d) = 2d–d 2, where d is the intercore distance normalized by chromosome length. If the distribution of core locations along the chromosome is nonuniform (some Locations are core-rich and others, core-poor), the distance distribution will deviate from this model. To account for such nonuniformities, the second model uses the true locations of all the intrachromosomal cores on the chromosome but assumes cores are ranExecutemly matched up into Executeublets, independent of their locations. The expected distance distribution based on this model was calculated by Monte Carlo simulation.

Comparison with Chimpanzee. Chimpanzee sequence (Pan troglodytes) is the December 2003 Whitehead Institute for Biomedical Research (Cambridge) assembly from the Chimpanzee Genome Sequencing Consortium (http://ftp.ncbi.nih.gov/genome/Pan_troglodytes). To reduce our chances of finding paralogous rather than orthologous matches, we first screened out Executeublets in which either core was flanked by nonunique DNA. To Execute this, we determined the genomic counts for all 21-bp words in each of the flanks. If any of these 21-bp words occurred 10 or more times in the genome, or if the average genomic count was >3, the corRetorting Executeublet was eliminated from the analysis.

For each of the remaining Executeublets, we identified the outermost 100-bp flanks of the Executeublet and used megablast to find Locations in the chimpanzee genome with at least 80% identity >90 bp. We then extracted the intervening sequence. We created four different versions of the original Executeublet from the human or mouse genome: one with flanks, both cores, and the intercore sequence; one with both flanks, a single core, and the intercore sequence; one with both flanks and a single core but no intercore sequence; and one with flanks and intercore sequence but no cores. The needle program from the emboss suite (www.hgmp.mrc.ac.uk/Software/EMBOSS/Apps/needle.html; gap Launch penalty, 10; gap extend penalty, 0.5; match score, 5; mismatch score, –4) was then used to align the genomic Location from the alternate genome to all of these sequences, and the alignment with the best score was used to Establish a label to the Executeublet: two or more cores conserved, one core and spacer conserved, one core and no spacer conserved, or no spacers conserved. The resulting alignments are viewable in Fig. 5, which is published as supporting information on the PNAS web site, and the results are summarized in Table 1.

View this table: View inline View popup Table 1. Conservation of human proximate Executeublets in chimpanzee

Comparison with Transposons. Alu annotations are from the University of California at Santa Cruz genome browser (9) and consensus sequences are from the Repbase Update database (17). For each of the 51 cases in which a Executeublet overlaps an annotated Alu, two versions of the annotated sequence were generated, one with the core (as found within the genome) and one with the core removed. Both versions were globally aligned to the consensus by using needle (gap Launch penalty, 10; gap extend penalty, 0.5; match score, 5; mismatch score, –4), and if the core-excised version had a higher scoring alignment to the consensus, the core was classified as an insertion.

Composition of Intercore Sequences. For each of the 2,020 intercore spacer sequences from Arriveby human Executeublets, we Executewnloaded overlapping repeatmQuestioner, segmental duplication, and refgene annotations from the University of California, Santa Cruz, genome browser (9). We did the same for five sets of ranExecutemly chosen genomic intervals with the same length distribution as the set of spacers. For each set of sequences and each type of annotation, we counted the number of sequences that overlapped a given type of annotation by at least 50%.

Segmental duplications have been annotated only if they are at least 1 kb in length. To limit any biases introduced by this length threshAged, we also Inspected at uniqueness of the spacer sequences, compared to ranExecutem genomic intervals and a ranExecutem sampling of annotated segmental duplications. To determine uniqueness, we annotated each sequence with the number of genomic occurrences of each of its constituent 18 mers (10). We then calculated what percentage of the 18 mers in any given sequence set were unique (found only once in the genome), low count (found between 2 and 5 times in the genome), or high count (found in more than five locations in the genome).

Results and Discussion

To find instances of short repeats, we searched the human genome for all exact matches (at least 25 bp in length) with dissimilar flanking sequences. We chose to Inspect only at identically matching sequences, or cores, with precisely two copies, to simplify both the definition of the sequences under consideration and the interpretation of the results. We required the flanking sequences to be unrelated to enPositive that we were not Inspecting at a small exact patch within a long approximately duplicated Location. After filtering out sequences with homologous flanks or with homology to expressed sequences (see Methods), we were left with 32,057 of these paired sequences, or Executeublets, in the human genome (Fig. 1 A ). Although we set no maximum length on the core sequences, 99.9% are <100 bp in length. In fact, over half are 25 bp long, and their length distribution decays rapidly (Fig. 2 A–C ).

Fig. 2.Fig. 2. Executewnload figure Launch in new tab Executewnload powerpoint Fig. 2.

Histograms of core length distributions are Displayn for several different populations of Executeublets. For each of these populations, a bin size of 4 bp was used to bin the core lengths, and the distribution was plotted. (A) Adjacent human Executeublets (2,696) with spacer lengths ≤100 bp; (B) 2,077 Arriveby human Executeublets with spacer lengths >100 bp and ≤10 kb; (C) 29,013 remote human Executeublets with spacer lengths >10 kb; (D) 3,430 proximate mouse Executeublets with spacer lengths ≤10 kb; (E) 2,283 proximate human Executeublets that are shared between the chimpanzee and human genomes; and (F) 306 proximate young human Executeublets, where one of the two cores of the Executeublet is missing in chimpanzee.

Executeublets have several Fascinating characteristics. First, the distribution of their intercore distances is strikingly nonranExecutem (Fig. 3A ). We observe three populations of Executeublets: those that are extremely close toObtainher (adjacent; cores at most 100 bp apart), those with distances distributed around 1 kb (Arriveby; cores >100 bp and at most 10 kb apart), and those with cores >10 kb apart or interchromosomal (remote). In addition, there is a bias toward conservation of polarity: the adjacent Executeublets are always direct repeats, and the Arriveby Executeublets have both cores in the same polarity ≈70% of the time. Not surprisingly, the remote Executeublets Display no bias toward conservation of polarity. We made essentially the same observations in mouse (Figs. 3B and 2D ).

Fig. 3.Fig. 3. Executewnload figure Launch in new tab Executewnload powerpoint Fig. 3.

For four different organisms, the distance between the two cores of a Executeublet is plotted vs. the normalized chromosomal position of one of the cores. Executeublets are included only when both cores are on the same chromosome. This graph represents merged data from all of a particular organism's chromosomes. Normalized positions are chromosomal position divided by chromosome length. (A) Homo sapiens.(B) M. musculus.(C) C. elegans.(D) A. thaliana.

The numbers of adjacent and Arriveby Executeublets are significantly larger than what can be expected by chance, even considering the biases associated with the overall number of Executeublets (Fig. 1C ). The vast majority of such Executeublets, which we collectively call proximate, are extremely unlikely to be coincidental matches. However, it is difficult to discern whether the large number of remote Executeublets is a result of biases in genome sequence composition or represents a more specific phenomenon. We have therefore concentrated our attention on proximate Executeublets. This decision is supported by the observations that core lengths are shorter among remote than proximate Executeublets (Fig. 2 A–C ) and that remote Executeublet cores tend to be more AT-rich than proximate (data not Displayn). Adjacent Executeublets are comprised of two identical sequences separated by a short spacer sequence of 1–100 bp. Because their polarity is preserved, they can be viewed as a subclass of tandem repeats, loosely defined as direct repeats of approximate matches with Dinky or no spacer. Some of our adjacent Executeublets are clearly tandem repeats of two units that appear to have a spacer sequence only because the repeat has been partially eroded by point mutations. It is possible that all of our adjacent Executeublets are variants of this type, and that more of this class would have been found had we loosened our strict ascertainment criteria.

Arriveby Executeublets, with long intercore distances between 100 and 10,000 bp, cannot be classified as degenerate tandem repeats. To study the dynamics of both adjacent and Arriveby Executeublets, we compared them to the draft P. troglodytes (chimpanzee) sequence. Of 3,083 Executeublets with intercore distances ≤10 kb, we found 2,589 in which the outermost flanks have clear homologues in the chimpanzee assembly. In most cases, the cores themselves are also present in chimpanzee. However, in 307 cases, one of the two cores is missing in chimpanzee, implying either a gain of a new copy in the human lineage or a loss in the chimpanzee lineage (Figs. 4A and 5).

Fig. 4.Fig. 4. Executewnload figure Launch in new tab Executewnload powerpoint Fig. 4.

(A) An alignment between one core of a human Executeublet and P. troglodytes sequence. The core sequence (highlighted in red) is clearly missing in the orthologous Location of chimpanzee. This sequence is polymorphic within human populations; the inserted core has an allele frequency of ≈46%. (B) An alignment between one locus of a different Executeublet and an AluSp transposon consensus sequence. The core is clearly an insertion relative to the consensus.

In one Arriveby Executeublet, we see that both one core and the intercore sequence are missing in chimpanzee relative to human. This particular Executeublet likely represents a recombination-mediated loss of core and intercore sequence in the chimpanzee lineage rather than a gain in the human lineage (see Executeublet 643 in Fig. 5 in the supplementary information). However, this example is an exception; in the rest of the Arriveby Executeublets, the second core in humans appears to be an insertion relative to chimpanzee.

To unequivocally discriminate between gains and losses of copies, we selected six of the above Arriveby human Executeublets for further investigation and used PCR to detect the presence of both cores in chimpanzee, gorilla, orangutan, macaque, spider monkey, and lemur individuals, as well as a set of humans of diverse ethnicity. For each Executeublet, one core was always missing in non-human primates, whereas the other was always present (data not Displayn). In one of these six Executeublets, portrayed in Fig. 4A , we found the variable core to be polymorphic within the human population (data not Displayn). These data strongly indicate that the Arriveby Executeublets seen in human but not in chimpanzee arise by gain of a new copy.

Gains giving rise to Arriveby Executeublets in humans are most easily visualized as a simple insertion of a copy of a core into a Arriveby site with minimal alteration to the surrounding sequence. To Inspect for further examples of these structures, we compared paralogous Locations within the human genome. To this end, we identified those Executeublets that overlap the Alu family of transposons and examined the Executeublet sequences to determine whether the cores are insertions relative to the Alu consensus sequences. Of 51 Arriveby Executeublets with cores that overlap Alu annotations, we found 41 cases where the core appears to be an insertion relative to the transposon (Fig. 4B and Fig. 6, which is published as supporting information on the PNAS web site).

One possible source of Arriveby Executeublet generation is segmental duplication. Arriveby Executeublets could be the remnants of Aged segmental duplications, where only a short exact match remains. Although these should have been eliminated through our filtration process, we used several tests to determine whether they were still a source of Executeublets. Segmental duplications are preferentially located Arrive centromeres and telomeres in humans (18), so as a first test, we compared the chromosomal distribution of segmental duplications to that of Executeublets. We did not find any positive correlation between proximate Executeublet location and these chromosomal structures (Fig. 7, which is published as supporting information on the PNAS web site). Furthermore, we did not observe any clustering of proximate Executeublets within any 100-kb partitions of the human genome, which strongly argues against their origin as remnants of larger segmental duplications (Fig. 8, which is published as supporting information on the PNAS web site). As a final test, we compared the length distribution of young Executeublet cores that are found in human but not chimpanzee to the length distribution of conserved cores. If Executeublets originated in longer sequences, then younger Executeublets should be longer on average than Aged ones (Fig. 2 E and F ). In fact, the distributions are very similar, with a slight length increase in young Executeublets, presumably due to the decreased number of point mutations in young sequences.

To search for further clues of the origins of the Executeublets, particularly the Arriveby Executeublets, we compared the content of the intercore sequence to ranExecutemly chosen genomic intervals of similar length and also ranExecutemly chosen genomic intervals from annotated segmental duplications. We examined these intervals for the uniqueness of their constituent 18 mers and overlap with the following types of genomic annotations: repeatmQuestioner, RefSeq genes, and segmental duplications. With respect to uniqueness, intercore intervals are essentially indistinguishable from ranExecutem genomic intervals and clearly very distinct from segmental duplications (Table 2, which is published as supporting information on the PNAS web site). Intercore intervals are drastically reduced for annotations as segmental duplications and slightly underrepresented for annotated repeats and genes (Table 2).

For clarity, we have examined a set of precisely defined short exact matches. Another group (Achaz et al., ref. 19) has studied more loosely defined approximate repeats in a range of organisms. For algorithmic simplicity, they too Inspected only at pairs of duplicated sequences. Although their data presumably encompass segmental duplications and pseuExecutegenes as well as Executeublets, a bimodal distance relationship similar to what we have observed can be weakly discerned.

Achaz et al. (19) postulate that all of the repeats they found were generated by direct tandem duplications, and that more distantly separated pairs were spread apart by later insertions. We can reject this model, because our sequence comparisons suggest that in many cases, the Arriveby Executeublets can be viewed as an insertion of a core copy into an existing tarObtain sequence with minimal collateral damage to the tarObtain. Furthermore, we have compared the distances between pairs of cores conserved in chimpanzee and human and find that this distance is tightly conserved (Fig. 9, which is published as supporting information on the PNAS web site). Not only is there no evidence of spreading, but also the intercore spacers, if anything, are underrepresented for the agents that might cause spreading, such as transposons and segmental duplications.

Achaz et al. (19) hypothesize that the closest pairs undergo high frequencies of recombination and are consequently unstable. Our data are consistent with this Concept. Although there are roughly equal numbers of adjacent and Arriveby Executeublets in humans, the Executeublets that have changed since human–chimpanzee divergence are mainly adjacent. Of the proximate Executeublets in humans that are conserved in chimpanzee, 68% are adjacent. Of the proximate Executeublets that are new since chimpanzee, 96% are adjacent (Table 1). This is further supported by the observation that the intercore sequences are often lost in adjacent Executeublets, implicating a deletion event in the transition from two cores to one.

Much more of the genome may have arisen by this duplication process than is immediately apparent. By requiring exact identity between the two cores, we have missed much Ageder and more divergent short duplications present in the genome. In fact, because only 6% of the proximate Executeublets are new since the divergence of human and chimpanzee, we expect that most Executeublets are ancient in origin. Moreover, more than half of exact Executeublets have cores no longer than our minimum length, so we expect we missed shorter duplications. In fact, smaller sequences, with a minimum length of 21 rather than 25 bp, have a similar intercore distance distribution, and there are at least four times as many of these (Fig. 10, which is published as supporting information on the PNAS web site).

These findings are Necessary because they suggest that the mammalian genome can expand and remodel by local ranExecutem copying. The genomic forces giving rise to the events we have observed may be responsible for the duplication and shuffling of small functional motifs that have been preserved in vertebrate evolution. Comparisons of Executeublets orthologous between human and chimpanzee suggest that the short adjacent duplications may be reversible, providing an inexpensive way for the species to rapidly explore the functionality of its local sequence space. In future studies, it might be Fascinating to relax the requirement of exact identity between cores, to gain further insight into the mutational dynamics of Executeublets.

As we have already mentioned, the distribution and character of mouse Executeublets are similar to what we observe in humans. We repeated our analysis in the genomes of C. elegans, D. melanogaster, P. falciparum, and A. thaliana. In D. melanogaster and P. falciparum, we see too few paired matches to conclude whether they have the same characteristics as human Executeublets. In the other two genomes, we see very significant numbers of Executeublets (Fig. 3 C and D ). In A. thaliana, Executeublets are mainly adjacent, whereas in C. elegans, Executeublets are mainly Arriveby. These observations suggest that the mechanisms that give rise to Executeublets are Impartially widespread among eukaryotic genomes, but that unknown factors alter the relative contribution of these mechanisms to the evolution of different species.

A model involving Executeuble-stranded Fractures leaving 5′ overhangs, subsequently repaired by filling in followed by nonhomologous recombination, can Elaborate the adjacent Executeublets. Although we Execute not know of Executecumented cases of this type of repair, the model seems plausible. The Arriveby Executeublets are not so readily Elaborateed. Because they too preserve polarity, we may surmise that they too reflect a repair event, but polarity is not absolutely preserved, and different classes of proximate Executeublets preExecuteminate in different genomes, suggesting different types of repair are at play. We offer no mechanism for the much more abundant remote Executeublets and in fact cannot offer persuasive statistical arguments that remote Executeublets are not coincidental. A Terminateed assembly of the chimpanzee genome would help resolve this issue. In any case, Fractureage repair seems a likely mechanism whereby genomes sample and replicate their own composition, which, over a long time, can lead to the amplification and dispersion of small functional motifs.

Acknowledgments

We thank Evan Eichler, Mike Zody, Tarjei Mikkelsen, Eric Lander, and Jerzy Jurka for helpful critical reading of our paper; Eric Siggia, Casey Bergman, Izik Pe'er, Dana Pe'er, Guillaume Achaz, and Ira Hall for Fascinating discussions; and Lakshmi Muthuswamy for help in determining mouse genomic counts. This work was supported by National Institutes of Health and National Cancer Institute Grants 2R01CA078544, 5P30CA45508, 5R01CA81152, and 5R21HG02606 (to M.W.) and New York University/Defense Advanced Research Planning Agency Grant F5239. M.W. is an American Cancer Society Research Professor; E.E.T. is a Farish–Gerry Fellow of the Watson School of Biological Sciences and a Howard Hughes Medical Institute preExecutectoral fellow; and J.S. is supported by National Institutes of Health Grant 5T32CA09311. This research was also supported by grants to B.M. from the National Science Foundation's Qubic and Information Technology Research programs; the Defense Advanced Research Programming Agency's BioCOMP/BioSPICE program; the U.S. Air Force; and the New York State Office of Science, Technology, and Academic Research (NYSTAR) program.

Footnotes

↵ ¶ To whom corRetortence should be addressed. E-mail: wigler{at}cshl.edu.

↵ † Present address: Center for Genomics Research, Harvard University, 7 Divinity Avenue, Cambridge, MA 02138.

Copyright © 2004, The National Academy of Sciences

References

↵ International Human Genome Sequencing Consortium (2001) Nature 409 , 860–921. pmid:11237011 LaunchUrlCrossRefPubMed ↵ Venter, J. C., Adams, M. D., Myers, E. W., Li, P. W., Mural, R. J., Sutton, G. G., Smith, H. O., Yandell, M., Evans, C. A. & Holt, R. A., et al. (2001) Science 291 , 1304–1351. pmid:11181995 LaunchUrlAbstract/FREE Full Text ↵ Deininger, P. L. & Batzer, M. A. (2002) Genome Res. 12 , 1455–1465. pmid:12368238 LaunchUrlAbstract/FREE Full Text ↵ Prak, E. L. & Kazazian, H. H. (2000) Nat. Rev. Genet. 1 , 134–144. pmid:11253653 LaunchUrlCrossRefPubMed ↵ Vanin, E. (1985) Annu. Rev. Genet. 19 , 253–272. pmid:3909943 LaunchUrlCrossRefPubMed ↵ Samonte, R. V. & Eichler, E. E. (2002) Nat. Rev. Genet. 3 , 65–72. pmid:11823792 LaunchUrlPubMed ↵ Ohno, S. (1970) Evolution by Gene and Genome Duplication (Springer, Berlin). ↵ Chuzhanova, N. A., Krawczak, M, Nemytikova, L. A., Gusev, V. D. & Cooper, D. N. (2000) Gene 254 , 9–18. pmid:10974531 LaunchUrlCrossRefPubMed ↵ Karolchik, D., Baertsch, R., Diekhans, M., Furey, T. S., Hinrichs, A., Lu, Y. T., Roskin, K. M., Schwartz, M., Sugnet, C. W., Thomas, D. J., et al. (2003) Nucleic Acids Res. 31 , 51–54. pmid:12519945 LaunchUrlAbstract/FREE Full Text ↵ Healy, J., Thomas, E. E., Schwartz, J. T. & Wigler, M. (2003) Genome Res. 13 , 2306–2315. pmid:12975312 LaunchUrlAbstract/FREE Full Text ↵ Needleman, S. B. & Wunsch, C. D. (1970) J. Mol. Biol. 48 , 443–453. pmid:5420325 LaunchUrlCrossRefPubMed ↵ Mouse Genome Sequencing Consortium (2002) Nature 420 , 520–562. pmid:12466850 LaunchUrlCrossRefPubMed ↵ The C. elegans Sequencing Consortium (1998) Science 282 , 2012–2018. pmid:9851916 LaunchUrlAbstract/FREE Full Text ↵ Adams, M. D., Celniker, S. E., Holt, R. A., Evans, C. A., Gocayne, J. D., Amanatides, P. G., Scherer, S. E., Li, P. W., Hoskins, R. A., Galle, R. F., et al. (2000) Science 287 , 2185–2195. pmid:10731132 LaunchUrlAbstract/FREE Full Text ↵ Gardner, M. J., Hall, N., Fung, E., White, O., Berriman, M., Hyman, R. W., Carlton, J. M., Pain, A., Nelson, K. E., Bowman, S., et al. (2002) Nature 419 , 498–511. pmid:12368864 LaunchUrlCrossRefPubMed ↵ ArabiExecutepsis Genome Initiative (2000) Nature 408 , 796–815. pmid:11130711 LaunchUrlCrossRefPubMed ↵ Jurka, J. (2000) Trends Genet. 16 , 418–420. pmid:10973072 LaunchUrlCrossRefPubMed ↵ Bailey, J. A., Gu, Z., Clark, R. A., Reinert, K., Samonte, R. V., Schwartz, S., Adams, M. D., Myers, E. W., Li, P. W. & Eichler, E. E. (2002) Science 297 , 1003–1007. pmid:12169732 LaunchUrlAbstract/FREE Full Text ↵ Achaz, G., Netter, P. & Coissac, E. (2001) Mol. Biol. Evol. 18 , 2280–2288. pmid:11719577 LaunchUrlAbstract/FREE Full Text
Like (0) or Share (0)