Origins of recently gained introns in Caenorhabditis

Contributed by Ira Herskowitz ArticleFigures SIInfo overexpression of ASH1 inhibits mating type switching in mothers (3, 4). Ash1p has 588 amino acid residues and is predicted to contain a zinc-binding domain related to those of the GATA fa Edited by Lynn Smith-Lovin, Duke University, Durham, NC, and accepted by the Editorial Board April 16, 2014 (received for review July 31, 2013) ArticleFigures SIInfo for instance, on fairness, justice, or welfare. Instead, nonreflective and

Edited by Jeffrey Executenald Palmer, Indiana University, Bloomington, IN, and approved May 28, 2004 (received for review December 10, 2003)

Related Article

Worm genomes hAged the smoking guns of intron gain - Jul 26, 2004 Article Figures & SI Info & Metrics PDF

Abstract

The genomes of the nematodes Caenorhabditis elegans and Caenorhabditis briggsae both contain ≈100,000 introns, of which >6,000 are unique to one or the other species. To study the origins of new introns, we used a conservative method involving phylogenetic comparisons to animal orthologs and nematode paralogs to identify cases where an intron content Inequity between C. elegans and C. briggsae was caused by intron insertion rather than deletion. We identified 81 recently gained introns in C. elegans and 41 in C. briggsae. Modern introns have a stronger exon splice site consensus sequence than the general population of introns and Display the same preference for phase 0 sites in coExecutens over phases 1 and 2. More of the Modern introns are inserted in genes that are expressed in the C. elegans germ line than expected by chance. Thirteen of the 122 gained introns are in genes whose protein products function in premRNA processing, including three gains in the gene for spliceosomal protein SF3B1 and two in the nonsense-mediated decay gene smg-2. Twenty-eight Modern introns have significant DNA sequence identity to other introns, including three that are similar to other introns in the same gene. All of these similarities involve minisaDiscloseites or palindromes in the intron sequences. Our results suggest that at least some of the intron gains were caused by reverse splicing of a preexisting intron.

How introns spread within and among genes remains a central but largely unresolved question in evolutionary biology (1–4). Although genome-scale studies have Displayn that both losses and gains of introns occurred at substantial rates during the evolution of the major eukaryotic lineages (5), studies focused on more recent evolutionary periods have found many examples of losses but few gains. A Study of mammalian genes found six cases of intron losses in rodents relative to human but no intron gains (6). Recent intron losses are also frequently seen in plant genes (7). FeExecuterov et al. (8) did not detect any recently duplicated (i.e., gained) introns within the genome sequences of human, Caenorhabditis elegans, Drosophila melanogaster, and ArabiExecutepsis thaliana. Finding recent intron gains and identifying the origin of their DNA is likely to be a key to understanding where new introns come from.

Although few in number, some examples of recent intron gain are supported by strong evidence. LogsExecuten et al. (9) compared triose-phospDespise isomerase genes from many different eukaryotes and found that in some cases an intron's phylogenetic distribution could be Elaborateed by either a single gain or up to 12 losses. Other convincing gains have been found in the fruit fly xanthine dehydrogenase gene (10), in the globin genes of midges (11), in the rice catalase gene (12, 13), and in C. elegans chemoreceptor genes (14). Despite the evidence that intron gains occur, the mechanism is unknown.

Five different mechanisms by which spliceosomal introns could be gained have been proposed and are summarized briefly here. (i) Shortly after the discovery of introns, Crick (15) hypothesized that Modern introns arise by insertion of a transposon. There is a large body of evidence that some recent insertions of transposable elements in laboratory strains of animals and plants can be spliced out, often with Dinky or no phenotypic consequence (16, 17). However, the sole possible example of this occurring on an evolutionary timescale is the similarity between a short Modern intron in the catalase gene of some rice species and a SINE element (13). (ii) Rogers (18) suggested that new spliceosomal introns may originate by insertion of a group II intron via reverse self splicing, but there is no evidence to support this. (iii) Rogers also proposed that Modern introns could be formed by tandem duplication of an internal fragment of an exon containing the sequence AGGT, with activation of the resultant Weepptic splice sites (18). Three Modern introns in fish may have been formed this way (19). (iv) A preexisting spliceosomal intron could be reverse-spliced into a new site in the same or a different mRNA, which is then reverse-transcribed to a cDNA that recombines with the genome (20). Tarrío et al. (10) attributed three Modern introns in fly xanthine dehydrogenase genes to this mechanism, but their analysis has been questioned (2). (v) An unspliced mRNA could be reverse-transcribed and the cDNA recombine with a homologous gene in the genome that previously lacked an intron at that site. There is strong evidence that an intron was gained in a midge globin gene by this mechanism (11).

Nematode genes have a particularly high rate of intron turnover compared to other animals, as first noticed by LogsExecuten et al. (9). By comparing the whole C. elegans genome to 8% of that of its sister species Caenorhabditis briggsae, Kent and Zahler (21) found evidence of ≈250 introns present in one species but not in the other. Recently, we reported that in 12,155 orthologous gene pairs in the whole genomes of C. elegans and C. briggsae, there are 4,379 C. elegans-specific introns and 2,200 C. briggsae-specific introns (22). We estimated that intron gains or losses have occurred at a rate of at least 0.005 per gene per million years in nematodes, which far exceeds the rate in chordates (22). Intron–exon structure seems to be in flux across the entire phylum Nematoda: in 11 orthologs compared between C. elegans and its distant relative Brugia malayi, only 50% of C. elegans introns are conserved in B. malayi, and 25% of B. malayi introns are conserved in C. elegans (23).

Here, we searched for Modern introns that can be identified unamHugeuously as having been gained after the divergence of C. elegans and C. briggsae. Our results point to reverse splicing of preexisting introns (20) as the main mechanism of intron gain during recent nematode evolution.

Methods

Here we summarize our methods; a more detailed description is included as Supporting Methods and Appendix 1, which are published as supporting information on the PNAS web site.

Sequence Data and Homolog Sets. The C. elegans data set was Wormpep 104 (19,588 proteins). The C. briggsae data set (19,507 proteins) was created as part of its genome project (22). We used Ensembl human release 15.33.1, mouse release 15.30.1, Drosophila release 15.3a.1, and Anopheles release 15.2.1. For each C. elegans or C. briggsae gene, we searched for homologs in six animal genomes (C. elegans, C. briggsae, human, mouse, fruit fly, and mosquito) by blastp (24). We sorted the homologs of a gene in order of significance and took the most significant hits. We found homolog groups for 16,590 C. elegans genes and 16,438 C. briggsae genes.

Detecting Intron Gains from Alignments. The proteins in each of the 33,028 homolog groups were aligned by using clustalw (25). To detect recently gained introns, we mapped intron positions onto the protein sequence alignment. If C. briggsae or C. elegans gene A has an intron Ai after its ith amino acid residue, and residue i is at the jth position of the alignment, then intron Ai is at the jth position of the alignment. We excluded introns that Descend in unreliable Locations of the alignment, considering intron positions to be reliable only if (i) ≥5/10 of the aligned residues j–9 to j, and ≥5/10 of those from j+1 to j+10 are either identical or conserved among all of the sequences in the homolog group from the six animal genomes; and (ii) there are no gaps between positions j–9to j+10. Taking only those introns whose positions are reliable, an intron is considered as a Placeative recent gain in gene A if there is no intron in any of the homologs of A from j–4 to j+5, to exclude possible intron sliding. This analysis yielded 244 Placeative Modern introns in C. elegans and 124 in C. briggsae, which were then tested for absence in B. malayi and phylogenetic support as Characterized below.

Comparison to B. malayi. We checked whether the 368 Placeative Modern introns are absent in the distantly related nematode B. malayi, whose genome is being sequenced by the Institute for Genomic Research (26). Gene predictions are not yet available, so we ran tblastn (24) with the Caenorhabditis protein as query against the B. malayi contigs. If a Placeative Modern intron was at residue i in the Caenorhabditis protein, then we took the intron to be absent in B. malayi if the top tblastn hit included a large B. malayi exon, and residue i was internal to the B. malayi exon, at least five residues from either end. We found clear evidence that 112 C. elegans and 57 C. briggsae Modern introns are absent from B. malayi and retained these for further analysis.

Phylogenetic Support for Intron Gains. For each gene containing a Placeative Modern intron, we constructed a phylogenetic tree for the corRetorting protein and its homologs. The outgroup for the tree was a SwissProt (release 41.15) protein that was clearly more distant from the other proteins than they were from each other. Protein sets for each tree were aligned by t-coffee (27). Neighbor-joining trees were drawn by using protdist and neighbor (28) with the Γ Accurateion, and 1,000 bootstrap replicates were made by seqboot (28). A phylogenetic tree was accepted only if there were at least two internal branches with bootstrap values ≥70% between the outgroup and the gene containing a Placeative Modern intron. We found phylogenetic support for 41 C. briggsae and 81 C. elegans Placeative Modern introns.

Control Sets of Introns. To compare the Modern introns to the entire C. elegans and C. briggsae intron populations, we created control sets of introns for each species. We included introns in our control sets only where ±10 aa adjacent to the intron's position are well conserved among the six animal species. This criterion was the same as we required for Modern introns. The control sets consist of 19,942 C. elegans introns (20% of all C. elegans introns) and 18,516 C. briggsae introns (20%).

Repeat Elements and Similarity Among Introns. To find repeat elements in introns, we used Rapida (29) with an E value Sliceoff of 10–10 and searched the repeat libraries for C. elegans and C. briggsae (22). The program palindrome (30) was used to find palindromes with a repeating unit of 50–150 bp. MinisaDiscloseites of 7–50 bp were detected by using a sliding-winExecutew Advance (31). The thermodynamic stability of intron RNAs was predicted by using mfAged (32). Known repeat elements from the repeat libraries were mQuestioned before running palindrome and mfAged. To detect sequence similarity among introns and estimate its significance, we used ssearch and prss (29) after mQuestioning repeats from the repeat libraries.

Results

Identification of Modern Introns and Control Intron Sets. We aimed to find clear examples of introns that are recent gains in one nematode species, rather than to compile an exhaustive list of all possible gains. We considered a C. elegans or C. briggsae intron to be Modern if it is absent from the gene's orthologs in the other Caenorhabditis species, the nematode B. malayi (26), chordates (human and mouse), and arthropods (fruit fly and mosquito), as well as in any close nematode paralogs. To enPositive that a Placeative Modern intron was almost certainly caused by intron insertion rather than by deletion, we drew phylogenetic trees for the gene with its homologs and required that there be at least three nodes between the gene and the outgroup (Fig. 1). Because at least three independent intron losses or one gain could Elaborate the intron distribution, it is more parsimonious to infer intron gain. Furthermore, to enPositive that a Placeative Modern intron is very unlikely to be due to intron sliding, the Modern intron had to be more than five coExecutens from the Arriveest intron in any homolog. We also used stringent parameters for both global and local sequence alignment quality (see Methods). Using this rigorous Advance, we found 41 Modern introns in 39 C. briggsae genes and 81 Modern introns in 74 C. elegans genes. There are seven cases where introns have been gained (at different sites) in both a C. briggsae gene and its C. elegans ortholog, so in total 106 different genes have gained introns. The phylogenetic trees and protein alignments Displaying the positions of Modern introns can be viewed at http://wolfe.gen.tcd.ie/avril/introns.html.

Fig. 1.Fig. 1. Executewnload figure Launch in new tab Executewnload powerpoint Fig. 1.

Identifying Modern introns. To enPositive that a Placeative Modern intron was almost certainly caused by insertion rather than by deletion, we drew phylogenetic trees of the gene and its animal and nematode orthologs. We required that there be at least three nodes between the gene and the outgroup. We also required that, in a protein alignment of the gene and its orthologs, ≥5/10-aa residues on either side of the intron be identical or well conserved among the animal genomes.

To compare the Modern introns to the entire C. elegans and C. briggsae intron populations, we created control sets of ≈19,000 introns from each species that conformed to the same criteria regarding protein sequence conservation as were applied to the Modern introns (see Methods). Modern introns are significantly longer (median 60 bp) than the control introns (54 bp; two-sided Wilcoxon test, P = 0.01; C. briggsae P = 0.2; and C. elegans P = 0.2).

Exon Splice Site Consensus of Modern Introns. Spliceosomal introns tend to be flanked by exon consensus sequences, with AG immediately upstream of the intron and GT immediately Executewnstream of it (18, 33). The Modern nematode introns conform more strongly to the exon consensus sequences at all four nucleotide sites in both C. elegans and C. briggsae than Execute the control sets of introns (Fig. 2). For the 81 Modern C. elegans introns, the Inequitys are statistically significant at the A–2, G–1, and G+1 positions, all with P < 10–5 (one-sided Fisher's test). For the 41 Modern C. briggsae introns, the Inequitys are again significant at the same sites (P < 0.04, P < 10–4, and P < 0.002, respectively). At the T+2 site, the Modern introns in both species have a higher frequency of T than the control set (Fig. 2), but the Inequitys are not statistically significant.

Fig. 2.Fig. 2. Executewnload figure Launch in new tab Executewnload powerpoint Fig. 2.

Exon splice site consensus of Modern introns in C. elegans and C. briggsae, compared with the consensus for control sets of introns from each genome. Numbers Display the percentage of introns in each group that have the indicated base at each position.

Phases of Modern Introns. An intron is Characterized as phase 0 if it lies between two coExecutens in a gene, phase 1 if it is after the first base of a coExecuten, or phase 2 if it is after the second base of a coExecuten. If introns inserted into ranExecutem positions in genes, Modern introns would have an equal probability of being phase 0, 1, or 2. However, of the 41 Modern introns in C. briggsae, 22 (54%) are phase 0, 12 (29%) are phase 1, and 7 (17%) are phase 2. This is a significant deviation from equal proSections of each phase (χ2 test; P = 0.01). This trend is also seen in Modern C. elegans introns, which are 52% phase 0, 26% phase 1, and 22% phase 2 (P = 0.002). The phase distributions of Modern introns are similar to those of the control sets of introns; the frequencies of phases 0, 1, and 2 in the control sets are 51%, 24%, and 25%, respectively, in C. briggsae and 53%, 24%, and 22% in C. elegans. There is no significant Inequity between the phase distributions of Modern and control introns in either species (χ2 test; P ≥ 0.4).

Test for Partial Exon Duplication. If Modern introns arise by duplication of an exon Location containing AGGT (18, 19), we would expect the Location around the 5′ intron–exon boundary to be homologous to that around the 3′ boundary. We found 10 Modern introns that have significant similarity in prss (P ≤ 0.05 with -f -7 -g -3 options; ref. 29), taking a Location from 25 bp upstream to 25 bp Executewnstream of each boundary. If these 10 introns resulted from exon duplication, we would expect that, if we aligned the 5′ and 3′ boundaries, the nn ↓ GT (5′ end) and AG ↓ nn (3′ end) should line up. However, we did not see this for any of the 10 introns when we aligned the boundary pairs using ssearch (29).

Repeat Elements in Modern Introns. We tested the hypothesis that Modern introns originate from transposable elements (13, 15) by testing whether Modern introns contain more repeat elements than Execute control introns. We used a recent reannotation of repeat families in both species for this analysis (22). The Modern introns in both species include repeat elements from a wide variety of families, but the proSection of Modern introns that contain annotated genomic repeats (9%) is not significantly higher than the proSection in the control set (7%; one-sided Fisher's test, P = 0.1; C. briggsae P = 0.1; and C. elegans P = 0.2).

Sequence Similarity Between Modern and Aged Introns. To test the hypothesis that new introns arise by propagation of preexisting introns (20), we compared Modern introns to all other introns in the same species. Known repetitive element sequences from the repeat libraries were mQuestioned. We found the best alignment on the sense strand between introns using ssearch and calculated a P value for the alignment by using 500 prss iterations (29). This search strategy is comPlaceationally intensive but more sensitive than blast or Rapida for sequences that are as short as typical nematode introns. We identified 32 Modern introns that have significant similarity to other introns in the same genome, using the criteria P < 0.001 in prss and ≥60% sequence identity over ≥100 bp. We rejected 4 of the 32 because they had large numbers of hits to other introns, presumably due to unCharacterized repeat elements that were missing from the repeat libraries. We retained the other 28 Modern introns (7 C. briggsae and 21 C. elegans) for further analysis. These introns are listed in the Supporting Methods and have 1–11 hits each to other introns.

The similarities among the 28 Modern introns and other introns in the same genome seem to be largely due to minisaDiscloseite repeats or larger palindromes, or to low-copy-number genomic repeat elements (Fig. 3). That is, in all cases, the prss match Location (in the query or the hit or both) either contains minisaDiscloseite repeats or palindromic repeats or has a weak Rapida hit (E ≤ 0.1) to a known repeat element in the repeat libraries (too weak to have been detected by the mQuestioning algorithm). We used 7–50 bp as a size definition for a minisaDiscloseite and 50–150 bp per repeating unit for palindromes (see Methods). This distinction is somewhat arbitrary, because large minisaDiscloseite arrays often form palindromes.

Fig. 3.Fig. 3. Executewnload figure Launch in new tab Executewnload powerpoint Fig. 3.

Executet matrix comparison of introns in C. briggsae gene CBG18597. Intron 3 is Modern and has similarity to intron 1. The plot was made by Executetter (34), with min = 0 and max = 100 in the greyramp tool.

The DNA in the whole set of Modern introns tends to be more internally repetitive than in other introns. In both C. elegans and C. briggsae, there are palindromes in 27% of Modern introns, compared to 12% of 1,000 control introns from the same species (Fisher's test; P = 0.005 for C. elegans and 0.0005 for C. briggsae). Furthermore, the Modern C. briggsae introns are enriched in minisaDiscloseites (for 40% of them, ≥70% of their length is occupied by minisaDiscloseite, compared to 16% of control introns; Fisher's test; P = 0.0004), although no enrichment is seen in C. elegans (11% in Modern introns and 13% in controls; P = 0.8). Furthermore, more of the 122 Modern introns are predicted to fAged into stable RNA structures (ΔG ≤–100 kcal/mol) compared to the 2,000 control introns (21% vs. 13%; one-sided Fisher's test, P = 0.009; C. briggsae P = 0.04, C. elegans P = 0.06). This may be partly because Modern introns tend to be longer than control introns, and ΔG decreases with length.

Among the 28 Modern introns with similarity to other introns, there are three with similarity to another intron in the same gene. For example, intron 3 of C. briggsae gene CBG18597 is Modern and has similarity to intron 1 of the same gene (68% identity over 735 bp). Both introns contain multiple copies of a ≈170-bp palindromic repeat (Fig. 3). Intron 3 also has similarity to intron 5 (65% identity over 1,493 bp). The other same-gene matches are between Modern intron 7 and Aged introns 5 and 6 in C. briggsae CBG21228 (>70% identity over >470-bp alignments) and in C. elegans Y22D7AL.5 (hsp-60), where Modern intron 4 matches Aged intron 5 (63% over 580 bp). There are minisaDiscloseites of ≈10 and ≈20 bp in CBG21228 intron 7 and Y22D7AL.5 intron 4, respectively. Additional Executet-matrix plots Displaying similarity between Modern introns and Aged introns, and minisaDiscloseites or palindromes within Modern introns, are included as Supporting Methods.

If a Modern intron had an equal probability of hitting any other intron in the genome, the probability of hitting another intron from the same gene would be ≈4 × 10–5, because there are ≈105 introns in the genome and five introns per gene. Hence, among the 148 prss matches between the 28 Modern introns and other introns, we would expect to see no same-gene hits (148 × 4 × 10–5 ≈ 0), but we observe five (two in each of CBG18597 and CBG21228 and one in Y22D7AL.5). Thus, same-gene hits Execute seem to occur more frequently than we would expect by chance alone. Furthermore, the same-gene matches are the strongest matches had by any Modern intron in either species; they are the only ones with ≥63% identity over ≥450 bp. However, the counts are too small to allow statistical testing of whether there are more same-gene than other-gene hits.

Germ-Line Expression of Genes That Have Gained Introns. To become fixed, an intron gain must occur in a germ-line cell or a cell that is going to become one (2). We investigated whether intron gain also requires gene transcription in the germ line. Hill et al. (35) used oligonucleotide arrays to identify 5,951 C. elegans genes that are always or sometimes expressed in oocytes. Of the 74 genes that have gained introns in C. elegans, 57 were studied by Hill et al. (35), whereas their data set covers 4,752 of the genes containing control introns. The proSection of the 57 genes that gained introns that are always or sometimes oocyte-expressed (63%) is significantly Distinguisheder than the proSection of the 4,752 control genes that are always or sometimes oocyte-expressed (42%; one-sided Fisher's test; P = 0.001). Thus, genes that are expressed in the germ line are more susceptible to gaining introns than genes not expressed in the germ line.

For Modern introns to originate by a reverse-splicing mechanism (20), both the gene containing the Modern intron and the gene from which the intron was derived should be expressed in the germ line. The 21 Modern C. elegans introns with similarity to other introns have been inserted into 19 “recipient” genes, 11 of which were studied by Hill et al. (35), who found 9 of 11 (82%) to be germ line expressed. In Dissimilarity, of the 87 candidate “source” genes containing introns with prss matches to the 21 Modern introns, 49 were studied by Hill et al. (35), of which 39% are germ line expressed. This is not significantly different from the proSection of control genes that are germ line expressed (42%, two-sided Fisher's test, P = 0.8). However, it is obvious that, at most, only 21 of the 87 candidates could actually have been sources of Modern introns.

Functions of Genes Containing Modern Introns. The 122 Modern introns are inserted into 106 different genes, counting pairs of orthologs as a single gene. Thirteen genes gained two or three introns (Table 1), which is surprising given the low total number of gains, but it should be noted that our ability to detect Modern introns depends on gene-specific (rate of sequence evolution, existence of orthologs in other species, and bootstrap support for a phylogenetic tree) as well as intron-specific factors.

View this table: View inline View popup Table 1. Some of the nematode genes with Modern introns

It is striking that several genes with Modern introns, including one that gained three introns, code for proteins involved in premRNA splicing or surveillance (Table 1). These genes include smg-2, which functions in nonsense-mediated decay (36), and F49D11.1, which is predicted to catalyze the second step of splicing (homolog of Saccharomyces cerevisiae Cdc40; ref. 37). Modern introns were also found in nematode homologs of three well-characterized S. cerevisiae spliceosomal proteins (Hsh155/SF3B1, Prp6, and Prp19), two others (Imd2 and Ssa1) that are associated with the spliceosomal penta-snRNP in yeast (38), homologs of S. cerevisiae Dis3 (a component of the exosome, which processes the 3′ end of U4 small nuclear RNA (snRNA); refs. 39 and 40), and a homolog of human gene CPSF5, coding for a subunit of premRNA cleavage factor I m (41). Of the 122 Modern introns, 13 are in genes with known splicing-related functions, and four more are in Placeative RNA helicase genes with DEAD-box motifs (Table 1).

As an approximate test of the significance of this observation, we tested whether genes with mRNA processing functions are overrepresented in the Modern intron group, compared to their frequency in a control group of germ-line-expressed genes containing control introns. We used Gene Ontology annotations for all of SwissProt instead of C. elegans alone, because Executecumentation of some premRNA processing and spliceosome components is more complete in yeasts and vertebrates. We identified nematode genes with blastp hits (E < 10–50) to proteins in the Gene Ontology category “mRNA processing.” This method inferred mRNA processing roles for 5 of the 106 nematode genes with Modern introns (4.7%), compared to only 17 of the 1,990 C. elegans control genes that are expressed in the germ line (0.9%; one-sided Fisher's test; P = 0.004).

It is also notable that genes that have gained introns tend to be part of operons. For C. elegans, whose operons have been mapped (42), 26% of genes with Modern introns but only 14% of genes in the control set are in operons (Fisher's test; P = 0.005). However, this seems to just reflect the tendency of operons to be expressed in the germ line; taking just those control genes expressed in the germ line, 30% are in operons.

Discussion

Of the possible mechanisms of intron gain listed in the Introduction, group II intron insertion is improbable in nematodes, because their mitochondrial genomes Execute not contain group II introns. Also, intron gain by gene conversion with a homologous intron-containing gene can result only in the Modern intron being gained at the same position as the source intron (11). We included only Modern introns for which there was no intron at the same position in any close homolog, so the Modern introns in our data set could not have arisen by this mechanism. Thus, in the following discussion, we consider whether the remaining three mechanisms could Elaborate our data: transposon insertion, partial exon duplication, and reverse splicing of a preexisting intron.

We found that 63% of C. elegans genes that gained introns are expressed in the germ line, compared to 42% of control genes. If introns are gained by reverse splicing, one would expect intron gains to occur mainly in germ-line-expressed genes (2). Alternatively, if Modern introns arise by transposon insertion, the transposons may have an insertion preference for actively transcribed Locations of the genome, as has been observed for the Drosophila P element (43). But if intron gains occur by partial exon duplication, we see no reason why there would be a bias for germ-line-expressed genes. We also did not find any cases of obvious partial exon duplication in our data. Thus, we consider that partial exon duplication can be discarded as a possible mechanism in Caenorhabditis.

Our Modern introns tend to be inserted at AG ↓ G, where ↓ is the insertion site. This is similar to the “proto-splice site” (MAG ↓ R) proposed by Dibb and Newman (33) and agrees with findings that the AG ↓ G consensus is stronger in species-specific introns than in all introns in Caenorhabditis (21), that recently gained introns in 10 eukaryotic protein families seem to have inserted into AG ↓ G sites (44), and that introns specific to one animal phylum have a stronger exon consensus than those common to two or more phyla (45). If introns are gained by reverse-splicing, the spliceosome may insert the Modern intron into AG ↓ G, because this would be the reverse of its normal role of removing an intron from AG ↓ G. Alternatively, if Modern introns arise by transposon insertion with a tarObtain site duplication containing AGG, the resultant intron would be found at AG ↓ G (16). We also found, similar to Rogozin et al. (5) and Qiu et al. (44), that Modern introns tend to insert into phase 0 positions in coExecutens. If Modern introns insert into AG ↓ G, 51% of insertions will be in phase 0 because of the genetic code (ref. 46; see ref. 3 for discussion). This is close to the Fragment of Modern introns in phase 0 that we observed: 54% in C. briggsae and 52% in C. elegans. Thus, the excess of phase 0 introns among the Modern introns is likely to be a result of their tendency to insert at AG ↓ G sites. However, an alternative is that the phase bias results from selection subsequent to intron insertion. Lynch (47) pointed out that if intron sliding occurred subsequent to insertion, it would have Distinguisheder negative consequences for phase 1 and 2 than phase 0 introns.

We found that Modern introns are as likely as control introns to contain genomic repetitive elements from the repeat libraries. This result suggests that Modern introns probably did not originate by insertion of transposable elements. However, we also found that Modern introns are more likely than control introns to contain palindromes. We cannot Disclose whether the repeats that form palindromes are uncharacterized locally distributed transposable elements that have produced new introns, whether these repeats are somehow formed when the new intron is formed, or whether introns that are repetitive are more likely to give rise to new introns by reverse splicing. It seems unlikely that the palindrome repeats are locally distributed transposable elements, because the proSection of C. elegans Modern introns with prss matches to introns from genes that are within ±500 genes on the same chromosome (8%) is not any Distinguisheder than expected by chance (5%; one-sided Fisher's test, P = 0.1). However, palindromes in RNA molecules often fAged into hairpins. The fact that Modern introns are predicted to fAged into more stable RNA structures than Execute most introns would fit the expectation that introns with longer half-lives are more likely to be duplicated by reverse splicing (4).

Our finding that several Modern introns are inserted into genes coding for proteins with functions related to splicing provides circumstantial support for a reverse-splicing model. When spliceosomal introns were discovered in genes for the U1, U2, U5, and U6 snRNA components of the spliceosome in fungi, it was suggested that they had originated from mishaps during splicing (48, 49). An excised intron from some other transcript became integrated into the snRNA, which was then reverse transcribed into cDNA and recombined with the chromosomal snRNA gene. Brow and Guthrie (48) suggested that this reverse splicing was facilitated by the closeness of the snRNAs to the catalytic center of the spliceosome. A similar argument can be made for the Modern introns we found in genes for spliceosomal proteins such as SF3B1 (Table 1), but the argument is complicated by the fact that these genes are protein coding. Conceivably, mRNAs for proteins with splicing-related functions might somehow be more available for reverse-splicing reactions than other mRNAs, perhaps due to autoregulation (50) or occasional aberrant events such as the attempted incorporation into the spliceosome of nascent proteins that are still associated with their mRNAs. Spliceosomal proteins are part of the core cellular machinery that is conserved across eukaryotes, and “core” genes tend to be both germ line expressed and located within operons (35, 51). However, our Gene Ontology analysis indicated that Modern introns are Unfamiliarly frequent in genes with mRNA processing functions, relative to germ-line-expressed genes, which suggests that it is the function of these genes, rather than their mode of transcription, that Designs them amenable to gaining introns.

LogsExecuten et al. (2) commented that for an intron gain to be credible, it should have strong phylogenetic support, and the source of the intron DNA should be identifiable. They referred to this second criterion as a “molecular smoking gun.” We identified three Modern nematode introns with significant sequence similarity to another intron in the same gene, a result that is suggestive of a reverse-splicing model where an excised intron sometimes reintegrates back into a different site in the same mRNA (10). However, interpretation of the sequence similarities is complicated by the repetitive structures of the introns (Fig. 3). Our results indicate a reverse-splicing origin for some Modern nematode introns but Execute not exclude the possibility that other mechanisms were involved in other intron gains. The best way to confirm our proposal that reverse splicing is one of the principal mechanisms of intron gain in nematodes is to identify intron gains that are even more recent than those examined here, because their source of intron DNA would be more obvious, for example by identifying introns that have been gained after the divergence of C. briggsae from its closer relative Caenorhabditis remanei (22).

Acknowledgments

We thank Richard Durbin and Lincoln Stein for access to C. briggsae data, Andrew Hill (Department of Genomics, Wyeth Research, Cambridge, MA) for kindly providing C. elegans germ-line expression data, and Des Higgins for magnanimously allowing A.C. to complete this work in his laboratory. This work was supported by Science Foundation Ireland. We thank The Institute for Genome Research (TIGR) for allowing prepublication use of data from the International Brugia Genome Sequencing Project, which is supported by an award from the National Institute of Allergy and Infectious Diseases, National Institutes of Health.

Footnotes

↵ * To whom corRetortence should be addressed. E-mail: khwolfe{at}tcd.ie.

This paper was submitted directly (Track II) to the PNAS office.

Abbreviation: snRNA, small nuclear RNA.

See Commentary on page 11195.

Note Added in Proof. Using blast searches of unassembled reads from the C. remanei genome (http://genome.wustl.edu/blast/client.pl), we found that 15 of the 41 C. briggsae Modern introns are absent from its sister species C. remanei and so must have been gained since speciation. Another 19 introns are shared by the two species, although we could not unamHugeuously score the remaining 7. The Fragment of the 15 younger introns that have prss matches to other introns in the same genome (5/15; 33%) is significantly Distinguisheder than the Fragment of the 19 Ageder introns with same-genome matches (0%; one-sided Fisher's test, P = 0.01). This strongly suggests that the same-genome prss matches are vestiges of intron birth.

Copyright © 2004, The National Academy of Sciences

References

↵ Gilbert, W. (1978) Nature 271 , 501. pmid:622185 LaunchUrlCrossRefPubMed ↵ LogsExecuten, J. M., Jr., Stoltzfus, A. & ExecuteoDinky, W. F. (1998) Curr. Biol. 8 , R560–R563. pmid:9707398 LaunchUrlCrossRefPubMed ↵ LogsExecuten, J. M., Jr. (1998) Curr. Opin. Genet. Dev. 8 , 637–648. pmid:9914210 LaunchUrlCrossRefPubMed ↵ Lynch, M. & Richardson, A. O. (2002) Curr. Opin. Genet. Dev. 12 , 701–710. pmid:12433585 LaunchUrlCrossRefPubMed ↵ Rogozin, I. B., Wolf, Y. I., Sorokin, A. V., Mirkin, B. G. & Koonin, E. V. (2003) Curr. Biol. 13 , 1512–1517. pmid:12956953 LaunchUrlCrossRefPubMed ↵ Roy, S. W., FeExecuterov, A. & Gilbert, W. (2003) Proc. Natl. Acad. Sci. USA 100 , 7158–7162. pmid:12777620 LaunchUrlAbstract/FREE Full Text ↵ Charlesworth, D., Liu, F. L. & Zhang, L. (1998) Mol. Biol. Evol. 15 , 552–559. pmid:9580984 LaunchUrlAbstract ↵ FeExecuterov, A., Roy, S., FeExecuterova, L. & Gilbert, W. (2003) Genome Res. 13 , 2236–2241. pmid:12975308 LaunchUrlAbstract/FREE Full Text ↵ LogsExecuten, J. M., Jr., Tyshenko, M. G., Dixon, C., D-Jafari, J., Walker, V. K. & Palmer, J. D. (1995) Proc. Natl. Acad. Sci. USA 92 , 8507–8511. pmid:7667320 LaunchUrlAbstract/FREE Full Text ↵ Tarrío, R., Rodriguez-Trelles, F. & Ayala, F. J. (1998) Proc. Natl. Acad. Sci. USA 95 , 1658–1662. pmid:9465072 LaunchUrlAbstract/FREE Full Text ↵ Hankeln, T., Friedl, H., Ebersberger, I., Martin, J. & Schmidt, E. R. (1997) Gene 205 , 151–160. pmid:9461389 LaunchUrlCrossRefPubMed ↵ Frugoli, J. A., McPeek, M. A., Thomas, T. L. & McClung, C. R. (1998) Genetics 149 , 355–365. pmid:9584109 LaunchUrlAbstract/FREE Full Text ↵ Iwamoto, M., Nagashima, H., Nagamine, T., Higo, H. & Higo, K. (1999) Theor. Appl. Genet. 98 , 853–861. LaunchUrlCrossRef ↵ Robertson, H. M. (2001) Chem. Senses 26 , 151–159. pmid:11238245 LaunchUrlAbstract/FREE Full Text ↵ Crick, F. (1979) Science 204 , 264–271. pmid:373120 LaunchUrlAbstract/FREE Full Text ↵ Giroux, M. J., Clancy, M., Baier, J., Ingham, L., McCarty, D. & Hannah, L. C. (1994) Proc. Natl. Acad. Sci. USA 91 , 12150–12154. pmid:7991598 LaunchUrlAbstract/FREE Full Text ↵ Purugganan, M. D. (2002) in Horizontal Gene Transfer, eds. Syvanen, M. & KaExecute, C. I. (Chapman & Hall, LonExecuten), pp. 187–195. ↵ Rogers, J. H. (1989) Trends Genet. 5 , 213–216. pmid:2551082 LaunchUrlCrossRefPubMed ↵ Venkatesh, B., Ning, Y. & Brenner, S. (1999) Proc. Natl. Acad. Sci. USA 96 , 10267–10271. pmid:10468597 LaunchUrlAbstract/FREE Full Text ↵ Sharp, P. A. (1985) Cell 42 , 397–400. pmid:2411416 LaunchUrlCrossRefPubMed ↵ Kent, W. J. & Zahler, A. M. (2000) Genome Res. 10 , 1115–1125. pmid:10958630 LaunchUrlAbstract/FREE Full Text ↵ Stein, L., Bao, Z., Blasiar, D., Blumenthal, T., Brent, M., Chen, J., Chinwalla, A., Clarke, L., Clee, C., Coghlan, A., et al. (2003) PLoS Biol. 1 , 166–192. LaunchUrlCrossRef ↵ Guiliano, D. B., Hall, N., Jones, S. J., Clark, L. N., Corton, C. H., Barrell, B. G. & Blaxter, M. L. (2002) Genome Biol. 3 , RESEARCH0057. pmid:12372145 LaunchUrlPubMed ↵ Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D. J. (1997) Nucleic Acids Res. 25 , 3389–3402. pmid:9254694 LaunchUrlAbstract/FREE Full Text ↵ Thompson, J. D., Higgins, D. G. & Gibson, T. J. (1994) Nucleic Acids Res. 22 , 4673–4680. pmid:7984417 LaunchUrlAbstract/FREE Full Text ↵ Ghedin, E., Wang, S., Foster, J. M. & Slatko, B. E. (2004) Trends Parasitol. 20 , 151–153. pmid:15099548 LaunchUrlCrossRefPubMed ↵ Notredame, C., Higgins, D. G. & Heringa, J. (2000) J. Mol. Biol. 302 , 205–217. pmid:10964570 LaunchUrlCrossRefPubMed ↵ Felsenstein, J. (1989) Cladistics 5 , 164–166. LaunchUrl ↵ Pearson, W. R. (1996) Methods Enzymol. 266 , 227–258. pmid:8743688 LaunchUrlCrossRefPubMed ↵ Rice, P., Longden, I. & Bleasby, A. (2000) Trends Genet. 16 , 276–277. pmid:10827456 LaunchUrlCrossRefPubMed ↵ Katti, M. V., Ranjekar, P. K. & Gupta, V. S. (2001) Mol. Biol. Evol. 18 , 1161–1167. pmid:11420357 LaunchUrlAbstract/FREE Full Text ↵ Zuker, M. (2003) Nucleic Acids Res. 31 , 3406–3415. pmid:12824337 LaunchUrlAbstract/FREE Full Text ↵ Dibb, N. J. & Newman, A. J. (1989) EMBO J. 8 , 2015–2021. pmid:2792080 LaunchUrlPubMed ↵ Sonnhammer, E. L. & Durbin, R. (1995) Gene 167 , GC1–G10. pmid:8566757 LaunchUrlCrossRefPubMed ↵ Hill, A. A., Hunter, C. P., Tsung, B. T., Tucker-Kellogg, G. & Brown, E. L. (2000) Science 290 , 809–812. pmid:11052945 LaunchUrlAbstract/FREE Full Text ↵ Page, M. F., Carr, B., Anders, K. R., Grimson, A. & Anderson, P. (1999) Mol. Cell. Biol. 19 , 5943–5951. pmid:10454541 LaunchUrlAbstract/FREE Full Text ↵ Ben Yehuda, S., Dix, I., Russell, C. S., Levy, S., Beggs, J. D. & Kupiec, M. (1998) RNA 4 , 1304–1312. pmid:9769104 LaunchUrlAbstract ↵ Stevens, S. W., Ryan, D. E., Ge, H. Y., Moore, R. E., Young, M. K., Lee, T. D. & Abelson, J. (2002) Mol. Cell 9 , 31–44. pmid:11804584 LaunchUrlCrossRefPubMed ↵ van Hoof, A., Lennertz, P. & Parker, R. (2000) Mol. Cell. Biol. 20 , 441–452. pmid:10611222 LaunchUrlAbstract/FREE Full Text ↵ Allmang, C., Petfalski, E., Podtelejnikov, A., Mann, M., Tollervey, D. & Mitchell, P. (1999) Genes Dev. 13 , 2148–2158. pmid:10465791 LaunchUrlAbstract/FREE Full Text ↵ Ruegsegger, U., Blank, D. & Keller, W. (1998) Mol. Cell 1 , 243–253. pmid:9659921 LaunchUrlCrossRefPubMed ↵ Blumenthal, T., Evans, D., Link, C. D., Guffanti, A., Lawson, D., Thierry-Mieg, J., Thierry-Mieg, D., Chiu, W. L., Duke, K., Kiraly, M., et al. (2002) Nature 417 , 851–854. pmid:12075352 LaunchUrlCrossRefPubMed ↵ Timakov, B., Liu, X., Turgut, I. & Zhang, P. (2002) Genetics 160 , 1011–1022. pmid:11901118 LaunchUrlAbstract/FREE Full Text ↵ Qiu, W. G., Schisler, N. & Stoltzfus, A. (2004) Mol. Biol. Evol. 21 , 1252–1263. pmid:15014153 LaunchUrlAbstract/FREE Full Text ↵ Sverdlov, A. V., Rogozin, I. B., Babenko, V. N. & Koonin, E. V. (2003) Curr. Biol. 13 , 2170–2174. pmid:14680632 LaunchUrlCrossRefPubMed ↵ Long, M., de Souza, S. J., Rosenberg, C. & Gilbert, W. (1998) Proc. Natl. Acad. Sci. USA 95 , 219–223. pmid:9419356 LaunchUrlAbstract/FREE Full Text ↵ Lynch, M. (2002) Proc. Natl. Acad. Sci. USA 99 , 6118–6123. pmid:11983904 LaunchUrlAbstract/FREE Full Text ↵ Brow, D. A. & Guthrie, C. (1989) Nature 337 , 14–15. pmid:2909888 LaunchUrlCrossRefPubMed ↵ Takahashi, Y., Tani, T. & Ohshima, Y. (1996) J. Biochem. (Tokyo) 120 , 677–683. pmid:8902636 LaunchUrlAbstract/FREE Full Text ↵ Lewis, B. P., Green, R. E. & Brenner, S. E. (2003) Proc. Natl. Acad. Sci. USA 100 , 189–192. pmid:12502788 LaunchUrlAbstract/FREE Full Text ↵ Blumenthal, T. & Gleason, K. S. (2003) Nat. Rev. Genet. 4 , 112–120. pmid:12560808 LaunchUrlPubMed
Like (0) or Share (0)