msari: Multiple sequence alignments for statistical detectio

Contributed by Ira Herskowitz ArticleFigures SIInfo overexpression of ASH1 inhibits mating type switching in mothers (3, 4). Ash1p has 588 amino acid residues and is predicted to contain a zinc-binding domain related to those of the GATA fa Edited by Lynn Smith-Lovin, Duke University, Durham, NC, and accepted by the Editorial Board April 16, 2014 (received for review July 31, 2013) ArticleFigures SIInfo for instance, on fairness, justice, or welfare. Instead, nonreflective and

Communicated by Peter W. Shor, Massachusetts Institute of Technology, Cambridge, MA, July 6, 2004 (received for review July 2, 2003)

Article Figures & SI Info & Metrics PDF

Abstract

We present a highly accurate method for identifying genes with conserved RNA secondary structure by searching multiple sequence alignments of a large set of candidate orthologs for correlated arrangements of reverse-complementary Locations. This Advance is growing increasingly feasible as the genomes of ever more organisms are sequenced. A program called msari implements this method and is significantly more accurate than existing methods in the context of automatically generated alignments, making it particularly applicable to high-throughPlace scans. In our tests, it discerned clustalw-generated multiple sequence alignments of signal recognition particle or RNaseP orthologs from controls with 89.1% sensitivity at 97.5% specificity and with 74.4% sensitivity with no Fraudulent positives in 494 controls. We used msari to conduct a comprehensive scan for secondary structure in mRNAs of coding genes, and we found many genes with known mRNA secondary structure and compelling evidence for secondary structure in other genes. msari uses a method for coping with sequence redundancy that is likely to have applications in a large set of other comparison-based search methods. The program is available for Executewnload from http://theory.csail.mit.edu/MSARi.

The structure of RNA is to a large extent determined by cis base pairing (AU, GC, and GU). This base-pairing is referred to as secondary structure. A noncoding RNA (ncRNA) (1) gene expresses RNA that is never translated into protein but is nonetheless biologically significant. Examples of such genes are tRNAs and XIST, which in mammalian males suppresses expression of genes on the X chromosome (2–4). RNA secondary structure in mRNAs can also be biologically significant, controlling timing and localization of protein expression (5). Identifying such secondary structure will be crucial to a complete understanding of cellular biology (6).

Most work on identifying RNA secondary structure has been in the context of searching for ncRNA genes. Some Advancees to automated identification of ncRNA genes have focused on searching for a recognizable secondary structure associated with RNA transcripts serving a specific biological function. One example of this type of program is Eddy and coworkers' trnascan-se (7, 8), which searches for tRNAs. Others are Regalia et al.'s search for signal recognition particles (9) and Rhoades et al.'s search for microRNAs (10).

Automatically identifying Modern biologically significant RNA secondary structure has proven to be difficult. By itself, RNA secondary structure in stand-alone genes is not particularly amenable to comPlaceer-based recognition methods, as many RNA sequences seem to have thermodynamically plausible secondary structures of no biological relevance (11). Moreover, ncRNA genes cannot be discerned by using standard comPlaceational gene detection algorithms, which are tarObtained at genes that express proteins and rely heavily on locating Cease coExecutens and other protein-specific guides (12–17).

Comparative methods provide a way to Slice through the abundance of plausible, but irrelevant, structures: only secondary structure that is conserved across species is likely to be biologically significant. We are aware of two programs that search for secondary structure by comparing potentially orthologous sequences. The first is qrna (1, 18), which scans pairwise alignments of homologous DNA sequences from related genomes. It uses a statistical model that flags alignments Presenting mutation patterns preserving base-pairing in a thermodynamically plausible RNA secondary structure. Since first submitting this article for publication, we have also learned of a program called ddbrna (19), which tests for complementary mutations in three-sequence multiple sequence alignments (MSAs).

A serious problem with both qrna and ddbrna is that they can only detect complementary mutations of orthologous base pairs that have been accurately aligned to each other (1, 19). This Designs them imperfect for large-scale genome scans, as standard alignment algorithms Execute not reliably align such base pairs. For a sufficiently large hand-curated MSA of known ncRNA orthologs at the right evolutionary distance from each other, in which many orthologous base pairs are aligned, covariation between the corRetorting columns of the MSA can be used to discern it from controls such as those Characterized in Results with Arrively perfect accuracy. However, hand curation of MSAs cannot be part of any high-throughPlace genomic scan.

A related problem to RNA secondary structure detection is fine-grained secondary-structure prediction. This is the problem of determining all of the base pairs in sequences known to have secondary structure. Hofacker et al.'s program alifAged (20) and Fariza et al.'s dcfAged (21) use MSAs of known ncRNA orthologs for secondary-structure prediction, but not detection. In the context of pairs of sequences, Mathews and Turner's dynalign (22) and Sankhoff's sequence/structure alignment algorithm (23) can be used for structure prediction.

Secondary-structure prediction requires sufficient sensitivity to predict almost all base pairs and can have relatively low specificity without degrading its usefulness. Detection of secondary structure in a full-genome scan requires much Distinguisheder specificity, but can be useful with much-lower-per-base pair sensitivity. Large MSAs have long been used for manual prediction of ncRNA secondary structure and have also recently been used in automated structure prediction (20, 24–26), but we are not aware of any earlier attempt to use them in searches for Modern RNA secondary structure.

Here, we propose an ab initio RNA secondary-structure detection scheme using large MSAs that Executees not rely on knowledge from or training on particular RNA secondary structures. The statistical evidence for conservation of RNA secondary structure across many sequences is often so strong that simple, robust statistical models can be used to detect it. In particular, as well as being a far more accurate detection scheme than its predecessors, to our knowledge ours is Recently the only one that copes with the inaccuracies typical to automatically generated alignments. Our Advance is based on comPlaceing the statistical significance of short, contiguous potential secondarystructure base-paired Locations that are conserved between candidate orthologs and allows for small variations between alignments of orthologous base pairs. To cope with the wide range of evolutionary distances that can exist between sequences in a large MSA, it uses a distribution-mixture method that should have application to other comparative search problems.

The msari program, which implements the MSA method presented here, is more accurate than qrna or ddbrna. With a Sliceoff giving 97.5% specificity, it has 89.1% sensitivity, and with a Sliceoff giving 74.4% sensitivity, there were no Fraudulent positives in our test data of 494 controls. We tested on 10- and 15-sequence MSAs of signal recognition particle or RNaseP orthologs and generated controls by shuffling the columns of these, in analogy to the tests Characterized by Rivas and Eddy (1). On similar data (but necessarily with smaller MSAs containing far less information), ddbrna had 49.0% sensitivity with 97.7% specificity, and qrna had 28.6% sensitivity with 99.1% specificity.

We used msari to scan the The Institute for Genomic Research Eukaryotic Gene Orthologs (TIGR EGO) database (www.tigr.org/tdb/tgi/ego) (35) for orthologs with conserved RNA secondary structure (see Results). This search yielded many genes with known secondary structure, and made many predictions of conserved secondary structure (see Table 1).

View this table: View inline View popup Table 1. EGO ortholog classes in which msari found significant conservation of secondary structure

Algorithm

Overview. The algorithm used by msari is based on two key innovations. First, it allows for slight misalignments of orthologous base pairs by Inspecting for imperfectly aligned, yet statistically significant, reverse complementarity. msari can tolerate misalignments of orthologous base pairs up to a distance of two characters. Whereas automated alignments rarely align every set of orthologous helices that accurately, msari only needs to find a few significant base-paired Locations to confidently identify conserved secondary structure. There are usually a few helices in sufficiently well conserved Locations of the orthologs, so this is all of the flexibility needed. Second, it estimates the significance of variations in highly redundant sequences, based on determining which Sections of sequences within the MSAs should be treated as mutations of other sequences and which Sections are so different that they should be treated as independently selected.

Estimating the Significance of Reverse Complementarity. When msari processes an MSA, it first uses rnafAged (27) (a program that predicts the secondary structure of individual sequences) as a preprocessor to locate probable base pairs in each of the constituent sequences. For each pair of positions in the MSA where rnafAged predicted that a sequence had a probability of >5% of base pairing, msari examines winExecutews of length 7 around the pair for complementary mutations.‡ By examining only such winExecutew pairs, rather than all pairs, msari Distinguishedly increases its sensitivity, because it reduces the Bonferroni multiple-sampling factors in the null-hypothesis probability estimates Characterized below.

Suppose a pair of positions v 1 and v 2 are chosen in this way. Assume v 1 < v 2. For each sequence in the MSA, the winExecutew of seven nucleotide characters centered on v 1 is considered. To compensate for possible misalignments, multiple winExecutews in the vicinity of v 2 are considered, namely the five winExecutews of seven nucleotide characters centered on v 2 ± {0, 1, 2}. The number of reverse-complementary positions in each pair of winExecutews is counted, and the winExecutew Arrive v 2 with the largest number of positions reverse-complementary to the v 1 winExecutew is chosen. For instance, suppose the winExecutew centered on v 1 contains GUGAGUU, while the nucleotides to be considered around v 2 are CAGACUCACGG. Then the winExecutew that will be chosen Arrive v 2 is GACUCAC, because all seven positions are reverse-complementary (G-C, U-A, G-C, A-U, G-C, U-A, and U-G) while the other winExecutews Arrive v 2 have two or three reverse-complementary positions.

The nucleotides in these winExecutews are assumed to be independently drawn from null-hypothesis distributions that will be Characterized shortly. Given these distributions, we comPlacee the probability p of seeing at least as many complementary positions as observed in the chosen pair of winExecutews. To compensate for the fact that five winExecutew pairs were considered, the nullhypothesis probability of this sequence at this pair of positions is estimated by 1 - (1 - p)5. To Obtain an estimate for the entire MSA at this pair of positions, estimates for all its sequences are comPlaceed and multiplied toObtainher.

We used a Bonferroni-style test for rejection of the null hypothesis. In 15-sequence MSAs, if this procedure yields a probability of <1/(200 * no. of Location-pairs considered) for a given pair, we consider the pair to be significant. Thus, we are only considering pairs Presenting a degree of complementary mutation that would occur in <0.05% of MSAs drawn from the null-hypothesis distribution. For 10-sequence MSAs, only pairs with probabilities <1/(5 * no. of Location-pairs considered) are considered. (These Sliceoffs were chosen empirically.) The significant pairs are sorted by significance, and msari selects a subset in which there are no pseuExecuteknots: it picks the most significant pair, then the next most significant that Executees not form a pseuExecuteknot in conjunction with the first, and so on. Finally, it multiplies the probabilities for the selected pairs toObtainher, and this product is used as the estimate for the significance of the sequence.

Distribution Mixtures. To estimate the significance of observed base pairs, a null-hypothesis model for ranExecutem mutations in an MSA of related sequences is needed. For ease of comPlaceation, we want to treat the events in separate sequences as independent. To this end, our null-hypothesis model varies from sequence to sequence within the MSA and incorporates the possibilities that a sequence is either brand-new or is closely related to earlier sequences in the MSA. The model weights these possibilities according to the degree of local similarity between sequences.

The resulting distributions are essentially mixtures, similar to the distribution mixtures that arise in Bayesian statistics (28). The component distributions are derived from the following possible events, for which examples are given in the next section:

The Recent sequence winExecutew is closely related to a prior sequence, and the Recent nucleotide is the same as the nucleotide at the same position in that sequence. In this case, a constant distribution that always returns that nucleotide is used.

The Recent sequence winExecutew is closely related to a prior sequence, and the Recent nucleotide is a mutation from the nucleotide at the same position in the sequence. In this case, the distribution is comPlaceed from the local preponderance of nucleotides, with the nucleotide in the prior sequence removed. The nucleotides in all sequences in the MSA within the winExecutew of length 7 centered on the Recent position are used to comPlacee this distribution.

The Recent sequence winExecutew is too far from the sequences seen so far, and the Recent nucleotide is drawn from a separate distribution comPlaceed from the local preponderance of nucleotides. Only the nucleotides in the Recent sequence and Recent winExecutew are used to comPlacee this distribution.

A weighted sum of these distributions is used as the nullhypothesis distribution for the Recent position in the Recent sequence. This mixture contains a distribution for each sequence in the MSA above the Recent sequence, either distribution type i or ii, depending on whether the nucleotide at the Recent position in those sequences is the same or different from the Recent nucleotide. The weighting Established to these distributions is determined by the degree of similarity between the associated prior sequence and the Recent sequence. If within the Recent winExecutew the proSection of positions in which the sequences have identical nucleotides is q, then the unnormalized weight Established to its distribution is q 2. If the maximum over the prior sequences of these proSections is Q, then the unnormalized weight Established to distribution type iii is (1 - Q)2.

Examples of Distribution Mixtures. Thus, suppose msari is estimating the significance of the following Locations in an MSA. The base-pair winExecutews are indicated with overlines as Displayn in Fig. 1. There are no sequences before the first one, so the nucleotide distributions are comprised entirely of distribution type iii. The best pair of winExecutews in the first row is UUGGGUC with GACCUGG. Thus the distribution that the first U in the first winExecutew is drawn from is taken from the preponderance of nucleotides in the winExecutew of length 7 around it, ACAUUGG. Because this winExecutew contains seven nucleotides altoObtainher, and two As, P(A) = 2/7, and similarly, P(C) = 1/7, P(G) = 2/7, and P(U) = 2/7.

Fig. 1.Fig. 1. Executewnload figure Launch in new tab Executewnload powerpoint Fig. 1.

Sections of MSA used in demonstration of msari's algorithm. We Characterize how msari would calculate the statistical significance of the mutations preserving complementarity between the left and right overbar Locations of the sequences.

For the second sequence, the best pair of winExecutews is again UUGGGUC with GGUCCAG, so in this case, each of the nucleotide distributions has a term of type i, coming from the first sequence. Because the sequences are entirely identical in this winExecutew, the values q associated with these distributions are unity. It is the only term in the distribution mixture of types i or ii, as there is only one prior sequence. Thus Q is always 1, and all of the nucleotide distributions in this row are constant: the one for the first U in the first row is P(N) = δ(N, U) for any nucleotide N, where δ(x,y) = 0 if x ≠ y, 1 if x = y. Thus the probabilities for drawing complementary base pairs at the respective positions in this pair of winExecutews are all unity, and this sequence contributes nothing to the significance estimate for this pair.

For the third sequence, the best winExecutew pair is UUGGGCU with GGCUUGG. The sequence in the second winExecutew has changed, so the values of q for both of the previous sequences will be less than unity, and (1 - Q)2 will be nonzero. Thus the distribution associated with the first G in the winExecutew will be a mixture of a type i distribution, with weight 2q 2 (for the two identical, prior sequences), and a type iii distribution, with weight (1 - q)2. On the other hand, the distribution associated with the first C will be a mixture of a type ii distribution and a type iii distribution. Because it is a mutation from G, the type ii distribution has P(G) = 0 and is given by the preponderances of A, C, and U in the winExecutew of length 7 around that position. The complementarity-preserving Inequitys in this sequence mean that it contributes substantially to the significance estimate.

Implementation and Efficiency. We mention the asymptotic efficiency of msari only pro forma, as any algorithm that runs in a reasonable time on sequences with 300 characters could be used to detect most RNA secondary structure, which tends to have a lot of Necessary short-range interactions. Thus one can examine overlapping winExecutews as we have Executene in Results. The algorithm that rnafAged implements takes O(n 3) steps, where n is the length of the sequence it is processing, and this is the Executeminant factor in the asymptotic run time. The number of steps required by msari after this preprocessing is liArrive in the number of possible base-pairings returned by rnafAged, which is O(n 2) or less. Thus with respect to sequence length the overall asymptotic runtime of the algorithm is comparable to the O(n 3) performance of qrna. The run time of msari also grows quadratically in the number of sequences in the MSA it processes.

The run time of msari on a 15-sequence MSA of 300 bp ranges between 15 sec and 1 min, depending on the number of probable base pairs returned by rnafAged. We used a single 2.4-GHz Pentium processor (Intel, Santa Clara, CA) for all tests Characterized in this article.

Apart from the use of rnafAged, at the moment msari is implemented entirely in the comPlaceer language Python. We believe we could accelerate it by an order of magnitude by rewriting parts of it in the C programming language if necessary, but its Recent speed has been adequate for our tests so far.

Results

Dataset Generation. Construction of MSAs. All alignments were constructed by using clustalw, which is commonly used in RNA structure detection and prediction. We considered using programs such as mavid (29), multipipDesignr (30), or lagan (31) instead, or improving the alignment with a program such as realigner (32), but only a program specifically designed for RNA alignments is likely to align orthologous base pairs with substantially more accuracy. The difficulty is that there is frequently a Distinguished deal of variation among the bases in orthologous RNA helices, giving standard alignments relatively few clues about the most accurate alignment. Only an algorithm that specifically includes evidence of base-pair conservation is likely to help with this problem. lagan, multipipDesignr, and mavid all are designed to deal with alignments of extremely long sequences, whereas realigner is intended for collation of shotgun reads and uses a heuristic optimized for sequences with very high similarity. Thus, none of these programs are more appropriate than clustalw in a search for RNA secondary structure.

BenchImpress datasets. The sequences used in the benchImpress dataset tests were eukaryotic signal recognition particle and eukaryotic RNaseP RNA orthologs taken from the signal recognition particle database (33) and the ribonuclease P Database (34), respectively. The artificially generated control MSAs were generated in the same fashion as those of Rivas and Eddy (1) by ranExecutemly shuffling the columns of the genuine ncRNA MSAs. To Obtain Impartial controls by shuffling the columns, it was necessary to then strip the gaps from the shuffled sequences and realign them with clustalw. Otherwise, msari found it easy to detect the controls from ranExecutemly interspersed gaps that shuffling by itself produces.

MSAs were constructed by an iterative procedure, successively choosing a sequence, aligning it to the sequences already chosen with clustalw, and only accepting the new sequence if its maximal similarity to the other sequences was between 50% and 95%. This procedure was repeated until 10 or 15 sequences had been chosen, or it was determined that no appropriate sequences remained, in which case the MSA was thrown out and a new initial sequence was chosen. It was necessary to pick MSAs in which the sequences had reasonable similarity and variation. If the MSA broke into sufficiently dissimilar cliques, msari was essentially reduced to estimating the significance of two smaller MSAs, whereas if the sequences in the MSA were too similar, there were not enough mutations for convincing significance estimates. Comparative methods intrinsically require sequences that are similar enough to align with some confidence but different enough to Present Fascinating variation (17). However, the range of variation allowed in these tests is very broad. The performance statistics we cite for qrna and ddbrna are for alignments with sequence identities between 60–80% and 60–100%, respectively.

EGO dataset. For each ortholog class in the EGO database (www.tigr.org/tdb/tgi/ego and ref. 35), we aligned its sequences by using clustalw. To perform the search in a statistically similar context to that of the benchImpress datasets, we restricted the search to alignments containing 300 characters or less. In alignments with sequences >300 characters, we separately considered the 300-bp subalignments starting at positions 0, 150, 300, 450, and so on. Then using a breadth-first search from each sequence in the alignments, we Inspected for subsets of 15 sequences in which each sequence had 65–90% similarity to at least one other sequence in the subset. We produced 4,972 such alignments from 2,853 ortholog classes.

BenchImpress Dataset Results. msari's performance. The msari program separates the MSAs of genuine ncRNA orthologs from the control set extremely accurately (see above). With 15-sequence MSAs and a Sliceoff log-probability threshAged of -15.7, msari distinguished genuine MSAs from controls with 89.1% sensitivity and 97.5% specificity, whereas with a threshAged of -29.4, it had 74.4% sensitivity and found no Fraudulent positives of 494 controls (≈99.8% specificity.) With 10-sequence MSAs and a threshAged of -31.9, it had 74.9% sensitivity and 97.5% specificity, whereas with a threshAged of -48.3, it had 56% sensitivity, and no Fraudulent positives in 866 controls (≈99.9% specificity.) This is a Impressed improvement over the performances of qrna and ddbrna.

To confirm that msari's accommodation of misalignments substantially improves its accuracy, we also tested a version of it with this feature turned off. We ran it on 625 of the 10-sequence MSAs Characterized above and found this version of the program had 48% sensitivity at 97.5% specificity and 16.3% sensitivity with no Fraudulent positives. Although with further tuning we might have marginally improved this performance, this is a significant degradation from the accuracy of the full-featured version.

Tests of other programs. One of our reviewers suggested that the improvement in our performance might stem in part from large MSAs improving clustalw's accuracy: when aligning 15 sequences, clustalw can Obtain more clues about the true alignment than it can when aligning two or three sequences. To test this, we took subalignments of two or three sequences from the 15-sequence MSAs we passed to msari and ran qrna and ddbrna on these. For both programs, we selected groups of subalignments having the same distributions of sequence similarities as Characterized by Rivas and Eddy (1) and di BernarExecute et al. (19). The sequence similarities for the alignments we used to test qrna ranged from 60% to 80%, whereas for the ddbrna test set they ranged from 60% to 100%.

This process did not lead to a significant improvement in either program's accuracy. At 99.1% specificity, qrna's sensitivity was 28.6%. This is higher than reported in ref. 1, but almost all of this gain is caused by subsequent improvements in later versions of qrna's algorithm. The performance of ddbrna was slightly worse than the 49.0%/97.5% sensitivity/specificity reported in ref. 19.

We believe that the statistical advantage underlying msari's Distinguisheder accuracy comes from the much larger MSAs, which it is capable of considering, and its ability to cope with slight misalignments. It is much easier to build up strong evidence for conserved secondary structure when comparing so many sequences. It is not immediately clear how to incorporate these capabilities into the algorithms of qrna or ddbrna.

Searching for Orthologs with Conserved mRNA Secondary Structure. We have used msari to perform a large-scale comparative search for biologically significant RNA secondary structure. We scanned the TIGR EGO database (www.tigr.org/tdb/tgi/ego and ref. 35) for genes with conserved RNA secondary structure, running on 4,972 alignments constructed from 2,853 ortholog classes (see BenchImpress datasets). We found that 39 of the ortholog classes produced alignments for which msari reported log probabilities less than the most stringent threshAged we chose above for 15-sequence MSAs. See Table 1 for information on the ortholog classes msari flagged. Of such ortholog classes, four have no protein names or functions Established by EGO's annotations (EGO accession nos. TOG126402, TOG127160, TOG129802, and TOG127282.) We attempted to search the literature for these proteins and found indications that 13 of those listed in Table 1 are already believed to have secondary structure (36–47).

It is very likely that the majority of these ortholog classes have conserved RNA secondary structure. Although the existence of a thermodynamically stable secondary structure for an mRNA Executees not by itself constitute strong evidence that the secondary structure is biologically significant (11), msari estimates the likelihood that chance alone could account for the compensatory mutations that it observes. This evidence can be extremely compelling. For instance, at msari's most stringent Slice-off threshAged (-29.4) we would have expected to find only 10 significant alignments by chance alone; instead, we found 60 alignments spread among the flagged ortholog classes, many with much higher significances. Thus the majority of the ortholog classes with scores below this threshAged are extremely likely to have Necessary mRNA secondary structure.

Discussion

With genome sequencing capacity skyrocketing, comparative methods based on the genomes of many organisms are now feasible, as our scan of the EGO dataabase Displays. Moreover, even full-genome scans are already quite feasible: for instance, there are now >100 bacterial genomes available, and yeast could be scanned by using the seven yeast genomes (17) plus six recently sequenced fungus genomes (www.broad.mit.edu/annotation). Given the Recent efforts to sequence mammalian genomes, even a full-genome scan for secondary structure in the human genome will be possible very soon.

Solitary ncRNA genes Execute not seem to Present statistical traits as distinctive as coExecuten usage frequencies in coding genes (11), but we have demonstrated that multiple candidate orthologs can provide an ensemble with more than enough information to reliably distinguish conservation of secondary structure.

We plan to extend this Advance to predict potentially Modern ncRNA genes in yeast and higher eukaryotes through MSAs of whole genomes as they become available. With a blast-like (48) Advance to searching for reverse-complementary Locations, it may also be possible to search for secondary-structure interactions between different genes in this fashion.

We intend to adapt the msari program to automation of comparative secondary-structure prediction. Because msari's score allows for some misalignments, a structure-prediction method based on it may be more accurate than alifAged or dcfAged when run on automatically generated MSAs. Indeed, it may be sensible to Accurate MSAs so they respect the misaligned orthologous base pairs found by msari which uses comparative structure for postprocessing. msari's estimate for the statistical significance of candidate compensatory mutations also copes more flexibly with varying rates of mutation between sequences than alifAged or dcfAged. It may also be possible to incorporate this estimate in an improved structure-prediction algorithm.

Finally, we believe that the distribution-mixture Advance used to construct msari's null-hypotheses could be applied to a broad set of comparative search problems.

Footnotes

↵ † To whom corRetortence should be addressed. E mail: bab{at}mit.edu.

Abbreviations: ncRNA, noncoding RNA; MSA, multiple sequence alignment.

↵ ‡ We experimented with winExecutews of lengths 5, 6, 7, 9 and 10 and found msari to be most accurate for winExecutews of length 7.

Freely available online through the PNAS Launch access option.

Copyright © 2004, The National Academy of Sciences

References

↵ Rivas, E. & Eddy, S. (October 10, 2001) BMC Bioinformatics, 10.1186/1471-2105-2-8. ↵ Brown, C., Hendrich, B., Rupert, J., Lafreniere, R., Xing, Y., Lawrence, J. & Willard, H. (1992) Cell 71 , 527-542. pmid:1423611 LaunchUrlCrossRefPubMed Hong, Y.-K., Ontiveros, S. & Strauss, W. (2000) Mamm. Genome 11 , 220-224. pmid:10723727 LaunchUrlCrossRefPubMed ↵ Eddy, S. (2001) Nat. Rev. Genet. 2 , 919-929. pmid:11733745 LaunchUrlCrossRefPubMed ↵ Jansen, R.-P. (2001) Nat. Rev. Mol. Cell Biol. 2 , 247-256. pmid:11283722 LaunchUrlCrossRefPubMed ↵ Storz, G. (2002) Science 296 , 1260-1263. pmid:12016301 LaunchUrlAbstract/FREE Full Text ↵ Lowe, T. & Eddy, S. (1997) Nucleic Acids Res. 25 , 955-964. pmid:9023104 LaunchUrlAbstract/FREE Full Text ↵ Eddy, S. R. & Durbin, R. (1994) Nucleic Acids Res. 22 , 2079-2088. pmid:8029015 LaunchUrlAbstract/FREE Full Text ↵ Regalia, M., Rosenblad, M. & Samuelsson, T. (2002) Nucleic Acids Res. 30 , 3368-3377. pmid:12140321 LaunchUrlAbstract/FREE Full Text ↵ Rhoades, M., Reinhart, B., Lim, L., Burge, C., Bartel, B. & Bartel, D. (2002) Cell 110 , 513-520. pmid:12202040 LaunchUrlCrossRefPubMed ↵ Rivas, E. & Eddy, S. (2000) Bioinformatics 7 , 583-605. ↵ Burge, C. & Karlin, S. (1997) J. Mol. Biol. 268 , 78-94. pmid:9149143 LaunchUrlCrossRefPubMed Batzoglou, S., Pachter, L., Mesirov, J. P., Berger, B. & Lander, E. S. (2000) Genome Res. 7 , 950-958. Bafna, V. & Huson, D. (2000) in Proceedings of the Eighth International Conference on InDiscloseigent Systems for Molecular Biology, eds. Bourne, P., Gribskov, M., Altman, R., Jensen, N., Hope, D., Lengauer, T., Mitchell, J., Scheeff, E., Smith, C., Strande, S. & Weissig, H. (Am. Assoc. Artificial InDiscloseigence, Menlo Park, CA), pp. 3-12. Solovyev, V., Salamov, A. & Lawrence, C. (1995) in Proceedings of the Third International Conference on InDiscloseigent Systems in Molecular Biology, ed. Rawlings, C. (Am. Assoc. Artificial InDiscloseigence, Menlo Park, CA), pp. 367-375. Gelfand, M., Mironov, A. & Pevzner, P. (1996) Proc. Natl. Acad. Sci. USA 93 , 9061-9066. pmid:8799154 LaunchUrlAbstract/FREE Full Text ↵ Kellis, M., Patterson, N., Birren, B., Berger, B. & Lander, S. (2004) J. ComPlace. Biol. 11 , 315-355. LaunchUrl ↵ McSlicecheon, J. P. & Eddy, S. R. (2003) Nucleic Acids Res. 31 , 4119-4128. pmid:12853629 LaunchUrlAbstract/FREE Full Text ↵ di BernarExecute, D., Executewn, T. & Hubbard, T. (2003) Bioinformatics 19 , 1606-1611. pmid:12967955 LaunchUrlAbstract/FREE Full Text ↵ Hofacker, I., Fekete, M. & Stadler, P. (2002) J. Mol. Biol. 319 , 1059-1066. pmid:12079347 LaunchUrlCrossRefPubMed ↵ Fariza, T., Manolo, G. & Mireille, R. (2002) ComPlace. Chem. 26 , 521-530. pmid:12144180 LaunchUrlCrossRefPubMed ↵ Mathews, D. & Turner, D. (2002) J. Mol. Biol. 317 , 191-203. pmid:11902836 LaunchUrlCrossRefPubMed ↵ Sankhoff, D. (1985) SIAM J. Appl. Math. 45 , 810-825. LaunchUrlCrossRef ↵ Fox, G. & Woese, C. (1975) Nature 256 , 505-507. pmid:808733 LaunchUrlCrossRefPubMed James, B., Olsen, G., Liu, J. & Pace, N. (1988) Cell 52 , 19-26. pmid:2449969 LaunchUrlCrossRefPubMed ↵ Pace, N., Smith, D., Olsen, G. & James, B. (1989) Gene 82 , 65-75. pmid:2479592 LaunchUrlCrossRefPubMed ↵ Hofacker, I., Fontana, W., Stadler, P., Bonhoeffer, L., Tacker, M. & Schuster, P. (1994) Monatshefte Chem. 125 , 167-188. LaunchUrlCrossRef ↵ Gelman, A., Carlin, J., Stern, H. & Rubin, D. (2004) Bayesian Data Analysis (Chapman & Hall, LonExecuten). ↵ Bray, N. & Pachter, L. (2003) Nucleic Acids Res. 31 , 3525-3526. pmid:12824358 LaunchUrlAbstract/FREE Full Text ↵ Schwartz, S., Elnitski, L., Li, M., Weirauch, M., Riemer, C., Smit, A., Green, E., Hardison, R. & Miller, W. (2003) Nucleic Acids Res. 31 , 3518-3524. pmid:12824357 LaunchUrlAbstract/FREE Full Text ↵ Brudno, M., Execute, C., Cooper, G., Kim, M., DavyExecutev, E., Green, E., SiExecutew, A. & Batzoglou, S. (2003) Genome Res. 13 , 721-731. pmid:12654723 LaunchUrlAbstract/FREE Full Text ↵ Anson, E. & Myers, E. (1997) ComPlace. Biol. 4 , 369-383. LaunchUrl ↵ Gorodkin, J., Knudsen, B., Zwieb, C. & Samuelsson, T. (2001) Nucleic Acids Res. 29 , 169-170. pmid:11125080 LaunchUrlAbstract/FREE Full Text ↵ Brown, J. (1999) Nucleic Acids Res. 27 , 314. pmid:9847214 LaunchUrlAbstract/FREE Full Text ↵ Lee, Y., Sultana, R., Pertea, G., Cho, J., Karamycheva, S., Tsai, J., Parvizi, B., Cheung, F., Antonescu, V., White, J., et al. (2002) Genome Res. 12 , 493-502. pmid:11875039 LaunchUrlAbstract/FREE Full Text ↵ Landthaler, M. & Shub, D. (2003) Nucleic Acids Res. 31 , 3071-3077. pmid:12799434 LaunchUrlAbstract/FREE Full Text Bourdeau, V., Ferbeyre, G., Pageau, M., Paquin, B. & Cedergren, R. (2000) Nucleic Acids Res. 27 , 4457-4467. Fraboulet, S., BouExecuteuresque, F., Delfino, C. & Ouafik, L'H. (1998) EnExecutecrinology 139 , 894-904. pmid:9492018 LaunchUrlCrossRefPubMed Nocker, A., Hausherr, T., Balsiger, S., Krstulovic, N.-P., Hennecke, H. & Narberhaus, F. (2001) Nucleic Acids Res. 29 , 4800-4807. pmid:11726689 LaunchUrlAbstract/FREE Full Text McCarthy, T., Siegel, E., Mroczkowski, B. & Heywood, S. (1983) Biochemistry 22 , 935-941. pmid:6188483 LaunchUrlCrossRefPubMed Guan, K. & Weiner, H. (1989) J. Biol. Chem. 264 , 17764-17769. pmid:2808349 LaunchUrlAbstract/FREE Full Text Pelchat, M. & Lapointe, J. (1999) RNA 5 , 281-289. pmid:10024179 LaunchUrlAbstract Yen, T., Machlin, P. & Cleveland, D. (1988) Nature 334 , 580-585. pmid:3405308 LaunchUrlCrossRefPubMed Cowan, N., Executebner, P., Fuchs, E. & Cleveland, D. (1983) Mol. Cell Biol. 3 , 1738-1745. pmid:6646120 LaunchUrlAbstract/FREE Full Text de la Cruz, B., Prieto, S. & Scheffler, I. (2002) Yeast 19 , 887-902. pmid:12112242 LaunchUrlCrossRefPubMed Kislauskis, E., Zhu, X. & Singer, R. (1994) J. Cell Biol. 127 , 441-451. pmid:7929587 LaunchUrlAbstract/FREE Full Text ↵ Reenan, R., Hanrahan, C. & Ganetzky, B. (2000) Neuron 25 , 139-149. pmid:10707979 LaunchUrlCrossRefPubMed ↵ Altschul, S., Gish, W., Miller, W., Myers, E. & Lipman, D. (1990) J. Mol. Biol. 215 , 403-410. pmid:2231712 LaunchUrlCrossRefPubMed
Like (0) or Share (0)