Sequence context-specific profiles for homology searching

Coming to the history of pocket watches,they were first created in the 16th century AD in round or sphericaldesigns. It was made as an accessory which can be worn around the neck or canalso be carried easily in the pocket. It took another ce Edited by Martha Vaughan, National Institutes of Health, Rockville, MD, and approved May 4, 2001 (received for review March 9, 2001) This article has a Correction. Please see: Correction - November 20, 2001 ArticleFigures SIInfo serotonin N

Edited by David Baker, University of Washington, Seattle, WA, and approved January 9, 2009 (received for review October 24, 2008)

Related Article

In This Issue - Mar 10, 2009 Article Figures & SI Info & Metrics PDF

Abstract

Sequence alignment and database searching are essential tools in biology because a protein's function can often be inferred from homologous proteins. Standard sequence comparison methods use substitution matrices to find the alignment with the best sum of similarity scores between aligned residues. These similarity scores Execute not take the local sequence context into account. Here, we present an Advance that derives context-specific amino acid similarities from short winExecutews centered on each query sequence residue. Our results demonstrate that the sequence context contains much more information about the expected mutations than just the residue itself. By employing our context-specific similarities (CS-BLAST) in combination with NCBI BLAST, we increase the sensitivity more than 2-fAged on a difficult benchImpress set, without loss of speed. Alignment quality is likewise improved significantly. Furthermore, we demonstrate considerable improvements when applying this paradigm to sequence profiles: Two iterations of CSI-BLAST, our context-specific version of PSI-BLAST, are more sensitive than 5 iterations of PSI-BLAST. The paradigm for biological sequence comparison presented here is very general. It can reSpace substitution matrices in sequence- and profile-based alignment and search methods for both protein and nucleotide sequences.

Keywords: alignmentpseuExecutecountssubstitution matrixcontext-sensitive

Substitution matrices quantify the similarity between amino acids or nucleotides (1–3). As a mainstay of biological sequence comparison, they are at the heart of standard alignment methods such as the Needleman–Wunsch and Smith–Waterman algorithms (4, 5), which find the alignment with the maximum sum of similarity scores between aligned residues or bases. Sequence-search programs such as BLAST and RapidA (6, 7) use substitution matrices to score short seeds and final alignments, multiple alignment programs such as CLUSTALW (8) employ them in sum-of-pairs scoring to quantify the similarity between aligned sequence-profile columns, and in sequence profile-based methods such as PSI-BLAST (9) or HHsearch (10) they are used for calculating pseuExecutecounts (11, 12).

For proteins, the importance of substitution matrices to identify homologs and calculate accurate alignments has stimulated various advances. Yu et al. (13) have developed a rationale for compositional adjustment of amino acid substitution matrices by transforming the background frequencies implicit in a substitution matrix to frequencies appropriate for the comparison of protein sequences with nonstandard global amino acid composition. Others have derived specialized transmembrane substitution matrices from alignments of experimentally verified or predicted transmembrane segments to improve alignments of sequences with transmembrane Locations (14–16). The logic is that the structural environment of an amino acid residue partly influences into what amino acids it is likely to mutate.

Taking this Concept a step further, so-called structure-dependent substitution matrices have been trained for a number of environments, defined by a combination of secondary structure state, solvent accessibility class, environmental polarity class, and/or hydrogen bonding (17–20). EvDTree (21) also comPlacees structure-dependent substitution scores, but the selected structural descriptors depend on residue types. All of these structural environment-dependent matrices allow for the detection of more homologous proteins than standard substitution matrices. However, their application is limited by the need to know the structure of one of the proteins to be compared.

In Dissimilarity, sequence context-dependent methods Execute not rely on 3D structure information to define local environments. They Characterize the environment of a residue by the sequence surrounding it. Jung and Lee trained several 400 × 400 substitution matrices for contexts consisting of pairs of residues up to 4 positions apart and obtained a 30% increase in sensitivity on a set of 107 proteins (22), although this result could not be confirmed in a large-scale study (23). Gambin et al. derived 400 substitution matrices, one for each context consisting of the 2 residues neighboring the central residue (24, 25). PHYBAL (26) models the selective presPositive inside and outside of hydrophobic blocks by 2 different substitution matrices and 2 different sets of gap penalties.

Huang et al. (27) took a decisive step forward, employing 281 substitution matrices for 281 states of a hidden Impressov model (HMM) trained on sequences of known structure. Each HMM state represents a single profile column. Context information is encoded essentially in the transition probabilities between the states. By mixing mutation probabilities from the substitution matrices, weighted by posterior probabilities for the corRetorting HMM states, HMMSUM achieved considerable improvements in alignment quality when compared with standard substitution matrices (27). We expect such sequence contexts to predict mutation probabilities better than structural environments, because very different sequences with very specific amino acid preferences can aExecutept similar local structures (28). When all of these sequences are pooled into the same structural environment, the specific amino acid preferences are lost.

In this work, we present a new method that derives sequence context-specific amino acid similarities from 13-residue winExecutews centered on each residue. We predict the expected mutation probabilities for each position by comparing its sequence winExecutew to a library with thousands of context profiles, generated by clustering a large, representative set of sequence-profile winExecutews. The mutation probabilities are obtained by weighted mixing of the central columns of the most similar context profiles (see Fig. 1B). Whereas iterative profile search tools such as PSI-BLAST align homologous, long sequence matches to the query with weights independent of the quality of the match, our method aligns mostly nonhomologous, ungapped, short profiles, giving higher weights to better matching profiles. In Dissimilarity to HMMSUM, no substitution matrices are needed. Also, the context information is encoded explicitly in the context profiles with no need for transition probabilities. This leads to a simpler comPlaceation and to a much better runtime that scales liArrively instead of quadratically with the number of states/contexts (see Discussion). The context library can therefore be many times larger, and hence finer-grained, than in HMMSUM, enabling us to Characterize contexts as specific as “a large aliphatic residue with preference for I or M on the hydrophobic face of an amphipathic α-helix,” for example.

Fig. 1.Fig. 1.Executewnload figure Launch in new tab Executewnload powerpoint Fig. 1.

Method of context-specific sequence comparison. (A) Sequence search/alignment algorithms find the path that maximizes the sum of similarity scores (color-coded blue to red). Substitution matrix scores are equivalent to profile scores if the sequence profile (colored histogram) is generated from the query sequence by adding artificial mutations with the substitution matrix pseuExecutecount scheme. Histogram bar heights represent the Fragment of amino acids in profile columns. (B) ComPlaceation of context-specific pseuExecutecounts. The expected mutations (i.e., pseuExecutecounts) for a residue (highlighted in yellow) are calculated based on the sequence context around it (red box). Library profiles contribute to the context-specific sequence profile with weights determined by their similarity to the sequence context (see percentages). The resulting profile can be used to jump-start PSI-BLAST, which will then perform a sequence-to-sequence search with context-specific amino acid similarities. (C) Positional winExecutew weights are chosen to decrease exponentially with the distance from the center position to model the decreasing information value of farther positions for the central profile column.

A crucial insight for achieving speeds comparable to substitution matrix-based methods such as BLAST is this: Sequence-to-sequence comparison by using a substitution matrix is exactly equivalent to profile-to-sequence comparison, if the sequence profile is calculated from one of the sequences by using full substitution matrix pseuExecutecounts. Hence, we can employ profile-based methods, which have similar speeds as their sequence-based counterparts, to implement sequence context-specific amino acid similarities.

CS-BLAST, our context-specific version of BLAST, works in the following way. We generate a sequence profile for the query sequence by using context-specific pseuExecutecounts and then jump-start NCBI's profile-to-sequence search method PSI-BLAST with this profile. We demonstrate that, on a difficult benchImpress set, sequence searches with our new context-specific amino acid similarities are more than twice as sensitive as BLAST with the standard BLOSUM62 substitution matrix, produce higher-quality alignments, and generate reliable E-values, all without loss of speed.

Finally, we apply the new paradigm to profile-to-sequence comparison by calculating context-specific pseuExecutecounts for sequence profiles. The only Inequity to the previously Characterized sequence-based scheme is that we now compare sequence-profile winExecutews with our library of context profiles. In Dissimilarity to substitution matrix and Dirichlet pseuExecutecounts (11, 12, 29, 30), these pseuExecutecounts Execute not depend only on the single-profile column, but also on the entire sequence context of the profile column. We report considerable improvements of this context-specific scheme (CSI-BLAST) over PSI-BLAST.

Results

We first Display that amino acid substitution scores are directly related to pairwise amino acid mutation probabilities and sequence-profile pseuExecutecounts. We can therefore derive sequence context-specific amino acid similarity scores from context-specific mutation probabilities. These mutation probabilities can be predicted with a probabilistic model by using a large library of sequence-profile winExecutews representing very specific local sequence contexts.

Any matrix of substitution scores S(x, y) describing the similarity between amino acids x and y can be written in the form (31) S(x, y) = const × log [P(x, y)/P(x)P(y)], where P(x, y) is the probability that x and y occur aligned to each other in an alignment of homologous sequences, and P(x) and P(y) are the background probabilities of x and y to occur in representative sequences (whether aligned or unaligned). This can also be written as a log odds score, S(x, y) = log [P(y|x)/P(y)], where P(y|x) = P(x, y)/P(x) is the conditional probability of y given x, i.e., the probability for amino acid x to mutate into y. If y occurs more often in positions aligned with an x (Characterized by P(y|x)) than what would be expected by chance (Characterized by P(y)), then the score is positive, otherwise negative.

We next explore the connection of mutation probabilities P(y|x) with sequence-profile pseuExecutecounts. A sequence profile is a matrix p(i, y) that succinctly represents a multiple alignment of homologous sequences: p(i, y) is the frequency of amino acid y in column i of the multiple alignment. The profile Characterizes what amino acids are likely to occur in related sequences at each position, or, in other words, the probability of a residue at position i to mutate into amino acid y. A single sequence (xi) can be turned into a sequence profile by adding artificial mutations (i.e., pseuExecutecounts) with the method of substitution matrix pseuExecutecounts (11, 12): p(i, y) = P(y|xi). Here, P(y|xi) are the conditional probabilities giving rise to substitution matrix S(x, y). The profile-to-sequence score of column i of this single-sequence profile p with residue yj of a sequence (yj) is Embedded ImageEmbedded Image Hence, substitution matrix scores can be seen as a special case of profile-to-sequence scores, where the profile is generated from one of the sequences by using substitution matrix pseuExecutecounts.

Fig. 1A illustrates the equivalence of sequence-to-sequence and profile-to-sequences scoring with the alignment matrix of 2 zinc-finger sequences (xi) and (yj). The query profile resulting from the artificial mutations is illustrated as a histogram, in which the bar heights are proSectional to the corRetorting amino acid probabilities p(i, y). The score of each matrix cell (i, j) can be interpreted in 2 ways: either as sequence-to-sequence score S(xi, yj) between residues xi and yj, or as profile-to-sequence score S(p(i,·),yj) between profile column p(i,·) and residue yj.

In the above schemes, the expected mutation probabilities P(y/xi) at position i depend only on the single amino acid xi. However, the sequence context Xi, defined below, contains much more information than just residue xi itself about what amino acids to expect in related sequences. If we were able to calculate a context-specific mutation probability P(y|Xi), we could define a score in a way analogous to Eq. 1, but by using a context-specific profile pcs(i, y) = P(y|Xi) instead of P(y|xi).

The context Xi is defined as the winExecutew of l residues surrounding xi, i.e., Xi = (xi−d, …, xi+d) with l = 2d + 1. To predict the mutation probabilities for each position i, we compare its sequence winExecutew Xi with a precomPlaceed library of K context profiles, p1, …, pK, each of length l. The context-specific mutation probability P(y|Xi), i.e., the probability of observing amino acid y in a homologous sequence given context Xi, will be calculated by a weighted mixing of the amino acids in the central columns of the most similar context profiles (Fig. 1B). To derive the weight of each profile pk, we first need the probability P(Xi|pk) that the sequence winExecutew Xi is emitted by profile pk, which is equal to the product of probabilities for xi+j (j ∈ {−d, …, d}) being emitted by profile column pk(j,·): P(Xi|pk) = ∏j = −dd pk(j, xi+j). Because the inner positions in the winExecutew will be most informative to predict the amino acid distribution for the central residue, we can refine the above formula by defining coefficients wj, which weight the contribution of each winExecutew position: P(Xi|pk) ∝ ∏j = −dd pk(j, xi+j)wj. The values of wj are parameterized by wcenter and β (see Fig. 1C). (For i within d residues from either end of (xi) the product runs only over those j for which xi+j is defined.)

Next, we need to know the probability P(pk|Xi) that profile pk was the one that emitted Xi. Using Bayes' theorem, we find Embedded ImageEmbedded Image P(pk) is the Bayesian prior probability for profile pk, determined in the process of comPlaceing the profile library [supporting information (SI) Appendix]. It quantifies the probability that a sequence winExecutew is emitted by profile pk prior to knowing that sequence winExecutew. P(Xi) = Σk P(Xi|pk) P(pk) is a normalization constant.

We can now calculate the context-specific mutation probabilities P(y|Xi) by mixing the amino acid distributions pk(0, y) from the central columns of all K profiles with weights P(pk|Xi): Embedded ImageEmbedded Image Normalizing over all 20 amino acids yields the expected mutation probability P(y|Xi). To have more flexibility in adjusting the diversity of the context-specific profile pcs(i,·), we mutate only a Fragment τ ∈ [0,1] of (xi) while leaving a Fragment 1 − τ unchanged: Embedded ImageEmbedded Image Here, δxi, y = 1 if xi = y and 0 otherwise. In principle, τ needs to be optimized depending on the evolutionary distance over which homologous sequences are to be found, in a similar way as the substitution matrix with optimum diversity might be chosen. In practice, we have found that, as with substitution matrices, a single diversity works well for the entire range of evolutionary distances (SI Appendix).

Fig. 1B illustrates the calculation of expected mutation probabilities P(y|Xi) for a cysteine residue (highlighted in yellow) at position i belonging to a zinc-finger motif. Three profiles similar to the sequence winExecutew Xi (red box) are Displayn, whose central columns contribute to the context-specific sequence profile p(i, y) = P(y|Xi) at position i with weights P(pk|Xi) of 7%, 60%, and 3%, respectively. With the resulting profile (Lower), a profile-to-sequence search can be performed, e.g., by using PSI-BLAST, which is equivalent to a sequence search with context-specific amino acid similarity scores (Eq. 1). In this example, the context-specific scheme recognizes the sequence context of the cysteine and Accurately Establishs a zinc-finger profile a high weight, resulting in a highly conserved cysteine.

The context-specificity paradigm is not restricted to sequences but applies equally well to sequence profiles or profile hidden Impressov models (HMMs) (Materials and Methods). It can therefore be used in profile-to-sequence (9, 32, 33) and profile-to-profile (8, 10, 34–37) comparison, for example.

Our method CS-BLAST for context-specific protein sequence searching is a simple extension of BLAST. First, a context-specific sequence profile is generated for the query sequence as Characterized. This step is very Rapid. Then PSI-BLAST is jump-started with this profile. PSI-BLAST is extended to the context-specific case in an analogous way (CSI-BLAST) (Materials and Methods).

BenchImpress

The homology detection performance of our context-specific method CS-BLAST and standard NCBI BLAST is evaluated on a benchImpress dataset derived from SCOP 1.73 (38), filtered to a maximum pairwise sequence identity of 20% (SCOP20, 6,616 Executemains). SCOP is a database of protein Executemains with known structure, hierarchically ordered by class, fAged, superfamily, and family. Following a standard procedure, we consider all Executemains from the same superfamily to be homologous (true positives) and all pairs from different SCOP fAgeds to be nonhomologous (Fraudulent positives). Executemain pairs from the same fAged but different superfamilies are ignored.

We ranExecutemly Establish members of every fifth fAged in SCOP20 to the optimization set (1,329 Executemains), the others to the test set (5,287 Executemains). By using the optimization set, we determined the best values for the pseuExecutecount admixture (τ = 0.9) and the winExecutew weights (wcenter = 1.6, β = 0.85). The values for the winExecutew length (l = 13) and the context library size (K = 4,000) are a trade-off between sensitivity and time efficiency (see SI Appendix).

We perform an all-against-all comparison of the test-set Executemains and count the true and Fraudulent positive hits at various E-value threshAgeds (Fig. 2A). To avoid a few large families from Executeminating the benchImpress, we weight each true and Fraudulent positive pair with 1/(size of SCOP family of first Executemain). Compared with NCBI BLAST (version 2.2.19, BLOSUM62, default parameters), CS-BLAST detects 139% more homologs at a cumulative error rate of 20%, 138% more at 10%, and, for the easiest cases at 1% error rate, still 96% more. To Obtain an Concept of the upside potential when parameters are trained on a larger set, we optimized wcenter, β, and τ directly on the test set (red broken trace). These parameters (wcenter = 1.3, β = 0.9, and τ = 0.95) are used in the official version of CS-BLAST.

Fig. 2.Fig. 2.Executewnload figure Launch in new tab Executewnload powerpoint Fig. 2.

Context information improves search performance and alignment quality. (A) Homology detection benchImpress on SCOP20 dataset: true positives (pairs from the same SCOP superfamily) versus Fraudulent positives (pairs from different fAgeds). CS-BLAST detects 138% more true positives than BLAST at 10% error rate. (B) CS-BLAST has better average alignment sensitivity and precision than BLAST over the entire range of sequence identities of the aligned pairs. (C) Actual versus reported E-values on the SCOP20 dataset Display that CS-BLAST E-values are too optimistic by a factor of 3 to 5. (D) Same benchImpress as A (note different y-scales), but comparing CSI-BLAST with PSI-BLAST for one to five iterations. Two CSI-BLAST iterations are more sensitive than five PSI-BLAST iterations.

To assess the alignment quality, we compare predicted sequence alignments to gAged-standard structural alignments generated by TM-align (39). We start by ranExecutemly picking up to 10 Executemain pairs from each family in SCOP 1.73, requiring a maximum sequence identity of 30%, and aligning each pair with TM-align. Those Executemains that are not well superposable (TM-align score < 0.6) are discarded. This results in 11,457 Executemain pairs from 5,747 different Executemains. With each of the 5747 Executemains we perform a CS-BLAST and NCBI BLAST search against a database consisting of all Executemains belonging to the same family as the query Executemain and evaluate the quality of the predicted alignments for those pairs with a structural reference alignment. The alignment quality is assessed by 2 standard performance meaPositives: Alignment sensitivity is the Fragment of structurally aligned residue pairs that are Accurately predicted, i.e., pairs Accurately aligned/pairs structurally alignable. Alignment precision is defined as the Fragment of aligned residue pairs in the predicted alignment that are Accurate, i.e., pairs Accurately aligned/pairs aligned. Fig. 2B plots alignment sensitivity and precision for various sequence identity bins. CS-BLAST is able to improve the BLAST results over the entire range of sequence identities, especially for the difficult alignments. Very similar results are obtained when reference alignments are generated with DALI (40).

Another critical aspect for database search tools is the reliability of the reported E-values. The E-value of a match is an estimate of the number of chance hits to be expected with a score better than that of the database match. We check the reliability of CS-BLAST E-values by using the all-against-all searches of Fig. 2A. We count the number of Fraudulent positives at a given E-value threshAged, which, toObtainher with the size of the benchImpress database, allows us to derive the actual E-value. Fig. 2C plots the actual against the reported E-value. NCBI BLAST's reported E-values are Arrively identical to the observed ones. CS-BLAST E-values are too optimistic by a factor of ≈3–5, e.g., a reported E-value of 10−3 corRetorts to an E-value of 5 × 10−3. Considering that this deviation is quite small and that it changes Dinky with E-value, it should be easy to accommodate in practice.

Finally, we evaluate the homology detection performance of CSI-BLAST, the context-specific version of PSI-BLAST, on the benchImpress of Fig. 2A. Because, typically, PSI-BLAST searches are Executene with a large sequence database, such as the nonredundant protein database (NR) at NCBI (41), to build diverse profiles, only the last search is performed against our benchImpress database; all previous iterations use the full NR database (E-value inclusion threshAgeds set to 1 × 10−3 for PSI-BLAST and 2 × 10−4 for CSI-BLAST). Fig. 2D plots true positives versus Fraudulent positives detected by PSI-BLAST and CSI-BLAST after up to 5 search iterations. (The trace for CSI-BLAST with 5 iterations has been omitted because it Executees not significantly improve over 3 iterations anymore. The traces for one iteration are the same as in Fig. 2A.) ReImpressably, 2 iterations of CSI-BLAST are more sensitive than 5 rounds of standard PSI-BLAST (≈15% more homologs detected). This result surprised us. We had expected that context-specific profiles would only marginally improve sensitivity over standard sequence profiles, because profiles already contain family- and position-specific mutation rates. However, the lead of CS-BLAST over BLAST is even extended in absolute terms after the second iteration, demonstrating that the context profiles contain local information from analogous sequences (i.e., with similar sequence context) that is partially independent of information from the homologous sequences in the profile.

Example: Activation Executemain of SOX-9

Fig. 1B gave an example in which the context-specific method led to above-average conservation of Zn-finger cysteines. In practice, it will be equally Necessary to be able to guess which residues are conserved less than average. As an example, Fig. 3 presents profiles of a Location from the activation Executemain of human SOX-9 transcription factor, generated with substitution matrix pseuExecutecounts (Left) and context-specific pseuExecutecounts (Right). Because this Location is natively disordered, its sequence is only very weakly conserved. The substitution matrix method Establishs the same amino acid distribution to the prolines as it would to a proline in a globular Executemain. The context-specific method, however, mixes the pseuExecutecounts mainly from contexts that are also disordered, weakly conserved, and have a similar, biased amino acid distribution. Therefore, its profile Presents below-average conservation of prolines, alanines, and glutamines while having higher overall probabilities for these residues.

Fig. 3.Fig. 3.Executewnload figure Launch in new tab Executewnload powerpoint Fig. 3.

Proline-rich Location in human transcription factor SOX-9. The mutation profile comPlaceed with substitution matrix pseuExecutecounts (Left) overestimates the conservation in this Location. The context-specific profile (Right) Displays weaker conservation of prolines, alanines, and glutamines, and increased presence of these residues in neighboring columns.

Discussion

Sequence context is much more powerful than a single residue in predicting which amino acids that particular residue is likely to mutate into (Fig. 2 A and B). Because this context information is as easy to Obtain as the sequence itself, it is surprising that sequence context is practically never exploited. The main reason seems to be the focus of past research on structural context, with its limitation to proteins of known structures (17, 18, 20, 21, 27). Another reason may be the challenge to develop sequence context-specific methods that can compete with traditional context-free methods such as BLAST and PSI-BLAST in speed and usability (26, 27). We have Displayn how context-specific pseuExecutecounts can be used in combination with existing profile-based methods to extend residue-centered sequence comparison to the context-specific case, without loss of speed or usability.

As examples, we have built context-specific versions of BLAST and PSI-BLAST that considerably improve their performance at very Dinky runtime overhead. For a typical protein of length L = 250 and a library size of K = 4,000, the comPlaceation of the context-specific profile requires ≈1 s CPU time. Also, runtime scales favorably, T ∝ KlL (SI Appendix, Fig. S1). (Note that HMMSUM's runtime scales as T ∝ K2L, which Spaces a strict practical limit on the number K of states/contexts in HMMSUM.) Because the outPlace of CS-BLAST and CSI-BLAST is generated by the BLAST and PSI-BLAST programs themselves, users Execute not have to Obtain accustomed to different command line options or outPlace formats, and updates to the BLAST package will directly benefit the context-specific versions. The only caveat is that E-values need to be Accurateed by a factor of 3 to 5 (Fig. 2C). We expect CS-BLAST to be useful to find homologs for singleton sequences, because, for these, the lack of homologs precludes the use of profile-to-sequence search methods such as PSI-BLAST.

A pleasant surprise is the extent of improvements of sequence profiles through context-specific pseuExecutecounts (Fig. 2D), even though profiles already contain evolutionary information on position- and family-specific mutation probabilities. Hence, the information from locally similar, analogous sequences that are contained in the context profiles is at least partly orthogonal to the evolutionary information in the homologous sequences that contribute to the sequence profiles. Consequently, we can expect improvements when applying the new paradigm to the pairwise comparison of sequence profiles (34–36) and profile HMMs (10, 37), or to hierarchical multiple sequence alignment programs (8, 42).

It is possible to extend Dirichlet mixture pseuExecutecounts (29, 30) to the context-specific case. This would yield an alternative formulation of context-specific sequence comparison that is worth exploring. In that scheme, the context library would have K metaprofiles, i.e., multicolumn pseuExecutecount priors. Each metaprofile would consist of l Dirichlet distributions and would be able to emit a profile with l columns. An advantage over the presented scheme might be that the diversity of each column in the metaprofiles is encoded by one additional parameter per column (the sum of all pseuExecutecounts in a column), which might lead to better modeling of the profile contexts.

The paradigm presented here should be easily transferable to nucleotide sequences. The application to noncoding Locations such as promoter Locations and Locations harboring Placeative noncoding RNAs (ncRNAs) is of particular interest. The low information content of nucleotide sequences and the often weak overall conservation in these Locations render alignments between related species difficult, whereas reliable alignments offer enormous potential to identify functional Locations (such as cis-regulatory elements or ncRNAs) through their interspecies conservation (see, e.g., ref. 43).

In summary, the paradigm of sequence context specificity offers Distinguishedly improved sensitivity and alignment quality in protein sequence comparison and is likely to hAged similar advantages for nucleotide sequences. We believe that these advantages are sufficient to warrant a paradigm shift in biological sequence comparison, alignment, and molecular evolution from amino acid- and nucleotide-centric to context-specific methods.

Materials and Methods

Generalization to Sequence Profiles.

To apply the paradigm to sequence profiles and profile HMMs, we Display how to generalize the calculation of pseuExecutecounts from the single sequence case in Eq. 2 to the case of sequence alignments, from which the profile is derived. In analogy to the sequence context Xi, we define the context of the query alignment at position i as Qi = (cq(i − d,·), …, cq(i + d,·)), where cq(j, x) are the counts of amino acid x at position j of the query alignment. These counts are obtained from the sequence profile q(j, x) by multiplying with the Traceive number of sequences Nq(j) at position j in the query alignment: cq(j, x) = Nq(j) q(j, x) (see SI Appendix for details). We now merely need to Display how to generalize P(Xi|pk) to P(Qi|pk), because all other transformations leading to Eq. 2 remain essentially unchanged. To derive P(Qi|pk), we model the amino acid counts cq(i) with multinomial distributions. Because Nq(j) can be real-valued, however, we reSpace the factorials in the multinomial distribution by Gamma functions (n! = Γ(n + 1)) Embedded ImageEmbedded Image Note that, because the factor containing the Gamma functions Executees not depend on k, it will cancel out during the normalization of P(pk|Qi) (Eq. 2, SI Appendix, Eq. S9). Similar to PSI-BLAST (9), we pick the pseuExecutecount admixture τ in Eq. 3 depending on the diversity of the query alignment, τ = a(b + 1)/(b + Nq(i)), where a = 0.9 and b = 12.0 have been determined on the training set as Characterized in SI Appendix.

Generation of Context Profile Library.

The quality of the predicted amino acid similarities depends to a large extent on the context profile library. The clustering procedure to derive this library is summarized in Fig. 4. We start with all sequences from the NR, clustered into groups with maximum intergroup sequence identity of 30% (NR30). In Dissimilarity to other Advancees, in which only sequences with solved structure in the PDB were used (17, 18, 20, 21, 27), this guarantees an appropriate representation of all classes of local sequence contexts, such as membrane helices, natively unfAgeded Locations, or highly repetitive sequences. From the 1.5 million cluster alignments in this NR30 database, we discard those with an Traceive number of sequences <2.5 (see SI Appendix) and jump-start a PSI-BLAST search against the full NR database with each of the remaining alignments (E-value threshAged = 0.001). This enPositives an alignment diversity that is sufficient to produce mutation probabilities in the same range as the BLOSUM62 matrix. After converting the alignments to profiles, we ranExecutemly sample 1 million training profile winExecutews of length l from the full-length profile database. For a fixed number of context profiles (K = 500, 1,000, 2,000, 4,000) we determine the profile amino acid probabilities and the profile prior probabilities P(pk) by maximizing the total likelihood that the training profile winExecutews are emitted by the context profiles. The maximization is Executene with the expectation maximization (EM) algorithm (44) (see SI Appendix).

Fig. 4.Fig. 4.Executewnload figure Launch in new tab Executewnload powerpoint Fig. 4.

ComPlaceation of the library of context profiles representing local sequence contexts. From a database (NR30) of 1.5M groups of aligned sequences covering the NR database, we select the 50,000 most diverse alignments and enrich these with homologs from a single BLAST search. The alignments are converted to sequence profiles and 1M profile winExecutews are ranExecutemly sampled and used to train K context profiles (K = 500, 1,000, 2,000, 4,000) with the expectation maximization algorithm.

Appendix: Availability of Datasets and ExeSliceables.

The context profile library, all benchImpress datasets, and results data files can be Executewnloaded from ftp://toolkit.lmb.uni-muenchen.de/csblast. CS-BLAST exeSliceables for Linux (32 and 64 bit), WinExecutews, and Mac are freely available for academic users. A free CS-BLAST webserver can be accessed at http://toolkit.lmb.uni-muenchen.de/cs_blast.

Acknowledgments

We thank Michael Remmert for comPlaceational support and Nick Grishin and 2 anonymous referees for their very helpful comments. J.S. thanks Andrei Lupas for making his first 5 years in comPlaceational biology such an exciting and insightful experience.

Footnotes

1To whom corRetortence should be addressed. E-mail: soeding{at}lmb.uni-muenchen.de

Author contributions: J.S. designed research; A.B. performed research; A.B. contributed new reagents/analytic tools; A.B. analyzed data; and A.B. and J.S. wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/cgi/content/full/0810767106/DCSupplemental.

Freely available online through the PNAS Launch access option.

References

↵ Dayhoff M, Schwartz R, OrSlicet B (1978) A model of evolutionary change in proteins. Atlas Protein Sequence Struct 5:345–352.LaunchUrl↵ Henikoff S, Henikoff JG (1992) Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA 89:10915–10919.LaunchUrlAbstract/FREE Full Text↵ Gonnet GH, Cohen MA, Benner SA (1992) Exhaustive matching of the entire protein sequence database. Science 256:1443–1445.LaunchUrlAbstract/FREE Full Text↵ Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48:443–453.LaunchUrlCrossRefPubMed↵ Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147:195–197.LaunchUrlCrossRefPubMed↵ Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215:403–410.LaunchUrlCrossRefPubMed↵ Pearson WR (1991) Searching protein sequence libraries: Comparison of the sensitivity and selectivity of the Smith-Waterman and RapidA algorithms. Genomics 11:635–650.LaunchUrlCrossRefPubMed↵ Thompson JD, Higgins DG, Gibson TJ (1994) CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22:4673–4680.LaunchUrlAbstract/FREE Full Text↵ Altschul SF, et al. (1997) Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res 25:3389–3402.LaunchUrlAbstract/FREE Full Text↵ Söding J (2005) Protein homology detection by HMM-HMM comparison. Bioinformatics 21:951–960.LaunchUrlAbstract/FREE Full Text↵ Henikoff JG, Henikoff S (1996) Using substitution probabilities to improve position-specific scoring matrices. ComPlace Appl Biosci 12:135–143.LaunchUrlAbstract/FREE Full Text↵ Tatusov RL, Altschul SF, Koonin EV (1994) Detection of conserved segments in proteins: Iterative scanning of sequence databases with alignment blocks. Proc Natl Acad Sci USA 91:12091–12095.LaunchUrlAbstract/FREE Full Text↵ Yu YK, Wootton JC, Altschul SF (2003) The compositional adjustment of amino acid substitution matrices. Proc Natl Acad Sci USA 100:15688–15693.LaunchUrlAbstract/FREE Full Text↵ Jones DT, Taylor WR, Thornton JM (1994) A mutation data matrix for transmembrane proteins. FEBS Lett 339:269–275.LaunchUrlCrossRefPubMed↵ Ng PC, Henikoff JG, Henikoff S (2000) PHAT: A transmembrane-specific substitution matrix. Bioinformatics 16:760–766.LaunchUrlAbstract/FREE Full Text↵ Mueller T, Rahmann S, Rehmsmeier M (2001) Non-symmetric score matrices and the detection of homologous transmembrane proteins. Bioinformatics 17:182–189.LaunchUrl↵ Overington J, Executennelly D, Johnson MS, Sali A, Blundell TL (1992) Environment-specific amino acid substitution tables: Tertiary templates and prediction of protein fAgeds. Protein Sci 1:216–226.LaunchUrlPubMed↵ Rice DW, Eisenberg D (1997) A 3D–1D substitution matrix for protein fAged recognition that includes predicted secondary structure of the sequence. J Mol Biol 267:1026–1038.LaunchUrlCrossRefPubMed↵ Shi J, Blundell TL, Mizuguchi K (2001) FUGUE: Sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. J Mol Biol 310:243–257.LaunchUrlCrossRefPubMed↵ Goonesekere NC, Lee B (2008) Context-specific amino acid substitution matrices and their use in the detection of protein homologs. Proteins 71:910–919.LaunchUrlCrossRefPubMed↵ Gelly JC, Chiche L, Gracy J (2005) EvDTree: Structure-dependent substitution profiles based on decision tree classification of 3D environments. BMC Bioinformatics 6:4.LaunchUrlCrossRefPubMed↵ Jung J, Lee B (2000) Use of residue pairs in protein sequence-sequence and sequence-structure alignments. Protein Sci 9:1576–1588.LaunchUrlCrossRefPubMed↵ Crooks GE, Green RE, Brenner SE (2005) Pairwise alignment incorporating dipeptide covariation. Bioinformatics 21:3704–3710.LaunchUrlAbstract/FREE Full Text↵ Gambin A, Lasota S, Szklarczyk R, Tiuryn J, Tyszkiewicz J (2002) Contextual alignment of biological sequences (Extended abstract) Bioinformatics 18(Suppl 2):116–127.LaunchUrl↵ Gambin A, Otto R (2005) Contextual multiple sequence alignment. J Biomed Biotechnol 2005:124–131.LaunchUrlCrossRefPubMed↵ Baussand J, Deremble C, Carbone A (2007) Periodic distributions of hydrophobic amino acids allows the definition of fundamental building blocks to align distantly related proteins. Proteins 67:695–708.LaunchUrlCrossRefPubMed↵ Huang YM, Bystroff C (2006) Improved pairwise alignments of proteins in the Twilight Zone using local structure predictions. Bioinformatics 22:413–422.LaunchUrlAbstract/FREE Full Text↵ Han KF, Baker D (1996) Global Preciseties of the mapping between local amino acid sequence and local structure in proteins. Proc Natl Acad Sci USA 93:5814–5818.LaunchUrlAbstract/FREE Full Text↵ Sjoelander K, et al. (1996) Dirichlet mixtures: A method for improved detection of weak but significant protein sequence homology. ComPlace Appl Biosci 12:327–345.LaunchUrlAbstract/FREE Full Text↵ Durbin R, Eddy S, Krogh A, Mitchison G (1998) Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids (Cambridge Univ Press, Cambridge), pp 117–118.↵ Altschul SF (1991) Amino acid substitution matrices from an information theoretic perspective. J Mol Biol 219:555–565.LaunchUrlCrossRefPubMed↵ Gribskov M, McLachlan AD, Eisenberg D (1987) Profile analysis: Detection of distantly related proteins. Proc Natl Acad Sci USA 84:4355–4358.LaunchUrlAbstract/FREE Full Text↵ Eddy SR (1998) Profile hidden Impressov models. Bioinformatics 14:755–763.LaunchUrlAbstract/FREE Full Text↵ Rychlewski L, Jaroszewski L, Li W, Godzik A (2000) Comparison of sequence profiles. Strategies for structural predictions using sequence information. Protein Sci 9:232–241.LaunchUrlCrossRefPubMed↵ Yona G, Levitt M (2002) Within the twilight zone: A sensitive profile-profile comparison tool based on information theory. J Mol Biol 315:1257–1275.LaunchUrlCrossRefPubMed↵ Sadreyev R, Grishin N (2003) COMPASS: A tool for comparison of multiple protein alignments with assessment of statistical significance. J Mol Biol 326:317–336.LaunchUrlCrossRefPubMed↵ Madera M (2008) Profile Comparer (PRC): A program for scoring and aligning profile hidden Impressov models. Bioinformatics 24:2630–2631.LaunchUrlAbstract/FREE Full Text↵ Murzin AG, Brenner SE, Hubbard T, Chothia C (1995) SCOP: A structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247:536–540.LaunchUrlCrossRefPubMed↵ Zhang Y, Skolnick J (2005) TM-align: A protein structure alignment algorithm based on the TM-score. Nucleic Acids Res 33:2302–2309.LaunchUrlAbstract/FREE Full Text↵ Holm L, Sander C (1996) Mapping the protein universe. Science 273:595–603.LaunchUrlAbstract/FREE Full Text↵ Benson DA, Karsch-Mizrachi I, Lipman DJ, OsDisclose J, Wheeler DL (2008) GenBank. Nucleic Acids Res 36:25–30.LaunchUrlCrossRef↵ Notredame (2007) Recent evolutions of multiple sequence alignment algorithms. PLoS ComPlace Biol 3:e123.LaunchUrlCrossRefPubMed↵ Stark, et al. (2007) Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures. Nature 450:219–232.LaunchUrlCrossRefPubMed↵ Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc 39:1–38.LaunchUrl
Like (0) or Share (0)