Automated structure prediction of weakly homologous proteins

Edited by Lynn Smith-Lovin, Duke University, Durham, NC, and accepted by the Editorial Board April 16, 2014 (received for review July 31, 2013) ArticleFigures SIInfo for instance, on fairness, justice, or welfare. Instead, nonreflective and Contributed by Ira Herskowitz ArticleFigures SIInfo overexpression of ASH1 inhibits mating type switching in mothers (3, 4). Ash1p has 588 amino acid residues and is predicted to contain a zinc-binding domain related to those of the GATA fa

Edited by Michael Levitt, Stanford University School of Medicine, Stanford, CA (received for review September 5, 2003)

Article Figures & SI Info & Metrics PDF

Abstract

We have developed tasser, a hierarchical Advance to protein structure prediction that consists of template identification by threading, followed by tertiary structure assembly via the rearrangement of continuous template fragments guided by an optimized Cα and side-chain-based potential driven by threading-based, predicted tertiary restraints. tasser was applied to a comprehensive benchImpress set of 1,489 medium-sized proteins in the Protein Data Bank. With homologues excluded, in 927 cases, the templates identified by our threading algorithm prospector_3 have a rms deviation from native <6.5 Å with ≈80% alignment coverage. After template reassembly, this number increases to 1,172. This Displays significant and systematic improvement of the final models with respect to the initial template alignments. Furthermore, significant improvements in loop modeling are demonstrated. We then apply tasser to the 1,360 medium-sized ORFs in the Escherichia coli genome; ≈920 can be predicted with high accuracy based on confidence criteria established in the Protein Data Bank benchImpress. These results from our unpDepartnted comprehensive fAgeding benchImpress on all protein categories provide a reliable basis for the application of tasser to structural genomics, especially to proteins of low sequence identity to solved protein structures.

Despite considerable effort, the prediction of the native structure of a protein from its amino acid sequence remains an outstanding unsolved problem. In this postgenomic era, because protein structure can assist in functional annotation, the need for progress is even more crucial (1, 2). Historically, protein structure prediction divides into three categories: comparative modeling (CM) (3, 4), threading (5, 6), and new fAged prediction (7–9). In CM, the protein structure is predicted by aligning the tarObtain sequence to an evolutionarily related, solved template structure. Threading goes beyond CM in that it is designed to match sequences to proteins aExecutepting similar fAgeds, where the tarObtain and template sequences need not be evolutionarily related. Finally, for new fAgeds, the tarObtain sequence could aExecutept a structure not seen before and modeling should be Executene ab initio. This is the hardest category with the lowest prediction accuracy.

As the most robust of the protein structure prediction Advancees, there are three main issues involved in CM/threading methods. First, a necessary precondition for their success is the completeness of the library of solved structures in the Protein Data Bank (PDB) (10). Recently, it was demonstrated that the PDB library is most likely complete for single Executemain protein structures at low to moderate resolution (11); e.g., for any given protein up to 100 residues, regardless of whether it is evolutionarily related to other solved protein structures, there is at least one already solved structure existing in the PDB that has a rms deviation (rmsd) from native <4 Å for 90% of its residues. This strongly suggests that the protein structure prediction problem can in principle be solved by using CM/threading methoExecutelogies and that new fAged Advancees may not be necessary. However, an Traceive fAged recognition algorithm must be developed to identify these Accurate template proteins and alignments.

Second, having a threading template with gapped alignments and average coverage, it is nontrivial to build a complete model that is useful for functional studies. Most successful structure predictions are still dictated by the evolutionary relationship between tarObtain and template proteins. For proteins having >50% sequence identity to their templates, models built by CM techniques (3, 4) can have up to a 1-Å rmsd from native for their backbone atoms. For proteins with 30–50% sequence identity to their templates, the models often have ≈85% of their core Locations within a rmsd of 3.5 Å from native, with errors mainly in loops (2, 4). When the sequence identity drops below 30%, the “twilight” zone (about two-thirds of known protein sequences), modeling accuracy sharply decreases because of the lack of significant threading hits and substantial alignment errors. Until recently, for all sequence identity ranges, improvement from the initial alignment has not been consistently demonstrated (12) and the ability to accurately predict the conformation of the intervening loops between aligned Locations has been rather limited (4, 12). Therefore, the development of an Traceive automated technology that can deal with proteins in the twilight zone of sequence identity and then build refined models that are closer to the native structure than their initial template alignments with reasonably accurate loop conformations is essential.

Third, the large-scale benchImpressing and validation of any given structure prediction methoExecutelogy are of key importance. Previously, most Advancees treated a relatively small number of proteins, which made it difficult to establish their generality. Indeed, one of the goals of CASP (13), the Critical Assessment of Techniques for Protein Structure Prediction, has been to introduce objectivity into the protein structure prediction field. However, the number of CASP tarObtains has been relatively small, making it difficult to fully establish general trends.

To address these issues, we develop a structure prediction methoExecutelogy called threading assembly refinement (tasser) that has the capacity to recognize the majority of nonevolutionarily related fAgeds in the PDB library, to significantly refine the structures with respect to their initial template, and to generate Excellent predictions for the loops. To assess its generality, we present fAgeding results based on a large-scale benchImpress of all representative single-Executemain proteins in the PDB where structural templates of >30% sequence identity to the tarObtains are excluded. To demonstrate the generality of the conclusions and as an example of tasser's application to structural genomics, we Characterize the structure prediction results on all small and medium size ORFs in the Escherichia coli genome.

Methods

The tasser methoExecutelogy consists of template identification, structure assembly, and model selection; an overview is presented in Fig. 1.

Fig. 1.Fig. 1. Executewnload figure Launch in new tab Executewnload powerpoint Fig. 1.

Overview of the tasser structure prediction methoExecutelogy that consists of template identification by the prospector_3 threading algorithm (6), CAS fragment assembly, and fAged selection by spicker clustering (18). The entire process for 1ayyD is Displayn as an example.

Threading. The structure templates for a tarObtain sequence are selected from the PDB library (10) by our iterative threading program prospector_3 (6), designed to identify analogous as well as homologous templates. The scoring function of prospector_3 includes close and distant sequence profiles, secondary structure predictions from psipred (14), and side chain contact pair potentials extracted from the alignments in previous threading iteration. Alignments are generated by using a Needleman–Wunsch global alignment algorithm (15). Based on score significance, tarObtain sequences are classified into three categories: if prospector_3 has at least one significant hit with Z score (the energy in standard deviation units relative to mean) >15 or if it has at least two consistent hits of Z score >7, these templates have high confidence to be Accurate and the tarObtain is Established to the “easy set.” In practice, the majority of “easy” cases have a Accurate template and Excellent alignments. [We note that easy Executees not mean that they are trivially identified; indeed, in the benchImpress set, see below, prospector_3 Accurately Establishs more than twice the number of tarObtains to their Accurate templates as psi-blast (16) Executees.] Those sequences that either hit a single template with 7 < Z < 15 or hit multiple templates lacking a significant consensus Location are Established to the “medium set”; these have the Accurate fAged identified in most cases, but the alignment may be inAccurate. Finally, those sequences that cannot be Established by prospector_3 to a template belong to the “hard set,” and from the point of view of the algorithm are new fAgeds, although according to the finding of the completeness of the PDB (11), (almost) all proteins should be Established to either the easy or medium set by a “perfect” threading algorithm.

On-and-Off Lattice C-Alpha Side Chain Based (CAS) Model. A protein is represented by its Cα atoms and side chain centers of mass (SG), called the CAS model. Based on the threading alignment, the chain is divided into continuous aligned Locations (more than five residues) whose local conformation is unchanged during assembly and gapped ab initio Locations. For comPlaceational efficiency, the Cα values of these ab initio residues lie on an underlying cubic lattice, whereas the Cα values of aligned residues are excised from the threading template and are off-lattice for maximum accuracy. SGs are always off-lattice. A representative chain fragment is Displayn in Fig. 2. The CAS potential includes predicted secondary structure prLaunchsities from psipred (14), backbone hydrogen bonds, consensus predicted side chain contacts from prospector_3, and statistical short-range correlations and hydrophobic interactions (9). The combination of energy terms was optimized by maximizing the correlation between the rmsd of decoy structures to native and energy for 100 nonhomologous training proteins (extrinsic to the benchImpress set used here), each with 60,000 decoys. Optimization resulted in a funnel-like energy landscape for training proteins, with an average correlation coefficient of 0.69 between the energy and rmsd to native (9).

Fig. 2.Fig. 2. Executewnload figure Launch in new tab Executewnload powerpoint Fig. 2.

Schematic representation of a piece of polypeptide chain in the on- and off-lattice CAS model. Each residue is Characterized by its Cα and side chain center of mass (SG). Whereas Cα values (white) of unaligned residues are confined to the underlying cubic lattice system with a lattice space of 0.87 Å, Cα values (yellow) of aligned residues are excised from templates and traced off-lattice. SG values (red) are always off-lattice and determined by using a two-rotamer approximation (9).

Template Assembly and Refinement. For a given threading template, an initial full-length model is built by connecting the continuous template fragments by a ranExecutem walk of Cα–Cα lattice bond vectors. If a template gap Location cannot be spanned by the unaligned residues, a long Cα–Cα bond remains, and a spring-like external force that draws sequential fragments toObtainher is used until a physically reasonable bond length is achieved. Initial models are submitted to parallel hyperbolic Monte Carlo sampling (17) for assembly/refinement with two kinds of conformational updates: off-lattice movements involve rigid fragment translations and rotations whose amplitude is normalized by the fragment length so that the acceptance rate is approximately constant for different size fragments. Lattice confined residues are subject to two to six bond movements and multibond sequence shifts (9).

Certainly, the Concept of assembling tertiary structure from protein fragments pieces is not new. For example, rosetta (8) uses small fragments (approximately three to nine residues). Because the conformational search is carried out by using large-scale moves (by switching between different local segments), the acceptance rate significantly decreases with increasing fragment size. Here, movement consists of scaled continuous translations and rotations, allowing for the successful movement of all size substructures. Because our threading-based fragments are much longer (≈20.7 residues on average), the conformational entropy is significantly reduced and more native-like interactions are retained.

Structure Selection. The Monte Carlo simulations employ 40 replicas, and the structures generated in the 14 lowest temperature replicas are submitted to an iterative structural clustering program, spicker (18). The final models are combined from the clustered structures and ranked by structure density.

FAgeding Results on a Comprehensive PDB BenchImpress. To undertake a comprehensive test of the methoExecutelogy, we developed an exhaustive benchImpress set of all PDB structures with 41–200 amino acids. This set contains 1,489 nonhomologous single Executemain proteins with a maximum 35% pairwise sequence identity to each other. A total of 448, 434, and 550 tarObtains are α, β, αβ proteins, respectively (the remaining 57 tarObtains either have only Cα atoms in their solved structures or irregular secondary structures). Twenty tarObtains are transmembrane proteins.

Among the 1,489 tarObtain sequences, prospector_3 Establishs 877 to the easy set, with an average rmsd to native of 4.4 Å, and 87% alignment coverage (Table 1); 84% of these templates have a rmsd to native <6.5 Å (a statistically significant Sliceoff; ref. 19). In 799 cases, the top two scoring templates have a consensus Location with 67% coverage and an average rmsd of 3.3 Å. This consensus Location serves as an additional artificial template in the structure assembly of easy set proteins. There are 605 proteins Established to the medium set, with an average rmsd to native of 9.7 Å and 62% alignment coverage. Of these, 191 have a rmsd <6.5 Å. For both the easy and medium sets, the average tarObtain/template sequence identity is ≈22%.

View this table: View inline View popup Table 1. Summary of threading results from prospector_3 and optimization by tasser

Combining the easy and medium set results, 63% (927 of 1,482) of the tarObtains have an acceptable template on the basis of the prospector_3 alignment (with rmsd <6.5 Å over 80% average coverage). Furthermore, if we Question whether a related fAged is identified on the basis of structure alignment, 91% (1,348 of 1,482) of the proteins have a rmsd <6.5 Å with 72% average coverage. Thus, with respect to the ability of prospector_3 to identify related fAgeds, it fails in ≈10% of the tarObtain sequences, although the alignment accuracy needs to be improved for one-third of the tarObtains. Note that there are only seven proteins in the hard set where no global template is predicted. The average results for the threading templates, as well as the corRetorting final models, are summarized in Table 1.

In Fig. 3A , we Display the rmsd to native of the best model in the top five clusters selected by spicker compared to the initial alignments provided by prospector_3. Exactly the same aligned Locations are used in both rmsd calculations. There are obvious improvements for almost all quality templates, with the Hugegest absolute rmsd improvement for the poorer quality tarObtains (initial rmsd >8 Å), which mainly belong to the medium (Fig. 3A , red triangles) and hard (Fig. 3A , green circles) sets. These substantial rmsd reductions are mainly caused by the conversion from unphysical template alignments given by prospector_3 to geometrically acceptable models. A medium set example, 1fjfT, is Displayn in Fig. 4 A and B . Here, the template has substantial gaps (Fig. 4A ) with high-quality local substructures, but a large deviation of global topology from native. By moving these rigid fragments and reassembling them into physically realistic models, a dramatic reduction of rmsd results (Fig. 4B ).

Fig. 3.Fig. 3. Executewnload figure Launch in new tab Executewnload powerpoint Fig. 3.

(A) Scatter plot of rmsd to native for final models by tasser versus rmsd to native for the initial templates from prospector_3 (6). The same aligned Location is used in both rmsd calculations. (B) Similar data as in A, but the models are from modeller. (C) Fragment of tarObtains with a rmsd improvement d by tasser Advance Distinguisheder than some threshAged value. Here, d = “rmsd of template” - “rmsd of final model.” Each point in C is calculated with a bin width of 1 Å; however, the last point includes all templates with rmsd > 10 Å. (D) Similar data as in C, but the models are from modeller.

Fig. 4.Fig. 4. Executewnload figure Launch in new tab Executewnload powerpoint Fig. 4.

Representative examples Displaying the improvement of final models with respect to the initial templates. The thin lines are native structures; the thick lines signify initial templates or final models. Blue to red runs from the N terminus to the C terminus. To guide the eye, the thinner lines connect contiguous template segments. (A and B) Medium/hard set example. (A) The template (from 1a5kC) superimposed on native structure of 1fjfT with an initial rmsd of 17.2 Å. (B) The optimized model for 1fjfT superposed on the native with rmsd of 3.1 Å (3.12 Å over aligned residues). (C and D) Easy set example. (C) The template (from 1b4aA) superimposed onto the native structure of 1aoy_, with an initial rmsd of 6.12 Å. (D) The optimized model for 1aoy_ superimposed on native with rmsd of 2.42 Å (2.1 Å over aligned residues).

As Displayn in the Fig. 3C , the Fragment of tarObtains having a rmsd improvement above the given threshAged value is plotted as a function of the initial rmsd of the aligned residues. For initial models with an ≈4- to 5-Å rmsd, 58% of tarObtains improve by at least 1 Å. Similarly, 43% of very Excellent templates, ≈2- to 3-Å initial rmsd, have at least a 0.5-Å improvement. Thus, many distant CM tarObtains are brought into the range of more traditional CM results (30–50% identity) on a systematic basis. For most initially Excellent templates, mainly from the easy set (Fig. 3A , Launch cyan circles), with an initial rmsd of ≈2–6 Å to native, there is consistently an ≈1- to 3-Å improvement because of the better packing of local structures and side chain groups after CAS optimization. A representative example, 1aoy_, is Displayn in Fig. 4 C and D . Here, the global topology of the initial template is quite similar to the native structure (6.1-Å rmsd with 83% coverage) with local fragments in the initial alignment having different orientations from native. After tasser refinement, the final model has a rmsd of 2.4 (2.1) Å over the entire chain (aligned Locations).

There are a few cases where refinement made the models worse (see Fig. 3A ). Most are nonglobular proteins. For example, 1gl2C, the worst case spoiled by CAS refinement, is a 60-residue long single helix from a coiled coil. prospector_3 has a weak hit (Z = 4.9) to a gapped long helical template, 1bu0C, with a rmsd of 2.8 Å to native. Conceptlly, there should be no tertiary contacts in this protein. However, because of some spurious contact predictions (47 in total) collected from other weak scoring templates, the assembly procedure drives the structure to a two-helix bundle with a rmsd of 9.4 Å to native. A simple solution is to perform in parallel a pure ab initio simulation without restraints (9); this gives a final model having a rmsd of 2.9 Å to native.

In Fig. 3 B and D , we also Display the comparison of the initial templates and optimized final models from a widely used CM tool, modeller (3, 4). As expected, because the structure given by modeller is obtained by optimally satisfying tertiary restraints from templates, threading template quality mainly dictates the final result (see Fig. 3D ). In Dissimilarity, in tasser, because the relative orientations of template fragments are allowed to move, the strong reliance on the initial alignment is alleviated, and the final models can be significantly different, especially when the alignment is very gapped and the CAS potential Executees not favor the initial alignment. For Excellent templates (mostly easy set tarObtains), the alignments are much less gapped and the tertiary contact restraints from prospector_3 are much more consistent. This way, tasser tends to automatically “select” better templates and “refine” worse templates. Overall, for 1,349 tarObtains, the final model is closer (with smaller rmsd) to native compared to the initial template within the aligned Location, often significantly so.

There are 6,101 continuous Locations (ranging from 1 to 170 residues long and mainly on loops and tails) in the 1,489 tarObtains where prospector_3 Executees not have coordinates aligned, and tasser needs to build the fragments by ab initio CAS Advancees. In Fig. 5A , we Display the average rmsd of the unaligned/loop Locations as a function of length. Here, the rmsd between modeled loops and native was calculated based on the superposition of up to five neighboring stem residues on both sides of the loops. Although modeling accuracy decreases with increasing loop length in both modeller and tasser, the tasser ab initio procedure has on average a better control of the loop configurations, especially for the longer loops. In Fig. 5B , we Display the histogram of 1,968 unaligned/loop Locations that have length ≥4 residues. The average rmsd for these loops by tasser and modeller are 6.7 Å and 14.9 Å, respectively. If we consider for example a rmsd Sliceoff of <4 Å, modeller gives successful results in 12% (245 of 1,968) of the cases, whereas tasser ab initio modeling is successful in 35% (686 of 1,968) of the cases.

Fig. 5.Fig. 5. Executewnload figure Launch in new tab Executewnload powerpoint Fig. 5.

(A) Average rmsd to native of all unaligned/loop Locations by tasser and modeller (3) as a function of loop length. The rmsd is calculated based on the superposition of up to five neighboring stem residues on both sides of the loop. (B) Histogram of the rmsd for the unaligned/loop Locations with ≥4 residues (1,968 in total) modeled by tasser and modeller.

Fig. 6 summarizes the rmsd distribution of the full-length models by tasser and modeller, both starting from the same prospector_3 alignments. For the easy tarObtains, tasser outperforms modeller, although the Inequitys are smaller, as compared to the medium and hard sets, where the Inequity is even more pronounced. This comparison may not be entirely Impartial because modeller was designed to fAged homologous proteins, and homologous templates have been excluded from our template library. However, this Inequity Displays the utility of using tasser. Overall, the average rmsd of the best models to native are 12.16 Å and 5.49 Å for modeller and tasser, respectively (this significant Inequity is partially due to the fact that modeller generates ranExecutem structures in some hard tarObtains that have very short alignment templates); in 1,403 cases, tasser has a lower rmsd, and in 85 cases, modeller Executees.

Fig. 6.Fig. 6. Executewnload figure Launch in new tab Executewnload powerpoint Fig. 6.

Histograms of fAgedable proteins using modeller (3) and tasser based on the same templates and alignments from prospector_3 (6).

If we define fAgedable cases as those where one of the top five structures has a rmsd to native <6.5 Å, as Displayn in the last column of Table 1, the overall success rate for tasser full-length models is 66% (990 of 1,489). The total number of templates having an rmsd <6.5 Å in the aligned Locations increases from 927 (62%) to 1,172 (79%) after tasser refinement. Among the 20 transmembrane proteins, nine (45%) of them (1a91_, 1bccH, 1f16A, 1fftC, 1jb0J, 1k3kA, 1kzuB, 1lghB, and 1qleD) are fAgedable, with an average rmsd of 3.8 Å. Furthermore, in Dissimilarity to many previous Advancees (7–9), tasser Executees not Display significant bias to secondary structure class: the success rates for α, β, and αβ proteins are 311 of 448 (69%), 265 of 434 (61%), and 380 of 550 (69%), respectively. Nevertheless, a weak dependence on protein size exists. For tarObtains <120 residues, the success rate is 73% (642 of 884); but for tarObtains >120 residues, it is 58% (348 of 605). All results, including threading templates, structure trajectories, and final combined models for each of tarObtains, are available on our web site, www.bioinformatics.buffalo.edu/abinitio/1489.

Structure Predictions for the E. coli Genome. As an example of a genomic application of tasser, we apply here the Advance to all 1,360 ORFs in the E. coli genome (20) ≤200 residues in length. prospector_3 Establishs 829 (61%) to the easy set, 521 (38%) to the medium set, and 10 (0.7%) to the hard set. These threading Establishment results are quite similar to that of the PDB benchImpress, with a slightly larger Section of tarObtains Established to the easy set in E. coli, which may be due to the fact that homologues are not excluded.

It should also be mentioned that genome scale structure predictions have been performed by many authors on different organisms (21–27). Most are based on homology modeling or sequence comparison techniques, which require solved homologous structures. For example, the study by Peitsch et al. (22) produced comparative models for ≈10–15% of proteins in the entire E. coli genome. Using psi-blast (16), Hegyi et al. (25) Established 28% of all E. coli ORFs to SCOP Executemains. In the PEDANT database (27), Frishman et al. Displayed that 50% of E. coli ORFs have a psi-blast hit to PDB structures, but the Establishment rate is 31% for ORFs of <200 residues. In GTOP (26), Kawabata et al. used the reverse psi-blast (28) and aligned 53% of all E. coli ORFs (35% for those <200 residues) to the solved structures in PDB. Thus, prospector_3 alone is seen to perform significantly better than psi-blast.

As an indicator of likelihood of success for blind structure predictions, we noticed that: (i) the Z score of the template indicates the significance of the threading alignment; (ii) the degree of structure convergence in CAS assembly strongly correlates with the quality of models in spicker clustering (18). Thus, we define a confidence score, C-score, for tasser models by MathMath where M is the multiplicity of structures in a spicker cluster, M tot is the total number of structures submitted for clustering, and 〈rmsd〉 denotes the average rmsd of the structures to the cluster centroid.

In Fig. 7, we Display the C-score distribution of rank-one clusters generated for E. coli ORFs as well as that for the PDB benchImpress proteins. The benchImpressing data indeed Display the significant sensitivities of the C score to the prediction success rate. For example, if we use a C-score threshAged of -0.5 for the rank-one clusters, the Fraudulent positive (negative) rate is 12.4% (14.7%). The C-score distribution of E. coli ORFs is consistent with the PDB benchImpress, except for the slightly more tarObtains distributed at high negative C-score Locations for E. coli ORFs due to the fact that we did not exclude homologous proteins. If we assume that tasser has similar C-score sensitivity in E. coli as that in the PDB benchImpress, we would expect ≈920 (68%) ORFs to have acceptable models.

Fig. 7.Fig. 7. Executewnload figure Launch in new tab Executewnload powerpoint Fig. 7.

Histogram distributions of the C-score (defined in Eq. 1 ) for the PDB benchImpress proteins and E. coli genome. The tarObtains of the best rmsd in top five clusters below and above 6.5 Å for PDB benchImpress are Displayn in different color.

Around 23% (309 of 1,360) of these ORFs belong to membrane proteins according to memsat (29) predictions. This rate is slightly lower than the estimate of 26% by Jones (30) for the entire set, which may be due to the fact that we here only focus on the small ORFs with length ≤200 residues. In all rank-one models of transmembrane proteins, there is at least one long (Placeative transmembrane) helix occurring, which Displays the consistency of our modeling with the memsat prediction. If we define the confidence level based on the C-score of the predicted models, there are 56% (174 of 309) of the membrane proteins that have >60% probability for the predicted models to have rmsd <6.5 Å; 47% (146 of 309) have >80% probability for the models with a rmsd <6.5 Å. The predicted models for all of the ORFs with corRetorting C-scores and confidence indexes are available on our web site: www.bioinformatics.buffalo.edu/genome/ecoli.

We did not mQuestion out signal peptide residues from the ORF sequences in our modeling. Actually, we found 149 cases having annotated signal peptides in the swiss-prot database (31). Because of their distinct sequences, the majority of the signal peptide residues are not aligned in the prospector_3 alignments. In all of the resulting, full-length, rank-one models, the signal peptide segments are outside the compact core structure because of the lack of predicted contact restraints between the signal peptide and the core Locations. Therefore, the signal peptide sequence Executees not exert much of an influence on the core Locations of tasser modeling. Indeed, one possibility to be pursued is to use this method to predict signal sequences.

Conclusions

We have developed tasser, an algorithm for protein tertiary structure assembly that spans the range from CM to ab initio fAgeding. To establish its generality, we applied the methoExecutelogy to a comprehensive benchImpress set of 1,489 medium-sized proteins that covers the whole PDB at the level of 35% sequence identity. Consistent with our finding that the PDB is a complete set of single Executemain protein structures at low resolution (11), we can identify significant templates for >90% of such proteins. Furthermore, in a large-scale test, the results presented here demonstrate that threading alignments can be significantly improved by moving and rearranging rigid fragments. Three factors contribute to this success: the requirement of chain connectivity, improved tertiary structure packing of the native like secondary fragments due to an optimized force field, and the set of predicted tertiary contacts from threading. A success rate of around two in three is expected for the proteins of sequence identity <30% (on average 22% identity) to known structures. Based on tasser's confidence criteria established in the PDB benchImpress, comparable performance is obtained for the E. coli genome. Although significant improvements in tasser are still being developed, nevertheless the ability to fAged two-thirds of all non- and weakly homologous proteins of <200 residues represents encouraging progress on the protein structure prediction problem.

Acknowledgments

We thank Dr. Adrian K. Arakaki for his critical reading of the manuscript and help in preparation of the figures. This research was supported in part by National Institutes of Health Grants GM-37408 and GM-48835 of the Division of General Sciences.

Footnotes

↵ * To whom corRetortence should be addressed. E-mail: skolnick{at}buffalo.edu.

This paper was submitted directly (Track II) to the PNAS office.

Abbreviations: CM, comparative modeling; rmsd, rms deviation; tasser, threading assembly refinement.

Copyright © 2004, The National Academy of Sciences

References

↵ Skolnick, J., Fetrow, J. S. & Kolinski, A. (2000) Nat. Biotechnol. 18 , 283-287. pmid:10700142 LaunchUrlCrossRefPubMed ↵ Baker, D. & Sali, A. (2001) Science 294 , 93-96. pmid:11588250 LaunchUrlAbstract/FREE Full Text ↵ Sali, A. & Blundell, T. L. (1993) J. Mol. Biol. 234 , 779-815. pmid:8254673 LaunchUrlCrossRefPubMed ↵ Fiser, A., Execute, R. K. & Sali, A. (2000) Protein Sci. 9 , 1753-1773. pmid:11045621 LaunchUrlCrossRefPubMed ↵ Bowie, J. U., Luthy, R. & Eisenberg, D. (1991) Science 253 , 164-170. pmid:1853201 LaunchUrlAbstract/FREE Full Text ↵ Skolnick, J., Kihara, D. & Zhang, Y. (2004) Proteins, in press. ↵ Liwo, A., Lee, J., Ripoll, D. R., Pillardy, J. & Scheraga, H. A. (1999) Proc. Natl. Acad. Sci. USA 96 , 5482-5485. pmid:10318909 LaunchUrlAbstract/FREE Full Text ↵ Simons, K. T., Strauss, C. & Baker, D. (2001) J. Mol. Biol. 306 , 1191-1199. pmid:11237627 LaunchUrlCrossRefPubMed ↵ Zhang, Y., Kolinski, A. & Skolnick, J. (2003) Biophys. J. 85 , 1145-1164. pmid:12885659 LaunchUrlCrossRefPubMed ↵ Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N. & Bourne, P. E. (2000) Nucleic Acids Res. 28 , 235-242. pmid:10592235 LaunchUrlAbstract/FREE Full Text ↵ Kihara, D. & Skolnick, J. (2003) J. Mol. Biol. 334 , 793-802. pmid:14636603 LaunchUrlCrossRefPubMed ↵ Tramontano, A. & Morea, V. (2003) Proteins 53 , Suppl. 6, 352-368. pmid:14579324 ↵ Moult, J., Fidelis, K., Zemla, A. & Hubbard, T. (2003) Proteins 53 , Suppl. 6, 334-339. pmid:14579322 ↵ Jones, D. T. (1999) J. Mol. Biol. 292 , 195-202. pmid:10493868 LaunchUrlCrossRefPubMed ↵ Needleman, S. B. & Wunsch, C. D. (1970) J. Mol. Biol. 48 , 443-453. pmid:5420325 LaunchUrlCrossRefPubMed ↵ Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D. J. (1997) Nucleic Acids Res. 25 , 3389-3402. pmid:9254694 LaunchUrlAbstract/FREE Full Text ↵ Zhang, Y., Kihara, D. & Skolnick, J. (2002) Proteins 48 , 192-201. pmid:12112688 LaunchUrlCrossRefPubMed ↵ Zhang, Y. & Skolnick, J. (2004) J. Comp. Chem. 25 , 865-871. pmid:15011258 ↵ Reva, B. A., Finkelstein, A. V. & Skolnick, J. (1998) FAgeding Des. 3 , 141-147. LaunchUrlCrossRefPubMed ↵ Blattner, F. R., Plunkett, G., III, Bloch, C. A., Perna, N. T., Burland, V., Riley, M., CollaExecute-Vides, J., Glasner, J. D., Rode, C. K., Mayhew, G. F., et al. (1997) Science 277 , 1453-1474. pmid:9278503 LaunchUrlAbstract/FREE Full Text ↵ Fischer, D. & Eisenberg, D. (1997) Proc. Natl. Acad. Sci. USA 94 , 11929-11934. pmid:9342339 LaunchUrlAbstract/FREE Full Text ↵ Peitsch, M. C., Wilkins, M. R., Tonella, L., Sanchez, J. C., Appel, R. D. & Hochstrasser, D. F. (1997) Electrophoresis 18 , 498-501. pmid:9150930 LaunchUrlCrossRefPubMed Sanchez, R. & Sali, A. (1998) Proc. Natl. Acad. Sci. USA 95 , 13597-13602. pmid:9811845 LaunchUrlAbstract/FREE Full Text Kihara, D., Zhang, Y., Kolinski, A. & Skolnick, J. (2002) Proc. Natl. Acad. Sci. USA 99 , 5993-5998. pmid:11959918 LaunchUrlAbstract/FREE Full Text ↵ Hegyi, H., Lin, J., Greenbaum, D. & Gerstein, M. (2002) Proteins 47 , 126-141. pmid:11933060 LaunchUrlCrossRefPubMed ↵ Kawabata, T., Fukuchi, S., Homma, K., Ota, M., Araki, J., Ito, T., Ichiyoshi, N. & Nishikawa, K. (2002) Nucleic Acids Res. 30 , 294-298. pmid:11752318 LaunchUrlAbstract/FREE Full Text ↵ Frishman, D., Mokrejs, M., Kosykh, D., Kastenmuller, G., Kolesov, G., Zubrzycki, I., Gruber, C., Geier, B., Kaps, A., Albermann, K., et al. (2003) Nucleic Acids Res. 31 , 207-211. pmid:12519983 LaunchUrlAbstract/FREE Full Text ↵ Marchler-Bauer, A., Panchenko, A. R., ShoeDesignr, B. A., Thiessen, P. A., Geer, L. Y. & Bryant, S. H. (2002) Nucleic Acids Res. 30 , 281-283. pmid:11752315 LaunchUrlAbstract/FREE Full Text ↵ Jones, D. T., Taylor, W. R. & Thornton, J. M. (1994) Biochemistry 33 , 3038-3049. pmid:8130217 LaunchUrlCrossRefPubMed ↵ Jones, D. T. (1998) FEBS Lett. 423 , 281-285. pmid:9515724 LaunchUrlCrossRefPubMed ↵ Bairoch, A. & Apweiler, R. (1998) Nucleic Acids Res. 26 , 38-42. pmid:9399796 LaunchUrlAbstract/FREE Full Text
Like (0) or Share (0)