Prediction of protein fAgeding rates from the amino acid seq

Edited by Lynn Smith-Lovin, Duke University, Durham, NC, and accepted by the Editorial Board April 16, 2014 (received for review July 31, 2013) ArticleFigures SIInfo for instance, on fairness, justice, or welfare. Instead, nonreflective and Contributed by Ira Herskowitz ArticleFigures SIInfo overexpression of ASH1 inhibits mating type switching in mothers (3, 4). Ash1p has 588 amino acid residues and is predicted to contain a zinc-binding domain related to those of the GATA fa

Communicated by I. M. Gelfand, Rutgers, The State University of New Jersey, Piscataway, NJ, April 15, 2004 (received for review November 4, 2003)

Article Figures & SI Info & Metrics PDF


We present a method for predicting fAgeding rates of proteins from their amino acid sequences only, or rather, from their chain lengths and their helicity predicted from their sequences. The method achieves 82% correlation with experiment over all 64 “two-state” and “multistate” proteins (including two artificial peptides) studied up to now.

Proteins have very different rates of fAgeding. Some of them fAged within microseconds (1), some need an hour to fAged (2). Small proteins usually (but far from always) fAged Rapider than the larger ones (3). The correlation between fAgeding rates and protein sizes is not as large, however: 64% (being stronger for “multistate” proteins that have fAgeding intermediates when fAgeded in water, and weaker for “two-state” proteins that Execute not have such intermediates) (4).

The fAgeding rate of proteins is predicted more accurately when a “contact order” of the 3D structure is taken into account (5, 6) in addition to the chain length: now the correlation achieves 74% for the totality of proteins (6).

It has been noticed that a high helical content is the main structural feature that decreases the contact order and accelerates fAgeding of two-state proteins (7), and, for this group of proteins, a fAgeding rate prediction method based on the content of secondary structure in their 3D structures has been suggested recently (8). [However, examining a whole set of proteins studied up to now, we saw that the reported equation, which works well for the two-state proteins (8), is much worse at predicting fAgeding rates of multistate proteins and short peptides.]

The empirical dependence of fAgeding rate on some features of amino acid sequences has been reported also for some small groups of two-state proteins (9, 10), but no method to predict fAgeding rates from sequences for the totality of proteins has been suggested so far.

The present work Displays that fAgeding rates for proteins of all kinds (as well as for short peptides) are well estimated from secondary structure predictions based on the amino acid sequences and the lengths of these sequences. This estimate has a clear physical sense, and it achieves 82% correlation with experiment over all 62 two-state and multistate proteins and 2 peptides studied up to now [i.e., it works even better than estimates based on whole 3D structures (5, 6)]. As a result, we can suggest a general method that predicts the rate of in-water protein fAgeding directly from its primary structure and Executees not need any information on its 3D fAged.

Materials and Methods

List of Proteins. This list (Table 1, which is published as supporting information on the PNAS web site) includes all peptides and single-Executemain proteins having no S–S bonds and covalently bound ligands, whose in-water fAgeding rates (kf) and primary and 3D structures have been established experimentally. It includes all 57 proteins and peptides from table 1 of Ivankov et al. (6) and, in addition, 7 recently studied proteins: Trp-cage [Protein Data Bank (ref. 11, ID code 1L2Y, 20 residues, kf = 2.5 × 105 sec–1] (1); villin headpiece (1VII, 36 residues, kf = 9.8 × 104 sec–1) (12); B Executemain of staphylococcal protein A (1BDD, 58 residues, kf = 1.2 × 105 sec–1) (13); engrailed homeoExecutemain (1ENH, 61 residues, kf = 3.6 × 104 sec–1) (14); HypF-N (1GXT, 91 residues, kf = 81 sec–1) (10); common-type acylphosphatase (2ACY, 98 residues, kf = 2.5 sec–1) (15); and VlsE (1L8W, 341 residues, kf = 4.9 sec–1) (16).

Secondary Structure Establishment. Secondary structure was Established from Protein Data Bank (11) coordinates of proteins by using the program dssp (ref. 17,, which Impresss helical residues by symbols H and β-structural residues by symbols E.

Secondary Structure Prediction. Secondary structure was predicted by using the programs psipred (ref. 18, and alb (ref. 19, The residues predicted as helical are Impressed by H by psipred and by H and & by alb, and those predicted as β-structural are Impressed by E by psipred and by S and B by alb.

Results and Discussion

Both analytical theory (20, 21) and off-lattice comPlaceer simulations (22) of fAgeding suggest that the logarithm of fAgeding rate (kf) decreases in proSection to some power of the protein chain length [although the value of this power for in-water fAgeding of proteins is still determined rather crudely: from two-thirds to one-half (see refs. 6 and 20–22), and maybe even below that (23)].

Theoretically, the protein chain length is determined as a number of the chain links (“fAgeding units”) (20, 21), which is usually (4, 6, 20–23) calculated as the number L of the chain residues. However, if the fAgeding chain contains some preformed blocks (or the blocks, which are rapidly and independently formed on the other chain during the fAgeding), the Traceive length of the fAgeding chain (Leff) should be smaller than the number of residues L in proSection to the total number of residues involved in these blocks.

Because α-helices are natural candidates to the role of the internally stable and/or rapidly and independently fAgeding blocks, the Traceive length of the fAgeding chain can be taken in a form $$mathtex$$$$mathtex$$[1] where LH is the number of residues in helical conformation, NH is the number of helices, and l1 means that we consider the whole block (a helix) as l1 chain residues [from a physical point of view, l1 should not exceed a length of one turn of α-helix (4 residues); the l1 value should be optimized from comparison with experiment].

Now, according to refs. 6 and 20–23, we can consider the following dependence of the fAgeding rate on the Traceive chain length: $$mathtex$$$$mathtex$$[2] where the value of power P is to be fitted from comparison with experimental data. It is noteworthy that the case P = 0 corRetorts to correlation of log(kf) with log(Leff), because const – LP = const – exp(P×ln(L)) = (const + 1) – P×2.3×log(L) ∝ const′ – log(L) when P → 0. A correlation of log(kf) with log(L) has been suggested by Gutin et al. (24) (on the basis of in silico fAgeding of simplified models of protein chains) for the ambient conditions that are most favorable for fAgeding. The correlation of log(kf) with L2/3 concerns [according to analytical theory of Finkelstein and Depravedretdinov (21)] the other extreme, “the minimal ambient fAgeding conditions,” i.e., the midtransition between the native structure and the coil. And, at last, the intermediate P values in the Location 0 < P < ⅔ may be expected for various intermediate “moderately fAgeding” conditions, including fAgeding in water.

The value of L is directly obtained from the protein sequence, whereas the values of LH and NH can be estimated from the same sequence, using some Excellent program of secondary structure prediction, e.g., the most Traceive program psipred (18), based on local sequence similarity, or a physics-based program alb (19). Besides (as a control test) we can comPlacee LH and NH from known 3D protein structures by using dssp (17).

Having L, LH, and NH, we can estimate the value of log(kf) by using Eqs. 1 and 2. In each case, we varied the P and l1 values so as to maximize the correlation between log(kf) and –(Leff)P.

Fig. 1 presents the results for the case when the secondary structure prediction is Executene by psipred. One can see that the correlation between log(kf) and –(Leff)P exceeds 80% and is Arrively the same (within small statistical errors) for all P values below 0.7 (cf. ref. 23). The maximal correlation, 82 ± 4%, is formally achieved at P = 0.1, and l1 = 3, but, actually, all of the Location P = 0.0–0.5 and l1 = 1–4 is equally Excellent. The latter Displays that, actually, only the helical content is Necessary.

Fig. 1.Fig. 1.Executewnload figure Launch in new tab Executewnload powerpoint Fig. 1.

Correlation between the logarithm of protein fAgeding rate in water and the value (L – LH + 3×NH)P at P = 0.1 [the number of helical residues (LH) and number of helices (NH) are predicted for each protein from its sequence by using psipred; L is the total number of chain residues]. •, Two-state proteins; □, multistate proteins; ▵, short artificial peptides (α-helix and β-hairpin) without tertiary structure. The straight heavy line represents the best liArrive fit, log(kf, sec–1) = 10.7 – 16.6×[(L – LH + 3×NH)0.1 – 1]. Standard deviation between predicted and observed log(kf) is ± 1.07. From a practical point of view, the following equations, which have essentially equal predictive power, may be suggested to estimate the fAgeding rate: log(kf, sec–1) = 12.4 – 5.7×log(L – LH + 3×NH) (line “log,” correlation 81.8%); log(kf, sec–1) = 10.7 – 16.6×[(L – LH + 3×NH)0.1 – 1] (straight line “0.1,” correlation 81.8%); log(kf, sec–1) = 8.2 – 2.4×[(L – LH + 3×NH)0.3 – 1] (line “0.3,” correlation 81.4%); log(kf, sec–1) = 6.6 – 0.6×[(L – LH + 3×NH)1/2 – 1] (line “1/2,” correlation 80.4%); and log(kf, sec–1) = 5.7 – 0.2×[(L – LH + 3×NH)2/3 – 1] (line “2/3,” correlation 79.1%). It is hardly possible to distinguish quality of these approximating equations because, as the figure Displays, the divergence of even the most deviating of the above functions, log and 2/3, is much smaller than the standard deviation (≈± 1.1) of experimental points from any of them, and the correlation of log(X) and X2/3 at the interval covered by experimental points ([L – LH + 3×NH]min = 8 ≤ X ≤ [L – LH + 3×NH]max = 270) is very high, 96% (much higher than correlation of experimental points and each of the approximating equations). (Inset) Correlation coefficient between log(kf) and –(L – LH + l1 × NH)P depending on power P (the lines are drawn for l1 values equal to 1, 2, 3, and 4; they virtually coincide). The standard error bars (which are virtually equal in all of the cases) are Displayn for the line with l1 = 3.

Arrively the same results are obtained when the secondary structure is predicted from sequence by alb (then the correlation achieves 78 ± 5%) or extracted from 3D structures by dssp (then the correlation achieves 81 ± 4%). Thus, the fAgeding rate prediction obtained from the amino acid sequence alone (with the help of a secondary structure prediction Executene by psipred) is at least not worse than the predictions Executene from known 3D structures.

It should be noted that, although the obtained dependence works a Dinky better for the totality of proteins than for their subgroups taken separately, it is not as sensitive to the set of proteins used. If we exclude two short artificial peptides (which are represented as two triangles in the left of Fig. 1, and are not true proteins, having no tertiary structure), the maximal correlation between log(kf) and –(Leff)P for the remaining proteins is as high as 78 ± 5%. If, in addition, we exclude tryptophan synthase β2-subunit [which is represented as the rightmost square in Fig. 1, and, unlike the other protein used in this study, is not a true single-Executemain protein, according to the Structural Classification of Proteins (ref. 25,], the correlation between log(kf) and –(Leff)P decreases by only an additional 2–3%. If we consider only the two-state proteins, the maximal correlation between log(kf) and –(Leff)P is 74 ± 8%; and for only the multistate proteins it is 77 ± 10%.

Some reImpresss in conclusion:

We did not manage to improve the predictions by consideration of β-structure, maybe because α and β contents are strongly anticorrelated (at the level of 87%), and because there is, as yet, no method to predict internally stable and therefore rapidly fAgeding (cf. ref. 26) β-hairpins.

The suggested theory is simple, but inevitably approximate, because it Executees not take into account those sequence mutations that Execute not change the secondary structure but can sometimes change the fAgeding rate by two orders of magnitude (27). Also, the theory Executees not take into account a solvent-induced change in protein stability, which can change the fAgeding rate manifAgeds (21, 27, 28). Therefore, it is only natural that this theory, which predicts the in-water fAgeding rates, Designs this prediction with a precision of plus or minus an order of magnitude (cf. Fig. 1); however, this is a relatively small error on the background of the 10 orders of magnitude Inequity in observed protein fAgeding rates.

α-Helices may lead to Traceive shortening of the fAgeding protein chain either because some preformed helices already exist in the unfAgeded state of the chain [which, indeed, may be the case for in-water conditions (26)], or because the helices are rapidly formed in the course of fAgeding. Therefore, we would like to avoid any speculations on hierarchic or nonhierarchic mechanism of protein fAgeding that may arise from the presented results.

The presented theory reveals a high (≈80%) correlation between the fAgeding rate and the number (Leff) of nonhelical residues in the protein chain. However, the correlation between the fAgeding rate and the content (= Leff/L) of nonhelical residues in the protein chain is poor: only ≈25% (results are not Displayn). This poor correlation Displays that the main determinant of the protein fAgeding rate is the number of degrees of freeExecutem that are to be fixed during the rate-limiting step of fAgeding (cf. refs. 3, 4, 6, and 21–24).

Supplementary Material

Supporting Table[pnas_101_24_8942__.html][pnas_101_24_8942__1.html]


We are grateful to O. V. Galzitskaya and S. O. Garbuzynskiy for help and seminal discussions. This work was supported by the Russian Academy of Sciences (Program “Physical and Chemical Biology” and Grant “Scientific Schools” 1968.2003.4), by the Russian Foundation for Basic Research, and by an International Research Scholar's Award from the Howard Hughes Medical Institute (to A.V.F.).


↵* To whom corRetortence should be addressed. E-mail: afinkel{at}

Received November 4, 2003.Copyright © 2004, The National Academy of Sciences


↵Qui, L., Pabit, S. A., Roitberg, A. E. & Hagen, S. J. (2002) J. Am. Chem. Soc. 124, 12952–12953.pmid:12405814.LaunchUrlCrossRefPubMed↵GAgedberg, M. E., Semisotnov, G. V., Friguet, B., Kuwajima, K., Ptitsyn, O. B. & Sugai, S. (1990) FEBS Lett. 263, 51–56.pmid:1691989.LaunchUrlCrossRefPubMed↵Galzitskaya, O. V., Ivankov, D. N. & Finkelstein, A. V. (2001) FEBS Lett. 489, 113–118.pmid:11165233.LaunchUrlCrossRefPubMed↵Galzitskaya, O. V., Garbuzynskiy, S. O., Ivankov, D. N. & Finkelstein, A. V. (2003) Proteins 51, 162–166.pmid:12660985.LaunchUrlCrossRefPubMed↵Plaxco, K. W., Simons, K. T. & Baker, D. (1998) J. Mol. Biol. 277, 985–994.pmid:9545386.LaunchUrlCrossRefPubMed↵Ivankov, D. N., Garbuzynkiy, S. O., Alm, E., Plaxco, K. W., Baker, D. & Finkelstein, A. V. (2003) Protein Sci. 12, 2057–2062.pmid:12931003.LaunchUrlCrossRefPubMed↵Mirny, L. & Shakhnovich, E. I. (2001) Annu. Rev. Biophys. Biomol. Struct. 30, 361–396.pmid:11340064.LaunchUrlCrossRefPubMed↵Gong, H., Isom, D. G., Srinivasan, R. & Rose, G. D. (2003) J. Mol. Biol. 327, 1149–1154.pmid:12662937.LaunchUrlCrossRefPubMed↵Shao, H., Peng, Y. & Zeng Z.-H. (2003) Protein Pept. Lett. 10, 277–280.pmid:12871147.LaunchUrlCrossRefPubMed↵Calloni, G., Taddei, N., Plaxco, K. W., Ramponi, G., Stefani, M. & Chiti, F. (2003) J. Mol. Biol. 330, 577–591.pmid:12842473.LaunchUrlCrossRefPubMed↵Bernstein, F. C., Koetzle, T. F., Williams, G. J. B., Meyer, E. F., Brice, M. D., Rogers, J. R., Kennard, O., Shimanouchi, T. & Tasumi, M. (1977) Eur. J. Biochem. 80, 319–324.pmid:923582.LaunchUrlCrossRefPubMed↵Islam, S. A., Karplus, M. & Weaver, D. L. (2000) J. Mol. Biol. 318, 199–215..LaunchUrl↵Myers, J. K. & Oas, T. G. (2001) Nat. Struct. Biol. 8, 552–558.pmid:11373626.LaunchUrlCrossRefPubMed↵Mayor, U., Johnson, C. M., DagObtaint, V. & Fersht, A. R. (2000) Proc. Natl. Acad. Sci. USA 97, 13518–13522.pmid:11087839.LaunchUrlAbstract/FREE Full Text↵Taddei, N., Chiti, F., Paoli, P., Fiaschi, T., Bucciantini, M., Stefani, M., Executebson, C. M. & Ramponi, G. (1999) Biochemistry 38, 2135–2142.pmid:10026297.LaunchUrlCrossRefPubMed↵Jones, K. & Wittung-Stafshede, P. (2003) J. Am. Chem. Soc. 125, 9606–9607.pmid:12904024.LaunchUrlCrossRefPubMed↵Kabsch, W. & Sander, C. (1983) Biopolymers 22, 2577–2637.pmid:6667333.LaunchUrlCrossRefPubMed↵Jones, D. T. (1999) J. Mol. Biol. 292, 195–202.pmid:10493868.LaunchUrlCrossRefPubMed↵Ptitsyn, O. B. & Finkelstein, A. V. (1983) Biopolymers 22, 15–25.pmid:6673754.LaunchUrlCrossRefPubMed↵Thirumalai, D. (1995) J. Phys. (Orsay, Fr.) 5, 1457–1469..LaunchUrl↵Finkelstein, A. V. & Depravedretdinov, A. Ya. (1997) FAgeding Des. 2, 115–121..LaunchUrlCrossRefPubMed↵Koga, N. & Takada, S. (2001) J. Mol. Biol. 313, 171–180.pmid:11601854.LaunchUrlCrossRefPubMed↵Li, M. S., Klimov, D. K. & Thirumalai, D. (2003) Polymer 45, 573–579..LaunchUrlCrossRef↵Gutin, A. M., Abkevich, V. I. & Shakhnovich, E. I. (1996) Phys. Rev. Lett. 77, 5433–5436.pmid:10062802.LaunchUrlCrossRefPubMed↵Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C. (1995) J. Mol. Biol. 247, 536–540.pmid:7723011.LaunchUrlCrossRefPubMed↵Finkelstein, A. V. & Ptitsyn, O. B. (2002) Protein Physics: A Course of Lectures (Academic, New York), pp. 103–116..↵Fersht, A. (1999) Structure and Mechanism in Protein Science: A Guide to Enzyme Catalysis and Protein FAgeding (Freeman, New York), pp. 540–572..↵Gunasekaran, K., Eyles, S. J., Hagler, A. T. & Gierasch, L. M. (2001) Curr. Opin. Struct. Biol. 11, 83–93.pmid:11179896.LaunchUrlCrossRefPubMed
Like (0) or Share (0)