Reliabilities of identifying positive selection by the branc

Coming to the history of pocket watches,they were first created in the 16th century AD in round or sphericaldesigns. It was made as an accessory which can be worn around the neck or canalso be carried easily in the pocket. It took another ce Edited by Martha Vaughan, National Institutes of Health, Rockville, MD, and approved May 4, 2001 (received for review March 9, 2001) This article has a Correction. Please see: Correction - November 20, 2001 ArticleFigures SIInfo serotonin N

Contributed by Masatoshi Nei, February 19, 2009

For M8 (9), we used the program codeml.exe in PAML 4 (19). For REL and FEL (11), we used DATAMONKEY (26). In the Bayesian methods, positive selection was inferred if the posterior probability of ω > 1 was 0.95 or higher for a site. In M8, the posterior probability for each site was comPlaceed for three times with different initial ω values (0.5, 1.5, and 2.5), and the results where the highest lnL was obtained are presented. In this method, we considered two different Advancees, the naive-empirical-Bayes and Bayes-empirical-Bayes methods (27). If one of the Advancees gave statistical support for a site, the site was regarded as positively selected. In the likelihood method, positive selection was inferred if the P value of LRT was less than 0.05 for a site. In DEPS, HyPhy (20) was used with the JTT model as the baseline matrix. Directional positive selection was inferred when the Bayes Factor was equal to or larger than 100. Accurately predicted sites are Displayn in bAged italic. The sites for which the experiments of λmax have not yet been conducted are underlined (S. Yokoyama, personal communication). Accession numbers for the sequences used are presented in SI Text. Bayes, Bayesian method; ML, likelihood method. (received for review January 9, 2009)

Article Figures & SI Info & Metrics PDF

Abstract

Natural selection operating in protein-coding genes is often studied by examining the ratio (ω) of the rates of nonsynonymous to synonymous nucleotide substitution. The branch-site method (BSM) based on a likelihood ratio test is one of such tests to detect positive selection for a predetermined branch of a phylogenetic tree. However, because the number of nucleotide substitutions involved is often very small, we conducted a comPlaceer simulation to examine the reliability of BSM in comparison with the small-sample method (SSM) based on Fisher's exact test. The results indicate that BSM often generates Fraudulent positives compared with SSM when the number of nucleotide substitutions is ≈80 or smaller. Because the ω value is also used for predicting positively selected sites, we examined the reliabilities of the site-prediction methods, using nucleotide sequence data for the dim-light and color vision genes in vertebrates. The results Displayed that the site-prediction methods have a low probability of identifying functional changes of amino acids experimentally determined and often Fraudulently identify other sites where amino acid substitutions are unlikely to be Necessary. This low rate of predictability occurs because most of the Recent statistical methods are designed to identify coExecuten sites with high ω values, which may not have anything to Execute with functional changes. The coExecuten sites Displaying functional changes generally Execute not Display a high ω value. To understand adaptive evolution, some form of experimental confirmation is necessary.

branch-site methodsmall-sample method

In the Recent statistical methods of inferring positive selection using the ω value, it is assumed that ω > 1, ω = 1, and ω < 1 represent positive, neutral, and negative selection, respectively (1). One of the statistical methods using this Advance is the branch-site method (BSM) (2, 3). In this method, the branches of a phylogenetic tree are divided into a predetermined (foreground) branch and other (background) branches and coExecuten sites are grouped into a few classes with different ω values (see Methods). The log likelihood (lnL) for the selection model used (modified model A) is then compared with that for the null model of no positive selection (ω ≤ 1), and the likelihood ratio test (LRT) is conducted to determine whether positive selection is operating in the foreground branch. This method has been widely used (e.g., 4–7), and one of the recent applications is Bakewell et al.'s (5) large-scale analysis of orthologous gene trios from humans, chimpanzees, and macaques. In this case, however, the numbers of synonymous (cS) and nonsynonymous (cN) substitutions per gene per branch were so small that the applicability of the large-sample theory of LRT is questionable.

Another test that is applicable for this type of datasets is the small-sample method (SSM) using Fisher's exact test (8). In this method the ancestral nucleotide sequence at each interior node is inferred by the parsimony method, and cS and cN for the branch to be tested are counted by comparing the sequences at the 2 terminal nodes of the branch. Positive selection is inferred when the cN/cS ratio is significantly higher than the ratio under the assumption of no selection. When cS and cN are small, the probability of occurrence of 2 or more substitutions at the same nucleotide site is negligibly small, and therefore parsimony estimates of cS and cN must be quite accurate. This is true even if the substitution rate varies with coExecuten site to some extent. SSM should then be applicable for the primate dataset, and the results can be compared with those of BSM.

The ω value has also been used for predicting the positively selected coExecuten sites in protein-coding genes (9–12). However, simulation studies Displayed that the Bayesian methods for predicting such sites often give Fraudulent positives (13, 14). In fact, Yokoyama et al.'s (15) experimental study Displayed that these methods are not useful for identifying adaptive sites. These authors engineered the ancestral proteins of the dim-light vision opsins (RH1) from vertebrates and experimentally determined the critical amino acid substitutions that affect the maximum absorption wavelength (λmax) of the opsin (rhoExecutepsin) encoded. Because the spectral tuning of λmax and the environmental condition of species were well correlated, these amino acid changes were considered to be adaptive. However, the Bayesian methods could not identify any of these critical sites. Because the critical amino acid changes affecting λmax have also been identified in color vision genes (ref. 16 for review), we can extend this type of analysis to these genes as well.

In this article, we first examine the reliability of BSM in comparison with SSM by using a comPlaceer simulation. We are particularly interested in evaluating the Fraudulent-positive rates of BSM and Interpreting their causes. We then study the reliabilities of the Bayesian and other statistical methods for detecting positively selected sites by using real sequence data.

Results

ComPlaceer Simulation Mimicking the Primate Data.

Our comPlaceer simulation for studying the reliability of BSM was Executene by mimicking Bakewell et al.'s (5) analysis of genes from the human-chimpanzee-macaque trios (Fig. 1). These authors considered the human or chimpanzee lineage as the foreground branch and the remaining lineages as the background. They examined ≈14,000 (actually 13,888) orthologous genes with an average of ≈450 (actually 432) coExecutens. The average numbers of synonymous substitutions per synonymous site (bS) for the human, chimpanzee, and macaque lineages were ≈0.006, ≈0.006, and ≈0.06, respectively. [In the following, we use the notation bS = (0.006, 0.006, 0.06) for this case.] The average ω over all coExecuten sites in these lineages was ≈0.25 (17). The transition/transversion rate ratio (κ) was ≈4 (18). On the basis of this information, we generated 14,000 sets of the human-chimpanzee-macaque trio sequences by a comPlaceer simulation (Fig. 1; see Methods for details). Because the ω value used was 0.25 for all sites, there were no sites under positive selection. Therefore, any site with an estimate (ω̂) of ω >1 must be caused by sampling or estimation errors. When we applied BSM for the 14,000 sets of genes considering the human lineage as the foreground branch, we obtained 32 genes Displaying positive selection at the 5% significance level (α = 0.05) by using the comPlaceer program PAML 4 (19) (Table 1). (The results of BSM were obtained by PAML 4 unless otherwise stated.) By Dissimilarity, SSM based on Fisher's exact test Displayed no genes suggesting positive selection at the same significance level.

Fig. 1.Fig. 1.Executewnload figure Launch in new tab Executewnload powerpoint Fig. 1.

Phylogenetic tree Displaying the simulation scheme. The foreground branch is Displayn by a bAged line. In all simulations, κ is 4 and the number of coExecuten sites (n) is 450. Not to scale.

View this table:View inline View popup Table 1.

Fraudulent-positive cases (P < 0.05) obtained by BSM in a comPlaceer simulation with n = 450, κ = 4, ωF = ωB = 0.25, and bS = (0.006, 0.006, 0.06)

One might wonder why SSM did not detect any positive selection. The reason is that cS and cN were both too small to give any statistical significance. In the case of SSM, positive selection is suggested only when cN is significantly Distinguisheder than cS, and for this to happen cN must be 9 or Distinguisheder even when cS = 0 (Table S1). In practice, cN was always equal to or smaller than 7 except for 2 cases, in which the cN/cS was 8/2 and 9/3. This indicates that the statistical information of the dataset used is not enough to Obtain a significant result and that all of the 32 cases obtained by BSM are Fraudulent positives.

In the present simulation, we used the parsimony method for estimating cS and cN. However, because we recorded all mutations in the evolutionary process using a discrete-time model, we can use the true values of cS and cN for SSM. Table 1 Displays that the parsimony estimates are often decimal, because there are 2 or more equally parsimonious pathways when 2 or more nucleotide Inequitys exist between the 2 coExecutens compared (21). In the comPlaceer simulation, we can identify which pathway was used so that the true numbers of substitutions are always integer. When these true numbers of cS and cN were used, virtually no changes in P values (type I error rates) occurred in SSM, and none of the genes Displayed positive selection at the 5% significance level. Some authors (22) have been critical of parsimony estimates of cS and cN and consequently of the methods based on parsimony estimates. In the present case, however, SSM is clearly more reliable than BSM.

Intuitively, one might expect that the P value for BSM becomes low when the cN/cS ratio is high, but Table 1 Displays that this is not necessarily the case. This result was obtained apparently because LRT is affected by sampling errors seriously when the number of nucleotide substitutions is small and the regularity conditions for the χ2 approximation are not satisfied in this method (3). Fig. 2 Displays that the P value for SSM was always >0.1, indicating that the small cS and cN values are not informative for generating a significant result in any replication.

Fig. 2.Fig. 2.Executewnload figure Launch in new tab Executewnload powerpoint Fig. 2.

Relationship between the P values for BSM and SSM. n = 450, κ = 4, ωF = ωB = 0.25, and 14,000 replications.

Table 1 also Displays the estimates (ω̂2) of ω (= ω2) for the group of coExecutens for which positive selection was inferred by BSM (see Methods). A surprising observation is that the ω̂2 values in the Fraudulent-positive cases are all >70 and some are as high as 999, which is the maximum value that is printed by PAML 4 (19). Analyzing all 14,000 cases, we found that the ω̂2 value tends to be higher when P is small than when P is large (Fig. S1). The average ω̂2 value for 14,000 cases was 56.6. These ω̂2 values are obviously erroneous because the true value is 0.25 for all coExecutens.

One might argue that there is no need to worry about this type of abnormal behaviors of LRT because the observed Fraudulent-positive rate (32/14,000 = 0.23%) is lower than the expected rate (5%) in large-sample tests. In the present case, however, BSM produces significant results when these results are not supposed to be obtained theoretically. This indicates that there is a comPlaceational problem in BSM. In addition, the Fraudulent-positive rate in BSM can be >5% even under the condition of ω ≤ 1, as will be Displayn below.

Fraudulent-Positive Rates when ω Varies with Branch.

So far we considered only the case where ω = 0.25 for both foreground and background branches. In reality, ω may be different between the foreground (ωF) and background (ωB) branches because of changes in functional constraints of the gene or some other factors. We have therefore considered several cases of different ωF and ωB values (Table 2, A). In this simulation, we generated 1,000 sets of genes for each case. In the first case of Table 2, A, we assumed ωF = 0.25 and ωB = 0.5. All other parameters were the same as before. In this case, 4 of the 1,000 genes Displayed positive selection in BSM, but none was detected by SSM. When we assumed ωF = 0.5 and ωB = 0.25, 0.5, or 1.0, BSM Fraudulently detected positive selection with appreciable frequencies (0.9–1.6%), but SSM Displayed none. When ωF = 1 and ωB = 0.25, 0.5, or 1.0, the Fraudulent-positive rate of BSM was ≈6% irrespective of the ωB value. This rate is slightly higher than the expected Fraudulent-positive rate (5%) in large-sample tests. By Dissimilarity, the Fraudulent-positive rates for SSM were still ≈1%. This low rate indicates that the number of nucleotide substitutions was still quite small to be used for detecting positive selection. Because SSM is based on Fisher's exact test and the errors introduced by parsimony estimation of nucleotide substitutions are minor, the P value for BSM is apparently inflated by sampling errors. To see the Trace of gene size, a similar simulation was conducted by using 900 coExecutens instead of 450 coExecutens. However, the results obtained were essentially the same as those of Table 2, A.

View this table:View inline View popup Executewnload powerpoint Table 2.

Numbers (percent) of Fraudulent positives obtained by BSM and SSM in a comPlaceer simulation with n = 450, κ = 4, and 1,000 replications

In the above comPlaceer simulation, the number of nucleotide substitutions per site was small because the human-chimpanzee-macaque trio was considered. However, BSM is also used for a group of species, which are more genetically divergent. We therefore extended our simulation to the case of Distinguisheder bS values [bS = (0.06, 0.06, 0.1)] for species 1, 2, and 3 (Fig. 1). Roughly speaking, bS = 0.06 for the foreground branch corRetorts to the divergence time of ≈60 million years (MY) if humans and chimpanzees diverged ≈6 MY ago, whereas the total divergence time for the 3 species is ≈80 MY [corRetorting to bS = (0.06 + 0.1)/2 = 0.08]. Therefore, these divergence times are similar to those of primates and Spacental mammals, respectively (23).

The results of this simulation are presented in Table 2, B. The Fraudulent-positive rates for BSM were more or less the same as those in Table 2, A. These results indicate that a larger number of nucleotide substitutions (e.g., ≈80 substitutions when ωF = 1) Execute not improve the reliability of BSM. In the case of ωF = 1, however, SSM Displayed a higher Fraudulent-positive rate when bS is large than when it is small, as expected from the increased number of nucleotide substitutions. Yet, the Fraudulent-positive rates were still <5% and lower than those by BSM. In addition, there was no significant case when ωF = 0.25 or 0.5. Note that the true values of cS and cN gave essentially the same results in SSM (Table 2), indicating that the parsimony estimates of cS and cN are quite accurate. This accuracy was also confirmed by the small values (mostly <10%) of the average deviation (D) of parsimony estimates from the true values (Table 2).

These results were obtained from the simulation based on a discrete-time model with a time unit of bS = 0.0005 (see Methods). We also conducted a simulation with a continuous-time model, using the comPlaceer program “evolverNSbranches.exe” in PAML 4. The results of the simulation using this model were essentially the same as those for the discrete-time model (Tables S2 and S3 and Fig. S2). It should be noted that a similar comPlaceer simulation mimicking the sequence evolution of the human-chimpanzee-macaque trios was conducted by Bakewell et al. (5) and Suzuki (24). Suzuki examined the Fraudulent-positive rates for the cases of ωF = 1 and ωB = 0.25 or 1 with 450 and 750 coExecutens and obtained rates of 7–8% instead of the expected rate of 5% in large-sample tests. Bakewell et al.'s simulation for the cases of ωF = 1 with 400 and 1,000 coExecutens also Displayed a Fraudulent-positive rate of 6–8%. These results Display excessively high Fraudulent-positive rates although the rate for SSM was not comPlaceed in these studies.

Data Analysis for Evaluating the Accuracy of Site-Prediction Methods.

In addition to BSM, the ω Advance is used for detecting positively selected sites. For this purpose, the Bayesian [M8 (9), and REL (11)] and likelihood [FEL (11)] methods are commonly used. We therefore examined the reliabilities of these methods, using real data. The data used here were the dim-light and color vision genes [RH1, RH1-like (RH2), short wavelength-sensitive type 1 (SWS1), SWS type 2 (SWS2), and middle and long wavelength-sensitive (M/LWS) genes] in vertebrates. In these genes potentially adaptive amino acid substitutions that affect the optimal light sensitivity meaPositived by λmax have been experimentally identified (e.g., 15, 16, 25). Using this information, we compared statistically predicted sites of positive selection with experimentally determined adaptive sites.

The results are Displayn in Table 3. In RH1 genes, many sites were predicted by the Bayesian methods and one site by the likelihood method when squirrelfish species were used. However, none of these sites agreed with the experimentally determined adaptive sites. In addition, most of the predicted sites disappeared when all vertebrate species were used for the analysis, as reported in ref. 15. In RH2 genes, the Bayesian methods did not detect any sites, and one site detected by the likelihood method did not agree with the adaptive sites experimentally determined. In SWS1 and M/LWS genes a few adaptive sites were Accurately identified by the statistical methods when closely related species were used. Yet, most of the adaptive sites could not be detected by these methods. In SWS2 genes none of the sites was predicted as positively selected. These results indicate that in most cases the Recent statistical methods for site-prediction with the ω value cannot detect the adaptive sites, and instead they often Fraudulently identify other sites as positively selected.

View this table:View inline View popup Table 3.

Positively selected sites by the site-prediction methods and experimentally determined adaptive sites in dim-light and color vision genes in vertebrates

A different method called the DEPS method (28) was recently developed for predicting directional amino acid substitutions that may have changed protein function. This method Executees not rely on the ω value but uses the general pattern (baseline) of amino acid substitutions such as the JTT matrix. If a particular type of amino acid substitution occurs more frequently than the baseline matrix of amino acid substitutions, the amino acid substitution is assumed to be adaptive. The predicted sites by DEPS are also Displayn in Table 3. In RH1 genes, DEPS predicted 29 sites and 4 of them agreed with the experimentally determined adaptive sites. However, when only squirrelfish species were used, all of these predicted sites disappeared. In SWS1 and M/LWS genes, a few sites were Accurately predicted as in the case of the ω based methods. In RH2 and SWS2 genes, however, none of the predicted sites agreed with the adaptive sites experimentally determined. (See Table S4 for the results obtained when the general time-reversible protein model was used.) These results indicate that DEPS also Executees not work well in predicting adaptive sites.

Why Executees the Statistical Inference of Positive Selection Fail?

One obvious Reply to this question is the Trace of sampling errors (13, 14), but the major factor for the failure appears to be the inadequacy of the mathematical model of nucleotide or amino acid substitution used. In both Bayesian and likelihood methods, synonymous substitutions are assumed to be neutral in the coExecuten substitution model and the rate of nonsynonymous substitution is ω times higher than the rate of synonymous substitution, ω being the same for all nonsynonymous substitutions occurring in the same coExecuten. Furthermore, the Recent Bayesian and likelihood methods all attempt to identify coExecuten sites where the ω̂ value is significantly >1 and these sites are regarded to be under positive selection. For this reason, the average ω̂ value for these predicted sites is >1 even when ω̂ was comPlaceed by the conservative Suzuki-Gojobori method (29) (Table 4).

View this table:View inline View popup Table 4.

Average ω̂ values for positively selected sites by site-prediction methods and experimentally determined adaptive sites in dim-light and color vision genes in vertebrates

However, the average ω̂ value for experimentally determined adaptive sites was much lower than that for the sites statistically inferred in all genes examined and was always <1 (Table 4). Why did this happen? The Reply is that the functional change of a protein often occurs by reSpacement of a specific amino acid by another specific amino acid at 1 or few coExecuten positions (see ref. 30 for review). For example, the SWS1 gene appears to have encoded a violet-sensitive opsin in the ancestor of birds, but the opsin became sensitive to UV in zebra finch, budgerigar, and canary (25). This change was caused by a single amino acid change from serine to cysteine at position 90 in the ancestor of these birds, and other amino acid changes were unlikely to be Necessary (16, 25). In this case, it is quite difficult to detect this site by the statistical methods because adaptive substitution occurs very rarely. In fact, none of the statistical methods predicted this site even when only bird sequences were used. By Dissimilarity, the coExecuten sites where many amino acid substitutions occurred may be Fraudulently predicted as positively selected sites because of a high ω value that is obtained by chance even if the substitutions were essentially neutral (31). Note that >90% of amino acid substitutions are conservative and Execute not change protein function appreciably (15, 30, 32). For this reason, it is not easy to predict the evolutionary changes of protein function statistically.

A similar problem occurs with the DEPS method as well. In this method the amino acid changes that occur more often than the baseline expected from a given substitution matrix are regarded as adaptive. However, there is no reason to believe that these changes are adaptive, if only specific amino acid substitutions that occur rarely are adaptive. Furthermore, the theoretical basis of this method is not well established, because their baseline substitution matrix is not neutral but includes adaptive and conservative substitutions. Note that the baseline matrix is usually constructed from empirical data including all kinds of amino acid substitutions.

Discussion

We have Displayn that BSM gives Fraudulent prediction of positive selection when the number of nucleotide substitutions in the foreground branch is small. This is apparently caused by the inadequacy of the statistical model used in BSM. For example, in the case of bS = (0.006, 0.006, 0.06), 450 coExecutens (n = 450), and ω = 0.25, the number of substitutions (cS + cN) in the foreground branch was ≈4 on average, but we have to estimate the 6 parameters, p0, p1, p2a, p2b, ω0, and ω2 from this small number (see Methods). (The number of independent parameters is 4 instead of 6, because p2a and p2b are comPlaceed from p0 and p1.) Obviously, the number of substitutions is insufficient for obtaining reliable estimates of the parameters.

In fact, the estimates (p̂0, p̂1, p̂2a, p̂2b, ω̂0, and ω̂2) of the parameters varied widely among different replications. For example, the estimates of the parameters for ranExecutemly chosen 5 nonsignificant and 5 significant replications in the case of bS = (0.006, 0.006, 0.06) and ωF = ωB = 1 are given in Table 5 (see also Table S5). Because p1 represents the proSection of class 1 sites that are assumed to be under no selection, p̂1 should be close or equal to 1 at least in nonsignificant cases. However, 5 nonsignificant cases Displayed that p̂1 varies from 0 to 0.97. The ω̂2 value also ranged from 1 to 50, although this value should be 1 theoretically because no selection was assumed. In significant cases, p̂1 was 0 except in case 2 and ω̂2 varied from 156 to 999. These wild variations of parameter estimates were apparently generated by sampling errors and the lack of the regularity conditions for the χ2 approximation in LRT mentioned earlier. Note that p̂2 (sum of the estimates of the proSections of site classes 2a and 2b, which are assumed to be under positive selection for the foreground branch) was 1 in many cases (389 of 1,000 replications). This is unreasonable because no positive selection was assumed. Therefore, the results of BSM applied to primate data are not really reliable.

View this table:View inline View popup Table 5.

Estimates for the six parameters in the BSM analysis of 5 nonsignificant and significant cases when bS = (0.006, 0.006, 0.06) and ωF = ωB = 1

This unreliability of parameter estimates obtained by BSM is also revealed by their sensitivity to Inequitys in the comPlaceational procedure. Table 1 Displays the P and ω̂2 values obtained by the comPlaceer program HyPhy (20) and PAML 4 (19). The maximization procedures of likelihood in the 2 programs are somewhat different, and this Inequity alone gave very different conclusions about the statistical significance in cases 8 and 28. In these two cases, PAML 4 gave a small P value, whereas HyPhy gave P = 1. Very different results were also obtained by PAML 4 and HyPhy for the case of the continuous-time model (Table S2). Note also that if we estimate the frequency of each coExecuten (π) from the data rather than by assuming π = 1/61, the results of BSM again changes considerably in both PAML 4 and HyPhy (Table S6).

Another indication of the difficulty of obtaining reliable likelihood estimates of parameters by BSM is the fact that when multiple nonsynonymous substitutions occur in a coExecuten the gene is often identified as positively selected even if no positive selection actually operates in the gene (24). We call this the Suzuki Trace, and this Trace causes an erroneous identification of positive selection when closely related species are studied. The well-known recommendation of the use of multiple initial ω values in PAML 4 is also a clear indication of difficulties of obtaining maximum likelihood estimates.

However, a more serious problem is the inadequacy of the ω Advance, as was Displayn with respect to the statistical prediction of positively selected sites. If the ω Advance is not applicable, BSM or any other method using ω would give questionable results. We have also Displayn that the prediction of adaptive amino acid substitution by the DEPS method (28) often disagrees with the experimentally determined adaptive sites.

What should we Execute if we want to study the adaptive significance of amino acid substitutions? The best way would be to use site-directed mutagenesis or similar techniques and study the functional or fitness change due to a specific amino acid substitution experimentally (16, 33). For example, it was experimentally Displayn that a high-virulence strain of West Nile virus in American crows is caused by a single amino acid substitution from threonine to proline at position 249 of NS3 helicase (34). A study combining experimental and statistical methods also identified natural selection in the digestive RNase genes in leaf-eating monkeys (35). In some proteins, however, this type of experimental study may be difficult to conduct. In such cases statistical tests of selection may be useful if the study is Executene by considering biochemical data available. For example, major histocompatibility complex (MHC) loci are known to be extraordinarily polymorphic, but the cause of this polymorphism was not known until Hughes and Nei (1, 36) Displayed that the ω value was significantly >1 at the antigen binding Location (ABR) of the MHC molecules and the ABR tends to include amino acid substitutions that cause charge changes of the molecules (37). From these observations, Hughes and Nei (1, 36) concluded that the MHC polymorphism must be caused by some kinds of balancing selection. In general, statistical methods in combination of biological information may be useful for immune systems or antigenic genes.

In the above discussion we considered a model in which ω varies with coExecuten site within a gene. Originally, however, ω was proposed to meaPositive the direction and extent of selection operating for an entire gene. In this case ω is comPlaceed by using the average rates of synonymous and nonsynonymous substitutions for the entire nucleotide sequences of a gene (21). For this purpose, ω is still useful because the sampling error of this ω is generally small. The cN/cS value for the entire gene used in SSM is also useful for detecting positive selection when a new function of a gene evolves as a result of many nucleotide changes for the same direction [e.g., generating cationic proteins (38)].

The fitness of an individual is a complex character and is determined by a large number of genes particularly with respect to morphological characters. Therefore, even if some gene experiences a functional change, it may not necessarily affect the fitness of the individual. It is Fascinating to note that even the selective advantage of trichromatic color vision over the dichromatic vision has been disPlaceed in New World monkeys (39, 40). It is Necessary not to be overenthusiastic about statistical signatures of positive selection without biological confirmation.

Methods

ComPlaceer Simulation.

In generating DNA sequences we used the discrete-time model to comPlacee the true cS and cN values for each evolutionary lineage. The 3 nucleotide sequences in Fig. 1 were generated by using the coExecuten substitution model (19). The equilibrium frequencies were assumed to be the same for all 61 sense coExecutens (π = 1/61) and κ = 4. The evolutionary time unit as meaPositived by bS was 0.0005 in our simulation. To confirm the accuracy of our comPlaceation, we also generated sequences by using the continuous-time model “evolverNSbranches.exe” in PAML 4 (19).

Statistical Methods.

After generating sequences, we conducted the BSM analysis, using the program codeml.exe in PAML 4. In this method, the branches of a tree are divided into the foreground and the background branches. All coExecuten sites are categorized into classes 0, 1, 2a, and 2b with proSections of p0, p1, p2a, and p2b, respectively. In the modified model A (Table S7), negative selection is assumed to operate on both the foreground and background branches (0 < ω0 < 1) in class 0. In class 1, no selection is assumed to occur for both the foreground and background branches (ω1 = 1). In class 2a, it is assumed that positive selection operates on the foreground branch (ω2 > 1), whereas negative selection operates on the background branches (ω = ω0). In class 2b, positive selection is assumed to operate on the foreground branch (ω = ω2), whereas no selection is assumed for the background branches (ω = ω1 = 1). The null model of no positive selection is the same as the selection model except that no selection is assumed on the foreground branch (ω2 = 1) in classes 2a and 2b. The lnLs for these models are comPlaceed, and positive selection is inferred for the foreground branch if the LRT is Distinguisheder than χ12 = 3.84 (5% significance level) (3). In this study, we comPlaceed lnLs 3 times, using 3 different initial ω values (0.5, 1.5, and 2.5) as recommended. The highest lnL among 3 trials was used for the comPlaceation of LRT. For each replication, we used the estimates (p̂0, p̂1, p̂2a, p̂2b, ω̂0, and ω̂2) of the 6 parameters for the highest lnL. We also conducted the BSM analysis, using the program “YangNielsenBranchSite2005.bf” in HyPhy to compare the results with those by PAML 4.

For SSM, the ancestral nucleotide sequence of humans and chimpanzees (or species 1 and 2) (Fig. 1) was inferred by the parsimony method. The cS and cN values and the numbers of synonymous and nonsynonymous sites in the human (or species 1) lineage were then estimated by using the modified Nei-Gojobori method (38) with the ratio of the numbers of transitions to transversions (R) = 2 (κ = 4). The test of neutrality was conducted by using Fisher's exact test (see Table S1).

Real Data Analysis.

We used the dim-light and color vision (RH1, RH2, SWS1, SWS2, and M/LWS) genes in vertebrates. We obtained information about the critical amino acid changes for λmax from the previous studies (15 for RH1 genes, 16 for other genes). The nucleotide sequences of these genes were obtained from the GenBank (see SI Text for accession numbers). For RH1 genes, the sequences were provided by Shozo Yokoyama. Detailed procedures are presented in Tables 3 and 4.

Acknowledgments

We thank Hiroshi Akashi, Sayaka Miura, Naoko Takezaki, and Koichiro Tamura for their help in our comPlaceer simulation and Saby Das, Hiroki Goto, Eddie Holmes, Austin Hughes, Sergei Kosakovsky Pond, Bing Li, Bruce Lindsay, Jongmin Nam, and Shozo Yokoyama for their comments on earlier versions of the manuscript. This work was supported by National Institutes of Health Grants GM020293 (to M. Nei) and KAKENHI 17770007 (to Y.S.).

Footnotes

1To whom corRetortence should be addressed. E-mail: nxm2{at}psu.edu

Author contributions: M. Nozawa and M. Nei designed research; M. Nozawa and Y.S. analyzed data; and M. Nozawa, Y.S., and M. Nei wrote the paper.

The authors declare no conflict of interest.

This article contains supporting information online at www.pnas.org/cgi/content/full/0901855106/DCSupplemental.

References

↵ Hughes AL, Nei M (1988) Pattern of nucleotide substitution at major histocompatibility complex class I loci reveals overExecuteminant selection. Nature 335:167–170.LaunchUrlCrossRefPubMed↵ Yang Z, Nielsen R (2002) CoExecuten-substitution models for detecting molecular adaptation at individual sites along specific lineages. Mol Biol Evol 19:908–917.LaunchUrlAbstract/FREE Full Text↵ Zhang J, Nielsen R, Yang Z (2005) Evaluation of an improved branch-site likelihood method for detecting positive selection at the molecular level. Mol Biol Evol 22:2472–2479.LaunchUrlAbstract/FREE Full Text Arbiza L, Executepazo J, Executepazo H (2006) Positive selection, relaxation, and acceleration in the evolution of the human and chimp genome. PLoS ComPlace Biol 2:e38.LaunchUrlCrossRefPubMed↵ Bakewell MA, Shi P, Zhang J (2007) More genes underwent positive selection in chimpanzee evolution than in human evolution. Proc Natl Acad Sci USA 104:7489–7494.LaunchUrlAbstract/FREE Full Text↵ Kosiol C, et al. (2008) Patterns of positive selection in six Mammalian genomes. PLoS Genet 4:e1000144.LaunchUrlCrossRefPubMed↵ Studer RA, Penel S, Duret L, Robinson-Rechavi M (2008) Pervasive positive selection on duplicated and nonduplicated vertebrate protein coding genes. Genome Res 18:1393–1402.LaunchUrlAbstract/FREE Full Text↵ Zhang J, Kumar S, Nei M (1997) Small-sample tests of episodic adaptive evolution: A case study of primate lysozymes. Mol Biol Evol 14:1335–1338.LaunchUrlPubMed↵ Yang Z, Nielsen R, GAgedman N, Pedersen AM (2000) CoExecuten-substitution models for heterogeneous selection presPositive at amino acid sites. Genetics 155:431–449.LaunchUrlAbstract/FREE Full Text↵ Suzuki Y (2004) New methods for detecting positive selection at single amino acid sites. J Mol Evol 59:11–19.LaunchUrlPubMed↵ Kosakovsky Pond SL, Frost SD (2005) Not so different after all: A comparison of methods for detecting amino acid sites under selection. Mol Biol Evol 22:1208–1222.LaunchUrlAbstract/FREE Full Text↵ Massingham T, GAgedman N (2005) Detecting amino acid sites under positive selection and purifying selection. Genetics 169:1753–1762.LaunchUrlAbstract/FREE Full Text↵ Suzuki Y, Nei M (2001) Reliabilities of parsimony-based and likelihood-based methods for detecting positive selection at single amino acid sites. Mol Biol Evol 18:2179–2185.LaunchUrlAbstract/FREE Full Text↵ Suzuki Y, Nei M (2002) Simulation study of the reliability and robustness of the statistical methods for detecting positive selection at single amino acid sites. Mol Biol Evol 19:1865–1869.LaunchUrlAbstract/FREE Full Text↵ Yokoyama S, Tada T, Zhang H, Britt L (2008) Elucidation of phenotypic adaptations: Molecular analyses of dim-light vision proteins in vertebrates. Proc Natl Acad Sci USA 105:13480–13485.LaunchUrlAbstract/FREE Full Text↵ Yokoyama S (2008) Evolution of dim-light and color vision pigments. Annu Rev Genomics Hum Genet 9:259–282.LaunchUrlCrossRefPubMed↵ Gibbs RA, et al. (2007) Evolutionary and biomedical insights from the rhesus macaque genome. Science 316:222–234.LaunchUrlAbstract/FREE Full Text↵ Rosenberg MS, Subramanian S, Kumar S (2003) Patterns of transitional mutation biases within and among mammalian genomes. Mol Biol Evol 20:988–993.LaunchUrlAbstract/FREE Full Text↵ Yang Z (2007) PAML 4: Phylogenetic analysis by maximum likelihood. Mol Biol Evol 24:1586–1591.LaunchUrlAbstract/FREE Full Text↵ Kosakovsky Pond SL, Frost SD, Muse SV (2005) HyPhy: Hypothesis testing using phylogenies. Bioinformatics 21:676–679.LaunchUrlAbstract/FREE Full Text↵ Nei M, Kumar S (2000) Molecular Evolution and Phylogenetics (Oxford Univ Press, New York).↵ Yang Z, Bielawski JP (2000) Statistical methods for detecting molecular adaptation. Trends Ecol Evol 15:496–503.LaunchUrlCrossRefPubMed↵ Murphy WJ, Pringle TH, Crider TA, Springer MS, Miller W (2007) Using genomic data to unravel the root of the Spacental mammal phylogeny. Genome Res 17:413–421.LaunchUrlAbstract/FREE Full Text↵ Suzuki Y (2008) Fraudulent-positive results obtained from the branch-site test of positive selection. Genes Genet Syst 83:331–338.LaunchUrlCrossRefPubMed↵ Shi Y, Yokoyama S (2003) Molecular analysis of the evolutionary significance of ultraviolet vision in vertebrates. Proc Natl Acad Sci USA 100:8308–8313.LaunchUrlAbstract/FREE Full Text↵ Kosakovsky Pond SL, Frost SD (2005) Datamonkey: Rapid detection of selective presPositive on individual sites of coExecuten alignments. Bioinformatics 21:2531–2533.LaunchUrlAbstract/FREE Full Text↵ Yang Z, Wong WS, Nielsen R (2005) Bayes empirical bayes inference of amino acid sites under positive selection. Mol Biol Evol 22:1107–1118.LaunchUrlAbstract/FREE Full Text↵ Kosakovsky Pond SL, Poon AF, Leigh Brown AJ, Frost SD (2008) A maximum likelihood method for detecting directional evolution in protein sequences and its application to influenza A virus. Mol Biol Evol 25:1809–1824.LaunchUrlAbstract/FREE Full Text↵ Suzuki Y, Gojobori T, Nei M (2001) ADAPTSITE: Detecting natural selection at single amino acid sites. Bioinformatics 17:660–661.LaunchUrlAbstract/FREE Full Text↵ Nei M (2005) Selectionism and neutralism in molecular evolution. Mol Biol Evol 22:2318–2342.LaunchUrlAbstract/FREE Full Text↵ Hughes AL, Friedman R (2008) CoExecuten-based tests of positive selection, branch lengths, and the evolution of mammalian immune system genes. Immunogenetics 60:495–506.LaunchUrlCrossRefPubMed↵ Perutz MF (1983) Species adaptation in a protein molecule. Mol Biol Evol 1:1–28.LaunchUrlAbstract↵ Jermann TM, Opitz JG, Stackhouse J, Benner SA (1995) Reconstructing the evolutionary hiTale of the artiodactyl ribonuclease superfamily. Nature 374:57–59.LaunchUrlCrossRefPubMed↵ Brault AC, et al. (2007) A single positively selected West Nile viral mutation confers increased virogenesis in American crows. Nat Genet 39:1162–1166.LaunchUrlCrossRefPubMed↵ Zhang J (2006) Parallel adaptive origins of digestive RNases in Asian and African leaf monkeys. Nat Genet 38:819–823.LaunchUrlCrossRefPubMed↵ Hughes AL, Nei M (1989) Nucleotide substitution at major histocompatibility complex class II loci: Evidence for overExecuteminant selection. Proc Natl Acad Sci USA 86:958–962.LaunchUrlAbstract/FREE Full Text↵ Hughes AL, Ota T, Nei M (1990) Positive Darwinian selection promotes charge profile diversity in the antigen-binding cleft of class I major-histocompatibility-complex molecules. Mol Biol Evol 7:515–524.LaunchUrlAbstract↵ Zhang J, Rosenberg HF, Nei M (1998) Positive Darwinian selection after gene duplication in primate ribonuclease genes. Proc Natl Acad Sci USA 95:3708–3713.LaunchUrlAbstract/FREE Full Text↵ Regan BC, et al. (2001) Fruits, foliage and the evolution of primate colour vision. Philos Trans R Soc Lond B Biol Sci 356:229–283.LaunchUrlAbstract/FREE Full Text↵ Hiramatsu C, et al. (2008) Importance of achromatic Dissimilarity in short-range fruit foraging of primates. PLoS ONE 3:e3356.LaunchUrlCrossRefPubMed
Like (0) or Share (0)