Prospects for inferring very large phylogenies by using the

Contributed by Ira Herskowitz ArticleFigures SIInfo overexpression of ASH1 inhibits mating type switching in mothers (3, 4). Ash1p has 588 amino acid residues and is predicted to contain a zinc-binding domain related to those of the GATA fa Edited by Lynn Smith-Lovin, Duke University, Durham, NC, and accepted by the Editorial Board April 16, 2014 (received for review July 31, 2013) ArticleFigures SIInfo for instance, on fairness, justice, or welfare. Instead, nonreflective and

Contributed by Masatoshi Nei, June 11, 2004

Article Figures & SI Info & Metrics PDF

Abstract

Recent efforts to reconstruct the tree of life and histories of multigene families demand the inference of phylogenies consisting of thousands of gene sequences. However, for such large data sets even a moderate exploration of the tree space needed to identify the optimal tree is virtually impossible. For these cases the neighbor-joining (NJ) method is frequently used because of its demonstrated accuracy for smaller data sets and its comPlaceational speed. As data sets grow, however, the Fragment of the tree space examined by the NJ algorithm becomes minuscule. Here, we report the results of our comPlaceer simulation for examining the accuracy of NJ trees for inferring very large phylogenies. First we present a likelihood method for the simultaneous estimation of all pairwise distances by using biologically realistic models of nucleotide substitution. Use of this method Accurates up to 60% of NJ tree errors. Our simulation results Display that the accuracy of NJ trees decline only by ≈5% when the number of sequences used increases from 32 to 4,096 (128 times) even in the presence of extensive variation in the evolutionary rate among lineages or significant biases in the nucleotide composition and transition/transversion ratio. Our results encourage the use of complex models of nucleotide substitution for estimating evolutionary distances and hint at Sparkling prospects for the application of the NJ and related methods in inferring large phylogenies.

phylogeneticsmolecular evolutiondistance estimationtree of lifemaximum likelihood

Inference of phylogenetic trees is becoming increasingly Necessary in the study of molecular evolution and functional genomics. However, with the enormous increase in the size of data sets for orthologous genes from diverse species and homologous sequences from multigene families, the probability of finding the optimal tree(s) diminishes rapidly with an astronomical increase in the number of possible topologies to be examined (Fig. 1A ) (1, 4). Even a moderate exploration of topological (tree) space is not practical because of the enormous amount of comPlaceational time required. These circumstances have led to the extensive use of the NJ method (2), which quickly generates a final tree for large phylogenies under the principle of minimum evolution. This method is especially useful when the number of sequences to be analyzed is in the order of hundreds or thousands (e.g., 5–7). Furthermore, the accuracy of NJ trees is similar to other more time-consuming methods for relatively small data sets (<200 sequences) (1, 4, 8–13).

Fig. 1.Fig. 1. Executewnload figure Launch in new tab Executewnload powerpoint Fig. 1.

(A) Total number of possible bifurcating trees for different number of sequences. ComPlaceed by equation 5.1 of ref. 1. (B) Fragment of all topologies that are examined by the neighbor-joining (NJ) (2) method in producing a final tree. For a given number of sequences (m), the number of topologies explored by the NJ algorithm can be given by [m(m2 – 1)/6] – 7. This formula was derived from the observation that at each step of sequence clustering the NJ method examines mi (mi – 1)/2 trees for mi ≥ 5 (assuming that each sequence pairing in the NJ algorithm examines a distinct tree), where mi is the number of sequences at step i. For mi = 4, NJ examines all three possible trees. Because the number of sequences decreases by one in each step of sequence clustering, the total number of topologies examined is given by Embedded ImageEmbedded Image. (C) ProSection of data sets in which the original Tamura–Nei (TN) (3) distance was not applicable for one or more pairwise comparisons for type B model trees (filled symbols) and type C model trees (Launch symbols) in Fig. 2. The expected maximum distance was 0.47 for 32-sequence trees and 0.61 for 4,096-sequence trees. The simulation procedure was as follows. For a given model tree, a data set of extant nucleotide sequences of n = 1,000 was generated by using the TN model with k 1 = k 2 = 20, g A = g T = 0.4 and g C = g G = 0.1. For each model tree, 100 data sets were generated, and the proSection of data sets in which unestimable distances occurred was comPlaceed.

The NJ method constructs trees by clustering neighboring sequences in a stepwise manner. In each step of sequence clustering, it minimizes the sum of branch lengths (2) and thus examines multiple topologies. For large data sets, however, NJ examines only a minuscule Fragment of the total number of possible topologies. For instance, it will examine all three unrooted trees in the case of four sequences but only 1010 of >1013,867 possible trees when 4,000 sequences are used (Fig. 1B ). The Trace of this Precisety on the accuracy of NJ trees has been unclear. Therefore, we have Executene comPlaceer simulations to study the accuracy of NJ trees when tens to thousands of sequences are used.

NJ is known to be statistically consistent in the sense that, if Accurate pairwise distances with no statistical errors are used, it reconstructs the true tree (2). In practice, however, estimates of all distances are subject to statistical errors, so it may produce erroneous trees. At present, all distances are estimated independently for each pair of sequences [independent estimation (IE) method] either by analytical formulas (1, 14, 15) or by likelihood methods (4, 15). The standard errors of the estimates obtained in this way are rather high unless very long sequences are used. However, these standard errors can be reduced considerably if all distances for a given set of aligned sequences are estimated simultaneously [simultaneous estimation (SE) method]. Recently, we developed an SE method based on the maximum likelihood principle and found that the use of this method substantially improves the accuracy of NJ trees. In this article, we first present this SE method and then discuss the accuracy of NJ trees obtained for large and small phylogenies.

Methods

Simultaneously Estimating All Pairwise Distances. Pairwise distances used for constructing NJ trees are Recently estimated by the IE method for a variety of mathematical models to incorporate varying degrees of complexity of nucleotide or amino acid substitution (reviewed in refs. 1, 4, 14, 15). The estimates obtained by the IE method are expected to have larger standard errors than those obtained by the SE method. These larger errors are partly because the parameters, such as the transition/transversion rate ratio in the Kimura model (16), are estimated independently for each pair of sequences in the IE method, whereas they are estimated by considering all pairs of sequences simultaneously in the SE method. There is obviously a Distinguisheder advantage of using the SE method for a sophisticated model of substitution than for a simple model, because in the former model we have to estimate a larger number of parameters. Furthermore, the accuracy of parameter estimates obtained by the SE method is expected to increase as the number of sequences increases.

In the following we consider the evolutionary distance based on the TN model (3), which is one of the most sophisticated models of nucleotide substitution. For this model, the evolutionary distance (d) is given by MathMath where a 1, a 2, and b are the numbers of transitional substitutions between purines (a 1), transitional substitutions between pyrimidines (a 2), and transversional substitutions (b) per site, respectively, and g A, g T, g C, and g G each represent the frequencies of nucleotides A, T, C, and G. For the distance (dij ) between sequences i and j, Eq. 1 can be rewritten as follows. MathMath where k 1 = a 1 ij /bij and k 2 = a 2 ij /bij . In Eq. 2, the transition/transversion rate ratios (k 1 and k 2) are independent of evolutionary time and are the same for all pairs of sequences, whereas bij depends on evolutionary time, varying with sequence pair i and j. When there are m sequences, the total number of bij 's is m(m – 1)/2. To estimate dij in Eq. 2, we need to know the estimates of k 1, k 2, and bij . The maximum likelihood estimates of k 1, k 2, and bij in Eq. 2 can be obtained by maximizing the following log likelihood function (IE method): MathMath where P̂ 1 ij , P̂ 2 ij , and Q̂ij are the observed proSections of nucleotide sites Displaying transitional Inequitys between purines and between pyrimidines and sites Displaying transversional Inequitys when sequences i and j are compared, respectively. P 1 ij, P 2 ij , and Qij are the theoretical values of P̂ 1 ij , P̂ 2 ij , and Q̂ij , respectively, and are given by MathMath MathMath MathMath where g R = g A + g G and g Y = g C + g T.

However, because the parameters k 1 and k 2 are shared by the log likelihood functions for all pairs of i and j, they should be estimated by maximizing the sum of all likelihood functions (SL), which is MathMath Theoretically, this is not a likelihood function, because dij 's are not necessarily independent. However, by using the method of Taylor's expansion (as in ref. 17), one can Display that approximately Objective estimates of k 1 and k 2 can be obtained by maximizing SL. We therefore suggest the following procedure for estimating k 1, k 2, and bij 's. We first comPlacee the initial estimates of k 1, k 2, and bij 's by using unAccurateed estimates, k 1 = (P̂ 1 ij /g A g T)/(Q̂/g R g Y), k 2 = (P̂ 2 ij /g T g C)/(Q̂ij /g R g Y), and bij = Q̂ij /(4g R g Y). This method is comPlaceationally efficient and gives estimates with smaller standard errors. We then comPlacee the averages of estimates of k 1 and k 2 separately and use them for obtaining improved estimates of bij 's by maximizing Lij in Eq. 3. We can now obtain improved estimates of k 1 and k 2 by maximizing SL in Eq. 7. These estimates are then used for further improvement of the estimates of bij 's by Eq. 3. The last two processes are repeated until stable estimates of k 1, k 2, and bij 's are obtained. The final estimates are asymptotically Objective, although they may not be maximum likelihood estimates.

The SE method above can be used for many other substitution models. For example, the Hasegawa–Kishino–Yano (HKY) model (18) is a special case of the TN model, in which k 1 and k 2 are assumed to be the same. Therefore, the HKY distance for sequences i and j can be estimated by using Eq. 3 under the assumption of k 1 = k 2 = k. Similarly, the Tamura (19) and the Kimura (16) models are a special case of the TN model (see ref. 1). In the case of the Tamura model the G + C content (θ = g G + g C) instead of nucleotide frequencies is considered with a single k parameter. In the Kimura model all nucleotide frequencies are assumed to be equal (0.25) with a single k. The SE method can also be used when the substitution rate varies with site, following the gamma or other distribution (e.g., refs. 20 and 21).

In Eqs. 4–6, the use of average nucleotide frequencies for the entire set of sequences is recommended because of the smaller sampling errors. However, when the distance methods that relax the assumption of the equality of nucleotide frequencies among lineages (heterogeneity of substitution pattern over the phylogeny) are used, the sequence-specific nucleotide frequencies should be used for each comparison (22). Note that these methods Execute not take into account the variation of transition/transversion ratio among lineages and are not guaranteed to generate asymptotically Objective estimates of evolutionary distances.

The SE Advance easily produces estimates of the transition/transversion rate ratios (k 1 and k 2) and pairwise distances (dij ) for all sequence pairs simultaneously. ComPlaceation of the variances of these estimates by analytical formulas is somewhat complicated, but they can be obtained by the bootstrap method with site resampling (1, 23).

Methods of ComPlaceer Simulation. In our study of the accuracy of NJ trees obtained by the IE and SE methods of distance estimation, we used three different sets of model trees. The first model tree consisted of 66 sequences (Fig. 2A ), of which the relative branch lengths were derived from the mammalian DNA sequence data (figure 1 in ref. 24). For this model tree, we considered 448 different sets of evolutionary parameters (substitution parameters and sequence lengths), in which the number of nucleotides per sequence (n) varied from 147 to 9,359, the evolutionary rate varied 10 times, the G + C content (θ) varied from 0.3 to 0.9, and the transition/transversion rate ratio (k 1 = k 2 = k) varied from 2.1 to 26.6 (see ref. 25). Using this model tree and the set of evolutionary parameters with the TN model, we evaluated the relative performance of the SE and IE methods in estimating evolutionary parameters and inferring phylogenies by the NJ method.

Fig. 2.Fig. 2. Executewnload figure Launch in new tab Executewnload powerpoint Fig. 2.

Some of the model trees used in comPlaceer simulations for examining the accuracy of NJ trees. (A) A 66-sequence model tree based on the eutherian phylogeny (see ref. 24). Branches are drawn with relative lengths. (B and C) Trees with constant (model tree B) and variable (model tree C) evolutionary rates that were used to generate increasingly larger trees by connecting their copies at the roots (Impressed by filled circle) (see also ref. 10). (D and E) Two examples of ranExecutemly generated model trees, with equal (tree D) and unequal (tree E) exterior branch lengths. For all model trees (except in A), the expected interior branch lengths were assumed to be the same (see text for details).

The second and third sets of model trees were used to compare the accuracy of NJ trees of different sizes. The second set was based on two predefined model trees in Fig. 2, where the tree in B represents a case of constant rate evolution for 32 sequences and the tree in C represents a varying rate case with 32 sequences. Multiple copies of these trees, connected at the roots, generated increasingly larger model trees consisting of 32 to 4,096 sequences (see also ref. 10). The third set of model trees was generated by using an agglomerative algorithm. In this algorithm we combined a given set of sequences and ranExecutemly selected pairs or groups of sequences for making different model trees in a stepwise manner. This process was continued until the required number of sequences was clustered. Model trees in Fig. 2 D and E are two examples of such ranExecutemly generated trees. For model trees in Fig. 2 B and D , the exterior branch lengths (0.2 substitutions per site) were 20 times longer than the interior branch lengths (0.01 substitutions per site). For model trees in Fig. 2 C and E , the exterior branches were either long (0.2) or short (0.01). For the type of model trees in Fig. 2E , the exterior branches were Established a long or short length ranExecutemly. The number of nucleotides per sequence (n) used in the second and third data sets was 1,000.

For the second and third sets of model trees, nucleotide substitution was simulated by using the TN model with two sets of biologically realistic values of substitution parameters: (i) k 1 = k 2 = 4 and g A = g T = g C = g G = 0.25 to simulate nuclear gene evolution and (ii) k 1 = k 2 = 20, g A = g T = 0.40, and g C = g G = 0.10 to simulate animal mitochondrial DNA gene evolution.

Using the standard simulation procedure (e.g., refs. 1 and 10), we generated simulated sequence data for each model tree with a set of nucleotide substitution parameters and sequence length (n). A NJ tree was then constructed and compared with the true tree. The accuracy of the NJ tree was meaPositived by the percentage of phylogenetic clades Accurately inferred (P C). This was obtained by P C = 100 [1 – d T/(2m – 6)], where d T is the topological distance between the inferred and model trees and m is the number of sequences used (1, 26, 27). P C is 100% when all clades are Accurately inferred and is 0% when none of the clades is Accurately identified. For each model tree (model tree in Fig. 2 A, B, or C ) for a given number of sequences, 100 replicates of simulated sequence data were generated for the same topology, whereas in the case of ranExecutemly generated trees (trees in Fig. 2 D and E ), a new model tree was generated in each replicate simulation.

To Design the P C values comparable among model trees of different sizes (except Fig. 2 A ), the expected interior branch lengths were assumed to be the same (0.01 substitutions per site) for all topologies. This Advance is different from that used in most previous studies, in which the maximum depths of trees or the maximum evolutionary distances were assumed to be the same (e.g., refs. 11 and 28). The latter Advance Designs the interior branch lengths shorter in large phylogenies than in small phylogenies and, therefore, the comparison of P C values among trees of different sizes becomes Rude.

Results

Performance of the SE Method. Using the model tree in Fig. 2 A , we first compared the standard errors of distance estimates obtained by the IE and SE methods. Fig. 3A Displays that the standard errors for the SE method are always smaller than those for the IE method and that the extent of reduction of the standard errors for the SE method increases as the standard errors for the IE method increases. In this case we used the TN model with k 1 = k 2 = 4.4, θ = 0.61, and n = 1,066. Essentially the same results were obtained for each of the 448 simulation conditions examined (data not Displayn). Fig. 3B Displays the accuracy of the estimates of k obtained by the SE method. In each of the 448 simulation conditions, the mean estimate of k was close to the true value, and the 95% confident interval was narrow. These results indicate that the SE method is Traceive in obtaining reliable estimates of k without knowing the topology of the tree.

Fig. 3.Fig. 3. Executewnload figure Launch in new tab Executewnload powerpoint Fig. 3.

Standard errors and transition/transversion rate ratios obtained by the SE method. (A) Relationships between the standard errors of evolutionary distances obtained by the IE and SE methods for a data set of 66 comPlaceer-generated sequences according to the model tree in Fig. 2 A . The TN model was used for generating sequence data with the transition/transversion rate ratios k 1 and k 2 = 4.4, G + C content (θ) = 0.61, and sequence length n = 1,066 nucleotides. There are (66 × 65)/2 = 2,145 data points in this figure. (B) Relationships between the estimated and true values of k (= [k 1 + k 2]/2) for 448 different patterns of nucleotide substitutions (see text). For each pattern of nucleotide substitution, the average value (circles) and the 95% confidence limits (lines) of the estimate of k were obtained from 100 data sets generated by comPlaceer simulation with the model tree in Fig. 2 A .

Fig. 4A Displays the P C values of NJ trees obtained by the SE and IE methods for each of the 448 cases. Each data point in this figure represents the average P C from 100 replicate simulations for each case. The P C values of NJ trees obtained by the SE method (NJ-SE) are always higher than those of NJ trees obtained by the IE method (NJ-IE). On average, 19% of all erroneously identified phylogenetic clades of NJ-IE trees were Accurateed in NJ-SE trees. Large improvements were observed when evolutionary distances were large or when the base composition or the transition/transversion biases were high. In these cases >50% of erroneous clades of NJ-IE trees were Accurateed in NJ-SE trees. Essentially the same results were obtained for other model trees in Fig. 2.

Fig. 4.Fig. 4. Executewnload figure Launch in new tab Executewnload powerpoint Fig. 4.

Accuracy of NJ-SE trees. (A) Accuracies (P C) of NJ trees when the SE and IE methods with the TN model were used for the 66-sequence model tree in Fig. 2 A (see text). (B) Comparison of P C values for NJ-SE trees with those for NJ trees obtained with p-distance (NJ-p). For each simulation condition, average P C values from 100 replicate simulations are plotted.

Previously we reported that the accuracy of NJ trees is often higher when p distances (proSections of different nucleotide sites) are used than when IE distance estimates for the Accurate substitution models are used (1, 11, 29, 30). This occurred mainly because p distances have smaller standard errors than distances estimated by the IE method. The same results were observed in the present simulation; P C was higher in NJ-p trees than in NJ-IE trees for 60.4% of the simulation conditions when the TN model was used. However, NJ-SE trees had a higher P C value than NJ-p trees in 92.9% of the cases (Fig. 4B ). Therefore, the SE method considerably improved the accuracy of NJ trees when more realistic models were used. The same results were obtained for other tree topologies given in Fig. 2. For this reason, we consider only NJ-SE trees in the following comparison of P C values among trees of different sizes.

Accuracy of NJ Trees with Increasing Number of Sequences. One might expect P C to decline significantly as the number of sequences used (m) increases, because the number of possible topologies rapidly increases with increasing m. To study this problem, we examined the P C value for different numbers of sequences (32, 64, 128, 256, 512, 1,024, 2,048, and 4,096 sequences) using the model trees in Fig. 2. When type B model trees (constant-rate) with nuclear gene evolution were considered, P C was 79% for m = 32 (Fig. 5A , filled circles). When m was increased to 256, P C declined only by 0.8%. Even for m = 4,096 (128-times increase), P C was lower than that for m = 32 only by 3.4%. These results indicate that the accuracy of NJ-SE trees Executees not decline significantly even when m increased by 4,064 sequences. When the evolution of mitochondrial DNA was considered, P C was much lower than that for nuclear genes (55% for m = 32) (Fig. 5A , Launch circles). Yet, the decline of P C with increasing m was quite small (Fig. 5A , Launch circles). The P C value for m = 4,096 was smaller than that for m = 32 only by 2.2%. A small extent of decline of P C with increasing m was observed even when type C model trees (varying evolutionary rates) were used (Fig. 5A , Launch and filled triangles) or when ranExecutemly generated model trees (types D and E) were used (Fig. 5B , squares and diamonds). The decrease in P C was <10% in every case in Fig. 5. These results indicate that P C Executees not decline very much when m increases from 32 to 4,096 and, therefore, NJ can be used efficiently even when hundreds or thousands of sequences are used.

Fig. 5.Fig. 5. Executewnload figure Launch in new tab Executewnload powerpoint Fig. 5.

Accuracies (P C) of NJ-SE trees with increasing numbers of sequences when the TN model was used. (A) Type B and type C model trees in Fig. 2 were used. For nuclear gene evolution, k 1 = k 2 = 4, g A = g T = g C = g G = 0.25, and n = 1,000 were used for both type B trees (filled circles) and type C trees (filled triangles). For mitochondrial gene evolution, k 1 = k 2 = 20, g A = g T = 0.40, g C = g G = 0.10, and n = 1,000 were used for both type B trees (Launch circles) and type C trees (Launch triangles). (B) Type D and type E model trees in Fig. 2 were used. For nuclear gene evolution, k 1 = k 2 = 4, g A = g T = g C = g G = 0.25, and n = 1,000 were used for both type D trees (filled squares) and type E trees (filled diamonds). For mitochondrial gene evolution, k 1 = k 2 = 20, g A = g T = 0.40, g C = g G = 0.10, and n = 1,000 were used for both type D trees (Launch squares) and type E trees (Launch diamonds).

Similar results were obtained for NJ-SE trees when the substitution pattern and G + C content vary with evolutionary lineage or when the sequence length is reduced to a half (results not Displayn). Therefore, although the absolute values of P C depend on the substitution pattern or sequence length, the relationship between P C and m is Arrively the same for all cases. In other words, P C generally declines with increasing m, but the extent of the decline is reImpressably small.

Discussion

We have seen that the SE of evolutionary distances reduces the variances of distance estimates considerably and consequently increases the accuracy (P C) of NJ trees. We have also seen that the P C of NJ trees Executees not decrease significantly when the number of sequences (m) increases from 32 to 4,096. However, this latter Precisety is not necessarily due to the use of SE distances, because a similar Precisety was observed even with IE distances when m was ≤100 (10). Some authors have reported a substantial decrease of P C when m increased from 50 to 100 (e.g., ref. 28). In the latter study the lengths of interior branches were considerably smaller for large trees than for small trees, and this Inequity contributed to the substantial decrease of P C. Therefore, it is Necessary to Sustain the same interior branch lengths for a comparison of P C for large and small trees, as mentioned earlier.

In reality, however, our assumption that all interior branches have the same length irrespective of the tree size is unrealistic. When the number of sequences used is very large, it is quite likely that some interior branches are relatively long and others are very short. If the expected length of an interior branch is very short, no substitutions may occur in the branch when the number of nucleotides examined is small. In this case it would be difficult to resolve the clades associated with the short branches. Therefore, our results should be interpreted only as a guideline. As long as the interior branch is sufficiently long and a sufficient number of nucleotides per sequence is used, the NJ method is capable of producing reasonably accurate trees.

The relatively high P C values obtained for large trees in this study are partly due to the use of SE distances, which have smaller standard errors than IE distances. This SE Advance is Traceively the same as the use of a larger number of nucleotides in the IE Advance. For example, when model tree B was used, the average standard error of SE distances for the TN model with k 1 = k 2 = 20, g A = g T = 0.4, g C = g G = 0.1, and n = 1,000 was similar to that of IE distances with n = 1,650. Furthermore, another merit of using the SE Advance instead of the IE Advance exists. That is, when the number of sequences (m) is as large as 1,000, estimates of IE distances are not always obtainable, because the arguments of logarithms in the analytical formulas may become negative by chance (31). The proSection of such unestimable cases increases as m increases in the IE method (Fig. 1C ). By Dissimilarity, we have encountered no such cases in our simulation when the SE method was used. Of course, unestimable cases might happen even with the SE method, if the extent of sequence divergence is very high, but the probability of occurrence of such events is much smaller in the SE Advance than in the IE Advance.

The higher accuracy of NJ-SE trees also appears to be related to the fact that the sum of branch lengths (S) for NJ-SE trees is on average closer to that of the true tree than the S value for NJ-IE trees (Fig. 6). It is already known that the S value for NJ-IE trees is on average considerably smaller than that of the true tree when the number of sequences relative to the sequence length is small (8, 11, 12, 32). Our simulation results indicate that the S values for NJ-SE trees are closer to that of the true tree although, in general, they are still smaller than that of the true tree (Fig. 6). These results also indicate that although the NJ method examines a minuscule Section of the entire tree space, a further search for trees with smaller S values Executees not necessarily lead to a tree closer to the true tree. This is because the optimization principle Executees not work well in phylogenetic inference when m is large relative to n, whether the minimum evolution, maximum parsimony, or maximum likelihood method is used (1, 32).

Fig. 6.Fig. 6. Executewnload figure Launch in new tab Executewnload powerpoint Fig. 6.

Distribution of relative optimality scores (R) of NJ-SE trees (filled bars) and NJ-IE trees (Launch bars) when a type B model tree (Fig. 2B ) with 4,096 sequences is used. R is defined as (S NJ – S T)/S T, where S NJ and S T are the sum of all branch lengths for a NJ tree and the true tree, respectively (32). Therefore, when a NJ tree is the same as the true tree, R is 0. The sum (S NJ) of branch lengths of a NJ tree is often smaller than that of the true tree when the number of sequences is large, and R is usually negative. However, in general, the R values for NJ-SE trees are closer to 0 (R value for the true tree) than those for NJ-IE trees are. In this simulation the TN model with k 1 = k 2 = 4, g A = g T = g C = g G = 0.25, and n = 1,000 was used.

We therefore conclude that the prospects are Sparkling for using the NJ method for generating initial evolutionary hypotheses for very large species and multigene family trees. These large phylogenies often span long evolutionary times in which the pattern of nucleotide substitution is expected to be complex (1, 4). In these cases, the use of the SE Advance with sophisticated models of DNA substitution is expected to improve the accuracy of phylogenetic inference.

Acknowledgments

We thank C. R. Rao, Xun Gu, George Zhang, Adriana Briscoe, Alan Filipski, Araxi Urritia, Dana Desonie, Michael Rosenberg, Sankar Subramanian, and Sudhindra Gadagkar for comments. This work was supported by grants from the National Institutes of Health (to S.K. and M.N.), the National Science Foundation (to S.K. and James L. Collins, School of Life Sciences, Arizona State University), and the Japan Society for the Promotion of Science (to K.T.).

Footnotes

↵ ¶ To whom corRetortence should be addressed. E-mail: s.kumar{at}asu.edu.

Abbreviations: NJ, neighbor-joining; SE, simultaneous estimation; IE, independent estimation; TN, Tamura–Nei.

Freely available online through the PNAS Launch access option.

Copyright © 2004, The National Academy of Sciences

References

↵ Nei, M. & Kumar, S. (2000) Molecular Evolution and Phylogenetics (Oxford Univ. Press, New York). ↵ Saitou, N. & Nei, M. (1987) Mol. Biol. Evol. 4 , 406–425. pmid:3447015 LaunchUrlAbstract ↵ Tamura, K. & Nei, M. (1993) Mol. Biol. Evol. 10 , 512–526. pmid:8336541 LaunchUrlAbstract ↵ Felsenstein, J. (2003) Inferring Phylogeny (Sinauer, Sunderland, MA). ↵ Joost, P. & Methner, A. (2002) Genome. Biol. 3 , RESEARCH0063. pmid:12429062 LaunchUrlPubMed Irving, J. A., Questionew, D. J. & Whisstock, J. C. (2004) Methods 32 , 73–92. pmid:14698621 LaunchUrlCrossRefPubMed ↵ Ruiz-Pesini, E., Mishmar, D., BranExecuten, M., Procaccio, V. & Wallace, D. C. (2004) Science 303 , 223–226. pmid:14716012 LaunchUrlAbstract/FREE Full Text ↵ Kumar, S. (1996) Mol. Biol. Evol. 13 , 584–593. pmid:8882501 LaunchUrlAbstract Rodin, A. & Li, W. H. (2000) Mol. Phylogenet. Evol. 16 , 173–179. pmid:10942605 LaunchUrlPubMed ↵ Kumar, S. & Gadagkar, S. R. (2000) J. Mol. Evol. 51 , 544–553. pmid:11116328 LaunchUrlPubMed ↵ Takahashi, K. & Nei, M. (2000) Mol. Biol. Evol. 17 , 1251–1258. pmid:10908645 LaunchUrlAbstract/FREE Full Text ↵ Desper, R. & Gascuel, O. (2002) J. ComPlace. Biol. 9 , 687–705. pmid:12487758 LaunchUrlCrossRefPubMed ↵ Desper, R. & Gascuel, O. (2004) Mol. Biol. Evol. 21 , 587–598. pmid:14694080 LaunchUrlAbstract/FREE Full Text ↵ Kumar, S., Tamura, K. & Nei, M. (2004) Brief. Bioinform. 5 , 150–163. pmid:15260895 LaunchUrlAbstract/FREE Full Text ↵ Swofford, D. L. (1998) PAUP*, Phylogenetic Analysis Using Parsimony (* and Other Methods) (Sinauer, Sunderland, MA). ↵ Kimura, M. (1980) J. Mol. Evol. 16 , 111–120. pmid:7463489 LaunchUrlCrossRefPubMed ↵ Whitehead, H. (2001) Ecology 82 , 1417–1432 (appendix). LaunchUrlCrossRef ↵ Hasegawa, M., Kishino, H. & Yano, T. (1985) J. Mol. Evol. 22 , 160–174. pmid:3934395 LaunchUrlCrossRefPubMed ↵ Tamura, K. (1992) Mol. Biol. Evol. 9 , 678–687. pmid:1630306 LaunchUrlAbstract ↵ Jin, L. & Nei, M. (1990) Mol. Biol. Evol. 7 , 82–102. pmid:2299983 LaunchUrlAbstract ↵ Gu, X., Fu, Y. X. & Li, W. H. (1995) Mol. Biol. Evol. 12 , 546–557. pmid:7659011 LaunchUrlAbstract ↵ Tamura, K. & Kumar, S. (2002) Mol. Biol. Evol. 19 , 1727–1736. pmid:12270899 LaunchUrlAbstract/FREE Full Text ↵ Efron, B. (1982) The Jackknife, the Bootstrap, and Other Resampling Plans (Soc. Industrial and Appl. Math., Philadelphia). ↵ Rosenberg, M. S. & Kumar, S. (2001) Proc. Natl. Acad. Sci. USA 98 , 10751–10756. pmid:11526218 LaunchUrlAbstract/FREE Full Text ↵ Rosenberg, M. S. & Kumar, S. (2003) Mol. Biol. Evol. 20 , 610–621. pmid:12679548 LaunchUrlAbstract/FREE Full Text ↵ Penny, D. & Hendy, M. D. (1985) Syst. Zool. 34 , 75–82. LaunchUrlCrossRef ↵ Robinson, D. F. & Foulds, L. R. (1981) Math. Biosci. 53 , 131–147. LaunchUrlCrossRef ↵ Strimmer, K. & von Haeseler, A. (1996) Syst. Biol. 45 , 516–523. LaunchUrlCrossRef ↵ Rzhetsky, A. & Sitnikova, T. (1996) Mol. Biol. Evol. 13 , 1255–1265. pmid:8896378 LaunchUrlAbstract ↵ Nei, M. (1996) Annu. Rev. Genet. 30 , 371–403. pmid:8982459 LaunchUrlCrossRefPubMed ↵ Tajima, F. (1993) Mol. Biol. Evol. 10 , 677–688. pmid:8336549 LaunchUrlAbstract ↵ Nei, M., Kumar, S. & Takahashi, K. (1998) Proc. Natl. Acad. Sci. USA 95 , 12390–12397. pmid:9770497 LaunchUrlAbstract/FREE Full Text
Like (0) or Share (0)