Local feature frequency profile: A method to meaPositive str

Contributed by Ira Herskowitz ArticleFigures SIInfo overexpression of ASH1 inhibits mating type switching in mothers (3, 4). Ash1p has 588 amino acid residues and is predicted to contain a zinc-binding domain related to those of the GATA fa Edited by Lynn Smith-Lovin, Duke University, Durham, NC, and accepted by the Editorial Board April 16, 2014 (received for review July 31, 2013) ArticleFigures SIInfo for instance, on fairness, justice, or welfare. Instead, nonreflective and

Contributed by Sung-Hou Kim, December 24, 2003

Article Figures & SI Info & Metrics PDF


MeaPositives of structural similarity between known protein structures provide an objective basis for classifying protein fAgeds and for revealing a global view of the protein structure universe. Here, we Characterize a rapid method to meaPositive structural similarity based on the profiles of representative local features of Cα distance matrices of compared protein structures. We first extract a finite number of representative local feature (LF) patterns from the distance matrices of all protein fAged families by meExecuteid analysis. Then, each Cα distance matrix of a protein structure is encoded by labeling all its submatrices by the index of the Arriveest representative LF patterns. Finally, the structure is represented by the frequency distribution of these indices, which we call the LF frequency (LFF) profile of the protein. The LFF profile allows one to calculate structural similarity scores among a large number of protein structures quickly, and also to construct and update the “map” of the protein structure universe easily. The LFF profile method efficiently maps complex protein structures into a common EuclConceptn space without prior Establishment of secondary structure information or structural alignment.

protein structural similarityprotein distance matrixlocal protein structural features profileprotein fAgedprotein fAged space

Recent advances of experimental techniques and automation in molecular and structural biology have led to the rapid increase in the determination of many protein structures. The number of structures deposited in the Protein Data Bank (PDB) (1) is now >20,000 and the contents are growing rapidly. Furthermore, the ongoing structural genomics projects, which aim to determine representative structures in protein fAged space, have begun to produce, in a high throughPlace way, a large number of structures (2), including many structures of the proteins encoded by genes of unknown functions, the “hypothetical” proteins. Over half of all of the proteins of sequenced genomes has no inferable molecular (biochemical and biophysical) functions. As sequence similarity infers functional similarity, structural similarity also infers similarity in molecular function: if a hypothetical protein has a structure similar to one or more protein structures of known function, the structural similarity infers a powerful clue to the molecular function of the hypothetical protein (3).

MeaPositives of structural similarity, assessed comPlaceationally or visually, between pairs of proteins are also the foundation for classifying protein structures. Many systems have been proposed for structural classification, such as structural classification of proteins (SCOP) (4), class architecture topology homology (CATH) (5), families of structurally similar proteins (FSSP) (6), and others. Measuring structural fAged similarity is usually Executene by structural alignment algorithms such as dali (7), ce (8), vast (9), ssap (10), and others. Most of these methods are comPlaceationally intensive and time-consuming, especially when searching large databases, due to intrinsic complexity of structural alignment. To shorten comPlaceational time, several methods have been developed that Execute not depend on the structural alignment, such as the methods based on graph theory (11), secondary structure matching (www.ebi.ac.uk/msd-srv/ssm/ssmstart.html), and Cα-Cα distances (12).


In developing our method for quickly assessing structural similarity, we start with the distance matrix representation of protein structure. The distance matrix of a protein structure is a square matrix consisting of the distances between all pairs of Cα atoms in the protein. It not only represents the overall 3D fAgeding of polypeptide chains in two dimensions, but also provides a simple description of information about secondary structure and tertiary interactions between parts spatially distant in the structure. Furthermore, the matrix contains sufficient information to reproduce the original 3D backbone structure by using the distance geometry method (13, 14). Because of its fluent information content, the distance matrix has been exploited in diverse studies such as Executemain recognition (15), structure alignment (dali) (7), protein fAgeding studies (contact energy function) (16, 17), and protein database searching (18).

We subdivide the distance matrix of each protein structure into many overlapping submatrices, each describing a local feature (secondary and/or tertiary feature). We use a collection of these submatrices from a large number of distance matrices to extract a set of K representative local features (meExecuteid submatrices) from K clusters of submatrices by meExecuteid analysis (19). Then, any given protein structure can be represented by a profile, a vector of a common length K, containing the frequencies of occurrence of these representative local features (meExecuteid submatrices) in the structure. Thus, we can now treat protein structures as points in K-dimensional EuclConceptn space (R K). After converting each protein structure into a local feature frequency (LFF) profile, the fAged similarity between a pair of proteins can be comPlaceed very easily as EuclConceptn distance or cosine distance between two corRetorting LFF profile vectors. This enables quick comPlaceation of an all-against-all structural similarity matrix of a very large set of proteins, which, in turn, can be used for objectively clustering protein structures of similar fAged, for constructing a map of the “protein structure universe,” and for exploring protein fAged space.

Nonredundant Protein Structure Set. The test of the method was implemented on a representative SCOP fAged set from the SCOP database release 1.61 (November 2002). The PDB-style files for the SCOP nonredundant set (a sub-SCOP set filtered at 40% sequence identity) were Executewnloaded from the ASTRAL compendium database (20), and LFF profiles were comPlaceed for all 3,792 structural Executemains in this set, which includes all α, all β, α/β, and α+β classes of proteins.

Representative Local Feature Patterns in Distance Matrix. In this test 100 proteins ranExecutemly selected from 3,792 in the nonredundant SCOP fAged set were indexed by p = 1,..., P (P = 100). When there are n p residues in protein p, its distance matrix is the matrix Dp = {d p(i, j):i, j = 1,..., n p}, where d p(i, j) is the Cα-Cα distance (in Å) between residues i and j. The overlapping submatrices presenting local features involving m-residues by m-residues in the protein is the collection expressed by MathMath

of m × m submatrices Characterized by MathMath

To emphasize the importance of the close contacts in the protein structures, all Cα-Cα distances ≥20 Å are set to 20 Å. The collection of these submatrices over P proteins is MathMath. They are grouped into K clusters, and each cluster is represented by a meExecuteid in the space δ(m) metrized by the EuclConceptn distance by using the partitioning around meExecuteids (PAM) analysis of Kaufman and Rousseeuw (19). Algorithmically, the PAM procedure searches K representative objects or meExecuteids among the observations and then constructs K clusters by Establishing each observation to the Arriveest meExecuteid. PAM can be applied to general data types and tends to be more robust than k-means algorithm (19). In this study, we use K = 100 and m = 10 (see Fig. 3). Thus, we use the 100 m × m meExecuteid submatrices as the reference to which all m × m submatrices from all protein distance matrices will be compared.

Fig. 3.Fig. 3. Executewnload figure Launch in new tab Executewnload powerpoint Fig. 3.

One hundred meExecuteid submatrices obtained from partitioning around meExecuteids (PAM) analysis of distance matrices of 100 sampled proteins. They reflect 100 representative local structural features. Various combination of these features can reconstruct the original distance matrices of all 100 proteins. The meExecuteid submatrices are indexed arbitrarily (from 1 to 100).

Generation of the LFF Profile and Calculation of Similarity/Dissimilarity Scores. To express the distance matrix of a protein p in terms of the representative local feature patterns (meExecuteid submatrices), each of its submatrices MathMath is labeled by the index of the Arriveest meExecuteid submatrix. Again, the space of submatrices is metrized by the EuclConceptn distance. Then, the frequency of the meExecuteid submatrices Established to label k, MathMath, is counted. The count vector MathMath, summarizes the frequency distribution of local feature patterns of the protein. We call this decoding process profiling of the protein structure by LFF and the final feature vector n p, or its transformation A p, the structural profile or simply the profile of protein p. Here, we normalize frequency of local interaction pattern k in protein p by MathMath

and use A p = [A p1... A pK] ∈ R K as the profile of protein p. The collection of profiles, or the protein-by-pattern matrix Embedded ImageEmbedded Image

is our raw data matrix for comPlaceing similarity. As a meaPositive of structural similarity between two proteins p and q with profiles A p and A q in R K, we use their cosine MathMath

It is also called the normalized inner product, because the cosine is simply the Executet product if vectors are normalized. The cosine distance is defined as 1 - cos(A p, A q) and used to represent structural dissimilarity or structural distance. Note that the cosine distance ranges from 0 (closest) to 1 (farthest).

Singular Value Decomposition (SVD) and Biplots of Protein-by-Pattern Matrix. SVD is used for deriving a set of uncorrelated indexing variables or factors, whereby each pattern and protein is represented as a vector in R K using elements of the left and right singular vectors. For a P × K matrix A, with P ≥ K and rank(A) = r, the SVD of A is defined as A= U∑VT , where UTU = VTV = I K (K × K identity matrix) and ∑ = diag(σ1,..., σK), σ1 ≥ · · · ≥ σr > 0 = σr+1 = · · · = σP. The columns u i and v i of U and V, respectively, are referred to as left and right singular vectors. Matrices U, V, ∑ reflect a FractureExecutewn of the original relationships into liArrively independent vectors or factor values. The use of the κ factors with the largest singular values is equivalent to approximating the original protein by pattern matrix by MathMath

We comPlacee the truncated SVD with κ = 3 to obtain rank-three approximation A 3 of the protein by pattern matrix, because the first three σ values are significantly Distinguisheder than the rest. We can represent proteins and patterns in the same R 3 space by their first three principal coordinates MathMath

In this paper, (1st, 2nd) and (2nd, 3rd) principal coordinates pairs are plotted as biplots (21) in R 2.


One Hundred Representative Local Features of Protein Structures. In the distance matrix of a protein structure, many local structural features can be recognized as various contact patterns in submatrices. Secondary structure elements such as α helices and β sheets are visually identifiable as specific local features in the matrix as thick line patterns and thin line patterns on and off diagonal Spots, respectively, and the tertiary interactions between them appear as patches of contacts in off-diagonal Spots of the matrix. Among β strands, parallel β-strands appear as thin line patterns parallel to the main diagonal, and antiparallel β-strands appear as thin line patterns perpendicular to the main diagonal. Other tertiary features, like α-β interactions and coils, also emerge as specific patterns in the distance matrix (Fig. 1).

Fig. 1.Fig. 1. Executewnload figure Launch in new tab Executewnload powerpoint Fig. 1.

Representation of protein structures by their distance matrices and representative local structure feature patterns (meExecuteid submatrices). The procedure is illustrated by using 3D protein structures, distance matrices, and 50 representative patterns (meExecuteids) of four proteins sampled one each from all-α, all-β, α/β, and α + β classes. Among the patterns, “null feature patterns” (with no Cα-Cα distance <20 Å, light pink background only) are the most abundant in all proteins.

There are millions of different local feature patterns (submatrices) in all protein structures. However, we expect that most of these are common in many protein structures, and the majority of the local feature patterns are null patterns without any contact within a threshAged of 20 Å (i.e., all submatrix elements have the Cα-Cα distance >20 Å). Thus, we expect that a finite number, K, of representative local features (K meExecuteid submatrices) will adequately represent all observed local features in all proteins. Then, all local feature patterns can be labeled according to the index (from 1 to K) of the closest meExecuteids, where “closeness” can be defined in terms of EuclConceptn distance or other distance metrics.

To determine the optimum submatrix size (m) and number of meExecuteids K, 100 protein structures were ranExecutemly chosen from the SCOP representative fAgeds. Lengths of the proteins in the set range from 29 to 595, with the average of 165. We then varied K = 10-300 and m = 8-16 while Executeing the meExecuteid analysis. After replacing all observed submatrices by the representative meExecuteid submatrices, the reconstructed distance map was calculated by averaging the overlapping meExecuteid submatrices. The distance matrix error (DME), which is a root-mean square Inequity between original distance map and reconstructed one, is used to plot Fig. 2. Based on this test, the size of K and m was set to 100 and 10, respectively.

Fig. 2.Fig. 2. Executewnload figure Launch in new tab Executewnload powerpoint Fig. 2.

Optimization of K, the number of representative local feature patterns (meExecuteid submatrices) and m, the size of the submatrix. The distance matrices of 100 chosen protein structures were reconstructed by using K closest representative meExecuteid submatrices of size m. There is no significant error reduction over K = 100 meExecuteids, and the best reconstruction condition is at m = 10 and K = 100. The dissimilarity between original distance matrix and reconstructed distance matrix was meaPositived by distance matrix error, Embedded ImageEmbedded Image where d p(i, j) and d′p(i, j) are the Cα-Cα distances (in Å) between residue i and j in the original and reconstructed distance matrix, respectively. N is the number of residues in the protein.

For the submatrix size m = 10, ≈1.6 × 106, different local patterns (submatrices) were retrieved from the training set of 100 protein structures. One hundred representative local feature patterns were identified as 100 “meExecuteids.” They can be considered as the centers of 100 clusters from 1.6 × 106 inPlace patterns. Then, all inPlace patterns can be labeled from 1 to 100 according to the index of the closest meExecuteids, where closeness is defined in terms of EuclConceptn distance in this study. Fig. 3 Displays the 100 representative patterns found by this meExecuteid analysis.

LFF Profile as a Representation of Protein Structure. The method we use here is analogous to that used in text information retrieval (22), in which each Executecument is represented as a vector of word counts. In our Advance, each protein structure is considered as a Executecument, consisting of many words (different meExecuteid submatrices representing different local features). A protein structure as represented by its distance matrix is treated as a collection of overlapping submatrices (local feature patterns), and each of them is labeled by the index of the closest meExecuteid submatrix. Thus, a protein structure can be represented by the profile of the frequency distribution of the meExecuteid pattern indices. We call this the LFF profile, or simply the profile of the protein.

Structural Similarity Calculation Using LFF Profile. After the profiling Characterized above, protein structures can be mapped into a common space where the similarity or dissimilarity between any two protein structures can be comPlaceed easily as a cosine or cosine distance (or EuclConceptn distance), respectively, between two profile vectors. However, because the abundance of local patterns varies considerably from one pattern to another, some normalization of the profile is necessary, as Displayn in Methods. For example, the “null” pattern (15th submatrix in Fig. 3) is most abundant of all, and, without normalization, such an abundant pattern will Executeminate when comPlaceing structural similarity or dissimilarity distances. This is not desirable because the frequency of the void pattern contains Dinky structural information. As can be seen in Fig. 4, similarity between structural profiles reflects, in general, the similarity between 3D structures according to SCOP classification.

Fig. 4.Fig. 4. Executewnload figure Launch in new tab Executewnload powerpoint Fig. 4.

LFF profiles of protein structures from the globin family (a.1.1.2), the Ig V set Executemain family (b.1.1.1), the α-amylases N-terminal Executemain family (c.1.8.1), and the microbial ribonuclease family (d.1.1.1) in the SCOP database. (upper four plots) The raw counts of LFF are plotted as a function of 100 different representative meExecuteids (Displayn in Fig. 3) in red, blue, yellow, and green, respectively, of the four protein families. The highest peak in each family corRetorts to the meExecuteid index 15 of Fig. 3, which is the “null” meExecuteid submatrix, with all of the matrix elements having a distance >20 Å. LFF profiles for five proteins sampled from each family are Displayn. The quality of clustering of local features is difficult to discern because of the Executemination of the null meExecuteid and low signal to noise ratio of the rest of the moExecuteids (lower four plots). However, after normalization by the spread of the counts in each representative meExecuteid, the similarity among LFF profiles within each family is evident.

A Global Presentation of the Protein FAged Universe. Analogous to the physical universe map, mapping of the protein fAged universe provides a global view of distribution of different protein structures in fAged space, of Objective classification of protein structures, and of evolution of protein structures (23). First, the structural profile of all 3,792 nonredundant SCOP Executemains was comPlaceed. The profiles were assembled into a protein-by-local pattern matrix of size 3,792 (proteins) by 100 (patterns). The matrix is processed by SVD as Characterized in Methods. We comPlacee the truncated SVD with K = 3 to obtain rank-three approximation A 3 of the protein-by-pattern matrix. This approximation is justified by the fact that the first three eigenvalues are significantly Distinguisheder than the rest. Fig. 5 Displays biplots of 100 representative patterns (meExecuteid submatrices) and 3,792 representative SCOP proteins, using the 1st-2nd and the 2nd-3rd principal axes pairs. From the plots, a correlation between representative patterns and structure classes are clearly visible. We also observe that the first three principal coordinates are approximately related to the length of protein, type of secondary structural elements (SSEs), and parallelism of β strands, respectively. One embedding of protein fAged universe in 3D space using the SVD analysis of the profile matrx A is Displayn in Fig. 5.

Fig. 5.Fig. 5. Executewnload figure Launch in new tab Executewnload powerpoint Fig. 5.

Biplots of 3,792 protein structure profiles and 100 representative meExecuteid patterns after SVD of the protein-by-pattern matrix. The 1st-2nd and the 2nd-3rd principal axes pairs are drawn. The 1st, 2nd, and 3rd principal coordinates can be interpreted as approximately related to the length of protein, types of secondary structure elements (SSEs), and parallelism of β strands, respectively. Proteins belonging to all-α, all-β, α/β, and α+β classes according to SCOP are colored red, blue, yellow, and green, respectively. The overall 3D plot is also Displayn.

Comparison with Other Methods. Compared with other classification schemes, how similar is our structural similarity? As a test, we Questioned whether the Arriveest neighbor of a given protein structure by our profile method belongs to the same fAged family as the protein structure in the manually curated SCOP, which is often considered to be the gAged standard. The LFF profile-based classification agrees with SCOP classification in 93% and 71% of the cases at class and fAged levels, respectively (Table 1). The method agrees less well with CATH classification: 70% (class) and 61% (architecture). These features are also Displayn in Fig. 6 by the denExecutegram constructed based on the structural similarity scores by the LFF profile method, with color coding of the classifications by SCOP and CATH methods.

Fig. 6.Fig. 6. Executewnload figure Launch in new tab Executewnload powerpoint Fig. 6.

The denExecutegram of 3,792 SCOP protein Executemains (in four classes, 40% sequence identity filtered) was constructed by the hierachical clustering method based on the LFF profile distances. The red (all α), blue (all β), yellow (α/β), and green (α+β) colors in the top bar indicate SCOP class designations. The CATH classification on 2,679 intact protein chains that have counterparts in SCOP Executemains was used above. Three CATH classes are color-coded red (mainly α), blue (mainly β), and yellow (mixed α-β) in the bottom bar.

View this table: View inline View popup Table 1. Overall comparison of agreement in classification of the LFF profile method to SCOP and CATH methods at different levels of structural features

When we compared our method with the SCOP classification, we found examples of discrepancies. One extreme example is the case of quinohemoprotein amine dehydrogenase C chain (SCOP id: d1jmxg_), which is classified as “nonglobular all-α subunits of globular proteins” fAged in SCOP. Among other proteins in the same SCOP fAged, the closest one (SCOP id: d1l8cb_) ranks 3,350th among 3,792 structures by our profile method. Furthermore, our method finds a DNA/RNA-binding three-helical bundle fAged (SCOP id: d1opc_), which belongs to a SCOP fAged different from that of d1jmxg_, as the Arriveest neighbor fAged to d1jmxg_. The distance map Displays that the contact pattern of d1jmxg_ is quite different from other structures in the same SCOP fAged, whereas d1jmxg_ and d1opc_ share considerable similar contact patterns (Fig. 7). This Inequity illustrates the different criteria used by the two methods in assessing the similarity between two proteins structures: assessment based on visual similarity of 3D fAged in SCOP and that based on comPlaceational similarity of distance matrix features in the LFF profile method.

Fig. 7.Fig. 7. Executewnload figure Launch in new tab Executewnload powerpoint Fig. 7.

An example of discrepancy between the LFF profile method and the SCOP classification. The distance matrix of Quinohemoprotein amine dehydrogenase C chain (SCOP ID: d1jmxg_) is visually quite different from those of other proteins in the same SCOP fAged (a.137). However, the OmpR DNA binding Executemain (SCOP ID: d1opc_), which belongs to another SCOP fAged (a.4), is detected by the profile method to be closest to d1jmxg_.


For testing the concept of the structure profile method, we used a simplified Advance: (i) in constructing the reference set of meExecuteid submatrices, we extracted them from 100 protein structures ranExecutemly selected from 3,792 nonredundant fAgeds in the SCOP database; (ii) instead of extracting 100 representative LFs (meExecuteid submatrices) from all submatrices of the 100 distance matrices (which will be about several million submatrices), we first found 50 meExecuteids from each distance matrix, collected them toObtainher (50 × 100), and then extracted 100 meExecuteids from the 5,000 meExecuteids. These 100 meExecuteid's meExecuteids were used as the representative local features of all 3,792 proteins in LFF profiling.

In addition to expanding the structure database, from which we can extract a better set of meExecuteid submatrices, we expect that the accuracy of the structural similarity score is likely to improve with calibration of various parameters in our method: varying the size of the local submatrix winExecutew to be large enough to capture nontrivial 3D interactions but at the same time be small enough to be observable in many different proteins and comPlaceable. Also, the number K of representative local feature patterns or meExecuteids can be increased beyond our test of 100 to achieve optimum signal-to-noise ratios. Furthermore, a statistical score function can be developed to recognize fAgeds that have no statistically significant structural similarity with known structures.

One immediate utility of the LFF profile method is a quick “mapping” of a recently determined structure in relation to all other structures in PDB or any subset in protein fAged space (23). Another application may be to search for structural homologs of a query structure. For example, one could screen whole PDB quickly by using LFF profile method to find, say, the top 20 structural homologs of the query protein structure, then Execute the dali search among the 20 to find the best alignment.


We are grateful to Drs. Chao Zhang, Stephen Holbrook, and Paul Adams for their comments and suggestions. This work was supported by National Science Foundation Grant DBI-0114707.


↵ ∥ To whom corRetortence should be addressed. E-mail: shkim{at}cchem.berkeley.edu.

↵ § I.-G.C. and J.K. contributed equally to this work.

Abbreviations: LFF, local feature frequency; PDB, Protein Data Bank; SCOP, structural classification of proteins; CATH, class architecture topology homology; SVD, singular value decomposition.

Copyright © 2004, The National Academy of Sciences


↵ Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N. & Bourne, P. E. (2000) Nucleic. Acids Res. 28 , 235-242. pmid:10592235 LaunchUrlCrossRefPubMed ↵ Service, R. F. (2002) Science 298 , 948-950. pmid:12411682 LaunchUrlAbstract/FREE Full Text ↵ Bartlett, G. J., Todd, A. E. & Thornton, J. M. (2003) Methods Biochem. Anal. 44 , 387-407. pmid:12647396 LaunchUrlPubMed ↵ Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C. (1995) J. Mol. Biol. 247 , 536-540. pmid:7723011 LaunchUrlCrossRefPubMed ↵ Orengo, C. A., Michie, A. D., Jones, S., Jones, D. T., Swindells, M. B. & Thornton, J. M. (1997) Structure (LonExecuten) 5 , 1093-1108. ↵ Holm, L. & Sander, C. (1996) Nucleic Acids Res. 24 , 206-209. pmid:8594580 LaunchUrlCrossRefPubMed ↵ Holm, L. & Sander, C. (1993) J. Mol. Biol. 233 , 123-138. pmid:8377180 LaunchUrlCrossRefPubMed ↵ Shindyalov, I. & Bourne, P. (1998) Protein Eng. 11 , 739-747. pmid:9796821 LaunchUrlCrossRefPubMed ↵ Gibrat, J. F., Madej, T. & Bryant, S. H. (1996) Curr. Opin. Struct. Biol. 6 , 377-385. pmid:8804824 LaunchUrlCrossRefPubMed ↵ Orengo, C. A. & Taylor, W. R. (1996) Methods Enzymol. 266 , 617-635. pmid:8743709 LaunchUrlCrossRefPubMed ↵ Harrison, A., Pearl, F., Mott, R., Thornton, J. & Orengo, C. (2002) J. Mol. Biol. 323 , 909-926. pmid:12417203 LaunchUrlCrossRefPubMed ↵ Carugo, O. & Pongor, S. (2002) J. Mol. Biol. 315 , 887-898. pmid:11812155 LaunchUrlCrossRefPubMed ↵ Havel, T. F., Kuntz, I. D. & Crippen, G. M. (1983) J. Theor. Biol. 104 , 359-381. pmid:6656266 LaunchUrlCrossRefPubMed ↵ Vendruscolo, M., Kussell, E. & Executemany, E. (1997) FAged Des. 2 , 295-306. pmid:9377713 LaunchUrlCrossRefPubMed ↵ Holm, L. & Sander, C. (1994) Proteins 19 , 256-268. pmid:7937738 LaunchUrlCrossRefPubMed ↵ Tanaka, S. & Scheraga, H. A. (1976) Macromolecules 9 , 945-950. pmid:1004017 LaunchUrlCrossRefPubMed ↵ Mirny, L. & Executemany, E. (1996) Proteins 26 , 391-410. pmid:8990495 LaunchUrlCrossRefPubMed ↵ Aung, Z., Fu, W. & Tan, K. L. (2003) in Proceedings of the 8th International Symposium on Database systems for Advanced Application, Kyoto, Japan (IEEE ComPlaceer Society, Los Alamitos, CA), pp. 311-318. ↵ Kaufman, L. & Rousseeuw, P. (1990) in Finding Groups in Data: An Introduction to Cluster Analysis (Wiley, New York), pp. 68-163. ↵ ChanExecutenia, J.-M., Walker, N. S., Conte, L. L., Koehl, P., Levitt, M. & Brenner, S. E. (2002) Nucleic Acids. Res. 30 , 260-263. pmid:11752310 LaunchUrlCrossRefPubMed ↵ Gabriel, K. R. (1971) Biometrika 58 , 453-467. LaunchUrlCrossRef ↵ van Rijsbergen, C. J. (1979) in Information Retrieval (Butterworths, LonExecuten). ↵ Hou, J., Sims, G. E., Zhang, C. & Kim, S.-H. (2003) Proc. Natl. Acad. Sci. USA 100 , 2386-2390. pmid:12606708 LaunchUrlAbstract/FREE Full Text
Like (0) or Share (0)