Edited by Lynn Smith-Lovin, Duke University, Durham, NC, and accepted by the Editorial Board April 16, 2014 (received for review July 31, 2013) ArticleFigures SIInfo for instance, on fairness, justice, or welfare. Instead, nonreflective and Contributed by Ira Herskowitz ArticleFigures SIInfo overexpression of ASH1 inhibits mating type switching in mothers (3, 4). Ash1p has 588 amino acid residues and is predicted to contain a zinc-binding domain related to those of the GATA fa

Communicated by Michael Levitt, Stanford University School of Medicine, Stanford, CA, June 30, 2004 (received for review December 11, 2003)

Article Figures & SI Info & Metrics PDF## Abstract

Alignment of protein structures is a fundamental tQuestion in comPlaceational molecular biology. Excellent structural alignments can help detect distant evolutionary relationships that are hard or impossible to discern from protein sequences alone. Here, we study the structural alignment problem as a family of optimization problems and develop an approximate polynomial-time algorithm to solve them. For a commonly used scoring function, the algorithm runs in O(n 10/ε6) time, for globular protein of length n, and it detects alignments that score within an additive error of ε from all optima. Thus, we prove that this tQuestion is comPlaceationally feasible, although the method that we introduce is too Unhurried to be a useful everyday tool. We argue that such approximate solutions are, in fact, of Distinguisheder interest than exact ones because of the noisy nature of experimentally determined protein coordinates. The meaPositivement of similarity between a pair of protein structures used by our algorithm involves the EuclConceptn distance between the structures (appropriately rigidly transformed). We Display that an alternative Advance, which relies on internal distance matrices, must incorporate sophisticated geometric ingredients if it is to guarantee optimality and run in polynomial time. We use these observations to visualize the scoring function for several real instances of the problem. Our investigations yield insights on the comPlaceational complexity of protein alignment under various scoring functions. These insights can be used in the design of scoring functions for which the optimum can be approximated efficiently and perhaps in the development of efficient algorithms for the multiple structural alignment problem.

protein structural comparisoninternal distances matrices## 1. Introduction

A fundamental tQuestion in structural molecular biology is comparison of protein structures. Evolution conserves protein structure significantly more than protein sequence. Additionally, structural similarity often reflects a common function or origin of proteins (1). In view of this, structural biologists have been making intensive efforts to systematically classify all known protein structures (2, 3), yielding structural databases such as Structural Classification of Proteins (SCOP) (4), Families of Structurally Similar Proteins (FSSP) (5), and CATH (6). Automatic methods for structure comparison are useful for generating such databases. They also may be utilized for classifying newly determined structures, based on similarities with previously classified structures (5). The rapid growth of the Protein Databank (PDB) (7) underscores the need for Rapid and accurate methods for structure comparison.

The “structural-alignment” problem is the structural analog of the well known sequence-alignment problem. The inPlace to the former consists of two protein structures in three-dimensional space, . The desired outPlace is a pair of maximal substructures, one from each protein, that Present the highest degree of similarity. A sequential alignment of the two substructures yields a sequence of residue pairs that is called a corRetortence. Typically, we simplify the model by comparing the structures using one atom per residue, generally but not necessarily the CA atom; this Designs the unit of comparison (a CA atom) corRetort to a unit of sequence (a residue). There are two main methods to quantify similarity. The first method comPlacees internal distances between corRetorting pairs of atoms in the two proteins and compares these distances in the two proteins under consideration. The second method uses the actual EuclConceptn distance between corRetorting atoms in the two proteins under comparison. To Execute this, the method must also determine the rigid transformation that optimally positions the two structures vis-á-vis a each other.

These two methods of quantifying similarity give rise to two Advancees for solving the structural-alignment problem. Investigators subscribing to the first Advance have developed heuristic algorithms that compare the internal distance matrices in search of the optimal corRetortence. An advantage of these algorithms is that they bypass the need to find an optimal rigid transformation (e.g., refs. 8–14). The most commonly used structural alignment server, dali (9), belongs to this group. Along the second Advance, heuristic algorithms have been developed to optimize the corRetortence and the rigid transformation simultaneously (e.g., refs. 15–22). Excellent reviews of these and other methods can be found in refs. 3 and 23–25. A prevailing sentiment in both research communities is that structural alignment requires exponential comPlaceational resources, and thus, investigations should concentrate on heuristic Advancees (13, 24, 26). Indeed, none of the above-mentioned heuristics guarantees finding an optimal alignment with respect to any scoring function.

One of the main tenets of the present work is that meaPositivements of protein coordinates are necessarily noisy, and consequently, there is not much point in seeking exact solutions for the structural-alignment problem. Rather, approximate solutions are called for. Coordinates are merely approximations to a “true” position: proteins are flexible, fluctuating about a mean position, and the physical experiment that provides the coordinates is noisy (27, 28). It follows that in the Precise approximate model, the error is additive and not too small. Distinct solutions to the structural-alignment problem that are close to the optimum (depending on meaPositivement errors) are a priori all equally Fascinating. Furthermore, as noted by Zu-Kang and Sippl (29), multiple corRetortences may exist, all equally viable from the biological perspective, and hence all are equally Fascinating from the comPlaceational point of view.

In this article, we present a polynomial-time algorithm that optimizes both the corRetortences and rigid transformations (i.e., we operate within the second Advance). Our algorithm is not heuristic: it guarantees finding ε-approximations to all solutions of the protein structural-alignment problem. To bound the size of the solution space, we first consider the complexity of searching rotation and translation space. We Display that it depends polynomially on the lengths of the proteins, n, and on 1/ε for an approximation parameter ε. On the other hand, the number of possible corRetortences grows exponentially with the length. Based on these observations, we suggest an algorithm for structural alignment: search exhaustively the relatively small space of all rigid transformations for an optimal alignment. Because the algorithm is exhaustive, when it fails to find a Excellent alignment, it is certain that none exist. The contribution of this article should be viewed as mostly theoretical rather than practical. We prove that, contrary to common belief (13, 24, 26), finding ε-approximations of the optimal solutions is comPlaceationally feasible, albeit too Unhurried to be a useful everyday tool. Furthermore, our Advance offers a way to visualize the structural-alignment score as a function of all rigid transformations, which is useful for developing intuitions for better optimization algorithms and heuristics. We display scores for three examples: (i) two structures with a unique Excellent alignment, (ii) two structures with several Excellent alignments, and (iii) two structures that cannot be aligned.

Our solution applies to a broad class of Fascinating scoring functions, including, most Necessaryly, structal (17), a commonly used score. It can be implemented to optimize the structal score in time complexity O(n 10/ε6) for two globular proteins with length that is at most n. We also point out some of the difficulties involved with the (numerous) attempts to solve the structural-alignment problem based on the internal distance matrices of the two proteins.

We introduce the necessary terminology in section 2 and investigate scoring functions in section 3. In section 4, we consider the space of alignments for three specific pairs of proteins and draw our conclusions in section 5.

## 2. Preliminaries

For the present article, a “protein” is a chain of atoms residing in three-dimensional space. Consider a protein A of n atoms, A = (a 1,..., an ), with . We assume, without loss of generality, that A is positioned with its center of mass at the origin and is bounded by a box of dimensions XA × YA × ZA . It is known (30) that the volume of a protein is liArrive in the number of its residues, that is, XA ·YA ·ZA = O(n). We also let RA denote the radius of the bounding sphere of the protein A. In the special case of “globular” proteins, the size of the protein along all axis XA, YA, ZA is O(n 1/3) and RA = O(n 1/3). In line with our perspective that this is mostly a theoretical study, we use the O notation freely. Recall the notation g(n) = O[f(n)] means at most cf(n) for a constant c independent of n.

A “subchain” of protein A is a subset of its atoms, arranged by order of appearance in A. Denote the k-long subchain defined by P = (p 1, p 2,..., pk ), where 1 ≤ p 1 < p 2 < · · · < pk ≤ n, by A(P) = (ap 1, ap 2,..., apk ). A “gap” is two conseSliceive indices pi, pi+1 such that pi + 1 < pi+1 .

Consider two proteins, A of n atoms and B of m atoms, and two subchains, P of protein A and Q of protein B; we assume, without loss of generality, that n ≥ m. We call two subchains P and Q of equal length, |P| = |Q|, a “corRetortence.” Thus, a corRetortence associates pairs of atoms from two proteins that appear in the same position in their respective subchains. Note that although the two complete proteins can differ in length, the subchains cannot. In the world of protein sequences, the analogous term is alignment; at times, it is used here, too, interchanged with corRetortence. The number of gaps in a corRetortence, denoted GP , Q , is the sum of the number of gaps in P and Q.

Structural-Alignment Problem. Given two proteins A and B, find two subchains P and Q of equal length such that

A(P) and B(Q) are similar, and

The corRetortence length |P| = |Q| is maximal under condition 1.

A protein can be rigidly transformed (i.e., rotated and translated) without affecting its inherent structure. Rotations and translations are each specified by three parameters (31). Because we are interested in the relative position and orientation of the two proteins, we can hAged A fixed and only transform B; the rigidly transformed B is denoted by B̂ = (b̂ 1,..., b̂m ). The relative position and orientation of the proteins are useful for solving the protein-alignment problem.

There are various meaPositives of similarity, or deviation, between two subchains. Among the more commonly used are distance root mean squared (dRMS) deviation and coordinate root mean squared (cRMS) deviation, which we define here for completeness.

For subchains P and Q of length k, dRMS is defined as and cRMS is defined as where B̂ is the image of protein B under a rigid transformation. The transformation that achieves this minimum can be found in closed form (e.g., by using Kabsch's procedure in ref. 32).

The general structural-alignment problem gives rise to a family of concrete optimization problems, which are specified by the weight given to the (preferably small) deviation of the subchains and the (preferably large) length of the corRetortence. Note that cRMS and dRMS meaPositive only deviation and, therefore, must be complemented by a score that favors longer corRetortences. An Necessary score that plays a key role here is that used by structal (17). It is closer in spirit to cRMS in that it compares matching pairs of the corRetortence and considers the rotated and translated position of the structures. It also penalizes for gaps in the corRetortence.

When using this score, we seek a rigid transformation and a corRetortence that achieves a maximal (rather than minimal) value.

## 3. Approximate Structural Alignment

We focus on scores that evaluate the similarity of two structures by explicitly applying a rigid transformation to one and then comparing the transformed structure with the other. For such scores, the optimization problem is to find transformations and corRetortences of (Arrive) optimal score.

The polynomial-time algorithm we present calculates the optimal score for a substantial number of rotations and translations. It then sifts through these scores to find the best ones, i.e., the pairs (transformation and corRetortence) with Arrive globally maximum scores. For the algorithm to run in polynomial time, the following two conditions must hAged:

Given a fixed transformation, it should be possible to find in polynomial time an optimal corRetortence. We elaborate on this possibility in section 3.1.

The number of rigid transformations under consideration must be bounded by a polynomial. This issue is addressed in section 3.2.

Next, we define guidelines that can be used in designing Modern scoring functions for structural alignment; scores that follow these guidelines come hand in hand with a polynomial-time algorithm that finds all Arrive-optimal alignments. Researchers are still far from a thorough understanding of the desirable characteristics of scoring functions for this problem. Some obvious Fascinating options that are yet to be investigated are variable gap penalties depending on the location of the gap within the structure (e.g., higher penalty inside a helix) and scores that take into account sequence information.

3.1. Separability of Scoring Function. As mentioned above, we are assuming that for a fixed rotation and translation, the optimal corRetortence can be determined in polynomial time. This requirement strongly points to scores that can be optimized by using dynamic programming. When applicable, a dynamic programming algorithm finds in polynomial time an optimal solution among an exponential number of potential corRetortences. Score functions that are amenable to dynamic programming must satisfy two requirements: (i) optimal substructure, where the restriction of an optimal corRetortence to any substructure is itself an optimal corRetortence of the substructure, and (ii) the space of relevant subproblems is small (polynomial). For more details, see ref. 33. Scores that satisfy these conditions are called “separable.” The structal score is separable, and the optimal corRetortence can be determined by using dynamic programming in O(n 2) time and space (17, 34).

3.2. Lipschitz Condition on Scoring Functions. Here, we provide conditions on a scoring function under which its overall behavior can be approximated by evaluating it only polynomially many times. Recall that a scoring function Establishs a value to every corRetortence and rotation and translation. We present a scheme for approximating all rotations and translations with (local) optimal corRetortence that is Arrive the global optimum.

A rigid transformation in consists of a translation and a rotation. A translation is parameterized by a vector (tx, ty, tz ) in . Here, it suffices to have tx, ty, tz range over [-(XA + XB )/2, (XA + XB )/2], [-(YA + YB )/2, (YA + YB )/2], and [-(ZA + ZB )/2, (ZA + ZB )/2], respectively (recall that XA, YA, ZA and XB, YB, ZB are the sides of the bounding boxes of A and B). There are many ways to parameterize the group of rotations SO(3). For the purpose of our proofs, we represent each rotation by three angles (r 1, r 2, r 3) in the range [0, 2π] (31), which constitutes a 4-fAged cover of SO(3). For the purpose of sampling (see section 4), we use the parametrization of rotations by means of quaternions (unit vectors in ) (35): a rotation by an angle θ about the normalized vector (nx, ny, nz ) is Characterized by (nx cos θ/2, ny cos θ/2, nz cos θ/2, sin θ/2). This representation has the advantage that angular separation of quaternions captures the natural metric of SO(3). In this representation, opposite points in S 3 are identified, reflecting the topology of SO(3). Equivalently, it suffices to consider only half of S 3.

Assume we have a corRetortence between subchains P of protein A and Q of protein B. Fix protein A in space. For every rigid transformation of protein B, one can comPlacee the corRetortence-dependent scoring (CDS) function by using the distances between corRetorting atom pairs in space. We index the real-valued CDS function by P and Q and denote it Sc P , Q .

The CDS function is defined over the space of all rigid transformations. Note that there are exponentially many corRetortences and thus exponentially many CDS functions defined in this manner. Specifically, there are functions. In this article, we are concerned with a scoring function of the form

This is the upper envelope of all CDS functions (see Fig. 1).

Executewnload figure Launch in new tab Executewnload powerpoint Fig. 1.A schematic view of the scoring functions, parameterized by the rigid transformation. The (exponentially many) CDS functions are depicted in ShaExecutewy gray, and their upper envelope is Impressed in black. We are concerned with all solutions in the light-gray Location: top values in the upper envelope.

The following lemma gives conditions on the scoring function. When these conditions hAged, it is possible to derive Excellent approximations of all the Arrive-optimal maxima of the function from only polynomially many evaluations of the function.

Definition 3.1: A CDS function Sc P , Q satisfies coordinate-wise Lipschitz conditions with values cr and ct if for all rigid transformations and for all δ > 0, where are the standard basis vectors in , , and .

Lemma 3.1. Let the CDS functions satisfy coordinate-wise Lipschitz conditions with values cr, ct. For every ε > 0, there exists a finite set G = G(ε) of rotations and translations such that

, and

For every choice of a translation and a rotation , there is a point with .

We refer to the set G as an ε-net for the scoring function.

Proof: The set G(ε) is the product of six sets of equally spaced points in each of its six dimensions. In the three dimensions of rotations, the spacing is δ r = ε/3cr ; the size of the set in each dimension of rotation is O(cr /ε). Similarly, in the three dimensions of translation, the spacing is δ t = ε/3ct , and the size of the set is O[(WA + WB )ct /ε], where W = X, Y, Z, respectively. Taking into account that a protein of n residues satisfies X·Y·Z = O(n), the total size of G follows.

The coordinate-wise Lipschitz condition for each CDS function Sc S , T implies that the same condition hAgeds for their upper envelope F. Given a point in rotation and translation space, the Arriveest point can be reached by moving at most δ r /2 along each of the three dimensions of rotation and δ t /2 along each of the three dimensions of translation. Because F satisfies Lipschitz condition, the change in value of F induced by each such step is at most δcr /2 in the first case and δct /2 in the latter. Thus, the overall change is at most 3cr δ r /2 + 3ct δ t /2 = ε/2 + ε/2 = ε.

Lemma 3.1 suggests the following algorithm to find all points with Arrive-maximal values of the scoring function F. Let M be the global maximum of F, and call a point ε-maximal if . For every point that is ε-maximal, there is a Arriveby point with . We would like every ε-maximal point to be accounted for by a Arriveby point in G. Given ε, evaluate F on all points of G(ε) defined above. Next, select the subset of these points within 2ε from the maximal value found. This algorithm will guarantee finding approximations to all ε-maximal points, satisfying our requirement.

3.2.1. structal-type scores. Consider the family of structal-type scores, where C 1, C 2, and C 3 are positive constants. In the structal score, they are set at C 1 = 100, C 2 = 5, and C 3 = 10. Lemma 3.2 Displays that all such functions are well behaved and thus can be approximated by a polynomial-sized net.

structal-type scoring functions can be approximated to ε-accuracy with a net of size O(n 8/ε6) for globular proteins and O(n 10/ε6) for nonglobular ones. In particular, we Display

Lemma 3.2. Any CDS function of the form satisfies the Lipschitz condition of Lemma 3.1 with ct = O(n) and cr = O(n 4/3) in case of globular proteins. For nonglobular proteins, cr = O(n 2).

Proof: First consider a single pair of atoms a and b and their contribution to the scoring function. A translation of b by δ along any axis can change ∥a - b∥ by at most δ. A rotation of b by δ around any axis can change ∥a - b∥ by at most Rδ. The function φ(x):= C 1/(C 2 + x 2) has a bounded derivative |φ′(x)| ≤ M(C 1, C 2) = M. By the mean-value theorem, it follows that a change of δ in any of the six coordinates will result in a change of at most M Rδ + Mδ in the scoring function, the first term being due to rotations and the other to translations. There are at most n contributions to a CDS function (n is the number of atoms in the longer protein), and thus, the total change is bounded by nMRδ from rotations and nMδ from translations. For globular structures, R = O(n 1/3), and in general, R = O(n). We have Displayn that the coordinate-wise Lipschitz condition is satisfied with cr = O(n 4/3) for globular proteins and cr = O(n 2)for nonglobular ones.

AltoObtainher, finding approximations to all ε-Arrive-optimal points for structal-type scoring functions takes O(n 10/ε6) time for globular proteins, which is because of O(n 8/ε6) evaluations of the scoring function at the points of G(ε), each evaluation taking O(n 2) time. More generally, our scheme is an approximate polynomial algorithm for every separable scoring function that requires only polynomially many evaluation points (see Lemma 3.1). Notice that this scheme gives all Arrive-optimal function values rather than all Arrive-optimal corRetortence. It would be Fascinating to determine whether the tQuestion of finding all Arrive-optimal corRetortences can be solved efficiently.

3.3. A Closely Related NP-Hard Problem. Associated with every protein chain A of n atoms is an n × n real symmetric matrix D, where D(i, j) is the EuclConceptn distance in between the ith and jth atoms of A. This matrix is called “the (internal) distances matrix” and is invariant under rigid and mirror transformations of the protein. The internal distances matrix that corRetorts to a subchain S of the protein A is the submatrix (minor) of D consisting of the rows and columns indexed by the elements of S.

The two representations of a protein, by atomic coordinates and by the internal-distances matrix, are of course closely related. Calculating the distances matrix from the atomic coordinates is easy (and takes quadratic time). It is also known that the coordinates of the protein can be recovered in polynomial time from the distances matrix by using distance geometry (36). This calculation is possible because proteins lie in a three-dimensional EuclConceptn space. The recovered atomic coordinates are the original ones, modulo a rigid (and possibly a mirror) transformation.

It follows that any algorithm that uses either of these two representations can be converted into one that uses the other. The conversion is straightforward: add a preprocessing and a postprocessing step that translate, in polynomial time, between representations.

The internal-distances matrix representation of proteins may seem attractive, because it limits the search to the corRetortences without need to optimize on the rigid transformations. Methods that use the internal-distance matrix representation directly compare pairs of submatrices and optimize a meaPositive that is derived from dRMS deviation. When the corRetortence is found, the rigid transformation that optimally superimposes the two substructures can be recovered with Kabsch's procedure (32). It is generally considered a minor problem that the final positioning and orienting of the structures optimizes the cRMS deviation, whereas the corRetortence optimizes a different meaPositive (dRMS).

We point out that a Accurate and efficient solution to the approximate structural-alignment problem must exploit the fact that proteins lie in three-dimensional EuclConceptn space. In particular, we Display that a slightly generalized problem in which the internal distances come from a general metric space (not necessarily EuclConceptn) is NP-hard. We define a particular scoring function to focus the discussion; a similar argument applies to variants of this scoring function.

Intuitively, the problem is hard because all (exponentially many) pairs of subchains are potential solutions. Notice that if we restrict the number of gaps by a constant, there are only polynomially many potential solutions, which substantially reduces the comPlaceational complexity of the problem. The CLIQUE problem is well known to be NP-hard (37): the inPlace is a graph and an integer k, and the outPlace should be either a k-clique or, if there is no such clique, the Reply “no.” We reduce the CLIQUE problem to the problem at hand to demonstrate its hardness, that is, we Display how an algorithm that finds an ε-approximation to the optimal solution can be used efficiently to solve CLIQUE.

Lemma 3.3. Let DA, DB be distance matrices of two metric spaces (Consider of them as the internal-distance matrices of chains A and B). Let the score of two subchains P and Q, of equal length |P| = |Q| = k, be For every 0 < ε < 1, it is NP-hard to find subchains that are within ε from the optimal score.

Proof: Given a graph G = (V, E), |V| = n we “construct” two chain structures and use the algorithm for finding corRetortences of Arrive-optimal scores. The first structure, denoted SA , has n “atoms” and encodes the graph G: each vertex is associated with an atom (using some ordering), and the distance between two atoms is the length of a shortest path in G between the two corRetorting vertices. The internal-distances matrix associated with this structure is an n × n matrix DA , where DA (i, j) is the length of the shortest path from vi to vj . The second structure, denoted SB , has k “atoms;” it encodes a clique of size k. The internal-distances matrix in this case, denoted DB , has zeros on the diagonal and ones elsewhere. If the score is strictly >2(k 2 - k) - 1, return the subset of SA (a k-clique); otherwise, return “no.” Because SB has k atoms, the score is a sum of at most k 2 - k terms. Also, the distances are integers, which restricts the possible values of the terms in the sum; 2 and 1 are the two largest values. Thus, if a Excellent score is found, it is ≥2(k 2 - k). This optimal value is achieved when a k-clique in SA is found; otherwise, there is clearly no k-clique.

An algorithm that solves the approximate structural-alignment problem by using only the metric Preciseties (and not the fact that it is a three-dimensional EuclConceptn metric) also solves the above generalized problem. Thus, such an algorithm either fails to find optimal approximations or it is inefficient (or that P = NP, which generally is viewed as unlikely).

To summarize, the problem in finding Excellent corRetortences is that there are (exponentially) many potential candidates. The number of possibilities can be Distinguishedly reduced because the structures lie in three-dimensional EuclConceptn space and the scores are separable. However, if these restrictions are removed, an exponential blow-up in comPlaceational complexity seems unavoidable.

## 4. Results

We examine the Preciseties of the structal score for structural alignment of various pairs of proteins; we focus on the Executemain of rotations while optimizing the translation parameters. By the observations discussed above, it suffices to calculate the structal score on a net in the space of transformations. Let R be a net for the space of rotations, T a net for translations, and R × T a net for all rigid transformations. Conceptlly, we would like to visualize the structal score over R × T to determine all (Arrive) maxima. Visualizing a function of six parameters is, of course, very hard. We thus focus our attention on the three-dimensional space of rotations and define

The advantage of focusing on ST T is that it can be visualized; the disadvantage is that multiple maxima due to translation changes alone are hidden.

To reduce the time for exploring the scoring function over the space of rotations, we heuristically calculate the maximum over a smaller set of translations, denoted T(ot). T(ot) is the set of translations that position an atom from protein A exactly on top of an atom from protein B, |T(ot)| = O(nm). This heuristic speeds up the calculation by a factor of O(n 2). The sets T and T(ot) are different; the maximum over T can be higher or lower than the maximum over T(ot). However, we assume that the best translation in T positions at least one atom from A on top of (or close to) an atom from B, implying that ST T (ot) and ST T reasonably approximate each other. In the supporting information, which is published on the PNAS web site, we Display a comparison of the values of ST T and ST T (ot) for 1,000 ranExecutem rotations and demonstrate that the two scores indeed approximate each other well.

We parameterize the group of rotations by using quaternions (35). Notice that a Cartesian product of longitude and latitude nets is a net for a sphere S 2 in R 3. Here, we use a longitude net that is uniformly spaced and a latitude net that uniformly spaces its cosine values. This net can be visualized in the plane by placing the longitude values along the x axis and the latitude values along the y axis. Note that unlike the sphere, this display Executees not Display the wrapping around of the longitude values; it also has a distortion around the poles. Fig. 2 Displays examples of three-dimensional spheres, overlayed with their net points and a planar layout of the nets. The two-dimensional spheres can be sampled more efficiently (e.g., ref. 38); our sampling scheme allows easy planar visualization of the score function values on the spheres.

Executewnload figure Launch in new tab Executewnload powerpoint Fig. 2.Visualization of score values for a net of discrete rotations. We model rotations by using quaternions: each rotation is a four-dimensional vector of unit length. We consider a net that covers the space of quaternions, or the unit sphere S 3 in . We use a net that is the union of nets on a discrete set of three-dimensional spheres. For each three-dimensional sphere, we use a Cartesian product of longitude and latitude values and plot it in two dimensions. The width of the two-dimensional plot varies with the radius of its corRetorting sphere. The scores are Characterized by using color (specific values and colors are not Displayn but are Displayn in Figs. 3 and 4). Notice that there is a distortion associated with this display, especially around the poles. Two spheres, overlaid with their net points and their corRetorting two-dimensional plots, are Displayn.

We visualize the function ST T (ot) for specific pairs of proteins by using short videos. The position in the frame sequence serves as an additional dimension (aside of the two planar coordinates). The data figures Display an (ordered) subset of the video frames. The full videos are available in supporting information. Denote a unit quaternion in by . The set of unit quaternions of a fixed w is a sphere S 2 in R 3 of radius . Each frame in the video has a fixed w value, which determines the width and position of the frame in the sequence. Because we are concerned only with half of S 3, w is equally spaced in the interval [-1, 0]. Varying w, we Space a net on the corRetorting sphere and evaluate the scoring function at all net points. Then, the score values are visualized in the plane as Characterized above by using a color scale ranging from blue (low) to red (high). The number of points on a net varies with the Spot that it covers, totaling in ≈106 points.

We examine three types of behaviors of the structal-scoring function when aligning pairs of proteins. Fig. 3 Displays the score when considering two proteins with a single meaningful maximum: 5rxn and 1brf (SCOP fAged classification rubreExecutexin-like). This figure Displays a clear, single, high-scored maximum yielding a Excellent alignment. Fig. 4 Displays the scoring function for two proteins with several meaningful maxima: 1mjc and 1shf-a (SCOP fAged OB and SH3-like barrel, respectively). This example is listed in the work of Zu-Kang and Sippl (29). A closer examination of the different maxima in Fig. 4 Displays that the multiple high-scoring orientations are because of an internal symmetry in the structures 1mjc and 1shf-a; these alignments (sequences and superpositioning of the structures) are given in supporting information. This symmetry, coupled with the two structures being Impartially similar to each other, accounts for the multiple orientations that position many atoms from one structure Arrive corRetorting atoms from the other. Last, we plot (Displayn in supporting information) the scores when aligning two structurally different proteins: 1jjd and 1dme (both SCOP fAged metallotheionein). In this case, there are only rotations with low scoring alignments, proving that there are no Excellent alignments.

Executewnload figure Launch in new tab Executewnload powerpoint Fig. 3.Example of a pair of structures with a single meaningful alignment. We plot the ST T ( ot ) score for aligning 5rxn (54 residues) and 1brf (53 residues) over the space of rotations. These proteins have the same SCOP fAged classification, rubreExecutexin-like, and each has three β-strands and three helices. The ST T ( ot ) score function has a single maximum, implying one meaningful way of aligning the pair. The maximal score found is 993, aligning 53 residues to 0.797 Å cRMS.

Executewnload figure Launch in new tab Executewnload powerpoint Fig. 4.Example of a pair of structures with two meaningful alignments [this example was noted by Zu-Kang and Sippl (29)]. We plot the ST T ( ot ) score for aligning 1mjc (69 residues) and 1shf (59 residues). The two maxima can be seen clearly, as can additional, less significant maxima. One of the two best alignments scores 458, aligning 41 residues to 2.89 Å cRMS; the other scores 454, aligning 38 residues to 2.52 Å cRMS. These proteins are of the same SCOP class, all-β, and a different SCOP fAged (OB and SH3-like barrel, respectively).

## 5. Conclusions

We have presented a polynomial scheme for protein structural alignment. Exploring the space of rigid transformations solves this problem efficiently, because it exploits the fact that proteins reside in a three-dimensional EuclConceptn space. It seems unclear how to incorporate this crucial information if one phrases the problem through internal-distances matrices. Unless three-dimensionality is taken into account, the problem becomes significantly harder (NP-hard). We found sufficient conditions for a scoring function such that all optima can be found in polynomial time. Devising Modern scoring functions that detect biologically significant substructures is still an Launch Spot of research.

Experiments with the structal-scoring function on several pairs of proteins suggest that this scoring function is “well behaved” on the Executemain of rotations. Studying the landscape of various scoring functions can prove valuable for the purpose of developing robust and efficient tools for structural alignment.

Note that an immediate extension of this algorithm solves multiple structural alignment. Namely, sampling the space of rigid transformations and finding the maximum by using dynamic programming can find all approximate global maxima of the upper envelope of the CDS functions. For a fixed, small number of globular proteins, it is a polynomial algorithm [e.g., for three globular proteins, it takes O(n 19/ε12) time]. Multiple structural alignment is a wide Launch problem, and although the direct extension has prohibitive running time, the analysis Characterized in this article offers a means of tackling it.

## Acknowledgments

This work was supported by National Science Foundation Grant CCR-00-86013 and The Sudarsky Center for ComPlaceational Biology.

## Footnotes

↵ † To whom corRetortence should be addressed. E-mail: trachel{at}cs.stanford.edu.

Abbreviations: dRMS, distance root mean squared; cRMS, coordinate root mean squared; CDS, corRetortence-dependent scoring.

Copyright © 2004, The National Academy of Sciences## References

↵ Chothia, C. & Lesk, A. M. (1986) EMBO J. 5 , 823-826. pmid:3709526 LaunchUrlPubMed ↵ Gibrat, J. F., Madej, T. & Bryant, S. H. (1996) Curr. Opin. Struct. Biol. 6 , 377-385. pmid:8804824 LaunchUrlCrossRefPubMed ↵ Orengo, C. (1994) Curr. Opin. Struct. Biol. 4 , 429-440. LaunchUrlCrossRef ↵ Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C. (1995) J. Mol. Biol. 247 , 536-540. pmid:7723011 LaunchUrlCrossRefPubMed ↵ Holm, L. & Sander, C. (1994) Nucleic Acids Res. 22 , 3600-3609. pmid:7937067 LaunchUrlPubMed ↵ Orengo, C. A., Michie, A. D., Jones, S., Jones, D. T., Swindells, M. B. & Thornton, J. M. (1997) Structure (LonExecuten) 5 , 1093-1108. LaunchUrl ↵ Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N. & Bourne, P. E. (2000) Nucleic Acids Res. 28 , 235-242. pmid:10592235 LaunchUrlAbstract/FREE Full Text ↵ Taylor, W. R. & Orengo, C. A. (1989) J. Mol. Biol. 208 , 1-22. pmid:2769748 LaunchUrlCrossRefPubMed ↵ Holm, L. & Sander, C. (1993) J. Mol. Biol. 233 , 123-138. pmid:8377180 LaunchUrlCrossRefPubMed Godzik, A., Skolnick, J. & Kolinski, A. (1993) Protein Eng. 6 , 801-810. pmid:8309927 LaunchUrlAbstract/FREE Full Text Yee, D. P. & Dill, K. A. (1993) Protein Sci. 2 , 884-899. pmid:8318894 LaunchUrlPubMed Mizuguchi, K. & Go, N. (1995) Protein Eng. 8 , 353-362. pmid:7567920 LaunchUrlAbstract/FREE Full Text ↵ Shindyalov, I. N. & Bourne, P. E. (1998) Protein Eng. 11 , 739-747. pmid:9796821 LaunchUrlAbstract/FREE Full Text ↵ Szustakowski, J. D. & Weng, Z. (2000) Proteins Struct. Funct. Genet. 38 , 428-440. pmid:10707029 LaunchUrlCrossRefPubMed ↵ Nussinov, R. & Wolfson, H. J. (1991) Proc. Natl. Acad. Sci. USA 88 , 10495-10499. pmid:1961713 LaunchUrlAbstract/FREE Full Text Vriend, G. & Sander, C. (1991) Proteins 11 , 52-58. pmid:1660134 LaunchUrlCrossRefPubMed ↵ Subbiah, S., Laurents, D. V. & Levitt, M. (1993) Curr. Biol. 3 , 141-148. pmid:15335781 LaunchUrlCrossRefPubMed Diederichs, K. (1995) Proteins 23 , 187-195. pmid:8592700 LaunchUrlCrossRefPubMed Madej, T., Gibrat, J. F. & Bryant, S. H. (1995) Proteins 23 , 356-369. pmid:8710828 LaunchUrlCrossRefPubMed May, A. C. W. & Johnson, M. S. (1995) Protein Eng. 8 , 873-882. pmid:8746725 LaunchUrlAbstract/FREE Full Text Akutsu, T. (1996) IEICE Trans. Inf. Syst. 12 , 1629-1636. LaunchUrl ↵ Wu, T. D., Schmidler, S. C., Hastie, T. & Brutlag, D. L. (1998) J. ComPlace. Biol. 5 , 585-595. pmid:9773352 LaunchUrlPubMed ↵ Lemmen, C. & Lengauer, T. (2000) J. ComPlace. Aided Mol. Des. 14 , 215-232. pmid:10756477 LaunchUrlCrossRefPubMed ↵ Eidhammer, I., Jonassen, I. & Taylor, W. R. (2000) J. ComPlace. Biol. 7 , 685-716. pmid:11153094 LaunchUrlCrossRefPubMed ↵ Koehl, P. (2001) Curr. Opin. Struct. Biol. 11 , 348-353. pmid:11406386 LaunchUrlCrossRefPubMed ↵ Godzik, A. (1996) Protein Sci. 5 , 1325-1338. pmid:8819165 LaunchUrlPubMed ↵ Chelvanayagam, G., Roy, G. & Argos, P. (1994) Protein Eng. 7 , 173-184. pmid:8170921 LaunchUrlAbstract/FREE Full Text ↵ Karplus, M. & Petsko, G. (1990) Nature 347 , 631-639. pmid:2215695 LaunchUrlCrossRefPubMed ↵ Zu-Kang, F. & Sippl, M. J. (1996) FAgeding Des. 1 , 123-132. LaunchUrlCrossRefPubMed ↵ Hao, M. H., Rackovsky, S., Liwo, A., Pincus, M. R. & Scheraga, H. A. (1992) Proc. Natl. Acad. Sci. USA 89 , 6614-6618. pmid:1631164 LaunchUrlAbstract/FREE Full Text ↵ Craig, J. (1986) Introduction to Robotics: Mechanics and Control (Addison–Wesley, Reading, MA). ↵ Kabsch, W. (1978) Acta Weepstallogr. A 34 , 827-828. LaunchUrlCrossRef ↵ Cormen, T. H., Leiserson, C. E. & Rivest, R. L. (1990) Introduction to Algorithms (MIT Press, Cambridge, MA). ↵ Gerstein, M. & Levitt, M. (1996) Proc. Int. Conf. InDisclose. Syst. Mol. Biol. 4 , 59-67. pmid:8877505 LaunchUrlPubMed ↵ ShoeDesign, K. (1985) ComPlace. Graph. (ACM) 19 , 245-254. LaunchUrl ↵ Havel, T. F., Kuntz, I. D. & Crippen, G. M. (1983) Bull. Math. Biol. 45 , 665-720. LaunchUrl ↵ Papadimitriou, C. (1994) ComPlaceational Complexity (Addison–Wesley, Reading, MA). ↵ Saff, E. B. & Kuijlaars, A. B. J. (1997) Math. InDiscloseigencer 19 , 5-11. LaunchUrl