Fast structure similarity searches among protein models: efficient clustering of protein fragments
© Fogolari et al; licensee BioMed Central Ltd. 2012
Received: 5 September 2011
Accepted: 29 May 2012
Published: 29 May 2012
For many predictive applications a large number of models is generated and later clustered in subsets based on structure similarity. In most clustering algorithms an all-vs-all root mean square deviation (RMSD) comparison is performed. Most of the time is typically spent on comparison of non-similar structures. For sets with more than, say, 10,000 models this procedure is very time-consuming and alternative faster algorithms, restricting comparisons only to most similar structures would be useful.
We exploit the inverse triangle inequality on the RMSD between two structures given the RMSDs with a third structure. The lower bound on RMSD may be used, when restricting the search of similarity to a reasonably low RMSD threshold value, to speed up similarity searches significantly. Tests are performed on large sets of decoys which are widely used as test cases for predictive methods, with a speed-up of up to 100 times with respect to all-vs-all comparison depending on the set and parameters used. Sample applications are shown.
The algorithm presented here allows fast comparison of large data sets of structures with limited memory requirements. As an example of application we present clustering of more than 100000 fragments of length 5 from the top500H dataset into few hundred representative fragments. A more realistic scenario is provided by the search of similarity within the very large decoy sets used for the tests. Other applications regard filtering nearly-indentical conformation in selected CASP9 datasets and clustering molecular dynamics snapshots.
A linux executable and a Perl script with examples are given in the supplementary material (Additional file 1). The source code is available upon request from the authors.
Computational predictions and simulations of biological systems entail a wide variety of processes and length and time scales. Going down from ecological systems, to organisms, organs and cells and subcellular components, the lowest level description of biological systems is in terms of single molecules and atoms . At this level, the structure and dynamics of biomolecules and biocomplexes are of utmost importance in determining their function, whose knowledge and elucidation is the ultimate goal of structural biology. Experimental methods for structural characterization of biomolecules are often too slow or have limitations in targets and resolution that cannot be overcome. For these reasons one often resorts to computational predictions or simulations.
A common feature of computational methods is that they generate a large number, typically in the range of thousands, of molecular models which are meant as samples of the large conformational space of a molecule or of a complex.
As a consequence, clustering of different conformations of the same molecular structure is a frequently performed task which allows on the one hand to reduce the number of conformations to be subjected to further analysis, and on the other hand to choose the most representative conformations among many . Clustering is mostly performed based on pairwise distance, see e.g. the GROMACS package manual for some widely used methods (URL: http://www.gromacs.org). In many cases a dissimilarity criterion is used instead of a similarity criterion (see e.g. for a general discussion [3, 4]). These issues are well illustrated in the fields of protein structure predictions and molecular dynamics simulations.
For instance, prediction of protein structure is typically accomplished in two steps: first a large number of plausible models (possibly including near-native models) is generated and afterwards models are scored and ranked . The latter step often considers each model as representative of an ensemble of similar conformations and some kind of weighted average is performed to pick up the best model (see e. g. [6–9]). For what concerns proteins the field of structural predictions has been largely explored in the last decades and the world-wide experiment CASP (Critical Assessment of Structural Predictions) has set standards for evaluation and has provided extensive evaluation of methods and approaches . Since 2006, among other prediction categories, the model Quality Assessment (QA) category has been introduced where predictors are asked to evaluate the quality of the many models (in the range of hundreds) proposed by servers for each target sequence. In recent rounds of CASP consensus methods have consistently scored better than single structure methods. Consensus may be taken on the score given by different model quality assessment programs or on the similarity of the models within the ensemble itself. Actually in the last CASP8 and CASP9 experiments the top performing model quality estimators were those using clustering/consensus methods on the ensemble of deposited models. Using the similarity among different predictive models was recognized earlier as a significant improvement factor for assessing the quality of a model . Since then methods have known continuous evolution up to the recent remarkable achievements in blind assessment experiment by the programs QmeanClust [12, 13], 3D-Jury , Pcons , ModFoldClust  and others . In these methods the similarity among predictive models, as well as scores obtained sometimes by different, possibly independent, sources, is used in different ways tailored on the specific model quality assessment method used. In some, but not all the methods, only similarities above a given threshold are used . The reader is referred to the original literature for details.
Somewhat surprisingly, however, a naive consensus method ranking the models based on the average similarity with other models in the ensemble was found to perform like the best consensus model quality estimators  as earlier suggested by Elofsson and coworkers . Structural comparisons appear therefore very important for proper choice or scoring of predictive models.
Similarly to predictive tasks, in molecular dynamics simulations typically many conformations are generated, in the range of thousands, by taking snapshots of the trajectory at given time intervals. Here clustering of different conformations of the same molecular structure is performed to reduce the number of conformations to be subjected to further analysis (e.g. docking simulations), or to choose the most representative conformations among many .
The measures of similarity that are typically used for proteins have been set in the context of the CASP experiment. The most straightforward measure of (dis)similarity is the root mean square deviation (RMSD) of corresponding atoms after optimal superposition of two molecular structures. This measure requires only the definition of the set of atoms to be superimposed and the set of atoms on which to compute the RMSD. The RMSD is considered sometimes less sensitive than other measures because well modeled parts of the protein will be not represented by the RMSD dominated by large deviations and, viceversa, large deviations in small regions will not contribute significantly to the RMSD computed from averaging over a typically much larger set of atoms.
Several other measures of similarity have been proposed to give a representation of similarity which reflects the fraction of well modeled residues. In this respect MaxSub , GDT_TS  and TM-score  are considered more appropriate than the RMSD. The details and motivations of these similarity scores are reported in the original papers and they will not be repeated here. Suffice it to say that these methods assign a score based on the maximum number of residues that can be superimposed at one or more given distance threshold. It is worth to remind that these scores are assigned through iterative algorithms that require by far more computations than a single RMSD computation. For this reason we will consider here the widely used definition of distance as the RMSD between corresponding Cα (CA) atomic positions after optimal superposition. However, as long as a dissimilarity measure is a metric the methodology reported hereafter applies.
Once a distance definition has been chosen, in the consensus procedures or representative model selections discussed above, the all-vs-all pairwise similarity computations are straightforward to implement, but the time required to analyse a set of, say, 10000 structures may become exceedingly large, requiring approximately 50 × 106 comparisons.
There are two issues to consider with all-vs-all comparisons. The first is the quadratic number of comparisons and the second is that RMSD computation is time consuming itself.
The latter aspect has been investigated and, compared to earlier methods of RMSD calculation [20, 21], significantly faster solutions which avoid finding the optimal rotation matrix, have been proposed based on quaternion formalism .
An approximated solution to the all-vs-all comparison problem has been proposed by Li and Zhou , who avoid finding the minimum RMSDs altogether, but rather choose randomly a reference structure, superpose only once all structures on the reference one (using N - 1 rototranslations) and compute the RMSD without any further rototranslation. The authors showed that the choice of reference structure is almost ininfluential, at least on their datasets, and that the RMSD computed in this way furnishes a good approximation to the minimum RMSD.
Here we improve the efficiency of all-vs-all comparisons by considering that often only good similarities are searched for. If a threshold RMSD value is set reasonably low, compared to the average random RMSD in the set, many comparisons can be avoided by using the inverse triangular inequality satisfied by the RMSD. We exploit this fact and we show that a significant speed-up is obtained on large decoy sets. An application to clustering of a very large number of fragments to a representative set is presented as an example. A more realistic scenario where different models for the same protein (resulting from structural predictions or molecular dynamics simulations) is provided by the search of similarity within very large decoy sets. Other applications like filtering nearly identical structures in a set or clustering molecular dynamics snapshots are shown as examples.
Given a set of N decoys a straighforward similarity search would require structural comparisons. After the first structure is compared with all the others, however, the corresponding set of pairwise RMSDs is available. We use the set of pairwise comparisons with the common reference structure to exclude from comparison those pairs of structures whose RMSD is surely above the threshold.
(In the appendix we provide a brief demonstration of the inverse triangle inequality for RMSD opt ). Based on the inverse triangle inequality, if we have computed with i = 2, .., N it is possible to exclude from further comparison all pairs i and j for which is larger than the chosen threshold t. By using this principle similarity searches may be sped up significantly.
Let us have an ensemble of N structures to be searched for structural similarity and let us assume that the distribution of pairwise RMSDs within the ensemble is f(x) (f(x) has been found by us in many decoys sets to be close to a normal distribution).
The algorithm we will describe iterates over steps requiring first a one-vs-all comparison (N - 1 comparisons when we have N structures) and then using the list of RMSDs to avoid unnecessary comparisons. In the latter task the number of comparisons depends, obviously, on the RMSDs distribution, but also on the use we make of the already computed RMSDs.
If the list of RMSDs is sorted, the structures corresponding to lowest RMSD will be the more distant from other structures and the inverse triangular inequality will make most comparisons unnecessary. On the other hand as we consider structures whose RMSD is in the most populated region of the distribution there will be many other structures with similar RMSD and the inverse triangular inequality will make many fewer comparison unnecessary.
It will be therefore more efficient to use the inverse triangular inequality only on the structures displaying an RMSD below a certain value s.
where s is the RMSD where we stop using the inverse triangular inequality on the already computed RMSDs and we move to the next step calculating RMSDs for a single structure and using the newly computed RMSDs.
At each step s should maximize the ratio based on the observed distribution of RMSDs. In practice the value s, i.e. the maximum RMSD whose corresponding structure is compared to all others using the inverse triangular inequality, is chosen when the ratio , which is computed while the comparisons are done, reaches a maximum.
Hereafter the implementation described above is reported in C-like pseudocode:
/* Initialization */
for(i = 0; i < n_structures; i++)
done[i] = 0;
/* Iterations */
for(i1 = 0; i1 < n_structures; i1++)
n_cmp = 0;
k = 0;
for(i2 = i1 + 1; i2 < n_structures; i2++)
index[k] = i2
rmsd[k] = RMSD(struct i1, struct i2)
n_cmp = n_cmp + 1
if(rmsd <= threshold )
output(i1, i2, rmsd)
k = k + 1
done[i1] = 1
n_left = k
index_rmsd <- (index, rmsd)
sort index_rmsd by rmsd
ratio = 0
for(j = 0; ((j+1)/n_cmp) > ratio && j<n_left; j++)
ratio = ((j+1)/n_cmp)
for(k=j+1; k<n_left; k++)
if ((index_rmsd.rmsd[k] - index_rmsd.rmsd[j]) <= threshold)
rmsd = RMSD(index_rmsd.index[k], index_rmsd.index[j])
n_cmp = n_cmp + 1
if(rmsd <= threshold)
output(index _rmsd.index[k], index_rmsd.index[j], rmsd)
done[index_rmsd[j].index] = 1
Better schemes keeping track of all the already computed RMSDs would require larger memory. The algorithm has been tested on decoy sets for small proteins downloaded from the Decoys'r'us database , representing a realistic step in a clustering scenario.
2 Results and discussion
2.1 Tests on decoy sets
The effect of the choice of the threshold has been tested on the relatively small 4state_reduced decoy datasets  which include on average 665 decoys for each target protein.
Number of RMSD computations for the 4state_reduced decoy dataset with varying RMSD threshold.
t = 2.4 Å
t = 2.8 Å
t = 3.2 Å
t = 3.6 Å
The method has been tested on the semfold decoy datasets which include on average 12900 decoys and in one case more than 20000 decoys .
Number of RMSD computations for the semfold decoy dataset.
2.2 Clustering fragments from the top500H dataset
As expected the above tests show that for lower RMSDs application of the inverse triangular inequality is very efficient in reducing the number of comparisons to be performed. The idea is exploited here to cluster short protein fragments.
Fifirst RMSDs below a very low threshold are computed. The computations will be many less than because the inverse triangular inequality is used in the best way at low thresholds.
where is the weight before clustering (i.e. 1 in the first clustering and greater or equal to 1 in subsequent clusterings) and t is the threshold RMSD chosen. This scheme is similar to others suggested for weighting similar structures  or for choosing the most representative fragment among similar ones [7, 8]. The list of fragments is then sorted by the weight and for each fragment, starting with the one with larger weight, all other fragments with RMSD less than threshold from the reference one are clustered together and represented by the reference one.
The RMSD threshold is increased and the procedure repeated until a single cluster is found.
As an example we have taken the dataset obtained by considering all continuous five residues fragments from the proteins in the top500H dataset  which includes 500 curated non redundant protein structures obtained by X-ray crystallography with resolution better than 1.8 Å and with few deviations from ideal geometry.
We have chosen a length of five following Micheletti et al.  who showed that a small dataset of five-residues fragments is able to represent accurately all five-residues fragments. Many of these fragments are nearly identical because of secondary structure elements and therefore the task of clustering residues should involve either a filtering or a large number of comparisons. The number of fragments is 107184 which implies, for an all-vs-all comparison ca. 5.7 billion RMSDs computations.
Our algorithm was able to cluster at various thresholds the fragments in a couple of hours on a laptop, with minimal memory requirements.
RMSD threshold (Å)
Fragment clustering on backbone.
RMSD threshold (Å)
From Tables 3 and 4 it is immediately seen how effective is the fast structural similarity search proposed in this work compared to all-vs-all comparisons. The number of actual computations is just 2% and 6% of that implied by all-vs-all comparisons for data in Tables 3 and 4, respectively.
Overall these results confirm the reliability of the methodology whose implementation is made possible by the fast computation of structural similarities.
2.3 Filtering predictive models for model quality assessment
where RMSD 100 is the estimated RMSD when 100 residues are aligned and N is the number of aligned residues. When N is less than 14 the RMSD is set to a maximum value.
We chose all server predictions for FM (free modeling) targets in the recent CASP9 experiment (i.e. T0529, T0531, T0534, T0537, T0544, T0547, T0550, T0553, T0555, T0561, T0571, T0578, T0581, T0604, T0608, T0616, T0618, T0621, T0624, T0629, T0637 and T0639). We assume that FM models that display a normalized pairwise RMSD below 1.0 Å are not independent and therefore only one representative model should be considered. In this way all nearly identical models are identified. In the sets considered there are many models (deposited by the same predictors) that are identical.
The effectiveness of the procedure may be judged by considering the number of comparisons performed (115399) versus those to be performed in an all-vs-all comparison (1040520) which amounts to a ratio of 0.11. The procedure allows to remove on average 40 predictive models that have a nearly identical model in each set.
2.4 Clustering molecular dynamics snapshots
The analysis of molecular dynamics trajectories often involves the comparison and clustering of thousands of snapshots. Clustering is mostly based on the analysis of pairwise RMSDs. When only closest similar conformations are to be grouped together the method proposed here strongly reduces the number of comparisons to be performed.
In the following two illustrative applications of the method are given.
We consider the decoy set vhp_mcmd , a decoy set containing alternative conformations for the villin headpiece obtained from snapshots of long molecular dynamics runs in explicit solvent starting from five different conformations, including the native one. The decoy set includes 6255 conformations. In order to choose the most representative ones we consider similarities at 1.0 Å threshold. The search at this similarity level is performed with 1717823 comparison, i.e. 0.088 times the comparisons required by a straightforward all-vs-all comparison (6255 × 3127). Clustering of structures based on the obtained similarities allows to identify for each of the five simulations a representative structure. Among these the largest cluster (245 out of 651 structures) is found for the simulation starting from the native structure, which highlights the lesser conformational dispersion for simulations starting from the native structure with respect to those starting from non-native structures.
We consider as another application the analysis of the simulation of cis-trans isomerization of Pro 32 in β2-microglobulin that drives a switch in the hydrogen bonding network of residues Arg 3, Thr 4, His 31, Pro 32, His 84 and Thr 86 . We consider the pairwise RMSD after superposition of all the N, CA, C and O atoms of the above six residues. We set the threshold at 0.4 Å to spot significantly different conformations. The number of comparison performed is 0.25 of those required by an all-vs-all comparison. The first two most representative clustered conformations are relative to the native cis Pro 32 conformation and non-native trans Pro 32 conformation, thus allowing selection of most representative conformations at the selected region.
The proposed method for fast structural similarity search below a given distance threshold makes use of the inverse triangular disequality in order to avoid unnecessary comparisons.
The applications reported here shows that the method is able to save many comparisons whenever the distance threshold is set much smaller than the average distance between structures in the set.
The advantage of the method compared to all-vs-all comparisons is more and more evident with larger datasets.
The method seems therefore mostly suited to calculate fastly similarities among large ensembles of conformers, e.g. those obtained by predictive softwares like Rosetta  where thousands of models are typically generated, or for large structural database analysis. In this respect the first three applications described in the Results section appear particularly suited for the method. Analysis of molecular dynamics snapshot is also an important field of application. The only drawback of the method is that the threshold distance between conformations must be much lower than the average distance for the method to be efficient. In this respect the comparison of conformations of subsystems (that display conformational mobility) or comparison at a very low threshold distance (aiming at identifying nearly identical conformers) appear to be the situation of choice for the application of the method, as shown in the examples of the Results section.
The main limitation of the method is that it relies on the use of the metric properties of the dissimilarity measure. For this reason it is not evident how to extend the application in order to make use of useful similarities measures like MaxSub , GDT_TS  and TM-score . Turning the latter measures into proper metrics is a prerequisite for using the method in predictive contexts like that set up in the CASP experiment. On the other hand it is apparent that the method may be used also in other contexts whenever a metric can be defined and similarities at low distances (compared to the average pairwise distance) are sought.
A reference program for Linux, using, with minor modifications, the RMSD calculation routines of Theobald , is available as supplementary material (Additional file 1). The source code is available upon request from the authors. A Perl script implementing the algorithm is provided with a subroutine dist() that wraps any external measure of distance.
6 Authors' contribution
FF conceived the study and implemented the algorithm, AC, PV and GE contributed design of the study and the tests. All authors read and approved the manuscript.
Let us consider two structures i and j which have been optimally superposed on structure 1, so that and are known. We look for a lower bound on based on and .
where r ik and r jk are the vectors of atom k in structure i and j, respectively, and is the rotation that superposes optimally, i.e. with the least RMSD, structure i onto structure j.
Let us define as Note that because the norm is invariant under rotations.
We thank Dr. I. M. Lait for valuable support.
- Fogolari F: Computers and Biosciences. Frascati Physics Series 2006, 40:275–280.
- Leach AR: Molecular modelling: principles and applications. 2nd edition. Harlow, Essex, UK: Pearson Education Limited; 2001.
- Jain AK, Murty NM, Flynn PJ: Data clustering: a review. ACM Comput Surv 1999, 31:264–323.View Article
- Zhao Y, Karypis G: Data clustering in life science. Mol Biotechnol 2005, 31:55–80.PubMedView Article
- Bonneau R, Baker D: Ab initio protein structure prediction: progress and prospects. Ann Rev Biophys Biomol Struct 2001, 30:173–189.View Article
- Shortle D, Simons KT, Baker D: Clustering of low-energy conformations near the native structures of small proteins. Proc Natl Acad Sci 1998, 95:11158–11162.PubMedView Article
- Xiang Z, Soto CS, Honig B: Evaluating conformational free energies: the colony energy and its application to the problem of loop prediction. Proc Natl Acad Sci USA 2002, 99:7432–7437.PubMedView Article
- Fogolari F, Tosatto SC: Application of MM/PBSA colony free energy to loop decoy discrimination: toward correlation between energy and root mean square deviation. Protein Sci 2005, 14:889–901.PubMedView Article
- Kryshtafovych A, Fidelis K, Tramontano A: Evaluation of model quality assessment predictions in CASP9. Proteins: Struct Funct Bioinf 2011, 79S:91–106.View Article
- Moult J: A decade of CASP: progress, bottlenecks and prognosis in protein structure prediction. Curr Op Struct Biol 2005, 15:285–289.View Article
- Lundstrom J, Rychlewski L, Bujnicki J, Elofsson A: Pcons: a neural-network-based consensus predictor that improves fold recognition. Protein Sci 2001, 10:2354–2362.PubMedView Article
- Benkert P, Tosatto SC, Schomburg D: QMEAN: A comprehensive scoring function for model quality assessment. Proteins 2008, 71:261–277.PubMedView Article
- Benkert P, Schwede T, Tosatto SC: QMEANclust: estimation of protein model quality by combining a composite scoring function with structural density information. BMC Struct Biol 2009, 9:35.PubMedView Article
- Ginalski K, Elofsson A, Fischer D, Rychlewski L: 3D-Jury: a simple approach to improve protein structure predictions. Bioinformatics 2003, 19:1015–1018.PubMedView Article
- Larsson P, Skwark MJ, Wallner B, Elofsson A: Assessment of global and local model quality in CASP8 using Pcons and ProQ. Proteins 2009,77(Suppl 9):167–172.PubMedView Article
- McGuffin LJ: The ModFOLD server for the quality assessment of protein structural models. Bioinformatics 2008, 24:586–587.PubMedView Article
- Siew N, Elofsson A, Rychlewski L, Fischer D: MaxSub: an automated measure for the assessment of protein structure prediction quality. Bioinformatics 2000, 16:776–785.PubMedView Article
- Zemla A: LGA: A method for finding 3D similarities in protein structures. Nucleic Acids Res 2003, 31:3370–3374.PubMedView Article
- Zhang Y, Skolnick J: Scoring function for automated assessment of protein structure template quality. Proteins 2004, 57:702–710.PubMedView Article
- Kabsch W: A solution for the best rotation to relate two sets of vectors. Acta Crystallogr A 1976, 32A:922–923.View Article
- Kabsch W: A discussion of the solution for the best rotation to related two sets of vectors. Acta Crystallogr A 1978, 34:827–828.View Article
- Theobald DL: Rapid calculation of RMSDs using a quaternion-based characteristic polynomial. Acta Crystallogr A 2005, 61:478–480.PubMedView Article
- Li H, Zhou Y: SCUD: fast structure clustering of decoys using reference state to remove overall rotation. J Comput Chem 2005, 26:1189–1192.PubMedView Article
- Edwards J, Chatham G, Glunt W, McDonald D, Wells C, Hayden T: Sampling properties of the alternating projection distance geometry algorithm applied to unconstrained polypeptide chains. Computers Chem 1997, 21:115–124.View Article
- Steipe B: A revised proof of the metric properties of optimally superimposed vector sets. Acta Crystallogr A 2002, 58:506.PubMedView Article
- Kaindl K, Steipe B: Metric properties of the root-mean-square deviation of vector sets. Acta Crystallogr A 1997, 53:809.View Article
- Samudrala R, Levitt M: Decoys 'R' us: a database of incorrect protein conformations to improve protein structure prediction. Protein Sci 2000, 9:1399–1401.PubMedView Article
- Park B, Levitt M: Energy function that discriminate X-ray and Near-native folds from well-constructed decoys. J Mol Biol 1996, 258:367–392.PubMedView Article
- Samudrala R, Levitt M: A comprehensive analysis of 40 blind protein structure predictions. BMC Struct Biol 2002, 2:3–18.PubMedView Article
- Micheletti C, Seno F, Maritan A: Recurrent oligomers in proteins: an optimal scheme reconciling accurate and concise backbone representations in automated folding and design studies. Proteins 2000, 40:662–674.PubMedView Article
- Lovell SC, Davis IW, Arendall WB, de Bakker PI, Word JM, Prisant MG, Richardson JS, Richardson DC: Structure validation by Calpha geometry: phi,psi and Cbeta deviation. Proteins 2003, 50:437–450.PubMedView Article
- Carugo O, Pongor S: A normalized root-mean-square distance for comparing protein three-dimensional structures. Protein Sci 2001, 10:1470–1473.PubMedView Article
- Fogolari F, Tosatto SC, Colombo G: A decoy set for the thermostable subdomain from chicken villin headpiece, comparison of different free energy estimators. BMC Bioinformatics 2005, 6:301.PubMedView Article
- Fogolari F, Corazza A, Varini N, Rotter M, Gumral D, Codutti L, Rennella E, Viglino P, Bellotti V, Esposito G: Molecular dynamics simulation of β 2-microglobulin in denaturing and stabilizing conditions. Proteins 2011, 79:986–1001.PubMedView Article
- Leaver-Fay A, Tyka M, Lewis SM, Lange OF, Thompson J, Jacak R, Kaufman K, Renfrew PD, Smith CA, Sheffler W, Davis IW, Cooper S, Treuille A, Mandell DJ, Richter F, Ban YE, Fleishman SJ, Corn JE, Kim DE, Lyskov S, Berrondo M, Mentzer S, Popovic Z, Havranek JJ, Karanicolas J, Das R, Meiler J, Kortemme T, Gray JJ, Kuhlman B, Baker D, Bradley P: ROSETTA3: an object-oriented software suite for the simulation and design of macromolecules. Meth Enzymol 2011, 487:545–574.PubMedView Article