Given a set of N decoys a straighforward similarity search would require structural comparisons. After the first structure is compared with all the others, however, the corresponding set of pairwise RMSDs is available. We use the set of pairwise comparisons with the common reference structure to exclude from comparison those pairs of structures whose RMSD is surely above the threshold.
Let us define
as the RMSD between structures i
, after optimal superposition and with no optimal superposition, respectively. It has been shown by Edwards et al. [24
] and by Steipe and Kaindl [25
] that RMSD
is a metric on the space of the classes of equivalent structures (i. e. structures that can be superimposed exactly by a rototranslation). As a consequence both the triangle inequality
and the inverse triangle inequality
hold. Note that the metric properties of RMSD
are not trivial as for RMSD because a different rototranslation is in principle implied by each RMSD
. As a consequence of RMSD
being a metric, the following inverse triangle inequality holds:
(In the appendix we provide a brief demonstration of the inverse triangle inequality for RMSD
). Based on the inverse triangle inequality, if we have computed with i = 2, .., N it is possible to exclude from further comparison all pairs i and j for which is larger than the chosen threshold t. By using this principle similarity searches may be sped up significantly.
Let us have an ensemble of N structures to be searched for structural similarity and let us assume that the distribution of pairwise RMSDs within the ensemble is f(x) (f(x) has been found by us in many decoys sets to be close to a normal distribution).
The algorithm we will describe iterates over steps requiring first a one-vs-all comparison (N - 1 comparisons when we have N structures) and then using the list of RMSDs to avoid unnecessary comparisons. In the latter task the number of comparisons depends, obviously, on the RMSDs distribution, but also on the use we make of the already computed RMSDs.
If the list of RMSDs is sorted, the structures corresponding to lowest RMSD will be the more distant from other structures and the inverse triangular inequality will make most comparisons unnecessary. On the other hand as we consider structures whose RMSD is in the most populated region of the distribution there will be many other structures with similar RMSD and the inverse triangular inequality will make many fewer comparison unnecessary.
It will be therefore more efficient to use the inverse triangular inequality only on the structures displaying an RMSD below a certain value s.
The number of comparisons N
involved at the step when N
structures have not been compared yet, will be the sum of the (N
- 1) comparisons needed to generate the list of (N
- 1) RMSDs and the number of comparisons that make use, through the inverse triangular inequality, of the RMSD list computed:
where s is the RMSD where we stop using the inverse triangular inequality on the already computed RMSDs and we move to the next step calculating RMSDs for a single structure and using the newly computed RMSDs.
After the comparisons have been made the number of structures whose RMSD below threshold t
with all other structures have been found (N
) will be:
At each step s should maximize the ratio based on the observed distribution of RMSDs. In practice the value s, i.e. the maximum RMSD whose corresponding structure is compared to all others using the inverse triangular inequality, is chosen when the ratio , which is computed while the comparisons are done, reaches a maximum.
Hereafter the implementation described above is reported in C-like pseudocode:
/* Initialization */
for(i = 0; i < n_structures; i++)
done[i] = 0;
/* Iterations */
for(i1 = 0; i1 < n_structures; i1++)
n_cmp = 0;
k = 0;
for(i2 = i1 + 1; i2 < n_structures; i2++)
index[k] = i2
rmsd[k] = RMSD(struct i1, struct i2)
n_cmp = n_cmp + 1
if(rmsd <= threshold )
output(i1, i2, rmsd)
k = k + 1
done[i1] = 1
n_left = k
index_rmsd <- (index, rmsd)
sort index_rmsd by rmsd
ratio = 0
for(j = 0; ((j+1)/n_cmp) > ratio && j<n_left; j++)
ratio = ((j+1)/n_cmp)
for(k=j+1; k<n_left; k++)
if ((index_rmsd.rmsd[k] - index_rmsd.rmsd[j]) <= threshold)
rmsd = RMSD(index_rmsd.index[k], index_rmsd.index[j])
n_cmp = n_cmp + 1
if(rmsd <= threshold)
output(index _rmsd.index[k], index_rmsd.index[j], rmsd)
done[index_rmsd[j].index] = 1
Better schemes keeping track of all the already computed RMSDs would require larger memory. The algorithm has been tested on decoy sets for small proteins downloaded from the Decoys'r'us database , representing a realistic step in a clustering scenario.