 Research
 Open access
 Published:
An effective sequencealignmentfree superpositioning of pairwise or multiple structures with missing data
Algorithms for Molecular Biology volumeÂ 11, ArticleÂ number:Â 18 (2016)
Abstract
Background
Superpositioning is an important problem in structural biology. Determining an optimal superposition requires a onetoone correspondence between the atoms of two proteins structures. However, in practice, some atoms are missing from their original structures. Current superposition implementations address the missing data crudely by ignoring such atoms from their structures.
Results
In this paper, we propose an effective method for superpositioning pairwise and multiple structures without sequence alignment. It is a twostage procedure including data reduction and data registration.
Conclusions
Numerical experiments demonstrated that our method is effective and efficient. The code package of protein structure superposition method for addressing the cases with missing data is implemented by MATLAB, and it is freely available from: http://sourceforge.net/projects/pssm123/files/?source=navbar
Background
Superposition is a frequently used method to measure spatial similarity of threedimensional objects such as computer vision, image science and molecular biology. Molecular biology employs superposition to support a wide variety of tasks. It is a very important problem to superimpose two or more protein structures in structural bioinformatics. Superpositioning problems have been explored by many studies [1â€“5]. The optimal superposition of threedimensional (3D) conformations of similar structures is necessary in many real cases. Determining an optimal superposition normally requires a onetoone correspondence between the atoms in the different structures [6]. The superposition of multiple structuresâ€™ situation is complicated by the fact that if structure X is superimposed on structure Y and structure Z is superimposed on structure Y, then, in general, structure X is not optimally superimposed on structure Z. In this case, the superposition of X on Z is only optimal superposition if two of the three structures are identical in shape.
A superposition is a particular orientation of objects in threedimensional space. There are many approaches to solve this problem. One of the approaches to solve the superpositioning problem is the method proposed by Kabsch [3], which allows computing the optimal transformation via singular value decomposition of a covariance matrix derived from the coordinates of the corresponding threedimensional structure. Another approach for this problem proposed by Kearsley [7] uses the algebra of quaternions. Multiple structure superposition programs have many applications, including understanding evolutionary conservation and divergence, functional prediction, automated docking, comparative modeling, protein and ligand design, construction of benchmark data sets and protein structure prediction and so on [8â€“11].
Structure alignment is different from superposition of structures. A structural alignment is the assignment of amino acid residueresidue correspondences between similar structural proteins [12]. One way to represent an alignment is using the familiar row and column matrix format, in which sequence alignments use single letter abbreviations for residues. Alignments of amino acid sequences of proteins play important roles in structure molecular biology such as the study of evolution in protein families, the identification of patterns of conservation in sequences, homology modeling, and protein crystal structure solution by molecular replacement.
In molecular biology, corresponding residues have similar structures. Many homologous proteins share a common core structure, in which the chain retains the topology of its folding pattern, but varies in geometric details. This retained similarity makes it possible to align the residues of the core. Since the structure of many proteins is still unknown and proteins with similar structural motifs often exhibit similar biological properties even when they are distantly related, structure alignment can help characterize the role of many proteins.
There are two ways for protein structure alignments, sequencebased alignments and nonsequencebased alignments (i.e. Structal [13], TMalign [13], LovoAlign [13]). For closely related proteins, sequencebased alignments give consistent answers, reflecting evolutionary divergence. For distantly related proteins, however, sequencebased alignments lead to diverse residue correspondences. At this case, we need nonsequencebased alignments. Nonsequential alignments can handle many cases such as reordering of domains and circular permutations [13â€“15].
Most multiple structure alignment programs are based on pairwise structural alignment programs [16, 17]. Even simplified variants of structure alignment are known to be NPhard [18, 19]. In many cases, certain residues are missing. For example, one crystal structure of a protein may omit loop regions that are present in another crystal structure of the same protein [20]. Most of the multiple structural alignment methods divide it into two subproblems. The first is to identify multiple corresponding structural elements. The second is to calculate the appropriate rigidbody transformation for each structure to create an optimal superposition.
There are three broad classes for structure alignment programs: the first class is aligned fragment pair (AFP) chaining methods [21]. The second class [22], is distance matrix methods. The third class includes everything else, such as geometric hashing and methods using secondary structural elements [22]. THESEUS is a software to consider the missing data by adopting an expectationmaximization (EM) algorithm [23]. However, EM algorithm relies on a sequential structure alignment and it is highly dependent on the choice of the initial value. In this paper, we propose a new method for nonsequential structure superposition. We use the combination of principal component analysis (PCA) and iterative closest point (ICP) registration techniques. The point of our method is we treat the proteins as the whole structures.
In this work, we propose a simple and efficient protein structure superposition method for addressing the cases with missing data (PSSM). We adopt a twostage procedure including data reduction and registration techniques to address this problem. We have applied it to the cytochrome C data, Globins family data, Serine Proteinases family data, Fisherâ€™s dataset and the simulated data to demonstrate its efficiency and accuracy.
Methods
Here we introduce a twostage method for the optimal superposition of pairwise and multiple structures with incomplete data. In the first stage, the key is to adopt a data reduction technique to get a reduced representation which is not sensitive to the noise and the missing residues. Based on the representation, we can obtain a rough superposition of pairwise or multiple structures with a least square technique. In the second stage, we employ the powerful iterative closest point (ICP) algorithm to further refine the superposition and find the optimal solution (Fig. 1).
The iterative closest point algorithm, originally introduced in the area of computer vision for image registration, can be used in bioinformatics for the alignment of complete protein structures. Bertolazzi [24] used this method for the structural alignment of protein surfaces.
We implemented the method in Matlab software as a package named PSSM.
Discovering rough superpositioning based on principalaxes transform
In this section we introduce the principal component analysis, the principalaxes transform techniques and the rotational search needed for some cases.
principal component analysis
Principal component analysis (PCA) is a very popular subspace analysis technique which is successfully applied in many domains for dimension reduction. It helps you reduce the number of variables in an analysis by describing a series of uncorrelated linear combinations of the variables that contain most of the variance. This reduction is achieved by transforming the original variables to the uncorrelated principal componentsâ€”new variables. This new variables are ordered so that the first few ones keep the most of the variation in all of the original variables.
The computation of principal components can give the principal component of the points. Then, we rotate the points along this principal component. This allows us to get the best initial value of the points. After this step, we employ the iterative closest point algorithm to further refine the superposition and find the optimal solution.
Principalaxes transform
The principal axes of a protein structure are computed directly from its atomic coordinates. The first moment of these points is their center of mass, and the three eigenvectors and eigenvalues of the second moment tensor give the principal axes and their relative lengths. The transform aligns the centers of mass and principal axes in order of decreasing relative lengths. The principal axes are coarse shape descriptors and are affected very little by noise or small differences in the structure and region being aligned [13]. The least square method was used to align corresponding principalaxes. As an example, we demonstrated the alignment of two twodimensional shapes using the principalaxes transform in Fig.Â 2.
Rotational search strategy
The principalaxes transform is expected to yield correct rough superpositioning for many initial values. However, it may fail to produce proper ones in some cases. We consider a rotational search strategy to improve this situation to test multiple orientations. The axis of rotation is a line which goes through points (0,Â 0,Â 0) (geometric center) and u (the linear combination of eigenvectors of one protein). The interval degree is set as \(10^{\circ }\). In practice, the principalaxes alignment method is applied first, followed by a rotational search if the resulting structure superpositioning does not give satisfactory results below a given RMSD (root mean squared deviation) value, then the principalaxes alignment method is applied again.
Structures with random rotations
To show the effectiveness of PSSM method, we use random rotational matrices to generate a random corresponding structure. A random rotational orthogonal matrix is generated by a MATLAB function [i.e., orth(rand(3,3))]. As we know, the rotational matrices change the pointsâ€™ position and orientation.
Refining the superpositioning based on iterative closest point algorithm
The iterative closest point (ICP) algorithm is based on quaternion [25]. The unit quaternion is a four vector \(\vec {q_R}=[q_0,q_1,q_2,q_3]^T\), where \(q_0 \ge 0\), and \(q_0^2+q_1^2+q_2^2+q_3^2=1\). The \(3\times 3\) rotation matrix generated by a unit rotation quaternion is
Let \(\vec {q_T}=[q_4,q_5,q_6]^T\) be a translation vector. The complete registration state vector \(\vec {q}\) is denoted as \(\vec {q}=[\vec {q_R},\vec {q_T}]^T.\) Let \(P=\{{\vec p_i}\}_{i=1}^{N_p}\) be a measured data point set to be aligned with a model point set \(X=\{{\vec x_i}\}_{i=1}^{N_x}\), where \(N_x=N_p\) and each point \(\vec p_i\) corresponds to the point \(\vec x_i\) with the same index. The mean square objective function to be minimized is
Defining \(\vec {\mu _p}\) and \(\vec {\mu _x}\) by:
The crosscovariance matrix \(\Sigma _{px}\) of the sets P and X is given by
The symmetric \(4\times 4\) matrix \(Q(\Sigma _{px})\) is:
where \(\Delta =[A_{23} A_{31} A_{12}]^T\) and \(A_{i,j}=(\Sigma _{px}\Sigma _{px}^{T})_{i,j}\). \(I_3\) is the \(3\times 3\) identity matrix. The unit eigenvector, denoted as \(\vec {q_R}=[q_0,q_1,q_2,q_3]^T\), corresponding to the maximum eigenvalue of the matrix \(Q(\Sigma _{px})\) is selected as the optimal rotation. The optimal translation vector is given by
This least square quaternion operation is \(O(N_p)\) and is denoted as
where \(d_{ms}\) is the mean square point matching error. The notation \(\vec {q}(P)\) is used to denote the point set P after transformation by the registration vector \(\vec {q}\).
Let d be the distance metric between an individual data point \(\vec {p}\) and a model shape X, then \(d(\vec {p},X)\) will be denoted:
The closest point in X denoted \(\vec {y}\) such that \(d(\vec {p},\vec {y})=d(\vec {p},X)\). let Y be the resultant corresponding point set ( the set of all closest points), and let \(\mathcal {C}\) be the closest point operator, then
The least squares registration is computed as described:
The positions of the data shape point set are then updated via \(P={ \vec{q} }(P)\).
Algorithm 3.1 ICP procedure

1.
Given the point set P with \(N_p\) points \({\vec {p}}\) from the data shape and the model shape X.

2.
The iteration is initialized by setting \(P_0=P,~\vec q_0=[1,0,0,0,0,0,0]^T\) and \(k=0\). The registration vectors are defined relative to the initial data set \(P_0\) so that the final registration represents the complete transformation. Steps (a)â€“(d) in the following are applied until convergence within a tolerance \(\tau \).

(a)
Compute the closest points: \(Y_k=\mathcal {C}(P_k,X)\), where \(\mathcal {C}\) denotes the closest point operator.

(b)
Compute the registration: \((\vec {q}_{k},d_{k})=\mathcal {Q}(P_{0},Y_{k}).\)

(c)
Apply the registration: \(P_{k+1}=\vec {q}_{k}(P_{0})\).

(d)
Terminate the iteration when the change in meansquare error falls below a preset positive threshold \(\tau \) (i.e. \(\Vert d_{k}d_{k+1}\Vert <\tau \)), which specifies the desired precision of the registration, otherwise, set kÂ =Â k+1, go to step (a).

(a)
It is worth noting that in Eq. (8)Â \(\mathcal {C}\) is not a unique map from P to X, but this does not influence the algorithm. The ICP algorithm does not require a onetoone correspondence between P and X. It was proved in Ref. [25] that the ICP algorithm always monotonically converged to a local minimum with respect to the mean square distance objective function. Our superpositioning algorithm also works well as demonstrated in all of our numerical experiments.
The combined procedure for pairwise and multiple structure superposition
The principal component analysis gives the principalaxes of each protein structure. The ICP algorithm is a powerful method for points registration. However, it is only converges to a local minimum value and is sensitive to the initial value. In the following, we introduce the combined procedure for the pairwise structure superposition in detail.
Data preprocessing is needed. We download proteins from the National Center for Biotechnology Information (NCBI) database or other database, and the format is Protein Data Bank (PDB). We extract 3dimensional coordinate and put the data into txt format. The Matlab program runs on the system of windows7, with AMD Athlon(tm) P340 DualCore Processor.
Algorithm 3.2 Pairwise structure superposition

1.
Input the proteins structure data \(P_a\), \(P_b\), set initial value kÂ =Â 1.

2.
Employ principal component analysis to find the principal components. For each of the two proteins \(P_a\) and \(P_b\), the eigenvectors and eigenvalues is calculated (\(u_1, u_2, u_3\) for \(P_a\) and \(v_1, v_2, v_3\) for \(P_b\)), and the geometric center is determined.

3.
The protein \(P_b\) is rotated. The rotating axis goes through O (0,Â 0,Â 0) (geometric center) and parallels to the vector v (here, v is \(v_1\) or \(v_1\underline{+}v_2\) or \(v_1\underline{+}v_3\)). The interval degree is set to \(10^{\circ }\).

4.
For each rotated position of \(P_b\), the eigenvectors and eigenvalues is calculated again. The principalaxes of the new \(P_b\) and \(P_a\) is aligned using least square method.

5.
The ICP algorithm is applied.

6.
If RMSD \(<c\) (e.g., \(c=1.5\)) or number of iterations exceeds certain times, output the cumulative rotation matrix and translation vector, break; Else, go back to 3.

7.
If RMSD \(>c\), (e.g., \(c=1.5\)) for the whole circle. Then we choose the smallest RMSD case, and output the rotation matrix and translation vector.
The multiple structures superposition algorithm is a natural extension of that for pairwise structure superposition. We first suggest to use the one with the median length of structure chains as the template protein. The key idea is applying pairwise structure superposition to calculate the superposition between the remaining proteins and the template protein. For example, there are three proteins X, Y, Z to be superimposed, and assuming protein X is the middle protein (model protein), then, Y is superimposed on structure X, Z is also superimposed on structure X.
The details of our multiple algorithm are as follows:
Algorithm 3.3 Multiple structure superposition

1.
Input the protein structures, \(C={P_1, P_2,\ldots , P_n}\), \(n\ge 3\).

2.
Calculate the length of each protein and sort them by length.

3.
Choose the middle sized protein as the template structure, denoted as \(M_i\), for each protein in C calling the pairwise proteins superposition algorithm, output the RMSD between this protein and the template and this proteinâ€™s number, denoted as set \(T_{i}\). The initial value i is equals 1.

4.
For each protein in \(T_{i}\), sort by RMSD in ascending order. If the RMSD \(< c\) (e.g., \(c=1.5\)), we put the proteins and the corresponding RMSD in set \(S_{i}\). If the RMSD \(>c\), we put the proteins and the corresponding RMSD in set \(T_{i+1}\).

5.
Choose the largest RMSD protein in set \(S_{i}\) as template \(M_{i+1}\), for each protein in \(T_{i+1}\) calling the pairwise proteins superposition algorithm, change RMSD in \(T_{i+1}\).

6.
\(i \leftarrow i+1\), using step 4 and step 5, update \(M_i\), \(T_i\) and \(S_i\).

7.
If \(T_{i}=T_{i+1}\) or \(T_{i}=0\), stop.

8.
Output each protein rotation matrix R and translation vector T.
Performance metrics
There are two parameters to measure the quality of the protein structure superposition: the number of residues that are aligned in the superposition and the average pairwise root mean squared deviation (RMSD) between aligned atoms. Clearly, the goal is to minimize the RMSD while maximizing the number of residues used in the superposition. In the following sections, if we do not mention the number of points used in superposition, the number is the smaller one between a pair of proteins.
Results
In this section, we tested our method PSSM using both simulated data and protein structures from the PDB. We compared it with several typical methods including least square (LS), \(C_\alpha \)match [26], CPSARST [27], CCP4 [28], SuperPose [29] and MUSTANG [30].
Results of the simulated data
We used the protein structure d1cih (835) as an example, and generated four rotated structures with three random rotational orthogonal matrices \(r_1\), \(r_2\) and \(r_3\) and one specific matrix \(r_4\) representing a 90degreerotation around zaxis.
The four rotation matrices \(r_1\), \(r_2\), \(r_3\), \(r_4\) are as follows:
We superimposed the four structures on the original one. Numerical experiments show that PSSM works well for all cases (Table 1). However, the running time is different due to the position and orientation of initial solutions to the optimal one. We can also use all possible correspondence between two structures and apply least square (LS) directly. It can also give a better superposition. The complexity of this algorithm is \(O(n^2)\), where n is the number of sequencealigned atoms. However, this algorithm needs sequence alignment. The complexity of our pairwise structure superposition is \(O(mn)\), where m and n are the number of the C(\(\alpha \)) atoms of the proteins.
Because the least square (LS) method is popular and serves as an optimality criterion for determining the best superposition, we compared our method with it. We use the \(C_\alpha \) atomic coordinates of five pairwise protein structures from d1cih, d1lfma, d1m60a, d2pcbb, and d1kyow to demonstrate our method can get similar superposition accuracy with LS. We can see that our algorithm indeed get almost the same RMSD as LS (Table 2). Although PSSM may take more time, it doesnâ€™t require the initial correspondence or sequence alignment.
Numerical experiments show that our method can get comparative RMSD with larger number of aligned residues than \(C_\alpha \)match and CPSARST (Table 3). This may be because our method treats the structure with missing data as a whole structure.
We compare our PSSM method with CCP4 and SuperPose (Table 4) and find that each method has its own advantage. We adopt four pairs of proteins including 1nls and 2bqp, 1glh and 1cpn, 1yad and 2dua, 1zbd and 1puj as testing system. Take 1nls and 2bqp as an example, PSSM gets 228 aligned residues (\(C_\alpha \)) with RMSD of 1.4Ã…, CCP4 gets 114 aligned 114 residues with RMSD of 0.999Ã… and SuperPose gets 205 aligned residues with RMSD of 18.14Ã…. Compared with CCP4 and SuperPose, PSSM gets more aligned residues, and gives reasonable and competitive RMSD compared those obtained by CCP4, and demonstrates overall better results than SuperPose. A possible reason is that SuperPose uses a secondary structural alignment strategy to guide the superposition. It is proper for secondary structural alignment and good at detecting domain or hinge motions in proteins. While our method is designed for the full structure superposition (see more examples in Additional file 1: Tables S2 and S3).
We also benchmark the performance of PSSM against DALI and MATT using Fischerâ€™s benchmark dataset (Table 5). Fischerâ€™s dataset is a popular benchmark for testing protein structure alignment programs, and they contain 68 pairs of protein structures. In Table 5, we use the average aligned residues and the average RMSD. (The pairs alignment performance can be seen in Additional file 1: Tables S4 and S5). Table 5 shows the performance. The average RMSD (aveRMSD) of our method is greater than DALI AND MATT, but the average aligned (aveAligned) residue is longer than DALI and MATT.
The usability of PSSM algorithm
The following analysis show how the missing data affect the performance of PSSM. We keep one copy of a protein structure and delete some atoms from another copy of it to simulate a protein structure with missing data. Two deleting approaches are explored. The first one is deleting the atoms in order and the second one is deleting the atoms in a random way.
Figure 3a shows that the performance of pairwise superposition between the protein structure d1cih (with only \(C_\alpha \) atoms) and a mimic structure through a rotation of d1cih. Here the rotation matrix is:
d1cih has 108 \(C_\alpha \)). We can see that when deleting the atoms in order and the number of deleted atoms is below 10, the RMSD is very small. However, when the number is greater than 10, the RMSD is sharply increased to about \(3\AA \). In this case, for random deleting, the RMSD keeps small until the deleted atoms are more than 20.
Figure 3b shows that the performance of pairwise superposition between the protein structure d1cih with all main chain atoms and a mimic structure through the rotation of d1cih. The rotation matrix r is:
d1cihâ€™s main chain has 835 atoms. When deleting the atoms in order, the results show that the RMSD keeps small up to 50 atoms deleted. However, when the deleted atoms are more than 50, the RMSD is sharply increased to \(2\AA \), and can keep at similar level till 70 atoms deleted.
From the above analysis, we can see that PSSM for pairwise structure superposition is relatively robust for the case with random missing data than with sequential missing data. From the two cases above and more cases we run, we find that PSSM requires the difference between the two protein lengths less than about 20Â %, for structure superposition with missing data.
Multiple protein structure superposition
We test our method using the proteins from three families. One of a ten protein structure superposition case is from the cytochrome C family which includes d1cih, d1pcbb, d1lfma, d1crj, d1csu, d1csx, d1yeb, d1kyow, d1m60a, and d1u74d. The other two families are Globins and Serine Proteinases. We choose five proteins for the Globins family and seven proteins for the Serine Proteinases. These proteins have different amino acid sequence, yet similar structures. We choose d1cih, 2dhbb and 2pka as the template protein structure which has the median length. In practice, we only need to find one template protein in the case. We show the results of our method for the pairwise superposition RMSD in these three families (Tables 5, 6, 7), respectively. We can see that our PSSM method works very well. There is only one case 2pka versus 1ppb with relatively larger RMSD than other pairs (Table 6).
We also compared our PSSM with MUSTANG using five proteins in the Globins family and seven proteins in the Serine Proteinases family as testing systems (Tables 8, 9). Generally, these two methods have shown very competitive results. For the Globins family, PSSM is better than MUSTANG with two more aligned residues and even a bit smaller RMSD (1.37 versus 1.41Â Ã…). As to the Serine Proteinases family, PSSM aligned more atoms with a slightly larger RMSD (1.72Â versusÂ 1.56Â Ã…)
Conclusion
We have proposed an effective method PSSM for superpositioning pairwise and multiple structures with missing data. The method does not need a sequence alignment in advance. It employs the principal component analysis to find the initial rough superposition, and then uses an iterative closest point algorithm for refining and getting accurate registration. According to what weâ€™ve known, this is the first time to combine PCA and ICP algorithm to study the problem of nonsequential superposition. Numerical experiments demonstrate its accuracy and effectiveness. This method has the comparable accuracy as the least square method which is a classical method for protein structure superposition. However, the least square method needs the sequence alignment.
Abbreviations
 3D:

threedimensional
 EM:

Expectation Maximization
 PSSM:

Protein Structure Superposition method for addressing the cases with Missing data
 ICP:

iterative closest point
 PCA:

principal component analysis
 RMSD:

root mean squared deviation
 NCBI:

National Center for Biotechnology Information
 PDB:

Protein Data Bank
 LS:

least square
 CPSARST:

Circular Permutation Search Aided by Ramachandran Sequential Transformation
 CCP4:

Collaborative Computational Project Number 4
 MUSTANG:

MUltiple STructural AligNment AlGorithm
 CAS:

Chinese Academy of Sciences
References
Diamond R. On the comparison of conformations using linear and quadratic transformations. Acta Crystallogr A. 1976;32:1â€“10.
Cohen G. Align: a program to superimpose protein coordinates, accounting for insertions and deletions. J Appl Crystallogr. 1997;30:1160â€“1.
Kabsch W. A discussion of the solution for the best rotation to relate two sets of vectors. Acta Crystallogr Sect A: Crystal Phys Diffract Theor General Crystallogr. 1978;34:827â€“8.
Coutsias EA, Seok C, Dill KA. Using quaternions to calculate rmsd. J Comput Chem. 2004;25:1849â€“57.
Theobald DL, Wuttke DS. Accurate structural correlations from maximum likelihood superpositions. PLoS Comput Biol. 2008;4:43.
Flower DR. Rotational superposition: a review of methods. J Mol Graph Model. 1999;17:238â€“44.
Kearsley SK. On the orthogonal transformation used for structural comparisons. Acta Crystallogr Sect A: Foundations Crystallogr. 1989;45:208â€“10.
Irving J, Whisstock JC, Lesk AM. Protein structural alignments and functional genomics. Proteins. 2001;42:378â€“82.
Edgar R, Batzoglou S. Multiple sequence alignment. Curr Opin Struct Bio. 2006;16:368â€“73.
Dunbrack RL. Sequence comparison and protein structure prediction. Curr Opin Struct Biol. 2006;16:274â€“84.
Panchenko A, MarchlerBauer A, Bryant SH. Threading with explicit models for evolutionary conservation of structure and sequence. Proteins. 1999;S3:133â€“40.
Martinez L, Andreani R, Martinez J. Convergent algorithms for protein structural alignment. BMC Bioinformatics. 2007;8:306.
Grishin NV. Fold change in evolution of protein structures. J Struct Biol. 2001;134:167â€“85.
Zuker Somorjai. The alignment of protein structures in three dimensions. Bull Math Biol. 1989;51:57â€“78.
Sujatha S, Balaji S, Srinivasan N. Pali: a database of alignments and phylogeny of homologous protein structures. Bioinformatics. 2001;17:375â€“6.
Ye Godzik. Multiple flexible structure alignment using partial order graphs. Bioinformatics. 2005;21:2362â€“9.
Torarinsson E. Multiple structural alignment and clustering of rna sequences. Bioinformatics. 2007;23:926â€“32.
Goldman D, Istrail S, Papadimitriou CH. Algorithmic aspects of protein structure similarity. In: P B, editor. Proceedings of the 40th Annual Symposium on Foundations of Computer Science. Los Alamitos: IEEE Computer Society; 1999. p. 512â€“22.
Wang L, Jiang T. On the complexity of multiple sequence alignment. J Comput Biol. 1994;1:512â€“22.
Barthel D, Hirst J, Bazewicz J, Burke E, Krasnogor N. Procksi: a decision support system for protein (structure) comparison, knowledge, similarity and information. BMC Bioinformatics. 2007;8:416.
Menke M. Matt: local flexibility aids protein multiple structure alignment. PLoS Comput Biol. 2008;4:10.
Dror O. Multiple structural alignment by secondary structures: algorithm and applications. Protein Sci. 2003;12:2492â€“507.
Theobald DL, Steindel PA. Optimal simultaneous superpositioning of multiple structures with missing data. Bioinformatics. 2012;28:1972â€“9.
Bertolazzi P, Guerra C, Liuzzi G. A global optimization algorithm for protein surface alignment. BMC Bioinformatics. 2010;11:488.
Besl PJ, McKay ND. A method for registration of 3d shapes. IEEE Trans Pattern Anal Mach Intell. 1992;14:239â€“56.
Bachar O, Fischer D, Nussinov R, Wolfson H. A computer vision based technique for 3d sequenceindependent structural comparison of proteins. Protein Eng Design Selection. 1993;6:279â€“87.
Lo WC, Lyu PC. Cpsarst: an efficient circular permutation search tool applied to the detection of novel protein structural relationships. Genome Biol. 2008;9:11.
Winn M, Ballard C, Cowtan K, Dodson E, Emsley P, Evans P, Keegan R, Krissinel E, Leslie A, McCoy A, McNicholas S, Murshudov G, Pannu N, Potterton E, Powell H, Read R, Vagin A, Wilson K. Overview of the ccp4 suite and current developments. Acta Crystallogr D Biol Crystallogr. 2011;67:235â€“42.
Rajarshi M, Domselaar GV, Zhang H, David S. Superpose: a simple server for sophisticated structural superposition. Nucleic Acids Res. 2004;32:W590â€“4.
Konagurthu AS, Whisstock JC, Stuckey PJ, Lesk AM. Mustang: a multiple structural alignment algorithm. Proteins. 2006;64:559â€“74.
Authors' contributions
JL, SZ and BL conceived and designed this study; JL implemented the algorithm and carried out the experiment; JL, GX, SZ and BL analyzed the data, wrote the paper and approved the final manuscript. All authors read and approved the final manuscript.
Acknowledgements
We would like to thank Drs. Lingyun Wu, Shiyang Bai and Bin Tu for their helpful discussions.
Competing interests
The authors declare that they have no competing interests.
Funding
This project was supported by the National Natural Science Foundation of China (No. 91530102, 21573274, 11321061 and 61379092), the Outstanding Young Scientist Program of Chinese Academy of Sciences (CAS), the CAS Program for Cross and Cooperative Team of the Science and Technology Innovation, the State Key Laboratory of Scientific/Engineering Computing, the Key Laboratory of Random Complex Structures and Data and the National Center for Mathematics and Interdisciplinary Sciences at CAS.
Author information
Authors and Affiliations
Corresponding authors
Additional file
13015_2016_79_MOESM1_ESM.pdf
Additional file 1: Table S1. The superposition results of PSSM for two identical protein structures with one randomly generated by a rotation from another one. Table S2. The RMSD of pairwise superposition between 2pka and others with PSSM for Serine Proteinases data set3rp2b, 1arb, 1ppb, 1sgt, 1ton, 2alp, 2sga, 2snv, 4ptp, 5chab. Table S3. The RMSD of pairwise superposition between 2pka and others with PSSM for Serine Proteinases data set3rp2b, 1arb, 1ppb, 1sgt, 1ton, 2alp, 2sga, 2snv, 4ptp, 5chab. Table S4. The RMSD of pairwise superposition with PSSM for Fischer's dataset (67 pairs). Table S5. The RMSD of pairwise superposition with PSSM for Fischer's dataset (67 pairs). Table S6. Comparison between PCA+ICP and ICP.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Lu, J., Xu, G., Zhang, S. et al. An effective sequencealignmentfree superpositioning of pairwise or multiple structures with missing data. Algorithms Mol Biol 11, 18 (2016). https://doi.org/10.1186/s1301501600793
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1301501600793