- Research
- Open Access
Protein sequence and structure alignments within one framework
- Gundolf Schenk^{1},
- Thomas Margraf^{1} and
- Andrew E Torda^{1}Email author
https://doi.org/10.1186/1748-7188-3-4
© Schenk et al; licensee BioMed Central Ltd. 2008
- Received: 29 January 2008
- Accepted: 01 April 2008
- Published: 01 April 2008
Abstract
Background
Protein structure alignments are usually based on very different techniques to sequence alignments. We propose a method which treats sequence, structure and even combined sequence + structure in a single framework. Using a probabilistic approach, we calculate a similarity measure which can be applied to fragments containing only protein sequence, structure or both simultaneously.
Results
Proof-of-concept results are given for the different problems. For sequence alignments, the methodology is no better than conventional methods. For structure alignments, the techniques are very fast, reliable and tolerant of a range of alignment parameters. Combined sequence and structure alignments may provide a more reliable alignment for pairs of proteins where pure structural alignments can be misled by repetitive elements or apparent symmetries.
Conclusion
The probabilistic framework has an elegance in principle, merging sequence and structure descriptors into a single framework. It has a practical use in fast structural alignments and a potential use in finding those examples where sequence and structural similarities apparently disagree.
Keywords
- Protein Data Bank
- Structure Alignment
- Protein Pair
- Bivariate Gaussian Distribution
- Substitution Matrix
Background
Protein sequence alignments usually rely on a substitution matrix. This reflects an evolutionary model and the probability that one type of residue has mutated to another [1, 2]. Protein structures can also be aligned, but using a very different set of heuristics. Here, we propose a single framework which estimates the similarity of small protein fragments and can be applied to sequence, structure or both simultaneously. The cost is that one has to discard the evolutionary model and replace it with one based purely on descriptive statistics. The benefit is that after the initial approximations, one has a rather rigorous measure of the similarity of pieces of proteins.
Just considering sequences, there is already a history working with different sized fragments. Firstly, one can think of conventional sequence alignment as working with fragments of length k, where k = 1. There is plenty of data to estimate the log-odds probabilities of 20 × 21/2 = 210 possible mutations [2]. Since sites in a protein are not independent, one could try to build a substitution matrix for k = 2 (dipeptides) [3, 4]. Unfortunately, there is simply not enough data to estimate all of the 400 × 401/2 = 80 200 mutation rates [4, 5]. The direct parameter estimation requires that all mutations be observed and, for reliable statistics, frequently observed.
Proteins can also be aligned on the basis of their structures, but there is no single popular methodology. Structure reflects the arrangement of residues in space and is not a property of a single residue, so fragments with k = 1 will never be a good way to represent structural properties. Furthermore, of the mass of structural alignment methods [6–34], hardly any explicitly try to estimate log-odds probabilities as a similarity measure [35].
If one is willing to forget the evolutionary model, it should be possible to statistically measure fragment similarity, but based on what is observed, rather than requiring that everything possible be observed. Furthermore, one should be able to work with larger fragment lengths. A fragment could be characterised by some vector of properties and the similarity of two such vectors would measure the similarity of the fragments.
This has been done using physical or chemical properties which seem reasonable to a chemist [16, 28], but we have aimed for a more objective statistical approach. In this work, long vectors are created, but they come from a probabilistic classification procedure. With N_{ c }classes, a fragment has a vector of probabilities that it is in class 1, 2, ...N_{ c }. Given this vector for two such fragments, one can then ask, what is the probability that two fragments are in the same class ? Regardless of which class this is, similar fragments will have similar vectors of probabilities. The classification may not be perfect, so some fragments may have a non-zero probability of being in several classes. Even if one cannot say which class the fragments are in, similar fragments will have similar patterns of probabilities. This could be seen by the dot product of class membership probability vectors and is formalised below (eq. 4). In this work, the classification comes from a maximally parsimonious Bayesian classification of fragments. The number of classes is typically of the order of 10^{2}, the fragment length k = 6 and the amount of training data of the order of 10^{6} observations.
The classes used here are sets of statistical distributions. These are multinomial Bernoulli distributions for the discrete (sequence) properties, Gaussian for the continuous (structural) properties and appropriate mixture models to combine sequence and structure. For example, one may have a pure structure classification based on φ and ψ backbone angles. One class within such a classification would have k pairs of φ and ψ distributions (one for each of the k residues). Given some observation (fragment), one can can calculate its probability of being in a class by calculating the probability of each φ,ψ pair within the corresponding distributions that define the class and taking the product of these probabilities.
Exactly the same process can be applied to sequence by using distributions of amino acid probabilities at each of the k sites within a class. Instead of Gaussian distributions, one has 20-way probabilities at each site. Different classes will reflect the different probabilities of finding each amino acid at each position. Class membership of a fragment is simply calculated from the product of the probabilities of each amino acid occurring at each site within the class.
Finally, a classification can combine sequence and structure distributions. Class membership of a fragment is just the product of probabilities of finding its sequence (discrete) and structure (continuous) descriptors within some class.
In practice, structure classes were based on bivariate Gaussian distributions in order to account for the strong correlation of φ,ψ angles within residues.
Describing proteins by fragments is not new, but the philosophy here differs from most literature examples [36–39]. Firstly, the classification is probabilistic. A fragment is never a member of just one class. It may have 0.99 probability of being in one class or it may have partial membership of a few classes. This is particularly important for robustness in comparison problems as described below. Secondly, the clustering does not rely on an explicit measure of cluster similarity or distance. Instead, a model is constructed for the data and the likelihood of the model is optimised, with no need to explicitly consider distances between clusters. Finally, there are almost no preconceptions built into the clustering since we rely on unsupervised learning. If α-helices, β-strands or sequence patterns of hydrophobicity and hydrophilicity are found, it is a consequence of fitting a statistical model, not chemical preconceptions.
Methods
Data sets
The training data was a set of protein chains taken from the protein data bank (PDB) [40] such that no two members of the list had more than 50 % sequence identity [41]. After removing all chains with less than 40 residues and the few with unknown sequence, each possible overlapping fragment of length k was extracted. Fragments with any bond longer than 2 Å were discarded, leaving a set of just over 1.5 × 10^{6} fragments of length k ≤ 6.
A set of protein pairs was used for testing alignments and selected so that there should be some structural similarity, but little sequence similarity. Starting from a list of related pairs of proteins [42–45], a set of 2 902 pairs was selected by requiring the members of the pair have less than 19 % sequence identity, but were superimposable to 3 Å or less over at least 40 residues.
Classification
where θ is the two-dimensional vector for a φ, ψ pair and μ_{ θ }is the corresponding vector of means. C is the covariance matrix, |C| the absolute value of the corresponding determinant and (θ-μ_{ θ })^{T} is the transpose of (θ-μ_{ θ }). Classifications using both sequence and structure used a mixture model with both the discrete and continuous distributions.
Given the distribution types, expectation maximization was used to find the model (parameter set) which maximises the likelihood of the data [46]. One uses an initial guess for the distribution parameters and re-estimates the distribution properties. These estimates are then used to re-calculate the distribution properties and the process iterated until a maximum in terms of likelihood is reached. This is usually a local maximum, so the entire classification process is repeated many times.
where v_{ j }is the set of distribution properties describing class j. w_{ j }is the weight or probability associated with class j. The product runs over the m attributes and considers the parameters v_{j,m}which describe the m'th attribute in the j'th class. When calculating the probability of a fragment being in a class, eq 2 is applied to all classes and normalised so that the sum of probabilities is one. The class weights, w, reflect the importance of a class and are subject to the normalisation $\sum _{j}{w}_{j}}=1$.
and this introduces a strong element of parsimony. Any time new parameters are introduced, one brings in a multiplicative factor less than one. Thus, any time a new class is introduced, the probability of the data set appears to decrease unless the new class is strongly supported by the data. This means the method is not very susceptible to overfitting and there is a tendency to find the minimal number of classes necessary to model the data.
Parameters optimized during clustering
Classification | n _{ disc } | n _{ cont } | n _{ total } |
---|---|---|---|
sequence | 20k | 0 | 20k |
structure | 0 | 5k | 5k |
sequence + structure | 20k | 5k | 25k |
Similarity and alignments
Given a classification, it could then be used for the calculation of alignments. If a classification is based on fragments of length k, then a protein with n_{ r }residues is broken into n_{ r }- k + 1 overlapping fragments. The class membership probabilities could then be assigned using eq 2. Given n_{ c }classes, a fragment is characterised by an n_{ c }-dimensional vector, so a protein can be seen as n_{ r }- k + 1 vectors in an n_{ c }-dimensional space. Given two such protein fragments i and j, probably from different proteins, one can calculate a similarity measure s_{ ij }
s_{ ij }= p_{ i }·p_{ j }
where p_{ i }denotes the vector of class probabilities for fragment i. If the two vectors have been normalised to unit vector length, s_{ ij }offers a rather rigorous measure of similarity in the range 0 to 1. These s_{ ij }scores can be used as the elements of a similarity matrix suitable for calculating optimal pairwise alignments. The procedure can be applied to probabilities calculated from pure sequence or pure structure or combined sequence and structure. Unlike conventional scoring methods, eq. 4 does not relate to single sites or amino acids in a protein. The vector p_{ i }reflects k residues. This means that each entry s_{ ij }in the score matrix reflects the contribution of k overlapping fragments, each of length k, so it is sensitive to an environment of 2k - 1 residues. All alignments were calculated with wurst [47] using the Gotoh version [49] of the Smith and Waterman [50] algorithm and with parameters optimized as described below.
where ${r}_{ij}^{nat}$ is the distance between C^{ α }_{ i }and C^{ α }_{ j }in the native structure and ${r}_{ij}^{model}$ is the corresponding distance in the model and the summation runs over the N_{ res }aligned residues. Next, one defines a threshold, DME^{ cut }= 4.0 Å, bearing in mind the typical C^{ α }- C^{ α }distance between adjacent residues is 3.8 Å. Then one discards the elements where the two distance matrices are most different, until DME_{nat,model}is less than or equal to DME^{ cut }. The remaining fraction of the distance matrix is f({r^{ nat }},{r^{ model }}) where {r^{ x }} is the set of C^{ α }coordinate vectors from molecule x. In pseudocode, one can describe the process:
while (DME_{nat,model}>DME^{ cut }) {
remove largest distance difference from C^{ α }distance matrix
recalculate DME_{nat,model}
f({r^{ nat }},{r^{ model }}) = fraction of distance difference matrix remaining
}
where a = 0.7 and b = 15 (an arbitrary choice for the shape of the sigmoid). The summation ran over all N_{ pair }= 2 902 protein pairs.
The one psi-blast database search referred to below used a profile built with acceptance parameters orders of magnitude more careful than default values [58]. 15 iterations were run, accepting homologues from the non-redundant protein sequence database with an e-value < 10^{-10}, 10 iterations with the threshold set at 10^{-8} and 5 iterations with the threshold set at 2 × 10^{-5}. This profile was then used as a query against sequences derived from protein data bank structures. For comparison, the default acceptance threshold is e-value < 5 × 10^{-3}.
Results
Classification in general
where the superscript H denotes a histogram from the training data, C denotes the classification and p_{ ij }is the probability for bin i,j. D_{ KL }is zero if two distributions are the same and grows as they differ. Similarly, one can treat the two-dimensional histogram from the training data and probabilities from the classification as vectors and then calculate a dot product. This will equal 1 if the two distributions are the same.
Given these overall properties, one can consider some example results from each type of classification.
Sequence classification and alignment
This type of classification is included as a matter of principle, rather than practical use. There are, however, two reasons why it may have been of interest. Firstly, if one believes in the importance of sequence motifs, this could be a method for finding them. Practically all motif finding methods use some form of supervised learning (training from known data) [59–61]. The approach in this classification is simply to look for statistically significant patterns without any knowledge of function. Secondly, one might hope that patterns of amino acid probabilities are a sequence signal which would be preserved over longer evolutionary time-scales than simple sequence similarity. In this case, one could align protein sequences using the similarity based on eq 4.
First we consider whether there are some statistical patterns which are so strong and distinct that they will be found by this kind of unsupervised learning/classification. The answer is yes, but it is of no practical use. For fragments of length k = 6, the most statistically unusual class, as measured by the cross entropy, is HHHHHH. The second most unusual class was another common sequence tag. The other classes may be interpreted in terms of chemical properties, but it is more sensible to refrain from over-interpretation. This kind of unsupervised learning is not the best way to recognise biologically interesting sequence motifs.
Next, we briefly consider the question of sequence alignment using a score matrix based on similarities of class probability vectors (eq. 4). With the set of 2 902 distantly related protein pairs, alignments were calculated, models constructed and the alignment quality measured as described under Methods. For comparison, the same procedure was done with conventional pair-wise alignments based on a blosum62 substitution matrix [2]. The same optimization of gap penalties and matrix zero level was then calculated after filling score/alignment matrices with gaussian distributed random numbers. Figure 1 compares the results from the different approaches. The cost function (eq. 6) is based on the similarity of distance matrices (eq. 5), so even with random elements in the score matrix, the score will not be zero due to small fragments of similar structure. On this set of remote homologues, the more expensive method does not produce better alignments than those using a conventional substitution matrix. Although it is technically interesting to find a genuinely new scoring scheme for sequence alignments, it is more useful to consider this methodology in a context where it seems to be very effective.
Structure alignment
Unlike a pure sequence-based classification, the pure structure-based classification leads to a directly useful application (structural alignment) and often easily interpretable results. We concentrate on results from a classification with fragment length k = 6 and 248 classes. Not surprisingly, the three most populated classes are recognisable classic secondary structure, but soon one reaches classes which may or may not have literature names. The practical application of this classification is more interesting than a reinvestigation of protein structural motifs. When the vectors in eq. 4 are based only on structural properties, they form the basis of a swift and robust protein structure alignment method, available as a web service [62] and fast enough to search a set of 17 000 representative protein folds in minutes.
Firstly, one can look at the very gross average behaviour and compare the quality of the alignments with those from the same methodology using sequences or conventional sequence alignment (previous section). Figure 1 shows the value of the testing function from the optimization described above (2 902 remote protein pairs) and the bar labelled "struct frag" refers to the structure-based alignment with this methodology. As expected, when aligning pairs of proteins with weak sequence homology, a structure based method performs much better. One may also note that the bars never go below -0.7. This reflects the fact that one is working with protein pairs whose structures are often somewhat dissimilar and the function only approaches -1 as structures become identical.
One can look at average performance, but when comparing protein alignments, it may be that there are many methods which perform similarly, even with approaches based on methods ranging from local distance information mean field methods [6–35, 63–66]. In this case, it is more important to look for examples which characterise a method and where the results reflect the peculiarities of a technique.
Combined sequence and structure alignment
The three classes with the highest statistical weight are structurally indistinguishable α-helical, but differ in their sequence profile. The second and third classes show the periodicity of amphipathic helices. Two example β-strand classes are shown which again differ in their sequence propensities. The last example (class 15) at the bottom of the plot shows a different property. The amino acid probabilities do not differ too much from background probabilities, except at position 4, which almost has to be a glycine. Looking at the fragment, it is clear that this is part of a classic, well characterised turn [69, 70].
A combined sequence to structure alignment resulted in the superposition of Figure 7. Within the 89 aligned residues there is only 13 % sequence identity and it only covers about 17 % of the 529 residues from chain A of 1zpu. It would be reasonable to doubt its significance. In fact, both are copper containing proteins involved in redox chemistry, albeit one from algae [72] the other from yeast [73]. Interestingly, it is possible to find a very remote sequence connection between the two proteins. Using the sequence of 1fa4 as a query, a sequence profile was built using the non-redundant protein sequence database and used to search against protein data bank sequences [58]. This finds 1zpu as a potential homologue with a very poor e-value (0.02). By itself, this would also not be considered significant. Most persuasively, the iterated sequence search from psi-blast aligns residues 34 to 123 of 1zpu and the combined sequence/structure alignment using our code aligns almost exactly the same stretch (residues 37 to 124). This appears to be simply an example of normal divergent evolution, but it is an example of where structure has diverged to the point where a simple structural superposition is not conclusive.
Again one should be clear that this kind of result is not in its own significant. When one is dealing with remote homologues, different programs will produce different results. With enough time, one would be able to find alignments which are found with other codes, but missed or miscalculated using our methods. The interesting point is that there is one method and one scoring scheme which can operate on both protein sequences and structures.
Discussion
Clearly, it is possible to have a single probabilistic methodology for finding similarities based on sequence, structure or simultaneous sequence and structure. The question is whether one would want to. The application to sequence alignment is interesting, but not obviously useful. The pure structure alignment, based on continuous descriptors is obviously useful and available as a web service [62]. The combination of sequence and structure descriptors is an unexploited method which has different properties to other alignment techniques which leads to two future possibilities. Firstly, it is accepted dogma that protein sequences evolve faster than structure, so one can detect similarities even when sequence homology is not significant [74]. With the tools here, it is relatively easy to search for cases where structural alignment is weak, but combined alignment appears to be significant. Secondly, there is the general question of remote homology detection. Protein structure searches are an essential tool when proteins have diverged so much that sequence similarity is, by itself, not significant. The question is whether combining all available descriptors will usually yield even better results. Although we give one example above (1fa4 and 1zpu) it needs more testing and the collection of new benchmarks to find if it is useful and if so, in what regime of similarity. From one point of view, one should use all available information (sequence as well as structure). From another point of view, this may not be true. Sequence mutations are often modelled as random events or walks through possible sequences [75, 76]. If two sequences have diverged such that there is little sequence similarity, adding sequence information will introduce noise as well as signal.
The methodology is in most senses rather unusual and there are some assumptions and limitations. If one feels the underlying statistical models are a good representation of protein data, then the rest of the procedure is completely justified. Of course the underlying models are not perfect. Gaussian distributions are mainly chosen for convenience and one knows that there are some correlations which could be included. The distributions in this work accounted for φ, ψ correlations within a residue, but test calculations on smaller data sets suggest that in a small number of classes there are correlations between neighbouring residues which could be accounted for. The problem is that there are currently 18 parameters per class in a pure structure classification with k = 6. Using a full covariance matrix results in 27 parameters per class.
There are already many protein fragment classifications in the literature, but usually with a different philosophy. Generally, these use a structure classification and then see which sequence patterns fit to each structure motif (or vice versa). They also require some similarity measure between clusters [35–38, 54, 77–81]. The methodology here is based on a mixture model which can treat all these properties simultaneously and this leads to a very different kind of result. As shown in Figure 6, a single structural motif can accommodate different sequence patterns. These are detected in this work since all the descriptors are considered simultaneously. Figure 6 shows half a dozen classes, but if one were to look through the other 261 classes, there are numerous examples of different sequence patterns fitting to a basic structural unit.
This raises the question as to which is most important when sequence and structure are combined. Unfortunately, there is no simple answer since it varies from class to class and site to site. As shown in Figure 6, a class may have relatively flat distributions for amino acids or sometimes a particular site has a distribution far from background probability. In crude terms, summing over all observations and all classes, the structural descriptors are about 3 1/2 times more important than the sequence descriptors in terms of discriminating.
The next major difference compared to other classification schemes is the application. Usually this is connected to prediction. If one has a sequence clustering one can collect structure properties to make structural predictions or vice versa. The Bayesian classification scheme used in this work has been used for this purpose [47], but not in this work. Here, we are interested in a single kind of similarity measure which operates in different contexts.
The results (or the lack thereof) for pure sequence alignment make it clear that this methodology will not displace conventional sequence-based methods. The results for structure and combined sequence and structure are far more promising. There are several reasons. Firstly, there are no preconceptions of regular protein structure. If some motif is statistically described it is part of the model. There is little preference for strands, helices or recognized turns over other motifs. Next, the method handles unusual structures. When faced with a fragment which has never been seen before, it will not be placed into any particular class. It will have some probability of being in a few classes. Any similar fragment, even if it has never been observed before, will have a similar set of class membership probabilities and will be recognized as similar. Next, the procedure is rather free of thresholds. The probability of similarity (eq. 4) runs smoothly between 0 and 1. There are no absolute matching steps necessary. Finally, each residue in a protein is involved in 2k - 1 overlapping fragments. In this work, this means that each element in a similarity matrix reflects the properties of 11 residues. These factors are probably why the method is rather tolerant of poor structures such as the example in Figure 5.
The procedure is also rather swift. To make database searches fast, class probability vectors for representative chains can be precalculated. This leaves the normal quadratic running time for the dynamic programming step. Compared to simple sequence alignment, this is slower by a constant factor since the normal table lookup from a substitution matrix is replaced by a dot product calculation.
The approach may be useful for pure structure alignment, but its performance needs to be demonstrated quantitatively in terms of accuracy and speed, rather than by the proof-of-concept examples given here. The main advance is that one can mix sequence and structure on equal probabilistic terms without any ad hoc weighting of the terms. Since the methodology is fast enough for phylogenetic calculations, we are now interested in finding examples where the different approach yield different results. It remains to be seen which is more reliable or at least persuasively believable.
Conclusion
With modest assumptions, it is possible to combine protein sequence and structure in one framework for protein alignment and comparison. Larger scale testing needs to be done to estimate its significance. The server is available for structure comparisons [62] and all software is free for download [82].
Declarations
Authors’ Affiliations
References
- Dayhoff M, Schwartz R, Orcutt B: A model of evolutionary change in proteins. Atlas of Protein Sequence and Structure. Edited by: Dayhoff M. 1978, 5: 345-358. Washington D.C.: National Biomedical Research FoundationGoogle Scholar
- Henikoff S, Henikoff JG: Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA. 1992, 89: 10915-10919. 10.1073/pnas.89.22.10915PubMedPubMed CentralView ArticleGoogle Scholar
- Jung J, Lee B: Use of residue pairs in protein sequence-sequence and sequence-structure alignments. Protein Sci. 2000, 9: 1576-1588.PubMedPubMed CentralView ArticleGoogle Scholar
- Gonnet GH, Cohen MA, Benner SA: Analysis of Amino-Acid Substitution During Divergent Evolution – the 400 by 400 Dipeptide Substitution Matrix. Biochem Biophys Res Commun. 1994, 199: 489-496. 10.1006/bbrc.1994.1255PubMedView ArticleGoogle Scholar
- Crooks GE, Green RE, Brenner SE: Pairwise alignment incorporating dipeptide covariation. Bioinformatics. 2005, 21: 3704-3710. 10.1093/bioinformatics/bti616PubMedView ArticleGoogle Scholar
- Zuker M, Somorjai RL: The alignment of protein structures in three dimensions. Bull Math Biol. 1989, 51: 55-78.PubMedView ArticleGoogle Scholar
- Russell RB, Barton GJ: Multiple protein sequence alignment from tertiary structure comparison. Proteins. 1992, 14: 309-323. 10.1002/prot.340140216PubMedView ArticleGoogle Scholar
- Holm L, Sander C: Protein-Structure Comparison by Alignment of Distance Matrices. J Mol Biol. 1993, 233: 123-138. 10.1006/jmbi.1993.1489PubMedView ArticleGoogle Scholar
- Subbiah S, Laurents DV, Levitt M: Structural similarity of DNA-binding domains of bacteriophage repressors and the globin core. Curr Biol. 1993, 3: 141-148. 10.1016/0960-9822(93)90255-MPubMedView ArticleGoogle Scholar
- Alexandrov NN: SARFing the PDB. Protein Eng. 1996, 9: 727-732. 10.1093/protein/9.9.727PubMedView ArticleGoogle Scholar
- Gibrat J-F, Madej T, Bryant SH: Surprising similarities in structure comparison. Curr Opin Struct Biol. 1996, 6: 377-385. 10.1016/S0959-440X(96)80058-3PubMedView ArticleGoogle Scholar
- Orengo CA, Taylor WR: SSAP: Sequential structure alignment program for protein structure comparison. Method Enzymol. 1996, 266: 617-635.View ArticleGoogle Scholar
- Suyama M, Matsuo Y, Nishikawa K: Comparison of protein structures using 3D profile alignment. J Mol Evol. 1997, 44: S163-173. 10.1007/PL00000065PubMedView ArticleGoogle Scholar
- Shindyalov IN, Bourne PE: Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng. 1998, 11: 739-747. 10.1093/protein/11.9.739PubMedView ArticleGoogle Scholar
- Holm L, Park J: DaliLite workbench for protein structure comparison. Bioinformatics. 2000, 16: 566-567. 10.1093/bioinformatics/16.6.566PubMedView ArticleGoogle Scholar
- Jung J, Lee B: Protein structure alignment using environmental profiles. Protein Eng. 2000, 13: 535-543. 10.1093/protein/13.8.535PubMedView ArticleGoogle Scholar
- Lackner P, Koppensteiner WA, Sippl MJ, Domingues FS: ProSup: a refined tool for protein structure alignment. Protein Eng. 2000, 13: 745-752. 10.1093/protein/13.11.745PubMedView ArticleGoogle Scholar
- Ortiz AR, Strauss CEM, Olmea O: MAMMOTH (Matching molecular models obtained from theory): An automated method for model comparison. Protein Sci. 2002, 11: 2606-2621. 10.1110/ps.0215902PubMedPubMed CentralView ArticleGoogle Scholar
- Shatsky M, Nussinov R, Wolfson HJ: Flexible protein alignment and hinge detection. Proteins. 2002, 48: 242-256. 10.1002/prot.10100PubMedView ArticleGoogle Scholar
- Blankenbecler R, Ohlsson M, Peterson C, Ringnér M: Matching protein structures with fuzzy alignments. Proc Natl Acad Sci USA. 2003, 100: 11936-11940. 10.1073/pnas.1635048100PubMedPubMed CentralView ArticleGoogle Scholar
- Kawabata T: MATRAS: a program for protein 3D structure comparison. Nucl Acids Res. 2003, 31: 3367-3369. 10.1093/nar/gkg581PubMedPubMed CentralView ArticleGoogle Scholar
- Ilyin VA, Abyzov A, Leslin CM: Structural alignment of proteins by a novel TOPOFIT method, as a superimposition of common volumes at a topomax point. Protein Sci. 2004, 13: 1865-1874. 10.1110/ps.04672604PubMedPubMed CentralView ArticleGoogle Scholar
- Krissinel E, Henrick K: Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions. Acta Cryst D. 2004, 60: 2256-2268. 10.1107/S0907444904026460View ArticleGoogle Scholar
- Ochagavia ME, Wodak H: Progressive combinatorial algorithm for multiple structural alignments: Application to distantly related proteins. Proteins. 2004, 55: 436-454. 10.1002/prot.10587PubMedView ArticleGoogle Scholar
- Shapiro J, Brutlag D: FoldMiner and LOCK 2: protein structure comparison and motif discovery on the web. Nucl Acids Res. 2004, 32: W536-W541. 10.1093/nar/gkh389PubMedPubMed CentralView ArticleGoogle Scholar
- Carpentier M, Brouillet S, Pothier J: YAKUSA: A fast structural database scanning method. Proteins. 2005, 61: 137-151. 10.1002/prot.20517PubMedView ArticleGoogle Scholar
- Chen L, Zhou T, Tang Y: Protein structure alignment by deterministic annealing. Bioinformatics. 2005, 21: 51-62. 10.1093/bioinformatics/bth467PubMedView ArticleGoogle Scholar
- Chen Y, Crippen GM: A novel approach to structural alignment using realistic structural and environmental information. Protein Sci. 2005, 14: 2935-2946. 10.1110/ps.051428205PubMedPubMed CentralView ArticleGoogle Scholar
- Zhang Y, Skolnick J: TM-align: a protein structure alignment algorithm based on the TM-score. Nucl Acids Res. 2005, 33: 2302-2309. 10.1093/nar/gki524PubMedPubMed CentralView ArticleGoogle Scholar
- Zhu JH, Weng ZP: FAST: A novel protein structure alignment algorithm. Proteins. 2005, 58: 618-627. 10.1002/prot.20331PubMedView ArticleGoogle Scholar
- Konagurthu AS, Whisstock JC, Stuckey PJ, Lesk AM: MUSTANG: A multiple structural alignment algorithm. Proteins. 2006, 64: 559-574. 10.1002/prot.20921PubMedView ArticleGoogle Scholar
- Lisewski AM, Lichtarge O: Rapid detection of similarity in protein structure and function through contact metric distances. Nucl Acids Res. 2006, 34: E152- 10.1093/nar/gkl788PubMedPubMed CentralView ArticleGoogle Scholar
- Taubig H, Buchner A, Griebsch J: PAST: fast structure-based searching in the PDB. Nucl Acids Res. 2006, 34: W20-W23. 10.1093/nar/gkl273PubMedPubMed CentralView ArticleGoogle Scholar
- Oldfield TJ: CAALIGN: a program for pairwise and multiple protein-structure alignment. Acta Cryst D. 2007, 63 (4): 514-525. 10.1107/S0907444907000844View ArticleGoogle Scholar
- Friedberg I, Harder T, Kolodny R, Sitbon E, Li ZW, Godzik A: Using an alignment of fragment strings for comparing protein structures. Bioinformatics. 2007, 23: E219-E224. 10.1093/bioinformatics/btl310PubMedView ArticleGoogle Scholar
- Camproux AC, Tuffery P, Chevrolat JP, Boisvieux JF, Hazout S: Hidden Markov model approach for identifying the modular framework of the protein backbone. Protein Eng. 1999, 12: 1063-1073. 10.1093/protein/12.12.1063PubMedView ArticleGoogle Scholar
- Hunter CG, Subramaniam S: Protein fragment clustering and canonical local shapes. Proteins. 2003, 50: 580-588. 10.1002/prot.10309PubMedView ArticleGoogle Scholar
- Sander O, Sommer I, Lengauer T: Local protein structure prediction using discriminative models. BMC Bioinformatics. 2006, 7: 14- 10.1186/1471-2105-7-14PubMedPubMed CentralView ArticleGoogle Scholar
- Simons KT, Ruczinski I, Kooperberg C, Fox BA, Bystroff C, Baker D: Improved recognition of native-like protein structures using a combination of sequence-dependent and sequence-independent features of proteins. Proteins. 1999, 34: 82-95. 10.1002/(SICI)1097-0134(19990101)34:1<82::AID-PROT7>3.0.CO;2-APubMedView ArticleGoogle Scholar
- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucl Acids Res. 2000, 28: 235-242. 10.1093/nar/28.1.235PubMedPubMed CentralView ArticleGoogle Scholar
- Li WZ, Jaroszewski L, Godzik A: Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics. 2001, 17: 282-283. 10.1093/bioinformatics/17.3.282PubMedView ArticleGoogle Scholar
- Holm L, Sander C: The FSSP database of structurally aligned protein fold families. Nucl Acids Res. 1994, 22: 3600-3609.PubMedPubMed CentralGoogle Scholar
- Holm L, Sander C: The FSSP database: Fold classification based on structure structure alignment of proteins. Nucl Acids Res. 1996, 24: 206-209. 10.1093/nar/24.1.206PubMedPubMed CentralView ArticleGoogle Scholar
- Holm L, Sander C: Dali/FSSP classification of three-dimensional protein folds. Nucl Acids Res. 1997, 25: 231-234. 10.1093/nar/25.1.231PubMedPubMed CentralView ArticleGoogle Scholar
- Holm L, Sander C: Touring protein fold space with dali/FSSP. Nucl Acids Res. 1998, 26: 316-319. 10.1093/nar/26.1.316PubMedPubMed CentralView ArticleGoogle Scholar
- Dempster AP, Laird NM, Rubin DB: Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B. 1977, 39: 1-38.Google Scholar
- Wurst server. http://www.zbh.uni-hamburg.de/wurst
- Cheeseman P, Stutz J: Bayesian Classification (Autoclass): Theory and Results. Advances in Knowledge Discovery and Data Mining. Edited by: Fayyad U, Piatetsky-Shapiro G, Smyth P, Uthurusamy R. 1995, 61-83. Menlo Park: The AAAI PressGoogle Scholar
- Gotoh O: An improved algorithm for matching biological sequences. J Mol Biol. 1982, 162: 705-708. 10.1016/0022-2836(82)90398-9PubMedView ArticleGoogle Scholar
- Smith TF, Waterman MS: Identification of Common Molecular Subsequences. J Mol Biol. 1981, 147: 195-197. 10.1016/0022-2836(81)90087-5PubMedView ArticleGoogle Scholar
- Onuchic JN, Luthey-Schulten Z, Wolynes PG: Theory of protein folding: The energy landscape perspective. Annu Rev Phys Chem. 1997, 48: 545-600. 10.1146/annurev.physchem.48.1.545PubMedView ArticleGoogle Scholar
- Goldstein RA, Luthey-Schulten ZA, Wolynes PG: Protein tertiary structure recognition using optimized Hamiltonians with local interactions. Proc Natl Acad Sci USA. 1992, 89: 9029-9033. 10.1073/pnas.89.19.9029PubMedPubMed CentralView ArticleGoogle Scholar
- Levitt M: Molecular dynamics of native protein. II. Analysis and nature of motion. J Mol Biol. 1983, 168: 621-657. 10.1016/S0022-2836(83)80306-4PubMedView ArticleGoogle Scholar
- Rooman MJ, Rodriguez J, Wodak SJ: Automatic definition of recurrent local structure motifs in proteins. J Mol Biol. 1990, 213: 327-336. 10.1016/S0022-2836(05)80194-9PubMedView ArticleGoogle Scholar
- Crippen GM: Easily searched protein folding potentials. J Mol Biol. 1996, 260: 467-475. 10.1006/jmbi.1996.0414PubMedView ArticleGoogle Scholar
- Havel TF: The sampling properties of some distance geometry algorithms applied to unconstrained polypeptide chains: a study of 1830 independently computed conformations. Biopolymers. 1990, 29: 1565-1585. 10.1002/bip.360291207PubMedView ArticleGoogle Scholar
- Russell A, Torda AE: Protein sequence threading – averaging over structures. Proteins. 2002, 47: 496-505. 10.1002/prot.10088PubMedView ArticleGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl Acids Res. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389PubMedPubMed CentralView ArticleGoogle Scholar
- Attwood TK, Blythe MJ, Flower DR, Gaulton A, Mabey JE, Maudling N, McGregor L, Mitchell AL, Moulton G, Paine K, Scordis P: PRINTS and PRINTS-S shed light on protein ancestry. Nucl Acids Res. 2002, 30: 239-241. 10.1093/nar/30.1.239PubMedPubMed CentralView ArticleGoogle Scholar
- Liu J, Rost B: Domains, motifs and clusters in the protein universe. Curr Opin Chem Biol. 2003, 7: 5-11. 10.1016/S1367-5931(02)00003-0PubMedView ArticleGoogle Scholar
- Sigrist CJA, Cerutti L, Hulo N, Gattiker A, Falquet L, Pagni M, Bairoch A, Bucher P: PROSITE: A documented database using patterns and profiles as motif descriptors. Brief Bioinform. 2002, 3: 265-274. 10.1093/bib/3.3.265PubMedView ArticleGoogle Scholar
- Salami: Protein structure similarity searches based on classification probability vectors. http://www.zbh.uni-hamburg.de/salami
- Guyon F, Camproux AC, Hochez J, Tuffery P: SA-Search: a web tool for protein structure mining based on a Structural Alphabet. Nucl Acids Res. 2004, 32: W545-W548. 10.1093/nar/gkh467PubMedPubMed CentralView ArticleGoogle Scholar
- Koehl P: Protein structure similarities. Curr Opin Struct Biol. 2001, 11: 348-353. 10.1016/S0959-440X(00)00214-1PubMedView ArticleGoogle Scholar
- Levine M, Stuart D, Williams J: A Method for the Systematic Comparison of the 3-Dimensional Structures of Proteins and Some Results. Acta Cryst A. 1984, 40: 600-610. 10.1107/S0108767384001239.View ArticleGoogle Scholar
- Martinez L, Andreani R, Martinez JM: Convergent algorithms for protein structural alignment. BMC Bioinformatics. 2007, 8: 306- 10.1186/1471-2105-8-306PubMedPubMed CentralView ArticleGoogle Scholar
- Kuhlman B, Dantas G, Ireton GC, Varani G, Stoddard BL, Baker D: Design of a novel globular protein fold with atomic level accuracy. Science. 2003, 302: 1364-1368. 10.1126/science.1089427PubMedView ArticleGoogle Scholar
- Song HK, Lee KN, Kwon KS, Yu MH, Suh SW: Crystal structure of an uncleaved alpha 1-antitrypsin reveals the conformation of its inhibitory reactive loop. FEBS Lett. 1995, 377: 150-154. 10.1016/0014-5793(95)01331-8PubMedView ArticleGoogle Scholar
- Wilmot CM, Thornton JM: Analysis and Prediction of the Different Types of β-Turn in Proteins. J Mol Biol. 1988, 203: 221-232. 10.1016/0022-2836(88)90103-9PubMedView ArticleGoogle Scholar
- Wilmot CM, Thornton JM: β-Turns and their distortions: a proposed new nomenclature. Protein Eng. 1990, 3: 479-493. 10.1093/protein/3.6.479PubMedView ArticleGoogle Scholar
- Shindyalov IN, Bourne PE: A database and tools for 3-D protein structure comparison and alignment using the Combinatorial Extension (CE) algorithm. Nucl Acids Res. 2001, 29: 228-229. 10.1093/nar/29.1.228PubMedPubMed CentralView ArticleGoogle Scholar
- Ma L, Jorgensen AMM, Sorensen GO, Ulstrup J, Led JJ: Elucidation of the Paramagnetic R_{1} Relaxation of Heteronuclei and Protons in Cu(II) Plastocyanin from Anabaena variabilis. J Am Chem Soc. 2000, 122: 9473-9485. 10.1021/ja001368z.View ArticleGoogle Scholar
- Taylor AB, Stoj CS, Ziegler L, Kosman DJ, Hart PJ: The copper-iron connection in biology: Structure of the metallo-oxidase Fet3p. Proc Natl Acad Sci USA. 2005, 102: 15459-15464. 10.1073/pnas.0506227102PubMedPubMed CentralView ArticleGoogle Scholar
- Holm L, Sander C: Mapping the protein universe. Science. 1996, 273: 595-602. 10.1126/science.273.5275.595PubMedView ArticleGoogle Scholar
- Ewens WJ, Grant GR: Statistical methods in bioinformatics. 2001, New York: Springer-VerlagView ArticleGoogle Scholar
- Jukes TH, Cantor CR: Evolution of protein molecules. Mammalian protein metabolism. Edited by: Munro HN. 1969, 21-123. New York: Academic PressView ArticleGoogle Scholar
- Camproux AC, Gautier R, Tuffery P: A hidden Markov model derived structural alphabet for proteins. J Mol Biol. 2004, 339: 591-605. 10.1016/j.jmb.2004.04.005PubMedView ArticleGoogle Scholar
- Camproux AC, Tuffery P, Buffat L, Andre C, Boisvieux JF, Hazout S: Analyzing patterns between regular secondary structures using short structural building blocks defined by a hidden Markov model. Theor Chem Account. 1999, 101: 33-40.View ArticleGoogle Scholar
- Dong QW, Wang XL, Lin L: Methods for optimizing the structure alphabet sequences of proteins. Comput Biol Med. 2007, 37: 1610-1616. 10.1016/j.compbiomed.2007.03.002PubMedView ArticleGoogle Scholar
- Simons KT, Kooperberg C, Huang E, Baker D: Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. J Mol Biol. 1997, 268: 209-225. 10.1006/jmbi.1997.0959PubMedView ArticleGoogle Scholar
- Tendulkar AV, Ogunnaike B, Wangikar PP: Protein local conformations arise from a mixture of Gaussian distributions. J Biosci. 2007, 32: 899-908. 10.1007/s12038-007-0090-4PubMedView ArticleGoogle Scholar
- Wurst. http://backpan.perl.org/authors/id/W/WU/WURST/
- Kraulis PJ: MOLSCRIPT: a program to produce both detailed and schematic plots of protein structures. J Appl Crystallogr. 1991, 24: 946-950. 10.1107/S0021889891004399.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.