Comparative analysis of the quality of a global algorithm and a local algorithm for alignment of two sequences
© Polyanovsky et al; licensee BioMed Central Ltd. 2011
Received: 31 December 2010
Accepted: 27 October 2011
Published: 27 October 2011
Algorithms of sequence alignment are the key instruments for computer-assisted studies of biopolymers. Obviously, it is important to take into account the "quality" of the obtained alignments, i.e. how closely the algorithms manage to restore the "gold standard" alignment (GS-alignment), which superimposes positions originating from the same position in the common ancestor of the compared sequences. As an approximation of the GS-alignment, a 3D-alignment is commonly used not quite reasonably. Among the currently used algorithms of a pair-wise alignment, the best quality is achieved by using the algorithm of optimal alignment based on affine penalties for deletions (the Smith-Waterman algorithm). Nevertheless, the expedience of using local or global versions of the algorithm has not been studied.
Using model series of amino acid sequence pairs, we studied the relative "quality" of results produced by local and global alignments versus (1) the relative length of similar parts of the sequences (their "cores") and their nonhomologous parts, and (2) relative positions of the core regions in the compared sequences. We obtained numerical values of the average quality (measured as accuracy and confidence) of the global alignment method and the local alignment method for evolutionary distances between homologous sequence parts from 30 to 240 PAM and for the core length making from 10% to 70% of the total length of the sequences for all possible positions of homologous sequence parts relative to the centers of the sequences.
We revealed criteria allowing to specify conditions of preferred applicability for the local and the global alignment algorithms depending on positions and relative lengths of the cores and nonhomologous parts of the sequences to be aligned. It was demonstrated that when the core part of one sequence was positioned above the core of the other sequence, the global algorithm was more stable at longer evolutionary distances and larger nonhomologous parts than the local algorithm. On the contrary, when the cores were positioned asymmetrically, the local algorithm was more stable at longer evolutionary distances and larger nonhomologous parts than the global algorithm. This opens a possibility for creation of a combined method allowing generation of more accurate alignments.
Pair-wise alignment of amino acid sequences is the main method of comparative protein analysis. Among the most popular algorithms based on comparison of protein primary structures the Needleman-Wunch algorithm , the Smith-Waterman algorithm , BLAST , and FASTA  should be noted. On the basis of paper  the algorithm  was created for comparing sequences with intermittent similarities. The improved version  makes use of multiple parameter sets in computation of an optimal alignment of the two sequences. A number of algorithms (Walquist et al. , Litvinov et al. , etc.) also take into account specific features of protein primary structures. However, it is important to know how closely algorithmic alignments produced through optimization of any chosen target function reflect an evolution-based alignment of the appropriate amino acid sequences, e.g. the one, which juxtaposes the positions in the compared proteins originating from the same position in their common predecessor.
The "quality" of the alignment algorithms, i.e. mutual concordance of algorithmic and GS alignments, was analyzed from different points of view; in most cases, alignments based on intercomparison of three-dimensional structures were used as the GS alignments. It was premised on the fact that 3D structures of proteins are much more conservative than their amino acid sequences .
In other words, sequences corresponding to a certain fold are greatly confluent: the same structure corresponds to somewhat dissimilar or even totally dissimilar sequences. There are also a number of counter-examples, when similar sequences correspond to totally different 3D structures, but such examples are much less common . Vingron and Argos  demonstrated that there was a relationship between conservatism of the optimal global alignment region in a set of suboptimal alignments and its similarity with the structural alignment results. They showed that regions of optimal alignment, recurring most frequently in suboptimal alignments, were very similar to alignments produced by the structural alignment methods.
In works [12, 13], evaluation of the accuracy of the optimal alignment was based on determination of the matching accuracy for each pair of matched amino acid residues, with the following plotting of the robustness index values versus the number of the aligned pair of residues. For example, Mevissen and Vingron  used a weight difference for the optimal alignment and the alignment with the largest weight, in which residue i and residue j were not matched, as a measure of robustness for matching residue i with residue j. In the work of Schlosshauer and Olsson , the measure of validity for matching residue i with residue j was based on substituting the discrete function "max" in the dynamic programming algorithm with a parameter-dependent analog function. It allowed evaluating possible suboptimal alternatives for the chosen aligned pair of residues, thus also allowing a numerical evaluation of the accuracy of their matching. This numerical index calculated for each pair of residues serves as a measure of the local accuracy of the alignment.
As opposed to the works mentioned above, our evaluation of algorithmic alignment methods was based not on the assessment of the alignment results for a few selected positions, but on the comparison of algorithmic alignments with the GS alignment as a whole over the total length of the sequences (see [14–16]). From the results of comparison of structural alignments with local algorithmic Smith-Waterman alignments Sunyaev et al.  made a conclusion that the possibility of reconstruction of structural alignments from algorithmic ones depends on the degree of similarity of appropriate proteins; besides, examination of internal structures of both alignments allowed to develop a more efficient procedure for aligning two sequences, taking into account not only the mean level of their identity, but also the distribution of more or less similar regions in the sequences in the structural alignment.
However, all the works cited above had a common fault: algorithmic alignments were compared not with the true evolutionary alignments (which were unknown!), but with their approximations. This introduced an error in the results, which could not be estimated by the usual direct methods. We suggest using a comparison of artificially generated sequences to evaluate the quality of alignment algorithms, because the GS alignment for such sequences is known from the very beginning. A similar numerical experiment was described in [17, 18]. However, generation of the test set of sequences in  did not reflect completely available data on the evolutionary process, because insertions and deletions were generated in accordance with an over-simplified algorithm. In the work , to generate a set of test sequences, a different evolution model was used, which had been described in [19–21]. The model included both point mutations and indels. Numerical values of the mean accuracy were obtained for the global alignment algorithm with affine gap penalties (the global version of the Smith-Waterman algorithm) for various evolutionary distances.
The purpose of our work was to determine conditions of preferred applicability of the local and global versions of the algorithm for determination of optimal alignments with an affine penalty function for indels . Thereinafter, for the sake of brevity, this algorithm will be called the "Smith-Waterman algorithm". As is well known, the global algorithm finds such positions for gaps in the sequences, which correspond to the maximum value of the difference between summed weights of matched residues and summed penalties for the gaps. A local algorithm allows finding the optimal alignment of two fragments of the studied sequences, whereby the regions before and after the fragments forming the alignment with the maximum weight are not taken into account when the weight is calculated. Thus, unlike the global algorithm, the local algorithm allows determining not only optimal positions for gaps in some fragments, but also the fragments themselves, which provide for their appropriate positioning.
Our task was to determine the relative quality of alignments obtained through the global and local algorithms versus the degree of homology of similar regions in the sequences (the "cores") and the length of nonhomologous regions at the ends of the sequences (the "consoles"). In particular, we tried to determine the application threshold for the global algorithm, i.e. the values of the above-mentioned parameters, which provided for the same or better quality of alignments by using the global algorithm in comparison to the quality of alignments by using the local algorithm (see the definition of the alignment quality in section 2.3).
2.1 Preparation of test sets of sequence pairs
2.1.1 General description of sets
To carry out computer experiments we prepared 224 test sets, each set including 1000 sequence pairs ("test pairs"). All sets were prepared using the same technique, and different sets have different technique parameters (see below).
A test sequence pair consists of two sequences generated independent of each other from a common initial sequence ("ancestor"), each of their compared sequences consisting of a homologous part ("core") and nonhomologous parts ("consoles") surrounding the core.
PAM is the measure of evolutional distance from the common ancestor to the sequence cores (see );
r is the ratio of the total length of consoles to the length of the common ancestor of sequence cores (this ratio is accepted to be equal for the both sequences of a pair and, as a result, the total lengths of consoles are equal for the both sequences);
c is the ratio of the absolute value of the difference between console lengths to their total length (this ratio is the same for the both sequences, but for sequence S1 the left console is no longer than the right one, and vice versa for sequence S2).
PAM = 30, 60, 120, 240 PAM;
r = 10%, 20%, 50%, 100% , 200%;
c = 0% (the consoles are of the same length), 10%, 20%, ... 100% (one console is missing).
2.1.2 Description of generation of a test sequence pair
Generation of a common ancestor P of cores of test sequences;
Generation of cores K1 and K2 of test sequences in accord with the PAM value;
Estimation of the total length of consoles for each of the sequences in accord with the r values;
Estimation of the length of each console in accord with the c value;
Generation of consoles L1, R1 (the left and right consoles of the first sequence), L2, R2 (the left and right consoles of the second sequence);
- (vi)Construction of desired test sequences S1, S2:
To solve this task we used the following evolution model, which is a version of the evolution model used in our previous paper . Therein we analyzed the alignment of the ancestor sequence S0 and generated sequence S1 from it, whereas here we examine alignments of sequences S1 and S2 generated from the common ancestor sequence S0.
Generation of cores is described in subsection 2.1.3.
2.1.3 Construction of homologous regions of test sequences (cores)
Cores K1 and K2 were constructed of an ancestor sequence P independently and according to the same procedure. The procedure consisted of two stages.
where PAM is the number characterizing the evolutional distance between the original and modified sequences .
For each position i, i = 1, ..., L+1, we determined by a random choice if there is an indel. Then we decided at random if an insertion or deletion appeared, the probability of the deletion was chosen to be 0.55. This value was determined empirically, provided that the total length of insertions and deletions was equal. When p(ins) = p(del) = 0.5 at the accepted procedure, the total length of insertions was greater than that of deletions. The length of an insertion or deletion was chosen at random from Zipfian distribution which, as stated in , is independent of the evolutional distance. If the chosen deletion length exceeded the distance from position i to the sequence terminus (in particular, at i = L+1) or the beginning of the deletion coincided with the already deleted position, the attempt was ignored. Ignored was also the insertion if its beginning coincided with the elongation of the earlier made insertion or deletion. Thus, we prevented insertions and deletions, whose length was not equal to the value obtained from the preset distribution.
Correspondence of %id and PAM.
The letter composition of insertions was generated analogous to original sequences. It should be noted that contrary to the scheme used in , in this case the total length of insertions-deletions was not a constant value for all pairs of sequences from one test set and obeyed the law of random numbers (which is closer to the real evolutional set).
2.2. Comparison of "ancestor-descendant" and "descendant1-descendant2" tests
In the above procedure, test pairs of sequences are generated according to the "descendant1-descendant2" scheme (i.e. two descendants of a common ancestor are compared) rather than by the "ancestor-descendant" scheme (the ancestor sequence is compared to its descendant). The first scheme in a better way models a comparison of real sequences. However the PAM parameter, used traditionally for characterizing the evolutional distance, is second-scheme-oriented, but since the parameter is universal, nothing prevents its use for estimating the distance between descendants of a common ancestor.
Characteristics of modified sequences generated according to the sequential ("ancestor-descendant") and diverging ("descendant1-descendant2") schemes.
2.3 Alignment of sequences with consoles
Alignment of generated sequences was performed using two variations of the Smith-Waterman algorithm - the global and local ones. For global alignment the substitution weight matrix PAM250 and gap-open and gap-extension penalties 14 and 2 were used; for local alignment the Gonnet250 matrix and penalties 10 and 0.5 were used. Such parameter values were chosen on the base of the preliminary computer experiments [15, 17] which allowed getting the best values of the quality of alignments (see subsection 2.4) through all test sets. These values are close to those commonly used.
Prior to the tests with sequences of variable homology along the sequence, we investigated the dependence of the quality of the alignment on the applied substitution matrix and gap-open and gap-extension penalties. For this purpose the following numerical experiment was held.
Test sets including 1000 pairs of sequences each were generated according to the "sequential" scheme of evolution with average distances between ancestors and descendants of 60, 100, 200, 300 PAM (see the description of "sequential" and "diverging" schemes in item 2.2).
Pairs of sequences of each test set were aligned by global version of Smith-Waterman algorithm using two matrices: "native" (i.e. the PAM matrix corresponding to the evolutionary distance between sequences) and PAM250. Values of the measures of similarity (2.4) of algorithmic alignments via the reference alignments derived from "native" matrix and the PAM250 matrix, are practically the same (difference less than 1%, see the table in additional file 1, sheet PAM250). As we observed, the high homologous proteins (id > 45%) can be successfully aligned, using a matrices designed for large evolutionary distances. At the same time, attempts to align low homologous proteins, using a matrix designed for shorter distances, significantly deteriorate the results, signifying mismatch of algorithmic and reference alignments.
This is confirmed by a more complete numerical experiment. For this aim, each pair of sequences of all test sets were aligned using matrices PAM60, PAM100, PAM200, and PAM300. Gap-open and gap-extension penalties ranged from 10 to 20 and from 1 to 5, respectively (see the data in additional file 1, sheet Max Values). A practical conclusion of this finding, obviously, is that if the evolutionary distance between the sequences is unknown, the evolutionary distance corresponding to applied matrix, should be certainly greater than expected distance between the sequences.
Complete data on the comparative analysis of weight matrices are given in additional file 1, sheet All Values.
2.4 Determination of the quality of algorithmic alignments
3. Results and Discussion
We have analyzed the dependence of the quality (i.e. accuracy and confidence) of the global and local alignments of console sequences versus the following values: (1) the evolutional distance between homologous fragments of sequences ("cores"); (2) the console length; and (3) console asymmetry ("shifted cores").
3.1. Symmetrical consoles
Accuracy and confidence of global and local alignment at symmetrical consoles
Evolutional distance, PAM
Console length, %
Thus, at a symmetrical position of the consoles, the global algorithm has higher resistance to the increase of evolutional distance and console length than the local algorithm.
3.2. Asymmetrical consoles
It should be noted that the above accuracy values of global alignments are much lower than the values given in previous papers (see, for example, ). This is explained by the fact that we analyze the sequences with consoles, whereas in the cited and some other papers only alignments of sequences homologous as a whole were considered.
When the evolutional distance between homologous regions of sequences continues to increase, this tendency becomes more pronounced. Thus, when the evolutional distance is 60 PAM, the total console length is 100% and the core shift changes from 50% to 70%, the accuracy value decreases from 87% to 30%, and the confidence value decreases from 85% to 29%. When the evolutional distance is 120 PAM, the total console length is 50% and the core shift changes from 40% to 60%, the accuracy value decreases from 67% to 38%, and the confidence value decreases from 64% to 36%. When the evolutional distance is 240 PAM and the console length is 50%, the accuracy and confidence values begin to decrease at a shift of 10%.
So, an increase in the evolutional distance between homologous regions of sequences causes a noticeable decrease in the quality of global alignment, a considerable decrease of the quality of global alignment taking place at a decreasing length of consoles and a diminishing shift of the core from the center of sequences.
The entire records of the dependence of accuracy and confidence values on the three parameters are given in Additional File 2.
It is essential that at all considered values of PAM, length and asymmetry of consoles, there is no remarkable decrease in the quality of local alignment. In other words, in case of asymmetrical consoles, the local algorithm is more resistant to the increase in evolutional distance and console length than the global algorithm.
Thus, we can conclude that there exists some threshold value of the three parameters: extent of core homology, total length of consoles and asymmetry of consoles. Prior to this value, the quality of the global alignment is higher than that of the local alignment, but above this value the global alignment quality decreases sharply, whereas the local alignment quality remains the same.
3.3. Determination of the "slope zone"
The plots in Figure 3 show that when the asymmetry of consoles c changes from 0 to 100% (at fixed evolutional distances between cores and the total length of consoles), the accuracy and confidence of global alignments decrease. In this case the decrease is sharp in a relatively narrow (~ 20%) range of c values; this range will be called the slope zone. In this section we will demonstrate how to predict theoretically where this area is. It will be accepted that the substitution weight matrix and gap penalties are fixed (see subsection 2.3).
Let us consider one of the sets of test sequences described in subsection 2.1. Using D ker and L ker we will denote mean values: Dker(P) and Lker(P), respectively, for the sequences in the set under discussion.
where GOP and GEP are gap-open and gap-extension penalties.
of the mean weight of reference alignments to the mean length of test sequences in a corresponding set. The D glob value will be called the mean density of reference alignments. Similarly, the mean density of random alignments D rand will be the relation of the mean weight of optimal alignment of two independent random sequences to the length of these sequences (see the details in subsection 3.4).
(here the length of random sequences is L ker +L con ). In other words, it is impossible to restore the reference alignment if its weight does not (almost) differ from the alignment weight of random sequences. Using the L ker , PAM, r values, the above stated allows us to roughly determine the range of asymmetry values c at which the quality of global alignment decreases.
3.4. Determination of density of random alignments
where Pi is the probability of appearance of the i-th amino acid residue and S ij is the substitution weight.
3.5. Comparison with other algorithms
Additionally, we compared three algorithms: Needleman-Wunsch , Smith-Waterman  and GAP3 . A generalized global alignment algorithm (GAP3) is a development of the standard Needleman-Wunsch dynamic programming algorithm designed for comparing sequences with intermittent similarities, an ordered list of similar regions separated by different regions. (In difference to the algorithm given in , the algorithm described in  besides the usual weight substitution matrix and gap-open and gap-extension penalties, requires an additional parameter - constant penalty for each difference block.)
A comparison test was carried out on thirteen sets of 1,000 pairs of sequences each with the following parameters (2.1.1): 1) evolutionary distance from a common ancestor to the sequence cores PAM = 120 PAM; 2) the ratio of the total length of consoles to the length of the common ancestor of sequence cores r = 20, 50, 100%; 3) the ratio of the absolute value of the difference between console lengths to their total length c = 0, 30, 50, 70%.
Accuracy and confidence of global, local and GAP3 alignment at various consoles
The table shows that in comparison with Smith-Waterman local alignment, in all cases the GAP3 alignment benefits slightly in Accuracy, but advantage in Confidence is significantly larger (it means that GAP3 makes a little more correct matches than the first algorithm, making fewer false matches).
As for comparison with global Needleman-Wunsch alignments, in the case without consoles the GAP3 alignments have a much lower Accuracy and a slightly higher Confidence. For symmetric consoles simultaneously with increase in their length (Core shift = 0, Console length = 20,...,100%), this tendency is reduced and becomes less prominent. It was mentioned above that as a result of the displacement of the cores of aligned sequences in the opposite direction from the center, the reliability of the global alignment is markedly reduced. However, a local alignment has no such a tendency. As shown, at the evolutionary distance of 120PAM, the GAP3 alignment is not only resistant to displacement, but also resistant to increasing the length of nonhomologous consoles.
The study has revealed regularities allowing for defining more exactly the areas of effective application of every algorithm: when consoles are positioned symmetrically, the global algorithm is more resistant to increasing evolutional distance and console length than the local algorithm (about 10% accuracy and about 8% confidence at 120PAM and up to 20% accuracy and confidence at 240PAM); quite the opposite, when consoles are asymmetrical, the local algorithm is more resistant to increasing evolutional distance and console length than the global algorithm. The boundary of the global algorithm preference is determined roughly by the value of asymmetrical position of homologous fragments of sequences (cores) at which the reference alignment density is almost equal to the density of random sequence alignment. The mean divergence of 5 ÷ 10%, which is typical both of accuracy and confidence of global and local alignments at a symmetrical position of cores, preconditions the developing of a combined method for making a more reliable alignment.
This work has been supported by a grant from the Presidium of the Russian Academy of Sciences (Molecular and Cell Biology Program), grants 08-01-92496 and 09-04-01053 from RFBR, grant 2011-07.514.11.4006 from MESRF and Russian-French Scientific Exchange Program.
- Needleman SB, Wunsch CD: A general method applicable to the search of similarity in the amino-acid sequence of two proteins. J Mol Biol 1970, 48:443–453.PubMedView Article
- Smith TF, Waterman MS: Identification of common molecular subsequences. J Mol Biol 1981, 147:195–197.PubMedView Article
- Altschul SF, Gish W, Miller W, Myers E, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215:403–410.PubMed
- Lipman DJ, Pearson WR: Rapid and sensitive protein similarity searches. Science 1985, 227:1435–1441.PubMedView Article
- Huang X, Chao KM: A generalized global alignment algorithm. Bioinformatics 2003, 19:228–233.PubMedView Article
- Huang X, Brutlag DL: Dynamic use of multiple parameter sets in sequence alignment. Nucleic Acids Res 2007, 35:678–686.PubMedView Article
- Wallqvist A, Fukunishi Y, Murphy LR, Fadel A, Levy RM: Iterative sequence secondary structure search for protein homologs: Comparison with amino acid sequence alignments and application to fold recognition in genome databases. Bioinformatics 2000, 16:988–1002.PubMedView Article
- Litvinov II, Lobanov MI, Mironov AA, Finkelstein MA: Information on the secondary structure improves the quality of protein sequence alignment. Molecular Biology 2006, 40:474–480.View Article
- Doolittle RF: Similar amino acid sequences: chance or common ancestry? Science 1981, 214:149–159.PubMedView Article
- Alexander PA, He Y, Chen Y, Orban J, Bryan PN: A minimal sequence code for switching protein structure and function. Proc Natl Acad Sci USA 2009, 106:21149–21154.PubMedView Article
- Vingron M, Argos P: Determination of reliable regions in protein sequence alignments. Prot Eng 1990, 3:565–569.View Article
- Mevissen HT, Vingron M: Quantifying the local reliability of a sequence alignment. Prot Eng 1996, 9:127–132.View Article
- Schlosshauer M, Olsson M: A novel approach to local reliability of sequence alignments. Bioinformatics 2002, 6:847–854.View Article
- Vogt G, Etzold T, Argos P: An assessment of amino acid exchange matrices in aliging protein sequences: the twilight zone revisited. J Mol Biol 1995, 249:816–831.PubMedView Article
- Domingues FS, Lackner P, Andreeva A, Sippl MJ: Structure-based evaluation of squence coparison and fold recognition alignment accuracy. J Mol Biol 2000, 297:1003–1013.PubMedView Article
- Sunyaev SR, Bogopolsky GA, Oleynikova NV, Vlasov PK, Finkelstein AV, Roytberg MA: From analysis of protein structural alignments toward a novel approach to align protein sequences. Proteins: Structure, Function and Bioinforrmatics 2004, 54:569–582.View Article
- Polyanovskii VO, Demchuk EY, Tumanyan VG: Efficiency of alignment procedure in respect of reconstruction reliability. Molecular Biology 1995, 28:833–835.
- Polyanovsky V, Roytberg MA, Tumanyan VG: Reconstruction of genuine pair-wise sequence alignment. J Comp Biol 2008, 15:379–391.View Article
- Dayhoff M, Schwartz R, Orcutt B: A model of evolutionary change in proteins. In Atlas of protein sequence and structure. Edited by: Dayhoff M. Washington: National Biomedical Research Foundation; 1978:345–352.
- Benner SA, Cohen MA, Gonnet GH: Empirical and structural models for insertions and deletions in the divergent evolution of proteins. J Mol Biol 1993, 229:1065–1082.PubMedView Article
- Reese JT, Pearson WR: Empirical determination of effective gap penalties for sequence comparison. Bioinformatics 2002, 18:1500–1507.PubMedView Article
- Waterman MS: Mathematical methods for DNA sequences. Boca Raton, Florida: CRC Press, Inc; 1989.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.