Instability in progressive multiple sequence alignment algorithms

Background Progressive alignment is the standard approach used to align large numbers of sequences. As with all heuristics, this involves a tradeoff between alignment accuracy and computation time. Results We examine this tradeoff and find that, because of a loss of information in the early steps of the approach, the alignments generated by the most common multiple sequence alignment programs are inherently unstable, and simply reversing the order of the sequences in the input file will cause a different alignment to be generated. Although this effect is more obvious with larger numbers of sequences, it can also be seen with data sets in the order of one hundred sequences. We also outline the means to determine the number of sequences in a data set beyond which the probability of instability will become more pronounced. Conclusions This has major ramifications for both the designers of large-scale multiple sequence alignment algorithms, and for the users of these alignments.


Background
The creation of a multiple sequence alignment is a routine step in the analysis of homologous genes or proteins. For aligning more than a few hundred sequences, most methods use a heuristic approach termed "progressive alignment" by Feng and Doolittle [1]. This is a two-stage process: first a guide tree [2] is created by clustering the sequences based on some distance or similarity measure, and then the branching structure of the guide tree is used to order the pairwise alignment of sequences. The power of progressive multiple sequence alignement may come from the fact that "more similar" sequences are aligned first: "...assuming that in progressive alignment, the best accuracy is obtained at each node by aligning the two profiles that have fewest differences, even if they are not evolutionary neighbours" [3].
The guide tree determines the order in which the sequences are aligned. All sequences are compared to each other to generate a matrix of distance measures between each pair of sequences. By necessity, the calculation of these distance measures must be fast as it will clearly require O(N 2 ) time and memory for N sequences. Most alignment programs use k-tuple scores [4,5] to measure the similarity of two sequences, or related word-based measures. Some use other string-matching algorithms to the same effect. While these approaches are fast, they only score exact matches between two sequences. For proteins, amino acids that are considered very similar, for example using the PAM [6] or BLO-SUM [7] matrices, are treated as complete mismatches.
This paper examines the impact of the tradeoff of accuracy for speed in the construction of the guide trees in protein progressive multiple sequence alignment. We find that, because of a loss of information when calculating the distance measures, the alignments generated are inherently unstable. This instability is easily seen by changing the order of the protein sequences in the input file. This will cause a different alignment to be generated. We also show that, while this instability is more apparent with larger alignments and with some alignment programs, it is also found in small alignments of less than 100 sequences. This instability is due to huge numbers of tied scores in the distance matrices used to make the guide trees. With word-based distances, there is a relatively small number of possible distance scores that can be found between two sequences. This number will depend on the length of the sequences and on the metric used. The effect is that once you get to even moderately large numbers of sequences, the distance matrix will have many tied scores which would ideally be represented in the guide tree as multifurcations. Progressive alignment is a strictly pairwise algorithm and the branching order within these tied groups will be completely arbitrary and determined purely by how the clustering code was written. If you change the sequence order, you will change the cluster order and hence the order of progressive alignment. This means that the supposed power of the guide tree to sensibly align the sequences in the correct order is lost and the considerable computation effort required to calculate them may be completely wasted.

HomFam
The analysis presented here uses the HomFam alignment benchmark system [8]. This consists of the single-domain Pfam [9] (version 25) families which have at least 5 members with known structures in a HOMSTRAD [10] structural alignment. We measure the proportion of correctly aligned core columns out of all aligned core columns in the reference sequences (BAliSCORE TC score [11]), when these sequences are embedded in larger data sets. The TC score ranges from 0.0 (no core columns in the reference sequences correctly aligned) to 1.0 (all reference sequence core columns correctly aligned). An alternative TC score measures the proportion of all correctly-aligned columns. While the results were similar, we use core columns in this paper.
On examining the HomFam sequences, it was noticed that a number of proteins had the same amino acid sequence even though they were (correctly) labelled differently in Pfam. As an example, in the zinc finger family (Pfam accession number PF00096), the sequence information for: >D2I3U5_AILME/95-116 ACADCGKTFSQSSHLVQHRRIH and >ZN787_HUMAN/95-116 ACADCGKTFSQSSHLVQHRRIH are identical. Table 1 shows the number of sequences in each HomFam family and the number of these which are unique. In the remaining analysis, duplicate sequences were removed from the HomFam families. Having duplicate sequences will automatically give tied distances and we wished to separate this effect from effects due to using k-tuple scores. One side effect of the removal process is that the remaining sequences are sorted in ascending alphabethical order of the sequences (not the sequence names) within each of the HomFam families. As each dataset is later randomly shuffled before being aligned, this will not have an effect on any of the alignments produced.

Software
This article examines the instability of the alignments produced by the progressive multiple sequence alignment programs Clustal Omega [12], Kalign [13], Mafft [14] and Muscle [3]. These programs were selected based on their widespread use, their ability to align more than a thousand protein sequences, and their use of a guide tree based on the similarity between each pair of sequences to determine the order in which the sequences will be aligned.
Each of the alignment programs generates a distance matrix containing the similarity or distance measures between all pairwise combinations of input sequences. Kalign does not output this distance matrix by default, but on examining kalign2_main.c line 135, the code to output the distance matrix has been commented out. This code was uncommented and modified to output the distance matrix to a specific text file. In addition, the distance measures were output to 25 decimal places to ensure that any duplicates were not as a result of rounding when formatting the output.
The other three alignment programs were also modified to output distance measures to 25 decimal places: Clustal Omega: line 327 of clustal/symmatrix.c; Mafft: line 2643 of io.c; Muscle: line 59 of fastclust.cpp.
For all four alignment programs, the runtime parameters were limited to those required to generate a distance matrix. By default, Clustal Omega uses the mBed algorithm [8] to cluster the sequences on the basis of a small number of "seed" sequences. This only requires the calculation of the similarity measures between these seed sequences and all other sequences in the input file. By requesting that a full distance matrix be generated and output, the sequences were clustered using the similarity measures between all pairs of input sequences.
For Mafft, the FFT-NS-1, FFT-NS-2 and G-INS-1 algorithms were used. With FFT-NS-1, a distance matrix is first generated using the 6-tuple score between each pair of sequences-both sequences are scanned from the start for matching 6-tuples, and when a match is found the score is incremented and scanning continues from the next residue [4]. A guide tree is then constructed by clustering according to these distances, and the The list of HomFam protein families, the total number of sequences in each family, the number of unique sequences, and the percentage of the total number of sequences that are duplicates sequences are then aligned using the branching order of the guide tree. With FFT-NS-2, the alignment produced by the FFT-NS-1 method is used to regenerate the distance matrix and the guide tree, and then do a second progressive alignment. In this paper, FFT-NS-1 will be specified whenever distance measures are needed. If no distance measures are required, the default FFT-NS-2 method will be used. The G-INS-1 algorithm was also used in Figure1 for comparison with a distance measure that doesn't rely on matching k-tuples.
With Muscle, the number of iterations was limited to 2 rather than the default of 16. This is the number of iterations recommended by the authors for large datasets.
The program versions and runtime parameters used are as follows:

Supporting material
A package of utility programs, data files and scripts is available for download from http://www.bioinf.ucd.ie/ download/2015instability.tar.gz.

Results and discussion
In the following sections, we refer to distance matrices and the calculation of distances between sequences. In most of the cases we discuss here, we actually use similarity scores. Nonetheless these can be easily converted to distances and we retain the use of the words distance and distances out of convenience.

Alignment instability
For each of the 94 HomFam families we selected the HOMSTRAD reference sequences and a random selection of sequences to make up 1000 sequences in total.
Families with an insufficient number of sequences were excluded, leaving a total of 68 families. The 1000 sequences were randomly shuffled, a default alignment was generated (for Mafft, both the FFT-NS-2 and G-INS-1 algorithms were used), and the alignment quality measured using its BAliSCORE TC score. The order of the sequences in the input file was then reversed, the alignment repeated with the same parameters and the quality of this alignment measured. The difference between the two quality scores was then calculated. This process was repeated 10 times for each of the 68 Hom-Fam families, and the results are presented in Fig. 1.
In the first four panels, and for virtually all of the represented HomFam families, reversing the order in which the sequences are listed in the input file has an impact on the quality of the alignments produced. For some protein families and alignment programs this impact is considerable, with the alignment of up to 50 % of columns in the reference sequences changing by reversing the order of the input sequences. In the fifth panel, Mafft G-INS-1 uses Needleman-Wunsch [15] to calculate the distance measures between each pair of sequences. Although some instability is still present, it is significantly lower than for the other alignment programs using their default parameters.
It should be noted that Mafft's G-INS-1 is considerably slower than FFT-NS-2 for the given number of sequences, taking approximately two orders of magnitude longer to run. It also requires over ten times more memory, and both memory and time requirements scale quadradically. As a result it is not recommended for aligning more than a few hundred sequences, but was included in the figure for reference purposes. In the remainder of this paper, we will only examine the distance measure calculations used when aligning larger numbers of sequences.

Unique distances
Clustal Omega uses 1-tuple scores to determine the distance measures between proteins, where the scores are calculated in the same was as Mafft's 6-tuple score except for the different lengths of matching string. Muscle uses 6-tuple scores calculated in the same way as Mafft, and Kalign uses the Muth Manber [16] approximate string matching algorithm. Such methods essentially count the number of matches between sequences, ignoring both the position of the matches and the actual values matched. The number of matches between sequences is therefore related to the lengths of the sequences. Clearly different, unrelated pairs of sequences can generate the same distance. In addition, the chances of seeing such matches will increase as the number of sequences being aligned increases. It is not clear how the clustering algorithm used in each of the alignment programs resolves such ties in distance measures. However, unless this scenario is specifically catered for, the default approach will be to choose between pairs of sequences based on their positions in the input file, either the first pair with that distance measure or the last pair.
In order to further investigate the frequency of such tied distances, the number of unique distance values in a distance matrix computed for a dataset was determined. The four alignment programs, Clustal Omega, Kalign, Mafft and Muscle were run with the parameters listed previously on random samples of sequences drawn from each of the HomFam protein families. Sample sizes ranged from 50 to 10,000 sequences (or as many unique sequences as were in the HomFam family), and each sampling was repeated 100 times. The number of unique distances were counted in the distance matrices produced from each alignment, and the mean number of unique distances for each family and number of sequences are presented in Fig. 2.
The number of unique distances generated by Mafft is considerably higher than for the other alignment programs. However, for all alignment programs, the numbers of unique distances show clear trends of levelling off as the number of sequences increases. In addition, as the total number of distances calculated is given by N (N − 1)/2 for N sequences, for the larger data sets the vast majority of distance measures are duplicated in each alignment program.

Same length sequences
To determine why the number of unique distances reaches a plateau while the total number of pairwise distances increases quadratically, we need to examine how the distances between sequences are calculated. To simplify the analysis, we will first look at sequences of the same length.
Clustal Omega uses 1-tuple scores for comparing sequences. With sequences of the same length it can therefore only generate a maximum of L + 1 unique distances where L is the length of the sequences. These correspond to sequences with no matches, 1 match, 2 matches, etc. up to identical sequences. Mafft and Muscle use 6-tuple scores, so the maximum number of unique distances between sequences of length L is (L − 5) + 1 where (L − 5) is the number of 6-tuples in a sequence and the additional +1 is necessary if no matches are found. The calculation of these distance measures ignores both the position of the matches and the values matched.
Depending on the actual amino acids, Kalign calculates the distance measure as zero between pairs of protein sequences of up to 32 amino acids each.

Different length sequences
Clustal Omega scales the 1-tuple scores by the length of the shorter of the two sequences. Similarly Muscle scales its 6-tuple scores by the number of 6-tuples in the shorter sequence, and Kalign scales based on the length of the longer sequence. In Mafft, the distance measure is calculated as: where: S ij is the 6-tuple score between sequences i and j, and x and y are the lengths of the longer and shorter sequences respectively. The additional scaling is deemed necessary as D ij can be near zero when comparing very short and very long sequences, even if the sequences are unrelated.

Theoretical maximum number of unique distances and sequences
Based on this analysis, the two factors that determine the number of different possible distance measures are the lengths of the sequences and the number of different sequence lengths. For simplicity, we will ignore the minor adjustments to the sequence lengths due to using 1-tuples or 6-tuples. Hence, for Clustal Omega, Kalign and Muscle, the theoretical maximum number of unique distances is given as the product of the longest sequence length and the number of different sequence lengths in the dataset. For Mafft, as both sequence lengths are included in the additional scaling factor, the theoretical maximum is the longest sequence length times the square of the number of different sequence lengths. These theoretical maxima are conservative as all sequences may not be as long as the longest sequence, and all possible matches for all sequence lengths may not be found. So, for Clustal Omega, Kalign and Muscle: and for Mafft: where MaxUniqueDists is the theoretical maximum number of unique distances, MaxSeqLength is the length of the longest sequence in the dataset, and Count(SeqLengths) is the number of different sequence lengths. In addition where MaxSeqs is the maximum number of sequences that can be aligned before duplicate distance measures are generated. Figure 3 plots these theoretical maxima for Clustal Omega, Kalign and Muscle (3a), and Mafft (3b) for each HomFam family based on all sequences in each family. Also shown are the maximum numbers of unique distances for each family found in the datasets used to construct Fig. 2 previously. As can be seen, the pattern of unique distances in the datasets follows but is lower than the theoretical maxima.
The lower plots in Fig. 3 shows the maximum number of sequences that can be aligned without duplicate distance measures, derived from these maximum numbers of unique distances. Again these maximum numbers of sequences are a conservative measure, as they are based on all lengths of sequences occurring in the dataset and each sequence having its full range of possible matches. Perhaps the most stiking thing about the lower plot is that the numbers of sequences are so low, particularly for Clustal Omega, Kalign and Muscle.
It should be noted, however, that duplicate distance measures do not necessarily lead to instability in the alignment generated. It will depend on whether the duplicate measures are the lowest values in the distance matrix at that MaxSeqs(MaxSeqs − 1)/2 = MaxUniqueDists step in the clustering process, which will in turn depend on what has happened in the previous clustering steps. Hence, we cannot say for definite that duplicate measures will lead to alignment instability. However, as the number of duplicate measures increases, so too does the likelihood of alignment instability. As the alignment instability is determined by the characteristics of the input sequences, we recommend that alignment programs be modified to issue a warning of potential instability when the clustering algorithm encounters tied distance measures.

Smaller alignments
While the instability demonstrated earlier is more apparent in larger alignments, it can also be present when smaller numbers of sequences are aligned. This can be shown by randomly selecting 50, 100 and 250 sequences (including each family's reference sequences) from each HomFam family and calculating the TC scores for the forward and reversed datasets, as was done in Fig. 1. 100 random samples were used for each HomFam family and for each of the three dataset sizes. For each sample, the forward and reverse TC scores were compared, and the number of differences for each HomFam family were counted. These counts are shown in Fig. 4.
As can be seen, the instability in sequence alignments occurs even with small alignments. Also, as the number of sequences increases so too does the number of differences in TC scores. While there is no clear trend between the number of unique distances and the number of TC score differences for a particular alignment program, this trend can be seen across the different programs-Mafft shows the fewest number of differences in TC scores and Kalign the most.

Algorithm symmetry
It should also be pointed out that another reason for the difference in TC scores reported above may be due to the asymmetry of the different implementations of distance measure calculations. Different distance measures could then cause a different clustering order and give a different tree topology, causing sequences to be aligned in a different order.
To illustrate, we randomly select two Retroviral aspartyl protease (Pfam accession number PF00077) sequences, run the four alignment programs and extract the distance measures between the two sequences. The order of the two sequences is then reveresed, the alignment programs run again, and the distance measure from this second run is compared with the original. (Clustal Omega requires a minimum of three sequences, so three sequences were selected at random and the distances between the first and third sequences were compared.) Out of 10,000 samples, for Clustal Omega there were 9 different distances identified. With Mafft and Muscle, no different distance measures were found. However, with Kalign 6516 differences were found.

Conclusions
In this paper we have demonstrated a very strong dependence on the order of the input sequences in a data file when we measure multiple alignment accuracy. This effect is disconcerting as merely changing the order of the sequences can change the alignment. The scale of this effect is somewhat surprising and mainly shows up when the numbers of sequences grows large. It can, nonetheless, be seen in data sets of the order of a hundred sequences or so.
We have also noticed that when we examine distance matrices generated by some widely used MSA packages that these become increasing dominated by tied values. The more sequences you have, the greater the percentage of the scores in a distance matrix that are duplicates of other scores. We can trace this effect to the use of k-tuple scores for computing these distances. For sequences of a given length, there is a finite and relatively small number of possible scores that can be generated. For shorter length sequences, the number of possible distances is also reduced. If you use real alignment scores using an amino acid weight matrix such as BLOSUM [7], the number of possible scores is still finite although much greater than with k-tuple distances. Given enough sequences though, you will inevitably get many tied values in a distance matrix. The use of such alignment scores is limited however, to relatively small datasets as they are expensive to compute, as was seen with Mafft G-INS-1 in Fig. 1. For really big alignments, of many thousands of sequences, we have little alternative to the use of k-tuple or word based scores at some stage of the progressive alignment procedure. Iteration, as carried out by Clustal Omega, Mafft and Muscle can help as the later alignments can use real alignment scores but these are very expensive computationally and do not eliminate tied scores. It is also possible to mitigate the alignment instability by, say, ordering the input sequences lexicographically before calculating the k-tuple scores. However, while this will result in a consistent alignment being produced, it is  Theoretical maximum and actual number of unique distances, and maximum theoretical number of sequences that can be aligned without duplicate distance measures. The theoretical maximum number of unique distances for each HomFam family, the actual number of unique distances found in the datasets used to generate Fig. 1, and the maximum theoretical number of sequences that can be aligned without generating duplicate distance measures based on the calculation that N sequences will produce N(N − 1)/2 distance measures difficult to justify from a biological point of view why one particular alignment out of numerous alternatives should be chosen. The solution to this issue is not clear-cut. We have previously shown [17] that the accuracy of progressive alignment decreases markedly with very large datasets. We assumed this was due to the greedy nature of the algorithm. Here we show that progressive alignment also produces alignments that have a strong dependence on the sequence order in the input file. The use of "chained" guide trees [18] can help improve accuracy but will still have a strong dependence on input file sequence order.