MSARC: Multiple sequence alignment by residue clustering
 Michał Modzelewski^{1} and
 Norbert Dojer^{1}Email author
https://doi.org/10.1186/17487188912
© Modzelewski and Dojer; licensee BioMed Central Ltd. 2014
Received: 1 December 2013
Accepted: 6 April 2014
Published: 16 April 2014
Abstract
Background
Progressive methods offer efficient and reasonably good solutions to the multiple sequence alignment problem. However, resulting alignments are biased by guidetrees, especially for relatively distant sequences.
Results
We propose MSARC, a new graphclustering based algorithm that aligns sequence sets without guidetrees. Experiments on the BAliBASE dataset show that MSARC achieves alignment quality similar to the best progressive methods.
Furthermore, MSARC outperforms them on sequence sets whose evolutionary distances are difficult to represent by a phylogenetic tree. These datasets are most exposed to the guidetree bias of alignments.
Availability
MSARC is available at http://bioputer.mimuw.edu.pl/msarc
Keywords
Background
Determining the alignment of a group of biological sequences is among the most common problems in computational biology. The dynamic programming method of pairwise sequence alignment can be readily extended to multiple sequences but requires the computation of an ndimensional matrix to align n sequences. Since the size of such a matrix is exponential with respect to n, the time and space complexity of this method is exponential too.
Progressive alignment[1] offers a substantial complexity reduction at the cost of possible loss of the optimal solution. Within this approach, subset alignments are sequentially pairwise aligned to build the final multiple alignment. The order of pairwise alignments is determined by a guidetree representing the phylogenetic relationships between sequences.
There are two drawbacks of the progressive alignment approach. First, the accuracy of the guidetree affects the quality of the final alignment. This problem is particularly important in the field of phylogeny reconstruction, because multiple alignment acts as a preprocessing step in most prominent methods of inferring a phylogenetic tree of sequences. It has been shown that, within this approach, the inferred phylogeny is biased towards the initial guidetree [2, 3].
Second, only sequences belonging to currently aligned subsets contribute to their pairwise alignment. Even if a guidetree reflects correct phylogenetic relationships, these alignments may be inconsistent with remaining sequences and the inconsistencies are propagated to further steps. To address this problem, in recent programs [4–8] progressive alignment is usually preceded by consistency transformation (incorporating information from all pairwise alignments into the objective function) and/or followed by iterative refinement of the multiple alignment of all sequences. Moreover, recently several strategies avoiding guide trees altogether were also proposed [9–11].
Experiments on the BAliBASE dataset [12] show that our approach is competitive with the best progressive methods and significantly outperforms most nonprogressive algorithms. Moreover, MSARC is the best aligner for sequence sets with very low levels of conservation. This feature makes MSARC a promising preprocessing tool for phylogeny reconstruction pipelines.
Methods
MSARC aligns sequence sets in several steps. In a preprocessing step, following Probalign [8], stochastic alignments are calculated for all pairs of sequences and consistency transformation is applied to resulting posterior probabilities of residue correspondences. Transformed probabilities, called residue alignment affinities, represent weights of an alignment graph^{a}.
Pairwise stochastic alignment
The concept of stochastic (or probability) alignment was proposed in [14]. Given a pair of sequences, this framework defines statistical weights of their possible alignments. Based on these weights, for each pair of residues from both sequences, the posterior probability of being aligned may be computed.
A consensus of highly weighted suboptimal alignments was shown to contain pairs with significant probabilities that agree with structural alignments despite the optimal alignment deviating significantly. Mückstein et al. [15] suggest the use of the method as a starting point for improved multiple sequence alignment procedures.
where β corresponds to the inverse of Boltzmann’s constant and should be adjusted to the match/mismatch scoring function s(x,y) (in fact, β simply rescales the scoring function).
Let P(a_{ i }∼b_{ j }) denote the posterior probability that residues a_{ i } and b_{ j } are aligned.
Here we use the notation ${\mathcal{A}}_{i,j}$ for an alignment of the sequence prefixes a_{1}⋯a_{ i } and b_{1}⋯b_{ j }, and ${\hat{\mathcal{A}}}_{i,j}$ for an alignment of the sequence suffixes a_{ i }⋯a_{ m } and b_{ j }⋯b_{ n }. Analogously, Z_{i,j} is the partition function over the prefix alignments and ${\hat{Z}}_{i,j}$ is the (reverse) partition function over the suffix alignments.
The reverse partition function can be calculated using the same recursion in reverse, starting from the ends of the aligned sequences.
In this case insertions and deletions must be separated by at least one match/mismatch position. This variant was proposed by Miyazawa [14] and applied in the Probalign [8] and MSAProbs [18] aligners.
Alignment graphs
Let us regard probabilities P(a_{ i }∼b_{ j }) as a representation of a bipartite graph with weighted edges, i.e. a graph with residues from both sequences as nodes and edges joining each a_{ i } with each b_{ j }.
Given a set S of k sequences to be aligned, we would like to analogously represent their residue alignment affinity by a kpartite weighted graph. It may be obtained by joining pairwise alignment graphs for all pairs of Ssequences. However, separate computation of edge weights for each pair of sequences does not exploit information included in the remaining alignments. Thus we decided to address this problem with a so called consistency transformation[4, 7], successfully used in progressive methods.
where w_{ x y } are weights specifying the relative contribution to the transformation of a sequence pair xy.
where · stands for matrix multiplication.
It results in the variant of consistency transformation used in Probalign [8] and ProbCons [7].
The idea behind the above formula is that the sum of a row/column of a matrix P_{ a b } yields the probability that the corresponding residue is aligned to one in the other sequence (not a gap). If sequences a and b are similar, alignments with fewer gaps are preferred, so (at least for the shorter sequence) most of the sums are close to 1. Consequently, the w_{ a b } is close to 1 as well. On the other hand, weights are much closer to 0 for pairs of dissimilar sequences.
Thus w_{ a b } measures the similarity of sequences a and b. Therefore sequences c that are similar to a and b contribute to ${P}_{\mathit{\text{ab}}}^{\prime}$ more significantly than others.
The consistency transformation may be iterated any number of times, but excessive iterations blur the structure of residue affinity. Following Probalign [8] and ProbCons [7], MSARC performs two iterations by default.
Residue clustering
Towards this objective, MSARC applies topdown hierarchical clustering (see Figure 2). Within this approach, the alignment graph is recursively split into two parts until no ambiguous cluster is left. Each partition step results from a single cut through all sequences, so clusterings are conflictfree at each step of the procedure. Consequently, the final clustering represents a proper multiple alignment.
Optimal clustering is expected to maximize residue alignment affinity within clusters and minimize it between them. Therefore, the partition selection in recursive steps of the clustering procedure should minimize the sum of weights of edges cut by the partition. This is in fact the objective of the wellknown problem of graph partitioning, i.e. dividing graph nodes into roughly equal parts such that the sum of weights of edges connecting nodes in different parts is minimized.
The FiducciaMattheyses algorithm [13] is an efficient heuristic for the graph partitioning problem. After selecting an initial, possibly random partition, it calculates for each node the change in cost caused by moving it between parts, called gain. Subsequently, single nodes are greedily moved between partitions based on the maximum gain and gains of remaining nodes are updated. The process is repeated in passes, where each node can be moved only once per pass. The best partition found in a pass is chosen as the initial partition for the next pass. The algorithm terminates when a pass fails to improve the partition. Grouping single moves into passes helps the algorithm to escape local optima, since intermediate partitions in a pass may have negative gains. An additional balance condition is enforced, disallowing movement from a partition that contains less than a minimum desired number of nodes.
FiducciaMattheyses algorithm needs to be modified in order to deal with alignment graphs. Mainly, residues are not moved independently; since the graph topology has to be maintained, moving a residue involves moving all the residues positioned between it and a current cut point on its sequence. This modification implies further changes in the design of data structures for gain processing. Next, the sizes of parts in considered partitions cannot differ by more than the maximum cluster size in a final clustering, i.e., the number of aligned sequences. This choice implies minimal search space containing partitions consistent with all possible multiple alignments. In the initial partition sequences are cut in their midpoints.
Refinement
Therefore we decided to add a refinement step, following the method used in ProbCons [7]. Sequences are split into two groups and the groups are pairwise realigned. Realignment is performed using the NeedlemanWunsch algorithm with the score for each pair of positions defined as the sum of posterior probabilities for all nongap pairs and zero gappenalty. First each sequence is realigned with the remaining sequences, since such division is very efficient in removing superfluous spaces. Next, several randomly selected sequence subsets are realigned against the rest.
Figures 6(cd) show the results of refining the alignments from Figures 6(ab). Refinement removed superfluous spaces from the clustering process and optimized the alignment. Note that the final postrefinement alignments turned out to be the same for both FiducciaMattheyses and multilevel method of graph partitioning.
Löytynoja and Goldman argue in [3] that progressive methods tend to force alignments of nonhomologous sequence fragments inserted in corresponding locations of aligned sequences. This tendency leads to systematic errors of the downstream analyses in phylogenetic pipelines, including overestimation of substitution and deletion events. Unfortunately, iterative refinement may be one of possible source of such effects. Therefore the number of iterations in subset realignment step in MSARC is adjustable, in particular the whole step may be turned off.
Computational complexity
Let n denote a number of sequences to align and let l be their maximum length. Both time and space complexities of stochastic alignment are $\mathcal{O}\left({n}^{2}{l}^{2}\right)$.
Computations in the other steps use data structures for sparse matrices, so the complexity depends on the number c of nonzero values per row/column. This number depends on the cutoff parameter t_{ c } (entries <t_{ c } are set to 0), namely c≤1/t_{ c }. However, we observe that c tends to be much lower than this bound, e.g. c rarely exceeds 5 for the default t_{ c }=0.01.
MSARC implementation of consistency transformation requires $\mathcal{O}\left({n}^{2}{c}^{2}l\right)$ time. Space complexity of this and the remaining steps is dominated by sparse matrices and equals $\mathcal{O}\left({n}^{2}\mathit{\text{cl}}\right)$.
The time complexity of one pass of the FiducciaMattheyses algorithm on whole sequences is $\mathcal{O}\left({n}^{2}c{l}^{2}\right)$. We observe that the algorithm converges after very few passes, but it is hard to prove a reasonable asymptotic bound. The complexity of the whole clustering is asymptoticly equal to the complexity of the main partition step.
The time complexity of iterative refinement belongs to the class $\mathcal{O}\left({n}^{2}c{l}^{2}\right)$.
Results
Benchmark data and methodology
MSARC was tested against the BAliBASE 3.0 benchmark database [1]. It contains manually refined reference protein alignments based on 3D structural superpositions. Each alignment contains coreregions that correspond to the most reliably alignable sections of the alignment. Alignments are divided into five sets designed to evaluate performance on varying types of problems:

Equidistant sequences with two different levels of conservation very divergent sequences (<20% identity)
medium to divergent sequences (20−40% identity)

Families aligned with a highly divergent “orphan” sequence

Subgroups with <25% residue identity between groups

Sequences with N/Cterminal extensions

Internal insertions
BAliBASE 3.0 also provides a program comparing given alignments with a reference one. Alignments are scored according to two metrics. A sumofpairs score (SP) showing the ratio of residue pairs that are correctly aligned, and a total column (TC) score showing the ratio of correctly aligned columns. Both scores can be applied to full sequences or just the coreregions.
We decided to present results based on coreregion scores only, since the corresponding sections of the reference alignments are most reliable. Moreover, results for full sequence scores are very similar.
Benchmarking MSARC variants
Two steps of MSARC algorithm: stochastic alignment and iterative refinement follow the respective steps in Probalign [7]. Therefore we decided to set a bunch of related parameters to Probalign’s defaults. Namely, MSARC was run with Gonnet 160 similarity matrix [20], gap penalties of −22, −1 and 0 for gap open, extension and terminal gaps respectively, β=0.2, a cutoff value for posterior probabilities of 0.01 (values smaller than the cutoff are set to 0 and operations designed for sparse matrices are used in order to speed up computations), two iterations of the consistency transformation and 100 iterations of iterative refinement.
On the other hand, we decided to evaluate three parameters that seem to be crucial for steps specific for MSARC approach. First, residue clustering may be performed with basic or multilevel FiducciaMattheyses algorithm. Second, weighted or unweighted consistency transformation may be applied. Third, stochastic pairwise alignment may be based on equations (5)(8) (i.e. stochastic version of classical Gotoh algorithm) or equations (6) and (7) may be replaced with equations (9) and (10), respectively. The modified formula disallows consecutive insertions and deletions, as is done in Probalign and MSAProbs.
Evaluation of MSARC variants
MSARC variant  SP/TC scores  

Alt. indels  Weighted  Multilevel  All  RV11  RV12  RV20  RV30  RV40  RV50  
yes  yes  no  $\frac{\mathbf{87}.6}{\mathbf{57}.1}$  $\frac{\mathbf{69}.9}{46.3}$  $\frac{\mathbf{94}.5}{85.7}$  $\frac{\mathbf{92}.5}{39.2}$  $\frac{\mathbf{83}.7}{47.2}$  $\frac{\mathbf{93}.2}{62.3}$  $\frac{88.7}{51.6}$  
yes  yes  yes  $\frac{\mathbf{87}.6}{57.0}$  $\frac{69.7}{\mathbf{46}.5}$  $\frac{\mathbf{94}.5}{\mathbf{85}.8}$  $\frac{\mathbf{92}.5}{39.0}$  $\frac{83.6}{46.9}$  $\frac{\mathbf{93}.2}{61.8}$  $\frac{88.7}{\mathbf{51}.9}$  
yes  no  no  $\frac{87.5}{56.6}$  $\frac{69.3}{45.5}$  $\frac{94.4}{85.6}$  $\frac{\mathbf{92}.5}{39.6}$  $\frac{\mathbf{83}.7}{\mathbf{47}.6}$  $\frac{93.0}{61.2}$  $\frac{88.6}{49.6}$  
yes  no  yes  $\frac{87.5}{56.6}$  $\frac{69.6}{45.6}$  $\frac{\mathbf{94}.5}{\mathbf{85}.8}$  $\frac{\mathbf{92}.5}{39.3}$  $\frac{83.4}{47.0}$  $\frac{93.1}{61.4}$  $\frac{88.4}{49.6}$  
no  yes  no  $\frac{87.5}{57.0}$  $\frac{69.2}{45.6}$  $\frac{94.4}{85.7}$  $\frac{\mathbf{92}.5}{39.5}$  $\frac{83.5}{47.1}$  $\frac{\mathbf{93}.2}{62.2}$  $\frac{\mathbf{89}.0}{\mathbf{51}.9}$  
no  yes  yes  $\frac{87.5}{\mathbf{57}.1}$  $\frac{69.2}{46.2}$  $\frac{94.4}{85.6}$  $\frac{\mathbf{92}.5}{39.2}$  $\frac{\mathbf{83}.7}{47.7}$  $\frac{\mathbf{93}.2}{\mathbf{62}.4}$  $\frac{88.7}{51.6}$  
no  no  no  $\frac{87.5}{56.6}$  $\frac{69.4}{45.6}$  $\frac{\mathbf{94}.5}{85.7}$  $\frac{\mathbf{92}.5}{\mathbf{39}.7}$  $\frac{83.5}{46.9}$  $\frac{93.0}{61.3}$  $\frac{88.5}{49.7}$  
no  no  yes  $\frac{87.5}{56.7}$  $\frac{69.5}{45.7}$  $\frac{94.4}{85.7}$  $\frac{\mathbf{92}.5}{39.1}$  $\frac{83.5}{47.0}$  $\frac{93.1}{61.7}$  $\frac{88.6}{49.7}$ 
Comparison to other aligners
MSARC results were compared to CLUSTAL Ω [1, 21] ver. 1.1.0, DIALIGNT [9] ver. 0.2.2, DIALIGNTX [22] ver. 1.0.2, MAFFT [6] ver. 6.903, MUSCLE [5] ver. 3.8.31, MSAProbs [18] ver. 0.9.7, Probalign [8] ver. 1.4, ProbCons [7] ver. 1.12, TCoffee [4] ver. 9.02, FSA [10] ver. 1.15.7 and PicXAA [11] ver. 1.03. All the programs were executed with their default parameters.
Comparison of multiple sequence alignment methods
SP/TC scores  Computation  

Aligner  All  RV11  RV12  RV20  RV30  RV40  RV50  BB40037  Time 
Nonprogressive methods  
MSARC  $\frac{87.6}{57.1}$  $\frac{\mathbf{69}.9}{\mathbf{46}.3}$  $\frac{94.5}{85.7}$  $\frac{92.5}{39.2}$  $\frac{83.7}{47.2}$  $\frac{\mathbf{93}.2}{62.3}$  $\frac{88.7}{51.6}$  $\frac{\mathbf{98}.7}{\mathbf{70}.0}$  16:36:37 
DIALIGNT  $\frac{77.3}{42.8}$  $\frac{49.3}{25.3}$  $\frac{88.8}{72.5}$  $\frac{86.3}{29.2}$  $\frac{74.7}{34.9}$  $\frac{82.0}{45.2}$  $\frac{80.1}{44.2}$  $\frac{52.6}{0.0}$  1:13:21 
FSA  $\frac{78.5}{42.1}$  $\frac{50.3}{26.9}$  $\frac{92.4}{81.8}$  $\frac{86.7}{18.7}$  $\frac{70.7}{27.6}$  $\frac{85.5}{46.2}$  $\frac{78.2}{39.8}$  $\frac{81.8}{30.0}$  35:15:34 
PicXAA  $\frac{\mathbf{87}.8}{59.4}$  $\frac{69.0}{\mathbf{46}.3}$  $\frac{94.6}{86.2}$  $\frac{92.5}{41.6}$  $\frac{86.0}{59.8}$  $\frac{93.1}{\mathbf{62}.4}$  $\frac{89.2}{53.0}$  $\frac{\mathbf{98}.7}{\mathbf{70}.0}$  5:54:18 
Progressive methods  
CLUSTAL Ω  $\frac{84.0}{55.4}$  $\frac{59.0}{35.8}$  $\frac{90.6}{78.9}$  $\frac{90.2}{45.0}$  $\frac{86.2}{57.5}$  $\frac{90.2}{57.9}$  $\frac{86.2}{53.3}$  $\frac{61.2}{0.0}$  1 2:1 5 
DIALIGNTX  $\frac{78.8}{44.3}$  $\frac{51.5}{26.5}$  $\frac{89.2}{75.2}$  $\frac{87.9}{30.5}$  $\frac{76.2}{38.5}$  $\frac{83.6}{44.8}$  $\frac{82.3}{46.6}$  $\frac{52.8}{0.0}$  1:36:05 
MAFFT  $\frac{86.7}{58.4}$  $\frac{65.3}{42.8}$  $\frac{93.6}{83.8}$  $\frac{92.5}{44.6}$  $\frac{85.9}{58.1}$  $\frac{91.5}{59.0}$  $\frac{90.1}{59.4}$  $\frac{56.4}{0.0}$  54:04 
MSAProbs  $\frac{\mathbf{87}.8}{\mathbf{60}.7}$  $\frac{68.2}{44.1}$  $\frac{\mathbf{94}.6}{\mathbf{86}.5}$  $\frac{\mathbf{92}.8}{\mathbf{46}.4}$  $\frac{\mathbf{86}.5}{\mathbf{60}.7}$  $\frac{92.5}{62.2}$  $\frac{\mathbf{90}.8}{\mathbf{60}.8}$  $\frac{59.5}{0.0}$  6:43:51 
MUSCLE  $\frac{81.9}{47.5}$  $\frac{57.2}{31.8}$  $\frac{91.5}{80.4}$  $\frac{88.9}{35.0}$  $\frac{81.4}{40.9}$  $\frac{86.5}{45.0}$  $\frac{83.5}{45.9}$  $\frac{48.4}{0.0}$  23:32 
Probalign  $\frac{87.6}{58.9}$  $\frac{69.5}{45.3}$  $\frac{\mathbf{94}.6}{86.2}$  $\frac{92.6}{43.9}$  $\frac{85.3}{56.6}$  $\frac{92.2}{60.3}$  $\frac{88.7}{54.9}$  $\frac{54.2}{0.0}$  4:31:41 
ProbCons  $\frac{86.4}{55.8}$  $\frac{67.0}{41.7}$  $\frac{94.1}{85.5}$  $\frac{91.7}{40.6}$  $\frac{84.5}{54.4}$  $\frac{90.3}{53.2}$  $\frac{89.4}{57.3}$  $\frac{59.3}{0.0}$  6:56:32 
TCoffee  $\frac{85.7}{55.1}$  $\frac{65.5}{40.9}$  $\frac{93.9}{84.8}$  $\frac{91.4}{40.1}$  $\frac{83.7}{49.0}$  $\frac{89.2}{54.5}$  $\frac{89.4}{58.5}$  $\frac{50.9}{0.0}$  13:53:02 
Significance of differences in BAliBASE 3.0 SP/TC scores
Aligner  RV11  RV12  RV20  RV30  RV40  RV50  Total  

Nonprogressive methods  
DIALIGNT  $\frac{+8.6e8}{+1.5e6}$  $\frac{+7.7e9}{+2.2e8}$  $\frac{+1.3e7}{+9.6e5}$  $\frac{+2.7e6}{+0.0024}$  $\frac{+2.1e9}{+4.9e8}$  $\frac{+0.00098}{+0.027}$  $\frac{+5.3e36}{+3.6e26}$  
FSA  $\frac{+8.6e8}{+1.2e6}$  $\frac{+3.5e6}{+0.00012}$  $\frac{+3.6e8}{+1.2e6}$  $\frac{+2.6e6}{+8.5e6}$  $\frac{+8.3e9}{+1.2e6}$  $\frac{+0.00081}{+0.021}$  $\frac{+3.6e34}{+3.5e27}$  
PicXAA  $\frac{+0.048}{+\left(0.53\right)}$  $\frac{\left(0.82\right)}{\left(0.98\right)}$  $\frac{\left(0.055\right)}{0.018}$  $\frac{2.8e5}{7.2e6}$  $\frac{+\left(0.11\right)}{\left(0.052\right)}$  $\frac{\left(0.063\right)}{\left(0.37\right)}$  $\frac{0.0079}{1.3e6}$  
Progressive methods  
Clustal Ω  $\frac{+2.6e7}{+5.1e5}$  $\frac{+2.4e5}{+0.00019}$  $\frac{+0.0048}{0.00054}$  $\frac{0.020}{0.00060}$  $\frac{+2.2e6}{+\left(0.16\right)}$  $\frac{+0.017}{\left(0.77\right)}$  $\frac{+1.1e13}{+\left(0.30\right)}$  
DIALIGNTX  $\frac{+1.0e7}{+1.3e6}$  $\frac{+6.2e8}{+4.0e7}$  $\frac{+2.3e7}{+0.00040}$  $\frac{+8.7e6}{+0.038}$  $\frac{+2.8e9}{+1.3e7}$  $\frac{+0.0017}{+\left(0.066\right)}$  $\frac{+3.1e34}{+9.5e23}$  
MAFFT  $\frac{+0.0031}{+\left(0.11\right)}$  $\frac{+0.00085}{+0.005}$  $\frac{\left(0.64\right)}{\left(0.052\right)}$  $\frac{0.0009}{0.0007}$  $\frac{+0.0005}{+\left(0.07\right)}$  $\frac{\left(0.072\right)}{\left(0.062\right)}$  $\frac{+0.028}{\left(0.55\right)}$  
MSAProbs  $\frac{+0.028}{+\left(0.23\right)}$  $\frac{\left(0.90\right)}{\left(0.67\right)}$  $\frac{0.011}{0.00032}$  $\frac{0.00017}{1.4e5}$  $\frac{+\left(0.61\right)}{+0.048}$  $\frac{0.010}{0.0086}$  $\frac{0.020}{5.9e8}$  
MUSCLE  $\frac{+7.3e6}{+0.00017}$  $\frac{+2.8e6}{+0.00015}$  $\frac{+0.00015}{+\left(0.15\right)}$  $\frac{+\left(0.19\right)}{+\left(0.52\right)}$  $\frac{+7.6e9}{+2.8e6}$  $\frac{+0.010}{+\left(0.072\right)}$  $\frac{+2.9e22}{+3.3e12}$  
Probalign  $\frac{+\left(0.67\right)}{+\left(0.52\right)}$  $\frac{\left(0.63\right)}{\left(0.88\right)}$  $\frac{0.032}{6.8e5}$  $\frac{0.0099}{0.00056}$  $\frac{+\left(0.62\right)}{+\left(0.060\right)}$  $\frac{\left(0.18\right)}{\left(0.32\right)}$  $\frac{0.019}{6.0e6}$  
ProbCons  $\frac{+0.021}{+0.037}$  $\frac{+0.0042}{+\left(0.19\right)}$  $\frac{+0.028}{\left(0.19\right)}$  $\frac{\left(0.15\right)}{0.010}$  $\frac{+0.00026}{+0.022}$  $\frac{\left(0.12\right)}{\left(0.17\right)}$  $\frac{+0.00087}{+\left(0.93\right)}$  
TCoffee  $\frac{+0.0024}{+0.016}$  $\frac{+0.0017}{+0.013}$  $\frac{+0.0075}{\left(0.51\right)}$  $\frac{\left(0.29\right)}{\left(0.099\right)}$  $\frac{+9.7e5}{+\left(0.29\right)}$  $\frac{0.048}{0.026}$  $\frac{+1.3e5}{+\left(0.70\right)}$ 
Probalign aligns separately the first half of the sequences (blue and green) and the second half of the sequences (from yellow to red). Next, the prefixes of the second group are aligned with the suffixes of the first group, propagating an error within a yellow subalignment.
MSAprobs aligns separately the dark blue, light blue and red sequences. Next the blue subalignments are aligned together. The resulting alignment has erroneously inserted gaps near the right ends of dark blue sequences. This error is propagated in the next step, where the suffix of the blue alignment is aligned with the prefix of the red alignment. Finally, the single violet sequence is added to the alignment, splitting it in two.
For both programs, alignment errors introduced in the earlier steps are propagated to the final alignment. On the other hand, the nonprogressive strategy used in MSARC yields a reasonable approximation of the reference alignment (see Figure 7(ab)).
Conclusions
The progressive principle has dominated multiple alignment algorithms for nearly 20 years. Throughout this time, many groups have dedicated their effort to refine its accuracy to the current state. Other approaches were omitted due to high computational complexity and/or unsatisfactory quality. However, recently several nonprogressive methods were proposed. Two of them, PicXAA and MSARC proved to be competitive with the best progressive approaches. Moreover, both programs outperform progressive methods on sequence sets with evolutionary distances that are difficult to represent by a phylogenetic tree.
Despite the algorithmic novelty, the nonprogressive approaches to multiple alignment are interesting preprocessing tools for phylogeny reconstruction pipelines. The objective of these procedures is to infer the structure of a phylogenetic tree from a given sequence set. Multiple alignment is usually the first pipeline step. When alignment is guided by a tree, the reconstructed phylogeny is biased towards this tree. In order to minimize this effect, some phylogenetic pipelines alternately optimize a tree and an alignment [24–26]. The unbiased alignment process of MSARC may simplify this procedure and improve the reconstruction accuracy, especially in the most problematic cases.
MSARC has also the potential for quality improvements. Alternative methods of computing residue alignment affinities could be used to improve the accuracy of both MSARC and Probalign based methods. Other approaches to alignment graph partitioning may also lead to improvements in the accuracy of MSARC, for example a better method of pairing residues for multilevel coarsening than the currently used naive consecutive neighbors merging.
The main disadvantage of MSARC is its computational complexity, especially in the case of the multilevel scheme variant (MSARC is ∼2.5× slower than MSAProbs, the MSARC variant with multilevel scheme is even slower). This is the cost of avoiding the progressive approach.
Endnotes
^{a} Our notion of alignment graph slightly differs from the one of Kececioglu [27]: removing edges between clusters transforms the former into the latter.
Declarations
Acknowledgements
This work was supported by the Polish Ministry of Science and Higher Education [N N519 652740].
Authors’ Affiliations
References
 Thompson JD, Higgins DG, Gibson TJ:Clustal w: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positionspecific gap penalties and weight matrix choice. Nucleic Acids Res. 1994, 22 (22): 46734680.View ArticlePubMedPubMed CentralGoogle Scholar
 Wong KM, Suchard MA, Huelsenbeck JP:Alignment uncertainty and genomic analysis. Science. 2008, 319 (5862): 473476. doi:10.1126/science.1151532View ArticlePubMedGoogle Scholar
 Löytynoja A, Goldman N:Phylogenyaware gap placement prevents errors in sequence alignment and evolutionary analysis. Science. 2008, 320 (5883): 16321635. doi:10.1126/science.1158395.View ArticlePubMedGoogle Scholar
 Notredame C, Higgins DG, Heringa J:Tcoffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol. 2000, 302 (1): 205217. doi:10.1006/jmbi.2000.4042.View ArticlePubMedGoogle Scholar
 Edgar RC:Muscle: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004, 32 (5): 17921797. doi:10.1093/nar/gkh340View ArticlePubMedPubMed CentralGoogle Scholar
 Katoh K, Toh H, Miyata T, :Mafft version 5 improvement in accuracy of multiple sequence alignment. Nucleic Acids Res. 2005, 33 (2): 511518. doi:10.1093/nar/gki198.View ArticlePubMedPubMed CentralGoogle Scholar
 Do CB, Mahabhashyam MSP, Brudno M, Batzoglou S:Probcons: Probabilistic consistencybased multiple sequence alignment. Genome Res. 2005, 15 (2): 330340. doi:10.1101/gr.2821705,View ArticlePubMedPubMed CentralGoogle Scholar
 Roshan U, Livesay DR:Probalign: multiple sequence alignment using partition function posterior probabilities. Bioinformatics. 22 (22): 27152721. doi:10.1093/bioinformatics/btl472.Google Scholar
 Subramanian AR, WeyerMenkhoff J, Kaufmann M, Morgenstern B:Dialignt: an improved algorithm for segmentbased multiple sequence alignment. BMC Bioinformatics. 2005, 6: 66doi:10.1186/14712105666.View ArticlePubMedPubMed CentralGoogle Scholar
 Bradley RK, Roberts A, Smoot M, Juvekar S, Do J, Dewey C, Holmes I, Pachter L:Fast statistical alignment. PLoS Comput Biol. 2009, 5 (5): 100039210.1371/journal.pcbi.1000392. doi:10.1371/journal.pcbi.1000392.View ArticleGoogle Scholar
 Sahraeian SME, Yoon BJ:Picxaa: greedy probabilistic construction of maximum expected accuracy alignment of multiple sequences. Nucleic Acids Res. 2010, 38 (15): 49174928. doi:10.1093/nar/gkq255.View ArticlePubMedPubMed CentralGoogle Scholar
 Thompson JD, Koehl P, Ripp R, Poch O:Balibase 3.0: latest developments of the multiple sequence alignment benchmark. Proteins. 2005, 61 (1): 127136. doi:10.1002/prot.20527,View ArticlePubMedGoogle Scholar
 Fiduccia CM, Mattheyses RM:A lineartime heuristic for improving network partitions. Proceedings of the 19th Design Automation Conference. DAC ’82. 1982, 175181.http://dl.acm.org/citation.cfm?id=800263.809204], Piscataway, NJ, USA: IEEE Press,Google Scholar
 Miyazawa S:A reliable sequence alignment method based on probabilities of residue correspondences. Protein Eng. 1995, 8 (10): 9991009.View ArticlePubMedGoogle Scholar
 Mückstein U, Hofacker IL, Stadler PF:Stochastic pairwise alignments. Bioinformatics. 2002, 18 (Suppl 2): 153160. 10.1093/bioinformatics/18.suppl_2.S153.View ArticleGoogle Scholar
 Yu YK, Hwa T:Statistical significance of probabilistic sequence alignment and related local hidden markov models. J Comput Biol. 2001, 8 (3): 249282. doi:10.1089/10665270152530845.View ArticlePubMedGoogle Scholar
 Gotoh O:An improved algorithm for matching biological sequences. J Mol Biol. 1982, 162 (3): 705708.View ArticlePubMedGoogle Scholar
 Liu Y, Schmidt B, Maskell DL:Msaprobs: multiple sequence alignment based on pair hidden markov models and partition function posterior probabilities. Bioinformatics. 1964, 26 (16): 19581964. doi:10.1093/bioinformatics/btq338.View ArticleGoogle Scholar
 Hendrickson B, Leland R:A multilevel algorithm for partitioning graphs. Proceedings of the 1995 ACM/IEEE Conference on Supercomputing (CDROM). Supercomputing ’95. 1995, doi:10.1145/224170.224228. [http://doi.acm.org/10.1145/224170.224228], New York, NY, USA: ACM, doi:10.1145/224170.224228.Google Scholar
 Gonnet GH, Cohen MA, Benner SA:Exhaustive matching of the entire protein sequence database. Science. 1992, 256 (5062): 14431445.View ArticlePubMedGoogle Scholar
 Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, Remmert M, Söding J, Thompson JD, Higgins DG:Fast, scalable generation of highquality protein multiple sequence alignments using clustal omega. Mol Syst Biol. 2011, 7: 539doi:10.1038/msb.2011.75.View ArticlePubMedPubMed CentralGoogle Scholar
 Subramanian AR, Kaufmann M, Morgenstern B:Dialigntx: greedy and progressive approaches for segmentbased multiple sequence alignment. Alg Mol Biol. 2008, 3: 6doi:10.1186/1748718836.View ArticleGoogle Scholar
 Guindon S, Dufayard JF, Lefort V, Anisimova M, Hordijk W, Gascuel O:New algorithms and methods to estimate maximumlikelihood phylogenies: assessing the performance of phyml 3.0. Syst Biol. 2010, 59 (3): 307321. doi:10.1093/sysbio/syq010.View ArticlePubMedGoogle Scholar
 Redelings BD, Suchard MA:Joint bayesian estimation of alignment and phylogeny. Syst Biol. 2005, 54 (3): 401418. doi:10.1080/10635150590947041.View ArticlePubMedGoogle Scholar
 Lunter G, Miklós I, Drummond A, Jensen JL, Hein J:Bayesian coestimation of phylogeny and sequence alignment. BMC Bioinformatics. 2005, 6: 83doi:10.1186/14712105683.View ArticlePubMedPubMed CentralGoogle Scholar
 Liu K, Raghavan S, Nelesen S, Linder CR, Warnow T:Rapid and accurate largescale coestimation of sequence alignments and phylogenetic trees. Science. 2009, 324 (5934): 15611564. doi:10.1126/science.1171243.View ArticlePubMedGoogle Scholar
 Kececioglu J:The maximum weight trace problem in multiple sequence alignment. Proceedings of the 4th Symposium on Combinatorial Pattern Matching (CPM). Lecture Notes in Computer Science. Berlin Heidelberg: Springer,1993, 106119.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.