MSARC aligns sequence sets in several steps. In a preprocessing step, following Probalign [8], stochastic alignments are calculated for all pairs of sequences and consistency transformation is applied to resulting posterior probabilities of residue correspondences. Transformed probabilities, called residue alignment affinities, represent weights of an alignment grapha.
MSARC clusters this graph with a top-down hierarchical method (Figure 2). Division steps are based on the Fiduccia-Mattheyses graph partitioning algorithm [13], adapted to satisfy constraints imposed by the sequence order of residues. Finally, the multiple alignment corresponding to the resulting clustering is refined with the iterative improvement strategy proposed in Probcons [7], adapted to remove clustering artefacts.
Pairwise stochastic alignment
The concept of stochastic (or probability) alignment was proposed in [14]. Given a pair of sequences, this framework defines statistical weights of their possible alignments. Based on these weights, for each pair of residues from both sequences, the posterior probability of being aligned may be computed.
A consensus of highly weighted suboptimal alignments was shown to contain pairs with significant probabilities that agree with structural alignments despite the optimal alignment deviating significantly. Mückstein et al. [15] suggest the use of the method as a starting point for improved multiple sequence alignment procedures.
The statistical weight of an alignment
is the product of the individual weights of (mis-)matches and gaps [16]. It may be obtained from the standard similarity scoring function with the following formula:
where β corresponds to the inverse of Boltzmann’s constant and should be adjusted to the match/mismatch scoring function s(x,y) (in fact, β simply rescales the scoring function).
The probability distribution over all alignments is achieved by normalizing this value. The normalization factor Z is called the partition function of the alignment problem [14], and is defined as
(2)
The probability of an alignment can be calculated by
(3)
Let P(a
i
∼b
j
) denote the posterior probability that residues a
i
and b
j
are aligned.
We can calculate it as the sum of probabilities of all alignments with a
i
and b
j
in a common column (denoted by ):
(4)
Here we use the notation for an alignment of the sequence prefixes a1⋯a
i
and b1⋯b
j
, and for an alignment of the sequence suffixes a
i
⋯a
m
and b
j
⋯b
n
. Analogously, Zi,j is the partition function over the prefix alignments and is the (reverse) partition function over the suffix alignments.
An efficient algorithm for calculating the partition function can be derived from the Gotoh maximum score algorithm [17] by replacing the maximum operations with additions [14–16]:
(5)
(6)
(7)
(8)
The reverse partition function can be calculated using the same recursion in reverse, starting from the ends of the aligned sequences.
We also considered a slight modification of formulas 6 and 7:
(9)
(10)
In this case insertions and deletions must be separated by at least one match/mismatch position. This variant was proposed by Miyazawa [14] and applied in the Probalign [8] and MSAProbs [18] aligners.
Alignment graphs
Let us regard probabilities P(a
i
∼b
j
) as a representation of a bipartite graph with weighted edges, i.e. a graph with residues from both sequences as nodes and edges joining each a
i
with each b
j
.
Given a set S of k sequences to be aligned, we would like to analogously represent their residue alignment affinity by a k-partite weighted graph. It may be obtained by joining pairwise alignment graphs for all pairs of S-sequences. However, separate computation of edge weights for each pair of sequences does not exploit information included in the remaining alignments. Thus we decided to address this problem with a so called consistency transformation[4, 7], successfully used in progressive methods.
In order to incorporate correspondence with residues from other sequences, MSARC re-estimates the residue alignment affinity according to the following formula:
(11)
where w
x
y
are weights specifying the relative contribution to the transformation of a sequence pair xy.
If P
a
b
is a matrix of current residue alignment affinities for sequences a and b, the matrix form equivalent transformation is given by
(12)
where · stands for matrix multiplication.
MSARC allows for two options of weight assignments. In the first one all the weights are set to 1 and the above formula simplifies to the following:
(13)
It results in the variant of consistency transformation used in Probalign [8] and ProbCons [7].
In the second option weights are calculated according to the following formula:
(14)
The idea behind the above formula is that the sum of a row/column of a matrix P
a
b
yields the probability that the corresponding residue is aligned to one in the other sequence (not a gap). If sequences a and b are similar, alignments with fewer gaps are preferred, so (at least for the shorter sequence) most of the sums are close to 1. Consequently, the w
a
b
is close to 1 as well. On the other hand, weights are much closer to 0 for pairs of dissimilar sequences.
Thus w
a
b
measures the similarity of sequences a and b. Therefore sequences c that are similar to a and b contribute to more significantly than others.
The consistency transformation may be iterated any number of times, but excessive iterations blur the structure of residue affinity. Following Probalign [8] and ProbCons [7], MSARC performs two iterations by default.
Residue clustering
Columns of any multiple alignment form a partition of the set of sequence residues. The main idea of MSARC is to reconstruct the alignment by clustering an alignment graph into columns. The clustering method must satisfy constraints imposed by alignment structure. First, each cluster may contain at most one residue from a single sequence. Second, the set of all clusters must be orderable consistently with sequence orders of their residues. Violation of the first constraint will be called ambiguity, while violation of the second one – conflict (see Figure 3).
Towards this objective, MSARC applies top-down hierarchical clustering (see Figure 2). Within this approach, the alignment graph is recursively split into two parts until no ambiguous cluster is left. Each partition step results from a single cut through all sequences, so clusterings are conflict-free at each step of the procedure. Consequently, the final clustering represents a proper multiple alignment.
Optimal clustering is expected to maximize residue alignment affinity within clusters and minimize it between them. Therefore, the partition selection in recursive steps of the clustering procedure should minimize the sum of weights of edges cut by the partition. This is in fact the objective of the well-known problem of graph partitioning, i.e. dividing graph nodes into roughly equal parts such that the sum of weights of edges connecting nodes in different parts is minimized.
The Fiduccia-Mattheyses algorithm [13] is an efficient heuristic for the graph partitioning problem. After selecting an initial, possibly random partition, it calculates for each node the change in cost caused by moving it between parts, called gain. Subsequently, single nodes are greedily moved between partitions based on the maximum gain and gains of remaining nodes are updated. The process is repeated in passes, where each node can be moved only once per pass. The best partition found in a pass is chosen as the initial partition for the next pass. The algorithm terminates when a pass fails to improve the partition. Grouping single moves into passes helps the algorithm to escape local optima, since intermediate partitions in a pass may have negative gains. An additional balance condition is enforced, disallowing movement from a partition that contains less than a minimum desired number of nodes.
Fiduccia-Mattheyses algorithm needs to be modified in order to deal with alignment graphs. Mainly, residues are not moved independently; since the graph topology has to be maintained, moving a residue involves moving all the residues positioned between it and a current cut point on its sequence. This modification implies further changes in the design of data structures for gain processing. Next, the sizes of parts in considered partitions cannot differ by more than the maximum cluster size in a final clustering, i.e., the number of aligned sequences. This choice implies minimal search space containing partitions consistent with all possible multiple alignments. In the initial partition sequences are cut in their midpoints.
The Fiduccia-Mattheyses heuristic may be optionally extended with a multilevel scheme [19]. In this approach increasingly coarse approximations of the graph are created by an iterative process called coarsening. At each iteration step selected pairs of nodes are merged into single nodes. Adjacent edges are merged accordingly and weighted with sums of original weights. The final coarsest graph is partitioned using the Fiduccia-Mattheyses algorithm. Then the partition is projected back to the original graph through the series of uncoarsening operations (see Figure 4), each of which is followed by a Fiduccia-Mattheyses based refinement. Because the last refinement is applied to the original graph, the multilevel scheme in fact reduces the problem of selecting an initial partition to the problem of selecting pairs of nodes to be merged. In alignment graphs only neighboring nodes can be merged, so MSARC just merges consecutive pairs of neighboring nodes (see Figure 5).
Refinement
An example of alignment columns produced by residue clustering can be seen in Figure 6(ab). Presented alignments contain surprisingly many spaces, especially in their right parts. Some of them are clearly superfluous, e.g. in both alignments there are 3 consecutive columns near the right end, each consisting of 4 spaces and 1 G-nucleotide occupying a different row.
Therefore we decided to add a refinement step, following the method used in ProbCons [7]. Sequences are split into two groups and the groups are pairwise re-aligned. Re-alignment is performed using the Needleman-Wunsch algorithm with the score for each pair of positions defined as the sum of posterior probabilities for all non-gap pairs and zero gap-penalty. First each sequence is re-aligned with the remaining sequences, since such division is very efficient in removing superfluous spaces. Next, several randomly selected sequence subsets are re-aligned against the rest.
Figures 6(cd) show the results of refining the alignments from Figures 6(ab). Refinement removed superfluous spaces from the clustering process and optimized the alignment. Note that the final post-refinement alignments turned out to be the same for both Fiduccia-Mattheyses and multilevel method of graph partitioning.
Löytynoja and Goldman argue in [3] that progressive methods tend to force alignments of non-homologous sequence fragments inserted in corresponding locations of aligned sequences. This tendency leads to systematic errors of the downstream analyses in phylogenetic pipelines, including overestimation of substitution and deletion events. Unfortunately, iterative refinement may be one of possible source of such effects. Therefore the number of iterations in subset re-alignment step in MSARC is adjustable, in particular the whole step may be turned off.
Computational complexity
Let n denote a number of sequences to align and let l be their maximum length. Both time and space complexities of stochastic alignment are .
Computations in the other steps use data structures for sparse matrices, so the complexity depends on the number c of non-zero values per row/column. This number depends on the cutoff parameter t
c
(entries <t
c
are set to 0), namely c≤1/t
c
. However, we observe that c tends to be much lower than this bound, e.g. c rarely exceeds 5 for the default t
c
=0.01.
MSARC implementation of consistency transformation requires time. Space complexity of this and the remaining steps is dominated by sparse matrices and equals .
The time complexity of one pass of the Fiduccia-Mattheyses algorithm on whole sequences is . We observe that the algorithm converges after very few passes, but it is hard to prove a reasonable asymptotic bound. The complexity of the whole clustering is asymptoticly equal to the complexity of the main partition step.
The time complexity of iterative refinement belongs to the class .