Formally, we consider the following optimization problem: we are given a set S = {s1, . . ., s
k
} of input sequences where l
i
is the length of sequence s
i
. A fragment f is a pair of two equal-length segments from two different input sequences. Thus, a fragment represents a local pairwise gap-free alignment of these two sequences. Each possible fragment f is assigned a weight score w(f) which, in our approach, depends on the probability P(f) of random occurrence of such a fragment. More precisely, if f is a local alignment of sequences s
i
and s
j
, then P (f) is the probability of finding a fragment of the same length as f with at least the same sum of matches or similarity values for amino acids in random sequences of length l
i
and l
j
, respectively. For protein alignment, a standard substitution matrix is used. Let F be the set of all possible fragments. The optimization problem is then to find a consistent set A ⊂ F of fragments with maximum total weight, i.e. a consistent set A maximizing
A set of fragments is called consistent if all fragments can be included into one single alignment, see [15]. Fragments in A are allowed to overlap if different pairs of sequences are involved. That is, if two fragments f1, f2∈ A involve sequence pairs s
i
, s
j
and s
j
, s
k
, respectively, then f1 and f2 are allowed to overlap in sequence s
j
. If two fragments involve the same pair of sequences, no overlap is allowed. It can be shown that the problem of finding an optimal consistent set A of fragments is NP-complete (Constructing multiple sequence alignments from pairwise data, Subramanian et al., in preparation). Therefore, we are motivated in finding intelligent approximations that deliver a good tradeoff between alignment quality and CPU time.
To decrease the computational complexity of this problem, we restrict ourselves to a reduced subset F' ⊂ F and we will first search for a consistent subset A ⊂ F' with maximum total score. As in previous versions of DIALIGN, we use pairwise optimal alignments as a filter. In other words, the set F' is defined as the set of all fragments contained in any of the optimal pairwise alignments of the sequences in our input data set. Here, we also restrict the length of fragments using some suitable constant.
For multiple alignment, previous versions of DIALIGN used the above outlined greedy approach. We call this approach a direct greedy approach, as opposed to the progressive greedy approach that we introduce in this paper. A modification of this 'direct greedy' approach was also used in our reimplementation DIALIGN-T. Here, we considered not only the weight scores of individual fragments (or their overlap weights [1]) but also took into account the overall degree of similarity between the two sequences involved in the fragment. The rationale behind this approach is that a fragment from a sequence pair with high overall similarity is less likely to be a random artefact than a fragment from an otherwise non-related sequence pair, see [16] for details.
2.1 Combining segment-based greedy and progressive alignment
To overcome the difficulties of a 'direct' greedy algorithm for multiple alignment, we combined greedy features with a 'progressive' alignment approach [17–20]. Roughly outlined, the new method we developed first computes a guide tree for the set of input sequences based on their pairwise similarity scores. The sequences are then aligned in the order defined by the guide tree. We divide the set of fragments contained in the respective optimal pairwise alignments into two subsets F0 and F1 where F0 consists of all fragments with weight scores below the average fragment score in all pairwise alignments, and F1 consists of the fragments with a weight above or equal to the average weight. In a first step, the set F1 is used to calculate an initial multiple alignment A1 in a 'progressive' manner. The low-scoring fragments from set F0 are added later to A1 in a 'direct' greedy way, provided they are consistent with A1. In addition, we construct an alternative multiple alignment A0 using the 'direct' greedy approach implemented in previous versions of DIALIGN and DIALIGN-T, respectively. The program finally returns either A0 or A1, depending on which one of these two alignments has the highest score.
To construct a guide tree for the progressive alignment algorithm, we use straight-forward hierarchical clustering. Here, we use a weighted combination of complete-linkage and average-linkage clustering based on pairwise similarity values R(p, q) for pairs of cluster (C
p
, C
q
). Initially, each cluster C
i
consists of one sequence s
i
only. The similarity R(i, j) between clusters C
i
and C
j
(or leaves i and j in our tree) is defined to be the score of the optimal pairwise alignment of s
i
and s
j
according to our objective function, i.e. the sum of the weights of the fragments in an optimal chain of fragments for these two sequences. In every step, we merge the two sequence clusters C
i
and C
j
with the maximum similarity value R(i, j) into a new cluster. Whenever a new cluster C
p
is created by merging clusters q and r (or a node p in the tree is created with children q and r), we define the similarity between p and all other remaining clusters m to be
The choice of this function has been inspired by MAFFT [21, 22]; it also worked very well in our situation on globally and locally related sequences after experiments on BALIBASE 3, BRAliBase II, IRMBASE 2 and DIRMBASE 1.
2.2 Merging two sub-alignments
The final multiple alignment of our input sequence set S is constructed bottom-up along the guide tree. Thus, the crucial step is to combine two sub-alignments represented by nodes q and r in our tree whenever a new node p is created. In the traditional 'progressive' alignment approach, this is done by calculating a pairwise alignment of profiles, but this procedure cannot be directly adapted to our segment-based approach. Let A
q
and A
r
be the existing subalignments of the sequences in clusters C
q
and C
r
, respectively, at the time where these clusters are merged to a new cluster C
p
. Let F
q, r
be the set of all fragments f ∈ F connecting one sequence from cluster C
q
with another sequence from cluster C
r
. Now, our main goal is to find a subset F
p
⊂ Fq,rwith maximum total weight score that is consistent with the existing alignments A
q
and A
r
. In other words, we are looking for a subset F
p
⊂ F
q, r
with maximum total weight such that
A
p
= A
q
∪ A
r
∪ F
p
describes a valid multiple sequence alignment of the sequence set represented by node p.
It is easy to see that at this time, before clusters A
q
and A
r
are merged, every single fragment f ∈ Fq,ris consistent with the existing (partial) alignments A
q
and A
r
and therefore consistent with the set of all fragments accepted so far. Only groups of at least two fragments from Fq,rcan lead to inconsistencies with the previously accepted fragments. Thus, there are different subtypes of consistency conflicts in Fq,rthat may arise when A
q
and A
r
are fixed. There are pairs, triples or, in general, l-tuples of fragments of Fq,rthat give rise to a conflict in the sense that the conflict can be resolved by removing exactly one fragment of such a conflicting l-tuple. Statistically, pairs of conflicting fragments are the most frequent type of conflict, so we will take care of them more intelligently rather than using only a greedy method. Since in our approach, the length of fragments is limited, we can easily determine in constant time for any pair of fragments (f1, f2) if the set
A
q
∪ A
r
∪ {f1, f2}
is consistent, i.e. if it forms a valid alignment, or if there is a pairwise conflict between f1 and f2. Here, the data structures described in [23] are used. With unbounded fragment length, the consistency check for the new fragments (f1, f2) would take O(|f1| × |f2|) time where |f| is the length of a fragment f .
This gives rise to a conflict graph Gq,rthat has a weighted node n
f
for every fragment f ∈ Fq,r. The weight w(n
f
) of node n
f
is defined to be the weight score w(f) of f, and for any two fragments f1, f2 there exists an edge connecting and iff there is a pairwise conflict between f1 and f2, i.e. if the set A
q
∪ A
r
∪ {f1, f2} is inconsistent. We are now interested in finding a good subset of Fq,rthat does not contain any pairwise conflicts in the above sense. The optimum solution would be obtained by removing a minimal weighted vertex cover from Gq,r. Since the weighted vertex cover problem is NP-complete we apply the 2-approximation given by Clarkson [24]. This algorithm roughly works as follows: in order to obtain the vertex cover C, the algorithm iteratively adds the node v with the maximum value
to C. For any edge (v, u) that connects a node u with v the weight w(u) is updated to
and the edge (u, v) is deleted. This iteration is followed as long as there are edges left.
Note that it is not sufficient to remove the vertex cover C from Fq,rto obtain a valid alignment since in the construction of C, only inconsistent pairs of fragments were considered. We therefore first remove C from Fq,rand we subsequently remove further inconsistent fragments from Fq,rusing our 'direct' greedy alignment as described in [16]. A consequence of this further reduction of the set Fq,ris that fragments that were previously removed because of pairwise inconsistencies, may became consistent again. A node may have been included into the set C and therefore removed from the alignment as the corresponding fragment f1 is part of an inconsistent fragment pair (f1, f2). However, after subtracting the set C from F
q, r
, the algorithm may detect that fragment f2 is part of a larger inconsistent group, and f2 is removed as well. In this case, it may be possible to include f1 again into the alignment. Therefore, our algorithm reconsiders in a final step the set C to see if some of the previously excluded fragments can now be reincluded into the alignment. This is again done using our 'direct' greedy method.
2.3 The overall algorithm
In the previous section, we discussed all ingredients that are necessary to give a high-level description of our algorithm to compute a multiple sequence alignment. For clarity, we omit algorithmical details and data structures such as the consistency frontiers or consistency boundaries, respectively, that are used to check for consistency as these features have been described elsewhere [23]. We use a subroutine PAIRWISE_ALIGNMENT (s
i
, s
j
, A) that takes two sequences s
i
and s
j
and (optionally) an existing consistent set of fragments A as input and calculates an optimal alignment of s
i
and s
j
under the side constraint that this alignment is consistent with A and that only those positions in the sequences are aligned that are not yet aligned by a fragment from A. Note that in DIALIGN, an alignment is defined as an equivalence relation on the set of all sequence positions, so a consistent set of fragments corresponds to an alignment. Therefore, we do not formally distinguish between alignments and sets of fragments.
Next, a subroutine GREEDY_ALIGNMENT (A, F') takes an alignment A and a set of fragments F' as arguments and returns a new alignment A' ⊃ A by adding fragments from the set F' in a 'directly' greedy fashion. For details on these subroutines see also [16]. Furthermore we use a subroutine BUILD_UPGMA (F') that takes a set F' of fragments as arguments and returns a tree and a subroutine MERGE(p, F') that takes the parent node p and the set of fragments F' as argument and returns an alignment of the set of sequences represented by node p. Those two subroutines are described in the previous two subsections. A pseudo-code description of the complete algorithm for multiple alignment is given in Figure 1. As in the original version of DIALIGN [1], the process of pairwise alignment and consistency filtering is carried out iteratively. Once a valid alignment A has been constructed by removing inconsistent fragments from the set F' of the fragments that are part of the respective optimal pairwise alignments, this procedure is repeated until no new fragments can be found. In the second and subsequent iteration steps, only those parts of the sequences are considered that are not yet aligned and optimal pairwise alignments are calculated under the consistency constraints imposed by the existing alignment A.