Formally, we consider the following optimization problem: we are given a set *S* = {*s*_{1}, . . ., *s*_{
k
}} of input sequences where *l*_{
i
}is the length of sequence *s*_{
i
}. A fragment *f* is a pair of two equal-length segments from two different input sequences. Thus, a fragment represents a local pairwise gap-free alignment of these two sequences. Each possible fragment *f* is assigned a *weight score w*(*f*) which, in our approach, depends on the probability *P*(*f*) of *random occurrence* of such a fragment. More precisely, if *f* is a local alignment of sequences *s*_{
i
}and *s*_{
j
}, then *P* (*f*) is the probability of finding a fragment of the same length as *f* with at least the same sum of matches or similarity values for amino acids in *random* sequences of length *l*_{
i
}and *l*_{
j
}, respectively. For protein alignment, a standard substitution matrix is used. Let *F* be the set of all possible fragments. The optimization problem is then to find a *consistent* set *A* ⊂ *F* of fragments with maximum total weight, i.e. a consistent set *A* maximizing

W(A):={\displaystyle \sum _{f\in A}w(f)}.

A set of fragments is called *consistent* if all fragments can be included into one single alignment, see [15]. Fragments in *A* are allowed to overlap if different pairs of sequences are involved. That is, if two fragments *f*_{1}, *f*_{2}∈ *A* involve sequence pairs *s*_{
i
}, *s*_{
j
}and *s*_{
j
}, *s*_{
k
}, respectively, then *f*_{1} and *f*_{2} are allowed to overlap in sequence *s*_{
j
}. If two fragments involve the same pair of sequences, no overlap is allowed. It can be shown that the problem of finding an optimal consistent set *A* of fragments is NP-complete (Constructing multiple sequence alignments from pairwise data, Subramanian *et al*., in preparation). Therefore, we are motivated in finding intelligent approximations that deliver a good tradeoff between alignment quality and CPU time.

To decrease the computational complexity of this problem, we restrict ourselves to a reduced subset *F'* ⊂ *F* and we will first search for a consistent subset *A* ⊂ *F'* with maximum total score. As in previous versions of DIALIGN, we use pairwise optimal alignments as a filter. In other words, the set *F'* is defined as the set of all fragments contained in any of the optimal *pairwise* alignments of the sequences in our input data set. Here, we also restrict the length of fragments using some suitable constant.

For multiple alignment, previous versions of DIALIGN used the above outlined *greedy* approach. We call this approach a *direct* greedy approach, as opposed to the *progressive* greedy approach that we introduce in this paper. A modification of this 'direct greedy' approach was also used in our reimplementation DIALIGN-T. Here, we considered not only the weight scores of individual fragments (or their *overlap* weights [1]) but also took into account the overall degree of similarity between the two sequences involved in the fragment. The rationale behind this approach is that a fragment from a sequence pair with high overall similarity is less likely to be a random artefact than a fragment from an otherwise non-related sequence pair, see [16] for details.

### 2.1 Combining segment-based greedy and progressive alignment

To overcome the difficulties of a 'direct' greedy algorithm for multiple alignment, we combined greedy features with a 'progressive' alignment approach [17–20]. Roughly outlined, the new method we developed first computes a *guide tree* for the set of input sequences based on their pairwise similarity scores. The sequences are then aligned in the order defined by the guide tree. We divide the set of fragments contained in the respective optimal pairwise alignments into two subsets *F*_{0} and *F*_{1} where *F*_{0} consists of all fragments with weight scores *below* the average fragment score in all pairwise alignments, and *F*_{1} consists of the fragments with a weight above or equal to the average weight. In a first step, the set *F*_{1} is used to calculate an initial multiple alignment *A*_{1} in a 'progressive' manner. The low-scoring fragments from set *F*_{0} are added later to *A*_{1} in a 'direct' greedy way, provided they are consistent with *A*_{1}. In addition, we construct an alternative multiple alignment *A*_{0} using the 'direct' greedy approach implemented in previous versions of DIALIGN and DIALIGN-T, respectively. The program finally returns either *A*_{0} or *A*_{1}, depending on which one of these two alignments has the highest score.

To construct a guide tree for the progressive alignment algorithm, we use straight-forward hierarchical clustering. Here, we use a weighted combination of complete-linkage and average-linkage clustering based on pairwise similarity values *R*(*p*, *q*) for pairs of cluster (*C*_{
p
}, *C*_{
q
}). Initially, each cluster *C*_{
i
}consists of one sequence *s*_{
i
}only. The similarity *R*(*i*, *j*) between clusters *C*_{
i
}and *C*_{
j
}(or leaves *i* and *j* in our tree) is defined to be the score of the optimal pairwise alignment of *s*_{
i
}and *s*_{
j
}according to our objective function, i.e. the sum of the weights of the fragments in an optimal chain of fragments for these two sequences. In every step, we merge the two sequence clusters *C*_{
i
}and *C*_{
j
}with the maximum similarity value *R*(*i*, *j*) into a new cluster. Whenever a new cluster *C*_{
p
}is created by merging clusters *q* and *r* (or a node *p* in the tree is created with children *q* and *r*), we define the similarity between *p* and all other remaining clusters *m* to be

R(m,p):=0.1\cdot \frac{1}{2}(R(m,p)+R(m,q))+0.9\cdot \mathrm{max}(R(m,p),R(m,q))

The choice of this function has been inspired by MAFFT [21, 22]; it also worked very well in our situation on globally and locally related sequences after experiments on BALIBASE 3, BRAliBase II, IRMBASE 2 and DIRMBASE 1.

### 2.2 Merging two sub-alignments

The final multiple alignment of our input sequence set *S* is constructed bottom-up along the guide tree. Thus, the crucial step is to combine two sub-alignments represented by nodes *q* and *r* in our tree whenever a new node *p* is created. In the traditional 'progressive' alignment approach, this is done by calculating a pairwise alignment of *profiles*, but this procedure cannot be directly adapted to our segment-based approach. Let *A*_{
q
}and *A*_{
r
}be the existing subalignments of the sequences in clusters *C*_{
q
}and *C*_{
r
}, respectively, at the time where these clusters are merged to a new cluster *C*_{
p
}. Let *F*_{
q, r
}be the set of all fragments *f* ∈ *F* connecting one sequence from cluster *C*_{
q
}with another sequence from cluster *C*_{
r
}. Now, our main goal is to find a subset *F*_{
p
}⊂ *F*_{q,r}with maximum total weight score that is consistent with the existing alignments *A*_{
q
}and *A*_{
r
}. In other words, we are looking for a subset *F*_{
p
}⊂ *F*_{
q, r
}with maximum total weight such that

*A*_{
p
}= *A*_{
q
}∪ *A*_{
r
}∪ *F*_{
p
}

describes a valid multiple sequence alignment of the sequence set represented by node *p*.

It is easy to see that at this time, before clusters *A*_{
q
}and *A*_{
r
}are merged, *every single* fragment *f* ∈ *F*_{q,r}is consistent with the existing (partial) alignments *A*_{
q
}and *A*_{
r
}and therefore consistent with the set of all fragments accepted so far. Only *groups* of at least two fragments from *F*_{q,r}can lead to inconsistencies with the previously accepted fragments. Thus, there are different subtypes of consistency conflicts in *F*_{q,r}that may arise when *A*_{
q
}and *A*_{
r
}are fixed. There are pairs, triples or, in general, *l*-tuples of fragments of *F*_{q,r}that give rise to a conflict in the sense that the conflict can be resolved by removing exactly one fragment of such a conflicting *l*-tuple. Statistically, pairs of conflicting fragments are the most frequent type of conflict, so we will take care of them more intelligently rather than using only a greedy method. Since in our approach, the length of fragments is limited, we can easily determine in constant time for any pair of fragments (*f*_{1}, *f*_{2}) if the set

*A*_{
q
}∪ *A*_{
r
}∪ {*f*_{1}, *f*_{2}}

is consistent, i.e. if it forms a valid alignment, or if there is a pairwise conflict between *f*_{1} and *f*_{2}. Here, the data structures described in [23] are used. With unbounded fragment length, the consistency check for the new fragments (*f*_{1}, *f*_{2}) would take *O*(|*f*_{1}| × |*f*_{2}|) time where |*f*| is the length of a fragment *f* .

This gives rise to a conflict graph *G*_{q,r}that has a weighted node *n*_{
f
}for every fragment *f* ∈ *F*_{q,r}. The weight *w*(*n*_{
f
}) of node *n*_{
f
}is defined to be the weight score *w*(*f*) of *f*, and for any two fragments *f*_{1}, *f*_{2} there exists an edge connecting {n}_{{f}_{1}} and {n}_{{f}_{2}} iff there is a *pairwise conflict* between *f*_{1} and *f*_{2}, i.e. if the set *A*_{
q
}∪ *A*_{
r
}∪ {*f*_{1}, *f*_{2}} is *inconsistent*. We are now interested in finding a good subset of *F*_{q,r}that does not contain any pairwise conflicts in the above sense. The optimum solution would be obtained by removing a minimal weighted vertex cover from *G*_{q,r}. Since the weighted vertex cover problem is NP-complete we apply the 2-approximation given by Clarkson [24]. This algorithm roughly works as follows: in order to obtain the vertex cover *C*, the algorithm iteratively adds the node *v* with the maximum value

to *C*. For any edge (*v*, *u*) that connects a node *u* with *v* the weight *w*(*u*) is updated to

w(u):=w(u)-\frac{degree(v)}{w(v)}

and the edge (*u*, *v*) is deleted. This iteration is followed as long as there are edges left.

Note that it is *not* sufficient to remove the vertex cover *C* from *F*_{q,r}to obtain a valid alignment since in the construction of *C*, only inconsistent pairs of fragments were considered. We therefore first remove *C* from *F*_{q,r}and we subsequently remove further inconsistent fragments from *F*_{q,r}using our 'direct' greedy alignment as described in [16]. A consequence of this further reduction of the set *F*_{q,r}is that fragments that were previously removed because of pairwise inconsistencies, may became consistent again. A node {n}_{{f}_{1}} may have been included into the set *C* and therefore removed from the alignment as the corresponding fragment *f*_{1} is part of an inconsistent fragment pair (*f*_{1}, *f*_{2}). However, after subtracting the set *C* from *F*_{
q, r
}, the algorithm may detect that fragment *f*_{2} is part of a larger inconsistent group, and *f*_{2} is removed as well. In this case, it may be possible to include *f*_{1} again into the alignment. Therefore, our algorithm reconsiders in a final step the set *C* to see if some of the previously excluded fragments can now be reincluded into the alignment. This is again done using our 'direct' greedy method.

### 2.3 The overall algorithm

In the previous section, we discussed all ingredients that are necessary to give a high-level description of our algorithm to compute a multiple sequence alignment. For clarity, we omit algorithmical details and data structures such as the *consistency frontiers* or *consistency boundaries*, respectively, that are used to check for consistency as these features have been described elsewhere [23]. We use a subroutine *PAIRWISE_ALIGNMENT* (*s*_{
i
}, *s*_{
j
}, *A*) that takes two sequences *s*_{
i
}and *s*_{
j
}and (optionally) an existing consistent set of fragments *A* as input and calculates an optimal alignment of *s*_{
i
}and *s*_{
j
}under the side constraint that this alignment is consistent with *A* and that only those positions in the sequences are aligned that are not yet aligned by a fragment from *A*. Note that in DIALIGN, an alignment is defined as an equivalence relation on the set of all sequence positions, so a consistent set of fragments corresponds to an alignment. Therefore, we do not formally distinguish between alignments and sets of fragments.

Next, a subroutine *GREEDY_ALIGNMENT* (*A*, *F'*) takes an alignment *A* and a set of fragments *F'* as arguments and returns a new alignment *A*' ⊃ *A* by adding fragments from the set *F'* in a 'directly' greedy fashion. For details on these subroutines see also [16]. Furthermore we use a subroutine *BUILD_UPGMA* (*F'*) that takes a set *F'* of fragments as arguments and returns a tree and a subroutine *MERGE*(*p*, *F'*) that takes the parent node *p* and the set of fragments *F'* as argument and returns an alignment of the set of sequences represented by node *p*. Those two subroutines are described in the previous two subsections. A pseudo-code description of the complete algorithm for multiple alignment is given in Figure 1. As in the original version of DIALIGN [1], the process of pairwise alignment and consistency filtering is carried out iteratively. Once a valid alignment *A* has been constructed by removing inconsistent fragments from the set *F'* of the fragments that are part of the respective optimal pairwise alignments, this procedure is repeated until no new fragments can be found. In the second and subsequent iteration steps, only those parts of the sequences are considered that are not yet aligned and optimal pairwise alignments are calculated under the consistency constraints imposed by the existing alignment *A*.