Bi-alignments with affine gaps costs

Background Commonly, sequence and structure elements are assumed to evolve congruently, such that homologous sequence positions correspond to homologous structural features. Assuming congruent evolution, alignments based on sequence and structure similarity can therefore optimize both similarities at the same time in a single alignment. To model incongruent evolution, where sequence and structural features diverge positionally, we recently introduced bi-alignments. This generalization of sequence and structure-based alignments is best understood as alignments of two distinct pairwise alignments of the same entities: one modeling sequence similarity, the other structural similarity. Results Optimal bi-alignments with affine gap costs (or affine shift cost) for two constituent alignments can be computed exactly in quartic space and time. Even bi-alignments with affine shift and gap cost, as well as bi-alignment with sub-additive gap cost are optimized efficiently. Affine gap-cost bi-alignment of large proteins (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sim 930$$\end{document}∼930 aa) can be computed. Conclusion Affine cost bi-alignments are of practical interest to study shifts of protein sequences and protein structures relative to each other. Availability The affine cost bi-alignment algorithm has been implemented in Python 3 and Cython. It is available as free software from https://github.com/s-will/BiAlign/releases/tag/v0.3 and as bioconda package bialign. Supplementary Information The online version contains supplementary material available at 10.1186/s13015-022-00219-7.


Incongruent evolution
While biological function is eventually encoded in a genomic sequence, it relies on the "decoding" of the sequence into a spatially structured RNA or protein, or into specific interactions, such as the binding of a DNA element by a transcription factor. Natural selection acts to conserve function over evolutionary times and therefore preserves functional RNA or protein structures, binding motifs, intron-exon boundaries, etc. Stabilizing selection on such a functional entity typically also causes the conservation of its encoding DNA sequence.
Homologous functional units, i.e. those that share a common ancestry [1], are therefore represented by homologous sequences. As a consequence, functional elements often can be identified based on their similarity in sequence alignments. For RNA and proteins, this allows the detection of consensus structures [2,3], enables the identification of transcription factor binding sites [4], and the detection of conserved (non-coding) transcripts through the conservation of splice junctions [5].
Homology of a feature or trait, however, does not require that all its constituent parts are homologous. Most obviously, insertions and deletions in a DNA sequence imply that not all nucleotides trace back to a common ancestor even if the sequence as a whole does. Similarly, homology of a structural feature does not imply that all its constituent contacts are preserved. There are indeed well-documented exceptions to the by far most common case of homologous features being produced

Open Access
Algorithms for Molecular Biology from homologous sequence positions. A well-studied, albeit apparently rare, example is intron-sliding, where the start and end of an intron "moves" in the same direction for the same number of nucleotides [6][7][8][9]. While the gene product is perfectly preserved, except possibly for some changes of the amino acids encoded by the few nucleotides involved in the sliding, both splice junctions are now encoded by non-homologous genomic positions. Promotors sometime exhibit a similar form of turnover, where a short binding site pattern at one site is replaced by the emergence of a matching sequence nearby [10]. In the context of biopolymer structures it is possible that contacts between nucleotides or amino acids are shifted relative to the underlying sequence in a way that preserves most features of the ancestral structure. Such transitions can be facilitated by the existence of kinetically accessible structural alternatives [11], of which different variants are stabilized by subsequent mutations in different lineages. In a preliminary survey, we recently observed that 72 of 1181 moderate-size Rfam families show evidence for this kind of incongruence between sequence and structure conservation [12]. This observation suggests that incongruent evolution of sequence and structure is relatively rare but still occurs with sufficient frequency to be non-negligible.
To our knowledge, incongruences between conserved protein sequence and conserved protein structures so far have not been studied systematically. However, the example of Fig. 1 demonstrates that (at least at the level of secondary structures) it is not at all difficult to obtain incongruence by performing a few mutations. Here, we (artificially) introduced substitutions into a peptide sequence such that predicted secondary structures shifted relative to the reference sequence. The comparative analysis of proteins occasionally reveals examples of natural incongruences between sequence and secondary structure; moreover, it shows that the phenomenon occurred at least occasionally in protein evolution. Figure 2(top) depicts the alignment of the extant human CYPB1 cytochrome P450 enzyme and its reconstructed ancestral mammalian counterpart, which was recently crystallized (PDB: 6OYU and 6OYV) and characterized functionally [15]. Despite the high level of similarity of the ancestral and extant folds, the bi-alignment (Fig. 2, bottom) reveals some differences in the extent of helices and suggests a shift of "helix D" by two amino acids, constituting an incongruence of the considered type. Another published example can be found in Fig. 5 of [16]: relative to the underlying sequence, one observes several small helix shifts in the evolution of the Pgp protein (MDR1) between human, mouse, and rat.
Incongruences between sequence homology and homology of structure or functional elements are rooted in the inherent redundancy of genotype-phenotype maps. For both RNA and proteins, very different sequences can encode the same fold or function [17][18][19], while at the same time identical sequences can appear in very different structural or functional contexts [20][21][22]. Together, these features sometimes lead to a sliding or migration of a functionally relevant structure in response to a fortuitously placed mutation. The occasional emergence of incongruences between sequence conservation and the conservation of structure thus is an expected consequence of the redundancies inherent in the sequence/structure relationship of biopolymers. It becomes a relevant empirical question, therefore, how frequent this process has been throughout evolutionary history. Fig. 1 Two pairwise alignments and a bi-alignment of peptide sequences and their predicted secondary structures (helix red, turn blue, β-sheet green, coil orange). Structure are predicted according to the Chou Fasman method [13] with CFSSP [14]. To facilitate quick visual assessment of sequence alignment quality, sequence mismatches are shown in bold black, sequence indels in non-bold black, and mismatches in dark red. The upper alignment optimizes sequence similarity, and shows the structure out of sync: the helix is moved to left, the last β-sheet is shifted to the right by 1 position. The second alignment maximizes structural similarity and thus shows little sequence similarity. The evolution of the two peptides is explained much better by a bi-alignment (third panel), which supports shift events (marked by rectangles) that can shift either sequence against its structure to the left ( < ) or to the right ( > ). The resulting regions of shift are indicated by in general k blue and red lines corresponding to shifts by k positions to the left or to the right. While the shift events shown in this example delete and insert structure of A with respect to both sequences and the structure of B, shift alignments also support as well analogous shifts of sequences and the second structure (which would be shown in the bottom row). In our representation, shift events are the only visible difference between the bi-alignment A in the third panel and the two alignments. Nevertheless, the representation can be mapped to our formalization of bi-alignments as alignments of two constituent alignments U and V : U is obtained from the 2nd and 3rd bi-alignment row by removing the two all-gap columns (i.e. the first and the 3rd-to-last column). The secondary structure alignment V coincides with the 1st and 4th row since there is no column that contains only gaps in these to rows Not only is incongruent evolution of interest as an under-studied aspect of evolutionary dynamics, but it has practical implications for data analysis. Incongruences impact our ability to detect and reconstruct consensus structures, since corresponding structural features are formed by evolutionarily unrelated nucleotides or amino acids, while homologous sequence positions form disparate structural elements. This means that (in the presence of incongruent evolution) a single multiple sequence alignment cannot simultaneously represent the similarities of sequence and structure. In particular, conserved structure can no longer be represented as 'consensus structure' , i.e. as an annotation of the columns of a sequence alignment.

Bi-alignments
We recently introduced bi-alignments [12,23] as a mathematically consistent way of describing incongruent evolutionary relationships. Bi-alignments are motivated by treating shifts between sequence and structure explicitly as evolutionary events. It is important to realize that it is not necessarily possible to find an optimal reconciliation of sequence and structure alignments by identifying shifts events a posteriori from a pair of sequence and structure alignments that have been computed separately. Instead, bi-alignments allow simultaneously predicting sequence and structure homologies and their relation. For this purpose, we define a bi-alignment to consist of two alignments (one based one sequence similarity, the other one based on structure similarity) that are related by a third alignment, which captures the shift events. All three constituent alignments contribute to a common score.
While bi-alignments have similarities to combined sequence and structure alignments (which also optimize a joint score for sequence and structure similarity), bi-alignments extend such models by supporting shift events explicitly. Combined sequence/structure alignments therefore can be interpreted as the limit case of bi-alignments where arbitrarily high shift penalties completely prohibit shift events. As important consequence, bi-alignments overcome the requirement of a consensus structure, which is the key assumption underlying combined sequence/structure alignments.
As their main purpose, bi-alignments provide a coherent framework to detect shift-like incongruences, i.e. a local "movement" of conserved structures relative to the underlying sequence. It is worth noting that the formal concept of bi-alignments is not tied to applications in structural biology. Instead, it can be seen as a way to quantify the effect of differences in scoring schemes that focus on different aspects of the same sequence. The only requirement for bi-alignments is a position-wise one-toone correspondence between the two different representations of each input object.
In this contribution, we extend bi-alignments with linear costs to a more realistic model with affine gap costs. We will illustrate our algorithmic developments using protein sequences and their secondary structures as an example, because the position-wise annotation of a secondary structure elements fits well with the framework of sequence alignments. The (artificial) example in Fig. 1 shows that incongruence between sequence and secondary structure can indeed be caused a few well-place substitutions. It also shows that bi-alignments are capable, at least in principle, to reconcile incongruent sequence and structure homologies and to identify shift events.
A bi-alignment is formally defined as an alignment relating two, generally different, alignments of the same objects.
consists of two pairwise alignments U and V of the objects a and b and an alignment W of U and V.
In Fig. 1, U is a sequence alignment (shown in the second row with the secondary structure annotation above and below the two sequences), while V is an alignment of the two respective secondary structures (shown in the second row with the two corresponding sequences between them). The columns of U and V are then aligned by W . Since the pairwise alignment of two pairwise alignments is equivalent to a 4-way alignment, bi-alignments can be thought of as multiple alignments A ∼ = (U, V, W) . The input objects a and b appear twice in A , once regarded as sequence (represented by the oneletter amino acid codes) and once regarded as secondary structure (shown a position-wise glyphs). Bi-alignments therefore differ from "structure-aware" sequence alignments by replacing the annotation of sequence positions with a secondary structure features by an alignment of both the sequence and the string of structural features. Importantly, A ∼ = (U, V, W) completely determines the alignments of the sequences of a and b with their secondary structures (shown in the third row of Fig. 1 as the first and last pair of rows, respectively.) These alignments in general contain gaps that indicate how the conserved "consensus" structure is shifted compared to the sequence positions.
Assuming a linear scoring model, i.e. scores for U , V , and W that are additively composed from single column contributions, it can be shown that the 4-way alignment A is scored additively as well [12,23]. Linear bi-alignment problems therefore can be exactly solved by dynamic programming [24,25] in quartic time. In this contribution we are interested in bi-alignments that are scored with affine gap costs.

Alignments as regular multi-tape grammars
To address this problem, it is helpful to describe the structure of alignments by multi-tape grammars, see e.g. [26] for a more detailed, formal discussion. In the simplest case, sequence alignments can be represented as regular grammars of the form A → Ac ǫ . The only nonterminal symbol A denotes a (pairwise) alignment, the terminal ǫ is the empty alignment, and the terminal c denotes an alignment column, which may be a (mis) Since alignments compare extant sequences rather than an ancestor/descendant pair, the two "indels" (insertion/ deletion) are biologically indistinguishable and hence receive the same score. The grammar simply expresses the fact that alignment can be constructed step-by-step by adding a column to an alignment of prefixes. For linear scoring functions, the production A → Ac allows adding the score of c to the previously accumulated score of the alignment A. Denote by M(x) the optimal score of an alignment of the prefixes a[1..
As noted e.g. in [27,28], the index vector of the penultimate column of the alignment is x − c , where • is interpreted as 1 and the gap character − as 0. The Needleman-Wunsch recursions [29] thus can be written in compact form (see also [24]) as Notably, in the non-affine case, the scoring function s(x, c) is completely determined by a single column. Affine gap cost. While linear gap costs are not very realistic in sequence alignment [30], arbitrary gap costs algorithmically require an additional factor O(n) in running time [31,32] and are difficult to parametrize in practice.
The affine gap cost model serves as a useful and convenient compromise that is most often used in practice. Here, the opening and the extension of a gap are scored differently. It is therefore necessary to distinguish three differ- alignments that end in a (mis)match, deletion, and insertion column, respectively. Again one obtains a regular grammar with analogous productions of the form A c → A c ′ c ǫ for the three non-terminals. Denote by M(x; c) the optimal score of an alignment of the prefixes a[1..
with end column of type c. We can then write Gotoh's well-known recursions [33] for pairwise affine gap cost alignment in the following compact form: formulation accommodates any scoring function s(x, c ′ , c) for which the column score depends on the gap pattern of the previous column. For instance, we could also score the closing of a gap separately. Both the Needleman-Wunsch algorithm and the Gotoh algorithm run in O(n 2 ) space and time. Recursion Eq. (1) also describes the dynamic programming algorithm for k-ary alignments [24,25,34,35], which requires O(n k ) space and time. The situation is more complicated, however, for affine gap costs. Sum-of-pairs scoring functions simply sum over the scores of all pairwise alignments contained in a given multiple alignment. Surprisingly, computing the optimal alignment of alignments with affine gap costs under the sum-of-pairsmodel is NP-complete unless the number of sequences in the constituent alignments is bounded [36]. On the other hand, scoring models of the form of Eq. (2) are of practical interest in particular for k = 3 [37][38][39].
In this contribution we show that the bi-alignment model with affine gap costs for the constituent alignments can be solved in polynomial time by dynamic programming. As we shall see, the recursions are of the form of Eq. (2) but require a subtle re-definition of M(x; c).

Bi-alignments
Recall that we define a bi-alignment as an alignment of alignments (Def. 1). It is well known that an alignment of alignments can be represented again as an alignment. This compositional structure of alignments is discussed formally in [40]. In our case, A is a 4-way alignment from which U (and V ) are obtained as "projections", i.e. by extracting the corresponding pair of rows and removing all columns consisting of a pair of gap characters. The alignment W , on the other hand, is obtained by considering each column in U and V as a single letter; and moreover interpreting the columns of the form − − (i.e. the ones that are removed in the projections to U and V ) as gap characters. The Bi-Alignment Problem for two input sequences a and b consists in optimizing with given scoring functions u, v, and w. The special case where u, v, and w are linear scoring functions has been discussed in [12,23].
The alignment W of U and V describes the shifts distinguishing U and V in the following manner. First, consider a match column α of W . It consists of a pair of columns with gap patterns c(α) and d(α) , respectively. Using their numerical interpretation, we observe that measures whether none, one, or both input sequences are shifted relative to each other (Fig. 3). Insertions and deletions in W correspond to inserting an all-gap column − − into U or V , respectively, and always lead to incongruences. We note, furthermore, that there is a one-to-one correspondence between the columns of W and the columns of the 4-way alignment A . Thus we can count the number of shifts s(A) = α∈A s(α) . The alignment A contains subalignments A (aa) and A (bb) of the first and second input sequence with itself. Let us denote the number of indels in these two projected alignments by δ a and δ b , respectively.

Lemma 2 If
Thus δ a (α) = 1 if α is an indel column in the projected self-alignment of a , and δ a (α) = 0 if α is a (mis)match column. Note that all-gap columns are omitted in the projection and thus do not contribute to the indel count. Thus δ a = α∈A δ a (α) correctly counts the indels in A (aa) . An analogous equality holds for δ b . A comparison with Eq. (4) completes the proof.

Bi-alignments with affine gaps costs
Lemma 2 provides an alternative interpretation in terms of a simple linear score for A (aa) and A (bb) . We can therefore think of Eq. (3) as a restricted sum-of-pairs model in which only four of the six pairwise alignments in A contribute. In this picture it is natural to assume that the constituent alignments U and V are scored with affine gap costs. In the light of the NP-hardness result of [36] it is not at all obvious, however, that the bi-alignment problem with affine gap costs can be solved in polynomial time.
In order to address this problem, we first recall the language of multi-way alignments. The following statement is "folklore", see e.g. [40]: every column of the 4-way alignment A is uniquely determined by sponds to a match in W . This regular language is sufficient for linear gap cost models [12,23]. In order to handle affine gap costs for U and V , we need to keep track of the gap patterns of the preceding alignment column in U and V . This is not the same as considering the preceding column of A because gap patterns of the form  correspond to all-gap columns, which are removed in U or V . Thus, we introduce a new notion of column type to address these 'preceding' gap patterns of the sub-alignments.

Definition 3
The end column type (p, q) of a bi-alignment A ∼ = (U, V, W) consists of the gap pattern p of the last column of U and the gap pattern q of the last column of V . The end column type of the empty alignment is left arbitrary.
The definition is illustrated in Fig. 4. Note that by construction, neither p nor q consist only of gaps. Now, we define a column-wise scoring function that captures the alignment score with affine gap cost. It scores a single column of a bi-alignment A , characterized by x y and c d , depending on the end column type c ′ d ′ of the previous column. This function has the form Since score(x, c ′ , 0) and score(y, d ′ , 0) , respectively, correspond to all-gap columns in U and V , we observe that the sum of the score Thus, Eq. (5) correctly scores the bi-alignment with general affine gap costs for both U and V.
In order to derive a dynamic programming algorithm that solves the bi-alignment problem with this type of scoring function, we consider a decomposition of the search space in grammar form. The non-terminals A (p,q) correspond to bi-alignment with end column type (p, q). The terminals are the 15 possible column types of a 4-way alignment, which we write as p q , with p, q = − − as well as − q q − where the − in the latter is a shorthand for − − . In addition, we write ǫ for the empty 4-way column. Fig. 4 The end column type of an bi-alignment is defined by the last column of each of the constituent pairwise alignments of a and b that is not an all-gap column Note that this grammar would allow terminating with any end column type. This is undesirable since we would like the first column to be scored as it was preceded by a match column in both U and V . This is easily implemented by an appropriate initialization for x = y = 0 , however.

Lemma 4 The language of bi-alignments with fixed end column type is generated by the productions
Definition 5 Let M p,q (x, y) denote the optimal score of a 4-way alignment with end column type (p, q).
In order to enforce that empty alignment is treated as

Theorem 6
The matrices M p,q satisfy the recursion We first note that every column of A is either a (mis)match or an indel column w.r.t. W . These correspond to the first three alternative productions in Eq. (7), and cover all alternatives. Since score c d depends only on the current column and the end column type, we obtain the optimal score of an alignment A with end column type (p, q) and last column (c, d) as the optimal score of an alignment A ′ with any of the matching column type plus the score score As an immediate consequence we have

Affine shift costs
While bi-alignment with affine gap cost and linear shift costs may be of the most obvious practical relevance, we also discuss two variations with affine shift costs. First of all, we clarify how to attribute affine shift cost in our bialignment scoring model. Let's take a step back to our original definition of the bi-alignment score (Eq. 3) and our previous suggestion to define the "shift" score component w(A) as −�s(A) , i.e. as a multiple of s(A) . Since the latter was defined as the number of gap columns in the alignments A (aa) and A (bb) , this amounts to scoring shifts in a linear cost model, where every shift has a cost of per column.
For affine shift costs, we take the view that every consecutive run of gap symbols in the pairwise alignments of the two copies of a and b represents one shift. This shift is scored in the same way as gaps are scored under affine gap cost, i.e. based on the shift opening cost o plus the shift extension cost times the length of the shift (number of shift columns).
We first consider affine shift cost and non-affine (i.e. linear) gap cost. Since affine shifts are scored exactly in the same way as affine gaps, this situation is symmetric to the case of affine gap cost combined with linear shift cost. The corresponding bi-alignment problem can thus be solved efficiently by applying exactly the same idea as in our previous algorithm (Theorem 6), only now keeping track by p and q of the gap patterns in the respective alignments of rows 1&3 and 2&4 . We immediately obtain

Combining affine gap and shift costs
More remarkably, we can even solve the general case of affine gap cost and affine shift cost in polynomial time by dynamic programming. Essentially, we combine the ideas of the above two algorithms. Our algorithm follows a grammar with general decomposition In order to evaluate affine gaps and affine shifts correctly at the same time, we need to know the last nongap-only gap patterns of all four pairwise alignments of rows 1&2 , 1&3 , 2&4 , and 3&4 ; thus, we utilize non-terminals A p , for all p that encode the respective gap patterns p = (p 12 , p 13 , p 24 , p 34 ) . By the same argument as before, we can show this information to be sufficient to score shifts and gaps correctly in affine cost models for every possible last column c.
One keeps track of the correct gap patterns for all of the relevant pairwise alignments by setting the entries of p ′ as (9) A p → A p ′ c for ij ∈ {12, 13, 24, 34} , depending on p and c in Eq. (9). For termination, we add the grammar rule: . This allows implicit accounting for gap and shift openings of respective gaps and shifts at the left end of alignment strings.
Remarks about generalizations and complexity Note that the existence of an efficient algorithm for general affine bi-alignment does not contradict the general hardness of multiple alignment with affine gap costs, even if it suggests the following generalization: Multiple (k-way) alignment with affine gap costs can be computed by dynamic programming following the above idea of keeping track of the right-most non-gap-only gap-patterns in all pairwise alignments. This requires considering k 2 many pairwise gap patterns, each out of three possibili- The resulting DP-algorithm for kway alignment thus needs exponentially many matrices in k.
In bi-alignments of two sequences, we need to consider only four gap patterns, two for the two alignments and two for the shifts between the sequence copies. That is, there are (at most) 3 4 = 81 combinations, which have to be represented by different matrices for the DP algorithm. This gets a little more practical, since many of these combinations cannot occur in valid bi-alignments.
For example, having gap patterns • • for both alignments of a and b , rules out all patterns for the alignments of the copies that contradict having last columns Consequently, we find only 51 consistent gap pattern combinations, while we can proof 30 combinations inconsistent due to an analogous argument as sketched above.

Sub-additive gap costs
The affine gap cost model, despite its algorithmic convenience, has been criticized because empirical gap length distributions usually are power laws thus suggesting a logarithmic gap costs [41]. However, gap costs of the form w(ℓ) = a + bℓ + c ln ℓ seem to yield better alignments in practice [42]. Pairwise alignments with subadditive gap costs can be computed by dynamic programming, considering insertions and deletions of arbitrary length: This idea does not seem to generalize to bi-alignments. It is possible, however, to generalize the end column type.
. The extensions of an insertion is scored by an analogous expression. The auxiliary entries M p,0 (x) are used to correctly score alignments in which the last column is different from the previous end gap pattern. This recursion runs in cubic time, but also requires cubic space (instead of quadratic space). For our purposes, however, it has the advantage that the score is again defined column-wise albeit at the expense of having to keep track of a linear instead of a constant number of end gap types. It generalizes to a recursion with four indices to compute the optimal bi-alignment.

Computational results
We implemented the bi-alignment algorithm with affine gap cost (Corollary 7) in Python 3. For improved performance, we adapted time-critical parts of the code to the Python C-extension Cython with some carefully chosen static typing. The new implementation was based on our previous implementation for RNA bi-alignment with linear gap cost [12,23]. Like the earlier version, it allows the user to limit the number of positions either sequence can be shifted to the left or right against its own structure by a constant . The restricted recursions, following in essence the idea of [34,35], have time complexity of O(n 2 2 ) instead of the unrestricted, but often impractical complexity O(n 4 ) . In addition to efficient bi-alignment with affine gap cost, new features have been added to the software: 1. Protein sequences may be scored with an arbitrary, user-defined similarity matrix. The BLOSUM62 matrix [43,44] is supplied as default. As a proof of concept we generated optimal bi-alignments of DNA Polymerase I from Escherichia (length 928) and Xanthomonas hortorum (length 933), while allowing shifts of sequence against structure by up to two positions to the left or to the right in either protein ( = 2 ). On an Intel(R) Core(TM) i7-10810U CPU, this (single-threaded) computation took 37 min. Note that a simple banding strategy on insertions and deletions could dramatically speed up such computations, typically without sacrificing alignment quality. The analogous computation allowing only one shift positions ( = 1 ) was performed in 10.5 min. Due to filling 9 dynamic programming matrices and considering 15 recursion cases per entry, the same implementation still takes 26 s, if shifts are completely forbidden ( = 0). Figure 5 shows the resulting bi-alignment for = 2 . For comparison, the results from = 1 and = 0 are given in Additional file 1. We chose a rather moderate shift cost = −210 , compared to a bonus of 800 per structure match as well as gap extension and gap opening costs of −50 and −200 , respectively. While we suspect that this parameter choice is too generous, it serves here to demonstrate that the algorithm readily predicts shifts that improve the compromise between primary and secondary structure alignment. The estimation of realistic shift costs is a non-trivial problem beyond the scope of this contribution.

Concluding remarks
We have shown here that bi-alignments with affine gap cost models for both constituent alignments and linear shift costs can be computed in quartic time by dynamic programming. Moreover, limiting the number of shifts to a constant reduces the cost to quadratic space and time. This makes the detection of locally-confined shifts computationally feasible for sequences of with length of realistic proteins or mRNAs. While we have illustrated our algorithmic innovations here using amino acid sequences and protein secondary structures as an example, the algorithm and its implementation is applicable to any linear representation of monomer-wise features along a biopolymer. In can be used, for instance, directly as an extension of the linear-gap-cost bi-alignments of RNAs described in [12].
We have focused here on the analysis of optimization problem and development of the algorithm. In addition to cost models for the constituent alignments U and V , a bi-alignment problem also requires the specification of the shift costs, i.e. the scoring model for W . Even though the scoring systems for U and V are borrowed from other studies, the choice of appropriate shift parameters remains an open problem for future work. This is a difficult problem for two reasons: (i) There is, at present, no collection of test cases with known shifts of sequences versus secondary structure for either proteins or RNAs that could be used to optimize the parameters. (ii) A biologically sound survey of proteins should presumably use a more elaborate scoring model for secondary structure elements that distinguishes amino acid positions depending on the distance from the element's ends. It stands to reason that the choice of the scoring model for the secondary structures would substantially influence estimates of the shift costs. Here, we are therefore content with a solution of algorithmic issues and a reference implementation. This provides the necessary tools for an in-depth empirical study of incongruent evolution of protein secondary structures in the future.
The formal framework of bi-alignments, Eq. (3), is much more general than the position-wise scoring models corresponding to regular multi-tape grammars. These were studied here because the corresponding optimization problems can be solved exactly by means of relatively simple dynamic programming algorithms. In a more general setting, one may want to consider V as an alignment of contact structures [45] or as an alignment of ordered sequences of 3D points, e.g. scored in terms of euclidean distances [46,47]. This is of increasing practical interest as recent advances in protein folding [48,49] provide access to high quality 3D structure predictions. The availability of accurately predicted protein structures of course also yields secondary structures, e.g. with the help of DSSP [50], which could be used for a systematic survey of incongruences in protein secondary structures. Alternatively, it seems promising to modify existing solutions to the protein structure alignment problems [51] to the corresponding bi-alignment problems. It is not obvious whether such a joint sequence and structure alignment problem implicitly contains a sequenceto-structure threading problem, which is known to be NP-complete [52]. In another forthcoming study, we are considering the corresponding problem for RNA secondary structures. In this case, the bi-alignment problem is amenable to a DP approach related to Sankoff 's algorithm for the simultaneous folding and alignment of RNAs [53].
In [12] we further generalized bi-alignments to polyalignments comprising k > 2 pairwise alignments U (i) , 1 ≤ i ≤ k ≥ 2 that are connected by a k-way alignment W . Each of the alignments U (i) then describes one particular aspect of the sequence. In addition to the individual amino acids and secondary structure elements, these may represent comparisons of profiles of physico-chemical parameters. It is not difficult to see that the grammar Eq. (7) generalizes to this case by defining end gap types (p 1 , p 2 , . . . p k ) with p i = − − . The corresponding grammar then needs to consider all 2 k gap patterns for the last column of the k-way alignment W . Optimal poly-alignments comprising k pairwise alignments with affine gap costs and additive cost contributions for the shifts between each pair of constituent alignments thus can be computed exactly in O(n 2k ) space and time. Complementarily, one may consider alignments U and V of more than two sequences and their corresponding structures. The scoring of W then must accommodate more complex shift patterns, whose total number again increases exponentially in k. It is unlikely, therefore, that exact dynamic programming algorithms for these generalized problems will be practical. This begs the question whether polyalignment problems can be approximated e.g. by progressive alignment schemes in a manner that is satisfactory from an applications point of view.