 Research
 Open access
 Published:
Bialignments with affine gaps costs
Algorithms for Molecular Biology volume 17, Article number: 10 (2022)
Abstract
Background
Commonly, sequence and structure elements are assumed to evolve congruently, such that homologous sequence positions correspond to homologous structural features. Assuming congruent evolution, alignments based on sequence and structure similarity can therefore optimize both similarities at the same time in a single alignment. To model incongruent evolution, where sequence and structural features diverge positionally, we recently introduced bialignments. This generalization of sequence and structurebased alignments is best understood as alignments of two distinct pairwise alignments of the same entities: one modeling sequence similarity, the other structural similarity.
Results
Optimal bialignments with affine gap costs (or affine shift cost) for two constituent alignments can be computed exactly in quartic space and time. Even bialignments with affine shift and gap cost, as well as bialignment with subadditive gap cost are optimized efficiently. Affine gapcost bialignment of large proteins (\(\sim 930\) aa) can be computed.
Conclusion
Affine cost bialignments are of practical interest to study shifts of protein sequences and protein structures relative to each other.
Availability
The affine cost bialignment algorithm has been implemented in Python 3 and Cython. It is available as free software from https://github.com/swill/BiAlign/releases/tag/v0.3 and as bioconda package bialign.
Introduction
Incongruent evolution
While biological function is eventually encoded in a genomic sequence, it relies on the “decoding” of the sequence into a spatially structured RNA or protein, or into specific interactions, such as the binding of a DNA element by a transcription factor. Natural selection acts to conserve function over evolutionary times and therefore preserves functional RNA or protein structures, binding motifs, intron–exon boundaries, etc. Stabilizing selection on such a functional entity typically also causes the conservation of its encoding DNA sequence. Homologous functional units, i.e. those that share a common ancestry [1], are therefore represented by homologous sequences. As a consequence, functional elements often can be identified based on their similarity in sequence alignments. For RNA and proteins, this allows the detection of consensus structures [2, 3], enables the identification of transcription factor binding sites [4], and the detection of conserved (noncoding) transcripts through the conservation of splice junctions [5].
Homology of a feature or trait, however, does not require that all its constituent parts are homologous. Most obviously, insertions and deletions in a DNA sequence imply that not all nucleotides trace back to a common ancestor even if the sequence as a whole does. Similarly, homology of a structural feature does not imply that all its constituent contacts are preserved. There are indeed welldocumented exceptions to the by far most common case of homologous features being produced from homologous sequence positions. A wellstudied, albeit apparently rare, example is intronsliding, where the start and end of an intron “moves” in the same direction for the same number of nucleotides [6,7,8,9]. While the gene product is perfectly preserved, except possibly for some changes of the amino acids encoded by the few nucleotides involved in the sliding, both splice junctions are now encoded by nonhomologous genomic positions. Promotors sometime exhibit a similar form of turnover, where a short binding site pattern at one site is replaced by the emergence of a matching sequence nearby [10]. In the context of biopolymer structures it is possible that contacts between nucleotides or amino acids are shifted relative to the underlying sequence in a way that preserves most features of the ancestral structure. Such transitions can be facilitated by the existence of kinetically accessible structural alternatives [11], of which different variants are stabilized by subsequent mutations in different lineages. In a preliminary survey, we recently observed that 72 of 1181 moderatesize Rfam families show evidence for this kind of incongruence between sequence and structure conservation [12]. This observation suggests that incongruent evolution of sequence and structure is relatively rare but still occurs with sufficient frequency to be nonnegligible.
To our knowledge, incongruences between conserved protein sequence and conserved protein structures so far have not been studied systematically. However, the example of Fig. 1 demonstrates that (at least at the level of secondary structures) it is not at all difficult to obtain incongruence by performing a few mutations. Here, we (artificially) introduced substitutions into a peptide sequence such that predicted secondary structures shifted relative to the reference sequence. The comparative analysis of proteins occasionally reveals examples of natural incongruences between sequence and secondary structure; moreover, it shows that the phenomenon occurred at least occasionally in protein evolution. Figure 2(top) depicts the alignment of the extant human CYPB1 cytochrome P450 enzyme and its reconstructed ancestral mammalian counterpart, which was recently crystallized (PDB: 6OYU and 6OYV) and characterized functionally [15]. Despite the high level of similarity of the ancestral and extant folds, the bialignment (Fig. 2, bottom) reveals some differences in the extent of helices and suggests a shift of “helix D” by two amino acids, constituting an incongruence of the considered type. Another published example can be found in Fig. 5 of [16]: relative to the underlying sequence, one observes several small helix shifts in the evolution of the Pgp protein (MDR1) between human, mouse, and rat.
Incongruences between sequence homology and homology of structure or functional elements are rooted in the inherent redundancy of genotypephenotype maps. For both RNA and proteins, very different sequences can encode the same fold or function [17,18,19], while at the same time identical sequences can appear in very different structural or functional contexts [20,21,22]. Together, these features sometimes lead to a sliding or migration of a functionally relevant structure in response to a fortuitously placed mutation. The occasional emergence of incongruences between sequence conservation and the conservation of structure thus is an expected consequence of the redundancies inherent in the sequence/structure relationship of biopolymers. It becomes a relevant empirical question, therefore, how frequent this process has been throughout evolutionary history.
Not only is incongruent evolution of interest as an understudied aspect of evolutionary dynamics, but it has practical implications for data analysis. Incongruences impact our ability to detect and reconstruct consensus structures, since corresponding structural features are formed by evolutionarily unrelated nucleotides or amino acids, while homologous sequence positions form disparate structural elements. This means that (in the presence of incongruent evolution) a single multiple sequence alignment cannot simultaneously represent the similarities of sequence and structure. In particular, conserved structure can no longer be represented as ‘consensus structure’, i.e. as an annotation of the columns of a sequence alignment.
Bialignments
We recently introduced bialignments [12, 23] as a mathematically consistent way of describing incongruent evolutionary relationships. Bialignments are motivated by treating shifts between sequence and structure explicitly as evolutionary events. It is important to realize that it is not necessarily possible to find an optimal reconciliation of sequence and structure alignments by identifying shifts events a posteriori from a pair of sequence and structure alignments that have been computed separately. Instead, bialignments allow simultaneously predicting sequence and structure homologies and their relation. For this purpose, we define a bialignment to consist of two alignments (one based one sequence similarity, the other one based on structure similarity) that are related by a third alignment, which captures the shift events. All three constituent alignments contribute to a common score.
While bialignments have similarities to combined sequence and structure alignments (which also optimize a joint score for sequence and structure similarity), bialignments extend such models by supporting shift events explicitly. Combined sequence/structure alignments therefore can be interpreted as the limit case of bialignments where arbitrarily high shift penalties completely prohibit shift events. As important consequence, bialignments overcome the requirement of a consensus structure, which is the key assumption underlying combined sequence/structure alignments.
As their main purpose, bialignments provide a coherent framework to detect shiftlike incongruences, i.e. a local “movement” of conserved structures relative to the underlying sequence. It is worth noting that the formal concept of bialignments is not tied to applications in structural biology. Instead, it can be seen as a way to quantify the effect of differences in scoring schemes that focus on different aspects of the same sequence. The only requirement for bialignments is a positionwise onetoone correspondence between the two different representations of each input object.
In this contribution, we extend bialignments with linear costs to a more realistic model with affine gap costs. We will illustrate our algorithmic developments using protein sequences and their secondary structures as an example, because the positionwise annotation of a secondary structure elements fits well with the framework of sequence alignments. The (artificial) example in Fig. 1 shows that incongruence between sequence and secondary structure can indeed be caused a few wellplace substitutions. It also shows that bialignments are capable, at least in principle, to reconcile incongruent sequence and structure homologies and to identify shift events.
A bialignment is formally defined as an alignment relating two, generally different, alignments of the same objects.
Definition 1
A bialignment \(\mathbb {A}\cong (\mathbb {U},\mathbb {V},\mathbb {W})\) consists of two pairwise alignments \(\mathbb {U}\) and \(\mathbb {V}\) of the objects \(\mathbf {a}\) and \(\mathbf {b}\) and an alignment \(\mathbb {W}\) of \(\mathbb {U}\) and \(\mathbb {V}\).
In Fig. 1, \(\mathbb {U}\) is a sequence alignment (shown in the second row with the secondary structure annotation above and below the two sequences), while \(\mathbb {V}\) is an alignment of the two respective secondary structures (shown in the second row with the two corresponding sequences between them). The columns of \(\mathbb {U}\) and \(\mathbb {V}\) are then aligned by \(\mathbb {W}\). Since the pairwise alignment of two pairwise alignments is equivalent to a 4way alignment, bialignments can be thought of as multiple alignments \(\mathbb {A}\cong (\mathbb {U},\mathbb {V},\mathbb {W})\). The input objects \(\mathbf {a}\) and \(\mathbf {b}\) appear twice in \(\mathbb {A}\), once regarded as sequence (represented by the oneletter amino acid codes) and once regarded as secondary structure (shown a positionwise glyphs). Bialignments therefore differ from “structureaware” sequence alignments by replacing the annotation of sequence positions with a secondary structure features by an alignment of both the sequence and the string of structural features. Importantly, \(\mathbb {A}\cong (\mathbb {U},\mathbb {V},\mathbb {W})\) completely determines the alignments of the sequences of \(\mathbf {a}\) and \(\mathbf {b}\) with their secondary structures (shown in the third row of Fig. 1 as the first and last pair of rows, respectively.) These alignments in general contain gaps that indicate how the conserved “consensus” structure is shifted compared to the sequence positions.
Assuming a linear scoring model, i.e. scores for \(\mathbb {U}\), \(\mathbb {V}\), and \(\mathbb {W}\) that are additively composed from single column contributions, it can be shown that the 4way alignment \(\mathbb {A}\) is scored additively as well [12, 23]. Linear bialignment problems therefore can be exactly solved by dynamic programming [24, 25] in quartic time. In this contribution we are interested in bialignments that are scored with affine gap costs.
Alignments as regular multitape grammars
To address this problem, it is helpful to describe the structure of alignments by multitape grammars, see e.g. [26] for a more detailed, formal discussion. In the simplest case, sequence alignments can be represented as regular grammars of the form \(A\rightarrow Ac \mathrel {\big \vert }\epsilon\). The only nonterminal symbol A denotes a (pairwise) alignment, the terminal \(\epsilon\) is the empty alignment, and the terminal c denotes an alignment column, which may be a (mis)match \({\left( {\begin{matrix}\bullet \\ \bullet \end{matrix}}\right) }\), a deletion \({\left( {\begin{matrix}\bullet \\ \end{matrix}}\right) }\), or an insertion \({\left( {\begin{matrix}\\ \bullet \end{matrix}}\right) }\}\). Since alignments compare extant sequences rather than an ancestor/descendant pair, the two “indels” (insertion/deletion) are biologically indistinguishable and hence receive the same score. The grammar simply expresses the fact that alignment can be constructed stepbystep by adding a column to an alignment of prefixes. For linear scoring functions, the production \(A\rightarrow Ac\) allows adding the score of c to the previously accumulated score of the alignment A. Denote by M(x) the optimal score of an alignment of the prefixes \(\mathbf {a}[1..x_1]\) and \(\mathbf {b}[1..x_2]\). As noted e.g. in [27, 28], the index vector of the penultimate column of the alignment is \(xc\), where \(\bullet\) is interpreted as 1 and the gap character − as 0. The Needleman–Wunsch recursions [29] thus can be written in compact form (see also [24]) as
Notably, in the nonaffine case, the scoring function s(x, c) is completely determined by a single column.
Affine gap cost. While linear gap costs are not very realistic in sequence alignment [30], arbitrary gap costs algorithmically require an additional factor O(n) in running time [31, 32] and are difficult to parametrize in practice. The affine gap cost model serves as a useful and convenient compromise that is most often used in practice. Here, the opening and the extension of a gap are scored differently. It is therefore necessary to distinguish three different nonterminal \(A_{{\left( {\begin{matrix}\bullet \\ \bullet \end{matrix}}\right) }}\), \(A_{{\left( {\begin{matrix}\bullet \\ \end{matrix}}\right) }}\), \(A_{{\left( {\begin{matrix}\\ \bullet \end{matrix}}\right) }}\) designating alignments that end in a (mis)match, deletion, and insertion column, respectively. Again one obtains a regular grammar with analogous productions of the form \(A_c \rightarrow A_{c^{\prime}}c \mathrel {\big \vert }\epsilon\) for the three nonterminals. Denote by M(x; c) the optimal score of an alignment of the prefixes \(\mathbf {a}[1..x_1]\) and \(\mathbf {b}[1..x_2]\) with end column of type c. We can then write Gotoh’s wellknown recursions [33] for pairwise affine gap cost alignment in the following compact form:
with initial conditions \(M(0,{\left( {\begin{matrix}\bullet \\ \bullet \end{matrix}}\right) })=0\), \(M\left(0,{\left( {\begin{matrix}\\ \bullet \end{matrix}}\right) }\right)=M\left(0,{\left( {\begin{matrix}\bullet \\ \end{matrix}}\right) }\right)= \infty\). In principle this formulation accommodates any scoring function \(s(x,c^{\prime},c)\) for which the column score depends on the gap pattern of the previous column. For instance, we could also score the closing of a gap separately.
Both the Needleman–Wunsch algorithm and the Gotoh algorithm run in \(O(n^2)\) space and time. Recursion Eq. (1) also describes the dynamic programming algorithm for kary alignments [24, 25, 34, 35], which requires \(O(n^k)\) space and time. The situation is more complicated, however, for affine gap costs. Sumofpairs scoring functions simply sum over the scores of all pairwise alignments contained in a given multiple alignment. Surprisingly, computing the optimal alignment of alignments with affine gap costs under the sumofpairsmodel is NPcomplete unless the number of sequences in the constituent alignments is bounded [36]. On the other hand, scoring models of the form of Eq. (2) are of practical interest in particular for \(k=3\) [37,38,39].
In this contribution we show that the bialignment model with affine gap costs for the constituent alignments can be solved in polynomial time by dynamic programming. As we shall see, the recursions are of the form of Eq. (2) but require a subtle redefinition of M(x; c).
Theory
Bialignments
Recall that we define a bialignment as an alignment of alignments (Def. 1). It is well known that an alignment of alignments can be represented again as an alignment. This compositional structure of alignments is discussed formally in [40]. In our case, \(\mathbb {A}\) is a 4way alignment from which \(\mathbb {U}\) (and \(\mathbb {V}\)) are obtained as “projections”, i.e. by extracting the corresponding pair of rows and removing all columns consisting of a pair of gap characters. The alignment \(\mathbb {W}\), on the other hand, is obtained by considering each column in \(\mathbb {U}\) and \(\mathbb {V}\) as a single letter; and moreover interpreting the columns of the form \({\left( {\begin{matrix}\\ \end{matrix}}\right) }\) (i.e. the ones that are removed in the projections to \(\mathbb {U}\) and \(\mathbb {V}\)) as gap characters.
The BiAlignment Problem for two input sequences \(\mathbf {a}\) and \(\mathbf {b}\) consists in optimizing
with given scoring functions u, v, and w. The special case where u, v, and w are linear scoring functions has been discussed in [12, 23].
The alignment \(\mathbb {W}\) of \(\mathbb {U}\) and \(\mathbb {V}\) describes the shifts distinguishing \(\mathbb {U}\) and \(\mathbb {V}\) in the following manner. First, consider a match column \(\alpha\) of \(\mathbb {W}\). It consists of a pair of columns with gap patterns \(c(\alpha )\) and \(d(\alpha )\), respectively. Using their numerical interpretation, we observe that
measures whether none, one, or both input sequences are shifted relative to each other (Fig. 3). Insertions and deletions in \(\mathbb {W}\) correspond to inserting an allgap column \({\left( {\begin{matrix}\\ \end{matrix}}\right) }\) into \(\mathbb {U}\) or \(\mathbb {V}\), respectively, and always lead to incongruences. We note, furthermore, that there is a onetoone correspondence between the columns of \(\mathbb {W}\) and the columns of the 4way alignment \(\mathbb {A}\). Thus we can count the number of shifts \(s(\mathbb {A})=\sum _{\alpha \in \mathbb {A}} s(\alpha )\). The alignment \(\mathbb {A}\) contains subalignments \(\mathbb {A}^{(\mathbf {aa})}\) and \(\mathbb {A}^{(\mathbf {bb})}\) of the first and second input sequence with itself. Let us denote the number of indels in these two projected alignments by \(\delta _{\mathbf {a}}\) and \(\delta _{\mathbf {b}}\), respectively.
Lemma 2
If \(\mathbb {A}\cong (\mathbb {U},\mathbb {V},\mathbb {W})\) is a bialignment of \(\mathbf {a}\) and \(\mathbf {b},\) then \(s(\mathbb {A})=\delta _{\mathbf {a}}+\delta _{\mathbf {b}}.\)
Proof
For column \(\alpha\) of \(\mathbb {A}\) we write \(\delta _{\mathbf {a}}(\alpha ):=c_1(\alpha )d_1(\alpha )\) and \(\delta _{\mathbf {b}}(\alpha ):=c_2(\alpha )d_2(\alpha )\). Thus \(\delta _{\mathbf {a}}(\alpha )=1\) if \(\alpha\) is an indel column in the projected selfalignment of \(\mathbf {a}\), and \(\delta _{\mathbf {a}}(\alpha )=0\) if \(\alpha\) is a (mis)match column. Note that allgap columns are omitted in the projection and thus do not contribute to the indel count. Thus \(\delta _{\mathbf {a}}=\sum _{\alpha \in \mathbb {A}} \delta _{\mathbf {a}}(\alpha )\) correctly counts the indels in \(\mathbb {A}^{(\mathbf {aa})}\). An analogous equality holds for \(\delta _{\mathbf {b}}\). A comparison with Eq. (4) completes the proof. \(\square\)
A natural scoring function for \(\mathbb {W}\) is thus to penalize the total number of shifts, setting \(w(\mathbb {A})=\Delta s(\mathbb {A})\). This amounts to computing the shift contribution for each column \({\left( {\begin{matrix}c\\ d\end{matrix}}\right) }\) of \(\mathbb {A}\) as \({\text{shift}}(c,d)= \Delta cd = \Delta \,(c_1d_1+c_2d_2)\).
Bialignments with affine gaps costs
Lemma 2 provides an alternative interpretation in terms of a simple linear score for \(\mathbb {A}^{(\mathbf {aa})}\) and \(\mathbb {A}^{(\mathbf {bb})}\). We can therefore think of Eq. (3) as a restricted sumofpairs model in which only four of the six pairwise alignments in \(\mathbb {A}\) contribute. In this picture it is natural to assume that the constituent alignments \(\mathbb {U}\) and \(\mathbb {V}\) are scored with affine gap costs. In the light of the NPhardness result of [36] it is not at all obvious, however, that the bialignment problem with affine gap costs can be solved in polynomial time.
In order to address this problem, we first recall the language of multiway alignments. The following statement is “folklore”, see e.g. [40]: every column of the 4way alignment \(\mathbb {A}\) is uniquely determined by

i.
a fourdimensional index (x, y) identifying the prefixes \(\mathbf {a}[1..x_1]\), \(\mathbf {b}[1..x_2]\), \(\mathbf {a}[1..y_1]\), and \(\mathbf {b}[1..y_2]\) that are aligned up to the focal column.

ii.
a gap pattern \((c,d)=((c_1,c_2),(d_1,d_2))\) specifying whether the entry in a column is a letter or a gap character.
The language of 4way alignments is generated by the regular language \(A\rightarrow A{\left( {\begin{matrix}c\\ d\end{matrix}}\right) } \mathrel {\big \vert }\epsilon\), where the nonterminal A denotes a bialignment and the terminals \({\left( {\begin{matrix}c\\ d\end{matrix}}\right) }\) correspond to one of the 15 possible gap patterns in a column of elements (excluding the allgap column). Note that \(c={\left( {\begin{matrix}\\ \end{matrix}}\right) }\) and \(d={\left( {\begin{matrix}\\ \end{matrix}}\right) }\) respectively correspond to an insertion and deletion in \(\mathbb {W}\), while \(c,d\ne {\left( {\begin{matrix}\\ \end{matrix}}\right) }\) corresponds to a match in \(\mathbb {W}\). This regular language is sufficient for linear gap cost models [12, 23].
In order to handle affine gap costs for \(\mathbb {U}\) and \(\mathbb {V}\), we need to keep track of the gap patterns of the preceding alignment column in \(\mathbb {U}\) and \(\mathbb {V}\). This is not the same as considering the preceding column of \(\mathbb {A}\) because gap patterns of the form \({\left( {\begin{matrix}{\left( {\begin{matrix}\\ \end{matrix}}\right) }\\ d\end{matrix}}\right) }\) and \({\left( {\begin{matrix}c\\ {\left( {\begin{matrix}\\ \end{matrix}}\right) }\end{matrix}}\right) }\) correspond to allgap columns, which are removed in \(\mathbb {U}\) or \(\mathbb {V}\). Thus, we introduce a new notion of column type to address these ’preceding’ gap patterns of the subalignments.
Definition 3
The end column type (p, q) of a bialignment \(\mathbb {A}\cong (\mathbb {U},\mathbb {V},\mathbb {W})\) consists of the gap pattern p of the last column of \(\mathbb {U}\) and the gap pattern q of the last column of \(\mathbb {V}\). The end column type of the empty alignment is left arbitrary.
The definition is illustrated in Fig. 4. Note that by construction, neither p nor q consist only of gaps.
Now, we define a columnwise scoring function that captures the alignment score with affine gap cost. It scores a single column of a bialignment \(\mathbb {A}\), characterized by \({\left( {\begin{matrix}x\\ y\end{matrix}}\right) }\) and \({\left( {\begin{matrix}c\\ d\end{matrix}}\right) }\), depending on the end column type \({\left( {\begin{matrix}c^{\prime}\\ d^{\prime}\end{matrix}}\right) }\) of the previous column. This function has the form
Since \({\text{score}}(x,c^{\prime},0)\) and \({\text{score}}(y,d^{\prime},0)\), respectively, correspond to allgap columns in \(\mathbb {U}\) and \(\mathbb {V}\), we observe that the sum of the \({\text{score}}\left({\left( {\begin{matrix}x\\ y\end{matrix}}\right) },{\left( {\begin{matrix}c^{\prime}\\ d^{\prime}\end{matrix}}\right) },{\left( {\begin{matrix}c\\ d\end{matrix}}\right) }\right)\) over all columns of \(\mathbb {A}\) equals
Thus, Eq. (5) correctly scores the bialignment with general affine gap costs for both \(\mathbb {U}\) and \(\mathbb {V}\).
In order to derive a dynamic programming algorithm that solves the bialignment problem with this type of scoring function, we consider a decomposition of the search space in grammar form. The nonterminals \(A_{(p,q)}\) correspond to bialignment with end column type (p, q). The terminals are the 15 possible column types of a 4way alignment, which we write as \({\left( {\begin{matrix}p\\ q\end{matrix}}\right) }\), with \(p,q\ne {\left( {\begin{matrix}\\ \end{matrix}}\right) }\) as well as \({\left( {\begin{matrix}\\ q\end{matrix}}\right) }\) \({\left( {\begin{matrix}q\\ \end{matrix}}\right) }\) where the − in the latter is a shorthand for \({\left( {\begin{matrix}\\ \end{matrix}}\right) }\). In addition, we write \(\epsilon\) for the empty 4way column.
Lemma 4
The language of bialignments with fixed end column type is generated by the productions
Proof
Consider an alignment \(\mathbb {A}\) with last column (c, d) and end column type (p, q), and denote by \(\mathbb {A}^{\prime}\) the alignment without the last column. If \(c,d\ne {\left( {\begin{matrix}\\ \end{matrix}}\right) }\), i.e. the (mis)match case in \(\mathbb {W}\), then \(p=c\) and \(q=d\) and \(\mathbb {A}^{\prime}\) may have any endcolumn type. If \(c={\left( {\begin{matrix}\\ \end{matrix}}\right) }\), corresponding to the insertion case in \(\mathbb {W}\), \(\mathbb {A}\) inherits the first component c of its end column type from the previous alignment \(\mathbb {A}^{\prime}\). The other component is given by the second part of the last column, i.e. \(d=q\). Thus the second component of the end column type of \(\mathbb {A}^{\prime}\) is arbitrary. The case \(d={\left( {\begin{matrix}\\ \end{matrix}}\right) }\), deletion in \(\mathbb {W}\) analogously yields \(d=q\) and an end column type \((p^{\prime},q)\) for the \(\mathbb {A}^{\prime}\). \(\square\)
Note that this grammar would allow terminating with any end column type. This is undesirable since we would like the first column to be scored as it was preceded by a match column in both \(\mathbb {U}\) and \(\mathbb {V}\). This is easily implemented by an appropriate initialization for \(x=y=0\), however.
Definition 5
Let \(M_{p,q}(x,y)\) denote the optimal score of a 4way alignment with end column type (p, q).
In order to enforce that empty alignment is treated as having end column type \(\left({\left( {\begin{matrix}\bullet \\ \bullet \end{matrix}}\right) },{\left( {\begin{matrix}\bullet \\ \bullet \end{matrix}}\right) }\right)\), we set \(M_{\left({\left( {\begin{matrix}\bullet \\ \bullet \end{matrix}}\right) },{\left( {\begin{matrix}\bullet \\ \bullet \end{matrix}}\right) }\right)}(0,0)=0\) and \(M_{(c,d)}(0,0)= \infty\) for \((c,d)\ne \left({\left( {\begin{matrix}\bullet \\ \bullet \end{matrix}}\right) },{\left( {\begin{matrix}\bullet \\ \bullet \end{matrix}}\right) }\right)\).
Theorem 6
The matrices \(M_{p,q}\) satisfy the recursion
Proof
We first note that every column of \(\mathbb {A}\) is either a (mis)match or an indel column w.r.t. \(\mathbb {W}\). These correspond to the first three alternative productions in Eq. (7), and cover all alternatives. Since \({\text{score}}\left({\left( {\begin{matrix}x\\ y\end{matrix}}\right) },{\left( {\begin{matrix}c^{\prime}\\ d^{\prime}\end{matrix}}\right) },{\left( {\begin{matrix}c\\ d\end{matrix}}\right) }\right)\) depends only on the current column and the end column type, we obtain the optimal score of an alignment \(\mathbb {A}\) with end column type (p, q) and last column (c, d) as the optimal score of an alignment \(\mathbb {A}^{\prime}\) with any of the matching column type plus the score \({\text{score}} \left({\left( {\begin{matrix}x\\ y\end{matrix}}\right) },{\left( {\begin{matrix}c^{\prime}\\ d^{\prime}\end{matrix}}\right) },{\left( {\begin{matrix}c\\ d\end{matrix}}\right) }\right)\) for the last column. The grammar in Eq. (7) specifies which end column types match. Furthermore, we note that, in the match case, the indices \((x^{\prime},y^{\prime})\) of the last column of the alignment to the left are given by \(xp\) and \(xq\), where (p, q) is gap pattern on the last column of \(\mathbb {A}\). Correspondingly we have \((x^{\prime},y^{\prime})=(xp,y)\) for the insertion case and \((x^{\prime},y^{\prime})=(x,y^{\prime}q)\) in the insertion case. Taken together, this established the correctness of the recursion. \(\square\)
As an immediate consequence we have
Corollary 7
The bialignment problem with affine gap cost models for the two constituent alignments can be solved in \(O(n^4)\) time and space.
Affine shift costs
While bialignment with affine gap cost and linear shift costs may be of the most obvious practical relevance, we also discuss two variations with affine shift costs. First of all, we clarify how to attribute affine shift cost in our bialignment scoring model.
Let’s take a step back to our original definition of the bialignment score (Eq. 3) and our previous suggestion to define the “shift” score component \(w(\mathbb {A})\) as \(\Delta s(\mathbb {A})\), i.e. as a multiple of \(s(\mathbb {A})\). Since the latter was defined as the number of gap columns in the alignments \(\mathbb {A}^{(\mathbf {aa})}\) and \(\mathbb {A}^{(\mathbf {bb})}\), this amounts to scoring shifts in a linear cost model, where every shift has a cost of \(\Delta\) per column.
For affine shift costs, we take the view that every consecutive run of gap symbols in the pairwise alignments of the two copies of \(\mathbf {a}\) and \(\mathbf {b}\) represents one shift. This shift is scored in the same way as gaps are scored under affine gap cost, i.e. based on the shift opening cost \(\Delta _o\) plus the shift extension cost \(\Delta\) times the length of the shift (number of shift columns).
We first consider affine shift cost and nonaffine (i.e. linear) gap cost. Since affine shifts are scored exactly in the same way as affine gaps, this situation is symmetric to the case of affine gap cost combined with linear shift cost. The corresponding bialignment problem can thus be solved efficiently by applying exactly the same idea as in our previous algorithm (Theorem 6), only now keeping track by p and q of the gap patterns in the respective alignments of rows \(1 \& 3\) and \(2 \& 4\). We immediately obtain
Corollary 8
The bialignment problem with affine shift cost models (and linear gap cost) can be solved in \(O(n^4)\) time and space.
Combining affine gap and shift costs
More remarkably, we can even solve the general case of affine gap cost and affine shift cost in polynomial time by dynamic programming. Essentially, we combine the ideas of the above two algorithms. Our algorithm follows a grammar with general decomposition
In order to evaluate affine gaps and affine shifts correctly at the same time, we need to know the last nongaponly gap patterns of all four pairwise alignments of rows \(1 \& 2\), \(1 \& 3\), \(2 \& 4\), and \(3 \& 4\); thus, we utilize nonterminals \(A_p\), for all p that encode the respective gap patterns \(p=(p_{12}, p_{13}, p_{24}, p_{34})\). By the same argument as before, we can show this information to be sufficient to score shifts and gaps correctly in affine cost models for every possible last column c.
One keeps track of the correct gap patterns for all of the relevant pairwise alignments by setting the entries of \(p^{\prime}\) as
for \(ij\in \{12,13,24,34\}\), depending on p and c in Eq. (9). For termination, we add the grammar rule:
for \(p^0:= \left({\left( {\begin{matrix}\bullet \\ \bullet \end{matrix}}\right) }, {\left( {\begin{matrix}\bullet \\ \bullet \end{matrix}}\right) }, {\left( {\begin{matrix}\bullet \\ \bullet \end{matrix}}\right) }, {\left( {\begin{matrix}\bullet \\ \bullet \end{matrix}}\right) }\right)\). This allows implicit accounting for gap and shift openings of respective gaps and shifts at the left end of alignment strings.
Remarks about generalizations and complexity Note that the existence of an efficient algorithm for general affine bialignment does not contradict the general hardness of multiple alignment with affine gap costs, even if it suggests the following generalization: Multiple (kway) alignment with affine gap costs can be computed by dynamic programming following the above idea of keeping track of the rightmost nongaponly gappatterns in all pairwise alignments. This requires considering \(\left( {\begin{array}{c}k\\ 2\end{array}}\right)\) many pairwise gap patterns, each out of three possibilities \({\left( {\begin{matrix}\bullet \\ \bullet \end{matrix}}\right) },{\left( {\begin{matrix}\bullet \\ \end{matrix}}\right) },{\left( {\begin{matrix}\\ \bullet \end{matrix}}\right) }\). The resulting DPalgorithm for kway alignment thus needs exponentially many matrices in k.
In bialignments of two sequences, we need to consider only four gap patterns, two for the two alignments and two for the shifts between the sequence copies. That is, there are (at most) \(3^4=81\) combinations, which have to be represented by different matrices for the DP algorithm. This gets a little more practical, since many of these combinations cannot occur in valid bialignments. For example, having gap patterns \({\left( {\begin{matrix}\bullet \\ \bullet \end{matrix}}\right) }\) for both alignments of \(\mathbf {a}\) and \(\mathbf {b}\), rules out all patterns for the alignments of the copies that contradict having last columns \({\left( {\begin{matrix}\bullet \\ \bullet \\ \bullet \\ \bullet \end{matrix}}\right) }\),\({\left( {\begin{matrix}\bullet \\ \bullet \\  \\ \end{matrix}}\right) }\), or \({\left( {\begin{matrix} \\ \\ \bullet \\ \bullet \end{matrix}}\right) }\). Consequently, we find only 51 consistent gap pattern combinations, while we can proof 30 combinations inconsistent due to an analogous argument as sketched above.
Subadditive gap costs
The affine gap cost model, despite its algorithmic convenience, has been criticized because empirical gap length distributions usually are power laws thus suggesting a logarithmic gap costs [41]. However, gap costs of the form \(w(\ell )=a + b\ell + c\ln \ell\) seem to yield better alignments in practice [42]. Pairwise alignments with subadditive gap costs can be computed by dynamic programming, considering insertions and deletions of arbitrary length:
This idea does not seem to generalize to bialignments. It is possible, however, to generalize the end column type. Instead of only distinguishing \({\left( {\begin{matrix}1\\ 1\end{matrix}}\right) }\), \({\left( {\begin{matrix}1\\ 0\end{matrix}}\right) }\), \({\left( {\begin{matrix}0\\ 1\end{matrix}}\right) }\), we can make each of them length dependent. This allows us to write the end column types \(\langle p,\ell \rangle\), where \(\ell \ge 1\) is the length of the run of columns of type p at the end of the alignment. With this notation we can write
with initial condition \(M_{\langle p,0\rangle }(0)=0\). Here \(d(x,p,\ell )\) equals the match score s(x) for \(p={\left( {\begin{matrix}\bullet \\ \bullet \end{matrix}}\right) }\). For deletions, \(p={\left( {\begin{matrix}\bullet \\ \end{matrix}}\right) }\), we have \(d(x,p,\ell )= w(\mathbf {a}[x_1\ell +1..x_1]) w(\mathbf {a}[x_1\ell +1..x_11])\). The extensions of an insertion is scored by an analogous expression. The auxiliary entries \(M_{\langle p,0\rangle }(x)\) are used to correctly score alignments in which the last column is different from the previous end gap pattern. This recursion runs in cubic time, but also requires cubic space (instead of quadratic space). For our purposes, however, it has the advantage that the score is again defined columnwise albeit at the expense of having to keep track of a linear instead of a constant number of end gap types. It generalizes to a recursion with four indices to compute the optimal bialignment.
Computational results
We implemented the bialignment algorithm with affine gap cost (Corollary 7) in Python 3. For improved performance, we adapted timecritical parts of the code to the Python Cextension Cython with some carefully chosen static typing. The new implementation was based on our previous implementation for RNA bialignment with linear gap cost [12, 23]. Like the earlier version, it allows the user to limit the number of positions either sequence can be shifted to the left or right against its own structure by a constant \(\lambda\). The restricted recursions, following in essence the idea of [34, 35], have time complexity of \(O(n^2\lambda ^2)\) instead of the unrestricted, but often impractical complexity \(O(n^4)\). In addition to efficient bialignment with affine gap cost, new features have been added to the software:

1.
Protein sequences may be scored with an arbitrary, userdefined similarity matrix. The BLOSUM62 matrix [43, 44] is supplied as default.

2.
Protein secondary structures are scored using a simple bonus (here, 800) for matched secondary structure.

3.
The dynamic programming matrices are stored as sparse matrices to to limit space consumption to \(O(n^2\lambda ^2)\) (compared to \(O(n^4)\) space complexity of a hypothetical nonsparse implementation).

4.
In case of ambiguity, simpler shifts are preferred (For example the bialignment of Fig 1 has cooptima with shift events in both sequences or shift events that shift longer subsequences).

5.
Improved graphical output of bialignments. Figures 1 and 5 were produced using a Jupyter notebook that is included in the software distribution.

6.
A flexible command line and Python interface is included.
As a proof of concept we generated optimal bialignments of DNA Polymerase I from Escherichia (length 928) and Xanthomonas hortorum (length 933), while allowing shifts of sequence against structure by up to two positions to the left or to the right in either protein (\(\lambda =2\)). On an Intel(R) Core(TM) i710810U CPU, this (singlethreaded) computation took 37 min. Note that a simple banding strategy on insertions and deletions could dramatically speed up such computations, typically without sacrificing alignment quality. The analogous computation allowing only one shift positions (\(\lambda =1\)) was performed in 10.5 min. Due to filling 9 dynamic programming matrices and considering 15 recursion cases per entry, the same implementation still takes 26 s, if shifts are completely forbidden (\(\lambda =0\)).
Figure 5 shows the resulting bialignment for \(\lambda =2\). For comparison, the results from \(\lambda =1\) and \(\lambda =0\) are given in Additional file 1. We chose a rather moderate shift cost \(\Delta = 210\), compared to a bonus of 800 per structure match as well as gap extension and gap opening costs of \(50\) and \(200\), respectively. While we suspect that this parameter choice is too generous, it serves here to demonstrate that the algorithm readily predicts shifts that improve the compromise between primary and secondary structure alignment. The estimation of realistic shift costs is a nontrivial problem beyond the scope of this contribution.
Concluding remarks
We have shown here that bialignments with affine gap cost models for both constituent alignments and linear shift costs can be computed in quartic time by dynamic programming. Moreover, limiting the number of shifts to a constant reduces the cost to quadratic space and time. This makes the detection of locallyconfined shifts computationally feasible for sequences of with length of realistic proteins or mRNAs. While we have illustrated our algorithmic innovations here using amino acid sequences and protein secondary structures as an example, the algorithm and its implementation is applicable to any linear representation of monomerwise features along a biopolymer. In can be used, for instance, directly as an extension of the lineargapcost bialignments of RNAs described in [12].
We have focused here on the analysis of optimization problem and development of the algorithm. In addition to cost models for the constituent alignments \(\mathbb {U}\) and \(\mathbb {V}\), a bialignment problem also requires the specification of the shift costs, i.e. the scoring model for \(\mathbb {W}\). Even though the scoring systems for \(\mathbb {U}\) and \(\mathbb {V}\) are borrowed from other studies, the choice of appropriate shift parameters remains an open problem for future work. This is a difficult problem for two reasons: (i) There is, at present, no collection of test cases with known shifts of sequences versus secondary structure for either proteins or RNAs that could be used to optimize the parameters. (ii) A biologically sound survey of proteins should presumably use a more elaborate scoring model for secondary structure elements that distinguishes amino acid positions depending on the distance from the element’s ends. It stands to reason that the choice of the scoring model for the secondary structures would substantially influence estimates of the shift costs. Here, we are therefore content with a solution of algorithmic issues and a reference implementation. This provides the necessary tools for an indepth empirical study of incongruent evolution of protein secondary structures in the future.
The formal framework of bialignments, Eq. (3), is much more general than the positionwise scoring models corresponding to regular multitape grammars. These were studied here because the corresponding optimization problems can be solved exactly by means of relatively simple dynamic programming algorithms. In a more general setting, one may want to consider \(\mathbb {V}\) as an alignment of contact structures [45] or as an alignment of ordered sequences of 3D points, e.g. scored in terms of euclidean distances [46, 47]. This is of increasing practical interest as recent advances in protein folding [48, 49] provide access to high quality 3D structure predictions. The availability of accurately predicted protein structures of course also yields secondary structures, e.g. with the help of DSSP [50], which could be used for a systematic survey of incongruences in protein secondary structures. Alternatively, it seems promising to modify existing solutions to the protein structure alignment problems [51] to the corresponding bialignment problems. It is not obvious whether such a joint sequence and structure alignment problem implicitly contains a sequencetostructure threading problem, which is known to be NPcomplete [52]. In another forthcoming study, we are considering the corresponding problem for RNA secondary structures. In this case, the bialignment problem is amenable to a DP approach related to Sankoff’s algorithm for the simultaneous folding and alignment of RNAs [53].
In [12] we further generalized bialignments to polyalignments comprising \(k>2\) pairwise alignments \(\mathbb {U}^{(i)}\), \(1\le i\le k\ge 2\) that are connected by a kway alignment \(\mathbb {W}\). Each of the alignments \(\mathbb {U}^{(i)}\) then describes one particular aspect of the sequence. In addition to the individual amino acids and secondary structure elements, these may represent comparisons of profiles of physicochemical parameters. It is not difficult to see that the grammar Eq. (7) generalizes to this case by defining end gap types \((p_1,p_2,\dots p_k)\) with \(p_i\ne {\left( {\begin{matrix}\\ \end{matrix}}\right) }\). The corresponding grammar then needs to consider all \(2^k\) gap patterns for the last column of the kway alignment \(\mathbb {W}\). Optimal polyalignments comprising k pairwise alignments with affine gap costs and additive cost contributions for the shifts between each pair of constituent alignments thus can be computed exactly in \(O(n^{2k})\) space and time. Complementarily, one may consider alignments \(\mathbb {U}\) and \(\mathbb {V}\) of more than two sequences and their corresponding structures. The scoring of \(\mathbb {W}\) then must accommodate more complex shift patterns, whose total number again increases exponentially in k. It is unlikely, therefore, that exact dynamic programming algorithms for these generalized problems will be practical. This begs the question whether polyalignment problems can be approximated e.g. by progressive alignment schemes in a manner that is satisfactory from an applications point of view.
Availability of data and materials
Implementations of the algorithms used in this contribution are available as free software from https://github.com/swill/BiAlign/releases/tag/v0.3. For easy installation, we provide packages bialign on the Conda channel bioconda and the Python Package Index PyPI, respectively.
References
Wagner GP. Homology, genes, and evolutionary innovation. Princeton: Princeton Univ. Press; 2014.
Hofacker IL, Fekete M, Stadler PF. Secondary structure prediction for aligned RNA sequences. J Mol Biol. 2002;319:1059–66. https://doi.org/10.1016/S00222836(02)00308X.
Marks DS, Hopf TA, Sander C. Protein structure prediction from sequence variation. Nat Biotechnol. 2012;30:1072–80. https://doi.org/10.1038/nbt.2419.
Chapman MA, Donaldson IJ, Gilbert J, Grafham D, Rogers J, Green AR, Göttgens B. Analysis of multiple genomic sequence alignments: a web resource, online tools, and lessons learned from analysis of mammalian SCL loci. Genome Res. 2004;14:313–8. https://doi.org/10.1101/gr.1759004.
Hiller M, Findeiß S, Lein S, Marz M, Nickel C, Rose D, Schulz C, Backofen R, Prohaska SJ, Reuter G, Stadler PF. Conserved introns reveal novel transcripts in Drosophila melanogaster. Genome Res. 2009;19:1289–300. https://doi.org/10.1101/gr.090050.108.
Stoltzfus A, Logsdon JM Jr, Palmer JD, Ford DW. Intron “sliding’’ and the diversity of intron positions. Proc Natl Acad Sci USA. 1997;94:10739–44. https://doi.org/10.1073/pnas.94.20.10739.
Lehmann J, Eisenhardt C, Stadler PF, Krauss V. Some novel intron positions in conserved Drosophila genes are caused by intron sliding or tandem duplications. BMC Evol Biol. 2010;10:156. https://doi.org/10.1186/1471214810156.
Bocco S, Csűrös M. Splice sites seldom slide: intron evolution in oomycetes. Genome Biol Evol. 2016;8:2340–50. https://doi.org/10.1093/gbe/evw157.
Fekete E, Flipphi M, Ág N, Kavalecz N, Cerqueira G, Scazzocchio C, Karaffa L. A mechanism for a single nucleotide intron shift. Nucleic Acids Res. 2017;45:9085–92. https://doi.org/10.1093/nar/gkx520.
Hare EE, Peterson BK, Iyer VN, Meier R, Eisen MB. Sepsid evenskipped enhancers are functionally conserved in Drosophila despite lack of sequence conservation. PLoS Genet. 2008;4:1000106. https://doi.org/10.1371/journal.pgen.1000106.
Flamm C, Fontana W, Hofacker I, Schuster P. RNA folding kinetics at elementary step resolution. RNA. 2000;6:325–38. https://doi.org/10.1017/s1355838200992161.
Waldl M, Will S, Wolfinger MT, Hofacker IL, Stadler PF. Bialignments as models of incongruent evolution of RNA sequence and secondary structure. In: Cazzaniga P, Besozzi D, Merelli I, Manzoni L, editors. Computational intelligence methods for bioinformatics and biostatistics (CIBB 2019)q, vol. 12313. Lecture notes in computer science. Cham: Springer; 2020. p. 159–70. https://doi.org/10.1007/9783030630614_15.
Chou PY, Fasman GD. Prediction of protein conformation. Biochemistry. 1974;13:222–45. https://doi.org/10.1021/bi00699a002.
Ashok Kumar T. CFSSP: Chou and Fasman secondary structure prediction server. Wide Spectr Res J. 2013;1:15–9. https://doi.org/10.5281/zenodo.50733.
Bart AG, Harris KL, Gillam EMJ, Scott EE. Structure of an ancestral mammalian family 1B1 cytochrome P450 with increased thermostability. J Biol Chem. 2020;295:5640–53. https://doi.org/10.1074/jbc.RA119.010727.
Dong M, Ladavière L, Penin F, Deléage G, Baggetto LG. Secondary structure of Pglycoprotein investigated by circular dichroism and amino acid sequence analysis. Biochim Biophys Acta Biomembr. 1998;1371:317–34. https://doi.org/10.1016/S00052736(98)000327.
Schuster P, Fontana W, Stadler PF, Hofacker IL. From sequences to shapes and back: a case study in RNA secondary structures. Proc R Soc Lond B. 1994;255:279–84. https://doi.org/10.1098/rspb.1994.0040.
Babajide A, Hofacker IL, Sippl MJ, Stadler PF. Neutral networks in protein space: a computational study based on knowledgebased potentials of mean force. Fold Des. 1997;2:261–9. https://doi.org/10.1016/S13590278(97)000370.
BornbergBauer E. How are model protein structures distributed in sequence space? Biophys J. 1997;73:2393–403. https://doi.org/10.1016/S00063495(97)782687.
Kabsch W, Sander C. On the use of sequence homologies to predict protein structure: identical pentapeptides can have completely different conformations. Proc Natl Acad Sci USA. 1984;81:1075–8. https://doi.org/10.1073/pnas.81.4.1075.
Schultes EA, Bartel DP. One sequence, two ribozymes: implications for the emergence of new ribozyme folds. Science. 2000;289(5478):448–52. https://doi.org/10.1126/science.289.5478.448.
Alexander PA, He Y, Chen Y, Orban J, Bryan PN. A minimal sequence code for switching protein structure and function. Proc Natl Acad Sci USA. 2009;106:21149–54. https://doi.org/10.1073/pnas.0906408106.
Waldl M, Will S, Wolfinger ML, Hofacker IL, Stadler PF. Bialignments as models of incongruent evolution and RNA sequence and structure. In: Cazzaniga P, Besozzi D, Merelli I, editors. CIBB’19 proceedings. 2019. p. 6. https://doi.org/10.1101/631606.
Sankoff D. Minimal mutation trees of sequences. SIAM J Appl Math. 1975;28:35–42. https://doi.org/10.1137/0128004.
Sankoff D. The early introduction of dynamic programming into computational biology. Bioinformatics. 2000;16:41–7. https://doi.org/10.1093/bioinformatics/16.1.41.
Höner zu Siederdissen C, Hofacker IL, Stadler PF. Product grammars for alignment and folding. IEEE/ACM Trans Comp Biol Bioinf. 2015;12:507–19. https://doi.org/10.1109/TCBB.2014.2326155.
Setubal JC, Meidanis J. Introduction to computational molecular biology. Boston: PWS Publishing Co.; 1997.
Retzlaff N, Stadler PF. Partially local multiway alignments. Math Comp Sci. 2018;12:207–34. https://doi.org/10.1007/s1178601803384.
Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970;48:443–53. https://doi.org/10.1016/00222836(70)900574.
Vingron M, Waterman MS. Sequence alignment and penalty choice: review of concepts, case studies and implications. J Mol Biol. 1994;235:1–12. https://doi.org/10.1016/S00222836(05)800063.
Waterman MS, Smith TF, Beyer WA. Some biological sequence metrics. Adv Math. 1976;20:367–87. https://doi.org/10.1016/00018708(76)902024.
Dewey TG. A sequence alignment algorithm with an arbitrary gap penalty function. J Comp Biol. 2001;8:177–90. https://doi.org/10.1089/106652701300312931.
Gotoh O. An improved algorithm for matching biological sequences. J Mol Biol. 1982;162:705–8. https://doi.org/10.1016/00222836(82)903989.
Carrillo H, Lipman D. The multiple sequence alignment problem in biology. SIAM J Appl Math. 1988;48:1073–82. https://doi.org/10.1137/0148063.
Lipman DJ, Altschul SF, Kececioglu JD. A tool for multiple sequence alignment. Proc Natl Acad Sci USA. 1989;86:4412–5. https://doi.org/10.1073/pnas.86.12.4412.
Kececioglu J, Starrett D. Aligning alignments exactly. In: Bourne PE, Gusfield D, editors. Proceedings of the 8th ACM conference on research in computational molecular biology (RECOMB). New York: ACM; 2004. p. 85–96. https://doi.org/10.1145/974614.974626.
Gotoh O. Alignment of three biological sequences with an efficient traceback procedure. J Theor Biol. 1986;121:327–37. https://doi.org/10.1016/S00225193(86)801126.
Konagurthu AS, Whisstock J, Stuckey PJ. Progressive multiple alignment using sequence triplet optimization and threeresidue exchange costs. J Bioinf Comp Biol. 2004;2:719–45. https://doi.org/10.1142/S0219720004000831.
Kruspe M, Stadler PF. Progressive multiple sequence alignments from triplets. BMC Bioinform. 2007;8:254. https://doi.org/10.1186/147121058254.
Berkemer SJ, Höner zu Siederdissen C, Stadler PF. Compositional properties of alignments. Math Comp Sci. 2021;15:609–30. https://doi.org/10.1007/s11786020004968.
Gonnet GH, Cohen MA, Benner SA. Exhaustive matching of the entire protein sequence database. Science. 1992;256:1443–5. https://doi.org/10.1126/science.1604319.
Cartwright RA. Logarithmic gap costs decrease alignment accuracy. BMC Bioinform. 2006;7:527. https://doi.org/10.1186/147121057527.
Eddy SR. Where did the BLOSUM62 alignment score matrix come from? Nat Biotechnol. 2004;22:1035–6. https://doi.org/10.1038/nbt08041035.
Styczynski MP, Jensen KL, Rigoutsos I, Stephanopoulos G. BLOSUM62 miscalculations improve search performance. Nat Biotechnol. 2008;26:274–5. https://doi.org/10.1038/nbt0308274.
Stadler PF. Alignments of biomolecular contact maps. Interface Focus. 2021;11:20200066. https://doi.org/10.1098/rsfs.2020.0066.
Poleksic A. Algorithms for optimal protein structure alignment. Bioinformatics. 2009;25:2751–6. https://doi.org/10.1093/bioinformatics/btp530.
Li SC. The difficulty of protein structure alignment under the RMSD. Algorithms Mol Biol. 2013;8:1. https://doi.org/10.1186/1748718881.
...Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Žídek A, Potapenko A, Bridgland A, Meyer C, Kohl SAA, Ballard AJ, Cowie A, RomeraParedes B, Nikolov S, Jain R, Adler J, Beck T, Petersen S, Reimann D, Clancy E, Zielinski M, Steinegger M, Pacholska M, Berghammer T, Bodenstein S, Silver D, Vinyals O, Senior AW, Kavukcuoglu K, Kohli P, Hassabis D. Highly accurate protein structure prediction with AlphaFold. Nature. 2021. https://doi.org/10.1038/s41586021038192.
...Baek M, DiMaio F, Anishchenko I, Dauparas J, Ovchinnikov S, Lee GR, Wang J, Cong QC, Kinch LN, Schaeffer RD, Millán C, Park HP, Adams C, Glassman CR, DeGiovanni A, Pereira JH, Rodrigues AV, van Dijk AA, Ebrecht AC, Opperman DJ, Sagmeister T, Buhlheller C, PavkovKeller T, Rathinaswamy MK, Dalwadi U, Yip CKY, Burke JE, Garcia KC, Grishin NV, Adams PD, Read RJ, Baker D. Accurate prediction of protein structures and interactions using a threetrack neural network. Science. 2021;373:871–6. https://doi.org/10.1126/science.abj8754.
Kabsch W, Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogenbonded and geometrical features. Biopolymers. 1983;22:2577–637. https://doi.org/10.1002/bip.360221211.
Daniluk P, Lesyng B. Theoretical and computational aspects of protein structural alignment. 2014;1:557–98. https://doi.org/10.1007/9783642285547_17.
Lathrop RH. The protein threading problem with sequence amino acid interaction preferences is NPcomplete. Protein Eng Des Sel. 1994;7:1059–68. https://doi.org/10.1093/protein/7.9.1059.
Sankoff D. Simultaneous solution of the RNA folding, alignment and protosequence problems. SIAM J Appl Math. 1985;45:810–25. https://doi.org/10.1137/0145048.
Acknowledgements
We thank Christian Höner zu Siederdissen for stimulating discussions and Verena Bender and Leonie Preker for preliminary work as part of the course Advanced Methods in Bioinformatics in 2020.
Funding
Open Access funding enabled and organized by Projekt DEAL. This work was supported in part by the German Research Foundation (DFG) as part of the Collaborative Research Centre 1423 “Structural Dynamics of GPCR Activation and Signal Transduction” (SFB 1423/1 421152132).
Author information
Authors and Affiliations
Contributions
Both authors contributed to deriving the mathematical results, the interpretation of results and the writing of the manuscript. SW implemented the algorithms. Both authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1.
Bialignments with different choices of λ. Bialignments of the same data as in Fig. 5 using a more restrictive value (λ = 1) and a shiftfree alignment (λ = 0). The latter corresponds to regular protein alignment with scores augmented by (mis)matches of the predicted secondary structure.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Stadler, P.F., Will, S. Bialignments with affine gaps costs. Algorithms Mol Biol 17, 10 (2022). https://doi.org/10.1186/s13015022002197
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13015022002197