Research  Open  Published:
Efficient edit distance with duplications and contractions
Algorithms for Molecular Biologyvolume 8, Article number: 27 (2013)
Abstract
Abstract
We propose three algorithms for string edit distance with duplications and contractions. These include an efficient general algorithm and two improvements which apply under certain constraints on the cost function. The new algorithms solve a more general problem variant and obtain better time complexities with respect to previous algorithms. Our general algorithm is based on minplus multiplication of square matrices and has time and space complexities of O (ΣMP (n)) and O (Σn^{2}), respectively, where Σ is the alphabet size, n is the length of the strings, and MP (n) is the time bound for the computation of minplus matrix multiplication of two n × n matrices (currently,
$\mathit{\text{MP}}\left(n\right)=O\left(\frac{{n}^{3}\stackrel{3}{\text{log}}\text{log}n}{\stackrel{2}{\text{log}}n}\right)$ due to an algorithm by Chan).
For integer cost functions, the running time is further improved to $O\left(\frac{\Sigma {n}^{3}}{\stackrel{2}{\text{log}}n}\right)$. In addition, this variant of the algorithm is online, in the sense that the input strings may be given letter by letter, and its time complexity bounds the processing time of the first n given letters. This acceleration is based on our efficient matrixvector minplus multiplication algorithm, intended for matrices and vectors for which differences between adjacent entries are from a finite integer interval D. Choosing a constant $\frac{1}{\underset{\leftD\right}{\text{log}}n}<\lambda <1$, the algorithm preprocesses an n × n matrix in $O\left(\frac{{n}^{2+\lambda}}{\leftD\right}\right)$ time and $O\left(\frac{{n}^{2+\lambda}}{\leftD\right{\lambda}^{2}\underset{\leftD\right}{\overset{2}{\text{log}}}n}\right)$ space. Then, it may multiply the matrix with any given nlength vector in $O\left(\frac{{n}^{2}}{{\lambda}^{2}\underset{\leftD\right}{\overset{2}{\text{log}}}n}\right)$ time. Under some discreteness assumptions, this matrixvector minplus multiplication algorithm applies to several problems from the domains of contextfree grammar parsing and RNA folding and, in particular, implies the asymptotically fastest $O\left(\frac{{n}^{3}}{\stackrel{2}{\text{log}}n}\right)$ time algorithm for singlestrand RNA folding with discrete cost functions.
Finally, assuming a different constraint on the cost function, we present another version of the algorithm that exploits the runlength encoding of the strings and runs in $O\left(\frac{\Sigma \mathit{\text{nMP}}(\xf1)}{\xf1}\right)$ time and $O\left(\right\Sigma \left\mathrm{n\xf1}\right)$ space, where $\xf1$ is the length of the runlength encoding of the strings.
Background
Comparing strings is a wellstudied problem in computer science as well as in bioinformatics. Traditionally, string similarity is measured in terms of edit distance, which reflects the minimumcost edit of one string to the other, based on the edit operations of substitutions (including matches) and deletions/insertions (indels). In this paper, we address the problem of string edit distance with the additional operations of duplication and contraction. The motivation for this problem originated from the study of minisatellites and their comparisons in the context of population genetics [1].
Motivation: minisatellite comparison
A minisatellite is a section of DNA that consists of tandem repetitions of short (6–100 nucleotides) sequence motifs spanning 500 nucleotides to several thousand nucleotides. The repeated motifs also vary in sequence through base substitutions and indels. For one minisatellite locus, both the type and the number of motifs vary between individuals in a population. Therefore, pairwise comparisons of minisatellites are typically applied in studying the evolution of populations.
A minisatellite map represents a minisatellite region, where each motif is encoded by a letter and handled as one entity (denoted unit). When comparing minisatellite maps, one has to consider that regions of the map have arisen as a result of duplication events from the neighboring units. The single copy duplication model, where only one unit can duplicate at a time, is the most popular and its biological validation was asserted for the MSY1 minisatellites [1, 2]. According to this model, one unit can mutate to another unit via an indel or a mutation of a single nucleotide within it. Also, a unit can be duplicated, that is, an additional copy of the unit may appear next to the original one in the map (tandem repeat). Thus, when comparing minisatellite maps, two additional operations are considered: unit duplication and unit contraction.
The EDDC problem definition
The single copy duplication model of minisatellite maps gave rise to a new variant of the string edit distance problem, Edit Distance with Duplications and Contractions (EDDC), which allows five edit operations: insertion, deletion, mutation, duplication and contraction.
We start with some string notations. Let s be a string. Denote by s_{ i } the ith letter in s, starting at index 0, and by s_{i,j} the substring s_{ i }s_{i+ 1} … s_{j 1} of s. A substring of the form s_{i,i} is an empty string, which will be denoted by ε. We use superscripts to denote substrings without an explicit indication of their start and end positions, and write e.g. s = s^{a}s^{b} to indicate that s is a concatenation of the two substrings s^{a} and s^{b}.
In the edit distance problem, one is given a source string s and a target string t over a finite alphabet Σ. An edit script from s to t is a series of strings $\mathcal{E}\mathcal{S}=\u3008s={u}^{0},{u}^{1},{u}^{2},\dots ,{u}^{r}=t\u3009$, where each intermediate string u^{i} is obtained by applying a single edit operation to the preceding string u^{i1}. In the standard problem definition [3–5], the allowed edit operations are insertion of a letter at some position in an intermediate string u^{i}, deletion of a letter in u^{i}, and mutation of a letter in u^{i} to another letter. The singlecopy EDDC problem variant adds two operations: duplication  inserting into u^{i} a letter in a position adjacent to a position that already contains the same letter, and contraction  deleting from u^{i} one copy of a letter where there are two consecutive copies of this letter. Denote by ins (α),dup (α) and del (α) costs associated with the insertion, duplication and deletion operations applied to a letter α in the alphabet, respectively, by cont (α) the cost of contracting two consecutive occurrences of α into a single occurrence, and by mut (α,β) the cost of mutating α to a letter β. Define the cost of $\mathcal{E}\mathcal{S}$ to be the summation of its implied operation costs, and the length$\left\mathcal{E}\mathcal{S}\right=r$ of $\mathcal{E}\mathcal{S}$ to be the number of operations performed in $\mathcal{E}\mathcal{S}$. Clearly, for every pair of strings s and t, there is some script transforming s to t, e.g. the script that first deletes all letters in s and then inserts all letters in t. An optimal edit script from s to t is one which has a minimum cost. The edit distance from s to t, denoted by ed (s,t), is the cost of an optimal edit script from s to t. The goal of the EDDC problem is, given strings s and t, to compute e d (s,t).
Previous algorithms assume various constraints on operation costs (see Section “A comparison with previous works”). In this paper, the only limiting assumption made is that all operation costs are nonnegative. In addition, we can make the following assumption without loss of generality, which will be required by the algorithms presented in this paper:
Property 1.
It may be assumed without loss of generality that for every α,β ∈ Σ,

ins (α) = ed (ε,α), del (α) = ed (α,ε),

dup (α) = ed (α,α α), cont (α) = ed (α α,α),

mut (α,β) = ed (α,β).
This assumption can be made, since in case one of the operation costs violates the assumption, then such an operation can always be replaced by a series of operations that would induce the same modification at lower cost. For example, it cannot be that mut (α,β) < ed (α,β), since ed (α,β) is smaller then or equal to the cost of any script from α to β, among which is the script containing the single mutation operation of α into β. Moreover, if mut (α,β) > ed (α,β), then we can always replace any mutation of α into β by a series of operations that transform α into β at cost ed (α,β). In this case, we may simply assume that mut (α,β) = ed (α,β), and interpret any such a mutation appearing in a script as being implem ented by the corresponding series of operations. In particular, Property 1 implies that mut (α,α) = 0 (since all operation costs are nonnegative, ed (w,w) = 0 for every string w), dup (α) ≤ ins (α) (since the cost of the script from α to α α that applies a single insertion of α is ins (α) ≥ ed (α,α α) = dup (α)), cont (α) ≤ del (α), and mut (α,β) ≤ mut (α,γ) + mut (γ,β) for every γ ∈ Σ.
Insertions and duplications are considered to be generating operations, increasing by one letter the length of the string. Similarly, deletions and contractions are considered to be reducing operations, decreasing by one letter the length of the string. An edit script containing no reducing operation is called a nonreducing script, and an edit script containing no generating operation is called a nongenerating script.
Previous work
The EDDC problem was first defined by Bérard and Rivals [2], who suggested an O (n^{4}) time and O (n^{3}) space algorithm for the problem, where n is the length of the two input strings (for the sake of simplicity, we assume that both strings are of the same length). This was followed by the work of Behzadi and Steyaert [6], who gave an O (Σn^{3}) time and O (Σn^{2}) space algorithm for the problem, where Σ is the alphabet size (typically a few tens of unique units). Behzadi and Steyaert [7] improved their algorithms’ complexity, based on runlength encoding, to $O({n}^{2}+n{\xf1}^{2}+\Sigma {\xf1}^{3}+\Sigma {}^{2}{\xf1}^{2})$ time and $O\left(\Sigma (n+{\xf1}^{2})+{n}^{2}\right)$ space, where $\xf1$ is the length of the runlength encoding of the input strings. Runlength encoding was also used by Bérard et al. [8], who proposed an $O({n}^{3}+\Sigma \left{\xf1}^{3}\right)$ time and $O({n}^{2}+\Sigma \left{\xf1}^{2}\right)$ space algorithm. Abouelhoda et al. [9] gave an algorithm with an alphabet size independent time and space complexities of $O({n}^{2}+n{\xf1}^{2})$ and O(n^{2}), respectively. A detailed comparison between the different problem models appears in Section “A comparison with previous works”.
Our contribution
This paper presents several algorithms for EDDC which are currently the most general and efficient for the problem.

1.
We give an algorithm for EDDC for general nonnegative cost functions that is based on minplus square matrix multiplication. This algorithm is an adaptation of the framework of [10] (see also [11]). For two input strings over an alphabet Σ and of length n each, the time and space complexities of this algorithm are O (ΣMP (n)) and O (Σn ^{2}), respectively, where MP (n) is the time complexity of a minplus multiplication of two n × n matrices. Using the matrix multiplication algorithm of Chan [12], this algorithm runs in $O\left(\frac{\Sigma {n}^{3}\stackrel{3}{\text{log}}\text{log}n}{\stackrel{2}{\text{log}}n}\right)$ time (Section “A matrix multiplication based algorithm for EDDC”). Moreover, our algorithm applies less restrictions on the cost function with respect to previous algorithms and is currently the only algorithm that works for the most general problem settings (Section “A comparison with previous works”).

2.
We describe a more efficient algorithm for EDDC when all operation costs are integers. This algorithm can also be applied in an online setting, where in each step a letter is added to one of the input strings. The time complexity of processing n letters in the input is $O\left(\frac{\Sigma {n}^{3}}{\stackrel{2}{\text{log}}n}\right)$, where the base of the log function is determined by the range of cost values (Section “An online algorithm for EDDC using minplus matrixvector multiplication for discrete cost functions”). In order to achieve this, we obtained the following steppingstone results, which are of interest on their own.

(a)
Let A be an n × m matrix for which differences between adjacent entries are within some finite integer interval D. Choosing a time/space complexity tradeoff parameter λ, where $\frac{1}{\underset{\leftD\right}{\text{log}}(n+m)}<\lambda <1$, we describe a preprocessing algorithm for A that runs in $O\left(\frac{\mathit{\text{nm}}{(n+m)}^{\lambda}}{\leftD\right}\right)$ time and requires $O\left(\frac{\mathit{\text{nm}}{(n+m)}^{\lambda}}{\leftD\right{\lambda}^{2}\underset{\leftD\right}{\overset{2}{\text{log}}}(n+m)}\right)$ space. This preprocessing allows later to compute minplus multiplications between A and mlength vectors sustaining the same discreteness requirement in $O\left(\frac{\mathit{\text{nm}}}{{\lambda}^{2}\underset{\leftD\right}{\overset{2}{\text{log}}}(n+m)}\right)$ time (Section “An efficient Ddiscrete minplus matrixvector multiplication algorithm”). The algorithm is an adaptation of Williams’ matrixvector multiplication algorithm over a finite semiring [13], with some notions similar to those in Frid and Gusfield’s RNA folding algorithm [14].

(a)
The manner in which the new matrixvector multiplication algorithm is integrated into the EDDC algorithm can be generalized to algorithms for a family of related problems, denoted VMT problems [11], under certain discreteness assumptions. This family includes many important problems from the domains of RNA folding and CFG parsing. An example of such a problem is the single strand RNA folding problem [15] under discrete scoring schemes. Our new matrixvector multiplication algorithm can be integrated into an algorithm for the latter problem to yield an $O\left(\frac{{n}^{3}}{\stackrel{2}{\text{log}}n}\right)$ time algorithm, improving the best previously known asymptotic time bound for the problem (see Section “Online VMT algorithms”).

(a)

3.
We extend our approach to exploit runlength encodings of the input strings, assuming some restrictions on the cost functions. This reduces the time and space complexities of the algorithm to $O\left(\Sigma {n}^{2}+\frac{\Sigma \mathit{\text{nMP}}(\xf1)}{\xf1}\right)$ and $O\left(\Sigma \mathrm{n\xf1}\right)$, respectively, where $\xf1$ is the length of the runlength encoding of the input (Section “Additional acceleration using runlength encoding”).
The rest of the paper is organized as follows. In Section “A baseline algorithm for the EDDC problem”, a recursive computation for EDDC and its implementation using dynamic programming (DP) is presented. Section “A matrix multiplication based algorithm for EDDC” shows how to accelerate the algorithm by incorporating efficient minplus matrix multiplication subroutines. In Section “An online algorithm for EDDC using minplus matrixvector multiplication for discrete cost functions”, an efficient minplus matrixvector multiplication algorithm is described for matrices and vectors which differences between adjacent entries are taken from a finite integer interval. This algorithm can be used for obtaining an accelerated online version of the EDDC algorithm, as well as for improving time complexities of several related problems. Section “Additional acceleration using runlength encoding” describes a variant of the EDDC algorithm that exploits runlength encoding. Comparison between this and previous works is given in Section “A comparison with previous works”, and Section “Conclusions and discussion” gives a concluding discussion. Additional proofs omitted from the main manuscript are given in the Appendix.
A baseline algorithm for the EDDC problem
In this section, we give a simple algorithm for the EDDC problem. We start by showing some recursive properties of the problem, and then formulate a straightforward dynamic programming implementation for the recursive computation.
The recurrence formula
Our recursive formulas resemble previous formulations [6, 9], yet solve a slightly more general variant of the problem (see discussion in Section “A comparison with previous works”). Since the proof of correctness of these recursive formulas is similar to previous ones, we defer it to Appendix “Correctness of the recursive computation”.
A (strict) partition of a string w of length at least 2 is a pair of strings (w^{a},w^{b}), such that w = w^{a}w^{b}and w^{a},w^{b}≠ ε. Denote by P (w) the set of all partitions of w. For example, for w = abac , P(w) = {(a,bac),(ab,ac),(aba,c)}.
For a source string which is either empty or contains a single letter and a target string t, Equations 1 to 3 (Figure 1) describe a recursive EDDC computation. This computation interleaves, in a mutually recursive manner, the computation of an additional special value ed^{′} (α,t), where ed^{′} (α,t) is defined to be the minimum cost of a nonreducing edit script from α to t that does not start with a mutation (t is required to contain at least two letters).
Symmetrically, Equations 4 to 6 give the recursive computation for a source string s and a target string which is either empty or contains a single letter. Here, ed^{′} (s,α) is defined as the minimum cost of a nongenerating edit script from s to α which does not end with a mutation (s is required to contain at least two letters).
In case both source string s and target string t are of length at least 2, the following equation can be used for computing ed (s,t) (Figure 2(a)):
For allowing efficient computation, Equation 7 can be replaced by Equations 8 and 9, which are computed in a mutually recursive manner to yield an equivalent computation (Figure 2(b) and Figure 2(c), respectively).
All basecases of the above recursive equations are implied from Property 1 in a straightforward manner.
Theorem 1.
Theorem 1. EDDC is correctly solved by Equations 19.
The proof of Theorem 1 appears in Appendix “Correctness of the recursive computation”.
A baseline dynamicprogramming algorithm for EDDC
In this section, we describe a DP algorithm implementing the recursive EDDC computation given by Equations 1 to 9, which is the basis for improvements introduced later in this paper.
Let s and t be the input source and target strings, respectively, and for simplicity assume both strings are of length n. The algorithm maintains the following matrices for storing solutions to subinstances of the input which occur along the recursive computation. All matrices are of size (n + 1) × (n + 1), with row and column indices in the range 0,1,2,…,n.

For every α ∈ Σ, the algorithm maintains matrices S^{′ α}, S^{α}, T^{′α}, and T^{α}. Entries S^{′α}[ k,i], S^{α}[ k,i], T^{′α}[ l,j] and T^{α}[ l,j] are used for storing the values ed^{′} (s_{k,i},α), ed(s_{k,i},α), ed^{′} (α,t_{l,j}), and ed (α,t_{l,j}), respectively.

Two matrices S^{ε}and T^{ε}, whose entries S^{ε}[ k,i] and T^{ε}[ l,j] are used for storing values of the forms ed (s_{k,i},ε) and ed (ε,t_{l,j}), respectively.

For every α ∈ Σ, a matrix EDT^{α}whose entries EDT^{α}[ i,j] are used for storing the values edt^{α}(s_{0,i},t_{0,j}).

A matrix ED, whose entries ED [ i,j] are used for storing the values ed (s_{0,i},t_{0,j}).
The algorithm consists of two stages: Stage 1 computes solutions to all subinstances in which one of the substrings is either empty or singlelettered, applying Equations 1 to 6. Stage 2 uses the values computed in Stage 1 in order to compute all prefixtoprefix solutions ed (s_{0,i},t_{0,j}) and edt^{α}(s_{0,i},t_{0,j}) according to Equations 8 and 9. In particular, Stage 2 computes the edit distance ed (s_{0,n},t_{0,n}) = ed (s,t) between the two complete strings. The entries are traversed in an order which guarantees that upon computing each entry, all solutions to subinstances appearing on the righthand side of the relevant equation are already computed and contained in the corresponding entries. Algorithm 1 gives the pseudocode for this computation.
Algorithm 1 BASELINEEDDC( s , t )
Complexity analysis of Algorithm 1
The running time of Algorithm 1 is dictated by the total time required to compute all entries in the DP matrices. Each entry is computed according to one of the recursive equations, where the number of operations in such a computation depends on the number of expressions examined on the righthand side of the corresponding recursive equation. Note that the value of each examined expression is obtained in a constant time, by querying previously computed values stored in the matrices.
The computation of each entry in matrices T^{ε}and S^{ε}and in matrices of the form T^{α}and S^{α}takes O (Σ) time, due to Equations 1, 4, 2, and 5, respectively. As there are O (Σ) such matrices and each matrix contains O (n^{2}) entries, their overall computation time is O (Σ^{2}n^{2}). The computation of entries in T^{′α}and S^{′α}take O (n) time, due to Equations 3 and 6, respectively. There are O (Σ) such matrices, each of size O (n^{2}), and so the total time for computing all entries in these matrices is O (Σn^{3}). According to Equation 9, computing each entry of the form EDT^{α}[ i,j] takes O (n) time, and as there are O (Σn^{2}) such entries the total time for computing all these entries is O (Σn^{3}). According to Equation 8, computing each entry of the form ED [ i,j] takes O (Σn) time, and since there are O (n^{2}) such entries, the total time for computing all these entries is again O (Σn^{3}). Thus, the total running time of the algorithm is O (Σn^{3} + Σ^{2}n^{2}). Under the assumption that Σ = O (n), the time is O (Σn^{3}). The algorithm requires O (Σn^{2}) space for maintaining the DP matrices.
A matrix multiplication based algorithm for EDDC
In previous work by the authors [11], Vector Multiplication Templates (VMTs) were identified as templates for computational problems sustaining certain properties, such that algorithms for solving these problems can be accelerated using efficient matrix multiplication subroutines (similarly to Valiant’s algorithm for CFG recognition [10]). Intuitively, standard algorithms for VMT problems perform computations that can be expressed in terms of vector multiplications, and these computations can be computed and combined more efficiently using efficient matrix multiplications. In this section, we show that EDDC exhibits such VMT properties, and formulate a new algorithm that incorporates matrixmatrix minplus multiplications. This algorithm yields a better running time than that of the baseline algorithm in the previous section.
Notations for matrices
For two integers p,q such that p ≤ q, I_{p,q} denotes the interval of integers I_{p,q} = [ p,p + 1,…,q  1]. We use the notation A_{n × m} to imply that the matrix A has n rows and m columns, and say that A has the dimensions n × m (rows and column indices start at 0). For a subset of row indices I and a subset of column indices J, denote by I × J the region which contains all pairs of indices (i,j), such that i ∈ I and j ∈ J. Define A [ I,J] to be the submatrix of A, which is induced by all entries in the region I × J. When I contains a single row i or J contains a single column j, we simplify the notation and write A [ i,J] or A [ I,j], respectively.
Define the following operations on matrices. Let tr (·) denote the transpose operation for matrices. For a set of matrices $\mathcal{A}=\left\{{A}^{1},{A}^{2},\dots ,{A}^{r}\right\}$ all of the same dimensions n × m, denote by $\text{min}\left\{\mathcal{A}\right\}$ the entrywise min operation over $\mathcal{A}$, whose result is a matrix C_{n × m}, such that $C[\phantom{\rule{0.3em}{0ex}}i,j]=\text{min}\left\{A[\phantom{\rule{0.3em}{0ex}}i,j]\phantom{\rule{0.3em}{0ex}}A\in \mathcal{A}\right\}$. $\text{min}\left\{\mathcal{A}\right\}$ can be computed in $O\left(\left\mathcal{A}\right\mathit{\text{nm}}\right)$ time in a straightforward manner. For matrices A_{n × k} and B_{k × m}, the minplus multiplication of A and B, denoted A ⊗ B, results in a matrix C_{n × m}, where the entries of C are defined by C [ i,j] = min{A [ i,h] + B [ h,j] 0 ≤ h < k}. Naively, A ⊗ B can be computed in O (nkm) operations. Denote the time complexity of a minplus multiplication of two n × n matrices by MP (n). At present, the asymptotically fastest algorithm for minplus square matrix multiplication is that of Chan [12], taking $O\left(\frac{{n}^{3}\stackrel{3}{\text{log}}\text{log}n}{\stackrel{2}{\text{log}}n}\right)$ time.
In the following observation, we point out how matrix multiplication can be computed as a composition of two parts, where each of the items (13) in the observation addresses a partitioning in one of the three dimensions. This will be later used by our recursive computation which is based on such partitioning.
Observation 1.
Let A_{n × k},B_{k × m}and C_{n × m}be matrices, such that C = A ⊗ B (see Figure 3).

1.
For every 0 ≤ h < m, C [ I _{0,n},I _{0,h}] = A ⊗ B [ I _{0,k},I _{0,h}] and C [ I _{0,n},I _{h,m}] = A ⊗ B [ I _{0,k},I _{h,m}].

2.
For every 0 ≤ h < n, C [ I _{0,h},I _{0,m}] = A [ I _{0,h},I _{0,k}] ⊗ B and C [ I _{h,n},I _{0,m}] = A[ I _{h,n},I _{0,k}] ⊗ B.

3.
For every 0 ≤ h < k, C = min {A [ I _{0,n},I _{0,h}] ⊗ B [ I _{0,h},I _{0,m}],A [ I _{0,n},I _{h,k}] ⊗ B [ I _{h,k},I _{0,m}]}.
EDDC expressed via minplus vector multiplications
The key observation that enables a further improvement of the worstcase bounds of EDDC is that Equations 3, 6, 8, and 9 can be expressed in terms of minplus vector multiplications. Under the assumption that all solutions to subinstances appearing on the righthand side of the equations are computed and stored in the corresponding entries, these equations can be written as follows:
The algorithm
The new algorithm has the same two stages as the baseline algorithm. It can be observed that the computation of all matrices of the forms S^{′α}, S^{α}, S^{ε}, T^{′α}, T^{α}, and T^{ε} performed in Stage 1 of the baseline algorithm adhere to the InsideVMT requirements as given in Definition 1 in [11]. The application of the generic InsideVMT algorithm[11] to this computation is immediate, and therefore we focus only on adapting the method to the computation of matrices of the form EDT^{α} and ED conducted in Stage 2 of the baseline algorithm.
After allocating all dynamic programming matrices and performing Stage 1 of the algorithm, the COMPUTEMATRIX procedure is used for implementing Stage 2 (see Algorithm 2 and Figure 4). This is a divideandconquer recursive procedure that accepts a region I × J and computes the values in all entries of ED and EDT^{α} within the region. The procedure partitions the given region into two parts and performs recursive calls on each part. In order to maintain a required precondition, the procedure applies minplus matrix multiplication subroutines between recursive calls. The correctness proof of Algorithm 2 appears in Appendix “Correctness of Algorithm 2”.
Algorithm 2 MATRIXEDDC( s , t )
Time complexity analysis
The time complexity of Algorithm 2 can be established by an identical analysis to that of the InsideVMT algorithm of [11] (see Section 3.3.1 of [11]). For completeness, we repeat this analysis here for Stage 2 of the computation, where the time complexity of Stage 1 can be inferred similarly. The complexity is expressed as a function of the bound MP (n) over the running time of a minplus multiplication of two n × n matrices. Note that MP (n) = Ω (n^{2}), as the input and output matrices of the computation contain O (n^{2}) entries. We assume here that MP (n) = Ω (n^{2+δ}) for some constant 0 < δ ≤ 1, which is true for the current best bound over MP (n) [12]. In some of the expressions developed below, we avoid the “big O” notation and give explicit bounds over the number of operations, as constant factors that may be hidden due to this notation cannot be ignored in the analysis.
The initialization time of Stage 2 is dominated by the matrix multiplications and entrywise min operations performed in line 4 of Algorithm 2. This initialization performs 2Σ multiplications of matrices of dimensions (n1)×1 with matrices of dimensions 1×(n  1), which can naively be implemented in O (Σn^{2}) time, and an entrywise min operation over a set containing Σ matrices of dimensions (n  1) × (n  1), which is also implemented in O (Σn^{2}) time.
The computation of the remaining entries in ED and EDT^{α} matrices is done within recursive calls to the COMPUTEMATRIX procedure. Observe that when COMPUTEMATRIX is called over a region of dimensions r × r for some even integer r ≥ 2, the procedure applies a vertical partitioning and performs two recursive calls over regions of dimensions $r\times \frac{r}{2}$ (lines 6 and 8). For a call over a region of dimensions $r\times \frac{r}{2}$, the procedure applies a horizontal partitioning and performs two recursive calls over regions of dimensions $\frac{r}{2}\times \frac{r}{2}$ (lines 11 and 13). For simplicity, assume that n  1 = 2^{p} for some integer p ≥ 0, and thus it follows that the dimensions of all regions occurring as inputs in recursive calls are either 1×1, or of the form r × r or $r\times \frac{r}{2}$ for some even integer r. Denote by T (x × y) an upper bound over the number of operations conducted when applying COMPUTEMATRIX over a region of dimensions x × y.
From line 2 of the procedure, T (1 × 1) = O (Σ). Consider a region of dimensions r × r for an even integer r ≥ 2. For such a region, the code in lines 58 of COMPUTEMATRIX is executed. In order to implement line 7 for some α ∈ Σ, it is necessary to compute first a minplus matrix multiplication C = A ⊗ B, where the matrix A = ED [ I_{i,k},I_{j,h}] is of dimensions $r\times \frac{r}{2}$, the matrix B = T^{α}[ I_{j,h},I_{h,l}] is of dimensions $\frac{r}{2}\times \frac{r}{2}$, and the resulting matrix C is of dimensions $r\times \frac{r}{2}$. Due to Observation 1, it is possible to compute independently the upper and lower halves of C, where $C[\phantom{\rule{0.3em}{0ex}}{I}_{0,\frac{r}{2}},{I}_{0,\frac{r}{2}}]=A[\phantom{\rule{0.3em}{0ex}}{I}_{0,\frac{r}{2}},{I}_{0,\frac{r}{2}}]\otimes B$ and $C[\phantom{\rule{0.3em}{0ex}}{I}_{\frac{r}{2},r},{I}_{0,\frac{r}{2}}]=A[\phantom{\rule{0.3em}{0ex}}{I}_{\frac{r}{2},r},{I}_{0,\frac{r}{2}}]\otimes B$. The time required to conduct this computation is $2\mathit{\text{MP}}\left(\frac{r}{2}\right)$. Then, it is required to compute min {EDT^{α}[ I_{i,k}, I_{h,l}], C} and to update EDT^{α}[ I_{i,k},I_{h,l}] to be the result of this operation, a computation which requires at most cr^{2} operations for some constant c. Since line 7 is computed for every α ∈ Σ, the total number of applied operations due to this line is at most $\Sigma \left(2\mathit{\text{MP}}\left(\frac{r}{2}\right)+c{r}^{2}\right)$. Besides line 7, two recursive calls are made in lines 6 and 8 over regions of dimensions $r\times \frac{r}{2}$, and therefore we get
When the procedure is called over a region of dimensions $r\times \frac{r}{2}$, the code in lines 1013 is executed. Similarly as above, it can be shown that the computation in line 12 requires at most $\Sigma \left(\mathit{\text{MP}}\left(\frac{r}{2}\right)+\frac{c{r}^{2}}{4}\right)$ operations, and due to the two recursive calls in lines 11 and 13 over regions of dimensions $\frac{r}{2}\times \frac{r}{2}$, we get
Therefore,
The explicit form of the above recursive equation can be established by the Master Theorem (under the assumption that MP (n) = Ω (n^{2+δ}), see Chapter 4 in [16]), yielding the expression T (r × r) = O (ΣMP (r)). Thus, the time complexity of Stage 2 of the algorithm is O (ΣMP (n)). The time analysis of the InsideVMT algorithm of [11], applied to implement Stage 1 of the algorithm yields the same bound of O (ΣMP (n)), and thus O (ΣMP (n)) is the time complexity of the entire algorithm. Using the currently asymptotically fastest algorithm for minplus matrix multiplication [12]$\mathit{\text{MP}}\left(n\right)=\Theta \left(\frac{{n}^{3}\stackrel{3}{\text{log}}\text{log}n}{\stackrel{2}{\text{log}}n}\right)$, we get the currently best explicit time bound for EDDC of $O\left(\frac{\Sigma {n}^{3}\stackrel{3}{\text{log}}\text{log}n}{\stackrel{2}{\text{log}}n}\right)$.
An online algorithm for EDDC using minplus matrixvector multiplication for discrete cost functions
In this section, we present an EDDC algorithm which is based on the general algorithm (given in Section “A matrix multiplication based algorithm for EDDC”) and improves its time complexity by a factor of O (log3 logn). This EDDC algorithm is intended for integer cost functions, but can also be applied to rational cost functions after they are scaled. It is an online algorithm; it can process the input strings letter by letter with a guaranteed low time bound for any prefix of the input. The EDDC algorithm presented in this section is based on a Ddiscrete matrixvector minplus multiplication algorithm we developed, which is generic and may be applied to other problems as well.
Ddiscrete matrices and the EDDC problem with integer costs
Given a matrix of integers A_{n × m} and indices 1 ≤ i < n and 0 ≤ j < n, call the pair of entries A [ i  1,j] and A [ i,j] adjacent. Let D = I_{a,b}= [ a,a + 1,…,b  1] be an integer interval for some integers a < b. Say that matrix A is Ddiscrete if for every pair of adjacent entries A [ i  1,j] and A [ i,j], their difference A [ i  1,j]  A [ i,j] is in D.
Consider the EDDC problem in the case of integer costs for all edit operations. In Lemma 1, we show that in this case, all matrix multiplications applied by Algorithm 2 are between Ddiscrete metrices, with respect to a certain integer interval D. This proof is similar to that of Masek and Paterson for simple edit distance [17]. This would allow conducting such matrix multiplications using a faster algorithm, described in Section “An efficient Ddiscrete minplus matrixvector multiplication algorithm”.
Lemma 1.
Given strings s and t and an integer cost function for EDDC, all matrix multiplications applied by Algorithm 2 are over Ddiscrete matrices, where D = I_{a,b} is determined according to the cost function by a= max{del(α)  α ∈ Σ} and b = max{ins (α)  α ∈ Σ} + 1.
The proof of Lemma 1 appears in Appendix “Proofs to lemmas corresponding to the EDDC algorithm for discrete cost functions”.
Ddiscrete matrices and vectors
Here, we present some properties of Ddiscrete matrices and vectors that are similar to those previously observed in [14, 17]. The following lemmas show that the set of Ddiscrete matrices is closed under the minplus multiplication and entrywise min operations. In what follows, let D=I_{a,b} be some integer interval. The proofs of the following lemmas appear in Appendix “Proofs to lemmas corresponding to the EDDC algorithm for discrete cost functions”.
Lemma 2.
Let X,Y and Z be matrices, such that X and Y contain only integer elements and Z = X ⊗ Y. If X is Ddiscrete, then is Ddiscrete.
Lemma 3.
Let X,Y and Z be matrices, such that X and Y contain only integer elements and Z = min{X,Y}. If X and Y are D discrete, then Z is Ddiscrete.
The following lemma implies that when the absolute difference between the first elements of two qlength Ddiscrete vectors x and y is sufficiently large, one of the vectors can be immediately taken as the result of the min(x,y) operation.
Lemma 4.
Let x =(x_{0},…,x_{q  1}) and y = (y_{0},…,y_{q  1}) be two qlength Ddiscrete vectors for some q > 0. If y_{0}  x_{0} ≥ q D, then min(x,y) = x.
In what follows, fix an integer q > 1. Let x = (x_{0},x_{1},…,x_{q1}) be a qlength Ddiscrete vector. By definition, for every 0<i < q, x_{i1}x_{ i } is an integer within D, and so x_{i1}x_{ i }  a is an integer within the interval I_{0,ba} = I_{0,D}. Therefore, the series x_{0}  x_{1}  a,x_{1}  x_{2}  a,…,x_{q2}x_{q1}a can be thought of as a series of q1 digits in a Dbase representation of an integer $\Delta x=\sum _{0\le i<q1}\leftD{}^{i}\right({x}_{i}{x}_{i+1}a)$, where 0 ≤ Δ x < D^{q1}. The Δ encoding of x is defined to be the pair of integers (x_{0},Δ x). We write x = (x_{0},Δ x) to indicate that (x_{0},Δ x) is the Δencoding of x, where x_{0} is called the offset of x and Δ x is called the canonical index of x. Note that for two qlength Ddiscrete vectors x = (x_{0},Δ x) and y = (y_{0},Δ y), Δ x = Δ y if and only if for every 0 ≤ i < q, x_{ i }  y_{ i } = c for some constant c. In particular, x and y share the same Δencoding if and only if they are identical. Call a Ddiscrete vector of the form x = (0,Δ x) (with an offset x_{0} = 0) a canonical vvector.
The next observations show that both operations of entrywise min and minplus multiplication, with respect to Ddiscrete matrices and vectors, can be expressed via canonical vectors.
Observation 2.
Observation 2. Let x = (x_{0},Δ x), y = (y_{0},Δ y), and z = (z_{0},Δ z) be qlength Ddiscrete vectors such that z = min (x,y). Then, for every number c it holds that min((x_{0}  c,Δ x),(y_{0}  c,Δ y)) = (z_{0}  c,Δ z). In particular, min((0,Δ x),(y_{0}  x_{0},Δ y)) = (z_{0}  x_{0},Δ z).
Observation 3.
LetB_{q × q} be a Ddiscrete matrix, x = (0,Δ x) a qlength canonical Ddiscrete vector, and y =(y_{0},Δ y) a qlength Ddiscrete vector, such that B ⊗ x = y. Then, for any number c it holds that B ⊗ (c,Δ x) = (y_{0} + c,Δ y).
An efficient Ddiscrete minplus matrixvector multiplication algorithm
Let A_{n×m} be a Ddiscrete matrix, and fix a constant $\frac{1}{\underset{\leftD\right}{\text{log}}(n+m)}<\lambda <1$. We give an algorithm for minplus Ddiscrete matrixvector multiplication that, after preprocessing A in $O\left(\frac{\mathit{\text{nm}}{(n+m)}^{\lambda}}{\leftD\right}\right)$ time and $O\left(\frac{\mathit{\text{nm}}{(n+m)}^{\lambda}}{\leftD\right{\lambda}^{2}\underset{\leftD\right}{\overset{2}{\text{log}}}(n+m)}\right)$ space, computes A ⊗ x for any mlength Ddiscrete vector x in $O\left(\frac{\mathit{\text{nm}}}{{\lambda}^{2}\underset{\leftD\right}{\overset{2}{\text{log}}}(n+m)}\right)$ time under the RAM computational model. Our algorithm is an adaptation of Williams’ algorithm [13] for finite semiring matrixvector multiplications, with some notions similar to Frid and Gusfield’s acceleration technique for RNA folding [14]. It follows the concept of the FourRussians Algorithm[18] (see also [14, 17, 19]), i.e. preprocessing reoccurring computations, tabulating their results in lookup tables, and retrieving such results in order to accelerate the general computation.
Specifically, the algorithm stores preprocessed computations of two kinds: matrixvector minplus multiplications, and vector entrywise minima, where vectors and matrices are of qlength and of q × q dimensions, respectively, for q = ⌊ λ logD(n + m) ⌋. For conducting this preprocessing, we will assume that D ≤ n + m, otherwise q = 0 and the multiplication cannot be accelerated using the suggested method. In addition, for simplicity of the analysis we assume that q^{3} ≤ min(n,m). If this does not hold, a multiplication of the form A ⊗ x can be naively computed in the relatively efficient time complexity of $O\left(\text{max}(n,m)\underset{\leftD\right}{\overset{3}{\text{log}}}\left(n+m\right)\right)$. The space complexity of the preprocessing phase is higher than the O (n m) space complexity of the standard multiplication algorithm and depends on the constant λ, ranging between O(n mD) and $O\left(\frac{\mathit{\text{nm}}(n+m)}{\underset{\leftD\right}{\overset{2}{\text{log}}}(n+m)}\right)$ for λ values between $\frac{1}{\underset{\leftD\right}{\text{log}}(n+m)}$ and 1, correspondingly. The lower bound $\frac{1}{\underset{\leftD\right}{\text{log}}(n+m)}$ for λ is chosen so that the time complexity $O\left(\frac{\mathit{\text{nm}}}{{\lambda}^{2}\underset{\leftD\right}{\overset{2}{\text{log}}}(n+m)}\right)$ of matrixvector multiplications involving the preprocessed matrix would be better than the naive time complexity O (n m).
Preprocessing of matrixvector ⊗ computations
Let ${n}^{\prime}=q\u230a\frac{n}{q}\u230b$ and ${m}^{\prime}=q\u230a\frac{m}{q}\u230b$, and note that 0 ≤ n  n^{′} < q and 0 ≤ m  m^{′} < q. Let Q_{ k } denote the qlength integer interval Q_{ k } = [ k q,k q + 1,…,(k + 1) q  1]. The submatrix ${A}^{\prime}=A\left[{I}_{0,{n}^{\prime}},{I}_{0,{m}^{\prime}}\right]$ is decomposed into $\frac{{n}^{\prime}{m}^{\prime}}{{q}^{2}}$ blocks B_{i,j} = A[ Q_{ i },Q_{ j }] where $i=0,1,\dots ,\frac{{n}^{\prime}}{q}1$ and $j=0,1,\dots ,\frac{{m}^{\prime}}{q}1$. For each block B, a corresponding lookup table MUL_{ B } is created, which tabulates minplus multiplications between B and all canonical qlength Ddiscrete vectors. For the canonical vector x = (0,Δ x), the result y = B ⊗ x is stored in the entry MUL_{ B }[ Δ x] by its encoding (y_{0},Δ y) (by Lemma 2, y is also Ddiscrete and thus can be encoded accordingly).
The multiplication of a q × q block with a qlength vector can be done in O (q^{2}) time in a straightforward manner and the encoding of the resulting qlength vector requires additional O (q) time. There are $\frac{{n}^{\prime}{m}^{\prime}}{{q}^{2}}$ blocks in the decomposition of A^{′}, each is multiplied by D^{q1} canonical vectors, and so the total time required for computing these multiplications is $O\left({q}^{2}D{}^{q1}\frac{\mathit{\text{nm}}}{{q}^{2}}\right)=O\left(D{}^{q1}\mathit{\text{nm}}\right)=O\left(\frac{\mathit{\text{nm}}{(n+m)}^{\lambda}}{\leftD\right}\right)$.
Let (y_{0},Δ y) be the Δ  encoding of some result y = B ⊗ x computed in the preprocessing of A^{′} as described above. Note that y_{0} = min 0 ≤ i < q {B [ 0,i] + x [ i]} ≤ min 0 ≤ i < q {2 max(B [ 0,i],x[ i])}. Therefore, the number of bits in the binary representation of y_{0} is at most one plus the maximum number of bits required for the representations of B [ 0,i] and x [ i] for some 0 ≤ i < q. Also, note that $0\le \Delta y<D{}^{q1}=\frac{{(n+m)}^{\lambda}}{\leftD\right}$, and Δ y can be represented in O (log(n + m)) bits. Thus, under the RAM computational model assumptions, each such encoding (y_{0},Δ y) requires O (1) space units and can be written and read in a constant time, and therefore the overall space complexity for maintaining all MUL_{ B } tables is $O\left(\frac{D{}^{q1}\mathit{\text{nm}}}{{q}^{2}}\right)=O\left(\frac{\mathit{\text{nm}}{(n+m)}^{\lambda}}{\leftD\right{\lambda}^{2}\underset{\leftD\right}{\overset{2}{\text{log}}}(n+m)}\right)$. In addition, given a canonical index Δ x, it is possible to retrieve the encoding (y_{0},Δ y) = B ⊗ (0,Δ x) stored in the entry MUL_{ B }[ Δ x] in a constant time.
Let x = (x_{0},Δ x) be some (not necessarily canonical) qlength Ddiscrete vector, for which we wish to compute B ⊗ x. Due to Observation 3, the multiplication result can be obtained in constant time by retrieving (y_{0},Δ y) = MUL_{ B }[ Δ x], and returning the encoding (y_{0} + x_{0},Δ y).
Preprocessing of vector entrywise min computations
The algorithm constructs a lookup table MIN, storing entrywise min calculations between every canonical qlength Ddiscrete vector x = (0,Δ x) and every qlength Ddiscrete vector y = (y_{0},Δ y) such that abs (y_{0}) < qD (here abs (y_{0}) denotes the absolute value of y_{0}). For every such x and y, the table entry MIN [ Δ x,y_{0},Δ y] stores the Δ  encoding (z_{0},Δ z) of the vector z = min (x,y) (due to Lemma 3, z is Ddiscrete and can be encoded accordingly). There are $O\left(q\leftD\rightD{}^{2(q1)}\right)=O\left(\frac{{(n+m)}^{2\lambda}\lambda \underset{\leftD\right}{\text{log}}(n+m)}{\leftD\right}\right)$ entries in the table MIN, and each entry can be computed in O(q) = O (λ logD(n + m)) time. Thus, the computation of all entries in MIN requires $O\left(\frac{{(n+m)}^{2\lambda}{\lambda}^{2}\underset{\leftD\right}{\overset{2}{\text{log}}}(n+m)}{\leftD\right}\right)$ time, and the table occupies $O\left(\frac{{(n+m)}^{2\lambda}\lambda \underset{\leftD\right}{\text{log}}(n+m)}{\leftD\right}\right)$ space.
Given two encoded qlength Ddiscrete vectors x = (x_{0},Δ x) and y = (y_{0},Δ y), the encoding (z_{0},Δ z) of the vector z = min(x,y) can now be obtained in a constant time as follows: (z_{0},Δ z) = (x_{0},Δ x) if y_{0}  x_{0} ≥ q D or (z_{0},Δ z) = (y_{0},Δ y) if x_{0}  y_{0} ≥ q D, due to Lemma 4. Otherwise, y_{0}  x_{0} < q D, and for the vectors x^{′} = (0,Δ x), y^{′} = (y_{0}  x_{0},Δ y), and ${z}^{\prime}=\left({z}_{0}^{\prime},\Delta {z}^{\prime}\right)=\text{min}({x}^{\prime},{y}^{\prime})$, we have that (z^{′},Δ z^{′}) = MIN [ Δ x,y_{0}  x_{0},Δ y]. From Observation 2, $\left({z}_{0},\Delta z\right)=\left({z}_{0}^{\prime}+{x}_{0},\Delta {z}^{\prime}\right)$.
Computing matrixvector multiplications
Given an mlength Ddiscrete vector x and assuming the preprocessing of matrix A_{n × m} was preformed as described above, we next explain how to efficiently compute the vector y = A ⊗ x. Note that y is an nlength Ddiscrete vector, due to Lemma 2.
Our algorithm computes first the multiplication ${y}^{\prime}={A}^{\prime}\otimes x\left[{I}_{0,{m}^{\prime}}\right]$ in parts of length q. First, for every $0\le j<\frac{{m}^{\prime}}{q}$, the algorithm computes the encoding $\left({x}_{\phantom{\rule{0.3em}{0ex}}0}^{\phantom{\rule{0.3em}{0ex}}j},\Delta {x}^{\phantom{\rule{0.3em}{0ex}}j}\right)$ of the subvector x^{j} = x [ Q_{ j }] of x. These encodings can be obtained in a total time of O (m). Then, for every $0\le i<\frac{{n}^{\prime}}{q}$, the encoding $\left({y}_{\phantom{\rule{0.3em}{0ex}}0}^{i},\Delta {y}^{i}\right)$ of the subvector ${y}^{i}={A}^{\prime}\left[\phantom{\rule{0.3em}{0ex}}{Q}_{i},{I}_{0,{m}^{\prime}}\right]\otimes x\left[\phantom{\rule{0.3em}{0ex}}{I}_{0,{m}^{\prime}}\right]$ of y^{′} is computed independently of the other subvectors of y^{′}. By definition (see Figure 5),
The encoded result $\left({z}_{\phantom{\rule{2.77626pt}{0ex}}\phantom{\rule{2.77626pt}{0ex}}0}^{i,j},\Delta {z}^{i,j}\right)$ of each multiplication z^{i,j} = B_{i,j} ⊗ x^{j} can be obtained in a constant time as explained in Section “Preprocessing of matrixvector ⊗ computations”. As there are $\frac{{m}^{\prime}}{q}$ such terms to compute with respect to y^{i}, their total computation time is $O\left(\frac{m}{q}\right)$. In addition, the entrywise min over all these terms can be computed by initializing $\left({y}_{\phantom{\rule{0.3em}{0ex}}0}^{i},\Delta {y}^{i}\right)\leftarrow \left({z}_{\phantom{\rule{2.77626pt}{0ex}}\phantom{\rule{2.77626pt}{0ex}}\phantom{\rule{2.77626pt}{0ex}}0}^{i,0},\Delta {z}^{i,0}\right)$, and iteratively updating $\left({y}_{\phantom{\rule{0.3em}{0ex}}0}^{i},\Delta {y}^{i}\right)\leftarrow \text{min}\left(\left(\underset{\phantom{\rule{0.3em}{0ex}}0}{\overset{i}{y}},\Delta {y}^{i}\right),\left(\underset{\phantom{\rule{2.77626pt}{0ex}}\phantom{\rule{2.77626pt}{0ex}}0}{\overset{i,j}{z}},\Delta {z}^{i,j}\right)\right)$ for all $0<j<\frac{{m}^{\prime}}{q}$. Each such update is computed in a constant time as described in Section “Preprocessing of vector entrywise min computations”, and so the encoding of a single segment y^{i} in y^{′} is computed in a total time of $O\left(\frac{m}{q}\right)$, and the encodings of all $O\left(\frac{n}{q}\right)$ such segments are computed in $O\left(\frac{\mathit{\text{nm}}}{{q}^{2}}\right)$ time. Decoding all encoded vectors y^{i} can be done in additional O (n) operations, obtaining an explicit form of y^{′} in a total time of $O\left(\frac{\mathit{\text{nm}}}{{q}^{2}}\right)$.
Let ${y}^{\prime \prime}=A\left[{I}_{0,{n}^{\prime}},{I}_{{m}^{\prime},m}\right]\otimes x\left[{I}_{{m}^{\prime},m}\right]$, where from Observation 1, $y\left[{I}_{0,{n}^{\prime}}\right]=\text{min}\left({y}^{\prime},{y}^{\prime \prime}\right)$. The computation of y^{′′} can be conducted in O (n q) time in a straightforward manner, and the computation of min(y^{′},y^{′′}) requires additional O (n) time. In addition, $y\left[{I}_{{n}^{\prime},n}\right]=A\left[{I}_{{n}^{\prime},n},{I}_{0,m}\right]\otimes x$, where this computation can be done naively in O (m q) time, and so the overall running time for computing y is $O\left(\frac{\mathit{\text{nm}}}{{q}^{2}}+\mathit{\text{nq}}+\mathit{\text{mq}}\right)=O\left(\frac{\mathit{\text{nm}}}{{q}^{2}}\right)=O\left(\frac{\mathit{\text{nm}}}{{\lambda}^{2}\underset{\leftD\right}{\overset{2}{\text{log}}}(n+m)}\right)$.
The above matrixvector minplus multiplication algorithm can be used as a fast square matrixmatrix multiplication algorithm in a straightforward manner. For two Ddiscrete matrices A_{n × n} and B_{n × n}, the computation of C = A ⊗ B can be conducted by first preprocessing A as described in Sections “Preprocessing of matrixvector ⊗ computations” in $O\left(\frac{{n}^{2+\lambda}}{\leftD\right}\right)$ time and $O\left(\frac{{n}^{2+\lambda}}{\leftD\right{\lambda}^{2}\underset{\leftD\right}{\overset{2}{\text{log}}}n}\right)$ space, and then computing each column j of C independently by multiplying A with the jth column of B, in $O\left(\frac{{n}^{2}}{{\lambda}^{2}\underset{\leftD\right}{\overset{2}{\text{log}}}n}\right)$ time as explained above. The total computation time of all n columns of C is therefore $O\left(\frac{{n}^{3}}{{\lambda}^{2}\underset{\leftD\right}{\overset{2}{\text{log}}}n}\right)$.
Online preprocessing of Ddiscrete matrices
In the previous section, we assumed the settings in which a Ddiscrete matrix is given, and that it is preprocessed once prior to any multiplication operation. Next, we describe how to maintain the required lookup tables for the case the input matrix is dynamic, acquiring additional rows and columns. Consider a streaming computational model, which begins with an initial empty matrix ${A}_{0\times 0}^{0}$. In each step r, the current matrix ${A}_{{n}_{r}\times {m}_{r}}^{r}$ is obtained from the previous matrix ${A}_{{n}_{r1}\times {m}_{r1}}^{r1}$ by either adding an m_{r1}length vector as the last row or adding an n_{r1}length vector as the last column in the matrix. Note that n_{ r } + m_{ r } = r, and therefore the preprocessing block length corresponding to A^{r} is q = ⌊ λ logD(n_{ r } + m_{ r }) ⌋ = ⌊ λ logD(r)⌋. For the purpose of this analysis, we assume that λ ≤ 0.5 (note that this does not limit the asymptotic upper bounds of the running time). This assumption implies the following inequality
Lookup tables corresponding to intermediate matrices along the series can be maintained as follows. Let r_{0} and r_{1} be the smallest integers such that the block sizes corresponding to ${A}^{{r}_{0}}$ and ${A}^{{r}_{1}}$ are q and q + 1, respectively. Assume that upon reaching ${A}^{{r}_{0}}$ in the matrix sequence, all required lookup tables with respect to ${A}^{{r}_{0}}$ are already computed. Along the series of steps r_{0},r_{0} + 1,…,r_{1}, we distribute two kinds of computations: (1) new MUL_{ B } tables for accumulated q × q blocks in matrices A^{r} for r_{0} ≤ r < r_{1}, and (2) a new MIN table, as well as new MUL_{ B } tables, with respect to block length q + 1.
(1) Computing MUL_{ B }tables for accumulated q × q blocks. Assume that for some r_{0} ≤ r < r_{1}, a column was added to the matrix at step r so that the number of columns m_{ r } in the intermediate matrix A^{r} is divisible by q. Thus, at most $\frac{{n}_{r}}{q}\le \frac{r}{q}$ new q × q complete blocks are now available for preprocessing. The computation of lookup tables of the form MUL_{ B } corresponding to these new blocks will be equally distributed along the series of q consecutive steps r,r + 1,…,r + q  1, during which it is guaranteed that no column addition would introduce new complete q × q blocks in the matrix. As shown in Section “Preprocessing of matrixvector ⊗ computations”, the time required for processing a single q × q block is O (q^{2}D^{q1}), and so the total time for processing all $O\left(\frac{r}{q}\right)$ blocks is O (q r D^{q1}). Thus, in each step among the q steps, there is a need to perform $O\left(rD{}^{q1}\right)=O\left(\frac{{r}^{1+\lambda}}{\leftD\right}\right)$ operations due to these computations. Symmetrically, computing lookup tables corresponding to new blocks added due to the accumulation of rows can be performed by conducting $O\left(\frac{{r}^{1+\lambda}}{\leftD\right}\right)$ operations per step r.
(2) Computing a new MIN lookup table and new MUL_{ B }tables with respect to block length q + 1 . By the selection of r_{0} and r_{1}, q  1 = ⌊ λ logD(r_{0}  1) ⌋ > λ logD(r_{0}  1)  1, and q + 1= ⌊ λ logD(r_{1}) ⌋ ≤ λ logD(r_{1}). Therefore, $\underset{\leftD\right}{\text{log}}\left({r}_{1}\right)>\underset{\leftD\right}{\text{log}}({r}_{0}1)+\frac{1}{\lambda}$, and so ${r}_{1}>({r}_{0}1)D{}^{\frac{1}{\lambda}}\ge {r}_{0}\frac{D{}^{\frac{1}{\lambda}}}{2}\stackrel{\text{Eq.14}}{\ge}2{r}_{0}$. In particular, ${r}_{2}=\frac{{r}_{1}}{2}$ satisfies r_{0} < r_{2} < r_{1}, and for every r_{2} ≤ r < r_{1} we have that O (r) = O (r_{1}). The computation of the table MIN and tables of the form MUL_{ B } with respect to block length q + 1 is distributed along the series of $\frac{{r}_{1}}{2}$ steps r_{2},r_{2} + 1,…,r_{1}.
The new MIN table is computed independently from the specific input instance, and its overall computation time is $O\left(\frac{{r}_{1}^{2\lambda}{\lambda}^{2}\underset{\leftD\right}{\overset{2}{\text{log}}}\left({r}_{1}\right)}{\leftD\right}\right)$ (see Section “Preprocessing of vector entrywise min computations”). By distributing this computation evenly along all O (r_{1}) steps, the computation time required for each step r_{2} ≤ r < r_{1} is $O\left(\frac{{r}_{1}^{2\lambda 1}{\lambda}^{2}\underset{\leftD\right}{\overset{2}{\text{log}}}\left({r}_{1}\right)}{\leftD\right}\right)=O\left(\frac{{r}^{2\lambda 1}{\lambda}^{2}\underset{\leftD\right}{\overset{2}{\text{log}}}\left(r\right)}{\leftD\right}\right)$.
The MUL_{ B } tables are computed similarly as done in (1), starting with (q + 1) × (q + 1) blocks already present in ${A}^{{r}_{2}}$, and continuing with blocks accumulated as the sequence progress. The overall preprocessing time of all these blocks is $O\left(\frac{{r}_{1}^{2+\lambda}}{\leftD\right}\right)$ (see Section “Preprocessing of matrixvector ⊗ computations”), and so the computation time required for each step r_{2} ≤ r < r_{1} is $O\left(\frac{{r}_{1}^{1+\lambda}}{\leftD\right}\right)=O\left(\frac{{r}^{1+\lambda}}{\leftD\right}\right)$.
All in all, the time complexity due to computations of (1) and (2) for each step r_{0} ≤ r < r_{1} is $O\left(\frac{{r}^{1+\lambda}}{\leftD\right}\right)$. In particular, the overall time complexity of preprocessing the n  size prefix A^{0},A^{1},…,A^{n} of the streamed matrices is $O\left(\frac{{n}^{2+\lambda}}{\leftD\right}\right)$.
The EDDC algorithm based on efficient Ddiscrete minplus matrixvector multiplication
Consider the EDDC problem in cases where all edit operation costs are integers. As explained in Section “Ddiscrete matrices and the EDDC problem with integer costsDdiscrete matrices and the EDDC problem with integercosts”, the EDDC DP tables can be considered Ddiscrete. This property allows for efficient minplus square Ddiscrete matrixvector multiplications, using the algorithm described in Section “An efficient Ddiscrete minplus matrixvector multiplication algorithm” to yield an $O\left(\frac{\Sigma {n}^{3}}{\underset{\leftD\right}{\overset{2}{\text{log}}}n}\right)$ running time algorithm for EDDC. We next describe an online version of the algorithm, in which the letters of the input strings s and t are received in a streaming model.
Assume that some pair of prefixes s_{0,i} and t_{0,j1} was already processed, and all entries in the DP matrices corresponding to these prefixes are computed. We explain how to update the tables in case where the next letter to arrive is the letter t_{j1} in t, where the case in which the arriving letter is from s is symmetric. The DP matrices are Ddiscrete, and assume that lookup tables for efficient minplus multiplications of these matrices are maintained as explained in the previous section. The addition of t_{j1} requires updating all matrices of the forms T^{ε}, T^{α}, and T^{′α}, for which the jth row and column should be added. In addition, it is required to add the jth column to matrices of the form ED and EDT^{α}.
In the first stage, the algorithm computes rows and columns j in all matrices of the form T^{′α}, T^{α}, and T^{ε}. The process is similar to the computation of these entries by the loop in lines 5 to 8 of Algorithm 1, with the following modification. Let q_{ j } = ⌊ λ logD(2j) ⌋, and let ${j}^{\prime}={q}_{j}\u230a\frac{j}{{q}_{j}}\u230b$. The algorithm first initializes the entries [ j  1,j] in all these matrices with the corresponding basecase values. The column is partitioned to intervals of length q_{ j }, where as before Q_{ k } denotes the interval ${I}_{k{q}_{j},(k+1){q}_{j}}$. Once an interval Q_{ k } is computed (i.e. the loop was executed with respect to index l = k q_{ j }), the Δ  encoding of the subvector T^{α}[ Q_{ k },j] is computed and kept for its later usage as lookup index. In addition, upon starting to compute the entries within an interval Q_{ k } (i.e. when l = (k + 1)q_{ j }  1), the following multiplications are computed for every α ∈ Σ:
Observe that all required entries for the computation of y^{α,k} are already computed and stored in T^{α}, and that similarly as done in Section “Computing matrixvector multiplications”, y^{α,k} can be computed by performing $O\left(\frac{j}{{q}_{j}}\right)$ constant time lookup table queries. After y^{α,k} is computed, y^{α,k}[ x] contains the value min{T^{α}[ k q_{ j } + x,h] + T^{α}[ h,j]  (k + 1)q_{ j } ≤ h < j^{′}}. Given y^{α,k}[ x], the number of expressions that need to be examined in line 5 of the loop with respect to l = k q_{ j } + x reduces to O (q_{ j }) per entry (considering values of the index h between l and (k + 1)q_{ j }, and between j^{′} and j). Entries in matrices of the form T^{α} and T^{ε} are computed exactly as done in lines 6 and 7 of Algorithm 1, respectively.
In the second stage, column j is computed in matrices EDT^{α} and ED. This is achieved by extending Equations 13 and 12 to have an entire column on the lefthand side, as follows:
This completes the update of the DP tables due to the addition of the letter t_{j1}.
Complexity analysis
After receiving n letters, the prefixes s_{0,i} and t_{0,j} of the input strings were preprocessed for some i,j such that i + j = n. The maintenance of lookup tables for efficient Ddiscrete multiplications requires at most $O\left(\frac{\Sigma {n}^{1+\lambda}}{\leftD\right}\right)$ operations per step among the first n steps, and $O\left(\frac{\Sigma {n}^{2+\lambda}}{\leftD\right}\right)$ operations for all first n steps, as shown in Section “Online preprocessing of Ddiscrete matrices”.
Adding a letter t_{j1} to the instance, the time required for processing the entries in column j of the T matrices is as follows. $O\left(\frac{\Sigma j}{{q}_{j}}\right)$ vectors y^{α,k} need to be computed, each vector is computed in $O\left(\frac{j}{{q}_{j}}\right)$ time, and their total computation time is therefore $O\left(\frac{\Sigma {j}^{2}}{{q}_{j}^{2}}\right)$. In addition, O (Σj) entries in tables T^{′α} are computed in O (q_{ j }) time each, and O (Σj) entries in tables T^{α}and T^{ε}are computed in O (Σ) time each. Therefore, the total time for computing column j in all these matrices is $O\left(\frac{\Sigma {j}^{2}}{{q}_{j}^{2}}+\Sigma {}^{2}j\right)=O\left(\frac{\Sigma {n}^{2}}{{\lambda}^{2}\underset{\leftD\right}{\overset{2}{\text{log}}}\left(n\right)}+\Sigma {}^{2}n\right)$.
Computing column j in matrices EDT^{α} and ED, the algorithm performs O (Σ) matrixvector minplus multiplications (Equations 15 and 16), each taking $O\left(\frac{{n}^{2}}{{\lambda}^{2}\underset{\leftD\right}{\overset{2}{\text{log}}}\left(n\right)}\right)$ time using the algorithm in Section “The EDDC algorithm based on efficient Ddiscrete minplus matrixvector multiplication”, and computes the entrywise minimum of Σilength vectors (Equation 16) in O (Σi) time. Hence, the total time complexity of computing column j is $O\left(\frac{\Sigma {n}^{2}}{{\lambda}^{2}\underset{\leftD\right}{\overset{2}{\text{log}}}\left(n\right)}+\Sigma {}^{2}n\right)$. Symmetrically, this bounds the running time when the nth letter comes from the source string s, and so the total running time over all first n steps is $O\left(\frac{\Sigma {n}^{3}}{{\lambda}^{2}\underset{\leftD\right}{\overset{2}{\text{log}}}\left(n\right)}+\Sigma {}^{2}{n}^{2}\right)$. The algorithm requires $O\left(\frac{\Sigma {n}^{2+\lambda}}{{\lambda}^{2}\underset{\leftD\right}{\overset{2}{\text{log}}}\left(n\right)}\right)$ space for the computed tables.
Online VMT algorithms
The online algorithm for EDDC presented in the previous section can be generalized for other problems with similar properties. Specifically, VMT problems [11], which utilize minplus multiplications and for which it can be guaranteed that computed DP matrices are Ddiscrete, can have their algorithms implemented using the same framework as we have presented above. Thus, in contrast to the general case for VMT problems in which it is required that the complete input be available at the beginning of the algorithm’s run, in the Ddiscrete case the input can be obtained in a streaming model. In addition, the asymptotic time complexity in such cases is slightly reduced with respect to the time complexity of the case of minplus multiplication of general matrices. A concrete example to such a problem is the RNA basepairing maximization problem [11, 15], in which the difference between adjacent entries (in the single DP matrix the algorithm uses) is either 0 or 1. This property was previously exploited by Frid and Gusfield [14] to obtain an $O\left(\frac{{n}^{3}}{\text{log}n}\right)$ algorithm for the problem. Using the Ddiscrete minplus multiplication algorithm presented here, this immediately implies an algorithm having the improved time bound of $O\left(\frac{{n}^{3}}{\stackrel{2}{\text{log}}n}\right)$. Additional related problems from the domains of RNA folding and Context Free Grammars (CFGs) parsing fall under the VMT framework, and it is likely that Ddiscreteness can be exploited for accelerating the computation of more problems within this family.
Additional acceleration using runlength encoding
Let w be a string. A maximal substring of w containing multiple repeats of the same letter is called a run in w. The Run Length Encoding (RLE) of w is a representation of the string in which each run is encoded by the corresponding repeating letter α and its repeat count p (denoted α^{p}). For example, the string w = aabbbaccc is a concatenation of the four runs a a, bbb, a, and ccc, and its RLE is a^{2}b^{3}a^{1}c^{3}. Denote by $\stackrel{~}{w}$ the compressed form of w, which replaces each run in w by a single occurrence of the corresponding letter. When n denotes the length of w, $\xf1$ will denote the length of the compressed form of w. The run index$\u0129$ of a letter w_{ i } in w is the index of the run in which w_{ i } participates. It can be asserted that the compressed form of the substring w_{i,j} of w is the substring ${\stackrel{~}{w}}_{\u0129,\left(\stackrel{~}{j1}\right)+1}$ of $\stackrel{~}{w}$. In the above example, $\stackrel{~}{w}=\mathit{\text{abac}}$, and therefore $\xf1=4$ (while n = 9). The run indices of all letters in w are given by the sequence [ 0,0,1,1,1,2,3,3,3], and the compressed form of w_{3,8} = bbacc is ${\stackrel{~}{w}}_{\stackrel{~}{3},\stackrel{~}{7}+1}={\stackrel{~}{w}}_{1,4}=\mathit{\text{bac}}$.
Previous works [7–9] showed how RLE can be exploited for improving the efficiency of EDDC algorithms. In these works it was required that the costs of duplications and contractions be less than the costs of all other operations (the requirement was implicit in [9], see discussion in Section “A comparison with previous works”). This requirement is somewhat unnatural for the application of minisatellite map comparison, since it assumes that mutations, which are typically common events, should cost more than the less common events of duplications and contractions. In this section, we adapt a similar RLEbased acceleration to our EDDC algorithm. The application of this acceleration requires the following constraint over cost functions:
Constraint 1.
For every α,β ∈ Σ, dup (α) ≤ dup (β) + mut (β,α) ≤ ins (α), and cont (α) ≤ cont (β) + mut (α,β) ≤ del (α).
The constraint dup (β) + mut (β,α) ≤ ins (α) implies that it never costs more to replace an insertion of some letter α into some nonempty string by the duplication of a letter β adjacent to the insertion position, and its consecutive mutation to α. Thus, we may assume w.l.o.g that optimal edit scripts do not contain insertions (unless applied to empty strings), or in other words, generation of new letters can only be obtained via duplications. Such an assumption is relatively reasonable in the context of minisatellite map comparison, considering the biological mechanisms that describe generative modifications.
The constraint dup (α) ≤ dup (β) + mut (β,α) can be intuitively understood by the example of generating a string of the form α α β from a string of the form α β. Due to the constraint, it would cost the same or less if the string α α β is obtained by duplicating the α letter in α β, rather than by duplicating the β letter and mutating its left copy into α. Again, such an assumption is relatively reasonable for the minisatellite map application. Symmetric arguments hold with respect to the constraint over contraction and deletion costs.
Observation 4.
Observation 4. Let s and w be strings. Then, ed (s,w β β) ≤ ed (s,w β) + dup (β), and ed (s,β β w) ≤ ed (s,β w) + dup (β) for every β ∈ Σ.
The correctness of Observation 4 follows from the existence of a script from s to w β β whose cost is ed (s,w β) + dup (β): this script first applies an optimal script to transform s into w β at cost ed (s,w β), and then duplicates the last β in w β at cost dup (β).
Lemmma 5.
Let α,β be letters and w ≠ ε a string. When Constraint 1 holds, ed (α,β w),ed (α,w β) ≥ ed (α,w) + dup (β), and ed (β w,α),ed (w β,α) ≥ ed (w,α) + cont (β).
The proof of Lemma 5 appears in Appendix “Proofs to lemmas corresponding to the runlength encoding based EDDC algorithm”.
Next, we show how to reduce the number of expressions that need to be considered in the EDDC recursive equations, in case Constraint 1 applies. For a string w of length at least 2, denote by R (w) ⊆ P (w) the set of all partitions (w^{a},w^{b}) of w such the last letter in w^{a} is different from the first letter in w^{b}. For example, for w = aabbbcdddd,R (w) = {(a a,bbbcdddd ), (aabbb,cdddd),(aabbbc,dddd)}. Observe that $\leftR\right(w\left)\right=\xf11$.
We start by describing how to improve the computation efficiency of EDDC for cases in which one of the input strings contains a single letter. Denote by dupcost (w) the cost of the edit script from $\stackrel{~}{w}$ to w which generates each run α^{p} in w by applying p  1 duplication operations over the corresponding letter α in $\stackrel{~}{w}$. Similarly, denote by contcost (w) the cost of the edit script from w to $\stackrel{~}{w}$ which reduces each run α^{p} in w by applying p  1 contraction operations over α. For example, for w = aabbbbaaccc, dupcost (w) = 2 dup (a) + 3 dup (b) + 2 dup (c) and contcost (w) = 2 cont (a) + 3 cont (b) + 2 cont (c). Note that $\mathit{\text{dupcost}}\left(w\right)\ge \mathit{\text{ed}}(\stackrel{~}{w},w)$, and $\mathit{\text{contcost}}\left(w\right)\ge \mathit{\text{ed}}(w,\stackrel{~}{w})$. It is simple to assert the following recursive relations: [b]
The following lemma shows that when one of the input strings contains a single letter, the edit distance can be inferred from the edit distance between this letter and the compressed form of the second string.
Lemma 6.
Let α be a letter and w a string. When Constraint 1 holds,$\mathit{\text{ed}}(\alpha ,w)=\mathit{\text{ed}}(\alpha ,\stackrel{~}{w})+\mathit{\text{dupcost}}\left(w\right)$, and$\mathit{\text{ed}}(w,\alpha )=\mathit{\text{contcost}}\left(w\right)+\mathit{\text{ed}}(\stackrel{~}{w},\alpha )$.
The following lemma shows that given a certain edit script from string u, its cost is greater than or equal to the cost of its application on a superstring of u.
For a string s of the form s = s^{a}us^{b} and an edit script $\mathcal{E}\mathcal{S}=\u3008u={u}^{0},{u}^{1},\dots ,{u}^{r}=w\u3009$ from u to w, denote by $\mathcal{E}\mathcal{S}\left(s\right)$ the edit script $\mathcal{E}\mathcal{S}\left(s\right)=\u3008s={s}^{a}u{s}^{b}={s}^{a}{u}^{0}{s}^{b},{s}^{a}{u}^{1}{s}^{b},\dots ,{s}^{a}{u}^{r}{s}^{b}={s}^{a}w{s}^{b}\u3009$from s = s^{a}us^{b}to t = s^{a}ws^{b}.
Lemma 7.
For s = s^{a}us^{b} and$\mathcal{E}\mathcal{S}=\u3008u={u}^{0},{u}^{1},\dots ,{u}^{r}=w\u3009$, $\mathit{\text{cost}}\left(\mathcal{E}\mathcal{S}\left(s\right)\right)\le \mathit{\text{cost}}\left(\mathcal{E}\mathcal{S}\right)$.
The proofs of Lemma 6 and Lemma 7 appear in the Appendix.
Equations 17 and 18 and Lemma 6 support the following preprocessing algorithm, Algorithm 3. Given a target string t, Algorithm 3 generates data structures that enable retrieving in constant time values of the form ed (α,t_{i,j}) for every α ∈ Σ and every substring t_{i,j} of t. The algorithm generates tables of the form ${\stackrel{~}{T}}^{\alpha}$ for every α ∈ Σ, such that entries ${\stackrel{~}{T}}^{\alpha}[\phantom{\rule{0.3em}{0ex}}i,j]$ contain the corresponding values $\mathit{\text{ed}}\left(\alpha ,{\stackrel{~}{t}}_{i,j}\right)$. In addition, the algorithm generates a vector DC, such that entries DC [ j] contain the corresponding values dupcost (t_{0,j}). Then, queries of the form ed (α,t_{i,j}) can be answered in a constant time according to Equation 19 below.
Algorithm 3 RLLETTERTOSTRING( t )
An algorithm which is symmetric to Algorithm 3 can be described in order to preprocess a string s for queries of the form ed (s_{i,j},α).
We continue to describe the improved computation in the case where both input strings s and t are of length at least 2. To do so, we first add some auxiliary notation. For an interval I_{x,y} of positions within a string w, denote by ${\u0128}_{x,y}$ the subsequence of indices in I_{x,y} which are start positions of runs in w. For example, for w = aabbbaacccc, the interval I_{1,7} = [ 1,2,…,6] contains all positions of letters within the substring w_{1,7} = abbbaa, and ${\u0128}_{1,7}=[\phantom{\rule{0.3em}{0ex}}2,5]$ contains the start positions in w of the runs bbb and aa that are included in I_{1,7} (the first letter a in w_{1,7} belongs to a run in w that starts in position 0, and therefore position 1 is not included in ${\u0128}_{1,7}$). This notation will be used for defining subsequences of rows and columns in DP matrices maintained by the algorithm, where some of these intervals are derived from the source string s, and some from the target string t. We will assume that the string from which ${\u0128}_{x,y}$ was derived is clear from the context, and will not specify it explicitly. For example, when ${\u0128}_{x,y}$ defines rows in matrices ED or EDT^{α}, or either rows or columns in matrices S^{α}, then the indices in ${\u0128}_{x,y}$ are derived from the source string s. When ${\u0128}_{x,y}$ defines columns in matrices ED or EDT^{α}, or either rows or columns in matrices T^{α}, then the indices in ${\u0128}_{x,y}$ are derived from the target string t. Subsequences ${\u0128}_{x,y}$ will be used for defining sparse regions in matrices, i.e. regions containing sets of rows or columns which are not necessarily adjacent.
Consider the computation of ed (s,t) as expressed in Equation 8. Assume first the special case where s ends with a run of length at least 2. In this case, s is of the form s = w β β for some string w and a letter β. For every partition (t^{a},t^{b}) of t, it is possible to combine an optimal script ${\mathcal{E}\mathcal{S}}^{1}$ from the prefix w β of s to t^{a} and an optimal script ${\mathcal{E}\mathcal{S}}^{2}$ from the suffix β of s to t^{b}, and to obtain a script $\mathcal{E}\mathcal{S}=\u3008{\mathcal{E}\mathcal{S}}^{1}(w\beta \beta ),{\mathcal{E}\mathcal{S}}^{2}({t}^{a}\beta )\u3009$ from s to t. Therefore, $\mathit{\text{ed}}(w\beta \beta ,t)\le \mathit{\text{cost}}\left(\mathcal{E}\mathcal{S}\right)=\mathit{\text{cost}}\left({\mathcal{E}\mathcal{S}}^{1}\right(w\beta \beta \left)\right)+\mathit{\text{cost}}\left({\mathcal{E}\mathcal{S}}^{2}\right({t}^{a}\beta \left)\right)\stackrel{\text{Lem.7}}{\le}\mathit{\text{cost}}\left({\mathcal{E}\mathcal{S}}^{1}\right)+\mathit{\text{cost}}\left({\mathcal{E}\mathcal{S}}^{2}\right)=\phantom{\rule{0.3em}{0ex}}\mathit{\text{ed}}(w\beta ,{t}^{a})+\phantom{\rule{0.3em}{0ex}}\mathit{\text{ed}}(\beta ,{t}^{b})$. In particular, $\mathit{\text{ed}}(w\beta \beta ,t)\le \text{min}\left\{\mathit{\text{ed}}(w\beta ,{t}^{a})+\mathit{\text{ed}}(\beta ,{t}^{b})\phantom{\rule{1em}{0ex}}\left\phantom{\rule{1em}{0ex}}\right({t}^{a},{t}^{b})\in P(t)\right\}\stackrel{\text{Eq.9}}{=}\mathit{\text{ed}}{t}^{\beta}(w\beta ,t)$. In addition, it is possible to compose an edit script from w β β to t by first contracting the last two letters to obtain the string w β, and then applying an optimal script from w β to t. The cost of such a script is ed (w β,t) + cont (β), and therefore we get that ed (w β β,t)≤ min{edt^{β}(w β,t),ed (w β,t) + cont (β)}.
Next, we show that ed (w β β,t) ≥ min {edt^{β}(w β,t), ed (w β,t) + cont (β)}. From Equation 8, either ed (w β β,t) = ed (w β β,α) + ed (α,t) or ed (w β β,t) = edt^{α}(s^{a},t) + ed (s^{b},α) for some α ∈ Σ and (s^{a},s^{b}) ∈ P (w β β). Consider first the latter case. If (s^{a},s^{b}) = (w β,β), then
Else, s^{b} is of length at least 2, and there is some string u such that s^{b}= u β β and w = s^{a}u. In this case, $\mathit{\text{ed}}(w\beta \beta ,t)=\mathit{\text{ed}}{t}^{\alpha}\left({s}^{a},t\right)+\mathit{\text{ed}}(u\beta \beta ,\alpha )\stackrel{\text{Lem.5}}{\ge}\mathit{\text{ed}}{t}^{\alpha}\left({s}^{a},t\right)+\mathit{\text{ed}}(u\beta ,\alpha )+\mathit{\text{cont}}(\beta )\stackrel{\text{Eq.8}}{\ge}\mathit{\text{ed}}(w\beta ,t)+\mathit{\text{cont}}(\beta )$. Similarly, it can be shown that when ed (w β β,t) = ed (w β β,α) + ed (α,t) for some α ∈ Σ, ed (w β β,t) ≥ ed (w β,t) + cont (β), and so ed (w β β,t) ≥ min {edt^{β}(w β,t),ed (w β,t) + cont (β)}. Thus,
Formulating Equation 20 with respect to the data structures defined in Section “A baseline dynamicprogramming algorithm for EDDC” (under the assumption that all values appearing at the righthand side of the equation are computed and stored in the corresponding entries), we get the following equation:
Now, consider the case where the last run in s is of length 1 (i.e. s is not of the form w β β). Assume first that the term that yields the minimum value of the righthand side of Equation 8 is of the form edt^{α}(s^{a},t) + ed (s^{b},α) for some partition (s^{a},s^{b}) ∈ P (s) and a letter α ∈ Σ. If (s^{a},s^{b}) ∉ R(s), then there is some letter β ∈ Σ which is both the last letter of s^{a} and the first letter of s^{b}. In this case, we can write s^{a} = w β and s^{b} = β u. Note that u ≠ ε (since s ≠ w β β by definition), and so $\mathit{\text{ed}}(s,t)=\mathit{\text{ed}}{t}^{\alpha}(w\beta ,t)+\mathit{\text{ed}}(\beta u,\alpha )\stackrel{\text{Eq.9}}{=}\text{min}\left\{\mathit{\text{ed}}\left(w\beta ,{t}^{a}\right)+\mathit{\text{ed}}\left(\alpha ,{t}^{b}\right)\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\left({t}^{a},{t}^{b}\right)\in P(t)\right\}+\mathit{\text{ed}}(\beta u,\alpha )\stackrel{\begin{array}{l}\text{Lem.5},\\ \text{Obs.4}\end{array}}{\ge}\text{min}\left\{\left(\mathit{\text{ed}}\left(w\beta \beta ,{t}^{a}\right)\mathit{\text{dup}}(\beta )\right)+\mathit{\text{ed}}\left(\alpha ,{t}^{b}\right)\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\left({t}^{a},{t}^{b}\right)\in P(t)\right\}+\left(\mathit{\text{ed}}(u,\alpha )+\mathit{\text{dup}}(\beta )\right)=\text{min}\left\{\mathit{\text{ed}}\left(w\beta \beta ,{t}^{a}\right)+\mathit{\text{ed}}(\alpha ,{t}^{b})\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\left({t}^{a},{t}^{b}\right)\in P(t)\right\}+\mathit{\text{ed}}(u,\alpha )\stackrel{\text{Eq.9}}{=}\mathit{\text{ed}}{t}^{\alpha}(w\beta \beta ,t)+\mathit{\text{ed}}(u,\alpha )$. From the optimality of the partition (s^{a},s^{b}) = (w β,β u), it follows that ed (s,t) = edt^{α}(w β β,t) + ed (u,α). If u starts with β this step can be repeated, and inductively we can apply such partition refinements until obtaining a partition (s^{a},s^{b}) of s such that ed (s,t) = edt^{α}(s^{a},t) + ed (s^{b},α) and (s^{a},s^{b}) ∈ R (s). Now, Equation 8 can be revised as follows:
Using the DP formulation, we get
Similarly as shown for the computation of ed (s,t), it is possible to revise the computation of edt^{α}(s,t) under Constraint 1 and obtain the following equations:
Finally, we present Algorithm 4, which is an efficient version of Algorithm 2. Stage 1 of the new algorithm is accelerated by using Algorithm 3 to compute for every α∈Σ distances from α all substrings of s and distances from all substrings of t to α. In Stage 2, we use the equations developed above in order to accelerate the computation. The correctness of the algorithm follows from the correctness of the recursive equations, and can be asserted similarly as done for Algorithm 2.
Algorithm 4 RLMATRIXEDDC( s , t )
Complexity analysis
Assume for simplicity that compressed forms of both input strings s and t have the same length $\xf1$.
Algorithm 3.
The running time of line 1 of the algorithm is O (n) for computing the compressed form of the input string, and $O\left(\right\Sigma \left\mathit{\text{MP}}\right(\xf1\left)\right)$ for running Stage 1 of Algorithm 2 over this compressed string. Lines 2 and 3 require O (n) time, and so the overall time complexity of Algorithm 3 is $O(n+\Sigma \left\mathit{\text{MP}}\right(\xf1\left)\right)$. The space complexity for computing and maintaining all matrices is $O\left(\right\Sigma \left{\xf1}^{2}\right)$, and an additional O (n) space is required for the vector DC. Hence, the overall space complexity of the algorithm is $O(n+\Sigma \left{\xf1}^{2}\right)$ (see Section “Time complexity analysis” for complexity analysis of Stage 1 of Algorithm 2).
Algorithm 4.
Time and space complexities of Stage 1 of the algorithm are identical to those of Algorithm 3. As in Section “The algorithm”, the computations governing the running time of Stage 2 are those of matrix multiplications performed within recursive calls to RLCOMPUTEMATRIX.
The recursive computation of RLCOMPUTEMATRIX can be visualized as a tree (see Figure 4). Each node in the tree corresponds to a call to RLCOMPUTEMATRIX over some regions I_{i,k} × I_{j,l}, which is either a leaf in case that k = i + 1 and l = j + 1, or otherwise an internal node. In the latter case, the node has exactly two children, corresponding to the two recursive calls obtained from either a vertical (lines 11 and 13) or a horizontal (lines 16 and 18) partition of the region. For simplicity, assume that the interval length $\xf1=\left{\u0128}_{2,n+1}\right={2}^{x}$ for some integer x. It can be observed that the algorithm alternates between vertical and horizontal partitions along paths from the root of the tree, where regions of two different nodes in the same depth y are disjoint, and the union of all regions of nodes in depth y covers the entire initial region I_{2,n + 1} × I_{2,n + 1} of the root node. For every 0 ≤ y ≤ log(n), there are two series of intervals ${I}^{y,0},{I}^{y,1},\dots ,{I}^{y,{2}^{y}1}$ and ${J}^{y,0},{J}^{y,1},\dots ,{J}^{y,{2}^{y}1}$, such that the set of regions corresponding to all nodes in depth 2y is {I^{y,f}× J^{y,g}  0 ≤ f < 2^{y},0 ≤ g < 2^{y}}, and the set of regions corresponding to all nodes in depth 2 y + 1 is {I^{y,f}× J^{y + 1,g}  0 ≤ f < 2^{y},0 ≤ g < 2^{y + 1}}. In addition, the corresponding subsequences ${\u0128}^{y,0},\dots ,{\u0128}^{y,{2}^{y}1}$ and ${\stackrel{~}{J}}^{y,0},\dots ,{\stackrel{~}{J}}^{y,{2}^{y}1}$ have all the same size 2^{x  y}.
Consider a node of depth 2y whose corresponding region is I^{y,f} × J^{y,g}, and the two regions corresponding to its children I^{y,f} × J^{y + 1,2g} and I^{y,f} × J^{y + 1,2g + 1}. The computation time spent on the node is dominated by the matrix multiplications performed in line 12 of RLCOMPUTEMATRIX. This includes Σ matrix multiplications between pairs of matrices such the dimensions of the first matrix in each pair is $\left{I}^{y,f}\right\times \left{\stackrel{~}{J}}^{y+1,2g}\right=\left{I}^{y,f}\right\times {2}^{xy1}$, and the dimensions of the second matrix in a pair is $\left{\stackrel{~}{J}}^{y+1,2g}\right\times \left{\stackrel{~}{J}}^{y+1,2g+1}\right={2}^{xy1}\times {2}^{xy1}$. Observe that I^{y,f}≥ 2^{xy1}, and such multiplications can be implemented by dividing the interval I^{y,f} into $\frac{\left{I}^{y,f}\right}{{2}^{xy1}}$ intervals of length 2^{xy1} each, and performing $\frac{\left{I}^{y,f}\right}{{2}^{xy1}}$ multiplications between square matrices of dimensions 2^{xy1} × 2^{xy1} in a total time of $\frac{\left{I}^{y,f}\right}{{2}^{xy1}}\mathit{\text{MP}}\left({2}^{xy1}\right)$. Therefore, the time required for all matrix multiplications performed within nodes in depth 2 y is
Similarly, it is possible to show that the total time required for all matrix multiplications performed within nodes in depth 2 y + 1 is also $\frac{2n\Sigma \mathit{\text{MP}}\left({2}^{xy1}\right)}{\xf1}$, and so the total computation time of matrix multiplications throughout the entire algorithm run is $O\left(\frac{\Sigma n}{\xf1}\sum _{0\le y<x}\mathit{\text{MP}}\left({2}^{y}\right)\right)$. As in Section “The algorithm”, using the Master Theorem [16], this summation evaluates to $O\left(\frac{\Sigma \mathit{\text{nMP}}(\xf1)}{\xf1}\right)$. In addition to matrix multiplications, RLCOMPUTEMATRIX performs O (Σn^{2}) operations in base computations (lines 27), and so the total time complexity of the complete algorithm is $O\left(\Sigma {n}^{2}+\frac{\Sigma \mathit{\text{nMP}}(\xf1)}{\xf1}\right)$.
A simple implementation of Algorithm 4 can be done using the same space complexity of O (Σn^{2}), as the space complexity of Algorithm 2. A more involved implementation can be applied by observing that in fact the algorithm only examines and updates entries in matrices of dimensions at most $n\times \xf1$ or $\xf1\times n$ when performing matrix multiplications, and in addition it examines adjacent entries “to the left” or “above” an entry in a basecase region. This observation can be used in order to reduce the space complexity of the algorithm to $O\left(\Sigma \mathrm{n\xf1}\right)$, where the complete details of such an implementation are omitted from this text.
A comparison with previous works
In this section, we review the previous main algorithms for EDDC by Behzadi and Steyaert [7], Bérard et al. [8] and Abouelhoda et al. [9], and point out similarities and improvements made in our current work.
The main contribution of our work is in obtaining subcubic algorithms for EDDC, whereas all previous algorithms have cubic time complexities (for Σ the alphabet size, n the length of the input strings and $\xf1$ the length of their RLE compressed forms, the algorithms of [7], [8], and [9] obtain the time complexities $O\left({n}^{2}+n{\xf1}^{2}+\Sigma {\xf1}^{3}+\Sigma {}^{2}{\xf1}^{2}\right)$, $O\left({n}^{3}+\Sigma {\xf1}^{3}\right)$, and $O({n}^{2}+n{\xf1}^{2})$, respectively).
Notably, the algorithm of [9] eliminates a Σ factor that appears in the time complexities of the algorithms given in [7, 8] and here. However, this improvement is confined to a constrained model of duplication histories. As we do not assume this model here, we could not use the representation of [9] that allows the elimination of the Σ time complexity factor.
In general, the frameworks of all algorithms in [7–9] as well as the algorithms presented here are similar. All these algorithms apply two phases, where the first phase computes costs corresponding to all substrings of each one of the input strings separately, and the second phase uses these precomputed costs in order to compute the edit distance between each pair of prefixes of the input strings (our online variant described in Section “An online algorithm for EDDC using minplus matrixvector multiplication for discrete cost functions” interleaves these two phases, yet each operation it conducts can be conceptually attributed to one of the phases). The recursive formulas are similar as well, where those for the first phase can be viewed as special kinds of Weighted Context Free Grammar derivation rules.
Next, we address the cost function constraints. All algorithms assume that operation costs are nonnegative and apply additional assumptions similarly to those listed in our Property 1, which can be made without loss of generality.
In [8], operation costs were limited so that all duplications and contractions have the same constant cost (regardless of the letter over which they are applied), all deletions and insertions have the same constant cost, and all mutation costs are symmetric (i.e. mut (α,β) = mut (β,α) for every α,β ∈ Σ). While it was argued that these restrictions allow edit distance to be a metric, they limit the generality of the algorithm of [8], where the rest of the previous algorithms we mentioned can handle scoring schemes that not necessarily abide by these restrictions.
Both in [7] and in [8], it was required that all duplication and contraction costs are lower than the costs of any of the insertion, deletion, or mutation costs. This restriction is not explicitly stated in [9], yet seems to be required there as well. For the application of minisatellite map comparison, this requirement is somewhat unnatural since it assumes that mutations, which are typically common events, should cost more than the less common events of duplications and contractions. Our algorithms can be applied even when this restriction does not hold. However, one of our algorithms, the RLE variant (Section “Additional acceleration using runlength encoding”) adds a new requirement that was absent from those previous algorithms: it requires that for every α,β ∈ Σ, dup (α) ≤ dup (β) + mut (β,α) ≤ ins (α), and cont (α) ≤ cont (β) + mut (α,β) ≤ del (α) (our Constraint 1). On one hand, our Constraint 1 is more strict than the constraint of [7] and [8], in the sense that it implies nonnegative lower bounds over differences of the form ins (α)  dup (α) and del (α)  cont (α), while in [7] and [8] it was only required that these differences be nonnegative. On the other hand, our Constraint 1 does not require that the cost of mutations be higher than the cost of duplications and contractions.
We showed that our algorithms are more general with respect to the assumed constraints. We also claim that our algorithms are more precise with respect to the formal problem specification. All previous algorithms (excluding the first algorithm by Bérard and Rivals [2], which had an O (n^{4}) running time and assumed a constant cost for all mutations in addition to the restrictions in [8]) might output nonoptimal solutions in certain cases, as demonstrated in the following example. Consider the input s = a b, t = e f, and the cost function in which all duplications and contractions cost 1, all deletions and insertions cost 20, and the symmetric mutation costs are as given in Table 1. It can be shown that all three algorithms in [7], [8], and [9] would output the value 18 as the edit distance between the input strings, reflecting one of the edit scripts 〈ab,eb,ef〉 or 〈ab,af,ef〉. Nevertheless, the correct value is 17, due to the script 〈ab,cb,cc,c,d,dd,ed,ef〉. Perhaps it could be possible to specify additional restrictions over the cost functions in order to guarantee that the algorithms in [7], [8], and [9] return optimal solutions for all instances.
Conclusions and discussion
This work presents computational techniques for improving the time complexity of algorithms for the EDDC problem. We adapt the problem to the VMT framework defined in [11], which incorporates efficient matrix multiplication subroutines in order to accelerate standard dynamic programming algorithms. We describe an efficient algorithm, as well as two variants which are even more efficient, given some restrictions on the cost functions.
An additional result we give is the currently most efficient algorithm for the minplus multiplication of Ddiscrete matrices (matrices for which differences between adjacent entries are integers within an interval of length D).
We note that the running times of our algorithms depend on the alphabet size Σ. For the general algorithm, the running time is O (Σ·MP(n)), where MP(n) is the time complexity of the minplus multiplication of two n × n matrices, which is currently upperbounded by $O\left(\frac{{n}^{3}\stackrel{3}{\text{log}}\text{log}n}{\stackrel{2}{\text{log}}n}\right)$[12]. Some of the previous algorithms obtain alphabet independent time complexities, for example the algorithms in [9] and [2]. As we discussed in Section “A comparison with previous works”, such algorithms do not solve the most general variant of the problem and require some assumptions on the cost function. Nevertheless, we believe that the matrix multiplicationbased techniques for improving the time complexity presented in this paper can also be incorporated to the algorithm of [9], however the details of this enhancement are beyond the scope of this paper.
In contrast to the work of [9], our model assumes that intermediate strings along edit scripts may contain characters which are absent from both source and target strings. This implies that the size of the alphabet Σ is not bounded by the length of the input sequences. In the context of minisatellite comparison, identifying a feasible alphabet and cost function for this task is an interesting problem beyond the scope of this paper.
Appendix
Correctness of the recursive computation
This section proves Theorem 1, thus asserting the correctness of the recursive computation for the EDDC problem given in Section “The recurrence formula”. We start by adding some required notation and showing how long edit scripts can be decomposed to shorter partial scripts. Then, we use the observed recursive properties in order to prove the correctness of the recurrence.
Let s,w,t be strings, ${\mathcal{E}\mathcal{S}}^{1}=\u3008s={u}^{1,0},{u}^{1,1},\dots ,{u}^{1,{r}_{1}}=w\u3009$ an edit script from s to w, and ${\mathcal{E}\mathcal{S}}^{2}=\u3008w={u}^{2,0},{u}^{2,1},\dots ,{u}^{2,{r}_{2}}=t\u3009$ an edit script from w to t. Denote by $\mathcal{E}\mathcal{S}=\u3008{\mathcal{E}\mathcal{S}}^{0},{\mathcal{E}\mathcal{S}}^{1}\u3009$ the concatenated edit script $\mathcal{E}\mathcal{S}=\u3008s\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}{u}^{1,0},{u}^{1,1},\dots ,{u}^{1,{r}_{1}}=w={u}^{2,0},{u}^{2,1},\dots ,{u}^{2,{r}_{2}}=t\u3009$ from s to t. Note that $\mathit{\text{cost}}\left(\mathcal{E}\mathcal{S}\right)=\mathit{\text{cost}}\left({\mathcal{E}\mathcal{S}}^{1}\right)+\mathit{\text{cost}}\left({\mathcal{E}\mathcal{S}}^{2}\right)$, and $\left\mathcal{E}\mathcal{S}\right=\left{\mathcal{E}\mathcal{S}}^{1}\right+\left{\mathcal{E}\mathcal{S}}^{2}\right$. This notation extends naturally to concatenations of more than two scripts. For example, $\mathcal{E}\mathcal{S}=\u3008{\mathcal{E}\mathcal{S}}^{1},{\mathcal{E}\mathcal{S}}^{2},\dots ,{\mathcal{E}\mathcal{S}}^{q}\u3009$ denotes an edit script from a string s to a string t obtained by a concatenation of q scripts, each script ${\mathcal{E}\mathcal{S}}^{i}$ transforms some intermediate string w^{i1} into a string w^{i}, and s = w^{0} and t = w^{q}.
Obeservation 5.
For every three strings s,w,t, ed(s,t) ≤ ed (s,w) + ed (w,t).
The correctness of the above observation follows from the fact that for a pair of optimal edit scripts ${\mathcal{E}\mathcal{S}}^{1}$ from s to w and ${\mathcal{E}\mathcal{S}}^{2}$ from w to t, the script $\mathcal{E}\mathcal{S}=\u3008{\mathcal{E}\mathcal{S}}^{1},{\mathcal{E}\mathcal{S}}^{2}\u3009$ from s to t satisfies $\mathit{\text{ed}}(s,t)\le \mathit{\text{cost}}\left(\mathcal{E}\mathcal{S}\right)=\mathit{\text{cost}}\left({\mathcal{E}\mathcal{S}}^{1}\right)+\mathit{\text{cost}}\left({\mathcal{E}\mathcal{S}}^{2}\right)=\mathit{\text{ed}}(s,w)+\mathit{\text{ed}}(w,t)$.
Lemma 7.
For s = s^{a}us^{b}and$\mathcal{E}\mathcal{S}=\u3008u={u}^{0},{u}^{1},\dots ,{u}^{r}=w\u3009$, $\mathit{\text{cost}}\left(\mathcal{E}\mathcal{S}\left(s\right)\right)\le \mathit{\text{cost}}\left(\mathcal{E}\mathcal{S}\right)$.
Proof.
Each edit operation transforming s^{a}u^{i}s^{b}to s^{a}u^{i + 1}s^{b}in $\mathcal{E}\mathcal{S}\left(s\right)$ corresponds to an operation transforming u^{i}to u^{i + 1} in $\mathcal{E}\mathcal{S}$. The only cases where corresponding operations may have different costs are those of insertions or deletions in $\mathcal{E}\mathcal{S}$ at the beginning or ending of u^{i}, which become duplications or contractions in $\mathcal{E}\mathcal{S}\left(s\right)$, respectively. For example, in case the applied operation over u^{i} in $\mathcal{E}\mathcal{S}$ is the deletion of its first letter α, and α is also the last letter of s^{a}, then the cost of the operation in $\mathcal{E}\mathcal{S}$ is del (α) while the cost of the corresponding operation in $\mathcal{E}\mathcal{S}\left(s\right)$ is cont (α) ≤ del (α). Similar scenarios may occur in case of an insertion of a letter at the beginning of u^{i}which is identical to the last letter of s^{a}, as well as in cases of insertions and deletions at the end of u^{i}of letters identical to the first letter of s^{b}. In any other case, each pair of corresponding operations have the same cost. Therefore the cost of each operation in $\mathcal{E}\mathcal{S}\left(s\right)$ is smaller than or equals to the cost of its corresponding operation in $\mathcal{E}\mathcal{S}$, and $\mathit{\text{cost}}\left(\mathcal{E}\mathcal{S}\left(s\right)\right)\le \mathit{\text{cost}}\left(\mathcal{E}\mathcal{S}\right)$. □
Lemma 8.
Let s and t be two strings, and (s^{a}, s^{b}) ∈ P (s), (t^{a}, t^{b}) ∈ P (t) partitions of s and t, respectively. Then, ed (s,t) ≤ ed (s^{a},t^{a}) + ed (s^{b},t^{b}).
Proof.
Let ${\mathcal{E}\mathcal{S}}^{a}$ be an optimal script from s^{a} to t^{a} and ${\mathcal{E}\mathcal{S}}^{b}$ an optimal script from s^{b} to t^{b}. The script ${\mathcal{E}\mathcal{S}}^{a}\left(s\right)$ is a script from s = s^{a}s^{b}to t^{a}s^{b}. Similarly, ${\mathcal{E}\mathcal{S}}^{b}\left({t}^{a}{s}^{b}\right)$ is a script from t^{a}s^{b}to t^{a}t^{b}= t. For the script $\mathcal{E}\mathcal{S}=\u3008{\mathcal{E}\mathcal{S}}^{a}\left(s\right),{\mathcal{E}\mathcal{S}}^{b}\left({t}^{a}{s}^{b}\right)\u3009$ from s to t, we have that $\mathit{\text{ed}}(s,t)\le \mathit{\text{cost}}\left(\mathcal{E}\mathcal{S}\right)=\mathit{\text{cost}}\left({\mathcal{E}\mathcal{S}}^{a}\right(s\left)\right)+\mathit{\text{cost}}\left({\mathcal{E}\mathcal{S}}^{b}\right({t}^{a}{s}^{b}\left)\right)\stackrel{\text{Lem.7}}{\le}\mathit{\text{cost}}\left({\mathcal{E}\mathcal{S}}^{a}\right)+\mathit{\text{cost}}\left({\mathcal{E}\mathcal{S}}^{b}\right)=\mathit{\text{ed}}\left({s}^{a},{t}^{a}\right)+\mathit{\text{ed}}\left({s}^{b},{t}^{b}\right)$. □
Let s and t be strings. Call a pair of partitions (s^{a}, s^{b}) ∈ P (s) and (t^{a}, t^{b}) ∈ P (t) an optimal pairwise partition of s and t if ed (s,t) = e (s^{a}, t^{a}) + ed (s^{b}, t^{b}). Say that an edit script $\mathcal{E}\mathcal{S}$ from s to t is a shortest optimal script from s to t if $\mathcal{E}\mathcal{S}$ is optimal, and for every other optimal script ${\mathcal{E}\mathcal{S}}^{\prime}$ from s to t, $\left\mathcal{E}\mathcal{S}\right\le \left{\mathcal{E}\mathcal{S}}^{\prime}\right$. For a script $\mathcal{E}\mathcal{S}=\u3008s={u}^{0},{u}^{1},\dots ,{u}^{r}=t\u3009$ from s to t and 0 ≤ i ≤ j ≤ r, denote by ${\mathcal{E}\mathcal{S}}^{i,j}=\u3008{u}^{i},{u}^{i+1},\dots ,{u}^{j}\u3009$ the partial script of $\mathcal{E}\mathcal{S}$ from u^{i} to u^{j}.
Observation 6.
Let$\mathcal{E}\mathcal{S}=\u3008{u}^{0},{u}^{1},\dots ,{u}^{r}\u3009$be a shortest optimal edit script from u^{0} to u^{r}. For every 0 ≤ i ≤ j ≤ r, the partial script${\mathcal{E}\mathcal{S}}^{i,j}$is a shortest optimal edit script from u^{i}to u^{j}. Moreover, for any shortest optimal script${\mathcal{E}\mathcal{S}}^{\ast i,j}$from u^{i}to u^{j}within$\mathcal{E}\mathcal{S}$, the script${\mathcal{E}\mathcal{S}}^{\ast}=\u3008{\mathcal{E}\mathcal{S}}^{0,i},{\mathcal{E}\mathcal{S}}^{\ast i,j},{\mathcal{E}\mathcal{S}}^{j,r}\u3009$is a shortest optimal script from u^{0} to u^{r}.
The correctness of the above observation is obtained by noting that if ${\mathcal{E}\mathcal{S}}^{i,j}$ is not a shortest optimal script from u^{i} to u^{j}, then for some shortest optimal script ${\mathcal{E}\mathcal{S}}^{\ast i,j}$ from u^{i} to u^{j} we get that the script ${\mathcal{E}\mathcal{S}}^{\ast}=\u3008{\mathcal{E}\mathcal{S}}^{0,i},{\mathcal{E}\mathcal{S}}^{\ast i,j},{\mathcal{E}\mathcal{S}}^{j,r}\u3009$ either has a lower cost than $\mathcal{E}\mathcal{S}$, or is a shorter script of the same cost, in contradiction to $\mathcal{E}\mathcal{S}$ being a shortest optimal script from u^{0} to u^{r}.
Lemma 9.
Let$\mathcal{E}\mathcal{S}=\u3008{u}^{0},{u}^{1},\dots ,{u}^{r}\u3009$be a shortest optimal edit script from u^{0} to u^{r}. If there are two indices 0 ≤ i < j ≤ r such that u^{i}and u^{j}are strings of length 1, then j = i + 1. In addition, for every 0 < k < r, u^{k}≠ ε.
Proof.
Assume there are two indices 0 ≤ i < j ≤ r such that u^{i}and u^{j}are strings of length 1, i.e u^{i}= α and u^{j}= β for some α,β ∈ Σ. From Observation 6, the partial script ${\mathcal{E}\mathcal{S}}^{i,j}=\u3008\alpha ={u}^{i},{u}^{i+1},\dots ,{u}^{j}=\beta \u3009$ is a shortest optimal script from α to β. Since j > i, it must be that α ≠ β (otherwise the script ${\mathcal{E}\mathcal{S}}^{\prime}=\u3008{\mathcal{E}\mathcal{S}}^{0,i},{\mathcal{E}\mathcal{S}}^{j,r}\u3009$ is a shorter script from u^{0} to u^{r} of no greater cost than $\mathcal{E}\mathcal{S}$, in contradiction to $\mathcal{E}\mathcal{S}$ being a shortest optimal script from u^{0} to u^{r}). From Property 1, ed (α,β) = mut (α,β), and so the edit script containing the single operation of mutating α to β is an optimal script from α to β, and it must be that j = i + 1.
In addition, assume by contradiction there is some index 0 < k < r such that u^{k} = ε. The only edit operation which may yield an empty string is a deletion from a singleletter string, and therefore u^{k1} = α for some letter α. Similarly, the only edit operation which may be applied over an empty string is an insertion, therefore u^{k+1} = β for some letter β, in contradiction to the fact that two intermediate strings of length 1 must be consecutive along a shortest optimal script, as shown above. □
Call an edit script $\mathcal{E}\mathcal{S}$ from a string s to a string t simple if $\mathcal{E}\mathcal{S}$ is a shortest optimal script from s to t, in which no generating operation precedes a reducing operation. The following lemma generalizes Lemma 2 of [6], by considering also indels in addition to contractions and duplications.
Lemma 10.
For every pair of strings s and t, there exists a simple edit script from s to t.
Proof.
Let s and t be two strings, and r the length of a shortest optimal script from s to t. When r ≤ 1, any shortest optimal script from s to t either contains no reducing operation or contains no generating operation, and in particular is a simple script. Otherwise, r > 1, and assume by induction the lemma holds for every pair of strings such that the length of a shortest optimal script from the source string to the target string is less than r. Let $\mathcal{E}\mathcal{S}=\u3008s={u}^{0},{u}^{1},\dots ,{u}^{r}=t\u3009$ be a shortest optimal script from s to t.
Case 1: The first operation in$\mathcal{E}\mathcal{S}$ is not a generating operation. From Observation 6, the partial script ${\mathcal{E}\mathcal{S}}^{1,r}$ is a shortest optimal script from u^{1} to u^{r}, whose length is r  1. From the inductive assumption, there is a simple script ${\mathcal{E}\mathcal{S}}^{\ast 1,r}$ from u^{1} to u^{r}, and from Observation 6 the script ${\mathcal{E}\mathcal{S}}^{\ast}=\u3008{\mathcal{E}\mathcal{S}}^{0,1},{\mathcal{E}\mathcal{S}}^{\ast 1,r}\u3009$ is a shortest optimal script from s to t. As the first operation in ${\mathcal{E}\mathcal{S}}^{\ast}$ is nongenerating (being the same first operation as in $\mathcal{E}\mathcal{S}$), ${\mathcal{E}\mathcal{S}}^{\ast}$ is simple, and the lemma follows.
Case 2: The first operation in$\mathcal{E}\mathcal{S}$ is a generating operation. Similarly as above, we may assume w.l.o.g. by applying the inductive assumption that the partial script ${\mathcal{E}\mathcal{S}}^{1,r}$ is simple. If this partial script is nonreducing, then $\mathcal{E}\mathcal{S}$ is nonreducing, and in particular it is simple. Otherwise, let 1≤i<r be the smallest index such that the transformation of u^{i} to u^{i+1} is by a reducing operation. Since neither generating nor reducing operations may precede this operation in the partial script ${\mathcal{E}\mathcal{S}}^{1,i}$, it follows that all operations in the partial script ${\mathcal{E}\mathcal{S}}^{1,i}$ (if there are any) are mutations.
The generating operation transforming u^{0} to u^{1} is either an insertion or a duplication of some letter α in u^{0}. In both cases, we can write s = u^{0} = vxw and u^{1} = vx^{′}w (v,x,x^{′} and w are strings), where in the former case x = ε and x^{′} = α, and in the latter case x=α and x^{′} = α α. As all operations in the partial script ${\mathcal{E}\mathcal{S}}^{1,i}$ are mutations, each intermediate string u^{j}, for 1 ≤ j ≤ i, is of the form v^{j}x^{j}w^{j}, where v^{j},x^{j}, and w^{j}are string obtained by applying zero or more mutations over v,x^{′}, and w, respectively. We argue that the reducing operation transforming u^{i}= v^{i}x^{i}w^{i}to u^{i+1} cannot be the deletion of a letter or a contraction involving at least one letter within the substring x^{i}. This is true, since in such a case it would have been possible to avoid the first generating operation in $\mathcal{E}\mathcal{S}$ (transforming x to x^{′}), as well as all mutation operations over a reduced letter in x^{i}, and the reducing operation from u^{i} to u^{i+1}. This would yield a script ${\mathcal{E}\mathcal{S}}^{\ast 0,i+1}$ from u^{0} to u^{i+1} which is shorter and of no higher coast than ${\mathcal{E}\mathcal{S}}^{0,i+1}$, in contradiction to Observation 6. Hence, the reducing operation from u^{i} to u^{i+1} either deletes a letter or contracts two letters within one of the substrings v^{i}or w^{i}of u^{i}.
Consider first the case where the reducing operation over u^{i}is applied within its prefix v^{i}. Thus, we can write u^{i+1} = v^{i+1}x^{i+1}w^{i+1}, where v^{i+1} is the string obtained by applying the corresponding reducing operation over v^{i}, x^{i+1} = x^{i}, w^{i+1} = w^{i}, and cost (〈 u^{i},u^{i+1} 〉) = cost (〈 v^{i},v^{i+1} 〉). The operations in ${\mathcal{E}\mathcal{S}}^{0,i+1}$ can be assigned into two independent scripts: a script ${\mathcal{E}\mathcal{S}}_{v}=\u3008v={v}^{\prime 0},{v}^{\prime 1},\dots ,{v}^{\mathrm{\prime p}}={v}^{i+1}\u3009$ from v to v^{i+1} obtained by merging each multiple occurrence of consecutive identical strings in the series v = v^{1},v^{2},…,v^{i+1} into a single occurrence, and similarly a script ${\mathcal{E}\mathcal{S}}_{\mathit{\text{xw}}}=\u3008\mathit{\text{xw}}={\left(\mathit{\text{xw}}\right)}^{0},{\left(\mathit{\text{xw}}\right)}^{1}={x}^{\prime}w={x}^{1}{w}^{1},{\left(\mathit{\text{xw}}\right)}^{2},\dots ,{\left(\mathit{\text{xw}}\right)}^{q}\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}{x}^{i+1}{w}^{i+1}\left(\right)close="\u3009">$ from x w to x^{i+1}w^{i+1}. Each operation in ${\mathcal{E}\mathcal{S}}^{0,i+1}$ corresponds to exactly one operation in either ${\mathcal{E}\mathcal{S}}_{v}$ or ${\mathcal{E}\mathcal{S}}_{\mathit{\text{xw}}}$, where the costs of corresponding operations are equal, and therefore $\mathit{\text{cost}}\left({\mathcal{E}\mathcal{S}}^{0,i+1}\right)=\mathit{\text{cost}}\left({\mathcal{E}\mathcal{S}}_{v}\right)+\mathit{\text{cost}}\left({\mathcal{E}\mathcal{S}}_{\mathit{\text{xw}}}\right)$ and $\left{\mathcal{E}\mathcal{S}}^{0,i+1}\right=\left{\mathcal{E}\mathcal{S}}_{v}\right+\left{\mathcal{E}\mathcal{S}}_{\mathit{\text{xw}}}\right$.
Now, the script ${\mathcal{E}\mathcal{S}}_{v}\left({u}^{0}\right)=\u3008{u}^{0}=\mathit{\text{vxw}}={v}^{\prime 0}\mathit{\text{xw}},{v}^{\prime 1}\mathit{\text{xw}},\dots ,{v}^{\prime p}\mathit{\text{xw}}={v}^{i+1}\mathit{\text{xw}}\u3009$ is a script from u^{0} to v^{i+1}xw, and similarly the script ${\mathcal{E}\mathcal{S}}_{\mathit{\text{xw}}}\left({v}^{i+1}\mathit{\text{xw}}\right)=\u3008{v}^{i+1}\mathit{\text{xw}}={v}^{i+1}{\left(\mathit{\text{xw}}\right)}^{0},{v}^{i+1}{\left(\mathit{\text{xw}}\right)}^{1},\dots ,{v}^{i+1}{\left(\mathit{\text{xw}}\right)}^{q}={v}^{i+1}{x}^{i+1}{w}^{i+1}={u}^{i+1}\u3009$ is a script from v^{i+1}xw to u^{i+1}. Thus, the script ${\mathcal{E}\mathcal{S}}^{\ast 0,i+1}=\u3008{\mathcal{E}\mathcal{S}}_{v}\left({u}^{0}\right),{\mathcal{E}\mathcal{S}}_{\mathit{\text{xw}}}\left({v}^{i+1}\mathit{\text{xw}}\right)\u3009$ is a script from u^{0} to u^{i+1}. Since ${\mathcal{E}\mathcal{S}}_{v}$ contains at least one operation (the reducing operation from v^{i} to v^{i+1}) and no generating operation (since besides the reducing operation ${\mathcal{E}\mathcal{S}}_{v}$ may contain only mutations), ${\mathcal{E}\mathcal{S}}^{\ast 0,i+1}$ starts with a nongenerating operation. In addition, $\mathit{\text{cost}}\left({\mathcal{E}\mathcal{S}}^{\ast 0,i+1}\right)=\mathit{\text{cost}}\left({\mathcal{E}\mathcal{S}}_{v}\right({u}^{0}\left)\right)+\mathit{\text{cost}}\left({\mathcal{E}\mathcal{S}}_{\mathit{\text{xw}}}\right({v}^{i+1}\mathit{\text{xw}}\left)\right)\stackrel{\text{Lem.7}}{\le}\mathit{\text{cost}}\left({\mathcal{E}\mathcal{S}}_{v}\right)+\mathit{\text{cost}}\left({\mathcal{E}\mathcal{S}}_{\mathit{\text{xw}}}\right)=\mathit{\text{cost}}\left({\mathcal{E}\mathcal{S}}^{0,i+1}\right)$ and $\left{\mathcal{E}\mathcal{S}}^{\ast 0,i+1}\right=\left{\mathcal{E}\mathcal{S}}_{v}\left({u}^{0}\right)\right+\left{\mathcal{E}\mathcal{S}}_{\mathit{\text{xw}}}\left({v}^{i+1}\mathit{\text{xw}}\right)\right=\left{\mathcal{E}\mathcal{S}}_{v}\right+\left{\mathcal{E}\mathcal{S}}_{\mathit{\text{xw}}}\right=\left{\mathcal{E}\mathcal{S}}^{0,i+1}\right$. From Observation 6, ${\mathcal{E}\mathcal{S}}^{0,i+1}$ is a shortest optimal script from u^{0} to u^{i+1}, and so ${\mathcal{E}\mathcal{S}}^{\ast 0,i+1}$ is a shortest optimal script from u^{0} to u^{i+1}. Applying Observation 6 again, the script ${\mathcal{E}\mathcal{S}}^{\ast}=\u3008{\mathcal{E}\mathcal{S}}^{\ast 0,i+1}{\mathcal{E}\mathcal{S}}^{i+1,r}\u3009$ is a shortest optimal script from s to t. Now, the lemma follows from Case 1 of this proof and from the fact the first operation in ${\mathcal{E}\mathcal{S}}^{\ast}$ is not a generating operation. □
Lemma 11.
For every α ∈ Σ and every nonempty string t, any simple script from α to t is nonreducing.
Proof.
Let $\mathcal{E}\mathcal{S}$ be a simple script from α to t, and assume by contradiction $\mathcal{E}\mathcal{S}$ contains a reducing operation. Since $\mathcal{E}\mathcal{S}$ is simple, all reducing operations in $\mathcal{E}\mathcal{S}$ occur prior to any generating operation, and in particular the first reducing operation is applied after applying zero or more mutations over α. Such a reducing operation must be a deletion from a string of length 1, resulting with an empty intermediate string, in contradiction to Lemma 9. □
Lemma 12.
Let w and t be strings and β a letter, such that w ≠ ε, t is of length at least 2, and there is a nonreducing simple script from w β to t. Then, ed (w β,t) = min {ed (w,t^{a}) + ed (β,t^{b})  (t^{a},t^{b}) ∈ P (t)}.
Proof.
Let $\mathcal{E}\mathcal{S}=\u3008\mathrm{w\beta}={u}^{0},{u}^{1},\dots ,{u}^{r}=t\u3009$ be a nonreducing simple script from w β to t. For every 0 ≤ i ≤ r, construct a partition (u^{i,a}, u^{i, b}) of u^{i}which sustains that ed (w β, u^{i}) ≥ ed (w, u^{i, a}) + ed (β,u^{i,b}), as follows. For i = 0, set (u^{0,a},u^{0,b}) = (w,β), where by definition ed (w β,u^{0}) = ed (w,u^{0,a}) + ed (β,u^{0,b}) = 0. Now, assume inductively for some 0 < i ≤ r and a partition (u^{i1,a}, u^{i1, b}) of u^{i1} that ed (w β, u^{i1}) ≥ ed (w, u^{i1, a}) + ed (β, u^{i1, b}). If the nonreducing operation transforming u^{i1} to u^{i} is a mutation, an insertion, or a duplication of a letter within the prefix u^{i1, a}, then set u^{i,a} to be the string obtained by applying this operation over u^{i1, a}, and set u^{i, b} = u^{i1, b}. Otherwise, the operation is a mutation, an insertion, or a duplication of a letter within the suffix u^{i1, b}, and in this case set u^{i,b} to be the string obtained by applying this operation over u^{i1, b}, and u^{i,a} = u^{i1, a}. Note that in both cases, $\mathit{\text{cost}}\left({\mathcal{E}\mathcal{S}}^{i1,i}\right)=\mathit{\text{ed}}\left({u}^{i1,a},{u}^{i,a}\right)+\mathit{\text{ed}}\left({u}^{i1,b},{u}^{i,b}\right)$, therefore we get from the inductive assumption that $\mathit{\text{ed}}\left(w\beta ,{u}^{i}\right)\stackrel{\text{Obs.6}}{=}\mathit{\text{cost}}\left({\mathcal{E}\mathcal{S}}^{0,i}\right)=\mathit{\text{cost}}\left({\mathcal{E}\mathcal{S}}^{0,i1}\right)+\mathit{\text{cost}}\left({\mathcal{E}\mathcal{S}}^{i1,i}\right)\phantom{\rule{0.3em}{0ex}}\ge \phantom{\rule{0.3em}{0ex}}\left(\mathit{\text{ed}}\left(w,{u}^{i1,a}\right)\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}\mathit{\text{ed}}\left(\beta ,{u}^{i1,b}\right)\right)\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}\left(\mathit{\text{ed}}\left({u}^{i1,a},{u}^{i,a}\right)\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}\mathit{\text{ed}}\left({u}^{i1,b},{u}^{i,b}\right)\right)\phantom{\rule{0.3em}{0ex}}\stackrel{\text{Obs.5}}{\ge}\phantom{\rule{0.3em}{0ex}}\mathit{\text{ed}}\left(w,{u}^{i,a}\right)+\mathit{\text{ed}}\left(\beta ,{u}^{i,b}\right)$.
The process above generates a partition (t^{∗a}, t^{∗b}) = (u^{r, a},u^{r, b}) of t=u^{r}, for which ed (u^{i}, t) ≥ ed (w, t^{∗a}) + ed (β,t^{∗b}). In particular, ed (w β,t) ≥ min {ed (w,t^{a}) + ed (β,t^{b})  (t^{a},t^{b}) ∈ P (t)}. On the other hand, $\mathit{\text{ed}}(w\beta ,t)\stackrel{\text{Lem.8}}{\le}\text{min}\left\{\mathit{\text{ed}}\left(w,{t}^{a}\right)\phantom{\rule{1em}{0ex}}+\phantom{\rule{1em}{0ex}}\mathit{\text{ed}}\left(\beta ,{t}^{b}\right)\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\left({t}^{a},{t}^{b}\right)\phantom{\rule{1em}{0ex}}\in \phantom{\rule{1em}{0ex}}P(t)\right\}$, and the lemma follows. □
Based on the above observations and lemmas, we now turn to prove the recursive computation given in Section “The recurrence formula”, starting with Equation 7. Fix henceforth a pair of input strings s and t, each containing at least two letters. Note that for every α ∈ Σ and every partitions (s^{a}, s^{b}) ∈ P (s) and (t^{a}, t^{b}) ∈ P (t), ed (s,t) ≤ Obs.5 ed (s,α) + ed (α,t), and ed (s,t) ≤ Lem.8 ed (s^{a}, t^{a}) + ed (s^{b}, t^{b}) ≤ Obs.5 ed (s^{a}, t^{a}) + ed (s^{b}, α) + ed (α,t^{b}), therefore
Thus, to prove the correctness of Equation 7, it remains to show that