Efficient edit distance with duplications and contractions

Pinhas, Tamar; Zakov, Shay; Tsur, Dekel; Ziv-Ukelson, Michal

doi:10.1186/1748-7188-8-27

Research
Open access
Published: 29 October 2013

Efficient edit distance with duplications and contractions

Tamar Pinhas¹,
Shay Zakov²,
Dekel Tsur¹ &
…
Michal Ziv-Ukelson¹

Algorithms for Molecular Biology volume 8, Article number: 27 (2013) Cite this article

2549 Accesses
5 Citations
Metrics details

Abstract

We propose three algorithms for string edit distance with duplications and contractions. These include an efficient general algorithm and two improvements which apply under certain constraints on the cost function. The new algorithms solve a more general problem variant and obtain better time complexities with respect to previous algorithms. Our general algorithm is based on min-plus multiplication of square matrices and has time and space complexities of O (|Σ|MP (n)) and O (|Σ|n²), respectively, where |Σ| is the alphabet size, n is the length of the strings, and MP (n) is the time bound for the computation of min-plus matrix multiplication of two n × n matrices (currently,

$MP (n) = O (\frac{n^{3} \overset{3}{log} log n}{\overset{2}{log} n})$ due to an algorithm by Chan).

For integer cost functions, the running time is further improved to $O (\frac{| Σ | n^{3}}{\overset{2}{log} n})$ . In addition, this variant of the algorithm is online, in the sense that the input strings may be given letter by letter, and its time complexity bounds the processing time of the first n given letters. This acceleration is based on our efficient matrix-vector min-plus multiplication algorithm, intended for matrices and vectors for which differences between adjacent entries are from a finite integer interval D. Choosing a constant $\frac{1}{\underset{| D |}{log} n} < λ < 1$ , the algorithm preprocesses an n × n matrix in $O (\frac{n^{2 + λ}}{| D |})$ time and $O (\frac{n^{2 + λ}}{| D | λ^{2} {log}_{| D |}^{2} n})$ space. Then, it may multiply the matrix with any given n-length vector in $O (\frac{n^{2}}{λ^{2} {log}_{| D |}^{2} n})$ time. Under some discreteness assumptions, this matrix-vector min-plus multiplication algorithm applies to several problems from the domains of context-free grammar parsing and RNA folding and, in particular, implies the asymptotically fastest $O (\frac{n^{3}}{\overset{2}{log} n})$ time algorithm for single-strand RNA folding with discrete cost functions.

Finally, assuming a different constraint on the cost function, we present another version of the algorithm that exploits the run-length encoding of the strings and runs in $O (\frac{| Σ | nMP (ñ)}{ñ})$ time and $O (| Σ | nñ)$ space, where $ñ$ is the length of the run-length encoding of the strings.

Background

Comparing strings is a well-studied problem in computer science as well as in bioinformatics. Traditionally, string similarity is measured in terms of edit distance, which reflects the minimum-cost edit of one string to the other, based on the edit operations of substitutions (including matches) and deletions/insertions (indels). In this paper, we address the problem of string edit distance with the additional operations of duplication and contraction. The motivation for this problem originated from the study of minisatellites and their comparisons in the context of population genetics [1].

Motivation: minisatellite comparison

A minisatellite is a section of DNA that consists of tandem repetitions of short (6–100 nucleotides) sequence motifs spanning 500 nucleotides to several thousand nucleotides. The repeated motifs also vary in sequence through base substitutions and indels. For one minisatellite locus, both the type and the number of motifs vary between individuals in a population. Therefore, pairwise comparisons of minisatellites are typically applied in studying the evolution of populations.

A minisatellite map represents a minisatellite region, where each motif is encoded by a letter and handled as one entity (denoted unit). When comparing minisatellite maps, one has to consider that regions of the map have arisen as a result of duplication events from the neighboring units. The single copy duplication model, where only one unit can duplicate at a time, is the most popular and its biological validation was asserted for the MSY1 minisatellites [1, 2]. According to this model, one unit can mutate to another unit via an indel or a mutation of a single nucleotide within it. Also, a unit can be duplicated, that is, an additional copy of the unit may appear next to the original one in the map (tandem repeat). Thus, when comparing minisatellite maps, two additional operations are considered: unit duplication and unit contraction.

The EDDC problem definition

The single copy duplication model of minisatellite maps gave rise to a new variant of the string edit distance problem, Edit Distance with Duplications and Contractions (EDDC), which allows five edit operations: insertion, deletion, mutation, duplication and contraction.

We start with some string notations. Let s be a string. Denote by s_i the i-th letter in s, starting at index 0, and by s_i,j the substring s_is_{i+ 1} … s_{j- 1} of s. A substring of the form s_i,i is an empty string, which will be denoted by ε. We use superscripts to denote substrings without an explicit indication of their start and end positions, and write e.g. s = s^as^b to indicate that s is a concatenation of the two substrings s^a and s^b.

In the edit distance problem, one is given a source string s and a target string t over a finite alphabet Σ. An edit script from s to t is a series of strings $E S = 〈 s = u^{0}, u^{1}, u^{2}, \dots, u^{r} = t 〉$ , where each intermediate string uⁱ is obtained by applying a single edit operation to the preceding string u^i-1. In the standard problem definition [3–5], the allowed edit operations are insertion of a letter at some position in an intermediate string uⁱ, deletion of a letter in uⁱ, and mutation of a letter in uⁱ to another letter. The single-copy EDDC problem variant adds two operations: duplication - inserting into uⁱ a letter in a position adjacent to a position that already contains the same letter, and contraction - deleting from uⁱ one copy of a letter where there are two consecutive copies of this letter. Denote by ins (α),dup (α) and del (α) costs associated with the insertion, duplication and deletion operations applied to a letter α in the alphabet, respectively, by cont (α) the cost of contracting two consecutive occurrences of α into a single occurrence, and by mut (α,β) the cost of mutating α to a letter β. Define the cost of $E S$ to be the summation of its implied operation costs, and the length $| E S | = r$ of $E S$ to be the number of operations performed in $E S$ . Clearly, for every pair of strings s and t, there is some script transforming s to t, e.g. the script that first deletes all letters in s and then inserts all letters in t. An optimal edit script from s to t is one which has a minimum cost. The edit distance from s to t, denoted by ed (s,t), is the cost of an optimal edit script from s to t. The goal of the EDDC problem is, given strings s and t, to compute e d (s,t).

Previous algorithms assume various constraints on operation costs (see Section “A comparison with previous works”). In this paper, the only limiting assumption made is that all operation costs are nonnegative. In addition, we can make the following assumption without loss of generality, which will be required by the algorithms presented in this paper:

Property 1.

It may be assumed without loss of generality that for every α,β ∈ Σ,

ins (α) = ed (ε,α), del (α) = ed (α,ε),
dup (α) = ed (α,α α), cont (α) = ed (α α,α),
mut (α,β) = ed (α,β).

This assumption can be made, since in case one of the operation costs violates the assumption, then such an operation can always be replaced by a series of operations that would induce the same modification at lower cost. For example, it cannot be that mut (α,β) < ed (α,β), since ed (α,β) is smaller then or equal to the cost of any script from α to β, among which is the script containing the single mutation operation of α into β. Moreover, if mut (α,β) > ed (α,β), then we can always replace any mutation of α into β by a series of operations that transform α into β at cost ed (α,β). In this case, we may simply assume that mut (α,β) = ed (α,β), and interpret any such a mutation appearing in a script as being implem ented by the corresponding series of operations. In particular, Property 1 implies that mut (α,α) = 0 (since all operation costs are nonnegative, ed (w,w) = 0 for every string w), dup (α) ≤ ins (α) (since the cost of the script from α to α α that applies a single insertion of α is ins (α) ≥ ed (α,α α) = dup (α)), cont (α) ≤ del (α), and mut (α,β) ≤ mut (α,γ) + mut (γ,β) for every γ ∈ Σ.

Insertions and duplications are considered to be generating operations, increasing by one letter the length of the string. Similarly, deletions and contractions are considered to be reducing operations, decreasing by one letter the length of the string. An edit script containing no reducing operation is called a non-reducing script, and an edit script containing no generating operation is called a non-generating script.

Previous work

The EDDC problem was first defined by Bérard and Rivals [2], who suggested an O (n⁴) time and O (n³) space algorithm for the problem, where n is the length of the two input strings (for the sake of simplicity, we assume that both strings are of the same length). This was followed by the work of Behzadi and Steyaert [6], who gave an O (|Σ|n³) time and O (|Σ|n²) space algorithm for the problem, where |Σ| is the alphabet size (typically a few tens of unique units). Behzadi and Steyaert [7] improved their algorithms’ complexity, based on run-length encoding, to $O (n^{2} + n ñ^{2} + | Σ | ñ^{3} + | Σ |^{2} ñ^{2})$ time and $O (| Σ | (n + ñ^{2}) + n^{2})$ space, where $ñ$ is the length of the run-length encoding of the input strings. Run-length encoding was also used by Bérard et al. [8], who proposed an $O (n^{3} + | Σ | ñ^{3})$ time and $O (n^{2} + | Σ | ñ^{2})$ space algorithm. Abouelhoda et al. [9] gave an algorithm with an alphabet size independent time and space complexities of $O (n^{2} + n ñ^{2})$ and O(n²), respectively. A detailed comparison between the different problem models appears in Section “A comparison with previous works”.

Our contribution

This paper presents several algorithms for EDDC which are currently the most general and efficient for the problem.

1.
We give an algorithm for EDDC for general non-negative cost functions that is based on min-plus square matrix multiplication. This algorithm is an adaptation of the framework of [10] (see also [11]). For two input strings over an alphabet Σ and of length n each, the time and space complexities of this algorithm are O (|Σ|MP (n)) and O (|Σ|n ²), respectively, where MP (n) is the time complexity of a min-plus multiplication of two n × n matrices. Using the matrix multiplication algorithm of Chan [12], this algorithm runs in $O (\frac{| Σ | n^{3} \overset{3}{log} log n}{\overset{2}{log} n})$ time (Section “A matrix multiplication based algorithm for EDDC”). Moreover, our algorithm applies less restrictions on the cost function with respect to previous algorithms and is currently the only algorithm that works for the most general problem settings (Section “A comparison with previous works”).
2.
We describe a more efficient algorithm for EDDC when all operation costs are integers. This algorithm can also be applied in an online setting, where in each step a letter is added to one of the input strings. The time complexity of processing n letters in the input is $O (\frac{| Σ | n^{3}}{\overset{2}{log} n})$ , where the base of the log function is determined by the range of cost values (Section “An online algorithm for EDDC using min-plus matrix-vector multiplication for discrete cost functions”). In order to achieve this, we obtained the following stepping-stone results, which are of interest on their own.
1. (a)
  Let A be an n × m matrix for which differences between adjacent entries are within some finite integer interval D. Choosing a time/space complexity tradeoff parameter λ, where $\frac{1}{\underset{| D |}{log} (n + m)} < λ < 1$ , we describe a preprocessing algorithm for A that runs in $O (\frac{nm {(n + m)}^{λ}}{| D |})$ time and requires $O (\frac{nm {(n + m)}^{λ}}{| D | λ^{2} {log}_{| D |}^{2} (n + m)})$ space. This preprocessing allows later to compute min-plus multiplications between A and m-length vectors sustaining the same discreteness requirement in $O (\frac{nm}{λ^{2} {log}_{| D |}^{2} (n + m)})$ time (Section “An efficient D-discrete min-plus matrix-vector multiplication algorithm”). The algorithm is an adaptation of Williams’ matrix-vector multiplication algorithm over a finite semiring [13], with some notions similar to those in Frid and Gusfield’s RNA folding algorithm [14].
2. (a)
  The manner in which the new matrix-vector multiplication algorithm is integrated into the EDDC algorithm can be generalized to algorithms for a family of related problems, denoted VMT problems [11], under certain discreteness assumptions. This family includes many important problems from the domains of RNA folding and CFG parsing. An example of such a problem is the single strand RNA folding problem [15] under discrete scoring schemes. Our new matrix-vector multiplication algorithm can be integrated into an algorithm for the latter problem to yield an $O (\frac{n^{3}}{\overset{2}{log} n})$ time algorithm, improving the best previously known asymptotic time bound for the problem (see Section “Online VMT algorithms”).
3.
We extend our approach to exploit run-length encodings of the input strings, assuming some restrictions on the cost functions. This reduces the time and space complexities of the algorithm to $O (| Σ | n^{2} + \frac{| Σ | nMP (ñ)}{ñ})$ and $O (| Σ | nñ)$ , respectively, where $ñ$ is the length of the run-length encoding of the input (Section “Additional acceleration using run-length encoding”).

The rest of the paper is organized as follows. In Section “A baseline algorithm for the EDDC problem”, a recursive computation for EDDC and its implementation using dynamic programming (DP) is presented. Section “A matrix multiplication based algorithm for EDDC” shows how to accelerate the algorithm by incorporating efficient min-plus matrix multiplication subroutines. In Section “An online algorithm for EDDC using min-plus matrix-vector multiplication for discrete cost functions”, an efficient min-plus matrix-vector multiplication algorithm is described for matrices and vectors which differences between adjacent entries are taken from a finite integer interval. This algorithm can be used for obtaining an accelerated online version of the EDDC algorithm, as well as for improving time complexities of several related problems. Section “Additional acceleration using run-length encoding” describes a variant of the EDDC algorithm that exploits run-length encoding. Comparison between this and previous works is given in Section “A comparison with previous works”, and Section “Conclusions and discussion” gives a concluding discussion. Additional proofs omitted from the main manuscript are given in the Appendix.

A baseline algorithm for the EDDC problem

In this section, we give a simple algorithm for the EDDC problem. We start by showing some recursive properties of the problem, and then formulate a straightforward dynamic programming implementation for the recursive computation.

The recurrence formula

Our recursive formulas resemble previous formulations [6, 9], yet solve a slightly more general variant of the problem (see discussion in Section “A comparison with previous works”). Since the proof of correctness of these recursive formulas is similar to previous ones, we defer it to Appendix “Correctness of the recursive computation”.

A (strict) partition of a string w of length at least 2 is a pair of strings (w^a,w^b), such that w = w^aw^band w^a,w^b≠ ε. Denote by P (w) the set of all partitions of w. For example, for w = abac , P(w) = {(a,bac),(ab,ac),(aba,c)}.

For a source string which is either empty or contains a single letter and a target string t, Equations 1 to 3 (Figure 1) describe a recursive EDDC computation. This computation interleaves, in a mutually recursive manner, the computation of an additional special value ed^′ (α,t), where ed^′ (α,t) is defined to be the minimum cost of a non-reducing edit script from α to t that does not start with a mutation (t is required to contain at least two letters).

\begin{array}{l} ed (ε, t) = \{\begin{array}{l} 0, & t = ε, \\ min \{ins (α) + ed (α, t) | α \in Σ\}, & otherwise . \end{array} \end{array}

(1)

\begin{array}{l} ed (α, t) = \{\begin{array}{l} del (α), & t = ε, \\ mut (α, β), & t = β, \\ min \{mut (α, β) + e d^{'} (β, t) | β \in Σ\}, & otherwise . \end{array} \end{array}

(2)

\begin{array}{l} e d^{'} (α, t) \\ = min \{\begin{array}{l} ed (α, t^{a}) + ed (ε, t^{b}), \\ ed (ε, t^{a}) + ed (α, t^{b}), \\ dup (α) + ed (α, t^{a}) + ed (α, t^{b}) \end{array}| (t^{a}, t^{b}) \in P (t)\} \\ (t is of length \geq 2) \end{array}

(3)

Symmetrically, Equations 4 to 6 give the recursive computation for a source string s and a target string which is either empty or contains a single letter. Here, ed^′ (s,α) is defined as the minimum cost of a non-generating edit script from s to α which does not end with a mutation (s is required to contain at least two letters).

\begin{array}{l} ed (s, ε) = \{\begin{array}{l} 0, & s = ε, \\ min \{ed (s, α) + del (α) | α \in Σ\}, & otherwise . \end{array} \end{array}

(4)

\begin{array}{l} ed (s, α) = \{\begin{array}{l} ins (α), & s = ε, \\ mut (β, α), & s = β, \\ min \{e d^{'} (s, β) + mut (β, α) | β \in Σ\}, & otherwise . \end{array} \end{array}

(5)

\begin{array}{l} e d^{'} (s, α) \\ = min \{\begin{array}{l} ed (s^{a}, α) + ed (s^{b}, ε), \\ ed (s^{a}, ε) + ed (s^{b}, α), \\ ed (s^{a}, α) + ed (s^{b}, α) + cont (α) \end{array}| (s^{a}, s^{b}) \in P (s)\} \\ (s is of length \geq 2) \end{array}

(6)

In case both source string s and target string t are of length at least 2, the following equation can be used for computing ed (s,t) (Figure 2(a)):

\begin{array}{l} ed (s, t) \\ = min \{\begin{array}{l} ed (s, α) + ed (α, t), \\ ed (s^{a}, t^{a} + ed (s^{b}, α) + ed (α, t^{b}) \end{array} |\begin{array}{l} (s^{a}, s^{b}) \in P (s), \\ (t^{a}, t^{b}) \in P (t), \\ α \in Σ \end{array}\} \\ (s and t are of lengths \geq 2) \end{array}

(7)

For allowing efficient computation, Equation 7 can be replaced by Equations 8 and 9, which are computed in a mutually recursive manner to yield an equivalent computation (Figure 2(b) and Figure 2(c), respectively).

\begin{array}{l} ed (s, t) = & min \{\begin{array}{l} ed (s, α) + ed (α, t), \\ ed t^{α} (s^{a}, t) + ed (s^{b}, α) \end{array} |\begin{array}{l} (s^{a}, s^{b}) \in P (s) \\ α \in Σ \end{array}\} \\ (s and t are of lengths \geq 2) \end{array}

(8)

\begin{array}{l} ed t^{α} (s, t) = & min \{\begin{array}{l} ed (s, t^{a}) + ed (α, t^{b}) \end{array} |\begin{array}{l} (t^{a}, t^{b}) \in P (t) \end{array}\} \\ (t is of lengths \geq 2) \end{array}

(9)

All base-cases of the above recursive equations are implied from Property 1 in a straightforward manner.

Theorem 1.

Theorem 1. EDDC is correctly solved by Equations 1-9.

The proof of Theorem 1 appears in Appendix “Correctness of the recursive computation”.

A baseline dynamic-programming algorithm for EDDC

In this section, we describe a DP algorithm implementing the recursive EDDC computation given by Equations 1 to 9, which is the basis for improvements introduced later in this paper.

Let s and t be the input source and target strings, respectively, and for simplicity assume both strings are of length n. The algorithm maintains the following matrices for storing solutions to sub-instances of the input which occur along the recursive computation. All matrices are of size (n + 1) × (n + 1), with row and column indices in the range 0,1,2,…,n.

For every α ∈ Σ, the algorithm maintains matrices S^{′ α}, S^α, T^′α, and T^α. Entries S^′α[ k,i], S^α[ k,i], T^′α[ l,j] and T^α[ l,j] are used for storing the values ed^′ (s_k,i,α), ed(s_k,i,α), ed^′ (α,t_l,j), and ed (α,t_l,j), respectively.
Two matrices S^εand T^ε, whose entries S^ε[ k,i] and T^ε[ l,j] are used for storing values of the forms ed (s_k,i,ε) and ed (ε,t_l,j), respectively.
For every α ∈ Σ, a matrix EDT^αwhose entries EDT^α[ i,j] are used for storing the values edt^α(s_0,i,t_0,j).
A matrix ED, whose entries ED [ i,j] are used for storing the values ed (s_0,i,t_0,j).

The algorithm consists of two stages: Stage 1 computes solutions to all sub-instances in which one of the substrings is either empty or single-lettered, applying Equations 1 to 6. Stage 2 uses the values computed in Stage 1 in order to compute all prefix-to-prefix solutions ed (s_0,i,t_0,j) and edt^α(s_0,i,t_0,j) according to Equations 8 and 9. In particular, Stage 2 computes the edit distance ed (s_0,n,t_0,n) = ed (s,t) between the two complete strings. The entries are traversed in an order which guarantees that upon computing each entry, all solutions to sub-instances appearing on the right-hand side of the relevant equation are already computed and contained in the corresponding entries. Algorithm 1 gives the pseudo-code for this computation.

Algorithm 1 BASELINE-EDDC( s , t )

Complexity analysis of Algorithm 1

The running time of Algorithm 1 is dictated by the total time required to compute all entries in the DP matrices. Each entry is computed according to one of the recursive equations, where the number of operations in such a computation depends on the number of expressions examined on the right-hand side of the corresponding recursive equation. Note that the value of each examined expression is obtained in a constant time, by querying previously computed values stored in the matrices.

The computation of each entry in matrices T^εand S^εand in matrices of the form T^αand S^αtakes O (|Σ|) time, due to Equations 1, 4, 2, and 5, respectively. As there are O (|Σ|) such matrices and each matrix contains O (n²) entries, their overall computation time is O (|Σ|²n²). The computation of entries in T^′αand S^′αtake O (n) time, due to Equations 3 and 6, respectively. There are O (|Σ|) such matrices, each of size O (n²), and so the total time for computing all entries in these matrices is O (|Σ|n³). According to Equation 9, computing each entry of the form EDT^α[ i,j] takes O (n) time, and as there are O (|Σ|n²) such entries the total time for computing all these entries is O (|Σ|n³). According to Equation 8, computing each entry of the form ED [ i,j] takes O (|Σ|n) time, and since there are O (n²) such entries, the total time for computing all these entries is again O (|Σ|n³). Thus, the total running time of the algorithm is O (|Σ|n³ + |Σ|²n²). Under the assumption that |Σ| = O (n), the time is O (|Σ|n³). The algorithm requires O (|Σ|n²) space for maintaining the DP matrices.

A matrix multiplication based algorithm for EDDC

In previous work by the authors [11], Vector Multiplication Templates (VMTs) were identified as templates for computational problems sustaining certain properties, such that algorithms for solving these problems can be accelerated using efficient matrix multiplication subroutines (similarly to Valiant’s algorithm for CFG recognition [10]). Intuitively, standard algorithms for VMT problems perform computations that can be expressed in terms of vector multiplications, and these computations can be computed and combined more efficiently using efficient matrix multiplications. In this section, we show that EDDC exhibits such VMT properties, and formulate a new algorithm that incorporates matrix-matrix min-plus multiplications. This algorithm yields a better running time than that of the baseline algorithm in the previous section.

Notations for matrices

For two integers p,q such that p ≤ q, I_p,q denotes the interval of integers I_p,q = [ p,p + 1,…,q - 1]. We use the notation A_n × m to imply that the matrix A has n rows and m columns, and say that A has the dimensions n × m (rows and column indices start at 0). For a subset of row indices I and a subset of column indices J, denote by I × J the region which contains all pairs of indices (i,j), such that i ∈ I and j ∈ J. Define A [ I,J] to be the submatrix of A, which is induced by all entries in the region I × J. When I contains a single row i or J contains a single column j, we simplify the notation and write A [ i,J] or A [ I,j], respectively.

Define the following operations on matrices. Let tr (·) denote the transpose operation for matrices. For a set of matrices $A = \{A^{1}, A^{2}, \dots, A^{r}\}$ all of the same dimensions n × m, denote by $min \{A\}$ the entry-wise min operation over $A$ , whose result is a matrix C_n × m, such that $C [i, j] = min \{A [i, j] | A \in A\}$ . $min \{A\}$ can be computed in $O (| A | nm)$ time in a straightforward manner. For matrices A_n × k and B_k × m, the min-plus multiplication of A and B, denoted A ⊗ B, results in a matrix C_n × m, where the entries of C are defined by C [ i,j] = min{A [ i,h] + B [ h,j]| 0 ≤ h < k}. Naively, A ⊗ B can be computed in O (nkm) operations. Denote the time complexity of a min-plus multiplication of two n × n matrices by MP (n). At present, the asymptotically fastest algorithm for min-plus square matrix multiplication is that of Chan [12], taking $O (\frac{n^{3} \overset{3}{log} log n}{\overset{2}{log} n})$ time.

In the following observation, we point out how matrix multiplication can be computed as a composition of two parts, where each of the items (1-3) in the observation addresses a partitioning in one of the three dimensions. This will be later used by our recursive computation which is based on such partitioning.

Observation 1.

Let A_n × k,B_k × mand C_n × mbe matrices, such that C = A ⊗ B (see Figure 3).

1.
For every 0 ≤ h < m, C [ I _0,n,I _0,h] = A ⊗ B [ I _0,k,I _0,h] and C [ I _0,n,I _h,m] = A ⊗ B [ I _0,k,I _h,m].

2.
For every 0 ≤ h < n, C [ I _0,h,I _0,m] = A [ I _0,h,I _0,k] ⊗ B and C [ I _h,n,I _0,m] = A[ I _h,n,I _0,k] ⊗ B.
3.
For every 0 ≤ h < k, C = min {A [ I _0,n,I _0,h] ⊗ B [ I _0,h,I _0,m],A [ I _0,n,I _h,k] ⊗ B [ I _h,k,I _0,m]}.

EDDC expressed via min-plus vector multiplications

The key observation that enables a further improvement of the worst-case bounds of EDDC is that Equations 3, 6, 8, and 9 can be expressed in terms of min-plus vector multiplications. Under the assumption that all solutions to sub-instances appearing on the right-hand side of the equations are computed and stored in the corresponding entries, these equations can be written as follows:

\begin{array}{l} \begin{array}{l} e d^{'} (α, t_{l, j}) & \overset{Eq.3}{=} min \{\begin{array}{l} T^{α} [l, h] + T^{ε} [h, j], \\ T^{ε} [l, h] + T^{α} [h, j], \\ dup (α) + T^{α} [l, h] + T^{α} [h, j] \end{array}| \begin{array}{l} l < h < j, \end{array}\} \\ = min \{\begin{array}{l} T^{α} [l, I_{l + 1, j}] \otimes T^{ε} [I_{l + 1, j}, j], \\ T^{ε} [l, I_{l + 1, j}] \otimes T^{α} [I_{l + 1, j}, j], \\ dup (α) + T^{α} [l, I_{l + 1, j}] \otimes T^{α} [I_{l + 1, j}, j] \end{array}\}, \end{array} \end{array}

(10)

\begin{array}{l} \begin{array}{l} e d^{'} (s_{k, i}, α) & \overset{Eq.6}{=} min \{\begin{array}{l} S^{α} [k, h] + S^{ε} [h, i], \\ S^{ε} [k, h] + S^{α} [h, i], \\ S^{α} [k, h] + S^{α} [h, i] + cont (α) \end{array}| \begin{array}{l} k < h < i \end{array}\} \\ = min \{\begin{array}{l} S^{α} [k, I_{k + 1, i}] \otimes S^{ε} [I_{k + 1, i}, i], \\ S^{ε} [k, I_{k + 1, i}] \otimes S^{α} [I_{k + 1, i}, i], \\ S^{α} [k, I_{k + 1, i}] \otimes S^{α} [I_{k + 1, i}, i] + cont (α) \end{array}\}, \end{array} \end{array}

(11)

\begin{array}{l} \begin{array}{l} ed (s_{0, i}, t_{0, j}) & \overset{Eq.8}{=} min \{\begin{array}{l} S^{α} [0, i] + T^{α} [0, j], \\ ED T^{α} [k, j] + S^{α} [k, i] \end{array} |\begin{array}{l} 0 < k < i \\ α \in Σ \end{array}\} \\ = min \{\begin{array}{l} S^{α} [0, i] + T^{α} [0, j], \\ tr (S^{α}) [i, I_{1, i}] \otimes ED T^{α} [I_{1, i}, j] \end{array}| α \in Σ\}, \end{array} \end{array}

(12)

\begin{array}{l} ed t^{α} (s_{0, i}, t_{0, j}) & \overset{Eq.9}{=} min \{ED [i, l] + T^{α} [l, j] | 0 < l < j\} \\ = ED [i, I_{1, j}] \otimes T^{α} [I_{1, j}, j] . \end{array}

(13)

The algorithm

The new algorithm has the same two stages as the baseline algorithm. It can be observed that the computation of all matrices of the forms S^′α, S^α, S^ε, T^′α, T^α, and T^ε performed in Stage 1 of the baseline algorithm adhere to the Inside-VMT requirements as given in Definition 1 in [11]. The application of the generic Inside-VMT algorithm[11] to this computation is immediate, and therefore we focus only on adapting the method to the computation of matrices of the form EDT^α and ED conducted in Stage 2 of the baseline algorithm.

After allocating all dynamic programming matrices and performing Stage 1 of the algorithm, the COMPUTEMATRIX procedure is used for implementing Stage 2 (see Algorithm 2 and Figure 4). This is a divide-and-conquer recursive procedure that accepts a region I × J and computes the values in all entries of ED and EDT^α within the region. The procedure partitions the given region into two parts and performs recursive calls on each part. In order to maintain a required precondition, the procedure applies min-plus matrix multiplication subroutines between recursive calls. The correctness proof of Algorithm 2 appears in Appendix “Correctness of Algorithm 2”.

Algorithm 2 MATRIX-EDDC( s , t )

Time complexity analysis

The time complexity of Algorithm 2 can be established by an identical analysis to that of the Inside-VMT algorithm of [11] (see Section 3.3.1 of [11]). For completeness, we repeat this analysis here for Stage 2 of the computation, where the time complexity of Stage 1 can be inferred similarly. The complexity is expressed as a function of the bound MP (n) over the running time of a min-plus multiplication of two n × n matrices. Note that MP (n) = Ω (n²), as the input and output matrices of the computation contain O (n²) entries. We assume here that MP (n) = Ω (n^2+δ) for some constant 0 < δ ≤ 1, which is true for the current best bound over MP (n) [12]. In some of the expressions developed below, we avoid the “big O” notation and give explicit bounds over the number of operations, as constant factors that may be hidden due to this notation cannot be ignored in the analysis.

The initialization time of Stage 2 is dominated by the matrix multiplications and entry-wise min operations performed in line 4 of Algorithm 2. This initialization performs 2|Σ| multiplications of matrices of dimensions (n-1)×1 with matrices of dimensions 1×(n - 1), which can naively be implemented in O (|Σ|n²) time, and an entry-wise min operation over a set containing |Σ| matrices of dimensions (n - 1) × (n - 1), which is also implemented in O (|Σ|n²) time.

The computation of the remaining entries in ED and EDT^α matrices is done within recursive calls to the COMPUTE-MATRIX procedure. Observe that when COMPUTE-MATRIX is called over a region of dimensions r × r for some even integer r ≥ 2, the procedure applies a vertical partitioning and performs two recursive calls over regions of dimensions $r \times \frac{r}{2}$ (lines 6 and 8). For a call over a region of dimensions $r \times \frac{r}{2}$ , the procedure applies a horizontal partitioning and performs two recursive calls over regions of dimensions $\frac{r}{2} \times \frac{r}{2}$ (lines 11 and 13). For simplicity, assume that n - 1 = 2^p for some integer p ≥ 0, and thus it follows that the dimensions of all regions occurring as inputs in recursive calls are either 1×1, or of the form r × r or $r \times \frac{r}{2}$ for some even integer r. Denote by T (x × y) an upper bound over the number of operations conducted when applying COMPUTE-MATRIX over a region of dimensions x × y.

From line 2 of the procedure, T (1 × 1) = O (|Σ|). Consider a region of dimensions r × r for an even integer r ≥ 2. For such a region, the code in lines 5-8 of COMPUTE-MATRIX is executed. In order to implement line 7 for some α ∈ Σ, it is necessary to compute first a min-plus matrix multiplication C = A ⊗ B, where the matrix A = ED [ I_i,k,I_j,h] is of dimensions $r \times \frac{r}{2}$ , the matrix B = T^α[ I_j,h,I_h,l] is of dimensions $\frac{r}{2} \times \frac{r}{2}$ , and the resulting matrix C is of dimensions $r \times \frac{r}{2}$ . Due to Observation 1, it is possible to compute independently the upper and lower halves of C, where $C [I_{0, \frac{r}{2}}, I_{0, \frac{r}{2}}] = A [I_{0, \frac{r}{2}}, I_{0, \frac{r}{2}}] \otimes B$ and $C [I_{\frac{r}{2}, r}, I_{0, \frac{r}{2}}] = A [I_{\frac{r}{2}, r}, I_{0, \frac{r}{2}}] \otimes B$ . The time required to conduct this computation is $2 MP (\frac{r}{2})$ . Then, it is required to compute min {EDT^α[ I_i,k, I_h,l], C} and to update EDT^α[ I_i,k,I_h,l] to be the result of this operation, a computation which requires at most cr² operations for some constant c. Since line 7 is computed for every α ∈ Σ, the total number of applied operations due to this line is at most $| Σ | (2 MP (\frac{r}{2}) + c r^{2})$ . Besides line 7, two recursive calls are made in lines 6 and 8 over regions of dimensions $r \times \frac{r}{2}$ , and therefore we get

T (r \times r) \leq 2 T (r \times \frac{r}{2}) + | Σ | (2 MP (\frac{r}{2}) + c r^{2}) .

When the procedure is called over a region of dimensions $r \times \frac{r}{2}$ , the code in lines 10-13 is executed. Similarly as above, it can be shown that the computation in line 12 requires at most $| Σ | (MP (\frac{r}{2}) + \frac{c r^{2}}{4})$ operations, and due to the two recursive calls in lines 11 and 13 over regions of dimensions $\frac{r}{2} \times \frac{r}{2}$ , we get

T (r \times \frac{r}{2}) \leq 2 T (\frac{r}{2} \times \frac{r}{2}) + | Σ | (MP (\frac{r}{2}) + \frac{c r^{2}}{4}) .

Therefore,

T (r \times r) \leq 4 T (\frac{r}{2} \times \frac{r}{2}) + | Σ | (4 MP (\frac{r}{2}) + \frac{3 c r^{2}}{2}) .

The explicit form of the above recursive equation can be established by the Master Theorem (under the assumption that MP (n) = Ω (n^2+δ), see Chapter 4 in [16]), yielding the expression T (r × r) = O (|Σ|MP (r)). Thus, the time complexity of Stage 2 of the algorithm is O (|Σ|MP (n)). The time analysis of the Inside-VMT algorithm of [11], applied to implement Stage 1 of the algorithm yields the same bound of O (|Σ|MP (n)), and thus O (|Σ|MP (n)) is the time complexity of the entire algorithm. Using the currently asymptotically fastest algorithm for min-plus matrix multiplication [12] $MP (n) = Θ (\frac{n^{3} \overset{3}{log} log n}{\overset{2}{log} n})$ , we get the currently best explicit time bound for EDDC of $O (\frac{| Σ | n^{3} \overset{3}{log} log n}{\overset{2}{log} n})$ .

An online algorithm for EDDC using min-plus matrix-vector multiplication for discrete cost functions

In this section, we present an EDDC algorithm which is based on the general algorithm (given in Section “A matrix multiplication based algorithm for EDDC”) and improves its time complexity by a factor of O (log3 logn). This EDDC algorithm is intended for integer cost functions, but can also be applied to rational cost functions after they are scaled. It is an online algorithm; it can process the input strings letter by letter with a guaranteed low time bound for any prefix of the input. The EDDC algorithm presented in this section is based on a D-discrete matrix-vector min-plus multiplication algorithm we developed, which is generic and may be applied to other problems as well.

D-discrete matrices and the EDDC problem with integer costs

Given a matrix of integers A_n × m and indices 1 ≤ i < n and 0 ≤ j < n, call the pair of entries A [ i - 1,j] and A [ i,j] adjacent. Let D = I_a,b= [ a,a + 1,…,b - 1] be an integer interval for some integers a < b. Say that matrix A is D-discrete if for every pair of adjacent entries A [ i - 1,j] and A [ i,j], their difference A [ i - 1,j] - A [ i,j] is in D.

Consider the EDDC problem in the case of integer costs for all edit operations. In Lemma 1, we show that in this case, all matrix multiplications applied by Algorithm 2 are between D-discrete metrices, with respect to a certain integer interval D. This proof is similar to that of Masek and Paterson for simple edit distance [17]. This would allow conducting such matrix multiplications using a faster algorithm, described in Section “An efficient D-discrete min-plus matrix-vector multiplication algorithm”.

Lemma 1.

Given strings s and t and an integer cost function for EDDC, all matrix multiplications applied by Algorithm 2 are over D-discrete matrices, where D = I_a,b is determined according to the cost function by a=- max{del(α) | α ∈ Σ} and b = max{ins (α) | α ∈ Σ} + 1.

The proof of Lemma 1 appears in Appendix “Proofs to lemmas corresponding to the EDDC algorithm for discrete cost functions”.

D-discrete matrices and vectors

Here, we present some properties of D-discrete matrices and vectors that are similar to those previously observed in [14, 17]. The following lemmas show that the set of D-discrete matrices is closed under the min-plus multiplication and entry-wise min operations. In what follows, let D=I_a,b be some integer interval. The proofs of the following lemmas appear in Appendix “Proofs to lemmas corresponding to the EDDC algorithm for discrete cost functions”.

Lemma 2.

Let X,Y and Z be matrices, such that X and Y contain only integer elements and Z = X ⊗ Y. If X is D-discrete, then is D-discrete.

Lemma 3.

Let X,Y and Z be matrices, such that X and Y contain only integer elements and Z = min{X,Y}. If X and Y are D -discrete, then Z is D-discrete.

The following lemma implies that when the absolute difference between the first elements of two q-length D-discrete vectors x and y is sufficiently large, one of the vectors can be immediately taken as the result of the min(x,y) operation.

Lemma 4.

Let x =(x₀,…,x_{q - 1}) and y = (y₀,…,y_{q - 1}) be two q-length D-discrete vectors for some q > 0. If y₀ - x₀ ≥ q |D|, then min(x,y) = x.

In what follows, fix an integer q > 1. Let x = (x₀,x₁,…,x_q-1) be a q-length D-discrete vector. By definition, for every 0<i < q, x_i-1-x_i is an integer within D, and so x_i-1-x_i - a is an integer within the interval I_0,b-a = I_0,|D|. Therefore, the series x₀ - x₁ - a,x₁ - x₂ - a,…,x_q-2-x_q-1-a can be thought of as a series of q-1 digits in a |D|-base representation of an integer $Δ x = \sum_{0 \leq i < q - 1} | D |^{i} (x_{i} - x_{i + 1} - a)$ , where 0 ≤ Δ x < |D|^q-1. The Δ -encoding of x is defined to be the pair of integers (x₀,Δ x). We write x = (x₀,Δ x) to indicate that (x₀,Δ x) is the Δ-encoding of x, where x₀ is called the offset of x and Δ x is called the canonical index of x. Note that for two q-length D-discrete vectors x = (x₀,Δ x) and y = (y₀,Δ y), Δ x = Δ y if and only if for every 0 ≤ i < q, x_i - y_i = c for some constant c. In particular, x and y share the same Δ-encoding if and only if they are identical. Call a D-discrete vector of the form x = (0,Δ x) (with an offset x₀ = 0) a canonical vvector.

The next observations show that both operations of entry-wise min and min-plus multiplication, with respect to D-discrete matrices and vectors, can be expressed via canonical vectors.

Observation 2.

Observation 2. Let x = (x₀,Δ x), y = (y₀,Δ y), and z = (z₀,Δ z) be q-length D-discrete vectors such that z = min (x,y). Then, for every number c it holds that min((x₀ - c,Δ x),(y₀ - c,Δ y)) = (z₀ - c,Δ z). In particular, min((0,Δ x),(y₀ - x₀,Δ y)) = (z₀ - x₀,Δ z).

Observation 3.

LetB_q × q be a D-discrete matrix, x = (0,Δ x) a q-length canonical D-discrete vector, and y =(y₀,Δ y) a q-length D-discrete vector, such that B ⊗ x = y. Then, for any number c it holds that B ⊗ (c,Δ x) = (y₀ + c,Δ y).

An efficient D-discrete min-plus matrix-vector multiplication algorithm

Let A_n×m be a D-discrete matrix, and fix a constant $\frac{1}{\underset{| D |}{log} (n + m)} < λ < 1$ . We give an algorithm for min-plus D-discrete matrix-vector multiplication that, after preprocessing A in $O (\frac{nm {(n + m)}^{λ}}{| D |})$ time and $O (\frac{nm {(n + m)}^{λ}}{| D | λ^{2} {log}_{| D |}^{2} (n + m)})$ space, computes A ⊗ x for any m-length D-discrete vector x in $O (\frac{nm}{λ^{2} {log}_{| D |}^{2} (n + m)})$ time under the RAM computational model. Our algorithm is an adaptation of Williams’ algorithm [13] for finite semiring matrix-vector multiplications, with some notions similar to Frid and Gusfield’s acceleration technique for RNA folding [14]. It follows the concept of the Four-Russians Algorithm[18] (see also [14, 17, 19]), i.e. preprocessing reoccurring computations, tabulating their results in lookup tables, and retrieving such results in order to accelerate the general computation.

Specifically, the algorithm stores preprocessed computations of two kinds: matrix-vector min-plus multiplications, and vector entry-wise minima, where vectors and matrices are of q-length and of q × q dimensions, respectively, for q = ⌊ λ log|D|(n + m) ⌋. For conducting this preprocessing, we will assume that |D| ≤ n + m, otherwise q = 0 and the multiplication cannot be accelerated using the suggested method. In addition, for simplicity of the analysis we assume that q³ ≤ min(n,m). If this does not hold, a multiplication of the form A ⊗ x can be naively computed in the relatively efficient time complexity of $O (max (n, m) {log}_{| D |}^{3} (n + m))$ . The space complexity of the preprocessing phase is higher than the O (n m) space complexity of the standard multiplication algorithm and depends on the constant λ, ranging between O(n m|D|) and $O (\frac{nm (n + m)}{{log}_{| D |}^{2} (n + m)})$ for λ values between $\frac{1}{\underset{| D |}{log} (n + m)}$ and 1, correspondingly. The lower bound $\frac{1}{\underset{| D |}{log} (n + m)}$ for λ is chosen so that the time complexity $O (\frac{nm}{λ^{2} {log}_{| D |}^{2} (n + m)})$ of matrix-vector multiplications involving the preprocessed matrix would be better than the naive time complexity O (n m).

Preprocessing of matrix-vector ⊗ computations

Let $n^{'} = q ⌊\frac{n}{q}⌋$ and $m^{'} = q ⌊\frac{m}{q}⌋$ , and note that 0 ≤ n - n^′ < q and 0 ≤ m - m^′ < q. Let Q_k denote the q-length integer interval Q_k = [ k q,k q + 1,…,(k + 1) q - 1]. The sub-matrix $A^{'} = A [I_{0, n^{'}}, I_{0, m^{'}}]$ is decomposed into $\frac{n^{'} m^{'}}{q^{2}}$ blocks B_i,j = A[ Q_i,Q_j] where $i = 0, 1, \dots, \frac{n^{'}}{q} - 1$ and $j = 0, 1, \dots, \frac{m^{'}}{q} - 1$ . For each block B, a corresponding lookup table MUL_B is created, which tabulates min-plus multiplications between B and all canonical q-length D-discrete vectors. For the canonical vector x = (0,Δ x), the result y = B ⊗ x is stored in the entry MUL_B[ Δ x] by its encoding (y₀,Δ y) (by Lemma 2, y is also D-discrete and thus can be encoded accordingly).

The multiplication of a q × q block with a q-length vector can be done in O (q²) time in a straightforward manner and the encoding of the resulting q-length vector requires additional O (q) time. There are $\frac{n^{'} m^{'}}{q^{2}}$ blocks in the decomposition of A^′, each is multiplied by |D|^q-1 canonical vectors, and so the total time required for computing these multiplications is $O (q^{2} | D |^{q - 1} \frac{nm}{q^{2}}) = O (| D |^{q - 1} nm) = O (\frac{nm {(n + m)}^{λ}}{| D |})$ .

Let (y₀,Δ y) be the Δ - encoding of some result y = B ⊗ x computed in the preprocessing of A^′ as described above. Note that y₀ = min 0 ≤ i < q {B [ 0,i] + x [ i]} ≤ min 0 ≤ i < q {2 max(B [ 0,i],x[ i])}. Therefore, the number of bits in the binary representation of y₀ is at most one plus the maximum number of bits required for the representations of B [ 0,i] and x [ i] for some 0 ≤ i < q. Also, note that $0 \leq Δ y < | D |^{q - 1} = \frac{{(n + m)}^{λ}}{| D |}$ , and Δ y can be represented in O (log(n + m)) bits. Thus, under the RAM computational model assumptions, each such encoding (y₀,Δ y) requires O (1) space units and can be written and read in a constant time, and therefore the overall space complexity for maintaining all MUL_B tables is $O (\frac{| D |^{q - 1} nm}{q^{2}}) = O (\frac{nm {(n + m)}^{λ}}{| D | λ^{2} {log}_{| D |}^{2} (n + m)})$ . In addition, given a canonical index Δ x, it is possible to retrieve the encoding (y₀,Δ y) = B ⊗ (0,Δ x) stored in the entry MUL_B[ Δ x] in a constant time.

Let x = (x₀,Δ x) be some (not necessarily canonical) q-length D-discrete vector, for which we wish to compute B ⊗ x. Due to Observation 3, the multiplication result can be obtained in constant time by retrieving (y₀,Δ y) = MUL_B[ Δ x], and returning the encoding (y₀ + x₀,Δ y).

Preprocessing of vector entry-wise min computations

The algorithm constructs a lookup table MIN, storing entry-wise min calculations between every canonical q-length D-discrete vector x = (0,Δ x) and every q-length D-discrete vector y = (y₀,Δ y) such that abs (y₀) < q|D| (here abs (y₀) denotes the absolute value of y₀). For every such x and y, the table entry MIN [ Δ x,y₀,Δ y] stores the Δ - encoding (z₀,Δ z) of the vector z = min (x,y) (due to Lemma 3, z is D-discrete and can be encoded accordingly). There are $O (q | D | | D |^{2 (q - 1)}) = O (\frac{{(n + m)}^{2 λ} λ \underset{| D |}{log} (n + m)}{| D |})$ entries in the table MIN, and each entry can be computed in O(q) = O (λ log|D|(n + m)) time. Thus, the computation of all entries in MIN requires $O (\frac{{(n + m)}^{2 λ} λ^{2} {log}_{| D |}^{2} (n + m)}{| D |})$ time, and the table occupies $O (\frac{{(n + m)}^{2 λ} λ \underset{| D |}{log} (n + m)}{| D |})$ space.

Given two encoded q-length D-discrete vectors x = (x₀,Δ x) and y = (y₀,Δ y), the encoding (z₀,Δ z) of the vector z = min(x,y) can now be obtained in a constant time as follows: (z₀,Δ z) = (x₀,Δ x) if y₀ - x₀ ≥ q |D| or (z₀,Δ z) = (y₀,Δ y) if x₀ - y₀ ≥ q |D|, due to Lemma 4. Otherwise, |y₀ - x₀| < q |D|, and for the vectors x^′ = (0,Δ x), y^′ = (y₀ - x₀,Δ y), and $z^{'} = (z_{0}^{'}, Δ z^{'}) = min (x^{'}, y^{'})$ , we have that (z^′,Δ z^′) = MIN [ Δ x,y₀ - x₀,Δ y]. From Observation 2, $(z_{0}, Δ z) = (z_{0}^{'} + x_{0}, Δ z^{'})$ .

Computing matrix-vector multiplications

Given an m-length D-discrete vector x and assuming the preprocessing of matrix A_n × m was preformed as described above, we next explain how to efficiently compute the vector y = A ⊗ x. Note that y is an n-length D-discrete vector, due to Lemma 2.

Our algorithm computes first the multiplication $y^{'} = A^{'} \otimes x [I_{0, m^{'}}]$ in parts of length q. First, for every $0 \leq j < \frac{m^{'}}{q}$ , the algorithm computes the encoding $(x_{0}^{j}, Δ x^{j})$ of the sub-vector x^j = x [ Q_j] of x. These encodings can be obtained in a total time of O (m). Then, for every $0 \leq i < \frac{n^{'}}{q}$ , the encoding $(y_{0}^{i}, Δ y^{i})$ of the sub-vector $y^{i} = A^{'} [Q_{i}, I_{0, m^{'}}] \otimes x [I_{0, m^{'}}]$ of y^′ is computed independently of the other sub-vectors of y^′. By definition (see Figure 5),

\begin{array}{l} y^{i} & = min \{A^{'} [Q_{i}, Q_{j}] \otimes x [Q_{j}] | 0 \leq j < \frac{m^{'}}{q}\} \\ = min \{B_{i, j} \otimes x^{j} | 0 \leq j < \frac{m^{'}}{q}\} . \end{array}

The encoded result $(z_{0}^{i, j}, Δ z^{i, j})$ of each multiplication z^i,j = B_i,j ⊗ x^j can be obtained in a constant time as explained in Section “Preprocessing of matrix-vector ⊗ computations”. As there are $\frac{m^{'}}{q}$ such terms to compute with respect to yⁱ, their total computation time is $O (\frac{m}{q})$ . In addition, the entry-wise min over all these terms can be computed by initializing $(y_{0}^{i}, Δ y^{i}) \leftarrow (z_{0}^{i, 0}, Δ z^{i, 0})$ , and iteratively updating $(y_{0}^{i}, Δ y^{i}) \leftarrow min ((y_{0}^{i}, Δ y^{i}), (z_{0}^{i, j}, Δ z^{i, j}))$ for all $0 < j < \frac{m^{'}}{q}$ . Each such update is computed in a constant time as described in Section “Preprocessing of vector entry-wise min computations”, and so the encoding of a single segment yⁱ in y^′ is computed in a total time of $O (\frac{m}{q})$ , and the encodings of all $O (\frac{n}{q})$ such segments are computed in $O (\frac{nm}{q^{2}})$ time. Decoding all encoded vectors yⁱ can be done in additional O (n) operations, obtaining an explicit form of y^′ in a total time of $O (\frac{nm}{q^{2}})$ .

Let $y^{''} = A [I_{0, n^{'}}, I_{m^{'}, m}] \otimes x [I_{m^{'}, m}]$ , where from Observation 1, $y [I_{0, n^{'}}] = min (y^{'}, y^{''})$ . The computation of y^′′ can be conducted in O (n q) time in a straightforward manner, and the computation of min(y^′,y^′′) requires additional O (n) time. In addition, $y [I_{n^{'}, n}] = A [I_{n^{'}, n}, I_{0, m}] \otimes x$ , where this computation can be done naively in O (m q) time, and so the overall running time for computing y is $O (\frac{nm}{q^{2}} + nq + mq) = O (\frac{nm}{q^{2}}) = O (\frac{nm}{λ^{2} {log}_{| D |}^{2} (n + m)})$ .

The above matrix-vector min-plus multiplication algorithm can be used as a fast square matrix-matrix multiplication algorithm in a straightforward manner. For two D-discrete matrices A_n × n and B_n × n, the computation of C = A ⊗ B can be conducted by first preprocessing A as described in Sections “Preprocessing of matrix-vector ⊗ computations” in $O (\frac{n^{2 + λ}}{| D |})$ time and $O (\frac{n^{2 + λ}}{| D | λ^{2} {log}_{| D |}^{2} n})$ space, and then computing each column j of C independently by multiplying A with the j-th column of B, in $O (\frac{n^{2}}{λ^{2} {log}_{| D |}^{2} n})$ time as explained above. The total computation time of all n columns of C is therefore $O (\frac{n^{3}}{λ^{2} {log}_{| D |}^{2} n})$ .

Online preprocessing of D-discrete matrices

In the previous section, we assumed the settings in which a D-discrete matrix is given, and that it is preprocessed once prior to any multiplication operation. Next, we describe how to maintain the required lookup tables for the case the input matrix is dynamic, acquiring additional rows and columns. Consider a streaming computational model, which begins with an initial empty matrix $A_{0 \times 0}^{0}$ . In each step r, the current matrix $A_{n_{r} \times m_{r}}^{r}$ is obtained from the previous matrix $A_{n_{r - 1} \times m_{r - 1}}^{r - 1}$ by either adding an m_r-1-length vector as the last row or adding an n_r-1-length vector as the last column in the matrix. Note that n_r + m_r = r, and therefore the preprocessing block length corresponding to A^r is q = ⌊ λ log|D|(n_r + m_r) ⌋ = ⌊ λ log|D|(r)⌋. For the purpose of this analysis, we assume that λ ≤ 0.5 (note that this does not limit the asymptotic upper bounds of the running time). This assumption implies the following inequality

| D |^{\frac{1}{λ}} \geq 2^{2} = 4 .

(14)

Lookup tables corresponding to intermediate matrices along the series can be maintained as follows. Let r₀ and r₁ be the smallest integers such that the block sizes corresponding to $A^{r_{0}}$ and $A^{r_{1}}$ are q and q + 1, respectively. Assume that upon reaching $A^{r_{0}}$ in the matrix sequence, all required lookup tables with respect to $A^{r_{0}}$ are already computed. Along the series of steps r₀,r₀ + 1,…,r₁, we distribute two kinds of computations: (1) new MUL_B tables for accumulated q × q blocks in matrices A^r for r₀ ≤ r < r₁, and (2) a new MIN table, as well as new MUL_B tables, with respect to block length q + 1.

(1) Computing MUL_Btables for accumulated q × q blocks. Assume that for some r₀ ≤ r < r₁, a column was added to the matrix at step r so that the number of columns m_r in the intermediate matrix A^r is divisible by q. Thus, at most $\frac{n_{r}}{q} \leq \frac{r}{q}$ new q × q complete blocks are now available for preprocessing. The computation of lookup tables of the form MUL_B corresponding to these new blocks will be equally distributed along the series of q consecutive steps r,r + 1,…,r + q - 1, during which it is guaranteed that no column addition would introduce new complete q × q blocks in the matrix. As shown in Section “Preprocessing of matrix-vector ⊗ computations”, the time required for processing a single q × q block is O (q²|D|^q-1), and so the total time for processing all $O (\frac{r}{q})$ blocks is O (q r |D|^q-1). Thus, in each step among the q steps, there is a need to perform $O (r | D |^{q - 1}) = O (\frac{r^{1 + λ}}{| D |})$ operations due to these computations. Symmetrically, computing lookup tables corresponding to new blocks added due to the accumulation of rows can be performed by conducting $O (\frac{r^{1 + λ}}{| D |})$ operations per step r.

(2) Computing a new MIN lookup table and new MUL_Btables with respect to block length q + 1 . By the selection of r₀ and r₁, q - 1 = ⌊ λ log|D|(r₀ - 1) ⌋ > λ log|D|(r₀ - 1) - 1, and q + 1= ⌊ λ log|D|(r₁) ⌋ ≤ λ log|D|(r₁). Therefore, $\underset{| D |}{log} (r_{1}) > \underset{| D |}{log} (r_{0} - 1) + \frac{1}{λ}$ , and so $r_{1} > (r_{0} - 1) | D |^{\frac{1}{λ}} \geq r_{0} \frac{| D |^{\frac{1}{λ}}}{2} \overset{Eq.14}{\geq} 2 r_{0}$ . In particular, $r_{2} = \frac{r_{1}}{2}$ satisfies r₀ < r₂ < r₁, and for every r₂ ≤ r < r₁ we have that O (r) = O (r₁). The computation of the table MIN and tables of the form MUL_B with respect to block length q + 1 is distributed along the series of $\frac{r_{1}}{2}$ steps r₂,r₂ + 1,…,r₁.

The new MIN table is computed independently from the specific input instance, and its overall computation time is $O (\frac{r_{1}^{2 λ} λ^{2} {log}_{| D |}^{2} (r_{1})}{| D |})$ (see Section “Preprocessing of vector entry-wise min computations”). By distributing this computation evenly along all O (r₁) steps, the computation time required for each step r₂ ≤ r < r₁ is $O (\frac{r_{1}^{2 λ - 1} λ^{2} {log}_{| D |}^{2} (r_{1})}{| D |}) = O (\frac{r^{2 λ - 1} λ^{2} {log}_{| D |}^{2} (r)}{| D |})$ .

The MUL_B tables are computed similarly as done in (1), starting with (q + 1) × (q + 1) blocks already present in $A^{r_{2}}$ , and continuing with blocks accumulated as the sequence progress. The overall preprocessing time of all these blocks is $O (\frac{r_{1}^{2 + λ}}{| D |})$ (see Section “Preprocessing of matrix-vector ⊗ computations”), and so the computation time required for each step r₂ ≤ r < r₁ is $O (\frac{r_{1}^{1 + λ}}{| D |}) = O (\frac{r^{1 + λ}}{| D |})$ .

All in all, the time complexity due to computations of (1) and (2) for each step r₀ ≤ r < r₁ is $O (\frac{r^{1 + λ}}{| D |})$ . In particular, the overall time complexity of preprocessing the n - size prefix A⁰,A¹,…,Aⁿ of the streamed matrices is $O (\frac{n^{2 + λ}}{| D |})$ .

The EDDC algorithm based on efficient D-discrete min-plus matrix-vector multiplication

Consider the EDDC problem in cases where all edit operation costs are integers. As explained in Section “D-discrete matrices and the EDDC problem with integer costsD-discrete matrices and the EDDC problem with integercosts”, the EDDC DP tables can be considered D-discrete. This property allows for efficient min-plus square D-discrete matrix-vector multiplications, using the algorithm described in Section “An efficient D-discrete min-plus matrix-vector multiplication algorithm” to yield an $O (\frac{| Σ | n^{3}}{{log}_{| D |}^{2} n})$ running time algorithm for EDDC. We next describe an online version of the algorithm, in which the letters of the input strings s and t are received in a streaming model.

Assume that some pair of prefixes s_0,i and t_0,j-1 was already processed, and all entries in the DP matrices corresponding to these prefixes are computed. We explain how to update the tables in case where the next letter to arrive is the letter t_j-1 in t, where the case in which the arriving letter is from s is symmetric. The DP matrices are D-discrete, and assume that lookup tables for efficient min-plus multiplications of these matrices are maintained as explained in the previous section. The addition of t_j-1 requires updating all matrices of the forms T^ε, T^α, and T^′α, for which the j-th row and column should be added. In addition, it is required to add the j-th column to matrices of the form ED and EDT^α.

In the first stage, the algorithm computes rows and columns j in all matrices of the form T^′α, T^α, and T^ε. The process is similar to the computation of these entries by the loop in lines 5 to 8 of Algorithm 1, with the following modification. Let q_j = ⌊ λ log|D|(2j) ⌋, and let $j^{'} = q_{j} ⌊\frac{j}{q_{j}}⌋$ . The algorithm first initializes the entries [ j - 1,j] in all these matrices with the corresponding base-case values. The column is partitioned to intervals of length q_j, where as before Q_k denotes the interval $I_{k q_{j}, (k + 1) q_{j}}$ . Once an interval Q_k is computed (i.e. the loop was executed with respect to index l = k q_j), the Δ - encoding of the sub-vector T^α[ Q_k,j] is computed and kept for its later usage as lookup index. In addition, upon starting to compute the entries within an interval Q_k (i.e. when l = (k + 1)q_j - 1), the following multiplications are computed for every α ∈ Σ:

\begin{array}{l} y^{α, k} & = T^{α} [Q_{k}, I_{q_{j} (k + 1), j^{'}}] \otimes T^{α} [I_{q_{j} (k + 1), j^{'}}, j] \\ \overset{Obs.1}{=} min \{T^{α} [Q_{k}, Q_{p}] \otimes T^{α} [Q_{p}, j] | k < p < \frac{j^{'}}{q_{j}}\} \end{array}

Observe that all required entries for the computation of y^α,k are already computed and stored in T^α, and that similarly as done in Section “Computing matrix-vector multiplications”, y^α,k can be computed by performing $O (\frac{j}{q_{j}})$ constant time lookup table queries. After y^α,k is computed, y^α,k[ x] contains the value min{T^α[ k q_j + x,h] + T^α[ h,j] | (k + 1)q_j ≤ h < j^′}. Given y^α,k[ x], the number of expressions that need to be examined in line 5 of the loop with respect to l = k q_j + x reduces to O (q_j) per entry (considering values of the index h between l and (k + 1)q_j, and between j^′ and j). Entries in matrices of the form T^α and T^ε are computed exactly as done in lines 6 and 7 of Algorithm 1, respectively.

In the second stage, column j is computed in matrices EDT^α and ED. This is achieved by extending Equations 13 and 12 to have an entire column on the left-hand side, as follows:

\begin{align} \begin{array}{l} ED T^{α} [I_{2, i + 1}, j] \overset{Eq.13}{\leftarrow} ED [I_{2, i + 1}, I_{1, j}] \otimes T^{α} [I_{1, j}, j] \end{array} \end{align}

(15)

\begin{array}{l} ED [I_{2, i + 1}, j] \overset{Eq.12}{\leftarrow} \\ min \{\begin{array}{l} S^{α} [0, I_{2, i + 1}] + T^{α} [0, j], \\ tr (S^{α}) [I_{2, i + 1}, I_{1, i + 1}] \otimes ED T^{α} [I_{1, i + 1}, j] \end{array}| α \in Σ\} \end{array}

(16)

This completes the update of the DP tables due to the addition of the letter t_j-1.

Complexity analysis

After receiving n letters, the prefixes s_0,i and t_0,j of the input strings were preprocessed for some i,j such that i + j = n. The maintenance of lookup tables for efficient D-discrete multiplications requires at most $O (\frac{| Σ | n^{1 + λ}}{| D |})$ operations per step among the first n steps, and $O (\frac{| Σ | n^{2 + λ}}{| D |})$ operations for all first n steps, as shown in Section “Online preprocessing of D-discrete matrices”.

Adding a letter t_j-1 to the instance, the time required for processing the entries in column j of the T matrices is as follows. $O (\frac{| Σ | j}{q_{j}})$ vectors y^α,k need to be computed, each vector is computed in $O (\frac{j}{q_{j}})$ time, and their total computation time is therefore $O (\frac{| Σ | j^{2}}{q_{j}^{2}})$ . In addition, O (|Σ|j) entries in tables T^′α are computed in O (q_j) time each, and O (|Σ|j) entries in tables T^αand T^εare computed in O (|Σ|) time each. Therefore, the total time for computing column j in all these matrices is $O (\frac{| Σ | j^{2}}{q_{j}^{2}} + | Σ |^{2} j) = O (\frac{| Σ | n^{2}}{λ^{2} {log}_{| D |}^{2} (n)} + | Σ |^{2} n)$ .

Computing column j in matrices EDT^α and ED, the algorithm performs O (|Σ|) matrix-vector min-plus multiplications (Equations 15 and 16), each taking $O (\frac{n^{2}}{λ^{2} {log}_{| D |}^{2} (n)})$ time using the algorithm in Section “The EDDC algorithm based on efficient D-discrete min-plus matrix-vector multiplication”, and computes the entry-wise minimum of |Σ|i-length vectors (Equation 16) in O (|Σ|i) time. Hence, the total time complexity of computing column j is $O (\frac{| Σ | n^{2}}{λ^{2} {log}_{| D |}^{2} (n)} + | Σ |^{2} n)$ . Symmetrically, this bounds the running time when the n-th letter comes from the source string s, and so the total running time over all first n steps is $O (\frac{| Σ | n^{3}}{λ^{2} {log}_{| D |}^{2} (n)} + | Σ |^{2} n^{2})$ . The algorithm requires $O (\frac{| Σ | n^{2 + λ}}{λ^{2} {log}_{| D |}^{2} (n)})$ space for the computed tables.

Online VMT algorithms

The online algorithm for EDDC presented in the previous section can be generalized for other problems with similar properties. Specifically, VMT problems [11], which utilize min-plus multiplications and for which it can be guaranteed that computed DP matrices are D-discrete, can have their algorithms implemented using the same framework as we have presented above. Thus, in contrast to the general case for VMT problems in which it is required that the complete input be available at the beginning of the algorithm’s run, in the D-discrete case the input can be obtained in a streaming model. In addition, the asymptotic time complexity in such cases is slightly reduced with respect to the time complexity of the case of min-plus multiplication of general matrices. A concrete example to such a problem is the RNA base-pairing maximization problem [11, 15], in which the difference between adjacent entries (in the single DP matrix the algorithm uses) is either 0 or 1. This property was previously exploited by Frid and Gusfield [14] to obtain an $O (\frac{n^{3}}{log n})$ algorithm for the problem. Using the D-discrete min-plus multiplication algorithm presented here, this immediately implies an algorithm having the improved time bound of $O (\frac{n^{3}}{\overset{2}{log} n})$ . Additional related problems from the domains of RNA folding and Context Free Grammars (CFGs) parsing fall under the VMT framework, and it is likely that D-discreteness can be exploited for accelerating the computation of more problems within this family.

Additional acceleration using run-length encoding

Let w be a string. A maximal substring of w containing multiple repeats of the same letter is called a run in w. The Run Length Encoding (RLE) of w is a representation of the string in which each run is encoded by the corresponding repeating letter α and its repeat count p (denoted α^p). For example, the string w = aabbbaccc is a concatenation of the four runs a a, bbb, a, and ccc, and its RLE is a²b³a¹c³. Denote by $\tilde{w}$ the compressed form of w, which replaces each run in w by a single occurrence of the corresponding letter. When n denotes the length of w, $ñ$ will denote the length of the compressed form of w. The run index $ĩ$ of a letter w_i in w is the index of the run in which w_i participates. It can be asserted that the compressed form of the substring w_i,j of w is the substring ${\tilde{w}}_{ĩ, (\tilde{j - 1}) + 1}$ of $\tilde{w}$ . In the above example, $\tilde{w} = abac$ , and therefore $ñ = 4$ (while n = 9). The run indices of all letters in w are given by the sequence [ 0,0,1,1,1,2,3,3,3], and the compressed form of w_3,8 = bbacc is ${\tilde{w}}_{\tilde{3}, \tilde{7} + 1} = {\tilde{w}}_{1, 4} = bac$ .

Previous works [7–9] showed how RLE can be exploited for improving the efficiency of EDDC algorithms. In these works it was required that the costs of duplications and contractions be less than the costs of all other operations (the requirement was implicit in [9], see discussion in Section “A comparison with previous works”). This requirement is somewhat unnatural for the application of minisatellite map comparison, since it assumes that mutations, which are typically common events, should cost more than the less common events of duplications and contractions. In this section, we adapt a similar RLE-based acceleration to our EDDC algorithm. The application of this acceleration requires the following constraint over cost functions:

Constraint 1.

For every α,β ∈ Σ, dup (α) ≤ dup (β) + mut (β,α) ≤ ins (α), and cont (α) ≤ cont (β) + mut (α,β) ≤ del (α).

The constraint dup (β) + mut (β,α) ≤ ins (α) implies that it never costs more to replace an insertion of some letter α into some nonempty string by the duplication of a letter β adjacent to the insertion position, and its consecutive mutation to α. Thus, we may assume w.l.o.g that optimal edit scripts do not contain insertions (unless applied to empty strings), or in other words, generation of new letters can only be obtained via duplications. Such an assumption is relatively reasonable in the context of minisatellite map comparison, considering the biological mechanisms that describe generative modifications.

The constraint dup (α) ≤ dup (β) + mut (β,α) can be intuitively understood by the example of generating a string of the form α α β from a string of the form α β. Due to the constraint, it would cost the same or less if the string α α β is obtained by duplicating the α letter in α β, rather than by duplicating the β letter and mutating its left copy into α. Again, such an assumption is relatively reasonable for the minisatellite map application. Symmetric arguments hold with respect to the constraint over contraction and deletion costs.

Observation 4.

Observation 4. Let s and w be strings. Then, ed (s,w β β) ≤ ed (s,w β) + dup (β), and ed (s,β β w) ≤ ed (s,β w) + dup (β) for every β ∈ Σ.

The correctness of Observation 4 follows from the existence of a script from s to w β β whose cost is ed (s,w β) + dup (β): this script first applies an optimal script to transform s into w β at cost ed (s,w β), and then duplicates the last β in w β at cost dup (β).

Lemmma 5.

Let α,β be letters and w ≠ ε a string. When Constraint 1 holds, ed (α,β w),ed (α,w β) ≥ ed (α,w) + dup (β), and ed (β w,α),ed (w β,α) ≥ ed (w,α) + cont (β).

The proof of Lemma 5 appears in Appendix “Proofs to lemmas corresponding to the run-length encoding based EDDC algorithm”.

Next, we show how to reduce the number of expressions that need to be considered in the EDDC recursive equations, in case Constraint 1 applies. For a string w of length at least 2, denote by R (w) ⊆ P (w) the set of all partitions (w^a,w^b) of w such the last letter in w^a is different from the first letter in w^b. For example, for w = aabbbcdddd,R (w) = {(a a,bbbcdddd ), (aabbb,cdddd),(aabbbc,dddd)}. Observe that $| R (w) | = ñ - 1$ .

We start by describing how to improve the computation efficiency of EDDC for cases in which one of the input strings contains a single letter. Denote by dupcost (w) the cost of the edit script from $\tilde{w}$ to w which generates each run α^p in w by applying p - 1 duplication operations over the corresponding letter α in $\tilde{w}$ . Similarly, denote by contcost (w) the cost of the edit script from w to $\tilde{w}$ which reduces each run α^p in w by applying p - 1 contraction operations over α. For example, for w = aabbbbaaccc, dupcost (w) = 2 dup (a) + 3 dup (b) + 2 dup (c) and contcost (w) = 2 cont (a) + 3 cont (b) + 2 cont (c). Note that $dupcost (w) \geq ed (\tilde{w}, w)$ , and $contcost (w) \geq ed (w, \tilde{w})$ . It is simple to assert the following recursive relations: [b]

\begin{array}{l} dupcost (w β) = \{\begin{array}{l} dupcost (w) + dup (β), & w ends with β, \\ dupcost (w), & otherwise . \end{array} \end{array}

(17)

\begin{array}{l} dupcost (w) = \{\begin{array}{l} dupcost (w^{a}) + dupcost (w^{b}), & (w^{a}, w^{b}) \in R (w), \\ dupcost (w^{a} β) + dupcost (β w^{b}) + dup (β), & (w^{a} β, β w^{b}) \in P (w) . \end{array} \end{array}

(18)

The following lemma shows that when one of the input strings contains a single letter, the edit distance can be inferred from the edit distance between this letter and the compressed form of the second string.

Lemma 6.

Let α be a letter and w a string. When Constraint 1 holds, $ed (α, w) = ed (α, \tilde{w}) + dupcost (w)$ , and $ed (w, α) = contcost (w) + ed (\tilde{w}, α)$ .

The following lemma shows that given a certain edit script from string u, its cost is greater than or equal to the cost of its application on a superstring of u.

For a string s of the form s = s^aus^b and an edit script $E S = 〈u = u^{0}, u^{1}, \dots, u^{r} = w〉$ from u to w, denote by $E S (s)$ the edit script $E S (s) = 〈s = s^{a} u s^{b} = s^{a} u^{0} s^{b}, s^{a} u^{1} s^{b}, \dots, s^{a} u^{r} s^{b} = s^{a} w s^{b}〉$ from s = s^aus^bto t = s^aws^b.

Lemma 7.

For s = s^aus^b and $E S = 〈u = u^{0}, u^{1}, \dots, u^{r} = w〉$ , $cost (E S (s)) \leq cost (E S)$ .

The proofs of Lemma 6 and Lemma 7 appear in the Appendix.

Equations 17 and 18 and Lemma 6 support the following preprocessing algorithm, Algorithm 3. Given a target string t, Algorithm 3 generates data structures that enable retrieving in constant time values of the form ed (α,t_i,j) for every α ∈ Σ and every substring t_i,j of t. The algorithm generates tables of the form ${\tilde{T}}^{α}$ for every α ∈ Σ, such that entries ${\tilde{T}}^{α} [i, j]$ contain the corresponding values $ed (α, {\tilde{t}}_{i, j})$ . In addition, the algorithm generates a vector DC, such that entries DC [ j] contain the corresponding values dupcost (t_0,j). Then, queries of the form ed (α,t_i,j) can be answered in a constant time according to Equation 19 below.

\begin{array}{l} ed (α, t_{i, j}) & \overset{\begin{array}{l} Lem.6 \\ Eq.18 \end{array}}{=} {\tilde{T}}^{α} [ĩ, (\tilde{j - 1}) + 1] \\ + \{\begin{array}{l} DC [j] - DC [i], & t_{i - 1} \neq t_{i}, \\ DC [j] - DC [i] - dup (t_{i}), & otherwise . \end{array} \end{array}

(19)

Algorithm 3 RL-LETTER-TO-STRING( t )

An algorithm which is symmetric to Algorithm 3 can be described in order to preprocess a string s for queries of the form ed (s_i,j,α).

We continue to describe the improved computation in the case where both input strings s and t are of length at least 2. To do so, we first add some auxiliary notation. For an interval I_x,y of positions within a string w, denote by $Ĩ_{x, y}$ the subsequence of indices in I_x,y which are start positions of runs in w. For example, for w = aabbbaacccc, the interval I_1,7 = [ 1,2,…,6] contains all positions of letters within the substring w_1,7 = abbbaa, and $Ĩ_{1, 7} = [2, 5]$ contains the start positions in w of the runs bbb and aa that are included in I_1,7 (the first letter a in w_1,7 belongs to a run in w that starts in position 0, and therefore position 1 is not included in $Ĩ_{1, 7}$ ). This notation will be used for defining subsequences of rows and columns in DP matrices maintained by the algorithm, where some of these intervals are derived from the source string s, and some from the target string t. We will assume that the string from which $Ĩ_{x, y}$ was derived is clear from the context, and will not specify it explicitly. For example, when $Ĩ_{x, y}$ defines rows in matrices ED or EDT^α, or either rows or columns in matrices S^α, then the indices in $Ĩ_{x, y}$ are derived from the source string s. When $Ĩ_{x, y}$ defines columns in matrices ED or EDT^α, or either rows or columns in matrices T^α, then the indices in $Ĩ_{x, y}$ are derived from the target string t. Subsequences $Ĩ_{x, y}$ will be used for defining sparse regions in matrices, i.e. regions containing sets of rows or columns which are not necessarily adjacent.

Consider the computation of ed (s,t) as expressed in Equation 8. Assume first the special case where s ends with a run of length at least 2. In this case, s is of the form s = w β β for some string w and a letter β. For every partition (t^a,t^b) of t, it is possible to combine an optimal script ${E S}^{1}$ from the prefix w β of s to t^a and an optimal script ${E S}^{2}$ from the suffix β of s to t^b, and to obtain a script $E S = 〈{E S}^{1} (w β β), {E S}^{2} (t^{a} β)〉$ from s to t. Therefore, $ed (w β β, t) \leq cost (E S) = cost ({E S}^{1} (w β β)) + cost ({E S}^{2} (t^{a} β)) \overset{Lem.7}{\leq} cost ({E S}^{1}) + cost ({E S}^{2}) = ed (w β, t^{a}) + ed (β, t^{b})$ . In particular, $ed (w β β, t) \leq min \{ed (w β, t^{a}) + ed (β, t^{b}) | (t^{a}, t^{b}) \in P (t)\} \overset{Eq.9}{=} ed t^{β} (w β, t)$ . In addition, it is possible to compose an edit script from w β β to t by first contracting the last two letters to obtain the string w β, and then applying an optimal script from w β to t. The cost of such a script is ed (w β,t) + cont (β), and therefore we get that ed (w β β,t)≤ min{edt^β(w β,t),ed (w β,t) + cont (β)}.

Next, we show that ed (w β β,t) ≥ min {edt^β(w β,t), ed (w β,t) + cont (β)}. From Equation 8, either ed (w β β,t) = ed (w β β,α) + ed (α,t) or ed (w β β,t) = edt^α(s^a,t) + ed (s^b,α) for some α ∈ Σ and (s^a,s^b) ∈ P (w β β). Consider first the latter case. If (s^a,s^b) = (w β,β), then

\begin{array}{l} ed (w β β, t) & = ed t^{α} (w β, t) + ed (β, α) \\ \overset{Eq.9}{=} min \{ed (wβ, t^{a}) + ed (α, t^{b}) | (t^{a}, t^{b}) \in P (t)\} \\ + ed (β, α) \\ \overset{Obs.5}{\geq} min \{ed (w β, t^{a}) + ed (β, t^{b}) | (t^{a}, t^{b}) \in P (t)\} \\ \overset{Eq.9}{=} ed t^{β} (w β, t) . \end{array}

Else, s^b is of length at least 2, and there is some string u such that s^b= u β β and w = s^au. In this case, $ed (w β β, t) = ed t^{α} (s^{a}, t) + ed (u β β, α) \overset{Lem.5}{\geq} ed t^{α} (s^{a}, t) + ed (u β, α) + cont (β) \overset{Eq.8}{\geq} ed (w β, t) + cont (β)$ . Similarly, it can be shown that when ed (w β β,t) = ed (w β β,α) + ed (α,t) for some α ∈ Σ, ed (w β β,t) ≥ ed (w β,t) + cont (β), and so ed (w β β,t) ≥ min {edt^β(w β,t),ed (w β,t) + cont (β)}. Thus,

ed (w β β, t) = min \{ed t^{β} (w β, t), ed (w β, t) + cont (β)\}

(20)

Formulating Equation 20 with respect to the data structures defined in Section “A baseline dynamic-programming algorithm for EDDC” (under the assumption that all values appearing at the right-hand side of the equation are computed and stored in the corresponding entries), we get the following equation:

\begin{array}{l} ed (s_{0, i}, t_{0, j}) & = min \{ED T^{s_{i - 1}} [i - 1, j], ED [i - 1, j] \\ + cont (s_{i - 1})\} (when s_{i - 1} = s_{i - 2}) \end{array}

(21)

Now, consider the case where the last run in s is of length 1 (i.e. s is not of the form w β β). Assume first that the term that yields the minimum value of the right-hand side of Equation 8 is of the form edt^α(s^a,t) + ed (s^b,α) for some partition (s^a,s^b) ∈ P (s) and a letter α ∈ Σ. If (s^a,s^b) ∉ R(s), then there is some letter β ∈ Σ which is both the last letter of s^a and the first letter of s^b. In this case, we can write s^a = w β and s^b = β u. Note that u ≠ ε (since s ≠ w β β by definition), and so $ed (s, t) = ed t^{α} (w β, t) + ed (β u, α) \overset{Eq.9}{=} min \{ed (w β, t^{a}) + ed (α, t^{b}) | (t^{a}, t^{b}) \in P (t)\} + ed (β u, α) \overset{\begin{array}{l} Lem.5, \\ Obs.4 \end{array}}{\geq} min \{(ed (w β β, t^{a}) - dup (β)) + ed (α, t^{b}) | (t^{a}, t^{b}) \in P (t)\} + (ed (u, α) + dup (β)) = min \{ed (w β β, t^{a}) + ed (α, t^{b}) | (t^{a}, t^{b}) \in P (t)\} + ed (u, α) \overset{Eq.9}{=} ed t^{α} (w β β, t) + ed (u, α)$ . From the optimality of the partition (s^a,s^b) = (w β,β u), it follows that ed (s,t) = edt^α(w β β,t) + ed (u,α). If u starts with β this step can be repeated, and inductively we can apply such partition refinements until obtaining a partition (s^a,s^b) of s such that ed (s,t) = edt^α(s^a,t) + ed (s^b,α) and (s^a,s^b) ∈ R (s). Now, Equation 8 can be revised as follows:

\begin{array}{l} ed (s, t) & = min \{\begin{array}{l} ed (s, α) + ed (α, t), \\ ed t^{α} (s^{a}, t) + ed (s^{b}, α) \end{array} |\begin{array}{l} (s^{a}, s^{b}) \in R (s), \\ α \in Σ \end{array}\} \\ (when s is not of the form wββ) \end{array}

(22)

Using the DP formulation, we get

\begin{array}{l} ed (s_{0, i}, t_{0, j}) & = min \{\begin{array}{l} S^{α} [0, i] + T^{α} [0, j], \\ ED T^{α} [h, j] + S^{α} [h, i] \end{array} |\begin{array}{l} h \in Ĩ_{0, i}, \\ α \in Σ \end{array}\} \\ = min \{\begin{array}{l} S^{α} [0, i] + T^{α} [0, j], \\ tr (S^{α}) [i, Ĩ_{0, i}] \otimes ED T^{α} [Ĩ_{0, i}, j] \end{array}| α \in Σ\} \\ (when s_{i - 1} \neq s_{i - 2}) \end{array}

(23)

Similarly as shown for the computation of ed (s,t), it is possible to revise the computation of edt^α(s,t) under Constraint 1 and obtain the following equations:

ed t^{α} (s, w β β) = min \{\begin{array}{l} ed t^{α} (s, w β) + dup (β), \\ ed (s, w β) + mut (α, β) \end{array}\}

(24)

\begin{array}{l} ed t^{α} (s_{0, i}, t_{0, j}) & = min \{\begin{array}{l} ED T^{α} [i, j - 1] + dup (t_{j - 1}), \\ ED [i, j - 1] + mut (α, t_{j - 1}) \end{array}\} \\ (when t_{j - 1} = t_{j - 2}) \end{array}

(25)

\begin{array}{l} ed t^{α} (s, t) = & min \{ed (s, t^{a}) + ed (α, t^{b}) | (t^{a}, t^{b}) \in R (t)\} \\ (when t is not of the form wββ) \end{array}

(26)

\begin{array}{l} ed t^{α} (s_{0, i}, t_{0, j}) & = min \{ED [i, h] + T^{α} [h, j] | h \in Ĩ_{0, j}\} \\ = ED [i, Ĩ_{0, j}] \otimes T^{α} [Ĩ_{0, j}, j] \\ (when t_{j - 1} \neq t_{j - 2}) \end{array}

(27)

Finally, we present Algorithm 4, which is an efficient version of Algorithm 2. Stage 1 of the new algorithm is accelerated by using Algorithm 3 to compute for every α∈Σ distances from α all substrings of s and distances from all substrings of t to α. In Stage 2, we use the equations developed above in order to accelerate the computation. The correctness of the algorithm follows from the correctness of the recursive equations, and can be asserted similarly as done for Algorithm 2.

Algorithm 4 RL-MATRIX-EDDC( s , t )

Complexity analysis

Assume for simplicity that compressed forms of both input strings s and t have the same length $ñ$ .

Algorithm 3.

The running time of line 1 of the algorithm is O (n) for computing the compressed form of the input string, and $O (| Σ | MP (ñ))$ for running Stage 1 of Algorithm 2 over this compressed string. Lines 2 and 3 require O (n) time, and so the overall time complexity of Algorithm 3 is $O (n + | Σ | MP (ñ))$ . The space complexity for computing and maintaining all matrices is $O (| Σ | ñ^{2})$ , and an additional O (n) space is required for the vector DC. Hence, the overall space complexity of the algorithm is $O (n + | Σ | ñ^{2})$ (see Section “Time complexity analysis” for complexity analysis of Stage 1 of Algorithm 2).

Algorithm 4.

Time and space complexities of Stage 1 of the algorithm are identical to those of Algorithm 3. As in Section “The algorithm”, the computations governing the running time of Stage 2 are those of matrix multiplications performed within recursive calls to RL-COMPUTE-MATRIX.

The recursive computation of RL-COMPUTE-MATRIX can be visualized as a tree (see Figure 4). Each node in the tree corresponds to a call to RL-COMPUTE-MATRIX over some regions I_i,k × I_j,l, which is either a leaf in case that k = i + 1 and l = j + 1, or otherwise an internal node. In the latter case, the node has exactly two children, corresponding to the two recursive calls obtained from either a vertical (lines 11 and 13) or a horizontal (lines 16 and 18) partition of the region. For simplicity, assume that the interval length $ñ = |Ĩ_{2, n + 1}| = 2^{x}$ for some integer x. It can be observed that the algorithm alternates between vertical and horizontal partitions along paths from the root of the tree, where regions of two different nodes in the same depth y are disjoint, and the union of all regions of nodes in depth y covers the entire initial region I_2,n + 1 × I_2,n + 1 of the root node. For every 0 ≤ y ≤ log(n), there are two series of intervals $I^{y, 0}, I^{y, 1}, \dots, I^{y, 2^{y} - 1}$ and $J^{y, 0}, J^{y, 1}, \dots, J^{y, 2^{y} - 1}$ , such that the set of regions corresponding to all nodes in depth 2y is {I^y,f× J^y,g | 0 ≤ f < 2^y,0 ≤ g < 2^y}, and the set of regions corresponding to all nodes in depth 2 y + 1 is {I^y,f× J^y + 1,g | 0 ≤ f < 2^y,0 ≤ g < 2^y + 1}. In addition, the corresponding subsequences $Ĩ^{y, 0}, \dots, Ĩ^{y, 2^{y} - 1}$ and ${\tilde{J}}^{y, 0}, \dots, {\tilde{J}}^{y, 2^{y} - 1}$ have all the same size 2^x - y.

Consider a node of depth 2y whose corresponding region is I^y,f × J^y,g, and the two regions corresponding to its children I^y,f × J^y + 1,2g and I^y,f × J^{y + 1,2g + 1}. The computation time spent on the node is dominated by the matrix multiplications performed in line 12 of RL-COMPUTE-MATRIX. This includes |Σ| matrix multiplications between pairs of matrices such the dimensions of the first matrix in each pair is $|I^{y, f}| \times |{\tilde{J}}^{y + 1, 2 g}| = |I^{y, f}| \times 2^{x - y - 1}$ , and the dimensions of the second matrix in a pair is $|{\tilde{J}}^{y + 1, 2 g}| \times |{\tilde{J}}^{y + 1, 2 g + 1}| = 2^{x - y - 1} \times 2^{x - y - 1}$ . Observe that |I^y,f|≥ 2^x-y-1, and such multiplications can be implemented by dividing the interval I^y,f into $\frac{|I^{y, f}|}{2^{x - y - 1}}$ intervals of length 2^x-y-1 each, and performing $\frac{|I^{y, f}|}{2^{x - y - 1}}$ multiplications between square matrices of dimensions 2^x-y-1 × 2^x-y-1 in a total time of $\frac{|I^{y, f}|}{2^{x - y - 1}} MP (2^{x - y - 1})$ . Therefore, the time required for all matrix multiplications performed within nodes in depth 2 y is

\begin{array}{l} \sum_{0 \leq f < 2^{y}} \sum_{0 \leq g < 2^{y}} & \frac{| Σ | |I^{y, f}| MP (2^{x - y - 1})}{2^{x - y - 1}} \\ = \sum_{0 \leq f < 2^{y}} \frac{2^{y} | Σ | |I^{y, f}| MP (2^{x - y - 1})}{2^{x - y - 1}} \\ = \frac{2 | Σ | nMP (2^{x - y - 1})}{ñ} . \end{array}

Similarly, it is possible to show that the total time required for all matrix multiplications performed within nodes in depth 2 y + 1 is also $\frac{2 n | Σ | MP (2^{x - y - 1})}{ñ}$ , and so the total computation time of matrix multiplications throughout the entire algorithm run is $O (\frac{| Σ | n}{ñ} \sum_{0 \leq y < x} MP (2^{y}))$ . As in Section “The algorithm”, using the Master Theorem [16], this summation evaluates to $O (\frac{| Σ | nMP (ñ)}{ñ})$ . In addition to matrix multiplications, RL-COMPUTE-MATRIX performs O (|Σ|n²) operations in base computations (lines 2-7), and so the total time complexity of the complete algorithm is $O (| Σ | n^{2} + \frac{| Σ | nMP (ñ)}{ñ})$ .

A simple implementation of Algorithm 4 can be done using the same space complexity of O (|Σ|n²), as the space complexity of Algorithm 2. A more involved implementation can be applied by observing that in fact the algorithm only examines and updates entries in matrices of dimensions at most $n \times ñ$ or $ñ \times n$ when performing matrix multiplications, and in addition it examines adjacent entries “to the left” or “above” an entry in a base-case region. This observation can be used in order to reduce the space complexity of the algorithm to $O (| Σ | nñ)$ , where the complete details of such an implementation are omitted from this text.

A comparison with previous works

In this section, we review the previous main algorithms for EDDC by Behzadi and Steyaert [7], Bérard et al. [8] and Abouelhoda et al. [9], and point out similarities and improvements made in our current work.

The main contribution of our work is in obtaining sub-cubic algorithms for EDDC, whereas all previous algorithms have cubic time complexities (for |Σ| the alphabet size, n the length of the input strings and $ñ$ the length of their RLE compressed forms, the algorithms of [7], [8], and [9] obtain the time complexities $O (n^{2} + n ñ^{2} + | Σ | ñ^{3} + | Σ |^{2} ñ^{2})$ , $O (n^{3} + | Σ | ñ^{3})$ , and $O (n^{2} + n ñ^{2})$ , respectively).

Notably, the algorithm of [9] eliminates a |Σ| factor that appears in the time complexities of the algorithms given in [7, 8] and here. However, this improvement is confined to a constrained model of duplication histories. As we do not assume this model here, we could not use the representation of [9] that allows the elimination of the |Σ| time complexity factor.

In general, the frameworks of all algorithms in [7–9] as well as the algorithms presented here are similar. All these algorithms apply two phases, where the first phase computes costs corresponding to all substrings of each one of the input strings separately, and the second phase uses these precomputed costs in order to compute the edit distance between each pair of prefixes of the input strings (our online variant described in Section “An online algorithm for EDDC using min-plus matrix-vector multiplication for discrete cost functions” interleaves these two phases, yet each operation it conducts can be conceptually attributed to one of the phases). The recursive formulas are similar as well, where those for the first phase can be viewed as special kinds of Weighted Context Free Grammar derivation rules.

Next, we address the cost function constraints. All algorithms assume that operation costs are nonnegative and apply additional assumptions similarly to those listed in our Property 1, which can be made without loss of generality.

In [8], operation costs were limited so that all duplications and contractions have the same constant cost (regardless of the letter over which they are applied), all deletions and insertions have the same constant cost, and all mutation costs are symmetric (i.e. mut (α,β) = mut (β,α) for every α,β ∈ Σ). While it was argued that these restrictions allow edit distance to be a metric, they limit the generality of the algorithm of [8], where the rest of the previous algorithms we mentioned can handle scoring schemes that not necessarily abide by these restrictions.

Both in [7] and in [8], it was required that all duplication and contraction costs are lower than the costs of any of the insertion, deletion, or mutation costs. This restriction is not explicitly stated in [9], yet seems to be required there as well. For the application of minisatellite map comparison, this requirement is somewhat unnatural since it assumes that mutations, which are typically common events, should cost more than the less common events of duplications and contractions. Our algorithms can be applied even when this restriction does not hold. However, one of our algorithms, the RLE variant (Section “Additional acceleration using run-length encoding”) adds a new requirement that was absent from those previous algorithms: it requires that for every α,β ∈ Σ, dup (α) ≤ dup (β) + mut (β,α) ≤ ins (α), and cont (α) ≤ cont (β) + mut (α,β) ≤ del (α) (our Constraint 1). On one hand, our Constraint 1 is more strict than the constraint of [7] and [8], in the sense that it implies nonnegative lower bounds over differences of the form ins (α) - dup (α) and del (α) - cont (α), while in [7] and [8] it was only required that these differences be nonnegative. On the other hand, our Constraint 1 does not require that the cost of mutations be higher than the cost of duplications and contractions.

We showed that our algorithms are more general with respect to the assumed constraints. We also claim that our algorithms are more precise with respect to the formal problem specification. All previous algorithms (excluding the first algorithm by Bérard and Rivals [2], which had an O (n⁴) running time and assumed a constant cost for all mutations in addition to the restrictions in [8]) might output non-optimal solutions in certain cases, as demonstrated in the following example. Consider the input s = a b, t = e f, and the cost function in which all duplications and contractions cost 1, all deletions and insertions cost 20, and the symmetric mutation costs are as given in Table 1. It can be shown that all three algorithms in [7], [8], and [9] would output the value 18 as the edit distance between the input strings, reflecting one of the edit scripts 〈ab,eb,ef〉 or 〈ab,af,ef〉. Nevertheless, the correct value is 17, due to the script 〈ab,cb,cc,c,d,dd,ed,ef〉. Perhaps it could be possible to specify additional restrictions over the cost functions in order to guarantee that the algorithms in [7], [8], and [9] return optimal solutions for all instances.

Table 1 Mutation costs for the instance s = ab , t = ef

Full size table

Conclusions and discussion

This work presents computational techniques for improving the time complexity of algorithms for the EDDC problem. We adapt the problem to the VMT framework defined in [11], which incorporates efficient matrix multiplication subroutines in order to accelerate standard dynamic programming algorithms. We describe an efficient algorithm, as well as two variants which are even more efficient, given some restrictions on the cost functions.

An additional result we give is the currently most efficient algorithm for the min-plus multiplication of D-discrete matrices (matrices for which differences between adjacent entries are integers within an interval of length D).

We note that the running times of our algorithms depend on the alphabet size |Σ|. For the general algorithm, the running time is O (|Σ|·MP(n)), where MP(n) is the time complexity of the min-plus multiplication of two n × n matrices, which is currently upper-bounded by $O (\frac{n^{3} \overset{3}{log} log n}{\overset{2}{log} n})$ [12]. Some of the previous algorithms obtain alphabet independent time complexities, for example the algorithms in [9] and [2]. As we discussed in Section “A comparison with previous works”, such algorithms do not solve the most general variant of the problem and require some assumptions on the cost function. Nevertheless, we believe that the matrix multiplication-based techniques for improving the time complexity presented in this paper can also be incorporated to the algorithm of [9], however the details of this enhancement are beyond the scope of this paper.

In contrast to the work of [9], our model assumes that intermediate strings along edit scripts may contain characters which are absent from both source and target strings. This implies that the size of the alphabet |Σ| is not bounded by the length of the input sequences. In the context of minisatellite comparison, identifying a feasible alphabet and cost function for this task is an interesting problem beyond the scope of this paper.

Appendix

Correctness of the recursive computation

This section proves Theorem 1, thus asserting the correctness of the recursive computation for the EDDC problem given in Section “The recurrence formula”. We start by adding some required notation and showing how long edit scripts can be decomposed to shorter partial scripts. Then, we use the observed recursive properties in order to prove the correctness of the recurrence.

Let s,w,t be strings, ${E S}^{1} = 〈s = u^{1, 0}, u^{1, 1}, \dots, u^{1, r_{1}} = w〉$ an edit script from s to w, and ${E S}^{2} = 〈w = u^{2, 0}, u^{2, 1}, \dots, u^{2, r_{2}} = t〉$ an edit script from w to t. Denote by $E S = 〈{E S}^{0}, {E S}^{1}〉$ the concatenated edit script $E S = 〈s = u^{1, 0}, u^{1, 1}, \dots, u^{1, r_{1}} = w = u^{2, 0}, u^{2, 1}, \dots, u^{2, r_{2}} = t〉$ from s to t. Note that $cost (E S) = cost ({E S}^{1}) + cost ({E S}^{2})$ , and $|E S| = |{E S}^{1}| + |{E S}^{2}|$ . This notation extends naturally to concatenations of more than two scripts. For example, $E S = 〈{E S}^{1}, {E S}^{2}, \dots, {E S}^{q}〉$ denotes an edit script from a string s to a string t obtained by a concatenation of q scripts, each script ${E S}^{i}$ transforms some intermediate string w^i-1 into a string wⁱ, and s = w⁰ and t = w^q.

Obeservation 5.

For every three strings s,w,t, ed(s,t) ≤ ed (s,w) + ed (w,t).

The correctness of the above observation follows from the fact that for a pair of optimal edit scripts ${E S}^{1}$ from s to w and ${E S}^{2}$ from w to t, the script $E S = 〈{E S}^{1}, {E S}^{2}〉$ from s to t satisfies $ed (s, t) \leq cost (E S) = cost ({E S}^{1}) + cost ({E S}^{2}) = ed (s, w) + ed (w, t)$ .

Lemma 7.

For s = s^aus^band $E S = 〈u = u^{0}, u^{1}, \dots, u^{r} = w〉$ , $cost (E S (s)) \leq cost (E S)$ .

Proof.

Each edit operation transforming s^auⁱs^bto s^au^{i + 1}s^bin $E S (s)$ corresponds to an operation transforming uⁱto u^i + 1 in $E S$ . The only cases where corresponding operations may have different costs are those of insertions or deletions in $E S$ at the beginning or ending of uⁱ, which become duplications or contractions in $E S (s)$ , respectively. For example, in case the applied operation over uⁱ in $E S$ is the deletion of its first letter α, and α is also the last letter of s^a, then the cost of the operation in $E S$ is del (α) while the cost of the corresponding operation in $E S (s)$ is cont (α) ≤ del (α). Similar scenarios may occur in case of an insertion of a letter at the beginning of uⁱwhich is identical to the last letter of s^a, as well as in cases of insertions and deletions at the end of uⁱof letters identical to the first letter of s^b. In any other case, each pair of corresponding operations have the same cost. Therefore the cost of each operation in $E S (s)$ is smaller than or equals to the cost of its corresponding operation in $E S$ , and $cost (E S (s)) \leq cost (E S)$ . □

Lemma 8.

Let s and t be two strings, and (s^a, s^b) ∈ P (s), (t^a, t^b) ∈ P (t) partitions of s and t, respectively. Then, ed (s,t) ≤ ed (s^a,t^a) + ed (s^b,t^b).

Proof.

Let ${E S}^{a}$ be an optimal script from s^a to t^a and ${E S}^{b}$ an optimal script from s^b to t^b. The script ${E S}^{a} (s)$ is a script from s = s^as^bto t^as^b. Similarly, ${E S}^{b} (t^{a} s^{b})$ is a script from t^as^bto t^at^b= t. For the script $E S = 〈{E S}^{a} (s), {E S}^{b} (t^{a} s^{b})〉$ from s to t, we have that $ed (s, t) \leq cost (E S) = cost ({E S}^{a} (s)) + cost ({E S}^{b} (t^{a} s^{b})) \overset{Lem.7}{\leq} cost ({E S}^{a}) + cost ({E S}^{b}) = ed (s^{a}, t^{a}) + ed (s^{b}, t^{b})$ . □

Let s and t be strings. Call a pair of partitions (s^a, s^b) ∈ P (s) and (t^a, t^b) ∈ P (t) an optimal pairwise partition of s and t if ed (s,t) = e (s^a, t^a) + ed (s^b, t^b). Say that an edit script $E S$ from s to t is a shortest optimal script from s to t if $E S$ is optimal, and for every other optimal script ${E S}^{'}$ from s to t, $|E S| \leq |{E S}^{'}|$ . For a script $E S = 〈s = u^{0}, u^{1}, \dots, u^{r} = t〉$ from s to t and 0 ≤ i ≤ j ≤ r, denote by ${E S}^{i, j} = 〈u^{i}, u^{i + 1}, \dots, u^{j}〉$ the partial script of $E S$ from uⁱ to u^j.

Observation 6.

Let $E S = 〈u^{0}, u^{1}, \dots, u^{r}〉$ be a shortest optimal edit script from u⁰ to u^r. For every 0 ≤ i ≤ j ≤ r, the partial script ${E S}^{i, j}$ is a shortest optimal edit script from uⁱto u^j. Moreover, for any shortest optimal script ${E S}^{* i, j}$ from uⁱto u^jwithin $E S$ , the script ${E S}^{*} = 〈{E S}^{0, i}, {E S}^{* i, j}, {E S}^{j, r}〉$ is a shortest optimal script from u⁰ to u^r.

The correctness of the above observation is obtained by noting that if ${E S}^{i, j}$ is not a shortest optimal script from uⁱ to u^j, then for some shortest optimal script ${E S}^{* i, j}$ from uⁱ to u^j we get that the script ${E S}^{*} = 〈{E S}^{0, i}, {E S}^{* i, j}, {E S}^{j, r}〉$ either has a lower cost than $E S$ , or is a shorter script of the same cost, in contradiction to $E S$ being a shortest optimal script from u⁰ to u^r.

Lemma 9.

Let $E S = 〈u^{0}, u^{1}, \dots, u^{r}〉$ be a shortest optimal edit script from u⁰ to u^r. If there are two indices 0 ≤ i < j ≤ r such that uⁱand u^jare strings of length 1, then j = i + 1. In addition, for every 0 < k < r, u^k≠ ε.

Proof.

Assume there are two indices 0 ≤ i < j ≤ r such that uⁱand u^jare strings of length 1, i.e uⁱ= α and u^j= β for some α,β ∈ Σ. From Observation 6, the partial script ${E S}^{i, j} = 〈α = u^{i}, u^{i + 1}, \dots, u^{j} = β〉$ is a shortest optimal script from α to β. Since j > i, it must be that α ≠ β (otherwise the script ${E S}^{'} = 〈{E S}^{0, i}, {E S}^{j, r}〉$ is a shorter script from u⁰ to u^r of no greater cost than $E S$ , in contradiction to $E S$ being a shortest optimal script from u⁰ to u^r). From Property 1, ed (α,β) = mut (α,β), and so the edit script containing the single operation of mutating α to β is an optimal script from α to β, and it must be that j = i + 1.

In addition, assume by contradiction there is some index 0 < k < r such that u^k = ε. The only edit operation which may yield an empty string is a deletion from a single-letter string, and therefore u^k-1 = α for some letter α. Similarly, the only edit operation which may be applied over an empty string is an insertion, therefore u^k+1 = β for some letter β, in contradiction to the fact that two intermediate strings of length 1 must be consecutive along a shortest optimal script, as shown above. □

Call an edit script $E S$ from a string s to a string t simple if $E S$ is a shortest optimal script from s to t, in which no generating operation precedes a reducing operation. The following lemma generalizes Lemma 2 of [6], by considering also indels in addition to contractions and duplications.

Lemma 10.

For every pair of strings s and t, there exists a simple edit script from s to t.

Proof.

Let s and t be two strings, and r the length of a shortest optimal script from s to t. When r ≤ 1, any shortest optimal script from s to t either contains no reducing operation or contains no generating operation, and in particular is a simple script. Otherwise, r > 1, and assume by induction the lemma holds for every pair of strings such that the length of a shortest optimal script from the source string to the target string is less than r. Let $E S = 〈s = u^{0}, u^{1}, \dots, u^{r} = t〉$ be a shortest optimal script from s to t.

Case 1: The first operation in $E S$ is not a generating operation. From Observation 6, the partial script ${E S}^{1, r}$ is a shortest optimal script from u¹ to u^r, whose length is r - 1. From the inductive assumption, there is a simple script ${E S}^{* 1, r}$ from u¹ to u^r, and from Observation 6 the script ${E S}^{*} = 〈{E S}^{0, 1}, {E S}^{* 1, r}〉$ is a shortest optimal script from s to t. As the first operation in ${E S}^{*}$ is non-generating (being the same first operation as in $E S$ ), ${E S}^{*}$ is simple, and the lemma follows.

Case 2: The first operation in $E S$ is a generating operation. Similarly as above, we may assume w.l.o.g. by applying the inductive assumption that the partial script ${E S}^{1, r}$ is simple. If this partial script is non-reducing, then $E S$ is non-reducing, and in particular it is simple. Otherwise, let 1≤i<r be the smallest index such that the transformation of uⁱ to uⁱ⁺¹ is by a reducing operation. Since neither generating nor reducing operations may precede this operation in the partial script ${E S}^{1, i}$ , it follows that all operations in the partial script ${E S}^{1, i}$ (if there are any) are mutations.

The generating operation transforming u⁰ to u¹ is either an insertion or a duplication of some letter α in u⁰. In both cases, we can write s = u⁰ = vxw and u¹ = vx^′w (v,x,x^′ and w are strings), where in the former case x = ε and x^′ = α, and in the latter case x=α and x^′ = α α. As all operations in the partial script ${E S}^{1, i}$ are mutations, each intermediate string u^j, for 1 ≤ j ≤ i, is of the form v^jx^jw^j, where v^j,x^j, and w^jare string obtained by applying zero or more mutations over v,x^′, and w, respectively. We argue that the reducing operation transforming uⁱ= vⁱxⁱwⁱto uⁱ⁺¹ cannot be the deletion of a letter or a contraction involving at least one letter within the substring xⁱ. This is true, since in such a case it would have been possible to avoid the first generating operation in $E S$ (transforming x to x^′), as well as all mutation operations over a reduced letter in xⁱ, and the reducing operation from uⁱ to uⁱ⁺¹. This would yield a script ${E S}^{* 0, i + 1}$ from u⁰ to uⁱ⁺¹ which is shorter and of no higher coast than ${E S}^{0, i + 1}$ , in contradiction to Observation 6. Hence, the reducing operation from uⁱ to uⁱ⁺¹ either deletes a letter or contracts two letters within one of the substrings vⁱor wⁱof uⁱ.

Consider first the case where the reducing operation over uⁱis applied within its prefix vⁱ. Thus, we can write uⁱ⁺¹ = vⁱ⁺¹xⁱ⁺¹wⁱ⁺¹, where vⁱ⁺¹ is the string obtained by applying the corresponding reducing operation over vⁱ, xⁱ⁺¹ = xⁱ, wⁱ⁺¹ = wⁱ, and cost (〈 uⁱ,uⁱ⁺¹ 〉) = cost (〈 vⁱ,vⁱ⁺¹ 〉). The operations in ${E S}^{0, i + 1}$ can be assigned into two independent scripts: a script ${E S}_{v} = 〈v = v^{' 0}, v^{' 1}, \dots, v^{′p} = v^{i + 1}〉$ from v to vⁱ⁺¹ obtained by merging each multiple occurrence of consecutive identical strings in the series v = v¹,v²,…,vⁱ⁺¹ into a single occurrence, and similarly a script ${E S}_{xw} = 〈xw = {(xw)}^{0}, {(xw)}^{1} = x^{'} w = x^{1} w^{1}, {(xw)}^{2}, \dots, {(xw)}^{q} = x^{i + 1} w^{i + 1} 〉$ from x w to xⁱ⁺¹wⁱ⁺¹. Each operation in ${E S}^{0, i + 1}$ corresponds to exactly one operation in either ${E S}_{v}$ or ${E S}_{xw}$ , where the costs of corresponding operations are equal, and therefore $cost ({E S}^{0, i + 1}) = cost ({E S}_{v}) + cost ({E S}_{xw})$ and $| {E S}^{0, i + 1} | = | {E S}_{v} | + | {E S}_{xw} |$ .

Now, the script ${E S}_{v} (u^{0}) = 〈u^{0} = vxw = v^{' 0} xw, v^{' 1} xw, \dots, v^{' p} xw = v^{i + 1} xw〉$ is a script from u⁰ to vⁱ⁺¹xw, and similarly the script ${E S}_{xw} (v^{i + 1} xw) = 〈v^{i + 1} xw = v^{i + 1} {(xw)}^{0}, v^{i + 1} {(xw)}^{1}, \dots, v^{i + 1} {(xw)}^{q} = v^{i + 1} x^{i + 1} w^{i + 1} = u^{i + 1}〉$ is a script from vⁱ⁺¹xw to uⁱ⁺¹. Thus, the script ${E S}^{* 0, i + 1} = 〈{E S}_{v} (u^{0}), {E S}_{xw} (v^{i + 1} xw)〉$ is a script from u⁰ to uⁱ⁺¹. Since ${E S}_{v}$ contains at least one operation (the reducing operation from vⁱ to vⁱ⁺¹) and no generating operation (since besides the reducing operation ${E S}_{v}$ may contain only mutations), ${E S}^{* 0, i + 1}$ starts with a non-generating operation. In addition, $cost ({E S}^{* 0, i + 1}) = cost ({E S}_{v} (u^{0})) + cost ({E S}_{xw} (v^{i + 1} xw)) \overset{Lem.7}{\leq} cost ({E S}_{v}) + cost ({E S}_{xw}) = cost ({E S}^{0, i + 1})$ and $|{E S}^{* 0, i + 1}| = |{E S}_{v} (u^{0})| + |{E S}_{xw} (v^{i + 1} xw)| = |{E S}_{v}| + |{E S}_{xw}| = |{E S}^{0, i + 1}|$ . From Observation 6, ${E S}^{0, i + 1}$ is a shortest optimal script from u⁰ to uⁱ⁺¹, and so ${E S}^{* 0, i + 1}$ is a shortest optimal script from u⁰ to uⁱ⁺¹. Applying Observation 6 again, the script ${E S}^{*} = 〈{E S}^{* 0, i + 1} {E S}^{i + 1, r}〉$ is a shortest optimal script from s to t. Now, the lemma follows from Case 1 of this proof and from the fact the first operation in ${E S}^{*}$ is not a generating operation. □

Lemma 11.

For every α ∈ Σ and every nonempty string t, any simple script from α to t is non-reducing.

Proof.

Let $E S$ be a simple script from α to t, and assume by contradiction $E S$ contains a reducing operation. Since $E S$ is simple, all reducing operations in $E S$ occur prior to any generating operation, and in particular the first reducing operation is applied after applying zero or more mutations over α. Such a reducing operation must be a deletion from a string of length 1, resulting with an empty intermediate string, in contradiction to Lemma 9. □

Lemma 12.

Let w and t be strings and β a letter, such that w ≠ ε, t is of length at least 2, and there is a non-reducing simple script from w β to t. Then, ed (w β,t) = min {ed (w,t^a) + ed (β,t^b) | (t^a,t^b) ∈ P (t)}.

Proof.

Let $E S = 〈wβ = u^{0}, u^{1}, \dots, u^{r} = t〉$ be a non-reducing simple script from w β to t. For every 0 ≤ i ≤ r, construct a partition (u^i,a, u^{i, b}) of uⁱwhich sustains that ed (w β, uⁱ) ≥ ed (w, u^{i, a}) + ed (β,u^i,b), as follows. For i = 0, set (u^0,a,u^0,b) = (w,β), where by definition ed (w β,u⁰) = ed (w,u^0,a) + ed (β,u^0,b) = 0. Now, assume inductively for some 0 < i ≤ r and a partition (u^i-1,a, u^{i-1, b}) of u^i-1 that ed (w β, u^i-1) ≥ ed (w, u^{i-1, a}) + ed (β, u^{i-1, b}). If the non-reducing operation transforming u^i-1 to uⁱ is a mutation, an insertion, or a duplication of a letter within the prefix u^{i-1, a}, then set u^i,a to be the string obtained by applying this operation over u^{i-1, a}, and set u^{i, b} = u^{i-1, b}. Otherwise, the operation is a mutation, an insertion, or a duplication of a letter within the suffix u^{i-1, b}, and in this case set u^i,b to be the string obtained by applying this operation over u^{i-1, b}, and u^i,a = u^{i-1, a}. Note that in both cases, $cost ({E S}^{i - 1, i}) = ed (u^{i - 1, a}, u^{i, a}) + ed (u^{i - 1, b}, u^{i, b})$ , therefore we get from the inductive assumption that $ed (w β, u^{i}) \overset{Obs.6}{=} cost ({E S}^{0, i}) = cost ({E S}^{0, i - 1}) + cost ({E S}^{i - 1, i}) \geq (ed (w, u^{i - 1, a}) + ed (β, u^{i - 1, b})) + (ed (u^{i - 1, a}, u^{i, a}) + ed (u^{i - 1, b}, u^{i, b})) \overset{Obs.5}{\geq} ed (w, u^{i, a}) + ed (β, u^{i, b})$ .

The process above generates a partition (t^∗a, t^∗b) = (u^{r, a},u^{r, b}) of t=u^r, for which ed (uⁱ, t) ≥ ed (w, t^∗a) + ed (β,t^∗b). In particular, ed (w β,t) ≥ min {ed (w,t^a) + ed (β,t^b) | (t^a,t^b) ∈ P (t)}. On the other hand, $ed (w β, t) \overset{Lem.8}{\leq} min \{ed (w, t^{a}) + ed (β, t^{b}) | (t^{a}, t^{b}) \in P (t)\}$ , and the lemma follows. □

Based on the above observations and lemmas, we now turn to prove the recursive computation given in Section “The recurrence formula”, starting with Equation 7. Fix henceforth a pair of input strings s and t, each containing at least two letters. Note that for every α ∈ Σ and every partitions (s^a, s^b) ∈ P (s) and (t^a, t^b) ∈ P (t), ed (s,t) ≤ Obs.5 ed (s,α) + ed (α,t), and ed (s,t) ≤ Lem.8 ed (s^a, t^a) + ed (s^b, t^b) ≤ Obs.5 ed (s^a, t^a) + ed (s^b, α) + ed (α,t^b), therefore

\begin{array}{l} ed (s, t) \leq min \{\begin{array}{l} ed (s, α) + ed (α, t), \\ ed (s^{a}, t^{a}) + ed (s^{b}, α) + ed (α, t^{b}) \end{array} |\begin{array}{l} (s^{a}, s^{b}) \in P (s), \\ (t^{a}, t^{b}) \in P (t), \\ α \in Σ \end{array}\} . \end{array}

Thus, to prove the correctness of Equation 7, it remains to show that

\begin{array}{l} ed (s, t) \geq min \{\begin{array}{l} ed (s, α) + ed (α, t), \\ ed (s^{a}, t^{a}) + ed (s^{b}, α) + ed (α, t^{b}) \end{array} |\begin{array}{l} (s^{a}, s^{b}) \in P (s), \\ (t^{a}, t^{b}) \in P (t), \\ α \in Σ \end{array}\} . \end{array}

From Lemma 10, there is a simple script $E S = 〈s = u^{0}, u^{1}, \dots, u^{r} = t〉$ from s to t, and in particular, there is a string uⁱ along $E S$ such that the partial script ${E S}^{0, i}$ is non-generating, and the partial script ${E S}^{i, r}$ is non-reducing. Recall that $ed (s, t) = cost (E S) = cost ({E S}^{0, i}) + cost ({E S}^{i, r}) \overset{Obs.6}{=} ed (s, u^{i}) + ed (u^{i}, t)$ .

If uⁱ= β for some letter β, then ed (s,t) = ed (s,β) + ed (β,t) ≥ min {ed (s,α) + ed (α,t) | α ∈ Σ}. Otherwise, uⁱ contains at least two letters. In this case, we can write uⁱ = w β, where β is the last letter in uⁱ and w is the nonempty prefix of uⁱ containing all letters except for the last one. From Lemma 12, ed (uⁱ,t) = ed (w β, t) = min {ed (w, t^a) + ed (β, t^b) | (t^a, t^b) ∈ P (t)}. Symmetrically, it is possible to show that ed (s, uⁱ) = min {ed (s^a, w) + ed (s^b, β) | (s^a, s^b) ∈ P (s)}, and so

\begin{array}{l} ed (s, t) & = ed (s, u^{i}) + ed (u^{i}, t) \\ = min \{ed (s^{a}, w) + ed (s^{b}, β) | (s^{a}, s^{b}) \in P (s)\} + \\ min \{ed (w, t^{a}) + ed (β, t^{b}) | (t^{a}, t^{b}) \in P (t)\} \\ \overset{Obs.5}{\geq} min \{ed (s^{a}, t^{a}) + ed (s^{b}, β) + ed (β, t^{b}) | (s^{a}, s^{b}) \in P (s), \\ (t^{a}, t^{b}) \in P (t)\} \\ \geq min \{\begin{array}{l} ed (s, α) + ed (α, t), \\ ed (s^{a}, t^{a}) + ed (s^{b}, α) + ed (α, t^{b}) \end{array} |\begin{array}{l} (s^{a}, s^{b}) \in P (s), \\ (t^{a}, t^{b}) \in P (t), \\ α \in Σ \end{array}\}, \end{array}

concluding the proof of Equation 7.

We next continue to develop the recursive computation, considering the simpler cases where one of the input strings is either empty or contains a single letter, and the other string contains at least two letters. Let $E S$ be a simple script from ε to t whose length is r. $E S$ must start with an insertion of some letter α, and from Observation 6, the remainder of the script ${E S}^{1, r}$ is an optimal script from α to t, implying the correctness of Equation 1. The correctness of Equation 4 is shown symmetrically.

Now, consider the computation of ed (α,t), as expressed in the last term of Equation 2. From Lemma 11, a simple script from α to t is non-reducing, and so the first operation in such a script is either the mutation of α, or some generating operation. If there is such a script in which the first operation is generating, then ed (α,t) = ed^′ (α,t) = mut (α,α) + ed^′ (α,t) ≥ min {mut (α,β) + ed^′ (β,t) | β ∈ Σ}. Else, there is a simple script from α to t in which the first operation is the mutation of α into some letter β. Due to Lemma 9, the following operation must be a generating operation, and so the reminder of the script is an optimal script from β to t in which the first operation is generating, implying again that ed (α,t) ≥ min {mut (α,β) + ed^′ (β,t) | β ∈ Σ}. The other direction of the inequality is shown similarly as done above for Equation 7, concluding the correctness proof of Equation 2.

We now address the correctness of Equation 3. Consider the minimum cost of a script from α to t which starts with a generating operation. Let $E S$ be such a script, and let r denote its length.

For the case where the first operation in $E S$ is an insertion of some letter γ after α, from Observation 6 we get that ${E S}^{1, r}$ is a non-reducing optimal script from u¹ = α γ to t, and therefore in this case

\begin{array}{l} e d^{'} (α, t) & = cost (E S) = ins (γ) + ed (α γ, t) \\ \overset{Lem.12}{=} min \{ins (γ) + ed (α, t^{a}) \\ + ed (γ, t^{b}) | (t^{a}, t^{b}) \in P (t)\} \\ \overset{Obs.5}{\geq} min \{ed (α, t^{a}) + ed (ε, t^{b}) | (t^{a}, t^{b}) \in P (t)\} . \end{array}

The cases where the first operation in $E S$ is the insertion of some letter before α, or the duplication of α, are solved similarly and imply that

\begin{array}{l} e d^{'} (α, t) \geq min \{\begin{array}{l} ed (α, t^{a}) + ed (ε, t^{b}), \\ ed (ε, t^{a}) + ed (α, t^{b}), \\ dup (α) + ed (α, t^{a}) + ed (α, t^{b}) \end{array}| (t^{a}, t^{b}) \in P (t)\} \end{array}

The other direction of the inequality is shown similarly as shown for Equation 7, concluding the proof for Equation 3. The correctness of Equations 5 and 6 is shown symmetrically.

Correctness of Algorithm 2

We next show that when the precondition of COMPUTE-MATRIX holds with respect to its input region I_{i, k}× I_{j, l}, executing the procedure derives its postcondition, i.e. the procedure computes correctly all entries in the input region within EDT^αand ED.

The base case of COMPUTE-MATRIX occurs when k = i + 1 and l = j + 1. In this case, I_{i, k}= i and I_{j, l}= j, and from the precondition we get that EDT^α[ i,j] = ED [ i, I_1,j] ⊗ T^α[ I_1,j,j] = Eq.13 edt^α(s_0,i,t_0,j), and ED [ i,j] = min {tr (S^α)[ i,I_1,i] ⊗ EDT^α[ I_1,i,j] | α ∈ Σ}. After running line 2 of the procedure, we have from Equation 12 that ED [ i,j] = ed (s_0,i,t_0,j). Thus, all entries of the form EDT^α[ i,j] and the entry ED [ i,j] are correctly computed, and the postcondition holds.

Else, either k > i + 1 or l > j + 1. In the case where l - j ≥ k - i (lines 5-8), the algorithm partitions vertically the region to be computed into two parts of approximately equal sizes. Let $h = ⌈\frac{j + l}{2}⌉$ be the value computed in line 5 of the procedure. Note that from Item 1 of Observation 1, the fact that EDT^α[ I_i,k, I_j,l] = ED [ I_i,k, I_1,j] ⊗ T^α[ I_1,j,I_j,l] implies that EDT^α[ I_i,k,I_j,h] = ED [ I_i,k,I_1,j] ⊗ T^α[ I_1,j,I_j,h], and similarly ED [ I_i,k,I_j,h] = min {tr(S^α)[ I_i,k,I_1,i] ⊗ EDT^α[ I_1,i,I_j,h] | α ∈ Σ}. Thus, all requirements of the precondition with respect to the region I_i,k× I_j,h are met, and the procedure is called recursively in line 6 over this region. From the postcondition of the recursive call, upon arriving to line 7 all entries in the region I_i,k× I_j,h in matrices EDT^α and ED contain the solutions for the corresponding sub-instances. In particular, it may be observed that at this point of the run, all requirements for the precondition to hold with respect to the region I_i,k × I_h,l are met, with the exception of the requirements regarding entries in the region I_i,k × I_h,l of matrices EDT^α. Again, from the precondition and Observation 1, at this stage EDT^α[ I_i,k,I_h,l] = ED [ I_i,k,I_1,j]⊗T^α[ I_1,j,I_h,l] for every α ∈ Σ. From Item 3 of Observation 1, min{EDT^α[I_i,k,I_h,l ],ED [I_i,k,I_j,h ]⊗ T^α[I_j,h, I_h,l] } = min {ED [ I_i,k,I_1,j]⊗T^α[ I_1,j,I_h,l],ED [ I_i,k,I_j,h] ⊗ T^α[ I_j,h,I_h,l]} = ED [ I_i,k,I_1,h] ⊗ T^α[ I_1,h,I_h,l], and therefore after executing line 7, the precondition holds with respect to the region I_i,k× I_h,l. After returning from the recursive call in line 8, all entries in the region I_i,k× I_j,l are computed, and the postcondition of the procedure is met. The correctness of the computation conducted lines 10-13 in the case where l - j < k - i is shown similarly.

Note that the initial call to COMPUTE-MATRIX from line 5 of Algorithm 2 is applied over the complete region I_i,k× I_j,l= I_2,n+1 × I_2,n+1. It may be observed that after the initialization in lines 3 and 4 of Algorithm 2, the precondition of COMPUTE-MATRIX is met with respect to this region. Therefore, it follows from the postcondition that once the computation terminates all entries in matrices EDT^α and ED contain the solutions for the corresponding sub-instances. In particular, ED [ n,n] holds the solution ed (s_0,n,t_0,n) = ed (s,t), and the returned value in line 6 of Algorithm 2 is correct.

Proofs to lemmas corresponding to the EDDC algorithm for discrete cost functions

Proof of lemma 1: Matrix multiplications computed along the run of Algorithm 2 occur in lines 7 and 12 of Procedure COMPUTE-MATRIX, and additional implicit such multiplications occur when the Inside-VMT algorithm is used in Stage 1 of the algorithm. Note that in such computations, all entries in the multiplied sub-matrices already contain the computed solutions for the corresponding sub-instances. In addition, matrix multiplications conducted by the Inside-VMT algorithm are applied only over sub-matrices $A [I_{i_{1}, i_{2}}, I_{j_{1}, j_{2}}]$ such that i₂ ≤ j₁ (see [11]), and thus, D-discreteness in matrices computed in Stage 1 need to be shown only with respect to adjacent entries A [ i,j], A [ i - 1,j] such that i ≤ j. In what follows, let 0 < i ≤ n and 0 ≤ j ≤ n be two integers for n the length of s and t.

Consider first the pair of adjacent entries ED [ i,j] and ED [ i - 1,j], which already contain the corresponding sub-instance solutions ed (s_0,i,t_0,j) and ed (s_0,i-1,t_0,j), respectively. An edit script transforming s_0,i-1 to t_0,j can be composed by first inserting the letter s_i-1 at the end of s_0,i-1 to obtain the string s_0,i at cost ins (s_i-1), and then transforming s_0,i to t_0,j by applying an optimal script at cost e d(s_0,i,t_0,j). Therefore, ed (s_0,i-1,t_0,j) ≤ ins (s_i-1) + ed (s_0,i,t_0,j). Also, an edit script transforming s_0,i to t_0,jcan be composed by first deleting last letter s_i-1 from s_0,i at cost del (s_i-1), and then transforming s_0,i-1 to t_0,j at cost ed (s_0,i-1,t_0,j). Therefore, ed (s_0,i,t_0,j) ≤ del (s_i-1) + ed(s_0,i-1,t_0,j). Thus, a ≤ - del (s_i-1) ≤ ed (s_0,i-1,t_0,j) - ed (s_0,i,t_0,j) ≤ ins (s_i-1) < b. Since all operation costs are integers, the cost of any edit script is an integer. Hence, after the adjacent entries ED [ i - 1,j] and ED [ i,j] are computed, ED [ i - 1,j] - ED [ i,j] = ed (s_0,i-1,t_0,j) - ed (s_0,i,t_0,j) is an integer within the interval D = I_a,b. The D-discreteness proofs for computed sub-matrices in all matrices of the form T^′α, T^α, T^ε, S^′α, S^α, S^ε (as well as for the transformed matrix tr (S^α)) are obtained similarly.

For the matrix EDT^α, note that there exists an integer l^∗ such that $ed t^{α} (s_{0, i - 1}, t_{0, j}) \overset{Eq.9}{=} ed (s_{0, i - 1}, t_{0, l^{*}}) + ed (α, t_{l^{*}, j})$ . In addition, in the same manner as above, for the same l^∗ we get that $ed t^{α} (s_{0, i}, t_{0, j}) \overset{Eq.9}{\leq} ed (s_{0, i}, t_{0, l^{*}}) + ed (α, t_{l^{*}, j}) \leq del (s_{i - 1}) + ed (s_{0, i - 1}, t_{0, l^{*}}) + ed (α, t_{l^{*}, j}) = del (s_{i - 1}) + ed t^{α} (s_{0, i - 1}, t_{0, j})$ .Similarly, it can be shown that edt^α(s_0,i-1, t_0,j) ≤ ins (s_i-1) + edt^α(s_0,i,t_0,j), and so after the entries EDT^α[ i - 1,j] and EDT^α[ i,j] are computed, EDT^α[ i - 1,j] - EDT^α[ i,j] = edt^α(s_0,i-1, t_0,j) - edt^α(s_0,i,t_0,j) is an integer within D.

Proof of lemma 2: Consider a pair of adjacent entries Z [ i- 1,j],Z [ i,j] in Z. Let r₁ and r₂ be indices, such that Z [ i - 1,j] = X [ i - 1,r₁] + Y [ r₁,j] and Z [ i,j] = X [ i,r₂] + Y [ r₂,j]. Then:

\begin{array}{l} Z [i - 1, j] - Z [i, j] & = X [i - 1, r_{1}] + Y [r_{1}, j] - (X [i, r_{2}] + Y [r_{2}, j]) \\ \leq X [i - 1, r_{2}] + Y [r_{2}, j] - (X [i, r_{2}] + Y [r_{2}, j]) \\ = X [i - 1, r_{2}] - X [i, r_{2}] < b. \end{array}

Similarly, it can be shown that Z [ i - 1,j] - Z [ i,j] ≥ a. Since X and Y contain only integer entries, it follows that Z contains only integer entries, and thus Z [ i - 1,j] - Z [ i,j] is an integer within D.

Proof of lemma 3: Consider a pair of adjacent entries Z [ i - 1,j],Z [ i,j] in Z. Then:

\begin{array}{l} Z [i - 1, j] - Z [i, j] & = min {X [i - 1, j], Y [i - 1, j]} \\ - min {X [i, j], Y [i, j]} \\ < min {X [i, j] + b, Y [i, j] + b} \\ - min {X [i, j], Y [i, j]} = b. \end{array}

Similarly, it can be shown that Z [ i - 1,j] - Z [ i,j] ≥ a. Since X and Y contain only integer entries, it follows that Z contains only integer entries, and thus Z [ i - 1,j] - Z [ i,j] is an integer within D.

Proof of lemma 4: Since both x and y are D-discrete, for every 0 < i < q, x_i < x₀ + ib and y₀ + ia ≤ y_i. Hence, x_i < x₀ + ib = x₀ + ib + (y₀ - y₀ + ia - ia) = (y₀ + i a) + i (b - a) - (y₀ - x₀) < y_i + q|D| - (y₀ - x₀). Therefore, when y₀ - x₀ ≥ q|D|, x_i< y_i for every 0 ≤ i < q.

Proofs to lemmas corresponding to the run-length encoding based EDDC algorithm

Proof of lemma 5: We show that ed (α,β w) ≥ ed (α,w) + dup (β), where the other inequalities are proven similarly.

Let r be the length of a simple script from α to β w. Observe that r ≥ 1, since by definition β w ≠ α. When r = 1, the single operation applied over α must be a generating operation (since w ≠ ε). As discussed in Section “Additional acceleration using run-length encoding”, we may assume this operation is the duplication of α, and so β = w = α. In this case, ed (α,β w) = dup (α) = ed (s,w) + dup (β).

When r > 1, assume by induction the lemma holds for every instance such that the length of a simple script from the source to the target string is less than r. A simple script from α to β w is non-reducing (as shown in Section “Correctness of the recursive computation”). If there is such a script in which the first operation is a mutation, then this operation mutates α into some letter γ ≠ α, and the remainder of the script is a simple script of length r - 1 from γ to β w. In this case, the inductive assumption implies that ed (α,β w) = mut (α,γ) + ed (γ,β w) ≥ mut (α,γ) + ed (γ,w) + dup (β) ≥ Obs.5 ed (α,w) + dup (β). Otherwise, there is a simple script from α to β w which starts with a generating operation. Again, we may assume this generating operation is the duplication of α. As shown in Section “Correctness of the recursive computation”, this implies that ed (α,β w) = dup (α) + ed (α,t^a) + ed (α,t^b) for some partition (t^a,t^b) ∈ P (β w), where the lengths of simple scripts from α to t^a and to t^b are strictly shorter than r. If t^a = β and t^b = w, then ed (α,β w) = dup (α)+ ed (α,β) + ed (α,w) = dup (α) + mut (α,β) + ed (α,w) ≥ Const.1 ed (α,w) + dup (β). Otherwise, t^a is of the form β u and w = ut^b for some string u ≠ ε. From the inductive assumption, ed (α,β w) = dup (α) + ed (α,β u) + ed (α,t^b) ≥ dup (α) + ed (α,u) + dup (β) + ed (α,t^b) ≥ Lem.8 dup (α) + ed (α α,w) + dup (β) ≥ Obs.5 ed (α,w) + dup (β).

Proof of lemma 6: Note that if w = ε or w = β for some β ∈ Σ, then $\tilde{w} = w$ , dupcost (w) = contcost (w) = 0, and the lemma holds in a straightforward manner. Otherwise, w is of length at least 2, and we prove the lemma by induction over the length r of a simple script between α and w. Assume by induction the lemma holds for every pair of input strings such that the length of a simple script from the source to the target string is less than r. We show that $ed (α, w) = ed (α, \tilde{w}) + dupcost (w)$ , where the proof that $ed (w, α) = contcost (w) + ed (\tilde{w}, α)$ is symmetric.

Observe that $ed (α, w) \overset{Obs.5}{\leq} ed (α, \tilde{w}) + ed (\tilde{w}, w) \leq ed (α, \tilde{w}) + dupcost (w)$ , and therefore it remains to show that $ed (α, w) \geq ed (α, \tilde{w}) + dupcost (w)$ . As discussed in the proof of Lemma 5, there is a simple script from α to w which either starts with a mutation of α or its duplication. If the first operation in such a script is the mutation of α to some letter β ∈ Σ, then the remainder of the script is a simple script from β to w of length r - 1, and from the inductive assumption $ed (α, w) = mut (α, β) + ed (β, w) = mut (α, β) + ed (β, \tilde{w}) + dupcost (w) \overset{Obs.5}{\geq} ed (α, \tilde{w}) + dupcost (w)$ . Otherwise, the first operation is the duplication of α, and there is some partition (w^a,w^b) ∈ P (w) such that ed (α,w) = dup (α) + ed (α,w^a) + ed (α,w^b) and the sum of lengths of shortest optimal scripts from α to w^a and to w^b is r - 1. From the inductive assumption, $ed (α, w^{a}) = ed (α, {\tilde{w}}^{a}) + dupcost (w^{a})$ and $ed (α, w^{b}) = ed (α, {\tilde{w}}^{b}) + dupcost (w^{b})$ . If (w^a,w^b) ∈ R (w), then $\tilde{w} = {\tilde{w}}^{a} {\tilde{w}}^{b}$ and dupcost (w) = Eq.18 dupcost (w^a) + dupcost (w^b). In this case, $ed (α, w) = dup (α) + (ed (α, {\tilde{w}}^{a}) + dupcost (w^{a})) + (ed (α, {\tilde{w}}^{b}) + dupcost (w^{b})) \overset{Lem.8}{\geq} dup (α) + ed (αα, \tilde{w}) + dupcost (w) \overset{Obs.5}{\geq} ed (α, \tilde{w}) + dupcost (w)$ . Else, (w^a,w^b) ∉ R (w), and so there is some letter β ∈ Σ such that w^a ends with β and w^b starts with β. In this case, there are some strings u^a,u^b and integers p,q > 0, such that w^a = u^aβ^p, w^b= β^qu^b, u^a does not end with β, and u^b does not start with β. Moreover, ${\tilde{w}}^{a} = ũ^{a} β$ , ${\tilde{w}}^{b} = β ũ^{b}$ , and $\tilde{w} = ũ^{a} β ũ^{b} = {\tilde{w}}^{a} ũ^{b}$ . Note that $ed (α, {\tilde{w}}^{b}) = ed (α, β ũ^{b}) \overset{Lem.5}{\geq} ed (α, ũ^{b}) + dup (β)$ , and therefore $ed (α, w) = dup (α) + ed (α, w^{a}) + ed (α, w^{b}) = dup (α) + (ed (α, {\tilde{w}}^{a}) + dupcost (w^{a})) + (ed (α, {\tilde{w}}^{b}) + dupcost (w^{b})) \geq dup (α) + ed (α, {\tilde{w}}^{a}) + ed (α, ũ^{b})$ $+ dup (β) + dupcost (w^{a}) + dupcost (w^{b}) \overset{Eq.18}{=} dup (α) + ed (α, {\tilde{w}}^{a}) + ed (α, ũ^{b}) + dupcost (w) \overset{Lem.8}{\geq} dup (α) + ed (αα, \tilde{w}) + dupcost (w) \overset{Obs.5}{\geq} ed (α, \tilde{w}) + dupcost (w)$ .

References

Jobling MA, Heyer E, Dieltjes P, de Knijff P:Y-chromosome-specific microsatellite mutation rates re-examined using a minisatellite, MSY1. Human Mol Genet. 1999, 8 (11): 2117-2120. 10.1093/hmg/8.11.2117.
Article CAS Google Scholar
Bérard S, Rivals E:Comparison of minisatellites. J Comput Biol. 2003, 10 (3–4): 357-372.
Article PubMed Google Scholar
Levenshtein V:Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady. 1966, 10 (8): 707-710.
Google Scholar
Waterman M: Introduction to computational biology: maps, sequences and genomes. 1995, London: Chapman & Hall/CRC
Book Google Scholar
Needleman S, Wunsch C:A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970, 48 (3): 443-453.
Article CAS PubMed Google Scholar
Behzadi B, Steyaert JM:An improved algorithm for generalized comparison of minisatellites. J Discrete Algorithms. 2005, 3 (2–4): 375-389.
Article Google Scholar
Behzadi B, Steyaert JM:The minisatellite transformation problem revisited: A run length encoded approach. Lecture Notes in Comput-Sci. 2004, 3240: 290-301. 10.1007/978-3-540-30219-3_25.
Article Google Scholar
Bérard S, Nicolas F, Buard J, Gascuel O, Rivals E:A fast and specific alignment method for minisatellite maps. Evol Bioinformatics Online. 2006, 2: 303-
Google Scholar
Abouelhoda MI, Giegerich R, Behzadi B, Steyaert JM:Alignment of minisatellite maps based on run-length encoding scheme. J Bioinformatics Comput Biol. 2009, 7 (2): 287-308. 10.1142/S0219720009004060.
Article CAS Google Scholar
Valiant LG:General context-free recognition in less than cubic time. J Comput Syst Sci. 1975, 10 (2): 308-314. 10.1016/S0022-0000(75)80046-8.
Article Google Scholar
Zakov S, Tsur D, Ziv-Ukelson M:Reducing the worst case running times of a family of RNA and CFG problems, using valiant’s approach. Algorithms Mol Biol. 2011, 6 (1): 20-[http://www.almob.org/content/6/1/20], []
Article PubMed Central PubMed Google Scholar
Chan TM:More algorithms for all-pairs shortest paths in weighted graphs. SIAM J Comput. 2010, 39 (5): 2075-2089. 10.1137/08071990X. 2007:590–598.
Article Google Scholar
Williams R:Matrix-vector multiplication in sub-quadratic time:(some preprocessing required). Proc. 18th ACM-SIAM Symposium on Discrete Algorithms (SODA). 2007, 995-1001. Philadelphia, PA, USA: SIAM.
Google Scholar
Frid Y, Gusfield D:A simple, practical and complete $O (\frac{n^{3}}{log n})$ -time Algorithm for RNA folding using the Four-Russians Speedup. Algorithms Mol Biol. 2010, 5: 13.
Article PubMed Central PubMed Google Scholar
Nussinov R, Jacobson AB:Fast algorithm for predicting the secondary structure of single-stranded RNA. PNAS. 1980, 77 (11): 6309-6313.
Article PubMed Central CAS PubMed Google Scholar
Cormen TH, Leiserson CE, Rivest RL, Stein C: Introduction to Algorithms. 2001, Cambridge, MA: MIT Press.
Google Scholar
Masek WJ, Paterson MS:A faster algorithm computing string edit distances. J Comput Syst Sci. 1980, 20: 18-31. 10.1016/0022-0000(80)90002-1.
Article Google Scholar
Arlazarov VL, Dinic EA, Kronod MA, Faradzev IA:On economical construction of the transitive closure of an oriented graph. Sov Math Dokl. 1970, 11: 1209-1210.
Google Scholar
Gusfield D: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. 1997, ISBN0521585198, Cambridge: Cambridge University Press.
Book Google Scholar

Download references

Acknowledgements

We would like to thank Prof. Yefim Dinitz for kindly pointing us to some relevant references. The research of T.P., S.Z. and M.Z.U. was partially supported by ISF grant 478/10 and by the Frankel Center for Computer Science at Ben Gurion University of the Negev. The research of D.T. was partially supported by ISF grant 981/11 and by the Frankel Center for Computer Science at Ben Gurion University of the Negev. The authors thank the anonymous reviewers for their very helpful comments.

Author information

Authors and Affiliations

Department of Computer Science, Ben-Gurion University of the Negev, Be’er Sheva, Israel
Tamar Pinhas, Dekel Tsur & Michal Ziv-Ukelson
Department of Computer Science and Engineering, University of California at San Diego, La Jolla, CA, USA
Shay Zakov

Authors

Tamar Pinhas
View author publications
You can also search for this author in PubMed Google Scholar
Shay Zakov
View author publications
You can also search for this author in PubMed Google Scholar
Dekel Tsur
View author publications
You can also search for this author in PubMed Google Scholar
Michal Ziv-Ukelson
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michal Ziv-Ukelson.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

All authors developed the algorithms, drafted the manuscript, read and approved the final manuscript.

Tamar Pinhas, Shay Zakov contributed equally to this work.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Authors’ original file for figure 9

Authors’ original file for figure 10

Authors’ original file for figure 11

Authors’ original file for figure 12

Authors’ original file for figure 13

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Pinhas, T., Zakov, S., Tsur, D. et al. Efficient edit distance with duplications and contractions. Algorithms Mol Biol 8, 27 (2013). https://doi.org/10.1186/1748-7188-8-27

Download citation

Received: 13 August 2012
Accepted: 30 September 2013
Published: 29 October 2013
DOI: https://doi.org/10.1186/1748-7188-8-27