Open Access

Efficient edit distance with duplications and contractions

  • Tamar Pinhas1,
  • Shay Zakov2,
  • Dekel Tsur1 and
  • Michal Ziv-Ukelson1Email author
Contributed equally
Algorithms for Molecular Biology20138:27

https://doi.org/10.1186/1748-7188-8-27

Received: 13 August 2012

Accepted: 30 September 2013

Published: 29 October 2013

Abstract

Abstract

We propose three algorithms for string edit distance with duplications and contractions. These include an efficient general algorithm and two improvements which apply under certain constraints on the cost function. The new algorithms solve a more general problem variant and obtain better time complexities with respect to previous algorithms. Our general algorithm is based on min-plus multiplication of square matrices and has time and space complexities of O (|Σ|MP (n)) and O (|Σ|n2), respectively, where |Σ| is the alphabet size, n is the length of the strings, and MP (n) is the time bound for the computation of min-plus matrix multiplication of two n × n matrices (currently,

MP ( n ) = O n 3 log 3 log n log 2 n due to an algorithm by Chan).

For integer cost functions, the running time is further improved to O | Σ | n 3 log 2 n . In addition, this variant of the algorithm is online, in the sense that the input strings may be given letter by letter, and its time complexity bounds the processing time of the first n given letters. This acceleration is based on our efficient matrix-vector min-plus multiplication algorithm, intended for matrices and vectors for which differences between adjacent entries are from a finite integer interval D. Choosing a constant 1 log | D | n < λ < 1 , the algorithm preprocesses an n × n matrix in O n 2 + λ | D | time and O n 2 + λ | D | λ 2 log | D | 2 n space. Then, it may multiply the matrix with any given n-length vector in O n 2 λ 2 log | D | 2 n time. Under some discreteness assumptions, this matrix-vector min-plus multiplication algorithm applies to several problems from the domains of context-free grammar parsing and RNA folding and, in particular, implies the asymptotically fastest O n 3 log 2 n time algorithm for single-strand RNA folding with discrete cost functions.

Finally, assuming a different constraint on the cost function, we present another version of the algorithm that exploits the run-length encoding of the strings and runs in O | Σ | nMP ( ñ ) ñ time and O ( | Σ | ) space, where ñ is the length of the run-length encoding of the strings.

Keywords

Edit distance Minisatellites Min-plus matrix multiplication Four Russians

Background

Comparing strings is a well-studied problem in computer science as well as in bioinformatics. Traditionally, string similarity is measured in terms of edit distance, which reflects the minimum-cost edit of one string to the other, based on the edit operations of substitutions (including matches) and deletions/insertions (indels). In this paper, we address the problem of string edit distance with the additional operations of duplication and contraction. The motivation for this problem originated from the study of minisatellites and their comparisons in the context of population genetics [1].

Motivation: minisatellite comparison

A minisatellite is a section of DNA that consists of tandem repetitions of short (6–100 nucleotides) sequence motifs spanning 500 nucleotides to several thousand nucleotides. The repeated motifs also vary in sequence through base substitutions and indels. For one minisatellite locus, both the type and the number of motifs vary between individuals in a population. Therefore, pairwise comparisons of minisatellites are typically applied in studying the evolution of populations.

A minisatellite map represents a minisatellite region, where each motif is encoded by a letter and handled as one entity (denoted unit). When comparing minisatellite maps, one has to consider that regions of the map have arisen as a result of duplication events from the neighboring units. The single copy duplication model, where only one unit can duplicate at a time, is the most popular and its biological validation was asserted for the MSY1 minisatellites [1, 2]. According to this model, one unit can mutate to another unit via an indel or a mutation of a single nucleotide within it. Also, a unit can be duplicated, that is, an additional copy of the unit may appear next to the original one in the map (tandem repeat). Thus, when comparing minisatellite maps, two additional operations are considered: unit duplication and unit contraction.

The EDDC problem definition

The single copy duplication model of minisatellite maps gave rise to a new variant of the string edit distance problem, Edit Distance with Duplications and Contractions (EDDC), which allows five edit operations: insertion, deletion, mutation, duplication and contraction.

We start with some string notations. Let s be a string. Denote by s i  the i-th letter in s, starting at index 0, and by si,j the substring s i si+ 1 … sj- 1 of s. A substring of the form si,i is an empty string, which will be denoted by ε. We use superscripts to denote substrings without an explicit indication of their start and end positions, and write e.g. s = s a s b  to indicate that s is a concatenation of the two substrings s a  and s b .

In the edit distance problem, one is given a source string s and a target string t over a finite alphabet Σ. An edit script from s to t is a series of strings E S = s = u 0 , u 1 , u 2 , , u r = t , where each intermediate string u i  is obtained by applying a single edit operation to the preceding string ui-1. In the standard problem definition [35], the allowed edit operations are insertion of a letter at some position in an intermediate string u i , deletion of a letter in u i , and mutation of a letter in u i  to another letter. The single-copy EDDC problem variant adds two operations: duplication - inserting into u i  a letter in a position adjacent to a position that already contains the same letter, and contraction - deleting from u i  one copy of a letter where there are two consecutive copies of this letter. Denote by ins (α),dup (α) and del (α) costs associated with the insertion, duplication and deletion operations applied to a letter α in the alphabet, respectively, by cont (α) the cost of contracting two consecutive occurrences of α into a single occurrence, and by mut (α,β) the cost of mutating α to a letter β. Define the cost of E S to be the summation of its implied operation costs, and the length | E S | = r of E S to be the number of operations performed in E S . Clearly, for every pair of strings s and t, there is some script transforming s to t, e.g. the script that first deletes all letters in s and then inserts all letters in t. An optimal edit script from s to t is one which has a minimum cost. The edit distance from s to t, denoted by ed (s,t), is the cost of an optimal edit script from s to t. The goal of the EDDC problem is, given strings s and t, to compute e d (s,t).

Previous algorithms assume various constraints on operation costs (see Section “A comparison with previous works”). In this paper, the only limiting assumption made is that all operation costs are nonnegative. In addition, we can make the following assumption without loss of generality, which will be required by the algorithms presented in this paper:

Property 1.

It may be assumed without loss of generality that for every α,βΣ,

  • ins (α) = ed (ε,α), del (α) = ed (α,ε),

  • dup (α) = ed (α,α α), cont (α) = ed (α α,α),

  • mut (α,β) = ed (α,β).

This assumption can be made, since in case one of the operation costs violates the assumption, then such an operation can always be replaced by a series of operations that would induce the same modification at lower cost. For example, it cannot be that mut (α,β) < ed (α,β), since ed (α,β) is smaller then or equal to the cost of any script from α to β, among which is the script containing the single mutation operation of α into β. Moreover, if mut (α,β) > ed (α,β), then we can always replace any mutation of α into β by a series of operations that transform α into β at cost ed (α,β). In this case, we may simply assume that mut (α,β) = ed (α,β), and interpret any such a mutation appearing in a script as being implem ented by the corresponding series of operations. In particular, Property 1 implies that mut (α,α) = 0 (since all operation costs are nonnegative, ed (w,w) = 0 for every string w), dup (α) ≤ ins (α) (since the cost of the script from α to α α that applies a single insertion of α is ins (α) ≥ ed (α,α α) = dup (α)), cont (α) ≤ del (α), and mut (α,β) ≤ mut (α,γ) + mut (γ,β) for every γΣ.

Insertions and duplications are considered to be generating operations, increasing by one letter the length of the string. Similarly, deletions and contractions are considered to be reducing operations, decreasing by one letter the length of the string. An edit script containing no reducing operation is called a non-reducing script, and an edit script containing no generating operation is called a non-generating script.

Previous work

The EDDC problem was first defined by Bérard and Rivals [2], who suggested an O (n4) time and O (n3) space algorithm for the problem, where n is the length of the two input strings (for the sake of simplicity, we assume that both strings are of the same length). This was followed by the work of Behzadi and Steyaert [6], who gave an O (|Σ|n3) time and O (|Σ|n2) space algorithm for the problem, where |Σ| is the alphabet size (typically a few tens of unique units). Behzadi and Steyaert [7] improved their algorithms’ complexity, based on run-length encoding, to O ( n 2 + n ñ 2 + | Σ | ñ 3 + | Σ | 2 ñ 2 ) time and O | Σ | ( n + ñ 2 ) + n 2 space, where ñ is the length of the run-length encoding of the input strings. Run-length encoding was also used by Bérard et al. [8], who proposed an O ( n 3 + | Σ | ñ 3 ) time and O ( n 2 + | Σ | ñ 2 ) space algorithm. Abouelhoda et al. [9] gave an algorithm with an alphabet size independent time and space complexities of O ( n 2 + n ñ 2 ) and O(n2), respectively. A detailed comparison between the different problem models appears in Section “A comparison with previous works”.

Our contribution

This paper presents several algorithms for EDDC which are currently the most general and efficient for the problem.
  1. 1.

    We give an algorithm for EDDC for general non-negative cost functions that is based on min-plus square matrix multiplication. This algorithm is an adaptation of the framework of [10] (see also [11]). For two input strings over an alphabet Σ and of length n each, the time and space complexities of this algorithm are O (|Σ|MP (n)) and O (|Σ|n 2), respectively, where MP (n) is the time complexity of a min-plus multiplication of two n × n matrices. Using the matrix multiplication algorithm of Chan [12], this algorithm runs in O | Σ | n 3 log 3 log n log 2 n time (Section “A matrix multiplication based algorithm for EDDC”). Moreover, our algorithm applies less restrictions on the cost function with respect to previous algorithms and is currently the only algorithm that works for the most general problem settings (Section “A comparison with previous works”).

     
  2. 2.
    We describe a more efficient algorithm for EDDC when all operation costs are integers. This algorithm can also be applied in an online setting, where in each step a letter is added to one of the input strings. The time complexity of processing n letters in the input is O | Σ | n 3 log 2 n , where the base of the log function is determined by the range of cost values (Section “An online algorithm for EDDC using min-plus matrix-vector multiplication for discrete cost functions”). In order to achieve this, we obtained the following stepping-stone results, which are of interest on their own.
    1. (a)

      Let A be an n × m matrix for which differences between adjacent entries are within some finite integer interval D. Choosing a time/space complexity tradeoff parameter λ, where 1 log | D | ( n + m ) < λ < 1 , we describe a preprocessing algorithm for A that runs in O nm ( n + m ) λ | D | time and requires O nm ( n + m ) λ | D | λ 2 log | D | 2 ( n + m ) space. This preprocessing allows later to compute min-plus multiplications between A and m-length vectors sustaining the same discreteness requirement in O nm λ 2 log | D | 2 ( n + m ) time (Section “An efficient D-discrete min-plus matrix-vector multiplication algorithm”). The algorithm is an adaptation of Williams’ matrix-vector multiplication algorithm over a finite semiring [13], with some notions similar to those in Frid and Gusfield’s RNA folding algorithm [14].

       
    2. (a)

      The manner in which the new matrix-vector multiplication algorithm is integrated into the EDDC algorithm can be generalized to algorithms for a family of related problems, denoted VMT problems [11], under certain discreteness assumptions. This family includes many important problems from the domains of RNA folding and CFG parsing. An example of such a problem is the single strand RNA folding problem [15] under discrete scoring schemes. Our new matrix-vector multiplication algorithm can be integrated into an algorithm for the latter problem to yield an O n 3 log 2 n time algorithm, improving the best previously known asymptotic time bound for the problem (see Section “Online VMT algorithms”).

       
     
  3. 3.

    We extend our approach to exploit run-length encodings of the input strings, assuming some restrictions on the cost functions. This reduces the time and space complexities of the algorithm to O | Σ | n 2 + | Σ | nMP ( ñ ) ñ and O | Σ | , respectively, where ñ is the length of the run-length encoding of the input (Section “Additional acceleration using run-length encoding”).

     

The rest of the paper is organized as follows. In Section “A baseline algorithm for the EDDC problem”, a recursive computation for EDDC and its implementation using dynamic programming (DP) is presented. Section “A matrix multiplication based algorithm for EDDC” shows how to accelerate the algorithm by incorporating efficient min-plus matrix multiplication subroutines. In Section “An online algorithm for EDDC using min-plus matrix-vector multiplication for discrete cost functions”, an efficient min-plus matrix-vector multiplication algorithm is described for matrices and vectors which differences between adjacent entries are taken from a finite integer interval. This algorithm can be used for obtaining an accelerated online version of the EDDC algorithm, as well as for improving time complexities of several related problems. Section “Additional acceleration using run-length encoding” describes a variant of the EDDC algorithm that exploits run-length encoding. Comparison between this and previous works is given in Section “A comparison with previous works”, and Section “Conclusions and discussion” gives a concluding discussion. Additional proofs omitted from the main manuscript are given in the Appendix.

A baseline algorithm for the EDDC problem

In this section, we give a simple algorithm for the EDDC problem. We start by showing some recursive properties of the problem, and then formulate a straightforward dynamic programming implementation for the recursive computation.

The recurrence formula

Our recursive formulas resemble previous formulations [6, 9], yet solve a slightly more general variant of the problem (see discussion in Section “A comparison with previous works”). Since the proof of correctness of these recursive formulas is similar to previous ones, we defer it to Appendix “Correctness of the recursive computation”.

A (strict) partition of a string w of length at least 2 is a pair of strings (w a ,w b ), such that w = w a w b and w a ,w b ≠ ε. Denote by P (w) the set of all partitions of w. For example, for w = abac , P(w) = {(a,bac),(ab,ac),(aba,c)}.

For a source string which is either empty or contains a single letter and a target string t, Equations 1 to 3 (Figure 1) describe a recursive EDDC computation. This computation interleaves, in a mutually recursive manner, the computation of an additional special value ed (α,t), where ed (α,t) is defined to be the minimum cost of a non-reducing edit script from α to t that does not start with a mutation (t is required to contain at least two letters).
Figure 1

Edit distance computation for the case where the source string is either empty or contains a single letter. Charts (a), (b) and (c) exemplify Equations 1, 2 and 3, respectively.

ed ( ε , t ) = 0 , t = ε , min ins ( α ) + ed ( α , t ) | α Σ , otherwise .
(1)
ed ( α , t ) = del ( α ) , t = ε , mut ( α , β ) , t = β , min mut ( α , β ) + e d ( β , t ) | β Σ , otherwise .
(2)
e d ( α , t ) = min ed α , t a + ed ε , t b , ed ε , t a + ed α , t b , dup ( α ) + ed α , t a + ed α , t b t a , t b P ( t ) ( t is of length 2 )
(3)
Symmetrically, Equations 4 to 6 give the recursive computation for a source string s and a target string which is either empty or contains a single letter. Here, ed (s,α) is defined as the minimum cost of a non-generating edit script from s to α which does not end with a mutation (s is required to contain at least two letters).
ed ( s , ε ) = 0 , s = ε , min ed ( s , α ) + del ( α ) | α Σ , otherwise .
(4)
ed ( s , α ) = ins ( α ) , s = ε , mut ( β , α ) , s = β , min e d ( s , β ) + mut ( β , α ) | β Σ , otherwise .
(5)
e d ( s , α ) = min ed s a , α + ed s b , ε , ed s a , ε + ed s b , α , ed s a , α + ed s b , α + cont ( α ) s a , s b P ( s ) ( s is of length 2 )
(6)
In case both source string s and target string t are of length at least 2, the following equation can be used for computing ed (s,t) (Figure 2(a)):
Figure 2

Edit distance computation when both input strings are of length at least 2. Chart (a) illustrates Equation 7, while charts (b) and (c) illustrate its decomposition into Equations 8 and 9, respectively.

ed ( s , t ) = min ed ( s , α ) + ed ( α , t ) , ed s a , t a + ed s b , α + ed α , t b s a , s b P ( s ) , t a , t b P ( t ) , α Σ ( s and t are of lengths 2 )
(7)
For allowing efficient computation, Equation 7 can be replaced by Equations 8 and 9, which are computed in a mutually recursive manner to yield an equivalent computation (Figure 2(b) and Figure 2(c), respectively).
ed ( s , t ) = min ed ( s , α ) + ed ( α , t ) , ed t α s a , t + ed s b , α s a , s b P ( s ) α Σ ( s and t are of lengths 2 )
(8)
ed t α ( s , t ) = min ed s , t a + ed α , t b t a , t b P ( t ) ( t is of lengths 2 )
(9)

All base-cases of the above recursive equations are implied from Property 1 in a straightforward manner.

Theorem 1.

Theorem 1.EDDC is correctly solved by Equations 1-9.

The proof of Theorem 1 appears in Appendix “Correctness of the recursive computation”.

A baseline dynamic-programming algorithm for EDDC

In this section, we describe a DP algorithm implementing the recursive EDDC computation given by Equations 1 to 9, which is the basis for improvements introduced later in this paper.

Let s and t be the input source and target strings, respectively, and for simplicity assume both strings are of length n. The algorithm maintains the following matrices for storing solutions to sub-instances of the input which occur along the recursive computation. All matrices are of size (n + 1) × (n + 1), with row and column indices in the range 0,1,2,…,n.

  •  For every αΣ, the algorithm maintains matrices Sα, S α , Tα, and T α . Entries Sα[ k,i], S α [ k,i], Tα[ l,j] and T α [ l,j] are used for storing the values ed (sk,i,α), ed(sk,i,α), ed (α,tl,j), and ed (α,tl,j), respectively.

  •  Two matrices S ε and T ε , whose entries S ε [ k,i] and T ε [ l,j] are used for storing values of the forms ed (sk,i,ε) and ed (ε,tl,j), respectively.

  •  For every αΣ, a matrix EDT α whose entries EDT α [ i,j] are used for storing the values edt α (s0,i,t0,j).

  •  A matrix ED, whose entries ED [ i,j] are used for storing the values ed (s0,i,t0,j).

The algorithm consists of two stages: Stage 1 computes solutions to all sub-instances in which one of the substrings is either empty or single-lettered, applying Equations 1 to 6. Stage 2 uses the values computed in Stage 1 in order to compute all prefix-to-prefix solutions ed (s0,i,t0,j) and edt α (s0,i,t0,j) according to Equations 8 and 9. In particular, Stage 2 computes the edit distance ed (s0,n,t0,n) = ed (s,t) between the two complete strings. The entries are traversed in an order which guarantees that upon computing each entry, all solutions to sub-instances appearing on the right-hand side of the relevant equation are already computed and contained in the corresponding entries. Algorithm 1 gives the pseudo-code for this computation.

Algorithm 1 BASELINE-EDDC( s , t )

Complexity analysis of Algorithm 1

The running time of Algorithm 1 is dictated by the total time required to compute all entries in the DP matrices. Each entry is computed according to one of the recursive equations, where the number of operations in such a computation depends on the number of expressions examined on the right-hand side of the corresponding recursive equation. Note that the value of each examined expression is obtained in a constant time, by querying previously computed values stored in the matrices.

The computation of each entry in matrices T ε and S ε and in matrices of the form T α and S α takes O (|Σ|) time, due to Equations 1, 4, 2, and 5, respectively. As there are O (|Σ|) such matrices and each matrix contains O (n2) entries, their overall computation time is O (|Σ|2n2). The computation of entries in Tαand Sαtake O (n) time, due to Equations 3 and 6, respectively. There are O (|Σ|) such matrices, each of size O (n2), and so the total time for computing all entries in these matrices is O (|Σ|n3). According to Equation 9, computing each entry of the form EDT α [ i,j] takes O (n) time, and as there are O (|Σ|n2) such entries the total time for computing all these entries is O (|Σ|n3). According to Equation 8, computing each entry of the form ED [ i,j] takes O (|Σ|n) time, and since there are O (n2) such entries, the total time for computing all these entries is again O (|Σ|n3). Thus, the total running time of the algorithm is O (|Σ|n3 + |Σ|2n2). Under the assumption that |Σ| = O (n), the time is O (|Σ|n3). The algorithm requires O (|Σ|n2) space for maintaining the DP matrices.

A matrix multiplication based algorithm for EDDC

In previous work by the authors [11], Vector Multiplication Templates (VMTs) were identified as templates for computational problems sustaining certain properties, such that algorithms for solving these problems can be accelerated using efficient matrix multiplication subroutines (similarly to Valiant’s algorithm for CFG recognition [10]). Intuitively, standard algorithms for VMT problems perform computations that can be expressed in terms of vector multiplications, and these computations can be computed and combined more efficiently using efficient matrix multiplications. In this section, we show that EDDC exhibits such VMT properties, and formulate a new algorithm that incorporates matrix-matrix min-plus multiplications. This algorithm yields a better running time than that of the baseline algorithm in the previous section.

Notations for matrices

For two integers p,q such that p ≤ q, Ip,q denotes the interval of integers Ip,q = [ p,p + 1,…,q - 1]. We use the notation An × m to imply that the matrix A has n rows and m columns, and say that A has the dimensions n × m (rows and column indices start at 0). For a subset of row indices I and a subset of column indices J, denote by I × J the region which contains all pairs of indices (i,j), such that iI and jJ. Define A [ I,J] to be the submatrix of A, which is induced by all entries in the region I × J. When I contains a single row i or J contains a single column j, we simplify the notation and write A [ i,J] or A [ I,j], respectively.

Define the following operations on matrices. Let tr (·) denote the transpose operation for matrices. For a set of matrices A = A 1 , A 2 , , A r all of the same dimensions n × m, denote by min A the entry-wise min operation over A , whose result is a matrix Cn × m, such that C [ i , j ] = min A [ i , j ] | A A . min A can be computed in O | A | nm time in a straightforward manner. For matrices An × k and Bk × m, the min-plus multiplication of A and B, denoted A B, results in a matrix Cn × m, where the entries of C are defined by C [ i,j] = min{A [ i,h] + B [ h,j]| 0 ≤ h < k}. Naively, AB can be computed in O (nkm) operations. Denote the time complexity of a min-plus multiplication of two n × n matrices by MP (n). At present, the asymptotically fastest algorithm for min-plus square matrix multiplication is that of Chan [12], taking O n 3 log 3 log n log 2 n time.

In the following observation, we point out how matrix multiplication can be computed as a composition of two parts, where each of the items (1-3) in the observation addresses a partitioning in one of the three dimensions. This will be later used by our recursive computation which is based on such partitioning.

Observation 1.

Let An × k,Bk × mand Cn × mbe matrices, such that C = AB(see Figure 3).
  1. 1.

    For every 0 ≤ h < m, C [ I 0,n,I 0,h] = AB [ I 0,k,I 0,h] and C [ I 0,n,I h,m] = AB [ I 0,k,I h,m].

     
Figure 3

Decomposition of a matrix-multiplication. In all three charts, the matrix C is the result of the multiplication AB. Charts (a), (b), and (c) illustrate items 1, 2, and 3 of Observation 1, respectively.

  1. 2.

    For every 0 ≤ h < n, C [ I 0,h,I 0,m] = A [ I 0,h,I 0,k] B and C [ I h,n,I 0,m] = A[ I h,n,I 0,k] B.

     
  2. 3.

    For every 0 ≤ h < k, C = min {A [ I 0,n,I 0,h] B [ I 0,h,I 0,m],A [ I 0,n,I h,k] B [ I h,k,I 0,m]}.

     

EDDC expressed via min-plus vector multiplications

The key observation that enables a further improvement of the worst-case bounds of EDDC is that Equations 3, 6, 8, and 9 can be expressed in terms of min-plus vector multiplications. Under the assumption that all solutions to sub-instances appearing on the right-hand side of the equations are computed and stored in the corresponding entries, these equations can be written as follows:
e d α , t l , j = Eq.3 min T α [ l , h ] + T ε [ h , j ] , T ε [ l , h ] + T α [ h , j ] , dup ( α ) + T α [ l , h ] + T α [ h , j ] l < h < j , = min T α [ l , I l + 1 , j ] T ε [ I l + 1 , j , j ] , T ε l , I l + 1 , j T α [ I l + 1 , j , j ] , dup ( α ) + T α [ l , I l + 1 , j ] T α [ I l + 1 , j , j ] ,
(10)
e d s k , i , α = Eq.6 min S α [ k , h ] + S ε [ h , i ] , S ε [ k , h ] + S α [ h , i ] , S α [ k , h ] + S α [ h , i ] + cont ( α ) k < h < i = min S α [ k , I k + 1 , i ] S ε [ I k + 1 , i , i ] , S ε [ k , I k + 1 , i ] S α [ I k + 1 , i , i ] , S α [ k , I k + 1 , i ] S α [ I k + 1 , i , i ] + cont ( α ) ,
(11)
ed s 0 , i , t 0 , j = Eq.8 min S α [ 0 , i ] + T α [ 0 , j ] , ED T α [ k , j ] + S α [ k , i ] 0 < k < i α Σ = min S α [ 0 , i ] + T α [ 0 , j ] , tr ( S α ) [ i , I 1 , i ] ED T α [ I 1 , i , j ] α Σ ,
(12)
ed t α s 0 , i , t 0 , j = Eq.9 min ED [ i , l ] + T α [ l , j ] | 0 < l < j = ED [ i , I 1 , j ] T α [ I 1 , j , j ] .
(13)

The algorithm

The new algorithm has the same two stages as the baseline algorithm. It can be observed that the computation of all matrices of the forms Sα, S α , S ε , Tα, T α , and T ε  performed in Stage 1 of the baseline algorithm adhere to the Inside-VMT requirements as given in Definition 1 in [11]. The application of the generic Inside-VMT algorithm[11] to this computation is immediate, and therefore we focus only on adapting the method to the computation of matrices of the form EDT α  and ED conducted in Stage 2 of the baseline algorithm.

After allocating all dynamic programming matrices and performing Stage 1 of the algorithm, the COMPUTEMATRIX procedure is used for implementing Stage 2 (see Algorithm 2 and Figure 4). This is a divide-and-conquer recursive procedure that accepts a region I × J and computes the values in all entries of ED and EDT α  within the region. The procedure partitions the given region into two parts and performs recursive calls on each part. In order to maintain a required precondition, the procedure applies min-plus matrix multiplication subroutines between recursive calls. The correctness proof of Algorithm 2 appears in Appendix “Correctness of Algorithm 2”.
Figure 4

The tree of recursive calls to COMPUTE-MATRIX. Each call over a region containing more than one entry partitions the region either vertically or horizontally, and performs two recursive calls over the two sub-regions. After the first call was performed (the left or top sub-region for vertical or horizontal partition, respectively), a matrix multiplication involving the computed region is computed in order to meet the precondition required for the computation of the sibling region.

Algorithm 2 MATRIX-EDDC( s , t )

Time complexity analysis

The time complexity of Algorithm 2 can be established by an identical analysis to that of the Inside-VMT algorithm of [11] (see Section 3.3.1 of [11]). For completeness, we repeat this analysis here for Stage 2 of the computation, where the time complexity of Stage 1 can be inferred similarly. The complexity is expressed as a function of the bound MP (n) over the running time of a min-plus multiplication of two n × n matrices. Note that MP (n) = Ω (n2), as the input and output matrices of the computation contain O (n2) entries. We assume here that MP (n) = Ω (n2+δ) for some constant 0 < δ ≤ 1, which is true for the current best bound over MP (n) [12]. In some of the expressions developed below, we avoid the “big O” notation and give explicit bounds over the number of operations, as constant factors that may be hidden due to this notation cannot be ignored in the analysis.

The initialization time of Stage 2 is dominated by the matrix multiplications and entry-wise min operations performed in line 4 of Algorithm 2. This initialization performs 2|Σ| multiplications of matrices of dimensions (n-1)×1 with matrices of dimensions 1×(n - 1), which can naively be implemented in O (|Σ|n2) time, and an entry-wise min operation over a set containing |Σ| matrices of dimensions (n - 1) × (n - 1), which is also implemented in O (|Σ|n2) time.

The computation of the remaining entries in ED and EDT α  matrices is done within recursive calls to the COMPUTE-MATRIX procedure. Observe that when COMPUTE-MATRIX is called over a region of dimensions r × r for some even integer r ≥ 2, the procedure applies a vertical partitioning and performs two recursive calls over regions of dimensions r × r 2 (lines 6 and 8). For a call over a region of dimensions r × r 2 , the procedure applies a horizontal partitioning and performs two recursive calls over regions of dimensions r 2 × r 2 (lines 11 and 13). For simplicity, assume that n - 1 = 2 p  for some integer p ≥ 0, and thus it follows that the dimensions of all regions occurring as inputs in recursive calls are either 1×1, or of the form r × r or r × r 2 for some even integer r. Denote by T (x × y) an upper bound over the number of operations conducted when applying COMPUTE-MATRIX over a region of dimensions x × y.

From line 2 of the procedure, T (1 × 1) = O (|Σ|). Consider a region of dimensions r × r for an even integer r ≥ 2. For such a region, the code in lines 5-8 of COMPUTE-MATRIX is executed. In order to implement line 7 for some αΣ, it is necessary to compute first a min-plus matrix multiplication C = AB, where the matrix A = ED [ Ii,k,Ij,h] is of dimensions r × r 2 , the matrix B = T α [ Ij,h,Ih,l] is of dimensions r 2 × r 2 , and the resulting matrix C is of dimensions r × r 2 . Due to Observation 1, it is possible to compute independently the upper and lower halves of C, where C [ I 0 , r 2 , I 0 , r 2 ] = A [ I 0 , r 2 , I 0 , r 2 ] B and C [ I r 2 , r , I 0 , r 2 ] = A [ I r 2 , r , I 0 , r 2 ] B . The time required to conduct this computation is 2 MP ( r 2 ) . Then, it is required to compute min {EDT α [ Ii,k, Ih,l], C} and to update EDT α [ Ii,k,Ih,l] to be the result of this operation, a computation which requires at most cr2 operations for some constant c. Since line 7 is computed for every αΣ, the total number of applied operations due to this line is at most | Σ | 2 MP ( r 2 ) + c r 2 . Besides line 7, two recursive calls are made in lines 6 and 8 over regions of dimensions r × r 2 , and therefore we get
T ( r × r ) 2 T r × r 2 + | Σ | 2 MP r 2 + c r 2 .
When the procedure is called over a region of dimensions r × r 2 , the code in lines 10-13 is executed. Similarly as above, it can be shown that the computation in line 12 requires at most | Σ | MP ( r 2 ) + c r 2 4 operations, and due to the two recursive calls in lines 11 and 13 over regions of dimensions r 2 × r 2 , we get
T r × r 2 2 T r 2 × r 2 + | Σ | MP r 2 + c r 2 4 .
Therefore,
T ( r × r ) 4 T r 2 × r 2 + | Σ | 4 MP r 2 + 3 c r 2 2 .

The explicit form of the above recursive equation can be established by the Master Theorem (under the assumption that MP (n) = Ω (n2+δ), see Chapter 4 in [16]), yielding the expression T (r × r) = O (|Σ|MP (r)). Thus, the time complexity of Stage 2 of the algorithm is O (|Σ|MP (n)). The time analysis of the Inside-VMT algorithm of [11], applied to implement Stage 1 of the algorithm yields the same bound of O (|Σ|MP (n)), and thus O (|Σ|MP (n)) is the time complexity of the entire algorithm. Using the currently asymptotically fastest algorithm for min-plus matrix multiplication [12] MP ( n ) = Θ n 3 log 3 log n log 2 n , we get the currently best explicit time bound for EDDC of O | Σ | n 3 log 3 log n log 2 n .

An online algorithm for EDDC using min-plus matrix-vector multiplication for discrete cost functions

In this section, we present an EDDC algorithm which is based on the general algorithm (given in Section “A matrix multiplication based algorithm for EDDC”) and improves its time complexity by a factor of O (log3 logn). This EDDC algorithm is intended for integer cost functions, but can also be applied to rational cost functions after they are scaled. It is an online algorithm; it can process the input strings letter by letter with a guaranteed low time bound for any prefix of the input. The EDDC algorithm presented in this section is based on a D-discrete matrix-vector min-plus multiplication algorithm we developed, which is generic and may be applied to other problems as well.

D-discrete matrices and the EDDC problem with integer costs

Given a matrix of integers An × m and indices 1 ≤ i < n and 0 ≤ j < n, call the pair of entries A [ i - 1,j] and A [ i,j] adjacent. Let D = Ia,b= [ a,a + 1,…,b - 1] be an integer interval for some integers a < b. Say that matrix A is D-discrete if for every pair of adjacent entries A [ i - 1,j] and A [ i,j], their difference A [ i - 1,j] - A [ i,j] is in D.

Consider the EDDC problem in the case of integer costs for all edit operations. In Lemma 1, we show that in this case, all matrix multiplications applied by Algorithm 2 are between D-discrete metrices, with respect to a certain integer interval D. This proof is similar to that of Masek and Paterson for simple edit distance [17]. This would allow conducting such matrix multiplications using a faster algorithm, described in Section “An efficient D-discrete min-plus matrix-vector multiplication algorithm”.

Lemma 1.

Given strings s and t and an integer cost function for EDDC, all matrix multiplications applied by Algorithm 2 are over D-discrete matrices, where D = Ia,bis determined according to the cost function by a=- max{del(α) | αΣ} and b = max{ins (α) | αΣ} + 1.

The proof of Lemma 1 appears in Appendix “Proofs to lemmas corresponding to the EDDC algorithm for discrete cost functions”.

D-discrete matrices and vectors

Here, we present some properties of D-discrete matrices and vectors that are similar to those previously observed in [14, 17]. The following lemmas show that the set of D-discrete matrices is closed under the min-plus multiplication and entry-wise min operations. In what follows, let D=Ia,b be some integer interval. The proofs of the following lemmas appear in Appendix “Proofs to lemmas corresponding to the EDDC algorithm for discrete cost functions”.

Lemma 2.

Let X,Y and Z be matrices, such that X and Y contain only integer elements and Z = XY. If X is D-discrete, then is D-discrete.

Lemma 3.

Let X,Y and Z be matrices, such that X and Y contain only integer elements and Z = min{X,Y}. If X and Y are D -discrete, then Z is D-discrete.

The following lemma implies that when the absolute difference between the first elements of two q-length D-discrete vectors x and y is sufficiently large, one of the vectors can be immediately taken as the result of the min(x,y) operation.

Lemma 4.

Let x =(x0,…,xq - 1) and y = (y0,…,yq - 1) be two q-length D-discrete vectors for some q > 0. If y0 - x0 ≥ q |D|, then min(x,y) =  x.

In what follows, fix an integer q > 1. Let x = (x0,x1,…,xq-1) be a q-length D-discrete vector. By definition, for every 0<i < q, xi-1-x i  is an integer within D, and so xi-1-x i  - a is an integer within the interval I0,b-a = I0,|D|. Therefore, the series x0 - x1 - a,x1 - x2 - a,…,xq-2-xq-1-a can be thought of as a series of q-1 digits in a |D|-base representation of an integer Δ x = 0 i < q - 1 | D | i ( x i - x i + 1 - a ) , where 0 ≤ Δx < |D|q-1. The Δ -encoding of x is defined to be the pair of integers (x0,Δx). We write x = (x0,Δx) to indicate that (x0,Δx) is the Δ-encoding of x, where x0 is called the offset of x and Δ x is called the canonical index of x. Note that for two q-length D-discrete vectors x = (x0,Δx) and y = (y0,Δy), Δx = Δ y if and only if for every 0 ≤ i < q, x i  - y i  = c for some constant c. In particular, x and y share the same Δ-encoding if and only if they are identical. Call a D-discrete vector of the form x = (0,Δx) (with an offset x0 = 0) a canonical vvector.

The next observations show that both operations of entry-wise min and min-plus multiplication, with respect to D-discrete matrices and vectors, can be expressed via canonical vectors.

Observation 2.

Observation 2.Let x = (x0,Δ x), y = (y0,Δ y), and z = (z0,Δ z) be q-length D-discrete vectors such that z = min (x,y). Then, for every number c it holds that min((x0 - c,Δ x),(y0 - c,Δ y)) = (z0 - c,Δ z). In particular, min((0,Δ x),(y0 - x0,Δ y)) = (z0 - x0,Δ z).

Observation 3.

LetBq × qbe a D-discrete matrix, x = (0,Δ  x) a q-length canonical D-discrete vector, and y =(y0,Δ y) a q-length D-discrete vector, such that Bx = y. Then, for any number c it holds that B (c,Δ x) = (y0 + c,Δ y).

An efficient D-discrete min-plus matrix-vector multiplication algorithm

Let An×m be a D-discrete matrix, and fix a constant 1 log | D | ( n + m ) < λ < 1 . We give an algorithm for min-plus D-discrete matrix-vector multiplication that, after preprocessing A in O nm ( n + m ) λ | D | time and O nm ( n + m ) λ | D | λ 2 log | D | 2 ( n + m ) space, computes Ax for any m-length D-discrete vector x in O nm λ 2 log | D | 2 ( n + m ) time under the RAM computational model. Our algorithm is an adaptation of Williams’ algorithm [13] for finite semiring matrix-vector multiplications, with some notions similar to Frid and Gusfield’s acceleration technique for RNA folding [14]. It follows the concept of the Four-Russians Algorithm[18] (see also [14, 17, 19]), i.e. preprocessing reoccurring computations, tabulating their results in lookup tables, and retrieving such results in order to accelerate the general computation.

Specifically, the algorithm stores preprocessed computations of two kinds: matrix-vector min-plus multiplications, and vector entry-wise minima, where vectors and matrices are of q-length and of q × q dimensions, respectively, for q = λ log|D|(n + m) . For conducting this preprocessing, we will assume that |D| ≤ n + m, otherwise q = 0 and the multiplication cannot be accelerated using the suggested method. In addition, for simplicity of the analysis we assume that q3 ≤ min(n,m). If this does not hold, a multiplication of the form Ax can be naively computed in the relatively efficient time complexity of O max ( n , m ) log | D | 3 n + m . The space complexity of the preprocessing phase is higher than the O (n m) space complexity of the standard multiplication algorithm and depends on the constant λ, ranging between O(n m|D|) and O nm ( n + m ) log | D | 2 ( n + m ) for λ values between 1 log | D | ( n + m ) and 1, correspondingly. The lower bound 1 log | D | ( n + m ) for λ is chosen so that the time complexity O nm λ 2 log | D | 2 ( n + m ) of matrix-vector multiplications involving the preprocessed matrix would be better than the naive time complexity O (n m).

Preprocessing of matrix-vector computations

Let n = q n q and m = q m q , and note that 0 ≤ n - n < q and 0 ≤ m - m < q. Let Q k  denote the q-length integer interval Q k  = [ k q,k q + 1,…,(k + 1) q - 1]. The sub-matrix A = A I 0 , n , I 0 , m is decomposed into n m q 2 blocks Bi,j = A[ Q i ,Q j ] where i = 0 , 1 , , n q - 1 and j = 0 , 1 , , m q - 1 . For each block B, a corresponding lookup table MUL B  is created, which tabulates min-plus multiplications between B and all canonical q-length D-discrete vectors. For the canonical vector x = (0,Δ x), the result y = B x is stored in the entry MUL B [ Δ x] by its encoding (y0,Δ y) (by Lemma 2, y is also D-discrete and thus can be encoded accordingly).

The multiplication of a q × q block with a q-length vector can be done in O (q2) time in a straightforward manner and the encoding of the resulting q-length vector requires additional O (q) time. There are n m q 2 blocks in the decomposition of A, each is multiplied by |D|q-1 canonical vectors, and so the total time required for computing these multiplications is O q 2 | D | q - 1 nm q 2 = O | D | q - 1 nm = O nm ( n + m ) λ | D | .

Let (y0,Δ y) be the Δ - encoding of some result y = B x computed in the preprocessing of A as described above. Note that y0 = min 0 ≤ i < q {B [ 0,i] + x [ i]} ≤ min 0 ≤ i < q {2 max(B [ 0,i],x[ i])}. Therefore, the number of bits in the binary representation of y0 is at most one plus the maximum number of bits required for the representations of B [ 0,i] and x [ i] for some 0 ≤ i < q. Also, note that 0 Δ y < | D | q - 1 = ( n + m ) λ | D | , and Δ y can be represented in O (log(n + m)) bits. Thus, under the RAM computational model assumptions, each such encoding (y0,Δ y) requires O (1) space units and can be written and read in a constant time, and therefore the overall space complexity for maintaining all MUL B  tables is O | D | q - 1 nm q 2 = O nm ( n + m ) λ | D | λ 2 log | D | 2 ( n + m ) . In addition, given a canonical index Δ x, it is possible to retrieve the encoding (y0,Δ y) = B  (0,Δ x) stored in the entry MUL B [ Δ x] in a constant time.

Let x = (x0,Δ x) be some (not necessarily canonical) q-length D-discrete vector, for which we wish to compute Bx. Due to Observation 3, the multiplication result can be obtained in constant time by retrieving (y0,Δ y) = MUL B [ Δ x], and returning the encoding (y0 + x0,Δ y).

Preprocessing of vector entry-wise min computations

The algorithm constructs a lookup table MIN, storing entry-wise min calculations between every canonical q-length D-discrete vector x = (0,Δ x) and every q-length D-discrete vector y = (y0,Δ y) such that abs (y0) < q|D| (here abs (y0) denotes the absolute value of y0). For every such x and y, the table entry MIN [ Δ x,y0,Δ y] stores the Δ - encoding (z0,Δ z) of the vector z = min (x,y) (due to Lemma 3, z is D-discrete and can be encoded accordingly). There are O q | D | | D | 2 ( q - 1 ) = O ( n + m ) 2 λ λ log | D | ( n + m ) | D | entries in the table MIN, and each entry can be computed in O(q) = O (λ log|D|(n + m)) time. Thus, the computation of all entries in MIN requires O ( n + m ) 2 λ λ 2 log | D | 2 ( n + m ) | D | time, and the table occupies O ( n + m ) 2 λ λ log | D | ( n + m ) | D | space.

Given two encoded q-length D-discrete vectors x = (x0,Δ x) and y = (y0,Δ y), the encoding (z0,Δ z) of the vector z = min(x,y) can now be obtained in a constant time as follows: (z0,Δ z) = (x0,Δ x) if y0 - x0 ≥ q |D| or (z0,Δ z) = (y0,Δ y) if x0 - y0 ≥ q |D|, due to Lemma 4. Otherwise, |y0 - x0| < q |D|, and for the vectors x = (0,Δ x), y = (y0 - x0,Δ y), and z = z 0 , Δ z = min ( x , y ) , we have that (z,Δ z) = MIN [ Δ x,y0 - x0,Δ y]. From Observation 2, z 0 , Δ z = z 0 + x 0 , Δ z .

Computing matrix-vector multiplications

Given an m-length D-discrete vector x and assuming the preprocessing of matrix An × m was preformed as described above, we next explain how to efficiently compute the vector y = Ax. Note that y is an n-length D-discrete vector, due to Lemma 2.

Our algorithm computes first the multiplication y = A x [ I 0 , m ] in parts of length q. First, for every 0 j < m q , the algorithm computes the encoding x 0 j , Δ x j of the sub-vector x j  = x [ Q j ] of x. These encodings can be obtained in a total time of O (m). Then, for every 0 i < n q , the encoding y 0 i , Δ y i of the sub-vector y i = A Q i , I 0 , m x I 0 , m of y is computed independently of the other sub-vectors of y. By definition (see Figure 5),
Figure 5

The computation of a q-length segment in a multiplication of a D-discrete matrix with a D-discrete vector.

y i = min A Q i , Q j x Q j | 0 j < m q = min B i , j x j | 0 j < m q .

The encoded result z 0 i , j , Δ z i , j of each multiplication zi,j = Bi,jx j  can be obtained in a constant time as explained in Section “Preprocessing of matrix-vector computations”. As there are m q such terms to compute with respect to y i , their total computation time is O m q . In addition, the entry-wise min over all these terms can be computed by initializing y 0 i , Δ y i z 0 i , 0 , Δ z i , 0 , and iteratively updating y 0 i , Δ y i min y 0 i , Δ y i , z 0 i , j , Δ z i , j for all 0 < j < m q . Each such update is computed in a constant time as described in Section “Preprocessing of vector entry-wise min computations”, and so the encoding of a single segment y i  in y is computed in a total time of O m q , and the encodings of all O n q such segments are computed in O nm q 2 time. Decoding all encoded vectors y i  can be done in additional O (n) operations, obtaining an explicit form of y in a total time of O nm q 2 .

Let y = A I 0 , n , I m , m x I m , m , where from Observation 1, y I 0 , n = min y , y . The computation of y′′ can be conducted in O (n q) time in a straightforward manner, and the computation of min(y,y′′) requires additional O (n) time. In addition, y I n , n = A I n , n , I 0 , m x , where this computation can be done naively in O (m q) time, and so the overall running time for computing y  is O nm q 2 + nq + mq = O nm q 2 = O nm λ 2 log | D | 2 ( n + m ) .

The above matrix-vector min-plus multiplication algorithm can be used as a fast square matrix-matrix multiplication algorithm in a straightforward manner. For two D-discrete matrices An × n and Bn × n, the computation of C = AB can be conducted by first preprocessing A as described in Sections “Preprocessing of matrix-vector computations” in O n 2 + λ | D | time and O n 2 + λ | D | λ 2 log | D | 2 n space, and then computing each column j of C independently by multiplying A with the j-th column of B, in O n 2 λ 2 log | D | 2 n time as explained above. The total computation time of all n columns of C is therefore O n 3 λ 2 log | D | 2 n .

Online preprocessing of D-discrete matrices

In the previous section, we assumed the settings in which a D-discrete matrix is given, and that it is preprocessed once prior to any multiplication operation. Next, we describe how to maintain the required lookup tables for the case the input matrix is dynamic, acquiring additional rows and columns. Consider a streaming computational model, which begins with an initial empty matrix A 0 × 0 0 . In each step r, the current matrix A n r × m r r is obtained from the previous matrix A n r - 1 × m r - 1 r - 1 by either adding an mr-1-length vector as the last row or adding an nr-1-length vector as the last column in the matrix. Note that n r  + m r  = r, and therefore the preprocessing block length corresponding to A r  is q = λ log|D|(n r  + m r )  = λ log|D|(r). For the purpose of this analysis, we assume that λ ≤ 0.5 (note that this does not limit the asymptotic upper bounds of the running time). This assumption implies the following inequality
| D | 1 λ 2 2 = 4 .
(14)

Lookup tables corresponding to intermediate matrices along the series can be maintained as follows. Let r0 and r1 be the smallest integers such that the block sizes corresponding to A r 0 and A r 1 are q and q + 1, respectively. Assume that upon reaching A r 0 in the matrix sequence, all required lookup tables with respect to A r 0 are already computed. Along the series of steps r0,r0 + 1,…,r1, we distribute two kinds of computations: (1) new MUL B  tables for accumulated q × q blocks in matrices A r  for r0 ≤ r < r1, and (2) a new MIN table, as well as new MUL B  tables, with respect to block length q + 1.

(1) Computing MUL B tables for accumulated q × q blocks. Assume that for some r0 ≤ r < r1, a column was added to the matrix at step r so that the number of columns m r  in the intermediate matrix A r  is divisible by q. Thus, at most n r q r q new q × q complete blocks are now available for preprocessing. The computation of lookup tables of the form MUL B  corresponding to these new blocks will be equally distributed along the series of q consecutive steps r,r + 1,…,r + q - 1, during which it is guaranteed that no column addition would introduce new complete q × q blocks in the matrix. As shown in Section “Preprocessing of matrix-vector computations”, the time required for processing a single q × q block is O (q2|D|q-1), and so the total time for processing all O r q blocks is O (q r |D|q-1). Thus, in each step among the q steps, there is a need to perform O r | D | q - 1 = O r 1 + λ | D | operations due to these computations. Symmetrically, computing lookup tables corresponding to new blocks added due to the accumulation of rows can be performed by conducting O r 1 + λ | D | operations per step r.

(2) Computing a new MIN lookup table and new MUL B tables with respect to block length q + 1 . By the selection of r0 and r1, q - 1 = λ log|D|(r0 - 1)  > λ log|D|(r0 - 1) - 1, and q + 1= λ log|D|(r1)  ≤ λ log|D|(r1). Therefore, log | D | ( r 1 ) > log | D | ( r 0 - 1 ) + 1 λ , and so r 1 > ( r 0 - 1 ) | D | 1 λ r 0 | D | 1 λ 2 Eq.14 2 r 0 . In particular, r 2 = r 1 2 satisfies r0 < r2 < r1, and for every r2 ≤ r < r1 we have that O (r) = O (r1). The computation of the table MIN and tables of the form MUL B  with respect to block length q + 1 is distributed along the series of r 1 2 steps r2,r2 + 1,…,r1.

The new MIN table is computed independently from the specific input instance, and its overall computation time is O r 1 2 λ λ 2 log | D | 2 ( r 1 ) | D | (see Section “Preprocessing of vector entry-wise min computations”). By distributing this computation evenly along all O (r1) steps, the computation time required for each step r2 ≤ r < r1 is O r 1 2 λ - 1 λ 2 log | D | 2 ( r 1 ) | D | = O r 2 λ - 1 λ 2 log | D | 2 ( r ) | D | .

The MUL B  tables are computed similarly as done in (1), starting with (q + 1) × (q + 1) blocks already present in A r 2 , and continuing with blocks accumulated as the sequence progress. The overall preprocessing time of all these blocks is O r 1 2 + λ | D | (see Section “Preprocessing of matrix-vector computations”), and so the computation time required for each step r2 ≤ r < r1 is O r 1 1 + λ | D | = O r 1 + λ | D | .

All in all, the time complexity due to computations of (1) and (2) for each step r0 ≤ r < r1 is O r 1 + λ | D | . In particular, the overall time complexity of preprocessing the n - size prefix A0,A1,…,A n  of the streamed matrices is O n 2 + λ | D | .

The EDDC algorithm based on efficient D-discrete min-plus matrix-vector multiplication

Consider the EDDC problem in cases where all edit operation costs are integers. As explained in Section “D-discrete matrices and the EDDC problem with integer costsD-discrete matrices and the EDDC problem with integercosts”, the EDDC DP tables can be considered D-discrete. This property allows for efficient min-plus square D-discrete matrix-vector multiplications, using the algorithm described in Section “An efficient D-discrete min-plus matrix-vector multiplication algorithm” to yield an O | Σ | n 3 log | D | 2 n running time algorithm for EDDC. We next describe an online version of the algorithm, in which the letters of the input strings s and t are received in a streaming model.

Assume that some pair of prefixes s0,i and t0,j-1 was already processed, and all entries in the DP matrices corresponding to these prefixes are computed. We explain how to update the tables in case where the next letter to arrive is the letter tj-1 in t, where the case in which the arriving letter is from s is symmetric. The DP matrices are D-discrete, and assume that lookup tables for efficient min-plus multiplications of these matrices are maintained as explained in the previous section. The addition of tj-1 requires updating all matrices of the forms T ε , T α , and Tα, for which the j-th row and column should be added. In addition, it is required to add the j-th column to matrices of the form ED and EDT α .

In the first stage, the algorithm computes rows and columns j in all matrices of the form Tα, T α , and T ε . The process is similar to the computation of these entries by the loop in lines 5 to 8 of Algorithm 1, with the following modification. Let q j  = λ log|D|(2j) , and let j = q j j q j . The algorithm first initializes the entries [ j - 1,j] in all these matrices with the corresponding base-case values. The column is partitioned to intervals of length q j , where as before Q k  denotes the interval I k q j , ( k + 1 ) q j . Once an interval Q k  is computed (i.e. the loop was executed with respect to index l = k q j ), the Δ - encoding of the sub-vector T α [ Q k ,j] is computed and kept for its later usage as lookup index. In addition, upon starting to compute the entries within an interval Q k  (i.e. when l = (k + 1)q j  - 1), the following multiplications are computed for every αΣ:
y α , k = T α Q k , I q j ( k + 1 ) , j T α I q j ( k + 1 ) , j , j = Obs.1 min T α Q k , Q p T α Q p , j | k < p < j q j

Observe that all required entries for the computation of yα,k are already computed and stored in T α , and that similarly as done in Section “Computing matrix-vector multiplications”, yα,k can be computed by performing O j q j constant time lookup table queries. After yα,k is computed, yα,k[ x] contains the value min{T α [ k q j  + x,h] + T α [ h,j] | (k + 1)q j  ≤ h < j}. Given yα,k[ x], the number of expressions that need to be examined in line 5 of the loop with respect to l = k q j  + x reduces to O (q j ) per entry (considering values of the index h between l and (k + 1)q j , and between j and j). Entries in matrices of the form T α  and T ε are computed exactly as done in lines 6 and 7 of Algorithm 1, respectively.

In the second stage, column j is computed in matrices EDT α  and ED. This is achieved by extending Equations 13 and 12 to have an entire column on the left-hand side, as follows:
ED T α I 2 , i + 1 , j Eq.13 ED I 2 , i + 1 , I 1 , j T α I 1 , j , j
(15)
ED I 2 , i + 1 , j Eq.12 min S α 0 , I 2 , i + 1 + T α [ 0 , j ] , tr ( S α ) I 2 , i + 1 , I 1 , i + 1 ED T α I 1 , i + 1 , j α Σ
(16)

This completes the update of the DP tables due to the addition of the letter tj-1.

Complexity analysis

After receiving n letters, the prefixes s0,i and t0,j of the input strings were preprocessed for some i,j such that i + j = n. The maintenance of lookup tables for efficient D-discrete multiplications requires at most O | Σ | n 1 + λ | D | operations per step among the first n steps, and O | Σ | n 2 + λ | D | operations for all first n steps, as shown in Section “Online preprocessing of D-discrete matrices”.

Adding a letter tj-1 to the instance, the time required for processing the entries in column j of the T matrices is as follows. O | Σ | j q j vectors yα,k need to be computed, each vector is computed in O j q j time, and their total computation time is therefore O | Σ | j 2 q j 2 . In addition, O (|Σ|j) entries in tables Tα are computed in O (q j ) time each, and O (|Σ|j) entries in tables T α and T ε are computed in O (|Σ|) time each. Therefore, the total time for computing column j in all these matrices is O | Σ | j 2 q j 2 + | Σ | 2 j = O | Σ | n 2 λ 2 log | D | 2 ( n ) + | Σ | 2 n .

Computing column j in matrices EDT α  and ED, the algorithm performs O (|Σ|) matrix-vector min-plus multiplications (Equations 15 and 16), each taking O n 2 λ 2 log | D | 2 ( n ) time using the algorithm in Section “The EDDC algorithm based on efficient D-discrete min-plus matrix-vector multiplication”, and computes the entry-wise minimum of |Σ|i-length vectors (Equation 16) in O (|Σ|i) time. Hence, the total time complexity of computing column j is O | Σ | n 2 λ 2 log | D | 2 ( n ) + | Σ | 2 n . Symmetrically, this bounds the running time when the n-th letter comes from the source string s, and so the total running time over all first n steps is O | Σ | n 3 λ 2 log | D | 2 ( n ) + | Σ | 2 n 2 . The algorithm requires O | Σ | n 2 + λ λ 2 log | D | 2 ( n ) space for the computed tables.

Online VMT algorithms

The online algorithm for EDDC presented in the previous section can be generalized for other problems with similar properties. Specifically, VMT problems [11], which utilize min-plus multiplications and for which it can be guaranteed that computed DP matrices are D-discrete, can have their algorithms implemented using the same framework as we have presented above. Thus, in contrast to the general case for VMT problems in which it is required that the complete input be available at the beginning of the algorithm’s run, in the D-discrete case the input can be obtained in a streaming model. In addition, the asymptotic time complexity in such cases is slightly reduced with respect to the time complexity of the case of min-plus multiplication of general matrices. A concrete example to such a problem is the RNA base-pairing maximization problem [11, 15], in which the difference between adjacent entries (in the single DP matrix the algorithm uses) is either 0 or 1. This property was previously exploited by Frid and Gusfield [14] to obtain an O n 3 log n algorithm for the problem. Using the D-discrete min-plus multiplication algorithm presented here, this immediately implies an algorithm having the improved time bound of O n 3 log 2 n . Additional related problems from the domains of RNA folding and Context Free Grammars (CFGs) parsing fall under the VMT framework, and it is likely that D-discreteness can be exploited for accelerating the computation of more problems within this family.

Additional acceleration using run-length encoding

Let w be a string. A maximal substring of w containing multiple repeats of the same letter is called a run in w. The Run Length Encoding (RLE) of w is a representation of the string in which each run is encoded by the corresponding repeating letter α and its repeat count p (denoted α p ). For example, the string w = aabbbaccc is a concatenation of the four runs a a, bbb, a, and ccc, and its RLE is a2b3a1c3. Denote by w ~ the compressed form of w, which replaces each run in w by a single occurrence of the corresponding letter. When n denotes the length of w, ñ will denote the length of the compressed form of w. The run index ĩ of a letter w i  in w is the index of the run in which w i  participates. It can be asserted that the compressed form of the substring wi,j of w is the substring w ~ ĩ , ( j - 1 ~ ) + 1 of w ~ . In the above example, w ~ = abac , and therefore ñ = 4 (while n = 9). The run indices of all letters in w are given by the sequence [ 0,0,1,1,1,2,3,3,3], and the compressed form of w3,8 = bbacc is w ~ 3 ~ , 7 ~ + 1 = w ~ 1 , 4 = bac .

Previous works [79] showed how RLE can be exploited for improving the efficiency of EDDC algorithms. In these works it was required that the costs of duplications and contractions be less than the costs of all other operations (the requirement was implicit in [9], see discussion in Section “A comparison with previous works”). This requirement is somewhat unnatural for the application of minisatellite map comparison, since it assumes that mutations, which are typically common events, should cost more than the less common events of duplications and contractions. In this section, we adapt a similar RLE-based acceleration to our EDDC algorithm. The application of this acceleration requires the following constraint over cost functions:

Constraint 1.

For every α,βΣ, dup (α) ≤ dup (β) + mut (β,α) ≤ ins (α), and cont (α) ≤ cont (β) + mut (α,β) ≤ del (α).

The constraint dup (β) + mut (β,α) ≤ ins (α) implies that it never costs more to replace an insertion of some letter α into some nonempty string by the duplication of a letter β adjacent to the insertion position, and its consecutive mutation to α. Thus, we may assume w.l.o.g that optimal edit scripts do not contain insertions (unless applied to empty strings), or in other words, generation of new letters can only be obtained via duplications. Such an assumption is relatively reasonable in the context of minisatellite map comparison, considering the biological mechanisms that describe generative modifications.

The constraint dup (α) ≤ dup (β) + mut (β,α) can be intuitively understood by the example of generating a string of the form α α β from a string of the form α β. Due to the constraint, it would cost the same or less if the string α α β is obtained by duplicating the α letter in α β, rather than by duplicating the β letter and mutating its left copy into α. Again, such an assumption is relatively reasonable for the minisatellite map application. Symmetric arguments hold with respect to the constraint over contraction and deletion costs.

Observation 4.

Observation 4.Let s and w be strings. Then, ed (s,w β β) ≤ ed (s,w β) + dup (β), and ed (s,β β w) ≤ ed (s,β w) + dup (β) for every βΣ.

The correctness of Observation 4 follows from the existence of a script from s to w β β whose cost is ed (s,w β) + dup (β): this script first applies an optimal script to transform s into w β at cost ed (s,w β), and then duplicates the last β in w β at cost dup (β).

Lemmma 5.

Let α,β be letters and w ≠ εa string. When Constraint 1 holds, ed (α,β w),ed (α,w β) ≥ ed (α,w) + dup (β), and ed (β w,α),ed (w β,α) ≥ ed (w,α) + cont (β).

The proof of Lemma 5 appears in Appendix “Proofs to lemmas corresponding to the run-length encoding based EDDC algorithm”.

Next, we show how to reduce the number of expressions that need to be considered in the EDDC recursive equations, in case Constraint 1 applies. For a string w of length at least 2, denote by R (w) P (w) the set of all partitions (w a ,w b ) of w such the last letter in w a  is different from the first letter in w b . For example, for w = aabbbcdddd,R (w) = {(a a,bbbcdddd ), (aabbb,cdddd),(aabbbc,dddd)}. Observe that | R ( w ) | = ñ - 1 .

We start by describing how to improve the computation efficiency of EDDC for cases in which one of the input strings contains a single letter. Denote by dupcost (w) the cost of the edit script from w ~ to w which generates each run α p  in w by applying p - 1 duplication operations over the corresponding letter α in w ~ . Similarly, denote by contcost (w) the cost of the edit script from w to w ~ which reduces each run α p  in w by applying p - 1 contraction operations over α. For example, for w = aabbbbaaccc, dupcost (w) = 2 dup (a) + 3 dup (b) + 2 dup (c) and contcost (w) = 2 cont (a) + 3 cont (b) + 2 cont (c). Note that dupcost ( w ) ed ( w ~ , w ) , and contcost ( w ) ed ( w , w ~ ) . It is simple to assert the following recursive relations: [b]
dupcost ( w β ) = dupcost ( w ) + dup ( β ) , w ends with β , dupcost ( w ) , otherwise .
(17)
dupcost ( w ) = dupcost ( w a ) + dupcost ( w b ) , ( w a , w b ) R ( w ) , dupcost ( w a β ) + dupcost ( β w b ) + dup ( β ) , ( w a β , β w b ) P ( w ) .
(18)

The following lemma shows that when one of the input strings contains a single letter, the edit distance can be inferred from the edit distance between this letter and the compressed form of the second string.

Lemma 6.

Let α be a letter and w a string. When Constraint 1 holds, ed ( α , w ) = ed ( α , w ~ ) + dupcost ( w ) , and ed ( w , α ) = contcost ( w ) + ed ( w ~ , α ) .

The following lemma shows that given a certain edit script from string u, its cost is greater than or equal to the cost of its application on a superstring of u.

For a string s of the form s = s a us b  and an edit script E S = u = u 0 , u 1 , , u r = w from u to w, denote by E S ( s ) the edit script E S ( s ) = s = s a u s b = s a u 0 s b , s a u 1 s b , , s a u r s b = s a w s b from s = s a us b to t = s a ws b .

Lemma 7.

For s = s a us b and E S = u = u 0 , u 1 , , u r = w , cost E S ( s ) cost ( E S ) .

The proofs of Lemma 6 and Lemma 7 appear in the Appendix.

Equations 17 and 18 and Lemma 6 support the following preprocessing algorithm, Algorithm 3. Given a target string t, Algorithm 3 generates data structures that enable retrieving in constant time values of the form ed (α,ti,j) for every αΣ and every substring ti,j of t. The algorithm generates tables of the form T ~ α for every αΣ, such that entries T ~ α [ i , j ] contain the corresponding values ed α , t ~ i , j . In addition, the algorithm generates a vector DC, such that entries DC [ j] contain the corresponding values dupcost (t0,j). Then, queries of the form ed (α,ti,j) can be answered in a constant time according to Equation 19 below.
ed α , t i , j = Lem.6 Eq.18 T ~ α [ ĩ , ( j - 1 ~ ) + 1 ] + DC [ j ] - DC [ i ] , t i - 1 t i , DC [ j ] - DC [ i ] - dup ( t i ) , otherwise .
(19)

Algorithm 3 RL-LETTER-TO-STRING( t )

An algorithm which is symmetric to Algorithm 3 can be described in order to preprocess a string s for queries of the form ed (si,j,α).

We continue to describe the improved computation in the case where both input strings s and t are of length at least 2. To do so, we first add some auxiliary notation. For an interval Ix,y of positions within a string w, denote by Ĩ x , y the subsequence of indices in Ix,y which are start positions of runs in w. For example, for w = aabbbaacccc, the interval I1,7 = [ 1,2,…,6] contains all positions of letters within the substring w1,7 = abbbaa, and Ĩ 1 , 7 = [ 2 , 5 ] contains the start positions in w of the runs bbb and aa that are included in I1,7 (the first letter a in w1,7 belongs to a run in w that starts in position 0, and therefore position 1 is not included in Ĩ 1 , 7 ). This notation will be used for defining subsequences of rows and columns in DP matrices maintained by the algorithm, where some of these intervals are derived from the source string s, and some from the target string t. We will assume that the string from which Ĩ x , y was derived is clear from the context, and will not specify it explicitly. For example, when Ĩ x , y defines rows in matrices ED or EDT α , or either rows or columns in matrices S α , then the indices in Ĩ x , y are derived from the source string s. When Ĩ x , y defines columns in matrices ED or EDT α , or either rows or columns in matrices T α , then the indices in Ĩ x , y are derived from the target string t. Subsequences Ĩ x , y will be used for defining sparse regions in matrices, i.e. regions containing sets of rows or columns which are not necessarily adjacent.

Consider the computation of ed (s,t) as expressed in Equation 8. Assume first the special case where s ends with a run of length at least 2. In this case, s is of the form s = w β β for some string w and a letter β. For every partition (t a ,t b ) of t, it is possible to combine an optimal script E S 1 from the prefix w β of s to t a  and an optimal script E S 2 from the suffix β of s to t b , and to obtain a script E S = E S 1 ( w β β ) , E S 2 ( t a β ) from s to t. Therefore, ed ( w β β , t ) cost ( E S ) = cost ( E S 1 ( w β β ) ) + cost ( E S 2 ( t a β ) ) Lem.7 cost ( E S 1 ) + cost ( E S 2 ) = ed ( w β , t a ) + ed ( β , t b ) . In particular, ed ( w β β , t ) min ed ( w β , t a ) + ed ( β , t b ) | ( t a , t b ) P ( t ) = Eq.9 ed t β ( w β , t ) . In addition, it is possible to compose an edit script from w β β to t by first contracting the last two letters to obtain the string w β, and then applying an optimal script from w β to t. The cost of such a script is ed (w β,t) + cont (β), and therefore we get that ed (w β β,t)≤ min{edt β (w β,t),ed (w β,t) + cont (β)}.

Next, we show that ed (w β β,t) ≥ min {edt β (w β,t), ed (w β,t) + cont (β)}. From Equation 8, either ed (w β β,t) = ed (w β β,α) + ed (α,t) or ed (w β β,t) = edt α (s a ,t) + ed (s b ,α) for some αΣ and (s a ,s b ) P (w β β). Consider first the latter case. If (s a ,s b ) = (w β,β), then
ed ( w β β , t ) = ed t α ( w β , t ) + ed ( β , α ) = Eq.9 min ed , t a + ed α , t b | t a , t b P ( t ) + ed ( β , α ) Obs.5 min ed w β , t a + ed β , t b | t a , t b P ( t ) = Eq.9 ed t β ( w β , t ) .
Else, s b  is of length at least 2, and there is some string u such that s b = u β β and w = s a u. In this case, ed ( w β β , t ) = ed t α s a , t + ed ( u β β , α ) Lem.5 ed t α s a , t + ed ( u β , α ) + cont ( β ) Eq.8 ed ( w β , t ) + cont ( β ) . Similarly, it can be shown that when ed (w β β,t) = ed (w β β,α) + ed (α,t) for some α Σ, ed (w β β,t) ≥ ed (w β,t) + cont (β), and so ed (w β β,t) ≥ min {edt β (w β,t),ed (w β,t) + cont (β)}. Thus,
ed ( w β β , t ) = min ed t β ( w β , t ) , ed ( w β , t ) + cont ( β )
(20)
Formulating Equation 20 with respect to the data structures defined in Section “A baseline dynamic-programming algorithm for EDDC” (under the assumption that all values appearing at the right-hand side of the equation are computed and stored in the corresponding entries), we get the following equation:
ed s 0 , i , t 0 , j = min ED T s i - 1 [ i - 1 , j ] , ED [ i - 1 , j ] + cont ( s i - 1 ) ( when s i - 1 = s i - 2 )
(21)
Now, consider the case where the last run in s is of length 1 (i.e. s is not of the form w β β). Assume first that the term that yields the minimum value of the right-hand side of Equation 8 is of the form edt α (s a ,t) + ed (s b ,α) for some partition (s a ,s b ) P (s) and a letter αΣ. If (s a ,s b ) R(s), then there is some letter βΣ which is both the last letter of s a  and the first letter of s b . In this case, we can write s a  = w β and s b  = β u. Note that u ≠ ε (since s ≠ w β β by definition), and so ed ( s , t ) = ed t α ( w β , t ) + ed ( β u , α ) = Eq.9 min ed w β , t a + ed α , t b | t a , t b P ( t ) + ed ( β u , α ) Lem.5 , Obs.4 min ed w β β , t a - dup ( β ) + ed α , t b | t a , t b P ( t ) + ed ( u , α ) + dup ( β ) = min ed w β β , t a + ed ( α , t b ) | t a , t b P ( t ) + ed ( u , α ) = Eq.9 ed t α ( w β β , t ) + ed ( u , α ) . From the optimality of the partition (s a ,s b ) = (w β,β u), it follows that ed (s,t) = edt α (w β β,t) + ed (u,α). If u starts with β this step can be repeated, and inductively we can apply such partition refinements until obtaining a partition (s a ,s b ) of s such that ed (s,t) = edt α (s a ,t) + ed (s b ,α) and (s a ,s b ) R (s). Now, Equation 8 can be revised as follows:
ed ( s , t ) = min ed ( s , α ) + ed ( α , t ) , ed t α s a , t + ed s b , α s a , s b R ( s ) , α Σ ( when s is not of the form wββ )
(22)
Using the DP formulation, we get
ed s 0 , i , t 0 , j = min S α [ 0 , i ] + T α [ 0 , j ] , ED T α [ h , j ] + S α [ h , i ] h Ĩ 0 , i , α Σ = min S α [ 0 , i ] + T α [ 0 , j ] , tr S α [ i , Ĩ 0 , i ] ED T α Ĩ 0 , i , j α Σ ( when s i - 1 s i - 2 )
(23)
Similarly as shown for the computation of ed (s,t), it is possible to revise the computation of edt α (s,t) under Constraint 1 and obtain the following equations:
ed t α ( s , w β β ) = min ed t α ( s , w β ) + dup ( β ) , ed ( s , w β ) + mut ( α , β )
(24)
ed t α s 0 , i , t 0 , j = min ED T α [ i , j - 1 ] + dup t j - 1 , ED [ i , j - 1 ] + mut α , t j - 1 ( when t j - 1 = t j - 2 )
(25)
ed t α ( s , t ) = min ed s , t a + ed α , t b | t a , t b R ( t ) ( when t is not of the form wββ )
(26)
ed t α s 0 , i , t 0 , j = min ED [ i , h ] + T α [ h , j ] | h Ĩ 0 , j = ED i , Ĩ 0 , j T α Ĩ 0 , j , j when t j - 1 t j - 2
(27)

Finally, we present Algorithm 4, which is an efficient version of Algorithm 2. Stage 1 of the new algorithm is accelerated by using Algorithm 3 to compute for every αΣ distances from α all substrings of s and distances from all substrings of t to α. In Stage 2, we use the equations developed above in order to accelerate the computation. The correctness of the algorithm follows from the correctness of the recursive equations, and can be asserted similarly as done for Algorithm 2.

Algorithm 4 RL-MATRIX-EDDC( s , t )

Complexity analysis

Assume for simplicity that compressed forms of both input strings s and t have the same length ñ .

Algorithm 3.

 The running time of line 1 of the algorithm is O (n) for computing the compressed form of the input string, and O ( | Σ | MP ( ñ ) ) for running Stage 1 of Algorithm 2 over this compressed string. Lines 2 and 3 require O (n) time, and so the overall time complexity of Algorithm 3 is O ( n + | Σ | MP ( ñ ) ) . The space complexity for computing and maintaining all matrices is O ( | Σ | ñ 2 ) , and an additional O (n) space is required for the vector DC. Hence, the overall space complexity of the algorithm is O ( n + | Σ | ñ 2 ) (see Section “Time complexity analysis” for complexity analysis of Stage 1 of Algorithm 2).

Algorithm 4.

 Time and space complexities of Stage 1 of the algorithm are identical to those of Algorithm 3. As in Section “The algorithm”, the computations governing the running time of Stage 2 are those of matrix multiplications performed within recursive calls to RL-COMPUTE-MATRIX.

The recursive computation of RL-COMPUTE-MATRIX can be visualized as a tree (see Figure 4). Each node in the tree corresponds to a call to RL-COMPUTE-MATRIX over some regions Ii,k × Ij,l, which is either a leaf in case that k = i + 1 and l = j + 1, or otherwise an internal node. In the latter case, the node has exactly two children, corresponding to the two recursive calls obtained from either a vertical (lines 11 and 13) or a horizontal (lines 16 and 18) partition of the region. For simplicity, assume that the interval length ñ = Ĩ 2 , n + 1 = 2 x for some integer x. It can be observed that the algorithm alternates between vertical and horizontal partitions along paths from the root of the tree, where regions of two different nodes in the same depth y are disjoint, and the union of all regions of nodes in depth y covers the entire initial region I2,n + 1 × I2,n + 1 of the root node. For every 0 ≤ y ≤ log(n), there are two series of intervals I y , 0 , I y , 1 , , I y , 2 y - 1 and J y , 0 , J y , 1 , , J y , 2 y - 1 , such that the set of regions corresponding to all nodes in depth 2y is {Iy,f× Jy,g | 0 ≤ f  < 2 y ,0 ≤ g < 2 y }, and the set of regions corresponding to all nodes in depth 2 y + 1 is {Iy,f× Jy + 1,g | 0 ≤ f < 2 y ,0 ≤ g < 2y + 1}. In addition, the corresponding subsequences Ĩ y , 0 , , Ĩ y , 2 y - 1 and J ~ y , 0 , , J ~ y , 2 y - 1 have all the same size 2x - y.

Consider a node of depth 2y whose corresponding region is Iy,f × Jy,g, and the two regions corresponding to its children Iy,f × Jy + 1,2g and Iy,f × Jy + 1,2g + 1. The computation time spent on the node is dominated by the matrix multiplications performed in line 12 of RL-COMPUTE-MATRIX. This includes |Σ| matrix multiplications between pairs of matrices such the dimensions of the first matrix in each pair is I y , f × J ~ y + 1 , 2 g = I y , f × 2 x - y - 1 , and the dimensions of the second matrix in a pair is J ~ y + 1 , 2 g × J ~ y + 1 , 2 g + 1 = 2 x - y - 1 × 2 x - y - 1 . Observe that |Iy,f|≥ 2x-y-1, and such multiplications can be implemented by dividing the interval Iy,f into I y , f 2 x - y - 1 intervals of length 2x-y-1 each, and performing I y , f 2 x - y - 1 multiplications between square matrices of dimensions 2x-y-1 × 2x-y-1 in a total time of I y , f 2 x - y - 1 MP ( 2 x - y - 1 ) . Therefore, the time required for all matrix multiplications performed within nodes in depth 2 y is
0 f < 2 y 0 g < 2 y | Σ | I y , f MP ( 2 x - y - 1 ) 2 x - y - 1 = 0 f < 2 y 2 y | Σ | I y , f MP ( 2 x - y - 1 ) 2 x - y - 1 = 2 | Σ | nMP ( 2 x - y - 1 ) ñ .

Similarly, it is possible to show that the total time required for all matrix multiplications performed within nodes in depth 2 y + 1 is also 2 n | Σ | MP ( 2 x - y - 1 ) ñ , and so the total computation time of matrix multiplications throughout the entire algorithm run is O | Σ | n ñ 0 y < x MP ( 2 y ) . As in Section “The algorithm”, using the Master Theorem [16], this summation evaluates to O | Σ | nMP ( ñ ) ñ . In addition to matrix multiplications, RL-COMPUTE-MATRIX performs O (|Σ|n2) operations in base computations (lines 2-7), and so the total time complexity of the complete algorithm is O | Σ | n 2 + | Σ | nMP ( ñ ) ñ .

A simple implementation of Algorithm 4 can be done using the same space complexity of O (|Σ|n2), as the space complexity of Algorithm 2. A more involved implementation can be applied by observing that in fact the algorithm only examines and updates entries in matrices of dimensions at most n × ñ or ñ × n when performing matrix multiplications, and in addition it examines adjacent entries “to the left” or “above” an entry in a base-case region. This observation can be used in order to reduce the space complexity of the algorithm to O | Σ | , where the complete details of such an implementation are omitted from this text.

A comparison with previous works

In this section, we review the previous main algorithms for EDDC by Behzadi and Steyaert [7], Bérard et al. [8] and Abouelhoda et al. [9], and point out similarities and improvements made in our current work.

The main contribution of our work is in obtaining sub-cubic algorithms for EDDC, whereas all previous algorithms have cubic time complexities (for |Σ| the alphabet size, n the length of the input strings and ñ the length of their RLE compressed forms, the algorithms of [7], [8], and [9] obtain the time complexities O n 2 + n ñ 2 + | Σ | ñ 3 + | Σ | 2 ñ 2 , O n 3 + | Σ | ñ 3 , and O ( n 2 + n ñ 2 ) , respectively).

Notably, the algorithm of [9] eliminates a |Σ| factor that appears in the time complexities of the algorithms given in [7, 8] and here. However, this improvement is confined to a constrained model of duplication histories. As we do not assume this model here, we could not use the representation of [9] that allows the elimination of the |Σ| time complexity factor.

In general, the frameworks of all algorithms in [79] as well as the algorithms presented here are similar. All these algorithms apply two phases, where the first phase computes costs corresponding to all substrings of each one of the input strings separately, and the second phase uses these precomputed costs in order to compute the edit distance between each pair of prefixes of the input strings (our online variant described in Section “An online algorithm for EDDC using min-plus matrix-vector multiplication for discrete cost functions” interleaves these two phases, yet each operation it conducts can be conceptually attributed to one of the phases). The recursive formulas are similar as well, where those for the first phase can be viewed as special kinds of Weighted Context Free Grammar derivation rules.

Next, we address the cost function constraints. All algorithms assume that operation costs are nonnegative and apply additional assumptions similarly to those listed in our Property 1, which can be made without loss of generality.

In [8], operation costs were limited so that all duplications and contractions have the same constant cost (regardless of the letter over which they are applied), all deletions and insertions have the same constant cost, and all mutation costs are symmetric (i.e. mut (α,β) = mut (β,α) for every α,βΣ). While it was argued that these restrictions allow edit distance to be a metric, they limit the generality of the algorithm of [8], where the rest of the previous algorithms we mentioned can handle scoring schemes that not necessarily abide by these restrictions.

Both in [7] and in [8], it was required that all duplication and contraction costs are lower than the costs of any of the insertion, deletion, or mutation costs. This restriction is not explicitly stated in [9], yet seems to be required there as well. For the application of minisatellite map comparison, this requirement is somewhat unnatural since it assumes that mutations, which are typically common events, should cost more than the less common events of duplications and contractions. Our algorithms can be applied even when this restriction does not hold. However, one of our algorithms, the RLE variant (Section “Additional acceleration using run-length encoding”) adds a new requirement that was absent from those previous algorithms: it requires that for every α,βΣ, dup (α) ≤ dup (β) + mut (β,α) ≤ ins (α), and cont (α) ≤ cont (β) + mut (α,β) ≤ del (α) (our Constraint 1). On one hand, our Constraint 1 is more strict than the constraint of [7] and [8], in the sense that it implies nonnegative lower bounds over differences of the form ins (α) - dup (α) and del (α) - cont (α), while in [7] and [8] it was only required that these differences be nonnegative. On the other hand, our Constraint 1 does not require that the cost of mutations be higher than the cost of duplications and contractions.

We showed that our algorithms are more general with respect to the assumed constraints. We also claim that our algorithms are more precise with respect to the formal problem specification. All previous algorithms (excluding the first algorithm by Bérard and Rivals [2], which had an O (n4) running time and assumed a constant cost for all mutations in addition to the restrictions in [8]) might output non-optimal solutions in certain cases, as demonstrated in the following example. Consider the input s = a b, t = e f, and the cost function in which all duplications and contractions cost 1, all deletions and insertions cost 20, and the symmetric mutation costs are as given in Table 1. It can be shown that all three algorithms in [7], [8], and [9] would output the value 18 as the edit distance between the input strings, reflecting one of the edit scripts 〈ab,eb,ef〉 or 〈ab,af,ef〉. Nevertheless, the correct value is 17, due to the script 〈ab,cb,cc,c,d,dd,ed,ef〉. Perhaps it could be possible to specify additional restrictions over the cost functions in order to guarantee that the algorithms in [7], [8], and [9] return optimal solutions for all instances.
Table 1

Mutation costs for the instance s = ab , t = ef

 

a

b

c

d

e

f

a

0

6

3

6

9

9

b

6

0

3

6

9

9

c

3

3

0

3

6

6

d

6

6

3

0

3

3

e

9

9

6

3

0

6

f

9

9

6

3

6

0

Conclusions and discussion

This work presents computational techniques for improving the time complexity of algorithms for the EDDC problem. We adapt the problem to the VMT framework defined in [11], which incorporates efficient matrix multiplication subroutines in order to accelerate standard dynamic programming algorithms. We describe an efficient algorithm, as well as two variants which are even more efficient, given some restrictions on the cost functions.

An additional result we give is the currently most efficient algorithm for the min-plus multiplication of D-discrete matrices (matrices for which differences between adjacent entries are integers within an interval of length D).

We note that the running times of our algorithms depend on the alphabet size |Σ|. For the general algorithm, the running time is O (|ΣMP(n)), where MP(n) is the time complexity of the min-plus multiplication of two n × n matrices, which is currently upper-bounded by O n 3 log 3 log n log 2 n [12]. Some of the previous algorithms obtain alphabet independent time complexities, for example the algorithms in [9] and [2]. As we discussed in Section “A comparison with previous works”, such algorithms do not solve the most general variant of the problem and require some assumptions on the cost function. Nevertheless, we believe that the matrix multiplication-based techniques for improving the time complexity presented in this paper can also be incorporated to the algorithm of [9], however the details of this enhancement are beyond the scope of this paper.

In contrast to the work of [9], our model assumes that intermediate strings along edit scripts may contain characters which are absent from both source and target strings. This implies that the size of the alphabet |Σ| is not bounded by the length of the input sequences. In the context of minisatellite comparison, identifying a feasible alphabet and cost function for this task is an interesting problem beyond the scope of this paper.

Appendix

Correctness of the recursive computation

This section proves Theorem 1, thus asserting the correctness of the recursive computation for the EDDC problem given in Section “The recurrence formula”. We start by adding some required notation and showing how long edit scripts can be decomposed to shorter partial scripts. Then, we use the observed recursive properties in order to prove the correctness of the recurrence.

Let s,w,t be strings, E S 1 = s = u 1 , 0 , u 1 , 1 , , u 1 , r 1 = w an edit script from s to w, and E S 2 = w = u 2 , 0 , u 2 , 1 , , u 2 , r 2 = t an edit script from w to t. Denote by E S = E S 0 , E S 1 the concatenated edit script E S = s = u 1 , 0 , u 1 , 1 , , u 1 , r 1 = w = u 2 , 0 , u 2 , 1 , , u 2 , r 2 = t from s to t. Note that cost ( E S ) = cost E S 1 + cost E S 2 , and E S = E S 1 + E S 2 . This notation extends naturally to concatenations of more than two scripts. For example, E S = E S 1 , E S 2 , , E S q denotes an edit script from a string s to a string t obtained by a concatenation of q scripts, each script E S i transforms some intermediate string wi-1 into a string w i , and s = w0 and t = w q .

Obeservation 5.

For every three strings s,w,t, ed(s,t) ≤ ed (s,w) + ed (w,t).

The correctness of the above observation follows from the fact that for a pair of optimal edit scripts E S 1 from s to w and E S 2 from w to t, the script E S = E S 1 , E S 2 from s to t satisfies ed ( s , t ) cost ( E S ) = cost ( E S 1 ) + cost ( E S 2 ) = ed ( s , w ) + ed ( w , t ) .

Lemma 7.

For s = s a us b and E S = u = u 0 , u 1 , , u r = w , cost E S ( s ) cost ( E S ) .

Proof.

 Each edit operation transforming s a u i s b to s a ui + 1s b in E S ( s ) corresponds to an operation transforming u i to ui + 1 in E S . The only cases where corresponding operations may have different costs are those of insertions or deletions in E S at the beginning or ending of u i , which become duplications or contractions in E S ( s ) , respectively. For example, in case the applied operation over u i  in E S is the deletion of its first letter α, and α is also the last letter of s a , then the cost of the operation in E S is del (α) while the cost of the corresponding operation in E S ( s ) is cont (α) ≤ del (α). Similar scenarios may occur in case of an insertion of a letter at the beginning of u i which is identical to the last letter of s a , as well as in cases of insertions and deletions at the end of u i of letters identical to the first letter of s b . In any other case, each pair of corresponding operations have the same cost. Therefore the cost of each operation in E S ( s ) is smaller than or equals to the cost of its corresponding operation in E S , and cost E S ( s ) cost ( E S ) . □

Lemma 8.

Let s and t be two strings, and (s a , s b ) P (s), (t a , t b ) P (t) partitions of s and t, respectively. Then, ed (s,t) ≤ ed (s a ,t a ) + ed (s b ,t b ).

Proof.

 Let E S a be an optimal script from s a  to t a  and E S b an optimal script from s b  to t b . The script E S a ( s ) is a script from s = s a s b to t a s b . Similarly, E S b ( t a s b ) is a script from t a s b to t a t b = t. For the script E S = E S a ( s ) , E S b ( t a s b ) from s to t, we have that ed ( s , t ) cost ( E S ) = cost ( E S a ( s ) ) + cost ( E S b ( t a s b ) ) Lem.7 cost ( E S a ) + cost ( E S b ) = ed s a , t a + ed s b , t b . □

Let s and t be strings. Call a pair of partitions (s a , s b ) P (s) and (t a , t b ) P (t) an optimal pairwise partition of s and t if ed (s,t) = e (s a , t a ) + ed (s b , t b ). Say that an edit script E S from s to t is a shortest optimal script from s to t if E S is optimal, and for every other optimal script E S from s to t, E S E S . For a script E S = s = u 0 , u 1 , , u r = t from s to t and 0 ≤ i ≤ j ≤ r, denote by E S i , j = u i , u i + 1 , , u j the partial script of E S from u i  to u j .

Observation 6.

Let E S = u 0 , u 1 , , u r be a shortest optimal edit script from u0to u r . For every 0 ≤ i ≤ j ≤ r, the partial script E S i , j is a shortest optimal edit script from u i to u j . Moreover, for any shortest optimal script E S i , j from u i to u j within E S , the script E S = E S 0 , i , E S i , j , E S j , r is a shortest optimal script from u0 to u r .

The correctness of the above observation is obtained by noting that if E S i , j is not a shortest optimal script from u i to u j , then for some shortest optimal script E S i , j from u i  to u j  we get that the script E S = E S 0 , i , E S i , j , E S j , r either has a lower cost than E S , or is a shorter script of the same cost, in contradiction to E S being a shortest optimal script from u0 to u r .

Lemma 9.

Let E S = u 0 , u 1 , , u r be a shortest optimal edit script from u0 to u r . If there are two indices 0 ≤ i < j ≤ r such that u i and u j are strings of length 1, then j = i + 1. In addition, for every 0 < k < r, u k ≠ ε.

Proof.

 Assume there are two indices 0 ≤ i < j ≤ r such that u i and u j are strings of length 1, i.e u i = α and u j = β for some α,βΣ. From Observation 6, the partial script E S i , j = α = u i , u i + 1 , , u j = β is a shortest optimal script from α to β. Since j > i, it must be that α ≠ β (otherwise the script E S = E S 0 , i , E S j , r is a shorter script from u0 to u r  of no greater cost than E S , in contradiction to E S being a shortest optimal script from u0 to u r ). From Property 1, ed (α,β) = mut (α,β), and so the edit script containing the single operation of mutating α to β is an optimal script from α to β, and it must be that j = i + 1.

In addition, assume by contradiction there is some index 0 < k < r such that u k  = ε. The only edit operation which may yield an empty string is a deletion from a single-letter string, and therefore uk-1 = α for some letter α. Similarly, the only edit operation which may be applied over an empty string is an insertion, therefore uk+1 = β for some letter β, in contradiction to the fact that two intermediate strings of length 1 must be consecutive along a shortest optimal script, as shown above. □

Call an edit script E S from a string s to a string t simple if E S is a shortest optimal script from s to t, in which no generating operation precedes a reducing operation. The following lemma generalizes Lemma 2 of [6], by considering also indels in addition to contractions and duplications.

Lemma 10.

For every pair of strings s and t, there exists a simple edit script from s to t.

Proof.

 Let s and t be two strings, and r the length of a shortest optimal script from s to t. When r ≤ 1, any shortest optimal script from s to t either contains no reducing operation or contains no generating operation, and in particular is a simple script. Otherwise, r > 1, and assume by induction the lemma holds for every pair of strings such that the length of a shortest optimal script from the source string to the target string is less than r. Let E S = s = u 0 , u 1 , , u r = t be a shortest optimal script from s to t.

Case 1: The first operation in E S is not a generating operation. From Observation 6, the partial script E S 1 , r is a shortest optimal script from u1 to u r , whose length is r - 1. From the inductive assumption, there is a simple script E S 1 , r from u1 to u r , and from Observation 6 the script E S = E S 0 , 1 , E S 1 , r is a shortest optimal script from s to t. As the first operation in E S is non-generating (being the same first operation as in E S ), E S is simple, and the lemma follows.

Case 2: The first operation in E S is a generating operation. Similarly as above, we may assume w.l.o.g. by applying the inductive assumption that the partial script E S 1 , r is simple. If this partial script is non-reducing, then E S is non-reducing, and in particular it is simple. Otherwise, let 1≤i<r be the smallest index such that the transformation of u i  to ui+1 is by a reducing operation. Since neither generating nor reducing operations may precede this operation in the partial script E S 1 , i , it follows that all operations in the partial script E S 1 , i (if there are any) are mutations.

The generating operation transforming u0 to u1 is either an insertion or a duplication of some letter α in u0. In both cases, we can write s = u0 = vxw and u1 = vxw (v,x,x and w are strings), where in the former case x = ε and x = α, and in the latter case x=α and x = α α. As all operations in the partial script E S 1 , i are mutations, each intermediate string u j , for 1 ≤ j ≤ i, is of the form v j x j w j , where v j ,x j , and w j are string obtained by applying zero or more mutations over v,x, and w, respectively. We argue that the reducing operation transforming u i = v i x i w i to ui+1 cannot be the deletion of a letter or a contraction involving at least one letter within the substring x i . This is true, since in such a case it would have been possible to avoid the first generating operation in E S (transforming x to x), as well as all mutation operations over a reduced letter in x i , and the reducing operation from u i  to ui+1. This would yield a script E S 0 , i + 1 from u0 to ui+1 which is shorter and of no higher coast than E S 0 , i + 1 , in contradiction to Observation 6. Hence, the reducing operation from u i  to ui+1 either deletes a letter or contracts two letters within one of the substrings v i or w i of u i .

Consider first the case where the reducing operation over u i is applied within its prefix v i . Thus, we can write ui+1 = vi+1xi+1wi+1, where vi+1 is the string obtained by applying the corresponding reducing operation over v i , xi+1 = x i , wi+1 = w i , and cost (〈 u i ,ui+1 〉) = cost (〈 v i ,vi+1 〉). The operations in E S 0 , i + 1 can be assigned into two independent scripts: a script E S v = v = v 0 , v 1 , , v ′p = v i + 1 from v to vi+1 obtained by merging each multiple occurrence of consecutive identical strings in the series v = v1,v2,…,vi+1 into a single occurrence, and similarly a script E S xw = xw = ( xw ) 0 , ( xw ) 1 = x w = x 1 w 1 , ( xw ) 2 , , ( xw ) q = x i + 1 w i + 1 from x w to xi+1wi+1. Each operation in E S 0 , i + 1 corresponds to exactly one operation in either E S v or E S xw , where the costs of corresponding operations are equal, and therefore cost ( E S 0 , i + 1 ) = cost ( E S v ) + cost ( E S xw ) and | E S 0 , i + 1 | = | E S v | + | E S xw | .

Now, the script E S v ( u 0 ) = u 0 = vxw = v 0 xw , v 1 xw , , v p xw = v i + 1 xw is a script from u0 to vi+1xw, and similarly the script E S xw ( v i + 1 xw ) = v i + 1 xw = v i + 1 ( xw ) 0 , v i + 1 ( xw ) 1 , , v i + 1 ( xw ) q = v i + 1 x i + 1 w i + 1 = u i + 1 is a script from vi+1xw to ui+1. Thus, the script E S 0 , i + 1 = E S v ( u 0 ) , E S xw ( v i + 1 xw ) is a script from u0 to ui+1. Since E S v contains at least one operation (the reducing operation from v i  to vi+1) and no generating operation (since besides the reducing operation E S v may contain only mutations), E S 0 , i + 1 starts with a non-generating operation. In addition, cost ( E S 0 , i + 1 ) = cost ( E S v ( u 0 ) ) + cost ( E S xw ( v i + 1 xw ) ) Lem.7 cost ( E S v ) + cost ( E S xw ) = cost ( E S 0 , i + 1 ) and E S 0 , i + 1 = E S v ( u 0 ) + E S xw ( v i + 1 xw ) = E S v +