Let s be the given string. We want to find a substring t of s with the maximum self-alignment value . Let s[1, j] be the substring of s that consists of the first j letters. Informally, we use w(i, j) to denote the maximum self-alignment value of a suffix t
j
of s[1, j] such that there are i letters at the right end of the self-alignment of t
j
that are aligned with spaces and scored as
. Note that, the right end of the self-alignment of t
j
could contain more than i spaces. However, only the last i spaces are scores as
each and the rest of them are scored as the score for μ(x, Δ).
Let T
j
be the set of all suffixes of s[1, j]. For a substring t of s and an integer i, Π(t, i) = {π(t)|π(t) ∈ Π(t) and |tk-1t
k
| ≥ i}, where tk-1, t
k
are the last two repeats in π(t). We define
to be the maximum V(·, c, i) value of all the partitions in Π(t, i), where t is a substring in T
j
. To compute w(i, j) using dynamic programming method, we first consider the boundary values of w(i, j). We set w(0, j) = -∞ since we do not allow suf(tk-1t
k
, i) to be empty. Note that, by definition, i ≤ j.
Lemma 1 For a sequence s of length n, w(j, j) = c·j for 1 ≤ j ≤ n.
Proof. For a partition π(t) = {t1, t2, ..., t
k
} satisfying t ∈ T
j
and π(t) ∈ Π(t, j), from the definition of Π(t, j), |tk-1t
k
| ≥ j. Since t is a suffix of s[1, j] and |t| ≥ |tk-1t
k
| ≥ j, we have t = s[1, j] and 1 ≤ k ≤ 2. Consider the self alignment of π(t) such that the last j letters in t are mapped to spaces with score
. Two cases arise. Case 1: k = 1 and t = t1 = s[1, j]. In this case, the middle alignment is empty. Thus, V(π(t), c, j) =
× (|t1| + j) = c·j. Case 2: k = 2 and t = t1t2 = s[1, j]. In this case, the middle alignment is the alignment between |t2| spaces and t2. By the assumption that
> μ(Δ, x), V(π(t), c, j) = |t2| × μ(Δ, x) +
× (|t1| + j) <c·j. □
From the above analysis, the initial and boundary values are
w(0, j) = -∞, w(j, j) = c·j (2)
Theorem 2 For a sequence s of length n and 2 ≤ j ≤ n, 1 ≤ i <j,
Proof. Consider the partition π(t) = {t1, t2, ..., t
k
} such that t ∈ T
j
, π(t) ∈ Π(t, i) and V(π(t), c, i) = w(i, j). We analyze the value of V(π(t), c, i) based on different cases.
case 1. π(t) has only one repeat. We have |t| = |t1| = i and the middle alignment is empty. Therefore V(π(t), c, i) =
× (|t1| + i) = c·i.
case 2. π(t) has k ≥ 2 repeats. In this case, the middle alignment is not empty since it contains t
k
. The last column in the middle alignment has three configurations: (a) the last column contains two letters s[j - i] and s[j], (b) the last column contains a space and the letter s[j], (c) the last column contains the letter s[j - i] and a space. For sub-case (a), if we take away the last letter s[j] from the self alignment of π(t), we can get a self alignment of π'(t'), where t' ∈ Tj-1, π'(t') ∈ Π(t', i) and V(π'(t'), c, i) = w(i, j - 1). By comparing the two self alignment, we have V(π(t , c, i = V(π'(t'), c, i) + μ(s[j - i], s[j]) = w(i, j - 1) + μ(s[j - i], s[j]). For sub-case (b), if we take away the last letter s[j] and space aligned with s[j] from the self alignment of π(t), we can get a self alignment of π'(t'), where t' ∈ Tj-1, π'(t') ∈ Π(t', i - 1) and V(π'(t'), c, i) = w(i - 1, j - 1). Notice that in the end of the self alignment of π'(t'), there are only i - 1 letters mapped to spaces with score
, and there is one more letter in the self alignment of π(t) mapped to spaces with score
. Thus, V(π(t), c, i) = w(i - 1, j - 1) + μ(Δ, s[j]) +
. For sub-case (c), π(t) ∈ Π(t, i + 1). We can impose that the alignment of the letter s[j - i] and the space is scored as
, not μ(s[j - i], Δ). From V(π(t), c, i + 1) = w(i + 1, j), we have V(π(t), c, i) = w(i + 1, j) + μ(s[j - i], Δ) - 
Based on Theorem 2, a dynamic programming algorithm is designed. Let n be the length of the input sequence s. We compute w(i, j) in the order shown below:
for j = 1 to n do
for i = j downto 1 docompute w(i, j) based on Theorem 2.
Obviously, the time complexity is O(n2), where n is the length of the whole string. A standard backtracking process allows us to find the local optimal pseudo-periodic region t.
The following example illustrates the algorithm. Let s = CAGAGT. We set c = -2 and use the following score scheme: a match costs 10, a mismatch costs -10, and an insertion or a deletion costs -10. The table constructed by using the dynamic programming algorithm is shown in Figure 3. The table is constructed in from the top to the bottom. For every row in the table, the w(i, j)'s are computed from left to right. From the table, it is easy to see that the maximum value of w(i, j) is w(2, 5) = 16. From the maximum value w(2, 5) = 16, we know that the local optimal pseudo-periodic region t is a suffix of s[1, 5] = CAGAG and there are 2 letters aligned with spaces and scored as
at the right end of the self alignment of the local optimal pseudo-periodic region t. From w(2, 5), we can backtrack w(2, 5) → w(2, 4) → w(2, 3) and stop at w(2, 3) since w(2,3) gets its value from c·i indicating the first segment of the partition of t ends at 3-th letter in s and the length of the segment is 2. Thus, we get t = AGAG. From the self alignment, it is easy to get the partition of π(t) = {AG, AG}.
The space complexity required is also O(n2) if we are not careful. However, we can release the space whenever they are no longer useful. Thus, only two columns, are required for the computation. For each of the two columns, we use two arrays: one array stores the value of w(i, j) and the other array stores the starting position of the subsequence t that maximizes w(i, j). Therefore, the space complexity is O(n) for computing all w(i, j)'s. After w(i, j)'s are computed, we know the substring t that leads to the optimal w value. Therefore, we can reconstruct the alignment for t in
time and space, where n1 is the length of t, the repeated region. If n1 is still big, we can use the standard technique in [12] to reduce the space to O(n1) by doubling the computation time for reconstructing the alignment of t.
In practice, a sequence may contain more than one repeated region. To find all the repeated regions, we can select the best k values of w(i, j)'s for some pre-defined value k. Each backtracking gives a repeated region. Another way to set a threshold for the value of w(i, j) and select all w(i, j)'s with value greater than the threshold.