In this section we propose a distance measure between entire genomes based on the notion of underlying subwords. In order to build a sound similarity measure between genomes, we need first to study the properties of the matching statistics. Our first contribution is the characterization of the subwords that are needed to compute the matching statistics. A second contribution is the selection of these subwords so that the resulting similarity measure does not contain overcounts. Our main idea is to avoid overlaps between selected subwords, more precisely by discarding common subwords occurring in regions covered by other more significant subwords.
Irredundant common subwords
In the literature, the values l[i] used by the ACS approach are called the matching statistics, as described in detail by Gusfield[17]. Our first contribution is to characterize the matching statistics in order to identify which subwords are essentials.
It is well known that the total number of distinct subwords of any length found in a sequence of length n can be at most Θ(n^{2}). Remarkably a notable family of fewer than 2n subwords exist that is maximal in the host sequence, in the sense that it is impossible to extend a word in this class by appending one or more characters to it without losing some of its occurrences[18]. It has been shown that the matching statistics can be derived from this set of maximal subwords[19]. Here we will further tighten this bound by showing that to compute the matching statistics it is enough to consider a subset of the maximal subwords, called irredundant common subwords.
The notion of irredundancy was introduced in[20] and later modified for the problem of protein comparison[21, 22]. It proved useful in different contexts from data compression[23] to the identification of transcription factors[24]. In this paper we introduce the concept of irredundant common subwords (i.e., without mismatches/wildcards). This ensures that there exists a close correspondence between the irredundant common subwords and the matching statistics.
Definition 1
(Irredundant/Redundant common subword) A common subword w is irredundant if and only if at least an occurrence of w in s_{1} or s_{2} is not covered by other common subwords. A common subword that does not satisfy this condition is called a redundant common subword.
We observe that the number of irredundant common subwords{\mathcal{I}}_{{s}_{1},{s}_{2}} is bounded by m + n, where s_{1} = n and s_{2} = m, since it is a subset of the set of maximal common subwords (see[19, 25] for a more complete treatment of this topic).
Proposition 1
The matching statistics{l}_{{s}_{1}}\left(i\right) can be computed by combining together all and only the irredundant common subwords of s_{1} and s_{2}.
Proof
To show that the vector{l}_{{s}_{1}}\left(i\right) can be derived from the irredundant common subwords, we define a new vector of scores l_{
w
}for each subword w, where l_{
w
}[j] = w−j + 1 represents the length of each suffix j of w, with j = 1,…,w. Then, for each subword w in{\mathcal{I}}_{{s}_{1},{s}_{2}} we superimpose the vector l_{
w
}on all the occurrences of w in s_{1}. For each position i, in s_{1},{l}_{{s}_{1}}\left(i\right) is the maximum value of the scores ma x_{
w
}(l_{
w
}[j]), such that k + j = i and k is an occurrence of w.
To complete the proof we have to show that every occurrence of a common subword of s_{1} and s_{2} is covered by some occurrence of a subword in{\mathcal{I}}_{{s}_{1},{s}_{2}}. By definition of irredundant common subword, any occurrence of a subword corresponds to an irredundant common subwords or is covered by some subword in{\mathcal{I}}_{{s}_{1},{s}_{2}}. Moreover every irredundant common subword w has at least an occurrence i that is not covered by other subwords. Thus,{l}_{{s}_{1}}\left(i\right) corresponds exactly to w and the subword w is necessary to compute the matching statistics. In conclusion, by using the method described above for{l}_{{s}_{1}}\left(i\right), we can compute for each position the length of the maximum common subword starting in that location, which corresponds to the matching statistics. □
In summary, the notion of irredundant common subwords is useful to decompose the information provided by the matching statistics into several patterns. Unfortunately these subwords can still overlap in some position. This might lead to an overcount in the matching statistics, in which the same region of the string contributes more than once. Our aim is to remove the possibility of overcount by filtering the most representative common subwords for each region of the sequences s_{1} and s_{2}, which will also remove all overlaps.
Underlying subwords
When comparing entire genomes we want to avoid that large noncoding regions, which by nature tend to be highly repetitive, may contribute to the similarity score a multiple number of times, thus misleading the final score. In fact, while analyzing massive genomes, the number of repeated patterns is very high, particularly in the nongenic regions. Therefore we need to filter out part of this information, and select the “best” common subword, by some measure, for each region of the sequences.
In this regard, we must recall the definition of pattern priority and of underlying pattern, adapted from[26] to the case of pairwise sequence comparison. We will take as input the irredundant common subwords and the underlying quorum u = 2, i.e. they must appears at least twice. Let now w and w^{′} be two distinct subwords. We say that w has priority over w^{′}, or w→w^{′}, if and only if either w ≥ w^{′}, or w = w^{′} and the first occurrence of w appears before the first occurrence of w^{′}. In this case, every subword can be defined just by its length and one of its starting positions in the sequences, meaning that any set of subwords is totally ordered with respect to the priority rule. We say that an occurrence l of w is tied to an occurrence l^{′} of a subword w^{′}, if the two occurrences overlap, i.e. ([l,l + w − 1]∩l^{′},l^{′} + w^{′} − 1]) ≠ ∅, and w^{′}→w. Otherwise, we say that l is untied from l^{′}.
Definition 2
(Underlying subword) A set of subwords{\mathcal{U}}_{{s}_{1},{s}_{2}}\subseteq {\mathcal{I}}_{{s}_{1},{s}_{2}} is said to be underlying if and only if:

(i)
every subword w in {\mathcal{U}}_{{s}_{1},{s}_{2}}, called an underlying subword, has at least two occurrences, one in s _{1} and the other in s _{2}, that are untied from all the untied occurrences of other subwords in {\mathcal{U}}_{{s}_{1},{s}_{2}}\setminus w, and

(ii)
there does not exist a subword w\in {\mathcal{I}}_{{s}_{1},{s}_{2}}\setminus {\mathcal{U}}_{{s}_{1},{s}_{2}} such that w has at least two untied occurrences, one per sequence, from all the untied occurrences of subwords in {\mathcal{U}}_{{s}_{1},{s}_{2}}.
This subset of{\mathcal{I}}_{{s}_{1},{s}_{2}} is composed only by those subwords that rank higher with our priority rule with respect to s_{1}. In fact, if there are overlaps between subwords that are in{\mathcal{I}}_{{s}_{1},{s}_{2}}, we will select only the subwords with the highest priority. Similarly to the score ACS(s_{1}s_{2}), the set{\mathcal{U}}_{{s}_{1},{s}_{2}} is asymmetric and depends on the order of the two sequences; we will address this issue in Section “A distancelike measure based on underlying subwords”. As for the underlying patterns[26], one can show that the set of underlying subwords exists, and is unique. As a corollary we know that the untied occurrences of the underlying subwords can be mapped into the sequences s_{1} and s_{2} without overlaps. Moreover, by definition, the total length of the untied occurrences cannot exceed the length of the sequences. These two properties are crucial when building a similarity measure, because any similarity that is based on these subwords will count the contribution of a region of the sequence only once.
Efficient computation of underlying subwords
To summarize we select the irredundant common subwords that best fit each region of s_{1} and s_{2}, employing a technique that we call Underlying Approach or, in short, UA. This technique is based on a simple pipeline. We first select the irredundant common subwords and subsequently filter out the subwords that are not underlying. From a different perspective, we start from the smallest set of subwords that captures the matching statistics and remove the overlaps by applying our priority rule. In the following we show how to compute the irredundant common subwords and the matching statistics, and then we present an approach for the selection of the underlying subwords among these subwords. The general structure of the Underlying Approach (UA) is the following:

1) Compute the set of the irredundant common subwords{\mathcal{I}}_{{s}_{1},{s}_{2}}

2) Rank all subwords in{\mathcal{I}}_{{s}_{1},{s}_{2}} according to the priority and initialize\mathcal{U} to an empty set.

3) Iteratively select a subwords p from{\mathcal{I}}_{{s}_{1},{s}_{2}} following the ranking.

4a) If p has at least two untied occurrences: add p to\mathcal{U} and update the corresponding regions of Γ(see next) in which p occurs;

4b) otherwise, discard p and return to (3).
Discovery of the irredundant common subwords
In step (1) we construct the generalized suffix tree{T}_{{s}_{1},{s}_{2}} of s_{1} and s_{2}. We recall that an occurrence of a subword is (left)rightmaximal if it cannot be covered from the (left)right by some other common subword. The first step consists in making a depthfirst traversal of all nodes of{T}_{{s}_{1},{s}_{2}}, and coloring each internal node with the colors of its leaves (each color corresponds to an input sequence). In this traversal, for each leaf i of{T}_{{s}_{1},{s}_{2}}, we capture the lowest ancestor of i having both the colors c_{1} and c_{2}, say the node w. Then, w is a common subword, and i is one of its rightmaximal occurrences (in s_{1} or in s_{2}); we select all subwords having at least one rightmaximal occurrence. The resulting set will be linear in the size of the sequences, that is O(m + n). This is only a superset of the irredundant common subwords, since the occurrences of these subwords could be not leftmaximal.
In a second phase, we map the length of each rightmaximal occurrence i into{l}_{{s}_{1}}\left(i\right), and, using Proposition 1, we check which occurrences i have length greater than or equal to the length stored in the location i−1 (for locations i ≥ 2). These occurrences are also leftmaximal, since they cannot be covered by a subword appearing at position i−1. Finally we can retain all subwords that have at least an occurrence that is both right and leftmaximal, i.e, the set of irredundant common subwords{\mathcal{I}}_{{s}_{1},{s}_{2}}. Note that, by employing the above technique, we are able to directly discover the irredundant common subwords and the matching statistics{l}_{{s}_{1}}\left(i\right).
The construction of the generalized suffix tree{T}_{{s}_{1},{s}_{2}} and the subsequent extraction of the irredundant common subwords{\mathcal{I}}_{{s}_{1},{s}_{2}} can be completed in time and space linear in the size of sequences.
Selection of the underlying subwords
In this section we describe, given the set of the irredundant common subwords{\mathcal{I}}_{{s}_{1},{s}_{2}}, how to filter out the subwords that are not underlying, obtaining the set of underlying subwords{\mathcal{U}}_{{s}_{1},{s}_{2}}.
The extraction of underlying subwords takes as input the set{\mathcal{I}}_{{s}_{1},{s}_{2}} and the tree{T}_{{s}_{1},{s}_{2}} from the previous section. First we need to sort all subwords in{\mathcal{I}}_{{s}_{1},{s}_{2}} according to the priority rule (step 2). Then, starting from the top subword, we analyze iteratively all subwords by checking their untied occurrences (step 3). If the subword passes a validity test we select it as underlying (step 4a), otherwise we move on with the next subword (step 4b). The two key steps of this algorithm are: sorting the subwords (step 2) and checking for their untied occurrences (step 4a).
Step 2 is implemented as follows. For all subwords we retrieve their lengths and first occurrences in s_{1} from the tree{T}_{{s}_{1},{s}_{2}}. Then each subword is characterized by its length and the first occurrence. Since these are integers in the range [0,n] we can apply radix sort[27], first by length and then by occurrence. This step can be done in linear time.
In order to implement step 4a we need to define the vector Γ of n booleans, representing the locations of s_{1}. If Γ[i] is true, then the location i is covered by some untied occurrence. We also preprocess the input tree and add a link for all nodes v to the closest irredundant ancestor, say prec(v). This can be done by traversing the tree in preorder. During the visit of a the node v if it is not irredundant we transmit to the children prec(v) otherwise if v is irredundant we transmit v. This preprocess can be implemented is linear time and space.
For each subword w in{\mathcal{I}}_{{s}_{1},{s}_{2}} we consider the list{\mathcal{L}}_{w} of occurrences to be checked. All{\mathcal{L}}_{w} are initialized in the following way. Every leaf v, that represent a position i, send its value i to the location list of the closest irredundant ancestor using the link prec(v). Again this preprocess takes linear time and space since all positions appear in exactly one location list. We will updated these lists{\mathcal{L}}_{w} only with the occurrences to be checked, i.e. that are not covered by some underlying subword already discovered. We start analyzing the top subword w and for this case{\mathcal{L}}_{w} is composed by all the occurrences of w.
For each occurrence i of w we need to check only its first and last location in the vector Γ; i.e., we need to check the locations Γ[i] and Γ[i + w−1]. If one of these two values is set to true, then i is tied by some subword w^{′}. Otherwise, if both the values are set to false, then i must be untied from all other subwords. Since all subwords already evaluated are not shorter than w, then they cannot cover some locations in Γ[i,i + w−1] without also covering Γ[i] or Γ[i + w−1]. Thus, if Γ[i] and Γ[i + w−1] are both set to false, we mark this occurrence i as untied for the subword w and update the vector Γaccordingly.
If Γ[i] is true we can completely discard the occurrence i, for the subword w and also for all its prefixes, that are represented by the ancestors of w in the tree{T}_{{s}_{1},{s}_{2}}. Thus the occurrence i will no longer be evaluated for any other subword.
If Γ[i] is false and Γ[i + w−1] is true, we need to further evaluate this occurrence for some ancestors of w. In this case, one can compute the longest prefix, w^{′}, of w such that Γ[i + w^{′}−1] is set to false and w^{′}is an irredundant common subword. Then the occurrence i is inserted into the list{\mathcal{L}}_{{w}^{\prime}}.
This step is performed by first computing the length d < w such that Γ[i + d−1] is false and Γ[i + d] is true, and then retrieving the corresponding prefix w^{′} of w in the tree that spells an irredundant common subword with length equal to or shorter than d. We can compute d by means of a length table χ in support (or in place) of the boolean vector Γ. For each untied occurrence i of w, χ stores the values [1,2,…,w] in the locations [i,i + 1,…,i + w−1], similarly to the proof of Proposition 1. Using this auxiliary table we can compute the value of d for the location under study i as d = w−χ[i + w−1].
Now, to select w^{′}, the longest prefix of w with w^{′} ≤ d, we employ an algorithm proposed by Kopelowitz and Lewenstein[28] for solving the weighted ancestor problem, where weights correspond to the length of words spelled in the path from the root to each node, in case of a suffix tree. In the weighted ancestor problem one preprocesses a weighted tree to support fast predecessor queries on the path from a query node to the root. That is, with a linear preprocessing on a tree of height n, using the above algorithm it is possible to locate any ancestor node w^{′} that has a weight less than d in time O(loglogn). In our case, the maximum length for an irredundant subword is min{m,n}, thus we can find a suitable ancestor w^{′}of w in time O(loglogmin{m,n}), with O(m + n) preprocessing of the tree{T}_{{s}_{1},{s}_{2}}.
At the end of the process, if the subword w has at least one untied occurrence per sequence, then we mark w as underlying subword. Otherwise, all the occurrences of w that are not covered are sent to its ancestors, using the previous procedure.
To analyze the overall complexity we need to compute how many times the same location i is evaluated. Suppose, for example, that i belongs to{\mathcal{L}}_{w} of the subword w. The location i is evaluated again for some\stackrel{\u0304}{w}, and inserted into the list{\mathcal{L}}_{\stackrel{\u0304}{w}}, only if Γ[i] is false and Γ[i + w−1] is true. Note that the locations not already covered are in the range [i,i + w−d−1], with d > 0. Then, the subword\stackrel{\u0304}{w} is the longest prefix of w that is an irredundant common subword and that lives completely in the locations [i,i + w−d−1]; however\stackrel{\u0304}{w} may not cover the entire interval. Now, the occurrence i will be evaluated again only if there exists another subword w^{′} that overlaps with\stackrel{\u0304}{w}, and that has a higher priority with respect to\stackrel{\u0304}{w}. The worst case is when w^{′}ends exactly at position i + w−d−1 and overlaps with\stackrel{\u0304}{w} by only one location. Since w^{′}must be evaluated before\stackrel{\u0304}{w}, then\left{w}^{\prime}\right\ge \left\stackrel{\u0304}{w}\right. Thus the worst case is when the two subwords have about the same length. In this settings the length of the subword\stackrel{\u0304}{w} can be at most (w−d)/2. We can iterate this argument at most O(logw) times for the same position i. Therefore any location can be evaluated at most O(logmin{m,n}) times. In conclusion, our approach requires O((m + n)logmin{m,n}loglogmin{m,n}) time and O(m + n) space to discover the set of all underlying subwords{\mathcal{U}}_{{s}_{1},{s}_{2}}.
Extension to inversions and complements
In this section we discuss the extension of the algorithmic structure discussed above to accommodate also inversion and complement matches.
A simple idea is to concatenate each sequence with its inverse and its complement, while keeping separate the occurrences coming from direct matches, inversions, and complements. In brief, we first define\widehat{x} as the concatenation of a string x with its inverse, followed by its complement, in this exact order. Then, we compute the irredundant common subwords,{\mathcal{I}}_{{s}_{1},\widehat{{s}_{2}}}, on the sequences s_{1}and\widehat{{s}_{2}}. We subsequently select the underlying subwords by ranking all the irredundant common subwords in the set{\mathcal{I}}_{{s}_{1},\widehat{{s}_{2}}}. Using the same algorithm described above we compute the set{\mathcal{U}}_{{s}_{1},\widehat{{s}_{2}}}, and then we map each subword occurrence to the reference sequences s_{1}. This will include also inversions and complements of s_{2} that are shared by s_{1}. In this way, we can store all the untied occurrences and consider all possible matches for each region of s_{1}.
In this framework, we choose to take into account all these symmetries, and thus the experiments presented will use this extended approach. We will also measure the contribution of inversions and complements to our similarity measure.
A distancelike measure based on underlying subwords
In the following we report the basic steps of our distancelike measure. Let us assume that we have computed{\mathcal{U}}_{{s}_{1},{s}_{2}}, and the other specular set{\mathcal{U}}_{{s}_{2},{s}_{1}}. For every subwordw\in {\mathcal{U}}_{{s}_{1},{s}_{2}} we sum up the score{h}_{w}^{{s}_{1}}\sum _{i=1}^{\leftw\right}i={h}_{w}^{{s}_{1}}\leftw\right\left(\rightw+1)/2 in UA(s_{1}s_{2}), where{h}_{w}^{{s}_{1}} is the number of its untied occurrences in s_{1}, similarly to ACS[7]. Then, we average UA(s_{1},s_{2}) over the length of the first sequence, s_{1}, yielding
\mathit{\text{UA}}({s}_{1},{s}_{2})=\frac{\sum _{w\in {\mathcal{U}}_{{s}_{1},{s}_{2}}}{h}_{w}^{{s}_{1}}w(w+1)}{2n}.
This is a similarity score that is large when two sequences are similar, therefore we take its inverse.
Moreover, for a fixed sequence s_{1} this score can also grow with the length of s_{2}, since the probability of having a match in s_{1}increases with the length of s_{2}. For this reason, we consider the measure log_{4}(s_{2})/UA(s_{1},s_{2}); we use a base4 logarithm since DNA sequences have four bases. Another issue with the above formula is the fact that it is not equal to zero for s_{1} = s_{2}; thus we subtract the correction term log_{4}(s_{1})/UA(s_{1},s_{1}), which ensures that this condition is always satisfied. Since{\mathcal{U}}_{{s}_{1},{s}_{1}} contains only one subword, the sequence s_{1} itself, which trivially has only one untied occurrence in s_{1}, this yields to UA(s_{1},s_{1}) = s_{1}(s_{1} + 1)/(2s_{1}) = (s_{1} + 1)/2. The following formulas accommodate all of these observations in a symmetrical distancelike measure d_{
UA
}(s_{1},s_{2}) between the sequences s_{1} and s_{2}:
\overline{\mathit{\text{UA}}}({s}_{1},{s}_{2})=\frac{{\text{log}}_{4}\left(\right{s}_{2}\left\right)}{\mathit{\text{UA}}({s}_{1},{s}_{2})}\frac{2{\text{log}}_{4}\left(\right{s}_{1}\left\right)}{\left(\right{s}_{1}+1)},
{d}_{\mathit{\text{UA}}}({s}_{1},{s}_{2})=\frac{\overline{\mathit{\text{UA}}}({s}_{1},{s}_{2})+\overline{\mathit{\text{UA}}}({s}_{2},{s}_{1})}{2}.
We can easily see that the correction term rapidly converges to zero as s_{1} increases. Moreover, we notice that d_{
UA
}(s_{1},s_{2}) grows as the two sequences s_{1} and s_{2} diverge. From now we will simply refer to the measure d_{
UA
}(s_{1},s_{2}) as the Underlying Approach measure, or UA.