To summarize we select the irredundant common subwords that best fit each region of *s*_{1} and *s*_{2}, employing a technique that we call *Underlying Approach* or, in short, UA. This technique is based on a simple pipeline. We first select the irredundant common subwords and subsequently filter out the subwords that are not underlying. From a different perspective, we start from the smallest set of subwords that captures the matching statistics and remove the overlaps by applying our priority rule. In the following we show how to compute the irredundant common subwords and the matching statistics, and then we present an approach for the selection of the underlying subwords among these subwords. The general structure of the Underlying Approach (UA) is the following:

#### Discovery of the irredundant common subwords

In step (1) we construct the generalized suffix tree${T}_{{s}_{1},{s}_{2}}$ of *s*_{1} and *s*_{2}. We recall that an occurrence of a subword is (left)right-maximal if it cannot be covered from the (left)right by some other common subword. The first step consists in making a depth-first traversal of all nodes of${T}_{{s}_{1},{s}_{2}}$, and coloring each internal node with the colors of its leaves (each color corresponds to an input sequence). In this traversal, for each leaf *i* of${T}_{{s}_{1},{s}_{2}}$, we capture the lowest ancestor of *i* having both the colors *c*_{1} and *c*_{2}, say the node *w*. Then, *w* is a common subword, and *i* is one of its right-maximal occurrences (in *s*_{1} or in *s*_{2}); we select all subwords having at least one right-maximal occurrence. The resulting set will be linear in the size of the sequences, that is *O*(*m* + *n*). This is only a superset of the irredundant common subwords, since the occurrences of these subwords could be not left-maximal.

In a second phase, we map the length of each right-maximal occurrence *i* into${l}_{{s}_{1}}\left(i\right)$, and, using Proposition 1, we check which occurrences *i* have length greater than or equal to the length stored in the location *i*−1 (for locations *i* ≥ 2). These occurrences are also left-maximal, since they cannot be covered by a subword appearing at position *i*−1. Finally we can retain all subwords that have at least an occurrence that is both right- and left-maximal, i.e, the set of irredundant common subwords${\mathcal{I}}_{{s}_{1},{s}_{2}}$. Note that, by employing the above technique, we are able to directly discover the irredundant common subwords and the matching statistics${l}_{{s}_{1}}\left(i\right)$.

The construction of the generalized suffix tree${T}_{{s}_{1},{s}_{2}}$ and the subsequent extraction of the irredundant common subwords${\mathcal{I}}_{{s}_{1},{s}_{2}}$ can be completed in time and space linear in the size of sequences.

#### Selection of the underlying subwords

In this section we describe, given the set of the irredundant common subwords${\mathcal{I}}_{{s}_{1},{s}_{2}}$, how to filter out the subwords that are not underlying, obtaining the set of underlying subwords${\mathcal{U}}_{{s}_{1},{s}_{2}}$.

The extraction of underlying subwords takes as input the set${\mathcal{I}}_{{s}_{1},{s}_{2}}$ and the tree${T}_{{s}_{1},{s}_{2}}$ from the previous section. First we need to sort all subwords in${\mathcal{I}}_{{s}_{1},{s}_{2}}$ according to the priority rule (step 2). Then, starting from the top subword, we analyze iteratively all subwords by checking their untied occurrences (step 3). If the subword passes a validity test we select it as underlying (step 4a), otherwise we move on with the next subword (step 4b). The two key steps of this algorithm are: sorting the subwords (step 2) and checking for their untied occurrences (step 4a).

Step 2 is implemented as follows. For all subwords we retrieve their lengths and first occurrences in *s*_{1} from the tree${T}_{{s}_{1},{s}_{2}}$. Then each subword is characterized by its length and the first occurrence. Since these are integers in the range [0,*n*] we can apply radix sort[27], first by length and then by occurrence. This step can be done in linear time.

In order to implement step 4a we need to define the vector Γ of *n* booleans, representing the locations of *s*_{1}. If Γ[*i*] is true, then the location *i* is covered by some untied occurrence. We also preprocess the input tree and add a link for all nodes *v* to the closest irredundant ancestor, say *prec*(*v*). This can be done by traversing the tree in preorder. During the visit of a the node *v* if it is not irredundant we transmit to the children *prec*(*v*) otherwise if *v* is irredundant we transmit *v*. This preprocess can be implemented is linear time and space.

For each subword *w* in${\mathcal{I}}_{{s}_{1},{s}_{2}}$ we consider the list${\mathcal{L}}_{w}$ of occurrences to be checked. All${\mathcal{L}}_{w}$ are initialized in the following way. Every leaf *v*, that represent a position *i*, send its value *i* to the location list of the closest irredundant ancestor using the link *prec*(*v*). Again this preprocess takes linear time and space since all positions appear in exactly one location list. We will updated these lists${\mathcal{L}}_{w}$ only with the occurrences to be checked, i.e. that are not covered by some underlying subword already discovered. We start analyzing the top subword *w* and for this case${\mathcal{L}}_{w}$ is composed by all the occurrences of *w*.

For each occurrence *i* of *w* we need to check only its first and last location in the vector Γ; i.e., we need to check the locations Γ[*i*] and Γ[*i* + |*w*|−1]. If one of these two values is set to true, then *i* is tied by some subword *w*^{
′
}. Otherwise, if both the values are set to false, then *i* must be untied from all other subwords. Since all subwords already evaluated are not shorter than *w*, then they cannot cover some locations in Γ[*i*,*i* + |*w*|−1] without also covering Γ[*i*] or Γ[*i* + |*w*|−1]. Thus, if Γ[*i*] and Γ[*i* + |*w*|−1] are both set to false, we mark this occurrence *i* as untied for the subword *w* and update the vector Γaccordingly.

If Γ[*i*] is true we can completely discard the occurrence *i*, for the subword *w* and also for all its prefixes, that are represented by the ancestors of *w* in the tree${T}_{{s}_{1},{s}_{2}}$. Thus the occurrence *i* will no longer be evaluated for any other subword.

If Γ[*i*] is false and Γ[*i* + |*w*|−1] is true, we need to further evaluate this occurrence for some ancestors of *w*. In this case, one can compute the longest prefix, *w*^{
′
}, of *w* such that Γ[*i* + |*w*^{
′
}|−1] is set to false and *w*^{
′
}is an irredundant common subword. Then the occurrence *i* is inserted into the list${\mathcal{L}}_{{w}^{\prime}}$.

This step is performed by first computing the length *d* < |*w*| such that Γ[*i* + *d*−1] is false and Γ[*i* + *d*] is true, and then retrieving the corresponding prefix *w*^{
′
} of *w* in the tree that spells an irredundant common subword with length equal to or shorter than *d*. We can compute *d* by means of a *length table* *χ* in support (or in place) of the boolean vector Γ. For each untied occurrence *i* of *w*, *χ* stores the values [1,2,…,|*w*|] in the locations [*i*,*i* + 1,…,*i* + |*w*|−1], similarly to the proof of Proposition 1. Using this auxiliary table we can compute the value of *d* for the location under study *i* as *d* = |*w*|−*χ*[*i* + |*w*|−1].

Now, to select *w*^{
′
}, the longest prefix of *w* with |*w*^{
′
}| ≤ *d*, we employ an algorithm proposed by Kopelowitz and Lewenstein[28] for solving the *weighted ancestor problem*, where weights correspond to the length of words spelled in the path from the root to each node, in case of a suffix tree. In the weighted ancestor problem one preprocesses a weighted tree to support fast predecessor queries on the path from a query node to the root. That is, with a linear preprocessing on a tree of height *n*, using the above algorithm it is possible to locate any ancestor node *w*^{
′
} that has a weight less than *d* in time *O*(loglog*n*). In our case, the maximum length for an irredundant subword is min{*m*,*n*}, thus we can find a suitable ancestor *w*^{
′
}of *w* in time *O*(loglogmin{*m*,*n*}), with *O*(*m* + *n*) preprocessing of the tree${T}_{{s}_{1},{s}_{2}}$.

At the end of the process, if the subword *w* has at least one untied occurrence per sequence, then we mark *w* as underlying subword. Otherwise, all the occurrences of *w* that are not covered are sent to its ancestors, using the previous procedure.

To analyze the overall complexity we need to compute how many times the same location *i* is evaluated. Suppose, for example, that *i* belongs to${\mathcal{L}}_{w}$ of the subword *w*. The location *i* is evaluated again for some$\stackrel{\u0304}{w}$, and inserted into the list${\mathcal{L}}_{\stackrel{\u0304}{w}}$, only if Γ[*i*] is false and Γ[*i* + |*w*|−1] is true. Note that the locations not already covered are in the range [*i*,*i* + |*w*|−*d*−1], with *d* > 0. Then, the subword$\stackrel{\u0304}{w}$ is the longest prefix of *w* that is an irredundant common subword and that lives completely in the locations [*i*,*i* + |*w*|−*d*−1]; however$\stackrel{\u0304}{w}$ may not cover the entire interval. Now, the occurrence *i* will be evaluated again only if there exists another subword *w*^{
′
} that overlaps with$\stackrel{\u0304}{w}$, and that has a higher priority with respect to$\stackrel{\u0304}{w}$. The worst case is when *w*^{
′
}ends exactly at position *i* + |*w*|−*d*−1 and overlaps with$\stackrel{\u0304}{w}$ by only one location. Since *w*^{
′
}must be evaluated before$\stackrel{\u0304}{w}$, then$\left|{w}^{\prime}\right|\ge \left|\stackrel{\u0304}{w}\right|$. Thus the worst case is when the two subwords have about the same length. In this settings the length of the subword$\stackrel{\u0304}{w}$ can be at most (|*w*|−*d*)/2. We can iterate this argument at most *O*(log|*w*|) times for the same position *i*. Therefore any location can be evaluated at most *O*(logmin{*m*,*n*}) times. In conclusion, our approach requires *O*((*m* + *n*)logmin{*m*,*n*}loglogmin{*m*,*n*}) time and *O*(*m* + *n*) space to discover the set of all underlying subwords${\mathcal{U}}_{{s}_{1},{s}_{2}}$.