Finding maximal exact matches in graphs

Background We study the problem of finding maximal exact matches (MEMs) between a query string Q and a labeled graph G. MEMs are an important class of seeds, often used in seed-chain-extend type of practical alignment methods because of their strong connections to classical metrics. A principled way to speed up chaining is to limit the number of MEMs by considering only MEMs of length at least \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\kappa$$\end{document}κ (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\kappa$$\end{document}κ-MEMs). However, on arbitrary input graphs, the problem of finding MEMs cannot be solved in truly sub-quadratic time under SETH (Equi et al., TALG 2023) even on acyclic graphs. Results In this paper we show an \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(n\cdot L \cdot d^{L-1} + m + M_{\kappa ,L})$$\end{document}O(n·L·dL-1+m+Mκ,L)-time algorithm finding all \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\kappa$$\end{document}κ-MEMs between Q and G spanning exactly L nodes in G, where n is the total length of node labels, d is the maximum degree of a node in G, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$m = |Q|$$\end{document}m=|Q|, and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$M_{\kappa ,L}$$\end{document}Mκ,L is the number of output MEMs. We use this algorithm to develop a \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\kappa$$\end{document}κ-MEM finding solution on indexable Elastic Founder Graphs (Equi et al., Algorithmica 2022) running in time \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(nH^2 + m + M_\kappa )$$\end{document}O(nH2+m+Mκ), where H is the maximum number of nodes in a block, and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$M_\kappa$$\end{document}Mκ is the total number of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\kappa$$\end{document}κ-MEMs. Our results generalize to the analysis of multiple query strings (MEMs between G and any of the strings). Additionally, we provide some experimental results showing that the number of graph MEMs is an order of magnitude smaller than the number of string MEMs of the corresponding concatenated collection. Conclusions We show that seed-chain-extend type of alignment methods can be implemented on top of indexable Elastic Founder Graphs by providing an efficient way to produce the seeds between a set of queries and the graph. The code is available in https://github.com/algbio/efg-mems.


Introduction
Sequence alignment has been studied since the 1970s [34,46] and is a fundamental problem of computational molecular biology.Solving the classical problems of longest common subsequence (LCS) and edit distance (ED) between two strings takes quadratic time with simple dynamic programs, and it was recently proven that no strongly subquadratic-time algorithms exist conditioned on the Strong Exponential Time Hypothesis (SETH) [5,3].To overcome this hardness, researchers have used heuristics such as co-linear chaining [1]: by taking (short) matches between the input strings, known as anchors, we can take an ordered subset of these anchors and chain them together into an alignment.Furthermore, when using maximal exact matches (MEMs) as anchors, different co-linear chaining formulations capture both LCS [31] and ED [24] in near-linear time.MEMs are also used in popular seed-and-extend read aligners [25,32].In fact, practical tools limit the number of MEMs by considering only κ-MEMs (MEMs of length at least κ) [36,45].
Extending alignment between sequences to sequence-to-graph alignment is an emerging and central challenge of computational pangenomics [10], as labeled graphs are a popular representation of pangenomes used in recent bioinformatics tools [27,8,37,26].We assume that a labeled graph G = (V, E, ℓ) (ℓ : V → Σ + ) is the reference pangenome of interest.Unfortunately, even finding exact occurrences of a given pattern in G does not admit strongly subquadratic-time solutions under SETH [15], and furthermore, a graph cannot be indexed in polynomial time to answer strongly subquadratic-time pattern matching queries [14].To circumvent this difficulty, research efforts have concentrated on finding parameterized solutions to (exact) pattern matching in labeled graphs [7,12,11,41].Moreover, the use of MEMs and co-linear chaining has also been extended to graphs [27,8,37,26].
In this paper, we study the problem of efficiently finding MEMs between a query string Q and a labeled graph G, where we extend the MEM definition to capture any maximal match between Q and the string spelled by some path of G.More precisely, our contributions are as follows: • In Section 3.1, we adapt the MEM finding algorithm between two strings of Belazzougui et al. [4] to find all κ-node-MEMs between Q and G = (V, E, ℓ) in O(m + n + M κ ) time, where m = |Q|, n = v∈V |ℓ(v)| is the cumulative length of the node labels, and M κ is the number of κ-node-MEMs (of length at least κ and between the node labels and Q).
• Next, in Section 3.2, we extend the previous algorithm to find all κ-MEMs spanning exactly L nodes of G in time O(m ), where d is the maximum degree of any node v ∈ V and M κ,L are the κ-MEMs of interest.Note that MEMs spanning less than L nodes can occur multiple times in paths spanning exactly L nodes, and our contribution is to introduce an efficient technique to filter out these MEMs.
• In Section 4, we obtain the following results: -We study κ-MEMs in indexable Elastic Founder Graphs (EFGs) [16], a subclass of labeled acyclic graphs admitting a poly-time indexing scheme for linear-time pattern matching.Given an indexable EFG G of height H (the maximum number of nodes in a graph block), we develop a suffix-tree-based solution to find all κ-MEMs spanning more than 3 nodes in G in O(nH 2 + m + M κ,4+ ) time, where M κ,4+ are the number of output MEMs.
-Combined with the above results for L = 1, 2, 3, we can find κ-MEMs of an indexable EFG G in O(nH 2 + m + M κ ) time.
-We note that the previous results easily generalize to the batched query setting: by substituting Q with the concatenation of different query strings Q 1 , . . ., Q t of total length m, we compute all κ-MEMs between any query string and the graph with the same stated running time.
• Finally, in Section 5 we provide some preliminary experimental results on finding MEMs from a collection of strains of covid19.We use the bidirectional r-index [2] as the underlying machinery.On the one hand, we build the r-index of the concatenation of the strains and find all m κ κ-MEMs.On the other hand, we build an indexable EFG of the strains and find an upper bound on all M κ κ-MEMs in this case.Our results indicate that M κ is an order of magnitude smaller than m κ , thus confirming that graph MEMs compactly represent all MEMs.
.n] is a sequence of symbols from Σ, that is, T ∈ Σ n where Σ n denotes the set of strings of length n over Σ.The length of a string T is denoted |T | and the empty string ε is the string of length 0. In this paper, we assume that σ is always smaller or equal to the length of the strings we are working with.The concatenation of strings T 1 and T 2 is denoted as We denote by T [x..y] the substring of T made of the concatenation of its characters from the x-th to the y-th, both inclusive; if x = y then we also use T [x] and if y < x then T [x..y] = ε.The reverse of a string T [1.
.n], denoted by T , is the string T read from right to left, that is, We denote by Σ * the set of finite strings over Σ, and also .y] for some interval [x..y]; in this case, we say that [x..y] is a match of Q in T .Moreover, we study matches between substrings of Q and T : a maximal exact match (MEM) between Q and and the match cannot be extended to the left nor to the right, that is, In this case, we say that the substring The lexicographic order of two strings T 1 and T 2 is naturally defined by the total order ≤ of the alphabet: . We avoid the prefix-case by adding an end marker $ ∈ Σ to the strings and considering $ to be the lexicographically smaller than any character in Σ. Labeled graphs.Let G = (V, E, ℓ) be a labeled graph with V being the set of nodes, E being the set of edges, and ℓ : V → Σ + being a function giving a label to each node.A length-k path

AGCTA
of P is the concatenation of the labels of the nodes in the path.For a node v and a path P we use • to denote its string length, that is, v = |ℓ(v)| and P = |ℓ(P )|.Let Q be a query string.We say that Q occurs in G if Q occurs in ℓ(P ) for some path P .In this case, the exact match of Q in G can be identified by the triple (i, with 1 ≤ i ≤ v 1 and 1 ≤ j ≤ v k , and we call such triple a substring of G. Given a substring (i, P, j) of G, we define its left-extension lext(i, P, j) as the singleton {ℓ(v 1 )[i − 1]} if i > 1 and otherwise as the set of characters {ℓ(u)[ u ] | (u, v 1 ) ∈ E}.Symmetrically, the right-extension rext(i, P, j) is {ℓ(v k )[j + 1]} if j < v k and otherwise it is {ℓ(v) [1] | (v k , v) ∈ E}.Note that the left (right) extension can be equal to the empty set ∅, if the start (end) node of P does not have incoming (outgoing) edges.Figure 1 illustrates these concepts.

Basic tools.
A trie or keyword tree [13] of a set of strings is an ordinal tree where the outgoing edges of each node are labeled by distinct symbols (the order of the children follows the order of the alphabet) and there is a unique root-to-leaf path spelling each string in the set; the shared part of two rootto-leaf paths spells the longest common prefix of the corresponding strings.In a compact trie [22], the maximal non-branching paths of a trie become edges labeled with the concatenation of labels on the path.The suffix tree of T ∈ Σ * is the compact trie of all suffixes of the string T ′ = T $.In this case, the edge labels are substrings of T and can be represented in constant space as an interval of T .Such tree uses linear space and can be constructed in linear time, assuming that σ ≤ |T |, so that when reading the root-to-leaf paths from left to right, the suffixes are listed in their lexicographic order [44,17].As such, the order spelled by the leaves of the suffix tree form the suffix array SA T is the i-th smallest suffix in lexicographic order.When applied to a string T , the Burrows-Wheeler transform (BWT) [6] yields another string BWT T such that BWT T Let Q [1..m] be a query string.If Q occurs in T , then the locus or implicit node of Q in the suffix tree of T is (v, k) such that Q = XY , where X is the path spelled from the root to the parent of v and Y is the prefix of length k of the edge from the parent of v to v. The leaves in the subtree rooted at v, also known as the leaves covered by v, correspond to all the suffixes sharing the common prefix Q.Such leaves form an interval in the SA and equivalently in the BWT.Let aX and X be the strings spelled from the root of the suffix tree to nodes v and w, respectively.Then one can store a suffix link from v to w. Suffix links from implicit nodes are called implicit suffix links.
The bidirectional BWT [43] is a compact BWT-based index capable of solving the MEM finding problem in linear time [4].The algorithm simulates the traversal of the corresponding suffix trees to find maximal occurrences in both strings: in the first step, it locates the suffix tree nodes (intervals in the BWTs) corresponding to the maximal matches, that is, the MEM strings, and then it uses a cross-product algorithm to extract each MEM from the BWT intervals.
Let B [1..n] be a bitvector, that is, a string over the alphabet {0, 1}.There is a data structure that can be constructed in time O(n) which answers r = rank(B, i) and j = select(B, r) in constant time, where the former operation returns the number of 1s in B [1..i] and the latter returns the position j ≤ i of the r-th 1 in B [9,23].
Let D [1..n] be an array of integers.There is a range minimum query data structure that can be constructed in O(n) time which answers RMQ D (i, j) in constant time [19], where RMQ D (i, j) returns an index k such that D[k] is the minimum value in the subarray D[i..j].We will use the following lemma that exploits range queries recursively.

Lemma 1. Let D[1..n] be an array of integers. One can preprocess D in O(n) time so that given a threshold ∆, one can list all elements of D such that D[i] ≤
∆ in linear time in the size of the output.
Proof.We can build the range minimum query data structure on D. Consider a recursive algorithm analogous to the one in [33] which starts with the algorithm stops as no element in the range is at least ∆.Otherwise, the algorithm reports k and recursively continues with RMQ D (1, k − 1) and RMQ D (k + 1, n).Note that each recursive call performs exactly one RMQ operation: if an element is reported the RMQ is charged to this element, otherwise it is charged to its parent (in the recursion tree), and thus the number of RMQ operations is linear in the output size.

Finding MEMs in Labeled Graphs
Let us consider the problem of finding all maximal exact matches (MEMs) between a labeled graph G = (V, E, ℓ), with ℓ : V → Σ + , and a query string Definition 1 (MEM between a pattern and a graph).Let G = (V, E, ℓ) be a labeled graph, with ℓ : V → Σ + , let Q ∈ Σ + be a query string, and let κ > 0 be a threshold.Given a match (i, P, j) of Q[x..y] in G, we say that the pair ([x..y], (i, P, j)) is left-maximal ( right-maximal) if it cannot be extended to the left (right, respectively) in both Q and G, that is, We call ([x..y], (i, P, j)) a κ-MEM iff LeftMax ∨ |lext(i, P, j)| ≥ 2, RightMax ∨ |rext(i, P, j)| ≥ 2, and y − x + 1 ≥ κ, meaning that it is of length at least κ, it is left-maximal or its left (graph) extension is not a singleton, and it is right-maximal or its right (graph) extension is not a singleton.
For example, with Q = CACCGTAT, κ = 0, v being the first underlined node of Figure 1, and u being the second in-neighbor of v, then ([1..7], (5, uv, 6)) is a MEM since it is left and right maximal.Note that pair ([2..7], (1, v, 6)) is also a MEM since it is right-maximal, and the left extension of (1, v, 6) is not a singleton (lext(v) = {A, C}): this match is not left-maximal but our definition includes it as there are at least two different characters to the left.
We use this particular extension of MEMs to graphs-with the additional conditions on non-singletons lext and rext-as it captures all MEMs between Q and ℓ(P ), where P is a source-to-sink path in G.Moreover, this MEM formulation (with κ = 1) captures LCS through co-linear chaining, whereas avoiding the additional conditions would fail [38].
The rest of this section is structured as follows.In Section 3.1, we show how to adapt the MEM finding algorithm of Belazzougui et al. [4] for the case of node MEMs, which ignore the singleton conditions of Definition 1.Then, in Section 3.2, we show how to further generalize this approach to report all κ-MEMs spanning exactly L nodes.

MEMs in Node Labels
We say that a match ([x..y], (i, v, j)) is a node MEM if it is left and right maximal w.r.t.ℓ(P ) only in the string sense.In other words, a node MEM is a (string) MEM between Q and ℓ(v) (especially in the case x = 1 or y = ℓ(v)).For this, we consider the text where 0 / ∈ Σ is used as a delimiter to prevent MEMs spanning more than a node label.
Running the MEM finding algorithm of Belazzougui et al. [4,Theorem 7.10] on Q and T nodes will retrieve exactly the node MEMs we are looking for.Given such a MEM (x 1 , x 2 , ℓ), to transform the coordinates of T nodes [x 2 ..x 2 + ℓ − 1] into the corresponding graph substring (i, P, j) we augment the index with a bitvector B marking the locations of 0s of T nodes , so that r = rank(B, x 2 ) identifies the corresponding node of G, i = x 2 − select(B, r) and j = i + ℓ − 1.
Lemma 2. Given a labeled graph G = (V, E, ℓ), with ℓ : V → Σ + , a query string Q, and a threshold κ > 0, we can compute all node MEMs of length at least κ between Q and G in time O(n + m + N κ ), where n is the total length of node labels, m = |Q|, and N κ is the number of output MEMs.
Proof.The claim follows by applying the MEM finding algorithm of Belazzougui et al. [4,Theorem 7.10] on T nodes and Q, and using B to extract the corresponding graph matches.We include a brief explanation of the original algorithm (following the later adaptation of [28,Algorithm 11.3]) as we will modify this algorithm later.In our context, the algorithm takes as inputs the bidirectional BWT index on T nodes $ and the bidirectional BWT index on Q$, and produces MEM strings 1 times: the suffixes of T nodes having Q ′ as a prefix have lexicographic ranks between i G and j G , and the prefixes of T nodes having Q ′ (the reverse of Q ′ ) as a suffix have co-lexicographic (lexicographic of the reverse) ranks between i ′ G and j ′ G .Analogously, the last two BWT intervals are such that a prefix have lexicographic ranks between i Q and j Q , and the prefixes of Q having Q ′ as a suffix have co-lexicographic ranks between i ′ Q and j ′ Q .For each of these four BWT intervals reported, the algorithm runs a cross-product routine outputting all MEMs whose MEM string is Q ′ .Globally, the algorithm is linear in both the input and output size, since the exploration of MEM strings Q ′ takes linear time in the size of the input strings, and since in total the cross-product routine runs in linear time in the number of BWT intervals considered as well as in the number of output MEMs [4, Theorem 7.10].To see the linearity with respect to the input strings, let us study how the MEM strings Q ′ are explored.The algorithm starts with intervals covering all suffixes/prefixes corresponding to a match of the empty string.It then executes recursively, so we can consider a generic step with intervals corresponding to a match of the string Q ′ .Moreover, the algorithm maintains the invariant that the current match Q ′ is right-maximal.In the recursive step, the algorithm first checks whether the match is left-maximal, in which case it reports the corresponding MEMs using the aforementioned cross-product algorithm.It then extends the match to the left with every possible character extension.For each such extension aQ ′ , it checks whether the extension contains a right-maximal match: if this is not the case the suffix tree of T nodes #Q (# ∈ Σ) does not have an internal node corresponding to aQ ′ , as all suffixes starting with aQ ′ continue with the same symbol, otherwise (the extension contains a right-maximal match) the suffix tree has an internal node corresponding to aQ ′ .This exploration is bounded by the number of implicit Weiner links in the suffix tree of T nodes #Q, which is linear in the input length [4, Observation 1].

MEMs Spanning Exactly L Nodes
Given a threshold κ, we want to find all κ-MEMs ([x..y], (i, P, j)) spanning exactly L nodes in G, that is, |P | = L.Note that the MEMs obtained for L = 1 are a subset of the ones obtained in Lemma 2: for a node MEM ([x..y], (i, v, j)) it might hold that i = 1 and {Q As per Definition 1, MEMs cannot be recognized without looking at the context of the paths in G (sets lext and rext).With this in mind, we consider the text where left(u) = c when lext(u) = {c} and otherwise left(u) = #, right(u) = d when rext(u) = {d} and otherwise right(u) = #, 0 = #, 0, # / ∈ Σ, and We have added the unique left-and right-extension symbols c and d to avoid reporting exact matches that can potentially be extended to longer paths.When these extensions are not unique (or empty), one can safely report a MEM, since there is a path diverting with a symbol different from that of the pattern (or the path cannot be extended further).With these left-and right-extension symbols, it suffices to modify the MEM finding algorithm of Section 3.1 to use some extra information regarding the starting position of each suffix inside string the ℓ(P ), as explained next.
To avoid reporting MEMs spanning less than L nodes (only if L > 1), we use an array We also preprocess, in linear time, a bitvector B marking the locations of 0s of T L so that we can map in constant time a position i in T L to the r-th path appended to T L for r = rank(B, i).
The only modification of [4,Theorem 7.10] required to only output MEMs spanning exactly L nodes (and only if L > 1) is to change its very last step when considering a MEM candidate Q ′ with |Q ′ | ≥ κ.Namely, the cross-product routine loops over all characters a, b, c, d where the first two are the intervals in the bidirectional BWT on T L corresponding to aQ ′ b and the latter two are the intervals in the bidirectional BWT on Q corresponding to cQ ′ d.After that, it outputs a triple (k, It suffices to modify the first iteration using Lemma 1 to loop only over k ∈ Our claims are that the running time stays linear in the input and output size on constant-size alphabet, and that only MEMs spanning exactly L nodes are output.The latter claim follows directly on how array D is defined and used with Lemma 1.For the former claim, the cross-product part of the original algorithm is linear in the output size (also on non-constant-size alphabet) since for each combination of left-and right-extension considered, the work can be charged to the output.In our case, due to the use of Lemma 1, some combinations may lead to empty outputs introducing an alphabet-factor (constant) multiplier on the input length.

Remark 1. Note that the algorithm in Lemma 3 works in time
is the total label length of G and d is the maximum in-or out-degree of a node.Indeed, T L corresponds to the concatenation of length-L paths of G: the number of paths containing label ℓ(v)

MEMs in Elastic Founder Graphs
The approach of Section 3.2 is exponential on L, so we can only use it for constant L if aiming for a poly-time MEM finding routine.For general labeled graphs, this may be the best achievable, as we cannot find MEMs with a threshold κ between a pattern Q and G in truly sub-quadratic time unless the Orthogonal Vectors Hypothesis (OVH) is false [15], and exponential-time indexing is required for truly sub-quadratic MEM finding with threshold κ unless OVH is false [14]: finding MEMs between Q and G finds matches of Q in G, and if κ = |Q| MEM finding is exactly equivalent to matching a pattern in a labeled graph.To have a poly-time indexing of G that can solve MEM finding in truly sub-quadratic time, it is necessary to constrain the family of graphs in question.Therefore, we now focus on indexable Elastic Founder Graphs (indexable EFGs), that are a subclass of labeled directed acyclic graphs (labeled DAGs) having the feature that they support poly-time indexing for linear-time queries [16].We will show that the same techniques used to query if Q appears in indexable EFG G can be extended to solve MEM finding on G with arbitrary length threshold κ.Definition 2 (Elastic Founder Graph [16]).Consider a block graph G = (V, E, ℓ), where ℓ : We say that a block graph is an indexable Elastic Founder Graph (indexable EFG) if the semi-repeat-free property holds: for each v ∈ V i , ℓ(v) occurs in G only as prefix of paths starting with some w ∈ V i .
Note that the semi-repeat-free property allows a node label to be prefix of other node labels in the same block, whereas it forbids them to appear as a proper suffix of other node labels nor anywhere else in the graph.Indexable EFGs can be obtained from a set of aligned sequences, in a way such that the resulting indexable EFG spells the sequences but also their recombination: for a gapless alignment, we can build in time linear to the size of the alignment an optimal indexable EFG with minimum height H of a block, where the height of block V i is defined as |V i |, solution generalized to the case with gaps by using an alternative height definition [40].
Let us now consider MEM finding with threshold κ on an indexable EFG G = (V, E, ℓ).We can use the general κ-MEM finding algorithm of Lemma 3 between Q and G spanning exactly L nodes, with L = 1, 2, 3; then, we find 1.If we cannot continue with 0, Q[1..y] is part of some MEM between Q and G spanning at most 3 nodes, so we ignore it, take the suffix link of p and consider matching Q [2..y] in G.
2. If we can continue with 0 and the occurrences of Q [1..y] span at most two nodes in G, then we also take the suffix link of p and consider matching Q [2..y].Thanks to the semi-repeat-free property, we can check this condition by retrieving any leaf in the subtree rooted at node p 0 , reached by reading 0 from p.
By repeating the suffix walk and tentative match of case 3 until we cannot read 0 from the failing node, we find the maximal prefix Q[1.
Then ([1..y], (i, ) is a MEM between Q and G for all (i, u 1 ) ∈ U 1 and (u, u L+1 , j) ∈ E L , and also ([1..y], (i, is a MEM for all (i, u 1 ) ∈ U 1 , if u L exists: these MEMs satisfy conditions (i), (ii), and (iii), and We are not missing any MEM satisfying conditions (i), (ii), and (iii): due to the semi-repeat-free property, any MEM ([x..y], (i ′ , P ′ , j ′ )) with x < x < x′ spanning more than 3 nodes shares substring ℓ(u k )ℓ(u k+1 ) with the previously computed MEM, for some k ∈ [2..L − 3], and is such that x′ < y since we assume (iii) to hold; the algorithm would have matched Q[x..y] with case 3 in the previous iteration, leading to a contradiction.The time globally spent reading Q is still O(|Q|), because each character of Q is considered at most twice.
Finally, we are ready to describe how to compute all remaining MEMs ([x..y], (i, P, j)) between Q and indexable EFG G spanning at least 4 nodes, that is, MEMs such that condition (i) holds and at least one of (ii) and (iii) do not: it is easy to see that Q[x..y] must be contained in the MEMs that we have already computed; also, since they span at least 4 nodes their matches must involve some of nodes u 2 , . . ., u L−1 of MEMs satisfying (i), (ii), and (iii).Indeed, whenever case 3 holds and we decompose Q We can gather all the elements of U RT during each descending walk in the suffix tree of T ′ 3 , since they correspond to the leaves of subtrees of branching nodes in the tentative match of Q[x..y].Analogously, we can find set We can compute U LT by analyzing the leaves of subtrees of branching nodes in the walk in and U RT are a compact representation of all MEMs spanning at least 4 nodes and involving (any substring of) Q[x..y]: a cross-product-like algorithm that matches elements of U 1 or L with elements of u L , E L , or U RT , joined by the relevant part of u 2 • • • u L−1 , can explicitly output the MEMs spanning more than 3 nodes in linear time with respect to the size of the output, by exploiting the fact that U LT and U RT are computed and ordered block by block.Theorem 1.Let alphabet Σ be of constant size, and let G = (V, E, ℓ) be an indexable Elastic Founder Graph of height H, that is, the maximum number of nodes in a block of G is H.An encoding of MEMs between query string Q ∈ Σ m and G with arbitrary length threshold κ can be reported in time O(nH 2 +m+M κ ), where n = v∈V v and M κ is the number of MEMs of interest.
Proof.We can apply the algorithm of Lemma 3 to find κ-MEMs spanning L nodes, with Let M κ,4+ be the number of MEMs satisfying threshold κ and spanning at least 4 nodes in G.The suffix tree of T ′ 3 can be constructed in time O(|T ′ 3 |) and the described modification of a descending suffix walk on Q takes constant amortized time per step, assuming constant-size alphabet.The time spent gathering and U RT , forming an encoding of the MEMs involving Q[x..y], can be charged to M κ,4+ because each element of U 1 , E L , U LT , and U RT corresponds to one or more MEMs, that could be retrieved in an explicit form with a cross-product-like procedure.Indeed: for U 1 we can retrieve all leaves of the subtree rooted at p of the suffix tree of T ′ 3 ; for E L and U RT , we can do the same for node r reached by the last tentative match of Q[y + 1..], and for branching nodes reached during every tentative match; for U LT , using a compact trie and blind search [18] in the representation of each T u allows to compare only the branching symbols.Finally, it is easy to see that in case 3, after we decomposed |Q[x..y]| as αℓ(u 2 ) • • • ℓ(u L−1 )β as in Observation 1, we know the length of strings α, ℓ(u 2 ), . . ., ℓ(u L−1 ), and β, so we can postpone the computation of sets U 1 , E L , U LT , and U RT and avoid computing MEMs of length smaller than κ.Thus, finding an encoding of all MEMs between Q and G with threshold κ and spanning more than 3 nodes takes O(|Q| The stated time complexity is reached due to the fact that |T Proof. where $ is a unique symbol not occurring in the queries nor in the graph.No MEM can span over such unique separator and hence the MEMs between graph G and concatenation Q are the same as those between G and each Q i .It is thus sufficient to feed concatenation Q as input to the algorithms and project each resulting MEM to the corresponding query sequence.Proof.Lemma 2 does not need any modification as the node labels are automatically substrings of T .Same applies for the edge labels, but for longer paths we need to make sure we do not create combinations not supported by T .This can be accomplished with the help of the r-index: with the claimed time and space one can build the run-length encoded BWT of T [35] and the associated data structures to form the counting version of the r-index that supports backward step in O(log log n) time [20].As we concatenate paths consisting of L nodes for MEM finding in Lemma 3, we can first search them using the r-index, and only include them if they occur in T .MEMs spanning more than 3 nodes in Theorem 1 and Corollary 1 can be searched afterwards with the r-index to filter out those MEMs not occurring in T ; these MEMs cannot mutually overlap each other in Q by more than one full node label, so the running time of the verification can be charged on the size of the elastic founder graph.

Experiments
The benefit of Corollary 2 over the mere use of r-index for MEM finding [42] is that a MEM can occur many times in a repetitive collection while the occurrences starting at the same column of a MSA of the collection can be represented by a small number of paths in the indexable Elastic Founder Graph.
To test this hypothesis, we implemented the MEM finding algorithm using the bidirectional r-index [2].The code is available at https://github.com/algbio/br-index-mems.The implementation works both for MEM finding between two sequences and between a sequence and a graph.In the case of a graph, the implementation covers the algorithms of Sections 3.1 and 3.2 up to paths of length 3 nodes.If the input graph is an indexable EFG and κ is at most the length of the shortest string spelled by an edge, the implementation outputs all κ-MEMs, yet some longer MEMs are not fully extended and/or may be reported in several pieces. 1 With these considerations, the implementation gives an upper bound on the number of κ-MEMs in the case of graphs, but it provides the exact number of MEMs in the case of two sequences as input.
We performed experiments with the same multiple sequence alignment (MSA) of covid19 strains as in [29].We first filtered out strains whose alignments had a run of gaps of length more than 100 bases. 2 Then we extracted a sub-MSA of 100 random strains from the remaining and extracted MSAs of the first 20, 40, 60, and 80 strains from this MSA of 100 strains.For each such dataset, we built the bidirectional r-index of the sequences (without gaps) and the indexable EFG of the MSA.The latter were constructed using the tool https://github.com/algbio/founderblockgraphs with parameters --elastic --gfa.
We used κ = 12 in all experiments, as this was the length of the shortest string spelled by an edge over all indexable EFGs.For the queries, we extracted 1000 substrings of length 100 from the first 20 strains.For each query, we selected two random positions and mutated them with equal probability for A, C, G, or T. The queries were then concatenated into a long sequence and the bidirectional r-index was built on it as described by Corollary 1.The MEMs were computed between the queries and the respective text/graph index.
The number of MEMs for each index is reported in Figure 2 and the number of runs in the two Burrows-Wheeler transforms of each index is reported in Figure 3.As can be seen from the results, the number of MEMs is greatly reduced when indexing the graph compared to indexing the collection of strains.In this setting, almost all graph MEMs have a counterpart in the collection: we implemented also a filtering mode analogous to Corollary 2 so that only graph MEMs occurring in the original strains are reported.The number of path MEMs (and thus total number of MEMs) reported dropped by 92, 97, 90, 91, and 91 for the case of indexable EFG on 20, 40, 60, 80, and 100 strains, respectively.That is, less than 2% of the reported graph MEMs were recombinants of the input strains.However, since the implementation outputs long MEMs in pieces, this percentage may be higher for full-length graphs MEMs.
The number of runs (the major factor affecting the space used by the indexes) for the bidirectional r-index of the collection and that of the concatenation of node labels are comparable.For edges and especially for paths of length 3 nodes the number of runs is significantly higher.Fortunately, the growth of these metrics when more strains are added is limited.This is not surprising, as the stains are highly similar and thus the added information content is limited and known to be correlated with the number of BWT runs [30].
Running times correlate with the index size comparison: with 100 strains, MEM finding using the the bidirectional r-index took 20 seconds, node MEM finding on the indexable EFG took 31 seconds, edge MEM finding 97 seconds,  Here sequences refers to the bidirectional r-index, labels efgnodes, efg-edges, efg-paths refer to the concatenation of node labels, paths of length 2 and paths of length 3, respectively.Label efg-total is the sum of the previous three numbers of runs.and path MEM finding 217 seconds.The running times were measured on a server with Intel Xeon 2.9 Ghz processor and exclude index construction, which took 1 second for the patterns, 11 seconds for the bidirectional r-index of 100 strains, and 13 seconds for the additional indexes required by the corresponding indexable EFG.The construction of the indexable EFG on 100 strains took 10 seconds.To speed-up MEM finding, we also tested switching the bidirectional r-index to a wavelet tree implementation of bidirectional BWT (https:// https://github.com/jnalanko/BDBWT index).On the same 100 strains indexable EFG, the total time for MEM finding was 96 seconds, including index construction, which means 3.7 speed-up over the bidirectional r-index-based implementation.

Discussion
An alternative strategy to achieve the same goal as in our experiments is to encode the graph as an aggregate over the collection, apply MEM finding on the r-index, and report the distinct aggregate values on lexicographic MEM ranges to identify MEM locations in the graph [21].This approach is not comparable to ours directly, as the compressibility of the aggregates depends on the graph properties, and the indexable Elastic Founder Graph's size has not been analyzed with respect to r.Also, the two approaches use different MEM definitions.Our Definition 1 is symmetric and local, while the version used in earlier work with the r-index [42,21] is asymmetric and semi-global: they define a MEM as a substring of a query that occurs in the text, but its query extensions do not appear in the text.For the purpose of chaining, only the symmetric definition yields connections to the Longest Common Subsequence problem [38].For completeness, our implementation (https://github.com/algbio/br-indexmems)also supports this asymmetric MEM definition; our algorithms can be simplified for this case.
We did not implement the general suffix tree-based approach to handle arbitrary long MEMs.From the experiments, we can see that already the case of paths of length 3 nodes handled by the generic MEM finding routine causes some scalability issues, and the suffix tree-based approach uses even more space.In our recent work [39], we have solved pattern search in indexable EFGs using only edges, and our aim is to extend that approach to work with MEMs, so that the whole mechanism could work on top of a plain bidirectional r-index.Some of our results assume a constant-size alphabet.This assumption can be relaxed with additional data structures, but also a more careful amortized analysis may lead to better bounds.

Lemma 3 .
it tells the distance of the k-th suffix of T L in the lexicographic order to the start of the last node of the corresponding path.With the help of Lemma 1 on D, we can then adapt the MEM finding algorithm to output suffixes corresponding to MEMs spanning exactly L nodes as follows.Let alphabet Σ be of constant size.Given a labeled graph G = (V, E, ℓ), a pattern Q ∈ Σ m , a threshold κ ≥ 1, and an integer L ≥ 1, we can compute an encoding of all MEMs of length at least κ and spanning exactly L nodes of G in time O(m + |T L | + M κ,L ).Here, T L is defined as in Equation (1) and M κ,L is the number of output MEMs.Proof.We build the bidirectional BWT indexes for T L $ and Q$, the suffix array of T L $, and preprocess D[1..|T L |] as in Lemma 1 in time O(|T L | + |Q|).
and E L form a compact representation of all MEMs spelling Q[1..y].So far the procedure computes all MEMs spanning more than 3 nodes, satisfying LeftMax and RightMax, and spelling maximal Q[1..y].We can extend it to find all MEMs satisfying the first two constraints and spelling any substring Q[x..y], with Q[x..y] maximal.Let x be the index for which we have computed MEMs spelling Q[x..y] (x = 1 in the first iteration).If cases 1 or 2 hold, we can start to search MEMs spelling Q[x+1..] in amortized linear time, since we follow the suffix link of p.If case 3 holds, we can restart the algorithm looking for MEMs spelling Q[x ′ ..y], where x′ verifying RightMax and describing a MEM where (iii) fails-or |rext(1, u, j)| ≥ 2-verifying the non-singleton condition of Definition 1 and describing a MEM where (ii) fails.

3 Corollary 1 .
| dominates |T ′ 3 |, |T 2 |, and |T 1 |, and for indexable EFGs |T 3 | ∈ O(nH 2 ), since every character of every node label ℓ(u) gets repeated at most H 2 times, which is an upper bound on the number of paths of length 3 containing u.The results of Lemmas 2 and 3 and Theorem 1 hold when query Q[1..m] is replaced by a set of queries of total length m.The respective algorithms can be modified to report MEMs between the graph and each query separately.

Corollary 2 .
The algorithms of Lemmas 2 and 3, Theorem 1, and Corollary 1 can be modified to report only MEMs that occur in text T formed by concatenating the rows (ignoring gaps and adding separator symbols) of the input MSA of the indexable EFG.This can be done in additional O(|T | + r log r) time and O(r log n) bits of space, and with multiplicative factor O(log log n) added to the running times of the respective algorithms, where r is the number of equal-letter runs in the BWT of T .

Figure 2 :
Figure 2: Number of MEMs with different indexes and varying number of covid19 strains.Here sequences refers to bidirectional r-index.For indexable EFG the results are shown for nodes, edges, and paths of length 3 nodes (these also include longer MEMs counted multiple times).Line efg-total is the total number of EFG MEMs.Note the logarithmic scale on the y-axis.

Figure 3 :
Figure 3: Number of BWT runs with different indexes and varying number of covid19 strains.Here sequences refers to the bidirectional r-index, labels efgnodes, efg-edges, efg-paths refer to the concatenation of node labels, paths of length 2 and paths of length 3, respectively.Label efg-total is the sum of the previous three numbers of runs.