Eulertigs: minimum plain text representation of k-mer sets without repetitions in linear time

A fundamental operation in computational genomics is to reduce the input sequences to their constituent k-mers. For maximum performance of downstream applications it is important to store the k-mers in small space, while keeping the representation easy and efficient to use (i.e. without k-mer repetitions and in plain text). Recently, heuristics were presented to compute a near-minimum such representation. We present an algorithm to compute a minimum representation in optimal (linear) time and use it to evaluate the existing heuristics. Our algorithm first constructs the de Bruijn graph in linear time and then uses a Eulerian-cycle-based algorithm to compute the minimum representation, in time linear in the size of the output.


Introduction
Motivation A k-mer is a DNA string of length k that is considered equal to itself and its reverse complement. A common pattern in bioinformatics is to reduce a set of input strings to their constituent k-mers. Such representations are at the core of many bioinformatics pipelines-see e.g. Schmidt et al. [1] or Brinda et al. [2] for an overview of applications. The wide-spread use of k-mer sets has prompted the question of what is the smallest plain text representation for a set of k-mers. Here, a plain text representation means a set of strings that have the same set of k-mers as the input strings, i.e. the spectrum is preserved. Such representations are also called spectrum preserving string sets (SPSS) [3], or simplitigs [2]. This has the following advantages over encoded representations: • When storing k-mer sets to disk, plain text may remove the need of decompression before usage, as some tools that usually take unitigs as input can take any other plain text representation without modification (e.g. Bifrost [4]). • Within an application, an encoded representation would require decoding whenever a k-mer is accessed, which may slow down the application a lot compared to when each k-mer is in RAM in plain text.
Further, in applications, it might be useful if the representation contains each k-mer exactly once. This is because some applications, like e.g. SSHash [5], are able to take any set of k-mers as input, but cannot easily deal with duplicate k-mers in the input. Related work There are two heuristic approaches to the construction of a small SPSS without repetitions, namely ProphAsm [2] and UST [3]. While neither of these computes a minimum representation, Rahman et al. [3] also present a lower bound to the minimum size of any representation without repetition, and they show that they are within 3% of this lower bound in practice. They also present a counter-example showing that their lower bound is not tight. Small SPSSs without repetitions are used e.g. in SSHash [5] and are also computed by stateof-the-art de Bruijn graph compactors like Cuttlefish 2 [6]. Additionally, the state-of-the-art de Bruijn graph compressor GGCAT [7] was extended to compute Eulertigs, in addition to other variants of SPSSs.
When k-mer repetitions are allowed in an SPSS, there is a known polynomially computable minimum representation, namely matchtigs [1]. The matchtig algorithm joins unitigs by first iterating all possible joins repeating up to k − 1 k-mers, and then using minimum perfect matching to find a set of joins that minimises the size of the representation. This is similar to the algorithm presented in this paper, which leaves out the matching step and only joins unitigs that are adjacent. While matchtigs are expensive to compute, the authors also present a more efficient greedy heuristic that is able to compute a near-minimum representation on a modern server with no significant penalty in runtime (when compared to computing just unitigs), but a significant increase in RAM usage.
In [1,2] the authors also showed that decreasing the size of an SPSS results in significantly better performance in downstream applications, i.e. when further compressing the representation with general purpose compressors, or when performing k-mer-based queries.
The authors of both [2] and [3] consider whether computing a minimum representation without repetitions may be NP-hard, as it is equivalent to computing a minimum path cover in a de Bruijn graph, which is NP-hard in general graphs by reduction from Hamiltonian cycle. However, computing a Hamiltonian cycle in a de Bruijn graph is actually polynomial [8]. The authors of [8] argue that de Bruijn graphs are a subclass of adjoint graphs, in which solving the Hamiltonian cycle problem is equivalent to solving the Eulerian cycle problem in the original of the adjoint graph, which can be computed in linear time. 1 However, the argument is only made for normal directed (and not bidirected) graphs, and thus is not applicable to our setup, where a k-mer is also considered equal to its reverse complement.
Our contributions Our first technical contribution is to carefully define the notion of a bidirected de Bruijn graph such that the spectrum of the input is accurately modelled in the allowed walks of the graph. While defining a bidirected de Bruijn graph is not new [10], we take specific care around k-mers that are their own reverse complement. This technicality is often neglected in the literature, and sidestepped by requiring that the value of k is odd, in which case this special case does not occur. To make sure that our definition is correct for any k, we show that our de Bruijn graph admits exactly the strings that can be spelled from the k-mers that it was constructed from. We give a suffix-tree-based deterministic linear-time algorithm to construct such a graph, filling a theory gap in the literature, as existing approaches [4,6,11,12] depend on the value of k and/or are probabilistic due to the of use hashing, minimizers or Bloom filters, or do not use the reverse-complement-aware definition of k-mers [13].
Given the bidirected de Bruijn graph, we present an algorithm that computes a minimum plain text representation of k-mer sets without repetitions, which runs in output sensitive linear time. Steps 1 to 3 run in linear time in the number of nodes and arcs in the graph. In short, it works as follows: 1 Add breaking arcs into this graph to make it Eulerian. 2 Compute a Eulerian cycle in the resulting graph. 3 Break that cycle at the breaking arcs. 4 Output the strings spelled by the resulting walks.
The algorithm is essentially an adaption of the matchtigs algorithm [1], removing the possibility of joining walks by repeating k-mers. We give detailed descriptions for all these steps and prove their correctness in our bidirected de Bruijn graph model. Together with our linear-time de Bruijn graph construction algorithm, we obtain the main result of our paper: Theorem 1 Let k be a positive integer and let I be a set of strings of length at least k over some alphabet . Then we can compute a set of strings I ′ of length at least k with minimum cumulative length and CS k ( where CS k (I) = CS k (I ′ ) means that I ′ is an SPSS of I, and ||I|| is the cumulative length of I (see Sect. "Preliminaries" for accurate definitions). This gives a positive answer to the open question if a minimum SPSS without repetitions can be computed in polynomial time. Additionally, this gives an easily computable tight lower bound on the size of a minimum SPSS without repetitions. We also give a counter example where previous heuristics are not necessarily optimal.
For our experiments, we have implemented steps 1 to 4 in Rust, taking the de Bruijn graph as given. The implementation is available on github: https:// github. com/ algbio/ match tigs. Our experimental evaluation shows that our algorithm does not result in significant practical improvements, but for the first time allows to benchmark the quality the heuristics ProphAsm and UST against an optimal solution. It turns out that both produce close-tooptimal results, but with a different distribution of computational resources.
Our work also shows that using arc-centric de Bruijn graphs can aid the intuition for certain problems, as in this case, the node-centric variant hides the relationship between Eulerian cycles and minimum SPSS without repetition.
Organisation of the paper In Sect. "Preliminaries" we give preliminary definitions of well-known concepts. In Sect. "De Bruijn graphs" we define de Bruijn graphs and prove the soundness of the definitions. In Sect. "Lineartime construction of compacted bidirected de Bruijn graphs" we show how to construct de Bruijn graphs by our definitions in linear time. In Sect. "Linear-time minimum SPSS without repetitions" we show how to construct a minimum SPSS without repetitions in linear time if the de Bruijn graph is given. Additionally, we give an example where previous heuristics were not optimal. In Sect. "Experiments" we compare our algorithm and Eulertigs against strings computed with ProphAsm and UST on practical data sets.

Preliminaries
In this section we give the prerequisite knowledge required for this paper.

Bidirected graphs
In this section we define our notion of the bidirected graphs and the incidence model.
A multiset is defined as a set M, and an implicit function # M : M → Z + mapping elements to their multiplicities. The cardinality is defined as |M| := s∈M # M (s).
An alphabet is an ordered set, and an -word is a string of characters of that set. String concatenation is written as ab for two strings a and b. The set k is the set of all -words of length k and the set * is the set of all -words, including the empty word ǫ . Given a positive integer k, the k-suffix suf k (w) (k-prefix pre k (w) ) of a word w is the substring of its last (first) k characters. A k-mer is a word of length k. A complement function over is a function comp : → mapping characters to characters that is self-inverse (i.e. comp ( comp (x)) = x , also called an involution). A reverse complement function for alphabet is a function rc : * → * defined as rc ((w 1 , . . . , w ℓ )) := ( comp (w ℓ ), . . . , comp (w 1 )) , for some arbitrary complement function comp . On sets, rc is defined to compute the reverse complement of each element in the set. Note that rc is self-inverse. A canonical k-mer is a k-mer that is lexicographically smaller than or equal to its reverse complement.
Given an integer k and an alphabet , the k-spectrum of a set of strings I ⊆ k ′ ≥k k ′ is a set of strings The canonical k-spectrum of I is CS k (I) := {w ∈ S k (I) | w is canonical} . For simplicity, the spectrum and canonical spectrum are defined for a single string w as if it were a set {w} . A spectrum preserving string set of a set of strings I is a set of strings I ′ such that CS k (I) = CS k (I ′ ) . The cumulative length of I is ||I|| := w∈I |w|.
Our definition of a bidirected graph is mostly standard like in e.g. [14], however we allow self-complemental nodes that occur in bidirected de Bruijn graphs. A bidirected graph is a tuple G = (V , E, c) with a set of normal and self-complemental nodes v ∈ V , a set of arcs e ∈ E , and a function c : V → {1, 0} marking self-complemental nodes with 1, and normal nodes with 0. An incidence is a pair vd , where d ∈ {⊕, ⊖, ⊙} is called its sign (e.g. v⊕ ). The negation of a sign is defined as ¬⊕ := ⊖ , ¬⊖ := ⊕ and ¬⊙ := ⊙ . For self-complemental nodes v ∈ V , only incidences v⊙ are allowed, and for normal nodes only incidences v⊕ and v⊖ are allowed. An arc is a tuple of incidences and a unique identifier η , where η can be of any type. The reversal of an arc is denoted by If not required, we may drop the identifier (i.e. just write . We count the incidences present in an arc e using multiset notation like # e (vd) , returning 0 if the arc does not contain the incidence vd , returning 1 if it contains the incidence once and returning 2 if it is a selfloop with that incidence. If a node v ∈ V is present with a ⊕ ( ⊖ ) sign in an arc, then the arc is outgoing (incoming) from (to) v.
Note that, other than in standard directed graphs, in bidirected graphs arcs can be outgoing or incoming on both ends, and the order of the incidences in the arc does not affect if it is outgoing or incoming to a node. Further, our notation differs from that of standard bidirected graphs in that arcs have a direction. This is required because we will work with arc-centric de Bruijn graphs (see Sect. "De Bruijn graphs"), which have labels on the arcs and not the nodes. Using the sign of the incidence pairs, it is possible to decide if a node is traversed forwards or backwards, but not if the arc is traversed forwards or backwards. But to decide which label (forwards or reverse complement) to use when computing the string spelled by an arc, the direction is relevant. See Fig. 1a for an example of a bigraph, which has labels that make it a de Bruijn graph as well.
A walk in a bigraph is a sequence of arcs where for every i it holds that (we can arbitrarily walk over arcs forwards and reverse), and for every i < ℓ it holds that v ′ i = v i+1 and d ′ i = ¬d i+1 . The length of a walk is ℓ = |W | .
For a node v ∈ V , the multiset of incidences is defined as I (v) := {vd | d ∈ {⊕, ⊖, ⊙}} , with multiplicities # I (v) (vd) := e∈E # e (vd) . For a node v ∈ V that is not self-complemental, the outdegree is defined as δ + (v) := # I (v) (v⊕) , and the indegree is defined as We define the imbalance of a node v ∈ V that is not self-complemental as the difference of its outdegree and indegree imbalance (v) A labelled graph is a bidirected graph G = (V , E, c) where the identifiers of arcs are strings over some alphabet (e.g. (v 1 ⊕, v 2 ⊖, ACCTG) ∈ E).

Suffix arrays and suffix trees
Section "Linear-time construction of compacted bidirected de Bruijngraphs" requires knowledge of suffix arrays and suffix trees. We assume the reader is familiar with these data structures, and briefly give the relevant definitions and properties below. We point the reader to Gusfield [15] and Mäkinen [16] for an in-depth treatment of the topics.
A suffix array SA T for a string T is an array of length |T| such that SA T [i] is the starting position of the lexicographically i-th suffix of T. The suffix array interval of a string x is the maximal interval [i..j] such that all the suffixes pointed by SA T [i], . . . , SA T [j] have x as a prefix, or the empty interval if x is not a substring of T.
A suffix tree of a string T is a compacted version of the trie of all suffixes of T, such that non-branching paths are merged into single arcs, with arcs pointing away from the root. The compactification concatenates the labels of the arcs on the compacted path. The nodes that were compacted away and are now in the middle of an arc are called implicit nodes, and the rest of the nodes are explicit. A locus (plural loci) is a node that is either explicit or implicit. A locus v is represented by a pair (u, d), where u is the explicit suffix tree node at the end of the arc containing v (u is equal to v if v is explicit), and d is the depth of locus v in the trie of loci. The suffix array interval of a node is the interval of leaves in the subtree of the node. The suffix array interval of an implicit locus (u, d) is the same as the suffix array interval of u.
The suffix tree can be constructed in log-linear time in |T | log | | using e.g. Ukkonen's algorithm [17] or in linear time in |T| using Farach's algorithm [18]. The tree comes with a function child that takes an explicit node and a character, and returns the child at the end of the arc from that node whose label starts with the given character (if such node exists). This can be implemented in O(log |�|) time by binary searching over child pointers sorted by labels. The child function can also be easily implemented for implicit loci. Ukkonen's algorithm also produces suffix links for the explicit nodes, which map from the suffix tree node of a string cx to the suffix tree node of string x. It is possible to emulate suffix links on the implicit loci using constant-time weighted level-ancestor queries [19] by mapping is the destination of a suffix link from u, and f d−1 (SL(u)) is the furthest suffix tree ancestor from SL(u) at depth at least d − 1 in the trie of loci. The inverse pointers of suffix links are called Weiner links, and they can also be simulated on the implicit loci by mapping is the destination of a Weiner link from u with character c. The de Bruijn graph of order k of a set of input strings I is defined as a labelled graph constructed by Algorithm 1. See Fig. 1a for an example. The algorithm inserts an arc for each canonical k-mer, and connects the arcs via nodes according to their k − 1 overlaps. Depending on if these overlaps use the reverse complement or if the k − 1-mer of a node is self-complemental, it adds directions to the incidences. A de Bruijn graph computed by this algorithm has the following property.

Proof
Observe that the values of w and w ′ computed in Lines 5 and 7 of Algorithm 1 are equal to pre k−1 (η 1 ) and suf k−1 (η 1 ) for e 1 and equal to pre k−1 (η 2 ) and suf k−1 (η 2 ) for e 2 . Further, observe that the values of v and v ′ computed in Lines 6 to 8 are equal to v 1 and v ′ 1 for e 1 and equal to v 2 and v ′ 2 for e 2 . This for all possible values of d 1 , and therefore For a walk W : in a de Bruijn graph, its sequence of k-mers is The string spell (W ) is the string spelled by W, which is defined as its collapsed sequence of kmers, i.e. its sequence of k-mers gets concatenated while overlapping consecutive k-mers by k − 1 . This is computed by Algorithm 2. It spells out the first k-mer (or its reverse complement) completely, and then adds the last characters of all subsequent k-mers (or their reverse complements). We prove the following lemmas to show that our definition of a de Bruijn graph together with the spell (·) function is sound for our purposes, i.e. walks in the de Bruijn graph can spell exactly the strings containing subsets of the k-mers used to create the de Bruijn graph. Due to this property, we can use our de Bruijn graph and spell to in the Eulertig algorithm, such that finding a minimum SPSS is equivalent to finding a minimum walk cover of the de Bruijn graph.

Lemma 3 Let k be a positive integer and let I be a set of strings of length at least k. Let
) be a walk in G, and K := (κ 1 , . . . , κ ℓ ) its sequence of k-mers. Then for each consecutive pair of kmers We can apply Lemma 2 case by case.

Proof
Let (κ 1 , . . . , κ ℓ ) := K W . We use induction over the length of W. For an empty W, K is empty, spell (W ) is empty, and therefore K ′ is empty as well. For |W | = 1 , Algorithm 2 outputs spell (W ) = κ 1 and it holds that For |W | ≥ 2 we consider that K X = K ′ X holds for a prefix X of W with |X| = |W | − 1 . When i = |W | at the beginning of the loop in Line 8, then s = spell (X) . By Lemma 3 it holds that the last k − 1 characters of s are equal to the first k − 1 characters of κ ℓ . Therefore, by appending the last character from κ ℓ to s, κ ℓ is appended to K ′ X forming K ′ W . Therefore, last k-mer of K ′ W equals the last k-mer of K W , and the first ℓ − 1 k-mers of K ′ W equal those of K W by induction.

Lemma 5 Let k be a positive integer and let I be a set of strings of length at least k. Let
) as follows: for each i, let η i be the canonical of κ i and f i ∈ E be the arc whose identifier is η i . We set e i = f i if κ i is canonical, and e i = f −1 i otherwise.
For W to fulfil the definition of a walk we need that v ′ i = v i+1 and d ′ i = ¬d ′ i+1 for all i. Using Lemma 2, we get: To complete the proof we need to show that spell (W ) = w . By definition, the sequence of k-mers K W of W is equivalent to K w . And since W is a walk, by Lemma 4 we get that the sequence of k-mers of spell (W ) is equivalent to K W , and therefore spell (W ) = w . A walk cover W of a bigraph G is a set of walks such that for each arc e ∈ E it holds that e is part of some walk W ∈ W , or e −1 is part of some walk W ∈ W.

Proof
If CS k (I ′ ) ⊆ CS k (I) , then for each string w ′ ∈ I ′ it holds that CS k (w ′ ) ⊆ CS k (I) . Therefore, by Lemma 5, there exists a walk w in G with spell (w) = w ′ . Then, the set of all such walks W spells I ′ . Further, because CS k (I) ⊆ CS k (I ′ ) , the identifier η of each arc e ∈ E is in CS k (I ′ ) , and therefore in the sequence of kmers K w ′ of some string w ′ ∈ I ′ (possibly as a reverse complement). By Lemma 4 it holds that K w ′ = K w , where K w is the sequence of k-mers of walk w. By the definition of the sequence of k-mers of a walk, this implies that w visits e (possible in reverse direction). Since this holds for each e ∈ E , it holds that W is a walk cover of G.
Assume that there is a walk cover W in G that spells the strings in I ′ , and let w ∈ W be a walk, K w its sequence of k-mers, w ′ := spell (w) and K w ′ the sequence of k-mers of w ′ . Then, by Lemma 4, K w = K w ′ , which, by the definition of the sequence of k-mers of a walk implies that CS k (I) ⊆ CS k (I ′ ) . And since W is a walk cover of G, we get CS k (I) = CS k (I ′ ) .

Corollary 7
By setting I = I ′ in Theorem 6 we see that our de Bruijn graph contains the strings it was constructed from. Further, by Theorem 6 it holds that walks in the de Bruijn graph spell exactly the strings that can be formed from the k-mers that were used to create the graph. Hence, our definition of a de Bruijn graph is sound for all k.
A compacted de Bruijn graph is constructed from a de Bruijn graph by contracting all nodes v ∈ V that are either self-complemental and have exactly two arcs that have exactly one incidence to v each, or that are not self-complemental and have exactly one incoming and one outgoing arc. For simplicity, we use uncompacted de Bruijn graphs in our theoretical sections, however all results equally apply to compacted de Bruijn graphs.

Linear-time construction of compacted bidirected de Bruijn graphs
In this section, we fill a gap in the literature by describing on a high level an algorithm to construct the bidirectional de Bruijn graph of a set of input strings in time linear in the total length of the input strings, independent of the value of k.

Algorithm
Let I = {w 1 , . . . w m } be the set of input strings. Consider the following concatenation: where $ is a special character outside of the alphabet of the input strings. We require an index on T that can answer the following queries: extendRight, extend-Left, contractRight and contractLeft in constant time. The extension operations take as input a character c ∈ and the interval of a string x in the suffix array of T, and return the suffix array intervals of xc in the case of extendRight and cx in the case of extend-Left. The contraction operations are the inverse operations of these, mapping the suffix array intervals of xc to x in the case of contractRight and cx to x in the case of contractLeft. For efficiency, we also require operations enumerateRight and enumerateLeft, which take a string x and give all characters such that extendRight and extendLeft respectively return a non-empty interval, in time that is linear in the number of such characters. Implementations for all the six subroutines are given in Sect. "Implementation of the subroutines". Using these operations, we can simulate the regular nonbidirected de Bruijn graph of T. Each k-mer of the input strings for a fixed k corresponds to a disjoint interval in the suffix array of T. The nodes are represented by their suffix array intervals. The outgoing arcs from a (k − 1) -mer x are those characters c where extendRight(x, c) returns a non-empty interval. We can enumerate all the characters c with this property in constant time using enumerateRight(x). The incoming arcs can be enumerated symmetrically with the enumerateLeft(x). Finally, we can find the destination or origin of an arc labelled with x by running a contractLeft or con-tractRight operation respectively on x.
To construct the bidirected de Bruijn graph, we merge together nodes that are the reverse complement of each other. To find which nodes are complemental, we scan the input strings I while maintaining the suffix array interval of the current k-mer using extendRight and contract-Left operations, while at the same time maintaining the suffix array interval of the reverse complement using extendLeft and contractRight operations. Whenever we merge two nodes, we combine the incoming and outgoing arcs, assigning the incidences of the arcs according to the incidence rules in our definition. We are able to tell in constant time which k-mer of a pair of complemental k-mers is canonical by comparing the suffix array intervals of the k-mers: the k-mer whose suffix array interval has a smaller starting point is the canonical k-mer. If the starting points are the same, the k-mer is self-complemental.
Using the enumerateRight and enumerateLeft functions, we can check if a node would be contracted in a compacted de Bruijn graph. By extending k-mers over such nodes, we can in linear time also output only the arcs and nodes of a compacted de Bruijn graph. For storing the labels, we use one pointer into the input strings to store a single k-mer, as well as a flag that is set whenever the label is not canonical. If a label has multiple k-mers, then we store the remaining k-mers as explicit strings, however without their overlap with the "pointer-k-mer". This way, we can store each label in O(ℓ) space, where ℓ is the number of k-mers in the label. We additionally store the first and last character of each label, as an easy way to make the spell function run in output sensitive linear time.

Implementation of the subroutines
All required the subroutines extendRight, extend-Left, contractRight, contractLeft, enumera-teRight and enumerateLeft can be implemented with the suffix tree of T by simulating the trie of the suffix tree loci as described in Sect. "Suffix arrays and suffix trees". The suffix array intervals of explicit nodes can be stored with the nodes, so that we can operate on loci (u, d) and retrieve the suffix array intervals on demand. The operation extendRight follows an arc from a locus to a child, and the operation contractRight is implemented by going to the parent of the current locus. The operation contractLeft follows a suffix link from the current locus, and extendLeft follows a Weiner link. The operations enumerateRight and enumerate-Left are implemented by storing the children and the Weiner links from explicit suffix tree nodes as neighbor lists. The total number of these links is linear in |T| [16]. With this implementation, the slowest operations are extendRight and extendLeft, taking O(log |�|) time to binary search the neighbor lists. We therefore obtain the following result:

Theorem 8 The compacted arc-centric bidirected de Bruijn graph of order k of a set of input strings I from the alphabet can be constructed in time O(||I|| log |�|).
We note that the same operations can also be implemented on top of the bidirectional BWT index of Belazzougui and Cunial [20], using the data structures of Belazzougui et al. [21] for the enumeration operations. This gives an index that supports all the required subroutines in constant time. The drawback of the bidirectional BWT index is that only randomized construction algorithms are known, but the expected time is still linear in |T|. We leave as an open problem the construction of the compacted arc-centric bidirected de Bruijn graph in deterministic linear time independent of the alphabet size.

Pseudocode
The pseudocode for computing a compacted de Bruijn graph in linear time is given by Algorithm 4 which uses Algorithm 3 as a subroutine. The data structure D used by the algorithms is that described in Sect. "Linear-time construction of compacted bidirected de Bruijn graphs". Note that if we compute the arc labels as plain strings as in Algorithm 1, we need up to O(k log |�|) bits to store a single-k-mer arc. And since arcs are not substrings of input strings (but potentially combinations of input strings), we would need up to O(k||I|| log |�|) bits to store all arc labels without referring to the input strings. This contradicts the algorithm being linear in ||I|| log | | . However, we can store the labels as tuples (p, η, q, r) , where pηq is the label where p and q are explicit strings while η is a pointer to a k-mer in the input. If r is true, then the label must be reverse complemented to match that defined by Algorithm 1. With this fix, the size of a label that represents ℓ k-mers is O(ℓ log |�|) , and in total the de Bruijn graph represets O(||I||) k-mers. Schmidt The comparison on Line 16 of Algorithm 4 can be done in linear time in |η 1 | + |η 2 | by finding the suffix array intervals of η 1 ηη 2 and rc (η 1 ηη 2 ) with extendLeft and extendRight from η and rc (η) respectively, and comparing the starts of the intervals. This way, the total time taken by all those comparisons is proportional to the sum of |η 1 | + |η 2 | over all unitigs, which is linear in ||I|| because each character of η 1 and η 2 can be mapped to a distinct edge in the non-compacted de Bruijn graph of ||I||. Therefore, the algorithm can be implemented to run in O(||I|| log |�|) time.
Our pseudocode does not compute the first and last character of each arc-label, but this can be easily computed in constant time using w i , η 1 and η 2 in Algorithm 4.

Linear-time minimum SPSS without repetitions
Let I be a set of strings. To compute an SPSS without repetitions we first build a compacted de Bruijn graph G from I. By Theorem 6, finding an SPSS is equivalent to finding a walk cover in G. Further, with Lemma 4, we get that an SPSS without repetitions is equivalent to a walk cover that visits each arc exactly once (either once forwards, or once reverse, but not both forwards and reverse). We call such a walk cover a unique walk cover.
For minimality, observe that the cumulative length of an SPSS S relates to its equivalent set of walks W as follows: This is because in Algorithm 2, in Line 7, k − 1 characters are appended to the result, and then in the loop in Line 8, one additional character per arc in W is appended. We cannot alter the sum W ∈W |W | , since we need to cover all arcs in G. However we can alter the number of strings, and decreasing or increasing this number by one will decrease or increase the cumulative length of S by k − 1 . Therefore, finding a minimum SPSS of I without repetitions equals finding a unique walk cover of G that has a minimum number of walks.
Note that computing a minimum SPSS in a bigraph that is not connected is equivalent to separately computing an SPSS in each maximal connected subgraph. Therefore we restrict to connected bigraphs from here on.

A lower bound for an SPSS without repetitions
Using the imbalance of the nodes of a bigraph, we can derive a lower bound for the number of walks in a walk cover.

Lemma 9
Let v ∈ V be an unbalanced node in a bigraph G = (V , E, c) . Then in a unique walk cover W of G, either at least | imbalance (v)| walks start in v, or at least | imbalance (v)| walks end in v.

Proof
If v is self-complemental, then its imbalance is 1, so by definition v has an odd number of incident arcs. Each walk that does not start or end in v needs to enter and leave v via two distinct arcs whenever it visits v. But since the number of incident arcs is odd, there is at least one arc that cannot be covered this way, implying that a walk needs to start or end in this arc.
If v is not self-complemental and has a positive imbalance, then it has imbalance (v) more outgoing arcs then incoming arcs. Since walks need to leave v with the opposite sign than they entered v, at least imbalance (v) arcs cannot be covered by walks that do not start or end in v.
If v has negative imbalance, the situation is symmetric.

Definition 10
The imbalance imbalance (G) of a bigraph G = (V , E, c) is the sum of the absolute imbalance of all nodes v∈V | imbalance (v)|.

Theorem 11
Let G be a bigraph. A walk cover W of G has a minimum string count of imbalance (G)/2.

Proof
Let v ∈ V be an unbalanced node. Then, by Lemma 9 at least | imbalance (v)| walks start in v or at least | imbalance (v)| walks end in v. Since each walk has exactly one start node and one end node, W has a minimum string count of imbalance (G)/2 .

Eulerising a bigraph
A directed graph is called Eulerian, if all nodes have indegree equal to outdegree, i.e. are balanced [22]. If the graph is strongly connected, 2 then this is equivalent to the graph admitting a Eulerian cycle, i.e. a cycle that visits each arc exactly once. The same notion can be used with bidirected graphs, using our definition of imbalance.

Definition 12 A bigraph is Eulerian, if all nodes have imbalance zero.
A connected bigraph can be transformed into a Eulerian bigraph by adding arcs using Algorithm 5. See Fig. 1b for an example. The algorithm lists all nodes that are out of balance, and inserts arbitrary arcs to balance them.

Proof
Adding or removing an arc changes the imbalance of two nodes by 1, or of one node by two. In both cases, the imbalance of the graph can only change by −2 , 0, or 2. Since the imbalance of a graph without arcs is 0, this implies that there can be no graph with odd imbalance. G = (V , E, c) , Algorithm 5 outputs a Eulerian bigraph G ′ = (V , E ′ , c).

Proof
Algorithm 5 is well-defined, since by Lemma 13, it holds that L has even length in each iteration of the loop in Line 10, so the removal operation in Line 12 always has something to remove.
The output of Algorithm 5 is a valid bigraph, since for selfcomplemental nodes v ∈ V , only incidences v⊙ are added to G ′ , and for not self-complemental nodes v ∈ V , only incidences v⊕ and v⊖ are added to G ′ .
Further, the output is a Eulerian bigraph, because for all v ∈ V , it holds that imbalance (v) is 0, by the following argument: • If c(v) = 1 and v has imbalance zero in G, then its imbalance stays the same in G ′ . If it has imbalance 1, then one incident arc is inserted, making its degree even and its imbalance therefore zero. • If c(v) = 0 and v has positive imbalance i in G, then i incoming arcs are added to v (counting incoming selfloops twice), and no outgoing arcs are added. Therefore, it has imbalance zero in G ′ . By symmetry, if v has negative imbalance in G, it has imbalance zero in G ′ .  The loop in Line 3 runs |V| times and each iteration runs in O(| imbalance (v)|) for a node v, because a doubly linked list supports appending in constant time. The sum of absolute imbalances of all nodes cannot exceed 2|E|, because each arc adds at most 1 to the absolute imbalance of at most two nodes, or adds at most 2 to the absolute imbalance of at most one node. Therefore, the length

Computing a Eulerian cycle in a bigraph
of list L after completing the loop is at most 2|E|, and the loop runs in O(|V | + |E|) time.
The loop in Line 10 runs at most |L| ≤ 2|E| times and performs only constant-time operations, since L is a doubly linked list and we can insert arcs into an adjacency list in constant time. Therefore, this loop also runs in O(|V | + |E|) time.
With Lemma 14 and 15 we get the following.

Theorem 16 Algorithm 5 is correct and runs in
After Eulerising the bigraph, we can compute a Eulerian cycle using Algorithm 6. We do this similarly to Hierholzer's classic algorithm for Eulerian cycles [22]. First we find an arbitrary cycle. Then, as long as there are unused arcs left, we search along the current cycle for unused arcs, and find additional cycles through such unused arcs. We integrate each of those additional cycles into the main cycle. See Fig. 1c for an example of a Eulerian cycle.

Lemma 17 Given a connected Eulerian bigraph
to be a Eulerian cycle, it must be a cycle that contains each arc exactly once.
The sequence W ′ constructed by the loop in Line 10 is a walk by construction, and since G is Eulerian it is a cycle after the loop terminates. After finding the initial cycle in the first iteration of the outer loop, each additional cycle is started from a node on the initial cycle, and is a cycle again. Therefore it can be inserted into the original cycle without breaking its cycle property.
Since each arc is deleted when being added to W ′ , there is no duplicate arc in W. And if the algorithm terminates, then |E| = 0 (Line 1), so W contains all arcs.
For termination, consider that if W is not complete after the first iteration of the outer loop, then the loop in Line 7 searches for an unused arc using the first_unfinished pointer. Since the prefix of W up to including first_unfinished is never modified (Line 19), and first_unfinished is only advanced when its  Table 1 Experiments on references and read sets of single genomes with k = 51 and a min abundance of 10 for human and 1 for the others The CL and SC ratios are compared to the CL-optimal Eulertigs. For time and memory, we report the total time and maximum memory required to compute the tigs from the respective data set. BCALM2 directly computes unitigs, while UST-and Eulertigs require a run of BCALM2 first before they can be computed themselves. Prophasm can only be run for k ≤ 32 , which does not make sense for large genomes. The number in parentheses behind time and memory indicates the slowdown/ increase over computing just unitigs with BCALM2. BCALM2 was run with 28 threads, while all other tools support only one thread. The lengths of the genomes are 100Mbp for C. elegans, 482Mbp for B. mori and 3.21Gbp for H. sapiens and the read data sets have a coverage of 64x for C. elegans, 58x for B. mori and 300x for H. sapiens pointee cannot reach any arc anymore, it holds that no arc in W can reach an arc in E when first_unfinished gets advanced over the end of W. Since G was initially Eulerian and only Eulerian cycles have been removed from G, this implies that all nodes visited by W are still balanced and therefore have no incident arcs anymore. And since G was originally connected, W has visited all nodes, i.e. |E| = 0 . Therefore, first_unfinished cannot be advanced over the end of W, because the outer loop terminates before that.
To complete the proof of termination, consider that in each iteration of the outer loop, at least one arc gets removed from E. In the first iteration, this happens at least in Line 3, and in all following iterations, this happens in Line 11.

Proof
We use a doubly linked list for W and W ′ , and an adjacency list for G. Then all lines can be executed in constant time.
The loop in Line 10 removes one arc from E each iteration, so it runs at most |E| times in total (over all iterations of the outer loop). The loop in Line 7 advances first_unfinished each iteration. Since the algorithm is correct by Lemma 17, |W | ≤ |E| and first_unfinished never runs over the end of first_unfinished , so the loop runs at most |E| times in total (over all iterations of the outer loop).
The condition for the loop in Line 10 is true at least once in each iteration of the outer loop, since the preceding branch sets up (vd, v ′ d ′ , η) such that it has a successor (in the first iteration because of Eulerianess). So in each iteration of the outer loop, at least one arc gets removed, so the outer loop runs at most |E| times in total.     As a result, all loops individually run at most |E| times, therefore Algorithm 6 terminates after O(|E|) steps. Because G is connected, this is equivalent to O(|V | + |E|) steps.
With Lemma 17 and 18 we get the following.
Theorem 19 Algorithm 6 is correct and runs in O(|V | + |E|) time.

Computing a minimum SPSS without repetitions
We convert the Eulerian cycle into a walk cover of the original bigraph by breaking it at all arcs inserted by Algorithm 5, and removing those arcs (see Fig. 1d for an example). This results in a walk cover with either one walk, if Algorithm 5 inserted zero or one arcs, or imbalance (G)/2 arcs, if Algorithm 5 inserted more arcs. By Theorem 11, this is a minimum number of walks, and therefore the SPSS spelled by these walks is minimum as well. Constructing the de Bruijn graph takes O(||I|| log |�|) time, and it has O(||I||) k-mers, so it holds that |V | ∈ O(||I||) and |E| ∈ O(||I||) . Further, spelling the walk cover takes time linear to the cumulative length of the spelled strings. Since we compute a minimum representation, it holds that the output is not larger than the total length of the input strings. Therefore we get: Theorem 1 Let k be a positive integer and let I be a set of strings of length at least k over some alphabet . Then we can compute a set of strings I ′ of length at least k with minimum cumulative length and CS k (I) = CS k (I ′ ) in O(||I|| log |�|) time.

Previous heuristics were not optimal
The heuristics implemented by UST [3] and Prophasm [2] are not optimal, as shown experimentally below. Here, we also give a simple counter-example to argue that the previous heuristics were not optimal. Even though the previous algorithms were described in node-centric de Bruijn graphs, we describe them here in the arc-centric variants to stick with the terminology of this paper.
UST works by starting from an arbitrary arc and extending forwards to unused arcs as long as possible. If there is no unused arc, but the last chosen arc has a successor that is the start of another walk, then the walks are joined. On the other hand, ProphAsm works by choosing an arbitrary arc and extending both forwards and backwards to unused arcs as long as possible. Both algorithms may fail to produce an optimal solution in the example given in Fig. 2. They may both first choose AGGTG and then continue to GTG GGA T, producing a string AGG TGG GAT. When they then process GTG CCG TG, they cannot join it with the previous string, hence they produce two strings of a cumulative length of 17. The optimal solution in Fig. 2 has one string with a cumulative length of 14.

Experiments
We ran our experiments on a server running Linux with two 64-core AMD EPYC 7H12 processors with 2 logical cores per physical core, 1.96TiB RAM and an SSD. Our data sets are the same as in [1], and we also adapted their metrics cumulative length (CL), which is the total count of characters in all strings, and string count (SC), which is the number of strings. Our implementation does not use the formalisation of bidirected graphs introduced in this work, but instead uses the formalisation from [1]. For constructing de Bruijn graphs, we do not implement our purely theoretical linear time algorithm, since practical de Bruijn graph construction is a well-researched field [4,6,11,[23][24][25], and we want to focus more on computing the compressed representation from unitigs. UST only supports unitigs constructed by BCALM2 [11], since it needs certain additional data. BCALM2 is not a linear time algorithm, but is efficient in practice. Therefore, we use BCALM2 to construct a node-centric de Bruijn graph, and then convert it to an arc-centric variant using a union-find data structure. For the human pangenome, which hits some built-in limit of BCALM2, we use Cuttlefish 2 [6] instead. This prevents us from running UST, but instead we run ProphAsm on the unitigs computed by Cuttlefish 2.
Our experimental pipeline is constructed with [26] and using the bioconda software repository [27]. We ran all multithreaded tools with up to 28 threads and never used more than 128 cores of our machine at once to prevent hyperthreading from affecting our timing. The code to reproduce our experiments is licensed under the Creative Commons Attribution 4.0 International license and available on zenodo [28]. We additionally provide our implementation of the Eulertigs algorithm on zenodo [29] as well as github [30], conda [31] and crates. io [32].
The performance figures in Tables 1 and 2 are all very similar, with two exceptions. Prophasm does not support parallel computation at the moment, therefore its runtime is much higher. Compared to that, all other algorithms use parallel computation to compute unitigs, but computing the final tigs from unitigs seems to be negligible compared to computing the de Bruijn topology. Moreover, running UST or Eulertigs on read data sets of larger genomes consumes significantly more memory than computing just unitigs. This is likely because BCALM2 uses external memory to compute unitigs, while the other tools simply load the whole set of unitigs into memory.In terms of CL, we