 Research
 Open access
 Published:
Eulertigs: minimum plain text representation of kmer sets without repetitions in linear time
Algorithms for Molecular Biology volume 18, Article number: 5 (2023)
Abstract
A fundamental operation in computational genomics is to reduce the input sequences to their constituent kmers. For maximum performance of downstream applications it is important to store the kmers in small space, while keeping the representation easy and efficient to use (i.e. without kmer repetitions and in plain text). Recently, heuristics were presented to compute a nearminimum such representation. We present an algorithm to compute a minimum representation in optimal (linear) time and use it to evaluate the existing heuristics. Our algorithm first constructs the de Bruijn graph in linear time and then uses a Euleriancyclebased algorithm to compute the minimum representation, in time linear in the size of the output.
Introduction
Motivation A kmer is a DNA string of length k that is considered equal to itself and its reverse complement. A common pattern in bioinformatics is to reduce a set of input strings to their constituent kmers. Such representations are at the core of many bioinformatics pipelines—see e.g. Schmidt et al. [1] or Brinda et al. [2] for an overview of applications. The widespread use of kmer sets has prompted the question of what is the smallest plain text representation for a set of kmers. Here, a plain text representation means a set of strings that have the same set of kmers as the input strings, i.e. the spectrum is preserved. Such representations are also called spectrum preserving string sets (SPSS) [3], or simplitigs [2]. This has the following advantages over encoded representations:

When storing kmer sets to disk, plain text may remove the need of decompression before usage, as some tools that usually take unitigs as input can take any other plain text representation without modification (e.g. Bifrost [4]).

Within an application, an encoded representation would require decoding whenever a kmer is accessed, which may slow down the application a lot compared to when each kmer is in RAM in plain text.
Further, in applications, it might be useful if the representation contains each kmer exactly once. This is because some applications, like e.g. SSHash [5], are able to take any set of kmers as input, but cannot easily deal with duplicate kmers in the input.
Related work There are two heuristic approaches to the construction of a small SPSS without repetitions, namely ProphAsm [2] and UST [3]. While neither of these computes a minimum representation, Rahman et al. [3] also present a lower bound to the minimum size of any representation without repetition, and they show that they are within 3% of this lower bound in practice. They also present a counterexample showing that their lower bound is not tight. Small SPSSs without repetitions are used e.g. in SSHash [5] and are also computed by stateoftheart de Bruijn graph compactors like Cuttlefish 2 [6]. Additionally, the stateoftheart de Bruijn graph compressor GGCAT [7] was extended to compute Eulertigs, in addition to other variants of SPSSs.
When kmer repetitions are allowed in an SPSS, there is a known polynomially computable minimum representation, namely matchtigs [1]. The matchtig algorithm joins unitigs by first iterating all possible joins repeating up to \(k  1\) kmers, and then using minimum perfect matching to find a set of joins that minimises the size of the representation. This is similar to the algorithm presented in this paper, which leaves out the matching step and only joins unitigs that are adjacent. While matchtigs are expensive to compute, the authors also present a more efficient greedy heuristic that is able to compute a nearminimum representation on a modern server with no significant penalty in runtime (when compared to computing just unitigs), but a significant increase in RAM usage.
In [1, 2] the authors also showed that decreasing the size of an SPSS results in significantly better performance in downstream applications, i.e. when further compressing the representation with general purpose compressors, or when performing kmerbased queries.
The authors of both [2] and [3] consider whether computing a minimum representation without repetitions may be NPhard, as it is equivalent to computing a minimum path cover in a de Bruijn graph, which is NPhard in general graphs by reduction from Hamiltonian cycle. However, computing a Hamiltonian cycle in a de Bruijn graph is actually polynomial [8]. The authors of [8] argue that de Bruijn graphs are a subclass of adjoint graphs, in which solving the Hamiltonian cycle problem is equivalent to solving the Eulerian cycle problem in the original of the adjoint graph, which can be computed in linear time.^{Footnote 1} However, the argument is only made for normal directed (and not bidirected) graphs, and thus is not applicable to our setup, where a kmer is also considered equal to its reverse complement.
Our contributions Our first technical contribution is to carefully define the notion of a bidirected de Bruijn graph such that the spectrum of the input is accurately modelled in the allowed walks of the graph. While defining a bidirected de Bruijn graph is not new [10], we take specific care around kmers that are their own reverse complement. This technicality is often neglected in the literature, and sidestepped by requiring that the value of k is odd, in which case this special case does not occur. To make sure that our definition is correct for any k, we show that our de Bruijn graph admits exactly the strings that can be spelled from the kmers that it was constructed from. We give a suffixtreebased deterministic lineartime algorithm to construct such a graph, filling a theory gap in the literature, as existing approaches [4, 6, 11, 12] depend on the value of k and/or are probabilistic due to the of use hashing, minimizers or Bloom filters, or do not use the reversecomplementaware definition of kmers [13].
Given the bidirected de Bruijn graph, we present an algorithm that computes a minimum plain text representation of kmer sets without repetitions, which runs in output sensitive linear time. Steps 1 to 3 run in linear time in the number of nodes and arcs in the graph. In short, it works as follows:

1
Add breaking arcs into this graph to make it Eulerian.

2
Compute a Eulerian cycle in the resulting graph.

3
Break that cycle at the breaking arcs.

4
Output the strings spelled by the resulting walks.
The algorithm is essentially an adaption of the matchtigs algorithm [1], removing the possibility of joining walks by repeating kmers. We give detailed descriptions for all these steps and prove their correctness in our bidirected de Bruijn graph model. Together with our lineartime de Bruijn graph construction algorithm, we obtain the main result of our paper:
Theorem 1
Let k be a positive integer and let I be a set of strings of length at least k over some alphabet \(\Sigma\). Then we can compute a set of strings \(I'\) of length at least k with minimum cumulative length and \({{\,\textrm{CS}\,}}_k(I) = {{\,\textrm{CS}\,}}_k(I')\) in \(O(I\log \Sigma )\) time.
where \({{\,\textrm{CS}\,}}_k(I) = {{\,\textrm{CS}\,}}_k(I')\) means that \(I'\) is an SPSS of I, and I is the cumulative length of I (see Sect. "Preliminaries" for accurate definitions). This gives a positive answer to the open question if a minimum SPSS without repetitions can be computed in polynomial time. Additionally, this gives an easily computable tight lower bound on the size of a minimum SPSS without repetitions. We also give a counter example where previous heuristics are not necessarily optimal.
For our experiments, we have implemented steps 1 to 4 in Rust, taking the de Bruijn graph as given. The implementation is available on github: https://github.com/algbio/matchtigs. Our experimental evaluation shows that our algorithm does not result in significant practical improvements, but for the first time allows to benchmark the quality the heuristics ProphAsm and UST against an optimal solution. It turns out that both produce closetooptimal results, but with a different distribution of computational resources.
Our work also shows that using arccentric de Bruijn graphs can aid the intuition for certain problems, as in this case, the nodecentric variant hides the relationship between Eulerian cycles and minimum SPSS without repetition.
Organisation of the paper In Sect. "Preliminaries" we give preliminary definitions of wellknown concepts. In Sect. "De Bruijn graphs" we define de Bruijn graphs and prove the soundness of the definitions. In Sect. "Lineartime construction of compacted bidirected de Bruijn graphs" we show how to construct de Bruijn graphs by our definitions in linear time. In Sect. "Lineartime minimum SPSS without repetitions" we show how to construct a minimum SPSS without repetitions in linear time if the de Bruijn graph is given. Additionally, we give an example where previous heuristics were not optimal. In Sect. "Experiments" we compare our algorithm and Eulertigs against strings computed with ProphAsm and UST on practical data sets.
Preliminaries
In this section we give the prerequisite knowledge required for this paper.
Bidirected graphs
In this section we define our notion of the bidirected graphs and the incidence model.
A multiset is defined as a set M, and an implicit function \(\#_M: M \rightarrow {\mathbb {Z}}^+\) mapping elements to their multiplicities. The cardinality is defined as \(M:= \sum _{s \in M} \#_M(s)\).
An alphabet \(\Sigma\) is an ordered set, and an \(\Sigma\)word is a string of characters of that set. String concatenation is written as ab for two strings a and b. The set \(\Sigma ^k\) is the set of all \(\Sigma\)words of length k and the set \(\Sigma ^*\) is the set of all \(\Sigma\)words, including the empty word \(\epsilon\). Given a positive integer k, the ksuffix \({{\,\textrm{suf}\,}}_k(w)\) (kprefix \({{\,\textrm{pre}\,}}_k(w)\)) of a word w is the substring of its last (first) k characters. A kmer is a word of length k. A complement function over \(\Sigma\) is a function \({{\,\textrm{comp}\,}}: \Sigma \rightarrow \Sigma\) mapping characters to characters that is selfinverse (i.e. \({{\,\textrm{comp}\,}}({{\,\textrm{comp}\,}}(x)) = x\), also called an involution). A reverse complement function for alphabet \(\Sigma\) is a function \({{\,\textrm{rc}\,}}: \Sigma ^* \rightarrow \Sigma ^*\) defined as \({{\,\textrm{rc}\,}}((w_1, \dots , w_\ell )):= ({{\,\textrm{comp}\,}}(w_\ell ), \dots , {{\,\textrm{comp}\,}}(w_1))\), for some arbitrary complement function \({{\,\textrm{comp}\,}}\). On sets, \({{\,\textrm{rc}\,}}\) is defined to compute the reverse complement of each element in the set. Note that \({{\,\textrm{rc}\,}}\) is selfinverse. A canonical kmer is a kmer that is lexicographically smaller than or equal to its reverse complement.
Given an integer k and an alphabet \(\Sigma\), the kspectrum of a set of strings \(I \subseteq \bigcup _{k' \ge k} \Sigma ^{k'}\) is a set of strings \({{\,\textrm{S}\,}}_k(I):= \{ w \in \Sigma ^k \mid \exists i \in I: w \text { is substring of } i \text { or }{{\,\textrm{rc}\,}}(i) \}\). The canonical kspectrum of I is \({{\,\textrm{CS}\,}}_k(I):= \{w \in {{\,\textrm{S}\,}}_k(I) \mid w \text { is canonical} \}\). For simplicity, the spectrum and canonical spectrum are defined for a single string w as if it were a set \(\{w\}\). A spectrum preserving string set of a set of strings I is a set of strings \(I'\) such that \({{\,\textrm{CS}\,}}_k(I) = {{\,\textrm{CS}\,}}_k(I')\). The cumulative length of I is \(I:= \sum _{w \in I} w\).
Our definition of a bidirected graph is mostly standard like in e.g. [14], however we allow selfcomplemental nodes that occur in bidirected de Bruijn graphs. A bidirected graph is a tuple \(G = (V, E, c)\) with a set of normal and selfcomplemental nodes \(v \in V\), a set of arcs \(e \in E\), and a function \(c: V \rightarrow \{1, 0\}\) marking selfcomplemental nodes with 1, and normal nodes with 0. An incidence is a pair \(vd\), where \(d\in \{\oplus , \ominus , \odot \}\) is called its sign (e.g. \(v\oplus\)). The negation of a sign is defined as \(\lnot \oplus := \ominus\), \(\lnot \ominus := \oplus\) and \(\lnot \odot := \odot\). For selfcomplemental nodes \(v \in V\), only incidences \(v\odot\) are allowed, and for normal nodes only incidences \(v\oplus\) and \(v\ominus\) are allowed. An arc \((v_1d_1, v'_1d'_1, \eta ) \in E\) is a tuple of incidences and a unique identifier \(\eta\), where \(\eta\) can be of any type. The reversal of an arc is denoted by \((v_1d_1, v'_1d'_1, \eta )^{1}:= (v'_1d'_1, v_1d_1, \eta )\). If not required, we may drop the identifier (i.e. just write \((v_1\ominus , v'_1\odot ) \in E\)). We count the incidences present in an arc e using multiset notation like \(\#_e(vd)\), returning 0 if the arc does not contain the incidence \(vd\), returning 1 if it contains the incidence once and returning 2 if it is a selfloop with that incidence. If a node \(v \in V\) is present with a \(\oplus\) (\(\ominus\)) sign in an arc, then the arc is outgoing (incoming) from (to) v.
Note that, other than in standard directed graphs, in bidirected graphs arcs can be outgoing or incoming on both ends, and the order of the incidences in the arc does not affect if it is outgoing or incoming to a node. Further, our notation differs from that of standard bidirected graphs in that arcs have a direction. This is required because we will work with arccentric de Bruijn graphs (see Sect. "De Bruijn graphs"), which have labels on the arcs and not the nodes. Using the sign of the incidence pairs, it is possible to decide if a node is traversed forwards or backwards, but not if the arc is traversed forwards or backwards. But to decide which label (forwards or reverse complement) to use when computing the string spelled by an arc, the direction is relevant. See Fig. 1a for an example of a bigraph, which has labels that make it a de Bruijn graph as well.
A walk in a bigraph is a sequence of arcs \(W:= ((v_1d_1, v'_1d'_1, \eta _1),(v_2d_2, v'_2d'_2, \eta _2),\dots ,(v_\ell d_\ell , v'_\ell d'_\ell , \eta _\ell ))\) where for every i it holds that \((v_id_i, v'_id'_i, \eta _i) \in E\) or \((v'_id'_i, v_id_i, \eta _i) \in E\) (we can arbitrarily walk over arcs forwards and reverse), and for every \(i < \ell\) it holds that \(v'_{i} = v_{i+1}\) and \(d'_i = \lnot d_{i+1}\). The length of a walk is \(\ell = W\). If \(v_1 = v'_\ell\) and \(d_1 = \lnot d'_\ell\), then W is a cycle. A bigraph is connected, if for each pair of nodes \(v_1, v_2 \in V\) there is a walk from \(v_1\) to \(v_2\).
For a node \(v \in V\), the multiset of incidences is defined as \({{\,\textrm{I}\,}}(v):= \{vd\mid d\in \{\oplus , \ominus , \odot \}\}\), with multiplicities \(\#_{{{\,\textrm{I}\,}}(v)}(vd):= \sum _{e \in E} \#_e(vd)\). For a node \(v \in V\) that is not selfcomplemental, the outdegree is defined as \(\delta ^+(v):= \#_{{{\,\textrm{I}\,}}(v)}(v\oplus )\), and the indegree is defined as \(\delta ^(v):= \#_{{{\,\textrm{I}\,}}(v)}(v\ominus )\). For a selfcomplemental node \(v \in V\), the degree is defined as \(\delta (v):= \#_{{{\,\textrm{I}\,}}(v)}(v\odot )\).
We define the imbalance of a node \(v \in V\) that is not selfcomplemental as the difference of its outdegree and indegree \({{\,\textrm{imbalance}\,}}(v):= \delta ^+(v)  \delta ^(v)\). For a selfcomplemental node \(v \in V\) the imbalance is defined as \({{\,\textrm{imbalance}\,}}(v):= 1\) if \(\delta (v)\) is odd, and \({{\,\textrm{imbalance}\,}}(v):= 0\) otherwise. A node \(v \in V\) is called unbalanced, if \({{\,\textrm{imbalance}\,}}(v) \ne 0\), and balanced otherwise.
A labelled graph is a bidirected graph \(G = (V, E, c)\) where the identifiers of arcs are strings over some alphabet \(\Sigma\) (e.g. \((v_1\oplus , v_2\ominus , ACCTG) \in E\)).
Suffix arrays and suffix trees
Section "Lineartime construction of compacted bidirected de Bruijn graphs" requires knowledge of suffix arrays and suffix trees. We assume the reader is familiar with these data structures, and briefly give the relevant definitions and properties below. We point the reader to Gusfield [15] and Mäkinen [16] for an indepth treatment of the topics.
A suffix array \(SA_T\) for a string T is an array of length T such that \(SA_T[i]\) is the starting position of the lexicographically ith suffix of T. The suffix array interval of a string x is the maximal interval [i..j] such that all the suffixes pointed by \(SA_T[i], \ldots , SA_T[j]\) have x as a prefix, or the empty interval if x is not a substring of T.
A suffix tree of a string T is a compacted version of the trie of all suffixes of T, such that nonbranching paths are merged into single arcs, with arcs pointing away from the root. The compactification concatenates the labels of the arcs on the compacted path. The nodes that were compacted away and are now in the middle of an arc are called implicit nodes, and the rest of the nodes are explicit. A locus (plural loci) is a node that is either explicit or implicit. A locus v is represented by a pair (u, d), where u is the explicit suffix tree node at the end of the arc containing v (u is equal to v if v is explicit), and d is the depth of locus v in the trie of loci. The suffix array interval of a node is the interval of leaves in the subtree of the node. The suffix array interval of an implicit locus (u, d) is the same as the suffix array interval of u.
The suffix tree can be constructed in loglinear time in \(T \log \Sigma \) using e.g. Ukkonen’s algorithm [17] or in linear time in T using Farach’s algorithm [18]. The tree comes with a function child that takes an explicit node and a character, and returns the child at the end of the arc from that node whose label starts with the given character (if such node exists). This can be implemented in \(O(\log \Sigma )\) time by binary searching over child pointers sorted by labels. The child function can also be easily implemented for implicit loci. Ukkonen’s algorithm also produces suffix links for the explicit nodes, which map from the suffix tree node of a string cx to the suffix tree node of string x. It is possible to emulate suffix links on the implicit loci using constanttime weighted levelancestor queries [19] by mapping \((u,d) \mapsto (f_{d1}(SL(u)), d1)\), where SL(u) is the destination of a suffix link from u, and \(f_{d1}(SL(u))\) is the furthest suffix tree ancestor from SL(u) at depth at least \(d1\) in the trie of loci. The inverse pointers of suffix links are called Weiner links, and they can also be simulated on the implicit loci by mapping \((u, d) \mapsto (WL(u, c), d+1)\), where WL(u, c) is the destination of a Weiner link from u with character c.
De Bruijn graphs
The de Bruijn graph of order k of a set of input strings I is defined as a labelled graph constructed by Algorithm 1. See Fig. 1a for an example. The algorithm inserts an arc for each canonical kmer, and connects the arcs via nodes according to their \(k1\) overlaps. Depending on if these overlaps use the reverse complement or if the \(k1\)mer of a node is selfcomplemental, it adds directions to the incidences. A de Bruijn graph computed by this algorithm has the following property.
Lemma 2
Let k be a positive integer and let I be a set of strings of length at least k. Let \(G = (V, E, c)\) be the de Bruijn graph of order k constructed from I. For all pairs of arcs \(e_1:= (v_1d_1, v'_1d'_1, \eta _1), e_2:= (v_2d_2, v'_2d'_2, \eta _2) \in E\) it holds that:

(a)
(\(v'_1 = v_2\) and \(d'_1 = \lnot d_2\)) if and only if \({{\,\textrm{suf}\,}}_{k1}(\eta _1) = {{\,\textrm{pre}\,}}_{k1}(\eta _2)\),

(b)
(\(v'_1 = v'_2\) and \(d'_1 = \lnot d'_2\)) if and only if \({{\,\textrm{suf}\,}}_{k1}(\eta _1) = {{\,\textrm{pre}\,}}_{k1}({{\,\textrm{rc}\,}}(\eta _2))\),

(c)
(\(v_1 = v_2\) and \(d_1 = \lnot d_2\)) if and only if \({{\,\textrm{suf}\,}}_{k1}({{\,\textrm{rc}\,}}(\eta _1)) = {{\,\textrm{pre}\,}}_{k1}(\eta _2)\), and

(d)
(\(v_1 = v'_2\) and \(d_1 = \lnot d'_2\)) if and only if \({{\,\textrm{suf}\,}}_{k1}({{\,\textrm{rc}\,}}(\eta _1)) = {{\,\textrm{pre}\,}}_{k1}({{\,\textrm{rc}\,}}(\eta _2))\).
Proof
Observe that the values of w and \(w'\) computed in Lines 5 and 7 of Algorithm 1 are equal to \({{\,\textrm{pre}\,}}_{k1}(\eta _1)\) and \({{\,\textrm{suf}\,}}_{k1}(\eta _1)\) for \(e_1\) and equal to \({{\,\textrm{pre}\,}}_{k1}(\eta _2)\) and \({{\,\textrm{suf}\,}}_{k1}(\eta _2)\) for \(e_2\). Further, observe that the values of v and \(v'\) computed in Lines 6 to 8 are equal to \(v_1\) and \(v'_1\) for \(e_1\) and equal to \(v_2\) and \(v'_2\) for \(e_2\). This makes \(v_1\), \(v'_1\), \(v_2\) and \(v'_2\) the canonicals of \({{\,\textrm{pre}\,}}_{k1}(\eta _1)\), \({{\,\textrm{suf}\,}}_{k1}(\eta _1)\), \({{\,\textrm{pre}\,}}_{k1}(\eta _2)\) and \({{\,\textrm{suf}\,}}_{k1}(\eta _2)\). Finally, observe that the sign values \(d\) and \(d'\) computed in Lines 9 to 14 are equal to \(d_1\) and \(d'_1\) for \(e_1\) and equal to \(d_2\) and \(d'_2\) for \(e_2\).

(a)
If \(v'_1 = v_2\) and \(d'_1 = \lnot d_2\), then \(w'_1 = w_2\) for all possible values of \(d'_1\), and therefore \({{\,\textrm{suf}\,}}_{k1}(\eta _1) = {{\,\textrm{pre}\,}}_{k1}(\eta _2)\). If \({{\,\textrm{suf}\,}}_{k1}(\eta _1) = {{\,\textrm{pre}\,}}_{k1}(\eta _2)\), then \(w'_1 = w_2\), and therefore \(v'_1 = v_2\) because \(v'_1\) and \(v_2\) are the canonicals of \(w'_1\) and \(w_2\). Additionally, \(d'_1 = \lnot d_2\) for all possible values of \(d'_1\).

(b)
If \(v'_1 = v'_2\) and \(d'_1 = \lnot d'_2\), then \(w'_1 = {{\,\textrm{rc}\,}}(w'_2)\) for all possible values of \(d'_1\), and therefore \({{\,\textrm{suf}\,}}_{k1}(\eta _1) = {{\,\textrm{rc}\,}}({{\,\textrm{suf}\,}}_{k1}(\eta _2)) = {{\,\textrm{pre}\,}}_{k1}({{\,\textrm{rc}\,}}(\eta _2))\). If \({{\,\textrm{suf}\,}}_{k1}(\eta _1) = {{\,\textrm{pre}\,}}_{k1}({{\,\textrm{rc}\,}}(\eta _2))\), then \(w'_1 = {{\,\textrm{rc}\,}}(w'_2)\), and therefore \(v'_1 = v'_2\) because \(v'_1\) and \(v'_2\) are the canonicals of \(w'_1\) and \(w'_2\). Additionally, \(d'_1 = \lnot d'_2\) for all possible values of \(d'_1\).

(c)
If \(v_1 = v_2\) and \(d_1 = \lnot d_2\), then \({{\,\textrm{rc}\,}}(w_1) = w_2\) for all possible values of \(d_1\), and therefore \({{\,\textrm{suf}\,}}_{k1}({{\,\textrm{rc}\,}}(\eta _1)) = {{\,\textrm{rc}\,}}({{\,\textrm{pre}\,}}_{k1}(\eta _1)) = {{\,\textrm{pre}\,}}_{k1}(\eta _2)\). If \({{\,\textrm{suf}\,}}_{k1}({{\,\textrm{rc}\,}}(\eta _1)) = {{\,\textrm{pre}\,}}_{k1}(\eta _2)\), then \(w_1 = {{\,\textrm{rc}\,}}(w_2)\), and therefore \(v_1 = v_2\) because \(v_1\) and \(v_2\) are the canonicals of \(w_1\) and \(w_2\). Additionally, \(d_1 = \lnot d_2\) for all possible values of \(d_1\).

(d)
This case is equivalent to the first case when swapping \(e_1\) and \(e_2\), because \({{\,\textrm{suf}\,}}_{k1}(\eta _1) = {{\,\textrm{pre}\,}}_{k1}(\eta _2) \iff {{\,\textrm{suf}\,}}_{k1}({{\,\textrm{rc}\,}}(\eta _2)) = {{\,\textrm{pre}\,}}_{k1}({{\,\textrm{rc}\,}}(\eta _1))\).
\(\square\)
For a walk \(W:= (e_1 = (v_1d_1, v'_1d'_1, \eta _1), \dots , e_\ell = (v_\ell d_\ell , v'_\ell d'_\ell , \eta _\ell ))\) in a de Bruijn graph, its sequence of kmers is \(K:= (\kappa _1, \dots , \kappa _\ell )\), where for each i we define \(\kappa _i\) as \(\eta _i\) if \(e_i \in E\), and as \({{\,\textrm{rc}\,}}(\eta _i)\) if \(e_i^{1} \in E\). The string \({{\,\textrm{spell}\,}}(W)\) is the string spelled by W, which is defined as its collapsed sequence of kmers, i.e. its sequence of kmers gets concatenated while overlapping consecutive kmers by \(k1\). This is computed by Algorithm 2. It spells out the first kmer (or its reverse complement) completely, and then adds the last characters of all subsequent kmers (or their reverse complements). We prove the following lemmas to show that our definition of a de Bruijn graph together with the \({{\,\textrm{spell}\,}}(\cdot )\) function is sound for our purposes, i.e. walks in the de Bruijn graph can spell exactly the strings containing subsets of the kmers used to create the de Bruijn graph. Due to this property, we can use our de Bruijn graph and \({{\,\textrm{spell}\,}}\) to in the Eulertig algorithm, such that finding a minimum SPSS is equivalent to finding a minimum walk cover of the de Bruijn graph.
Lemma 3
Let k be a positive integer and let I be a set of strings of length at least k. Let \(G = (V, E, c)\) be the de Bruijn graph of order k constructed from I. Let \(W:= (e_1 = (v_1d_1, v'_1d'_1, \eta _1), \dots , e_\ell = (v_\ell d_\ell , v'_\ell d'_\ell , \eta _\ell ))\) be a walk in G, and \(K:= (\kappa _1, \dots , \kappa _\ell )\) its sequence of kmers. Then for each consecutive pair of kmers \(\kappa _i, \kappa _{i+1}\) it holds that \({{\,\textrm{suf}\,}}_{k1}(\kappa _i) = {{\,\textrm{pre}\,}}_{k1}(\kappa _{i+1})\).
Proof
Let \(i \in \{1, \dots , \ell  1\}\). By the definition of walk it holds that \(v'_i = v_{i+1}\) and \(d'_i = \lnot d_{i+1}\). We can apply Lemma 2 case by case.

(a)
If \(e_i, e_{i+1} \in E\), then by Lemma 2 (a), it holds that \({{\,\textrm{suf}\,}}_{k1}(\eta _i)\) equals \({{\,\textrm{pre}\,}}_{k1}(\eta _{i+1})\). By definition, \(\kappa _i = \eta _i\) and \(\kappa _{i+1} = \eta _{i+1}\), so \({{\,\textrm{suf}\,}}_{k1}(\kappa _i) = {{\,\textrm{pre}\,}}_{k1}(\kappa _{i+1})\).

(b)
If \(e_i, e^{1}_{i+1} \in E\), then by Lemma 2 (b) applied to \(e_i, e^{1}_{i+1}\), it holds that \({{\,\textrm{suf}\,}}_{k1}(\eta _i)\) equals \({{\,\textrm{pre}\,}}_{k1}({{\,\textrm{rc}\,}}(\eta _{i+1}))\). By definition, \(\kappa _i = \eta _i\) and \(\kappa _{i+1} = {{\,\textrm{rc}\,}}(\eta _{i+1})\), so \({{\,\textrm{suf}\,}}_{k1}(\kappa _i) = {{\,\textrm{pre}\,}}_{k1}(\kappa _{i+1})\)

(c)
If \(e^{1}_i, e_{i+1} \in E\), then by Lemma 2 (c) applied to \(e^{1}_i, e_{i+1}\), it holds that \({{\,\textrm{suf}\,}}_{k1}({{\,\textrm{rc}\,}}(\eta _i))\) equals \({{\,\textrm{pre}\,}}_{k1}(\eta _{i+1})\). By definition, \(\kappa _i = {{\,\textrm{rc}\,}}(\eta _i)\) and \(\kappa _{i+1} = \eta _{i+1}\), so \({{\,\textrm{suf}\,}}_{k1}(\kappa _i) = {{\,\textrm{pre}\,}}_{k1}(\kappa _{i+1})\).

(d)
If \(e^{1}_i, e^{1}_{i+1} \in E\), then by Lemma 2 (d) applied to \(e^{1}_i, e^{1}_{i+1}\), it holds that \({{\,\textrm{suf}\,}}_{k1}({{\,\textrm{rc}\,}}(\eta _i))\) equals \({{\,\textrm{pre}\,}}_{k1}({{\,\textrm{rc}\,}}(\eta _{i+1}))\). By definition, \(\kappa _i = {{\,\textrm{rc}\,}}(\eta _i)\) and \(\kappa _{i+1} = {{\,\textrm{rc}\,}}(\eta _{i+1})\), so \({{\,\textrm{suf}\,}}_{k1}(\kappa _i) = {{\,\textrm{pre}\,}}_{k1}(\kappa _{i+1})\).
\(\square\)
We define the sequence of kmers \(K = (\kappa _1, \dots , \kappa _\ell )\) of a string \(w = (a_1, \dots , a_{\ell +k1})\) by \(\kappa _i:= (a_i, \dots , a_{i+k1})\) for each i.
Lemma 4
Let k be a positive integer and let I be a set of strings of length at least k. Let \(G = (V, E, c)\) be the de Bruijn graph of order k constructed from I. Let W be a walk in G, \(K_W\) its sequence of kmers and \(K'_W\) the sequence of kmers of \({{\,\textrm{spell}\,}}(W)\). Then \(K_W = K'_W\).
Proof
Let \((\kappa _1, \dots , \kappa _\ell ):= K_W\). We use induction over the length of W. For an empty W, K is empty, \({{\,\textrm{spell}\,}}(W)\) is empty, and therefore \(K'\) is empty as well. For \(W = 1\), Algorithm 2 outputs \({{\,\textrm{spell}\,}}(W) = \kappa _1\) and it holds that \(K'_W = (\kappa _1) = K_W\).
For \(W \ge 2\) we consider that \(K_X = K'_X\) holds for a prefix X of W with \(X = W  1\). When \(i = W\) at the beginning of the loop in Line 8, then \(s = {{\,\textrm{spell}\,}}(X)\). By Lemma 3 it holds that the last \(k  1\) characters of s are equal to the first \(k  1\) characters of \(\kappa _\ell\). Therefore, by appending the last character from \(\kappa _\ell\) to s, \(\kappa _\ell\) is appended to \(K'_X\) forming \(K'_W\). Therefore, last kmer of \(K'_W\) equals the last kmer of \(K_W\), and the first \(\ell  1\) kmers of \(K'_W\) equal those of \(K_W\) by induction. \(\square\)
Lemma 5
Let k be a positive integer and let I be a set of strings of length at least k. Let \(G = (V, E, c)\) be the de Bruijn graph of order k constructed from I. Let w be a string with \({{\,\textrm{CS}\,}}_k(w) \subseteq {{\,\textrm{CS}\,}}_k(I)\). Then there exists a walk W in G with \({{\,\textrm{spell}\,}}(W) = w\).
Proof
Let \(K_w = (\kappa _1, \dots , \kappa _\ell )\) be the sequence of kmers of w. We construct \(W = (e_1 = (v_1d_1, v'_1d'_1, \eta _1), \dots , e_\ell = (v_\ell d_\ell , v'_\ell d'_\ell , \eta _\ell ))\) as follows: for each i, let \(\eta _i\) be the canonical of \(\kappa _i\) and \(f_i \in E\) be the arc whose identifier is \(\eta _i\). We set \(e_i = f_i\) if \(\kappa _i\) is canonical, and \(e_i = f^{1}_i\) otherwise.
For W to fulfil the definition of a walk we need that \(v'_i = v_{i+1}\) and \(d'_i = \lnot d'_{i+1}\) for all i. Using Lemma 2, we get:

If \(e_i, e_{i+1} \in E\), then \({{\,\textrm{suf}\,}}_{k1}(\eta _i) = {{\,\textrm{suf}\,}}_{k1}(\kappa _i) = {{\,\textrm{pre}\,}}_{k1}(\kappa _{i+1}) = {{\,\textrm{pre}\,}}_{k1}(\eta _{i+1})\). Therefore, by Lemma 2 (a), it holds that \(v'_i = v_{i+1}\) and \(d'_i = \lnot d'_{i+1}\).

If \(e_i, e^{1}_{i+1} \in E\), then \({{\,\textrm{suf}\,}}_{k1}(\eta _i) = {{\,\textrm{suf}\,}}_{k1}(\kappa _i) = {{\,\textrm{pre}\,}}_{k1}(\kappa _{i+1}) = {{\,\textrm{pre}\,}}_{k1}({{\,\textrm{rc}\,}}(\eta _{i+1}))\). Therefore, by Lemma 2 (b), it holds that \(v'_i = v_{i+1}\) and \(d'_i = \lnot d'_{i+1}\).

If \(e^{1}_i, e_{i+1} \in E\), then \({{\,\textrm{suf}\,}}_{k1}({{\,\textrm{rc}\,}}(\eta _i)) = {{\,\textrm{suf}\,}}_{k1}(\kappa _i) = {{\,\textrm{pre}\,}}_{k1}(\kappa _{i+1}) = {{\,\textrm{pre}\,}}_{k1}(\eta _{i+1})\). Therefore, by Lemma 2 (c), it holds that \(v'_i = v_{i+1}\) and \(d'_i = \lnot d'_{i+1}\).

If \(e^{1}_i, e^{1}_{i+1} \in E\), then \({{\,\textrm{suf}\,}}_{k1}({{\,\textrm{rc}\,}}(\eta _i)) = {{\,\textrm{suf}\,}}_{k1}(\kappa _i) = {{\,\textrm{pre}\,}}_{k1}(\kappa _{i+1}) = {{\,\textrm{pre}\,}}_{k1}({{\,\textrm{rc}\,}}(\eta _{i+1}))\). Therefore, by Lemma 2 (d), it holds that \(v'_i = v_{i+1}\) and \(d'_i = \lnot d'_{i+1}\).
To complete the proof we need to show that \({{\,\textrm{spell}\,}}(W) = w\). By definition, the sequence of kmers \(K_W\) of W is equivalent to \(K_w\). And since W is a walk, by Lemma 4 we get that the sequence of kmers of \({{\,\textrm{spell}\,}}(W)\) is equivalent to \(K_W\), and therefore \({{\,\textrm{spell}\,}}(W) = w\). \(\square\)
A walk cover \({\mathcal {W}}\) of a bigraph G is a set of walks such that for each arc \(e \in E\) it holds that e is part of some walk \(W \in {\mathcal {W}}\), or \(e^{1}\) is part of some walk \(W \in {\mathcal {W}}\).
Theorem 6
Let k be a positive integer and let I and \(I'\) be sets of strings of length at least k. Let \(G = (V, E, c)\) be the de Bruijn graph of order k constructed from I. Then it holds that \({{\,\textrm{CS}\,}}_k(I) = {{\,\textrm{CS}\,}}_k(I')\), if and only if there is a walk cover \({\mathcal {W}}\) in G that spells the strings in \(I'\).
Proof
If \({{\,\textrm{CS}\,}}_k(I') \subseteq {{\,\textrm{CS}\,}}_k(I)\), then for each string \(w' \in I'\) it holds that \({{\,\textrm{CS}\,}}_k(w') \subseteq {{\,\textrm{CS}\,}}_k(I)\). Therefore, by Lemma 5, there exists a walk w in G with \({{\,\textrm{spell}\,}}(w) = w'\). Then, the set of all such walks \({\mathcal {W}}\) spells \(I'\). Further, because \({{\,\textrm{CS}\,}}_k(I) \subseteq {{\,\textrm{CS}\,}}_k(I')\), the identifier \(\eta\) of each arc \(e \in E\) is in \({{\,\textrm{CS}\,}}_k(I')\), and therefore in the sequence of kmers \(K_{w'}\) of some string \(w' \in I'\) (possibly as a reverse complement). By Lemma 4 it holds that \(K_{w'} = K_w\), where \(K_w\) is the sequence of kmers of walk w. By the definition of the sequence of kmers of a walk, this implies that w visits e (possible in reverse direction). Since this holds for each \(e \in E\), it holds that \({\mathcal {W}}\) is a walk cover of G.
Assume that there is a walk cover \({\mathcal {W}}\) in G that spells the strings in \(I'\), and let \(w \in {\mathcal {W}}\) be a walk, \(K_w\) its sequence of kmers, \(w':= {{\,\textrm{spell}\,}}(w)\) and \(K_{w'}\) the sequence of kmers of \(w'\). Then, by Lemma 4, \(K_w = K_{w'}\), which, by the definition of the sequence of kmers of a walk implies that \({{\,\textrm{CS}\,}}_k(I) \subseteq {{\,\textrm{CS}\,}}_k(I')\). And since \({\mathcal {W}}\) is a walk cover of G, we get \({{\,\textrm{CS}\,}}_k(I) = {{\,\textrm{CS}\,}}_k(I')\). \(\square\)
Corollary 7
By setting \(I = I'\) in Theorem 6 we see that our de Bruijn graph contains the strings it was constructed from. Further, by Theorem 6 it holds that walks in the de Bruijn graph spell exactly the strings that can be formed from the kmers that were used to create the graph. Hence, our definition of a de Bruijn graph is sound for all k.
A compacted de Bruijn graph is constructed from a de Bruijn graph by contracting all nodes \(v \in V\) that are either selfcomplemental and have exactly two arcs that have exactly one incidence to v each, or that are not selfcomplemental and have exactly one incoming and one outgoing arc. For simplicity, we use uncompacted de Bruijn graphs in our theoretical sections, however all results equally apply to compacted de Bruijn graphs.
Lineartime construction of compacted bidirected de Bruijn graphs
In this section, we fill a gap in the literature by describing on a high level an algorithm to construct the bidirectional de Bruijn graph of a set of input strings in time linear in the total length of the input strings, independent of the value of k.
Algorithm
Let \(I = \{w_1, \ldots w_m\}\) be the set of input strings. Consider the following concatenation:
where \(\$\) is a special character outside of the alphabet \(\Sigma\) of the input strings. We require an index on T that can answer the following queries: extendRight, extendLeft, contractRight and contractLeft in constant time. The extension operations take as input a character \(c \in \Sigma\) and the interval of a string x in the suffix array of T, and return the suffix array intervals of xc in the case of extendRight and cx in the case of extendLeft. The contraction operations are the inverse operations of these, mapping the suffix array intervals of xc to x in the case of contractRight and cx to x in the case of contractLeft. For efficiency, we also require operations enumerateRight and enumerateLeft, which take a string x and give all characters such that extendRight and extendLeft respectively return a nonempty interval, in time that is linear in the number of such characters. Implementations for all the six subroutines are given in Sect. "Implementation of the subroutines".
Using these operations, we can simulate the regular nonbidirected de Bruijn graph of T. Each kmer of the input strings for a fixed k corresponds to a disjoint interval in the suffix array of T. The nodes are represented by their suffix array intervals. The outgoing arcs from a \((k1)\)mer x are those characters c where extendRight(x, c) returns a nonempty interval. We can enumerate all the characters c with this property in constant time using enumerateRight(x). The incoming arcs can be enumerated symmetrically with the enumerateLeft(x). Finally, we can find the destination or origin of an arc labelled with x by running a contractLeft or contractRight operation respectively on x.
To construct the bidirected de Bruijn graph, we merge together nodes that are the reverse complement of each other. To find which nodes are complemental, we scan the input strings I while maintaining the suffix array interval of the current kmer using extendRight and contractLeft operations, while at the same time maintaining the suffix array interval of the reverse complement using extendLeft and contractRight operations. Whenever we merge two nodes, we combine the incoming and outgoing arcs, assigning the incidences of the arcs according to the incidence rules in our definition. We are able to tell in constant time which kmer of a pair of complemental kmers is canonical by comparing the suffix array intervals of the kmers: the kmer whose suffix array interval has a smaller starting point is the canonical kmer. If the starting points are the same, the kmer is selfcomplemental.
Using the enumerateRight and enumerateLeft functions, we can check if a node would be contracted in a compacted de Bruijn graph. By extending kmers over such nodes, we can in linear time also output only the arcs and nodes of a compacted de Bruijn graph. For storing the labels, we use one pointer into the input strings to store a single kmer, as well as a flag that is set whenever the label is not canonical. If a label has multiple kmers, then we store the remaining kmers as explicit strings, however without their overlap with the “pointerkmer”. This way, we can store each label in \(O(\ell )\) space, where \(\ell\) is the number of kmers in the label. We additionally store the first and last character of each label, as an easy way to make the spell function run in output sensitive linear time.
Implementation of the subroutines
All required the subroutines extendRight, extendLeft, contractRight, contractLeft, enumerateRight and enumerateLeft can be implemented with the suffix tree of T by simulating the trie of the suffix tree loci as described in Sect. "Suffix arrays and suffix trees". The suffix array intervals of explicit nodes can be stored with the nodes, so that we can operate on loci (u, d) and retrieve the suffix array intervals on demand. The operation extendRight follows an arc from a locus to a child, and the operation contractRight is implemented by going to the parent of the current locus. The operation contractLeft follows a suffix link from the current locus, and extendLeft follows a Weiner link. The operations enumerateRight and enumerateLeft are implemented by storing the children and the Weiner links from explicit suffix tree nodes as neighbor lists. The total number of these links is linear in T [16]. With this implementation, the slowest operations are extendRight and extendLeft, taking \(O(\log \Sigma )\) time to binary search the neighbor lists. We therefore obtain the following result:
Theorem 8
The compacted arccentric bidirected de Bruijn graph of order k of a set of input strings I from the alphabet \(\Sigma\) can be constructed in time \(O(I \log \Sigma )\).
We note that the same operations can also be implemented on top of the bidirectional BWT index of Belazzougui and Cunial [20], using the data structures of Belazzougui et al. [21] for the enumeration operations. This gives an index that supports all the required subroutines in constant time. The drawback of the bidirectional BWT index is that only randomized construction algorithms are known, but the expected time is still linear in T. We leave as an open problem the construction of the compacted arccentric bidirected de Bruijn graph in deterministic linear time independent of the alphabet size.
Pseudocode
The pseudocode for computing a compacted de Bruijn graph in linear time is given by Algorithm 4 which uses Algorithm 3 as a subroutine. The data structure D used by the algorithms is that described in Sect. "Lineartime construction of compacted bidirected de Bruijn graphs". Note that if we compute the arc labels as plain strings as in Algorithm 1, we need up to \(O(k\log \Sigma )\) bits to store a singlekmer arc. And since arcs are not substrings of input strings (but potentially combinations of input strings), we would need up to \(O(kI\log \Sigma )\) bits to store all arc labels without referring to the input strings. This contradicts the algorithm being linear in \(I\log \Sigma \). However, we can store the labels as tuples \((p, \eta , q, r)\), where \(p\eta q\) is the label where p and q are explicit strings while \(\eta\) is a pointer to a kmer in the input. If r is true, then the label must be reverse complemented to match that defined by Algorithm 1. With this fix, the size of a label that represents \(\ell\) kmers is
\(O(\ell \log \Sigma )\), and in total the de Bruijn graph represents O(I) kmers.
The comparison on Line 16 of Algorithm 4 can be done in linear time in \(\eta _1 + \eta _2\) by finding the suffix array intervals of \(\eta _1\eta \eta _2\) and \({{\,\textrm{rc}\,}}(\eta _1\eta \eta _2)\) with extendLeft and extendRight from \(\eta\) and \({{\,\textrm{rc}\,}}(\eta )\) respectively, and comparing the starts of the intervals. This way, the total time taken by all those comparisons is proportional to the sum of \(\eta _1 + \eta _2\) over all unitigs, which is linear in I because each character of \(\eta _1\) and \(\eta _2\) can be mapped to a distinct edge in the noncompacted de Bruijn graph of I. Therefore, the algorithm can be implemented to run in \(O(I\log \Sigma )\) time.
Our pseudocode does not compute the first and last character of each arclabel, but this can be easily computed in constant time using \(w_i\), \(\eta _1\) and \(\eta _2\) in Algorithm 4.
Lineartime minimum SPSS without repetitions
Let I be a set of strings. To compute an SPSS without repetitions we first build a compacted de Bruijn graph G from I. By Theorem 6, finding an SPSS is equivalent to finding a walk cover in G. Further, with Lemma 4, we get that an SPSS without repetitions is equivalent to a walk cover that visits each arc exactly once (either once forwards, or once reverse, but not both forwards and reverse). We call such a walk cover a unique walk cover.
For minimality, observe that the cumulative length of an SPSS S relates to its equivalent set of walks \({\mathcal {W}}\) as follows:
This is because in Algorithm 2, in Line 7, \(k  1\) characters are appended to the result, and then in the loop in Line 8, one additional character per arc in W is appended. We cannot alter the sum \(\sum _{W \in {\mathcal {W}}} W\), since we need to cover all arcs in G. However we can alter the number of strings, and decreasing or increasing this number by one will decrease or increase the cumulative length of S by \(k  1\). Therefore, finding a minimum SPSS of I without repetitions equals finding a unique walk cover of G that has a minimum number of walks.
Note that computing a minimum SPSS in a bigraph that is not connected is equivalent to separately computing an SPSS in each maximal connected subgraph. Therefore we restrict to connected bigraphs from here on.
A lower bound for an SPSS without repetitions
Using the imbalance of the nodes of a bigraph, we can derive a lower bound for the number of walks in a walk cover.
Lemma 9
Let \(v \in V\) be an unbalanced node in a bigraph \(G = (V, E, c)\). Then in a unique walk cover \({\mathcal {W}}\) of G, either at least \({{\,\textrm{imbalance}\,}}(v)\) walks start in v, or at least \({{\,\textrm{imbalance}\,}}(v)\) walks end in v.
Proof
If v is selfcomplemental, then its imbalance is 1, so by definition v has an odd number of incident arcs. Each walk that does not start or end in v needs to enter and leave v via two distinct arcs whenever it visits v. But since the number of incident arcs is odd, there is at least one arc that cannot be covered this way, implying that a walk needs to start or end in this arc.
If v is not selfcomplemental and has a positive imbalance, then it has \({{\,\textrm{imbalance}\,}}(v)\) more outgoing arcs then incoming arcs. Since walks need to leave v with the opposite sign than they entered v, at least \({{\,\textrm{imbalance}\,}}(v)\) arcs cannot be covered by walks that do not start or end in v. If v has negative imbalance, the situation is symmetric. \(\square\)
Definition 10
The imbalance \({{\,\textrm{imbalance}\,}}(G)\) of a bigraph \(G = (V, E, c)\) is the sum of the absolute imbalance of all nodes \(\sum _{v \in V} {{\,\textrm{imbalance}\,}}(v)\).
Theorem 11
Let G be a bigraph. A walk cover \({\mathcal {W}}\) of G has a minimum string count of \({{\,\textrm{imbalance}\,}}(G) / 2\).
Proof
Let \(v \in V\) be an unbalanced node. Then, by Lemma 9 at least \({{\,\textrm{imbalance}\,}}(v)\) walks start in v or at least \({{\,\textrm{imbalance}\,}}(v)\) walks end in v. Since each walk has exactly one start node and one end node, \({\mathcal {W}}\) has a minimum string count of \({{\,\textrm{imbalance}\,}}(G) / 2\). \(\square\)
Eulerising a bigraph
A directed graph is called Eulerian, if all nodes have indegree equal to outdegree, i.e. are balanced [22]. If the graph is strongly connected,^{Footnote 2} then this is equivalent to the graph admitting a Eulerian cycle, i.e. a cycle that visits each arc exactly once. The same notion can be used with bidirected graphs, using our definition of imbalance.
Definition 12
A bigraph is Eulerian, if all nodes have imbalance zero.
A connected bigraph can be transformed into a Eulerian bigraph by adding arcs using Algorithm 5. See Fig. 1b for an example. The algorithm lists all nodes that are out of balance, and inserts arbitrary arcs to balance them.
Lemma 13
The imbalance of a bigraph is even.
Proof
Adding or removing an arc changes the imbalance of two nodes by 1, or of one node by two. In both cases, the imbalance of the graph can only change by \(2\), 0, or 2. Since the imbalance of a graph without arcs is 0, this implies that there can be no graph with odd imbalance. \(\square\)
Lemma 14
Given a connected bigraph \(G = (V, E, c)\), Algorithm 5 outputs a Eulerian bigraph \(G' = (V, E', c)\).
Proof
Algorithm 5 is welldefined, since by Lemma 13, it holds that L has even length in each iteration of the loop in Line 10, so the removal operation in Line 12 always has something to remove.
The output of Algorithm 5 is a valid bigraph, since for selfcomplemental nodes \(v \in V\), only incidences \(v\odot\) are added to \(G'\), and for not selfcomplemental nodes \(v \in V\), only incidences \(v\oplus\) and \(v\ominus\) are added to \(G'\).
Further, the output is a Eulerian bigraph, because for all \(v \in V\), it holds that \({{\,\textrm{imbalance}\,}}(v)\) is 0, by the following argument:

If \(c(v) = 1\) and v has imbalance zero in G, then its imbalance stays the same in \(G'\). If it has imbalance 1, then one incident arc is inserted, making its degree even and its imbalance therefore zero.

If \(c(v) = 0\) and v has positive imbalance i in G, then i incoming arcs are added to v (counting incoming selfloops twice), and no outgoing arcs are added. Therefore, it has imbalance zero in \(G'\). By symmetry, if v has negative imbalance in G, it has imbalance zero in \(G'\).
\(\square\)
Lemma 15
Given a bigraph \(G = (V, E, c)\), Algorithm 5 terminates after \(O(V + E)\) steps.
Proof
For the list data structure we choose a doubly linked list, and for the graph an adjacency list (and array with an entry for each node containing a doubly linked list for the arcs).
The loop in Line 3 runs V times and each iteration runs in \(O({{\,\textrm{imbalance}\,}}(v))\) for a node v, because a doubly linked list supports appending in constant time. The sum of absolute imbalances of all nodes cannot exceed 2E, because each arc adds at most 1 to the absolute imbalance of at most two nodes, or adds at most 2 to the absolute imbalance of at most one node. Therefore, the length of list L after completing the loop is at most 2E, and the loop runs in \(O(V + E)\) time.
The loop in Line 10 runs at most \(L \le 2E\) times and performs only constanttime operations, since L is a doubly linked list and we can insert arcs into an adjacency list in constant time. Therefore, this loop also runs in \(O(V + E)\) time. \(\square\)
With Lemma 14 and 15 we get the following.
Theorem 16
Algorithm 5 is correct and runs in \(O(V + E)\) time.
Computing a Eulerian cycle in a bigraph
After Eulerising the bigraph, we can compute a Eulerian cycle using Algorithm 6. We do this similarly to Hierholzer’s classic algorithm for Eulerian cycles [22]. First we find an arbitrary cycle. Then, as long as there are unused arcs left, we search along the current cycle for unused arcs, and find additional cycles through such unused arcs. We integrate each of those additional cycles into the main cycle. See Fig. 1c for an example of a Eulerian cycle.
Lemma 17
Given a connected Eulerian bigraph \(G = (V, E, c)\), Algorithm 6 terminates and outputs a Eulerian cycle W.
Proof
For \(W = (e_1 = (v_1d_1, v'_1d'_1, \eta _1), \dots , e_\ell = (v_\ell d_\ell , v'_\ell d'_\ell , \eta _\ell ))\) to be a Eulerian cycle, it must be a cycle that contains each arc exactly once.
The sequence \(W'\) constructed by the loop in Line 10 is a walk by construction, and since G is Eulerian it is a cycle after the loop terminates. After finding the initial cycle in the first iteration of the outer loop, each additional cycle is started from a node on the initial cycle, and is a cycle again. Therefore it can be inserted into the original cycle without breaking its cycle property.
Since each arc is deleted when being added to \(W'\), there is no duplicate arc in W. And if the algorithm terminates, then \(E = 0\) (Line 1), so W contains all arcs.
For termination, consider that if W is not complete after the first iteration of the outer loop, then the loop in Line 7 searches for an unused arc using the \(first\_unfinished\) pointer. Since the prefix of W up to including \(first\_unfinished\) is never modified (Line 19), and \(first\_unfinished\) is only advanced when its pointee cannot reach any arc anymore, it holds that no arc in W can reach an arc in E when \(first\_unfinished\) gets advanced over the end of W. Since G was initially Eulerian and only Eulerian cycles have been removed from G, this implies that all nodes visited by W are still balanced and therefore have no incident arcs anymore. And since G was originally connected, W has visited all nodes, i.e. \(E = 0\). Therefore, \(first\_unfinished\) cannot be advanced over the end of W, because the outer loop terminates before that.
To complete the proof of termination, consider that in each iteration of the outer loop, at least one arc gets removed from E. In the first iteration, this happens at least in Line 3, and in all following iterations, this happens in Line 11. \(\square\)
Lemma 18
Given a connected Eulerian bigraph \(G = (V, E, c)\), Algorithm 6 terminates after \(O(V + E)\) steps.
Proof
We use a doubly linked list for W and \(W'\), and an adjacency list for G. Then all lines can be executed in constant time.
The loop in Line 10 removes one arc from E each iteration, so it runs at most E times in total (over all iterations of the outer loop). The loop in Line 7 advances \(first\_unfinished\) each iteration. Since the algorithm is correct by Lemma 17, \(W \le E\) and \(first\_unfinished\) never runs over the end of \(first\_unfinished\), so the loop runs at most E times in total (over all iterations of the outer loop).
The condition for the loop in Line 10 is true at least once in each iteration of the outer loop, since the preceding branch sets up \((vd, v'd', \eta )\) such that it has a successor (in the first iteration because of Eulerianess). So in each iteration of the outer loop, at least one arc gets removed, so the outer loop runs at most E times in total.
As a result, all loops individually run at most E times, therefore Algorithm 6 terminates after O(E) steps. Because G is connected, this is equivalent to \(O(V + E)\) steps. \(\square\)
With Lemma 17 and 18 we get the following.
Theorem 19
Algorithm 6 is correct and runs in \(O(V + E)\) time.
Computing a minimum SPSS without repetitions
We convert the Eulerian cycle into a walk cover of the original bigraph by breaking it at all arcs inserted by Algorithm 5, and removing those arcs (see Fig. 1d for an example). This results in a walk cover with either one walk, if Algorithm 5 inserted zero or one arcs, or \({{\,\textrm{imbalance}\,}}(G)/2\) arcs, if Algorithm 5 inserted more arcs. By Theorem 11, this is a minimum number of walks, and therefore the SPSS spelled by these walks is minimum as well. Constructing the de Bruijn graph takes \(O(I\log \Sigma )\) time, and it has O(I) kmers, so it holds that \(V \in O(I)\) and \(E \in O(I)\). Further, spelling the walk cover takes time linear to the cumulative length of the spelled strings. Since we compute a minimum representation, it holds that the output is not larger than the total length of the input strings. Therefore we get:
Theorem 1 Let k be a positive integer and let I be a set of strings of length at least k over some alphabet \(\Sigma\). Then we can compute a set of strings \(I'\) of length at least k with minimum cumulative length and \({{\,\textrm{CS}\,}}_k(I) = {{\,\textrm{CS}\,}}_k(I')\) in \(O(I\log \Sigma )\) time.
Previous heuristics were not optimal
The heuristics implemented by UST [3] and Prophasm [2] are not optimal, as shown experimentally below. Here, we also give a simple counterexample to argue that the previous heuristics were not optimal. Even though the previous algorithms were described in nodecentric de Bruijn graphs, we describe them here in the arccentric variants to stick with the terminology of this paper.
UST works by starting from an arbitrary arc and extending forwards to unused arcs as long as possible. If there is no unused arc, but the last chosen arc has a successor that is the start of another walk, then the walks are joined. On the other hand, ProphAsm works by choosing an arbitrary arc and extending both forwards and backwards to unused arcs as long as possible. Both algorithms may fail to produce an optimal solution in the example given in Fig. 2. They may both first choose AGGTG and then continue to GTGGGAT, producing a string AGGTGGGAT. When they then process GTGCCGTG, they cannot join it with the previous string, hence they produce two strings of a cumulative length of 17. The optimal solution in Fig. 2 has one string with a cumulative length of 14.
Experiments
We ran our experiments on a server running Linux with two 64core AMD EPYC 7H12 processors with 2 logical cores per physical core, 1.96TiB RAM and an SSD. Our data sets are the same as in [1], and we also adapted their metrics cumulative length (CL), which is the total count of characters in all strings, and string count (SC), which is the number of strings. Our implementation does not use the formalisation of bidirected graphs introduced in this work, but instead uses the formalisation from [1]. For constructing de Bruijn graphs, we do not implement our purely theoretical linear time algorithm, since practical de Bruijn graph construction is a wellresearched field [4, 6, 11, 23,24,25], and we want to focus more on computing the compressed representation from unitigs. UST only supports unitigs constructed by BCALM2 [11], since it needs certain additional data. BCALM2 is not a linear time algorithm, but is efficient in practice. Therefore, we use BCALM2 to construct a nodecentric de Bruijn graph, and then convert it to an arccentric variant using a unionfind data structure. For the human pangenome, which hits some builtin limit of BCALM2, we use Cuttlefish 2 [6] instead. This prevents us from running UST, but instead we run ProphAsm on the unitigs computed by Cuttlefish 2.
Our experimental pipeline is constructed with [26] and using the bioconda software repository [27]. We ran all multithreaded tools with up to 28 threads and never used more than 128 cores of our machine at once to prevent hyperthreading from affecting our timing. The code to reproduce our experiments is licensed under the Creative Commons Attribution 4.0 International license and available on zenodo [28]. We additionally provide our implementation of the Eulertigs algorithm on zenodo [29] as well as github [30], conda [31] and crates.io [32].
The performance figures in Tables 1 and 2 are all very similar, with two exceptions. Prophasm does not support parallel computation at the moment, therefore its runtime is much higher. Compared to that, all other algorithms use parallel computation to compute unitigs, but computing the final tigs from unitigs seems to be negligible compared to computing the de Bruijn topology. Moreover, running UST or Eulertigs on read data sets of larger genomes consumes significantly more memory than computing just unitigs. This is likely because BCALM2 uses external memory to compute unitigs, while the other tools simply load the whole set of unitigs into memory.
In terms of CL, we see that the SPSS computed with UST mostly remains within the expected \(3\%\) of the lower bound, but it is up to \(5\%\) above the lower bound on more compressible data sets. The SPSS computed by ProphAsm is very close to the optimum in all cases, and we assume that this difference in quality is because ProphAsm extends paths both forwards and backwards, while the UST heuristic merely extends them forwards.
Looking at SC, we see that Eulertigs are always the lowest, which is due to the string count directly being connected to the cumulative length by Eq. 1. This also explains the correlation between CL ratio and SC ratio, which can be observed in all cases.
We conduct our experiments also with even k in Tables 3 and 4 to prove that our implementation also works with even k. To verify that the strings are correct, the tigs computed for the E. coli pangenome are compared against each other by loading all kmers into a hash table and checking if different tigs contain the same kmers. There are no significant differences between the experiments with even k and odd k.
Additionally, we conduct our experiments with higher k in Tables 5 and 6 to show that performance stays the same when k is increased. The increase in k seems to increase the runtimes of BCALM2 and Cuttlefish 2, but the runtime of our Eulertigs implementation does not change significantly. As a result, the ratio between the runtimes of BCALM2 and Cuttlefish 2 and the runtime of our Eulertigs implementation becomes smaller for larger k.
Conclusions
We have presented a linear and hence optimal algorithm for computing a minimum SPSS without repetitions for a fixed alphabet size. This closes the open question about its complexity raised in [2, 3]. Using our optimal algorithm, we were able to accurately evaluate the existing heuristics and show that they are very close to the optimum in practice. Further, we have published our algorithm as a commandline tool on github.com [30] and conda [31] and a library on crates.io [32], allowing it to easily be used in any kmerbased tool. While the difference in cumulative length between previous heuristics and the optimum is not large, our tool works for any value of k, and is, combined with a de Bruijn graph compactor such as BCALM, much faster than ProphAsm, which achieves nearly indistinguishable cumulative length. Hence, our tool is better suited to be used any kmerbased application including SSHash [5], and specifically in the applications listed in [2] which include compressed storage on disk and kmerbased queries.
Further, we have presented how bidirected de Bruijn graphs can be formalised without excluding any corner cases. We have also shown how such a graph can be constructed in linear time for a fixedsize alphabet. The construction of the compacted arccentric bidirected de Bruijn graph in linear time independent of the alphabet size stays an open problem.
Availability of data and materials
The implementation of the Eulertig algorithm is available on github [30]. The name of the project is matchtigs. It is platform independent, and can be compiled locally or installed from bioconda as described in the README of the project. It is licensed under the 2clause BSD license. The version used for our experiments is available on zenodo [29], and the implementation together with all code to reproduce the experiments is available at [28]. The experiment code is licensed under the Creative Commons Attribution 4.0 International license. See [1] for the availability of the nonoriginal data used for our experiments.
Notes
The original of an adjoint graph can be computed by splitting each node v into two nodes \(v'\) and \(v''\) such that \(v'\) keeps the incoming arcs, and \(v''\) the outgoing arcs as in [9, Figure 4] Then, the graph is a collection of complete bipartite graphs [9]. These graphs can be contracted into single nodes, and then we add an arc between the contracted representations of each \(v'\) and \(v''\). This can be computed in linear time and is the original graph, since all nodes have become arcs again, and the arcs have the correct predecessors and successors.
Strongly connected means that there is a directed path from each node \(v_1\) to each node \(v_2\).
References
Schmidt S, Khan S, Alanko J, Tomescu AI. Matchtigs: minimum plain text representation of kmer sets. bioRxiv. 2021. https://doi.org/10.1101/2021.12.15.472871.
Břinda K, Baym M, Kucherov G. Simplitigs as an efficient and scalable representation of de Bruijn graphs. Genome Biol. 2021;22(1):1–24.
Rahman A, Medevedev P. Representation of kMer sets using spectrumpreserving string sets. J Comput Biol. 2021;28(4):381–94.
Holley G, Melsted P. Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs. Genome Biol. 2020;21(1):1–20.
Pibiri GE. Sparse and skew hashing of kmers. bioRxiv. 2022. https://doi.org/10.1101/2022.01.15.476199.
Khan J, Kokot M, Deorowicz S, Patro R. Scalable, ultrafast, and lowmemory construction of compacted de Bruijn graphs with Cuttlefish 2. bioRxiv. 2021. https://doi.org/10.1101/2021.12.14.472718.
Cracco A, Tomescu AI. Extremelyfast construction and querying of compacted and colored de Bruijn graphs with GGCAT. bioRxiv. 2022. https://doi.org/10.1101/2022.10.24.513174.
Kasprzak M. Classification of de Bruijnbased labeled digraphs. Discrete Appl Math. 2018;234:86–92. https://doi.org/10.1016/j.dam.2016.10.014.
Blazewicz J, Hertz A, Kobler D, de Werra D. On some properties of DNA graphs. Discrete Appl Math. 1999;98(1–2):1–19.
Rahman A, Medvedev P. Assembler artifacts include misassembly because of unsafe unitigs and underassembly because of bidirected graphs. Genome Res. 2022;32(9):1746–53.
Chikhi R, Limasset A, Medvedev P. Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinformatics. 2016;32(12):201–8.
Bankevich A, Bzikadze AV, Kolmogorov M, Antipov D, Pevzner PA. Multiplex de Bruijn graphs enable genome assembly from long, highfidelity reads. Nat Biotechnol. 2022. https://doi.org/10.1038/s41587022012206.
Cazaux B, Lecroq T, Rivals E. From indexing data structures to de Bruijn graphs. In: Kulikov AS, Kuznetsov SO, Pevzner P, editors. Symposium on combinatorial pattern matching. Springer: Berlin; 2014. p. 89–99.
Kundeti V, Rajasekaran S, Dinh H. An efficient algorithm for Chinese postman walk on bidirected de Bruijn graphs. In: Wu W, Daescu O, editors. Combinatorial optimization and applications. Berlin, Heidelberg: Springer; 2010. p. 184–96.
Gusfield D. Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge: Cambridge University Press; 1997. https://doi.org/10.1017/cbo9780511574931.
Mäkinen V, Belazzougui D, Cunial F, Tomescu AI. Genomescale algorithm design. Cambridge: Cambridge University Press; 2015.
Ukkonen E. Online construction of suffix trees. Algorithmica. 1995;14(3):249–60.
Farach M. Optimal suffix tree construction with large alphabets. In: Proceedings 38th Annual Symposium on Foundations of Computer Science. IEEE. 1997; p. 137–43.
Belazzougui D, Kosolobov D, Puglisi SJ, Raman R. Weighted ancestors in suffix trees revisited. In: 32nd Annual Symposium on Combinatorial Pattern Matching. 2021.
Belazzougui D, Cunial F. Fullyfunctional bidirectional burrowswheeler indexes and infiniteorder de bruijn graphs. In: 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019). 2019; Schloss DagstuhlLeibnizZentrum fuer Informatik
Belazzougui D, Cunial F, Kärkkäinen J, Mäkinen V. Versatile succinct representations of the bidirectional burrowswheeler transform. In: Bodlaender HL, Italiano GF, editors. European symposium on algorithms. Springer: Berlin; 2013. p. 133–44.
Fleischner H. Eulerian graphs and related topics. The Netherlands: Elsevier; 1990.
Crawford VG, Kuhnle A, Boucher C, Chikhi R, Gagie T. Practical dynamic de Bruijn graphs. Bioinformatics. 2018;34(24):4189–95.
Muggli MD, Bowe A, Noyes NR, Morley PS, Belk KE, Raymond R, Gagie T, Puglisi SJ, Boucher C. Succinct colored de Bruijn graphs. Bioinformatics. 2017;33(20):3181–7.
Muggli MD, Alipanahi B, Boucher C. Building large updatable colored de Bruijn graphs via merging. Bioinformatics. 2019;35(14):51–60.
Köster J, Rahmann S. Snakemakea scalable bioinformatics workflow engine. Bioinformatics. 2012;28(19):2520–2.
Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, TomkinsTinch CH, Valieris R, Köster J. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018;15(7):475–6.
Schmidt S. Eulertigs experiments. Zenodo. 2022. https://doi.org/10.5281/zenodo.7371148.
Schmidt S. Eulertigs. Zenodo. 2022. https://doi.org/10.5281/zenodo.7371184.
Schmidt S. Matchtigs. GitHub. https://github.com/algbio/matchtigs. Accessed 15 Apr 2023.
Schmidt S. Matchtigs. Bioconda. https://anaconda.org/bioconda/matchtigs. Accessed 15 Apr 2023.
Schmidt S. Matchtigs. Crates.io. https://crates.io/crates/matchtigs. Accessed 15 Apr 2023.
Acknowledgements
We wish to thank Andrea Cracco for providing us with the \(\sim\)309kx Salmonella pangenome. We also wish to thank the anonymous reviewers for there useful constructive feedback, which improved the presentation of the paper, the implementation and the experimental results. Open access funded by Helsinki University Library.
Funding
Open Access funding provided by University of Helsinki including Helsinki University Central Hospital. This work was partially funded by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 851093, SAFEBIO), and by NIH NIAID grant No. R01HG011392 and Academy of Finland grant 339070.
Author information
Authors and Affiliations
Contributions
JNA and SS discovered the problem, SS solved the problem when the de Bruijn graph is given and wrote most of the manuscript, JNA designed the lineartime de Bruijn graph construction algorithm and wrote Sect. "Lineartime construction of compacted bidirected de Bruijn graphs". SS implemented the algorithm and conducted and evaluated the experiments. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
This study only uses publicly available datasets, hence an ethics approval or consent to participate is not required.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Schmidt, S., Alanko, J.N. Eulertigs: minimum plain text representation of kmer sets without repetitions in linear time. Algorithms Mol Biol 18, 5 (2023). https://doi.org/10.1186/s13015023002271
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13015023002271