Eulertigs: minimum plain text representation of k-mer sets without repetitions in linear time

Schmidt, Sebastian; Alanko, Jarno N.

doi:10.1186/s13015-023-00227-1

Research
Open access
Published: 04 July 2023

Eulertigs: minimum plain text representation of k-mer sets without repetitions in linear time

Algorithms for Molecular Biology volume 18, Article number: 5 (2023) Cite this article

1183 Accesses
1 Citations
2 Altmetric
Metrics details

Abstract

A fundamental operation in computational genomics is to reduce the input sequences to their constituent k-mers. For maximum performance of downstream applications it is important to store the k-mers in small space, while keeping the representation easy and efficient to use (i.e. without k-mer repetitions and in plain text). Recently, heuristics were presented to compute a near-minimum such representation. We present an algorithm to compute a minimum representation in optimal (linear) time and use it to evaluate the existing heuristics. Our algorithm first constructs the de Bruijn graph in linear time and then uses a Eulerian-cycle-based algorithm to compute the minimum representation, in time linear in the size of the output.

Introduction

Motivation A k-mer is a DNA string of length k that is considered equal to itself and its reverse complement. A common pattern in bioinformatics is to reduce a set of input strings to their constituent k-mers. Such representations are at the core of many bioinformatics pipelines—see e.g. Schmidt et al. [1] or Brinda et al. [2] for an overview of applications. The wide-spread use of k-mer sets has prompted the question of what is the smallest plain text representation for a set of k-mers. Here, a plain text representation means a set of strings that have the same set of k-mers as the input strings, i.e. the spectrum is preserved. Such representations are also called spectrum preserving string sets (SPSS) [3], or simplitigs [2]. This has the following advantages over encoded representations:

When storing k-mer sets to disk, plain text may remove the need of decompression before usage, as some tools that usually take unitigs as input can take any other plain text representation without modification (e.g. Bifrost [4]).
Within an application, an encoded representation would require decoding whenever a k-mer is accessed, which may slow down the application a lot compared to when each k-mer is in RAM in plain text.

Further, in applications, it might be useful if the representation contains each k-mer exactly once. This is because some applications, like e.g. SSHash [5], are able to take any set of k-mers as input, but cannot easily deal with duplicate k-mers in the input.

Related work There are two heuristic approaches to the construction of a small SPSS without repetitions, namely ProphAsm [2] and UST [3]. While neither of these computes a minimum representation, Rahman et al. [3] also present a lower bound to the minimum size of any representation without repetition, and they show that they are within 3% of this lower bound in practice. They also present a counter-example showing that their lower bound is not tight. Small SPSSs without repetitions are used e.g. in SSHash [5] and are also computed by state-of-the-art de Bruijn graph compactors like Cuttlefish 2 [6]. Additionally, the state-of-the-art de Bruijn graph compressor GGCAT [7] was extended to compute Eulertigs, in addition to other variants of SPSSs.

When k-mer repetitions are allowed in an SPSS, there is a known polynomially computable minimum representation, namely matchtigs [1]. The matchtig algorithm joins unitigs by first iterating all possible joins repeating up to $k - 1$ k-mers, and then using minimum perfect matching to find a set of joins that minimises the size of the representation. This is similar to the algorithm presented in this paper, which leaves out the matching step and only joins unitigs that are adjacent. While matchtigs are expensive to compute, the authors also present a more efficient greedy heuristic that is able to compute a near-minimum representation on a modern server with no significant penalty in runtime (when compared to computing just unitigs), but a significant increase in RAM usage.

In [1, 2] the authors also showed that decreasing the size of an SPSS results in significantly better performance in downstream applications, i.e. when further compressing the representation with general purpose compressors, or when performing k-mer-based queries.

The authors of both [2] and [3] consider whether computing a minimum representation without repetitions may be NP-hard, as it is equivalent to computing a minimum path cover in a de Bruijn graph, which is NP-hard in general graphs by reduction from Hamiltonian cycle. However, computing a Hamiltonian cycle in a de Bruijn graph is actually polynomial [8]. The authors of [8] argue that de Bruijn graphs are a subclass of adjoint graphs, in which solving the Hamiltonian cycle problem is equivalent to solving the Eulerian cycle problem in the original of the adjoint graph, which can be computed in linear time.^{Footnote 1} However, the argument is only made for normal directed (and not bidirected) graphs, and thus is not applicable to our setup, where a k-mer is also considered equal to its reverse complement.

Our contributions Our first technical contribution is to carefully define the notion of a bidirected de Bruijn graph such that the spectrum of the input is accurately modelled in the allowed walks of the graph. While defining a bidirected de Bruijn graph is not new [10], we take specific care around k-mers that are their own reverse complement. This technicality is often neglected in the literature, and sidestepped by requiring that the value of k is odd, in which case this special case does not occur. To make sure that our definition is correct for any k, we show that our de Bruijn graph admits exactly the strings that can be spelled from the k-mers that it was constructed from. We give a suffix-tree-based deterministic linear-time algorithm to construct such a graph, filling a theory gap in the literature, as existing approaches [4, 6, 11, 12] depend on the value of k and/or are probabilistic due to the of use hashing, minimizers or Bloom filters, or do not use the reverse-complement-aware definition of k-mers [13].

Given the bidirected de Bruijn graph, we present an algorithm that computes a minimum plain text representation of k-mer sets without repetitions, which runs in output sensitive linear time. Steps 1 to 3 run in linear time in the number of nodes and arcs in the graph. In short, it works as follows:

1
Add breaking arcs into this graph to make it Eulerian.
2
Compute a Eulerian cycle in the resulting graph.
3
Break that cycle at the breaking arcs.
4
Output the strings spelled by the resulting walks.

The algorithm is essentially an adaption of the matchtigs algorithm [1], removing the possibility of joining walks by repeating k-mers. We give detailed descriptions for all these steps and prove their correctness in our bidirected de Bruijn graph model. Together with our linear-time de Bruijn graph construction algorithm, we obtain the main result of our paper:

Theorem 1

Let k be a positive integer and let I be a set of strings of length at least k over some alphabet $\Sigma$. Then we can compute a set of strings $I'$ of length at least k with minimum cumulative length and ${{\,\textrm{CS}\,}}_k(I) = {{\,\textrm{CS}\,}}_k(I')$ in $O(||I||\log |\Sigma |)$ time.

where ${{\,\textrm{CS}\,}}_k(I) = {{\,\textrm{CS}\,}}_k(I')$ means that $I'$ is an SPSS of I, and ||I|| is the cumulative length of I (see Sect. "Preliminaries" for accurate definitions). This gives a positive answer to the open question if a minimum SPSS without repetitions can be computed in polynomial time. Additionally, this gives an easily computable tight lower bound on the size of a minimum SPSS without repetitions. We also give a counter example where previous heuristics are not necessarily optimal.

For our experiments, we have implemented steps 1 to 4 in Rust, taking the de Bruijn graph as given. The implementation is available on github: https://github.com/algbio/matchtigs. Our experimental evaluation shows that our algorithm does not result in significant practical improvements, but for the first time allows to benchmark the quality the heuristics ProphAsm and UST against an optimal solution. It turns out that both produce close-to-optimal results, but with a different distribution of computational resources.

Our work also shows that using arc-centric de Bruijn graphs can aid the intuition for certain problems, as in this case, the node-centric variant hides the relationship between Eulerian cycles and minimum SPSS without repetition.

Organisation of the paper In Sect. "Preliminaries" we give preliminary definitions of well-known concepts. In Sect. "De Bruijn graphs" we define de Bruijn graphs and prove the soundness of the definitions. In Sect. "Linear-time construction of compacted bidirected de Bruijn graphs" we show how to construct de Bruijn graphs by our definitions in linear time. In Sect. "Linear-time minimum SPSS without repetitions" we show how to construct a minimum SPSS without repetitions in linear time if the de Bruijn graph is given. Additionally, we give an example where previous heuristics were not optimal. In Sect. "Experiments" we compare our algorithm and Eulertigs against strings computed with ProphAsm and UST on practical data sets.

Preliminaries

In this section we give the prerequisite knowledge required for this paper.

Bidirected graphs

In this section we define our notion of the bidirected graphs and the incidence model.

A multiset is defined as a set M, and an implicit function $\#_M: M \rightarrow {\mathbb {Z}}^+$ mapping elements to their multiplicities. The cardinality is defined as $|M|:= \sum _{s \in M} \#_M(s)$.

An alphabet $\Sigma$ is an ordered set, and an $\Sigma$-word is a string of characters of that set. String concatenation is written as ab for two strings a and b. The set $\Sigma ^k$ is the set of all $\Sigma$-words of length k and the set $\Sigma ^*$ is the set of all $\Sigma$-words, including the empty word $\epsilon$. Given a positive integer k, the k-suffix ${{\,\textrm{suf}\,}}_k(w)$ (k-prefix ${{\,\textrm{pre}\,}}_k(w)$) of a word w is the substring of its last (first) k characters. A k-mer is a word of length k. A complement function over $\Sigma$ is a function ${{\,\textrm{comp}\,}}: \Sigma \rightarrow \Sigma$ mapping characters to characters that is self-inverse (i.e. ${{\,\textrm{comp}\,}}({{\,\textrm{comp}\,}}(x)) = x$, also called an involution). A reverse complement function for alphabet $\Sigma$ is a function ${{\,\textrm{rc}\,}}: \Sigma ^* \rightarrow \Sigma ^*$ defined as ${{\,\textrm{rc}\,}}((w_1, \dots , w_\ell )):= ({{\,\textrm{comp}\,}}(w_\ell ), \dots , {{\,\textrm{comp}\,}}(w_1))$, for some arbitrary complement function ${{\,\textrm{comp}\,}}$. On sets, ${{\,\textrm{rc}\,}}$ is defined to compute the reverse complement of each element in the set. Note that ${{\,\textrm{rc}\,}}$ is self-inverse. A canonical k-mer is a k-mer that is lexicographically smaller than or equal to its reverse complement.

Given an integer k and an alphabet $\Sigma$, the k-spectrum of a set of strings $I \subseteq \bigcup _{k' \ge k} \Sigma ^{k'}$ is a set of strings ${{\,\textrm{S}\,}}_k(I):= \{ w \in \Sigma ^k \mid \exists i \in I: w \text { is substring of } i \text { or }{{\,\textrm{rc}\,}}(i) \}$. The canonical k-spectrum of I is ${{\,\textrm{CS}\,}}_k(I):= \{w \in {{\,\textrm{S}\,}}_k(I) \mid w \text { is canonical} \}$. For simplicity, the spectrum and canonical spectrum are defined for a single string w as if it were a set $\{w\}$. A spectrum preserving string set of a set of strings I is a set of strings $I'$ such that ${{\,\textrm{CS}\,}}_k(I) = {{\,\textrm{CS}\,}}_k(I')$. The cumulative length of I is $||I||:= \sum _{w \in I} |w|$.

Our definition of a bidirected graph is mostly standard like in e.g. [14], however we allow self-complemental nodes that occur in bidirected de Bruijn graphs. A bidirected graph is a tuple $G = (V, E, c)$ with a set of normal and self-complemental nodes $v \in V$, a set of arcs $e \in E$, and a function $c: V \rightarrow \{1, 0\}$ marking self-complemental nodes with 1, and normal nodes with 0. An incidence is a pair $vd$, where $d\in \{\oplus , \ominus , \odot \}$ is called its sign (e.g. $v\oplus$). The negation of a sign is defined as $\lnot \oplus := \ominus$, $\lnot \ominus := \oplus$ and $\lnot \odot := \odot$. For self-complemental nodes $v \in V$, only incidences $v\odot$ are allowed, and for normal nodes only incidences $v\oplus$ and $v\ominus$ are allowed. An arc $(v_1d_1, v'_1d'_1, \eta ) \in E$ is a tuple of incidences and a unique identifier $\eta$, where $\eta$ can be of any type. The reversal of an arc is denoted by $(v_1d_1, v'_1d'_1, \eta )^{-1}:= (v'_1d'_1, v_1d_1, \eta )$. If not required, we may drop the identifier (i.e. just write $(v_1\ominus , v'_1\odot ) \in E$). We count the incidences present in an arc e using multiset notation like $\#_e(vd)$, returning 0 if the arc does not contain the incidence $vd$, returning 1 if it contains the incidence once and returning 2 if it is a self-loop with that incidence. If a node $v \in V$ is present with a $\oplus$ ($\ominus$) sign in an arc, then the arc is outgoing (incoming) from (to) v.

Note that, other than in standard directed graphs, in bidirected graphs arcs can be outgoing or incoming on both ends, and the order of the incidences in the arc does not affect if it is outgoing or incoming to a node. Further, our notation differs from that of standard bidirected graphs in that arcs have a direction. This is required because we will work with arc-centric de Bruijn graphs (see Sect. "De Bruijn graphs"), which have labels on the arcs and not the nodes. Using the sign of the incidence pairs, it is possible to decide if a node is traversed forwards or backwards, but not if the arc is traversed forwards or backwards. But to decide which label (forwards or reverse complement) to use when computing the string spelled by an arc, the direction is relevant. See Fig. 1a for an example of a bigraph, which has labels that make it a de Bruijn graph as well.

A walk in a bigraph is a sequence of arcs $W:= ((v_1d_1, v'_1d'_1, \eta _1),(v_2d_2, v'_2d'_2, \eta _2),\dots ,(v_\ell d_\ell , v'_\ell d'_\ell , \eta _\ell ))$ where for every i it holds that $(v_id_i, v'_id'_i, \eta _i) \in E$ or $(v'_id'_i, v_id_i, \eta _i) \in E$ (we can arbitrarily walk over arcs forwards and reverse), and for every $i < \ell$ it holds that $v'_{i} = v_{i+1}$ and $d'_i = \lnot d_{i+1}$. The length of a walk is $\ell = |W|$. If $v_1 = v'_\ell$ and $d_1 = \lnot d'_\ell$, then W is a cycle. A bigraph is connected, if for each pair of nodes $v_1, v_2 \in V$ there is a walk from $v_1$ to $v_2$.

For a node $v \in V$, the multiset of incidences is defined as ${{\,\textrm{I}\,}}(v):= \{vd\mid d\in \{\oplus , \ominus , \odot \}\}$, with multiplicities $\#_{{{\,\textrm{I}\,}}(v)}(vd):= \sum _{e \in E} \#_e(vd)$. For a node $v \in V$ that is not self-complemental, the outdegree is defined as $\delta ^+(v):= \#_{{{\,\textrm{I}\,}}(v)}(v\oplus )$, and the indegree is defined as $\delta ^-(v):= \#_{{{\,\textrm{I}\,}}(v)}(v\ominus )$. For a self-complemental node $v \in V$, the degree is defined as $\delta (v):= \#_{{{\,\textrm{I}\,}}(v)}(v\odot )$.

We define the imbalance of a node $v \in V$ that is not self-complemental as the difference of its outdegree and indegree ${{\,\textrm{imbalance}\,}}(v):= \delta ^+(v) - \delta ^-(v)$. For a self-complemental node $v \in V$ the imbalance is defined as ${{\,\textrm{imbalance}\,}}(v):= 1$ if $\delta (v)$ is odd, and ${{\,\textrm{imbalance}\,}}(v):= 0$ otherwise. A node $v \in V$ is called unbalanced, if ${{\,\textrm{imbalance}\,}}(v) \ne 0$, and balanced otherwise.

A labelled graph is a bidirected graph $G = (V, E, c)$ where the identifiers of arcs are strings over some alphabet $\Sigma$ (e.g. $(v_1\oplus , v_2\ominus , ACCTG) \in E$).

Suffix arrays and suffix trees

Section "Linear-time construction of compacted bidirected de Bruijn graphs" requires knowledge of suffix arrays and suffix trees. We assume the reader is familiar with these data structures, and briefly give the relevant definitions and properties below. We point the reader to Gusfield [15] and Mäkinen [16] for an in-depth treatment of the topics.

A suffix array $SA_T$ for a string T is an array of length |T| such that $SA_T[i]$ is the starting position of the lexicographically i-th suffix of T. The suffix array interval of a string x is the maximal interval [i..j] such that all the suffixes pointed by $SA_T[i], \ldots , SA_T[j]$ have x as a prefix, or the empty interval if x is not a substring of T.

A suffix tree of a string T is a compacted version of the trie of all suffixes of T, such that non-branching paths are merged into single arcs, with arcs pointing away from the root. The compactification concatenates the labels of the arcs on the compacted path. The nodes that were compacted away and are now in the middle of an arc are called implicit nodes, and the rest of the nodes are explicit. A locus (plural loci) is a node that is either explicit or implicit. A locus v is represented by a pair (u, d), where u is the explicit suffix tree node at the end of the arc containing v (u is equal to v if v is explicit), and d is the depth of locus v in the trie of loci. The suffix array interval of a node is the interval of leaves in the subtree of the node. The suffix array interval of an implicit locus (u, d) is the same as the suffix array interval of u.

The suffix tree can be constructed in log-linear time in $|T| \log |\Sigma |$ using e.g. Ukkonen’s algorithm [17] or in linear time in |T| using Farach’s algorithm [18]. The tree comes with a function child that takes an explicit node and a character, and returns the child at the end of the arc from that node whose label starts with the given character (if such node exists). This can be implemented in $O(\log |\Sigma |)$ time by binary searching over child pointers sorted by labels. The child function can also be easily implemented for implicit loci. Ukkonen’s algorithm also produces suffix links for the explicit nodes, which map from the suffix tree node of a string cx to the suffix tree node of string x. It is possible to emulate suffix links on the implicit loci using constant-time weighted level-ancestor queries [19] by mapping $(u,d) \mapsto (f_{d-1}(SL(u)), d-1)$, where SL(u) is the destination of a suffix link from u, and $f_{d-1}(SL(u))$ is the furthest suffix tree ancestor from SL(u) at depth at least $d-1$ in the trie of loci. The inverse pointers of suffix links are called Weiner links, and they can also be simulated on the implicit loci by mapping $(u, d) \mapsto (WL(u, c), d+1)$, where WL(u, c) is the destination of a Weiner link from u with character c.

De Bruijn graphs

The de Bruijn graph of order k of a set of input strings I is defined as a labelled graph constructed by Algorithm 1. See Fig. 1a for an example. The algorithm inserts an arc for each canonical k-mer, and connects the arcs via nodes according to their $k-1$ overlaps. Depending on if these overlaps use the reverse complement or if the $k-1$-mer of a node is self-complemental, it adds directions to the incidences. A de Bruijn graph computed by this algorithm has the following property.

Lemma 2

Let k be a positive integer and let I be a set of strings of length at least k. Let $G = (V, E, c)$ be the de Bruijn graph of order k constructed from I. For all pairs of arcs $e_1:= (v_1d_1, v'_1d'_1, \eta _1), e_2:= (v_2d_2, v'_2d'_2, \eta _2) \in E$ it holds that:

(a)
($v'_1 = v_2$ and $d'_1 = \lnot d_2$) if and only if ${{\,\textrm{suf}\,}}_{k-1}(\eta _1) = {{\,\textrm{pre}\,}}_{k-1}(\eta _2)$,
(b)
($v'_1 = v'_2$ and $d'_1 = \lnot d'_2$) if and only if ${{\,\textrm{suf}\,}}_{k-1}(\eta _1) = {{\,\textrm{pre}\,}}_{k-1}({{\,\textrm{rc}\,}}(\eta _2))$,
(c)
($v_1 = v_2$ and $d_1 = \lnot d_2$) if and only if ${{\,\textrm{suf}\,}}_{k-1}({{\,\textrm{rc}\,}}(\eta _1)) = {{\,\textrm{pre}\,}}_{k-1}(\eta _2)$, and
(d)
($v_1 = v'_2$ and $d_1 = \lnot d'_2$) if and only if ${{\,\textrm{suf}\,}}_{k-1}({{\,\textrm{rc}\,}}(\eta _1)) = {{\,\textrm{pre}\,}}_{k-1}({{\,\textrm{rc}\,}}(\eta _2))$.

Proof

Observe that the values of w and $w'$ computed in Lines 5 and 7 of Algorithm 1 are equal to ${{\,\textrm{pre}\,}}_{k-1}(\eta _1)$ and ${{\,\textrm{suf}\,}}_{k-1}(\eta _1)$ for $e_1$ and equal to ${{\,\textrm{pre}\,}}_{k-1}(\eta _2)$ and ${{\,\textrm{suf}\,}}_{k-1}(\eta _2)$ for $e_2$. Further, observe that the values of v and $v'$ computed in Lines 6 to 8 are equal to $v_1$ and $v'_1$ for $e_1$ and equal to $v_2$ and $v'_2$ for $e_2$. This makes $v_1$, $v'_1$, $v_2$ and $v'_2$ the canonicals of ${{\,\textrm{pre}\,}}_{k-1}(\eta _1)$, ${{\,\textrm{suf}\,}}_{k-1}(\eta _1)$, ${{\,\textrm{pre}\,}}_{k-1}(\eta _2)$ and ${{\,\textrm{suf}\,}}_{k-1}(\eta _2)$. Finally, observe that the sign values $d$ and $d'$ computed in Lines 9 to 14 are equal to $d_1$ and $d'_1$ for $e_1$ and equal to $d_2$ and $d'_2$ for $e_2$.

(a)
If $v'_1 = v_2$ and $d'_1 = \lnot d_2$, then $w'_1 = w_2$ for all possible values of $d'_1$, and therefore ${{\,\textrm{suf}\,}}_{k-1}(\eta _1) = {{\,\textrm{pre}\,}}_{k-1}(\eta _2)$. If ${{\,\textrm{suf}\,}}_{k-1}(\eta _1) = {{\,\textrm{pre}\,}}_{k-1}(\eta _2)$, then $w'_1 = w_2$, and therefore $v'_1 = v_2$ because $v'_1$ and $v_2$ are the canonicals of $w'_1$ and $w_2$. Additionally, $d'_1 = \lnot d_2$ for all possible values of $d'_1$.
(b)
If $v'_1 = v'_2$ and $d'_1 = \lnot d'_2$, then $w'_1 = {{\,\textrm{rc}\,}}(w'_2)$ for all possible values of $d'_1$, and therefore ${{\,\textrm{suf}\,}}_{k-1}(\eta _1) = {{\,\textrm{rc}\,}}({{\,\textrm{suf}\,}}_{k-1}(\eta _2)) = {{\,\textrm{pre}\,}}_{k-1}({{\,\textrm{rc}\,}}(\eta _2))$. If ${{\,\textrm{suf}\,}}_{k-1}(\eta _1) = {{\,\textrm{pre}\,}}_{k-1}({{\,\textrm{rc}\,}}(\eta _2))$, then $w'_1 = {{\,\textrm{rc}\,}}(w'_2)$, and therefore $v'_1 = v'_2$ because $v'_1$ and $v'_2$ are the canonicals of $w'_1$ and $w'_2$. Additionally, $d'_1 = \lnot d'_2$ for all possible values of $d'_1$.
(c)
If $v_1 = v_2$ and $d_1 = \lnot d_2$, then ${{\,\textrm{rc}\,}}(w_1) = w_2$ for all possible values of $d_1$, and therefore ${{\,\textrm{suf}\,}}_{k-1}({{\,\textrm{rc}\,}}(\eta _1)) = {{\,\textrm{rc}\,}}({{\,\textrm{pre}\,}}_{k-1}(\eta _1)) = {{\,\textrm{pre}\,}}_{k-1}(\eta _2)$. If ${{\,\textrm{suf}\,}}_{k-1}({{\,\textrm{rc}\,}}(\eta _1)) = {{\,\textrm{pre}\,}}_{k-1}(\eta _2)$, then $w_1 = {{\,\textrm{rc}\,}}(w_2)$, and therefore $v_1 = v_2$ because $v_1$ and $v_2$ are the canonicals of $w_1$ and $w_2$. Additionally, $d_1 = \lnot d_2$ for all possible values of $d_1$.
(d)
This case is equivalent to the first case when swapping $e_1$ and $e_2$, because ${{\,\textrm{suf}\,}}_{k-1}(\eta _1) = {{\,\textrm{pre}\,}}_{k-1}(\eta _2) \iff {{\,\textrm{suf}\,}}_{k-1}({{\,\textrm{rc}\,}}(\eta _2)) = {{\,\textrm{pre}\,}}_{k-1}({{\,\textrm{rc}\,}}(\eta _1))$.

$\square$

For a walk $W:= (e_1 = (v_1d_1, v'_1d'_1, \eta _1), \dots , e_\ell = (v_\ell d_\ell , v'_\ell d'_\ell , \eta _\ell ))$ in a de Bruijn graph, its sequence of k-mers is $K:= (\kappa _1, \dots , \kappa _\ell )$, where for each i we define $\kappa _i$ as $\eta _i$ if $e_i \in E$, and as ${{\,\textrm{rc}\,}}(\eta _i)$ if $e_i^{-1} \in E$. The string ${{\,\textrm{spell}\,}}(W)$ is the string spelled by W, which is defined as its collapsed sequence of kmers, i.e. its sequence of k-mers gets concatenated while overlapping consecutive k-mers by $k-1$. This is computed by Algorithm 2. It spells out the first k-mer (or its reverse complement) completely, and then adds the last characters of all subsequent k-mers (or their reverse complements). We prove the following lemmas to show that our definition of a de Bruijn graph together with the ${{\,\textrm{spell}\,}}(\cdot )$ function is sound for our purposes, i.e. walks in the de Bruijn graph can spell exactly the strings containing subsets of the k-mers used to create the de Bruijn graph. Due to this property, we can use our de Bruijn graph and ${{\,\textrm{spell}\,}}$ to in the Eulertig algorithm, such that finding a minimum SPSS is equivalent to finding a minimum walk cover of the de Bruijn graph.

Lemma 3

Let k be a positive integer and let I be a set of strings of length at least k. Let $G = (V, E, c)$ be the de Bruijn graph of order k constructed from I. Let $W:= (e_1 = (v_1d_1, v'_1d'_1, \eta _1), \dots , e_\ell = (v_\ell d_\ell , v'_\ell d'_\ell , \eta _\ell ))$ be a walk in G, and $K:= (\kappa _1, \dots , \kappa _\ell )$ its sequence of k-mers. Then for each consecutive pair of kmers $\kappa _i, \kappa _{i+1}$ it holds that ${{\,\textrm{suf}\,}}_{k-1}(\kappa _i) = {{\,\textrm{pre}\,}}_{k-1}(\kappa _{i+1})$.

Proof

Let $i \in \{1, \dots , \ell - 1\}$. By the definition of walk it holds that $v'_i = v_{i+1}$ and $d'_i = \lnot d_{i+1}$. We can apply Lemma 2 case by case.

(a)
If $e_i, e_{i+1} \in E$, then by Lemma 2 (a), it holds that ${{\,\textrm{suf}\,}}_{k-1}(\eta _i)$ equals ${{\,\textrm{pre}\,}}_{k-1}(\eta _{i+1})$. By definition, $\kappa _i = \eta _i$ and $\kappa _{i+1} = \eta _{i+1}$, so ${{\,\textrm{suf}\,}}_{k-1}(\kappa _i) = {{\,\textrm{pre}\,}}_{k-1}(\kappa _{i+1})$.
(b)
If $e_i, e^{-1}_{i+1} \in E$, then by Lemma 2 (b) applied to $e_i, e^{-1}_{i+1}$, it holds that ${{\,\textrm{suf}\,}}_{k-1}(\eta _i)$ equals ${{\,\textrm{pre}\,}}_{k-1}({{\,\textrm{rc}\,}}(\eta _{i+1}))$. By definition, $\kappa _i = \eta _i$ and $\kappa _{i+1} = {{\,\textrm{rc}\,}}(\eta _{i+1})$, so ${{\,\textrm{suf}\,}}_{k-1}(\kappa _i) = {{\,\textrm{pre}\,}}_{k-1}(\kappa _{i+1})$
(c)
If $e^{-1}_i, e_{i+1} \in E$, then by Lemma 2 (c) applied to $e^{-1}_i, e_{i+1}$, it holds that ${{\,\textrm{suf}\,}}_{k-1}({{\,\textrm{rc}\,}}(\eta _i))$ equals ${{\,\textrm{pre}\,}}_{k-1}(\eta _{i+1})$. By definition, $\kappa _i = {{\,\textrm{rc}\,}}(\eta _i)$ and $\kappa _{i+1} = \eta _{i+1}$, so ${{\,\textrm{suf}\,}}_{k-1}(\kappa _i) = {{\,\textrm{pre}\,}}_{k-1}(\kappa _{i+1})$.
(d)
If $e^{-1}_i, e^{-1}_{i+1} \in E$, then by Lemma 2 (d) applied to $e^{-1}_i, e^{-1}_{i+1}$, it holds that ${{\,\textrm{suf}\,}}_{k-1}({{\,\textrm{rc}\,}}(\eta _i))$ equals ${{\,\textrm{pre}\,}}_{k-1}({{\,\textrm{rc}\,}}(\eta _{i+1}))$. By definition, $\kappa _i = {{\,\textrm{rc}\,}}(\eta _i)$ and $\kappa _{i+1} = {{\,\textrm{rc}\,}}(\eta _{i+1})$, so ${{\,\textrm{suf}\,}}_{k-1}(\kappa _i) = {{\,\textrm{pre}\,}}_{k-1}(\kappa _{i+1})$.

$\square$

We define the sequence of k-mers $K = (\kappa _1, \dots , \kappa _\ell )$ of a string $w = (a_1, \dots , a_{\ell +k-1})$ by $\kappa _i:= (a_i, \dots , a_{i+k-1})$ for each i.

Lemma 4

Let k be a positive integer and let I be a set of strings of length at least k. Let $G = (V, E, c)$ be the de Bruijn graph of order k constructed from I. Let W be a walk in G, $K_W$ its sequence of k-mers and $K'_W$ the sequence of k-mers of ${{\,\textrm{spell}\,}}(W)$. Then $K_W = K'_W$.

Proof

Let $(\kappa _1, \dots , \kappa _\ell ):= K_W$. We use induction over the length of W. For an empty W, K is empty, ${{\,\textrm{spell}\,}}(W)$ is empty, and therefore $K'$ is empty as well. For $|W| = 1$, Algorithm 2 outputs ${{\,\textrm{spell}\,}}(W) = \kappa _1$ and it holds that $K'_W = (\kappa _1) = K_W$.

For $|W| \ge 2$ we consider that $K_X = K'_X$ holds for a prefix X of W with $|X| = |W| - 1$. When $i = |W|$ at the beginning of the loop in Line 8, then $s = {{\,\textrm{spell}\,}}(X)$. By Lemma 3 it holds that the last $k - 1$ characters of s are equal to the first $k - 1$ characters of $\kappa _\ell$. Therefore, by appending the last character from $\kappa _\ell$ to s, $\kappa _\ell$ is appended to $K'_X$ forming $K'_W$. Therefore, last k-mer of $K'_W$ equals the last k-mer of $K_W$, and the first $\ell - 1$ k-mers of $K'_W$ equal those of $K_W$ by induction. $\square$

Lemma 5

Let k be a positive integer and let I be a set of strings of length at least k. Let $G = (V, E, c)$ be the de Bruijn graph of order k constructed from I. Let w be a string with ${{\,\textrm{CS}\,}}_k(w) \subseteq {{\,\textrm{CS}\,}}_k(I)$. Then there exists a walk W in G with ${{\,\textrm{spell}\,}}(W) = w$.

Proof

Let $K_w = (\kappa _1, \dots , \kappa _\ell )$ be the sequence of k-mers of w. We construct $W = (e_1 = (v_1d_1, v'_1d'_1, \eta _1), \dots , e_\ell = (v_\ell d_\ell , v'_\ell d'_\ell , \eta _\ell ))$ as follows: for each i, let $\eta _i$ be the canonical of $\kappa _i$ and $f_i \in E$ be the arc whose identifier is $\eta _i$. We set $e_i = f_i$ if $\kappa _i$ is canonical, and $e_i = f^{-1}_i$ otherwise.

For W to fulfil the definition of a walk we need that $v'_i = v_{i+1}$ and $d'_i = \lnot d'_{i+1}$ for all i. Using Lemma 2, we get:

If $e_i, e_{i+1} \in E$, then ${{\,\textrm{suf}\,}}_{k-1}(\eta _i) = {{\,\textrm{suf}\,}}_{k-1}(\kappa _i) = {{\,\textrm{pre}\,}}_{k-1}(\kappa _{i+1}) = {{\,\textrm{pre}\,}}_{k-1}(\eta _{i+1})$. Therefore, by Lemma 2 (a), it holds that $v'_i = v_{i+1}$ and $d'_i = \lnot d'_{i+1}$.
If $e_i, e^{-1}_{i+1} \in E$, then ${{\,\textrm{suf}\,}}_{k-1}(\eta _i) = {{\,\textrm{suf}\,}}_{k-1}(\kappa _i) = {{\,\textrm{pre}\,}}_{k-1}(\kappa _{i+1}) = {{\,\textrm{pre}\,}}_{k-1}({{\,\textrm{rc}\,}}(\eta _{i+1}))$. Therefore, by Lemma 2 (b), it holds that $v'_i = v_{i+1}$ and $d'_i = \lnot d'_{i+1}$.
If $e^{-1}_i, e_{i+1} \in E$, then ${{\,\textrm{suf}\,}}_{k-1}({{\,\textrm{rc}\,}}(\eta _i)) = {{\,\textrm{suf}\,}}_{k-1}(\kappa _i) = {{\,\textrm{pre}\,}}_{k-1}(\kappa _{i+1}) = {{\,\textrm{pre}\,}}_{k-1}(\eta _{i+1})$. Therefore, by Lemma 2 (c), it holds that $v'_i = v_{i+1}$ and $d'_i = \lnot d'_{i+1}$.
If $e^{-1}_i, e^{-1}_{i+1} \in E$, then ${{\,\textrm{suf}\,}}_{k-1}({{\,\textrm{rc}\,}}(\eta _i)) = {{\,\textrm{suf}\,}}_{k-1}(\kappa _i) = {{\,\textrm{pre}\,}}_{k-1}(\kappa _{i+1}) = {{\,\textrm{pre}\,}}_{k-1}({{\,\textrm{rc}\,}}(\eta _{i+1}))$. Therefore, by Lemma 2 (d), it holds that $v'_i = v_{i+1}$ and $d'_i = \lnot d'_{i+1}$.

To complete the proof we need to show that ${{\,\textrm{spell}\,}}(W) = w$. By definition, the sequence of k-mers $K_W$ of W is equivalent to $K_w$. And since W is a walk, by Lemma 4 we get that the sequence of k-mers of ${{\,\textrm{spell}\,}}(W)$ is equivalent to $K_W$, and therefore ${{\,\textrm{spell}\,}}(W) = w$. $\square$

A walk cover ${\mathcal {W}}$ of a bigraph G is a set of walks such that for each arc $e \in E$ it holds that e is part of some walk $W \in {\mathcal {W}}$, or $e^{-1}$ is part of some walk $W \in {\mathcal {W}}$.

Theorem 6

Let k be a positive integer and let I and $I'$ be sets of strings of length at least k. Let $G = (V, E, c)$ be the de Bruijn graph of order k constructed from I. Then it holds that ${{\,\textrm{CS}\,}}_k(I) = {{\,\textrm{CS}\,}}_k(I')$, if and only if there is a walk cover ${\mathcal {W}}$ in G that spells the strings in $I'$.

Proof

If ${{\,\textrm{CS}\,}}_k(I') \subseteq {{\,\textrm{CS}\,}}_k(I)$, then for each string $w' \in I'$ it holds that ${{\,\textrm{CS}\,}}_k(w') \subseteq {{\,\textrm{CS}\,}}_k(I)$. Therefore, by Lemma 5, there exists a walk w in G with ${{\,\textrm{spell}\,}}(w) = w'$. Then, the set of all such walks ${\mathcal {W}}$ spells $I'$. Further, because ${{\,\textrm{CS}\,}}_k(I) \subseteq {{\,\textrm{CS}\,}}_k(I')$, the identifier $\eta$ of each arc $e \in E$ is in ${{\,\textrm{CS}\,}}_k(I')$, and therefore in the sequence of kmers $K_{w'}$ of some string $w' \in I'$ (possibly as a reverse complement). By Lemma 4 it holds that $K_{w'} = K_w$, where $K_w$ is the sequence of k-mers of walk w. By the definition of the sequence of k-mers of a walk, this implies that w visits e (possible in reverse direction). Since this holds for each $e \in E$, it holds that ${\mathcal {W}}$ is a walk cover of G.

Assume that there is a walk cover ${\mathcal {W}}$ in G that spells the strings in $I'$, and let $w \in {\mathcal {W}}$ be a walk, $K_w$ its sequence of k-mers, $w':= {{\,\textrm{spell}\,}}(w)$ and $K_{w'}$ the sequence of k-mers of $w'$. Then, by Lemma 4, $K_w = K_{w'}$, which, by the definition of the sequence of k-mers of a walk implies that ${{\,\textrm{CS}\,}}_k(I) \subseteq {{\,\textrm{CS}\,}}_k(I')$. And since ${\mathcal {W}}$ is a walk cover of G, we get ${{\,\textrm{CS}\,}}_k(I) = {{\,\textrm{CS}\,}}_k(I')$. $\square$

Corollary 7

By setting $I = I'$ in Theorem 6 we see that our de Bruijn graph contains the strings it was constructed from. Further, by Theorem 6 it holds that walks in the de Bruijn graph spell exactly the strings that can be formed from the k-mers that were used to create the graph. Hence, our definition of a de Bruijn graph is sound for all k.

A compacted de Bruijn graph is constructed from a de Bruijn graph by contracting all nodes $v \in V$ that are either self-complemental and have exactly two arcs that have exactly one incidence to v each, or that are not self-complemental and have exactly one incoming and one outgoing arc. For simplicity, we use uncompacted de Bruijn graphs in our theoretical sections, however all results equally apply to compacted de Bruijn graphs.

Linear-time construction of compacted bidirected de Bruijn graphs

In this section, we fill a gap in the literature by describing on a high level an algorithm to construct the bidirectional de Bruijn graph of a set of input strings in time linear in the total length of the input strings, independent of the value of k.

Algorithm

Let $I = \{w_1, \ldots w_m\}$ be the set of input strings. Consider the following concatenation:

$$\begin{aligned} T = \$w_1 \$ w_2 \$ \ldots \$ w_m \$ {{\,\textrm{rc}\,}}(w_1) \$ {{\,\textrm{rc}\,}}(w_2) \$ \ldots \$ {{\,\textrm{rc}\,}}(w_m)\$ , \end{aligned}$$

where $\$$ is a special character outside of the alphabet $\Sigma$ of the input strings. We require an index on T that can answer the following queries: extendRight, extendLeft, contractRight and contractLeft in constant time. The extension operations take as input a character $c \in \Sigma$ and the interval of a string x in the suffix array of T, and return the suffix array intervals of xc in the case of extendRight and cx in the case of extendLeft. The contraction operations are the inverse operations of these, mapping the suffix array intervals of xc to x in the case of contractRight and cx to x in the case of contractLeft. For efficiency, we also require operations enumerateRight and enumerateLeft, which take a string x and give all characters such that extendRight and extendLeft respectively return a non-empty interval, in time that is linear in the number of such characters. Implementations for all the six subroutines are given in Sect. "Implementation of the subroutines".

Using these operations, we can simulate the regular non-bidirected de Bruijn graph of T. Each k-mer of the input strings for a fixed k corresponds to a disjoint interval in the suffix array of T. The nodes are represented by their suffix array intervals. The outgoing arcs from a $(k-1)$-mer x are those characters c where extendRight(x, c) returns a non-empty interval. We can enumerate all the characters c with this property in constant time using enumerateRight(x). The incoming arcs can be enumerated symmetrically with the enumerateLeft(x). Finally, we can find the destination or origin of an arc labelled with x by running a contractLeft or contractRight operation respectively on x.

To construct the bidirected de Bruijn graph, we merge together nodes that are the reverse complement of each other. To find which nodes are complemental, we scan the input strings I while maintaining the suffix array interval of the current k-mer using extendRight and contractLeft operations, while at the same time maintaining the suffix array interval of the reverse complement using extendLeft and contractRight operations. Whenever we merge two nodes, we combine the incoming and outgoing arcs, assigning the incidences of the arcs according to the incidence rules in our definition. We are able to tell in constant time which k-mer of a pair of complemental k-mers is canonical by comparing the suffix array intervals of the k-mers: the k-mer whose suffix array interval has a smaller starting point is the canonical k-mer. If the starting points are the same, the k-mer is self-complemental.

Using the enumerateRight and enumerateLeft functions, we can check if a node would be contracted in a compacted de Bruijn graph. By extending k-mers over such nodes, we can in linear time also output only the arcs and nodes of a compacted de Bruijn graph. For storing the labels, we use one pointer into the input strings to store a single k-mer, as well as a flag that is set whenever the label is not canonical. If a label has multiple k-mers, then we store the remaining k-mers as explicit strings, however without their overlap with the “pointer-k-mer”. This way, we can store each label in $O(\ell )$ space, where $\ell$ is the number of k-mers in the label. We additionally store the first and last character of each label, as an easy way to make the spell function run in output sensitive linear time.

Implementation of the subroutines

All required the subroutines extendRight, extendLeft, contractRight, contractLeft, enumerateRight and enumerateLeft can be implemented with the suffix tree of T by simulating the trie of the suffix tree loci as described in Sect. "Suffix arrays and suffix trees". The suffix array intervals of explicit nodes can be stored with the nodes, so that we can operate on loci (u, d) and retrieve the suffix array intervals on demand. The operation extendRight follows an arc from a locus to a child, and the operation contractRight is implemented by going to the parent of the current locus. The operation contractLeft follows a suffix link from the current locus, and extendLeft follows a Weiner link. The operations enumerateRight and enumerateLeft are implemented by storing the children and the Weiner links from explicit suffix tree nodes as neighbor lists. The total number of these links is linear in |T| [16]. With this implementation, the slowest operations are extendRight and extendLeft, taking $O(\log |\Sigma |)$ time to binary search the neighbor lists. We therefore obtain the following result:

Theorem 8

The compacted arc-centric bidirected de Bruijn graph of order k of a set of input strings I from the alphabet $\Sigma$ can be constructed in time $O(||I|| \log |\Sigma |)$.

We note that the same operations can also be implemented on top of the bidirectional BWT index of Belazzougui and Cunial [20], using the data structures of Belazzougui et al. [21] for the enumeration operations. This gives an index that supports all the required subroutines in constant time. The drawback of the bidirectional BWT index is that only randomized construction algorithms are known, but the expected time is still linear in |T|. We leave as an open problem the construction of the compacted arc-centric bidirected de Bruijn graph in deterministic linear time independent of the alphabet size.

Pseudocode

The pseudocode for computing a compacted de Bruijn graph in linear time is given by Algorithm 4 which uses Algorithm 3 as a subroutine. The data structure D used by the algorithms is that described in Sect. "Linear-time construction of compacted bidirected de Bruijn graphs". Note that if we compute the arc labels as plain strings as in Algorithm 1, we need up to $O(k\log |\Sigma |)$ bits to store a single-k-mer arc. And since arcs are not substrings of input strings (but potentially combinations of input strings), we would need up to $O(k||I||\log |\Sigma |)$ bits to store all arc labels without referring to the input strings. This contradicts the algorithm being linear in $||I||\log |\Sigma |$. However, we can store the labels as tuples $(p, \eta , q, r)$, where $p\eta q$ is the label where p and q are explicit strings while $\eta$ is a pointer to a k-mer in the input. If r is true, then the label must be reverse complemented to match that defined by Algorithm 1. With this fix, the size of a label that represents $\ell$ k-mers is

$O(\ell \log |\Sigma |)$, and in total the de Bruijn graph represents O(||I||) k-mers.

The comparison on Line 16 of Algorithm 4 can be done in linear time in $|\eta _1| + |\eta _2|$ by finding the suffix array intervals of $\eta _1\eta \eta _2$ and ${{\,\textrm{rc}\,}}(\eta _1\eta \eta _2)$ with extendLeft and extendRight from $\eta$ and ${{\,\textrm{rc}\,}}(\eta )$ respectively, and comparing the starts of the intervals. This way, the total time taken by all those comparisons is proportional to the sum of $|\eta _1| + |\eta _2|$ over all unitigs, which is linear in ||I|| because each character of $\eta _1$ and $\eta _2$ can be mapped to a distinct edge in the non-compacted de Bruijn graph of ||I||. Therefore, the algorithm can be implemented to run in $O(||I||\log |\Sigma |)$ time.

Our pseudocode does not compute the first and last character of each arc-label, but this can be easily computed in constant time using $w_i$, $\eta _1$ and $\eta _2$ in Algorithm 4.

Linear-time minimum SPSS without repetitions

Let I be a set of strings. To compute an SPSS without repetitions we first build a compacted de Bruijn graph G from I. By Theorem 6, finding an SPSS is equivalent to finding a walk cover in G. Further, with Lemma 4, we get that an SPSS without repetitions is equivalent to a walk cover that visits each arc exactly once (either once forwards, or once reverse, but not both forwards and reverse). We call such a walk cover a unique walk cover.

For minimality, observe that the cumulative length of an SPSS S relates to its equivalent set of walks ${\mathcal {W}}$ as follows:

$$\begin{aligned} ||S|| = \sum _{W \in {\mathcal {W}}}(k - 1 + |W|) \end{aligned}$$

(1)

This is because in Algorithm 2, in Line 7, $k - 1$ characters are appended to the result, and then in the loop in Line 8, one additional character per arc in W is appended. We cannot alter the sum $\sum _{W \in {\mathcal {W}}} |W|$, since we need to cover all arcs in G. However we can alter the number of strings, and decreasing or increasing this number by one will decrease or increase the cumulative length of S by $k - 1$. Therefore, finding a minimum SPSS of I without repetitions equals finding a unique walk cover of G that has a minimum number of walks.

Note that computing a minimum SPSS in a bigraph that is not connected is equivalent to separately computing an SPSS in each maximal connected subgraph. Therefore we restrict to connected bigraphs from here on.

A lower bound for an SPSS without repetitions

Using the imbalance of the nodes of a bigraph, we can derive a lower bound for the number of walks in a walk cover.

Lemma 9

Let $v \in V$ be an unbalanced node in a bigraph $G = (V, E, c)$. Then in a unique walk cover ${\mathcal {W}}$ of G, either at least $|{{\,\textrm{imbalance}\,}}(v)|$ walks start in v, or at least $|{{\,\textrm{imbalance}\,}}(v)|$ walks end in v.

Proof

If v is self-complemental, then its imbalance is 1, so by definition v has an odd number of incident arcs. Each walk that does not start or end in v needs to enter and leave v via two distinct arcs whenever it visits v. But since the number of incident arcs is odd, there is at least one arc that cannot be covered this way, implying that a walk needs to start or end in this arc.

If v is not self-complemental and has a positive imbalance, then it has ${{\,\textrm{imbalance}\,}}(v)$ more outgoing arcs then incoming arcs. Since walks need to leave v with the opposite sign than they entered v, at least ${{\,\textrm{imbalance}\,}}(v)$ arcs cannot be covered by walks that do not start or end in v. If v has negative imbalance, the situation is symmetric. $\square$

Definition 10

The imbalance ${{\,\textrm{imbalance}\,}}(G)$ of a bigraph $G = (V, E, c)$ is the sum of the absolute imbalance of all nodes $\sum _{v \in V} |{{\,\textrm{imbalance}\,}}(v)|$.

Theorem 11

Let G be a bigraph. A walk cover ${\mathcal {W}}$ of G has a minimum string count of ${{\,\textrm{imbalance}\,}}(G) / 2$.

Proof

Let $v \in V$ be an unbalanced node. Then, by Lemma 9 at least $|{{\,\textrm{imbalance}\,}}(v)|$ walks start in v or at least $|{{\,\textrm{imbalance}\,}}(v)|$ walks end in v. Since each walk has exactly one start node and one end node, ${\mathcal {W}}$ has a minimum string count of ${{\,\textrm{imbalance}\,}}(G) / 2$. $\square$

Eulerising a bigraph

A directed graph is called Eulerian, if all nodes have indegree equal to outdegree, i.e. are balanced [22]. If the graph is strongly connected,^{Footnote 2} then this is equivalent to the graph admitting a Eulerian cycle, i.e. a cycle that visits each arc exactly once. The same notion can be used with bidirected graphs, using our definition of imbalance.

Definition 12

A bigraph is Eulerian, if all nodes have imbalance zero.

A connected bigraph can be transformed into a Eulerian bigraph by adding arcs using Algorithm 5. See Fig. 1b for an example. The algorithm lists all nodes that are out of balance, and inserts arbitrary arcs to balance them.

Lemma 13

The imbalance of a bigraph is even.

Proof

Adding or removing an arc changes the imbalance of two nodes by 1, or of one node by two. In both cases, the imbalance of the graph can only change by $-2$, 0, or 2. Since the imbalance of a graph without arcs is 0, this implies that there can be no graph with odd imbalance. $\square$

Lemma 14

Given a connected bigraph $G = (V, E, c)$, Algorithm 5 outputs a Eulerian bigraph $G' = (V, E', c)$.

Proof

Algorithm 5 is well-defined, since by Lemma 13, it holds that L has even length in each iteration of the loop in Line 10, so the removal operation in Line 12 always has something to remove.

The output of Algorithm 5 is a valid bigraph, since for self-complemental nodes $v \in V$, only incidences $v\odot$ are added to $G'$, and for not self-complemental nodes $v \in V$, only incidences $v\oplus$ and $v\ominus$ are added to $G'$.

Further, the output is a Eulerian bigraph, because for all $v \in V$, it holds that ${{\,\textrm{imbalance}\,}}(v)$ is 0, by the following argument:

If $c(v) = 1$ and v has imbalance zero in G, then its imbalance stays the same in $G'$. If it has imbalance 1, then one incident arc is inserted, making its degree even and its imbalance therefore zero.
If $c(v) = 0$ and v has positive imbalance i in G, then i incoming arcs are added to v (counting incoming self-loops twice), and no outgoing arcs are added. Therefore, it has imbalance zero in $G'$. By symmetry, if v has negative imbalance in G, it has imbalance zero in $G'$.

$\square$

Lemma 15

Given a bigraph $G = (V, E, c)$, Algorithm 5 terminates after $O(|V| + |E|)$ steps.

Proof

For the list data structure we choose a doubly linked list, and for the graph an adjacency list (and array with an entry for each node containing a doubly linked list for the arcs).

The loop in Line 3 runs |V| times and each iteration runs in $O(|{{\,\textrm{imbalance}\,}}(v)|)$ for a node v, because a doubly linked list supports appending in constant time. The sum of absolute imbalances of all nodes cannot exceed 2|E|, because each arc adds at most 1 to the absolute imbalance of at most two nodes, or adds at most 2 to the absolute imbalance of at most one node. Therefore, the length of list L after completing the loop is at most 2|E|, and the loop runs in $O(|V| + |E|)$ time.

The loop in Line 10 runs at most $|L| \le 2|E|$ times and performs only constant-time operations, since L is a doubly linked list and we can insert arcs into an adjacency list in constant time. Therefore, this loop also runs in $O(|V| + |E|)$ time. $\square$

With Lemma 14 and 15 we get the following.

Theorem 16

Algorithm 5 is correct and runs in $O(|V| + |E|)$ time.

Computing a Eulerian cycle in a bigraph

After Eulerising the bigraph, we can compute a Eulerian cycle using Algorithm 6. We do this similarly to Hierholzer’s classic algorithm for Eulerian cycles [22]. First we find an arbitrary cycle. Then, as long as there are unused arcs left, we search along the current cycle for unused arcs, and find additional cycles through such unused arcs. We integrate each of those additional cycles into the main cycle. See Fig. 1c for an example of a Eulerian cycle.

Lemma 17

Given a connected Eulerian bigraph $G = (V, E, c)$, Algorithm 6 terminates and outputs a Eulerian cycle W.

Proof

For $W = (e_1 = (v_1d_1, v'_1d'_1, \eta _1), \dots , e_\ell = (v_\ell d_\ell , v'_\ell d'_\ell , \eta _\ell ))$ to be a Eulerian cycle, it must be a cycle that contains each arc exactly once.

The sequence $W'$ constructed by the loop in Line 10 is a walk by construction, and since G is Eulerian it is a cycle after the loop terminates. After finding the initial cycle in the first iteration of the outer loop, each additional cycle is started from a node on the initial cycle, and is a cycle again. Therefore it can be inserted into the original cycle without breaking its cycle property.

Since each arc is deleted when being added to $W'$, there is no duplicate arc in W. And if the algorithm terminates, then $|E| = 0$ (Line 1), so W contains all arcs.

For termination, consider that if W is not complete after the first iteration of the outer loop, then the loop in Line 7 searches for an unused arc using the $first\_unfinished$ pointer. Since the prefix of W up to including $first\_unfinished$ is never modified (Line 19), and $first\_unfinished$ is only advanced when its pointee cannot reach any arc anymore, it holds that no arc in W can reach an arc in E when $first\_unfinished$ gets advanced over the end of W. Since G was initially Eulerian and only Eulerian cycles have been removed from G, this implies that all nodes visited by W are still balanced and therefore have no incident arcs anymore. And since G was originally connected, W has visited all nodes, i.e. $|E| = 0$. Therefore, $first\_unfinished$ cannot be advanced over the end of W, because the outer loop terminates before that.

To complete the proof of termination, consider that in each iteration of the outer loop, at least one arc gets removed from E. In the first iteration, this happens at least in Line 3, and in all following iterations, this happens in Line 11. $\square$

Lemma 18

Given a connected Eulerian bigraph $G = (V, E, c)$, Algorithm 6 terminates after $O(|V| + |E|)$ steps.

Proof

We use a doubly linked list for W and $W'$, and an adjacency list for G. Then all lines can be executed in constant time.

The loop in Line 10 removes one arc from E each iteration, so it runs at most |E| times in total (over all iterations of the outer loop). The loop in Line 7 advances $first\_unfinished$ each iteration. Since the algorithm is correct by Lemma 17, $|W| \le |E|$ and $first\_unfinished$ never runs over the end of $first\_unfinished$, so the loop runs at most |E| times in total (over all iterations of the outer loop).

The condition for the loop in Line 10 is true at least once in each iteration of the outer loop, since the preceding branch sets up $(vd, v'd', \eta )$ such that it has a successor (in the first iteration because of Eulerianess). So in each iteration of the outer loop, at least one arc gets removed, so the outer loop runs at most |E| times in total.

As a result, all loops individually run at most |E| times, therefore Algorithm 6 terminates after O(|E|) steps. Because G is connected, this is equivalent to $O(|V| + |E|)$ steps. $\square$

With Lemma 17 and 18 we get the following.

Theorem 19

Algorithm 6 is correct and runs in $O(|V| + |E|)$ time.

Computing a minimum SPSS without repetitions

We convert the Eulerian cycle into a walk cover of the original bigraph by breaking it at all arcs inserted by Algorithm 5, and removing those arcs (see Fig. 1d for an example). This results in a walk cover with either one walk, if Algorithm 5 inserted zero or one arcs, or ${{\,\textrm{imbalance}\,}}(G)/2$ arcs, if Algorithm 5 inserted more arcs. By Theorem 11, this is a minimum number of walks, and therefore the SPSS spelled by these walks is minimum as well. Constructing the de Bruijn graph takes $O(||I||\log |\Sigma |)$ time, and it has O(||I||) k-mers, so it holds that $|V| \in O(||I||)$ and $|E| \in O(||I||)$. Further, spelling the walk cover takes time linear to the cumulative length of the spelled strings. Since we compute a minimum representation, it holds that the output is not larger than the total length of the input strings. Therefore we get:

Theorem 1 Let k be a positive integer and let I be a set of strings of length at least k over some alphabet $\Sigma$. Then we can compute a set of strings $I'$ of length at least k with minimum cumulative length and ${{\,\textrm{CS}\,}}_k(I) = {{\,\textrm{CS}\,}}_k(I')$ in $O(||I||\log |\Sigma |)$ time.

Previous heuristics were not optimal

The heuristics implemented by UST [3] and Prophasm [2] are not optimal, as shown experimentally below. Here, we also give a simple counter-example to argue that the previous heuristics were not optimal. Even though the previous algorithms were described in node-centric de Bruijn graphs, we describe them here in the arc-centric variants to stick with the terminology of this paper.

UST works by starting from an arbitrary arc and extending forwards to unused arcs as long as possible. If there is no unused arc, but the last chosen arc has a successor that is the start of another walk, then the walks are joined. On the other hand, ProphAsm works by choosing an arbitrary arc and extending both forwards and backwards to unused arcs as long as possible. Both algorithms may fail to produce an optimal solution in the example given in Fig. 2. They may both first choose AGGTG and then continue to GTGGGAT, producing a string AGGTGGGAT. When they then process GTGCCGTG, they cannot join it with the previous string, hence they produce two strings of a cumulative length of 17. The optimal solution in Fig. 2 has one string with a cumulative length of 14.

Experiments

We ran our experiments on a server running Linux with two 64-core AMD EPYC 7H12 processors with 2 logical cores per physical core, 1.96TiB RAM and an SSD. Our data sets are the same as in [1], and we also adapted their metrics cumulative length (CL), which is the total count of characters in all strings, and string count (SC), which is the number of strings. Our implementation does not use the formalisation of bidirected graphs introduced in this work, but instead uses the formalisation from [1]. For constructing de Bruijn graphs, we do not implement our purely theoretical linear time algorithm, since practical de Bruijn graph construction is a well-researched field [4, 6, 11, 23,24,25], and we want to focus more on computing the compressed representation from unitigs. UST only supports unitigs constructed by BCALM2 [11], since it needs certain additional data. BCALM2 is not a linear time algorithm, but is efficient in practice. Therefore, we use BCALM2 to construct a node-centric de Bruijn graph, and then convert it to an arc-centric variant using a union-find data structure. For the human pangenome, which hits some built-in limit of BCALM2, we use Cuttlefish 2 [6] instead. This prevents us from running UST, but instead we run ProphAsm on the unitigs computed by Cuttlefish 2.

Our experimental pipeline is constructed with [26] and using the bioconda software repository [27]. We ran all multithreaded tools with up to 28 threads and never used more than 128 cores of our machine at once to prevent hyperthreading from affecting our timing. The code to reproduce our experiments is licensed under the Creative Commons Attribution 4.0 International license and available on zenodo [28]. We additionally provide our implementation of the Eulertigs algorithm on zenodo [29] as well as github [30], conda [31] and crates.io [32].

The performance figures in Tables 1 and 2 are all very similar, with two exceptions. Prophasm does not support parallel computation at the moment, therefore its runtime is much higher. Compared to that, all other algorithms use parallel computation to compute unitigs, but computing the final tigs from unitigs seems to be negligible compared to computing the de Bruijn topology. Moreover, running UST or Eulertigs on read data sets of larger genomes consumes significantly more memory than computing just unitigs. This is likely because BCALM2 uses external memory to compute unitigs, while the other tools simply load the whole set of unitigs into memory.

Table 1 Experiments on references and read sets of single genomes with $k = 51$ and a min abundance of 10 for human and 1 for the others

Full size table

Table 2 Experiments on (references of) pangenomes with $k = 31$ and a min abundance of 1

Full size table

In terms of CL, we see that the SPSS computed with UST mostly remains within the expected $3\%$ of the lower bound, but it is up to $5\%$ above the lower bound on more compressible data sets. The SPSS computed by ProphAsm is very close to the optimum in all cases, and we assume that this difference in quality is because ProphAsm extends paths both forwards and backwards, while the UST heuristic merely extends them forwards.

Looking at SC, we see that Eulertigs are always the lowest, which is due to the string count directly being connected to the cumulative length by Eq. 1. This also explains the correlation between CL ratio and SC ratio, which can be observed in all cases.

We conduct our experiments also with even k in Tables 3 and 4 to prove that our implementation also works with even k. To verify that the strings are correct, the tigs computed for the E. coli pangenome are compared against each other by loading all k-mers into a hash table and checking if different tigs contain the same k-mers. There are no significant differences between the experiments with even k and odd k.

Table 3 Experiments on references and read sets of single genomes with k = 52 and a min abundance of 10 for human and 1 for the others

Full size table

Table 4 Experiments on (references of) pangenomes with k = 32 and a min abundance of 1

Full size table

Additionally, we conduct our experiments with higher k in Tables 5 and 6 to show that performance stays the same when k is increased. The increase in k seems to increase the runtimes of BCALM2 and Cuttlefish 2, but the runtime of our Eulertigs implementation does not change significantly. As a result, the ratio between the runtimes of BCALM2 and Cuttlefish 2 and the runtime of our Eulertigs implementation becomes smaller for larger k.

Conclusions

Table 5 Experiments on references and read sets of single genomes with $k = 102$ and a min abundance of 10 for human and 1 for the others

Full size table

Table 6 Experiments on (references of) pangenomes with k = 64 and a min abundance of 1

Full size table

We have presented a linear and hence optimal algorithm for computing a minimum SPSS without repetitions for a fixed alphabet size. This closes the open question about its complexity raised in [2, 3]. Using our optimal algorithm, we were able to accurately evaluate the existing heuristics and show that they are very close to the optimum in practice. Further, we have published our algorithm as a command-line tool on github.com [30] and conda [31] and a library on crates.io [32], allowing it to easily be used in any k-mer-based tool. While the difference in cumulative length between previous heuristics and the optimum is not large, our tool works for any value of k, and is, combined with a de Bruijn graph compactor such as BCALM, much faster than ProphAsm, which achieves nearly indistinguishable cumulative length. Hence, our tool is better suited to be used any k-mer-based application including SSHash [5], and specifically in the applications listed in [2] which include compressed storage on disk and k-mer-based queries.

Further, we have presented how bidirected de Bruijn graphs can be formalised without excluding any corner cases. We have also shown how such a graph can be constructed in linear time for a fixed-size alphabet. The construction of the compacted arc-centric bidirected de Bruijn graph in linear time independent of the alphabet size stays an open problem.

Availability of data and materials

The implementation of the Eulertig algorithm is available on github [30]. The name of the project is matchtigs. It is platform independent, and can be compiled locally or installed from bioconda as described in the README of the project. It is licensed under the 2-clause BSD license. The version used for our experiments is available on zenodo [29], and the implementation together with all code to reproduce the experiments is available at [28]. The experiment code is licensed under the Creative Commons Attribution 4.0 International license. See [1] for the availability of the non-original data used for our experiments.

Notes

The original of an adjoint graph can be computed by splitting each node v into two nodes $v'$ and $v''$ such that $v'$ keeps the incoming arcs, and $v''$ the outgoing arcs as in [9, Figure 4] Then, the graph is a collection of complete bipartite graphs [9]. These graphs can be contracted into single nodes, and then we add an arc between the contracted representations of each $v'$ and $v''$. This can be computed in linear time and is the original graph, since all nodes have become arcs again, and the arcs have the correct predecessors and successors.
Strongly connected means that there is a directed path from each node $v_1$ to each node $v_2$.

References

Schmidt S, Khan S, Alanko J, Tomescu AI. Matchtigs: minimum plain text representation of kmer sets. bioRxiv. 2021. https://doi.org/10.1101/2021.12.15.472871.
Article PubMed PubMed Central Google Scholar
Břinda K, Baym M, Kucherov G. Simplitigs as an efficient and scalable representation of de Bruijn graphs. Genome Biol. 2021;22(1):1–24.
Article Google Scholar
Rahman A, Medevedev P. Representation of k-Mer sets using spectrum-preserving string sets. J Comput Biol. 2021;28(4):381–94.
Article CAS PubMed PubMed Central Google Scholar
Holley G, Melsted P. Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs. Genome Biol. 2020;21(1):1–20.
Article Google Scholar
Pibiri GE. Sparse and skew hashing of k-mers. bioRxiv. 2022. https://doi.org/10.1101/2022.01.15.476199.
Article Google Scholar
Khan J, Kokot M, Deorowicz S, Patro R. Scalable, ultra-fast, and low-memory construction of compacted de Bruijn graphs with Cuttlefish 2. bioRxiv. 2021. https://doi.org/10.1101/2021.12.14.472718.
Article PubMed PubMed Central Google Scholar
Cracco A, Tomescu AI. Extremely-fast construction and querying of compacted and colored de Bruijn graphs with GGCAT. bioRxiv. 2022. https://doi.org/10.1101/2022.10.24.513174.
Article Google Scholar
Kasprzak M. Classification of de Bruijn-based labeled digraphs. Discrete Appl Math. 2018;234:86–92. https://doi.org/10.1016/j.dam.2016.10.014.
Article Google Scholar
Blazewicz J, Hertz A, Kobler D, de Werra D. On some properties of DNA graphs. Discrete Appl Math. 1999;98(1–2):1–19.
Article Google Scholar
Rahman A, Medvedev P. Assembler artifacts include misassembly because of unsafe unitigs and underassembly because of bidirected graphs. Genome Res. 2022;32(9):1746–53.
Article PubMed PubMed Central Google Scholar
Chikhi R, Limasset A, Medvedev P. Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinformatics. 2016;32(12):201–8.
Article Google Scholar
Bankevich A, Bzikadze AV, Kolmogorov M, Antipov D, Pevzner PA. Multiplex de Bruijn graphs enable genome assembly from long, high-fidelity reads. Nat Biotechnol. 2022. https://doi.org/10.1038/s41587-022-01220-6.
Article PubMed Google Scholar
Cazaux B, Lecroq T, Rivals E. From indexing data structures to de Bruijn graphs. In: Kulikov AS, Kuznetsov SO, Pevzner P, editors. Symposium on combinatorial pattern matching. Springer: Berlin; 2014. p. 89–99.
Chapter Google Scholar
Kundeti V, Rajasekaran S, Dinh H. An efficient algorithm for Chinese postman walk on bi-directed de Bruijn graphs. In: Wu W, Daescu O, editors. Combinatorial optimization and applications. Berlin, Heidelberg: Springer; 2010. p. 184–96.
Chapter Google Scholar
Gusfield D. Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge: Cambridge University Press; 1997. https://doi.org/10.1017/cbo9780511574931.
Book Google Scholar
Mäkinen V, Belazzougui D, Cunial F, Tomescu AI. Genome-scale algorithm design. Cambridge: Cambridge University Press; 2015.
Book Google Scholar
Ukkonen E. On-line construction of suffix trees. Algorithmica. 1995;14(3):249–60.
Article Google Scholar
Farach M. Optimal suffix tree construction with large alphabets. In: Proceedings 38th Annual Symposium on Foundations of Computer Science. IEEE. 1997; p. 137–43.
Belazzougui D, Kosolobov D, Puglisi SJ, Raman R. Weighted ancestors in suffix trees revisited. In: 32nd Annual Symposium on Combinatorial Pattern Matching. 2021.
Belazzougui D, Cunial F. Fully-functional bidirectional burrows-wheeler indexes and infinite-order de bruijn graphs. In: 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019). 2019; Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik
Belazzougui D, Cunial F, Kärkkäinen J, Mäkinen V. Versatile succinct representations of the bidirectional burrows-wheeler transform. In: Bodlaender HL, Italiano GF, editors. European symposium on algorithms. Springer: Berlin; 2013. p. 133–44.
Google Scholar
Fleischner H. Eulerian graphs and related topics. The Netherlands: Elsevier; 1990.
Google Scholar
Crawford VG, Kuhnle A, Boucher C, Chikhi R, Gagie T. Practical dynamic de Bruijn graphs. Bioinformatics. 2018;34(24):4189–95.
Article CAS PubMed Google Scholar
Muggli MD, Bowe A, Noyes NR, Morley PS, Belk KE, Raymond R, Gagie T, Puglisi SJ, Boucher C. Succinct colored de Bruijn graphs. Bioinformatics. 2017;33(20):3181–7.
Article CAS PubMed PubMed Central Google Scholar
Muggli MD, Alipanahi B, Boucher C. Building large updatable colored de Bruijn graphs via merging. Bioinformatics. 2019;35(14):51–60.
Article Google Scholar
Köster J, Rahmann S. Snakemake-a scalable bioinformatics workflow engine. Bioinformatics. 2012;28(19):2520–2.
Article PubMed Google Scholar
Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018;15(7):475–6.
Article PubMed Google Scholar
Schmidt S. Eulertigs experiments. Zenodo. 2022. https://doi.org/10.5281/zenodo.7371148.
Article Google Scholar
Schmidt S. Eulertigs. Zenodo. 2022. https://doi.org/10.5281/zenodo.7371184.
Article Google Scholar
Schmidt S. Matchtigs. GitHub. https://github.com/algbio/matchtigs. Accessed 15 Apr 2023.
Schmidt S. Matchtigs. Bioconda. https://anaconda.org/bioconda/matchtigs. Accessed 15 Apr 2023.
Schmidt S. Matchtigs. Crates.io. https://crates.io/crates/matchtigs. Accessed 15 Apr 2023.

Download references

Acknowledgements

We wish to thank Andrea Cracco for providing us with the $\sim$309kx Salmonella pangenome. We also wish to thank the anonymous reviewers for there useful constructive feedback, which improved the presentation of the paper, the implementation and the experimental results. Open access funded by Helsinki University Library.

Funding

Open Access funding provided by University of Helsinki including Helsinki University Central Hospital. This work was partially funded by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 851093, SAFEBIO), and by NIH NIAID grant No. R01HG011392 and Academy of Finland grant 339070.

Author information

Authors and Affiliations

Department of Computer Science, University of Helsinki, Helsinki, Finland
Sebastian Schmidt & Jarno N. Alanko
Institute of Biology, National University of Sciences, Kiel, Germany
Jarno N. Alanko

Authors

Sebastian Schmidt
View author publications
You can also search for this author in PubMed Google Scholar
Jarno N. Alanko
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

JNA and SS discovered the problem, SS solved the problem when the de Bruijn graph is given and wrote most of the manuscript, JNA designed the linear-time de Bruijn graph construction algorithm and wrote Sect. "Linear-time construction of compacted bidirected de Bruijn graphs". SS implemented the algorithm and conducted and evaluated the experiments. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Sebastian Schmidt.

Ethics declarations

Ethics approval and consent to participate

This study only uses publicly available datasets, hence an ethics approval or consent to participate is not required.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Schmidt, S., Alanko, J.N. Eulertigs: minimum plain text representation of k-mer sets without repetitions in linear time. Algorithms Mol Biol 18, 5 (2023). https://doi.org/10.1186/s13015-023-00227-1

Download citation

Received: 13 February 2023
Accepted: 13 May 2023
Published: 04 July 2023
DOI: https://doi.org/10.1186/s13015-023-00227-1

Eulertigs: minimum plain text representation of k-mer sets without repetitions in linear time

Abstract

Introduction

Theorem 1

Preliminaries

Bidirected graphs

Suffix arrays and suffix trees

De Bruijn graphs

Lemma 2

Proof

Lemma 3

Proof

Lemma 4

Proof

Lemma 5

Proof

Theorem 6

Proof

Corollary 7

Linear-time construction of compacted bidirected de Bruijn graphs

Algorithm

Implementation of the subroutines

Theorem 8

Pseudocode

Linear-time minimum SPSS without repetitions

A lower bound for an SPSS without repetitions

Lemma 9

Proof

Definition 10

Theorem 11

Proof

Eulerising a bigraph

Definition 12

Lemma 13

Proof

Lemma 14

Proof

Lemma 15

Proof

Theorem 16

Computing a Eulerian cycle in a bigraph

Lemma 17

Proof

Lemma 18

Proof

Theorem 19

Computing a minimum SPSS without repetitions

Previous heuristics were not optimal

Experiments

Conclusions

Availability of data and materials

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Algorithms for Molecular Biology

Contact us