 Research
 Open Access
A representation of a compressed de Bruijn graph for pangenome analysis that enables search
 Timo Beller^{1} and
 Enno Ohlebusch^{1}Email author
https://doi.org/10.1186/s1301501600837
© The Author(s) 2016
 Received: 6 May 2016
 Accepted: 1 July 2016
 Published: 18 July 2016
The Erratum to this article has been published in Algorithms for Molecular Biology 2016 11:28
Abstract
Background
Recently, Marcus et al. (Bioinformatics 30:3476–83, 2014) proposed to use a compressed de Bruijn graph to describe the relationship between the genomes of many individuals/strains of the same or closely related species. They devised an \(O(n\log g)\) time algorithm called splitMEM that constructs this graph directly (i.e., without using the uncompressed de Bruijn graph) based on a suffix tree, where n is the total length of the genomes and g is the length of the longest genome. Baier et al. (Bioinformatics 32:497–504, 2016) improved their result.
Results
In this paper, we propose a new spaceefficient representation of the compressed de Bruijn graph that adds the possibility to search for a pattern (e.g. an allele—a variant form of a gene) within the pangenome. The ability to search within the pangenome graph is of utmost importance and is a design goal of pangenome data structures.
Keywords
 Compressed de Bruijn graph
 Burrows–Wheeler transform
 Backward search
 Pangenome analysis
Background
Nowadays, next generation sequencers produce vast amounts of DNA sequence information and it is often the case that multiple genomes of the same or closely related species are available. An example is the 1000 Genomes Project, which started in 2008. Its goal was to sequence the genomes of at least 1000 humans from all over the world and to produce a catalog of all variations (SNPs, indels, etc.) in the human population. The genomic sequences together with this catalog is called the “pangenome” of the population. There are several approaches that try to capture variations between many individuals/strains in a population graph; see e.g. [1–4]. These works all require a multialignment as input. By contrast, Marcus et al. [5] use a compressed de Bruijn graph of maximal exact matches (MEMs) as a graphical representation of the relationship between genomes; this is basically a compressed version of the colored de Bruijn graph introduced in [6]. They describe an \(O(n\log g)\) time algorithm that directly computes the compressed de Bruijn graph on a suffix tree, where n is the total length of the genomes and g is the length of the longest genome. Marcus et al. write in [5, Section 4]: “Future work remains to improve splitMEM and further unify the family of sequence indices. Although ..., most desired are techniques to reduce the space consumption ...”
The de Bruijn graph is also used in sequence assembly where an unknown long DNA sequence is reconstructed from a set of kmers (strings of length k). A de Bruijn graph of order k stores a kmer in each node. In sequence assembly, the de Bruijn graph has an edge between two nodes if the corresponding kmers overlap exactly \(k1\) characters. In this paper, all kmers originate from several longer strings and two nodes are connected by an edge if the corresponding kmers occur consecutively in one of the strings. The difference can be seen in Fig. 1 for \(k=3\): Although TAC and ACT overlap \(k1\) characters, there is no edge between them because they do not occur consecutively in the string. Therefore, algorithms that construct de Bruijn graphs for assembly (see e.g. [10–13] and references therein) can not be used for our problem and vice versa.
Recently, [14] suggested to use a Bloom Filter Trie for pangenome storage: They use a colored de Bruijn graph, where each stored kmer has a color, but the condition for an edge is the same as in sequence assembly, i.e., the presence of an edge is independent of the color and the origin of the corresponding kmers. Consequently, their algorithm can not be used in our application.
The contracted de Bruijn graph introduced by Cazaux et al. [15] is closely related but not identical to the compressed de Bruijn graph. A node in the contracted de Bruijn graph is not necessarily a substring of one of the genomic sequences (see the remark following Definition 3 in [15]). Thus the contracted de Bruijn graph, which can be constructed in linear time from the suffix tree [15], is not useful for our purposes.
Preliminaries
Index data structures of the string ACTACGTACGTACG$
i  \(\mathsf {SA}\)  \(\mathsf {LCP}\)  \(B_{r}\)  \(B_{l}\)  \(\mathsf {LF}\)  \(\Psi \)  \(\mathsf {BWT}\)  \(S_{\mathsf {SA}[i]}\) 

1  15  −1  0  0  10  5  G  $ 
2  12  0  1  0  13  6  T  ACG$ 
3  8  3  0  0  14  7  T  ACGTACG$ 
4  4  7  1  0  15  8  T  ACGTACGTACG$ 
5  1  2  0  0  1  9  $  ACTACGTACGTACG$ 
6  13  0  0  0  2  10  A  CG$ 
7  9  2  0  0  3  11  A  CGTACG$ 
8  5  6  0  0  4  12  A  CGTACGTACG$ 
9  2  1  0  1  5  15  A  CTACGTACGTACG$ 
10  14  0  0  0  6  1  C  G$ 
11  10  1  0  0  7  13  C  GTACG$ 
12  6  5  0  1  8  14  C  GTACGTACG$ 
13  11  0  0  0  11  2  G  TACG$ 
14  7  4  0  0  12  3  G  TACGTACG$ 
15  3  8  0  0  9  4  C  TACGTACGTACG$ 
16  −1 
Let \(\Sigma \) be an ordered alphabet of size \(\sigma \) whose smallest element is the sentinel character \({\$}\). In the following, S is a string of length n on \(\Sigma \) having the sentinel character at the end (and nowhere else). In pangenome analysis, S is the concatenation of multiple genomic sequences, where the different sequences are separated by special symbols (in practice, we use one separator symbol and treat the different occurrences of it as if they were different characters; see “Computation of rightmaximal kmers and node identifiers” section). For \(1 \le i \le n\), S[i] denotes the character at position i in S. For \(i \le j\), S[i..j] denotes the substring of S starting with the character at position i and ending with the character at position j. Furthermore, \(S_i\) denotes the ith suffix S[i..n] of S. The suffix array \(\mathsf {SA}\) of the string S is an array of integers in the range 1 to n specifying the lexicographic ordering of the n suffixes of S, that is, it satisfies \(S_{\mathsf {SA}[1]}< S_{\mathsf {SA}[2]}< \cdots < S_{\mathsf {SA}[n]};\) see Table 1 for an example. A suffix array can be constructed in linear time; see e.g. the overview article [16]. For every substring \(\omega \) of S, the \(\omega \)interval is the suffix array interval [i..j] so that \(\omega \) is a prefix of \(S_{\mathsf {SA}[k]}\) if and only if \(i\le k \le j\).
The Burrows–Wheeler transform [17] converts S into the string \(\mathsf {BWT}[1..n]\) defined by \(\mathsf {BWT}[i]=S[\mathsf {SA}[i] 1]\) for all i with \(\mathsf {SA}[i] \ne 1\) and \(\mathsf {BWT}[i] = {\$}\) otherwise; see Table 1. Several semiexternal and external memory algorithms are known that construct the \(\mathsf {BWT}\) directly (i.e., without constructing the suffix array); see e.g. [18–21].
The wavelet tree [22] of the \(\mathsf {BWT}\) supports one backward search step in \(O(\log \sigma )\) time [23]: Given the \(\omega \)interval [lb..rb] and a character c from the alphabet \(\Sigma \), backwardSearch(c, [lb..rb]) returns the \(c\omega \)interval [i..j] (i.e., \(i\le j\) if \(c\omega \) is a substring of S; otherwise \(i > j\)). This crucially depends on the fact that a bit vector B can be preprocessed in linear time so that an arbitrary \(rank_1(B,i)\) query (asks for the number of ones in B up to and including position i) can be answered in constant time [24]. Backward search can be generalized on the wavelet tree as follows: Given an \(\omega \)interval [lb..rb], a slight modification of the procedure getIntervals([lb..rb]) described in [25] returns the list \([(c,[i..j]) \mid c\omega \text{ is } \text{ a } \text{ substring } \text{ of } S \text{ and } [i..j] \text{ is } \text{ the } \) \(c\omega \text{interval}]\), where the first component of an element (c, [i..j]) must be a character. The worstcase time complexity of the procedure getIntervals is \(O(z + z \log (\sigma /z))\), where z is the number of elements in the output list; see [26, Lemma 3].
The \(\mathsf {LF}\)mapping (lasttofirstmapping) is defined as follows: If \(\mathsf {SA}[i] = q\), then \(\mathsf {LF}(i)\) is the index j so that \(\mathsf {SA}[j] = q1\) (if \(\mathsf {SA}[i] = 1\), then \(\mathsf {LF}(i)=1\)). In other words, if the ith entry in the suffix array is the suffix \(S_q\), then \(\mathsf {LF}(i)\) “points” to the entry at which the suffix \(S_{q1}\) can be found; see Table 1. The function \(\Psi \) is the inverse of the \(\mathsf {LF}\)mapping. Using the wavelet tree of the \(\mathsf {BWT}\), a value \(\mathsf {LF}(i)\) or \(\Psi (i)\) can be calculated in \(O(\log \sigma )\) time. For later purposes, we recall how the \(\mathsf {LF}\)mapping can be computed from the \(\mathsf {BWT}\). First, the Carray is calculated, where for each \(c \in \Sigma \), C[c] is the overall number of occurrences of characters in \(\mathsf {BWT}\) that are strictly smaller than c. Second, if in a lefttoright scan of the \(\mathsf {BWT}\), where the loopvariable i varies from 1 to n, C[c] is incremented by one for \(c = \mathsf {BWT}[i]\), then \(\mathsf {LF}[i] = C[c]\).
The suffix array \(\mathsf {SA}\) is often enhanced with the socalled \(\mathsf {LCP}\)array containing the lengths of longest common prefixes between consecutive suffixes in \(\mathsf {SA}\); see Table 1. Formally, the \(\mathsf {LCP}\)array is an array so that \(\mathsf {LCP}[1] = 1 = \mathsf {LCP}[n+1]\) and \(\mathsf {LCP}[i] = \left\mathsf {lcp}(S_{\mathsf {SA}[i1]},S_{\mathsf {SA}[i]})\right\) for \(2\le i \le n\), where \(\mathsf {lcp}(u,v)\) denotes the longest common prefix between two strings u and v. The \(\mathsf {LCP}\)array can be computed in linear time from the suffix array and its inverse, but it is also possible to construct it directly from the wavelet tree of the \(\mathsf {BWT}\) in \(O(n \log \sigma )\) time with the help of the procedure getIntervals [25].
A substring \(\omega \) of S is a repeat if it occurs at least twice in S. Let \(\omega \) be a repeat of length \(\ell \) and let [i..j] be the \(\omega \)interval. The repeat \(\omega \) is leftmaximal if \(\{\mathsf {BWT}[x] \mid i \le x \le j\} \ge 2\), i.e., the set \(\{S[\mathsf {SA}[x]1] \mid i \le x \le j\}\) of all characters that precede at least one of the suffixes \(S_{\mathsf {SA}[i]},\dots ,S_{\mathsf {SA}[j]}\) is not singleton (where \(S[0] := \$ \)). Analogously, the repeat \(\omega \) is rightmaximal if \(\{S[\mathsf {SA}[x]+\ell ] \mid i \le x \le j\} \ge 2\). A left and rightmaximal repeat is called maximal repeat. (Note that [5] use the term “maximal exact match” instead of the more common term “maximal repeat”. We will not use the term “maximal exact match” here). A detailed explanation of the techniques used here can be found in [27].
Compressed de Bruijn graph
Figure 2 shows how splitMEM represents the compressed de Bruijn graph G for \(k=3\) and the string \(S=\) ACTACGTACGTACG$. Each node corresponds to a substring \(\omega \) of S and consists of the components (id, len, posList, adjList), where id is a natural number that uniquely identifies the node, len is the length \(\omega \) of \(\omega \), posList is the list of positions at which \(\omega \) occurs in S (sorted in ascending order), and adjList is the list of the successors of the node (sorted in such a way that the walk through G that gives S is induced by the adjacency lists: if node G[id] is visited for the ith time, then its successor is the node that can be found at position i in the adjacency list of G[id]).

A uniqueNode represents a unique substring in the pangenome and has a single start position (i.e., posList contains just one element)

A repeatNode represents a substring that occurs at least twice in the pangenome, either as a repeat in a single genome or as a segment shared by multiple genomes.
According to [5], the compressed de Bruijn graph is most suitable for pangenome analysis: “This way the complete pangenome will be represented in a compact graphical representation such that the shared/strainspecific status of any substring is immediately identifiable, along with the context of the flanking sequences. This strategy also enables powerful topological analysis of the pangenome not possible from a linear representation.” It has one defect though: it is not possible to search efficiently for certain nodes and then to explore the graph in the vicinity of these nodes. A user might, for example, want to search for a certain allele in the pangenome and—if it is present—to examine the neighborhood of that allele in the graph. Here, we propose a new spaceefficient representation of the compressed de Bruijn graph that adds exactly this functionality.

The variable len is the length of the string \(\omega = S[\mathsf {SA}[lb]..\mathsf {SA}[lb]+len1]\) that corresponds to the node with identifier id

\([lb..lb+size1]\) is the \(\omega \)interval, lb is its left boundary, and size is its size

\([{ suffix}\_lb..{ suffix}\_lb+size1]\) is the interval of the k length suffix of \(\omega \)
Figure 3 shows an example. Henceforth this representation will be called implicit representation, while the representation from Fig. 2 will be called explicit representation. It is clear that in the implicit representation the list of all positions at which \(\omega \) occurs in S can be computed as follows: \([\mathsf {SA}[i] \mid lb \le i \le lb+size1]\). It will be explained later, how the graph can be traversed and how a pattern can be searched for. We shall see that this can be done efficiently by means of the fourth component \({ suffix}\_lb\).
Construction algorithm
We will build the implicit representation of the compressed de Bruijn graph directly from an FMindex (the wavelet tree of the \(\mathsf {BWT}\)) of S, using Lemma 1 (the simple proof is omitted).
Lemma 1
Let v be a node in the compressed de Bruijn graph and let \(\omega \) be the string corresponding to v. If v is not the start node, then it has at least two different predecessors if and only if the length k prefix of \(\omega \) is a leftmaximal repeat. It has at least two different successors if and only if the length k suffix of \(\omega \) is a rightmaximal repeat.
The general idea behind our algorithm is as follows. Compute the suffix array intervals of all rightmaximal kmers. For each such kmer v, compute all cvintervals, where \(c\in \Sigma \). Then, for each \(u=cv\), compute all buintervals, where \(b\in \Sigma \), etc. In other words, we start with all rightmaximal kmers and extend them as long as possible (and in all possible ways with the procedure getIntervals), character by character, to the left. According to Lemma 1, the leftextension of a string \(\omega \) must stop if (i) the length k prefix of \(\omega \) is a leftmaximal repeat (this is the case if the procedure getIntervals applied to the \(\omega \)interval returns a nonsingleton list). It must also stop if (ii) the length k prefix v of \(c\omega \) is a rightmaximal repeat for some \(c\in \Sigma \); see Fig. 4. This is because by Lemma 1 there is a node uv, \(u\in \Sigma ^*\), in the compressed de Bruijn graph with at least two different successors (the length k suffix v of uv is a rightmaximal repeat). Consequently, there must be a directed edge \((uv,\omega )\) in the compressed de Bruijn graph. In the following, we will explain the different phases of the algorithm in detail.
Computation of rightmaximal kmers and node identifiers
As explained above, there is a directed edge \((v,\omega )\) in the compressed de Bruijn graph if the klength suffix of v is a rightmaximal repeat or the klength prefix of \(\omega \) is a leftmaximal repeat. Since we proceed from right to left, we must be able to compute an identifier id of the node corresponding to v. In case that the klength suffix of v is a rightmaximal repeat, we will use the bit vector \(B_{r}\) to compute the identifier id. \(B_{r}\) has the property that \(B_{r}[i]=1\) if and only if the kmer \(u=S[\mathsf {SA}[i]..\mathsf {SA}[i]+k1]\) is a rightmaximal repeat and \(S_{\mathsf {SA}[i]}\) is either the lexicographically smallest or the lexicographically largest suffix of S that has u as a prefix (in other words, the left and right boundary of the uinterval in the suffix array is marked by a 1 in \(B_{r}\)). In case that the klength suffix of v is not a rightmaximal repeat but the klength prefix of \(\omega \) is a leftmaximal repeat, we will use the bit vector \(B_{l}\) to compute the identifier id. \(B_{l}\) has the property that \(B_{l}[i]=1\) if and only if (i) the kmer \(u=S[\mathsf {SA}[i]..\mathsf {SA}[i]+k1]\) is not a rightmaximal repeat, (ii) \(S_{\mathsf {SA}[i]}\) is the lexicographically largest suffix of S that has u as a prefix (in other words, i is the right boundary of the uinterval in the suffix array), and the kmer \(S[\mathsf {SA}[i]+1..\mathsf {SA}[i]+k]\) is a leftmaximal repeat.
To obtain the bit vector \(B_{r}\), we must compute all rightmaximal kmers and their suffix array intervals. Let u be a rightmaximal kmer and consider the uinterval [lb..rb] in the suffix array. Note that (1) \(\mathsf {LCP}[lb] < k\) and (2) \(\mathsf {LCP}[rb+1] < k\). Since u is rightmaximal, u is the longest common prefix of all suffixes in the interval [lb..rb]. This implies (3) \(\mathsf {LCP}[j] \ge k\) for all j with \(lb+1\le j \le rb\) and (4) \(\mathsf {LCP}[j] = k\) for at least one j with \(lb+1\le j \le rb\) (in the terminology of [28], [lb..rb] is an lcpinterval of lcpvalue k). It follows as a consequence that the bit vector \(B_{r}\) can be calculated with the help of the \(\mathsf {LCP}\)array. Using the algorithm of [25], Algorithm 1 constructs the \(\mathsf {LCP}\)array directly from the \(\mathsf {BWT}\) in \(O(n \log \sigma )\) time, where \(\sigma \) is the size of the alphabet. It is not difficult to verify that lines 9–17 of Algorithm 1 compute all suffix array intervals of rightmaximal kmers. Furthermore, on lines 16 and 17 the boundaries lb and \(rb=i1\) of the kmer intervals are marked by setting the entries of \(B_{r}\) at these positions to 1. On line 18, the node \((lb,k,ilb,lb)\) having the current value of the variable counter as identifier is added to the graph G. In contrast to the last two components, the first two components of a node may change later (they will change when a leftextension is possible). On line 19, the node identifier is added to the queue Q and then counter is incremented by one.
We would like to stress that all rightmaximal kmers can be determined without the entire \(\mathsf {LCP}\)array. In order to verify whether or not an interval satisfies properties (1)–(4), it is sufficient to compute all entries ≤ k in the \(\mathsf {LCP}\)array (the others have a value > k). Since the algorithm of [25] calculates entries in the \(\mathsf {LCP}\)array in ascending order, it is ideal for our purposes. We initialize an array L with values 2 and set \(L[1]=0\) and \(L[n+1]=0\). Two bits are enough to encode the case “\(< k\)” by 0, the case “\(= k\)” by 1, and the case “\(> k\)” by 2 (so initially all entries in the \(\mathsf {LCP}\)array are marked as being \(>k\), except for L[1] and \(L[n+1]\), which are marked as being \(<k\)). Then, for \(\ell \) from 0 to \(k1\), the algorithm of [25] calculates all indices p with entries \(\mathsf {LCP}[p] = \ell \) and sets \(L[p] = 0\). Furthermore, it continues to calculate all indices q with entries \(\mathsf {LCP}[q] = k\) and sets \(L[q] = 1\). Now the array L contains all the information that is needed to compute rightmaximal kmers.
As already mentioned, in pangenome analysis \(S=S^1\#S^2\#\dots S^{d1}\#S^d\$ \) is the concatenation of multiple genomic sequences \(S^1,\dots ,S^d\), separated by a special symbol \(\#\). Our algorithm treats the different occurrences of \(\#\) as if they were different characters. Assuming that \(\#\) is the second smallest character, this can be achieved as follows. As explained above, all rightmaximal kmers can be determined without the entire \(\mathsf {LCP}\)array if the algorithm in [25] is used. If there are \(d1\) occurrences of \(\#\) in total and this algorithm starts with \(d1\) singleton intervals [s..s], \(2\le s \le d\), instead of the \(\#\)interval \([2..d]\), then the different occurrences of \(\#\) are treated as if they were different characters.
Bit vector \(B_{l}\) is computed on lines 7–29 of Algorithm 1 as follows: If the suffix array interval [lb..rb] of a repeat \(\omega \) of length \(\ge k\) is detected, then it must be checked whether or not \(\omega \) is leftmaximal (note that \(rb= i1\)). Recall that \(\omega \) is a leftmaximal repeat if and only if \(\{\mathsf {BWT}[lb],\mathsf {BWT}[lb+1],\dots , \mathsf {BWT}[rb]\} \ge 2\). Algorithm 1 checks this condition by keeping track of the largest index \( lastdiff \) at which the characters \(\mathsf {BWT}[ lastdiff 1]\) and \(\mathsf {BWT}[ lastdiff ]\) differ; see lines 28 and 29. Since \( lastdiff \le rb= i1\), the characters \(\mathsf {BWT}[lb],\mathsf {BWT}[lb+1],\dots , \mathsf {BWT}[rb]\) are not all the same if and only if \( lastdiff > lb\). If this condition on line 21 evaluates to true, then for each \(c\notin \{\#,\$\}\) in \(\mathsf {BWT}[lb..rb]\) the algorithm sets \(B_{l}[\mathsf {LF}[q]]\) to 1 in lines 23–25, where q is the index of the last occurrence of \(c \in \mathsf {BWT}[lb..rb]\) and \(\mathsf {LF}\) is the lasttofirst mapping. How this is done by means of the Carray will be explained below. So a one in \(B_{l}\) marks a kmer that precedes a leftmaximal kmer. Since we are only interested in those kmers that are not rightmaximal (rightmaximal kmers are already covered by bit vector \(B_{r}\)), lines 30–38 of Algorithm 1 reset those onebits in \(B_{l}\) to zero that mark a rightmaximal kmer.
It remains for us to explain the computation of the \(B_{l}\) vector with the Carray. After the computation of the Carray on line 3 of Algorithm 1, for each \(c \in \Sigma \), C[c] is the overall number of occurrences of characters in S that are strictly smaller than c. Moreover, after line 8 of Algorithm 1 was executed, we have \(C[\mathsf {BWT}[i1]]=\mathsf {LF}[i1]\) (to see this, recall from "Preliminaries" section how the \(\mathsf {LF}\)mapping can be computed from the \(\mathsf {BWT}\)). Thus, when the forloop on lines 7–29 of Algorithm 1 is executed for a certain value of i, we have \(C[c]=\mathsf {LF}[q] \) for each character c in \(\mathsf {BWT}[1..i1]\), where q is the index of the last occurrence of c in \(\mathsf {BWT}[1..i1]\). Algorithm 1 uses this fact on line 25: \(C[c]=\mathsf {LF}[q] \), where q is the index of the last occurrence of c in \(\mathsf {BWT}[lb..i1]\).
Apart from the direct construction of the \(\mathsf {LCP}\)array from the \(\mathsf {BWT}\), which takes \(O(n \log \sigma )\) time, Algorithm 1 has a linear runtime. The overall runtime is therefore \(O(n \log \sigma )\).
Construction of the spaceefficient representation
Algorithm 2 constructs the implicit representation of the compressed de Bruijn graph. It calls Algorithm 1, which computes—besides the two bit vectors \(B_{r}\) and \(B_{l}\)—the suffix array interval \([lb..lb+size1]\) of each rightmaximal kmer \(\omega \), stores the quadruple (k, lb, size, lb) at G[id], where \(id = (rank_1(B_{r},lb)+1)/2\) (this is because Algorithm 1 computes rightmaximal kmer intervals in lexicographical order), and adds id to the (initially empty) queue Q. The attributes G[id].size and \(G[id].{ suffix}\_lb\) will never change, but the attributes G[id].len and G[id].lb will change when a leftextension is possible. In the forloop on lines 7–11, the stop nodes are added to G and their identifiers are added to Q. In the whileloop on lines 12–31, as long as the queue Q is not empty, the algorithm removes an identifier id from Q and in a repeatloop computes \(list = getIntervals([lb..rb])\), where \(lb=G[id].lb\) and \(rb= lb + G[id].size  1\). During the repeatloop, the interval [lb..rb] is the suffix array interval of some string \(\omega \) of length G[id].len. In the body of the repeatloop, a flag extendable is set to false. The procedure call getIntervals([lb..rb]) then returns the list list of all \(c\omega \)intervals. At this point, the algorithm tests whether or not the length k prefix of \(c\omega \) is a rightmaximal repeat. It is not difficult to see that the length k prefix of \(c\omega \) is a rightmaximal repeat if and only if the \(c\omega \)interval [i..j] is a subinterval of a rightmaximal kmer interval. Here, the bit vector \(B_{r}\) comes into play. At the beginning of Algorithm 2, all suffix array intervals of rightmaximal kmers have been computed and their boundaries have been marked in \(B_{r}\). It is crucial to note that these intervals are disjoint. Lemma 2 shows how the bit vector \(B_{r}\) can be used to test for nonrightmaximality.
Lemma 2
The \(c\omega \) interval [i..j] is not a subinterval of a rightmaximal k mer interval if and only if \(rank_1(B_{r},i)\) , the number of ones in \(B_{r}\) up to (and including) position i , is even and \(B_{r}[i] = 0\).
Proof
“Onlyif:” Suppose [i..j] is not a subinterval of a rightmaximal kmer interval. Since [i..j] cannot overlap with a rightmaximal kmer interval, it follows that \(rank_1(B_{r},i)\) must be even and \(B_{r}[i..j]\) contains only zeros.
“if:” Suppose [i..j] is a subinterval of a rightmaximal kmer interval [p..q]. If \(i\ne j\), then \(rank_1(B_{r},i)\) must be odd. If \(i = j\), then \(rank_1(B_{r},i)\) may be even. But in this case i must be the right boundary of the interval [p..q], so \(B_{r}[i] = B_{r}[q] =1\).\(\hfill\square\)
 1.
If list contains just one element (c, [i..j]), then \(\omega \) is not leftmaximal. In this case, the algorithms sets extendable to true, G[id].lb to i, and increments G[id].len by one. Now G[id] represents the \(c\omega \)interval [i..j] and the repeatloop continues with this interval. Note that \(G[id].size = ji+1\) because \(\omega \) is not leftmaximal.
 2.
Otherwise, \(\omega \) is leftmaximal. In this case, a split occurs (so the attributes of G[id] will not change any more) and Algorithm 2 must continue with the kmer prefix \(x=c\omega [1..k]\) of \(c\omega \). For the correctness of the algorithm, it is important to note that the interval [i..j] is the xinterval; see Lemma 3. We use the bit vector \(B_{l}\) to assign the unique identifier \(newId=rightMax + rank_1(B_{l},i1)+1\) to the next node, which corresponds to (or ends with) x (recall that rightMax is the number of all rightmaximal kmers and that x is not a rightmaximal kmer). So a quadruple \((k,i,ji+1,i)\) is inserted at G[newId] and newId is added to Q.
Lemma 3
Consider the \(c\omega \) interval [i..j] in Case 2 of Algorithm 2 (beginning at line 27). The interval [i..j] coincides with the \(c\omega [1..k]\) interval [p..q].
Proof
 1.
\(\ell =k\): In this case, Algorithm 2 stops (the length k prefix cu of \(c\omega \) is a rightmaximal repeat), so it cannot execute Case 2; a contradiction.
 2.
\(\ell > k\): Note that u has length \(\ell 1 \ge k\). Since u is a rightmaximal repeat, it is impossible that the procedure getIntervals is applied to the \(\omega \)interval [lb..rb]. This contradiction proves the Lemma.\(\hfill\square\)
As an example, we apply Algorithm 2 to \(k=3\) and the \(\mathsf {LCP}\)array and the \(\mathsf {BWT}\) of the string ACTACGTACGTACG$; see Table 1. There is only one right maximal kmer, ACG, so a node \((len,lb,size,{ suffix}\_lb) = (3,2,3,2)\) is inserted at G[1] and the identifier 1 is added to the queue Q in Algorithm 1. On line 9 of Algorithm 2 the stop node is added to G. It has the identifier \(rightMax+leftMax+1 = 1+2+1=4\), so G[4] is set to (1, 1, 1, 1) and 4 is added to Q. In the whileloop, the identifier 1 of node (3, 2, 3, 2) is dequeued and the procedure call getIntervals([2..4]) returns a list that contains just one interval, the TACGinterval [13..15]. Since \(rank_1(B_{r},13) = 2\) is even and \(B_{r}[13]=0\), Case 1 applies. So extendable is set to true and G[1] is modified to (4, 13, 3, 2). In the next iteration of the repeatloop, getIntervals([13..15]) returns the list \([(\texttt {C},[9..9]), (\texttt {G},[11..12])]\), where [9..9] is the CTACGinterval and [11..12] is the GTACGinterval. It is readily verified that Case 2 applies in both cases. For the CTACGinterval [9..9] we obtain the identifier \(rightMax+rank_1(B_{l},91)+1 = 1+0+1=2\), so G[2] is set to (3, 9, 1, 9). Analogously, the GTACGinterval [11..12] gets the identifier \(rightMax+rank_1(B_{l},111)+1 = 1+1+1=3\) and G[3] is set to (3, 11, 2, 11). Furthermore, the identifiers 2 and 3 are added to the queue Q. Next, the identifier 4 of the stop node (1, 1, 1, 1) is dequeued and the procedure call getIntervals([1..1]) returns a list that contains just one interval, the G$interval [10..10]. Case 1 applies, so G[4] is modified to (2, 10, 1, 1). In the second iteration of the repeatloop, getIntervals([10..10]) returns the CG$interval [6..6]. Again Case 1 applies and G[4] is modified to (3, 6, 1, 1). In the third iteration of the repeatloop, getIntervals([6..6]) returns the ACG$interval [2..2]. This time, \(rank_1(B_{r},2)=1\) is odd and therefore the repeatloop terminates. The computation continues until the queue Q is empty; the final compressed de Bruijn graph is shown in Fig. 3.
We claim that Algorithm 2 has a worstcase time complexity of \(O(n \log \sigma )\) and use an amortized analysis to prove this. Since the compressed de Bruijn graph has at most n nodes, it is an immediate consequence that at most n identifiers enter and leave the queue Q (this covers Case 2). Case 1 can occur at most n times because there are at most n leftextensions; so at most n intervals generated by the procedure getIntervals belong to this category. Each leftextension eventually ends; so at most n intervals generated by the procedure getIntervals belong to this category because there are at most n leftextensions. In summary, at most 2n intervals are generated by the procedure getIntervals. Since this procedure takes \(O(\log \sigma )\) time for each generated interval, the claim follows.
Construction of the explicit compressed de Bruijn graph
If the pangenome consists of \(d\) sequences, then \(S=S^1\#S^2\#\dots S^{d1}\#S^d\$ \) and there are d stop nodes. Since the implicit representation allows for an efficient backward traversal, there is no need for start nodes. By contrast, the explicit graph must provide them. That is why Algorithm 3 stores them in an array StartNodes of size d.
Algorithm 3 starts with the stop node of the last sequence \(S^d\), which has identifier \(id=rightMax+leftMax+1\). Let \(\omega \) be the string corresponding to node id. Since \(\omega \) ends with $ and $ appears at position n in S, the start position of \(\omega \) in S is \(pos=nG[id].len +1\). Consequently, pos is added to the front of G[id].posList on line 7 of Algorithm 3. Next, we have to find the predecessor of node id. It is not difficult to see that \(idx=G[id].lb\) is the index in the suffix array at which the suffix \(S_{pos}\) can be found (note that \(S_{pos}\) has \(\omega \) as a prefix). Clearly, \(i=LF(idx)\) is the index of the suffix \(S_{pos1}\) in the suffix array. Note that \(S_{pos1}\) has \(c\omega \) as a prefix, where \(c=\mathsf {BWT}[idx]\). If c is not a separator symbol (i.e., \(c \notin \{\#,{\$}\}\)), then the predecessor of node id is the node newId whose corresponding string u ends with the kmer prefix \(x = c\omega [1..k]\) of \(c\omega \). If x is a rightmaximal kmer, then newId is \((rank_1(B_{r},i)+1)/2\), otherwise it is \(rightMax + rank_1(B_{l},i1)+1\). Note that u ends at position \(pos1+(k1)\) in S because u and \(\omega \) overlap \(k1\) characters. It follows as a consequence that u starts at position \(newPos = pos1+k1G[newId].len +1 = pos1(G[newId].len  k)\). So the position newPos is added to the front of the position list of G[newId]. Because node G[id] is the successor of node G[newId], the identifier id is added to the front of the adjacency list of G[newId]. To find the predecessor of node newId in the same fashion, we must find the index idx at which the suffix \(S_{newPos}\) can be found in the suffix array. According to Lemma 4, this is \(G[newId].lb+(iG[newId].{ suffix}\_lb)\). The whileloop repeats the search for a predecessor node until a separator symbol is found. In this case, a start node has been reached and its identifier is stored in an array StartNodes of size d. Since there are d separator symbols, the whole process is executed d times.
Lemma 4
Let \(G[id] = (len,lb,size,{ suffix}\_lb)\) be a node in the implicit representation of the compressed de Bruijn graph. If G[id] is not a stop node and suffix \(S_p\) appears at index i in the interval \([b..e]=[{ suffix}\_lb..{ suffix}\_lb+size1]\) (i.e., \(\mathsf {SA}[i]=p\) ), then the suffix \(S_{p+(lenk)}\) appears at index \(lb+(i{ suffix}\_lb)\) in the interval \([lb..lb+size1]\).
Proof
Let u be the string corresponding to G[id] and let x be the kmer suffix of u. By construction, \([lb..lb+size1]\) is the uinterval and [b..e] is the xinterval in the suffix array. If \(u=x\), then \(len=k\), \(lb = { suffix}\_lb\), and there is nothing to show. So suppose \(u\ne x\) and let c be the character that precedes x in u (recall that x is not leftmaximal). Since \(S_{\mathsf {SA}[b]}< S_{\mathsf {SA}[b+1]}< \dots < S_{\mathsf {SA}[e]}\), it follows that \(cS_{\mathsf {SA}[b]}< cS_{\mathsf {SA}[b+1]}< \dots < cS_{\mathsf {SA}[e]}\). In other words, the cxinterval contains the suffixes \(S_{\mathsf {SA}[b]1}< S_{\mathsf {SA}[b+1]1}< \dots < S_{\mathsf {SA}[e]1}\). Consequently, if i is the qth element of [b..e] and \(\mathsf {SA}[i]=p\), then \(\mathsf {LF}(i)\) is the qth element of the cxinterval and \(\mathsf {SA}[\mathsf {LF}(i)]=p1\) (this implies in particular that \([\mathsf {LF}(b)..\mathsf {LF}(e)]\) is the cxinterval). Iterating this argument \(lenk\) times yields the Lemma. \(\square \)
Algorithm 3 has a worstcase time complexity of \(O(N\log \sigma )\), where N is the number of edges in the compressed de Bruijn graph. This is because in each execution of the whileloop an edge is added to the graph and a value \(\mathsf {LF}(idx)\) is computed in \(O(\log \sigma )\) time (all other operations take only constant time). Since the uncompressed de Bruijn graph has at most n edges, so does the compressed graph. Hence \(N \le n\). In fact, N is much smaller than n in virtually all cases. It follows from the preceding section that N can be characterized in terms of left and rightmaximal kmer repeats. We have seen that the number of nodes in the compressed de Bruijn graph equals \(V_1+V_2+d=rightMax+leftMax+d\), where \(V_1 = \{\omega \mid \omega \text{ is } \text{ a } \text{ rightmaximal } k\text{mer } \text{ repeat } \text{ in } S\}\) and \(V_2 = \{\omega \mid \exists i \in \{1,\dots ,nk\} : \omega = S[i..i+k1] \notin V_1\) and \(S[i+1..i+k] \text{ is } \text{ a } \text{ leftmaximal } k\text{mer } \text{ repeat } \text{ in } S\}\); the stop nodes are taken into account by adding d. The number N of edges in the compressed de Bruijn graph therefore is \(\{i \mid 1 \le i \le nk \text{ and } S[i..i+k1] \in V_1 \cup V_2\}\).
Operations on the compressed de Bruijn graph
It is our next goal to show how the combination of the implicit graph and the FMindex can be used to search for a pattern P of length \(m\ge k\). This is important, for example, if one wants to search for a certain allele in the pangenome and—if it is present—to examine the neighborhood of that allele in the graph. Algorithm 4 shows pseudocode for such a search. The main difficulty is to find the node of the klength suffix of P in the implicit graph. Once we have found this node, we can use the method introduced in the previous section to continue the search (where backward search replaces the \(\mathsf {LF}\)mapping).
Using the FMindex, we first find the suffix array interval [i..j] of the kmer suffix \(P[mk+1..m]\) of P. If \(i\le j\) (i.e., \(P[mk+1..m]\) occurs in the pangenome), we search for the node G[id] whose corresponding string \(\omega \) contains \(P[mk+1..m]\). If \(P[mk+1..m]\) is a suffix of \(\omega \), then the unknown identifier id can be determined by lines 9–13 in Algorithm 4. If it is not a suffix of \(\omega \), then there is a suffix u of \(\omega \) that has \(P[mk+1..m]\) as prefix; see Fig. 5. The key observation is that [i..j] is the suffix array interval of u. Moreover, u can be written as \(c_1c_2\dots c_{\ell } x\), where \(c_q \in \Sigma \) for \(q\in \{1,\dots ,\ell \}\) and x is the kmer suffix of u. Note that the value of \(\ell \) is unknown. Since \(c_2\dots c_{\ell } x\) is not leftmaximal, it follows that \([\Psi (i)..\Psi (j)]\) is its suffix array interval (this can be proven by similar arguments as in the proof of Lemma 4). Algorithm 4 iterates this process until either on line 18 the identifier of a stop node or on lines 9–13 the identifier of a nonstopnode is found. In the latter case, there are \(\ell \) characters before the kmer suffix x of u; so \(u=\ell +k\) and therefore \(G[id].len  \ell k\) characters precede u in \(\omega \) (see line 21). In the former case, \(u=c_1c_2\dots c_{\ell } \#\) has length \(\ell +1\) and thus \(G[id].len  \ell 1\) characters precede u in \(\omega \). To obtain this value on line 21, k is subtracted from \(\ell +1\) on line 20.
 1.
If \(\ell > 0\), then the current prefix of P[pos..m] still belongs to the current node. In this case \(\ell \) is decremented by one.
 2.
If \(\ell = 0\), then the kmer prefix of P[pos..m] belongs to the predecessor node of the current node. Its identifier id is determined in the usual way and then added to the front of resList. The variable \(\ell \) is set to the new value \(G[id].lenk\) because so many characters precede the kmer prefix of P[pos..m] in the string corresponding to node G[id].

In which sequences (or genomes) does pattern P (or node v) occur?

In how many sequences (or genomes) does pattern P (or node v) occur?

How often does pattern P (or node v) occur in a specific sequence (or genome)?
Experimental results
The experiments were conducted on a 64 bit Ubuntu 14.04.1 LTS (Kernel 3.13) system equipped with two tencore Intel Xeon processors E52680v2 with 2.8 GHz and 128GB of RAM (but no parallelism was used). All programs were compiled with g++ (version 4.8.2) using the provided makefile. As test files we used the E. coli genomes listed in the supplementary material of [5]. Additionally, we used 5 different assemblies of the human reference genome (UCSC Genome Browser assembly IDs: hg16, hg17, hg18, hg19, and hg38) as well as the maternal and paternal haplotype of individual NA12878 (Utah female) of the 1000 Genomes Project; see [29]. Our software and test data are available at https://www.uniulm.de/in/theo/research/seqana.html; splitMEM can be obtained from http://www.sourceforge.net/projects/splitmem/.
Runtime and maximum main memory usage for the construction of the compressed de Bruijn graph
k  Algorithm  40 E. coli  62 E. coli  7 × Chr1  7 × HG 

init  SplitMEM  117 (315.25)  141 (317.00)  −  − 
init  A1, A2  38 (5.00)  64 (5.00)  380 (5.00)  − 
init  A3, A4  131 (1.32)  202 (1.24)  1168 (1.24)  20,341 (1.24) 
50  SplitMEM  2261 (572.19)  −  −  − 
50  A1  57 (5.22)  92 (5.34)  596 (6.20)  − 
50  A2  61 (8.49)  97 (8.78)  619 (9.98)  − 
50  A3  188 (2.23)  300 (2.26)  1733 (3.07)  29,816 (2.77) 
50  A3compr1  208 (1.81)  346 (1.85)  1880 (2.66)  31,472 (2.36) 
50  A3compr2  236 (1.63)  374 (1.66)  2318 (2.51)  39,366 (2.22) 
50  A4  164 (1.75)  254 (1.82)  1419 (1.28)  25,574 (1.96) 
50  A4compr1  167 (1.46)  257 (1.53)  1435 (1.28)  25,866 (1.66) 
50  A4compr2  179 (1.32)  272 (1.24)  1526 (1.24)  27,365 (1.39) 
50  A4+explicit  172 (3.26)  268 (3.35)  1515 (3.59)  27,619 (3.88) 
50  A4compr1+explicit  176 (2.97)  271 (3.06)  1541 (3.31)  28,044 (3.64) 
50  A4compr2+explicit  188 (2.66)  289 (2.74)  1629 (2.96)  29,517 (3.38) 
100  SplitMEM  2568 (572.20)  −  −  − 
100  A1  59 (5.00)  95 (5.00)  595 (5.95)  − 
100  A2  62 (7.89)  99 (8.19)  605 (9.74)  − 
100  A3  188 (1.63)  299 (1.68)  1738 (2.74)  27,815 (2.23) 
100  A3compr1  205 (1.50)  326 (1.49)  1839 (2.33)  30,401 (1.80) 
100  A3compr2  232 (1.32)  411 (1.29)  2340 (2.14)  38,134 (1.66) 
100  A4  174 (1.71)  261 (1.79)  1422 (1.28)  25,723 (1.94) 
100  A4compr1  171 (1.42)  264 (1.50)  1439 (1.28)  26,040 (1.64) 
100  A4compr2  185 (1.32)  289 (1.24)  1544 (1.24)  27,464 (1.37) 
100  A4+explicit  178 (2.61)  270 (2.73)  1486 (3.21)  26,878 (3.36) 
100  A4compr1+explicit  175 (2.32)  273 (2.44)  1500 (2.92)  26,999 (3.07) 
100  A4compr2+explicit  190 (2.01)  299 (2.12)  1624 (2.68)  28,665 (2.80) 
500  SplitMEM  2116 (570.84)  −  −  − 
500  A1  72 (5.00)  113 (5.00)  620 (5.83)  − 
500  A2  83 (7.17)  117 (7.43)  640 (9.66)  − 
500  A3  194 (1.50)  304 (1.49)  1752 (2.67)  28,548 (2.07) 
500  A3compr1  216 (1.50)  325 (1.49)  1839 (2.19)  30,488 (1.65) 
500  A3compr2  241 (1.32)  378 (1.29)  2319 (2.06)  36,993 (1.50) 
500  A4  184 (1.65)  283 (1.74)  1453 (1.28)  26,362 (1.93) 
500  A4compr1  197 (1.35)  287 (1.44)  1477 (1.28)  26,545 (1.63) 
500  A4compr2  213 (1.32)  322 (1.24)  1622 (1.24)  28,501 (1.36) 
500  A4+explicit  185 (1.81)  285 (1.90)  1509 (3.14)  27,285 (3.14) 
500  A4compr1+explicit  198 (1.52)  288 (1.61)  1535 (2.83)  27,417 (2.79) 
500  A4compr2+explicit  214 (1.32)  323 (1.29)  1694 (2.56)  29,283 (2.58) 
Space in bytes per input base pair for the explicit and the implicit representation of the compressed de Bruijn graph
k  ds  40 E. coli  62 E. coli  7 × Chr1  7 × HG 

50  Explicit  1.80  1.89  2.80  2.57 
50  Implicit  0.84  0.82  0.77  0.76 
50  Implicitc1  0.55  0.53  0.47  0.47 
50  Implicitc2  0.30  0.27  0.25  0.26 
100  Explicit  1.46  1.51  2.55  2.12 
100  Implicit  0.80  0.79  0.75  0.74 
100  Implicitc1  0.51  0.50  0.46  0.45 
100  Implicitc2  0.26  0.24  0.23  0.24 
500  Explicit  1.07  1.08  2.50  2.01 
500  Implicit  0.74  0.74  0.75  0.74 
500  Implicitc1  0.44  0.44  0.45  0.44 
500  Implicitc2  0.20  0.18  0.23  0.23 
As the explicit compressed de Bruijn graph, the combination of the implicit graph and the FMindex supports a graph traversal (albeit in backward direction). For this task the implicit graph and the FMindex use much less space than the explicit graph. This can be seen as follows. One can store the explicit representation by using—for each node—an integer for len and a pointer to posList/adjList. Additionally, each edge causes an entry in the posList and an entry in the adjList of a node. Altogether, this requires two integers per node and two integers per edge. In contrast, the implicit representation needs four integers per node \((len,lb,size,{ suffix}\_lb)\) and (independent of the number of edges) the two bit vectors \(B_{l}\) and \(B_{r}\), both equipped with an additional data structure that supports rank queries in O(1) time. Table 3 shows the measured space usage of the different representations of the compressed de Bruijn graph.
Breakdown of the space usage of the variants of algorithm A4
Algo  Part  62 E. coli  7 × Chr1  7 × HG 

A4  Wtbwt  0.42 (23.83 %)  0.44 (36.23 %)  0.43 (22.68 %) 
A4  Nodes  0.10 (5.94 %)  0.03 (2.61 %)  0.04 (2.02 %) 
A4  \(B_{r}\)  0.16 (8.93 %)  0.16 (12.86 %)  0.16 (8.25 %) 
A4  \(B_{l}\)  0.14 (8.04 %)  0.14 (11.57 %)  0.14 (7.42 %) 
A4  Wtdoc  0.93 (53.26 %)  0.45 (36.73 %)  1.13 (59.63 %) 
A4compr1  Wtbwt  0.42 (28.57 %)  0.44 (47.83 %)  0.43 (26.85 %) 
A4compr1  Nodes  0.10 (7.12 %)  0.03 (3.44 %)  0.04 (2.39 %) 
A4compr1  \(B_{r}\)  0.00 (0.23 %)  0.00 (0.12 %)  0.00 (0.09 %) 
A4compr1  \(B_{l}\)  0.00 (0.23 %)  0.00 (0.12 %)  0.00 (0.08 %) 
A4compr1  Wtdoc  0.93 (63.85 %)  0.45 (48.49 %)  1.13 (70.59 %) 
A4compr2  Wtbwt  0.16 (13.03 %)  0.22 (31.01 %)  0.22 (15.62 %) 
A4compr2  Nodes  0.10 (8.67 %)  0.03 (4.55 %)  0.04 (2.76 %) 
A4compr2  \(B_{r}\)  0.00 (0.28 %)  0.00 (0.16 %)  0.00 (0.10 %) 
A4compr2  \(B_{l}\)  0.00 (0.28 %)  0.00 (0.16 %)  0.00 (0.10 %) 
A4compr2  Wtdoc  0.93 (77.74 %)  0.45 (64.11 %)  1.13 (81.42 %) 
Runtime and main memory usage for finding nodes
k  62 E. coli  7 × Chr1  7 × HG  

50  A4  3 (1.81)  9 (1.28)  9 (1.96) 
50  A4compr1  3 (1.52)  9 (0.98)  11 (1.66) 
50  A4compr2  6 (1.20)  20 (0.70)  29 (1.39) 
100  A4  3 (1.78)  12 (1.26)  27 (1.94) 
100  A4compr1  3 (1.49)  15 (0.97)  19 (1.64) 
100  A4compr2  6 (1.17)  31 (0.68)  51 (1.37) 
500  A4  9 (1.73)  20 (1.26)  22 (1.93) 
500  A4compr1  12 (1.43)  24 (0.96)  27 (1.63) 
500  A4compr2  17 (1.11)  55 (0.67)  74 (1.36) 
Runtime and main memory usage for finding sequences that correspond to given nodes
k  62 E. coli  7 × Chr1  7 × HG  

50  A4  10.84 (1.81)  3.31 (1.28)  15.33 (1.96) 
50  A4compr1  10.91 (1.52)  3.17 (0.98)  14.88 (1.66) 
50  A4compr2  11.02 (1.20)  3.07 (0.70)  13.02 (1.39) 
100  A4  8.31 (1.78)  2.72 (1.26)  10.99 (1.94) 
100  A4compr1  8.11 (1.49)  2.83 (0.97)  9.10 (1.64) 
100  A4compr2  8.23 (1.17)  2.84 (0.68)  9.25 (1.37) 
500  A4  2.43 (1.73)  1.32 (1.26)  4.51 (1.93) 
500  A4compr1  2.78 (1.43)  1.32 (0.96)  4.22 (1.63) 
500  A4compr2  2.32 (1.11)  1.29 (0.67)  4.30 (1.36) 
Length of the longest string corresponding to a node
k  62 E. coli  7 x Chr1  7 x HG 

50  79,967  41,571  36,579 
100  173,366  85,773  203,398 
500  179,671  2,283,980  1,402,896 
Conclusions
We have presented a spaceefficient method to build the compressed de Bruijn graph from scratch. An experimental comparison with splitMEM showed that our algorithm is more than an order of magnitude faster than splitMEM while using significantly less space (two orders of magnitude). To demonstrate its scalability, we successfully applied it to seven complete human genomes. Consequently, it is now possible to use the compressed de Bruijn graph for much larger pangenomes than before (consisting e.g. of hundreds or even thousands of different strains of bacteria). Moreover, the combination of the implicit graph and the FMindex can be used to search for a pattern P in the graph (and to traverse the graph).
Future work includes a parallel implementation of the construction algorithm. Moreover, it should be worthwhile to investigate the timespace tradeoff if one uses data structures that are optimized for highly repetitive texts; see [31] and the references therein.
Notes
Declarations
Authors' contributions
TB implemented the algorithms and conducted the benchmark experiments. All authors developed the algorithms and wrote the paper. Both authors read and approved the final manuscript.
Acknowledgements
We thank the anonymous reviewers for their helpful and constructive comments. The authors would also like to thank the DFG for funding.
Competing interests
The authors declare that they have no competing interests.
Funding
This project is supported by the DFG (OH 53/62).
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Schneeberger K, Hagmann J, Ossowski S, Warthmann N, Gesing S, Kohlbacher O, Weigel D. Simultaneous alignment of short reads against multiple genomes. Genome Biol. 2009;10(9):98.View ArticleGoogle Scholar
 Huang L, Popic V, Batzoglou S. Short read alignment with populations of genomes. Bioinformatics. 2013;29(13):361–70.View ArticleGoogle Scholar
 Rahn R, Weese D, Reinert K. Journaled string treea scalable data structure for analyzing thousands of similar genomes on your laptop. Bioinformatics. 2014;30(24):3499–505.View ArticlePubMedGoogle Scholar
 Dilthey A, Cox C, Iqbal Z, Nelson MR, McVean G. Improved genome inference in the MHC using a population reference graph. Nat Genet. 2015;47(6):682–8.View ArticlePubMedPubMed CentralGoogle Scholar
 Marcus S, Lee H, Schatz MC. SplitMEM: a graphical algorithm for pangenome analysis with suffix skips. Bioinformatics. 2014;30(24):3476–83.View ArticlePubMedPubMed CentralGoogle Scholar
 Iqbal Z, Caccamo M, Turner I, Flicek P, McVean G. De novo assembly and genotyping of variants using colored de bruijn graphs. Nat Genet. 2012;44(2):226–32.View ArticlePubMedPubMed CentralGoogle Scholar
 Beller T, Ohlebusch E. Efficient construction of a compressed de Bruijn graph for pangenome analysis. In: Proc. 26th annual symposium on combinatorial pattern matching. Lecture notes in computer science, vol. 9133. Berlin: Springer; 2015. p. 40–51.Google Scholar
 Baier U, Beller T, Ohlebusch E. Graphical pangenome analysis with compressed suffix trees and the Burrows–Wheeler transform. Bioinformatics. 2016;32(4):497–504. doi:10.1093/bioinformatics/btv603.View ArticlePubMedGoogle Scholar
 Marschall T, Marz M, Abeel T, Dijkstra L, Dutilh BE, Ghaffaari A, et al. Computational pangenomics: status, promises and challenges. bioRxiv. 2016. doi:10.1101/043430. http://www.biorxiv.org/content/early/2016/03/29/043430.full.pdf.Google Scholar
 Bowe A, Onodera T, Sadakane K, Shibuya T. Succinct de bruijn graphs. In: Algorithms in bioinformatics—12th international workshop, WABI 2012, Ljubljana, September 10–12, 2012. Proceedings. p. 225–35.Google Scholar
 Bonizzoni P, Vedova GD, Pirola Y, Previtali M, Rizzi R. Constructing string graphs in external memory. In: Algorithms in bioinformatics—14th international workshop, WABI 2014, Wroclaw, September 8–10, 2014. Proceedings, p. 311–25.Google Scholar
 Chikhi R, Limasset A, Jackman S, Simpson JT, Medvedev P. On the representation of de bruijn graphs. In: Research in computational molecular biology—18th annual international conference, RECOMB 2014, Pittsburgh, April 2–5, 2014, proceedings. p. 35–55.Google Scholar
 Boucher C, Bowe A, Gagie T, Puglisi SJ, Sadakane K. Variableorder de bruijn graphs. In: 2015 data compression conference, DCC 2015, Snowbird, April 7–9, 2015. p. 383–92.Google Scholar
 Holley G, Wittler R, Stoye J. Bloom filter trie: an alignmentfree and referencefree data structure for pangenome storage. Alg Mol Biol. 2016;11(1):1–9. doi:10.1186/s1301501600668.View ArticleGoogle Scholar
 Cazaux B, Lecroq T, Rivals E. From indexing data structures to de Bruijn graphs. In: Proc. 25th annual symposium on combinatorial pattern matching. Lecture notes in computer science, vol. 8486. Berlin: Springer; 2014. p. 89–99.Google Scholar
 Puglisi SJ, Smyth WF, Turpin A. A taxonomy of suffix array construction algorithms. ACM Comput Surv. 2007;39(2):4.View ArticleGoogle Scholar
 Burrows M, Wheeler DJ. A blocksorting lossless data compression algorithm. Research report 124, Digital Systems Research Center. 1994.Google Scholar
 Kärkkäinen J. Fast BWT in small space by blockwise suffix sorting. Theor Comput Sci. 2007;387(3):249–57.View ArticleGoogle Scholar
 Okanohara D, Sadakane K. A lineartime Burrows–Wheeler transform using induced sorting. In: Proc. 16th international symposium on string processing and information retrieval. Lecture notes in computer science, vol. 5721. Berlin: Springer; 2009. p. 90–101.Google Scholar
 Ferragina P, Gagie T, Manzini G. Lightweight data indexing and compression in external memory. In: Proc. 9th Latin American theoretical informatics symposium. Lecture notes in computer science, vol. 6034. Berlin: Springer; 2010. p. 697–710.Google Scholar
 Beller T, Zwerger M, Gog S, Ohlebusch E. Spaceefficient construction of the Burrows–Wheeler transform. In: Proc. 20th international symposium on string processing and information retrieval. Lecture notes in computer science, vol. 8214. Berlin: Springer; 2013. p. 5–16.Google Scholar
 Grossi R, Gupta A, Vitter JS. Highorder entropycompressed text indexes. In: Proc. 14th annual ACMSIAM symposium on discrete algorithms. 2003. p. 841–50.Google Scholar
 Ferragina P, Manzini G. Opportunistic data structures with applications. In: Proc. 41st annual IEEE symposium on foundations of computer science. 2000. p. 390–98.Google Scholar
 Jacobson G. Spaceefficient static trees and graphs. In: Proc. 30th annual IEEE symposium on foundations of computer science. 1989. p. 549–54.Google Scholar
 Beller T, Gog S, Ohlebusch E, Schnattinger T. Computing the longest common prefix array based on the Burrows–Wheeler transform. J Discrete Alg. 2013;18:22–31.View ArticleGoogle Scholar
 Gagie T, Navarro G, Puglisi SJ. New algorithms on wavelet trees and applications to information retrieval. Theor Comput Sci. 2012;426–427:25–41.View ArticleGoogle Scholar
 Ohlebusch E. Bioinformatics algorithms: sequence analysis, genome rearrangements, and phylogenetic reconstruction. Bremen: Oldenbusch Verlag; 2013.Google Scholar
 Abouelhoda MI, Kurtz S, Ohlebusch E. Replacing suffix trees with enhanced suffix arrays. J Discrete Alg. 2004;2:53–86.View ArticleGoogle Scholar
 Rozowsky J, Abyzov A, Wang J, Alves P, Raha D, Harmanci A, Leng J, Bjornson R, Kong Y, Kitabayashi N, Bhardwaj N, Rubin M, Snyder M, Gerstein M. AlleleSeq: analysis of allelespecific expression and binding in a network framework. Mol Syst Biol. 2011;7:522.View ArticlePubMedPubMed CentralGoogle Scholar
 Gog S, Beller T, Moffat A, Petri M. From theory to practice: plug and play with succinct data structures. In: Proc. 13th international symposium on experimental algorithms. Lecture notes in computer science, vol. 8504. Berlin: Springer; 2014. p. 326–37.Google Scholar
 Navarro G, Ordóñez A. Faster compressed suffix trees for repetitive text collections. In: Proc. 13th international symposium on experimental algorithms. Lecture notes in computer science, vol. 8504. Berlin: Springer; 2014. p. 424–35.Google Scholar