Using cascading Bloom filters to improve the memory usage for de Brujin graphs
- Kamil Salikhov^{1},
- Gustavo Sacomoto^{2, 3} and
- Gregory Kucherov^{4, 5}Email author
https://doi.org/10.1186/1748-7188-9-2
© Salikhov et al.; licensee BioMed Central Ltd. 2014
Received: 8 December 2013
Accepted: 17 February 2014
Published: 24 February 2014
Abstract
Background
De Brujin graphs are widely used in bioinformatics for processing next-generation sequencing data. Due to a very large size of NGS datasets, it is essential to represent de Bruijn graphs compactly, and several approaches to this problem have been proposed recently.
Results
In this work, we show how to reduce the memory required by the data structure of Chikhi and Rizk (WABI’12) that represents de Brujin graphs using Bloom filters. Our method requires 30% to 40% less memory with respect to their method, with insignificant impact on construction time. At the same time, our experiments showed a better query time compared to the method of Chikhi and Rizk.
Conclusion
The proposed data structure constitutes, to our knowledge, currently the most efficient practical representation of de Bruijn graphs.
Keywords
Background
Modern next-generation sequencing (NGS) technologies generate huge volumes of short nucleotide sequences (reads) drawn from a DNA sample under study. The length of a read varies from 35 to about 400 base pairs (letters) and the number of reads may be hundreds of millions, thus the total volume of data may reach tens or even hundreds of Gb.
Many computational tools dealing with NGS data, especially those devoted to genome assembly, are based on the concept of a de Bruijn graph, see e.g. [1]. Nodes of a de Bruijn graph^{a} correspond to all distinct k-mers occurring in the given set of reads, and two k-mers are linked by an arc if they have a suffix-prefix overlap of size k−1. The value of k is an open parameter that in practice is chosen between 20 and 64. The idea of using de Bruijn graphs for genome assembly goes back to the “pre-NGS era” [2]. Note, however, that de novo genome assembly is not the only application of those graphs when dealing with NGS data. There are several others, including: de novo transcriptome assembly [3] and de novo alternative splicing calling [4] from transcriptomic NGS data (RNA-seq); metagenome assembly [5] from metagenomic NGS data; and genomic variant detection [6] from genomic NGS data using a reference genome.
Due to a very large size of NGS datasets, it is essential to represent de Bruijn graphs as compactly as possible. This has been a very active line of research. Recently, several papers have been published that propose different approaches to compressing de Bruijn graphs [7–11].
Conway and Bromage [7] proposed a method based on classical succinct data structures, i.e. bitmaps with efficient rank/select operations. On the same direction, Bowe et al. [10] proposed an interesting succinct representation that, assuming only one string (read) is present, uses only 4E bits, where E is the number of arcs in the graph. The more realistic case where there are R reads can be easily reduced to the one string case by concatenating all R reads using a special separator character. In this case, however, the size of the structure is 4E+O(R logE) bits ([10], Theorem 1). Since the multiplicative constant of the second term is difficult to evaluate, it is hard to know precisely what would be the size of this structure in practice.
Ye et al. [8] proposed a different method based on a sparse representation of de Bruijn graphs, where only a subset of k-mers present in the dataset are stored. Using the Bloom filter data structure, Pell et al. [11] proposed probabilistic de Bruijn graphs as a compact approximate representation of full de Bruijn graphs. Finally, Chikhi and Rizk [9] improved Pell’s scheme in order to obtain an exact representation of the de Bruijn graph.
A direct application of Bloom filters to de Bruijn graphs, studied in [11], results in a space-efficient representation at the price of allowing one-sided errors, namely false positive nodes (k-mers). The method of [9] removes these errors and proposes a space-efficient data structure for the exact representation of de Bruijn graphs. The method is based on the following idea. In the genome assembly application, de Bruijn graphs are only used for traversal, and random accesses to graph nodes are never performed. If all queried nodes (k-mers) are only those which are reachable from some node known to belong to the graph, then only a fraction of all false positives can actually occur. Storing these false positives explicitly leads to an exact (false positive free) and space-efficient representation of the de Bruijn graph. This is the best practical exact representation of de Bruijn graphs for the purpose of genome assembly, now implemented in MINIA software [15].
Our main contribution is an improvement of the method of [9] by changing the representation of the set of false positives. We achieve this by iteratively applying a Bloom filter to represent the set of false positives, then the set of “false false positives” etc. We show analytically that this cascade of Bloom filters allows for a considerable further economy of memory, improving the method of [9]. Depending on the value of k, our method requires 30% to 40% less memory with respect to the method of [9]. Moreover, with our method, the memory grows very little as k grows. Finally, we implemented our method and tested it against [9] on real datasets. The tests confirm the theoretical predictions for the size of structure and show a 20% to 30% improvement in query times.
Preliminaries
A de Bruijn graph, for a given parameter k, of a set of reads (strings) $\mathcal{R}\subseteq {\Sigma}^{\ast}={\{A,C,T,G\}}^{\ast}$ is entirely defined by the set T⊆U=Σ^{ k } of k-mers present in . The nodes of the graph are precisely the k-mers of T and for any two vertices u,v∈T, there is an arc from u to v if the suffix of u of size (k−1) is equal to the prefix of v of the same size. Thus, given a set T⊆U of k-mers, we can represent its de Bruijn graph using a Bloom filter B. This approach has the disadvantage of having false positive nodes, as a direct consequence of false positives in the Bloom filter, which can create false connections in the graph (see [11] for the influence of false positive nodes on the topology of the graph). The naive way to remove those false positives nodes by explicitly storing (e.g. using a hash table) the set of all false positives of B is clearly inefficient, as the expected number of elements to be explicitly stored is $\left|U\right|\mathcal{F}={4}^{k}\mathcal{F}$.
The key idea of [9] is to explicitly store only a small subset of all false positives of B, the so-called critical false positives. Consider a k-mer u that belongs to T, u has at most 2|Σ|=8potential neighbors, i.e. k-mers overlapping u by (k−1) letters. The set of critical false positives consists of the potential neighbors of k-mers of T that are false positives of B. This set is, in general, much smaller than the set of all false positives of B, its expected size can be upper-bounded by $8\left|T\right|\mathcal{F}$. On the other hand, storing the set of critical false positives is clearly sufficient to represent the de Bruijn graph if one only wants to support graph traversal, i.e. navigation from a node of the graph to its neighbors. In this case, only potential neighbors of nodes in T are queried.
Cascading Bloom filter
Let be a set of reads and T_{0} be the set of occurring k-mers (nodes of the de Brujin graph) that we want to store. As stated in Section “Preliminaries”, the method of [9] stores T_{0} via a bitmap B_{1} using a Bloom filter, together with the set T_{1} of critical false positives. T_{1} consists of potential neighbors of T_{0} which are stored in B_{1} “by mistake”, i.e. belong^{b} to B_{1} but not to T_{0}. B_{1} and T_{1} are sufficient to represent the graph provided that the only queried k-mers are those which are potential neighbors of k-mers of T_{0}.
The idea we introduce in this work is to use this structure recursively and represent the set T_{1} by a new bitmap B_{2} and a new set T_{2}, then represent T_{2} by B_{3} and T_{3}, and so on. More formally, starting from B_{1} and T_{1} defined as above, we define a series of bitmaps B_{1},B_{2},… and a series of sets T_{1},T_{2},… as follows. B_{2} stores the set of false positives T_{1} using another Bloom filter, and T_{2} contains the critical false positives of B_{2}, i.e. true positives from T_{0} that are stored in B_{2} “by mistake” (we call them false false positives). B_{3} and T_{3}, and, generally, B_{ i } and T_{ i } are defined similarly: B_{ i } stores k-mers of T_{i−1} using a Bloom filter, and T_{ i } contains k-mers stored in B_{ i } “by mistake”, i.e. those k-mers in B_{ i } that do not belong to T_{i−1} but belong to T_{i−2}. Observe that T_{0}∩T_{1}=∅, T_{0}⊇T_{2}⊇T_{4}… and T_{1}⊇T_{3}⊇T_{5}….
The following lemma shows that the construction is correct, that is it allows one to verify whether or not a given k-mer belongs to the set T_{0}.
Lemma 1.
Given a k-mer (node) u, consider the smallest i such that u∉B_{i+1} (if u∉B_{1}, we define i=0). Then, if i is odd, then u∈T_{0}, and if i is even (including 0), then u∉T_{0}.
Proof.
Observe that u∉B_{i+1} implies u∉T_{ i } by the basic property of Bloom filters that membership queries have one-sided error, i.e. there are no false negatives. We first check the Lemma for i=0,1.
For i=0, we have u∉B_{1}, and then u∉T_{0}.
For i=1, we have u∈B_{1} but u∉B_{2}. The latter implies that u∉T_{1}, and then u must be a false false positive, that is u∈T_{0}. Note that here we use the fact that the only queried k-mers u are either nodes of T_{0} or their potential neighbors in the graph (see [9]), and therefore if u∈B_{1} and u∉T_{0} then u∈T_{1}.
For the general case i≥2, we show by induction that u∈T_{i−1}. Indeed, u∈B_{1}∩…∩B_{ i } implies u∈T_{i−1}∪T_{ i } (which, again, is easily seen by induction), and u∉B_{i+1} implies u∉T_{ i }.
Since T_{i−1}⊆T_{0} for odd i, and T_{i−1}⊆T_{1} for even i (for T_{0}∩T_{1}=∅), the lemma follows.
Naturally, Lemma 1 provides an algorithm to check if a given k-mer u belongs to the graph: it suffices to check successively if it belongs to B_{1},B_{2},… until we encounter the first B_{i+1} which does not contain u. Then, the answer will simply depend on whether i is even or odd: u belongs to the graph if and only if i is odd
In our reasoning so far, we assumed an infinite number of bitmaps B_{ i }. Of course, in practice we cannot store infinitely many (and even simply many) bitmaps. Therefore, we truncate the construction at some step t and store a finite set of bitmaps B_{1},B_{2},…,B_{ t } together with an explicit representation of T_{ t }. The procedure of Lemma 1 is extended in the obvious way: if for all 1≤i≤t, u∈B_{ i }, then the answer is determined by directly checking u∈T_{ t }.
Analysis of the data structure
Memory and time usage
A simple calculation shows that the minimum of this expression is achieved when r=5.464, and then the minimum memory used per k-mer is 8.45 bits.
As mentioned earlier, in practice we store only a finite number of bitmaps B_{1},…,B_{ t } together with an explicit representation (such as array or hash table) of T_{ t }. In this case, the memory taken by the bitmaps is a truncated sum r N+6r N c^{ r }+r N c^{ r }+.., and a data structure storing T_{ t } takes either $2k\xb7N{c}^{\lceil \frac{t}{2}\rceil r}$ or $2k\xb76N{c}^{\lceil \frac{t}{2}\rceil r}$ bits, depending on whether t is even or odd. The latter follows from the observations that we need to store $N{c}^{\lceil \frac{t}{2}\rceil r}$ (or $6N{c}^{\lceil \frac{t}{2}\rceil r}$) k-mers, each taking 2k bits of memory. Consequently, we have to adjust the optimal value of r minimizing the total space, and re-estimate the resulting space spent on one k-mer.
1st column: k -mer size; 2nd and 4th columns: optimal value of r for Bloom filters (bitmap size per number of stored elements) for t =4 and t =6 respectively; 3rd and 5th columns: the resulting space per k -mer (for t =4 and t =6); 6th column: space per k -mer for the method of [[9]] ( t =1)
k | Optimalr | Bits per k-mer | Optimalr | Bits per k-mer | Bits per k-mer |
---|---|---|---|---|---|
for t=4 | for t=4 | for t=6 | for t=6 | for t=1 ([9]) | |
16 | 5.777 | 8.556 | 5.506 | 8.459 | 12.078 |
32 | 6.049 | 8.664 | 5.556 | 8.470 | 13.518 |
64 | 6.399 | 8.824 | 5.641 | 8.490 | 14.958 |
128 | 6.819 | 9.045 | 5.772 | 8.524 | 16.398 |
The last column of Table 1 shows the memory usage of the original method of [9], obtained using the estimation $\left(1.44\underset{2}{log}\left(\frac{16k}{2.08}\right)+2.08\right)$ the authors provided. Note that according to that estimation, doubling the value of k results in a memory increment by 1.44 bits, whereas in our method the increment is of 0.11 to 0.22 bits.
Let us now comment on query and preprocessing times for our scheme. The query time can be split into two parts: the time spent on querying t Bloom filters and the time spent on querying T_{ t }. As stated in Section “Preliminaries”, each query in a Bloom filter corresponds to p=r ln2 hash functions evaluations. Clearly, the total query time for t Bloom filters is t p=Θ(t r). Thus, it is expected that using t Bloom filters, even if r decreases, the query time increases. For instance, with t=4 we have that r=6.049 (k=32) and the total number of hash function evaluations is proportional to r t≈24, whereas with t=1 we have that r=11.44 and r t≈12, a factor 2 increase in the number of hash function evaluations. On the other hand, the set T_{ t } is generally much smaller than T_{1}, due to the above-mentioned exponential decrease. Depending on the data structure for storing T_{ t }, the time saving in querying T_{ t } vs. T_{1} may even dominate the time loss in querying multiple Bloom filters. Our experimental results (Section “Construction algorithm” below) confirm that this situation does indeed occur in practice. Note that even in the case when querying T_{ t } weakly depends on its size (e.g. when T_{ t } is implemented by a hash table), the query time will not increase much, due to our choice of a small value for t, as discussed earlier.
At the preprocessing step, we need to construct Bloom filters B_{1},…,B_{ t } and set T_{ t }. At each stage i, we need to store T_{i−1} and T_{i−2} (possibly on disk, if we want to save on the internal memory used by the algorithm) to construct B_{ i } and T_{ i }. A key observation is that the sizes of B_{ i } and T_{ i } decrease exponentially on i and therefore the time spent to construct the whole structure is a linear function on the size of T_{0}. In particular, asymptotically it is only a small constant factor larger compared to the original method of [9]. If the value of t is small (such as t=4, as in Table 1), the preprocessing time is obviously even smaller.
Using different values of r for different filters
and the minimum value of this expression becomes 7.93 (this value is achieved with r_{1}=4.41;r_{ i }=1.44,i>1).
Estimated memory occupation for the case of different values of r vs. single value of r (shown in Table 1), for 4 Bloom filters ( t =4)
k | Optimalr_{1},r_{2},r_{3},r_{4} | Bits per k-mer | Bits per k-mer |
---|---|---|---|
different values ofr | single value ofr | ||
16 | 5.254, 3.541, 4.981, 8.653 | 8.336 | 8.556 |
32 | 5.383, 3.899, 5.318, 9.108 | 8.404 | 8.664 |
64 | 5.572, 4.452, 5.681, 9.108 | 8.512 | 8.824 |
128 | 5.786, 5.108, 6.109, 9.109 | 8.669 | 9.045 |
Query distribution among filters
The query algorithm of Lemma 1 simply queries Bloom filters B_{1},…,B_{ t } successively as long as the returned answer is positive. The query time then directly depends on the number of filters applied before getting a negative answer. Therefore, it is instructive to analyse how the query frequencies to different filters are distributed when performing a graph traversal. We provide such an analysis in this section.
We analyse query frequencies during an exhaustive traversal of the de Bruijn graph, when each true node is visited exactly once. We assume that each time a true node is visited, all its eight potential neighbors are queried, as there is no other way to tell which of those neighbors are real. Note however that this assumption does not take into account structural properties of the de Bruijn graph, nor any additional statistical properties of the genome (such as genomic word frequencies).
For a filter B_{ i }, we want to estimate the number of queried k-mers resolved by B_{ i } during the traversal, that is queries on which B_{ i } returns no. This number is the difference of the number of queries submitted to B_{ i } and the number of queries for which B_{ i } returns yes. Note that the queries submitted to B_{ i } are precisely those on which the previous filter B_{i−1} returns yes.
If the input set T_{0} contains N k-mers, then the number of queries in a graph traversal is 8N, since for each true node each of its 8 potential neighbors are queried. Moreover, about 2N queries correspond to true k-mers, as we assume that most of the graph nodes have two true neighbors. Filter B_{1} will return yes on 2N+6c^{ r }N queries, corresponding to the number of true and false positives respectively. For an arbitrary i, filter B_{ i } returns yes precisely on the k-mers inserted to B_{ i } (i.e. k-mers B_{ i } is built on), and the k-mers which are inserted to B_{i+1} (which are the critical false positives for B_{ i }). The counts then easily follow from the analysis of Section “Memory and time usage”.
Estimations of the number of queries made to filters B _{ 1 } , B _{ 2 } , B _{ 3 } , B _{ 4 } and of the fraction of queries resolved by each filter (for the optimal value r =5 . 464), in the case of infinite number of filters
B _{1} | B _{2} | B _{3} | B _{4} | |
---|---|---|---|---|
nb of queries | 8N | (2+6c^{ r })N | (6c^{ r }+2c^{ r })N | (2c^{ r }+6c^{2r})N |
Queries returning yes | (2+6c^{ r })N | (6c^{ r }+2c^{ r })N | (2c^{ r }+6c^{2r})N | (6c^{2r}+2c^{2r})N |
Queries returning no | (6−6c^{ r })N | (2−2c^{ r })N | (6c^{ r }−6c^{2r})N | (2c^{ r }−2c^{2r})N |
Fraction of resolved queries | 69.57% | 23.19% | 5.04% | 1.68% |
Estimated fractions of queries resolved by each filter and by the explicitely stored set T _{ t } for t =1,2,4, computed for k =32 and optimal value of r shown in the second column
Value oft | r | B _{1} | B _{2} | B _{3} | B _{4} | T _{ t } |
---|---|---|---|---|---|---|
1 | 11.44 | 74.70% | 0 | 0 | 0 | 25.3% |
2 | 8.060 | 73.44% | 24.48% | 0 | 0 | 2.08% |
4 | 6.049 | 70.90% | 23.63% | 3.88% | 1.29% | 0.3% |
Experimental results
Construction algorithm
In practice, constructing a cascading Bloom filter for a real-life read set is a computationally intensive step. To perform it on a commonly-used computer, the implementation makes an essential use of external memory. Here we give a short description of the construction algorithm for up to four Bloom filters. Extension for larger number of filters is straightforward.
We start from the input set T_{0} of k-mers written on disk. In this set, for each pair of k-mer and its reverse complement we keep only one of them, the lexicographically smaller, and identify the other to it. We build the Bloom filter B_{1} of appropriate size by inserting elements of T_{0} successively. Next, all possible extensions of each k-mer in T_{0} are queried against B_{1}, and those which return true are written to the disk. Then, in this set only the k-mers absent from T_{0} are kept, i.e. we perform a set difference from T_{0}. We cannot afford to load T_{0} entirely in memory, so we partition T_{0} and perform the set difference in several iterations, loading only one partition of T_{0} each time. This results in the set T_{1} of critical false positives, which is also kept on disk. Up to this point, the procedure is identical to that of [9].
Next, we insert all k-mers from T_{1} into B_{2} and to obtain T_{2}, we check for each k-mer in T_{0} if a query to B_{2} returns true. This results in the set T_{2}, which is directly stored on disk. Thus, at this point we have B_{1}, B_{2} and, by loading T_{2} from the disk, a complete representation for t=2. In order to build the data structure for t=4, we continue this process, by inserting T_{2} in B_{3} and retrieving (and writing directly on disk) T_{3} from T_{1} (stored on disk). It should be noted that to obtain T_{ i } we need T_{i−2}, and by always directly storing it on disk we guarantee not to use more memory than the size of the final structure. The set T_{ t } (that is, T_{1}, T_{2} or T_{4} in our experiments) is represented as a sorted array and is searched by a binary search. We found this implementation more efficient than a hash table.
Implementation and experimental setup
We implemented our method using MINIA software [9] and ran comparative tests for 2 and 4 Bloom filters (t=2,4). Note that since the only modified part of MINIA was the construction step and the k-mer membership queries, this allows us to precisely evaluate our method against the one of [9].
The first step of the implementation is to retrieve the list of k-mers that appear more than d times using DSK [13] – a constant memory streaming algorithm to count k-mers. Note, as a side remark, that performing counting allows us to perform off-line deletions of k-mers. That is, if at some point of the scan of the input set of k-mers (or reads) some of them should be deleted, it is done by a simple decrement of the counter.
Assessing the query time is done through the procedure of graph traversal, as it is implemented in [9]. Since the procedure is identical and independent on the data structure, the time spent on graph traversal is a faithful estimator of the query time.
We compare three versions: t=1 (i.e. the version of [9]), t=2 and t=4. For convenience, we define 1 Bloom, 2 Bloom and 4 Bloom as the versions with t=1,2 and 4, respectively.
E.coli dataset, varying k
The total size of the structures in bits per stored k-mer, i.e. the size of B_{1} and T_{1} (respectively, B_{1},B_{2}, T_{2} or B_{1},B_{2},B_{3},B_{4}, T_{4}) is shown in Figure 1(a). As expected, the space for 4 Bloom filters is the smallest for all values of k considered, showing a considerable improvement, ranging from 32% to 39%, over the version of [9]. Even the version with just 2 Bloom filters shows an improvement of at least 20% over [9], for all values of k. Regarding the influence of the k-mer size on the structure size, we observe that for 4 Bloom filters the structure size is almost constant, the minimum value is 8.60 and the largest is 8.89, an increase of only 3%. For 1 and 2 Bloom the same pattern is seen: a plateau from k=16 to 32, a jump for k=33 and another plateau from k=33 to 64. The jump at k=32 is due to switching from 64-bit to 128-bit representation of k-mers in the table T_{ t }.
Figure 1(b) shows the size of table T_{ t } (number of k-mers) for t=1,2,4, depending on k. It clearly demonstrates the sharp decrease of the size of T_{ t } with growing t, in accordance with the exponential decrease estimated analytically in Section “Memory and time usage”. We also observe a decrease in the size of T_{ t } with growing k for t=1 and, to a smaller extent, for t=2, while for t=4 the decrease is not noticeable. This is explained by the increase rate of optimal r (Table 1) which is high for t=1, smaller for t=2 and yet smaller for t=4. Since the size of T_{ t } is O(N c^{t r/2}) (Section “Memory and time usage”) for c<1 and almost invariable N, the decrease rate is exponential w.r.t. the increase rate of r.
Traversal times for each version are shown in Figure 1(c). The fastest version is 4 Bloom, showing an improvement over [9] of 18% to 30%, followed by 2 Bloom. This result is surprising and may seem counter-intuitive, as we have four filters to apply to the queried k-mer rather than a single filter as in [9]. However, the size of T_{4} (or even T_{2}) is much smaller than T_{1}, as the size of T_{ i }’s decreases exponentially. As T_{ t } is stored in an array, the time economy in searching T_{4} (or T_{2}) compared to T_{1} dominates the time lost on querying additional Bloom filters, which explains the overall gain in query time.
As far as the construction time is concerned (Figure 1(d)), our versions yielded also a faster construction, with the 4 Bloom version being 5% to 22% faster than that of [9]. The gain is explained by the time required for sorting the array storing T_{ t }, which is much higher for T_{0} than for T_{2} or T_{4}. However, the gain is less significant here, and, on the other hand, was not observed for bigger datasets (see Section “Human dataset”).
E. coli dataset, varying coverage
E. coli dataset, query statistics
Human dataset
Results of 1, 2 and 4 Bloom filters version for 564M Human reads of 100 bp using k =23
Method | 1 Bloom[9] | 2 Bloom | 4 Bloom |
---|---|---|---|
Construction time (s) | 40160.7 | 43362.8 | 44300.7 |
Traversal time (s) | 46596.5 | 35909.3 | 34177.2 |
r coefficient | 11.10 | 7.80 | 5.97 |
B_{1}=3250.95 | B_{1}=2283.64 | B_{1}=1749.04 | |
B_{2}=323.08 | B_{2}=591.57 | ||
Bloom filters size (MB) | B_{3}=100.56 | ||
B_{4}=34.01 | |||
False positive table | T_{1}=545.94 | T_{2}=425.74 | T_{4}=36.62 |
size (MB) | |||
Total size (MB) | 3796.89 | 3032.46 | 2511.8 |
Size (bits/ k -mer) | 12.96 | 10.35 | 8.58 |
The results are in general consistent with the previous tests on E.coli datasets. There is an improvement of 34% (21%) for the 4 Bloom (2 Bloom) in the size of the structure. The graph traversal is also 26% faster in the 4 Bloom version. However, in contrast to the previous results, the graph construction time increased by 10% and 7% for 4 and 2 Bloom versions respectively, when compared to the 1 Bloom version. This is due to the fact that disk writing/reading operations now dominate the time for the graph construction, and 2 and 4 Bloom versions generate more disk accesses than 1 Bloom. As stated in Section “Construction algorithm”, when constructing the 1 Bloom structure, the only part written on the disk is T_{1} and it is read only once to fill an array in memory. For 4 Bloom, T_{1} and T_{2} are written to the disk, and T_{0} and T_{1} are read at least one time each to build B_{2} and B_{3}. Moreover, since the size coefficient of B_{1} reduces, from r=11.10 in 1 Bloom to r=5.97 in 4 Bloom, the number of false positives in T_{1} increases.
Discussion and conclusions
Using cascading Bloom filters for storing de Bruijn graphs has clear advantages over the single-filter method of [9]. In terms of memory consumption, which is the main parameter here, we obtained an improvement of around 30%-40% in all our experiments. Our data structure takes 8.5 to 9 bits per stored k-mer, compared to 13 to 15 bits by the method of [9]. This confirms our analytical estimations. The above results were obtained using only four filters and are very close to the estimated optimum (around 8.4 bits/k-mer) produced by the infinite number of filters. This is consistent with both our analytical estimations and experimental data showing that over 99% of queries are resolved by the four filters, without resorting to the explicitely stored set T_{ t }. Even two filters only resolve about 95% of queries. An interesting characteristic of our method is that the memory grows insignificantly with the growth of k, even slower than with the method of [9]. Somewhat surprisingly, we also obtained a significant decrease, of order 20%-30%, of query time. The construction time of the data structure varied from being 10% slower (for the human dataset) to 22% faster (for the bacterial dataset). Cascading Bloom filters have now been implemented by default in the MINIA software [15].
As stated previously, another compact encoding of de Bruijn graphs has been proposed in [10], however no implementation of the method was made available. For this reason, we could not experimentally compare our method with the one of [10]. We remark, however, that the space bound of [10] heavily depends on the number of reads (i.e. coverage), while in our case, the data structure size is almost invariant with respect to the coverage (Section “E. coli dataset, varying coverage”).
An interesting open question is whether the Bloom filter construction can be made online, so that new k-mers (reads) can be inserted without reconstructing the whole data structure from scratch. Note that the presented construction (Section “Construction algorithm”) is inherently off-line, as all k-mers should be known before the data structure is built.
Another interesting prospect for possible further improvements of our method is offered by work [16], where an efficient replacement to Bloom filter was introduced. The results of [16] suggest that we could hope to reduce the memory to about 5 bits per k-mer. However, there exist obstacles on this way: an implementation of such a structure would probably result in a significant construction and query time increase.
Endnotes
^{a} Note that this is actually a subgraph of the de Bruijn graph under its classical combinatorial definition. However, we still call it de Bruijn graph to follow the terminology common to the bioinformatics literature.
^{b} By a slight abuse of notation, we also view B_{ j } as the set of all k-mers on which the filter B_{ j } returns the positive answer.
Declarations
Acknowledgements
Part of this work has been done during the visit of KS to LIGM in France, supported by the CNRS French-Russian exchange program in Computer Science. GK has been partly supported by the ABS2NGS grant of the French gouvernement (program Investissement d’Avenir) as well as by a Marie-Curie Intra-European Fellowship for Carrier Development. GS was supported by the European Research Council under the European Community’s Seventh Framework Programme (FP7/2007-2013)/ERC grant agreement no. [247073]10.
Authors’ Affiliations
References
- Miller JR, Koren S, Sutton G:Assembly algorithms for next-generation sequencing data. Genomics. 2010, 95 (6): 315-327.View ArticlePubMedPubMed CentralGoogle Scholar
- Pevzner PA, Tang H, Waterman MS:An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci USA. 2001, 98 (17): 9748-9753.View ArticlePubMedPubMed CentralGoogle Scholar
- Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, Chen Z, Mauceli E, Hacohen N, Gnirke A, Rhind N, di Palma F, Birren BW, Nusbaum C, Lindblad-Toh K, Friedman N, Regev A:Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotech. 2011, 29 (7): 644-652. 10.1038/nbt.1883.View ArticleGoogle Scholar
- Sacomoto G, Kielbassa J, Chikhi R, Uricaru R, Antoniou P, Sagot M-F, Peterlongo P, Lacroix V:KISSPLICE: de-novo calling alternative splicing events from RNA-seq data. BMC Bioinformatics. 2012, 13 (Suppl 6): 5-Google Scholar
- Peng Y, Leung HCM, Yiu SM, Chin FYL:Meta-IDBA: a de novo assembler for metagenomic data. Bioinformatics. 2011, 27 (13): 94-101. 10.1093/bioinformatics/btr216.View ArticleGoogle Scholar
- Iqbal Z, Caccamo M, Turner I, Flicek P, McVean G:De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat Genet. 2012, 44 (2): 226-232.View ArticlePubMedPubMed CentralGoogle Scholar
- Conway TC, Bromage AJ:Succinct data structures for assembling large genomes. Bioinformatics. 2011, 27 (4): 479-486.View ArticlePubMedGoogle Scholar
- Ye C, Ma Z, Cannon C, Pop M, Yu D:Exploiting sparseness in de novo genome assembly. BMC Bioinformatics. 2012, 13 (Suppl 6): 1-View ArticleGoogle Scholar
- Chikhi R, Rizk G:Space-efficient and exact de Bruijn graph representation based on a Bloom filter. Algorithms Mol Biol. 2013, 8 (1): 22-Preliminary version in WABI’2012View ArticlePubMedPubMed CentralGoogle Scholar
- Bowe A, Onodera T, Sadakane K, Shibuya T:Succinct de Bruijn graphs. Algorithms in Bioinformatics - 12th International Workshop, WABI 2012, Ljubljana, Slovenia, September 10-12, 2012. Proceedings. Lecture Notes in Computer Science Volume 7534. Edited by: Raphael BJ, Tang J. 2012, 225-235. Berlin: SpringerGoogle Scholar
- Pell J, Hintze A, Canino-Koning R, Howe A, Tiedje JM, Brown CT:Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. Proc Natl Acad Sci USA. 2012, 109 (33): 13272-13277.View ArticlePubMedPubMed CentralGoogle Scholar
- Kirsch A, Mitzenmacher M:Less hashing, same performance: building a better Bloom filter. Random Struct Algorithms. 2008, 33 (2): 187-218. 10.1002/rsa.20208.View ArticleGoogle Scholar
- Rizk G, Lavenier D, Chikhi R:DSK: k-mer counting with very low memory usage. Bioinformatics. 2013, 29 (5): 652-3.View ArticlePubMedGoogle Scholar
- Blattner FR, Plunkett G, Bloch CA, Perna NT, Burland V, Riley M, Collado-Vides J, Glasner JD, Rode CK, Mayhew GF, Gregor J, Davis NW, Kirkpatrick HA, Goeden MA, Rose DJ, Mau B, Shao Y:The complete genome sequence of Escherichia coli K-12. Science. 1997, 277 (5331): 1453-1462.View ArticlePubMedGoogle Scholar
- MINIAsoftware. [http://minia.genouest.org/], []
- Porat E:An optimal Bloom filter replacement based on matrix solving. Computer Science - Theory and Applications, Fourth International Computer Science Symposium in Russia, CSR 2009, Novosibirsk, Russia, August 18-23, 2009. Proceedings. Lecture Notes in Computer Science, Volume 5675. 2009, 263-273. Berlin: SpringerGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.