Using cascading Bloom filters to improve the memory usage for de Brujin graphs

Background De Brujin graphs are widely used in bioinformatics for processing next-generation sequencing data. Due to a very large size of NGS datasets, it is essential to represent de Bruijn graphs compactly, and several approaches to this problem have been proposed recently. Results In this work, we show how to reduce the memory required by the data structure of Chikhi and Rizk (WABI’12) that represents de Brujin graphs using Bloom filters. Our method requires 30% to 40% less memory with respect to their method, with insignificant impact on construction time. At the same time, our experiments showed a better query time compared to the method of Chikhi and Rizk. Conclusion The proposed data structure constitutes, to our knowledge, currently the most efficient practical representation of de Bruijn graphs.


Introduction
Modern Next-Generation Sequencing (NGS) technologies generate huge volumes of short nucleotide sequences (reads) drawn from the DNA sample under study.The length of a read varies from 35 to about 400 base pairs (letters) and the number of reads may be hundreds of millions, thus the total volume of data may reach tens or even hundreds of Gb.
Many computational tools dealing with NGS data, especially those devoted to genome assembly, are based on the concept of de Bruijn graph, see e.g.[8].The nodes of the de Bruijn graph are all distinct k-mers occurring in the reads, and two k-mers are linked by an edge if there is a suffix-prefix overlap of size k − 1. 6 The value of k is an open parameter, that in practice is chosen between 20 and 64.The idea of using de Bruijn graph for genome assembly goes back to the "pre-NGS era" [11].Note, however, that de novo genome assembly is not the only application of those graphs when dealing with NGS data.There are several others, including: de novo transcriptome assembly [5] and de novo alternative splicing calling [14] from transcriptomic NGS data (RNA-seq); metagenome assembly [10] from metagenomic NGS data; and genomic variant detection [6] from genomic NGS data using a reference genome.
Due to the very large size of NGS datasets, it is essential to represent de Bruijn graphs as compactly as possible.This has been a very active line of research.Recently, several papers have been published that propose different approaches to compressing de Bruijn graphs [4,15,3,2,9].
Conway and Bromage [4] proposed a method based on classical succinct data structures, i.e. bitmaps with efficient rank/select operations.On the same direction, Bowe et al. [2] proposed a very interesting succinct representation that, assuming only one string (read) is present, uses only 4m bits, where m is the number of edges in the graph.The more realistic case, where there are M reads, can be easily reduced to the one string case by concatenating all M reads using a special separator character.However, in this case the size of the structure is 4m + O(M log m) bits ( [2], Theorem 1).Since the multiplicative constant of the second term is hidden by the asymptotic notation, it is hard to know precisely what would be the size of this structure in practice.
Ye at al. [15] proposed a different method based on a sparse representation of de Bruijn graphs, where only a subset of k-mers present in the dataset are stored.Pell et al. [9] proposed a method to represent it approximately, the so called probabilistic de Bruijn graph.In their representation a node have a small probability to be a false positive, i.e. the k-mer is not present in the dataset.Finally, Chikhi and Rizk [3] improved Pell's scheme in order to obtain an exact representation of the de Bruijn graph.This was, to our knowledge, the best practical representation of an exact de Bruijn graph.
In this work, we focus on the method proposed in [3] which is based on Bloom filters.They were first used in [9] to provide a very space-efficient representation of a subset of a given set (in our case, a subset of k-mers), at the price of allowing one-sided errors, namely false positives.The method of [3] is based on the following idea: if all queried nodes (k-mers) are only those which are reachable from some node known to belong to the graph, then only a fraction of all false positives can actually occur.Storing these false positives explicitly leads to an exact (false positive free) and space-efficient representation of the de Bruijn graph.
Our contribution is an improvement of this scheme by changing the representation of the set of false positives.We achieve this by iteratively applying a Bloom filter to represent the set of false positives, then the set of "false false positives" etc.We show analytically that this cascade of Bloom filters allows for a considerable further economy of memory, improving the method of [3].Depending on the value of k, our method requires 30% to 40% less memory with respect to the method of [3].Moreover, with our method, the memory grows very little with the growth of k.Finally, we implemented our method and tested it against [3] on real datasets.The tests confirm the theoretical predictions for the size of structure and show a 20% to 30% improvement in query times.
Let T 0 be the set of k-mers (nodes) of the de Brujin graph that we want to store.The method of [3] stores T 0 via a bitmap B 1 using a Bloom filter, together with the set T 1 of critical false positives.T 1 consists of those k-mers which overlap k-mers from T 0 by k − 1 letters (i.e. are "potential neighbors" of nodes of T 0 ) but which are stored in B 1 "by mistake", i.e. which belong to B 1 but are not in T 0 .7 B 1 and T 1 are sufficient to represent the graph provided that the only queried k-mers are those which are potential neighbors of k-mers of T 0 .
The idea we introduce in this note is to use this structure recursively and represent the set T 1 by a new bitmap B 2 and a new set T 2 , then represent T 2 by B 3 and T 3 , and so on.More formally, starting from B 1 and T 1 defined as above, we define a series of bitmaps B 1 , B 2 , . . .and a series of sets T 1 , T 2 , . . .as follows.B 2 will store the set T 1 using a Bloom filter, and the set T 2 will contain "true nodes" from T 0 that are stored in B 2 "by mistake" (we call them false 2 positives).B 3 and T 3 , and, generally, B i and T i are defined similarly: B i stores k-mers of T i−1 using a Bloom filter, and T i contains k-mers stored in B i "by mistake", i.e. those k-mers that do not belong to T i−1 but belong to T i−2 (we call them false i positives).Observe that The following lemma shows that the construction is correct, that is it allows one to verify whether or not a given k-mer is belongs to the set T 0 .Lemma 1.Given a k-mer (node) K, consider the smallest i such that K ∈ B i+1 (if K ∈ B 1 , we define i = 0).Then, if i is odd, then K ∈ T 0 , and if i is even (including 0), then K ∈ T 0 .
Proof.Observe that K ∈ B j+1 implies K ∈ T j by the basic property of Bloom filters.We first check the Lemma for i = 0, 1.
For i = 0, we have K ∈ B 1 , and then K ∈ T 0 .For i = 1, we have K ∈ B 1 but K ∈ B 2 .The latter implies that K ∈ T 1 , and then K must be a false 2 positive, that is K ∈ T 0 .Note that here we use the fact that the only queried k-mers K are either nodes of T 0 or their neighbors in the graph (see [3]), and therefore if K ∈ B 1 and K ∈ T 0 then K ∈ T 1 .
For the general case i ≥ 2, we show by induction that Since T i−1 ⊆ T 0 for odd i, and Naturally, the lemma provides an algorithm to check if a given k-mer K belongs to the graph: it suffices to check successively if it belongs to B 1 , B 2 , . . .until we encounter the first B i+1 which does not contain K. Then the answer will simply depend on whether i is even or odd.
In our reasoning so far, we assumed an infinite number of bitmaps B i .Of course, in practice we cannot store infinitely many (and even simply many) bitmaps.Therefore we "truncate" the construction at some step t and store a finite set of bitmaps B 1 , B 2 , . . ., B t together with an explicit representation of T t .The procedure of Lemma 1 is extended in the obvious way: if for all 1 ≤ i ≤ t, K ∈ B i , then the answer is determined by directly checking K ∈ T t .

Memory and time usage
First, we estimate the memory needed by our data structure, under the assumption of infinite number of bitmaps.Let N be the number of "true positives", i.e. nodes of T 0 .From properties of Bloom filters, if T 0 has to be stored via a bitmap B 1 of size rN , the false positive rate can be estimated as c r , where c = 0.6185 (see [7]).From this, the expected number of critical false positive nodes (set T 1 ) have been estimated in [3] to be 8N c r , as every node has eight extensions, i.e. potential neighbors in the graph.We slightly refine this estimation to 6N c r by noticing that for almost all nodes, two out of these eight extensions belong to T 0 and only six are potential false positives.Furthermore, to store these 6N c r critical false positive nodes, we use a bitmap B 2 of size 6rN c r .Bitmap B 3 is used for storing nodes of T 0 which are stored in B 2 "by mistake" (set T 2 ).We estimate the number of these nodes as the fraction c r (false positive rate of filter B 2 ) of N (size of T 0 ), that is N c r .Similarly, the number of nodes we need to put to B 4 is 6N c r multiplied by c r , i.e. 6N c 2r .Keeping counting in this way, we obtain that the memory needed for the whole structure is rN + 6rN c r + rN c r + 6rN c 2r + rN c 2r + ... bits.By dividing this expression by N to obtain the number of bits per one k-mer, we get A simple calculation shows that the minimum of this expression is achieved when r = 5.464, and then the minimum memory used per k-mer is 8.45 bits.
As mentioned earlier, in practice we store only a finite number of bitmaps B 1 , . . ., B t together with an explicit representation (such as array or hash table) of T t .In this case, the memory taken by the bitmaps is a truncated sum rN + 6rN c r + rN c r + .., and a data structure storing T t takes either 2k bits, depending on whether t is even or odd.The latter follows from the observations that we need to store N c t 2 r (or 6rN c t 2 r ) k-mers, each taking 2k bits of memory.Consequently, we have to adjust the optimal value of r minimizing the total space, and re-estimate the resulting space spent on one k-mer.In this case, the memory taken by the bitmaps is a truncated sum rN +8rN c r +rN c r +.., and a data structure storing T t takes either 2k •N c t 2 r or 2k • 8N c t 2 r bits, depending on whether t is even or odd.The latter follows from the observations that we need to store N c t 2 r (or 8rN c t Table 1 shows estimations for optimal values of r and the corresponding space per k-mer for t = 4 and several values of k.The data demonstrates that even such a small value of t leads to considerable memory savings.It appears that the space per k-mer is very close to the "optimal" space (8.45 bits) obtained for the infinite number of filters.The third column of Table 1 shows the memory usage of the original method of [3], obtained using the estimation (1.44 log 2 ( 16k 2.08 ) + 2.08) from [3].Note that according to the method of [3], doubling the value of k results in a memory increment by 1.44 bits, whereas in our method the increment is of 0.11 to 0.22 bits.
Let us now estimate preprocessing and query times for our scheme.If the value of t is small (such as t = 4, as in Table 1), the preprocessing time grows insignificantly in comparison to the original method of [3].To construct each B i , we need to store T i−2 (possibly on disk, if we want to save on the internal memory used by the algorithm) in order to compute those k-mers which are stored in B i−1 "by mistake".The preprocessing time increases little in comparison to the original method of [3], as the size of B i decreases exponentially and then the time spent to construct the whole structure is linear on the size of T 0 .
Clearly, applying t Bloom filters instead of a single one introduces a multiplicative factor of t.On the other hand, set T t is generally much smaller than T 0 , due to the above-mentioned exponential decrease.Depending on the data structure for storing T t , the time saving in querying T t vs. T 0 may even dominate the time loss in querying multiple Bloom filters.Our experimental results (Section 4.1 below) confirm that this situation does indeed occur in practice.Note that even in the case when querying T t weakly depends on its size (e.g. when T t is implemented by a hash table), the query time will not increase much, due to a small value of t, as discussed earlier.

Using different values of r for different filters
In the previous section, we assumed that each of our Bloom filters uses the same value of r, the ratio of bitmap size to the number of stored k-mers.However, formula (1) for the number of bits per k-mer shows a difference for odd and even filter indices.This suggests that using different parameters r for different filters, rather than the same for all filters, may reduce the space even further.If r i denotes the corresponding ratio for filter B i , then (1) should be rewritten to and the minimum value of this expression becomes 7.913.
In the same way, we can use different values of r i in the truncated case.This leads to a small 2% to 4% improvement in comparison with case of unique value of r. ).For the case of single r, its value is shown in Table 1.
4 Experimental results

Implementation and experimental setup
We implemented our method using the Minia software [3] and ran comparative tests for 2 and 4 Bloom filters (t = 2, 4).Note that since the only modified part was the construction step and the k-mer membership queries, this allows us to precisely evaluate our method against the one of [3].The first step of the implementation is to retrieve the list of k-mers that appear more than d times using DSK [13] -a constant memory streaming algorithm to count k-mers.Each k-mer appearing more than d times (set T 0 ) is inserted into B 1 .Next, all possible extensions of each k-mer in T 0 are queried against B 1 , and those which return true are written on the disk.Then, this set is traversed and only the k-mers absent from T 0 are kept.This results in the set T 1 of critical false positives, which is also kept on disk.Up to this point, the procedure is identical to that of [3].
Next, we insert all k-mers from T 1 into B 2 and to obtain T 2 , we check for each k-mer in T 0 if a query to B 2 returns true.This results in the set T 2 .Thus, at this point we have B 1 , B 2 and T 2 , a complete representation for t = 2.In order to build the data structure for t = 4, we continue this process, by inserting T 2 in B 3 and retrieving T 3 from T 1 (stored on disk).It should be noted that to obtain T i we need T i−2 , and by always storing it on disk we guarantee not to use more memory than the size of the final structure.The set T t (that is, T 1 , T 2 or T 4 in our experiments) is stored as a sorted array and is searched by a binary search.We found this implementation more efficient than a hash table.
Assessing the query time is done through the procedure of graph traversal, as it is implemented in [3].Since the procedure is identical and independent on the data structure, the time spent on graph traversal is a faithful estimator of the query time.

E.coli dataset, varying k
In this set of tests, our main goal was to evaluate the influence of the k-mer size on principal parameters: the size of the whole data structure, the size of the set T t , the graph traversal time, and the time of construction of the data structure.We retrieved 10M E. coli reads of 100bp from the Short Read Archive (ERX008638) without read pairing information and extracted all k-mers occurring at least two times.The total number of k-mer considered varied, depending on the value of k, from 6.967.781(k = 15) to 5.923.501(k = 63).We ran each version, 1 Bloom ([3]), 2 Bloom and 4 Bloom, for values of k ranging from 16 to 64.The results are shown in Fig. 1.The total size of the structures in bits per stored k-mer, i.e. the size of B 1 and T 1 (respectively, B 1 , B 2 ,T 2 or B 1 , B 2 , B 3 , B 4 ,T 4 ) is shown in Fig. 1(a).As expected, the space for 4 Bloom filters is the smallest for all values of k considered, showing a considerable improvement, ranging from 32% to 39%, over the version of [3].Even the version with just 2 Bloom filters shows an improvement of at least 20% over [3], for all values of k.Regarding the influence of the k-mer size on the structure size, we observe that for 4 Bloom filters the structure size is almost constant, the minimum value is 8.60 and the largest is 8.89, an increase of only 3%.For 1 and 2 Bloom the same pattern is seen: a plateau from k = 16 to 32, a jump for k = 33 and another plateau from k = 33 to 64.The jump at k = 32 is due to switching from 64-bit to 128-bit representation of k-mers in the table T t .
The traversal times for each version is shown in Fig. 1(c).The fastest version is 4 Bloom, showing an improvement over [3] of 18% to 30%, followed by 2 Bloom.This result is surprising and may seem counter-intuitive, as we have four filters to apply to the queried k-mer rather than a single filter as in [3].However, the size of T 4 (or even T 2 ) is much smaller than T 1 , as the size of T i 's decreases exponentially.As T t is stored in an array, the time economy in searching T 4 (or T 2 ) compared to T 1 dominates the time lost on querying additional Bloom filters, which explains the overall gain in query time.
As far as the construction time is concerned (Fig. 1(d)), our versions yielded also a faster construction, with the 4 Bloom version being 5% to 22% faster than that of [3].The gain is explained by the time required for sorting the array storing T t , which is much higher for T 0 than for T 2 or T 4 .However, the gain is less significant here, and, on the other hand, was not observed for bigger datasets (see Section 4.4).

E. coli dataset, varying coverage
From the complete E. coli dataset (≈44M reads) from the previous section, we selected several samples ranging from 5M to 40M reads in order to assess the impact of the coverage on the size of the data structures.This strain E. coli (K-12 MG1655) is estimated to have a genome of 4.6M bp [1], implying that a sample of 5M reads (of 100bp) corresponds to ≈100X coverage.We set d = 3 and k = 27.The results are shown in Fig. 2. As expected, the memory consumption per k-mer remains almost constant for increasing coverage, with a slight decrease for 2 and 4 Bloom.The best results are obtained with the 4 Bloom version, an improvement of 33% over the 1 Bloom version of [3].On the other hand, the number of distinct k-mers increases markedly (around 10% for each 5M reads) with increasing coverage, see Fig. 2(b).This is due to sequencing errors: an increase in coverage implies more errors with higher coverage, which are not removed by our cutoff d = 3.This suggests that the value of d should be chosen according to the coverage of the sample.Moreover, in the case where read qualities are available, a quality control pre-processing step may help to reduce the number of sequencing errors.

Human dataset
We also compared 2 and 4 Bloom versions with the 1 Bloom version of [3] on a large dataset.For that, we retrieved 564M Human reads of 100bp (SRA: SRX016231) without pairing information and discarded the reads occurring less than 3 times.The dataset corresponds to ≈17X coverage.A total of 2.455.753.508k-mers were indexed.We ran each version, 1 Bloom ([3]), 2 Bloom and 4 Bloom with k = 23.The results are shown in Table 3.
The results are in general consistent with the previous tests on E.coli datasets.There is an improvement of 34% (21%) for the 4 Bloom (2 Bloom) in the size of the structure.The graph traversal is also 26% faster in the 4 Bloom version.However, in contrast to the previous results, the graph construction time increased by 10% and 7% for 4 and 2 Bloom versions respectively, when compared to the 1 Bloom version.This is due to the fact that disk writing/reading operations now dominate the time for the graph construction, and 2 and 4 Bloom  versions generate more disk accesses than 1 Bloom.As stated in Section 4.1, when constructing the 1 Bloom structure, the only part written on the disk is T 1 and it is read only once to fill an array in memory.For 4 Bloom, T 1 and T 2 are written to the disk, and T 0 and T 1 are read at least one time each to build B 2 and B 3 .Moreover, since the size coefficient of B 1 reduces, from r = 11.10 in 1 Bloom to r = 5.97 in 4 Bloom, the number of false positives in T 1 increases.

Discussion and Conclusions
Using cascading Bloom filters for storing de Bruijn graphs brings a clear advantage over the single-filter method of [3].In terms of memory consumption, which is the main parameter here, we obtained an improvement of around 30%-40% in all our experiments.Our data structure takes 8.5 to 9 bits per stored k-mer, compared to 13 to 15 bits by the method of [3].This confirms our analytical estimations.The above results were obtained using only four filters and are very close to the estimated optimum (around 8.4 bits/k-mer) produced by the infinite number of filters.An interesting characteristic of our method is that the memory grows insignificantly with the growth of k, even slower than with the method of [3].Somewhat surprisingly, we also obtained a significant decrease, of order 20%-30%, of query time.The construction time of the data structure varied from being 10% slower (for the human dataset) to 22% faster (for the bacterial dataset).
As stated previously, another compact encoding of de Bruijn graphs has been proposed in [2], however no implementation of the method was made available.For this reason, we could not experimentally compare our method with the one of [2].We remark, however, that the space bound of [2] heavily depends on the number of reads (i.e.coverage), while in our case, the data structure size is almost invariant with respect to the coverage (Section 4.3).An interesting prospect for further possible improvements of our method is offered by work [12], where an efficient replacement to Bloom filter was introduced.The results of [12] suggest that we could hope to reduce the memory to about 5 bits per k-mer.However, there exist obstacles on this way: an implementation of such a structure would probably result in a significant construction and query time increase.This issue remains, however, to be more carefully studied.

Fig. 1 .Fig. 2 .
Fig. 1. Results for 10M E.coli reads of 100bp using several values of k.The 1 Bloom version corresponds to the one presented in [3].(a) Size of the structure in bits used per k-mer stored.(b) Number of false positives stored in T1, T2 or T4 for 1, 2 or 4 Bloom filters, respectively.(c) De Bruijn graph construction time, excluding k-mer counting step.(d) De Bruijn graph traversal time, including branching k-mer indexing.

Table 1
Table 1 reveals another advantage of our improvement: the number of bits per stored k-mer remains almost constant for different values of k.
[3]st column: k-mer size; 2nd column: optimal value of r for Bloom filters (bitmap size per number of stored elements) for t = 4; 3rd column: the resulting space per k-mer; 4th column: space per k-mer for the method of[3](t = 1)

Table 2 .
Table 1 shows results for the case t = 4 for different values of k.Estimated memory occupation for the case of different values of r vs. single value of r, for 4 Bloom filters (t = 4

Table 3 .
Results of 1, 2 and 4 Bloom filters version for 564M Human reads of 100bp using k = 23.The 1 Bloom version corresponds to the one presented in [3].