Gerbil: a fast and memory-efficient k-mer counter with GPU-support

Background A basic task in bioinformatics is the counting of k-mers in genome sequences. Existing k-mer counting tools are most often optimized for small k < 32 and suffer from excessive memory resource consumption or degrading performance for large k. However, given the technology trend towards long reads of next-generation sequencers, support for large k becomes increasingly important. Results We present the open source k-mer counting software Gerbil that has been designed for the efficient counting of k-mers for k ≥ 32. Our software is the result of an intensive process of algorithm engineering. It implements a two-step approach. In the first step, genome reads are loaded from disk and redistributed to temporary files. In a second step, the k-mers of each temporary file are counted via a hash table approach. In addition to its basic functionality, Gerbil can optionally use GPUs to accelerate the counting step. In a set of experiments with real-world genome data sets, we show that Gerbil is able to efficiently support both small and large k. Conclusions While Gerbil’s performance is comparable to existing state-of-the-art open source k-mer counting tools for small k < 32, it vastly outperforms its competitors for large k, thereby enabling new applications which require large values of k. Electronic supplementary material The online version of this article (doi:10.1186/s13015-017-0097-9) contains supplementary material, which is available to authorized users.


Introduction
The counting of k-mers in large amounts of reads is a common task in bioinformatics.The problem is to count the occurrences of all k-long substrings in a large amount of sequencing reads.Its most prominent application is de novo assembly of genome sequences.Although building a histogram of k-mers seems to be quite a simple task from an algorithmic point of view, it has attracted a considerably amount of attention in recent years.In fact, the counting of k-mers becomes a challenging problem for large instances, if it is to be both resource-and time-efficient and therefore makes it an interesting object of study for algorithm engineering.Existing tools for k-mer counting are often optimized for k < 32 and lack good performance for larger k.Recent advances in technology towards larger read lengths are leading to the quest to cope with values of k exceeding 32.Studies elaborating on the optimal choice for the value of k recommend for various applications relatively high values [1,13].In particular, working with long sequencing reads helps to improve accuracy and contig assembly (with k values in the hundreds) [11].In this paper, we develop a tool with a high performance for such large values of k.

Related Work
Among the first software tools that succeeded in counting the k-mers of large genome data sets was Jellyfish [5], which uses a lock-free hash table that allows parallel insertion.In the following years, several tools were published, successively reducing running time and required memory.BFCounter [6] uses Bloom filters for kmer counting to filter out rarely occurring k-mers stemming from sequencing errors.Other tools like DSK [7] and KMC [2] exploit a two-disk architecture and aim at reducing expensive IO operations.Turtle [10] replaces a standard Bloom filter by a cache-efficient counterpart.MSPKmerCounter [4] introduces the concept of minimizers to the k-mer counting, thus further optimizing the disk-based approach.The minimizer approach was later on refined to signatures within KMC2 [3].Up to now, the two most efficient open source software tools have been KMC2 and DSK.KMC2 uses a sorting based counting approach that has been optimized for k < 32.However, its performance drops when k grows larger.Instead, DSK uses a single large hash table and is therefore efficient for large k (but does not support k > 127).However, for small k, it is clearly slower than KMC2.To the best of our knowledge, the only existing approach that uses GPUs for counting k-mers is the work by Suzuki et al. [12].

Contribution
In this article we present the open source k-mer counting tool Gerbil.Our software is the result of an extensive process of algorithm engineering that tried to bring together the best ideas from the literature.The result is a k-mer counting tool that is both time efficient and memory frugal. 1 In addition, Gerbil can optionally use GPUs to accelerate the counting step.It outperforms its strongest competitors both in efficiency and resource consumption significantly.For large values of k, it reduces the runtime by up to a factor of four.The software is written in C++ and CUDA and is freely available at https://github.com/uni-halle/gerbilunder MIT license.
In the next section we describe the general algorithmic work flow of Gerbil.Thereafter, in Section 3, we focus on algorithm engineering aspects that proved essential for high performance and describe details, like the integration of a GPU into the counting process.In Section 4, we evaluate Gerbil's performance in a set of experiments and compare it with those of KMC2 and DSK.We conclude this article by a short summary and a glance on future work.

Methods
Gerbil is divided into two phases: (1) Distribution and (2) Counting.In this section, we give a high-level description of Gerbil's work flow.

Distribution
Whole genome data sets typically do not fit into the main memory.Hence, it is necessary to split the input data into a couple of smaller temporary files.Gerbil uses a two-disk approach that is similar to those of most contemporary k-mer counting tools [3,4,7].The first disk contains the input read data and is used to store the counted k-mer values.We call this disk input/output-disk.The second disk, which we call working disk, is used to store temporary files.The key idea is to assure that the temporary files partition the input reads in such a way, that all occurrences of a certain k-mer are stored in the same temporary file.This way, one can simply count the k-mers of the temporary files independently of each other, with small main memory requirements.To split the genome data into temporary files, we make use of the minimizer approach that has been proposed by [9] and later on refined by [3].A genome sequence can be decomposed into a number of overlapping super-mers.Each super-mer is a substring of maximal length such that all k-mers on that substring share the same minimizer.Hereby, a minimizer of a k-mer is defined as its lexicographically  smallest substring of a fixed length m < k with respect to some total ordering on strings of length m.See Fig. 1 for an example.It suffices to partition the set of super-mers into different temporary files to achieve a partitioning of all different k-mers [3].

Counting
The counting of k-mers is typically done by one of two approaches: Sorting and Compressing [3] or using a hash table with k-mers as keys and counters as values [5,7].The efficiency of the sorting approach typically relies on the sorting algorithm Radix Sort, whose running time increases with the length of k-mers.Since we aim at high efficiency for large k, we decided to implement the hash table approach.Therefore, we use a specialized hash table with k-mers as keys and counters as values.We use a hash

Work Flow
Although the following description of the main process is sequential, all of the steps are interleaved and therefore executed in parallel.This is done by a classical pipeline architecture.Each output of a step makes the input of the next.We use ring buffers to connect the steps of the pipeline.Such buffers are specialized for all combinations of single (S)/multiple (M) producers (P) and single (S)/multiple (M) consumers (C).The actual number of parallel threads depend on the system and is determined by the software at runtime to achieve optimal memory throughput.

Phase One: Distribution
The goal of the first phase is to split the input data into a number of temporary files.Fig. 2 visualizes the first phase.
1.A group of reader threads read the genome reads from the input disk into the main memory.For compressed input, these threads also decompress it.
2. A second group of parser threads convert the read data from the input format into an internal read bundle format.
3. A group of splitter threads compute the minimizers of the reads.All subsequent substrings of a read that share the same minimizer are stored as a super-mer into an output buffer.
4. A single writer thread stores the output buffers to a variable number of temporary files at the working disk.

Phase Two: Counting
After the first phase has been completed, the temporary files are sequentially re-read from working disk and processed in the following manner (see Fig. 3).
1.A single reader thread reads the super-mers of a temporary file and stores them in main memory.
2. A group of threads split the super-mers into k-mers.Each k-mer is distributed to one of multiple hasher threads by using a hash function on each k-mer.This ensures that multiple occurrences of the same k-mer are assigned to the same hasher thread and allows the distribution of separated hash tables to different memory spaces.
3. A group of hasher threads insert the k-mers into their thread-own hash tables.After a temporary file has been completely processed, each hasher thread sends the content of its hash table to an output buffer.
4. A single writer thread writes from the output buffer to the output disk.

Undetermined bases
DNA reads typically contain bases that could not been identified correctly during the sequencing process.Usually, such bases are marked N in FASTQ input files.In accordance with established k-mer counting tools, we ignore all k-mers that contain an undetermined base.

Reverse-Complement
Since DNA is organized in double helix form, each k-mer x ∈ {A, C, G, T } k corresponds to its reversecomplement that is defined by reversing x and replacing A ⇔ T and C ⇔ G. Thus, the k-mer ACCG corresponds to CGGT .Many applications do not distinguish between a k-mer and its reverse-complement.Thus, each occurrence of ACCG and CGGT is counted as occurrences of their unique canonical representation.Gerbil uses the lexicographically smaller k-mer as canonical representation.The use of reverse complement normalization can be turned off by command flag.

Implementation Details
We now want to point out several details on the algorithm engineering process that were essential to gain high performance.

Total ordering on minimizers
The choice of a total ordering has large effects on the size of temporary files and thus, also on the performance.To find a good total ordering, we have to balance various aspects.On the one hand, the total number of resulting super-mers are to be minimized to reduce the total size of disk memory that is needed by temporary files.On the other hand, the maximal number of distinct k-mers that share the same minimizer should not be too large since we want an approximately uniform distribution of k-mers to the temporary files.An "ideal" total ordering would have both a large total number of super-mers and a small maximal number of distinct k-mers per minimizer.Since these requirements contradict each other, we experimentally evaluated the pros and cons of various ordering strategies.
CGAT The lexicographic ordering of minimizers based on C < G < A < T .

Roberts et al. [8]
They propose the lexicographic ordering of minimizers with respect to C < A < T < G. Furthermore, within the minimizer computation all bases at even positions are to be replaced by their reverse complement.Thus, rare minimizers like CGCGCG are preferred.

KMC2
The ordering that is proposed by [3] is a lexicographic ordering with A < C < G < T and some built-in exceptions to eliminate the large number of minimizers that start with AAA or ACA.Random A random order of all string of fixed length m is unlikely to have both a small number of supermers and a highly imbalanced distribution of distinct k-mers.It is simple to establish, since we do not need frequency samples or further assumptions about the distribution of minimizers.

Distance from Pivot (dfp(p))
To explain this strategy, consider the following observations: Ascendingly sorting the minimizers by their frequency favors rare minimizers.As a consequence, the maximal number of distinct k-mers per minimizer is small.However, the total number of super-mers can be very large.Similarly, an descendingly sorted ordering results in quite the opposite effect.To find a compromise between both extremes, we initially sort the set of minimizers by their frequency.Since the frequencies depend on the data set, we approximate them by taking samples during runtime.We fix a pivot factor 0 ≤ p ≤ 1 and re-sort the minimizers by the absolute difference of their initial position to the pivot position 4 m p.The result is an ordering that does neither prefer very rare nor very common minimizers and therefore makes a good compromise.

Evaluation
See Fig. 4 for a rating of each strategy.The value on the x-axis corresponds to the expected temporary disk memory, whereas the value on the y-axis is correlated with the maximal main memory consumption of our program.A perfect strategy would be located at the bottom left corner.Several strategies seem to be reasonable choices.We evaluated each strategy and found that a small number of super-mers is more important than a small maximal number of k-mer per minimizer for most data sets.As a result, we confirm that the total ordering that is already been used by KMC2 is a good choice for most data sets.Therefore, Gerbil uses the strategy from KMC2 for its ranking of minimizers.

Length of miminizers
The length m of miminizers is a parameter that has to be chosen with care.However, we can consider a basic rule: The larger m is chosen, the less likely it becomes that consecutive k-mers share the same minimizer.Therefore, the number of super-mers decreases with growing m.An advantage of a smaller number of supermers is that the set of super-mers can be distributed to the temporary files more uniformly, which results in temporary files of approximately uniform size.However, a major drawback of a large number of super-mers is the increased total size of temporary files.Thus, a small m results in a better data compression.
In our experiments, we found that choosing minimizer length m = 7 is most efficient for the data sets.Area that is scanned in parallel.
Figure 5: GPU memory access pattern.The figure shows the memory area that is being scanned while probing a hash table entry that is stored at memory address p.In this example, k = 3 and each table entry needs four bytes for the key and four bytes for the counter.Therefore, 16 entries can be loaded from global memory within one step and are scanned in parallel.

GPU Integration
To integrate one or more GPUs into the process of k-mer counting, several problems have to be dealt with.Typically, a GPU performs well only if it deals with data in a parallel manner.In addition, memory bound tasks (i.e. tasks that do not require a lot of arithmetic operations) like the counting of k-mers require a carefully chosen memory access pattern to minimize the number of the accesses to the GPU's global memory.
We decided to transfer the hash table based counting approach to the GPU.

GPU Hash Tables
When compiled and executed with GPU support, Gerbil automatically detects CUDA capable GPUs.For each GPU, Gerbil replaces a CPU hasher thread by a GPU hasher thread which maintains its own hash table in GPU memory.Each GPU hash table is similar in function to a traditional hash table.However, unlike the traditional approach, we add a large number of k-mers in parallel.Therefore, the insertion procedure is slightly changed.First, a bundle of several thousand k-mers is copied to the GPU global memory space.Afterwards, we launch a large number of CUDA blocks, each consisting of 32 threads.Each block sequentially inserts a few kmers into the GPU hash table.Since with increasing running time, it becomes more and more probable to find a mismatch when probing a hash table position, we additionally scan adjacent table positions in a range of 128 bytes when probing a hash table entry (see Fig. 5).Due to the architecture of a GPU, this can be done within the same global memory access.Thus, we scan up to 16 table entries in parallel, thereby reducing the number of accesses to a GPU's global memory.In addition, the total number of probing operations is drastically reduced.In particular, by probing just one entry after the other, 90.37% of all 28-mers of the F Vesca data set (Table 1) can be inserted at first probing and no 28-mer needed more than 29 probings.In contrast, through scanning of adjacent table entries, 99.94% of all 28-mers could be inserted at the first trial and no 28-mer needed more than seven probing operations.To eliminate race conditions between CUDA blocks, we synchronize the probing of the hash table by using atomic operations to lock and unlock hash table entries.Since such operations are efficiently implemented in hardware, a large number of CUDA blocks can be executed in parallel.

Load Balancing
We dynamically balance the amount of k-mers that are assigned to the various CPU and GPU hasher threads.Therefore, we constantly measure the throughput of each hasher thread, i. e. the CPU-time needed to insert a certain number of k-mers.Whenever a new temporary file is loaded from disk, we rebalance the number of k-mers that are assigned to each hasher thread, considering the throughput and capacity of each hash table.By that, we automatically determine a good division of labour between CPU and GPU hasher threads without the need of careful hand-tuning.(a) Number of 28-mers and number of distinct 28mers in the 512 temporary files that have been created while processing the F Vesca data set.Each point corresponds to a temporary file.Here, the KMC2 minimizer ordering did not succeed in creating uniformly sized temporary files since a single file contains far more 28-mers than the other 511 files.

Estimating table sizes
We aim at estimating the expected size of each hash table as closely as possible to save main memory.We do so since reduced memory consumption leaves more memory to the operating system that can be used as cache when writing temporary files.Therefore, we approximate the number of expected distinct k-mers in each temporary file.We use a simple approximation mechanism that predicts the number of distinct kmers in a file by multiplying the number of k-mers in each file with a constant that has been determined experimentally (see Fig. 6).Since this ratio depends on properties of the data set, we dynamically adjust the ratio during runtime.

Hash Function
To insert a k-mer into the hash table, we use a hash function that implements double hashing.

Probing Strategy
As a general strategy we use double hashing.We stop the probing of the hash table after a constant number of trials.Therefore, it is possible that k-mers could not be inserted into a hash table.For that reason, Gerbil has a built-in emergency mechanism that handles such k-mers to prevent them from getting lost.Hereby, CPU and GPU hasher threads have different strategies.CPU hasher threads store such k-mers in an additional temporary file, which is processed after the work with the current temporary file has been completed.In contrast, GPU hasher threads use part of free GPU memory to sequentially store those k-mers that could not be inserted.After all k-mers of a temporary file have been processed, the k-mers in this area are counted via a sorting and compression approach.However, it is still possible to exceed the available GPU memory.In such a case, we copy the whole amount of k-mers in that area back to main memory and store them in a temporary file, similar to the CPU emergency handling.Such an operation is very costly.However, we have never observed a single GPU error handling and only few executions of CPU error handling when processing real world data sets.

Results
We tested our implementation in a set of experiments, using the same instances as Deorowicz et al. [3] (see Table 1).For each data set we counted all k-mers for k = 28, 40, 56, and 65 and compared Gerbil's running time with those of KMC2 in version 2.3.0 and DSK in version 2.0.7.In addition, we used a synthesized test set GRCh38, created from Genome Reference Consortium Human Reference 38 (GCA 000001405.2),from which we uniformly sampled k-mers of size 1000.The purpose of this data set is to have longer reads allowing to test the performance for larger values of k.To judge performance on various types of hardware, we executed the experiments on two different desktop computers.See Table 2 for details about the hardware configuration of the test systems.Table 3 and Fig. 7 show the results of the performance evaluation.We want to point out several interesting observations.
• Gerbil with GPU support (gGerbil) is the most efficient tool in almost all cases.Exceptions occur for small k = 28, where the sorting based approach KMC2 is sometimes slightly more efficient.• Fig. 7a shows that Gerbil starts outperforming KMC2 at k ≈ 36.Interestingly, the running time of each tool decreases with growing k.This can be explained by the small read length of the G Gallus data set.In addition, one can observe the erratic increase of running time near k = 32 and k = 64 for all tools, due to a change of the internal k-mer representation.
• When k grows, KMC2 becomes more and more inefficient, while Gerbil stays efficient.When counting the 200-mers in the GRCh38 data set, KMC2 did not finish within 20 hours, whereas Gerbil required only 98 minutes (Fig. 7b).The running time of DSK grows similarly fast as that of KMC2.Recall that DSK does not support values of k > 127.
• For small k, the use of a GPU decreases the running time by a significant amount of time.However, with growing k, the data structure that stores k-mers grows larger.Therefore, the number of table entries that can be scanned in parallel decreases.Experimentally, we found that the GPU induced speedup vanishes when k exceeds 150.
We gain some additional interesting insights when we take a closer look into Table 4 that shows detailed information on running time and memory usage.
• The use of a GPU accelerates Gerbil's second phase by up to a factor of two, whereas the additional speedup given by a second GPU is only moderate.
• All tools were called with an option that sets the maximal memory size to 14 GB on Test System One and 30 GB on Test System Two.However, Gerbil typically uses much less memory due to its dynamic prediction of the hash table size.In contrast, both KMC2 and DSK use more main memory.
• Gerbil's disk usage is comparable to KMC2's disk usage, whereas the disk usage of DSK is much larger.
• Gerbil's frugal use of disk-and main memory is a main reason for its high performance.The use of little main memory gives the operating system opportunity to use the remaining main memory for   buffering disk operations.A small disk space consumption is essential since disk operations are far more expensive than the actual counting.
• We can compare the actual running time with the theoretically minimal running time that is given by hardware constraints (hereby done with Test System Two).We find that the running time of phase one is bounded by the read rate of the input disk.Thus, a further speedup of the first phase can only be achieved by reading compressed input files.In contrast, there is potential speedup in phase two: In GPU mode, Gerbil's throughput in phase two is just about 61% of the possible throughput given by write rate of the output disk.In non-GPU mode it is about 35%.Interestingly, the throughput of the working disk is not a critical component at our test system.

Conclusion
We introduced the k-mer counting software Gerbil that uses a hash table based approach for the counting of k-mers.For large k, a use case that becomes important for long reads, we are able to clearly outperform the state-of-the-art open source k-mer counting tools, while using significantly less resources.We showed that Gerbil's running time can be accelerated by the use of GPUs.However, since this only affects the second phase, the overall additional speedup is only very moderate.As future work, we plan to evaluate strategies to use GPUs to accelerate also the first phase.Another option for further speed-up would be to give up exactness by using Bloom filters.
In the latter case, all bits of the first byte are set to 1.The remaining four bytes contain the counter in a conventional 32-bit unsigned integer.Examples (X is undefined): • 67 AACGTG ⇒ 01000011 00000110 1110XXXX • 345 TGGATC ⇒ 11111111 00000000 00000000 00000001 01011001 11101000 1101XXXX When called with command line argument -x h, Gerbil additionally creates a human readable csv file including the same, but uncompressed information.This option should only be used for very small data sets.

D Datasets
The data sets were downloaded from the following URL's.

Figure 1 :
Figure 1: Minimizers and super-mers of the DNA string CAAGAACAGTG.Here, k = 4 and m = 3.For each k-mer, the bold part is its minimizer.The example uses the lexicographic ordering on 3-mers based on A < C < G < T .The sequence is divided into the five super-mers CAAGA, AGAA, GAACA, ACAG, and CAGTG that would be stored in temporary files.
Ratio of 28−mers and distinct 28−mersnumber of 28−mers (in millions) number of distinct 28−mers (in millions) no. of distinct 28−mers / no. of 28Dividing the number of 28-mers in each file by number of its distinct 28-mers leads to a ratio that is used to determine the size of the hash tables.A ratio between 0.15 and 0.2 is a proper choice for the F Vesca data set.
Running times for 20 ≤ k ≤ 80 of G Gallus data set.Running times for 28 ≤ k ≤ 200 of GRCh38 data set.

Figure 7 :
Figure 7: Running time for various k.(Test System Two) Insert the k-mer x into the hash table t.

Table 3 :
Running times in the format mm:ss (the best performing in bold).Each entry is the average over three runs.Missing running times for DSK are due to insufficient disk space.The label 'gGerbil' stands for Gerbil with activated GPU mode.Instead, standard 'Gerbil' does not use any GPU.

Table 4 :
Detailed running times (in format mm:ss) and maximal main memory and disk space consumption (in GB) for the G Gallus instance.Each entry is the average of three runs.