Bloom Filter Trie: an alignmentfree and referencefree data structure for pangenome storage
 Guillaume Holley^{1, 2, 3}Email author,
 Roland Wittler^{1, 2, 3} and
 Jens Stoye^{1, 2, 3}
https://doi.org/10.1186/s1301501600668
© Holley et al. 2016
 Received: 8 December 2015
 Accepted: 31 March 2016
 Published: 14 April 2016
Abstract
Background
High throughput sequencing technologies have become fast and cheap in the past years. As a result, largescale projects started to sequence tens to several thousands of genomes per species, producing a high number of sequences sampled from each genome. Such a highly redundant collection of very similar sequences is called a pangenome. It can be transformed into a set of sequences “colored” by the genomes to which they belong. A colored de Bruijn graph (CDBG) extracts from the sequences all colored kmers, strings of length k, and stores them in vertices.
Results
In this paper, we present an alignmentfree, referencefree and incremental data structure for storing a pangenome as a CDBG: the bloom filter trie (BFT). The data structure allows to store and compress a set of colored kmers, and also to efficiently traverse the graph. Bloom filter trie was used to index and query different pangenome datasets. Compared to another stateoftheart data structure, BFT was up to two times faster to build while using about the same amount of main memory. For querying kmers, BFT was about 52–66 times faster while using about 5.5–14.3 times less memory.
Conclusion
We present a novel succinct data structure called the Bloom Filter Trie for indexing a pangenome as a colored de Bruijn graph. The trie stores kmers and their colors based on a new representation of vertices that compress and index shared substrings. Vertices use basic data structures for lightweight substrings storage as well as Bloom filters for efficient trie and graph traversals. Experimental results prove better performance compared to another stateoftheart data structure.
Availability
Keywords
 Pangenome
 Similar genomes
 Population genomics
 Colored de bruijn graph
 Bloom filter
 Compression
 Trie
 Index
 Succinct data structure
Background
A string x is a sequence of characters drawn from a finite, nonempty set, called the alphabet \(\mathcal {A}\). Its length is denoted by x. The character at position i is denoted by x[i], the substring starting at position i and ending at position j by x[i..j]. Strings are concatenated by juxtaposition. If \(x = ps\) for (potentially empty) strings p and s, then p is a prefix and s is a suffix of x.
A genome is the collection of all inheritable material of a cell. Ideally it can be represented as a single string over the DNA alphabet \(\mathcal {A} = \{{a},{c},{g},{t}\}\) (or as a few strings in case of species with multiple chromosomes). In practice, however, genomes in databases are often less perfect, either left unchanged in form of the raw data as produced by sequencing machines (millions of short sequences called reads), or after some incomplete assembly procedure in form of contiguous chromosome regions (hundreds of contigs of various lengths). We are interested in the problem of storing and compressing a set of multiple highly similar genomes, e.g. the pangenome of a bacterial species, comprising hundreds, or even thousands of strains that share large sequence parts, but differ by individual mutations from one another. An abstract structure that has been proposed for this task is the colored de Bruijn graph (CDBG) [1]. It is a directed graph \(G = (V_G,E_G)\) in which each vertex \(v \in V_G\) represents a kmer, a string of length k over \(\mathcal {A}\), associated with a set of colors representing the genomes in which the kmer occurs. A directed edge \(e \in E_G\) from vertex v to vertex \(v'\), respectively from kmer x to kmer \(x'\), exists if \(x[2..k] = x'[1..k1]\). Each kmer x has \(\mathcal {A}\) possible successors x[2..k] c and \(\mathcal {A}\) possible predecessors \(c x[1..k1]\) with \(c\in \mathcal {A}\). An implementation of such a graph does not have to store edges since they are implicitly given by vertices overlapping on \(k1\) characters.
In this paper, we propose a new data structure for indexing and compressing a pangenome as a CDBG, the Bloom Filter Trie (BFT). It is alignmentfree, referencefree and incremental, i.e., it does not need to be entirely rebuilt when a new genome is inserted. BFTs provide insertion and lookup operations for strings of fixed length associated with an annotation. This paper is an extended version of the preliminary work presented in [2].
In the next section, existing data structures and software for pangenome representation are reviewed. The BFT and the operations it supports are then described, followed by the traversal method of a CDBG stored as a BFT. Finally, experimental results showing the performance of the data structure are provided.
Existing approaches
The BFT, as well as existing tools for pangenome storage, uses a variety of basic data structures reviewed in the following. Existing tools for pangenome storage will then be discussed.
Data structures
One common way to index and compress a set of strings is the BurrowsWheeler Transform (BWT) [3] that rearranges the input data to enable better compression by aggregating characters with similar context. For multiple sets of strings, a diskbased approach [4] or different terminator characters must be used. The FMIndex [5] allows to count and locate the occurrences of a substring in the BWT.
The Sequence Bloom Tree (SBT) [7] is a binary tree with BFs as vertices. An internal vertex is the union of its two children BFs, i.e., a BF where a slot is set to 1 if the slot at the same position in at least one of the two children is 1.
A trie [8] is a rooted edgelabeled tree \(T = (V_T,E_T)\) storing a set of strings. Each edge \(e \in E_T\) is labeled with a character and no two edges starting at the same vertex can have the same character. A path from the root to a leaf represents the string obtained by concatenating all the characters on this path. The depth of a vertex v in T is denoted by depth(v, T) and is the number of edges between the root of T and v. The height of T, denoted by height(T), is the number of edges on the longest path from the root of T to a leaf. The burst trie [9] is an efficient implementation of a trie which reduces its number of branches by compressing subtries into leaves. Its internal vertices are labeled with multiple prefixes of length 1, linked to children. The leaves are labeled with multiple suffixes of arbitrary length. A leaf has a limited capacity of suffixes and is burst when this capacity is exceeded. A burst splits suffixes of a leaf into prefixes of length 1, linked to new leaves representing the remaining suffixes.
Software for pangenome storage
Existing tools for pangenome storage are mostly alignmentbased or referencebased and take a set of assembled genomes as input. Alignments naturally exhibit shared and unique regions of the pangenome but are computationally expensive to obtain. In addition, misalignments can lead to an inaccurate estimation of the pangenome regions [10]. PanCake [11] is an extension of string graphs, known from genome assembly [12], which achieves compression based on pairwise alignments. Experiments showed compression ratios of 3:1 to 5:1. Nguyen et al. [13] formulated the pangenome construction problem as an optimization problem of arranging alignment blocks for a set of genomes partitioned by homology. The complexity of the problem has been shown to be NPhard, and a heuristic using Cactus graphs [14] was provided. However, a multiple sequence alignment is required for creating the blocks, another NPhard problem.
Among the referencebased tools, Huang et al. [15] proposed to build a pangenome by annotating a reference genome with all the variants detected between a set of genomes and the reference. The BWT of the augmented reference is then computed and can be used by an aligner based on the FMIndex. While being more accurate with the augmented reference genome than BWA [16] with the reference alone, the aligner is between 10 to 100 times slower, uses significantly more memory and can introduce false positive alignments. RCSI [17] (Referentially Compressed Search Index) uses referential compression with a compressed suffix tree to store a pangenome and to search for exact or inexact matches. The inexact matching allows a limited number of edit distance operations. 1092 human genomes totaling 3.09 TB of data were compressed into an index of 115 GB, offering a compression ratio of about 28:1. Yet, the index is built for a maximum length query and a maximum number of edit operations. MRCSI [18] improves on RCSI by proposing a compressed search index based on multiple references.
Closer to our approach is SplitMEM [19], which uses a CDBG to build a pangenome from assembled genomes and extract the shared regions. The CDBG is directly constructed in a compressed way, where a nonbranching path is stored in a single vertex, using an augmented suffix tree. Baier et al. [20] improved SplitMEM in theory and practice with two algorithms that use the BWT and a compressed suffix tree. Unfortunately, both tools use more memory than the original size of the input sequences.
The tool Khmer [21] provides a lightweight representation of de Bruijn graphs [22] based on Bloom filters and a graph labeling method based on graph partitioning. Unfortunately, the graph labeling method does not offer yet enough flexibility to reproduce the experiments presented in this paper.
The SBT data structure has been implemented in an alignmentfree, referencefree and incremental software [7] to label raw sequences with their colors. The proposed tool is designed to index and compress data from sequencing experiments for effective query of fulllength genes or transcripts by separation into kmers. A leaf of an SBT is used to represent a sequencing experiment by extracting all its kmers and storing them in the BF of the leaf. The SBT software does not represent exactly the set of kmers of the sequencing experiments they contain, though, due to the inexact nature of BFs.
The Bloom Filter Trie
The Bloom Filter Trie (BFT) that we propose in this paper is an implementation of a CDBG. It is based on a burst trie and is used to store kmers associated with a set of colors. For the moment we may assume that colors are represented by a bit array color initialized with 0s. Each color has an index \(i_{color}\) such that \({color}_x[i_{color}] = 1\) records that kmer x has color \(i_{color}\). Sets of colors will later be compressed. All arrays in a BFT are dynamic: An insertion at position pos in an array A reallocates it and shifts every slot having an index \(\ge {pos}\) by one position in \(\mathcal {O}(A)\) time.
In the following, let \(t = (V_t,E_t)\) be a BFT created for a certain value of k where we assume that k is a multiple of an integer l such that kmers can be split into \(\frac{k}{l}\) equallength substrings. The maximum height of t is \({height}_{max}(t) = \frac{k}{l}1\). The alphabet we consider is the DNA alphabet \(\mathcal {A} = \{\textit{a},\textit{c},\textit{g},\textit{t}\}\), and because \(\mathcal {A} = 4\), each character can be stored using two bits. A vertex in a BFT is a list of containers, zero or more of which are compressed, plus zero or one uncompressed container. In the following, we will explain how the containers are represented and how an uncompressed container is burst when its capacity is exceeded.
Uncompressed container
Compressed container

store \(q \le c\) suffix prefixes in compressed form (in Fig. 1b, \(q=4\)),

store links to children containing the suffixes, and

reconstruct suffix prefixes and find the corresponding children.

quer is a BF represented as a bit array of length m and f hash functions, used to record and filter for the presence of q suffix prefixes;

pref is a bit array of \(2^{\lambda }\) bits initialized with 0s and used to record prefix presence exactly. Here the binary representation \(\alpha\) of a prefix a is interpreted as an integer such that \({pref}[\alpha ]\) set to 1 records the presence of a;

suf is an array of q suffixes b sorted in ascending lexicographic order of the original suffix prefixes they belong to;

clust is an array of q bits, one per suffix of array suf, that represents cluster starting points. A cluster is a list of consecutive suffixes in array suf that share the same prefix. It has an index \(i_{cluster}\) with \(1 \le i_{cluster} \le 2^{\lambda }\) and a start position \({pos}_{cluster}\) in the array suf with \(i_{cluster} \le {pos}_{cluster} \le q\). Position pos in array clust is set to 1 to indicate that the suffix in suf[pos] starts a cluster because it is the lexicographically smallest suffix of its cluster. A cluster contains \(n \ge 1\) suffixes and, therefore, position i in array clust is set to 0 for \({pos} < i < {pos}+n\). The end of a cluster is indicated by the beginning of the next cluster or if \({pos} \ge q\).
The size of q substrings in a compressed container is \(m + 2^{\lambda } + q \cdot (\mu + 1)\) bits. A bursting minimizes this size by choosing a prefix length a and a BF size m such that the set of substrings stored in a compressed container does not occupy more memory than their original representation in an uncompressed container, i.e., \(m + 2^{\lambda } \le q \cdot (\lambda  1)\). Each suffix prefix inserted after a bursting costs only \(\mu + 1\) bits. When the average size per suffix prefix stored is close to \(\mu + 1\) bits, arrays pref, suf and clust can be recomputed by increasing a and decreasing b, such that \(2^{\lambda '} + q \cdot \mu ' < 2^{\lambda } + q \cdot \mu\), where \(\lambda '\) and \(\mu '\) are the values of \(\lambda\) and \(\mu\), respectively, after resizing.
Operations supported by the Bloom Filter Trie
The BFT supports all operations necessary for storing, traversing and searching a pangenome, as well as to extract the relevant information of the contained genomes and subsets thereof. Here we describe the most basic ones of them, Lookup and Insertion, as well as how the sets of colors are compressed. The traversal of the graph is discussed in the next section.
The algorithms use three auxiliary functions. \(\textsf {HammingWeight}(\alpha ,{pref})\) counts the number of 1s in \({pref}[1..\alpha ]\) and corresponds to how many prefixes represented in array pref are lexicographically smaller than or equal to an inserted prefix a with binary representation \(\alpha\) of length \(\lambda\) bits. This requires \(\mathcal {O}(2^{\lambda })\) time. The second function, \(\textsf {Rank}(i,{clust})\), iterates over array clust from its first position until the ith entry 1 is found and returns the position of this entry. It corresponds to the start position of cluster i in array clust. If the entry is not found, the function returns \({clust}+1\) as a position. While \(\textsf {Rank}\) could be implemented in \(\mathcal {O}(1)\) time [5], we use a more naive but space efficient \(\mathcal {O}(q)\) time implementation, where q is the number of suffix prefixes in a compressed container. Finally, \(\textsf {BinarySearch}({uc},{s})\) searches for the suffix s in the uncompressed container uc in \(\mathcal {O}(\log _2 c)\) time, where c is the capacity of uc.
Lookup
The function that tests whether a suffix prefix \(s_{pref} = ab\) with binary representation \(\alpha \beta\) is stored in a compressed container cc is given in Algorithm 1. Line 1 verifies the presence of prefix a in the array pref in \(\mathcal {O}(1)\) time. If a is present, line 2 computes in \(\mathcal {O}(2^{\lambda })\) time the Hamming weight i of a, i.e., the index of the cluster in which suffix b is possibly situated. Line 3 locates the rank of i, i.e., the start position of the cluster, and lines 4–7 compare the suffixes of the cluster to b. Lines 3–7 are computed in \(\mathcal {O}(q)\) time. Algorithm 1 has therefore a worst case running time of \(\mathcal {O}(2^{\lambda } + q)\).
The function that tests whether a kmer x is present in a BFT \(t = (V_t,E_t)\) is given in Algorithm 2. Each vertex \(v \in V_t\) represents kmer suffixes possibly stored in its uncompressed container or rooted from its compressed containers. The lookup traverses t from the root and, for a vertex v, queries its containers one after the other for suffix \(x_{suf} = {x[l \cdot {depth}(v,t) + 1..k]}\). If the queried container is compressed, its BF quer is queried for \(x_{suf}[1..l]\) using the function \(\textsf {MayContain}\) in \(\mathcal {O}(f)\) time where, as above, f is the number of hash functions used by the BF. In case of a positive answer, the function \(\textsf {Contains}\) is used for an exact membership of \(x_{suf}[1..l]\). If it is found, the traversing procedure continues recursively on the corresponding child. The absence of \(x_{suf}[1..l]\) indicates the absence of x in t since \(x_{suf}[1..l]\) cannot be in another container of v because of the insertion process explained later in this paper. If the container is uncompressed, the presence of \(x_{suf}\) is detected using the function \(\textsf {BinarySearch}\). As an uncompressed container has no children, a match indicates the presence of the kmer. Algorithm 2 is initially called as \(\textsf {TreeContains}(x, 1, l, {root})\). In the worst case, all vertices on a traversed path represent all possible suffix prefixes and the BFs quer have a false positive ratio of 0. In such case, each traversed vertex contains \(\left\lceil \frac{\mathcal {A}^l}{c}\right\rceil\) containers. The longest path of a BFT has \(\frac{k}{l}\) vertices. Therefore, the worst case time of \(\textsf {TreeContains}\) is \(\mathcal {O}\left (\frac{k}{l} \cdot \left (\left\lceil \frac{\mathcal {A}^l}{c}\right\rceil \cdot f + 2^{\lambda } + q\right )\right)\).
Insertion
Prior to any kmer insertion into a BFT t, a lookup verifies if the kmer is already present. If it is, only its set of colors is modified. Otherwise, the lookup stops the trie traversal on a container cont of a vertex v where the searched suffix prefix or kmer suffix is not present. If cont is uncompressed, the insertion of the kmer suffix and its color is a simple \(\mathcal {O}(\log _2 c)\) time process. If cont is compressed, the insertion of suffix prefix \(s_{pref} = ab\) is a bit more intricate. In fact, it will only be triggered if cont is the first compressed container of v to have \(s_{pref}\) as a false positive (\(\textsf {MayContain}(s_{pref},{cont}.{quer}) = {true}\) and \(\textsf {Contains}(s_{pref},{cont}) = {false}\)). False positives are therefore “recycled”, which is a nice property of BFTs: The BF quer remains unchanged, and only pref, suf and clust need to be updated in a way similar to Algorithm 1. The presence of prefix a must be first verified by testing the value of \({pref}[\alpha ]\) where \(\alpha\) is the binary representation of a. If \({pref}[\alpha ] = 0\), prefix a is not present and is recorded by setting \({pref}[\alpha ]\) to 1. Then, the index \({id}_{cluster}\) and start position \({pos}_{cluster}\) of the new cluster are computed using \(\textsf {HammingWeight}\) and \(\textsf {Rank}\). The suffix b is inserted into \({suf}[{pos}_{cluster}]\) and a 1 into \({clust}[{pos}_{cluster}]\). This takes \(\mathcal {O}(2^{\lambda } + 2q)\) time. If \({pref}[\alpha ] = 1\) prior to insertion, prefix a is already present, and \({id}_{cluster}\) and \({pos}_{cluster}\) have already been computed by \(\textsf {Contains}(s_{pref},{cont})\). Let n be the number of suffixes in cluster \({id}_{cluster}\). Suffix b is inserted into suf[pos] such that \({pos}_{cluster} \le {pos} \le {{pos}_{cluster}+n}\) and \({suf}[{pos}1] < {suf}[{pos}]\). If \({pos} = {pos}_{cluster}\), b starts its cluster: A 1 is inserted into clust[pos] and \({clust}[{pos}+1]\) is set to 0. Otherwise, a 0 is inserted into clust[pos]. This takes \(\mathcal {O}(2q)\) time. The worst case insertion time of a kmer is \(\mathcal {O}(d + 2^{\lambda } + 2q)\) with d being the worst case time lookup.
Color compression
Remember from the BFT description that color sets associated with kmers in a CDBG are initially stored as bit arrays in BFTs. However, these can be compressed by storing sets of colors that are identical for multiple kmers once. To this end, a list of all color sets occurring in the BFT is built and sorted in decreasing order of total size, i.e., the number of kmers sharing a color set multiplied by its size. Then, by iterating over the list, each color set is added incrementally to an external array if the integer encoding its position in the array uses less space than the size of the color set itself. Finally, each color set present in the external array is replaced in the BFT by its position in the external array.
Traversing successors and predecessors
Let t be a BFT that represents a CDBG G. For a kmer x, visiting all its predecessors or successors in G, denoted pred(x, G) and succ(x, G), respectively, implies the lookup of \(\mathcal {A}\) different kmers in t. Such a lookup would visit in the worst case \(\mathcal {A} \cdot ({height}_{max}(t)+1)\) vertices in t. This section describes how to reduce the number of vertices and containers visited in t during the traversal of a vertex in G.
Observation 1
Let G be a CDBG represented by a BFT t and x a kmer corresponding to a vertex of G. All kmers of succ(x, G) share x[2..k] as a common prefix and therefore share a common subpath in t starting at the root. On the other hand, kmers of pred(x, G) have different first characters and, therefore, except for the root of t do not share a common subpath. Hence, the maximum number of visited vertices in t for all kmers of succ(x, G) is \(1 + {height}_{max}(t)\) and for all kmers of pred(x, G) is \(1 + \mathcal {A} \cdot {height}_{max}(t)\).
Lemma 1
Let G be a CDBG represented by a BFT t, x a kmer in t and v a vertex of t that terminates the shared subpath of the kmers in succ(x, G). If \({depth}(v,t) = {height}_{max}(t)\), succ(x, t) suffixes may be stored in any container of v. If not, they are stored in the uncompressed container of v.
Proof
A vertex v is the root of a subtrie storing kmer suffixes of length \({l \cdot ({height}_{max}(t)  {depth}(v,t) + 1)}\) with \(l = \frac{k}{{height}_{max}(t)+1}\). Let s be a kmer suffix of succ(x, t) rooted at a vertex \(v \in V_t\). If \({depth}(v,t) \ne {height}_{max}(t)\) but s is rooted at a compressed container in v, then this compressed container stores s[1..l], and \(s[l+1..s]\) is rooted in one of its children. As the divergent character between the kmer suffixes of succ(x) is in position \(s  1\), this character is in \(s[l+1..s]\), rooted at one child of this compressed container. Therefore v does not terminate the common subpath shared by succ(x, t) kmers. \(\square\)

searching at the root of t for kmers of pred(x, G),

if \({depth}(v,t) = {height}_{max}(t)\), searching at vertex v for suffixes of succ(x, G).
Restricting the hash functions used in the compressed containers to take only positions 2 through \(l1\) into account, allows to limit the search space.
Lemma 2
Let t be a BFT where the f hash functions \(h_i\) of quer have the form \(h_i(s_{pref}):s_{pref}[2..l1]\rightarrow \{1,..,m\}\) for \(i=1,...,f\). Then, for a vertex v of t and a suffix prefix \(s_{pref}\), all possible substrings \(s'_{pref} = c_1 s_{pref}[2..l1] c_2\) are contained in the same container of v.
Proof
Assume a kmer suffix s inserted in a vertex v of t. A lookup for s analyzes the containers of v from the head to the tail of the container list. In the worst case, s can be rooted, according to BFs quer, in all compressed containers as a true positive or as a false positive. However, a lookup stops either on the first compressed container claiming to contain the suffix prefix \(s_{pref} = s[1..l]\), or on the uncompressed container. As the hash functions of quer consider only \(s_{pref}[2..l1]\), a lookup will therefore stop on the same container for any substring \(s'_{pref} = c_1 s_{pref}[2..l1] c_2\). \(\square\)
As a consequence of Lemma 2, each suffix prefix \(s_{pref}\) stored or to store in arrays pref, suf and clust is modified such that \(s'_{pref} = s_{pref}[2..l] s_{pref}[1]\), which guarantees that all \(s''_{pref} = s'_{pref}[1..l2] c_2 c_1\) are in the same container. Furthermore, suffixes stored in array suf are required to have a minimum length of two characters to ensure that characters \(c_1\) and \(c_2\), the variable parts between the different \(s''_{pref}\), are stored in array suf. Hence, as all \(s''_{pref}\) share \(s'_{pref}[1..l2]\) as a prefix, they share the same cluster in arrays suf and clust. Suffix prefixes \(s''_{pref} = s'_{pref}[1..l1] c_1\) also have consecutive suffixes in their cluser.
Evaluation
Insertion time and memory usage for the real (P. aeruginosa) and simulated (Y. pestis) dataset. The compression ratio is given w.r.t. the original file sizes. Disk sizes for the SBT are given for the leaves first and then for the complete tree
P. aeruginosa  Y. pestis  

BFT  SBT  BFT  SBT  
Insertion time (min)  168.52  371.45  29.88  32.67 
Peak of main memory (MB/compr. ratio)  7487/112:1  7356/114:1  1313/182:1  1586/151:1 
Disk size (MB/compr. ratio)  1644/511:1  2076–4572/405:1–184:1  484/495:1  538–1117/445:1–214:1 
Compressed disk size (MB/compr. ratio)  833/1009:1  1906–4280/441:1–196:1  225/1064:1  528–1099/454:1–218:1 
Total and per kmer query times for the real (P. aeruginosa) and simulated (Y. pestis) dataset with peaks of main memory
P. aeruginosa  Y. pestis  

BFT  SBT  BFT  SBT  
Total query time (min)  1.19  61.86  0.57  37.42 
Query time per kmer (μs)  7.14  371.16  3.42  224.52 
Peak of main memory (MB)  2076  11,678  544  7775 
A third experiment gives an estimation of the time required to traverse the graph represented by a BFT: It verifies for each kmer of the batch queries whether its corresponding vertex in the graph is branching. This experiment first computes information about the root in a negligible amount of time and memory. Then, the BFT is queried for its branching vertices. For the real dataset, this experiment took 55.52 s (average time of 5.55 \(\mu\)s per 63mer), resulting in 1,574,198 branching vertices. For the simulated dataset, this experiment took 38.79 s (average time of 3.88 \(\mu\)s per 54mer), resulting in 141,802 branching vertices.
In summary, in our experiments the BFT was up to two times faster to build than the SBT while using about the same amount of main memory. When written on disk, the BFT used less memory than SBT for both datasets and when compressed with 7z, the BFT was about two times smaller than the SBT. For querying kmers, the BFT was about 52 to 66 times faster than the SBT while using about 5.5 to 14.3 times less memory.
Conclusions
We proposed a novel data structure called the Bloom Filter Trie for storing a pangenome as a colored de Bruijn graph. The trie stores kmers and their colors. A new representation of vertices is proposed to compress and index shared substrings. It uses four basic data structures that allow to quickly verify the presence of substrings. In the worst case, the compressed strings have a memory footprint close to their binary representation. However, we observe in practice substantial memory savings. Future work concerns the possiblity to compress nonbranching paths that share the same colors [19] and also the extraction of the different pangenome regions.
Declarations
Authors’ contributions
GH developed, implemented and evaluated the method. All authors wrote the paper. All authors read and approved the final manuscript.
Acknowledgements
The authors wish to thank the authors of SBT for helpful comments. GH and RW are funded by the International DFG Research Training Group GRK 1906/1.
Competing interests
The authors declare that they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Iqbal Z, Caccamo M, Turner I, Flicek P, McVean G. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat Genet. 2012;44(2):226–32.View ArticlePubMedPubMed CentralGoogle Scholar
 Holley G, Wittler R, Stoye J. Bloom Filter Triea data structure for Pangenome storage. In: Proc. of 15th Workshop on Algorithms in Bioinformatics, vol. 9289. 2015. p. 217–30.Google Scholar
 Burrows M, Wheeler DJ. A blocksorting lossless data compression algorithm. Digital SRC Research Report 124. 1994.Google Scholar
 Cox AJ, Bauer MJ, Jakobi T, Rosone G. Largescale compression of genomic sequence databases with the Burrows–Wheeler transform. Bioinformatics. 2012;28(11):1415–9.View ArticlePubMedGoogle Scholar
 Ferragina P, Manzini G. An experimental study of an opportunistic index. Proc. of the 12th ACMSIAM Symposium on Discrete Algorithms. 2001;1:269–78.Google Scholar
 Bloom BH. Space/time tradeoffs in hash coding with allowable errors. Comm ACM. 1970;13(7):422–6.View ArticleGoogle Scholar
 Solomon B, Kingsford C. Largescale search of transcriptomic read sets with sequence Bloom Trees. bioRxiv 017087. 2015.Google Scholar
 Fredking E. Trie Memory. Comm ACM. 1960;3(9):490–9.View ArticleGoogle Scholar
 Heinz S, Zobel J, Williams HE. Burst tries: a fast, efficient data structure for string keys. ACM Trans Inf Syst. 2002;20(2):192–223.View ArticleGoogle Scholar
 DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M. A framework for variation discovery and genotyping using nextgeneration DNA sequencing data. Nat Genet. 2011;43(5):491–8.View ArticlePubMedPubMed CentralGoogle Scholar
 Ernst C, Rahmann S. PanCake: a data structure for pangenomes. Proc German Conf Bioinform. 2013;34:35–45.Google Scholar
 Myers EW. The fragment assembly string graph. Bioinformatics. 2005;21:79–85.Google Scholar
 Nguyen N, Hickey G, Zerbino DR, Raney B, Earl D, Armstrong J, Haussler D, Paten B. Building a pangenome reference for a population. J Comput Biol. 2015;22(5):387–401.View ArticlePubMedGoogle Scholar
 Paten B, Diekhans M, Earl D, John JS, Ma J, Suh B, Haussler D. Cactus graphs for genome comparisons. J Comput Biol. 2011;18(3):469–81.View ArticlePubMedGoogle Scholar
 Huang L, Popic V, Batzoglou S. Short read alignment with populations of genomes. Bioinformatics. 2013;29(13):361–70.View ArticleGoogle Scholar
 Li H, Durbin R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 2009;25(14):1754–60.View ArticlePubMedPubMed CentralGoogle Scholar
 Wandelt S, Starlinger J, Bux M, Leser U. RCSI: Scalable similarity search in thousand(s) of genomes. Proc VLDB Endowment. 2013;6(13):1534–45.View ArticleGoogle Scholar
 Wandelt S, Leser U. MRCSI: compressing and searching string collections with multiple references. Proc VLDB Endowment. 2015;8(5):461–72.View ArticleGoogle Scholar
 Marcus S, Lee H, Schatz MC. SplitMEM: a graphical algorithm for pangenome analysis with suffix skips. Bioinformatics. 2014;30(24):3476–83.View ArticlePubMedPubMed CentralGoogle Scholar
 Baier U, Beller T, Ohlebusch E. Graphical pangenome analysis with compressed suffix trees and the Burrows–Wheeler transform. Bioinformatics. 2016;32(4):497–504.View ArticlePubMedGoogle Scholar
 Crusoe MR, Alameldin HF, Awad S, Bucher E, Caldwell A, Cartwright R, Charbonneau A, Constantinides B, Edvenson G, Fay S, Fenton J, Fenzl T, Fish J, GarciaGutierrez L, Garland P, Gluck J, Gonzalez I, Guermond S, Guo J, Gupta A, Herr JR, Howe A, Hyer A, Harpfer A, Irber L, Kidd R, Lin D, Lippi J, Mansour T, McA’Nulty P, McDonald E, Mizzi J, Murray K, Nahum JR, Nanlohy K, Nederbragt AJ, OrtizZuazaga H, Ory J, Pell J, PepeRanney C, Russ ZN, Schwarz E, Scott C, Seaman J, Sievert S, Simpson J, Skennerton CT, Spencer J, Srinivasan R, Standage D, Stapleton JA, Steinman SR, Stein J, Taylor B, Trimble W, Wiencko HL, Wright M, Wyss B, Zhang Q, Zyme E, Brown CT. The khmer software package: enabling efficient nucleotide sequence analysis. F1000Research 4. 2015.Google Scholar
 Pell J, Hintze A, CaninoKoning R, Howe A, Tiedje JM, Brown CT. Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. Proc Natl Acad Sci. 2012;109(33):13272–7.View ArticlePubMedPubMed CentralGoogle Scholar
 Li H. Wgsim. 2011. https://github.com/lh3/wgsim (Accessed 26 Nov 2015).
 Chikhi R, Medvedev P. Informed and automated kmer size selection for genome assembly. Bioinformatics. 2014;30(1):31–7.View ArticlePubMedGoogle Scholar
 Raman R, Raman V, Rao SS. Succinct indexable dictionaries with applications to encoding kary trees and multisets. In: Proc. of the 13h Annual ACMSIAM Symposium on Discrete Algorithms. 2002. p. 233–242.Google Scholar
 Pavlov I. 7Zip compressor. 1999. http://www.7zip.org/ (Accessed 26 Nov 2015).