 Research
 Open access
 Published:
On weighted kmer dictionaries
Algorithms for Molecular Biology volume 18, Article number: 3 (2023)
Abstract
We consider the problem of representing a set of \(k\)mers and their abundance counts, or weights, in compressed space so that assessing membership and retrieving the weight of a \(k\)mer is efficient. The representation is called a weighted dictionary of \(k\)mers and finds application in numerous tasks in Bioinformatics that usually count \(k\)mers as a preprocessing step. In fact, \(k\)mer counting tools produce very large outputs that may result in a severe bottleneck for subsequent processing. In this work we extend the recently introduced SSHash dictionary (Pibiri in Bioinformatics 38:185–194, 2022) to also store compactly the weights of the \(k\)mers. From a technical perspective, we exploit the order of the \(k\)mers represented in SSHash to encode runs of weights, hence allowing much better compression than the empirical entropy of the weights. We study the problem of reducing the number of runs in the weights to improve compression even further and give an optimal algorithm for this problem. Lastly, we corroborate our findings with experiments on realworld datasets and comparison with competitive alternatives. Up to date, SSHash is the only \(k\)mer dictionary that is exact, weighted, associative, fast, and small.
Introduction
Recent advancements in the socalled Next Generation Sequencing (NGS) technology made possible the availability of very large collections of DNA. However, before being able to actually analyze the data at this scale, efficient methods are required to index and search such collections. One popular strategy to address this challenge is to consider short substrings of fixed length k, known as \(k\)mers. Software tools based on \(k\)mers are predominant in Bioinformatics and they have found applications in genome assembly [1, 2], variant calling [3, 4], pangenome analysis [5, 6], metagenomics [7], sequence comparison [8,9,10], just to name a few ones.
For several such applications it is important to quantify how many times a given \(k\)mer is present in a DNA database. In fact, many efficient \(k\)mer counting tools have been developed for this task [11,12,13,14,15]. The output of these tools is a table where each distinct \(k\)mer in the database is associated to its abundance count, or weight. The weights are either exact or approximate (in this work, we focus on exact weights). These genomic tables are usually very large and take several GBs—in the range of 40–80 bits/\(k\)mer or more according to recent experiments [13, 16, 17]. Therefore, the tables should be compressed effectively while permitting efficient random access queries in order to be useful for online processing tasks. This is precisely the goal of this work. We better formalize the problem as follows.
Problem 1
(The Weighted \(k\)mer Dictionary Problem) Let S be a long DNA string and \(\mathcal {K}\) be the set of the n distinct pairs \(\langle g, w(g) \rangle \), where g is a \(k\)mer of S and w(g) is the weight of g, i.e., the number of times g occurs in S. We want a compressed representation of \(\mathcal {K}\) that permits to efficiently check the exact membership of g to \(\mathcal {K}\) and, if g actually belongs to \(\mathcal {K}\), retrieve w(g).
In a previous investigation, we proposed a sparse and skew hashing scheme for \(k\)mers (SSHash, henceforth) [18]—a compressed dictionary that relies on \(k\)mer minimizers [8] and minimal perfect hashing [19, 20] to support fast membership (in both random and streaming query modality) in succinct space. These features make SSHash useful for several applications, including reference indexing and pseudoalignment [21]. However, we did not consider the weights of the \(k\)mers. In this work, therefore, we enrich the SSHash data structure with the weight information to solve Problem 1. The main practical result is that, by exploiting the order of the \(k\)mers represented in SSHash, the compressed exact weights take only a small extra space on top of the space of SSHash. This extra space is proportional to the number of runs (maximal subsequences formed by all equal symbols) in the weights and not proportional to the number of distinct \(k\)mers. As a consequence, the weights are represented in a much smaller space than the empirical entropy lower bound.
We study the problem of reducing the number of runs in the weights and model it as a graph covering problem, for which we give an optimal, lineartime, algorithm. The optimization algorithm effectively reduces the number of runs to the minimum, hence improving space even further.
When empirically compared to other weighted dictionaries that can be either somewhat smaller but much slower or much larger, SSHash embodies a robust tradeoff between index space and query efficiency.
Related work
A solution to the weighted \(k\)mer dictionary problem can be obtained using the popular FMindex [22]. The FMindex represents the original DNA string taking the BurrowsWheeler transform (BWT) [23] of the string. Reporting the weight of a \(k\)mer is solved using the count operation of the FMindex which involves O(k) rank queries over the BWT.
Another solution using the BWT is the socalled BOSS data structure [24] that is a succinct representation of the de Bruijn graph of the input—a graph where the nodes are the \(k\)mers and the edges model the overlaps between the \(k\)mers. The BOSS data structure has been recently enriched with the weights of the \(k\)mers [16], by deltaencoding the weights on a spanning branching of the graph. Since consecutive \(k\)mers often have equal (or very similar) weights, good space effectiveness is achieved by this technique.
Other solutions, instead, rely on hashing for faster query evaluation compared to BWTbased indexes. For example, both deBGR [17] and Squeakr [13] use a counting quotient filter [25] to store the \(k\)mers and the weights. They can either return approximate weights, i.e., wrong answers with a prescribed (low) probability, for better space usage of exact weights at the price of more index space. In any case, the memory consumption of these solutions is not competitive with that of BWTbased ones as they do not employ sophisticated compression techniques and were designed for other purposes, e.g., dynamic updates.
A closely related problem is that of realizing maps from \(k\)mers to weights, i.e., data structures that do not explicitly represent the \(k\)mers and so return arbitrary answers for outofset keys. In the context of this work, we distinguish between such approaches, maps, and dictionaries that instead represent both the \(k\)mers and the weights. Besides minimal perfect hashing [19, 20], some efficient maps have been proposed and tailored specifically for genomic counts, such as based on setmin sketches [26] and compressed static functions (CSFs) [27]. These proposals leverage on the repetitiveness of the weights (lowentropy distributions) to obtain very compact space.
Lastly in this section, we report that other works [28, 29] considered the multidocument version of the problem studied here, that is, how to retrieve a vector of weights for a query \(k\)mer, where each component of the vector represents the weight of the \(k\)mer in a distinct document. Also such count vectors are usually very “regular” (or can be made so by introducing some approximation) [28] and present runs of equal symbols that can be compressed effectively with runlength encoding (RLE).
Representing runs of weights
In this section we describe the compression scheme for the weights that we use in SSHash. Recall that we indicate with \(\mathcal {K}\) the set of n distinct \(\langle \) \(k\)mer, weight\(\rangle \) \(=\) \(\langle g, w(g) \rangle \) pairs, that we want to store in a dictionary. With a little abuse of notation, we write “\(g \in \mathcal {K}\)” for a \(k\)mer g to mean that there is a pair of \(\mathcal {K}\) whose \(k\)mer is g. We first highlight the main properties of SSHash that we are going to exploit in the following to obtain good space effectiveness for the weights (for all the other details concerning the SSHash index, we point the interested reader to our previous work [18]).
From a highlevel perspective, SSHash implements the function \(h: \Sigma ^k \rightarrow \{0,1,\ldots ,n\}\), where \(n=\mathcal {K}\) and \(\Sigma ^k\) is the whole set of klength strings over the DNA alphabet \(\Sigma =\mathsf {\{A,C,G,T\}}\). In particular, h(g) is a unique value \(1 \le i \le n\) if \(g \in \mathcal {K}\); or \(h(g)=0\) otherwise, i.e., if \(g \notin \mathcal {K}\). In other words, SSHash serves the same purpose of a minimal and perfect hash function (MPHF) [20] for \(\mathcal {K}\) but, unlike a traditional MPHF, SSHash rejects alien \(k\)mers. This is possible because the \(k\)mers of \(\mathcal {K}\) are actually represented in SSHash whereas the space of a traditional MPHF does not depend on the input keys.
The value \(i=h(g)\) for the \(k\)mer \(g \in \mathcal {K}\) is the handle of g, or its “hash” code. The hash codes can be used to associate some satellite information to the \(k\)mers such as, for example, the weights themselves using an array W[1..n] where \(W[h(g)]=w(g)\).
SSHash takes advantage of the fact that the input set \(\mathcal {K}\) can be processed into a socalled spectrumpreserving string set (or SPSS) \(\mathcal {S}\)—a collection of strings \(\mathcal {S}=\{{S}_1,\ldots ,{S}_m\}\) where each \(k\)mer of \(\mathcal {K}\) appears exactly once. We omit the details here on how the collection \(\mathcal {S}\) can be built; we only report that there are efficient algorithms for this purpose that also try to minimize the total number of symbols in \(\mathcal {S}\), i.e., the quantity \(\sum _{i=1}^m {S}_i\). One such algorithm is the UST algorithm [30] that we also use to prepare the input for SSHash. The key property of SSHash in which we are interested is that—after \(\mathcal {K}\) is processed into the SPSS \(\mathcal {S}\)—the function h preserves the relative order of the \(k\)mers, that is: if \(g_1[1..k]\) and \(g_2[1..k]\) are two \(k\)mers with \(g_1[2..k]=g_2[1..k1]\) (i.e., \(g_2\) comes immediately after \(g_1\) in a string), then \(h(g_2)=h(g_1)+1\). Therefore, consecutive \(k\)mers, i.e., those sharing an overlap of \(k1\) symbols, are also given consecutive hash codes.
Therefore, once an order \({S}_1,\ldots ,{S}_m\) for the strings of \(\mathcal {S}\) is fixed, then also an order \(i=1,\ldots ,n\) for the \(k\)mers \(g_i\) is uniquely determined. Let W[1..n] be the sequence of weights in this order. Then, we have: \( h(g_i)=i \text { and } W[i] = w(g_i), \text { for } i=1,\ldots ,n. \)
This orderpreserving behavior of h induces a property on the sequence of weights W[1..n] that significantly aids compression: W contains runs, i.e., maximal subsequences of equal weights. This is so because consecutive \(k\)mers are very likely to have the same weight due to the high specificity of the strings. This a known fact, also observed in prior work [16, 27, 28]. Here, we are exploiting the order of the \(k\)mers given by SSHash to preserve the natural order of the weights in W. Note that this cannot be achieved by approximate schemes that do not represent the \(k\)mers themselves, like a generic MPHF or a CSF. Even if the \(k\)mers were available, those schemes are unable to assign consecutive hashes to consecutive \(k\)mers, actually shuffling the weights at random and, thus, making W very difficult to compress.
It is standard to represent a sequence W featuring r runs of equal symbols using runlength encoding (RLE), i.e., W is modeled as a sequence of runlength pairs \( RLW = \langle w_1,\ell _1 \rangle \langle w_2,\ell _2 \rangle \cdots \langle w_r,\ell _r \rangle \) where \(w_i\) and \(\ell _i\) are, respectively, the value of the run and the length of the ith run in W. Figure 1 shows an example of RLW for a collection \(\mathcal {S}\) with 4 weighted strings.
Encoding RLW
Let \(\mathcal {D}\) be the set of all distinct \(w_i\) in RLW. Clearly, \(r \ge \mathcal {D}\) as we must have at least one run per distinct weight. We store \(\mathcal {D}\) using \(\mathcal {D}\lceil \log _2max \rceil \) bits where \(max \ge 1\) is the largest \(w_i\). We use \(\mathcal {D}\) to uniquely represent each \(w_i\) in RLW with \(\lceil \log _2\mathcal {D}\rceil \) bits. Since runs are maximal subsequences in W by definition, then \(w_i \ne w_{i+1}\) for every \(i=1,\ldots ,r1\) (adjacent weights must be different). Then we take the prefixsums of the sequence \(0,\ell _1,\ldots ,\ell _{r1}\) into an array L[1..r] and encode it with EliasFano [31, 32].^{Footnote 1} By construction we have that \(\sum _{i=1}^r \ell _i = n\) since the runs must cover the whole set of \(k\)mers. So the largest element in L is actually \(n\ell _r\) and we spend at most \(r\lceil \log _2(n/r)\rceil + 2r + o(r)\) bits for L. Summing up, we spend at most
for representing RLW on top of the space of SSHash. In conclusion, the weights are represented in space proportional to the number of runs in W (i.e., \(r=RLW\)) and not proportional to the number of \(k\)mers, which is n. As a consequence, this space is likely to be considerably less than the empirical entropy \(H_0(W)\) as we are going to see with the experiments in “Experiments” section.
To retrieve the weight w(g) from \(i=h(g)\), all that is required is to identify the run containing i. This operation is done in \(O(\log (n/r))\) time with a predecessor query over L given that we represent L with EliasFano. If the identified run is the jth run in W, then \(w_j\) is retrieved in O(1) from \(\mathcal {D}\).
The problem of reducing the number of runs
In “Representing runs of weights” section we presented an encoding scheme for the \(k\)mer weights whose space is proportional to the number of runs in the sequence of weights W. Therefore, in this section we consider the problem of reducing the number of runs in the weights to optimize the space of the encoding.
Rules of the game
We assume that the strings in \(\mathcal {S}\) are atomic entities: it is not allowed to partition them into substrings (e.g., in correspondance of the runs of weights in the strings). In fact, since the strings are obtained by the UST algorithm [30] with the purpose of minimizing the number of nucleotides as we explained in “Representing runs of weights” section, breaking them will lead to an increased space usage for the \(k\)mers, actually dwarfing any spacesaving effort spent for the weights. With this constraint specified, there are only two degrees of freedom that can be exploited to obtain better compression for W: (1) the order of the strings, and (2) the orientation of the strings. Altering \(\mathcal {S}\) using these two degrees of freedom does not affect the correctness nor the (relative) orderpreserving property of the function \(h: \Sigma ^k \rightarrow \{0,1,\ldots ,n\}\) implemented by SSHash. In fact, as evident from our description in “Representing runs of weights” section, the output of h will still be \(\{1,\ldots ,n\}\) as the \(k\)mers themselves do not change (even when taking reversecomplements into account as they are considered to be identical). What changes is just the absolute order of the \(k\)mers as a consequence of permuting the order of the strings \(\{{S}_1,\ldots ,{S}_m\}\) in \(\mathcal {S}\).
Therefore, our goal is to permute the order of the strings in \(\mathcal {S}\) and possibly change their orientations to reduce the number of runs in W. We now consider an illustrative example to motivate why both these two operations—those of changing the order and orientation of a string—are important to reduce the number of runs. Refer to Fig. 2a which shows an example collection of \(m=\mathcal {S}=4\) weighted strings (for \(k=3)\). Applying the permutation \(\pi =[1,4,2,3]\) as shown in Fig. 2b reduces the number of runs by 1 because the run at the junction of string 4 and 2 can be glued. Lastly, applying the signed permutation \(\pi =[+1,+4,2,+3]\) as in Fig. 2c reduces the number of runs by 3, which is the best possible. Our objective is to compute such a signed permutation \(\pi \) for an input collection of strings, in order to permute \(\mathcal {S}\) as shown in Algorithm 1.
Figure 2 also suggests that the final result \(\pi \) solely depends on the weight of the first and last \(k\)mer of each sequence—which we call the endpoint weights (or just endpoints) of a sequence—and not on the other weights nor the nucleotide sequences. Therefore, it is useful to model an input collection \(\mathcal {S}\) using a graph, defined as follows.
Definition 1
(Endpoint weight graph) Given a collection of weighted sequences \(\mathcal {S}\), let G be a graph where:

There is a node u for each sequence of \(\mathcal {S}\) and u has two sides—a front and a back side—respectively labelled with the first and last weights of the sequence (endpoint weights).

There is an edge between any two nodes u and v that have a side with the same weight.
The graph G is called the endpoint weight graph for \(\mathcal {S}\) and indicated with \(G(\mathcal {S})\).
In the following, we assume the collection \(\mathcal {S}\) to be clear from the context and we refer to \(G(\mathcal {S})\) simply as G. We indicate a node \(u \in G\) using the identifier (id) of the corresponding sequence of \(\mathcal {S}\). Also, we associate to u a \(sign \in \{1,+1\}\) (also called “orientation”), indicating whether the sequence should be reversecomplemented. In summary, a node \(u \in G\) is the 4tuple \((id ,front ,back ,sign )\).
Definition 2
(Oriented path) An oriented path of length \(\ell \) in G is either a single node (\(\ell =1\)) or a sequence of nodes \(u_1 \rightarrow \cdots \rightarrow u_{\ell }\) where each consecutive pair of nodes \(u_i \rightarrow u_{i+1}\) is oriented in such a way that \(u_i.back = u_{i+1}.front \), for any \(1 \le i < \ell \), \(\ell \ge 2\).
Since we will be interested only in oriented paths, we just refer to them as “paths”. For ease of notation, we will indicate a path in our examples as a sequence of signed numbers \((i_1 \rightarrow \cdots \rightarrow i_{\ell })\) where each number represents a node’s id and its sign represents the node’s sign. The first and the last node of a path P are indicated, respectively, with the \(P.front \) and the \(P.back \). The weights \(P.front .front \) and \(P.back .back \) are the two endpoints of the path.
Given this graph model, it follows that the problem of finding a signed permutation \(\pi \) for \(\mathcal {S}\) is equivalent to that of computing a (disjointnode) path cover C for G, i.e., a set of paths in G that visit all the nodes and where each node belongs to exactly one path. In fact note that, given a cover C for G, there is a lineartime reduction from C to \(\pi \) as illustrated in Algorithm 2. Since the cover C is a disjointnode path cover, the correctness of the algorithm is immediate as well as its complexity of \(\Theta (m)\).
Figure 3 illustrates the same example of Fig. 2 but with endpoint weight graphs. In Fig. 3b we would obtain a cover \(C=\{(+1 \rightarrow 3 \rightarrow +4 \rightarrow +2)\}\) formed by a single path. In this case the permutation \(\pi \), following Algorithm 2, would be \(\pi [1]=+1\), \(\pi [3]=2\), \(\pi [4]=+3\), and \(\pi [2]=+4\). This is indeed the same permutation discussed in Fig. 2c. Another example: for the graph in Fig. 3c, the cover would be \(C=\{(+2 \rightarrow 3 \rightarrow +4),(+1)\}\) and the permutation \(\pi \) would be \(\pi [2]=+1\), \(\pi [3]=2\), \(\pi [4]=+3\), and \(\pi [1]=+4\).
Computing a minimumsize path cover
We showed that changing the order and orientation of the strings in \(\mathcal {S}\) can reduce the number of runs in the weights. The question we now ask is: by how much? We are interested in computing an optimal signed permutation \(\pi \) for \(\mathcal {S}\), i.e., a permutation \(\pi \) such that \(\pi (\mathcal {S})\) has the minimum number of runs. Since we modeled the problem of computing \(\pi \) as the problem of finding a path cover for G, we reason in terms of G and a path cover C for G.
Let \(r_i\) be the number of runs in \({S}_i\) and let R be the total number of runs, i.e., \(R=\sum _{i=1}^{m} r_i\). Let also C be the number of paths in the cover C. We observe that the final number of runs r in the permuted \(\mathcal {S}\) will be equal to \(Rm+C\). In fact, every path in C must begin (resp. end) with a node whose front side (resp. back side) cannot be glued with any other path’s side. Therefore, a new run begins with the first node of every path. Since we wish to minimize the quantity \(Rm+C\), and considering that \(Rm\) is constant for a given \(\mathcal {S}\), it follows that the problem reduces to that of minimizing C, the number of paths in the cover. In other words, the problem of minimizing the number of runs r is equivalent to that of finding a minimumcardinality path cover C for G.
We recall that the problem of computing a minimumcardinality path cover for directed graphs is NPhard. However, in this section we show that the problem can be solved optimally using linear time and space on endpoint weight graphs. Specifically, we explain the steps of the mincover algorithm (Algorithm 3) and prove its optimality.
Preliminaries and notation
Our focus is on the endpoint weights of the nodes, thus throughout “Computing a minimumsize path cover” section we will denote a node \((id ,front ,back ,sign )\) just by its weights \((front ,back )\). Without loss of generality, we assume that each \(u \in G\) is such that \(u.front \le u.back \) since we can change the orientation of u if \(u.front > u.back \) with a primitive changeorientation. In this way, two nodes (x, y) and (y, x) are considered as equal regarding their endpoints.
Let \(\mathcal {W}(G)\) be the set of the distinct endpoint weights of G and \(C^*(G)\) a minimumsize (i.e., optimal) path cover for G. Whenever clear from the context, we will avoid specifying G and refer to \(\mathcal {W}(G)\) and \(C^*(G)\) simply as \(\mathcal {W}\) and \(C^*\) respectively. Let G denote the number of nodes in G.
Definition 3
(Incidence set and weight frequency) The set \(I_w\) of nodes where w appears as endpoint is called the incidence set of w, for every \(w \in \mathcal {W}\). We call the frequency of w the number of times w appears in the nodes of \(I_w\) and indicate this quantity with \(n(I_w)\).
Example 1
Consider the graph G in Fig. 4. The graph has \(m=32\) nodes, 3 connected components, and \(\mathcal {W}=16\) distinct weights. For example, the incidence set for the weight 3 is
and \(n(I_3)=I_3=10\). Another example: \(I_8=\{(3,8),(8,8),(8,8)\}\) and \(n(I_8)=5\) but \(I_8=3\).
Graph simplification
In general, a path \(u_1 \rightarrow \cdots \rightarrow u_{\ell }\) in G can be logically replaced by the single node of endpoints \((u_1.front ,u_{\ell }.back )\). When we do so, we say that G is “simplified” to a new graph where nodes \(\{u_1,\ldots ,u_{\ell }\}\) are removed and the single node \((u_1.front ,u_{\ell }.back )\) is added. (When \(\ell =2\) and nodes u and v are merged on weight w, we refer to this operation as \(\textsc {Merge}({u,v,w})\) (see Algorithm 5). However, we remark that the simplification is only logical and not physical, i.e., the nodes removed from G are not truly discarded since each node of G must appear once in \(C^*\). We therefore assume that the new node \((u_1.front ,u_{\ell }.back )\) keeps track of its inner structure and, when visited (e.g., in the forloop in lines 5–7 of Algorithm 2), it actually visits each node in the path \(u_1 \rightarrow \cdots \rightarrow u_{\ell }\).
Lemma 1
Let \(E_{x,y}\) be the subset of all equal nodes (x, y) of G. Let \(d=E_{x,y}\). If d is even, then the d nodes can be oriented to form a maximal path of either endpoints (x, x) or (y, y). If d is odd, then the path has endpoints (x, y).
Proof
We proceed by induction on d. Base case: if \(d=1\) (odd case), then there is only the singleton path (x, y); if \(d=2\) (even case), then we can either form the path \((x,y)\rightarrow (y,x)\) of endpoints (x, x) or the path \((y,x)\rightarrow (x,y)\) of endpoints (y, y). So the base case is verified. Now we assume the Lemma holds true for a generic \(d > 2\) and we want to prove it for \(d+1\). If d is even, then \(d+1\) is odd and we can either have a path \((x,x)\rightarrow (x,y)\) or a path \((x,y)\rightarrow (y,y)\). In both cases the endpoints are (x, y). Symmetrically: if d is odd, then \(d+1\) is even and we can either have a path \((x,y)\rightarrow (y,x)\) with endpoints (x, x) or a path \((y,x)\rightarrow (x,y)\) with endpoints (y, y). \(\square \)
Using Lemma 1 we can simplify G using a routine preprocess as follows.

All nodes from the sets \(E_{x,x}\) are oriented to form a single path of endpoints (x, x). Hence all these nodes are removed from G and replaced with a new node (x, x). This new node is then merged to another node (x, y), if any, into a twonode path of endpoints (x, y).

All nodes from sets \(E_{x,y}\) are oriented to form a single path of endpoints (x, y) if \(E_{x,y}\) is odd or two paths, both of endpoints (x, y) if \(E_{x,y}\) is even. In this latter case, two equal nodes (x, y) are added to G.
After preprocess, G has the following form: there are no nodes of the form (x, x) and nodes of the form (x, y) appear at most twice. Without loss of generality, we are going to assume that G has this form from now on.
After preprocess, two sets are built (lines 3–6 of Algorithm 3), I and U, that are manipulated by the subsequent steps mergeeven and greedycover. The set I is the collection of all incidence sets \(I_w\), \(\forall \, w \in \mathcal {W}\). Note that, since nodes (w, w) are not present in G after preprocess, \(n(I_w)=I_w\) for any w. The set U represents the set of all nodes that are still to be added by the algorithm to some path of \(C^*\).
Whenever a node \(u = (x,y)\) is to be inserted in G, u is actually added to U, \(I_x\), and \(I_y\). This steps are illustrated in \(\textsc {Insert} ({u,I,U})\) (Algorithm 5). Symmetrically, we use \(\textsc {Erase} ({u,I,U})\) to erase the node \(u = (x,y)\) from U, \(I_x\), and \(I_y\).
Example 2
Figure 5 shows the simplification steps for the graph G from Fig. 4. Let \(G'\) be the simplified graph. In the example graph G we have: \(E_{1,3}=4\), hence \(G'\) has two nodes of endpoint (1, 3); also \(E_{1,5}=4\), hence \(G'\) has two nodes of endpoint (1, 5); \(E_{3,7}=3\), hence \(G'\) has a single node (3, 7); \(E_{11,12}=E_{21,43}=E_{17,43}=E_{3,6}=2\), hence \(G'\) has all the nodes in these sets.
Oddfrequency endpoints
Let us first consider the oddfrequency endpoints, i.e., those endpoints w such that \(I_w\) is odd.
Lemma 2
If \(I_w\) is odd then w appears as endpoint of some path in \(C^*\).
Proof
Since each node has two matching sides, one node u in \(I_w\) will remain unmatched. Hence u will be an endpoint of some path in \(C^*\). \(\square \)
Let \(\mathcal {W}\) be defined after preprocess of G. We partition \(\mathcal {W}\) into two sets, \(\mathcal {W}_{odd}\) and \(\mathcal {W}_{even}\): if \(I_w\) is odd, then \(w \in \mathcal {W}_{odd}\); otherwise, \(w \in \mathcal {W}_{even}\).
Lemma 3
\(\mathcal {W}_{odd}\) is even.
Proof
Observe that \(\sum _{w \in \mathcal {W}} I_w\) is even and equal to 2G because each node has two endpoints. Since \(\mathcal {W} = \mathcal {W}_{odd} \cup \mathcal {W}_{even}\) and \(\mathcal {W}_{odd} \cap \mathcal {W}_{even} = \varnothing \), the above sum can be rewritten as \( \sum _{w \in \mathcal {W}} I_w = \sum _{w \in \mathcal {W}_{odd}} I_w + \sum _{w \in \mathcal {W}_{even}} I_w = 2G. \) It follows that also \( \sum _{w \in \mathcal {W}_{odd}} I_w = 2G  \sum _{w \in \mathcal {W}_{even}} I_w \) must be even since it is obtained by difference of even quantities. Since each term in the sum \(\sum _{w \in \mathcal {W}_{odd}} I_w\) is odd by definition, the whole sum is even if and only if \(\mathcal {W}_{odd}\) is even, as the sum of an odd number of odd numbers is odd. \(\square \)
Lemma 4
If \(\mathcal {W}_{even} = \varnothing \) then \(C^* = \mathcal {W}_{odd}/2\) and the greedycover algorithm (Algorithm 4) is optimal.
Proof
For Lemma 2, any \(w \in \mathcal {W}_{odd}\) will appear as endpoint of some path in \(C^*\). Since each path must necessarily have two endpoints, then \(C^* = \mathcal {W}_{odd}/2\).
Consider now the greedycover algorithm (Algorithm 4). We want to show it computes a solution C with exactly \(\mathcal {W}_{odd}/2\) paths. At the beginning of each iteration of the main whileloop (lines 3–12), the algorithm takes an unvisited node from U (line 4) and begins a new path P from there. Nodes are appended to either the front or the back of P, for as much as possible (inner whileloop in the lines 6–11). For each node \(u=(x,y)\) appended to P (see Fig. 6), it is also removed from U, \(I_x\), and \(I_y\) with \(\textsc {Erase} ({u,I,U})\). Suppose now P cannot be extended any further. Then P must end with two weights that cannot appear as endpoints of any other path since P is of maximal length. Hence, each time the inner whileloop (lines 6–11) ends, we have two weights less in \(\mathcal {W}_{odd}\) that can appear as endpoints. It follows that greedycover finds a solution with exactly \(\mathcal {W}_{odd}/2\) paths. \(\square \)
Evenfrequency endpoints
Now note that the number of paths in \(C^*\) is surely at least the number of connected components of G since there must be at least one path for each connected component. Let \(\mathcal {C}\) be the set of connected components of G. The size of an optimal path cover for G is therefore \(C^* = \sum _{c \in \mathcal {C}} C^*(c)\) where \(C^*(c)\) is a minimumsize path cover for the connected component c.
Lemma 5
Let c be a connected component of G. At the end of the mergeeven algorithm (Algorithm 6), c is simplified to a graph where: either there is a single node, or \(I_w = \varnothing \) \(\forall \, w \in \mathcal {W}_{even}(c)\).
Proof
We show that the mergeeven algorithm maintains the following loop invariant: if \(I_w \ne \varnothing \) and \(w \in \mathcal {W}_{even}(c)\), there are at least two nodes u and v in \(I_w\); otherwise \(c=1\).
The invariant is true at the beginning of the algorithm before the first iteration, because: (1) either \(\mathcal {W}_{even}(c) = \varnothing \), or (2) \(I_w \ge 2\) for each \(w \in \mathcal {W}_{even}(c)\) given that there are no nodes (w, w) in c, or (3) \(c=1\).
Assume the invariant is true at the beginning of each iteration of the whileloop. We show it remains true after the iteration. At each iteration of the whileloop, let \(w^*\) be a weight of minimum even frequency (line 3). Two nodes are removed from \(I_{w^*}\) and merged into a parent node p with \(\textsc {Merge} ({u,v,w^*})\). One of the following two cases can happen.

The nodes u and v are such that \(u=(x,w^*)\) and \(v=(w^*,y)\), and p is therefore of endpoints (x, y). Suppose either \(I_x\) or \(I_y\) is even (or both are). Without loss of generality, assume \(I_x\) is even. Then it must be \(I_x \ge I_{w^*} \ge 2\) because \(w^*\) has minimum frequency, thus there are at least two nodes in \(I_x\) if \(c>1\) after merge, and the invariant is preserved (or p is the only node left in c and the invariant still holds). Otherwise, both \(I_x\) and \(I_y\) are odd. In this case, since both \(I_x\) and \(I_y\) remain odd after merge, nor x nor y will be considered by the algorithm at line 3.

The nodes u and v are such that \(u=(x,w^*)\) and \(v=(w^*,x)\), and p is therefore of endpoints (x, x). It could be that p is the only node where x appears, i.e., \(I_x=\{(x,x)\}\), and the invariant would be violated unless \(c=1\) after merge. However, the algorithm again merges p with any other node from \(I_x\) (lines 10–14) if \(I_x \ne \varnothing \) at line 10. Note that it must be that \(I_x \ne \varnothing \) at line 10 if \(I_x\) is odd, hence the second merge at line 13 is always possible. We now show that, if \(I_x = \varnothing \) instead at line 10 then it must also be that \(c=0\) and (x, x) will be the only node in c after the iteration, hence preserving the invariant. Assume by absurd that this is not the case, i.e., \(I_x=\varnothing \) but \(c \ne 0\) at line 10. Since \(c \ne 0\), there must be other nodes in \(I_{w^*}\) otherwise c cannot be a connected component at the beginning of the algorithm, i.e., \(I_{w^*}>2\). But this would imply that \(I_x<I_{w^*}\) because \(I_x=2\) before merge, which contradicts the hypothesis that \(w^*\) is a minimum even frequency weight. The algorithm therefore maintains the invariant that there are no nodes (x, x) in \(I_x\), thus \(I_x \ge 2\) unless \(c=1\).
In all cases, it is easy to see that at each iteration: the parity of \(I_x\) and \(I_y\) do not change, \(I_{w^*}\) decreases by 2, and the number of nodes in c decreases by 1 or 2. Summing up, merging on a weight of minimum even frequency preserves the loop invariant.
At the end of the algorithm, since at every iteration the even frequency of a weight is decreased by 2, either \(I_w = \varnothing \, \forall \, w \in \mathcal {W}_{even}(c)\) or \(c=1\) (or both). Note that if \(\mathcal {W}_{odd}(c) = \varnothing \) at the beginning of the algorithm then the loop invariant guarantees that \(c=1\) at the end of the algorithm. \(\square \)
Example 3
Consider Fig. 7 showing the merge operations performed by algorithm mergeeven. In the input graph, there are 8 weights whose frequency is even: 4, 5, 6, 8, 10, 17, 21, 43. After the execution of mergeeven, each of the three connected components of the graph has no node whose weights have even frequency, except for the one with the green background whose nodes all have weights with even frequency, according to Lemma 5. Indeed, note that the green component is simplified to a single node. Lastly, Fig. 8 shows the execution of the algorithm greedycover after the action of mergeeven. In this case, the final (optimal) path cover has 5 paths of endpoints: (3, 7), (1, 2), (9, 12), (11, 13), and (43, 43).
The final algorithm
We can now state and prove the optimality of the mincover algorithm.
Theorem 1
Let \(\mathcal {C}_{even}\) be the set of connected components of G whose nodes only have endpoints of even frequency. Then \(C^* = \mathcal {C}_{even} + \mathcal {W}_{odd}/2\) and the mincover algorithm (Algorithm 3) is optimal.
Proof
Consider the mincover algorithm (Algorithm 3). Since any pair of connected components have disjoint set of nodes by definition, \(C^* = \sum _{c \in \mathcal {C}} C^*(c) = \sum _{c \in \mathcal {C}_{even}} C^*(c) + \sum _{c \notin \mathcal {C}_{even}} C^*(c)\). If \(c \in \mathcal {C}_{even}\) then \(\mathcal {W}_{odd}(c)=\varnothing \) and \(C^*(c)=1\) for Lemma 5 since c will be simplified to one single node by mergeeven. Hence, \(\sum _{c \in \mathcal {C}_{even}} C^*(c) = \mathcal {C}_{even}\). If \(c \notin \mathcal {C}_{even}\) then after mergeeven there will only be nodes with odd weights in c. Thus, for Lemma 4, \(C^*(c)=\mathcal {W}_{odd}(c)/2\) and \(\sum _{c \notin \mathcal {C}_{even}} C^*(c) = \mathcal {W}_{odd}/2\). In conclusion, the mincover algorithm is optimal. \(\square \)
Time and space complexity
If we use hash tables to implement the sets I and U, then the operations take, insert, and erase, are all supported in O(1) on average. Also append can be performed in constant amortized time using a doubleended queue to represent the path P. The merge function in Algorithm 5 obviously takes constant time.
With these remarks in mind, it follows that the mincover algorithm runs in \(\Theta (m)\) amortized time, where \(m=G\) is the number of nodes in the input graph G.
In fact, the preprocess routine can be implemented by sorting the m nodes in O(m) time with radix sort, and scanning the nodes in sorted order. The greedycover algorithm also takes O(m) time because each node is visited and appended to a path exactly once. The complexity of the mergeeven algorithm (Algorithm 6) critically depends, instead, on the complexity of the step at line 3—the identification of the weight of minimum even frequency. In “Reporting the minimum weight in constant time” section we will show how to perform this operation in O(1) time. Hence, since the number of nodes in G reduces by 1 or 2 at each iteration of the whileloop and all operations in the body of the loop take constant time, the complexity of mergeeven is O(m).
Lastly, the algorithm also consumes \(\Theta (m)\) space because: (1) at most 2m nodes (and at least m) are inserted in I and exactly m in U during the initialization (lines 5–6 of mincover); (2) the mergeeven algorithm creates at most 2m new nodes.
Reporting the minimum weight in constant time
In this section we solve the following problem. We have an array A of \((key ,value )\) pairs, initially sorted by \(value \). Keys are all distinct and values are even integers larger than or equal to 2. We want to answer minkey queries over A, i.e., to report the key of minimum value, under the condition that the value of a key x can be decreased by 2 with a decreasevalue(x) operation.
Since we use a solution to this problem to implement the reporting of the weight of minimum even frequency in the mergeeven algorithm (line 3 of Algorithm 6), we work under the assumption that decreasevalue(x) cannot be called if the value associated to the key x is already 0. The minimum values in A can therefore be 0 but we want to report the key with minimum value larger than 0.
Let K and V be the set of keys and values of A respectively. We use two linearspace maps (implemented as hash tables), \(R: V \rightarrow \{1,\ldots ,A\}^2\) and \(P: K \rightarrow \{1,\ldots ,A\}\). Before answering any query on A, R and P are initialized such that \(R[v] = [begin ,end )\) indicates that the values of the elements \(A[begin ,end )\) are all equal to v, and such that \(P[x]=i\) indicates that \(A[i].key =x\). Also we add the special value 0 to R and let \(R[0] = [1,1)\) at the beginning.
We want to show that the algorithm decreasevalue (Algorithm 8) maintains the following invariants: (1) A is sorted by value; (2) all the values of the elements \(A[begin ,end )\) are equal to v if \(R[v] = [begin ,end )\) for each \(v \in V \cup \{0\}\); (3) if \(P[x]=i\) then \(A[i].key =x\) for each key \(x \in K\). If these invariants hold true after each call of decreasevalue, then \(R[0].end \) is the position of the key of minimum value larger than 0 and minkey can simply return \(A[R[0].end ].key \) in O(1).
The invariants are all trivially satisfied by construction before answering any query: A is sorted and \(R[0].end =1\), hence \(A[1].key \) is the key of minimum value.
Suppose now that the invariants are true for some \(1 < R[0].end \le A\). We want to show that they are also true after decreasevalue(x). The position j of x is determined at line 2 and its value v at line 3. By assumption, \(v \ge 2\). Then the algorithm identifies the beginning of the range of values equal to v in A, i.e., position i. It then swaps A[i] with A[j] and update P consequently so that, at line 9, i indicates the position of the key x whose value v has to be decreased by 2. Hence, lines 29 maintain the invariant that A is sorted by value as well as the invariants on R and P. Before decreasing the value \(A[i].value \) by 2 at line 14, the algorithm adjusts R[v] and \(R[v2]\). Since the value v is going to be decreased by 2, \(R[v2].end \) has to be increased by 1 and, symmetrically, \(R[v].begin \) has to be decreased by 1. Whenever we decrease a value, it is possible that \(v2\) does not yet exist in R. If that is the case, then \(R[v2]\) is initialized with \([i,i+1)\) at line 11, which is correct because i is the position of the key x whose value has to be decreased. Whenever we increase the beginning of a range instead, it is possible that the range is exhausted, i.e., \(R[v].begin = R[v].end \). If that is the case, it means that there is only one key with value v and, since v is going to be decrease, v is correctly erased from R. Figure 9 shows an example for an array A of 8 pairs and how the invariants on A, P, and R are preserved after decreasevalue(e).
In conclusion, all the invariants are preserved after every call of decreasevalue. Therefore—if all values are decreased to 0—then there will only be the value 0 in R and \(R[0] = [1,A+1)\). Note that each step of decreasevalue takes O(1) time since R and P are implemented with hash tables, hence the overall time of decreasevalue is O(1).
Let M be the data structure holding the array A, the maps R and P, and exposing the functions build, hasnext, minkey, and decreasevalue. Algorithm 9 illustrates how to use the data structure to implement the mergeeven algorithm previously described as Algorithm 6. The data structure is built from the nodes in U by letting the weights be the keys in A and the frequencies of the weights be the values in A. Since there are at most \(m=G\) nodes in U, the initialization of A, R, and P takes O(m) time and the whole data structure consumes O(m) space.
Note A solution to the problem of maintaining a sorted list of integers subject to increments/decrements was also given by Knuth [42] to implement the adaptive Huffman coding algorithm, but using a different combination of elementary data structures. The presentation given here is specifically suited for Algorithm 6 (see also Algorithm 9).
Experiments
In this section we evaluate the proposed weight compression scheme for SSHash and compare it to several competitive baselines. We first describe our experimental setup. Experiments were run using a server machine equipped with a Intel i99900K CPU (clocked at 3.60 GHz) and 64 GB of RAM. All the tested software was compiled with gcc 11.2.0 under Ubuntu 19.10 (Linux kernel 5.3.0, 64 bits), using the flags O3 and march=native. Our implementation of SSHash is written in C++17 and available at https://github.com/jermp/sshash.
All timings were collected using a single core of the processor. The dictionaries are loaded in internal memory before executing queries. For all the experiments, we fix k to 31.
Datasets
We use the following genomic collections: EColi and CElegans are, respectively, the full genomes of E. Coli (Sakai strain) and C. Elegans that were also used in the experimentation by Shibuya et al. [27]; SEnterica100 is a pangenome of 100 genomes of S. Enterica, collected by Rossi et al. [43]; HumanChr13 is the 13th human chromosome from the genome assembly GRCh38. Table 1 reports some basic statistics for the collections. The weights were collected using the tool BCALM (v2) [44] without any filtering (option allabundancecounts). In general, note the very low empirical entropy of the weights, \(H_0(W)\). This is expected since most \(k\)mers actually appear once for largeenough values of k. Instead, the weights on the pangenome SEnterica100 have much higher entropy due to the fact that many \(k\)mers have weight equal to the number of genomes in the collection (in this specific case, equal to 100). This is useful to test the effectiveness of our encoding on both low and high entropy inputs.
All datasets, in raw and preprocessed form, are available on Zenodo at https://zenodo.org/record/7772316.
Weight compression in SSHash
We now consider the space effectiveness of the encoding scheme described in “Representing runs of weights” section. Table 2 reports the space as average bits per \(k\)mer: we see that, in all cases, the space is well below the empirical entropy lower bound \(H_0(W)\)—usually below by several times. The optimization strategy described in “The problem of reducing the number of runs” section brings further advantage. (The space shown is comprehensive of the \(\mathcal {D}\lceil \log _2max \rceil \) bits used to represent the distinct weights in the collection. Note that this space takes a negligible fraction of the total space since \(\mathcal {D}\) is very small as reported in Table 1).
Table 3, instead, shows the performance of the path cover Algorithm 3. As already mentioned in “Representing runs of weights” section, the set of strings indexed by SSHash is obtained by building a spectrumpreserving string set (SPSS) from the raw genome, using the algorithm UST [30] over the output of BCALM [44]. (At the code repository https://github.com/jermp/sshash we provide further details on how to take these preliminary steps before indexing with SSHash). The number of strings in each collection, m, determines the runtime of Algorithm 3 whose complexity is \(\Theta (m)\). The lineartime complexity is evident from the reported timings and makes the algorithm very fast, taking \(\approx \)5080 nanoseconds per node.
In Appendix we report additional experimental results.
Overall comparison
In Table 4 we show a comparison between the following weighted \(k\)mer dictionaries:

The dBGFM index [45]^{Footnote 2} based on the popular FMindex [22]. In particular, this representation implements a weighted \(k\)mer dictionary via the count query which returns the number of occurrences of a given \(k\)mer in the input. The count query, in turn, is implemented using rank queries over the BWT. The dBGFM implementation has a main tradeoff parameter, s, to control the practical performance of rank queries. We test the values \(s=32,64,128\).

The cwdBG [16]^{Footnote 3} dictionary based on the data structure called BOSS [24]. Similarly to an FMindex, also cwdBG has a tradeoff parameter that we vary as \(s=32,64,128\). (The authors used \(s=64\) in their own experiments).

The nonweighted SSHash itself coupled with the fast compressed static function (CSF) tailored for lowentropy distributions, proposed by Shibuya et al. [27].^{Footnote 4} As reviewed in “Related work” section, a CSF does not represent the \(k\)mers but just realizes a map from \(k\)mers to their weights. Such map is collisionfree only over the set of \(k\)mers that was used to actually build the function. Therefore, we use SSHash as an efficient dictionary for the \(k\)mers and the CSF to represent the weights. The authors proposed two different versions of their approach, BCSF and AMB, with different space/time tradeoffs.

The weighted SSHash dictionary proposed in this work, which we refer to as wSSHash in the following, after the runreduction optimization (Tables 2 and 3). We use the regular index variant of SSHash. The main parameter of the index—the minimizer length—is always set to \(\lceil \log _4 N \rceil + 1\) where N is the number of nucleotides in the SPSSs of the datasets, following the recommendation given in the previous paper [18]. Therefore, we use the following minimizer lengths: 13, 14, 15, and 15, for respectively, EColi, SEnterica100, HumanChr13, and CElegans. Also the AMB algorithm by Shibuya et al. [27] is based on minimizers and we use the same lengths.
We did not compare against deBGR [17] and Squeakr [13] as the authors of cwdBG showed in their experimentation [16] that both tools take considerably more space than cwdBG, e.g., one order of magnitude more space. Here, we are interested in a good balance between space effectiveness and query efficiency. The same consideration applies to the popular \(k\)mer counting tool KMC3 [12] which stores both \(k\)mers and their abundances in a hash table without compression.
To measure querytime—the time it takes to retrieve the weight w(g) given the \(k\)mer g—we sampled \(10^6\) \(k\)mers uniformly at random from the collections and use them as queries. We report the mean between 5 measurements. Half of the queries were transformed into their reverse complements to make sure we benchmark the dictionaries in the most general case.
The space of wSSHash is generally competitive with that of the fastest variant of dBGFM (\(s=32\)), but wSSHash has (more than) one order of magnitude better query time. Note that on SEnterica100 the dBGFM index is spaceinefficient since it redundantly represents many repeated \(k\)mers. Using a higher sampling rate reduces the space of dBGFM at the price of slowing down querytime; however, the most spaceefficient variant tested (\(s=128\)) is not even \(2\times \) smaller than wSSHash.
The cwdBG index is the smallest tested dictionary. Its space effectiveness is comparable to that of dBGFM \(s=128\), and indeed generally twice as better as that of wSSHash. The price to pay for this enhanced compression ratio is a significant penalty at querytime. Indeed, wSSHash can be two order of magnitude faster than cwdBG. Consider, for example, the two dictionaries built for SEnterica100: we have 0.5 vs. \(\approx \)60–110 μs per query.
The two CSFs, BCSF and AMB, make SSHash consistently slower and larger than wSSHash This comparison motivates the need for a unified data structure to handle efficiently both the \(k\)mers and the weights, like wSSHash. While the increase in space due to the CSF is not much for the lowentropy datasets because both BCSF and AMB are very spaceefficient in those cases, the gap is more evident on SEnterica100.
As a last note, observe that there is no significant slowdown in accessing the weights in wSSHash compared to a simpler membership query (the time reported in shaded color in Table 4), hence proving the RLEbased scheme to be efficient too and not only very effective.
Conclusions
In this work we extended the recent SSHash [18] dictionary to also store the weights of the \(k\)mers in compressed format. In particular, we represented the weights using compressed runs of equal symbols. While using runlength encoding to compress highly repetitive sequences is not novel per se and indeed a folklore strategy at the basis of many other data structures, this allows to use a very small extra space (e.g., much less than the empirical entropy of the weights) on top of SSHash with only a slight penalty at retrieval time. The crucial point is that it is possible to use runlength encoding because SSHash preserves the (relative) order of the \(k\)mers in the indexed sequences. The main practical takeaway is, therefore, that SSHash handles weighted \(k\)mer sets in an exact manner without noticeable extra costs. Our software is publicly available to encourage its use and reproducibility of results.
We also introduced the concept of endpoint weight graph and showed its usefulness in reducing the number of runs in the weights. Precisely, we showed that minimizing the number of runs in a collection of sequences corresponds to the problem of computing a minimumcardinality path cover for the endpoint weight graph of the sequences. We presented an optimal algorithm that computes a cover of minimum size in lineartime (in the number of nodes of the graph). As a result of this optimization, the space spent to represent the weights is unlikely to be improved using runlength encoding.
Although several approaches in the literature [13, 26,27,28] also consider approximate weights, we did not pursue this direction here as the weights are already encoded spaceefficiently in SSHash and in an exact way, so there may be no need for approximation.
The distribution of weights in large collections is usually expected to be very skew, i.e., most \(k\)mers actually appear once and few of them repeat many times [26, 27]. A common strategy to save space is then to avoid the representation of the most frequent weight(s). Note that, since we represent runs of weights and not the individual weights, we are already optimizing (potentially very large) subsets of weights equal to the most frequent one. That is, runlength encoding is also a good match for such skew distributions.
Availability of data and materials
The SSHash software is available on GitHub at https://github.com/jermp/sshash. The datasets used in this article are available on Zenodo at https://zenodo.org/record/7772316.
Notes
EliasFano represents a monotone integer sequence S[1..n] with \(S[n] \le U\) in at most \(n\lceil \log _2(U/n)\rceil +2n\) bits. With o(n) extra bits it is possible to decode any S[i] in constant time and support predecessor queries in \(O(\log (U/n))\) time. For a complete description of the method, we point the reader to the survey by Pibiri and Venturini [33, Section 3.4]. We also remark that EliasFano has been recently used in many compressed, practical, data structures, e.g., inverted indexes [33,34,35,36,37], tries [38,39,40], and fulltext indexes [41].
References
Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko SI, Pham S, Prjibelski AD, et al. Spades: a new genome assembly algorithm and its applications to singlecell sequencing. J Comput Biol. 2012;19(5):455–77.
Jackman SD, Vandervalk BP, Mohamadi H, Chu J, Yeo S, Hammond SA, Jahesh G, Khan H, Coombe L, Warren RL, et al. Abyss 2.0: resourceefficient assembly of large genomes using a bloom filter. Genome Res. 2017;27(5):768–77.
Khorsand P, Hormozdiari F. Nebula: ultraefficient mappingfree structural variant genotyper. Nucl Acids Res. 2021;49(8):47–47.
Standage DS, Brown CT, Hormozdiari F. Kevlar: a mappingfree framework for accurate discovery of de novo variants. Iscience. 2019;18:28–36.
Baier U, Beller T, Ohlebusch E. Graphical pangenome analysis with compressed suffix trees and the BurrowsWheeler transform. Bioinformatics. 2016;32(4):497–504.
Marcus S, Lee H, Schatz MC. Splitmem: a graphical algorithm for pangenome analysis with suffix skips. Bioinformatics. 2014;30(24):3476–83.
Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 2014;15(3):1–12.
Roberts M, Hayes W, Hunt BR, Mount SM, Yorke JA. Reducing storage requirements for biological sequence comparison. Bioinformatics. 2004;20(18):3363–9.
Sahlin K. Effective sequence similarity detection with strobemers. Genome Res. 2021;31(11):2080–94.
Sahlin K. Strobemers: an alternative to kmers for sequence comparison. bioRxiv (2021).
Deorowicz S, Kokot M, Grabowski S, DebudajGrabysz A. Kmc 2: fast and resourcefrugal kmer counting. Bioinformatics. 2015;31(10):1569–76.
Kokot M, Długosz M, Deorowicz S. Kmc 3: counting and manipulating kmer statistics. Bioinformatics. 2017;33(17):2759–61.
Pandey P, Bender MA, Johnson R, Patro R. Squeakr: an exact and approximate kmer counting system. Bioinformatics. 2018;34(4):568–75.
Marçais G, Kingsford C. A fast, lockfree approach for efficient parallel counting of occurrences of kmers. Bioinformatics. 2011;27(6):764–70.
Rizk G, Lavenier D, Chikhi R. DSK: kmer counting with very low memory usage. Bioinformatics. 2013;29(5):652–3.
Italiano G, Prezza N, Sinaimeri B, Venturini R. Compressed weighted de Bruijn graphs. In: 32nd annual symposium on combinatorial pattern matching (CPM 2021), vol. 191. 2021. p. 16–11616. https://github.com/nicolaprezza/cwdBg.
Pandey P, Bender MA, Johnson R, Patro R. deBGR: an efficient and nearexact representation of the weighted de Bruijn graph. Bioinformatics. 2017;33(14):133–41.
Pibiri GE. Sparse and skew hashing of kmers. Bioinformatics. 2022;38(Supplement_1):185–94.
Pibiri GE, Trani R. Parallel and externalmemory construction of minimal perfect hash functions with PTHash. CoRR arXiv:2106.02350 (2021)
Pibiri GE, Trani R. PTHash: revisiting FCH minimal perfect hashing. In: SIGIR ’21: the 44th international ACM SIGIR conference on research and development in information retrieval, virtual event, Canada, July 11–15, 2021. 2021. p. 1339–48.
Fan J, Khan J, Pibiri GE, Patro R. Spectrum preserving tilings enable sparse and modular reference indexing. In: Research in computational molecular biology. 2023. p. 21–40.
Ferragina P, Manzini G. Opportunistic data structures with applications. In: Proceedings 41st annual symposium on foundations of computer science. New York: IEEE; 2000. p. 390–8.
Burrows M, Wheeler D. A blocksorting lossless data compression algorithm. In: Digital SRC research report. Citeseer; 1994.
Bowe A, Onodera T, Sadakane K, Shibuya T. Succinct de Bruijn graphs. In: International workshop on algorithms in bioinformatics (WABI). Berlin: Springer; 2012. p. 225–35.
Pandey P, Bender M.A, Johnson R, Patro R. A generalpurpose counting filter: making every bit count. In: Proceedings of the 2017 ACM international conference on management of data. 2017. p. 775–87.
Shibuya Y, Belazzougui D, Kucherov G. Setmin sketch: a probabilistic map for powerlaw distributions with application to kmer annotation. J Comput Biol. 2022;29(2):140–54.
Shibuya Y, Belazzougui D, Kucherov G. Spaceefficient representation of genomic kmer count tables. Algorithms Mol Biol. 2022;17(1):1–15.
Marchet C, Iqbal Z, Gautheret D, Salson M, Chikhi R. Reindeer: efficient indexing of kmer presence and abundance in sequencing datasets. Bioinformatics. 2020;36(Supplement_1):177–85.
Karasikov M, Mustafa H, Rätsch G, Kahles A. Lossless indexing with counting de Bruijn graphs. bioRxiv (2021)
Rahman A, Medvedev P. Representation of \(k\)mer sets using spectrumpreserving string sets. In: International conference on research in computational molecular biology. Berlin: Springer; 2020. p. 152–68. https://github.com/medvedevgroup/UST.
Elias P. Efficient storage and retrieval by content and address of static files. J ACM. 1974;21(2):246–60.
Fano RM. On the number of bits required to implement an associative memory. Memorandum 61, Computer Structures Group, MIT. 1971.
Pibiri GE, Venturini R. Techniques for inverted index compression. ACM Comput Surv. 2021;53(6):125–112536.
Ottaviano G, Venturini R. Partitioned EliasFano indexes. In: Proceedings of the 37th international ACM SIGIR conference on research & development in information retrieval. 2014. p. 273–82.
Pibiri GE, Venturini R. Clustered EliasFano indexes. ACM Trans Inf Syst. 2017;36(1):2–1233.
Pibiri GE, Venturini R. On optimally partitioning variablebyte codes. IEEE Trans Knowl Data Eng. 2020;32(9):1812–23.
Vigna S. Quasisuccinct indices. In: Proceedings of the sixth ACM international conference on web search and data mining. 2013. p. 83–92.
Perego R, Pibiri GE, Venturini R. Compressed indexes for fast search of semantic data. IEEE Trans Knowl Data Eng. 2021;33(9):3187–98.
Pibiri GE, Venturini R. Efficient data structures for massive ngram datasets. In: Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval. 2017. p. 615–24.
Pibiri GE, Venturini R. Handling massive Ngram datasets efficiently. ACM Trans Inf Syst. 2019;37(2):25–12541.
Ma D, Puglisi SJ, Raman R, Zhukova B. On eliasfano for rank queries in fmindexes. In: 2021 data compression conference (DCC). New York: IEEE; 2021. p. 223–32.
Knuth DE. Dynamic Huffman coding. J Algorithms. 1985;6(2):163–80.
Rossi M, Silva MSD, RibeiroGonçalves BF, Silva DN, Machado MP, Oleastro M, Borges V, Isidro J, Viera L, Halkilahti J, Jaakkonen A, Palma F, Salmenlinna S, Hakkinen M, Garaizar J, Bikandi J, Hilbert F, Carriço JA. INNUENDO whole genome and core genome MLST schemas and datasets for Salmonella enterica. 2018.
Chikhi R, Limasset A, Medvedev P. Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinformatics. 2016;32(12):201–8.
Chikhi R, Limasset A, Jackman S, Simpson JT, Medvedev P. On the representation of de Bruijn graphs. In: International conference on research in computational molecular biology. Berlin: Springer; 2014. p. 35–55. https://github.com/jts/dbgfm.
Almodaresi F, Sarkar H, Srivastava A, Patro R. A space and timeefficient index for the compacted colored de Bruijn graph. Bioinformatics. 2018;34(13):169–77.
Acknowledgements
The author wishes to thank an anonymous reviewer of this work for pointing out the connection between the data structure described in “Reporting the minimum weight in constant time” section and the Knuth’s algorithm for adaptive Huffman coding.
Funding
Funding for this research has been partially provided by the European Union’s Horizon Europe research and innovation programme (EFRA project, Grant Agreement Number 101093026).
Author information
Authors and Affiliations
Contributions
Not applicable given that G. E. P. is the only author of the manuscript. The author read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The author declares no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: Additional experimental results
Appendix: Additional experimental results
In Tables 5 and 6 we report the performance of the mincover algorithm and of wSSHash on four additional, larger, collections that we also used in our previous work [18], namely the full genomes of G. morhua (Cod), F. tinnunculus (Kestrel), and H. sapiens (Human), and a collection of more than 8000 bacterial genomes (Bacterial) [46]. Precisely, the results in Table 6 are for regular wSSHash dictionaries with minimizer lengths equal to 17, 17, 20, and 20, for respectively, Cod, Kestrel, Human, and Bacterial.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Pibiri, G.E. On weighted kmer dictionaries. Algorithms Mol Biol 18, 3 (2023). https://doi.org/10.1186/s13015023002262
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13015023002262