Main algorithm
Our starting point is a set of canonical \(k\)-mers K, the graph cdBG(K), and a vertex-disjoint normalized path cover \(\Psi\) of cdBG(K) returned by UST.Footnote 1 To develop the intuition for our algorithm, we first consider a simple example (Fig. 1A). In this example, we see a vertex-disjoint path cover \(\Psi\) composed of two paths, \(\psi ^p\) and \(\psi ^c\). There is an edge between an internal vertex (=unitigFootnote 2) \(u^p\) of \(\psi ^p\) and the initiator vertex \(u^c\) of \(\psi ^c\). Such an edge is an example of an absorption edge. ESS-Compress constructs an enriched string representation of K, as shown in the figure. The basic idea is that \(u^p\) and \(u^c\) share a common \((k-1)\)-mer (i.e. GT). We can cut out this common portion from the string representing \(u^c\) and replace it with a special marker character “+”. We can then include \(u^c\) inside of the representation of \(u^p\) by surrounding \(u^c\) with brackets. The marker character \(``+''\) is a placeholder for the \(k-1\) nucleotides right before the opening bracket. To decompress the enriched string, we first replace the marker to get TCGT[GTAA]T and then cleave out the bracketed string to get \(\{TCGTT, GTAA\}\). This exactly recovers the SPSS representation of \(\psi ^p\) and \(\psi ^c\).
Formally, we say that an edge in cdBG(K) is an absorption edge iff (1) it connects two unitigs \(u^p\) and \(u^c\), on two distinct paths \(\psi ^p\) and \(\psi ^c\), respectively, (2) \(u^p\) is an internal vertex, and (3) \(u^c\) is an initiator vertex. We refer to \(u^p\) and \(\psi ^p\) as parents and \(u^c\) and \(\psi ^c\) as children; we also say that \(\psi ^p\) and \(u^p\) absorb \(\psi ^c\) and \(u^c\).Footnote 3
Figure 1B–D shows the other cases, corresponding to the possible orientation of the absorption edge. The logic is the same, but we need to introduce a second marker character \(``-''\) that is a placeholder for the reverse complement of the last \(k-1\) characters right before the opening bracket. In each of these cases, we add 3 extra characters (two brackets and one marker) and remove \(k - 1\) nucleotide characters. Note that, expanding the alphabet has its inherent cost, but even after taking that into account, we get lower number of characters than SPSS representation when \(k > 4\).
Next, observe that a single parent path can absorb multiple children paths, as illustrated in Fig. 2A. Also, observe that a single parent unitig can absorb more than one child path, as shown in Fig. 2B. As in the previous example, we save \(k - 1 - 3 = k -4\) characters for every absorbed edge.
These absorptions can be recursively combined, as shown in Fig. 2C. Because we require a parent unitig to be an internal vertex and a child unitig to be an initiator vertex, the same unitig cannot be both parent and child. Therefore, ESS-Compress can construct a representation recursively, without any conflicts. The recursion tree is reflected in the nesting structure of the brackets in the enriched string.
However, we must be careful to avoid cycles in the recursion. We define the absorption digraph \(D_A\) as the directed graph whose vertex set is the set of paths \(\Psi\) and an edge \((\psi ^p \rightarrow \psi ^c)\) if \(\psi ^p\) absorbs \(\psi ^c\). For every edge in \(D_A\), we also associate the corresponding bidirected edge between \(u^p\) and \(u^c\) in cdBG(K). We would like to select a subset of edges F along which to perform absorptions, so as to avoid cycles in \(D_A\) and to make sure a path cannot be absorbed by more than one other path. We would also try to choose as many edges as possible, since each absorption saves \(k-4\) characters. To achieve these goals, ESS-Compress defines F as a spanning out-forest in \(D_A\) with the maximum number of edges. We postpone the algorithm to find F to Sect. "Algorithm to choose absorption edges".
The high-level pseudo-code of ESS-Compress is shown in Algorithm 1 and illustrated in Fig. 3. The recursive algorithm to create the enriched representation using F as a guide is shown in Algorithm 2. It follows the intuition we just developed. It starts from the paths that will not be absorbed (i.e. the roots in F). For a path \(\psi ^p\), it first computes the enriched representations of all the child paths (Lines 3–9). It then integrates them into the appropriate locations of \(spell(\psi ^p)\) (Lines 10–14). It then uses a marker to replace the redundant sequence in the spelling of \(\psi ^p\), with respect to \(\psi ^p\)’s own parent (Lines 17–31). To decide which marker to use, it receives as a parameter the absorption edge \(e_D\) that was used to absorb \(\psi ^p\).
Decompression is done by a recursive algorithm DEC that takes as input an enriched string x and a \((k-1)\)-mer called markerReplacement. Initially, DEC is called independently on every enriched string \(x\in \text {ESS-Compress} (K)\), with \(markerReplacement = null\). We call the characters of x which are not enclosed within brackets outer. The brackets themselves are not considered outer characters. DEC first replaces any occurrence of an outer \(``+''\) (respectively, \(``-''\)) with markerReplacement (respectively, the reverse complement of markerReplacement). It then outputs all the outer characters as a single string. Then, for every top-level open/close bracket pair in x, it calls DEC recursively on the sequence in between the brackets, and passes as markerReplacement the rightmost \(k-1\) outer characters to the left of the open bracket.
Algorithm to choose absorption edges
Let D be any directed graph and consider the problem of finding a spanning out-forest with the maximum number of edges. We call this the problem of finding an edge-maximizing spanning out-forest. This problem is a specific instance of the maximum weight out-forest problem [33], which allows for weights to be placed on the edges. As we show in this section, there is an optimal algorithm for our problem that is simpler than the algorithm for arbitrary weights described in [33].
Our algorithm first decomposes D into strongly connected components, and builds SC(D), the strongly connected component digraph of D. In SC(D), the vertices are the strongly connected components of D, and there is an edge from component \(c_1\) to \(c_2\) if there is an edge in D from some vertex in \(c_1\) to some vertex in \(c_2\). For every component that is a source in SC(D), we pick an arbitrary vertex from it (in D) and put it into a “starter” set. Then, we perform a depth-first search (DFS) traversal of D, but whenever we start a new tree, we initiate it with a vertex from the starter set, if one is available. We remove the vertex from the starter set once it is used to initiate a tree. We then output the DFS forest F.
We will prove that F is a spanning out-forest of D with the maximum number of edges.
Lemma 3.1
(Correctness of edge-maximizing spanning out-forest algorithm) Let D be a directed graph, let F be the spanning out-forest returned by our algorithm run on D, and let \(n_{sc}\) be the number of source components in SC(D). Then, the number of out-trees in F is \(n_{sc}\) and this is the smallest possible for any spanning out-forest. Also, the number of edges in F is the maximum possible for any spanning out-forest.
Proof
Consider any spanning out-forest of D. If it has less than \(n_{sc}\) out-trees, then by the pigeonhole principle, there are two source components \(c_1\) and \(c_2\) with vertices \(v_1\) and \(v_2\), respectively, belonging to the same out-tree. This is a contradiction, since \(c_1\) and \(c_2\) are source components and hence there cannot be a path between them. Hence, any spanning out-forest must have at least \(n_{sc}\) out-trees. Now, consider F. Every vertex in D is reachable from one of the vertices in the starter set, by its construction. There are \(n_{sc}\) starter vertices, so F will have at most \(n_{sc}\) out-trees. Since any spanning out-forest must have at least \(n_{sc}\) out-trees, F will have \(n_{sc}\) out-trees and it will be the minimum achievable. Also, in any spanning out-forest, the number of edges is the number of vertices minus the number of out-trees; hence F will have the the maximum number of edges of any spanning out-forest. \(\square\)
The weight of the ESS-compress representation
In this section, we derive a formula for the weight of the ESS-Compress representation and explore the potential benefits of some variations of ESS-Compress.
Theorem 3.2
Let K be a set of canonical \(k\)-mers, and let \(\Psi\) be a vertex-disjoint normalized path cover of cdBG(K) that is used by \(\text {ESS-Compress} (K)\). Let \(n_{sc}\) be the number of sources in the strongly connected component graph of the absorption graph \(D_A\). Let X be the solution returned by \(\text {ESS-Compress} (K)\). Then
$$\begin{aligned} weight(X) = |K| + 3|\Psi | + n_{sc} (k-4) \end{aligned}$$
Proof
If we unroll the recursion of ESS-Compress, then there are exactly \(|\Psi |\) runs of Spell-Path-Enrich, one for each \(\psi \in \Psi\). For each call, we let \(n_\psi\) be the number of characters in the returned string that are added non-recursively (i.e. everything except \(ins_0\) and \(ins_1\)). Considering the structure of the recursion and accounting for characters in this way, we have that \(weight(X) = \sum _{\psi \in \Psi } n_\psi\).
Prior to marker replacement (Line 17, the non-recursive part of x is \(spell(\psi )\)). When \(\psi\) is a root in the absorption forest F, then the marker absorption stage is not executed and so \(n_\psi = |spell(\psi )|\). Otherwise, the marker absorption phase (Lines 17 to 31) removes \(k-1\) characters, adds 1 new marker character, and adds two new bracket characters. Hence, \(n_\psi = |spell(\psi )| - (k-1) + 3 = |spell(\psi )| - (k - 4)\). By Lemma 3.1, F contains \(n_{sc}\) roots. Hence,
$$\begin{aligned} weight(X)&= \sum _{\psi \in \Psi } n_\psi = \sum _{\psi \text { is a root}} |spell(\psi )| + \sum _{\psi \text { is not a root}} \left( |spell(\psi )| - (k-4) \right) \\&= \sum _{\psi \in \Psi } |spell(\psi )| - (k-4)(|\Psi | - n_{sc}) = |K| + 3|\Psi | - n_{sc} (k-4) \end{aligned}$$
The last equality follows by applying Eq. (2.1) from Sect. "Preliminaries". \(\square\)
We can use Theorem 3.2 to better understand ESS-Compress. The weight depends on the choice of \(\Psi\). The \(\Psi\) returned by UST has, empirically, almost the minimum \(|\Psi |\) possible [26]. This (almost) minimizes the \(3|\Psi |\) term in Theorem 3.2. However, this may not necessarily lead to the lowest total weight, because there is an interplay between \(\Psi\) and \(n_{sc}\), as follows. Let \(\Psi '\) be a vertex-disjoint normalized path cover with \(|\Psi '| > |\Psi |\). Its paths are shorter, on average, than \(\Psi\)’s. There may now be edges of cdBG(K) that become absorption edges, that were not with \(\Psi\). For example, an edge between two unitigs which are internal in \(\Psi\) is not, by our definition, an absorption edge. With the shorter paths in \(\Psi '\), one of these unitigs may become an initiator vertex, making the edge absorbing. This may in turn improve connectivity in \(D_A\) and decrease \(n_{sc}\), counterbalancing the increase in \(|\Psi '|\). Nevertheless, ESS-Compress does not consider alternative path covers and always uses the one returned by UST.
Another aspect of ESS-Compress that could be changed is the definition of absorption edge. We restrict absorption edges to be between an initiator unitig and an internal unitig; however, one could in principle also define ways to absorb between an endpoint unitig and an internal unitig, or between two internal unitigs. This could potentially decrease \(n_{sc}\) by increasing the number of absorption edges, though it would likely need more complicated and space-consuming encoding schemes.
How much could be gained by modifying the path cover and the absorption rules that ESS-Compress uses? We can answer this by observing that \(n_{sc}\) cannot be less than C, the number of connected components of the undirected graph underlying cdBG(K). At the same time, in [26] we gave an algorithm to compute an instance-specific lower bound \(\beta\) on the number of paths in any vertex-disjoint path cover. Putting this together, we conclude that regardless of which path cover is used and which subset of cdBG(K) edges are allowed to be absorbing, the weight of a ESS-Compress representation cannot be lower than:
$$\begin{aligned} |K| + 3\beta + C(k-4) \end{aligned}$$
(3.1)
As we will see in the results, the weight of ESS-Compress is never more than 2% higher than this lower bound, which is why we did not pursue these other possible optimizations to ESS-Compress. We note, however, that the above is not a general lower bound and does not rule out the possibility of lower-weight string set representations that beat ESS-Compress.