Superbubbles revisited

Background Superbubbles are distinctive subgraphs in direct graphs that play an important role in assembly algorithms for high-throughput sequencing (HTS) data. Their practical importance derives from the fact they are connected to their host graph by a single entrance and a single exit vertex, thus allowing them to be handled independently. Efficient algorithms for the enumeration of superbubbles are therefore of important for the processing of HTS data. Superbubbles can be identified within the strongly connected components of the input digraph after transforming them into directed acyclic graphs. The algorithm by Sung et al. (IEEE ACM Trans Comput Biol Bioinform 12:770–777, 2015) achieves this task in \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {O}(m~log(m))$$\end{document}O(mlog(m))-time. The extraction of superbubbles from the transformed components was later improved to by Brankovic et al. (Theor Comput Sci 609:374–383, 2016) resulting in an overall \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {O}(m+n)$$\end{document}O(m+n)-time algorithm. Results A re-analysis of the mathematical structure of superbubbles showed that the construction of auxiliary DAGs from the strongly connected components in the work of Sung et al. missed some details that can lead to the reporting of false positive superbubbles. We propose an alternative, even simpler auxiliary graph that solved the problem and retains the linear running time for general digraph. Furthermore, we describe a simpler, space-efficient \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {O}(m+n)$$\end{document}O(m+n)-time algorithm for detecting superbubbles in DAGs that uses only simple data structures. Implementation We present a reference implementation of the algorithm that accepts many commonly used formats for the input graph and provides convenient access to the improved algorithm. https://github.com/Fabianexe/Superbubble.


Background
Under idealizing assumption, the genome assembly problem reduces to finding an Eulerian path in the de Bruijn graph [1] that represents the collection of sequencing reads [2]. In real-life data sets, however, sequencing errors and repetitive sequence elements contaminate the de Bruijn graph with additional, false, vertices and edges. Assembly tools therefore employ filtering steps that are based on recognizing local motifs in the de Bruijn graphs that correspond to this kind of noise, see e.g. [3]. Superbubbles also appear naturally in the multigraphs in the context of supergenome coordinatization [4], i.e., the problem of finding good common coordinate systems for multiple genomes.
The simplest such motif is a bubble, comprising two or more isolated paths connecting a source s to a target t, see [5] for a formal analysis. While bubbles are easily recognized, most other motives are much more difficult to find. Superbubbles are a complex generalization of bubbles that were proposed in [6] as an important class of subgraphs in the context of HTS assembly. It will be convenient for the presentation in this paper to first consider a more general class of structure which are obtained by omitting the minimality criterion: Definition 1 (Superbubbloid) Let G = (V , E) be a directed multi-graph and let (s, t) be an ordered pair of distinct vertices. Denote by U st the set of vertices reachable from s without passing through t and write U + ts for the set of vertices from which t is reachable without passing through s. Then the subgraph G[U st ] induced by U st is a superbubbloid in G if the following three conditions are satisfied:
We call s, t, and U st \{s, t} the entrance, exit, and interior of the superbubbloid. We denote the induced subgraph G[U st ] by s, t if it is a superbubbloid with entrance s and exit t.
A superbubble is a superbubbloid that is minimal in the following sense: Definition 2 A superbubbloid s, t is a superbubble if there is no s ′ ∈ U st \{s} such that �s ′ , t� is a superbubbloid.
We note that Definition 2 is a simple rephrasing of the language used in [6], where a simple O(n(m + n))-time algorithm was proposed that, for each candidate entrance s, explicitly retrieves all superbubbles s, t . Since the definition is entirely based on reachability, multiple edges are irrelevant and can be omitted altogether. Hence we only consider simple digraphs throughout.
The vertex set of every digraph G(V, E) can be partitioned into its strongly connected components. Denote by V the set of singletons, i.e., the strongly connected components without edges. One easily checks that the induced subgraph G[V ] is acyclic. Furthermore, denote by S the partition of V comprising the non-singleton connected components of G and the union V of the singleton. The key observation of [7] can stated as

Proposition 1 Every superbubble s, t in G (V , E ) is an induced subgraph of G [C] for someC ∈ S.
It ensures that it is sufficient to search separately for superbubbles within G [C] for C ∈ S . However, these induced subgraphs may contain additional superbubbles that are created by omitting the edges between different components. In order to preserve this information the individual components C are augmented by artificial vertices [7]. The augmented component C is then converted into a directed acyclic graph (DAG). Within each DAG the superbubbles can be enumerated efficiently. With the approach of [7], this yields an overall O(m log m) -time algorithm, the complexity of which is determined by the extraction of the superbubbles from the component DAGs. The partitioning of G(V, E) into the components G [C] for C ∈ S and the transformation into DAGs can be achieved in O(m + n)-time. Recently, Brankovic et al. [8] showed that superbubbles can be found in linear time within a DAG. Their improvement uses the fact that the DAG can always be topological ordered in such a way that superbubbles appears as a contiguous blocks.
In this ordering, furthermore, the candidates for entrance and exit vertices can be narrowed down considerably. For each pair of entrance and exit candidates (s, t), it can then be decided in constant time whether U st is indeed a superbubble. Using additional properties of superbubbles to further prune the candidate list of (s, t) pairs results in O(m + n)-time complexity.
The combination of the work of [7] with the improvement of [8] results in the state of the art algorithm. The concept of a superbubble was extended to bi-directed and bi-edged graphs, called ultrabubble in [9][10][11]. The enumeration algorithm for ultrabubbles in [9] has a worst case complexity of O(mn) , and hence does not provide an alternative for directed graphs.
A careful analysis showed that sometimes false-positive superbubbles are reported, see Fig. 1. These do not constitute a fatal problem because they can be recognized easily in linear total time simply by checking the tail of incoming and head of outgoing edges. It is nevertheless worth while to analyse the issue and to seek a direct remedy. As we shall see below, the false positive subgraphs are a consequence of the way in which a strongly connected component C is transformed into a two DAGs that are augmented by either the source or target vertices.

Theory
In the first part of this section we revisit the theory of superbubbles in digraphs in some more detail. Although some of the statements below have appeared at least in similar for in the literature [6][7][8] we give concise proofs and take care to disentangle properties that depend on minimality from those that hold more generally. This refined mathematical analysis sets the stage in the second part for identifying the reason for the problems with the auxiliary graph constructed in [7] shows how the problem can be solved efficiently in these cases using an even simpler auxiliary graph. In the third part we elaborate on the linear time algorithm on [8] for DAGs. We derive a variant that has the same asymptotic running time but seems easier to explain.

Weak superbubbloids
Although we do not intend to compute superbubbloids in practice, they feature several convenient mathematical properties that will simplify the analysis of superbubbles considerably. The main aim of this section is to prove moderate generalizations of the main results of [6,7]. To this end, it will be convenient to rephrase the reachability and matching conditions (S1) and (S2) for the vertex set U of superbubbloid with entrance s and exit t in the following, a more expanded form. Lemma 1 Let G be a digraph, U ⊂ V (G) and s, t ∈ U . Then (S1) and (S2) holds for U = U st = U + ts if and only if the following four conditions are satisfied (S.i) Every u ∈ U is reachable from s. (S.ii) t is reachable from every u ∈ U. (S.iii) If u ∈ U and w / ∈ U then every w → u path contains s. (S.iv) u ∈ U and w / ∈ U then every u → w path contains t.
Proof Suppose (S1) and (S2) are true. Then u ∈ U st and u ∈ U + ts implies, by definition, that u is reachable from s, i.e. (S.i) and (S.ii) holds. By (S2) we have U := U st = U + ts . If w / ∈ U it is not reachable from s without passing through t. Since every u is reachable from s without passing through t, we would have w ∈ U if w was reachable from any u ∈ U on a path not containing t, hence (S.iv) holds. Similarly, since t is reachable from u without passing through s, we would have w ∈ U if v could be reached  [7]. The directed 3-cycle a on the l.h.s. correctly yields the three subgraphs on two vertices as superbubbles. The graph b on the r.h.s., on the other hand, includes a as the only non-trivial strongly connected component. The vertices 1 and 3 have additional neighbors which are replaced by artificial nodes r and r ′ , respectively. c, d are the corresponding DFS trees using an artificial source as root. Since no artificial source is present in a, a random vertex, here 1, is used as root. The correspond DAGs in e, f are constructed from duplicate copies of the DFS trees, augmented by source and sink vertices in e since these were lacking in c. Note that the same DAGs (g, h) are obtained for a and the non-trivial copy of a in b. Hence the same superbubbles are returned in both cases. While 3, 1 is a valid result for a, it is a false positive for b since 3 is not a valid entrance and 1 is not a valid exit in b from w along a path that does not contain s, i.e. (S.iii) holds. Now suppose (S.i), (S.ii), (S.iii), and (S.iv) holds. Clear, both (S.i) and (S.ii) already imply (S1). Since u ∈ U is reachable from s by (S.ii) and every path reaching w / ∈ U pases through t by (S.iii), we have U = U st . By (S.i), t is reachable from every u ∈ U and by (S.iv) t can by reached from w / ∈ U only by passing through s, i.e., U = U + ts , i.e., U st = U + ts .
Corollary 1 Suppose U, s, and t satisfies (S.i), (S.ii), (S.iii), and (S.iv). Then every path connecting s to u ∈ U and u to t is contained within U.
Proof Assume, for contradiction, that there u → t path containing a vertex w / ∈ �s, t�. By definition of the set U st , w / ∈ U st is not reachable from u ∈ U st without passing through t first, i.e., w cannot be part of a u → t path.
Corollary 1 shows that subgraphs satisfying (S1) and (S2) related to reachability structures explored in some detail in [12,13]. In the following it will be useful to consider G contains both t and s. In the following we shall see that (S.v) acts a slight relaxation of the acyclicity axiom ((S3)). Proof Suppose U is the vertex set of a superbubbloid with entrance s and exit t. If (u, v) is an edge in U, then v � = s by (S3). Since v is reachable from s within U, no v → s path can exist within U, since otherwise there would be a cycle, contradicting (S3), that any v → s path passes through t. There are two cases: If there (t, s) ∈ E , any path containing this edge trivially contains both s and t. The existence of the edge (t, s) contradicts (S3). Otherwise, any t → s path contain at least one vertex x / ∈ U . By (S.iii) and (S.iv) every v → x path contains t and every x → u path contain s and t, respectively. Hence the first statement holds.
Conversely, suppose (S.v) holds, i.e., every directed cycle Z within U contains s and t. Suppose (t, s) is not contained Z, i.e., there is vertex u ∈ U \{s, t} such that (t, u) ∈ E . By (S.ii), t is reachable from u without passing through s, and every u → t path is contained in U by Corollary 1. Thus there is a directed cycle within U that contains u and t but not s, contradicting (S.v). Removing the edge (t, s) thus cuts every directed cycle within U, and hence G[U ]\{(t, s)} is acyclic.
Although the definition of [6] (our Definition 2) is also used in [7], the notion of a superbubble is tacitly relaxed in [7] by allowing an edge (t, s) from exit to entrance, although this contradicts the acyclicity condition (S3). This suggests We denote a weak superbubbloids with entrance s and exit t by s, t and write U st for its vertex set. As an immediate consequence of Definition 3 and Lemma 2 we have The possibility of an edge connecting t to s will play a role below, hence we will focus on weak superbubbloids in this contribution.
First we observe that a weak superbubbloids contained within another weak superbubbloid must be a superbubbloid because the existence of an edge from exit to entrance contradicts (S.v) for the surrounding weak superbubbloid. We record this fact as Lemma 3 If s, t and �s ′ , t ′ � are weak superbubbloids with s ′ , t ′ ∈ �s, t� and {s ′ , t ′ } � = {s, t}, then �s ′ , t ′ � is a superbubbloid.
The result will be important in the context of minimal (weak) superbubbloids below.
Another immediate consequence of Lemma 2 is Corollary 3 Let s, t be a weak superbubbloid in G. If there is an edge (u, v) in s, t that is contained in a cycle, then every edges in s, t is contained in cycle containing s and t.
Proof By (S.v) there is cycle running though s and t. Let (u, v) be an edge in s, t . Since u is reachable from s and v reaches t within U, there is a cycle containing s, t, and the edge (u, v).

Theorem 1 Every weak superbubbloid s, t in G(V, E) is an induced subgraph of G[C] for some C ∈ S.
Proof First assume that s, t contains an edge (u, v) that is contained in cycle. Then by (S.v), there is cycle through s and t and thus in particular a (t, s) path. For every u ∈ U , there is a path within U from s to t through u by (S.i), (S.ii), and Lemma 1. Thus s, t is contained as an induced subgraph in a strongly connected component G[C] of G. If there is no edge in s, t that is contained in a cycle, then every vertex in s, t is a strongly connected component on its own. s, t is therefore an induced subgraph of G[V ] .
Theorem 1 establishes Proposition 1, the key result of [7], in sufficient generality for our purposes. Next we derive a few technical results that set the stage for considering minimality among weak superbubbloids.

Lemma 4 Assume that s, t is a weak superbubbloid and let u be an interior vertex of s, t . Then s, u is a superbubbloid if and only if u, t is a superbubbloid.
Proof Suppose s, u is a superbubbloid. Set W ut := (U st \U su ) ∪ {u} and consider w ∈ W ut . The subgraph induced by W ut is an induced subgraph of �s, t�\{(t, s)}. Hence it is acyclic and in particular (t, u) / ∈ E, i.e., (S.v) and even (S3) holds. Since t / ∈ U su every path from s to t runs through u. Since w is reachable from s there is a path from s through u to w, i.e., w is reachable from u. Thus (S.i) holds. (S.ii) holds by assumption since t is reachable from w. Now suppose v / ∈ W ut and w ∈ W ut . If v / ∈ U st , then every v → w path passes through s and then through u, the exit of s, u before reaching w. If v ∈ U st , then v ∈ U su \{u} and thus every v → w path passes through u as the exit of s, u . Hence W ut satisfied (S.iii). If v ∈ U st , then v ∈ U su \{u} and thus every w → v path passes through s. By (S.v) there is no w → s path within �s, t�\{(t, s)}, and thus any w → v includes (t, s) or a vertex y / ∈ U st . By construction, all w → y paths contain t, and thus all w → v paths also pass through t and W ut also satisfies (S.iv).
Conversely suppose u, t is a superbubbloid. We have to show that W su := (U st \U ut ) ∪ {u} induces a superbubbloid. The proof strategy is very similar. As above we observe that (S.v), (S.i), and (S.ii) are satisfied. Now consider v / ∈ W su and w ∈ W su . If v / ∈ U st then every v → w path contains s; otherwise v ∈ U ut \{u} and v → w passes through t and thus also through s by Corollary 1, thus (S.iii) holds. If v ∈ U st , then v ∈ U ut \{u}, in which case every w → v path passes through u. Otherwise v / ∈ U st then every w → v runs through t ∈ U st and thus in particular also through u. Hence (S.iv) holds.

Lemma 5
Let w, u and s, t be two weak superbubbloids such that u is an interior vertex of s, t , s is an interior vertex of w, u , w is not contained in s, t and t is not contained in w, u . Then the intersection �s, u� = �w, u� ∩ �s, t� is also a superbubbloid.
Proof First consider the intersection s, u . u ∈ �s, t� is reachable from s, hence (S1) holds. Furthermore s, u is an induced subgraph of �s, t�\{(t, s)} and hence again acyclic (S3). Set W su := U wu ∩ U st and consider v ∈ W su . First we note that v is reachable from s by definition of s, t and u is reachable from v by definition of w, u . Let ∈ U wu (and v ∈ U wu ) and thus every x → v path passes through w. Since w / ∈ U st , we know that every x → v path contains s.
∈ U wu and hence also through u. Thus W su is a superbubbloid.
We include the following result for completeness, although it is irrelevant for the algorithmic considerations below.

Lemma 6 Let w, u and s, t be defined as in Lemma 5.
Then the union �w, t� = �w, u� ∪ �s, t� is superbubbloid if and only if the induced subgraph w, t satisfies (S.v).
Proof Since w, s , s, u , u, t are superbubbloids, t is reachable from w, i.e., (S1) holds. By the same token, every v ∈ W wt := U wu ∪ U st is reachable from w or s and reaches u or t. Since s is reachable from w and t is reachable from u, every v ∈ W wt is reachable from w and reaches t. Now consider x / ∈ W wt and v ∈ W wt . If v ∈ U wu every x → v path passed through w; if v ∈ U s,t , it passes through s ∈ U wu and thus also through w. If v ∈ U st , then every v → x path passed through t. If v ∈ U wu it passes through u ∈ U st and thus also through t. Thus W wt satisfies (S2). Thus w, t is a weak superbubbloid if and only if (S.v) holds.

Lemma 7 Let s, t be a weak superbubbloid in G with
Proof Conditions (S.i), (S.ii), and (S.v) are trivially conserved when G is restricted to G[W]. Since every w → u and u → w path with u ∈ U st and w / ∈ U st within W is also such a path in V, we conclude that (S.iii) and (S.iv) are satisfied w.r.t. W whenever they are true w.r.t. the larger set V.
The converse is not true. The restriction to induced subgraphs thus can introduce additional (weak) superbubbloids. As the examples in Fig. 1 show, it is also possible to generate additional superbubbles.
Finally we turn our attention to the minimality condition.
The "non-symmetric" phrasing of the minimality condition in Definitions 2 and 4 [6][7][8] is justified by Lemma 4: If s, t and �s, t ′ � with t ′ ∈ �s, t� are superbubbloids, then �t ′ , t� is also a superbubbloid, and thus s, t is not a superbubble. As a direct consequence of Lemma 3, furthermore, we have Corollary 4 Every superbubble is also a weak superbubble.
Lemma 4 also implies that every weak superbubbloid, which is not a superbubble itself, can be decomposed into consecutive superbubbles:

Corollary 5 If s, t is a weak superbubbloid, then it is either a weak superbubble or there is a sequence of vertices
A useful consequence of Lemma 5, furthermore, is that superbubbles cannot overlap at interior vertices since their intersection is again a superbubbloid and thus neither of them could have been minimal. Furthermore, Lemma 4 immediately implies that w, s and u, t are also superbubbloids, i.e., neither w, u nor s, t is a superbubble in the situation of Lemma 5. Figure 2 shows a graph in which all (weak) superbubbloids and superbubbles are indicated.

Reduction to auperbubble finding in DAGs
Theorem 1 guarantees that every weak superbubbloid and thus every superbubble in G(V, E) is completely contained within one of induced subgraphs G[C], C ∈ S . It does not guarantee, however, that a superbubble in G[C] is also a superbubble in G. This was already noted in [7]. This fact suggests to augment the induced subgraph G[C] of G by an artificial source a and an artificial sink b.

Definition 5
The augmented graph G (C) is constructed from G[C] by adding the artificial source a and the artificial sink b. There is an edge (a, x) in G (C) whenever x ∈ C has an incoming edge from another component in G and there is an edge (x, b) whenever x ∈ C has an outgoing edge to another component of G.
Since G[V ] is acyclic, a has only outgoing edges and b only incoming ones, it follows that the augmented graph G (V ) is also acyclic. In a are all weak superbubbloids (blue) and all superbubbloids (green) marked. Note that beside 0, 2 and 7, 10 all weak superbubbloids are also superbubbloids. In b are all weak superbubbles (blue) and all superbubbles (green) marked. The weak superbubbloids 0, 2 is the only superbubbloids that creates no (weak) superbubble. So that 7, 10 is the only superbubble that is not a weak superbubble

Lemma 8 s, t is a weak superbubbloid in G if and only if it is a weak superbubbloid of G (C) or a superbubbloid in G (V ) that does not contain an axiliary source a or an auxiliary sink b.
Proof First assume that s, t is an induced subgraph of the strongly connected component is also a strongly connected component of G (C) . Thus reachability within C is the same w.r.t. G and G (C) . Also by construction, a vertex w / ∈ C is reach- . For the special case that s, t is an induced subgraph of the acyclic graph G[V ] we can argue in exactly the same manner.
For strongly connected components C, the graph G (C) contains exactly 3 strongly connected components whose vertex sets are C and the singletons {a} and {b} . Since and hence contains neither a nor b. Superbubbloids containing a or b cannot be excluded for the acyclic component G [V ] , however.
It is possible, therefore, to find the weak superbubbloids of G by computing the weak superbubbloids not containing an artificial source or sink vertex in the augmented graphs. In the remainder of this section we show how this can be done efficiently.
The presentation below depends strongly on the properties of depth first search (DFS) trees and vertex orders associated with them. We thus briefly recall their relevant features. A vertex order is a bijection ρ : V → {1, . . . , |V |} . We write ρ −1 (i) is the vertex at the i-th position of the ρ-ordered vertex list. Later we will also need vertex sets that form intervals w.r.t. ρ . These will be denoted by DFS on a strongly connected digraph G (exploring only along directed edges) is well known to enumerate all vertices starting from an arbitrary root [14]. The corresponding DFS tree consists entirely of edges of G pointing away from the root. In the following we will reserve the symbol ρ for the reverse postorder of the DFS tree T in a strongly connected graph. Edges of G can be classified relative to a given DFS tree T with root x. By definition, all tree edges (u, v) are considered to be oriented away from the root w; hence ρ(u) < ρ(v) . An edge (u, v) ∈ E(G) is a forward edge if v is reachable from u along a path consisting of tree edges, hence it satisfied ρ(u) < ρ(v) . The edge (u, v) is a backward edge if u is reachable from v along a path of consisting of tree edges, hence ρ(u) > ρ(v) . For remaining, so-called cross edges have no well-defined behavior w.r.t. ρ . We refer to [14,15] for more details on depth first search, DFS trees, and the associated vertex orders.
A topological sorting of a directed graph order π of V such that π(u) < π(v) holds for every directed (u, v) [16]. Equivalently, π is a topological sorting if there are no backward edges. A directed graph admits a topological sorting if and only if it is a DAG. In particular, if v is reachable from u then π(u) < π(v) must hold. In a DAG, a topological sorting can be obtained as the reverse postorder of an arbitrary DFS tree that is constructed without considering the edge directions in G [15].

Lemma 9
Let G be a strongly connected digraph, s, t be a weak superbubbloid in G, w / ∈ �s, t�, and ρ the inverse postorder of a DFS tree T rooted at w. Then the induced subgraph s, t of G contains no backward edge w.r.t. ρ except possibly (t, s).
Proof Let T be a DFS tree rooted in T and let δ denote the preordering of T. First we rule out δ(s) > δ(t). Since t cannot be reached from anywhere along a path that does not contain s, this is only possible if ρ(t) = 1 , i.e., if t is the root of DFS tree T. This contradicts the assumption that ρ(w) = 1 for some w outside s, t . Hence δ(s) < δ(t) . The DFS tree T therefore contains a directed path from s to t. Since interior vertices of s, t are only reachable through s and reach outside only through t, it follows that the subtree T * of T induced by s, t is a tree and only s and t are incident to edges of T outside of s, t . In the DFS reverse postorder ρ we therefore have ρ(s) < ρ(u) < ρ(t) for every vertex u interior to s, t , and either ρ(w) < ρ(s) or ρ(w) > ρ(t) for all w outside of s, t . The graph G st obtained from s, t by removing the possible (t, s) edge is a DAG, the subtree T * is a DFS tree on G st , whose reverse postorder ρ * is collinear with rho, i.e., ρ * (u) < ρ * (v) holds whenever ρ(u) < ρ(v) . Therefore, there are no back-edges in G st . Lemma 9 is the key prerequisite for constructing an acyclic graph that contains all weak superbubbles of G (C) . Similar to the arguments above, however, we cannot simply ignore the backward edges. Instead, we will again add edges to the artificial source and sink vertices.

Definition 6
Given a DFS tree T with a root w = ρ −1 (1) that is neither an interior vertex nor the exit of a weak superbubbloid of G (C) , the auxiliary graph Ĝ (C) is obtained from G (C) by replacing every backward edge (v, u) with respect to ρ in G (C) with both an edge (a, u) and an edge (v, b). Note that Definition 6 implies that all backward edges (u, v) of G (C) are removed in Ĝ (C) . As a consequence, Ĝ (C) is acyclic. The construction of Ĝ is illustrated in Fig. 3.

Lemma 10
Let C be a strongly connected component of G and let T be a DFS tree on G (C) with a root w = ρ −1 (1) that is neither an interior vertex nor the exit of a weak superbubbloid of G. Then s, t with s, t ∈ C is a weak superbubble of G contained in G (C) if and only if s, t is a superbubble in Ĝ (C) that does not contain the auxiliary source a or the auxiliary sink b.
Proof Assume that s, t is a weak superbubble in G (C) that does not contain a or b. Lemma 8 ensures that this is equivalent to s, t being a weak superbubble of G. By Lemma 9, s, t contains no backward edges in G (C) , with the possible exception of the edge (t, s). Since G (C) and Ĝ (C) by construction differ only in the backward edges, the only difference affecting s, t is the possible insertion of edges from a to s or from t to b. Neither affects a weak superbubble, however, and hence s, t is a superbubble in Ĝ (C). Now assume that s, t is a superbubble in Ĝ (C) with vertex set U st and a, b / ∈ U st . Since the restriction of Ĝ (C) to C is by construction a subgraph of G (C) , we know that reachability within C w.r.t. to Ĝ (C) implies reachability w.r.t. G (C) . Therefore U st satisfies (S.i) and (S.ii) also w.r.t. G (C) . Therefore, if s, t is not a weak superbubble in G (C) then there must be a backwards edge (x, v) or a backward edge (v, x) with v in the interior of s, t . The construction of Ĝ (C) , however, ensures that Ĝ (C) then contains an edge (a, v) or (v, b), respectively, which would contradict (S.iii), (S.iv), or acyclicity (in case x ∈ U st ) and hence (S.v). Therefore s, t is a superbubble in Ĝ (C) .
The remaining difficulty is to find a vertex w that can safely be used a root for the DFS tree T. In most cases, one can simply set ρ(a) = 1 since Lemma 8 ensures that a is not part of a weak superbubbloid of G. However, there is no guarantee that an edge of the form (a, w) exists, in which case G (C) is not connected. Thus another root for the DFS tree must be chosen. A closer inspection shows that three cases have to be distinguished: A. a has an out-edge. In this case we can choose a as the root of the DFS tree, i.e., ρ(a) = 1. B. a has no edge, but there b has an in-edge. In this case we have to identify vertices that can only be entrances of a superbubble. These can then be connected with the artificial source vertex without destroying a superbubble. C. Neither a nor b have edges. The case requires special treatment.
In order to handle case (B), we use the following

Lemma 11 Let a and b be the artificial source and sink of G (C). Let a ′ and b ′ be a successor of a and a predecessor of b, respectively. Then
i) a ′ is neither an interior vertex nor the exit of a superbubble. ii) A predecessor a ′′ of a ′ is neither an interior vertex nor an entrance of a superbubble. iii) b ′ is neither an interior vertex nor the entrance of a superbubble. iv) A successor b ′′ of b ′ is neither an interior vertex nor an exit of a superbubble.
Proof If a ′ is contained in a superbubble, it must be the entrance, since otherwise its predecessor, the artificial vertex a would belong to the same superbubble. If a ′′ is in the interior of an entrance, the a ′ would be an interior vertex of a superbubble, which is impossible by (i). The statements for b follow analogously.

Corollary 6 If b has an inedge in G (C), then every successor b ′′ � = b of every predecessor b ′ of b can be used a root of the DFS search tree. At least one such vertex exists.
Proof By assumption, b has at least one predecessor b ′ .
Since G[C] is strongly connected, b ′ has at least one successor b ′′ � = b , which by Lemma 11(iv) is either not contained in a superbubble or is the entrance of a superbubble.
The approach sketched above fails in case (C) because there does not seem to be an efficient way to find a root for DFS tree that is guaranteed not to be an interior vertex or the exit of a (weak) superbubbloid. Sung et al. [7] proposed the construction of a more complex auxiliary DAG H that not only retains the superbubbles of G[C] but also introduces additional ones. Then all weak superbubbles in H(G) are identified and tested whether they also appeared in G[C]. Definition 7 (Sung graphs) Let G be a strongly connected graph with a DFS tree T with root x. The vertex set V (H) = V ′∪ V ′′∪ {a, b} consists of two copies v ′ ∈ V ′ and v ′′ ∈ V ′′ of each vertex v ∈ V (G) , a source a, and a sink b. The edge set of H comprises four classes of edges: ) is a edge in G and (iv) edges (v ′′ , b) whenever (v, b) is a edge in G.  G (top). The graph G has two non-trivial SCCs (indicated by the white and orange vertices, resp.). In addition, there and two singleton SCCs (purple vertices) from which G (V ) is constructed. The middle panel shows the graphs G (C) . Each is obtained by adding the artificial source and sink vertices a and b. The artificial source of the second SCC has no incident edge and in the DAG G (V ) the artificial sink b has no incoming edge. These vertices are not shown since only the connected components containing C or V are of interest. The edges (10, 1), (5,9) and (6,9) in G form connections between the SCCs and the DAG, resp. Hence they are replaced by corresponding edges to an artificial source or artificial sink vertex according to Definition 5. The bottom panel shows the graphs Ĝ (C) obtained with the help of DFS searches. The reverse post ordering is shown. In the case of the second SCC, the artificial source a is connected to 11 as described in Corollary 6. The back edges (5, 2), (7, 1), (7,6) and (10,11) are then replaced with the corresponding edge to a and from b as prescribed by Definition 6. The tree graphs have the same superbubbles as G The graph H is a connected DAG since a topological sorting on H is obtained by using the reverse postorder of T within each copy of V(G) and placing the first copy entirely before the second. We refer to [7] for further details.
The graph H contains two types of weak superbubbloids: those that contain no backward edges w.r.t. T, and those that contain backward edges. Members of the first class do not contain the root of T by Lemma 9 and hence are also superbubbles in G. Every weak superbubble of this type is present (and will be detected) in both V ′ and V ′′ . A weak superbubble with backward edge has a "front part" in V ′ and a "back part" in V ′′ and appears exactly once in H. The vertex sets V ′ and V ′′ are disjoint. It is possible that H contains superbubbles that have duplicated vertices, i.e., vertices v ′ and v ′′ deriving from the same vertex in V. These candidates are removed together with one of the copies of superbubbles appearing in both V ′ and V ′′ . We refer to this filtering step as Sung filtering as it was proposed in [7].
This construction is correct in case (C) if there are no other edges connecting G[C] within G. The additional connections to a and b introduced to account for edges that connect G[C] to other vertices in G, may fail. To see this, consider an interior vertex v ′ in a superbubble s, t with a backward edge. It is possible that its original has an external out edge and thus b should be connected to v ′ . This is not accounted for in the construction of H, which required that V ′ is connected to a only, and V ′′ is connected to b only. These "missing" edges may introduce false positive superbubbles as shown in Fig. 1. This is not a dramatic problem because it is easy to identify the false positives: it suffices to check whether there is an edge (x, w) or (w, y) with w / ∈ U st , x ∈ U st \{t} and y ∈ U st \{s} . Clearly, this can be achieved in linear total time for all superbubble candidates U st , providing a easy completion for the algorithm of Sung et al. [7]. Our alternative construction eliminates the need for this additional filtering step. Proof The correctness of Algorithm 1 is an immediate consequence of the discussion above. Let us briefly consider its running time. The strongly connected components of G can be computed in linear, i.e., O(|V | + |E|) time [14,17,18]. The cycle-free part G[V ] as well as its connected components [19] are also obtained in linear time. The construction of directed (to construct T) or undirected DFS search (to construct π in a DAG) also require only linear time [14,15], as does the classification of forward and backward edges. The construction of the auxiliary DAGs Ĝ (C) and H(C) and the determination of the root for the DFS searches is then also linear in time. Since the vertex sets considered in the auxiliary DAGs are disjoint in G, we conclude that the superbubbles can be identified in linear time in arbitrary digraph if the problem can be solved in linear time in a DAG.

Algorithm 1 Top level organization of the computation of superbubbles in a digraph
The algorithm of Brankovic et al. [8] shows that this is indeed the case.

Corollary 7 The (weak) superbubbles in a digraph G(V, E) can be identified in O(|V | + |E|) time using Algorithm 1.
In the following section we give a somewhat different account of a linear time algorithm for superbubble finding that may be more straightforward than the approach in [8], which heavily relies on range queries. An example graph as the different auxiliary graphs are shown in Fig. 4.

Detecting superbubbles in a DAG
The identification of (weak) superbubbles is drastically simplified in DAGs since acyclicity, i.e., (S3), and thus (S.v), can be taken for granted. In particular, therefore, every weak superbubbloid is a superbubbloid. A key result of [8] is the fact that there are vertex orders for DAGs in which all superbubbles appear as intervals. The proof of Proposition 2 does not make use the minimality condition hence we can state the result here more generally for superbubbloids and arbitrary DFS trees on G: i) Every interior vertex u of s, t satisfied π(s) < π(u) < π(t). ii) If w � ∈ �s, t� then either π(w) < π(s) or π(t) < π(w).
The following two functions were also introduced in [8]: We slightly modify the definition here to assign values also to the sink and source vertices of the DAG G. The functions return the predecessor and successor of v that is furthest away from v in terms of the DFS order π . It is convenient to extend this definition to intervals by setting A main result of this contribution is that superbubbles are characterized completely by these two functions, resulting in an alternative linear-time algorithm for recognizing superbubbles in DAGs that also admits a simple proof of correctness. To this end we will need a few simple properties of the OutParent and OutChild functions for intervals. First we observe that [k, l] ⊆ [i, j] implies the inequalities A key observation for our purposes is the following

Proof
(i) By definition π −1 (j − 1) has at least one successor. On the other hand, all successor of π −1 after j − 1 are by definition not later than j. Hence π −1 (j) is uniquely defined. (ii) We proceed by induction w.r.t. the length of the interval [i, j − 1] . If i = j − 1 , i.e., a single vertex, the assertion (ii) is obviously true. Now assume that the assertion is true for [i + 1, j] . By definition of OutChild , i has a successor in [i + 1, j] , from which π −1 (j) is reachable. (iii) Again, we proceed by induction. The assertion holds trivially for single vertices. Assume that the assertion is true for [i + 1, j] . By definition of OutChild , every successor u of π −1 (i) is contained in π −1 ([i + 1, j]) . By induction hypothesis, every path from u to a vertex w / ∈ π −1 ([i − 1, j − 1]) contains π −1 (j) , and also all path from π −1 (i) to It is important to notice that Lemma 13 depends crucially on the fact that π , by construction, is a reverse postorder of a DFS tree. It does not generalize to arbitrary topological sortings.
Replacing successor by predecessor in the proof of Lemma 13 we obtain Let us now return to the superbubbloids. We first need two simple properties of the OutParent and OutChild function for individual vertices:
We now proceed to showing that the possible superbubbloids and superbubbles can be found efficiently, i.e., in linear time using only the reserve postorder of the DFS tree and the corresponding functions OutChild and OutParent . As an immediate consequence of (F2) and Lemma 13, we have the following necessary condition for exits:
We now use the minimality condition of Definition 2 to identify the superbubbles among the superbubbloids.

Lemma 16
If t is the exit of a superbubbloid, then there is also the exit of a superbubble s, t whose entrance s is vertex with the largest value of π(s) < π(t) such that (F1) and (F2) is satisfied.
Proof This is an immediate consequence of Lemma 5, which shows that the intersection �s, t� ∩ �s ′ , t ′ � would be a superbubbloid, contradicting minimality of s, t .

Lemma 18
Let s, t be a superbubble and suppose t ′ is an interior vertex of s, t . Then there is a vertex v with π(s) ≤ π(v) < π(t ′ ) such that OutChild(v) > π(t ′ ).
Taken together, these observations suggest to organize the search by scanning the vertex set for candidate exit vertices t in reverse order. For every such t, one would then search for the corresponding entrance s such that the pair s, t fulfills (F1) and (F2). Using eq.(3) one can test (F2) independently for each v by checking whether OutChild(v) ≤ π(t) . Checking for (F1) requires that the interval [π(s) + 1, π(t)] is considered. The value of its OutParent function can be obtained incrementally as the minimum of OutParent(v) and the OutParent interval of the previous step: By Lemma 16, the nearest entrance s to the exit t completes the superbubble. The tricky part is to identify all superbubbles in a single scan. Lemma 17 ensures that no valid entrance can be found for exit t ′ if a vertex v with OutChild(v) > π(t ′ ) is encountered. In this case t ′ can be discarded. Lemma 18 ensures that a false exit candidate t ′ within a superbubble s, t candidate cannot "mask" the entrance s belonging to t, i.e., there is necessarily a vertex v satisfying OutChild(v) > π(t ′ ) with π(s) < π(v).
It is natural therefore to use a stack S to hold the exit candidates. Since the OutParent interval explicitly refers to an exit candidate t, it must be re-initialized whenever a superbubble is completed or the candidate exit is rejected. More precisely, the OutParent interval of the previous exit candidate t must be updated. This is achieved by computing
Require: DAG G(V, E) with reverse DFS postorder π empty stack S empty map outParentMap empty exit t for k = n..
Algorithm 2 presents this idea in a more formal way.

Lemma 19 Algorithm 2 identifies the superbubbles in a DAG G.
Proof Every reported candidate satisfied (F1) since OutParent([π(s) + 1, π(t)]) = π(s) is used to identify the entrance for the current t. Since v ∈ π −1 [π(s), π(t) − 1] is checked for every OutChild(v) ≤ π(t) , (F2) holds due to equ.(3) since by Lemma 13 this is equal to test the interval. Hence every reported candidate is a superbubbloid. By Lemma 16 s, t is minimal and thus a superbubble. Lemma 18 ensures that the corresponding entrance is identified for every valid exit t, i.e., that all false candidate exits are rejected before the next valid entrance in encountered.
Proof Given the reverse DFS postorder π , the for loop processes every vertex exactly once. All computations except OutChild(v) , OutParent(v) , and the while loop take constant time. This includes explicit the calculation of the minimum of two integer values that are needed to

Implementation
Algorithms 1 and 2 were implemented in Python and are available as Linear Superbubble Detector, LSD for short. LSD can be installed with pip. 1 The source is available on GitHub. 2 It is intended as a reference implementation emphasizing easy understanding rather than as a performance-optimized production tool. The underlying graph structures make use of NetworkX [20], which has the benefit that many input formats can be parsed easily.
To our knowlege, SUPBUB 3 [8] is the only other publicly available implementation of a superbubble detector. Unfortunately, it has some bugs e.g., in the handling of successors in the DFS tree that leads to problems with superbubble with a backward edge. An analysis of the code shows, furthermore, that the construction of the auxiliary graphs strictly follows [7]. Hence it cannot serve as a reference implementation.
In order to compare our approach to the state of the art algorithm we re-implemented the workflow on Sung et al. [7] and Brankovic et al. [8] using the same python libraries. This allows a direct comparison that focusses on the algorithms rather than the differences between programming languages and compilers. The workflow can be subdivided into two separate tasks: (1) the construction of the DAGs, and (2) the recognition of superbubbles within the DAG. For the first task, we compare our approach and the algorithm of Sung et al. [7] augmented by a simple linear-time filter to detect the false positives. For the second part, we compare our stack-based approach with the range-query method of Brankovic et al. [8]. Table 1 summarized the empirical results for test data of different sizes taken from our recent work on supergenome coordinatization and the Stanford Large Network Dataset Collection [21]. Although the running times are comparable, we find that LSD consistently performs better than the alternative for both tasks. The combined improvement of LSD is a least a factor of 2 in the examples tested here. All results and methods are available in the git repository. 4

Conclusion
We have re-investigated the mathematical properties of superbubbles and their obvious generalization, the weak superbubbloids. We not only re-derive foundational  Table 1 Comparison of running times The for combinations of algorithms compared here are: LSD (using the auxiliary graphs Ĝ C and the stack-based superbubble detector), S+LSD using Sung graphs with our stack-based detector plus a post-filter for the false positives, LSD+B using our graph construction with the range-query-based detector of [8], and S+B using the re-implementation of the state of the art method with the post-filter. All computations were performed on a 2.5GHz quad-core Intel Core i7 processor (Turbo Boost up to 3.7GHz) with 6MB shared L3 cache and 16GB of 1600MHz DDR3L onboard memory. Test data sets are taken from [4] and from the Stanford Large Network Dataset Collection [21]. The