Consistent orientation of long reads
For perfect data it is possible to consistently determine the reading direction of each read relative to the genome from which it derives. This is not necessarily the case in real-life data. The relative orientation of two reads is implicitly determined by the relative orientation of overlapping reads, i.e., by the signature \(\theta (r_1,r_2)\) of the edge \(r_1r_2\in E(G)\). To formalize this idea we consider a subset \(D\subseteq E(G)\) and define the orientation of D as \(\theta (D):=\prod _{r_1r_2\in D} \theta (r_1,r_2)\). For a disjoint union of two edge sets D and \(D'\) we therefore have
and, more generally, their symmetric different \(D\oplus D'\) satisfies \(\theta (D\oplus D')=\theta (D)\theta (D')\) since the edges in \(D\cap D'\) appear twice in \(\theta (D)\theta (D')\) and thus each of these edges contributes a factor \((\pm 1)^2=1\).
Definition 2
Two vertices \(r_1,r_2\in V(G)\) are orientable if \(\theta (P)=\theta (P')\) holds for any two paths P and \(P'\) connecting \(r_1\) and \(r_2\) in G. We say that G is orientable if all pairs of vertices in G are orientable.
Lemma 3
G is orientable if and only if every cycle C in G satisfies \(\theta (C)=1\).
Proof
Let \(r,r'\) be two vertices of G and write \({\mathscr {C}}(r,r')\) for the set of all cycles that contain r and \(r'\). If \(r=r'\) or \({\mathscr {C}}(r,r')=\emptyset \), then r and \(r'\) are orientable by definition. Now assume \(r\ne r'\), \({\mathscr {C}}(r,r')\ne \emptyset \), and consider a cycle \(C\in {\mathscr {C}}(r,r')\). Clearly, C can be split into two edge-disjoint path \(C_1\) and \(C_2\) both of which connect r and \(r'\). If r and \(r'\) are orientable, then \(\theta (C_1)=\theta (C_2)\) and thus \(\theta (C)=\theta (C_1)\theta (C_2)=1\). If r and \(r'\) are not orientable, then there is a pair of paths \(P_1\) and \(P_2\) connecting r and \(r'\) such that \(\theta (P_1)=-\theta (P_2)\). Since
is an edge-disjoint union of cycles \(C_i\) we have \(-1 = \theta (P_1)\theta (P_2)=\prod _{i=1}^{k} \theta (C_i)\) and thus there is least one cycle \(C_i\) with \(\theta (C_i)=-1\) in G. \(\square \)
The practical importance of Lemma 3 is the implication that only a small set of cycles needs to be considered, because every cycle in G can be obtained as an \(\oplus \)-sum of cycles in a cycle basis [34, 35]. Every graph G with c connected components has a cycle basis comprising \(|E|-|V|-c\) cycles. Particular cycles bases, known as Kirchhoff bases, are obtained from a spanning tree T of G as the set \({\mathscr {B}}\) of cycles \(C_e\) consisting of the edge \(e\in E\setminus T\) and the unique path in T connecting the endpoints of e [36]. Every cycle C of G can then be written as \(C = \bigoplus _{e\in C\setminus T} C_e\), see e.g. [35].
Theorem 4
Let G be a graph with signature \(\theta :E(G)\rightarrow \{-1,1\}\) on its edges, and let \({\mathscr {B}}\) be a cycle basis of G. Then G is orientable if and only if \(\theta (C)=1\) for all \(C\in {\mathscr {B}}\).
Proof
The theorem follows from Lemma 3 and the fact that every cycle C in G can be written as an \(\oplus \)-sum of basis cycles, i.e., \(\theta (C)=1\) for every cycle in C if and only if \(\theta (C')=1\) for every basis cycle \(C'\in {\mathscr {B}}\). \(\square \)
Thm. 4 suggests the following, conservative heuristic to extract an orientable subgraph from G:
- (1):
-
Construct a maximum weight spanning tree \(T_G\) of G by using the \(\Omega \)-scores as edge weights. Tree \(T_G\) can easily be obtained using, e.g., Kruskal’s algorithm [37].
- (2):
-
Construct a Kirchhoff cycle basis \({\mathscr {B}}\) from \(T_G\).
- (3):
-
For every cycle \(C\in {\mathscr {B}}\), check whether \(\theta (C)= -1\). If so, find the \(\Omega \)-minimum weighted edge \({\hat{e}} \in C\) and remove it from E(G) and (possibly) from \(T_G\) if \({\hat{e}}\in E(T_G)\).
We delete the offending edge because it is very unlikely that the preprocessing correctly identified that two long reads overlap but failed to determine the correct relative orientation. The edge deletion is simplified by the following observation:
Lemma 5
If T is maximal \(\Omega \)-weight spanning tree of T and end e is a non-tree edge, then \(\Omega (e)=\min _{e'\in C_e} \Omega (e')\).
Proof
Let \(e'\in C_e\setminus \{e\}\) by a tree edge in the cycle \(C_e\). Then \(T'=T\setminus \{e'\}\cup \{e\}\) is again a spanning tree of G since the vertex set \(V(C_e)\) is still connected and \(T'\) contains not cycle. Its weight it weight is \(\Omega (T')=\Omega (T)-\Omega (e')+\Omega (e)\le \Omega (T)\), since T is a maximum weight spanning tree by assumption. Thus \(\Omega (e)\le \Omega (e')\), i.e., e has minimum \(\Omega \)-weight. \(\square \)
As a consequence, the minimum weight edge of an offending cycle is always the non-tree edge. Step (3) above therefore reduces to finding the basis edges \({\hat{e}}\) with negative signature cycles \(C_{{\hat{e}}}\) and to remove these edges. The removal of \({\hat{e}}\) leaves \(T_G\) unchanged and thus does not affect the contiguity of the assembly. The end result of the procedure outlined above is therefore a connected subgraph \(G'\) and a spanning forest \(T_{G'}=T_G\) for \(G'\).
Lemma 6
Let G be an undirected connected graph with signature \(\theta \) and let \(G'\) be the residual graph produced by the heuristic steps (1)-(3). Then (i) \(G'\) is connected, (ii) \(G'\) is orientable, and (iii) \(T_G\) is an \(\Omega \)-maximal spanning tree of \(G'\).
Proof
(i) By Lemma 5, \(T_G\subseteq E(G')\), hence \(T_G\) is a spanning tree of \(G'\) and thus \(G'\) is connected. (ii) Since the heuristic removes all non-tree edges e with \(\theta (C_e)=-1\), Thm. 4 implies that \(G'\) is orientable. (iii) Since \(T_G\subseteq E(G')\), Kruskal’s maximum weight spanning tree algorithm will pick the same spanning tree edges again from \(E(G')\), and \(T_G\) is an \(\Omega \)-maximal spanning tree. \(\square \)
In order to expedite the identification of edges that violate orientability in G, we define an orientation \(\varphi \) for the vertices of G, i.e., the long reads. To this end, we pick an arbitrary \(r_{*}\in V(G)\) as reference and set \(\varphi (r_{*}):=+1\). Denote by \(P_T(r)\) the unique path from \(r^*\) to r and define \(\varphi (r) := \theta (P_T(r))\).
Lemma 7
If G is a connected orientable graph with signature \(\theta \), then the vertex orientation \(\varphi \) with reference \(\varphi (r_{*}):=+1\) is independent of the choice of the spanning tree T.
Proof
Let P be an arbitrary path connecting r and \(r^*\). By connectedness, such a path exists and since G is orientable w.r.t. \(\theta \) we have \(\theta (P)=\theta (P_T)\). Furthermore r and \(r^*\) are connected by the backbone of any spanning tree of T, \(\varphi \) is independent of the choice of T. \(\square \)
As an immediate consequence we observe:
Corollary 8
If G is an orientable graph with signature \(\theta \) and vertex orientation \(\varphi \), then every pair of adjacent vertices satisfies \(\varphi (r')\varphi (r'')=\theta (r'r'')\).
It follows that the heuristic to extract an orientable subgraph can be implemented in linear time:
- (1):
-
An \(\Omega \)-maximal spanning tree \(T_G\) is obtained in \({\mathcal {O}}(|V|+|E|)\) time using Kruskal’s algorithm.
- (2):
-
The vertex orientation \(\varphi \) is computed by traversal of the spanning tree \(T_G\) in \({\mathcal {O}}(|V|)\) time.
- (3):
-
For each \(e\in E\setminus T_G\), one checks in constant time whether \(\varphi (r')\varphi (r'')\ne \theta (r'r'')\) and if so deletes the edge \(r'r''\). The total effort is therefore \({\mathcal {O}}(|E|)\).
We remark that one could now define a graph \({\tilde{G}}\), obtained from G by inverting all long-reads r with a negative orientation \(\varphi (r)=-1\). This amounts to replacing each long read r by its reverse complement. Since processing of the overlap graph does not explicitly consider the sequence information, it would be sufficient to replace the coordinates [p, q] of a match interval by \([\ell -q+1,\ell -p+1]\) and to invert the directionality of extension by another long read. The bit scores of matches, of course, remain unchanged. In \({\tilde{G}}\) all edge signatures are \({{\tilde{\theta }}}(e)=+1\). It is not necessary, however, to construct \({\tilde{G}}\) explicitly. Instead, we simply keep track of the vertex orientations \(\varphi (r)\).
From here on, we again write G for the orientable graph \(G'\).
Reduction to a directed acyclic graph
We next make use of the direction of extension of long read \(r_1\) and \(r_2\) defined by the mutual overhangs in the case that \(r_1r_2\) is an edge in G. We write
for the directed version of a connected component G of the residual graph \(G'\) constructed above. For each edge \(r_1r_2\in E(G)\) we create the corresponding edge \(e \in E\)(
) as
$$\begin{aligned} e:= {\left\{ \begin{array}{ll} r_1r_2 &{} \text {if }\varphi (r_1 ) = +1 \text { and } r_1 \rightarrow r_2 \text { or }\\ &{} \varphi (r_1 ) = -1 \text { and } r_1 \leftarrow r_2; \\ r_2r_1 &{} \text {if }\varphi (r_1 ) = +1 \text { and } r_1 \leftarrow r_2 \text { or }\\ &{} \varphi (r_1 ) = -1 \text { and } r_1 \rightarrow r_2. \\ \end{array}\right. } \end{aligned}$$
(3)
Suppose the data used to construct
are free of repetitive sequences and contain no false-positive overlaps. In such perfect data,
is a directed interval graph. Since we have contracted edges corresponding to nested reads (i.e., intervals),
is in fact a proper interval graph or indifference graph [38]. In addition
is directed in a manner consistent with the ordering of the intervals. More precisely, there is an ordering \(\prec \) of the vertices (long reads) that satisfies the umbrella property [39]: \(r_1\prec r_2\prec r_3\) and \(r_1r_3\in E\) (
) implies \(r_1r_2,r_2r_3\in E\)(
). We can interpret \(r_1 \prec r_2\) to mean that \(r_1\) extends \(r_2\) to the left, i.e., towards smaller coordinate values in the final assembly. A “normal interval representation” and a linear order \(\prec \) of the reads can be computed in \({\mathcal {O}}(|{\mathcal {R}}|)\) time [40, 41] for proper interval graphs.
Due to the noise in the data, however, we have to expect that the original overlap graph only approximates a proper interval graph. On the other hand, we have already obtained an orientation of the edges that – in ideal data – would be consistent with interval order. We therefore consider necessary conditions for directed indifference graphs and set out to enforce them.
First, we observe that
should be acyclic. Orientability w.r.t. a signature \(\theta \), does not guarantee acyclicity since
still may contain some spurious “back-linking” edges due to unrecognized repetitive elements. The obvious remedy is to remove a \(\Omega \)-minimal set of directed edges. This amounts to solving an feedback arc set problem, which is known to be NP-complete in both weighted and unweighted versions, see [42] for a recent overview. We therefore resort to a heuristic that makes use of our expectations on the structure of
: In general we expect multiple overlaps of correctly placed reads, i.e., r is expected to have several incoming edges from its predecessors and several outgoing edges exclusively to a small set of succeeding reads. In contrast, we expect incorrect edges to appear largely in isolation. This suggests to adapt Khan’s topological sorting algorithm [43]. In its original version, it identifies a source u, i.e., a vertex with in-degree 0, appends it to the list W of ordered vertices, and then deletes all its out-edges. It stops with “fail” when no source can be found before the sorting is complete, i.e., W does not contain all vertices of the given graph, indicating that a cycle has been encountered. We modify this procedure in two ways:
First, if multiple sources are available in a given step, we always pick the one with largest total \(\Omega \)-weight of edges incoming from the sorted set W. As a consequence, incomparable paths in
are sorted contiguously, i.e., a new path is initiated only after the previous one cannot be continued any further. Note that keeping track of the total input weight from W does not alter the \({\mathcal {O}}(|V|+|E|)\) running time of the Kahn’s algorithm.
Second, we replace the “fail” state by a heuristic to find an “almost source” to continue the sorting. Denote by \(N_+(W)\) the out-neighborhood of the set W that has been sorted so far and consider the set \(K := N_+(W)\setminus W\) the not-yet-sorted out-neighbors of W. These are the natural candidates for the next source. For each \(u\in K\) we distinguish incoming edges xu from \(x\in W\), \(x\in K\), and \(x\in V\setminus (W\cup K)\) and consider two cases:
- (1):
-
There is a \(u\in K\) without an in-edge xu from some other \(x\in K\). Then we choose among all vertices of this type the vertex \({\hat{u}}\) with the largest total \(\Omega \)-weight incoming from W because \({\hat{u}}\) then overlaps with most of the previously sorted reads.
- (2):
-
If for each \(u\in K\) there is an in-edge xu from some other \(x\in K\), then the candidate set K forms a strongly connected digraph. In this case we choose the candidate \({\hat{u}}\in K\) with the largest difference of \(\Omega \)-weights incoming from W and K, i.e., \(\hat{u}:=\text{arg\,max}_{u\in K}\sum _{w\in W}\Omega (w,u) - \sum _{k\in K\setminus \{u\}}\Omega (k,u)\).
In either case, we add the edges incoming from \(V\setminus W\) into \({\hat{u}}\) to the set F of edges that violate the topological order. It is clear from the construction that (i) F remains empty if
is a DAG since in this case a source is available in each step, and (ii) the graph
obtained by from
by deleting the edges in F is acylic since all in-edges to u in
emanate from W, the set of vertices sorted before u, and all out-edges from u point to the as yet unsorted set. Thus F is a feedback arc set for
.
Lemma 9
The modified Kahn algorithm can be implemented to run in \(O(|E|+|V|\log |V|)\) time.
Proof
Our modified Kahn algorithm keeps the not-yet-sorted vertices in a priority queue instead of a simple queue. The priority of a vertex \(u\in V\setminus W\) depends on the number of total \(\Omega \)-scores of the in-edges wu with \(w\in W\cap N^-(u)|\), \(w\in K\cap N^-(u)\), and \(w\in N^{-}(u)\cap V\setminus (W\cup K)\) respectively. Every time a vertex v is added to W, these values have to be updated for the out-neighbors \(u\in N^+(v)\). Each update only comprises of the addition or subtraction of \(\Omega (v,u)\) and the decision whether the second and/or third values are zero, and thus require total time \(O(E(\mathbf {G}))\). Highest priority is given to vertices u with \(N^-(u)\subseteq W\), i.e., true sources, next vertices \(u\in K\) with \(N^-(u)\cap K=\emptyset \), and the last tier is formed by the remaining vertices. Assuming an efficient implementation of the priority queue as a heap, the total effort for its maintenance is O(E) plus \(O(|V|\log |V|)\) for the dequeuing operations, see e.g. [44, 45]. \(\square \)
It is possible that
is not connected. In this case, each connected component can be processed independently in subsequent processing steps. If the feedback set F is disjoint from \(T_G\), then \(T_G\) is still a \(\Omega \)-maximal spanning tree of
. Otherwise, edges in \(F\cap T_G\) need to be replaced. Lemma 5 that the replacement edges have to be drawn from non-tree edges between the vertex sets spanned by the connected components of \(T_G\setminus F\). In principle, this can be done efficiently with specialized data structures for dynamic connectivity queries, in particular if \(F\cap T_G\) is small [46]. However, the effort to run Kruskal’s algorithm again on
is by no means prohibitive, since the update has to be done only once.
Golden paths
For perfect data, the directed proper interval graph
has a single source and a single sink vertex, corresponding to the left-most and right-most long reads \(r'\) and \(r''\), respectively. Furthermore, every directed path connecting \(r'\) and \(r''\) is a golden path, that is, a sequence of overlapping intervals that covers the entire chromosome. Even more stringently, every read \(r\ne r',r''\) has at least one predecessor and at least one successor in
. An undirected graph is a proper interval graph if there is a set of intervals, corresponding to the vertices, such that (i) no interval is properly contained within another, and (ii) two vertices are adjacent iff their intervals intersect. For perfect data, therefore, the overlap graph is a proper interval graph.
Lemma 10
[47] A connected proper interval graph has a unique vertex order (other than the reversal of the order).
The vertex order of a connected proper interval graph is therefore completely determined by fixing the orientation of single edge. In our case, the orientation is fixed by \(r^*\). We choose the ascending vertex order, i.e., \(r_1\prec r_2\) for every directed edge \(r_1r_2\). A proper interval graph with such an orientation of edges is a directed proper interval graph.
For perfect data, therefore,
is directed proper interval graph and thus it would suffice to compute the unique topological sorting of
. For real-life data, however, we cannot expect that even the acyclic graph
is a directed proper interval graph. Ploidy in eukaryotes may constitute a valid exception to this assumption, as differences in chromosomes ideally also cause diverging structures. However, given the high error rate of long reads, low sequence variation can only be differentiated in very high coverage scenarios or with the help of known ancestral relationships [48]; both are explicitly not targeted by LazyB. In practice, ploidy is commonly reduced even when sufficient coverage is available but can be recovered via variant calling [49]. High accuracy short read assemblies originating from different alleles can be expected to match equally well to the same long reads given their low quality. Therefore, also ploidy variation will normally be merged to a single consensus. Accordingly, we did not detect any mayor duplication issues in the human, fly, or yeast.
Our aim now is to approximate the DAG
by a disjoint union of connected directed proper intervals graphs. To gain some intuition for this task, we first consider reductions of directed graphs that expose longest paths.
A transitive reduction
of some directed graph
is a subgraph of
with as few edges as possible such that two vertices x and y are connected by a directed path in
if and only if they are connected by a directed path in H [50, 51]. It is well-known that every directed acyclic graph has a unique transitive reduction [51, Thm. 1]. This property allows us to call an edge e of an acyclic digraph
redundant if \(e\notin E\)(
). Unfortunately, computation of the transitive reduction requires \({\mathcal {O}}(|V|\,|E|)\) time in sparse graphs and \({\mathcal {O}}(|V|^{\omega })\), where \(\omega \approx 2.3729\) is the matrix multiplication constant. This is impractical for our purposes.
As a simpler analog of transitive reduction, we define the triangle reduction
of H as the digraph obtained from
by removing all edges \(uw\in E\) (
) for which there is a vertex v with \(uv,vw\in E\)(
).
Lemma 11
If
is a connected directed proper interval graph then (i)
is a path, and (ii)
=
.
Proof
By Lemma 10,
has a unique topological sorting, i.e., \(\prec \) is a unique total order. Property (ii) now is an immediate consequence of the umbrella property, and (iii) follows from the fact the transitive reduction is a subgraph of the triangle reduction and preserves connectedness. \(\square \)
As an immediate consequence of Lemma 11 we observe that if
is a connected induced subgraph of a directed proper interval graph
, then
is an induced path in the triangle reduction
of
. Of course, Lemma 11 does not imply that the triangle reduction
is a path. It serves as motivation, however, to identify long-read contigs as maximal paths in the triangle reduction
of the directed acycling graph
. Since the topological sorting along any such path is unique, it automatically identifies all redundant non-triangle edges along a path.
We note that it is not necessary to first compute the transitive or triangle reduction if one is only interested in the maximal paths.
Lemma 12
Let
be a directed acyclic graph with triangle reduction
and transitive reduction
. Then P is a maximal path in
if and only if it is a also maximal path in
or
.
Proof
Every maximal path in
connects a source with a sink, since otherwise it could be extended at one the the ends. Now suppose that a longest path P contains an edge \(e=r'r''\) that this not contained in the transitive reduction. By definition, then there is a path \(P_{r'r''}\) of length at least 2 from \(r'\) and \(r''\), and since H is acyclic, no vertex in \(P_{r'r''}\) lies along the path P. Thus \(P'\) obtained from P by replacing e with \(P_{r'r''}\) is again a path, which is strictly longer then P, contradicting the assumption that P was maximal. Thus P is contained
and
. Since
and
is a subgraph of
and P is maximal in
, it is also maximal in
and
. \(\square \)
We note, furthermore, that the modified Kahn algorithm described above has the useful side effect of producing long runs of consecutive vertices \(r_i,r_{i+1},\dots r_{j-1},r_j\). These can be used to effectively reduce the graph
by removing all arcs connecting non-consecutive vertices with any such run.
The longest path terminating in a given vertex x can be computed with \({\mathcal {O}}(|E|)\) effort [52], suggesting that the explicit computation of reductions will not be helpful in practice. It also does not address the issue that the triangle-reduction
differs from a unique golden path by bubbles, tips, and crosslinks, see Fig. 7. Tips and bubbles predominantly are caused by edges that are missing e.g. due to mapping noise between reads that belong to a shared contig region. Remember that ploidy is collapsed to one haplotype due to the high error rates of long reads. Hence, any path through a bubble or superbubble yields essentially the same assembly of the affected region and thus can be chosen arbitrarily whereas tips may prematurely end a contig. Node-disjoint alternative paths within a (super-)bubble [53] start and end in the neighborhood of the original path. Tips either originate or end in neighborhood of the chosen path. As tips themselves may also be subject to mild noise, and crosslinks may occur near the start- or end-sites of the true paths, both are not always easily distinguished. Crosslinks represent connections between two proper contigs by spurious overlaps, caused, e.g., by repetitive elements that have escaped filtering. As crosslinks can occur at any position, a maximal path may not necessarily follow the correct connection and thus may introduce chimeras into the assembly.
We therefore have to expect that solving the longest path problem on
will sometimes follow spurious edges rather than locally more plausible choices since these may lead to overall shorter paths. As a remedy, we therefore aim to resolve the path choices based on local information. More precisely, we measure how well an edge e fits into a local region that forms an induced proper interval graph. Recall that a tournament is an orientation of a complete graph, and is called transitive if and only if it is acyclic [54].
Lemma 13
If
is a directed proper interval graph, then the subgraph
induced by the closed outneighborhood \(N_+(r) := N^+(r)\cup \{r\}\) is a transitive tournament.
Proof
By definition there is an arc from r to every \(u\in N^+(r)\). Furthermore, we already know that
has a unique topological ordering. The umbrella property therefore implies that there is an arc from u to v whenever u preceeds v in the unique topological ordering. Thus
is a transitive tournament. \(\square \)
For ideal data, the out-neighborhoods
form transitive tournaments, and their triangle reductions form induced subpath of
. In fact, collecting results from the literature, it can be shown that is also necessary:
Theorem 14
A connected directed graph H is a directed proper interval graph if and only if the out-neighorhood \(N^+(r)\) is complete (and hence a transitive tournament) for every \(r\in V\) and forms an interval in the (unique) vertex order.
Proof
The equivalence of proper interval graphs and so-called closed graphs is shown in [55]. By definition, H is closed if it has so-called closed vertex ordering equivalent to the umbrella property [55]. Prop.2.2 in [56] states that a vertex ordering \(\prec \) is closed if and only if the out-neighborhood is complete and forms an interval w.r.t. \(\prec \). Together with the forward-orientation of the edges of H w.r.t. \(\prec \), this in particular implies that \(N^+(r)\) is transitive tournament. \(\square \)
An analogous result holds for the in-neighbors. Equivalently, proper interval graphs are also characterized by the fact that they admit a straight vertex order in which the in-neighbors of r, r itself, and then the out-neighbors of r appear consecutively [47].
For real data the subgraph
induced by the out-neighbors of r will in general violate the transitive tournament property. The problem of finding the maximum transitive tournament in an acyclic graph is NP-hard [57]. An approximation can be obtained, however, using the fact that a transitive tournament has a unique directed Hamiltonian path. Thus candidates for transitive tournaments in
can be retrieved efficiently as the maximal path \(P_{rq}\) in
that connects r with an endpoint q, i.e., a vertex without an outgoing edge within
. Using the topological order of
, the maximal paths \(P_{rq}\) can be traced back in \({\mathcal {O}}(|N_+(r)|)\) time for each endpoint \(P_{rq}\).
For \(P_{rq}\) we compute the number
. The subgraph
with the largest value of \(h_{rq}\) serves as approximation for the maximal transitive tournament with r as its top element. Its edge set
is used to define the interval support of an edge
as
Here, d(r, e) is the minimal number of edges in the unique path from r to e in the path formed by the edges in \(H_r\). The interval support can be interpreted as the number of triangles that support e as lying within an induced proper interval graph. Importantly, it suffices to compute \(\nu (e)\) for
. The idea is now to choose, at every vertex r with more than one successor or precedssor in
the edges in \(N^+(r)\) and \(N^+(r)\) that have the maximal interval support. We observed empirically that determining the best path by greedily optimizing \(\nu (e)\) at branch points results in contigs with a better solution quality compared to optimizing the weight \(\Omega (e)\) of the spanning tree edges of \(T_G\). Taken together, we arrive at the following heuristic to iteratively extract meaningful paths:
- (i):
-
Find a maximal path \({\mathbf {p}} = r_1,\ldots , r_n\) in
such that at every junction, we choose the incoming and outgoing edges e with maximal interval support \(\nu (e)\).
- (ii):
-
Add the path \({\mathbf {p}}\) to the contig set if it is at least two nodes long and neither the in-neighborhood \(N_-(r_1)\) nor the out-neighborhood \(N_+(r_n)\) is already marked as “visited” in
. Otherwise, we have found a tip if one of \(N_-(r_1)\) or \(N_+(r_n)\) was visited before and a bubble if both were visited. Such paths are assumed to have arisen from more complex crosslinks and can be added to the contig set if they exceed a user-defined minimum length.
- (iii):
-
The path \({\mathbf {p}}\) is marked “visited” in
and all corresponding nodes and edges are deleted from
.
- (iv):
-
The procedure terminates when
is empty.
As the result, we obtain a set of paths, each defining a contig.