 Research
 Open access
 Published:
Heuristic shortest hyperpaths in cell signaling hypergraphs
Algorithms for Molecular Biology volume 17, Article number: 12 (2022)
Abstract
Background
Cell signaling pathways, which are a series of reactions that start at receptors and end at transcription factors, are basic to systems biology. Properly modeling the reactions in such pathways requires directed hypergraphs, where an edge is now directed between two sets of vertices. Inferring a pathway by the most parsimonious series of reactions corresponds to finding a shortest hyperpath in a directed hypergraph, which is NPcomplete. The current stateoftheart for shortest hyperpaths in cell signaling hypergraphs solves a mixedinteger linear program to find an optimal hyperpath that is restricted to be acyclic, and offers no efficiency guarantees.
Results
We present, for the first time, a heuristic for general shortest hyperpaths that properly handles cycles, and is guaranteed to be efficient. We show the heuristic finds provably optimal hyperpaths for the class of singletontail hypergraphs, and also give a practical algorithm for tractably generating all sourcesink hyperpaths. The accuracy of the heuristic is demonstrated through comprehensive experiments on all sourcesink instances from the standard NCIPID and Reactome pathway databases, which show it finds a hyperpath that matches the stateoftheart mixedinteger linear program on over 99% of all instances that are acyclic. On instances where only cyclic hyperpaths exist, the heuristic surpasses the stateoftheart, which finds no solution; on every such cyclic instance, enumerating all sourcesink hyperpaths shows the solution found by the heuristic was in fact optimal.
Conclusions
The new shortest hyperpath heuristic is both fast and accurate. This makes finding sourcesink hyperpaths, which in general may contain cycles, now practical for real cell signaling networks.
Availability
Source code for the hyperpath heuristic in a new tool we call Hhugin (as well as for hyperpath enumeration, and all dataset instances) is available free for noncommercial use at http://hhugin.cs.arizona.edu.
Background
Cell signaling pathways are cornerstones of molecular and cellular biology. They underly cellular communication, govern environmental response, and their perturbation has been implicated in the cause of many diseases [1]. While signaling pathways have classically been modeled as ordinary graphs, using directed or undirected edges to link pairs of interacting molecules [2, 3], both Klamt, Haus and Theis [4] and Ritz, Tegge, Kim, Poirel and Murali [5] have shown that ordinary graphs cannot adequately represent cellular activity that involves the assembly and disassembly of protein complexes, or multiway reactions among such complexes.
Directed hypergraphs are generalizations of ordinary graphs where an edge, now called a hyperedge, is directed from one set of vertices, called its tail, to another set of vertices, called its head. Hypergraphs have been used to model many cellular processes [4,5,6,7,8,9,10,11,12]. In particular, a biochemical reaction that involves multiple reactants, all of which must be present for the reaction to proceed, and that results in multiple products, all of which are produced upon its completion, is correctly captured by a single hyperedge directed from its set of reactants to its set of products. Despite hypergraphs affording more faithful models of reaction networks, the lack of practical hypergraph algorithms has hindered their potential for properly representing and reasoning about molecular reactions.
Biologically, a typical cellsignaling pathway consists of membranebound receptors that bind to extracellular ligands, triggering intracellular cascades of reactions, culminating in the activation of transcriptional regulators and factors [13]. Computationally, treating receptors as sources, and transcription factors as targets, finding the most efficient way to synthesize a particular transcription factor from a set of receptors maps to the shortest hyperpath problem we consider here: Given a cellsignaling network whose reactants and reactions are modeled by the vertices and weighted hyperedges of a directed hypergraph, together with a set of sources and a target, find a hyperpath consisting of hyperedges from the sources to the target of minimum total weight. We briefly summarize prior work on related problems next.
Related work
Hypergraphs have been studied in the algorithms community [14,15,16], and applied within systems biology to metabolic networks [17,18,19,20] and cellsignaling networks [12, 21,22,23].
In the field of algorithms, Italiano and Nanni [14] first proved that finding a shortest sourcesink hyperpath is NPcomplete, even when hyperedges have a single head vertex. In a seminal paper that is the source for much of the subsequent work on hypergraphs, Gallo, Longo, Pallottino and Nguyen [15] explore special cases of hypergraphs, and define several versions of hyperpaths, including what they call a Bpath (though see the correction of Nielsen and Pretolani [24]), which is essentially equivalent to our definition of hyperpath (given in the following section on shortest hyperpaths in directed hypergraphs). They show the vertices reachable from a source vertex in a hypergraph can be found in time linear in the total size of the tail and head sets of all hyperedges, give an efficient algorithm for a variant of shortest hyperpaths with a socalled additive cost function, and prove that finding a minimum cut in a hypergraph is NPcomplete. Ausiello and Laura [16] survey results on hypergraphs whose hyperedges have singleton head sets, and note that a consequence of the NPcompleteness reduction [14] for shortest hyperpaths from the set cover problem is that, unless \(\text {P} \!=\! \text {NP}\), no approximation algorithm can exist for shortest hyperpaths on hypergraphs of n vertices with approximation ratio \(\bigl ( 1 \!\! o(1) \bigr ) \ln n\).
In metabolic networks, Cottret, Milreu and Acuña et al. [17] examine the minimum precursor problem: given a hypergraph G, a set of sources S, and a set of targets T, find a source subset \(P \subseteq S\) of minimum cardinality that has a factory from P to T, where a factory is a set of hyperedges that produce targets T from precursor set P while satisfying weaker ordering constraints on hyperedges than required by hyperpaths. They show this problem is NPcomplete, and give an algorithm that enumerates all minimal precursor sets whose factory is acyclic. Acuña, Milreu and Cottret et al. [18] subsequently enumerate all minimal precursor sets allowing cycles. Andrade, Wannagat and Klein et al. [19] extend these algorithms to accommodate stoichiometry and conserve intermediate metabolites within the factory. Carbonell, Fichera, Pandit and Faulon [20] give an efficient algorithm to find a sourcesink hyperpath if one exists—irrespective of its length—and prove that finding any hyperpath that must contain a specified set of hyperedges is NPcomplete. They also offer an approach to hyperpath enumeration that relies on solutions to this NPcomplete problem, for which they employ a heuristic.
In cellsignaling networks, Ritz, Avent and Murali [12, 21] were the first to solve the shortest acyclic hyperpath problem by formulating it as a mixedinteger linear program (MILP)—the current stateoftheart for shortest hyperpaths—and showed that in practice, optimal acyclic hyperpaths can be found even for large cellsignaling hypergraphs. Their formulation does not extend to hyperpaths with cycles, and requires exponential time in the worstcase (which may be unavoidable, as the acyclic problem remains NPcomplete). Recently, Franzese, Groce, Murali and Ritz [22] defined a parameterized notion of connectivity that interpolates between hyperpath and ordinarypathconnectivity, while Schwob, Zhan and Dempsey [23] modified the acyclic MILP of Ritz et al. [21] to include timedependence among reactions.
Our contributions
In contrast to prior work, we present a heuristic for shortest hyperpaths that handles cycles, is worstcase efficient, and finds hyperpaths that are demonstrably optimal or close to optimal in real cellsignaling hypergraphs. In more detail, we make the following contributions.

We present an efficient heuristic for shortest hyperpaths, that on a hypergraph of size \(\ell\), which measures the total cardinality of all hyperedge tail and head sets, with m hyperedges that are doublyreachable from the source and sink vertices, and k defined analogously to \(\ell\) over these doublyreachable hyperedges, runs in \(O(\ell \,+\, m^2 \, k)\) time.

We prove that the heuristic finds an optimal shortest hyperpath for the class of singletontail hypergraphs, where the tails of all hyperedges in the hypergraph are single vertices.

We also give a practical algorithm for hyperpath enumeration that generates all possible sourcesink hyperpaths, allowing us to tractably measure how close our heuristic is to the optimum.

Our heuristic matches the stateoftheart MILP for shortest acyclic hyperpaths on over 99% of all instances from two standard databases of cellsignaling pathways.

Our heuristic surpasses the stateoftheart on instances where every sourcesink hyperpath is cyclic, and hence the MILP finds no solution. On all such cyclic biological instances, our hyperpath enumeration algorithm verified that the heuristic was in fact optimal.
To our knowledge, this heuristic is the first in the literature for shortest sourcesink hyperpaths in general directed hypergraphs, where hyperedges have arbitrary tail and head sets, and the length of a hyperpath is the sum of the weights of its hyperedges.
We note that the worstcase efficiency and subclass optimality of the heuristic highlighted in the first two points above show that the shortest hyperpaths problem is polynomialtime solvable for singletontail hypergraphs—in contrast to its NPcompleteness for singletonhead hypergraphs [14]—which does not appear to have been observed before in the literature [16]. Furthermore, while prior work has developed specialized algorithms that are tailored to shortest hyperpaths under socalled additive cost functions [15]—which also handle singletontail hypergraphs—in distinction, we give a general heuristic for arbitrary hypergraphs under the nonadditive cost function of total weight of the hyperpath, that as a consequence is optimal for the special case of singletontail hypergraphs.
Source code for an implementation of the shortest hyperpath heuristic in a new tool we call Hhugin [25] (short for “hypergraph heuristic for general shortest sourcesink hyperpaths”), as well as the hyperpath enumeration algorithm and all dataset instances, is available free for noncommercial use at http://hhugin.cs.arizona.edu.
Plan of the paper
The next section defines the general shortest hyperpath problem, allowing cycles. The following section then presents our heuristic for shortest hyperpaths, analyzes its time complexity, shows it returns a feasible solution whenever one exists, and proves it finds optimal solutions for singletontail hypergraphs. The next section gives our algorithm for generating all sourcesink hyperpaths, proves its correctness, and analyzes its time complexity. The subsequent section compares the heuristic, through experiments on all sourcesink instances from standard databases, to the stateoftheart MILP for acyclic instances, or to the optimum of all enumerated hyperpaths for cyclic instances, and discusses three examples of cyclic shortest hyperpaths in cell signaling networks. Finally, the last section concludes, and provides directions for further research.
Shortest hyperpaths in directed hypergraphs
A directed hypergraph is a generalization of an ordinary directed graph, where an edge, instead of touching two vertices, now connects two subsets of vertices. Formally, a directed hypergraph is a pair (V, E), where V is a set of vertices, and E is a set of directed hyperedges. (The literature sometimes uses the term hyperarc for an edge in a directed hypergraph, but we prefer the simpler term hyperedge—just as the term edge is conventionally used for both directed and undirected ordinary graphs. We will occasionally abbreviate the term hyperedge to simply edge, when it is clear that the context is with respect to a directed hypergraph.) Each hyperedge \(e \in E\) is an ordered pair (X, Y), where both \(X,Y \subseteq V\) are vertex subsets. Edge e is directed from set X to set Y. We call set X the tail of e, and set Y the head of e, and refer to these sets by the functions \(\mathrm {tail}(e) = X\) and \(\mathrm {head}(e) = Y\). We also refer to the inedges of vertex v by \(\text {in}(v) := \{ e \!\in \! E \,:\, v \in \mathrm {head}(e) \}\), and the outedges of v by \(\text {out}(v) := \{ e \!\in \! E \,:\, v \in \mathrm {tail}(e) \}\). Figure 1 shows a directed hyperedge.
In ordinary directed graphs, a path from a vertex s to a vertex t is a sequence of edges starting from s and ending at t, where for consecutive edges e and f in the sequence, the preceding edge e must enter the vertex that the following edge f leaves. We say t is reachable from s when there is such a path from s to t.
In generalizing these notions to directed hypergraphs, the conditions both for when a hyperedge can follow another in a hyperpath, and when a vertex is reachable from another, become more involved. A hyperpath is again a sequence of hyperedges, but now for hyperedge f in a hyperpath, for every vertex \(v \in \mathrm {tail}(f)\), there must be some hyperedge e that precedes f in the hyperpath for which \(v \in \mathrm {head}(e)\). Reachability is captured by the following notion of superpath.
Definition 1
(Superpath) In a directed hypergraph (V, E), an s,tsuperpath, for vertices \(s,t \in V\), is an edge subset \(F \subseteq E\) such that the hyperedges of F can be ordered \(e_1, e_2, \ldots , e_k\), where

(i)
\(\mathrm {tail}(e_1) \,=\, \{s\}\),

(ii)
for each \(1 < i \le k\),
$$\begin{aligned} \mathrm {tail}(e_i) \;\;\subseteq \;\; \{s\} \;\cup \; \bigcup _{1 \le j < i} \mathrm {head}(e_j) \, , \end{aligned}$$ 
(iii)
and \(t \,\in \, \mathrm {head}(e_k)\).
For an s,tsuperpath, we call s its source, t its sink, and we say t is reachable from s.\({\square}\)
We can now define hyperpaths in terms of superpaths. Recall that a set S is minimal with respect to some property X if S satisfies X, but no proper subset of S satisfies X.
Definition 2
(Hyperpath) An s,thyperpath is a minimal s,tsuperpath.\({\square}\)
In other words, a hyperpath P is a superpath for which removing any edge \(e \in P\) leaves a subset \(P  \{e\}\) that is no longer a superpath. Essentially, hyperpaths eliminate unnecessary edges from superpaths. Figures 7, 8, and 9 later show examples of hyperpaths.
We say a hyperpath P contains a cycle if, for every ordering \(e_1, \ldots , e_k\) of its hyperedges satisfying properties (i)–(iii) in the definition of superpath, P contains some hyperedge f with a vertex in \(\mathrm {head}(f)\) that also occurs in \(\mathrm {tail}(e)\) for an earlier hyperedge e in the ordering. While in ordinary graphs a minimal s,tpath can never contain a cycle, in hypergraphs an s,thyperpath can in fact contain cycles, as shown in our later section on biological examples.
We can now define the shortest hyperpath problem. For an edge weight function \(\omega (e)\), we extend \(\omega\) to edge subsets \(F \subseteq E\) by \(\omega (F) \,:=\, \sum _{e \in F} \, \omega (e)\).
Definition 3
(Shortest Hyperpaths) The Shortest Hyperpaths problem is the following. Given a directed hypergraph (V, E), a positive edge weight function \(\omega : E \!\rightarrow \! \mathcal {R}^+\), source \(s \in V\) and sink \(t \in V\), find an s,thyperpath \(P \subseteq E\) of minimum total weight \(\omega (P)\).\({\square}\)
Note that for positive edge weights, Shortest Hyperpaths is equivalent to finding an s,tsuperpath of minimum total weight.
Shortest Hyperpaths with a single source and sink vertex also captures more general versions of the problem with multiple sources and multiple sinks, as follows. To find a hyperpath that starts from a set of sources \(S \subseteq V\), simply add a new source vertex s to the hypergraph together with a single hyperedge \((\{s\},S)\) of zero weight, and equivalently find a hyperpath from the single source s. To find a hyperpath that reaches all vertices in a set of sinks \(T \subseteq V\), add a new sink vertex t, a zeroweight hyperedge \((T,\{t\})\), and equivalently find a hyperpath to the single sink t. To find a hyperpath that reaches some vertex in a set of sinks \(T \subseteq V\), add new sink vertex t, zeroweight hyperedges \((\{v\},\{t\})\) from all \(v \in T\), and again equivalently find a hyperpath to the single sink t. Thus versions of shortest hyperpaths with multiple sources and sinks can be reduced to the problem with a single source and sink.
Shortest Hyperpaths is NPcomplete [14] (even for acyclic hypergraphs with singleton head sets), so we likely cannot efficiently compute shortest hyperpaths in the worstcase. The next section presents an efficient heuristic for shortest hyperpaths that is highly accurate at finding demonstrably optimal or nearoptimal hyperpaths in real cellsignaling hypergraphs.
An efficient shortest hyperpath heuristic
We now give a fast heuristic for Shortest Hyperpaths that always finds an s,thyperpath if one exists. While the heuristic is not guaranteed to find a shortest s,thyperpath in general, our later experiments on real cellsignaling hypergraphs show it quickly finds a hyperpath that is optimal or remarkably close to optimal on the vast majority of instances in comprehensive experiments over the two standard cellsignaling databases in the literature. Furthermore, we will prove that the heuristic is guaranteed to find a shortest s,thyperpath for the class of singletontail hypergraphs, where the tailsets of all hyperedges are single vertices.
We present the heuristic by providing detailed pseudocode at a level that can be directly implemented, as the heuristic is carefully designed and many of its component algorithms are surprisingly tricky to implement correctly. After describing the heuristic, we give a time analysis that shows it is always efficient, prove its feasibility, and then show that it finds optimal hyperpaths for singletontail hypergraphs.
While at a high level the heuristic has some aspects in common with Dijkstra’s algorithm for singlesource shortest paths in an ordinary directed graph (see [26, pp. 658–659])—in that the heuristic maintains a heap of elements prioritized by estimated path lengths—it has significant differences. In contrast to Dijkstra’s algorithm, the heuristic is edgebased, rather than vertexbased, and the heap maintains hyperedges e prioritized by the length of the shortest known hyperpath from the source s to edge e, which will be formally defined later. Also in contrast to Dijkstra’s algorithm, maintaining a single inedge to a vertex no longer suffices for recovering a path back to source s; instead, recovering an s,thyperpath now requires the heuristic to maintain a set of inedges to each hyperedge e that are candidates for the final edges on the path from s to e. Furthermore, the total length of a hyperpath P to e is no longer a simple function (like a minimum or a sum) of the lengths of hyperpaths to the inedges of e in P that cover the tail of e, since the constituent hyperpaths within P to these inedges of e can have arbitrarilycomplicated sharing of hyperedges. Simply determining the length of the best recovered hyperpath for a hyperedge e on the heap, using these stored inedges to each hyperedge, is itself now a hard combinatorial problem, which the heuristic tackles by a carefullyconstructed greedy procedure.
The overall structure of the heuristic is a breadthfirst search over the hyperedges e reachable from source s, ordered by the estimated length of the shortest hyperpath from s to e. (Admittedly a shortest s,thyperpath P is not necessarily composed of shortest hyperpaths from s to individual hyperedges e in P, which is partly why this approach is a heuristic.) The search repeatedly invokes a greedy procedure to recover the currently bestknown hyperpath to e in order to evaluate its length. As hyperpaths are by definition minimal superpaths, to determine minimality this greedy recovery procedure repeatedly tests reachability of hyperedges. Moreover, for efficiency, the overall breadthfirst search proceeds over a smaller subgraph of the input hypergraph that only contains hyperedges that are reachable both from source s and in reverse from sink t. Hence at base, the heuristic builds upon fast algorithms for computing reachability in a hypergraph.
Accordingly, to present the heuristic, we first give pseudocode for these fundamental algorithms for directed reachability. These algorithms use the following terminology of forwardreachable, backwardtraceable, and doublyreachable, which we define next.
Definition 4
(Reachability and Traceability) Vertex v is forward reachable from source s in hypergraph G if there is an s,vsuperpath in G. Hyperedge e is forward reachable from s if all vertices \(v \in \mathrm {tail}(e)\) are forward reachable from s.
Vertex v is backward traceable from sink t if \(v = t\), or recursively if \(v \in \mathrm {tail}(e)\) for an edge e where some \({w \in \mathrm {head}(e)}\) is backward traceable from t. Hyperedge e is backward traceable from t if some \({v \in \mathrm {head}(e)}\) is backward traceable from t.
A vertex v or hyperedge e is doubly reachable if v or e, respectively, is both forward reachable from s and backward traceable from t.\({\square}\)
To describe the heuristic, it will also be convenient to extend the definitions of superpath and hyperpath to a path from a source s to a hyperedge e.
Definition 5
(Superpath and Hyperpath from Source to Hyperedge) An s,esuperpath is an edge subset S with \(e \in S\) where all vertices in \(\mathrm {tail}(e)\) are forward reachable from source s using hyperedges in S. An s,ehyperpath is a minimal s,esuperpath.\({\square}\)
The pseudocode that we present accesses a hypergraph G through the fields G.vertices and G.edges. We access the tailset and headset of a hyperedge e through the fields e.head and e.tail. We access the set of inedges and outedges of a vertex v through the fields v.in and v.out. For a list Q that is handled as a queue, the operation Q.Put(x) appends item x to the rear of Q, while the operation Q.Get() removes and returns the item at the front of Q. For a minheap H, the operation H.Insert(x, k) inserts item x with key k into H, and returns a reference p to the heap node containing this pair (x, k) in H; the operation H.Extract() removes and returns the item in H with minimum key; and the operation H.Decrease(p, k) takes a reference p to a heap node in H and decreases its key to k if k is smaller than its current key. All functions assume hypergraph G is passed by reference.
Figure 2 gives pseudocode for the two functions ForwardReachable and BackwardTraceable. Function ForwardReachable returns the set of all hyperedges that are forward reachable from source s, while function BackwardTraceable returns the set of hyperedges that are backward traceable from sink t. Function ForwardReachable uses the Boolean vertex field v.reached, and the integer edge field e.count, which it assumes have already been initialized to the values \(v.\text {reached} = \textsf {false}\) for all \(v \in V\) and \(e.\text {count} = \bigl \mathrm {tail}(e)\bigr \) for all \(e \in E\). Function BackwardTraceable also uses the Boolean edge field e.marked, which it similarly assumes is initialized to false for all e. (This initialization will be done once for hypergraph G in the shortest hyperpath heuristic, which allows these functions when called repeatedly to run in time bounded by just the size of the forwardreachable or backwardtraceable subgraphs.) Function ForwardReachable uses the field e.count to detect when all vertices in \(\mathrm {tail}(e)\) have been reached from s, and hence e is now reached from s. Function BackwardTraceable performs a similar but simpler computation in reverse from sink t. The worstcase time for both these functions is linear in the size of the subgraph they explore, as analyzed in the following section on the timecomplexity of the heuristic.
Figure 3 gives pseudocode for the function ShortestHyperpathHeuristic, our heuristic. Like Dijkstra’s shortest path algorithm for ordinary graphs, this function maintains a heap H, but in contrast to Dijkstra’s algorithm this is now a heap of hyperedges e (rather than a heap of vertices), which are prioritized by keys that are the best known estimate of the length of a shortest s,ehyperpath. We refer to this estimate as the current path length for e. The heuristic starts from the outedges of source s, and in a reaching computation repeatedly extracts from heap H the hyperedge e with minimum key. When hyperedge e is removed from H, the estimated path length for e is evaluated, and stored in field e.length. To compute this length estimate, it must construct the best s,ehyperpath it can find, and evaluate its total weight. Of course, computing an optimal s,ehyperpath is NPcomplete, so it uses a greedy heuristic to construct this path by the function RecoverShortPath. This greedy pathconstruction heuristic consists of two steps: (1) recovering an s,esuperpath by recursively backwardtraversing hyperedges that enter \(\mathrm {tail}(e)\), followed by (2) finding a minimal subset of this superpath that is an s,ehyperpath while attempting to minimize its total weight.
Figure 4 gives pseudocode for the function RecoverShortHyperpath that implements this greedy pathconstruction heuristic. For the first step, recovering the s,esuperpath S is done by recursively backwardtraversing what we call inedges: those hyperedges whose headsets intersect the tailset of a given hyperedge. Function ShortestHyperpathHeuristic maintains for a hyperedge e the field e.inedges, which stores the subset of inedges f to e whose field f.length has been determined.
For the second step, function RecoverShortHyperpath attempts to find the minimum weight subset of S that is still a superpath by greedily considering hyperedges \(f \in S\) for removal in decreasing order of f.length, which is the estimated total length of a shortest s,fhyperpath. (Note this is more sophisticated than a naive greedy approach that simply removes hyperedges f in decreasing order of their edgeweight \(\omega (f)\), which would degenerate to removing edges in random order in real cellsignaling networks where hyperedges typically all have unit weight, and hence would all be tied for removal.) This greedy process for trimming superpath S repeatedly tests whether \(\mathrm {tail}(e)\) is still reachable from s on removing f by calling Boolean function IsReachable. Pseudocode for IsReachable is not given, but it simply implements a version of function ForwardReachable that halts and returns true as soon as it adds e to the set of hyperedges reachable from s, or returns false after collecting the entire reachable set without encountering e.
We note that most of the computation of the shortest hyperpath heuristic proceeds over a much smaller subgraph of the input hypergraph G: namely the subgraph induced by the hyperedges \({D \subseteq E}\) that are doubly reachable (both forward reachable from s and backward traceable from t). This preserves correctness, since hyperedges that are not doubly reachable cannot be on an s,thyperpath and can safely be ignored (as argued in the later section on feasibility of the heuristic in the proof of Theorem 2).
To summarize, the shortest hyperpath heuristic proceeds greedily like Dijkstra’s algorithm, but with some important differences: it maintains a heap of hyperedges prioritized by estimated shortest path lengths to tailsets, records a set of potential inedges to a given hyperedge used for recovering a hyperpath to the hyperedge, and recovering such a hyperpath now involves another greedy heuristic to find a minimal superpath of small total weight.
Our later section on experimental results shows this heuristic is remarkably close to optimal on real cellsignaling hypergraphs. Given that no practical exact algorithm exists for general shortest hyperpaths, we determine the optimum by enumerating all s,thyperpaths and taking the minimum of their lengths, using an algorithm we develop in the later section on tractably generating all sourcesink hyperpaths.
We note for this heuristic that the inapproximability of the shortest hyperpath problem [16], together with the polynomial time analysis of the next subsection, imply that unless \(\text {P} = \text {NP}\), the heuristic cannot be a constantfactor approximation algorithm for shortest hyperpaths.
In the following subsections, we first analyze the running time of the heuristic; then show it always finds a feasible solution whenever one exists; and finally prove it actually finds an optimal solution for the class of singletontail hypergraphs.
Time complexity of the heuristic
We now bound the time complexity of the shortest hyperpath heuristic. Our analysis is in terms of the following parameters measured on a hypergraph, or an induced subgraph. For a hypergraph G with vertices V and hyperedges E, we denote its number of vertices and hyperedges by
We also use the size parameter
and degree parameter
Note that in general, the space required to represent all hyperedges is \(\Theta (\ell )\). We assume all tail and head sets are nonempty, and every vertex is touched by a hyperedge, which implies \(m + n = O(\ell )\). When we need to refer to these measures for a particular hypergraph G, such as on an induced subgraph, we explicitly subscript the parameters by the specific hypergraph, such as \(n_G, \ldots , d_G\), where these parameters are then measured in terms of the vertices and edges of the subscripted hypergraph G.
The running time of the shortest hyperpath heuristic may be expressed as a function of parameters measured on both the input hypergraph and its doublyreachable subgraph (induced by the hyperedges that are simultaneously forward reachable from the source and backward traceable from the sink).
Theorem 1
(Time complexity of the heuristic) The time complexity of the shortest hyperpath heuristic, in terms of the number of hyperedges m and size parameter \(\ell\) for both the input hypergraph G and its doublyreachable subgraph H, is
Proof
To bound the running time of the function ShortestHyperpathHeuristic, we analyze in turn its component functions ForwardReachable, BackwardTraceable, and RecoverShortHyperpath. The running time of the reachability computations ForwardReachable and BackwardTraceable (in Fig. 2) can be expressed in an outputsensitive way in terms of the size of the edge sets they return.
For ForwardReachable, let \(R \subseteq V\) be the set of vertices reachable from source s, and \(F \subseteq E\) be the set of hyperedges reachable from s that are returned. The total time for ForwardReachable is dominated by the time for its main whileloop, which takes time \(\Theta \bigl ( \, \sum _{v \in R} \, \bigl \text {out}(v)\bigr  \,\,+\,\, \sum _{e \in F} \, \bigl \mathrm {head}(e)\bigr  \, \bigr )\), or equivalently,
For BackwardTraceable, let \(B \subseteq V\) be the set of vertices it reaches from sink t, and \(F \subseteq E\) be the set of hyperedges traceable from t that are returned. A similar analysis shows the time for BackwardTraceable is
So the time for both ForwardReachable and BackwardTraceable on the input hypergraph G is \(O\bigl (\ell _G\bigr )\)— but can be bounded more tightly in terms of the subgraph of G they actually explore.
For the function RecoverShortHyperpath (in Fig. 4), when it is called by ShortestHyperpathHeuristic, all its computations are performed on G restricted to the edge subset \(D \subseteq E\) of doublyreachable hyperedges. We denote by hypergraph H the doublyreachable subgraph induced by D.
In RecoverShortHyperpath, the time to recover s,esuperpath S by tracing back from e is at most
The time to greedily trim superpath S to s,ehyperpath \(P \subseteq S\), in terms of cardinality \(k = S\), is at most
Thus the total time for RecoverShortHyperpath is
For the function ShortestHyperpathHeuristic (in Fig. 3), we break its time down into the following components. The time for the initialization, collecting the doublyreachable edges D by calling ForwardReachable and BackwardTraceable, and restricting G to its subgraph H induced by D, is \(O\bigl (\ell _G\bigr )\). The main whileloop executes for \(m_H\) iterations, and spends \(O\bigl (m_H \, \log \, m_H\bigr )\) time for all Extracts. The total time across all iterations to compute s,ehyperpath P for all extracted edges e by calling RecoverShortHyperpath is \(O\bigl (\ell _H \, m_H^2\bigr )\). The total time to collect the outedges F for the extracted e across all iterations is \(O\bigl ( \sum _{e \in D} \, \sum _{v \in \mathrm {head}(e)} \, \bigl  \text {out}(v) \bigr  \bigr ) \,=\, O\bigl (d_H \, \ell _H\bigr )\). The total time across all iterations for Decrease and Insert, which take O(1) amortized time per edge in F using a Fibonacci heap (see [26, pp. 510–522]), is also \(O\bigl (d_H \, \ell _H\bigr )\). The time to recover the best s,thyperpath \(P^*\) is \(O\bigl (d_H \, \ell _H \, m_H \bigr )\).
Finally, adding up the bounds for the above components, the total time for the shortest hyperpath heuristic is
which is in turn \(O\bigl (\ell _G \,+\, \ell _H \, m_H^2\bigr )\).\({\square}\)
Notice that the overall running time of the heuristic is dominated by the total time to recover short hyperpaths, which requires invoking RecoverShortHyperpath whenever the path length to a hyperedge is updated. This is necessary in hypergraphs, since in contrast to ordinary graphs the length of the hyperpath to a hyperedge can no longer be expressed as a simple function (such as a minimum or a sum) of the lengths of the hyperpaths to its inedges.
As demonstrated in our later section on experimental results, for real biological instances the size of the doublyreachable subgraph H is significantly smaller than the full input hypergraph G, so designing the heuristic to compute mainly over the much smaller hypergraph H yields a significant performance speedup in practice.
Next we show the heuristic always finds a feasible solution.
Feasibility of the heuristic
The most basic property that a heuristic for a combinatorial optimization problem should satisfy is feasibility: that it always returns a feasible solution whenever one exists. In the context of Shortest Hyperpaths, a feasible solution is any s,thyperpath, while an optimal solution is a feasible solution of minimum total edgeweight.
For the hyperpath heuristic, we now show feasibility.
Theorem 2
(Feasibility of the heuristic) The shortest hyperpath heuristic finds a sourcesink hyperpath whenever one exists.
Proof
Function ShortestHyperpathHeuristic (in Fig. 3) first restricts the input hypergraph G to its doublyreachable subgraph, consisting of the hyperedges D that are both forward reachable from source s and backward traceable from sink t. Note that functions ForwardReachable and BackwardTraceable (in Fig. 2) together correctly collect these doublyreachable hyperedges D: function ForwardReachable explores breadthfirst the hyperedges that are forward reachable from s, maintaining a counter for each hyperedge e that records the number of vertices in its tail that have not yet been reached from s, and detecting when e is reached by this counter hitting zero; while function BackwardTraceable directly implements Definition 4 of backward traceability from t.
Furthermore, we claim that when restricting to the doublyreachable subgraph \({\widetilde{G}}\), the heuristic does not lose any hyperedges on sourcesink hyperpaths. Note that any hyperedge e on an s,thyperpath P in the input hypergraph G is forward reachable from s: consider the ordering of hyperedges in P from Definition 1, and take the prefix of this ordering up through e; this prefix is an s,esuperpath, so e is by definition forward reachable from s. Note also that any e on P in G is backward traceable from t as well: if \({t \in \mathrm {head}(e)}\), backward traceability immediately holds; otherwise, in the ordering of P there must be a hyperedge f following e with nonempty \(\mathrm {head}(e) \,\cap \, \mathrm {tail}(f)\) (else e can be removed from P, contradicting minimality); applying this same process again at f yields a subsequence of the ordering of P that ends in a hyperedge whose head contains t; considering this subsequence in reverse order satisfies Definition 4 for backward traceability of e from t. Hence restricting to the doublyreachable subgraph \({\widetilde{G}}\) is safe.
To show the implication of the theorem, notice ShortestHyperpathHeuristic explores all hyperedges that are forward reachable from s in \({\widetilde{G}}\), inserting hyperedge e into heap H when e is initially reached, again detecting when traversing e causes another hyperedge f to be first reached using counter f.count, and recording in field f.inedges all such e that have reached f. So if an s,thyperpath exists in G, which implies sink t has an inedge e that is forward reachable from s in \({\widetilde{G}}\), this e will eventually be inserted into H, making e.node nonnil, and at the end of the heuristic causing RecoverShortHyperpath to be called on e.
We claim that when function RecoverShortHyperpath (in Fig. 4) is ultimately called on an inedge to sink t, phase (I) first recovers an edge set S that is an s,tsuperpath in G. Considering the hyperedges of S in reverse order of their removal from queue Q, they satisfy the three conditions for an s,tsuperpath in Definition 1: the last hyperedge removed from Q solely has s in its tail, each hyperedge in S (other than this last one) has its tail set covered by hyperedges removed later from Q, and the first edge removed has t in its head.
Function RecoverShortHyperpath in phase (II) then trims S to a minimal s,tsuperpath, yielding an s,thyperpath. Finally, ShortestHyperpathHeuristic returns the shortest such hyperpath found.
Thus whenever a sourcesink hyperpath exists, the heuristic finds one.\({\square}\)
Next we prove the heuristic actually solves Shortest Hyperpaths when the input is a singletontail hypergraph.
Optimality of the heuristic for singletontail hypergraphs
While our heuristic does not necessarily find shortest hyperpaths in general hypergraphs, we can prove that it does find optimal solutions for the following class of hypergraphs.
A singletontail hypergraph is a directed hypergraph G where every hyperedge e in G has \(\bigl \mathrm {tail}(e)\bigr  = 1\). (The head sets of hyperedges can be arbitrary.) In other words, in singletontail hypergraphs, the tails of hyperedges are single vertices.
At a high level, the optimality argument for singletontail hypergraphs first shows that shortest sourcesink hyperpaths are composed of shortest s,ehyperpaths; then argues that the heuristic’s greedy superpath trimming recovers shortest s,ehyperpaths when the hyperedge fields hold shortest hyperpath lengths; and finally proves that the heuristic computes exact shortest s,ehyperpath lengths.
The following characterization states that in singletontail hypergraphs, a shortest s,thyperpath is composed of shortest s,ehyperpaths to its constituent hyperedges. This does not hold for general hypergraphs, and is partly why the special case of shortest singletontail hyperpaths is polynomialtime solvable.
Lemma 1
(Characterizing shortest singletontail hyperpaths) In singletontail hypergraphs with nonnegative edge weights, every shortest s,thyperpath can be ordered as a sequence \(e_1 \cdots e_k\) of hyperedges where

(i)
each \(\mathrm {head}(e_i) \,\supseteq \, \mathrm {tail}(e_{i+1})\), and

(ii)
every prefix \(e_1 \cdots e_i\) is a shortest \(s,e_i\)hyperpath.
Proof
Consider a shortest s,thyperpath P in a singletontail hypergraph. By definition, P is a minimal s,tsuperpath, so its edges can be ordered as a sequence \({e_1 \cdots e_k}\) where \(\mathrm {tail}(e_1) = \{s\}\), \({\mathrm {head}(e_k) \supseteq \{t\}}\), and since tail sets contain a single vertex, for every hyperedge \(e_j\) in this sequence other than the first one, there is a prior hyperedge \(e_i\) with \({\mathrm {head}(e_i) \,\supseteq \, \mathrm {tail}(e_j)}\).
Starting from the last hyperedge \(e_k\), and repeatedly picking a prior hyperedge whose head covers the tail of the current hyperedge until reaching tail \(\{s\}\), yields a subsequence \(f_1 \cdots f_\ell\) specifying subset \({Q \,=\, \{f_1, \ldots , f_\ell \} \,\subseteq \, P}\), where again \({\mathrm {tail}(f_1) = \{s\}}\), \({\mathrm {head}(f_\ell ) \supseteq \{t\}}\), and now \({\mathrm {head}(f_i) \,\supseteq \, \mathrm {tail}(f_{i+1})}\) for \({1 \!\le \! i \!<\! \ell }\). Furthermore \({Q = P}\), otherwise P is not minimal. So subsequence \(f_1 \cdots f_\ell\) is exactly sequence \(e_1 \cdots e_k\).
Clearly every prefix \(e_1 \cdots e_i\) is an \(s,e_i\)superpath. Moreover this prefix must be a minimal \(s,e_i\)superpath, otherwise P is not minimal. Thus every prefix ending in \(e_i\) is an \(s,e_i\)hyperpath.
Finally, every prefix \(e_1 \cdots e_i\) must be a shortest \(s,e_i\)hyperpath. Otherwise, replacing this prefix by a shortest \(s,e_i\)hyperpath yields an s,tsuperpath S of total weight less than P. Furthermore, trimming S to a minimal s,tsuperpath under nonnegative edge weights yields an s,thyperpath of total weight less than P, contradicting the optimality of P.\({\square}\)
In the following, the distance of hyperedge e from source s is the total weight of a shortest s,ehyperpath, which we denote by d(e). Recall that function ShortestHyperpathHeuristic (in Fig. 3) maintains the field e.length, that holds the total weight of the bestknown s,ehyperpath, which upper bounds d(e).
The next lemma states that in singletontail hypergraphs, given two key conditions, the greedy superpath trimming that is used by the heuristic to recover a hyperpath to hyperedge e in fact finds a shortest s,ehyperpath.
Lemma 2
(Recovering hyperpaths in singletontail hypergraphs) In a singletontail hypergraph with nonnegative edge weights, when the hyperpath heuristic recovers a hyperpath from source s to hyperedge e, suppose

(i)
field e.inedges contains among its hyperedges an inedge to e from a shortest s,ehyperpath, and

(ii)
in the s,esuperpath S found when recovering a hyperpath to e, for all hyperedges \(f \in S \!\! \{e\}\), field f.length holds distance d(f).
Then the hyperpath to e that the heuristic recovers is a shortest s,ehyperpath.
Proof
We first claim that under the assumptions of the lemma, when the hyperpath heuristic calls RecoverShortHyperpath (in Fig. 4) on a hyperedge e, its first phase recovers an s,esuperpath S that contains a shortest s,ehyperpath. By assumption (i), field e.inedges contains a hyperedge f on a shortest s,ehyperpath, and f will be in superpath S, hence by assumption (ii), the value of f.length is d(f). This value came from a shortest s,fhyperpath Q that was found in a prior call to RecoverShortHyperpath on f, by trimming an s,fsuperpath T. Notice that Q followed by e is an s,esuperpath \({\widetilde{P}}\), as \(\mathrm {head}(f) \supseteq \mathrm {tail}(e)\). Now trim \({\widetilde{P}}\) to an s,ehyperpath P, and let \(P^*\) be a shortest s,ehyperpath containing f that exists by assumption (i). By Lemma 1 and minimality of hyperpaths, \(P^*\) must consist of a shortest s,fhyperpath \(Q^*\) followed by e. Under nonnegative edge weights,
Thus P is also a shortest s,ehyperpath. Since f is in e.inedges, tracing back from e recovers the superpath
so the claim holds.
We next claim that when RecoverShortHyperpath in its second phase greedily trims superpath S, the resulting superpath \(T \subseteq S\) still contains a shortest hyperpath. To show this, we prove that each superpath \(S_i\) that remains after i iterations of greedy trimming contains a shortest s,ehyperpath, by induction on i. For the basis at \(i \!=\! 0\), the initial superpath \(S_0\) before any trimming contains a shortest hyperpath by our first claim on S. For the induction at \(i \!>\! 0\), let P be a shortest s,ehyperpath that superpath \(S_{i1}\) contains by our hypothesis, and let f be the hyperedge removed from \(S_{i1}\) at iteration i. If \(f \not \in P\), then \(S_i \,=\, S_{i1}  \{f\}\) trivially contains P. So we assume \(f \in P\). In the following, the core of hyperpath P consists of the tail vertices of its hyperedges.
In an ordering of shortest hyperpath P that satisfies Lemma 1, consider the hyperedges in the suffix of P that begins with f. As edge weights are nonnegative, by Lemma 1 the distances of these hyperedges must be at least d(f), so by assumption (ii) the values of the length field for these hyperedges must be at least f.length. Greedy trimming proceeds in decreasing order of lengthfield values, so the hyperedges in this suffix of P must either have been already considered for trimming before f, or not yet considered due to being tied with f (from having zero edgeweight). If they were considered before f, then since they were not trimmed, there must be no alternate s,ehyperpath in \(S_{i1}\) that enters their head vertices on the core of P. If they were not considered yet, then since f can be removed from \(S_{i1}\), there must be an alternate s,ehyperpath \(Q \subseteq S_i\) distinct from P that enters one of the core headvertices of the hyperedges in this suffix of P whose length field is tied with f. Moreover, this alternate hyperpath Q must enter P with the same lengthfield value as the edge of P sharing this core headvertex. (If Q enters P at a smaller lengthvalue, then P is not a shortest s,ehyperpath; if Q enters at a greater lengthvalue, hyperedge f would not be the next hyperedge removed, as instead a hyperedge from Q of greater length would be.) Since Q enters P at the same lengthvalue, hyperpath Q is also a shortest s,ehyperpath. Hence \(S_i \supseteq Q\) still contains a shortest hyperpath, which proves the second claim.
So the final trimmed s,esuperpath T returned by RecoverShortHyperpath contains a shortest s,ehyperpath \(P \subseteq T\). Since T is minimal (as no further edges could be trimmed), and P by definition is minimal, we must have \(T = P\), which proves the lemma. \({\square}\)
We now show that the hyperpath heuristic solves Shortest Hyperpaths for singletontail hypergraphs.
Theorem 3
(Optimality of the heuristic on singletontail hypergraphs) For singletontail hypergraphs with nonnegative edge weights, the hyperpath heuristic finds a shortest sourcesink hyperpath.
Proof
The key to proving optimality is showing that in singletontail hypergraphs, the estimates that the heuristic computes for shortest hyperpath lengths are exact. Recall that when function ShortestHyperpathHeuristic (in Fig. 3) removes hyperedge e from heap H, it calls RecoverShortHyperpath on e to recover an s,ehyperpath P, and sets the field \(e.\text {length}\) to \(\omega (P)\), the total weight of P.
We claim that when this assignment occurs, field e.length holds distance d(e), the total weight of a shortest s,ehyperpath. We now prove this claim by induction on the number of heap extractions. At a high level, the argument is similar to that for Dijkstra’s shortestpath algorithm (see [26, pp. 659–661]) on ordinary directed graphs.
For the basis, the first hyperedge extracted has \(\mathrm {tail}(e) = \{s\}\) and \(e.\text {key} = \omega (e)\), which equals d(e), as e itself is a shortest s,ehyperpath (since all edge weights are nonnegative). The recovered s,ehyperpath will consist of e (as e.inedges is empty), so after the assignment field e.length holds d(e).
For the induction, let e be the next hyperedge to be removed from the heap, and assume for all hyperedges h extracted prior to e that h.length holds d(h). Now consider a shortest s,ehyperpath P, and in the ordering of P given by Lemma 1, let f be the first hyperedge in P that has not yet been removed from the heap. Note that f exists, as e has not been removed yet.
We first show \(f.\text {key} = d(f)\). In the special case where f is the first edge of P, notice \(d(f) = \omega (f)\) by the same reasoning as in the basis. Furthermore \(f.\text {key} = \omega (f)\), as f.key starts at \(\omega (f)\), never increases, and cannot decrease below this minimum value. So \({f.\text {key} = d(f)}\) in this special case.
In the general case where f is not the first edge of P, let g be the inedge to f on P, and \(Q \subseteq P\) be the prefix of P ending in f, as illustrated in Fig. 5. Notice g has already been extracted from the heap (by the definition of f), so g is in f.inedges (as when a hyperedge is extracted, for all its outedges h it is added to h.inedges). Furthermore Q is a shortest s,fhyperpath by Lemma 1, so g is on a shortest hyperpath to f. For all hyperedges h extracted before e, by the induction hypothesis \(h.\text {length} = d(h)\), and only extracted h add themselves to the field inedges of their outedges. Hence when g was extracted, added itself to f.inedges, and updated f.key by recovering an s,fhyperpath, in the s,fsuperpath S first found during recovery, all hyperedges \(h \in S\) had \(h.\text {length} = d(h)\). Thus by Lemma 2, the recovered s,fhyperpath was a shortest hyperpath, so this updated f.key to d(f), and as argued before in the special case, this key will not change. So again \(f.\text {key} = d(f)\).
We next show,
In the above, inequality (1) holds since e and f are both on the heap (as f was inserted in the heap either during initialization or when g was extracted), but e is removed before f. Equation (2) is from our prior analysis of f. Inequality (3) holds as Q and P are shortest s,f and s,ehyperpaths respectively, while \({Q \subseteq P}\) and edge weights are nonnegative. Lastly, inequality (4) holds since the key of e while it is on the heap is the total weight of some s,ehyperpath. Thus relations (1)–(4) must all be equalities, so \(e.\text {key} = d(e)\).
We now argue \(e.\text {length} = d(e)\) after e is extracted. Since \(e.\text {key} = d(e)\) is the weight of a hyperpath recovered earlier for e, notice (i) there was an inedge to e on a shortest s,ehyperpath in e.inedges; moreover (ii) all hyperedges h in the s,esuperpath collected while recovering a hyperpath for e were extracted earlier, and hence by the induction hypothesis had \(h.\text {length} = d(h)\). Furthermore, hyperedges are never removed from the field inedges, and h.length never changes after h is extracted. Thus the assumptions in Lemma 2 are still met upon extraction of e, so when ShortestHyperpathHeuristic assigns to e.length the total weight of the hyperpath P recovered for e, by Lemma 2 this recovered P will again be a shortest s,ehyperpath, hence \(e.\text {length} = d(e)\). This completes the inductive proof of our claim.
So for every hyperedge h in the doublyreachable subgraph explored by ShortestHyperpathHeuristic, after extracting h from the heap, the relation \(h.\text {length} = d(h)\) holds. Finally, when recovering the best s,thyperpath at the end of the heuristic by examining the inedges e to sink t, for each such hyperedge e the assumptions of Lemma 2 are still met (by the same reasoning as above), so the hyperpaths P obtained from calling RecoverShortHyperpath on these sink inedges e are again shortest s,ehyperpaths. Since a shortest s,thyperpath consists of doublyreachable hyperedges (by the proof of Theorem 2), and is a shortest s,ehyperpath for some inedge e to sink t, the best of these recovered hyperpaths P, which is the hyperpath returned by the heuristic, is a shortest s,thyperpath.\({\square}\)
Theorem 3 (in combination with Theorem 1) shows that, while Shortest Hyperpaths is NPcomplete for singletonhead hypergraphs [14], it is polynomialtime solvable for singletontail hypergraphs.
Generating all sourcesink hyperpaths
In this section, we give a practical algorithm for generating all s,thyperpaths in a given hypergraph for a fixed source s and sink t. In our later experimental results, we use this algorithm on specific sourcesink instances from real cellsignaling networks to tractably measure how close our heuristic is to optimal.
In general, the technique of inclusion and exclusion of Hamacher and Queyranne [27] provides a widelyapplicable method for generating all the solutions to any combinatorial optimization problem whose feasible solutions are subsets of a ground set—where in our context, hyperpaths are subsets of hyperedges from a hypergraph—but it relies on the ability to efficiently compute a feasible solution that is constrained to include a given inset and exclude a given outset. Interestingly, for hyperpaths, Carbonell et al. [20] have shown that just determining whether an s,thyperpath exists that contains a specified inset of hyperedges (regardless of the length of the hyperpath) is already NPcomplete. Consequently, we cannot generate all s,thyperpaths using the standard inclusionexclusion technique, as we cannot tractably solve the resulting subproblem that has both in and outset constraints.
Instead, we generate all hyperpaths through a simple and practical algorithm that only involves outsets, given in Fig. 6. Function AllHyperpaths returns a list of all s,thyperpaths in hypergraph G, leveraging a function OneHyperpath that just has to return one s,thyperpath P in G that does not contain any hyperedges from set Out (so \(P \cap \text {Out} = \emptyset\)), or determine that no such hyperpath exists. This constrained hyperpath problem with only outsets is easy to solve: remove all hyperedges in set Out from G, collect all vertices R and hyperedges F reachable from s in this reduced hypergraph, and if \(t \in R\), then find any minimal subset \(P \subseteq F\) in which t is still reachable from s; otherwise if \(t \not \in R\), no such hyperpath exists. Function OneHyperpath can efficiently find such an s,thyperpath P excluding set Out using repeated calls to ForwardReachable (given earlier in Fig. 2).
Function AllHyperpaths uses a queue of subproblems. A subproblem is described by a pair \((\text {Out}, \text {Keep})\), which corresponds to finding an s,thyperpath excluding Out, where any subsequent subproblems that arise from this given subproblem must not exclude any hyperedges from the set Keep (though their solutions are not required to actually use edges from Keep). The purpose of this set Keep is to ensure that all subproblems ever placed on the queue have distinct Out sets. (So any given subproblem described by an outset is only ever solved once, as argued in the later section on the time complexity of the hyperpath enumeration algorithm in the proof of Theorem 5.) A subproblem that directly arises from a given one we call a child subproblem (as the entire collection of subproblems conceptually forms a tree that is explored breadthfirst using the queue). Each child subproblem excludes one edge from the hyperpath found for its parent subproblem; in this way, the children will generate hyperpaths that are distinct from their parent hyperpath, if they have a solution. (Once a subproblem becomes infeasible due to its outset eliminating any s,thyperpath as a solution, it also does not generate further subproblems.) Though the whole approach never repeatedly solves the same subproblem, in contrast to the inclusionexclusion technique it can generate the same hyperpath from different subproblems, so we check whether hyperpath P is distinct from those already found before adding it to the list \({\mathcal {A}}\) of all hyperpaths.
We first prove this enumeration approach is correct, and then analyze its time complexity.
Correctness of the hyperpath enumeration algorithm
We next show that function AllHyperpaths solves the problem of sourcesink hyperpath enumeration.
Theorem 4
(Correctness of hyperpath enumeration) The hyperpath enumeration algorithm generates every s,thyperpath exactly once.
Proof
For the function AllHyperpaths (in Fig. 6), we view the subproblems it processes as forming a tree: when a problem p is pulled off queue Q and causes a new subproblem q to be put onto Q, these subproblems q comprise the children of p in the tree. Each subproblem is specified by a pair \((\text {Out}, \text {Keep})\), representing the problem of finding an s,thyperpath that contains no hyperedge in the set Out. Let P be an s,thyperpath satisfying this outconstraint for problem p. Any other s,thyperpath \({\widetilde{P}}\) distinct from P that also satisfies the outconstraint for p must not contain some hyperedge in P. (If \({\widetilde{P}}\) contains every hyperedge of P yet is distinct, it is a strict superset of P, contradicting minimality.) Function AllHyperpaths forms the children of p by adding each hyperedge in P to the outset of p for a different child. (So the hyperpaths satisfying the outconstraints of the children are all hyperpaths that both satisfy the constraints of parent p and are distinct from hyperpath P.) Consequently hyperpath P, together with every solution to the children of p, comprise all possible solutions to problem p.
This treelike process begins at the root with a problem having an empty outset (whose solutions are all possible s,thyperpaths), and continues refining each problem into its children subproblems until reaching the leaves (which have no solution). Thus the set consisting of each hyperpath P found at the nodes of this tree contains all s,thyperpaths.
In brief, function AllHyperpaths generates every s,thyperpath. Since it checks for uniqueness, the enumeration algorithm generates every sourcesink hyperpath exactly once.\({\square}\)
Time complexity of the hyperpath enumeration algorithm
We now bound the running time of function AllHyperpaths in terms of the number of subproblems it solves, and parameters of the input hypergraph.
Theorem 5
(Time complexity of hyperpath enumeration) The running time of the hyperpath enumeration algorithm, when solving k subproblems on a hypergraph of size \(\ell\) with m hyperedges, is
Proof
We bound the running time of function AllHyperpaths (in Fig. 6) as follows. Solving a given subproblem from the queue by function OneHyperpath (which finds an s,thyperpath by iteratively removing hyperedges from the hypergraph and testing reachability to identify a minimal set in which t is still reachable from s), involves at most m calls to function ForwardReachable. A call to ForwardReachable takes \(O(\ell )\) time (by the analysis in the proof of Theorem 1), so solving a subproblem takes \(O(\ell \, m)\) time. If AllHyperpaths terminates after processing k subproblems, its total time is then \(O(k \, \ell \, m)\).
We argue next that the outsets of subproblems are all distinct. Consider the tree of subproblems processed by AllHyperpaths (as in the proof of Theorem 4), and two arbitrary subproblems x and y in this tree. If one of x and y is a descendant of the other, their outsets are distinct, as a child always adds a hyperedge to the outset of its parent. If neither x nor y is a descendant of the other, let subproblem u be their nearest common ancestor, subproblems v and w be the children of u on the paths to x and y respectively, and assume without loss of generality that child v precedes child w. When child v adds hyperedge e to the set Out of its parent u, edge e is not added to set Out for any other children of u, and e is also added to set Keep for all children of u following v, including w. Furthermore, the set Out for a descendant is a superset of set Out for its ancestors, and set Out for a descendant is always disjoint from set Keep for its ancestors. Consequently, the above hyperedge e is in the outset of subproblem x but not subproblem y, so their outsets are again distinct.
Since subproblem outsets are distinct, \(k \,=\, O(2^m)\). Combining this with the prior total time for hyperpath enumeration yields a worstcase time bound of \(O(2^m \, \ell \, m)\).\({\square}\)
In practice, typically \(k \ll 2^m\), so the running time is much faster than the worstcase bound suggests. Function AllHyperpaths can tractably generate all sourcesink hyperpaths for large hypergraphs, as shown in the next section on experimental results, since many of its subproblems quickly become infeasible for real cellsignaling networks.
Experimental results
We now present results from computational experiments on real pathway databases that compare the hyperpath found by our heuristic to the optimal solution. We also remark on the prevalence of biological instances with cyclic shortest hyperpaths, study the cause of suboptimality in our heuristic, report actual running times, and discuss biological examples of cyclic hyperpaths.
Datasets
We evaluate the quality of our heuristic on four datasets built by combining different annotated signaling pathways from two pathway databases, NCIPID and Reactome. NCIPID [28] is a curated humanpathway database containing biochemical reactions for complex assembly, cellular transport, and transcriptional regulation. Reactome [29] also contains curated human signaling pathways, and is actively maintained with new reactions being continuously added. We constructed hypergraphs from three subsets of NCIPID pathways used in Ritz et al. [5], named the Small, Medium, and Large datasets. The Small dataset is a small Wnt signaling pathway consisting of the union of two pathways: “degradation of \(\beta\)catenin” and “canonical Wnt signaling”. The Medium dataset is a larger Wnt signaling pathway including four additional pathways: “noncanonical Wnt signaling”, “Wnt signaling network”, “regulation of nuclear \(\beta\)catenin”, and “presenilin action in Notch and Wnt signaling”, which correspond to noncanonical branches of Wnt signaling. The Large dataset contains all NCIPID pathways. Similarly, the Reactome dataset is the union of all Reactome pathways. The NCIPID and Reactome pathways were downloaded in the BioPAX format [30] from Pathway Commons, and processed using a parser from Franzese et al. [22] built on PaxTools [31].
To construct the hypergraphs for each dataset, we mapped each entity (such as a protein, small molecule, and so on) to a vertex in the hypergraph. Each complex was represented as a unique vertex distinct from the entities in the complex. Multiple forms of the same protein map to different vertices denoting compartmentalization and posttranslational modifications, such as phosphorylation and ubiquitination. We treated each variant as a distinct entity because many pathways describe the transportation of a protein from one cellular compartment to another, or the marking of a protein for degradation by ubiquitination, necessitating that the corresponding vertices be distinct to reflect these variants. Each reaction was mapped to a hyperedge, where the reactants and positive regulators comprise the tail of the hyperedge, and the products comprise the head. All hyperedges were given unit weight, even though the heuristic handles weighted edges, as NCIPID is missing reaction rates for some reactions.
Table 1 gives statistics on the hypergraphs constructed from each of the four datasets. The hypergraphs are very sparse: there are fewer hyperedges than vertices in all four datasets, with Reactome being even sparser than the NCIPID datasets. The hypergraphs from the Large and Reactome datasets contain respectively 40 and 433 selfloops, showing that many cyclic hyperpaths are likely to exist. However, a small number of these selfloops are unreachable, due to an otherwise unreachable vertex appearing in both the head and tail of the hyperedge. The sources and targets used in all our experiments are respectively vertices with no inedges (or vertices whose only inedge is an unreachable selfloop), and vertices with no outedges. The number of forwardreachable, backwardtraceable, and doublyreachable hyperedges shows how many hyperedges remain after the heuristic prunes the input hypergraph to the doublyreachable subgraph before computing a solution. On average, hyperedges from all four hypergraphs have small head and tail sets, and vertices have low in and outdegree, reflecting the sparseness of the hypergraphs.
Experimental setup
To prepare the hypergraphs from each dataset for our experiments, we parsed the union of the pathways in the dataset. We connected a supersource s to all source vertices—namely, the input vertices with no inedges—by a single zeroweight hyperedge whose tail consisted of the supersource s and whose head contained all the source vertices. We also included in the head of this hyperedge from supersource s all input vertices whose sole inedge was a selfloop, since otherwise such a selfloop was not traversable. For each specific target vertex v—namely, each input vertex with no outedges—we had a separate version of the hypergraph that differed only by connecting this target v to a sink t by a single zeroweight ordinarygraph edge directed from v to t, giving us a specific target instance. Note that these choices for the source and target vertices are reasonable, as they are the molecules where biologists stopped annotating a given pathway. Note also that the supersource s and the sink t remain the same across all target instances in a dataset.
For each target instance, we trimmed the hypergraph to the doublyreachable set: the set of hyperedges that were both forwardreachable from supersource s, and backwardtraceable from sink t. Table 1 gives the average and maximum size of the forwardreachable, backwardtraceable, and doublyreachable sets over all target instances for a given dataset, which dramatically reduces the size of the hypergraph over which the heuristic performs most of its computation.
For each target instance, we found a hyperpath from supersource s to sink t using our shortest hyperpath heuristic implemented in the new tool Hhugin [25], and compared its length to the solution of the MILP of Ritz et al. [21] if the heuristic hyperpath was acyclic. For each cyclic target instance where the heuristic output a cyclic hyperpath, we exhaustively enumerated all s,thyperpaths, and compared the heuristic hyperpath to the shortest hyperpath found by this enumeration. (Enumerating all s,thyperpaths for one sourcesink instance takes on average around 20 hours in practice—so it is not feasible to perform this enumeration on all acyclic target instances.)
Abundance of cyclic hyperpaths
Cyclic shortest hyperpaths appear in all four datasets. To take just one example, in the Small and Medium datasets, the only hyperpath from ubiquitinated \(\beta\)catenin to APC is cyclic, so for this target instance the acyclic shortesthyperpath MILP fails to find a solution. Admittedly this particular sourcetarget pair is specially chosen, as ubiquitinated \(\beta\)catenin has an inedge and APC has an outedge so they would not normally be considered under our definition of sources and targets. Nevertheless, this pair demonstrates there do exist cyclic hyperpaths in the NCIPID database—even in the union of just two pathways—that are missed by the current stateoftheart when computing only acyclic shortest hyperpaths.
In the Large dataset, 38 target instances have cyclic heuristic hyperpaths. Of these, 22 were cyclic because of a selfloop, and 16 were cyclic due to a nontrivial cycle. For all these instances, no acyclic hyperpath exists between supersource s and sink t. It is likely that even more cycles exist within the hypergraph from the Large dataset, as there were 8 selfloops that were not on any hyperpath found by the heuristic.
In the Reactome dataset, the heuristic found a cyclic shortest hyperpath on 22 target instances, and only one of these instances was cyclic due to a selfloop. In general, Reactome is much sparser than NCIPID, and 432 of the 433 selfloops in Reactome are never used in a heuristic hyperpath.
The abundance of cyclic hyperpaths in the NCIPID and Reactome datasets demonstrates the importance of a shortest hyperpath algorithm that properly handles cycles. We discuss concrete examples of biological cyclic shortest hyperpaths in a later section on biological examples.
Quality of the hyperpath heuristic
To determine the quality of our hyperpath heuristic, we compared the length of the heuristic hyperpath to an optimal shortest hyperpath. In general, no practical exact algorithm is currently known for finding a shortest sourcesink hyperpath. Consequently, on the target instances where the heuristic found a cyclic hyperpath, we determined the optimum by generating all sourcesink hyperpaths and retaining the shortest one, using our algorithm for hyperpath enumeration. On the target instances where the heuristic found an acyclic hyperpath, we compared its length just to the optimal hyperpath returned by the MILP for shortest acyclic hyperpaths. An even shorter cyclic hyperpath could exist for these latter instances, but finding it by enumerating all hyperpaths is simply too timeconsuming to carry out for every such instance.
Table 2 summarizes the quality of the heuristic on acyclic instances. On the Small, Medium, and Reactome datasets, the heuristic hyperpath is optimal on all target instances, meaning the heuristic hyperpath and the shortest acyclic hyperpath from the MILP have the same length. On the Large dataset, the heuristic is optimal on over 99% of the instances, demonstrating the quality of the heuristic on these biological datasets. The small fraction of instances where our heuristic was suboptimal are discussed in more detail in the next subsection.
Table 3 summarizes the quality of the heuristic on instances where it output a cyclic hyperpath. On all these cyclic instances, the acyclic MILP failed to find a solution, so we could not compare the heuristic to an optimal hyperpath other than by exhaustively enumerating all hyperpaths and picking the shortest one—which verified that the heuristic on these instances in fact found an optimal solution. Cyclic instances from the Reactome (and Large) datasets contain many distinct hyperpaths, with a median of 22 (respectively 3) hyperpaths, and a maximum of 136 (respectively 364) hyperpaths. The hyperpaths tend to vary in length, with a maximum difference between the length of the longest and shortest hyperpath of 15 (respectively 43) hyperedges, and a median difference of [2, 3] (respectively 1) hyperedges. This demonstrates that the heuristic is discriminating between hyperpaths of different lengths and choosing the best hyperpath over worse hyperpaths, further indicating the quality of the heuristic. In every cyclic target instance, all s,thyperpaths were cyclic, and many shared a common cycle; most of the hyperedges occurring in one hyperpath but not another appeared outside this shared cycle.
Studying the suboptimality of the heuristic
We call the small number of target instances in these experiments where the heuristic found a known suboptimal hyperpath its suboptimal instances. Table 4 summarizes these 23 suboptimal instances, which are all from the Large NCIPID dataset, and are all acyclic instances. (The heuristic was optimal on all cyclic instances, and all Reactome, Small, and Medium instances. We mention as well that the maximum values across the table occur in distinct target instances.) To gain insight into why the heuristic found a suboptimal solution on these instances, we enumerated all sourcesink hyperpaths for every suboptimal instance. (This enumeration also verified that on all suboptimal instances, the acyclic MILP in fact found a shortest hyperpath, as there was no shorter cyclic hyperpath.)
Hyperpath enumeration confirmed that these suboptimal instances are much harder than the cyclic instances. The median number of hyperpaths is nearly 140 times higher for suboptimal NCIPID instances compared to cyclic NCIPID instances, and the length difference between the longest and shortest hyperpaths is 30 times larger. This stark contrast indicates the inherent difficulty of these suboptimal instances, where the heuristic must now discriminate among a much higher number of hyperpaths that have much greater pathlength variance. The fraction of all hyperpaths that are optimal is fairly small, with only around 3% being optimal for the median instance. Even faced with many alternate solutions, the heuristic still found a hyperpath that was nearly optimal: the median difference between the length of the heuristic hyperpath and the shortest hyperpath was 1 hyperedge, the maximum difference was 6 hyperedges, and the median ratio of the length of the heuristic hyperpath to the shortest hyperpath was 1.1 (so it was only 10% longer). Next we investigate what could be causing this suboptimality.
The suboptimality of the heuristic is likely coming from the repeated calls to the function RecoverShortHyperpath, which proceeds in two phases. In phase (I), this function recovers an s,esuperpath S, relying on inedge lists to hyperedges f, where the inedge list for f contains only hyperedges removed from the heap prior to f, which may exclude hyperedges in a shortest s,ehyperpath. In phase (II), this function trims superpath S to a hyperpath by greedily considering hyperedges in S for removal, which may also remove a hyperedge in an optimal s,ehyperpath.
To determine whether the recover or trim phases were responsible for suboptimality, we ran the following experiment. After the heuristic determined its estimated path length for every hyperedge in the hypergraph, we called RecoverShortHyperpath on each inedge to the target where we ran its recovery phase but stopped before its trimming phase, and unioned together the resulting s,tsuperpaths from each inedge to create one large s,tsuperpath F. We then took an optimal s,thyperpath P and examined whether \(P \subseteq F\): in other words, whether the recovery phase permitted the heuristic to potentially find an optimal hyperpath. We discovered that for all 23 suboptimal instances \(P \not \subseteq F\), indicating phase (I) of RecoverShortHyperpath that recovers an s,esuperpath was forcing the heuristic to be suboptimal on every instance.
On the other hand the trimming phase of RecoverShortHyperpath could also be leading to suboptimality, which we investigated as follows. For each suboptimal instance, we modified the recovery phase of RecoverShortHyperpath to use all inedges in the hypergraph to each hyperedge, rather than the inedge lists collected by the heuristic. (In this situation, the recovered superpath F definitely contains a shortest hyperpath P.) Phase (II) then trimmed this superpath as normal. We discovered that the trimming phase often fails to find a shortest hyperpath within this larger superpath (which was the entire doublyreachable subgraph). This indicates that while phase (I) is definitely causing suboptimality, simply changing phase (I) to recover a larger superpath may in turn lead to suboptimality in phase (II).
Implementation and running time
The heuristic is implemented in Python 2.7.3, comprising around 500 lines of code. The parser used to convert the BioPAX format into hypergraphs is from [22]. For directed hypergraph representation and reachability we used Halp (github.com/Muraligroup/halp/). All heuristic and hyperpath enumeration source code is available at http://hhugin.cs.arizona.edu.
Experiments were run on a laptop with a 2.9 GHz Intel Core i5 CPU, and 16 GB of RAM. The running time of the hyperpath heuristic was 55 seconds on average for the instances from the Large and Reactome datasets, which have just under 1000 doublyreachable hyperedges on average. Memory usage was low, with the heuristic using less than 2 GB of memory.
Enumerating all hyperpaths for the instances is timeconsuming, taking 20.4 hours on average for the suboptimal instances with a maximum time of 53.8 hours, which is not practical to carry out for all 4600 target instances.
Biological examples
We now discuss three instances with cyclic shortest hyperpaths from the Large and Reactome datasets. The hyperpath found by our heuristic for these three instances is optimal (as was the case for all instances where the heuristic found a cyclic path), and is drawn in Figs. 7, 8, and 9. We describe the hypergraph structure and constituent reactions for each instance.
Assembly of the JUP/DSP complex The first example captures the assembly of the JUP/DSP complex from the Large dataset. Figure 7 shows the shortest hyperpath returned by our heuristic with the JUP/DSP complex as the target. All vertices at the top of the figure are connected to the supersource.
This hyperpath includes seven hyperedges from four different NCIPID pathways: “Ecadherin signaling in the nascent adherens junction” (hyperedges \(e_1\) and \(e_5\)), “Posttranslational regulation of adherens junction stability and dissassembly” (hyperedges \(e_2\), \(e_6\) and \(e_7\)), “Signaling events mediated by PRL” (hyperedge \(e_3\)), and “Signaling events mediated by hepatocyte growth factor receptor (cMet)” (hyperedge \(e_4\)). We briefly describe the key events in this hyperpath. Protein \(\gamma\)catenin (also known as junction plakoglobin or JUP) is initially complexed with Cadherin 1 (CDH1) in the tail of hyperedge \(e_1\). In hyperedge \(e_2\), the metalloprotease meprin\(\beta\) cleaves Ecadherin (CDH1), releasing it from its complex with \(\alpha\)catenin (CTNNA1) and \(\delta\)catenin (CTNND1) [32]. The CDH1/JUP complex adds \(\alpha\)catenin (CTNNA1 in hyperedge \(e_1\)) and CTNND1 and Ca\(^{2+}\) (in hyperedge \(e_5\)) to form a fivemember complex. Hepatocyte growth factor (HGF) activates the protooncogene tyrosineprotein kinase Src (hyperedge \(e_4\)) [33]. Src regulates the breakup of this complex into its individual components [34] (hyperedge \(e_6\)), freeing JUP to bind with DSP and creating the two cycles in this hyperpath via CTNNA1 and CTNNB1. The hyperpath culminates in the formation of a complex between desmoplasmin (DSP) and JUP.
The hypergraph for this instance is large, with 6168 forwardreachable hyperedges, 2642 backwardtraceable hyperedges, and 1665 doublyreachable hyperedges. There is no acyclic hyperpath from the supersource to JUP/DSP. When enumerating all s,thyperpaths for this instance, there were 16 alternate hyperpaths, and the longest hyperpath had 3 more hyperedges than the heuristic path, which was verified to be optimal.
Phosphorylation of p53 The second example captures the phosphorylation of p53 by NUAK1 (ARK5) from the Reactome dataset. The heuristic hyperpath, which is optimal, is shown in Fig. 8. All of the vertices at the top are connected to the supersource.
Hyperedge \(e_1\) shows the complex formation of FOXO3 and FOXO4 with the STK11 gene, allowing for the transcription of the gene in hyperedge \(e_2\). Hyperedges \(e_3\) and \(e_4\) deal with the transcription of protein p53 (TP53), and its formation into a homotetramer. The p53 tetramer then forms a complex with NUAK1 (ARK5) and STK11 in hyperedge \(e_5\), allowing for the phosphorylation of NUAK1 via ATP in hyperedge \(e_6\). Once NUAK1 is phosphorylated, it directly phosphorylates p53 [35], activating it and allowing it to assist in DNA damage repair. The final hyperedge \(e_7\), shown in red, breaks apart the p53 tetramer/NUAK1/STK11 complex, resulting in a cycle of free STK11. This hyperpath features two transcriptional hyperedges \(e_2\) and \(e_3\), shown dotted.
This example from Reactome is slightly smaller than the example from the Large dataset, with only 4645 forwardreachable edges, 7021 backwardtraceable edges, and 1632 hyperedges in the doublyreachable set. There was no acyclic hyperpath for this instance. In contrast to the first example, no alternate hyperpaths to the target exist in the hypergraph.
HEY2/ARNT complex assembly The final example we discuss is the formation of the HEY2/ARNT complex from the Large dataset. The shortest hyperpath from the supersource to HEY2/ARNT, which was found by the heuristic, is shown in Fig. 9. Once again, the sources are at the top of the figure, with the hyperedge from the supersource not shown.
This hyperpath with eleven edges spans three pathways: “Notch signaling pathway” (hyperdges \(e_1\)–\(e_7\)), “Hypoxic and oxygen homeostasis regulation of HIF1\(\alpha\)” (hyperedges \(e_9, e_{10}\)), and “Notchmediated HES/HEY network” (hyperedges \(e_8, e_{11}\)). Hypoxiainducible factor 1 (HIF1) is a heterodimeric transcription factor that regulates genes that are induced by hypoxia [36]. It is a complex of HIF1\(\alpha\) (HIF1A) and HIF1\(\beta\) (aryl hydrocarbon receptor nuclear translocator or ARNT). “Hairy/enhancerofsplit related with YRPW motif protein 2” (HEY2) is a transcriptional repressor [37] that physically interacts with ARNT (hyperedge \(e_{11}\)). The hyperdges \(e_9\) and \(e_{11}\) show a pair of reactions where HIF1 is formed and then repressed by HEY2. Hyperedges \(e_1\)–\(e_7\) capture events in the Notch signaling pathway that occur upstream of the formation of the transcriptional activator formed by the complex of the nuclear protein “Recombining binding protein suppressor of hairless” (RBPJ) and Notch intracellular domain (NICD). The expression of protein HEY2 is upregulated by the NICD/RBPJ complex [38].
This signaling hypergraph was markedly smaller than the other two examples. The hypergraph had 6169 forwardreachable hyperedges, but only 23 hyperedges were backwardtraceable, hence only 23 hyperedges were doublyreachable, due to the poor connectivity of the HEY2/ARNT complex to other vertices in the graph. Even though the hypergraph is small, the hyperpath shown is not the only shortest hyperpath to the target, as \(e_2\) and \(e_3\) can be replaced by hyperedges containing Jagged2 instead of Jagged1.
Conclusions
We have presented the first heuristic for Shortest Hyperpaths in general directed hypergraphs with positive edge weights, where the length of a hyperpath is the sum of the weights of its hyperedges. The heuristic handles cycles, is guaranteed to be efficient, finds optimal hyperpaths for singletontail hypergraphs, and is highly accurate in practice. It matches the stateoftheart mixedinteger linear program for shortest acyclic hyperpaths on over 99% of all instances from the NCIPID and Reactome databases, and surpasses the stateoftheart on all instances where no acyclic hyperpath exists. Moreover, exhaustively enumerating all sourcesink hyperpaths using our hyperpath enumeration algorithm demonstrates that on every cyclic instance from these databases, the heuristic was provably optimal.
Further research
Given that we can quickly find hyperpaths that are close to optimal in real cellsignaling hypergraphs, several research directions beckon. While the inapproximability of Shortest Hyperpaths [16] rules out a constantfactor approximation unless \(\text {P} \!=\! \text {NP}\), is there an approximation algorithm whose approximation ratio on hypergraphs with n vertices matches the theoretical lower bound of \(\ln n\)? More practically, given that in our experiments our heuristic was suboptimal only on acyclic instances, is there a fast method for acyclic hyperpaths that outperforms our heuristic? Since a user would like to know how close to optimal a computed hyperpath is for their particular input graph, is there an efficient heuristic that, as well as giving an upper bound on the optimum through its hyperpath, also outputs a lower bound on the length of the shortest hyperpath? Many intriguing research avenues are open.
Availability of data and materials
Source code for the hyperpath heuristic and the hyperpath enumeration algorithm, as well as the hypergraphs from the parsed Reactome, Small, Medium, and Large datasets, is available free for noncommercial use at http://hhugin.cs.arizona.edu.
Change history
29 December 2022
A Correction to this paper has been published: https://doi.org/10.1186/s1301502200222y
References
Li Y, McGrail DJ, Latysheva N, Yi S, Babu MM, Sahni N. Pathway perturbations in signaling networks: linking genotype to phenotype. Semin Cell Dev Biol. 2020;99:3–11.
Sharan R, Ideker T. Modeling cellular machinery through biological network comparison. Nat Biotechnol. 2006;24(4):427–33.
Vidal M, Cusick ME, Barabási AL. Interactome networks and human disease. Cell. 2011;144(6):986–98.
Klamt S, Haus UU, Theis F. Hypergraphs and cellular networks. PLoS Comput Biol. 2009;5(5):1000385.
Ritz A, Tegge AN, Kim H, Poirel CL, Murali TM. Signaling hypergraphs. Trends Biotechnol. 2014;32(7):356–62.
Ramadan E, Tarafdar A, Pothen A. A hypergraph model for the yeast protein complex network. In: Proceedings of the 18th Parallel and Distributed Processing Symposium. 2004. p. 189–196.
Hu Z, Mellor J, Wu J, Kanehisa M, Stuart JM, DeLisi C. Towards zoomable multidimensional maps of the cell. Nat Biotechnol. 2007;25(5):547–54.
Christensen TS, Oliveira AP, Nielsen J. Reconstruction and logical modeling of glucose repression signaling pathways in Saccharomyces cerevisiae. BMC Syst Biol. 2009;3(1):7.
Heath LS, Sioson AA. Semantics of multimodal network models. IEEE/ACM Trans Computat Biol Bioinform. 2009;6(2):271–80.
Ramadan E, Perincheri S, Tuck D. A hypergraph approach for analyzing transcriptional networks in breast cancer. In: Proceedings of the 1st ACM Conference on Bioinformatics and Computational Biology (ACMBCB). 2010:556–562.
Zhou W, Nakhleh L. Properties of metabolic graphs: biological organization or representation artifacts? BMC Bioinform. 2011;12(1):132.
Ritz A, Murali TM. Pathway analysis with signaling hypergraphs. In: Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACMBCB). 2014. p. 249–258.
Alberts B, Johnson A, Lewis J, Raff M, Roberts K, Walter P. Molecular Biology of the Cell. New York: Garland Science; 2007.
Italiano GF, Nanni U. Online maintenance of minimal directed hypergraphs. Technical Report, Department of Computer Science, Columbia University. 1989.
Gallo G, Longo G, Pallottino S, Nguyen S. Directed hypergraphs and applications. Discret Appl Math. 1993;42(2–3):177–201.
Ausiello G, Laura L. Directed hypergraphs: introduction and fundamental algorithms—a survey. Theor Comput Sci. 2017;658:293–306.
Cottret L, Vieira Milreu P, Acuña V, MarchettiSpaccamela A, Viduani Martinez F, Sagot MF, Stougie L. Enumerating precursor sets of target metabolites in a metabolic network. In: Proceedings of the 8th Workshop on Algorithms in Bioinformatics (WABI). 2008. p. 233–244.
Acuña V, Milreu PV, Cottret L, MarchettiSpaccamela A, Stougie L, Sagot MF. Algorithms and complexity of enumerating minimal precursor sets in genomewide metabolic networks. Bioinformatics. 2012;28(19):2474–83.
Andrade R, Wannagat M, Klein CC, Acuña V, MarchettiSpaccamela A, Milreu PV, Stougie L, Sagot MF. Enumeration of minimal stoichiometric precursor sets in metabolic networks. Algorithm Mol Biol. 2016;11(1):25.
Carbonell P, Fichera D, Pandit SB, Faulon JL. Enumerating metabolic pathways for the production of heterologous target chemicals in chassis organisms. BMC Syst Biol. 2012;6(1):10.
Ritz A, Avent B, Murali TM. Pathway analysis with signaling hypergraphs. IEEE/ACM Trans Comput Biol Bioinform. 2017;14(5):1042–55.
Franzese N, Groce A, Murali TM, Ritz A. Hypergraphbased connectivity measures for signaling pathway topologies. PLoS Comput Biol. 2019;15(10):1–26.
Schwob MR, Zhan J, Dempsey A. Modeling cell communication with timedependent signaling hypergraphs. IEEE/ACM Trans Comput Biol Bioinform. 2021;18(3):1151–63.
Nielsen LR, Pretolani D. A remark on the definition of a \(B\)hyperpath. Technical Report, Department of Operations Research, University of Aarhus. 2001.
Krieger S, Kececioglu J. Hhugin: hypergraph heuristic for general shortest sourcesink hyperpaths, version 1.0. 2021 http://hhugin.cs.arizona.edu
Cormen TH, Leiserson CE, Rivest RL, Stein C. Introduction to Algorithms. 3rd ed. Cambridge, Massachusetts: MIT Press; 2009.
Hamacher HW, Queyranne M. \(K\) best solutions to combinatorial optimization problems. Annal Oper Res. 1985;4:123–43.
Schaefer CF, Anthony K, Krupa S, Buchoff J, Day M, Hannay T, Buetow KH. PID: the pathway interaction database. Nucl Acids Res. 2009;37:674–9.
JoshiTope G, Gillespie M, Vastrik I, D’Eustachio P, Schmidt E, de Bono B, Jassal B, Gopinath GR, Wu GR, Matthews L, Lewis S, Birney E, Stein L. Reactome: a knowledgebase of biological pathways. Nucl Acids Res. 2005;33:428–32.
Demir E, Cary MP, Paley S, et al. The BioPAX community standard for pathway data sharing. Nat Biotechnol. 2010;28(9):935–42.
Demir E, Babur Ö, Rodchenkov I, Aksoy BA, Fukuda KI, Gross B, Sümer OS, Bader GD, Sander C. Using biological pathway data with Paxtools. PLoS Comput Biol. 2013;9(9):1003194.
Huguenin M, Müller EJ, TrachselRösmann S, Oneda B, Ambort D, Sterchi EE, Lottaz D. The metalloprotease meprinbeta processes Ecadherin and weakens intercellular adhesion. PLoS One. 2008;3(5):2153.
Palacios F, Tushir JS, Fujita Y, D’SouzaSchorey C. Lysosomal targeting of Ecadherin: a unique mechanism for the downregulation of cellcell adhesion during epithelial to mesenchymal transitions. Mol Cell Biol. 2005;25(1):389–402.
Miravet S, Piedra J, Castaño J, Raurell I, Francì C, Duñach M, García de Herreros A. Tyrosine phosphorylation of plakoglobin causes contrary effects on its association with desmosomes and adherens junction components and modulates \(\beta\)cateninmediated transcription. Mol Cell Biol. 2003;23(20):7391–402.
Hou X, Liu JE, Liu W, Liu CY, Liu ZY, Sun ZY. A new role of NUAK1: directly phosphorylating p53 and regulating cell proliferation. Oncogene. 2011;30(26):2933–42.
Jiang BH, Rue E, Wang GL, Roe R, Semenza GL. Dimerization, DNA binding, and transactivation properties of hypoxiainducible factor 1. J Biol Chem. 1996;271(30):17771–8.
Chin MT, Maemura K, Fukumoto S, Jain MK, Layne MD, Watanabe M, Hsieh CM, Lee ME. Cardiovascular basic helix loop helix factor 1, a novel transcriptional repressor expressed preferentially in the developing and adult cardiovascular system. J Biol Chem. 2000;275(9):6381–7.
Iso T, Chung G, Hamamori Y, Kedes L. HERP1 is a cell typespecific primary target of Notch. J Biol Chem. 2002;277(8):6598–607.
Krieger S, Kececioglu J. Fast approximate shortest hyperpaths for inferring pathways in cell signaling hypergraphs. In: Proceedings of the 21st ISCB Workshop on Algorithms in Bioinformatics (WABI). Leibniz International Proceedings in Informatics, vol 201. 2021. p. 1–20.
Acknowledgements
We especially wish to thank T.M. Murali for introducing us to the problem of shortest hyperpaths in cellsignaling hypergraphs, for orienting us to the biology literature, and for discussing the JUP/DSP biological example. In addition, we thank Anna Ritz for discussing the NCIPID and Reactome datasets, and for providing the BioPax parser. We also thank the anonymous reviewers for their helpful comments.
This paper is an extended journal version of a prior conference paper by the coauthors [39].
Funding
This research was supported by the US National Science Foundation through grants CCF1617192 and IIS2041613 to JK.
Author information
Authors and Affiliations
Contributions
SK and JK designed and analyzed the hyperpath heuristic and hyperpath enumeration algorithm. SK implemented the hyperpath heuristic and hyperpath enumeration algorithm, and performed all experiments. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The formatting issues of definitions, lemmas, theorems, and mathematical proofs throughout the paper were corrected.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Krieger, S., Kececioglu, J. Heuristic shortest hyperpaths in cell signaling hypergraphs. Algorithms Mol Biol 17, 12 (2022). https://doi.org/10.1186/s13015022002179
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13015022002179