 Research
 Open Access
 Published:
Unrooted unordered homeomorphic subtree alignment of RNA trees
Algorithms for Molecular Biology volume 8, Article number: 13 (2013)
Abstract
We generalize some current approaches for RNA tree alignment, which are traditionally confined to ordered rooted mappings, to also consider unordered unrooted mappings. We define the Homeomorphic Subtree Alignment problem (HSA), and present a new algorithm which applies to several modes, combining global or local, ordered or unordered, and rooted or unrooted tree alignments. Our algorithm generalizes previous algorithms that either solved the problem in an asymmetric manner, or were restricted to the rooted and/or ordered cases. Focusing here on the most general unrooted unordered case, we show that for input trees T and S, our algorithm has an O(n_{ T }n_{ S } + min(d_{ T },d_{ S })L_{ T }L_{ S }) time complexity, where n_{ T },L_{ T } and d_{ T } are the number of nodes, the number of leaves, and the maximum node degree in T, respectively (satisfying d_{ T } ≤ L_{ T } ≤ n_{ T }), and similarly for n_{ S },L_{ S } and d_{ S } with respect to the tree S. This improves the time complexity of previous algorithms for less general variants of the problem.
In order to obtain this time bound for HSA, we developed new algorithms for a generalized variant of the MinCost Bipartite Matching problem (MCM), as well as to two derivatives of this problem, entitled AllCavityMCM and AllPairsCavityMCM. For two input sets of size n and m, where n ≤ m, MCM and both its cavity derivatives are solved in O(n^{3} + n m) time, without the usage of priority queues (e.g. Fibonacci heaps) or other complex data structures. This gives the first cubic time algorithm for AllPairsCavityMCM, and improves the running times of MCM and AllCavityMCM problems in the unbalanced case where n ≪ m.
We implemented the algorithm (in all modes mentioned above) as a graphical software tool which computes and displays similarities between secondary structures of RNA given as input, and employed it to a preliminary experiment in which we ran allagainstall interfamily pairwise alignments of RNAse P and Hammerhead RNA family members, exposing new similarities which could not be detected by the traditional rooted ordered alignment approaches. The results demonstrate that our approach can be used to expose structural similarity between some RNAs with higher sensitivity than the traditional rooted ordered alignment approaches. Source code and webinterface for our tool can be found in http://www.cs.bgu.ac.il/\~negevcb/FRUUT.
Background
Secondary structure of RNA molecules serves important functions in many noncoding RNAs [1]. Functional constraints lead to evolutionary structural conservation that in many cases exceeds the level of sequence conservation. Thus, detecting similarity between RNA secondary structures is of major importance in functional RNA research [2, 3].
A mainstream approach for (pseudoknot free) RNA secondary structure comparison represents them as trees, and applies tree alignment algorithms [4–6] to their comparison.
Several variants of tree edit distance and alignment problems were previously studied. These variants differ in the type of trees they examine (ordered/unordered, rooted/unrooted), in the type of edit operations or alignment restrictions they apply [4–14], and in their algorithmic approaches (see [7]).
Currently available bioinformatic softwares for RNA tree comparison usually apply rooted ordered tree alignment[11, 12, 14, 15]. However, there are known evolutionary phenomena, such as segment insertions, translocations and reversals, which may result in a reordering or rerooting of RNA structural elements [16]. These events can yield two similarly structured motifs, which are rooted differently (with respect to the standard “external loop” corresponding roots) [17], and/or permuted with respect to branching order. There are known examples of such unrooted/unordered RNA structural conservations [18, 19] (Figure 1), therefore, it is possible that searching for unordered and unrooted structural similarities may reveal new relations between RNA molecules that were previously undetected.
The general Unordered Tree Edit Distance problem is MAXSNP hard [20], promoting the study of constrained variants. The Subtree Isomorphism problem [21, 22] is, given a pattern tree S and a text tree T, to find if there is some subtree T′ of T which is isomorphic to S. The Subtree Homeomorphism problem [23–25] is a variant of the former problem, where degree2 nodes may be deleted from the selected subtree T′ of the text. Pinter et al. [26] efficiently solved the Subtree Homeomorphism problem, under the unrooted unordered settings. In addition, their algorithm assigns costs for alignments and finds an alignment of minimum cost, thus solving a weighted variant of the problem. The running time of the algorithm of [26] is $O({n}_{S}^{2}{n}_{T}+{n}_{S}{n}_{T}log{n}_{T})$, where n_{ T } and n_{ S } are the number of nodes in T and S, respectively (improved time complexities under some scoring scheme restrictions were also shown in [26]). The Constrained Edit Distance Between Unordered Labeled Trees problem, presented by Zhang in [27], is a restricted version of rooted unordered tree edit distance, which allows the edit operations of node relabeling, subtree pruning, and deletions of degree2 nodes (where in the general edit distance variant, nodes of arbitrary degrees may be deleted). Zhang gave an O(n_{ T }n_{ S }(d_{ T } + d_{ S })l o g(d_{ T } + d_{ S })) time algorithm for this variant, where d_{ T } and d_{ S } are the maximum node degrees in T and S respectively. In this sense, the algorithm of [27] can be viewed as a symmetric (allowing deletions from both input trees), yet rooted variant of the algorithm of [26].
The essential approach in many tree comparison algorithms is a recursive rooted comparison of subtrees of the input trees, and finding the best combination of such subinstance solutions to yield a solution for the input instance. The computation considers the cases in which either one of the roots is deleted, and the case where the roots are aligned to each other. In the latter case, it is required to find an optimal matching between the two children sets of the roots, where in the ordered variant such matching is restricted to maintain the child order (and may be computed by a reduction to sequence alignment), and in the unordered variant no such restriction holds (and thus an optimal matching can be found by a bipartite graph matching algorithm).
Our contribution
We propose an efficient algorithm for comparing unordered unrooted trees. Specifically, we define the Homeomorphic Subtree Alignment (HSA) problem, for which we give an O(n_{ T }n_{ S } + m i n(d_{ T },d_{ S })L_{ T }L_{ S }) running time algorithm, where L_{ T } and L_{ S } are the numbers of leaves in the input trees T and S, respectively. Our approach can be viewed as a generalization of the two previous works [26, 27], which relaxes the asymmetric “textpattern” restriction of [26], as well as the rooting restriction of [27].
Both algorithms in [26, 27], as well as the algorithm presented here, make use of subroutines for solving the Minimum Cost Bipartite Matching (MCM) problem, which dictate their time complexities. Here, we define the AllPairsCavityMCM problem, a generalization of the AllCavityMCM problem [28], and show how to integrate it into our tree alignment algorithm. For MCM and both its cavity derivatives, we use similar ideas to those applied in [29] to obtain O(n^{3} + n m) time algorithms, where n and m are the sizes of the input sets, and n ≤ m. This gives the first cubic time algorithm for AllPairsCavityMCM, and improves the running times of MCM and AllCavityMCM problems in the unbalanced case where n ≪ m. The new MCM algorithms we developed allow our HSA algorithm to match, and even improve, the running times of the previous algorithms of [26, 27] for less general variants of the HSA problem.
We implemented the algorithm (in all combininations of global or local, ordered or unordered, and rooted or unrooted modes) as a graphical software tool named FRUUT which computes and displays similarities between secondary structures of RNA given as input, and employed it to a preliminary experiment in which we ran allagainstall interfamily pairwise alignments of RNAse P and Hammerhead RNA family members, exposing new similarities which could not be detected by the traditional rooted ordered alignment approaches. The results demonstrate that our approach can be used to expose structural similarity between some RNAs with higher sensitivity than the traditional rooted ordered alignment approaches.
Preliminaries
Tree notations
A tree T = (V,E) is an undirected, connected and acyclic graph. For a node v ∈ V, denote by N(v) the set of neighbors of v: N(v) = {u ∈ V:(v,u) ∈ E}. Denote by d_{ v } = N(v) the degree of v. A node v for which d_{ v } ≤ 1 is called a leaf in T. For simplicity, we henceforth use the notation v ∈ T and (v,u) ∈ T to imply that v is a node and (v,u) is an edge in a tree T. We use the notation (v → u) to indicate that the generally undirected edge (v,u) is being considered with respect to the specific direction from v to u. Denote by n_{ T }, L_{ T }, and d_{ T } the number of nodes, the number of leaves, and the maximum degree of a node in T, respectively.
A rooted tree is a tree in which one of the nodes is selected as its root. Denote by T^{v} the tree T when rooted upon the node v ∈ T. An ordered tree is a tree T in which for each node v ∈ T, the elements in N(v) are ordered. In this work we consider unrooted unordered trees, rooted unordered trees, unrooted ordered trees, and rooted ordered trees. If no indication is given, we assume that the mentioned trees are unrooted and unordered.
Let T = (V,E) be a tree. A smoothing of a node v of degree 2 in T is obtained by removing v from T and connecting its two neighbors by an edge. A smoothing of T is a tree obtained by smoothing zero or more nodes in T. A subtree of T is a connected subgraph of T. For an edge (v → u) ∈ T, denote by ${T}_{u}^{v}$ the rooted subtree of T induced by v as a root, and all nodes x in T such that the path between v and x in T starts with (v → u).
Since a tree T with n nodes contains n  1 undirected edges, the total number of directed edges, and hence the number of rooted subtrees of the form ${T}_{u}^{v}$, is 2(n  1).
A pruning of a tree T with respect to an edge (v → u) is the removal from T of all nodes in ${T}_{u}^{v}$, except for v. Observe that every nonempty subtree of T is obtained by pruning T with respect to zero or more edges.
MinCost bipartite matching
Similarly to previous tree alignment and edit distance algorithms [26–28], the algorithm presented here makes use of mincost bipartite matching algorithms as subroutines. Below, we define extended variants of the bipartite matching problem, in which the input groups may be ordered or unordered, and the score incorporates both standard element matching scoring terms, as well as penalties for unmatched elements. In addition, we define “cavity” variants of the problem, which are used for speeding up our tree alignment algorithms.
The (generalized) MinCost bipartite matching problem (MCM)
Let X and Y be two sets. A bipartite matching M between X and Y is a set of pairs M ⊆ X × Y, such that each element in X ∪ Y participates in at most one pair in M. If some element z ∈ X ∪ Y does not participate in any pair in M, we say that z is unmatched by M and denote z ∉ M. A (generalized) matching cost function w for X and Y assigns costs w(x,y) for every (x,y) ∈ X × Y and costs w(z) for every z ∈ X ∪ Y. The cost of a bipartite matching M between X and Y with respect to w is given by
A matching instance is a triplet (X,Y,w), where X and Y are two sets, and w is a matching cost function for X and Y. The MinCost Bipartite Matching problem (MCM) is, given a matching instance (X,Y,w), to find the minimum cost of a matching between X and Y with respect to w. Denote by MCM(X,Y,w) the solution of the MCM problem for the instance (X,Y,w), and call a matching whose cost equals to the solution optimal.
Numerous works study and suggest algorithms for the MCM problem, usually when no unmatched element costs are taken into account (see e.g., [30–33]). A standard approach is to reduce MCM to the MinCost MaxFlow problem, which yields an O(n m^{2} + m^{2} log m) algorithm for MCM, where n = min(X,Y) and m = max(X,Y). In [27], an adapted reduction was presented which generalizes the problem definition to incorporate unmatched element costs and also runs in O(n m^{2} + m^{2} log m) time. An algorithm suggested by Dinitz in [29] solves the MCM problem in O(n^{3} + n m) time, without reducing it to MinCost MaxFlow. Since that paper is in Russian and since it considers neither unmatched element costs nor the cavity variants of MCM (see the following section), we prefer to explicitly adapt the MinCost MaxFlow approach to our needs, using some variation of the ideas of [29].
Cavity MCM
The AllCavityMCM problem [28] is, given a matching instance (X,Y,w), to compute MCM(X,Y ∖ {y},w) for all y ∈ Y. We define the AllPairsCavityMCM problem as, given a matching instance (X,Y,w), to compute MCM(X ∖ {x},Y ∖ {y},w) for all x ∈ X and y ∈ Y.
Clearly, algorithms for both AllCavityMCM and AllPairsCavityMCM problems can be implemented by repeatedly running an algorithm for MCM on all required inputs. In [28], an algorithm for AllCavityMCM was proposed, which is more efficient than the naïve algorithm and retains the same cubic running time as the standard algorithm for MCM. To the best of our knowledge, no algorithm for AllPairsCavityMCM which improves upon the naïve algorithm (i.e. repeatedly executing the algorithm of [28] for AllCavityMCM over all x ∈ X) was previously described.
In Section ‘Algorithms for bipartite matching problems’, we give new algorithms for (generalized, unordered) MCM, AllCavityMCM and AllPairsCavityMCM. The running times of these algorithms are summarized in the following theorem, whose correctness is shown in Section ‘Algorithms for bipartite matching problems’.
Theorem 1. Let (X,Y,w) be a matching instance, and denote n = min(X,Y), m = max(X,Y). Then, each one of the problems MCM, AllCavityMCM, and AllPairsCavityMCM over the instance (X,Y,w) can be solved in O(n^{3} + n m) running time. Moreover, this may be done without the usage of priority queues (e.g. Fibonacci heaps [31]) or other complex data structures.
Ordered MCM variants
For an ordered set Z = 〈 z_{0},z_{1},…,z_{n1} 〉 and an integer k, the k rotation of Z is the reordering of its elements ${Z}_{k}=\u3008{z}_{0}^{\prime},{z}_{1}^{\prime},\dots ,{z}_{n1}^{\prime}\u3009$, where ${z}_{i}^{\prime}={z}_{(i+k)\text{mod}n}$ (that is, Z_{ k } = 〈 z_{ k },z_{k + 1},…,z_{n1},z_{0},…,z_{k1} 〉). Note that Z_{0} = Z.
Let X and Y be two ordered sets, and M a bipartite matching between X and Y. Say that M preserves linear order if for every $({x}_{i},{y}_{j}),({x}_{{i}^{\prime}},{y}_{{j}^{\prime}})$ ∈ M, i ≤ i′ ⇔ j ≤ j′. Say that M preserves cyclic order if there are some integers k,l such that M preserves linear order with respect to the rotated sets X_{ k } and Y_{ l }. It is possible to show, that defining M as preserving cyclic order if there exists an integer l such that M preserves linear order with respect to X and Y_{ l }, is equivalent to the definition above.
The Linear Ordered MCM problem (LinearMCM) and the Cyclic Ordered MCM problem (CyclicMCM) are defined similarly to MCM, with the restrictions that the considered matchings have to preserve linear or cyclic order, respectively. LinearMCM is essentially equivalent to the Sequence Alignment problem, which can be solved in O(n m) running time [34]. CyclicMCM can be solved by taking the minimum cost solution among LinearMCM solutions for all rotations of the smaller set in the input, in O(n^{2}m). More efficient algorithms for CyclicMCM can be implied from [35–37].
Homeomorphic subtree alignment
An isomorphic alignment between two trees T = (V,E) and S = (V′,E′) is a bijection A : V → V′, such that for every pair of nodes v,u ∈ V we have that (v,u) ∈ E ⇔ (A(v),A(u)) ∈ E ′. A homeomorphic alignment between T and S is an isomorphic alignment between some smoothing T′ of T and some smoothing S′ of S, and a homeomorphic subtree alignment (HSA) between T and S is a homeomorphic alignment between some subtree T′ of T and some subtree S ′ of S (Figure 2). For short, we write (v,v′) ∈ A to indicate that A(v) = v′.
Let T and S be two trees, and A a homeomorphic subtree alignment between them. Let T′ and S′ be the subtrees of T and S, and let T′′ and S′′ be the smoothings of T′ and S′, respectively, such that A is an isomorphic alignment between T′′ and S′′. Say that a node v ∈ T is aligned by A if v ∈ T′′, and that v is smoothed by A if v ∈ T′ and v ∉ T′′. Say that a subtree ${T}_{u}^{v}$ is pruned by A if v ∈ T′ and u ∉ T′. Let $\mathit{\text{prune}}\left({T}_{u}^{v}\right)$ be a cost associated with pruning the subtree ${T}_{u}^{v}$ from T, s m o o t h(v) be a cost associated with smoothing node v, and a l i g n(v,v′) be a cost associated with aligning node v against some node v′. Definitions for S are similar. Denote by π(A) the set of pruned subtrees and by δ(A) the set of smoothed nodes of T and S, with respect to A. Define the alignment cost:
Denote by HSA(T,S) the minimum alignment cost of an HSA between T and S, and call an HSA A optimal with respect to T and S if w(T,S,A) = HSA(T,S). The MinCost HSA problem is, given a pair of trees T and S, to compute HSA(T,S).
Remark 1. We do not give the details of how to construct optimal alignments in this paper. As usual for dynamic programming algorithms, this may be done by a standard backtracking procedure applied on the computed dynamic programming tables.
Rooted and ordered alignments
In addition to the general MinCost HSA problem, we also consider special cases of the problem in which the two input trees are rooted and/or ordered, and the alignment is required to satisfy certain restrictions with respect to these additional properties. For two rooted trees T^{v} and ${S}^{v\prime}$, say that A is a rooted HSA between T^{v} and ${S}^{v\prime}$ if A is an HSA between T and S, and (v,v′) ∈ A. The definition of ordered HSA requires some additional formalism, related to bipartite matchings.
Let A be an HSA between the trees T and S. For an edge (v→u) ∈ T, say that u is a relevant neighbor of v (with respect to A) if there is some $x\in {T}_{u}^{v},x\ne v$, which is aligned by A. Define relevant neighbors in S similarly.
Observation 1. Let A be an HSA between trees T and S, and (v,v′), (x,x′), (y,y′) ∈ A. The path between x and y in T goes through v if and only if the path between x′ and y′ in S goes through v′.
The correctness of the observation can be asserted from the fact that A is an isomorphic alignment between a smoothed subtree of T that contains v,x, and y, and a smoothed subtree of S that contains v′,x′, and y′.
Lemma 1. Let (v,v′) ∈ A, and let u be a relevant neighbor of v. Then, there is a unique relevant neighbor u′ of v′ such that for every (y,y′) ∈ A, $y\in {T}_{u}^{v}\iff {y}^{\prime}\in {S}_{{u}^{\prime}}^{{v}^{\prime}}$.
Proof. Since u is a relevant neighbor of v, there is a node $x\in {T}_{u}^{v}$, such that x ≠ v and x is aligned by A. Let x′ = A(x), and let u′ be the relevant neighbor of v′ such that ${x}^{\prime}\in {S}_{{u}^{\prime}}^{{v}^{\prime}}$. For (y,y′) ∈ A, observe that $y\notin {T}_{u}^{v}$ if and only if the path between x and y in T passes through v. Similarly, ${y}^{\prime}\notin {S}_{{u}^{\prime}}^{{v}^{\prime}}$ if and only if the path between x′ and y′ in S passes through v′. Applying Observation 1, this implies that $y\in {T}_{u}^{v}\iff {y}^{\prime}\in {S}_{{u}^{\prime}}^{{v}^{\prime}}$. □
Lemma 1 implies that for (v,v′) ∈ A, the alignment induces a bipartite matching ${M}_{v,{v}^{\prime}}^{A}$ between N(v) and N(v′) in which the matched elements are exactly those relevant neighbors of v and v′ (Figure 3). Say that A is ordered if T and S are ordered trees, and for every (v,v′)∈A, the corresponding bipartite matching ${M}_{v,{v}^{\prime}}^{A}$ is cyclically ordered.
Now, we can define three additional variants of the HSA problem. Let T^{v} and${S}^{{v}^{\prime}}$ be rooted and ordered trees. Denote by OrderedHSA(T,S),$\mathit{\text{RootedHSA}}({T}^{v},{S}^{{v}^{\prime}})$ and$\mathit{\text{OrderedRootedHSA}}({T}^{v},{S}^{{v}^{\prime}})$ the minimum costs of an ordered HSA, a rooted HSA, and an ordered and rooted HSA between T^{v} and ${S}^{{v}^{\prime}}$, respectively. Define the corresponding variants of the MinCost HSA problem whose goals are to compute these values.
Algorithm for homeomorphic subtree alignment
In this section we describe a basic algorithm for HSA for its unordered unrooted variant (though it is adequate for the other variants as well with some simple modifications).
Recursive computation
Let A be an HSA between T and S. Let (v,v′) ∈ A, and ${M}_{v,{v}^{\prime}}^{A}$ the corresponding bipartite matching between N(v) and N(v′), as defined in Section ‘Homeomorphic subtree alignment’. Note that A can be viewed as a rooted alignment between T^{v} and ${S}^{{v}^{\prime}}$, which is the union of a set of rooted subalignments${A}_{u}^{v}$ between rooted subtree pairs of the form ${T}_{u}^{v}$ and ${S}_{{u}^{\prime}}^{{v}^{\prime}}$, where$(u,{u}^{\prime})\in {M}_{v,{v}^{\prime}}^{A}$ (Figure 3). The alignment cost can therefore be obtained by summing the costs of these subalignments, which cover all scoring terms implied by matching nodes, smoothing nodes, and pruning subtrees by the corresponding subalignments, and the additional pruning costs of pruned subtrees of the forms${T}_{u}^{v}$ and${S}_{{u}^{\prime}}^{{v}^{\prime}}$ (where u,u′ are unmatched by ${M}_{v,{v}^{\prime}}^{A}$). Note that the pair (v,v′) belongs by definition to each of the subalignments ${A}_{u}^{v}$. In order to avoid multiple additions of the term a l i g n(v,v′) when summing subalignment costs, define ${w}^{r}({T}^{v},{S}^{{v}^{\prime}},A)=w({T}^{v},{S}^{{v}^{\prime}},A)\mathit{\text{align}}(v,{v}^{\prime})$. The cost $w({T}^{v},{S}^{{v}^{\prime}},A)$ can then be written as follows:
Call a rooted alignment nontrivial if it aligns at least one additional pair of nodes besides the roots. Note that every rooted subalignment${A}_{u}^{v}$ is nontrivial (since u and u′ are relevant neighbors of v and v′). Denote by ${\mathit{\text{RootedHSA}}}^{r}\left({T}_{u}^{v},{S}_{{u}^{\prime}}^{{v}^{\prime}}\right)$ the minimum w^{r} cost of a nontrivial rooted alignment between ${T}_{u}^{v}$ and ${S}_{{u}^{\prime}}^{{v}^{\prime}}$.
Clearly, if A is an optimal rooted HSA between T^{v} and ${S}^{{v}^{\prime}}$, then for each$(u,{u}^{\prime})\in {M}_{v,{v}^{\prime}}^{A}$${w}^{r}\left({T}_{u}^{v},{S}_{{u}^{\prime}}^{{v}^{\prime}},{A}_{u}^{v}\right)={\mathit{\text{RootedHSA}}}^{r}\left({T}_{u}^{v},{S}_{{u}^{\prime}}^{{v}^{\prime}}\right)$ (otherwise, it is possible to produce a rooted alignment with a better cost than A for T^{v} and ${S}^{{v}^{\prime}}$). Define the bipartite matching instance $\left(N\right(v),N({v}^{\prime}),{w}_{v,{v}^{\prime}})$, where for u∈N(v), u′∈N(v′), se${w}_{v,{v}^{\prime}}(u,{u}^{\prime})={\mathit{\text{RootedHSA}}}^{r}({T}_{u}^{v},{S}_{{u}^{\prime}}^{{v}^{\prime}}),\phantom{\rule{1em}{0ex}}{w}_{v,{v}^{\prime}}\left(u\right)=\mathit{\text{prune}}\left({T}_{u}^{v}\right)$, and ${w}_{v,{v}^{\prime}}\left({u}^{\prime}\right)=\mathit{\text{prune}}\left({S}_{{u}^{\prime}}^{{v}^{\prime}}\right)$. Observe that the righthand side of Equation 4 equals the cost of the bipartite matching${M}_{v,{v}^{\prime}}^{A}$ for the matching instance$\left(N\right(v),N({v}^{\prime}),{w}_{v,{v}^{\prime}})$ (see Figure 3b). In addition, every bipartite matching between N(v) and N(v′) corresponds to some valid rooted HSA between T^{v} and ${S}^{v\prime}$, so that the matching and alignment costs are equal.
Therefore, a minimum cost bipartite matching induces a minimum cost alignment, and we get that
Assuming the nondegenerate case where an optimal HSA between T and S contains at least one pair (v,v′), we can compute the cost of an optimal HSA by solving RootedHSA with respect to all possible root pairs, and taking the pair which induces a minimum cost as a solution for the unrooted case:
In order to obtain cost functions of the form${w}_{v,{v}^{\prime}}$ for the computation of Equation 5, we need to compute solutions of the form${\mathit{\text{RootedHSA}}}^{r}({T}_{u}^{v},{S}_{{u}^{\prime}}^{{v}^{\prime}})$ for subinstances of the input. When u and u′ are leaves, the only nontrivial rooted alignment between ${T}_{u}^{v}$ and${S}_{{u}^{\prime}}^{{v}^{\prime}}$ contains both pairs (v,v′) and (u,u′), and therefore we get that${\mathit{\text{RootedHSA}}}^{r}\left({T}_{u}^{v},{S}_{{u}^{\prime}}^{{v}^{\prime}}\right)=\mathit{\text{align}}(u,{u}^{\prime})$. Otherwise, Equation 7 computes${\mathit{\text{RootedHSA}}}^{r}\left({T}_{u}^{v},{S}_{{u}^{\prime}}^{{v}^{\prime}}\right)$ recursively (see Figure 4), where${w}_{u,{u}^{\prime}}^{v,{v}^{\prime}}$ is defined similarly to${w}_{u,{u}^{\prime}}$ with respect to the sets N(u) ∖ {v} and N(u′)∖{v′}.
Proof. [Equation 7]. By definition of the RootedHSA^{r}$\left({T}_{u}^{v},{S}_{{u}^{\prime}}^{{v}^{\prime}}\right)$ score, it corresponds to the score of the best nontrivial alignment between${T}_{u}^{v}$ and${S}_{{u}^{\prime}}^{{v}^{\prime}}$, minus the root alignment cost term a l i g n(v,v′). We may cover the set of all possible nontrivial rooted alignments between${T}_{u}^{v}$ and${S}_{{u}^{\prime}}^{{v}^{\prime}}$ by three sets: (a) alignments in which u is unmatched, (b) alignments in which u′ is unmatched, and (c) alignments in which both u and u′ are matched (note that there might be an intersection between groups (a) and (b)). We show that each one of the terms I, II, and III in the righthand side of Equation 7 computes the minimum cost of an alignment in each one of the groups (a), (b), and (c), respectively, and therefore the minimum among all these terms gives the correct value${\mathit{\text{RootedHSA}}}^{r}\left({T}_{u}^{v},{S}_{{u}^{\prime}}^{{v}^{\prime}}\right)$.
We start by showing that term I computes the minimum cost of an alignment in group (a). Let A be an alignment in group (a) of minimum cost. Since A is nontrivial, and u is smoothed in A, u has exactly one additional relevant neighbor x^{∗} besides v. It therefore follows that all nodes in${T}_{u}^{v}$ except for v which are matched by A belong to the subtree${T}_{{x}^{\ast}}^{u}$. Defining${A}_{{x}^{\ast}}^{u}=\left(A\setminus \left\{\right(v,{v}^{\prime}\left)\right\}\right)\cup \left\{\right(u,{v}^{\prime}\left)\right\}$, we have that${A}_{{x}^{\ast}}^{u}$ is a nontrivial rooted HSA between${T}_{{x}^{\ast}}^{u}$ and${S}_{{u}^{\prime}}^{{v}^{\prime}}$. Therefore, the cost of A is obtained by the summation of pruning costs of all subtrees${T}_{y}^{u}$ for y ∈ N(u) ∖ {v,x^{∗}}, the cost of smoothing u, and the cost${w}^{r}\left({T}_{{x}^{\ast}}^{u},{S}_{{u}^{\prime}}^{{v}^{\prime}},{A}_{{x}^{\ast}}^{u}\right)$ (which counts for all cost terms corresponding to node matchings, node smoothings, and subtree prunings, implied by the subalignment${A}_{{x}^{\ast}}^{u}$). Since A is optimal, it is clear that${w}^{r}\left({T}_{{x}^{\ast}}^{u},{S}_{{u}^{\prime}}^{{v}^{\prime}},{A}_{{x}^{\ast}}^{u}\right)={\mathit{\text{RootedHSA}}}^{r}\left({T}_{{x}^{\ast}}^{u},{S}_{{u}^{\prime}}^{{v}^{\prime}}\right)$ (otherwise, it is possible to improve the cost of A by replacing the subalignment${A}_{{x}^{\ast}}^{u}$ with an optimal alignment for the corresponding subinstance). Thus,
Since for every x ∈ N(u) ∖ {v} the term
is the w^{r} cost of a possible nontrivial alignment between${T}_{u}^{v}$ and${S}_{{u}^{\prime}}^{{v}^{\prime}}$ in which u is smoothed (where the subtree${T}_{x}^{u}$ is optimally aligned to${S}_{{u}^{\prime}}^{{v}^{\prime}}$), we get that x^{∗} satisfies that
hence the correctness of term I.
The proof that term II of the equation computes the minimum cost of an alignment in group (b) is symmetric to the proof of term I. As for term III, note that the case where u and u′ are matched, but not to each other, implies a contradiction to Observation 1. Therefore, for any alignment in group (c), and in particular for such an alignment A of minimum cost, we have that (u,u′) ∈ A. In this case, the optimal alignment cost is obtained immediately from applying Equation 5 for this specific subinstance, as formulated by term III in the equation. □
The computation of${w}_{u,{u}^{\prime}}^{v,{v}^{\prime}}$ requires the computation of scores of the form${\mathit{\text{RootedHSA}}}^{r}\left({T}_{x}^{u},{S}_{{x}^{\prime}}^{{u}^{\prime}}\right)$ for all x ∈ N(u) ∖ {v} and all x′ ∈ N(u′) ∖ {v′}. It can be shown that all RootedHSA^{r} solutions required for the computation of the righthand side of the equation are for strictly smaller subinstances than the subinstance appearing in the lefthand side, thus the termination of the recursive computation is guaranteed. Equation 7 can be efficiently computed using Dynamic Programming (DP), as summarized by Algorithm 1 below.
Algorithm 1: HSA (T,S)
In Section ‘Time complexity of Algorithm 1’, we show that a straightforward implementation of Algorithm 1 obtains the running time of O(min(d_{ T },d_{ S })n_{ T }n_{ S } + min (d_{ T },d_{ S })^{3}L_{ T }L_{ S }). For some trees T,S with n_{ T },L_{ T },d_{ T }, n_{ S },L_{ S },d_{ S } = Θ(n) (e.g. “star” trees), this implies an O(n^{5}) running time. In Section ‘Improving the time complexity’, we show how to improve this time bound and obtain a cubic time algorithm for the problem.
We note that Algorithm 1 generalizes to also solve the ordered unrooted, unordered rooted, and ordered rooted variants of MinCost HSA in polynomial time. In case a rooted alignment is sought, the algorithm can compute Equation 5 in line 3 only with respect to the two roots, and avoid the computation of Equation 6. In case an ordered alignment is sought, the MCM application in Equation 5 can be replaced by CyclicMCM, and in Equation 7MCM can be replaced by LinearMCM, similarly to [27]. Traditionally, ordered matchings are implemented via reduction to sequence alignment [7, 26, 38]. Similarly to the improvement described in the next section for the unordered tree alignment algorithm, it seems that fast incremental/decremental versions of ordered matchings [35–37] can be integrated into the ordered variant of our algorithm to improve its time complexity. However, the detailed description of these techniques are beyond the scope of this paper.
Time complexity of Algorithm 1
We first analyze the time complexity of a straightforward implementation of the algorithm. Then, in Section ‘Improving the time complexity’, we show how this time complexity can be reduced by applying cavity matching subroutines.
Let$\mathit{\text{index}}\left({T}_{u}^{v}\right)$ and$\mathit{\text{index}}\left({S}_{{u}^{\prime}}^{{v}^{\prime}}\right)$ denote the row and column indices of${T}_{u}^{v}$ and${S}_{{u}^{\prime}}^{{v}^{\prime}}$ in H, respectively. Let u ∈ T and u′ ∈ S be a pair of nodes. The set of subtrees${T}_{u}^{v}$ for v ∈ N(u) corresponds to a subset of rows in H. Similarly, the set of subtrees${S}_{{u}^{\prime}}^{{v}^{\prime}}$ for v′ ∈ N(u′) corresponds to a subset of columns in H, and thus all solutions of the form${\mathit{\text{RootedHSA}}}^{r}\left({T}_{u}^{v},{S}_{{u}^{\prime}}^{{v}^{\prime}}\right)$ are stored in a submatrix of H of size${d}_{u}\times {d}_{{u}^{\prime}}$. Let${H}_{u,{u}^{\prime}}$ denote this submatrix, and let v_{ i } and${v}_{j}^{\prime}$ denote nodes in N(u) and N(u′) such that${T}_{u}^{{v}_{i}}$ and${S}_{{u}^{\prime}}^{{v}_{j}^{\prime}}$ correspond to the ith row and jth column in${H}_{u,{u}^{\prime}}$, respectively (i.e.$\mathit{\text{index}}\left({T}_{u}^{{v}_{1}}\right)$ is the first row in${H}_{u,{u}^{\prime}}$,$\mathit{\text{index}}\left({T}_{u}^{{v}_{2}}\right)$ is the second row, etc.). Note that H can be viewed as a union of submatrices of the form${H}_{u,{u}^{\prime}}$, where each entry in H is covered by exactly one submatrix${H}_{u,{u}^{\prime}}$ for some u ∈ T, u′ ∈ S.
The following observation identifies special properties of the second column and second row in${H}_{u,{u}^{\prime}}$, which are exploited for the efficient computation of Algorithm 1. Observe that for every 1 < i ≤ d_{ u },${T}_{{v}_{i}}^{u}$ is a subtree of${T}_{u}^{{v}_{1}}$, and therefore$\mathit{\text{index}}\left({T}_{{v}_{i}}^{u}\right)<\mathit{\text{index}}\left({T}_{u}^{{v}_{1}}\right)$. Also,${T}_{{v}_{1}}^{u}$ is a subtree of${T}_{u}^{{v}_{2}}$, and therefore$\mathit{\text{index}}\left({T}_{{v}_{1}}^{u}\right)<\mathit{\text{index}}\left({T}_{u}^{{v}_{2}}\right)$. Since$\mathit{\text{index}}\left({T}_{u}^{{v}_{1}}\right)<\mathit{\text{index}}\left({T}_{u}^{{v}_{2}}\right)$, we get the following observation (Figure 5):
Observation 2. For every 1 ≤ i ≤ d_{ u },$\mathit{\text{index}}\left({T}_{{v}_{i}}^{u}\right)<\mathit{\text{index}}\left({T}_{u}^{{v}_{2}}\right)$. Similarly, for every$1\le j\le {d}_{{u}^{\prime}}$,$\mathit{\text{index}}\left({S}_{{v}_{j}^{\prime}}^{u}\right)<\mathit{\text{index}}\left({S}_{{u}^{\prime}}^{{v}_{2}^{\prime}}\right)$.
In order to focus on the bottleneck expression in the running time analysis of the algorithm, we first summarize the complexity of its secondary computations in the following lemma:
Lemma 2. It is possible to implement Algorithm 1 so that all operations, besides computation of solutions to the MCM problem, require O(n_{ T }n_{ S }) running time.
Proof. It is simple to observe that the computations conducted in lines 1 and 4 of the algorithm consume O(n_{ T }n_{ S }) time (the computation of all subtree sizes and their sorting can be implemented in a linear time in a straightforward manner, where the details are omitted from this text). As computations of Equation 5 in line 3 and of term III of Equation 7 in line 2 are dominated by MCM computations, it remains to show that it is possible to compute terms I and II of Equation 7 in line 2 of the algorithm in O(n_{ T }n_{ S }) along the entire run of the algorithm.
Consider a pair of nodes u ∈ T and u′∈S, and the corresponding submatrix${H}_{u,{u}^{\prime}}$ (for illustration, here and on, see Figure 6a). It is simple to observe that an explicit computation of term I of the equation with respect to some subtrees${T}_{u}^{v}$ and${S}_{{u}^{\prime}}^{{v}^{\prime}}$ can be conducted in O(d_{ u }) time. Nevertheless, we next show how to conduct this computation in O(1) amortized time. Fix an index$1\le j\le {d}_{{u}^{\prime}}$. Let${x}^{\ast}={\text{argmin}}_{x\in N\left(u\right)}\left({\mathit{\text{RootedHSA}}}^{r}\left({T}_{x}^{u},{S}_{{u}^{\prime}}^{{v}_{j}^{\prime}}\right)\mathit{\text{prune}}\left({T}_{x}^{u}\right)\right)$, and denote$\alpha ={\mathit{\text{RootedHSA}}}^{r}\left({T}_{{x}^{\ast}}^{u},{S}_{{u}^{\prime}}^{{v}_{j}^{\prime}}\right)\mathit{\text{prune}}\left({T}_{{x}^{\ast}}^{u}\right)$. As N(u) ∖ {v} ⊂ N(u) for every v ∈ N(u), it is clear that for v ≠ x^{∗},minx ∈ N(u) ∖ {v}$\left({\mathit{\text{RootedHSA}}}^{r}\left({T}_{x}^{u},{S}_{{u}^{\prime}}^{{v}_{j}^{\prime}}\right)\mathit{\text{prune}}\left({T}_{x}^{u}\right)\right)=\alpha \phantom{\rule{1em}{0ex}}$. Similarly, denote$\beta =\sum _{x\in N\left(u\right)}\mathit{\text{prune}}\left({T}_{x}^{u}\right)$, and observe that$\sum _{x\in N\left(u\right)\setminus \left\{v\right\}}\mathit{\text{prune}}\left({T}_{x}^{u}\right)=\beta \mathit{\text{prune}}\left({T}_{v}^{u}\right)$ for every v ∈ N(u). Thus, for v ≠ x^{∗}, term I of Equation 7 can be written as$\mathit{\text{smooth}}\left(u\right)+\beta \mathit{\text{prune}}\left({T}_{v}^{u}\right)+\alpha $, and given the values α and β, be computed in O(1) time. Due to Observation 2, when the algorithm is about to compute the entry in the second row and jth column of${H}_{u,{u}^{\prime}}$ (i.e. the entry corresponding to${T}_{u}^{{v}_{2}}$ and${S}_{{u}^{\prime}}^{{v}_{j}^{\prime}}$), all required values for computing x^{∗},α, and β, are already stored in H, and therefore these values may be computed in O(d_{ u }) time. Once computing these values, it is possible to compute term I for each of the remaining rows i > 1 in column j of${H}_{u,{u}^{\prime}}$, except for row i such that v_{ i } = x^{∗}, in O(1) time each. Additional O(d_{ u }) operations are required for computing the term for row 1 and the row i such that v_{ i } = x^{∗}, and therefore the total number of operations for computing the term for all d_{ u } entries in column j of${H}_{u,{u}^{\prime}}$ is O(d_{ u }). This implies that the amortized time for computing term I for each entry in${H}_{u,{u}^{\prime}}$ is O(1), and therefore the amortized time for computing term I for each entry in H is O(1). The proof for term II is symmetric. All in all, we get that the running time for all operations conducted by the algorithm, besides MCMvcomputations, is O(n_{ T }n_{ S }). □
Before continuing with the time complexity analysis, we formulate an auxiliary lemma.
Lemma 3. For a tree T,$\sum _{v\in T}{d}_{v}=O\left({n}_{T}\right)$, and$\sum _{\genfrac{}{}{0ex}{}{v\in T,}{{d}_{v}\ge 3}}{d}_{v}=O\left({L}_{T}\right)$.
Proof. It is well known that the number of undirected edges in T is n_{ T }1. Since each undirected edge ( v,u) contributes 1 unit to the degrees d_{ v } and d_{ u } of its endpoints, we get that$\sum _{v\in T}{d}_{v}=2({n}_{T}1)=O\left({n}_{T}\right)$.In order to show that$\sum _{\genfrac{}{}{0ex}{}{v\in T,}{{d}_{v}\ge 3}}{d}_{v}=O\left({L}_{T}\right)$, we will show that$\sum _{\genfrac{}{}{0ex}{}{v\in T}{{d}_{v}\ge 3}}{d}_{v}<3{L}_{T}.$ When T contains a single node, L_{ T } = 1,$\sum _{\genfrac{}{}{0ex}{}{v\in T}{{d}_{v}\ge 3}}{d}_{v}=0$,and the inequality follows. Assume by induction the correctness of the inequality for all trees with less than n > 1 nodes, and let T be a tree with n nodes. Let (x,y) ∈ T be an edge such that y is leaf in T. The subtree T′ of T containing all nodes in T except for y (and all edges except for (x,y)) is of size n  1, and from the inductive assumption$\sum _{\genfrac{}{}{0ex}{}{v\in {T}^{\prime}}{{d}_{v}^{\prime}\ge 3}}{d}_{v}^{\prime}<3{L}_{{T}^{\prime}}$, where${d}_{v}^{\prime}$ denotes the degree of v in T′. Besides y, T contains no additional leaves which are not already leaves in T′. In addition,${d}_{v}={d}_{v}^{\prime}$ for all nodes v ∈ T′ besides x, where${d}_{x}={d}_{x}^{\prime}+1$, and d_{ y } = 1. If x is a leaf in T′ then${L}_{T}={L}_{{T}^{\prime}}$ (as y replaces x as a leaf in T), and since${d}_{x}={d}_{x}^{\prime}+1<3$ we get that$\sum _{\genfrac{}{}{0ex}{}{v\in T}{{d}_{v}\ge 3}}{d}_{v}=\sum _{\genfrac{}{}{0ex}{}{v\in {T}^{\prime}}{{d}_{v}^{\prime}\ge 3}}{d}_{v}^{\prime}<3{L}_{{T}^{\prime}}=3{L}_{T}$. If x is not a leaf in T′ then${L}_{T}={L}_{{T}^{\prime}}+1$ (due to the addition of y as a leaf). Here, it is possible that${d}_{x}^{\prime}=2$ and d_{ x } = 3, which would contribute 3 to the degree summation for nodes with degree ≥ 3, while if${d}_{x}^{\prime}>2$ the increment in the degree of x would contribute 1 to this summation. Therefore,
and the lemma follows. □
In order to complete the time complexity analysis, we turn to count the number of operations applied in MCM computations throughout the algorithm’s run. Such computations are applied when computing term III of Equation 7, or when computing Equation 5. Term III of Equation 7 is computed once for every pair of subtrees${T}_{u}^{v}$ and${S}_{{u}^{\prime}}^{{v}^{\prime}}$, in$O(\text{min}{({d}_{u},{d}_{{u}^{\prime}})}^{3}+{d}_{u}{d}_{{u}^{\prime}})$ running time (Theorem 1, see also Section ‘Reducing MCM to MinCost MaxFlow’). Therefore, for a given pair of nodes u ∈ T and u′ ∈ S, and all neighbor pairs v ∈ N(u),v′ ∈ N(u′), the time required for MCM computations due to term III is
In order to sum the expression above for all pairs u ∈ T, u′ ∈ S, we sum indempendetly the expressions$\sum _{u\in T}\sum _{{u}^{\prime}\in S}{d}_{u}{d}_{{u}^{\prime}}\text{min}{({d}_{u},{d}_{{u}^{\prime}})}^{3}$ and$\sum _{u\in T}\sum _{{u}^{\prime}\in S}{d}_{u}^{2}{d}_{{u}^{\prime}}^{2}$:
Thus, the overall MCM computation time due to term III in Equation 7 is O((d_{ T } + d_{ S })n_{ T }n_{ S } + d_{ T }d_{ S }L_{ T }L_{ S } + min(d_{ T },d_{ S })^{3}L_{ T }L_{ S }).
Equation 5 is computed once for each v ∈ T and v′ ∈ S in line 3, where the corresponding matching instance’s group sizes are d_{ v } and${d}_{v\prime}$. Similarly as above, it can be shown that summing the total number of operations in the implied MCM computations yields
Therefore, the overall running time of Algorithm 1 is dictated by the the bottleneck expression O((d_{ T } + d_{ S })n_{ T }n_{ S } + d_{ T }d_{ S }L_{ T }L_{ S } + min(d_{ T },d_{ S })^{3}L_{ T }L_{ S }).
Improving the time complexity
The time analysis in the previous section shows that all operations in Algorithm 1, besides MCM computations due to term III of Equation 7, are conducted in O(n_{ T }n_{ S } + min(d_{ T },d_{ S })L_{ T }L_{ S }) time. In this section, we show how to improve the time complexity of Algorithm 1, by incorporating cavity matching subroutines to speed up the MCM computations due to term III of Equation 7.
Let u ∈ T, u′ ∈ S, and consider the computation of term III of Equation 7 for instances in the first row in${H}_{u,{u}^{\prime}}$. Note that for the entries in this row, the first group in the bipartite matching instance is fixed and equals to N(u) ∖ {v_{ 1 }}, whereas for each column j, the second group in the matching instance is$N\left({u}^{\prime}\right)\setminus \left\{{v}_{j}^{\prime}\right\}$. The first entry in this row is computed by solving the MCM problem directly for the matching instance$\left(N\right(u)\setminus \{{v}_{1}\},N({u}^{\prime})\setminus \{{v}_{1}^{\prime}\},{w}_{u,{u}^{\prime}}^{{v}_{1},{v}_{1}^{\prime}})$ (Figures 6a,6b). Based on Observation 2, upon reaching the second entry in this row, all solutions${\mathit{\text{RootedHSA}}}^{r}\left({T}_{{v}_{i}}^{u},{S}_{{v}_{j}^{\prime}}^{{u}^{\prime}}\right)$ for i > 1 and j ≥ 1 are computed and stored in H. Therefore, the AllCavityMCM problem can be solved for the matching instance$\left(N\right(u)\setminus \{{v}_{1}\},N({u}^{\prime}),{w}_{u,{u}^{\prime}}^{{v}_{1}})$, where${w}_{u,{u}^{\prime}}^{{v}_{1}}$ is defined similarly as${w}_{u,u\prime}$ with respect to the sets N(u)∖{v_{ 1 }} and N(u′). This allows to compute term III for each one of the remaining entries in this row of${H}_{u,{u}^{\prime}}$ in O(1) time (Figures 7a,7b).
The first entry of the second row in${H}_{u,{u}^{\prime}}$ is again computed directly by solving the MCM problem for the matching instance$\left(N\right(u)\setminus \{{v}_{2}\},N({u}^{\prime})\setminus \{{v}_{1}^{\prime}\},{w}_{u,{u}^{\prime}}^{{v}_{2},{v}_{1}^{\prime}})$. Upon reaching the second entry of the second row of${H}_{u,{u}^{\prime}}$, Observation 2 implies that all solutions${\mathit{\text{RootedHSA}}}^{r}\left({T}_{{v}_{i}}^{u},{S}_{{v}_{j}^{\prime}}^{{u}^{\prime}}\right)$ for i,j ≥ 1 are already computed and stored in H. Therefore, the AllPairsCavityMCM problem can be solved for the matching instance$\left(N\right(u),N({u}^{\prime}),{w}_{u,{u}^{\prime}})$, allowing to compute term III for each one of the remaining entries in${H}_{u,{u}^{\prime}}$ in O(1) time (Figures 7c,7d). Thus, computing term III for all entries in${H}_{u,{u}^{\prime}}$ is done by solving the MCM problem twice, solving the AllCavityMCM problem once, and solving the AllPairsCavityMCM problem once, where the sizes of the two groups in the matching instances for these problems are at most d_{ u } and${d}_{{u}^{\prime}}$. Based on Theorem 1, this whole MCM computation for sub matrix${H}_{u,{u}^{\prime}}$ takes$O(\text{min}{({d}_{u},{d}_{{u}^{\prime}})}^{3}+{d}_{u}{d}_{{u}^{\prime}})$ time. Recall that matrix H is decomposed into matrices${H}_{u,{u}^{\prime}}$ for all pairs u ∈ T, u′ ∈ S, and so the total computation time of term III in Equation 7 throughout the entire run of the algorithm is
matching the running time of all other computations, and we get the following theorem:
Theorem 2. Algorithm 1 can be implemented with an O(n_{ T }n_{ S } + min(d_{ T },d_{ S })L_{ T }L_{ S }) = O(min(n_{ T },n_{ S }) n_{ T }n_{ S }) time complexity.
We would like to emphasize that replacing n_{ T } and n_{ S } by L_{ T } and L_{ S } or d_{ T } and d_{ S } in the time complexity term of Theorem 2 is due to a refined analysis rather than some algorithmic improvement, and such an analysis can also be applied to refine the time complexities of some of the previous algorithms. While in many tree comparison applications typical input trees T are characterized by having the number of leaves at the same order of magnitude as the number of nodes (i.e. L_{ T } = Θ (n_{ T })), and sometimes it is true that d_{ T } = Θ (n_{ T }) (e.g. in starlike trees), there are cases where L_{ T },d_{ T } ≪ n_{ T }. Specifically, it can be asserted that removing (by node smoothing) or adding (by subdividing edges) degree2 nodes to a tree do not change its maximum degree nor the number of its leaves, and thus trees with a high number of degree2 nodes have a low maximum degree and a small number of leaves with respect to the total number of nodes. As can be shown in our examples (see Section ‘RNA tree representation’), typical RNA trees in our application do have a relatively high number of degree2 nodes, and thus gain from the fact that the cubic term in the time complexity of the algorithm depends on the maximum node degrees and the number of leaves, rather than the number of nodes in the input trees.
Algorithms for bipartite matching problems
In this section we show efficient algorithms for the MCM, AllCavityMCM, and AllPairsCavityMCM problems defined in Section ‘MinCost bipartite matching’. These algorithms are based on a reduction to the MinCost MaxFlow problem. Since the MinCost MaxFlow problem is well known to computer scientists, we only provide a brief discussion of essential properties required for describing our algorithms, while omitting some of the details. For a definition of the MinCost MaxFlow problem, related theorems, and a thorough discussion of its properties, please refer to other works, e.g. [30, 39, 40].
Throughout this section, let (X,Y,w) be a matching instance. Assume w.l.o.g. that X ≤ Y, and denote X = n,Y = m. When the context is clear, instead of writing a “matching M for (X,Y,w)”, we simply write a “matching M”, and similarly when writing “M is an optimal matching” we mean that M is optimal with respect to (X,Y,w).
Efficient algorithm for MCM
Next, we refine the reduction of MCM to MinCost MaxFlow given at [27], in order to to improve the running time of the algorithm. Our modification reduces the O(n m^{2} + m^{2} logm) running time of the algorithm of [27] to O(n^{3} + n m), using a variant of the approach of [29].
Reducing MCM to MinCost MaxFlow
For x ∈ X and y ∈ Y, define the effective matching cost w_{ e }(x,y) of the pair (x,y) to be w_{ e }(x,y) = w(x,y)  w(x)  w(y). The effective matching cost w_{ e }(x,y) is the cost change due to the addition of the pair (x,y) into a matching in which both x and y are unmatched.
For each x ∈ X, define a subset Y_{ x } ⊆ Y of n smallest cost matches for x in Y, with respect to the effective matching costs. Define${Y}_{X}=\bigcup _{x\in X}{Y}_{x}$.
Lemma 4. There exists an optimal matching M^{∗} such that M^{∗} ⊆ X × Y_{ X }.
Proof. Let M^{∗} be an optimal matching. Since each element in X participates in at most one pair in M^{∗}, M^{∗} contains at most n pairs, and so there are at most n distinct elements y ∈ Y such that y ∈ M. If M^{∗} contains a pair (x,y) such that y ∉ Y_{ X }, then y ∉ Y_{ X }, and in particular there must be some y′ ∈ Y_{ x } such that y′ ∉ M (since Y_{ x } = n). By definition of Y_{ x }, w_{ e }(x,y′) ≤ w_{ e }(x,y), and the matching M′∗ obtained by removing the pair (x,y) from M^{∗} and adding the pair (x,y′) satisfies w(M′∗) = w(M^{∗})  w_{ e }(x,y) + w_{ e }(x,y′) ≤ w(M^{∗}). Since M^{∗} is optimal, M′∗ is also optimal. It is possible to continue and apply such modifications until getting an optimal matching M′′^{∗} which contains only pairs (x,y) such that y ∈ Y_{ X }, as required. □
Next, we describe how to reduce MCM into the MinCost MaxFlow problem. The reduction builds the cost flow network N = (G,s,t,c), where G is the network’s weighted graph, s and t are the source and sink nodes respectively, and c is the edge capacity function. The graph G = (V,E) is defined as follows (Figure 8a):

V = X ∪ Y_{ X }∪{s,t,ϕ}, where s, t, and ϕ are unique nodes different from all nodes in X and Y_{ x }. Note that we use the same notations for elements in X and Y_{ x } and their corresponding nodes in V, where ambiguity can be resolved by the context. By definition, Y_{ X } ≤ n^{2}, and so V = O(n^{2}).

E = E_{ 1 } ∪ E_{ 2 }, where E_{ 1 } = {(x,y):x ∈ X,y ∈ Y_{ x }}, and E_{ 2 } = {(s,x):x ∈ X} ∪ {(y,t):y ∈ Y_{ X }} ∪ {(x,ϕ):x ∈ X} ∪ {(ϕ,t)}. The cost c o s t(x,y) of every edge (x,y) ∈ E_{ 1 } is the corresponding effective matching cost w_{ e }(x,y), and the cost c o s t(u,v) of each edge (u,v) ∈ E_{ 2 } is zero. Note that E = O(n^{2}.
The capacity function c assigns unit capacities to all edges in E, with the exception that c(ϕ,t) = n.
For a maximum flow f in N, define the set of pairs M_{ f } = {(x,y) : x ∈ X,y ∈ Y_{ X },f(x,y) = 1}.From flow conservation constraints, it is simple to assert that every z ∈ X ∪ Y participates in at most one pair in M_{ f }, and thus M_{ f } is a valid matching, and in addition, M_{ f } ⊆ X × Y_{ X }. For a matching M for (X,Y,w) such that M ⊆ X × Y_{ X }, define the maximum flow f_{ M } in N as the flow which is obtained by transmitting one flow unit on every path of the form s → x → y → t for all (x,y) ∈ M, and one flow unit on every path of the form s → x → ϕ → t for all x ∈ X such that x ∉ M. It is simple to observe that f_{ M } is a valid flow in N (satisfying the capacity and flow conservation constraints), where its maximality is asserted from the fact that it saturates the cut ({s},V ∖ {s}). Note that for every matching M ⊆ X × Y_{ X },${M}_{{f}_{M}}=M$, and for every maximum flow f in N,${f}_{{M}_{f}}=f$.
Denote by c o s t(f) the cost of a flow f in N, and by w_{X,Y} the summation${w}_{X,Y}=\sum _{z\in X\cup Y}w\left(z\right)$.
Lemma 5 shows the relation between a cost of a matching M ⊆ X × Y_{ X } and its corresponding flow f_{ M }.
Lemma 5. For every matching M ⊆ X × Y_{ X }, w(M) = c o s t(f_{ M }) + w_{X,Y}.
Proof. Note that only edges in E_{ 1 } in N are assigned nonzero costs, and therefore the cost of f_{ M } is given by$\mathit{\text{cost}}\left({f}_{M}\right)=\sum _{(x,y)\in {E}_{1}}{f}_{M}(x,y)\xb7\mathit{\text{cost}}(x,y)$. In addition, note that for every (x,y) ∈ E_{ 1 }, f_{ M }(x,y) = 1 if (x,y) ∈ M, and otherwise f_{ M }(x,y) = 0.
Hence,$\mathit{\text{cost}}\left({f}_{M}\right)\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}\sum _{(x,y)\in M}\mathit{\text{cost}}(x,y)=\sum _{(x,y)\in M}{w}_{e}(x,y)$. Now,
□
Call a minimumcost maximumflow in N an optimal flow with respect to N. Proposition 1 concludes the relation between optimal flows in N and optimal matchings of (X,Y,w).
Proposition 1. Let f^{∗} an optimal flow in N. Then,$\mathit{\text{MCM}}(X,Y,w)=w\left({M}_{{f}^{\ast}}\right)=\mathit{\text{cost}}\left({f}^{\ast}\right)+{w}_{X,Y}$.
Proof. For the matching${M}_{{f}^{\ast}}$, we have that MCM$(X,Y,w)\le w\left({M}_{{f}^{\ast}}\right)\stackrel{\text{Lem.5}}{=}\mathit{\text{cost}}\left({f}_{{M}_{{f}^{\ast}}}\right)+{w}_{X,Y}=\mathit{\text{cost}}\left({f}^{\ast}\right)+{w}_{X,Y}$. On the other hand, from Lemma 4 there is an optimal matching M^{∗} such that M^{∗} ⊆ X × Y_{ X }, and$\mathit{\text{MCM}}(X,Y,w)=w\left({M}^{\ast}\right)\stackrel{\text{Lem.5}}{=}\mathit{\text{cost}}\left({f}_{{M}^{\ast}}\right)+{w}_{X,Y}\ge \mathit{\text{cost}}\left({f}^{\ast}\right)+{w}_{X,Y}$, proving the proposition. □
Efficient computation and time complexity
For an efficient computation of an optimal flow in N, we first describe a modification that can be applied to residual graphs and allows a computational speed up.
Let f be a flow in N, and y ∈ Y_{ X }. Due to the flow conservation constraints, either f does not transmit any flow unit through y, or f transmits one flow unit that enters y over some ingoing edge (x,y) (where y ∈ Y_{ X }), and leaves y by the unique outgoing edge (y,t) in E.In both cases, and since all edges adjacent to y have a unit capacity, the residual graph G^{f} = (V,E^{f}) contains a single outgoing edge from y: the original edge (y,t) in the former case, or the residual edge (y,x) in the latter case (where the edge (y,t) is reversed in G^{f} to become an ingoing edge (t,y)). Denote by v_{ y } the unique node in X ∪ {t} into which there is an outgoing edge from y in E^{f}. In addition, observe that the number of ingoing edges into y in G^{f} equals to the number of ingoing edges into y in G, implying the following observation:
Observation 3
$\sum _{y\in {Y}_{X}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\left\left\{(u,y)\in {E}^{f}\right\}\right\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\sum _{y\in {Y}_{X}}\left\left\{(u,y)\in E\right\}\right\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}\left{E}_{1}\right\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}{n}^{2}.$
Let${\u011c}^{f}=(\widehat{V},{\xca}^{f})$be the graph defined by

$\widehat{V}=X\cup \{s,t,\varphi \}$,

${\xca}^{f}={\xca}_{1}^{f}\cup {\xca}_{2}^{f}$, where
${\xca}_{1}^{f}={E}^{f}\setminus \left\{(u,v)\in {E}^{f}:u\in {Y}_{X}\text{or}\phantom{\rule{1em}{0ex}}v\in {Y}_{X}\right\}$, and
${\xca}_{2}^{f}=\left\{(u,v):\exists y\in {Y}_{X}\text{s.t.}(u,y)\in {E}^{f}\text{and}\phantom{\rule{1em}{0ex}}v={v}_{y}\right\}$.
Observe that${\xca}_{1}^{f}$ is included in E^{f}. Denote by$\mathit{\text{cos}}{t}_{{\xca}^{f}}$ the cost function over edges in${\xca}^{f}$. For$(u,v)\in {\xca}_{1}^{f}$ define$\mathit{\text{cos}}{t}_{{\xca}^{f}}(u,v)=\mathit{\text{cost}}(u,v),$ and for$(u,v)\in {\xca}_{2}^{f}$ define$\mathit{\text{cos}}{t}_{{\xca}^{f}}(u,v)=\underset{(}{\text{min}}\mathit{\text{cost}}(u,y)+\mathit{\text{cost}}(y,v)).$y ∈ Y_{ X } : (u,y) ∈ E^{f} and v = v_{ y }.
For nodes u,v ∈ V, denote by${d}_{u,v}^{f}$ the minimum cost of a path from u to v in G^{f}, and for$u,v\in \widehat{V}$, denote by${\widehat{d}}_{u,v}^{f}$ the minimum cost of a path from u to v in${\u011c}^{f}$ (if there is no path between u and v in one of these graphs, define the corresponding minimum path cost to be ∞).
Lemma 6. For every$u,v\in \widehat{V}$,${d}_{u,v}^{f}={\widehat{d}}_{u,v}^{f}$.
Proof. Let$u,v\in \widehat{V}$, and let P be a minimum cost path in G^{f} from u to v. If P traverses a node y′ ∈ Y_{ x }, this traversal is of the form u^{′} → y^{′} → v^{′} for some${u}^{\prime},{v}^{\prime}\in \widehat{V}$ such that (u^{′},y^{′}) ∈ E^{f} and${v}^{\prime}={v}_{{y}^{\prime}}$ (note that there are no edges between two nodes in Y_{ x }). By construction, the edge (u^{′},v^{′}) is in${\xca}_{2}^{f}$, where its cost satisfies$\mathit{\text{cos}}{t}_{{\xca}^{f}}({u}^{\prime},{v}^{\prime})=\underset{(}{\text{min}}\mathit{\text{cost}}({u}^{\prime},y)+\mathit{\text{cost}}(y,{v}^{\prime}))\le \mathit{\text{cost}}({u}^{\prime},{y}^{\prime})+\mathit{\text{cost}}({y}^{\prime},{v}^{\prime})$.y ∈ Y_{ X } : (u^{′},y) ∈ E^{f} and v^{′} = v_{ y }.
Thus, each such subpath u′→y′→v′ in P can be replaced by an edge$({u}^{\prime},{v}^{\prime})\in {\xca}_{2}^{f}$, where the cost of the replacing edge is at most the cost of the subpath. This yields a corresponding path$\widehat{P}$ from u to v in${\u011c}^{f}$, which has the same or lower cost than P. In particular,${d}_{u,v}^{f}\ge {\widehat{d}}_{u,v}^{f}$.
On the other hand, let$\widehat{P}$ be a minimum cost path in${\u011c}^{f}$ from u to v. Similarly as above, every edge$({u}^{\prime},{v}^{\prime})\in {\xca}_{2}^{f}$ in $\widehat{P}$ can be replaced by a path u′ → y → v′ in E^{f} such that $\mathit{\text{cost}}({u}^{\prime},y)+\mathit{\text{cost}}(y,{v}^{\prime})=\mathit{\text{cos}}{t}_{{\xca}^{f}}({u}^{\prime},{v}^{\prime})$, yielding a path P from u to v in G^{f} of the same cost as $\widehat{P}$. Hence ${d}_{u,v}^{f}\le {\widehat{d}}_{u,v}^{f}$, and so we get that if there is a path from u to v in one of the graphs,${d}_{u,v}^{f}={\widehat{d}}_{u,v}^{f}$. In the case where there are no paths from u to v in both graphs,${d}_{u,v}^{f}={\widehat{d}}_{u,v}^{f}=\infty $. □
Lemma 7. For every y ∈ Y_{ X } and every$v\in \widehat{V}$,${d}_{y,v}^{f}=\mathit{\text{cost}}(y,{v}_{y})+{\widehat{d}}_{{v}_{y},v}^{f}$, and${d}_{v,y}^{f}=\underset{(u,y)\in {E}^{f}}{\text{min}}({\widehat{d}}_{v,u}^{f}+\mathit{\text{cost}}(u,y\left)\right)$.
Proof. Let y ∈ Y_{ X } and$v\in \widehat{V}$. A minimum cost path P from y to v in G^{f} (if there is such a path) must start with the only outgoing edge (y,v_{ y }) from y, and from the optimality of P the remainder of P is a minimum cost path from v_{ y } into v in G^{f}. Thus, the cost of P is ${d}_{y,v}^{f}=\mathit{\text{cost}}(y,{v}_{y})+{d}_{{v}_{y},v}^{f}\stackrel{\text{Lem.6}}{=}\mathit{\text{cost}}(y,{v}_{y})+{\widehat{d}}_{{v}_{y},v}^{f}$. If there is no path from y to v in G^{f}, then in particular there is no path from v_{ y } to v in G^{f}, and${d}_{y,v}^{f}=\infty ={d}_{{v}_{y},v}^{f}\stackrel{\text{Lem.6}}{=}{\widehat{d}}_{{v}_{y},v}^{f}=\mathit{\text{cost}}(y,{v}_{y})+{\widehat{d}}_{{v}_{y},v}^{f}$.
To show the second equality in the lemma, observe similarly that for a minimum cost path P from v to y in G^{f} and the last edge (u′,y) in P, the cost of P is${d}_{v,y}^{f}={d}_{v,{u}^{\prime}}^{f}+\mathit{\text{cost}}({u}^{\prime},y)\stackrel{\text{Lem.6}}{=}{\widehat{d}}_{v,{u}^{\prime}}^{f}+\mathit{\text{cost}}({u}^{\prime},y)\ge \underset{(u,y)\in {E}^{f}}{\text{min}}({\widehat{d}}_{v,u}^{f}+\mathit{\text{cost}}(u,y\left)\right)$. Since for every u such that (u,y) ∈ E^{f} there is a path from v to y of cost${d}_{v,u}^{f}+\mathit{\text{cost}}(u,y)$ (the path concatenating the edge (u,y) at the end of a minimum cost path from v to u in G^{f}), it follows that${d}_{v,y}^{f}\phantom{\rule{0.6em}{0ex}}\le \phantom{\rule{0.6em}{0ex}}\underset{\phantom{\rule{0.3em}{0ex}}(u,y)\in {E}^{f}}{\text{min}}\phantom{\rule{0.3em}{0ex}}(\phantom{\rule{0.3em}{0ex}}{d}_{v,u}^{f}\phantom{\rule{0.6em}{0ex}}+\phantom{\rule{0.6em}{0ex}}\mathit{\text{cost}}(\phantom{\rule{0.3em}{0ex}}u,\phantom{\rule{0.3em}{0ex}}y\phantom{\rule{0.3em}{0ex}}\left)\phantom{\rule{0.3em}{0ex}}\right)\phantom{\rule{0.6em}{0ex}}\stackrel{\text{Lem.6}}{=}\phantom{\rule{1.2em}{0ex}}\underset{(u,y)\in {E}^{f}}{\text{min}}\phantom{\rule{0.3em}{0ex}}({\widehat{d}}_{v,u}^{f}\phantom{\rule{0.6em}{0ex}}+\phantom{\rule{0.6em}{0ex}}\mathit{\text{cost}}(\phantom{\rule{0.3em}{0ex}}u,\phantom{\rule{0.6em}{0ex}}y\phantom{\rule{0.3em}{0ex}}\left)\phantom{\rule{0.3em}{0ex}}\right)$, and we get that${d}_{v,y}^{f}=\underset{(u,y)\in {E}^{f}}{\text{min}}({\widehat{d}}_{v,u}^{f}+\mathit{\text{cost}}(u,y\left)\right)$. The case where there is no path from v to y in G^{f} is resolved similarly as above. □
Using Lemmas 6 and 7, we turn to analyze the time complexity for the reduction we described for solving MCM(X,Y,w).
The computation of each subset Y_{ x } can be done in O(m) time, using the algorithm of [41], and so Y_{ x } can be computed in O(n m) time. The network N contains O(n^{2}) nodes and edges, and thus can be constructed in O(n^{2}) time.
An optimal flow in N can be found with the algorithm of Edmonds and Karp [30]. Essentially, this is an iterative algorithm that maintains a valid flow f in N. Starting with a zeroflow, at each iteration the algorithm increases the flow by adding the bottleneck capacity flow along a minimum cost path from s to t in the residual graph G^{f}. In our specific reduction, all edges leaving s have a unit capacity, and thus each augmentation path increases the size of the flow by one unit. As the size of the maximum flow in N is n (asserted by the minimum cut ({s},V ∖ {s}) of capacity n in N), the algorithm performs n iterations.
The time required for each iteration is dictated by the time required for computing minimum cost paths from s into every node in the residual graph G^{f}. Such paths can be computed efficiently for weighted graphs with nonnegative edge costs using Dijkstra’s algorithm [42]. In order to render all edge costs in the residual graphs to nonnegative, the algorithm of [30] applies a node labeling function. Such a function assigns to each node v ∈ V a label l(v), and shortest paths in the residual network are computed with respect to the modified edge costs c o s t′(u,v) = c o s t(u,v) + l(u)  l(v). It is known that a path from u to v in G^{f} is of minimum cost with respect to the original cost function if and only if it is of minimum cost with respect to the modified cost function, and that setting$l\left(v\right)={d}_{s,v}^{f}$ for every v ∈ V, where f is the flow at the beginning of the ith iteration, guarantees nonnegative modified edge costs in the residual graph at the beginning of the (i+1)th iteration (after f was augmented and the residual graph was modified, see [30]). Thus, given that minimum cost paths are computed from s to all nodes in the residual graph, label maintenance at each iteration is done in linear time with respect to the number of nodes in the network.
In order to compute minimum cost paths from s to all nodes in G^{f}, we first construct the corresponding graph${\u011c}^{f}$ as described above (under the assumption that edge costs are rendered to be nonnegative). It is simple to observe that this construction can be implemented in O(n^{2}) time. By construction,$\left\widehat{V}\right=\leftX\cup \{s,t,\varphi \}\right=O\left(n\right)$. Using Dijkstra’s algorithm [42],${\widehat{d}}_{s,v}^{f}$ can be computed for all$v\in \widehat{V}$ in$O\left(\widehat{V}{}^{2}\right)=O\left({n}^{2}\right)$ time. Note that we are referring to the original Dijkstra algorithm (published in 1959) rather than to its commonly used improvement due to Fredman and Tarjan [31]. While the latter improvement reduces the running time of the algorithm for nondense graphs, it involves the usage of a relatively sophisticated data structure (Fibonacci heap). In our case, the simple implementation described in [42] does not exceed the O(n^{2}) running time required for the construction of${\u011c}^{f}$, without the usage of any kind of priority queue or other complex data structures.
Given the values${\widehat{d}}_{s,v}^{f}$ for all$v\in \widehat{V}$, values${d}_{s,v}^{f}$ can be computed for all v ∈ V by applying Lemmas 6 and 7. The time required for computing distances${d}_{s,v}^{f}$ for all$v\in \widehat{V}=V\setminus {Y}_{X}$ due to Lemma 6 is$O\left(\left\widehat{V}\right\right)=O\left(n\right)$, and the time required for computing distances${d}_{s,y}^{f}$ for all y ∈ Y_{ X } due to Lemma 7 is$\sum _{y\in {Y}_{X}}\left\left\{(u,y)\in {E}^{f}\right\}\right\stackrel{\text{Obs.3}}{=}O\left({n}^{2}\right)$. Therefore, each iteration is conducted in O(n^{2}) time, and the total running time of all n iterations of the algorithm is O(n^{3}).
A special attention is required though for the initialization of the algorithm of [30]. There, it was assumed that all edges in the input network N are of nonnegative costs, while our described reduction from MCM does not sustain this property. Nevertheless, this may be overcome by setting the initial labels of all nodes v ∈ V to the corresponding minimum path costs d_{s,v} in the graph G of N. Since G is acyclic, these initial costs can be computed in O(V + E) = O(n^{2}) time using a simple topological traversal (see e.g., [43]).
In all, we get that the total running time required for constructing a flow network N corresponding to the matching instance (X,Y,w) and finding an optimal flow in N is O(n^{3} + n m). Computing w_{X,Y} can be done in O(n + m) time, and applying Proposition 1, MCM(X,Y,w) can be computed in O(n^{3} + n m) time, proving the statement in Theorem 1 regarding MCM.
Additional practical improvements using sparsification
It is possible to further improve the running time of the reduction in practice, by reducing the number of edges in the network, and computing MinCost Flow instead of MinCost MaxFlow.
We may assume without lost of generality that an optimal matching M^{∗}between X and Y contains no pairs (x,y) such that w(x,y)  w(x)  w(y) ≥ 0 (since removing such an edge from the matching can only decrease its cost). Consequentially, there is an optimal flow in N that does not transmit flow through edges (x,y) for which w′(x,y) = w(x,y)  w(x)  w(y) ≥ 0. Therefore, the cost of an optimal flow in the network N′, obtained by removing such edges from N, equals to the cost of an optimal flow in N. Note that in the case of RNA tree alignment, removed edges correspond to pairs of subtrees which are sufficiently dissimilar so that their pruning cost is lower than their alignment cost. For a reasonable cost function, it is expected that many of the subtreepairs of the input would sustain this condition (even in the extreme case of aligning a tree to itself), as they represent different parts of complex molecules.
The second improvement exploits a property of the MinCost MaxFlow algorithm of [30], for which it was shown that the series of computed augmentation paths are of nondecreasing costs (Theorem 5 in [30]). Note that as long as there is an augmentation path in the network, there is such a path with cost zero (since there is some node x for which (s,x) is not saturated, and thus the path s → x → ϕ →t is an augmentation path of cost zero). Therefore, once the minimum cost of an augmentation path is zero, all consecutive augmentation paths would have a zero cost, and increasing the flow along these paths will not change the overall cost of the flow. It is thus possible to stop the flow algorithm upon the first iteration at which the minimum augmentation path cost is zero, where the flow at this stage is a minimumcost flow which is not necessarily of maximum size (while it has the same cost as a mincost maxflow). It is possible to show that increasing the flow over a negative cost augmentation path necessarily increases the size of the corresponding matching (otherwise it implies a negative cost cycle in the residual network), therefore the number of iterations in the above described MinCost Flow algorithm is M^{∗} ≤ n (where M^{∗} is an optimal matching of minimum size), and so the total running time of the algorithm is O(M^{∗}n^{2} + n m), which is faster than O(n^{3} + n m) in the case where M^{∗} is small. Again, for the RNA tree alignment application, in most cases it is expected that the matching would be computed with respect to dissimilar sets of subtrees, for which it is expected to have a small optimal matching size.
Efficient algorithms for AllCavityMCM and AllPairsCavityMCM
We now present Algorithm 2, which solves AllPairsCavityMCM. Similarly to Kao et al. [28], we show that solutions for instances of the form (X ∖ {x},Y ∖ {y},w) correspond to certain shortest paths in the residual flow network obtained when solving the instance (X,Y,w). This observation allows to solve both AllCavityMCM and AllPairsCavityMCM at the same time complexity O(n^{3} + n m) as that of the algorithm for MCM presented in the previous section.
Algorithm 2: AllPairsCavityMCM (X,Y,w)
In order to prove the algorithm’s correctness, we first formulate an auxiliary lemma.
Lemma 8. Let G = (V,E) be a directed graph with a cost function c o s t(·) over its edges and no negative cost cycle. Let P be a minimum cost path from a node y to a node x in G. Let G′ = (V,E′) be the graph obtained from G by adding, for every edge (u,v)∈P, the reversed edge (v,u) (if not already in G), and setting its cost to c o s t(v,u) =  c o s t(u,v). Then, G′contains no negative cost cycle.
Proof. Following [30], we call a flow extreme if it is of minimum cost among all flows of the same size. By [30] (Theorem 3), a flow is extreme if and only if the corresponding residual network contains no negative cost cycle. In addition, a flow obtained by increasing an extreme flow along a minimum cost augmentation path from the source to the sink of the network, is also extreme (see [44], page 121).
Let N be the flow network defined by the graph G, the source node y, the sink node x, the cost function c o s t(·), and the capacity function c which assigns two capacity units to all edges in G. Since there is no negative cost cycle in N, the zero flow is extreme with respect to zerosize flows. Let f be the unit flow along P in N, and note that the residual graph G^{f} is identical to G′ by definition. Since P is a minimum cost path from y to x in G, f is extreme with respect to all flows of size 1. Thus, G^{f} contains no negative cost cycle. □
In the remainder of this section, let N be the network corresponding to the matching instance (X,Y,w) and f an optimal flow in N. Denote by N_{x,y} the network whose graph G_{x,y} is obtained by excluding from G the node x, the node y in case that y ∈ Y_{ X }, and all edges adjacent to these nodes. The edge costs in G_{x,y} are inherited from G, and the maximum flow size in N_{x,y} is n  1. Note that any flow in N which does not transmit flow units through x and through y (when y ∈ Y_{ X }), is also a valid flow in N_{x,y}. In addition, observe that G_{x,y} contains the graph corresponding to the subinstance (X ∖ {x},Y ∖ {y},w) (it is possible that G_{x,y} contains additional nodes in Y ∖ {y} excluded from (Y ∖ {y})_{X ∖ {x}} and additional edges, since x′∈X ∖ {x} may have n adjacent edges of the form (x′,y′) in G_{x,y}, while in the graph corresponding to (X ∖ {x},Y ∖ {y},w) is has (n  1) such adjacent edges). Nevertheless, it can be asserted from the lemmas in Section ‘Reducing MCM to MinCost MaxFlow’, Proposition 1, and their proofs, that for an optimal flow f′ in N_{x,y}, MCM(X ∖ {x},Y ∖ {y},w) = c o s t(f^{′}) + w_{X ∖ {x},Y ∖ {y}}. The correctness of Algorithm 2 is implied by the following analysis.
Lemma 9. For every x ∈ X and every y ∈ Y_{ X }, there is a path from y to x in G^{f}.
Proof. If f(y,t) = 0, then G^{f} contains the edge (y,t), and in particular there is a path from y to t in G^{f}. Else, f(y,t) = 1, and from flow conservations constraints there is some x′ ∈ X such that f(x′,y) = 1. In this case, the path y → x′ → ϕ → t is a path from y to t in G^{f} (again, the existence of all edges of this path in G^{f} can be asserted from the flow conservation constraints). Similarly, let z∈Y_{ X } ∪ {ϕ} be the node such that f(x,z) = 1, and note that the path t → z → x is a path from t to x in G^{f}. The concatenation of a path from y to t and a path from t to x in G^{f} proves that there is a path from y to x in G^{f}. □
Now, we show the relation between MCM solutions for instances (X,Y,w) and solutions for subinstances of the form (X ∖ {x},Y ∖ {y},w). As defined in Algorithm 2, let${d}_{y,x}=\left\{\begin{array}{cc}{d}_{y,x}^{f},& y\in {Y}_{X},\\ {d}_{t,x}^{f},& y\in Y\setminus {Y}_{X}\end{array}\right.$
Lemma 10. The cost of an optimal flow in N_{x,y}is c o s t(f)+d_{y,x}.
Proof. In order to prove the lemma, we construct a flow f′ in N_{x,y} such that w(f^{′}) = c o s t(f) + d_{y,x}, and prove that f′ is optimal with respect to N_{x,y}.
Consider the case where y ∈ Y_{ X }, and so${d}_{y,x}={d}_{y,x}^{f}$. From Lemma 9, there exists a path from y to x in the residual graph G^{f}. Let P be such a path of minimum cost, and define the weighted graph G′ that includes all nodes and edges in G^{f} with the same edge costs, and in addition, for each (u,v) ∈ P, G′ contains the reversed edge (v,u), whose cost is set to c o s t(v,u) =  c o s t(u,v). Since G^{f} contains no negative cost cycle (due to the optimality of f), Lemma 8 indicates that G′ contains no negative cost cycle.
Define the flow f′ is obtained from f by returning one flow unit from t to s along the path P′ as follows: if P starts with (y,t), then P′ is obtained by removing (y,t) from P and concatenating (x,s) at its end (see Figure 8d). Else, P′ is obtained by concatenating (t,y), P, and (x,s). Observe that f′ is a valid flow in N_{x,y} (since f′ passes no flow units through x and y), its cost is c o s t(f′) = c o s t(f)+c o s t(P′) = c o s t(f) + d_{y,x} (since edges adjacent to t and s have zero costs, and so the costs of P′ and P are equal), and its size is f′ = f  1 = n  1. Therefore, f′ is a maximum flow in N_{x,y}. Also, observe that the residual graph${G}_{x,y}^{{f}^{\prime}}$ is a subgraph of G′ (since the only edges in P′ which are not in P are adjacent to either x or y, and therefore their reversed edges, which are included in${G}^{{f}^{\prime}}$, are excluded from${G}_{x,y}^{{f}^{\prime}}$), and therefore it contains no negative cycle. Thus, f′ is an optimal flow in N_{x,y} of cost c o s t(f)+d_{y,x}.
For the case where y ∈ Y ∖ Y_{ X } (and${d}_{y,x}={d}_{t,x}^{f}$), let P be a minimum cost path from t to x in G^{f}, and P^{′} the path from t to s in G^{f} obtained by concatenating the residual edge (x,s) at the end of P. Similarly as above, it can be shown that the flow f′ obtained by returning one flow unit from t to s along P^{′} is an optimal flow in N_{x,y} of cost c o s t(f) + d_{y,x}, and the lemma follows. □
From Lemma 10, for every x ∈ X and every y ∈ Y, the value c o s t(f) + d_{y,x} is the cost of an optimal flow in N_{x,y}. From Proposition 1, adding to this cost the value w_{X ∖ {x},Y ∖ {y}} = w_{X,Y}  w(x)  w(y) gives the minimum matching cost for the matching instance (X ∖ {x},Y ∖ {y},w), and thus the correctness of Algorithm 2.
In order to use the algorithm for solving AllCavityMCM, we apply the following simple modification. Say we are interested in finding solutions to subinstances of the form (X,Y ∖ {y},w) for every y ∈ Y. We replace the set X by the set X′ = X∪{x_{ ϕ }} for some new element x_{ ϕ } ∉ X (and arbitrarily define w(x_{ ϕ }) = 0 and w(x_{ ϕ },y) = 0 for every y ∈ Y). We then solve AllPairsCavityMCM for the instance (X′,Y,w), and return MCM(X′∖{x_{ ϕ }},Y ∖ {y}) as the solution for the instance MCM(X,Y ∖ {y}). Finding solutions for all instances (X ∖ {x},Y,w) is done symmetrically.
Time complexity of Algorithm 2
From the analysis in Section ‘Efficient computation and time complexity’, lines 1 and 2 of the algorithm can be computed in O(n^{3} + n m) running time. After finding an optimal flow f in N, it is possible to compute for some x ∈ X the values${d}_{v,x}^{f}$ for every v ∈ V in O(n^{2}), as explained in Section ‘Efficient computation and time complexity’ (applying Lemma 7 and using a reversed variant of Dijkstra’s algorithm [42]). Thus, the computation of such values for all x ∈ X, performed in line 3 of the algorithm, can be implemented in O(n^{3}) running time.
Given these values, the execution of line 4 takes O(n m) time. In all, the running time of Algorithm 2 is O(n^{3} + n m), proving the statements regarding AllCavityMCM and AllPairsCavityMCM in Theorem 1.
Implementation details
Our algorithm is implemented (in Java) as a tool called FRUUT (Fast RNA Unordered Unrooted Tree mapper). The RNA tree representation is described in Section ‘RNA tree representation’ and the scoring scheme employed by FRUUT is described in Section ‘Alignment cost function’. FRUUT allows the user to select any alignment mode combination (rooted / unrooted, ordered / unordered, local / global) and to compute optimal pairwise alignments of RNA trees with an attached significance scoring model described in Section ‘pValue computation’. We also provide an interactive PHP webserver for running FRUUT in our website (RNA plots are are generated by the Vienna Package [2]).
RNA tree representation
There are several previous models for representing a pseudoknotfree RNA secondary structure (example in Figure 9a) as an ordered, rooted tree [3, 9, 10, 13, 45–48]. For example, [45] represented the RNA structure as a tree, where nodes correspond to loop elements of the secondary structure (hairpin loops, bulges, internal loops or multiloops) and the edges correspond to basepaired (stem) regions. Another, different representation is given in Zhang’s work [9]: the nodes of the tree represent either unpaired bases (leaves) or paired bases (internal nodes). Each node is labeled with a base or a pair of bases, respectively. There are two kinds of edges, alternatively connecting either consecutive stem basepairs or a leaf base with the last basepair in the corresponding stem. The aforementioned trees are rooted and ordered, their order corresponds to the 5’3’ orientation of an RNA sequence and their root is traditionally a designated node parenting the motif in which the first 5’ base participates.
The tree representation that we employed uses a similar modeling as in Höchsman et al. [48], with some variations we describe next. Given an RNA secondary structure, each loop, basepair, and interval of unpaired bases generates a node in the tree representation of the structure. Labels are assigned to tree nodes in order to indicate their type and content. Node types are: BP (basepair), UPI (unpairedbase interval), HP (hairpin), IL (internal loop or bulge), ML (multiloop), and EXT (external loop). For a UPI node, the label also includes the 5’ to 3’ basesequence of the corresponding interval, and for a BP node the label includes the corresponding bases (see Figure 9).
Each loop node (HP, IL, ML, and EXT) is connected, in 5’ to 3’ sequence order, to all UPI nodes which correspond to intervals of unpaired bases associated to the loop, and to all BP nodes which correspond to stemterminating basepairs adjacent to the loop. BP nodes are nodes of degree 2, where the two neighbors of such nodes are the BP nodes that correspond to adjacent stacked basepairs. The set of leaves in the tree corresponds to the set of UPI nodes or BP nodes terminating loopfree stems.
Alignment cost function
Node smoothing costs were set to 3 for BP, 11 for ML/EXT and 5 for IL. For subtree pruning costs, we have designed a cost function that counts the occurrences of different types of elements appearing in the subtree and deduces a corresponding penalty. The function is of the form: p r u n e(T) = C_{ u p } · N_{ u p } + C_{ b p } · N_{ b p } + C_{ h p } · N_{ h p } + …, where the values of C_{ x } are constant penalty factors, N_{ u p } is the number of unpaired bases in T, N_{ b p } is the number of basepairs, N_{ h p } is the number of hairpins, and so on (specific C_{ x } values are given in Table 1). Table 2 summarizes the costs of matching node pairs by the alignment. Sequence alignment costs and basepair alignment costs were set using the RIBOSUM8560 scoring matrix [49]. In order to support the local alignment mode, we added an option to set the subtree pruning cost to zero: p r u n e(T) = 0.
Relative scoring
We used a relative score formula described by Höchsmann et al. [48] to assess the similarity of two trees, normalizing the alignment cost by the average of the selfalignment costs of the compared trees. Let H S A_{ m }(T,S) denote the optimal alignment cost of trees T and S in alignment mode m, where m is one of the following modes: RootedOrdered, RootedUnordered, UnrootedOrdered or UnrootedUnordered. Let R e l S c o r e_{ m }(T,S) denote the relative score of T and S in alignment mode m, given by [48]:
The scoring scheme we use satisfies that for every tree T, H S A_{ m }(T,T) < 0, and for every pair of trees T,S, H S A_{ m }(T,T),H S A_{ m }(S,S) ≤ H S A_{ m }(T,S). Under these conditions, the relative score for any pair of trees is upper bounded by 1, and the similarity of the trees increases as the score approaches 1.
pValue computation
We apply the following pValue computation in order to determine whether an alignment score obtained by comparing two RNA trees is significant.
For the purpose of assessing the significance of a score, we need to know what distribution HSA scores follow. In order to identify the correct distribution, we first create a set of random observations to inspect. Algorithm 3 describes a routine for shuffling a tree, while preserving most of its structural features (e.g. number of stacks, amount of multiloops) and keeping the tree valid in terms of RNA secondary structure (exemplified in Figure 10).
Algorithm 3: shuffle (T)
Algorithm 4 creates a set of observations by randomly selecting pairs of trees from a dataset, shuffling them and reporting the relative score of their alignment. In each alignment mode m, where m ∈ {R O,R U,U O,U U}, we ran Algorithm 4, setting the input parameter dataset to all the structures from over the RNAStrand [50] database (containing 1751 RNA structures), and setting the number of iterations (the amount parameter, in the code below) to 2 × 10^{6}.
Stats (d a t a s e t,a m o u n t)
The observed results where plotted in an accumulative manner and a Maximum Likelihood (ML) fitting technique was used to determine the best distribution and its parameters. We tried several types of distributions and used the KolmogorovSmirnov test (KS test) formula [51] to measure the goodness of the data fit to a given distribution. In all four alignment modes the ML Gaussian distribution provided the best fit. Figure 11 exemplifies the fitting of the data to this distribution with alignment mode UU. The figure also displays the ML Gumbel distribution. The KS score of the ML Gaussian distribution is better than the score for the ML Gumbel distribution. Following these results we used the ML fitted Gaussian distribution to model the HSA results. This allowed us to compute for a pair of trees T and S a pValue score analytically, using the following formula:
where x is the relative score of T and S, and X is a random variable normally distributed with the ML fitted parameters.
A Bonferroni correction was applied to all the reported pvalue computations described in the following sections. This was done by multiplying the computed pvalue by the number of tests performed (i.e. the number of tree pairs aligned within the family that participated in the corresponding test).
Algorithm 4: Stats (dataset, amount)
Results
RNase P family
RNase P is the endoribonuclease responsible for the 5’ maturation of tRNA precursors [19]. Secondary structures of bacterial RNase P RNAs have been studied in detail, primarily using comparative methods [52], and were shown to share a common core of primary and secondary structure. In bacteria, synthetic minimal RNase P RNAs consisting only of these core sequences and structures were shown to be catalytically proficient. Sequences encoding RNase P RNAs of various genomes have been determined and a database established [53], which consists of a compilation of ribonuclease P RNA sequences, sequence alignments, secondary structures and threedimensional models.
We conducted a preliminary experiment, intended to identify examples of pairs of RNA trees for which an RNA structural comparison approach supporting unrooting and branch shuffling may detect (otherwise hidden) structural similarity. To achieve this, we ran a benchmark of allagainstall pairwise alignments of bacterial RNAse P RNA secondary structure trees, using our tool’s different treealignment modes and comparing the differences between the obtained alignment costs. The alignment cost functions and parameters used in our experiment are given in Section ‘Alignment cost function’.
Our benchmark was based on 470 RNAse P structures, ranging across various organisms, taken from the RNAse P database [53] (molecule naming conventions are according to [50]). After filtering out partial and environmental sequences, 170 distinct structures remained, yielding 14,365 distinct pairs of trees. The sizes of the trees in this dataset ranged from 82 to 230 nodes, averaging at 142. The total running time of the benchmark was approximately 33 minutes on a single Xeon X5690 using around 300Mb of memory.
Each pair of trees T,S was compared in two modes to obtain the corresponding scores and alignments: rootedordered (RO) and rootedunordered (RU), and the relative score was computed for each pair in each mode according to Section ‘Relative scoring’.
Our goal in this experiment was to identify evolutionary events that can be explained by unordered alignments. Thus, we sought pairs of RNAse P RNAs that are highly conserved, and yet their alignment can still benefit substantially from unordered mappings. To achieve this, we removed from the set pairs of trees for which R e l S c o r e_{ R U }(T,S) < 0.5. We sorted the remaining pairs of trees according to the difference between the RU and RO modes (R e l S c o r e_{ R U }(T,S)  R e l S c o r e_{ R O }(T,S)).
When examining the top 50 alignments carefully, two distinct types of mapping patterns were observed among them, where each of the top 50 pairs belongs (with slight variations) to one of these two types (33 to Type 1 and 17 to Type 2). In the next paragraphs, we exemplify the highest ranking alignment of each of the two types (the first type is shown in Figure 1b). As mentioned before, the input for FRUUT alignments consisted only of sequence and secondary structure information. The tertiary structure (pseudoknot annotations) for the topranking alignments were only considered later, during the alignment interpretations.
Type 1: loop swapping in main multiloop accompanied by hairpin deletion in P17.1 The first type of alignment pattern is characterized by comparisons between a green sulfur bacteria Chlorobium and gamma purple bacterial RNAse P RNAs. This alignment pattern is exemplified in Figure 1b, for the bacterial RNAs ASE_00047 and ASE_00334. When examining the corresponding tertiary structure information (Figure 12), the transformations predicted by FRUUT seem to make sense: observe that there is an additional duplex connecting the loops of intervals 13 and 15. Notice that the alignment between the two trees maps the intervals in a manner that preserves the tertiary structural information and the other information surrounding the loops  thus exemplifies a biologically verified alignment which does not preserve branch ordering.
Type 2: hairpin swapping between P17 and P17.1 The second type of alignment pattern is characterized by comparisons between Agrobacterium tumefaciens (ASE_00018) and several Chlamydia trachomatis members. This alignment pattern is exemplified in Figure 13. An interesting elementtwist transformation is observed here in hairpins P17 and P17.1 of ASE_00018, which are mapped onto their corresponding hairpins P17.1 and P17 in ASE_00070, respectively, via a subtree reordering mapping operation. When examining the corresponding tertiary structure information (Figure 13), we observe that the loops of the hairpins P17 and P17.1 are engaged in pseudoknots with loops L (named P6).
The Hammerhead Ribozyme family
Another type of homology detected by our tool is exemplified in the Hammerhead Ribozyme family, which is characterized by two distinct transcript types, yielding the same functional RNA (Figures 1a and14).
The Hammerhead Ribozyme, a derivative of several selfcleaving satellite virus RNAs [54], is a single strand RNA with autocatalytic capabilities. Naturally, it has a highly specific selfcleavage site at C17, operating via isomeration and rearrangement of the linkage phosphodiester bond [55]. Furthermore, Birikh et al.[18] suggested that the Hammerhead Ribozyme may undergo synthetic modification by removing the loop of one of its helical arms, thus making it catalytically active and able to cleave other RNAs. Hammerheads are therefore widely used in the biotechnological industry as biosensors, enzymes for specific RNAs and gene discovery agents.
The tertiary structure of the minimal version of the Hammerhead Ribozyme has been thoroughly studied by [56, 57]. It is composed of three base paired helices, entitled I, II, III, according to their position in the sequence. There are highly conserved sequences within the multiloop between stem I and II (containing the sequence box CUGA), between stem II and III (containing the box GAAA) and between stem III and I (containing the cleavage reaction site, C17). In nature, there are two types of Hammerheads: type I, where stem I starts at the 5’ and 3’ ends of the strand, and type III, where stem III starts at the 5’ and 3’ ends of the strand. As of today, no natural type II Hammerhead has been found.
Due to the fact that the two natural Hammerhead types share a conserved structure, though it is formed by two distinct transcripts, we chose this family to demonstrate the sensitivity gained by extending the classical rootedordered RNA tree alignment to more flexible variants. Our benchmark was based on 146 Hammerhead structures (86 of type I and 62 of type III), ranging across various organisms, taken from the RNA Strand database [50]. This yields a total of 10,585 pairs of trees, with tree sizes ranging from 17 to 48 nodes, averaging at 25.5.
Each pair of trees T,S was compared in two modes to obtain the corresponding scores and alignments: rootedordered (RO) and unrootedordered (UO). Within each mode, we used the relative score formula as described in Section ‘Relative scoring’. We partitioned the tree pairs into two groups, the first containing all pairs where both members are from the same type (5377 pairs) and the second group containing mixed pairs where the two members in each pair belong to different types (5208 pairs). We calculated the average relative score for each group in both alignment modes. The results, summarized in Table 3, show that the UO tree alignment mode is more sensitive to the similarity between the two different Hammerhead types than the RO mode. The similarity between Hammerhead structures of different types is not captured by the classical rooted ordered alignment (Figure 14a). However, when comparing the same pair using unrooted ordered alignment, the similarity between the structures is revealed, in line with the similarity of biological function (Figure 14b).
Conclusions
In this paper we define the (unrooted unordered) Homeomorphic Subtree Alignment (HSA) problem, as well as additional three restricted variants of it: OrderedHSA, RootedHSA, and OrderedRootedHSA. We focus on the general (unrooted and unordered) HSA variant, and present a cubic time algorithm for it.
The new algorithm is implemented as a tool (which allows solving all four HSA variants) and is applied to pairwise alignments of RNA secondary structures. Preliminary experimental results over members of the RNAs P and Hammerhead families show that the tool can be used for detecting new structural similarities between RNA molecules, which could not be detected by the classical rootedordered tree alignment methods. In order to obtain an O(n^{3}) running time of an otherwise O(n^{5}) time algorithm, we extend the AllCavity Bipartite Matching problem, previously defined by Kao et al., to the AllPairsCavity problem, give an efficient algorithm for it, and show how to integrate it as a subroutine within our tree alignment algorithm.
References
 1.
Agmon I, Auerbach T, Baram D, Bartels H, Bashan A, Berisio R, Fucini P, Hansen H, Harms J, Kessler M, et al: On peptide bond formation, translocation, nascent protein progression and the regulatory properties of ribosomes. Eur J Biochem. 2003, 270 (12): 25432556. 10.1046/j.14321033.2003.03634.x.
 2.
Hofacker I: Vienna RNA secondary structure server. Nucleic Acids Res. 2003, 31 (13): 342910.1093/nar/gkg599.
 3.
Steffen P, Voss B, Rehmsmeier M, Reeder J, Giegerich R: RNAshapes: anintegrated RNA analysis package based on abstract shapes. Bioinformatics. 2006, 22 (4): 500503. 10.1093/bioinformatics/btk010.
 4.
Höchsmann M, Toller T, Giegerich R, Kurtz S: Local similarity in RNA secondary structures. Bioinformatics Conference, 2003. CSB 2003. Proceedings of the 2003 IEEE. 2003, IEEE, 159168. 10.1109/CSB.2003.1227315..
 5.
Jiang T, Lin G, Ma B, Zhang K: A general edit distance between RNA structures. J Comput Biol. 2002, 9 (2): 371388. 10.1089/10665270252935511.
 6.
Zhang K, Wang L, Ma B: Computing similarity between RNA structures. Combinatorial Pattern Matching. 1999, Springer, 281293.
 7.
Bille P: A survey on tree edit distance and related problems. Theor Comput Sci. 2005, 337 (13): 217239. 10.1016/j.tcs.2004.12.030.
 8.
Jiang T, Wang L, Zhang K: Alignment of trees—an alternative to tree edit. Theor Comput Sci. 1995, 143: 137148.
 9.
Zhang K: Computing similarity between RNA secondary structures. INTSYS ’98: Proceedings of the IEEE International Joint Symposia on Intelligence and Systems. 1998, Washington: IEEE Computer Society, 126126.
 10.
Le S, Nussinov R, Maizel J: Tree graphs of RNA secondary structures and their comparisons. Comput Biomed Res. 1989, 22 (5): 461473. 10.1016/00104809(89)900396.
 11.
Schirmer S, Giegerich R: Forest alignment with affine gaps and anchors. Combinatorial Pattern Matching. 2011, Springer, 104117. 10.1007/9783642214585\_11.
 12.
Hofacker I, Fontana W, Stadler P, Bonhoeffer L, Tacker M, Schuster P: Fast folding and comparison of RNA secondary structures. Monatshefte fur Chemie/Chemical Monthly. 1994, 125 (2): 167188. 10.1007/BF00818163.
 13.
Liu J, Wang J, Hu J, Tian B: A method for aligning RNA secondary structures and its application to RNA motif detection. BMC Bioinformatics. 2005, 6: 8910.1186/14712105689.
 14.
Blin G, Denise A, Dulucq S, Herrbach C, Touzet H: Alignments of RNA structures. Comput Biol Bioinformatics, IEEE/ACM Trans. 2010, 7 (2): 309322.
 15.
Allali J, Sagot M: A multiple graph layers model with application to RNA secondary structures comparison. String Processing and Information Retrieval. 2005, Springer, 348359. 10.1007/11575832\_39.
 16.
Jan E: Divergent IRES elements in invertebrates. Virus Res. 2006, 119: 1628. 10.1016/j.virusres.2005.10.011.
 17.
Perreault J, Weinberg Z, Roth A, Popescu O, Chartrand P, Ferbeyre G, Breaker R: Identification of hammerhead ribozymes in all domains of life reveals novel structural variations. PLoS Comput Biol. 2011, 7 (5): e100203110.1371/journal.pcbi.1002031.
 18.
Birikh K, Heaton P, Eckstein F: The structure, function and application of the hammerhead ribozyme. Eur J Biochem. 1997, 245: 116. 10.1111/j.14321033.1997.t01300001.x.
 19.
Haas E, Brown J: Evolutionary variation in bacterial RNase P RNAs. Nucleic Acids Res. 1998, 26 (18): 40934099. 10.1093/nar/26.18.4093.
 20.
Zhang K, Jiang T: Some MAX SNPhard results concerning unordered labeled trees. Inf Process Lett. 1994, 49 (5): 249254. 10.1016/00200190(94)900620.
 21.
Matula D: Subtree isomorphism in O(n5/2). Ann Discrete Math. 1978, 2: 91106.
 22.
Shamir R, Tsur D: Faster subtree isomorphism. J Algorithms. 1999, 33: 267280. 10.1006/jagm.1999.1044.
 23.
Chung M: O(n2.5) time algorithms for the subgraph homeomorphism problem on trees. J Algorithms. 1987, 8: 106112. 10.1016/01966774(87)900307.
 24.
Reyner S: An analysis of a good algorithm for the subtree problem. SIAM J Comput. 1977, 6: 73010.1137/0206053.
 25.
Valiente G: Constrained tree inclusion. J Discrete Algorithms. 2005, 3 (24): 431447. 10.1016/j.jda.2004.08.017.
 26.
Pinter RY, Rokhlenko O, Tsur D, ZivUkelson M: Approximate labelled subtree homeomorphism. J Discrete Algorithms. 2008, 6 (3): 480496. 10.1016/j.jda.2007.07.001.
 27.
Zhang K: A constrained edit distance between unordered labeled trees. Algorithmica. 1996, 15 (3): 205222. 10.1007/BF01975866.
 28.
Kao M, Lam T, Sung W, Ting H: Cavity matchings, label compressions, and unrooted evolutionary trees. SIAM J Comput. 2000, 30 (2): 602624. 10.1137/S0097539797332275.
 29.
Dinic E: On solution of two assignment problems. Studies in Discrete Optimization. Edited by: Fridman A. 1976, Nauka. Moscow: Nauka, 333348.
 30.
Edmonds J, Karp R: Theoretical improvements in algorithmic efficiency for network flow problems. J ACM (JACM). 1972, 19 (2): 248264. 10.1145/321694.321699.
 31.
Fredman M, Tarjan R: Fibonacci heaps and their uses in improved network optimization algorithms. J ACM (JACM). 1987, 34 (3): 596615. 10.1145/28869.28874.
 32.
Gabow H, Tarjan R: Faster scaling algorithms for network problems. SIAM J Comput. 1989, 18: 101310.1137/0218069.
 33.
Orlin J, Ahuja R: New scaling algorithms for the assignment and minimum mean cycle problems. Math Program. 1992, 54: 4156. 10.1007/BF01586040.
 34.
Needleman S, Wunsch C, et al: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970, 48 (3): 443453. 10.1016/00222836(70)900574.
 35.
Maes M: On a cyclic stringtostring correction problem. Inf Process Lett. 1990, 35 (2): 7378. 10.1016/00200190(90)90109B.
 36.
Schmidt JP: All highest scoring paths in weighted grid graphs and their application to finding all approximate repeats in strings. SIAM J Comput. 1998, 27 (4): 972992. 10.1137/S0097539795288489.
 37.
Tiskin A: Semilocal string comparison: Algorithmic techniques and applications. Math Comput Sci. 2008, 1 (4): 571603. 10.1007/s1178600700333.
 38.
Zhang K: Algorithms for the constrained editing distance between ordered labeled trees and related problems. Pattern Recognit. 1995, 28 (3): 463474. 10.1016/00313203(94)00109Y.
 39.
Tarjan R: Data Structures and Network Algorithms, Volume 44. 1983, Society for, Industrial Mathematics, 10.1137/1.9781611970265.fm.
 40.
Ahuja R, Magnanti T, Orlin J, Weihe K: Network flows: theory, algorithms and applications. ZORMethods Models Oper Res. 1995, 41 (3): 252254.
 41.
Blum M, Floyd R, Pratt V, Rivest R, Tarjan R: Time bounds for selection. J Comput Syst Sci. 1973, 7 (4): 448461. 10.1016/S00220000(73)800339.
 42.
Dijkstra E: A note on two problems in connexion with graphs. Numerische mathematik. 1959, 1: 269271. 10.1007/BF01386390.
 43.
Lawler E: Combinatorial Optimization: Networks and Matroids. 1976, New York: Holt,Rinehart and Winston
 44.
Ford Jr L, Fulkerson D, Ziffer A: Flows in networks. Phys Today. 1963, 16: 54
 45.
Shapiro B: An algorithm for comparing multiple RNA secondary structures. Comput Appl Biosci. 1986, 4 (3): 387393.
 46.
Waterman M: Secondary structure of singlestranded nucleic acids. Adv Math Suppl Studies. 1978, 1: 167212.
 47.
Fontana W, Konings D, Stadler P, Schuster P: Statistics of RNA secondary structures. Biopolymers. 1993, 33 (9): 13891404. 10.1002/bip.360330909.
 48.
Höchsmann M, Voss B, Giegerich R: Pure multiple RNA secondary structure alignments: a progressive profile approach. IEEE Trans Comput Biol Bioinformatics. 2004, 1: 5362. 10.1109/TCBB.2004.11.
 49.
Klein R, Eddy S: RSEARCH: finding homologs of single structured RNA sequences. BMC Bioinformatics. 2003, 4: 4410.1186/14712105444.
 50.
Andronescu M, Bereg V, Hoos HH, Condon A: RNA STRAND: the RNA secondary structure and statistical analysis database. BMC Bioinformatics. 2008, 9: 34010.1186/147121059340.
 51.
Massey Jr F: The KolmogorovSmirnov test for goodness of fit. J Am Stat Assoc. 1951, 46: 6878. 10.1080/01621459.1951.10500769.
 52.
Pace NR, Brown JW: Evolutionary perspective on the structure and function of ribonuclease P, a ribozyme. J Bacteriol. 1995, 177 (8): 19191928.
 53.
Brown J: The ribonuclease P database. Nucleic Acids Res. 1999, 27: 31410.1093/nar/27.1.314.
 54.
Murray J, Terwey D, Maloney L, Karpeisky A, Usman N, Beigelman L, Scott W: The structural basis of hammerhead ribozyme selfcleavage. Cell. 1998, 92 (5): 665673. 10.1016/S00928674(00)811344.
 55.
Hean J, Weinberg M: The hammerhead ribozyme revisited: new biological insights. RNA and the Regulation of Gene Expression: A Hidden Layer of Complexity. Edited by: Morris KV. 2008, Caister Academic, Pr, 11.
 56.
Pley H, Lindes D, DeLucaFlaherty C, McKay D: Crystals of a hammerhead ribozyme. J Biol Chem. 1965, 268 (26): 6
 57.
Scott W, Finch J, Klug A: The crystal structure of an allRNA hammerhead ribozyme. Nucleic Acids Symposium Series, Volume 34. 1995, IRL PRESS LTD, 214216.
Acknowledgements
The authors are grateful to Naama Amir for pointing out challenges in extending previous algorithms to allow deletions from both trees. This research was partially supported by ISF grants 478/10 and 580/11, and by a donation from the Moshe Yanai foundation.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
NM participated in some parts of the algorithm construction, implemented the software, participated in the writing of the manuscript and was in charge of the data collection, the statistical model, the experimental results, and the web tool. SZ contributed in all parts of the algorithm construction, participated in the software implementation and in the writing of the manuscript. EK participated in a preliminary version of this study. EB participated in the statistical model and in the experimental results. YD participated in some parts of the algorithm construction and in the overall checking of the text. MZU conceived, designed and led the study, and participated in all parts of the project, including the algorithm construction, the experimental results and the writing of the manuscript. All authors read and approved the final manuscript.
Nimrod Milo, Shay Zakov contributed equally to this work.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.