Unrooted unordered homeomorphic subtree alignment of RNA trees
 Nimrod Milo†^{1},
 Shay Zakov†^{2},
 Erez Katzenelson^{1},
 Eitan Bachmat^{1},
 Yefim Dinitz^{1} and
 Michal ZivUkelson^{1}Email author
https://doi.org/10.1186/17487188813
© Milo et al.; licensee BioMed Central Ltd. 2013
Received: 20 December 2012
Accepted: 5 February 2013
Published: 16 April 2013
Abstract
We generalize some current approaches for RNA tree alignment, which are traditionally confined to ordered rooted mappings, to also consider unordered unrooted mappings. We define the Homeomorphic Subtree Alignment problem (HSA), and present a new algorithm which applies to several modes, combining global or local, ordered or unordered, and rooted or unrooted tree alignments. Our algorithm generalizes previous algorithms that either solved the problem in an asymmetric manner, or were restricted to the rooted and/or ordered cases. Focusing here on the most general unrooted unordered case, we show that for input trees T and S, our algorithm has an O(n_{ T }n_{ S } + min(d_{ T },d_{ S })L_{ T }L_{ S }) time complexity, where n_{ T },L_{ T } and d_{ T } are the number of nodes, the number of leaves, and the maximum node degree in T, respectively (satisfying d_{ T } ≤ L_{ T } ≤ n_{ T }), and similarly for n_{ S },L_{ S } and d_{ S } with respect to the tree S. This improves the time complexity of previous algorithms for less general variants of the problem.
In order to obtain this time bound for HSA, we developed new algorithms for a generalized variant of the MinCost Bipartite Matching problem (MCM), as well as to two derivatives of this problem, entitled AllCavityMCM and AllPairsCavityMCM. For two input sets of size n and m, where n ≤ m, MCM and both its cavity derivatives are solved in O(n^{3} + n m) time, without the usage of priority queues (e.g. Fibonacci heaps) or other complex data structures. This gives the first cubic time algorithm for AllPairsCavityMCM, and improves the running times of MCM and AllCavityMCM problems in the unbalanced case where n ≪ m.
We implemented the algorithm (in all modes mentioned above) as a graphical software tool which computes and displays similarities between secondary structures of RNA given as input, and employed it to a preliminary experiment in which we ran allagainstall interfamily pairwise alignments of RNAse P and Hammerhead RNA family members, exposing new similarities which could not be detected by the traditional rooted ordered alignment approaches. The results demonstrate that our approach can be used to expose structural similarity between some RNAs with higher sensitivity than the traditional rooted ordered alignment approaches. Source code and webinterface for our tool can be found in http://www.cs.bgu.ac.il/\~negevcb/FRUUT.
Keywords
Background
Secondary structure of RNA molecules serves important functions in many noncoding RNAs [1]. Functional constraints lead to evolutionary structural conservation that in many cases exceeds the level of sequence conservation. Thus, detecting similarity between RNA secondary structures is of major importance in functional RNA research [2, 3].
A mainstream approach for (pseudoknot free) RNA secondary structure comparison represents them as trees, and applies tree alignment algorithms [4–6] to their comparison.
Several variants of tree edit distance and alignment problems were previously studied. These variants differ in the type of trees they examine (ordered/unordered, rooted/unrooted), in the type of edit operations or alignment restrictions they apply [4–14], and in their algorithmic approaches (see [7]).
The general Unordered Tree Edit Distance problem is MAXSNP hard [20], promoting the study of constrained variants. The Subtree Isomorphism problem [21, 22] is, given a pattern tree S and a text tree T, to find if there is some subtree T′ of T which is isomorphic to S. The Subtree Homeomorphism problem [23–25] is a variant of the former problem, where degree2 nodes may be deleted from the selected subtree T′ of the text. Pinter et al. [26] efficiently solved the Subtree Homeomorphism problem, under the unrooted unordered settings. In addition, their algorithm assigns costs for alignments and finds an alignment of minimum cost, thus solving a weighted variant of the problem. The running time of the algorithm of [26] is $O({n}_{S}^{2}{n}_{T}+{n}_{S}{n}_{T}log{n}_{T})$, where n_{ T } and n_{ S } are the number of nodes in T and S, respectively (improved time complexities under some scoring scheme restrictions were also shown in [26]). The Constrained Edit Distance Between Unordered Labeled Trees problem, presented by Zhang in [27], is a restricted version of rooted unordered tree edit distance, which allows the edit operations of node relabeling, subtree pruning, and deletions of degree2 nodes (where in the general edit distance variant, nodes of arbitrary degrees may be deleted). Zhang gave an O(n_{ T }n_{ S }(d_{ T } + d_{ S })l o g(d_{ T } + d_{ S })) time algorithm for this variant, where d_{ T } and d_{ S } are the maximum node degrees in T and S respectively. In this sense, the algorithm of [27] can be viewed as a symmetric (allowing deletions from both input trees), yet rooted variant of the algorithm of [26].
The essential approach in many tree comparison algorithms is a recursive rooted comparison of subtrees of the input trees, and finding the best combination of such subinstance solutions to yield a solution for the input instance. The computation considers the cases in which either one of the roots is deleted, and the case where the roots are aligned to each other. In the latter case, it is required to find an optimal matching between the two children sets of the roots, where in the ordered variant such matching is restricted to maintain the child order (and may be computed by a reduction to sequence alignment), and in the unordered variant no such restriction holds (and thus an optimal matching can be found by a bipartite graph matching algorithm).
Our contribution
We propose an efficient algorithm for comparing unordered unrooted trees. Specifically, we define the Homeomorphic Subtree Alignment (HSA) problem, for which we give an O(n_{ T }n_{ S } + m i n(d_{ T },d_{ S })L_{ T }L_{ S }) running time algorithm, where L_{ T } and L_{ S } are the numbers of leaves in the input trees T and S, respectively. Our approach can be viewed as a generalization of the two previous works [26, 27], which relaxes the asymmetric “textpattern” restriction of [26], as well as the rooting restriction of [27].
Both algorithms in [26, 27], as well as the algorithm presented here, make use of subroutines for solving the Minimum Cost Bipartite Matching (MCM) problem, which dictate their time complexities. Here, we define the AllPairsCavityMCM problem, a generalization of the AllCavityMCM problem [28], and show how to integrate it into our tree alignment algorithm. For MCM and both its cavity derivatives, we use similar ideas to those applied in [29] to obtain O(n^{3} + n m) time algorithms, where n and m are the sizes of the input sets, and n ≤ m. This gives the first cubic time algorithm for AllPairsCavityMCM, and improves the running times of MCM and AllCavityMCM problems in the unbalanced case where n ≪ m. The new MCM algorithms we developed allow our HSA algorithm to match, and even improve, the running times of the previous algorithms of [26, 27] for less general variants of the HSA problem.
We implemented the algorithm (in all combininations of global or local, ordered or unordered, and rooted or unrooted modes) as a graphical software tool named FRUUT which computes and displays similarities between secondary structures of RNA given as input, and employed it to a preliminary experiment in which we ran allagainstall interfamily pairwise alignments of RNAse P and Hammerhead RNA family members, exposing new similarities which could not be detected by the traditional rooted ordered alignment approaches. The results demonstrate that our approach can be used to expose structural similarity between some RNAs with higher sensitivity than the traditional rooted ordered alignment approaches.
Preliminaries
Tree notations
A tree T = (V,E) is an undirected, connected and acyclic graph. For a node v ∈ V, denote by N(v) the set of neighbors of v: N(v) = {u ∈ V:(v,u) ∈ E}. Denote by d_{ v } = N(v) the degree of v. A node v for which d_{ v } ≤ 1 is called a leaf in T. For simplicity, we henceforth use the notation v ∈ T and (v,u) ∈ T to imply that v is a node and (v,u) is an edge in a tree T. We use the notation (v → u) to indicate that the generally undirected edge (v,u) is being considered with respect to the specific direction from v to u. Denote by n_{ T }, L_{ T }, and d_{ T } the number of nodes, the number of leaves, and the maximum degree of a node in T, respectively.
A rooted tree is a tree in which one of the nodes is selected as its root. Denote by T^{ v } the tree T when rooted upon the node v ∈ T. An ordered tree is a tree T in which for each node v ∈ T, the elements in N(v) are ordered. In this work we consider unrooted unordered trees, rooted unordered trees, unrooted ordered trees, and rooted ordered trees. If no indication is given, we assume that the mentioned trees are unrooted and unordered.
Let T = (V,E) be a tree. A smoothing of a node v of degree 2 in T is obtained by removing v from T and connecting its two neighbors by an edge. A smoothing of T is a tree obtained by smoothing zero or more nodes in T. A subtree of T is a connected subgraph of T. For an edge (v → u) ∈ T, denote by ${T}_{u}^{v}$ the rooted subtree of T induced by v as a root, and all nodes x in T such that the path between v and x in T starts with (v → u).
Since a tree T with n nodes contains n  1 undirected edges, the total number of directed edges, and hence the number of rooted subtrees of the form ${T}_{u}^{v}$, is 2(n  1).
A pruning of a tree T with respect to an edge (v → u) is the removal from T of all nodes in ${T}_{u}^{v}$, except for v. Observe that every nonempty subtree of T is obtained by pruning T with respect to zero or more edges.
MinCost bipartite matching
Similarly to previous tree alignment and edit distance algorithms [26–28], the algorithm presented here makes use of mincost bipartite matching algorithms as subroutines. Below, we define extended variants of the bipartite matching problem, in which the input groups may be ordered or unordered, and the score incorporates both standard element matching scoring terms, as well as penalties for unmatched elements. In addition, we define “cavity” variants of the problem, which are used for speeding up our tree alignment algorithms.
The (generalized) MinCost bipartite matching problem (MCM)
A matching instance is a triplet (X,Y,w), where X and Y are two sets, and w is a matching cost function for X and Y. The MinCost Bipartite Matching problem (MCM) is, given a matching instance (X,Y,w), to find the minimum cost of a matching between X and Y with respect to w. Denote by MCM(X,Y,w) the solution of the MCM problem for the instance (X,Y,w), and call a matching whose cost equals to the solution optimal.
Numerous works study and suggest algorithms for the MCM problem, usually when no unmatched element costs are taken into account (see e.g., [30–33]). A standard approach is to reduce MCM to the MinCost MaxFlow problem, which yields an O(n m^{2} + m^{2} log m) algorithm for MCM, where n = min(X,Y) and m = max(X,Y). In [27], an adapted reduction was presented which generalizes the problem definition to incorporate unmatched element costs and also runs in O(n m^{2} + m^{2} log m) time. An algorithm suggested by Dinitz in [29] solves the MCM problem in O(n^{3} + n m) time, without reducing it to MinCost MaxFlow. Since that paper is in Russian and since it considers neither unmatched element costs nor the cavity variants of MCM (see the following section), we prefer to explicitly adapt the MinCost MaxFlow approach to our needs, using some variation of the ideas of [29].
Cavity MCM
The AllCavityMCM problem [28] is, given a matching instance (X,Y,w), to compute MCM(X,Y ∖ {y},w) for all y ∈ Y. We define the AllPairsCavityMCM problem as, given a matching instance (X,Y,w), to compute MCM(X ∖ {x},Y ∖ {y},w) for all x ∈ X and y ∈ Y.
Clearly, algorithms for both AllCavityMCM and AllPairsCavityMCM problems can be implemented by repeatedly running an algorithm for MCM on all required inputs. In [28], an algorithm for AllCavityMCM was proposed, which is more efficient than the naïve algorithm and retains the same cubic running time as the standard algorithm for MCM. To the best of our knowledge, no algorithm for AllPairsCavityMCM which improves upon the naïve algorithm (i.e. repeatedly executing the algorithm of [28] for AllCavityMCM over all x ∈ X) was previously described.
In Section ‘Algorithms for bipartite matching problems’, we give new algorithms for (generalized, unordered) MCM, AllCavityMCM and AllPairsCavityMCM. The running times of these algorithms are summarized in the following theorem, whose correctness is shown in Section ‘Algorithms for bipartite matching problems’.
Theorem 1. Let (X,Y,w) be a matching instance, and denote n = min(X,Y), m = max(X,Y). Then, each one of the problems MCM, AllCavityMCM, and AllPairsCavityMCM over the instance (X,Y,w) can be solved in O(n^{3} + n m) running time. Moreover, this may be done without the usage of priority queues (e.g. Fibonacci heaps [31]) or other complex data structures.
Ordered MCM variants
For an ordered set Z = 〈 z_{0},z_{1},…,z_{n1} 〉 and an integer k, the k rotation of Z is the reordering of its elements ${Z}_{k}=\u3008{z}_{0}^{\prime},{z}_{1}^{\prime},\dots ,{z}_{n1}^{\prime}\u3009$, where ${z}_{i}^{\prime}={z}_{(i+k)\text{mod}n}$ (that is, Z_{ k } = 〈 z_{ k },z_{k + 1},…,z_{n1},z_{0},…,z_{k1} 〉). Note that Z_{0} = Z.
Let X and Y be two ordered sets, and M a bipartite matching between X and Y. Say that M preserves linear order if for every $({x}_{i},{y}_{j}),({x}_{{i}^{\prime}},{y}_{{j}^{\prime}})$ ∈ M, i ≤ i′ ⇔ j ≤ j′. Say that M preserves cyclic order if there are some integers k,l such that M preserves linear order with respect to the rotated sets X_{ k } and Y_{ l }. It is possible to show, that defining M as preserving cyclic order if there exists an integer l such that M preserves linear order with respect to X and Y_{ l }, is equivalent to the definition above.
The Linear Ordered MCM problem (LinearMCM) and the Cyclic Ordered MCM problem (CyclicMCM) are defined similarly to MCM, with the restrictions that the considered matchings have to preserve linear or cyclic order, respectively. LinearMCM is essentially equivalent to the Sequence Alignment problem, which can be solved in O(n m) running time [34]. CyclicMCM can be solved by taking the minimum cost solution among LinearMCM solutions for all rotations of the smaller set in the input, in O(n^{2}m). More efficient algorithms for CyclicMCM can be implied from [35–37].
Homeomorphic subtree alignment
Denote by HSA(T,S) the minimum alignment cost of an HSA between T and S, and call an HSA A optimal with respect to T and S if w(T,S,A) = HSA(T,S). The MinCost HSA problem is, given a pair of trees T and S, to compute HSA(T,S).
Remark 1. We do not give the details of how to construct optimal alignments in this paper. As usual for dynamic programming algorithms, this may be done by a standard backtracking procedure applied on the computed dynamic programming tables.
Rooted and ordered alignments
In addition to the general MinCost HSA problem, we also consider special cases of the problem in which the two input trees are rooted and/or ordered, and the alignment is required to satisfy certain restrictions with respect to these additional properties. For two rooted trees T^{ v } and ${S}^{v\prime}$, say that A is a rooted HSA between T^{ v } and ${S}^{v\prime}$ if A is an HSA between T and S, and (v,v′) ∈ A. The definition of ordered HSA requires some additional formalism, related to bipartite matchings.
Let A be an HSA between the trees T and S. For an edge (v→u) ∈ T, say that u is a relevant neighbor of v (with respect to A) if there is some $x\in {T}_{u}^{v},x\ne v$, which is aligned by A. Define relevant neighbors in S similarly.
Observation 1. Let A be an HSA between trees T and S, and (v,v′), (x,x′), (y,y′) ∈ A. The path between x and y in T goes through v if and only if the path between x′ and y′ in S goes through v′.
The correctness of the observation can be asserted from the fact that A is an isomorphic alignment between a smoothed subtree of T that contains v,x, and y, and a smoothed subtree of S that contains v′,x′, and y′.
Lemma 1. Let (v,v′) ∈ A, and let u be a relevant neighbor of v. Then, there is a unique relevant neighbor u′ of v′ such that for every (y,y′) ∈ A, $y\in {T}_{u}^{v}\iff {y}^{\prime}\in {S}_{{u}^{\prime}}^{{v}^{\prime}}$.
Proof. Since u is a relevant neighbor of v, there is a node $x\in {T}_{u}^{v}$, such that x ≠ v and x is aligned by A. Let x′ = A(x), and let u′ be the relevant neighbor of v′ such that ${x}^{\prime}\in {S}_{{u}^{\prime}}^{{v}^{\prime}}$. For (y,y′) ∈ A, observe that $y\notin {T}_{u}^{v}$ if and only if the path between x and y in T passes through v. Similarly, ${y}^{\prime}\notin {S}_{{u}^{\prime}}^{{v}^{\prime}}$ if and only if the path between x′ and y′ in S passes through v′. Applying Observation 1, this implies that $y\in {T}_{u}^{v}\iff {y}^{\prime}\in {S}_{{u}^{\prime}}^{{v}^{\prime}}$. □
Now, we can define three additional variants of the HSA problem. Let T^{ v } and${S}^{{v}^{\prime}}$ be rooted and ordered trees. Denote by OrderedHSA(T,S),$\mathit{\text{RootedHSA}}({T}^{v},{S}^{{v}^{\prime}})$ and$\mathit{\text{OrderedRootedHSA}}({T}^{v},{S}^{{v}^{\prime}})$ the minimum costs of an ordered HSA, a rooted HSA, and an ordered and rooted HSA between T^{ v } and ${S}^{{v}^{\prime}}$, respectively. Define the corresponding variants of the MinCost HSA problem whose goals are to compute these values.
Algorithm for homeomorphic subtree alignment
In this section we describe a basic algorithm for HSA for its unordered unrooted variant (though it is adequate for the other variants as well with some simple modifications).
Recursive computation
Call a rooted alignment nontrivial if it aligns at least one additional pair of nodes besides the roots. Note that every rooted subalignment${A}_{u}^{v}$ is nontrivial (since u and u′ are relevant neighbors of v and v′). Denote by ${\mathit{\text{RootedHSA}}}^{r}\left({T}_{u}^{v},{S}_{{u}^{\prime}}^{{v}^{\prime}}\right)$ the minimum w^{r} cost of a nontrivial rooted alignment between ${T}_{u}^{v}$ and ${S}_{{u}^{\prime}}^{{v}^{\prime}}$.
Clearly, if A is an optimal rooted HSA between T^{ v } and ${S}^{{v}^{\prime}}$, then for each$(u,{u}^{\prime})\in {M}_{v,{v}^{\prime}}^{A}$${w}^{r}\left({T}_{u}^{v},{S}_{{u}^{\prime}}^{{v}^{\prime}},{A}_{u}^{v}\right)={\mathit{\text{RootedHSA}}}^{r}\left({T}_{u}^{v},{S}_{{u}^{\prime}}^{{v}^{\prime}}\right)$ (otherwise, it is possible to produce a rooted alignment with a better cost than A for T^{ v } and ${S}^{{v}^{\prime}}$). Define the bipartite matching instance $\left(N\right(v),N({v}^{\prime}),{w}_{v,{v}^{\prime}})$, where for u∈N(v), u′∈N(v′), se${w}_{v,{v}^{\prime}}(u,{u}^{\prime})={\mathit{\text{RootedHSA}}}^{r}({T}_{u}^{v},{S}_{{u}^{\prime}}^{{v}^{\prime}}),\phantom{\rule{1em}{0ex}}{w}_{v,{v}^{\prime}}\left(u\right)=\mathit{\text{prune}}\left({T}_{u}^{v}\right)$, and ${w}_{v,{v}^{\prime}}\left({u}^{\prime}\right)=\mathit{\text{prune}}\left({S}_{{u}^{\prime}}^{{v}^{\prime}}\right)$. Observe that the righthand side of Equation 4 equals the cost of the bipartite matching${M}_{v,{v}^{\prime}}^{A}$ for the matching instance$\left(N\right(v),N({v}^{\prime}),{w}_{v,{v}^{\prime}})$ (see Figure 3b). In addition, every bipartite matching between N(v) and N(v′) corresponds to some valid rooted HSA between T^{ v } and ${S}^{v\prime}$, so that the matching and alignment costs are equal.
Proof. [Equation 7]. By definition of the RootedHSA^{r}$\left({T}_{u}^{v},{S}_{{u}^{\prime}}^{{v}^{\prime}}\right)$ score, it corresponds to the score of the best nontrivial alignment between${T}_{u}^{v}$ and${S}_{{u}^{\prime}}^{{v}^{\prime}}$, minus the root alignment cost term a l i g n(v,v′). We may cover the set of all possible nontrivial rooted alignments between${T}_{u}^{v}$ and${S}_{{u}^{\prime}}^{{v}^{\prime}}$ by three sets: (a) alignments in which u is unmatched, (b) alignments in which u′ is unmatched, and (c) alignments in which both u and u′ are matched (note that there might be an intersection between groups (a) and (b)). We show that each one of the terms I, II, and III in the righthand side of Equation 7 computes the minimum cost of an alignment in each one of the groups (a), (b), and (c), respectively, and therefore the minimum among all these terms gives the correct value${\mathit{\text{RootedHSA}}}^{r}\left({T}_{u}^{v},{S}_{{u}^{\prime}}^{{v}^{\prime}}\right)$.
hence the correctness of term I.
The proof that term II of the equation computes the minimum cost of an alignment in group (b) is symmetric to the proof of term I. As for term III, note that the case where u and u′ are matched, but not to each other, implies a contradiction to Observation 1. Therefore, for any alignment in group (c), and in particular for such an alignment A of minimum cost, we have that (u,u′) ∈ A. In this case, the optimal alignment cost is obtained immediately from applying Equation 5 for this specific subinstance, as formulated by term III in the equation. □
The computation of${w}_{u,{u}^{\prime}}^{v,{v}^{\prime}}$ requires the computation of scores of the form${\mathit{\text{RootedHSA}}}^{r}\left({T}_{x}^{u},{S}_{{x}^{\prime}}^{{u}^{\prime}}\right)$ for all x ∈ N(u) ∖ {v} and all x′ ∈ N(u′) ∖ {v′}. It can be shown that all RootedHSA^{r} solutions required for the computation of the righthand side of the equation are for strictly smaller subinstances than the subinstance appearing in the lefthand side, thus the termination of the recursive computation is guaranteed. Equation 7 can be efficiently computed using Dynamic Programming (DP), as summarized by Algorithm 1 below.
Algorithm 1: HSA (T,S)
In Section ‘Time complexity of Algorithm 1’, we show that a straightforward implementation of Algorithm 1 obtains the running time of O(min(d_{ T },d_{ S })n_{ T }n_{ S } + min (d_{ T },d_{ S })^{3}L_{ T }L_{ S }). For some trees T,S with n_{ T },L_{ T },d_{ T }, n_{ S },L_{ S },d_{ S } = Θ(n) (e.g. “star” trees), this implies an O(n^{5}) running time. In Section ‘Improving the time complexity’, we show how to improve this time bound and obtain a cubic time algorithm for the problem.
We note that Algorithm 1 generalizes to also solve the ordered unrooted, unordered rooted, and ordered rooted variants of MinCost HSA in polynomial time. In case a rooted alignment is sought, the algorithm can compute Equation 5 in line 3 only with respect to the two roots, and avoid the computation of Equation 6. In case an ordered alignment is sought, the MCM application in Equation 5 can be replaced by CyclicMCM, and in Equation 7MCM can be replaced by LinearMCM, similarly to [27]. Traditionally, ordered matchings are implemented via reduction to sequence alignment [7, 26, 38]. Similarly to the improvement described in the next section for the unordered tree alignment algorithm, it seems that fast incremental/decremental versions of ordered matchings [35–37] can be integrated into the ordered variant of our algorithm to improve its time complexity. However, the detailed description of these techniques are beyond the scope of this paper.
Time complexity of Algorithm 1
We first analyze the time complexity of a straightforward implementation of the algorithm. Then, in Section ‘Improving the time complexity’, we show how this time complexity can be reduced by applying cavity matching subroutines.
Let$\mathit{\text{index}}\left({T}_{u}^{v}\right)$ and$\mathit{\text{index}}\left({S}_{{u}^{\prime}}^{{v}^{\prime}}\right)$ denote the row and column indices of${T}_{u}^{v}$ and${S}_{{u}^{\prime}}^{{v}^{\prime}}$ in H, respectively. Let u ∈ T and u′ ∈ S be a pair of nodes. The set of subtrees${T}_{u}^{v}$ for v ∈ N(u) corresponds to a subset of rows in H. Similarly, the set of subtrees${S}_{{u}^{\prime}}^{{v}^{\prime}}$ for v′ ∈ N(u′) corresponds to a subset of columns in H, and thus all solutions of the form${\mathit{\text{RootedHSA}}}^{r}\left({T}_{u}^{v},{S}_{{u}^{\prime}}^{{v}^{\prime}}\right)$ are stored in a submatrix of H of size${d}_{u}\times {d}_{{u}^{\prime}}$. Let${H}_{u,{u}^{\prime}}$ denote this submatrix, and let v_{ i } and${v}_{j}^{\prime}$ denote nodes in N(u) and N(u′) such that${T}_{u}^{{v}_{i}}$ and${S}_{{u}^{\prime}}^{{v}_{j}^{\prime}}$ correspond to the ith row and jth column in${H}_{u,{u}^{\prime}}$, respectively (i.e.$\mathit{\text{index}}\left({T}_{u}^{{v}_{1}}\right)$ is the first row in${H}_{u,{u}^{\prime}}$,$\mathit{\text{index}}\left({T}_{u}^{{v}_{2}}\right)$ is the second row, etc.). Note that H can be viewed as a union of submatrices of the form${H}_{u,{u}^{\prime}}$, where each entry in H is covered by exactly one submatrix${H}_{u,{u}^{\prime}}$ for some u ∈ T, u′ ∈ S.
Observation 2. For every 1 ≤ i ≤ d_{ u },$\mathit{\text{index}}\left({T}_{{v}_{i}}^{u}\right)<\mathit{\text{index}}\left({T}_{u}^{{v}_{2}}\right)$. Similarly, for every$1\le j\le {d}_{{u}^{\prime}}$,$\mathit{\text{index}}\left({S}_{{v}_{j}^{\prime}}^{u}\right)<\mathit{\text{index}}\left({S}_{{u}^{\prime}}^{{v}_{2}^{\prime}}\right)$.
In order to focus on the bottleneck expression in the running time analysis of the algorithm, we first summarize the complexity of its secondary computations in the following lemma:
Lemma 2. It is possible to implement Algorithm 1 so that all operations, besides computation of solutions to the MCM problem, require O(n_{ T }n_{ S }) running time.
Proof. It is simple to observe that the computations conducted in lines 1 and 4 of the algorithm consume O(n_{ T }n_{ S }) time (the computation of all subtree sizes and their sorting can be implemented in a linear time in a straightforward manner, where the details are omitted from this text). As computations of Equation 5 in line 3 and of term III of Equation 7 in line 2 are dominated by MCM computations, it remains to show that it is possible to compute terms I and II of Equation 7 in line 2 of the algorithm in O(n_{ T }n_{ S }) along the entire run of the algorithm.
Before continuing with the time complexity analysis, we formulate an auxiliary lemma.
Lemma 3. For a tree T,$\sum _{v\in T}{d}_{v}=O\left({n}_{T}\right)$, and$\sum _{\genfrac{}{}{0ex}{}{v\in T,}{{d}_{v}\ge 3}}{d}_{v}=O\left({L}_{T}\right)$.
and the lemma follows. □
Thus, the overall MCM computation time due to term III in Equation 7 is O((d_{ T } + d_{ S })n_{ T }n_{ S } + d_{ T }d_{ S }L_{ T }L_{ S } + min(d_{ T },d_{ S })^{ 3 }L_{ T }L_{ S }).
Therefore, the overall running time of Algorithm 1 is dictated by the the bottleneck expression O((d_{ T } + d_{ S })n_{ T }n_{ S } + d_{ T }d_{ S }L_{ T }L_{ S } + min(d_{ T },d_{ S })^{3}L_{ T }L_{ S }).
Improving the time complexity
The time analysis in the previous section shows that all operations in Algorithm 1, besides MCM computations due to term III of Equation 7, are conducted in O(n_{ T }n_{ S } + min(d_{ T },d_{ S })L_{ T }L_{ S }) time. In this section, we show how to improve the time complexity of Algorithm 1, by incorporating cavity matching subroutines to speed up the MCM computations due to term III of Equation 7.
matching the running time of all other computations, and we get the following theorem:
Theorem 2. Algorithm 1 can be implemented with an O(n_{ T }n_{ S } + min(d_{ T },d_{ S })L_{ T }L_{ S }) = O(min(n_{ T },n_{ S }) n_{ T }n_{ S }) time complexity.
We would like to emphasize that replacing n_{ T } and n_{ S } by L_{ T } and L_{ S } or d_{ T } and d_{ S } in the time complexity term of Theorem 2 is due to a refined analysis rather than some algorithmic improvement, and such an analysis can also be applied to refine the time complexities of some of the previous algorithms. While in many tree comparison applications typical input trees T are characterized by having the number of leaves at the same order of magnitude as the number of nodes (i.e. L_{ T } = Θ (n_{ T })), and sometimes it is true that d_{ T } = Θ (n_{ T }) (e.g. in starlike trees), there are cases where L_{ T },d_{ T } ≪ n_{ T }. Specifically, it can be asserted that removing (by node smoothing) or adding (by subdividing edges) degree2 nodes to a tree do not change its maximum degree nor the number of its leaves, and thus trees with a high number of degree2 nodes have a low maximum degree and a small number of leaves with respect to the total number of nodes. As can be shown in our examples (see Section ‘RNA tree representation’), typical RNA trees in our application do have a relatively high number of degree2 nodes, and thus gain from the fact that the cubic term in the time complexity of the algorithm depends on the maximum node degrees and the number of leaves, rather than the number of nodes in the input trees.
Algorithms for bipartite matching problems
In this section we show efficient algorithms for the MCM, AllCavityMCM, and AllPairsCavityMCM problems defined in Section ‘MinCost bipartite matching’. These algorithms are based on a reduction to the MinCost MaxFlow problem. Since the MinCost MaxFlow problem is well known to computer scientists, we only provide a brief discussion of essential properties required for describing our algorithms, while omitting some of the details. For a definition of the MinCost MaxFlow problem, related theorems, and a thorough discussion of its properties, please refer to other works, e.g. [30, 39, 40].
Throughout this section, let (X,Y,w) be a matching instance. Assume w.l.o.g. that X ≤ Y, and denote X = n,Y = m. When the context is clear, instead of writing a “matching M for (X,Y,w)”, we simply write a “matching M”, and similarly when writing “M is an optimal matching” we mean that M is optimal with respect to (X,Y,w).
Efficient algorithm for MCM
Next, we refine the reduction of MCM to MinCost MaxFlow given at [27], in order to to improve the running time of the algorithm. Our modification reduces the O(n m^{ 2 } + m^{ 2 } logm) running time of the algorithm of [27] to O(n^{3} + n m), using a variant of the approach of [29].
Reducing MCM to MinCost MaxFlow
For x ∈ X and y ∈ Y, define the effective matching cost w_{ e }(x,y) of the pair (x,y) to be w_{ e }(x,y) = w(x,y)  w(x)  w(y). The effective matching cost w_{ e }(x,y) is the cost change due to the addition of the pair (x,y) into a matching in which both x and y are unmatched.
For each x ∈ X, define a subset Y_{ x } ⊆ Y of n smallest cost matches for x in Y, with respect to the effective matching costs. Define${Y}_{X}=\bigcup _{x\in X}{Y}_{x}$.
Lemma 4. There exists an optimal matching M^{∗} such that M^{∗} ⊆ X × Y_{ X }.
Proof. Let M^{∗} be an optimal matching. Since each element in X participates in at most one pair in M^{∗}, M^{∗} contains at most n pairs, and so there are at most n distinct elements y ∈ Y such that y ∈ M. If M^{∗} contains a pair (x,y) such that y ∉ Y_{ X }, then y ∉ Y_{ X }, and in particular there must be some y′ ∈ Y_{ x } such that y′ ∉ M (since Y_{ x } = n). By definition of Y_{ x }, w_{ e }(x,y′) ≤ w_{ e }(x,y), and the matching M′∗ obtained by removing the pair (x,y) from M^{∗} and adding the pair (x,y′) satisfies w(M′∗) = w(M^{∗})  w_{ e }(x,y) + w_{ e }(x,y′) ≤ w(M^{∗}). Since M^{∗} is optimal, M′∗ is also optimal. It is possible to continue and apply such modifications until getting an optimal matching M′′^{∗} which contains only pairs (x,y) such that y ∈ Y_{ X }, as required. □
Next, we describe how to reduce MCM into the MinCost MaxFlow problem. The reduction builds the cost flow network N = (G,s,t,c), where G is the network’s weighted graph, s and t are the source and sink nodes respectively, and c is the edge capacity function. The graph G = (V,E) is defined as follows (Figure 8a):

V = X ∪ Y_{ X }∪{s,t,ϕ}, where s, t, and ϕ are unique nodes different from all nodes in X and Y_{ x }. Note that we use the same notations for elements in X and Y_{ x } and their corresponding nodes in V, where ambiguity can be resolved by the context. By definition, Y_{ X } ≤ n^{2}, and so V = O(n^{2}).

E = E_{ 1 } ∪ E_{ 2 }, where E_{ 1 } = {(x,y):x ∈ X,y ∈ Y_{ x }}, and E_{ 2 } = {(s,x):x ∈ X} ∪ {(y,t):y ∈ Y_{ X }} ∪ {(x,ϕ):x ∈ X} ∪ {(ϕ,t)}. The cost c o s t(x,y) of every edge (x,y) ∈ E_{ 1 } is the corresponding effective matching cost w_{ e }(x,y), and the cost c o s t(u,v) of each edge (u,v) ∈ E_{ 2 } is zero. Note that E = O(n^{ 2 }.
The capacity function c assigns unit capacities to all edges in E, with the exception that c(ϕ,t) = n.
For a maximum flow f in N, define the set of pairs M_{ f } = {(x,y) : x ∈ X,y ∈ Y_{ X },f(x,y) = 1}.From flow conservation constraints, it is simple to assert that every z ∈ X ∪ Y participates in at most one pair in M_{ f }, and thus M_{ f } is a valid matching, and in addition, M_{ f } ⊆ X × Y_{ X }. For a matching M for (X,Y,w) such that M ⊆ X × Y_{ X }, define the maximum flow f_{ M } in N as the flow which is obtained by transmitting one flow unit on every path of the form s → x → y → t for all (x,y) ∈ M, and one flow unit on every path of the form s → x → ϕ → t for all x ∈ X such that x ∉ M. It is simple to observe that f_{ M } is a valid flow in N (satisfying the capacity and flow conservation constraints), where its maximality is asserted from the fact that it saturates the cut ({s},V ∖ {s}). Note that for every matching M ⊆ X × Y_{ X },${M}_{{f}_{M}}=M$, and for every maximum flow f in N,${f}_{{M}_{f}}=f$.
Denote by c o s t(f) the cost of a flow f in N, and by w_{X,Y} the summation${w}_{X,Y}=\sum _{z\in X\cup Y}w\left(z\right)$.
Lemma 5 shows the relation between a cost of a matching M ⊆ X × Y_{ X } and its corresponding flow f_{ M }.
Lemma 5. For every matching M ⊆ X × Y_{ X }, w(M) = c o s t(f_{ M }) + w_{X,Y}.
Proof. Note that only edges in E_{ 1 } in N are assigned nonzero costs, and therefore the cost of f_{ M } is given by$\mathit{\text{cost}}\left({f}_{M}\right)=\sum _{(x,y)\in {E}_{1}}{f}_{M}(x,y)\xb7\mathit{\text{cost}}(x,y)$. In addition, note that for every (x,y) ∈ E_{ 1 }, f_{ M }(x,y) = 1 if (x,y) ∈ M, and otherwise f_{ M }(x,y) = 0.
□
Call a minimumcost maximumflow in N an optimal flow with respect to N. Proposition 1 concludes the relation between optimal flows in N and optimal matchings of (X,Y,w).
Proposition 1. Let f^{ ∗ } an optimal flow in N. Then,$\mathit{\text{MCM}}(X,Y,w)=w\left({M}_{{f}^{\ast}}\right)=\mathit{\text{cost}}\left({f}^{\ast}\right)+{w}_{X,Y}$.
Proof. For the matching${M}_{{f}^{\ast}}$, we have that MCM$(X,Y,w)\le w\left({M}_{{f}^{\ast}}\right)\stackrel{\text{Lem.5}}{=}\mathit{\text{cost}}\left({f}_{{M}_{{f}^{\ast}}}\right)+{w}_{X,Y}=\mathit{\text{cost}}\left({f}^{\ast}\right)+{w}_{X,Y}$. On the other hand, from Lemma 4 there is an optimal matching M^{ ∗ } such that M^{ ∗ } ⊆ X × Y_{ X }, and$\mathit{\text{MCM}}(X,Y,w)=w\left({M}^{\ast}\right)\stackrel{\text{Lem.5}}{=}\mathit{\text{cost}}\left({f}_{{M}^{\ast}}\right)+{w}_{X,Y}\ge \mathit{\text{cost}}\left({f}^{\ast}\right)+{w}_{X,Y}$, proving the proposition. □
Efficient computation and time complexity
For an efficient computation of an optimal flow in N, we first describe a modification that can be applied to residual graphs and allows a computational speed up.
Let f be a flow in N, and y ∈ Y_{ X }. Due to the flow conservation constraints, either f does not transmit any flow unit through y, or f transmits one flow unit that enters y over some ingoing edge (x,y) (where y ∈ Y_{ X }), and leaves y by the unique outgoing edge (y,t) in E.In both cases, and since all edges adjacent to y have a unit capacity, the residual graph G^{ f } = (V,E^{ f }) contains a single outgoing edge from y: the original edge (y,t) in the former case, or the residual edge (y,x) in the latter case (where the edge (y,t) is reversed in G^{ f } to become an ingoing edge (t,y)). Denote by v_{ y } the unique node in X ∪ {t} into which there is an outgoing edge from y in E^{ f }. In addition, observe that the number of ingoing edges into y in G^{ f } equals to the number of ingoing edges into y in G, implying the following observation:
Observation 3
$\sum _{y\in {Y}_{X}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\left\left\{(u,y)\in {E}^{f}\right\}\right\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\sum _{y\in {Y}_{X}}\left\left\{(u,y)\in E\right\}\right\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}\left{E}_{1}\right\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}{n}^{2}.$
Let${\u011c}^{f}=(\widehat{V},{\xca}^{f})$be the graph defined by

$\widehat{V}=X\cup \{s,t,\varphi \}$,

${\xca}^{f}={\xca}_{1}^{f}\cup {\xca}_{2}^{f}$, where
${\xca}_{1}^{f}={E}^{f}\setminus \left\{(u,v)\in {E}^{f}:u\in {Y}_{X}\text{or}\phantom{\rule{1em}{0ex}}v\in {Y}_{X}\right\}$, and
${\xca}_{2}^{f}=\left\{(u,v):\exists y\in {Y}_{X}\text{s.t.}(u,y)\in {E}^{f}\text{and}\phantom{\rule{1em}{0ex}}v={v}_{y}\right\}$.
Observe that${\xca}_{1}^{f}$ is included in E^{ f }. Denote by$\mathit{\text{cos}}{t}_{{\xca}^{f}}$ the cost function over edges in${\xca}^{f}$. For$(u,v)\in {\xca}_{1}^{f}$ define$\mathit{\text{cos}}{t}_{{\xca}^{f}}(u,v)=\mathit{\text{cost}}(u,v),$ and for$(u,v)\in {\xca}_{2}^{f}$ define$\mathit{\text{cos}}{t}_{{\xca}^{f}}(u,v)=\underset{(}{\text{min}}\mathit{\text{cost}}(u,y)+\mathit{\text{cost}}(y,v)).$y ∈ Y_{ X } : (u,y) ∈ E^{ f } and v = v_{ y }.
For nodes u,v ∈ V, denote by${d}_{u,v}^{f}$ the minimum cost of a path from u to v in G^{ f }, and for$u,v\in \widehat{V}$, denote by${\widehat{d}}_{u,v}^{f}$ the minimum cost of a path from u to v in${\u011c}^{f}$ (if there is no path between u and v in one of these graphs, define the corresponding minimum path cost to be ∞).
Lemma 6. For every$u,v\in \widehat{V}$,${d}_{u,v}^{f}={\widehat{d}}_{u,v}^{f}$.
Proof. Let$u,v\in \widehat{V}$, and let P be a minimum cost path in G^{ f } from u to v. If P traverses a node y′ ∈ Y_{ x }, this traversal is of the form u^{′} → y^{′} → v^{′} for some${u}^{\prime},{v}^{\prime}\in \widehat{V}$ such that (u^{′},y^{′}) ∈ E^{ f } and${v}^{\prime}={v}_{{y}^{\prime}}$ (note that there are no edges between two nodes in Y_{ x }). By construction, the edge (u^{′},v^{′}) is in${\xca}_{2}^{f}$, where its cost satisfies$\mathit{\text{cos}}{t}_{{\xca}^{f}}({u}^{\prime},{v}^{\prime})=\underset{(}{\text{min}}\mathit{\text{cost}}({u}^{\prime},y)+\mathit{\text{cost}}(y,{v}^{\prime}))\le \mathit{\text{cost}}({u}^{\prime},{y}^{\prime})+\mathit{\text{cost}}({y}^{\prime},{v}^{\prime})$.y ∈ Y_{ X } : (u^{′},y) ∈ E^{ f } and v^{′} = v_{ y }.
Thus, each such subpath u′→y′→v′ in P can be replaced by an edge$({u}^{\prime},{v}^{\prime})\in {\xca}_{2}^{f}$, where the cost of the replacing edge is at most the cost of the subpath. This yields a corresponding path$\widehat{P}$ from u to v in${\u011c}^{f}$, which has the same or lower cost than P. In particular,${d}_{u,v}^{f}\ge {\widehat{d}}_{u,v}^{f}$.
On the other hand, let$\widehat{P}$ be a minimum cost path in${\u011c}^{f}$ from u to v. Similarly as above, every edge$({u}^{\prime},{v}^{\prime})\in {\xca}_{2}^{f}$ in $\widehat{P}$ can be replaced by a path u′ → y → v′ in E^{ f } such that $\mathit{\text{cost}}({u}^{\prime},y)+\mathit{\text{cost}}(y,{v}^{\prime})=\mathit{\text{cos}}{t}_{{\xca}^{f}}({u}^{\prime},{v}^{\prime})$, yielding a path P from u to v in G^{ f } of the same cost as $\widehat{P}$. Hence ${d}_{u,v}^{f}\le {\widehat{d}}_{u,v}^{f}$, and so we get that if there is a path from u to v in one of the graphs,${d}_{u,v}^{f}={\widehat{d}}_{u,v}^{f}$. In the case where there are no paths from u to v in both graphs,${d}_{u,v}^{f}={\widehat{d}}_{u,v}^{f}=\infty $. □
Lemma 7. For every y ∈ Y_{ X } and every$v\in \widehat{V}$,${d}_{y,v}^{f}=\mathit{\text{cost}}(y,{v}_{y})+{\widehat{d}}_{{v}_{y},v}^{f}$, and${d}_{v,y}^{f}=\underset{(u,y)\in {E}^{f}}{\text{min}}({\widehat{d}}_{v,u}^{f}+\mathit{\text{cost}}(u,y\left)\right)$.
Proof. Let y ∈ Y_{ X } and$v\in \widehat{V}$. A minimum cost path P from y to v in G^{ f } (if there is such a path) must start with the only outgoing edge (y,v_{ y }) from y, and from the optimality of P the remainder of P is a minimum cost path from v_{ y } into v in G^{ f }. Thus, the cost of P is ${d}_{y,v}^{f}=\mathit{\text{cost}}(y,{v}_{y})+{d}_{{v}_{y},v}^{f}\stackrel{\text{Lem.6}}{=}\mathit{\text{cost}}(y,{v}_{y})+{\widehat{d}}_{{v}_{y},v}^{f}$. If there is no path from y to v in G^{ f }, then in particular there is no path from v_{ y } to v in G^{ f }, and${d}_{y,v}^{f}=\infty ={d}_{{v}_{y},v}^{f}\stackrel{\text{Lem.6}}{=}{\widehat{d}}_{{v}_{y},v}^{f}=\mathit{\text{cost}}(y,{v}_{y})+{\widehat{d}}_{{v}_{y},v}^{f}$.
To show the second equality in the lemma, observe similarly that for a minimum cost path P from v to y in G^{ f } and the last edge (u′,y) in P, the cost of P is${d}_{v,y}^{f}={d}_{v,{u}^{\prime}}^{f}+\mathit{\text{cost}}({u}^{\prime},y)\stackrel{\text{Lem.6}}{=}{\widehat{d}}_{v,{u}^{\prime}}^{f}+\mathit{\text{cost}}({u}^{\prime},y)\ge \underset{(u,y)\in {E}^{f}}{\text{min}}({\widehat{d}}_{v,u}^{f}+\mathit{\text{cost}}(u,y\left)\right)$. Since for every u such that (u,y) ∈ E^{ f } there is a path from v to y of cost${d}_{v,u}^{f}+\mathit{\text{cost}}(u,y)$ (the path concatenating the edge (u,y) at the end of a minimum cost path from v to u in G^{ f }), it follows that${d}_{v,y}^{f}\phantom{\rule{0.6em}{0ex}}\le \phantom{\rule{0.6em}{0ex}}\underset{\phantom{\rule{0.3em}{0ex}}(u,y)\in {E}^{f}}{\text{min}}\phantom{\rule{0.3em}{0ex}}(\phantom{\rule{0.3em}{0ex}}{d}_{v,u}^{f}\phantom{\rule{0.6em}{0ex}}+\phantom{\rule{0.6em}{0ex}}\mathit{\text{cost}}(\phantom{\rule{0.3em}{0ex}}u,\phantom{\rule{0.3em}{0ex}}y\phantom{\rule{0.3em}{0ex}}\left)\phantom{\rule{0.3em}{0ex}}\right)\phantom{\rule{0.6em}{0ex}}\stackrel{\text{Lem.6}}{=}\phantom{\rule{1.2em}{0ex}}\underset{(u,y)\in {E}^{f}}{\text{min}}\phantom{\rule{0.3em}{0ex}}({\widehat{d}}_{v,u}^{f}\phantom{\rule{0.6em}{0ex}}+\phantom{\rule{0.6em}{0ex}}\mathit{\text{cost}}(\phantom{\rule{0.3em}{0ex}}u,\phantom{\rule{0.6em}{0ex}}y\phantom{\rule{0.3em}{0ex}}\left)\phantom{\rule{0.3em}{0ex}}\right)$, and we get that${d}_{v,y}^{f}=\underset{(u,y)\in {E}^{f}}{\text{min}}({\widehat{d}}_{v,u}^{f}+\mathit{\text{cost}}(u,y\left)\right)$. The case where there is no path from v to y in G^{ f } is resolved similarly as above. □
Using Lemmas 6 and 7, we turn to analyze the time complexity for the reduction we described for solving MCM(X,Y,w).
The computation of each subset Y_{ x } can be done in O(m) time, using the algorithm of [41], and so Y_{ x } can be computed in O(n m) time. The network N contains O(n^{2}) nodes and edges, and thus can be constructed in O(n^{2}) time.
An optimal flow in N can be found with the algorithm of Edmonds and Karp [30]. Essentially, this is an iterative algorithm that maintains a valid flow f in N. Starting with a zeroflow, at each iteration the algorithm increases the flow by adding the bottleneck capacity flow along a minimum cost path from s to t in the residual graph G^{ f }. In our specific reduction, all edges leaving s have a unit capacity, and thus each augmentation path increases the size of the flow by one unit. As the size of the maximum flow in N is n (asserted by the minimum cut ({s},V ∖ {s}) of capacity n in N), the algorithm performs n iterations.
The time required for each iteration is dictated by the time required for computing minimum cost paths from s into every node in the residual graph G^{ f }. Such paths can be computed efficiently for weighted graphs with nonnegative edge costs using Dijkstra’s algorithm [42]. In order to render all edge costs in the residual graphs to nonnegative, the algorithm of [30] applies a node labeling function. Such a function assigns to each node v ∈ V a label l(v), and shortest paths in the residual network are computed with respect to the modified edge costs c o s t′(u,v) = c o s t(u,v) + l(u)  l(v). It is known that a path from u to v in G^{ f } is of minimum cost with respect to the original cost function if and only if it is of minimum cost with respect to the modified cost function, and that setting$l\left(v\right)={d}_{s,v}^{f}$ for every v ∈ V, where f is the flow at the beginning of the ith iteration, guarantees nonnegative modified edge costs in the residual graph at the beginning of the (i+1)th iteration (after f was augmented and the residual graph was modified, see [30]). Thus, given that minimum cost paths are computed from s to all nodes in the residual graph, label maintenance at each iteration is done in linear time with respect to the number of nodes in the network.
In order to compute minimum cost paths from s to all nodes in G^{ f }, we first construct the corresponding graph${\u011c}^{f}$ as described above (under the assumption that edge costs are rendered to be nonnegative). It is simple to observe that this construction can be implemented in O(n^{2}) time. By construction,$\left\widehat{V}\right=\leftX\cup \{s,t,\varphi \}\right=O\left(n\right)$. Using Dijkstra’s algorithm [42],${\widehat{d}}_{s,v}^{f}$ can be computed for all$v\in \widehat{V}$ in$O\left(\widehat{V}{}^{2}\right)=O\left({n}^{2}\right)$ time. Note that we are referring to the original Dijkstra algorithm (published in 1959) rather than to its commonly used improvement due to Fredman and Tarjan [31]. While the latter improvement reduces the running time of the algorithm for nondense graphs, it involves the usage of a relatively sophisticated data structure (Fibonacci heap). In our case, the simple implementation described in [42] does not exceed the O(n^{ 2 }) running time required for the construction of${\u011c}^{f}$, without the usage of any kind of priority queue or other complex data structures.
Given the values${\widehat{d}}_{s,v}^{f}$ for all$v\in \widehat{V}$, values${d}_{s,v}^{f}$ can be computed for all v ∈ V by applying Lemmas 6 and 7. The time required for computing distances${d}_{s,v}^{f}$ for all$v\in \widehat{V}=V\setminus {Y}_{X}$ due to Lemma 6 is$O\left(\left\widehat{V}\right\right)=O\left(n\right)$, and the time required for computing distances${d}_{s,y}^{f}$ for all y ∈ Y_{ X } due to Lemma 7 is$\sum _{y\in {Y}_{X}}\left\left\{(u,y)\in {E}^{f}\right\}\right\stackrel{\text{Obs.3}}{=}O\left({n}^{2}\right)$. Therefore, each iteration is conducted in O(n^{2}) time, and the total running time of all n iterations of the algorithm is O(n^{3}).
A special attention is required though for the initialization of the algorithm of [30]. There, it was assumed that all edges in the input network N are of nonnegative costs, while our described reduction from MCM does not sustain this property. Nevertheless, this may be overcome by setting the initial labels of all nodes v ∈ V to the corresponding minimum path costs d_{s,v} in the graph G of N. Since G is acyclic, these initial costs can be computed in O(V + E) = O(n^{2}) time using a simple topological traversal (see e.g., [43]).
In all, we get that the total running time required for constructing a flow network N corresponding to the matching instance (X,Y,w) and finding an optimal flow in N is O(n^{3} + n m). Computing w_{X,Y} can be done in O(n + m) time, and applying Proposition 1, MCM(X,Y,w) can be computed in O(n^{3} + n m) time, proving the statement in Theorem 1 regarding MCM.
Additional practical improvements using sparsification
It is possible to further improve the running time of the reduction in practice, by reducing the number of edges in the network, and computing MinCost Flow instead of MinCost MaxFlow.
We may assume without lost of generality that an optimal matching M^{∗}between X and Y contains no pairs (x,y) such that w(x,y)  w(x)  w(y) ≥ 0 (since removing such an edge from the matching can only decrease its cost). Consequentially, there is an optimal flow in N that does not transmit flow through edges (x,y) for which w′(x,y) = w(x,y)  w(x)  w(y) ≥ 0. Therefore, the cost of an optimal flow in the network N′, obtained by removing such edges from N, equals to the cost of an optimal flow in N. Note that in the case of RNA tree alignment, removed edges correspond to pairs of subtrees which are sufficiently dissimilar so that their pruning cost is lower than their alignment cost. For a reasonable cost function, it is expected that many of the subtreepairs of the input would sustain this condition (even in the extreme case of aligning a tree to itself), as they represent different parts of complex molecules.
The second improvement exploits a property of the MinCost MaxFlow algorithm of [30], for which it was shown that the series of computed augmentation paths are of nondecreasing costs (Theorem 5 in [30]). Note that as long as there is an augmentation path in the network, there is such a path with cost zero (since there is some node x for which (s,x) is not saturated, and thus the path s → x → ϕ →t is an augmentation path of cost zero). Therefore, once the minimum cost of an augmentation path is zero, all consecutive augmentation paths would have a zero cost, and increasing the flow along these paths will not change the overall cost of the flow. It is thus possible to stop the flow algorithm upon the first iteration at which the minimum augmentation path cost is zero, where the flow at this stage is a minimumcost flow which is not necessarily of maximum size (while it has the same cost as a mincost maxflow). It is possible to show that increasing the flow over a negative cost augmentation path necessarily increases the size of the corresponding matching (otherwise it implies a negative cost cycle in the residual network), therefore the number of iterations in the above described MinCost Flow algorithm is M^{∗} ≤ n (where M^{∗} is an optimal matching of minimum size), and so the total running time of the algorithm is O(M^{∗}n^{2} + n m), which is faster than O(n^{ 3 } + n m) in the case where M^{ ∗ } is small. Again, for the RNA tree alignment application, in most cases it is expected that the matching would be computed with respect to dissimilar sets of subtrees, for which it is expected to have a small optimal matching size.
Efficient algorithms for AllCavityMCM and AllPairsCavityMCM
We now present Algorithm 2, which solves AllPairsCavityMCM. Similarly to Kao et al. [28], we show that solutions for instances of the form (X ∖ {x},Y ∖ {y},w) correspond to certain shortest paths in the residual flow network obtained when solving the instance (X,Y,w). This observation allows to solve both AllCavityMCM and AllPairsCavityMCM at the same time complexity O(n^{3} + n m) as that of the algorithm for MCM presented in the previous section.
Algorithm 2: AllPairsCavityMCM (X,Y,w)
In order to prove the algorithm’s correctness, we first formulate an auxiliary lemma.
Lemma 8. Let G = (V,E) be a directed graph with a cost function c o s t(·) over its edges and no negative cost cycle. Let P be a minimum cost path from a node y to a node x in G. Let G′ = (V,E′) be the graph obtained from G by adding, for every edge (u,v)∈P, the reversed edge (v,u) (if not already in G), and setting its cost to c o s t(v,u) =  c o s t(u,v). Then, G′contains no negative cost cycle.
Proof. Following [30], we call a flow extreme if it is of minimum cost among all flows of the same size. By [30] (Theorem 3), a flow is extreme if and only if the corresponding residual network contains no negative cost cycle. In addition, a flow obtained by increasing an extreme flow along a minimum cost augmentation path from the source to the sink of the network, is also extreme (see [44], page 121).
Let N be the flow network defined by the graph G, the source node y, the sink node x, the cost function c o s t(·), and the capacity function c which assigns two capacity units to all edges in G. Since there is no negative cost cycle in N, the zero flow is extreme with respect to zerosize flows. Let f be the unit flow along P in N, and note that the residual graph G^{ f } is identical to G′ by definition. Since P is a minimum cost path from y to x in G, f is extreme with respect to all flows of size 1. Thus, G^{ f } contains no negative cost cycle. □
In the remainder of this section, let N be the network corresponding to the matching instance (X,Y,w) and f an optimal flow in N. Denote by N_{x,y} the network whose graph G_{x,y} is obtained by excluding from G the node x, the node y in case that y ∈ Y_{ X }, and all edges adjacent to these nodes. The edge costs in G_{x,y} are inherited from G, and the maximum flow size in N_{x,y} is n  1. Note that any flow in N which does not transmit flow units through x and through y (when y ∈ Y_{ X }), is also a valid flow in N_{x,y}. In addition, observe that G_{x,y} contains the graph corresponding to the subinstance (X ∖ {x},Y ∖ {y},w) (it is possible that G_{x,y} contains additional nodes in Y ∖ {y} excluded from (Y ∖ {y})_{X ∖ {x}} and additional edges, since x′∈X ∖ {x} may have n adjacent edges of the form (x′,y′) in G_{x,y}, while in the graph corresponding to (X ∖ {x},Y ∖ {y},w) is has (n  1) such adjacent edges). Nevertheless, it can be asserted from the lemmas in Section ‘Reducing MCM to MinCost MaxFlow’, Proposition 1, and their proofs, that for an optimal flow f′ in N_{x,y}, MCM(X ∖ {x},Y ∖ {y},w) = c o s t(f^{′}) + w_{X ∖ {x},Y ∖ {y}}. The correctness of Algorithm 2 is implied by the following analysis.
Lemma 9. For every x ∈ X and every y ∈ Y_{ X }, there is a path from y to x in G^{ f }.
Proof. If f(y,t) = 0, then G^{ f } contains the edge (y,t), and in particular there is a path from y to t in G^{ f }. Else, f(y,t) = 1, and from flow conservations constraints there is some x′ ∈ X such that f(x′,y) = 1. In this case, the path y → x′ → ϕ → t is a path from y to t in G^{ f } (again, the existence of all edges of this path in G^{ f } can be asserted from the flow conservation constraints). Similarly, let z∈Y_{ X } ∪ {ϕ} be the node such that f(x,z) = 1, and note that the path t → z → x is a path from t to x in G^{ f }. The concatenation of a path from y to t and a path from t to x in G^{ f } proves that there is a path from y to x in G^{ f }. □
Now, we show the relation between MCM solutions for instances (X,Y,w) and solutions for subinstances of the form (X ∖ {x},Y ∖ {y},w). As defined in Algorithm 2, let${d}_{y,x}=\left\{\begin{array}{cc}{d}_{y,x}^{f},& y\in {Y}_{X},\\ {d}_{t,x}^{f},& y\in Y\setminus {Y}_{X}\end{array}\right.$
Lemma 10. The cost of an optimal flow in N_{x,y}is c o s t(f)+d_{y,x}.
Proof. In order to prove the lemma, we construct a flow f′ in N_{x,y} such that w(f^{′}) = c o s t(f) + d_{y,x}, and prove that f′ is optimal with respect to N_{x,y}.
Consider the case where y ∈ Y_{ X }, and so${d}_{y,x}={d}_{y,x}^{f}$. From Lemma 9, there exists a path from y to x in the residual graph G^{ f }. Let P be such a path of minimum cost, and define the weighted graph G′ that includes all nodes and edges in G^{ f } with the same edge costs, and in addition, for each (u,v) ∈ P, G′ contains the reversed edge (v,u), whose cost is set to c o s t(v,u) =  c o s t(u,v). Since G^{ f } contains no negative cost cycle (due to the optimality of f), Lemma 8 indicates that G′ contains no negative cost cycle.
Define the flow f′ is obtained from f by returning one flow unit from t to s along the path P′ as follows: if P starts with (y,t), then P′ is obtained by removing (y,t) from P and concatenating (x,s) at its end (see Figure 8d). Else, P′ is obtained by concatenating (t,y), P, and (x,s). Observe that f′ is a valid flow in N_{x,y} (since f′ passes no flow units through x and y), its cost is c o s t(f′) = c o s t(f)+c o s t(P′) = c o s t(f) + d_{y,x} (since edges adjacent to t and s have zero costs, and so the costs of P′ and P are equal), and its size is f′ = f  1 = n  1. Therefore, f′ is a maximum flow in N_{x,y}. Also, observe that the residual graph${G}_{x,y}^{{f}^{\prime}}$ is a subgraph of G′ (since the only edges in P′ which are not in P are adjacent to either x or y, and therefore their reversed edges, which are included in${G}^{{f}^{\prime}}$, are excluded from${G}_{x,y}^{{f}^{\prime}}$), and therefore it contains no negative cycle. Thus, f′ is an optimal flow in N_{x,y} of cost c o s t(f)+d_{y,x}.
For the case where y ∈ Y ∖ Y_{ X } (and${d}_{y,x}={d}_{t,x}^{f}$), let P be a minimum cost path from t to x in G^{ f }, and P^{′} the path from t to s in G^{ f } obtained by concatenating the residual edge (x,s) at the end of P. Similarly as above, it can be shown that the flow f′ obtained by returning one flow unit from t to s along P^{′} is an optimal flow in N_{x,y} of cost c o s t(f) + d_{y,x}, and the lemma follows. □
From Lemma 10, for every x ∈ X and every y ∈ Y, the value c o s t(f) + d_{y,x} is the cost of an optimal flow in N_{x,y}. From Proposition 1, adding to this cost the value w_{X ∖ {x},Y ∖ {y}} = w_{X,Y}  w(x)  w(y) gives the minimum matching cost for the matching instance (X ∖ {x},Y ∖ {y},w), and thus the correctness of Algorithm 2.
In order to use the algorithm for solving AllCavityMCM, we apply the following simple modification. Say we are interested in finding solutions to subinstances of the form (X,Y ∖ {y},w) for every y ∈ Y. We replace the set X by the set X′ = X∪{x_{ ϕ }} for some new element x_{ ϕ } ∉ X (and arbitrarily define w(x_{ ϕ }) = 0 and w(x_{ ϕ },y) = 0 for every y ∈ Y). We then solve AllPairsCavityMCM for the instance (X′,Y,w), and return MCM(X′∖{x_{ ϕ }},Y ∖ {y}) as the solution for the instance MCM(X,Y ∖ {y}). Finding solutions for all instances (X ∖ {x},Y,w) is done symmetrically.
Time complexity of Algorithm 2
From the analysis in Section ‘Efficient computation and time complexity’, lines 1 and 2 of the algorithm can be computed in O(n^{3} + n m) running time. After finding an optimal flow f in N, it is possible to compute for some x ∈ X the values${d}_{v,x}^{f}$ for every v ∈ V in O(n^{2}), as explained in Section ‘Efficient computation and time complexity’ (applying Lemma 7 and using a reversed variant of Dijkstra’s algorithm [42]). Thus, the computation of such values for all x ∈ X, performed in line 3 of the algorithm, can be implemented in O(n^{3}) running time.
Given these values, the execution of line 4 takes O(n m) time. In all, the running time of Algorithm 2 is O(n^{3} + n m), proving the statements regarding AllCavityMCM and AllPairsCavityMCM in Theorem 1.
Implementation details
Our algorithm is implemented (in Java) as a tool called FRUUT (Fast RNA Unordered Unrooted Tree mapper). The RNA tree representation is described in Section ‘RNA tree representation’ and the scoring scheme employed by FRUUT is described in Section ‘Alignment cost function’. FRUUT allows the user to select any alignment mode combination (rooted / unrooted, ordered / unordered, local / global) and to compute optimal pairwise alignments of RNA trees with an attached significance scoring model described in Section ‘pValue computation’. We also provide an interactive PHP webserver for running FRUUT in our website (RNA plots are are generated by the Vienna Package [2]).
RNA tree representation
The tree representation that we employed uses a similar modeling as in Höchsman et al. [48], with some variations we describe next. Given an RNA secondary structure, each loop, basepair, and interval of unpaired bases generates a node in the tree representation of the structure. Labels are assigned to tree nodes in order to indicate their type and content. Node types are: BP (basepair), UPI (unpairedbase interval), HP (hairpin), IL (internal loop or bulge), ML (multiloop), and EXT (external loop). For a UPI node, the label also includes the 5’ to 3’ basesequence of the corresponding interval, and for a BP node the label includes the corresponding bases (see Figure 9).
Each loop node (HP, IL, ML, and EXT) is connected, in 5’ to 3’ sequence order, to all UPI nodes which correspond to intervals of unpaired bases associated to the loop, and to all BP nodes which correspond to stemterminating basepairs adjacent to the loop. BP nodes are nodes of degree 2, where the two neighbors of such nodes are the BP nodes that correspond to adjacent stacked basepairs. The set of leaves in the tree corresponds to the set of UPI nodes or BP nodes terminating loopfree stems.
Alignment cost function
Cost functions: Element pruning penalty
Factor  C _{ u p }  C _{ b p }  C _{ h p }  C _{ m l }  C _{ i l }  C _{ e x t } 

Value  1  2.5  5  3  2  0 
Cost functions: Node matching costs
Type  UPI  BP  HP  IL  ML  EXT 

UPI  Sequence alignment  ∞  ∞  ∞  ∞  ∞ 
BP  ∞  Basepair alignment  ∞  ∞  ∞  ∞ 
HP  ∞  ∞  10  0  0  0 
IL  ∞  ∞  0  5  ∞  ∞ 
ML  ∞  ∞  0  ∞  7  7 
EXT  ∞  ∞  0  ∞  7  10 
Relative scoring
The scoring scheme we use satisfies that for every tree T, H S A_{ m }(T,T) < 0, and for every pair of trees T,S, H S A_{ m }(T,T),H S A_{ m }(S,S) ≤ H S A_{ m }(T,S). Under these conditions, the relative score for any pair of trees is upper bounded by 1, and the similarity of the trees increases as the score approaches 1.
pValue computation
We apply the following pValue computation in order to determine whether an alignment score obtained by comparing two RNA trees is significant.
Algorithm 3: shuffle (T)
Algorithm 4 creates a set of observations by randomly selecting pairs of trees from a dataset, shuffling them and reporting the relative score of their alignment. In each alignment mode m, where m ∈ {R O,R U,U O,U U}, we ran Algorithm 4, setting the input parameter dataset to all the structures from over the RNAStrand [50] database (containing 1751 RNA structures), and setting the number of iterations (the amount parameter, in the code below) to 2 × 10^{6}.
Stats (d a t a s e t,a m o u n t)
A Bonferroni correction was applied to all the reported pvalue computations described in the following sections. This was done by multiplying the computed pvalue by the number of tests performed (i.e. the number of tree pairs aligned within the family that participated in the corresponding test).
Algorithm 4: Stats (dataset, amount)
Results
RNase P family
RNase P is the endoribonuclease responsible for the 5’ maturation of tRNA precursors [19]. Secondary structures of bacterial RNase P RNAs have been studied in detail, primarily using comparative methods [52], and were shown to share a common core of primary and secondary structure. In bacteria, synthetic minimal RNase P RNAs consisting only of these core sequences and structures were shown to be catalytically proficient. Sequences encoding RNase P RNAs of various genomes have been determined and a database established [53], which consists of a compilation of ribonuclease P RNA sequences, sequence alignments, secondary structures and threedimensional models.
We conducted a preliminary experiment, intended to identify examples of pairs of RNA trees for which an RNA structural comparison approach supporting unrooting and branch shuffling may detect (otherwise hidden) structural similarity. To achieve this, we ran a benchmark of allagainstall pairwise alignments of bacterial RNAse P RNA secondary structure trees, using our tool’s different treealignment modes and comparing the differences between the obtained alignment costs. The alignment cost functions and parameters used in our experiment are given in Section ‘Alignment cost function’.
Our benchmark was based on 470 RNAse P structures, ranging across various organisms, taken from the RNAse P database [53] (molecule naming conventions are according to [50]). After filtering out partial and environmental sequences, 170 distinct structures remained, yielding 14,365 distinct pairs of trees. The sizes of the trees in this dataset ranged from 82 to 230 nodes, averaging at 142. The total running time of the benchmark was approximately 33 minutes on a single Xeon X5690 using around 300Mb of memory.
Each pair of trees T,S was compared in two modes to obtain the corresponding scores and alignments: rootedordered (RO) and rootedunordered (RU), and the relative score was computed for each pair in each mode according to Section ‘Relative scoring’.
Our goal in this experiment was to identify evolutionary events that can be explained by unordered alignments. Thus, we sought pairs of RNAse P RNAs that are highly conserved, and yet their alignment can still benefit substantially from unordered mappings. To achieve this, we removed from the set pairs of trees for which R e l S c o r e_{ R U }(T,S) < 0.5. We sorted the remaining pairs of trees according to the difference between the RU and RO modes (R e l S c o r e_{ R U }(T,S)  R e l S c o r e_{ R O }(T,S)).
When examining the top 50 alignments carefully, two distinct types of mapping patterns were observed among them, where each of the top 50 pairs belongs (with slight variations) to one of these two types (33 to Type 1 and 17 to Type 2). In the next paragraphs, we exemplify the highest ranking alignment of each of the two types (the first type is shown in Figure 1b). As mentioned before, the input for FRUUT alignments consisted only of sequence and secondary structure information. The tertiary structure (pseudoknot annotations) for the topranking alignments were only considered later, during the alignment interpretations.
The Hammerhead Ribozyme family
The Hammerhead Ribozyme, a derivative of several selfcleaving satellite virus RNAs [54], is a single strand RNA with autocatalytic capabilities. Naturally, it has a highly specific selfcleavage site at C17, operating via isomeration and rearrangement of the linkage phosphodiester bond [55]. Furthermore, Birikh et al.[18] suggested that the Hammerhead Ribozyme may undergo synthetic modification by removing the loop of one of its helical arms, thus making it catalytically active and able to cleave other RNAs. Hammerheads are therefore widely used in the biotechnological industry as biosensors, enzymes for specific RNAs and gene discovery agents.
The tertiary structure of the minimal version of the Hammerhead Ribozyme has been thoroughly studied by [56, 57]. It is composed of three base paired helices, entitled I, II, III, according to their position in the sequence. There are highly conserved sequences within the multiloop between stem I and II (containing the sequence box CUGA), between stem II and III (containing the box GAAA) and between stem III and I (containing the cleavage reaction site, C17). In nature, there are two types of Hammerheads: type I, where stem I starts at the 5’ and 3’ ends of the strand, and type III, where stem III starts at the 5’ and 3’ ends of the strand. As of today, no natural type II Hammerhead has been found.
Due to the fact that the two natural Hammerhead types share a conserved structure, though it is formed by two distinct transcripts, we chose this family to demonstrate the sensitivity gained by extending the classical rootedordered RNA tree alignment to more flexible variants. Our benchmark was based on 146 Hammerhead structures (86 of type I and 62 of type III), ranging across various organisms, taken from the RNA Strand database [50]. This yields a total of 10,585 pairs of trees, with tree sizes ranging from 17 to 48 nodes, averaging at 25.5.
A comparison of the RO and UO alignment modes
Mode ∖ Type  Same type  Different type 

RootedOrdered (RO)  0.55  0.19 
UnrootedOrdered (UO)  0.56  0.35 
Conclusions
In this paper we define the (unrooted unordered) Homeomorphic Subtree Alignment (HSA) problem, as well as additional three restricted variants of it: OrderedHSA, RootedHSA, and OrderedRootedHSA. We focus on the general (unrooted and unordered) HSA variant, and present a cubic time algorithm for it.
The new algorithm is implemented as a tool (which allows solving all four HSA variants) and is applied to pairwise alignments of RNA secondary structures. Preliminary experimental results over members of the RNAs P and Hammerhead families show that the tool can be used for detecting new structural similarities between RNA molecules, which could not be detected by the classical rootedordered tree alignment methods. In order to obtain an O(n^{3}) running time of an otherwise O(n^{5}) time algorithm, we extend the AllCavity Bipartite Matching problem, previously defined by Kao et al., to the AllPairsCavity problem, give an efficient algorithm for it, and show how to integrate it as a subroutine within our tree alignment algorithm.