Unrooted unordered homeomorphic subtree alignment of RNA trees

We generalize some current approaches for RNA tree alignment, which are traditionally confined to ordered rooted mappings, to also consider unordered unrooted mappings. We define the Homeomorphic Subtree Alignment problem (HSA), and present a new algorithm which applies to several modes, combining global or local, ordered or unordered, and rooted or unrooted tree alignments. Our algorithm generalizes previous algorithms that either solved the problem in an asymmetric manner, or were restricted to the rooted and/or ordered cases. Focusing here on the most general unrooted unordered case, we show that for input trees T and S, our algorithm has an O(nTnS + min(dT,dS)LTLS) time complexity, where nT,LT and dT are the number of nodes, the number of leaves, and the maximum node degree in T, respectively (satisfying dT ≤ LT ≤ nT), and similarly for nS,LS and dS with respect to the tree S. This improves the time complexity of previous algorithms for less general variants of the problem. In order to obtain this time bound for HSA, we developed new algorithms for a generalized variant of the Min-Cost Bipartite Matching problem (MCM), as well as to two derivatives of this problem, entitled All-Cavity-MCM and All-Pairs-Cavity-MCM. For two input sets of size n and m, where n ≤ m, MCM and both its cavity derivatives are solved in O(n3 + nm) time, without the usage of priority queues (e.g. Fibonacci heaps) or other complex data structures. This gives the first cubic time algorithm for All-Pairs-Cavity-MCM, and improves the running times of MCM and All-Cavity-MCM problems in the unbalanced case where n ≪ m. We implemented the algorithm (in all modes mentioned above) as a graphical software tool which computes and displays similarities between secondary structures of RNA given as input, and employed it to a preliminary experiment in which we ran all-against-all inter-family pairwise alignments of RNAse P and Hammerhead RNA family members, exposing new similarities which could not be detected by the traditional rooted ordered alignment approaches. The results demonstrate that our approach can be used to expose structural similarity between some RNAs with higher sensitivity than the traditional rooted ordered alignment approaches. Source code and web-interface for our tool can be found in http://www.cs.bgu.ac.il/\~negevcb/FRUUT.


Background
Secondary structure of RNA molecules serves important functions in many non-coding RNAs [1]. Functional constraints lead to evolutionary structural conservation that in many cases exceeds the level of sequence conservation. Thus, detecting similarity between RNA secondary structures is of major importance in functional RNA research [2,3]. http://www.almob.org/content/8/1 /13 Currently available bioinformatic softwares for RNA tree comparison usually apply rooted ordered tree alignment [11,12,14,15]. However, there are known evolutionary phenomena, such as segment insertions, translocations and reversals, which may result in a reordering or re-rooting of RNA structural elements [16]. These events can yield two similarly structured motifs, which are rooted differently (with respect to the standard "external loop" corresponding roots) [17], and/or permuted with respect to branching order. There are known examples of such unrooted/unordered RNA structural conservations [18,19] (Figure 1), therefore, it is possible that searching for unordered and unrooted structural similarities may reveal new relations between RNA molecules that were previously undetected.
The general Unordered Tree Edit Distance problem is MAX-SNP hard [20], promoting the study of constrained variants. The Subtree Isomorphism problem [21,22] is, given a pattern tree S and a text tree T, to find if there is some subtree T of T which is isomorphic to S. The Subtree Homeomorphism problem [23][24][25] is a variant of the former problem, where degree-2 nodes may be deleted from the selected subtree T of the text. Pinter et al. [26] efficiently solved the Subtree Homeomorphism problem, under the unrooted unordered settings. In addition, their algorithm assigns costs for alignments and finds an alignment of minimum cost, thus solving a weighted variant of the problem. The running time of the algorithm of [26] is O(n 2 S n T + n S n T log n T ), where n T and n S are the number of nodes in T and S, respectively (improved time complexities under some scoring scheme restrictions were also shown in [26]). The Constrained Edit Distance Between Unordered Labeled Trees problem, presented by Zhang in [27], is a restricted version of rooted unordered tree edit distance, which allows the edit operations of node relabeling, subtree pruning, and deletions of degree-2 nodes (where in the general edit distance variant, nodes of arbitrary degrees may be deleted). Zhang gave an O(n T n S (d T + d S )log(d T + d S )) time algorithm for this variant, where d T and d S are the maximum node degrees in T and S respectively. In this sense, the algorithm of [27] can be viewed as a symmetric (allowing deletions from both input trees), yet rooted variant of the algorithm of [26].
The essential approach in many tree comparison algorithms is a recursive rooted comparison of subtrees of the input trees, and finding the best combination of such sub-instance solutions to yield a solution for the input Unrooted and unordered RNA similarities. Nodes of the RNA trees are clustered to motifs marked by letters or numbers (stems, loops, and unpaired nucleotide intervals), where aligned motifs share the same annotation, and unaligned nodes are in gray. Nomenclature is according to [50]. (a) An unrooted alignment between Hammerhead RNAs: PDB_00693 (Type I, top) and RFA_00388 (Type III, bottom), with a computed and corrected, p-value of 2.1250 × 10 −7 . Arrows mark the roots chosen by our tool. The unrooted mode of FRUUT identifies the high similarity between the molecules, not being restricted to align external loops to each other. (b) An unordered alignment between RNAse P RNAs: ASE_00047 (left) and ASE_00334 (right), with a computed, and corrected, p-value of 1.190 × 10 −4 . In the unordered mode of FRUUT, the aligned motifs marked by 6 and 8 do not preserve order. In both molecules, pseudoknots occur between intervals annotated by 8 and 2, and between intervals annotated by 13 and 15 (see Figure 12), asserting the validity of the alignment. http://www.almob.org/content/8/1 /13 instance. The computation considers the cases in which either one of the roots is deleted, and the case where the roots are aligned to each other. In the latter case, it is required to find an optimal matching between the two children sets of the roots, where in the ordered variant such matching is restricted to maintain the child order (and may be computed by a reduction to sequence alignment), and in the unordered variant no such restriction holds (and thus an optimal matching can be found by a bipartite graph matching algorithm).

Our contribution
We propose an efficient algorithm for comparing unordered unrooted trees. Specifically, we define the Homeomorphic Subtree Alignment (HSA) problem, for which we give an O(n T n S + min(d T , d S )L T L S ) running time algorithm, where L T and L S are the numbers of leaves in the input trees T and S, respectively. Our approach can be viewed as a generalization of the two previous works [26,27], which relaxes the asymmetric "text-pattern" restriction of [26], as well as the rooting restriction of [27]. Both algorithms in [26,27], as well as the algorithm presented here, make use of subroutines for solving the Minimum Cost Bipartite Matching (MCM) problem, which dictate their time complexities. Here, we define the All-Pairs-Cavity-MCM problem, a generalization of the All-Cavity-MCM problem [28], and show how to integrate it into our tree alignment algorithm. For MCM and both its cavity derivatives, we use similar ideas to those applied in [29] to obtain O(n 3 + nm) time algorithms, where n and m are the sizes of the input sets, and n ≤ m. This gives the first cubic time algorithm for All-Pairs-Cavity-MCM, and improves the running times of MCM and All-Cavity-MCM problems in the unbalanced case where n m. The new MCM algorithms we developed allow our HSA algorithm to match, and even improve, the running times of the previous algorithms of [26,27] for less general variants of the HSA problem.
We implemented the algorithm (in all combininations of global or local, ordered or unordered, and rooted or unrooted modes) as a graphical software tool named FRUUT which computes and displays similarities between secondary structures of RNA given as input, and employed it to a preliminary experiment in which we ran all-against-all inter-family pairwise alignments of RNAse P and Hammerhead RNA family members, exposing new similarities which could not be detected by the traditional rooted ordered alignment approaches. The results demonstrate that our approach can be used to expose structural similarity between some RNAs with higher sensitivity than the traditional rooted ordered alignment approaches.

Tree notations
A tree T = (V , E) is an undirected, connected and acyclic graph. For a node v ∈ V , denote by N(v) the set of neighbors of v: N(v) = {u ∈ V : (v, u) ∈ E}. Denote by d v = |N(v)| the degree of v. A node v for which d v ≤ 1 is called a leaf in T. For simplicity, we henceforth use the notation v ∈ T and (v, u) ∈ T to imply that v is a node and (v, u) is an edge in a tree T. We use the notation (v → u) to indicate that the generally undirected edge (v, u) is being considered with respect to the specific direction from v to u. Denote by n T , L T , and d T the number of nodes, the number of leaves, and the maximum degree of a node in T, respectively.
A rooted tree is a tree in which one of the nodes is selected as its root. Denote by T v the tree T when rooted upon the node v ∈ T. An ordered tree is a tree T in which for each node v ∈ T, the elements in N(v) are ordered. In this work we consider unrooted unordered trees, rooted unordered trees, unrooted ordered trees, and rooted ordered trees. If no indication is given, we assume that the mentioned trees are unrooted and unordered.
Let T = (V , E) be a tree. A smoothing of a node v of degree 2 in T is obtained by removing v from T and connecting its two neighbors by an edge. A smoothing of T is a tree obtained by smoothing zero or more nodes in T. A subtree of T is a connected subgraph of T. For an edge (v → u) ∈ T, denote by T v u the rooted subtree of T induced by v as a root, and all nodes x in T such that the path between v and x in T starts with (v → u).
Since a tree T with n nodes contains n − 1 undirected edges, the total number of directed edges, and hence the number of rooted subtrees of the form T v u , is 2(n − 1). A pruning of a tree T with respect to an edge (v → u) is the removal from T of all nodes in T v u , except for v. Observe that every nonempty subtree of T is obtained by pruning T with respect to zero or more edges.

Min-Cost bipartite matching
Similarly to previous tree alignment and edit distance algorithms [26][27][28], the algorithm presented here makes use of min-cost bipartite matching algorithms as subroutines. Below, we define extended variants of the bipartite matching problem, in which the input groups may be ordered or unordered, and the score incorporates both standard element matching scoring terms, as well as penalties for unmatched elements. In addition, we define "cavity" variants of the problem, which are used for speeding up our tree alignment algorithms. http://www.almob.org/content/8/1/13

The (generalized) Min-Cost bipartite matching problem (MCM)
Let X and Y be two sets. A bipartite matching M between X and Y is a set of pairs M ⊆ X×Y , such that each element in X ∪ Y participates in at most one pair in M. If some element z ∈ X ∪ Y does not participate in any pair in M, we say that z is unmatched by M and denote z / ∈ M. A (generalized) matching cost function w for X and Y assigns costs w(x, y) for every (x, y) ∈ X × Y and costs w(z) for every z ∈ X ∪ Y . The cost of a bipartite matching M between X and Y with respect to w is given by A matching instance is a triplet (X, Y , w), where X and Y are two sets, and w is a matching cost function for X and Y. The Min-Cost Bipartite Matching problem (MCM) is, given a matching instance (X, Y , w), to find the minimum cost of a matching between X and Y with respect to w. Denote by MCM(X, Y , w) the solution of the MCM problem for the instance (X, Y , w), and call a matching whose cost equals to the solution optimal.
Numerous works study and suggest algorithms for the MCM problem, usually when no unmatched element costs are taken into account (see e.g., [30][31][32][33]). A standard approach is to reduce MCM to the Min-Cost Max-Flow problem, which yields an O(nm 2 +m 2 log m) algorithm for MCM, where n = min(|X|, |Y |) and m = max(|X|, |Y |).
In [27], an adapted reduction was presented which generalizes the problem definition to incorporate unmatched element costs and also runs in O(nm 2 + m 2 log m) time. An algorithm suggested by Dinitz in [29] solves the MCM problem in O(n 3 + nm) time, without reducing it to Min-Cost Max-Flow. Since that paper is in Russian and since it considers neither unmatched element costs nor the cavity variants of MCM (see the following section), we prefer to explicitly adapt the Min-Cost Max-Flow approach to our needs, using some variation of the ideas of [29].

Cavity MCM
The All-Cavity-MCM problem [28] is, given a matching instance (X, Y , w), to compute MCM(X, Y \ {y}, w) for all y ∈ Y . We define the All-Pairs-Cavity-MCM problem as, given a matching instance (X, Y , w), to compute MCM(X \ {x}, Y \ {y}, w) for all x ∈ X and y ∈ Y . Clearly, algorithms for both All-Cavity-MCM and All-Pairs-Cavity-MCM problems can be implemented by repeatedly running an algorithm for MCM on all required inputs. In [28], an algorithm for All-Cavity-MCM was proposed, which is more efficient than the naïve algorithm and retains the same cubic running time as the standard algorithm for MCM. To the best of our knowledge, no algorithm for All-Pairs-Cavity-MCM which improves upon the naïve algorithm (i.e. repeatedly executing the algorithm of [28] for All-Cavity-MCM over all x ∈ X) was previously described.
In Section ' Algorithms for bipartite matching problems' , we give new algorithms for (generalized, unordered) MCM, All-Cavity-MCM and All-Pairs-Cavity-MCM. The running times of these algorithms are summarized in the following theorem, whose correctness is shown in Section ' Algorithms for bipartite matching problems' . Theorem 1. Let (X, Y , w) be a matching instance, and denote n = min (|X| , |Y |), m = max (|X| , |Y |). Then, each one of the problems MCM, All-Cavity-MCM, and All-Pairs-Cavity-MCM over the instance (X, Y , w) can be solved in O(n 3 + nm) running time. Moreover, this may be done without the usage of priority queues (e.g. Fibonacci heaps [31]) or other complex data structures.

Ordered MCM variants
For an ordered set Z = z 0 , z 1 , . . . , z n−1 and an integer k, the k-rotation of Z is the reordering of its elements Let X and Y be two ordered sets, and M a bipartite matching between X and Y. Say that M preserves linear order if for every ( Say that M preserves cyclic order if there are some integers k, l such that M preserves linear order with respect to the rotated sets X k and Y l . It is possible to show, that defining M as preserving cyclic order if there exists an integer l such that M preserves linear order with respect to X and Y l , is equivalent to the definition above. The Linear Ordered MCM problem (Linear-MCM) and the Cyclic Ordered MCM problem (Cyclic-MCM) are defined similarly to MCM, with the restrictions that the considered matchings have to preserve linear or cyclic order, respectively. Linear-MCM is essentially equivalent to the Sequence Alignment problem, which can be solved in O(nm) running time [34]. Cyclic-MCM can be solved by taking the minimum cost solution among Linear-MCM solutions for all rotations of the smaller set in the input, in O(n 2 m). More efficient algorithms for Cyclic-MCM can be implied from [35][36][37].

Homeomorphic subtree alignment
An isomorphic alignment between two trees T = (V , E) and S = (V , E ) is a bijection A : V → V , such that for every pair of nodes v, u ∈ V we have that (v, u) ∈ E ⇔ (A(v), A(u)) ∈ E . A homeomorphic alignment between T and S is an isomorphic alignment between some smoothing T of T and some smoothing S of S, and a homeomorphic subtree alignment (HSA) between T and S is a homeomorphic alignment between some subtree T http://www.almob.org/content/8/1/13 of T and some subtree S of S ( Figure 2). For short, we Let T and S be two trees, and A a homeomorphic subtree alignment between them. Let T and S be the subtrees of T and S, and let T and S be the smoothings of T and S , respectively, such that A is an isomorphic alignment between T and S . Say that a node v ∈ T is be a cost associated with pruning the subtree T v u from T, smooth(v) be a cost associated with smoothing node v, and align(v, v ) be a cost associated with aligning node v against some node v . Definitions for S are similar. Denote by π(A) the set of pruned subtrees and by δ(A) the set of smoothed nodes of T and S, with respect to A. Define the alignment cost: Denote by HSA(T, S) the minimum alignment cost of an HSA between T and S, and call an HSA A optimal with respect to T and S if w(T, S, A) = HSA(T, S). The Min-Cost HSA problem is, given a pair of trees T and S, to compute HSA(T, S).

Remark 1.
We do not give the details of how to construct optimal alignments in this paper. As usual for dynamic programming algorithms, this may be done by a standard back-tracking procedure applied on the computed dynamic programming tables.

Rooted and ordered alignments
In addition to the general Min-Cost HSA problem, we also consider special cases of the problem in which the two input trees are rooted and/or ordered, and the alignment is required to satisfy certain restrictions with respect to these additional properties. For two rooted trees T v and S v , say that A is a rooted HSA between T v and S v if A is an HSA between T and S, and The definition of ordered HSA requires some additional formalism, related to bipartite matchings.
Let A be an HSA between the trees T and S. For an edge (v → u) ∈ T, say that u is a relevant neighbor of The correctness of the observation can be asserted from the fact that A is an isomorphic alignment between a smoothed subtree of T that contains v, x, and y, and a smoothed subtree of S that contains v , x , and y .  if T and S are ordered trees, and for every (v, v ) ∈ A, the corresponding bipartite matching M A v,v is cyclically ordered. Now, we can define three additional variants of the HSA problem. Let T v and S v be rooted and ordered trees. Denote by Ordered-HSA(T, S), Rooted-HSA(T v , S v ) and Ordered-Rooted-HSA(T v , S v ) the minimum costs of an ordered HSA, a rooted HSA, and an ordered and rooted HSA between T v and S v , respectively. Define the corresponding variants of the Min-Cost HSA problem whose goals are to compute these values.

Algorithm for homeomorphic subtree alignment
In this section we describe a basic algorithm for HSA for its unordered unrooted variant (though it is adequate for the other variants as well with some simple modifications).

Recursive computation
Let A be an HSA between T and S. Let (v, v ) ∈ A, and M A v,v the corresponding bipartite matching between N(v) and N(v ), as defined in Section 'Homeomorphic subtree alignment' . Note that A can be viewed as a rooted alignment between T v and S v , which is the union of a set of rooted sub-alignments A v u between rooted subtree pairs of the form Figure 3). The alignment cost can therefore be obtained by summing the costs of these sub-alignments, which cover all scoring terms implied by matching nodes, smoothing nodes, and pruning subtrees by the corresponding sub-alignments, and the additional pruning costs of pruned subtrees of the forms T v u and S v u (where u, u are unmatched by M A v,v ). Note that the pair (v, v ) belongs by definition to each of the sub-alignments A v u . In order to avoid multiple additions of the term align(v, v ) when summing subalignment costs, define can then be written as follows: Call a rooted alignment non-trivial if it aligns at least one additional pair of nodes besides the roots. Note that every rooted sub-alignment A v u is non-trivial (since u and u are relevant neighbors of v and v ). Denote by Rooted-HSA −r T v u , S v u the minimum w −r cost of a nontrivial rooted alignment between T v u and S v u . http://www.almob.org/content/8/1/13 Clearly, if A is an optimal rooted HSA between T v and S v , then for each (u, u it is possible to produce a rooted alignment with a better cost than A for T v and S v ). Define the bipartite matching instance Observe that the right-hand side of Equation 4 equals the cost of the bipartite matching M A v,v for the matching instance Figure 3b). In addition, every bipartite matching between N(v) and N(v ) corresponds to some valid rooted HSA between T v and S v , so that the matching and alignment costs are equal.
Therefore, a minimum cost bipartite matching induces a minimum cost alignment, and we get that Assuming the non-degenerate case where an optimal HSA between T and S contains at least one pair (v, v ), we can compute the cost of an optimal HSA by solving Rooted-HSA with respect to all possible root pairs, and taking the pair which induces a minimum cost as a solution for the unrooted case: In order to obtain cost functions of the form w v,v for the computation of Equation 5, we need to compute solutions of the form Rooted-HSA −r (T v u , S v u ) for subinstances of the input. When u and u are leaves, the only non-trivial rooted alignment between T v u and S v u contains both pairs (v, v ) and (u, u ), and therefore we get that Rooted- ]. By definition of the Rooted-HSA −r T v u , S v u score, it corresponds to the score of the best non-trivial alignment between T v u and S v u , minus the root alignment cost term align (v, v ). We may cover the set of all possible non-trivial rooted alignments between T v u and S v u by three sets: (a) alignments in which u is unmatched, (b) alignments in which u is unmatched, and (c) alignments in which both u and u are matched (note that there might be an intersection between groups (a) and (b)). We show that each one of the terms I, II, and III in the righthand side of Equation 7 computes the minimum cost of an alignment in each one of the groups (a), (b), and (c), respectively, and therefore the minimum among all these terms gives the correct value Rooted-HSA −r T v u , S v u . We start by showing that term I computes the minimum cost of an alignment in group (a). Let A be an alignment in group (a) of minimum cost. Since A is nontrivial, and u is smoothed in A, u has exactly one additional relevant neighbor x * besides v. It therefore follows that all nodes in T v u except for v which are matched by A belong to the subtree T u x * is a non-trivial rooted HSA between T u x * and S v u . Therefore, the cost of A is obtained by the summation of pruning costs of all subtrees T u y for y ∈ N(u) \ {v, x * }, the cost of smoothing u, and the cost w −r T u x * , S v u , A u x * (which counts for all cost terms corresponding to node matchings, node smoothings, and subtree prunings, implied by the sub-alignment A u it is possible to improve the cost of A by replacing the sub-alignment A u x * with an optimal alignment for the corresponding sub-instance). Thus, The computation of term I in the right-hand side of the equation: The term considers the best alignment score under the assumption that u is smoothed. In this case, the score is obtained by taking the smoothing cost of u, and summing it, for some x ∈ N(u) \ {v}, with the alignment score between T u x and S v u and the pruning cost of all subtrees T u y for y ∈ N(u) \ {v, x}. Vertex x is chosen to be the neighbor of u that induces a minimum cost with respect to this computation. The computation of term II is symmetric. (c) The computation of term III in the right-hand side of the equation: This term considers the case were neither u nor u are smoothed, and therefore these two nodes are aligned to each other. In this case, the score is computed similarly to the computation in Equation 5, with respect to the sets N(u) \ {v} and N(u ) \ {v }.
is the w -r cost of a possible non-trivial alignment between T v u and S v u in which u is smoothed (where the subtree T u x is optimally aligned to S v u ), we get that x * satisfies that hence the correctness of term I. The proof that term II of the equation computes the minimum cost of an alignment in group (b) is symmetric to the proof of term I. As for term III, note that the case where u and u are matched, but not to each other, implies a contradiction to Observation 1. Therefore, for any alignment in group (c), and in particular for such an alignment A of minimum cost, we have that (u, u ) ∈ A.
In this case, the optimal alignment cost is obtained immediately from applying Equation 5 for this specific subinstance, as formulated by term III in the equation.
The computation of w v,v u,u requires the computation of scores of the form Rooted- It can be shown that all Rooted-HSA −r solutions required for the computation of the right-hand side of the equation are for strictly smaller sub-instances than the sub-instance appearing in the left-hand side, thus the termination of the recursive computation is guaranteed. Equation 7 can be efficiently computed using Dynamic Programming (DP), as summarized by Algorithm 1 below. In Section 'Time complexity of Algorithm 1' , we show that a straightforward implementation of Algorithm 1 obtains the running time of O(min(d T , d S )n T n S + min (d T , d S ) 3 L T L S ). For some trees T, S with n T , L T , d T , http://www.almob.org/content/8/1/13 where n T and n S are the numbers of nodes in T and S, respectively. The rows of H correspond to subtrees T v u and the columns correspond to subtrees S v u , ordered with nondecreasing sizes; 2 Fill the entries of H row by row with increasing row indices, each row column by column, with increasing column indices. Each entry is filled according to Equation 7 with respect to the pair of subtrees corresponding to its row and column (all solutions for relevant instances in the right-hand side of the equation are already computed and stored in H, due to the order-by-size organization of the subtrees along the rows/columns of the matrix); to return HSA(T, S); n S , L S , d S = (n) (e.g. "star" trees), this implies an O(n 5 ) running time. In Section 'Improving the time complexity' , we show how to improve this time bound and obtain a cubic time algorithm for the problem.
We note that Algorithm 1 generalizes to also solve the ordered unrooted, unordered rooted, and ordered rooted variants of Min-Cost HSA in polynomial time. In case a rooted alignment is sought, the algorithm can compute Equation 5 in line 3 only with respect to the two roots, and avoid the computation of Equation 6. In case an ordered alignment is sought, the MCM application in Equation 5 can be replaced by Cyclic-MCM, and in Equation 7 MCM can be replaced by Linear-MCM, similarly to [27]. Traditionally, ordered matchings are implemented via reduction to sequence alignment [7,26,38]. Similarly to the improvement described in the next section for the unordered tree alignment algorithm, it seems that fast incremental/decremental versions of ordered matchings [35][36][37] can be integrated into the ordered variant of our algorithm to improve its time complexity. However, the detailed description of these techniques are beyond the scope of this paper.

Time complexity of Algorithm 1
We first analyze the time complexity of a straightforward implementation of the algorithm. Then, in Section 'Improving the time complexity' , we show how this time complexity can be reduced by applying cavity matching subroutines.
Let index(T v u ) and index(S v u ) denote the row and column indices of T v u and S v u in H, respectively. Let u ∈ T and u ∈ S be a pair of nodes. The set of subtrees T v u for v ∈ N(u) corresponds to a subset of rows in H. Similarly, the set of subtrees S v u for v ∈ N(u ) corresponds to a subset of columns in H, and thus all solutions of the form Rooted- Let H u,u denote this sub-matrix, and let v i and v j denote nodes in N(u) and N(u ) such that is the second row, etc.). Note that H can be viewed as a union of sub-matrices of the form H u,u , where each entry in H is covered by exactly one sub-matrix H u,u for some u ∈ T, u ∈ S.
The following observation identifies special properties of the second column and second row in H u,u , which are exploited for the efficient computation of Algorithm 1. Observe that for every 1 , we get the following observation ( Figure 5): In order to focus on the bottleneck expression in the running time analysis of the algorithm, we first summarize the complexity of its secondary computations in the following lemma: The subtree T v1 u (bounded by a solid red line) contains T u v2 and T u v3 as subtrees (bounded by dashed and dotted blue lines, respectively), and therefore The subtree T v2 u (bounded by a solid red line) contains T u v1 as a subtree (bounded by a dashed blue line), and therefore The DP matrix H. The rows of the matrix correspond to subtrees of T, sorted from top to bottom with non-decreasing tree size. Rows corresponding to subtrees of the form T vi u are in solid red, and rows corresponding to subtrees of the form T u vi are in waved-blue. All waved-blue rows have smaller indices than the row corresponding to T v2 u (circled with a dashed green line).
prune(T u x ), and observe that Thus, for v = x * , term I of Equation 7 can be written as operations are required for computing the term for row 1 and the row i such that v i = x * , and therefore the total http://www.almob.org/content/8/1/13 (a) (b) Figure 6 The incorporation of cavity matching subroutines in the DP algorithm.In (a), the DP matrix H is illustrated, where the solid red entries correspond to entries in the sub-matrix H u,u for some u ∈ T and u ∈ S. In this example, vj is a subtree of T v1 u , the rows corresponding to these subtrees have smaller indices than the row of T v1 u , and so the required solutions are already computed and stored in H (waved entries marked with a dashed green circle, in the same column above the computed entry). Similarly, for computing term II, there is a need to examine solutions Proof. It is well known that the number of undirected edges in T is n T −1. Since each undirected edge (v, u) contributes 1 unit to the degrees d v and d u of its endpoints, we get that When T contains a single node, L T = 1, v∈T dv≥3 d v = 0,and the inequality follows. Assume by induction the correctness of the inequality for all trees with less than n > 1 nodes, and let T be a tree with n nodes. Let (x, y) ∈ T be an edge such that y is leaf in T. The subtree T of T containing all nodes in T except for y (and all edges except for (x, y)) is of size n − 1, and from the inductive assumption then L T = L T +1 (due to the addition of y as a leaf ). Here, it is possible that d x = 2 and d x = 3, which would contribute 3 to the degree summation for nodes with degree ≥ 3, while if d x > 2 the increment in the degree of x would contribute 1 to this summation. Therefore, v∈T dv≥3  Flow'). Therefore, for a given pair of nodes u ∈ T and u ∈ S, and all neighbor pairs v ∈ N(u), v ∈ N(u ), the time required for MCM computations due to term III is In order to sum the expression above for all pairs u ∈ T, u ∈ S, we sum indempendetly the expressions Thus, the overall MCM computation time due to term III in Equation Equation 5 is computed once for each v ∈ T and v ∈ S in line 3, where the corresponding matching instance's group sizes are d v and d v . Similarly as above, it can be shown that summing the total number of operations in the implied MCM computations yields Therefore, the overall running time of Algorithm 1 is dictated by the the bottleneck expression

Improving the time complexity
The time analysis in the previous section shows that all operations in Algorithm 1, besides MCM computations due to term III of Equation 7, are conducted in O(n T n S + min(d T , d S )L T L S ) time. In this section, we show how to improve the time complexity of Algorithm 1, by incorporating cavity matching subroutines to speed up the MCM computations due to term III of Equation 7.
Let u ∈ T, u ∈ S, and consider the computation of term III of Equation 7 for instances in the first row in H u,u . Note that for the entries in this row, the first group in the bipartite matching instance is fixed and equals to N(u) \ {v 1 }, whereas for each column j, the second group in the matching instance is N(u ) \ {v j }. The first entry in this row is computed by solving the MCM problem directly for the matching instance (N(u) Figures 6a,6b). Based on Observation 2, upon reaching the second entry in this row, all are computed and stored in H. Therefore, the All-Cavity-MCM problem can be solved for the matching instance We would like to emphasize that replacing n T and n S by L T and L S or d T and d S in the time complexity term of Theorem 2 is due to a refined analysis rather than some algorithmic improvement, and such an analysis can also be applied to refine the time complexities of some of the previous algorithms. While in many tree comparison applications typical input trees T are characterized http://www.almob.org/content/8/1/13 by having the number of leaves at the same order of magnitude as the number of nodes (i.e. L T = (n T )), and sometimes it is true that d T = (n T ) (e.g. in star-like trees), there are cases where L T , d T n T . Specifically, it can be asserted that removing (by node smoothing) or adding (by subdividing edges) degree-2 nodes to a tree do not change its maximum degree nor the number of its leaves, and thus trees with a high number of degree-2 nodes have a low maximum degree and a small number of leaves with respect to the total number of nodes. As can be shown in our examples (see Section 'RNA tree representation'), typical RNA trees in our application do have a relatively high number of degree-2 nodes, and thus gain from the fact that the cubic term in the time complexity of the algorithm depends on the maximum node degrees and the number of leaves, rather than the number of nodes in the input trees.

Algorithms for bipartite matching problems
In this section we show efficient algorithms for the MCM, All-Cavity-MCM, and All-Pairs-Cavity-MCM problems defined in Section 'Min-Cost bipartite matching' . These algorithms are based on a reduction to the Min-Cost Max-Flow problem. Since the Min-Cost Max-Flow problem is well known to computer scientists, we only provide a brief discussion of essential properties required for describing our algorithms, while omitting some of the details. For a definition of the Min-Cost Max-Flow problem, related theorems, and a thorough discussion of its properties, please refer to other works, e.g. [30,39,40].
Throughout this section, let (X, Y , w) be a matching instance. Assume w.l.o.g. that |X| ≤ |Y |, and denote |X| = n, |Y | = m. When the context is clear, instead of writing a "matching M for (X, Y , w)", we simply write a "matching M", and similarly when writing "M is an optimal matching" we mean that M is optimal with respect to (X, Y , w).

Efficient algorithm for MCM
Next, we refine the reduction of MCM to Min-Cost Max-Flow given at [27], in order to to improve the running time of the algorithm. Our modification reduces the O(nm 2 + m 2 log m) running time of the algorithm of [27] to O(n 3 + nm), using a variant of the approach of [29].

Reducing MCM to Min-Cost Max-Flow
For x ∈ X and y ∈ Y , define the effective matching cost w e (x, y) of the pair (x, y) to be w e (x, y) = w(x, y) − w(x) − w(y). The effective matching cost w e (x, y) is the cost change due to the addition of the pair (x, y) into a matching in which both x and y are unmatched.
For each x ∈ X, define a subset Y x ⊆ Y of n smallest cost matches for x in Y, with respect to the effective matching costs. Define Y X = x∈X Y x .

Lemma 4.
There exists an optimal matching M * such that M * ⊆ X × Y X .
Proof. Let M * be an optimal matching. Since each element in X participates in at most one pair in M * , M * contains at most n pairs, and so there are at most http://www.almob.org/content/8/1/13 n distinct elements y ∈ Y such that y ∈ M. If M * contains a pair (x, y) such that y / ∈ Y X , then y / ∈ Y x , and in particular there must be some y ∈ Y x such that y / ∈ M (since |Y x | = n). By definition of Y x , w e (x, y ) ≤ w e (x, y), and the matching M * obtained by removing the pair (x, y) from M * and adding the pair (x, y ) satisfies w(M * ) = w(M * ) − w e (x, y) + w e (x, y ) ≤ w(M * ). Since M * is optimal, M * is also optimal. It is possible to continue and apply such modifications until getting an optimal matching M * which contains only pairs (x, y) such that y ∈ Y X , as required.
Next, we describe how to reduce MCM into the Min-Cost Max-Flow problem. The reduction builds the cost flow network N = (G, s, t, c), where G is the network's weighted graph, s and t are the source and sink nodes respectively, and c is the edge capacity function. The graph G = (V , E) is defined as follows (Figure 8a): where s, t, and φ are unique nodes different from all nodes in X and Y X . Note that we use the same notations for elements in X and Y X and their corresponding nodes in V, where ambiguity can be resolved by the context. By definition, |Y X | ≤ n 2 , and so |V | = O(n 2 ). The capacity function c assigns unit capacities to all edges in E, with the exception that c(φ, t) = n.
For a maximum flow f in N, define the set of pairs M f = (x, y) : x ∈ X, y ∈ Y X , f (x, y) = 1 .From flow conservation constraints, it is simple to assert that every z ∈ X ∪ Y participates in at most one pair in M f , and thus M f is a valid matching, and in addition, The corresponding return path P = t → φ → x 2 → y 1 → x 1 → s from t to s is obtained by removing from P the first edge y 2 → t (in dotted red), and adding at the end the edge x 1 → s (in dotted green). The flow f obtained by returning one flow unit from t to s along P is an optimal flow in the network N x1,y2 , corresponding to the sub-instance (X \ {x 1 }, Y \ {y 2 }, w) (see proof of Lemma 10). The corresponding matching for this flow is M f = {(x 2 , y 1 ), (x 3 , y 4 )}, which is an optimal matching for (X \ {x 1 }, Y \ {y 2 }, w).
Lemma 5 shows the relation between a cost of a matching M ⊆ X × Y X and its corresponding flow f M .

Now,
Call a minimum-cost maximum-flow in N an optimal flow with respect to N. Proposition 1 concludes the relation between optimal flows in N and optimal matchings of (X, Y , w).
On the other hand, from Lemma 4 there is an optimal matching M * such that M * ⊆ X × Y X , and

Efficient computation and time complexity
For an efficient computation of an optimal flow in N, we first describe a modification that can be applied to residual graphs and allows a computational speed up.
Let f be a flow in N, and y ∈ Y X . Due to the flow conservation constraints, either f does not transmit any flow unit through y, or f transmits one flow unit that enters y over some ingoing edge (x, y) (where y ∈ Y x ), and leaves y by the unique outgoing edge (y, t) in E.In both cases, and since all edges adjacent to y have a unit capacity, the residual graph G f = (V , E f ) contains a single outgoing edge from y: the original edge (y, t) in the former case, or the residual edge (y, x) in the latter case (where the edge (y, t) is reversed in G f to become an ingoing edge (t, y)). Denote by v y the unique node in X ∪ {t} into which there is an outgoing edge from y in E f . In addition, observe that the number of ingoing edges into y in G f equals to the number of ingoing edges into y in G, implying the following observation: there is no path from v to y in G f is resolved similarly as above.
Using Lemmas 6 and 7, we turn to analyze the time complexity for the reduction we described for solving MCM (X, Y , w).
The computation of each subset Y x can be done in O(m) time, using the algorithm of [41], and so Y X can be computed in O(nm) time. The network N contains O(n 2 ) nodes and edges, and thus can be constructed in O(n 2 ) time.
An optimal flow in N can be found with the algorithm of Edmonds and Karp [30]. Essentially, this is an iterative algorithm that maintains a valid flow f in N. Starting with a zero-flow, at each iteration the algorithm increases the flow by adding the bottleneck capacity flow along a minimum cost path from s to t in the residual graph G f . In our specific reduction, all edges leaving s have a unit capacity, and thus each augmentation path increases the size of the flow by one unit. As the size of the maximum flow in N is n (asserted by the minimum cut ({s}, V \ {s}) of capacity n in N), the algorithm performs n iterations.
The time required for each iteration is dictated by the time required for computing minimum cost paths from s into every node in the residual graph G f . Such paths can be computed efficiently for weighted graphs with nonnegative edge costs using Dijkstra's algorithm [42]. In order to render all edge costs in the residual graphs to nonnegative, the algorithm of [30] applies a node labeling function. Such a function assigns to each node v ∈ V a label l(v), and shortest paths in the residual network are computed with respect to the modified edge costs cost (u, v) = cost(u, v) + l(u) − l(v). It is known that a path from u to v in G f is of minimum cost with respect to the original cost function if and only if it is of minimum cost with respect to the modified cost function, and that setting l(v) = d f s,v for every v ∈ V , where f is the flow at the beginning of the i-th iteration, guarantees nonnegative modified edge costs in the residual graph at the beginning of the (i + 1)-th iteration (after f was augmented and the residual graph was modified, see [30]). Thus, given that minimum cost paths are computed from s to all nodes in the residual graph, label maintenance at each iteration is done in linear time with respect to the number of nodes in the network.
In order to compute minimum cost paths from s to all nodes in G f , we first construct the corresponding grapĥ G f as described above (under the assumption that edge costs are rendered to be nonnegative). It is simple to observe that this construction can be implemented in O(n 2 ) time. By construction, |V | = |X ∪ {s, t, φ}| = O(n). Using Dijkstra's algorithm [42],d f s,v can be computed for all v ∈V in O |V | 2 = O(n 2 ) time. Note that we are referring to the original Dijkstra algorithm (published in 1959) rather than to its commonly used improvement due to Fredman and Tarjan [31]. While the latter improvement reduces the running time of the algorithm for non-dense graphs, it involves the usage of a relatively sophisticated data structure (Fibonacci heap). In our case, the simple implementation described in [42] does not exceed the O(n 2 ) running time required for the construction ofĜ f , without the usage of any kind of priority queue or other complex data structures.
Given the valuesd A special attention is required though for the initialization of the algorithm of [30]. There, it was assumed that all edges in the input network N are of nonnegative costs, while our described reduction from MCM does not sustain this property. Nevertheless, this may be overcome by setting the initial labels of all nodes v ∈ V to the corresponding minimum path costs d s,v in the graph G of N. Since G is acyclic, these initial costs can be computed in O(|V | + |E|) = O(n 2 ) time using a simple topological traversal (see e.g., [43]).
In all, we get that the total running time required for constructing a flow network N corresponding to the matching instance (X, Y , w) and finding an optimal flow in N is O(n 3 +nm). Computing w X,Y can be done in O(n+m) time, and applying Proposition 1, MCM(X, Y , w) can be http://www.almob.org/content/8/1/13 computed in O(n 3 + nm) time, proving the statement in Theorem 1 regarding MCM.

Additional practical improvements using sparsification
It is possible to further improve the running time of the reduction in practice, by reducing the number of edges in the network, and computing Min-Cost Flow instead of Min-Cost Max-Flow.
We may assume without lost of generality that an optimal matching M * between X and Y contains no pairs (x, y) such that w(x, y) − w(x) − w(y) ≥ 0 (since removing such an edge from the matching can only decrease its cost). Consequentially, there is an optimal flow in N that does not transmit flow through edges (x, y) for which w (x, y) = w(x, y) − w(x) − w(y) ≥ 0. Therefore, the cost of an optimal flow in the network N , obtained by removing such edges from N, equals to the cost of an optimal flow in N. Note that in the case of RNA tree alignment, removed edges correspond to pairs of subtrees which are sufficiently dissimilar so that their pruning cost is lower than their alignment cost. For a reasonable cost function, it is expected that many of the subtree-pairs of the input would sustain this condition (even in the extreme case of aligning a tree to itself ), as they represent different parts of complex molecules.
The second improvement exploits a property of the Min-Cost Max-Flow algorithm of [30], for which it was shown that the series of computed augmentation paths are of non-decreasing costs (Theorem 5 in [30]). Note that as long as there is an augmentation path in the network, there is such a path with cost zero (since there is some node x for which (s, x) is not saturated, and thus the path s → x → φ → t is an augmentation path of cost zero). Therefore, once the minimum cost of an augmentation path is zero, all consecutive augmentation paths would have a zero cost, and increasing the flow along these paths will not change the overall cost of the flow. It is thus possible to stop the flow algorithm upon the first iteration at which the minimum augmentation path cost is zero, where the flow at this stage is a minimum-cost flow which is not necessarily of maximum size (while it has the same cost as a min-cost max-flow). It is possible to show that increasing the flow over a negative cost augmentation path necessarily increases the size of the corresponding matching (otherwise it implies a negative cost cycle in the residual network), therefore the number of iterations in the above described Min-Cost Flow algorithm is |M * | ≤ n (where M * is an optimal matching of minimum size), and so the total running time of the algorithm is O(|M * |n 2 + nm), which is faster than O(n 3 + nm) in the case where |M * | is small. Again, for the RNA tree alignment application, in most cases it is expected that the matching would be computed with respect to dissimilar sets of subtrees, for which it is expected to have a small optimal matching size.

Efficient algorithms for All-Cavity-MCM and All-Pairs-Cavity-MCM
We now present Algorithm 2, which solves All-Pairs-Cavity-MCM. Similarly to Kao et al. [28], we show that solutions for instances of the form (X \ {x}, Y \ {y}, w) correspond to certain shortest paths in the residual flow network obtained when solving the instance (X, Y , w). This observation allows to solve both All-Cavity-MCM and All-Pairs-Cavity-MCM at the same time complexity O(n 3 + nm) as that of the algorithm for MCM presented in the previous section.

Algorithm 2:
All-Pairs-Cavity-MCM (X, Y , w) 1 Compute the set Y X and construct the flow network N corresponding to (X, Y , w), as explained in Section 'Reducing MCM to Min-Cost Max-Flow'; 2 Compute an optimal flow f in N; 3 Compute for every x ∈ X and every z ∈ Y X ∪ {t} the value d f z,x (where t is the target node in N); 4 For every x ∈ X and every y ∈ Y , set and report In order to prove the algorithm's correctness, we first formulate an auxiliary lemma. Lemma 8. Let G = (V , E) be a directed graph with a cost function cost(·) over its edges and no negative cost cycle. Let P be a minimum cost path from a node y to a node x in G. Let G = (V , E ) be the graph obtained from G by adding, for every edge (u, v) ∈ P, the reversed edge (v, u) (if not already in G), and setting its cost to cost(v, u) = −cost(u, v). Then, G contains no negative cost cycle.
Proof. Following [30], we call a flow extreme if it is of minimum cost among all flows of the same size. By [30] (Theorem 3), a flow is extreme if and only if the corresponding residual network contains no negative cost cycle. In addition, a flow obtained by increasing an extreme flow along a minimum cost augmentation path from the source to the sink of the network, is also extreme (see [44], page 121).
Let N be the flow network defined by the graph G, the source node y, the sink node x, the cost function cost(·), and the capacity function c which assigns two capacity units to all edges in G. Since there is no negative cost cycle http://www.almob.org/content/8 /1/13 in N, the zero flow is extreme with respect to zero-size flows. Let f be the unit flow along P in N, and note that the residual graph G f is identical to G by definition. Since P is a minimum cost path from y to x in G, f is extreme with respect to all flows of size 1. Thus, G f contains no negative cost cycle.
In the remainder of this section, let N be the network corresponding to the matching instance (X, Y , w) and f an optimal flow in N. Denote by N x,y the network whose graph G x,y is obtained by excluding from G the node x, the node y in case that y ∈ Y X , and all edges adjacent to these nodes. The edge costs in G x,y are inherited from G, and the maximum flow size in N x,y is n − 1. Note that any flow in N which does not transmit flow units through x and through y (when y ∈ Y X ), is also a valid flow in N x,y . In addition, observe that G x,y contains the graph corresponding to the sub-instance (X \ {x}, Y \ {y}, w) (it is possible that G x,y contains additional nodes in Y \ {y} excluded from (Y \ {y}) X\{x} and additional edges, since x ∈ X \ {x} may have n adjacent edges of the form (x , y ) in G x,y , while in the graph corresponding to (X \ {x}, Y \ {y}, w) is has (n − 1) such adjacent edges). Nevertheless, it can be asserted from the lemmas in Section 'Reducing MCM to Min-Cost Max-Flow' , Proposition 1, and their proofs, that for an optimal flow f in N x,y , MCM(X \ {x}, Y \ {y}, w) = cost(f ) + w X\{x},Y \{y} . The correctness of Algorithm 2 is implied by the following analysis.

Lemma 9.
For every x ∈ X and every y ∈ Y X , there is a path from y to x in G f .
Proof. If f (y, t) = 0, then G f contains the edge (y, t), and in particular there is a path from y to t in G f . Else, f (y, t) = 1, and from flow conservations constraints there is some x ∈ X such that f (x , y) = 1. In this case, the path y → x → φ → t is a path from y to t in G f (again, the existence of all edges of this path in G f can be asserted from the flow conservation constraints). Similarly, let z ∈ Y X ∪ {φ} be the node such that f (x, z) = 1, and note that the path t → z → x is a path from t to x in G f . The concatenation of a path from y to t and a path from t to x in G f proves that there is a path from y to x in G f . Now, we show the relation between MCM solutions for instances (X, Y , w) and solutions for sub-instances of the form (X \ {x}, Y \ {y}, w). As defined in Algorithm 2, let Proof. In order to prove the lemma, we construct a flow f in N x,y such that w(f ) = cost(f ) + d y,x , and prove that f is optimal with respect to N x,y .
Consider the case where y ∈ Y X , and so d y,x = d f y,x . From Lemma 9, there exists a path from y to x in the residual graph G f . Let P be such a path of minimum cost, and define the weighted graph G that includes all nodes and edges in G f with the same edge costs, and in addition, for each (u, v) ∈ P, G contains the reversed edge (v, u), whose cost is set to cost(v, u) = −cost (u, v). Since G f contains no negative cost cycle (due to the optimality of f ), Lemma 8 indicates that G contains no negative cost cycle.
Define the flow f which is obtained from f by returning one flow unit from t to s along the path P as follows: if P starts with (y, t), then P is obtained by removing (y, t) from P and concatenating (x, s) at its end (see Figure 8d). Else, P is obtained by concatenating (t, y), P, and (x, s). Observe that f is a valid flow in N x,y (since f passes no flow units through x and y), its cost is cost(f ) = cost(f ) + cost(P ) = cost(f ) + d y,x (since edges adjacent to t and s have zero costs, and so the costs of P and P are equal), and its size is |f | = |f | − 1 = n − 1. Therefore, f is a maximum flow in N x,y . Also, observe that the residual graph G f x,y is a sub-graph of G (since the only edges in P which are not in P are adjacent to either x or y, and therefore their reversed edges, which are included in G f , are excluded from G f x,y ), and therefore it contains no negative cycle. Thus, f is an optimal flow in N x,y of cost cost(f ) + d y,x .
For the case where y ∈ Y \ Y X (and d y,x = d f t,x ), let P be a minimum cost path from t to x in G f , and P the path from t to s in G f obtained by concatenating the residual edge (x, s) at the end of P. Similarly as above, it can be shown that the flow f obtained by returning one flow unit from t to s along P is an optimal flow in N x,y of cost cost(f )+d y,x , and the lemma follows.
From Lemma 10, for every x ∈ X and every y ∈ Y , the value cost(f ) + d y,x is the cost of an optimal flow in N x,y . From Proposition 1, adding to this cost the value w X\{x},Y \{y} = w X,Y − w(x) − w(y) gives the minimum matching cost for the matching instance (X\{x}, Y \{y}, w), and thus the correctness of Algorithm 2.
In order to use the algorithm for solving All-Cavity-MCM, we apply the following simple modification. Say we are interested in finding solutions to sub-instances of the form (X, Y \ {y}, w) for every y ∈ Y . We replace the set X by the set X = X ∪ {x φ } for some new element x φ / ∈ X (and arbitrarily define w(x φ ) = 0 and w(x φ , y) = 0 for every y ∈ Y ). We then solve All-Pairs-Cavity-MCM for the instance (X , Y , w), and return MCM(X \ {x φ }, Y \ http://www.almob.org/content/8/1/13

{y}) as the solution for the instance MCM(X, Y \ {y}).
Finding solutions for all instances (X \ {x}, Y , w) is done symmetrically.

Time complexity of Algorithm 2
From the analysis in Section 'Efficient computation and time complexity' , lines 1 and 2 of the algorithm can be computed in O(n 3 + nm) running time. After finding an optimal flow f in N, it is possible to compute for some x ∈ X the values d f v,x for every v ∈ V in O(n 2 ), as explained in Section 'Efficient computation and time complexity' (applying Lemma 7 and using a reversed variant of Dijkstra's algorithm [42]). Thus, the computation of such values for all x ∈ X, performed in line 3 of the algorithm, can be implemented in O(n 3 ) running time.
Given these values, the execution of line 4 takes O(nm) time. In all, the running time of Algorithm 2 is O(n 3 + nm), proving the statements regarding All-Cavity-MCM and All-Pairs-Cavity-MCM in Theorem 1.

Implementation details
Our algorithm is implemented (in Java) as a tool called FRUUT (Fast RNA Unordered Unrooted Tree mapper). The RNA tree representation is described in Section 'RNA tree representation' and the scoring scheme employed by FRUUT is described in Section ' Alignment cost function' . FRUUT allows the user to select any alignment mode combination (rooted / unrooted, ordered / unordered, local / global) and to compute optimal pairwise alignments of RNA trees with an attached significance scoring model described in Section 'p-Value computation' . We also provide an interactive PHP web-server for running FRUUT in our website (RNA plots are are generated by the Vienna Package [2]).

RNA tree representation
There are several previous models for representing a pseudoknot-free RNA secondary structure (example in Figure 9a) as an ordered, rooted tree [3,9,10,13,[45][46][47][48]. For example, [45] represented the RNA structure as a tree, where nodes correspond to loop elements of the secondary structure (hairpin loops, bulges, internal loops or multi-loops) and the edges correspond to base-paired (stem) regions. Another, different representation is given in Zhang's work [9]: the nodes of the tree represent either unpaired bases (leaves) or paired bases (internal nodes). Each node is labeled with a base or a pair of bases, respectively. There are two kinds of edges, alternatively connecting either consecutive stem base-pairs or a leaf base with the last base-pair in the corresponding stem. The aforementioned trees are rooted and ordered, their order corresponds to the 5'-3' orientation of an RNA sequence and their root is traditionally a designated node parenting the motif in which the first 5' base participates.
The tree representation that we employed uses a similar modeling as in Höchsman et al. [48], with some variations we describe next. Given an RNA secondary structure, each loop, base-pair, and interval of unpaired (a) (b) Figure 9 Demonstration of the selected RNA representation. (a) RNA secondary structure as presented by [2]. (b) The tree representation of (a) in our model. Each base-pairs stacking is matched by color to it's representing branch in the tree. http://www.almob.org/content/8/1/13 bases generates a node in the tree representation of the structure. Labels are assigned to tree nodes in order to indicate their type and content. Node types are: BP (base-pair), UPI (unpaired-base interval), HP (hairpin), IL (internal loop or bulge), ML (multi-loop), and EXT (external loop). For a UPI node, the label also includes the 5' to 3' base-sequence of the corresponding interval, and for a BP node the label includes the corresponding bases (see Figure 9). Each loop node (HP, IL, ML, and EXT) is connected, in 5' to 3' sequence order, to all UPI nodes which correspond to intervals of unpaired bases associated to the loop, and to all BP nodes which correspond to stemterminating base-pairs adjacent to the loop. BP nodes are nodes of degree 2, where the two neighbors of such nodes are the BP nodes that correspond to adjacent stacked base-pairs. The set of leaves in the tree corresponds to the set of UPI nodes or BP nodes terminating loop-free stems.

Alignment cost function
Node smoothing costs were set to 3 for BP, 11 for ML/EXT and 5 for IL. For subtree pruning costs, we have designed a cost function that counts the occurrences of different types of elements appearing in the subtree and deduces a corresponding penalty. The function is of the form: prune(T) = C up · N up + C bp · N bp + C hp · N hp + . . ., where the values of C x are constant penalty factors, N up is the number of unpaired bases in T, N bp is the number of base-pairs, N hp is the number of hairpins, and so on (specific C x values are given in Table 1). Table 2 summarizes the costs of matching node pairs by the alignment. Sequence alignment costs and base-pair alignment costs were set using the RIBOSUM85-60 scoring matrix [49]. In order to support the local alignment mode, we added an option to set the subtree pruning cost to zero: prune(T) = 0.

Relative scoring
We used a relative score formula described by Höchsmann et al. [48] to assess the similarity of two trees, normalizing the alignment cost by the average of the selfalignment costs of the compared trees. Let HSA m (T, S) denote the optimal alignment cost of trees T and S in alignment mode m, where m is one of the following modes: Rooted-Ordered, Rooted-Unordered, Unrooted-Ordered or Unrooted-Unordered. Let RelScore m (T, S) denote the relative score of T and S in alignment mode m, given by [48]: . (8) The scoring scheme we use satisfies that for every tree T, HSA m (T, T) < 0, and for every pair of trees T, S, HSA m (T, T), HSA m (S, S) ≤ HSA m (T, S). Under these conditions, the relative score for any pair of trees is upper bounded by 1, and the similarity of the trees increases as the score approaches 1.

p-Value computation
We apply the following p-Value computation in order to determine whether an alignment score obtained by comparing two RNA trees is significant. For the purpose of assessing the significance of a score, we need to know what distribution HSA scores follow. In order to identify the correct distribution, we first create a set of random observations to inspect. Algorithm 3 describes a routine for shuffling a tree, while preserving most of its structural features (e.g. number of stacks, amount of multi-loops) and keeping the tree valid in terms of RNA secondary structure (exemplified in Figure 10). Algorithm 4 creates a set of observations by randomly selecting pairs of trees from a dataset, shuffling them and reporting the relative score of their alignment. In each alignment mode m, where m ∈ {RO, RU, UO, UU}, we ran Algorithm 4, setting the input parameter dataset to all the structures from over the RNAStrand [50] database (containing 1751 RNA structures), and setting the number of iterations (the amount parameter, in the code below) to 2 × 10 6 .  The observed results where plotted in an accumulative manner and a Maximum Likelihood (ML) fitting technique was used to determine the best distribution and its parameters. We tried several types of distributions and used the Kolmogorov-Smirnov test (K-S test) formula [51] to measure the goodness of the data fit to a given distribution. In all four alignment modes the ML Gaussian distribution provided the best fit. Figure 11 exemplifies the fitting of the data to this distribution with alignment mode UU. The figure also displays the ML Gumbel distribution. The K-S score of the ML Gaussian distribution is better than the score for the ML Gumbel distribution. Following these results we used the ML fitted Gaussian distribution to model the HSA results. This allowed us to compute for a pair of trees T and S a p-Value score analytically, using the following formula: where x is the relative score of T and S, and X is a random variable normally distributed with the ML fitted parameters.
A Bonferroni correction was applied to all the reported p-value computations described in the following sections. This was done by multiplying the computed p-value by the number of tests performed (i.e. the number of tree pairs aligned within the family that participated in the corresponding test).

RNase P family
RNase P is the endoribonuclease responsible for the 5' maturation of tRNA precursors [19]. Secondary structures of bacterial RNase P RNAs have been studied in detail, primarily using comparative methods [52], and were shown to share a common core of primary and secondary structure. In bacteria, synthetic minimal RNase P RNAs consisting only of these core sequences and structures were shown to be catalytically proficient. Sequences encoding RNase P RNAs of various genomes have been determined and a database established [53], which consists of a compilation of ribonuclease P RNA sequences, sequence alignments, secondary structures and three-dimensional models.
We conducted a preliminary experiment, intended to identify examples of pairs of RNA trees for which an RNA structural comparison approach supporting unrooting and branch shuffling may detect (otherwise hidden) structural similarity. To achieve this, we ran a benchmark of all-against-all pairwise alignments of bacterial RNAse P RNA secondary structure trees, using our tool's different tree-alignment modes and comparing the differences between the obtained alignment costs. http://www.almob.org/content/8/1/13 Figure 10 Example of the shuffle process. On the left is the input tree and on the right the output of the process.
The alignment cost functions and parameters used in our experiment are given in Section ' Alignment cost function' .
Our benchmark was based on 470 RNAse P structures, ranging across various organisms, taken from the RNAse P database [53] (molecule naming conventions are according to [50]). After filtering out partial and environmental sequences, 170 distinct structures remained, yielding 14,365 distinct pairs of trees. The sizes of the trees in this dataset ranged from 82 to 230 nodes, averaging at 142. The total running time of the benchmark was approximately 33 minutes on a single Xeon X5690 using around 300Mb of memory. Figure 11 The result of running Stats(D, 1 × 10 6 ) in UU mode of HSA with relative score and its fitting to Gaussian distribution.
Each pair of trees T, S was compared in two modes to obtain the corresponding scores and alignments: rootedordered (RO) and rooted-unordered (RU), and the relative score was computed for each pair in each mode according to Section 'Relative scoring' .
Our goal in this experiment was to identify evolutionary events that can be explained by unordered alignments. Thus, we sought pairs of RNAse P RNAs that are highly conserved, and yet their alignment can still benefit substantially from unordered mappings. To achieve this, we removed from the set pairs of trees for which RelScore RU (T, S) < 0.5. We sorted the remaining pairs of trees according to the difference between the RU and RO modes (RelScore RU (T, S) − RelScore RO (T, S)).
When examining the top 50 alignments carefully, two distinct types of mapping patterns were observed among them, where each of the top 50 pairs belongs (with slight variations) to one of these two types (33 to Type 1 and 17 to Type 2). In the next paragraphs, we exemplify the highest ranking alignment of each of the two types (the first type is shown in Figure 1b). As mentioned before, the input for FRUUT alignments consisted only of sequence and secondary structure information. The tertiary structure (pseudoknot annotations) for the topranking alignments were only considered later, during the alignment interpretations.
Type 1: loop swapping in main multiloop accompanied by hairpin deletion in P17. 1 The first type of alignment http://www.almob.org/content/8/1/13 pattern is characterized by comparisons between a green sulfur bacteria Chlorobium and gamma purple bacterial RNAse P RNAs. This alignment pattern is exemplified in Figure 1b, for the bacterial RNAs ASE_00047 and ASE_00334. When examining the corresponding tertiary structure information (Figure 12), the transformations predicted by FRUUT seem to make sense: observe that there is an additional duplex connecting the loops of intervals 13 and 15. Notice that the alignment between the two trees maps the intervals in a manner that preserves the tertiary structural information and the other information surrounding the loops -thus exemplifies a biologically verified alignment which does not preserve branch ordering.
Type 2: hairpin swapping between P17 and P17.1 The second type of alignment pattern is characterized by comparisons between Agrobacterium tumefaciens (ASE_00018) and several Chlamydia trachomatis members. This alignment pattern is exemplified in Figure 13. An interesting element-twist transformation is observed here in hairpins P17 and P17.1 of ASE_00018, which are mapped onto their corresponding hairpins P17.1 and P17 in ASE_00070, respectively, via a subtree reordering mapping operation. When examining the corresponding tertiary structure information (Figure 13), we observe that the loops of the hairpins P17 and P17.1 are engaged in pseudoknots with loops L (named P6).

The Hammerhead Ribozyme family
Another type of homology detected by our tool is exemplified in the Hammerhead Ribozyme family, which is characterized by two distinct transcript types, yielding the same functional RNA (Figures 1a and14).
The Hammerhead Ribozyme, a derivative of several self-cleaving satellite virus RNAs [54], is a single strand RNA with autocatalytic capabilities. Naturally, it has a highly specific self-cleavage site at C17, operating via isomeration and rearrangement of the linkage phosphodiester bond [55]. Furthermore, Birikh et al. [18] suggested that the Hammerhead Ribozyme may undergo synthetic modification by removing the loop of one of its helical arms, thus making it catalytically active and able to cleave other RNAs. Hammerheads are therefore widely used in the biotechnological industry as biosensors, enzymes for specific RNAs and gene discovery agents.
The tertiary structure of the minimal version of the Hammerhead Ribozyme has been thoroughly studied by [56,57]. It is composed of three base paired helices, Figure 12 RNAse P type 1 Tertiary structural information. Tertiary structural information for ASE_00047 and ASE_00334, taken from the RNAse P Database [53]. http://www.almob.org/content/8/1/13 (a) (b) Figure 13 RNAse P type 2 example. (a) An example of unordered alignment between two RNAse P RNAs: ASE_00018 and ASE_00070 with a corrected p-value score of 3.117 × 10 −6 . Grey colored bases in the FRUUT alignment graphics represent deletions. P6 is a pseudoknot marking of the tertiary structure information. (b) Tertiary structural information for ASE_00018 and ASE_00070, taken from the RNAse P Database [53]. http://www.almob.org/content/8/1/13 (a) (b) Figure 14 Example 2 on the HammerHead family. Results on different modes of alignment between two Hammerhead RNAs: PDB_01077 (Type I, left) and RFA_00433 (Type III, right) (a) A rooted ordered alignment between the two trees with a relative score of 0.2374. (b) A unrooted ordered alignment between the two trees with a relative score of 0.5111.
entitled I, II, III, according to their position in the sequence. There are highly conserved sequences within the multi-loop between stem I and II (containing the sequence box CUGA), between stem II and III (containing the box GAAA) and between stem III and I (containing the cleavage reaction site, C17). In nature, there are two types of Hammerheads: type I, where stem I starts at the 5' and 3' ends of the strand, and type III, where stem III starts at the 5' and 3' ends of the strand. As of today, no natural type II Hammerhead has been found. Due to the fact that the two natural Hammerhead types share a conserved structure, though it is formed by two distinct transcripts, we chose this family to demonstrate the sensitivity gained by extending the classical rootedordered RNA tree alignment to more flexible variants. Our benchmark was based on 146 Hammerhead structures (86 of type I and 62 of type III), ranging across various organisms, taken from the RNA Strand database [50]. This yields a total of 10,585 pairs of trees, with tree sizes ranging from 17 to 48 nodes, averaging at 25.5.
Each pair of trees T, S was compared in two modes to obtain the corresponding scores and alignments: rootedordered (RO) and unrooted-ordered (UO). Within each mode, we used the relative score formula as described in Section 'Relative scoring' . We partitioned the tree pairs into two groups, the first containing all pairs where both members are from the same type (5377 pairs) and the second group containing mixed pairs where the two members in each pair belong to different types (5208 pairs). We calculated the average relative score for each group in both alignment modes. The results, summarized in Table 3, show that the UO tree alignment mode is more sensitive to the similarity between the two different Hammerhead types than the RO mode. The similarity between Hammerhead structures of different types is not captured by the classical rooted ordered alignment (Figure 14a). However, when comparing the same pair using unrooted ordered alignment, the similarity between the structures is revealed, in line with the similarity of biological function (Figure 14b).

Conclusions
In this paper we define the (unrooted unordered) Homeomorphic Subtree Alignment (HSA) problem, as well as additional three restricted variants of it: Ordered-HSA, Rooted-HSA, and Ordered-Rooted-HSA. We focus on the general (unrooted and unordered) HSA variant, and present a cubic time algorithm for it.
The new algorithm is implemented as a tool (which allows solving all four HSA variants) and is applied to pairwise alignments of RNA secondary structures. Preliminary experimental results over members of the RNAs P and Hammerhead families show that the tool can be used for detecting new structural similarities between RNA molecules, which could not be detected by the classical rooted-ordered tree alignment methods. In order to obtain an O(n 3 ) running time of an otherwise O(n 5 ) time