Testing the agreement of trees with internal labels

Background A semi-labeled tree is a tree where all leaves as well as, possibly, some internal nodes are labeled with taxa. Semi-labeled trees encompass ordinary phylogenetic trees and taxonomies. Suppose we are given a collection \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {P}}= \{{\mathcal {T}}_1, {\mathcal {T}}_2, \ldots , {\mathcal {T}}_k\}$$\end{document}P={T1,T2,…,Tk} of semi-labeled trees, called input trees, over partially overlapping sets of taxa. The agreement problem asks whether there exists a tree \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {T}}$$\end{document}T, called an agreement tree, whose taxon set is the union of the taxon sets of the input trees such that the restriction of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {T}}$$\end{document}T to the taxon set of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {T}}_i$$\end{document}Ti is isomorphic to \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {T}}_i$$\end{document}Ti, for each \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i \in \{1, 2, \ldots , k\}$$\end{document}i∈{1,2,…,k}. The agreement problems is a special case of the supertree problem, the problem of synthesizing a collection of phylogenetic trees with partially overlapping taxon sets into a single supertree that represents the information in the input trees. An obstacle to building large phylogenetic supertrees is the limited amount of taxonomic overlap among the phylogenetic studies from which the input trees are obtained. Incorporating taxonomies into supertree analyses can alleviate this issue. Results We give a \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {O}}(n k (\sum _{i \in [k]} d_i + \log ^2(nk)))$$\end{document}O(nk(∑i∈[k]di+log2(nk))) algorithm for the agreement problem, where n is the total number of distinct taxa in \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {P}}$$\end{document}P, k is the number of trees in \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {P}}$$\end{document}P, and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d_i$$\end{document}di is the maximum number of children of a node in \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {T}}_i$$\end{document}Ti. Conclusion Our algorithm can aid in integrating taxonomies into supertree analyses. Our computational experience with the algorithm suggests that its performance in practice is much better than its worst-case bound indicates.


Introduction
In the tree agreement problem (agreement problem, for short), we are given a collection P = {T 1 , T 2 , . . ., T k } of rooted phylogenetic trees with partially overlapping taxon sets.P is called a profile and the trees in P are the input trees.The question is whether there exists a tree T whose taxon set is the union of the taxon sets of the input trees, such that, for each i ∈ {1, 2, . . ., k}, T i is isomorphic to the restriction of T to the taxon set of T i .If such a tree T exists, then we call T an agreement tree for P and say that P agrees; otherwise, P disagrees.The first explicit polynomial-time algorithm for the agreement problem is in reference [16] 1 .The agreement problem can be solved in O(n 2 k) time, where n is the number of distinct taxa in P [10].
Here we study a generalization of the agreement problem, where the internal nodes of the input trees may also be labeled.These labels represent higherorder taxa; i.e., in effect, sets of taxa.Thus, for example, an input tree may contain the taxon Glycine max (soybean) nested within a subtree whose root is labeled Fabaceae (the legumes), itself nested within an Angiosperm subtree.Note that leaves themselves may be labeled by higher-order taxa.We present a O(nk( i∈[k] d i + log 2 (nk))) algorithm for the agreement problem for trees with internal labels, where n is the total number of distinct taxa in P, k is the number of trees in P, and, for each i ∈ {1, 2, . . ., k}, d i is the maximum number of children of a node in T i .
Background.A close relative of the agreement problem is the compatibility problem.The input to the compatibility problem is a profile P = {T 1 , T 2 , . . ., T k } of rooted phylogenetic trees with partially overlapping taxon sets.The question is whether there exists a tree T whose taxon set is the union of the taxon sets of the input trees such that each input tree T i can be obtained from the restriction of T to the taxon set of T i through edge contractions.If such a tree T exists, we refer to T as a compatible tree for P and say that P is compatible; otherwise, P is incompatible.Compatibility is a less stringent requirement than agreement; therefore, any profile that agrees is compatible, but the converse is not true.The compatibility problem for phylogenies (i.e., trees without internal labels), is solvable in O(M P log 2 M P ) time, where M P is the total number of nodes and edges in the trees of P [9].Note that M P = O(nk).
Compatibility and agreement reflect two distinct approaches to dealing with multifurcations; i.e., non-binary nodes, also known as polytomies.Suppose that node v is a multifurcation in some input tree of P and that 1 , 2 , and 3 are taxa in three distinct subtrees of v.In an agreement tree for P, these three taxa must be in distinct subtrees of some node in the agreement tree.In contrast, a compatible tree for P may contain no such node, since a compatible tree is allowed to "refine" the multifurcation at v -that is, group two out of 1 , 2 , and 3 separately from the third.Thus, compatibility treats multifurcations as "soft" facts; agreement treats them as "hard" facts [15].Both viewpoints can be valid, depending on the circumstances.
The agreement and compatibility problems are fundamental special cases of the supertree problem, the problem of synthesizing a collection of phylogenetic trees with partially overlapping taxon sets into a single supertree that represents the information in the input trees [4,2,18,24].The original supertree methods were limited to input trees where only the leaves are labeled, but there has been increasing interest in incorporating internally labeled trees in supertree analysis, motivated by the desire to incorporate taxonomies in these analyses.Taxonomies group organisms according to a system of taxonomic rank (e.g., family, genus, and species); two examples are the NCBI taxonomy [21] and the Angiosperm taxonomy [23].Taxonomies provide structure and completeness that can be hard to obtain otherwise [17,12,19], offering a way to circumvent one of the obstacles to building comprehensive phylogenies: the limited taxonomic overlap among different phylogenetic studies [20].
Although internally labeled trees, and taxonomies in particular, are not, strictly speaking, phylogenies, they have many of the same mathematical properties as phylogenies.Both phylogenies and internally labeled trees are X-trees (also called semi-labeled trees) [5,22].Algorithmic results for compatibility and agreement of internally labeled trees are scarce, compared to what is available for ordinary phylogenies.To our knowledge, the first algorithm for testing compatibility of internally labeled trees is in [7] (see also [3]).The fastest known algorithm for the problem runs in O(M P log 2 M P ) time [8].We are unaware of any previous algorithmic results for the agreement problem for internally labeled trees.
All algorithms for compatibility and agreement that we know of are indebted to Aho et al.'s Build algorithm [1].The time bounds for agreement algorithms are higher than those of compatibility algorithms, due to the need for agreement trees to respect the multifurcations in the input trees.To handle agreement, Build has to be modified so that certain sets of the partition of the taxa it generates are re-merged to reflect the multifurcations in the input trees, adding considerable overhead [16,10] (similar issues are faced when testing consistency of triples and fans [14]).This issue becomes more complex for internally labeled trees, in part because internal nodes with the same label, but in different trees, may jointly imply multifurcations, even if all input trees are binary.
Organization of the paper.Section 2 provides a formal definition of the agreement problem for internally labeled trees.Section 3 studies the decomposability properties of profiles that agree.These properties allow us to reduce an agreement problem on a profile into independent agreement problems on subprofiles, leading to the agreement algorithm presented in Section 4. Section 5 contains some final remarks.All proofs are in the Appendix.

Graphs and trees.
Let G be a graph.V (G) and E(G) denote the node and edge sets of G. Let U be a subset of V (G).Then the subgraph of G induced by U is the graph whose vertex set is U and whose edge set consists of all of the edges in E(G) that have both endpoints in U .
A tree is an acyclic connected graph.All trees here are assumed to be rooted.For a tree T , r(T ) denotes the root of T .Suppose u, v ∈ V (T ).Then, u is an If {u, v} ∈ E(T ) and u ≤ T v, then u is the parent of v and v is a child of u.For each x ∈ V (T ), we use parent T (x), and Ch T (x), T (x) to denote the parent of x, the children of x, and the subtree of T rooted at x, respectively.We extend the child notation to subsets of V (T ) in the natural way: for Let T be a tree and suppose U ⊆ V (T ).The lowest common ancestor of U in T , denoted LCA T (U ), is the unique smallest upper bound of U under ≤ T .X-trees.Throughout the paper, X denotes a set of labels (that is, taxa, which may be, e.g., species or families of species).An X-tree is a pair T = (T, φ) where T is a tree and φ is a mapping from X to V (T ) such that, for every node v ∈ V (T ) of degree at most two, v ∈ φ(X).X is the label set of T and φ is the labeling function of T .For every node v ∈ V (T ), φ −1 (v) denotes the (possibly empty) subset of X whose elements map into v; these elements as the labels of By definition, every leaf in an X-tree is labeled, and any node, including the root, that has a single child must be labeled.Nodes with two or more children may be labeled or unlabeled.An X-tree T = (T, φ) is singly labeled if every node in T has at most one label; T is fully labeled if every node in T is labeled.
X-trees, also known as semi-labeled trees, generalize ordinary phylogenetic trees (also known as phylogenetic X-trees [22]).An ordinary phylogenetic tree is a semi-labeled tree T = (T, φ) where r(T ) has degree at least two and φ is a bijection from L(T ) into leaf set of T (thus, internal nodes are not labeled).
Let T = (T, φ) be an X-tree.For each u ∈ V (T ), X(u) denotes the set of all labels in the subtree of T rooted at u; that is, X(u) = v:u≤ T v φ −1 (v).X(u) is called a cluster of T .Cl(T ) denotes the set of all clusters of T .We extend the cluster notation to sets of nodes as follows.Let U be a subset of V (T ).Then, Suppose Y ⊆ X for an X-tree T = (T, φ).The restriction of T to Y , denoted T |Y , is the semi-labeled tree whose cluster set is Cl(T |Y ) = {W ∩ Y : W ∈ Cl(T ) and W ∩ Y = ∅}.Intuitively, T |Y is obtained from the minimal rooted subtree of T that connects the nodes in φ(Y ) by suppressing all vertices v such that v / ∈ φ(Y ) and v has only one child.Let T = (T, φ) be an X-tree and T = (T , φ ) be an X -tree such that X ⊆ X. T agrees with T if Cl(T ) = Cl(T |X ).It is well known that the clusters of a tree determine the tree, up to isomorphism [22,Theorem 3.5.2].Thus, T agrees with T if T and T |X are isomorphic.
Profiles and agreement.Throughout the rest of this paper, P denotes a set 1a).We refer to P as a profile, and to the trees in P as input trees.We write A profile P agrees if there is an X P -tree T that agrees with each of the trees in P. If T exists, we refer to T as an agreement tree for P. See Figure 1b.
Given a subset Y of X P , the restriction of P to Y , denoted P|Y , is the profile defined as The proof of the following lemma is straightforward.
Lemma 1. Suppose a profile P has an agreement tree T .Then, for any Y ⊆ X P , T |Y is an agreement tree for P|Y .Suppose P contains trees that are not fully labeled.We can convert P into an equivalent profile P of fully-labeled trees as follows.For each i ∈ [k], let l i be the number of unlabeled nodes in where is a distinct element from X .We refer to P as the profile obtained by adding distinct new labels to P (see Figure 1a).
Lemma 2. Let P be the profile obtained by adding distinct new labels to P.Then, P agrees if and only if P agrees.Further, if T is an agreement tree for P , then T is also and agreement tree for P.
From this point forward, we make the following assumption.
Assumption 1 For each i ∈ [k], T i is fully and singularly labeled.
By Lemma 2, no generality is lost in assuming that all trees in P are fully labeled.The assumption that the trees are singularly labeled is inessential; it is only for clarity.Note that, even with the latter assumption, a tree that agrees with P is not necessarily singularly labeled.Figure 1b illustrates this fact.Lemma 3. If profile P agrees, then P has an agreement tree T = (T, φ) such that φ −1 (v) = ∅ for each node v ∈ V (T ).
By Assumption 1, for each i ∈ [k], there is a bijection between the labels in X i and the nodes of V (T i ).For this reason, we will often refer to nodes by their labels.In particular, given a label ∈ X i , we write X i ( ) to denote X i (φ i ( )) (the cluster of T i at the node labeled ), Ch Ti ( ) to denote φ i (Ch Ti (φ i ( )) (the labels of children of in T i ), and, for A ⊆ X i , Ch Ti (A) to denote φ i (Ch Ti (φ i (A)).
The following characterization of agreement generalizes a result in [10].
Lemma 4. Let P be a profile and T = (T, φ) be an X P -tree.Then, T is an agreement tree for P if and only if, for each i ∈ [k], there exists a function , and (E3) for every two distinct labels b, c ∈ Ch Ti (a), there exist distinct nodes u, v ∈ Ch T (φ i (a)) such that φ i (b) ∈ X P (u) and φ i (c) ∈ X P (v).
We refer to a function φ i satisfying conditions (E1)-(E3) of Lemma 4 as a topological embedding of T i into T .Observe that, by transitivity, condition (E2) implies that, for any a, b ∈ X i , if a < Ti b, then φ i (a) < T φ i (b).

Positions in a Profile
A position in a profile P is a tuple π = (π 1 , π 2 , . . ., π k ) where, for each i ∈ [k], either π i = ∅ or π i = { }, for some ∈ X i .Note that the definition of a position allows for the possibility that there exist i, j ∈ [k], i = j, such that ∈ π i , but / ∈ π j , even if ∈ X i and ∈ X j .At any given point during its execution, our agreement algorithm focuses on testing the agreement of the subprofile of P determined by the subtrees associated with a specific position.
For a position π in P, let X P (π) denote the set of labels We say that position π has an agreement tree if P|X P (π) has an agreement tree.
A position The initial position for P is the position π init , where, for each i Clearly, π init is a valid position.Lemma 5. A profile P has an agreement tree if and only if there is an agreement tree for every valid position π in P.

Decomposing a position.
In what follows, π denotes a valid position in P. For each i ∈ [k] such that π i = ∅, let i ∈ X i denote the single label in π i .Let Ch P (π) denote the set of all children of some label in π; i.e., Ch Let π be a valid position in P. A good decomposition of π is a pair (S, Π), where S is a subset of the exposed labels in i∈πi π i and Π = {π (1) , π (2) , . . ., π (d) } is a collection of valid positions such that (D1) S ∪ j∈[d] X P (π (j) ) = X P (π) and S ∩ j∈[d] X P (π (j) ) = ∅, and (D2) X P (π (p) ) ∩ X P (π (q) ) = ∅, for all p, q ∈ [d] such that p = q.
Note that we allow S or Π to be empty.We refer to the labels in S as semiuniversal labels and to the positions in Π as successor positions of π.The next result is central to our agreement algorithm.Lemma 6.Let π be a valid position in a profile P.Then, π has an agreement tree if and only if there exists a good decomposition (S, Π) of π such that S = ∅ and, for each position π ∈ Π, π has an agreement tree.If such a good decomposition exists, then π has an agreement tree T = (T, φ) where φ −1 (r(T )) = S.
Good partitions.To find a good decomposition of a position π, it is convenient to work with partitions of Ch P (π).(Recall that a partition of a set Y is a collection Γ of nonempty subsets of Y such that every element x ∈ Y is in exactly one set in Γ .)A good decomposition (S, Π), where Π = {π (j) } j∈[d] defines a partition Γ of the set Ch P (π) where, for any a, b ∈ Ch P (π), a and b are in the same set of Γ if and only if there exists j ∈ [d] such that a, b ∈ X P (π (j) ).We refer to Γ as the partition of Ch P (π) associated with (S, Π).Next, we show that, conversely, certain partitions of Ch P (π) define good decompositions of π.
Set A ⊆ Ch P (π) is nice with respect to a subset S of the exposed labels in π if, for each i ∈ [k] and each label ∈ i∈[k] π i such that Ch P ( ) ∩ A = ∅, (N1) if ∈ S and each i ∈ [k] such that ∈ π i , then |Ch Ti ( ) ∩ A| = 1, and (N2) if ∈ S, then Ch P ( ) ⊆ X P (A).
Suppose A is a nice set.The position associated with A is the position π A , where, for each i ∈ [k], π A i is defined as follows.If π i = ∅, then π A i = ∅.Otherwise, let be the single element in π i .Then, A partition Γ of Ch P (π) is good with respect to S if each set A ∈ Γ is nice with respect to S and, for every two distinct sets A, B ∈ Γ , X P (π A ) ∩ X P (π B ) = ∅.
Lemma 7.There is a bijection between good decompositions of π and good partitions of Ch P (π).That is, the following statements hold.
We refer to the good partition (S, Γ ) of Ch P (π) obtained from a good decomposition (S, Π) of π, as described in Lemma 7 (i), as the good partition of Ch P (π) associated with (S, Π).Likewise, we refer to the good decomposition (S, Π) of π obtained from a good partition (S, Γ ) of Ch P (π), as described in Lemma 7 (ii), as the good decomposition of Ch P (π) associated with (S, Γ ).
We refer to the (unique) good decomposition (S, Π) associated with the minimal good partition of Ch P (π) as the maximal good decomposition of π.
Corollary 1.Let π be a valid position in a profile P and (S, Π) be the maximal good decomposition of π.If π has an agreement tree, then S = ∅.

Constructing an Agreement Tree
BuildAST (Algorithm 1) takes as input a profile P on a set of labels X and either returns an agreement tree for P or reports that no such tree exists.BuildAST 1 BuildAST(P) Data: A profile P = {T1, T2, . . ., T k } on a set of taxa X.
Result: Returns an agreement tree T for P, if one exists; otherwise, returns disagreement.BuildAST proceeds from the top down, starting from the initial position π init of P, attempting to construct an agreement tree for P in a breath-first manner.Like other algorithms based on breadth-first search, BuildAST uses a queue, which stores pairs π, pred , where π is a position in P and pred is a reference to the parent of the tree node (potentially) to be created for π.At the outset, the queue contains only the pair π init , null , corresponding to the root of the agreement tree, which has no parent.
At each iteration of its outer while loop (lines 3-13), BuildAST extracts a pair π, pred from its queue and invokes GetDecomposition to obtain a maximal good decomposition (S, Π) of π.If S = ∅, then, by Corollary 1, no agreement tree for π exists.BuildAST reports this fact (line 7) and terminates.
If S = ∅, BuildAST creates a tree node r(π) for π; r(π) is the tentative root for the agreement tree for π.By Lemma 6, if π has an agreement subtree, then it has an agreement tree where φ( ) = r(π).Lines 10-11 set up the mapping φ accordingly.Also by Lemma 6, if π has an agreement tree, then so does each position π ∈ Π; furthermore, the roots of the trees for each position in Π will be the children of r(π).Thus, BuildAST adds π , r(π) , for each π ∈ Π to the queue, to ensure that π is processed at a later iteration and that the root of the agreement tree constructed for π (if such a tree exists) is made to have r(π) as its parent (lines 12-13).Therefore, if BuildAST terminates without reporting disagreement, then the result returned in line 14 is an agreement tree for P. BuildAST indeed terminates, because there are only two possibilities at any given iteration: either the algorithm terminates reporting disagreement or (since S = ∅) the maximal good decomposition (S, Π) of π has the property that π ∈Π X P (π ) is a proper subset of X P (π).The number of iterations of 1 GetDecomposition(π) Data: A valid position π.
Let be the single label in πi Algorithm 2: Computing the maximal good decomposition.
BuildAST cannot exceed the total number of nodes in an agreement tree for P, which is O(n).Thus, we have the following result.
Theorem 1.Given a profile P = {T 1 , T 2 , . . ., T k }, BuildAST returns an agreement tree T for P, if such a tree exists; otherwise, BuildAST returns disagreement.
The total number of iterations of BuildAST's outer loop is O(n).
Finding the maximal good decomposition.GetDecomposition (Algorithm 2) computes a maximal good decomposition of a position π, relying on an auxiliary graph known as the display graph of the input profile and denoted by H P [6,8,9].The graph H P is obtained from the disjoint union of the underlying trees T 1 , . . ., T k of the P by identifying nodes that have the same label.Multiple edges between the same pair of nodes are replaced by a single edge.See Figure 2.
H P has O(nk) nodes and edges, and can be constructed in O(nk) time.By Assumption 1, there is a bijection between the labels in X and the nodes of H P .Thus, from this point forward, we refer to the nodes of H P by their labels.For a valid position π, H P (π) denotes the subgraph of H P induced by X(π).Thus, H P (π init ) = H P .
Lines 2-10 of GetDecomposition construct the minimal good partition of Ch P (π).Line 2 initializes S to contain all exposed labels in π, and sets K to consist of the indices of the trees in P that contain the labels in S. Line 3 initializes Γ using H P (π).We say that a label ∈ S is bad if there exist i ∈ K Lemma 9. Let π be a valid position in a profile P and let (S * , Γ * ) be the minimal good partition of Ch P (π).Let (S 0 , Γ 0 ) denote the initial value of (S, Γ ) in GetDecomposition before entering the while loop, (S j , Γ j ) denote the value of (S, Γ ) after j executions of the body of the loop, and r denote the total number of iterations.Then, r ≤ k and (S 0 , Γ 0 ) By Lemma 9, the pair (S, Γ ) constructed in lines 4-10 of GetDecomposition is a minimal good partition of Ch P (π).The foreach loop of lines 11-19 simply uses Equation (1) to construct the maximal good decomposition (S, Π) of π from (S, Γ ).We thus have the following.
Lemma 10.GetDecomposition returns the maximal good decomposition of π.
Implementation.Throughout its execution, BuildAST maintains the display graph H P .Also, for each label ∈ X, it maintains a field .appearcontaining every index i such that π i = { } for some π in Q. Label is exposed when | .appear|= k , where k denotes the number of input trees containing label .For each π in BuildAST's queue, the set Ch P (π) is stored as a sparse array ((i, Ch Ti (π i )) : i ∈ [k] and Ch Ti (π i )) = ∅).This enables GetDecomposition to access the parts of Ch P (π) associated with each input tree separately.We use this representation of Ch P (π) to build similar representations of the sets in the partition Γ of Ch P (π) produced from H P (π) \ S in line 3 of GetDecomposition.For each label a ∈ Ch P (π), we maintain a mapping that returns, in O(1) time, the set A ∈ Γ containing a.During the execution of GetDecomposition's while loop, sets in Γ may be merged, and representations of these merged sets must be produced and the mapping from Ch P (π) to Γ must be modified.

Concluding Remarks
BuildAST may be much faster in practice than Theorem 2 suggests, since that bound assumes the unlikely scenario where every edge deletion performed in constructing H P (π) \ S in GetDecomposition generates a new component and that most of these components are remerged in the GetDecomposition's while loop.In any case, Theorem 2 implies that BuildAST performs well if the sum of the maximum out-degrees is small relative to the number of taxa.
The running time of BuildAST can be further improved to O(nk( i∈[k] d i + log 2 (nk)/ log log(nk))) using the graph connectivity data structure of [25].It is not clear, however, that this data structure would have a practical impact.In fact, experimental work [11] suggests that data structures much simpler than HDT (and, therefore, than [25]) perform well in practice.
BuildAST can be modified to run in O(nk log 2 (nk)) time for profiles P where the input trees are all binary and solely leaf-labeled.For such profiles, |A ∩ Ch Ti (π i )| ≤ 2, for A ∈ Γ and i ∈ [k] in a position π of P. Labels a, a ∈ Ch Ti (π i ) are either in the same set A or in different sets A, A where A, A ∈ Γ .In the first case, ∈ π i must be bad.Bad labels can then be detected earlier in Line 3 and directly removed from S. Thus, we can skip GetDecomposition's while loop.Hence, maintaining graph connectivity dominates the performance of BuildAST.
BuildAST enables users to deal with hard polytomies.In applications, we may encounter both hard and soft polytomies.It would be interesting to modify BuildAST to handle a mixture of both types polytomies, as appropriate.
For each position π considered in line 3 of GetDecomposition, we only need to delete edges from each new label in π (and then delete itself).Therefore, each vertex and edge of H P is deleted at most once, and the total number of vertex and edge deletions in O(nk) over the entire execution of BuildAST, for a total of O(nk log 2 (nk)) time.
Whenever an each edge deletion splits up a connected component, Ch P (π) is itself spilt, and we need to update the associated information.We can do so in O(M p log M p ) time by scanning the smaller of the two new connected components, as done in earlier papers [11,10].We omit the details.
The while loop of lines 4-10 of GetDecomposition merges some of the components produced by line 3.These operations do not modify the display graph.We deal with these operations in Lemma 13. 2

Proof of Lemma 12
To build sets S and K in line 2, we scan each π i in π for each i ∈ [k].Given ∈ π i , update .appearwith i and test if is exposed.Suppose π has a parent position π * .Then, exposed label ∈ π i is new if π i = π * i .This step takes O(k) time.Now, consider Line 3. To find Γ , we scan each label a ∈ Ch Ti (π i ) for each i ∈ [k] and retrieve the set A ∈ Γ that contains a using the mapping from Ch P (π) to Γ .The entire process takes O( i∈[k] d i ) time. 2

Proof of Lemma 13
By Lemma 9 the while loop iterates O(k) times.We complete the proof by showing that each iteration takes O( i∈[k] d i ) time.We rely on the following below, which follows from the fact that, in line 3 of GetDecomposition, H P (π)\S is obtained by deleting at most i∈[k] d i edges from H P (π).
For each set A ∈ Γ , we maintain a count, initialized to 0. By Observation 1, the total time to initialize the counts is O( i∈K d i ) per iteration.To search for a bad label, for each i ∈ K, we scan each a ∈ Ch Ti (π i ), and increase the count of the set A to which a belongs.If the count for any set A ∈ Γ exceeds one, then ∈ π i is a bad label and the search ends.Next, we consider the time taken by the body of the while loop.Retrieving K = .appearin Line 6 takes constant time.By Observation 1 and the fact that we have constant time access to mappings, building Γ in line 7 takes O( i∈K d i ) time.
We compute the union of the sets in Γ in line 8 as follows.We initialize B to the empty set, and then successively consider each A ∈ Γ .At each step, we append every child label a from a non-empty entry in the representation of A to the corresponding entry in B, and change the mapping of a to B. Given

Fig. 2 :
Fig. 2: The display graph H P for the profile of Figure 1a.

Lemma 11 .Theorem 2 .
The total time needed to maintain the display graph throughout the entire execution of BuildAST is O(nk log 2 (nk)).In the following results, d i denotes the maximum number of children of a node in tree T i , for each i ∈ [d].Lemma 12. Excluding the time needed to maintain the display graph, Lines 2 and 3 of GetDecomposition take O( i∈[k] d i ) time.Lemma 13.GetDecomposition's while loop takes O(k i∈[k] d i ) time.BuildAST can be implemented to run in O(nk( i∈[k] d i +log 2 (nk))) time, where n is the number of distinct taxa in P, k is the number of trees in P, and d i is the maximum number of children of tree T i , for i ∈ [k].

2 Proof of Theorem 2 By
our representation of the sets in Γ , this process takes O( i∈[k] d i ) time in each iteration of the while loop.Updating Γ in Line 9 requires removing every A ∈ Γ from Γ and then adding B. The time spent on updates is O(|Γ |), which is O( i∈K d i ).Finally, updating S in Line 10 takes constant time and updating K takes O(|K |) time.Lemmas 12 and 13, lines 2-10 of GetDecomposition, take O(k i∈[k] d i ) time.By Theorem 1, GetDecomposition is invoked O(n) times.Thus, lines 2-10 of GetDecomposition take O(nk i∈[k] d i ) time the entire execution of BuildAST.For each A ∈ Γ , the foreach loop of lines 12-19 obtains the corresponding successor position using Equation (1) in O(k) time.Since BuildAST generates O(n) positions, the total time spent on the foreach loop of lines 12-19 of GetDecomposition over the entire execution of BuildAST is O(nk).To summarize, BuildAST takes O(nk i∈[k] d i ) to compute successors and, by Lemma 11, O(nk log 2 (nk)) time to maintain the display graph.The claimed time bound follows. 2