Constructing perfect phylogenies and proper triangulations for three-state characters

In this paper, we study the problem of constructing perfect phylogenies for three-state characters. Our work builds on two recent results. The first result states that for three-state characters, the local condition of examining all subsets of three characters is sufficient to determine the global property of admitting a perfect phylogeny. The second result applies tools from minimal triangulation theory to the partition intersection graph to determine if a perfect phylogeny exists. Despite the wealth of combinatorial tools and algorithms stemming from the chordal graph and minimal triangulation literature, it is unclear how to use such approaches to efficiently construct a perfect phylogeny for three-state characters when the data admits one. We utilize structural properties of both the partition intersection graph and the original data in order to achieve a competitive time bound.


Background
In this paper, we study the problem of constructing phylogenies, or evolutionary trees, to describe ancestral relationships between a set of observed taxa. Each taxon is represented by a sequence and the evolutionary tree provides an explanation of branching patterns of mutation events transforming one sequence into another.
We will focus on the widely studied infinite sites model from population genetics, in which the mutation of any character can occur at most once in the phylogeny. Without recombination, the phylogeny is a tree called a perfect phylogeny. The problem of determining if a set of binary sequences fits the infinite sites model without recombination corresponds to determining if the data can be derived on a perfect phylogeny. A generalization of the infinite sites model is the infinite alleles model, in which any character can mutate multiple times but each mutation of the character must lead to a distinct allele (state). Again, without recombination, the phylogeny is tree, called a multi-state perfect phylogeny. Correspondingly, the problem of determining if multi-state data fits the infinite-alleles model without recombination corresponds *Correspondence: rsgysel@ucdavis.edu; flam@cs.ucdavis.edu Department of Computer Science, University of California, Davis, 1 Shields Avenue, Davis CA 95616, USA to determining if the data can be derived on a multi-state perfect phylogeny.
Dress and Steel [1] and Kannan and Warnow [2] both give algorithms that construct perfect phylogenies for three-state characters when one exists. The goal of this work is to extend the results in [3] using the minimal separators of the partition intersection graph to create a three state construction algorithm that is competitive with Dress and Steel's algorithm.

Notation and prior results
The input to our problem is a set of n taxa defined over a set of m characters C = {χ 1 , χ 2 , . . . , χ m }. We denote the states of character χ i by χ i j for 0 ≤ j ≤ r − 1. A species is any sequence s = s 1 , s 2 , . . . , s m with s i ∈ {χ i 0 , χ i 1 , . . . , χ i r−1 } ∪ { * } for i = 1, 2, . . . , m. The * denotes a missing value. χ i can also be considered as a function mapping species to character states, writing χ i (s) = s i . In this paper, every taxon is a species without missing values (C is also called a set of full characters in the literature). We will consider the set of taxa as an n × m matrix M, where each row corresponds to a taxon and each column corresponds to a character (or site).
The perfect phylogeny problem is to determine whether the taxa defined by a matrix M can be displayed on a tree T such that http://www.almob.org/content/7/1/26 1. each taxon of M labels exactly one node in T, 2. each leaf in T is labeled by a taxon of M, 3. each node of T is labeled by a species, 4. for every character χ i and for every state χ i j of character χ i , the set of all nodes in T labeled by species whose state of character χ i is χ i j forms a connected subtree of T.
Any tree satisfying conditions 1 -4 is called a perfect phylogeny for M. Any character satisfying condition 4 is said to be compatible with T. The general perfect phylogeny problem (with no constraints on r, n, and m) is NP-complete [4,5]. However, the perfect phylogeny problem becomes polynomially solvable (in n and m) when r is fixed. For r = 2, this follows from the Splits Equivalence Theorem [6,7]. For r = 3, Dress and Steel gave an O(nm 2 ) algorithm [1] and for r = 3 or 4, Kannan and Warnow gave an O(n 2 m) algorithm [2]. For any fixed r, Agarwala and Fernández-Baca gave an O(2 3r (nm 3 + m 4 )) algorithm [8], which was improved to O(2 2r nm 2 ) by Kannan and Warnow [9]. Note that by definition, there are no edges in the partition intersection graph between states of the same character. It will be useful to consider the partition intersection graph G(χ i ,χ j ,χ k ) of the submatrix of M defined by the three characters χ i , χ j , χ k . See [11] and [12] for further details on chordal graphs. Consider coloring the vertices of the partition intersection graph G(M) by colors 1, 2, . . . , m as follows. For each character χ i , assign color i to the vertices χ i 0 , χ i 1 , . . . , χ i r−1 . A pair of distinct vertices u,v of G(M) with the same color is called a monochromatic pair. A proper triangulation of the partition intersection graph G(M) is a chordal supergraph of G(M) such that every edge has endpoints with different colors. In [10], Buneman established the following fundamental connection between the perfect phylogeny problem and triangulations of the corresponding partition intersection graph. A triangulation of a graph G is minimal if it does not have a proper subgraph that is also a triangulation of G. Theorem 2.3 can be restated in terms of proper minimal triangulations of G(M) because removing edges from a proper triangulation will preserve the coloring of the graph. If G(M) has a proper triangulation H, then a perfect phylogeny for M can be constructed from a clique tree of H. T is a clique tree for a graph G if 1. the nodes of T are in bijection with the maximal cliques of G, 2. for each vertex v of G, the maximal cliques containing v form a connected subtree of T .
That is, given a clique tree T for a proper triangulation H of G(M), we label each node by its corresponding maximal clique. Because H is properly colored, this maximal clique includes at most one state per character and therefore defines a species. Each taxon t defines a clique K t of size m in G(M), and because H is a triangulation of G(M), K t is a clique in H as well. Furthermore, H is a proper triangulation, so K t is a maximal clique of H. For a clique tree T , we label the node corresponding to K t by t to obtain a perfect phylogeny for M. Conversely, if M has a perfect phylogeny T, then the species in T define a set of additional edges to obtain a proper triangulation for G(M). This is due to the following characterization of chordal graphs by the intersections of subtrees of a tree.

Theorem 2.4. [10,13] G is a chordal graph if and only if there is a tree T such that each vertex u of G induces a subtree T u of T and uv is an edge of G if and only if subtrees T u and T v share at least one node.
In particular, if a pair of character states appear in the same species of a perfect phylogeny for M but not in any input taxon of M, this pair defines a fill edge to add to obtain a proper triangulation of the partition intersection graph. This fill edge preserves the proper coloring because intersecting subtrees from the same character would contradict conditions 3 and 4 of the perfect phylogeny definition.
To illustrate some of these notions, consider the example in Figure 1. The species with sequence 2100 defines a fill edge χ 1 2 χ 4 0 which is not an edge of G(M) (this is the only such fill edge). Nevertheless G(M) itself is chordal, and adding this fill edge would result in a proper triangulation that is not minimal.
In recent work, it is shown that there is a complete description of minimal obstruction sets for three-state characters analogous to a well-known result on obstruction sets for binary characters (the four gamete condition) http://www.almob.org/content/7/1/26 There are no species with missing values in T. [3]. These results allow us to expand upon recent work of Gusfield [14] which uses properties of triangulations and minimal separators of partition intersection graphs to solve several problems related to multi-state perfect phylogenies.
An (a,b)-separator of a graph G is a set of vertices whose removal from G separates a and b. A minimal (a,b)separator is an (a,b)-separator such that no proper subset is an (a,b)-separator, and a minimal separator is a separator that is a minimal (a,b)-separator for some pair of vertices a and b. For a set of vertices X, let G-X be the induced subgraph of G after removing vertices X. If S and S are two minimal separators of G, we say S is parallel to S if there is a single connected component C of G − S such that S ⊆ C ∪ S (otherwise S and S cross). A pair of vertices a and b cross S if S is an (a,b)-separator. The neighborhood of a set of vertices X is N(X) = {v ∈ G − X : The following characterization of minimal separators is critical to our arguments.

Lemma 2.5. [15] Let S be a subset of vertices of graph G. Then S is a minimal separator of G if and only if G-S has two or more full components.
In a colored graph, a legal separator is a separator such that no two vertices have the same color. Let G denote the minimal separators of graph G. For S ∈ G , we saturate S by adding edges between every pair of vertices in S to create a clique. For Q ⊆ G , G Q denotes the graph obtained by saturating every S ∈ Q. The following theorem shows the connection between minimal triangu-lations and collections of parallel minimal separators of a graph. Theorem 2.6. (Minimal Triangulation Theorem [16][17][18]). Suppose Q ⊆ G is a maximal set of pairwise parallel minimal separators of G. Then G Q is a minimal triangulation of G and G Q = Q. Conversely, if H is a minimal triangulation of G, then H is a maximal pairwise parallel set of minimal separators of G.
The following are necessary and sufficient conditions for the existence of a perfect phylogeny for data over arbitrary number of states. We refer the reader to [14] for the proofs. For the special case of input M with characters over three states (r = 3), the partition intersection graph satisfies additional structure and the following theorems give necessary and sufficient conditions for the existence of a perfect phylogeny for M [3]. http://www.almob.org/content/7/1/26

Theorem 2.9. [3] Given an input set M with at most three states per character (r ≤ 3), M admits a perfect phylogeny if and only if every subset of three characters of M admits a perfect phylogeny.
Furthermore, there is an explicit description of all minimal obstruction sets to the existence of a perfect phylogeny. This complete characterization of minimal obstruction sets allows us to simplify Theorem 2.8 in the case r = 3.

Theorem 2.11. [3] For input M on at most three states per character (r ≤ 3), there is a three-state perfect phylogeny for M if and only if the partition intersection graph for every pair of characters is acyclic and every monochromatic pair of vertices in G(M) is separated by a legal minimal separator.
Theorem 2.11 shows that the requirement of Theorem MSPN that the legal minimal separators in Q be pairwise parallel can be removed for the case of input data over three-state characters. The condition in Theorem 2.11 that the input is over three state characters is necessary, as there are examples showing that the theorem does not extend to data with four-state characters.
All of the legal minimal separators for three-state input can be found in O(nm 2 ) time and the algorithm to check if each monochromatic pair is separated by a legal minimal separator can be performed during the algorithm for generating the legal minimal separators (see Section "Proper triangulation algorithm"). Therefore, the 3-state perfect phylogeny decision problem can be solved in O(nm 2 ) time using minimal separators. However, it is not clear how minimal separators can be used to solve the construction problem in a similar time bound. In [14], Gusfield used the minimal separator approach and integer linear programming methods to solve both the decision and construction problem for k-state perfect phylogeny. Since integer linear programming methods in general do not have polynomial time bounds, this naturally leads to the following question: is there an O(nm 2 ) algorithm for the construction problem for 3-state perfect phylogeny using the separator approach? In this paper, we answer in the affirmative, and show that any algorithm which explicitly computes the partition intersection graph has a time bound of at least We first study the structure of separators in the partition intersection graph for 3-state input with the goal of answering this question. We first state two lemmas from [3]. Lemma 2.12. (Lemma 3.4 [3]). Let M be a set of input taxa with at most three states per character, and consider any three characters χ i , χ j , χ k in M. If the partition intersection graph G(χ i , χ j , χ k ) is properly triangulatable, Figure 2 Minimal obstruction sets. Minimal obstruction sets for three-state characters up to relabeling. The boxes highlight the input entries that are identical for three of the obstruction sets. http://www.almob.org/content/7/1/26 then the only possible chordless cycles in G(χ i , χ j , χ k ) are chordless 4-cycles, with two colors appearing once and the remaining color appearing twice.
Lemma 2.12 implies that if a subset of three characters

Structure of separators
In this section, our goal is to study the relationship between minimal separators in G(M) and G (M) when M is a set of taxa over 3-state characters. Our ultimate goal is to show that it suffices to consider only the legal minimal separators of G(M) while disregarding the illegal minimal separators. We first prove the following theorem on the separator structure of G (M). In order to use techniques in [14], the the goal of our next two results will be to describe the relationship between the minimal separators of G (M) and the legal minimal separators of G(M) when M has a perfect phylogeny.  We now prove the main result of this section.

Theorem 3.5. Suppose M is a set of taxa on 3-state characters. Then M has a perfect phylogeny if and only if any maximal pairwise parallel set of legal minimal separators Q of G(M) induces a proper minimal triangulation G(M) Q of G(M).
Proof. First, suppose that M has a perfect phylogeny, and let Q be a maximal pairwise parallel set of legal minimal separators of G(M). We show that G(M) Q is a proper triangulation of G(M). By

Proper triangulation algorithm
In this section, we build on techniques developed in [14] to generate the minimal separators of G (M) and their parallel relations in O(nm 2 ) time. This will allow us to use a greedy approach to pick a maximal pairwise parallel set of legal minimal separators. These minimal separators will then define a set of fill edges for a proper minimal triangulation, and a perfect phylogeny will be constructed in the form of a clique tree using Maximum Cardinality Search (MCS). ; stop if Q has more than 2n − 3 minimal separators. 6. Add edges to G (M ) to make each S ∈ Q a clique.
Call this graph G Q . 7. Use MCS to construct a clique tree for G Q . http://www.almob.org/content/7/1/26 We proceed with a series of lemmas that will be used in Theorem 4.11 to show that each step is O(nm 2 ). The following simple observation is important for many of our time bounds.

Observation 4.2. Let M be a set of taxa whose characters have at most three states. Then G(M) has O(m) vertices (one vertex per state of each character) and O(m 2 ) edges.
Step 2 of the algorithm uses concepts from [2,8,9,14], which we detail here for completeness. A proper cluster is a bipartition of the taxa (i.e. the taxa are split into two disjoint nonempty sets) such that each character shares at most one state across the bipartition, and at least one character is not shared across this bipartition [8,9]. There are O(m) proper clusters when r is fixed. In particular, suppose χ is not shared across the bipartition of a proper cluster. Then the proper cluster also creates a bipartition of χ's character states (see Figure 3). Hence, we can compute the set of proper clusters by exhaustively checking, for each character, if some bipartition of its states split the taxa into a proper cluster (there are O(2 r ) ways to split each character).
Proper clusters generate the minimal separators in * G(M) as follows [14]. For a connected component C of G(M) − S, let t(C) be the set of taxa with characterstate χ i j for at least one χ i j ∈ C. We will refer to the set of t(C) determined by the connected components of G(M) − S as the S-partition of the taxa. Recall S has at most m − 1 vertices by Lemma 4.1, so every taxon must have a character-state that is not a vertex of S. Hence no taxon can have all of its character-states as vertices of S. Additionally, each taxon defines a clique, so it cannot have vertices in more than one connected component of G(M) − S (this would define an edge between connected components). By Lemma 2.5, G(M) − S has two or more full components C 1 and C 2 . Place t(C 1 ) and t(C 2 ) in separate parts of the bipartition, then for the remaining connected components C of G(M) − S add t(C) to either part. This defines a bipartition where the shared character states (known as the splitting vector [9]) are exactly the vertices of S. To see this, suppose a character-state χ i j is a vertex of S. Because C 1 is a full component, there is a vertex χ i 0 j 0 ∈ C 1 adjacent to χ i j . Because these vertices are adjacent, χ i 0 j 0 and χ i j appear in the same row of M, which in turn is a taxon t 1 of t(C 1 ). Similarly, there exists t 2 ∈ t(C 2 ) such that χ i (t 2 ) = j, so χ i j is shared in the bipartition. See Figure 3 for an illustration of these concepts. This implies that | * G(M) | = O(m). The following two lemmas are special cases of those found in [14].  [11,21]. In this sense, legal minimal separators are analagous to splitting vectors. http://www.almob.org/content/7/1/26 and let S x be the vertices of G(M) appearing as characterstates in x. Define the equivalence relation g/x by the transitive closure of the relation tRt if and only if there is a character χ i where χ i (t) = χ i (t ) = j and χ i j is not a shared character state in x; calculating g/x takes O(nm) time [9]. Given an equivalence class [t] of g/x, the vertices  Before discussing the running time required to compute crossing relations, we first state two structural lemmas on minimal separators; the second follows from a lemma in [19].  Because of the slight change from Lemma 3.10 in [19] and for completeness, we give a proof of Lemma 4.6.
Proof. Suppose S and S are parallel. Since S is a minimal separator, there are at least two full components in G − S and because S is parallel to S, there is a full component Now, suppose there are C S and C S satisfying the conditions of the lemma. Then S ⊆ N(C S ) ⊆ C S ∪ N(C S ) ⊆ C S ∪ S , implying that S and S are parallel. Conversely, assume that S and S are not parallel. Let C 1 be a full component of G(M) − S and C 2 be a full component of G(M) − S . By Lemma 4.5, there is a vertex v ∈ C 1 ∩ S , and because C 2 is full, there is a u ∈ C 2 ∩ N(v). The taxa form an edge clique cover for G(M), so there is a taxon t having both character states corresponding to u and v. Note v ∈ C 1 so t ∈ t(C 1 ) and u ∈ C 2 so t ∈ t(C 2 ). S has at least two full components, and repeating this argument yields another full component  [20]) was incorrect, as demonstrated in Figure 4. We present a corrected bound for the number of minimal separators in the following Lemma. make this correspondence explicit, for each node x of T we will write K x to mean the maximal clique of H that corresponds to x. A classic result in chordal graph theory says that if S ∈ H , there is an edge xy of T such that S = K x ∩ K y [11,21]. Therefore the number of minimal separators in H is at most the number of edges of T .
First, consider any leaf a of T . We claim that K a contains a vertex of G that is not in any other maximal clique of G (this fact is well known in the chordal graph literature [22], but we prove it here for completeness). Suppose a is the neighbor of a in T . By maximality, K a ⊂ K a so there is a vertex v of H that is contained in K a but not contained in K a . If v is contained in a maximal clique of G that is not K a , then the second property of clique trees implies that v ∈ K a as well. Hence v is only contained in K a , proving the claim. Further, v is some character-state χ i j , and there is a taxon t of M such that χ i (t) = j. Taxon t can only label a because no other node of T corresponds to a maximal clique that contains χ i j . Thus for each leaf of T there is a unique taxon that labels it.
To complete the proof, we show a similar result for internal nodes of T with degree two. Let z be such a node with neighbors z 1 and z 2 . If z contains a vertex that is only contained in z s maximal clique K z , our previous argument shows that z can be labeled by a unique taxon. Suppose this is not the case. Let S i = K z ∩ K z i for i = 1, 2. It must be that K z = S 1 ∪ S 2 because we are considering the case when K z does not contain a unique vertex. Further, we cannot have S 1 ⊆ S 2 since otherwise K z = S 2 ⊆ K z 2 would not be maximal. Similarly, S 2 ⊆ S 1 . Pick u 1 ∈ S 1 − S 2 and u 2 ∈ S 2 − S 1 , noting that u 1 / ∈ K z 2 and u 2 / ∈ K z 1 . We argue that K z is the only maximal clique containing both u 1 and u 2 . This is because if any other maximal clique K contains both vertices, then either K z 1 or K z 2 is on the path from K z to K in T (K has degree two) and by the second property of clique trees, this maximal clique also contains both vertices. Further, because each S ∈ H is of the form S = K x ∩ K y for an edge xy of T , there is no minimal separator of H containing both u 1 and u 2 . By Theorem 2.6, it is not a fill edge) because H is a minimal triangulation of G(M), so all fill edges come from saturating each S ∈ H . Therefore there is a taxon t of M such that χ i 1 (t ) = j 1 and χ i 2 (t ) = j 2 . As in the unique vertex case, z is the unique node with label t .
Therefore any node of T with degree at most two is labeled by a unique taxon, implying there are at most n such nodes. Any tree containing at most n leaves and internal nodes of degree two has at most 2n − 3 edges. Hence T has at most 2n − 3 edges, and in turn H has at most 2n − 3 minimal separators, proving the bound.
Remark. The proof of Lemma 4.8 requires minimality of the triangulation, but it does not require that M lacks missing values or that the number of states for each character is bounded. This Lemma along with the fact that each S ∈ *

G(M)
has fewer than m vertices gives the following result.   Combining these lemmas show that our minimal separator algorithm for constructing perfect phylogenies for r = 3 is competitive with the algorithm of Dress and Steel [1], giving our main result.

Large partition intersection graphs
Ideally, one would like to find an O(n 2 m) or O(nm) algorithm for 3-state perfect phylogeny (i.e., m is square-free). In this section, we will construct a family of 3-state matrices M that have a perfect phylogeny and (m 2 ) edges in G(M). This discourages attempts to improve our time bound using an approach that explicitly computes the partition intersection graph.
Any 3-state character compatible with a perfect phylogeny can be obtained from choosing any two edges of the phylogeny, removing them, and using the three resulting subtrees to define each taxon's state for that character. 2-state characters are obtained in a similar manner, removing a single edge instead of two edges. Therefore, if a 3-state matrix M with distinct columns (up to relabeling) has a perfect phylogeny, m = O( n 2 ) = O(n 2 ). Consider the tree T with taxa t 1 , t 2 , . . . , t n as depicted in Figure 5, and suppose i < j. We construct the character χ (i,j) using the partition {t 1 , t 2 , . . . , t i }, {t i+1 , t i+2 , . . . , t j }, {t j+1 , t j+2 , . . . , t n } as in Figure 5. Each set in the partition is called the cell 0, cell 1, and cell 2 of χ (i,j) , respectively. That is, χ (i,j) (t 1 ) = 0, χ (i,j) (t i+1 ) = 1, χ (i,j) (t j+1 ) = 2, and so on. Let M * be the matrix whose columns are the characters χ (i,j) for 1 ≤ i < j < n. T is clearly a perfect phylogeny for Figure 5 Characters of a perfect phylogeny with a large partition intersection graph. A 3-state character created using "intervals" of taxa from a fully resolved tree T. The 0 th piece of χ (i,j) is the interval χ (i,j) http://www.almob.org/content/7/1/26 M * , and m = n−1 2 = (n 2 ). Next, we show that G(M * ) has (m 2 ) edges.
Observation 5.1. Let χ (i,j) and χ (i ,j ) be distinct characters of M * . Then χ (i,j) k χ (i ,j ) k is an edge of G(M * ) iff cell k of χ (i,j) and the cell k of χ (i ,j ) have a non-empty intersection (i.e. share a taxon).

Conclusions
We have demonstrated how to use the minimal separator approach introduced in [14] to construct a perfect phylogeny for 3-state data in O(nm 2 ) time. We also constructed a 3-state matrix M with a perfect phylogeny that has (m 2 ) edges. Thus, any explicit analysis of the edges of G(M) or of a proper triangulation of G(M) is inadequate to speed up our approach. Faster proper triangulation algorithms should use M for computation instead of G(M) aided with theoretical results about G(M). Constructing tree representations in order to minimally triangulate a graph without explicitly computing the fill edges was studied in [19] in order to achieve a faster time bound, and it would be interesting to see if these ideas can be extended to find a faster construction algorithm for 3-state perfect phylogeny.