- Research
- Open access
- Published:
The space of phylogenetic mixtures for equivariant models
Algorithms for Molecular Biology volume 7, Article number: 33 (2012)
Abstract
Background
The selection of an evolutionary model to best fit given molecular data is usually a heuristic choice. In his seminal book, J. Felsenstein suggested that certain linear equations satisfied by the expected probabilities of patterns observed at the leaves of a phylogenetic tree could be used for model selection. It remained an open question, however, whether these equations were sufficient to fully characterize the evolutionary model under consideration.
Results
Here we prove that, for most equivariant models of evolution, the space of distributions satisfying these linear equations coincides with the space of distributions arising from mixtures of trees. In other words, we prove that the evolution of an observed multiple sequence alignment can be modeled by a mixture of phylogenetic trees under an equivariant evolutionary model if and only if the distribution of patterns at its columns satisfies the linear equations mentioned above. Moreover, we provide a set of linearly independent equations defining this space of phylogenetic mixtures for each equivariant model and for any number of taxa. Lastly, we use these results to perform a study of identifiability of phylogenetic mixtures.
Conclusions
The space of phylogenetic mixtures under equivariant models is a linear space that fully characterizes the evolutionary model. We provide an explicit algorithm to obtain the equations defining these spaces for a number of models and taxa. Its implementation has proved to be a powerful tool for model selection.
Background
The principal goal of phylogenetics is to reconstruct the ancestral relationships among organisms. Most popular phylogenetic reconstruction methods are based on mathematical models describing the molecular evolution of DNA. In spite of this, there exists no unified framework for model selection and the results are highly dependent on the models and methods used in the analysis (cf. [1]).
In this paper we assume the Darwinian model of evolution proceeding along phylogenetic trees and address the following question: how can the data evolving under a particular model be characterized? In other words, we look for invariants of the DNA patterns which have evolved following a tree (or a mixture of trees, as we will see below) under a particular model. The answer to this question provided in this paper leads to a complete characterization of the evolutionary model and to a novel model selection tool, which is valid for any mixture of trees.
In what follows, we briefly explain the motivation for this work. It has been shown that if the evolution along a phylogenetic tree is described by a particular model, the expected probabilities of nucleotide patterns at the leaves of the tree satisfy certain equalities (see e.g. [2], p.375). Several authors (e.g. [2–4]) pointed out that these equalities could potentially be used to test the fitness of the model of base change. The full set of equations required for viable model selection, however, was unknown. The objective of this work is to fill in this gap and to go a step further into practical aplication by providing an algorithm to compute the required invariants for model selection.
In this work we consider a group of equivariant models ([5, 6]). These models are Markov processes on trees, whose transition matrices satisfy certain symmetries: the Jukes-Cantor model, the Kimura 2 and 3 parameter models, the strand symmetric model, and the general Markov model. Our first important result, Theorem 17, states that if evolution occurs according to trees (or even mixtures of trees) under these equivariant models, then the model of evolution is completely determined by the linear space defined by the aforementioned equalities. By exhaustively studying the group of symmetries of these models, we also give a straightforward combinatorial way of determining the equations of this linear space (see Theorem 22). The implementation of the algorithm producing the equations is available as a package SPIn ([7],http://genome.crg.es/cgi-bin/phylo_mod_sel/AlgModelSelection.pl), which has proved to be a successful tool in evolutionary model selection.
Our main technique consists in proving that the linear space above coincides with the space of phylogenetic mixtures evolving under the model , i.e. the set of points that are linear combinations of points lying in the phylogenetic varieties (see Preliminaries section for specific definitions). In biological words and in the stochastic context, this is the set of vectors of expected pattern frequencies for mixtures of trees evolving under the model (not necessarily the same tree topology in the mixture, and not necessarily the same transition matrices when the tree topologies coincide). In phylogenetics, the so-called i.i.d. hypothesis (independent and identically distributed) about the sites of an alignment is prevalent in the most simple models. When the assumption “identically distributed” is replaced it by “distributed according to the same evolutionary model”, one obtains a phylogenetic mixture.
Phylogenetic mixtures are useful in modeling heterogeneous evolutionary processes, e.g. data comprising multiple genes, selected codon positions, or rate variation across sites (e.g. [8]). Among a plethora of applications, they are used in orthology predictions, gene and genome annotations, species tree reconstructions, and drug target identifications.
In addition to the main result, we determine the dimension of these linear spaces and use it to give an upper bound, h0 (n), on the number of mixtures that should be used in phylogenetic reconstruction on n taxa. This relates to the so-called identifiability problem in phylogenetic mixtures, which can be posed as determining the conditions that guarantee that the model parameters (discrete parameters in the form of tree topologies and the continuous parameters of the root and model distributions) can be recovered from the data. Identifiability is crucial for consistency of the maximum likelihood approaches and, though extensively studied in the phylogenetic context, few results are known (see for instance [9–13]).
In brief, in Theorem 30 we prove that either the tree topologies or the continuous parameters are not generically identifiable for mixtures on more than h0 (n) trees under equivariant models. Here h0 (n) is the quotient of the dimension of the linear space (computed in Proposition 20) by the number of free parameters of on a trivalent tree plus one. For example, for four taxa and the Jukes-Cantor model (resp. the Kimura 3-parameter model) this result proves that mixtures on three (resp. four) or more taxa are not identifiable (i.e. either the discrete or the continuous parameters cannot be fully identified). A detailed discussion on this subject is provided in the last section.
The main tools used in this work are algebraic geometry and group theory. The reader is referred to [14, 15] for general references on these topics.
Main text
Preliminaries
Phylogenetic trees and Markov models of evolution have been widely used in the literature. In what follows we fix the notation needed to deal with them in our setting.
Let n be a positive integer and denote by [n] the set {1,2,…,n}. A phylogenetic tree T on the set of taxa [n] is a tree (i.e. a connected graph with no loops), whose n leaves are bijectively labeled by [n]. Its vertices represent species or other biological entities and its edges represent evolutionary processes between the vertices.
We allow internal vertices of any degree and if all the internal vertices are of degree 3 we say that the tree is trivalent. We will denote the set of vertices of T by N(T), the set of edges by E(T), and the set of interior nodes by Int(T). A rooted tree is a tree together with a distinguished node r called the root. The root induces an orientation on the edges of T, whereby the root represents the common ancestor to all the species represented in the tree. If e is an edge of a rooted tree T, we write pa(e) and ch(e) for its parent vertex (origin) and its child vertex (end), respectively. Two unrooted phylogenetic trees on the set of taxa [n] are said to have the same tree topology if their labeled graphs have the same topology.
We fix a positive integer k and an ordered set B = {b1,b2,…,b k }. For example, for most applications we take B = {A,C,G,T} to be the set of nucleotides in a DNA sequence. We may think of B as the set of states of a discrete random variable. We call W the complex vector space spanned by B, so that B is a natural basis of W. For algebraic convenience, we usually work over the complex field and restrict to the stochastic setting when necessary. Vectors in W are thought of as probability distributions on the set of states B if their coordinates are non-negative and sum to one. In this setting the vector means that observation b i occurs with probability c i . From now on, we will identify vectors in W with their coordinates in the basis B written as a column vector, e.g. we identify with the vector 1 = (1,1,…,1)t∈ W.
In order to model molecular evolution on a phylogenetic tree T, we consider a Markov process specified by a root distribution, Π∈W, and a collection of transition matrices, A = (Ae)e ∈ E (T), where each Aeis a k × k-matrix in End(W). The matrices Aerepresent the conditional probabilities of substitution between the states in B from the parent node pa (e) to the child node ch(e) of e. We adopt the convention that the matrices Ae act on W from the right, i.e. a vector ωtin pa (e) maps to ωtAe in ch(e).
Distinct forms of the transition matrices give rise to different evolutionary models. Using the terminology introduced above, we proceed to the definition of evolutionary models used throughout this work.
Definition 1. An (algebraic) evolutionary model is specified by giving a vector subspace such that 1tΠ ≠ 0 for some Π in W0, together with a multiplicatively closed vector subspace Mod(for model) of containing the identity matrix. We will usually denote such a model by . We define the stochastic evolutionary modelassociated to by taking s W0 = {Π∈W0:1tΠ = 1} and sMod = {A∈Mod:A 1 = 1}. The term “stochastic” refers to the fact that, by restricting to the points in the spaces with non-negative real entries, we obtain distributions and Markov matrices. A phylogenetic tree T together with the parameters Π and A = (Ae) e ∈ E (T)is said to evolve under the algebraic evolutionary model if Π ∈ W0, and all matrices Aelie in Mod.
Remark 2. Note that s W0and sMod are not vector spaces. The condition 1tΠ ≠ 0 in the above definition means that the sum of the coordinates of Π is not zero. Since vectors in s W0 with non-negative coordinates represent the probability distributions for the set of observations B, this condition implies no restriction from a biological point of view. Moreover, it ensures that has dimension equal to dim(W0) − 1. In particular, the simplex of stochastic vectors in W0will form a semialgebraic set of of dimension equal to dim(W0)−1 (as expected).
Remark 3. The subspace Mod of substitution matrices is usually required to be multiplicatively closed (as in the definition above) so that when two evolutionary processes are concatenated, the final process is of the same kind. The importance of this requirement is the starting point of [16], where a different approach to the definition of “evolutionary mode” is provided.
Our definition of evolutionary models includes most of the well-known evolutionary models, namely those given in [17] and the equivariant models (see [5, 6]).
Example 4. Let G be a permutation group of B, that is, a group whose elements are permutations of the set B, . Given g ∈ G, write P g for the k × k-permutation matrix corresponding to g: (P g )i,j= 1 if g(j) = i and 0 otherwise. The G-equivariant evolutionary model is defined by taking Mod equal to
and W0 = {Π ∈ W ∣ P g Π = Π for all g ∈ G}. These subsets are vector subspaces of and W, respectively. Moreover, if A1,A2 ∈ M(G), then
and A1A2 ∈ M (G). Therefore, equivariant models provide a wide family of examples of algebraic evolutionary models in the sense of Definition 1. For example, if B = {A,,̧G T}, it can be seen that the algebraic versions of the Jukes-Cantor model [18], the Kimura models with 2 or 3 parameters [19, 20], the strand symmetric model [21] or the general Markov model [22] are instances of equivariant models:
-
if , then is the algebraic Jukes-Cantor model JC69,
-
if G = 〈(A C G T),(A G)〉, then is the algebraic Kimura 2-parameter model K80,
-
if G = 〈(A C)(G T),(A G)(C T)〉, then is the algebraic Kimura 3-parameter model K81,
-
if G = 〈(A T)(C G)〉, then is known as the strand symmetric model SSM, and
-
if G = 〈e〉, then is the general Markov model GMM.
Given an evolutionary model and a phylogenetic tree T, we define the space of parameters as
Similarly, we define the space of stochastic parameters associated to T by
Though artificial at first glance, the use of tensors in the framework that includes the distributions on the set of patterns in B at the leaves of a phylogenetic tree is a natural choice. Indeed, if denotes the joint probability of observing x1 at leaf 1, x2 at leaf 2, and so on, up to x n at leaf n, then the vector provides a distribution on the set of patterns in B at the leaves of T, and this can be regarded as the tensor having these coordinates in the natural basis,
This motivates the following definition.
Definition 5. Given a phylogenetic tree T on the set of taxa [n], an [n] - tensor is any element of the tensor power
Given an algebraic evolutionary model and a phylogenetic tree T with root r, every Markov process on T (specified by a collection of parameters Π and A = (A e )e∈E(T)) gives rise to a tensor in in the following way: we consider a parametrization
defined by
where
x v denotes the state at the vertex v, pa(e) (resp. ch (e)) is the parent (resp. child) node of e, and Π x ,x ∈ B, are the coordinates of Π. When restricted to the stochastic matrices and distributions in W0, this parametrization corresponds to the hidden Markov process on the tree T (the leaves correspond to the observed random variables and the interior nodes to the hidden variables).
The parametrization (1) restricts to another polynomial map , where is the hyperplane defined by . Because we work in the algebraic setting, the use of the word “stochastic” in this paper is more general than usual, as we only request entries summing to one.
From now on, we will refer to this restriction as the stochastic parametrization. It is important to note that when we consider the distributions in s W0 and the Markov matrices in sMod, its image by lies in the standard simplex in (and thus in H). This in turn implies that the whole image is contained in H.
We proceed to define the algebraic varieties associated to the parametrization maps defined above. Roughly speaking, algebraic varieties are sets of solutions to systems of polynomial equations (e.g. [14]).
Definition 6. The stochastic phylogenetic varietyassociated to a phylogenetic tree T is the smallest algebraic variety containing (in particular, ).
Similarly, the phylogenetic varietyassociated to T is the smallest algebraic variety in that contains .
Below we explain the reason for the notation of , which was adopted from [23].
The reader may note that the position of the root r of T played a role in the above parameterizations. It can be shown, however, that under certain mild assumptions, and are independent of the root position in the following sense: if two phylogenetic trees have the same topology as unrooted trees, then the smallest algebraic varieties containing the corresponding image sets are the same. For example, any model satisfying (i) belongs to W0 for all Π ∈ W0 and all A ∈ Mod, and (ii) whenever exists (here D ω denotes the diagonal matrix with the entries of ω on the diagonal and zeros elsewhere) has this property (in this case, we say the model is root-independent). It is not difficult to check that the equivariant models satisfy these two properties (e.g. adapting the proof of [24] or [25]). For technical reasons, from now on we consider only the evolutionary models satisfying (i) and (ii). Indeed, in this case the notation refers to the fact that the phylogenetic variety is just the cone over the stochastic phylogenetic variety (see Figure 1 and the remark below).
Remark 7. Let be an evolutionary model satisfying (i) and (ii) above. For , , define . Then
and . This is well known for the general Markov model [23] and can be easily generalized to any model satisfying (i) and (ii).
The space of phylogenetic mixtures
In phylogenetics, the hypothesis that the sites of an alignment are independent and identically distributed is often used. When the assumption “identically distributed” is replaced by “distributed according to the same evolutionary model”, one obtains a phylogenetic mixture. Below, we introduce phylogenetic mixtures from the algebraic point of view (see also [26]).
Definition 8. Fix a set of taxa [n] and an algebraic evolutionary model . A phylogenetic mixture (on m-classes) or m-mixture is any vector of the form
where and for some tree topologies T i on the set of taxa [n]. As is a homogeneous map, phylogenetic mixtures are represented by vectors of the form , where . We call the space of all phylogenetic mixtures (on any number of classes) under the algebraic evolutionary model .
As mentioned in the introduction, the tree topologies contained in the mixture can be the same or different. An example of a phylogenetic mixture is the data modeled by the discrete Gamma-rates models (see e.g. [8]).
Restricting matrix rows to sum to one requires restricting the phylogenetic mixtures to the points of the form
We call the space of stochastic phylogenetic mixtures.
Remark 9. The phylogenetic variety of a trivalent tree topology contains all phylogenetic varieties of the non-trivalent tree topologies obtained by contracting any of its interior edges. Indeed, the latter are a particular case of the former when the matrices associated to the contracted edges are equal to the identity matrix. It follows that the space of phylogenetic mixtures on the trivalent tree topologies coincides with the space of phylogenetic mixtures on all possible topologies.
The following result was proven by Matsen, Mossel and Steel in [26] for the two state random cluster model but, as proved below, it can be easily generalized to any evolutionary model.
Lemma 10. Given a set of taxa [n] and an algebraic evolutionary model , the set of all phylogenetic mixtures is a vector subspace of . Similarly, is a linear variety and it equals .
Proof. is a -vector space and is a linear variety by their definition. It follows that is an algebraic variety that contains for any phylogenetic tree T on the set of taxa [n]. Therefore, it also contains , and equals the set of points of the form , where . Similarly, is an algebraic variety that contains , so it also contains for any phylogenetic tree T. It follows that is formed by points of type , where and .
Now we check that . Let , so that for some m, , and . Clearly, . Moreover, the sum of coordinates of q, λ(q), satisfies . Thus, q ∈ H. Conversely, let with for certain tree topologies T i , and assume that λ (p) = 1. Apply Remark 7 to each pito get pi= λ (pi)q i for some . Then
and since each q i lies on H. This proves that . □
Remark 11. In the proof of the above lemma, we have seen that and can be alternatively described as the spaces of mixtures obtained from the respective varieties and (i.e., not only from the images of the parametrization maps).
The space of phylogenetic mixtures for equivariant evolutionary models
This section provides a precise description of the space for the equivariant models listed in Example 4 (JC69, K80, K81, SSM, and GMM). First, we recall some definitions and facts of group theory and linear representation theory. From now on, B = {A,C,G,T}, k = 4, , n is fixed and .
Background on representation theory
We introduce some tools in group representation theory needed in the sequel. We refer the reader to [15] as a classical reference for these concepts. Although some of the following results are valid for any permutation group, for simplicity in the exposition we restrict to permutations of four elements (as our applications deal only with the case B = {A C G T}).
Let be a permutation group. The trivial element in will be denoted as e. We write ρ G for the restriction to G of the defining representation given by the permutations of the basis B of W. This representation induces a G-module structure on W by setting g · x: = ρ(g)(x) ∈ W. In fact, ρ induces a G-module structure on any tensor power ⊗sW by setting
and extending by linearity. From now on, the space will be implicitly considered as a G-module with this action. We call χ the character associated to the representation ρ G : G → GL (W), i.e. χ (g) is the trace of the corresponding permutation matrix or, in other words, χ (g) equals the number of fixed elements in B by the permutation g∈G. Then the character associated to the induced representation G → GL (⊗nW) is χn, the n-th power of χ.
We write N1,…,N t for the irreducible representations of G and ω1,…,ω t for the corresponding irreducible characters, where N1 and ω1 will denote the trivial representation and trivial character, respectively. Maschke’s Theorem applied to the action of G described in (3) states that there is a decomposition of ⊗sW into its isotypic components:
where each (⊗sW)ω i is isomorphic to a number of copies of the irreducible representation N i associated to ω i , , for some non-negative integer m i (s) called the multiplicity of⊗sW relative to ω i . The isotypic component of associated to the trivial representation will be denoted by and it is composed of the n-tensors invariant under the action of G defined in (3). If is the equivariant evolutionary model associated to G, will also be denoted as . It is easy to prove that (see Lemma 4.3 of [5]).
We recall that the set Ω G = {ω i }i = 1,…,tof irreducible characters of G forms an orthonormal basis of the space of characters relative to the inner product defined by
We introduce the following notion.
Definition 12. An n - word over B is an ordered sequence X = x1x2 … x n , where every letter is taken from the alphabet B. The set of n-words is equivalent to the cartesian power Bnand will be denoted by .
Words will be denoted in typewritter uppercase font (like X) and their letters in lowercase (like X). Sometimes it will be convenient to identify the [n]-tensors of the form x1 ⊗ … ⊗ x n with the n-words X = x1…x n . Consequently, we will identify with the natural basis of . Given , we will denote by {X} G = {g X ∣ g ∈ G} the G-orbit of X. We associate a G-invariant tensor, τ{X} G , to each orbit {X} G : . It is straightforward to see that every G-invariant tensor can be written as a linear combination of the tensors τ{X} G , . On the other hand, the set of different τ{X} G ’s is linearly independent, since the corresponding G-orbits {X} G have non-overlapping composition of the elements of .
Mixtures for equivariant models
For each x ∈ B, we write S G (x) for the stabiliser of X under the action of G, that is, S G (x) = {g ∈ G : g·x = x}.
Proposition 13. Let G be a subgroup of such that S G (x0) = {e} for some x0∈B. Then every tensor of type τ{X} G , , lies in the image of for some tree topology T. In particular, .
Proof. For any G-orbit {Y} G , , write . We will explicitly associate a tree topology and parameters (Π,A) to it so that the tensor τ {T} G is equal to . To this aim, we denote by B (Y) the set of letters appearing in Y. Then for every z ∈ B (Y), consider the set , so that .
We construct a tree T on the set of taxa [n] in the following way. We join each taxa in to a common node v z by an edge. Then each vertex v z is joined to the root of the tree (we call it r) by an edge that we denote as e(z) (see Figure 1). Now, in the edges joining any v z with some leaf in , we consider the identity matrix, while the matrix in e(z) is defined by taking
Finally, if c is the cardinality of {x0} G , define the distribution at the root Π=(Π A ,Π C ,Π G ,Π T ) by
It is straightforward to check that these matrices and the vector Π are G-equivariant, so . Now, from (2) and the definition of Π, we can write
where
(here δa,b stands for the Kronecker delta, i.e. δa,a= 1, δa,b= 0 if a ≠ b). Moreover, from the definition of the matrix Ae(z), we have
The hypothesis S G (x0) = {e} ensures that (g·x0,x z ) = (h·x0,h·z) if and only if g = h. From this, it becomes clear that unless
-
1.
x z = g · z, for z ∈ B, and
-
2.
for each , x i is equal to x z = g · z,
in which case . It follows that
and . Moreover, as the set of τ{Y} G , for , generates the vector space , the second claim follows. □
Remark 14. The above result is not true if the hypothesis S G (x0) = 〈e〉 is removed. For example, if G = 〈(A C G T),(A G)〉 (so that ), then S G (A) = S G (G) = {e,(C T)} and S G (C) = S G (T) = {e,(A G)}. In that case, it can be shown that the G-orbit {A C G T} G is not in for any tree topology T with 4 leaves.
Since the above condition on the group holds for , G = 〈(A T)(C G)〉, and G = 〈(A G),(A C G T)〉, we deduce the following claim.
Corollary 15. If G corresponds to any of the equivariant models K81, SSM or GMM, we have .
In phylogenetics, an invariant of a phylogenetic tree T is an equation satisfied by the expected distributions of patterns at the leaves of T, irrespectively of the continuous parameters of the model . In the algebraic geometry setting, these are the equations satisfied by all . Invariants were introduced by Lake (see [27]) and Cavender and Felsenstein (see [28]). A phylogenetic invariant of T is an invariant of T, which is not an invariant of all other phylogenetic trees (under the same model ). Equivalently, f is a phylogenetic invariant of if it is an invariant of and there exists a tree topology T′ such that f is not an invariant of . In principle, phylogenetic invariants can be used for tree topology reconstruction purposes.
Remark 16.
-
(a)
It can be seen that the condition of trivial stabiliser for some element of B given in Proposition 13 guarantees that all the irreducible representations of G will be present in the decomposition of W into its isotypic components. Then, by using the results of [6], it follows that the corresponding equivariant model will have no linear phylogenetic invariants. This fact was already known for the models in the above corollary: see [29] for the GMM, [21] for the SSM and [30] for the K81. Here we provided an alternative proof based on elementary tools of group theory.
-
(b)
The models JC69 and K80 are known to have linear phylogenetic invariants, but these are the only linear invariants which do not define hyperplanes containing , as can be deduced from [3, 30]. In fact, for these two models, the claim of the corollary is still true as stated in the following theorem. Nevertheless, we have not been able to provide a unified proof of this fact because of the different properties of the corresponding groups. There is no description of the space of linear invariants for other equivariant models not listed in Example 4, so we cannot claim that the result below still holds.
Theorem 17. If is one of the equivariant evolutionary models JC69 , K80 , K81 , SSM , or GMM , then the space of phylogenetic mixtures coincides with , and equals .
This theorem allows to identify the set of all phylogenetic mixtures with , which is a vector subspace of whose linear equations are easy to describe. In other words, is the smallest linear space containing the data coming from any mixture of trees evolving under the model . One can therefore use to select the most suitable model for the given data. This has been studied in [7].
Proof of Theorem 17. For equivariant models we have that for any tree T. Hence, by Lemma 10 and the definition of , is a vector subspace of .
From Corollary 15, we infer the equality for the models K81, SSM and GMM. For the other two models, JC69 and K80, it remains to prove that there does not exist any hyperplane Π containing and not containing . If such a hyperplane existed, then it would contain all the points of for any tree topology T. It suffices to prove that for these models there are no homogeneous linear polynomials vanishing on all tree topologies, except for the linear equations vanishing on . This has been seen in Remark 16(b).
The equality follows immediately from Lemma 10 and the first assertion in the statement of this theorem.
Remark 18. We are indebted to one of the referees of this paper for pointing out that the preceeding result, as well as the second part of Proposition 13, can also be inferred from Proposition 4.9 of [5]: under the assumption that the stabiliser of some state is trivial, Draisma and Kuttler show that the star tree is the smallest algebraic variety containing the tensors τ{X} G , for pure tensors X(that is, tensors of rank 1). It follows that the set of mixtures on the star tree equals the space .
Remark 19. It is not difficult to check that for K81, SSM or GMM, coincides with the space of mixtures on the star tree (see also [26], where the same result is proven for a 2-state model). On the contrary, this is not true for JC69 and K80 models because in this case the star tree lies in a smaller linear space as a consequence of the existence of phylogenetic linear equations (see Remark 16(b)).
Equations for the space
Our goal here is to compute the dimension of for the groups associated to the equivariant models listed in Definition 4, and to list a set of independent linear equations defining this space.
Proposition 20. Using the notations above,
-
(i)
,
-
(ii)
,
-
(iii)
, and
-
(iv)
.
Proof. Let be any equivariant model. By definition, we know that is the isotypic component of ⊗nW associated to the trivial representation (⊗nW)[ω1]. Since the dimension of the trivial representation is one, it follows that the dimension of is precisely the multiplicity m1 (n), i.e. the number of times the trivial representation appears in the decomposition of ⊗nW into isotypic components. This multiplicity m1 (n) equals (see (5))
The proof ends by grouping the elements of G in the conjugacy classes of G for SSM, K81, K80, or JC69. Recall that the conjugacy classes of a group G are the disjoint sets of the form C (g) = {h− 1gh : h ∈ G}. If C1,…,C s are the conjugacy classes for G, write for the s-tuple of their cardinalities, so that . Recall that χn(g1) = χn(g2) whenever g1 and g2 lie in the same conjugacy class, so we can represent χnby an s-tuple , where t i = χn(g) for any g ∈ C i . Thus, we have , where g i is any element in the conjugacy class c i . The result for , K81, K80, or JC69 follows by applying the Table 1.
□
Our next goal is to provide a set of independent linear equations for . Before stating the main result, let us introduce some useful notation.
Notation 21. We consider the following subsets of :
The set is composed of all n-words with only one letter and it is contained in , , and . Similarly, is composed of all n-words with two letters at most. It is straightforward to check that and .
We adopt multiplicative notation for n-words, for instance, we write Clfor the word , and (Al)(Gm)xl + m + 1…x n for , where xl + m + 1,…,x n represent any letters.
The main result of this section is the following:
Theorem 22. A set of linearly independent equationsdefiningfor=JC69 , K80 , K81 , or SSM is given by
: equations p X= p(A T)(CG)X for all with x1 ∈ {A,C};
: the equations in , and the equations {p X= p(AT)(CG)X} for all with x1 = A;
: the equations in , plus the equations {p X = p( A )( G ) X } for all having x1 = A and satisfying the following condition: if T appears in X, then there is a C in a preceding position;
: the equations in , together with the equations {p X = p( A T ) X } for all of the form (Al)(Cm)xl + m + 1…x n ; plus the equations {p X = p( A C ) X } and {p X = p( A T ) X } for all of the form (Al)(Cm)xl + m + 1…x n and satisfying the condition: if T appears in X, then there is a G in a preceding position.
The number of equations added in each case is 22 n − 1for SSM, 22 n − 2 for K81, 22n − 3− 2n−2 for K80, and for JC69.
Before proving this theorem, we explain how these sets of equations were obtained. Notice that a system of linear equations of is given by
The role played by the G-orbits on becomes apparent. Indeed, the idea is to relate the equations to the orbits of a subgroup of G. To this aim, let H be a subgroup of G and write H \ G = {Hg:g ∈ G} for the set of right cosets of H in G. We consider a transversal of H \ G, i.e. a collection {g1,…,g[G:H]} such that . Then the orbit of any can be decomposed as
This decomposition establishes the connection between the G-orbits and the H-orbits. In order to obtain a system of equations for , once has been computed, it is enough to add the equations involving the permutations in a transversal {g1 = e,g2,…,g[G:H]} of H \ G:
Notice that the union in (6) is not necessarily disjoint as it may happen that {g i X} H = {g j X} H for i ≠ j. In this case, the equality already holds in the space and does not provide any new restriction. In order to avoid this situation and obtain a minimal set of equations for , we request the special conditions on the in the statement of the theorem.
Proof. For each model , we prove that the corresponding equations are linearly independent and there are as many equations as the codimension of . By Proposition 20, the codimension of is 22 n − 1for SSM, 3· 4n − 1for K81, 7·22n−3− 2n − 2for K80, and for JC69. In the sequel, we refer to the groups by the name of the equivariant model associated to them.
SSM: As SSM is the group {e,(A T)(C G)}, a set of equations for SSM is {p X = p( A T)( C G ) X }. Fixing x1in {A,C} we obtain 22n−1linearly independent equations (equations involving different coordinates). The codimension of is equal to 22n−1, which coincides with the number of equations given, and thus this set of equations defines .
K81: Since a transversal of SSM\K81 is {e,(A C)(G T)}, the hyperplanes p X = p( A T )( C G ) X contain but not . Moreover, using (6) we see that the orbit {X} K81 decomposes into the disjoint union of {X} SSM and {(A C)(G T)X} SSM for any . Therefore, the equations given for K81 involve different coordinates than those in . Requiring x1 = A, we obtain 4n−1linearly independent new equations. Thus defines the space because the number of linearly independent equations provided, 22n−1 + 4n−1= 3·4n−1, coincides with the codimension of .
K80: The set {e,(A G)} is a transversal of K81\K80. In order to show that the equations provided are linearly independent to those of , we apply (6) to this transversal to obtain {X} K80 = {X} K81 ∪{(A G)X} K81 . If , then {(A G)X} K81 and {X} K81 are disjoint, so each equation p X = p( A )( G ) X is linearly independent from . The set has cardinal 4n− 2n + 1and, if , each orbit {X} K80 has cardinality 8. Therefore, the number of different orbits for is (4n− 2n + 1)/8 = 22 n − 3−2n − 2. Moreover, the choice of X’s in with x1 = A and satisfying “if T appears in X, there is a C in a preceding position” guarantees that we take only one element in each {X} K80 , and thus we are adding exactly one equation for each of these X′, and thus we are adding exactly one equation for each of these X′s. Overall, there are 3·4n − 1 + (22 n − 3−2n − 2) = 7 · 22 n − 3− 2n − 2linearly independent equations in . This number coincides with the codimension of and these equations define .
JC69: A transversal of K80\JC69 is {e,(A C),(A T)}, therefore (6) applies to give . Summing up, there are if , then {(A C)X} K80 = {X} K80 and {X} JC69 is the disjoint union of {X} K80 and {(A T)X} K80 . As such, each equation p X = p( A T ) X is linearly independent from . Moreover, if is of the form (Al)(Cm)xl + m + 1…x n , we have 2n − 1− 1 such equations and they are linearly independent.
if then the three orbits {(A C)X} K80 , {(A T)X} K80 , and {X} K80 have 8 elements each and are disjoint. Therefore, for these X’s, each equation of type {p X = p( A C ) X } or {p X = p( A T ) X } is linearly independent from Moreover, as has cardinal 4n− 3 · 2n + 1 + 8 and is covered by these orbits, we have different orbits. The restriction to the elements of the form (Al)(Cm) xl + m + 1… x n and satisfying that “if T appears in X, there is some G in a preceding position” guarantees that the equations are written only only once for each orbit.
linearly independent equations in that contain . As this number is equal to the codimension of , the proof is complete.
All the equalities among orbits used in this proof are summarized in Table 2.
Remark 23. The sets of equations of Theorem 22 has been successfully used in [7] for model selection. Although the dimensions of these linear spaces are exponential in n, in practice it is not necessary to consider the full set of equations, but only those containing the patterns observed in the data. This is crucial for the applicability of the method, since the number of different columns in an alignment is really small compared to the dimension of these spaces.
Example 24. As an example, we compute a minimal system of equations for SSM, K81, K80, and JC69 in the case of 3 leaves.
Equations for: is composed of the following equations:
Equations for: is formed by and
Equations for: is formed by and
Equations for: is formed by and
Identifiability of phylogenetic mixtures
In this section we study the identifiability of phylogenetic mixtures. To this end, we use projective algebraic varieties and techniques from algebraic geometry. It is not our intention to give the reader a background on these tools, so we refer to the algebraic geometry book [14] and, more specifically, to [10] for the usage of these techniques in the study of phylogenetic mixtures.
There is a natural isomorphism between the points lying in the hyperplane H considered above, , and the open affine subset of . We use the notation [p A … A :⋯:p T … T ] for projective coordinates (in contrast to used for affine coordinates). The projective phylogenetic variety associated to a phylogenetic tree T is the projective closure in of the image of the stochastic parameterization defined above. That is, it is the smallest projective variety in containing via the above isomorphism.
In what follows, we explain the relationship between this new variety and and . By Remark 7, it becomes clear that equals the affine cone over the projective phylogenetic variety (for the general Markov model, see also [23], Proposition 1). This implies that , and if belongs to , then q := p A … T :⋯:p T … T belongs to . Moreover, if is not zero, then is a point in the affine stochastic phylogenetic variety .
Before defining identifiability of mixtures, we consider the following construction of projective algebraic varieties.
Definition 25. Given two projective varieties , the join of X and Y, X ∨ Y, is the smallest variety in containing all lines with x ∈ X, y∈Y, and x ≠ y(see [14], 8.1 for details). Similarly, one defines the join of projective varieties, ∨i = 1h X i , as the smallest subvariety in containing all the linear varieties spanned by with x i ∈ X i and x i ≠ x j . It is known that
The right hand side of this inequality is usually known as the expected dimension of ∨i = 1h X i .
For instance, if we consider the join for certain tree topologies T i on the leaf set [n] and a given evolutionary model , then there is a (dominant rational) map
which is the projective closure of the parameterization defined by
Here, is isomorphic to an affine open subset of . In this setting, an h-mixture on {T1,…,T h } corresponds to a point in the variety . We will use this algebraic variety to study the identifiability of phylogenetic mixtures.
When considering unmixed models on trivalent trees on n taxa, generic identifiability of the tree topology is equivalent to the projective varieties and being different when T ≠ T′(see [31]). The identifiability of the continuous parameters must take into account the possibility of permuting the labels of the states at the interior nodes, as such permutations give rise to the same joint distribution at the leaves. In the language of algebraic geometry, generic identifiability of the continuous parameters of the model implies that the map is generically finite (i.e. the preimage of a generic point is a finite number of points; see [31]). In this case, the fiber dimension Theorem ([14], Theorem 11.12) applies and we have that is equal to the number of stochastic parameters of the model, . Therefore, if the continuous parameters are generically identifiable for the unmixed trees under , then the dimension of the variety is the same for all trivalent tree topologies on n taxa. This dimension is denoted by .
Example 26. The tree topologies and the continuous parameters are generically identifiable for the unmixed equivariant models JC69, K80, K81, SSM, and GMM on trees with any number of leaves (see [9] and [6], Corollary 3.9).
From now on we only consider trees without nodes of degree 2, so that the number of free stochastic parameters on a phylogenetic tree on n taxa under is .
We recall the definition of generic identifiability of the tree topologies on h-mixtures (see [10]).
Definition 27. The tree topologies on h-mixtures under are generically identifiable if for any set of trivalent tree topologies {T1,…,T h } and a generic choice of , the equality
for tree topologies and implies
In terms of algebraic varieties this is equivalent to saying that the variety is not contained in and vice versa.
The tree topologies are the discrete parameters of h-mixtures. When considering the continuous parameters of h-mixtures, the above mentioned label-swapping can be disregarded. We give the following definition according to [12].
Definition 28. The continuous parameters of h-mixtures on under an evolutionary model are generically identifiable if, for a generic choice of stochastic parameters , the equality
for stochastic parameters implies that there is a permutation such that , , and for . In other words, we only allow swapping of the continuous parameters when at least two tree topologies coincide.
Definition 29. An h-mixture under a model is said to be identifiable if both its tree topologies and its continuous parameters are generically identifiable.
In terms of algebraic varieties, generic identifiability of continuous parameters on h-mixtures implies that the generic fibers (i.e. preimages of generic points) of the map are finite. In this case, the fiber dimension theorem applied to (7) (cf. [14], Theorem 11.12) gives
The following result demonstrates the need for careful inspection of identifiability of mixtures with many components (i.e. large values of h).
Theorem 30. Let [n] be a set of taxa and be an evolutionary model for which the continuous parameters are generically identifiable on trivalent (unmixed) trees. In addition, let be the dimension of for any trivalent tree T, and set . Then the h-mixtures of trees on [n] evolving under are not identifiable for h ≥ h0(n).
Remark 31. Note that, in the above definition of h0(n), also depends on n.
Corollary 32. Let [n] be a set of taxa and be one of the equivariant models JC69, K80, K81, SSM, or GMM. Then the phylogenetic h-mixtures under these models are not identifiable for h ≥ h0(n), where
Proof. Theorem 17 shows that and Proposition 20 gives the dimension of in each case. Next, we calculate: s GMM (n) = 12(2n−3) + 3, s SSM (n) = 6(2n−3) + 1, s K81 (n) = 3(2n−3), s K80 (n) = 2(2n−3), and s JC69 (n) = 2n−3. Applying Theorem 30, we conclude the proof. □
Example 33. Consider the Kimura 3-parameter model K81 on n = 4 taxa. For any h ≥ 4, phylogenetic h-mixtures are not identifiable by Corollary 32. We are not aware of any result proving that mixtures of 2 or 3 different tree topologies under this model are identifiable (either for the tree parameters or for the continuous parameters).
Example 34. Consider the Jukes-Cantor model JC69 on n = 4 taxa. Then Corollary 32 tells us that for h ≥ 3, h-mixtures are not identifiable. Therefore, for this particular model on four taxa the cases in which the identifiability holds are known: the tree and the continuous parameters are generically identifiable for the unmixed model; the tree parameters are generically identifiable for 2-mixtures ([10], Theorem 10); the continuous parameters are generically identifiable for 2-mixtures on different tree topologies and not identifiable for the same tree topology ([10], Theorem 23); neither the continuous parameters nor the tree topologies are generically identifiable for mixtures with more than two components (Corollary 32).
Proof of Theorem 30. Let . Then the variety has dimension ≤ edim(h). Indeed, as is a parameterization of an open subset of , then the dimension of is less than or equal to . Moreover, the dimension of is equal to if T i is trivalent (since the continuous parameters for the unmixed models under consideration are generically identifiable) and is less than for non-trivalent trees. Therefore, .
If all T i are trivalent trees, then and, therefore, if and only if . Moreover, by the fiber dimension theorem applied to , the equality of dimensions holds if and only if the generic fiber of has dimension 0. In particular, if , then the continuous parameters of this phylogenetic mixture are not identifiable.
As , we have that . Now, we fix an with h ≥ h0(n), so that .
There are two possible scenarios:
-
(a)
For any set of tree topologies , the dimension of is less than .
-
(b)
There exists a set of tree topologies for which
Case (a) implies that the dimension of is less than edim(h) for any set of trivalent tree topologies . Based on the conclusions drawn above, this implies that the continuous parameters are not generically identifiable.
In case (b), coincides with . Indeed, is contained in , both varieties are irreducible, and , which implies that both varieties coincide. In particular, any h-mixture (which is a point in ) would be contained in , and therefore the topologies are not generically identifiable.
Remark 35. The negative result of Theorem 30 should be complemented with the following positive result of Rhodes and Sullivant in [12]: if =GMM and one restricts to h-mixtures on the same trivalent tree topology T, then the tree topology and the continuous parameters are generically identifiable if .
Conclusions
In this paper, we have dealt with the space of phylogenetic mixtures for evolutionary equivariant models. We have shown that for the case of the Jukes-Cantor model, the Kimura models with two or three parameters, the strand symmetric model and the general Markov model, this linear space is defined by the set of linear equations satisfied by the distributions of the patterns at the leaves of a tree that evolves under that model. It follows that this space completely characterizes the model. The use of tools from group theory and group representation theory played a major role, and allowed us to design a procedure to produce minimal systems of equations for these spaces and for any number of taxa. This procedure has been implemented successfully in a new method for model selection in phylogenetics based on linear invariants (see [7]), which is available online at http://genome.crg.es/cgi-bin/phylo_mod_sel/AlgModelSelection.pl.
In the last part of the paper, we proved new results concerning the identifiability of phylogenetic mixtures. Namely, we provided an upper bound for the number of components (classes) of a mixture so that the identifiability of both the continuous and the discrete parameters is still possible.
References
Posada D: The effect of branch length variation on the selection of models of molecular evolution. J Mol Evol. 2001, 52: 434-444.
Felsenstein J: Inferring Phylogenies. 2004, Sunderland: Sinauer Associates, Inc.,
Fu YX, Li W: Construction of linear invariants in phylogenetic inference. Math Biosci. 1992, 109 (2): 201-228. 10.1016/0025-5564(92)90045-X
Steel M, Hendy M, Székely L, Erdős P: Spectral analysis and a closest tree method for genetic sequences. Appl Math Lett. Int J Rapid Publication. 1992, 5: 63-67.
Draisma J, Kuttler J: On the ideals of equivariants tree models. Mathematische Annalen. 2009, 344: 619-644. 10.1007/s00208-008-0320-6
Casanellas M, Fernandez-Sanchez J: Relevant phylogenetic invariants of evolutionary models. J de Mathématiques Pures et Appliquées. 2011, 96: 207-229.
Kedzierska A, Drton M, Guigó R, Casanellas M: SPIn: model selection for phylogenetic mixtures via linear invariants. Mol Biol Evol. 2012, 29: 929-937. 10.1093/molbev/msr259
Semple C, Steel M: Phylogenetics, Volume 24 of Oxford Lecture Series in Mathematics and its Applications. 2003, Oxford: Oxford University Press,
Allman E, Rhodes J: The identifiability of tree topology for phylogenetic models, including covarion and mixture models. J Comput Biol. 2006, 13: 1101-1113. 10.1089/cmb.2006.13.1101
Allman ES, Petrovic S, Rhodes JA, Sullivant S: Identifiability of two-tree mixtures for group-based models. IEEE/ACM Trans Comput Biol Bioinformatics. 2011, 8: 710-720.
Stefanovic D, Vigoda E: Phylogeny of mixture models: Robustness of maximum likelihood and non-identifiable distributions. J Comput Biol. 2007, 14: 156-189. 10.1089/cmb.2006.0126
Rhodes J, Sullivant S: Identifiability of large phylogenetic mixture models. Bull Math Biol. 2012, 74: 212-231. 10.1007/s11538-011-9672-2
Chai J, Housworth EA: On Rogers’s proof of identifiability for the GTR + Gamma + I model. Syst Biol. 2011, 60 (5): 713-718. 10.1093/sysbio/syr023
Harris J: Algebraic Geometry. A First Course, Volume 133 of Graduate Texts in Mathematics. 1992, New York: Springer-Verlag,
Serre J: Linear Representations of Finite Groups. 1977, [Translated from the second French edition by Leonard L. Scott, Graduate Texts in Mathematics, Vol. 42], New York: Springer-Verlag,
Sumner J, Fernández-Sánchez J, Jarvis P: On Lie Markov models. J Theor Biol. 2012, 298: 16-31.
Allman E, Rhodes J: Phylogenetic invariants for stationary base composition. J Symbolic Comput. 2006, 41 (2): 138-150. 10.1016/j.jsc.2005.04.004
Jukes TH, Cantor CR: Evolution of protein molecules. Mammalian Protein Metabolism, Volume 3. Edited by: Munro HN. 1969, 21-32. New York: Academic Press,
Kimura M: A simple method for estimating evolutionary rates of base substitution through comparative studies of nucleotide sequences. J Mol Evol. 1980, 16: 111-120. 10.1007/BF01731581
Kimura M: Estimation of evolutionary sequences between homologous nucleotide sequences. Proc Nat Acad Sci. 1981, 78: 454-458. 10.1073/pnas.78.1.454
Casanellas M, Sullivant S: The strand symmetric model. Algebraic Statistics for Computational Biology. Edited by: Pachter L, Sturmfels B. 2005, New York: Cambridge University Press,
Chang JT: Full reconstruction of Markov models on evolutionary trees: identifiability and consistency. Math Biosci. 1996, 137: 51-73. 10.1016/S0025-5564(96)00075-2
Allman E, Rhodes J: Phylogenetic ideals and varieties for the general Markov model. Adv Appl Math. 2008, 40: 127-148. 10.1016/j.aam.2006.10.002
Allman E, Rhodes J: Phylogenetic invariants for the general Markov model of sequence mutation. Math Biosci. 2003, 186 (2): 113-144. 10.1016/j.mbs.2003.08.004
Steel M, Hendy M, Penny D: Reconstructing phylogenies from nucleotide pattern probabilities: a survey and some new results. Discrete Appl Math. 1998, 88 (1-3): 367-396. 10.1016/S0166-218X(98)00080-8
Matsen F, Mossen E, Steel M: Mixed-up trees: The structure of phylogenetic mixtures. Bull Math Biol. 2008, 70: 1115-1139. 10.1007/s11538-007-9293-y
Lake J: A rate-independent technique for analysis of nucleaic acid sequences: evolutionary parsimony. Mol Biol Evol. 1987, 4: 167-191.
Cavender J, Felsenstein J: Invariants of phylogenies in a simple case with discrete states. J Classif. 1987, 4: 57-71. 10.1007/BF01890075
Allman E, Rhodes J: Quartets and parameter recovery for the general Markov model of sequence mutation. Appl Math Res Express. 2004, 2004 (4): 107-131. 10.1155/S1687120004020283
Sturmfels B, Sullivant S: Toric ideals of phylogenetic invariants. J Comput Biol. 2005, 12: 204-228. 10.1089/cmb.2005.12.204
Allman E, Rhodes J: Identifying evolutionary trees and substitution parameters for the general Markov model with invariable sites. Math Biosci. 2008, 211: 18-33. 10.1016/j.mbs.2007.09.001
Acknowledgements
All authors are partially supported by Generalitat de Catalunya, 2009 SGR 1284. Research of the first and second authors was partially supported by Ministerio de Educación y Ciencia MTM2009-14163-C02-02 (Spain). Research of the third author was partially supported by grant BIO2011-26205 from the Ministerio de Educación y Ciencia (Spain). We would like to thank both referees for very useful comments that led to major improvements.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
All authors contributed equally and the author names order is alphabetical. All authors read and approved the final manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Casanellas, M., Fernández-Sánchez, J. & Kedzierska, A.M. The space of phylogenetic mixtures for equivariant models. Algorithms Mol Biol 7, 33 (2012). https://doi.org/10.1186/1748-7188-7-33
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/1748-7188-7-33