The space of phylogenetic mixtures for equivariant models

Background The selection of an evolutionary model to best fit given molecular data is usually a heuristic choice. In his seminal book, J. Felsenstein suggested that certain linear equations satisfied by the expected probabilities of patterns observed at the leaves of a phylogenetic tree could be used for model selection. It remained an open question, however, whether these equations were sufficient to fully characterize the evolutionary model under consideration. Results Here we prove that, for most equivariant models of evolution, the space of distributions satisfying these linear equations coincides with the space of distributions arising from mixtures of trees. In other words, we prove that the evolution of an observed multiple sequence alignment can be modeled by a mixture of phylogenetic trees under an equivariant evolutionary model if and only if the distribution of patterns at its columns satisfies the linear equations mentioned above. Moreover, we provide a set of linearly independent equations defining this space of phylogenetic mixtures for each equivariant model and for any number of taxa. Lastly, we use these results to perform a study of identifiability of phylogenetic mixtures. Conclusions The space of phylogenetic mixtures under equivariant models is a linear space that fully characterizes the evolutionary model. We provide an explicit algorithm to obtain the equations defining these spaces for a number of models and taxa. Its implementation has proved to be a powerful tool for model selection.


Background
The principal goal of phylogenetics is to reconstruct the ancestral relationships among organisms. Most popular phylogenetic reconstruction methods are based on mathematical models describing the molecular evolution of DNA. In spite of this, there exists no unified framework for model selection and the results are highly dependent on the models and methods used in the analysis (cf. [1]).
In this paper we assume the Darwinian model of evolution proceeding along phylogenetic trees and address the following question: how can the data evolving under a particular model be characterized? In other words, we look for invariants of the DNA patterns which have evolved following a tree (or a mixture of trees, as we will see below) under a particular model. The answer to this question provided in this paper leads to a complete characterization In what follows, we briefly explain the motivation for this work. It has been shown that if the evolution along a phylogenetic tree is described by a particular model, the expected probabilities of nucleotide patterns at the leaves of the tree satisfy certain equalities (see e.g. [2], p.375). Several authors (e.g. [2][3][4]) pointed out that these equalities could potentially be used to test the fitness of the model of base change. The full set of equations required for viable model selection, however, was unknown. The objective of this work is to fill in this gap and to go a step further into practical aplication by providing an algorithm to compute the required invariants for model selection.
In this work we consider a group of equivariant models ( [5,6]). These models are Markov processes on trees, whose transition matrices satisfy certain symmetries: the Jukes-Cantor model, the Kimura 2 and 3 parameter models, the strand symmetric model, and the general Markov model. Our first important result, Theorem 17, states that if evolution occurs according to trees (or even mixtures http://www.almob.org/content/7/1/33 of trees) under these equivariant models, then the model of evolution is completely determined by the linear space defined by the aforementioned equalities. By exhaustively studying the group of symmetries of these models, we also give a straightforward combinatorial way of determining the equations of this linear space (see Theorem 22). The implementation of the algorithm producing the equations is available as a package SPIn ( [7], http://genome.crg.es/ cgi-bin/phylo mod sel/AlgModelSelection.pl.), which has proved to be a successful tool in evolutionary model selection.
Our main technique consists in proving that the linear space above coincides with the space D M of phylogenetic mixtures evolving under the model M, i.e. the set of points that are linear combinations of points lying in the phylogenetic varieties CV M T (see Preliminaries section for specific definitions). In biological words and in the stochastic context, this is the set of vectors of expected pattern frequencies for mixtures of trees evolving under the model M (not necessarily the same tree topology in the mixture, and not necessarily the same transition matrices when the tree topologies coincide). In phylogenetics, the so-called i.i.d. hypothesis (independent and identically distributed) about the sites of an alignment is prevalent in the most simple models. When the assumption "identically distributed" is replaced it by "distributed according to the same evolutionary model", one obtains a phylogenetic mixture.
Phylogenetic mixtures are useful in modeling heterogeneous evolutionary processes, e.g. data comprising multiple genes, selected codon positions, or rate variation across sites (e.g. [8]). Among a plethora of applications, they are used in orthology predictions, gene and genome annotations, species tree reconstructions, and drug target identifications.
In addition to the main result, we determine the dimension of these linear spaces and use it to give an upper bound, h 0 (n), on the number of mixtures that should be used in phylogenetic reconstruction on n taxa. This relates to the so-called identifiability problem in phylogenetic mixtures, which can be posed as determining the conditions that guarantee that the model parameters (discrete parameters in the form of tree topologies and the continuous parameters of the root and model distributions) can be recovered from the data. Identifiability is crucial for consistency of the maximum likelihood approaches and, though extensively studied in the phylogenetic context, few results are known (see for instance [9][10][11][12][13]).
In brief, in Theorem 30 we prove that either the tree topologies or the continuous parameters are not generically identifiable for mixtures on more than h 0 (n) trees under equivariant models. Here h 0 (n) is the quotient of the dimension of the linear space D M (computed in Proposition 20) by the number of free parameters of M on a trivalent tree plus one. For example, for four taxa and the Jukes-Cantor model (resp. the  this result proves that mixtures on three (resp. four) or more taxa are not identifiable (i.e. either the discrete or the continuous parameters cannot be fully identified). A detailed discussion on this subject is provided in the last section.
The main tools used in this work are algebraic geometry and group theory. The reader is referred to [14,15] for general references on these topics.

Preliminaries
Phylogenetic trees and Markov models of evolution have been widely used in the literature. In what follows we fix the notation needed to deal with them in our setting.
Let n be a positive integer and denote by [ n] the set {1, 2, . . . , n}. A phylogenetic tree T on the set of taxa [ n] is a tree (i.e. a connected graph with no loops), whose n leaves are bijectively labeled by [ n]. Its vertices represent species or other biological entities and its edges represent evolutionary processes between the vertices.
We allow internal vertices of any degree and if all the internal vertices are of degree 3 we say that the tree is trivalent. We will denote the set of vertices of T by N(T), the set of edges by E(T), and the set of interior nodes by Int(T). A rooted tree is a tree together with a distinguished node r called the root. The root induces an orientation on the edges of T, whereby the root represents the common ancestor to all the species represented in the tree. If e is an edge of a rooted tree T, we write pa(e) and ch(e) for its parent vertex (origin) and its child vertex (end), respectively. Two unrooted phylogenetic trees on the set of taxa [ n] are said to have the same tree topology if their labeled graphs have the same topology.
We fix a positive integer k and an ordered set B = {b 1 , b 2 , . . . , b k }. For example, for most applications we take B = {A, C, G, T} to be the set of nucleotides in a DNA sequence. We may think of B as the set of states of a discrete random variable. We call W the complex vector space W = B C spanned by B, so that B is a natural basis of W. For algebraic convenience, we usually work over the complex field and restrict to the stochastic setting when necessary. Vectors in W are thought of as probability distributions on the set of states B if their coordinates are non-negative and sum to one. In this setting the vector c i b i means that observation b i occurs with probability c i . From now on, we will identify vectors in W with their coordinates in the basis B written as a column vector, e.g. we identify k b k with the vector 1 = (1, 1, . . . , 1) t ∈ W . In order to model molecular evolution on a phylogenetic tree T, we consider a Markov process specified by a root distribution, π ∈ W , and a collection of transition matrices, A = (A e ) e∈E(T) , where each A e is a k × k-matrix in http://www.almob.org/content/7/1/33 End(W ). The matrices A e represent the conditional probabilities of substitution between the states in B from the parent node pa(e) to the child node ch(e) of e. We adopt the convention that the matrices A e act on W from the right, i.e. a vector ω t in pa(e) maps to ω t A e in ch(e).
Distinct forms of the transition matrices give rise to different evolutionary models. Using the terminology introduced above, we proceed to the definition of evolutionary models used throughout this work. Definition 1. An (algebraic) evolutionary model M is specified by giving a vector subspace W 0 ⊂ W such that 1 t π = 0 for some π in W 0 , together with a multiplicatively closed vector subspace Mod (for model) of M k (C) containing the identity matrix. We will usually denote such a model by M = (W 0 , Mod). We define the stochastic evolutionary model sM = (sW 0 , sMod) associated to M by taking sW 0 = {π ∈ W 0 : 1 t π = 1} and sMod = {A ∈ Mod : A1 = 1}. The term "stochastic" refers to the fact that, by restricting to the points in the spaces with non-negative real entries, we obtain distributions and Markov matrices. A phylogenetic tree T together with the parameters π and A = (A e ) e∈E(T) is said to evolve under the algebraic evolutionary model M if π ∈ W 0 , and all matrices A e lie in Mod.

Remark 2.
Note that sW 0 and sMod are not vector spaces. The condition 1 t π = 0 in the above definition means that the sum of the coordinates of π is not zero. Since vectors in sW 0 with non-negative coordinates represent the probability distributions for the set of observations B, this condition implies no restriction from a biological point of view. Moreover, it ensures that W 0 ∩ { x∈B π x = 1} has dimension equal to dim(W 0 ) − 1. In particular, the simplex of stochastic vectors in W 0 will form a semialgebraic set of B R of dimension equal to dim(W 0 ) − 1 (as expected).

Remark 3. The subspace
Mod of substitution matrices is usually required to be multiplicatively closed (as in the definition above) so that when two evolutionary processes are concatenated, the final process is of the same kind. The importance of this requirement is the starting point of [16], where a different approach to the definition of "evolutionary mode" is provided.
Our definition of evolutionary models includes most of the well-known evolutionary models, namely those given in [17] and the equivariant models (see [5,6]).

Example 4.
Let G be a permutation group of B, that is, a group whose elements are permutations of the set B, G ≤ S k . Given g ∈ G, write P g for the k×k-permutation matrix corresponding to g: (P g ) i,j = 1 if g(j) = i and 0 otherwise.
The G-equivariant evolutionary model M G is defined by taking Mod equal to M(G) = {A ∈ M k (C) | P g AP −1 g = A for all g ∈ G}, and W 0 = {π ∈ W | P g π = π for allg ∈ G}. These subsets are vector subspaces of M k (C) and W, respectively. Moreover, if A 1 , A 2 ∈ M(G), then . Therefore, equivariant models provide a wide family of examples of algebraic evolutionary models in the sense of Definition 1. For example, if B = {A, ,G, T}, it can be seen that the algebraic versions of the Jukes-Cantor model [18], the Kimura models with 2 or 3 parameters [19,20], the strand symmetric model [21] or the general Markov model [22] are instances of equivariant models: Given an evolutionary model M and a phylogenetic tree T, we define the space of parameters as Similarly, we define the space of stochastic parameters associated to T by Though artificial at first glance, the use of tensors in the framework that includes the distributions on the set of patterns in B at the leaves of a phylogenetic tree is a natural choice. Indeed, if p x 1 x 2 ...x n denotes the joint probability of observing x 1 at leaf 1, x 2 at leaf 2, and so on, up to x n at leaf n, then the vector provides a distribution on the set of patterns in B at the leaves of T, and this can be regarded as the tensor having these coordinates in the natural basis, This motivates the following definition. http://www.almob.org/content/7/1/33 Definition 5. Given a phylogenetic tree T on the set of taxa [ n], an [ n]-tensor is any element of the tensor power Given an algebraic evolutionary model M and a phylogenetic tree T with root r, every Markov process on T (specified by a collection of parameters π and A = (A e ) e∈E(T) ) gives rise to a tensor in L in the following way: we consider a parametrization where x v denotes the state at the vertex v, pa(e) (resp. ch(e)) is the parent (resp. child) node of e, and π x , x ∈ B, are the coordinates of π. When restricted to the stochastic matrices and distributions in W 0 , this parametrization corresponds to the hidden Markov process on the tree T (the leaves correspond to the observed random variables and the interior nodes to the hidden variables). The parametrization (1) restricts to another polynomial Because we work in the algebraic setting, the use of the word "stochastic" in this paper is more general than usual, as we only request entries summing to one.
From now on, we will refer to this restriction as the stochastic parametrization φ M T . It is important to note that when we consider the distributions in sW 0 and the Markov matrices in sMod, its image by φ M T lies in the standard simplex in L (and thus in H). This in turn implies that the whole image Im φ M T is contained in H. We proceed to define the algebraic varieties associated to the parametrization maps defined above. Roughly speaking, algebraic varieties are sets of solutions to systems of polynomial equations (e.g. [14]).
Below we explain the reason for the notation of CV M T , which was adopted from [23].
The reader may note that the position of the root r of T played a role in the above parameterizations. It can be shown, however, that under certain mild assumptions, Im M T and Im φ M T are independent of the root position in the following sense: if two phylogenetic trees have the same topology as unrooted trees, then the smallest algebraic varieties containing the corresponding image sets are the same. For example, any model M = (W 0 , Mod) satisfying (i) π t := π t A belongs to W 0 for all π ∈ W 0 and all A ∈ Mod, and (ii) D −1 π A t D π ∈ Mod whenever D −1 π exists (here D ω denotes the diagonal matrix with the entries of ω on the diagonal and zeros elsewhere) has this property (in this case, we say the model is rootindependent). It is not difficult to check that the equivariant models satisfy these two properties (e.g. adapting the proof of [24] or [25]). For technical reasons, from now on we consider only the evolutionary models satisfying (i) and (ii). Indeed, in this case the notation CV M T refers to the fact that the phylogenetic variety is just the cone over the stochastic phylogenetic variety (see Figure 1 and the remark below).
This is well known for the general Markov model [23] and can be easily generalized to any model satisfying (i) and (ii).

The space of phylogenetic mixtures
In phylogenetics, the hypothesis that the sites of an alignment are independent and identically distributed is often used. When the assumption "identically distributed" is replaced by "distributed according to the same evolutionary model", one obtains a phylogenetic mixture. Below, we introduce phylogenetic mixtures from the algebraic point of view (see also [26]).

Definition 8. Fix a set of taxa [ n] and an algebraic evolutionary model M. A phylogenetic mixture (on m-classes) or m-mixture is any vector
where α i ∈ C and p i ∈ Im( M T i ) for some tree topologies As mentioned in the introduction, the tree topologies contained in the mixture can be the same or different. An example of a phylogenetic mixture is the data modeled by the discrete Gamma-rates models (see e.g. [8]).
Restricting matrix rows to sum to one requires restricting the phylogenetic mixtures to the points of the form We call D sM the space of stochastic phylogenetic mixtures.

Remark 9.
The phylogenetic variety of a trivalent tree topology contains all phylogenetic varieties of the nontrivalent tree topologies obtained by contracting any of its interior edges. Indeed, the latter are a particular case of the former when the matrices associated to the contracted edges are equal to the identity matrix. It follows that the space of phylogenetic mixtures on the trivalent tree topologies coincides with the space of phylogenetic mixtures on all possible topologies.
The following result was proven by Matsen, Mossel and Steel in [26] for the two state random cluster model but, as proved below, it can be easily generalized to any evolutionary model.
Remark 11. In the proof of the above lemma, we have seen that D M and D sM can be alternatively described as the spaces of mixtures obtained from the respective varieties CV M T and V M T (i.e., not only from the images of the parametrization maps).

The space of phylogenetic mixtures for equivariant evolutionary models
This section provides a precise description of the space

Background on representation theory
We introduce some tools in group representation theory needed in the sequel. We refer the reader to [15] as a classical reference for these concepts. Although some of the following results are valid for any permutation group, for simplicity in the exposition we restrict to permutations of http://www.almob.org/content/7/1/33 four elements (as our applications deal only with the case B = {A, C, G, T}).
Let G ≤ S 4 be a permutation group. The trivial element in S 4 will be denoted as e. We write ρ G for the restriction to G of the defining representation ρ : S 4 → GL(W ) given by the permutations of the basis B of W. This representation induces a G-module structure on W by setting g · x := ρ(g)(x) ∈ W . In fact, ρ induces a G-module structure on any tensor power ⊗ s W by setting and extending by linearity. From now on, the space L = ⊗ n W will be implicitly considered as a G-module with this action. We call χ the character associated to the representation ρ G : G → GL(W ), i.e. χ(g) is the trace of the corresponding permutation matrix or, in other words, χ(g) equals the number of fixed elements in B by the permutation g ∈ G. Then the character associated to the induced representation G → GL(⊗ n W ) is χ n , the n-th power of χ. We write N 1 , . . . , N t for the irreducible representations of G and ω 1 , . . . , ω t for the corresponding irreducible characters, where N 1 and ω 1 will denote the trivial representation and trivial character, respectively. Maschke's Theorem applied to the action of G described in (3) states that there is a decomposition of ⊗ s W into its isotypic components: where each (⊗ s W )[ ω i ] is isomorphic to a number of copies of the irreducible representation N i associated to , for some non-negative integer m i (s) called the multiplicity of ⊗ s W relative to ω i . The isotypic component of L associated to the trivial representation will be denoted by L G and it is composed of the [ n]-tensors invariant under the action of G defined in (3). If M is the equivariant evolutionary model associated to G, L G will also be denoted as L M . It is easy to prove that CV M T ⊂ L G (see Lemma 4.3 of [5]). We recall that the set G = {ω i } i=1,...,t of irreducible characters of G forms an orthonormal basis of the space of characters relative to the inner product defined by We introduce the following notion.
Definition 12. An n-word over B is an ordered sequence X = x 1 x 2 . . . x n , where every letter is taken from the alphabet B. The set of n-words is equivalent to the cartesian power B n and will be denoted by B.
Words will be denoted in typewritter uppercase font (like X) and their letters in lowercase (like x). Sometimes it will be convenient to identify the [ n]-tensors of the form x 1 ⊗ . . . ⊗ x n with the n-words X = x 1 . . . x n . Consequently, we will identify B with the natural basis of L. Given X ∈ B, we will denote by {X} G = {gX | g ∈ G} the G-orbit of X. We associate a G-invariant tensor, τ {X} G , to each orbit {X} G : τ {X} G := g∈G gX. It is straightforward to see that every G-invariant tensor can be written as a linear combination of the tensors τ {X} G , X ∈ B. On the other hand, the set of different τ {X} G 's is linearly independent, since the corresponding G-orbits {X} G have non-overlapping composition of the elements of B.

Mixtures for equivariant models
For each x ∈ B, we write S G (x) for the stabiliser of x under the action of G, that is, S G (x) = {g ∈ G : g · x = x}.

Proof. For any
We will explicitly associate a tree topology and parameters (π, A) to it so that the tensor τ {T} G is equal to . We construct a tree T on the set of taxa [ n] in the following way. We join each taxa in L Y z to a common node v z by an edge. Then each vertex v z is joined to the root of the tree (we call it r) by an edge that we denote as e(z) (see Figure 1). Now, in the edges joining any v z with some leaf in L Y z , we consider the identity matrix, while the matrix in e(z) is defined by taking Finally, if c is the cardinality of {x 0 } G , define the distribution at the root π = (π A , π C , π G , π T ) by It is straightforward to check that these matrices and the vector π are G-equivariant, so (π, A) ∈ Par M G (T). Now, from (2) and the definition of π, we can write where  In phylogenetics, an invariant of a phylogenetic tree T is an equation satisfied by the expected distributions of patterns at the leaves of T, irrespectively of the continuous parameters of the model M. In the algebraic geometry setting, these are the equations satisfied by all p ∈ CV M T . Invariants were introduced by Lake (see [27]) and Cavender and Felsenstein (see [28]

). A phylogenetic invariant of T is an invariant of T, which is not an invariant of all other phylogenetic trees (under the same model M). Equivalently, f is a phylogenetic invariant of CV M T if it is an invariant of CV M
T and there exists a tree topology T such that f is not an invariant of CV M T . In principle, phylogenetic invariants can be used for tree topology reconstruction purposes.

Remark 16.
(a) It can be seen that the condition of trivial stabiliser for some element of B given in Proposition 13 guarantees that all the irreducible representations of G will be present in the decomposition of W into its isotypic components. Then, by using the results of [6], it follows that the corresponding equivariant model will have no linear phylogenetic invariants. This fact was already known for the models in the above corollary: see [29] for the GMM, [21] for the SSM and [30] for the K81. Here we provided an alternative proof based on elementary tools of group theory. (b) The models JC69 and K80 are known to have linear phylogenetic invariants, but these are the only linear invariants which do not define hyperplanes containing L G , as can be deduced from [3,30]. In fact, for these two models, the claim of the corollary is still true as stated in the following theorem. Nevertheless, we have not been able to provide a unified proof of this fact because of the different properties of the corresponding groups. There is no description of the space of linear invariants for other equivariant models not listed in Example 4, so we cannot claim that the result below still holds. This theorem allows to identify the set of all phylogenetic mixtures D M G with L G , which is a vector subspace of L whose linear equations are easy to describe. In other words, L G is the smallest linear space containing the data coming from any mixture of trees evolving under the model M G . One can therefore use L G to select the most suitable model for the given data. This has been studied in [7]. The equality D sM G = L G ∩ H follows immediately from Lemma 10 and the first assertion in the statement of this theorem.

Remark 18.
We are indebted to one of the referees of this paper for pointing out that the preceeding result, as http://www.almob.org/content/7/1/33 well as the second part of Proposition 13, can also be inferred from Proposition 4.9 of [5]: under the assumption that the stabiliser of some state is trivial, Draisma and Kuttler show that the star tree is the smallest algebraic variety containing the tensors τ {X} G , for pure tensors X (that is, tensors of rank 1). It follows that the set of mixtures on the star tree equals the space L G .

Remark 19.
It is not difficult to check that for M = K81, SSM or GMM, D M coincides with the space of mixtures on the star tree (see also [26], where the same result is proven for a 2-state model). On the contrary, this is not true for JC69 and K80 models because in this case the star tree lies in a smaller linear space as a consequence of the existence of phylogenetic linear equations (see Remark 16(b)).

Equations for the space L G
Our goal here is to compute the dimension of L G for the groups associated to the equivariant models listed in Definition 4, and to list a set of independent linear equations defining this space. Proposition 20. Using the notations above, Proof. Let M be any equivariant model. By definition, we know that L G is the isotypic component of ⊗ n W associated to the trivial representation (⊗ n W )[ ω 1 ]. Since the dimension of the trivial representation is one, it follows that the dimension of L M is precisely the multiplicity m 1 (n), i.e. the number of times the trivial representation appears in the decomposition of ⊗ n W into isotypic components. This multiplicity m 1 (n) equals (see (5)) The proof ends by grouping the elements of G in the conjugacy classes of G for SSM, K81, K80, or JC69. Recall that the conjugacy classes of a group G are the disjoint sets of the form C(g) = {h −1 gh : h ∈ G}. If C 1 , . . . , C s are the conjugacy classes for G, write C(G) = (|C 1 |, . . . , |C s |) for the s-tuple of their cardinalities, so that s i=1 |C i | = |G|. Recall that χ n (g 1 ) = χ n (g 2 ) whenever g 1 and g 2 lie in the same conjugacy class, so we can represent χ n by an s-tuple (t 1 , . . . , t s ), where t i = χ n (g) for any g ∈ C i . Thus, where g i is any element in the conjugacy class C i . The result for M = SSM, K81, K80, or JC69 follows by applying the Table 1.
Our next goal is to provide a set of independent linear equations for L G . Before stating the main result, let us introduce some useful notation.
The set B 0 is composed of all n-words with only one letter and it is contained in B AC|GT , B AG|CT , and B AT|CG . Similarly, B 2 is composed of all n-words with two letters at most. It is straightforward to check that |B AC|GT | = |B AG|CT | = |B AT|CG | = 2 n+1 and |B 2 | = 3 · 2 n+1 − 8.
We adopt multiplicative notation for n-words, for instance, we write C l for the word C . . .
. . x n and satisfying the condition: if T appears in X, then there is a G in a preceding position.
The number of equations added in each case is 2 2n−1 for SSM, 2 2n−2 for K81, 2 2n−3 − 2 n−2 for K80, and 2 n−1 − 1 Before proving this theorem, we explain how these sets of equations were obtained. Notice that a system of linear equations of L G is given by The role played by the G-orbits on B becomes apparent.
This decomposition establishes the connection between the G-orbits and the H-orbits. In order to obtain a system of equations for L G , once E H has been computed, it is enough to add the equations involving the permutations in a transversal {g 1 = e, g 2 , . . . , g [G:H] } of H \ G: Notice that the union in (6) is not necessarily disjoint as it may happen that {g i X} H = {g j X} H for i = j. In this case, the equality p g j X = p g j X already holds in the space L H and does not provide any new restriction. In order to avoid this situation and obtain a minimal set of equations for L G , we request the special conditions on the X ∈ B in the statement of the theorem.
Proof. For each model M, we prove that the corresponding equations are linearly independent and there are as many equations as the codimension of L M . By Proposition 20, the codimension of L M is 2 2n−1 for SSM, 3·4 n−1 for K81, 7 · 2 2n−3 − 2 n−2 for K80, and 4 n − 2 2n−3 +1 3 − 2 n−2 for JC69. In the sequel, we refer to the groups by the name of the equivariant model associated to them. Summing up, there are linearly independent equations in E JC69 that contain L JC69 . As this number is equal to the codimension 4 n − 2 2n−3 +1 3 − 2 n−2 of L JC69 , the proof is complete.
All the equalities among orbits used in this proof are summarized in Table 2.

Identifiability of phylogenetic mixtures
In this section we study the identifiability of phylogenetic mixtures. To this end, we use projective algebraic varieties and techniques from algebraic geometry. It is not our intention to give the reader a background on these tools, Table 2 Equalities among orbits used in the proof of Theorem 22 For each word X in B, the orbit in the model that indexes the column is described. The dots . . . mean the set on the left and " means the set on the top. http://www.almob.org/content/7/1/33 so we refer to the algebraic geometry book [14] and, more specifically, to [10] for the usage of these techniques in the study of phylogenetic mixtures. There is a natural isomorphism between the points lying in the hyperplane H considered above, H = {p = Definition 25. Given two projective varieties X, Y ⊂ P m , the join of X and Y, X ∨ Y , is the smallest variety in P m containing all lines xy with x ∈ X, y ∈ Y , and x = y (see [14], 8.1 for details). Similarly, one defines the join of projective varieties X 1 , . . . , X h ⊂ P m , ∨ h i=1 X i , as the smallest subvariety in P m containing all the linear varieties spanned by x 1 , . . . , x h with x i ∈ X i and x i = x j . It is known that The right hand side of this inequality is usually known as For instance, if we consider the join ∨ h i=1 PV M T i for certain tree topologies T i on the leaf set [ n] and a given evolutionary model M, then there is a (dominant rational) map which is the projective closure of the parameterization Here, = {a = (a 1 , . . . , a h ) | i a i = 1} is isomorphic to an affine open subset of P h−1 . In this setting, an h-mixture on {T 1 , . . . , T h } corresponds to a point in the variety ∨ h i=1 PV M T i . We will use this algebraic variety to study the identifiability of phylogenetic mixtures.
When considering unmixed models M on trivalent trees on n taxa, generic identifiability of the tree topology is equivalent to the projective varieties PV M T and PV M T being different when T = T (see [31]). The identifiability of the continuous parameters must take into account the possibility of permuting the labels of the states at the interior nodes, as such permutations give rise to the same joint distribution at the leaves. In the language of algebraic geometry, generic identifiability of the continuous parameters of the model implies that the map φ M T is generically finite (i.e. the preimage of a generic point is a finite number of points; see [31]). In this case, the fiber dimension Theorem ( [14], Theorem 11.12) applies and we have that dim PV M T is equal to the number of stochastic parameters of the model, dim Par sM (T). Therefore, if the continuous parameters are generically identifiable for the unmixed trees under M, then the dimension of the variety PV M T is the same for all trivalent tree topologies on n taxa. This dimension is denoted by s M (n).

Example 26.
The tree topologies and the continuous parameters are generically identifiable for the unmixed equivariant models JC69, K80, K81, SSM, and GMM on trees with any number of leaves (see [9] and [6], Corollary 3.9).
From now on we only consider trees without nodes of degree 2, so that the number of free stochastic parameters on a phylogenetic tree on n taxa under M is ≤ s M (n).
We recall the definition of generic identifiability of the tree topologies on h-mixtures (see [10]).  In terms of algebraic varieties this is equivalent to saying that the variety and vice versa.
The tree topologies are the discrete parameters of hmixtures. When considering the continuous parameters of h-mixtures, the above mentioned label-swapping can be disregarded. We give the following definition according to [12].  (ξ 1 , . . . , ξ h , a), the equality In other words, we only allow swapping of the continuous parameters when at least two tree topologies coincide.

Definition 29.
An h-mixture under a model M is said to be identifiable if both its tree topologies and its continuous parameters are generically identifiable.
In terms of algebraic varieties, generic identifiability of continuous parameters on h-mixtures implies that the generic fibers (i.e. preimages of generic points) of the map φ T 1 ∨ . . . ∨ φ T h are finite. In this case, the fiber dimension theorem applied to (7) (cf. [14], Theorem 11.12) gives The following result demonstrates the need for careful inspection of identifiability of mixtures with many components (i.e. large values of h).  Example 34. Consider the Jukes-Cantor model JC69 on n = 4 taxa. Then Corollary 32 tells us that for h ≥ 3, hmixtures are not identifiable. Therefore, for this particular model on four taxa the cases in which the identifiability holds are known: the tree and the continuous parameters are generically identifiable for the unmixed model; the tree parameters are generically identifiable for 2-mixtures ([10], Theorem 10); the continuous parameters are generically identifiable for 2-mixtures on different tree topologies and not identifiable for the same tree topology ([10], Theorem 23); neither the continuous parameters nor the tree topologies are generically identifiable for mixtures with more than two components (Corollary 32).
Proof of Theorem 30. Let edim(h) := hs M (n) + h − 1. Then the variety ∨ h i=1 PV T i has dimension ≤ edim(h). Indeed, as ∨ i φ T i is a parameterization of an open subset of ∨ h i=1 PV T i , then the dimension of ∨ h i=1 PV T i is less than or equal to dim PV T i + h − 1. Moreover, the dimension of PV T i is equal to s M (n) if T i is trivalent (since the continuous parameters for the unmixed models under consideration are generically identifiable) and is less than s M (n) for non-trivalent trees. Therefore, dim(∨ h i=1 PV T i ) ≤ edim(h). If all T i are trivalent trees, then dim PV T i + h − 1 = edim(h) and, therefore, dim(∨ h i=1 PV T i ) < edim(h) if and http://www.almob.org/content/7/1/33 Moreover, by the fiber dimension theorem applied to ∨φ T i , the equality of dimensions holds if and only if the generic fiber of ∨φ T i has dimension 0. In particular, if dim(∨ h i=1 PV T i ) < edim(h), then the continuous parameters of this phylogenetic mixture are not identifiable.
There are two possible scenarios: Case (a) implies that the dimension of ∨ h i=1 PV T i is less than edim(h) for any set of trivalent tree topologies {T 1 , . . . , T h }. Based on the conclusions drawn above, this implies that the continuous parameters are not generically identifiable. In , which implies that both varieties coincide. In particular, any h-mixture (which is a point in P(D M )) would be contained in ∨ h i=1 PV T i , and therefore the topologies are not generically identifiable.

Conclusions
In this paper, we have dealt with the space of phylogenetic mixtures for evolutionary equivariant models. We have shown that for the case of the Jukes-Cantor model, the Kimura models with two or three parameters, the strand symmetric model and the general Markov model, this linear space is defined by the set of linear equations satisfied by the distributions of the patterns at the leaves of a tree that evolves under that model. It follows that this space completely characterizes the model. The use of tools from group theory and group representation theory played a major role, and allowed us to design a procedure to produce minimal systems of equations for these spaces and for any number of taxa. This procedure has been implemented successfully in a new method for model selection in phylogenetics based on linear invariants (see [7]), which is available online at http://genome.crg.es/cgi-bin/phylo mod sel/AlgModelSelection.pl,.
In the last part of the paper, we proved new results concerning the identifiability of phylogenetic mixtures. Namely, we provided an upper bound for the number of components (classes) of a mixture so that the identifiability of both the continuous and the discrete parameters is still possible.