The space of phylogenetic mixtures for equivariant models
 Marta Casanellas^{1}Email author,
 Jesús FernándezSánchez^{1} and
 Anna M Kedzierska^{2}
https://doi.org/10.1186/17487188733
© Casanellas et al; licensee BioMed Central Ltd. 2012
Received: 30 September 2011
Accepted: 19 November 2012
Published: 28 November 2012
Abstract
Background
The selection of an evolutionary model to best fit given molecular data is usually a heuristic choice. In his seminal book, J. Felsenstein suggested that certain linear equations satisfied by the expected probabilities of patterns observed at the leaves of a phylogenetic tree could be used for model selection. It remained an open question, however, whether these equations were sufficient to fully characterize the evolutionary model under consideration.
Results
Here we prove that, for most equivariant models of evolution, the space of distributions satisfying these linear equations coincides with the space of distributions arising from mixtures of trees. In other words, we prove that the evolution of an observed multiple sequence alignment can be modeled by a mixture of phylogenetic trees under an equivariant evolutionary model if and only if the distribution of patterns at its columns satisfies the linear equations mentioned above. Moreover, we provide a set of linearly independent equations defining this space of phylogenetic mixtures for each equivariant model and for any number of taxa. Lastly, we use these results to perform a study of identifiability of phylogenetic mixtures.
Conclusions
The space of phylogenetic mixtures under equivariant models is a linear space that fully characterizes the evolutionary model. We provide an explicit algorithm to obtain the equations defining these spaces for a number of models and taxa. Its implementation has proved to be a powerful tool for model selection.
Keywords
Background
The principal goal of phylogenetics is to reconstruct the ancestral relationships among organisms. Most popular phylogenetic reconstruction methods are based on mathematical models describing the molecular evolution of DNA. In spite of this, there exists no unified framework for model selection and the results are highly dependent on the models and methods used in the analysis (cf. [1]).
In this paper we assume the Darwinian model of evolution proceeding along phylogenetic trees and address the following question: how can the data evolving under a particular model be characterized? In other words, we look for invariants of the DNA patterns which have evolved following a tree (or a mixture of trees, as we will see below) under a particular model. The answer to this question provided in this paper leads to a complete characterization of the evolutionary model and to a novel model selection tool, which is valid for any mixture of trees.
In what follows, we briefly explain the motivation for this work. It has been shown that if the evolution along a phylogenetic tree is described by a particular model, the expected probabilities of nucleotide patterns at the leaves of the tree satisfy certain equalities (see e.g. [2], p.375). Several authors (e.g. [2–4]) pointed out that these equalities could potentially be used to test the fitness of the model of base change. The full set of equations required for viable model selection, however, was unknown. The objective of this work is to fill in this gap and to go a step further into practical aplication by providing an algorithm to compute the required invariants for model selection.
In this work we consider a group of equivariant models ([5, 6]). These models are Markov processes on trees, whose transition matrices satisfy certain symmetries: the JukesCantor model, the Kimura 2 and 3 parameter models, the strand symmetric model, and the general Markov model. Our first important result, Theorem 17, states that if evolution occurs according to trees (or even mixtures of trees) under these equivariant models, then the model of evolution is completely determined by the linear space defined by the aforementioned equalities. By exhaustively studying the group of symmetries of these models, we also give a straightforward combinatorial way of determining the equations of this linear space (see Theorem 22). The implementation of the algorithm producing the equations is available as a package SPIn ([7],http://genome.crg.es/cgibin/phylo_mod_sel/AlgModelSelection.pl), which has proved to be a successful tool in evolutionary model selection.
Our main technique consists in proving that the linear space above coincides with the space ${\mathcal{D}}^{\mathcal{M}}$ of phylogenetic mixtures evolving under the model $\mathcal{M}$, i.e. the set of points that are linear combinations of points lying in the phylogenetic varieties $C{V}_{T}^{\mathcal{M}}$ (see Preliminaries section for specific definitions). In biological words and in the stochastic context, this is the set of vectors of expected pattern frequencies for mixtures of trees evolving under the model $\mathcal{M}$ (not necessarily the same tree topology in the mixture, and not necessarily the same transition matrices when the tree topologies coincide). In phylogenetics, the socalled i.i.d. hypothesis (independent and identically distributed) about the sites of an alignment is prevalent in the most simple models. When the assumption “identically distributed” is replaced it by “distributed according to the same evolutionary model”, one obtains a phylogenetic mixture.
Phylogenetic mixtures are useful in modeling heterogeneous evolutionary processes, e.g. data comprising multiple genes, selected codon positions, or rate variation across sites (e.g. [8]). Among a plethora of applications, they are used in orthology predictions, gene and genome annotations, species tree reconstructions, and drug target identifications.
In addition to the main result, we determine the dimension of these linear spaces and use it to give an upper bound, h_{0} (n), on the number of mixtures that should be used in phylogenetic reconstruction on n taxa. This relates to the socalled identifiability problem in phylogenetic mixtures, which can be posed as determining the conditions that guarantee that the model parameters (discrete parameters in the form of tree topologies and the continuous parameters of the root and model distributions) can be recovered from the data. Identifiability is crucial for consistency of the maximum likelihood approaches and, though extensively studied in the phylogenetic context, few results are known (see for instance [9–13]).
In brief, in Theorem 30 we prove that either the tree topologies or the continuous parameters are not generically identifiable for mixtures on more than h_{0} (n) trees under equivariant models. Here h_{0} (n) is the quotient of the dimension of the linear space ${\mathcal{D}}^{\mathcal{M}}$ (computed in Proposition 20) by the number of free parameters of $\mathcal{M}$ on a trivalent tree plus one. For example, for four taxa and the JukesCantor model (resp. the Kimura 3parameter model) this result proves that mixtures on three (resp. four) or more taxa are not identifiable (i.e. either the discrete or the continuous parameters cannot be fully identified). A detailed discussion on this subject is provided in the last section.
The main tools used in this work are algebraic geometry and group theory. The reader is referred to [14, 15] for general references on these topics.
Main text
Preliminaries
Phylogenetic trees and Markov models of evolution have been widely used in the literature. In what follows we fix the notation needed to deal with them in our setting.
Let n be a positive integer and denote by [n] the set {1,2,…,n}. A phylogenetic tree T on the set of taxa [n] is a tree (i.e. a connected graph with no loops), whose n leaves are bijectively labeled by [n]. Its vertices represent species or other biological entities and its edges represent evolutionary processes between the vertices.
We allow internal vertices of any degree and if all the internal vertices are of degree 3 we say that the tree is trivalent. We will denote the set of vertices of T by N(T), the set of edges by E(T), and the set of interior nodes by Int(T). A rooted tree is a tree together with a distinguished node r called the root. The root induces an orientation on the edges of T, whereby the root represents the common ancestor to all the species represented in the tree. If e is an edge of a rooted tree T, we write pa(e) and ch(e) for its parent vertex (origin) and its child vertex (end), respectively. Two unrooted phylogenetic trees on the set of taxa [n] are said to have the same tree topology if their labeled graphs have the same topology.
We fix a positive integer k and an ordered set B = {b_{1},b_{2},…,b_{ k }}. For example, for most applications we take B = {A,C,G,T} to be the set of nucleotides in a DNA sequence. We may think of B as the set of states of a discrete random variable. We call W the complex vector space $W={\u3008B\u3009}_{\mathbb{C}}$ spanned by B, so that B is a natural basis of W. For algebraic convenience, we usually work over the complex field and restrict to the stochastic setting when necessary. Vectors in W are thought of as probability distributions on the set of states B if their coordinates are nonnegative and sum to one. In this setting the vector $\sum {c}_{i}{b}_{i}$ means that observation b_{ i } occurs with probability c_{ i }. From now on, we will identify vectors in W with their coordinates in the basis B written as a column vector, e.g. we identify ${\sum}_{k}{b}_{k}$ with the vector 1 = (1,1,…,1)^{ t }∈ W.
In order to model molecular evolution on a phylogenetic tree T, we consider a Markov process specified by a root distribution, Π∈W, and a collection of transition matrices, A = (A^{ e })_{e ∈ E (T)}, where each A^{ e }is a k × kmatrix in End(W). The matrices A^{ e }represent the conditional probabilities of substitution between the states in B from the parent node pa (e) to the child node ch(e) of e. We adopt the convention that the matrices A^{ e } act on W from the right, i.e. a vector ω^{ t }in pa (e) maps to ω^{ t }A^{ e } in ch(e).
Distinct forms of the transition matrices give rise to different evolutionary models. Using the terminology introduced above, we proceed to the definition of evolutionary models used throughout this work.
Definition 1. An (algebraic) evolutionary model$\mathcal{M}$ is specified by giving a vector subspace ${W}_{0}\subset W$ such that 1^{ t }Π ≠ 0 for some Π in W_{0}, together with a multiplicatively closed vector subspace Mod(for model) of ${M}_{k}\left(\mathbb{C}\right)$ containing the identity matrix. We will usually denote such a model by $\mathcal{M}=({W}_{0},\mathit{\text{Mod}})$. We define the stochastic evolutionary model$s\mathcal{M}=(s{W}_{0},\mathit{\text{sMod}})$associated to$\mathcal{M}$ by taking s W_{0} = {Π∈W_{0}:1^{ t }Π = 1} and sMod = {A∈Mod:A 1 = 1}. The term “stochastic” refers to the fact that, by restricting to the points in the spaces with nonnegative real entries, we obtain distributions and Markov matrices. A phylogenetic tree T together with the parameters Π and A = (A^{ e }) _{e ∈ E (T)}is said to evolve under the algebraic evolutionary model$\mathcal{M}$ if Π ∈ W_{0}, and all matrices A^{ e }lie in Mod.
Remark 2. Note that s W_{0}and sMod are not vector spaces. The condition 1^{ t }Π ≠ 0 in the above definition means that the sum of the coordinates of Π is not zero. Since vectors in s W_{0} with nonnegative coordinates represent the probability distributions for the set of observations B, this condition implies no restriction from a biological point of view. Moreover, it ensures that ${W}_{0}\cap \{{\sum}_{\mathtt{x}\in B}{\Pi}_{\mathtt{x}}=1\}$ has dimension equal to dim(W_{0}) − 1. In particular, the simplex of stochastic vectors in W_{0}will form a semialgebraic set of ${\u3008B\u3009}_{\mathbb{R}}$ of dimension equal to dim(W_{0})−1 (as expected).
Remark 3. The subspace Mod of substitution matrices is usually required to be multiplicatively closed (as in the definition above) so that when two evolutionary processes are concatenated, the final process is of the same kind. The importance of this requirement is the starting point of [16], where a different approach to the definition of “evolutionary mode” is provided.
Our definition of evolutionary models includes most of the wellknown evolutionary models, namely those given in [17] and the equivariant models (see [5, 6]).
and A_{1}A_{2} ∈ M (G). Therefore, equivariant models provide a wide family of examples of algebraic evolutionary models in the sense of Definition 1. For example, if B = {A,,̧G T}, it can be seen that the algebraic versions of the JukesCantor model [18], the Kimura models with 2 or 3 parameters [19, 20], the strand symmetric model [21] or the general Markov model [22] are instances of equivariant models:

if $G={\mathfrak{S}}_{4}$, then ${\mathcal{M}}_{G}$ is the algebraic JukesCantor model JC69,

if G = 〈(A C G T),(A G)〉, then ${\mathcal{M}}_{G}$ is the algebraic Kimura 2parameter model K80,

if G = 〈(A C)(G T),(A G)(C T)〉, then ${\mathcal{M}}_{G}$ is the algebraic Kimura 3parameter model K81,

if G = 〈(A T)(C G)〉, then ${\mathcal{M}}_{G}$ is known as the strand symmetric model SSM, and

if G = 〈e〉, then ${\mathcal{M}}_{G}$ is the general Markov model GMM.
This motivates the following definition.
x_{ v } denotes the state at the vertex v, pa(e) (resp. ch (e)) is the parent (resp. child) node of e, and Π _{ x },x ∈ B, are the coordinates of Π. When restricted to the stochastic matrices and distributions in W_{0}, this parametrization corresponds to the hidden Markov process on the tree T (the leaves correspond to the observed random variables and the interior nodes to the hidden variables).
The parametrization (1) restricts to another polynomial map ${\varphi}_{T}^{\mathcal{M}}:{\text{Par}}_{s\mathcal{M}}\left(T\right)\to H$, where $H\subset \mathcal{L}$ is the hyperplane defined by $H=\left\{p\in \mathcal{L}\mid {\sum}_{{\mathtt{x}}_{1},\dots ,{\mathtt{x}}_{n}\in B}{p}_{{\mathtt{x}}_{1}\dots {\mathtt{x}}_{n}}=1\right\}$. Because we work in the algebraic setting, the use of the word “stochastic” in this paper is more general than usual, as we only request entries summing to one.
From now on, we will refer to this restriction as the stochastic parametrization${\varphi}_{T}^{\mathcal{M}}$. It is important to note that when we consider the distributions in s W_{0} and the Markov matrices in sMod, its image by ${\varphi}_{T}^{\mathcal{M}}$ lies in the standard simplex in $\mathcal{L}$ (and thus in H). This in turn implies that the whole image $\text{Im}\phantom{\rule{0.3em}{0ex}}{\varphi}_{T}^{\mathcal{M}}$ is contained in H.
We proceed to define the algebraic varieties associated to the parametrization maps defined above. Roughly speaking, algebraic varieties are sets of solutions to systems of polynomial equations (e.g. [14]).
Definition 6. The stochastic phylogenetic variety${V}_{T}^{\mathcal{M}}$associated to a phylogenetic tree T is the smallest algebraic variety containing $\text{Im}\phantom{\rule{0.3em}{0ex}}{\varphi}_{T}^{\mathcal{M}}=\left\{{\varphi}_{T}^{\mathcal{M}}({\Pi}_{r},\mathbf{A}):\right.\left.({\Pi}_{r},\mathbf{A})\in {\text{Par}}_{s\mathcal{M}}\left(T\right)\right\}$ (in particular, ${V}_{T}^{\mathcal{M}}\subset H$).
Similarly, the phylogenetic variety$C{V}_{T}^{\mathcal{M}}$associated to T is the smallest algebraic variety in $\mathcal{L}$ that contains $\text{Im}\phantom{\rule{0.3em}{0ex}}{\mathrm{\Psi}}_{T}^{\mathcal{M}}=\left\{{\mathrm{\Psi}}_{T}^{\mathcal{M}}({\Pi}_{r},\mathbf{A}):({\Pi}_{r},\mathbf{A})\in {\text{Par}}_{\mathcal{M}}\left(T\right)\right\}$.
Below we explain the reason for the notation of $C{V}_{T}^{\mathcal{M}}$, which was adopted from [23].
and ${V}_{T}^{\mathcal{M}}=C{V}_{T}^{\mathcal{M}}\cap H$. This is well known for the general Markov model [23] and can be easily generalized to any model satisfying (i) and (ii).
The space of phylogenetic mixtures
In phylogenetics, the hypothesis that the sites of an alignment are independent and identically distributed is often used. When the assumption “identically distributed” is replaced by “distributed according to the same evolutionary model”, one obtains a phylogenetic mixture. Below, we introduce phylogenetic mixtures from the algebraic point of view (see also [26]).
where ${\alpha}_{i}\in \mathbb{C}$ and ${p}^{i}\in \text{Im}\left({\Psi}_{{T}_{i}}^{\mathcal{M}}\right)$ for some tree topologies T_{ i } on the set of taxa [n]. As ${\Psi}_{{T}_{i}}^{\mathcal{M}}$ is a homogeneous map, phylogenetic mixtures are represented by vectors of the form ${\sum}_{i=1}^{m}{\stackrel{\u030c}{p}}^{i}$, where ${\stackrel{\u030c}{p}}^{i}\in \text{Im}\left({\Psi}_{{T}_{i}}^{\mathcal{M}}\right)$. We call ${\mathcal{D}}_{\mathcal{M}}\subset \mathcal{L}$ the space of all phylogenetic mixtures (on any number of classes) under the algebraic evolutionary model $\mathcal{M}$.
As mentioned in the introduction, the tree topologies contained in the mixture can be the same or different. An example of a phylogenetic mixture is the data modeled by the discrete Gammarates models (see e.g. [8]).
We call ${\mathcal{D}}_{s\mathcal{M}}$ the space of stochastic phylogenetic mixtures.
Remark 9. The phylogenetic variety of a trivalent tree topology contains all phylogenetic varieties of the nontrivalent tree topologies obtained by contracting any of its interior edges. Indeed, the latter are a particular case of the former when the matrices associated to the contracted edges are equal to the identity matrix. It follows that the space of phylogenetic mixtures on the trivalent tree topologies coincides with the space of phylogenetic mixtures on all possible topologies.
The following result was proven by Matsen, Mossel and Steel in [26] for the two state random cluster model but, as proved below, it can be easily generalized to any evolutionary model.
Lemma 10. Given a set of taxa [n] and an algebraic evolutionary model $\mathcal{M}$, the set of all phylogenetic mixtures ${\mathcal{D}}_{\mathcal{M}}$ is a vector subspace of $\mathcal{L}$. Similarly, ${\mathcal{D}}_{s\mathcal{M}}$ is a linear variety and it equals ${\mathcal{D}}_{\mathcal{M}}\cap H$.
Proof.${\mathcal{D}}_{\mathcal{M}}$ is a $\mathbb{C}$vector space and ${\mathcal{D}}_{s\mathcal{M}}$ is a linear variety by their definition. It follows that ${\mathcal{D}}_{\mathcal{M}}$ is an algebraic variety that contains $\text{Im}\phantom{\rule{0.3em}{0ex}}{\Psi}_{T}^{\mathcal{M}}$ for any phylogenetic tree T on the set of taxa [n]. Therefore, it also contains $C{V}_{T}^{\mathcal{M}}$, and ${\mathcal{D}}_{\mathcal{M}}$ equals the set of points of the form $p=\sum {p}_{i}$, where ${p}_{i}\in C{V}_{{T}_{i}}^{\mathcal{M}}$. Similarly, ${\mathcal{D}}_{s\mathcal{M}}$ is an algebraic variety that contains $\text{Im}\phantom{\rule{0.3em}{0ex}}{\varphi}_{T}^{\mathcal{M}}$, so it also contains ${V}_{T}^{\mathcal{M}}$ for any phylogenetic tree T. It follows that ${\mathcal{D}}_{s\mathcal{M}}$ is formed by points of type $q=\sum {\alpha}_{i}{q}_{i}$, where ${q}_{i}\in {V}_{{T}_{i}}^{\mathcal{M}}$ and ${\sum}_{i}{\alpha}_{i}=1$.
and $1=\lambda \left(p\right)={\sum}_{i}\lambda \left({p}^{i}\right)\lambda \left({q}_{i}\right)={\sum}_{i}\lambda \left({p}^{i}\right)$ since each q_{ i } lies on H. This proves that $p\in {\mathcal{D}}_{s\mathcal{M}}$. □
Remark 11. In the proof of the above lemma, we have seen that ${\mathcal{D}}_{\mathcal{M}}$ and ${\mathcal{D}}_{s\mathcal{M}}$ can be alternatively described as the spaces of mixtures obtained from the respective varieties $C{V}_{T}^{\mathcal{M}}$ and ${V}_{T}^{\mathcal{M}}$ (i.e., not only from the images of the parametrization maps).
The space of phylogenetic mixtures for equivariant evolutionary models
This section provides a precise description of the space ${\mathcal{D}}_{\mathcal{M}}$ for the equivariant models $\mathcal{M}$ listed in Example 4 (JC69, K80, K81, SSM, and GMM). First, we recall some definitions and facts of group theory and linear representation theory. From now on, B = {A,C,G,T}, k = 4, $W={\u3008B\u3009}_{\mathbb{C}}$, n is fixed and $\mathcal{L}={\otimes}_{\left[n\right]}W$.
Background on representation theory
We introduce some tools in group representation theory needed in the sequel. We refer the reader to [15] as a classical reference for these concepts. Although some of the following results are valid for any permutation group, for simplicity in the exposition we restrict to permutations of four elements (as our applications deal only with the case B = {A C G T}).
and extending by linearity. From now on, the space $\mathcal{L}={\otimes}^{n}W$ will be implicitly considered as a Gmodule with this action. We call χ the character associated to the representation ρ_{ G }: G → GL (W), i.e. χ (g) is the trace of the corresponding permutation matrix or, in other words, χ (g) equals the number of fixed elements in B by the permutation g∈G. Then the character associated to the induced representation G → GL (⊗^{ n }W) is χ^{ n }, the nth power of χ.
where each (⊗^{ s }W)ω_{ i } is isomorphic to a number of copies of the irreducible representation N_{ i } associated to ω_{ i }, $\left({\otimes}^{s}W\right)\left[{\omega}_{i}\right]\cong {N}_{i}\otimes {\mathbb{C}}^{{m}_{i}\left(s\right)}$, for some nonnegative integer m_{ i }(s) called the multiplicity of⊗^{ s }W relative to ω_{ i }. The isotypic component of $\mathcal{L}$ associated to the trivial representation will be denoted by ${\mathcal{L}}^{G}$ and it is composed of the ntensors invariant under the action of G defined in (3). If $\mathcal{M}$ is the equivariant evolutionary model associated to G, ${\mathcal{L}}^{G}$ will also be denoted as ${\mathcal{L}}^{\mathcal{M}}$. It is easy to prove that $C{V}_{T}^{\mathcal{M}}\subset {\mathcal{L}}^{G}$ (see Lemma 4.3 of [5]).
We introduce the following notion.
Definition 12. An n  word over B is an ordered sequence X = x_{1}x_{2} … x_{ n }, where every letter is taken from the alphabet B. The set of nwords is equivalent to the cartesian power B^{ n }and will be denoted by $\mathcal{B}$.
Words will be denoted in typewritter uppercase font (like X) and their letters in lowercase (like X). Sometimes it will be convenient to identify the [n]tensors of the form x_{1} ⊗ … ⊗ x_{ n } with the nwords X = x_{1}…x_{ n }. Consequently, we will identify $\mathcal{B}$ with the natural basis of $\mathcal{L}$. Given $\mathtt{X}\in \mathcal{B}$, we will denote by {X}_{ G }= {g X ∣ g ∈ G} the Gorbit of X. We associate a Ginvariant tensor, τ{X}_{ G }, to each orbit {X}_{ G }: $\tau {\left\{\mathtt{X}\right\}}_{G}:={\sum}_{g\in G}g\mathtt{X}$. It is straightforward to see that every Ginvariant tensor can be written as a linear combination of the tensors τ{X}_{ G }, $\mathtt{X}\in \mathcal{B}$. On the other hand, the set of different τ{X}_{ G }’s is linearly independent, since the corresponding Gorbits {X}_{ G } have nonoverlapping composition of the elements of $\mathcal{B}$.
Mixtures for equivariant models
For each x ∈ B, we write S_{ G }(x) for the stabiliser of X under the action of G, that is, S_{ G }(x) = {g ∈ G : g·x = x}.
Proposition 13. Let G be a subgroup of ${\mathfrak{S}}_{4}$ such that S_{ G }(x_{0}) = {e} for some x_{0}∈B. Then every tensor of type τ{X}_{ G }, $\mathtt{X}\in \mathcal{B}$, lies in the image of ${\Psi}_{T}^{{\mathcal{M}}_{G}}$ for some tree topology T. In particular, ${\mathcal{L}}^{G}\subset {\mathcal{D}}_{{\mathcal{M}}_{G}}$.
Proof. For any Gorbit {Y}_{ G }, $\mathtt{Y}\in \mathcal{B}$, write $\tau {\left\{\mathtt{Y}\right\}}_{G}={y}_{1}\otimes \dots \otimes {\mathtt{Y}}_{n}+\sum _{g\ne e}g\xb7{\mathtt{Y}}_{1}\otimes \dots \otimes g\xb7{\mathtt{Y}}_{n}$. We will explicitly associate a tree topology and parameters (Π,A) to it so that the tensor τ {T}_{ G }is equal to ${\Psi}_{T}^{{\mathcal{M}}_{G}}$. To this aim, we denote by B (Y) the set of letters appearing in Y. Then for every z ∈ B (Y), consider the set ${L}_{\mathtt{z}}^{\mathtt{Y}}=\{i\in [n]:{\mathtt{Y}}_{i}=\mathtt{z}\}$, so that ${\cup}_{\mathtt{z}\in B\left(\mathtt{Y}\right)}{L}_{\mathtt{z}}^{\mathtt{Y}}=\left[n\right]$.
 1.
x _{ z }= g · z, for z ∈ B, and
 2.
for each $i\in {L}_{\mathtt{z}}^{\mathtt{Y}}$, x _{ i }is equal to x _{ z }= g · z,
and ${\Psi}_{T}^{\mathcal{M}}(\Pi ,\mathbf{A})=\tau {\left\{\mathtt{Y}\right\}}_{G}$. Moreover, as the set of τ{Y}_{ G }, for $\mathtt{Y}\in \mathcal{B}$, generates the vector space ${\mathcal{L}}^{G}$, the second claim follows. □
Remark 14. The above result is not true if the hypothesis S_{ G }(x_{0}) = 〈e〉 is removed. For example, if G = 〈(A C G T),(A G)〉 (so that $\mathcal{M}=\mathtt{K80}$), then S_{ G }(A) = S_{ G }(G) = {e,(C T)} and S_{ G }(C) = S_{ G }(T) = {e,(A G)}. In that case, it can be shown that the Gorbit {A C G T}_{ G }is not in $\text{Im}{\Psi}_{T}^{\mathtt{K80}}$ for any tree topology T with 4 leaves.
Since the above condition on the group holds for $G={\mathfrak{S}}_{4}$, G = 〈(A T)(C G)〉, and G = 〈(A G),(A C G T)〉, we deduce the following claim.
Corollary 15. If G corresponds to any of the equivariant models K81, SSM or GMM, we have ${\mathcal{L}}^{G}\subset {\mathcal{D}}_{{\mathcal{M}}_{G}}$.
In phylogenetics, an invariant of a phylogenetic tree T is an equation satisfied by the expected distributions of patterns at the leaves of T, irrespectively of the continuous parameters of the model $\mathcal{M}$. In the algebraic geometry setting, these are the equations satisfied by all $p\in C{V}_{T}^{\mathcal{M}}$. Invariants were introduced by Lake (see [27]) and Cavender and Felsenstein (see [28]). A phylogenetic invariant of T is an invariant of T, which is not an invariant of all other phylogenetic trees (under the same model $\mathcal{M}$). Equivalently, f is a phylogenetic invariant of $C{V}_{T}^{\mathcal{M}}$ if it is an invariant of $C{V}_{T}^{\mathcal{M}}$ and there exists a tree topology T^{′} such that f is not an invariant of $C{V}_{{T}^{\prime}}^{\mathcal{M}}$. In principle, phylogenetic invariants can be used for tree topology reconstruction purposes.
 (a)
It can be seen that the condition of trivial stabiliser for some element of B given in Proposition 13 guarantees that all the irreducible representations of G will be present in the decomposition of W into its isotypic components. Then, by using the results of [6], it follows that the corresponding equivariant model will have no linear phylogenetic invariants. This fact was already known for the models in the above corollary: see [29] for the GMM, [21] for the SSM and [30] for the K81. Here we provided an alternative proof based on elementary tools of group theory.
 (b)
The models JC69 and K80 are known to have linear phylogenetic invariants, but these are the only linear invariants which do not define hyperplanes containing ${\mathcal{L}}^{G}$, as can be deduced from [3, 30]. In fact, for these two models, the claim of the corollary is still true as stated in the following theorem. Nevertheless, we have not been able to provide a unified proof of this fact because of the different properties of the corresponding groups. There is no description of the space of linear invariants for other equivariant models not listed in Example 4, so we cannot claim that the result below still holds.
Theorem 17. If ${\mathcal{M}}_{G}$ is one of the equivariant evolutionary models JC69 , K80 , K81 , SSM , or GMM , then the space of phylogenetic mixtures ${\mathcal{D}}_{{\mathcal{M}}_{G}}$ coincides with ${\mathcal{L}}^{G}$ , and ${\mathcal{D}}_{s{\mathcal{M}}_{G}}$ equals ${\mathcal{L}}^{G}\cap H$ .
This theorem allows to identify the set of all phylogenetic mixtures ${\mathcal{D}}_{{\mathcal{M}}_{G}}$ with ${\mathcal{L}}^{G}$, which is a vector subspace of $\mathcal{L}$ whose linear equations are easy to describe. In other words, ${\mathcal{L}}^{G}$ is the smallest linear space containing the data coming from any mixture of trees evolving under the model ${\mathcal{M}}_{G}$. One can therefore use ${\mathcal{L}}^{G}$ to select the most suitable model for the given data. This has been studied in [7].
Proof of Theorem 17. For equivariant models we have that $C{V}_{T}^{{\mathcal{M}}_{G}}\subset {\mathcal{L}}^{G}$ for any tree T. Hence, by Lemma 10 and the definition of ${\mathcal{D}}_{{\mathcal{M}}_{G}}$, ${\mathcal{D}}_{{\mathcal{M}}_{G}}$ is a vector subspace of ${\mathcal{L}}^{G}$.
From Corollary 15, we infer the equality ${\mathcal{L}}^{G}={\mathcal{D}}_{{\mathcal{M}}_{G}}$ for the models K81, SSM and GMM. For the other two models, JC69 and K80, it remains to prove that there does not exist any hyperplane Π containing ${\mathcal{D}}_{{\mathcal{M}}_{G}}$ and not containing ${\mathcal{L}}^{G}$. If such a hyperplane existed, then it would contain all the points of $C{V}_{T}^{{\mathcal{M}}_{G}}$ for any tree topology T. It suffices to prove that for these models there are no homogeneous linear polynomials vanishing on all tree topologies, except for the linear equations vanishing on ${\mathcal{L}}^{G}$. This has been seen in Remark 16(b).
The equality ${\mathcal{D}}_{s{\mathcal{M}}_{G}}={\mathcal{L}}^{G}\cap H$ follows immediately from Lemma 10 and the first assertion in the statement of this theorem.
Remark 18. We are indebted to one of the referees of this paper for pointing out that the preceeding result, as well as the second part of Proposition 13, can also be inferred from Proposition 4.9 of [5]: under the assumption that the stabiliser of some state is trivial, Draisma and Kuttler show that the star tree is the smallest algebraic variety containing the tensors τ{X}_{ G }, for pure tensors X(that is, tensors of rank 1). It follows that the set of mixtures on the star tree equals the space ${\mathcal{L}}^{G}$.
Remark 19. It is not difficult to check that for $\mathcal{M}=$K81, SSM or GMM, ${\mathcal{D}}_{\mathcal{M}}$ coincides with the space of mixtures on the star tree (see also [26], where the same result is proven for a 2state model). On the contrary, this is not true for JC69 and K80 models because in this case the star tree lies in a smaller linear space as a consequence of the existence of phylogenetic linear equations (see Remark 16(b)).
Equations for the space ${\mathcal{L}}^{\mathit{G}}$
Our goal here is to compute the dimension of ${\mathcal{L}}^{G}$ for the groups associated to the equivariant models listed in Definition 4, and to list a set of independent linear equations defining this space.
 (i)
$\text{dim}{\mathcal{L}}^{\mathtt{\text{SSM}}}={2}^{2n1}$,
 (ii)
$dim{\mathcal{L}}^{\mathtt{K81}}={4}^{n1}$,
 (iii)
$dim{\mathcal{L}}^{\mathtt{K80}}={2}^{2n3}+{2}^{n2}$, and
 (iv)
$dim{\mathcal{L}}^{\mathtt{JC69}}=\frac{{2}^{2n3}+1}{3}+{2}^{n2}$.
Details of the conjugacy classes of some permutation groups needed in the proof of Proposition 20
$\mathit{G}\mathbf{\le}{\mathfrak{S}}_{\mathbf{4}}$  $\mathcal{M}$  Representatives of conj. classes  $\mathcal{C}\mathbf{\left(}\mathit{G}\mathbf{\right)}$  ${\mathit{\chi}}_{\mathcal{C}\mathbf{\left(}\mathit{G}\mathbf{\right)}}^{n}$ 

〈(A T)(C G)〉  SSM  {e,(A T)(C G)}  (1,1)  (4^{ n },0) 
〈(A C)(G T),(A G)(C T)〉  K81  {e,(A T)(C G),(A C)(G T),(A G)(C T)}  (1,1,1,1)  (4^{ n },0,0,0) 
〈(A C G T),(A G)〉  K80  {e,(A C)(G T),(A G)(C T),(A C G T),(A G)}  (1,2,1,2,2)  (4^{ n },0,0,0,2^{ n }) 
${\mathfrak{S}}_{4}$  JC69  {e,(A C)(G T),(A C G T),(A G),(A C G)}  (1,3,6,6,8)  (4^{ n },0,0,2^{ n },1) 
□
Our next goal is to provide a set of independent linear equations for ${\mathcal{L}}^{G}$. Before stating the main result, let us introduce some useful notation.
The set ${\mathcal{B}}_{0}$ is composed of all nwords with only one letter and it is contained in ${\mathcal{B}}_{\mathtt{A}\mathtt{C}\mid \mathtt{G}\mathtt{T}}$, ${\mathcal{B}}_{\mathtt{A}\mathtt{G}\mid \mathtt{C}\mathtt{T}}$, and ${\mathcal{B}}_{\mathtt{A}\mathtt{T}\mid \mathtt{C}\mathtt{G}}$. Similarly, ${\mathcal{B}}_{2}$ is composed of all nwords with two letters at most. It is straightforward to check that $\left{\mathcal{B}}_{\mathtt{A}\mathtt{C}\mid \mathtt{G}\mathtt{T}}\right=\left{\mathcal{B}}_{\mathtt{A}\mathtt{G}\mid \mathtt{C}\mathtt{T}}\right=\left{\mathcal{B}}_{\mathtt{A}\mathtt{T}\mid \mathtt{C}\mathtt{G}}\right={2}^{n+1}$ and $\left{\mathcal{B}}_{2}\right=3\xb7{2}^{n+1}8$.
We adopt multiplicative notation for nwords, for instance, we write C^{ l }for the word $\underset{l}{\underset{\u23b5}{\mathtt{C}\dots \mathtt{C}}}$, and (A^{ l })(G^{ m })x_{l + m + 1}…x_{ n }for $\underset{l}{\underset{\u23b5}{\mathtt{A}\dots \mathtt{A}}}\underset{m}{\underset{\u23b5}{\mathtt{G}\dots \mathtt{G}}}{x}_{l+m+1}\dots {x}_{n}$, where x_{l + m + 1},…,x_{ n }represent any letters.
The main result of this section is the following:
Theorem 22. A set of linearly independent equations${\mathbb{E}}^{\mathcal{M}}$defining${\mathcal{L}}^{\mathcal{M}}$for$\mathcal{M}$=JC69 , K80 , K81 , or SSM is given by
${\mathbb{E}}^{\mathtt{\text{SSM}}}$ : equations p _{X}= p_{(}_{A} _{T}_{)(}_{C}_{G}_{)}_{X} for all $\mathtt{X}\in \mathcal{B}$ with x_{1} ∈ {A,C};
${\mathbb{E}}^{\mathtt{K81}}$ : the equations in ${\mathbb{E}}^{\mathtt{\text{SSM}}}$, and the equations {p _{X}= p_{(}_{A}_{T}_{)(}_{C}_{G}_{)}_{X}} for all $\mathtt{X}\in \mathcal{B}$ with x_{1} = A;
${\mathbb{E}}^{\mathrm{K80}}$ : the equations in ${\mathbb{E}}^{\mathtt{K81}}$, plus the equations {p _{ X }= p_{(}_{ A }_{)(}_{ G }_{)}_{ X }} for all $\mathtt{X}\in \mathcal{B}\backslash {\mathcal{B}}_{\mathtt{A}\mathtt{C}\mid \mathtt{G}\mathtt{T}}$ having x_{1} = A and satisfying the following condition: if T appears in X, then there is a C in a preceding position;
${\mathbb{E}}^{\mathrm{JC69}}$ : the equations in ${\mathbb{E}}^{\mathrm{K80}}$, together with the equations {p _{ X }= p_{(}_{ A } _{ T }_{)}_{ X }} for all $\mathtt{X}\in {\mathcal{B}}_{\mathtt{A}\mathtt{C}\mid \mathtt{G}\mathtt{T}}\backslash {\mathcal{B}}_{0}$ of the form (A^{ l })(C^{ m })x_{l + m + 1}…x_{ n }; plus the equations {p _{ X }= p_{(}_{ A } _{ C }_{)}_{ X }} and {p _{ X }= p_{(}_{ A } _{ T }_{)}_{ X }} for all $\mathtt{X}\in \mathcal{B}\backslash {\mathcal{B}}_{2}$ of the form (A^{ l })(C^{ m })x_{l + m + 1}…x_{ n }and satisfying the condition: if T appears in X, then there is a G in a preceding position.
The number of equations added in each case is 2^{2 n − 1}for SSM, 2^{2 n − 2} for K81, 2^{2n − 3}− 2^{n−2} for K80, and ${2}^{n1}1+2(\frac{{2}^{2n3}+1}{3}{2}^{n2})$ for JC69.
Notice that the union in (6) is not necessarily disjoint as it may happen that {g_{ i }X}_{ H }= {g_{ j }X}_{ H } for i ≠ j. In this case, the equality ${p}_{{g}_{j}\mathtt{X}}={p}_{{g}_{j}\mathtt{X}}$ already holds in the space ${\mathcal{L}}^{H}$ and does not provide any new restriction. In order to avoid this situation and obtain a minimal set of equations for ${\mathcal{L}}^{G}$, we request the special conditions on the $\mathtt{X}\in \mathcal{B}$ in the statement of the theorem.
Proof. For each model $\mathcal{M}$, we prove that the corresponding equations are linearly independent and there are as many equations as the codimension of ${\mathcal{L}}^{\mathcal{M}}$. By Proposition 20, the codimension of ${\mathcal{L}}^{\mathcal{M}}$ is 2^{2 n − 1}for SSM, 3· 4^{n − 1}for K81, 7·2^{2n−3}− 2^{n − 2}for K80, and ${4}^{n}\frac{{2}^{2n3}+1}{3}{2}^{n2}$ for JC69. In the sequel, we refer to the groups by the name of the equivariant model associated to them.
SSM: As SSM is the group {e,(A T)(C G)}, a set of equations for SSM is {p _{ X }= p_{(}_{ A } _{T}_{)(}_{ C } _{ G }_{)}_{ X }}. Fixing x_{1}in {A,C} we obtain 2^{2n−1}linearly independent equations (equations involving different coordinates). The codimension of ${\mathcal{L}}^{\mathtt{\text{SSM}}}$ is equal to 2^{2n−1}, which coincides with the number of equations given, and thus this set of equations defines ${\mathcal{L}}^{\mathtt{\text{SSM}}}$.
K81: Since a transversal of SSM\K81 is {e,(A C)(G T)}, the hyperplanes p _{ X }= p_{(}_{ A } _{ T }_{)(}_{ C } _{ G }_{)}_{ X } contain ${\mathcal{L}}^{\mathtt{K81}}$ but not ${\mathcal{L}}^{\mathtt{\text{SSM}}}$. Moreover, using (6) we see that the orbit {X}_{ K81 } decomposes into the disjoint union of {X}_{ SSM } and {(A C)(G T)X}_{ SSM } for any $\mathtt{X}\in \mathcal{B}$. Therefore, the equations given for K81 involve different coordinates than those in ${\mathbb{E}}^{\mathtt{\text{SSM}}}$. Requiring x_{1} = A, we obtain 4^{n−1}linearly independent new equations. Thus ${\mathbb{E}}^{\mathtt{K81}}$ defines the space ${\mathcal{L}}^{\mathtt{K81}}$ because the number of linearly independent equations provided, 2^{2n−1} + 4^{n−1}= 3·4^{n−1}, coincides with the codimension of ${\mathcal{L}}^{\mathtt{K81}}$.
K80: The set {e,(A G)} is a transversal of K81\K80. In order to show that the equations provided are linearly independent to those of ${\mathbb{E}}^{\mathtt{K81}}$, we apply (6) to this transversal to obtain {X}_{ K80 }= {X}_{ K81 }∪{(A G)X}_{ K81 }. If $\mathtt{X}\notin {\mathcal{B}}_{\mathtt{A}\mathtt{G}\mid \mathtt{C}\mathtt{T}}$, then {(A G)X}_{ K81 } and {X}_{ K81 } are disjoint, so each equation p _{ X }= p_{(}_{ A }_{)(}_{ G }_{)}_{ X } is linearly independent from ${\mathbb{E}}^{\mathtt{K81}}$. The set $\mathcal{B}\backslash {\mathcal{B}}_{\mathtt{A}\mathtt{G}\mid \mathtt{C}\mathtt{T}}$ has cardinal 4^{ n }− 2^{n + 1}and, if $\mathtt{X}\in \mathcal{B}\backslash {\mathcal{B}}_{\mathtt{A}\mathtt{G}\mid \mathtt{C}\mathtt{T}}$, each orbit {X}_{ K80 } has cardinality 8. Therefore, the number of different orbits for $\mathtt{X}\in \mathcal{B}\backslash {\mathcal{B}}_{\mathtt{A}\mathtt{G}\mid \mathtt{C}\mathtt{T}}$ is (4^{ n }− 2^{n + 1})/8 = 2^{2 n − 3}−2^{n − 2}. Moreover, the choice of X’s in $\mathcal{B}\backslash {\mathcal{B}}_{\mathtt{A}\mathtt{G}\mid \mathtt{C}\mathtt{T}}$ with x_{1} = A and satisfying “if T appears in X, there is a C in a preceding position” guarantees that we take only one element in each {X}_{ K80 }, and thus we are adding exactly one equation for each of these X^{′}, and thus we are adding exactly one equation for each of these X^{′}s. Overall, there are 3·4^{n − 1} + (2^{2 n − 3}−2^{n − 2}) = 7 · 2^{2 n − 3}− 2^{n − 2}linearly independent equations in ${\mathbb{E}}^{\mathrm{K80}}$. This number coincides with the codimension of ${\mathcal{L}}^{\mathrm{K80}}$ and these equations define ${\mathcal{L}}^{\mathrm{K80}}$.
JC69: A transversal of K80\JC69 is {e,(A C),(A T)}, therefore (6) applies to give ${\left\{\mathtt{X}\right\}}_{\mathrm{JC69}}={\left\{\mathtt{X}\right\}}_{\mathrm{K80}}\cup {\left\{\right(\mathtt{A}\mathtt{C}\left)\mathtt{X}\right\}}_{\mathrm{K80}}\cup {\left\{\right(\mathtt{A}\mathtt{T}\left)\mathtt{X}\right\}}_{\mathrm{K80}}$. Summing up, there are if $\mathtt{X}\in {\mathcal{B}}_{\mathtt{A}\mathtt{C}\mid \mathtt{G}\mathtt{T}}\backslash {\mathcal{B}}_{0}$, then {(A C)X}_{ K80 }= {X}_{ K80 } and {X}_{ JC69 } is the disjoint union of {X}_{ K80 } and {(A T)X}_{ K80 }. As such, each equation p _{ X }= p_{(}_{ A } _{ T }_{)}_{ X } is linearly independent from ${\mathbb{E}}^{\mathrm{K80}}$. Moreover, if $\mathtt{X}\in {\mathcal{B}}_{\mathtt{A}\mathtt{C}\mid \mathtt{G}\mathtt{T}}\backslash {\mathcal{B}}_{0}$ is of the form (A^{ l })(C^{ m })x_{l + m + 1}…x_{ n }, we have 2^{n − 1}− 1 such equations and they are linearly independent.
linearly independent equations in ${\mathbb{E}}^{\mathrm{JC69}}$ that contain ${\mathcal{L}}^{\mathrm{JC69}}$. As this number is equal to the codimension ${4}^{n}\frac{{2}^{2n3}+1}{3}{2}^{n2}$ of ${\mathcal{L}}^{\mathrm{JC69}}$, the proof is complete.
Equalities among orbits used in the proof of Theorem 22
{X}_{ GMM }  {X}_{ SSM }  {X}_{ K81 }  {X}_{ K80 }  {X}_{ JC69 }  

${\mathcal{B}}_{0}$  {X}  $\cdots \cup \left\{\right(\mathtt{A}\mathtt{T}\left)\right(\mathtt{C}\mathtt{G}\left)\mathtt{X}\right\}$  $\cdots \cup {\left\{\right(\mathtt{A}\mathtt{C}\left)\right(\mathtt{G}\mathtt{T}\left)\mathtt{X}\right\}}_{\mathtt{\text{SSM}}}$  $\dots $  
${\mathcal{B}}_{\mathtt{A}\mathtt{G}\mid \mathtt{C}\mathtt{T}}$  ”  ”  ”  $\cup {\left\{\right(\mathtt{A}\mathtt{C}\left)\mathtt{\text{X}}\right\}}_{\mathtt{\text{K80}}}$  
${\mathcal{B}}_{\mathtt{A}\mathtt{C}\mid \mathtt{G}\mathtt{T}}$  ”  ”  ”  $\cdots \cup {\left\{\right(\mathtt{A}\mathtt{G}\left)\mathtt{X}\right\}}_{\mathtt{K81}}$  $\cdots \cup {\left\{\right(\mathtt{A}\mathtt{T}\left)\mathtt{X}\right\}}_{\mathrm{K80}}$ 
${\mathcal{B}}_{\mathtt{A}\mathtt{T}\mid \mathtt{C}\mathtt{G}}$  ”  ”  ”  $\cdots \cup {\left\{\right(\mathtt{A}\mathtt{G}\left)\mathtt{X}\right\}}_{\mathtt{K81}}$  $\cdots \cup {\left\{\right(\mathtt{A}\mathtt{C}\left)\mathtt{X}\right\}}_{\mathrm{K80}}$ 
$\mathcal{B}\backslash {\mathcal{B}}_{2}$  ”  ”  ”  $\cdots \cup {\left\{\right(\mathtt{A}\mathtt{G}\left)\mathtt{X}\right\}}_{\mathtt{K81}}$  $\cdots \cup {\left\{\right(\mathtt{A}\mathtt{C}\left)\mathtt{X}\right\}}_{\mathrm{K80}}\cup {\left\{\right(\mathtt{A}\mathtt{T}\left)\mathtt{X}\right\}}_{\mathrm{K80}}$ 
Remark 23. The sets of equations of Theorem 22 has been successfully used in [7] for model selection. Although the dimensions of these linear spaces are exponential in n, in practice it is not necessary to consider the full set of equations, but only those containing the patterns observed in the data. This is crucial for the applicability of the method, since the number of different columns in an alignment is really small compared to the dimension of these spaces.
Example 24. As an example, we compute a minimal system of equations for SSM, K81, K80, and JC69 in the case of 3 leaves.
Identifiability of phylogenetic mixtures
In this section we study the identifiability of phylogenetic mixtures. To this end, we use projective algebraic varieties and techniques from algebraic geometry. It is not our intention to give the reader a background on these tools, so we refer to the algebraic geometry book [14] and, more specifically, to [10] for the usage of these techniques in the study of phylogenetic mixtures.
There is a natural isomorphism between the points lying in the hyperplane H considered above, $H=\{p=({p}_{\mathtt{A}\dots \mathtt{A}},\dots ,{p}_{\mathtt{T}\dots \mathtt{T}})\in \mathcal{L}:\sum {p}_{{x}_{1}\dots {x}_{n}}=1\}$, and the open affine subset $\{p=[{p}_{\mathtt{A}\dots \mathtt{A}}:\cdots :{p}_{\mathtt{T}\dots \mathtt{T}}]:\sum {p}_{{x}_{1}\dots {x}_{n}}\ne 0\}$ of ${\mathbb{P}}^{{4}^{n}1}=\mathbb{P}\left(\mathcal{L}\right)$. We use the notation [p _{ A }_{…}_{ A }:⋯:p _{ T }_{…}_{ T }] for projective coordinates (in contrast to $({p}_{\mathtt{A}\dots \mathtt{A}},\dots ,{p}_{\mathtt{T}\dots \mathtt{T}})$ used for affine coordinates). The projective phylogenetic variety$\mathbb{P}{V}_{T}^{\mathcal{M}}$ associated to a phylogenetic tree T is the projective closure in $\mathbb{P}\left(\mathcal{L}\right)$ of the image of the stochastic parameterization ${\varphi}_{T}^{\mathcal{M}}$ defined above. That is, it is the smallest projective variety in $\mathbb{P}\left(\mathcal{L}\right)$ containing $\text{Im}\phantom{\rule{0.3em}{0ex}}{\varphi}_{T}^{\mathcal{M}}$ via the above isomorphism.
In what follows, we explain the relationship between this new variety and $C{V}_{T}^{\mathcal{M}}$ and ${V}_{T}^{\mathcal{M}}$. By Remark 7, it becomes clear that $C{V}_{T}^{\mathcal{M}}$ equals the affine cone over the projective phylogenetic variety $\mathbb{P}{V}_{T}^{\mathcal{M}}$ (for the general Markov model, see also [23], Proposition 1). This implies that $dimC{V}_{T}^{\mathcal{M}}=dim\mathbb{P}{V}_{T}^{\mathcal{M}}+1$, and if $p=({p}_{\mathtt{A}\mathtt{\dots}\mathtt{A}},\dots ,{p}_{\mathtt{T}\mathtt{\text{\u2026}}\mathtt{T}})$ belongs to $C{V}_{T}^{\mathcal{M}}$, then q := p _{ A } _{ … } _{ T }:⋯:p _{ T } _{ … } _{ T } belongs to $\mathbb{P}{V}_{T}^{\mathcal{M}}$. Moreover, if $\lambda :=\sum {p}_{{x}_{1}\dots {x}_{n}}$ is not zero, then $(\frac{{p}_{\mathtt{A}\dots \mathtt{A}}}{\lambda},\dots ,\frac{{p}_{\mathtt{T}\dots \mathtt{T}}}{\lambda})$ is a point in the affine stochastic phylogenetic variety ${V}_{T}^{\mathcal{M}}$.
Before defining identifiability of mixtures, we consider the following construction of projective algebraic varieties.
The right hand side of this inequality is usually known as the expected dimension of ∨i = 1h X_{ i }.
Here, $\mathrm{\Omega}=\{\mathbf{a}=({a}_{1},\dots ,{a}_{h})\mid \sum _{i}{a}_{i}=1\}$ is isomorphic to an affine open subset of ${\mathbb{P}}^{h1}$. In this setting, an hmixture on {T_{1},…,T_{ h }} corresponds to a point in the variety ${\vee}_{i=1}^{h}\mathbb{P}{V}_{{T}_{i}}^{\mathcal{M}}$. We will use this algebraic variety to study the identifiability of phylogenetic mixtures.
When considering unmixed models $\mathcal{M}$ on trivalent trees on n taxa, generic identifiability of the tree topology is equivalent to the projective varieties $\mathbb{P}{V}_{T}^{\mathcal{M}}$ and $\mathbb{P}{V}_{{T}^{\prime}}^{\mathcal{M}}$ being different when T ≠ T^{′}(see [31]). The identifiability of the continuous parameters must take into account the possibility of permuting the labels of the states at the interior nodes, as such permutations give rise to the same joint distribution at the leaves. In the language of algebraic geometry, generic identifiability of the continuous parameters of the model implies that the map ${\varphi}_{T}^{\mathcal{M}}$ is generically finite (i.e. the preimage of a generic point is a finite number of points; see [31]). In this case, the fiber dimension Theorem ([14], Theorem 11.12) applies and we have that $dim\mathbb{P}{V}_{T}^{\mathcal{M}}$ is equal to the number of stochastic parameters of the model, $dim\mathit{Pa}{r}_{s\mathcal{M}}\left(T\right)$. Therefore, if the continuous parameters are generically identifiable for the unmixed trees under $\mathcal{M}$, then the dimension of the variety $\mathbb{P}{V}_{T}^{\mathcal{M}}$ is the same for all trivalent tree topologies on n taxa. This dimension is denoted by ${s}_{\mathcal{M}}\left(n\right)$.
Example 26. The tree topologies and the continuous parameters are generically identifiable for the unmixed equivariant models JC69, K80, K81, SSM, and GMM on trees with any number of leaves (see [9] and [6], Corollary 3.9).
From now on we only consider trees without nodes of degree 2, so that the number of free stochastic parameters on a phylogenetic tree on n taxa under $\mathcal{M}$ is $\le {s}_{\mathcal{M}}\left(n\right)$.
We recall the definition of generic identifiability of the tree topologies on hmixtures (see [10]).
In terms of algebraic varieties this is equivalent to saying that the variety ${\vee}_{i=1}^{h}\mathbb{P}{V}_{{T}_{i}}^{\mathcal{M}}$ is not contained in ${\vee}_{i=1}^{h}\mathbb{P}{V}_{{T}_{i}^{\prime}}^{\mathcal{M}}$ and vice versa.
The tree topologies are the discrete parameters of hmixtures. When considering the continuous parameters of hmixtures, the above mentioned labelswapping can be disregarded. We give the following definition according to [12].
for stochastic parameters $\left({\xi}_{1}^{\prime},\dots ,{\xi}_{h}^{\prime},{\mathbf{a}}^{\prime}\right)$ implies that there is a permutation $\sigma \in {\mathfrak{S}}_{h}$ such that $\sigma \xb7({T}_{1},\dots ,{T}_{h})=({T}_{1},\dots ,{T}_{h})$, ${\xi}_{i}^{\prime}={\xi}_{\sigma \left(i\right)}$, and ${a}_{i}^{\prime}={a}_{\sigma \left(i\right)}$ for $i=1,\dots r$. In other words, we only allow swapping of the continuous parameters when at least two tree topologies coincide.
Definition 29. An hmixture under a model $\mathcal{M}$ is said to be identifiable if both its tree topologies and its continuous parameters are generically identifiable.
The following result demonstrates the need for careful inspection of identifiability of mixtures with many components (i.e. large values of h).
Theorem 30. Let [n] be a set of taxa and $\mathcal{M}$ be an evolutionary model for which the continuous parameters are generically identifiable on trivalent (unmixed) trees. In addition, let ${s}_{\mathcal{M}}\left(n\right)$ be the dimension of $\mathbb{P}{V}_{T}^{\mathcal{M}}$ for any trivalent tree T, and set ${h}_{0}\left(n\right):=\frac{dim{\mathcal{D}}_{\mathcal{M}}}{{s}_{\mathcal{M}}\left(n\right)+1}$. Then the hmixtures of trees on [n] evolving under $\mathcal{M}$ are not identifiable for h ≥ h_{0}(n).
Remark 31. Note that, in the above definition of h_{0}(n), $dim{\mathcal{D}}_{\mathcal{M}}$ also depends on n.
Proof. Theorem 17 shows that ${\mathcal{L}}^{\mathcal{M}}={\mathcal{D}}_{\mathcal{M}}$ and Proposition 20 gives the dimension of ${\mathcal{L}}^{\mathcal{M}}$ in each case. Next, we calculate: s _{ GMM }(n) = 12(2n−3) + 3, s _{ SSM }(n) = 6(2n−3) + 1, s _{ K81 }(n) = 3(2n−3), s _{ K80 }(n) = 2(2n−3), and s _{ JC69 }(n) = 2n−3. Applying Theorem 30, we conclude the proof. □
Example 33. Consider the Kimura 3parameter model K81 on n = 4 taxa. For any h ≥ 4, phylogenetic hmixtures are not identifiable by Corollary 32. We are not aware of any result proving that mixtures of 2 or 3 different tree topologies under this model are identifiable (either for the tree parameters or for the continuous parameters).
Example 34. Consider the JukesCantor model JC69 on n = 4 taxa. Then Corollary 32 tells us that for h ≥ 3, hmixtures are not identifiable. Therefore, for this particular model on four taxa the cases in which the identifiability holds are known: the tree and the continuous parameters are generically identifiable for the unmixed model; the tree parameters are generically identifiable for 2mixtures ([10], Theorem 10); the continuous parameters are generically identifiable for 2mixtures on different tree topologies and not identifiable for the same tree topology ([10], Theorem 23); neither the continuous parameters nor the tree topologies are generically identifiable for mixtures with more than two components (Corollary 32).
Proof of Theorem 30. Let $\mathit{edim}\left(h\right):=h{s}_{\mathcal{M}}\left(n\right)+h1$. Then the variety ${\vee}_{i=1}^{h}\mathbb{P}{V}_{{T}_{i}}$ has dimension ≤ edim(h). Indeed, as ${\vee}_{i}{\varphi}_{{T}_{i}}$ is a parameterization of an open subset of ${\vee}_{i=1}^{h}\mathbb{P}{V}_{{T}_{i}}$, then the dimension of ${\vee}_{i=1}^{h}\mathbb{P}{V}_{{T}_{i}}$ is less than or equal to $\sum dim\mathbb{P}{V}_{{T}_{i}}+h1$. Moreover, the dimension of $\mathbb{P}{V}_{{T}_{i}}$ is equal to ${s}_{\mathcal{M}}\left(n\right)$ if T_{ i } is trivalent (since the continuous parameters for the unmixed models under consideration are generically identifiable) and is less than ${s}_{\mathcal{M}}\left(n\right)$ for nontrivalent trees. Therefore, $dim\left({\vee}_{i=1}^{h}\mathbb{P}{V}_{{T}_{i}}\right)\le \mathit{edim}\left(h\right)$.
If all T_{ i } are trivalent trees, then $\sum dim\mathbb{P}{V}_{{T}_{i}}+h1=\mathit{edim}\left(h\right)$ and, therefore, $dim\left({\vee}_{i=1}^{h}\mathbb{P}{V}_{{T}_{i}}\right)<\mathit{edim}\left(h\right)$ if and only if $dim\left({\vee}_{i=1}^{h}\mathbb{P}{V}_{{T}_{i}}\right)<\sum dim\mathbb{P}{V}_{{T}_{i}}+h1$. Moreover, by the fiber dimension theorem applied to $\vee {\varphi}_{{T}_{i}}$, the equality of dimensions holds if and only if the generic fiber of $\vee {\varphi}_{{T}_{i}}$ has dimension 0. In particular, if $dim\left({\vee}_{i=1}^{h}\mathbb{P}{V}_{{T}_{i}}\right)<\mathit{edim}\left(h\right)$, then the continuous parameters of this phylogenetic mixture are not identifiable.
As ${h}_{0}\left(n\right)=\frac{dim{\mathcal{D}}_{\mathcal{M}}}{{s}_{\mathcal{M}}\left(n\right)+1}$, we have that $\mathit{edim}\left({h}_{0}\right(n\left)\right)={h}_{0}\left(n\right)\left({s}_{\mathcal{M}}\right(n)+1)1=dim{\mathcal{D}}_{\mathcal{M}}1$. Now, we fix an $h\in \mathbb{N}$ with h ≥ h_{0}(n), so that $\mathit{edim}\left(h\right)\ge dim\left({\mathcal{D}}_{\mathcal{M}}\right)1$.
 (a)
For any set of tree topologies $\{{T}_{1},\dots ,{T}_{h}\}$, the dimension of ${\vee}_{i=1}^{h}\mathbb{P}{V}_{{T}_{i}}$ is less than $dim\left({\mathcal{D}}_{\mathcal{M}}\right)1$.
 (b)
There exists a set of tree topologies $\{{T}_{1},\dots ,{T}_{h}\}$ for which $\text{dim}\left({\vee}_{i=1}^{h}\mathbb{P}{V}_{{T}_{i}}\right)=\text{dim}\left({\mathcal{D}}_{\mathcal{M}}\right)1.$
Case (a) implies that the dimension of ${\vee}_{i=1}^{h}\mathbb{P}{V}_{{T}_{i}}$ is less than edim(h) for any set of trivalent tree topologies $\{{T}_{1},\dots ,{T}_{h}\}$. Based on the conclusions drawn above, this implies that the continuous parameters are not generically identifiable.
In case (b), ${\vee}_{i=1}^{h}\mathbb{P}{V}_{{T}_{i}}$ coincides with $\mathbb{P}\left({\mathcal{D}}_{\mathcal{M}}\right)$. Indeed, ${\vee}_{i=1}^{h}\mathbb{P}{V}_{{T}_{i}}$ is contained in $\mathbb{P}\left({\mathcal{D}}_{\mathcal{M}}\right)$, both varieties are irreducible, and $\text{dim}\left({\vee}_{i=1}^{h}\mathbb{P}{V}_{{T}_{i}}\right)=\text{dim}\left({\mathcal{D}}_{\mathcal{M}}\right)1=\text{dim}\left(\mathbb{P}\right({\mathcal{D}}_{\mathcal{M}}\left)\right)$, which implies that both varieties coincide. In particular, any hmixture (which is a point in $\mathbb{P}\left({\mathcal{D}}_{\mathcal{M}}\right)$) would be contained in ${\vee}_{i=1}^{h}\mathbb{P}{V}_{{T}_{i}}$, and therefore the topologies are not generically identifiable.
Remark 35. The negative result of Theorem 30 should be complemented with the following positive result of Rhodes and Sullivant in [12]: if $\mathcal{M}$=GMM and one restricts to hmixtures on the same trivalent tree topology T, then the tree topology and the continuous parameters are generically identifiable if $h<{4}^{\frac{n}{4}1}$.
Conclusions
In this paper, we have dealt with the space of phylogenetic mixtures for evolutionary equivariant models. We have shown that for the case of the JukesCantor model, the Kimura models with two or three parameters, the strand symmetric model and the general Markov model, this linear space is defined by the set of linear equations satisfied by the distributions of the patterns at the leaves of a tree that evolves under that model. It follows that this space completely characterizes the model. The use of tools from group theory and group representation theory played a major role, and allowed us to design a procedure to produce minimal systems of equations for these spaces and for any number of taxa. This procedure has been implemented successfully in a new method for model selection in phylogenetics based on linear invariants (see [7]), which is available online at http://genome.crg.es/cgibin/phylo_mod_sel/AlgModelSelection.pl.
In the last part of the paper, we proved new results concerning the identifiability of phylogenetic mixtures. Namely, we provided an upper bound for the number of components (classes) of a mixture so that the identifiability of both the continuous and the discrete parameters is still possible.
Declarations
Acknowledgements
All authors are partially supported by Generalitat de Catalunya, 2009 SGR 1284. Research of the first and second authors was partially supported by Ministerio de Educación y Ciencia MTM200914163C0202 (Spain). Research of the third author was partially supported by grant BIO201126205 from the Ministerio de Educación y Ciencia (Spain). We would like to thank both referees for very useful comments that led to major improvements.
Authors’ Affiliations
References
 Posada D: The effect of branch length variation on the selection of models of molecular evolution. J Mol Evol. 2001, 52: 434444.PubMedGoogle Scholar
 Felsenstein J: Inferring Phylogenies. 2004, Sunderland: Sinauer Associates, Inc.,Google Scholar
 Fu YX, Li W: Construction of linear invariants in phylogenetic inference. Math Biosci. 1992, 109 (2): 201228. 10.1016/00255564(92)90045XPubMedView ArticleGoogle Scholar
 Steel M, Hendy M, Székely L, Erdős P: Spectral analysis and a closest tree method for genetic sequences. Appl Math Lett. Int J Rapid Publication. 1992, 5: 6367.Google Scholar
 Draisma J, Kuttler J: On the ideals of equivariants tree models. Mathematische Annalen. 2009, 344: 619644. 10.1007/s0020800803206View ArticleGoogle Scholar
 Casanellas M, FernandezSanchez J: Relevant phylogenetic invariants of evolutionary models. J de Mathématiques Pures et Appliquées. 2011, 96: 207229.View ArticleGoogle Scholar
 Kedzierska A, Drton M, Guigó R, Casanellas M: SPIn: model selection for phylogenetic mixtures via linear invariants. Mol Biol Evol. 2012, 29: 929937. 10.1093/molbev/msr259PubMedView ArticleGoogle Scholar
 Semple C, Steel M: Phylogenetics, Volume 24 of Oxford Lecture Series in Mathematics and its Applications. 2003, Oxford: Oxford University Press,Google Scholar
 Allman E, Rhodes J: The identifiability of tree topology for phylogenetic models, including covarion and mixture models. J Comput Biol. 2006, 13: 11011113. 10.1089/cmb.2006.13.1101PubMedView ArticleGoogle Scholar
 Allman ES, Petrovic S, Rhodes JA, Sullivant S: Identifiability of twotree mixtures for groupbased models. IEEE/ACM Trans Comput Biol Bioinformatics. 2011, 8: 710720.View ArticleGoogle Scholar
 Stefanovic D, Vigoda E: Phylogeny of mixture models: Robustness of maximum likelihood and nonidentifiable distributions. J Comput Biol. 2007, 14: 156189. 10.1089/cmb.2006.0126View ArticleGoogle Scholar
 Rhodes J, Sullivant S: Identifiability of large phylogenetic mixture models. Bull Math Biol. 2012, 74: 212231. 10.1007/s1153801196722PubMedView ArticleGoogle Scholar
 Chai J, Housworth EA: On Rogers’s proof of identifiability for the GTR + Gamma + I model. Syst Biol. 2011, 60 (5): 713718. 10.1093/sysbio/syr023PubMedView ArticleGoogle Scholar
 Harris J: Algebraic Geometry. A First Course, Volume 133 of Graduate Texts in Mathematics. 1992, New York: SpringerVerlag,Google Scholar
 Serre J: Linear Representations of Finite Groups. 1977, [Translated from the second French edition by Leonard L. Scott, Graduate Texts in Mathematics, Vol. 42], New York: SpringerVerlag,View ArticleGoogle Scholar
 Sumner J, FernándezSánchez J, Jarvis P: On Lie Markov models. J Theor Biol. 2012, 298: 1631.PubMedView ArticleGoogle Scholar
 Allman E, Rhodes J: Phylogenetic invariants for stationary base composition. J Symbolic Comput. 2006, 41 (2): 138150. 10.1016/j.jsc.2005.04.004View ArticleGoogle Scholar
 Jukes TH, Cantor CR: Evolution of protein molecules. Mammalian Protein Metabolism, Volume 3. Edited by: Munro HN. 1969, 2132. New York: Academic Press,View ArticleGoogle Scholar
 Kimura M: A simple method for estimating evolutionary rates of base substitution through comparative studies of nucleotide sequences. J Mol Evol. 1980, 16: 111120. 10.1007/BF01731581PubMedView ArticleGoogle Scholar
 Kimura M: Estimation of evolutionary sequences between homologous nucleotide sequences. Proc Nat Acad Sci. 1981, 78: 454458. 10.1073/pnas.78.1.454PubMedPubMed CentralView ArticleGoogle Scholar
 Casanellas M, Sullivant S: The strand symmetric model. Algebraic Statistics for Computational Biology. Edited by: Pachter L, Sturmfels B. 2005, New York: Cambridge University Press,Google Scholar
 Chang JT: Full reconstruction of Markov models on evolutionary trees: identifiability and consistency. Math Biosci. 1996, 137: 5173. 10.1016/S00255564(96)000752PubMedView ArticleGoogle Scholar
 Allman E, Rhodes J: Phylogenetic ideals and varieties for the general Markov model. Adv Appl Math. 2008, 40: 127148. 10.1016/j.aam.2006.10.002View ArticleGoogle Scholar
 Allman E, Rhodes J: Phylogenetic invariants for the general Markov model of sequence mutation. Math Biosci. 2003, 186 (2): 113144. 10.1016/j.mbs.2003.08.004PubMedView ArticleGoogle Scholar
 Steel M, Hendy M, Penny D: Reconstructing phylogenies from nucleotide pattern probabilities: a survey and some new results. Discrete Appl Math. 1998, 88 (13): 367396. 10.1016/S0166218X(98)000808View ArticleGoogle Scholar
 Matsen F, Mossen E, Steel M: Mixedup trees: The structure of phylogenetic mixtures. Bull Math Biol. 2008, 70: 11151139. 10.1007/s115380079293yPubMedView ArticleGoogle Scholar
 Lake J: A rateindependent technique for analysis of nucleaic acid sequences: evolutionary parsimony. Mol Biol Evol. 1987, 4: 167191.PubMedGoogle Scholar
 Cavender J, Felsenstein J: Invariants of phylogenies in a simple case with discrete states. J Classif. 1987, 4: 5771. 10.1007/BF01890075View ArticleGoogle Scholar
 Allman E, Rhodes J: Quartets and parameter recovery for the general Markov model of sequence mutation. Appl Math Res Express. 2004, 2004 (4): 107131. 10.1155/S1687120004020283View ArticleGoogle Scholar
 Sturmfels B, Sullivant S: Toric ideals of phylogenetic invariants. J Comput Biol. 2005, 12: 204228. 10.1089/cmb.2005.12.204PubMedView ArticleGoogle Scholar
 Allman E, Rhodes J: Identifying evolutionary trees and substitution parameters for the general Markov model with invariable sites. Math Biosci. 2008, 211: 1833. 10.1016/j.mbs.2007.09.001PubMedView ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.