The space of phylogenetic mixtures for equivariant models

Casanellas, Marta; Fernández-Sánchez, Jesús; Kedzierska, Anna M

doi:10.1186/1748-7188-7-33

Research
Open access
Published: 28 November 2012

The space of phylogenetic mixtures for equivariant models

Marta Casanellas¹,
Jesús Fernández-Sánchez¹ &
Anna M Kedzierska²

Algorithms for Molecular Biology volume 7, Article number: 33 (2012) Cite this article

3017 Accesses
6 Citations
Metrics details

Abstract

Background

The selection of an evolutionary model to best fit given molecular data is usually a heuristic choice. In his seminal book, J. Felsenstein suggested that certain linear equations satisfied by the expected probabilities of patterns observed at the leaves of a phylogenetic tree could be used for model selection. It remained an open question, however, whether these equations were sufficient to fully characterize the evolutionary model under consideration.

Results

Here we prove that, for most equivariant models of evolution, the space of distributions satisfying these linear equations coincides with the space of distributions arising from mixtures of trees. In other words, we prove that the evolution of an observed multiple sequence alignment can be modeled by a mixture of phylogenetic trees under an equivariant evolutionary model if and only if the distribution of patterns at its columns satisfies the linear equations mentioned above. Moreover, we provide a set of linearly independent equations defining this space of phylogenetic mixtures for each equivariant model and for any number of taxa. Lastly, we use these results to perform a study of identifiability of phylogenetic mixtures.

Conclusions

The space of phylogenetic mixtures under equivariant models is a linear space that fully characterizes the evolutionary model. We provide an explicit algorithm to obtain the equations defining these spaces for a number of models and taxa. Its implementation has proved to be a powerful tool for model selection.

Background

The principal goal of phylogenetics is to reconstruct the ancestral relationships among organisms. Most popular phylogenetic reconstruction methods are based on mathematical models describing the molecular evolution of DNA. In spite of this, there exists no unified framework for model selection and the results are highly dependent on the models and methods used in the analysis (cf. [1]).

In this paper we assume the Darwinian model of evolution proceeding along phylogenetic trees and address the following question: how can the data evolving under a particular model be characterized? In other words, we look for invariants of the DNA patterns which have evolved following a tree (or a mixture of trees, as we will see below) under a particular model. The answer to this question provided in this paper leads to a complete characterization of the evolutionary model and to a novel model selection tool, which is valid for any mixture of trees.

In what follows, we briefly explain the motivation for this work. It has been shown that if the evolution along a phylogenetic tree is described by a particular model, the expected probabilities of nucleotide patterns at the leaves of the tree satisfy certain equalities (see e.g. [2], p.375). Several authors (e.g. [2–4]) pointed out that these equalities could potentially be used to test the fitness of the model of base change. The full set of equations required for viable model selection, however, was unknown. The objective of this work is to fill in this gap and to go a step further into practical aplication by providing an algorithm to compute the required invariants for model selection.

In this work we consider a group of equivariant models ([5, 6]). These models are Markov processes on trees, whose transition matrices satisfy certain symmetries: the Jukes-Cantor model, the Kimura 2 and 3 parameter models, the strand symmetric model, and the general Markov model. Our first important result, Theorem 17, states that if evolution occurs according to trees (or even mixtures of trees) under these equivariant models, then the model of evolution is completely determined by the linear space defined by the aforementioned equalities. By exhaustively studying the group of symmetries of these models, we also give a straightforward combinatorial way of determining the equations of this linear space (see Theorem 22). The implementation of the algorithm producing the equations is available as a package SPIn ([7],http://genome.crg.es/cgi-bin/phylo_mod_sel/AlgModelSelection.pl), which has proved to be a successful tool in evolutionary model selection.

Our main technique consists in proving that the linear space above coincides with the space $D^{M}$ of phylogenetic mixtures evolving under the model $M$ , i.e. the set of points that are linear combinations of points lying in the phylogenetic varieties $C V_{T}^{M}$ (see Preliminaries section for specific definitions). In biological words and in the stochastic context, this is the set of vectors of expected pattern frequencies for mixtures of trees evolving under the model $M$ (not necessarily the same tree topology in the mixture, and not necessarily the same transition matrices when the tree topologies coincide). In phylogenetics, the so-called i.i.d. hypothesis (independent and identically distributed) about the sites of an alignment is prevalent in the most simple models. When the assumption “identically distributed” is replaced it by “distributed according to the same evolutionary model”, one obtains a phylogenetic mixture.

Phylogenetic mixtures are useful in modeling heterogeneous evolutionary processes, e.g. data comprising multiple genes, selected codon positions, or rate variation across sites (e.g. [8]). Among a plethora of applications, they are used in orthology predictions, gene and genome annotations, species tree reconstructions, and drug target identifications.

In addition to the main result, we determine the dimension of these linear spaces and use it to give an upper bound, h₀ (n), on the number of mixtures that should be used in phylogenetic reconstruction on n taxa. This relates to the so-called identifiability problem in phylogenetic mixtures, which can be posed as determining the conditions that guarantee that the model parameters (discrete parameters in the form of tree topologies and the continuous parameters of the root and model distributions) can be recovered from the data. Identifiability is crucial for consistency of the maximum likelihood approaches and, though extensively studied in the phylogenetic context, few results are known (see for instance [9–13]).

In brief, in Theorem 30 we prove that either the tree topologies or the continuous parameters are not generically identifiable for mixtures on more than h₀ (n) trees under equivariant models. Here h₀ (n) is the quotient of the dimension of the linear space $D^{M}$ (computed in Proposition 20) by the number of free parameters of $M$ on a trivalent tree plus one. For example, for four taxa and the Jukes-Cantor model (resp. the Kimura 3-parameter model) this result proves that mixtures on three (resp. four) or more taxa are not identifiable (i.e. either the discrete or the continuous parameters cannot be fully identified). A detailed discussion on this subject is provided in the last section.

The main tools used in this work are algebraic geometry and group theory. The reader is referred to [14, 15] for general references on these topics.

Main text

Preliminaries

Phylogenetic trees and Markov models of evolution have been widely used in the literature. In what follows we fix the notation needed to deal with them in our setting.

Let n be a positive integer and denote by [n] the set {1,2,…,n}. A phylogenetic tree T on the set of taxa [n] is a tree (i.e. a connected graph with no loops), whose n leaves are bijectively labeled by [n]. Its vertices represent species or other biological entities and its edges represent evolutionary processes between the vertices.

We allow internal vertices of any degree and if all the internal vertices are of degree 3 we say that the tree is trivalent. We will denote the set of vertices of T by N(T), the set of edges by E(T), and the set of interior nodes by Int(T). A rooted tree is a tree together with a distinguished node r called the root. The root induces an orientation on the edges of T, whereby the root represents the common ancestor to all the species represented in the tree. If e is an edge of a rooted tree T, we write pa(e) and ch(e) for its parent vertex (origin) and its child vertex (end), respectively. Two unrooted phylogenetic trees on the set of taxa [n] are said to have the same tree topology if their labeled graphs have the same topology.

We fix a positive integer k and an ordered set B = {b₁,b₂,…,b_k}. For example, for most applications we take B = {A,C,G,T} to be the set of nucleotides in a DNA sequence. We may think of B as the set of states of a discrete random variable. We call W the complex vector space $W = {〈 B 〉}_{C}$ spanned by B, so that B is a natural basis of W. For algebraic convenience, we usually work over the complex field and restrict to the stochastic setting when necessary. Vectors in W are thought of as probability distributions on the set of states B if their coordinates are non-negative and sum to one. In this setting the vector $\sum c_{i} b_{i}$ means that observation b_i occurs with probability c_i. From now on, we will identify vectors in W with their coordinates in the basis B written as a column vector, e.g. we identify $\sum_{k} b_{k}$ with the vector 1 = (1,1,…,1)^t∈ W.

In order to model molecular evolution on a phylogenetic tree T, we consider a Markov process specified by a root distribution, Π∈W, and a collection of transition matrices, A = (A^e)_{e ∈ E (T)}, where each A^eis a k × k-matrix in End(W). The matrices A^erepresent the conditional probabilities of substitution between the states in B from the parent node pa (e) to the child node ch(e) of e. We adopt the convention that the matrices A^e act on W from the right, i.e. a vector ω^tin pa (e) maps to ω^tA^e in ch(e).

Distinct forms of the transition matrices give rise to different evolutionary models. Using the terminology introduced above, we proceed to the definition of evolutionary models used throughout this work.

Definition 1. An (algebraic) evolutionary model $M$ is specified by giving a vector subspace $W_{0} \subset W$ such that 1^tΠ ≠ 0 for some Π in W₀, together with a multiplicatively closed vector subspace Mod(for model) of $M_{k} (C)$ containing the identity matrix. We will usually denote such a model by $M = (W_{0}, Mod)$ . We define the stochastic evolutionary model $s M = (s W_{0}, sMod)$ associated to $M$ by taking s W₀ = {Π∈W₀:1^tΠ = 1} and sMod = {A∈Mod:A 1 = 1}. The term “stochastic” refers to the fact that, by restricting to the points in the spaces with non-negative real entries, we obtain distributions and Markov matrices. A phylogenetic tree T together with the parameters Π and A = (A^e) _{e ∈ E (T)}is said to evolve under the algebraic evolutionary model $M$ if Π ∈ W₀, and all matrices A^elie in Mod.

Remark 2. Note that s W₀and sMod are not vector spaces. The condition 1^tΠ ≠ 0 in the above definition means that the sum of the coordinates of Π is not zero. Since vectors in s W₀ with non-negative coordinates represent the probability distributions for the set of observations B, this condition implies no restriction from a biological point of view. Moreover, it ensures that $W_{0} \cap {\sum_{x \in B} Π_{x} = 1}$ has dimension equal to dim(W₀) − 1. In particular, the simplex of stochastic vectors in W₀will form a semialgebraic set of ${〈 B 〉}_{R}$ of dimension equal to dim(W₀)−1 (as expected).

Remark 3. The subspace Mod of substitution matrices is usually required to be multiplicatively closed (as in the definition above) so that when two evolutionary processes are concatenated, the final process is of the same kind. The importance of this requirement is the starting point of [16], where a different approach to the definition of “evolutionary mode” is provided.

Our definition of evolutionary models includes most of the well-known evolutionary models, namely those given in [17] and the equivariant models (see [5, 6]).

Example 4. Let G be a permutation group of B, that is, a group whose elements are permutations of the set B, $G \leq S_{k}$ . Given g ∈ G, write P_g for the k × k-permutation matrix corresponding to g: (P_g)_i,j= 1 if g(j) = i and 0 otherwise. The G-equivariant evolutionary model $M_{G}$ is defined by taking Mod equal to

m:m (G) = {A \in {m:m}_{k} (C) P_{g} A P_{g}^{- 1} = A for all g \in G},

and W₀ = {Π ∈ W ∣ P_gΠ = Π for all g ∈ G}. These subsets are vector subspaces of $M_{k} (C)$ and W, respectively. Moreover, if A₁,A₂ ∈ M(G), then

P_{g} A_{1} A_{2} P_{g}^{- 1} = (P_{g} A_{1} P_{g}^{- 1}) (P_{g} A_{2} P_{g}^{- 1}) = A_{1} A_{2},

and A₁A₂ ∈ M (G). Therefore, equivariant models provide a wide family of examples of algebraic evolutionary models in the sense of Definition 1. For example, if B = {A,,̧G T}, it can be seen that the algebraic versions of the Jukes-Cantor model [18], the Kimura models with 2 or 3 parameters [19, 20], the strand symmetric model [21] or the general Markov model [22] are instances of equivariant models:

if $G = S_{4}$ , then $M_{G}$ is the algebraic Jukes-Cantor model JC69,
if G = 〈(A C G T),(A G)〉, then $M_{G}$ is the algebraic Kimura 2-parameter model K80,
if G = 〈(A C)(G T),(A G)(C T)〉, then $M_{G}$ is the algebraic Kimura 3-parameter model K81,
if G = 〈(A T)(C G)〉, then $M_{G}$ is known as the strand symmetric model SSM, and
if G = 〈e〉, then $M_{G}$ is the general Markov model GMM.

Given an evolutionary model $M$ and a phylogenetic tree T, we define the space of parameters as

{Par}_{M} (T) = W_{0} \times (\prod_{e \in E (T)} Mod) .

Similarly, we define the space of stochastic parameters associated to T by

{Par}_{s M} (T) = s W_{0} \times (\prod_{e \in E (T)} sMod) .

Though artificial at first glance, the use of tensors in the framework that includes the distributions on the set of patterns in B at the leaves of a phylogenetic tree is a natural choice. Indeed, if $p_{x_{1} x_{2} \dots x_{n}}$ denotes the joint probability of observing x₁ at leaf 1, x₂ at leaf 2, and so on, up to x_n at leaf n, then the vector $p = (p_{b_{1} \dots b_{1}}, p_{b_{1} b_{1} \dots b_{2}}, \dots, p_{b_{k} \dots b_{k}})$ provides a distribution on the set of patterns in B at the leaves of T, and this can be regarded as the tensor having these coordinates in the natural basis,

p = \sum_{x_{1} \dots x_{n} \in B} p_{x_{1} \dots x_{n}} x_{1} \dots x_{n} .

This motivates the following definition.

Definition 5. Given a phylogenetic tree T on the set of taxa [n], an [n] - tensor is any element of the tensor power

Given an algebraic evolutionary model $M$ and a phylogenetic tree T with root r, every Markov process on T (specified by a collection of parameters Π and A = (A_e)_e∈E(T)) gives rise to a tensor in $L$ in the following way: we consider a parametrization

Ψ_{T}^{M} : {Par}_{M} (T) \to L

(1)

defined by

Ψ_{T}^{M} (Π, A) = \sum_{x_{i} \in B} p_{x_{1} \dots x_{n}} x_{1} x_{n},

where

p_{x_{1} \dots x_{n}} = \sum_{x_{v} \in B, v \in Int (T)} Π_{x_{r}} \prod_{e \in E (T)} A_{x_{pa(e)}, x_{ch(e)}}^{e},

(2)

x_v denotes the state at the vertex v, pa(e) (resp. ch (e)) is the parent (resp. child) node of e, and Π _x,x ∈ B, are the coordinates of Π. When restricted to the stochastic matrices and distributions in W₀, this parametrization corresponds to the hidden Markov process on the tree T (the leaves correspond to the observed random variables and the interior nodes to the hidden variables).

The parametrization (1) restricts to another polynomial map $ϕ_{T}^{M} : {Par}_{s M} (T) \to H$ , where $H \subset L$ is the hyperplane defined by $H = \{p \in L ∣ \sum_{x_{1}, \dots, x_{n} \in B} p_{x_{1} \dots x_{n}} = 1\}$ . Because we work in the algebraic setting, the use of the word “stochastic” in this paper is more general than usual, as we only request entries summing to one.

From now on, we will refer to this restriction as the stochastic parametrization $ϕ_{T}^{M}$ . It is important to note that when we consider the distributions in s W₀ and the Markov matrices in sMod, its image by $ϕ_{T}^{M}$ lies in the standard simplex in $L$ (and thus in H). This in turn implies that the whole image $Im ϕ_{T}^{M}$ is contained in H.

We proceed to define the algebraic varieties associated to the parametrization maps defined above. Roughly speaking, algebraic varieties are sets of solutions to systems of polynomial equations (e.g. [14]).

Definition 6. The stochastic phylogenetic variety $V_{T}^{M}$ associated to a phylogenetic tree T is the smallest algebraic variety containing $Im ϕ_{T}^{M} = \{ϕ_{T}^{M} (Π_{r}, A) : (Π_{r}, A) \in {Par}_{s M} (T)\}$ (in particular, $V_{T}^{M} \subset H$ ).

Similarly, the phylogenetic variety $C V_{T}^{M}$ associated to T is the smallest algebraic variety in $L$ that contains $Im Ψ_{T}^{M} = \{Ψ_{T}^{M} (Π_{r}, A) : (Π_{r}, A) \in {Par}_{M} (T)\}$ .

Below we explain the reason for the notation of $C V_{T}^{M}$ , which was adopted from [23].

The reader may note that the position of the root r of T played a role in the above parameterizations. It can be shown, however, that under certain mild assumptions, $Im Ψ_{T}^{M}$ and $Im ϕ_{T}^{M}$ are independent of the root position in the following sense: if two phylogenetic trees have the same topology as unrooted trees, then the smallest algebraic varieties containing the corresponding image sets are the same. For example, any model $M = (W_{0}, Mod)$ satisfying (i) ${\tilde{Π}}^{t} : = Π^{t} A$ belongs to W₀ for all Π ∈ W₀ and all A ∈ Mod, and (ii) $D_{\tilde{Π}}^{- 1} A^{t} D_{Π} \in Mod$ whenever $D_{\tilde{Π}}^{- 1}$ exists (here D_ω denotes the diagonal matrix with the entries of ω on the diagonal and zeros elsewhere) has this property (in this case, we say the model is root-independent). It is not difficult to check that the equivariant models satisfy these two properties (e.g. adapting the proof of [24] or [25]). For technical reasons, from now on we consider only the evolutionary models satisfying (i) and (ii). Indeed, in this case the notation $C V_{T}^{M}$ refers to the fact that the phylogenetic variety is just the cone over the stochastic phylogenetic variety (see Figure 1 and the remark below).

Remark 7. Let $M$ be an evolutionary model satisfying (i) and (ii) above. For $p \in L$ , $p = \sum p_{x_{1} \dots x_{n}} x_{1} \otimes \dots \otimes x_{n}$ , define $λ (p) : = \sum_{x_{i} \in B} p_{x_{1} \dots x_{n}}$ . Then

C V_{T}^{M} = \{p \in L | p = λ (p) q, q \in V_{T}^{M}\}

and $V_{T}^{M} = C V_{T}^{M} \cap H$ . This is well known for the general Markov model [23] and can be easily generalized to any model satisfying (i) and (ii).

The space of phylogenetic mixtures

In phylogenetics, the hypothesis that the sites of an alignment are independent and identically distributed is often used. When the assumption “identically distributed” is replaced by “distributed according to the same evolutionary model”, one obtains a phylogenetic mixture. Below, we introduce phylogenetic mixtures from the algebraic point of view (see also [26]).

Definition 8. Fix a set of taxa [n] and an algebraic evolutionary model $M$ . A phylogenetic mixture (on m-classes) or m-mixture is any vector of the form

p = \sum_{i = 1}^{m} α_{i} p^{i},

where $α_{i} \in C$ and $p^{i} \in Im (Ψ_{T_{i}}^{M})$ for some tree topologies T_i on the set of taxa [n]. As $Ψ_{T_{i}}^{M}$ is a homogeneous map, phylogenetic mixtures are represented by vectors of the form $\sum_{i = 1}^{m} {\overset{̌}{p}}^{i}$ , where ${\overset{̌}{p}}^{i} \in Im (Ψ_{T_{i}}^{M})$ . We call $D_{M} \subset L$ the space of all phylogenetic mixtures (on any number of classes) under the algebraic evolutionary model $M$ .

As mentioned in the introduction, the tree topologies contained in the mixture can be the same or different. An example of a phylogenetic mixture is the data modeled by the discrete Gamma-rates models (see e.g. [8]).

Restricting matrix rows to sum to one requires restricting the phylogenetic mixtures to the points of the form

q = \sum_{i = 1}^{m} α_{i} q^{i} where q^{i} \in Im (ϕ_{T_{i}}^{M}), and \sum_{i} α_{i} = 1 .

We call $D_{s M}$ the space of stochastic phylogenetic mixtures.

Remark 9. The phylogenetic variety of a trivalent tree topology contains all phylogenetic varieties of the non-trivalent tree topologies obtained by contracting any of its interior edges. Indeed, the latter are a particular case of the former when the matrices associated to the contracted edges are equal to the identity matrix. It follows that the space of phylogenetic mixtures on the trivalent tree topologies coincides with the space of phylogenetic mixtures on all possible topologies.

The following result was proven by Matsen, Mossel and Steel in [26] for the two state random cluster model but, as proved below, it can be easily generalized to any evolutionary model.

Lemma 10. Given a set of taxa [n] and an algebraic evolutionary model $M$ , the set of all phylogenetic mixtures $D_{M}$ is a vector subspace of $L$ . Similarly, $D_{s M}$ is a linear variety and it equals $D_{M} \cap H$ .

Proof. $D_{M}$ is a $C$ -vector space and $D_{s M}$ is a linear variety by their definition. It follows that $D_{M}$ is an algebraic variety that contains $Im Ψ_{T}^{M}$ for any phylogenetic tree T on the set of taxa [n]. Therefore, it also contains $C V_{T}^{M}$ , and $D_{M}$ equals the set of points of the form $p = \sum p_{i}$ , where $p_{i} \in C V_{T_{i}}^{M}$ . Similarly, $D_{s M}$ is an algebraic variety that contains $Im ϕ_{T}^{M}$ , so it also contains $V_{T}^{M}$ for any phylogenetic tree T. It follows that $D_{s M}$ is formed by points of type $q = \sum α_{i} q_{i}$ , where $q_{i} \in V_{T_{i}}^{M}$ and $\sum_{i} α_{i} = 1$ .

Now we check that $D_{s M} = D_{M} \cap H$ . Let $q \in D_{s M}$ , so that $q = \sum_{i = 1}^{m} α_{i} q^{i}$ for some m, $q^{i} \in V_{T_{i}}^{M}$ , and $\sum α_{i} = 1$ . Clearly, $q \in D_{M}$ . Moreover, the sum of coordinates of q, λ(q), satisfies $λ (q) = \sum_{i} α_{i} λ (q^{i}) = \sum_{i} α_{i} = 1$ . Thus, q ∈ H. Conversely, let $p = \sum_{i = 1}^{m} p^{i}$ with $p^{i} \in C V_{T_{i}}^{M}$ for certain tree topologies T_i, and assume that λ (p) = 1. Apply Remark 7 to each pⁱto get pⁱ= λ (pⁱ)q_i for some $q_{i} \in V_{T_{i}}^{M}$ . Then

p = \sum_{i} p^{i} = \sum_{i} λ (p^{i}) q_{i}

and $1 = λ (p) = \sum_{i} λ (p^{i}) λ (q_{i}) = \sum_{i} λ (p^{i})$ since each q_i lies on H. This proves that $p \in D_{s M}$ . □

Remark 11. In the proof of the above lemma, we have seen that $D_{M}$ and $D_{s M}$ can be alternatively described as the spaces of mixtures obtained from the respective varieties $C V_{T}^{M}$ and $V_{T}^{M}$ (i.e., not only from the images of the parametrization maps).

The space of phylogenetic mixtures for equivariant evolutionary models

This section provides a precise description of the space $D_{M}$ for the equivariant models $M$ listed in Example 4 (JC69, K80, K81, SSM, and GMM). First, we recall some definitions and facts of group theory and linear representation theory. From now on, B = {A,C,G,T}, k = 4, $W = {〈 B 〉}_{C}$ , n is fixed and $L = \otimes_{[n]} W$ .

Background on representation theory

We introduce some tools in group representation theory needed in the sequel. We refer the reader to [15] as a classical reference for these concepts. Although some of the following results are valid for any permutation group, for simplicity in the exposition we restrict to permutations of four elements (as our applications deal only with the case B = {A C G T}).

Let $G \leq S_{4}$ be a permutation group. The trivial element in $S_{4}$ will be denoted as e. We write ρ_G for the restriction to G of the defining representation $ρ : S_{4} \to GL (W)$ given by the permutations of the basis B of W. This representation induces a G-module structure on W by setting g · x: = ρ(g)(x) ∈ W. In fact, ρ induces a G-module structure on any tensor power ⊗^sW by setting

g \cdot (x_{1} \otimes \dots \otimes x_{s}) : = g \cdot x_{1} \otimes \dots \otimes g \cdot x_{s},

(3)

and extending by linearity. From now on, the space $L = \otimes^{n} W$ will be implicitly considered as a G-module with this action. We call χ the character associated to the representation ρ_G: G → GL (W), i.e. χ (g) is the trace of the corresponding permutation matrix or, in other words, χ (g) equals the number of fixed elements in B by the permutation g∈G. Then the character associated to the induced representation G → GL (⊗ⁿW) is χⁿ, the n-th power of χ.

We write N₁,…,N_t for the irreducible representations of G and ω₁,…,ω_t for the corresponding irreducible characters, where N₁ and ω₁ will denote the trivial representation and trivial character, respectively. Maschke’s Theorem applied to the action of G described in (3) states that there is a decomposition of ⊗^sW into its isotypic components:

\otimes^{s} W = \oplus_{i = 1}^{t} (\otimes^{s} W) [ω_{i}],

(4)

where each (⊗^sW)ω_i is isomorphic to a number of copies of the irreducible representation N_i associated to ω_i, $(\otimes^{s} W) [ω_{i}] ≅ N_{i} \otimes C^{m_{i} (s)}$ , for some non-negative integer m_i(s) called the multiplicity of⊗^sW relative to ω_i. The isotypic component of $L$ associated to the trivial representation will be denoted by $L^{G}$ and it is composed of the n-tensors invariant under the action of G defined in (3). If $M$ is the equivariant evolutionary model associated to G, $L^{G}$ will also be denoted as $L^{M}$ . It is easy to prove that $C V_{T}^{M} \subset L^{G}$ (see Lemma 4.3 of [5]).

We recall that the set Ω_G= {ω_i}_{i = 1,…,t}of irreducible characters of G forms an orthonormal basis of the space of characters relative to the inner product defined by

〈 f, h 〉 : = \frac{1}{| G |} \sum_{g \in G} f (g) \bar{h (g)} .

(5)

We introduce the following notion.

Definition 12. An n - word over B is an ordered sequence X = x₁x₂ … x_n, where every letter is taken from the alphabet B. The set of n-words is equivalent to the cartesian power Bⁿand will be denoted by $B$ .

Words will be denoted in typewritter uppercase font (like X) and their letters in lowercase (like X). Sometimes it will be convenient to identify the [n]-tensors of the form x₁ ⊗ … ⊗ x_n with the n-words X = x₁…x_n. Consequently, we will identify $B$ with the natural basis of $L$ . Given $X \in B$ , we will denote by {X}_G= {g X ∣ g ∈ G} the G-orbit of X. We associate a G-invariant tensor, τ{X}_G, to each orbit {X}_G: $τ {X}_{G} : = \sum_{g \in G} g X$ . It is straightforward to see that every G-invariant tensor can be written as a linear combination of the tensors τ{X}_G, $X \in B$ . On the other hand, the set of different τ{X}_G’s is linearly independent, since the corresponding G-orbits {X}_G have non-overlapping composition of the elements of $B$ .

Mixtures for equivariant models

For each x ∈ B, we write S_G(x) for the stabiliser of X under the action of G, that is, S_G(x) = {g ∈ G : g·x = x}.

Proposition 13. Let G be a subgroup of $S_{4}$ such that S_G(x₀) = {e} for some x₀∈B. Then every tensor of type τ{X}_G, $X \in B$ , lies in the image of $Ψ_{T}^{M_{G}}$ for some tree topology T. In particular, $L^{G} \subset D_{M_{G}}$ .

Proof. For any G-orbit {Y}_G, $Y \in B$ , write $τ {Y}_{G} = y_{1} \otimes \dots \otimes Y_{n} + \sum_{g \neq e} g \cdot Y_{1} \otimes \dots \otimes g \cdot Y_{n}$ . We will explicitly associate a tree topology and parameters (Π,A) to it so that the tensor τ {T}_Gis equal to $Ψ_{T}^{M_{G}}$ . To this aim, we denote by B (Y) the set of letters appearing in Y. Then for every z ∈ B (Y), consider the set $L_{z}^{Y} = {i \in [n] : Y_{i} = z}$ , so that $\cup_{z \in B (Y)} L_{z}^{Y} = [n]$ .

We construct a tree T on the set of taxa [n] in the following way. We join each taxa in $L_{z}^{Y}$ to a common node v _z by an edge. Then each vertex v _z is joined to the root of the tree (we call it r) by an edge that we denote as e(z) (see Figure 1). Now, in the edges joining any v _z with some leaf in $L_{z}^{Y}$ , we consider the identity matrix, while the matrix in e(z) is defined by taking

A_{i, j}^{e (z)} = \{\begin{array}{l} 1 & if (i, j) = (h \cdot x_{0}, h \cdot z) for some h \in G, \\ 0 & otherwise. \end{array}

Finally, if c is the cardinality of {x₀}_G, define the distribution at the root Π=(Π _A,Π _C,Π _G,Π _T) by

Π_{z} = \{\begin{array}{l} \frac{1}{c} & if z \in {x_{0}}_{G}, \\ 0 & otherwise. \end{array}

It is straightforward to check that these matrices and the vector Π are G-equivariant, so $(Π, A) \in {Par}_{M_{G}} (T)$ . Now, from (2) and the definition of Π, we can write

\begin{align} p_{x_{1} \dots x_{n}} = \sum_{\begin{matrix} g \in G \\ {x_{z}}_{z \in B (Y)} \subset B \end{matrix}} P_{x_{1} \dots x_{n}} (g, {x_{z}}_{z \in B (Y)}) \end{align}

where

\begin{align} P_{x_{1} \dots x_{n}} (g, {x_{z}}_{z \in B (Y)}) = Π_{g \cdot x_{0}} \prod_{z \in B (Y)} (A_{g \cdot x_{0}, x_{z}}^{e (z)} \prod_{j \in L_{x_{z}}^{Y}} δ_{x_{z}, x_{j}}) \end{align}

(here δ_a,b stands for the Kronecker delta, i.e. δ_a,a= 1, δ_a,b= 0 if a ≠ b). Moreover, from the definition of the matrix A^e(^z⁾, we have

A_{g \cdot x_{0}, x_{z}}^{e (z)} = \{\begin{array}{l} 1 & if (g \cdot x_{0}, x_{z}) = (h \cdot x_{0}, h \cdot z) for some h \in G, \\ 0 & otherwise. \end{array}

The hypothesis S_G(x₀) = {e} ensures that (g·x₀,x _z) = (h·x₀,h·z) if and only if g = h. From this, it becomes clear that $P_{x_{1} \dots x_{n}} (g, {x_{z}}_{z \in B (Y)}) = 0$ unless

1.
x _z= g · z, for z ∈ B, and
2.
for each $i \in L_{z}^{Y}$ , x _iis equal to x _z= g · z,

in which case $P_{x_{1} \dots x_{n}} (g, {x_{z}}_{z \in B (Y)}) = Π_{g \cdot x_{0}} = \frac{1}{c}$ . It follows that

p_{x_{1} \dots x_{n}} = \{\begin{array}{l} 1 & if x_{1} \dots x_{n} \in {Y}_{G}, \\ 0 & otherwise, \end{array}

and $Ψ_{T}^{M} (Π, A) = τ {Y}_{G}$ . Moreover, as the set of τ{Y}_G, for $Y \in B$ , generates the vector space $L^{G}$ , the second claim follows. □

Remark 14. The above result is not true if the hypothesis S_G(x₀) = 〈e〉 is removed. For example, if G = 〈(A C G T),(A G)〉 (so that $M = K80$ ), then S_G(A) = S_G(G) = {e,(C T)} and S_G(C) = S_G(T) = {e,(A G)}. In that case, it can be shown that the G-orbit {A C G T}_Gis not in $Im Ψ_{T}^{K80}$ for any tree topology T with 4 leaves.

Since the above condition on the group holds for $G = S_{4}$ , G = 〈(A T)(C G)〉, and G = 〈(A G),(A C G T)〉, we deduce the following claim.

Corollary 15. If G corresponds to any of the equivariant models K81, SSM or GMM, we have $L^{G} \subset D_{M_{G}}$ .

In phylogenetics, an invariant of a phylogenetic tree T is an equation satisfied by the expected distributions of patterns at the leaves of T, irrespectively of the continuous parameters of the model $M$ . In the algebraic geometry setting, these are the equations satisfied by all $p \in C V_{T}^{M}$ . Invariants were introduced by Lake (see [27]) and Cavender and Felsenstein (see [28]). A phylogenetic invariant of T is an invariant of T, which is not an invariant of all other phylogenetic trees (under the same model $M$ ). Equivalently, f is a phylogenetic invariant of $C V_{T}^{M}$ if it is an invariant of $C V_{T}^{M}$ and there exists a tree topology T^′ such that f is not an invariant of $C V_{T^{'}}^{M}$ . In principle, phylogenetic invariants can be used for tree topology reconstruction purposes.

Remark 16.

(a)
It can be seen that the condition of trivial stabiliser for some element of B given in Proposition 13 guarantees that all the irreducible representations of G will be present in the decomposition of W into its isotypic components. Then, by using the results of [6], it follows that the corresponding equivariant model will have no linear phylogenetic invariants. This fact was already known for the models in the above corollary: see [29] for the GMM, [21] for the SSM and [30] for the K81. Here we provided an alternative proof based on elementary tools of group theory.
(b)
The models JC69 and K80 are known to have linear phylogenetic invariants, but these are the only linear invariants which do not define hyperplanes containing $L^{G}$ , as can be deduced from [3, 30]. In fact, for these two models, the claim of the corollary is still true as stated in the following theorem. Nevertheless, we have not been able to provide a unified proof of this fact because of the different properties of the corresponding groups. There is no description of the space of linear invariants for other equivariant models not listed in Example 4, so we cannot claim that the result below still holds.

Theorem 17. If $M_{G}$ is one of the equivariant evolutionary models JC69 , K80 , K81 , SSM , or GMM , then the space of phylogenetic mixtures $D_{M_{G}}$ coincides with $L^{G}$ , and $D_{s M_{G}}$ equals $L^{G} \cap H$ .

This theorem allows to identify the set of all phylogenetic mixtures $D_{M_{G}}$ with $L^{G}$ , which is a vector subspace of $L$ whose linear equations are easy to describe. In other words, $L^{G}$ is the smallest linear space containing the data coming from any mixture of trees evolving under the model $M_{G}$ . One can therefore use $L^{G}$ to select the most suitable model for the given data. This has been studied in [7].

Proof of Theorem 17. For equivariant models we have that $C V_{T}^{M_{G}} \subset L^{G}$ for any tree T. Hence, by Lemma 10 and the definition of $D_{M_{G}}$ , $D_{M_{G}}$ is a vector subspace of $L^{G}$ .

From Corollary 15, we infer the equality $L^{G} = D_{M_{G}}$ for the models K81, SSM and GMM. For the other two models, JC69 and K80, it remains to prove that there does not exist any hyperplane Π containing $D_{M_{G}}$ and not containing $L^{G}$ . If such a hyperplane existed, then it would contain all the points of $C V_{T}^{M_{G}}$ for any tree topology T. It suffices to prove that for these models there are no homogeneous linear polynomials vanishing on all tree topologies, except for the linear equations vanishing on $L^{G}$ . This has been seen in Remark 16(b).

The equality $D_{s M_{G}} = L^{G} \cap H$ follows immediately from Lemma 10 and the first assertion in the statement of this theorem.

Remark 18. We are indebted to one of the referees of this paper for pointing out that the preceeding result, as well as the second part of Proposition 13, can also be inferred from Proposition 4.9 of [5]: under the assumption that the stabiliser of some state is trivial, Draisma and Kuttler show that the star tree is the smallest algebraic variety containing the tensors τ{X}_G, for pure tensors X(that is, tensors of rank 1). It follows that the set of mixtures on the star tree equals the space $L^{G}$ .

Remark 19. It is not difficult to check that for $M =$ K81, SSM or GMM, $D_{M}$ coincides with the space of mixtures on the star tree (see also [26], where the same result is proven for a 2-state model). On the contrary, this is not true for JC69 and K80 models because in this case the star tree lies in a smaller linear space as a consequence of the existence of phylogenetic linear equations (see Remark 16(b)).

Equations for the space $L^{G}$

Our goal here is to compute the dimension of $L^{G}$ for the groups associated to the equivariant models listed in Definition 4, and to list a set of independent linear equations defining this space.

Proposition 20. Using the notations above,

(i)
$dim L^{SSM} = 2^{2 n - 1}$ ,
(ii)
$dim L^{K81} = 4^{n - 1}$ ,
(iii)
$dim L^{K80} = 2^{2 n - 3} + 2^{n - 2}$ , and
(iv)
$dim L^{JC69} = \frac{2^{2 n - 3} + 1}{3} + 2^{n - 2}$ .

Proof. Let $M$ be any equivariant model. By definition, we know that $L^{G}$ is the isotypic component of ⊗ⁿW associated to the trivial representation (⊗ⁿW)[ω₁]. Since the dimension of the trivial representation is one, it follows that the dimension of $L^{M}$ is precisely the multiplicity m₁ (n), i.e. the number of times the trivial representation appears in the decomposition of ⊗ⁿW into isotypic components. This multiplicity m₁ (n) equals (see (5))

\begin{align} 〈 χ^{n}, ω_{1} 〉 = \frac{1}{| G |} \sum_{g \in G} χ^{n} (g) ω_{1} (g) . \end{align}

The proof ends by grouping the elements of G in the conjugacy classes of G for SSM, K81, K80, or JC69. Recall that the conjugacy classes of a group G are the disjoint sets of the form C (g) = {h^{− 1}gh : h ∈ G}. If C₁,…,C_s are the conjugacy classes for G, write $C (G) = (| C_{1} |, \dots, | C_{s} |)$ for the s-tuple of their cardinalities, so that $\sum_{i = 1}^{s} | C_{i} | = | G |$ . Recall that χⁿ(g₁) = χⁿ(g₂) whenever g₁ and g₂ lie in the same conjugacy class, so we can represent χⁿby an s-tuple $χ_{C (G)}^{n} = (t_{1}, \dots, t_{s})$ , where t_i= χⁿ(g) for any g ∈ C_i. Thus, we have $m_{1} (n) = \frac{1}{| G |} \sum_{i = 1}^{s} χ^{n} (g_{i}) | C_{i} |$ , where g_i is any element in the conjugacy class c_i. The result for $M = SSM$ , K81, K80, or JC69 follows by applying the Table 1.

Table 1 Details of the conjugacy classes of some permutation groups needed in the proof of Proposition 20

Full size table

□

Our next goal is to provide a set of independent linear equations for $L^{G}$ . Before stating the main result, let us introduce some useful notation.

Notation 21. We consider the following subsets of $B = B^{n}$ :

\begin{align} B_{0} & = {A \dots A, C \dots C, G \dots G, T \dots T}, \\ B_{A C ∣ G T} & = {A, C}^{n} \cup {G, T}^{n}, \\ B_{A G ∣ C T} & = {A, G}^{n} \cup {C, T}^{n}, \\ B_{A T ∣ C G} & = {A, T}^{n} \cup {C, G}^{n}, and \\ B_{2} & = B_{A C ∣ G T} \cup B_{A G ∣ C T} \cup B_{A T ∣ C G} . \end{align}

The set $B_{0}$ is composed of all n-words with only one letter and it is contained in $B_{A C ∣ G T}$ , $B_{A G ∣ C T}$ , and $B_{A T ∣ C G}$ . Similarly, $B_{2}$ is composed of all n-words with two letters at most. It is straightforward to check that $| B_{A C ∣ G T} | = | B_{A G ∣ C T} | = | B_{A T ∣ C G} | = 2^{n + 1}$ and $| B_{2} | = 3 \cdot 2^{n + 1} - 8$ .

We adopt multiplicative notation for n-words, for instance, we write C^lfor the word $\underset{l}{\underset{⎵}{C \dots C}}$ , and (A^l)(G^m)x_{l + m + 1}…x_nfor $\underset{l}{\underset{⎵}{A \dots A}} \underset{m}{\underset{⎵}{G \dots G}} x_{l + m + 1} \dots x_{n}$ , where x_{l + m + 1},…,x_nrepresent any letters.

The main result of this section is the following:

Theorem 22. A set of linearly independent equations $E^{M}$ defining $L^{M}$ for $M$ =JC69 , K80 , K81 , or SSM is given by

$E^{SSM}$ : equations p _X= p₍_A _T₎₍_C_G₎_X for all $X \in B$ with x₁ ∈ {A,C};

$E^{K81}$ : the equations in $E^{SSM}$ , and the equations {p _X= p₍_A_T₎₍_C_G₎_X} for all $X \in B$ with x₁ = A;

$E^{K80}$ : the equations in $E^{K81}$ , plus the equations {p _X= p₍_A₎₍_G₎_X} for all $X \in B \ B_{A C ∣ G T}$ having x₁ = A and satisfying the following condition: if T appears in X, then there is a C in a preceding position;

$E^{JC69}$ : the equations in $E^{K80}$ , together with the equations {p _X= p₍_A _T₎_X} for all $X \in B_{A C ∣ G T} \ B_{0}$ of the form (A^l)(C^m)x_{l + m + 1}…x_n; plus the equations {p _X= p₍_A _C₎_X} and {p _X= p₍_A _T₎_X} for all $X \in B \ B_{2}$ of the form (A^l)(C^m)x_{l + m + 1}…x_nand satisfying the condition: if T appears in X, then there is a G in a preceding position.

The number of equations added in each case is 2^{2 n − 1}for SSM, 2^{2 n − 2} for K81, 2^{2n − 3}− 2ⁿ⁻² for K80, and $2^{n - 1} - 1 + 2 (\frac{2^{2 n - 3} + 1}{3} - 2^{n - 2})$ for JC69.

Before proving this theorem, we explain how these sets of equations were obtained. Notice that a system of linear equations of $L^{G}$ is given by

\begin{align} \{p_{g X} = p_{X} ∣ g \in G, X \in B\} . \end{align}

The role played by the G-orbits on $B$ becomes apparent. Indeed, the idea is to relate the equations to the orbits of a subgroup of G. To this aim, let H be a subgroup of G and write H \ G = {Hg:g ∈ G} for the set of right cosets of H in G. We consider a transversal of H \ G, i.e. a collection {g₁,…,g_[G:H]} such that $G = ⊔_{i = 1}^{[G : H]} H g_{i}$ . Then the orbit of any $X \in B$ can be decomposed as

{X}_{G} = ⋃_{i = 1, \dots, [G : H]} {g_{i} X}_{H} .

(6)

This decomposition establishes the connection between the G-orbits and the H-orbits. In order to obtain a system of equations for $L^{G}$ , once $E^{H}$ has been computed, it is enough to add the equations involving the permutations in a transversal {g₁ = e,g₂,…,g_[G:H]} of H \ G:

\begin{array}{l} p_{X} = p_{g_{2} X} \\ p_{X} = p_{g_{3} X} \\ \dots \\ p_{X} = p_{g_{[G : H]} X} \end{array}\} for all X \in B .

Notice that the union in (6) is not necessarily disjoint as it may happen that {g_iX}_H= {g_jX}_H for i ≠ j. In this case, the equality $p_{g_{j} X} = p_{g_{j} X}$ already holds in the space $L^{H}$ and does not provide any new restriction. In order to avoid this situation and obtain a minimal set of equations for $L^{G}$ , we request the special conditions on the $X \in B$ in the statement of the theorem.

Proof. For each model $M$ , we prove that the corresponding equations are linearly independent and there are as many equations as the codimension of $L^{M}$ . By Proposition 20, the codimension of $L^{M}$ is 2^{2 n − 1}for SSM, 3· 4^{n − 1}for K81, 7·2²ⁿ⁻³− 2^{n − 2}for K80, and $4^{n} - \frac{2^{2 n - 3} + 1}{3} - 2^{n - 2}$ for JC69. In the sequel, we refer to the groups by the name of the equivariant model associated to them.

SSM: As SSM is the group {e,(A T)(C G)}, a set of equations for SSM is {p _X= p₍_A _T₎₍_C _G₎_X}. Fixing x₁in {A,C} we obtain 2²ⁿ⁻¹linearly independent equations (equations involving different coordinates). The codimension of $L^{SSM}$ is equal to 2²ⁿ⁻¹, which coincides with the number of equations given, and thus this set of equations defines $L^{SSM}$ .

K81: Since a transversal of SSM\K81 is {e,(A C)(G T)}, the hyperplanes p _X= p₍_A _T₎₍_C _G₎_X contain $L^{K81}$ but not $L^{SSM}$ . Moreover, using (6) we see that the orbit {X}_K81 decomposes into the disjoint union of {X}_SSM and {(A C)(G T)X}_SSM for any $X \in B$ . Therefore, the equations given for K81 involve different coordinates than those in $E^{SSM}$ . Requiring x₁ = A, we obtain 4ⁿ⁻¹linearly independent new equations. Thus $E^{K81}$ defines the space $L^{K81}$ because the number of linearly independent equations provided, 2²ⁿ⁻¹ + 4ⁿ⁻¹= 3·4ⁿ⁻¹, coincides with the codimension of $L^{K81}$ .

K80: The set {e,(A G)} is a transversal of K81\K80. In order to show that the equations provided are linearly independent to those of $E^{K81}$ , we apply (6) to this transversal to obtain {X}_K80= {X}_K81∪{(A G)X}_K81. If $X \notin B_{A G ∣ C T}$ , then {(A G)X}_K81 and {X}_K81 are disjoint, so each equation p _X= p₍_A₎₍_G₎_X is linearly independent from $E^{K81}$ . The set $B \ B_{A G ∣ C T}$ has cardinal 4ⁿ− 2^{n + 1}and, if $X \in B \ B_{A G ∣ C T}$ , each orbit {X}_K80 has cardinality 8. Therefore, the number of different orbits for $X \in B \ B_{A G ∣ C T}$ is (4ⁿ− 2^{n + 1})/8 = 2^{2 n − 3}−2^{n − 2}. Moreover, the choice of X’s in $B \ B_{A G ∣ C T}$ with x₁ = A and satisfying “if T appears in X, there is a C in a preceding position” guarantees that we take only one element in each {X}_K80, and thus we are adding exactly one equation for each of these X^′, and thus we are adding exactly one equation for each of these X^′s. Overall, there are 3·4^{n − 1} + (2^{2 n − 3}−2^{n − 2}) = 7 · 2^{2 n − 3}− 2^{n − 2}linearly independent equations in $E^{K80}$ . This number coincides with the codimension of $L^{K80}$ and these equations define $L^{K80}$ .

JC69: A transversal of K80\JC69 is {e,(A C),(A T)}, therefore (6) applies to give ${X}_{JC69} = {X}_{K80} \cup {(A C) X}_{K80} \cup {(A T) X}_{K80}$ . Summing up, there are if $X \in B_{A C ∣ G T} \ B_{0}$ , then {(A C)X}_K80= {X}_K80 and {X}_JC69 is the disjoint union of {X}_K80 and {(A T)X}_K80. As such, each equation p _X= p₍_A _T₎_X is linearly independent from $E^{K80}$ . Moreover, if $X \in B_{A C ∣ G T} \ B_{0}$ is of the form (A^l)(C^m)x_{l + m + 1}…x_n, we have 2^{n − 1}− 1 such equations and they are linearly independent.

if $X \in B \ B_{2}$ then the three orbits {(A C)X}_K80, {(A T)X}_K80, and {X}_K80 have 8 elements each and are disjoint. Therefore, for these X’s, each equation of type {p _X= p₍_A _C₎_X} or {p _X = p₍_A _T₎_X} is linearly independent from $E^{K80} .$ Moreover, as $B \ B_{2}$ has cardinal 4ⁿ− 3 · 2^{n + 1} + 8 and is covered by these orbits, we have $\frac{4^{n} - 3 \cdot 2^{n + 1} + 8}{24} = \frac{1}{3} (2^{2 n - 3} + 1) - 2^{n - 2}$ different orbits. The restriction to the elements of the form (A^l)(C^m) x_{l + m + 1}… x_nand satisfying that “if T appears in X, there is some G in a preceding position” guarantees that the equations are written only only once for each orbit.

7 \cdot 2^{2 n - 3} - 2^{n - 2} + (2^{n - 1} - 1 + 2 (\frac{1}{3} (2^{2 n - 3} + 1) - 2^{n - 2}))

linearly independent equations in $E^{JC69}$ that contain $L^{JC69}$ . As this number is equal to the codimension $4^{n} - \frac{2^{2 n - 3} + 1}{3} - 2^{n - 2}$ of $L^{JC69}$ , the proof is complete.

All the equalities among orbits used in this proof are summarized in Table 2.

Table 2 Equalities among orbits used in the proof of Theorem 22

Full size table

Remark 23. The sets of equations of Theorem 22 has been successfully used in [7] for model selection. Although the dimensions of these linear spaces are exponential in n, in practice it is not necessary to consider the full set of equations, but only those containing the patterns observed in the data. This is crucial for the applicability of the method, since the number of different columns in an alignment is really small compared to the dimension of these spaces.

Example 24. As an example, we compute a minimal system of equations for SSM, K81, K80, and JC69 in the case of 3 leaves.

Equations for $L^{SSM}$ : $E^{SSM}$ is composed of the following equations:

\begin{align} p_{A A A} = p_{T T T}, p_{A A C} = p_{T T G}, p_{A A G} = p_{T T C}, \\ p_{A A T} = p_{T T A}, p_{A C A} = p_{T G T}, p_{A C C} = p_{T G G}, \\ p_{A C G} = p_{T G C}, p_{A C T} = p_{T G A}, p_{A G A} = p_{T C T}, \\ p_{A G C} = p_{T C G}, p_{A G G} = p_{T C C}, p_{A G T} = p_{T C A}, \\ p_{A T A} = p_{T A T}, p_{A T C} = p_{T A G}, p_{A T G} = p_{T A C}, \\ p_{A T T} = p_{T A A}, p_{C A A} = p_{G T T}, p_{C A C} = p_{G T G}, \\ p_{C A G} = p_{G T C}, p_{C A T} = p_{G T A}, p_{C C A} = p_{G G T}, \\ p_{C C C} = p_{G G G}, p_{C C G} = p_{G G C}, p_{C C T} = p_{G G A}, \\ p_{C G A} = p_{G C T}, p_{C G C} = p_{G C G}, p_{C G G} = p_{G C C}, \\ p_{C G T} = p_{G C A}, p_{C T A} = p_{G A T}, p_{C T C} = p_{G A G}, \\ p_{C T G} = p_{G A C}, p_{C T T} = p_{G A A} . \end{align}

Equations for $L^{K81}$ : $E^{K81}$ is formed by $E^{SSM}$ and

\begin{align} p_{A A A} = p_{C C C}, p_{A A C} = p_{C C A}, p_{A A G} = p_{C C T}, \\ p_{A A T} = p_{C C G}, p_{A C A} = p_{C A C}, p_{A C C} = p_{C A A}, \\ p_{A C G} = p_{C A T}, p_{A C T} = p_{C A G}, p_{A G A} = p_{C T C}, \\ p_{A G C} = p_{C T A}, p_{A G G} = p_{C T T}, p_{A G T} = p_{C T G}, \\ p_{A T A} = p_{C G C}, p_{A T C} = p_{C G A}, p_{A T G} = p_{C G T}, \\ p_{A T T} = p_{C G G} . \end{align}

Equations for $L^{K80}$ : $E^{K80}$ is formed by $E^{K81}$ and

\begin{align} p_{A A G} = p_{G A A}, p_{A C G} = p_{G C A}, p_{A C T} = p_{G C T}, \\ p_{A G A} = p_{G A G}, p_{A G C} = p_{G A C}, p_{A G G} = p_{G A A} . \end{align}

Equations for $L^{JC69}$ : $E^{JC69}$ is formed by $E^{K80}$ and

\begin{align} p_{A A C} = p_{T T C}, p_{A C A} = p_{T C T}, p_{A C C} = p_{T C C}, \\ p_{A C G} = p_{C A G}, p_{A C G} = p_{T C G} . \end{align}

Identifiability of phylogenetic mixtures

In this section we study the identifiability of phylogenetic mixtures. To this end, we use projective algebraic varieties and techniques from algebraic geometry. It is not our intention to give the reader a background on these tools, so we refer to the algebraic geometry book [14] and, more specifically, to [10] for the usage of these techniques in the study of phylogenetic mixtures.

There is a natural isomorphism between the points lying in the hyperplane H considered above, $H = {p = (p_{A \dots A}, \dots, p_{T \dots T}) \in L : \sum p_{x_{1} \dots x_{n}} = 1}$ , and the open affine subset ${p = [p_{A \dots A} : \dots : p_{T \dots T}] : \sum p_{x_{1} \dots x_{n}} \neq 0}$ of $P^{4^{n} - 1} = P (L)$ . We use the notation [p _A_…_A:⋯:p _T_…_T] for projective coordinates (in contrast to $(p_{A \dots A}, \dots, p_{T \dots T})$ used for affine coordinates). The projective phylogenetic variety $P V_{T}^{M}$ associated to a phylogenetic tree T is the projective closure in $P (L)$ of the image of the stochastic parameterization $ϕ_{T}^{M}$ defined above. That is, it is the smallest projective variety in $P (L)$ containing $Im ϕ_{T}^{M}$ via the above isomorphism.

In what follows, we explain the relationship between this new variety and $C V_{T}^{M}$ and $V_{T}^{M}$ . By Remark 7, it becomes clear that $C V_{T}^{M}$ equals the affine cone over the projective phylogenetic variety $P V_{T}^{M}$ (for the general Markov model, see also [23], Proposition 1). This implies that $dim C V_{T}^{M} = dim P V_{T}^{M} + 1$ , and if $p = (p_{A \dots A}, \dots, p_{T … T})$ belongs to $C V_{T}^{M}$ , then q := p _A _… _T:⋯:p _T _… _T belongs to $P V_{T}^{M}$ . Moreover, if $λ : = \sum p_{x_{1} \dots x_{n}}$ is not zero, then $(\frac{p_{A \dots A}}{λ}, \dots, \frac{p_{T \dots T}}{λ})$ is a point in the affine stochastic phylogenetic variety $V_{T}^{M}$ .

Before defining identifiability of mixtures, we consider the following construction of projective algebraic varieties.

Definition 25. Given two projective varieties $X, Y \subset P^{m}$ , the join of X and Y, X ∨ Y, is the smallest variety in $P^{m}$ containing all lines $\bar{xy}$ with x ∈ X, y∈Y, and x ≠ y(see [14], 8.1 for details). Similarly, one defines the join of projective varieties $X_{1}, \dots, X_{h} \subset P^{m}$ , ∨i = 1h X_i, as the smallest subvariety in $P^{m}$ containing all the linear varieties spanned by $x_{1}, \dots, x_{h}$ with x_i∈ X_iand x_i≠ x_j. It is known that

\begin{align} dim (\lor_{i = 1}^{h} X_{i}) \leq min {\sum_{i = 1}^{h} dim (X_{i}) + h - 1, m} . \end{align}

The right hand side of this inequality is usually known as the expected dimension of ∨i = 1h X_i.

For instance, if we consider the join $\lor_{i = 1}^{h} P V_{T_{i}}^{M}$ for certain tree topologies T_ion the leaf set [n] and a given evolutionary model $M$ , then there is a (dominant rational) map

P V_{T_{1}}^{M} \times P V_{T_{2}}^{M} \times \times P V_{T_{h}}^{M} \times P^{h - 1} --\to \lor_{i = 1}^{h} P V_{T_{i}}^{M} \subset P (L),

(7)

which is the projective closure of the parameterization $ϕ_{T_{1}} \lor \dots \lor ϕ_{T_{h}}$ defined by

\begin{array}{l} Pa r_{s M} (T_{1}) \times \dots \times Pa r_{s M} (T_{h}) \times Ω & \to & L \\ (ξ_{1}, \dots, ξ_{h}, a) & \mapsto & \sum_{j} a_{i} ϕ_{T_{i}}^{M} (ξ_{i}) . \end{array}

Here, $Ω = {a = (a_{1}, \dots, a_{h}) ∣ \sum_{i} a_{i} = 1}$ is isomorphic to an affine open subset of $P^{h - 1}$ . In this setting, an h-mixture on {T₁,…,T_h} corresponds to a point in the variety $\lor_{i = 1}^{h} P V_{T_{i}}^{M}$ . We will use this algebraic variety to study the identifiability of phylogenetic mixtures.

When considering unmixed models $M$ on trivalent trees on n taxa, generic identifiability of the tree topology is equivalent to the projective varieties $P V_{T}^{M}$ and $P V_{T^{'}}^{M}$ being different when T ≠ T^′(see [31]). The identifiability of the continuous parameters must take into account the possibility of permuting the labels of the states at the interior nodes, as such permutations give rise to the same joint distribution at the leaves. In the language of algebraic geometry, generic identifiability of the continuous parameters of the model implies that the map $ϕ_{T}^{M}$ is generically finite (i.e. the preimage of a generic point is a finite number of points; see [31]). In this case, the fiber dimension Theorem ([14], Theorem 11.12) applies and we have that $dim P V_{T}^{M}$ is equal to the number of stochastic parameters of the model, $dim Pa r_{s M} (T)$ . Therefore, if the continuous parameters are generically identifiable for the unmixed trees under $M$ , then the dimension of the variety $P V_{T}^{M}$ is the same for all trivalent tree topologies on n taxa. This dimension is denoted by $s_{M} (n)$ .

Example 26. The tree topologies and the continuous parameters are generically identifiable for the unmixed equivariant models JC69, K80, K81, SSM, and GMM on trees with any number of leaves (see [9] and [6], Corollary 3.9).

From now on we only consider trees without nodes of degree 2, so that the number of free stochastic parameters on a phylogenetic tree on n taxa under $M$ is $\leq s_{M} (n)$ .

We recall the definition of generic identifiability of the tree topologies on h-mixtures (see [10]).

Definition 27. The tree topologies on h-mixtures under $M$ are generically identifiable if for any set of trivalent tree topologies {T₁,…,T_h} and a generic choice of $(ξ_{1}, \dots ξ_{h}, a) \in Pa r_{s M} (T_{1}) \times \dots \times Pa r_{s M} (T_{h}) \times Ω$ , the equality

ϕ_{T_{1}} \lor \dots \lor ϕ_{T_{h}} (ξ_{1}, \dots ξ_{h}, a) = ϕ_{T_{1}^{'}} \lor \dots \lor ϕ_{T_{h}^{'}} (ξ_{1}^{'}, \dots ξ_{h}^{'}, a^{'}),

for tree topologies ${T_{1}^{'}, \dots, T_{h}^{'}}$ and $(ξ_{1}^{'}, \dots ξ_{h}^{'}, a^{'}) \in Pa r_{s M} (T_{1}^{'}) \times \dots \times Pa r_{s M} (T_{h}^{'}) \times Ω$ implies

{T_{1}, \dots, T_{h}} = {T_{1},^{'} \dots, T_{h}^{'}} .

In terms of algebraic varieties this is equivalent to saying that the variety $\lor_{i = 1}^{h} P V_{T_{i}}^{M}$ is not contained in $\lor_{i = 1}^{h} P V_{T_{i}^{'}}^{M}$ and vice versa.

The tree topologies are the discrete parameters of h-mixtures. When considering the continuous parameters of h-mixtures, the above mentioned label-swapping can be disregarded. We give the following definition according to [12].

Definition 28. The continuous parameters of h-mixtures on $T_{1}, \dots, T_{h}$ under an evolutionary model $M$ are generically identifiable if, for a generic choice of stochastic parameters $(ξ_{1}, \dots, ξ_{h}, a)$ , the equality

ϕ_{T_{1}} \lor \dots \lor ϕ_{T_{h}} (ξ_{1}, \dots ξ_{h}, a) = ϕ_{T_{1}} \lor \dots \lor ϕ_{T_{h}} (ξ_{1}^{'}, \dots ξ_{h}^{'}, a^{'})

for stochastic parameters $(ξ_{1}^{'}, \dots, ξ_{h}^{'}, a^{'})$ implies that there is a permutation $σ \in S_{h}$ such that $σ \cdot (T_{1}, \dots, T_{h}) = (T_{1}, \dots, T_{h})$ , $ξ_{i}^{'} = ξ_{σ (i)}$ , and $a_{i}^{'} = a_{σ (i)}$ for $i = 1, \dots r$ . In other words, we only allow swapping of the continuous parameters when at least two tree topologies coincide.

Definition 29. An h-mixture under a model $M$ is said to be identifiable if both its tree topologies and its continuous parameters are generically identifiable.

In terms of algebraic varieties, generic identifiability of continuous parameters on h-mixtures implies that the generic fibers (i.e. preimages of generic points) of the map $ϕ_{T_{1}} \lor \dots \lor ϕ_{T_{h}}$ are finite. In this case, the fiber dimension theorem applied to (7) (cf. [14], Theorem 11.12) gives

\begin{align} dim (\lor_{i = 1}^{h} P V_{T_{i}}) = \sum_{i = 1}^{h} dim (P V_{T_{i}}) + h - 1 . \end{align}

The following result demonstrates the need for careful inspection of identifiability of mixtures with many components (i.e. large values of h).

Theorem 30. Let [n] be a set of taxa and $M$ be an evolutionary model for which the continuous parameters are generically identifiable on trivalent (unmixed) trees. In addition, let $s_{M} (n)$ be the dimension of $P V_{T}^{M}$ for any trivalent tree T, and set $h_{0} (n) : = \frac{dim D_{M}}{s_{M} (n) + 1}$ . Then the h-mixtures of trees on [n] evolving under $M$ are not identifiable for h ≥ h₀(n).

Remark 31. Note that, in the above definition of h₀(n), $dim D_{M}$ also depends on n.

Corollary 32. Let [n] be a set of taxa and $M$ be one of the equivariant models JC69, K80, K81, SSM, or GMM. Then the phylogenetic h-mixtures under these models are not identifiable for h ≥ h₀(n), where

h_{0} (n) = \{\begin{array}{l} \frac{4^{n}}{12 (2 n - 3) + 4}, if M = GMM, \\ \frac{2^{2 n - 1}}{6 (2 n - 3) + 2}, if M = SSM, \\ \frac{4^{n - 1}}{3 (2 n - 3) + 1}, if M = K81, \\ \frac{2^{2 n - 3} + 2^{n - 2}}{2 (2 n - 3) + 1}, if M = K80, \\ \frac{2^{2 n - 3} + 3 \cdot 2^{n - 2} + 1}{3 (2 n - 2)}, if M = JC69 . \end{array}

Proof. Theorem 17 shows that $L^{M} = D_{M}$ and Proposition 20 gives the dimension of $L^{M}$ in each case. Next, we calculate: s _GMM(n) = 12(2n−3) + 3, s _SSM(n) = 6(2n−3) + 1, s _K81(n) = 3(2n−3), s _K80(n) = 2(2n−3), and s _JC69(n) = 2n−3. Applying Theorem 30, we conclude the proof. □

Example 33. Consider the Kimura 3-parameter model K81 on n = 4 taxa. For any h ≥ 4, phylogenetic h-mixtures are not identifiable by Corollary 32. We are not aware of any result proving that mixtures of 2 or 3 different tree topologies under this model are identifiable (either for the tree parameters or for the continuous parameters).

Example 34. Consider the Jukes-Cantor model JC69 on n = 4 taxa. Then Corollary 32 tells us that for h ≥ 3, h-mixtures are not identifiable. Therefore, for this particular model on four taxa the cases in which the identifiability holds are known: the tree and the continuous parameters are generically identifiable for the unmixed model; the tree parameters are generically identifiable for 2-mixtures ([10], Theorem 10); the continuous parameters are generically identifiable for 2-mixtures on different tree topologies and not identifiable for the same tree topology ([10], Theorem 23); neither the continuous parameters nor the tree topologies are generically identifiable for mixtures with more than two components (Corollary 32).

Proof of Theorem 30. Let $edim (h) : = h s_{M} (n) + h - 1$ . Then the variety $\lor_{i = 1}^{h} P V_{T_{i}}$ has dimension ≤ edim(h). Indeed, as $\lor_{i} ϕ_{T_{i}}$ is a parameterization of an open subset of $\lor_{i = 1}^{h} P V_{T_{i}}$ , then the dimension of $\lor_{i = 1}^{h} P V_{T_{i}}$ is less than or equal to $\sum dim P V_{T_{i}} + h - 1$ . Moreover, the dimension of $P V_{T_{i}}$ is equal to $s_{M} (n)$ if T_i is trivalent (since the continuous parameters for the unmixed models under consideration are generically identifiable) and is less than $s_{M} (n)$ for non-trivalent trees. Therefore, $dim (\lor_{i = 1}^{h} P V_{T_{i}}) \leq edim (h)$ .

If all T_i are trivalent trees, then $\sum dim P V_{T_{i}} + h - 1 = edim (h)$ and, therefore, $dim (\lor_{i = 1}^{h} P V_{T_{i}}) < edim (h)$ if and only if $dim (\lor_{i = 1}^{h} P V_{T_{i}}) < \sum dim P V_{T_{i}} + h - 1$ . Moreover, by the fiber dimension theorem applied to $\lor ϕ_{T_{i}}$ , the equality of dimensions holds if and only if the generic fiber of $\lor ϕ_{T_{i}}$ has dimension 0. In particular, if $dim (\lor_{i = 1}^{h} P V_{T_{i}}) < edim (h)$ , then the continuous parameters of this phylogenetic mixture are not identifiable.

As $h_{0} (n) = \frac{dim D_{M}}{s_{M} (n) + 1}$ , we have that $edim (h_{0} (n)) = h_{0} (n) (s_{M} (n) + 1) - 1 = dim D_{M} - 1$ . Now, we fix an $h \in N$ with h ≥ h₀(n), so that $edim (h) \geq dim (D_{M}) - 1$ .

There are two possible scenarios:

(a)
For any set of tree topologies ${T_{1}, \dots, T_{h}}$ , the dimension of $\lor_{i = 1}^{h} P V_{T_{i}}$ is less than $dim (D_{M}) - 1$ .
(b)
There exists a set of tree topologies ${T_{1}, \dots, T_{h}}$ for which $dim (\lor_{i = 1}^{h} P V_{T_{i}}) = dim (D_{M}) - 1 .$

Case (a) implies that the dimension of $\lor_{i = 1}^{h} P V_{T_{i}}$ is less than edim(h) for any set of trivalent tree topologies ${T_{1}, \dots, T_{h}}$ . Based on the conclusions drawn above, this implies that the continuous parameters are not generically identifiable.

In case (b), $\lor_{i = 1}^{h} P V_{T_{i}}$ coincides with $P (D_{M})$ . Indeed, $\lor_{i = 1}^{h} P V_{T_{i}}$ is contained in $P (D_{M})$ , both varieties are irreducible, and $dim (\lor_{i = 1}^{h} P V_{T_{i}}) = dim (D_{M}) - 1 = dim (P (D_{M}))$ , which implies that both varieties coincide. In particular, any h-mixture (which is a point in $P (D_{M})$ ) would be contained in $\lor_{i = 1}^{h} P V_{T_{i}}$ , and therefore the topologies are not generically identifiable.

Remark 35. The negative result of Theorem 30 should be complemented with the following positive result of Rhodes and Sullivant in [12]: if $M$ =GMM and one restricts to h-mixtures on the same trivalent tree topology T, then the tree topology and the continuous parameters are generically identifiable if $h < 4^{\frac{n}{4} - 1}$ .

Conclusions

In this paper, we have dealt with the space of phylogenetic mixtures for evolutionary equivariant models. We have shown that for the case of the Jukes-Cantor model, the Kimura models with two or three parameters, the strand symmetric model and the general Markov model, this linear space is defined by the set of linear equations satisfied by the distributions of the patterns at the leaves of a tree that evolves under that model. It follows that this space completely characterizes the model. The use of tools from group theory and group representation theory played a major role, and allowed us to design a procedure to produce minimal systems of equations for these spaces and for any number of taxa. This procedure has been implemented successfully in a new method for model selection in phylogenetics based on linear invariants (see [7]), which is available online at http://genome.crg.es/cgi-bin/phylo_mod_sel/AlgModelSelection.pl.

In the last part of the paper, we proved new results concerning the identifiability of phylogenetic mixtures. Namely, we provided an upper bound for the number of components (classes) of a mixture so that the identifiability of both the continuous and the discrete parameters is still possible.

References

Posada D: The effect of branch length variation on the selection of models of molecular evolution. J Mol Evol. 2001, 52: 434-444.
PubMed CAS Google Scholar
Felsenstein J: Inferring Phylogenies. 2004, Sunderland: Sinauer Associates, Inc.,
Google Scholar
Fu YX, Li W: Construction of linear invariants in phylogenetic inference. Math Biosci. 1992, 109 (2): 201-228. 10.1016/0025-5564(92)90045-X
Article PubMed CAS Google Scholar
Steel M, Hendy M, Székely L, Erdős P: Spectral analysis and a closest tree method for genetic sequences. Appl Math Lett. Int J Rapid Publication. 1992, 5: 63-67.
Google Scholar
Draisma J, Kuttler J: On the ideals of equivariants tree models. Mathematische Annalen. 2009, 344: 619-644. 10.1007/s00208-008-0320-6
Article Google Scholar
Casanellas M, Fernandez-Sanchez J: Relevant phylogenetic invariants of evolutionary models. J de Mathématiques Pures et Appliquées. 2011, 96: 207-229.
Article Google Scholar
Kedzierska A, Drton M, Guigó R, Casanellas M: SPIn: model selection for phylogenetic mixtures via linear invariants. Mol Biol Evol. 2012, 29: 929-937. 10.1093/molbev/msr259
Article PubMed CAS Google Scholar
Semple C, Steel M: Phylogenetics, Volume 24 of Oxford Lecture Series in Mathematics and its Applications. 2003, Oxford: Oxford University Press,
Google Scholar
Allman E, Rhodes J: The identifiability of tree topology for phylogenetic models, including covarion and mixture models. J Comput Biol. 2006, 13: 1101-1113. 10.1089/cmb.2006.13.1101
Article PubMed CAS Google Scholar
Allman ES, Petrovic S, Rhodes JA, Sullivant S: Identifiability of two-tree mixtures for group-based models. IEEE/ACM Trans Comput Biol Bioinformatics. 2011, 8: 710-720.
Article Google Scholar
Stefanovic D, Vigoda E: Phylogeny of mixture models: Robustness of maximum likelihood and non-identifiable distributions. J Comput Biol. 2007, 14: 156-189. 10.1089/cmb.2006.0126
Article Google Scholar
Rhodes J, Sullivant S: Identifiability of large phylogenetic mixture models. Bull Math Biol. 2012, 74: 212-231. 10.1007/s11538-011-9672-2
Article PubMed Google Scholar
Chai J, Housworth EA: On Rogers’s proof of identifiability for the GTR + Gamma + I model. Syst Biol. 2011, 60 (5): 713-718. 10.1093/sysbio/syr023
Article PubMed Google Scholar
Harris J: Algebraic Geometry. A First Course, Volume 133 of Graduate Texts in Mathematics. 1992, New York: Springer-Verlag,
Google Scholar
Serre J: Linear Representations of Finite Groups. 1977, [Translated from the second French edition by Leonard L. Scott, Graduate Texts in Mathematics, Vol. 42], New York: Springer-Verlag,
Book Google Scholar
Sumner J, Fernández-Sánchez J, Jarvis P: On Lie Markov models. J Theor Biol. 2012, 298: 16-31.
Article PubMed CAS Google Scholar
Allman E, Rhodes J: Phylogenetic invariants for stationary base composition. J Symbolic Comput. 2006, 41 (2): 138-150. 10.1016/j.jsc.2005.04.004
Article Google Scholar
Jukes TH, Cantor CR: Evolution of protein molecules. Mammalian Protein Metabolism, Volume 3. Edited by: Munro HN. 1969, 21-32. New York: Academic Press,
Chapter Google Scholar
Kimura M: A simple method for estimating evolutionary rates of base substitution through comparative studies of nucleotide sequences. J Mol Evol. 1980, 16: 111-120. 10.1007/BF01731581
Article PubMed CAS Google Scholar
Kimura M: Estimation of evolutionary sequences between homologous nucleotide sequences. Proc Nat Acad Sci. 1981, 78: 454-458. 10.1073/pnas.78.1.454
Article PubMed CAS PubMed Central Google Scholar
Casanellas M, Sullivant S: The strand symmetric model. Algebraic Statistics for Computational Biology. Edited by: Pachter L, Sturmfels B. 2005, New York: Cambridge University Press,
Google Scholar
Chang JT: Full reconstruction of Markov models on evolutionary trees: identifiability and consistency. Math Biosci. 1996, 137: 51-73. 10.1016/S0025-5564(96)00075-2
Article PubMed CAS Google Scholar
Allman E, Rhodes J: Phylogenetic ideals and varieties for the general Markov model. Adv Appl Math. 2008, 40: 127-148. 10.1016/j.aam.2006.10.002
Article Google Scholar
Allman E, Rhodes J: Phylogenetic invariants for the general Markov model of sequence mutation. Math Biosci. 2003, 186 (2): 113-144. 10.1016/j.mbs.2003.08.004
Article PubMed CAS Google Scholar
Steel M, Hendy M, Penny D: Reconstructing phylogenies from nucleotide pattern probabilities: a survey and some new results. Discrete Appl Math. 1998, 88 (1-3): 367-396. 10.1016/S0166-218X(98)00080-8
Article Google Scholar
Matsen F, Mossen E, Steel M: Mixed-up trees: The structure of phylogenetic mixtures. Bull Math Biol. 2008, 70: 1115-1139. 10.1007/s11538-007-9293-y
Article PubMed Google Scholar
Lake J: A rate-independent technique for analysis of nucleaic acid sequences: evolutionary parsimony. Mol Biol Evol. 1987, 4: 167-191.
PubMed CAS Google Scholar
Cavender J, Felsenstein J: Invariants of phylogenies in a simple case with discrete states. J Classif. 1987, 4: 57-71. 10.1007/BF01890075
Article Google Scholar
Allman E, Rhodes J: Quartets and parameter recovery for the general Markov model of sequence mutation. Appl Math Res Express. 2004, 2004 (4): 107-131. 10.1155/S1687120004020283
Article Google Scholar
Sturmfels B, Sullivant S: Toric ideals of phylogenetic invariants. J Comput Biol. 2005, 12: 204-228. 10.1089/cmb.2005.12.204
Article PubMed CAS Google Scholar
Allman E, Rhodes J: Identifying evolutionary trees and substitution parameters for the general Markov model with invariable sites. Math Biosci. 2008, 211: 18-33. 10.1016/j.mbs.2007.09.001
Article PubMed Google Scholar

Download references

Acknowledgements

All authors are partially supported by Generalitat de Catalunya, 2009 SGR 1284. Research of the first and second authors was partially supported by Ministerio de Educación y Ciencia MTM2009-14163-C02-02 (Spain). Research of the third author was partially supported by grant BIO2011-26205 from the Ministerio de Educación y Ciencia (Spain). We would like to thank both referees for very useful comments that led to major improvements.

Author information

Authors and Affiliations

Departament de Matemàtica Aplicada I, ETSEIB, Universitat Politècnica de Catalunya, Avinguda Diagonal 647, 08028, Barcelona, Spain
Marta Casanellas & Jesús Fernández-Sánchez
Centre for Genomic Regulation (CRG), Dr. Aiguader 88, 08003, Barcelona, Spain
Anna M Kedzierska

Authors

Marta Casanellas
View author publications
You can also search for this author in PubMed Google Scholar
Jesús Fernández-Sánchez
View author publications
You can also search for this author in PubMed Google Scholar
Anna M Kedzierska
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marta Casanellas.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

All authors contributed equally and the author names order is alphabetical. All authors read and approved the final manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Casanellas, M., Fernández-Sánchez, J. & Kedzierska, A.M. The space of phylogenetic mixtures for equivariant models. Algorithms Mol Biol 7, 33 (2012). https://doi.org/10.1186/1748-7188-7-33

Download citation

Received: 30 September 2011
Accepted: 19 November 2012
Published: 28 November 2012
DOI: https://doi.org/10.1186/1748-7188-7-33

The space of phylogenetic mixtures for equivariant models

Abstract

Background

Results

Conclusions

Background

Main text

Preliminaries

The space of phylogenetic mixtures

The space of phylogenetic mixtures for equivariant evolutionary models

Background on representation theory

Mixtures for equivariant models

Equations for the space $L^{G}$

Identifiability of phylogenetic mixtures

Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ contributions

Authors’ original submitted files for images

Authors’ original file for figure 1

Rights and permissions

About this article

Cite this article

Keywords

Algorithms for Molecular Biology

Contact us

The space of phylogenetic mixtures for equivariant models

Abstract

Background

Results

Conclusions

Background

Main text

Preliminaries

The space of phylogenetic mixtures

The space of phylogenetic mixtures for equivariant evolutionary models

Background on representation theory

Mixtures for equivariant models

Equations for the space L G

Identifiability of phylogenetic mixtures

Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ contributions

Authors’ original submitted files for images

Authors’ original file for figure 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Algorithms for Molecular Biology

Contact us

Equations for the space $L^{G}$