Open Access

The space of phylogenetic mixtures for equivariant models

  • Marta Casanellas1Email author,
  • Jesús Fernández-Sánchez1 and
  • Anna M Kedzierska2
Algorithms for Molecular Biology20127:33

https://doi.org/10.1186/1748-7188-7-33

Received: 30 September 2011

Accepted: 19 November 2012

Published: 28 November 2012

Abstract

Background

The selection of an evolutionary model to best fit given molecular data is usually a heuristic choice. In his seminal book, J. Felsenstein suggested that certain linear equations satisfied by the expected probabilities of patterns observed at the leaves of a phylogenetic tree could be used for model selection. It remained an open question, however, whether these equations were sufficient to fully characterize the evolutionary model under consideration.

Results

Here we prove that, for most equivariant models of evolution, the space of distributions satisfying these linear equations coincides with the space of distributions arising from mixtures of trees. In other words, we prove that the evolution of an observed multiple sequence alignment can be modeled by a mixture of phylogenetic trees under an equivariant evolutionary model if and only if the distribution of patterns at its columns satisfies the linear equations mentioned above. Moreover, we provide a set of linearly independent equations defining this space of phylogenetic mixtures for each equivariant model and for any number of taxa. Lastly, we use these results to perform a study of identifiability of phylogenetic mixtures.

Conclusions

The space of phylogenetic mixtures under equivariant models is a linear space that fully characterizes the evolutionary model. We provide an explicit algorithm to obtain the equations defining these spaces for a number of models and taxa. Its implementation has proved to be a powerful tool for model selection.

Keywords

Evolutionary model Equivariant model Phylogenetic mixture Identifiability

Background

The principal goal of phylogenetics is to reconstruct the ancestral relationships among organisms. Most popular phylogenetic reconstruction methods are based on mathematical models describing the molecular evolution of DNA. In spite of this, there exists no unified framework for model selection and the results are highly dependent on the models and methods used in the analysis (cf. [1]).

In this paper we assume the Darwinian model of evolution proceeding along phylogenetic trees and address the following question: how can the data evolving under a particular model be characterized? In other words, we look for invariants of the DNA patterns which have evolved following a tree (or a mixture of trees, as we will see below) under a particular model. The answer to this question provided in this paper leads to a complete characterization of the evolutionary model and to a novel model selection tool, which is valid for any mixture of trees.

In what follows, we briefly explain the motivation for this work. It has been shown that if the evolution along a phylogenetic tree is described by a particular model, the expected probabilities of nucleotide patterns at the leaves of the tree satisfy certain equalities (see e.g. [2], p.375). Several authors (e.g. [24]) pointed out that these equalities could potentially be used to test the fitness of the model of base change. The full set of equations required for viable model selection, however, was unknown. The objective of this work is to fill in this gap and to go a step further into practical aplication by providing an algorithm to compute the required invariants for model selection.

In this work we consider a group of equivariant models ([5, 6]). These models are Markov processes on trees, whose transition matrices satisfy certain symmetries: the Jukes-Cantor model, the Kimura 2 and 3 parameter models, the strand symmetric model, and the general Markov model. Our first important result, Theorem 17, states that if evolution occurs according to trees (or even mixtures of trees) under these equivariant models, then the model of evolution is completely determined by the linear space defined by the aforementioned equalities. By exhaustively studying the group of symmetries of these models, we also give a straightforward combinatorial way of determining the equations of this linear space (see Theorem 22). The implementation of the algorithm producing the equations is available as a package SPIn ([7],http://genome.crg.es/cgi-bin/phylo_mod_sel/AlgModelSelection.pl), which has proved to be a successful tool in evolutionary model selection.

Our main technique consists in proving that the linear space above coincides with the space D M of phylogenetic mixtures evolving under the model M , i.e. the set of points that are linear combinations of points lying in the phylogenetic varieties C V T M (see Preliminaries section for specific definitions). In biological words and in the stochastic context, this is the set of vectors of expected pattern frequencies for mixtures of trees evolving under the model M (not necessarily the same tree topology in the mixture, and not necessarily the same transition matrices when the tree topologies coincide). In phylogenetics, the so-called i.i.d. hypothesis (independent and identically distributed) about the sites of an alignment is prevalent in the most simple models. When the assumption “identically distributed” is replaced it by “distributed according to the same evolutionary model”, one obtains a phylogenetic mixture.

Phylogenetic mixtures are useful in modeling heterogeneous evolutionary processes, e.g. data comprising multiple genes, selected codon positions, or rate variation across sites (e.g. [8]). Among a plethora of applications, they are used in orthology predictions, gene and genome annotations, species tree reconstructions, and drug target identifications.

In addition to the main result, we determine the dimension of these linear spaces and use it to give an upper bound, h0 (n), on the number of mixtures that should be used in phylogenetic reconstruction on n taxa. This relates to the so-called identifiability problem in phylogenetic mixtures, which can be posed as determining the conditions that guarantee that the model parameters (discrete parameters in the form of tree topologies and the continuous parameters of the root and model distributions) can be recovered from the data. Identifiability is crucial for consistency of the maximum likelihood approaches and, though extensively studied in the phylogenetic context, few results are known (see for instance [913]).

In brief, in Theorem 30 we prove that either the tree topologies or the continuous parameters are not generically identifiable for mixtures on more than h0 (n) trees under equivariant models. Here h0 (n) is the quotient of the dimension of the linear space D M (computed in Proposition 20) by the number of free parameters of M on a trivalent tree plus one. For example, for four taxa and the Jukes-Cantor model (resp. the Kimura 3-parameter model) this result proves that mixtures on three (resp. four) or more taxa are not identifiable (i.e. either the discrete or the continuous parameters cannot be fully identified). A detailed discussion on this subject is provided in the last section.

The main tools used in this work are algebraic geometry and group theory. The reader is referred to [14, 15] for general references on these topics.

Main text

Preliminaries

Phylogenetic trees and Markov models of evolution have been widely used in the literature. In what follows we fix the notation needed to deal with them in our setting.

Let n be a positive integer and denote by [n] the set {1,2,…,n}. A phylogenetic tree T on the set of taxa [n] is a tree (i.e. a connected graph with no loops), whose n leaves are bijectively labeled by [n]. Its vertices represent species or other biological entities and its edges represent evolutionary processes between the vertices.

We allow internal vertices of any degree and if all the internal vertices are of degree 3 we say that the tree is trivalent. We will denote the set of vertices of T by N(T), the set of edges by E(T), and the set of interior nodes by Int(T). A rooted tree is a tree together with a distinguished node r called the root. The root induces an orientation on the edges of T, whereby the root represents the common ancestor to all the species represented in the tree. If e is an edge of a rooted tree T, we write pa(e) and ch(e) for its parent vertex (origin) and its child vertex (end), respectively. Two unrooted phylogenetic trees on the set of taxa [n] are said to have the same tree topology if their labeled graphs have the same topology.

We fix a positive integer k and an ordered set B = {b1,b2,…,b k }. For example, for most applications we take B = {A,C,G,T} to be the set of nucleotides in a DNA sequence. We may think of B as the set of states of a discrete random variable. We call W the complex vector space W = B C spanned by B, so that B is a natural basis of W. For algebraic convenience, we usually work over the complex field and restrict to the stochastic setting when necessary. Vectors in W are thought of as probability distributions on the set of states B if their coordinates are non-negative and sum to one. In this setting the vector c i b i means that observation b i occurs with probability c i . From now on, we will identify vectors in W with their coordinates in the basis B written as a column vector, e.g. we identify k b k with the vector 1 = (1,1,…,1) t W.

In order to model molecular evolution on a phylogenetic tree T, we consider a Markov process specified by a root distribution, ΠW, and a collection of transition matrices, A = (A e )e E (T), where each A e is a k × k-matrix in End(W). The matrices A e represent the conditional probabilities of substitution between the states in B from the parent node pa (e) to the child node ch(e) of e. We adopt the convention that the matrices A e act on W from the right, i.e. a vector ω t in pa (e) maps to ω t A e in ch(e).

Distinct forms of the transition matrices give rise to different evolutionary models. Using the terminology introduced above, we proceed to the definition of evolutionary models used throughout this work.

Definition 1. An (algebraic) evolutionary model M is specified by giving a vector subspace W 0 W such that 1 t Π ≠ 0 for some Π in W0, together with a multiplicatively closed vector subspace Mod(for model) of M k ( C ) containing the identity matrix. We will usually denote such a model by M = ( W 0 , Mod ) . We define the stochastic evolutionary model s M = ( s W 0 , sMod ) associated to M by taking s W0 = {ΠW0:1 t Π = 1} and sMod = {AMod:A 1 = 1}. The term “stochastic” refers to the fact that, by restricting to the points in the spaces with non-negative real entries, we obtain distributions and Markov matrices. A phylogenetic tree T together with the parameters Π and A = (A e ) e E (T)is said to evolve under the algebraic evolutionary model M if Π W0, and all matrices A e lie in Mod.

Remark 2. Note that s W0and sMod are not vector spaces. The condition 1 t Π ≠ 0 in the above definition means that the sum of the coordinates of Π is not zero. Since vectors in s W0 with non-negative coordinates represent the probability distributions for the set of observations B, this condition implies no restriction from a biological point of view. Moreover, it ensures that W 0 { x B Π x = 1 } has dimension equal to dim(W0) − 1. In particular, the simplex of stochastic vectors in W0will form a semialgebraic set of B R of dimension equal to dim(W0)−1 (as expected).

Remark 3. The subspace Mod of substitution matrices is usually required to be multiplicatively closed (as in the definition above) so that when two evolutionary processes are concatenated, the final process is of the same kind. The importance of this requirement is the starting point of [16], where a different approach to the definition of “evolutionary mode” is provided.

Our definition of evolutionary models includes most of the well-known evolutionary models, namely those given in [17] and the equivariant models (see [5, 6]).

Example 4. Let G be a permutation group of B, that is, a group whose elements are permutations of the set B, G S k . Given g G, write P g for the k × k-permutation matrix corresponding to g: (P g )i,j= 1 if g(j) = i and 0 otherwise. The G-equivariant evolutionary model M G is defined by taking Mod equal to
m:m ( G ) = { A m:m k ( C ) P g A P g 1 = A for all g G } ,
and W0 = {Π W P g Π = Π for all g G}. These subsets are vector subspaces of M k ( C ) and W, respectively. Moreover, if A1,A2 M(G), then
P g A 1 A 2 P g 1 = ( P g A 1 P g 1 ) ( P g A 2 P g 1 ) = A 1 A 2 ,

and A1A2 M (G). Therefore, equivariant models provide a wide family of examples of algebraic evolutionary models in the sense of Definition 1. For example, if B = {A,,̧G T}, it can be seen that the algebraic versions of the Jukes-Cantor model [18], the Kimura models with 2 or 3 parameters [19, 20], the strand symmetric model [21] or the general Markov model [22] are instances of equivariant models:

  • if G = S 4 , then M G is the algebraic Jukes-Cantor model JC69,

  • if G = 〈(A C G T),(A G)〉, then M G is the algebraic Kimura 2-parameter model K80,

  • if G = 〈(A C)(G T),(A G)(C T)〉, then M G is the algebraic Kimura 3-parameter model K81,

  • if G = 〈(A T)(C G)〉, then M G is known as the strand symmetric model SSM, and

  • if G = 〈e〉, then M G is the general Markov model GMM.

Given an evolutionary model M and a phylogenetic tree T, we define the space of parameters as
Par M T = W 0 × e E ( T ) Mod .
Similarly, we define the space of stochastic parameters associated to T by
Par s M T = s W 0 × e E ( T ) sMod .
Though artificial at first glance, the use of tensors in the framework that includes the distributions on the set of patterns in B at the leaves of a phylogenetic tree is a natural choice. Indeed, if p x 1 x 2 x n denotes the joint probability of observing x1 at leaf 1, x2 at leaf 2, and so on, up to x n at leaf n, then the vector p = ( p b 1 b 1 , p b 1 b 1 b 2 , , p b k b k ) provides a distribution on the set of patterns in B at the leaves of T, and this can be regarded as the tensor having these coordinates in the natural basis,
p = x 1 x n B p x 1 x n x 1 x n .

This motivates the following definition.

Definition 5. Given a phylogenetic tree T on the set of taxa [n], an [n] - tensor is any element of the tensor power
Given an algebraic evolutionary model M and a phylogenetic tree T with root r, every Markov process on T (specified by a collection of parameters Π and A = (A e )eE(T)) gives rise to a tensor in L in the following way: we consider a parametrization
Ψ T M : Par M ( T ) L
(1)
defined by
Ψ T M Π , A = x i B p x 1 x n x 1 x n ,
where
p x 1 x n = x v B , v Int ( T ) Π x r e E ( T ) A x pa(e) , x ch(e) e ,
(2)

x v denotes the state at the vertex v, pa(e) (resp. ch (e)) is the parent (resp. child) node of e, and Π x ,x B, are the coordinates of Π. When restricted to the stochastic matrices and distributions in W0, this parametrization corresponds to the hidden Markov process on the tree T (the leaves correspond to the observed random variables and the interior nodes to the hidden variables).

The parametrization (1) restricts to another polynomial map ϕ T M : Par s M ( T ) H , where H L is the hyperplane defined by H = p L x 1 , , x n B p x 1 x n = 1 . Because we work in the algebraic setting, the use of the word “stochastic” in this paper is more general than usual, as we only request entries summing to one.

From now on, we will refer to this restriction as the stochastic parametrization ϕ T M . It is important to note that when we consider the distributions in s W0 and the Markov matrices in sMod, its image by ϕ T M lies in the standard simplex in L (and thus in H). This in turn implies that the whole image Im ϕ T M is contained in H.

We proceed to define the algebraic varieties associated to the parametrization maps defined above. Roughly speaking, algebraic varieties are sets of solutions to systems of polynomial equations (e.g. [14]).

Definition 6. The stochastic phylogenetic variety V T M associated to a phylogenetic tree T is the smallest algebraic variety containing Im ϕ T M = ϕ T M ( Π r , A ) : ( Π r , A ) Par s M ( T ) (in particular, V T M H ).

Similarly, the phylogenetic variety C V T M associated to T is the smallest algebraic variety in L that contains Im Ψ T M = Ψ T M ( Π r , A ) : ( Π r , A ) Par M ( T ) .

Below we explain the reason for the notation of C V T M , which was adopted from [23].

The reader may note that the position of the root r of T played a role in the above parameterizations. It can be shown, however, that under certain mild assumptions, Im Ψ T M and Im ϕ T M are independent of the root position in the following sense: if two phylogenetic trees have the same topology as unrooted trees, then the smallest algebraic varieties containing the corresponding image sets are the same. For example, any model M = ( W 0 , Mod ) satisfying (i) Π ~ t : = Π t A belongs to W0 for all Π W0 and all A Mod, and (ii) D Π ~ 1 A t D Π Mod whenever D Π ~ 1 exists (here D ω denotes the diagonal matrix with the entries of ω on the diagonal and zeros elsewhere) has this property (in this case, we say the model is root-independent). It is not difficult to check that the equivariant models satisfy these two properties (e.g. adapting the proof of [24] or [25]). For technical reasons, from now on we consider only the evolutionary models satisfying (i) and (ii). Indeed, in this case the notation C V T M refers to the fact that the phylogenetic variety is just the cone over the stochastic phylogenetic variety (see Figure 1 and the remark below).
Figure 1

On the left, the varieties V T M and CV T M are shown; on the right, the phylogenetic tree described in the proof of Proposition 13 is represented.

Remark 7. Let M be an evolutionary model satisfying (i) and (ii) above. For p L , p = p x 1 x n x 1 x n , define λ ( p ) : = x i B p x 1 x n . Then
C V T M = p L | p = λ ( p ) q , q V T M

and V T M = C V T M H . This is well known for the general Markov model [23] and can be easily generalized to any model satisfying (i) and (ii).

The space of phylogenetic mixtures

In phylogenetics, the hypothesis that the sites of an alignment are independent and identically distributed is often used. When the assumption “identically distributed” is replaced by “distributed according to the same evolutionary model”, one obtains a phylogenetic mixture. Below, we introduce phylogenetic mixtures from the algebraic point of view (see also [26]).

Definition 8. Fix a set of taxa [n] and an algebraic evolutionary model M . A phylogenetic mixture (on m-classes) or m-mixture is any vector of the form
p = i = 1 m α i p i ,

where α i C and p i Im ( Ψ T i M ) for some tree topologies T i on the set of taxa [n]. As Ψ T i M is a homogeneous map, phylogenetic mixtures are represented by vectors of the form i = 1 m p ̌ i , where p ̌ i Im ( Ψ T i M ) . We call D M L the space of all phylogenetic mixtures (on any number of classes) under the algebraic evolutionary model M .

As mentioned in the introduction, the tree topologies contained in the mixture can be the same or different. An example of a phylogenetic mixture is the data modeled by the discrete Gamma-rates models (see e.g. [8]).

Restricting matrix rows to sum to one requires restricting the phylogenetic mixtures to the points of the form
q = i = 1 m α i q i where q i Im ( ϕ T i M ) , and i α i = 1 .

We call D s M the space of stochastic phylogenetic mixtures.

Remark 9. The phylogenetic variety of a trivalent tree topology contains all phylogenetic varieties of the non-trivalent tree topologies obtained by contracting any of its interior edges. Indeed, the latter are a particular case of the former when the matrices associated to the contracted edges are equal to the identity matrix. It follows that the space of phylogenetic mixtures on the trivalent tree topologies coincides with the space of phylogenetic mixtures on all possible topologies.

The following result was proven by Matsen, Mossel and Steel in [26] for the two state random cluster model but, as proved below, it can be easily generalized to any evolutionary model.

Lemma 10. Given a set of taxa [n] and an algebraic evolutionary model M , the set of all phylogenetic mixtures D M is a vector subspace of L . Similarly, D s M is a linear variety and it equals D M H .

Proof. D M is a C -vector space and D s M is a linear variety by their definition. It follows that D M is an algebraic variety that contains Im Ψ T M for any phylogenetic tree T on the set of taxa [n]. Therefore, it also contains C V T M , and D M equals the set of points of the form p = p i , where p i C V T i M . Similarly, D s M is an algebraic variety that contains Im ϕ T M , so it also contains V T M for any phylogenetic tree T. It follows that D s M is formed by points of type q = α i q i , where q i V T i M and i α i = 1 .

Now we check that D s M = D M H . Let q D s M , so that q = i = 1 m α i q i for some m, q i V T i M , and α i = 1 . Clearly, q D M . Moreover, the sum of coordinates of q, λ(q), satisfies λ ( q ) = i α i λ ( q i ) = i α i = 1 . Thus, q H. Conversely, let p = i = 1 m p i with p i C V T i M for certain tree topologies T i , and assume that λ (p) = 1. Apply Remark 7 to each p i to get p i = λ (p i )q i for some q i V T i M . Then
p = i p i = i λ ( p i ) q i

and 1 = λ ( p ) = i λ ( p i ) λ ( q i ) = i λ ( p i ) since each q i lies on H. This proves that p D s M . □

Remark 11. In the proof of the above lemma, we have seen that D M and D s M can be alternatively described as the spaces of mixtures obtained from the respective varieties C V T M and V T M (i.e., not only from the images of the parametrization maps).

The space of phylogenetic mixtures for equivariant evolutionary models

This section provides a precise description of the space D M for the equivariant models M listed in Example 4 (JC69, K80, K81, SSM, and GMM). First, we recall some definitions and facts of group theory and linear representation theory. From now on, B = {A,C,G,T}, k = 4, W = B C , n is fixed and L = [ n ] W .

Background on representation theory

We introduce some tools in group representation theory needed in the sequel. We refer the reader to [15] as a classical reference for these concepts. Although some of the following results are valid for any permutation group, for simplicity in the exposition we restrict to permutations of four elements (as our applications deal only with the case B = {A C G T}).

Let G S 4 be a permutation group. The trivial element in S 4 will be denoted as e. We write ρ G for the restriction to G of the defining representation ρ : S 4 GL ( W ) given by the permutations of the basis B of W. This representation induces a G-module structure on W by setting g · x: = ρ(g)(x) W. In fact, ρ induces a G-module structure on any tensor power s W by setting
g · x 1 x s : = g · x 1 g · x s ,
(3)

and extending by linearity. From now on, the space L = n W will be implicitly considered as a G-module with this action. We call χ the character associated to the representation ρ G : GGL (W), i.e. χ (g) is the trace of the corresponding permutation matrix or, in other words, χ (g) equals the number of fixed elements in B by the permutation gG. Then the character associated to the induced representation GGL ( n W) is χ n , the n-th power of χ.

We write N1,…,N t for the irreducible representations of G and ω1,…,ω t for the corresponding irreducible characters, where N1 and ω1 will denote the trivial representation and trivial character, respectively. Maschke’s Theorem applied to the action of G described in (3) states that there is a decomposition of s W into its isotypic components:
s W = i = 1 t ( s W ) [ ω i ] ,
(4)

where each ( s W)ω i is isomorphic to a number of copies of the irreducible representation N i associated to ω i , ( s W ) [ ω i ] N i C m i ( s ) , for some non-negative integer m i (s) called the multiplicity of s W relative to ω i . The isotypic component of L associated to the trivial representation will be denoted by L G and it is composed of the n-tensors invariant under the action of G defined in (3). If M is the equivariant evolutionary model associated to G, L G will also be denoted as L M . It is easy to prove that C V T M L G (see Lemma 4.3 of [5]).

We recall that the set Ω G = {ω i }i = 1,…,tof irreducible characters of G forms an orthonormal basis of the space of characters relative to the inner product defined by
f , h : = 1 | G | g G f ( g ) h ( g ) ¯ .
(5)

We introduce the following notion.

Definition 12. An n - word over B is an ordered sequence X = x1x2x n , where every letter is taken from the alphabet B. The set of n-words is equivalent to the cartesian power B n and will be denoted by B .

Words will be denoted in typewritter uppercase font (like X) and their letters in lowercase (like X). Sometimes it will be convenient to identify the [n]-tensors of the form x1 x n with the n-words X = x1x n . Consequently, we will identify B with the natural basis of L . Given X B , we will denote by {X} G = {g X g G} the G-orbit of X. We associate a G-invariant tensor, τ{X} G , to each orbit {X} G : τ { X } G : = g G g X . It is straightforward to see that every G-invariant tensor can be written as a linear combination of the tensors τ{X} G , X B . On the other hand, the set of different τ{X} G ’s is linearly independent, since the corresponding G-orbits {X} G have non-overlapping composition of the elements of B .

Mixtures for equivariant models

For each x B, we write S G (x) for the stabiliser of X under the action of G, that is, S G (x) = {g G : g·x = x}.

Proposition 13. Let G be a subgroup of S 4 such that S G (x0) = {e} for some x0B. Then every tensor of type τ{X} G , X B , lies in the image of Ψ T M G for some tree topology T. In particular, L G D M G .

Proof. For any G-orbit {Y} G , Y B , write τ { Y } G = y 1 Y n + g e g · Y 1 g · Y n . We will explicitly associate a tree topology and parameters (Π,A) to it so that the tensor τ {T} G is equal to Ψ T M G . To this aim, we denote by B (Y) the set of letters appearing in Y. Then for every z B (Y), consider the set L z Y = { i [ n ] : Y i = z } , so that z B ( Y ) L z Y = [ n ] .

We construct a tree T on the set of taxa [n] in the following way. We join each taxa in L z Y to a common node v z by an edge. Then each vertex v z is joined to the root of the tree (we call it r) by an edge that we denote as e(z) (see Figure 1). Now, in the edges joining any v z with some leaf in L z Y , we consider the identity matrix, while the matrix in e(z) is defined by taking
A i , j e ( z ) = 1 if ( i , j ) = ( h · x 0 , h · z ) for some h G , 0 otherwise.
Finally, if c is the cardinality of {x0} G , define the distribution at the root Π=(Π A ,Π C ,Π G ,Π T ) by
Π z = 1 c if z { x 0 } G , 0 otherwise.
It is straightforward to check that these matrices and the vector Π are G-equivariant, so ( Π , A ) Par M G ( T ) . Now, from (2) and the definition of Π, we can write
p x 1 x n = g G { x z } z B ( Y ) B P x 1 x n ( g , { x z } z B ( Y ) )
where
P x 1 x n ( g , { x z } z B ( Y ) ) = Π g · x 0 z B ( Y ) A g · x 0 , x z e ( z ) j L x z Y δ x z , x j
(here δa,b stands for the Kronecker delta, i.e. δa,a= 1, δa,b= 0 if ab). Moreover, from the definition of the matrix Ae( z ), we have
A g · x 0 , x z e ( z ) = 1 if ( g · x 0 , x z ) = ( h · x 0 , h · z ) for some h G , 0 otherwise.
The hypothesis S G (x0) = {e} ensures that (g·x0,x z ) = (h·x0,h·z) if and only if g = h. From this, it becomes clear that P x 1 x n ( g , { x z } z B ( Y ) ) = 0 unless
  1. 1.

    x z = g · z, for z B, and

     
  2. 2.

    for each i L z Y , x i is equal to x z = g · z,

     
in which case P x 1 x n ( g , { x z } z B ( Y ) ) = Π g · x 0 = 1 c . It follows that
p x 1 x n = 1 if x 1 x n { Y } G , 0 otherwise,

and Ψ T M ( Π , A ) = τ { Y } G . Moreover, as the set of τ{Y} G , for Y B , generates the vector space L G , the second claim follows. □

Remark 14. The above result is not true if the hypothesis S G (x0) = 〈e〉 is removed. For example, if G = 〈(A C G T),(A G)〉 (so that M = K80 ), then S G (A) = S G (G) = {e,(C T)} and S G (C) = S G (T) = {e,(A G)}. In that case, it can be shown that the G-orbit {A C G T} G is not in Im Ψ T K80 for any tree topology T with 4 leaves.

Since the above condition on the group holds for G = S 4 , G = 〈(A T)(C G)〉, and G = 〈(A G),(A C G T)〉, we deduce the following claim.

Corollary 15. If G corresponds to any of the equivariant models K81, SSM or GMM, we have L G D M G .

In phylogenetics, an invariant of a phylogenetic tree T is an equation satisfied by the expected distributions of patterns at the leaves of T, irrespectively of the continuous parameters of the model M . In the algebraic geometry setting, these are the equations satisfied by all p C V T M . Invariants were introduced by Lake (see [27]) and Cavender and Felsenstein (see [28]). A phylogenetic invariant of T is an invariant of T, which is not an invariant of all other phylogenetic trees (under the same model M ). Equivalently, f is a phylogenetic invariant of C V T M if it is an invariant of C V T M and there exists a tree topology T such that f is not an invariant of C V T M . In principle, phylogenetic invariants can be used for tree topology reconstruction purposes.

Remark 16.
  1. (a)

    It can be seen that the condition of trivial stabiliser for some element of B given in Proposition 13 guarantees that all the irreducible representations of G will be present in the decomposition of W into its isotypic components. Then, by using the results of [6], it follows that the corresponding equivariant model will have no linear phylogenetic invariants. This fact was already known for the models in the above corollary: see [29] for the GMM, [21] for the SSM and [30] for the K81. Here we provided an alternative proof based on elementary tools of group theory.

     
  2. (b)

    The models JC69 and K80 are known to have linear phylogenetic invariants, but these are the only linear invariants which do not define hyperplanes containing L G , as can be deduced from [3, 30]. In fact, for these two models, the claim of the corollary is still true as stated in the following theorem. Nevertheless, we have not been able to provide a unified proof of this fact because of the different properties of the corresponding groups. There is no description of the space of linear invariants for other equivariant models not listed in Example 4, so we cannot claim that the result below still holds.

     

Theorem 17. If M G is one of the equivariant evolutionary models JC69 , K80 , K81 , SSM , or GMM , then the space of phylogenetic mixtures D M G coincides with L G , and D s M G equals L G H .

This theorem allows to identify the set of all phylogenetic mixtures D M G with L G , which is a vector subspace of L whose linear equations are easy to describe. In other words, L G is the smallest linear space containing the data coming from any mixture of trees evolving under the model M G . One can therefore use L G to select the most suitable model for the given data. This has been studied in [7].

Proof of Theorem 17. For equivariant models we have that C V T M G L G for any tree T. Hence, by Lemma 10 and the definition of D M G , D M G is a vector subspace of L G .

From Corollary 15, we infer the equality L G = D M G for the models K81, SSM and GMM. For the other two models, JC69 and K80, it remains to prove that there does not exist any hyperplane Π containing D M G and not containing L G . If such a hyperplane existed, then it would contain all the points of C V T M G for any tree topology T. It suffices to prove that for these models there are no homogeneous linear polynomials vanishing on all tree topologies, except for the linear equations vanishing on L G . This has been seen in Remark 16(b).

The equality D s M G = L G H follows immediately from Lemma 10 and the first assertion in the statement of this theorem.

Remark 18. We are indebted to one of the referees of this paper for pointing out that the preceeding result, as well as the second part of Proposition 13, can also be inferred from Proposition 4.9 of [5]: under the assumption that the stabiliser of some state is trivial, Draisma and Kuttler show that the star tree is the smallest algebraic variety containing the tensors τ{X} G , for pure tensors X(that is, tensors of rank 1). It follows that the set of mixtures on the star tree equals the space L G .

Remark 19. It is not difficult to check that for M = K81, SSM or GMM, D M coincides with the space of mixtures on the star tree (see also [26], where the same result is proven for a 2-state model). On the contrary, this is not true for JC69 and K80 models because in this case the star tree lies in a smaller linear space as a consequence of the existence of phylogenetic linear equations (see Remark 16(b)).

Equations for the space L G

Our goal here is to compute the dimension of L G for the groups associated to the equivariant models listed in Definition 4, and to list a set of independent linear equations defining this space.

Proposition 20. Using the notations above,
  1. (i)

    dim L SSM = 2 2 n 1 ,

     
  2. (ii)

    dim L K81 = 4 n 1 ,

     
  3. (iii)

    dim L K80 = 2 2 n 3 + 2 n 2 , and

     
  4. (iv)

    dim L JC69 = 2 2 n 3 + 1 3 + 2 n 2 .

     
Proof. Let M be any equivariant model. By definition, we know that L G is the isotypic component of n W associated to the trivial representation ( n W)[ω1]. Since the dimension of the trivial representation is one, it follows that the dimension of L M is precisely the multiplicity m1 (n), i.e. the number of times the trivial representation appears in the decomposition of n W into isotypic components. This multiplicity m1 (n) equals (see (5))
χ n , ω 1 = 1 | G | g G χ n ( g ) ω 1 ( g ) .
The proof ends by grouping the elements of G in the conjugacy classes of G for SSM, K81, K80, or JC69. Recall that the conjugacy classes of a group G are the disjoint sets of the form C (g) = {h− 1gh : h G}. If C1,…,C s are the conjugacy classes for G, write C ( G ) = ( | C 1 | , , | C s | ) for the s-tuple of their cardinalities, so that i = 1 s | C i | = | G | . Recall that χ n (g1) = χ n (g2) whenever g1 and g2 lie in the same conjugacy class, so we can represent χ n by an s-tuple χ C ( G ) n = ( t 1 , , t s ) , where t i = χ n (g) for any g C i . Thus, we have m 1 ( n ) = 1 | G | i = 1 s χ n ( g i ) | C i | , where g i is any element in the conjugacy class c i . The result for M = SSM , K81, K80, or JC69 follows by applying the Table 1.
Table 1

Details of the conjugacy classes of some permutation groups needed in the proof of Proposition 20

G S 4

M

Representatives of conj. classes

C ( G )

χ C ( G ) n

〈(A T)(C G)〉

SSM

{e,(A T)(C G)}

(1,1)

(4 n ,0)

〈(A C)(G T),(A G)(C T)〉

K81

{e,(A T)(C G),(A C)(G T),(A G)(C T)}

(1,1,1,1)

(4 n ,0,0,0)

〈(A C G T),(A G)〉

K80

{e,(A C)(G T),(A G)(C T),(A C G T),(A G)}

(1,2,1,2,2)

(4 n ,0,0,0,2 n )

S 4

JC69

{e,(A C)(G T),(A C G T),(A G),(A C G)}

(1,3,6,6,8)

(4 n ,0,0,2 n ,1)

For each permutation group in the column on the left, the corresponding equivariant model and conjugacy classes are described. For each conjugacy class, we give a list of representatives, its cardinality and the value taken by the character χ n on it.

Our next goal is to provide a set of independent linear equations for L G . Before stating the main result, let us introduce some useful notation.

Notation 21. We consider the following subsets of B = B n :
B 0 = { A A , C C , G G , T T } , B A C G T = { A , C } n { G , T } n , B A G C T = { A , G } n { C , T } n , B A T C G = { A , T } n { C , G } n , and B 2 = B A C G T B A G C T B A T C G .

The set B 0 is composed of all n-words with only one letter and it is contained in B A C G T , B A G C T , and B A T C G . Similarly, B 2 is composed of all n-words with two letters at most. It is straightforward to check that | B A C G T | = | B A G C T | = | B A T C G | = 2 n + 1 and | B 2 | = 3 · 2 n + 1 8 .

We adopt multiplicative notation for n-words, for instance, we write C l for the word C C l , and (A l )(G m )xl + m + 1x n for A A l G G m x l + m + 1 x n , where xl + m + 1,…,x n represent any letters.

The main result of this section is the following:

Theorem 22. A set of linearly independent equations E M defining L M for M =JC69 , K80 , K81 , or SSM is given by

E SSM : equations p X= p(A T)(CG)X for all X B with x1 {A,C};

E K81 : the equations in E SSM , and the equations {p X= p(AT)(CG)X} for all X B with x1 = A;

E K80 : the equations in E K81 , plus the equations {p X = p( A )( G ) X } for all X B \ B A C G T having x1 = A and satisfying the following condition: if T appears in X, then there is a C in a preceding position;

E JC69 : the equations in E K80 , together with the equations {p X = p( A T ) X } for all X B A C G T \ B 0 of the form (A l )(C m )xl + m + 1x n ; plus the equations {p X = p( A C ) X } and {p X = p( A T ) X } for all X B \ B 2 of the form (A l )(C m )xl + m + 1x n and satisfying the condition: if T appears in X, then there is a G in a preceding position.

The number of equations added in each case is 22 n − 1for SSM, 22 n − 2 for K81, 22n − 3− 2n−2 for K80, and 2 n 1 1 + 2 ( 2 2 n 3 + 1 3 2 n 2 ) for JC69.

Before proving this theorem, we explain how these sets of equations were obtained. Notice that a system of linear equations of L G is given by
p g X = p X g G , X B .
The role played by the G-orbits on B becomes apparent. Indeed, the idea is to relate the equations to the orbits of a subgroup of G. To this aim, let H be a subgroup of G and write H \ G = {Hg:g G} for the set of right cosets of H in G. We consider a transversal of H \ G, i.e. a collection {g1,…,g[G:H]} such that G = i = 1 [ G : H ] H g i . Then the orbit of any X B can be decomposed as
{ X } G = i = 1 , , [ G : H ] { g i X } H .
(6)
This decomposition establishes the connection between the G-orbits and the H-orbits. In order to obtain a system of equations for L G , once E H has been computed, it is enough to add the equations involving the permutations in a transversal {g1 = e,g2,…,g[G:H]} of H \ G:
p X = p g 2 X p X = p g 3 X p X = p g [ G : H ] X for all X B .

Notice that the union in (6) is not necessarily disjoint as it may happen that {g i X} H = {g j X} H for ij. In this case, the equality p g j X = p g j X already holds in the space L H and does not provide any new restriction. In order to avoid this situation and obtain a minimal set of equations for L G , we request the special conditions on the X B in the statement of the theorem.

Proof. For each model M , we prove that the corresponding equations are linearly independent and there are as many equations as the codimension of L M . By Proposition 20, the codimension of L M is 22 n − 1for SSM, 3· 4n − 1for K81, 7·22n−3− 2n − 2for K80, and 4 n 2 2 n 3 + 1 3 2 n 2 for JC69. In the sequel, we refer to the groups by the name of the equivariant model associated to them.

SSM: As SSM is the group {e,(A T)(C G)}, a set of equations for SSM is {p X = p( A T)( C G ) X }. Fixing x1in {A,C} we obtain 22n−1linearly independent equations (equations involving different coordinates). The codimension of L SSM is equal to 22n−1, which coincides with the number of equations given, and thus this set of equations defines L SSM .

K81: Since a transversal of SSM\K81 is {e,(A C)(G T)}, the hyperplanes p X = p( A T )( C G ) X contain L K81 but not L SSM . Moreover, using (6) we see that the orbit {X} K81 decomposes into the disjoint union of {X} SSM and {(A C)(G T)X} SSM for any X B . Therefore, the equations given for K81 involve different coordinates than those in E SSM . Requiring x1 = A, we obtain 4n−1linearly independent new equations. Thus E K81 defines the space L K81 because the number of linearly independent equations provided, 22n−1 + 4n−1= 3·4n−1, coincides with the codimension of L K81 .

K80: The set {e,(A G)} is a transversal of K81\K80. In order to show that the equations provided are linearly independent to those of E K81 , we apply (6) to this transversal to obtain {X} K80 = {X} K81 {(A G)X} K81 . If X B A G C T , then {(A G)X} K81 and {X} K81 are disjoint, so each equation p X = p( A )( G ) X is linearly independent from E K81 . The set B \ B A G C T has cardinal 4 n − 2n + 1and, if X B \ B A G C T , each orbit {X} K80 has cardinality 8. Therefore, the number of different orbits for X B \ B A G C T is (4 n − 2n + 1)/8 = 22 n − 3−2n − 2. Moreover, the choice of X’s in B \ B A G C T with x1 = A and satisfying “if T appears in X, there is a C in a preceding position” guarantees that we take only one element in each {X} K80 , and thus we are adding exactly one equation for each of these X, and thus we are adding exactly one equation for each of these Xs. Overall, there are 3·4n − 1 + (22 n − 3−2n − 2) = 7 · 22 n − 3− 2n − 2linearly independent equations in E K80 . This number coincides with the codimension of L K80 and these equations define L K80 .

JC69: A transversal of K80\JC69 is {e,(A C),(A T)}, therefore (6) applies to give { X } JC69 = { X } K80 { ( A C ) X } K80 { ( A T ) X } K80 . Summing up, there are if X B A C G T \ B 0 , then {(A C)X} K80 = {X} K80 and {X} JC69 is the disjoint union of {X} K80 and {(A T)X} K80 . As such, each equation p X = p( A T ) X is linearly independent from E K80 . Moreover, if X B A C G T \ B 0 is of the form (A l )(C m )xl + m + 1x n , we have 2n − 1− 1 such equations and they are linearly independent.

if X B \ B 2 then the three orbits {(A C)X} K80 , {(A T)X} K80 , and {X} K80 have 8 elements each and are disjoint. Therefore, for these X’s, each equation of type {p X = p( A C ) X } or {p X = p( A T ) X } is linearly independent from E K80 . Moreover, as B \ B 2 has cardinal 4 n − 3 · 2n + 1 + 8 and is covered by these orbits, we have 4 n 3 · 2 n + 1 + 8 24 = 1 3 ( 2 2 n 3 + 1 ) 2 n 2 different orbits. The restriction to the elements of the form (A l )(C m ) xl + m + 1x n and satisfying that “if T appears in X, there is some G in a preceding position” guarantees that the equations are written only only once for each orbit.
7 · 2 2 n 3 2 n 2 + 2 n 1 1 + 2 1 3 ( 2 2 n 3 + 1 ) 2 n 2

linearly independent equations in E JC69 that contain L JC69 . As this number is equal to the codimension 4 n 2 2 n 3 + 1 3 2 n 2 of L JC69 , the proof is complete.

All the equalities among orbits used in this proof are summarized in Table 2.
Table 2

Equalities among orbits used in the proof of Theorem 22

 

{X} GMM

{X} SSM

{X} K81

{X} K80

{X} JC69

B 0

{X}

{ ( A T ) ( C G ) X }

{ ( A C ) ( G T ) X } SSM

B A G C T

{ ( A C ) X } K80

B A C G T

{ ( A G ) X } K81

{ ( A T ) X } K80

B A T C G

{ ( A G ) X } K81

{ ( A C ) X } K80

B \ B 2

{ ( A G ) X } K81

{ ( A C ) X } K80 { ( A T ) X } K80

For each word X in B , the orbit in the model that indexes the column is described. The dots mean the set on the left and ” means the set on the top.

Remark 23. The sets of equations of Theorem 22 has been successfully used in [7] for model selection. Although the dimensions of these linear spaces are exponential in n, in practice it is not necessary to consider the full set of equations, but only those containing the patterns observed in the data. This is crucial for the applicability of the method, since the number of different columns in an alignment is really small compared to the dimension of these spaces.

Example 24. As an example, we compute a minimal system of equations for SSM, K81, K80, and JC69 in the case of 3 leaves.

Equations for L SSM : E SSM is composed of the following equations:
p A A A = p T T T , p A A C = p T T G , p A A G = p T T C , p A A T = p T T A , p A C A = p T G T , p A C C = p T G G , p A C G = p T G C , p A C T = p T G A , p A G A = p T C T , p A G C = p T C G , p A G G = p T C C , p A G T = p T C A , p A T A = p T A T , p A T C = p T A G , p A T G = p T A C , p A T T = p T A A , p C A A = p G T T , p C A C = p G T G , p C A G = p G T C , p C A T = p G T A , p C C A = p G G T , p C C C = p G G G , p C C G = p G G C , p C C T = p G G A , p C G A = p G C T , p C G C = p G C G , p C G G = p G C C , p C G T = p G C A , p C T A = p G A T , p C T C = p G A G , p C T G = p G A C , p C T T = p G A A .
Equations for L K81 : E K81 is formed by E SSM and
p A A A = p C C C , p A A C = p C C A , p A A G = p C C T , p A A T = p C C G , p A C A = p C A C , p A C C = p C A A , p A C G = p C A T , p A C T = p C A G , p A G A = p C T C , p A G C = p C T A , p A G G = p C T T , p A G T = p C T G , p A T A = p C G C , p A T C = p C G A , p A T G = p C G T , p A T T = p C G G .
Equations for L K80 : E K80 is formed by E K81 and
p A A G = p G A A , p A C G = p G C A , p A C T = p G C T , p A G A = p G A G , p A G C = p G A C , p A G G = p G A A .
Equations for L JC69 : E JC69 is formed by E K80 and
p A A C = p T T C , p A C A = p T C T , p A C C = p T C C , p A C G = p C A G , p A C G = p T C G .

Identifiability of phylogenetic mixtures

In this section we study the identifiability of phylogenetic mixtures. To this end, we use projective algebraic varieties and techniques from algebraic geometry. It is not our intention to give the reader a background on these tools, so we refer to the algebraic geometry book [14] and, more specifically, to [10] for the usage of these techniques in the study of phylogenetic mixtures.

There is a natural isomorphism between the points lying in the hyperplane H considered above, H = { p = ( p A A , , p T T ) L : p x 1 x n = 1 } , and the open affine subset { p = [ p A A : : p T T ] : p x 1 x n 0 } of P 4 n 1 = P ( L ) . We use the notation [p A A ::p T T ] for projective coordinates (in contrast to ( p A A , , p T T ) used for affine coordinates). The projective phylogenetic variety P V T M associated to a phylogenetic tree T is the projective closure in P ( L ) of the image of the stochastic parameterization ϕ T M defined above. That is, it is the smallest projective variety in P ( L ) containing Im ϕ T M via the above isomorphism.

In what follows, we explain the relationship between this new variety and C V T M and V T M . By Remark 7, it becomes clear that C V T M equals the affine cone over the projective phylogenetic variety P V T M (for the general Markov model, see also [23], Proposition 1). This implies that dim C V T M = dim P V T M + 1 , and if p = ( p A A , , p T T ) belongs to C V T M , then q := p A T ::p T T belongs to P V T M . Moreover, if λ : = p x 1 x n is not zero, then ( p A A λ , , p T T λ ) is a point in the affine stochastic phylogenetic variety V T M .

Before defining identifiability of mixtures, we consider the following construction of projective algebraic varieties.

Definition 25. Given two projective varieties X , Y P m , the join of X and Y, X Y, is the smallest variety in P m containing all lines xy ¯ with x X, yY, and xy(see [14], 8.1 for details). Similarly, one defines the join of projective varieties X 1 , , X h P m , i = 1h X i , as the smallest subvariety in P m containing all the linear varieties spanned by x 1 , , x h with x i X i and x i x j . It is known that
dim ( i = 1 h X i ) min { i = 1 h dim ( X i ) + h 1 , m } .

The right hand side of this inequality is usually known as the expected dimension of i = 1h X i .

For instance, if we consider the join i = 1 h P V T i M for certain tree topologies T i on the leaf set [n] and a given evolutionary model M , then there is a (dominant rational) map
P V T 1 M × P V T 2 M × × P V T h M × P h 1 −−→ i = 1 h P V T i M P ( L ) ,
(7)
which is the projective closure of the parameterization ϕ T 1 ϕ T h defined by
Pa r s M ( T 1 ) × × Pa r s M ( T h ) × Ω L ξ 1 , , ξ h , a j a i ϕ T i M ( ξ i ) .

Here, Ω = { a = ( a 1 , , a h ) i a i = 1 } is isomorphic to an affine open subset of P h 1 . In this setting, an h-mixture on {T1,…,T h } corresponds to a point in the variety i = 1 h P V T i M . We will use this algebraic variety to study the identifiability of phylogenetic mixtures.

When considering unmixed models M on trivalent trees on n taxa, generic identifiability of the tree topology is equivalent to the projective varieties P V T M and P V T M being different when TT(see [31]). The identifiability of the continuous parameters must take into account the possibility of permuting the labels of the states at the interior nodes, as such permutations give rise to the same joint distribution at the leaves. In the language of algebraic geometry, generic identifiability of the continuous parameters of the model implies that the map ϕ T M is generically finite (i.e. the preimage of a generic point is a finite number of points; see [31]). In this case, the fiber dimension Theorem ([14], Theorem 11.12) applies and we have that dim P V T M is equal to the number of stochastic parameters of the model, dim Pa r s M ( T ) . Therefore, if the continuous parameters are generically identifiable for the unmixed trees under M , then the dimension of the variety P V T M is the same for all trivalent tree topologies on n taxa. This dimension is denoted by s M ( n ) .

Example 26. The tree topologies and the continuous parameters are generically identifiable for the unmixed equivariant models JC69, K80, K81, SSM, and GMM on trees with any number of leaves (see [9] and [6], Corollary 3.9).

From now on we only consider trees without nodes of degree 2, so that the number of free stochastic parameters on a phylogenetic tree on n taxa under M is s M ( n ) .

We recall the definition of generic identifiability of the tree topologies on h-mixtures (see [10]).

Definition 27. The tree topologies on h-mixtures under M are generically identifiable if for any set of trivalent tree topologies {T1,…,T h } and a generic choice of ( ξ 1 , ξ h , a ) Pa r s M ( T 1 ) × × Pa r s M ( T h ) × Ω , the equality
ϕ T 1 ϕ T h ( ξ 1 , ξ h , a ) = ϕ T 1 ϕ T h ( ξ 1 , ξ h , a ) ,
for tree topologies { T 1 , , T h } and ( ξ 1 , ξ h , a ) Pa r s M ( T 1 ) × × Pa r s M ( T h ) × Ω implies
{ T 1 , , T h } = { T 1 , , T h } .

In terms of algebraic varieties this is equivalent to saying that the variety i = 1 h P V T i M is not contained in i = 1 h P V T i M and vice versa.

The tree topologies are the discrete parameters of h-mixtures. When considering the continuous parameters of h-mixtures, the above mentioned label-swapping can be disregarded. We give the following definition according to [12].

Definition 28. The continuous parameters of h-mixtures on T 1 , , T h under an evolutionary model M are generically identifiable if, for a generic choice of stochastic parameters ( ξ 1 , , ξ h , a ) , the equality
ϕ T 1 ϕ T h ( ξ 1 , ξ h , a ) = ϕ T 1 ϕ T h ( ξ 1 , ξ h , a )

for stochastic parameters ξ 1 , , ξ h , a implies that there is a permutation σ S h such that σ · ( T 1 , , T h ) = ( T 1 , , T h ) , ξ i = ξ σ ( i ) , and a i = a σ ( i ) for i = 1 , r . In other words, we only allow swapping of the continuous parameters when at least two tree topologies coincide.

Definition 29. An h-mixture under a model M is said to be identifiable if both its tree topologies and its continuous parameters are generically identifiable.

In terms of algebraic varieties, generic identifiability of continuous parameters on h-mixtures implies that the generic fibers (i.e. preimages of generic points) of the map ϕ T 1 ϕ T h are finite. In this case, the fiber dimension theorem applied to (7) (cf. [14], Theorem 11.12) gives
dim ( i = 1 h P V T i ) = i = 1 h dim ( P V T i ) + h 1 .

The following result demonstrates the need for careful inspection of identifiability of mixtures with many components (i.e. large values of h).

Theorem 30. Let [n] be a set of taxa and M be an evolutionary model for which the continuous parameters are generically identifiable on trivalent (unmixed) trees. In addition, let s M ( n ) be the dimension of P V T M for any trivalent tree T, and set h 0 ( n ) : = dim D M s M ( n ) + 1 . Then the h-mixtures of trees on [n] evolving under M are not identifiable for hh0(n).

Remark 31. Note that, in the above definition of h0(n), dim D M also depends on n.

Corollary 32. Let [n] be a set of taxa and M be one of the equivariant models JC69, K80, K81, SSM, or GMM. Then the phylogenetic h-mixtures under these models are not identifiable for hh0(n), where
h 0 n = 4 n 12 ( 2 n 3 ) + 4 , if M = GMM , 2 2 n 1 6 ( 2 n 3 ) + 2 , if M = SSM , 4 n 1 3 ( 2 n 3 ) + 1 , if M = K81 , 2 2 n 3 + 2 n 2 2 ( 2 n 3 ) + 1 , if M = K80 , 2 2 n 3 + 3 · 2 n 2 + 1 3 ( 2 n 2 ) , if M = JC69 .

Proof. Theorem 17 shows that L M = D M and Proposition 20 gives the dimension of L M in each case. Next, we calculate: s GMM (n) = 12(2n−3) + 3, s SSM (n) = 6(2n−3) + 1, s K81 (n) = 3(2n−3), s K80 (n) = 2(2n−3), and s JC69 (n) = 2n−3. Applying Theorem 30, we conclude the proof. □

Example 33. Consider the Kimura 3-parameter model K81 on n = 4 taxa. For any h ≥ 4, phylogenetic h-mixtures are not identifiable by Corollary 32. We are not aware of any result proving that mixtures of 2 or 3 different tree topologies under this model are identifiable (either for the tree parameters or for the continuous parameters).

Example 34. Consider the Jukes-Cantor model JC69 on n = 4 taxa. Then Corollary 32 tells us that for h ≥ 3, h-mixtures are not identifiable. Therefore, for this particular model on four taxa the cases in which the identifiability holds are known: the tree and the continuous parameters are generically identifiable for the unmixed model; the tree parameters are generically identifiable for 2-mixtures ([10], Theorem 10); the continuous parameters are generically identifiable for 2-mixtures on different tree topologies and not identifiable for the same tree topology ([10], Theorem 23); neither the continuous parameters nor the tree topologies are generically identifiable for mixtures with more than two components (Corollary 32).

Proof of Theorem 30. Let edim ( h ) : = h s M ( n ) + h 1 . Then the variety i = 1 h P V T i has dimension ≤ edim(h). Indeed, as i ϕ T i is a parameterization of an open subset of i = 1 h P V T i , then the dimension of i = 1 h P V T i is less than or equal to dim P V T i + h 1 . Moreover, the dimension of P V T i is equal to s M ( n ) if T i is trivalent (since the continuous parameters for the unmixed models under consideration are generically identifiable) and is less than s M ( n ) for non-trivalent trees. Therefore, dim ( i = 1 h P V T i ) edim ( h ) .

If all T i are trivalent trees, then dim P V T i + h 1 = edim ( h ) and, therefore, dim ( i = 1 h P V T i ) < edim ( h ) if and only if dim ( i = 1 h P V T i ) < dim P V T i + h 1 . Moreover, by the fiber dimension theorem applied to ϕ T i , the equality of dimensions holds if and only if the generic fiber of ϕ T i has dimension 0. In particular, if dim ( i = 1 h P V T i ) < edim ( h ) , then the continuous parameters of this phylogenetic mixture are not identifiable.

As h 0 ( n ) = dim D M s M ( n ) + 1 , we have that edim ( h 0 ( n ) ) = h 0 ( n ) ( s M ( n ) + 1 ) 1 = dim D M 1 . Now, we fix an h N with hh0(n), so that edim ( h ) dim ( D M ) 1 .

There are two possible scenarios:
  1. (a)

    For any set of tree topologies { T 1 , , T h } , the dimension of i = 1 h P V T i is less than dim ( D M ) 1 .

     
  2. (b)

    There exists a set of tree topologies { T 1 , , T h } for which dim ( i = 1 h P V T i ) = dim ( D M ) 1 .

     

Case (a) implies that the dimension of i = 1 h P V T i is less than edim(h) for any set of trivalent tree topologies { T 1 , , T h } . Based on the conclusions drawn above, this implies that the continuous parameters are not generically identifiable.

In case (b), i = 1 h P V T i coincides with P ( D M ) . Indeed, i = 1 h P V T i is contained in P ( D M ) , both varieties are irreducible, and dim ( i = 1 h P V T i ) = dim ( D M ) 1 = dim ( P ( D M ) ) , which implies that both varieties coincide. In particular, any h-mixture (which is a point in P ( D M ) ) would be contained in i = 1 h P V T i , and therefore the topologies are not generically identifiable.

Remark 35. The negative result of Theorem 30 should be complemented with the following positive result of Rhodes and Sullivant in [12]: if M =GMM and one restricts to h-mixtures on the same trivalent tree topology T, then the tree topology and the continuous parameters are generically identifiable if h < 4 n 4 1 .

Conclusions

In this paper, we have dealt with the space of phylogenetic mixtures for evolutionary equivariant models. We have shown that for the case of the Jukes-Cantor model, the Kimura models with two or three parameters, the strand symmetric model and the general Markov model, this linear space is defined by the set of linear equations satisfied by the distributions of the patterns at the leaves of a tree that evolves under that model. It follows that this space completely characterizes the model. The use of tools from group theory and group representation theory played a major role, and allowed us to design a procedure to produce minimal systems of equations for these spaces and for any number of taxa. This procedure has been implemented successfully in a new method for model selection in phylogenetics based on linear invariants (see [7]), which is available online at http://genome.crg.es/cgi-bin/phylo_mod_sel/AlgModelSelection.pl.

In the last part of the paper, we proved new results concerning the identifiability of phylogenetic mixtures. Namely, we provided an upper bound for the number of components (classes) of a mixture so that the identifiability of both the continuous and the discrete parameters is still possible.

Declarations

Acknowledgements

All authors are partially supported by Generalitat de Catalunya, 2009 SGR 1284. Research of the first and second authors was partially supported by Ministerio de Educación y Ciencia MTM2009-14163-C02-02 (Spain). Research of the third author was partially supported by grant BIO2011-26205 from the Ministerio de Educación y Ciencia (Spain). We would like to thank both referees for very useful comments that led to major improvements.

Authors’ Affiliations

(1)
Departament de Matemàtica Aplicada I, ETSEIB, Universitat Politècnica de Catalunya
(2)
Centre for Genomic Regulation (CRG)

References

  1. Posada D: The effect of branch length variation on the selection of models of molecular evolution. J Mol Evol. 2001, 52: 434-444.PubMedGoogle Scholar
  2. Felsenstein J: Inferring Phylogenies. 2004, Sunderland: Sinauer Associates, Inc.,Google Scholar
  3. Fu YX, Li W: Construction of linear invariants in phylogenetic inference. Math Biosci. 1992, 109 (2): 201-228. 10.1016/0025-5564(92)90045-XPubMedView ArticleGoogle Scholar
  4. Steel M, Hendy M, Székely L, Erdős P: Spectral analysis and a closest tree method for genetic sequences. Appl Math Lett. Int J Rapid Publication. 1992, 5: 63-67.Google Scholar
  5. Draisma J, Kuttler J: On the ideals of equivariants tree models. Mathematische Annalen. 2009, 344: 619-644. 10.1007/s00208-008-0320-6View ArticleGoogle Scholar
  6. Casanellas M, Fernandez-Sanchez J: Relevant phylogenetic invariants of evolutionary models. J de Mathématiques Pures et Appliquées. 2011, 96: 207-229.View ArticleGoogle Scholar
  7. Kedzierska A, Drton M, Guigó R, Casanellas M: SPIn: model selection for phylogenetic mixtures via linear invariants. Mol Biol Evol. 2012, 29: 929-937. 10.1093/molbev/msr259PubMedView ArticleGoogle Scholar
  8. Semple C, Steel M: Phylogenetics, Volume 24 of Oxford Lecture Series in Mathematics and its Applications. 2003, Oxford: Oxford University Press,Google Scholar
  9. Allman E, Rhodes J: The identifiability of tree topology for phylogenetic models, including covarion and mixture models. J Comput Biol. 2006, 13: 1101-1113. 10.1089/cmb.2006.13.1101PubMedView ArticleGoogle Scholar
  10. Allman ES, Petrovic S, Rhodes JA, Sullivant S: Identifiability of two-tree mixtures for group-based models. IEEE/ACM Trans Comput Biol Bioinformatics. 2011, 8: 710-720.View ArticleGoogle Scholar
  11. Stefanovic D, Vigoda E: Phylogeny of mixture models: Robustness of maximum likelihood and non-identifiable distributions. J Comput Biol. 2007, 14: 156-189. 10.1089/cmb.2006.0126View ArticleGoogle Scholar
  12. Rhodes J, Sullivant S: Identifiability of large phylogenetic mixture models. Bull Math Biol. 2012, 74: 212-231. 10.1007/s11538-011-9672-2PubMedView ArticleGoogle Scholar
  13. Chai J, Housworth EA: On Rogers’s proof of identifiability for the GTR + Gamma + I model. Syst Biol. 2011, 60 (5): 713-718. 10.1093/sysbio/syr023PubMedView ArticleGoogle Scholar
  14. Harris J: Algebraic Geometry. A First Course, Volume 133 of Graduate Texts in Mathematics. 1992, New York: Springer-Verlag,Google Scholar
  15. Serre J: Linear Representations of Finite Groups. 1977, [Translated from the second French edition by Leonard L. Scott, Graduate Texts in Mathematics, Vol. 42], New York: Springer-Verlag,View ArticleGoogle Scholar
  16. Sumner J, Fernández-Sánchez J, Jarvis P: On Lie Markov models. J Theor Biol. 2012, 298: 16-31.PubMedView ArticleGoogle Scholar
  17. Allman E, Rhodes J: Phylogenetic invariants for stationary base composition. J Symbolic Comput. 2006, 41 (2): 138-150. 10.1016/j.jsc.2005.04.004View ArticleGoogle Scholar
  18. Jukes TH, Cantor CR: Evolution of protein molecules. Mammalian Protein Metabolism, Volume 3. Edited by: Munro HN. 1969, 21-32. New York: Academic Press,View ArticleGoogle Scholar
  19. Kimura M: A simple method for estimating evolutionary rates of base substitution through comparative studies of nucleotide sequences. J Mol Evol. 1980, 16: 111-120. 10.1007/BF01731581PubMedView ArticleGoogle Scholar
  20. Kimura M: Estimation of evolutionary sequences between homologous nucleotide sequences. Proc Nat Acad Sci. 1981, 78: 454-458. 10.1073/pnas.78.1.454PubMedPubMed CentralView ArticleGoogle Scholar
  21. Casanellas M, Sullivant S: The strand symmetric model. Algebraic Statistics for Computational Biology. Edited by: Pachter L, Sturmfels B. 2005, New York: Cambridge University Press,Google Scholar
  22. Chang JT: Full reconstruction of Markov models on evolutionary trees: identifiability and consistency. Math Biosci. 1996, 137: 51-73. 10.1016/S0025-5564(96)00075-2PubMedView ArticleGoogle Scholar
  23. Allman E, Rhodes J: Phylogenetic ideals and varieties for the general Markov model. Adv Appl Math. 2008, 40: 127-148. 10.1016/j.aam.2006.10.002View ArticleGoogle Scholar
  24. Allman E, Rhodes J: Phylogenetic invariants for the general Markov model of sequence mutation. Math Biosci. 2003, 186 (2): 113-144. 10.1016/j.mbs.2003.08.004PubMedView ArticleGoogle Scholar
  25. Steel M, Hendy M, Penny D: Reconstructing phylogenies from nucleotide pattern probabilities: a survey and some new results. Discrete Appl Math. 1998, 88 (1-3): 367-396. 10.1016/S0166-218X(98)00080-8View ArticleGoogle Scholar
  26. Matsen F, Mossen E, Steel M: Mixed-up trees: The structure of phylogenetic mixtures. Bull Math Biol. 2008, 70: 1115-1139. 10.1007/s11538-007-9293-yPubMedView ArticleGoogle Scholar
  27. Lake J: A rate-independent technique for analysis of nucleaic acid sequences: evolutionary parsimony. Mol Biol Evol. 1987, 4: 167-191.PubMedGoogle Scholar
  28. Cavender J, Felsenstein J: Invariants of phylogenies in a simple case with discrete states. J Classif. 1987, 4: 57-71. 10.1007/BF01890075View ArticleGoogle Scholar
  29. Allman E, Rhodes J: Quartets and parameter recovery for the general Markov model of sequence mutation. Appl Math Res Express. 2004, 2004 (4): 107-131. 10.1155/S1687120004020283View ArticleGoogle Scholar
  30. Sturmfels B, Sullivant S: Toric ideals of phylogenetic invariants. J Comput Biol. 2005, 12: 204-228. 10.1089/cmb.2005.12.204PubMedView ArticleGoogle Scholar
  31. Allman E, Rhodes J: Identifying evolutionary trees and substitution parameters for the general Markov model with invariable sites. Math Biosci. 2008, 211: 18-33. 10.1016/j.mbs.2007.09.001PubMedView ArticleGoogle Scholar

Copyright

© Casanellas et al; licensee BioMed Central Ltd. 2012

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Advertisement