Notation
Recall that a rooted binary phylogenetic X-tree is a tree with the following properties: There is one vertex with indegree 0 and outdegree 1, which is called the root of . All edges are directed away from ρ, and all vertices have indegree 1 and outdegree 0 or 2. Vertices with outdegree 0 are usually referred to as leaves of . Remember that for an X-tree, there are exactly |X|=n leaves, which is why there is a bijection between the set of leaves of and the taxon set X. Thus, when there is no ambiguity, we use the terms leaf and taxon synonymously. Moreover, we often just write “phylogenetic tree” or “tree” when referring to a rooted binary phylogenetic tree.
Furthermore, recall that a character f is a function for some set of r character states (). We denote by the set of all rn possible characters on and n taxa. For instance, for the four-state DNA alphabet, and the set consists of all 4npossible characters.
An extension of f to is a map such that g(i)=f(i) for all i in X. For such an extension g of f, we denote by the number of edges e={u v} in on which a substitution occurs, i.e. where g(u)≠g(v). The parsimony score of f on , denoted by , is obtained by minimizing over all possible extensions g. Given a tree and a character f on the same taxon set, one can easily calculate the parsimony score of f on with the famous Fitch algorithm [7]. Moreover, when a character state changes along one edge of the tree, we refer to this state change as substitution or mutation. As for our purposes only so-called manifest mutations are relevant, i.e. those mutations that can be observed and are not reversed, we do not distinguish between mutations and substitutions, which is why we use these terms synonymously.
Construction of the OSM matrix
We now introduce the OSM framework in a stepwise fashion. The aim of the OSM approach is to determine the effects a single mutation occurring on a rooted tree has on a character evolving on that tree.
The first task of this approach is to formalize the term mutation and its effects on a single character state in . A mutation is an operation which is bijective, i.e. it satisfies the following condition:
C1. For all there is a such that σ(c
i
)=c
j
, and if σ(c
i
)=σ(c
j
), then c
i
=c
j
.
This guarantees that a mutation affects a character state in a unique fashion. It is well-known that any bijective function on a finite discrete state set is a permutation (e.g., [13]). Thus, a mutation is a specific instance of a permutation applied to a character.
The next step is to select the set Σ of admissible permutations acting on . It is mathematically convenient to select Σ such that it forms an Abelian group [9] with a regular (transitive and free) action on . Hence, Σ satisfies the following conditions:
C2. For every pair there is exactly one permutation σ ∈ Σ such that σ (c
i
) = c
j
, i.e., the action of Σ on is regular.
C3. For all σ1,σ2 ∈ Σ also the product σ1 ∘ σ2 ∈ Σ. Mathematically speaking, Σ is closed with respect to concatenation of its permutations.
C4. For all σ1,σ2∈Σ we have Thus, Σ is commutative, and hence the order in which we assign permutations is irrelevant for the outcome.
C5. There is an element σ0 ∈ Σ such that for all σ1 ∈ Σ we have , i.e. there exists a so-called neutral element, namely the identity, in Σ. For all only σ0(c
i
) = c
i
, i.e. σ
i
is fixed point free for all σ
i
≠σ0.
C6. For every σ1 ∈ Σ there exists a σ2 ∈ Σ such that σ1 ∘ σ2 = σ0. Mathematically speaking, for every element of Σ there exists an inverse element. This guarantees that every permutation can be reversed within a single step.
C7. For all σ1,σ2,σ3 ∈ Σ we have σ1 ∘ (σ2 ∘ σ3) = (σ1 ∘ σ2) ∘ σ3 = σ1 ∘ σ2 ∘ σ3, i.e. the associative law holds.
It should be noted that any set of permutations is associative, i.e. satifies C7. Thus, for a set of permutations Σ to be Abelian with a regular action on it only needs to satisfy C1−C6.
In the following, we consider the matrix representation of permutations. A permutation matrix over is an r×r matrix such that if σ(c
i
)=c
j
, and 0 otherwise. We consider it equivalent to discuss a permutation or its corresponding matrix. Therefore, concatenation “∘” is equivalent to the matrix multiplication “·”. We use σ to denote a permutation or a permutation matrix, depending on the context.
Example 1. In genetics, the most commonly used character state set is. There are two different Abelian groups for four states, namely the Klein-Four-groupand the cyclic group. The Klein-Four-group is constructed from the cyclic groupover two elements, the identity τ0and the flip τ1. These take the matrix form
The Klein-Four-group consists of the four Kronecker products of these two matrices, i.e. s0 = τ0 ⊗ τ0, s1 = τ1 ⊗ τ0, s2 = τ0 ⊗ τ1, and s3 = τ1 ⊗ τ1. The Kronecker products here yield 4×4 matrices, e.g.,
The set ΣK3ST:={s0s1s2s3} coincides with the substitution matrices under the Kimura 3ST model[6]. In particular, s1describes transitions within purines (A G) and pyrimidines (C T), s2represents transversions within pairs (A C) and (G T), and s3represents the remaining set of transversions within pairs (A T) and (C G).
The second Abelian group over four states, the cyclic group, is formed by selecting a 4-cycle, e.g., A→G→T→C→A and concatenating this cycle with itself. The resulting set of permutationscontains the following elements:
Note that there are actually six different four-cycles for. These result in three distinguishable Abelian groups. Bryant[14]generates his cyclic group with the four-cycle A→C→G→T→A, and shows that the resulting set ΣK2STunderlies the Kimura 2ST model[15], wherecorresponds to the transition within purines and pyrimidines, andandare the (not further distinguished) transversions.
The next step in constructing the OSM matrix is to construct a set of operations over governed by , and based on the permutation set Σ. To this end, we first define Σn as a set of operations which work elementwise, i.e. for and σ∈Σn we have
This can also be described by the Kronecker product, i.e. equally
(1)
This means that there are rndifferent operators in Σn=Σ⊗⋯⊗Σ.
Remark 1. Therefore, for any pair of characterswe can find an operation σ ∈ Σnsuch that σ (f) = g.
Another noteworthy consequence of using the Kronecker product is that the elements of Σn are permutations over [16, 17], and in fact Σn satisfies our Conditions C1−C7, i.e. Σn is an Abelian group over .
In the OSM framework we assume that the permutations acting on a character are derived from the underlying rooted tree . If permutation σ
i
∈Σ acts on the pendant edge leading to taxon j∈X, then the associated permutation matrix σj,iacting on has the form
If a permutation acts on an interior edge e, then it simultaneously acts on the states of all descendant taxa of e, i.e. all those taxa whose path to the root passes e. E.g., assume Taxa 1 and 2 form a cherry, i.e. their most recent common ancestor, 12, has no other descendants, and permutation σ
i
∈Σ, i=1,…,r−1 is acting on the edge leading to this ancestor. Then, we get the permutation
(2)
This shows in particular that a Kronecker product of some permutations acting on each character state is equivalent to the matrix product of the permutations acting on the entire character. The right hand side equation shows that a single permutation on an internal edge has the same effect as simultaneously applying the same permutation on the pendant edges of all descendant taxa. In other words, if de(e) denotes the set of descendants of edge e, and σ
i
∈Σ, then
(3)
Note that the set ΣXof all permutations acting on the pendant edges is a generator of Σn, i.e. the closure of ΣX contains all permutations in Σn. Since Σn contains a single permutation to transform character into , and since ΣX generates Σn, there is a shortest chain of permutations in ΣXwhich transforms f into g. ΣX is also the set of permutations implied by the star tree for X. In general, the set of all permutations on tree is
where r is the number of states in Σ.
For every X-tree we have , and therefore is a generator for Σn, too. An illustration of such a generator set over the character set is the so-called Cayley graph[18], which has as vertices the characters of , and two characters are connected if there is a permutation such that σ(f)=g. In [5] Cayley graphs have been presented as alternative illustrations of the tree over a binary state set .
Example 2. Regard the K3ST model from Example 1 and the rooted two-taxon tree depicted in Figure1a. With thisis given by the set
Each permutation which acts on the characters is thus a symmetric 16×16 permutation matrix depicting a transition (se,1), transversion 1 (se,2), or transversion 2 (se,3) along edge . Figures 1b-d display the permutation matrices for a transition on branch e1(), e2() and e12(), respectively. Figure 1e shows the Cayley graph associated with .
We are now in a position to recall the definition of the OSM matrix for a rooted binary phylogenetic tree as explained in [5] and [19]. For an edge we denote by p
e
the relative branch length of e, i.e. its actual branch length (expected number of substitutions per site) divided by the length of (the sum of all branch lengths). Thus, one can view p
e
as the probability that a mutation is observed at edge e assuming that a single mutation occurred on . Clearly, . Further, denote by αe,i the probability that this mutation on e is of type i∈{1,…,r−1} with for all . Then the OSM matrix is the convex sum of the elements in , where each permutation σe,i is multiplied by αe,ip
e
, the probability of hitting the edge e with permutation σ
i
∈Σ. Thus, we obtain:
(4)
can be regarded as the weighted exchangeability matrix for all characters under the K3ST model assuming that a single substitution occurs on the tree . Figure 1f depicts the OSM matrix for the tree in Figure 1a. Here, colors indicate relative branch lengths p
e
, and patterns denote permutation types α
i
. E.g., a blue square with horizontal lines indicates the product , i.e. the probability of observing a transition s1on edge e2.
The transformation problem
With the construction of we have generated the tools needed to formally describe the computations in Step 4 of the Misfits algorithm [4]. Given a rooted tree and two characters f and fd in , we want to compute the minimal number of substitutions required on the tree to convert f into fd. [4] presented an efficient procedure to compute this minimal number of substitutions.
Algorithm 1
INPUT: rooted binary phylogenetic tree on leaf set X, characters f and fdon X, Abelian group Σ.
STEP 1: Using Remark 1, find the substitution type σ
i
which translates f
j
into for all positions j=1,…,|X|. Let σ∈Σnbe the resulting operation, i.e. σ(f)=fd.
STEP 2: Let c:=c1…c1be a constant character on X with . Let h:=σ(c).
STEP 3: Calculate .
OUTPUT: m.
We prove the correctness of our algorithm. In our framework, m corresponds to the minimum number of permutations σ1,…,σ
m
∈Σ such that σ1⊗⋯⊗σ
m
(f)=fd. In this form, m has multiple equivalent interpretations. It is the length of the shortest path between f and fd in the Cayley graph for , where this path corresponds to σ1⊗⋯⊗σ
m
. Further, m corresponds to the minimum power (k) of such that for j<k and , because a positive entry in means that there is a concatenation of k permutations connecting the associated characters.
Example 3. Figure2demonstrates how Algorithm 4 works under the K3ST model, i.e. when the group is Σ = Σ
K3ST
(Figure2a). Consider the rooted five-taxon tree in Figure2band the character GTAGA at the leaves. Assume that the character GTAGA is to be converted into character ACCTC. By comparing the two characters position-wise, we need a substitution s1on the external branch leading to taxon 1 to convert G into A at the first position. Similarly, we need a substitution s1on the external branch leading to taxon 2, and a substitution s2on every external branch leading to taxa 3, 4, and 5. Thus, the operation s:=(s1s1s2s2s2) transfers the character GTAGA into the character ACCTC. As the operation s also translates the constant character AAAAA into GGCCC, converting GTAGA into ACCTC is equivalent to evolving the character state A at the root along the tree to obtain the character GGCCC at the leaves. The Fitch algorithm[7]applied to the character GGCCC with the constraint that the character state at the root is A produces a unique most parsimonious solution of two substitutions as depicted by Figure2c.