Definition of phylogenetic networks
We follow the definition of the phylogenetic networks as given in [[14], Definition 4, page 16]. For all other graph-theoretical definitions that are not given here, we follow [18]. A rooted phylogenetic network, simply called here a phylogenetic network, is defined in [19] as a rooted, directed acyclic graph (DAG), whose root has indegree 0 and the leaves have outdegree 0. The vertices whose indegree is greater than 1 are called reticulate vertices and the edges with reticulate vertices as head vertices are called reticulate edges. All other edges are termed tree edges. The definition given in [14] takes care of the so-called "time-consistency" restraint, namely, that the tree edges take place in a positive time and the reticulate vertices have parents that can only "coexist in time". Hence, forward, we recall the formal definition of phylogenetic networks as given in [14].
Given any directed graph, we say two vertices u and v cannot coexist in time if there exists a sequence P = (p1, p2, ..., p
k
) of paths in N such that:
-
1.
p
i
is a directed path that contains at least one tree edge, for every 1 ≤ i ≤ k,
-
2.
u is the tail of p 1, and v is the head of p
k
, and
-
3.
for every 1 ≤ i ≤ k - 1, there exists a network vertex whose two parents are the head p
i
and the tail of p i+1.
A phylogenetic network N is a rooted DAG obeying the following constraints:
-
1.
Every vertex has indegree and outdegree defined by one of the four combination (0, 2), (1, 0), (1, 2), or (2, 1) - corresponding to, respectively, root, leaves, internal tree vertices, and reticulate vertices. All vertices other than reticulate vertices are called tree vertices.
-
2.
If two vertices u and v cannot coexist in time, then there does not exist a network vertex w with edges (u, w) and (v, w).
-
3.
Given any edge of the network, at least one of its endpoints must be a tree vertex.
Another component of this definition is that for any edge in the phylogenetic network, at least one of its endpoints (either the head or tail) is a tree vertex. Here, we will use this definition. Wherever possible, we point out whether the conditions of the definition are necessary.
Phylogenetic networks can naїvely be thought of as a network that contain as subgraphs, the trees that explain the evolutionary histories of different segments of the input terminal sequences. Given a phylogenetic network, deleting one of each edge incident to a reticulate vertex does not guarantee a resulting phylogenetic tree with the same set of leaves as that of the network. This is an undesirable property, especially if the parsimony criterion is defined by finding a phylogenetic tree inside the network that is most parsimonious for the given site, as defined in [8, 11–13]. In order to avoid this problem, it is necessary to assume that no internal vertex has two reticulate children. We call this class of phylogenetic networks as a phylogenetic network with no sister reticulations. See Figure 1 for some examples of phylogenetic networks.
Before we proceed to the definition of the parsimony problems, the following is a useful observation. For a phylogenetic network N with no sister reticulations, and having r reticulate vertices and with leaf set X, we denote as the set of all trees contained in N. Each such tree is obtained by following two steps: (1) for each reticulate vertex, remove one of the incoming edges, and then (2) for every vertex v of indegree and outdegree 1, whose parent is u and child is w, contract the edges (u, v) and (v, w) into a single edge (u, w). The condition that each edge in N has a tree vertex as an endpoint and that each tree vertex has at least one tree vertex as a child, ensures that the set of leaves of the resulting tree is the same as that of the network. Hence the set contains exactly 2rphylogenetic trees whose leaf set is exactly X.
Maximum Parsimony
We refer the readers to [2, 3] for a general description of the idea of parsimony and to the discussion of various parsimony algorithms. It has been pointed out in [9] that the parsimony method for trees can be extended to phylogenetic networks. In a series of papers [8, 11, 12], one such parsimony criterion is defined by finding a tree in the network that has the best parsimonious score, and efficient algorithms to optimize this criterion on a given phylogenetic network have been devised. Although these algorithms are shown to perform well in practice, they can perform correctly only for phylogenetic networks with no sister reticulations, since it is straightforward to search for an optimal tree in these restricted class of networks. In this section, we state an alternate version of the parsimony problem and in the following sections provide some heuristic solutions for optimizing the score on any phylogenetic network.
Let [n] = {1, 2, ..., n} denote the set of leaf labels of a given phylogenetic network N. A function λ: [n] → {0, 1, ...,|Σ| - 1} is called a state assignment function over the alphabet Σ (a non-empty set) for N. We say that a function is an extension of λ on N if it agrees with λ on the leaves of N. For a vertex v in N, we call the as an assignment of on v. A fully assigned network is a network in which all the vertices have labels from {0, 1, ..., |Σ| - 1}. Let C be a cost matrix whose ijth entry c
ij
is the cost of transforming from state i to state j along any edge in N. If e = (u, v) is an edge in N, where u is the parent of v, we denote , where and . For a graph G, we let E(G) denote the edge set of G. Then the parsimony problem is defined as follows.
Input: A phylogenetic network N with leaf labels [n] and a state assignment function λ over the alphabet Σ for N.
Parsimony criterion: For an extension of λ, let
and
Output: Given P ∈ {P1, P2}, find that minimizes .
We note that is introduced in [8] and is the definition we will use in this paper. A more general approach is to minimize , where d
e
is a non-negative weight function on the edges of N. For the purposes of this paper, we restrict ourselves to P = P2, although the first of our approaches, the dynamic programming solution also holds for P = Q.
Parsimony algorithms on networks
Traversing a phylogenetic network
In a network, vertex traversal refers to the process of visiting each vertex, exactly once, in a systematic way. Such traversals are classified by the order in which the vertices are visited. We need two types of network vertex traversals to describe our algorithms. These are well-known for phylogenetic trees, and we present them here for phylogenetic networks. The algorithms for the traversals given below start from any given vertex v in the network. In this paper, we will always perform the traversals from the root vertex of the network.
Pre-order traversal of a phylogenetic network from a vertex v
-
1.
Visit the vertex v.
-
2.
Recursively perform pre-order traversal from each child that has not yet been visited.
Post-order traversal of a phylogenetic network from a vertex v
-
1.
Recursively perform post-order traversal from each child that has not yet been visited.
-
2.
Visit the vertex v.
Since a phylogenetic network is a DAG, such traversals will visit all the vertices of the network exactly once. (Refer to [18] for more details on existence on such traversals on DAGs). For the purposes of this paper, we assume that the vertices of a network are uniquely labelled by integers. Note that the leaves are already labelled from the set [n]; and so we use other integers for other vertices. Whenever the child vertices of v are extracted, they are also arranged in increasing order of their integer labelings and the pre- and post-order traversals are performed in this order. This will ensure the following: if vertices v and v' are such that there is no directed path between them, then the vertex v is traversed prior to vertex v' in the pre-order if and only if the vertex v is traversed prior to the vertex v' in the post-order. See the Figure 1 for some examples. With this property, we notice that the pre- and post-order traversals from the root of a phylogenetic network each trace the same spanning tree, which we call here the traversal tree.
Dynamic Programming solution
Dynamic programming is used to provide efficient solutions for finding the exact parsimony score when the network is a phylogenetic tree [15, 16]. In this section, we show that the same approach can be generalized to phylogenetic networks. Sankoff's algorithm on a tree traverses the vertices of the tree via post-order while computing the minimum costs of each state at each vertex from the leaves to the root, and then chooses the best assignments on each vertex by backtracking from the root to the leaves by traversing the tree vertices via pre-order. Both the phases are presented for networks in Algorithms 1 and 2 respectively. We describe them briefly below. It can be noted that if the network is a tree, then our algorithms match with the pre-order and post-order phases of Sankoff's method for trees.
Given a phylogenetic network N, with leaf vertices labeled [n] and with state assignment function λ over the alphabet Σ, assign to each vertex v ∈ V a quantity S
v
(i) for each i ∈ Σ. In phylogenetic trees, S
v
(i) denotes the minimum sum of costs of all the events from the vertex v to all the leaves that are reachable from v, given that v is assigned state i and all the descendant vertices from v are each assigned a state. In networks, there is no simple way to compute such a quantity. Instead, we allow S
v
(i) to be a lower-bound of the above exact score and it is calculated during the post-order traversal phase.
Post-order traversal phase: If v is a leaf of N, then S
v
(i) is assigned 0 if the observed state is state i, and infinite otherwise. Now all we need is a recursion relationship to calculate S
v
(i) for rest of the vertices. For each child w of v, we say w satisfies the post-order traversal condition with respect to v, or simply traversal condition with respect to v in view of the observation in the beginning of this section, if the following hold:
-
(i)
The vertex w is a reticulate vertex and
-
(ii)
if v' is the parent of w other than v, then the vertex v must be traversed prior to v' in the post-order traversal of N.
We now define recursively for each edge (v, w),
For a phylogenetic tree, s(v, w)(i) always assumes the first of these quantities, and it thus gives the sum of the substitution costs along the edges of the tree that lie below the vertex v, provided the vertex v is assigned the state i. For phylogenetic networks, in order to account for the substitution costs along the edges that lie below a reticulate vertex w just a single time when vertex v is assigned the state i, we let the 'parent' v of w in the traversal tree account for all the substitution costs along all the edges that lie below v. On the other hand, if v is not a parent of w in the traversal tree, s(v, w)(i) simply denotes the substitution cost from state i at vertex v to another state at w that is least expensive.
We then define
(1)
where the sum runs for all child(ren) vertex(s) w of v. As mentioned before, in phylogenetic trees, S
v
(i) denotes the minimum possible sum of substitution costs along all the edges from the vertex v to all the leaves that are reachable from v, given that v is assigned state i and all the vertices reachable from v are each assigned a state.
In phylogenetic networks, while calculating s(v, w)(i) where w is a reticulate vertex such that (v, w) is not an edge in the traversal tree, there is no prior knowledge of the state that will be later assigned at the reticulate vertex w. Thus s(v, w)(i) can only be a lower bound of the edges of the network that lie below the vertex v, if the vertex v is assigned the state i. The reasoning for this is that s(v, w)(i) is the substitution cost from state i at vertex v to another state at w that is least expensive, instead of the substitution cost from state i at v to the state at w that will be later assigned. Since the definition of S
v
(i) depends on the definition of s(v, w)(i), and they are defined recursively, we observe the following: S
v
(i) is a lower bound on the sum of substitution costs along the edges of the network that are reachable from the vertex v, provided that v is assigned state i and all the descendant tree vertices are assigned a unique state, and the reticulate vertices are assigned two states that are not necessarily the same. The assigned states of the reticulate vertex contributes to a conflict if the states are not the same. Let us suppose that state i is assigned to the root vertex r, and all tree vertices are assigned a unique state, while the reticulate vertices are assigned two states. Then the cost S
r
(i) denotes the minimum possible sum of substitution costs along all the edges of a traversal tree with one of states assigned for reticulate vertices, plus the sum of the substitution costs along the remaining reticulate edges with the alternate assignment state at the reticulate vertices. Since we seek an assignment on the vertices of the network with no conflicts in the reticulate vertices, S
r
(i) is a lower bound on the cost of such assignment where the root vertex is assigned i and all vertices are assigned with a unique assignment.
During this phase, we also store the states
(2)
to be able to backtrack the state of w that achieves the quantity s(v, w)(i) during the pre-order phase. See Algorithm 1.
Pre-order traversal phase: We first choose the minimum
where r is the root vertex and assign the state that attains the minimum at the root vertex, i.e., let such that S
r
(i
r
) = S. For any other vertex w that is not a reticulate vertex, whose parent v is already assigned with a state i, we assign the state t(v, w)(i). For a reticulate vertex w whose parent vertices are v and v', let us suppose that v and v' are assigned states i and i' respectively when traversing by the pre-order. The possible states j = t(v, w)(i) and j' = t(v', w)(i') of w that achieve s(v, w)(i) and s(v', w)(i') respectively, need not be the same. In other words, it is possible that j ≠ j'. In this case, we have a conflict on the reticulate vertex w. Thus, the dynamic programming technique fails to give an extension for λ whose parsimony score is S. In this case, we simply choose between j and j' for λ(w) according to which of the vertices among v and v' is traversed first in the pre-order. Thus, if the vertex w satisfies the traversal condition with respect to v we have .
After completing the pre-order phase, we can get the score corresponding to the extension by first setting S' = S and updating S' at each reticulate vertex w as follows: The upper bound score S' is updated corresponding to the assignment j at vertex w as S' -c
i' j'
+c
i' j
. See Algorithm 2. Figure 2 shows an example of how the algorithm runs on a network. Since S
r
(i) is a lower bound on the optimum assignment where the root vertex is assigned i and all vertices are assigned with a unique assignment, and since S = min
i
S
r
(i), we conclude that S is a lower bound of the optimum we seek to find. See Lemma 1 for a formal proof.
Lemma 1. The quantity S is a lower bound of the optimum parsimony score on the network N.
Proof. By the construction of S, we have
(3)
where the second summand is for the reticulate vertex w with parents are v and v', such that v satisfies the traversal condition w.r.t. w. Thus the cost is the substitution cost from the assigned state at v to the state at w. On the other hand, the cost is the substitution cost from the assigned state at v to the state at w. Note that the state is not necessarily same as the state , and S is the minimum among all assignments that may result in conflicts at the reticulate vertices.
Suppose is the optimum parsimony score on N with the function μ: V (N) → {0, 1, ..., |Σ| - 1} as the extension of λ we have
(4)
where in the second summand w is a reticulate vertex with parents v and v'. Since μ is a conflict-free assignment that is contained in the set of all assignments among whose costs S is the minimum (compare equation (3) and (4)) we have . □
Now for the complexity of the algorithm. Suppose the network N has n leaves and r reticulate vertices. Then the number of vertices in N is 2(n + r) -1. At each vertex v and for each state i, the quantity S can be computed in O(k2) time, where k = |Σ|. The pre-order traversal step involves finding S in O(k) complexity and assigning the best states for each vertex. Also, fixing conflicting reticulate vertex states takes O(r) time. Thus the complexity of the algorithm (presented here) to find a lower and an upper bound is O((n + r)k2). An alternate upper bound can be obtained in O(nk2) by simply assigning during the post-order traversal phase, for each reticulate vertex the state that occurs the maximum number of times at the leaves reachable from the respective reticulate vertex; and proceeding via finding S
v
(i) for the remaining vertices. The exact optimum can also be obtained by restricting the possible states to a single state for each reticulate vertex, by running the dynamic programming algorithm for each of the kr combinations of states for the reticulate vertices, and choosing the minimum among all of them. The time-complexity of this process is O(nkr+2).
Algorithm 1 Post-order traversal phase: Calculate the cost of each state at each vertex
-
1:
Input: Network N and the observed states from Σ at the leaves of N, i.e., a state assignment function λ over the alphabet Σ for N.
-
2:
For each leaf v, let S
v
(i) = 0 if λ(v) = i and ∞ otherwise.
-
3:
Repeat in post-order for each in internal vertex (root, internal tree vertex or reticulate vertex) v in N: For each state i, compute S
v
(i) given in (1) and t(v, w)(i) for each child w of v, given in (2).
-
4:
Output: {(S
v
(i), [t(v, w)(i): w is a child of v]): v ∈ V (N), i ∈ Σ}.
Minimizing the number of mutations on a phylogenetic network
The Fitch algorithm [17] counts the number of changes in a bifurcating phylogenetic tree for any character set, where the states can change from any state to any other state. Thus, the cost matrix is such that its diagonal elements are all zeros and the off-diagonal elements are all ones. In this section, we show how Fitch's algorithm extends to finding upper and lower bounds for the number of evolutionary changes in a given phylogenetic network. First, we show that the Fitch algorithm can be extended to give an upper bound for the optimum parsimony score. As before, the post-order and the pre-order traversal phases are given in Algorithms 3 and 4 below. See Figure 3 for an example run of the algorithm.
Algorithm 2 Pre-order traversal phase: Calculate lower and upper bounds of the optimum and the corresponding assignment of the upper bound
-
1:
Input: {(S
v
(i), [t(v, w)(i): w is a child of v]): v ∈ V (N), i ∈ Σ}.
-
2:
Let S = min
i
S
r
(i), where r is the root vertex and let .
-
3:
Let S' = S
-
4:
For each vertex w in pre-order whose parent vertex v immediately preceeds w in the pre-order, let where .
-
5:
Visit each reticulate vertex w with parents v and v' such that w satisfies the traversal condition with respect to v, with , , and update S' as follows:
-
6:
Output: (Lower bound, Upper bound) = (S, S'); extension corresponding to the upper bound score .
Algorithm 3 Post-order traversal phase: Calculate the optimum
-
1:
Input: Phylogenetic network N and a state assignment function λ over the alphabet Σ for N.
-
2:
For every leaf v of N, we are given A(v) = {λ(v)}, a singleton set containing the observed state at the leaf.
-
3:
Set UB = 0.
-
4:
Recurse using post-order: For a vertex v of T with children w1 and w2, let and If the vertex v has a single child w, then
and
5: ({A(v): v ∈ V (N)}, UB)
Since the pre-order traversal phase gives a conflict-free assignment on the vertices, UB is an upper bound. This is a special case of the dynamic algorithm presented for general cost-matrix. Suppose we restrict N to be a phylogenetic network with no sister reticulations, then any Fitch solution on any tree T in forms a lower bound for the optimal score on networks; and adding the cost on edges not in T gives an upper bound for the optimal score. Thus, it is possible to calculate our lower bound for counting the number of character changes only for phylogenetic networks with no sister reticulations, where it is straightforward to find a tree in .
Algorithm 4 Pre-order traversal phase: Assigning the states
-
1:
Input: Phylogenetic tree N and ({A(v): v ∈ V (N)}, UB).
-
2:
For every vertex v in the tree that is not already assigned, the algorithm computes as follows: For the root r, , where σ is an arbitrary element of A(r). Assign recursively via pre-order: For a vertex v whose parent u is assigned,
-
3:
Fixing the score: for each reticulate vertex v, if u' is not the parent in pre-order, and if , but , then increment UB by 1.
-
4:
Output: UB and extension function of λ.