Basic notations
Trees considered in this paper are rooted and labeled at their leaves only, each leaf being labeled with the name of a studied species. Given a tree T, its node set, edge set, leaf node set and root are resp. denoted V(T), E(T), L(T) and r(T). The label of a leaf u of T is denoted by and the set of labels of leaves of T is denoted by . When a node u has two children, they are denoted u1 and u2.
Given two nodes u and v of T, we write u ≤
T
v (resp. u <
T
v) if and only if v is on the unique path from u to r(T) (resp. and u ≠ v); if neither u <
T
v nor v <
T
u then u and v are said to be incomparable. As we consider rooted trees T only, we adopt the convention that an edge denoted (v,u) means that u <
T
v. For a node u of T, T
u
denotes the subtree of T rooted at u, p(u) the parent node of u, while (u
p
,u) is the parent edge of u. A tree T′ is a refinement of a tree T if T can be obtained from T′ by collapsing some edges in T′, i.e. by merging the two extremitites of these edges [31].
A species tree is a rooted binary tree depicting the evolutionary relationships of ancestral (internal nodes) species leading to a set of extant (leaf) species. A species tree S is considered here to be dated, that is associated to a time function such that if y <
S
x then θ (y) < θ (x). Such times are usually estimated on the basis of molecular sequences [32] and fossil records. Note that to ensure the time consistency of inferred transfers, absolute dates are not required, the important information being the ordering of the nodes of S induced by the dating.
Given a dated binary species tree S, the reconciliation model we rely on considers a variant S′ of S called a subdivision (as done also in [2, 6, 17]), associated to a time function . More precisely, for each node x ∈ V(S) ∖ L(S) and each edge (y
p
,y) ∈ E(S) s.t. θ
S
(y
p
) > θ
S
(x) > θ
S
(y), an artificial node w is inserted along the edge (y
p
,y), with (see Figure 1). Note that the height of S′ nodes (i.e. the number of their ancestors) is a valid time function that preserves the same partial order on nodes as and that the restriction of this time function to V(S)⊆V(S′) preserves the partial order induced by θ
S
.
A gene tree is a rooted binary tree depicting the evolutionary history of a gene family, that lead to a set of homologous sequences observed in current organisms. Each leaf of the gene tree has a unique label, corresponding to specific extant sequences of the gene. Indeed, several leaves of a gene tree can be associated to a same species due to duplication and transfert events. We denote by s(u) the species associated to leaf u ∈ V(G).
A gene tree G with supports is a gene tree whose internal edges each have a support value. Let w k
t
(G) ⊆ E(G) be the set of edges having a support value weaker than threshold t and let s t r
t
(G) be E(G) - w k
t
(G), that is the edges having a support equal or stronger than t.
Reconciliation model
Reconciling a gene tree G with a species tree S means building a mapping α that associates each gene g ∈ V(G) to a sequence of nodes in V(S), namely the species in which the sequence g evolved. This evolution is submitted to different kinds of biological events such as speciation, duplication and transfer. The following definition presents a discrete models of this evolution.
Definition 1
(Reconciliation model). Consider a gene tree G, a species tree S with a time function θ
S
, and its subdivision S′ with a time function θS′.
Let α be a function that maps each node u of G onto an ordered sequence of nodes of S′, denoted α(u). For u ∈ V(G), let ℓ denote the length of α(u) and let α
i
(u) be its ith element (where 1 ≤ i ≤ ℓ). α is said to be a reconciliation between G and S′ if and only if exactly one of the following atomic events occurs for each couple of nodes u of G and α
i
(u) of S′ (where α
i
(u) is denoted x):
-
1.
u∈L(G), x ∈ L(S ′) and . (
event)
-
2.
x is not artificial and {α 1(u 1),α 1(u 2)} = {x 1,x 2}. (
event)
-
3.
α 1(u 1)=α 1(u 2) = x. (
event)
-
4.
α 1(u 1) = x, and α 1(u 2) = x ′ is such that x ′ ≠ x and . (
event)
-
5.
x is an artificial vertex and α i+1(u) is its only
child. (
event)
-
6.
x is not artificial and α i+1(u) ∈ {x 1,x 2}. ( event)
-
7.
α i+1(u) = x ′ is such that x ′ ≠ x and . ( event)
The combinatorial events mentioned above (
,
,
,
,
, , ) are those defined in [2]. See Figure 2 for an illustration of these events and Figure 3 for an example of reconciliation according to this model.
Note that among these events, and are in fact a combination of two independent biological events. However, the fact that a loss is always taken into account jointly with another event allows to obtain a recursive algorithm and is done without loss of generality, i.e. does not reduce the power of the model [2].
Given a gene tree G and species tree S, there is an infinite number of possible reconciliations. Discrete evolutionary models compare them by counting the number of events they respectively induce. As different types of event can have different expectancies (e.g.
are thought to be more frequent than
and
), reconciliation models allow for a specific cost to be given to each kind of event. The cost of a reconciliation is then the sum of the costs of the individual events it induces. In that setting, the parsimony approach is then to prefer a reconciliation of lower cost. This is formalized in the following definition.
Definition 2.
Let us consider a gene tree G, a subdivision S′ of a species tree, and a reconciliation α between trees G and S′. The cost of α is defined as
where δ, τ, and λ respectively denote the cost of
,
, and
events, while d, t, and l denote the number of the corresponding events in α. Moreover, a event is atomic and costs (τ + λ), while a event just costs λ. Indeed, speciation events are most of the time considered as having a null cost, but the model easily accommodates for non-null costs if necessary.
The optimal reconciliation cost is
over all reconciliations α between G and S′.
Note that several distinct alternative reconciliations can have an optimal reconciliation cost.
Lemma 1 (Consecutive events)
Consider a gene tree G, the subdivision S′ of a species tree, and a reconciliation α of optimal cost C(G,S′) = c(α). For any node u of G, if α
i
(u) corresponds to a event, then αi + 1(u) does not.
This results from the observation that two in a row can be replaced by single , leading to a reconciliation of lesser cost.
Finding a most parsimonious reconciliation
To find one of the most parsimonious reconciliations between a gene G and a species tree S we will rely on the dynamic programming algorithm of Doyon et al. [2] that computes the optimal reconciliation cost, C(G,S′) on G and the subdivision S′ of S. This algorithm successively examines the nodes u of G and their possible mapping on nodes x of S′ (or equivalently on edges ending at such nodes). A node u of G can be mapped on such a vertex x according to different scenarios, each postulating a different event at node u among those of Definition 1. The optimal cost for mapping u at x is defined according to the scenario of minimal cost. For running time optimization reasons, the scenario involving a event, whose cost is denoted , is computed after the other possible scenarios, denoting the minimum cost that can be achieved among the latter. This decomposition is possible since a event is always followed by a
,
,
,
,
, or event (see Lemma 1). As a result, the best receiver for a event of node u with donor branch x can be computed from the costs over all vertices y other than x such that . The cost are themselves computed from values but for children u
i
of u (see below). These intricate notions are formally detailed in Definition 3
Definition 3 (Reconciliation cost matrix).
Consider a gene tree G and the subdivision S′ of a species tree S with a time function θS′. Let denote the cost matrix recursively defined as follows for a node u of G and a vertex x of S′: and , where the costs for all events x are defined below
-
, if u ∈ L(G), x ∈ L(S′) and .
-
-
c(u1,x2) + c(u2,x1)}, if u ∉ L(G) and .
-
, if u ∉ L(G).
-
-
+ τ, with u ∉ L(G) and z (resp. y) denoting a vertex that minimizes c(u2,z) (resp. c(u1,y)) over all vertices such that .
-
, if x has a single child.
-
, if x has two children.
-
, where y denotes a vertex that minimizes over all vertices x′ ∈ V(S′) ∖ {x} such that .
If the above constraints for an event on node u and vertex x are not respected, the corresponding cost is set to ∞.
The value c(u,x) is the optimal cost when mapping gene node u to node x in S′. The optimal cost for reconciling G with S′, denoted C(G,S′), is then .
The algorithm of Doyon et al. [2], called Mowgli, fills the dynamic programming cost matrix by two embedded loops: one loop visits all species nodes of S′ in time order (e.g. according to the partial order, while the other loop visits nodes of the gene tree G in postorder. Due to an optimization in precomputing the best receiver edge for transfer events of nodes u at a given time, this algorithm has O(|S|2.|G|) time complexity.
The problem considered in this paper is the following:
MOST PARSIMONIOUS RECONCILIATION GENE TREE (MPR-GT)
INPUT:
-
A dated species tree S with a time function θ
S
-
a gene tree G with supports on its edges and whose leaves are associated to leaves of S
-
costs δ, τ, resp. λ for
,
, resp.
and
-
a threshold t.
OUTPUT: a gene tree G′ such that both and , and such that C(G∗,S′) is minimum among all such trees.
Algorithm
We describe here a heuristic for the MPR-GT problem that relies on a hill-climbing strategy to seek a (rooted) gene tree G of minimum reconciliation cost (see Definition 3) using NNI edit operations [33].
Performing an NNI operation around an internal edge (w,v) means swapping the position of one of the two subtrees connected to v with that of the subtree connected to the sibling of v. Given an initial gene tree G and an edge of G, two “alternative” trees can be obtained from G by performing an NNI operation (see Figure 4). The hill-climbing proceeds as follows: (1) select a weak edge of G; (2) compute the reconciliation cost for the two alternative gene trees obtained by NNI on that edge; (3) if none of these trees decreases the reconciliation cost, then try another weak edge; if none of the weak edges allows to progress, then G is a local minimum and the hill climbing stops; (4) otherwise one of the alternative gene trees leads to a decrease in reconciliation cost, and the above process continues with the alternative tree of minimum reconciliation cost. MowgliNNI outputs the final binary rearrangement along with its most parsimonious reconciliation. In the worst cases, MowgliNNI examines all unreliable edges and does not find any better binary rearrangement of the given gene tree G since the topology G is already (locally) optimal. The whole scheme of MowgliNNI is described in Figure 5.
Consider now the time complexity of MowgliNNI. Identifying the weak edges is done in O(|G|) and generating the two alternative gene trees for a NNI operation is done in constant time. Hence, the complexity bottleneck of MowgliNNI is the number of times (denoted N) the Θ(|S|2 · |G|)Mowgli algorithm is called. Overall, the time complexity of MowgliNNI is Θ(|S|2 · |G| · N). The next section describes how we can avoid recomputing large parts of the cost matrix, and hence greatly reduce the running time of MowgliNNI.
Combinatorial optimization
We now present results that take advantage of the way the dynamic programming matrix is computed (Definition 3) to avoid recomputing from scratch the cost matrix associated to a gene tree G′ obtained by an NNI edit operation from a gene tree G. Consider the gene tree G of Figure 4, the NNI operation applied on edge (w,v) that swaps the two subtrees G
b
and G
c
, and the resulting gene tree denoted G′. We can observe that despite the global architecture of G and G′ differs, the local architectures of subtrees , remain unchanged. Hence, any cost that differs between the matrices C(G,S′) and C(G′,S′) (see Definition 3) is located in a column (i.e. node of the gene tree) associated to an ancestor of v (including v itself). For each of those nodes, there are two cases: (i) the node belongs to the NNI edge and its two children have subtree that have been modified (e.g. nodes w and v); (ii) the node is a strict ancestor of the NNI edge (w,v) and has exactly one child with a subtree that has been modified (e.g. g
k
,…,g0).
Lemma 2 below indicates which columns of the cost matrix don’t need to be recomputed.
Lemma 2.
Consider a gene tree G, the subdivision S′ of a species tree S, an edge (w,v) of G, and the gene tree G′ obtained from G by an NNI operation on (w,v). For each node z of G that is not ancestor of v in G and for each vertex x of S′, then c(z,x) = c′(z,x) holds.
This observation results from the fact that the dynamic algorithm of Mowgli computes the value of a cell (z,x) in the cost matrix using cells storing values either for the same node z or for its children (see formulas of Definition 3). Hence the value of a cell (z,x) directly or indirectly depends only on values for cells corresponding to z and its descendants. Going from gene tree G to G′ by an NNI operation, precisely changes the descendant relationships of v and its ancestors, i.e. all other nodes z have the same descendants in both G and G′ (see Figure 4), hence c(z,x) = c′(z,x) holds for all these nodes.
Unfortunately, there is no extension of Lemma 2 to ensure that when an edge has already been unsuccessfully tried for an NNI, it is useless to reconsider it later, even if it is a descendant in G of the edge leading to the last successful NNI.
Theorem 1.
Consider a gene tree G, the subdivision S′ of a species tree S, an edge (w,v) of G, a gene tree G′ obtained by an NNI operation on (w,v), and any strict ancestor u of w in G where the unique child of u that is an ancestor of w is u1 w.l.o.g. (i.e. w ≤ u1 in both G and G’). If c(u1,x) ≤ c′(u1,x) holds for all x ∈ V(S′), then c(u,x) ≤ c′(u,x) holds for all x ∈ V(S′), and as a corollary C(G,S′) ≤ C(G′,S′).
The proof of Theorem 1 is described in Appendix. This theorem leads to the optimized algorithm of MowgliNNI, formally stated in Algorithm 1 as an integrated procedure run after Mowgli. The later computes a dynamic programming matrix c :V(G) → V(S′) that MowgliNNI then partly recomputes given a rearrangement performed on the gene tree G. For each rearrangement, the matrix recomputed by MowgliNNI, denoted c′ :V(G′) → V(S′), is obtained in worst case time O(|S′| · h(G)), where h(G) is the height of G (i.e. the number of its ancestors)
Algorithm 1
M
o
w
g
l
i
N
N
I
(
G
,
c
): seeking a gene tree
G
′
of minimum reconciliation cost, starting from a gene tree
G
and the precomputed matrix reconciliation cost
, where
S
′
is the subdivided species tree.
Theorem 2.
MowgliNNI has worst case running time O(|S|2 · |G| + |S|2 · h(G)·N)
Indeed the steps of Algorithm 1 can be described as follows: initializing the reconciliation matrix for the initial gene tree is done in O(|S|2 · |G|) time; then updating the matrix for each of the N NNIs now only costs O(|S′| · h(G)) = O(|S|2·h(G)).
In MowgliNNI’s naïve implementation each rearrangement requires to recompute the cost associated to each and every node of the gene tree. In contrast, in the optimized version, an NNI around edge (w,v) is examined after updating only those costs associated to ancestral nodes of w. This has no impact on the worst case complexity (when the gene tree is a caterpillar h(G) is in O(|G|)) but significantly reduces the running times in practice since in most cases the number of nodes in G is much larger than their average height. For some random tree models the average height of a node in an n-leaf tree is indeed proportional to l o g(n) [34].