Locus-aware decomposition of gene trees with respect to polytomous species trees

Background Horizontal gene transfer (HGT), a process of acquisition and fixation of foreign genetic material, is an important biological phenomenon. Several approaches to HGT inference have been proposed. However, most of them either rely on approximate, non-phylogenetic methods or on the tree reconciliation, which is computationally intensive and sensitive to parameter values. Results We investigate the locus tree inference problem as a possible alternative that combines the advantages of both approaches. We present several algorithms to solve the problem in the parsimony framework. We introduce a novel tree mapping, which allows us to obtain a heuristic solution to the problems of locus tree inference and duplication classification. Conclusions Our approach allows for faster comparisons of gene and species trees and improves known algorithms for duplication inference in the presence of polytomies in the species trees. We have implemented our algorithms in a software tool available at https://github.com/mciach/LocusTreeInference.


Background
Horizontal gene transfer (HGT) is the process of acquisition and fixation of foreign genetic material. It can lead to substantial changes in the ecology and evolution of recipient organism, sometimes leading to the emergence of new pathogens [1]. HGT is interesting both from biological and computational perspective. Several methods of detecting horizontally transferred genes have been proposed, which can be roughly divided into two categories [2]. So-called surrogate methods are computationally efficient, yet often imprecise. The other group consists of the phylogenetic methods, most notably the tree reconciliation [3].
HGT and gene duplication are examples of evolutionary events in which an organism gains a new locus (referred later on as locus gain events). A locus is a fragment of a chromosome with a specific gene. The locus gain events cause an incongruence between a gene tree and a species tree. A species tree is a schematic representation of an evolutionary history of a set of species, in which a node corresponds to a speciation event (i.e. separation of a group of organisms into two distinct species). Likewise, a gene tree is a schematic representation of the evolutionary history of a set of genes. A node of a gene tree corresponds to an emergence of a new gene, whether due to a speciation or an evolutionary event like HGT or duplication. The leafs of the gene tree are usually labelled by the leafs of the corresponding species tree. Consequently, while the leaf labels of the species trees are unique, a single label may occur multiple times in the gene tree.
Apart from the aforementioned evolutionary events, populational effects are a major source of incongruence between gene and species trees. These effects arise due to the fact that each gene may undergo a mutation, which creates a new nucleotide sequence referred to as an allele of the gene. It follows that the set of alleles in a single species can itselt exhibit a complex evolutionary relationship. During a speciation, the alleles present in the population are sorted into two sets corresponding

Open Access
Algorithms for Molecular Biology to the diverging populations (a process termed allele or lineage sorting). It follows that the evolutionary distance between alleles may not correspond to the evolutionary distance between their respective species, causing an incongruence between the species tree and the gene tree inferred from the nucleotide sequences of those alleles. Such incongruence is referred to as an incomplete lineage sorting (ILS). It can be shown that as the time between speciations increases, the probability of an ILS decreases exponentially. Therefore, if all the speciations in the considered species tree are separated by a sufficiently large period of time, the populational effects may be considered negligible. For a more detailed discussion of the ILS, the reader is referred to e.g. [4][5][6].
The new locus created by one of the aforementioned evolutionary events evolves more or less independently of other loci. This observation leads to the concept of a locus tree [4,7], which serves as a schematic represenation of the evolutionary relationship between different loci. A node of a locus tree corresponds to an emergence of a new locus due to either a speciation or an evolutionary event (note that a mutation of the nucleotide sequence does not create a new locus). The loci are assumed to evolve within the branches of the species tree (or, in case of an HGT, between two branches), while the alleles are assumed to evolve within the branches of the locus tree. Therefore, a locus tree is an intermediate concept between the gene and the species tree, as it encompasses the evolutionary events like gene duplication or HGT, but not the populational effects. Another distinction between the gene and locus trees is that the former can be inferred from the nucleotide sequences, while the latter usually needs to be inferred by comparing the gene tree to the species tree.
When population effects are negligible due to long times between speciations, the evolutionary events are the only source of incongruence between the gene and the species tree. In this case, the evolutionary relationship between loci can be regarded as identical to the evolutionary relationship between alleles. Consequently, the locus tree can be represented as the gene tree with nodes labeled either as speciations or locus gains. In this approach, a locus gain node corresponds to the time point in which a new locus is first observed (note that after a duplication, the choice of the "new" versus the "old" locus is often arbitrary). Such labeling allows to decompose the gene tree into a forest of trees representing independent evolutionary histories of different loci by detaching the locus gain nodes (see e.g. Fig. 1). The concept of gene tree decomposition has been investigated earlier in the context of tree comparison [8], but, to our knowledge, not in the context of inference of evolutionary events or locus trees.
Distinguishing between different locus gain events is challenging, as their effects on gene trees are topologically similar. In reconciliation, weights of events have to be specified; these are, however, rarely known. The fact that the results depend strongly on those unknown parameters may undermine the credibility of biological conclusions. To properly estimate the weights, high-quality training datasets are needed, in which inferred events are biologically supported.
Many cases of HGT were found by manual inspection of incongruences in gene trees [9]. Inferring a locus Fig. 1 An example of locus trees for G = ((a 1 , b 2 ), (b 3 , c 4 )) with two decompositions F 1 and F 2 consistent with S = (a, (b, c), d) . These decompositions are created by cuts indicated with red color. M X : G → S is shown for every set of cuts X (for internal nodes). Here, �(F 1 , S) = �(F 2 , S) = 0 , �(F 1 ) = 2 · GAIN and �(F 2 ) = 3 · GAIN , i.e., F 1 is optimal tree facilitates such analyses, as it allows to automatically detect the incongruences. This approach has several advantages over reconciliation. It allows restricting to only two parameters: the locus gain and the locus loss weight. It is also more robust to imprecise data, as improperly placed branches will only be locally detected as new loci, without interfering with the global evolutionary scenario. This allows to disregard the noise when analyzing the tree, and instead focus on several chosen events. The locus tree inference has been addressed in populational genetics setting [4]. However, this approach requires several parameters, like speciation times or population sizes, which are often challenging to obtain. It has also been addressed in parsimony framework in the model of gene duplication and loss [7].
Our contribution In this work, we address the problem of locus tree inference when populational effects are negligible. This allows addressing the locus tree inference problem in a parsimony framework, and to adapt a more general approach than presented in [7]. We assume that incongruence between gene and species trees can be caused by locus acquisition events of any kind, including duplications and horizontal gene transfers. We propose to solve the locus tree inference problem by decomposing a binary gene tree into a forest of subtrees that can be embedded into a possibly polytomic species tree, in a way that minimizes the weighted sum of the forest size and the number of loss events. We propose two variants of the problem: the Locus Tree Inference, LTI, in which forest elements are subtrees of the species tree, and the Conditional Locus Tree Inference, CLTI, where each forest element is a subtree of some binarization (full refinement) of the species tree. We show a dynamic programming algorithm that solves LTI in O(|G||S|m) time and O(|G||S|) space, where m is the maximal degree of a node from the species tree. To solve CLTI, we propose a new mapping, called the highest separating rank. Based on the mapping, we show an O(d|G| + |S|) time and O(|G| + |S|) space algorithm, where d is the height of S, for inferring required and conditional duplications in gene trees, which improves an O(|G|(d + m) + |S|) time solution from [10]. Finally, we propose an efficient heuristic to solve CLTI, and present a comparative study on simulated and empirical data.

Definitions
Let T = �V T , E T � be a rooted directed tree. For a, b ∈ V T , by lca T (a, b) we denote the lowest common ancestor of a and b in T. We also use the binary order relation a b if b is a node on the path between a and the root of T (note that a a ). Two nodes a and b are called siblings, which is denoted a = sibling(b) , if they are children of lca T (a, b).
We call a and b comparable if a b or b a , otherwise a and b are called incomparable. The parent of a node a is denoted as parent(a) . The subtree of T rooted at v is denoted by T(v). By L(T) we denote the set of all leaves in a tree T and we use L(v) instead of L(T(v)). By root(T ) we denote the root of tree T. A species tree S is a rooted directed tree in which nodes are called taxa. A gene tree G is a rooted directed binary tree, such that every leaf of G is labeled by a leaf-taxon from S, i.e., an element of L(S). Note that a gene tree can have multiple copies of taxons (see Fig. 1). For a node g in G, by tax(g) ⊂ L(S) we denote the set of all labels of leaves from L(g).
The lowest common ancestor mapping, or lca-mapping, between G and S is a function M : V G → V S such that M(g) = t if g is a leaf labelled by the leaf-taxon t, or M(g) = lca S (M(g 1 ), M(g 2 )) if g has two children g 1 and g 2 . An internal node g in G is a duplication if M(g) = M(g i ) for any child g i of g. Every other node, i.e., a leaf or an internal node satisfying M(g) ≻ M(g i ) for every child g i of g, is called a speciation [11][12][13].
A node with more than two children is called a polytomy. For a polytomy s in a tree S, let H(s) be the set of all possible binary trees whose leaves are the children of s. For instance, if s is the polytomy node present in S from

Locus gain problems
In this section we introduce the parsimony framework for the (conditional) locus tree inference problem and a dynamic programming formula for solving the problem. We say that a gene tree G is embeddable (respectively, conditionaly embeddable) into a species tree S, if each node of G is a speciation (respectively, a speciation in some binarization of S). For instance, (a, (b, c)) is embeddable into (a, (b, c, e), f), while (a, (a, b)) is not. Since the polytomies in S can be resolved independently, we get the following result:

Lemma 1 G is conditionally embeddable into S if and only if there is a binarization
Proof (⇐ ) Obvious. ( ⇒ ) First, observe that for any two nodes g, g ′ ∈ G , if M(g) and M(g ′ ) are incomparable, then binarizing any node under M(g) will not change the mapping of any node below g ′ . Thus, we can binarize disjoint parts of S separately. Now, consider a maximal node g which maps to a non-binary node in S (i.e. a node g such that for any g ′ ≻ g , M(g ′ ) is binary). Since the parent of g is a speciation, the mappings of g and sibling(g) are incomparable in S. Since G is conditionally embeddable, there exists a binary tree T ∈ H (M(g)) such that g becomes a speciation after replacing M(g) with T. From the definition of a speciation node it follows that, after such replacement, the children of g map onto disjoint parts of S, and subsequent polytomies can be resolved independently to obtain the desired binarization.
Every internal node g of G induces a set of loss events defined as nodes of the species tree strictly between M(g) and M(parent(g)) , plus M(g) if M(g) is a polytomy. The above definition yields a notion of the loss cost, denoted by �(G, S) , and defined as the total number of loss events required to embed G into S.
A gene tree may not be embeddable into a species tree due to duplications or HGTs. Our goal is to decompose a gene tree into a set of embeddable subtrees in the most parsimonious way. We say that a forest F is a decomposition of G, if T ∈F L(T ) = L(G) , and for every T ∈ F , G| L(T ) = T , where, for A ⊆ L(G), G| A is a tree having A as the set of leaves and {lca G (a, b)|a, b ∈ A} as the set of internal nodes with the ancestor relation inherited from G. Decompositions can be equivalently obtained by tree edit operations as follows. Given a gene tree G, X ⊆ E G is called a set of cuts if no two edges in X share their top nodes.
Let X be a set of cuts from G. We define G X to be the graph obtained from G by removing all cuts from G, contracting all nodes with one parent and one child, and then removing all roots having exactly one child.

Lemma 2
For every set of cuts X from G, G X is a decomposition of G.
Proof Removing all cuts from G leads to a set of disjoint and connected subgraphs of G. As no two cuts share the top node, each subgraph contains at least one leaf from G, i.e., there is no subgraph composed entirely of G's internal nodes. On the other hand, each leaf from G is in exactly one subgraph. Internal nodes of G can be divided into four disjoint classes: (I) nodes disjoint with any cut, (II) bottom nodes of a cut incident to exactly one cut, (III) top nodes of a cut incident to exactly one cut, and (IV) nodes incident to exactly two cuts. Contracting nodes with exactly one parent and one child [(the group (III)] and removing single-child roots (IV) do not affect the leaves and the nodes from the group (I). Therefore, after these operations, each subgraph becomes either a binary tree or a leaf. It follows that G X is a forest such that T ∈G X L(T ) = L(G) . Since each tree in G X is obtained from a connected subgraph of G, it inherits the ancestral relation from G. Furthermore, note that a node with one parent and one child or a root with one child cannot be the last common ancestor of any set of leaves. It follows that for any T ∈ G X we have T = G| L(T ) .
In the context of X every node of G can be uniquely associated with a node from G X by a mapping σ X that maps a node g to the first non-removed node below g, connected by non-cut edges. Formally, σ X (g) = σ X (g 1 ) if the edge g, g 2 is a cut from X and g 1 is the sibling of g 2 , and σ X (g) = g , otherwise. By M X : G → S we denote the "locus-aware" lca-mapping, given by M X (g) = M ′ (σ X (g)) , where M ′ is the lca-mapping between T ∈ G X and S such that σ X (g) ∈ T (see Fig. 1).
Consider a set of cuts X in G. We say that X detaches g ∈ G , or is g-detaching, if σ X (g) is the root of some tree from G X . For example, in Fig. 1, the cuts from the bottom example detach the parent of leaf a 1 (as σ X (parent(a 1 )) = b 2 is a root in G X ), while the cuts from the top one do not ( σ X (parent(a 1 )) = a 1 is below the root).
We say that a decomposition is consistent (respectively, conditionally consistent) with a species tree S if for every T ∈ F , T is (respectively, conditionally) embeddable into S. From the definition of σ X we have:

Remark 3
Let G X be a decomposition consistent with S. Then, a set of cuts X detaches g ∈ G if and only if every tree in G X is either disjoint with or entirely contained in G(g).
Given a species tree S and a gene tree G we define a locus tree with respect to S as a pair (G, X), where X is a set of cuts such that the decomposition G X is consistent with S. Locus trees which induce the same decompositions are considered equivalent. A locus tree represents the evolutionary history of a set of genes, while a cut corresponds to the creation of a new locus by gene duplication or horizontal gene transfer. The decomposition induced by a set of cuts represents independent evolutionary histories of different loci.
The definition of a locus tree presented in this work is formal, and does not always agree with the biological intuition. In "Relationship between decompositions and locus trees" section we show a set of cuts which cannot be directly explained by duplications and transfers, and requires additional evolutionary events to explain the decomposition (so-called non-admissible events). The topic of non-admissible events together with an algorithm for detecting them is covered in detail in "Relationship between decompositions and locus trees" section.
From Lemma 2 it follows that for each set of cuts there is a unique decomposition induced by this set. Conversely, for every decomposition F of G there exists a set of cuts X such that F = G X . Inferring such a set from a given decomposition is straightforward by a bottom-up traversal of the gene tree. Therefore, we can consider decompositions as equivalent to locus trees. From the computational point of view, it is more natural to seek for optimal decompositions rather than sets of cuts.
In some applications (see e.g. "Results" section), given a locus tree (G, X) we seek for the ≺-maximal nodes, called source nodes, obtained from G after removing X. The set of all source nodes uniquely corresponds to the elements of G X . For example, in Fig. 1, the top locus tree has two sources: the root of G and b 2 , while in the bottom tree, we have three sources: the root of G, a 1 and the parent of b 3 .
Given a decomposition F, we define the total loss cost as �(F , S) = T ∈F �(T , S) . We can now define the Locus Tree Inference problem (LTI) in the parsimony framework: Problem 4 (Locus Tree Inference, LTI) Given a gene tree G and a species tree S. Find the decomposition F * of G consistent with S having the minimal weighted cost �(G, S) = GAIN · |F * | + LOSS · �(F * , S) in the set of all decompositions of G consistent with S, where GAIN ≥ 0 and LOSS ≥ 0 are the weights of locus gain and locus loss events, resp.
Decompositions which satisfy the above conditions are referred to as optimal. Note that, from the mathematical point of view, one of the weights in the above cost function could be set to 1, and the other adjusted accordingly.
In the same way, for conditional consistency, we define the Conditional Locus Tree Inference problem (CLTI). The problems are equivalent if the input species tree is binary. From the algorithmic point of view, LTI is similar to the reconciliation with DTL (duplication-transfer-loss) scenarios [14] with no duplications. A transfer event corresponds to the creation of a tree in a decomposition forest. Additionally, we do not count loss events at the root of a new tree.
Our algorithm for solving LTI, described below, consists of several functions of g ∈ G , s ∈ S and ι ∈ {0, 1} which denotes whether a set of cuts detaches g:

D1
δ(g, s, 0) is the minimal partial cost contribution of G(g) in the set of all g-detaching sets X such that M X (g) = s. D2 δ(g, s, 1) as above but for non-g-detaching sets of cuts.
is the minimal partial cost contribution of G(g) in the set of all g-detaching sets of cuts X such that M X (g) s . For ι = 1 , the cost additionally includes all losses created by σ X (g) and associated with every species node s ′ satisfying M X (g) ≺ s ′ � s.
Below we present the dynamic programming formulas (DP Algorithm) for solving LTI. Here, c(v) is the set of children of v ( ∅ for leaves). By 1 we denote the indicator function, that is, 1[p] is 1 if p is satisfied and 0 otherwise.
know if there is a polynomial time algorithm for CLTI. However, when the locus gain weight ( GAIN ) is much greater than the loss weight ( LOSS ), an efficient heuristic can be constructed, based on a mapping introduced in the next section.

Ranked trees and rank-based mappings
Usually, when comparing trees, mappings based on their topologies are used (e.g., the lca-mapping). However, some biological species trees contain additional useful structure: the taxonomic ranks, like species, genus, or family. A species tree with taxonomic ranks assigned is sometimes called a taxonomy. Several major ranks are common to almost all living organisms. In this section, we propose a mathematical formalization of ranks and two rank-based mappings, which are useful in duplication inference and CLTI.
A ranked species tree is a species tree S in which every node s of S has assigned a small positive integer called rank, denoted R(s) , such that, for every s and s ′ , if s ≺ s ′ then R(s) < R(s ′ ) . We assume that the rank of the root of is d > 0 and every leaf has rank 1.
After assigning integer numbers to the taxonomic ranks, a taxonomy becomes a ranked species tree. However, the latter concept is broader, as any species tree can be converted to a ranked one, for example by assigning ranks equal to the node depths. The theory and algorithms described in the following sections apply to any ranked tree, regardless of the way in which the ranks were assigned.
Let G be a gene tree and S be a ranked species tree. For a rank r and a leaf t in S, the unique directed path in S consisting of all taxa comparable with t having the rank lower than r will be called an (evolutionary) r-lineage of t. Note that every 1-lineage is empty. We say that leaf-taxa t and t ′ are separated by the rank r if for every x from the r-lineage of t and every y from the r-lineage of t ′ , x and y are incomparable. Observe that every pair of leaf-taxa is separated by the rank of 1. Moreover, if r separates t and t ′ then every rank lower than r also separates t and t ′ . For example, in Fig. 2 leaf-taxa a and c are separated by ranks 1, 2 and 3, but not by rank 4.

Theorem 5 (Solution to LTI) For every G and S:
Proof The proof is by induction on the structure of G and S, where the properties D1-D4 of all δ 's are proved. We omit technical details. CLTI can be solved by an algorithm similar to the one presented above. It requires an additional case in δ for resolving duplications. To model a proper binarization of a polytomy in M(g), both children of g have to be mapped into disjoint sets of children of M(g). Such solution requires extending all δ 's by a set of species nodes allowed for the mappings. In consequence, this approach has an exponential time and space complexity. We do not Let g be an internal node in G with children g 1 and g 2 . The highest separating rank mapping P : V G → {1, 2, . . . , d} is defined as The lowest common rank mapping I : V G → {1, 2, . . . , d} is defined as I(g) = R(M(g)) . We now present some fundamental properties of both mappings. Simple proofs are omitted for brevity.

Lemma 7
Let ρ(t, t ′ ) be the highest rank that separates leaf-taxa t and t ′ and let g be an internal node of G with two children g 1 and g 2 . Then, Proof For the proof of (A), let r = R(lca S (t, t ′ )) . Then, by the definition, for every element v of r-lineage for t, . And the same holds for the r-lineage of t ′ . Thus, both r-lineages consists P(g) = max r : r separates every pair of leaf-taxa of incomparable nodes. We conclude that the rank r separates t and t ′ . Note, that for ranks r ′ > r , both r ′ -lineages contain lca S (t, t ′ ) . Therefore, r is the highest rank separating t and t ′ . This completes the proof of (A). (B) follows from the definition of mapping P and (A). The proof of (C) is similar to (B). Both (D) and (E) follow from (A) and (B).
Taxonomic ranks have been used earlier for HGT detection [15]. In this work, the authors decorated nodes of the gene tree with the rank of the lowest taxon shared by each descendant leaf, equivalent with the I mapping. A high difference between the rank of a node and the one of its parent was one of the premisses for HGT. To the best of our knowledge, no mapping equivalent with the highest separating rank has been proposed to date.

Computing mappings
Given a species tree S and a gene tree G, to compute I we can use the classical algorithm for lca-queries, in which, after a linear-time preprocessing, computing lca-queries can be completed in constant time [16]. We conclude that I can be computed in O(|G| + |S|) time.
A naïve algorithm for computing P, based on Lemma 7, requires O(|G||S| 2 ) time. Here, we propose an O(d|G| + |S|) time solution. For two distinct leaves l 1 and l 2 of G, we write l 1 < p l 2 if l 1 is visited earlier than l 2 in prefix traversal of G. For instance, in Fig. 2 the leaves are linearly ordered starting from the left, i.e., d 1 < p b 2 < p f 3 < p . . ..

Lemma 8
For a fixed s ∈ S, the sequence of all assignment evaluations in line 9 such that v.smap = s induces a sequence of values v, denoted by v 1 , v 2 , . . . , v k such that: (I) the assignment s.lastvisited := v i is executed only when rank = s.rank, (II) v 1 < p v 2 < p · · · < p v k , and (III) Proof (I) is obvious by the condition in the second loop. By the condition in the inner loop, the order of leaves induced by a sequence of such assignments follows < p . and v cannot be assigned to a node incomparable with M(v).
Proof Let g ′ and g ′′ be the left and the right child of g, respectively. The proof is by induction on the rank r = 1, 2, . . . , d , where d is the highest rank in S. Let r = 1 . Assume that P(g) = 1 , we show that g.P = 1 . Let s ∈ tax(g ′ ) ∩ tax(g ′′ ) . Then, by Lemma 8, let s be the sequence {v 1 , v 2 , . . . , v k } of all leaves assigned to s.lastvisited such that M(v i ) = s and ordered by < p . Clearly, the list has the leaves from both subtrees of g. Thus there is an index j < k , such that v j ∈ L(g ′ ) and v j+1 ∈ L(g ′′ ) . Thus lca G (v j , v j+1 ) = g . Now, in line 8, when v is v j+1 then v.smap.lastvisited is v j . In such a case, either g.P is None and g.P will be set to 1, or g.P is already set. However, it can be only 1. This completes the first part of the proof.
Assume that P(g) = r and for every q, such that P(q) < r , we have P(q) = q.P . For every v ∈ L(g ′ ) and w ∈ L(g ′′ ) , R(lca S (M(v), M(w))) ≥ r . Thus g.P = None , when Algorithm 1 starts the main loop with rank = r . From Lemma 7, there is a pair taxa �t 1 , t 2 � ∈ĝ such that s = lca S (t, t ′ ) and R(s) = r . Thus, there are two leaves a 1 and a 2 in G such that for each i, M(a i ) = t i and lca(a 1 , a 2 ) = g , i.e. a 1 � g ′ and a 2 � g ′′ . Similarly, to the first step, the leaves from M −1 (L(s)) are all visited and set to s.lastvisited according to the order < p . The sequence contains elements a 1 and a 2 , therefore again there is j separating leaves from both subtrees of g. The rest of the proof is analogous: in line 8 either g.P is already set to r (if there was another s ′ , processed before s, with R(s ′ ) = r satisfying the same properties as s) or it will be set to r. This completes the proof.

Classification of gene duplications
Several methods for reconciliation with non-binary gene trees have been proposed [17][18][19][20][21][22]. However, reconciliation with non-binary species trees is harder to model. This is because a polytomy may represent a lack of knowledge about the order of speciations, and therefore some duplication nodes may correspond to biological speciations. This motivates a further classification of duplication nodes into conditional and required duplications [10]. In our model, we assume that the species tree is ranked. This approach can be applied to any species tree by assigning ranks, e.g. based on the node depth.
When reconciling a gene tree G with every binarization of S, if g from G is a duplication in every reconciliation, then g is called a required duplication. Similarly, if g is a duplication in at least one but not all reconciliations, we say that g is a conditional duplication. Note that G is conditionally embeddable in S if and only if each node in G is either a speciation or a conditional duplication.
In this section, we show how P and I can be used to solve the problem of gene duplication classification when the species tree has possible polytomies.
Note that the above Lemma also holds when P(g) = I(g) = 1 , i.e. when an internal node g is mapped to a leaf of S. In such a case the condition (A2) is satisfied trivially.

Lemma 12
Let g be an internal node of G such that P(g) = I(g). Then, g is a speciation if and only if M(g) is an internal node and there are exactly two subtrees rooted at children of M(g) having nodes from tax(g).
Proof (⇒) . We have that g is an internal node. In such a case I(g) > 1 and M(g) is an internal node. Then, by (A2) from Lemma 11, every child of M(g) has taxa present in at most one child of g. There are at least two children of M(g) satisfying this property. If there are more than two, then one child of g, say g 1 , has taxa from at least two children of M(g). Hence, M(g 1 ) = M(g) and g is a duplication node, a contradiction. (⇐) . Similarly, if M(g) is an internal node, then by (A2), the mappings of the children of g are incomparable and located below M(g). Therefore g cannot be a duplication.
We have a symmetric property whose proof is similar to the previous one.

Lemma 13 Let g be an internal node of G such that P(g) = I(g). Then, g is a duplication node if and only if either M(g) is a leaf or M(g) is an internal node and there are at least three subtrees rooted at a child of M(g) having nodes from tax(g).
Finally, we have the main property.
Theorem 14 (Classification Theorem) Let g be an internal node of G. Then, (C1) If P(g) = I(g) = 1 orP(g) < I(g) then g is a required duplication. (C2) If P(g) = I(g) > 1, then g is a duplication if and only if g is a conditional duplication.
Proof (C1) If P(g) = I(g) = 1 , then g is mapped to a leaf. Hence, every leaf below g has the same label. Thus, in every binarization of S, g is a duplication. Assume that P(g) < I(g) . Then M(g) is an internal node in S, having at least three taxa in L(M(g)) (otherwise, the two children of M(g) are leaves and P(g) = I(g) = 2 ). We can assume that there are three leaves t, t ′ ∈ tax(g 1 ) and t ′′ ∈ tax(g 2 ) such that lca S (t ′ , t ′′ ) ≺ lca S (t, t ′ , t ′′ ) . This property holds for every binarization T of S, where the possible polytomy M(g) is resolved. Moreover, in every T, M(g 1 ) � lca T (t, t ′ , t ′′ ) , thus M(g 1 ) is comparable with M(g 2 ) � t ′′ . Thus, M(g) = max(M(g 1 ), M(g 2 )) and g is a duplication node.
(C2,⇐ ). If g is a conditional duplication, then it is a duplication by definition. (C2,⇒ ). Assume that g is a duplication, then by condition (A2) from Lemma 11, the children of M(g) can be clustered into three disjoint sets X, X ′ and X ′′ such that every node from X has no taxa present in tax(g) , every node of X ′ has taxa from tax(g ′ ) but not from tax(g ′′ ) and analogously every node of X ′′ has taxa from tax(g ′′ ) but not from tax(g ′ ) , where g ′ and g ′′ are the children of g. Also, by Lemma 13 at least one among X ′ and X ′′ , say X ′ , has at least two elements. Consider a binary tree T in H(M(g)), such that all elements of X ′ and X ′′ are located on the left and the right subtree of T, respectively. Then, lca T (X ′ ) and lca T (X ′′ ) are incomparable. Thus, in such a binarization of S, g ′ and g ′′ maps below M(g), and g is a speciation node. Similarly, it can be shown that there exists a tree in H(M(g)) in which g is a duplication.
Based on Algorithm 1, classification theorem leads to a natural O(d|G| + |S|) time solution for the inference of required and conditional duplications when reconciling a given binary gene tree with a species tree. This improves the known O(|G|(d + m) + |S|) time algorithm from [10], where m is the maximal degree of a node from S. The improvement is beneficial for highly polytomic species trees. For example, as of 04.28.2017, the genus Aspergillus has 1950 children species in the NCBI Taxonomy.

Heuristic for CLTI
In this section, we propose the heuristic algorithm for CLTI when the locus gain weight is much higher than the loss weight. The algorithm is based on the following lemma, which follows directly from Theorem 14:

Lemma 15 Tree G is conditionally embeddable in S if and only if, for all internal nodes g in G,
Algorithm 2 is a greedy approach that iteratively finds the minimal nodes g such that P(g) < I(g) or I(g) = 1 and detaches an embeddable subtree below each node. Note the following: Let T 1 \ T 2 denote tree T 1 with detached subtree T 2 . Then, � (G ′ , S) +�(G(g) \ G ′ , S) is an estimate of the partial loss cost induced by detaching subtree G ′ . The detached subtree in Algorithm 2 is chosen to minimize this estimate. To limit the complexity of a single step, we consider only subtrees rooted at vertices at a close neighborhood of g.

Lemma 17 Algorithm 2 returns a decomposition conditionally consistent with S in O(ad|G| + a|S|) time, where a is the number of recomputations of I and P mappings.
Proof Correctness It follows immediately from the fact that each G(w) detached in the inner loop is conditionally embeddable. Time Let g be an element of Z. Then, as G is binary, there are at most six edges v, w adjacent to a child of g. (Case I) If v = g and w is the child of g, then both trees G(w) and G(g) \ G(w) are conditionally embeddable by the assumption that g is the minimal node such that G(g) is not conditionally embeddable. This completes the proof for v = g . (Case II) For the remaining case, there at most four grandchild edges connecting a child v with a grandchild w of g. We can arbitrarily index them by 1,2,3 and 4. Again, G(w) is conditionally embeddable, however, it is unclear for T = G(g) \ G(w) . By Lemma 15, it is sufficient to check whether, for the root t of T, I T (t) = P T (t) > 1 , where the mappings are from V T . I T (t) can be computed in O(1) time by the rank of lca S (M(v.sibling), M(w.sibling)) , however, for P T : V T → {1, 2, . . . , d} we need to use Algorithm 1. To avoid quadratic time, consider the following additional steps after line 4. i ∈ {1, 2, 3, 4} , create a copy G i of G. 2. For each g in Z, remove the i-th grandchild edge v, w and G(w) from G i . 3. For every i, let P i be the mapping P, denoted P i , computed by Algorithm 1.

For each
Then, when the ith grandchild edge v, w is processed in the inner loop, the value P T (t) is P i (t) . We conclude that each step of the main loop (lines 2-6), requires 4 additional runs of Algorithm 1.
A significant advantage of Algorithm 2 is the space complexity, which is O(|G| + |S|) . This makes the heuristic suitable for trees with hundreds or thousands of nodes. Note that in Lemma 17, a is pessimistically equal to the height of the gene tree, which makes this algorithm asymptotically quadratic. However, in applications, a is expected to be a small integer. This expectation is valid e.g. in cases where evolutionary events occur less frequently than speciations. Note, however, that it may not be valid for genes which exhibit particularly complex evolutionary history, like transposons.

Relationship between decompositions and locus trees
An evolutionary scenario is the set of evolutionary events which have shaped the observed gene tree. Usually, a scenario is represented by assigning duplication labels to some nodes of the gene tree, and transfer labels to some of its edges.
Each evolutionary scenario, understood as a set of duplication nodes and transfer edges, induces a decomposition of the gene tree into a set of trees representing evolutionary histories of different loci. A set of cuts for such decomposition consists of all transfer edges and one edge below each duplication. One could reasonably expect that a converse relation holds, i.e. that each decomposition induces an evolutionary scenario and a similar set of cuts. This, however, turns out to not be true. The decomposition depicted in Fig. 3 requires adding additional duplication nodes with no cuts below them. Thus, one of the trees in this decomposition corresponds to the evolutionary history of two loci. In this section, we give preliminary results on the relationship between decompositions and locus trees. We begin with formalizing the notion of an evolutionary scenario. Next, we formally define the non-admissible events induced by a decomposition, and show and algorithm of detecting them. The number of non-admissible events measures how well a decomposition agrees with biological intuition behind evolutionary scenarios.
The formal definition of a DTL scenario presented below is adopted from [23,24] with the difference that we focus more on the event based conditions. Definition 18 (DTL scenario, from [25]) A DTL scenario, or scenario, for a binary tree G, and a tree S and a labelling L : L G → L S is a tuple M, �, �, �, ξ such that L : L G → L S is the leaf labelling function, M : V G → V S is a mapping that extends L , { , , } is a partition of I G into speciation, duplication and transfer nodes, respectively, and ξ : � → V G determines the termination node of a transfer in G, subject the following conditions. For any g ∈ I G such that c 1 and c 2 are the children of g, let s = lca S (M(c 1 ), M(c 2 )) . Then, • We have g ∈ if and only if the mappings of the children of g are incomparable, and s = M(g). • If g ∈ then s M(g).
The above conditions denote the speciation, duplication and horizontal gene transfers events, respectively. An example of an evolutionary scenario has been depicted in Fig. 4. In DTL scenarios, a vertical descent is modeled by the condition that the mapping of a child is below or equal to the mapping of its parent. The condition holds for the children of speciation and duplication nodes. A destination of a transfer node g is defined by the function ξ . Therefore, we require that both the mapping of g and the mapping of ξ(g) are incomparable. Note that, the above definition of a DTL scenario does not exclude cyclic HGT scenarios, i.e. scenarios with at least two conflicting transfer edges. Conflicting edges occur e.g. when the acceptor of the first transfer is above the donor of the second one, and the acceptor of the second transfer is above the donor of the first one. Such scenarios are impossible to occur in nature, because the acceptors and donors need to be present at the same time point. If the optimal cost is defined as the minimal weighted sum of numbers of HGT and duplication events, then, for a given gene tree and a species tree, it can be computed in O(|G||S|) time [23,24], while the problem for acyclic scenarios is NP-hard [24].

Validation of decomposition
Let F be a decomposition of a gene tree G. Then, by the definition of decomposition every node of F is a node of G. Such nodes will be called F-internal. Now, we can determine how a given decomposition is evolutionarily congruent by searching for the DTL Fig. 3 An example of a decomposition of a gene tree G into two trees (a 1 , b 1 ) and (a 2 , b 2 ) for which evolutionary scenarios require one additional duplication located on the internal node of the decomposition forest (here the root of G). The scenario with the minimal number of duplications (here 2) is depicted on the left in the form of the embedding of G into S [12] Fig. 4 An example of a DTL scenario E for a gene tree G and a species tree S. E has two HGTs, one duplication, and three speciation events. The scenario is shown with the mapping M depicted for the internal nodes only. While the gene loss events are not formally modeled here, they are required when embedding a gene tree into a species tree according to a given DTL scenario. Therefore, our visualizations of DTL scenarios are extended by the required loss events (see also the scenario E 3 in Fig. 5) scenarios that preserve the largest number of speciation events in the forest and the minimal number of duplication and HGT events outside of it. Given a scenario E for G and S, we say that an event g ∈ I G is admissible if • g ∈ and g is not F-internal.
Given that | | + | | + | | = |G| − 1 , the decomposition congruency can be equivalently expressed in terms of minimizing the number of non-admissible events. (Validation of decomposition) Given a gene tree G, a species tree S and a decomposition F of G consistent with S. Find a DTL scenario for G and S that minimizes the number of non-admissible events.

Problem 19
Examples of such scenarios are depicted in Fig. 5. The above problem can be solved by a dynamic programming algorithm in O(|G||S|) time similar to the decomposition problem and the reconstruction of DTL scenarios [23,25]. Similarly to the decomposition formulas for g ∈ G and s ∈ S we have the following properties of the σ's: s) is the minimal number of non-admissible events located in G(g) for the scenarios between G(g) and S(s). V2 δ(g, s) as above but under assumption that g is mapped to s. V3 δ → (g, s) is the minimal number of non-admissible events located in G(g) in the set of all scenarios for G(g) and S(x) where x is incomparable with s.
Below we present the dynamic programming formulas for the validation of decompositions. Here g ′ and g ′′ are the children of g, α represents the case when g is a speciation, β -a duplication and γ-an HGT.

Theorem 20 Given a gene tree G, a species tree S and a decomposition F of G consistent with S. The minimal number of non-admissible events equals min s∈S σ △ (root(G), s)).
Proof The proof is by induction on the structure of G and S, where the properties V1-V3 of all σ 's are proved. We omit technical details.

Theorem 21 The number of non-admissible events can be computed inO(|G||S|m) time and O(|G||S|) space, where m is the maximal degree of a node from S.
Proof See the proof of Theorem 6.

Results
We have run numerical simulations to assess the performance of the proposed algorithms. First, we have run the heuristic and dynamic programming algorithms on pairs of random trees to compare the inferred forest sizes and loss costs. Next, we have performed simulations of realistic evolutionary scenarios to check the correctness of the algorithms' results. The optimal decomposition is seldom unique. Therefore, when analyzing the dynamic programming algorithm, for each gene tree-species tree pair we picked one of the optimal decompositions randomly. The heuristic algorithm always returns a single decomposition.

Comparison of algorithms
In the case of binary species trees, conditional embeddability is identical to strict embeddability, and both locus tree inference algorithms can be compared experimentally.
For each |L(G)| = 1, . . . , 20 and |L(S)| = 10 we have generated 100 pairs of random trees under the Yule-Harding model. The leaves of G have been assigned to Page 13 of 18 Ciach et al. Algorithms Mol Biol (2018) 13:11 Fig. 5 An example of the validation of a decomposition with one non-admissible event. Top A species tree S and a gene tree G decomposed into 4 trees (a 1 , b 1 ) (blue), (a 4 , c 1 ) (green), a 2 (red) and a 3 (light blue). Bottom (3 rows) DTL scenarios E 1 -E 3 with embeddings. In scenarios E 1 and E 2 there is one non-admissible HGT terminating in a 4 and c 1 , respectively, while E 3 has one non-admissible duplication at the root Fig. 6 Comparison of DP and heuristic algorithms for binary species trees in terms of forest size |F| and numbers of losses �(F, S) . The brown line depicts the median cost; the grey ribbon depicts the 90% confidence interval. The weights in the DP algorithm have been set to GAIN = 1000, LOSS = 1 . The plots have been smoothed with cubic splines leaves of S randomly. The numbers of losses for heuristic algorithm have been computed using a modification of the DP algorithm for LTI. The inferred costs are shown in Fig. 6.
The costs are similar for both algorithms. The forest size is approximately half the number of leaves in G. Using linear regression, we have determined that, on average, the inferred forest size is equal to 0.47|L(G)| for DP and 0.53|L(G)| for the heuristic. Large forest sizes in these examples can be explained by the fact that both gene and species trees are simulated independently, and most trees in the forests contain only one or two leaves.
The number of losses is slightly smaller for the heuristic algorithm (on average 0.98|L(G)| for DP and 0.90|L(G)| for the heuristic). We hypothesize that the reason for this is that the greedy approach tends to detach more concise trees.

Detecting evolutionary events
To validate our approach under more realistic conditions, we have run simulations of evolutionary scenarios using the GenPhyloData tool [26]. The tool simulates species trees under a stochastic birth-and-death model, in which each leaf gets duplicated or lost with a constant rate. The total rate of duplication (loss) is equal to the duplication (loss) rate times the number of leaves at a given time point (see [27] for details). When the loss rate is equal to zero, this process reduces to the Yule-Harding model. In the case of the species tree, a branch duplication models a speciation, and branch loss models an extinction.
In each branch of the species tree, a similar birthand-death process models the gene duplication and loss events. An additional Poisson process models the occurrence of horizontal gene transfer events. A recipient of the transferred gene is picked uniformly from species alive at a given time point.
Using GenPhyloData, we have simulated 100 species trees under a birth-and-death model with the birth rate 5 and the death rate 1. Trees have been pruned to remove extinct lineages. This procedure resulted in a set of binary species trees with approximately 75 leaves on average (see Fig. 7).
For each species tree, we have simulated 10 gene trees with duplication rate 0.4, transfer rate 0.1 and loss rate 1.2. Trees have been pruned to remove lost genes. We have obtained a set of 1000 binary gene trees with approximately 43 leaves and 4 duplication or transfer events on average (see Fig. 7). The maximum number of events in a single gene tree was 27.
The simulated gene trees have been decomposed with respect to the corresponding species trees using our heuristic and dynamic programming algorithms. In both cases, the numbers of duplication/transfer events were similar to the inferred number of new loci (i.e. forest size minus one, accounting for the "base" locus at the root of the gene tree). The forest sizes inferred by the heuristic algorithm have been depicted in Fig. 7. The number of locus acquisition events has been inferred properly in 82.6 and 82.7% of the gene trees by the heuristic and dynamic programming algorithm, respectively, and differed by at most one event in 98.3 and 98.2% cases. The dynamic programming algorithm underestimated the number of evolutionary events slightly more often than the heuristic one. The number of inferred events was lower than the actual number in 10.7 and 11.9% of trees for the heuristic and dynamic programming algorithms, respectively.
To further validate our results, we have checked whether the source nodes inferred by the decomposition algorithms correspond to simulated evolutionary events. Note that, from the definition, two source nodes cannot be siblings (see "Locus gain problems" section for the definition of a source node). A properly predicted source node is either the end of a transfer edge, or a child of a duplication node. We say that a properly predicted source node is a true positive event. Consequently, an event that failed to be predicted is a false negative; a speciation that is classified as a source is a false positive; and a properly predicted speciation is a true negative.
Note that for a transfer edge, a decomposition algorithm might classify the sibling of the edge's end as a source node. In this case, we assume that the algorithm is wrong twice: first, it incorrectly classifies a node as a source (a false positive), and second, it fails to detect an evolutionary event (a false negative). To account for this assumption, we have adopted the following convention. A duplication event corresponds to: 1. One true positive and one true negative event if one of its children is a source node, 2. One true negative and one false negative if none of its children are source nodes.
A transfer event corresponds to: 1. One true positive and one true negative if one of its children is a source node, and this child is the end of the transfer edge, 2. One false positive and one false negative if one of its children is a source node, and the end of the transfer edge is the sibling of the source, 3. One true negative and one false negative if none of its children are source nodes.
The results for both algorithms have been summarized in We have further verified the correctness of our approach by investigating the numbers of non-admissible events induced by the decompositions. The results are depicted in Fig. 7. Overall, there was no non-admissible event in 93.2% of decompositions returned by the heuristic algorithm and 90.4% of decompositions returned by the dynamic programming one.
A likely reason for the worse performance of the dynamic programming algorithm is the random choice of optimal decomposition. The number of non-admissible events might in future be used as an additional criterion for the optimality of the decomposition.

Applications of evolutionary history decomposition
We have compared our approach with a state-of-the-art reconciliation program, Notung 2.9 [28]. We have analyzed the evolution of the gene family of an aminotransferase from a fungus Penicillium lilacinoechinulatum [GenBank:ABV48733.1] with a history marked by HGT events [9,29]. Homologs of the protein sequence have been found using BLASTp suite. For analysis, we have chosen 45 closest homologs from 20 fungal species.
The sequences have been aligned using MAFFT and trimmed with TrimAL [30,31]. The phylogenetic tree has been created using PhyML with default parameters and aLRT branch support, and rooted by setting Amanita muscaria as the outgroup [32]. Nodes of the species tree have been collapsed to represent only the following taxonomic ranks: species, genus, family, order, class, phylum, kingdom. The gene tree has been reconciled with NCBI Taxonomy with loss weight 1, duplication weight 1.5, codivergence weight 0 and transfer weight varying from 3 to 8. The result of decomposition by our heuristic approach has been visualized using the Python ete3 package [33] and is depicted in Fig. 8.
The protein ABV48733.1 has been chosen for analysis because it exhibits a particularly complex evolutionary history. In the gene tree consisting of 45 homologs, our heuristic has inferred 23 locus acquisition events. Depending on the transfer weight, Notung 2.9 reported from 1 to 7 transfers and numerous duplications. The inference of evolutionary events is further complicated by the fact that the sets of transfer edges for transfer weights 3 and 6 are disjoint. However, even though in this case it is difficult to explain the whole evolutionary scenario by reconciliation, the decomposition can still be helpful in inferring biologically relevant conclusions.
Consider the light blue subtree on Fig. 8 composed of several Fusarium species. The protein has been  Furthermore, the light blue subtree branches with Aspergillus and Penicillium species. As the support of the "junction" node of both subtrees (labeled as 1 on the locus tree) is 1.00, we can assume that this is not due to erroneous gene tree inference. The Fusarium species are not present in any other part of the tree, which is indicative of a horizontal gene transfer from the ancestor of Aspergillus and closely related Penicillium species to the ancestor of Fusarium species from the light blue subtree. The horizontal gene transfer hypothesis is further supported by the fact that genus Aspergillus is distantly related to Fusarium. It is estimated that their ancestors separated 300-500 million years ago [34,35]. The branching of two distant groups of closely related species in this case is also visible from the values of mappings I and P, as their values at the junction node 1 are considerably higher than the values at the root of the light blue subtree and its sibling node. This transfer is consistent with reconciliation results for transfer weights greater or equal to 3.7.
The emergence of another light blue subtree, consisting of Penicillium oxalicum and Penicillium arizonense, could similarly be explained by a duplication or a horizontal gene transfer from an ancestor of Aspergillus udagawae. However, Aspergillus and Penicillium species are fairly closely related, and the support of the junction node (labeled as 2) is only 0.65. A closer inspection shows that performing a nearest neighbor interchange operation resolves the incongruence. This suggests that this locus subtree is an effect of an erroneous tree inference. Reconciliation explains this event by a horizontal gene transfer for transfer weights from 3 to 5, and as an ancestral duplication for greater weights. Now, consider the light green subtree. This subtree contains several species which are present in its sister subtree (A. lentulus, A. udagawae) and numerous closely related species. This is indicative of an ancestral duplication. The duplication hypothesis is further confirmed by the values of mappings I and P. For the junction node (labeled 3), the mapping P is considerably lower than the mapping I. This indicates a branching of two large, but closely related groups of organisms. Comparison of mappings I and P at junction nodes and their children can in future serve as a basis for automated classification of locus gain events as transfers or duplications.
Most other locus gain events are ambiguous, both in the case of history decomposition analysis and the tree reconciliation.

Conclusions
In this work, we have investigated two new problems, LTI and CLTI, for locus tree inference in a parsimony framework, defined as problems of decomposing a gene tree into trees consistent with a species tree. We have proposed a new mapping, called the highest separating rank, which has been applied to improve the classification of duplications and to solve CLTI. We have proposed two memory and time efficient solutions to the proposed problems: an O(n 2 ) dynamic programming algorithm for LTI and a near linear time heuristic for CLTI designed to solve large instances. Next, to verify the evolutionary consistency of the output our algorithms, we have proposed a validation method based on the model of evolutionary scenarios with HGTs. Finally, we have performed a number of tests on simulated data showing that these algorithms detect evolutionary events with high accuracy, and performed a proof-of-concept analysis of an empirical gene tree. Our results suggest that the new mapping, combined with the lca-mapping, can be used to locate cases of gene duplications and horizontal gene transfers.

Future outlooks
We plan to extend the solutions to LTI and CLTI to nonbinary gene trees, as it would allow to collapse nodes with low support and possibly to decrease the forest size. We will further investigate the properties of the highest separating rank mapping, especially in the context of supertree inference, gene tree rooting and gene tree correction. Finally, we plan to apply our methods to design automated tools for HGT inference. They will serve as a preprocessing step in obtaining a manually curated dataset of horizontally transferred genes.