Unifying duplication episode clustering and gene-species mapping inference

We present a novel problem, called MetaEC, which aims to infer gene-species assignments in a collection of partially leaf-labeled gene trees labels by minimizing the size of duplication episode clustering (EC). This problem is particularly relevant in metagenomics, where incomplete data often poses a challenge in the accurate reconstruction of gene histories. To solve MetaEC, we propose a polynomial time dynamic programming (DP) formulation that verifies the existence of a set of duplication episodes from a predefined set of episode candidates. In addition, we design a method to infer distributions of gene-species mappings. We then demonstrate how to use DP to design an algorithm that solves MetaEC. Although the algorithm is exponential in the worst case, we introduce a heuristic modification of the algorithm that provides a solution with the knowledge that it is exact. To evaluate our method, we perform two computational experiments on simulated and empirical data containing whole genome duplication events, showing that our algorithm is able to accurately infer the corresponding events.


Introduction
In the field of computational biology, the use of gene families and the reconciliation model has become increasingly popular for studying the evolution of diverse organisms.These tools have facilitated the development of new algorithms and computational methods capable of handling large and complex datasets and exploring various types of evolutionary events.These events range from simple macroevolutionary processes such as gene duplications, gene losses, and horizontal gene transfers to more complex ones such as genomic duplications, speciations, and hybridizations.The reconciliation model has enabled researchers to reconstruct evolutionary histories by reconciling gene trees with species trees and identifying the evolutionary events that have led to the observed patterns of gene evolution.In the context of metagenomics, the reconciliation model has also been used to detect missing gene-species assignments using polynomial time algorithms.These developments have led to a better understanding of the evolutionary processes.
A classical reconciliation model [1,2] defines a mapping from every node from a gene tree into a node in the species tree and determines if such a node is related to a speciation or can be classified as a single gene duplication [3].In result, an embedding of the set of gene trees into a species tree can be interpreted as a joint evolutionary scenario [4].The classical least common ancestor (LCA) mapping minimizes the number of single gene duplications and losses for one gene tree and the species tree [4].
The whole genome duplication (WGD) phenomenon incorporates additional copies of a complete genome into the original genetic material, thus creating an opportunity to introduce novel evolutionary traits [5,6].From a macro perspective, this phenomenon played a crucial role in the divergence and formation of species and shaped the evolution of almost all major lineages of life.In particular, many WGDs were uncovered in the evolutionary histories of plants, especially crops.WDGs potentially enabled the successful domestication of plants [7] and are important in the fight against famine [8].Many traces and evidence of whole genome duplications can be found in the genomes of yeast and other fungal species [5,9].From the perspective of single cell evolution studies, WGDs are prevalent in cancer progression [10] and can lead us to the prognosis of advanced cancer stages [11] or the creation of strategies for targeted therapy [12].
Guigó et al. [13] proposed the first approach for detecting multiple gene duplication episodes from a collection of rooted gene trees.They designed a heuristic that aggregates single gene duplication events into a large gene duplication, given a collection of rooted gene trees and a rooted species tree.This approach was formalized and improved by Page and Cotton [14], who defined the problem of episode clustering ( EC ) as the task of identify- ing the minimal number of locations in the species tree where all duplications from the input gene trees can be placed.Fellows [15] applied this model in the context of the supertree problem.Polynomial-time solutions for two types of multiple gene duplication problems episode clustering and a more general variant of clustering called minimum episodes (ME) were proposed in [16,17].Luo et al. [18] proposed linear time and space algorithms, partially based on [19], for these problems.[20] introduced a unified approach by proposing a concept of interval models with a linear time and space solution to a broad class of clustering problems including EC and ME.Alternative approaches include generalization to unrooted gene trees; however, such approaches are often computationally complex [21,22].Other approaches include variants of clustering rules that depend on the maximal number of duplication episodes placed in one path [23,24].A comprehensive analysis of various models is available in [20].Furthermore, [25] proposes an integer linear programming formulation that simplifies the process of testing these models.Relevant computational complexity results on the ME problem are presented in [26].
Metagenomic studies provide valuable information for analyzing entire communities of organisms and revealing a complete picture of their functional and adaptive capacities crucial for ecology [27] or human health [28].Genetic material isolated in such studies can be used to detect whole genome duplication events.
One of the steps in metagenomic analysis is called binning.The aim of this procedure is to assign sequenced DNA fragments to the appropriate taxonomic groups.The assignment of certain genes to species can be ambiguous due to the limitations of annotation methods.A precise and comprehensive gene tree topology is essential for the accurate identification of potential duplication sites.The absence or misplacement of duplications in gene trees can, in turn, result in incorrect outcomes of methods aimed at determining whole genome duplication events.
To tackle the challenge of missing gene-species assignments in evolutionary studies, previous research has introduced methods based on the reconciliation score using gene duplication and loss events [29,30].In a related work, Mykowiecka et al. [31] extended this model by including horizontal gene transfer to better analyze bacterial evolution and proposed polynomial time algorithms for these models.These approaches utilize tree reconciliation according to the classical scheme, in which the gene tree includes symbols representing sequences with unclear species assignment in addition to the known gene labels.The objective is to assign the unknown gene labels to their corresponding species in a gene tree while minimizing the total reconciliation score, which is typically a weighted sum of evolutionary events such as gene duplication, gene loss, and horizontal gene transfer.
Here, we present a novel problem called MetaEC, which aims to infer gene-species assignments in a collection of gene trees with unknown labels by minimizing the size of episode clustering.This problem is particularly relevant in metagenomics, where incomplete data often poses a challenge in the accurate reconstruction of gene histories.To solve MetaEC, we propose a dynamic programming (DP) algorithm that verifies the existence of a set of duplication episodes from a predefined set of episode candidates.We then demonstrate how to use DP to design an algorithm that solves MetaEC.In addition, we design an algorithm to infer distributions of gene-species mappings from the set of all optimal solutions inferred by DP.Although the algorithm is exponential in the worst case, we introduce a heuristic modification of the algorithm that provides a solution with the knowledge that it is exact.To evaluate our method, we perform two computational experiments on simulated and empirical data containing WGD events, showing that our algorithm can accurately infer the corresponding events.It is important to note that a direct comparison of our novel algorithm with alternative methods is not possible, as there is currently no existing approach for the simultaneous inference of duplications and gene-species mappings.

Gene trees, species trees, and the duplication cost
We begin by recalling some basic definitions from graph theory.All trees in this article are rooted and binary, therefore we refer to them as trees.For a tree T = (V (T ), E(T )) , by root(T ) we denote the root, and by L(T) we denote the set of all leaves.Every non-leaf node will be called internal.A species tree is a tree whose leaves are called species.A gene tree over a species tree S is a tree with leaves labeled by the species from S. The set of all species present in a species tree or a gene tree T is denoted by L(T ) .Note that for a species tree S, L(S) = L(S) .Also, for a gene tree G over S, L(G) ⊆ L(S).
For nodes a and b, a b means that a and b are on the same path from the root, with b being closer to the root than a.We write a ≺ b if a b and a = b.
For a gene tree G over a species tree S, the least common ancestor (lca) mapping between G and S is a function for a child a of g.The duplication cost, denoted by D(G, S) , is the total number of duplications in G.Each non-duplication internal node of G we call a speciation.In the latter part of the article, a duplication g is called an s-duplication if M G (g) = s .Similarly, we use the notation for an s-speciation and an s-leaf.
An example of tree reconciliation and the lca-mapping is depicted in the leftmost part of Fig. 1.

Episode clustering problems
Below we present a model of duplication episodes proposed in [20].In short, this model admits all evolutionary scenarios using duplication and loss events with a minimal number of gene duplications.
Formally, the model of gene duplication episodes allows for relocating a gene duplication from its lca-mapping node to one of its ancestors under some additional constraints required to preserve the biological soundness of the scenario.For a gene tree G over S, a mapping Note that the model of valid mappings described above is more comprehensive than the model presented in [16].Figure 1 provides an example of valid mappings that uniquely define an evolutionary scenario that can be represented as a tree with an additional decoration of nodes.For more information on the formal modeling of evolutionary scenarios, refer to [4].
We denote by Dup T ⊂ V (T ) , the set of all duplication nodes in T. Let G 1 , G 2 , . . ., G n be a collection of rooted gene trees over a species tree S. Assume that, for every i ∈ {1, 2, . . ., n} , F i is a valid mapping between G i and S. Every element s ∈ i F i [Dup G i ] 1 denotes the location of multiple gene duplication events in S. We will refer to such locations as duplication episodes or simply episodes.Later on, we may also use the term episode to refer to the set of duplications that are mapped into it.
Problem 1 (Episode Clustering, EC) Given a collection of rooted gene trees G 1 , G 2 , . . ., G n over a species tree S. Compute the minimal number of duplication episodes, denoted by EC(G 1 , G 2 , . . ., G n , S) , in the set of all valid mappings F 1 , F 2 , . . ., F n such that This problem can be solved in linear time and space [18].
Fig. 1 From the left side: a gene tree G and a species tree S with the lca-mapping M shown using arrows from the internal nodes of G to the nodes of S.There are 5 gene duplications, 2 speciation nodes in G (red bars), and 8 valid mappings depicted as embeddings of G into S [4], where the blue lines in these embeddings correspond to the edges of G. E LCA is induced by the lca-mapping.Here, EC (G, S) as depicted in the three rightmost scenarios with episode sets {a, b, abcd} for E 5 and {a, abc, abcd} for E 6 and E 7 .The example trees are partially adapted from [24]

Gene-species mappings
We present the main problem for joint reconstruction of gene-species mappings and minimizing the set of episodes.
A partial gene tree is a rooted binary tree where each leaf is labeled by a species or has no label.Let G be a partial gene tree over S. By � G : L(G) → L(S) we denote the partial leaf labeling function such that � G (g) is the label (species) of the leaf g in G if defined.Note that any gene tree is a partial gene tree with the leaf labeling being a total function.If a leaf in G has no label, we say that the label is unknown and write � G (g) = ⊥ .We say that a gene tree G * over S extends a partial gene tree G over S if G and G * are isomorphic as graphs (i.e., V (G) = V (G * ) and E(G) = E(G * ) ), and, G * is a total function that extends G .

Inferring labelings by minimizing episodes
Now, we present the problem of the simultaneous reconstruction of leaf labelings and duplication episodes from collections of partial gene trees.
Problem 2 (MetaEC) Given a collection of partial gene trees G 1 , G 2 , . . ., G k over a species tree S. Com- pute the minimum For example, if (a, (⊥, (⊥, ⊥))) is a single gene tree with three unknown labels and (a, (b, c)) is a species tree, then the problem is to replace all occurrences of ⊥ by a, b or c such that the total number of duplica- tion episodes is minimized.In this case, the optimal cost is 1, since at least one duplication is needed when the gene tree has four leaves and there are only three species.

Methods
We begin by solving a simpler problem in which we assume that the set of duplication episodes is constrained to a given set of species tree nodes.Next, we show how to solve MetaEC for a single gene tree.Section MetaEC in the general case presents the general solution.

Episode feasibility problem
We start with a related constrained problem.Given a partial gene tree, we are interested in the question of whether there is an extension of the partial gene tree such that the set of corresponding duplication episodes is contained in a given fixed set of episode candidates.
Problem 3 (Episode Feasibility) Given a partial gene tree G over a species tree S and X ⊆ V (S) .Does there exist a gene tree G * and a valid mapping If a partial gene tree G satisfies the above property, we call G X-feasible with respect to a species tree S. If the context is clear, we omit the reference to S.
The solution to Episode Feasibility is a dynamic programming (DP) formulation expressed using Łukasiewicz's Three-Valued Logic Ł 3 [32] with three constants True , False , and Unknown (representing uncertainty) and ordered linearly: False < Unknown < True .The logic has binary operators ∨ (disjunction, max ), ∧ (conjunction, min ), and two unary operators L (certainty) and M (possibil- ity).See the interpretation in Table 1.
For a node g of a gene tree G, by G|g we denote the subtree of G rooted at g.For any non-leaf node t in a tree, by t ′ and t ′′ , we denote the children of t.To simplify the nota- tion, we assume that the set X ⊆ V (S) is fixed.Then, we have the following dynamic programming formulas that solve Episode Feasibility.Let g ∈ V (G) and s ∈ V (S).where In the next Lemma, we express properties satisfied by the above formulas.For a partial gene tree G over S, and nodes g ∈ V (G) and s ∈ V (S) , we say that a valid mapping F T : V (G|g) → V (S|s) is feasible for (g, s, X) if and only if T extends G|g and F T (Dup T ) ⊆ X .Feasible mappings represent episode scenarios that correspond to partial solutions to the instance of Episode Feasibility that forces duplications from G|g to be present in episodes from X ∩ V (S|s).
We say a duplication g in a gene tree T is upper if the path from g to the root of T contains only duplications.The set of all upper duplications in a tree T is denoted by UDup T .We say that a valid map- ping F T : V (G|g) → V (S|s) is weakly feasible for (g, s, X) if and only if there is no feasible mapping for (g, s, X), but there is T that extends G|g, T has at least one upper duplication d such that F T (d) / ∈ X and F T (Dup T \UDup T ) ⊆ X .In contrast to feasible map- pings, in weakly feasible mappings we constrain only non-upper duplications present in G|g.Here, the upper duplications are elements of episodes s / ∈ X .This situ- ation is modeled by Unknown value returned from δ ↓ (g, s) and δ(g, s) calls, meaning that there is at least one duplication that needs to be assigned later (if possible) to an episode from X \ V (S|s) , which eventually occurs at levels of recursion shallower than the level of (g, s).
Informally, the meaning of DP formulas can be understood as follows.Below, let T be an extension of the subtree of G rooted at a node g.The value of δ(g, s) is True if there exists T where g is an s-duplication and all duplications are assigned to the episodes from X (where s ∈ X as well).Similarly, the value of σ (g, s) is True if there exists T, where g represents an s-speciation or an g is internal and s ∈ X, (1) δ * (g, s) ∧ Unknown g is internal and s / ∈ X, (2) False otherwise, ǫ(g, s) ∨ Mδ ↓ (g, s ′ ) ∨ Mδ ↓ (g, s ′′ ) s internal, and s ∈ X. ( (10) ǫ(g, s) =σ (g, s) ∨ δ(g, s), (11) s-leaf node.Next, δ(g, s) is Unknown if the condition for δ(g, s) = True is not met.However, there still exists T, where g represents an s-duplication, and all non-upper duplications from T are assigned to the episodes from X, while the upper duplications are assigned to episodes outside of X.It is important to note that in this case, s is not an element of X.Note that σ (g, s) cannot be Unknown since speciation nodes are fixed.Moving on, δ ↓ (g, s) is True if there exists T where all duplications are assigned to the episodes from X. Lastly, δ ↓ (g, s) is Unknown if the condition for δ ↓ (g, s) = True is not met.However, there still exists T, where all non-upper duplications from T are assigned to the episodes from X, while the upper duplications are assigned to episodes outside of X.
While ǫ and δ * should be treated as "local" in the main formulas (i.e., they should not form separate arrays in implementation), their properties can be formulated as follows.Generally, if ǫ(g, s) is True , then there is a tree T where g is mapped into s and all duplications are assigned to the episodes from X. Since σ (g, s) cannot be Unknown , ǫ(g, s) is Unknown only if σ (g, s) is False and δ(g, s) = Unknown .Again, here g is mapped to s.Now, δ * (g, s) is True only if there is T where g is an s-duplica- tion, and all duplication nodes below g are assigned to episodes from X. Importantly, at least one of the children of g must be mapped to s, which is captured by ǫ .Fur- thermore, δ * (g, s) resembles δ(g, s) , but the condition that duplications must be assigned to episodes from X only applies to the duplications (or upper-duplications) below g if δ * (g, s) is True (or Unknown , respectively).
The following Lemma formalizes the conditions described above.
Lemma 1 Given a partial gene tree G over a species tree S and X ⊆ V (S) .Let g ∈ V (G) , s ∈ V (S) .Then, P1 δ(g, s) = True if and only if there is a gene tree T and a feasible mapping F T for (g, s, X) such that g is an s-duplication in T.
P2 δ(g, s) = Unknown if and only if there is a gene tree T and a weakly feasible mapping F T for (g, s, X) such that g is an s-duplication in T. P3 σ (g, s) = True if and only if there a gene tree T and a feasible mapping F T for (g, s, X) such that and g is an s-speciation or an s-leaf in T. P4 For any g and s, σ (g, s) = Unknown.P5 δ ↓ (g, s) is True if and only if there is a feasible map- ping for (g, s, X).P6 δ ↓ (g, s) is Unknown if and only if there is a weakly feasible mapping for (g, s, X).

Proof
The proof is by induction on the structure of G and S. The base of induction is when g ∈ L(G) and s ∈ L(S) , for which all properties are easy to verify.
Inductive assumption: For every x, y such that g ≻ x , and s y or g x and s ≻ y , P1-P6 are satisfied.Inductive hypothesis: For g and s, where at least one of g and s is not a leaf, P1-P6 are satisfied.
Additional notation: since functions are relations, we identify a function f : Note that the resulting mapping is valid for a gene tree (T ′ , T ′′ ) that extends G|g.
We start with several properties.
then there is a weakly feasible F T ′ for (g ′ , s, X) and g ′ is an s-duplication.Here, σ (g ′ , s) cannot be True , thus σ (g ′ , s) = False , by P3 and P4.Therefore, δ(g ′ , s) = Unknown .The rest follows from P2.
We first prove properties P1-P4 for δ and σ .Then, we show that P5 and P6 hold for g and s.
) is True , it follows from the inductive assumption for P1 and P3, that there is a feasible mapping F T ′ for (g ′ , s, X) such that M T ′ (g ′ ) = s .For, the other child, we have δ ↓ (g ′′ , s) = True .From P5, there is a feasible mapping F T ′′ for (g ′′ , s, X) .Now, let T = (T ′ , T ′′ ) .Then, the mapping F T ′ ⊕ F T ′′ is feasible for (g, s, X) and g is an s-duplication in T.
Then, similarly to the pre- vious case g ′ is an s-speciation and there is a feasible F T ′ for (g ′ , s, X) , while, from P6, there is a weakly fea- sible F T ′′ for (g ′′ , s, X) .Then, similarly to the previous case, let T = (T ′ , T ′′ ) .Then, F T ′ ⊕ F T ′′ is weakly feasi- ble mapping for (g, s, X).It remains to analyse the case when ǫ(g ′ , s) = Unknown .By (A2) there is a weakly feasible F T ′ for (g ′ , s, X) and g ′ is an s-duplication.Here, δ ↓ (g ′′ , s) ∈ {True, Unknown} , and depending on the value either, by P5 there is a feasible ( True ) or, by P6, weakly feasible ( Unknown ) F T ′′ for (g ′′ , s, X) .We conclude that F T ′ ⊕ F T ′′ is weakly feasible mapping for (g, s, X).Since g ′ is an s-duplication and s / ∈ X , there is no feasible map- ping for (g, s, X) in this case.
(P2, ⇐ ).Let F T be a weakly feasible mapping for (g, s, X) such that g is an s-duplication in T. Since g is an s-duplication, g is upper and s / ∈ X .Thus, δ(g, s) = δ * (g, s) ∧ Unknown from (2).W.l.o.g.we may assume that M T (g ′ ) = s (note that g is an s-duplication).If g ′ is a speciation or a leaf, then F T |g ′ is feasible for (g ′ , s, X) since no upper duplication is present in Similarly, for the second child of g, the mapping F T |g ′′ is either weakly feasible for (g ′′ , s, X) if g ′′ is upper dupli- cation outside X, or feasible otherwise.By P5 and P6, δ ↓ (g ′′ , s) ≥ Unknown .Finally, δ * (g, s) ≥ Unknown and δ(g, s) = δ * (g, s) ∧ Unknown = Unknown.
(P3, ⇒ ).Let σ (g, s) = True .Note that at least one of g and s is internal by the inductive assumption.Then, both nodes must be internal, from (7).W.l.o.g.we may assume that δ ↓ (g ′ , s ′ ) ∧ δ ↓ (g ′′ , s ′′ ) = True .Thus, from P5, we have two feasible mappings F T ′ for (g ′ , s ′ , X) and F T ′′ for (g ′′ , s ′′ , X) .Let T = (T ′ , T ′′ ) , then g is an s-speciation and (P3, ⇐ ).Assume that is a feasible mapping F T for (g, s, X) such that g is an s-speciation or an s-leaf in T. If g is a leaf, the statement is obvious.Assume that g is an s-speciation, then s is internal and σ (g, s) follows from (7).W.l.o.g.we may assume that M T (g ′ ) � s ′ and M T (g ′′ ) � s ′′ .Thus, (P4) It follows easily from the definition of σ and the operator L.
Note, that σ (g, s) = False , otherwise both g and s must be leaves.Thus, δ(g, s) = True and the feasi- ble mapping for (g, s, X) exists by the already proven P1. (Case P5.2).If s is internal and s / ∈ X , then, by (6), there is a feasible mapping for (g, s, X) from already proven P1 for g and s.Similarly, we have the mapping from P3 if σ (g, s) = True .If δ ↓ (g, s ′ ) = True , then there is a feasible mapping F T for (g, s ′ , X) from the inductive assumption for P5.It should be clear that the mapping obtained from F T by enlarg- ing the codomain from V (S|s ′ ) to V(S|s) is feasible for (g, s, X).The remaining case when δ ↓ (g, s ′′ ) = True is analogous.(Case P5.3).If s is internal and s ∈ X , then, by (5) The proof is the same as above if δ(g, s) = True , σ (g, s) = True , δ ↓ (g, s ′ ) = True or δ ↓ (g, s ′′ ) = True .For the remaining case, let δ ↓ (g, s ′ ) = Unknown , then there is a weakly feasible mapping F T for (g, s ′ , X) from the induc- tive assumption for P6.By placing all upper duplications from F T at episode s and enlarging the codomain from V (S|s ′ ) to V(S|s), we construct a feasible mapping for (g, s, X).The remaining case when δ ↓ (g, s ′′ ) = Unknown is analogous.
(P5, ⇐ ) Assume there is a feasible mapping for (g, s, X). (Case P5.1)If s is a leaf, then g is internal by the inductive assumption.Thus, g is an upper s-duplication in T. Thus, s ∈ X and by already proven P1, δ(g, s) = True , ǫ(g, s) = True and δ ↓ (g, s) = True .(Case P5.2) Assume s is an internal node.If g is an s-duplication, then similarly to the above case, from P1, s ∈ X , and δ(g, s) = True and δ ↓ (g, s) = True using (5).Analogously, we have the same conclusion, if g is an s-speciation (here, s does not have to be in X).For the remaining cases, there is v ≺ s , such that g is either an v-duplication, or an v-speciation.W.l.o.g., we may assume that v � s ′ ≺ s .(Case P5.2.a)If g is an v-speciation, then F T (g) = v and there is no upper duplication in T. Since T|v is a subtree of T |s ′ , F T with a codomain V (T |s ′ ) is feasible for (g, s ′ , X) .From, the inductive assumption for P5, δ ↓ (g, s ′ ) = True and also Mδ ↓ (g, s ′ ) = True (if s ∈ X ).In all cases, δ ↓ (g, s) = True .(Case P5.2.b)If g is an v-duplication, then, from the feasibility of F T , there is at least one node w ∈ X on the path from s to v (there must be a duplication episode for g).Let w be the lowest (i.e., closest to v) node with the property.Now, we have two cases.If w = s , then s ∈ X , and there is no candidate in X for g below s, therefore, there is no the feasible mapping for (g, s ′ , X) .However, there is a weakly feasible F ′ T for (g, s ′ , X) infered from F T by setting Hence, from the inductive assumption P6, δ ↓ (g ′ , s) = Unknown .Next, Mδ ↓ (g ′ , s) = True and δ ↓ (g, s) = True using (5).It remains to analyse the case when, w � s ′ ≺ s .Since, Then, F T with the codomain shrinked to V (S|s ′ ) is feasi- ble for (g, s ′ , X) .From the inductive assumption for P5, δ ↓ (g, s ′ ) = True .Here, it does not matter whether s ∈ X , in both cases we get δ ↓ (g, s) = True from (5) or (6).
(P6, ⇐ ) Assume there is a weakly feasible map- ping F T for (g, s, X). (Case P6.1)If s is a leaf, then g is internal by the inductive assumption.Thus, g is an upper s-duplication in T. Thus, s / ∈ X and by already proven P2, δ(g, s) = Unknown .Since, σ (g, s) = False , ǫ(g, s) = Unknown and δ ↓ (g, s) = Unknown from (4).(Case P6.2) Assume s is internal.Note that g cannot be an v-speciation for any v, otherwise there is no upper duplication in T and F T cannot be weakly feasible.If g is an s-duplication in T, then similarly to the previous case, from P2, s / ∈ X , and δ(g, s) = Unknown .Note that δ ↓ (g, s ′ ) = δ ↓ (g, s ′′ ) = False (since M T (g) = s ).We con- clude that δ ↓ (g, s) = Unknown from (6).For the remain- ing case, there is v ≺ s , such that g is an v-duplication.W.l.o.g., we may assume that v � s ′ ≺ s .From the weak feasibility of F T , no node from the path between s to v (inclusively) is in X.Similarly, to P5.2, we construct weakly feasible for the remaining nodes u from T. By the inductive assumption for P6, δ ↓ (g ′ , s) = Unknown .Next, if v = s ′ , then δ(g, s ′ ) = Unknown (since x / ∈ X ) by P2, otherwise δ(g, s ′ ) = False .Also, δ ↓ (g, s ′′ ) = False .This yields δ(g, s) = Unknown using (6).
To solve Episode Feasibility, we have to apply δ ↓ on the roots of the input trees.
Theorem 2 (Correctness) Given a partial gene tree G over a species tree S and

Theorem 3 (Complexity) Given a partial gene tree G over a species tree S and X ⊆ V (S) . The time and space complexity of solving Episode Feasibility by the dynamic programming algorithm is O(|V(G)||V(S)|).
Proof We have three arrays δ X , δ ↓ X and σ X (note that ǫ and δ * can be directly inserted in their calls), each of size O(|V(G)||V(S)|) and every cell of an array can be computed in O(1) time.
An example of DP execution with a feasible solution is depicted in Fig. 2.

Solving MetaEC for a single partial gene tree
Here we describe the main algorithm to solve MetaEC for instances with a single gene tree.First, we characterize an important property of episodes.
Lemma 4 (Fixed Episodes) Given a partial gene tree G over a species tree S. Assume that there are nodes g in G and s in S such that • if s is the root of S, then at least one proper subtree of G contains species (leaf-labels) from both children of s.
• otherwise, let p be the parent of s, then G|g is a gene tree, g is a p-speciation and a child of g is an s-duplication.
Then, for any G * that extends G, s is an episode in every valid mapping between G * and S.
Proof For the first case, all nodes above the root of the subtree are gene duplications mapped to the root of S in any G * that extends G. Therefore, the root of S is an epi- sode in all valid mappings.In the second case, the duplication child cannot be raised, therefore, its mapping is fixed.
The nodes satisfying the above conditions we call fixed episodes (for G and S).For example, for trees from Fig. 1, there are two fixed episodes: the root of S and the leaf b, where the duplications with fixed mappings are depicted using white marks in the exemplary gene tree G.The set of all fixed episodes can be computed in linear time and space by bottom-up traversal of the partial gene tree G and by using LCA-queries in the species tree S as follows.
For each node g from V(G), the algorithm computes a tuple (u, s, d), where Then, for each g and its tuple (u, s, d), and for each child of g with a tuple (u ′ , s ′ , d ′ ): • if s = s ′ = root(S) , then s (the root of S) is a fixed epi- sode, • if u = d = False , d ′ = True and the parent of s ′ is s, then s ′ is a fixed episode.
We omit correctness and complexity proofs for brevity.Note that the number of fixed episodes is the lower bound of EC(G, S).Algorithm 1 takes as input a partial gene tree G over a species tree S and outputs EC(G, S) .It first computes the set of fixed episodes F (see Lemma 4).The algorithm then starts with an initial maximal episode number b equal to Fig. 2 An example of the dynamic programming algorithm (from Sect.Episode Feasibility Problem) execution.The partial gene tree is G = ((⊥, (b, ((⊥, ⊥), (⊥, ⊥)))), (d, (c, a))) , which contains five unknown labels.The species tree, denoted as S, is represented as ((a, c), (b, d)).The marked nodes in S indicate episode candidates from X: the root of S (abcd) and the leaf node a.By applying dynamic programming, we obtain a feasible solution, depicted in the bottom-right corner.The resulting extension of the partial gene tree G is G * , where the valid mapping between G * and S is the lowest common ancestor (LCA) mapping.In G * , each duplication node is marked with a triangle or a square denoting their corresponding episode in S. Each node in G is decorated with an array that represents the values of DP formulas, where each row corresponds to a node in S, starting from abcd, bd, and so on as indicated in the first column.The next columns have the values of δ , δ ↓ , σ , and ǫ , respectively, for the gene tree node and the corresponding species tree node.For example, considering the root of G and the root of S, the top row of the array contains the following values: δ(root(G), root(S)) = δ ↓ (root(G), root(S)) = ǫ(root(G), root(S)) = True , while σ (root(G), root(S)) = False the number of nodes in S. In each iteration of a while loop, the algorithm checks if there is a set C of size b − |F | − 1 from the vertices of S that are not in F, such that the partial gene tree G is C ∪ F-feasible using the dynamic programming algorithm.The correctness of the algorithm follows from the fact that if there is no set X of size b − 1 such that G is X-fea- sible, then there is no set of any size smaller than b that satisfies the property.Since b represents the number of episodes from some valid mapping, it is also minimal in such a case.Therefore, when the algorithm terminates, b = EC(G, S) , and the algorithm returns the correct value.
The algorithm's worst-case time complexity is Despite the exponential time complexity, in our experiments on both simulated and empirical data, we were able to compute exact solutions after only a few executions of the main loop.

Distributions of gene-species mappings
To evaluate the accuracy of gene-species mappings, we propose a method that enhances the DP algorithm by incorporating formulas for inferring the number of genespecies mappings present in all feasible reconstructed mappings.These counts can be collectively integrated to determine, for any leaf with unknown label, the precise frequency of its mappings to each species leaf within these feasible mappings.An alternative method for approximating these frequencies was suggested in [31] using uniform sampling.However, in this work, we introduce an exact algorithm for this purpose.For a fixed species tree S, and a node g from a partial gene tree G, we call a mapping f : L(G|g) × L(S) → {0, 1, 2, . . .} a counter on L(G|g) if for every leaf l with unknown label, the sum s f (l, s) , denoted #f , does not depend on l.Counters will be used to count how many times a given gene ⊥-leaf is assigned to a species leaf in all feasible mappings.In such a case, #f is the number of all such feasible mappings.The coun- ter fixed to a gene tree ⊥-leaf l and represented by the function f (l, •) , is referred to as the gene-species distribu- tion (of l).Subsequently, we often examine normalized distributions, wherein each value is divided by #f (then s f (l, s) = 1 ).For convenience counters also include other leaves, but their counts will be set to 0. We have the following basic counters: • ∅ A is the zero counter on A, i.e., the counter with #∅ = 0, • for s in L(S), B l,s is a counter on {l} such that B l,s (l, s) = 1 , and B l,s (l, s ′ ) = 0 for all s ′ � = s.
Let ⊕ and ⊗ be commutative operators, where ⊗ has higher precedence that ⊕ , satisfying the following properties.
• If f and g are counters on disjoint sets A and B, respectively, then f ⊗ g is the counter on A ∪ B , such that for every l ∈ A , (f ⊗ g)(l, s) = f (l, s) • #g.
• If f and g are counters on A, then f ⊕ g is the counter on A, such that, for every l ∈ A , (f ⊕ g)(l, s) = f (l, s) + g(l, s).
The empty counter is used when a part of a gene tree has no leaves with unknown labels, while the counter B l,s represents the situation when leaf l is assigned to s.Additionally, we use a special counter E that represents situations where DP formulas return F .The counter is evaluated as follows: E ⊕ f = f and E ⊗ f = E for any counter f including E. By using the notation introduced in the dynamic programming algorithm, we define counters for DP δ #T (g, s) , δ #U (g, s) , δ ↓ #T (g, s) , δ ↓ #U (g, s) , ǫ #T (g, s) , ǫ #U (g, s) and σ # (g, s) to count distributions of gene-species mappings.E.g., δ #T (g, s) is a counter for the leaves from L(G|g) that counts mappings for these leaves for the cases when δ(g, s) is True , and so on.
The following lemma states the crucial property of DP counters.
Lemma 5 (Correctness of counters for DP) Given a partial gene tree G and a species tree S and X ⊆ V (S) .G is X-feasible if and only if the counter δ ↓ #T (root(G), root(S)) is not E and for every leaf l of G with unknown label, δ The proof of the above lemma follows by induction, similar to the proof of correctness of DP.We omit technical details.
An example of gene-species distributions is depicted in Fig. 3. DP counters with verification algorithm are implemented in the software package metaEC.

Extensions
To identify the optimal solution within the main loop, enumerating all possible combinations of size b − f − 1 from the set of episode candidates V (S) \ F may be time- consuming for larger instances.To address this issue, we propose a heuristic approach that randomly samples (e.g., > 1000 ) and adds a stopping condition based on the number of dynamic programming (DP) calls without improvement (e.g., after 100 calls).This approach not only speeds up the algorithm but also provides additional information on whether the returned value is exact or an upper bound obtained by switching to a heuristic mode.See Sect. 4 for more details.Fig. 3 An example of gene-species distributions of for the leaves with unknown label from a partial gene tree with three such leaves and a species tree S. In this case, the optimal number of episodes is 1, and the episode is located at the root of S, marked accordingly.The total number of feasible mappings is 52.The histograms at the leaves of the gene tree depict how many times a specific leaf is mapped to the corresponding species leaf in these feasible mappings.In other words, they represent δ , where l is the gene tree leaf associated with each histogram Fig. 4 Converting a multiple gene tree instance to a single gene tree instance using an outgroup ω .Red bars in G ω denote speciation nodes mapped to the root of S ω .Green squares represent new duplications clustered at a new duplication episode in the root of S ω Furthermore, based on our experiments, we have observed that the solution is often close to the set of fixed episodes.To leverage this observation, we propose a bottom-up algorithm that explores candidate sets starting from sizes 0, 1, 2, and so on until a feasible solution is found.In this case, the internal search has a time complexity of O( n − f i ) , starting from i = 0 .This algo- rithm can be combined with the heuristic variant described earlier to improve its effectiveness.However, the experimental evaluation did not show significant improvement compared to the top-down method in Algorithm 1.

MetaEC in the general case
Here we show that MetaEC in a general case can be solved using a single partial gene tree under an additional assumption.Given a collection of partial gene trees G 1 , G 2 , . . ., G k over a species tree S, let ω be a new species, called outgroup, not present in S. We first add the outgroup to every input tree.Let S ω be species tree Then, by ω-MetaEC we define the problem MetaEC with a single partial gene tree, where the extension of a partial labeling cannot introduce ω , i.e., if 4 for illustration.
We have the following property.
Lemma 6 Given a collection of at least two partial gene trees G 1 , G 2 , . . ., G k over a species tree S such that ω / ∈ L(S) .X ⊆ V (S) is the set of episodes that yields the solution of MetaEC for G 1 , G 2 , . . ., G k and S if and only if X ∪ {root(S ω )} is the set of episodes that yields the solu- tion to the instance G ω k and S ω of ω-MetaEC.
Since the extension of G i introduces only nodes from L(S), we have the same property with (G * i ) ω .Now, every parent of a leaf labeled ω in (G * k ) ω is speciation mapped to the root of S ω , since, for some i, its sibling is a root of G * i and ω / ∈ L(G * i ) .Thus, if i > 1 , the root of (G * i ) ω is a duplication mapped to the root of S ω .In summary, all duplications from gene trees G * i in (G * k ) ω are mapped below the root of S ω , and they are sepa- rated by speciation nodes from the duplications mapped to the root of S ω as indicated in Fig. 4. Now, we define a valid mapping ) .For each i, F ω k on the set of nodes of G i equals the cor- responding valid mapping between V (G * i ) and V(S) that yields the solution to MetaEC, while for the remaining nodes v we set F k (v) = root(S ω ) .It should be clear that F ω k is a valid mapping.Now, it is not difficult to see that the set of duplication episodes in (G * k ) ω is X ∪ root(S ω ) .If there is a solution to ω-MetaEC with a lower number of episodes than |X| + 1 , say obtained by X ′ ∪ {root(S ω )} with |X ′ | < |X| , then the construction can be reversed to obtain valid mappings and the corresponding duplication episodes X ′ for the initial instance of MetaEC.However, this is a contradiction with the assumption that |X| is the solution to the initial instance of MetaEC.(⇐ ) The proof of the second direction is analogous since the transformation between collections of partial gene trees and the partial gene tree G ω k is reversible.We omit easy details.
Note that the algorithms provided in the previous sections can be easily modified to solve ω-MetaEC, by replacing case (8) with: Then, DP will exclude extensions of ⊥ by ω.

Experiments
In this Section we present two computational studies based on simulated and empirical data.

Simulated dataset
The species trees having none, one, or two whole genome duplication events were taken directly from [25].Then, we estimated gene trees via tree inference software from simulated sequences and modified them to represent the uncertainty commonly associated with metagenomic data.Finally, our algorithm was evaluated on six datasets consisting of estimated gene trees.
To begin, we describe how the species and gene trees were generated.Then, we explain the modifications made to the gene trees to represent the uncertainty of metagenomic data.Finally, we show how the results obtained with our algorithms allow us to infer genome-wide duplication events and gene and species distributions.

A species tree
First, we briefly summarize the simulation procedure from [25].The simulated species trees were generated by SimPhy [33] with parameter settings used in a simulated study [34] that was based on an empirical dataset of 16 Fungi species [35].The species tree S of 20 taxa was generated by SimPhy with the speciation rate parameter equal to 1.8 × 10 −9 and the tree height parameter set to 1.8 × 10 9 .
To simulate a whole genome duplication (WGD), a node v in the species tree S was chosen as the location of the event.Subsequently, a modified species tree, denoted as S ′ , was constructed by substituting a subtree S|v with a duplicated version of itself.This duplication involved creating a new root connected to the original root of S|v and the root of its copy.The WGD variants used in the simulations are illustrated in Fig. 5, where S 1 represents a single recent WGD Ŵ , S 2 represents a single ancient WGD , S 3 represents two WGDs and , with occurring after , S 4 represents two close WGDs and ′ at the same branch, and S 5 represents two recent independent WGDs and .Let us refer to the original tree S with no WGD events as S 0 .Note that these simulation methods do not incorporate fractionation, that is, the loss of a gene copy to eliminate the redundancy [36].

Gene trees
For every S i , i ∈ {0, 1, . . ., 5} , one hundred true gene trees were generated using SimPhy.The duplication and loss rate parameter was set to 2 −10 events per genera- tion per lineage.To minimize the effect of incomplete lineage sorting, the population size parameter was set to 10.All other parameters were taken from [34].
Next, we describe our pipeline to infer estimated trees from true trees.For every true tree G, we first simulated a multiple sequence alignment (MSA).
For the MSA simulation, we used INDELible [37] and parameters from [34], with one difference, we used a constant sequence length of 1000.Then, we inferred an unrooted maximum-likelihood tree from each MSA using FastTree [38] with the GTR model.Finally, we performed midpoint-plateau rooting of each unrooted gene tree using URec [39].The rooting was inferred in a way Fig. 5 Summary of the inferred gene-species mappings and duplication episodes on the simulated datasets.Locations of the simulated whole-genome duplication (WGD) events are denoted by Greek letters.For clarity, all leaf labels have been removed from the visualization of species trees (see [25] for details).Each bar in the histograms shows the normalized average p-support of the corresponding species node.The key to histograms is present at the bottom-right corner.The number above a single bar represents the maximum height of a bar in its histogram.A histogram at node v is omitted as insignificant if the normalized average p-support is below 10 for all values of p Fig. 6 Summary of simulated dataset experiments for the estimated trees: histograms of exact and heuristic solutions returned by Algorithm 1 that minimizes the duplication-loss cost between the gene tree and the species tree S.
In summary, we obtained six datasets of estimated rooted gene trees denoted G i = {G i,1 , G i 2 , . . ., G i,100 } , where each dataset comprises 100 gene trees generated using the same species tree S but with a different WGD scenario S i .In our evaluation study, for each dataset G i , we generated a set of partial gene trees Gk,p i by randomly removing each leaf label from every gene tree G i,j in G i with probability p.We considered values of p from 0.0 to 0.6 in increments of 0.1 and generated 100 instances of Gk,p i for each value of p and k = 1, 2, . . ., K , with K = 100 .This resulted in a total of 4200 instances ( Gk,p i , S) of MetaEC, where S is the species tree used to generate G i , and Gk,p i consists of 100 partial gene trees.Note that the instances for p = 0 , correspond to no removal of leaf labels, i.e., Gk,0.0 i = G i , for each i and k.

WGD detection
The results of our algorithm on the estimated partial gene trees are depicted in Fig. 5, where we summarize the episode sizes in the form of histograms.Also, additional data is provided in Figs. 6 and 7.The evaluation took about 24 h of a computing server with 80 cores.In general, we observed that the runtime and the number of DP calls grow linearly with the value of parameter p on average as indicated in the bottom diagram of Fig. 8.We used the heuristic variant of the algorithm, where random sampling was applied if the number of combinations exceeded 100 trees, and with the stopping criterion equal to 50.Out of 4200 instances, 3037 were completed with the exact solution (see Fig. 6).The resulting costs without exact guarantee, were more often obtained for larger values of p. Additionally, we observed that the lower bound given by the number of fixed episodes was a tight approximation of the inferred cost (see Fig. 8).As p increased, the number of fixed episodes decreased.Note that for p = 1 (not included in our analysis), the solution to MetaEC is 1.In such a case the leaf labelings are constant functions and all internal nodes of each gene tree are duplications.Then there is just one episode, which can be placed at the root of the species tree.Now, we briefly summarize the outcome of WGD detection.Initially in the preliminary version of the article [40], we quantified support by counting the number of single gene duplications aggregated within a particular episode.Here, to reflect the contributions of single gene families while mitigating the influence of large gene families with potentially numerous duplications, we count the number of gene trees that contain duplications clustered within the episode.Note that in contrast to [40] where the true trees were used to detect WGD events, here we perform the detection on estimated gene trees.
In the context of the scenario without WGDs ( S 0 ), the episodes can be treated as background noise consisting of single duplication events.Consequently, when subtracting the contribution of background duplication episodes from the duplication clustering results on simulated datasets containing WGDs, we can emphasize the simulated WGD events as significant occurrences.Formally, our analysis is conducted using the following formulas.
Given a dataset ( Gk,p i , S) and a node s in S, the p-sup- port of s (w.r.t.k and i), denoted µ p,s,i,k is the number of gene trees obtained by applying the DP algorithm on trees from the dataset G k,p i , that have at least one duplication in the episode s.Then, the normalized average p-support of a species node s in scenario S i is defined as Recall that K = 100 is the number of estimated gene tree datasets for fixed p and i.In other words, the normalized average support accounts for the difference between the average support values of S i and S 0 .For example, given that every gene tree dataset consists of 100 estimated gene trees, the normalized average p-support of s in the scenario S i close to 100, denotes high support for the duplication episode at s (in S i ).Note also that the normal- ized p-support can be negative due to normalization.
The results for S 1 (as depicted in Fig. 5) reveal that WGD Ŵ is uniquely well-supported when p ≤ 0.4 , with support values exceeding 20.It is noteworthy that the outcomes obtained for S 1 outperform the analysis con- ducted in [40], which utilized true trees, in terms of accurately identifying the simulated WGD event.In summary, the WGD detection outcome depends on the location of a WGD and the value of p.In general, when p increases, the support of WGDs decreases.Recent WGDs like Ŵ , , and are well supported up to p = .3, while more ancient up to p = .2 .For ancient WGDs , , , ′ , the normalized average p-support is low and, for most cases, indistinguishable from non-WGD nodes.However, the root episode is generally well supported in scenarios having ancient WGDs, which suggest such an event close to the root location.

Gene-species distributions
We implemented the algorithm outlined in Sect.Distributions of gene-species mappings to derive counters with gene-species distributions for ⊥-leaves across all gene tree datasets, i.e., for all six species trees S 0 -S 5 , all six different positive values of p (i.e., we inferred 36 × 100 counter collections for each gene tree set).The results frequently yielded remarkably high numbers of feasible mappings, often exceeding 10 1000 .As a remedy, we employed normalization by dividing the values of the gene-species distributions by #f , where f was the relevant counter.
We first analyzed the domain of distributions with positive values.Let d : L(S) → [0, 1] be a gene-species distribution.The span of d is the set of all species leaves with a positive value.Furthermore, if the span of this distribution is the set of leaves originating from a subtree rooted at s within the species tree S, we categorize this distribution as a subtree-spanning distribution.A summary of subtree-spanning and non-subtree-spanning distributions can be found in Table 2.It is noteworthy that spanning distributions occur with high frequency.Furthermore, they often encompass significant portions of the species tree, and the property gets stronger with the increase of p. This, however, is not the desired characteristic, as the most sought-after distributions are the one-point distributions where a single species leaf has the maximum value of 1. Subsequently, we conducted an analysis to determine the extent to which these distributions deviate from uniformity.Detailed insights can be found in Fig. 9.In this figure, we present a computed coefficient of variation (CV) for each distribution, which is the standard deviation normalized by the mean.Lower CV values signify a closer resemblance to a uniform distribution.The majority of histograms display a tall blue bar indicating low CV values.This suggests that the distributions tended to exhibit a significant degree of near-uniformity in nearly all simulated datasets.
In summary, our study reveals that the species presented in the reconstructed gene leaf mappings extend across a substantial portion of the species tree.Additionally, the frequency distribution across all feasible mappings typically exhibits an almost uniform shape.As a result, identifying the true gene-species mapping signal within the leaf mapping proves challenging with the current approach.This challenge is, in part, attributed to the combinatorial explosion in the number of feasible mappings, consequently leading to flattened distributions of gene-species mappings based on leaf counts.

Empirical evaluation
To ensure that our algorithm was properly tested, we required a dataset that would capture the characteristics of the metagenomic data as closely as possible, while allowing us to assess the quality and accuracy of the results obtained.For this reason, we decided to prepare a dataset consisting of gene trees for species identified during metagenomic analysis.To simulate unknown gene-species assignments, we artificially removed some of the gene labels from the gene trees and retained information about their taxonomic origin for further analysis of the results.Another important issue was the presence of a previously described whole-genome duplication event that occurred in the evolutionary tree of selected species.Given the above requirements, we decided to use proteomes belonging to yeast species identified during metagenomic analysis of kefir [41].Note that a direct comparison of our results with alternative methods is not possible due to the absence of any existing approach for the simultaneous inference of duplications and gene-species mappings.consistent with the NCBI taxonomy and many papers on yeast evolution, is shown in Fig. 10.It also shows the location of the whole-genome duplication event confirmed by previous studies [42,43].
The proteomes used to infer gene trees were sourced from the UniProt database [44].Protein families were created by dividing the proteins into groups using the mcl program [45] with parameters I = 2 and I = 5 .How- ever, since the differences between the obtained sets were minimal, we used the set obtained for I = 2 in subse- quent steps.The protein sequences in each group were aligned with the MUSCLE algorithm [46], and unrooted gene trees were inferred using the phyml program [47] with the default parameter setting.Rooting of the gene trees was performed by URec program [39] using the minimal duplication-loss cost as the rooting criterion.
We removed trees containing fewer than 3 leaves or 3 species, as well as trees with edges of length 0 from the

Results
Figure 10 depicts histograms showing the results obtained for the described dataset.The evaluation was performed on the same computing server as before and took approximately one hour.For this evaluation, we set the sampling threshold and stopping criterion to 50, which yielded exact solutions for all cases.The number of fixed episodes was consistent across datasets with p < 0.6 , at 12 (note that the number of leaves in a tree was 15).For p = 0.6 , the number of fixed episodes fell within the range of 9 to 12.The number of DP calls ranged from 2 to 3 for p < 0.6 and between 2 and 11 for p = 0.6.
The results obtained by the algorithm for the yeast dataset are consistent with our knowledge of the wholegenome duplication localization.For the dataset with all leaf labels present and for p = 0.1 , we have the highest support for the WGD event.The number of supporting single duplications decreases gradually for successive p ′ s .For values of p ≥ 0.5 , the correct WGD localiza- tion is still supported by a significantly large number of single duplications.It is worth noting that even for the p = 0.6 , the right location is supported by three times as many duplications as the second most supported location, which is in the root.Additionally, we observed an increase in duplications at the leaves of the species tree as p increased.Since most of the leaves are fixed episodes, the algorithm often assigned labels to create duplications at the leaves, resulting in larger sizes of episodes at leaves.

Conclusions and future outlook
In this article, we presented a novel problem that integrates gene-species mapping inference and genomic duplication detection.We proposed efficient algorithms to solve the problem exactly in the majority of instances, along with a heuristic modification for cases where exact solutions are not feasible.To demonstrate the effectiveness and accuracy of our proposed algorithm, we conducted computational experiments on both simulated and empirical data.While there is presently no established method for the simultaneous inference of duplications and gene-species mappings, the results showed that our algorithm was able to accurately infer recent WGD events when the number of missing labels was relatively small for simulated data.Moreover, the algorithm performed even better on empirical data, demonstrating its robustness and applicability to real-world scenarios.Nevertheless, our findings regarding gene-species mapping inference underscore the challenging nature of the problem with the current approach.Inferring true gene leaf labels proves difficult due to the combinatorial explosion of potential solutions and the resulting nearly-uniform distributions of gene-species mappings, which extend across substantial portions of the species tree.
To maximize topological similarities between a gene tree and its species tree, speciation nodes should more frequently appear in the resulting extensions of input partial gene trees.We observe that the optimization model tends to reconstruct leaf labels in a way that prioritizes duplication events assigned to the nearest fixed episodes or the root, in the absence of such episodes.This is confirmed by the property that fixed episodes are tight approximations of the EC cost, leading to a reduction in the number of speciation events in the final gene Fig. 10 Summary of gene-species mappings and duplication episodes inference for the yeast dataset consisting of 3430 gene trees.Each bar with confidence levels represents the average percentage of the number of gene trees participating in the given duplication episodes at samples with value p from 0.0 to 0.6.WGD denotes the whole genome duplication event postulated in [42,43].For the description of symbols refer to Fig. 5 tree extensions.As a consequence, the model's effectiveness may be limited in some cases when the number of unknown labels in partial gene trees is significant.
Several avenues for future research exist.For instance, the simulation pipeline could be extended to model fractionation [36,48], offering a more detailed understanding of the biological processes involved.Alternatively, tools for segmental duplications [49] might provide a more accurate representation of gene duplication processes.Also, we plan to extend the analyzed model to strengthen the importance of the topological similarities between gene and species trees.Alternatively, one may limit the distance between the lca-mapping of a gene duplication and its destination mapping in the final scenario similarly to [25].Additionally, there are models of genomic duplications providing a higher level of detail than EC , such as minimum episodes (ME) [20] and RMP [23], which can be adapted in a similar way to infer gene-species mappings and minimize the number of duplication episodes simultaneously.These models can be further combined with more general models of valid mappings, which allow the introduction of more duplication events than the minimum obtained by the lca-mapping [4].The combination of these models can provide a more comprehensive approach to inferring gene-species mappings and identifying the minimum number of duplication episodes.
for any speciation node a (fixed spe- ciations), • F G (a) M G (a) for any duplication node a (duplica- tion can be raised), • F G (a) ≺ M G (b) for any speciation node b such that a ≺ b (fixed number of gene duplications).

Algorithm 1
This step requires |V (S)| − |F | b − |F | − 1 calls of DP in the worst case.If such a set C exists, the algorithm computes EC(G * , S) by the linear time algorithm from [20], where G * is the gene tree obtained by backtracking from the corresponding call of DP and updates b with the result.Note that b is not assigned the value of |C ∪ F | , since the minimal set of epi- sodes for G * and S is a subset of C ∪ F , and it is often sig- nificantly smaller than C ∪ F in early steps of iteration.Updating b with EC(G * , S) guarantees the minimal num- ber of episodes, where some elements of C may be unused.This is an important optimization step.If such a set C does not exist, the algorithm terminates and returns the current value of b.Solution to MetaEC with a single gene tree where f is the size of the set of fixed episodes ( f = |F | ), n denotes the number of vertices in S, and m denotes the number of vertices in G.

Fig. 7 Fig. 8
Fig. 7 Summary of simulated dataset experiments for the estimated trees: EC cost and the number of fixed episodes

Table 1
Boolean operations in Three-Valued-LogicHere, F = False , U = Unknown , and T = True Górecki et al.
False} is True if and only if there is a leaf with unknown label reachable from g, • s ∈ V (S) ∪ {None} is the least common ancestor of all non-⊥ labels reachable from g in S and None if only ⊥ 's are visible from g, • and d ∈ {True, False} is True if and only if u = False and g is a duplication node in a gene tree G|g.

Table 2
Summary of spanning and no-spanning distributions of gene-species mappings for the simulated dataset set of trees.This resulted in 3430 rooted gene trees.Similar to the first experiment, we created 10 datasets for each p ∈ {0.1, 0.2, . . ., 0.6} by randomly removing each leaf label with the probability p.This resulted in 60 datasets plus the original dataset representing p = 0. final