From pairs of most similar sequences to phylogenetic best matches

Background Many of the commonly used methods for orthology detection start from mutually most similar pairs of genes (reciprocal best hits) as an approximation for evolutionary most closely related pairs of genes (reciprocal best matches). This approximation of best matches by best hits becomes exact for ultrametric dissimilarities, i.e., under the Molecular Clock Hypothesis. It fails, however, whenever there are large lineage specific rate variations among paralogous genes. In practice, this introduces a high level of noise into the input data for best-hit-based orthology detection methods. Results If additive distances between genes are known, then evolutionary most closely related pairs can be identified by considering certain quartets of genes provided that in each quartet the outgroup relative to the remaining three genes is known. A priori knowledge of underlying species phylogeny greatly facilitates the identification of the required outgroup. Although the workflow remains a heuristic since the correct outgroup cannot be determined reliably in all cases, simulations with lineage specific biases and rate asymmetries show that nearly perfect results can be achieved. In a realistic setting, where distances data have to be estimated from sequence data and hence are noisy, it is still possible to obtain highly accurate sets of best matches. Conclusion Improvements of tree-free orthology assessment methods can be expected from a combination of the accurate inference of best matches reported here and recent mathematical advances in the understanding of (reciprocal) best match graphs and orthology relations. Availability Accompanying software is available at https://github.com/david-schaller/AsymmeTree.


Background
The distinction of orthologous and paralogous pairs of genes, respectively, is of key importance in evolutionary biology as well as genome annotation. As defined by Walter Fitch [1,2], two genes are orthologs if their last common ancestor (in the gene tree) corresponds to a speciation event, and they are paralogs if they arose through a duplication event. In general, orthologs are expected to have the same function in different organisms, while the functions of paralogs are usually similar but clearly distinct [3,4].
A large class of computational approaches to orthology assessment [5,6] uses symmetric best matches (SBM) [7], also known as bidirectional best hits (BBH) [8], reciprocal best hits (RBH) [9], or reciprocal smallest distance (RSD) [10]. The intuitive justification for these approaches is that symmetric best matches (in the sense of sequence similarity) approximate the idea of evolutionarily closest relatedness. These two concepts are not the same, however. The notion of evolutionary relatedness depends

Algorithms for Molecular Biology
Page 2 of 20 Stadler et al. Algorithms Mol Biol (2020) 15 :5 on the underlying phylogenetic tree T and is naturally expressed by comparing last common ancestors: a gene x is more closely related to a gene y than to y ′ if the last common ancestor lca (x, y) is a successor of lca (x, y ′ ) in T.
From an evolutionary point of view, therefore, one is interested in reciprocal best matches (defined in terms of the gene tree T) rather than in reciprocal best hits (defined in terms of some distance of similarity measure between sequences). Best matches and best hits are equivalent if and only if the Molecular Clock Hypothesis is satisfied [11,12]. In general this is not the case. In particular, paralogous members of a gene family often differ in their evolutionary rates due to (adaptive) changes in the function [13,14]. Both the "Duplication-Degeneration-Complementation" (DDC) model [15] and the "Escape from Adaptive Conflict" (EAC) model [16] predict that the fate of paralogs, including their evolutionary rate, may differ substantially between lineages that diverge soon after the duplication event due to different selective pressures. The simplest case is shown in Fig. 1: an ancestral gene is duplicated before the speciation event leading to two species (indicated by colors), each containing two paralogs (denoted by x and x ′ in the red species and y and y ′ in the blue species). The two paralogs evolve with very different rates in the two species. Although x and y as well x ′ and y ′ are orthologs, the evolutionary rates are more similar between x and y ′ , and x ′ and y, respectively. This situation is not at all uncommon. The asymmetric divergence of the genes in the HOXA cluster following the teleost-specific (3R) genome duplication may serve as a paradigmatic example [17]. While in fugu (Takifugu rubripes) and other percomorphs the HOXAb paralogs diverge faster, it is the HOXA13b paralog that evolves at a faster rate in zebrafish (Danio rerio), which diverged early from percomorphs within the Teleostei clade.
The situation as observed in the HOXA cluster is shown in Fig. 1. Here, the pair x, y ′ shows the smallest evolutionary distance and hence will appear as reciprocal best hit, while the closest evolutionary relative of x is the gene y. This discrepancy is not a consequence of inaccurate measurements but an intrinsic feature of the evolutionary process: more evolutionary events have accumulated on the path from x to y than on the path from x to y ′ . The correct reciprocal best hit therefore does not coincide with the correct reciprocal best match. This immediately begs the question whether such cases can be detected from sequence comparisons. We consider this issue at two levels: (i) Can (reciprocal) best matches be identified in principle, i.e., from perfectly accurate data, and (ii) how well can this be done in practice? To address the first question we will assume that we can determine an additive distance between any two genes and investigate the consequences of this assumption. To investigate the accuracy that can be achieved from sequence data we will devise a simulation system to generate evolutionary scenarios with complex rate variations.
The focus on additive metrics is motivated by the close connection between additive metrics and evolutionary trees. More precisely, an additive metric determines a unique unrooted phylogenetic tree T as well as its branch length [18,19], and vice versa. The determination of best matches, which are defined in terms of last common ancestors, however, requires a rooted phylogenetic tree T. From a theoretical point of view, therefore, the missing information is the placement of the root of T in the underlying unrooted phylogenetic tree T .
The problem of determining the position of the root in an unrooted tree T has been well studied in the phylogenetic literature [20]. The most common approach is the inclusion of an outgroup, i.e., a taxon z known to branch earlier than the taxa of interest. The root is then located in the branch leading to z. Outgroup rooting can be unreliable in the presence of rapid radiations or when only very distant outgroups are available [21,22]. The simplest method is midpoint rooting [23], which places the root at the midpoint on the longest path in the tree. Despite its simplicity it often works remarkably well [24]. An interesting variation on this theme is minimum variance rooting [25]. The estimation of dated phylogenies using a relaxed clock assumption yields an estimate for the position of the root as a by-product [26]. A related Bayesian method was introduced in [27]. In a phylogenomics setting, the root of the species tree can also be obtained by minimizing the number of inferred gene duplications [28]. Most recently, non-reversible substitution models have been employed for estimating rooted phylogenetic trees [29,30].
From a practical point of view, furthermore, we wish to avoid the explicit construction of a (rooted or unrooted) gene tree T since reconstructing accurate evolutionary trees from individual gene sequences is a notoriously difficult problem. Instead we aim to stay as close as possible y' x' y x Fig. 1 Lineage-specific rate variation between paralogs. The gene tree, with branch length indicating an additive evolutionary distances, pre-dates the speciation ( • ) of the red and blue species. We have lca (x, y) ≺ lca (x, y ′ ) but d(x, y ′ ) < d(x, y) Stadler et al. Algorithms Mol Biol (2020) 15:5 to the idea of reciprocal best hit methods and thus we will attempt to use only "local" comparisons of as few as possible measurements of evolutionary distances. This idea naturally leads us to considering quartets, i.e., unrooted trees describing four taxa, and the corresponding rooted triples. It is well known that the rooted triples are sufficient to determine the rooted tree in which they reside. Moreover, there is a polynomial-time algorithm that either constructs a rooted tree T from a set of rooted triples or determines that no such tree exists [31]. By Buneman's Theorem [18,19], an unrooted tree can be uniquely recovered from all its quartets. However, the problem of determining whether a given set is compatible (i.e., whether there is an unrooted tree T that contains all quartets) is NP-complete [32], a fact that reinforces the desire to avoid the explicit reconstruction of T . Nevertheless, these classical results ensure that the relevant information is contained in quartets. More directly, we will show in this contribution that if we can reliably determine a suitable outgroup, best matches can be extracted from a small set of quartets.
Although much of this work is based on the assumption that an additive distance between taxa is available, one has to keep in mind that additive evolutionary distances, like divergence times, cannot be measured directly. While it is common practice to determine a dissimilarity d ′ (x, y) of two taxa (genes) x and y from pairwise alignments, d ′ is a systematic under-estimate of the number of events d due to back-mutations, and thus not additive. In practice, the conversion of measurements of d ′ into an additive distance d that quantifies the number of evolutionary events is based on a Markov model of the evolutionary process. For sequence data, this may be the Jukes-Cantor model [33] or one of its more elaborate variants [34][35][36]. In the most benign setting, d and d ′ are related by a monotone transformation, which in particular implies that the measured distances d ′ correctly identify the best hits. It can also be shown, however, that non-additive distances in general cannot identify the correct topology of quartets [37]. Hence, we have no hope of computing correct best matches directly from non-additive (dis)similarities.
This contribution is organized as follows: in the following section we give a rigorous mathematical rendering of the background outlined above and use it to show that, given an additive distance measure, it is indeed possible to perfectly identify all best matches of a gene x of species s among its homologs {y 1 , . . . , y k } in species t provided a suitable outgroup z can be found for every set {x, y i , y j , z} of four genes. As a consequence, the practical problem becomes the reliable identification of correct "relative outgroups". Assuming knowledge on the phylogeny of species from which the genes are taken, we proceed to derive several conditions under which z cannot be a correct choice and use these insights to devise a heuristic approach that works nearly perfect given additive distance data. We then introduce (in "Methods" section) a simulation environment for generating gene family histories with complex rate variations and show that it is possible to recover best matches more accurately than approximating them by best hits.

Notation and basic definitions
Let T be a phylogenetic (gene) tree with leaf set L. For each gene x ∈ L we denote by σ (x) the species within which it resides. We write L[s] = {y ∈ L | σ (y) = s} for the set of genes in species s. For a leaf set L ′ ⊆ L we define the rooted tree T [L ′ ] as the tree obtained from T by retaining only the vertices and edges along paths from the root to a leaf in L ′ and suppressing all vertices with degree 2. The vertex set of a rooted tree T is endowed with a partial order ≺ such that x y whenever y lies along the unique path connecting x and the root ρ T . Thus the leaves are the minimal elements w.r.t. ≺ . Furthermore, for A ⊆ L we define the last common ancestor lca (A) = min{z | x � z for all x ∈ A} , where the minimum is taken w.r.t. the partial order ≺ . Moreover, if A = {x, y} contains only two elements, we write lca (x, y) instead of lca ({x, y}) . For every u ∈ V (T ) , we denote by T(u) the subtree of T rooted at u.
Consider a gene x in species s. Among all genes in species t = s , the best matches of x are all those genes y in species t that have the lowest lca (x, y) . These y are the closest relatives of x in species t. This concept is made precise in Definition 1 [38] Let T be a phylogenetic tree with leaf set L (denoting genes) and σ : L → S identifying the species σ (x) ∈ S in which a gene x resides. Then y ∈ L is a best match of x ∈ L , in symbols x → y , if lca (x, y) � lca (x, y ′ ) holds for all leaves y ′ from species σ (y ′ ) = σ (y).
The best match relation → is reflexive (since lca (x, x) = x ), but it is neither transitive nor symmetric. Its mathematical properties are discussed in detail in [38,39]. In particular, all orthologs of x are among its best matches [40].
The evolutionary relatedness of two taxa x and y is most directly expressed by the divergence time τ (x, y) , which is the total time elapsed in both lineages since the last common ancestor of x and y. Here, we consider only the case that all leaves refer to extant genes or taxa, i.e., τ (x, y) = 2τ ( lca (x, y)) , where τ is the age of lca (x, y) . Divergence times are ultrametric by definition.
Furthermore, there is a well-known one-to-one correspondence between isomorphism classes of dated, rooted, phylogenetic trees and ultrametrics, cf. [41,42]. The best match relation → can thus also be defined in terms of divergence time: x → y if and only if The distinction between best hits and best matches thus is simply the distance function: best matches require divergence times, while best hits use one of several (dis) similarity measures for sequence data. They are equivalent under the Molecular Clock Hypothesis, which however fails for most real life data sets.

Reconciliation of gene tree and species tree
Since genes evolve as part of species, we can expect that a priori knowlege of the species phylogeny can be helpful for understanding the phylogeny of a gene family. This link is made precise by considering the embedding of a gene tree T into a species tree S.
As it is possible that gene duplications and losses predate the first speciation event, we model the species tree S with leaf set S as a planted tree, i.e., we introduce a vertex 0 S that is called the planted root of S and has the "conventional" root ρ S = lca (L) as its single child. Using this construction, the embedding of the gene tree into the species tree is formalized by the reconciliation map µ : V (T ) → V (S) ∪ E(S) , which maps duplications to the edges of S and speciations to the inner vertices V 0 (S) of the species tree. We follow here the notation of [40]. Restricting ourselves to duplication/loss scenarios, i.e., disregarding horizontal gene transfer, the reconciliation map satisfies the root constraint (R0) µ(x) = 0 S if and only if x = 0 T ; the leaf constraint (R1) µ(x) = σ (x) for x ∈ L(T ) , the ancestor preservation (R2), i.e., x ≺ T y =⇒ µ(x) � S µ(y) , and the following two speciation constraints for all speciation vertices µ(x) ∈ V 0 (S) : [40]. Equivalent axiom systems are considered e.g. in [43][44][45]. Such reconciliation maps satisfy i.e, an event x ∈ V (T ) in the gene tree cannot be mapped to a node in the species tree below the last common ancestor of all the species In this contribution we assume that µ in addition satisfies (R4): If µ( lca T (x, y)) = µ( lca T (x, z)) ∈ V 0 (S) , then lca S (σ (x), σ (y)) = lca S (σ (x), σ (z)) . In essence, (R4) (1) y ∈ arg min τ x, y ′ y ′ ∈ L σ y (2) µ(x) S lca S (σ (L(T (x)))), ensures that a single node in T cannot represent two distinct speciation events, i.e., that the gene tree T is not "less resolved" than the species tree S into which it is embedded.
The reconciliation map µ defines event labels on the inner nodes of the gene tree T, identifying u as a duplication node if µ(u) ∈ E(S) and as speciation if µ(u) ∈ V (S) . While it is possibly to find a reconciliation map µ for every pair of gene and species tree [46], this is no longer true when event labels on T are given [45,47]. Conversely, T and S imply constraints on the event labels, identifying nodes that have to be duplications under any reconciliation map [40]. Here, we characterize these nodes further. We start from the following technical results: Let us first consider the case of binary gene trees: Lemma 3 Let T be a binary gene tree, S be a species tree, and µ : V (T ) → V (S) ∪ E(S) be a reconciliation map without horizontal gene transfer that does not necessarily satisfy (R4). Let x, y ∈ L(T ) be two genes with σ (x) = σ (y) . If lca S (σ (x), σ (y)) ≺ µ( lca T (x, y)) , then lca T (x, y) is a duplication event.
The assumption that T is binary is necessary here as the example in Fig. 2 shows. Such reconciliations, however, cannot be meaningfully interpreted in terms of evolutionary events. Instead, the root of T confounds the duplication leading to x and y and the speciation separating lca S (σ (x), σ (y)) from σ (z) . To suppress such undesirable cases, we in addition require that µ satisfies axiom (R4). In essence, (R4) forbids to map two distinct speciation events to the same vertex of S. If lca S (σ (x), σ (y)) ≺ µ( lca T (x, y)) , then lca T (x, y) is a duplication event.

Trees and (dis)similarities
Neither the divergence times nor the lca function of the phylogenetic tree T can be measured directly. The nextbest choice is to work with an evolutionary distance, which measures the number of evolutionary events that have taken place to separate two taxa. For each edge e = uv in T it is given by is the rate of evolution. In general µ e (t) depends both on the lineage, and thus the individual edges in T, as well as on the exact point in time along e. It associates with each edge e a measure ℓ(e) of changes incurred, and thus an additive distance. If µ e (t) = µ 0 is constant, we simply have d ℓ,T (x, y) = µ 0 τ (x, y) . This is the well-known Molecular Clock Hypothesis [11,12].
In general, we consider ℓ : E(T ) → R + simply as an assignment of positive lengths to the edges of T, which we interpret as a measure proportional to the number of evolutionary events. This gives rise to a metric distance function d ℓ,T (x, y) on L defined as the sum of the lengths ℓ(e) of the edges e along the unique path connecting x and y in T. From T we obtain an associated unrooted tree T by (i) omitting the planted root 0 T and its incident edge, and (ii), in case the root ρ in T has exactly two children u 1 and u 2 , by replacing the path u 1 ρu 2 by a single edge u 1 u 2 with length ℓ(u 1 u 2 ) := ℓ(u 1 ρ) + ℓ(ρu 2 ) . Note that the dissimilarity function ℓ is by construction the same on T and T . Thus T determines T up to the position of the root, i.e., T is obtained from T by inserting the root into an edge of T or declaring an inner vertex of T as the root. As for rooted trees, we define the restriction T [L ′ ] for some subset L ′ ⊆ L by retaining only the vertices and edges along the paths between pairs of vertices in A dissimilarity d on L is called additive if there is an unrooted tree T with edge lengths ℓ such that d = d ℓ,T . A key result in mathematical phylogenetics [18,19] characterizes additive (pseudo)metrics as those that satisfy the four point condition. It states that d is additive if and only if the restriction of d to each subset L ′ of L with |L ′ | = 4 , usually called a quartet, is additive and thus determines a tree on four leaves. Furthermore, the unrooted tree T is uniquely defined by d. In principle, therefore, distance data completely determines a phylogenetic tree up to the position of the root.
The results of [18,19] furthermore imply that T can be expressed in terms of its four-taxa subtrees. This provides us with a natural possibility to consider only "local" topologies instead of having to construct the unrooted tree T explicitly. To this end, we consider the restrictions T [p, q, r, s] of T to four distinct leaves p, q, r, s ∈ L and define the quartet relation [48,49] (pq|rs) if there is an edge e in T , and thus in T [p, q, r, s] , such that {p, q} and {r, s} are in different connected components of the forest obtained by removing e from T or T [p, q, r, s] . Equivalently, we have [48,49] For additive metrics, the two distance sums on the second line are equal [18,19]. All three terms are equal if and only if the four points form a star, whence the existence of a separating edge requires the strict inequality. By a slight abuse of notation we write T [p, q, r, s] = (pq|rs) if Eq. (3) holds, and T [p, q, r, s] = × if no quartet exists on these four leaves, i.e., if T [p, q, r, s] is the star tree.

From quartets to rooted triples
In a planted phylogenetic tree T with leaf set L ∪ {0 T } all inner vertices have degree at least 3. The special leaf 0 T identifies the ancestral state. Its only neighbor is the root ρ T . It will sometimes be useful to consider T(u) as planted tree by including the unique parent v of u and the edge vu. The leaf set of T(u) will be denoted by L(T(u)). The most common method to specify the root of a phylogenetic tree is the use of so-called outgroups, that is, additional taxa that are known a priori to be outside a monophyletic group of interest. Given a planted (or rooted) phylogenetic tree, on the other hand, monophyletic groups are the leaf sets of a subtree, i.e., L ′ is a monophyletic group if and only if there is a vertex Every edge in an unrooted tree T defines a split L ′ |L ′′ of L, where L ′ and L ′′ are the leaves in the connected components of . At most one of the two subtrees T ′ and T ′′ contains the root of the underlying phylogenetic tree T. If the root is not contained in T ′ , then the tree T ′ ∪ {e} planted at the endpoint of e describes a monophyletic group. In this case all x ∈ L ′′ are outgroups for T ′ . Which subtrees of T correspond to monophyletic groups is determined by the position of the root, and therefore requires external information.
It will be convenient in the following to define outgroups not only for monophyletic groups.

Definition 6
For a phylogenetic tree T with leaf set L, consider a subset L ′ ⊆ L and a leaf z ∈ L \ L ′ . We say that z is an outgroup for L ′ if lca (L ′ ) ≺ lca (L ′ , z).
Let us now return to the quartets of T . The following simple result, illustrated in Fig. 3, shows that quartets can be used to infer inequalities between lca vertices in T provided one of the four leaves is known to be an outgroup for the other three: Proof Since z is an outgroup by assumption, there are only three possible fully resolved rooted tree with L = {x, y ′ , y ′′ , z} , see Fig. 3. Each of these trees corresponds to a unique quadruple (annotated at the top). The relationship between lca (x, y ′ ) and lca (x, y ′′ ) is determined by the tree topology. The statement follows by inspecting the three cases.
If T [x, y ′ , y ′′ , z] is not fully resolved, no quartet is defined on {x, y ′ , y ′′ , z} , i.e., T is the star tree and thus Lemma 7 together with Obs. 8 implies that quartets with known outgroups can be used to identify best matches. More precisely, in order to determine the set {y ∈ L[s] | x → y} it suffices to consider leaf sets {x, y ′ , y ′′ , z} with y ′ , y ′′ ∈ L[s] such that z is an outgroup for {x, y ′ , y ′′ } . By Lemma 7, any set of this type implies an (in)equality between lca (x, y ′ ) and lca (x, y ′′ ) . It may not be necessary to consider all quartets. To explore ways to reduce the computational efforts, let us assume that for given x ∈ L and s ∈ S , s = σ (x) , we can identify sets Y ⊆ L[s] and Z ⊆ L such that the following three assumptions are satisfied: (A0) The noise in the data is small enough so that for any four taxa {x, y ′ , y ′′ , z} with y ′ , y ′′ ∈ Y and z ∈ Z one of the three possible quartets or the star topology is inferred correctly. (A1) The candidate set Y ⊆ L[s] contains all best matches of x in species s (but usually also additional leaves). (A2) Z is a non-empty set of outgroups for Y ∪ {x}.
Before we proceed, let us consider these three assumptions in some more detail. (A0) is satisfied by construction for additive distance data. In reallife applications it is often possible to obtain at least a very good approximation using explicit models of sequence evolution. In addition, several computational approaches have been proposed to estimate the quartet relation directly from sequence data. It is also worth noting that (A0) does not require precise distance data, it only asks for correct categorical data on the quartet relation.
Condition (A1) can always be enforced by setting Y = L[s] . We make this assumption explicit because in practice it will be desirable to work with small subsets Y ⊆ L[s] as using L[s] may be too expensive for large gene families. The inclusion of very distant relatives may be problematic for the construction of good multiple sequence alignments and thus the extraction of the quartet relation. Furthermore, it may be difficult to find suitable outgroup data in this case. Thus we will limit Y to a manageable size and sufficient sequence similarity. In ProteinOrtho [50], for example, Y ⊆ L[s] is defined as the set sequence with blast bit-scores exceeding a certain fraction of the best hit for x in species s.
Condition (A2), i.e., the knowledge of appropriate outgroups, is the only problematic assumption. As discussed above, distance-based methods by construction do not convey information on the root of the phylogenetic tree T but only determine its unrooted version T . As a consequence, additional information, not contained in the pairwise distance measurements, is necessary to determine the edge in T that harbors the position of the root ρ of T [51]. In general, Z will be chosen from one or more species that are outgroups to σ (x) and s in S. Even if outgroup species are given, gene duplications may pre-date the divergence of the available species set, so that a given data set will usually violate (A2) for some pairs of leaves. We will return to these issues in more detail in the following sections.
The discussion so far suggests to use the quadruple relation for sets of the form {x, y ′ , y ′ , z} with y ′ , y ′′ ∈ Y and z ∈ Z to determine the best matches of x in the species containing the homolog set Y. The procedure is summarized in Alg. 1. The main result of this section establishes its correctness. assumption (A0) are all inferred correctly. Therefore, lines 4-7 compute all quartets correctly, and thus the inequality between lca (x, y 1 ) and lca (x, y 2 ) is inferred correctly. The auxiliary graphs Ŵ therefore contains at least one arc between any two vertices y ′ , y ′′ ∈ Y and both the arc (y ′ , y ′′ ) and (y ′′ , y ′ ) if and only if lca (x, y ′ ) = lca (x, y ′′ ) , i.e., the strongly connected components are cliques.
Since the lca (x, y) are interior vertices of T that are totally ordered along the path from x to the root of T (Observation 8), there is a unique strongly connected component B in Ŵ that has no out-edges, whose vertices are those y ∈ B for which lca (x, y) is minimal. Thus B is the set of best matches of x with color s.
Algorithm 1 therefore works correctly at least under idealized assumptions. It also serves as a heuristic in cases where one of the assumptions (usually (A2)) is violated.

Identification of outgroups
In many practical applications, the phylogenetic relationships between the species under consideration are known. We therefore investigate here to what extent knowledge of the species tree S can help to identify good outgroup sets Z. Ideally, the genes chosen as outgroups Z are coorthologs of the focal gene set Y, i.e., the duplication event that produced y ′ and y ′′ occured after the speciation event that separates σ (z) for all z ∈ Z from σ (X) and σ (Y ) . As we shall see, it is not possible to identify outgroups with complete certainty. It is possible, however, to identify incorrect choices in many situations.
In the following we consider three species σ (X) , σ (Y ) , and σ (z) for z ∈ Z such that i.e., σ (z) is an outgroup in the species tree for σ (X) and σ (Y ) . Problematic cases in which quartets are interpreted incorrectly may appear whenever the duplication event lca T (y ′ , y ′′ ) separating two paralogs y ′ , y ′′ ∈ Y pre-dates the speciation event separating σ (z) from lca S (σ (X), σ (Y )) . We capture this situation in Definition 10 Let u be an inner node of the species tree S, let y ′ , y ′′ ∈ Y be paralogs in a species σ (Y ) ∈ L(S(u)) . Then lca T (y ′ , y ′′ ) is an ancient duplication relative to u ∈ V (S) for the reconciliation map µ : Clearly, if lca T (y ′ , y ′′ ) is an ancient duplication relative to lca S (σ (X), σ (Y ), σ (z)) , then genes in z ∈ Z are bad choices as outgroups {x, y ′ , y ′′ , z} . The difficulty is that we do not know the reconciliation map µ in our setting. In some cases, however, it is possible to identify (4) lca S (σ (X), σ (Y )) ≺ S lca S (σ (X), σ (Y ), σ (z)), vertices in T that are ancient duplications relative to some speciation for any reconciliation. Such cases can then be avoided.
Before we investigate possibilities to identify some ancient duplications in distance data, we prove a rather technical result that shows that in cases without too many ancient duplications, Algorithm 1 produces correct results. For the proof we will need to consider the reconciliation map µ for complete gene family histories, i.e., gene trees T containing extant genes as well as all branches leading to loss events. As above, we do not consider HGT. The leaf set of T is thus , where L e represents the extant genes and L 0 denotes loss events. Since the species map is naturally restricted to extant genes (i.e., σ : L e (T ) → L(S) ), we need to restrict (R1): If x ∈ L e (T ) , then µ(x) = σ (x) . We will refer to such gene trees and reconciliation maps as extended gene trees and extended reconciliation maps, respectively. Correspondingly, Lemma 2 only holds for L e , i.e., we can conclude that σ (L e (T (w 1 ))) ∩ σ (L e (T (w 2 ))) = ∅ . This can easily be seen by reusing the contradiction argument in [40][ Lemma 2]. As a consequence of loss events we now may have σ (L e (T (v))) = ∅ for some nodes v ∈ V (T ).
Proof First note that the statement is trivial if there exists only one gene in Y. Hence, we can assume that Y contains more than one gene. Moreover, Condition (A1) is trivially satisfied since, by assumption, the candidate set of best matches of x in Y is exactly Y. Since L e is nonempty, we have to consider the two cases |µ −1 (ρ S )| = 1 and |µ −1 (ρ S )| = 2.
The condition |µ −1 (ρ S )| ≤ 2 makes an explicit assumption on the true history of the gene family by limiting the scenario to at most one ancient duplication on . Figure 4 shows that this condition cannot be dropped: if there are two or more ancient duplications affecting X, Y, and Z, then the correct inference of best matches from quartets can no longer be guaranteed. It is important to note that the condition |µ −1 (ρ S )| ≤ 2 cannot be checked in real data since µ is unknown. In the simulated data, however, it is easy to validate and we observed empirically that it is rarely violated in our data (see "Simulation results" section).
In some situations ancient duplications can be inferred unambiguously, independent of the reconciliation map µ . This is in particular the case if there are incongruences between quartets of genes and species. Consider four genes a, b, c, d residing in four pairwise distinct species σ (a) , σ (b) , σ (c) , and σ (d) , and assume that these four species form the quartet (2) implies µ( lca T (a, c)) � lca S (σ (a), σ (c)) = u and µ( lca T (b, d)) � lca S (σ (a), σ (c)) = u . Thus both inner nodes p and q of the quartet are mapped no lower than u. The edge between them therefore must be mapped to an edge pre-dating u, since the speciation constraint (R3) implies that two ≺ T -comparable events in T of which one is a speciation cannot by mapped to the same vertex of S. Thus u ≺ S µ( lca T (a, b, c, d)) . The case (ad|bc) is handled by an analogous argument exchanging c and d. The fact that lca T (a, b, c, d) is a duplication event now follows from Lemma 4. This theorem can be used to discard suspicious outgroups: If T [x, y, z 1 , z 2 ] is incongruent with the known species tree, then σ (z 1 ) = σ (z 2 ) should be replaced by outgroup candidates from earlier-branching species. The downside of using Theorem 12 is that it requires the systematic investigation of a possibly large numbers of quartets.
We suspect that it is possible in most cases to unambiguously identify pairs whose last common ancestor in the gene tree pre-dates the last common ancestor of the species tree under consideration. While it may be difficult to determine the relative order of such duplications, we suspect that clustering methods used to extract groups of co-orthologs (COGs) can be adapted to disentangle such ancient "paralog groups".

Simulation results
Well curated data for gene family histories are not available at large scale. We therefore use simulated data to evaluate how well best matches (in the sense of evolutionary relatedness) can be estimated from both perfect and noisy evolutionary distance measurements. For this purpose, it is important to have data sets that emphasize asymmetric rate variations among paralogs, i.e., the situations in which sequence dissimilarities and divergence times are not well correlated. We therefore developed a simulation system (see "Methods" section) that can produce this type of test data. Each scenario consists of a dated, planted species tree S and a gene tree T, that was simulated along S and thus is also dated. Each edge in T is assigned a rate, drawn from a distribution to model rate differences between paralog groups following gene duplications [14,15,52]. The product of the time difference between the end points of an edge and the evolutionary rate assigned to it then defines its length. The genetic distance d(x, y) of two genes x and y is the sum of the edge lengths along the unique path connecting x and y in T, see Eq. (7). Thus d is additive by construction. A typical example of a gene tree with distances can be found in Additional file 1: Fig. S1. In total, we simulated 2000 scenarios.
Since perfect additivity of d cannot be expected in the presence of measurement noise, we therefore superimposed normally distributed noise on the distance data, using the standard deviation s to control the noise level, see "Simulation of measurement noise" section. As a more realistic way to produce noisy data, we instead simulated sequence data along T, such that the expected number of events is proportional to the edge length, and thus to d.
Gene families in real-life data differ quite drastically from each other not only in their rate of sequence evolution but also as far as the rates of gene duplication and loss of paralogs in concerned. We therefore consider here a mix of scenarios with a different number of species. See Additional file 1: Figs. S2-5 for various statistics of the data set including the distribution of species and gene number per scenario, the average number of genes per species, and the distribution of edge lengths.

Best matches from evolutionary distances
We compare three strategies to estimate best matches directly from the evolutionary distance d: 1. Reciprocal best hits are inferred directly from the distance data. In order to account for rate variations among paralogs, we follow the strategy of Pro-teinOrtho [50] and consider nearly co-optimal best hits by considering for a given gene x in species σ (x) all those y ∈ Y as almost best hits that have distance not worse than a factor 1 + ǫ than the most similar gene in Y.
In symbols: For further comparison we then chose the value of ǫ * that maximizes the F-measure ( ǫ * = 0.5 , see Fig. 6). Still, this approach produces a substantial number of both false positives and false negatives in data sets with large rate variations between paralogs. We expect that the optimal value of ǫ * will depend on the details of the data set, in particular on the extent of evolution rate asymmetries. In general these will have to be estimated from the gene family history. We refer to this approach as the " ǫ -method". Since we chose the cut-off ǫ * to maximize the F-measure, we effectively determine an upper bound on the performance of the best hits approach.
2. Explicit reconstruction of T . Since the additive distances completely determine T , the only source of errors for perfect data is a potentially incorrect choice for the root of T . For additive distance data, the Neighbor Joining algorithm [53] is guaranteed to produce the correct T [54]. We then use midpoint rooting [24] to pass from T to T * and compute the best matches in T. We refer to this method as "NJ+midpoint rooting". This method is not intended as a viable means of analysis for real-life sequence data. It serves, however, as a convenient way to assess the effects of rate imbalances because it isolates the errors that are introduceεd by the choice of the root alone i.e., by rate imbalances.
3. The "Quartet" approach starts from a known (rooted) species tree S. For x ∈ X , and y ′ , y ′′ ∈ Y we select the set of outgroup genes Z from outgroup species w.r.t. the species of X and Y, i.e., lca S (σ (X), σ (Y )) ≺ lca S (σ (X), σ (Y ), σ (z)) for z ∈ Z . To reduce the risk for too many ancient duplications, which are a source of error in this approach (see Lemma 11), we may require in addition that lca S (σ (X), σ (Y ), σ (z)) = ρ S , i.e., we only allow outgroup species from "the other side of the root". For reasons of time complexity, we randomly select min(20, |Z|) among the genes Z that meet this condition as the final outgroup set Z in Algorithm 1. Since we operate on distance data, quartets can directly be estimated using Eq. (3).
In order to benchmark the inference of best matches we compute recall and precision w.r.t. the true best matches restricted to pairs of gene sets X and Y for which such outgroups are available. On average we could assign outgroup genes to 74.6% of the n(n − 1)/2 gene pairs, where n is the number of non-loss leaves in the respective scenario (see also Additional file 1: Fig. S6 for the scenariowise percentage).
The comparison in Fig. 7 (bottom panel) shows that the quartet method outperforms the alternatives for different levels of the simulated random measurement error in terms of F-measure. As an example, for s = 1.0 the 10th percentile of the F-measure reaches 0.909 for the quartet method compared to 0.869 and 0.884 for the ǫ-method and the Neighborjoining trees, respectively. The median values, on the other hand, are almost identical and fairly high (around 0.985). Hence, we suspect that more sophisticated methods are advantageous for a number of rare (but not negligible) difficult cases. Moreover, note that both recall and precision are almost perfect for noiseless data ( s = 0 ) and that the results are robust over a wide range of simulated measurement error. The same was observed for systematically biased noise (see Additional file 1: Fig. S7), which was simulated by computing a convex combination of the original matrix and a perturbation matrix derived from another tree. The performance of all three methods drops quickly when the perturbation becomes large. The highest and most stable recall values could be obtained with the ǫ-method for both types of noisy data. For our choice of ǫ , this clearly comes at the cost of precision. Not surprisingly, the reconstruction of Neighborjoining trees already provides a higher precision than the ǫ-method. The simple midpoint rooting strategy however still incurs noticable level of error. For the quartet method operating on noiseless data the only source of errors are bad choices of outgroups, which are the consequence of ancient duplications. The number of ancient duplication exceeds 1 in 5.15% of the simulated gene family scenarios. Due to loss events predating the root of the species tree, the condition in Lemma 11 is only violated in 3.7% of the gene trees. Out of these problematic cases, little more than half ( 2.25% ) actually result in a non-perfect inference accuracy.
Restricting the choice of outgroup genes z to species that are separated from X and Y by the root of the species tree, i.e., such that lca S (σ (z), lca S (σ (X), σ (Y ))) = ρ S , is likely to be problematic whenever S is skewed in a way that leaves very few choices for σ (z) and whenever the divergence between σ (z) and lca S (σ (X), σ (Y )) is large. In the latter case, saturation effects may impair the quartet inference in fast evolving gene families. Hence, it would be advantageous to consider also genes from closer species. In principle, every relative outgroup w.r.t. species σ (X) and σ (Y ) is a viable candidate. These can then be filtered by applying Theorem 12 to reduce the number of bad choices of z. We find that filtering for outgroups with identifiable ancient duplications and giving preferences to the closest outgroup genes, i.e., those with the lowest lca S (σ (z), lca S (σ (X), σ (Y ))) indeed yields a further moderate improvement of the estimated best matches (see Fig. 8). However, the performance is slightly reduced for perfectly additive data due to ancient duplications that were not detected by the currently available filtering heuristics.
At first glance the F-measures in Figs. 7, 8 look "to good to be true". Our collection of scenarios, however, contains many easy instances with few paralogs and losses. Reallife data furthermore are plagued by systematic biases, incomplete and missing annotations, inconsistent choices between isoforms, etc., that affect the ability to correctly estimate evolutionary distances and thus pairwise best hits. So far, we have assumed that we have perfect data in this respect and evaluate only our ability to recover (reciprocal) best matches. In the following we briefly consider the effect of having to estimate evolutionary distances from sequences. Again, we will consider only the most benign situation, i.e., sequences generated from Markov processes.

Best matches from sequences
In applications to real-life data sets, additional uncertainties arise through the reconstruction of distances from sequences. We therefore simulated sequences (without in/dels) from the gene tree/species tree scenarios and inferred the best matches from the sequence data. Considering only substitutions avoids the need for computing sequence alignments. We explored two different ways of determining the quartet relation: (a) We derived an approximately additive evolutionary distance from the observed dissimilarity, before again applying Eq. (3). More precisely, for nucleic acid sequences we transformed the normalized Hamming distance using the simple Jukes-Cantor transform [33], and for amino acid sequences we applied the BLOSUM-based transformation [55] in the Biopython package [56]. (b) We directly inferred the quartets using the Quartet Mapping method (QM) [57] as outlined in "Methods" section. Figure 9 summarizes the results for simulated nucleic acid and amino acid sequences of different lengths and different scaling of the evolutionary rates. As expected, the short sequences incur a relative large noise level compared to the perfect additive distances. Nevertheless, the overwhelming majority of best matches is still estimated correctly (F-measures well above 0.9 for the vast majority of scenarios even for nucleic acid sequences as short as 200 nt). Larger false positive rates are observed only in a small number of scenarios with many duplications and losses. This is not surprising since our relatively simple rule for outgroup choice tends to fail if there are many ancient duplications. As expected, the F-measure improves with increasing sequence length due to the increased amount of information from which the distances are estimated. Since the standard deviation of the estimated distances (normalized by sequence length) decreases ∼ n −1/2 , the main effect of the sequence length is to tune the noise level. Likewise, the F-measure improves when saturation effects are reduced by downscaling the total number of events. To this end the edge lengths in the original trees were rescaled by a factor of 1/2 and 1/4, respectively. We expect, however, that reducing the number of even events further will ultimately lead to a decreasing performance, since deriving topology information from almost identical sequences is difficult or even impossible. The same trends were observed for simulated protein sequences. Figure 9 also shows that the Quartet Mapping method outperforms the other methods for the 200 nucleotide sequences, indicating that this approach is advantageous when sequence length is small. In case of long nucleic acid sequences (2000 nt) and amino acid sequences (500 nt), the best results are obtained by estimating additive distances from pairwise sequence comparison using the Jukes-Cantor transform or a BLOSUM-based transformation, respectively.
The rather disappointing performance of the QM method for long sequences is probably the consequence of the majority voting procedure used choose the quartets. We suspect that majority voting is too simple-minded in situations where none of the three possible splits dominates. In the default setting, these are interpreted as unresolved trees ( × ) and inserted as bidirectional edges into the auxiliary graph Ŵ . This, however, leads to a moderate overprediction of best matches. Alternatively, a consensus can be taken over multiple choices of the outgroup z. Finally, the unresolved quartets can be omitted altogether in the construction of the auxiliary graph Ŵ . Both alternatives perform worse than the default method, see Additional file 1: Fig. S8.
We also investigated to what extent the number of gene duplication and losses in a scenario influences the inference of best matches. As a representative example, Fig. 10 shows the false positive rate for QM. As expected, the number of false positives increases with increasing number of duplication and loss events. This can be observed both for the absolute and the relative number of events , i.e., after normalizing by the number of species (see also Additional file 1: Fig. S9 and S10 for the 200 nt sequences and for F-measure instead of false positive rate, respectively).

Discussion and conclusions
The idea to use quartet structures for improvement of orthology estimates is not new; it was used e.g. in QuartetS [58]. Quartets are also used as witnesses of non-orthology in OMA [59] to avoid some types of falsepositives. Here, we systematically investigate how and when quartets help to improve and/or correct empirical best-hit data to identify best matches in the sense of closest evolutionary relatives. We propose that reciprocal best matches, rather than the uncorrected reciprocal best hits, should then be used to infer orthology relationships. This second step has been the topic of a separate manuscript [40], in which the mathematical connections between (reciprocal) best matches and orthology are elucidated in detail.
The key observation of the present contribution is that the best matches of a gene x in the set Y of genes from a different species can be computed correctly if for every y ′ , y ′′ ∈ Y one can find a gene z from a third species that is an outgroup for {x, y ′ , y ′′ } . From a theoretical point of view, this condition is closely related to rooting the gene tree. The second necessary ingredient is an estimate of an additive evolutionary distance between the genes that is accurate enough to correctly identify the topology of a certain subset of quartets. We emphasize that this is a much less stringent condition compared to the ability of reconstructing the complete gene tree T.
Empirically, we observe that (partial) knowledge of the species tree (more precisely: reliable monophyletic groups) is very useful for the choice of outgroup genes z: excellent results are obtained by choosing a candidate z from a species that is an outgroup for σ (x) and σ (Y ) . The results can be further improved by using filtering criteria that identify ancient duplication events and by computing a consensus over several choices of z. In data sets with little measurement noise, we indeed obtain nearly perfect best match estimates. The theoretical considerations outlined here also suggest additional in-roads for further improvements by means of identifying ancient duplications, which not only serve as "witnesses of nonorthology" but can also be used to prune the set of candidate outgroups.
In order to make the methods described here applicable to very large real-life data sets, it will be necessary to optimize the computational performance. To this end, we will develop heuristic rules to prune the set Y in the case of large gene families. An obvious candidate is to use the ǫ-method as an initial filter, where ǫ is now chosen to optimize the tradeoff between |Y| and false negative predictions of best matches. We expect that the heuristic rules for choosing the set Z of candidate outgroups can also be improved substantially.
Best matches are rarely if ever of interest in isolation. Instead, they are an intermediate construction, in particular in orthology detection or the assessment of synteny. It is difficult therefore to benchmark the translation of reciprocal best hits to reciprocal best matches in a truly realistic setting because best hit data themselves are burdened with diverse sources of errors, including incorrect sequence assembly, incorrect or missing annotation of coding sequences, and the use of different splicing isoforms. The benchmarking results shown here thus have to be taken with a grain of salt. In particular, we expect that error levels of a pipeline that determines best hits and then converts them to best matches will be dominated by the first step, i.e., the computation of best hits from genome or proteome data. Our data also show, however, that there are difficult instances for which we currently have no good way to compute the correct best matches. Fortunately, these appear to be rare. We expect that methods for orthology assessment can be improved in both reliability and computational performance by combining the accurate estimation of best matches described here with a better understanding of (reciprocal) best match graphs [38,39,60] and their connection with the orthology relation [40]. Since tree-free methods for orthology detection rely on (pairwise) best hits as proxy for reciprocal best matches, we expect that the accuracy of most tools would improve if best matches are supplied as input data. This is not easy to test, however, since the best hit computation is usually an integral part of the software. Such a benchmark study is hence beyond the scope of this contribution.
The work reported here is primarily intended to provide a solid theoretical foundation for the construction of improved best match heuristics. The theoretical results give some guarantees for obtaining the correct best matches and highlight some limitations that cannot be overcome with certainty as long as only distance data are available. The most promising additional data source is synteny, or more precisely, genomic proximity [61]. Given two proximal genes u and v from different families in species A and a pair of family members u ′ and v ′ proximal in species B, it is very likely that either both u, u ′ and v, v ′ or neither of them are best matches. A more systematic development of such filters will be the topic of future work.
The software used for simulating and testing the conversion of best hits to best matches has been made available on github [62]. As a next step, it will be incorporated into ProteinOrtho [50,63,64] to assess and benchmark the achievable improvements on real-life data. Best matches instead of best hits could of course also be used in other orthology detection tools.

Simulations of dated species trees
As in previous work [40], we use the Innovation Model [65] to produce realistic topologies for the planted species tree S. We then construct a dating function τ : V (S) → [0, 1] such that τ (0 S ) = 1 and τ (x) = 0 for x ∈ L(S) . In order to assign a date to an interior vertex, we traverse S top-down, more precisely for the current node u at time τ (u) we proceed as follows: (1) We pick a child v ∈ child(u) and a leaf x ∈ L(S(v)) in the subtree below v. If v is already a leaf, we set τ (v) = 0 and proceed to the next child of u.
(2) Otherwise, we determine the number k of speciations on the path between v and x. Hence, the path from u to x comprises k + 2 edges. (3) We pick a random number r with mean 1 and range (0, 2) from a uniform distribution and set τ (v) = τ (u)(1 − r/(k + 2)) . This rule is chosen so that the expected time elapsed along the edge uv equals τ (u) divided by the number (k + 2) of edges along the path to the root and ensures that τ (v) > 0 . The result is a dated species tree in which each edge uv has length τ (u) − τ (v).
The choice of the uniform distribution in (3) is a mere convenience. In principle it should be replaced by an empirically estimated distribution. Alternatively, generators capable of producing dated trees such as TreeSi-mGM [66] could be used.

Simulation of gene trees in the dated species tree
We use the Gillespie algorithm [67] to simulate the duplication, loss and horizontal gene transfer events (HGT) occurring in S. The branches of the species tree S are independent in Duplication/Loss scenarios. However, horizontal gene transfer introduces dependencies between them. We therefore have to simulate the evolution process in such a way that at each time point τ the possible reactions are given by the Cartesian product G(τ ) × {D, L, H} , where g ∈ G(τ ) is a gene that is present at time τ in any one of the branches of the dates species tree, and q ∈ {D, L, H} is one of the three possible events (Duplication, Loss, HGT). Every possible simulation event ξ := (g, q) is associated with a rate r ξ (τ ) that may depend explicitly on the point in time. Rate constants are described below. In each step, two random numbers r 1 and r 2 are drawn independently from the uniform distribution on [0, 1]. The first random number r 1 is used to select ξ with probability r ξ (τ )/R(τ ) , where R(τ ) is the sum of the rates of all reactions available at time τ . We refer to [67] for a convenient way to implement the rate-proportional choice of the "reaction channel". Depending on the selected event type, the following actions are performed: (q = L ) Gene loss is modeled by removing g from the list of active genes.
(q = D ) Gene duplications are modeled by placing a copy g ′ of g into the same branch of S at time τ.
(q = H ) For HGT the copy of g ′ is placed into a different branch of S. The "landing site" for the HGT copy is chosen uniformly from the branches of S available at time τ with the exception of the branch harboring the parental gene g.
The rules determining the rate parameters for gene copies g ′ and the optional adjustment of rates for the genes g are discussed below. The second random variable r 2 is used to update the clock according to τ ← τ − �τ with �τ = ln(1/r 2 )/R . The simulation terminates as soon as τ − �τ ≤ 0. A complication arises from the fact that the time interval [τ , τ − �τ ] may contain a speciation event at time τ s . At a speciation, the gene content is copied into the daughter-lineages, and the rates are modified in a lineage-specific manner. As a consequence, the waiting time �τ has to be re-estimated since the set of reaction channels has changed. More precisely, we need to determine the distribution of waiting times from a time point t 0 until the next event conditioned on the fact that no event occurred between t 0 and t 1 , where t 1 designates the time point of the speciation. For the complementary cumulative distribution function and s := t 0 − t 1 we have Since the waiting time distributions are exponential with rate r 1 before t 1 and rate r 2 following the speciation event, we obtain Hence, if the simulated waiting time reaches beyond the speciation event, the clock is advanced to the speciation event and a new waiting time is drawn with the rates after the speciation event. In practice, a new random number to obtain the time step �τ ′ with the updated rates after the speciation event. In the new interval [τ S , τ S − �τ ′ ] we again have to check for speciation events. Since the speciation events are known a priori from the dated species tree S, they are held in a priority queue in temporal order. The final result is a dated gene tree T, i.e., each event is unambiguously associated with a time stamp. The simulation also completely determines the reconciliation map µ.
We simulated 2000 pairs of species and gene trees, where |L(S)| was drawn uniformly from the interval [3,50]. The duplication and loss rates were (independently) drawn from [0.5, 1.0).

Modeling rate imbalances
In order to produce realistic (sequence) data, an evolution rate ω e has to be associated with each edge e of T. To this end we use a hierarchical model that first determines a baseline gene substitution rate ω 0 e for each edge e of the species tree S in order to simulate effects such as variations of population size and generation time. This introduces a correlation between the rates of all genes in the same lineage of S. These base rates are then modified by gene-specific contributions that capture effects such as differences in selection pressures that depend on gene function and rate differences in the wake of duplications such as neofunctionalization and subfunctionalization [15]. In detail, we use the following parametrization: • mean substitution rate of the conserved members of a gene family (default 1.0). • variance σ 2 0 for the baseline substitution rate in S (default 0.2). • a gamma distribution for the substitution rates > 1 of divergent genes. The parameters are estimated from data for the whole genome duplication in saccharomycete yeasts [52]. Alternatively, a uniform distribution on (1; r max ] can be selected. • weights for the relative frequency of the possible fates of duplicates (functional conservation, subfunctionalization, neofunctionalization; default equal weights 1/3).
We determine the baseline substitution rates ω 0 uv for the edge uv ∈ E(S) as follows: We simply assign the mean substitution rate to the planted edge 0 S ρ S (i.e. 1.0 by default). We traverse S in pre-order and draw for each edge uv ∈ E(S) \ {0 S ρ S } the logarithm ln ω 0 uv of the rate of evolution from a normal distribution with variance σ 2 = σ 2 0 (τ (u) − τ (v)) . To avoid bias towards higher or lower rates, we normalize the mean of the normal distribution such that E(ω 0 uv ) = ω 0 par(u)u . For the gene specific rates we first sort all vertices u ∈ V (T ) by τ (u) in descending temporal order. We keep track of the current number of genes in each branch of the species tree. During the simulation, the edges of T will be marked as either conserved or divergent depending on the fate of the branch after a duplication event. For each edge e = uv ∈ E(T ) in the gene tree, we initialize an empty list L e of ordered pairs of the form (τ , ω) to record the gene-specific evolution rates ω and the corresponding time points τ at which they become valid during the existence of e. This allows us to reset the divergent status of a gene in case it is the last survivor in a given species. At present, we do not consider other events that change the rate of evolution of a gene within the edge e. The framework, however, can easily accommodate such rules in future refinements of the model. We denote by L e,i the ith ordered pair (τ i , ω i ) in L e and define τ (L e,i ) := τ i and ω(L e,i ) := ω i .
Recall that 0 T ρ T is the first (planted) edge in T. To initialize the simulation, we mark 0 T ρ T as conserved and append (τ (0 T ), 1.0) to L 0 T ρ T . Then for each vertex u in the sorted list we proceed as follows: (1). u is a speciation event Mark all edges uv with v ∈ child(u) the same as par(u)u . To L uv we append the pair (τ (u), ω) with ω = 1.0 (uv is conserved) or ω Gammadistributed (uv is divergent), respectively. (2). u is a duplication event If the edge par(u)u is marked as divergent, then all edges uv with v ∈ child(u) are also marked as divergent and corresponding pairs (τ (u), ω) are appended to L uv , where the values of ω are drawn i.i.d. from the Gamma distribution. If par(u)u is marked as conserved, we choose between (a) conservation, (b) subfunctionalization and (c) neofunctionalization with the specified weights. For (a) mark both incident edges below u as conserved, for (b) as divergent and for (c) one edge is conserved and the other is divergent. To L uv we append the pair (τ (u), ω) with ω = 1.0 (uv is conserved) or ω Gamma-distributed (uv is divergent), respectively. (3). u is a loss event If a single copy is left in the respective species after the loss: Let e * be the corresponding edge of the remaining copy at τ (u) . Mark e * as conserved and append the pair (τ (u), 1.0) to L e * . (4). u is an HGT event Let v 1 be the copy that remains in the species and v 2 the transferred copy. Mark uv 1 the same as par(u)u and append (τ (u), ω) to L uv 1 where ω is the last rate that was appended to L par(u)u . Mark uv 2 as divergent and append (τ (u), ω) to L uv 2 with ω Gamma-distributed. For each edge e = uv in T we finalize L e by appending (τ (v), ω) where ω is the last rate that was appended to L e so far. We then define the edge length ℓ(e) for each edge e in T as where f is the edge in the species tree S into which e is embedded.

Computation of distances
The resulting function ℓ : E(T ) → R + (see Eq. 6) defines an additive metric on the set of vertices V(T). We denote by d a distance function on the set of non-loss leaves in T (i.e., the extant genes at time τ = 0 ), representing the evolutionary distance between each pair of these genes. In order to compute d, we first construct the observable gene tree T by removing all branches that lead to losses only, and then contracting all inner vertices that are left with a single child. The distance d(x, y) of two leaves x and y in T is given by the sum of edge lengths on the unique path P xy connecting x and y in T , thus

Simulation of measurement noise
In order to simulate measurement noise we consider three strategies: (1) Adding i.i.d. random noise to the additive distance d in general violates the triangle inequality, i.e., the condition d(x, y) ≤ d(x, z) + d(z, y) no longer holds for all x, y, z ∈ L . We therefore use the following simple algorithm: choose two distinct x, y ∈ L at random. Moreover, we draw a noise factor ε xy from a normal distribution with mean 1 and standard deviation s, then substitute the distance of x and y, i.e. d(x, y) and d(y, x), by d ′ := ε xy d(x, y) . If the perturbed distance d ′ satisfies the triangle inequality, we accept the perturbed distance d ′ . Otherwise, d ′ is rejected and a new random perturbation is generated. We repeat this until |L| 2 perturbations have been accepted. An alternative approach is to first introduce perturbations to all distances and then to extract a corrected distance d using one of several algorithms for the "metric repair problem", see e.g. [68,69]. A cursory test showed that the trees reconstructed from distance matrices processed with these methods tend to be more different from the reference than with our approach of enforcing the triangle inequality immediately. We therefore did not pursue them further in this contribution. (2) We denote by D a distance matrix on the set of non-loss leaves in T whose entries correspond to the distances of d. It is easy to see that a convex combination (1 − α)D + αD ′ , 0 ≤ α ≤ 1 of two metrics D and D ′ is again a metric (i.e., in particular, satisfies the triangle inequality). Even if both D and D ′ are additive, however, their convex combination is not additive in general. This yields a distance that is affected by a systematic bias corresponding to the noise contribution αD ′ . (3) The edge lengths ℓ(e) (see Eq. (6)) can also be interpreted as the average number of evolutionary events per site. These are then simulated directly using the generation tool pyvolve [70]. We generated nucleic acid sequences of length 200 with equal transition and transversion rates and of length 2000 with a transition : transversion ratio of 2 : 1. To assess saturation effects, we scaled the rates (7) d(x, y) = e∈P xy ℓ(e).