 Research
 Open Access
 Published:
The link between orthology relations and gene trees: a correction perspective
Algorithms for Molecular Biology volume 11, Article number: 4 (2016)
Abstract
Background
While treeoriented methods for inferring orthology and paralogy relations between genes are based on reconciling a gene tree with a species tree, many treefree methods are also available (usually based on sequence similarity). Recently, the link between orthology relations and gene trees has been formally considered from the perspective of reconstructing phylogenies from orthology relations. In this paper, we consider this link from a correction point of view. Indeed, a gene tree induces a set of relations, but the converse is not always true: a set of relations is not necessarily in agreement with any gene tree. A natural question is thus how to minimally correct an infeasible set of relations. Another natural question, given a gene tree and a set of relations, is how to minimally correct a gene tree so that the resulting gene tree fits the set of relations.
Results
We consider four variants of relation and gene tree correction problems, and provide hardness results for all of them. More specifically, we show that it is NPHard to edit a minimum of set of relations to make them consistent with a given species tree. We also show that the problem of finding a maximum subset of genes that share consistent relations is hard to approximate. We then demonstrate that editing a gene tree to satisfy a given set of relations in a minimum way is NPHard, where “minimum” refers either to the number of modified relations depicted by the gene tree or the number of clades that are lost. We also discuss some of the algorithmic perspectives given these hardness results.
Background
Genes, the molecular units of heredity, hold the information to build and maintain cells. In the course of evolution, they are duplicated, lost, and passed to organisms through speciation. Genes originating from the same ancestral copy are called homologs. Homologous gene are grouped into gene families, usually via sequence similarity methods. Moreover, homologous genes can be orthologous, if their parental origin is a speciation, or paralogous, if it is a duplication. Orthologous gene are considered to be more similar in function than paralogs, a conjecture known as the orthology conjecture [1]. This is a major motivation for inferring gene evolution, as it is a prerequisite for functional prediction purposes.
Starting usually from a DNA or protein sequence alignment, the treebased method requires to build a phylogenetic tree, called gene tree, for the considered gene family. Reconciliation [2] with the species tree then allows to infer evolutionary events (duplications and speciations) associated with the internal nodes of the gene tree. Hence the internal nodes of a gene tree can be labeled as duplications and losses, and such a labeling induces a full orthology and paralogy set of relations between gene pairs. In order to detect orthology, treefree methods are also available. These methods are based on gene clustering according to sequence similarity, (cf. e.g. the COG database [3], OrthoMCL [4], InParanoid [5], Proteinortho [6]), synteny [7, 8] or functional annotation of genes [9]. Such methods usually are not able to detect a full set of relations, but only a partial set, i.e. some relations among genes are not inferred.
Recent papers [10, 11] have investigated, from a graph theory point of view, the link between trees and orthology/paralogy relations (we just say “relations” in the following). Given a gene family \(\Gamma\) and a set \(\mathcal{C}\) of pairwise relations, a first problem is whether we can reconstruct a labeled gene tree for \(\Gamma\) inducing \(\mathcal{C}\). The problem can be subdivided into two parts. First, we can consider whether \(\mathcal{C}\) is satisfiable, i.e. whether there exists an eventlabeled gene tree G in agreement with \(\mathcal{C}\). However satisfiability is not sufficient to ensure the possibility for the relation set to reflect a true history, as nodes of G labeled as speciations can be contradictory. This raises the second question which is the existence of an Sconsistent gene tree, namely an eventlabeled tree that can be obtained by reconciliation with a species tree S. A simple characterization of satisfiability is given in [10], when the set \(\mathcal{C}\) is a full set of relations (i.e. each pair of genes of \(\Gamma\) is in \({\mathcal C}\)). On the other hand, checking for Sconsistency can be done in polynomialtime for full sets [12, 13], and also partial sets of relations [14].
In this paper we explore the link between relations and trees in the perspective of relation and tree correction. Several gene tree databases from whole genomes are available, including for instance Ensembl Compara [15], Hogenom [16], Phog [17], MetaPHOrs [18], PhylomeDB [19], Panther [20]. However, due to various limitations such as alignment errors, systematic artifacts of inference methods or insufficient differentiation between sequences, trees are known to contain errors and uncertainties. Consequently, a great deal of effort has been put towards tools for gene tree editing [21–29]. Most of them are based on selecting, in a neighborhood of an input tree, one best fitting the species tree.
Two years ago, we developed the first algorithm for gene tree correction using orthology relations [7]. Here we address, from a complexity and approximation point of view, the more general problem of correcting a gene tree according to a set of orthology and paralogy relations. We consider two objective functions: the number of unchanged relations (from orthology to paralogy or viceversa), leading to the Maximum Homology Correction problem, and the number of unchanged clades (the RobinsonFoulds distance [30]), leading to the Maximum Clade Correction problem. We provide NPcompleteness results for these two problems.
Conversely, we also address the problem of correcting a set of relations so that it represents a valid history in terms of Sconsistency. A set of relations is usually represented as a graph R, where edges represent orthologous relations and nonedges represent paralogous relations. The satisfiability problem related to Sconsistency reduces to adding or removing a minimum number of edges of R in order to make it \(P_4\)free (that is, it contains no induced path of length three), as shown in [10]. The problem is known to be NPHard and fixed parameter tractable [31]. In [11], an integer linear programming formulation is used to correct relation graphs of reasonable size. A factor approximation algorithm of factor \(4 \Delta\), where \(\Delta\) is the degree of the graph R, is given in [32]. The Sconsistency problem, however, has never been studied.
In this paper, two criteria are considered for correcting a set R of relations: minimize the number of modified relations, and maximize the number of genes inducing an Sconsistent set of relations. The first problem is shown to be NPcomplete, while the second problem is shown to be not approximable within factor \(d n^{\frac{1}{2}(1\varepsilon )},\) for any \(0 < \varepsilon < 1\) and any constant \(d > 0\).
Trees and orthology relations
All trees considered in this paper are assumed to be rooted. They are not necessarily binary, but we assume that all nodes are of degree at least three, except possibly the root that can be of degree two. Given a set X, a tree T for X is a tree whose leafset \({\mathcal L}(T)\) is in bijection with X. We denote by V(T) the set of nodes and by r(T) the root of T. Given an internal node u of T, the subtree rooted at u is denoted \(T_u\) and we call the leafset \({\mathcal L}(T_u)\) the clade of u. A node u is an ancestor of v if u is on the (inclusive) path between v and the root, and we then call v a descendant of u. If \(u \ne v\), then v is a strict descendant of u, and if u and v are connected by an edge of T, then v is a child of u. The lowest common ancestor (lca) of u and v, denoted \(lca_T(u, v)\), is the ancestor common to both nodes that is the most distant from the root. We say that u and v are separated if and only if \(lca_T(u, v) \notin \{u,v\}\) (i.e. none is an ancestor of the other). We define \(lca_T(U)\) analogously for a set U of nodes. Let \(L'\) be a subset of \({\mathcal L}(T)\). The restriction \(T_{L'}\) of T to \(L'\) is the tree with leaf set \(L'\) obtained from the subtree of T rooted as \(lca_T(L')\), by removing all leaves that are not in \(L'\), and contracting all internal nodes of degree two, except the root. Let \(T'\) be a tree such that \({\mathcal L}(T') = L' \subseteq {\mathcal L}(T)\). We say that T displays \(T'\) if and only if \(T_{L'}\) is \(T'\).
Evolution of a gene family
Species evolve through speciation, which is the separation of one species into distinct ones. A species tree S for a species set \(\Sigma\) represents an ordered set of speciation events that have led to \(\Sigma\): an internal node is an ancestral species at the moment of a speciation event, and its children are the new descendant species. Inside the species’ genomes, genes undergo speciation when the species to which they belong do, but also duplications, and losses (other events such as transfers can happen, but we ignore them here). A gene family is a set of genes \(\Gamma\) accompanied by a mapping function \(s : \Gamma \rightarrow \Sigma\) mapping each gene to its corresponding species. The evolutionary history of \(\Gamma\) can be represented as a nodelabeled gene tree for \(\Gamma\), where each internal node refers to an ancestral gene at the moment of an event (either speciation or duplication), and is labeled as a speciation (Spec) or duplication (Dup) accordingly.
Formally, we call a DS tree for \(\Gamma\) a pair \((G, ev_G)\), where G is a tree with \({\mathcal L}(G) = \Gamma\), and \(ev_G:V(G) \setminus {\mathcal L}(G) \rightarrow \{Dup, Spec\}\) is a function labeling each internal node of G as a duplication or a speciation node (we drop the G subscript from \(ev_G\) when it is clear from the context). Given a species tree S, the LCAmapping function \(s_G : V(G) \rightarrow V(S)\) maps each gene of G, ancestral or extant, to a species as follows: if \(g \in {\mathcal L}(G)\), then \(s_G(g) = s(g)\); otherwise, \(s_G(g) = lca_S( \{s(g') : g' \in {\mathcal L}(G_g) \})\). An example is given in Fig. 1, where the label of each node of G represents its LCAmapping with respect to S.
According to the Fitch [33] terminology, we say that two genes x, y of \(\Gamma\) are orthologous in G if \(ev(lca_G(x, y)) = Spec\), and paralogous in G if \(ev(lca_G(x, y)) = Dup\). We denote by \({\mathcal O} (G)\), respectively \({\mathcal P} (G)\), the set of all gene pairs that are orthologous, respectively paralogous in G. By \(xy \in {\mathcal O} (G)\) we mean \(\{x, y\} \in {\mathcal O} (G)\) (the same applies for \({\mathcal P} (G)\)). In Fig. 1, \(a_1c_1 \in {\mathcal O} (G)\) while \(a_1b_1 \in {\mathcal P} (G)\). We say that \(a_1c_1\) (respec. \(a_1b_1\)) is an orthology (respec. paralogy) relation induced by G.
While a history for \(\Gamma\) can be represented as a DStree, the converse is not always true, as a DStree G for \(\Gamma\) does not necessarily represent a valid history. For this to hold, any speciation node of G should reflect a clustering of species in agreement with S [14]. Formally G should be S consistent, as defined below.
Definition 1
Let S be a species tree and G be a DStree. Let v be an internal node of G such that \(ev(v) = Spec\). Then the speciation node v is S consistent if and only if for any two distinct children \(v_1, v_2\) of v, \(s_G(v_1)\) and \(s_G(v_2)\) are separated in S.
We say that G is S consistent if and only if every speciation node of G is Sconsistent.
Notice that G and S are not required to be binary. In particular, the definition of Sconsistency for a speciation node v of G does not require v to be binary, even if S is binary. The reason is that in such a case, one can “refine” v into a set of binary Sconsistent speciation nodes based on the topology of S. This operation does not affect the orthology and paralogy relations of the genes of G (see Fig. 1). Duplication nodes can be refined as well. Lemma 1 formalizes this intuition. This will serve to show that our results hold for both nonbinary and binary gene trees.
Lemma 1
Let G be an Sconsistent DStree for some binary species tree S. Then there is a binary DStree \(G'\) such that \(G'\) is Sconsistent, and such that \({\mathcal O} (G) = {\mathcal O} (G')\) and \({\mathcal P} (G) = {\mathcal P} (G')\).
Proof
Let v be a highest nonbinary node (i.e. v has no nonbinary ancestors) of G with children \(v_1, \ldots , v_k\). We show that v can be made to be binary while preserving \({\mathcal O} (G)\) and \({\mathcal P} (G)\), which suffices to prove the Lemma since we can repeat this operation successively on every nonbinary node.
If \(ev_G(v) = Dup\), obtain a DStree \(G^*\) by removing \(v_2, \ldots , v_k\) from the children of v, adding a child \(v'\) to v and adding \(v_2, \ldots , v_k\) as children of \(v'\), setting \(ev_{G^*}(v') = Dup\). Notice that \(s_G(w) = s_{G^*}(w)\) for every \(w \in V(G) \cap V(G^*) = V(G^*) \setminus \{v'\}\), implying that all speciations remain Sconsistent. It is readily seen that \({\mathcal O} (G) = {\mathcal O} (G^*)\) and \({\mathcal P} (G) = {\mathcal P} (G^*)\).
If instead \(ev_G(v) = Spec\), let \(s_1, s_2 \in V(S)\) be the two children of \(s_G(v)\). Let \(V_j = \{v_i : s_j\) is an ancestor of \(s_G(v_i)\), \(1 \le i \le k\}\) for \(j \in \{1, 2\}\). Notice that for any child \(v_i\) of v, \(s_G(v_i)\) is a strict descendant of \(s_G(v)\). For if not, v has a child \(v_i\) such that \(s_G(v_i) = s_G(v)\). But since v is a speciation, \(s_G(v_i) = s_G(v)\) is separated from \(s_G(v_j)\) for any \(j \ne i\), implying that \(s_G(v)\) is not the lca of \(s_G(v_i)\) and \(s_G(v_j)\), contradicting the definition of \(s_G\). This strict descendant condition implies that \(\{V_1, V_2\}\) partitions \(\{v_1, \ldots , v_k\}\). Also observe that \(V_1\) and V_2 cannot be empty, for otherwise \(s_G(v)\) would be equal to either \(s_1\) or \(s_2\). Obtain \(G^*\) by removing the children of v, adding two children \(w_1\) and \(w_2\) to v, then adding \(V_1\) as children of \(w_1\) and \(V_2\) as children of \(w_2\). Set \(ev_{G^*}(v) = ev_{G^*}(w_1) = ev_{G^*}(w_2) = Spec\). Note that the children of \(w_1\) and \(w_2\) are still from separated species, and so both are Sconsistent. As for v, by the definition of \(V_1\) and \(V_2\), \(s_{G^*}(w_1)\) is a descendant of \(s_1\) and \(s_{G^*}(w_2)\) is a descendant of \(s_2\) (not necessarily a strict descendant). Therefore, both are separated and so v is Sconsistent. The species for every other node remaining unchanged, we conclude that \(G^*\) preserves Sconsistency and does not modify \({\mathcal O} (G)\) nor \({\mathcal P} (G)\). \(\square\)
We can verify that both DStrees in Fig. 1 are Sconsistent. For example, the speciation node z in \(G'\) has children from species v, c, d and w, which are pairwise separated in S. Notice that, from Definition 1, if G is a DStree, then the lca of two leaves of G belonging to the same species must be a duplication node. The converse is not true. For example, in the Sconsistent gene tree G of Fig. 1, the parental node of \(e_1\) and \(f_1\) is a duplication node even though \(e_1\) and \(f_1\) belong to two different species.
Relation graph
A set of orthology/paralogy relations on \(\Gamma\) (or simply a relation set) is a pair \(C = (C_O, C_P)\) of subsets \(C_O, C_P \subseteq {\Gamma \atopwithdelims ()2}\) such that \(C_O \cap C_P = \emptyset\) and if \(s(x) = s(y)\), then \(\{x, y\} \in C_P\). The relation set is said full if \(C_O \cup C_P = {\Gamma \atopwithdelims ()2}\). A DStree G induces a full set \(({\mathcal O} (G), {\mathcal P} (G))\) of relations.
We adopt the graph representation considered in [14] for full relation sets. A relation graph R on a gene family \(\Gamma\) is a graph with vertex set \(V(R) = \Gamma\), in which we interpret each edge uv of the edge set E(R) of R as an orthology relation between u and v, and each missing edge (nonedge) \(uv \notin E(R)\) as a paralogy relation.^{Footnote 1} Notice that if \(s(u) = s(v)\), then \(uv \notin E(R)\). The relation graph of a DStree G, denoted by R(G), is the graph with vertex set \({\mathcal L}(G)\) and edge set \({\mathcal O} (G)\) (for example, see the relation graph R in Fig. 2).
A DStree for a gene family \(\Gamma\) leads to a relation graph, but the converse is not always true. A relation graph R is satisfiable if there exists a DStree G such that \(R(G) = R\). The problem of relation graph satisfiability has been addressed in [10]. The following theorem is a reformulation of one of the main results of this paper.
Theorem 1
([10]) A relation graph R is satisfiable if and only if R is \(P_4\) free, meaning that no four vertices of R induce a path of length three.
For example, in Fig. 2, the relation graphs R and \(R''\) are satisfiable, while the graph \(R'\) is not. As a DStree does not necessarily represent a true history for \(\Gamma\) (see previous section and Definition 1), satisfiability of a relation graph does not ensure a possible translation in terms of a history for \(\Gamma\). For this to hold, R should be consistent with the species tree, according to the following definition.
Definition 2
Given a species tree S, a relation graph R for \(\Gamma\) is Sconsistent if and only if R is satisfiable by a DStree G which is itself Sconsistent.
For example the graph R in Fig. 2 is Sconsistent. Notice that Sconsistency implies satisfiability. Results from [14] complete the characterization of Sconsistent graphs through Theorem 2. A triplet is a binary tree with leafset L of size three. For \(L = \{x, y, z\}\), we denote by xyz the unique triplet T on L for which \(lca_T(x, y) \ne r(T)\) holds. Now \(P_3(R)\) is the subset of triplets of species induced by paths having exactly three vertices in \(R = (V, E)\):
We present in Theorem 2 a necessary and sufficient condition for Sconsistency of a relation graph in terms of \(P_3(R)\). First, we introduce in Lemma 2 an intermediate property, that is useful for proving Theorem 2.
Lemma 2
Let G be a DStree and S be a species tree. Then for any internal node v of G, there exist leaves x, y of \(G_v\) such that both the following hold: (1) \(lca_S(s(x), s(y)) = s_G(v)\) and (2) \(lca_G(x, y) = v\).
Proof
We first show that (1) must hold for some \(x, y \in {\mathcal L}(G_v)\). If \(s_G(v)\) has two children \(s_1\) and \(s_2\) for which there exist leaves x and y of \(G_v\) such that \(s_1\) is an ancestor of s(x) and \(s_2\) an ancestor of s(y), then (1) holds. Thus if we suppose that (1) does not hold, then \(s_G(v)\) has a child \(s'\) such that all leaves of \(G_v\) belong to a species that has \(s'\) as an ancestor. This implies that \(s'\) is a lower common ancestor than \(s_G(v)\) for the species present in \(G_v\), contradicting the definition of \(s_G\).
Now, take x and y satisfying (1). Suppose that (2) does not hold for x and y, i.e. \(lca_S(s(x), s(y)) = s_G(v)\), but that \(lca_G(x, y) \ne v\). Take \(z \in {\mathcal L}(G_v)\) such that z is separated from \(lca_G(x, y)\) by v (i.e. \(lca_G(z, lca_G(x, y)) = v\)). We have \(lca_G(x, z) = lca_G(y, z) = v\). If \(lca_S(s(x), s(z)) = s_G(v)\), then we are done as x and z satisfy both (1) and (2). Otherwise, \(lca_S(s(x), s(z))\) is on the \(s(x)  s_G(v)\) path, implying that \(lca_S(s(y), s(z)) = s_G(v)\) since \(s_G(v)\) separates s(x) from s(y). In this last case, y and z are the leaves of interest, ending the proof. \(\square\)
Theorem 2
Let \(R = (V,E)\) be a satisfiable relation graph and let S be a species tree. Then R is Sconsistent if and only if S displays all the triplets of \(P_3(R)\).
Proof
\(\Rightarrow\) : let G be an Sconsistent gene tree satisfying R, and let \(x, y, z \in V(R)\) such that \(zx, zy \in E(R)\) but \(xy \notin E(R)\) and \(s(x) \ne s(y)\). Then we must have \(zx, zy \in {\mathcal O} (G)\) and \(xy \in {\mathcal P} (G)\). We claim that S must display the s(x)s(y)s(z) triplet. Let \(\alpha = lca_G(x, y), \beta = lca_G(x, z)\) and \(\gamma = lca_G(y, z)\). Since \(ev_G(\alpha ) \ne ev_G(\beta ) = ev_G(\gamma )\), xyz must be a triplet of G. Moreover, since \(ev_G(\gamma ) = ev_G(\beta ) = Spec\), \(lca_S(s(x), s(y))\) and s(z) must be separated in S, implying that s(x)s(y)s(z) is a triplet of S.
\(\Leftarrow\) : by assumption, R is satisfiable by some DStree \(G'\). We first obtain from \(G'\) a leastresolved DStree G satisfying R, in terms of speciation. That is, if \(G'\) has any speciation node v that has a speciation child w, we obtain \(G''\) by contracting v and w (delete w and give its children to v). Note that we have \({\mathcal O} (G') = {\mathcal O} (G'')\) and so \(G''\) still satisfies R. We obtain the DStree G by repeating this operation until we cannot find such a v and w. We claim that if S displays the triplets of \(P_3(R)\), then G is Sconsistent.
Let v be a speciation node of G, and let \(v_1, v_2\) be any two distinct children of v. By the construction of G, \(ev_G(v_1) = ev_G(v_2) = Dup\). By Lemma 2, \(G_{v_1}\) has two leaves \(x_1, x_2\) such that \(lca_S(s(x_1), s(x_2)) = s_G(v_1)\) and \(lca_G(x_1, x_2) = v_1\). Similarly, \(G_{v_2}\) has two leaves \(y_1, y_2\) with \(lca_S(s(y_1), s(y_2)) = s_G(v_2)\) and \(lca_G(y_1, y_2) = v_2\). Since v is a speciation while \(v_1, v_2\) are duplications, we have \(x_1x_2, y_1y_2 \notin E(R)\) while \(x_1y_1, x_1y_2, x_2y_1, x_2y_2 \in E(R)\). Thus, \(x_1y_1x_2\) and \(x_1y_2x_2\) are induced paths of length two in R, which implies that S displays the \(s(x_1)s(x_2)s(y_1)\) and \(s(x_1)s(x_2)s(y_2)\) triplets. Analogously, S displays the \(s(y_1)s(y_2)s(x_1)\) and \(s(y_1)s(y_2)s(x_2)\) triplets. This is only possible if \(lca_S(s(x_1), s(x_2)) = s_G(v_1)\) and \(lca_S(s(y_1), s(y_2)) = s_G(v_2)\) are separated in S. We deduce that all child pairs of v are from separated species, and hence that G is Sconsistent. \(\square\)
As an example, the graph \(R''\) in Fig. 2 is satisfiable but not Sconsistent as the path of length 2 containing \(\{a_1, b_1, c_1\}\) induces the triplet acb, while the triplet displayed by S is abc.
We end this section with additional notations that will be of use later. A subgraph \(H'\) of H is a graph with \(V(H') \subseteq V(H)\) and \(E(H') \subseteq E(H)\). For a graph H and some \(V' \subseteq V(H)\), the subgraph of H induced by \(V'\), denoted \(H[V']\), is the subgraph of H with vertexset \(V'\) having the maximum number of edges. We say that \(H'\) is an induced subgraph of H if there is a subset \(V' \subseteq V(H)\) such that \(H' = H[V']\). If I is another graph, we say H is Ifree if there is no \(V' \subseteq V(H)\) such that \(H[V']\) is isomorphic to I. Finally, for some edge set \(E' \subseteq E(H)\), \(H  E'\) is the subgraph \(H'\) with \(V(H') = V(H)\) and \(E(H') = E(H) \setminus E'\).
Relation correction problems
We raise the issue of leaving out a minimum of information from a relation graph R in order to reach satisfiability and Sconsistency. Two optimality criteria are considered: (1) the minimum number of edges that need to be removed; (2) the maximum number of genes that can be kept.
The minimum edgeremoval consistency problem
Based on the same construction used in paper [34], we show that adding the information on the species tree S does not make the problem of removing the minimum number of edges leading to a \(P_4\)free graph simpler. Although a similar reduction is likely to hold in the general case of edgemodification (removal or insertion) [31], here we focus on edge removal, as this formulation is needed in subsequent developments. We show the NPCompleteness of this problem, even when every gene from the family \(\Gamma\) comes from a distinct species.
Minimum edgeremoval consistency problem:
Input: A relation graph R for a gene family \(\Gamma\), a species tree S and an integer k;
Output: “Yes” if and only if there exists an Sconsistent subgraph \(R'\) of R with \(V(R') = V(R)\) such that \(E(R) \setminus E(R') \le k\).
Theorem 3
The Minimum EdgeRemoval Consistency Problem is NPComplete, even if for any distinct \(g_1, g_2 \in \Gamma\), \(s(g_1) \ne s(g_2)\).
Proof
Given \(R'\) as a certificate, Theorem 2 easily translates into a polynomialtime algorithm to verify that \(R'\) is Sconsistent. It is also clear that verifying if \(E(R) \setminus E(R') \le k\) can be done quickly. The problem is therefore in NP. As for NPHardness, the reduction is from the exact 3cover problem, a classic NPHard problem [35]: given a set \(W = \{w_1, \ldots , w_{3t} \}\) and a collection \(Z = \{Z_1, \ldots , Z_r\}\) of 3elements of W, does there exists \(Z' \subseteq Z\) such that \(Z' = t\) and \(Z'\) is a partition of W ? We assume that \(r \ge t\).
Given arbitrary W and Z, we construct R and S by first defining the species set \(\Sigma\). Let \(\alpha = {{3t} \atopwithdelims ()2}\) and let \(X = \{X_1, \ldots , X_r\}\) and \(Y = \{Y_1, \ldots , Y_r\}\) be two collections of all disjoint sets of species (i.e. for any distinct set \(A, B \in X \cup Y\), \(A \cap B = \emptyset\)), with \(X_i = \alpha\) and \(Y_i = r^2\alpha\), for all \(1 \le i \le r\). Let \(X_{\Sigma } = \bigcup _{1 \le i \le r} X_i\) and \(Y_{\Sigma } = \bigcup _{1 \le i \le r} Y_i\) be the species in X and Y. Then the species set is \(\Sigma = W \cup X_{\Sigma } \cup Y_{\Sigma }\). Let \(S_{W}, S_X\) and \(S_Y\) be three trees such that \({\mathcal L}(S_{W}) = W, {\mathcal L}(S_X) = X_{\Sigma }\) and \({\mathcal L}(S_Y) = Y_{\Sigma }\). Then S is obtained by first connecting \(r(S_Y)\) with \(r(S_{W})\) to obtain a new tree \(S_{W Y}\), then connecting \(r(S_{W Y})\) with \(r(S_X)\) (see Fig. 3). Therefore S has exactly \(\Sigma  = 3t + r(\alpha + r^2\alpha )\) leaves. The gene family \(\Gamma\) is then constructed so that it contains exactly one gene per species, as mentioned in the Theorem statement. In other words the mapping \(s : \Gamma \rightarrow \Sigma\) is a bijection. Thus for simplicity, we make no distinction between a gene g and its species s(g). We then define R with \(V(R) = \Sigma\) such that each of the sets \(W, X_1, \ldots , X_r, Y_1, \ldots , Y_r\) forms an individual clique. Finally we add two edgesets \(E_1\) and \(E_2\) to R, where \(E_1 = \{g_1g_2 : g_1 \in X_i, g_2 \in Z_i, \; \text{ for } \text{ a } \text{ given } \;1 \le i \le r\}\) and \(E_2 = \{g_1g_2 : g_1 \in X_i, g_2 \in Y_i, \; \text{ for } \text{ a } \text{ given } \;1 \le i \le r\}\). Then R has \(2r + 1\) cliques, namely \(W, X_1, \ldots , X_r, Y_1, \ldots , Y_r\). Also, for \(1 \le i \le r\), all edges between \(X_i\) and \(Y_i\) are present, as well as all edges between \(X_i\) and \(Z_i\). Figure 3 gives an example with \(t=2\) and \(W = \{1,2,3,4,5,6\}\).
Notice that the construction of R described above can clearly be done in polynomial time. We now show that W and Z admit an exact 3cover if and only if R admits an Sconsistent DStree after the deletion of at most \(3\alpha (r  t) + (\alpha  3t)\) edges.
(\(\Rightarrow\)) : let \(Z' \subseteq Z\) be a partition of W, \(Z' = t\). Let \(R'\) be the subgraph of R in which all edges between \(Z_i\) and \(X_i\) are removed if and only if \(Z_i \notin Z'\) (which removes \(3\alpha (r  t)\) edges), and the only edges not removed from the Wclique are those belonging to a \(Z_i\) triangle with \(Z_i \in Z'\) (which removes \(\alpha  3t\) edges). An example of \(R'\) is given in Fig. 3. Thus there are exactly \(3\alpha (r  t) + (\alpha  3t)\) edges of R missing from \(R'\), as desired. Clearly, \(R'\) is \(P_4\)free and thus satisfiable. To see that \(R'\) is Sconsistent, we use Theorem 2. Notice that any path of length 3 in \(R'\) has the form \(wx_iy_i\) with \(w \in W, x_i \in X_i\) and \(y_i \in Y_i\) for some i, inducing the \(wy_ix_i\) speciation triplet, which is in agreement with S. Therefore there exists an Sconsistent gene tree \(G'\) satisfying \(R'\).
(\(\Leftarrow\)) : let \(R'\) be an Sconsistent relation graph obtained by deleting at most \(3\alpha (r  t) + (\alpha  3t)\) edges from R. Then, \(R'\) must be \(P_4\)free. We show that \(R'[W]\) is partitioned into triangles which form a solution to the 3cover instance. Let \(w \in W\). We claim that in \(R'\), there is exactly one \(X_i \in X\) such that w has neighbors in \(X_i\). Suppose first there are \(x_1 \in X_i\) and \(x_2 \in X_j\), \(i \ne j\), such that both \(x_1\) and \(x_2\) are neighbors of w in \(R'\). Then there is some \(y \in Y_i\) such that \(yx_1wx_2\) induce a \(P_4\), unless all edges between \(x_1\) and \(Y_i\) were deleted. But we reach a contradiction since there are \(r^2\alpha > 3\alpha (r  t) + (\alpha  3t)\) such edges. Therefore w has neighbors in at most one \(X_i \in X\). Using that fact, we can see that w must have at least one neighbor in X, since otherwise at most \((3t  1)\alpha\) edges between X and W would remain, implying the deletion of \(3 \alpha r  (3t  1) \alpha = 3\alpha (r  t) + \alpha\) edges, more than permitted. This proves our claim.
Thus at best, each \(w \in W\) has \(\alpha\) neighbors in X, implying that at least \(3 \alpha r  3t \alpha = 3\alpha (r  t)\) deleted edges are between X and W. This leaves a maximum of \(\alpha  3t\) other edges that can be deleted.
Now, let C be a connected component of \(R'[W]\). We claim that all vertices of C must have their X neighbors in the same \(X_i \in X\). For suppose otherwise that there are two vertices \(c_1, c_2\) of C such that \(c_1\) has a neighbor \(x_1 \in X_i\) and \(c_2\) a neighbor \(x_2 \in X_j\) with \(i \ne j\). It is easy to see that such \(c_1\) and \(c_2\) can be chosen to be neighbors. Then \(x_1, c_1, c_2, x_2\) induce a \(P_4\), a contradiction. Thus all vertices of C have their X neighbors in a common \(X_i \in X\). Since each vertex of \(X_i\) has three neighbors in W, this implies that C has at most three vertices. Suppose that C that has two vertices or less. Then since all vertices of \(R'[W]\) have at most two neighbors, it can have at most \(\frac{1}{2}(2(3t  2) + 2) = 3t  1\) edges (obtained by counting the sum of degrees). This, however, implies that at least \(\alpha  (3t  1)\) additional edges were deleted, more than the \(\alpha  3t\) available.
We conclude that \(R'[W]\) is partitioned into t connected components, each having three vertices. Moreover, each vertex in a given component C has neighbors in the same \(X_i \in X\), implying that \(Z_i\) contains the members of C. Finally since the components are all associated with a disctinct \(Z_i\), \(R'[W]\) effectively defines a solution to the exact cover instance. \(\square\)
The Maximum Node Consistency problem
We introduce the Maximum Node Consistency Problem (in its decision version) and we consider the approximation complexity of the corresponding optimization version.
Maximum Node Consistency problem:
Input: A relation graph R for a gene family \(\Gamma\), a species tree S and an integer k;
Output: “Yes” if and only if there exists an Sconsistent induced subgraph \(R'\) of R with \(V(R') \ge k\).
We show that Maximum Node Consistency is hard to approximate within a factor \(d n^{\frac{1}{2}(1\varepsilon )}\) for any \(0 < \epsilon < 1\) and any constant \(d>0\), by giving a gappreserving reduction from Maximum Independet Set (n is the number of nodes of R). We refer the reader to [36] for a definition of gappreserving reduction. Consider an instance \(H=(V_H,E_H)\) of Maximum Independet Set, with \(V_H=m\). We construct an instance of Maximum Node Consistency as follows.
First, we define the set of genes \(\Gamma\), i.e. the nodes of the relation graph R. Denote \(V_H = \{v_1, \ldots , v_m\}\) and for each \(v_i \in V_H\), we define a set \(I(v_i)\) of m genes: \(I(v_i)= \{ r_{i,j}: 1 \le j \le m \}\). The gene set \(\Gamma\) is \(\bigcup _{v_i \in V_H} I(v_i)\).
Now, we define the species tree S. First consider S as any binary tree over m leaves \({\ell }_1, \ldots , {\ell }_m\), and replace each leaf \({\ell }_i\) by any binary subtree \(T_i\) having m leaves (thus S has \(m^2\) leaves). Each gene in \(I(v_i)\) is mapped to a leaf of \(T_i\) in a bijective manner, and so each species has exactly one gene in R. We make no distinction between \(g \in \Gamma\) and s(g).
Now, define the relation graph \(R=(V_R,E_R)\). Set \(V_R = \Gamma\), and we get that \(n = V_R = m^2\). For each \(v_i \in V\), \(I(v_i)\) forms a clique in R. Moreover, for each \(\{v_i,v_j\} \in E_H\), define an edge \(\{ r_{i,t}, r_{j,t} \} \in E_R\), for each t with \(1 \le t \le m\).
Let \(R'\) be a solution of Maximum Node Consistency over instance (R, S). Denote by \(R'(v_i)\) the subset of nodes \(V(R') \cap I(v_i)\), that is those nodes of \(I(v_i)\) that have not been removed. We pay a particular attention to those \(R'(v_i)\) that contain more than one node.
Lemma 3
Let \(R'(v_i), R'(v_j)\) be two subsets of nodes of a solution \(R'\) of Maximum Node Consistency over instance (R, S) such that \(R'(v_i) \ge 2\) and \(R'(v_j) \ge 2\) . Then there is no edge with one endpoint in \(R'(v_i)\) and the other in \(R'(v_j)\).
Proof
Assume on the contrary that there is some q such that \(r_{i,q} \in R'(v_i)\) and \(r_{j,q} \in R'(v_j)\) share an edge. Consider a node \(r_{i,z}\) of \(R'(v_i) \setminus \{ r_{i,q} \}\), which must exist since \(R'(v_i) \ge 2\). The \(P_3\) induced by \(r_{i,z}, r_{i, q}\) and \(r_{j, q}\) implies the triplet \((r_{i,z},r_{j,q}r_{i,q})\), while S contains the triplet \((r_{i,z},r_{i,q}r_{j,q})\). \(\square\)
Now, we are ready to prove the main result of this section.
Lemma 4
Let a graph H be an instance Maximum Independet Set with m nodes, and let (R, S) be the corresponding instance of Maximum Node Consistency with \(n = m^2\) nodes. Then

1.
Given an independent set \(V'\) of H, we can compute in polynomial time a solution of Maximum Node Consistency of size at least \(V'm\);

2.
Given a solution of Maximum Node Consistency on instance (R, S) of size at least k m, we can compute in polynomial time an independent set \(V'\) of H such that \(V' \ge k\).
Proof

1.
Consider an independent set \(V'\) of H and define a solution of Maximum Node Consistency on instance (R, S) of size at least \(V'm\) as follows: remove each node of \(I(v_i)\) if and only if \(v_i \notin V'\). Let \(R'\) be the corresponding solution of Maximum Node Consistency. Since \(V'\) is an independent set, it follows that \(R'\) consists only of cliques \(R'(v_i)\), disconnected one from the other. It has \(V'm\) nodes and as \(R'\) is \(P_3\)free, it is Sconsistent.

2.
The case \(k = 1\) is trivial so we assume \(k > 1\). Consider a solution \(R'\) of Maximum Node Consistency on instance (R, S) of size at least k m, and consider the subsets \(R'(v_i)\) in \(R'\) such that \(R'(v_i) > 1\). Notice that we can assume that there exist at least k such sets, otherwise \(R'\) would contain at most \((k1)m + m  (k  1) < km\) nodes.
Given an index j, consider the set \(R^j= \{ r_{i,j} \in R'(v_i): R'(v_i) > 1, 1 \le i \le m \}\), i.e. the nodes with index j that belong to some subset \(R'(v_i)\) larger than one. By Lemma 3 each set \(R^j\) is an independent set. Now, pick the set \(R^j\) having maximum cardinality. It follows that \(R^j\) contains at least k nodes, since otherwise \(R'\) would have at most \(m(k  1) + m  k < mk\) nodes. Hence, \(V'= \{v_i: r_{i,j} \in R^j \}\) is an independent set of size at least k, thus concluding the proof. \(\square\)
We say a maximization problem cannot be approximated within a factor \(\alpha\) if, unless \(P = NP\), for any approximation algorithm \({\mathcal A}\) there are infinitely many instances for which \({\mathcal A}\) outputs a solution with value AP such that \(AP < \frac{1}{\alpha } OPT\), where OPT is the optimal value of a solution to the problem (note that equivalently, \(\frac{OPT}{AP} > \alpha\)). It is wellknown that Maximum Independet Set cannot be approximated within a factor \(cm^{1  \varepsilon }\) for any \(0 < \varepsilon < 1\) and for any constant \(c > 0\) [37].
Theorem 4
The optimization version of Maximum Node Consistency cannot be approximated within a factor \(d n^{\frac{1}{2}(1\varepsilon )}\) for any \(0 < \varepsilon < 1\) and for any constant \(d>0,\) where n is the number of nodes of the given relation graph. Moreover, this result holds even on instances in which for any distinct \(g_1, g_2 \in \Gamma\), \(s(g_1) \ne s(g_2)\).
Proof
Let H be a graph with m nodes and let (R, S) be the corresponding instance of Maximum Node Consistency with \(n = m^2\) nodes. Denote by \(OPT_I\) and \(OPT_N\), respectively, the values of an optimal solution for Maximum Independet Set and Maximum Node Consistency. Let \({\mathcal A}_N\) be any approximation algorithm for Maximum Node Consistency, and let \({\mathcal A}_I\) be the approximation algorithm for Maximum Independet Set that on input H, runs \({\mathcal A}_N\) on the corresponding instance (R, S) and returns the independent set resulting from Lemma 4. Let \(AP_I(H)\) and \(AP_N(R, S)\) denote, respectively, the sizes of the solutions found by \({\mathcal A}_I(H)\) and \({\mathcal A}_N(R,S)\). By Lemma 4 we get that \(AP_I(H) \ge \lfloor AP_N(R, S)/m \rfloor \ge AP_N(R, S)/m  1\) and \(OPT_N(R,S) \ge OPT_I(H)m\). Now,
as we may assume that \(AP_I(H) \ge 1\). Since Maximum Independet Set cannot be approximated within a factor \(c m^{1\varepsilon }\), for any \(0 < \varepsilon < 1\) and any constant \(c>0\), then for any \(0 < \varepsilon < 1\) and any constant \(c>0\) there exist infinitely many instances H on which \(\frac{OPT_I(H)}{2AP_I(H)}~>~cm^{1\varepsilon }\). Thus, it follows that
on infinitely many instances. Finally the fact that the result holds even on instances in which for any distinct \(g_1, g_2 \in \Gamma\), \(s(g_1) \ne s(g_2)\) follows from the construction of R. \(\square\)
We get the following as an immediate corollary, which will be of use later:
Corollary 1
The decision version of Maximum Node Consistency is NPHard, even on instances in which for any distinct \(g_1, g_2 \in \Gamma,\) \(s(g_1) \ne s(g_2).\)
Gene tree correction problems
In this section, we are given a gene family \(\Gamma\), a species tree S, an Sconsistent DStree G for \(\Gamma\), and a set \(C = (O, P)\) of orthology/paralogy constraints (not necessarily full). We focus on the problem of correcting G according to C in a minimal way. The goal is thus to find a DStree \(G'\) inducing C such that the difference between G and \(G'\) is minimum. We consider two ways of measuring the difference (or symetrically the similarity) between gene trees, one based on conserved orthology/paralogy relations induced by the two trees, and one based on the number of conserved clades between the two trees, which is the RobinsonFoulds in the case that G, \(G'\) and S are all binary trees.
The Maximum Homology Correction problem
Maximum Homology Correction problem:
Input: A species tree S, an Sconsistent DStree G for a gene family \(\Gamma\), an integer k, a set O of orthology and a set P of paralogy relations;
Output: “Yes” if there exists an Sconsistent DStree \(G'\) for \(\Gamma\) with \(O \subseteq {\mathcal O} (G')\), \(P \subseteq {\mathcal P} (G')\) such that \({\mathcal O} (G) \cap {\mathcal O} (G') + {\mathcal P} (G) \cap {\mathcal P} (G') \ge k\).
Theorem 5
The Maximum Homology Correction problem is NPComplete, even if S, G and \(G'\) are required to be binary.
Proof
The problem is clearly in NP, as verifying Sconsistency can be done in polynomial time, as well as counting the common orthologs/paralogs relations (the set of relations is quadratic in size). For our reduction, we use the Minimum EdgeRemoval Consistency problem for the case of a gene family with at most one gene per genome, which is NPHard by Theorem 3. Given a species tree S, a relation graph R with V(R) in bijection with \({\mathcal L}(S)\) and an integer k, we construct an instance of the Maximum Homology Correction Problem, i.e. a species tree \(S'\), a DStree G, an orthologous set O and paralogous set P.
Let \(S' = S\) and construct G by mimicking S  that is by first copying S and its leaf labels, then replacing each leaf \({\ell }\) of G by the gene \(s^{1}({\ell })\). Note that if S is binary, then so is G. All internal nodes of G are labeled as speciations, so all genes of \(\Gamma\) are pairwise orthologous. Thus R(G) is a clique. Finally, let \(O = \emptyset\) and \(P = \{g_1g_2 : g_1g_2 \notin E(R) \}\). Therefore the objective is to break a minimum of orthologies of G in order to satisfy P.
We show that that there is an Sconsistent subgraph \(R'\) of R obtained by removing at most k edges if and only if there is an \(S'\)consistent DStree \(G'\) satisfying O and P with at most \(P + k\) relations that are not induced by G.
\(\Rightarrow\) : Let \(R'\) be a solution to the Minimum EdgeRemoval Consistency Problem for R and S. Then there exists a Sconsistent DStree \(G'\) satisfying \(R'\), which is obtained by deleting at most k edges from R. By Lemma 1, we may assume that if S is binary, then so is \(G'\). Now, since \(R'\) has at most \(P + k\) nonedges, \(G'\) has at most \(k + P\) paralogs and is therefore a solution to the constructed instance of the Maximum Homology Correction Problem that breaks at most \(k + P\) orthologies of R(G).
\(\Leftarrow\) : Let \(G'\) be a solution, binary or not, to the constructed Maximum Homology Correction Problem instance and let \(R' = R(G')\). Since \(G'\) satisfies P and breaks at most \(P + k\) orthologies, \(R'\) must have P as nonedges, plus at most k other nonedges. Thus \(R'\) can be obtained by removing at most k edges from \(R(G)  P = R\), as desired. \(\square\)
The maximum clade correction problem
Maximum clade correction problem:
Input: A gene tree G, a species tree S, a set O of orthology and a set P of paralogy relations and an integer k;
Output: “Yes” if there exists an Sconsistent DStree \(G'\) satisfying O and P such that G and \(G'\) have at least k clades in common.
Notice that if S, G and \(G'\) are required to be binary, the effective measure between G and \(G'\) is the RobinsonFoulds distance. This special case is handled as part of the general proof. But before we need the following lemma, which uses grafting operations to add leaves to G and satisfy a prescribed relation without breaking other relations.
Given two trees \(T_1\) and \(T_2\), connecting \(T_1\) with \(T_2\) corresponds to creating a new node x and giving it \(r(T_1)\) and \(r(T_2)\) as its two children. Grafting a new leaf x to a tree T corresponds to adding x to \({\mathcal L}(T)\) by either: (1) adding x as a new child of some node u of T; (2) connecting T with x; (3) subdividing an edge uv and adding x as a child of the newly created vertex.
Lemma 5
Let G be an Sconsistent gene tree, for some species tree S. Let x be a gene not in G and y be some gene in G with \(s(x) \ne s(y)\). Then there exists a gene tree \(G'\) obtained by grafting x to G such that the following conditions are satisfied:

1
x and y are orthologs in \(G'\);

2
\({\mathcal O} (G) \subseteq {\mathcal O} (G')\) and \({\mathcal P} (G) \subseteq {\mathcal P} (G')\);

3
\(G'\) is Sconsistent;
Proof
If \(s_G(r(G))\) is a strict descendant of \(lca_S(s(x), s(y))\), then it is easy to see that connecting x to r(G) under a common parent yields the desired result. So we assume \(s_G(r(G))\) is an ancestor of \(lca_S(s(x), s(y))\). If there is some node u of G such that adding x as a child of u satisfies the three conditions of the Lemma, then we are done. So assume that there is no node u to which we can add x as a child.
Let uv be an edge of G, and suppose that we graft x on uv to obtain \(G'\). Call p the parent of x on \(G'\), and say that u is the parent of p (i.e. p has children x and v). Note that if \(s_G(u) = s_{G'}(u)\), then \(s_G(w) = s_{G'}(w)\) for any \(w \in V(G) \cap V(G')\), implying that setting \(ev_{G'}(z) = ev_{G}(z)\) for all \(z \in V(G) \cap V(G') \setminus \{u\}\) preserves Sconsistency. We will find such a uv that guarantees this \(s_G(u) = s_{G'}(u)\) property, while ensuring that \(lca_{G'}(x, y)\) can be a speciation (i.e. \(ev_{G'}(lca_{G'}(x, y)) = Spec\) is Sconsistent), and that \(ev_G(u) = ev_{G'}(u)\) is Sconsistent. This will prove the Lemma.
Let \(s_{xy} = lca_S(s(x), s(y))\), and let g be the lowest ancestor of y in G such that \(s_G(g)\) is \(s_{xy}\) or an ancestor of \(s_{xy}\). Note that the case in which g does not exist was handled in the beginning of the proof. Now suppose that \(ev_{G}(g) = Dup\). Denote by \(g'\) the child of g that is also an ancestor of y. Note that \(s_G(g)\) is an ancestor of \(s_{xy}\) and \(s_G(g')\) is a strict descendant of \(s_{xy}\). We claim that \(uv = gg'\). To see this, obtain \(G'\) by grafting x to \(gg'\), p being the parent of x and g the parent of p. Then, \(s_{G'}(p) = s_{xy}\), and its children species \(s_{G'}(g')\) and \(s_{G'}(x)\) are separated in S by our choice of \(g'\). Thus setting \(ev_{G'}(p) = ev_{G'}(lca_{G'}(x, y)) = Spec\) preserves Sconsistency. Also, \(s_G(g) = s_{G'}(g)\) since \(s_G(g)\) is already an ancestor of \(s_{xy} = s_{G'}(p)\). Finally, we are free to set \(ev_{G'}(g) = Dup\), satisfying all the required conditions.
So instead suppose that \(ev_G(g) = Spec\). Recall that adding x as a child of g to obtain a new tree \(G''\) is not a solution. Since in \(G''\), \(lca_{G''}(x, y) = g\), we must either have \(s_G(g) \ne s_{G''}(g)\), or \(ev_{G''}(g) = Spec\) is not Sconsistent. By the choice of g, only the latter is possible, implying that all children of g are from separated species in G, but not in \(G''\). Therefore, there must be a child \(g'\) of g such that \(s_G(g')\) is an ancestor of s(x). Note that \(g'\) must be unique since otherwise, \(ev_G(g) = Spec\) would not be possible. We then claim that \(uv = gg'\). Indeed, obtain \(G'\) by grafting x to \(gg'\), p being the parent of x. We have \(s_G(p) = s_{G'}(g')\), and we set \(ev_{G'}(p) = Dup\). The species of the children of g remain unchanged in \(G'\), and so \(s_G(g) = s_{G'}(g)\) and \(ev_{G'}(g) = ev_{G'}(lca_{G'}(x, y)) = Spec\) is Sconsistent, again satisfying all required conditions. \(\square\)
Theorem 6
The Maximum Clade Correction Problem is NPComplete, even if S, G and \(G'\) are required to be binary.
Proof
Verifying Sconsistency and comparing the set of clades from G and \(G'\) can clearly be done in polynomial time, thus the problem is in NP. We use the Maximum Node Consistency problem for our reduction, which is NPHard by Corollary 1. Let R, S and k be the Maximum Node Consistency instance, letting R be the relation graph with \(V(R) = \{v_1, \ldots , v_n\}\), S the species tree and k an integer. Let \(\alpha = n(n  1  k) + 2k\) (noting that \(\alpha > 0\) when \(k \le n\)). The constructed instance of the Maximum Clade Correction Problem uses the same species tree S. Construct G as follows: first consider G as any binary tree with n leaves \(l_1, \ldots , l_n\), where each leaf \(l_i\) is mapped to vertex \(v_i\) of R. Then, replace each leaf \(l_i\) by a subtree \(T_i\) constructed as follows: \(T_i\) is a caterpillar tree with \(n  1 + \alpha\) leaves, and each leaf \({\ell }\) of \(T_i\) is such that \(s({\ell }) = s(v_i)\) (recall that a caterpillar tree is a path to which we add a leaf child to each internal node). Let \(L_i\) denote the set of the \(n1\) deepest leaves of \(T_i\) (the depth of a leaf \({\ell }\) being the number of nodes on the path between \({\ell }\) and the root). Each leaf of \(L_i\) is mapped to a distinct node of \(V(R) \setminus \{v_i\}\). Denote by \({\ell }_{i, j}\) the leaf of \(T_i\) mapped to \(v_j\), and by \(N_i\) the subtree of \(T_i\) rooted at \(lca(L_i)\). Then G has exactly \(n(n  1 + \alpha )\) leaves and \(n(n  1 + \alpha )  1\) clades (since it is binary). An example is given in Fig. 4. Finally define \(O = \{ \{{\ell }_{i,j}, {\ell }_{j,i} \} : v_iv_j \in E(R)\}\) the set of orthology relations to satisfy and \(P = \{ \{{\ell }_{i,j}, {\ell }_{j,i} \} : v_iv_j \notin E(R)\}\) the set of paralogy relations to satisfy. Note that each \({\ell }_{i,j}\) is present in exactly one relation.
We show that R admits an Sconsistent induced subgraph with at least k nodes if and only if G, O and P admit an Sconsistent DStree \(G'\) satisfying O and P such that G and \(G'\) share at least \(k(\alpha + n  2)\) clades.
(\(\Rightarrow\)) Let \(R'\) be a solution to the Maximum Node Consistencyinstance, \(V(R') \ge k\), and let H be a DStree satisfying \(R'\) that is Sconsistent. By Lemma 1, we may assume that if S is binary, then so is H. Now, since \({\mathcal L}(H) \subseteq V(R)\), to each leaf \(v_i\) of H corresponds a subtree \(T_i\) in G. Then build a DStree \(G^*\) from H by replacing each leaf \(v_i\) of H by \(T_i\), and labeling all internal nodes of inserted trees as Dup (in Fig. 4, \(G^*\) is the subtree of \(G'\) rooted at the common ancestor of \(T_a, T_b\) and \(T_c\)). We first argue that \(G^*\) is Sconsistent and satisfies the subsets of O and P restricted to \({\mathcal L}(G^*)\). In a subsequent step, we will graft the genes missing from \(G^*\) using Lemma 5. Notice that \(s_H(v_i) = s_{G^*}(r(T_i))\) and that all nodes of H that are in \(G^*\) have the same LCAmapping in both trees. It follows that \(G^*\) is also Sconsistent. Also, for all \(v_i, v_j \in {\mathcal L}(H)\), \(ev_H(lca_H(v_i, v_j)) = ev_{G^*}(lca_{G^*}(r(T_i), r(T_j))\). Thus for any pair of leaves \({\ell }_{i,j}, {\ell }_{j,i}\) in \({\mathcal L}(G^*)\) such that \(\{ {\ell }_{i,j}, {\ell }_{j, i} \} \in O\), \(lca_{G^*}({\ell }_{i,j}, {\ell }_{j, i})\) is a speciation (by the construction of O from R and the fact that H satisfies \(R'\)). By the same reasoning, if \(\{ {\ell }_{i,j}, {\ell }_{j, i} \} \in P\) then \({\ell }_{i,j}\) and \({\ell }_{j,i}\) are paralogous in \(G^*\).
The solution \(G'\) is obtained by grafting to \(G^*\) every leaf of G missing from \(G^*\) whilst preserving the \(T_i\) clades, maintaining satisfiability of O and P and Sconsistency. If such a \(G'\) exists, then G and \(G'\) share at least k identical subtrees from \(\{T_1, \ldots , T_n\}\), and since each \(T_i\) contains \(\alpha + n  2\) clades, it follows that G and \(G'\) share at least \(k(\alpha + n  2)\) clades as required. Let \(L = {\mathcal L}(G) \setminus {\mathcal L}(G^*)\) be the leaves yet missing from \(G^*\). Let \(L_O = \{{\ell }\in L : \exists {\ell }' \in {\mathcal L}(G^*), {\ell }{\ell }' \in O\}\) (i.e. the leaves of L subject to an orthology constraint with some leaf already in \(G^*\)). The complement \(\overline{L_O}\) is the set of leaves of L that are either subject to a paralogy constraint with some leaf of \(G^*\), or not subject to any constraint with any leaf of \(G^*\). Let \(R(\overline{L_O})\) be the relation graph with vertex set \(\overline{L_O}\) and edge set \(\{{\ell }_1{\ell }_2 : {\ell }_1, {\ell }_2 \in \overline{L_O}, {\ell }_1{\ell }_2 \in O\}\), depicting the required orthologies within \(\overline{L_O}\). Recall that each leaf of L is contained in at most one relation, implying that each node of \(R(\overline{L_O})\) has maximum degree 1. Thus \(R(\overline{L_O})\) is \(P_3\)free and therefore is Sconsistent. Let \(G_{L_O}\) be a DStree satisfying \(R(\overline{L_O})\) that is Sconsistent, assuming that \(G_{L_O}\) is binary if S is. We update \(G^*\) by joining \(r(G^*)\) and \(r(G_{L_O})\) under a common parent x, and labeling x as Dup. Notice that this does not modify any orthology or paralogy relation previously in \(G^*\) or in \(G_{L_O}\), nor does it break Sconsistency. This also ensures that paralogies \({\ell }_1{\ell }_2 \in P\) with \({\ell }_1 \in \overline{L_O}\) and \({\ell }_2 \in {\mathcal L}(G^*)\) are satisfied.
The final step is to graft the leaves of \(L_O\) to \(G^*\) in a way satisfying orthology requirements. This is done by successively applying Lemma 5 to each \({\ell }\in L_O\). As shown, each such \({\ell }\) can be grafted into \(G^*\) without modifying any orthology or paralogy relation already in \(G^*\) whilst satisfying the orthology requirement that \({\ell }\) is subject to. It is straightforward to see that in addition, \({\ell }\) can be grafted without breaking any \(T_i\) clade present in \(G^*\), since every vertex in \(T_i\) is mapped to the same species. The tree \(G'\) obtained after all these grafting operations, satisfies every O and P and has the required common clades with G.
(\(\Leftarrow\)) Let \(G'\) be a solution, binary or not, to the Maximum Clade Correction Problem instance. Denote by C the number of clades shared by both G and \(G'\), with \(C \ge k(\alpha + n  2)\). Recall that \(L_i\) is the set of the \(n  1\) deepest leaves of \(T_i\) in G, with \(N_i\) being the subtree rooted at \(lca_G(L_i)\). Denote by \(G'_{L_i}\) the subtree of \(G'\) rooted at \(lca_{G'}(L_i)\). We say that \(N_i\) was preserved if every leaf of \(G'_{L_i}\) belongs to \({\mathcal L}(T_i)\) (in other words, the \(N_i\) clade might have been extended, but only to include other leaves from \(T_i\)). We claim that at least k of the \(N = \{N_1, \ldots , N_n\}\) subtrees are preserved in \(G'\). Assume, on the contrary, that at least \(n  k + 1\) subtrees from N are not preserved. Take a nonpreserved subtree \(N_i\). Then some leaf \({\ell }\notin {\mathcal L}(T_i)\) belongs to the \(lca_{G'}(L_i)\) clade. This implies that for any ancestor x of \(r(N_i)\) in G, \(G'\) cannot contain the x clade. By construction of \(T_i\), \(r(N_i)\) has at least \(\alpha\) ancestors in G. Therefore, \(C \le n(n  1 + \alpha )  1  \alpha (n  k + 1)\). This leads to \(k(\alpha + n  2) \le C \le n(n  1 + \alpha )  1  \alpha (n  k + 1)\), and then to \(\alpha \le n(n  1  k) + 2k  1\), contradicting our choice of \(\alpha\).
Now, let \(N^p = \{N_i \in N : N_i\) is preserved in \(G'\}\). We have \(N^p \ge k\). Let \(L = \bigcup _{\{i:N_i \in N^p\}} L_i\) and \(H = G'_{L}\). Notice that to each \(N_i \in N^p\) corresponds exactly one subtree \(N'_i\) in H such that \({\mathcal L}(N_i) = {\mathcal L}(N'_i)\) (and all such \(N'_i\) subtrees are disjoint). Let \(H^*\) be the tree obtained by replacing every subtree \(N'_i\) in H by \(v_i\). Replacing \(N'_i\) by \(v_i\) changes no LCAmapping value since all vertices of \(N'_i\) map to \(s(v_i)\). Thus as \(G'\) is Sconsistent, then so are H and \(H^*\). Now, we claim that \(H^*\) induces the set of relations represented by \(R' = R[{\mathcal L}(H^*)]\), which proves the theorem since \({\mathcal L}(H^*) = N^p \ge k\). By contradiction, suppose that \(v_iv_j \in E(R')\) but \(lca_{H^*}(v_i, v_j)\) is labeled Dup. Then \(lca_H({\ell }_{i,j}, {\ell }_{j,i})\) is also labeled Dup, and so is \(lca_{G'}({\ell }_{i,j}, {\ell }_{j, i})\). But \({\ell }_{i,j} {\ell }_{j,i} \in O\), contradicting our assumption that \(G'\) is a solution. The same reasoning applies when \(v_iv_j \notin E(R')\), ending the proof. \(\square\)
Algorithmic avenues
As the problems considered in this paper are all computationally hard, only nonpolynomial exact algorithms or approximation algorithms avenues can realistically be explored. Let us generalize the Minimum EdgeRemoval Consistency problem to the minimum editing problem (i.e. minimzing edge removals and insertions). It is not hard to imagine a branchandbound algorithm that solves the problem. Call an induced subgraph H of a relation graph R bad if it is a \(P_4\), or there is triplet of \(P_3(H)\) not displayed by S. Each \(P_4\) can be solved by six possible edge editings, and each contradictory triplet of \(P_3(H)\) can be solved by three possible editings. Therefore, in a branchandbound process, one would verify if a given graph \(R'\) contains a bad subgraph and if so, proceed recursively on each graph obtained by an editing that removes it. If no bad subgraph exists, then \(R'\) is a possible solution and its number of editings is retained. If, at any point, \(R'\) has had more editings than the best solution encountered so far, the algorithm can stop the recursion. Notice however that an edge should not be edited more than once in order to avoid infinite loops. The idea of this branchandbound algorithm can also be applied to the Maximum Node Consistency problem. It is known that a \(P_4\), if one exists, can be found in linear time [38], and clearly a contradictory triplet, if any, can be found in time \(O(n^3)\) (though a more efficient algorithm may exist). A similar approach has been applied in [31] to design an FPT algorithm for the satisfiability problem.
As for approximations, an algorithm proposed in [32] can be directly applied to the Minimum EdgeRemoval Consistency problem and guarantees that we do not remove more than \(4 \Delta (R)\) times more edges than the optimal solution, where \(\Delta (R)\) is the maximum degree of R. The idea is simple: as long as R has a bad subgraph H, remove every edge incident to a vertex of H and continue. Even though this is the best known approximation algorithm so far, it has the undesirable effect of isolating many vertices, motivating the exploration of alternative algorithms. One direction would be to consider existing ideas on the problem of satisfiability, i.e. finding the minimum number of editings required to make a graph \(P_4\)free, and adapt them to the consistency problem  for instance the MinCut algorithm proposed in [39].
As for gene tree correction, we have developed in [14] a polynomialtime algorithm which, given a species tree S and partial sets of relations O and P, verifies if there exists an Sconsistent gene tree \(G'\) satisfying O and P and if so, constructs one among the set of all possible solutions. In ordre to correct a gene tree G, we can envisage an extension of this algorithm allowing to provide G as input, and pick, among the solutions of the algorithm the one which is the closest to G (either in terms of common homology relations or clades).
Conclusions
A gene tree induces a set of orthology and paralogy relations between members of a gene family, but the converse is not always true. In this paper we have shown that attempting to modify a set of relations as least as possible in order to ensure consistency with a species tree leads to the formulation of NPComplete problems. Moreover, even assuming that the given relations are errorfree, it remains computationally difficult to correct a gene tree in order to fit the given set of relations. As various modelfree methods are available to infer orthology and paralogy, these correction problems are of practical biological interest. A future direction would be to explore the exact branchandbound algorithms and heuristics mentioned in the last section, and design fast approximation algorithms for the relation graph and gene tree editing problems.
Notes
 1.
The term ‘relation graph’ is also used in phylogenetics in the form of a generalization of a median network to a set of partitions. To make it clear, relation graphs in this paper have nothing to do with this notion.
References
 1.
Ohno S. Evolution by gene duplication. Berlin: Springer; 1970.
 2.
Goodman M, Czelusniak J, Moore GW, RomeroHerrera AE, Matsuda G. Fitting the gene lineage into its species lineage, a parsimony strategy illustrated by cladograms constructed from globin sequences. Syst Zool. 1979;28:132–63.
 3.
Tatusov RL, Galperin MY, Natale DA, Koonin EV. The COG database: a tool for genomescale analysis of protein functions and evolution. Nucl Acids Res. 2000;28:33–6.
 4.
Li L, Stoeckert CJJ, Roos DS. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 2003;13:2178–89.
 5.
Berglund AC, Sjolund E, Ostlund G, Sonnhammer EL. InParanoid 6: eukaryotic ortholog clusters with inparalogs. Nucl Acids Res. 2008;36:D263–6.
 6.
Lechner M, Findeib SS, Steiner L, Marz M, Stadler PF, Prohaska SJ. Proteinortho: detection of (co)orthologs in largescale analysis. BMC Bioinform. 2011;12:124.
 7.
Lafond M, Semeria M, Swenson KM, Tannier E, ElMabrouk N. Gene tree correction guided by orthology. BMC Bioinform. 2013;14(supp 15):S5.
 8.
Lafond M, Swenson K, ElMabrouk N. Error detection and correction of gene trees. Models and algorithms for genome evolution. London: Springer; 2013.
 9.
Consortium TGO. Gene ontology: tool for the unification of biology. Nat Genet. 2000;25(1):25–9.
 10.
Hellmuth M, HernandezRosales M, Huber K, Moulton V, Stadler P, Wieseke N. Orthology relations, symbolic ultrametrics, and cographs. J Math Biol. 2013;66(1–2):399–420.
 11.
Hellmuth M, Wieseke N, Lechner M, Lenhof HP, Middendorf M, Stadler PF. Phylogenomics with paralogs. PNAS. 2014;112(7):2058–63.
 12.
Aho AV, Sagiv Y, Szymanski TG, Ullman JD. Inferring a tree from lowest common ancestors with an application to the optimization of relational expressions. SIAM J Comput. 1981;10:405–21.
 13.
HernandezRosales M, Hellmuth M, Wieseke N, Huber KT, Moulton V, Stadler P. From eventlabeled gene trees to species trees. BMC Bioinform. 2012;13(Suppl. 19):56.
 14.
Lafond M, ElMabrouk N. Orthology and paralogy constraints: satisfiability and consistency. BMC Genomics. 2014;15(Suppl 6):12.
 15.
Vilella AJ, Severin J, UretaVidal A, Heng L, Durbin R, Birney E. EnsemblCompara gene trees: Complete, duplicationaware phylogenetic trees in vertebrates. Genome Res. 2009;19:327–35.
 16.
Penel S, Arigon AM, Dufayard JF, Sertier AS, Daubin V, Duret L, Gouy M, Perrière G. Databases of homologous gene families for comparative genomics. BMC Bioinform. 2009;10(Suppl 6):S3. doi:10.1186/1471210510S6S3.
 17.
Datta RS, Meacham C, Samad B, Neyer C, Sjölander K. Berkeley PHOG: PhyloFacts orthology group prediction web server. Nucleic Acids Res. 2009;37:84–9.
 18.
Pryszcz LP, HuertaCepas J, Gabaldón T. MetaPhOrs: orthology and paralogy predictions from multiple phylogenetic evidence using a consistencybased confidence score. Nucleic Acids Res. 2011;39:32.
 19.
HuertaCepas J, CapellaGutierrez S, Pryszcz LP, Denisov I, Kormes D, MarcetHouben M, Gabald’on T. Phylomedb v3.0: an expanding repository of genomewide collections of trees, alignments and phylogenybased orthology and paralogy predictions. Nucleic Acids Res. 2011;39:556–60.
 20.
Mi H, Muruganujan A, Thomas PD. Panther in 2013: modeling the evolution of gene function, and other gene attributes, in the context of phylogenetic trees. Nucleic Acids Res. 2012;41:377–86.
 21.
Chaudhary R, Burleigh JG, Eulenstein O. Efficient error correction algorithms for gene tree reconciliation based on duplication, duplication and loss, and deep coalescence. BMC Bioinform. 2011;13(Supp. 10):11.
 22.
Chen K, Durand D, FarachColton M. Notung: dating gene duplications using gene family trees. J Comput Biol. 2000;7:429–47.
 23.
Dondi R, ElMabrouk N, Swenson KM. Gene tree correction for reconciliation and species tree inference: complexity and algorithms. J Discret Algorithms. 2014;25:51–65. doi:10.1016/j.jda.2013.06.001.
 24.
Doroftei A, ElMabrouk N. Removing noise from gene trees. In: Przytycka TM, Sagot MF, editors. WABI 2011. Lecture notes in bioinformatics. vol. 6833. Berlin, Heidelberg: Springer; 2011. p. 76–91.
 25.
Gorecki P, Eulenstein O. Algorithms: simultaneous errorcorrection and rooting for gene tree reconciliation and the gene duplication problem. BMC Bioinform. 2011;13(Supp 10):14.
 26.
Gorecki P, Eulenstein O. A lineartime algorithm for errorcorrected reconciliation of unrooted gene trees. In: Chen J, Wang J, Zelikovsky A, editors. ISBRA 2011. Lecture notes in bioinformatics. vol. 6674. Berlin, Heidelberg: Springer; 2011. p. 148–159.
 27.
Lafond M, Chauve C, Dondi R, ElMabrouk N. Polytomy refinement for the correction of dubious duplications in gene trees. Bioinformatics. 2014;30(17):519–26. doi:10.1093/bioinformatics/btu463.
 28.
Swenson KM, Doroftei A, ElMabrouk N. Gene tree correction for reconciliation and species tree inference. Algorithms Mol Biol. 2012;7:31.
 29.
Nguyen TH, Ranwez V, Pointet S, Chifolleau AM, Doyon JP, Berry V. Reconciliation and local gene tree rearrangement can be of mutual profit. Algorithms Mol Biol. 2013;8(8):12.
 30.
Robinson D, Foulds L. Comparison of phylogenetic trees. Math Biosci. 1981;53:131–47.
 31.
Liu Y, Wang J, Guo J, Chen J. Complexity and parameterized algorithms for cograph editing. Theor Comput Sci. 2012;461:45–54. doi:10.1016/j.tcs.2011.11.040.
 32.
Natanzon A, Shamir R, Sharan R. Complexity classification of some edge modification problems. Discret Appl Math. 2001;113(1):109–28.
 33.
Fitch WM. Homology a personal view on some of the problems. Trends Genet. 2000;16(5):227–31.
 34.
ElMallah ES, Colbourn CJ. The complexity of some edge deletion problems. IEEE Trans Circuits Syst. 1988;35(3):354–62.
 35.
Michael RG, David SJ. Computers and intractability: a guide to the theory of npcompleteness. San Francisco: WH Freeman & Co.; 1979.
 36.
Vazirani VV. Approximation algorithms. New York: Springer; 2003.
 37.
Zuckerman D. Linear degree extractors and the inapproximability of max clique and chromatic number. Proc Thirty Eight Annu ACM Symp Theor Comput. 2007;3(1):103–28. doi:10.4086/toc.2007.v003a006.
 38.
Bretscher A, Corneil DG, Habib M, Paul C. A simple linear time lexbfs cograph recognition algorithm. SIAM J Discret Math. 2008;22(4):1277–96. doi:10.1137/060664690.
 39.
Altenhoff AM, Gil M, Gonnet GH, Dessimoz C. Inferring hierarchical orthologous groups from orthologous gene pairs. PLoS One. 2013;8(1):53786.
Authors’ contribution
ML, RD and NE modeled the four problems presented, and devised and wrote the hardness proofs. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Funding
Publication of this work is funded by the Natural Sciences and Engineering Research Council of Canada (NSERC) and the Fonds de Recherche Nature et Technologies of Quebec (FRQNT).
Author information
Affiliations
Corresponding author
Additional information
Manuel Lafond, Riccardo Dondi and Nadia ElMabrouk contributed equally to this work
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Lafond, M., Dondi, R. & ElMabrouk, N. The link between orthology relations and gene trees: a correction perspective. Algorithms Mol Biol 11, 4 (2016). https://doi.org/10.1186/s1301501600677
Received:
Accepted:
Published:
Keywords
 Orthology
 Paralogy
 NPHardness
 Gene tree
 Species tree