The link between orthology relations and gene trees: a correction perspective

Background While tree-oriented methods for inferring orthology and paralogy relations between genes are based on reconciling a gene tree with a species tree, many tree-free methods are also available (usually based on sequence similarity). Recently, the link between orthology relations and gene trees has been formally considered from the perspective of reconstructing phylogenies from orthology relations. In this paper, we consider this link from a correction point of view. Indeed, a gene tree induces a set of relations, but the converse is not always true: a set of relations is not necessarily in agreement with any gene tree. A natural question is thus how to minimally correct an infeasible set of relations. Another natural question, given a gene tree and a set of relations, is how to minimally correct a gene tree so that the resulting gene tree fits the set of relations. Results We consider four variants of relation and gene tree correction problems, and provide hardness results for all of them. More specifically, we show that it is NP-Hard to edit a minimum of set of relations to make them consistent with a given species tree. We also show that the problem of finding a maximum subset of genes that share consistent relations is hard to approximate. We then demonstrate that editing a gene tree to satisfy a given set of relations in a minimum way is NP-Hard, where “minimum” refers either to the number of modified relations depicted by the gene tree or the number of clades that are lost. We also discuss some of the algorithmic perspectives given these hardness results.


Background
Genes, the molecular units of heredity, hold the information to build and maintain cells. In the course of evolution, they are duplicated, lost, and passed to organisms through speciation. Genes originating from the same ancestral copy are called homologs. Homologous gene are grouped into gene families, usually via sequence similarity methods. Moreover, homologous genes can be orthologous, if their parental origin is a speciation, or paralogous, if it is a duplication. Orthologous gene are considered to be more similar in function than paralogs, a conjecture known as the orthology conjecture [1]. This is a major motivation for inferring gene evolution, as it is a prerequisite for functional prediction purposes.
Starting usually from a DNA or protein sequence alignment, the tree-based method requires to build a phylogenetic tree, called gene tree, for the considered gene family. Reconciliation [2] with the species tree then allows to infer evolutionary events (duplications and speciations) associated with the internal nodes of the gene tree. Hence the internal nodes of a gene tree can be labeled as duplications and losses, and such a labeling induces a full orthology and paralogy set of relations between gene pairs. In order to detect orthology, tree-free methods are also available. These methods are based on gene clustering according to sequence similarity, (cf. e.g. the COG database [3], OrthoMCL [4], InParanoid [5], Proteinortho [6]), synteny [7,8] or functional annotation of genes [9]. Such methods usually are not able to detect a full set of relations, but only a partial set, i.e. some relations among genes are not inferred.
Recent papers [10,11] have investigated, from a graph theory point of view, the link between trees and orthology/paralogy relations (we just say "relations" in the following). Given a gene family Ŵ and a set C of pairwise relations, a first problem is whether we can reconstruct a labeled gene tree for Ŵ inducing C. The problem can be subdivided into two parts. First, we can consider whether C is satisfiable, i.e. whether there exists an event-labeled gene tree G in agreement with C. However satisfiability is not sufficient to ensure the possibility for the relation set to reflect a true history, as nodes of G labeled as speciations can be contradictory. This raises the second question which is the existence of an S-consistent gene tree, namely an event-labeled tree that can be obtained by reconciliation with a species tree S. A simple characterization of satisfiability is given in [10], when the set C is a full set of relations (i.e. each pair of genes of Ŵ is in C). On the other hand, checking for S-consistency can be done in polynomial-time for full sets [12,13], and also partial sets of relations [14].
In this paper we explore the link between relations and trees in the perspective of relation and tree correction. Several gene tree databases from whole genomes are available, including for instance Ensembl Compara [15], Hogenom [16], Phog [17], MetaPHOrs [18], Phy-lomeDB [19], Panther [20]. However, due to various limitations such as alignment errors, systematic artifacts of inference methods or insufficient differentiation between sequences, trees are known to contain errors and uncertainties. Consequently, a great deal of effort has been put towards tools for gene tree editing [21][22][23][24][25][26][27][28][29]. Most of them are based on selecting, in a neighborhood of an input tree, one best fitting the species tree.
Two years ago, we developed the first algorithm for gene tree correction using orthology relations [7]. Here we address, from a complexity and approximation point of view, the more general problem of correcting a gene tree according to a set of orthology and paralogy relations. We consider two objective functions: the number of unchanged relations (from orthology to paralogy or vice-versa), leading to the Maximum Homology Correction problem, and the number of unchanged clades (the Robinson-Foulds distance [30]), leading to the Maximum Clade Correction problem. We provide NPcompleteness results for these two problems.
Conversely, we also address the problem of correcting a set of relations so that it represents a valid history in terms of S-consistency. A set of relations is usually represented as a graph R, where edges represent orthologous relations and non-edges represent paralogous relations. The satisfiability problem related to S-consistency reduces to adding or removing a minimum number of edges of R in order to make it P 4 -free (that is, it contains no induced path of length three), as shown in [10]. The problem is known to be NP-Hard and fixed parameter tractable [31]. In [11], an integer linear programming formulation is used to correct relation graphs of reasonable size. A factor approximation algorithm of factor 4 , where is the degree of the graph R, is given in [32]. The S-consistency problem, however, has never been studied.
In this paper, two criteria are considered for correcting a set R of relations: minimize the number of modified relations, and maximize the number of genes inducing an S-consistent set of relations. The first problem is shown to be NP-complete, while the second problem is shown to be not approximable within factor dn 1 2 (1−ε) , for any 0 < ε < 1 and any constant d > 0.

Trees and orthology relations
All trees considered in this paper are assumed to be rooted. They are not necessarily binary, but we assume that all nodes are of degree at least three, except possibly the root that can be of degree two. Given a set X, a tree T for X is a tree whose leafset L(T ) is in bijection with X. We denote by V(T) the set of nodes and by r(T) the root of T. Given an internal node u of T, the subtree rooted at u is denoted T u and we call the leafset L(T u ) the clade of u. A node u is an ancestor of v if u is on the (inclusive) path between v and the root, and we then call v a descendant of u. If u � = v, then v is a strict descendant of u, and if u and v are connected by an edge of T, then v is a child of u. The lowest common ancestor (lca) of u and v, denoted lca T (u, v), is the ancestor common to both nodes that is the most distant from the root. We say that u and v are separated if and only if lca T (u, v) / ∈ {u, v} (i.e. none is an ancestor of the other). We define lca T (U) analogously for a set U of nodes. Let L ′ be a subset of L(T ). The restriction T | L ′ of T to L ′ is the tree with leaf set L ′ obtained from the subtree of T rooted as lca T (L ′ ) , by removing all leaves that are not in L ′ , and contracting all internal nodes of degree two, except the root. Let T ′ be a tree such that L(T ′ ) = L ′ ⊆ L(T ). We say that T displays T ′ if and only if T | L ′ is T ′ .

Evolution of a gene family
Species evolve through speciation, which is the separation of one species into distinct ones. A species tree S for a species set represents an ordered set of speciation events that have led to : an internal node is an ancestral species at the moment of a speciation event, and its children are the new descendant species. Inside the species' genomes, genes undergo speciation when the species to which they belong do, but also duplications, and losses (other events such as transfers can happen, but we ignore them here). A gene family is a set of genes Ŵ accompanied by a mapping function s : Ŵ → � mapping each gene to its corresponding species. The evolutionary history of Ŵ can be represented as a node-labeled gene tree for Ŵ, where each internal node refers to an ancestral gene at the moment of an event (either speciation or duplication), and is labeled as a speciation (Spec) or duplication (Dup) accordingly.
Formally, we call a DS-tree for Ŵ a pair (G, ev G ), where G is a tree with L(G) = Ŵ, and ev G : V (G) \ L(G) → {Dup, Spec} is a function labeling each internal node of G as a duplication or a speciation node (we drop the G subscript from ev G when it is clear from the context). Given a species tree S, the LCA-mapping function s G : V (G) → V (S) maps each gene of G, ancestral or extant, to a species as follows: if g ∈ L(G), then s G (g) = s(g); otherwise, s G (g) = lca S ({s(g ′ ) : g ′ ∈ L(G g )}). An example is given in Fig. 1, where the label of each node of G represents its LCA-mapping with respect to S.
According to the Fitch [33] terminology, we say that two genes x, y of Ŵ are orthologous in G if ev(lca G (x, y)) = Spec , and paralogous in G if ev(lca G (x, y)) = Dup. We denote by O(G), respectively P(G), the set of all gene pairs that are orthologous, respectively paralogous in G. By xy ∈ O(G) we mean {x, y} ∈ O(G) (the same applies for P(G)). In Fig. 1, a 1 c 1 ∈ O(G) while a 1 b 1 ∈ P(G). We say that a 1 c 1 (respec. a 1 b 1 ) is an orthology (respec. paralogy) relation induced by G.
While a history for Ŵ can be represented as a DS-tree, the converse is not always true, as a DS-tree G for Ŵ does not necessarily represent a valid history. For this to hold, any speciation node of G should reflect a clustering of species in agreement with S [14]. Formally G should be S-consistent, as defined below. Definition 1 Let S be a species tree and G be a DS-tree.
Let v be an internal node of G such that ev(v) = Spec. Then the speciation node v is S-consistent if and only if for any two distinct children v 1 , v 2 of v, s G (v 1 ) and s G (v 2 ) are separated in S.
We say that G is S-consistent if and only if every speciation node of G is S-consistent.
Notice that G and S are not required to be binary. In particular, the definition of S-consistency for a speciation node v of G does not require v to be binary, even if S is binary. The reason is that in such a case, one can "refine" v into a set of binary S-consistent speciation nodes based on the topology of S. This operation does not affect the orthology and paralogy relations of the genes of G (see Fig. 1). Duplication nodes can be refined as well. Lemma 1 formalizes this intuition. This will serve to show that our results hold for both non-binary and binary gene trees.

Lemma 1
Let G be an S-consistent DS-tree for some binary species tree S. Then there is a binary DS-tree G ′ such that G ′ is S-consistent, and such that O(G) = O(G ′ ) and P(G) = P(G ′ ).
Proof Let v be a highest non-binary node (i.e. v has no non-binary ancestors) of G with children v 1 , . . . , v k . We show that v can be made to be binary while preserving O(G) and P(G), which suffices to prove the Lemma since we can repeat this operation successively on every nonbinary node.
If ev G (v) = Dup, obtain a DS-tree G * by removing v 2 , . . . , v k from the children of v, adding a child v ′ to v and adding v 2 , . . . , v k as children of v ′ , setting ev G * (v ′ ) = Dup. Notice that s G (w) = s G * (w) for every w ∈ V (G) ∩ V (G * ) = V (G * ) \ {v ′ }, implying that all speciations remain S-consistent. It is readily seen that O(G) = O(G * ) and P(G) = P(G * ).
If instead ev G (v) = Spec, let s 1 , s 2 ∈ V (S) be the two children of s G (v). Let V j = {v i : s j is an ancestor of s G (v i ) , 1 ≤ i ≤ k} for j ∈ {1, 2}. Notice that for any child v i of v, contradicting the definition of s G . This strict descendant condition implies that {V 1 , V 2 } partitions {v 1 , . . . , v k }. Also observe that V 1 and V_2 cannot be empty, for otherwise s G (v) would be equal to either s 1 or s 2 . Obtain G * by removing the children of v, adding two children w 1 and w 2 to v, then adding V 1 as children of w 1 and V 2 as children of w 2 . Set ev G * (v) = ev G * (w 1 ) = ev G * (w 2 ) = Spec. Note that the children of w 1 and w 2 are still from separated species, and so both are S-consistent. As for v, by the definition of V 1 and V 2 , s G * (w 1 ) is a descendant of s 1 and s G * (w 2 ) is a descendant of s 2 (not necessarily a strict descendant). Therefore, both are separated and so v is S-consistent. The species for every other node remaining unchanged, we conclude that G * preserves S-consistency and does not modify O(G) nor P(G). Fig. 1 A species tree S, a binary DS-tree G and a non-binary DS-tree G ′ . In DS-trees, Dup nodes are indicated by squares. All other nodes are speciations nodes. Each leaf α i denotes a gene belonging to the genome α. G is a refinement of G ′ such that O(G) = O(G ′ ) and P(G) = P(G ′ ). Notice that, although in this example the gene trees contain exactly one gene copy from each genome, this is not a requirement. Another example with multiple gene copies in genome a is given in Fig. 2 Page 4 of 13 Lafond et al. Algorithms Mol Biol (2016) 11:4 We can verify that both DS-trees in Fig. 1 are S-consistent. For example, the speciation node z in G ′ has children from species v, c, d and w, which are pairwise separated in S. Notice that, from Definition 1, if G is a DS-tree, then the lca of two leaves of G belonging to the same species must be a duplication node. The converse is not true. For example, in the S-consistent gene tree G of Fig. 1, the parental node of e 1 and f 1 is a duplication node even though e 1 and f 1 belong to two different species.

Relation graph
A set of orthology/paralogy relations on Ŵ (or simply a relation set) is a pair We adopt the graph representation considered in [14] for full relation sets. A relation graph R on a gene family Ŵ is a graph with vertex set V (R) = Ŵ, in which we interpret each edge uv of the edge set E(R) of R as an orthology relation between u and v, and each missing edge (non-edge) uv / ∈ E(R) as a paralogy relation. 1 Notice that if s(u) = s(v), then uv / ∈ E(R). The relation graph of a DStree G, denoted by R(G), is the graph with vertex set L(G) and edge set O(G) (for example, see the relation graph R in Fig. 2).
A DS-tree for a gene family Ŵ leads to a relation graph, but the converse is not always true. A relation graph R is satisfiable if there exists a DS-tree G such that R(G) = R . The problem of relation graph satisfiability has been addressed in [10]. The following theorem is a reformulation of one of the main results of this paper.

Theorem 1 ([10]) A relation graph R is satisfiable if and only if R is P 4 -free, meaning that no four vertices of R induce a path of length three.
For example, in Fig. 2, the relation graphs R and R ′′ are satisfiable, while the graph R ′ is not. As a DS-tree does not necessarily represent a true history for Ŵ (see previous section and Definition 1), satisfiability of a relation graph does not ensure a possible translation in terms of a history for Ŵ. For this to hold, R should be consistent with the species tree, according to the following definition.

Definition 2 Given a species tree S, a relation graph
For example the graph R in Fig. 2 is S-consistent. Notice that S-consistency implies satisfiability. Results from [14] complete the characterization of S-consistent graphs through Theorem 2. A triplet is a binary tree with leafset L of size three. For L = {x, y, z}, we denote by xy|z the unique triplet T on L for which lca T (x, y) � = r(T ) holds. Now P 3 (R) is the subset of triplets of species induced by paths having exactly three vertices in R = (V , E): We present in Theorem 2 a necessary and sufficient condition for S-consistency of a relation graph in terms of P 3 (R). First, we introduce in Lemma 2 an intermediate property, that is useful for proving Theorem 2.

Lemma 2 Let G be a DS-tree and S be a species tree. Then for any internal node v of G, there exist leaves x, y of G v such that both the following hold: (1)
Proof We first show that (1) must hold for some x, y ∈ L(G v ). If s G (v) has two children s 1 and s 2 for which there exist leaves x and y of G v such that s 1 is an ancestor of s(x) and s 2 an ancestor of s(y), then (1) holds. Thus if we suppose that (1) does not hold, then s G (v) has a child s ′ such that all leaves of G v belong to a species that has s ′ as an ancestor. This implies that s ′ is a lower common ancestor than s G (v) for the species present in G v , contradicting the definition of s G . Now, take x and y satisfying (1). Suppose that (2) does not hold for x and y, i.e. lca S (s(x), s(y)) = s G (v), but that lca G (x, y) � = v. Take z ∈ L(G v ) such that z is separated then we are done as x and z satisfy both (1) and (2). Oth- . In this last case, y and z are the leaves of interest, ending the proof.

satisfiable relation graph and let S be a species tree. Then R is S-consistent if and only if S displays all the triplets of P 3 (R).
Proof ⇒ : let G be an S-consistent gene tree satisfying R, and let x, y, z ∈ V (R) such that zx, zy ∈ E(R) but xy / ∈ E(R) and s(x) � = s(y). Then we must have zx, zy ∈ O(G) and xy ∈ P(G). We claim that S must display the We first obtain from G ′ a least-resolved DS-tree G satisfying R, in terms of speciation. That is, if G ′ has any speciation node v that has a speciation child w, we obtain G ′′ by contracting v and w (delete w and give its children to v). Note that we have O(G ′ ) = O(G ′′ ) and so G ′′ still satisfies R. We obtain the DS-tree G by repeating this operation until we cannot find such a v and w. We claim that if S displays the triplets of P 3 (R), then G is S-consistent.
Let v be a speciation node of G, and let v 1 , v 2 be any two distinct children of v. By the construction of G, Similarly, G v 2 has two leaves y 1 , y 2 with lca S (s(y 1 ), s(y 2 )) = s G (v 2 ) and lca G (y 1 , Thus, x 1 y 1 x 2 and x 1 y 2 x 2 are induced paths of length two in R, which implies that S displays the s(x 1 )s(x 2 )|s(y 1 ) and s(x 1 )s(x 2 )|s(y 2 ) triplets. Analogously, S displays the s(y 1 )s(y 2 )|s(x 1 ) and s(y 1 )s(y 2 )|s(x 2 ) triplets. This is only possible if lca S (s(x 1 ), s(x 2 )) = s G (v 1 ) and lca S (s(y 1 ), s(y 2 )) = s G (v 2 ) are separated in S. We deduce that all child pairs of v are from separated species, and hence that G is S-consistent.
As an example, the graph R ′′ in Fig. 2 is satisfiable but not S-consistent as the path of length 2 containing {a 1 , b 1 , c 1 } induces the triplet ac|b, while the triplet displayed by S is ab|c.
We end this section with additional notations that will be of use later

Relation correction problems
We raise the issue of leaving out a minimum of information from a relation graph R in order to reach satisfiability and S-consistency. Two optimality criteria are considered: (1) the minimum number of edges that need to be removed; (2) the maximum number of genes that can be kept.

The minimum edge-removal consistency problem
Based on the same construction used in paper [34], we show that adding the information on the species tree S does not make the problem of removing the minimum number of edges leading to a P 4 -free graph simpler. Although a similar reduction is likely to hold in the general case of edge-modification (removal or insertion) [31], here we focus on edge removal, as this formulation is needed in subsequent developments. We show the NP-Completeness of this problem, even when every gene from the family Ŵ comes from a distinct species.
Minimum edge-removal consistency problem: Input: A relation graph R for a gene family Ŵ, a species tree S and an integer k; Output: "Yes" if and only if there exists an S-con-
Proof Given R ′ as a certificate, Theorem 2 easily translates into a polynomial-time algorithm to verify that R ′ is S-consistent. It is also clear that verifying if |E(R) \ E(R ′ )| ≤ k can be done quickly. The problem is therefore in NP. As for NP-Hardness, the reduction is from the exact 3-cover problem, a classic NP-Hard problem [35]: given a set W = {w 1 , . . . , w 3t } and a collection Z = {Z 1 , . . . , Z r } of 3-elements of W, does there exists Z ′ ⊆ Z such that |Z ′ | = t and Z ′ is a partition of W ? We assume that r ≥ t.
Given arbitrary W and Z, we construct R and S by first defining the species set . Let α = 3t 2 and let X = {X 1 , . . . , X r } and Y = {Y 1 , . . . , Y r } be two collections of all disjoint sets of species (i.e. for any distinct set A, B ∈ X ∪ Y , A ∩ B = ∅), with |X i | = α and |Y i | = r 2 α, for all 1 ≤ i ≤ r. Let X = 1≤i≤r X i and Y = 1≤i≤r Y i be the species in X and Y. Then the species set is = W ∪ X ∪ Y . Let S W , S X and S Y be three trees such that L(S W ) = W , L(S X ) = X � and L(S Y ) = Y � . Then S is obtained by first connecting r(S Y ) with r(S W ) to obtain a new tree S WY , then connecting r(S WY ) with r(S X ) (see Fig. 3). Therefore S has exactly |�| = 3t + r(α + r 2 α) leaves. The gene family Ŵ is then constructed so that it contains exactly one gene per species, as mentioned in the Theorem statement. In other words the mapping s : Ŵ → � is a bijection. Thus for simplicity, we make no distinction between a gene g and its species s(g). We then define R with V (R) = � such that each of the sets W , X 1 , . . . , X r , Y 1 , . . . , Y r forms an individual clique. Finally we add two edge-sets E 1 and E 2 to R, where E 1 = {g 1 g 2 : g 1 ∈ X i , g 2 ∈ Z i , for a given 1 ≤ i ≤ r} and E 2 = {g 1 g 2 : g 1 ∈ X i , g 2 ∈ Y i , for a given 1 ≤ i ≤ r} . Then R has 2r + 1 cliques, namely W , X 1 , . . . , X r , Y 1 , . . . , Y r . Also, for 1 ≤ i ≤ r, all edges between X i and Y i are present, as well as all edges between X i and Z i . Figure 3 gives an example with t = 2 and W = {1, 2, 3, 4, 5, 6}.
Notice that the construction of R described above can clearly be done in polynomial time. We now show that W and Z admit an exact 3-cover if and only if R admits an S-consistent DS-tree after the deletion of at most 3α(r − t) + (α − 3t) edges.
(⇒) : let Z ′ ⊆ Z be a partition of W, |Z ′ | = t. Let R ′ be the subgraph of R in which all edges between Z i and X i are removed if and only if Z i / ∈ Z ′ (which removes 3α(r − t) edges), and the only edges not removed from the W-clique are those belonging to a Z i triangle with Z i ∈ Z ′ (which removes α − 3t edges). An example of R ′ is given in Fig. 3. Thus there are exactly 3α(r − t) + (α − 3t) edges of R missing from R ′ , as desired. Clearly, R ′ is P 4 -free and thus satisfiable. To see that R ′ is S-consistent, we use Theorem 2. Notice that any path of length 3 in R ′ has the form wx i y i with w ∈ W , x i ∈ X i and y i ∈ Y i for some i, inducing the wy i |x i speciation triplet, which is in agreement with S. Therefore there exists an S-consistent gene tree G ′ satisfying R ′ .
(⇐) : let R ′ be an S-consistent relation graph obtained by deleting at most 3α(r − t) + (α − 3t) edges from R. Then, R ′ must be P 4 -free. We show that R ′ [W ] is partitioned into triangles which form a solution to the 3-cover instance. Let w ∈ W . We claim that in R ′ , there is exactly one X i ∈ X such that w has neighbors in X i . Suppose first there are x 1 ∈ X i and x 2 ∈ X j , i � = j, such that both x 1 and x 2 are neighbors of w in R ′ . Then there is some y ∈ Y i such that yx 1 wx 2 induce a P 4 , unless all edges between x 1 and Y i were deleted. But we reach a contradiction since there are r 2 α > 3α(r − t) + (α − 3t) such edges. Therefore w has neighbors in at most one X i ∈ X . Using that fact, we can see that w must have at least one neighbor in X, since otherwise at most (3t − 1)α edges between X and W would remain, implying the deletion of 3αr − (3t − 1)α = 3α(r − t) + α edges, more than permitted. This proves our claim.
Thus at best, each w ∈ W has α neighbors in X, implying that at least 3αr − 3tα = 3α(r − t) deleted edges are between X and W. This leaves a maximum of α − 3t other edges that can be deleted. Now, let C be a connected component of R ′ [W ]. We claim that all vertices of C must have their X neighbors in the same X i ∈ X. For suppose otherwise that there are two vertices c 1 , c 2 of C such that c 1 has a neighbor x 1 ∈ X i and c 2 a neighbor x 2 ∈ X j with i � = j. It is easy to see that such c 1 and c 2 can be chosen to be neighbors. Then x 1 , c 1 , c 2 , x 2 induce a P 4 , a contradiction. Thus all vertices of C have their X neighbors in a common X i ∈ X . Since each vertex of X i has three neighbors in W, this implies that C has at most three vertices. Suppose that C that has two vertices or less. Then since all vertices of R ′ [W ] have at most two neighbors, it can have at most 1 2 (2(3t − 2) + 2) = 3t − 1 edges (obtained by counting is a subset of Z which is a partition of W. R ′ is the "corrected" relation graph corresponding to Z ′ the sum of degrees). This, however, implies that at least α − (3t − 1) additional edges were deleted, more than the α − 3t available. We conclude that R ′ [W ] is partitioned into t connected components, each having three vertices. Moreover, each vertex in a given component C has neighbors in the same X i ∈ X, implying that Z i contains the members of C. Finally since the components are all associated with a disctinct Z i , R ′ [W ] effectively defines a solution to the exact cover instance.

The Maximum Node Consistency problem
We introduce the Maximum Node Consistency Problem (in its decision version) and we consider the approximation complexity of the corresponding optimization version. Maximum Node Consistency problem: Input: A relation graph R for a gene family Ŵ, a species tree S and an integer k; Output: "Yes" if and only if there exists an S-consistent We show that Maximum Node Consistency is hard to approximate within a factor dn 1 2 (1−ε) for any 0 < ǫ < 1 and any constant d > 0, by giving a gappreserving reduction from Maximum Independet Set (n is the number of nodes of R). We refer the reader to [36] for a definition of gap-preserving reduction. Consider an instance H = (V H , E H ) of Maximum Independet Set, with |V H | = m. We construct an instance of Maximum Node Consistency as follows.
First, we define the set of genes Ŵ, i.e. the nodes of the relation graph R. Denote V H = {v 1 , . . . , v m } and for each v i ∈ V H , we define a set I(v i ) of m genes: Now, we define the species tree S. First consider S as any binary tree over m leaves ℓ 1 , . . . , ℓ m , and replace each leaf ℓ i by any binary subtree T i having m leaves (thus S has m 2 leaves). Each gene in I(v i ) is mapped to a leaf of T i in a bijective manner, and so each species has exactly one gene in R. We make no distinction between g ∈ Ŵ and s(g). Now, define the relation graph R = (V R , E R ). Set V R = Ŵ, and we get that n = |V R | = m 2 . For each v i ∈ V , I(v i ) forms a clique in R. Moreover, for each {v i , v j } ∈ E H , define an edge {r i,t , r j,t } ∈ E R , for each t with 1 ≤ t ≤ m.
Let R ′ be a solution of Maximum Node Consistency over instance (R, S). Denote by R ′ (v i ) the subset of nodes V (R ′ ) ∩ I(v i ), that is those nodes of I(v i ) that have not been removed. We pay a particular attention to those R ′ (v i ) that contain more than one node. Proof Assume on the contrary that there is some q such that r i,q ∈ R ′ (v i ) and r j,q ∈ R ′ (v j ) share an edge. Consider a node r i,z of R ′ (v i ) \ {r i,q }, which must exist since |R ′ (v i )| ≥ 2. The P 3 induced by r i,z , r i,q and r j,q implies the triplet (r i,z , r j,q |r i,q ), while S contains the triplet (r i,z , r i,q |r j,q ). Now, we are ready to prove the main result of this section. Consistency. Since V ′ is an independent set, it follows that R ′ consists only of cliques R ′ (v i ), disconnected one from the other. It has |V ′ |m nodes and as R ′ is P 3 -free, it is S-consistent. 2. The case k = 1 is trivial so we assume k > 1. Consider a solution R ′ of Maximum Node Consistency on instance (R, S) of size at least k m, and consider the subsets

Lemma 4 Let a graph H be an instance Maximum Independet Set with m nodes, and let (R, S) be the corresponding instance of
Notice that we can assume that there exist at least k such sets, otherwise R ′ would contain at most i.e. the nodes with index j that belong to some subset R ′ (v i ) larger than one. By Lemma 3 each set R j is an independent set. Now, pick the set R j having maximum cardinality. It follows that R j contains at least k nodes, since otherwise R ′ would have at most m(k − 1) + m − k < mk nodes. Hence, V ′ = {v i : r i,j ∈ R j } is an independent set of size at least k, thus concluding the proof. We say a maximization problem cannot be approximated within a factor α if, unless P = NP, for any approximation algorithm A there are infinitely many instances for which A outputs a solution with value AP such that AP < 1 α OPT, where OPT is the optimal value of a solution to the problem (note that equivalently, OPT AP > α). It is well-known that Maximum Independet Set cannot be approximated within a factor cm 1−ε for any 0 < ε < 1 and for any constant c > 0 [37].

Theorem 4 The optimization version of Maximum
Node Consistency cannot be approximated within a factor dn 1 2 (1−ε) for any 0 < ε < 1 and for any constant d > 0, where n is the number of nodes of the given relation graph. Moreover, this result holds even on instances in which for any distinct g 1 , g 2 ∈ Ŵ, s(g 1 ) � = s(g 2 ).
Proof Let H be a graph with m nodes and let (R, S) be the corresponding instance of Maximum Node Consistency with n = m 2 nodes. Denote by OPT I and OPT N , respectively, the values of an optimal solution for Maximum Independet Set and Maximum Node Consistency. Let A N be any approximation algorithm for Maximum Node Consistency, and let A I be the approximation algorithm for Maximum Independet Set that on input H, runs A N on the corresponding instance (R, S) and returns the independent set resulting from Lemma 4. Let AP I (H) and AP N (R, S) denote, respectively, the sizes of the solutions found by A I (H) and A N (R, S). By Lemma 4 we get that AP I (H) ≥ ⌊AP N (R, S)/m⌋ ≥ AP N (R, S)/m − 1 and OPT N (R, S) ≥ OPT I (H)m. Now, as we may assume that AP I (H) ≥ 1. Since Maximum Independet Set cannot be approximated within a factor cm 1−ε , for any 0 < ε < 1 and any constant c > 0, then for any 0 < ε < 1 and any constant c > 0 there exist infinitely many instances H on which OPT I (H ) 2AP I (H ) > cm 1−ε . Thus, it follows that on infinitely many instances. Finally the fact that the result holds even on instances in which for any distinct g 1 , g 2 ∈ Ŵ, s(g 1 ) � = s(g 2 ) follows from the construction of R.
We get the following as an immediate corollary, which will be of use later:

Corollary 1 The decision version of Maximum Node
Consistency is NP-Hard, even on instances in which for any distinct g 1 , g 2 ∈ Ŵ, s(g 1 ) � = s(g 2 ).

Gene tree correction problems
In this section, we are given a gene family Ŵ, a species tree S, an S-consistent DS-tree G for Ŵ, and a set C = (O, P) of orthology/paralogy constraints (not necessarily full). We focus on the problem of correcting G according to C in a minimal way. The goal is thus to find a DS-tree G ′ inducing C such that the difference between G and G ′ is minimum. We consider two ways of measuring the difference (or symetrically the similarity) between gene trees, one based on conserved orthology/paralogy relations induced by the two trees, and one based on the number of conserved clades between the two trees, which is the Robinson-Foulds in the case that G, G ′ and S are all binary trees.

The Maximum Homology Correction problem
Maximum Homology Correction problem: Input: A species tree S, an S-consistent DS-tree G for a gene family Ŵ, an integer k, a set O of orthology and a set P of paralogy relations; Output: "Yes" if there exists an S-consistent DS-

Theorem 5
The Maximum Homology Correctionproblem is NP-Complete, even if S, G and G ′ are required to be binary.
Proof The problem is clearly in NP, as verifying S-consistency can be done in polynomial time, as well as counting the common orthologs/paralogs relations (the set of relations is quadratic in size). For our reduction, we use the Minimum Edge-Removal Consistency problem for the case of a gene family with at most one gene per genome, which is NP-Hard by Theorem 3. Given a species tree S, a relation graph R with V(R) in bijection with L(S) and an integer k, we construct an instance of the Maximum Homology Correction Problem, i.e. a species tree S ′ , a DS-tree G, an orthologous set O and paralogous set P.
Let S ′ = S and construct G by mimicking S -that is by first copying S and its leaf labels, then replacing each leaf ℓ of G by the gene s −1 (ℓ). Note that if S is binary, then so is G. All internal nodes of G are labeled as speciations, so all genes of Ŵ are pairwise orthologous. Thus R(G) is a clique. Finally, let O = ∅ and P = {g 1 g 2 : g 1 g 2 / ∈ E(R)}. Therefore the objective is to break a minimum of orthologies of G in order to satisfy P.
We show that that there is an S-consistent subgraph R ′ of R obtained by removing at most k edges if and only if there is an S ′ -consistent DS-tree G ′ satisfying O and P with at most |P| + k relations that are not induced by G. ⇒ : Let R ′ be a solution to the Minimum Edge-Removal Consistency Problem for R and S. Then there exists a S-consistent DS-tree G ′ satisfying R ′ , which is obtained by deleting at most k edges from R. By Lemma 1, we may assume that if S is binary, then so is G ′ . Now, since R ′ has at most |P| + k non-edges, G ′ has at most k + |P| paralogs and is therefore a solution to the constructed instance of the Maximum Homology Correction Problem that breaks at most k + |P| orthologies of R(G). ⇐ : Let G ′ be a solution, binary or not, to the constructed Maximum Homology Correction Problem instance and let R ′ = R(G ′ ). Since G ′ satisfies P and breaks at most |P| + k orthologies, R ′ must have P as non-edges, plus at most k other non-edges. Thus R ′ can be obtained by removing at most k edges from R(G) − P = R, as desired.

The maximum clade correction problem
Maximum clade correction problem: Input: A gene tree G, a species tree S, a set O of orthology and a set P of paralogy relations and an integer k; Output: "Yes" if there exists an S-consistent DS-tree G ′ satisfying O and P such that G and G ′ have at least k clades in common.
Notice that if S, G and G ′ are required to be binary, the effective measure between G and G ′ is the Robinson-Foulds distance. This special case is handled as part of the general proof. But before we need the following lemma, which uses grafting operations to add leaves to G and satisfy a prescribed relation without breaking other relations.
Given two trees T 1 and T 2 , connecting T 1 with T 2 corresponds to creating a new node x and giving it r(T 1 ) and r(T 2 ) as its two children. Grafting a new leaf x to a tree T corresponds to adding x to L(T ) by either: (1) adding x as a new child of some node u of T; (2) connecting T with x; (3) subdividing an edge uv and adding x as a child of the newly created vertex.

Lemma 5
Let G be an S-consistent gene tree, for some species tree S. Let x be a gene not in G and y be some gene in G with s(x) � = s(y). Then there exists a gene tree G ′ obtained by grafting x to G such that the following conditions are satisfied: Let uv be an edge of G, and suppose that we graft x on uv to obtain G ′ . Call p the parent of x on G ′ , and say that u is the parent of p (i. e. p has children x and v).
We will find such a uv that guarantees this s G (u) = s G ′ (u) property, while ensuring that lca G ′ (x, y) can be a speciation (i.e. ev G ′ (lca G ′ (x, y)) = Spec is S-consistent), and that ev G (u) = ev G ′ (u) is S-consistent. This will prove the Lemma.
Let s xy = lca S (s(x), s(y)), and let g be the lowest ancestor of y in G such that s G (g) is s xy or an ancestor of s xy . Note that the case in which g does not exist was handled in the beginning of the proof. Now suppose that ev G (g) = Dup . Denote by g ′ the child of g that is also an ancestor of y. Note that s G (g) is an ancestor of s xy and s G (g ′ ) is a strict descendant of s xy . We claim that uv = gg ′ . To see this, obtain G ′ by grafting x to gg ′ , p being the parent of x and g the parent of p. Then, s G ′ (p) = s xy , and its children species s G ′ (g ′ ) and s G ′ (x) are separated in S by our choice of g ′ . Thus setting ev G ′ (p) = ev G ′ (lca G ′ (x, y)) = Spec preserves S-consistency. Also, s G (g) = s G ′ (g) since s G (g) is already an ancestor of s xy = s G ′ (p). Finally, we are free to set ev G ′ (g) = Dup, satisfying all the required conditions. So instead suppose that ev G (g) = Spec. Recall that adding x as a child of g to obtain a new tree G ′′ is not a solution. Since in G ′′ , lca G ′′ (x, y) = g, we must either have s G (g) � = s G ′′ (g), or ev G ′′ (g) = Spec is not S-consistent. By the choice of g, only the latter is possible, implying that all children of g are from separated species in G, but not in G ′′ . Therefore, there must be a child g ′ of g such that s G (g ′ ) is an ancestor of s(x). Note that g ′ must be unique since otherwise, ev G (g) = Spec would not be possible. We then claim that uv = gg ′ . Indeed, obtain G ′ by grafting x to gg ′ , p being the parent of x. We have s G (p) = s G ′ (g ′ ) , and we set ev G ′ (p) = Dup. The species of the children of g remain unchanged in G ′ , and so s G (g) = s G ′ (g) and ev G ′ (g) = ev G ′ (lca G ′ (x, y)) = Spec is S-consistent, again satisfying all required conditions. Page 10 of 13 Lafond et al. Algorithms Mol Biol (2016) 11:4 Theorem 6 The Maximum Clade Correction Problem is NP-Complete, even if S, G and G ′ are required to be binary.
Proof Verifying S-consistency and comparing the set of clades from G and G ′ can clearly be done in polynomial time, thus the problem is in NP. We use the Maximum Node Consistency problem for our reduction, which is NP-Hard by Corollary 1. Let R, S and k be the Maximum Node Consistency instance, letting R be the relation graph with V (R) = {v 1 , . . . , v n }, S the species tree and k an integer. Let α = n(n − 1 − k) + 2k (noting that α > 0 when k ≤ n). The constructed instance of the Maximum Clade Correction Problem uses the same species tree S. Construct G as follows: first consider G as any binary tree with n leaves l 1 , . . . , l n , where each leaf l i is mapped to vertex v i of R. Then, replace each leaf l i by a subtree T i constructed as follows: T i is a caterpillar tree with n − 1 + α leaves, and each leaf ℓ of T i is such that s(ℓ) = s(v i ) (recall that a caterpillar tree is a path to which we add a leaf child to each internal node). Let L i denote the set of the n − 1 deepest leaves of T i (the depth of a leaf ℓ being the number of nodes on the path between ℓ and the root). Each leaf of L i is mapped to a distinct node of V (R) \ {v i }. Denote by ℓ i,j the leaf of T i mapped to v j , and by N i the subtree of T i rooted at lca(L i ). Then G has exactly n(n − 1 + α) leaves and n(n − 1 + α) − 1 clades (since it is binary). An example is given in Fig. 4. Finally define O = {{ℓ i,j , ℓ j,i } : v i v j ∈ E(R)} the set of orthology relations to satisfy and P = {{ℓ i,j , ℓ j,i } : v i v j / ∈ E(R)} the set of paralogy relations to satisfy. Note that each ℓ i,j is present in exactly one relation.
We show that R admits an S-consistent induced subgraph with at least k nodes if and only if G, O and P admit an S-consistent DS-tree G ′ satisfying O and P such that G and G ′ share at least k(α + n − 2) clades.
(⇒) Let R ′ be a solution to the Maximum Node Consistencyinstance, |V (R ′ )| ≥ k, and let H be a DStree satisfying R ′ that is S-consistent. By Lemma 1, we may assume that if S is binary, then so is H. Now, since L(H) ⊆ V (R), to each leaf v i of H corresponds a subtree T i in G. Then build a DS-tree G * from H by replacing each leaf v i of H by T i , and labeling all internal nodes of inserted trees as Dup (in Fig. 4, G * is the subtree of G ′ rooted at the common ancestor of T a , T b and T c ). We first argue that G * is S-consistent and satisfies the subsets of O and P restricted to L(G * ). In a subsequent step, we will graft the genes missing from G * using Lemma 5. Notice that s H (v i ) = s G * (r(T i )) and that all nodes of H that are in G * have the same LCA-mapping in both trees. It follows that G * is also S-consistent. Also, for all v i , v j ∈ L(H) , ev H (lca H (v i , v j )) = ev G * (lca G * (r(T i ), r(T j )). Thus for any pair of leaves ℓ i,j , ℓ j,i in L(G * ) such that {ℓ i,j , ℓ j,i } ∈ O, lca G * (ℓ i,j , ℓ j,i ) is a speciation (by the construction of O from R and the fact that H satisfies R ′ ). By the same reasoning, if {ℓ i,j , ℓ j,i } ∈ P then ℓ i,j and ℓ j,i are paralogous in G * .
The solution G ′ is obtained by grafting to G * every leaf of G missing from G * whilst preserving the T i clades, maintaining satisfiability of O and P and S-consistency. If such a G ′ exists, then G and G ′ share at least k identical subtrees from {T 1 , . . . , T n }, and since each T i contains α + n − 2 clades, it follows that G and G ′ share at least k(α + n − 2) clades as required. Let L = L(G) \ L(G * ) be the leaves yet missing from G * . Let L O = {ℓ ∈ L : ∃ℓ ′ ∈ L(G * ), ℓℓ ′ ∈ O} (i.e. the leaves of L subject to an orthology constraint with some leaf already in G * ). The complement L O is the set of leaves of L that are either subject to a paralogy constraint with some leaf of G * , or not subject to any constraint with any leaf of G * . Let R(L O ) be the relation graph with vertex set L O and edge set {ℓ 1 ℓ 2 : ℓ 1 , ℓ 2 ∈ L O , ℓ 1 ℓ 2 ∈ O}, depicting the required orthologies within L O . Recall that each leaf of L is contained in at most one relation, implying that each node of R(L O ) has maximum degree 1. Thus R(L O ) is P 3 -free and therefore is S-consistent. Let G L O be a DS-tree satisfying R(L O ) that is S-consistent, assuming that G L O is binary if S is. We update G * by joining r(G * ) and r(G L O ) under a common parent x, and labeling x as Dup. Notice that this does not modify any orthology or paralogy relation previously in G * or in G L O , nor does it break S-consistency. This also ensures that paralogies ℓ 1 ℓ 2 ∈ P with ℓ 1 ∈ L O and ℓ 2 ∈ L(G * ) are satisfied. The final step is to graft the leaves of L O to G * in a way satisfying orthology requirements. This is done by successively applying Lemma 5 to each ℓ ∈ L O . As shown, each such ℓ can be grafted into G * without modifying any orthology or paralogy relation already in G * whilst satisfying the orthology requirement that ℓ is subject to. It is straightforward to see that in addition, ℓ can be grafted without breaking any T i clade present in G * , since every vertex in T i is mapped to the same species. The tree G ′ obtained after all these grafting operations, satisfies every O and P and has the required common clades with G.
(⇐) Let G ′ be a solution, binary or not, to the Maximum Clade Correction Problem instance. Denote by C the number of clades shared by both G and G ′ , with C ≥ k(α + n − 2). Recall that L i is the set of the n − 1 deepest leaves of T i in G, with N i being the subtree rooted at lca G (L i ). Denote by G ′ L i the subtree of G ′ rooted at lca G ′ (L i ). We say that N i was preserved if every leaf of G ′ L i belongs to L(T i ) (in other words, the N i clade might have been extended, but only to include other leaves from T i ). We claim that at least k of the N = {N 1 , . . . , N n } subtrees are preserved in G ′ . Assume, on the contrary, that at least n − k + 1 subtrees from N are not preserved. Take a nonpreserved subtree N i . Then some leaf ℓ / ∈ L(T i ) belongs to the lca G ′ (L i ) clade. This implies that for any ancestor x of r(N i ) in G, G ′ cannot contain the x clade. By construction of T i , r(N i ) has at least α ancestors in G. Therefore, C ≤ n(n − 1 + α) − 1 − α(n − k + 1). This leads to k(α + n − 2) ≤ C ≤ n(n − 1 + α) − 1 − α(n − k + 1) , and then to α ≤ n(n − 1 − k) + 2k − 1, contradicting our choice of α. Now, let N p = {N i ∈ N : N i is preserved in G ′ }. We have |N p | ≥ k. Let L = {i:N i ∈N p } L i and H = G ′ | L . Notice that to each N i ∈ N p corresponds exactly one subtree N ′ i in H such that L(N i ) = L(N ′ i ) (and all such N ′ i subtrees are disjoint). Let H * be the tree obtained by replacing every subtree N ′ i in H by v i . Replacing N ′ i by v i changes no LCA-mapping value since all vertices of N ′ i map to s(v i ). Thus as G ′ is S-consistent, then so are H and H * . Now, we claim that H * induces the set of relations represented by R ′ = R[L(H * )], which proves the theorem since |L(H * )| = |N p | ≥ k. By contradiction, suppose that v i v j ∈ E(R ′ ) but lca H * (v i , v j ) is labeled Dup. Then lca H (ℓ i,j , ℓ j,i ) is also labeled Dup, and so is lca G ′ (ℓ i,j , ℓ j,i ) .
But ℓ i,j ℓ j,i ∈ O, contradicting our assumption that G ′ is a solution. The same reasoning applies when v i v j / ∈ E(R ′ ), ending the proof

Algorithmic avenues
As the problems considered in this paper are all computationally hard, only non-polynomial exact algorithms or approximation algorithms avenues can realistically be explored. Let us generalize the Minimum Edge-Removal Consistency problem to the minimum editing problem (i.e. minimzing edge removals and insertions). It is not hard to imagine a branch-and-bound algorithm that solves the problem. Call an induced subgraph H of a relation graph R bad if it is a P 4 , or there is triplet of P 3 (H) not displayed by S. Each P 4 can be solved by six possible edge editings, and each contradictory triplet of P 3 (H) can be solved by three possible editings. Therefore, in a branch-and-bound process, one would verify if a given graph R ′ contains a bad subgraph and if so, proceed recursively on each graph obtained by an editing that removes it. If no bad subgraph exists, then R ′ is a possible solution and its number of editings is retained. If, at any point, R ′ has had more editings than the best solution encountered so far, the algorithm can stop the recursion. Notice however that an edge should not be edited more than once in order to avoid infinite loops. The idea of this branch-and-bound algorithm can also be applied to the Maximum Node Consistency problem. It is known that a P 4 , if one exists, can be found in linear time [38], and clearly a contradictory triplet, if any, can be found in time O(n 3 ) (though a more efficient algorithm may exist). A similar approach has been applied in [31] to design an FPT algorithm for the satisfiability problem.
As for approximations, an algorithm proposed in [32] can be directly applied to the Minimum Edge-Removal Consistency problem and guarantees that we do not remove more than 4�(R) times more edges than the optimal solution, where �(R) is the maximum degree of R. The idea is simple: as long as R has a bad subgraph H, remove every edge incident to a vertex of H and continue. Even though this is the best known approximation algorithm so far, it has the undesirable effect of isolating many vertices, motivating the exploration of alternative algorithms. One direction would be to consider existing ideas on the problem of satisfiability, i.e. finding the minimum number of editings required to make a graph P 4 -free, and adapt them to the consistency problem -for instance the Min-Cut algorithm proposed in [39].
As for gene tree correction, we have developed in [14] a polynomial-time algorithm which, given a species tree S and partial sets of relations O and P, verifies if there exists