The gene family-free median of three

Background The gene family-free framework for comparative genomics aims at providing methods for gene order analysis that do not require prior gene family assignment, but work directly on a sequence similarity graph. We study two problems related to the breakpoint median of three genomes, which asks for the construction of a fourth genome that minimizes the sum of breakpoint distances to the input genomes. Methods We present a model for constructing a median of three genomes in this family-free setting, based on maximizing an objective function that generalizes the classical breakpoint distance by integrating sequence similarity in the score of a gene adjacency. We study its computational complexity and we describe an integer linear program (ILP) for its exact solution. We further discuss a related problem called family-free adjacencies for k genomes for the special case of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k \le 3$$\end{document}k≤3 and present an ILP for its solution. However, for this problem, the computation of exact solutions remains intractable for sufficiently large instances. We then proceed to describe a heuristic method, FFAdj-AM, which performs well in practice. Results The developed methods compute accurate positional orthologs for genomes comparable in size of bacterial genomes on simulated data and genomic data acquired from the OMA orthology database. In particular, FFAdj-AM performs equally or better when compared to the well-established gene family prediction tool MultiMSOAR. Conclusions We study the computational complexity of a new family-free model and present algorithms for its solution. With FFAdj-AM, we propose an appealing alternative to established tools for identifying higher confidence positional orthologs. Electronic supplementary material The online version of this article (doi:10.1186/s13015-017-0106-z) contains supplementary material, which is available to authorized users.


Introduction
The prediction of evolutionary relationships between genomic sequences is a longstanding problem in computational biology.According to Fitch [9], two genomic sequences are called homologous if they descended from a common ancestral sequence.Furthermore, Fitch identifies different events that give rise to a branching point in the phylogeny of homologous sequences, leading to the concepts of orthologous genes (who descend from their last common ancestor through a speciation) and paralogous genes (descending from their last common ancestor through a speciation), that reach far beyond evolutionary genomics [10].Until quite recently, orthology and paralogy relationships were mostly inferred from sequence similarity.However it is now well accepted that the syntenic context can carry valuable evolutionary information, which has lead to the notion of positional orthologs [6].In the present work, we describe a method to compute groups of likely orthologous genes for a group of three genomes, through a new problem we introduce, the gene family-free median of three.
Most methods for detecting potential orthologous groups require a prior clustering of the genes of the considered genomes into homologous gene families, defined as groups of genes assumed to originate from a single ancestral gene; clustering protein sequences into families is already in itself a difficult problem.
Here, we follow the matching-based approach, framed within the gene familyfree principle, that embodies the idea to perform gene order analysis without the prerequisite of gene family or homology assignments.Instead, we are given allagainst-all gene similarities through a symmetric and reflexive similarity measure σ : Σ×Σ → R ≥0 over the universe of genes Σ [4].We use sequence similarity but other similarity measures can fit the previous definition.Gene family or homology assignments represent a particular subgroup of gene similarity functions that require transitivity.Independent of the particular similarity measure σ, relations between genes imposed by σ are considered as candidates for homology assignments.A gene family-free research program was outlined in [4] (see also [8]) and has so far been developed for the pairwise comparison of genomes [7,14,11] and shown to be effective for orthology analysis [12].
In Section 2 we introduce a new genome median problem in the family-free framework, that generalizes the traditional breakpoint median problem [19].For a group of three genomes, the input of the family-free median problem is a tripartite similarity graph of pairwise gene similarities.Informally, a median of three is defined as both a set of median genes -each defined by three extant genes forming a clique in the similarity graph, scored according the edges of this clique -, forming a set of median adjacencies, each supported by at least one extant gene adjacency (hence any median gene belongs to at least one median adjacency).A median is optimal if it maximizes the sum of the scores of its median genes.Hence, the optimization criterion of this problem fully integrates both sequence similarity and synteny conservation.In Section 3 we study its the computational complexity and give an exact algorithm for its solution.We show that our method can be used for positional ortholog prediction in simulated and real data sets of bacterial genomes in Section 4.

The gene family-free median of three
Extant genomes, genes and adjacencies.In this work, a genome G is entirely represented by a tuple G ≡ (C, A), where C denotes a non-empty set of unique genes, and A is a set of adjacencies.Genes are represented by their extremities, i.e., a gene g ≡ (g t , g h ), g ∈ C, consists of a head g h and a tail g t .Telomeres are modeled explicitly, as special genes of C(G) with a single extremity, denoted by "•".Extremities g a , ḡb , a, b ∈ {h, t} of any two genes g, ḡ can form an adjacency.In the following, we will conveniently use the notation C(G) and A(G) to denote the set of genes and the set of adjacencies of genome G, respectively.We indicate the presence of an adjacency {x a 1 , x b 2 } in an extant genome X by Given two genomes G and H and gene similarity measure σ, two adjacencies, {g a 1 , g b 2 } ∈ A(G) and {h a 1 , h b 2 } ∈ A(H) with a, b ∈ {h, t} are conserved iff σ(g 1 , h 1 ) > 0 and σ(g 2 , h 2 ) > 0. We subsequently define the adjacency score of any four extremities g a , h b , i c , j d , where a, b, c, d ∈ {h, t} and g, h, i, j ∈ Σ as the geometric mean of their corresponding gene similarities: Median genome, genes and adjacencies.Informally, the family-free median problem asks for a fourth genome M that maximizes the sum of pairwise adjacency scores to three given extant genomes G, H, and I.In doing so, the gene content of the requested median M must first be defined: each gene m ∈ C(M ) must be unambiguously associated with a triple of extant genes (g, h, i), g ∈ C(G), h ∈ C(H), and i ∈ C(I).Moreover, we want to associate to a median gene m a sequence similarity score (g, h, i) relatively to the three extant genes it is related to.As the sequence of the median gene is obviously not available, we define this score as the geometric mean of their pairwise similarities: In the following we make use of mapping π G (m) ≡ g, π H (m) ≡ h, and π I (m) ≡ i to relate gene m with its extant counterparts.Two candidate median genes or telomeres m 1 and m 2 are conflicting if m 1 = m 2 and the intersection between associated gene sets {π G (m 1 ), π H (m 1 ), π I (m 1 )} and {π G (m 2 ), π H (m 2 ), π I (m 2 )} is non-empty.A set of candiate median genes or telomeres C is called conflict-free if no two of its members m 1 , m 2 ∈ C are conflicting.This definition trivially extends to the notion of a conflict-free median.
Problem 1 (FF-Median).Given three genomes G, H, and I, and gene similarity measure σ, find a conflict-free median M , which maximizes the following formula: where a, b ∈ {h, t} and s(•) is the adjacency score as defined by Equation (2).
Remark 1.The adjacency score for a median adjacency {m a 1 , m b 2 } with respect to the corresponding potential extant adjacency {π X (m 1 ) a , π X (m 2 ) b }, where {m a 1 , m b 2 } ∈ A(M ) and X ∈ {G, H, I}, can be entirely expressed in terms of pairwise similarities between genes of extant genomes using Equation (3): {Y,Z}⊂{G,H,I} σ(πY (m1), πZ(m1)) • σ(πY (m2), πZ(m2)) In the following, a median gene m and its extant counterparts (g, h, i) are treated as equivalent.We denote the set of all candidate median genes by Each pair of median genes (g 1 , h 1 , i 1 ), (g 2 , h 2 , i 2 ) ∈ Σ and extremities a, b ∈ {h, t} give rise to a candidate median adjacency , and (g a 1 , h a 1 , i a 1 ) and (g b 2 , h b 2 , i b 2 ) are non-conflicting.We denote the set of all candidate median adjacencies and the set of all conserved (i.e.present in at least one extant genome) candidate median adjacencies by Remark 2. A median gene can only belong to a median adjacency with nonzero adjacency score if all pairwise similarities of its corresponding extant genes g, h, i are non-zero.Thus, the search for median genes can be limited to 3-cliques (triangles) in the tripartite similarity graph.
Remark 3. The right-hand side of the above formula for the weight of an adjacency is independent of genome X.From Equation (4), an adjacency in median M has only an impact in a solution to problem FF-Median if it participates in a gene adjacency in at least one extant genome.So including in a median genome median genes that do not belong to a candidate median adjacency in A C do not increase the objective function.
Related problems.The FF-median problem relates to previously studied gene order evolution problems.It is a generalization of the tractable mixed multichromosomal median problem introduced in [19], that can indeed be defined as an FF-median problem with a similarity graph composed of disjoint 3-cliques and edges having all the same weight.The FF-median problem also bears similarity with methods aimed at detecting groups of orthologous genes based on gene order evolution, especially the MultiMSOAR [18] algorithm, although other method integrate synteny and sequence conservation for inferring orhogroups, see [6].Our approach differs first and foremost in its family-free principle (all other methods require a prior gene family assignment).Compared to MultiM-SOAR, the only other method that can handle more than two genomes with an optimization criterion that considers gene order evolution, both MultiMSOAR (for three genomes) and FF-median aim at computing a maximum weight tripartite matching.However we differ fundamentally from MultiMSOAR by the full integration of sequence and synteny conservation into the objective function, while MultiMSOAR proceeds first by computing pairwise orthology assignments to define a multipartite graph.

Algorithmic and complexity results
We now describe our theoretical results: a NP-hardness proof, an exact Integer Linear Program (ILP), and an algorithm to detect local optimal structures.
We describe the full hardness proof in Appendix A. It is based on a reduction from the Maximum Independent Set for Graphs of Bounded Degree 3.
An exact ILP algorithm to problem FF-Median.We now present program FF-Median, described by Algorithm 1, that exploits the specific properties of problem FF-Median to design an ILP using O(n 5 ) variables and statements.Program FF-Median makes use of two types of binary variables a and b as declared in domain specifications (D.01) and (D.02), that defines the set of median genes Σ λ and of candidate conserved median adjacencies A C (Remark 3).The former variable type indicates the presence or absence of candidate genes in an optimal median M .The latter, variable type b, specifies if an adjacency between two gene extremities or telomeres is established in M .Constraint (C.01) ensures that M is conflict-free, by demanding that each extant gene (or telomere) can be associated with at most one median gene (or telomere).Further, constraint (C.02) dictates that a median adjacency can only be established between genes that both are part of the median.Lastly, constraint (C.03) guarantees that each gene extremity and telomere of the median participates in at most one adjacency.Remark 4. The output of the algorithm FF-Median is a set of adjacencies between median genes that define a set of linear and/or circular orders, called CARs (Contiguous Ancestral Regions), where linear segments are not capped by telomeres.So formally the computed median might not be a valid genome.However, as adding adjacencies that do not belong to A C do not modify the score of a given median, a set of median adjacencies can always be completed into a valid genome by such adjacencies that join the linear segments together and add telomeres.These extra adjacencies would not be supported by any extant genome and thus can be considered as dubious, and in our implementation, we only return the median adjacencies computed by the ILP, i.e. a subset of A C .Remark 5. Following Remark 2, preprocessing the input extant genomes requires to handle the extant genes that do not belong to at least one 3-clique in the similarity graph.Such genes can not be part of any median.So one could decide to leave them in the input, and the ILP can handle them and ensures they Algorithm 1 Program FF-Median for three genomes (G, H, I) Constraints: are never part of the output solution.However, discarding them from the extant genomes can help recover adjacencies that have been disrupted by the insertion of a mobile element for example, so in our implementation we follow this approach.
As discussed at the end of Section 2, the FF-median problem is a generalization of the mixed multichromosomal breakpoint median [19].However, it was shown in [19] that this breakpoint median problem can be solved in polynomial time by a Maximum-Weight Matching (MWM) algorithm.This motivates the results presented in the next paragraph that use a MWM algorithm to identify optimal median substructures by focusing on conflict-free sets of median genes.
Finding local optimal segments.Tannier et al. [19] solve the mixed multichromosomal breakpoint median problem by transforming it into an MWM problem, that we outline now.A graph is defined in which each extremity of a candidate median gene and each telomere gives rise to a vertex.Any two vertices are connected by an edge, weighted according to the number of observed adjacencies between the two gene extremities in extant genomes.Edges corresponding to adjacencies between a gene extremity and telomeres are weighted only by half as much.An MWM in this graph induces a set of adjacencies that defines an optimal median.
We first describe how this approach applies to our problem.We define a graph Γ (G, H, I, σ) constructed from an FF-Median instance (G, H, I, σ) that is similar to that of Tannier et al., only deviating by defining vertices as candidate median genes and weighting an edge between two candidate median gene extremities (or telomeres) We make first the following observation, where a conflict-free matching is a matching that does not contain two conflicting vertices (candidate median genes): Observation 2 Any conflict-free matching in graph Γ (G, H, I, σ) of maximum weight defines an optimal median.
We show now that we can define notions of sub-instances -of a full FFmedian instance -that contains no internal conflicts, for which applying the MWM can allow to detect if the set of median genes defining the sub-instance is part of at least one optimal FF-median.Let S be a set of candidate median genes.An internal conflict is a conflict between two genes from S; an external conflict is a conflict between a gene from S and a candidate median gene not in S. We say that S is contiguous in extant genome X if the set π X (S) forms a unique, contiguous, segment in X.We say that S is an internal-conflict free segment (IC-free segment) if it contains no internal conflict and is contiguous in all three extant genomes; this can be seen as the family-free equivalent of the notion of common interval in permutations [3].An IC-free segment is framed if the extremities of the extant segments belong to the same two median genes, with conserved relative orientations (the equivalent of a conserved interval ).An IC-free segment is a run if the order of the extant genes is conserved in all three extant genomes, up to a full reversal of the segment.
Intuitively, one can find an optimal solution to the sub-instance defined by an IC-free segment, but it might not be part of an optimal median for the whole instance due to side effects of the rest of the instance.So we need to adapt the graph to which we apply an MWM algorithm to account for such side effects.To do so, we define the potential of a candidate median gene m as We then extend graph Γ =: (V, E) to graph Γ ′ := (V, E ′ ) by adding edges between the extremities of each candidate median gene of an IC-free segment S, i.e.E ′ = E ∪ {{m h , m t } | m ∈ S}.In the following we refer to these edges as conflict edges.Let C(m) be the set of candidate median genes that are involved in an (external) conflict with a given candidate median gene m of S, then the conflict edge {m h , m t } ∈ E ′ is weighted by the maximum potential of a nonconflicting subset of C(m), A conflict-free matching in Γ ′ is a matching that does not contain a conflict edge.
Lemma 1.Given an internal conflict-free segment S, any maximum weight matching in graph Γ ′ (S) that is conflict-free defines a set of median genes and adjacencies that belong to at least one optimal FF-median of the whole instance.
A proof is presented in Appendix B. Lemma 1 leads to a procedure (Algorithm 2) that iteratively identifies and tests IC-free segments in the FF-Median instance.For each identified IC-free segment S an adjacency graph Γ ′ (S) is constructed and a maximum weight matching is computed (lines 2-3).If the resulting matching is conflict free (line 4), adjacencies of IC-free segment S are reported and S is removed from an FF-Median instance by masking its internal adjacencies and removing all candidate median genes (and consequently their associated candidate median adjacencies) corresponding to external conflicts (lines 5-6).It then follows immediately from Lemma 1 that the set median genes returned by Algorithm 2 belongs to at least one optimal solution to the FF-median problem.

Algorithm 2 Algorithm ICF-SEG
Input: FF-Median instance (G, H, I, σ) Output: Set of adjacencies AdjM that is part of a median M of (G, H, I, σ).

Experimental results and discussion
Our algorithms have been implemented in Python and require CPLEX4 ; they are freely available as part of the family-free genome comparison tool FFGC downloadable at http://bibiserv.cebitec.uni-bielefeld.de/ffgc.
In subsequent analyses, gene similarities are based on local alignment hits identified with BLASTP [2] on protein sequences using an e-value threshold of 10 −5 .In gene similarity graphs, we discard spurious edges by applying a stringency filter proposed by Lechner et al. [13] that utilizes a local threshold parameter f ∈ [0, 1] and BLAST bitscores: a BLAST hit from a gene g to h is only retained if it is has a higher or equal score than f times the best BLAST hit from h to any gene g ′ that is member of the same genome as g.In all our experiments, we set f to 0.5.Edge weights of the gene similarity graph are then calculated according to the relative reciprocal BLAST score (RRBS) [17].Finally we applied Algorithm ICF-SEG with conserved segments defined as runs.
For solving the FF-Median problem, we granted CPLEX two CPU cores, 4 GB memory, and a time limit of 3 hours per dataset.
In our experiments, we compare ourselves against the orthology prediction tool MultiMSOAR [18].This tool requires precomputed gene families, which we constructed by following the workflow described in [18].
Evaluation on simulated data.We first evaluate our algorithms on simulated data sets obtained by ALF [5].The ALF simulator covers many aspects of genome evolution from point mutations to global modifications.The latter includes two types of genome rearrangements, as well as various options to customize the process of gene family evolution.In our simulations, we mainly use standard parameters suggested by the authors of ALF and we focus on three parameters that primarily influence the outcome of gene family-free genome analysis: (i) the rate of sequence evolution, (ii) the rate of genome rearrangements, and (iii) the rate of gene duplications and losses.We keep all three rates constant, only varying the evolutionary distance between the generated extant genomes.We confine our simulations to protein coding sequences.A comprehensive list of parameter settings used in our simulations is shown in Table 2 in Appendix C. As root genome in the simulations, we used the genomic sequence of an E. coli K-12 strain 5 which comprises 4, 320 protein coding genes.We then generated 7 × 10 data sets with increasing evolutionary distance ranging from 10 to 130 percent accepted mutations (PAM).Details about the generated data sets are shown in Table 1 in Appendix C. Figure 2 (a) shows the outcome of our analysis with respect to precision and recall6 of inferring positional orthologs.In all simulations, FF-Median generated no or very few false positives, leading to perfect or near-perfect precision score, consistently outperforming MultiMSOAR.However, since the objective of FF-Median only takes median genes into account that are conserved by synteny, the increase in mutational changes over evolutionary time causes a growing loss of syntenic context which results in a lower recall.Therefore, MultiMSOAR retains a better recall for larger evolutionary distances, while FF-Median provides better results for more closely related genomes.Evaluation on real data.We study 15 γ-proteobacterial genomes that span a large taxonomic spectrum and are contained in the OMA database [1].A complete list of species names is given in Appendix D. We obtained the genomic sequences from the NCBI database and constructed for each combination of three genomes a gene similarity graph following the same procedure as in the simulated dataset.In 9 out of the 455 combinations of genomes the time limit prohibited CPLEX from finding an optimal solution.However, in those cases CPLEX was still able to find integer feasible suboptimal solutions.Figure 2 (b) displays statistics of the real dataset.The number of candidate median genes and adjacencies ranges from 442 to 18, 043 and 3, 164 to 2, 261, 716, respectively, giving rise to up to 3, 227 median genes that are distributed on 5 to 91 CARs per median.Some CARs are circular, indicating dubious conformations mostly arising from tandem duplications, but the number of such cases were low (mean: 2.78, max: 13).
We observed that the gene families in the OMA database are clustered tightly and therefore missing many true orthologies in the considered triples of genomes.As a result, many of the orthologous groups inferred by FF-Median and Mul-tiMSOAR fall into more than one gene family inferred by OMA.We therefore evaluate our results by classifying the inferred orthologous groups into three categories: An orthologous group agrees with OMA if its three genes are in the same OMA group.It disagrees with OMA if extant genes x and y (of genomes X and Y respectively) are in different OMA groups but the OMA group of x contains another gene from genome Y .It is compatible with OMA if it neither agrees nor disagrees with OMA.We measure the number of median genes as well as the number orthologous groups of MultiMSOAR in each of the three categories.Figure 2 (c) and (d) show the outcome this analysis.MultiMSOAR is generally able to find more orthology relations in the dataset.This comes at no surprise, as it is clear from the objective of problem FF-Median and from the results of the simulated datasets that our method does not retain candidate median genes which have lost their syntenic context, which happens in triples of highly divergent genomes.The number of disagreeing orthologous groups that disagree with OMA is comparably low for both FF-Median (mean: 35.16, var: 348) and MultiMSOAR (mean: 48.61, var: 348).
We then performed another analysis to assess the robustness of the positional orthology predictions.To this end, we look at orthologous groups across multiple datasets that share two extant genomes, but vary in the third.Given two genes, x of genome X and y of genome Y , an orthologous group that contains x and y is called robust if x and y occur in the same orthologous group, whatever the third extant genome is.We computed the percentage of robust orthologous groups for all gene pairs of genomes E. coli K-12 MG 1655 and S. enterica subsp.enterica serovar Typhimurium str.14028s in our dataset.The results indicate that orthologous groups inferred by FF-Median are slightly more robust (95.61%)than robust those by MultiMSOAR (91.77%).This is likely due to the strict constraint of defining median adjacencies only from genes that participate in at least one observed adjacency (Remark 4).
Overall, we can observe that FF-Median performed better than MultiM-SOAR only for triples of closely related genomes -which is consistent with our observation on simulated data -while being slightly more robust in general.This suggests FF-Median is an interesting alternative to identify higher confidence positional orthologs, at the expense of a higher recall rate.Future work.We first aim to investigate alternative methods to reduce the computational load of Program FF-Median by identifying further strictly suboptimal and optimal substructures, which might require to understand better the impact of internal conflicts within substructures defined by intervals in the extant genomes.Without the need to modify drastically either the FF-median problem definition or the ILP, one can think about more complex weighting schemes for adjacencies that could account for known divergence time between genomes or relaxed notion of adjacencies that would address the high recal rate we observe in FF-Median.Within that regard, it would probably be interesting to combine this with the use of common intervals instead of runs to define conflict-free sub-instances.Finally, ideal family-free analysis should take into account the effects of gene family evolution.However, the presented family-free median model can only resolve certain cases of gene duplication.It is generally susceptible to gene losses that occurred along the evolutionary paths between the three extant genomes and their common ancestor.The definition of a familyfree median model that tolerates events of gene family evolution at a reasonable computational cost is likely an interesting research avenue.edge-colorable using at most d or d + 1 colors.Hence, using colors χ 1 , χ 2 , χ 3 , χ 4 with χ 1 = χ 2 ≡ I, χ 3 = χ 4 ≡ H and Misra and Gries' algorithm [15], edges of graph Λ = (E, V ) can be partitioned into two groups in O(|E||V |) time implying an assignment to genomes H and I.
Example 1. Figure 3 (b) shows a FF-Median instance constructed with transformation scheme R from the simple graph depicted in Figure 3 (a).Gene similarities between genes are not shown, but can be derived from the genes' labeling.
We structure our proof that the presented transformation is in fact a valid mapping of an MAX IS-3 instance to an instance of FF-Median into three different lemmas: Lemma 2. Given a median M of FF-Median instance R(Λ) = (G, H, I, σ), (1) for each median gene (g, h, i) ∈ C(M ) where g, h, or i are associated with vertices in V (Λ) holds ξ(g) = ξ(h) ∩ ξ(i) = v, v ∈ V (Λ); (2) there exist at most two median genes whose corresponding extant genes are not associated to any vertex in V (Λ).
Proof.Assume for contradiction that claim (1) does not hold.Then either ξ(g) = ξ(h) ∩ ξ(i), or ξ(h) ∩ ξ(i) = ∅, both of which violate the constraint of establishing gene similarities between associated genes that is given in step 5.For claim (2), observe that the only unassociated genes in genome G are gene g * and ḡ * introduced in step 6, limiting the overall number of unassociated genes in any median M .
Proof.Observe that both candidate median adjacencies a * = {m h * , m t * } and ā * = {m h * , m t * } are conserved in all three genomes, whereas all other conserved candidate adjacencies between associated and unassociated genes can be at most conserved in H and I. Establishing adjacencies a * , ā * gives rise to a cumulative adjacency score of 6.Conversely, up to 4 non-conflicting adjacencies between associated and unassociated genes can be established that are conserved in both genomes H and I.However, since such adjacencies are only conserved between unassociated genes of type "∅" whose gene similarities are set to 1  4 , the best cumulative adjacency score can not exceed 4. Thus, adjacencies a * , ā * must be contained in any median.Further, because of this and the fact that in both genomes H and I, each gene associated with vertices of V (Λ) is only adjacent to an unassociated gene, M cannot contain adjacencies that are conserved in extant genomes other than genome G, which are the adjacencies of each gene pair (g v , ḡv ) associated with the same vertex v ∈ V (Λ).
⊓ ⊔ Lemma 4. Given FF-median instance R(Λ) = (G, H, I, σ), let m u , m v be any pair of candidate median adjacencies of A whose corresponding extant genes are associated to vertices u, v ∈ V (Λ), then m u , m v are conflicting if and only if (u, v) ∈ E.
Proof.By construction in step 5 of transformation scheme R, each vertex v ∈ V is associated with exactly two candidate median genes Further, let u be another vertex of V (Λ), such that (u, v) ∈ E(Λ), and m u , m u are its two corresponding candidate median genes.Then, by construction in step Now, let u, v be two vertices of V (Λ) such that edge (u, v) ∈ E(Λ), then there exists no gene x in extant genomes H and I with ξ(x) = uv.Even more, due to Lemma 2, there cannot exist a candidate median gene (g, h, i) with {u, v} ⊆ ξ(g)∪ξ(h)∪ξ(i).Thus, the candidate median genes of u and v are not conflicting and neither are their corresponding candidate median adjacencies.
⊓ ⊔ We proceed to show that the given transformation scheme gives rise to an approximation preserving reduction known as L-reduction.An L-reduction reduces a problem P to a problem Q by means of two polynomial-time computable transformation functions: A function f : P → Q ′ ⊆ Q that maps each instance of P onto an instance of Q, herein represented by transformation scheme R, and a function g : Q ′ → P to transform any feasible solution of an instance in Q ′ to a feasible solution of an instance of P .Here, a feasible solution means anynot necessarily optimal -solution that obeys the problem's constraints.A feasible solution of FF-Median instance (G, H, I, σ) is an ancestral genome X where C(X) ⊆ Σ and A(X) ⊆ A such that A(X) is conflict-free.We give the following transformation scheme to map ancestral genomes of an FF-Median instance to solutions of an MAX IS-3 instance: We define score function s (X) ≡ 1 2 F (X) − 3 of an ancestral genome X.For (R, S) to be an L-reduction the following two properties must hold for any given MAX IS-3 instance (Λ, l): (1) There is some constant α such that for any median M of the transformed FF-Median instance R(Λ) holds s (M ) ≤ α•l; (2) There is some constant β such that for any ancestral genome X of R(Λ) holds l−|S(X)| ≤ β • |s (M ) − s (X)|.We proceed to proof the following lemma: Proof.For any median M of FF-Median instance R(Λ), the number of conserved median adjacencies with correspondence to the same vertex of Λ is two, giving rise a cumulative adjacency score of two.From Lemmata 3 and 4 immediately follows that any ancestral genome of R(Λ) that maximizes the number of conserved adjacencies also maximizes the number of independent vertices in Λ. Recall that the two conserved adjacencies between unassociated genes of type " * " (which are part of all medians) give rise to a cumulative adjacency score of 6, we conclude that |A(M ) ∩ A C | − 2 = 1  2 F (M ) − 3 = s (M ) = l, thus α = 1.Because l = s λ (M ), it remains to show that l−|S(X)| ≤ β|l−s (X)|.In a suboptimal ancestral genome of R(Λ), median genes with no association to vertices of Λ can also contain extant genes of type "∅".These unassociated median genes can form "mixed" conserved adjacencies with genes that are associated with vertices of Λ.Such mixed conserved adjacencies have no correspondence to vertices in Λ and do not contribute to the transformed solution S(X) of an ancestral genome X.Yet, as mentioned earlier, the cumulative adjacency score of all mixed conserved adjacencies can not not exceed 4. Therefore it holds that |S(X)| ≥ s (X) and we conclude β = 1.

B Speeding up the search for a median
Proof of Lemma 1: Proof.Given an IC-free segment S = {m 1 , . . ., m k } of an FF-Median instance (G, H, I, σ).Let M be a conflict-free matching in graph Γ ′ (S).Because M is conflict-free and S contiguous in all three extant genomes, M must contain all candidate median genes of S. Now, let M be a median such that S ⊆ C(M ′ ).Further, let C(m) be the set of candidate median genes that are involved in a conflict with with a given median gene m of S and X = C(M ′ )∩( m∈S C(m)∪S).
Clearly, X = ∅ and for the contribution F (X) must hold F (X) ≥ F (S), otherwise M ′ is not optimal since it is straightforward to construct a median higher score which includes S. Clearly, the contribution F (X) to the median is bounded by max({ m ′ ∈C ′ ∆(m ′ ) | C ′ ⊆ C(m) : C ′ is conflict-free}) + F (S).But since S gives rise to a conflict-free matching with maximum score, also median M ′′ with C(M ′′ ) = (C(M ′ ) \ X) ∪ C(S) and A(M ′′ ) = (A(M ′ ) \ A(X)) ∪ A(S)) must be an (optimal) median.

⊓ ⊔
C Simulated sequence evolution with ALF

Property 1 .
The size (i.e.number of variables and statements) of any ILP returned by program FF-Median is limited by O(n 5 ) where n = max(|C(G)|, |C(H)|, |C(I)|).

Fig. 2 .
Fig. 2. Top: (a) Precision and recall of FF-Median and MultiMSOAR in simulations; (b) statistical assessment of CARs and median genes on real datasets.Bottom: agreement, compatiblity and disagreement of positional orthologs inferred by (c) FF-Median and (d) MultiMSOAR with OMA database.

Fig. 3 .
Fig. 3. (a) A simple graph bounded by degree three and (b) a corresponding FF-Median instance constructed with transformation scheme R.

⊓ ⊔ Lemma 3 .
The conserved adjacency set of any medianM of FF-Median instance R(Λ) = (G, H, I, σ) is of the form A(M )∩A C = A G (M )∪{{m h * , m t * }, {m h * , m t * }}, where the extant genes corresponding to m * and m * are all unassociated genes of type " * " and A(M
2, there exists exactly one extant gene x with ξ(x) = uv (which, by assignment in step 3, is either contained in genome H or I).Consequently, either m u is in conflict with m v , or m u with m v , or m u with m v , or m u with m v .Recall that by construction in step 2 in R and by Lemma 3, m u , m u and m v , m v form conserved candidate adjacencies {m h Clearly, independent of which of the candidate median gene pairs of u and v are in conflict, both pairs of candidate median adjacencies are in conflict with each other.