A general framework for genome rearrangement with biological constraints

This paper generalizes previous studies on genome rearrangement under biological constraints, using double cut and join (DCJ). We propose a model for weighted DCJ, along with a family of optimization problems called \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varphi$$\end{document}φ-MCPS (Minimum Cost Parsimonious Scenario), that are based on labeled graphs. We show how to compute solutions to general instances of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varphi$$\end{document}φ-MCPS, given an algorithm to compute \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varphi$$\end{document}φ-MCPS on a circular genome with exactly one occurrence of each gene. These general instances can have an arbitrary number of circular and linear chromosomes, and arbitrary gene content. The practicality of the framework is displayed by presenting polynomial-time algorithms that generalize the results of Bulteau, Fertin, and Tannier on the Sorting by wDCJs and indels in intergenes problem, and that generalize previous results on the Minimum Local Parsimonious Scenario problem.


Context
The practical study of genome rearrangement scenarios has been limited by a lack of mathematical models capable of incorporating biological constraints, since foundational models focused on minimum length scenarios transforming one genome into another. In the modern age, where the collection of fully assembled and annotated genomes is ever-increasing, there is the need for the development of more elaborate mathematical models that consider the data from multiple biological experiments.
One way to incorporate biological information into the inference of evolutionary scenarios is to consider models that weight rearrangements according to their likelihood of occurring; a breakpoint may be more likely to occur in some intergenic regions than others. To this end, the study of length-weighted reversals was started in the late nineties by Blanchette et al. [1]. Baudet et al. present a summary of work done in this area, along with work on reversals centered around the origin of replication [2]. Recently, Tannier has published a series of papers focused on weighting intergenic regions by their length in nucleotides. In [3], Biller et al. pointed out that, according to the Nadeau-Taylor model of uniform random breakage [4,5], a breakpoint is more likely to occur in a longer intergenic region. Subsequent papers by Fertin et al. [6], and Bulteau et al. [7] present algorithmic results for models that take into account the length of intergenic regions. Using Hi-C data [8], Veron et al. along with our own study, have pointed out the importance of weighting pairs of breakpoints according to how close they tend to be in physical space [9,10]. In order to use this physical constraint, we partitioned intergenic regions into colocalized areas, and developed algorithms for computing distances that minimize the number of rearrangements that operate on breakpoints between different areas [11,12].
Much of this work is based on the mathematically clean model for genome rearrangement called Double Cut and Join, or DCJ [13,14]. Genomes are partitioned into n orthologous syntenic blocks that we will simply call genes. Each gene is represented by two extremities, and each chromosome is represented by an ordering of these extremities. Those extremities that are adjacent in this ordering are paired, and transformations of these pairs occur by swapping extremities of two pairs. DCJ can naturally be interpreted as a graph edit model with the use of the breakpoint graph, where there is an edge between gene extremities a and b for each adjacent pair. A DCJ operation replaces an edge pair {a, b}, {c, d} of the graph by {a, c}, {b, d} or {a, d}, {b, c} . This edge edit operation on a graph is called a 2-break.
This paper establishes a general framework for weighting rearrangements. The results are based on the problem of transforming one labeled graph into another through a scenario of operations, each weighted by an arbitrary function ϕ . The problem, called ϕ -Minimum Cost Parsimonious Scenario (or ϕ-MCPS), asks for a scenario with a minimum number of 2-breaks, such that the sum of the costs for the operations is minimized.

Applications of our framework
While our framework is general, we use it to render two previous studies more practical. The first study is our work relating the likelihood of rearrangement breakpoints to the physical proximity in the nucleus [11]. This work is based on the hypothesis that two breakpoints could be confused when they are physically close. The model in this study labels the breakpoint graph edges (corresponding to intergenic regions) with fixed "colors", and the cost of a DCJ has a weight of one if the labels are different and a weight of zero if they are the same. Using that cost function, we colored intergenic regions by grouping them according to their physical proximity, as inferred by Hi-C data. Although this technique of grouping proved to make biological sense [10,12], it is far from ideal since much of the information given by the Hi-C data is lost in the labeling, and it is not immediately clear how to best compute the grouping. Our results here bypass the complexity of grouping by allowing each DCJ to be weighted by the values taken directly from the Hi-C contact maps. We give an algorithm for ϕ-MCPS on a breakpoint graph with an arbitrary ϕ and fixed edge labels, that runs in O(n 5 ) time in the worst case but has better parameterized complexity in practice (see Example 1). We give in "Practical matters" section other reasons why the running times for this algorithm should remain practical.
The second study that we improve is that of Bulteau et al. [7]. Their biological constraint is based on the number of nucleotides in the intergenic regions containing breakpoints; they compute parsimonious scenarios that minimize the number of nucleotides inserted and deleted in intergenic regions. Their algorithm is restricted to instances where the breakpoint graph has only cycles (and no paths-sometimes referred to as cotailed genomes). Using their O(n log n) algorithm, our framework gives an O(n 3 ) algorithm on any breakpoint graph (see Example 3). This is an example of how our framework simplifies algorithm design on weighted DCJs. For a weight function adhering to our general criteria of "Cost-constrained 2-breaks" section, future algorithm designers now need only to concentrate on developing an efficient algorithm that works on a single cycle of a breakpoint graph. Thanks to Theorem 3, they will get a polynomial time algorithm that works on a general instance for free. "α-approximation for φ-MCPS" section shows that the same is true for approximation algorithms.
This paper is based on general results we obtain on weighted transformations of edge-labeled multi-graphs. The permitted transformations can change the connectivity of the graph through a 2-break, or change the edge labels, or both. This model not only proves to be powerful enough to subsume the previously mentioned results, but also offers other advantages. It is flexible enough so that DCJ costs can be based on the labels of edges in the breakpoint graph, or on the labels of the vertices, or a combination of both. Also, since single-gene insertions and deletions can be represented as "ghost" adjacencies [15], all of this paper applies to genomes where genes could be missing in one genome or the other. Most results can be applied to genomes with duplicate genes (as depicted in Fig. 1).

Our model and general results
The foundation of this paper is a model for cost-constraining scenarios of degree preserving graph transformations, called 2-breaks, that are also known as edge swaps, switches, rewirings, or flips [16]. A 2-break transforms a graph by replacing two edges {u, v} and {q, s} by {u, q} and {v, s} . These transformations have been studied not only in a restricted setting for genome   [15]. In the genomes A and A ′ , gene 1 is repeated twice, and the operation transforming A into A ′ is an insertion of a gene 2, corresponding to the 2-break G(A, B) → G(A ′ , B) . A DCJ scenario transforming A ′ into the linear genome B includes a deletion of a gene 4  [18,19], but also in the more general settings of generating random networks [16] and network design [20,21].
Our results are about the transformation of an arbitrary multi-graph G into another one H having the same degree sequence. We find it convenient to reason in a setting, where we are given an Eulerian 2-edge-colored multi-graph with black and gray edges, the black edges being from G and the gray from H. We transform the connectivity of the black edges into the connectivity of the gray edges using a sequence of 2-breaks. Therefore, whenever we use the word graph, path (respectively cycle), we are referring to an Eulerian 2-edge-colored multi-graph, a path (respectively cycle) that alternates between black and gray edges. Naturally, a cycle decomposition of a graph is a partition of the edges of an Eulerian 2-edge-colored multi-graph into a set of alternating cycles. A breakpoint graph is a graph with a vertex for each gene extremity-each incident to exactly one gray and one black edge-along with one chromosome endpoint vertex • that could have degree as high as 2n (see Fig. 2). "DCJ scenarios for genomes and breakpoint graphs" section introduces the breakpoint graph in detail, and defines the Double Cut and Join (DCJ) model.
Our model for weighting 2-breaks is primarily based on a graph labeling, a set O of valid operations, and a weight function ϕ : O → R + . Roughly speaking, a labeled input graph can be transformed through a series of operations in O , where an operation can change the connectivity of the black edges of the graph, and/or change the labels of the edges. Any weight function ϕ defines an optimization problem ϕ-MCPS, which asks for a scenario that minimizes the total weight of the operations. This model subsumes many previously studied weighted DCJ models, as described in "Examples of the cost-constrained DCJ problems in the literature" section.
The spine of our results is built from successive theorems that speak to the decomposability into subproblems of a ϕ-MCPS instance. Lemma 3 shows that a parsimonious scenario of 2-breaks transforming the black edges into the gray implies a Maximum Alternating Edgedisjoint Cycle Decomposition (or MAECD) [22]. Theorem 1 says that an optimal solution to ϕ-MCPS can be found using solutions to the MAECD problem, so that if ϕ-MCPS can be solved on a simple alternating cycle, then it can be solved on any instance. Theorem 2 says that an optimal solution to ϕ-MCPS on a simple alternating cycle can be found using a solution to the ϕ-MCPS problem on what we call a circle, that is, an alternating cycle that does not visit the same vertex twice (see Fig. 4).
Under the common genome model, where each gene occurs exactly once in each genome, a relationship exists between parsimonious DCJ scenarios and solutions to MAECD on a breakpoint graph [14,23]. We exploit this link in "ϕ-MCPS for a breakpoint graph" section. Theorem 3 ties everything together; an amortized analysis shows that, given an O(r t ) algorithm for computing ϕ-MCPS on a circle with r edges, ϕ-MCPS can be calculated on a breakpoint graph in O(n t+1 ) time.
Under a more general genome model, that allows for changes in copy numbers of genes (e.g. insertions, deletions, and duplications), the spine of our results still holds due to the convenient representation of missing genes as ghost adjacencies in an Eulerian 2-edge-colored multi-graph [15] (see Fig. 1). All of our results hold for pairs of genomes with non-duplicated genes, but unequal gene content. Indeed, a breakpoint graph (i.e. graph with limited degree for most nodes) can still represent the pair of genomes in this case.
Caprara proved that MAECD is NP-Hard for Eulerian 2-edge-colored multi-graphs where each vertex is incident to at most two gray and two black edges (which is the case when there are two copies of each gene) [22]. We present a simple integer linear program (or ILP) that solves ϕ-MCPS for these types of graphs, given a method to solve ϕ-MCPS on a circle. This ILP is likely to be unwieldy in general, since the number of variables is exponential in the number of simple alternating cycles. In the case of breakpoint graphs on specific genomes, this may not always be intractable, as the number of duplicate genes may be limited. See "Practical matters" section for a discussion of these practical matters.

DCJ scenarios for genomes and breakpoint graphs
A genome consists of chromosomes that are linear or circular orders of genes separated by potential breakpoint regions. In Fig. 2 the tail of an arrow represents the tail  extremity, and the head of an arrow represents the head extremity of a gene. We can represent a genome by a set of adjacencies between the gene extremities. An adjacency is either internal: an unordered pair of the extremities that are adjacent on a chromosome, or external: a single extremity adjacent to one of the two ends of a linear chromosome. In what follows we will suppose that two genomes A and B are partitioned into n genes each occurring exactly once in each genome, and our goal will be to transform A into B using a sequence of DCJs. We represent the pairs of the genomes with a help of a breakpoint graph [13,17].

2-Break scenarios for 2-edge-colored graphs
In this paper a graph is an Eulerian 2-edge-colored undirected multi-graph with edges colored black or gray as in Fig. 1. A graph with equal multi-sets of black and gray edges is called terminal, and our goal is to transform a given graph into a terminal one using 2-breaks.
of length m is a sequence of m 2-breaks transforming a graph into a terminal one.

Definition 4 (Eulerian graph and alternating cycle) G is
Eulerian if every vertex has equal black and gray degrees. A cycle is alternating if it is Eulerian. All use of the word cycle in this paper will be synonymous with alternating cycle.
Define a Maximum Alternating Edge-disjoint Cycle Decomposition (MAECD) of a graph G as a decomposition of G into a maximum number of edgedisjoint alternating cycles. Denote the size of a MAECD of G by c(G) and the number of its black edges by e(G). We make a distinction between simple cycles and circles (see Fig. 4 to see a simple cycle that is not a circle).

Definition 5 (simple cycle and circle)
A graph G is a simple cycle if the size of a MAECD, c(G) = 1 . If in addition to that the black and gray degrees deg b (G, v) and deg g (G, v) are equal to 1 for every vertex v, then G is called a circle.

Parsimonious 2-break scenarios
The problem of finding a minimum length (or parsimonious) 2-break scenario was treated in several unrelated settings using different terminology. Lemma 1 proven in "Proofs" section was treated in [20] where the authors also showed that finding a minimum length 2-break scenario is NP-hard due to the NP-hardness of finding a MAECD of a graph and provided a 7/4-approximation algorithm for finding this length. A variant of the problem for Eulerian digraphs where all the gray edges are loops was solved in [24].

Lemma 1 (Bienstock and Günlük in [20]) The minimum length of a 2-break scenario on a graph G is
Since finding a MAECD for a breakpoint graph is easy, Lemma 1 leads to a linear time algorithm for finding a parsimonious DCJ scenario [13]. The algorithm is based on Lemma 2 proven in "Proofs" section. [13]) The minimum length of a DCJ scenario transforming genome A into B is equal to d 2b (G(A, B)) = e(G(A, B)) − c(G (A, B)).

Decomposition of a 2-break scenario
In this section we will show how a 2-break scenario ρ of length m can be partitioned into subscenarios ρ 1 , . . . , ρ k and G can be decomposed into edge-disjoint Eulerian subgraphs H 1 , . . . , H k where ρ i is a scenario for H i , and k ≥ e(G) − m . We will use this decomposition in "ϕ-MCPS for a graph" section to show that ϕ-MCPS on a graph can be solved by solving ϕ-MCPS on its simple cycles. For a graph G and a 2-break scenario ρ we define a directed 1-edge-colored edge-labeled graph D(G, ρ) , akin to the trajectory graph introduced by Shao et al. [25]. Denote the sequence of the first l 2-breaks of ρ by ρ l and the graph obtained from G after these 2-breaks by G l . Define D(G, ρ 0 ) in the following way: for each black edge e of G we have two new vertices connected by a directed edge labeled by e (see Fig. 3). For the l-th 2-break of ρ , Proceed by adding two new vertices to D(G, ρ l−1 ) and two edges labeled {x 1 , x 3 } and {x 2 , x 4 } from the merged vertex to the newly added ones to obtain D(G, ρ l ) . Continue until D(G, ρ m ) is obtained, where m is the length of ρ , and denote it by D(G, ρ). Shao et al. [25] characterize the connected components of a trajectory graph for a parsimonious scenario. Using similar techniques we prove the following lemma in "Proofs" section.

Cost-constrained 2-breaks
In this section we outline our model for assigning costs to 2-breaks. We associate labels to both vertices and edges of a graph, and then describe a set O of valid operations of 2-breaks on labeled edges and edge-label changes. Our cost function is defined on O . This model generalizes the labeled DCJ problems of [7,11].
We will use letters u, v, q, s to denote vertices, letters a, b, c, d to denote vertex labels and x, y, z, t to denote edge labels. Given an alphabet of vertex labels V and one of edge labels E , fix a subset O containing a set of tuples An O-scenario does not necessarily exist for a given (G, ) , however if it exists, then the inequality d Ob (G, ) ≥ d 2b (G) holds, where d 2b (G) is the minimum length of a 2-break scenario on a graph G. In this paper we deal with the sets O that have the necessary operations to parsimoniously transform (G, ) into (G, ) . We call these sets p-sufficient.
The cost function that we consider is ϕ : O → R + . The cost of an O-scenario is the sum of the costs of its constituent operations. If O is p-sufficient for (G, ) , then MCPS ϕ (G, ) is the minimum cost of an O-scenario of the 2-break-length equal to d 2b (G) , otherwise MCPS ϕ (G, ) is ∞ . We consider the following problem:

Examples of the cost-constrained DCJ problems in the literature
Example 1 (Minimum Local Parsimonious Scenario) In [11] we supposed the adjacencies of genome A to be partitioned into spatial regions represented by different colors. We then developed a polynomial time algorithm for finding a parsimonious DCJ scenario minimizing the number of rearrangements whose breakpoints appear in different regions. The problem as was stated in [11] differs slightly from ϕ-MCPS, since in that study we do not have colors for the adjacencies of genome B.
We can bridge this gap as follows. In [11] we presented an O(n 4 ) time algorithm solving ϕ c -MCPS for a labeled breakpoint graph with the gray edges labeled by τ . In [12] we demonstrated that finding a minimum cost O scenario for such a breakpoint graph, when the parsimonious criteria is disregarded, is NP-hard. We proposed an algorithm that is exponential in the number of colors but not in the number of genes.
In "ϕ f -MCPS for a circle with fixed labels" section we use the same O , fix a symmetric function : 2 → R + , and define ϕ f ({a, a}, x), ({a, a}, y); ({a, a}, x), ({a, a}, y) = �(x, y) . This drastically enhances the model introduced in [11] as now rearrangements whose breakpoints appear in the same region can have non-zero costs. In "ϕ-MCPS for a breakpoint graph" section we provide an O(n 5 ) time algorithm solving the generalized problem of ϕ f -MCPS for a labeled breakpoint graph.
Example 2 (DCJ weighted by Hi-C) In [10] we weighted each DCJ by the value taken directly from the Hi-C contact map. In this model every intergenic region of genome A gets assigned an interval corresponding to its genomic coordinates on a chromosome. The weight of a DCJ acting on two intergenic regions is then equal to the average Hi-C value for their corresponding intervals.
In [10] we presented an algorithm greedily maximizing the weight of a parsimonious scenario and found that the obtained weight is significantly higher than the weight of a random parsimonious scenario.
Edge labels are the genomic intervals corresponding to the intergenic regions of a genome A plus an additional terminal label. There is a single vertex label V = {a} . O stays as in Example 1. � HiC (x, y) on two genomic intervals is their average Hi-C value. The problem that maximizes Hi-C values can be easily transformed into a minimization problem by setting the cost of a 2-break on labels ({a, a}, x), ({a, a}, y); ({a, a}, x), ({a, a}, y) to � max − � HiC (x, y) , where max is the maximum � HiC (x, y) over all x, y.
In [10] the optimality of the proposed greedy algorithm was not discussed, but our work presented in "ϕ f -MCPS for a circle with fixed labels" section of this paper provides us with a polynomial time algorithm for solving this problem exactly.

Example 3 (Sorting by wDCJs and indels in intergenes) Bulteau et al. [7] introduced a problem
where adjacencies of genomes are labeled with their genetic length (number of nucleotides). A wDCJ is a DCJ that preserves the sum of the genetic lengths of the adjacencies and an indel δ increases or decreases the genetic length of an adjacency by δ . The cost of a wDCJ is 0 and the cost of an indel δ is |δ| . A scenario of wDCJs and indels for (G, ) is said to be valid if its wDCJ-length is d 2b (G) . The paper presents an O(n log n) algorithm for finding a minimum cost scenario among the valid ones, for the genomes with circular chromosomes and n genes.
Translating this into our formalism yields the following ϕ-MCPS problem. Edge labels are the natural numbers, there is a single vertex label, and O contains 2-breaks on labels (({a, a}, w 1 ), ({a, a}, w 2 ); ({a, a}, w 3 ), ({a, a}, w 4 ) for w i ∈ E satisfying w 1 + w 2 = w 3 + w 4 . O also contains edge-label changes ({a, a}, w 1 ); ({a, a}, w 2 ) for w i ∈ . O is p-sufficient for any (G, ) since G can be first transformed into a terminal graph using any parsimonious 2-break scenario and then its labels can be adjusted. The cost ϕ l of a 2-break on labels is 0 and the cost ϕ l of a edgelabel change ({a, a}, w 1 ); ({a, a}, w 2 ) is |w 1 − w 2 |.
In [7] the authors presented an O(r log r) time algorithm for solving ϕ l -MCPS on a circle with r vertices. Combining this algorithm with our results from "ϕ-MCPS for a breakpoint graph" section gives an algorithm solving ϕ l -MCPS in O(n 3 ) time for a labeled breakpoint graph. The ILP defined in "ϕ-MCPS for a graph" section solves ϕ l -MCPS for any labeled graph. Example 4 (wDCJ-dist) Fertin et al. [6] treated a problem wDCJ-dist where wDCJs without indels are allowed, and the sums of the genetic lengths of the adjacencies of two genomes are equal.
In this case we keep the same E , V and O as in Example 3 except that the edge-label changes are excluded from O . A labeled graph is said to be balanced if the sums of the labels of black and gray edges are equal. wDCJdist is the problem of finding d Ob for a balanced graph whose connected components are circles. The authors show that wDCJ-dist is strongly NP-complete. However they also prove that d Ob (O, ) = d 2b (O) for a balanced circle O and that O is p-sufficient for a graph whose connected components are balanced circles. Proof For a cycle S of a labeled graph (G, ) , S denotes the labeling of S according to . We suppose that min(∅) = ∞ and prove the following: Suppose that there exists a MAECD C of G consisting of the simple cycles for which O is p-sufficient. For every S ∈ C take an O-scenario ρ S O of cost MCPS ϕ (S, S ) and 2-break-length d 2b (S) . By performing these scenarios one after another we obtain an O-scenario ρ O for (G, ) of 2-break-length S∈C d 2b (S) = d 2b (G) and of cost S∈C MCPS ϕ (S, S ) . This means that MCPS ϕ (G, ) ≤ S∈C MCPS ϕ (S, S ).
On the other hand, suppose that O is p-sufficient for (G, ) and take an O-scenario ρ O for (G, ) of length d 2b (G) . For ρ , a 2-break scenario obtained from ρ O when the labels are neglected, a decomposition C(ρ) corresponding to ρ is a MAECD of G due to Lemma 3. A sub- There exists an algorithm efficiently listing all the simple cycles of an undirected 1-edge-colored graph [26], however we are unaware of a similar result for the 2-edge-colored graphs. Computing c(G) is an APX-hard problem [27] and the size of S may be exponential in the size of G, which might make these ILPs intractable in general. For graphs representing genomes with duplicate genes, the number of simple cycles can grow exponentially as a function of the number of duplicate genes. For breakpoint graphs, however, the number grows quadratically and c(G) can be found in linear time.
Maximize S∈S x S Subject to S:e∈S x S ≤ 1 for each edge e of G and x S ∈ {0, 1} for simple cycle S ∈ S.
Minimize S∈S x S MCPS ϕ (S, S ) Subject to S:e∈S x S ≤ 1 for each edge e of G, S∈S x S = c(G) and x S ∈ {0, 1} for simple cycle S ∈ S.

ϕ-MCPS for a simple cycle
The decomposition theorem of "ϕ-MCPS for a graph" section reduces the computation of ϕ-MCPS on a graph to the computation of ϕ-MCPS on a simple alternating cycle. In this section we further decompose the problem into simpler versions of cycles, called circles, which are alternating cycles that contain a vertex only once. Denote deg 2 (G) for a graph G as the number of vertices with black and gray degree equal to two. It is easy to check that deg b (S, v) = deg g (S, v) ≤ 2 for any vertex v of a simple cycle S. If deg 2 (S) = 0 , then S is a circle. See the first column of Fig. 4 for examples of simple cycles that are not circles.
Take a simple labeled cycle (S, ) and denote S 0 as {(S, )} . Choose a vertex v of degree two in S and replace it by two vertices v 1 , v 2 / ∈ V labeled by the same label as v. If v is incident to a gray loop, then split it into two vertices v 1 and v 2 , as depicted on the top row of Fig. 4, to obtain a set S 1 consisting of a single simple cycle. Otherwise split it into two vertices, as depicted on the bottom row of Fig. 4, to obtain a set S 1 consisting of two simple cycles.
Simple labeled cycles in S 1 share the same set of vertices of degree two. Choose such a vertex and split it simultaneously in all the cycles in S 1 as previously to obtain a set S 2 of at most 4 simple labeled cycles sharing the same set of vertices and the same multi-set of labeled black edges. Continue this procedure until the set circ(S, ) = S deg 2 (S) of the labeled circles is obtained. Proof First we prove that MCPS ϕ (S, ) = min{MCPS ϕ (H, µ)| (H, µ) ∈ S 1 } . Labeled graphs in S 1 are obtained by splitting a vertex v of degree 2 into vertices v 1 and v 2 . For a labeled graph (H , µ) on vertices V ∪ {v 1 , v 2 } \ {v} denote r g (H, µ) as the labeled graph obtained from (H, µ) by reversing the split, that is, by merging the vertices v 1 and v 2 into v.
Choose (Ŝ,ˆ ) ∈ S 1 . By construction r g (Ŝ,ˆ ) = (S, ) . Denote For an edge f of (Ŝ,ˆ ) joining vertices q and s, the edge r e (f ) = {r v (q), r v (s)} is present in (S, ) and has the same label as f. r e defines a bijection between the labeled edges of (S, ) and (Ŝ,ˆ ) and thus between O operations on these graphs. This means that an operation in O transforming (Ŝ,ˆ ) into some (Ŝ ′ ,ˆ ′ ) transforms (S, ) into r g (Ŝ ′ ,ˆ ′ ) , and an operation in O transforming (S, ) into some Thus for an O-scenario of (Ŝ,ˆ ) there exists an O-scenario of the same ϕ cost and the same 2-break-length for (S, ) . On the other hand, an O-scenario for (S, ) provides us with a sequence ρ of O operations of the same ϕ cost and the same 2-break-length transforming (Ŝ,ˆ ) into (S, ) for which r g (S, ) is a terminal graph.

ϕ-MCPS for a breakpoint graph
In this section we suppose that there exists an algorithm for computing MCPS ϕ on a labeled circle (e.g. the algorithm of "ϕ f -MCPS for a circle with fixed labels" section).
Using this algorithm as a subroutine we will construct an algorithm for finding MCPS ϕ for a labeled breakpoint graph. This is a generalization of the work first presented in [11]. Take genomes A and B partitioned into n genes where each gene occurs exactly once in each genome, and a labeling of a breakpoint graph G (A, B). For all the vertices v = • we have deg g (G(A, B), v) = deg b (G(A, B), v) = 1 . Thus, if there is a circle in G(A, B) containing an edge then this circle is the only simple cycle containing this edge. This means that every MAECD of G (A, B) includes all of its circles. These set aside we are left with G(A, B) ′ , which is a union of alternating paths starting and ending at • with end edges of the same color. If this color is black we call the path AA, and BB otherwise.
We proceed by constructing a complete weighted bipartite graph H having the AA and BB paths of G(A, B) ′ as vertices. Every simple cycle of G(A, B) ′ is a union of an AA path and a BB path. To each edge joining these paths in H we assign weight equal to MCPS ϕ for a union of these paths. A MAECD of G(A, B) ′ corresponds to a maximum matching of H and every such matching corresponds to a MAECD of G(A, B) ′ . Denote ′ as the labeling of G(A, B) ′ according to . Using Theorem 1 we obtain that MCPS ϕ (G(A, B) ′ , ′ ) is equal to the minimum weight of a maximum matching of H. There is an equal number p of AA and BB paths. Let P denote the total number of edges in G(A, B) ′ . Using this notation we obtain the following lemma proven in "Proofs" section.  Corollary 2 Using the O(r log r) algorithm from [7] for the Sorting by wDCJs and indels in intergenes problem on a circle (see Example 3), we obtain an O(n 3 ) algorithm for solving the problem on a breakpoint graph.
α-approximation for ϕ-MCPS Theorems 1 and 2 demonstrate how ϕ-MCPS for any labeled graph can be solved if one is able to solve ϕ-MCPS for a labeled circle. This is exploited in Theorem 3 to solve ϕ-MCPS for a breakpoint graph. Analogous results proven in "Proofs" section hold if instead of an exact algorithm one has an α-approximation for ϕ-MCPS for a labeled circle.  Without loss of generality we can suppose that all of the black edges of a circle have different labels; if two edges are labeled with the same label x, then we simply replace one of these labels with a new label x and set � (x, y) = �(x, y) and � (y, z) = �(y, z) for y, z ∈ . Fix a circular embedding of V respecting the order of the black edges on the labeled circle (see Fig. 5). A graph with vertices V is said to be planar on the circle if none of its edges cross in this embedding. We prove Lemma 6 in "Proofs" section linking planar trees and parsimonious scenarios. Farnoud and Milenkovic in [19] provide a dynamic programming algorithm for finding a minimum cost planar tree on a circle. In "Proofs" section their proof for a following lemma is given which, together with Lemma 6, leads to Theorem 4.

Practical matters
Our algorithm for ϕ f -MCPS on a breakpoint graph with fixed labels has a running time of O(n 5 ) in the worst case. Note that the running time is dominated, however, by the maximum bipartite matching step in "ϕ-MCPS for a breakpoint graph" section. The size of the bipartite graph is determined by the number of AA and BB paths which is bounded by the maximum number of chromosomes m for the two species. Thus using Lemma 4 we know that the algorithm scales like O(mn 4 ) on biological data. For the same reason our algorithm for Sorting by wDCJs and indels in intergenes [7] on a breakpoint graph scales like O(m 2 n log n + m 3 ) instead of O(n 3 ) on biological data. Further, n is the number of syntenic blocks-and not literally the genes as we call them. Our analyses of Drosophila genomes yield no AA and BB paths, and less than 100 blocks [10]. Our analysis of Human and Mouse genomes yields between 250 and 800 syntenic blocks, depending on the parameters given to OrthoCluster [28].
For graphs with higher degree nodes, like those graphs that represent genomes with duplicated genes, the number of simple cycles can grow rapidly. Although this relationship is beyond the scope of this work, we expect that fixed parameter algorithms could be developed to handle biological data in the future.

Future direction
Our cost framework is liberal, and in our examples we have explored only a small portion of its capacities. Edges can be labeled by more complex objects such as vectors or trees. The cost can be a function of a combination of the edge and vertex labels. We hope that a closer study of the graph D(G, ρ) from "Decomposition of a 2-break scenario" section will lead to polynomial time algorithms for ϕ-MCPS on circles for a large family of cost functions. Once the set of scenarios for a circle is better understood, one could address the problems of counting and sampling the ϕ-MCPS scenarios. While all of our results apply to genomes with insertions or deletions of single genes, further study is required in order to increase efficiency on genomes with duplicate genes.
Our assumption of "minimum evolution" may not always be true as an actual evolutionary scenario might be non-parsimonious [29]. The Minimum Cost Scenario (MCS) problem of finding a minimum cost scenario among all the possible scenarios has already been studied for a couple of fairly simple cost functions [6,12] and proven to be NP-hard in both of these cases. However, as we have shown in [12], computationally tractable algorithms can still be implemented for certain NP-hard MCS problems. An intermediate problem between MCPS and MCS could be the one of finding a minimum cost scenario among the scenarios of a prescribed length.

Lemma 1
Lemma (Bienstock and Günlük in [20]) The minimum length of a 2-break scenario on a graph G is d 2b (G) = e(G) − c(G).
Proof A 2-break can increase the size of a MAECD by at most 1 and the size of a MAECD of a terminal graph is e(G). This leads to an inequality d 2b (G) ≥ e(G) − c(G).
In this paragraph the length of a cycle will be its number of black edges. For any cycle c of length l > 1 there is a 2-break transforming c into a union of length 1 and length l − 1 cycles. This way we obtain a scenario of length l − 1 for c, and can transform every cycle of a MAECD of G independently, obtaining a 2-break scenario of length e(G) − c(G) . Thus, d 2b (G) ≤ e(G) − c(G) .

Lemma 2
Lemma (Yancopoulos et al. in [13]) The minimum length of a DCJ scenario transforming genome A into B is equal to d 2b (G(A, B)) = e(G (A, B)) − c(G (A, B)).  G(B, B) is terminal, it follows that the minimum length of a scenario transforming A into B is d 2b (G(A, B)) and we conclude using Lemma 1.

Lemma 3
Lemma If D(G, ρ) has k connected components then ρ can be partitioned into k subscenarios ρ i and G can be partitioned into k edge-disjoint Eulerian subgraphs H i in such a way that ρ i is a scenario for H i for every i ∈ {1, . . . , k} . If ρ is parsimonious, then k = c(G) and C(ρ) = {H 1 , . . . , H k } is a MAECD of G.
Proof Take a connected component C of D(G, ρ) . It has an equal number of vertices of indegree 0 and vertices of outdegree 0. Its edges incident to the vertices of indegree 0 are labeled with the black edges of G and its edges incident to the vertices of outdegree 0 are labeled with the gray edges of G. Together these labels define a subgraph H of G that we will prove to be Eulerian.
Define C l to be a subgraph of D(G, ρ l ) consisting of its connected components containing the vertices of indegree 0 of C. This way C m = C . Define H l to be a subgraph of G l containing the gray edges of H and the black edges of G l labeling the edges of C l incident to the vertices of outdegree 0. This way H 0 = H and H m is a terminal graph.
We prove that H is Eulerian by induction. H m is Eulerian as it is terminal. Suppose that H l is Eulerian. By construction the two edges of G l replaced by the l-th 2-break of ρ either both belong to H l−1 or both are outside of H l−1 . In the first case, H l is obtained from H l−1 via a 2-break and as H l is Eulerian this means that H l−1 is also Eulerian. In the second case, H l = H l−1 , thus the latter stays Eulerian. Thus H = H 0 is Eulerian and we obtain a subsequence of ρ that is a scenario for H. D(G, ρ 0 ) has e(G) connected components. The l-th 2-break of ρ merges two vertices of D(G, ρ l−1 ) , thus reduces the number of the connected components by at most 1. This means that the number k of the connected components of D(G, ρ) is greater or equal to e(G) − m.
If ρ is parsimonious, then its length m is e(G) − c(G) using Lemma 1. This means that k ≥ c(G) and G can be partitioned into k edge-disjoint Eulerian subgraphs. Due to the maximality of c(G), we have that k = c(G) and all of the obtained edge-disjoint Eulerian subgraphs of G are simple cycles.