Finding local genome rearrangements
 Pijus Simonaitis^{1} and
 Krister M. Swenson^{1, 2}Email authorView ORCID ID profile
https://doi.org/10.1186/s1301501801272
© The Author(s) 2018
Received: 14 November 2017
Accepted: 16 April 2018
Published: 4 May 2018
Abstract
Background
The double cut and join (DCJ) model of genome rearrangement is well studied due to its mathematical simplicity and power to account for the many events that transform gene order. These studies have mostly been devoted to the understanding of minimum length scenarios transforming one genome into another. In this paper we search instead for rearrangement scenarios that minimize the number of rearrangements whose breakpoints are unlikely due to some biological criteria. One such criterion has recently become accessible due to the advent of the HiC experiment, facilitating the study of 3D spacial distance between breakpoint regions.
Results
We establish a link between the minimum number of unlikely rearrangements required by a scenario and the problem of finding a maximum edgedisjoint cycle packing on a certain transformed version of the adjacency graph. This link leads to a 3/2approximation as well as an exact integer linear programming formulation for our problem, which we prove to be NPcomplete. We also present experimental results on fruit flies, showing that HiC data is informative when used as a criterion for rearrangements.
Conclusions
A new variant of the weighted DCJ distance problem is addressed that ignores scenario length in its objective function. A solution to this problem provides a lower bound on the number of unlikely moves necessary when transforming one gene order into another. This lower bound aids in the study of rearrangement scenarios with respect to chromatin structure, and could eventually be used in the design of a fixed parameter algorithm with a more general objective function.
Keywords
Background
The problem of sorting genomes by a minimum number of biologically plausible rearrangements has been central to the theoretical comparative genomics community for roughly a quarter century. Traditionally, the likelihood of a rearrangement scenario has been based solely on the parsimony criterion. Unfortunately, a huge number of possible parsimonious scenarios between a pair of genomes exists [1–3]. This highlights the importance of methods that infer scenarios which conform to some extra biological constraints.
To this end we interest ourselves in data describing the 3D organization of chromatin, which is increasingly available due to the advent of an experiment called HiC [4, 5]. Indeed, the 3D spatial proximity of breakpoint regions have an important role in the formation [6, 7] and fixation [8] of genome rearrangements.
We have started development of methodology suitable for use with this type of constraint. Syntenic blocks of similar stretches of genomes are inferred, resulting in adjacencies that are candidate breakpoints for rearrangements. One can color these adjacencies for use with a cost function, where a DCJ acting on adjacencies having the same color is said to be local, and of zero cost, while a DCJ acting on adjacencies having different colors is nonlocal, and of cost one. In [9] we showed that the problem of finding—out of all parsimonious rearrangement scenarios—a scenario that minimizes the number of costly moves, the minimum local parsimonious scenario (MLPS) problem, is polynomialtime solvable. In this paper we disregard the parsimony criterion and instead focus solely on minimizing the number of costly moves required by a scenario, the minimum local scenario (MLS) problem.
In the "Minimum Local Scenario for circular genomes" section we treat a restricted case of MLS where only circular chromosomes are allowed. In the "Minimum Local Scenario for general genomes" section we show how general genomes can be capped in order to use the previously obtained results, establishing an exact formula for the number of costly moves in a minimum local scenario. This formula is based solely on the number of edges and the size of a maximum edgedisjoint cycle packing of the socalled junction graph, which is obtained by merging the vertices of the adjacency graph. We show that the MLS problem is NPComplete, while admitting a 3/2approximation (the "Complexity of MLS" section). We also implement an exact algorithm for MLS that is exponential in the number of colors but not in the number of genes. Despite the NPhardness of MLS the exact algorithm is efficient enough to be applied to the comparison of Drosophila melanogaster and Drosophila yakuba.
We use the genomes of these fruit flies in the "Experiments" section to demonstrate the utility of MLS. We attribute colors to the adjacencies of D. melanogaster using kmedoid clustering, randomized clustering, and clustering based on the linear ordering of the adjacencies along the chromosomes. We observe a significant difference between the randomized clustering and the other two, while only a small difference between the kmedoid and linear clustering methods. We conclude that normalizing the HiC data before clustering is imperative, and that further study on both normalization and clustering would be beneficial for effective coloring of adjacencies.
Finally, a modification of MLS that attributes a nonzero cost to local moves could be of interest. Finding a minimum cost scenario in this case remains an open problem. In the "Towards a more general cost function" section, however, we provide an upper bound for the length of such a scenario, which is of interest in practice, as supported by our experimental results (the "Experiments" section).
Definitions
Genome and DCJ
A block has an orientation as indicated by an arrow, where the tail of the arrow represents the tail extremity, and the head of the arrow represents the head extremity. We can represent a genome by a set of adjacencies between extremities. Such a set for the genome from Fig. 1 is \(\big \{\{A_{t}\}, \{A_{h}, B_{t}\}, \{B_{h},C_{h}\}, \{C_{t}\}\big \}\). An adjacency is either an unordered pair of the extremities that are adjacent on a chromosome, called internal adjacency, or a single extremity adjacent to one of the two ends of a linear chromosome, called an external adjacency.
Definition 1
 1.
\(\{a,b\},\{c,d\}\rightarrow \{a,c\},\{b,d\}\),
 2.
\(\{a,b\},\{c\}\rightarrow \{a,c\},\{b\}\),
 3.
\(\{a,b\}\rightarrow \{a\},\{b\}\),
 4.
\(\{a\},\{b\}\rightarrow \{a,b\}\).
In Fig. 2 an example of a DCJ corresponding to an inversion of a syntenic block is provided.
Renaming of block extremities
A DCJ cuts breakpoints and joins the resulting ends of the chromosomes while keeping the rest of the adjacencies in tact. The outcome of a DCJ only depends on whether the adjacencies it acts upon are internal or external. From Definition 1 we see that it does not matter if it acts on tail or head extremities, if the two adjacencies belong to different chromosomes, or if those chromosomes are linear or circular. This observation allows us to simplify the notation of genomes by renaming the block extremities. For example extremities of the genome from Fig. 1 can be renamed \((A_{h},B_{t},C_{h},B_{h},C_{t},A_{t})=(1,2,3,4,5,6)\) to obtain a set of adjacencies \(\big \{\{1, 2\}, \{3,4\}, \{5\},\{6\}\big \}\). A DCJ scenario can be performed on these adjacencies. Once it is finished we can undo the renaming to obtain a new genome and a DCJ scenario leading to it. For example a move \(\{1, 2\}, \{3,4\}\rightarrow \{1,4\},\{3,2\}\) results in a new set of adjacencies that, when renaming is undone, results in a genome \(\big \{\{A_{h}, B_{h}\}, \{B_{t},C_{h}\}, \{C_{t}\},\{A_{t}\}\big \}\).
Cost of a DCJ scenario
Definition 2
A coloring of the adjacencies of a genome A over a set of colors \(\Delta\) is a function \(col:A\rightarrow \Delta\) partitioning A into subsets of different colors.
A coloring is used to define the cost of a DCJ move. A move is local and of zero cost if it acts on adjacencies with equal colors, and it is nonlocal and of cost 1 otherwise. The cost of a sequence of DCJ moves, a DCJ scenario, is the sum of the costs of its constituent moves. For an adjacency \(p\in A\) we use notation (p, col(p)) for a colored adjacency.
Definition 3
 1.
\((\{a,b\},x),(\{c,d\},y)\rightarrow (\{a,c\},x),(\{b,d\},y)\) or \((\{a,c\},y),(\{b,d\},x)\),
 2.
\((\{a,b\},x),(\{c\},y)\rightarrow (\{a,c\},x),(\{b\},y)\) or \((\{a,c\},y),(\{b\},x)\),
 3.
\((\{a,b\},x)\rightarrow (\{a\},x),(\{b\},z)\) or \((\{a\},z),(\{b\},x)\) with any color z, and
 4.
\((\{a\},x),(\{b\},y)\rightarrow (\{a,b\},x)\) or \((\{a,b\},y)\).
In our previous work [9] we have treated the minimum local parsimonious scenario problem.
Problem 1
(MLPS) For genomes A and B, and a coloring of the adjacencies of A, find a minimum cost scenario among the DCJ scenarios of minimum length transforming A into B.
We have shown that MLPS is polynomialtime solvable. A real evolutionary scenario, however, might be nonparsimonious. In this paper we study the minimum local scenario problem which asks for potentially nonparsimonious scenarios.
Problem 2
(MLS) For genomes A and B, and a coloring of the adjacencies of A, find a minimum cost DCJ scenario transforming A into B.
Adjacency and junction graphs
The Adjacency graph was introduced in [10] for the study of DCJ rearrangements. We introduce a transformation of the adjacency graph, called a junction graph, that incorporates the information on a coloring.
Definition 4
(Adjacency graph) For two genomes A and B the adjacency graph AG(A, B) is defined as an undirected bipartite multigraph whose vertices are \(A\cup B\), and there are exactly \(p\cap q\) edges joining any \(p \in A\) and \(q\in B\).
Definition 5
(Junction graph) For two genomes A, B and a coloring col of A over \(\Delta\) we define an undirected multigraph \(J(A,B,col)=(\Delta , E)\). For every internal adjacency \(\{a,b\}\in B\) we add an edge (x, y) to E such that x and y are the colors of the adjacencies of A adjacent to \(\{a,b\}\) in AG(A, B).
In what follows we will use letters G, J and AG when speaking about general graphs, junction graphs and adjacency graphs respectively.
Definition 6
(2break) A 2break is a transformation on a graph that replaces edges (x, y) and (z, t) with either (x, z) and (y, t), or (x, t) and (y, z).
Example 1
Definition 7
(Eulerian graph) If all the vertices of a graph G are of even degree, then G is said to be Eulerian.
Minimum Local Scenario for circular genomes
In this case every vertex of an adjacency graph has degree two, all of its connected components are cycles and the junction graph is Eulerian. All connected components of AG(B, B) are cycles of length 2, thus at the end of a DCJ scenario transforming A into B we are left with a junction graph whose edges are all selfloops. We call such a graph terminal. For an Eulerian graph G we denote \(\ell (G)\) as the minimum length of a 2break scenario transforming G into a terminal graph.
Lemma 1
For sets of pairs A and B and a coloring col of A, the cost of a minimum local scenario transforming A into B is equal to \(\ell (J(A,B,col)).\)
Proof
We first transform any DCJ scenario into a 2break scenario on J(A, B, col). From Example 1 it should be clear that for any DCJ move \(A\rightarrow A'\), the transformation \(J(A,B,col)\rightarrow J(A',B,col')\) is a 2break. A DCJ move \(A\rightarrow A'\) of cost zero can be disregarded since \(J(A,B,col) = J(A',B,col')\). This means that a DCJ scenario of cost w transforming A (and its coloring col) into B (and its coloring \(col_{B}\)) provides us with a 2break scenario of length at most w transforming J(A, B, col) into a terminal graph \(J(B,B,col_{B})\). On the other hand for every 2break \(J\rightarrow J'\), a DCJ move \(A\rightarrow A'\) can be found such that \(J(A',B,col')=J'\). For any 2break scenario of length l transforming J(A, B, col) into a terminal graph we obtain a DCJ scenario of length l, thus of cost at most l, transforming A (and its coloring col) into a genome C (and its coloring \(col_{C}\)) such that \(J(C,B,col_{C})\) is terminal. This means that C’s pairs belonging to the same connected component of AG(C, B) are of the same color. A DCJ scenario transforming C into B and only acting on the pairs belonging to the same connected components of an adjacency graph can be easily found and such a scenario is of zero cost. This ensures that the scenario from A to C, and then from C to B, is a DCJ scenario transforming A into B of cost at most l. \(\square\)
Linking 2break scenarios and maximum edgedisjoint cycle packings
Using Lemma 1 we can shift our attention from a DCJ scenario on a set of pairs to a 2break scenario on a junction graph J.
Definition 8
(Maximum edgedisjoint cycle packing) An MECP of a graph G is a largest set of edgedisjoint cycles in G. If G is Eulerian, then an MECP covers all of its edges.
For a graph \(G=(V,E)\) we denote \(e(G)=E\) and c(G) as the size of its MECP.
Lemma 2
A 2break on an Eulerian graph G can increase the size of its MECP by at most one.
Proof
Without loss of generality we can suppose that \(G'\) is obtained from G via a 2break replacing edges (x, y) and (z, t) by edges (x, t) and (y, z). We take an MECP of \(G'\), call it \(C'\), and construct a cycle packing C of G of size at least \(C'1\) to prove the claim. The set of edgedisjoint cycles of \(C'\) that do not include edges (x, t) or (y, z) form a set of edgedisjoint cycles of G, so we include these cycles in C. If both newly added edges belong to the same cycle of \(C'\), then we are done since in this case \(C=C'1\). Otherwise the edges of the two cycles containing edges (x, t) or (y, z) form a cycle in G, as depicted in Fig. 7, providing us with the last cycle for C implying \(C=C'1\). \(\square\)
Lemma 3
For a graph G that is a cycle of length l there exists a 2break scenario of length \(l1\) transforming it into a terminal graph.
Proof
If \(l=1\) then G is terminal. Otherwise there exists a 2break that splits a cycle of length l into cycles of length 1 and \(l1\). We repeat this operation \(l1\) times to obtain a terminal graph. \(\square\)
Theorem 1
For a junction graph J we have \(\ell (J)=e(J)c(J).\)
Proof
In our restricted case J is Eulerian, so its MECP covers all of its edges. We can transform the cycles of a given MECP one by one, thus obtaining a terminal graph at the end of a scenario. The length of such a scenario is \(e(J)c(J)\) using Lemma 3. On the other hand, the size of an MECP of a terminal graph is equal to e(J), and according to Lemma 2 we need at least \(e(J)c(J)\) 2breaks to increase the size of an MECP from c(J) to e(J). \(\square\)
Minimum Local Scenario for general genomes
Genome capping
A genome can be extended into a set of pairs analyzed in the "Minimum Local Scenario for circular genomes" section by capping, which is the process of adding artificial gene extremities.
Definition 9
A pair \(\{i,j\}\) where \(i > 2n+2m\) and \(j > 2n+2m\) will be called a telomeric pair. Internal adjacencies of a genome are present in its extension and external adjacencies are simply complemented by an artificial gene extremity. This means that adjacencies of a genome, and nontelomeric pairs of its extension, can be mapped one to one. A coloring col of A can be trivially extended to a coloring \(\hat{col}\) of \(\hat{A}\in A_{+}\) by keeping the same colors for the nontelomeric pairs, and by choosing any colors for the telomeric pairs.
For a DCJ move \(A\rightarrow A'\) acting on two adjacencies of a genome there is an induced DCJ move \(\hat{A}\rightarrow \hat{A}'\) of the same cost, where \(\hat{A}'\in A'_{+}\) acting on the corresponding pairs of a genome extension. For example \((\{a\},x),(\{b\},y)\rightarrow (\{a,b\},x)\) induces \((\{a,\circ _{i}\},x),(\{b,\circ _{j}\},y)\rightarrow (\{a,b\},x),(\{\circ _{i},\circ _{j}\}, y)\) and \((\{a,b\},x),(\{c\},y)\rightarrow (\{a,c\},x),(\{b\},y)\) induces \((\{a,b\},x),(\{c,\circ _{i}\},y)\rightarrow (\{a,c\},x),(\{b,\circ _{i}\},y)\). A DCJ move of the form \((\{a,b\},x)\rightarrow (\{a\},x),(\{b\},z)\) (i.e. acting on a single adjacency) is different, as in this case we need a telomeric pair of color z to be present in a genome extension. For example \((\{a,b\},x)\rightarrow (\{a\},x),(\{b\},z)\) induces \((\{a,b\},x), (\{\circ _{i},\circ _{j}\},z)\rightarrow (\{a,\circ _i\},x),(\{b,\circ _j\},z)\) on a genome extension including \((\{\circ _{i},\circ _{j}\},z)\).
Lemma 4
For a DCJ scenario transforming genome A into B and a coloring of A, there exists genome extensions \(\hat{A}\in A_{+}\) and \(\hat{B}\in B_{+},\) as well as a scenario of the same cost transforming \(\hat{A}\) into \(\hat{B}.\)
Proof
In the DCJ scenario transforming A into B, let p be the number of DCJ moves acting on a single adjacency. These are of a form \(\{a,b\}\rightarrow \{a\}, \{b\}\). We take a genome extension \(\hat{A}\in A_{+}\) with p telomeric pairs. Every DCJ move \((\{a,b\},x)\rightarrow (\{a\},x),(\{b\},z)\) will induce a move acting on a different telomeric pair of a genome extension and its color will be the color z required by the DCJ move on the nonextended genome. In this way every DCJ move on a genome will induce a move on a genome extension, and after a scenario of cost w we will end up with \(\hat{B}\), an extension of genome B. \(\square\)
Lemma 5
For a DCJ scenario transforming \(\hat{A}\in A_{+}\) into \(\hat{B}\in B_{+}\) and a coloring of A there exists a DCJ scenario of the same cost or smaller transforming A into B.
Proof
We start with a pair \((A,\hat{A})\) and apply a scenario transforming \(\hat{A}\) into \(\hat{B}\) step by step, updating A along the way accordingly. After the first k moves of a scenario whose cost is \(w_{k}\) we get a pair \((A^{k},\hat{A}^{k})\) with \(\hat{A}^{k}\in A^{k}_{+}\) such that there is a scenario of cost at most \(w_{k}\) transforming A into \(A^{k}\). A pair \((A^{k+1},\hat{A}^{k+1})\) is constructed as follows. The \(k+1\)st move of a scenario is \(\hat{A}^{k}\rightarrow \hat{A}^{k+1}\).
If \(\hat{A}^{k+1}\in A^{k}_{+}\), then the next pair is \((A^{k},\hat{A}^{k+1})\). If \(\hat{A}^{k+1}\notin A^{k}_{+}\), then for a DCJ \((\{a_{1},a_{2}\},x),(\{a_{3},a_{4}\},y)\rightarrow (\{a_{1},a_{3}\},x),(\{a_{2},a_{4}\},y)\) transforming \(\hat{A}^{k}\) into \(\hat{A}^{k+1}\), at least one of the four involved pairs must contain two gene extremities. This observation allows us to find a genome C such that \(\hat{A}^{k+1}\in C_{+}\) and there is a DCJ move \(A^{k}\rightarrow C\) of the same cost as \(\hat{A}^{k}\rightarrow \hat{A}^{k+1}\). Therefore the next pair is \((C, \hat{A}^{k+1})\). There are numerous cases for C, but they are all trivial to analyze. For example, if \(a_{1}\) and \(a_{2}\) are both gene extremities and \(\{a_{3},a_{4}\}\) is a telomeric pair then a DCJ \((\{a_{1},a_{2}\},x)\rightarrow (\{a_{1}\},x),(\{a_{2}\},y)\) transforms \(A^{k}\) into such a C.
Now \(\hat{A}^{k+1}\in A^{k+1}_{+}\) and the scenario transforming A into \(A^{k+1}\) is of cost at most \(w_{k+1}\). We continue until we obtain \((B,\hat{B})\) with a scenario transforming A into B of cost at most w. \(\square\)
Define an Eulerian extension of a graph to be an Eulerian graph obtained from the initial graph by adding some edges. By construction \(J(\hat{A},\hat{B},\hat{col})\) is an Eulerian extension of J(A, B, col). We close this section by relating Eulerian extensions of J(A, B, col) to the junction graphs of genome extensions.
Lemma 6
For every Eulerian extension \(J'\) of \(J=J(A,B,col)\) there exists genome extensions \(\hat{A}\in A_{+}\) and \(\hat{B}\in B_{+}\) such that \(J(\hat{A},\hat{B},\hat{col})\) and \(J'\) have exactly the same nonloop edges. We say that such graphs are nonloop equal.
Proof
We will augment AG(A, B) with new adjacencies to obtain intermediate versions of adjacency graph \(AG'\) before finally arriving at an \(AG' = AG(\hat{A},\hat{B})\). Our running example will use the graphs shown in Figs. 8 and 9.
\(AG'\) has exactly the cycles of AG(A, B). The paths of AG(A, B), however, appear in \(AG'\) in a modified form. To each path we add new vertices at its endpoints, copying the endpoints’ colors (see Fig. 10 for an example). The junction graph of \(AG'\) is now nonloop equal to J. In our example AG(A, B) has no cycles and its three paths imply paths for \(AG'\) as shown in Fig. 10. The junction graph of \(AG'\) is given in Fig. 11.
For graphs \(G'=(V,E\cup E')\) and \(G=(V,E)\) we denote \(G'G=(V,E')\). Take an Eulerian subgraph H of \(J'J\) such that \(F=(J'J)H\) is a forest. For H we add a union of cycles to \(AG'\) such that H is the junction graph of these cycles. F can be partitioned into paths joining the vertices of odd degree, and for each of these paths we add to \(AG'\) a path with adjacencies of the corresponding colors. In our example, H is a cycle (z, y, z) and F has a single path (z, x). These add a cycle and a path to \(AG'\), shown in Fig. 12.
Now transform every path of \(AG'\) into a cycle in the following way. Each path of \(AG'\) that has endpoints of the same color is now transformed into a cycle by merging those endpoints into a single vertex of degree two. We are left with paths having endpoints of different colors. Consider one such path with an endpoint of color x. Merge this vertex with an endpoint of color x from another path. Such a path will always exist since every vertex in the Eulerian extension \(J'\) has an even degree, implying that there is an even number of paths in \(AG'\) having this vertex colored x as an endpoint. Continue this procedure until no paths are left. One possible outcome of such an operation is given in Fig. 13. The junction graph of \(AG'\) is now nonloop equal to \(J'\) as it can be seen in Fig. 14.
It is easy to reconstruct \(\hat{B}\in B_{+}\), \(\hat{A}\in A_{+}\) and its coloring \(\hat{col}\) such that \(AG'=AG(\hat{A},\hat{B},\hat{col})\), which guarantees that \(J(\hat{A},\hat{B},\hat{col})\) is nonloop equal to \(J'\). \(\square\)
A closed formula for minimum local scenario
Theorem 2
The cost w of a minimum local scenario transforming genome A into B is \(e(J)c(J)\) where \(J=J(A,B,col).\)
Proof
Constructions in lemmas 1, 4, 5 and 6 used in Theorem 2 are all polynomial. The next section shows that computing c(J) is at the heart of the complexity of the minimum local scenario.
Complexity of MLS
NPcompleteness of MLS
Theorem 3
The decision version of minimum local scenario is NPcomplete.
Proof
A 3/2approximation for MLS
For a cycle packing C of a graph G, denote the number of its lengthone and lengthtwo cycles by \(c_{1}(C)\) and \(c_{2}(C)\) respectively, and the number of longer cycles by \(c_{+}(C)\). Denote the number of G’s edges and lengthone cycles by e(G) and \(c_{1}(G)\) respectively. Finally, denote by \(c_{2}(G)\) the maximum of \(c_{2}(C)\) among all of the cycle packings of G.
Lemma 7
Proof
Theorem 4
Proof
An exact algorithm for MLS
Towards a more general cost function

(0, 1) is the minimum local scenario problem,

(1, 1) is the traditional double cut and join problem,

\((\omega _L, \omega _N)\) with \(\frac{\omega _L}{\omega _N\omega _L}>n\), where n is the number of adjacencies, is the minimum local parsimonious scenario problem,

\((\omega _L, \omega _N)\) with \(0<\omega _L<\omega _N\) is the problem that we consider in this section.
Experiments
Our theoretical work is based on a coloring of adjacencies where rearrangements are considered to be more likely when acting on those of the same color, and where the colors of the adjacencies are preserved across large evolutionary distances. Although there are many factors that may effect the likelihood of a rearrangement, in this section we focus on a coloring that partitions a genome into local 3D regions using HiC data. This idea is supported by the hypothesis that rearrangements are more likely to occur between breakpoints in close spatial proximity [6, 7], and an observation that syntenic blocks distant in the 1D sense in human, but adjacent in mouse, were observed in close 3D proximity in the human more often than expected [8].
We use HiC data as a similarity function for pairs of adjacencies. We propose a simple weight function of a coloring based on this similarity, and an algorithm kmedoidsthat provides colorings maximizing the weight. Unfortunately it is not clear exactly how HiC values are linked to 3D distance in the nucleus, thus there is no definitive way to know how well our clusters capture the 3D structure of a genome.
We compare these colorings to two other clustering algorithms: linear, which respects the 1D structure of the chromosomes, and random, which attributes colors to adjacencies at random. We report results in the "Clustering algorithms", "Colorings inferred from the adjacency graph" and "Divergence from linearity" sections.
All results of this section are possible due to the fact that, despite the NPhardness of the minimum local scenario problem, we find that it can be computed exactly (using our algorithm from the "An exact algorithm for MLS" section) for all of the colorings obtained for D. melanogaster and D. yakuba.
HiC data and normalization
We use both nonnormalized and normalized HiC data as similarity functions for pairs of adjacencies. These similarity functions are used to define the weight of a coloring.
Clustering algorithms
In what follows the terms coloring and clustering will be used interchangeably. We perform our experiments on the genomes of D. melanogaster and D. yakuba which were partitioned into 64 syntenic blocks using Orthocluster tool [14]. The DCJ distance between the two genomes is 51. D. melanogaster is used as genome A with a coloring that we obtain from the HiC data published in [5]. No HiC for D. yakuba is publicly available at the time of writing.
The first of the clustering methods we used is clustering around medoids, which was chosen for its simplicity and speed [15]. A medoid of a cluster is an element that maximizes the sum of the similarities to the rest of a cluster. This sum is the cluster’s weight, and when summed over all the clusters it provides us with a clustering weight. We use this algorithm with HiC data as a similarity function for the pairs of adjacencies.
The kmedoids algorithm starts with k randomly initialized centroids. The rest of the elements are then associated to the centroids that are most similar to them. The medoids of the obtained clusters are then computed and they become the new centroids around which the elements will be clustered. We continue this procedure until the clustering weight stops increasing.
The linear algorithm respects the 1D structure of the 6 chromosome arms of D. melanogaster. For a given \(k\ge 6\) we choose \(k6\) syntenic blocks at random and cut the chromosomes at these blocks. In this way we obtain k segments of the chromosomes, each of which has at least one adjacency. We then attribute a distinct color to each segment and assign to each adjacency the color of its segment.
The random algorithm attributes colors to the adjacencies uniformly at random while ensuring that for every color there is at least one adjacency of that color.
Clustering weight is well defined for any clustering. Nonnormalized and normalized HiC data provides two similarity functions for the pairs of adjacencies and thus two different clustering weights: nonnormalized and normalized. In Fig. 15 we compare MLS to clustering weights for the number of colors \(k=15\). We generate 100 random clusterings (black) and 100 linear clusterings (green) and compute their normalized and nonnormalized clustering weights and MLS. We generate 100 kmedoids clusterings using nonnormalized HiC (blue) and 100 kmedoids clusterings using normalized HiC (red). The meaning of olive, brown and orange outliers in Fig. 15 will be explained in the "Colorings inferred from the adjacency graph" and "Divergence from linearity" sections.
Colorings inferred from the adjacency graph
The kmedoids, linearand random algorithms color the adjacencies of a single genome using HiC data or 1D structure. However if we use the adjacencies of both genomes we can construct colorings with a much lower MLS cost. For example, if we chose a coloring for which the connected components of an adjacency graph are monochromatic, then the junction graph is terminal and MLS is equal to 0. The adjacency graph of D. melanogaster and D. yakuba has 19 connected components. This means that for \(k\le 19\) there exists a coloring with MLS equal to 0. We call such coloring optimal.
Another extreme case is a linear coloring minimizing MLS. When every chromosome is assigned its own color MLS is equal to 8. By increasing the number of colors for a linear clustering we can only increase its MLS, thus a lower MLS bound for linear clustering is 8. This lower bound was not observed when running linear, however using the adjacency graph of D. melanogaster and D. yakuba we have manually constructed colorings of up to 19 colors with MLS equal to 8. We call these colorings minimum linear.
Divergence from linearity
Clusters in optimal and kmedoids clusterings mostly contain adjacencies from only one or two chromosomes. Moreover, those adjacencies are mostly contiguous in the 1D sense on the chromosomes. We define divergence from linearity in order to quantify nonlinearity of these colorings, and to see if it is related to MLS.
Definition 10
(Divergence from linearity) Consider a set S of adjacencies colored with color x. Partition it into nonempty subsets \(S_{1}, S_{2}\ldots , S_{l}\) of adjacencies belonging to the chromosomes \(chr_{1}, chr_{2}\ldots , chr_{l}\). Sort these subsets according to the 1D structure of the chromosomes. Say that \(S_{i}\) has a gap if there is an adjacency p followed by q in \(S_{i}\), but in \(chr_{i}\) there is some adjacency r in between p and q. The divergence from linearity of S is the total number of gaps in S plus \(l1\). Sum these values for all the colors to obtain the divergence from linearity for a coloring.
The divergence from linearity of the linear coloring is 0. We use a greedy algorithm to find optimal colorings minimizing and maximizing the divergence from linearity.
In Fig. 16 we compare MLS to the divergence from linearity for \(k=15\) colors. We observe that kmedoids clusters using nonnormalized HiC are more linear than those using normalized HiC, with their mean values being 5.3 and 16.5 respectively. This was expected due to the nature of normalization, which accentuates the importance of the HiC values of the adjacencies that are distant in 1D sense or coming from different chromosomes.
We see that the linear and kmedoids clusterings for nonnormalized HiC are almost the same due to the low divergence from linearity of the latter. This explains why they have almost the same nonnormalized clustering weight, as can be seen in Fig. 15. optimal and minimum linear provides the bounds for MLS and divergence from linearity.
Conclusion and further work
Aside from problems that consider rearrangement length, little is known about weighted rearrangements [16–20]. In [9], we showed that with a simple cost function based on a partition of the adjacencies of one of the genomes into equivalence classes, one can choose—from the exponentially large set of shortest scenarios—a scenario that minimizes the number of moves acting across classes.
In this paper we showed that the genome rearrangement problem with an objective function based solely on the cost of DCJs is NPHard, even for a simple binary cost function. We gave a 3/2approximation derived from bounds on the sizes of cycles in a cycle packing of the junction graph. We also presented an exact algorithm and found that an exact solution can be computed quickly between D. melanogaster and D. yakuba.
Our algorithms depend on a coloring of the adjacencies between syntenic blocks. We use MLS to study these colorings by clustering normalized and nonnormalized HiC data from D. melanogaster. We find that our clusterings based on normalized data with the unsophisticated kmedoids technique give marginally lower MLS costs than clusterings that strictly preserve the linear ordering of the adjacencies. Both of those nonrandom clusterings give much lower MLS costs than randomized clusterings. Our rough linearity measure shows that, as kmedoid clusterings become more linear, the cost of the MLS decreases. A concerted effort towards the development of normalization and clustering is required to study these relationships in more detail.
Our model supports the coloring of a single genome only, yet this coloring may be capable of representing the 3D spatial structure between two genomes. Indeed, if spatial organization is somewhat conserved across large evolutionary distances, the chromatin conformation from a single genome could be informative for the inference of rearrangements over an entire scenario. Such conservation has been demonstrated in at least two cases. Recent results show that syntenic regions in mouse and human share a high degree of similarity in their higher order chromatin structure [21]. Further, syntenic blocks distant in the 1D sense in human, but adjacent in mouse, were observed in close 3D proximity in the human more often than expected; the authors concluded that there is a certain degree of conservation in spatial structure [8]. Ideally, adjacencies in these conserved regions would get the same color. Despite this largescale conservation, however, further extension of the model to accommodate discrepancies in the HiC data between species is of future interest.
Our work opens the door to the development of more complex models of genome rearrangement with positional constraints, where local moves would be attributed nonzero cost. To this end we established a useful link between the weighted distance and the difference between minimum local parsimonious scenario and minimum local scenario in the "Towards a more general cost function" section. Preliminary experimental results indicate that the problem of finding a minimum cost scenario among those of almost minimum length would be of practical interest.
Declarations
Authors' contributions
Both authors contributed to this paper. Both authors read and approved the final manuscript.
Acknowledgements
The authors would like to thank Sylvain Pulicani for creating syntenic blocks between Drosophila species, and reviewer #3 for her or his many helpful remarks.
Competing interests
The authors declare that they have no competing interests.
Availability of data and materials
These results can be reproduced using the data and scripts provided at https://bitbucket.org/pijus_simonaitis/localgenomerearrangements.
Ethics approval and consent to participate
Not applicable.
Funding
This work is partially supported by the IBC (Institut de Biologie Computationnelle) (ANR11BINF0002), by the Labex NUMEV flagship project GEM, and by the CNRS project Osez l’Interdisciplinarité.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Ouangraoua A, Bergeron A. Combinatorial structure of genome rearrangements scenarios. J Comput Biol. 2010;17(9):1129–44.View ArticlePubMedGoogle Scholar
 Braga MD, Stoye J. The solution space of sorting by DCJ. J Comput Biol. 2010;17(9):1145–65.View ArticlePubMedGoogle Scholar
 Miklós I, Tannier E. Approximating the number of double cutandjoin scenarios. Theor Comput Sci. 2012;439:30–40.View ArticleGoogle Scholar
 LiebermanAiden E, van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A, Amit I, Lajoie BR, Sabo PJ, Dorschner MO, Sandstrom R, Bernstein B, Bender MA, Groudine M, Gnirke A, Stamatoyannopoulos J, Mirny LA, Lander ES, Dekker J. Comprehensive mapping of longrange interactions reveals folding principles of the human genome. Science. 2009;326(5950):289–93.View ArticlePubMedPubMed CentralGoogle Scholar
 Sexton T, Yaffe E, Kenigsberg E, Bantignies F, Leblanc B, Hoichman M, Parrinello H, Tanay A, Cavalli G. Threedimensional folding and functional organization principles of the Drosophila genome. Cell. 2012;148(3):458–72.View ArticlePubMedGoogle Scholar
 Roix JJ, McQueen PG, Munson PJ, Parada LA, Misteli T. Spatial proximity of translocationprone gene loci in human lymphomas. Nat Genet. 2003;34(3):287.View ArticlePubMedGoogle Scholar
 Mani RS, Chinnaiyan AM. Triggers for genomic rearrangements: insights into genomic, cellular and environmental influences. Nat Rev Genet. 2010;11(12):819.View ArticlePubMedGoogle Scholar
 Véron AS, Lemaitre C, Gautier C, Lacroix V, Sagot MF. Close 3D proximity of evolutionary breakpoints argues for the notion of spatial synteny. BMC Genom. 2011;12(1):303.View ArticleGoogle Scholar
 Swenson KM, Simonaitis P, Blanchette M. Models and algorithms for genome rearrangement with positional constraints. Algorithm Mol Biol. 2016;11(1):13.View ArticleGoogle Scholar
 Bergeron A, Mixtacki J, Stoye J. A unifying view of genome rearrangements. Berlin: Springer; 2006. p. 163–73.Google Scholar
 Holyer I. The NPcompleteness of some edgepartition problems. SIAM J Comput. 1981;10(4):713–7.View ArticleGoogle Scholar
 Caprara A, Panconesi A, Rizzi R. Packing cycles in undirected graphs. J Algorithm. 2003;48(1):239–56.View ArticleGoogle Scholar
 Pulicani S, Simonaitis P, Rivals E, Swenson KM. Rearrangement scenarios guided by chromatin structure. In: RECOMB international workshop on comparative genomics. Berlin: Springer; 2017. p 141–155.Google Scholar
 Zeng X, Nesbitt MJ, Pei J, Wang K, Vergara IA, Chen N. Orthocluster: a new tool for mining synteny blocks and applications in comparative genomics. In: Proceedings of the 11th international conference on extending database technology: advances in database technology. ACM; 2008. p 656–667.Google Scholar
 Park HS, Jun CH. A simple and fast algorithm for kmedoids clustering. Exp Syst Appl. 2009;36(Part 2):3336–41.View ArticleGoogle Scholar
 Blanchette M, Kunisawa T, Sankoff D. Parametric genome rearrangement. Gene. 1996;172(1):11–7.View ArticleGoogle Scholar
 Pinter RY, Skiena S. Genomic sorting with lengthweighted reversals. Genome Inform. 2002;13:103–11.PubMedGoogle Scholar
 Lefebvre JF, ElMabrouk N, Tillier E, Sankoff D. Detection and validation of single gene inversions. Bioinformatics. 2003;19:190–6.View ArticleGoogle Scholar
 Bender MA, Ge D, He S, Hu H, Pinter RY, Skiena S, Swidan F. Improved bounds on sorting by lengthweighted reversals. J Comput Syst Sci. 2008;74(5):744–74.View ArticleGoogle Scholar
 Galvão GR, Baudet C, Dias Z. Sorting circular permutations by super short reversals. IEEE/ACM Trans Comput Biol Bioinf. 2017;14(3):620–33.View ArticleGoogle Scholar
 Dixon JR, Selvaraj S, Yue F, Kim A, Li Y, Shen Y, Hu M, Liu JS, Ren B. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature. 2012;485(7398):376.View ArticlePubMedPubMed CentralGoogle Scholar