# DCJ-Indel sorting revisited

- Phillip EC Compeau
^{1}Email author

**8**:6

https://doi.org/10.1186/1748-7188-8-6

© Compeau; licensee BioMed Central Ltd. 2013

**Received: **21 December 2012

**Accepted: **7 February 2013

**Published: **1 March 2013

## Abstract

### Background

The introduction of the double cut and join operation (DCJ) caused a flurry of research into the study of multichromosomal rearrangements. However, little of this work has incorporated indels (i.e., insertions and deletions of chromosomes and chromosomal intervals) into the calculation of genomic distance functions, with the exception of Braga et al., who provided a linear time algorithm for the problem of DCJ-indel sorting. Although their algorithm only takes linear time, its derivation is lengthy and depends on a large number of possible cases.

### Results

We note the simple idea that a deletion of a chromosomal interval can be viewed as a DCJ that creates a new circular chromosome. This framework will allow us to amortize indels as DCJs, which in turn permits the application of the classical breakpoint graph to obtain a simplified indel model that still solves the problem of DCJ-indel sorting in linear time via a more concise formulation that relies on the simpler problem of DCJ sorting. Furthermore, we can extend this result to fully characterize the solution space of DCJ-indel sorting.

### Conclusions

Encoding indels as DCJ operations offers a new insight into why the problem of DCJ-indel sorting is not ultimately any more difficult than that of sorting by DCJs alone. There is still room for research in this area, most notably the problem of sorting when the cost of indels is allowed to vary with respect to the cost of a DCJ and we demand a minimum cost transformation of one genome into another.

## Keywords

## Background

In the simplest terms, DNA may mutate in two fundamentally different ways. On the one hand, single-nucleotide polymorphisms alter the base at a single position of the nucleic acid polymer; on the other hand, huge mutations called chromosomal rearrangements can move around, duplicate, insert, or delete huge blocks of DNA, often from one chromosome to another.

Chromosomal rearrangements were first observed by Dobzhansky and Sturtevant in 1938 ([1]), but extensive efforts to quantify their study did not take off until the early 1990s. In the last two decades, a number of discrete genomic models have been proposed and studied (see [2] for an overview of the combinatorics of genome rearrangements).

Having selected a genomic model and a collection of genome operations to consider, the standard algorithmic problem is the computation of the distance between two genomes *π* and *Γ*, or the minimum number of allowable operations required to transform *π* into *Γ*; the more difficult problem of sorting demands the operations themselves. The first historical example of such a discrete genomic distance is the prefix reversal distance for permutations (which model the order of genes along a single linear chromosome), introduced in [3] and bounded in [4–6]. The computation of prefix reversal distance has been proposed to be *NP*-Hard (see [7]).

More recent research has moved past permutations and toward multichromosomal genomic models that incorporate both linear and circular chromosomes. One of these models, which we will study in this paper, models the chromosomes of a genome with paths and cycles in a graph. For this model, the double cut and join operation (DCJ) was introduced in [8] and incorporates segment reversals with a number of other operations. Interestingly, a linear time greedy algorithm exists for DCJ sorting two genomes having equal gene content (see [9]).

The incorporation of insertions and deletions of chromosomes and chromosomal intervals (collectively called *indels*) into DCJ distance was discussed in [10] and quantified rigorously in [11]. The latter authors provided a linear time algorithm for the associated problem of DCJ-indel sorting, which gives a minimum collection of DCJ and indel operations required to transform one genome into another. Yet their argument is case-ridden, and so in this paper, which builds upon [12], we wish to provide a much simpler presentation of DCJ-indel sorting that still yields a linear-time solution to the problem.

## Main text

### Preliminaries

*N*labeled vertices $\mathcal{V}$, forming a set $\mathcal{G}$ of

*N*edges called genes; the vertices of each gene form its head and tail. We define a genome

*π*as the edge-disjoint union of two matchings. The genes of

*π*, denoted

*g*(

*π*), form a matching on $\mathcal{V}$ such that $g\left(\pi \right)\subseteq \mathcal{G}$; the adjacencies of

*π*, denoted

*a*(

*π*), form a matching on

*V*(

*g*(

*π*)). We color the genes of

*π*black and the adjacencies of

*π*blue (see Figure 1(a)).

A consequence of these definitions is that *π* comprises a disjoint collection of paths and cycles, where each connected component alternates between black genes and blue adjacencies. Each component of *π* is called a *chromosome*; paths (cycles) of *π* define *linear* (*circular*) chromosomes of *π*. The endpoint *v* of a path in *π* is called a telomere of *π*; *v* is not incident to an adjacency, and so for clerical purposes, we say that *v* has the null adjacency {*v*,*∅*}. A genome consisting of only circular (linear) chromosomes is called a circular (linear) *genome*. Note that *π* is circular if and only if the edges of *a*(*π*) form a perfect matching on *V*(*π*).

Henceforth, we only consider genome pairs {*π*,*Γ*} such that $g\left(\pi \right)\cup g\left(\Gamma \right)=\mathcal{G}$. A workhorse data structure encoding the relationship between *π* and *Γ* is the breakpoint graph ([13]), denoted by B(*π*,*Γ*) and defined as the edge-disjoint union ^{a} of *a*(*π*) and *a*(*Γ*), where adjacencies of *Γ* will be colored red (Figure 1(b)). Observe that *B*(*π*,*Γ*) is also a collection of disjoint paths and cycles, which alternate between red and blue edges. The *length* of a connected component of B(*π*,*Γ*) is its total number of edges; we consider an isolated vertex in B(*π*,*Γ*) to be a path of length 0. The breakpoint graph is also the line graph of the adjacency graph, which was first defined in [9] and has also been used in rearrangement studies.

*double cut and join*operation (DCJ) on

*π*(introduced in [8])

*uses*one or two adjacencies of

*π*via one of the following four operations to produce a new genome

*π*

^{′}:

- 1.
{

*v*,*w*},{*x*,*y*} →{*v*,*x*},{*w*,*y*} - 2.
{

*v*,*w*},{*x*,*∅*}→{*v*,*x*},{*w*,*∅*} - 3.
{

*v*,*∅*},{*w*,*∅*} →{*v*,*w*} - 4.
{

*v*,*w*} →{*v*,*∅*},{*w*,*∅*}

*π*and

*Γ*have the same genes (i.e., $g\left(\pi \right)=g\left(\Gamma \right)=\mathcal{G}$), the

*DCJ distance*between

*π*and

*Γ*, written

*d*

_{DCJ}(

*π*,

*Γ*), is the minimum number of DCJs required to transform

*π*into

*Γ*One can easily verify that

*d*

_{DCJ}forms a metric on the set of all genomes having gene set $\mathcal{G}$. A closed formula for DCJ distance was derived in [9] and translated into breakpoint graph notation in [14]:

Here, *c*(*π*,*Γ*) and *p*_{even}(*π*,*Γ*) denote the number of cycles and even-length paths in B(*π*,*Γ*), respectively.

For the more general case that *π* and *Γ* do not share the same genes, a deletion of a chromosomal interval of *π* replaces adjacencies {*v*,*w*} and {*x*,*y*} (contained in the order (*v*,*w*,*x*,*y*) along a chromosome of *π*) with the adjacency {*v*,*y*} and removes the path connecting *w* to *x*. We also allow deletions of entire chromosomes; however, we must stipulate (following the lead of the authors in [11]) that every vertex removed from *π* must belong to $\mathcal{V}-\mathcal{V}\left(\Gamma \right)$. ^{b} The insertion of a chromosome or chromosomal interval into *π* to obtain *π*^{′} is defined as the inverse of a corresponding deletion from *π*^{′} that yields *π*. Note that a consequence of this definition is that we may not insert a gene unless it is contained in $\mathcal{G}$. Insertions and deletions are collectively called indels; thus, we define the DCJ-indel distance between *π* and *Γ*, written ${d}_{\text{DCJ}}^{\text{ind}}(\pi ,\Gamma )$, as the minimum number of DCJs and indels required to transform *π* into *Γ*.

Because insertions and deletions are inverse operations, it follows that ${d}_{\text{DCJ}}^{\text{ind}}(\pi ,\Gamma )={d}_{\text{DCJ}}^{\text{ind}}(\Gamma ,\pi )$. However, although ${d}_{\text{DCJ}}^{\text{ind}}$ is symmetric, unlike *d*_{DCJ} it does not form a metric, as the triangle inequality does not hold; see [15] for a more complete discussion.

### DCJ-Indel sorting

#### Handling circular singletons

We begin our discussion of DCJ-indel sorting by defining a circular singleton of *π* (adapted from[11]) as a circular chromosome *C* such that *V*(*C*) ∩ *V* (*Γ*) = *∅*. Note that *C* is defined with respect to *Γ* as well as *π*. Ideally, we could delete (insert) all circular singletons of *π* and *Γ* immediately to simplify the problem of DCJ-indel sorting; fortunately, this is indeed the case, as shown by the following two results.

##### Proposition 1

*If* π^{′}*is formed by removing a circular singleton C from* π,*then*${d}_{\text{DCJ}}^{\text{ind}}({\pi}^{\prime},\Gamma )={d}_{\text{DCJ}}^{\text{ind}}(\pi ,\Gamma )-1$.*Furthermore, when transforming* π *into* Γ *via a minimum collection of DCJs and indels, no gene belonging to a circular singleton of* π *can ever appear in the same chromosome as a gene of* Γ.

##### Proof

Any collection of *k* DCJs and indels transforming *π*^{′} into *Γ* can be supplemented by the deletion of *C* to yield *k* + 1 DCJs and indels transforming *π* into *Γ*; thus, ${d}_{\text{DCJ}}^{\text{ind}}({\pi}^{\prime},\Gamma )\ge {d}_{\text{DCJ}}^{\text{ind}}(\pi ,\Gamma )-1$.

To obtain the reverse bound, let us view a transformation $\mathbb{T}$ of *π* into *Γ* as a sequence (*π*_{0},*π*_{1},…,*π*_{
n
}) (*n* ≥ 1), where *π*_{0} = *π*, *π*_{
n
}= *Γ*, and *π*_{i + 1}is obtained from *π*_{
i
}as the result of a single DCJ or indel. Consider the sequence $({\pi}_{0}^{\prime},{\pi}_{1}^{\prime},\dots ,{\pi}_{n}^{\prime})$, where ${\pi}_{i}^{\prime}$ is constructed from *π*_{
i
}by removing the subgraph of *π*_{
i
}induced by the vertices of *C* under the stipulation that whenever we remove a path *P* connecting *v* to *w*, we replace adjacencies {*v*,*x*} and {*w*,*y*} in *π* with {*x*,*y*} in ${\pi}_{i}^{\prime}$. It is easy to see that ${\pi}_{0}^{\prime}={\pi}^{\prime}$, ${\pi}_{n}^{\prime}=\Gamma $, and for every *i* in range, either ${\pi}_{i+1}^{\prime}$ is the result of a DCJ or indel applied to ${\pi}_{i}^{\prime}$ or ${\pi}_{i+1}^{\prime}={\pi}_{i}^{\prime}$; thus, $({\pi}_{0}^{\prime},{\pi}_{1}^{\prime},\dots ,{\pi}_{n}^{\prime})$ encodes a transformation of *π*^{′} into *Γ* using at most *n* DCJs and indels. Furthermore, one can verify that ${\pi}_{i+1}^{\prime}={\pi}_{i}^{\prime}$ only when an adjacency of *C* is used by a DCJ in $\mathbb{T}$ changing *π*_{
i
}to *π*_{i + 1}or when *π*_{i + 1}is produced from *π*_{
i
}by a deletion of vertices that all belong to *C*. At least one such operation must always occur in $\mathbb{T}$; hence, ${d}_{\text{DCJ}}^{\text{ind}}({\pi}^{\prime},\Gamma )\le {d}_{\text{DCJ}}^{\text{ind}}(\pi ,\Gamma )-1$.

The proposition’s second conclusion follows from the fact that if for some *j* (1 ≤ *j* ≤ *n* - 1), a chromosome of *π*_{
j
}contains a gene *g*_{1} of *π* and a gene *g*_{2} of *C*, then one DCJ was required to combine *g*_{1} and *g*_{2} into the same chromosome, and another will be needed to separate them, yielding two distinct values of *i* for which ${\pi}_{i+1}^{\prime}={\pi}_{i}^{\prime}$. From the first part of the proof, we may conclude that ${d}_{\text{DCJ}}^{\text{ind}}(\pi ,\Gamma )<n$. □ □

Letting sing (*π*,*Γ*) denote the total number of circular singletons of *π* and *Γ*, we have an immediate corollary.

##### Corollary 2

*The DCJ-indel distance is given by the following:*

*where* π^{0} (Γ^{0}) *is formed by removing all circular singletons from* π (Γ).

With respect to DCJ-indel sorting, Corollary 2 allows us to assume without loss of generality that *π* and *Γ* do not contain any circular singletons.

*π*connecting

*w*to

*x*may be viewed as a DCJ: {

*v*,

*w*},{

*x*,

*y*} → {

*v*,

*y*},{

*w*,

*x*}; this operation produces a circular chromosome containing

*w*and

*x*that is scheduled for removal, including the case that

*v*or

*y*equals

*∅*(the deletion of an entire linear chromosome is handled by

*u*=

*x*=

*∅*); see Figure 3. Because insertions are the inverses of deletions, we would like to conclude that indels may be placed in a one-to-one correspondence with the removal of circular chromosomes. Ironically, the apparent exception to this proposed rule is the deletion of an entire circular chromosome.

Yet if a deleted circular chromosome *C* is not produced as the result of a DCJ, then *C* must be a circular singleton of *π* in order to be deleted. Otherwise, *C* has been produced as the result of a DCJ applied to a chromosomal interval; by the method we just described, we can amortize the deletion in this DCJ unless the DCJ also creates another circular chromosome *C*^{′} that is scheduled for deletion. However, this sequence of operations cannot arise in a minimum collection of DCJs and indels transforming *π* into *Γ*, as we could simply delete the original chromosome from which *C* and *C*^{′} were produced by the DCJ in question, thus requiring a single operation instead of three.

#### Toward a new model of Indels

We will follow the observation made in [16] that the actual removal of deleted chromosomes can occur as a final step in the transformation of *π* into *Γ*. As a result, we may view the transformation of *π* into *Γ* as composed of three steps: inserting chromosomes into *π* to yield a new genome *π*^{′} with $g\left({\pi}^{\prime}\right)=\mathcal{G}$; applying a sequence of DCJs to produce a genome *Γ*^{′} having the same genes as *π*^{′}; and finally, deleting chromosomes from *Γ*^{′} to produce *Γ*. Note that we can equivalently view the first step as the deletion of chromosomes from *π*^{′} to obtain *π*. Combining this observation with our correspondence between indels and circular chromosomes above, we may introduce the following framework.

*π*as a genome

*π*

^{′}having $g\left({\pi}^{\prime}\right)=\mathcal{G}$ and for which

*a*(

*π*

^{′}) is composed of

*a*(

*π*) together with a perfect matching on

*V*(

*π*

^{′}) -

*V*(

*π*). We call the adjacencies of

*a*(

*π*

^{′}) -

*a*(

*π*) new. Note that the chromosomes of

*π*embed as chromosomes of

*π*

^{′}and that the components of

*π*

^{′}-

*π*form cycles because the new adjacencies of

*π*

^{′}induce a perfect matching on

*V*(

*π*

^{′}) -

*V*(

*π*); we may now without ambiguity call these circular chromosomes of

*π*

^{′}the indels of

*π*

^{′}. A completion of a pair of genomes (

*π*,

*Γ*) is simply a pair (

*π*

^{′},

*Γ*

^{′}) for which

*π*

^{′}and

*Γ*

^{′}are completions of

*π*and

*Γ*, respectively. The above discussion implies that for any minimum cost transformation of

*π*into

*Γ*, the indels of

*π*

^{′}correspond bijectively to DCJ operations, so that we will amortize each unit indel cost by that of a DCJ operation. This amortization yields the following equation for DCJ-indel distance:

where the minimum is taken over all completions of (*π*,*Γ*). A completion (*π*^{∗},*Γ*^{∗}) is optimal if it attains the minimum in (3). Applying the closed form equation for the DCJ distance in (1) to immediately produces the following result.

##### Theorem 3

*The DCJ-indel distance is given by the following equation:*

*where the maximum is taken over all completions of* (π,Γ).

#### Constructing an optimal completion

In light of Theorem 3, we have reduced DCJ-indel sorting to the problem of constructing indels intelligently to maximize a weighted sum of breakpoint graph components. Once we have produced an optimal completion (π^{∗},Γ^{∗}), we can simply invoke the *O*(*N*) - time sorting algorithm described in [9] to transform *π*^{∗} into *Γ*^{∗} via a minimum collection of DCJs.

Our goal is to construct (*π*^{∗},*Γ*^{∗}) by direct analysis of B(*π*,*Γ*). Because *π* and *Γ* do not necessarily share the same genes, B(*π*,*Γ*) may contain path endpoints that are not telomeres. Accordingly, we define a vertex *v* to be *Π* - open (*γ* - open) if *v* ∉ *π* (*v* ∉ *Γ*). In other words, *v* must be matched to some other *π* - open vertex when constructing the indels of *π*^{∗}. ^{c} The paths of B(*π*,*Γ*) are therefore classified according to their endpoints: a *Π* - path (*γ* - path) ends in one *π* - open (*Γ* - open) vertex and one telomere (of either *π* or *Γ*); a *{* *Π*,*γ*} - path ends in a *π* - open vertex and a *Γ* - open vertex (such a path must have even length at least 2); a *{* *Π*,*Π*} - path (*{* *γ*,*γ*} - path) ends in two *π* - open (*Γ* - open) vertices and must therefore have odd length. We should also provide statistics for counting these different components. Define *p*^{Π,γ} as the number of {*π*,*Γ*} - paths in B(*π*,*Γ*); ${p}_{\text{even}}^{\Pi}$ as the number of even-length *π* - paths in B(*π*,*Γ*); and ${p}_{\text{even}}^{0}$ as the number of even-length paths in B(*π*,*Γ*) containing no open vertices (i.e., ending in two telomeres). Similar statistics counting odd-length paths can be defined analogously. We have dropped the genomes {*π*,*Γ*} from these statistics for the sake of simplicity; all component statistics will be taken with respect to B(*π*,*Γ*) unless otherwise noted.

We first present a proposition regarding the parity of the paths of B (π,Γ).

##### Proposition 4

*The component statistics of*B(π,Γ)

*satisfy the following condition:*

##### Proof

*π*- open vertices is equal to

*V*(

*π*

^{′}) -

*V*(

*π*) and must therefore be even. Of course, the same is the case for

*Γ*- open vertices, and counting

*π*- open and

*Γ*- open vertices over the connected components of B (

*π*,

*Γ*) thus produces the following equivalences:

*p*

^{Π,γ}to both sides of (6) and (7) gives the following:

The equivalence of (5) and (8) is an arithmetical fact. □

We next establish two necessary conditions on optimal completions by culling the set of possible adjacencies of any such completion. Our general strategy is to consider the addition of a new adjacency {*v*,*w*} to a completion *π*^{′} as linking the component(s) of B(*π*,*Γ*) whose endpoints are the (*π* - open) vertices *v* and *w*. Our first result states that we must always link the endpoints of any {*Π*,*Π*} - path to each other.

##### Lemma 5

*If* (π^{∗},Γ^{∗}) *is an optimal completion of* (π,*Γ*), then every {*Π*,*Π*} *- path (*{*γ*,*γ*} *- path) of length* 2 *k* - 1 *in* B(π,Γ)*(k* ≥ 1*)* embeds into a cycle of length 2 *k* in B (π^{∗},*Γ*^{∗}).

##### Proof

Let *P* be a path of length 2 *k* - 1 connecting *π*-open vertices *v* and *w* in B (*π*,*Γ*). Our claim is that we must link *v* and *w* in B (*π*^{∗},*Γ*^{∗}). Suppose for the sake of contradiction that we have a completion (*π*^{′},*Γ*^{′}) such that *P* does not embed into a cycle of length 2 *k* in B (*π*^{′},*Γ*^{′}); in this case, we must have adjacencies {*v*,*x*} and {*w*,*y*} in *a*(*π*^{′}), where all four vertices are distinct.

Consider the completion *π*^{″} that is identical to *π*^{′} except that {*v*,*x*} and {*w*,*y*} are replaced by {*v*,*w*} and {*x*,*y*}. In B(*π*^{″},*Γ*^{′}), we have closed *P* into a cycle of length 2 *k*, and at the same time, we have changed neither the parity nor the linearity/circularity of the component containing *x* and *y*. Because we have increased the number of breakpoint graph cycles by 1 without changing the total number of paths, it follows from (1) that *d*_{DCJ}(*π*^{″},*Γ*^{′}) = *d*_{DCJ}(*π*^{′},*Γ*^{′}) - 1, and so (*π*^{′},*Γ*^{′}) cannot be optimal. □

Having dealt with {*Π*,*Π*} - and {*γ*,*γ*} - paths of B(*π*,*Γ*), any remaining component of B(*π*^{∗},*Γ*^{∗}) must be either a *j* - bracelet, which is a cycle linking *j* {*π*,*Γ*} - paths (where *j* ≥ 2 and *j* is even), or a *k* - chain, in which two *π* - paths or two *Γ* - paths are linked via an intermediate number of {*π*,*Γ*} - paths to form a path containing *k* components from B(*π*,*Γ*) (*k* ≥ 2). Note that when *k* is even, a *k*- chain *C* must contain either two *π* - paths or two *Γ*- paths, and when *k* is odd, *C* must contain one *π* - path and one *Γ* - path.

For the sake of simplicity, we will represent a *j* - bracelet by (*P*_{1} : *P*_{2} : ⋯ : *P*_{
j
}) and a *k* - chain by [*P*_{1} : *P*_{2} : ⋯ : *P*_{
k
}], where every *P*_{
i
}is linked to *P*_{i + 1}, and in the case of a *j* - bracelet, *P*_{1} is linked to *P*_{
j
}. Because we wish to maximize a weighted sum of breakpoint graph components, we might guess that we should look for many short bracelets and chains. Indeed, the length of a bracelet or chain in B(*π*^{∗},*Γ*^{∗}) is heavily restricted by the following lemma.

##### Lemma 6

*If* (π^{∗},Γ^{∗})*is an optimal completion, then a component C*^{∗}*of* B (*π*^{∗},*Γ*^{∗}) *can only contain two or more* {*π*,*Γ*} *- paths if C*^{∗}*is a* 2 *- bracelet.*

##### Proof

Again, say for the sake of contradiction that we have an optimal completion (*π*^{′},*Γ*^{′}) for which a component *C*^{′} of B(*π*^{′},*Γ*^{′}) contains two or more {*π*,*Γ*}-paths. If *C*^{′} is not a 2 - bracelet, then it must contain {*π*,*Γ*} - paths *P*_{1} and *P*_{2} that are linked by precisely one new adjacency. Say that *P*_{1} joins *π* - open vertex *v* to *Γ* - open vertex *w* and that *P*_{2} joins *π* - open vertex *x* to *Γ* - open vertex *y*. To meet the assumption that *P*_{1} and *P*_{2} are linked by precisely one new adjacency, suppose that {*v*,*x*} ∈ *a*(*π*^{′}) but {*w*,*y*} ∉ *a*(*Γ*^{′}), where instead {*w*,*w*^{′}} and {*y*,*y*^{′}} are in *a*(*Γ*^{′}). Replacing these two adjacencies with {*w*,*y*} and {*w*^{′},*y*^{′}} defines a different completion *Γ*^{″} for which B(*π*^{′},*Γ*^{″}) contains (*P*_{1} : *P*_{2}). Viewed as an operation on B(*π*^{′},*Γ*^{′}) to yield B(*π*^{′},*Γ*^{″}), we have two cases.

First, if *C*^{′} was a bracelet, then we have formed two new bracelets from *C*^{′}, one of which is (*P*_{1} : *P*_{2}). Otherwise, *C*^{′} was a chain, in which case we have formed a chain (of the same parity) in addition to (*P*_{1} : *P*_{2}). In either case, *d*_{DCJ} (*π*^{′},*Γ*^{″})<*d*_{DCJ} (*π*^{′},*Γ*^{′}), and so (*π*^{′},*Γ*^{′}) cannot be optimal. □

Following Lemma 6, we may only have 2 - bracelets, 2 - chains, and 3 - chains in B (*π*^{∗},*Γ*^{∗}). After a simple result about 2-chain components, we will be ready to state our main result on DCJ-indel sorting.

##### Proposition 7

*The breakpoint graph of an optimal completion cannot have one 2-chain joining two odd π - paths and another 2-chain joining two even π-paths. The same holds for Γ - paths.*

##### Proof

Once again, proceed by contradiction and assume that (*π*^{′},*Γ*^{′}) is an optimal completion with such 2-chains [*P*_{1} : *P*_{2}] and [*P*_{3} : *P*_{4}]. Replacing these 2-chains with [*P*_{1} : *P*_{3}] and [*P*_{2} : *P*_{4}] replaces two odd paths in B(*π*^{′},*Γ*^{′}) with two even paths; hence, (*π*^{′},*Γ*^{′}) cannot be optimal. □

##### Theorem 8

*Algorithm*1,

*given below, defines an O(N) time algorithm for DCJ - indel sorting. For pairs*{π,Γ}

*having*sing (π,Γ) = 0,

*the DCJ-indel distance is given by the following equation:*

Here, *δ* = 1 when *p*^{Π,γ}is odd and either ${p}_{\text{odd}}^{\Pi}>{p}_{\text{even}}^{\Pi}$, ${p}_{\text{odd}}^{\gamma}>{p}_{\text{even}}^{\gamma}$ or ${p}_{\text{odd}}^{\Pi}<{p}_{\text{even}}^{\Pi},{p}_{\text{odd}}^{\gamma}<{p}_{\text{even}}^{\gamma}$; otherwise, *δ* = 0.

##### Proof

*π*

^{∗},

*Γ*

^{∗}) having

First, we count the cycles of B(*π*^{∗},*Γ*^{∗}). By Lemma 5, every {*Π*,*Π*} - path or {*γ*,*γ*} - path of B (*π*,*Γ*) must be closed into a cycle by adding a single new adjacency (Step 1 of Algorithm 1). We now claim that there exists an optimal completion containing $\u230a\frac{{p}^{\Pi ,\gamma}}{2}\u230b$ 2 - bracelets. Note that we may always replace 3-chains [*P*_{1} : *P*_{2} : *P*_{3}] and [*P*_{4} : *P*_{5} : *P*_{6}] (where *P*_{1} and *P*_{4} are *π* - paths) with [*P*_{1} : *P*_{4}], (*P*_{2} : *P*_{5}), and [*P*_{3} : *P*_{6}], without increasing the DCJ distance of the associated completion because we have obtained a cycle from two paths. This argument implies Step 2 of Algorithm 1 and produces the value of *c*(*π*^{∗},*Γ*^{∗}) stated above.

As for the even paths of B(*π*^{∗},*Γ*^{∗}), let us operate under the assumption that *p*^{Π,γ} is odd. Then after forming a maximal collection of 2-bracelets, we will be left with one additional {*π*,*Γ*} - path *P*. We claim that (*π*^{∗},*Γ*^{∗}) will be optimal if we link as many *π* - paths (*Γ* - paths) of opposite parity as possible. On the one hand, Proposition 7 states that we cannot have 2 - chains [*P*_{1} : *P*_{2}] and [*P*_{3} : *P*_{4}], where *P*_{1} and *P*_{2} are even *π* - paths and *P*_{3} and *P*_{4} are odd *π* - paths. On the other hand, say that we have a 2 - chain [*P*_{1} : *P*_{2}] and a 3 - chain [*P*_{3} : *P* : *P*_{4}], where without loss of generality we assume that *P*_{1} and *P*_{2} are odd *π* - paths, *P*_{3} is an even *π* - path, and *P*_{4} is a *Γ* - path. Replacing these chains with the chains [*P*_{1} : *P*_{3}] and [*P*_{2} : *P* : *P*_{4}] does not change the number of paths of even length in B(*π*^{∗},*Γ*^{∗}), implying Step 3 of Algorithm 1

As a result, all remaining *π* - paths must have the same parity, as must all the *Γ* - paths; thus, we may choose any *π* - path and *Γ* - path to link to *P* (Step 4 of Algorithm 1) and form a 3-chain. The length of this 3-chain may be even (*δ* = 1) or odd (*δ* = 0) depending on whether the length of its *π* - path and *Γ* - path have equal parity or not. All remaining paths must therefore be 2-chains linking pairs of *π* - paths or pairs of *Γ* - paths (Step 5 of Algorithm 1).

If instead *p*^{Π,γ}is even, then *δ* = 0, and the argument for constructing an optimal completion proceeds similarly, except that no {*π*,*Γ*} - paths will remain after forming a maximal collection of 2-bracelets, eliminating the need for Step 4. □

##### Algorithm 1.

Given genomes (*π*,*Γ*), the following algorithm constructs an optimal completion (*π*^{∗},*Γ*^{∗}) in *O*(*N*) time.

0 Remove all circular singletons from *π* and *Γ*.

1 Close every {*Π*,*Π*} - path ({*γ*,*γ*}-path) into a cycle by adding a single new adjacency to *π* ^{∗} (*Γ* ^{∗}).

2 Form a maximum set of 2-bracelets.

3 Form a maximum set of even 2 - chains by linking pairs of *π*-paths (*Γ* - paths) having opposite parity.

4 If *p* ^{Π,γ} is odd, then link the remaining {*π*,*Γ*} - path with any remaining *π* - path and *Γ* - path to form a 3 - chain.

5 Arbitrarily link pairs of remaining *π* - paths, all of which have the same parity, to form 2-chains. Do the same for remaining *Γ*-paths.

### The solution space of DCJ-Indel sorting

The problem of DCJ sorting is well understood, its solution space having been described in [17]. Thus, by Theorem 3, to identify the solution space of DCJ-indel sorting (an open problem), we simply need to enumerate the construction of indels of an optimal completion. We mentioned this enumeration in [12], but here we will explore the details of the calculation.

#### Handling circular singletons

By Proposition 1, we may consider the circular singletons of *π* and *Γ* independently of other chromosomes; for that matter, because insertions and deletions are defined symmetrically, we may assume that *π* contains *k* chromosomes and that *Γ* is the empty genome. Then by Corollary 2 and the trivial fact that any DCJ applied to *π* changes the total number of chromosomes of *π* by at most 1 (see [8]), we may obtain *Γ* from *π* in *k* steps if and only if we perform *j* successive DCJs (0 ≤ *j* < *k*), each of which fuses two circular chromosomes into one, followed by applying *k* - *j* chromosome deletions.

Assuming that *k* is relatively small, the enumeration of all such transformations of *π* into *Γ* poses a tedious but straightforward task, as a fusion of two circular chromosomes corresponds to a DCJ using two adjacencies from different chromosomes.

#### Genomes lacking circular singletons

Having handled circular singletons, we may assume that sing(*π*,*Γ*) = 0. Fortunately, the lemmas presented before Theorem 8 have greatly reduced the collection of possible optimal completions, which we now continue to pare down.

##### Proposition 10

*Every π - path (Γ - path) embedding into a 3-chain of an optimal completion must have the same parity.*

##### Proof

Say for the sake of contradiction that we have an optimal completion (*π*^{′},*Γ*^{′}) such that B(*π*^{′},*Γ*^{′}) contains 3-chains [*P*_{1} : *P*_{2} : *P*_{3}] and [*P*_{4} : *P*_{5} : *P*_{6}], where *P*_{1} and *P*_{4} are *π* - paths of opposite parity. Consider the completion (*π*^{″},*Γ*^{″}), which is defined by rejoining adjacencies of (*π*^{′},*Γ*^{′}) to form [*P*_{1} : *P*_{4}], (*P*_{2} : *P*_{5}), and [*P*_{3} : *P*_{6}] in B(*π*^{″},*Γ*^{″}). The 2-chain [*P*_{1} : *P*_{4}] must have even length, and (*P*_{2} : *P*_{5}) is a cycle; thus, *d*_{DCJ} (*π*^{″},*Γ*^{″}) < *d*_{DCJ} (*π*^{′},*Γ*^{′}), and so (*π*^{′},*Γ*^{′}) cannot be optimal. □

##### Proposition 11

*If p*
^{Π,γ}
*is even, then the breakpoint graph of an optimal completion must contain a maximum set of even-length 2 - chains.*

##### Proof

We proceed by contradiction. Say that (*π*^{′},*Γ*^{′}) is an optimal completion for which an odd *π* - path *P*_{1} and an even *π* - path *P*_{2} are contained in different components of B(*π*^{′},*Γ*^{′}), neither of which is an even 2-chain. By Propositions 7 and 10, we may assume that *P*_{1} and *P*_{2} embed into an odd-length 2-chain [*P*_{1} : *P*_{5}] and a 3-chain [*P*_{2} : *P*_{3} : *P*_{4}]. Because *p*^{Π,γ} is even by Proposition 4, we must have at least one additional 3-chain [*P*_{6} : *P*_{7} : *P*_{8}], where (again by Proposition 10) *P*_{6} is an even-length *π* -path, and the *Γ* - paths *P*_{4} and *P*_{8} have the same parity. With these assumptions in hand, we may rejoin adjacencies to form the four components [*P*_{1} : *P*_{2}] (even), [*P*_{5} : *P*_{6}] (even), (*P*_{3} : *P*_{7}), and [*P*_{4} : *P*_{8}] (odd), producing a cycle and two even 2-chains from our original three paths. Hence, by (4), (*π*^{′},*Γ*^{′}) cannot be optimal. □

We are now ready to fully describe the collection of optimal completions when *p*^{Π,γ} is even. To construct an optimal completion, after closing each {*Π*,*Π*} - path and {*γ*,*γ*} - path, which can be done uniquely, we must form a maximum collection of even 2-chains by Proposition 11. Recall that our aim is to maximize the statistic $c({\pi}^{\ast},{\Gamma}^{\ast})+\frac{{p}_{\text{even}}({\pi}^{\ast},{\Gamma}^{\ast})}{2}$, and consider the following two subcases.

##### Case 1

*p*

^{Π,γ}is even, ${p}_{\text{odd}}^{\Pi}\le {p}_{\text{even}}^{\Pi}$, and ${p}_{\text{odd}}^{\gamma}\ge {p}_{\text{even}}^{\gamma}$. First, a maximal collection of even-length 2 - chains will total ${p}_{\text{odd}}^{\Pi}+{p}_{\text{even}}^{\gamma}$ components, which requires simply choosing ${p}_{\text{odd}}^{\Pi}$ even-length

*π*- paths, then matching them to odd-length

*π*- paths. This can be achieved in

*A*

_{1}ways, where

*Γ*- paths of opposite parity, yielding

*B*

_{1}total matchings:

*n*,

*k*) to denote the partial permutation statistic: $\mathrm{P}(n,k)=\frac{n!}{(n-k)!}$. We will be left with ${p}_{\text{even}}^{\Pi}-{p}_{\text{odd}}^{\Pi}$ even

*π*- paths and ${p}_{\text{odd}}^{\gamma}-{p}_{\text{even}}^{\gamma}$ odd

*Γ*- paths. It is impossible to create any more even-length paths in B(

*π*

^{∗},

*Γ*

^{∗}), and so we must form a maximum collection of $\frac{{p}^{\Pi ,\gamma}}{2}$ 2 - bracelets from the {

*π*,

*Γ*} - paths:

*π*- paths to each other and arbitrary remaining

*Γ*- paths to each other:

By the independence of these four procedures, the total number of optimal completions is simply given by the product *A*_{1} · *B*_{1} · *C*_{1} · *D*_{1}.

##### Case 2

*p*

^{Π,γ}is even, ${p}_{\text{odd}}^{\Pi}>{p}_{\text{even}}^{\Pi}$, and ${p}_{\text{odd}}^{\gamma}>{p}_{\text{even}}^{\gamma}$. In this case, we first form a maximum set of 2 - chains:

*π*- paths and ${p}_{\text{odd}}^{\gamma}-{p}_{\text{even}}^{\gamma}$ odd-length

*Γ*- paths remaining. Assume without loss of generality that ${p}_{\text{odd}}^{\Pi}-{p}_{\text{even}}^{\Pi}\ge {p}_{\text{odd}}^{\gamma}-{p}_{\text{even}}^{\gamma}$, and set $m=min\{{p}^{\Pi ,\gamma},{p}_{\text{odd}}^{\gamma}-{p}_{\text{even}}^{\gamma}\}$. We may attain the formula in (9) if and only if we form 2

*j*even-length 3 - chains for some integer

*j*satisfying $0\le j\le \frac{m}{2}$, then create $\frac{{p}^{\Pi ,\gamma}}{2}-j$ total 2 - bracelets from the remaining {

*π*,

*Γ*} - paths. Any remaining odd-length

*π*- paths (

*Γ*- paths) must then be linked to each other to form (odd-length) 2 - chains in B (

*π*

^{∗},

*Γ*

^{∗}). The number of such possibilities can be counted by the following statistic

*B*

_{2}:

Again, the two statistics can be carried out independently, yielding *A*_{2} · *B*_{2} total optimal completions.

In both of the first two cases, reversing the inequalities will lead to analogous arguments. For the next two cases, suppose instead that *p*^{Π,γ} is odd, and select a single {*π*,*Γ*} - path *P* that must belong to a 3 - chain.

##### Case 3

*p*^{Π,γ} is odd, ${p}_{\text{odd}}^{\Pi}<{p}_{\text{even}}^{\Pi}$, and ${p}_{\text{odd}}^{\gamma}>{p}_{\text{even}}^{\gamma}$. Note that there are *A*_{3} = *p*^{Π,γ} total ways to select a {*π*,*Γ*} - path *P*. Of the four possibilities for the parity of the paths to which *P* may be linked to form a 3 - chain, one may wish to verify that the only way we cannot attain the maximum in (9) is if we link *P* to an odd-length *π* - path and an even-length *Γ* - path. Thus, we arrive at three mutually exclusive subcases.

*P*is linked to an even-length

*π*- path and an odd-length

*Γ*- path:

We now have an even number of {*π*,*Γ*} - paths remaining and have reduced our problem to a simpler one that falls under Case 1 above, from which we may obtain some number *C*_{3} of optimal completions.

*P*to an odd-length

*π*- path and an odd-length

*Γ*- path. First, select two such paths:

*E*

_{3}total optimal completions. In our third and final subcase, we join

*P*to an even

*π*- path and an even

*Γ*- path:

Say that applying Case 1 to the resulting subcase in which *p*^{Π,γ} is even yields *G*_{3} total optimal completions. Then by independence, the total number of optimal completions over all three subcases will be given by *A*_{3} · (*B*_{3} · *C*_{3} + *D*_{3} · *E*_{3} + *F*_{3} · *G*_{3}).

##### Case 4

*p*^{Π,γ}is odd, ${p}_{\text{odd}}^{\Pi}>{p}_{\text{even}}^{\Pi}$, and ${p}_{\text{odd}}^{\gamma}>{p}_{\text{even}}^{\gamma}$. Having selected *P* from the *A*_{4} = *p*^{Π,γ} total {*π*,*Γ*} - paths, one may verify that the only way we can achieve the maximum in (9) is by linking *P* to an odd-length *π* - path and an odd-length *Γ* - path, of which there are ${B}_{4}={p}_{\text{odd}}^{\Pi}\xb7{p}_{\text{odd}}^{\gamma}$ total choices. We have therefore reduced our problem of linking components of B(*π*,*Γ*) to a smaller problem, falling under Case 2, for which *p*^{Π,γ}is even. If there are *C*_{4} total solutions to this smaller problem, then the number of optimal completions is given by *A*_{4} · *B*_{4} · *C*_{4}.

As in the first two cases, reversing the inequalities defining Cases 3 and 4 will result in analogous arguments.

## Conclusions

In this paper, we have demonstrated how the problem of DCJ-indel sorting, first solved in [11], can equally be handled via direct inspection of the breakpoint graph. Unfortunately, we still do not see a natural correspondence between the two approaches to DCJ-indel sorting, which appear to be at odds because their definitions of indels are equivalent but motivated differently.

Furthermore, modeling an indel as a circular chromosome resulting from a DCJ has uncovered the solution space of DCJ-indel sorting, thus resolving an open problem. We wonder if other operations could be adapted to a similar model to yield a straightforward calculation of other genomic distances involving indels. We are also curious whether this model applies to the case of finding a minimum-cost transformation of one genome into another as we vary the parameter associated with the (constant) indel cost.

## Endnotes

^{a}This definition allows B(*π*,*Γ*) to contain cycles of length 2.^{b}In particular, this requirement bars the trivial transformation of *π* into *Γ* in which every chromosome from *π* is deleted, and then all the chromosomes of *Γ* inserted. ^{c}Note that *v* cannot be simultaneously *π* - and *Γ* - open, although it may be a telomere of both *π* and *Γ* or be *π* - open and a telomere of *Γ* (in both cases, *v* is an isolated vertex of B(*π*,*Γ*), i.e., a path of length 0).

## Declarations

### Acknowledgements

The author would like to acknowledge the support of Pavel Pevzner (UC San Diego Department of Computer Science), who offered guidance during the drafting of the manuscript.

## Authors’ Affiliations

## References

- Dobzhansky T, Sturtevant AH: Inversions in the chromosomes of drosophila pseudoobscura. Genetics. 1938, 23: 28-64.PubMed CentralPubMedGoogle Scholar
- Fertin G, Labarre A, Rusu I, Tannier E, Vialette S: Combinatorics of Genome Rearrangements. Cambridge: MIT Press 2009.View ArticleGoogle Scholar
- , : Problem E2569. Am Math Mon. 1975, 82: 1010-10.2307/2318261.View ArticleGoogle Scholar
- Gates WH, Papadimitriou CH: Bounds for sorting by prefix reversal. Discrete Math. 1979, 27: 47-57. 10.1016/0012-365X(79)90068-2. [http://www.sciencedirect.com/science/article/pii/0012365X79900682], []View ArticleGoogle Scholar
- Heydari MH, Sudborough I: On the diameter of the pancake network. J Algorithms. 1997, 25: 67-94. 10.1006/jagm.1997.0874. [http://www.sciencedirect.com/science/article/pii/S0196677497908749], []View ArticleGoogle Scholar
- Chitturi B, Fahle W, Meng Z, Morales L, Shields C, Sudborough I, Voit W: An upper bound for sorting by prefix reversals. Theor Comput Sci. 2009, 410 (36): 3372-3390. 10.1016/j.tcs.2008.04.045. [http://www.sciencedirect.com/science/article/pii/S0304397508003575]. [Graphs, Games and Computation: Dedicated to Professor Burkhard Monien on the Occasion of his 65th Birthday], []. [Graphs, Games and Computation: Dedicated to Professor Burkhard Monien on the Occasion of his 65th Birthday]View ArticleGoogle Scholar
- Bulteau L, Fertin G, Rusu I: Pancake flipping is hard. CoRR. preprint, abs/1111.0434.Google Scholar
- Yancopoulos S, Attie O, Friedberg R: Efficient sorting of genomic permutations by translocation, inversion and block interchange. Bioinformatics. 2005, 21 (16): 3340-3346. [http://bioinformatics.oxfordjournals.org/content/21/16/3340.abstract], []View ArticlePubMedGoogle Scholar
- Bergeron A, Mixtacki J, Stoye J: A unifying view of genome rearrangements. WABI 2006. LNCS (LNBI). 2006, 4175: 163-173.Google Scholar
- Yancopoulos S, Friedberg R: DCJ path formulation for genome transformations which include insertions, deletions, and duplications. J Comput Biol. 2009, 16 (10): 1311-1338.View ArticlePubMedGoogle Scholar
- Braga MDV, Willing E, Stoye J: Genomic distance with DCJ and indels. Proc 10th Int Conf Algorithms Bioinformatics. 2010, 90-101. [http://portal.acm.org/citation.cfm?id=1885783.1885793], []View ArticleGoogle Scholar
- Compeau PEC: A simplified view of DCJ-Indel distance. WABI Volume 7534 of Lecture Notes in Computer Science. Edited by: Raphael BJ, Tang J. 2012, 365-377. [http://dblp.uni-trier.de/db/conf/wabi/wabi2012.html#Compeau12], Springer, []Google Scholar
- Bafna V, Pevzner PA: Genome rearrangements and sorting by reversals. SIAM J Comput. 1996, 25 (2): 272-289. 10.1137/S0097539793250627.View ArticleGoogle Scholar
- Tannier E, Zheng C, Sankoff D: Multichromosomal median and halving problems under different genomic distances. BMC Bioinformatics. 2009, 10: 120-[http://www.biomedcentral.com/1471-2105/10/120], []PubMed CentralView ArticlePubMedGoogle Scholar
- Braga M, Machado R, Ribeiro L, Stoye J: On the weight of indels in genomic distances. BMC Bioinformatics. 2011, 12 (Suppl 9): S13-[http://www.biomedcentral.com/1471-2105/12/S9/S13], []PubMed CentralView ArticlePubMedGoogle Scholar
- Ma J, Ratan A, Raney BJ, Suh BB, Miller W, Haussler D: The infinite sites model of genome evolution. Proc Natl Acad Sci USA. 2008, 105 (38): 14254-14261. [http://dx.doi.org/10.1073/pnas.0805217105], []PubMed CentralView ArticlePubMedGoogle Scholar
- Braga MD, Stoye J: The solution space of sorting by DCJ. J Comput Biol. 2010, 17 (9): 1145-1165.View ArticlePubMedGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.