DCJ-Indel sorting revisited

Compeau, Phillip EC

doi:10.1186/1748-7188-8-6

Research
Open access
Published: 01 March 2013

DCJ-Indel sorting revisited

Phillip EC Compeau¹

Algorithms for Molecular Biology volume 8, Article number: 6 (2013) Cite this article

3424 Accesses
22 Citations
Metrics details

Abstract

Background

The introduction of the double cut and join operation (DCJ) caused a flurry of research into the study of multichromosomal rearrangements. However, little of this work has incorporated indels (i.e., insertions and deletions of chromosomes and chromosomal intervals) into the calculation of genomic distance functions, with the exception of Braga et al., who provided a linear time algorithm for the problem of DCJ-indel sorting. Although their algorithm only takes linear time, its derivation is lengthy and depends on a large number of possible cases.

Results

We note the simple idea that a deletion of a chromosomal interval can be viewed as a DCJ that creates a new circular chromosome. This framework will allow us to amortize indels as DCJs, which in turn permits the application of the classical breakpoint graph to obtain a simplified indel model that still solves the problem of DCJ-indel sorting in linear time via a more concise formulation that relies on the simpler problem of DCJ sorting. Furthermore, we can extend this result to fully characterize the solution space of DCJ-indel sorting.

Conclusions

Encoding indels as DCJ operations offers a new insight into why the problem of DCJ-indel sorting is not ultimately any more difficult than that of sorting by DCJs alone. There is still room for research in this area, most notably the problem of sorting when the cost of indels is allowed to vary with respect to the cost of a DCJ and we demand a minimum cost transformation of one genome into another.

Background

In the simplest terms, DNA may mutate in two fundamentally different ways. On the one hand, single-nucleotide polymorphisms alter the base at a single position of the nucleic acid polymer; on the other hand, huge mutations called chromosomal rearrangements can move around, duplicate, insert, or delete huge blocks of DNA, often from one chromosome to another.

Chromosomal rearrangements were first observed by Dobzhansky and Sturtevant in 1938 ([1]), but extensive efforts to quantify their study did not take off until the early 1990s. In the last two decades, a number of discrete genomic models have been proposed and studied (see [2] for an overview of the combinatorics of genome rearrangements).

Having selected a genomic model and a collection of genome operations to consider, the standard algorithmic problem is the computation of the distance between two genomes π and Γ, or the minimum number of allowable operations required to transform π into Γ; the more difficult problem of sorting demands the operations themselves. The first historical example of such a discrete genomic distance is the prefix reversal distance for permutations (which model the order of genes along a single linear chromosome), introduced in [3] and bounded in [4–6]. The computation of prefix reversal distance has been proposed to be NP-Hard (see [7]).

More recent research has moved past permutations and toward multichromosomal genomic models that incorporate both linear and circular chromosomes. One of these models, which we will study in this paper, models the chromosomes of a genome with paths and cycles in a graph. For this model, the double cut and join operation (DCJ) was introduced in [8] and incorporates segment reversals with a number of other operations. Interestingly, a linear time greedy algorithm exists for DCJ sorting two genomes having equal gene content (see [9]).

The incorporation of insertions and deletions of chromosomes and chromosomal intervals (collectively called indels) into DCJ distance was discussed in [10] and quantified rigorously in [11]. The latter authors provided a linear time algorithm for the associated problem of DCJ-indel sorting, which gives a minimum collection of DCJ and indel operations required to transform one genome into another. Yet their argument is case-ridden, and so in this paper, which builds upon [12], we wish to provide a much simpler presentation of DCJ-indel sorting that still yields a linear-time solution to the problem.

Main text

Preliminaries

Say that we are given a perfect matching on 2N labeled vertices $V$ , forming a set $G$ of N edges called genes; the vertices of each gene form its head and tail. We define a genome π as the edge-disjoint union of two matchings. The genes of π, denoted g(π), form a matching on $V$ such that $g (π) \subseteq G$ ; the adjacencies of π, denoted a(π), form a matching on V(g(π)). We color the genes of π black and the adjacencies of π blue (see Figure 1(a)).

A consequence of these definitions is that π comprises a disjoint collection of paths and cycles, where each connected component alternates between black genes and blue adjacencies. Each component of π is called a chromosome; paths (cycles) of π define linear (circular) chromosomes of π. The endpoint v of a path in π is called a telomere of π; v is not incident to an adjacency, and so for clerical purposes, we say that v has the null adjacency {v,∅}. A genome consisting of only circular (linear) chromosomes is called a circular (linear) genome. Note that π is circular if and only if the edges of a(π) form a perfect matching on V(π).

Henceforth, we only consider genome pairs {π,Γ} such that $g (π) \cup g (Γ) = G$ . A workhorse data structure encoding the relationship between π and Γ is the breakpoint graph ([13]), denoted by B(π,Γ) and defined as the edge-disjoint union ^a of a(π) and a(Γ), where adjacencies of Γ will be colored red (Figure 1(b)). Observe that B(π,Γ) is also a collection of disjoint paths and cycles, which alternate between red and blue edges. The length of a connected component of B(π,Γ) is its total number of edges; we consider an isolated vertex in B(π,Γ) to be a path of length 0. The breakpoint graph is also the line graph of the adjacency graph, which was first defined in [9] and has also been used in rearrangement studies.

A double cut and join operation (DCJ) on π (introduced in [8]) uses one or two adjacencies of π via one of the following four operations to produce a new genome π^′:

1.
{v,w},{x,y} →{v,x},{w,y}
2.
{v,w},{x,∅}→{v,x},{w,∅}
3.
{v,∅},{w,∅} →{v,w}
4.
{v,w} →{v,∅},{w,∅}

The DCJ incorporates a wide range of genome rearrangements, as shown in Figure 2.

For the particular case that π and Γ have the same genes (i.e., $g (π) = g (Γ) = G$ ), the DCJ distance between π and Γ, written d_DCJ(π,Γ), is the minimum number of DCJs required to transform π into Γ One can easily verify that d_DCJ forms a metric on the set of all genomes having gene set $G$ . A closed formula for DCJ distance was derived in [9] and translated into breakpoint graph notation in [14]:

d_{DCJ} (π, Γ) = N - c (π, Γ) - \frac{p_{even} (π, Γ)}{2}

(1)

Here, c(π,Γ) and p_even(π,Γ) denote the number of cycles and even-length paths in B(π,Γ), respectively.

For the more general case that π and Γ do not share the same genes, a deletion of a chromosomal interval of π replaces adjacencies {v,w} and {x,y} (contained in the order (v,w,x,y) along a chromosome of π) with the adjacency {v,y} and removes the path connecting w to x. We also allow deletions of entire chromosomes; however, we must stipulate (following the lead of the authors in [11]) that every vertex removed from π must belong to $V - V (Γ)$ . ^b The insertion of a chromosome or chromosomal interval into π to obtain π^′ is defined as the inverse of a corresponding deletion from π^′ that yields π. Note that a consequence of this definition is that we may not insert a gene unless it is contained in $G$ . Insertions and deletions are collectively called indels; thus, we define the DCJ-indel distance between π and Γ, written $d_{DCJ}^{ind} (π, Γ)$ , as the minimum number of DCJs and indels required to transform π into Γ.

Because insertions and deletions are inverse operations, it follows that $d_{DCJ}^{ind} (π, Γ) = d_{DCJ}^{ind} (Γ, π)$ . However, although $d_{DCJ}^{ind}$ is symmetric, unlike d_DCJ it does not form a metric, as the triangle inequality does not hold; see [15] for a more complete discussion.

DCJ-Indel sorting

Handling circular singletons

We begin our discussion of DCJ-indel sorting by defining a circular singleton of π (adapted from[11]) as a circular chromosome C such that V(C) ∩ V (Γ) = ∅. Note that C is defined with respect to Γ as well as π. Ideally, we could delete (insert) all circular singletons of π and Γ immediately to simplify the problem of DCJ-indel sorting; fortunately, this is indeed the case, as shown by the following two results.

Proposition 1

If π^′is formed by removing a circular singleton C from π,then $d_{DCJ}^{ind} (π^{'}, Γ) = d_{DCJ}^{ind} (π, Γ) - 1$ .Furthermore, when transforming π into Γ via a minimum collection of DCJs and indels, no gene belonging to a circular singleton of π can ever appear in the same chromosome as a gene of Γ.

Proof

Any collection of k DCJs and indels transforming π^′ into Γ can be supplemented by the deletion of C to yield k + 1 DCJs and indels transforming π into Γ; thus, $d_{DCJ}^{ind} (π^{'}, Γ) \geq d_{DCJ}^{ind} (π, Γ) - 1$ .

To obtain the reverse bound, let us view a transformation $T$ of π into Γ as a sequence (π₀,π₁,…,π_n) (n ≥ 1), where π₀ = π, π_n= Γ, and π_{i + 1}is obtained from π_ias the result of a single DCJ or indel. Consider the sequence $(π_{0}^{'}, π_{1}^{'}, \dots, π_{n}^{'})$ , where $π_{i}^{'}$ is constructed from π_iby removing the subgraph of π_iinduced by the vertices of C under the stipulation that whenever we remove a path P connecting v to w, we replace adjacencies {v,x} and {w,y} in π with {x,y} in $π_{i}^{'}$ . It is easy to see that $π_{0}^{'} = π^{'}$ , $π_{n}^{'} = Γ$ , and for every i in range, either $π_{i + 1}^{'}$ is the result of a DCJ or indel applied to $π_{i}^{'}$ or $π_{i + 1}^{'} = π_{i}^{'}$ ; thus, $(π_{0}^{'}, π_{1}^{'}, \dots, π_{n}^{'})$ encodes a transformation of π^′ into Γ using at most n DCJs and indels. Furthermore, one can verify that $π_{i + 1}^{'} = π_{i}^{'}$ only when an adjacency of C is used by a DCJ in $T$ changing π_ito π_{i + 1}or when π_{i + 1}is produced from π_iby a deletion of vertices that all belong to C. At least one such operation must always occur in $T$ ; hence, $d_{DCJ}^{ind} (π^{'}, Γ) \leq d_{DCJ}^{ind} (π, Γ) - 1$ .

The proposition’s second conclusion follows from the fact that if for some j (1 ≤ j ≤ n - 1), a chromosome of π_jcontains a gene g₁ of π and a gene g₂ of C, then one DCJ was required to combine g₁ and g₂ into the same chromosome, and another will be needed to separate them, yielding two distinct values of i for which $π_{i + 1}^{'} = π_{i}^{'}$ . From the first part of the proof, we may conclude that $d_{DCJ}^{ind} (π, Γ) < n$ . □ □

Letting sing (π,Γ) denote the total number of circular singletons of π and Γ, we have an immediate corollary.

Corollary 2

The DCJ-indel distance is given by the following:

d_{DCJ}^{ind} (π, Γ) = sing (π, Γ) + d_{DCJ}^{ind} (π^{0}, Γ^{0})

(2)

where π⁰ (Γ⁰) is formed by removing all circular singletons from π (Γ).

With respect to DCJ-indel sorting, Corollary 2 allows us to assume without loss of generality that π and Γ do not contain any circular singletons.

We next make an observation taken from [16], which is that the deletion of a chromosomal interval of π connecting w to x may be viewed as a DCJ: {v,w},{x,y } → {v,y},{w,x}; this operation produces a circular chromosome containing w and x that is scheduled for removal, including the case that v or y equals ∅ (the deletion of an entire linear chromosome is handled by u = x = ∅); see Figure 3. Because insertions are the inverses of deletions, we would like to conclude that indels may be placed in a one-to-one correspondence with the removal of circular chromosomes. Ironically, the apparent exception to this proposed rule is the deletion of an entire circular chromosome.

Yet if a deleted circular chromosome C is not produced as the result of a DCJ, then C must be a circular singleton of π in order to be deleted. Otherwise, C has been produced as the result of a DCJ applied to a chromosomal interval; by the method we just described, we can amortize the deletion in this DCJ unless the DCJ also creates another circular chromosome C^′ that is scheduled for deletion. However, this sequence of operations cannot arise in a minimum collection of DCJs and indels transforming π into Γ, as we could simply delete the original chromosome from which C and C^′ were produced by the DCJ in question, thus requiring a single operation instead of three.

Toward a new model of Indels

We will follow the observation made in [16] that the actual removal of deleted chromosomes can occur as a final step in the transformation of π into Γ. As a result, we may view the transformation of π into Γ as composed of three steps: inserting chromosomes into π to yield a new genome π^′ with $g (π^{'}) = G$ ; applying a sequence of DCJs to produce a genome Γ^′ having the same genes as π^′; and finally, deleting chromosomes from Γ^′ to produce Γ. Note that we can equivalently view the first step as the deletion of chromosomes from π^′ to obtain π. Combining this observation with our correspondence between indels and circular chromosomes above, we may introduce the following framework.

Define a completion of π as a genome π^′ having $g (π^{'}) = G$ and for which a(π^′) is composed of a(π) together with a perfect matching on V(π^′) - V(π). We call the adjacencies of a(π^′) - a(π) new. Note that the chromosomes of π embed as chromosomes of π^′ and that the components of π^′ - π form cycles because the new adjacencies of π^′ induce a perfect matching on V(π^′) - V(π); we may now without ambiguity call these circular chromosomes of π^′ the indels of π^′. A completion of a pair of genomes (π,Γ) is simply a pair (π^′,Γ^′) for which π^′ and Γ^′ are completions of π and Γ, respectively. The above discussion implies that for any minimum cost transformation of π into Γ, the indels of π^′ correspond bijectively to DCJ operations, so that we will amortize each unit indel cost by that of a DCJ operation. This amortization yields the following equation for DCJ-indel distance:

d_{DCJ}^{ind} (π, Γ) = min_{(π^{'}, Γ^{'})} \{d_{DCJ} (π^{'}, Γ^{'})\}

(3)

where the minimum is taken over all completions of (π,Γ). A completion (π^∗,Γ^∗) is optimal if it attains the minimum in (3). Applying the closed form equation for the DCJ distance in (1) to immediately produces the following result.

Theorem 3

The DCJ-indel distance is given by the following equation:

d_{DCJ}^{ind} (π, Γ) = N - max_{(π^{'}, Γ^{'})} \{c (π^{'}, Γ^{'}) + \frac{p_{even} (π^{'}, Γ^{'})}{2}\}

(4)

where the maximum is taken over all completions of (π,Γ).

Constructing an optimal completion

In light of Theorem 3, we have reduced DCJ-indel sorting to the problem of constructing indels intelligently to maximize a weighted sum of breakpoint graph components. Once we have produced an optimal completion (π^∗,Γ^∗), we can simply invoke the O(N) - time sorting algorithm described in [9] to transform π^∗ into Γ^∗ via a minimum collection of DCJs.

Our goal is to construct (π^∗,Γ^∗) by direct analysis of B(π,Γ). Because π and Γ do not necessarily share the same genes, B(π,Γ) may contain path endpoints that are not telomeres. Accordingly, we define a vertex v to be Π - open (γ - open) if v ∉ π (v ∉ Γ). In other words, v must be matched to some other π - open vertex when constructing the indels of π^∗. ^c The paths of B(π,Γ) are therefore classified according to their endpoints: a Π - path (γ - path) ends in one π - open (Γ - open) vertex and one telomere (of either π or Γ); a { Π,γ} - path ends in a π - open vertex and a Γ - open vertex (such a path must have even length at least 2); a { Π,Π} - path ({ γ,γ} - path) ends in two π - open (Γ - open) vertices and must therefore have odd length. We should also provide statistics for counting these different components. Define p^Π,γ as the number of {π,Γ} - paths in B(π,Γ); $p_{even}^{Π}$ as the number of even-length π - paths in B(π,Γ); and $p_{even}^{0}$ as the number of even-length paths in B(π,Γ) containing no open vertices (i.e., ending in two telomeres). Similar statistics counting odd-length paths can be defined analogously. We have dropped the genomes {π,Γ} from these statistics for the sake of simplicity; all component statistics will be taken with respect to B(π,Γ) unless otherwise noted.

We first present a proposition regarding the parity of the paths of B (π,Γ).

Proposition 4

The component statistics of B(π,Γ) satisfy the following condition:

\begin{align} p^{Π, γ} \equiv |p_{odd}^{Π} - p_{even}^{Π}| \equiv |p_{odd}^{γ} - p_{even}^{γ}| mod 2 \end{align}

(5)

Proof

The total number of π - open vertices is equal to V (π^′ ) - V (π) and must therefore be even. Of course, the same is the case for Γ - open vertices, and counting π - open and Γ - open vertices over the connected components of B (π,Γ) thus produces the following equivalences:

\begin{align} p_{odd}^{Π} + p_{even}^{Π} + p^{Π, γ} & \equiv 0 mod 2 \end{align}

(6)

\begin{align} p_{odd}^{γ} + p_{even}^{γ} + p^{Π, γ} & \equiv 0 mod 2 \end{align}

(7)

Adding p^Π,γto both sides of (6) and (7) gives the following:

p^{Π, γ} \equiv (p_{odd}^{Π} + p_{even}^{Π}) \equiv (p_{odd}^{γ} + p_{even}^{γ}) mod 2

(8)

The equivalence of (5) and (8) is an arithmetical fact. □

We next establish two necessary conditions on optimal completions by culling the set of possible adjacencies of any such completion. Our general strategy is to consider the addition of a new adjacency {v,w} to a completion π^′ as linking the component(s) of B(π,Γ) whose endpoints are the (π - open) vertices v and w. Our first result states that we must always link the endpoints of any {Π,Π} - path to each other.

Lemma 5

If (π^∗,Γ^∗) is an optimal completion of (π,Γ), then every {Π,Π} - path ({γ,γ} - path) of length 2 k - 1 in B(π,Γ)(k ≥ 1) embeds into a cycle of length 2 k in B (π^∗,Γ^∗).

Proof

Let P be a path of length 2 k - 1 connecting π-open vertices v and w in B (π,Γ). Our claim is that we must link v and w in B (π^∗,Γ^∗). Suppose for the sake of contradiction that we have a completion (π^′,Γ^′) such that P does not embed into a cycle of length 2 k in B (π^′,Γ^′); in this case, we must have adjacencies {v,x} and {w,y} in a(π^′), where all four vertices are distinct.

Consider the completion π^″ that is identical to π^′ except that {v,x} and {w,y} are replaced by {v,w} and {x,y}. In B(π^″,Γ^′), we have closed P into a cycle of length 2 k, and at the same time, we have changed neither the parity nor the linearity/circularity of the component containing x and y. Because we have increased the number of breakpoint graph cycles by 1 without changing the total number of paths, it follows from (1) that d_DCJ(π^″,Γ^′) = d_DCJ(π^′,Γ^′) - 1, and so (π^′,Γ^′) cannot be optimal. □

Having dealt with {Π,Π} - and {γ,γ} - paths of B(π,Γ), any remaining component of B(π^∗,Γ^∗) must be either a j - bracelet, which is a cycle linking j {π,Γ} - paths (where j ≥ 2 and j is even), or a k - chain, in which two π - paths or two Γ - paths are linked via an intermediate number of {π,Γ} - paths to form a path containing k components from B(π,Γ) (k ≥ 2). Note that when k is even, a k- chain C must contain either two π - paths or two Γ- paths, and when k is odd, C must contain one π - path and one Γ - path.

For the sake of simplicity, we will represent a j - bracelet by (P₁ : P₂ : ⋯ : P_j) and a k - chain by [P₁ : P₂ : ⋯ : P_k], where every P_iis linked to P_{i + 1}, and in the case of a j - bracelet, P₁ is linked to P_j. Because we wish to maximize a weighted sum of breakpoint graph components, we might guess that we should look for many short bracelets and chains. Indeed, the length of a bracelet or chain in B(π^∗,Γ^∗) is heavily restricted by the following lemma.

Lemma 6

If (π^∗,Γ^∗)is an optimal completion, then a component C^∗of B (π^∗,Γ^∗) can only contain two or more {π,Γ} - paths if C^∗is a 2 - bracelet.

Proof

Again, say for the sake of contradiction that we have an optimal completion (π^′,Γ^′) for which a component C^′ of B(π^′,Γ^′) contains two or more {π,Γ}-paths. If C^′ is not a 2 - bracelet, then it must contain {π,Γ} - paths P₁ and P₂ that are linked by precisely one new adjacency. Say that P₁ joins π - open vertex v to Γ - open vertex w and that P₂ joins π - open vertex x to Γ - open vertex y. To meet the assumption that P₁ and P₂ are linked by precisely one new adjacency, suppose that {v,x} ∈ a(π^′) but {w,y} ∉ a(Γ^′), where instead {w,w^′} and {y,y^′} are in a(Γ^′). Replacing these two adjacencies with {w,y} and {w^′,y^′} defines a different completion Γ^″ for which B(π^′,Γ^″) contains (P₁ : P₂). Viewed as an operation on B(π^′,Γ^′) to yield B(π^′,Γ^″), we have two cases.

First, if C^′ was a bracelet, then we have formed two new bracelets from C^′, one of which is (P₁ : P₂). Otherwise, C^′ was a chain, in which case we have formed a chain (of the same parity) in addition to (P₁ : P₂). In either case, d_DCJ (π^′,Γ^″)<d_DCJ (π^′,Γ^′), and so (π^′,Γ^′) cannot be optimal. □

Following Lemma 6, we may only have 2 - bracelets, 2 - chains, and 3 - chains in B (π^∗,Γ^∗). After a simple result about 2-chain components, we will be ready to state our main result on DCJ-indel sorting.

Proposition 7

The breakpoint graph of an optimal completion cannot have one 2-chain joining two odd π - paths and another 2-chain joining two even π-paths. The same holds for Γ - paths.

Proof

Once again, proceed by contradiction and assume that (π^′,Γ^′) is an optimal completion with such 2-chains [P₁ : P₂] and [P₃ : P₄]. Replacing these 2-chains with [P₁ : P₃] and [P₂ : P₄] replaces two odd paths in B(π^′,Γ^′) with two even paths; hence, (π^′,Γ^′) cannot be optimal. □

Theorem 8

Algorithm 1, given below, defines an O(N) time algorithm for DCJ - indel sorting. For pairs {π,Γ} having sing (π,Γ) = 0, the DCJ-indel distance is given by the following equation:

\begin{align} d_{DCJ}^{ind} (π, Γ) & = N - [(c + p^{Π, Π} + p^{γ, γ} + ⌊\frac{p^{Π, γ}}{2}⌋) \\ + \frac{1}{2} (p_{even}^{0} + min \{p_{odd}^{Π}, p_{even}^{Π}\} \\ + min \{p_{odd}^{γ}, p_{even}^{γ}\} + δ)] \end{align}

(9)

Here, δ = 1 when p^Π,γis odd and either $p_{odd}^{Π} > p_{even}^{Π}$ , $p_{odd}^{γ} > p_{even}^{γ}$ or $p_{odd}^{Π} < p_{even}^{Π}, p_{odd}^{γ} < p_{even}^{γ}$ ; otherwise, δ = 0.

Proof

We aim to construct an optimal completion (π^∗,Γ^∗) having

\begin{align} c (π^{*}, Γ^{*}) & = c + p^{Π, Π} + p^{γ, γ} + ⌊\frac{p^{Π, γ}}{2}⌋ \end{align}

(10)

\begin{align} p_{even} (π^{*}, Γ^{*}) & = p_{even}^{0} + min \{p_{odd}^{Π}, p_{even}^{Π}\} \\ + min \{p_{odd}^{γ}, p_{even}^{γ}\} + δ \end{align}

(11)

First, we count the cycles of B(π^∗,Γ^∗). By Lemma 5, every {Π,Π} - path or {γ,γ} - path of B (π,Γ) must be closed into a cycle by adding a single new adjacency (Step 1 of Algorithm 1). We now claim that there exists an optimal completion containing $⌊\frac{p^{Π, γ}}{2}⌋$ 2 - bracelets. Note that we may always replace 3-chains [P₁ : P₂ : P₃] and [P₄ : P₅ : P₆] (where P₁ and P₄ are π - paths) with [P₁ : P₄], (P₂ : P₅), and [P₃ : P₆], without increasing the DCJ distance of the associated completion because we have obtained a cycle from two paths. This argument implies Step 2 of Algorithm 1 and produces the value of c(π^∗,Γ^∗) stated above.

As for the even paths of B(π^∗,Γ^∗), let us operate under the assumption that p^Π,γ is odd. Then after forming a maximal collection of 2-bracelets, we will be left with one additional {π,Γ} - path P. We claim that (π^∗,Γ^∗) will be optimal if we link as many π - paths (Γ - paths) of opposite parity as possible. On the one hand, Proposition 7 states that we cannot have 2 - chains [P₁ : P₂] and [P₃ : P₄], where P₁ and P₂ are even π - paths and P₃ and P₄ are odd π - paths. On the other hand, say that we have a 2 - chain [P₁ : P₂] and a 3 - chain [P₃ : P : P₄], where without loss of generality we assume that P₁ and P₂ are odd π - paths, P₃ is an even π - path, and P₄ is a Γ - path. Replacing these chains with the chains [P₁ : P₃] and [P₂ : P : P₄] does not change the number of paths of even length in B(π^∗,Γ^∗), implying Step 3 of Algorithm 1

As a result, all remaining π - paths must have the same parity, as must all the Γ - paths; thus, we may choose any π - path and Γ - path to link to P (Step 4 of Algorithm 1) and form a 3-chain. The length of this 3-chain may be even (δ = 1) or odd (δ = 0) depending on whether the length of its π - path and Γ - path have equal parity or not. All remaining paths must therefore be 2-chains linking pairs of π - paths or pairs of Γ - paths (Step 5 of Algorithm 1).

If instead p^Π,γis even, then δ = 0, and the argument for constructing an optimal completion proceeds similarly, except that no {π,Γ} - paths will remain after forming a maximal collection of 2-bracelets, eliminating the need for Step 4. □

Algorithm 1.

Given genomes (π,Γ), the following algorithm constructs an optimal completion (π^∗,Γ^∗) in O(N) time.

0 Remove all circular singletons from π and Γ.

1 Close every {Π,Π} - path ({γ,γ}-path) into a cycle by adding a single new adjacency to π ^∗ (Γ ^∗).

2 Form a maximum set of 2-bracelets.

3 Form a maximum set of even 2 - chains by linking pairs of π-paths (Γ - paths) having opposite parity.

4 If p ^Π,γ is odd, then link the remaining {π,Γ} - path with any remaining π - path and Γ - path to form a 3 - chain.

5 Arbitrarily link pairs of remaining π - paths, all of which have the same parity, to form 2-chains. Do the same for remaining Γ-paths.

The solution space of DCJ-Indel sorting

The problem of DCJ sorting is well understood, its solution space having been described in [17]. Thus, by Theorem 3, to identify the solution space of DCJ-indel sorting (an open problem), we simply need to enumerate the construction of indels of an optimal completion. We mentioned this enumeration in [12], but here we will explore the details of the calculation.

Handling circular singletons

By Proposition 1, we may consider the circular singletons of π and Γ independently of other chromosomes; for that matter, because insertions and deletions are defined symmetrically, we may assume that π contains k chromosomes and that Γ is the empty genome. Then by Corollary 2 and the trivial fact that any DCJ applied to π changes the total number of chromosomes of π by at most 1 (see [8]), we may obtain Γ from π in k steps if and only if we perform j successive DCJs (0 ≤ j < k), each of which fuses two circular chromosomes into one, followed by applying k - j chromosome deletions.

Assuming that k is relatively small, the enumeration of all such transformations of π into Γ poses a tedious but straightforward task, as a fusion of two circular chromosomes corresponds to a DCJ using two adjacencies from different chromosomes.

Genomes lacking circular singletons

Having handled circular singletons, we may assume that sing(π,Γ) = 0. Fortunately, the lemmas presented before Theorem 8 have greatly reduced the collection of possible optimal completions, which we now continue to pare down.

Proposition 10

Every π - path (Γ - path) embedding into a 3-chain of an optimal completion must have the same parity.

Proof

Say for the sake of contradiction that we have an optimal completion (π^′,Γ^′) such that B(π^′,Γ^′) contains 3-chains [P₁ : P₂ : P₃] and [P₄ : P₅ : P₆], where P₁ and P₄ are π - paths of opposite parity. Consider the completion (π^″,Γ^″), which is defined by rejoining adjacencies of (π^′,Γ^′) to form [P₁ : P₄], (P₂ : P₅), and [P₃ : P₆] in B(π^″,Γ^″). The 2-chain [P₁ : P₄] must have even length, and (P₂ : P₅) is a cycle; thus, d_DCJ (π^″,Γ^″) < d_DCJ (π^′,Γ^′), and so (π^′,Γ^′) cannot be optimal. □

Proposition 11

If p ^Π,γ is even, then the breakpoint graph of an optimal completion must contain a maximum set of even-length 2 - chains.

Proof

We proceed by contradiction. Say that (π^′,Γ^′) is an optimal completion for which an odd π - path P₁ and an even π - path P₂ are contained in different components of B(π^′,Γ^′), neither of which is an even 2-chain. By Propositions 7 and 10, we may assume that P₁ and P₂ embed into an odd-length 2-chain [P₁ : P₅] and a 3-chain [P₂ : P₃ : P₄]. Because p^Π,γ is even by Proposition 4, we must have at least one additional 3-chain [P₆ : P₇ : P₈], where (again by Proposition 10) P₆ is an even-length π -path, and the Γ - paths P₄ and P₈ have the same parity. With these assumptions in hand, we may rejoin adjacencies to form the four components [P₁ : P₂] (even), [P₅ : P₆] (even), (P₃ : P₇), and [P₄ : P₈] (odd), producing a cycle and two even 2-chains from our original three paths. Hence, by (4), (π^′,Γ^′) cannot be optimal. □

We are now ready to fully describe the collection of optimal completions when p^Π,γ is even. To construct an optimal completion, after closing each {Π,Π} - path and {γ,γ} - path, which can be done uniquely, we must form a maximum collection of even 2-chains by Proposition 11. Recall that our aim is to maximize the statistic $c (π^{*}, Γ^{*}) + \frac{p_{even} (π^{*}, Γ^{*})}{2}$ , and consider the following two subcases.

Case 1

p^Π,γis even, $p_{odd}^{Π} \leq p_{even}^{Π}$ , and $p_{odd}^{γ} \geq p_{even}^{γ}$ . First, a maximal collection of even-length 2 - chains will total $p_{odd}^{Π} + p_{even}^{γ}$ components, which requires simply choosing $p_{odd}^{Π}$ even-length π - paths, then matching them to odd-length π - paths. This can be achieved in A₁ ways, where

A_{1} = (\binom{p_{even}^{Π}}{p_{odd}^{Π}}) \cdot (p_{odd}^{Π})! = P (p_{even}^{Π}, p_{odd}^{Π})

(12)

Next, we follow the same method for forming even-length 2-chains by linking Γ - paths of opposite parity, yielding B₁ total matchings:

B_{1} = P (p_{odd}^{γ}, p_{even}^{γ})

(13)

Here, we use P(n,k) to denote the partial permutation statistic: $P (n, k) = \frac{n!}{(n - k)!}$ . We will be left with $p_{even}^{Π} - p_{odd}^{Π}$ even π - paths and $p_{odd}^{γ} - p_{even}^{γ}$ odd Γ - paths. It is impossible to create any more even-length paths in B(π^∗,Γ^∗), and so we must form a maximum collection of $\frac{p^{Π, γ}}{2}$ 2 - bracelets from the {π,Γ} - paths:

C_{1} = (p_{even}^{Π, γ} - 1)!! = (p_{even}^{Π, γ} - 1) (p_{even}^{Π, γ} - 3) \dots (5) (3) (1)

(14)

Note the definition of double factorial. Finally, we link arbitrary remaining π - paths to each other and arbitrary remaining Γ - paths to each other:

D_{1} = (p_{even}^{Π} - p_{odd}^{Π} - 1)!! \cdot (p_{odd}^{γ} - p_{even}^{γ} - 1)!!

(15)

By the independence of these four procedures, the total number of optimal completions is simply given by the product A₁ · B₁ · C₁ · D₁.

Case 2

p^Π,γis even, $p_{odd}^{Π} > p_{even}^{Π}$ , and $p_{odd}^{γ} > p_{even}^{γ}$ . In this case, we first form a maximum set of 2 - chains:

A_{2} = P (p_{odd}^{Π}, p_{even}^{Π}) \cdot P (p_{odd}^{γ}, p_{even}^{γ})

(16)

We then have $p_{odd}^{Π} - p_{even}^{Π}$ odd-length π - paths and $p_{odd}^{γ} - p_{even}^{γ}$ odd-length Γ - paths remaining. Assume without loss of generality that $p_{odd}^{Π} - p_{even}^{Π} \geq p_{odd}^{γ} - p_{even}^{γ}$ , and set $m = min {p^{Π, γ}, p_{odd}^{γ} - p_{even}^{γ}}$ . We may attain the formula in (9) if and only if we form 2j even-length 3 - chains for some integer j satisfying $0 \leq j \leq \frac{m}{2}$ , then create $\frac{p^{Π, γ}}{2} - j$ total 2 - bracelets from the remaining {π,Γ} - paths. Any remaining odd-length π - paths (Γ - paths) must then be linked to each other to form (odd-length) 2 - chains in B (π^∗,Γ^∗). The number of such possibilities can be counted by the following statistic B₂:

\begin{array}{l} B_{2} = \sum_{j = 0}^{m / 2} (\binom{p_{odd}^{Π} - p_{even}^{Π}}{2 j}) (\binom{p_{odd}^{γ} - p_{even}^{γ}}{2 j}) (\binom{p^{Π, γ}}{2 j}) {[(2 j)!]}^{2} \cdot \\ (p_{odd}^{Π} - p_{even}^{Π} - 2 j - 1)!! (p_{odd}^{γ} - p_{even}^{γ} - 2 j - 1)!! \\ (p^{Π, γ} - 2 j - 1)!! \end{array}

(17)

Again, the two statistics can be carried out independently, yielding A₂ · B₂ total optimal completions.

In both of the first two cases, reversing the inequalities will lead to analogous arguments. For the next two cases, suppose instead that p^Π,γ is odd, and select a single {π,Γ} - path P that must belong to a 3 - chain.

Case 3

p^Π,γ is odd, $p_{odd}^{Π} < p_{even}^{Π}$ , and $p_{odd}^{γ} > p_{even}^{γ}$ . Note that there are A₃ = p^Π,γ total ways to select a {π,Γ} - path P. Of the four possibilities for the parity of the paths to which P may be linked to form a 3 - chain, one may wish to verify that the only way we cannot attain the maximum in (9) is if we link P to an odd-length π - path and an even-length Γ - path. Thus, we arrive at three mutually exclusive subcases.

In our first subcase, P is linked to an even-length π - path and an odd-length Γ - path:

B_{3} = p_{even}^{Π} \cdot p_{odd}^{γ}

(18)

We now have an even number of {π,Γ} - paths remaining and have reduced our problem to a simpler one that falls under Case 1 above, from which we may obtain some number C₃ of optimal completions.

In the second subcase, we join P to an odd-length π - path and an odd-length Γ - path. First, select two such paths:

D_{3} = p_{odd}^{Π} \cdot p_{odd}^{γ}

(19)

Again we have reduced the problem to a subproblem falling under Case 1, from which we may obtain E₃ total optimal completions. In our third and final subcase, we join P to an even π - path and an even Γ - path:

F_{3} = p_{even}^{Π} \cdot p_{even}^{γ}

(20)

Say that applying Case 1 to the resulting subcase in which p^Π,γ is even yields G₃ total optimal completions. Then by independence, the total number of optimal completions over all three subcases will be given by A₃ · (B₃ · C₃ + D₃ · E₃ + F₃ · G₃).

Case 4

p^Π,γis odd, $p_{odd}^{Π} > p_{even}^{Π}$ , and $p_{odd}^{γ} > p_{even}^{γ}$ . Having selected P from the A₄ = p^Π,γ total {π,Γ} - paths, one may verify that the only way we can achieve the maximum in (9) is by linking P to an odd-length π - path and an odd-length Γ - path, of which there are $B_{4} = p_{odd}^{Π} \cdot p_{odd}^{γ}$ total choices. We have therefore reduced our problem of linking components of B(π,Γ) to a smaller problem, falling under Case 2, for which p^Π,γis even. If there are C₄ total solutions to this smaller problem, then the number of optimal completions is given by A₄ · B₄ · C₄.

As in the first two cases, reversing the inequalities defining Cases 3 and 4 will result in analogous arguments.

Conclusions

In this paper, we have demonstrated how the problem of DCJ-indel sorting, first solved in [11], can equally be handled via direct inspection of the breakpoint graph. Unfortunately, we still do not see a natural correspondence between the two approaches to DCJ-indel sorting, which appear to be at odds because their definitions of indels are equivalent but motivated differently.

Furthermore, modeling an indel as a circular chromosome resulting from a DCJ has uncovered the solution space of DCJ-indel sorting, thus resolving an open problem. We wonder if other operations could be adapted to a similar model to yield a straightforward calculation of other genomic distances involving indels. We are also curious whether this model applies to the case of finding a minimum-cost transformation of one genome into another as we vary the parameter associated with the (constant) indel cost.

Endnotes

^aThis definition allows B(π,Γ) to contain cycles of length 2.^bIn particular, this requirement bars the trivial transformation of π into Γ in which every chromosome from π is deleted, and then all the chromosomes of Γ inserted. ^cNote that v cannot be simultaneously π - and Γ - open, although it may be a telomere of both π and Γ or be π - open and a telomere of Γ (in both cases, v is an isolated vertex of B(π,Γ), i.e., a path of length 0).

References

Dobzhansky T, Sturtevant AH: Inversions in the chromosomes of drosophila pseudoobscura. Genetics. 1938, 23: 28-64.
PubMed Central CAS PubMed Google Scholar
Fertin G, Labarre A, Rusu I, Tannier E, Vialette S: Combinatorics of Genome Rearrangements. Cambridge: MIT Press 2009.
Book Google Scholar
, : Problem E2569. Am Math Mon. 1975, 82: 1010-10.2307/2318261.
Article Google Scholar
Gates WH, Papadimitriou CH: Bounds for sorting by prefix reversal. Discrete Math. 1979, 27: 47-57. 10.1016/0012-365X(79)90068-2. [http://www.sciencedirect.com/science/article/pii/0012365X79900682], []
Article Google Scholar
Heydari MH, Sudborough I: On the diameter of the pancake network. J Algorithms. 1997, 25: 67-94. 10.1006/jagm.1997.0874. [http://www.sciencedirect.com/science/article/pii/S0196677497908749], []
Article Google Scholar
Chitturi B, Fahle W, Meng Z, Morales L, Shields C, Sudborough I, Voit W: An upper bound for sorting by prefix reversals. Theor Comput Sci. 2009, 410 (36): 3372-3390. 10.1016/j.tcs.2008.04.045. [http://www.sciencedirect.com/science/article/pii/S0304397508003575]. [Graphs, Games and Computation: Dedicated to Professor Burkhard Monien on the Occasion of his 65th Birthday], []. [Graphs, Games and Computation: Dedicated to Professor Burkhard Monien on the Occasion of his 65th Birthday]
Article Google Scholar
Bulteau L, Fertin G, Rusu I: Pancake flipping is hard. CoRR. preprint, abs/1111.0434.
Yancopoulos S, Attie O, Friedberg R: Efficient sorting of genomic permutations by translocation, inversion and block interchange. Bioinformatics. 2005, 21 (16): 3340-3346. [http://bioinformatics.oxfordjournals.org/content/21/16/3340.abstract], []
Article CAS PubMed Google Scholar
Bergeron A, Mixtacki J, Stoye J: A unifying view of genome rearrangements. WABI 2006. LNCS (LNBI). 2006, 4175: 163-173.
Google Scholar
Yancopoulos S, Friedberg R: DCJ path formulation for genome transformations which include insertions, deletions, and duplications. J Comput Biol. 2009, 16 (10): 1311-1338.
Article CAS PubMed Google Scholar
Braga MDV, Willing E, Stoye J: Genomic distance with DCJ and indels. Proc 10th Int Conf Algorithms Bioinformatics. 2010, 90-101. [http://portal.acm.org/citation.cfm?id=1885783.1885793], []
Chapter Google Scholar
Compeau PEC: A simplified view of DCJ-Indel distance. WABI Volume 7534 of Lecture Notes in Computer Science. Edited by: Raphael BJ, Tang J. 2012, 365-377. [http://dblp.uni-trier.de/db/conf/wabi/wabi2012.html#Compeau12], Springer, []
Google Scholar
Bafna V, Pevzner PA: Genome rearrangements and sorting by reversals. SIAM J Comput. 1996, 25 (2): 272-289. 10.1137/S0097539793250627.
Article Google Scholar
Tannier E, Zheng C, Sankoff D: Multichromosomal median and halving problems under different genomic distances. BMC Bioinformatics. 2009, 10: 120-[http://www.biomedcentral.com/1471-2105/10/120], []
Article PubMed Central PubMed Google Scholar
Braga M, Machado R, Ribeiro L, Stoye J: On the weight of indels in genomic distances. BMC Bioinformatics. 2011, 12 (Suppl 9): S13-[http://www.biomedcentral.com/1471-2105/12/S9/S13], []
Article PubMed Central PubMed Google Scholar
Ma J, Ratan A, Raney BJ, Suh BB, Miller W, Haussler D: The infinite sites model of genome evolution. Proc Natl Acad Sci USA. 2008, 105 (38): 14254-14261. [http://dx.doi.org/10.1073/pnas.0805217105], []
Article PubMed Central CAS PubMed Google Scholar
Braga MD, Stoye J: The solution space of sorting by DCJ. J Comput Biol. 2010, 17 (9): 1145-1165.
Article CAS PubMed Google Scholar

Download references

Acknowledgements

The author would like to acknowledge the support of Pavel Pevzner (UC San Diego Department of Computer Science), who offered guidance during the drafting of the manuscript.

Author information

Authors and Affiliations

Department of Mathematics, UC San Diego, 9500 Gilman Drive 0112, San Diego, CA, 92093, United States
Phillip EC Compeau

Authors

Phillip EC Compeau
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Phillip EC Compeau.

Additional information

Competing interests

The author declares that he has no competing interests.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Compeau, P.E. DCJ-Indel sorting revisited. Algorithms Mol Biol 8, 6 (2013). https://doi.org/10.1186/1748-7188-8-6

Download citation

Received: 21 December 2012
Accepted: 07 February 2013
Published: 01 March 2013
DOI: https://doi.org/10.1186/1748-7188-8-6

DCJ-Indel sorting revisited

Abstract

Background

Results

Conclusions

Background

Main text

Preliminaries

DCJ-Indel sorting

Handling circular singletons

Proposition 1

Proof

Corollary 2

Toward a new model of Indels

Theorem 3

Constructing an optimal completion

Proposition 4

Proof

Lemma 5

Proof

Lemma 6

Proof

Proposition 7

Proof

Theorem 8

Proof

Algorithm 1.

The solution space of DCJ-Indel sorting

Handling circular singletons

Genomes lacking circular singletons

Proposition 10

Proof

Proposition 11

Proof

Case 1

Case 2

Case 3

Case 4

Conclusions

Endnotes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Algorithms for Molecular Biology

Contact us