 Research
 Open Access
 Published:
DCJindel and DCJsubstitution distances with distinct operation costs
Algorithms for Molecular Biology volume 8, Article number: 21 (2013)
Abstract
Background
Classical approaches to compute the genomic distance are usually limited to genomes with the same content and take into consideration only rearrangements that change the organization of the genome (i.e. positions and orientation of pieces of DNA, number and type of chromosomes, etc.), such as inversions, translocations, fusions and fissions. These operations are generically represented by the doublecut and join (DCJ) operation. The distance between two genomes, in terms of number of DCJ operations, can be computed in linear time. In order to handle genomes with distinct contents, also insertions and deletions of fragments of DNA – named indels – must be allowed. More powerful than an indel is a substitution of a fragment of DNA by another fragment of DNA. Indels and substitutions are called contentmodifying operations. It has been shown that both the DCJindel and the DCJsubstitution distances can also be computed in linear time, assuming that the same cost is assigned to any DCJ or contentmodifying operation.
Results
In the present study we extend the DCJindel and the DCJsubstitution models, considering that the contentmodifying cost is distinct from and upper bounded by the DCJ cost, and show that the distance in both models can still be computed in linear time. Although the triangular inequality can be disrupted in both models, we also show how to efficiently fix this problem a posteriori.
Background
The distance between two genomes is often computed using only the common markers, that occur in both genomes. Such distance allows rearrangements that change the organization of the genome, that is, the positions and orientations of markers, number and types of chromosomes. Inversions, translocations, fusions and fissions are some of these operations[1]. All these rearrangements can be generically represented as a doublecutandjoin (DCJ) operation[2]. The DCJ distance, which takes into consideration only DCJ operations, can be computed in linear time[3].
Nevertheless, genomes with the same content are rare, and differences in gene content may reflect important evolutionary aspects. In order to handle genomes with unequal contents, one has to take into consideration contentmodifying operations, that change the contents of the genomes. These operations can be an insertion or a deletion of a piece of DNA. Insertions and deletions are also called indels. Some extensions of the classical approaches lead to models that handle genomes with unequal contents, but without duplicated markers, allowing rearrangements and indels. In 2001, El Mabrouk[4] extended the classical sorting by inversions approach[5] and developed a method to compare unichromosomal genomes with unequal contents, considering only inversions and indels. She provided an exact algorithm that deals with insertions and deletions asymmetrically, and a heuristic that handles the operations symmetrically. Then, in 2009, a model to sort multichromosomal genomes with unequal contents, using both DCJ and indel operations was introduced by Yancopoulos and Friedberg[6]. Later, Braga et al.[7] presented an exact formula for the DCJindel distance, that can be computed in linear time handling indels symmetrically.
Recently, in 2011, a more powerful contentmodifying operation has also been considered: a substitution allows a piece of DNA to be substituted by another piece of DNA[8]. Observe that it is not suggested that a substitution occurs in a precise moment in evolution, but instead it represents a region that underwent continuous mutations (duplications, losses and gene mutations), so that a group of genes is transformed into a different group of genes (either of which may also be empty, allowing a substitution to represent an insertion or a deletion). Other studies also represent continuous mutations as a rearrangement event[9, 10]. By minimizing substitutions we are able to establish a relation between indels that could have occurred in the same position of the compared genomes, identifying genomic regions that could be subject to these continuous mutations. It has been shown that the DCJsubstitution distance can also be computed in linear time[8].
The approaches mentioned above[4, 6–8] assign the same cost to any rearrangement or contentmodifying operation. However, during the evolution of many organisms, contentmodifying operations are said to occur more often than rearrangements and, consequently, should be assigned to a lower cost. Examples are bacteria that are obligate intracellular parasites, such as Rickettsia[11]. The genomes of such intracellular parasites are observed to have a reductive evolution, that is, the process by which genomes shrink and undergo extreme levels of gene degradation and loss. In the present work, we refine the DCJindel[7] and the DCJsubstitution[8] models, by adopting a distinct contentmodifying cost that is upper bounded by the DCJ cost. For simplicity, we assign a cost of 1 to DCJ and a positive cost of w≤1 to contentmodifying operations. We are then able to give exact formulas for both the DCJindel and the DCJsubstitution distances, for any positive w≤1.
Contentmodifying operations are applied to pieces of DNA of any size, and a side effect of this fact is that the triangular inequality often does not hold for distances that consider these operations[4, 6–8, 12]. In the case of the models we study here, it is possible to do an a posteriori correction, using an approach similar to the one described in[12].
This paper is an extension of[13] and is organized as follows. In the remainder of this section we give definitions and previous results used in this work. We will then present our results, including the formulas for the distances with distinct DCJ and contentmodifying costs and the correction to establish the triangular inequality.
Genomes
We deal with models in which duplicated markers are not allowed. Given two genomes A and B, possibly with unequal content, let$\mathcal{G}$,$\mathcal{A}$ and$\mathcal{B}$ be three disjoint sets, such that$\mathcal{G}$ is the set of markers that occur both in A and B,$\mathcal{A}$ is the set of markers that occur only in A, and$\mathcal{B}$ is the set of markers that occur only in B. The markers in sets$\mathcal{A}$ and$\mathcal{B}$ are also called unique markers. We denote by$u(A,B)=\left\mathcal{A}\right+\left\mathcal{B}\right$ the number of unique markers in genomes A and B.
Each marker g in a genome is a DNA fragment and is represented by the symbol g, if it is read in direct orientation, or by the symbol$\overline{g}$, if it is read in reverse orientation. Each one of the two extremities of a linear chromosome is called a telomere, represented by the symbol ∘. Each chromosome in a genome can be then represented by a string that can be circular, if the chromosome is circular, or linear and flanked by the symbols ∘ if the chromosome is linear. In general, a genome is either circular (composed of circular chromosomes) or linear (composed of linear chromosomes). As an example, consider the linear genomes$A=\left\{\circ \mathit{\text{bsu}}\overline{c}\mathit{\text{av}}\overline{d}e\circ \right\}$ and$B=\left\{\circ \mathit{\text{awb}}\overline{x}c\circ ,\circ \mathit{\text{ydze}}\circ \right\}$, represented in Figure1. Here we have$\mathcal{G}=\{a,b,c,d,e\}$,$\mathcal{A}=\{s,u,v\}$ and$\mathcal{B}=\{w,x,y,z\}$.
The DCJ model
In this section we will summarize the DCJ model, that allows the sorting of the common content of two genomes, also called DCJsorting. We will also show how the DCJ distance can be easily computed with the help of the adjacency graph.
Given two genomes A and B, we denote the two extremities of each$g\in \mathcal{G}$ by g^{t} (tail) and g^{h} (head). Then, a$\mathcal{G}$adjacency or simply adjacency[7] in genome A (respectively in genome B) is a string$v={\gamma}_{1}\ell {\gamma}_{2}\equiv {\gamma}_{2}\overline{\ell}{\gamma}_{1}$, such that each γ_{ i } can be a telomere or an extremity of a marker from$\mathcal{G}$ and ℓ is a substring composed of the markers that are between γ_{1} and γ_{2} in A (respectively in B) and contains no marker that also belongs to$\mathcal{G}$. The substring ℓ is the label of v. If ℓ is empty, the adjacency is said to be clean, otherwise it is said to be labeled. If a linear chromosome is composed only of unique markers, it is represented by an adjacency ∘ℓ∘. Similarly, a circular chromosome composed only of unique markers is represented by a (circular) adjacency ℓ. For the linear genomes represented in Figure1, the set of adjacencies in A is {∘b^{t}, b^{h}s u c^{h}, c^{t}a^{t}, a^{h}v d^{h}, d^{t}e^{t}, e^{h} ∘} and the set of adjacencies in B is$\left\{\circ {a}^{t},\phantom{\rule{0.3em}{0ex}}{a}^{h}\phantom{\rule{0.3em}{0ex}}w{b}^{t},\phantom{\rule{0.3em}{0ex}}{b}^{h}\overline{x}{c}^{t},\phantom{\rule{0.3em}{0ex}}{c}^{h}\phantom{\rule{0.3em}{0ex}}\circ ,\phantom{\rule{0.3em}{0ex}}\circ y{d}^{t},\phantom{\rule{0.3em}{0ex}}{d}^{h}\phantom{\rule{0.3em}{0ex}}z{e}^{t},\phantom{\rule{0.3em}{0ex}}{e}^{h}\phantom{\rule{0.3em}{0ex}}\circ \right\}$.
Adjacency graph
Given two genomes A and B, the adjacency graph A G(A, B)[3] is the bipartite multigraph whose vertices are the adjacencies of A and of B and that has one edge for each common extremity of a pair of vertices. Each of the connected components of A G(A, B) alternate vertices in genome A and in genome B. Each component can be either a cycle, or an A B path (that has one endpoint in genome A and the other in B), or an A A path (that has both endpoints in genome A), or a B B path (that has both endpoints in B). A special case of an A A or a B Bpath is a linear singleton, that is a linear chromosome represented by an adjacency of type ∘ℓ∘, where ℓ contains only unique markers. Paths occur when the genomes are linear. For circular genomes, the graph A G(A, B) is composed of cycles only, and may also have a special type of component composed of a single vertex, that corresponds to a circular chromosome composed only of markers that are not in$\mathcal{G}$, called circular singleton. In Figure2 we show the adjacency graph built over the linear genomes represented in Figure1.
DCJ operations
A cut performed on a genome A separates two adjacent markers of A. A cut affects a single adjacency v in A: it is done between two symbols of v, creating two open ends. In general a cut can be performed between two markers of a label, but the DCJindel distance can be computed considering only cuts that do not “break” labels. A doublecut and join or DCJ applied on a genome A is the operation that performs cuts in two different adjacencies in A, creating four open ends, and joins these open ends in a different way. In other words, a DCJ rearranges two adjacencies in A, transforming them into two new adjacencies. As an example consider a DCJ applied to genome A (from Figure1), that rearranges the adjacencies a^{h}v d^{h} and d^{t}e^{t} into the new adjacencies a^{h}v d^{t} and d^{h}e^{t}. Observe that this operation corresponds to the inversion of marker d in genome A. Indeed, a DCJ operation can correspond to several rearrangements, such as an inversion, a translocation, a fusion or a fission[2].
DCJsorting and DCJ distance
Given two genomes A and B, the components of A G(A, B) with 3 or more vertices need to be reduced, by applying DCJ operations, to components with only 2 vertices, that can be cycles or A Bpaths[14]. This procedure is called DCJsorting of A into B. The number of A Bpaths in A G(A, B) is always even and a DCJ can be of three types[7]: it can either decrease the number of cycles by one, or the number of A Bpaths by two (counteroptimal); or it does not affect the number of cycles and A Bpaths (neutral); or it can either increase the number of cycles by one, or the number of A Bpaths by two (optimal). The DCJ distance of A and B, denoted by d_{ D C J }(A, B), is the minimum number of steps required to do a DCJsorting of A into B, given by the following theorem.
Theorem 1 (from[3]). Given two genomes A and B , we have${d}_{\phantom{\rule{0.3em}{0ex}}\mathit{\text{DCJ}}}(A,\phantom{\rule{0.3em}{0ex}}B)=\left\mathcal{G}\rightc\frac{b}{2}$, where$\mathcal{G}$is the set of common markers and c and b are, respectively, the number of cycles and of A B paths in A G(A, B).
Internal DCJ operations and recombinations
Observe that a DCJ operation ρ acts on two different adjacencies, that can be in the same or in two distinct connected components of the graph. The components on which the cuts are applied are called sources and the components obtained after the joinings are called resultants of ρ. With respect to the adjacency graph, ρ can be of two types: internal, when ρ is applied to two adjacencies belonging to a single component; and recombination, when ρ is applied to adjacencies belonging to two distinct components.
Any recombination applied to a vertex of an A Apath and a vertex of a B Bpath is optimal[14]. A recombination applied to vertices of two distinct A Bpaths can be either neutral, when the resultants are also A Bpaths, or counteroptimal, when the resultants are an A Apath and a B Bpath. All other types of path recombinations are neutral and all recombinations involving at least one cycle are counteroptimal.
It is possible to do a separate DCJsorting in any component P of A G(A, B)[14] by applying DCJs internal to P. We denote by d_{ D C J }(P) the number of optimal DCJ operations used for DCJsorting P separately (d_{ D C J }(P) depends only on the number of vertices or, equivalently, the number of edges of P[14]). Thus, the DCJ distance can also be rewritten in terms of the sum of the distance per component:
Lemma 1 (derived from[14]). Given two genomes A and B , we have${d}_{\phantom{\rule{0.3em}{0ex}}\mathit{\text{DCJ}}}(A,\phantom{\rule{0.3em}{0ex}}B)=\sum _{P\in A\phantom{\rule{0.3em}{0ex}}G(A,\phantom{\rule{0.3em}{0ex}}B)}{d}_{\phantom{\rule{0.3em}{0ex}}\mathit{\text{DCJ}}}\left(P\right)$.
Only optimal DCJs, counted in the equivalent formulas given by Theorem 1 and Lemma 1, are necessary to do a DCJsorting. Given a DCJ ρ, the DCJ variation of ρ, denoted by Δ_{ D C J }(ρ), is defined to be respectively 0, 1 and 2 depending whether ρ is optimal, neutral or counteroptimal.
Modifying the content of a genome
In the previous section, the unique markers appeared as labels of adjacencies, but the DCJ operations are only able to change the organization of the genomes. Here we introduce the operations that are applied to the labels and change the content of the genomes.
Indel operations
The most classical contentmodifying operations are insertions and deletions of blocks of contiguous markers[4, 6]. We refer to insertions and deletions as indel operations. In the model we consider, an indel only affects the label of one single adjacency, by deleting or inserting contiguous markers in this label, with the restriction that an insertion cannot produce duplicated markers[7]. Thus, while sorting A into B, the indels are the steps in which the markers in$\mathcal{A}$ are deleted and the markers in$\mathcal{B}$ are inserted. At most one chromosome can be entirely deleted or inserted at once. We illustrate an indel with the following example: the deletion of markers su from adjacency b^{h}s u c^{h} of genome A (Figure2), which results into the clean adjacency b^{h}c^{h}. The opposite operation would be an insertion.
Substitutions
Substitutions are more powerful contentmodifying operations, that allow blocks of contiguous markers to be substituted by other blocks of contiguous markers[8]. In other words, a deletion and a subsequent insertion that occur at the same position of the genome can be modeled as a substitution, counting together for one single sorting step.
A substitution only affects the label of one single adjacency, by substituting contiguous markers in this label, with the restriction that it cannot produce duplicated markers[8]. An example is the substitution of markers su in adjacency b^{h}s u c^{h} by$\overline{x}$, which results into adjacency${b}^{h}\overline{x}{c}^{h}$. At most one chromosome can be entirely substituted at once (but we do not allow the substitution of a linear by a circular chromosome nor viceversa). As previously mentioned, insertions and deletions are special cases of substitutions. If a block of markers is substituted by the empty string, we have a deletion. Analogously, if the empty string is substituted by a block of markers, we have an insertion.
Runs, indel and substitutionpotentials
In this section we introduce some definitions and concepts that will help us to integrate the DCJ model with contentmodifying operations. These concepts will be very useful in our results, when we will show how to use DCJ operations to minimize the number of contentmodifying operations to be performed.
First, let us recall the concept of run, introduced in[7]. Given two genomes A and B and a component P of A G(A, B), a run is a maximal subpath of P, in which the first and the last vertices are labeled and all labeled vertices belong to the same genome (or partition). An example is given in Figure3. A run in genome A is also called an$\mathcal{A}$run, and a run in genome B is called a$\mathcal{B}$run. We denote by Λ(P) the number of runs in a component P. While a path can have any number or runs, a cycle has either 0, 1, or an even number of runs.
A set of labels of one genome can be accumulated with DCJs. For example, take the adjacencies d^{h}z e^{t} and${d}^{t}\phantom{\rule{0.3em}{0ex}}\overline{y}\circ $ from genome B (Figure3). A DCJ applied to these two adjacencies could result into d^{t}e^{t} and${d}^{h}\phantom{\rule{0.3em}{0ex}}z\overline{y}\circ $, in which the label$z\overline{y}$ resulted from the accumulation of the labels of the two original adjacencies. In particular, when we apply optimal DCJs internal to a single component of the adjacency graph, we can accumulate an entire run into a single adjacency[7].
Runs can be merged by DCJ operations. Consequently, during the optimal DCJsorting of a component P, we can reduce its number of runs. The indelpotential of P, denoted by λ(P), is defined in[7] as the minimum number of runs that we can obtain by DCJsorting P with optimal DCJ operations. An example is given in Figure4.
The indelpotential of a component depends only on its number of runs:
Proposition 1 (from[7]).Given two genomes A and B and a component P of A G(A, B), the indelpotential of P is given by$\lambda \left(P\right)=\lceil \frac{\Lambda \left(P\right)+1}{2}\rceil $, if Λ(P)≥1. Otherwise, if Λ(P)=0, then λ(P)=0.
Similarly, the substitutionpotential of a component P is the minimum number of substitutions that we can obtain by DCJsorting P with optimal DCJ operations. The substitutionpotential is denoted by σ(P) and can be computed as follows:
Proposition 2 (from[8]). Given genomes A and B and a component P of A G(A, B), the substitutionpotential of P is given by$\sigma \left(P\right)=\lceil \frac{\Lambda \left(P\right)+1}{4}\rceil $, if Λ(P)≥1. Otherwise, if Λ(P)=0, then σ(P)=0.
Results
In this section we show how to compute the DCJindel and the DCJsubstitution distances, considering that the contentmodifying cost is distinct from and upper bounded by the DCJ cost. We assign the cost of 1 to each DCJ and a positive cost w≤1 to each contentmodifying operation.
The DCJindel model with distinct operation costs
First we consider the case in which only indels are allowed as contentmodifying operations. Given two genomes A and B, we define the DCJindel distance of A and B, denoted by${d}_{\mathit{\text{DCJ}}}^{\mathit{\text{id}}}(A,B)$, as the minimum cost of a DCJindel sequence of operations that sorts A into B. If w=1, the DCJindel distance corresponds exactly to the minimum number of steps required to sort A into B. To compute the distance in this case, a linear algorithm was given in[7]. Here we present a more general method to compute the DCJindel distance for any positive w≤1.
An upper bound for the DCJindel distance
We can obtain a good upper bound for the DCJindel distance by showing how to compute the DCJindel distance per component. Given a DCJ operation ρ, let λ_{0} and λ_{1} be, respectively, the sum of the indelpotentials for the components of the adjacency graph before and after ρ, and let Δ λ(ρ)=λ_{1}−λ_{0}. If ρ is an optimal DCJ internal to a single component of the graph, the definition of indelpotential implies Δ λ(ρ)≥0. We also have Δ λ(ρ)≥0, if ρ is counteroptimal, and Δ λ(ρ)≥−1, if ρ is neutral[7]. Recall that Δ_{ D C J }(ρ) is, respectively, 0, 1 and 2, depending whether the DCJ ρ is optimal, neutral or counteroptimal. We define Δ_{D C Jλ}(ρ)=Δ_{ D C J }(ρ) + w Δ λ(ρ).
We know that each component P of A G(A,B) can be DCJsorted separately, and the labels can then be easily sorted with indel operations. Let${d}_{\mathit{\text{DCJ}}}^{\mathit{\text{id}}}\left(P\right)$ be the DCJindel distance of P, that is the minimum cost of a DCJindel sequence of operations sorting P separately. This can be computed according to the following proposition.
Proposition 3. For each P∈A G(A, B),${d}_{\mathit{\text{DCJ}}}^{\mathit{\text{id}}}\left(P\right)={d}_{\mathit{\text{DCJ}}}\left(P\right)+\mathrm{w\lambda}\left(P\right)$.
Proof. By the definition of λ, the best we can do with optimal DCJs is d_{ D C J }(P)+w λ(P). From[7], we have Δ_{D C Jλ}(ρ)≥2 if ρ is counteroptimal, thus we can only get more expensive sorting scenarios if we use such operation. We also know that, if ρ is neutral Δ_{D C Jλ}(ρ)≥1−w≥0, for any positive w≤1. □
This allows us to get a good upper bound for the DCJindel distance with distinct operation costs:
Lemma 2. Given two genomes A and B and a positive indel cost w≤1, we have
Proof. If we sort the components separately we have${d}_{\mathit{\text{DCJ}}}^{\mathit{\text{id}}}(A,B)\le \sum _{P\in A\phantom{\rule{0.3em}{0ex}}G(A,\phantom{\rule{0.3em}{0ex}}B)}{d}_{\mathit{\text{DCJ}}}^{\mathit{\text{id}}}\left(P\right)$, which, according to Lemma 1 and Proposition 3, corresponds exactly to${d}_{\mathit{\text{DCJ}}}(A,B)+w\sum _{P\in A\phantom{\rule{0.3em}{0ex}}G(A,B)}\lambda \left(P\right)$.
Recombinations and the exact DCJindel distance
Until this point, we have explored the possible effects of any DCJ that is internal to a single component of the graph. Now we will analyze the effect of recombinations, that have Δ λ≥−2[7]. We saw previously that any recombination involving cycles is counteroptimal. Since any counteroptimal recombination has Δ_{D C Jλ}≥2−2w≥0, only path recombinations can have Δ_{D C Jλ} <0.
Although the space of recombinations is not small, some observations allow us to explore it efficiently. Proposition 1 shows that the indelpotential increases of one when the number of runs increases of two. Furthermore, when we decrease the number of runs of a path by one, it will decrease the indelpotential only if its initial number of runs is one or a multiple of two. However, the exact number of runs does not really matter. In the path recombination analysis, we only have to consider the following properties for each path:

whether it is an AA, or a BB, or an ABpath;

whether it has zero, or an odd or an even number of runs; and

whether its first run is in A or in B (by convention, an ABpath is always read from A to B).
An empty sequence (with no run) is represented by ε. For the benefit of the reader, for an integer i≥0, let$\mathcal{A}$ (respectively$\mathcal{B}$) be a sequence with odd 2i+1 runs, starting and ending with an$\mathcal{A}$run (respectively$\mathcal{B}$run). Similarly, let$\mathcal{A}\mathcal{B}$ (respectively$\mathcal{B}\mathcal{A}$), be a sequence with even 2i+2 runs, starting with an$\mathcal{A}$run (respectively$\mathcal{B}$run) and ending with a$\mathcal{B}$run (respectively$\mathcal{A}$run). Then each one of the notations A A_{ ε },$A{A}_{\mathcal{A}}$,$A{A}_{\mathcal{B}}$,$A{A}_{\mathcal{A}\mathcal{B}}\equiv A{A}_{\mathcal{B}\mathcal{A}}$, B B_{ ε },$B{B}_{\mathcal{A}}$,$B{B}_{\mathcal{B}}$,$B{B}_{\mathcal{A}\mathcal{B}}\equiv B{B}_{\mathcal{B}\mathcal{A}}$, A B_{ ε },$A{B}_{\mathcal{A}}$,$A{B}_{\mathcal{B}}$,$A{B}_{\mathcal{A}\mathcal{B}}$ and$A{B}_{\mathcal{B}\mathcal{A}}$ represents a particular type of path (AA, BB or AB) with a particular structure of runs (ε,$\mathcal{A}$,$\mathcal{B}$,$\mathcal{A}\mathcal{B}$ or$\mathcal{B}\mathcal{A}$). An example of this notation is given in Figure5, which represents a neutral recombination possibly with Δ_{D C Jλ} <0.
Each type of recombination can lead to different resultants, depending on where the cuts are applied. However, it is always possible to choose the “best” resultants in each case: we take the recombination with the smallest Δ+_{D C Jλ}, whose resultants can be better reused in further recombinations. The main observations to guide this task are: only recombinations of paths whose runs are$\mathcal{A}\mathcal{B}$ or$\mathcal{B}\mathcal{A}$ have Δ λ=−2 and only recombinations of type A A+B B are optimal and have Δ_{ D C J }=0. In Table1, we list all path recombinations that can have Δ_{D C Jλ}<0, together with neutral recombinations that have Δ_{D C Jλ}=1−w≥0, but produce an$A{A}_{\mathcal{A}\mathcal{B}}$ or a$B{B}_{\mathcal{A}\mathcal{B}}$ path. We denote by ∙ an ABpath that never appears as a source of a recombination in this table (these paths are A B_{ ε },$A{B}_{\mathcal{A}}$ and$A\phantom{\rule{0.3em}{0ex}}{B}_{\mathcal{B}}$).
The DCJindel distance formula
By analyzing the whole universe of operations, we could identify groups of recombinations, as listed in Table2. Since some resultants of recombinations can be used in other recombinations, the groups can have more than one recombination. Groups$\mathcal{P}$,${\mathcal{S}}_{1}$ and${\mathcal{S}}_{2}$ are composed of a single recombination, while groups$\mathcal{T}$,${\mathcal{N}}_{1}$ and${\mathcal{N}}_{2}$ are composed of two recombinations and groups$\mathcal{Q}$ and$\mathcal{M}$ are composed of three recombinations. recombination is not an associative operation, thus, in column ‘DCJ seq.’ of Table2, we indicate how the sequence of DCJs must be applied in each group (the symbol ≺ separates preceeding and succeeding recombinations).
While in groups$\mathcal{Q}$ and$\mathcal{T}$ the preceding recombinations have lower Δ_{D C Jλ}, in groups$\mathcal{M}$,${\mathcal{N}}_{1}$ and${\mathcal{N}}_{2}$ we need to use operations of type n_{1} in order to prepare better recombinations. Another important observation concerning groups$\mathcal{Q}$ and$\mathcal{T}$ is that, although their Δ_{D C Jλ} indicate that$\mathcal{Q}$ could be applied for w>1/4 and$\mathcal{T}$ could be applied for w>1/3, the last operation of these groups is of type n_{2} and actually increases Δ_{D C Jλ} for w≤1/2. For this reason, we skip groups$\mathcal{Q}$ and$\mathcal{T}$ for w≤1/2 (there is no loss with this approach, since their optimal operations are then counted in${\mathcal{S}}_{1}$).
The deductions shown in Table2 can be computed with an approach that greedily maximizes the number of occurrences in$\mathcal{P}$,$\mathcal{Q}$,$\mathcal{T}$,${\mathcal{S}}_{1}$,${\mathcal{S}}_{2}$,$\mathcal{M}$,${\mathcal{N}}_{1}$ and${\mathcal{N}}_{2}$ in this order. The two groups in$\mathcal{Q}$ are mutually exclusive after maximizing$\mathcal{P}$. The lines in$\mathcal{T}$ are subgroups of the lines in$\mathcal{Q}$, that is, they are only computed when there are enough remaining components after maximizing$\mathcal{Q}$. Similarly, each one of the remaining groups are computed when there are enough remaining components after maximizing the upper groups. With the results presented in this section we have an exact formula to compute the DCJindel distance:
Theorem 2. Given two genomes A and B and a positive indel cost w≤1,
where$\mathcal{P}\phantom{\rule{0.3em}{0ex}}$,$\mathcal{Q}$,$\mathcal{T}\phantom{\rule{0.3em}{0ex}}$,${\mathcal{S}}_{1}$,${\mathcal{S}}_{2}$,$\mathcal{M}$,${\mathcal{N}}_{1}$ and${\mathcal{N}}_{2}$ are computed as described above.
As we mentioned before, the groups$\mathcal{Q}$ and$\mathcal{T}$ are skipped ($\mathcal{Q}=\mathcal{T}=0$) for w≤1/2. Furthermore, we also have${\mathcal{S}}_{2}=\mathcal{M}={\mathcal{N}}_{1}=0$ if w≤1/2 and${\mathcal{N}}_{2}=0$ if w≤2/3. Although some groups have reusable resultants, those are actually never reused (if groups that are lower in the table use as sources resultants from higher groups, the sources of all referred groups would be previously consumed in groups that occupy even higher positions in the table). Due to this fact, the number of occurrences in each group depends only on w and the initial number of each type of component.
Observe that, for w=1, our formula is identical to the one proposed in[7]. Actually, for any 2/3<w≤1, the two formulas are equivalent, since the same occurrences of groups of recombinations and an equivalent upper bound are taken into account.
We illustrate the result of our formula with an example. Let A G(A,B) have only the following labeled paths: two$A{A}_{\mathcal{A}\mathcal{B}}$, one$B{B}_{\mathcal{A}}$ and one$B{B}_{\mathcal{B}}$. In this case, there are no occurrences of$\mathcal{P}$, thus we have$\mathcal{P}=0$. If we take$w>\frac{1}{2}$, all labeled paths are consumed in one occurrence of$\mathcal{Q}$. We have$\mathcal{Q}=1$, while all other values are zero, resulting in Δ_{D C Jλ}=1−4w. On the other hand, if$w\le \frac{1}{2}$, we automatically set$\mathcal{Q}=\mathcal{T}={\mathcal{S}}_{2}=\mathcal{M}={\mathcal{N}}_{1}={\mathcal{N}}_{2}=0$. The labeled paths are consumed in two occurrences of${\mathcal{S}}_{1}$, that is,${\mathcal{S}}_{1}=2$, resulting in Δ_{D C Jλ}=−2w. For sure, −2w≤1−4w only if$w\le \frac{1}{2}$.
The DCJsubstitution model with distinct operation costs
Now we consider a different model in which substitutions are the contentmodifying operations. Recall that substitutions include indels. Again we assign the cost of 1 to each DCJ and the cost of w≤1 to each substitution. The DCJsubstitution distance of genomes A and B, denoted by${d}_{\mathit{\text{DCJ}}}^{\mathit{\text{sb}}}(A,B)$, is then the minimum cost of a DCJsubstitution sequence that sorts A into B. If w=1, this corresponds exactly to the minimum number of steps required to sort A into B and can be computed in linear time[8]. Here we present a general method to compute the DCJsubstitution distance for any positive w≤1. Similarly to the approach used with the DCJindel model, we will first use internal DCJs to obtain a good upper bound and then analyze recombinations to compute the exact DCJsubstitution distance.
An upper bound for the DCJsubstitution distance
We can also obtain a good upper bound for the DCJsubstitution distance by showing how to compute the DCJsubstitution distance per component. Given a DCJ operation ρ, let σ_{0} and σ_{1} be, respectively, the sum of the substitutionpotentials for the components of the adjacency graph before and after ρ, and let Δ σ(ρ)=σ_{1}−σ_{0}. If ρ is an optimal DCJ internal to a single component of the graph, the definition of substitutionpotential implies Δ σ(ρ)≥0. We also have Δ σ(ρ)≥0, if ρ is counteroptimal, and Δ σ(ρ)≥−1, if ρ is neutral[8]. We define Δ_{D C Jσ}(ρ)=Δ_{ D C J }(ρ)+w Δ σ(ρ).
After DCJsorting a component P of A G(A,B), the remaining labels can be easily sorted with substitutions. Let${d}_{\mathit{\text{DCJ}}}^{\mathit{\text{sb}}}\left(P\right)$ be the DCJsubstitution distance of P, that is the minimum cost of a DCJsubstitution sequence of operations sorting P separately. This is given by the following proposition.
Proposition 4. For each P∈A G(A, B),${d}_{\mathit{\text{DCJ}}}^{\mathit{\text{sb}}}\left(P\right)={d}_{\mathit{\text{DCJ}}}\left(P\right)+\mathrm{w\sigma}\left(P\right)$.
Proof. Analogous to the proof of Proposition 3. □
If P is a singleton in A G(A,B),${d}_{\mathit{\text{DCJ}}}^{\mathit{\text{sb}}}\left(P\right)=w$ (the indel of the whole chromosome). A linear cannot be substituted by a circular singleton and viceversa. However, a pair composed by a singleton in genome A and a singleton in genome B, such that both are linear or both are circular, can be sorted with one substitution (which saves one sorting step per pair). Let P_{ L S } and P_{ C S } be, respectively, the maximum number of disjoint pairs of linear and circular singletons in A G(A,B). Together with Proposition 4, these numbers give a good upper bound for the DCJsubstitution distance:
Lemma 3. Given genomes A and B and a positive substitution cost w≤1,
where P_{ L S } and P_{ C S } are the numbers of disjoint pairs of linear and circular singletons.
Proof. If we sort the components separately we have${d}_{\phantom{\rule{0.3em}{0ex}}\mathit{\text{DCJ}}}^{\mathit{\text{sb}}}(A,\phantom{\rule{0.3em}{0ex}}B)\le \sum _{P\in A\phantom{\rule{0.3em}{0ex}}G(A,B)}{d}_{\phantom{\rule{0.3em}{0ex}}\mathit{\text{DCJ}}}^{\mathit{\text{sb}}}\left(P\right)$, which, according to Lemma 1 and Proposition 4, corresponds exactly to${d}_{\phantom{\rule{0.3em}{0ex}}\mathit{\text{DCJ}}}(A,\phantom{\rule{0.3em}{0ex}}B)+w\sum _{P\in A\phantom{\rule{0.3em}{0ex}}G(A,\phantom{\rule{0.3em}{0ex}}B)}\sigma \left(P\right)$. □
Recombinations and the exact DCJsubstitution distance
Now we also need to analyze the effect of path recombinations, that have Δ σ(ρ)≥−2[8], in the DCJsubstitution distance. Here the space of recombinations is even larger, but can still be efficiently explored. Proposition 2 shows that the substitutionpotential increases of one when the number of runs increases of four. Furthermore, when we decrease the number of runs of a path by one, it will decrease the indelpotential only if its initial number of runs is one or a multiple of four. Again, the exact number of runs does not really matter. We have to consider the following properties for each path:

whether it is an AA, or a BB, or an ABpath;

whether it has zero, or a number of runs that is a multiple of four, or a multiple of four plus 1, or a multiple of four plus 2, or a multiple of four plus 3; and

whether its first run is in A or in B (by convention, an ABpath is always read from A to B).
Recall that an empty sequence (with no run) is represented by ε. For labeled paths we adopt a different meaning for$\mathcal{A}$,$\mathcal{B}$,$\mathcal{A}\mathcal{B}$,$\mathcal{B}\mathcal{A}$: for an integer i≥0, let$\mathcal{A}$ (respectively$\mathcal{B}$) be a sequence with odd 4i+1 runs, starting and ending with an$\mathcal{A}$run (respectively$\mathcal{B}$run), and let$\mathcal{A}\mathcal{B}$ (respectively$\mathcal{B}\mathcal{A}$), be a sequence with even 4i+2 runs, starting with an$\mathcal{A}$run (respectively$\mathcal{B}$run) and ending with a$\mathcal{B}$run (respectively$\mathcal{A}$run). Here we still have some additional cases: let$\mathcal{A}\mathcal{B}\mathcal{A}$ (respectively$\mathcal{B}\mathcal{A}\mathcal{B}$) be a sequence with odd 4i+3 runs, starting and ending with an$\mathcal{A}$run (respectively$\mathcal{B}$run), and let$\mathcal{A}\mathcal{B}\mathcal{A}\mathcal{B}$ (respectively$\mathcal{B}\mathcal{A}\mathcal{B}\mathcal{A}$), be a sequence with even 4i+4 runs, starting with an$\mathcal{A}$run (respectively$\mathcal{B}$run) and ending with a$\mathcal{B}$run (respectively$\mathcal{A}$run). Then, for each type of path (AA, BB or AB) with a particular structure of runs ($\mathcal{A}$,$\mathcal{B}$,$\mathcal{A}\mathcal{B}$,$\mathcal{B}\mathcal{A}$,$\mathcal{A}\mathcal{B}\mathcal{A}$,$\mathcal{B}\mathcal{A}\mathcal{B}$,$\mathcal{A}\mathcal{B}\mathcal{A}\mathcal{B}$, or$\mathcal{B}\mathcal{A}\mathcal{B}\mathcal{A}$), we have a particular notation. An example of this notation is given in Figure6, which represents a neutral recombination with Δ_{D C Jσ}=1−w.
Again, although each type of recombination can lead to different resultants, it is always possible to choose the “best” resultants in each case: we take the recombination with the smallest Δ_{D C Jσ}, whose resultants can be better reused. In Table3, we list all recombinations that can have Δ_{D C Jσ}<0, together with those that have Δ_{D C Jσ}=1−w≥0, but produce an AA or a BBpath with runs$\mathcal{A}\mathcal{B}\mathcal{A}\mathcal{B}$ or$\mathcal{A}$ or$\mathcal{B}$. We denote by ∙ an ABpath that never appears as a source in this table (these are all AB paths, with the exception of$A{B}_{\mathcal{A}\mathcal{B}\mathcal{A}\mathcal{B}}$ and$A{B}_{\mathcal{B}\mathcal{A}\mathcal{B}\mathcal{A}}$).
The DCJsubstitution distance formula
In Table4 we list groups of recombinations, which allow the computation of the exact DCJsubstitution distance, with an approach that greedily maximizes the number of occurrences in$\mathcal{U}$,$\mathcal{V}$,$\mathcal{W}$,${\mathcal{X}}_{1}$,${\mathcal{X}}_{2}$,$\mathcal{Y}$,${\mathcal{Z}}_{1}$ and${\mathcal{Z}}_{2}$ in this order. The two groups in$\mathcal{V}$ are mutually exclusive after maximizing$\mathcal{U}$, while those in$\mathcal{W}$ are subgroups of$\mathcal{V}$ (they are only computed when there are enough remaining components after maximizing$\mathcal{V}$). Similarly, each one of the remaining groups are computed when there are enough remaining components after maximizing the upper groups. As previously observed, the recombination is not associative, thus the column ‘DCJ seq.’ determines in which order the sequence of DCJs must be applied in each group. Here we also need to skip some recombinations depending on the value of w. In particular, although Δ_{D C Jσ} indicates that$\mathcal{W}$ could be applied for w>1/3 and$\mathcal{V}$ for w>1/4, the last operation of these groups is of type n_{ 2 } and increases Δ_{D C Jσ} for w≤1/2. Groups$\mathcal{V}$ and$\mathcal{W}$ are skipped for w≤1/2, and their optimal operations are then counted in${\mathcal{X}}_{1}$.
The recombinations allow us to obtain an exact formula for the DCJsubstitution distance:
Theorem 3
Given genomes A and B and a positive substitution cost w≤1,
where$\mathcal{U}$,$\mathcal{V}$,$\mathcal{W}$,${\mathcal{X}}_{1}$,${\mathcal{X}}_{2}$,$\mathcal{Y}$,${\mathcal{Z}}_{1}$and${\mathcal{Z}}_{2}$are computed as described above and P_{ L S } and P_{ C S }are the numbers of disjoint pairs of linear and circular singletons.
Observe that the number of occurrences in each group depends only on w and the initial number of each type of component and, for any 2/3<w≤1, our formula is equivalent to the one proposed in[8], since the same occurrences of groups of recombinations and an equivalent upper bound are taken into account.
Complexity
Both A G(A,B) and d_{ D C J }(A,B) can be computed in linear time[3]. The occurrences in each recombination group depends only on w and the initial components. The runs are obtained by a single walk through each path, thus the whole procedure takes linear time for both models.
Establishing the triangular inequality
We have presented two genomic distances that combine DCJ and contentmodifying operations and can be computed in linear time. However, contentmodifying operations are applied to pieces of DNA of any size, and a side effect of this fact is that the triangular inequality often does not hold for distances that consider these operations[4, 6][8, 12].
Let A, B and C be three genomes, with unequal contents, and consider, without loss of generality, that${d}_{\mathit{\text{DCJ}}}^{\mathit{\text{id}}}(A,B)\ge {d}_{\mathit{\text{DCJ}}}^{\mathit{\text{id}}}(A,C)$ and${d}_{\mathit{\text{DCJ}}}^{\mathit{\text{id}}}(A,B)\ge {d}_{\mathit{\text{DCJ}}}^{\mathit{\text{id}}}(B,C)$. The triangular inequality is then the property which guarantees that the inequality${d}_{\mathit{\text{DCJ}}}^{\mathit{\text{id}}}(A,B)\le {d}_{\mathit{\text{DCJ}}}^{\mathit{\text{id}}}(A,C)+{d}_{\mathit{\text{DCJ}}}^{\mathit{\text{id}}}(B,C)$also holds. Unfortunately this is not the case for the DCJindel distance, and also not the case for the DCJsubstitution distance. Take for example the genomes A={∘a b c d e∘},$B=\{\circ \mathit{\text{ac}}\overline{d}\mathit{\text{be}}\circ \}$ and C={∘a e∘}[6]. While the cost of sorting A (or B) into C is w (one indel), the minimum number of DCJs (that are inversions in this case) required to sort A into B is three. We have${d}_{\mathit{\text{DCJ}}}^{\mathit{\text{id}}}(A,B)=3$,${d}_{\mathit{\text{DCJ}}}^{\mathit{\text{id}}}(A,C)=w$,${d}_{\mathit{\text{DCJ}}}^{\mathit{\text{id}}}(B,C)=w$ and the triangular inequality is disrupted.
Denote by$\mathcal{A}$,$\mathcal{B}$,$\mathcal{C}$,$\mathcal{D}$,$\mathcal{E}$,$\mathcal{F}$ and$\mathcal{G}$ the disjoint sets of markers such that:$\mathcal{A}$,$\mathcal{B}$ or$\mathcal{C}$ are the sets of markers that occur respectively only in genome A, B or C, the markers in$\mathcal{D}$ are common only to genomes A and B, the markers in$\mathcal{E}$ are common only to B and C, the markers in$\mathcal{F}$ are common only to A and C, and,$\mathcal{G}$is the set of markers that are common to all three genomes A, B and C. These sets are represented in Figure7.
When$\mathcal{D}=\varnothing $, meaning that genomes A and B have no common marker that does not occur in C, the triangular inequality holds for both DCJindel and DCJsubstitution distances[12]. However, if$\mathcal{D}\ne \varnothing $, the triangular inequality can be disrupted for${d}_{\phantom{\rule{0.3em}{0ex}}\mathit{\text{DCJ}}}^{\mathit{\text{id}}}$ and${d}_{\phantom{\rule{0.3em}{0ex}}\mathit{\text{DCJ}}}^{\mathit{\text{sb}}}$, and this may be an obstacle if one intends to use these distances to compute the median of three or more genomes and in phylogenetic reconstructions.
It is possible to establish the triangular inequality in our two models a posteriori, by adapting an approach proposed in[12]: we simply sum to each distance a surcharge that depends on the number of unique markers, as we will see in the following subsections.
We define the diameter as the maximum distance between any pair of genomes, usually as a function on the size of the genomes. We use this definition in the next results.
Correction for the DCJindel distance
For genomes A and B and a positive constant k, let${m}^{\mathit{\text{id}}}(A,B)={d}_{\mathit{\text{DCJ}}}^{\mathit{\text{id}}}(A,B)+k\xb7u(A,B)$, where u(A,B) is the number of unique markers between A and B[7],[12]. We then have${m}^{\mathit{\text{id}}}(A,B)={d}_{\mathit{\text{DCJ}}}^{\mathit{\text{id}}}(A,B)+k\left(\right\mathcal{A}+\mathcal{F}+\mathcal{B}+\mathcal{E}\left\right)$,${m}^{\mathit{\text{id}}}(A,C)={d}_{\mathit{\text{DCJ}}}^{\mathit{\text{id}}}(A,C)+k\left(\right\mathcal{A}+\mathcal{D}+\mathcal{C}+\mathcal{E}\left\right)$ and${m}^{\mathit{\text{id}}}(B,C)={d}_{\mathit{\text{DCJ}}}^{\mathit{\text{id}}}(B,C)+k\left(\right\mathcal{B}+\mathcal{D}+\mathcal{C}+\mathcal{F}\left\right)$. From this definition we can derive a simpler inequality that can be used to determine the value of the constant k:
Proposition 5 (from[12]). Given three genomes A, B and C without duplicated markers, the inequality m^{id}(A,B)≤m^{id}(A,C)+m^{id}(B,C) holds if, and only if,${d}_{\mathit{\text{DCJ}}}^{\mathit{\text{id}}}(A,B)\le {d}_{\mathit{\text{DCJ}}}^{\mathit{\text{id}}}(A,C)+{d}_{\mathit{\text{DCJ}}}^{\mathit{\text{id}}}(B,C)+2k\left\mathcal{D}\right$, where$\mathcal{D}$is the set of markers common only to A and B.
The problem now is to find the minimum value of k for which the inequality of Proposition 5 holds. In order to accomplish this task, the first step is to determine the diameter of the DCJindel distance.
Lemma 4. Given a positive indel cost w≤1 and two genomes A and B with n common markers, then
where L_{ A }, S_{ A } and L_{ B }, S_{ B }are, respectively, the number of linear chromosomes and circular singletons in genomes A and B.
Proof. Let P be the number of vertices in component P, that is DCJsorted with$\lfloor \frac{\leftP\right\phantom{\rule{0.3em}{0ex}}1}{2}\rfloor $DCJs[14]. If P is even, P is sorted with$\frac{\leftP\right}{2}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}1$ DCJs and$\lambda \left(P\right)\le \frac{\leftP\right}{2}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}1$ indels, then${d}_{\phantom{\rule{0.3em}{0ex}}\mathit{\text{DCJ}}}^{\mathit{\text{id}}}\left(P\right)\le \frac{\leftP\right}{2}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}1\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}w(\frac{\leftP\right}{2}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}1)=\frac{(w\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}1)\leftP\right}{2}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}w\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}1$. If P is odd, P is sorted with$\frac{\leftP\right\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}1}{2}$ DCJs and$\lambda \left(P\right)\le \frac{\leftP\right\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}1}{2}$ indels, then${d}_{\phantom{\rule{0.3em}{0ex}}\mathit{\text{DCJ}}}^{\mathit{\text{id}}}\left(P\right)\le \frac{\leftP\right\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}1}{2}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}w\frac{\leftP\right\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}1}{2}=\frac{(w\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}1)\leftP\right\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}w\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}1}{2}$. As w≤1 implies$w\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}1\le \frac{w\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}1}{2}\le 0$, for any component P we have${d}_{\phantom{\rule{0.3em}{0ex}}\mathit{\text{DCJ}}}^{\mathit{\text{id}}}\left(P\right)\le \frac{(w\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}1)\leftP\right\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}w\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}1}{2}$. Then,${d}_{\phantom{\rule{0.3em}{0ex}}\mathit{\text{DCJ}}}^{\mathit{\text{id}}}(A,\phantom{\rule{0.3em}{0ex}}B)\le \sum _{P\in A\phantom{\rule{0.3em}{0ex}}G(A,B)}\phantom{\rule{0.3em}{0ex}}{d}_{\phantom{\rule{0.3em}{0ex}}\mathit{\text{DCJ}}}^{\mathit{\text{id}}}\left(P\right)\le \sum _{P\in A\phantom{\rule{0.3em}{0ex}}G(A,B)}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\frac{(w\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}1)\leftP\right\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}w\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}1}{2}=\frac{w\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}1}{2}\sum _{P\in A\phantom{\rule{0.3em}{0ex}}G(A,B)}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\leftP\right+\sum _{P\in A\phantom{\rule{0.3em}{0ex}}G(A,B)}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\frac{w1}{2}$. Each linear chromosome corresponds to one path in A G(A, B), thus the number of components is at least (L_{ A } + S_{ A } + L_{ B } + S_{ B }) and$\sum _{P\in A\phantom{\rule{0.3em}{0ex}}G(A,B)}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\frac{w1}{2}\le \frac{({L}_{A}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}{S}_{A}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}{L}_{B}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}{S}_{B})(w1)}{2}\le 0$. Furthermore, from[12] we know that$\sum _{P\in A\phantom{\rule{0.3em}{0ex}}G(A,B)}\leftP\right=2n\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}{L}_{A}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}{S}_{A}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}{L}_{B}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}{S}_{B}$. □
We are ready to generalize the result of[12], and determine the minimum possible value of k.
Theorem 4. For any positive indel cost w≤1, the function m^{id}satisfies the triangular inequality if and only if$k\ge \frac{w+1}{2}$.
Proof. Recall that, to prove the triangular inequality for m^{id}, we only need to find a k such that${d}_{\phantom{\rule{0.3em}{0ex}}\mathit{\text{DCJ}}}^{\mathit{\text{id}}}(A,\phantom{\rule{0.3em}{0ex}}B)\le {d}_{\phantom{\rule{0.3em}{0ex}}\mathit{\text{DCJ}}}^{\mathit{\text{id}}}(A,C)+{d}_{\phantom{\rule{0.3em}{0ex}}\mathit{\text{DCJ}}}^{\mathit{\text{id}}}(B,\phantom{\rule{0.3em}{0ex}}C)+2k\left\mathcal{D}\right$ holds (Proposition 5). We know that the inequality holds when$\mathcal{D}=\varnothing $[12]. It remains to examine the case in which$\mathcal{D}\ne \varnothing $. The worst case would be to have an empty genome C[12]. Let X_{ A } and X_{ B } be the number of chromosomes in A and B. Since C is empty, we know that${d}_{\phantom{\rule{0.3em}{0ex}}\mathit{\text{DCJ}}}^{\mathit{\text{id}}}(A,\phantom{\rule{0.3em}{0ex}}C)=w{X}_{A}$ and${d}_{\phantom{\rule{0.3em}{0ex}}\mathit{\text{DCJ}}}^{\mathit{\text{id}}}(B,\phantom{\rule{0.3em}{0ex}}C)=w{X}_{B}$. From Lemma 4, we have${d}_{\phantom{\rule{0.3em}{0ex}}\mathit{\text{DCJ}}}^{\mathit{\text{id}}}(A,\phantom{\rule{0.3em}{0ex}}B)\le (w\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}1)\left\mathcal{D}\right\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}w({L}_{A}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}{S}_{A}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}{L}_{B}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}{S}_{B})$. This gives$(w\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}1)\left\mathcal{D}\right\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}w({L}_{A}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}{S}_{A}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}{L}_{B}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}{S}_{B})\le w({X}_{A}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}{X}_{B})\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}2k\left\mathcal{D}\right$. Since L_{ A } + S_{ A } + L_{ B } + S_{ B }≤X_{ A } + X_{ B }, we have$(w\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}1)\left\mathcal{D}\right\le 2k\left\mathcal{D}\right$, which holds for any$k\ge \frac{w\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}1}{2}$.
For the necessity, take A and B with n common markers and let each genome be composed of one circular chromosome, meaning that we have one adjacency per common marker in each genome (or n adjacencies per genome). Then let A G(A, B) have one single cycle with 2n vertices and let each vertex be labeled, so that the number of runs in the cycle is 2n and the number of unique markers in each genome is n. Thus, we have${d}_{\phantom{\rule{0.3em}{0ex}}\mathit{\text{DCJ}}}^{\mathit{\text{id}}}(A,\phantom{\rule{0.3em}{0ex}}B)=(n\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}1)\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}w(n\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}1)=(w\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}1)n\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}(w\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}1)$ and the corrected distance is m^{id}(A, B)=(w + 1)n + (w − 1) + 2k n. Take C as an empty genome, so that${d}_{\phantom{\rule{0.3em}{0ex}}\mathit{\text{DCJ}}}^{\mathit{\text{id}}}(A,\phantom{\rule{0.3em}{0ex}}C)={d}_{\phantom{\rule{0.3em}{0ex}}\mathit{\text{DCJ}}}^{\mathit{\text{id}}}(B,\phantom{\rule{0.3em}{0ex}}C)=w$ and m^{id}(A, C)=m^{id}(B, C)=w + 2k n. The inequality m^{id}(A, B)≤m^{id}(A, C) + m^{id}(B, C) corresponds to (w + 1)n + (w − 1) + 2k n≤2w + 4k n or, equivalently, 2k n≥(w + 1)n − w − 1, that is$k\ge \frac{w+1}{2}\left(1\frac{1}{n}\right)$, which holds for all n only if$k\ge \frac{w+1}{2}$. □
Correction for the DCJsubstitution distance
Similarly, in the case of the DCJsubstitution distance, for genomes A and B and a positive constant k^{′}, let${m}^{\mathit{\text{sb}}}(A,\phantom{\rule{0.3em}{0ex}}B)={d}_{\phantom{\rule{0.3em}{0ex}}\mathit{\text{DCJ}}}^{\mathit{\text{sb}}}(A,\phantom{\rule{0.3em}{0ex}}B)+{k}^{\prime}\xb7u(A,\phantom{\rule{0.3em}{0ex}}B)$, where u(A, B) is the number of unique markers between A and B[7],[12]. We then have${m}^{\mathit{\text{sb}}}(A,\phantom{\rule{0.3em}{0ex}}B)={d}_{\phantom{\rule{0.3em}{0ex}}\mathit{\text{DCJ}}}^{\mathit{\text{sb}}}(A,\phantom{\rule{0.3em}{0ex}}B)\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}{k}^{\prime}\left(\right\mathcal{A}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}\mathcal{F}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}\mathcal{B}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}\mathcal{E}\left\right)$,${m}^{\mathit{\text{sb}}}(A,\phantom{\rule{0.3em}{0ex}}C)={d}_{\phantom{\rule{0.3em}{0ex}}\mathit{\text{DCJ}}}^{\mathit{\text{sb}}}(A,\phantom{\rule{0.3em}{0ex}}C)\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}{k}^{\prime}\left(\right\mathcal{A}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}\mathcal{D}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}\mathcal{C}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}\mathcal{E}\left\right)$ and${m}^{\mathit{\text{sb}}}(B,\phantom{\rule{0.3em}{0ex}}C)={d}_{\phantom{\rule{0.3em}{0ex}}\mathit{\text{DCJ}}}^{\mathit{\text{sb}}}(B,\phantom{\rule{0.3em}{0ex}}C)\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}{k}^{\prime}\left(\right\mathcal{B}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}\mathcal{D}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}\mathcal{C}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}\mathcal{F}\left\right)$. Again, from this definition we can derive a simpler inequality that can be used to determine the value of the constant k^{′}:
Proposition 6 (from[12]). Given three genomes A, B and C without duplicated markers, the inequality m^{sb}(A, B)≤m^{sb}(A, C)+m^{sb}(B, C) holds if, and only if,${d}_{\phantom{\rule{0.3em}{0ex}}\mathit{\text{DCJ}}}^{\mathit{\text{sb}}}(A,\phantom{\rule{0.3em}{0ex}}B)\le {d}_{\phantom{\rule{0.3em}{0ex}}\mathit{\text{DCJ}}}^{\mathit{\text{sb}}}(A,\phantom{\rule{0.3em}{0ex}}C)+{d}_{\phantom{\rule{0.3em}{0ex}}\mathit{\text{DCJ}}}^{\mathit{\text{sb}}}(B,\phantom{\rule{0.3em}{0ex}}C)+2{k}^{\prime}\left\mathcal{D}\right$, where$\mathcal{D}$ is the set of markers common only to A and B.
In order to find the minimum value of k^{′} for which the inequality of Proposition 6 holds, we need to determine the diameter of the DCJsubstitution distance, that is given by the following lemma.
Lemma 5. If A and B are genomes with n common markers, then
where L_{ A }, S_{ A }, L_{ B } and S_{ B } are, respectively, the number of linear chromosomes and circular singletons in genomes A and B.
Proof. Let P be the number of vertices in component P, that is DCJsorted with$\lfloor \frac{\leftP\right\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}1}{2}\rfloor $ DCJs[14]. If P is even, then P can be DCJsorted with$\frac{\leftP\right}{2}1$ DCJs. We have to analyze two cases: (i) if P=4x+4, then$\sigma \left(P\right)\le \frac{\leftP\right}{4}+1$ and${d}_{\phantom{\rule{0.3em}{0ex}}\mathit{\text{DCJ}}}^{\mathit{\text{sb}}}\left(P\right)\le (\frac{\leftP\right}{2}1)+w(\frac{\leftP\right}{4}+1)=\frac{(w+2)\leftP\right}{4}+w1$; (ii) if P=4x+2, then$\sigma \left(P\right)\le \frac{\leftP\right2}{4}+1$ and${d}_{\phantom{\rule{0.3em}{0ex}}\mathit{\text{DCJ}}}^{\mathit{\text{sb}}}\left(P\right)\le (\frac{\leftP\right}{2}1)+w(\frac{\leftP\right2}{4}+1)=\frac{(w+2)\leftP\right}{4}+\frac{w2}{2}$. As w≤1 implies$\frac{w2}{2}\le w1\le 0$. If P is odd, then P is an AA or a BBpath and can be DCJsorted with$\frac{\leftP\right1}{2}$ DCJs. Again, we have to analyze two cases: (i) if P=4x+3, then$\sigma \left(P\right)\le \frac{\leftP\right+1}{4}$ and${d}_{\phantom{\rule{0.3em}{0ex}}\mathit{\text{DCJ}}}^{\mathit{\text{sb}}}\left(P\right)\le \frac{\leftP\right1}{2}+w\left(\frac{\leftP\right+1}{4}\right)=\frac{(w+2)\leftP\right}{4}+\frac{w2}{4}$; (ii) if P=4x+1, then$\sigma \left(P\right)\le \frac{\leftP\right+3}{4}$ and${d}_{\phantom{\rule{0.3em}{0ex}}\mathit{\text{DCJ}}}^{\mathit{\text{sb}}}\left(P\right)\le \frac{\leftP\right1}{2}+w\left(\frac{\leftP\right+3}{4}\right)=\frac{(w+2)\leftP\right}{4}+\frac{3w2}{4}$. In this last case we could have${d}_{\phantom{\rule{0.3em}{0ex}}\mathit{\text{DCJ}}}^{\mathit{\text{sb}}}\left(P\right)>\frac{(w+2)\leftP\right}{4}$. Observe however that the numbers of AA and BBpaths are bounded, respectively, by L_{ A } and L_{ B }. Then,${d}_{\phantom{\rule{0.3em}{0ex}}\mathit{\text{DCJ}}}^{\mathit{\text{sb}}}(A,B)\le \sum _{P\in A\phantom{\rule{0.3em}{0ex}}G(A,B)}\phantom{\rule{0.3em}{0ex}}{d}_{\phantom{\rule{0.3em}{0ex}}\mathit{\text{DCJ}}}^{\mathit{\text{sb}}}\left(P\right)\le \sum _{P\in A\phantom{\rule{0.3em}{0ex}}G(A,B)}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\frac{(w\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}2)\leftP\right}{4}+\frac{(3w2)({L}_{A}+{L}_{B})}{4}=\frac{w\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}2}{4}\sum _{P\in A\phantom{\rule{0.3em}{0ex}}G(A,B)}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\leftP\right+\frac{(3w2)({L}_{A}+{L}_{B})}{4}$. From[12] we know that$\sum _{P\in A\phantom{\rule{0.3em}{0ex}}G(A,B)}\leftP\right=2n\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}{L}_{A}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}{S}_{A}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}{L}_{B}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}{S}_{B}$. Therefore,${d}_{\phantom{\rule{0.3em}{0ex}}\mathit{\text{DCJ}}}^{\mathit{\text{sb}}}(A,\phantom{\rule{0.3em}{0ex}}B)\le \frac{w\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}2}{4}(2n\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}{L}_{A}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}{S}_{A}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}{L}_{B}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}{S}_{B})+\frac{(3w2)({L}_{A}+{L}_{B})}{4}=\frac{(w+2)}{2}n+w({L}_{A}+{L}_{B})+\frac{(w+2)({S}_{A}+{S}_{B})}{4}\le \frac{(w+2)}{2}n+w({L}_{A}+{L}_{B}+{S}_{A}+{S}_{B})$. □
We are ready to generalize the result of[12], and determine the minimum possible value ofk^{′}.
Theorem 5. For any positive substitution cost w≤1, the function m^{sb}satisfies the triangular inequality if and only if${k}^{\prime}\ge \frac{w+2}{4}$.
Proof. Recall that, to prove the triangular inequality for m^{sb}, we only need to find a k^{′} such that${d}_{\phantom{\rule{0.3em}{0ex}}\mathit{\text{DCJ}}}^{\mathit{\text{sb}}}(A,\phantom{\rule{0.3em}{0ex}}B)\le {d}_{\phantom{\rule{0.3em}{0ex}}\mathit{\text{DCJ}}}^{\mathit{\text{sb}}}(A,\phantom{\rule{0.3em}{0ex}}C)+{d}_{\phantom{\rule{0.3em}{0ex}}\mathit{\text{DCJ}}}^{\mathit{\text{sb}}}(B,\phantom{\rule{0.3em}{0ex}}C)+2{k}^{\prime}\left\mathcal{D}\right$ holds (Proposition 6). We know that the inequality holds when$\mathcal{D}=\varnothing $[12]. It remains to examine the case in which$\mathcal{D}\ne \varnothing $. The worst case would be to have an empty genome C[12]. Let X_{ A } and X_{ B } be the number of chromosomes in A and B. Since C is empty, we know that${d}_{\phantom{\rule{0.3em}{0ex}}\mathit{\text{DCJ}}}^{\mathit{\text{sb}}}(A,\phantom{\rule{0.3em}{0ex}}C)=w{X}_{A}$ and${d}_{\phantom{\rule{0.3em}{0ex}}\mathit{\text{DCJ}}}^{\mathit{\text{sb}}}(B,\phantom{\rule{0.3em}{0ex}}C)=w{X}_{B}$. From Lemma 5, we have${d}_{\phantom{\rule{0.3em}{0ex}}\mathit{\text{DCJ}}}^{\mathit{\text{sb}}}(A,\phantom{\rule{0.3em}{0ex}}B)\le \frac{(w\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}2)\left\mathcal{D}\right}{2}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}w({L}_{A}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}{S}_{A}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}{L}_{B}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}{S}_{B})$. This gives$\frac{(w\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}2)\left\mathcal{D}\right}{2}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}w({L}_{A}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}{S}_{A}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}{L}_{B}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}{S}_{B})\le w({X}_{A}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}{X}_{B})\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}2{k}^{\prime}\left\mathcal{D}\right$. Since L_{ A } + S_{ A } + L_{ B } + S_{ B }≤X_{ A } + X_{ B }, we have$\frac{(w\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}2)\left\mathcal{D}\right}{2}\le 2{k}^{\prime}\left\mathcal{D}\right$, which holds for any${k}^{\prime}\ge \frac{w\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}2}{4}$.
For the necessity, take A and B with n common markers, for n even, and let each genome be composed of one circular chromosome, meaning that we have one adjacency per common marker in each genome (or n adjacencies per genome). Then let A G(A, B) have one single cycle with 2n vertices and let each vertex be labeled, so that the number of runs in the cycle is 2n and the number of unique markers in each genome is n. Thus, we have${d}_{\phantom{\rule{0.3em}{0ex}}\mathit{\text{DCJ}}}^{\mathit{\text{sb}}}(A,\phantom{\rule{0.3em}{0ex}}B)=(n\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}1)\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}w(\frac{n}{2}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}1)=\frac{(w+2)n}{2}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}(w\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}1)$ and the corrected distance is${m}^{\mathit{\text{sb}}}(A,\phantom{\rule{0.3em}{0ex}}B)=\frac{(w+2)n}{2}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}(w\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}1)\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}2{k}^{\prime}n$. Take C as an empty genome, so that${d}_{\phantom{\rule{0.3em}{0ex}}\mathit{\text{DCJ}}}^{\mathit{\text{sb}}}(A,\phantom{\rule{0.3em}{0ex}}C)={d}_{\phantom{\rule{0.3em}{0ex}}\mathit{\text{DCJ}}}^{\mathit{\text{sb}}}(B,\phantom{\rule{0.3em}{0ex}}C)=w$ and m^{sb}(A, C)=m^{sb}(B, C)=w + 2k^{′}n. The inequality m^{sb}(A, B)≤m^{sb}(A, C) + m^{sb}(B, C) corresponds to$\frac{(w+2)n}{2}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}(w\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}1)\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}2{k}^{\prime}n\le 2w\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}4{k}^{\prime}n$ or, equivalently,$2{k}^{\prime}n\ge \left(\frac{w\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}2}{2}\right)n\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}w\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}1$, that is${k}^{\prime}\ge \frac{w+2}{4}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\frac{w+1}{2n}$, which holds for all n only if${k}^{\prime}\ge \frac{w+2}{4}$. □
Conclusions
In this work we have presented methods to compute in linear time the DCJindel and DCJsubstitution distances between two genomes without duplicated markers, when the contentmodifying cost is distinct from and upper bounded by the DCJ cost. Contentmodifying operations can be applied to pieces of DNA of any size, and a side effect of this property is that the triangular inequality does not hold for our distance formulas. However we have shown that an a posteriori correction can be applied to establish the triangular inequality in both DCJindel and DCJsubstitution distances.
References
 1.
Hannenhalli S, Pevzner P: Transforming men into mice (polynomial algorithm for genomic distance problem). 36th Annual IEEE Symposium on Foundations of Computer Science. 1995, 581592.
 2.
Yancopoulos S, Attie O, Friedberg R: Efficient sorting of genomic permutations by translocation, inversion and block interchange. Bioinformatics. 2005, 21: 33403346.
 3.
Bergeron A, Mixtacki J, Stoye J: A unifying view of genome rearrangements. Algorithms in Bioinformatics, Lecture Notes in Computer Science, Volume 4175. Springer,2006, 163173.
 4.
ElMabrouk N: Sorting signed permutations by reversals and insertions/deletions of contiguous segments. J Discrete Algorithms. 2001, 1: 105122.
 5.
Hannenhalli S, Pevzner P: Transforming cabbage into turnip: polynomial algorithm for sorting signed permutations by reversals. J ACM. 1999, 46: 127. 10.1145/300515.300516.
 6.
Yancopoulos S, Friedberg R: DCJ path formulation for genome transformations which include insertions, deletions, and duplications. J Comput Biol. 2009, 16 (10): 13111338.
 7.
Braga MDV, Willing E, Stoye J: Double Cut and Join with Insertions and Deletions. J Comput Biol. 2011, 18 (9): 11671184. (a preliminary version appeared in proceedings of WABI 2010, LNBI vol. 6293, p. 90–101)
 8.
Braga MDV, Machado R, Ribeiro LC, Stoye J: Genomic distance under gene substitutions. BMC Bioinformatics. 2011, 12 (Suppl 9): S8.
 9.
Boore JL: The duplication/random loss model for gene rearrangement exemplified by mitochondrial genomes of deuterostome animals. Comparative Genomics, Volume 1.Springer,2000, 133147.
 10.
Moritz C, Dowling TE, Brown WM: Evolution of animal mitochondrial DNA: relevance for population biology and systematics. Annual Review of Ecology and Systematics, Volume 18. 1987, 269292. Annual Reviews Inc
 11.
Blanc G, Ogata H, Robert C: Reductive genome evolution from the mother of Rickettsia. PLoS Genet. 2007, 3: e14.
 12.
Braga MDV, Machado R, Ribeiro LC, Stoye J: On the weight of indels in genomic distances. BMC Bioinformatics. 2011, 12 (Suppl 9): S13.
 13.
da Silva PH, Braga MDV, Machado R, Dantas S: DCJindel distance with distinct operation costs. Algorithms in Bioinformatics, Lecture Notes in Computer Science, Volume 7534. Springer,2012, 378390.
 14.
Braga MDV, Stoye J: The solution space of sorting by DCJ. J Comput Biol. 2010, 17 (9): 11451165.
Acknowledgements
This research was partially supported by the Brazilian research agencies CNPq and FAPERJ.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
PHS, RM, SD and MDVB have elaborated the model, proved the results and written the paper. All authors read and approved the final manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
About this article
Received
Accepted
Published
DOI
Keywords
 Double cut and join (DCJ)
 Insertions and deletions (indels)
 Substitution
 Genome rearrangements
 Genomic distance
 Evolution
 Comparative genomics
 Combinatorics
 Algorithms