DCJ-indel and DCJ-substitution distances with distinct operation costs

Background Classical approaches to compute the genomic distance are usually limited to genomes with the same content and take into consideration only rearrangements that change the organization of the genome (i.e. positions and orientation of pieces of DNA, number and type of chromosomes, etc.), such as inversions, translocations, fusions and fissions. These operations are generically represented by the double-cut and join (DCJ) operation. The distance between two genomes, in terms of number of DCJ operations, can be computed in linear time. In order to handle genomes with distinct contents, also insertions and deletions of fragments of DNA – named indels – must be allowed. More powerful than an indel is a substitution of a fragment of DNA by another fragment of DNA. Indels and substitutions are called content-modifying operations. It has been shown that both the DCJ-indel and the DCJ-substitution distances can also be computed in linear time, assuming that the same cost is assigned to any DCJ or content-modifying operation. Results In the present study we extend the DCJ-indel and the DCJ-substitution models, considering that the content-modifying cost is distinct from and upper bounded by the DCJ cost, and show that the distance in both models can still be computed in linear time. Although the triangular inequality can be disrupted in both models, we also show how to efficiently fix this problem a posteriori.


Background
The distance between two genomes is often computed using only the common markers, that occur in both genomes. Such distance allows rearrangements that change the organization of the genome, that is, the positions and orientations of markers, number and types of chromosomes. Inversions, translocations, fusions and fissions are some of these operations [1]. All these rearrangements can be generically represented as a doublecut-and-join (DCJ) operation [2]. The DCJ distance, which takes into consideration only DCJ operations, can be computed in linear time [3].
Nevertheless, genomes with the same content are rare, and differences in gene content may reflect important evolutionary aspects. In order to handle genomes with unequal contents, one has to take into consideration http://www.almob.org/content/8 /1/21 Recently, in 2011, a more powerful content-modifying operation has also been considered: a substitution allows a piece of DNA to be substituted by another piece of DNA [8]. Observe that it is not suggested that a substitution occurs in a precise moment in evolution, but instead it represents a region that underwent continuous mutations (duplications, losses and gene mutations), so that a group of genes is transformed into a different group of genes (either of which may also be empty, allowing a substitution to represent an insertion or a deletion). Other studies also represent continuous mutations as a rearrangement event [9,10]. By minimizing substitutions we are able to establish a relation between indels that could have occurred in the same position of the compared genomes, identifying genomic regions that could be subject to these continuous mutations. It has been shown that the DCJsubstitution distance can also be computed in linear time [8].
The approaches mentioned above [4,[6][7][8] assign the same cost to any rearrangement or content-modifying operation. However, during the evolution of many organisms, content-modifying operations are said to occur more often than rearrangements and, consequently, should be assigned to a lower cost. Examples are bacteria that are obligate intracellular parasites, such as Rickettsia [11]. The genomes of such intracellular parasites are observed to have a reductive evolution, that is, the process by which genomes shrink and undergo extreme levels of gene degradation and loss. In the present work, we refine the DCJ-indel [7] and the DCJ-substitution [8] models, by adopting a distinct content-modifying cost that is upper bounded by the DCJ cost. For simplicity, we assign a cost of 1 to DCJ and a positive cost of w ≤ 1 to content-modifying operations. We are then able to give exact formulas for both the DCJ-indel and the DCJ-substitution distances, for any positive w ≤ 1.
Content-modifying operations are applied to pieces of DNA of any size, and a side effect of this fact is that the triangular inequality often does not hold for distances that consider these operations [4,[6][7][8]12]. In the case of the models we study here, it is possible to do an a posteriori correction, using an approach similar to the one described in [12].
This paper is an extension of [13] and is organized as follows. In the remainder of this section we give definitions and previous results used in this work. We will then present our results, including the formulas for the distances with distinct DCJ and content-modifying costs and the correction to establish the triangular inequality.

Genomes
We deal with models in which duplicated markers are not allowed. Given two genomes A and B, possibly with unequal content, let G, A and B be three disjoint sets, such that G is the set of markers that occur both in A and B, A is the set of markers that occur only in A, and B is the set of markers that occur only in B. The markers in sets A and B are also called unique markers. We denote by u(A, B) = |A| + |B| the number of unique markers in genomes A and B.
Each marker g in a genome is a DNA fragment and is represented by the symbol g, if it is read in direct orientation, or by the symbol g, if it is read in reverse orientation. Each one of the two extremities of a linear chromosome is called a telomere, represented by the symbol •. Each chromosome in a genome can be then represented by a string that can be circular, if the chromosome is circular, or linear and flanked by the symbols • if the chromosome is linear. In general, a genome is either circular (composed of circular chromosomes) or linear (composed of linear chromosomes). As an example, consider the linear genomes A = •bsucavde• and B = •awbxc•, •ydze• , represented in Figure 1. Here we have G = {a, b, c, d, e}, A = {s, u, v} and B = {w, x, y, z}.

The DCJ model
In this section we will summarize the DCJ model, that allows the sorting of the common content of two genomes, also called DCJ-sorting. We will also show how the DCJ distance can be easily computed with the help of the adjacency graph.
Given two genomes A and B, we denote the two extremities of each g ∈ G by g t (tail) and g h (head). Then, a G-adjacency or simply adjacency [7] in genome A (respectively in genome B) is a string v = γ 1 γ 2 ≡ γ 2 γ 1 , such that each γ i can be a telomere or an extremity of a marker from G and is a substring composed of the markers that are between γ 1 and γ 2 in A (respectively in B) and contains no marker that also belongs to G. The substring is the label of v. If is empty, the adjacency is said to be clean, otherwise it is said to be labeled. If a linear chromosome is composed only of unique markers, it is represented by an adjacency • •. Similarly, a circular chromosome composed only of unique markers is represented by a (circular) adjacency . For the linear genomes represented in Figure 1, the set of adjacencies in A is

Adjacency graph
Given two genomes A and B, the adjacency graph AG(A,B) [3] is the bipartite multigraph whose vertices are the adjacencies of A and of B and that has one edge for each common extremity of a pair of vertices. Each of the connected components of AG(A,B) alternate vertices in genome A and in genome B. Each component can be either a cycle, or an AB-path (that has one endpoint in genome A and the other in B), or an AA-path (that has both endpoints in genome A), or a BB-path (that has both endpoints in B). A special case of an AA or a BB-path is a linear singleton, that is a linear chromosome represented by an adjacency of type • •, where contains only unique markers. Paths occur when the genomes are linear. For circular genomes, the graph AG(A,B) is composed of cycles only, and may also have a special type of component composed of a single vertex, that corresponds to a circular chromosome composed only of markers that are not in G, called circular singleton. In Figure 2 we show the adjacency graph built over the linear genomes represented in Figure 1.

DCJ operations
A cut performed on a genome A separates two adjacent markers of A. A cut affects a single adjacency v in A: it is done between two symbols of v, creating two open ends. In general a cut can be performed between two markers of a label, but the DCJ-indel distance can be computed considering only cuts that do not "break" labels. A double-cut and join or DCJ applied on a genome A is the operation that performs cuts in two different adjacencies in A, creating four open ends, and joins these open ends in a different way. In other words, a DCJ rearranges two adjacencies in A, transforming them into two new adjacencies. As an example consider a DCJ applied to genome A (from Figure 1), that rearranges the adjacencies a h vd h and d t e t into the new adjacencies a h vd t and d h e t . Observe that this operation corresponds to the inversion of marker d in genome A. Indeed, a DCJ operation can correspond to several rearrangements, such as an inversion, a translocation, a fusion or a fission [2].

DCJ-sorting and DCJ distance
Given two genomes A and B, the components of AG(A,B) with 3 or more vertices need to be reduced, by applying DCJ operations, to components with only 2 vertices, that can be cycles or AB-paths [14]. This procedure is called DCJ-sorting of A into B. The number of AB-paths in AG(A,B) is always even and a DCJ can be of three types [7]: it can either decrease the number of cycles by one, or the number of AB-paths by two (counter-optimal); or it does not affect the number of cycles and AB-paths (neutral); or it can either increase the number of cycles by one, or the number of AB-paths by two (optimal). The DCJ distance of A and B, denoted by d DCJ (A,B), is the minimum number of steps required to do a DCJ-sorting of A into B, given by the following theorem.

Internal DCJ operations and recombinations
Observe that a DCJ operation ρ acts on two different adjacencies, that can be in the same or in two distinct connected components of the graph. The components on which the cuts are applied are called sources and the components obtained after the joinings are called resultants of ρ. With respect to the adjacency graph, ρ can be of two types: internal, when ρ is applied to two adjacencies belonging to a single component; and recombination, when ρ is applied to adjacencies belonging to two distinct components.
Any recombination applied to a vertex of an AA-path and a vertex of a BB-path is optimal [14]. A recombination applied to vertices of two distinct AB-paths can be either neutral, when the resultants are also AB-paths, or counter-optimal, when the resultants are an AA-path and a BB-path. All other types of path recombinations are neutral and all recombinations involving at least one cycle are counter-optimal.
It is possible to do a separate DCJ-sorting in any component P of AG(A,B) [14] by applying DCJs internal to P. We denote by d DCJ (P) the number of optimal DCJ operations used for DCJ-sorting P separately (d DCJ (P) depends only on the number of vertices or, equivalently, the number of edges of P [14]). Thus, the DCJ distance can also be re-written in terms of the sum of the distance per component: [14]). Given two genomes A and B, we have d DCJ  Only optimal DCJs, counted in the equivalent formulas given by Theorem 1 and Lemma 1, are necessary http://www.almob.org/content/8/1/21 to do a DCJ-sorting. Given a DCJ ρ, the DCJ variation of ρ, denoted by DCJ (ρ), is defined to be respectively 0, 1 and 2 depending whether ρ is optimal, neutral or counter-optimal.

Modifying the content of a genome
In the previous section, the unique markers appeared as labels of adjacencies, but the DCJ operations are only able to change the organization of the genomes. Here we introduce the operations that are applied to the labels and change the content of the genomes.

Indel operations
The most classical content-modifying operations are insertions and deletions of blocks of contiguous markers [4,6]. We refer to insertions and deletions as indel operations. In the model we consider, an indel only affects the label of one single adjacency, by deleting or inserting contiguous markers in this label, with the restriction that an insertion cannot produce duplicated markers [7]. Thus, while sorting A into B, the indels are the steps in which the markers in A are deleted and the markers in B are inserted. At most one chromosome can be entirely deleted or inserted at once. We illustrate an indel with the following example: the deletion of markers su from adjacency b h suc h of genome A (Figure 2), which results into the clean adjacency b h c h . The opposite operation would be an insertion.

Substitutions
Substitutions are more powerful content-modifying operations, that allow blocks of contiguous markers to be substituted by other blocks of contiguous markers [8]. In other words, a deletion and a subsequent insertion that occur at the same position of the genome can be modeled as a substitution, counting together for one single sorting step.
A substitution only affects the label of one single adjacency, by substituting contiguous markers in this label, with the restriction that it cannot produce duplicated markers [8]. An example is the substitution of markers su in adjacency b h suc h by x , which results into adjacency b h xc h . At most one chromosome can be entirely substituted at once (but we do not allow the substitution of a linear by a circular chromosome nor vice-versa). As previously mentioned, insertions and deletions are special cases of substitutions. If a block of markers is substituted by the empty string, we have a deletion. Analogously, if the empty string is substituted by a block of markers, we have an insertion.

Runs, indel-and substitution-potentials
In this section we introduce some definitions and concepts that will help us to integrate the DCJ model with content-modifying operations. These concepts will be very useful in our results, when we will show how to use DCJ operations to minimize the number of contentmodifying operations to be performed.
First, let us recall the concept of run, introduced in [7]. Given two genomes A and B and a component P of AG (A,B), a run is a maximal subpath of P, in which the first and the last vertices are labeled and all labeled vertices belong to the same genome (or partition). An example is given in Figure 3. A run in genome A is also called an Arun, and a run in genome B is called a B-run. We denote by (P) the number of runs in a component P. While a path can have any number or runs, a cycle has either 0, 1, or an even number of runs.
A set of labels of one genome can be accumulated with DCJs. For example, take the adjacencies d h ze t and d t y• from genome B ( Figure 3). A DCJ applied to these two adjacencies could result into d t e t and d h zy•, in which the label zy resulted from the accumulation of the labels of the two original adjacencies. In particular, when we apply optimal DCJs internal to a single component of the adjacency graph, we can accumulate an entire run into a single adjacency [7].
Runs can be merged by DCJ operations. Consequently, during the optimal DCJ-sorting of a component P, we can reduce its number of runs. The indel-potential of P, denoted by λ(P), is defined in [7] as the minimum number of runs that we can obtain by DCJ-sorting P with optimal DCJ operations. An example is given in Figure 4.
The indel-potential of a component depends only on its number of runs: Figure 3 An AB-path with 3 runs (extracted from Figure 2). http://www.almob.org/content/8/1/21 Figure 4 Two optimal sequences for DCJ-sorting an AB-path with = 3 (the cuts of each DCJ in each sequence are represented by "|"). In (i) the overall number of runs in the resulting components is three, while in (ii) the resulting components have only two runs. Indeed, in this case, the best we can have is the indel-potential λ = 2.

Proposition 1 (from [7]). Given two genomes A and B and a component P of AG(A,B), the indel-potential of P is given by
Similarly, the substitution-potential of a component P is the minimum number of substitutions that we can obtain by DCJ-sorting P with optimal DCJ operations. The substitution-potential is denoted by σ (P) and can be computed as follows:

Results
In this section we show how to compute the DCJindel and the DCJ-substitution distances, considering that the content-modifying cost is distinct from and upper bounded by the DCJ cost. We assign the cost of 1 to each DCJ and a positive cost w ≤ 1 to each content-modifying operation.

The DCJ-indel model with distinct operation costs
First we consider the case in which only indels are allowed as content-modifying operations. Given two genomes A and B, we define the DCJ-indel distance of A and B, denoted by d id DCJ (A, B), as the minimum cost of a DCJindel sequence of operations that sorts A into B. If w = 1, the DCJ-indel distance corresponds exactly to the minimum number of steps required to sort A into B. To compute the distance in this case, a linear algorithm was given in [7]. Here we present a more general method to compute the DCJ-indel distance for any positive w ≤ 1.

An upper bound for the DCJ-indel distance
We can obtain a good upper bound for the DCJ-indel distance by showing how to compute the DCJ-indel distance per component. Given a DCJ operation ρ, let λ 0 and λ 1 be, respectively, the sum of the indel-potentials for the components of the adjacency graph before and after ρ, and let λ(ρ) = λ 1 − λ 0 . If ρ is an optimal DCJ internal to a single component of the graph, the definition of indelpotential implies λ(ρ) ≥ 0. We also have λ(ρ) ≥ 0, if ρ is counter-optimal, and λ(ρ) ≥ −1, if ρ is neutral [7]. Recall that DCJ (ρ) is, respectively, 0, 1 and 2, depending whether the DCJ ρ is optimal, neutral or counter-optimal. We define DCJ-λ We know that each component P of AG(A, B) can be DCJ-sorted separately, and the labels can then be easily sorted with indel operations. Let d id DCJ (P) be the DCJ-indel distance of P, that is the minimum cost of a DCJ-indel sequence of operations sorting P separately. This can be computed according to the following proposition.
Proof. By the definition of λ, the best we can do with optimal DCJs is d DCJ (P) + wλ(P). From [7], we have DCJ-λ (ρ) ≥ 2 if ρ is counter-optimal, thus we can only get more expensive sorting scenarios if we use such operation. We also know that, if ρ is neutral DCJ-λ (ρ) ≥ 1 − w ≥ 0, for any positive w ≤ 1.
This allows us to get a good upper bound for the DCJindel distance with distinct operation costs:

Lemma 2. Given two genomes A and B and a positive indel cost
Proof. If we sort the components separately we have

Recombinations and the exact DCJ-indel distance
Until this point, we have explored the possible effects of any DCJ that is internal to a single component http://www.almob.org/content/8/1/21 of the graph. Now we will analyze the effect of recombinations, that have λ ≥ −2 [7]. We saw previously that any recombination involving cycles is counter-optimal. Since any counter-optimal recombination has DCJ-λ ≥ 2 − 2w ≥ 0, only path recombinations can have DCJ-λ < 0.
Although the space of recombinations is not small, some observations allow us to explore it efficiently. Proposition 1 shows that the indel-potential increases of one when the number of runs increases of two. Furthermore, when we decrease the number of runs of a path by one, it will decrease the indel-potential only if its initial number of runs is one or a multiple of two. However, the exact number of runs does not really matter. In the path recombination analysis, we only have to consider the following properties for each path: • whether it is an AA, or a BB, or an AB-path; • whether it has zero, or an odd or an even number of runs; and • whether its first run is in A or in B (by convention, an AB-path is always read from A to B ).
An empty sequence (with no run) is represented by ε. For the benefit of the reader, for an integer i ≥ 0, let A (respectively B) be a sequence with odd 2i + 1 runs, starting and ending with an A-run (respectively B-run). Similarly, let AB (respectively BA), be a sequence with even 2i + 2 runs, starting with an A-run (respectively Brun) and ending with a B-run (respectively A-run). Then each one of the notations AA ε , AA A , AA B , AA AB ≡ AA BA , BB ε , BB A , BB B , BB AB ≡ BB BA , AB ε , AB A , AB B , AB AB and AB BA represents a particular type of path (AA, BB or AB) with a particular structure of runs (ε, A, B, AB or BA). An example of this notation is given in Figure 5, which represents a neutral recombination possibly with DCJ-λ < 0. Each type of recombination can lead to different resultants, depending on where the cuts are applied. However, it is always possible to choose the "best" resultants in each case: we take the recombination with the smallest DCJ-λ , whose resultants can be better reused in further recombinations. The main observations to guide this task are: only recombinations of paths whose runs are AB or BA have λ = −2 and only recombinations of type AA + BB are optimal and have DCJ = 0. In Table 1, we list all path recombinations that can have DCJ-λ < 0, together with neutral recombinations that have DCJ-λ = 1−w ≥ 0, but produce an AA AB or a BB AB path. We denote by • an ABpath that never appears as a source of a recombination in this table (these paths are AB ε , AB A and AB B ).
The DCJ-indel distance formula By analyzing the whole universe of operations, we could identify groups of recombinations, as listed in Table 2. Since some resultants of recombinations can be used in other recombinations, the groups can have more than one recombination. Groups P, S 1 and S 2 are composed of a single recombination, while groups T , N 1 and N 2 are composed of two recombinations and groups Q and M are composed of three recombinations. recombination is not an associative operation, thus, in column 'DCJ seq. ' of Table 2, we indicate how the sequence of DCJs must be applied in each group (the symbol ≺ separates preceeding and succeeding recombinations).
While in groups Q and T the preceding recombinations have lower DCJ-λ , in groups M, N 1 and N 2 we need to use operations of type n -1 in order to prepare better recombinations. Another important observation concerning groups Q and T is that, although their DCJ-λ indicate that Q could be applied for w > 1/4 and T could be applied for w > 1/3, the last operation of these groups is of type n -2 and actually increases DCJ-λ for w ≤ 1/2. For this reason, we skip groups Q and T for w ≤ 1/2 (there is no loss with this approach, since their optimal operations are then counted in S 1 ).
The deductions shown in Table 2 can be computed with an approach that greedily maximizes the number of occurrences in P, Q, T , S 1 , S 2 , M, N 1 and N 2 in this order. The two groups in Q are mutually exclusive after maximizing P. The lines in T are subgroups of the lines in Q, that is, they are only computed when there are enough remaining components after maximizing Q. Similarly, each one of the remaining groups are computed when there are enough remaining components after maximizing the upper groups. With the results presented in Figure 5 Neutral recombination that has DCJ-λ = 1 − 2w (we represent only the labels of the adjacencies, the cuts of the recombination are represented by "/" and "\"). http://www.almob.org/content/8/1/21 Recombinations of type o -2 (optimal with λ = −2), o -1 (optimal with λ = −1) and n -2 (neutral with λ = −2) can have DCJ-λ < 0. Recombinations of type n -1 (neutral with λ = −1) have DCJ-λ = 1 − w ≥ 0, but produce an AA AB or a BB AB path.

Table 2 All recombination groups that determine the deductions for computing the DCJ-indel distance (continued)
this section we have an exact formula to compute the DCJ-indel distance:

Theorem 2. Given two genomes A and B and a positive indel cost w
where P, Q, T, S 1 , S 2 , M, N 1 and N 2 are computed as described above.
As we mentioned before, the groups Q and T are skipped (Q = T = 0) for w ≤ 1/2. Furthermore, we also have S 2 = M = N 1 = 0 if w ≤ 1/2 and N 2 = 0 if w ≤ 2/3. Although some groups have reusable resultants, those are actually never reused (if groups that are lower in the table use as sources resultants from higher groups, the sources of all referred groups would be previously consumed in groups that occupy even higher positions in the table). Due to this fact, the number of occurrences in each group depends only on w and the initial number of each type of component.
Observe that, for w = 1, our formula is identical to the one proposed in [7]. Actually, for any 2/3 < w ≤ 1, the two formulas are equivalent, since the same occurrences of groups of recombinations and an equivalent upper bound are taken into account.
We illustrate the result of our formula with an example. Let AG(A, B) have only the following labeled paths: two AA AB , one BB A and one BB B . In this case, there are no occurrences of P, thus we have P = 0. If we take w > 1 2 , all labeled paths are consumed in one occurrence of Q.
We have Q = 1, while all other values are zero, resulting in DCJ-λ = 1 − 4w. On the other hand, if w ≤ 1 2 , we automatically set Q = T = S 2 = M = N 1 = N 2 = 0. The labeled paths are consumed in two occurrences of S 1 , that is,

The DCJ-substitution model with distinct operation costs
Now we consider a different model in which substitutions are the content-modifying operations. Recall that substitutions include indels. Again we assign the cost of 1 to each DCJ and the cost of w ≤ 1 to each substitution. The DCJ-substitution distance of genomes A and B, denoted by d sb DCJ (A, B), is then the minimum cost of a DCJ-substitution sequence that sorts A into B. If w = 1, this corresponds exactly to the minimum number of steps required to sort A into B and can be computed in linear time [8]. Here we present a general method to compute the DCJ-substitution distance for any positive w ≤ 1. Similarly to the approach used with the DCJ-indel model, we will first use internal DCJs to obtain a good upper bound and then analyze recombinations to compute the exact DCJ-substitution distance.

An upper bound for the DCJ-substitution distance
We can also obtain a good upper bound for the DCJsubstitution distance by showing how to compute the DCJ-substitution distance per component. Given a DCJ operation ρ, let σ 0 and σ 1 be, respectively, the sum of the substitution-potentials for the components of the adjacency graph before and after ρ, and let σ (ρ) = σ 1 − σ 0 . If ρ is an optimal DCJ internal to a single component of the graph, the definition of substitution-potential implies http://www.almob.org/content/8/1/21 σ (ρ) ≥ 0. We also have σ (ρ) ≥ 0, if ρ is counteroptimal, and σ (ρ) ≥ −1, if ρ is neutral [8]. We define DCJ-σ (ρ) = DCJ (ρ) + w σ (ρ). After DCJ-sorting a component P of AG (A, B), the remaining labels can be easily sorted with substitutions. Let d sb DCJ (P) be the DCJ-substitution distance of P, that is the minimum cost of a DCJ-substitution sequence of operations sorting P separately. This is given by the following proposition.

Recombinations and the exact DCJ-substitution distance
Now we also need to analyze the effect of path recombinations, that have σ (ρ) ≥ −2 [8], in the DCJ-substitution distance. Here the space of recombinations is even larger, but can still be efficiently explored. Proposition 2 shows that the substitution-potential increases of one when the number of runs increases of four. Furthermore, when we decrease the number of runs of a path by one, it will decrease the indel-potential only if its initial number of runs is one or a multiple of four. Again, the exact number of runs does not really matter. We have to consider the following properties for each path: • whether it is an AA, or a BB, or an AB-path; • whether it has zero, or a number of runs that is a multiple of four, or a multiple of four plus 1, or a multiple of four plus 2, or a multiple of four plus 3; and • whether its first run is in A or in B (by convention, an AB-path is always read from A to B ).
Recall that an empty sequence (with no run) is represented by ε. For labeled paths we adopt a different meaning for A, B, AB, BA: for an integer i ≥ 0, let A (respectively B) be a sequence with odd 4i + 1 runs, starting and ending with an A-run (respectively B-run), and let AB (respectively BA), be a sequence with even 4i + 2 runs, starting with an A-run (respectively B-run) and ending with a B-run (respectively A-run). Here we still have some additional cases: let ABA (respectively BAB) be a sequence with odd 4i + 3 runs, starting and ending with an A-run (respectively B-run), and let ABAB (respectively BABA), be a sequence with even 4i+4 runs, starting with an A-run (respectively B-run) and ending with a B-run (respectively A-run). Then, for each type of path (AA, BB or AB) with a particular structure of runs (A, B, AB, BA, ABA, BAB, ABAB, or BABA), we have a particular notation. An example of this notation is given in Figure 6, which represents a neutral recombination with DCJ-σ = 1 − w.
Again, although each type of recombination can lead to different resultants, it is always possible to choose the "best" resultants in each case: we take the recombination with the smallest DCJ-σ , whose resultants can be better reused. In Table 3, we list all recombinations that can have DCJ-σ < 0, together with those that have DCJ-σ = 1 − w ≥ 0, but produce an AA or a BB-path with runs ABAB or A or B. We denote by • an AB-path that never appears as a source in this table (these are all AB paths, with the exception of AB ABAB and AB BABA ). Table 4 we list groups of recombinations, which allow the computation of the exact DCJ-substitution distance, with an approach that greedily maximizes the number of occurrences in U , V, W, X 1 , X 2 , Y, Z 1 and Z 2 in this order. The two groups in V are mutually exclusive after maximizing U , while those in W are subgroups of V (they are only computed when there are enough remaining components after maximizing V). Similarly, each one of the remaining groups are computed when there are enough remaining components after maximizing the upper groups. As previously observed, the recombination is not associative, thus the column 'DCJ seq. ' determines in which order the sequence of DCJs must be applied in each group. Here we also need to skip some recombinations depending on the value of w. In particular, although DCJ-σ indicates that W could be applied for w > 1/3 and V for w > 1/4, the last operation of these groups is of type n -2 and http://www.almob.org/content/8/1/21 Figure 6 Neutral recombination that has DCJ-σ = 1−w (we represent only the labels of the adjacencies, the cuts of the recombination are represented by "/").

The DCJ-substitution distance formula In
increases DCJ-σ for w ≤ 1/2. Groups V and W are skipped for w ≤ 1/2, and their optimal operations are then counted in X 1 .
The recombinations allow us to obtain an exact formula for the DCJ-substitution distance:

Theorem 3. Given genomes A and B and a positive substitution cost
where U , V, W, X 1 , X 2 , Y, Z 1 and Z 2 are computed as described above and P LS and P CS are the numbers of disjoint pairs of linear and circular singletons.
Observe that the number of occurrences in each group depends only on w and the initial number of each type of component and, for any 2/3 < w ≤ 1, our formula is equivalent to the one proposed in [8], since the same occurrences of groups of recombinations and an equivalent upper bound are taken into account.

Complexity
Both AG(A, B) and d DCJ (A, B) can be computed in linear time [3]. The occurrences in each recombination group depends only on w and the initial components. The runs are obtained by a single walk through each path, thus the whole procedure takes linear time for both models.

Establishing the triangular inequality
We have presented two genomic distances that combine DCJ and content-modifying operations and can be computed in linear time. However, content-modifying operations are applied to pieces of DNA of any size, and a side effect of this fact is that the triangular inequality often does not hold for distances that consider these operations [4,[6][7][8]12]. Let A, B and C be three genomes, with unequal contents, and consider, without loss of generality, that d id C). The triangular inequality is then the property which guarantees that the inequality d id C) also holds. Unfortunately this is not the case for the DCJ-indel distance, and also not the case for the DCJ-substitution distance. Take  Denote by A, B, C, D, E, F and G the disjoint sets of markers such that: A, B or C are the sets of markers that occur respectively only in genome A, B or C, the markers in D are common only to genomes A and B, the markers in E are common only to B and C, the markers in F are common only to A and C, and, G is the set of markers that are common to all three genomes A, B and C. These sets are represented in Figure 7.
When D = ∅, meaning that genomes A and B have no common marker that does not occur in C, the triangular inequality holds for both DCJ-indel and DCJ-substitution Figure 7 The set of markers of each genome is represented by a circle. http://www.almob.org/content/8/1/21 distances [12]. However, if D = ∅, the triangular inequality can be disrupted for d id DCJ and d sb DCJ , and this may be an obstacle if one intends to use these distances to compute the median of three or more genomes and in phylogenetic reconstructions.
It is possible to establish the triangular inequality in our two models a posteriori, by adapting an approach proposed in [12]: we simply sum to each distance a surcharge that depends on the number of unique markers, as we will see in the following subsections.
We define the diameter as the maximum distance between any pair of genomes, usually as a function on the size of the genomes. We use this definition in the next results. The problem now is to find the minimum value of k for which the inequality of Proposition 5 holds. In order to accomplish this task, the first step is to determine the diameter of the DCJ-indel distance. ≤ 0. Furthermore, from [12] we know that P∈AG(A,B) |P| = 2n + L A + S A + L B +S B .

Correction for the DCJ-indel distance
We are ready to generalize the result of [12], and determine the minimum possible value of k. Proof. Recall that, to prove the triangular inequality for m id , we only need to find a k such that d id DCJ (A,B) ≤ d id DCJ (A,C) + d id DCJ (B,C) + 2k|D| holds (Proposition 5). We know that the inequality holds when D = ∅ [12]. It remains to examine the case in which D = ∅. The worst case would be to have an empty genome C [12]. Let X A and X B be the number of chromosomes in A and B. Since C is empty, we know that d id DCJ (A,C) = wX A and d id DCJ (B,C) = wX B . From Lemma 4, we have d id DCJ (A,B) ≤ (w+1)|D|+w(L A +S A +L B +S B ). This gives (w+1)|D|+w(L A + S A +L B +S B ) ≤ w(X A +X B )+2k|D|. Since L A +S A +L B +S B ≤ X A +X B , we have (w+1)|D| ≤ 2k|D|, which holds for any k ≥ w+1 2 . For the necessity, take A and B with n common markers and let each genome be composed of one circular chromosome, meaning that we have one adjacency per common marker in each genome (or n adjacencies per genome). Then let AG(A,B) have one single cycle with 2n vertices and let each vertex be labeled, so that the number of runs in the cycle is 2n and the number of unique markers in each genome is n. Thus, we have In order to find the minimum value of k for which the inequality of Proposition 6 holds, we need to determine the diameter of the DCJ-substitution distance, that is given by the following lemma. Proof. Let |P| be the number of vertices in component P, that is DCJ-sorted with |P|−1