Approximation algorithm for rearrangement distances considering repeated genes and intergenic regions

The rearrangement distance is a method to compare genomes of different species. Such distance is the number of rearrangement events necessary to transform one genome into another. Two commonly studied events are the transposition, which exchanges two consecutive blocks of the genome, and the reversal, which reverts a block of the genome. When dealing with such problems, seminal works represented genomes as sequences of genes without repetition. More realistic models started to consider gene repetition or the presence of intergenic regions, sequences of nucleotides between genes and in the extremities of the genome. This work explores the transposition and reversal events applied in a genome representation considering both gene repetition and intergenic regions. We define two problems called Minimum Common Intergenic String Partition and Reverse Minimum Common Intergenic String Partition. Using a relation with these two problems, we show a \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Theta \left( k \right)$$\end{document}Θk-approximation for the Intergenic Transposition Distance, the Intergenic Reversal Distance, and the Intergenic Reversal and Transposition Distance problems, where k is the maximum number of copies of a gene in the genomes. Our practical experiments on simulated genomes show that the use of partitions improves the estimates for the distances.


Introduction
In the field of Computational Biology, when analyzing the relationship between two genomes, one can estimate the evolutionary distance by calculating the number of mutations necessary to transform one genome into another. These mutations can be non-conservative (i.e., affect the quantity of genetic material), which is the case of insertion, deletion, duplication, or substitution of individual nucleotides [1][2][3], or the mutations can be conservative (i.e., do not insert or remove genetic material), which is the case of the conservative genome rearrangement events [4], which affect only the order and orientation of genes in the genome.
Some conservative events affect a single chromosome, such as the reversal, which inverts a sequence of genes, and the transposition, which exchanges the position of two consecutive sequences of genes. There are also events that may affect more than one chromosome, such as translocation, which swaps extremities of two chromosomes. The translocation and reversal events can be simulated by the Double-Cut-and-Join (DCJ) [5] operation, which cuts the genome at two positions and creates two new adjacencies by joining the four extremities affected by these cuts. This work focuses on the reversal and transposition events, consequently, we only consider genomes with a single chromosome.
When comparing genomes with a rearrangementbased distance, one must select a rearrangement model (i.e., the set of allowed rearrangement events) and find a representation for the genomes suitable to the selected

Open Access
Algorithms for Molecular Biology *Correspondence: gabriel.siqueira@ic.unicamp.br Institute of Computing, University of Campinas, Campinas, Brazil model. With a given model, a rearrangement distance problem aims at finding the minimum number of allowed rearrangement events necessary to transform one genome into another.
Genomes can be represented by a string, where each character represents a gene. There may be multiple genes represented by the same characters, those genes constitute a gene family.
If we assume that there are no replicated characters, the characters are usually represented by integer numbers, and a string of size n corresponds to a permutation of numbers from 1 to n. In this case, when comparing two genomes G and H of size n, one of them is represented by the identity permutation ι = (1 2 . . . n) and the other by a permutation π . Consequently, finding the rearrangement distance is equivalent to finding the minimum number of allowed rearrangement events necessary to sort the permutation π.
A string (or a permutation) may also include information regarding gene orientation, and such information is encoded as signs, + or −, associated with each character. In this case, we have a signed string (or a signed permutation).
When there are replicated characters, two common approaches are adopted to transform the strings into permutations. The first selects an exemplar of each gene family [6], and the second establishes a correspondence between characters of both strings [7,8], which allows us to discriminate between multiple copies of the same character. The second approach has the advantage of losing less information but can only be applied when such correspondence can be established. In the presence of non-conservative events, the correspondence between genes may not be possible, and a preprocessing step is required to eliminate genes present in only one of the genomes.
In biological terms, this correspondence is called an orthologous assignment. The distance between permutations resulting from an orthologous assignment gives us a valid upper bound for the distance between the original strings. As there are multiple possible assignments, there are some strategies to find assignments that lead to lower distances [7,8].
Recent works [9,10] argue that considering the size of intergenic regions (i.e., number of nucleotides between genes and in the extremities of the genome) improves the estimated distances. When the sizes of intergenic regions are taken into account, the genome representation includes a string representing the gene sequence and a sequence of integers corresponding to the size of each intergenic region.
Each combination of genome representation and rearrangement model defines a different rearrangement distance problem. Table 1 shows a summary of results from the literature, considering different rearrangement distance problems and the contributions of the present work (last three rows). For each problem, we mention whether there is a known polynomial-time algorithm or an NP-hardness proof and, in the last case, what is the best known approximation factor for that problem.
It is worth mentioning that, to ensure an approximation, the distance between strings takes into account the result of the string partition problems [26]. Such problems seek to split two strings into sub-strings that can be concatenated in different orders to form the original strings. The way in which the sub-strings appear in each original string defines the problem. If the sub-strings must appear in the same orientation in both original strings, we have the minimum common string partition problem. If the sub-strings can appear inverted in the original strings, we have the signed minimum common string partition problem when considering signed strings, and the reverse minimum common string partition problem when considering unsigned strings.
If there is an ℓ-approximation for the minimum common string partition problem, then there exists a 3ℓ -approximation for the transposition distance on strings problem [21]. Similarly, if there is an ℓ-approximation for the signed minimum common string partition problem, then there exists a 2ℓ-approximation for the reversal distance on signed strings problem [7]. The same relation can be applied to the reversal distance on strings and the reverse minimum common string partition problems [26].
The best known approximation algorithms for the partition problems have factors in O(log n log * n) [27], where n is the size of the string, and in �(k) [20], where k is the maximum number of copies of a character in the string.
This work describes approximation algorithms for the intergenic transposition distance, intergenic reversal distance, and intergenic reversal and transposition distance problems, where the representation of the genomes takes into account both repeated genes and intergenic regions. Initially, we present some definitions and formalize the problems. Next, we generalize the minimum common string partition and the reverse minimum common string partition problems to consider intergenic regions. We also present relations between the partitions and distance problems that consider intergenic regions and describe a �(k)-approximation algorithm for the partition problems ensuring a �(k)-approximation for the distance problems. Finally, we performed some practical tests on simulated genomes to evaluate the improvement in

Definitions
In the following definitions we use ordered sequences of elements (lists). The number of elements in a list X is denoted by |X|, and an element at the i-th position of a list X is denoted by X i . The list Y = rev(X) is equal to the list X in the reverse order (i.e., |X| = |Y | and Y i = Given a string S, the set S of distinct elements of S is the alphabet of S and each element of S is called label. The occurrence of a label α in a string S is the number of characters of S with label α , and is denoted by occ(α, S) . The maximum occurrence of any character in S is occ(S) = max α∈� S (occ(α, S)) . A character whose label has occurrence one is called a singleton, and a character whose label has occurrence at least two is called a replica. Two strings S and P are balanced if S = P and occ(α, S) = occ(α, P), ∀α ∈ � S . In other words, balanced strings are formed by the same characters in possibly different orders.
When modeling genomes, we consider the intergenic regions between genes represented by their sizes. Usually, an actual genome starts and ends with intergenic regions but, to construct our representation, we include two artificial genes in the beginning and end of the genome. In this process, usually called extension or capping, we use the same pair of genes for any genome.
Formally, a genome G = (g 1 ,g 1 , g 2 , . . . ,g n−1 , g n ) with size n is an interleaved sequence of n genes ( g 1 , . . . , g n ) and n − 1 intergenic regions ( g 1 , . . . ,g n−1 ). We represent a genome G = (S,S) with a string S and a list of integers S , such that: • The gene g i is represented by the character S i of S, for 1 ≤ i ≤ n. • The intergenic region g i is represented by the integer S i of S , for 1 ≤ i ≤ n − 1. Two genomes G = (S,S) and H = (P,P) are called cotailed if they have the same initial and final gene (i.e., S 1 = P 1 and S n = P n ). Note that, any two genomes resulting from an extension are co-tailed. The reverse of a genome G = (S,S) , denoted by rev(G) , is a genome represented by the lists rev(S) and rev(S) . We say that two genomes G and H are equal ( G = H ) if their correspondent strings and their correspondent integer lists are equal. Additionally, we say that two genomes G and H are congruent Figure 1 shows an example of a genome and its reverse.
Given a genome G = (S,S) , the subgenome G i,j = (S i,j ,S i,j ) is the portion of genome G between the genes g i and g j . Consequently, the subgenome G i,j is represented by lists S i,j and S i,j , such that: A genome G contains another genome H if H is equal to some subgenome of G . We denote that relation by H ⊂ G . We also use H ⊂ G to indicate that G does not contain H.
Let us define an operation of a combination of genomes (exemplified in Fig. 2). We say that a genome K = (Q,Q) is a combination of two genomes G = (S,S) and H = (P,P) if: • Q is the concatenation of the strings S and P. • Q is formed by the list S followed by an integer (representing the size of the intergenic region between the two genomes) and then followed by the list P .
Two genomes G = (S,S) and H = (P,P) of size n are balanced if: • The strings S and P are balanced. • The sum of the integers correspondent to intergenic regions are the same, i.e., n i=1S i = n i=1P i Given two balanced genomes G = (S,S) and H = (P,P) , an orthologous assignment ξ between them is a mapping between genes, i.e., for each gene S i of S there is a correspondent gene ξ(S i ) in P. We denote the intergenic region after the gene ξ(S i ) by ξ(S i ) . Each singleton from S is associated with the singleton of same label from P. Each replica from S must be associated with a replica of same label from P. Note that there are multiple ways to perform the association for a replica. Figure 3 shows an orthologous assignment between two genomes G and H. Consider a genome G = (S,S) of size n and the num-  transposition τ (i,j,k) (x,y,z) is an operation that transforms G into a genome G.τ (i,j,k) (x,y,z) = (S ′ ,S ′ ) , where: Figure 4 shows a generic intergenic transposition and an example of an intergenic transposition applied in a genome G.
Consider a genome G = (S,S) of size n and the numbers i, j, x, y, with 2 ≤ i < j ≤ n − 1 , 0 ≤ x ≤S i−1 , and 0 ≤ y ≤S j . The intergenic reversal ρ (i,j) (x,y) is an operation that transforms G into a genome G.ρ       Figure 5 shows a generic reversal and an example of a reversal applied in a genome G. As shown in the following problem statements, we are interested in finding the minimum number of intergenic operations necessary to transform one genome into another. We assume that the genomes come from the extension process and, consequently, they are co-tailed.

Proof
Directly from the fact that the correspondent problems on permutations are in the NP-hard class [23,24].
The minimum number of intergenic transpositions necessary to transform one genome G into another genome H is called the intergenic transposition distance, and it is denoted by d IT (G, H) . Similarly, the minimum number of intergenic reversals necessary to transform one genome G into another genome H is called the intergenic reversals distance, and it is denoted by d IR (G, H) . Also, the minimum number of operations that are either intergenic reversals or intergenic transpositions necessary to transform one genome G into another genome H is called the intergenic reversals and transposition distance, and it is denoted by d IRT (G, H).

Intergenic Partition
In order to develop a solution for the ITD, IRD, and IRTD problems we studied two related problems called minimum common intergenic string partition and reverse minimum common intergenic string partition. To define those problems, we consider the following two types of intergenic partitions of two balanced genomes.
An direct intergenic partition between two balanced genomes G = (S,S) and H = (P,P) is a pair of genome sequences (S, P) such that: 1 The genomes of S when combined correspond to the genome G. 2 The genomes of P when combined correspond to the genome H. 3 It is possible to change the order of the genomes of S to obtain the genomes of P (i.e., there is at least one permutation φ , from the numbers 1 to |S| , such that A reverse intergenic partition between two balanced genomes G = (S,S) and H = (P,P) is a pair of genome sequences (S, P) such that: 1 The genomes of S when combined correspond to the genome G. 2 The genomes of P when combined correspond to the genome H. 3 It is possible to change the order and orientation of the genomes of S to obtain the genomes of P (i.e., there is at least one permutation φ , from the numbers 1 to |S| , such that In both intergenic partitions, the genomes correspondent to elements of S and P are called blocks, and are subgenomes of G and H , respectively. As the blocks of S must be combined to form G , the blocks must follow the order in which they appear in G . Additionally, every gene must appear in some block. Some intergenic regions, on the other hand, do not appear in S , those are the regions that must be included during the combination of the blocks. As these regions mark the points where the genome G is split into blocks, we call them breakpoints of S . The breakpoints of P have a similar definition. Two breakpoints X i and Y j are called equivalent if the surrounding genes are equal, i.e., X i = Y i and X i+1 = Y i+1 . Additionally, two breakpoints X i and Y j are called congruent if they Siqueira et al. Algorithms Mol Biol (2021) 16:21 have the same surrounding genes in possibly different positions, i.e., The cost(S, P) of an intergenic partition (S, P) is the number of breakpoints of S . The cost can also be calculated by the number of blocks in S minus one. Note that, as a consequence of the third condition, both sequences S and P must have the same number of blocks and, consequently, the cost would be the same if we consider P instead of S.
An intergenic partition is minimal if no two consecutive blocks can be combined to form an intergenic partition with smaller cost. An orthologous assignment between two genomes G and H associates genes of G with genes of H and, consequently, induces a unique minimal intergenic partition between G and H.
Given a orthologous assignment ξ between two balanced genomes G = (S,S) and H = (P,P) , and the minimal intergenic partition (S, P) between G and H induced by ξ , we can distinguish between two types of breakpoint where (R, Q) is the partition between τ (i,j,k) (x,y,z) .G and H induced by the assignment ξ.

Example 1
An direct intergenic partition (S, P) of two genomes G = (S,S) and H = (P,P) of cost 3. Figure 6 shows a graphical representation of the partition (S, P) and a possible orthologous assignment capable of inducing that partition.

Example 2
A reverse intergenic partition (S, P) of two genomes G = (S,S) and H = (P,P) of cost 3. Figure 7 shows a graphical representation of the partition (S, P) and a possible orthologous assignment capable of inducing that partition. 6 A graphical representation of the direct intergenic partition from Example 1. The intergenic regions with dashed lines are the breakpoints and each block is shown in a different color. The superscripts on each gene represent an assignment capable of inducing the partition. The breakpoint between genes C 1 and D 1 is an undercharged hard breakpoint, and the remaining breakpoints are soft We are interested in the minimum cost direct intergenic partition and in the minimum cost reverse intergenic partition, as shown in the following problem statements.
When we do not consider intergenic regions, the genomes may be represented only by the strings. In that case, there are analogous definitions for partitions.
A direct partition of two balanced strings S and P is a pair of string sequences (S, P) such that: 1 The strings of S when concatenated correspond to the string S. 2 The strings of P when concatenated correspond to the string P. 3 It is possible to change the order of the strings of S to obtain the strings of P (i.e., there is at least one permutation φ , from the numbers 1 to |S| , such that A reverse partition of two balanced strings S and P is a pair of string sequences (S, P) such that: 1 The strings of S when concatenated correspond to the string S. 2 The strings of P when concatenated correspond to the string P. 3 It is possible to change the order and orientation of the strings of S to obtain the strings of P (i.e., there is at least one permutation φ , from the numbers 1 to |S| , In both cases, the cost of a partition is |S| − 1 and there are problems focused on minimizing that cost.
The MCSP and RMCSP problems belong to the NPhard class [28].

Theorem 2
The MCISP problem belongs to the NPhard class.
Proof Given an integer p, the decision version of the problems MCSP and MCISP aim at finding a direct partition and direct intergenic partition, respectively, of cost p. Considering the decision versions, let us reduce the MCSP problem to the MCISP problem.
Let the strings S and P be an instance of the MCSP problem. We construct an instance of the MCISP problem by adding the integer list S and P , of size |S| − 1 , composed only by zeros. Note that, there is a partition of size p between S and P if and only if there is a direct intergenic partition of size p between (S,S) and (P,P) .

Theorem 3
The RMCISP problem belongs to the NPhard class.
Proof Analogous to the proof of Theorem 2 considering the RMCSP problem instead of MCSP.

Correspondence between partition and distance problems
This section presents a correspondence between the partition and distance problems. Such correspondence allows us to adapt an approximation for the MCISP problem to obtain an approximation for the ITD problem, and to adapt an approximation for the RMCISP problem to obtain approximations for the IRD and IRTD problems. The following lemmas establish lower bounds for the distances based on partitions cost.

Lemma 1
Let (S, P) be a minimal direct intergenic partition induced by an orthologous assignment between two balanced genomes G = (S,S) and H = (P,P) . For any intergenic transposition τ (i,j,k) (x,y,z) , the minimal direct intergenic partition (R, Q) between the genomes G.τ (i,j,k) (x,y,z) and H , induced by the same orthologous assignment, respects the restriction cost(R, Q) ≥ cost(S, P) − 3.
Proof As the direct intergenic partition (R, Q) must be induced by the same assignment of (S, P) , we can only reduce the cost of the direct intergenic partition by moving the blocks to allow their combination. The intergenic transposition may be able to combine three pairs of blocks: the block ending in S i−1 with the block starting in S j ; the block ending in S k−1 with the block starting in S i ; and the block ending in S j−1 with the block starting in S k . In the best case, if all three combinations occur, we have cost(R, Q) = cost(S, P) − 3 .

Lemma 2
Let (S, P) be a minimal reverse intergenic partition induced by an orthologous assignment between two balanced genomes G = (S,S) and H = (P,P) . For any intergenic transposition τ (i,j,k) (x,y,z) , the minimal reverse intergenic partition (R, Q) between the genomes G.τ (i,j,k) (x,y,z) and H , induced by the same orthologous assignment, respects the restriction cost(R, Q) ≥ cost(S, P) − 3.
Proof Analogous to the proof of Lemma 1.

Lemma 3
Let (S, P) be a minimal reverse intergenic partition induced by an orthologous assignment between two balanced genomes G = (S,S) and H = (P,P) . For any intergenic reversal ρ (i,j) (x,y) , the minimal reverse intergenic partition (R, Q) between the genomes G.ρ (i,j) (x,y) and H , induced by the same orthologous assignment, respects the restriction cost(R, Q) ≥ cost(S, P) − 2.
Proof Similar to the proof of Lemma 1, considering that the intergenic reversal ρ (i,j) (x,y) can combine up to two pairs of blocks: the block ending in S i−1 with the block ending in S j and the block starting in S j+1 with the block starting in S i .

Lemma 4
Let (S, P) be a direct intergenic partition of minimum cost between two balanced genomes G = (S,S) and H = (P,P) . Any sequence of intergenic transpositions that transforms S into P must have size at least cost(S,P) Proof Consider a sequence of k intergenic transpositions capable of transforming G into H . Such sequence establishes an orthologous assignment between G and H . The assignment is recovered by verifying, for each character of S, the new position in P, after the intergenic transpositions are applied.
Let (R, Q) be the minimal direct intergenic partition induced from the orthologous assignment. We know that cost(R,Q) 3 ≤ k , because each intergenic transposition can remove at most 3 breakpoints (Lemma 1) and k intergenic transpositions are sufficient to turn R into Q (i.e., k intergenic transpositions can remove all breakpoints). As (S, P) is a minimum cost direct intergenic partition, we

Lemma 5
Let (S, P) be a reverse intergenic partition of minimum cost between two balanced genomes G = (S,S) and H = (P,P) . Any sequence of intergenic reversals that transforms S into P must have size at least cost(S,P)
Proof Analogous to the proof of Lemma 4, but using Lemma 3 instead of Lemma 1.

Lemma 6
Let (S, P) be a reverse intergenic partition of minimum cost between two balanced genomes G = (S,S) and H = (P,P) . Any sequence composed of intergenic reversals and intergenic transpositions that transforms S into P must have size at least cost(S,P)
Proof Analogous to the proof of Lemma 4, but using lemmas 2 and 3 instead of Lemma 1.
The next lemmas show upper bounds for the distances based on the cost of the partitions. Lemma 7 (Brito et al. [24]) Let G = (S,S) be a genome. Given a sequence of two intergenic transpositions τ , applied in this order, it is Note that, after the two intergenic transpositions describe in Lemma 7, the string S remains the same.

Lemma 8
Given two genomes G = (S,S) and H = (P,P) , and an orthologous assignment ξ between them. Let (S, P) be the minimal direct partition derived from the orthologous assignment ξ . If S has a soft breakpoint S i such that S i ≥ ξ(S i ) , then we can apply an intergenic transposition in G that removes at least one breakpoint from S . Furthermore, if S has at least 4 soft breakpoints and there is no breakpoint S r , r = i , such that Proof Consider the gene S j of S, such that the genes ξ(S i ) and ξ(S j ) are adjacent in P and the position of ξ(S j ) in P is greater than the position of ξ(S i ) . Note that S j = S i+1 , otherwise this would be a hard breakpoint. Besides, note that S j−1 is a breakpoint.
Initially, suppose that j > i . Let S k be a breakpoint such that k < i or k ≥ j . Such breakpoint must exist, otherwise (S 1 , . . . ,S i ) and (S j , . . . ,S n ) would have no breakpoints and, since ξ(S i ) and ξ(S j ) are adjacent and S j = S i+1 , there is no valid value for S i+1 .
turns the pairs (S k , S i+1 ), (S j−1 , S k+1 ) , and (S i , S j ) adjacent in the new genome. Also, we can set x, y, and z to ensure that the intergenic region between S i and S j is not a breakpoint, since S i ≥ ξ(S i ) . Note that no breakpoints are introduced, since the affected pairs are all breakpoints. Additionally, let us assume that the region between S k and S i+1 would become an overcharged breakpoint, that S has at least 4 breakpoints, and that there is no breakpoint S r , r = i , such that S r ≥ ξ(S r ) . In that case, let S ℓ be a breakpoint with ℓ = i , ℓ = j − 1 , and ℓ = k . We can replace the intergenic transposition τ (k+1,i+1,j) (x,y,z) to ensure that no overcharged breakpoints are added. Each case leads to an intergenic transposition choice as follows: • If ℓ < k , we can use the intergenic transposition τ (ℓ+1,i+1,j) (x,y,z) to turn the pairs (S ℓ , S i+1 ) , (S j−1 , S ℓ+1 ) , and (S i , S j ) adjacent in the new genome. Note that the region between S ℓ and S i+1 is not a hard breakpoint, because S k already comes before S i+1 in P.
• If ℓ ≥ j , we can use the intergenic transposition to turn the pairs (S i , S j ) , (S ℓ , S i+1 ) , and (S j−1 , S ℓ+1 ) adjacent in the new genome.
In that case, we do not have (S i , S j ) , but we can set x, y, and z to ensure that the intergenic region between S k and S i+1 is not a breakpoint. We also ensure that the region between S i and S ℓ+1 is not a hard breakpoint, because S j already comes after S i in P.
Note that, if the region between S j−1 and S k+1 , S j−1 and S ℓ+1 , or S ℓ and S k+1 becomes a hard breakpoint, we can choose the values of x, y, and z to ensure that it becomes an undercharged breakpoint. If k ≥ j , an intergenic transposition τ (i+1,j,k+1) (x,y,z) turns the pairs (S i , S j ) , (S k , S i+1 ) , and (S j−1 , S k+1 ) adjacent in the new genome. Also, we can set x, y, and z to ensure that the intergenic region between S i and S j is not a breakpoint, since S i ≥ ξ(S i ) . Additionally, if S has at least 4 breakpoints and there is no breakpoint S r , r = i , such that S r ≥ ξ(S r ) , we may replace the intergenic transposition, as in the previous case, to ensure that it does not create overcharged breakpoints. Now, suppose that i > j . Let S k be a breakpoint such that k < i and k ≥ j . Such breakpoint must exist, otherwise (S j , . . . ,S i ) would have no breakpoints, which is a contradiction because the position of ξ(S j ) in P is greater than the position of ξ(S i ) . An intergenic transposition τ (j,k+1,i+1) (x,y,z) turns the pairs (S j−1 , S k+1 ), (S i , S j ) , and (S k , S i+1 ) adjacent in the new genome. Also, we can set x, y, and z to ensure that the intergenic region between S i and S j is not a breakpoint, since S i ≥ ξ(S i ) . Additionally, if S has at least 4 breakpoints and there is no breakpoint S r , r = i , such that S r ≥ ξ(S r ) , we may replace the intergenic transposition, as in the previous case, to ensure that it does not create overcharged breakpoints.

Lemma 9
Given two genomes G = (S,S) and H = (P,P) , and an orthologous assignment ξ between them, it is possible to turn G into H using at most cost(S, P) + 1 intergenic transpositions, where (S, P) is the minimal direct partition derived from the orthologous assignment ξ.
Proof We will describe how to apply at most cost(S, P) + 1 intergenic transpositions in G to remove all breakpoints from S and, consequently, to turn G into H . The intergenic transpositions are applied according to the following cases: 1 If there are two or more overcharged breakpoints in S : Let S i and S j be two overcharged breakpoints and let S k be another breakpoint in S (such breakpoint must exist since there are overcharged breakpoints). We can use two intergenic transpositions (Lemma 7) to move the exceeding nucleotides from S i and S j to the intergenic region S k . this case must occur, otherwise the amount of intergenic region in S would be greater than the amount of intergenic region in P , which is not possible. 3 If there exists only one overcharged breakpoint S j in S and there exists no soft breakpoint S i in S such that S i ≥ ξ(S i ) : In that case S j must have ξ(S j ) + b∈B ξ(b) − b nucleotides, where B is the set of breakpoints distinct from S j , otherwise the amount of intergenic region in S would be different from the amount of intergenic region in P . We consider two sub-cases: (a) If there is an undercharged breakpoint S k : From the quantity of nucleotides on S j , we have S j +S k ≥ ξ(S j ) + ξ(S k ) . If there exists another breakpoint S ℓ , then we can use two intergenic transpositions (Lemma 7) to move the necessary number of nucleotides from S j to S k and the exceeding number of nucleotides from S j to S ℓ . Otherwise, since these are the only breakpoints, we have S j +S k = ξ(S j ) + ξ(S k ) . We can use two intergenic transpositions to redistribute the number of nucleotides between these two regions and remove these two breakpoints as well. (b) If there is no undercharged breakpoint: There exist at least 3 soft breakpoints, because there must exist a soft breakpoint to ensure the correct quantity of nucleotides and there is no direct intergenic partition with only 1 or 2 soft breakpoints. In that case, we can use two intergenic transpositions (Lemma 7) to move the exceeding number of nucleotides from S j to a soft breakpoint. Afterwards, we can apply intergenic transpositions from Lemma 8 to remove all soft breakpoints and ensure that no overcharged breakpoint is inserted while there are at least 4 breakpoints. When there are 3 breakpoints, at least one will be removed and the others will become hard breakpoints. As there are no longer soft breakpoints the remaining breakpoints will be removed by cases 1 and 3(a).
With one exception, we remove at least one breakpoint per intergenic transposition. In this way, we can transform G = (S,S) into H = (P,P) using at most cost(S, P) + 1 intergenic transpositions. Lemma 10 (Brito et al. [24]) Given two genomes G = (S,S) and H = (P,P) , and an orthologous assignment ξ between them, it is possible to turn G into H using at most 2cost(S, P) intergenic reversals, where (S, P) is the minimal reverse partition derived from the orthologous assignment ξ.
Lemma 11 (Brito et al. [24]) Given two genomes G = (S,S) and H = (P,P) , and an orthologous assignment ξ between them, it is possible to turn G into H using at most 3 2 cost(S, P) intergenic reversals or intergenic transpositions, where (S, P) is the minimal reverse partition derived from the orthologous assignment ξ.
With the bounds presented on the previous lemmas, we can establish a relation between partition and distance problems.

Theorem 4
An ℓ-approximation for the MCISP problem ensures an asymptotic 3ℓ-approximation for the ITD problem.
Proof Let G = (S,S) and H = (P,P) be two co-tailed genomes and let p be the size of the minimum direct intergenic partition between G and H . An algorithm for the MCISP problem with approximation factor ℓ returns a direct intergenic partition (S, P) , such that p ≤ cost(S, P) ≤ ℓp.
By Lemma 9, it is always possible to transform G into H with k intergenic transpositions, such that k ≤ cost(S, P) + 1 . Additionally, by Lemma 4, we know that d IT (G, H) ≥ p 3 . Consequently, we have d IT (G, H) ≤ k ≤ 3ℓd IT (G, H) + 1 .
As a consequence of lemmas 4 and 9, we have an asymptotic 3-approximation for the intergenic transposition distance when there are no repeated genes. The best approximation factor known in the literature for that problem is 3.5 [23].

Theorem 5 An ℓ-approximation for the RMCISP problem ensures a 4ℓ-approximation for the IRD problem.
Proof Let G = (S,S) and H = (P,P) be two co-tailed genomes and let p be the size of the minimum reverse intergenic partition between G and H . An algorithm for the RMCISP problem with approximation factor ℓ returns a reverse intergenic partition (S, P) , such that p ≤ cost(S, P) ≤ ℓp.
By Lemma 10, it is always possible to transform G into H with k intergenic reversals, such that k ≤ 2cost(S, P) . Additionally, by Lemma 5, we know that d IR (G, H) ≥ p 2 . Consequently, we have d IR (G, H) ≤ k ≤ 4ℓd IR (G, H) . Theorem 6 An ℓ-approximation for the RMCISP problem ensures a 4.5ℓ-approximation for the IRTD problem.
Proof Let G = (S,S) and H = (P,P) be two co-tailed genomes and let p be the size of the minimum reverse intergenic partition between G and H . An algorithm for the RMCISP problem with approximation factor ℓ returns a reverse intergenic partition (S, P) , such that p ≤ cost(S, P) ≤ ℓp.
By Lemma 11, it is always possible to transform G into H with k intergenic reversals or intergenic transpositions, such that k ≤ 3 2 cost(S, P) . Additionally, by Lemma 6, we know that

2k-approximation for MCISP
This section presents an algorithm for the MCISP problem between two genomes G = (S,S) and H = (P,P) with an approximation factor of 2k, where k = occ(S) . The algorithm was partially inspired by the Kolman and Waleń algorithm [20] that does not consider intergenic regions.
In order to describe the algorithm we need two functions: • subgen(G, X ) : the number of subgenomes of G equal to X (each of these subgenomes is an occurrence of X). • weight(G, H, X ) = subgen(G, X ) − subgen(H, X ) : a value indicating how many occurrences of X are in excess in G or in H . If the value is positive G has more occurrences of X than H . If the value is negative H has more occurrences of X than G.
The function weight can be generalized to work on two sequences S and P of genomes:

Lemma 12
Given two genomes G = (S,S) , H = (P,P) , and a pair (S, P) of genome sequences, such that it satisfies the conditions 1 and 2 of direct intergenic partition, we have that (S, P) satisfies the condition 3 if and only if weight(S, P, X ) = 0 for all genomes X contained in G or in H.
Proof First, we argue that if the third condition is satisfied then weight(S, P, X ) = 0 . Assuming the third condition is satisfied, we have a permutation φ , from the numbers 1 to |S| , such that Let X be a genome such that X ⊂ G or X ⊂ H . In weight(S, P, X ) , we are only going to count an occurrence of X in G if it is a subgenome of some block of S . Similarly, we are only going to count an occurrence of X in H if it is a subgenome of some block of P.
Note that, the counted occurrences of X in G are in a one-to-one correspondence with the counted occurrences of H . More precisely, for a subgenome S Now we prove that if weight(S, P, X ) = 0 then the third condition is satisfied. By contradiction let us assume that there is no one-to-one correspondence between blocks of S and blocks of P.
The impossibility of a correspondence may happen by four reasons: (i) there is a block in S that is not equal to any block of P ; (ii) there is a genome X correspondent to r blocks of S , but ℓ < r blocks of P ; (iii) there is a block in P not equal to any block of S ; (iv) there is a genome X correspondent to r blocks of P , but ℓ < r blocks of S . Without loss of generality, we consider only the first two cases.
In case (i), assume that S j is the biggest block of S not equal to any block of P . As weight(S, P, S j ) = 0 , we have Consequently, P must have a copy of S j in one of its blocks. Let P s be a block with such copy, i.e., S j ⊂ P s . If P s = S j , then S must have a copy of P s , because weight(S, P, P s ) = 0 . This means that S has at least two copies of S j and we must have another copy of S j in P . Following that argument eventually P must have a block equal to S j , contradicting the assumption of case (i). In case (ii) we can establish a correspondence between the ℓ blocks of P and some of the r blocks of S . We have at least one block of S without a correspondent in P . If we ignore the blocks with correspondences when calculating the weights, the same argument of case (i) leads to a contradiction.
Given two genomes G and H , we can easily construct a pair of genomes sequences (S, P) satisfying the first two conditions of direct intergenic partition. We just have to choose which intergenic regions of G and H will be the breakpoints of S and P , respectively. By Lemma 12, to ensure that (S, P) is a direct intergenic partition of G and H , we must choose the breakpoints such that weight(S, P, X ) = 0 for all genomes X of G or H.
Let T G,H be the set of all genomes X , such that X ⊂ G or X ⊂ H , and weight(G, H, X ) = 0 and consider the subset T min G,H = {X ∈ T G,H |Y � ⊂ X , ∀Y ∈ T G,H , Y � = X } . Note that, to include a breakpoint in some occurrence of a genome Y ∈ T G,H \ T min G,H , it suffices to include a breakpoint in the correspondent occurrence of a genome X ∈ T min G,H , X ⊂ Y . For that reason, we start by including breakpoints in elements of T min G,H . In fact, the following lemma ensures that we must include at least one breakpoint for each element of T min G,H .

Lemma 13
In order to construct a direct intergenic partition (S, P) of two genomes G and H , we must include a breakpoint in at least one copy of every element X ∈ T min G,H .
Proof For a genome X ∈ T G,H , let k = weight(G, H, X ) .
To ensure that weight(S, P, X ) = 0 , if k > 0 then we must include breakpoints in at least k copies of X in G , otherwise, if k < 0 , we must include breakpoints in at least −k copies of X in H . As weight(G, H, X ) = 0 , we must include at least one breakpoint in G or in H , and the lemma follows.
It may be necessary to include a breakpoint in more than one occurrence of a genome X ∈ T min G,H . We define break(X ) as the breakpoint associated with the genome X , and when we include a breakpoint in an occurrence of X we always select a breakpoint equivalent to break(X ).
To include the breakpoints, we not only must know the genomes contained in G or H with initially non-zero weight, but also keep track of genomes that acquire a non-zero weight after the inclusion of a breakpoint. For that, we generalize the sets T G,H and T min G,H to consider genome sequences. Given two genome sequences S and P , the set T S,P comprises of genomes X , such that X ⊂ S i , for 1 ≤ i ≤ |S| , or X ⊂ P j , for 1 ≤ j ≤ |P| , and weight(S, P, X ) = 0 . Additionally, we have the set T min S,P = {X ∈ T S,P |Y � ⊂ X , ∀Y ∈ T S,P , Y � = X }. Let us define break(X ) for a genome X ∈ T min S,P . If X ∈ T min G,H , break(X ) is already defined, otherwise, there must be at least one breakpoint included in some occurrence of X in G or H , so break(X ) is equivalent to the first breakpoint included in some occurrence of X.
The algorithm that selects the breakpoints (Algorithm 1) works as follows. Initially we consider two sequences S 0 = [G] and P 0 = [H] , each with a single block. At the i-th step, we produce the sequences S i and P i including a breakpoint in the sequences S i−1 and P i−1 based on the following rules: • The breakpoint is included in an occurrence of a genome X ∈ T min S i−1 ,P i−1 . • If weight(S i−1 , P i−1 , X ) > 0 , the selected occurrence of X must come from G. • If weight(S i−1 , P i−1 , X ) < 0 , the selected occurrence of X must come from H. • The selected breakpoint must be equivalent to break(X ).
The algorithm continues until weight(S i , P i , X ) = 0 for all genomes X of G or H , i.e., until (S i , P i ) becomes a direct intergenic partition. Let us briefly discuss the time complexity of Algorithm 1. Let n be the size of the input strings. First, we consider the complexity to build T min G,H . Using the suffix tree data structure [29] (constructed in time O(n)), subgen(G, X ) is computed in O(n) time, and, consequently, so is weight (G, H, X ) . Similarly, for a genome Y , we can recover the genomes X contained in G or H , such that Y ⊂ X , in O(n) time. Since there are 2n 2 subgenomes of G and H , the set T min G,H can be constructed in O(n 3 ) time. We can also store which subgenomes belong to the set T min S i ,P i in a suffix tree allowing the update of Proof Directly from lemmas 13 and 14.
Corollary 1 Algorithm 1 has an approximation factor of 2k for the MCSP problem between the string S and P, where k = occ(S).
Proof Using the same reduction presented in Theorem 2, but considering the optimization versions of the problems, we can apply Algorithm 1 to the MCSP problem and ensure the approximation factor 2k.
It is worth noting that we improve the previously known �(k) approximation of MCSP [20] from 4k to 2k. Corollary 2 Algorithm 1, in combination with the algorithm described in Lemma 9, ensures an asymptotic approximation factor of 6k for the ITD problem between the genomes G = (S,S) and H = (P,P) , where k = occ(S).
Proof Directly from theorems 4 and 7.

2k-approximation for RMCISP
We can adapt Algorithm 1 to approximate the RMCISP problem. The main point of the adaptation is to use congruence of genomes instead of equality and substitute the relation X ⊂ G with a new relation X ⊏ G , such that X ⊏ G if X ⊂ G or rev(X ) ⊂ G . Using this relation, the functions and sets from the previous section must be adapted: • subgen(G, X ) is now the number of subgenomes of G congruent to X (i.e., equal to X or to rev(X ) ). Consequently, weight considers now this new subgen function. • T G,H is now the set of all genomes X , such that X ⊏ G or X ⊏ H , and weight(G, H, X ) = 0 . Additionally, S,P is adapted in a similar manner).
Some other adaptations must be made on Algorithm 1. Line 5 must check if (S, P) is a reverse intergenic partition instead of a direct intergenic partition. In lines 9 and 13, the block must contain an occurrence of X or rev(X ) , and the breakpoint in lines 10 and 14 must be congruent to break(X ) instead of equivalent to break(X ) . Next, we show analogous results to the ones presented in the previous section.

Lemma 15
Given two genomes G = (S,S) , H = (P,P) , and a pair (S, P) of genome sequences, such that it satisfies conditions 1 and 2 of reverse intergenic partition, we have that (S, P) satisfies condition 3 if and only if weight(S, P, X ) = 0 for all genomes X , such that X ⊏ G or X ⊏ H.
Proof First, we argue that if the third condition is satisfied then weight(S, P, X ) = 0 . Assuming the third condition is satisfied, we have a permutation φ , from the numbers 1 to |S| , such that Let X be a genome such that X ⊏ G or X ⊏ H . In weight(S, P, X ) , we are only going to count an occurrence of X or rev(X ) in G if it is a subgenome of some block of S . Similarly, we are only going to count an occurrence of X or rev(X ) in H if it is a subgenome of some block of P.
Note that the counted occurrences of X or rev(X ) in G are in a one-to-one correspondence with the counted occurrences in H . More precisely, for a subgenome S i,j k of a block S k such that S i,j k ∼ = X , there is a subgenome P i,j φ k of a block P φ k , such that P i,j φ k ∼ = X . Conversely, for a sub- subgenome S i,j k of a block S k , such that S i,j k ∼ = X . Consequently, weight(S, P, X ) = 0 for every genome X , such that X ⊏ G or X ⊏ H.
Now we prove that if weight(S, P, X ) = 0 then the third condition is satisfied. By contradiction let us assume that there is no one-to-one correspondence between blocks of S and blocks of P.
The impossibility of a correspondence may happen by four reasons: (i) there is a block in S that is not congruent to any block of P ; (ii) there is a genome X congruent to r blocks of S , but it is congruent to ℓ < r blocks of P ; (iii) there is a block in P not congruent to any block of S ; (iv) there is a genome X congruent to r blocks of P , but it is congruent to ℓ < r blocks of S . Without loss of generality, we consider only the first two cases.
In case (i), assume that S j is the biggest block of S not congruent to any block of P . As weight(S, P, S j ) = 0 , we have subgen(P i , S j ) . Consequently, P must have a copy of S j or rev(S j ) in one of its blocks. Let P s be a block with such copy, i.e., S j ⊏ P s . If P s ∼ = S j , then S must have a copy of P s or rev(P s ), because weight(S, P, P s ) = 0 . This means that S has at least two copies of S j or rev(S j ) and we must have another copy of S j or rev(S j ) in P . Following that argument, eventually P must have a block equal to S j or rev(S j ) , contradicting the assumption of case (i).
In case (ii) we can establish a correspondence between the ℓ blocks of P and some of the r blocks of S . We have at least one block of S without a correspondent in P . If we ignore the blocks with correspondences when calculating the weights, the same argument of case (i) leads to a contradiction.

Lemma 16
In order to construct a reverse intergenic partition (S, P) of two genomes G and H , we must include a breakpoint in at least one copy of every element X ∈ T min G,H .
Proof For a genome X ∈ T G,H , let k = weight(G, H, X ) .
To ensure that weight(S, P, X ) = 0 , if k > 0 , then we must include breakpoints in at least k copies of X or rev(X ) in G , otherwise, if k < 0 , we must include breakpoints in at least −k copies of X or rev(X ) in H . As weight(G, H, X ) = 0 , we must include at least one breakpoint in G or in H , and the lemma follows.

Lemma 17
The adaptation of Algorithm 1 produces a direct intergenic partition of two genomes G = (S,S) and H = (P,P) , including at most 2k|T min G,H | breakpoints, where k = occ(S).
Proof We know the algorithm stops producing a reverse intergenic partition for the same reason stated in Lemma 14. Additionally, every breakpoint is included in an occurrence of a genome from T min G,H or is congruent to an already included breakpoint. Consequently, every breakpoint is congruent to break(X ) for some X ∈ T min G,H . As there is a maximum of k copies for each gene in G and a maximum of k copies for each gene in H , every breakpoint is congruent to a maximum of 2k − 1 other breakpoints, so we include at most 2k|T min G,H | breakpoints.

Theorem 8
The adaptation of Algorithm 1 has an approximation factor of 2k for the RMCISP problem between the genomes G = (S,S) and H = (P,P) , where k = occ(S).
Proof Directly from lemmas 16 and 17.

Corollary 3
The adaptation of Algorithm 1 has an approximation factor of 2k for the RMCSP problem between the string S and P, where k = occ(S).
Proof Applying a reduction, as in Corollary 1, we can apply the adaptation of Algorithm 1 to the RMCSP problem and ensure the approximation factor 2k.
It is worth noting that we improve the previously known �(k) approximation of RMCSP [20] from 8k to 2k.

Corollary 4
The adaptation of Algorithm 1 combined with the algorithm described by Brito et al. [24] for the Sorting Permutations by Intergenic Reversals problem ensures an approximation factor of 8k for the IRD problem between the strings S and P, where k = occ(S).
Proof Directly from theorems 5 and 8.

Corollary 5
The adaptation of Algorithm 1 combined with the algorithm described by Brito et al. [24] for the Sorting Permutations by Intergenic Reversals and Transpositions problem ensures an approximation factor of 9k for the IRTD problem between the genomes G = (S,S) and H = (P,P) , where k = occ(S).
Proof Directly from theorems 6 and 8. 1 For the source genome G = (S,S) , we constructed the string S by selecting 100 characters from a uniform distribution of m characters (correspondent to an alphabet , such that S ⊂ ), each character could be selected more than once. Afterwards, we constructed the list S by randomly choosing each intergenic region from integers in the interval [0, 100], each integer had the same probability of being chosen. 2 For the target genome H = (P,P) , we apply o operations in S. The type of operation depends on the database. In the TRANS database, we applied o intergenic transpositions τ (i,j,k) (x,y,z) , where the values of i, j, k, x, y, and z were randomly chosen. In the REV database, we applied o intergenic reversals ρ (i,j) (x,y) , where the values i, j, x, and y were randomly chosen. In the REVTRANS database we applied o 2 intergenic reversals and o 2 intergenic transpositions. These operations were aplied in a random order and the parameters of each one were randomly chosen. 3 We performed the extension process by adding two extra characters in the extremities of the source and target genomes to ensure that they are co-tailed. Note that both genomes have a final size of 102.
In these tests, for each pair of genomes from the TRANS database, we computed the direct intergenic partition from our algorithm, and for each pair of genomes from REV and TRANSREV databases, we computed the reverse intergenic partition from our algorithm. Afterwards, we produced 100 orthologous assignments capable of inducing each partition. We ensured that each possible assignment had the same probability of being chosen.
For each assignment, we computed the distance between the genomes using the assignment. The distances are computed by a different algorithm for each database: for the TRANS database, we used the algorithm described in Lemma 9 (implemented in C++); for the REV and REVTRANS databases, we used the algorithms for reversals and reversals and transpositions from Brito et al. [24] (implemented in Python), respectively.
To compare with the distances that do not consider the partitions, we also produced, for each genome pair, 100 assignments that do not take into account the partitions. We computed the distances for each of these assignments as well.
Tables 2, 3, and 4 show the distances for the TRANS, REV, and REVTRANS databases, respectively. Each line corresponds to a set of 100 genome pairs; the first two columns indicate, respectively, the number of operations and the size of the alphabet used to generated the set. The following seven columns present the results considering the partitions. For each genome pair, we consider the minimum and average distance from all 100 assignments. For each set, we report the minimum (Min.), average (Avg.), and maximum (Max.) for those two values. We also report the average time, in seconds, necessary to produce the partition and compute the 100 distances. The last seven columns present the same values for the distances that do not consider the partitions. In that case, the time reported refers only to calculating the distances. Figures 8, 9, and 10 show box plots with the average distances for the TRANS, REV, and REVTRANS databases, respectively.
From Table 2 and Fig. 8, we see that in the TRANS database the distances considering the partitions are lower than the distances that do not take the partitions into account. For sets generated with 25 transpositions, the minimum distances without partition are, on average, at least 39% higher than the minimum distances with partition. For the average distance, the difference is at least 60% on average. The difference between the distances decreases as the number of operations or the size of the alphabet increases. For sets generated with 100 transpositions and alphabet of size 10, the minimum and average distances without partition are on average 8% higher than the minimum or average distances with partition. For sets generated with 100 transpositions and alphabet of size 100, the minimum distances without partition are on average 3% higher than the minimum distances with partition. For the average distance, the difference is 5% on average. It is worth mentioning that with 100 operations we have an extreme case, where each origin genome is considerably shuffled to produce the corresponding target genome of the pair. It is also interesting that with smaller alphabets, when the number of replicas increases, the advantage of using the partitions also increases. Looking at the running times, we see that, for the transposition model, we must pay a small cost to produce better distances using the partitions. From Table 3 and Fig. 9, we see that in the REV database the distances considering the partitions are still lower than the distances that do not take the partitions into account, and the differences between distances are higher for this database. For sets generated with 25 reversals, the minimum distances without partition are, on average, at least 149% higher than the minimum distances with partition. For the average distance, the difference is at least 173% on average. Again, the difference between the distances decreases as the number of operations or the size of the alphabet increases, however, even in sets generated with 100 reversals and alphabet of size 100, the minimum distances with partition are on average 14% higher than the minimum distances with partition. For the average distance, the difference is 16% on average. In the REV database, we see that the running time considering the partition was lower than the running time without the partition. This happened because the  16:21 100 runs of the distance algorithm were slower than the partition algorithm, and using assignments that consider the partition tends to reduce the running time of the distance algorithm as the number of breakpoints tends to be smaller than the number of breakpoints considering a random assignment.
From Table 4 and Fig. 10, we see that in the REVTRANS database the distances considering the partitions are still lower than the distances that do not take the partitions into account. The differences were higher than those from the TRANS database, but smaller than those from the REV database. For sets generated with 25 operations, the minimum distances without partition are, on average, at least 90% higher than the minimum distances with partition. For the average distance, the difference is at least 105% on average. Again, the difference between the distances decreases as the number of operations or the size of the alphabet increases. In sets generated with 100 operations and alphabet of size 100, the minimum distances with partition are on average 7% higher than the minimum distances with partition. For the average distance, the difference is 10% on average. For the set generated with at most 75 operations, the running time considering the partition was lower than the running time without the partition. Considering all results, we see that the partitions improve the distances and the improvement is higher for smaller alphabets or closer genomes (genomes that can be turned into one another with fewer operations). We can also see that with partitions, we have either a small cost in the running time, when the distance algorithm takes less time than the partition algorithm, or a large gain in running time, when the distance algorithm takes more time than the partition algorithm.

Conclusion
We defined the intergenic transposition distance (ITD), the intergenic reversal distance (IRD), the intergenic reversal and transposition distance (IRTD), the minimum common intergenic string partition (MCISP), and the reverse minimum common intergenic string partition (RMCISP) problems. Next, we described a relation between the partition and distance problems and a �(k)-approximation for the MCISP and RMCISP problems ensuring a �(k)-approximation for the ITD, IRD, and IRTD problems. Our algorithm for the MCISP and RMCISP problems may also be applied to the MCSP and RMCSP problems, which do not consider intergenic regions, improving a previously known approximation. We also performed practical tests on simulated genomes, showing that the distances calculated considering the partitions were lower than the distances calculated without taking partitions into account.
As future works, one can extend our approach by considering the orientation of the genes. Additionally, one possible approach to overcome the balanced genome restriction is to consider non-conservative events, such as insertion and deletion, similarly to the work of Alexandrino et al. [30] with the Intergenic Reversal Distance without gene repetition.