An improved approximation algorithm for the reversal and transposition distance considering gene order and intergenic sizes

Background In the comparative genomics field, one of the goals is to estimate a sequence of genetic changes capable of transforming a genome into another. Genome rearrangement events are mutations that can alter the genetic content or the arrangement of elements from the genome. Reversal and transposition are two of the most studied genome rearrangement events. A reversal inverts a segment of a genome while a transposition swaps two consecutive segments. Initial studies in the area considered only the order of the genes. Recent works have incorporated other genetic information in the model. In particular, the information regarding the size of intergenic regions, which are structures between each pair of genes and in the extremities of a linear genome. Results and conclusions In this work, we investigate the sorting by intergenic reversals and transpositions problem on genomes sharing the same set of genes, considering the cases where the orientation of genes is known and unknown. Besides, we explored a variant of the problem, which generalizes the transposition event. As a result, we present an approximation algorithm that guarantees an approximation factor of 4 for both cases considering the reversal and transposition (classic definition) events, an improvement from the 4.5-approximation previously known for the scenario where the orientation of the genes is unknown. We also present a 3-approximation algorithm by incorporating the generalized transposition event, and we propose a greedy strategy to improve the performance of the algorithms. We performed practical tests adopting simulated data which indicated that the algorithms, in both cases, tend to perform better when compared with the best-known algorithms for the problem. Lastly, we conducted experiments using real genomes to demonstrate the applicability of the algorithms.


Background
In the comparative genomics field, there are many ways to compare genomic features like DNA sequence, gene order, and genomic landmarks from different organisms. Genome rearrangement events are mutations that affect large stretches of the DNA sequence. Determining the shortest sequence of such events that can transform one genome into another is widely used as a metric to study evolutionary relationships among organisms, and to explain biological similarities and differences as well. The reversal and transposition are two of the most studied genome rearrangement events in the literature [1][2][3]. A reversal inverts a segment of a genome, and a transposition moves a segment of a genome to another position.
One way to represent a genome is by using the gene order as the only genomic trait, which can be encoded as a sequence of elements, where each element represents a gene. When the compared genomes share the same set of genes and do not have replicated genes, we model them as permutations of natural numbers, such that each number in the sequence appears once. Furthermore, if the orientation of the genes is known, a plus or a minus sign (+ or −) is assigned to each element of the permutation to indicate its orientation, and we say that the permutation is a signed permutation. Otherwise, signs are omitted and the permutation is called unsigned.
It is always possible to map the target genome in a permutation such that the elements are in increasing order. This permutation is called by identity permutation and denoted as ι = (1 2 . . . n) and ι = (+1 +2 . . . +n) , considering unsigned and signed cases, respectively. Therefore, the transformation from a source genome to a target genome can be seen as a sorting problem.
First studies in the genome rearrangement field considered a single type of rearrangement events, which led to solutions specific to that type. In particular, the reversal event leads to the sorting by reversals problem, which has a polynomial-time algorithm on signed permutations [1], whereas it is NP-hard on unsigned permutations [4] and the best algorithm has an approximation factor of 1.375 [2]. The transposition event leads to the sorting by transpositions problem, which is NP-hard [5] and the best algorithm has an approximation factor of 1.375 [3]. By allowing both reversal and transposition we have the sorting by reversals and transpositions problem, which is NP-hard on signed and unsigned permutations [6]. The best algorithms have approximation factors of 2 [7] and 2.8334 + ǫ [8,9] for signed and unsigned permutations, respectively.
The gene order was fundamental to the initial development of rearrangement distance models. However, recent studies indicate that incorporating another genetic information apart from the gene order could generate more realistic models [10,11]. In particular, the information regarding the size of intergenic regions (structures with a specific number of nucleotides between each pair of genes and in the extremities of genomes) was incorporated into the mathematical models.
The Double Cut and Join (DCJ) is a rearrangement event that cuts the genome in two points and reassembles the stretches following a predetermined criterion. The problem of sorting by dcjs with intergenic regions is NP-hard [12], but one can find a polynomial-time algorithm when DCJs are used together with insertions and deletions on intergenic regions [13]. The Block-Interchange is a rearrangement event that swaps the position of two segments (not necessarily consecutive) of the genome. The sorting by intergenic block-interchange problem has a 2-approximation algorithm [14] and its complexity is unknown. Considering the reversal event, we have the sorting by intergenic reversals problem, which is NP-hard on signed and unsigned permutations [15,16] and the best algorithms have approximation factors of 2 [15] and 4 [16], respectively. The sorting by intergenic transpositions is NP-hard and the best algorithm has an approximation factor of 3.5 [17]. The sorting by intergenic reversals and transpositions (SbIRT) is NP-hard on signed and unsigned permutations [16,17] and the best algorithms have the approximation factors of 3 [17] and 4.5 [16], respectively. The SbIRT problem with the generalized definition of the transposition event on signed permutations has an approximation algorithm with a factor of 2.5 [17].
The SbIRT problem with an additional constraint that limits the number of genes affected by each operation, called super short operations, was investigated [18]. On signed permutations it was proposed a 5-approximation algorithm, while for unsigned permutations it was proposed a 3-approximation algorithm.
In this work, we investigate the SbIRT problem on signed and unsigned permutations. For the unsigned case, we present an improved algorithm based on intergenic breakpoints that guarantees an approximation factor of 4. We also show a 3-approximation algorithm for the SbIRT problem on unsigned permutations considering the generalized definition of the transposition event.
For the signed case, we show approximation algorithms with the same approximation factors as in the unsigned cases. Although the theoretical approximations for the signed case are superior to the previously known results, the tests with simulated data pointed that our algorithms tend to provide better practical results. We propose a greedy strategy to improve the algorithms' performance and tested them using simulated and real data.
This manuscript is organized as follows. "Definitions" section presents concepts and definitions used throughout the paper. "Theoretical results" section shows a lower bound and the approximation algorithms for the SbIRT problem. "Practical results" section shows the experiments using real and simulated data. "Conclusion" section concludes the paper and introduces future directions.

Definitions
The problem we investigate uses information about source and target genomes. We assume that both genomes share the same set of genes and there are no replicated genes. Thus, given a linear genome G = (i 1 , g 1 , i 2 , g 2 , . . . , i n , g n , i n+1 ) with n genes and n + 1 intergenic regions, we use (i) a permutation π , representing the order of the genes, and (ii) a list of non-negative integer numbers π , representing the sizes of intergenic regions. If the orientation of the genes is known, a " + " or "−" sign is associated with each element of the permutation π to indicate its orientation. We use π i , 1 ≤ i ≤ n , to denote the element in position i of π . Similarly, we denote by π i the size of the intergenic region in the left of π i . The intergenic region π n+1 is on the right of π n .
For convenience, we map the genes from the target genome to the identity permutation ι = (1 2 . . . n) for the case where the orientation of the genes is unknown and ι = (+1 +2 . . . +n) otherwise. The permutation π of the source genome can be mapped according to how we assigned elements to genes while mapping the target genome to the identity permutation, so the source and target genomes are represented as (π,π) and (ι,ι) , respectively. Since the identity permutation is fixed given the size of the genomes, an instance for the SbIRT problem is composed by (π ,π,ι) . Figure 1 shows the representation of a genome G as (π ,π).

Fig. 1
On the top we have a fictitious genome G , with 5 genes and an intergenic region between each pair of genes and also in the extremities of G . Each intergenic region has a specific number of nucleotides (represented by the letters A, C, G, or T). At the bottom, we have the genome representation for the SbIRT problem, using a permutation π and a list π , representing the order of the genes and the sizes of the intergenic regions, respectively. Observe that for each intergenic region π i we have the information about the number of nucleotides inside it  16:24 by adding the elements π 0 = 0 and π n+1 = (n + 1) at the beginning and at the end of π , respectively. We hereafter assume that permutations are in extended form, and we refer to them simply as permutations. Following, we present concepts and definitions that are used in previous works [16] regarding the SbIRT problem.

Definition 2.3
Given an unsigned instance I = (π,π ,ι) of the SbIRT problem, a pair of elements (π i , π i+1 ) , such that 0 ≤ i ≤ n , is an intergenic breakpoint type one if one of the following cases occur:

Definition 2.4
Given an unsigned instance I = (π,π ,ι) of the SbIRT problem, a pair of elements (π a , π b ) is an intergenic adjacency if |a − b| = 1 and (π min(a,b) , π max(a,b) ) is not an intergenic breakpoint type one.
In other words, an intergenic breakpoint type one indicates a region that must be affected by a rearrangement event to fix the order of the genes or the size of the intergenic region to reach the target genome. On the other hand, an intergenic adjacency indicates a pair of genes that are consecutive in the target genome and the intergenic regions between them have the same size. From now on, we will refer to intergenic breakpoint and intergenic adjacency as breakpoint and adjacency, respectively. Definition 2.5 A breakpoint type one (π i , π i+1 ) , such that |π i+1 − π i | = 1 , is overcharged if π i+1 >ι x , such that x = max(π i , π i+1 ) , and undercharged otherwise. Definition 2.6 A pair of breakpoints type one (π i , π i+1 ) and (π j , π j+1 ) is connected if the following conditions are met: 1 The pair of elements (π i , π i+1 ) , (π j , π j+1 ) , (π i , π j ) , (π i , π j+1 ) , (π i+1 , π j ) , or (π i+1 , π j+1 ) are consecutive in ι and do not form an adjacency in (π,π). 2 π i+1 +π j+1 ≥ι k , such that ι k is the intergenic region size between the consecutive elements (from condition 1) in ι.
A pair of connected breakpoints indicate that it is possible to form an adjacency using only the nucleotides from the intergenic regions of the two breakpoints. Note that a pair of connected breakpoints has at least one pair of consecutive elements between π i , π i+1 , π j , and π j+1 . Besides, the number of nucleotides in both breakpoints ( π i+1 +π j+1 ) is at least the size of the intergenic region between the consecutive elements in ι. Now, we introduce new definitions which are used to derive the results. Definition 2.7 A breakpoint type one (π i , π i+1 ) is called hard if it is overcharged or undercharged and soft otherwise.
Note that in hard breakpoints the pair of genes are consecutive in the target genome, but the intergenic region between them is not the same as in the target genome.
Definition 2.8 A pair of breakpoints type one (π i , π i+1 ) and (π j , π j+1 ) is called softly connected if they are connected and both breakpoints are soft. Definition 2.9 A hard breakpoint (π i , π i+1 ) is called super hard if one of the following cases occur: • (π i−1 , π i ) or (π i+1 , π i+2 ) is a hard breakpoint or an adjacency.
Note that a super hard breakpoint is in one of the extremities of the genome, or immediately before or after the breakpoint exists a hard breakpoint or an adjacency.

Definition 2.10
Given an unsigned instance I = (π,π ,ι) , strips are maximal sequences of consecutive elements of π without soft breakpoints. A strip with only one element π i is called a singleton, and it is defined as increasing if i ∈ {0, (n + 1)} , and as decreasing otherwise. A strip with more than one element is called increasing if its elements form an increase sequence; and it is called decreasing otherwise.

Remark 2.1
The only unsigned instance I of the SbIRT problem such that b 1 (I) = 0 is (ι,ι,ι) . Similarly, the only signed instance I ′ of the SbIRT problem, such that b 2 (I ′ ) = 0 is (ι,ι,ι) . Thus, to transform (π,π) into (ι,ι) it is necessary to remove all the breakpoints of an instance. Figure 2 shows the concepts using a representation of the source and target genomes.

Theoretical results
In this section, we show lower bounds and present approximation algorithms for both cases of the SbIRT problem. We start by showing how many breakpoints a reversal and a transposition can remove in the best scenario. Lemma 3.1 Given an unsigned instance I 1 = (π ,π,ι) and a signed instance I 2 = (π ,π,ι) of the SbIRT problem, �b 1 (I 1 , ρ) ≥ −2 and �b 2 (I 2 , ρ) ≥ −2 for any reversal ρ (i,j) (x,y) , respectively.
Proof Recall that a reversal affects two pair of consecutive elements of π . In the best case, (π i−1 , π i ) and (π j , π j+1 ) are breakpoints, and the reversal ρ (i,j) (x,y) removes them.

Lemma 3.2
Given an unsigned instance I 1 = (π ,π,ι) and a signed instance I 2 = (π ,π,ι) of the SbIRT problem, Proof The proof is similar to the one described in Lemma 3.1, and considering that a transposition can affect up to three breakpoints.

Proposition 3.2 Given a signed instance
Proof Directly by Remark 2.1 and lemmas 3.1 and 3.2.

Approximation algorithms for the unsigned case of the SbIRT problem
In this section, we investigate the unsigned case of the SbIRT problem and present a 4-approximation algorithm considering the reversal and transpositions events. Besides, we show a 3-approximation algorithm incorporating a generalized definition of the transposition event.
We show a sequence of lemmas that will be used by the algorithms as subroutines.

Lemma 3.3 (Lemma 19 [16])
It is possible to perform any redistribution of nucleotides within intergenic regions π i , π j , and π k using two consecutive transpositions in the format: In this section, we will refer to breakpoint type one simply as a breakpoint. Now let us show how to remove breakpoints from an unsigned instance depending on how many overcharged breakpoints an instance has.

Lemma 3.4
Given an unsigned instance (π,π,ι) for the SbIRT problem, if there are at least two overcharged breakpoints then there exists a sequence of two transpositions that removes at least two breakpoints.
Proof First note that a third breakpoint must exist in (π ,π,ι) , otherwise the total number of nucleotides within intergenic regions of the source genome would be greater than the number of nucleotides within intergenic regions of the target genome. By Lemma 3.3, it is possible to make a redistribution of nucleotides within three intergenic regions using two consecutive transpositions. Without loss of generality, assume that two of these intergenic regions are between the two overcharged breakpoints, and that the third intergenic region is between an existing third breakpoint. In this case, the extra nucleotides from the two overcharged breakpoints are moved to the third breakpoint, and the lemma follows.

Lemma 3.5 Given an unsigned instance (π ,π,ι) for the SbIRT problem, if there is a pair of softly connected breakpoints then there exists a reversal or a transposition that removes at least one breakpoint.
Proof Brito et. al. [16, lemmas 14 and 20] showed how to remove a breakpoint from a pair of connected breakpoints. In particular, when both breakpoints (π i , π i+1 ) and (π j , π j+1 ) are soft, we have one of the following three possibilities to form at least one adjacency from them: For each one of the cases, a reversal or a transposition can be applied to remove at least one breakpoint, and the lemma follows. Figure 3 shows, for each case in Lemma 3.5, a reversal or a transposition that can be applied to remove at least one breakpoint. In Case 3, a transposition is applied to the pair of soft breakpoints and in a third breakpoint, which can be located before or after the pair of soft breakpoints.

Remark 3.1
Note that Case 2 of Lemma 3.5 is the only one in which a hard breakpoint can be removed as a result of the operation applied ( k = i + 1 and k + 1 = j ). However, Lemma 3.5 cannot remove a super hard breakpoint. Lemma 3.6 Given a valid unsigned instance I = (π,π ,ι) for the SbIRT problem, if b 1 (I) > 0 and there is no pair of softly connected breakpoints, then there must be at least one overcharged breakpoint.
Proof Assume that there are no overcharged breakpoints in I. We will show by contradiction that π i ∈ππ i < ι i ∈ιι i , which contradicts the fact that I is a valid instance. Since there is no pair of softly connected breakpoints, it follows that for each soft breakpoint (π i , π i+1 ) , we have π i+1 <ι k , where k = max(π i , π i+1 ) , otherwise I has at least one pair of softly connected breakpoints.
Proof By contradiction, assume that π i+1 +π j+1 <ι x +ι y . Since no pair of softly connected breakpoints exist in I, it follows that there are no soft breakpoints in (π ,π ,ι) or there are not enough nucleotides in the soft breakpoints to remove them. In both cases, moving the excess of nucleotides from the overcharged breakpoint (π i , π i+1 ) to the undercharged breakpoint (π j , π j+1 ) is not enough to remove two breakpoints ( π i+1 +π j+1 <ι x +ι y ). So, the instance (π,π ,ι) remains with at least one undercharged breakpoint (π j , π j+1 ) and possibly with soft breakpoints with not enough nucleotides to remove them, which contradicts the fact that

Lemma 3.8
Given an unsigned instance I = (π,π,ι) for the SbIRT problem, if I has only one overcharged breakpoint (π i , π i+1 ), at least one undercharged breakpoint (π j , π j+1 ), and there is no pair of softly connected breakpoints, then there is a sequence of two operations that removes at least two breakpoints.
If π i+1 +π j+1 >ι x +ι y , then at least a third breakpoint must exist. By Lemma 3.3, it is possible to redistribute the nucleotides within intergenic regions π i , π j , and π k using two consecutive transpositions. Initially, we verify if there is a soft breakpoint to receive (ι x +ι y ) − (π i+1 +π j+1 ) Fig. 3 The possibilities that can arise when a pair of softly connected breakpoints exists. In this case, one operation can be applied to remove at least one breakpoint. The pair of elements that are consecutive in the identity permutation is represented with a grayscale color nucleotides. Note that adding or removing nucleotides to a soft breakpoint does not turn it into a hard breakpoint. If the soft breakpoint exists, then the overcharged and undercharged breakpoints will be removed and it will receive the exceeding nucleotides after applying two consecutive transpositions. Otherwise, the third breakpoint must be an undercharged breakpoint, which can be removed or turned into an overcharged breakpoint after receiving the exceeding nucleotides.
In the worst case, two breakpoints are removed after applying a sequence of two operations, and the lemma follows.
Note that the sequence of operations from Lemma 3.8 generates at most one overcharged breakpoint after two consecutive transpositions, but if it occurs the instance (π,π,ι) will have no soft breakpoints.

Lemma 3.9
Given an unsigned instance I = (π ,π,ι) for the SbIRT problem such that b s (I) > 0 and with no pair of softly connected breakpoints, it is possible to create a hard undercharged breakpoint keeping the instance with no pair of softly connected breakpoints, or create a super hard undercharged breakpoint after applying one operation of reversal or transposition.

Lemma 3.10
Given an unsigned instance I = (π,π,ι) for the SbIRT problem such that there is only one overcharged breakpoint, no undercharged breakpoints, and there is no pair of softly connected breakpoints, then there is a sequence of at most three operations that removes at least two breakpoints or a sequence of at most four operations that removes at least three breakpoints.
Proof Note that b s (π ,π,ι) ≥ 2 , since it is impossible to create a valid instance with only one overcharged breakpoint and one soft breakpoint. Applying Lemma 3.9 we have two possibilities: (i) a hard undercharged breakpoint is created keeping the instance with no pair of softly connected breakpoints, then Lemma 3.8 can be applied (resulting in two breakpoints removed after applying three operations); (ii) a super hard undercharged breakpoint is created. In this case, if there are no pair of softly connected breakpoints in I, then Lemma 3.8 can be applied (also resulting in two breakpoints removed after applying three operations). Otherwise, Lemma 3.5 can be applied. Note that, by Remark 3.1, the super hard undercharged breakpoint remains untouched, and one of the following cases can occur: • A new overcharged breakpoint is created, and Lemma 3.4 can be applied (three breakpoints removed after applying four operations). • A pair of softly connected breakpoints is created, and Lemma 3.5 can be applied (two breakpoints removed after applying three operations). • There is no pair of softly connected breakpoints in I, and Lemma 3.8 can be applied (three breakpoints removed after applying four operations).

Remark 3.2
Note that if only two breakpoints are removed by Lemma 3.10, then it implies that the resulting genome (π ,π) is different from (ι,ι).
Now consider Algorithm 1, which consists of four cases depending on the number of overcharged breakpoints or the existence of a pair of softly connected breakpoints.  16:24 Note that at each iteration of Algorithm 1, at least one breakpoint is removed, so eventually (π,π) will be transformed into (ι,ι) and the algorithm stops. Besides, each step is performed in linear time using the auxiliary structures of a breakpoint list and the inverse permutation of π (i.e., a permutation that indicates the position of each element i in π ). Since b(π,π ,ι) ≤ n + 1 , the running time of Algorithm 1 is O(n 2 ).

operations.
Proof Algorithm 1 can be analyzed considering the following cases: 1 I has at least two overcharged breakpoints (lines 3 to 6). 2 I has at least one pair of softly connected breakpoints (lines 7 to 10).
3 I has only one overcharged breakpoint, at least one undercharged breakpoint, and there is no pair of softly connected breakpoints (lines 12 to 15). 4 I has only one overcharged breakpoint, no undercharged breakpoints, and there is no pair of softly connected breakpoints (lines 16 to 19).
Note that, if the algorithm reaches cases 3 or 4, there is exactly one overcharged breakpoint. Otherwise, case 1 would be performed first or the instance is not a valid one (Lemma 3.6). Cases 1, 2, and 3 remove, on average, one breakpoint per operation. If the worst case of Case 4 is performed (where two breakpoints are removed with three operations), we have by Remark 3.2 that (π ,π) � = (ι,ι) , and cases 1, 2, or 3 will be applied subsequently, and all guarantees a sequence of operations that will remove, on average, one breakpoint per operation. Thus, on average, each breakpoint is removed by using at most 4 3 operations, and the lemma follows.

Incorporating generic transpositions
In this section we use a more generalized definition of transpositions and design a 3-approximation algorithm for the sorting by intergenic reversals and transpositions problem using that definition. Let us start with a formal definition of intergenic moves and generic transpositions, that include intergenic transpositions and intergenic moves.
We note that an intergenic move modifies only two intergenic regions of an instance. Now we show how generic transpositions affect the number of breakpoints from an instance (π ,π ,ι).
Proof The proof is similar to the one described in Lemma 3.1, and considering that an intergenic transposition can affect up to three breakpoints and an intergenic move can affect up to two breakpoints.
In the following lemma we explain how to remove an overcharged breakpoint using one intergenic move.

Lemma 3.13
Given an unsigned instance I = (π,π,ι) for the SbIRT problem, if I has one overcharged breakpoint, then it is possible to remove at least one breakpoint using an intergenic move.
Proof Let (π i , π i+1 ) , with 0 ≤ i ≤ n , be the overcharged breakpoint, and let w =ι x −π i+1 , such that x = max(π i , π i+1 ) . We note that another breakpoint (π k , π k+1 ) , with 0 ≤ k ≤ n and k = i , must exist in (π ,π,ι) , otherwise the instance is not valid. We can use an intergenic move to transfer w nucleotides from π i+1 to π k+1 , and the overcharged breakpoint is removed. If i < k , we can apply the intergenic move τ (i+1,i+1,k) (0,w,0)  Proof Note that, since there is no overcharged breakpoint, and b(I) > 0 , then at least two soft breakpoints must exist, otherwise the instance has only undercharged breakpoints and it is not valid. We can use a similar argument as the proof of Lemma 3.6 to show that at least one pair of soft is applied to move the excess of nucleotides from the overcharged breakpoint (π i , π i+1 ) to the breakpoint (π k , π k+1 ) , such that i < k . Similarly, at the bottom (Case 2), the intergenic move τ (k+1,i+1,i+1) (x,y,z) is applied but i > k breakpoints must be connected, otherwise I is not a valid instance.
Algorithm 2 consists of two cases: one occurs when there is an overcharged breakpoint and the other is applied when there are only soft and undercharged breakpoints. At each iteration of Algorithm 2 at least one breakpoint is removed using one reversal or one generic transposition, so eventually (π,π) will be transformed into (ι,ι) and the algorithm ends. The same argument of Algorithm 1 can be used to show that the running time of Algorithm 2, which is O(n 2 ) .
Proof Algorithm 2 has only two cases: (i) I has at least one overcharged breakpoint (lines 3 to 6) and (ii) I has at least one pair of softly connected breakpoints (lines 7 to 10). In both cases at least one breakpoint is removed per operation, and the lemma follows.

Greedy strategy
To improve the practical performance of algorithms 1 and 2, we search at the beginning of each iteration for one of the following operations.
1 A transposition that removes three breakpoints. 2 A reversal or transposition that removes two breakpoints.
The search is performed in linear time knowing where each element is placed in π . Therefore, it does not increase the asymptotic time complexity of algorithms 1 and 2. Besides, this strategy does not affect the theoretical approximation factors of algorithms 1 and 2, since the applied operations remove at least two breakpoints each.

Approximation algorithms for the signed case of the SbIRT problem
In this section, we show how to obtain approximation algorithms for the signed case of the SbIRT problem based on a reduction from a signed instance into an unsigned instance. The algorithms are designed following three steps: (i) Initially, we describe a polynomial time function F that maps a signed instance I = (π,π ,ι) of the SbIRT problem into a valid unsigned instance I ′ = (π ′ ,π ′ ,ι ′ ) . (ii) Then, we use the algorithms 1 or 2 to provide an solution S(I ′ ) for the instance I ′ , and (iii) we show a polynomial time function G that maps a solution S(I ′ ) into a valid solution S(I) for I. Lastly, we prove the theoretical approximation factor obtained by adopting this process.
Function F works as follows: for each element π i of the source genome (π ,π) , we map it into two new elements: In both cases, a new intergenic region with size zero is inserted between these two new elements. We apply the same procedure in the target genome (ι,ι) . This procedure doubles the size of the instance I ′ but note that b 2 (I) = b 1 (I ′ ) , since each breakpoint type two is mapped into a breakpoint type one. Besides, the F function takes linear time to complete the mapping.
Function G uses the fact that algorithms 1 and 2 act only over breakpoints to map a solution S(I ′ ) for I ′ into a valid solution S(I) for I. It maps each reversal ρ (i,j) (x,y) in S(I ′ ) into ρ (i ′ ,j ′ ) (x,y) such that i ′ = i+1 2 and j ′ = j 2 , and each 2 , j ′ = j+1 2 , and k ′ = k+1 2 . Recall that this mapping is only possible because algorithms 1 and 2 do not create breakpoints of type one during the process that transform the source genome into the target genome. Furthermore, note that solutions S(I) and S(I ′ ) have the same number of operations. Since solution S(I ′ ) is O(n) , where n is the number of elements of π , then function G takes linear time to complete the solution mapping. Figure 5 shows an example using the functions F and G . The signed instance (π,π ,ι) of the SbIRT problem (at the top) is mapped into an unsigned instance (π ′ ,π ′ ,ι ′ ) (at the bottom) using the function F . Moreover, the function G is used to map an solution S ′ for (π ′ ,π ′ ,ι ′ ) into a valid solution S of same size for (π,π ,ι).  Proof Direct by the construction of the F function.
Proof The proof is similar to the one described in Lemma 3.17 but considering that a solution S(I ′ ) for the instance I ′ is obtained using at most b 1 (π,π ,ι) operations of reversal and transposition.

Practical results
In this section, we compare the proposed algorithms using simulated datasets. Besides, we perform an experiment using marine and brackish picocyanobacteria genomes from Cyanorak 2.1 [20] system.

Results with unsigned simulated datasets
To assess algorithms 1 and 2, we compare them with the 4.5-approximation algorithm for the unsigned case of the SbIRT problem presented by Brito et. al. [16]. We hereafter refer to the 4.5-approximation algorithm [16], Algorithm 1, and Algorithm 2 by 4.5SbIRT , 4 SbIRT , and 3SbIRGT, respectively. We used the following datasets of simulated genomes: • DS1: This dataset was presented by Brito et. al. [16].
It is divided into groups according to the number of random operations (reversal or transposition) used to create each instance in the dataset. Each group contains 10.000 instances of size 100. Instances are created as follows: the target genome is composed by the identity permutation ι and the intergenic region sizes in the target genome are randomly chosen in the range [0.
.100]. The source genome was obtained after applying a sequence of random operations in the target genome. The number of random operations ranged from 5 up to 100, in intervals of 5. Reversals and transpositions can be selected with the same probability to create each instance. This dataset has a total of 200.000 instances. • DS2: This dataset contains groups of instances with sizes 100, 200, 300, 400, and 500. Each group contains 10.000 instances. Instances are created as follows: the target genome is again composed by the identity permutation ι with intergenic region sizes randomly chosen in the range [0.
.100]. The source genome (π ,π) was obtained by shuffling the lists of genes and intergenic region sizes from the target genome independently, in order to create instances with a large number of breakpoints. This dataset has a total of 50.000 instances.
The DS1 dataset explores scenarios considering instances of same size and where the number of breakpoints tends to increase as the number of random operations used to generate each instance grows. The DS2 dataset explores scenarios considering groups of instances with different sizes and, by the random process of construction, they tend to have a higher number of breakpoints. Tables 1, 2, and 3 consider the DS1 dataset and they use, respectively, algorithms 4.5SbIRT , 4 SbIRT , and 3SbIRT. Columns OP, Default Implementation, and Greedy Strategy represent the number of random operations used to create the instances, the result with no greedy strategy, and the result with the greedy strategy, respectively.
From Table 1, we note that the greedy strategy significantly improved the results of the 4.5SbIRT algorithm. The minimum, average, and maximum metrics for the distance and the approximation ratio using the greedy strategy presented lower values when compared with the algorithm default implementation, except for the minimum distance when OP = 05 . The average approximation ratio tends to increase as OP increases. When no greedy strategy is applied, the values ranged from 2.01 (OP = 05 ) to 2.96 (OP = 100 ). Using the greedy strategy the values ranged from 1.34 (OP = 05 ) to 2.11 (OP = 100 ). Besides, by adopting the greedy strategy we were able to find at least one optimal solution in the groups where OP = 05 and OP = 10 , indicated by the minimum approximation ratio column with value 1.00. Table 2 shows a similar behavior for 4 SbIRT regarding the increase of the average approximation ratio as OP grows, and the improvement obtained by the greedy strategy. Using no greedy strategy, the average distance of 4 SbIRT is better than the average distance of 4.5SbIRT algorithm when the number of random operations (OP) is greater than or equals to 50. It indicates that the default implementation of the 4 SbIRT algorithm tends to provide better results when the instance has many breakpoints. When we compare both algorithms using the greedy strategy, the 4 SbIRT algorithm provides better results for the vast majority of the groups and metrics. Considering all groups and using the greedy strategy, the maximum approximation ratio obtained by both algorithms (4.5SbIRT and 4 SbIRT ) was 3.00, which is considerably less than the theoretical approximation factor proven for them. Table 3 shows that 3SbIRGT provided results similar to those presented by 4 SbIRT . Considering the average distance and average approximation ratio columns, we can see a slight improvement for all values of OP compared with the practical results of 4 SbIRT . This fact results from the inclusion of the intergenic move operation, which can reduce the number of operations needed to transform a genome into another. Besides, considering the versions without and with the greedy strategy, respectively, the maximum approximation ratios regarding all groups were 2.97 and 2.83. Using the greedy strategy, the average approximation ratio of 3SbIRGT ranged from 1.29 to 2.05, which is significantly less than the theoretical approximation factor. Table 4 shows the results for the DS2 dataset using 4.5SbIRT , 4 SbIRT , and 3SbIRGT. The average distances of the algorithms without greedy strategy were close to the instance sizes in all groups. Computing the absolute difference between the average distance and the instance sizes, the highest values provided by the 4.5SbIRT , 4 SbIRT , and 3SbIRGT algorithms were 4.00 (Size=500), 0.42 (Size=500), and 0.08 (Size=100), respectively. The greedy strategy also led to important improvement of the results for all the algorithms and groups. With and without greedy strategy, the best results were provided by 3SbIRGT followed by the 4 SbIRT and 4.5SbIRT algorithms regarding the average distance and average approximation ratio metrics. Table 5 shows the average running time, in seconds, of the 4.5SbIRT , 4 SbIRT , and 3SbIRGT algorithms per instance, comparing the default implementation (DI) and the greedy strategy (GS) using the DS2 dataset. Note that the greedy strategy is more time-consuming than the default implementation. The maximum average running time of an algorithm without greedy strategy was less than 0.20 seconds, while using the greedy strategy it was 0.65 seconds. Observing the improvement in the results given by the greedy strategy in Table 4, we highlight that the additional running time is a good trade-off regarding running time and solution quality.
Based on the results, the practical approximation ratio provided by the algorithms tends to be better than the theoretical approximation factors. Besides, it is noteworthy that the greedy strategy has brought a significant improvement on both datasets. Since incorporating this strategy does not change the asymptotic time complexity nor the theoretical approximation of the algorithms, it becomes an excellent alternative to obtain better results.

Results with signed simulated datasets
To assess algorithms 3 and 4, we compare them with the 3-approximation and the 2.5-approximation algorithms for the signed case of the SbIRT problem, respectively, which were presented by Oliveira et. al. [17]. We hereafter refer to the 3-approximation algorithm [17], 2.5-approximation algorithm [17], Algorithm 3, and Algorithm 4 by 3 SbIRT , 2.5SbIRGT , 4 SbIRT , and 3 SbIRGT , respectively. The results of the 4 SbIRT and 3 SbIRGT algorithms were obtained adopting the greedy strategy. We used the DB SIRIT and DB SIRGT datasets presented by Oliveira et. al. [17], and they have the following characteristics: Each dataset started with 100 target genomes (ι,ι) , such that ι has 100 elements, and each value of ι i , with 1 ≤ i ≤ 101 , being chosen randomly and uniformly in interval [0..100]. After that, from each source genome (ι,ι) were generated 100 instances (π ,π,ι) by applying: • DB SIRIT : d random operations of reversals and transpositions (being 50% of each) in each source genome (ι,ι). • DB SIRGT : d random operations of reversals and generic transpositions (being 50% of reversals, 40% of transpositions, and 10% of moves) in each source genome (ι,ι).
The parameters of each applied operation were randomly generated considering the range of valid values. The value of d ranged from 10 up to 100, in intervals of 10. For each value of d, a group with 10,000 instances was generated. DB SIRIT and DB SIRGT datasets have a total of 100,000 instances each. Tables 6 and 7 show the practical results of the algorithms using the DB SIRIT and DB SIRGT datasets, respectively. The approximation ratio for each instance was  Table 6 compares the results obtained by the 3SbIRT and 4SbIRT algorithms. The columns Small and Small or Equal indicate, for each group, the percentage of solutions provided by the 4SbIRT algorithm with strictly smaller size and with small or equal size, respectively, when compared to the solutions provided by the 3SbIRT algorithm.
From Table 6, it is possible to observe that the 4SbIRT algorithm, in all the groups, was able to provide better results considering the metrics of average approximation ratio and average distance. Besides, considering the groups with d greater than 20, the algorithm provided better solutions in more than 75% of the instances (column Small). Considering the groups with d greater than 30, the 4SbIRT algorithm provided better or equivalent size solutions (column Small or Equal) in more than 96% of the instances. It is important to note that, as the value of d increases, the absolute difference between the average distance provided by the 3SbIRT and 4SbIRT algorithms also increases significantly. When d is greater than 50, the absolute difference between the average distances is superior to 10, which indicates that the 4SbIRT algorithm tends to provide better solutions in scenarios where a higher number of operations were used. Table 7 compares the results obtained by the 2.5SbIRGT and 3 SbIRGT algorithms. The columns Small and Small or Equal indicate, for each group, the percentage of solutions provided by the 3 SbIRGT algorithm with strictly smaller size and with small or equal size, respectively, when compared to the solutions provided by the 2.5SbIRGT algorithm.
From Table 7, we can note that the 2.5SbIRGT algorithm, when compared to the 3SbIRGT algorithm, showed a slightly better result regarding the average approximation ratio and distance in the groups with d = 10 and d = 20 . Considering these two groups ( d = 10 and d = 20 ), the absolute difference between the average distance provided by the algorithms was less than 0.61. Besides, by column Small, we can notice that in the groups d = 10 and d = 20 the 3 SbIRGT algorithm provided better solutions in 32.30% and 34.77% of the instances, respectively. This shows that the 3 SbIRGT algorithm can act in a complementary way with the 2.5SbIRGT algorithm, even in the cases where both provide similar results. Since better estimates tend to outcome in enhanced analysis, selecting the better result between each algorithm is a good alternative to assist in this task. Regarding the groups where d is greater than 20, the 3 SbIRGT algorithm provided better results considering the average approximation ratio and distance. Furthermore, in the same groups, the 3 SbIRGT algorithm provided better or equivalent size solutions (column Small or Equal) in more than 73% of the instances.
From Tables 6 and 7, it is possible to note that 4SbIRT and 3 SbIRGT algorithms are robust and tend to provide practical results better than the theoretical bounds.

Results with real genomes
To assess the 3 SbIRGT algorithm and analyze the behavior with real genomes, we compared it with the 2-approximation algorithm for the problem considering reversals and transpositions on signed permutations (ignoring the intergenic regions), presented by Walter et al. [7]. We hereafter refer to the 2-approximation algorithm [7] by 2sbR t. We used 97 genomes from Cyanorak 2.1 [20], which is a system for the visualization and curation of marine and brackish picocyanobacteria genomes. The system encompasses 51 synechococcus, 3 cyanobium, 41 prochlorococcus genomes, and 2 prochlorococcus metagenome-assembled genomes. For each genome, the number of genes ranged from 1834 to 4391, and replicated genes correspond to less than 5% of the total genes, on average.
We performed a preprocessing stage to ensure that the data fits the model constraints, which is divided in two steps: 1. Map the sequence of genes and the intergenic regions into (π ,π) : For each genome, we mapped the first occurrence of the genes into a permutation π and computed the size of the intergenic regions to obtain π. 2. Pairing: For each pair of genomes, we performed a pairing so that the genes and conserved blocks shared by both genomes were kept while the remaining genes were removed through a process that simulates a sequence of deletions.
After the preprocessing stage, we obtained for each pairing an instance (π ,π,ι) . Note that the 2SbRT algorithm requires as input only the permutation π , since it was not designed to consider the intergenic regions. Finally, 3 SbIRGT with the greedy strategy and the 2SbR t were applied to each pairing. The number of genome rearrangement events for each pairing was computed by the total of deletions used in the preprocessing stage (step 2) plus the size of the sequence of reversals and (generic) transpositions provided by the algorithms. These numbers were fed into a matrix of pairwise distances.
We constructed two phylogenetic trees based on the matrix of pairwise distances computed from the algorithms and using the Circular Order Reconstruction method [21]. To analyze the topological characteristics of the phylogenetic trees, we performed a comparison        with the phylogenetic tree presented by Laurence et al. [20] using a tool [22] based on the maximum agreement subtrees (MAST) to determine the topological congruence between two phylogenetic trees. Table 8 shows the obtained results. Table 8 indicates that both phylogenetic tree have a high concordance with the phylogenetic trees presented by Laurence et al. [20], with the phylogenetic tree obtained from the 3 SbIRGT algorithm providing a MAST with more leaves and consequently a better value for I cong and P-value. It is important to mention that the objective of this experiment using real genomes is to demonstrate the applicability of our algorithm, which considers the information regarding the genes and the size of the intergenic regions, compared with a similar model that considers only the order and orientation of the genes. We used the same data preprocessing stage and reconstruction method to provide a fair comparison. However, the results may differ especially considering genomes with different characteristics and the adopted reconstruction method. Figure 6 shows a phylogenetic tree constructed using the Circular Order Reconstruction method [21] with the matrix of pairwise distances from the 3 SbIRGT algorithm.
From Fig. 6 (created using treeio R package [23]), we observe that the approach separates the organisms considering the species and performed good groupings. It is worth mentioning that the tree was based exclusively on rearrangement event information.

Conclusion
We studied the sorting by intergenic reversals and transpositions problem on signed and unsigned permutations. We presented, for both cases, a 4-approximation algorithm, improving the 4.5 approximation factor previously known for the unsigned case. Besides, we generalized the transposition event and presented a 3-approximation algorithm to the problem that arises, which is more realistic in scenarios that consider intergenic regions. We developed a greedy strategy to improve the practical performance of the algorithms and conducted a comparison using datasets with different features. Considering the signed case of the problem, the tests indicated that our algorithms, in the vast majority of the cases, tend to provide better practical results compared with the previous known results. Moreover, we carried out an experiment using real genomes to verify the applicability of the proposed algorithms.
From the theoretical point of view, the algorithms proposed for the unsigned case of the sorting by intergenic reversals and transpositions problem bring an important improvement considering the approximation factor. On the other hand, the results for the signed case of the problem have the practical potential of enhancing the estimates for the distance of compared genomes, and consequently, the analysis regarding the genome rearrangements.
In future works, one can incorporate non-conservative events (e.g., insertion and deletion of genes or nucleotides) into the model.