Natural family-free genomic distance

Background A classical problem in comparative genomics is to compute the rearrangement distance, that is the minimum number of large-scale rearrangements required to transform a given genome into another given genome. The traditional approaches in this area are family-based, i.e., require the classification of DNA fragments of both genomes into families. Furthermore, the most elementary family-based models, which are able to compute distances in polynomial time, restrict the families to occur at most once in each genome. In contrast, the distance computation in models that allow multifamilies (i.e., families with multiple occurrences) is NP-hard. Very recently, Bohnenkämper et al. (J Comput Biol 28:410–431, 2021) proposed an ILP formulation for computing the genomic distance of genomes with multifamilies, allowing structural rearrangements, represented by the generic double cut and join (DCJ) operation, and content-modifying insertions and deletions of DNA segments. This ILP is very efficient, but must maximize a matching of the genes in each multifamily, in order to prevent the free lunch artifact that would otherwise let empty or almost empty matchings give smaller distances. Results In this paper, we adopt the alternative family-free setting that, instead of family classification, simply uses the pairwise similarities between DNA fragments of both genomes to compute their rearrangement distance. We adapted the ILP mentioned above and developed a model in which pairwise similarities are used to assign weights to both matched and unmatched genes, so that an optimal solution does not necessarily maximize the matching. Our model then results in a natural family-free genomic distance, that takes into consideration all given genes, without prior classification into families, and has a search space composed of matchings of any size. In spite of its bigger search space, our ILP seems to be boosted by a reduction of the number of co-optimal solutions due to the weights. Indeed, it converged faster than the original one by Bohnenkämper et al. for instances with the same number of multiple connections. We can handle not only bacterial genomes, but also fungi and insects, or sets of chromosomes of mammals and plants. In a comparison study of six fruit fly genomes, we obtained accurate results. Supplementary Information The online version contains supplementary material available at 10.1186/s13015-021-00183-8.


Introduction
Genomes are subject to mutations or rearrangements in the course of evolution.A classical problem in comparative genomics is to compute the rearrangement distance, that is the minimum number of large-scale rearrangements required to transform a given genome into another given genome [20].Typical large-scale rearrangements change the number of chromosomes, and/or the positions and orientations of DNA segments.Examples of such structural rearrangements are inversions, translocations, fusions and fissions.One might also need to consider rearrangements that modify the content of a genome, such as insertions and deletions (collectively called indels) of DNA segments.
In order to study the rearrangement distance, one usually adopts a high-level view of genomes, in which only "relevant" fragments of the DNA (e.g., genes) are taken into consideration.Furthermore, a pre-processing of the data is required, so that we can compare the content of the genomes.One popular method, adopted for more than 20 years, is to group the fragments in both genomes into families, so that two fragments in the same family are said to be equivalent.This setting is said to be family-based.Without duplications, that is, with the additional restriction that each family occurs at most once in each genome, 1. DCJ operations modify the organization of a genome: A cut performed on a genome A separates two adjacent markers of A. A double-cut and join or DCJ applied on a genome A is the operation that performs cuts in two different positions of A, creating four open ends, and joins these open ends in a different way [3,23].For example, let A = {6 1 7 8 4, 3 5 2}, and consider a DCJ that cuts between markers 1 and 7 of its first chromosome and between markers 5 and 2 of its second chromosome, creating fragments 6 1•, •7 8 4, 3 5• and •2 (where the symbols • represent the open ends).If we join the first with the third and the second with the fourth open end, we get A = {6 1 2, 3 5 7 8 4}, that is, the described DCJ operation is a translocation transforming A into A .Indeed, a DCJ operation can correspond not only to a translocation but to several structural rearrangements, such as an inversion, a fusion or a fission.
2. Indel operations modify the content of a genome: We can modify the content of a genome with insertions and with deletions of blocks of contiguous markers, collectively called indel operations [6,24].
As an example, consider the deletion of fragment 7 8 from chromosome 6 1 7 8 4, resulting in chromosome 6 1 4.In the model we consider, we do not allow that a marker is deleted and reinserted, nor inserted and then deleted.Furthermore, at most one chromosome can be entirely deleted or inserted at once.
Let A and B be two genomes and let A be the set of markers in genome A and B be the set of markers in genome B. We consider two distinct settings: -In a family-based setting markers are grouped into families and each marker from a genome is represented by its family.Therefore, a marker from A can occur more than once in A, as well as a marker from B can occur more than once in B. Furthermore, genomes A and B may share a set of common markers G = A ∩ B. We also have sets A = A \ G and B = B \ G of markers that occur respectively only in A and only in B and are called exclusive markers.For example, we could have A = {3 1 4 3 2, 3 5 2} and B = {1 2 1 3 2 6}.In this case we have A = {1, 2, 3, 4, 5} and B = {1, 2, 3, 6}.Consequently, G = {1, 2, 3}, A = {4, 5} and B = {6}.-In a family-free setting the markers of A and B are all distinct and unique.In other words, each marker of A occurs exactly once in A, each marker of B occurs exactly once in B and A ∩ B = ∅.Consider, for example, genomes A = {1 3 4 2} and B = {8 7 5, 9 6}.

Relational diagram and distance of family-based singular genomes
Let A and B be two genomes in a family-based setting and assume that both A and B are singular, that is, each marker from G = A ∩ B occurs exactly once in each genome.We will now describe how the DCJ-indel distance can be computed in this case [6].
For a given marker m, denote its two extremities by m t (tail) and m h (head).Given two singular genomes A and B, the relational diagram R(A, B) [4] has a set of vertices V = V (A) ∪ V (B), where V (A) is the set of extremities of markers from A and V (B) is the set of extremities of markers from B. There are three types of edges in R(A, B): -Adjacency edges: for each pair of marker extremities γ 1 and γ 2 that are adjacent in a chromosome of any of the two genomes, we have the adjacency edge γ 1 γ 2 .Denote by E A adj and by E B adj the adjacency edges in A and in B, respectively.Marker extremities located at chromosome ends are called telomeres and are not connected to any adjacency edge.
-Extremity edges, whose set is denoted by E γ : for each common marker m ∈ G, we have two extremity edges, one connecting the vertex m h from V (A) to the vertex m h from V (B) and the other connecting the vertex m t from V (A) to the vertex m t from V (B).-Indel edges: for each occurrence of an exclusive marker m ∈ A ∪ B , we have the indel edge m t m h .
Denote by E A id and by E B id the indel edges in A and in B. Each vertex has degree one or two: it is connected either to an extremity edge or to an indel edge, and to at most one adjacency edge, therefore R(A, B) is a simple collection of cycles and paths.A path that has one endpoint in genome A and the other in genome B is called an AB-path.In the same way, both endpoints of an AA-path are in A and both endpoints of a BB-path are in B. A cycle contains either zero or an even number of extremity edges.When a cycle has at least two extremity edges, it is called an AB-cycle.Moreover, a path (respectively cycle) of R(A, B) composed exclusively of indel and adjacency edges in one of the two genomes corresponds to a whole linear (respectively circular) chromosome and is called a linear (respectively circular) singleton in that genome.Actually, linear singletons are particular cases of AA-or BB-paths.The numbers of telomeres and of AB-paths in R(A, B) are even.An example of a relational diagram is given in Figure 1.
r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r  DCJ distance of canonical genomes: When singular genomes A and B have no exclusive markers, that is, A = B = ∅, they are said to be canonical.In this case A can be sorted into B with DCJ operations only and their DCJ distance d dcj can be computed as follows [3]: where c is the number of AB-cycles and i is the number of AB-paths in R(A, B).
Runs and indel-potential: When singular genomes A and B have exclusive markers, it is possible to optimally select DCJ operations that group exclusive markers together for minimizing indels [6], as follows.
Given two genomes A and B and a component C of R(A, B), a run [6] is a maximal subpath of C, in which the first and the last edges are indel edges, and all indel edges belong to the same genome.It can be an A-run when its indel edges are in genome A, or a B-run when its indel edges are in genome B. We denote by Λ(C) the number of runs in component C. If Λ(C) ≥ 1 the component C is said to be indelenclosing, otherwise Λ(C) = 0 and C is said to be indel-free.The indel-potential of a component C, denoted by λ(C), is the optimal number of indels obtained after "sorting" C separately and can be directly computed from Λ(C) [6]: Figure 7 (Appendix A) shows a BB-path with 4 runs, and how its indel-potential can be achieved.With the indel-potential, an upper bound for the DCJ-indel distance was established [6]: DCJ-indel distance of singular circular genomes: For singular circular genomes, the graph R(A, B) is composed of cycles only.In this case the upper bound given by Equation ( 1) is tight and leads to a simplified formula [6]: DCJ-indel distance of singular linear genomes: For singular linear genomes, the upper bound given by Equation (1) is achieved when the components of R(A, B) are sorted separately.However, it can be decreased by recombinations, that are DCJ operations that act on two distinct paths of R(A, B).Such path recombinations are said to be deducting.The total number of types of deducting recombinations is relatively small.By exhaustively exploring the space of recombination types, it is possible to identify groups of chained recombinations [6], so that the sources of each group are the original paths of the graph.In other words, a path that is a resultant of a group is never a source of another group.This results in a greedy approach (detailed in [6]) that optimally finds the value δ ≥ 0 to be deducted.We then have the following exact formula [6]: 3 The family-free setting As already stated, in the family-free setting, each marker in each genome is represented by a distinct symbol, thus A ∩ B = ∅.Observe that the cardinalities |A| and |B| may be distinct.

Marker similarity graph for the family-free setting
Given a threshold 0 ≤ x ≤ 1, we can represent the similarities between the markers of genome A and the markers of genome B in the so called marker similarity graph [5], denoted by S x (A, B).This is a weighted bipartite graph whose partitions A and B are the sets of markers in genomes A and B, respectively.Let ñA be the number of unsaturated markers in A and ñB be the number of unsaturated markers in B. We extend the function s, so that it maps each unsaturated marker a ∈ A to one value in {n+1, n+2, . . ., n+ñ A } and each unsaturated marker b ∈ B to one value in {n + ñA + 1, n + ñA + 2, . . ., n + ñA + ñB }.The sets of M -unsaturated mapped markers are: The mapped genomes A M and B M are then obtained by renaming each marker a ∈ A to s(a, M ) and each marker b ∈ B to s(b, M ), preserving all orientations.
Established distances of mapped genomes: Let the relational graph R(A M , B M ) have c M AB-cycles and i M AB-paths.By simply ignoring the exclusive markers of A (M ) and B (M ), we can compute the DCJ distance: Taking into consideration the weight of the matching M defined as w(M ) = e∈M σ(e), we can also compute the weighted DCJ distance wd dcj (A M , B M ) [16]: Observe that, when all edges of M have the maximum weight 1, we have w(M ) = |M | and wd dcj (A M , B M ) = d dcj (A M , B M ).
Finally, taking into consideration the exclusive markers of A (M ) and B (M ), but not the weight w(M ), we can compute the DCJ-indel distance of mapped genomes A M and B M : where δ M is the deduction given by path recombinations in R(A M , B M ).

The family-free DCJ-indel distance
Let A M and B M be the mapped genomes for a given matching M of S x (A, B).The weighted relational diagram of A M and B M , denoted by WR(A M , B M ), is obtained by constructing the relational diagram of A M and B M and adding weights to the indel edges as follows.For each mapped M -unsaturated marker m ∈ A (M ) ∪ B (M ), the indel edge m h m t receives a weight w(m h m t ) = max{σ(uv)|uv ∈ S x (A, B) and u= s −1 (m, M )}, that is the maximum similarity among the edges incident to the marker u = s −1 (m, M ) in S x (A, B).We denote by M = E A id ∪ E B id the set of indel edges, here also called the complement of M .The weight of M is w( M ) = e∈ M w(e).Examples of diagrams of mapped genomes are shown in Figure 3.
In the computation of the weighted DCJ-indel distance of mapped genomes A M and B M , denoted by wd id dcj (A M , B M ), we should take into consideration the exclusive markers of A (M ) and B (M ), and the weights w(M ) and w( M ).An important condition is that wd id dcj (A M , B M ) must be equal to d id dcj (A M , B M ) if w(M ) = |M | and w( M ) = 0. We can achieve this by extending the formula for computing wd dcj (A M , B M ) as follows: Let us now examine the behaviour of the formula above for the examples given in Figure 3. Matching M 1 is maximal and gives the distance wd id dcj (A M1 , B M1 ) = 8.6.Matching M 2 is also maximal and gives the distance wd id dcj (A M2 , B M2 ) = 5.2.The empty matching M ∅ gives the distance wd id dcj (A M ∅ , B M ∅ ) = 9.7, that is the biggest.And the non-maximal matching M 3 ⊂ M 2 gives the distance wd id dcj (A M3 , B M3 ) = 5.1, that is the smallest.
Given that M is the set of all distinct matchings in S x (A, B), the family-free DCJ-indel distance is defined as follows: Complexity: If two family-based genomes contain the same number of occurrences of each marker, they are said to be balanced.The problem of computing the DCJ distance of balanced genomes (BG-DCJ) is NPhard [22].Since the computation of ffd id dcj can be used to solve BG-DCJ, it is also NP-hard.See Appendix B for details of the reduction.
r r r r r r r r r r (1:1) (2:2) (3:3) (4:5) (5:4) r r r r r r r r r r r r r r r r r r r r r r (1:1) (2:2) (3:3) (4:4) (5:5) r r r r r r r r r r r r 3 0.9 0.9 0.8 0.6 0.3 0.9 0.9 0.7 0.8 r r r r r r r r r r (1:1) (2:2) (3:3) (4:4) (5:5) r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r  2, let M 1 (red) and M 2 (blue) be two distinct maximal matchings in S 0.1 (A, B).We also represent the non-maximal matching M 3 (cyan) that is a subset of M 2 .In the middle part we show diagrams WR(A M1 , B M1 ) and WR(A M2 , B M2 ), both with two AB-paths and two AB-cycles.In the lower part we show diagrams WR(A M ∅ , B M ∅ ), corresponding to the trivial empty matching M ∅ and with two linear singletons (one AA-path and one BB-path), and WR(A M3 , B M3 ), with two AB-paths and two AB-cycles.The labeling (X:Y) indicates that Y = s(X, M i ).

Family-free relational diagram
An efficient way to solve the family-free DCJ-indel distance is to develop an ILP that searches for its solution in a general graph, that represents all possible diagrams corresponding to all candidate matchings, in a similar way as the approaches given in [4,16,22].Given two genomes A and B and their marker similarity graph S x (A, B), the structure that integrates the properties of all diagrams of mapped genomes is the family-free relational diagram FFR(A, B, S x ), that has a set V (A) with a vertex for each of the two extremities of each marker of genome A and a set V (B) with a vertex for each of the two extremities of each marker of genome B. Again, sets E A adj and E B adj contain adjacency edges connecting adjacent extremities of markers in A and in B. But here the set E γ contains, for each edge ab ∈ S x (A, B), an extremity edge connecting a t to b t , and an extremity edge connecting a h to b h .To both edges a t b t and a h b h , that are called siblings, we assign the same weight, that corresponds to the similarity of the edge ab in S Furthermore, for each marker m there is an indel edge connecting the vertices m h and m t .The indel edge m h m t receives a weight w(m h m t ) = max{σ(mv)|mv ∈ S x (A, B)}, that is, it is the maximum similarity among the edges incident to the marker m in S x (A, B).We denote by E A id the set of indel edges of markers in genome A and by E B id the set of indel edges of markers in genome B. An example of a family-free relational diagram is given in Figure 4.

Consistent decompositions
The diagram FFR(A, B, S x ) may contain vertices of degree larger than two.A decomposition of FFR(A, B, S x ) is a collection of vertex-disjoint components, that can be cycles and/or paths, covering all vertices of FFR(A, B, S x ).There can be multiple ways of selecting a decomposition, and we need to find one that   ).We represent in multiple colors the edges that correspond to multiple matchings.
allows to identify a matching of S x (A, B).A set S ⊆ E γ is a sibling-set if it is exclusively composed of pairs of siblings and does not contain any pair of incident edges.Thus, a sibling-set S of FFR(A, B, S x ) corresponds to a matching of S x (A, B).In other words, there is a clear bijection between matchings of S x (A, B) and sibling-sets of FFR(A, B, S x ) and we denote by M S the matching corresponding to the sibling-set S.
The set of edges D[S] induced by a sibling-set S is said to be a consistent decomposition of FFR(A, B, S x ) and can be obtained as follows.In the beginning, D[S] is the union of S with the sets of adjacency edges E A adj and E B adj .We then need to determine the complement of the sibling-set S, denoted by S, that is composed of the indel-edges of FFR(A, B, S x ) that must be added to D Given that S is the sets of all sibling-sets of FFR(A, B, S x ), we compute the family-free DCJ-indel distance of A and B with the following equation:

Capping
Telomeres produce some difficulties for the decomposition of FFR(A, B, S x ), and a known technique to overcome this problem is called capping [13].It consists of modifying the diagram by adding artificial markers, also called caps, whose extremities should be properly connected to the telomeres of the linear chromosomes of A and B. Therefore, usually the capping depends on the numbers κ A and κ B , that are, respectively, the total numbers of linear chromosomes in genomes A and B.
Family-based singular genomes: First we recall the capping of family-based singular genomes.Here the caps must circularize all linear chromosomes, so that their relational diagram is composed of cycles only, but, if the capping is optimal, the DCJ-indel distance is preserved.
An optimal capping that transforms singular linear genomes A and B into singular circular genomes can be obtained after identifying the recombination groups [6].The DCJ-indel distance is preserved by properly linking the components of each identified recombination group into a single cycle [4].Such a capping may require some artificial adjacencies between caps.The following result is very useful.
Theorem 1 (from [4]).We can obtain an optimal capping of singular genomes A and B with exactly , each one representing a cap extremity.Each of the 2κ A telomeres of A is connected by an adjacency edge to a distinct cap extremity among Similarly, each of the 2κ B telomeres of B is connected by an adjacency edge to a distinct cap extremity among B by an artificial adjacency edge.All these new adjacency edges and artificial adjacency edges are added to E A adj and E B adj , respectively.We also connect each ) is induced by a sibling-set S ⊆ E γ and a (maximal) capping-set P ⊆ E • and is composed of vertex disjoint cycles that cover all vertices of FFR • (A, B, S x ).An example of a capped family-free relational diagram is given in Figure 9 (Appendix A).
that can still be redesigned to a form that can be easier implemented in the ILP [4].First, let a transition in a cycle C be an indel-free segment of C that is between a run in one genome and a run in the other genome and denote by ℵ(C) the number of transitions in C. Observe that, if C is indel-free, then obviously ℵ(C) = 0.If C has a single run, then we also have ℵ(C) = 0. On the other hand, if C has at least 2 runs, then ℵ(C) = Λ(C).The new formula is split into two parts.The first part is the function r(C), defined as r(C) = 1 if Λ(C) ≥ 1, otherwise r(C) = 0, that simply tests whether C is indel-enclosing or indel-free.The second part depends on the number of transitions ℵ(C), and the complete formula stands as follows [4]: Distance formula: Note that the number of indel-enclosing components is C∈Q[S,P ] r(C) = c r Q + s Q , where c r Q and s Q are the number of indel-enclosing AB-cycles and the number of circular singletons in Q[S, P ], respectively.Furthermore, the number of indel-free AB-cycles of Q , and Given that S and P max are, respectively, the sets of all sibling-sets and all maximal capping-sets of FFR • (A, B, S x ), the final version of our optimization problem is ffd id dcj (A, B, S x ) = min S∈S,P ∈Pmax wd id dcj (Q[S, P ]) .

ILP formulation to compute the family-free DCJ-indel distance
Our formulation is an adaptation of the ILP for computing the DCJ-indel distance of family-based natural genomes, by Bohnenkämper et al. [4], that is itself an extension of the ILP for computing the DCJ distance of family-based balanced genomes, by Shao et al. [22].The main differences between our approach and the approach from [4] are the underlying graphs and the objective functions.The general idea is searching for a sibling-set, that, together with a maximal capping-set, gives an optimal consistent cycle decomposition of the capped diagram FFR • (A, B, S x ) = (V, E), where the set of edges comprises all disjoint sets of distinct types: While in the ILP from [4] the search space is restricted to maximal sibling-sets, in the family-free DCJ-indel distance the search space includes all sibling-sets, of any size.
In Algorithm 1 we give the formulation for computing ffd id dcj (A, B, S x ), distributed in three main parts.).The objective function of our ILP minimizes the size of the sibling-set, with sum over variables x e , the number of circular singletons, calculated by the sum over variables s k , half the overall number of transitions in indel-enclosing AB-cycles, calculated by the sum over variables t e , and the weight of all indel edges in the decomposition, given by the sum over their weights w e x e for all e ∈ E id , while maximizing both the number of indel-free cycles, counted by the sum over variables z i , and half of the weights of the edges in the decomposition, given by the sum over their weights w e x e for all edges e ∈ E γ .The minimization is not affected by constant p * , that is included in the objective function to keep the correspondence to Equation (2).
Algorithm 1 ILP for computing the family-free DCJ-indel distance Comparison to related models: Since the pre-requisites of a family-free setting differ substantially from those of a family-based setting, we could not compare our approach to the one from [4].We intend to perform such a comparison in a future work, for example by using pairwise similarities to cluster the genes into families.Comparing our approach to the original family-free DCJ distance was also not possible, because the ILP provided in [16] is only suitable for unichromosomal genomes.Again, we intend to perform such a comparison in a future work, after we implement an ILP that is able to compute the family-free DCJ distance of multichromosomal genomes.

Unweighted version:
In the present work, for comparison purposes, we also implemented a simpler version of the family-free DCJ-indel distance, that simply ignores all weights.This version is called unweighted familyfree DCJ-indel distance, and consists of finding a sibling-set in FFR • (A, B, S x ) that minimizes d id dcj (D[S, P ]).
But here it is important to observe that smaller sibling-sets, that simply discard blocks of contiguous markers, tend to give the smaller distances.Considering the similarity graph S 0.1 (A, B) of Figure 3, the trivial empty matching gives the distance d id dcj (A M ∅ , B M ∅ ) = 2 (deletion of the chromosome of A followed by the insertion of the chromosome of B).For the other matchings we have d id dcj (A M1 , B M1 ) = 4 and d id dcj (A M2 , B M2 ) = d id dcj (A M3 , B M3 ) = 3.We then restrict the search space to maximal sibling-sets only, avoiding that blocks of markers are discarded.However, this could also enforce weak connections.In the example shown in Figure 3, both maximal matchings M 1 and M 2 have weak edges with weights 0.2 and 0.3.Matching M 3 has only edges with weight at least 0.6, but it would be ignored for being non-maximal.Enforcing weak connections can be prevented by removing them form the similarity graph, that is, by assigning a higher value to the cutting threshold x. (see an example in Figure 8 of Appendix A).Given that S max and P max are, respectively, the sets of all maximal sibling-sets and all maximal capping-sets of FFR • (A, B, S x ), the unweighted version of the problem is then: For computing the unweighted unwffd id dcj (A, B, S x ) we need to slightly modify the ILP described in Algorithm 1.The details are given in Appendix C.

Implementation:
The ILPs for computing both the family-free DCJ-indel distance and its unweighted version were implemented and can be downloaded from https://gitlab.ub.uni-bielefeld.de/gi/gen-diff.
Data analysis: For all pairwise comparisons, we obtained gene similarities using the FFGC pipeline3 [11], with the following parameters: (i) 1 for the minimum number of genomes for which each gene must share some similarity in, (ii) 0.1 for the stringency threshold, (iii) 1 for the BLAST e-value, and (iv) default values for the remaining parameters.As an ILP solver, for all experiments we ran CPLEX with 8 2.67GHz cores.
Cutting threshold: Differently from the unweighted version, that requires a cutting threshold of about x = 0.5 to give accurate results, the weighted ffd id dcj was designed to be computed with all given pairwise similarities, i.e., with the cutting threshold x = 0, that leads to a "complete" family-free relational diagram.Such a diagram would be too large to be handled in practice, therefore, if x = 0, we consider only the similarities that are strictly greater than 0. Nevertheless, for bigger instances the diagram with similarities close to 0 might still be too large to be solved in reasonable time.Hence, for some instances it may be necessary to do a small increase of the cutting threshold.Our experiments in real data (described in Section 5.2) show that small similarities have a minor impact on the computed distance, therefore, by adopting a small cutting threshold x up to 0.3, it is possible to reduce the diagram and solve bigger instances, still with good accuracy.

Performance evaluation
We generated simulated genomes using Artificial Life Simulator (ALF) [10] in order to benchmark our algorithm for computing the family-free DCJ-indel distance.We simulated and compared 190 pairs of genomes with different duplication rates, keeping all other parameters fixed (e.g.rearrangement, indel and mutation rates).The extant genomes have around 10,000 genes.We obtained gene similarities between simulated genomes using FFGC.For each genome pair, a threshold of x = 0.1 resulted in up to 8,400 genes with multiple homology relations (i.e.vertices with degree > 1 in S 0. 1 (A, B)) and from 2 to 2.8 relations on average for those genes.In addition, each pair is about 3,000 rearrangement events away from each other.The complete parameter sets used for running ALF, together with additional information on simulated genomes, can be found in Appendix D.
For computing the family-free DCJ-indel distances, we ran CPLEX with maximum CPU time of 1 hour.Results were grouped depending on the number of genes with multiple homology relations in the respective genome pairs.Figure 5 summarizes the performance of our weighted family-free DCJ-indel distance formulation.The running times escalate quickly as the number of genes with multiple homologies increase (Figure 5a, grouped in intervals of 100), reaching the time limit after 2,000 of them (Figure 5b, grouped in intervals of 500).The optimality gap is the relative gap between the best solution found and the upper bound found by the solver, calculated by ( upper bound best solution − 1) × 100, and appears to grow, for our simulated data, linearly in the number of genes with multiple homologies (Figure 5b).
The solution time and the optimality gap of our algorithm clearly depends less on genome sizes and more on the multiplicity of homology relations.In our experiments, we were able to find in 1 hour optimal or near-optimal solutions for genomes with 10,000 genes and up to 4,000 genes with 2.2 homology relations on average.Our formulation should be able to handle, for instance, the complete genomes of bacteria, fungi and insects, or even sets of chromosomes of mammal and plant genomes.

Gap (%)
Optimality gap for simulated data (b) Fig. 5: Results of the weighted family-free DCJ-indel distance given by the solver, (a) shows the average running time for instances grouped by the number of vertices with degree > 1 in S 0.1 (A, B) (in intervals of 100, those greater than 900 are not shown), and (b) for groups of instances that did not finish within the time limit of 1 hour, the average optimality gap and the average number of homology relations for those genes with multiple homologies (in intervals of 500).

Real data analysis
We evaluated the potential of our approach by comparing genomes of fruit flies from the genus Drosophila, including the following species: D. busckii, D. melanogaster, D. pseudoobscura, D. sechellia, D. simulans and D. yakuba [1,9,17,25].A reference phylogenetic tree of these species is shown in Figure 10, in Appendix E, where we also give the sources of the DNA sequences for each analyzed genome, and additional information on the experiments.Each genome has approximately 150Mb, with about 13,000 genes distributed in 5-6 chromosomes.We obtained gene similarities using FFGC and performed two separate experiments, whose computed distances were used to build phylogenetic trees using Neighbor-Joining [14,19].

Pairwise comparison of complete genomes:
In this experiment, genomes in each comparison comprise together ∼ 13,000 genes with multiple homologies (11.2 on average), some of them having about 90 relations considering similarities that are strictly greater than x = 0. Since these instances were too large, we set the threshold to x = 0.3.We then ran CPLEX with maximum CPU time of 3 hours.All ffd id dcj computations finished within the time limit, most of them in less than 10 minutes, whereas the unweighted unwffd id dcj computations, in spite of having a search space of maximal sibling-sets, that is much smaller, surprisingly took from 1 to 3 hours.We conjecture that this is due to a large number of co-optimal solutions in the unweighted version, while in ffd id dcj the co-optimality is considerably minimized by weights, which helps the solver to converge faster.While the tree given by ffd id dcj , shown in Figure 6a, agrees with the reference tree, the tree given by unwffd id dcj , shown in Figure 6b, diverges from the reference in a single branch.Details of the results are given in Appendix E.1.Fig. 6: Based on distance matrices calculated by our ILPs for the pairwise comparisons of complete genomes (Gen) or only X chromosomes (Xchr) of Drosophila, we built phylogenetic trees computed by the Neighbor-Joining method [14,19].The output of this algorithm is an unrooted tree, and we assumed the most distant species D. busckii as the outgroup for rooting the trees.All comparisons converged to exactly two trees, and next to each tree we give a list of comparisons that produced that tree.The tree in (a) agrees with the reference shown in Figure 10 (Appendix E), while the tree in (b) diverges from the reference in a single branch.

Pairwise comparison of X chromosomes:
We also did an experiment with smaller instances, composed of pairwise comparisons of X chromosomes only, so that we could evaluate the impact of the cutting threshold on the accuracy of the approach.In this experiment, considering similarities that are strictly greater than x = 0, each pair comprises 1,000-2,000 genes with multiple homologies (5 on average) with as many as 30 relations.
We computed ffd id dcj with cutting thresholds x = 0, x = 0.1, x = 0.2 and x = 0.3, always obtaining the accurate phylogenetic tree from Figure 6a.These results suggest that a small cutting threshold allows to reduce the size of the instances, without having a big impact in the accuracy of ffd id dcj .In addition, we computed unwffd id dcj with cutting thresholds x = 0 and x = 0.3, both resulting in the slightly inaccurate tree from Figure 6b, and x = 0.5, that also resulted in the accurate tree from Figure 6a.As expected, in the unweighted formulation the cutting threshold plays a major role in the accuracy of the calculated distances.
The analyses were done with maximum CPU time of 1 hour.The comparisons finished within a few seconds for most of instances, except for unwffd id dcj with threshold x = 0, for which the majority of the pairwise comparisons reached the time limit-with an optimality gap of less than 3.5% though (see Appendix E.2).

Length of indel segments:
As a generalization of the singular DCJ-indel model [6], the basic idea behind our approach is that runs can be merged and accumulated with DCJ operations.This is a more parsimonious alternative to the trivial approach of inserting or deleting exclusive markers individually.However, it raises the question of whether the indels then tend to be very long, and whether this makes biological sense.Considering that it is possible to distribute the runs so that each indel is composed of 1-2 runs, we can say that the lengths of the runs play a major role in defining the length of indel segments.In the particular analysis of Drosophila complete genomes, we have an average run length of 5.1, while the maximum run length is 121.We conjecture that the long runs are mostly composed of genes that are part of a contiguous segment from the beginning, and are not really accumulated by DCJ operations.In a future work we intend to have a closer look into the long runs, so that we can characterize their structures and verify this conjecture.

Conclusions and discussion
In this work we proposed a new genomic distance, for the first time integrating DCJ and indel operations in a family-free setting.In this setting the whole analysis requires less pre-processing and no classification of the data, since it can be performed based on the pairwise similarities of markers in both genomes.Based on the positions and orientations of markers in both genomes we build the family-free relational diagram.We then assign weights to the edges of the diagram, according to the given pairwise similarities.A sibling-set of edges corresponds to a set of matched markers in both genomes.Our approach transfers weights from the edges to matched and unmatched markers, so that, again for the first time, an optimal solution does not necessarily need to maximize the number of matched markers.Instead, the search space of our approach allows solutions composed of any number of matched markers.The computation of our new family-free DCJ-indel distance is NP-hard and we provide an efficient ILP formulation to solve it.
The experiments on simulated data show that our ILP can handle not only bacterial genomes, but also complete genomes of fungi and insects, or sets of chromosomes of mammals and plants.We performed a comparison study of six fruit fly genomes, using the obtained distances to reconstruct the phylogenetic tree of the six species, obtaining accurate results.This study was a first validation of the quality of our method and a more rigorous evaluation will be performed in a future work.In particular, we intend to analyze the reasons behind insertions and deletions of long segments and verify the quality of the obtained gene matchings, by comparing them to the annotated orthologies given by public databases.Furthermore, as already mentioned, we plan to compare our ILP to the one given in [4], once we manage to cluster the genes into families, and also to implement an ILP that is able to compute the family-free DCJ distance described in [16] for multichromosomal genomes, so that we can compare it to our ILP.

A Supplementary figures
Figure 7 shows a BB-path with 4 runs, and how its indel-potential can be achieved.(ii) After an optimal DCJ that creates a new cycle, one A-run is accumulated (between edges e 4 and e 3 there is only an adjacency edge) and two B-runs are merged (e 2 is in the same run with e 5 and e 6 ).Indeed the indel-potential of the original BB-path is three.
Figure 8 shows -in contrast to the marker similarity graph S 0.1 (A, B) of Figure 3 -the graph S 0.5 (A, B), two of its matchings and the corresponding weighted relational diagrams.

B Computational complexity of the family-free DCJ-indel distance
In the family-based setting, if two genomes contain the same number of occurrences of each marker, they are said to be balanced.Notice that there are no exclusive markers in this case.The problem of computing the DCJ distance of balanced genomes is NP-hard [22].We use this problem in straightforward reductions to show that computing the family-free DCJ-indel distance, both weighted and unweighted, are NP-hard problems.The first step of the reduction is the creation of a similarity graph and transforming family-based into family-free genomes as follows.Let A and B be two balanced genomes.Rename each occurrence of each marker in each of the two genomes A and B , so that we get two family-free genomes A and B; and build the marker similarity graph S 1 (A, B) by connecting markers a in A and b in B if they correspond to occurrences of the same marker in the original family-based genomes and setting σ(ab) = 1.Note that, for any 0 ≤ x ≤ 1, we have S x (A, B) = S 1 (A, B).
Theorem 3.For given genomes A and B and a marker similarity graph S x (A, B) for any 0 ≤ x ≤ 1, computing the unweighted family-free DCJ-indel distance unwffd id dcj (A, B, S x ) is NP-hard.Proof.Given balanced genomes A and B , we obtain family-free genomes A and B and their marker similarity graph S x (A, B) = S 1 (A, B) as described above.Then, a maximal matching in S x (A, B) that finds the unweighted family-free DCJ-indel distance unwffd id dcj (A, B, Sx) implies immediately in finding the DCJ distance of balanced genomes A and B .Theorem 4. For given genomes A and B and a marker similarity graph S x (A, B) for any 0 ≤ x ≤ 1, computing the weighted family-free DCJ-indel distance ffd id dcj (A, B, S x ) is NP-hard.Proof.Given balanced genomes A and B , we obtain family-free genomes A and B and their marker similarity graph S x (A, B) = S 1 (A, B) as described above.We now show that ffd id dcj (A, B, S 1 ) is given by a matching in S 1 (A, B) of maximal cardinality.
Let n := |A| = |B|.Let M be a matching in S 1 (A, B).First, notice that |M | = w(M ), for any matching M in S 1 (A, B).Thus, we can compute the weighted DCJ-indel distance wd id dcj (A M , B M ) as follows: Notice that w( M ) = 2(n − |M |) for any matching M in S 1 (A, B).Then, if M is maximal, i.e., if |M | = n, no indel operation is performed on the genomes and thus On the other hand, if we take the trivial empty matching M ∅ , no DCJ operation is performed and we have at least 2 indel operations (one per chromosome of A and one per chromosome of B).Since w( M ∅ ) = 2n, we have wd id dcj (A M ∅ , B M ∅ ) = d id dcj (A M ∅ , B M ∅ ) + w( M ∅ ) ≥ 2 + 2n .Therefore, the trivial empty matching is definitely not a candidate for giving the optimal solution.Now let M 1 , M 2 , . . ., M n be a sequence of matchings in S 1 (A, B) such that, for any 1 ≤ i ≤ n, |M i | = i and M i+1 = M i ∪ {e}, with e / ∈ M i , i.e., M i+1 is obtained by adding exactly one edge to the matching M i .We necessarily have w( M i+1 ) = w( M i ) − 2 , and , meaning that, in the worst case, we increase the size of a matching and keep the same distance: wd id dcj (A Mi+1 , B Mi+1 ) ≤ wd id dcj (A Mi , B Mi ) .However, for the last pair of consecutive matchings M n−1 and M n , we know that the number of indels decrease from 2 to 0, while the DCJ part of the formula increases at most +2, that is d id dcj (A Mn , B Mn ) ≤ d id dcj (A Mn−1 , B Mn−1 ).Since we still have w( M n ) = w( M n−1 ) − 2, it is clear that wd id dcj (A Mn , B Mn ) ≤ wd id dcj (A Mn−1 , B Mn−1 ) − 2 .Therefore, the weighted family-free DCJ-indel distance ffd id dcj (A, B, S 1 ) corresponds to a maximal matching of S 1 (A, B).And since all maximum matchings of S 1 (A, B) give mapped genomes without exclusive markers, ffd id dcj (A, B, S 1 ) is exactly the DCJ distance of original balanced genomes A and B .

C ILP for computing the unweighted family-free DCJ distance
For computing the unweighted unwffd id dcj (A, B, S x ) we need to slightly modify the ILP described in Algorithm 1.Besides all its constraints, variables and domains, to ensure that a matching is of maximal cardinality, we add a new constraint as follows:

D Generation of simulated data
Here we describe the process and the parameters used in Artificial Life Simulator (ALF) [10] for generating our simulated data.Each one of the 190 instances generated consists of a pair of simulated genomes.We used the default values for parameters not mentioned.PAM units were used as time scale for simulation, starting with a randomly generated root genome with 10,000 genes, whose lengths where drawn from a Gamma distribution with k = 2.4019 and θ = 133.8063(minimum length 100).We used a custom evolutionary tree defining an speciation event after 25 time units, resulting in two leaf species, which evolved for additional 25 time units.The WAG substitution model was used together with Zipfian indels in DNA sequences with rate 0.0002 (maximum length 50).Such rate varies among sites according to a Gamma distribution with shape 1 and 10 classes.In addition, we set the rate of invariable sites to 0.001.Inversions and translocations of up to 30 genes were allowed at a rate of 0.0025.Finally, for generating instances comprising genes with different numbers of homologies, we varied the gene duplication and the gene loss rates between 1 × 10 −5 and 2 × 10 −3 .

E Analysis of Drosophila genomes
We downloaded the genomes of 6 species of Drosophila [1,9,17,25] from NCBI 4 .In our experiments we used the assemblies listed in Figure 10, with their respective gene annotations.Fig. 10: List of genomes used in our experiments and a reference phylogenetic tree of the respective species of Drosophila given by TimeTree [15], a public knowledge-base for information on the tree-of-life and its evolutionary timescale.

Species
As already mentioned, we obtained pairwise similarities between genes of Drosophila genomes using the FFGC pipeline5 [11] with the following parameters: (i) 1 for the minimum number of genomes for which each gene must share some similarity in, (ii) 0.1 for the stringency threshold, (iii) 1 for the BLAST e-value, and (iv) default values for the remaining parameters.
In the following subsections, in-depth information is provided on the results for experiments using complete genomes and X chromosomes of the listed Drosophila species.

E.1 Complete genomes
The first tables in this section detail the results of the comparison of complete genomes in terms of the BLAST alignment performed for all genes, and the corresponding similarity graphs for each genome pair without cutting threshold.This data was generated using the FFCG pipeline with the parameters described above.Unplaced scaffolds were discarded, decreasing the number of genes, from ∼ 15,000 to ∼ 13,000.Table 1 outlines the number of gene pairs in each similarity range for each pair of genomes.Table 2 shows the number of genes with no homology relations (which induce trivial selections of indel edges in the relational diagram), the number of genes with exactly one homology relation and the number of genes with multiple homologies (which pose a significant challenge to the solver).The computed distances and elapsed time (or gap in % when the solver reaches the time limit) in the pairwise comparisons with cutting threshold 0.3 are shown in Tables 3 and 4. The solver was set to stop after finding a solution with optimality gap smaller than 0.5% or after 3 hours.Table 2: Association between genes in pairwise comparisons of complete genomes, considering pairwise similarities strictly greater than 0. The tables show the number of genes with zero, one and multiple homology relations, respectively.For all of them, the element stored in line i and column j represents the number of genes of the species i in the pairwise comparison of genomes i and j.

E.2 X chromosomes
Unplaced scaffolds of X chromosomes were discarded, decreasing the overall number of genes from ∼ 2,500 to ∼ 2,000.Similarity values in pairwise comparisons are given in Table 5.The number of genes with 0, 1 and multiple homologies are given in Table 6.Results for ffd id dcj and for unwffd id dcj are shown in Tables 7 and 8 (CPLEX was set to stop after finding a solution with optimality gap smaller than 0.1% or after 1 hour).Table 6: Association between genes in pairwise comparisons of the corresponding X chromosomes, considering pairwise similarities strictly greater than 0. For the three tables, the element stored in line i and column j represents the number of genes of the species i in the pairwise comparison of genomes i and j.The X chromosome of D. pseudoobscura was fused with another chromosome during evolution [17], therefore it presents a larger number of unassociated genes when compared to the other species.

Fig. 3 :
Fig.3: Considering the same genomes A = {1 2 3 4 5} and B = {6 7 8 9 10 11} as in Figure2, let M 1 (red) and M 2 (blue) be two distinct maximal matchings in S 0.1 (A, B).We also represent the non-maximal matching M 3 (cyan) that is a subset of M 2 .In the middle part we show diagrams WR(A M1 , B M1 ) and WR(A M2 , B M2 ), both with two AB-paths and two AB-cycles.In the lower part we show diagrams WR(A M ∅ , B M ∅ ), corresponding to the trivial empty matching M ∅ and with two linear singletons (one AA-path and one BB-path), and WR(A M3 , B M3 ), with two AB-paths and two AB-cycles.The labeling (X:Y) indicates that Y = s(X, M i ).

Fig. 4 :
Fig. 4: Given genomes A = {1 2 3 4 5} and B = {6 7 8 9 10 11}, in the left part we represent the marker similarity graph S 0.1 (A, B) and in the right part the family-free relational diagram FFR(A, B, S 0.1 ).We represent in multiple colors the edges that correspond to multiple matchings.
[S]: for each indel edge e, if its two endpoints have degree one or zero in D[S], then e is added to both S and D[S].(Note that S = M S , while |S| = 2|M S | and w(S) = 2w(M S ).)The consistent decomposition D[S] covers all vertices of FFR(A, B, S x ) and is composed of cycles and paths, allowing us to compute the values d id dcj (D[S]) = where c D and i D are the numbers of AB-cycles and AB-paths in D[S], respectively, and δ D is the optimal deduction of recombinations of paths from D[S].
caps and |κ A − κ B | artificial adjacencies between caps.Capped family-free relational diagram: We transform FFR(A, B, S x ) into the capped family-free relational diagram FFR • (A, B, S x ) as follows.Again, let p

Theorem 2 .
Let P max be the set of all distinct (maximal) capping-sets from FFR • (A, B, S x ).For each sibling-set S of FFR(A, B, S x ) and FFR • (A, B, S x ), we have d id dcj (D[S]) = min P ∈Pmax {d id dcj (Q[S, P ])} , and wd id dcj (D[S]) = min P ∈Pmax {wd id dcj (Q[S, P ])} .Proof.Each capping-set corresponds to exactly p * caps.In addition, all adjacencies, including the |κ A − κ B | artificial adjacencies between cap extremities, are part of each consistent decomposition.Recall that each sibling-set S of FFR • (A, B, S x ) corresponds to a matching M S of S x (A, B).The set of consistent decompositions include all possible distinct consistent decompositions induced by S together with one distinct element of P max .Theorem 1 states that the pair of matched genomes A M S and B M S can be optimally capped with p * caps and |κ A − κ B | artificial adjacencies.Therefore, it is clear that d id dcj (D[S]) = min P ∈Pmax {d id dcj (Q[S, P ])}.Since the capping does not change the sizes of the sibling-sets and their weights and complements, it is also clear that wd id dcj (D[S]) = min P ∈Pmax {wd id dcj (Q[S, P ])}.Alternative formula for computing the indel-potential of cycles: The consistent decompositions of the diagram FFR • (A, B, S x ) are composed exclusively of cycles, and the number of runs Λ(C) of a cycle C is always in {0, 1, 2, 4, 6, . ..}.Therefore, the formula to compute the indel-potential of a cycle C can be simplified to Counting indel-free cycles in the decomposition makes up the first part, depicted in constraints (C.01)-(C.06),variables and domains (D.01)-(D.03).The second part is for counting transitions, described in constraints (C.07)-(C.10),variables and domains (D.04)-(D.05).The last part describes how to count the number of circular singletons, with constraint (C.11), variable and domain (D.06

Fig. 7 :
Fig.7: (i) A BB-path with 4 runs.(ii) After an optimal DCJ that creates a new cycle, one A-run is accumulated (between edges e 4 and e 3 there is only an adjacency edge) and two B-runs are merged (e 2 is in the same run with e 5 and e 6 ).Indeed the indel-potential of the original BB-path is three.

Fig. 8 :
Fig. 8: Considering the same genomes A = {1 2 3 4 5} and B = {6 7 8 9 10 11} as in Figure 2, let M 1 (orange) and M 2 (cyan) be two distinct maximal matchings in S 0.5 (A, B).In the middle part we show the diagram R(A M1 , B M1 ), that has two AB-paths and one AB-cycle, giving d id dcj (A M1 , B M1 ) = 4.In the right part we show the diagram R(A M2 , B M2 ), that has two AB-paths and two AB-cycles, giving d id dcj (A M2 , B M2 ) = 3.

Fig. 9 :
Fig. 9: The capped version of the family-free relational diagram from Figure 4.

Table 1 :
Distribution of similarities between genes (and percentage) in pairwise comparisons of complete genomes.

Table 3 :
Computed ffd id dcj and elapsed time (or gap in %) in pairwise comparisons of complete genomes, with cutting threshold x = 0.3.The time limit for execution of the ILP solver is 10800s.

Table 4 :
Computed unwffd id dcj and elapsed time (or gap in %) in pairwise comparisons of complete genomes, with cutting threshold x = 0.3.The time limit for execution of the ILP solver is 10800s.

Table 5 :
Distribution of similarities (and percentage) in pairwise comparisons of X chromosomes.

Table 7 :
Computed ffd id dcj and elapsed time (or gap in %) in pairwise comparisons of X chromosomes, with cutting thresholds ranging between x = 0.0 and x = 0.3.The time limit is 3600s.

Table 8 :
Computed unwffd id dcj and elapsed time (or gap in %) in pairwise comparisons of X chromosomes, with cutting thresholds x = 0.0, x = 0.3 and x = 0.5.The time limit is 3600s.