Recombinations, chains and caps: resolving problems with the DCJ-indel model

One of the most fundamental problems in genome rearrangement studies is the (genomic) distance problem. It is typically formulated as finding the minimum number of rearrangements under a model that are needed to transform one genome into the other. A powerful multi-chromosomal model is the Double Cut and Join (DCJ) model.While the DCJ model is not able to deal with some situations that occur in practice, like duplicated or lost regions, it was extended over time to handle these cases. First, it was extended to the DCJ-indel model, solving the issue of lost markers. Later ILP-solutions for so called natural genomes, in which each genomic region may occur an arbitrary number of times, were developed, enabling in theory to solve the distance problem for any pair of genomes. However, some theoretical and practical issues remained unsolved. On the theoretical side of things, there exist two disparate views of the DCJ-indel model, motivated in the same way, but with different conceptualizations that could not be reconciled so far. On the practical side, while ILP solutions for natural genomes typically perform well on telomere to telomere resolved genomes, they have been shown in recent years to quickly loose performance on genomes with a large number of contigs or linear chromosomes. This has been linked to a particular technique, namely capping. Simply put, capping circularizes linear chromosomes by concatenating them during solving time, increasing the solution space of the ILP superexponentially. Recently, we introduced a new conceptualization of the DCJ-indel model within the context of another rearrangement problem. In this manuscript, we will apply this new conceptualization to the distance problem. In doing this, we uncover the relation between the disparate conceptualizations of the DCJ-indel model. We are also able to derive an ILP solution to the distance problem that does not rely on capping. This solution significantly improves upon the performance of previous solutions on genomes with high numbers of contigs while still solving the problem exactly and being competitive in performance otherwise. We demonstrate the performance advantage on simulated genomes as well as showing its practical usefulness in an analysis of 11 Drosophila genomes. Supplementary Information The online version contains supplementary material available at 10.1186/s13015-024-00253-7.


B-Run
A-Run Fig. B1: Safe internal operations ((a) to (d)) in a bridge of type P A|b , P a|b , P B|a "accumulate" elements of a run ((a),(b) and (d)) or "cut in between runs" reducing their number by 2 as described in [11].In the end, only the components P A|b , P a|b and P B|a are not yet handled.Note also that the same sorting can be achieved after a completion as in Figure B2.
Table B2: The indel potential from [11] corresponds to the term P in our distance formula if considered for a single bridge.Bridges with an odd number of extremities per genomes (i.e.odd bridges) were padded with an odd viaduct of length 1.  Table B3: Path Recombinations used in Recombination Groups as listed in Tables 1 and 2 of [11] are explained by safe operations between unsaturated components.The concrete effect on the distance as opposed to a separate sorting is explained by the differences of P in two graphs containing the respective sources separately as opposed to one graph containing both.Paddings are administered as in Table B2.[50k,52k[ [100k,102k[ [150k,152k[ [200k,202k[ Fig. C3: dingII and ding-cf evaluated on simulated genomes with the same parameters as described in Section 5 with a single linear chromosome as the root genome.
On the left, the number of simulated operations is increased from 10,000 to 20,000 in steps of 2500.On the right, the number of markers is increased from 10,000 and 2500 to 100,000 in steps of 25,000.
The further performance test data of Figure C3 indicates that ding-cf behaves similarly as ding regarding the number of operations and markers (compare to [7]), even outperforming ding on harder problem instances.The outliers for the smallest number of markers in Figure C3 indicates that the number of operations alone might not be such a significant factor, but rather the number of operations relative to the size of the genome.

D.1 Evaluating the Distance Matrix
To estimate the strength of the phylogenetic signal in the obtained distance data of Table D6, we compared it to the path metric of the reconstructed tree, which we give here in Newick format for reference: ( ((ananassae:1537.6125,(((grimshawi:1095.78125,(mojavensis:915.77777778, virilis:950.22222222Calculating the deviation of the distances calculated by ding-cf and the tree path metric, we obtain an average of around 0.5% per entry with the largest relative difference being 2% for the distance of D. melanogaster and D. simulans.
Another way to analyse how well a distance matrix reflects a tree, is to perform a split decomposition.In short, a split is a generalized tree edge in the sense that it separates one subset of taxa from the rest.It is then possible to decompose the distance matrix into a subset of so-called d-splits and a residual matrix containing no further d-splits.The way d-splits are displayed by SplitsTree is as a set of parallel edges separating one set part of the graph from the other.Most importantly, if a split decomposition is applied to an additive metric, that is, the path metric of a unique tree, the only d-splits in the resulting network are the edges of that tree.For more information, the interested reader is referred to [15].
We used SplitsTree4 to construct such a splits network and give the output in Figure D4.The resulting network is very tree-like in appearance with all major splits conforming to the tree computed for Figure 14.We provide zommed in versions of the minor splits displayed by SplitsTree.We see that relative to the "correct" splits with which they conflict, they have far smaller weight.In combination with the computed fit of 97.7% calculated by SplitsTree for this data, we can further confirm that the distances calculated by ding-cf are close to being additive on this dataset.

Fig. B2 :
Fig. B2: Forming bracelets and chains (as in [5]) from the components of the bridge of type P A|b , P a|b , P B|a of Figure B1 leaves only P A|b , P a|b and P B|a in an unsafe chain.Note how bracelets (a),(b) and (c) create exactly the same adjacencies as part of indel groups as operations (a),(b) and (c) in Figure B1 respectively.

Fig. D4 :
Fig. D4: Phylogenetic network generated from the distances in Table D6 by SplitsTree4.The fit calculated by SplitsTree4 was 97.7%.

Table B4 :
Recombination groups from[11]can be facilitated entirely by safe operations.In all but 4 cases, resultants either contain no further unsaturated components or if so, no further safe operations are available on the unsaturated components as a bridge containing the respective partner would have been used in an operation higher in the table (last column).Exceptions are the groups S4, S5, S6, S7 (marked with star).In these cases, the resultant AB AB (ABBA )could still be safely recombined with AB AB (AB BA ), but this recombination is only neutral (see TableB3).

Table D5 :
Statistics about drosophila genomes after processing to unimog format.

Table D6 :
Statistics of all pairwise ding-cf runs with lr the number of linear, cr the number of circular chromosomes, mxfam the size of the largest family, #ambf the number of ambiguous families, # amb average number of markers of ambiguous families and N the total number of markers.