An efficient way to solve the family-free DCJ-indel distance is to develop an ILP that searches for its solution in a general graph, that represents all possible diagrams corresponding to all candidate matchings, in a similar way as the approaches given in [12, 15, 16]. Given two genomes *A* and *B* and their marker similarity graph \({\mathcal {S}}_{x}(A,B)\), the structure that integrates the properties of all possible weighted relational diagrams of mapped genomes is the *family-free relational diagram* \(FFR(A,B,{\mathcal {S}}_{x})\), that has a set *V*(*A*) with a vertex for each of the two extremities of each marker of genome *A* and a set *V*(*B*) with a vertex for each of the two extremities of each marker of genome *B*.

Again, sets \(E_{\text {adj}}^{A}\) and \(E_{\text {adj}}^{B}\) contain adjacency edges connecting adjacent extremities of markers in *A* and in *B*. But here the set \(E_{\gamma }\) contains, for each edge \(ab \in {\mathcal {S}}_{x}(A,B)\), an extremity edge connecting \(a^{\,\!t}\) to \(b^{\,\!t}\), and an extremity edge connecting \(a^{\,\!h}\) to \(b^{\,\!h}\). To both edges \(a^{\,\!t}b^{\,\!t}\) and \(a^{\,\!h}b^{\,\!h}\), that are called *siblings*, we assign the same weight, that corresponds to the similarity of the edge *ab* in \({\mathcal {S}}_{x}(A,B)\): \(w(a^{\,\!t}b^{\,\!t})=w(a^{\,\!h}b^{\,\!h})=\sigma (ab)\). Furthermore, for each marker *m* there is an indel edge connecting the vertices \(m^{\,\!h}\) and \(m^{\,\!t}\). The indel edge \(m^{\,\!h}m^{\,\!t}\) receives a weight \(w(m^{\,\!h}m^{\,\!t})=\max \{ \sigma (mv) | mv \in {\mathcal {S}}_{x}(A,B) \}\), that is, it is the maximum similarity among the edges incident to the marker *m* in \({\mathcal {S}}_{x}(A,B)\). We denote by \(E_{\text {id}}^{A}\) the set of indel edges of markers in genome *A* and by \(E_{\text {id}}^{B}\) the set of indel edges of markers in genome *B*. An example of a family-free relational diagram is given in Fig. 4.

### Consistent decompositions

The diagram \(FFR(A,B,{\mathcal {S}}_{x})\) may contain vertices of degree larger than two. A *decomposition* of \(FFR(A,B,{\mathcal {S}}_{x})\) is a collection of vertex-disjoint *components*, that can be cycles and/or paths, covering all vertices of \(FFR(A,B,{\mathcal {S}}_{x})\). There can be multiple ways of selecting a decomposition, and we need to find one that allows to identify a matching of \({\mathcal {S}}_{x}(A,B)\). A set \(S \subseteq E_{\gamma }\) is a *sibling-set* if it is exclusively composed of pairs of siblings and does not contain any pair of incident edges. Thus, a sibling-set *S* of \(FFR(A,B,{\mathcal {S}}_{x})\) corresponds to a matching of \({\mathcal {S}}_{x}(A,B)\). In other words, there is a clear bijection between matchings of \({\mathcal {S}}_{x}(A,B)\) and sibling-sets of \(FFR(A,B,{\mathcal {S}}_{x})\) and we denote by \(M_S\) the matching corresponding to the sibling-set *S*.

The set of edges *D*[*S*] *induced* by a sibling-set *S* is said to be a *consistent decomposition* of \(FFR(A,B,{\mathcal {S}}_{x})\) and can be obtained as follows. In the beginning, *D*[*S*] is the union of *S* with the sets of adjacency edges \(E_{\text {adj}}^{A}\) and \(E_{\text {adj}}^{B}\). We then need to determine the *complement* of the sibling-set *S*, denoted by \({\widetilde{S}}\), that is composed of the indel-edges of \(FFR(A,B,{\mathcal {S}}_{x})\) that must be added to *D*[*S*]: for each indel edge *e*, if its two endpoints have degree one or zero in *D*[*S*], then *e* is added to both \({\widetilde{S}}\) and *D*[*S*]. (Note that \({\widetilde{S}}={\widetilde{M}}_S\), while \(|S| = 2|M_S|\) and \(w(S)=2w(M_S)\).) The consistent decomposition *D*[*S*] covers all vertices of \(FFR(A,B,{\mathcal {S}}_{x})\) and is composed of cycles and paths, allowing us to compute the values

$$\begin{aligned} d_{DCJ}^{~id} (D[S])= & {} \frac{|S|}{2} - c_{D} - \frac{i_{D}}{2} +\sum _{C \in D[S]} \lambda (C) -\delta _{D} \, \text{ and }\\ wd_{DCJ}^{~id}(D[S])= & {} d_{DCJ}^{~id} (D[S]) + \frac{|S|}{2}- \frac{w(S)}{2} + w({\widetilde{S}}) \,, \end{aligned}$$

where \(c_{D}\) and \(i_{D}\) are the numbers of \(AB\)-cycles and \(AB\)-paths in *D*[*S*], respectively, and \(\delta _{D}\) is the optimal deduction of recombinations of paths from *D*[*S*].

Given that \({\mathbb {S}}\) is the sets of all sibling-sets of \(FFR(A,B,{\mathcal {S}}_{x})\), we compute the family-free DCJ-indel distance of *A* and *B* with the following equation:

$$\begin{aligned} ffd_{DCJ}^{~id} (A, B, {\mathcal {S}}_{x}) = \min _{S \in {\mathbb {S}}}\{wd_{DCJ}^{~id}(D[S]) \}. \end{aligned}$$

### Capping

Telomeres produce some difficulties for the decomposition of \(FFR(A,B,{\mathcal {S}}_{x})\), and a known technique to overcome this problem is called *capping* [3]. It consists of modifying the diagram by adding *artificial* markers, also called *caps*, whose extremities should be properly connected to the telomeres of the linear chromosomes of *A* and *B*. Therefore, usually the capping depends on the numbers \(\kappa _{A}\) and \(\kappa _{B}\), that are, respectively, the total numbers of linear chromosomes in genomes *A* and *B*.

#### Family-based singular genomes

First we recall the capping of family-based singular genomes. Here the caps must circularize all linear chromosomes, so that their relational diagram is composed of cycles only, but, if the capping is optimal, the DCJ-indel distance is preserved.

An optimal capping that transforms singular linear genomes *A* and *B* into singular circular genomes can be obtained after identifying the recombination groups [6]. The DCJ-indel distance is preserved by properly linking the components of each identified recombination group into a single cycle [16]. Such a capping may require some artificial adjacencies between caps. The following result is very useful.

###
**Theorem 1**

(from [16]) *We can obtain an optimal capping of singular genomes* *A* *and* *B* *with exactly* \(p_* = \max \{\kappa _{A},\kappa _{B}\}\) *caps and* \(|\kappa _{A}-\kappa _{B}|\) *artificial adjacencies between caps.*

#### Capped family-free relational diagram

The diagram \(FFR(A,B,{\mathcal {S}}_{x})\) is transformed into the *capped family-free relational diagram* \(FFR_{\circ }(A,B,{\mathcal {S}}_{x})\) as follows. Add to \(FFR(A,B,{\mathcal {S}}_{x})\) \(4p_*\) new vertices, named \(\circ _{A}^{1},\circ _{A}^{2},\ldots ,\circ _{A}^{2p_*}\) and \(\circ _{B}^{1}, \circ _{B}^{2},\ldots ,\circ _{B}^{2p_*}\), each one representing a *cap extremity*. Connect each of the \(2\kappa _{A}\) telomeres of *A* by an adjacency edge to a distinct cap extremity among \(\circ _{A}^{1},\circ _{A}^{2},\ldots ,\circ _{A}^{2\kappa _{A}}\). Similarly, connect each of the \(2\kappa _{B}\) telomeres of *B* by an adjacency edge to a distinct cap extremity among \(\circ _{B}^{1},\circ _{B}^{2},\ldots ,\circ _{B}^{2\kappa _{B}}\). Moreover, if \(\kappa _{A} < \kappa _{B}\), for \(i = 2\kappa _{A}+1,2\kappa _{A}+3,\ldots ,2\kappa _{B}-1\), connect \(\circ _{A}^{i}\) to \(\circ _{A}^{i+1}\) by an *artificial adjacency edge*. Otherwise, if \(\kappa _{B} < \kappa _{A}\), for \(j = 2\kappa _{B}+1,2\kappa _{B}+3,\ldots ,2\kappa _{A}-1\), connect \(\circ _{B}^{j}\) to \(\circ _{B}^{j+1}\) by an artificial adjacency edge. All these new adjacency edges and artificial adjacency edges are added to \(E_{\text {adj}}^{A}\) and \(E_{\text {adj}}^{B}\), respectively. Finally, connect each \(\circ _{A}^{i}\), \(1\le i\le 2p_*\), by a *cap extremity edge* to each \(\circ _{B}^{j}\), \(1\le j\le 2p_*\), and denote by \(E_{\circ }\) the set of cap extremity edges.

A set \(P \subseteq E_{\circ }\) is a *capping-set* if it does not contain any pair of incident edges and is maximal. Since each cap extremity of *A* is connected to each cap extremity of *B*, the size of any (maximal) capping-set is \(2p_*\). A *capped consistent decomposition* *Q*[*S*, *P*] of \(FFR_{\circ }(A,B,{\mathcal {S}}_{x})\) is induced by a sibling-set \(S \subseteq E_\gamma\) and a (maximal) capping-set \(P \subseteq E_\circ\) and is composed of vertex disjoint cycles that cover all vertices of \(FFR_{\circ }(A,B,{\mathcal {S}}_{x})\). An example of a capped family-free relational diagram is given in Additional file 1: Figure S1-2, Appendix S1, Section (1A).

###
**Theorem 2**

*Let* \({\mathbb {P}}_{max}\) *be the set of all distinct (maximal) capping-sets from* \(FFR_{\circ }(A,B,{\mathcal {S}}_{x})\). *For each sibling-set* *S* *of* \(FFR(A,B,{\mathcal {S}}_{x})\) and \(FFR_{\circ }(A,B,{\mathcal {S}}_{x})\), *we have*

$$\begin{aligned} d_{DCJ}^{~id} (D[S])= & {} \min _{P \in {\mathbb {P}}_{max}}\{ d_{DCJ}^{~id} (Q[S,P])\}\,, \text{ and } \\ wd_{DCJ}^{~id}(D[S])= & {} \min _{P \in {\mathbb {P}}_{max}}\{ wd_{DCJ}^{~id}(Q[S,P])\}. \end{aligned}$$

###
*Proof*

Each capping-set corresponds to exactly \(p_*\) caps. In addition, all adjacencies, including the \(|\kappa _A - \kappa _B|\) artificial adjacencies between cap extremities, are part of each capped consistent decomposition. Recall that each sibling-set *S* of \(FFR_{\circ }(A,B,{\mathcal {S}}_{x})\) corresponds to a matching \(M_S\) of \({\mathcal {S}}_{x}(A,B)\). The set of capped consistent decompositions include all possible distinct decompositions induced by *S* together with one distinct element of \({\mathbb {P}}_{max}\). Theorem 1 states that the pair of matched genomes \(A^{M_S}\) and \(B^{M_S}\) can be optimally capped with \(p_*\) caps and \(|\kappa _A - \kappa _B|\) artificial adjacencies. Therefore, it is clear that \(d_{DCJ}^{~id} (D[S]) = \min _{P \in {\mathbb {P}}_{max}}\{ d_{DCJ}^{~id} (Q[S,P])\}\). Since the capping does not change the sizes of the sibling-sets and their weights and complements, it is also clear that \(wd_{DCJ}^{~id}(D[S]) = \min _{P \in {\mathbb {P}}_{max}}\{ wd_{DCJ}^{~id}(Q[S,P])\}\). \(\square\)

Given that \({\mathbb {S}}\) and \({\mathbb {P}}_{max}\) are, respectively, the sets of all sibling-sets and all maximal capping-sets of \(FFR_{\circ }(A,B,{\mathcal {S}}_{x})\), the final version of our optimization problem is

$$\begin{aligned} ffd_{DCJ}^{~id} (A, B,{\mathcal {S}}_{x}) = \min _{S \in {\mathbb {S}}, P \in {\mathbb {P}}_{max}}\big \{wd_{DCJ}^{~id}(Q[S,P])\big \}. \end{aligned}$$

#### Alternative formula for computing the indel-potential of cycles

The capped consistent decompositions of the diagram \(FFR_{\circ }(A,B,{\mathcal {S}}_{x})\) are composed exclusively of cycles, and the number of runs \(\Lambda (C)\) of a cycle *C* is always in \(\{0,1,2,4,6,\ldots \}\). Therefore, the formula to compute the indel-potential of a cycle *C* can be simplified to

$$\begin{aligned} \lambda (C) = {\left\{ \begin{array}{ll} \quad \;\, \Lambda (C)\,, &{} \text{ if } \Lambda (C)\in \{0,1\} \\ 1+\frac{\Lambda (C)}{2}\,, &{} \text{ if } \Lambda (C) \in \{2,4,6,\ldots \} \end{array}\right. } \end{aligned}$$

that can still be redesigned to a form that can be easier implemented in the ILP [16]. First, let a *transition* in a cycle *C* be an indel-free segment of *C* that is between a run in one genome and a run in the other genome and denote by \(\aleph (C)\) the number of transitions in *C*. Observe that, if *C* is indel-free, then obviously \(\aleph (C) = 0\). If *C* has a single run, then we also have \(\aleph (C) = 0\). On the other hand, if *C* has at least 2 runs, then \(\aleph (C)=\Lambda (C)\). The new formula is split into two parts. The first part is the function *r*(*C*), defined as \(r(C)=1\) if \(\Lambda (C)\ge 1\), otherwise \(r(C)=0\), that simply tests whether *C* is indel-enclosing or indel-free. The second part depends on the number of transitions \(\aleph (C)\), and the complete formula stands as follows [16]:

$$\begin{aligned} \lambda (C)=r(C) + \frac{\aleph (C)}{2}. \end{aligned}$$

#### New formula for computing the weighted distance

Note that the number of indel-enclosing components is \(\sum _{C \in Q[S,P]} r(C) = c^{r}_{Q} + s_{Q}\), where \(c^{r}_{Q}\) and \(s_{Q}\) are the number of indel-enclosing \(AB\)-cycles and the number of circular singletons in *Q*[*S*, *P*], respectively. Furthermore, the number of indel-free \(AB\)-cycles of *Q*[*S*, *P*] is \(c^{{\tilde{r}}}_{Q}=c_Q-c^{r}_{Q}\). We can now compute the values

$$\begin{aligned} d_{DCJ}^{~id} (Q[S,P])&= p_* + \frac{|S|}{2} - c_{Q} +\sum _{C \in Q[S,P]} \lambda (C)\nonumber \\&= p_* + \frac{|S|}{2} - c_{Q} +\sum _{C \in Q[S,P]} \left( r(C) + \frac{\aleph (C)}{2}\right) \nonumber \\&= p_* + \frac{|S|}{2} - c^{{\tilde{r}}}_{Q} + s_{Q} + \sum _{C \in Q[S,P]} \frac{\aleph (C)}{2}\,, \text{ and }\nonumber \\ wd_{DCJ}^{~id}(Q[S,P])&= d_{DCJ}^{~id} (Q[S,P]) + \frac{|S|}{2} - \frac{w(S)}{2} + w({\widetilde{S}}) \nonumber \\&= p_* + |S| - c^{{\tilde{r}}}_{Q} + s_{Q} + \sum _{C \in Q[S,P]} \frac{\aleph (C)}{2} - \frac{w(S)}{2}+ w({\widetilde{S}}). \end{aligned}$$

(2)

### Cutting threshold

The family-free DCJ-indel distance \(ffd_{DCJ}^{~id}\) was designed to be computed with all given pairwise similarities, i.e., with the cutting threshold \(x=0\), that leads to a “complete” family-free relational diagram. Such a diagram would be too large to be handled in practice, therefore, if \(x=0\), we consider only the similarities that are strictly greater than 0. Nevertheless, for bigger instances the diagram with similarities close to 0 might still be too large to be solved in reasonable time. Hence, for some instances it may be necessary to do a small increase of the cutting threshold. We usually adopt a small cutting threshold up to 0.3.