An efficient way to solve the family-free DCJ-indel distance is to develop an ILP that searches for its solution in a general graph, that represents all possible diagrams corresponding to all candidate matchings, in a similar way as the approaches given in [12, 15, 16]. Given two genomes A and B and their marker similarity graph \({\mathcal {S}}_{x}(A,B)\), the structure that integrates the properties of all possible weighted relational diagrams of mapped genomes is the family-free relational diagram \(FFR(A,B,{\mathcal {S}}_{x})\), that has a set V(A) with a vertex for each of the two extremities of each marker of genome A and a set V(B) with a vertex for each of the two extremities of each marker of genome B.
Again, sets \(E_{\text {adj}}^{A}\) and \(E_{\text {adj}}^{B}\) contain adjacency edges connecting adjacent extremities of markers in A and in B. But here the set \(E_{\gamma }\) contains, for each edge \(ab \in {\mathcal {S}}_{x}(A,B)\), an extremity edge connecting \(a^{\,\!t}\) to \(b^{\,\!t}\), and an extremity edge connecting \(a^{\,\!h}\) to \(b^{\,\!h}\). To both edges \(a^{\,\!t}b^{\,\!t}\) and \(a^{\,\!h}b^{\,\!h}\), that are called siblings, we assign the same weight, that corresponds to the similarity of the edge ab in \({\mathcal {S}}_{x}(A,B)\): \(w(a^{\,\!t}b^{\,\!t})=w(a^{\,\!h}b^{\,\!h})=\sigma (ab)\). Furthermore, for each marker m there is an indel edge connecting the vertices \(m^{\,\!h}\) and \(m^{\,\!t}\). The indel edge \(m^{\,\!h}m^{\,\!t}\) receives a weight \(w(m^{\,\!h}m^{\,\!t})=\max \{ \sigma (mv) | mv \in {\mathcal {S}}_{x}(A,B) \}\), that is, it is the maximum similarity among the edges incident to the marker m in \({\mathcal {S}}_{x}(A,B)\). We denote by \(E_{\text {id}}^{A}\) the set of indel edges of markers in genome A and by \(E_{\text {id}}^{B}\) the set of indel edges of markers in genome B. An example of a family-free relational diagram is given in Fig. 4.
Consistent decompositions
The diagram \(FFR(A,B,{\mathcal {S}}_{x})\) may contain vertices of degree larger than two. A decomposition of \(FFR(A,B,{\mathcal {S}}_{x})\) is a collection of vertex-disjoint components, that can be cycles and/or paths, covering all vertices of \(FFR(A,B,{\mathcal {S}}_{x})\). There can be multiple ways of selecting a decomposition, and we need to find one that allows to identify a matching of \({\mathcal {S}}_{x}(A,B)\). A set \(S \subseteq E_{\gamma }\) is a sibling-set if it is exclusively composed of pairs of siblings and does not contain any pair of incident edges. Thus, a sibling-set S of \(FFR(A,B,{\mathcal {S}}_{x})\) corresponds to a matching of \({\mathcal {S}}_{x}(A,B)\). In other words, there is a clear bijection between matchings of \({\mathcal {S}}_{x}(A,B)\) and sibling-sets of \(FFR(A,B,{\mathcal {S}}_{x})\) and we denote by \(M_S\) the matching corresponding to the sibling-set S.
The set of edges D[S] induced by a sibling-set S is said to be a consistent decomposition of \(FFR(A,B,{\mathcal {S}}_{x})\) and can be obtained as follows. In the beginning, D[S] is the union of S with the sets of adjacency edges \(E_{\text {adj}}^{A}\) and \(E_{\text {adj}}^{B}\). We then need to determine the complement of the sibling-set S, denoted by \({\widetilde{S}}\), that is composed of the indel-edges of \(FFR(A,B,{\mathcal {S}}_{x})\) that must be added to D[S]: for each indel edge e, if its two endpoints have degree one or zero in D[S], then e is added to both \({\widetilde{S}}\) and D[S]. (Note that \({\widetilde{S}}={\widetilde{M}}_S\), while \(|S| = 2|M_S|\) and \(w(S)=2w(M_S)\).) The consistent decomposition D[S] covers all vertices of \(FFR(A,B,{\mathcal {S}}_{x})\) and is composed of cycles and paths, allowing us to compute the values
$$\begin{aligned} d_{DCJ}^{~id} (D[S])= & {} \frac{|S|}{2} - c_{D} - \frac{i_{D}}{2} +\sum _{C \in D[S]} \lambda (C) -\delta _{D} \, \text{ and }\\ wd_{DCJ}^{~id}(D[S])= & {} d_{DCJ}^{~id} (D[S]) + \frac{|S|}{2}- \frac{w(S)}{2} + w({\widetilde{S}}) \,, \end{aligned}$$
where \(c_{D}\) and \(i_{D}\) are the numbers of \(AB\)-cycles and \(AB\)-paths in D[S], respectively, and \(\delta _{D}\) is the optimal deduction of recombinations of paths from D[S].
Given that \({\mathbb {S}}\) is the sets of all sibling-sets of \(FFR(A,B,{\mathcal {S}}_{x})\), we compute the family-free DCJ-indel distance of A and B with the following equation:
$$\begin{aligned} ffd_{DCJ}^{~id} (A, B, {\mathcal {S}}_{x}) = \min _{S \in {\mathbb {S}}}\{wd_{DCJ}^{~id}(D[S]) \}. \end{aligned}$$
Capping
Telomeres produce some difficulties for the decomposition of \(FFR(A,B,{\mathcal {S}}_{x})\), and a known technique to overcome this problem is called capping [3]. It consists of modifying the diagram by adding artificial markers, also called caps, whose extremities should be properly connected to the telomeres of the linear chromosomes of A and B. Therefore, usually the capping depends on the numbers \(\kappa _{A}\) and \(\kappa _{B}\), that are, respectively, the total numbers of linear chromosomes in genomes A and B.
Family-based singular genomes
First we recall the capping of family-based singular genomes. Here the caps must circularize all linear chromosomes, so that their relational diagram is composed of cycles only, but, if the capping is optimal, the DCJ-indel distance is preserved.
An optimal capping that transforms singular linear genomes A and B into singular circular genomes can be obtained after identifying the recombination groups [6]. The DCJ-indel distance is preserved by properly linking the components of each identified recombination group into a single cycle [16]. Such a capping may require some artificial adjacencies between caps. The following result is very useful.
Theorem 1
(from [16]) We can obtain an optimal capping of singular genomes A and B with exactly \(p_* = \max \{\kappa _{A},\kappa _{B}\}\) caps and \(|\kappa _{A}-\kappa _{B}|\) artificial adjacencies between caps.
Capped family-free relational diagram
The diagram \(FFR(A,B,{\mathcal {S}}_{x})\) is transformed into the capped family-free relational diagram \(FFR_{\circ }(A,B,{\mathcal {S}}_{x})\) as follows. Add to \(FFR(A,B,{\mathcal {S}}_{x})\) \(4p_*\) new vertices, named \(\circ _{A}^{1},\circ _{A}^{2},\ldots ,\circ _{A}^{2p_*}\) and \(\circ _{B}^{1}, \circ _{B}^{2},\ldots ,\circ _{B}^{2p_*}\), each one representing a cap extremity. Connect each of the \(2\kappa _{A}\) telomeres of A by an adjacency edge to a distinct cap extremity among \(\circ _{A}^{1},\circ _{A}^{2},\ldots ,\circ _{A}^{2\kappa _{A}}\). Similarly, connect each of the \(2\kappa _{B}\) telomeres of B by an adjacency edge to a distinct cap extremity among \(\circ _{B}^{1},\circ _{B}^{2},\ldots ,\circ _{B}^{2\kappa _{B}}\). Moreover, if \(\kappa _{A} < \kappa _{B}\), for \(i = 2\kappa _{A}+1,2\kappa _{A}+3,\ldots ,2\kappa _{B}-1\), connect \(\circ _{A}^{i}\) to \(\circ _{A}^{i+1}\) by an artificial adjacency edge. Otherwise, if \(\kappa _{B} < \kappa _{A}\), for \(j = 2\kappa _{B}+1,2\kappa _{B}+3,\ldots ,2\kappa _{A}-1\), connect \(\circ _{B}^{j}\) to \(\circ _{B}^{j+1}\) by an artificial adjacency edge. All these new adjacency edges and artificial adjacency edges are added to \(E_{\text {adj}}^{A}\) and \(E_{\text {adj}}^{B}\), respectively. Finally, connect each \(\circ _{A}^{i}\), \(1\le i\le 2p_*\), by a cap extremity edge to each \(\circ _{B}^{j}\), \(1\le j\le 2p_*\), and denote by \(E_{\circ }\) the set of cap extremity edges.
A set \(P \subseteq E_{\circ }\) is a capping-set if it does not contain any pair of incident edges and is maximal. Since each cap extremity of A is connected to each cap extremity of B, the size of any (maximal) capping-set is \(2p_*\). A capped consistent decomposition Q[S, P] of \(FFR_{\circ }(A,B,{\mathcal {S}}_{x})\) is induced by a sibling-set \(S \subseteq E_\gamma\) and a (maximal) capping-set \(P \subseteq E_\circ\) and is composed of vertex disjoint cycles that cover all vertices of \(FFR_{\circ }(A,B,{\mathcal {S}}_{x})\). An example of a capped family-free relational diagram is given in Additional file 1: Figure S1-2, Appendix S1, Section (1A).
Theorem 2
Let \({\mathbb {P}}_{max}\) be the set of all distinct (maximal) capping-sets from \(FFR_{\circ }(A,B,{\mathcal {S}}_{x})\). For each sibling-set S of \(FFR(A,B,{\mathcal {S}}_{x})\) and \(FFR_{\circ }(A,B,{\mathcal {S}}_{x})\), we have
$$\begin{aligned} d_{DCJ}^{~id} (D[S])= & {} \min _{P \in {\mathbb {P}}_{max}}\{ d_{DCJ}^{~id} (Q[S,P])\}\,, \text{ and } \\ wd_{DCJ}^{~id}(D[S])= & {} \min _{P \in {\mathbb {P}}_{max}}\{ wd_{DCJ}^{~id}(Q[S,P])\}. \end{aligned}$$
Proof
Each capping-set corresponds to exactly \(p_*\) caps. In addition, all adjacencies, including the \(|\kappa _A - \kappa _B|\) artificial adjacencies between cap extremities, are part of each capped consistent decomposition. Recall that each sibling-set S of \(FFR_{\circ }(A,B,{\mathcal {S}}_{x})\) corresponds to a matching \(M_S\) of \({\mathcal {S}}_{x}(A,B)\). The set of capped consistent decompositions include all possible distinct decompositions induced by S together with one distinct element of \({\mathbb {P}}_{max}\). Theorem 1 states that the pair of matched genomes \(A^{M_S}\) and \(B^{M_S}\) can be optimally capped with \(p_*\) caps and \(|\kappa _A - \kappa _B|\) artificial adjacencies. Therefore, it is clear that \(d_{DCJ}^{~id} (D[S]) = \min _{P \in {\mathbb {P}}_{max}}\{ d_{DCJ}^{~id} (Q[S,P])\}\). Since the capping does not change the sizes of the sibling-sets and their weights and complements, it is also clear that \(wd_{DCJ}^{~id}(D[S]) = \min _{P \in {\mathbb {P}}_{max}}\{ wd_{DCJ}^{~id}(Q[S,P])\}\). \(\square\)
Given that \({\mathbb {S}}\) and \({\mathbb {P}}_{max}\) are, respectively, the sets of all sibling-sets and all maximal capping-sets of \(FFR_{\circ }(A,B,{\mathcal {S}}_{x})\), the final version of our optimization problem is
$$\begin{aligned} ffd_{DCJ}^{~id} (A, B,{\mathcal {S}}_{x}) = \min _{S \in {\mathbb {S}}, P \in {\mathbb {P}}_{max}}\big \{wd_{DCJ}^{~id}(Q[S,P])\big \}. \end{aligned}$$
Alternative formula for computing the indel-potential of cycles
The capped consistent decompositions of the diagram \(FFR_{\circ }(A,B,{\mathcal {S}}_{x})\) are composed exclusively of cycles, and the number of runs \(\Lambda (C)\) of a cycle C is always in \(\{0,1,2,4,6,\ldots \}\). Therefore, the formula to compute the indel-potential of a cycle C can be simplified to
$$\begin{aligned} \lambda (C) = {\left\{ \begin{array}{ll} \quad \;\, \Lambda (C)\,, &{} \text{ if } \Lambda (C)\in \{0,1\} \\ 1+\frac{\Lambda (C)}{2}\,, &{} \text{ if } \Lambda (C) \in \{2,4,6,\ldots \} \end{array}\right. } \end{aligned}$$
that can still be redesigned to a form that can be easier implemented in the ILP [16]. First, let a transition in a cycle C be an indel-free segment of C that is between a run in one genome and a run in the other genome and denote by \(\aleph (C)\) the number of transitions in C. Observe that, if C is indel-free, then obviously \(\aleph (C) = 0\). If C has a single run, then we also have \(\aleph (C) = 0\). On the other hand, if C has at least 2 runs, then \(\aleph (C)=\Lambda (C)\). The new formula is split into two parts. The first part is the function r(C), defined as \(r(C)=1\) if \(\Lambda (C)\ge 1\), otherwise \(r(C)=0\), that simply tests whether C is indel-enclosing or indel-free. The second part depends on the number of transitions \(\aleph (C)\), and the complete formula stands as follows [16]:
$$\begin{aligned} \lambda (C)=r(C) + \frac{\aleph (C)}{2}. \end{aligned}$$
New formula for computing the weighted distance
Note that the number of indel-enclosing components is \(\sum _{C \in Q[S,P]} r(C) = c^{r}_{Q} + s_{Q}\), where \(c^{r}_{Q}\) and \(s_{Q}\) are the number of indel-enclosing \(AB\)-cycles and the number of circular singletons in Q[S, P], respectively. Furthermore, the number of indel-free \(AB\)-cycles of Q[S, P] is \(c^{{\tilde{r}}}_{Q}=c_Q-c^{r}_{Q}\). We can now compute the values
$$\begin{aligned} d_{DCJ}^{~id} (Q[S,P])&= p_* + \frac{|S|}{2} - c_{Q} +\sum _{C \in Q[S,P]} \lambda (C)\nonumber \\&= p_* + \frac{|S|}{2} - c_{Q} +\sum _{C \in Q[S,P]} \left( r(C) + \frac{\aleph (C)}{2}\right) \nonumber \\&= p_* + \frac{|S|}{2} - c^{{\tilde{r}}}_{Q} + s_{Q} + \sum _{C \in Q[S,P]} \frac{\aleph (C)}{2}\,, \text{ and }\nonumber \\ wd_{DCJ}^{~id}(Q[S,P])&= d_{DCJ}^{~id} (Q[S,P]) + \frac{|S|}{2} - \frac{w(S)}{2} + w({\widetilde{S}}) \nonumber \\&= p_* + |S| - c^{{\tilde{r}}}_{Q} + s_{Q} + \sum _{C \in Q[S,P]} \frac{\aleph (C)}{2} - \frac{w(S)}{2}+ w({\widetilde{S}}). \end{aligned}$$
(2)
Cutting threshold
The family-free DCJ-indel distance \(ffd_{DCJ}^{~id}\) was designed to be computed with all given pairwise similarities, i.e., with the cutting threshold \(x=0\), that leads to a “complete” family-free relational diagram. Such a diagram would be too large to be handled in practice, therefore, if \(x=0\), we consider only the similarities that are strictly greater than 0. Nevertheless, for bigger instances the diagram with similarities close to 0 might still be too large to be solved in reasonable time. Hence, for some instances it may be necessary to do a small increase of the cutting threshold. We usually adopt a small cutting threshold up to 0.3.