Investigating the complexity of the double distance problems

Background Two genomes \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {A}$$\end{document}A and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {B}$$\end{document}B over the same set of gene families form a canonical pair when each of them has exactly one gene from each family. Denote by \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n_*$$\end{document}n∗ the number of common families of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {A}$$\end{document}A and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {B}$$\end{document}B. Different distances of canonical genomes can be derived from a structure called breakpoint graph, which represents the relation between the two given genomes as a collection of cycles of even length and paths. Let \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c_i$$\end{document}ci and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_j$$\end{document}pj be respectively the numbers of cycles of length i and of paths of length j in the breakpoint graph of genomes \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {A}$$\end{document}A and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {B}$$\end{document}B. Then, the breakpoint distance of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {A}$$\end{document}A and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {B}$$\end{document}B is equal to \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n_*-\left( c_2+\frac{p_0}{2}\right)$$\end{document}n∗-c2+p02. Similarly, when the considered rearrangements are those modeled by the double-cut-and-join (DCJ) operation, the rearrangement distance of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {A}$$\end{document}A and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {B}$$\end{document}B is \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n_*-\left( c+\frac{p_e }{2}\right)$$\end{document}n∗-c+pe2, where c is the total number of cycles and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_e$$\end{document}pe is the total number of paths of even length. Motivation The distance formulation is a basic unit for several other combinatorial problems related to genome evolution and ancestral reconstruction, such as median or double distance. Interestingly, both median and double distance problems can be solved in polynomial time for the breakpoint distance, while they are NP-hard for the rearrangement distance. One way of exploring the complexity space between these two extremes is to consider a \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma _k$$\end{document}σk distance, defined to be \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n_*-\left( c_2+c_4+\ldots +c_k+\frac{p_0+p_2+\ldots +p_{k-2}}{2}\right)$$\end{document}n∗-c2+c4+…+ck+p0+p2+…+pk-22, and increasingly investigate the complexities of median and double distance for the \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma _4$$\end{document}σ4 distance, then the \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma _6$$\end{document}σ6 distance, and so on. Results While for the median much effort was done in our and in other research groups but no progress was obtained even for the \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma _4$$\end{document}σ4 distance, for solving the double distance under \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma _4$$\end{document}σ4 and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma _6$$\end{document}σ6 distances we could devise linear time algorithms, which we present here. Supplementary Information The online version contains supplementary material available at 10.1186/s13015-023-00246-y.


Introduction
In genome comparison, the most elementary problem is that of computing a distance between two given genomes [11], each one being a set of chromosomes.Usually a high-level view of a chromosome is adopted, in which each chromosome is represented by a sequence of oriented genes and the genes are classified into families.The simplest model in this setting is the breakpoint model, whose distance consists of somehow quantifying the distinct adjacencies between the two genomes, an adjacency in a genome being the oriented neighborhood between two genes in one of its chromosomes [12].Other models rely on large-scale genome rearrangements, such as inversions, translocations, fusions and fissions, yielding distances that correspond to the minimum number of rearrangements required to transform one genome into another [8,9,13].
Independently of the underlying model, the distance formulation is a basic unit for several other combinatorial problems related to genome evolution and ancestral reconstruction [12].The median problem, for example, has three genomes as input and asks for an ancestor genome that minimizes the sum of its distances to the three given genomes.Other models are related to the whole genome duplication (WGD) event [7].Let the doubling of a genome duplicate each of its chromosomes.The double distance is the problem that has a duplicated genome and a singular genome as input and computes the distance between the former and a doubling of the latter.The halving problem has a duplicated genome as input and asks for a singular genome whose double distance to the given duplicated genome is minimized.Finally, the guided halving problem has a duplicated and a singular genome as input and asks for another singular genome that minimizes the sum of its double distance to the given duplicated genome and its distance to the given singular genome.
Our study relies on the breakpoint graph, a structure that represents the relation between two given genomes [2].When the two genomes are over the same set of gene families and form a canonical pair, that is, when each of them has exactly one gene from each family, their breakpoint graph is a collection of cycles of even length and paths.Assuming that both genomes have n * genes, if we call k-cycle a cycle of length k and k-path a path of length k, the corresponding breakpoint distance is equal to n * − c 2 + p0 2 , where c 2 is the number of 2-cycles and p 0 is the number of 0-paths [12].Similarly, when the considered rearrangements are those modeled by the double-cut-and-join arXiv:2303.04205v3[cs.DS] 1 Apr 2023 A chromosome is an oriented DNA molecule and can be either linear or circular.We represent a chromosome by its sequence of genes, where each gene is an oriented DNA fragment.We assume that each gene belongs to a family, which is a set of homologous genes.A gene that belongs to a family X is represented by the symbol X itself if it is read in forward orientation or by the symbol X if it is read in reverse orientation.For example, the sequences [1 3 2] and (4) represent, respectively, a linear (flanked by square brackets) and a circular chromosome (flanked by parentheses), both shown in Figure 1, the first composed of three genes and the second composed of a single gene.Note that if a sequence s represents a chromosome K, then K can be equally represented by the reverse complement of s, denoted by s, obtained by reversing the order and the orientation of the genes in s.Moreover, if K is circular, it can be equally represented by any circular rotation of s and s.Recall that a gene is an occurrence of a family, therefore distinct genes from the same family are represented by the same symbol.We can also represent a gene from family X referring to its extremities X h (head) and X t (tail).The adjacencies in a chromosome are the neighboring extremities of distinct genes.The remaining extremities, that are at the ends of linear chromosomes, are telomeres.In linear chromosome [1 3 2], the adjacencies are {1 h 3 h , 3 t 2 t } and the telomeres are {1 t , 2 h }.Note that an adjacency has no orientation, that is, an adjacency between extremities 1 h and 3 h can be equally represented by 1 h 3 h and by 3 h 1 h .In the particular case of a single-gene circular chromosome, e.g. ( 4), an adjacency exceptionally occurs between the extremities of the same gene (here 4 h 4 t ).
A genome is then a multiset of chromosomes and we denote by F(G) the set of gene families that occur in genome G.In addition, we denote by A(G) the multiset of adjacencies and by T(G) the multiset of telomeres that occur in G.A genome S is called singular if each gene family occurs exactly once in S. Similarly, a genome D is called duplicated if each gene family occurs exactly twice in D. The two occurrences of a family in a duplicated genome are called paralogs.A doubled genome is a special type of duplicated genome in which each adjacency or telomere occurs exactly twice.These two copies of the same adjacency (respectively same telomere) in a doubled genome are called paralogous adjacencies (respectively paralogous telomeres).Observe that distinct doubled genomes with circular chromosomes can have exactly the same adjacencies and telomeres, as we show in Table 1, where we also give examples of singular and duplicated genomes.
Table 1.Examples of a singular, a duplicated and two doubled genomes, with their sets of families and their multisets of adjacencies.Note that the doubled genomes B1 and B2 have exactly the same adjacencies and telomeres.
Singular genome (each family occurs once)

Doubled genomes
(each adj.or tel.occurs twice)

Comparing canonical genomes
Two genomes S 1 and S 2 are said to be a canonical pair when they are singular and have the same gene families, that is, F(S 1 ) = F(S 2 ).Denote by F * the set of families occurring in canonical genomes S 1 and S 2 , and by n * = |F * | its cardinality.For example, genomes Breakpoint graph.The relation between two canonical genomes S 1 and S 2 can be represented by their breakpoint graph BG(S 1 , S 2 ) = (V, E), that is a multigraph representing the adjacencies of S 1 and S 2 [2].The vertex set V comprises, for each family X in F * , one vertex for the extremity X h and one vertex for the extremity X t .The edge multiset E represents the adjacencies.For each adjacency in S 1 there exists one S 1 -edge in E linking its two extremities.Similarly, for each adjacency in S 2 there exists one S 2 -edge in E linking its two extremities.Clearly, BG(S 1 , S 2 ) can easily be constructed in linear O(n * ) time.
The degree of each vertex can be 0, 1 or 2 and each connected component alternates between S 1 -and S 2 -edges.As a consequence, the components of the breakpoint graph of canonical genomes can be cycles of even length or paths.An even path has one endpoint in S 1 (S 1 -telomere) and the other in S 2 (S 2 -telomere), while an odd path has either both endpoints in S 1 or both endpoints in S 2 .A vertex that is not a telomere in S 1 nor in S 2 is said to be non-telomeric.In the breakpoint graph a non-telomeric vertex has degree 2. We call i-cycle a cycle of length i and j-path a path of length j.We also denote by c i the number of i-cycles, by p j the number of j-paths, by c the total number of cycles and by p e the total number of even paths.Since the number of telomeres in each genome is even (2 telomeres per linear chromosome), the total number of even paths in the breakpoint graph must be even.An example of a breakpoint graph is given in Figure 2.
Breakpoint distance.For canonical genomes S 1 and S 2 the breakpoint distance, denoted by d bp , is defined as follows [12]: For and the set of common telomeres is T(S 1 )∩T(S 2 ) = {4 t }, giving d bp (S 1 , S 2 ) = 2.5.Since a common adjacency of S 1 and S 2 corresponds to a 2-cycle and a common telomere corresponds to a 0-path in BG(S 1 , S 2 ), the breakpoint distance can be rewritten as DCJ distance.Given a genome, a double cut and join (DCJ) is the operation that breaks two of its adjacencies or telomeres 1 and rejoins the open extremities in a different way [13].For example, consider the chromosome K = [ 1 2 3 4 ] and a DCJ that cuts K between genes 1 and 2 and between genes 3 and 4, creating segments 1•, The class of σ k distances.Given the breakpoint graph of two canonical genomes S 1 and S 2 , for k ∈ {2, 4, 6, . . ., ∞}, we denote by σ k the cumulative sums . Then the σ k distance of S 1 and S 2 is defined to be [6]: It is easy to see that the σ 2 distance equals the breakpoint distance and that the σ ∞ distance equals the DCJ distance, and that the distance decreases monotonously between these two extremes.Moreover, the σ k distance of two genomes that form a canonical pair can easily be computed in linear time for any k ≥ 2.

Comparing a singular and a duplicated genome
Let S be a singular and D be a duplicated genome over the same n * gene families, that is, F(S) = F(D) and n * = |F(S)| = |F(D)|.The number of genes in D is twice the number of genes in S and we need to somehow equalize the contents of these genomes, before searching for common adjacencies and common telomeres of S and D or transforming one genome into the other with DCJ operations.This can be done by doubling S, with a rearrangement operation mimicking a whole genome duplication: it simply consists of doubling each adjacency and each telomere of S.However, when S has one or more circular chromosomes, it is not possible to find a unique layout of its chromosomes after the doubling: indeed, each circular chromosome can be doubled into two identical circular chromosomes, or the two copies are concatenated to each other in a single circular chromosome.Therefore, in general the doubling of a genome S results in a set of doubled genomes denoted by 2S.Note that |2S| = 2 r , where r is the number of circular chromosomes in S. For example, if 1).All genomes in 2S have exactly the same multisets of adjacencies and of telomeres, therefore we can use a special notation for these multisets: A(2S) = A(S)∪A(S) and T(2S) = T(S)∪T(S).
Each family in a duplicated genome can be a b -singularized by adding the index a to one of its occurrences and the index b to the other.A duplicated genome can be entirely singularized if each of its families is singularized.Let S a b (D) be the set of all possible genomes obtained by all distinct ways of a b -singularizing the duplicated genome D. Similarly, we denote by S a b (2S) the set of all possible genomes obtained by all distinct ways of a b -singularizing each doubled genome in the set 2S.
The class of σ k double distances.The class of σ k double distances of a singular genome S and duplicated genome D for k = 2, 4, 6, . . . is defined as follows: {d σ k (B, Ď)}, where Ď is any genome in S a b (D).
σ 2 (breakpoint) double distance.The breakpoint double distance of S and D, denoted by d 2 bp (S, D), is equivalent to the σ 2 double distance.For this case the solution can be found easily with a greedy algorithm [12]: each adjacency or telomere of D that occurs in S can be fulfilled.If an adjacency or telomere that occurs twice in D also occurs in S, it can be fulfilled twice in any genome from 2S.Then, σ ∞ (DCJ) double distance.For the DCJ double distance, that is equivalent to the σ ∞ double distance, the solution space cannot be explored greedily.In fact, computing the DCJ double distance of genomes S and D was proven to be an NP-hard problem [12].
The complexity of σ k double distances.The exploration of the complexity space between the greedy linear time σ 2 (breakpoint) double distance and the NP-hard σ ∞ (DCJ) double distance is the main motivation of this study.In the remainder of this paper we show that both σ 4 and σ 6 double distances can be solved in linear time.
3 Equivalence of σ k double distance and σ k disambiguation A nice way of representing the solution space of the σ k double distance is by using a modified version of the breakpoint graph [12].

Ambiguous breakpoint graph
Given a singular genome S and a duplicated genome D, their ambiguous breakpoint graph ABG(S, Ď) = (V, E) is a multigraph representing the adjacencies of any element in S a b (2S) and a genome Ď ∈ S a b (D).The vertex set V comprises, for each family X in F(S), the two pairs of paralogous vertices X h a , X h b and X t a , X t b .We can use the notation û to refer to the paralogous counterpart of a vertex u.For example, if u = X h a , then û = X h b .The edge set E represents the adjacencies.For each adjacency in Ď there exists one Ď-edge in E linking its two extremities.The S-edges represent all adjacencies occurring in all genomes from S a b (2S): for each adjacency γβ of S, we have the pair of paralogous edges E(γβ) = {γ a β a , γ b β b } and the complementary pair of paralogous edges The S-edges in the ambiguous breakpoint graph are therefore the squares of all adjacencies in S. Let a * be the number of squares in ABG(S, Ď).Obviously we have a * = |A(S)| = n * − κ(S), where κ(S) is the number of linear chromosomes in S. Again, we can use the notation ê to refer to the paralogous counterpart of an S-edge e.For example, if e = γ a β a , then ê = γ b β b .An example of an ambiguous breakpoint graph is shown in Figure 3 (i).
Each linear chromosome in S corresponds to four telomeres, called S-telomeres, in any element of 2S.These four vertices are not part of any square.In other words, the number of S-telomeres in ABG(S, Ď) is 4κ(S).If κ(D) is the number of linear chromosomes in D, the number of telomeres in Ď, also called Ď-telomeres, is 2κ(D).

The class of σ k disambiguations
Resolving a square Q(•) = E(•) ∪ E(•) corresponds to choosing in the ambiguous breakpoint graph either the edges from E(•) or the edges from E(•), while the complementary pair is masked.Resolving all squares is called disambiguating the ambiguous breakpoint graph.If we number the squares of ABG(S, Ď) from 1 to a * , a solution can be represented by a tuple τ = (L 1 , L 2 , . . ., L a * ), where each L i contains the pair of paralogous edges (either E i ), resulting in one 2-cycle, two 0-paths, one 2-path and one 4-path.This is also the breakpoint graph of Ď and In both (i) and (ii), vertex types are distinguished by colors: telomeres in S are marked in blue, telomeres in Ď are marked in gray, telomeres in both S and Ď are marked in purple and non-telomeric vertices are white.
or E i ) that are chosen (kept) in the graph for square Q i .The graph induced by τ is a simple breakpoint graph, which we denote by BG(τ, Ď). Figure 3 (ii) shows an example.
Given a solution τ , let c i and p j be, respectively, the number of cycles of length i and of paths of length j in BG(τ, Ď).The k-score of τ is then the sum . The minimization problem of computing the σ k double distance of S and D is equivalent to finding a solution τ so that the k-score of τ is maximized [12].We call the latter (maximization) problem σ k disambiguation.As already mentioned, for σ 2 the double distance can be solved in linear time and for σ ∞ the double distance is NP-hard.Therefore the same is true, respectively, for the σ 2 and the σ ∞ disambiguations.Conversely, if we determine the complexity of solving the σ k disambiguation for any k ≥ 4, this will automatically determine the complexity of solving the σ k double distance.
An optimal solution for the σ k disambiguation of ABG(S, Ď) gives its k-score, denoted by σ k (ABG(S, Ď)).Note that, since an optimal σ k disambiguation is also a σ k+2 disambiguation, although possibly not optimal, the k-score of ABG(S, Ď) can not decrease as k increases.
Approach for solving the σ k disambiguation.A player of the σ k disambiguation is either a valid cycle whose length is at most k or a valid even path whose length is at most k − 2. In order to solve the σ k disambiguation, a natural approach is to visit ABG(S, Ď) and search for players.For describing how the graph can be screened, we need to introduce the following concepts.Two S-edges in ABG(S, Ď) are incompatible when they belong to the same square and are not paralogous.A component in ABG(S, Ď) is valid when it does not contain any pair of incompatible edges.Note that a valid component necessarily alternates S-edges and Ď-edges.Two valid components C = C in ABG(S, Ď) are either intersecting, when they share at least one vertex, or disjoint.It is obvious that any solution τ of ABG(S, Ď) is composed of disjoint valid components.
Given a solution τ = (L 1 , L 2 , . . ., L i . . ., L a * ), the switching operation of the i-th element of τ is denoted by s(τ, i) and replaces value L i by L i resulting in τ = (L 1 , L 2 , . . ., L i . . ., L a * ).A choice of paralogous edges resolving a given square Q i can be fixed for any solution, meaning that Q i can no longer be switched.In this case, Q i is itself said to be fixed.

First steps to solve the σ k disambiguation
In this section we describe a greedy linear time algorithm for the σ 4 disambiguation and give some general results related to any σ k disambiguation.

Common adjacencies and telomeres are conserved
Let τ be an optimal solution for σ k disambiguation of ABG(S, Ď).If a player C ∈ BG(τ, Ď) is disjoint from any player distinct from C in any other optimal solution, then C must be part of all optimal solutions and is itself said to be optimal.Lemma 1.For any σ k disambiguation, all existing 0-paths and 2-cycles in ABG(S, D) are optimal.
Proof.While any 0-path is an isolated vertex and obviously optimal, the optimality of every 2-cycle is less obvious but still holds, as illustrated in Figure 4.This lemma is a generalization of the (breakpoint) σ 2 disambiguation and guarantees that all common adjacencies and telomeres are conserved in any σ k double distance, including the NP-hard (DCJ) σ ∞ case.All 0-paths are isolated vertices that do not integrate squares, therefore they are selected independently of the choices for resolving the squares.A 2-cycle, in its turn, always includes one S-edge from some square (such as square 1 in Figure 3).From now on we assume that squares that have at least one S-edge in a 2-cycle are fixed so that all existing 2-cycles are induced.

Symmetric squares can be fixed arbitrarily
Let a symmetric square in ABG(S, Ď) either (i) have a Ď-edge connecting a pair of paralogous vertices, or (ii) have Ď-telomeres in one pair of paralogous vertices, or (iii) have Ď-edges directly connected to S-telomeres inciding in one pair of paralogous vertices, as illustrated in Figure 5.Note that, for any σ k disambiguation, the two ways of resolving each of these squares would lead to solutions with the same score, therefore each of them can be fixed arbitrarily.From now on we assume that ABG(S, Ď) has no symmetric squares.

A linear time greedy algorithm for the σ 4 disambiguation
Differently from 2-cycles, two valid 4-cycles can intersect with each other.But, since our graph is free of symmetric squares, two valid 2-paths cannot intersect with each other.Moreover, since a 2-path has no Ď-edge connecting squares, a 4-cycle and a 2-path cannot intersect with each other.In this setting, it is clear that, for the σ 4 disambiguation, any valid 2-path is always optimal.Furthermore, a 4-cycle that does intersect with another one is always optimal and two intersecting 4-cycles are always part of two co-optimal solutions: Lemma 2. Any valid 4-cycle that is disjoint from a 2-cycle in ABG(S, D) is induced by an optimal solution of σ 4 disambiguation.
Proof.All possible patterns are represented in Figure 6: A valid 4-cycle C (in the center) connecting two squares and the three distinct possibilities of linking the four open ends.In all cases the valid 4-cycle C is either optimal or co-optimal.Fig. 6.Illustration of the co-optimality of every valid 4-cycle not intersecting a 2-cycle in the σ4 disambiguation.In each of these pictures, each gray path is necessarily odd with length at least one and alternates Ďand S-edges.Furthermore, the 4-cycle C = (uvwz) is displayed in the center, induced by blue edges.In (i) it is easy to see that any optimal solution is induced by the blue edges and includes, besides the cycle C, cycles (û . . .v) and ( ŵ . . .ẑ).In (ii) an optimal solution includes 4-cycle C and cycle C = (ûv . . .ŵẑ . ..).If the connection between v and ŵ is a single edge, then another optimal solution is induced by the red edges, including 4-cycle D = (uv ŵz) and cycle D = (v û . . .ẑw).And if additionally the connection between û and ẑ is a single edge, then both C and D are also 4-cycles.In (iii) any optimal solution is induced by the blue edges and includes 4-cycle C and cycle (ûv . . .ẑ ŵ . ..), which is also a 4-cycle when the connections between v and ẑ and between û and ŵ are single edges.
An optimal solution of σ 4 disambiguation can then be obtained greedily: after fixing squares containing edges that are part of 2-cycles, traverse the remainder of the graph and, for each valid 2-path or 4-cycle C that is found, fix the square(s) containing S-edges that are part of C, so that C is induced.When this part is accomplished the remaining squares can be fixed arbitrarily.

Pruning ABG(S, Ď) for the σ 6 disambiguation
A player in the σ 6 disambiguation can be either a {2,4}-path, that is a valid 2-or 4-path, or a {4,6}-cycle, that is a valid 4-or 6-cycle.It is easy to see that players can intersect with each other.Moreover, for the σ 6 disambiguation, not every player is induced by at least one optimal solution.For that reason, a greedy algorithm does not work here and a more elaborated procedure is required.The first step is a linear time preprocessing in which from ABG(S, Ď) first all edges are removed that are incompatible with the existing 2-cycles, and then all remaining edges that cannot be part of a player.This results in a {6}-pruned ambiguous breakpoint graph P G(S, Ď).
The first step is easily achieved by a simple graph traversal in which for each Ď-edge uv it is tested whether both ends connect to the same S-edge uv.If this is the case, the two incident S-edges uv and vû are removed from the graph, separating the 2-cycle (uv).Then, in the second step, for any remaining edge e, its 6-neighborhood (which has constant size in a graph of degree at most three) is exhaustively explored for the existence of a player involving e.If no such player is found, e is deleted.Each of these two steps clearly takes linear time O(|ABG(S, Ď)|), and what remains is exactly the desired graph P G(S, Ď).
The edges that are not pruned and are therefore present in P G(S, Ď) are said to be preserved.As shown in Figure 7, for any given square the pruned graph might preserve either (a1-a2) all edges, or (b1-b4) only three edges, or (c1-c3) only two edges each one from a distinct pair of paralogous edges, or (d1-d3) only two edges from the same pair of paralogous edges, or (e1-e2) a single edge.While the squares are still ambiguous in cases (a1-a2), (b1-b4) and (c1-c3), in cases (d1-d3) and (e1-e2) they are already resolved and can be fixed according to the preserved paralogous edges in cases (d1-d3) and (e1-e2).Additionally, if none of its edges is part of a player, a square is completely pruned out and is arbitrarily fixed in ABG(S, Ď).
The smaller pruned graph P G(S, Ď) has all relevant parts required for finding an optimal solution of σ 6 disambiguation, therefore the 6-scores of both graphs are the same: σ 6 (ABG(S, Ď)) = σ 6 (P G(S, Ď)).A clear advantage here is that the pruned graph might be split into smaller connected components, and it is obvious that the disambiguation problem can be solved independently for each one of them.Any square that is still ambiguous in P G(S, Ď) is called a {6}-square.Each connected component G of P G(S, Ď) is of one of the two types: Fig. 7. Possible (partial) squares of P G(S, Ď).Shadowed parts represent the pruned elements (since they do not count for the score, it is not relevant to differentiate whether the pruned vertices are telomeres or not).The top line represents squares whose preserved elements include no telomere.The middle and the bottom line represent squares whose preserved elements include telomeres, marked in gray.Note that all of these are Ď-telomeres (S-telomeres are not part of any square).Cases (a1-a2), (b1-b4) and (c1-c3) are ambiguous, while cases (d1-d3) and (e1-e2) are resolved.
Let C and P be the sets of resolved components, so that C has all resolved cycles and P has all resolved paths.Furthermore, let M be the set of ambiguous components of P G(S, Ď).If we denote by σ 6 (M ) the 6-score of an ambiguous component M ∈ M, the 6-score of P G(S, Ď) can be computed with the formula: Solving the σ 6 disambiguation corresponds then to finding, for each ambiguous component M ∈ M, an optimal solution including only the {6}-squares of M .From now on, by S-edge, S-telomere, Ď-edge and Ď-telomere, we are referring only to the elements that are preserved in P G(S, Ď).

Intersection between players of the σ 6 disambiguation
Let a ĎS Ď-path be a subpath of three edges, starting and ending with a Ď-edge.This is the largest segment that can be shared by two players: although there is no room to allow distinct {2, 4}-paths and/or valid 4-cycles to share a ĎS Ď-path in a graph free of symmetric squares, a ĎS Ď-path can be shared by at most two valid 6-cycles.Furthermore, if distinct ĎS Ď-paths intersect at the same Ď-edge e and each of them occurs in two distinct 6-cycles, then the Ď-edge e occurs in four distinct valid 6-cycles.
In Figure 8 we characterize this exceptional situation, which consists of the occurrence of a triplet, defined to be an ambiguous component composed of exactly three connected ambiguous squares in which at most two vertices, necessarily in distinct squares, are pruned out.In a saturated triplet, the squares in each pair are connected to each other by two Ď-edges connecting paralogous vertices in both squares; if a single Ď-edge is missing, that is, the corresponding vertices have outer connections, we have an unsaturated triplet.This structure and its score can be easily identified, therefore we will assume that our graph is free from triplets.With this condition, Ď-edges can be shared by at most two players: Proposition 1.Any Ď-edge is part of either one or two (intersecting) players in a graph free of symmetric squares and triplets.Proof.Recall that a ĎS Ď-path is a subpath of three edges, starting and ending with a Ď-edge.It is easy to see that, without symmetric squares, there is no "room" to allow distinct 4-paths and/or 4-cycles to share a ĎS Ď-path.In contrast, at most two valid 6-cycles can share a ĎS Ď-path as illustrated in Figure 8.And if the S-edge in the middle of the shared ĎS Ď-path is in an ambiguous square, we have the exceptional case of a triplet, where a Ď-edge occurs in more than two players.This case can be treated separately in a preprocessing step, so that we can assume that our graph is free of triplets.
Let an S ĎS-path be a subpath of three edges, starting and ending with an S-edge.Obviously there is no "room" to allow two players to share an S ĎS-path: (i) there are two ways of adding a Ď-edge to a S ĎS-path for obtaining a valid 4-path but they are incompatible therefore at most one can exist; or (ii) the two ends of the S ĎS-path must incide in the same Ď-edge, giving a single way of obtaining a 4-cycle; or (iii) any valid 6-cycle including the given S ĎS-path needs to have both extra Ď-edges inciding at both ends, then there can be only one way of filling the "gap" with a last S-edge.
Now let an open 2-path be an S-edge adjacent to a Ď-edge such that at most one of the two includes a telomere.Considering the case of paths, in the absence of symmetric squares there is no possibility of having two 4-paths sharing an open 2-path.And considering the case of cycles, it is obvious that two {4, 6}-cycles sharing the same open 2-path must share the same ĎS Ď-path, which falls in the same particular case of a triplet mentioned before.
Finally, it is easy to see that a Ď-edge can occur in more than one player (general cases for cycles are illustrated in Figure 9).However, it can only occur in more than two players if it is part of distinct ĎS Ď-paths such that each of them occurs in distinct players.By construction we can see that this can only happen in a triplet (Figure 8) or if the graph has symmetric squares.It follows that, without symmetric squares and triplets, each Ď-edge occurs in at most two distinct players.Proposition 2. Any S-edge of a {6}-square is part of exactly one player in a graph free of symmetric squares and triplets.
Proof.If an S-edge e is in a {6}-square Q, it "shares" either the same Ď-edge or the same Ď-telomere d with another S-edge e from the same square Q.In this case the Ď-edge/telomere d is part of exactly two players and each of the S-edges e and e must be part of exactly one player.
In the next sections we present the most relevant contribution of this work: an algorithm to solve the σ 6 disambiguation in linear time.
6 Solving the σ 6 disambiguation for circular genomes For the case of circular genomes, which are those exclusively including circular chromosomes, the ambiguous breakpoint graph has no telomeres, therefore all players are cycles.In this case, we call each ambiguous component a cycle-bubble.
Two {6}-squares Q and Q are neighbors when a vertex of Q is connected to a vertex of Q by a Ď-edge.Any S-edge e of a {6}-square Q in a cycle-bubble M is part of exactly one {4,6}-cycle (Proposition 2) and both Ď-edges inciding at the endpoints of e would clearly induce the same {4,6}-cycle.For that reason, the choice of e (and its paralogous edge ê) implies a unique way of resolving all neighbors of Q, and, by propagating this to the neighbors of the neighbors and so on, all squares of M are resolved, resulting in what we call straight solution τ M (see Algorithms 1 and 2).Then we can immediately obtain the complementary alternative solution τ M , by switching all ambiguous squares of τ M .A cycle-bubble is said to be unbalanced if τ M = τ M or balanced if τ M = τ M .If M is unbalanced, its score is given either by τ M or by τ M (the maximum among the two).If M is balanced, its score is given by both τ M and τ M (co-optimality).Examples are given in Figure 10.In both cases the algorithm starts on the dark blue edge of square 1.In (i) we have a balanced cycle-bubble, for which the resulting straight disambiguation and its complementary alternative have the same score (co-optimality).In (ii) we have an unbalanced cycle-bubble, for which the resulting straight disambiguation and its complementary alternative have distinct scores.

Solving the σ 6 disambiguation with linear chromosomes
For genomes with linear chromosomes, the ambiguous components might include paths besides cycle-bubbles.In the presence of paths, the straight algorithm unfortunately does not work (see Figure 11).We must then proceed with an additional characterization of each ambiguous component M of P G(S, Ď), splitting the disambiguation of M into smaller subproblems.
As we will present in the following, the solution for arbitrarily large components can be split into two types of problems, which are analogous to solving the maximal independent set of auxiliary subgraphs that are either simple paths or double paths.In both cases, the solutions can be obtained in linear time.if ê is an S-edge in M then 6:

Algorithm 1 StraightBubbleSolution
ResolveNeighbors(τM , ê); 7: if vertex w is not in a resolved or fixed square then 8: j ← index in τM of square containing w; 9: f ← S-edge wy of Qj forming a {4,6}-cycle with uv and vw; 10: if f is an S-edge in M then 12: ResolveNeighbors(τM , f ); 13: return Fig. 11.Example showing that the straight algorithm does not work with paths: if we start on the dark blue edge of square number 1, we cannot propagate the effect of this choice to the neighbor square.

Intersection graph of an ambiguous component
The auxiliary intersection graph I(M ) of an ambiguous component M has a vertex with weight 1  2 for each {2,4}path and a vertex with weight 1 for each {4,6}-cycle of M .Furthermore, if two distinct players intersect, we have an edge between the respective vertices.The intersection graphs of all ambiguous components can be built during the pruning procedure without increasing its linear time complexity.
Note that an independent set of maximum weight in I(M ) corresponds to an optimal solution of M .Although in general this problem is NP-hard, in our case the underlying ambiguous component M imposes a regular structure to its intersection graph, allowing us to find such an independent set in linear time.
If two {2,4}-paths intersect in their S-telomere, this intersection must include the incident Ď-edge.Therefore, when we say that an intersection occurs at an S-telomere, this automatically means that the intersection is the Ď-edge inciding in an S-telomere.A valid 4-cycle has two Ď-edges and a valid 6-cycle has three Ď-edges.Besides the one at the S-telomere, a valid 4-path has one Ď-edge while a valid 2-path has none -therefore the latter cannot intersect with a {4,6}-cycle.When we say that 4-paths and/or {4,6}-cycles intersect with each other in a Ď-edge, we refer to an inner Ď-edge not one inciding in an S-telomere.
Since the contribution of each cycle in the score is twice as much as the contribution of a path, we make a distinction between two types of subgraphs of an intersection graph I(M ), which can correspond to cycle-bubbles or path-flows.

Path-flows in the intersection graph
A path-flow in I(M ) is a maximal connected subgraph whose vertices correspond to {2,4}-paths.A path-line of length in a path-flow is a series of paths, such that each pair of consecutive paths intersect at a telomere.Assume that the vertices in a path-line are numbered from left to right with integers 1, 2, . . ., .A double-line consists of two parallel path-lines of the same length , such that vertices with the same number in both lines intersect in a Ď-edge and are therefore connected by an edge.A 2-path has no free Ď-edge, therefore a double-line is exclusively composed of 4-paths.If a path-line composes a double-line, it is saturated, otherwise it is unsaturated.Since each 4-path of a double-line has a Ď-edge intersection with and each 4-path can have only one Ď-edge intersection, no vertex of a double-line can be connected to a cycle in I(M ).Examples of an unsaturated path-line and a double-line are given in Figure 12.Let us assume that a double-line is always represented with one upper path-line and one lower path-line.A double-line of length has 2 vertices and exactly two independent sets of maximal weight, each one with vertices and weight 2 : one includes the paths with odd numbers in the upper line and the paths with even numbers in the lower line, while the other includes the paths with even numbers in the upper line and the paths with odd numbers in the lower line.Since a double-line cannot intersect with cycles, it is clear that at least one of these independent sets will be part of a global optimal solution for I(M ).In other words, not only the two possible local optimal solutions and their (common) weight are known, but it is guaranteed that at least one of them will be part of a global optimal solution.A maximal double-line can be of three different types: 1. Isolated : corresponds to the complete graph I(M ).Here the double line can be cyclic.If is even, in both upper and lower lines of a cyclic double-line, the last vertex intersects at a telomere with the first vertex.If is odd, this connection of a cyclic double-line is "twisted": the last vertex of the upper line intersects at a telomere with the first vertex of the lower line, and the first vertex of the upper line intersects at a telomere with the last vertex of the lower line.Being cyclic or not, any of the two optimal local solutions can be fixed.2. Terminal : intersects with one unsaturated path-line, and, without loss of generality, the intersection involves the vertex v located at the rightmost end of the lower line.Here at least one of the two optimal local solutions would leave v unselected; we can safely fix this option.(See Figure 13.) 3. Link : intersects with unsaturated lines at both ends.The intersections can be: (a) single-sided : both occur at the ends of the same saturated line, or (b) alternate: the left intersection occurs at the end of one saturated line and the right intersection occurs at the end of the other.
Let v be the outer vertex connected to a vertex v belonging to the link at the right and u be the outer vertex connected to a vertex u belonging to the link at the left.Let a balanced link be alternate of odd length, or single-sided of even length.In contrast, an unbalanced link is alternate of even length, or single-sided of odd length.If the link is unbalanced, one of the two local optimal solutions leaves both u and v unselected; we can safely fix this option.If the link is balanced, we cannot fix the solution before-hand, but we can reduce the problem, by removing the connections uu and vv and adding the connection u v .Since both u and v must be the ends of unsaturated lines, this procedure simply concatenates these two lines into a single unsaturated path-line.(See Figure 13 (v) and (vi).)Finding a maximum independent set of the remaining unsaturated path-lines is a trivial problem that will be solved last; depending on whether one of the vertices u and v is selected in the end, we can fix the solution of the original balanced link.
Fig. 13.Types of double-line: terminal, balanced and unbalanced links.The yellow solution that in cases (i-ii) leaves v unselected and in cases (iii-iv) leaves u and v unselected can be fixed so that an independent set of the adjacent unsaturated path-line(s) can start at v (and u ).In cases (v-vi) either the yellow or the green solution will be fixed later; it will be the one compatible with the selected independent set of the unsaturated path-line ending in u concatenated to the one starting in v .

Intersection between path-flows and cycle-bubbles
If an ambiguous component has only cycles, its solution can be easily obtained with the straight algorithm presented in the previous section.More intricate is when an ambiguous component M includes cycles and paths.In this case we redefine a cycle-bubble as corresponding to a maximal connected subgraph of I(M ) whose vertices correspond to {4,6}-cycles.Let H be the subgraph of M including all edges that compose the cycles of a cycle-bubble.An optimal solution for H is either the straight solution τ H , given by Algorithm 1, or its alternative τ H . Recall that if both τ H and τ H have the same score, then H is said to be balanced, otherwise it is said to be unbalanced.
Proposition 3. Let an ambiguous component M have cycle-bubbles H 1 , ..., H q .There is an optimal solution for M including, for each i = 1, ..., q: (1) the optimal solution for H i , if H i is unbalanced; or (2) either τ Hi or τ Hi , if H i is balanced.
Proof.We will analyze the cases by increasing the size of the maximal subgraph containing intersecting cycles: 1.A {4, 6}-cycle C that does not intersect with any other {4, 6}-cycle: (a) if C is a 4-cycle, it can intersect with at most two valid 4-paths; therefore there is an optimal solution including C; (b) if C is a 6-cycle, it can intersect with at most three valid 4-paths, but if it intersects with three valid 4-paths there will be at least one valid 2-path P compatible with C ; therefore there is an optimal solution including C and P (see Figure 14 (ii)).As the size of the bubble grows, there is less space for intersecting paths, and each cycle intersects with at most one path.In general, the best we can get by replacing cycles by paths are co-optimal solutions.As a consequence of Proposition 3, if a cycle-bubble is unbalanced, its optimal solution can be fixed so that the unsaturated path-lines around it can be treated separately.Similarly, if a balanced cycle-bubble H has a single intersection involving a cycle C and a path P (that can be the first vertex of an unsaturated path-line), then we can immediately fix the solution of H that does not contain C.
Balanced cycle-bubbles intersecting with at least two paths.If a cycle-bubble H is balanced and intersects with at least two paths, then it requires a special treatment.However, as we will see, here the only case that can be arbitrarily large is easy to handle.Let a cycle-bubble be a cycle-line when it consists of a series of valid 6-cycles, such that each pair of consecutive cycles intersect at a Ď-edge (see Figure 15).Proof.In Figure 16 (whose steps are more elaborated in Figures A1-A6 of Appendix A.1) we show that, if a bubble is not a line, it reaches its "capacity" with at most 8 cycles.Besides having its size limited to 8 cycles, the more complex a non-linear cycle-bubble becomes, the less space it has for paths around it.The solutions for these few exceptional bounded cases are described in the end of this section.
Our focus now is the remaining situation of a balanced cycle-line with intersections involving at least two cycles.Recall that cycles can only intersect with unsaturated path-lines.An intersection between a cycle-and a path-line is a plug connection when it occurs between vertices that are at the ends of both lines.Proposition 5. Cycle-lines of length at least 4 can only have plug connections.
Proof.If a cycle-line has length at least four, its underlying graph has only "room" for intersections with 4-paths next to its leftmost of rightmost cycles.See the illustration in Figure 17.14 and A7-A8, the last two in Appendix A.2). (ii) If the cycle-line has even length and plug connections at both sides, we have a balanced link: either the yellow or the green solution will be fixed later; it will be the one compatible with the selected independent set of the unsaturated path-line ending in u concatenated to the one starting in v .
For arbitrarily large instances, the last missing case is of a balanced cycle-line with plug connections at both sides, called a balanced link.The procedure here is the same as that for double-lines that are balanced links, where the local solution can only be fixed after fixing those of the outer connections (see Figure 17 (ii)).
Exceptional bounded cases.Balanced cycle-lines with two cycles can have connections to path-lines that are not plugs, but the number of cases is again limited.In most of them (shown in Figure A7 of Appendix A.2) the bubble is saturated and the paths around cannot be connected to extendable path-lines.For these bubbles all paths are over the same squares of the cycles, therefore the straight algorithm would give the two overall alternatives including the paths around each of these bubbles, and the best solution can be immediately fixed.
In another case (shown in Figure A8 (i) of Appendix A.2) there is one extendable path-line, but the local solution (including the bubble and the paths that are over the same squares) is unbalanced, therefore also here we can fix the best among the two overall alternatives given by the straight algorithm.
In the last two cases (shown in Figure A8 (ii) and (iii) of Appendix A.2) there are extendable path-lines, and the local solutions (including the bubble and the paths that are over the same squares) are balanced.In the first case, there is only one extendable path-line and we can fix the solution including the cycle that is connected to last "visible" path of the path-line.The second case is analogous to cycle-lines of type balanced link, with the difference that here the lines are already concatenated; the local solution can then only be fixed after fixing those of the outer connections.
Concerning non-linear cycle-bubbles, there are only four distinct cases that need to be considered: one case of a non-linear bubble with two 6-cycles (Figure A7 (iii) of Appendix A.2) and three cases of non-linear bubbles with four 6-cycles (Figure A9 in Appendix A.2).In all of these four cases, the bubble is saturated and the paths around cannot be connected to extendable path-lines.Indeed, also for these bubbles all paths are over the same squares of the cycles, therefore the straight algorithm would give the two overall alternatives including the paths around each of these bubbles, and the best among these solutions can be immediately fixed.
What remains is a set of independent unsaturated path-lines.If what remains is a single unsaturated path-line of even length, it can even be cyclic2 .In any case, an optimal solution can be trivially found.First assume that in an unsaturated path-line of length the paths are numbered from left to right with 1, 2, . . ., .The solution that selects all paths with odd numbers must be optimal.Fix this solution and, depending on the connections between the selected vertices of the unsaturated path-line and vertices from balanced links that are double-lines or cycle-lines, fix the compatible solutions for the latter ones.

Final remarks and discussion
Given a singular genome S and a duplicated genome D over the same set of gene families, the double distance of S and D aims to find the smallest distance between D and any element from the set 2S, that contains all possible genome configurations obtained by doubling the chromosomes of S. Different underlying genomic distance measures give rise to different double distances: the breakpoint double distance of S and D is an easy problem that can be greedily solved in linear time, while computing the DCJ double distance of S and D is NP-hard.Our study is an exploration of the complexity space between these two extremes.
We considered a class of genomic distance measures called σ k distances, for k = 2, 4, 6, . . ., ∞, which are between the breakpoint (σ 2 ) and the DCJ (σ ∞ ) distance.In this work we presented linear time algorithms for computing the double distance under the σ 4 , and under the σ 6 distance.Our solution relies on a variation of the breakpoint graph called ambiguous breakpoint graph.
The solutions we found so far are greedy with all players being optimal in σ 2 , greedy with all players being cooptimal in σ 4 and non-greedy with non-optimal players in σ 6 , all of them running in linear time.More specifically for the σ 6 case, after a pre-processing that fixes symmetric squares and triplets, at most two players share an edge.However we can already observe that, as k grows, the number of players sharing a same edge also grows.For that reason, we believe that, if for some k ≥ 8 the complexity of the σ k double distance is found to be NP-hard, the complexity is also NP-hard for any k > k.We expect that when we find the smallest k for which the σ k double distance is NP-hard we will be able to confirm this conjecture.In any case, the natural next step in our research is to study the σ 8 double distance.
Besides the double distance, other combinatorial problems related to genome evolution and ancestral reconstruction, including median and guided halving, have the distance problem as a basic unit.And, analogously to the double distance, these problems can be solved in polynomial time (but differently from the double distance, not greedy and linear) when they are built upon the breakpoint distance, while they are NP-hard when they are built upon the DCJ distance [12].Therefore, a challenging avenue of research is doing the same exploration for both median and guided halving problems under the class of σ k distances.In both cases it seems possible to adopt variations of the breakpoint graph.To the best of our knowledge, the guided halving problem has not yet been studied for any σ k distance except k = 2 and k = ∞, while for the median much effort for the σ 4 distance has been done but no progress was obtained so far.A reason for this difference of progress between double distance and median is probably related to the underlying approaches.While the double distance can be solved by removing paralogous edges from the ambiguous breakpoint graph, solving the median requires adding new edges (representing the adjacencies of the median genome) to an extended (multiple) breakpoint graph, and the combinatorial space of the distinct possibilities of doing that could not yet be described.

A Supplementary figures
All the figures presented here assume a graph free of symmetric squares and triplets.For each case we have the ambiguous component of the pruned graph and its intersection graph.Often small modifications (e.g., by switching the positions of S-and Ď-telomeres) lead to equivalent cases, and here we show only one of these.In the particular situations of an intersection between two cycles being a ĎS Ď-path or intersections between two paths occurring at both telomeres, the respective vertices of the intersection graph are connected by two parallel edges.
A.1 Complex bubbles are limited to 8 cycles By a complete enumeration of cases, in Figures A1-A6 we show that, if a bubble is not a line, it reaches its "capacity" with at most 8 cycles.In all figures dashed gray edges are pruned out.
(2a)   .Single case of a bubble with eight cycles, obtained by connecting the two free vertices of (6b).All squares are fully connected and the bubble can no longer be extended.

A.2 Balanced cycle-bubbles intersecting with more than one path
In Figures A7-A9 we enumerate all cases of balanced cycle-bubbles that have at most 8 cycles and intersect with more than one path.We omit the general and well described case of a cycle-line with plug connections.In all figures, dotted red edges are exclusively for paths, dashed gray edges are pruned out, blue nodes represent S-telomeres and gray nodes represent Ď-telomeres.Furthermore, green/yellow solutions are co-optimal, while yellow solutions are better than the pink alternatives.(ii) has has two paths and two connections between these paths and cycles from the same independent set.(iii) has two paths forming a path line and two connections between these paths and cycles from distinct independent sets.

Fig. 2 .
Fig. 2. Breakpoint graph of genomes S1 = { (1 2) [3 4] } and S2 = { (1 3 2) [4] }. Edge types are distinguished by colors: S1edges are drawn in blue and S2-edges are drawn in black.Similarly, vertex types are distinguished by colors: an S1-telomere is marked in blue, an S2-telomere is marked in gray, a telomere in both S1 and S2 is marked in purple and non-telomeric vertices are white.This graph has one 2-cycle, one 0-path and one 4-path.

2 3 •
and •4 (where the symbols • represent the open ends).If we join the first with the third and the second with the fourth open end, we get K = [ 1 3 2 4 ], that is, the described DCJ operation is an inversion transforming K into K .Besides inversions, DCJ operations can represent several rearrangements, such as translocations, fissions and fusions.The DCJ distance d dcj is then the minimum number of DCJs that transform one genome into the other and can be easily computed with the help of their breakpoint graph [3]:

Fig. 4 .
Fig. 4. (i) The gray path connecting vertices v to û is necessarily odd with length at least one and alternates Ďand Sedges.The 2-cycle C = (uv) intersects the longer cycle D = (uv . . .ûv).Any solution containing (red edges) E = {uv, ûv} induces D and can be improved by switching E to (blue edges) E = {uv, ûv}, inducing, instead of D, the 2-cycle C and cycle D = (v . . .û) (which is shorter than D).(ii) The gray paths connecting vertices v to telomere y and û to telomere z alternate Ďand S-edges.The 2-cycle C = (uv) intersects the longer path P = y . . .vuv û . . .z. Any solution containing (red edges) E = {uv, ûv} induces P and can be improved by switching E to (blue edges) E = {uv, ûv}, inducing, instead of P , the 2-cycle C and path P = y . . .v û . . .z (which is of the same type, but 2 edges shorter than P ).

Fig. 9 .
Fig. 9. Patterns free of triplets and symmetric squares showing a Ď-edge uv in two distinct intersecting {4,6}-cycles which themselves do not intersect 2-cycles.(i) -(iii) The edge uv connects two distinct squares and is part of two {4,6}-cycles whose intersection is only uv.(iv) The edge uv is part of two 6-cycles whose intersection is a ĎS Ď-path starting in uv.Here one square (marked in blue) is clearly fixed: if this square could be switched, this would merge each of the two existing 6-cycles into a longer cycle.

Fig. 10 .
Fig.10.Example of execution of Algorithm 1 in cycle-bubbles.In both cases the algorithm starts on the dark blue edge of square 1.In (i) we have a balanced cycle-bubble, for which the resulting straight disambiguation and its complementary alternative have the same score (co-optimality).In (ii) we have an unbalanced cycle-bubble, for which the resulting straight disambiguation and its complementary alternative have distinct scores.
Input: A partially filled solution τM and an S-edge uv of cycle-bubble M / * S-edge uv is adjacent to two Ď-edges uz and vw * / 1: if vertex z is not in a resolved or fixed square then 2:i ← index in τM of square containing z; 3:e ← S-edge zx of Qi forming a {4,6}-cycle with uv and uz; 4:τM [i] ← {e, ê}; 5:

Fig. 12 .
Fig. 12. Examples of an unsaturated path-line, a double-line and the intersection between a double-line and two unsaturated path-lines.

2 .
Two {4, 6}-cycles C and C intersecting with each other but not with any other {4, 6}-cycle: Since valid 4-cycles have less edges for intersection, let us assume without loss of generality that both C and C are 6-cycles.Their intersection (illustrated in Figures A7 and A8 of Appendix A.2) can be: (a) a ĎS Ď-path, and in this case each cycle can intersect with at most one valid 4-path, therefore there is an optimal solution including either C or C ; (b) a single Ď-edge, and in this case each cycle can intersect with two valid 4-paths, therefore there is an optimal solution including either C or C .

Fig. 14 .
Fig.14.Underlying pruned subgraphs and corresponding intersection graphs of a bubble with a single 6-cycle Y (solid edges).Dotted edges are exclusive to paths and dashed gray edges are pruned out.In (i) and (ii), Y intersects with three valid 4-paths Aαa, Bβb and Cδc.In (i), the yellow solution including Y would also include the three 2-paths Ab, Bc and Ca, being clearly superior.In (ii), the yellow solution including Y would still include the 2-path Ba, having the same score of the green solution with three 4-paths.In any of the two cases, the underlying graph cannot be extended.In (iii), Y has plug connections with unsaturated path-lines starting at 4-paths Aαa and Bβb (both can be extended).

Fig. 15 .
Fig. 15.Cycle-bubble of type cycle-line and its intersection graph.

Proposition 4 .
Cycle-bubbles involving 9 or more cycles must be a cycle-line.

Fig. 16 .
Fig.16.While a cycle-line can be arbitrarily large, by increasing the complexity of a bubble we quickly saturate the space for adding cycles to it.Starting with (a) a simple cycle-line of length two, we can either (b1) connect the open vertices of squares 2 and 3, obtaining a cyclic cycle-line of length 4 that cannot be extended, or (b2) extend the line so that it achieves length three.From (b2) we can obtain (c1) a cyclic cycle-line of length 4 that can be extended first by adding cycle Y5 next to Y1 and then either adding Y 5 next to Y3 or closing Y6, Y7 and Y8 so that we get (c2).In both cases no further extensions are possible.Note that (c2) can also be obtained by extending a cycle-line of length three and transforming it in a star with three branches, that can still be extended by closing Y3, Y6, Y7 and Y8.(These steps are more elaborated in FiguresA1-A6of Appendix A.1.)

Fig. 17 .
Fig. 17. (i) A cycle-line of length 4 or larger only allows plug connections.In contrast, cycle-lines of lengths 1-3 admit other types of connections (see Figures14 and A7-A8, the last two in Appendix A.2). (ii) If the cycle-line has even length and plug connections at both sides, we have a balanced link: either the yellow or the green solution will be fixed later; it will be the one compatible with the selected independent set of the unsaturated path-line ending in u concatenated to the one starting in v .

Fig. A7 .Fig. A8 .
Fig.A7.Bubbles with two 6-cycles Y1 and Y2 (solid edges) whose intersections with paths do not allow path-line extensions.(i) is symmetrically surrounded by a cyclic path-line of length 6. (ii) is symmetrically connected to two path-lines of length 2. (iii) is symmetrically connected to a single path-line of length 2, whose paths intersect at both telomeres.(iv) has a path line of length 3, whose ends are connected to a single cycle.(v) has two paths connected to a single cycle.