Global exact optimisations for chloroplast structural haplotype scaffolding

Background Scaffolding is an intermediate stage of fragment assembly. It consists in orienting and ordering the contigs obtained by the assembly of the sequencing reads. In the general case, the problem has been largely studied with the use of distances data between the contigs. Here we focus on a dedicated scaffolding for the chloroplast genomes. As these genomes are small, circular and with few specific repeats, numerous approaches have been proposed to assemble them. However, their specificities have not been sufficiently exploited. Results We give a new formulation for the scaffolding in the case of chloroplast genomes as a discrete optimisation problem, that we prove the decision version to be \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal{NP}$$\end{document}NP-Complete. We take advantage of the knowledge of chloroplast genomes and succeed in expressing the relationships between a few specific genomic repeats in mathematical constraints. Our approach is independent of the distances and adopts a genomic regions view, with the priority on scaffolding the repeats first. In this way, we encode the structural haplotype issue in order to retrieve several genome forms that coexist in the same chloroplast cell. To solve exactly the optimisation problem, we develop an integer linear program that we implement in Python3 package khloraascaf. We test it on synthetic data to investigate its performance behaviour and its robustness against several chosen difficulties. Conclusions We succeed to model biological knowledge on genomic structures to scaffold chloroplast genomes. Our results suggest that modelling genomic regions is sufficient for scaffolding repeats and is suitable for finding several solutions corresponding to several genome forms.


Background
DNA molecule is a support of living mechanism information found in all the living organisms.Its sequence can be seen as a word on the alphabet nuc = {A, C, G, T } .Genetic, genomic and epigenetic analysis lead to combinatorial problems that need to be solved by computing approaches and thus require to pass from the DNA molecule to a word representation in a computer.This process is described as sequencing the molecule using a sequencing technology.

Current technologies
are not yet able to read an entire DNA molecule.They output a huge amount of small overlapping erroneous DNA sequences, named reads.Moreover, due to the double-strand structure of the DNA, the reads come from either one strand or its complement in reverse order, and the sequencing technologies cannot assure that two reads were sequenced from the same strand.Thus, each read must be considered in two orientations: the one given by the technologies (defined as the forward orientation), and its reverse-complement (defined as the reverse orientation) (e.g.ATGCCA and TGGCAT are each other's reverse-complement).
From the reads, genome assembly methods aim to find the longest true DNA sequences.Note that it is sufficient to only find one strand out of two, as the other is obtained by the reverse-complement transformation of the first one.Assembly methods are often split into three major stages: (i) assembly of reads based on their overlaps to obtain longer sequences (contigs); (ii) scaffolding, that aims to obtain an order of oriented contigs, potentially separated by nucleotide distances (scaffolds); (iii) gap-filling, that aims to complete the assembly by filling the gaps between the contigs in the scaffolds.Here we focus on the scaffolding step uniquely.
The vast majority of proposed scaffolding formulations are based on distances data between the reads (pairedend or mate-pair data) that are adapted for the contigs to obtain scaffolds.Read distances can either be counted to represent a confidence linking information between two contigs [5,14], either be considered precisely for contig nucleotide positioning [1], or combined both approaches as in [12,18,21].

Chloroplast genome architecture and structural haplotypes
In this paper, we address the scaffolding problem for the particular class of chloroplast genomes.Chloroplasts are plants' organelles derived from the integration of a cyanobacterium in an eukaryotic host.They conduct photosynthesis, a process to convert light energy into chemical energy.Over the evolution time, the chloroplast genome has reduced in length and loosed in terms of complexity [22].As a result, chloroplast genomes possess few repeats that are usually identical in nucleotide sequences.
One the most studied forms of chloroplast genome is a circular quadripartite DNA molecule.It consists of four regions: two identical (or highly similar) nucleotide subsequences, separated by two long and short single-copies (LSC, SSC) [4,17].There are two types of repeats: (i) the Direct Repeat (DR), where the sequences are highly similar; (ii) the Inverted Repeat (IR), where one sequence is the reverse-complement of the other.Figure 1 illustrates the common chloroplast genome architectures.
Furthermore, each chloroplast has multiple copies of its genome, and the molecular forms of the copies differ (structural haplotype leading to heteroplasmy, and multigenomic structures-not discussed here [3,16]).This phenomenon can be induced by flip-flop inversion: one subsequence is reverse-complemented (reversed) during the DNA replication.This inversion is provoked by the existence of facing IR on either side of the reversed subsequence.

Chloroplast scaffolding approaches
Although there are chloroplast genome assemblers and scaffolders, they do not fully exploit the chloroplast Fig. 1 The most studied chloroplast genome form is circular and very often quadripartite.For each of the three figures, coloured arrows represent nucleotide sequences.LSC and SSC stand for long and short single copies (purple and red), respectively.They correspond to regions (subsequences) that are not repeated in the genome.On the opposite, IR and DR stand for inverted and direct repeat (green and blue), respectively.a This architecture is the most common one and is defined as a quadripartite architecture.The two green IR arrows face each other and illustrates that one is the reverse-complement sequence of the other; b the two blue DR arrows are in the same direction that illustrates both have the same nucleotide sequence; c the two types of repeat can simultaneously exist in the chloroplast genome, and DRs are shorter than IRs Fig. 2 During the DNA replication of the chloroplast genome, one of the region between the two inverted repeats can be reversed (c.f. the red region SSC ).This provokes the existence of several forms of the genome in the same chloroplast (heteroplasmy) genome's specificities.Some of them are pipelines of generic methods applied on cleaned input dataset [2], or based on locally approaches as seed-and-extend algorithms [7,8].Ref. [13], in GetOrganelle, statistically compute the contigs' multiplicities by minimising the squared distance between them and the mapping coverage by the reads.
Concerning the handling of the distinct genome forms, GetOrganelle returns several solutions and explores in post-process the corresponding architectures.In [1] the flip-flop inversion breakpoints are detected in a postscaffolding-process, and new optimal solutions can be constructed in polynomial time.
We raise the following two questions: (i) how to mathematically model chloroplast genomic biological knowledge?(ii) How to include the multimerism into the scaffolding problem formulation?
To formalise the scaffolding by integrating the structural haplotypes in its core, we postulate on the particularities of the chloroplast genome: (i) repeats are pairs of regions; (ii) the two regions of a repeat have identical (or reversed) nucleotide sequence; (iii) structural haplotypes can be seen as permutations in a sequence of oriented contigs.

Organisation of the paper
We first describe the input data for our approach and provide mathematical definitions (Section "Input data and notation").Based on the above three assumptions, we propose a new formulation for scaffolding chloroplast genomes without requiring any distances (Section "Chloroplast scaffolding problem formulations").Our approach is region-driven and focuses on retrieving the repeats first.We model the optimisation problem on a directed graph (digraph) where we apply several integer linear programming (ILP) strategies (Sections "Graph and repeated fragment sets" and "Integer linear programming (ILP) formulation").We then detail how we combine the ILP solutions (Section "Hierarchical problem succession").
The ILP's solutions and the digraph correspond to an oriented contig sequence representing one genome's form (Section "From an ILP solution to a genome structure").We partition this sequence into genomic regions we express in a region graph.This graph enables us to return multiple genome forms (Section "Multiple genome forms").
We prove the decision version of the chloroplast genome scaffolding problem to be N P-Complete (Sec- tion "N P-completeness").However, we exactly solve the problem by profiting from the small size of the chloroplast instances and providing some numerical results (Section "Numerical results").We finally conclude (Sections "Conclusion" and "Discussion and perspectives").

Set of contigs C
Contigs are words in the nucleotides DNA alphabet nuc + .A contig can occur in the genome up to an integer called multiplicity.Function mult : C → N >0 provides its value.Any of contig's occurrence can appear in one of two possible reverse-complementary and mutually exclusive orientations: forward ( f = 0 ) and reverse ( r = 1).
Each contig is provided with an existence-weight in R ≥0 given by the function wex .The weight is proportional to the number of times a contig aligns with chloroplast sequences from a given set (from related or unrelated species).
Finally, one contig in this set is defined as the starter ( s ) that must uniquely participate in the genome ( mult(s) = 1 ).The starter is a contig whose sequence matches a sequence shared by most chloroplast genomes in a single-copy.Table 1 gives an example of contig set.

Set of links L
Each link is an ordered pair of oriented contigs.We denote by L ⊂ C × {f , r}  all occurrences of c and d respecting the same orienta- tions.Table 1 gives an example of link set.

Mathematically defining genomic regions
We aim to order oriented occurrences of the contigs based on their links.Each genome form corresponds to a sequence of oriented contigs.Not all contigs or their occurrences are included.Indeed, the contig set may contain contigs belonging to the plant genome or other organelles.The link set may also contain artefact links.Definition 1 provides the properties the sequence of oriented contigs must respect.
Definition 1 (Sequence of oriented contigs) Let SOC = (c 0 , c 1 , . . ., c n−1 ) be a sequence of oriented contigs: Based on the biological knowledge, we address the dedicated chloroplast scaffolding problem as a regiondriven scaffolding, such that specific regions fit into a circular structure.We identify three types of regions to scaffold: the directed repeats, the inverted repeats and the single-copies.Definition 2 (Region) A region r = (c 0 , c 1 , . . ., c n−1 ) is a sequence of oriented contigs.Each region is oriented.Let r = (c n−1 , . . ., c 1 , c 0 ) be the reverse region of r .It is composed of the oriented contigs of r , considered in their reverse orientation, and given in the reversed order.According to the reverse symmetry in the links, r also respects Definition 1.
Definition 3 (Direct repeat-DR) A DR is a couple of regions (dr i , dr j ) where dr i = dr j .Definition 4 (Inverted repeat-IR) An IR is a couple of regions (ir k , ir l ) where ir l = ir k .Definition 5 (Repeat) A repeat is the generic term to denote either DR or IR.The length replen(R) of a repeat R = (r i , r j ) equals the lengths of r i and r j ( replen(R) Definition 6 A SC is a region that is not part of a repeat.

Definition 7 (Region weight)
The weight rwex(r) of a region r is defined as rwex(r) = c∈r wex(c).
A chloroplast genome consists of a sequence of oriented regions.A genome form is a result of iterative transformations of an initial one.Section "Multiple genome forms" introduces the region graph to model multiple genome forms (sequences of oriented contigs).

Chloroplast scaffolding problem formulations
Solving the repeats is the most challenging task.A formulation that does not restrict the occurrences can lead to misassemblies where the results are longer than the solution genomes.Therefore, the use of an occurrence is limited to conformity with the biological knowledge of genome forms.A contig should participate in the sequence only if it enables the formation of pairs of repeated regions.In this case, we would be inclined to assemble the minimum number of repeats.
However, in the case of repeat degeneration (e.g. two subsequences inside the two regions of an identified repeat differ, note that some IR losses have been reported in the chloroplast genomes of green algae- [20]) finding the minimum number of repeats is not an appropriate model.Figure 3 illustrates the impact of degenerations on quadripartite structures.Indeed, in Fig. 3b, we cannot guaranty to find both IR 1 and IR 2 , but perhaps only one of them.For each repeat type, we address this issue by maximising the cumulative repeat lengths only if their regions respect a specific order.Definition 9 (Chloroplast scaffolding problem CHSP ) Given a set of contigs with their multiplici- ties and their weights, a starting contig and a link set.The aim is to obtain a circular sequence of oriented regions maximising the cumulative repeat lengths and minimising the number of repeats, with single-copies of maximum-weight.
For instance, let (A) and (B) be two distinct and feasible sequences of oriented contigs: cFor (A) and (B) the cumulative lengths are the same ( replen((i, j)) = replen((k, l)) + replen((m, n)) = 8 ).However, (A) has one less repeat, which we prefer.CHSP involves three subproblems, each one associ- ated with a particular type of region: (i) DRP for the direct repeats (Definition 10); (ii) IRP for the inverted repeats (Definition 11) and (iii) SCP for the single- copies (Definition 12).We tackle CHSP in a hierarchi- cal succession of DRP , IRP and SCP (Definition 14).Any intermediate problem must preserve the regions found by its predecessors (Definition 13).
DRP and IRP constrain the number of occurrences to the structure of pairs of repetitions.Indeed, each repeat type defines a valid repeat structure.The problems consist in maximising the cumulative length of the minimum number of repeats.
Definition 10 (Chloroplast direct repeat scaffolding problem DRP ) Consider a set of contigs, their multi- plicities, a starting contig and a link set.Find a circular sequence of oriented regions SOR , such that: -It maximises the cumulative length of the minimum number of DRs, joined by regions of any kind; -For any couple of DRs (i, j) and (k, l) found in SOR such that their respective positions in SOR given by function σ respect σ (i) < σ (j) , σ (k) < σ (l) , and σ (i) < σ (k) , then: - Definition 11 (Chloroplast inverted repeat scaffolding problem IRP ) Consider a set of contigs, their multiplicities, a starting contig and a link set.Find a circular sequence of oriented regions SOR , such that: -It maximises the cumulative length of the minimum number of IRs, joined by regions of any kind; -For any couple of IRs (i, j) and (k, l) found in SOR such that that their respective positions in SOR given by function σ respect σ (i) < σ (j) , σ (k) < σ (l) and σ (i) < σ (k) , then: - Figure 4 provides examples of valid oriented contig positioning for each common genome structure (Fig. 1).Although Fig. 8 illustrates the authorised and forbidden positions for the latter defined repeated fragments, it is also applicable for the DRP and IRP regions' position cases.
Definition 12 (Chloroplast single-copy scaffolding problem SCP ) Consider a set of contigs, their multiplici- ties, their weights, a starting contig and a link set.Find a circular sequence of oriented regions such that all the single-copies maximise their weights.
Note that in Definition 12, if they are no repeats, the problem is reduced to find the maximum-weighted circuit of oriented contigs.
Definition 13 (Chloroplast scaffolding problem succession) DRP , IRP and SCP (Definitions 10, 11, 12) can also take as input a set of fixed regions that must be preserved in the resulting sequence of oriented regions.3 With time, a repeat may degrade so that its occurrences differ.a and c show two common quadripartite structures with an IR and a DR, respectively.b and d highlight the impact of a degeneration on their structures.We give the bellow region orders according to the LSC arrow's direction.In b, the degeneration results in two IRs: IR 1 = (i, j) and IR 2 = (k, l) , such that i is before k and l precedes j .In d it results in two DRs: DR 1 = (i, j) and IR 2 = (k, l) , such that i is before k and j precedes l Our hierarchical approach prioritises the scaffolding of the repeats previously to the SC regions.Indeed, scaffolding the repeats is the most difficult task as it can lead to misassemblies (wrongly chosen links).Hence, a contig should only be used as many times as possible if its occurrences enable the scaffolding of repeats that most closely represent the architecture of the chloroplast genome, as illustrated in Figs. 1 and 4. We assume that the weights concern events that are less relevant comparing to the genomes architecture.Also, the weights on the contigs are less relevant than if they were weights on the links.Definition 14 (Hierarchical problem succession) The form of each solution of CHSP satisfies one of the two problem successions: DRP-IRP-SCP ( h 1 ) and IRP -DRP-SCP ( h 2 ).
The next question is how to prioritise DRP and IRP ?We propose resolving the order by comparing the scores defined in Sect."Scaffolding problems ILP": if DRP score is better than this of IRP , then the retained succession will be DRP-IRP-SCP , otherwise it will be IRP -DRP-SCP .In the equality case, we discriminate at a further step of the hierarchical successions.The process is detailed in Sect."Hierarchical problem succession".
Finally, each hierarchical problem succession produces a circular sequence of oriented regions.From the obtained sequence it is possible to extract a set of ordered pairs of oriented regions.This procedure allows the building of several circular sequences of oriented regions of the same length.Each of them represents one possible chloroplast genome form.This all-equivalent-form process is described in Sect."Multiple genome forms".

Graph and repeated fragment sets
In order to efficiently handle the multiplicities of the contigs, hence of the links, we need to build adapted data structures.On the one hand, finding a sequence of oriented contigs, when the links correspond to ordered pairs of oriented contigs, justifies the use of a directed graph to represent the oriented contigs and their links.Section "Graph structure" defines such a directed graph structure.On the other hand, scaffolding the repeats requires choosing pairs of contigs occurring several times in the oriented contig sequence.Section "Repeated fragment sets" defines the sets of such repeat contig candidates.

Graph structure
Here we describe a directed graph suitable for further algorithms and the mathematical formulation of the scaffolding problems from Definitions 10, 11, 12.
Definition 15 (Multiplied Doubled Contig Graph-MDCG ) Given a set of contigs C , their multiplicities and the link set L , the multiplied doubled contig graph MDCG = (V , E, vwex) is defined such that: is the set of all the forward an reverse occurrences of all the contigs ( |V | = 2 c∈C mult(c) ).The vertices are asso- ciated with four functions: Figure 5 illustrates the MDCG representing the exam- ple data given in Table 1.

Repeated fragment sets
A repeat is a couple containing two identical (or reverse for IR) sequences of oriented contigs (Definitions 3 and 4).Therefore, a repeat consists of couples containing two identical (or reverse) contigs.In the context of MDCG , this leads to the concept of repeated fragments.Definition 16 A repeated fragment is an unordered pair of vertices such that one of the corresponding oriented contig belongs to the first region of a repeat, while the other belongs to the second region.The vertices are associated with the same contig but their occurrences differ, i.e. for each repeat (r i , r j ) , ∃u, v ∈ V , c ∈ C where contig(u) = contig(v) = c and vocc(u) = vocc(v) such that (c, vor(u)) ∈ r i and (c, vor(v)) ∈ r j .
For example, (c f ,0 , c r,1 ) is a repeated fragment for the IR in Fig. 5.We then precise the set of repeated fragments for each repeat type.Denote by R = c ∈ C | mult(c) > 1 the set of contigs candidate to be part of repeats.For the sake of clarity, for each vertex v ∈ V , we note ) .We also assume there is an arbitrary strict total order on C , i.e. ∀c, a repeated fragment, such that u and v have the same orientation.
Definition 18 An inverted fragment (u, v) ∈ V 2 is a repeated fragment, such that the orientations of u and v differ.
Figure 6 illustrates DirF and InvF sets.In addition, we add two functions to retrieve the repeated fragments from the vertices in �(1) : dirfrag : V ′ ⊂ V → DirF and invfrag : V ′ ⊂ V → InvF (abstracted with the repfrag function, c.f. Appendix A for their definitions).
Furthermore, Definitions 10 and 11 constrain the order between the repeated fragments.Hence, they respectively require to compare pairs of direct/inverted fragments, that must be defined: Each vertex is associated with an occurrence of an oriented contig, and each contig is represented by an even number of vertices.For example, vertex labelled c r,1 means that it comes from contig c = contig(c r,1 ) , in its reverse orientation ( vor(c r,1 ) = r ), and in its second occurrence ( vocc(c r,1 ) = 1 ).The colours are the same as the ones in Fig. 1a to relate the input data with the IR architecture.The bold red edges draw a circuit corresponding to an IR scaffolding where a f ,0 is the starter

Definition 19 (Set of pairs of direct fragments)
Definition 20 (Set of pairs of inverted fragments) Figure 6 illustrates PDirF and PInvF sets.The con- straints defining DirF , InvF , ADirF , AInvF , PDirF and PInvF are explained in details in Appendix B where we proof they are the smallest sets enabling to find all the distinct solutions.
Furthermore, a repeat is a couple of regions (Definition 5), themselves defined as oriented contig sequences (Definition 2).We need to define the edges connecting two repeated fragments.

Definition 21
An adjacent repeated fragment is an edge (u, v) ∈ E such that u and v participate in two dis- tinct repeated fragments.
Edges in ADirF and AInvF play the role of canoni- cal edges between two adjacent repeated fragments, see Fig. 7.In addition, we add two functions to retrieve the adjacent repeated fragments from the edges in �(1) : diradj : E → E and invadj : E → E (abstracted with the repadj function, c.f. Appendix A for their definitions).

Integer linear programming (ILP) formulation
Modelling DRP , IRP and SCP from Definitions 10, 11, 12 requires finding a valid circuit in MDCG.
Definition 24 (Valid circuit in MDCG ) Given a graph MDCG = (V , E) and a starting vertex s , where ctg s is the starting contig, or s = f and occ s = 0 .A circuit cp in MDCG is valid if: -It starts and ends with s; First we describe common constraint blocks in Sects."Circuit constraints", "Repeated regions constraints" and "Fixing regions constraints" for ILPs formulations, and then we give the DRP , IRP and SCP scaffolding prob- lems ILP in Sect."Scaffolding problems ILP".
Let M = c∈C mult(c) be a constant, N − v and N + v be the sets of predecessors and successors of vertex v ∈ V , respectively.

Circuit constraints
The following set of constraints defines a valid circuit of oriented contig in MDCG , and is defined with a flow for- mulation as in [1] instead of using Miller-Tucker-Zemlin constraints to avoid cycles [15].

Binary variables
-x e encodes if the edge e ∈ E participates in the cir- cuit.

Continuous variables
Although it is a continuous variable, it acts as a binary one as proven in [9].
-f e ∈ R ≥0 is the positive flow on the participating edge e ∈ E in the circuit (zero otherwise).
Constraint (1) defines the flow.The circuit must start and end with the starter in its forward orientation (Constraints (2,3,4,5).If a vertex participates, its reverse cannot (Constraint ( 6)).Defining a circuit is equivalent to requiring an edge to exit a vertex if it has an incoming one (Constraint ( 7)).Constraint (8) forces the flow to be monotonically increasing.This property avoids cycles.

Repeated regions constraints
The following constraints are general to define ILPs for DRP and IRP .Definitions 10 and 11 define the repeated regions according to the positions of the oriented contig in them.It follows that some order of the vertices in the pairs of repeated fragments are allowed, and some others are forbidden.We decide to write the (1) w∈N constraints for the forbidden cases because they are fewer than the allowed ones.To model the forbidden orders between 4 vertices, we compare the positions between two.Specifically for IRP , modelling the forbidden orders echoes the approach for the RNA folding problem [11], except that the positions of the RNA's nucleotides are known.
According to Definitions 10 and 11, and given PDirF and PInvF (Definitions 19 and 20), denote by ForbidDR and ForbidIR the sets of forbidden quartet vertices for the DRs and IRs, respectively: Figure 8 illustrates the authorised and forbidden positions for DRP and IRP.
To know if we are in the forbidden cases described in these two sets, we propose to compare the vertices two-bytwo.Denote by AlphaDR and AlphaIR the sets containing the couples of vertices to be compared to determine the forbidden cases respectively associated with ForbidDR and ForbidIR sets, such that: In the following, the sets of repeated fragments and these for the forbidden orders are abstracted to generalise DRP and IRP .Table 2 gives the correspondence of the sets depending on the problem to solve.

Binary variables
-m ij encodes if the repeated fragment (i, j) ∈ RepF is a part of a repeat.-isadj e encodes if two repeated fragments connected by the edge e ∈ ARepF (and repadj(e) ∈ E ) are adjacent in the circuit.-forbid ijkl encodes whether we are in the forbidden ver- tices order (i, j, k, l) ∈ Forbid.-α uv encodes whether the vertex u is before the ver- tex v in the circuit.Since (i, j), (k, l), (i, k), (j, l), (i, l), (j, k)

Continuous variables
pates in the circuit, and acts as a binary variable.-f e ∈ R ≥0 is the positive flow on the participating edge e ∈ E in the circuit (zero otherwise).We use the exiting flow to define the position pos The vertices of participating repeated fragments must be in the circuit (Constraints (9 and 10)).Constraints (11,12,13) implement with linear constraints the α uv definition.Constraints (14,15,16) implement the forbid ijkl definition.Constraints (17,18,19,20) imple- ment the isadj e definition.
CRepeat constraints (9) Add the set of constraints CCircuit

Fixing regions constraints
When repeats are previously scaffolded, the involved regions are fixed as input for the next problems.Let ADirF * , DirF * , AInvF * and InvF * respectively be the sets of (adjacent) direct and (adjacent) inverted fragments composing the direct and inverted repeats that have been scaffolded.

Speed-up constraints
Constraints (26 and 27) prevent the solver to loop on strictly equivalent solutions, e.g.solutions that differ according to a permutation of the occurrences.Denote by ConsOcc the set of occurrence-consecutive vertices, such that: Also, denote by ConsRepF the set of consecutive repeated fragments, such that:

Scaffolding problems ILP
Finally, it is possible to define the ILP formulations for the DRP , IRP and SCP scaffolding problems as an union of the constraints described before.
For DRP and IRP the ILP formulations are the same, and it is sufficient to choose the sets RepF , PRepF , ARepF , Alpha , Forbid and ConsRepF according to the repeats the problems scaffold.We aim to maximise the cumulative length of the minimum number of repeats.The objective value corresponds to: where Repeats is the set of repeats.

Hierarchical problem succession
Finally, here we describe how we combine the DRP , IRP and SCP scaffolding problems.As described in Definition 14, two problem combinations are opposed.The combinations are kept depending on the value of the problems' objective functions.
Note that 0 ≤ |S| ≤ 2 , and Definition 25 is stable for any problem with an objective value equal to zero.For example, Z * h 1 [1] = 0 means that there is no inverted repeat.For an easier interpretation of the architecture, we adopt a problem code combination, summarised in Table 3.

From an ILP solution to a genome structure
From a solution found by the best hierarchical problem succession, we extract the corresponding genome architecture.Let m, n ∈ N be the number of repeats (pair of regions) and the number of single-copies, respectively.A genome contains 2 m + n regions.Two items are suf- ficient to describe a genome with its regions: contains the forward regions, i.e. it is a ( m + n)-uplet of m + n oriented contig sequences.is a linearised circular sequence of oriented regions.For each i ∈ 0, 2 m + n , if ror i = f then SOR[i] represents the forward region COR[rid i ] , else if ror i = r then SOR[i] represents the reverse COR[rid i ] .Figure 9 illustrates the regions extracted by Algorithm 4 in the toy example.
Table 3 Problem code combinations The key idea of such an extraction algorithm is to start from the starting vertex s and walk over the chosen edges.During the walk, for each vertex we need to identify the type of its region, and then add the vertex to the corresponding region (to the corresponding region identifier rid r ).Algorithm 1 aims to determine the region type for a given vertex.The sequence of oriented regions SOR begins by the starter's region that is necessarily a singlecopy (because mult(contig(s)) = 1 ).The first oriented contig may not be the starter.Algorithm 2 gives the initial vertex associated to the first oriented contig of the starter's region.During the walk in the solution circuit from the initial vertex, a new region begins each time the region type changes.When the current vertex participates in a repeat, we must check if the next one participates in the same repeat, although the region type may not change (Algorithm 3).At the end, Algorithm 4 builds COR and SOR from an ILP solution.
To build the repeated regions, we use a First In First Out (FIFO) queue for the DRs, and a Last In First Out (LIFO) queue for the IRs.Each queue rep_queue , is asso- ciated with three methods: rep_queue.put(x)append x to the FIFO/LIFO; rep_queue.is_emptyreturns true if the queue is empty; rep_queue.peekreturns the first/last value in the FIFO/LIFO; rep_queue.popdeletes the first/last value in the FIFO/LIFO and returns it.
In the following, the given time complexities are under the assumption that the belonging test "is x ∈ X ?" for an object x in a set X is in �(1) .Algorithm 1 is in �(1) .Algorithm 2 is in O(|SC s |) , where |SC s | is the number of contigs composing the single-copy region that contains the starting vertex.Algorithm 3 is in �(1) , and so Algo- rithm 4 is in O(|V | + |E|).

Multiple genome forms
The sequence of oriented regions SOR represents one chloroplast genome form.Recall that especially with the LSC-IR-SSC-IR architecture (Fig. 1a), the SSC can be reversed during the DNA replication phase.In the following, our goal is to retrieve other forms from the one obtained by the hierarchical problem succession.
Towards this goal, we introduce a specific assembly graph: the region graph.The discovery of multiple genome forms is associated with the search of Eulerian circuits in this graph.Figure 10  Ereg is the multiset of links between two oriented regions in SOR (including between the last and the first regions).Note that ∀e ∈ Ereg , e ∈ Ereg denotes its reverse where Vreg} is the incident func- tion, such that for two consecutive oriented regions r i and r j in SOR (including the last and the first ones), ∃! e ∈ Ereg | �(e) = (vreg −1 (r i ), vreg −1 (r j )).
Figure 10 illustrates the region graph for the toy example's solution given in Fig. 9. Based on the above graph, we can find different sequences of oriented regions.Each region, independently of its orientation, in each sequence, must participate the same number of times.Proof Let RegGraph = (Vreg, Ereg, �) be a region graph.Denote by euc = (v 0 , v 1 , . . ., v 2 m+n−1 ) an Eulerian circuit in RegGraph .For each two consecutive oriented regions v i , v j in euc , there exists an edge e ∈ Ereg such that �(e) = (v i , v j ) .According to Definition 26, vreg(v i ) = r i and vreg(v j ) = r j are also consecutive in the oriented region sequence that has originally built RegGraph , oth- erwise in its reverse.Thus, according to Algorithm 4, there is an edge

Definition 27 (Eulerian circuit in
We can easily verify that the number of Eulerian circuits is bounded by O(2 m ′ ) , where m ′ ≤ m is the number of inverted repeats.Now, given a region graph RegGraph , finding all the Eulerian circuits is equivalent to retrieve all the possible chloroplast genome forms.Each Eulerian circuit traverses exactly the same regions, but not necessary in the same orientations.Figure 10 gives the resulting region graph obtained from the input data given in Table 1.
We accept all the Eulerian circuits, although they may contradict the repeated region interval constraints given in Definitions 10 and 11.For example, the oriented region sequence (0 f , 1 f , 2 f , 3 f , 1 f , 3 r ) respects the Fig. 10 The toy example's solution given in Fig. 9 results in the graph visualised in Fig. 5 will give a region graph that shows an LSC-IR-SSC-IR architecture as the one given in Fig. 1a.Two oriented region sequences (genome forms) are obtained by finding the eulerian circuits: They correspond to two structural haplotypes commonly found in the chloroplast cells, and described in Fig. 2. a Each arrow represents a region.Each link connects two arrows' extremity.Entering the tail/head and exiting the head/tail of an arrow corresponds to choosing the region in its forward/reverse orientation.b The same information is visualised.Each vertex is a region with a fixed orientation, and each edge connects two oriented regions definitions, where (1 f , 1 f ) is a DR and (3 f , 3 r ) is an IR.One of the Eulerian circuit produces the oriented region sequence (0 f , 1 f , 2 f , 3 f , 1 r , 3 r ) .Region 1 f now evolves in a new IR (1 f , 1 r ) .The order between the regions of the two IRs contradicts Definition 11.

N P-completeness
To prove the N P-completeness of one of the two hier- archical problem successions (decision version), it is sufficient to focus on only one scaffolding problem for the repeat.Here, we will focus on the decision version of IRP ( DIRP).

Definition 28 (IRP decision problem-DIRP )
Given a set of contigs C , their multiplicities, a set of links L , a starting contig s , two integers k, m ′ ∈ N , is there a valid sequence of oriented regions for IRP with ir∈IR replen(ir) ≥ k and |IR| ≤ m ′ , where IR is the solu- tion set of inverted repeats?Proposition 2 (DIRP is in N P ) Given the input of DIRP , a sequence of oriented regions SOR , the sequence of oriented contigs for each region COR , two integers k, m ′ ∈ N .There is a polynomial time algorithm that checks if the given solution is valid and if its accumulative repeat length equals at least k and the number of repeats equals at most m ′ .Proof Algorithm 5 verifies if the sequence of oriented regions is valid.It requires two traversals of SOR : (i) to iden- tify which regions in the sequence form IRs; (ii) to verify if the order between the IRs is valid (thanks to the use of a LIFO).It also checks if the associated accumulative repeat length equals at least k and the number of repeats equals at most m ′ .A LIFO ir_lifo is associated with four methods: ir_lifo.put(x)append x to the LIFO; It is straightforward to see that Algorithm 5 and Algorithm 6 are linear according the size of SOR and COR .Remember we assume that the belonging test "is x ∈ X ?" for an object x in a set X is in �(1) .Proof By reduction from the longest path decision problem from vertex s to vertex t ( LPST P ), known to be N P-Complete [19].Consider an instance I ∈ LPST P composed of a directed graph G = (V , E) , two vertices s, t ∈ V and an integer k ∈ N (the hypothetical number of vertices between s and t in the longest path).We shall build an instance transform function tf such that , and fix parameter m ′ = 1 .Figure 11 illustrates the transformation.
All four subgraphs G ′ or,i in Fig. 11b can be seen as copies of G V \{s,t} = V \{s, t}, E V \{s,t} (the subgraph induced by the vertex set V \{s, t} ) in Fig. 11a.For each or ∈ {f , r} , for each i ∈ {0, 1} , G ′ or,i = V ′ or,i , E ′ or,i such that: -There is a bijective function vtrans or,i : V \{s, t} ֒→ → V or,i , where ∀v ∈ V ′ or,i , vor(v) = or and vocc(v) = i; -There is a bijective function etrans or,i : There is a bijective function vsttrans st : {s, t} ֒→ → {(s f , s ′ f ), (t f , t r )} .There is a function esttrans : {(s, w) ∈ E} ∪{(u, t) ∈ E} → E ′ such that: It is straightforward to see that there exists an algorithm in O(|V | + |E|) that computes this transform function.The sets InvF , PInvF and AInvF are built based on G ′ or,i graphs.Inverted fragments are visualised by green dashed vertical lines in Fig. 11b, where As the adjacent inverted fragments associate only vertices in V f ,0 with those in V r,1 , the path that maximises the number of contiguous inverted fragments exits s f , goes through G ′ f ,0 to t f (or t r , it does not matter), and passes through G ′ r,1 to s ′ f .
Since G ′ f ,0 is a copy of G V \{s,t} , while G ′ r,1 is its reverse graph, there is a bijection between V ′ f ,0 and V ′ r,1 vertices sets.
The way G ′ is built implies only one IR is assembled (so parameter m ′ = 1 is respected).The length of this IR ( k ′ = 2k ) gives the hypothetical length of the longest path in LPST P.
To conclude, as there is a linear time complexity transform function tf such that I ∈ LPST P ⇐⇒ tf (I) ∈ DIRP , DIRP is N P-Hard.
From Propositions 2 and 3 we conclude that DIRP is N P-Complete.

Numerical results
We develop khloraascaf,2 a python package that computes the scaffolding of chloroplast contigs.It can either use Gurobi solver or CBC.All the following runs have been executed on a Linux laptop computer (32GB RAM, Intel ® Core ™ i7-10610U CPU @ 1.80GHz ×8 ).Each time, we select Gurobi solver.

Complexity validation on artificial data
khloraascaf is also accessible as an API, that permits in this section to study the combinatorial behaviour of IRP.
We demonstrate in Sect "N P-completeness" that DIRP is in the general case N P-Complete.Further- more, the heteroplasmy for the chloroplast genome is very often caused by the presence of an inverted repeat, that reverses the region(s) between it.
Thus, we artificially build a contig set with the associated attributes, and a link set, such that the genome architecture behind corresponds to the following circular sequence of oriented regions: SC1 − IR − SC2 − IR .In the following, we run what corresponds to IRP computation in khloraascaf on two types of growing generated data: perfect and noisy artificial ones.To emphasise the effect of the inverted repeats in the complexity of IRP , we fix the length of the single-copies, i.e. |SC1| = |SC2| = |SC| = 20 , and we incrementally raise the length of the inverted repeat |IR| = 20k for k ∈ 1, 10 .

Perfect artificial data
The data generated for this section correspond to the smallest set of contigs, links, and the smallest multiplicities to make sure that IRP is feasible.The multiplied doubled contig graph associated with these perfect data has exactly the same topology as the one illustrated in Fig. 5.For instance, testing perfect artificial data acts as a control for further tests.Table 4 gives some Gurobi metrics. 3bserve that the gap is equal to 0 % and the prob- lem is solved either during the presolving or the linear relaxation.Indeed, for the class of graph that contains only the perfect artificial data, there exist a polynomial algorithm.However, the relaxation time seems to fit an exponential distribution as well as for the B&B time, as shown in Fig. 12, even though the distributions should be treated with caution because of the limited number of points.

Noisy artificial data
Here we test the behaviour of the solver when we add noise to the perfect artificial data.In that case, for each generated contig, the multiplicity has 25 % chance to be overestimated by one (that increases InvF , PInvF and possibly AInvF sets).Similarly, for each contig, there is 25 % of chance to create a new link to another randomly chosen contig.This can generate more loops, and can increase the AInvF set. 4s expected, now the gap is not ensured to be null, and some instances are not solved at the presolving or at the linear relaxation steps.
These numerical results corroborate the N P-Complete demonstration for a more general class of graphs (Table 5).

Synthetic chloroplast input data
In this section, we aim to validate experimentally the relevance of our scaffolding problem definition by running khloraascaf on synthetic data.

Input data generation
Here we briefly describe our protocol for input data generation. 5200 chloroplast genomes (selected in CpGDB6 ) were downloaded from the NCBI 7 .For each of them, a set of reads was generated.The contigs were generated with Minia [6], a de Bruijn graph assembly approach.The links correspond to k-mer paths in the resulting com- pacted de Bruijn graph (cDBG) that connect two contigs.Finally, 31 instances were selected for which various difficulties have been detected, e.g.extra-links in the link set or combination of repeats.The starter is the contig for which the matK gene, usually found in a single-copy region, maps on.
To obtain the multiplicity of a given contig c , we sum the length of the alignments of the reads mapping on c (we denote by MapR c the set containing the reads that map on c ).This sum is defined as the coverage cov c of the contig c by the reads: Where align read c is the sequence representing the alignment of the read read on the contig c .Its length equals the number of nucleotides of read that match on c (iden- tity or substitution).Then the multiplicity mult(c) of c is obtained by normalising its coverage cov c by this of the starter s , cov s : mult(c) = max 1, cov c /cov s − 0.1 .As the multiplicity is an upper-bound of the usage of contig, we prefer to round up the normalisation only if the decimal part is greater than 0.1.
The existence-weight wex(c) for a contig c is computed by counting the number of nucleotides of c that are cov- ered by at least a gene of protein from a chloroplast nearspecies, normalised by the length of c.

The evaluation's metrics
For each synthetic instance, we know the sequence of the oriented contigs and the sequence of the oriented regions we search for.In the sequel we test our scaffolding approach and evaluate the obtained region graph.For each instance, for each optimisation problem combination, we provide the following metrics: -The total number of eulerian circuits in the region graph (genome forms); -How many of them coincide with the sequence of oriented contigs we search for; -How many of them coincide with the sequence of oriented regions we search for.
Evaluating the sequence of the oriented contigs is stringent.On the other hand, although a result can be evaluated as a false one, we can still retrieve the sequence   of the chloroplast genome by applying an alternative sequence.As a consequence, for each instance we use Quast [10] to evaluate the sequences associated to each genome form.As the genome reference is known, Quast tries to find the minimum number of differences (relocation, inversion, indels) between the reference and the sequence we provide.Three metrics are chosen to evaluate the best genome form for each problem succession: -The genome fraction of the reference; -The number of misassemblies; -The number of local misassemblies.
For more detailed descriptions of Quast metrics, you can refer to Appendix C.1.
It would be expected that Quast metrics illustrate wrong assembly for the instances for which the sequence

Table 6 Sequence and Quast metrics for the initial synthetic data version
For each instance in the column Instance: ILPs provides the optimal (at most two) hierarchical problem successions; Total reports the number of eulerian circuits in the region graph (genome forms); SOC is the number of oriented contig sequences that equal to the reference oriented contig sequence; SOR is the number of oriented region sequences in bijection with the reference oriented region sequence; %gnm is the genome fraction of the best sequence produced by one of the genome forms; #mis are the number of misassemblies (left) and of local misassemblies (right) of the contigs, and a fortiori this of the regions, are not retrieved.Analogously, the instance that truthfully retrieves the sequences would have good Quast metric.However, these assertions may be contradicted because of the contig and the link sequences generation.In Sect."Initial version", we provide the two metric sets when khloraascaf is applied to the original synthetic data, while in Sect."Modified version" the metrics are reported for a subset of modified synthetic data.

Initial version
Table 6 provides all the metrics defined above for the 31 instances.The instances are solved very quickly (solver times < 4.5 sec., Table 9).khloraascaf successfully founds the sequence of the oriented contigs in 20 of them and retrieves the sequence of the oriented regions in 28 of them. 8Three categories of failures are identified.
Wrong starting contig and multiplicities estimations This is the category for which our approach is dependent and sensible.In the presented version, we have used a given starting contig, and a wrong one can lead to reduce all the multiplicities.This is the case for Begonia_pulchrifolia and Lamprocapnos_spectabilis for which the starters are contigs that normally participate into an IR, that contradicts our assumption that the starter participates in an SC.
Independently of a right starting contig, the multiplicity computation is sensible from the noise on the contig coverage by the reads.Agathis_dammara and Pel-argonium_nanum both suffer from only one contig under-estimated multiplicity.
IRP 's objective: maximising the cumulative length of the minimum number of repeats The sequence of oriented regions for the Cucumix_hystrix's reference contains the following sub-sequences: . . ., IR, IR ′ , . . ., IR ′ , SC, IR .As IRP also aims to mini- mise the number of repeats, it results in the merge of IR − IR ′ and consequently does not retrieve SC , normally inserted between IR ′ and IR.
Model's robustness While the sequences of oriented contigs for the repeats are retrieved, the ones of the single-copies suffer from extra-links combined with low, sometimes null, weights.On the one hand, in case of null weights, the circuits in Lathyrus_pubescens and Triosteum_pinnatifidum do not pass through contigs that must participate in the SCs.On the other hand, Podocarpus_totara possesses two objective-equivalent subpaths in an SC.Surprisingly, our tool khloraascaf reversed some subparts of single-copies.This is due to extra-links, but they are specifically caused by the existence of very short IRs hidden in them (remember that the links correspond to paths between pairs of contigs in the cDBG).This behaviour is observed in Carpodetus_serratus, Jas-minum_tortuosum and Lophocereus_schottii.
Nucleotide sequences misassemblies To avoid confusion, we always use nucl.seq. to denote "nucleotide sequence" to contrast with sequences of contigs/regions.In the following we analyse the instances presenting an unexpected behaviour for the Quast metrics, regarding if the sequences are found.Appendix Table 8 permits verifying if the contigs that participate in the sequence have been correctly assembled.Except for Lathyrus_pubescens where one local misassembly is due to the missing contig in the sequence, all the (local) missassemblies in Commiphora_foliacea, Eucom-mia_ulmoides, Juniperus_scopulorum, Musa_ornata, Scia-dopitys_verticillata, Taxus_baccata, Welwitschia_mirabilis provided by Quast are found in the nucl.seq. of links.

Modified version
In this section, we present a manually changed synthetic data version to succeed the scaffolding.The goal is to precisely evaluate the robustness of khloraascaf.We detail the modifications bellow9 : Agathis_dammara The multiplicity of contig 1 is raised to be equal to 3. Begonia_pulchrifolia Contig 4 becomes the starter.So according to the multiplicity computation described in Sect."Input data generation", the multiplicities of contigs 1 to 5 become 1, while this one of contig 0 raises to 2. Carpodetus_serratus Link (10 r , 11 f ) is deleted.Jasminum_tortuosum Link (6 f , 4 f ) is added.Lamprocapnos_spectabilis Contig 8 becomes the starter.So the multiplicities of contigs 2 , 4 , 6 , 10 and 11 increase by one, while the ones of contigs 0 , 1 and 3 increase by two.Lathyrus_pubescens The weight of contig 12 raises to 0.01.Lophocereus_schottii Link (10 f , 3 f ) is deleted.Pelargonium_nanum The multiplicity of contig 2 raises from 3 to 4, without respecting the computa-tion of the multiplicity described in Sect."Input data generation".Podocarpus_totara Link (6 f , 11 r ) is deleted.Triosteum_pinnatifidum The weight of contig 7 raises to 0.01.
Table 7 provides all the metrics obtained by running khloraascaf.The instances are also solved very quickly (solver times < 3 sec.,Table 10).It truthfully finds all the sequences for the modified synthetic data except for Agathis_dammara instance.Although all the repeats (both the direct and the inverted ones) have been retrieved, one of the single-copy region has not been found.It can be explained by extra-links that create alternative paths with the same optimal value for SCP .Note that the oriented region sequence has been still retrieved.
All the (local) misassemblies provided by Quast for Carpodetus_serratus, Lathyrus_pubescens, Pelargonium_ nanum and Triosteum_pinnatifidum just concern the nucl.seq. of links that is not a khloraascaf issue. 10

Conclusion
While the scaffolding problem is traditionally defined with distances data between the contigs, we show it is possible to avoid them in the case of the well-studied circular chloroplast genomes.Based on their specificities, we provide a new scaffolding formulation focused on revealing structural haplotypes.
Under the assumption that chloroplast genomes possess few repeats, we formalise their architectures as combinations of direct and inverted repeats, joined by single-copies, where the repeats are couples of identical (or reversed) nucleotide sequences.We tackle the chloroplast genome scaffolding as a discrete optimisation problem that yields three suboptimisation ones.We split the inherent multi-objective problem into one optimisation problem per region type.As a consequence, it is necessary to choose the order of subproblem resolutions as a function of the results of previously solved problems.This is what has been addressed through the hierarchical combination strategy.We model each subproblem with an ILP.
As our dedicated chloroplast scaffolding definition is a region-scaffolding-driven, the region graph is a natural data structure to reveal distinct genome forms that can coexist in a same cell.Indeed, particularly due to an IR flip-flop mechanism, regions between the IRs can be reversed during the genome replication process.
Moreover, we prove the decision version of the Chloroplast Scaffolding Problem ( CHSP ) to be N P-Complete in the general case, even though numerical results on perfect artificial data suggest there is a class of MDCG graphs where the problem is in P .Without surprise, the noisy artificial data benchmark confirms the theoretical complexity.
We have implemented our approach and the ILP formulations in a Python3 package, khloraascaf, 11 that we test on synthetic chloroplast contigs and links data.When the input data permit finding the solution, khloraascaf successfully retrieves the genome forms.Even if the decision problem is N P-Complete, the small size of the input data enables to quickly solve optimally CHSP.
Our results show that the scaffolding-repeat problem formulations DRP and IRP seem to be sufficient to scaffold the repeats.This tends to validate our assumptions on the small number of repeats, and especially on the sufficiency of defining the repeats as pairs of equal (reversed) nucleotide sequences.

Discussion and perspectives
While our scaffolding problem formulation seems to be sufficient to retrieve the repeats, it seems it is not fully suitable for single-copies.If we have applied the maximum-weighted circuit problem to scaffold the singlecopies after having scaffolded the repeats, it was only with the intention of staying in the context of global optimisation.On the one hand, having weights on links may have been more appropriate than just considering weights on the contigs: in some sense, that is the purpose of distances.On the other hand, khloraascaf could initially scaffold the repeats and then propose several solutions to link them, e.g.scored by the weights.
Concerning the tests on synthetic data, we should use a more traditional assembly graph input: in fact, as revealed by comparing the khloraascaf results with the reference genomes, the used link generation suffers from local, or worse, global, misassemblies.As next step, we plan to inject khloraascaf into a state-of-the-art chloroplast genome pipeline, like GetOrganelle, and substitute what can be identified as the scaffolding part by our method.Hence, we should be able to relevantly compare khloraascaf approach with the state-of-the-art.
khloraascaf is sensible to the contig multiplicity computation.For now, a contig multiplicity is obtained by normalising its coverage by this of the starter.A better strategy may be to choose the smallest coverage for the normalisation as we expect the multiplicities to be upperbounds of the contigs use.
Another issue concerns the choice of the starter: while it depends on the result of the mapping of matK gene map on the contigs, DRP , IRP and SCP problems may be adapted for a set of candidate for the starter.
To generalise CHSP on non-equally (reversed) pair of regions for the repeats, two combining ideas are proposed: on the one hand, we can add pairs of contigs to the repeated fragment sets from the user-input.On the other hand, CHSP should be able to handle the case when a sin- gle-copy region is in only one of the regions of a repeat: for now, the contiguity constraint and the objective exclude this case.The contiguity constraint can be adapted to accept only one contiguous region out of the two.
From a user-case perspective, the region graph data structure can be used to determine what genome forms are present in the read set, and in which proportion.Indeed, as the region graph explicitly describes the junctions between the regions, especially between the inverted repeats, one may extract the nucleotide sequences of these junctions to answer the existence of the forms in the read set.

Reduction of the repeated fragment sets
In this section, we give prove the minimality and the sufficiency of the repeated fragment sets ( RepF , PRepF , ARepF ) definitions.

B.1 Repeated fragment set reductions
Here we consider the reduction operations for DirF and InvF , and conclude on their minimality.Only two vertices i, j ∈ V with the same identifier, i.e. ctg i = ctg j can form a repeated fragment.The relative orientations between i and j is determined according to the type of the repeat the fragment is associated to.The occurrences of the two vertices in a repeated fragment must differ.Two combinatorial reductions are necessary-commutative and pairing reductions: i. commutative reduction means that ∀(i, j) ∈ RepF , (j, i) / ∈ RepF; ii.pairing reduction means that it is not necessary to pair all the occurrences cross all the occurrences, and we can group by consecutive occurrences without any intersection.

Reminder of the DirF definition:
Proposition 4 (Commutative reduction for DirF) The direct fragment set contains a sufficient number of direct fragments to form all the solutions.

Proof
The idea is to find the minimum occurrence difference between two direct fragments such that they can both be chosen to form DRs. Let (i, j) ∈ DirF be a direct fragment and sup- pose that it is chosen to participate into a DR.Let k, l ∈ V | ctg k = ctg l = ctg i = ctg j ∧ or k = or l = or i = or j be a candidate for the next direct fragment: -the occurrences of k and l must differ from the ones of i and j , as a vertex can occupy at most one posi- tion, so occ j < occ k ; -according to Proposition 4, occ k < occ l and since occurrences are integers, occ k = occ l − 1.
The smallest possible values for occ k and occ l equal occ i + 2 and occ j + 2 .

Proposition 6
Direct fragment set DirF is a minimal set for DRP.
Proof Firstly, we must keep the direct fragment for both the forward ( or i = or j = 0 ) and the reverse orienta- tions ( or i = or j = 1 ) as the scaffolding problem aims to order and orientate the contigs, so by definition we cannot determine which orientation will participate in the order.Secondly, making step bigger than 2k prevents to

Proof
The idea is to find the minimum occurrence difference between two inverted fragments such that they can both be chosen to form IRs. Let (i, j) ∈ InvF be an inverted fragment and sup- pose that it is chosen to participate into an IR.Let k, l ∈ V | ctg k = ctg l = ctg i = ctg j ∧ or k � = or l be a candi- date for the next direct fragment: -to respect the property given by Proposition 7, or k = 1 − or l = 0 -the occurrences of k and l must differ from the ones of i and j when their orientations are matching, as a vertex can occupy at most one position.Furthermore, for a given occurrence and a given identifier, the two orientations are mutually exclusive, so occ j < occ k ; according to Proposition 7, occ k < occ l and since occurrences are integers, occ k = occ l − 1.
The smallest possible values for occ k and occ l equal occ i + 2 and occ j + 2 .

Proposition 9
Inverted fragment set InvF is a minimal set for IRP.

Proof
Step bigger than 2k prevents to use all the occur- rences of a contig, so it contradicts the contig multiplicity definition.

B.2 Pairs of repeated fragment set reductions
Here we consider the reduction operations for PDirF and PInvF , and conclude on their minimality.Just the com- mutative reduction is necessary.Reminder of pairs of direct fragments set definition: Proposition 10 (Commutative reduction for PDirF) Proof Let ((i, j), (k, l)) ∈ PDirF: -if ctg j < ctg k thus ctg l > ctg i so ((k, l), (i, j)) / ∈ PDirF -else ctg j = ctg k ∧ occ j < occ k thus occ l > occ i so ((k, l), (i, j)) / ∈ PDirF Proposition 11 Pair of direct fragment set PDirF is a minimal set for DRP.
Let build a counter-example: suppose that we have the order i k l j m n .This implies the direct fragments p and q fall into a forbidden case for DRP , but this is not detected by Constraints C14 and C15 because (p, q) / ∈ PDirF ′ .According to Constraint C16, the direct fragments p and q 2 can both match, and direct fragments q and q 2 too.So m p = m q = m q 2 = 1 : absurd because p and q are nested so m p and m q should be equal to 0 .

Reminder of pairs of inverted fragments set definition:
Proof The reciprocal are trivial as ADirF ⊂ E. All the edges using only i , i , k or k belong to ADirF as all the vertex' occurrences are even.While all the edges using j , j , l or l cannot belong to ADirF as one of the ver- tex' occurrence is odd.
Proposition 15 (Minimum ADirF set for different identifiers) When the vertex' identifiers are different, it is not possible to reduce the number of canonical edges based on occurrences comparisons.
Proof In order to reduce ADirF when the vertex' iden- tifiers are different, we have to be more restrictive on the occurrences constraints.To reduce the loops on k and k ′ , we have to find an order between the occurrences of u and v.However, for each order relation between the occurrences ( occ u ≤ occ v and occ u ≥ occ v ), we can find two associated counter-examples.For the sake of clarity, we will find a counter-example for the case occ u ≤ occ v .The reasoning is analogous for the second case.
Let a , b and c be three contigs in C such that all their iden- tifiers are different, mult a = 2 , mult b = mult(c) = 1 , and Let fa 1 = (i, j) , fa 2 = (i ′ , j ′ ) be the two direct frag- ments associated to a and fb = (k, l) , fc = (m, n) the two respectively associated to b and c .The idea behind the proof is to know if the adjacent direct fragments fa 1 fb fa 2 fc can participate into a solution path Path .In such a case, Path is sub-defined by edges i → k → i ′ → m (and j → l → j ′ → n ).Remark that k can permute with l , and m with n.
According to ADirF set definition, (i, k) ∈ ADirF , (k, i ′ ) ∈ ADirF and (i ′ , m) ∈ ADirF .If we constrain the occurrences by the relation occ u ≤ occ v , it implies that: As fa 1 = fa 2 , occ i = occ i ′ − 1 : if occ i < occ i ′ − 1 then when we focus on the first edges path (in the "best" case n permutes with m ) occ k (= 0) ≤ occ i ′ (= 2) ≤ occ n (= 1) , so it is absurd; else occ i > occ i ′ − 1 then when we focus on the first edges path, you can permute i ′ and i and you fall into the same absurd implication as above.
Proposition 16 (Occurences reduction for equal identifiers in ADirF ) There exists an order between the occur- rences of the vertices of an edge between two direct fragments that does not reduce the number of feasible and distinct solutions when the vertex' identifiers are equal.
be same identifiers vertices participating in a solution path for IRP , in this order (i.e.∀k ∈ 1, n , u k is before u k+1 in the solu- tion path).
It can be proven 12 that: regardless the equation set of k − 1 order relation R k,k+1 ∈ {<, >} between two verti- ces ctg uk and ctg uk+1 (i.e.ctg uk R k,k+1 ctg uk+1 , ∀k ∈ 1, n ), there exists a permutation between all distinct occurrences of u such that all the relations R k,k+1 are satis- fied.Thus, let defined the following equation set (used in ADirF): Have we got enough edges to build a path with n 2 adjacent direct fragments?On the one hand, the maximum number of edges we would need is: 2 mult u 2 − 1 (the 2 is caused by direct fragments).On the other hand, ADirF provides: edges, (the 2 is caused by diradj function).
Proof In order to reduce AInvF when the vertex' iden- tifiers are different, we have to be more restrictive on the occurrences constraints.To reduce the loops on k and k ′ , we have to find an order between the occurrences of u and v.However, for each order relation between the occurrences ( occ u ≤ occ v and occ u ≥ occ v ), we can find two associated counter-examples.For the sake of clarity, we will find a counter-example for the case occ u ≤ occ v .The reasoning is analogous for the second case.
Let c and d be two contigs in C such that ctg c < ctg d , mult(c) = 4 and mult d = 2 , and (c f , d f ) ∈ L and (d f , c f ) ∈ L .Let p 1 = (i, j) , p 2 = (i ′ , j ′ ) be the two inverted fragments associated to c and q = (k, l) the one associated to d .The idea behind the proof is to know if the adjacent inverted fragments p 1 q p 2 can participate into a solution path Path .In such a case, Path is sub- defined by edges i → k → i ′ and j ′ → l → j.
Proposition 20 (Occurences reduction for equal identifiers in AInvF ) There exists an order between the occur- rences of the vertices of an edge between two inverted fragments that does not reduce the number of feasible and distinct solutions when the vertex' identifiers are equal.
Proof Let u 1 , u 2 , . . ., u n , n = 2 mult u 2 be same identifiers vertices participating in a solution path for IRP , in this order (i.e.∀k ∈ 1, n , u k is before u k+1 in the solu- tion path).
It can be proven 13 that: regardless the equation set of k − 1 order relation R k,k+1 ∈ {<, >} between two vertices ctg uk and ctg uk+1 (i.e.ctg uk R k,k+1 ctg uk+1 , ∀k ∈ 1, n ), there exists a permutation between all distinct occurrences of u such that all the relations R k,k+1 are satisfied.Thus, let define the following equation set (used in AInvF): Have we got enough edges to build a path with n 2 adjacent inverted fragments?On the one hand, the maximum number of edges we would need is: 2 mult u 2 − 1 (the 2 is caused by inverted fragments).On the other hand, AInvF provides: edges, (the 2 is caused by invadj function).
Proposition 21 (Canonical reductions for AInvF for equal identifiers) AInvF set contains only one of the four edges that exist between two adjacent inverted fragments when the vertex' identifiers are different, i.e. ∀(i, j) ∈ InvF , ∀(k, l) ∈ InvF | ctg i = ctg k ∧ occ i < occ k : Nota bene (k, i) ∈ E case is not here because (i, k) does not change the solution as their sequences are equal.This non-redundancy advantage is allowed by occurrence reduction in Proposition 20.

2
the link set.The nature of the double-strands DNA requires that ∀(c, d) ∈ L, (d, c) ∈ L , where c and d denote the oriented contigs c and d in their reverse orientation, respectively (note that c = c ).The links between two oriented contigs c and d are valid for

Fig.
Fig.3 With time, a repeat may degrade so that its occurrences differ.a and c show two common quadripartite structures with an IR and a DR, respectively.b and d highlight the impact of a degeneration on their structures.We give the bellow region orders according to the LSC arrow's direction.In b, the degeneration results in two IRs: IR 1 = (i, j) and IR 2 = (k, l) , such that i is before k and l precedes j .In d it results in two DRs: DR 1 = (i, j) and IR 2 = (k, l) , such that i is before k and j precedes l

Fig. 4
Fig. 4 Chloroplast repeat scaffolding.Each subfigure is a common chloroplast genome structure with its associated order of oriented contigs (coloured arrows).The green and the blue sequences of arrows are IR and DR, respectively.The purple and the red ones are single-copy regions.Contig s is the starter, and the right side black arrow determines the contigs' order.Contigs a 0 , a 1 , b 0 , b 1 , c 0 , c 1 and d 0 , d 1 are two occurrences of four contigs a , b , c and d , respectively.Each coloured dashed line links two occurrences of the same contig.a The order of the occurrences is reversed, and their arrows are oppositely oriented.Visually, an IR produces parallel dashed lines.b The order and the orientation of the occurrences is preserved, revealing a DR.c A chloroplast genome can contain the two types of repeats.Here, we will retain the hierarchical problem succession IRP-DRP-SCP ( h 2 ) since the IR contains more contigs than the DR

Fig. 8
Fig. 8 Non-exhaustive illustrations for authorised and forbidden order cases for two repeated fragments ((i, j), (k, l)) ∈ PRepF .a and b Authorised and forbidden orders for PDirF ; c and d Authorised and forbidden orders for PInvF Traditionally, SCP finds the maximum weighted circuit.SCP model Both for DRP and IRP , the number of variables and constraints are in O(|V | 2 + |E|) .The number of variables and constraints for SCP are in O(|V | + |E|).

Algorithm 1 Algorithm 2 Algorithm 3
Get the region type for a given vertex Get the initial vertex of the single-copy containing the starting vertex Is the repeat contiguous?

Proposition 1 (
RegGraph ) A cir- cuit in RegGraph = (Vreg, Ereg, �) is defined as Eulerian when: -It begins from and ends with vertex 0 f representing the region containing the starter in forward orientation, i.e. vreg(0 f ) = COR[0]; -It passes through exactly one of the two versions of each edge ( e ∈ Ereg otherwise e ∈ Ereg).An eulerian circuit is a valid oriented contig sequence for RegGraph ) An Eulerian circuit in RegGraph = (Vreg, Ereg, �) provides a valid sequence of oriented contigs (Definition 1).

Algorithm 5 5 Algorithm 6
Verify the validity of the sequence of oriented regions Epain and Andonov Algorithms for Molecular Biology (2024) 19:Verify the validity of the sequence of oriented contigs Proposition 3 DIRP is N P-Hard.

Fig. 11
Fig.11From a digraph G for LPSTP to a digraph G ′ for IRP .Bold red edges in both sub-figures correspond to the solution path for LPSTP and IRP problems, respectively.a G V \{s,t} is the subgraph induced by the vertex set V \{s, t} .As the longest path exits s and enters t , dashed edges do not participate in the solution.b Green dashed line between vertices in G ′ f 0 and G ′ r1 visualise the inverted fragments

Fig. 12
Fig. 12 Solver running time distributions for perfect artificial data.Points are measured times, the red curves correspond to the best ae bx function applied on IR length axis

Table 1
Toy example of input dataLeft: set of contigs C .Right: set of links L .For the sake of space, for no one of the links in the table, its reverse is given, although it belongs to L

Table 2
ILP sets and functions corresponding table Definition 25 (Hierarchical problem succession solutions) Denote by h 1 , h 2 the two hierarchical problem suc- cessions DRP-IRP-SCP and IRP-DRP-SCP .For each h ∈ h 1 , h 2 , denote by Z * h ∈ R 3 ≥0 the vector containing the values of the objective functions for each problem in the order of the problem succession corresponding to h.
ir_lifo.is_empty returns true if the LIFO is empty; SOC from the sequence of oriented regions SOR and the sequence of each forward regions COR .It traverses the oriented contigs in SOC , verify if each two consecutive oriented contigs are linked in the link set, and count the number of times a contig occur.At the end, it checks if each contig does not appear more than its multiplicity number of times.

Table 4
Gurobi solver metrics on perfect artificial growing data |V| , |E| , |SC| , |IR| and |L| respectively stand for the number of vertices, edges, contigs in each single-copy region, contigs in each region of the inverted repeat, and links; Time: the presolve time plus the relaxation time (above) and the B&B time (below); Opt.: the linear relaxation bound UB (above) and the integer optimal value Opt (below); % Gap: the MIP gap equals 100 × (UB−Opt) UB ; Nodes: number of explored B&B nodes; Iter.: number of iterations for the LP relaxation (above) and for the B&B phase (below)

Table 5
Gurobi solver metrics on noisy artificial growing data The column descriptions are the same as the one in Table4.Except that because of the noise on multiplicities, the sum of contig multiplicity for each region can change: this is the value below the number of contig in columns |SC1| , |SC2| and |IR| .Similarly, because of the noise on the number of links, the below value for L corresponds to the number of noisy links

Table 7
Sequence and Quast metrics for the modified synthetic data versionThe caption is the same as in Table6 Proof Let (i, j) ∈ InvF .occ j = occ i + 1 so (j, i) / ∈ InvF Proposition 8 (Pairing reduction for InvF ) The inverted fragment set contains a sufficient number of inverted fragments to form all the solutions.