Co-linear chaining on pangenome graphs

Pangenome reference graphs are useful in genomics because they compactly represent the genetic diversity within a species, a capability that linear references lack. However, efficiently aligning sequences to these graphs with complex topology and cycles can be challenging. The seed-chain-extend based alignment algorithms use co-linear chaining as a standard technique to identify a good cluster of exact seed matches that can be combined to form an alignment. Recent works show how the co-linear chaining problem can be efficiently solved for acyclic pangenome graphs by exploiting their small width and how incorporating gap cost in the scoring function improves alignment accuracy. However, it remains open on how to effectively generalize these techniques for general pangenome graphs which contain cycles. Here we present the first practical formulation and an exact algorithm for co-linear chaining on cyclic pangenome graphs. We rigorously prove the correctness and computational complexity of the proposed algorithm. We evaluate the empirical performance of our algorithm by aligning simulated long reads from the human genome to a cyclic pangenome graph constructed from 95 publicly available haplotype-resolved human genome assemblies. While the existing heuristic-based algorithms are faster, the proposed algorithm provides a significant advantage in terms of accuracy. Implementation (https://github.com/at-cg/PanAligner).


Introduction
Graph-based representation of genome sequences has emerged as a prominent data structure in genomics, offering a powerful means to represent the genetic variation within a species across multiple individuals [1][2][3][4][5][6][7].A pangenome graph can be represented as a directed graph G(V, E) such that vertices are labeled by characters (or strings) from the alphabet {A,C,G,T}.The topology of the graph is determined by the count and the type of variants included in the graph.For example, inversions, duplications, or copy number variation are best represented as cycles in a pangenome graph [4,5,[8][9][10].As a result, the draft pangenome graphs published by the Human Pangenome Reference Consortium [4] and the Chinese Pangenome Consortium [11] are also cyclic.Aligning reads or assembly contigs to a directed labeled graph is a fundamental problem in computational pangenomics [12,13].Aligning reads to graphs is also useful for other bioinformatics tasks such as long-read de novo assembly [14][15][16] and long-read error correction [17,18].
Formally, the sequence-to-graph alignment problem seeks a walk in the graph that spells a sequence with minimum edit distance from the input sequence.O(|Q||E|) time alignment algorithms for this problem are already known, where Q is the query sequence [19,20].The known conditional lower bound [21] implies that an exact algorithm significantly faster than O(|Q||E|) is unlikely.This lower bound also holds for de Bruijn graphs [22].Therefore, fast heuristics are used to process highthroughput sequencing data.
Seed-chain-extend is a common heuristic used by modern alignment tools [23][24][25].This is a three-step process.First, the seeding stage involves computing exact seed matches, such as k-mer matches, between a query sequence and a reference.These matches are referred to as anchors.The presence of repetitive sequences in genomes often leads to a large number of false positive anchors.Subsequently, the chaining stage is employed to link the subsets of anchors in a coherent manner while optimizing specific criteria (Fig. 1).This procedure also eliminates the false positive anchors.Finally, the extend stage returns a base-to-base alignment along the selected anchors.Efficient generalization of the three stages to pangenome graphs is an active research topic [13].Many sequence-to-graph aligners already exist that differ in terms of implementing these stages [5,[26][27][28][29][30].This paper addresses the generalization of the chaining stage to cyclic pangenome graphs (Figs. 2, 3).

Related work
Co-linear chaining is a mathematically rigorous method to filter anchors after the seeding stage.It has been wellstudied for the sequence-to-sequence alignment case [31][32][33][34][35][36][37].The input to the chaining problem is a set of N weighted anchors.An anchor can be denoted as a pair of intervals in the two sequences corresponding to the exact seed match.A chain is an ordered subset of anchors whose intervals must appear in increasing order in both sequences.The co-linear chaining problem seeks the chain with the highest score, where the score of a chain is calculated by summing the weights of the anchors in the chain and subtracting the penalty for gaps between adjacent anchors.The problem is solvable in O(N log N ) time [31].
The first effort to generalize the co-linear chaining problem to graphs was made by Makinen et al. [38].They addressed the co-linear chaining problem on directed acyclic graphs (DAGs).The authors introduced a sparse dynamic programming algorithm whose Fig. 1 An illustration of co-linear chaining for sequence-to-graph alignment.Assume that the vertices of the graph are labeled with nucleotide sequences.In panel a, short exact matches, i.e., anchors, are illustrated using red blocks joined by dotted lines.In panel b, the anchors corresponding to the best-scoring chain are retained, and the rest are removed.The retained anchors are combined to produce an alignment of the query sequence to the graph (illustrated using a green curved line) Fig. 2 An example illustrating a graph, a query sequence, and multiple anchors as input for co-linear chaining.The sequence of anchors (M [1], M [2], M [4], M [5], M [7], M [8]) forms a valid chain that visits vertex v 4 twice due to a cycle in the graph.The coordinates associated with anchor M [8] are also highlighted as an example runtime complexity is parameterized in terms of the width of the DAG.The width of a DAG is defined as the minimum number of paths in the DAG such that each vertex is included in at least one path.Parameterizing the complexity in terms of the width is helpful because pangenome graphs typically have small width in practice [26,29,38].An optimized version of their algorithm requires O(KN log KN ) time for chaining, where K is the width of the DAG [29].This formulation has been further extended to incorporate gap cost in the scoring function [26], and for solving the longest common subsequence problem between a DAG and a sequence [39].
There is limited work on formulating and solving the co-linear chaining problem for general pangenome graphs which might contain cycles.One way to address this was discussed in [29, Appendix section], but the proposed formulation is oblivious to the coordinates of anchors that lie in a strongly connected component of the graph.Their algorithm works by shrinking every strongly connected component into a single vertex and applying the same algorithm developed for DAGs.With this approach, the high-scoring anchor chains in cyclic regions of the graph may result in low-quality alignments.

Contributions
In this paper, we build on top of the algorithmic techniques developed for DAGs [26,29,38] and propose novel formulations for cyclic pangenome graphs.Our proposed algorithm exploits the small width of pangenome graphs similar to [38].Our approach for defining the gap cost between a pair of anchors is inspired by the corresponding function defined on DAGs [26].
We address the following three challenges that arise on cyclic pangenome graphs.First, the dynamic programming-based chaining algorithms developed for DAGs exploit the topological ordering of vertices [26,29,38], but such an ordering is not available in cyclic graphs.Second, computing the width and a minimum path cover can be solved in polynomial time for DAGs but is NPhard for general directed graphs [40].Third, the walk corresponding to the optimal sequence-to-graph alignment can traverse a vertex multiple times if there are cycles.Accordingly, a chain of anchors should be allowed to loop through vertices.Our proposed problem formulation and the proposed algorithm address the above challenges.
Our approach involves computing a path cover P of the input graph during preprocessing, fol- lowed by chaining of anchors using iterative algorithms.Let Ŵ c , Ŵ l , Ŵ d be the parameters that specify the count of iterations used in our algorithms (formally defined later).Our chaining algorithm solves the stated objective in O(Ŵ c |P|N log N + |P|N log |P|N ) time after a one-time preprocessing of the graph in O((Ŵ l + Ŵ d + log |V |)|P||E|) time.We will show that parameters Ŵ c , Ŵ l , Ŵ d are small in practice to justify the practicality of this approach.The runtime complexity also depends on |P| , which is determined by our path cover finding heuristic.We show that the number of paths in our path cover is small and near-optimal in practice.
We implemented the proposed chaining algorithm as an open-source software PanAligner.We designed PanAligner as an end-to-end sequence-to-graph aligner using seeding and alignment code from Minigraph [28].We evaluated the scalability and alignment accuracy of PanAligner by using a cyclic human pangenome graph constructed from 94 high-quality haplotype-resolved assemblies [4] and CHM13 human genome assembly [41].We achieve the highest long-read mapping accuracy 98.7% using PanAligner when compared to exist- ing methods Minigraph [28] ( 98.1% ) and GraphAligner [30] ( 97.0% ).PanAligner also supports a hybrid method which identifies a subset of reads that are relatively "easyto-align" and utilizes fast Minigraph heuristics [28] for aligning them.This option significantly improves the speed of the algorithm.

Notations and problem formulations
Pangenome graph G(V , E, σ ) is a string labeled graph such that function σ : V → � + labels each vertex v with string σ (v) over alphabet = {A, C, G, T } .Let Q be a query sequence over .Let M [1..N] be an array of anchor tuples (v, [x..y], [c..d]) with the interpretation that Function weight assigns a user-specified weight to each anchor.For example, the weight of an anchor could be proportional to the length of the matching substring.
A path cover is a set P = {P 1 , P 2 , . . ., P |P| } of paths in graph G such that every vertex in V is included in at least one of the |P| paths.We define paths(v) as {i : P i includes v} .If i ∈ paths(v) , then let index(v, i) spec- ify the position of vertex v on path P i .Suppose R − (v) is the set of vertices in V that can reach vertex v through any walk in graph G.We will assume that the set R − (v) always includes the vertex v.The value last2reach(v, i) for v ∈ V , i ∈ [1, |P|] represents the last vertex on path P i that belongs to set R − (v) .Note that last2reach(v, i) does not exist if there is no vertex on path P i that belongs to R − (v) .Let N + (v) and N − (v) be the set of outgoing and incoming neighbor vertices of vertex v, respectively.
We need to calculate character distances between pairs of anchors in the graph while solving the co-linear chaining problem.Assume that edge to specify the length of the shortest proper cycle containing v. D • (v) = ∞ if v is not part of any proper cycle.If P i includes v, let dist2begin(v, i) denote the length of the sub-path of path P i from the start of P i to v.
Our algorithm will use a balanced binary search tree data structure for executing range queries efficiently.It has the following properties.
Lemma 1 (ref.[42]) Let n be the number of leaves in a tree, each storing a (key, value) pair.The following operations can be supported in O(log n) time: • update (k, val): For the leaf w with key = k , value(w) ← − max(value(w), val).• RMQ(l, r): Return max{value(w) | l < key(w) < r} such that w is a leaf.This is the range maximum query.
Given n (key, value) pairs, the tree can be constructed in O(n log n) time and O(n) space.
Next, we define a precedence relation between a pair of anchors, which is a partial order among the input anchors [29].
Definition 2 (Chain) Given the set of anchors {M [1], M[2], . . ., M[N ]} , a chain is an ordered subset of anchors S = s 1 s 2 • • • s q of M, such that s j precedes s j+1 for all 1 ≤ j < q.
Our co-linear chaining problem formulation seeks a chain S = s 1 s 2 • • • s q that maximizes the chain score defined as q j=1 weight(s j ) − q−1 j=1 gap Q (s j , s j+1 ) + q−1 j=1 gap G (s j , s j+1 ) .Functions gap Q and gap G specify the gap cost incurred in the query sequence and the graph, respectively.Although we specifically focus on problem formulations where the gap cost is calculated as the sum of gap G and gap Q , our approach can be extended to other gap definitions such as |gap G − gap Q | , min(gap G , gap Q ) , or max(gap G , gap Q ) , similar to [26].We define gap Q (s j , s j+1 ) as s j+1 .c− s j .d− 1 , which can be interpreted as the count of characters in the query sequence between the endpoints of the two anchors.Next, we will define two versions of the co-linear chaining problem that differ in their definition of gap G .In both versions, gap G (s j , s j+1 ) is calculated by looking at the count of characters spelled along a walk in the graph from s j to s j+1 .In the first version of the problem formulation, we use the shortest path from vertex s j .v to s j+1 .v to calculate gap G (s j , s j+1 ).
Problem 1 Given a query sequence Q, graph G(V , E, σ ) and anchors M [1..N], determine the optimal chaining score by using the following definition of gap G : where (s j , s j+1 ) is a pair of anchors from M such that s j precedes s j+1 .
These computations need to be done only once for a graph.To solve the chaining problem for a given query sequence, sort the input anchor array M [1..N] in non-decreasing order by the component M time because precedence condition can be checked in constant time.Report max j C[j] as the optimal chaining score.
The above algorithm is unlikely to scale to large whole-genome sequencing datasets because it requires �(N 2 ) time for the dynamic programming recursion.Motivated by [26], we will define an alternative definition of gap G .We will approximate the distance between a pair of vertices by using a path cover of the graph.We will later propose an efficient algorithm for the revised problem formulation. Suppose For each path i ∈ paths(v 1 ) , consider the walk starting from v 1 along the edges of path P i till vertex α i , where vertex α i = v 2 if v 2 also lies on path P i anywhere after v 1 , i.e., index(v 2 , i) ≥ index(v 1 , i) , and the rest of the walk till v 2 is completed by using the shortest path from vertex α i to v 2 .Denote D P (v 1 , v 2 ) as the length of the shortest walk among such |paths(v 1 )| possible walks from v 1 to v 2 .Formally, we can write D P (v 1 , v 2 ) as following.
as the length of a specific walk that starts and ends at v, i.e., .N], determine a path cover P of the graph, and the optimal chaining score by using the following definition of gap G : where (s j , s j+1 ) is a pair of anchors from M such that s j precedes s j+1 .

Proposed algorithms
A single experiment typically requires aligning millions of reads to a graph.Therefore, we will do a one-time preprocessing of the graph that will help to reduce the runtime of our chaining algorithm for solving Problem 2.

Algorithms for preprocessing the graph
We compute the following quantities during the preprocessing stage: • A path cover P of G(V , E, σ ).We require the path cover to be small (in the number of paths).However, determining the minimum path cover in a graph with cycles is an NP-hard problem.We will discuss an efficient heuristic for determining a small path cover.Later, we will empirically show that |P| is very close to optimal by comparing it to a lower bound on the size of the minimum path cover.
specifies a linear ordering of vertices.The ordering should satisfy the following property: If vertex v 2 occurs anywhere after v 1 in a path in P , then ordering may not exist for an arbitrary path cover but it will exist for the path cover chosen by us. and . These val- ues will be frequently used by our chaining algorithm to compute gap costs.
We propose the following heuristic for computing a small path cover of graph G(V , E, σ ) .We derive a DAG G ′ (V , E ′ , σ ) from G by removing a small number of edges.Next, we determine the minimum path cover P of G ′ in O(|P||E| log |V |) time by using a known algorithm [38].Our intuition is that removing as few edges as possible will provide a close to optimal path cover of G.One way to compute G ′ is to use standard heuristic-based solvers for minimum feedback arc set (FAS) problem, e.g., [43], but we empirically observed that this approach could sometimes disconnect a weak component of a graph, leading to a large path cover.Therefore, instead of using FAS heuristics, we use a simple idea where we identify all strongly connected components in G and perform a depth-first search within each strong component to remove back edges [44].This approach provides a DAG that has the same number of weak components as G while removing a small number of edges in practice.
It is important to verify that the above heuristic actually results in a path cover whose size is close to optimal because the runtime complexity of our algorithms depends on |P| .Computing the minimum path cover is difficult due to NP-hardness of the problem.Instead, we compute a lower bound on the size of the minimum path cover using a flow-based method.This method is inspired from a known relationship between minimum path cover problem and minimum flow problem in DAGs [38,45].
In the minimum flow problem, the input is a directed graph with a single source, a single sink, and a demand value ∈ Z for every edge.The task is to find a flow of minimum value that satisfies all demands.The value of a flow is the sum of the flow on the edges exiting the source.We compute a new graph G * from G by (i) replacing each vertex v with two vertices (v − , v + ) , (ii) joining all in-neigh- bors of v to v − , and (iii) joining out-neighbors of v from v + .We add a global source with an out-going edge to every vertex and a global sink with an in-coming edge from every vertex.The demand on all edges of type (v − , v + ) is set to one in G * .The demand on all the remaining edges is set to zero.Observe that any path cover P of G can be converted into a feasible flow of value |P| in G * .As a result, the value of minimum flow in G * must be less than or equal to the size of the minimum path cover in G. Thus, we can solve the minimum flow problem to know a lower bound on the size of the minimum path cover.In our experiments, we compute and use the lower bounds to establish the effectiveness of our path cover finding heuristic.
Next, we compute a function rank for all vertices ∈ V by topological sorting of vertices in DAG G ′ .If there is no cycle in G, then last2reach(v, i) and D(last2reach(v, i), v) can be computed in O(|P||E|) time by using dynamic programming algorithms that process vertices in topological order [26,38].We extend these ideas to cyclic graphs by designing iterative algorithms.We will formally prove that as the iterations proceed, the output gets closer to the desired solution.Our approach to computing last2reach(v, i) is outlined in Algorithm 1.If last2reach(v, i) exists, the algorithm determines it in terms of its rank.We maintain an array L2R to save intermediate results.L2R(v, i) is initialised to rank(v) if v lies on path P i .In each iteration, we revise L2R(v, i) by probing L2R(u, i) for all u ∈ N − (v) .In Lemma 3, we prove the correctness of this algorithm by arguing that all |P||V | values in array L2R converge to their optimal values through label propagation in ≤ |V | iterations.Let Ŵ l denote the count of iterations used by the algorithm.
hold the values of current and previous iteration respectively for v ∈ V in the increasing order of rank(v) do 6: end for

Lemma 3 In Algorithm 1, L2R(v, i) converges to the rank of last2reach(v, i) in at most |V| iterations for all v ∈ V and i ∈ [1, |P|].
Proof A vertex v 2 ∈ V is said to be reachable within k hops from vertex v 1 ∈ V if there exists a path with ≤ k edges from v 1 to v 2 .We will prove by induction that Algo- rithm 1 satisfies the following invariant: After j iterations, L2R(v, i) has converged to rank(last2reach(v, i)) if last2reach(v, i) exists and vertex v is reachable within j hops from last2reach(v, i) in G.This argument will prove the lemma because vertex v 2 ∈ V must be reachable within Base case (j = 0) holds due to initialisation of L2R(v, i) in Line 1.If v lies 0-hop from last2reach(v, i), i.e., v = last2reach(v, i) , then v must lie on path P i and rank(last2reach(v, i)) = rank(v) .Next, assume that the invariant is true for j = n .Now consider the situ- ation after n + 1 iterations.Suppose v ∈ V is reach- able within n + 1 hops from last2reach(v, i).Then, at least one neighbor u ∈ N − (v) of vertex v exists which is reachable within n hops from last2reach(v, i) and last2reach(u, i) = last2reach(v, i) .Based on our assumption, L2R(u, i) must have already converged to rank(last2reach(u, i)) before (n + 1) th itera- tion.Therefore, Line 7 in Algorithm 1 ensures that L2R(v, i) ← rank(last2reach(v, i)) after (n + 1) th itera- tion.
It is possible to design an adversarial example where the algorithm uses �(|V |) iterations.However, in prac- tice, we expect the algorithm to converge quickly.Each iteration of Algorithm 1 requires O(|P||E|) time.There- fore, the total worst-case time of Algorithm 1 is bounded by O(Ŵ l |P||E|) .A similar approach is applicable to com- pute D(last2reach(v, i), v) for all v ∈ V and i ∈ [1, |P|] (Algorithm 2).We use Ŵ d to denote the count of itera- tions needed in Algorithm 2. Similar to parameter Ŵ l in Algorithm 1, Ŵ d is also upper bounded by |V|.We will later show empirically that Ŵ l ≪ |V | and Arrays D and D prev will hold the values of the current and previous iteration, respectively for v ∈ V in the increasing order of rank(v) do 6:

Co-linear chaining algorithm
We propose an iterative chaining algorithm to address Problem 2. The proposed algorithm builds on top of the known algorithms for DAGs [26,38].Similar to [38], we maintain one search tree T i for each path P i ∈ P .Given anchors M [1..N], our algorithm will return array C [1..N] such that C[j] corresponds to the optimal score of a chain that ends at anchor M [j].
If there are no cycles in G, then one iteration of Algorithm 3 suffices to compute the optimal chaining scores.For a DAG, a single iteration of Algorithm 3 works equivalently to the known algorithm for DAGs in [26].In this case, Algorithm 3 would essentially visit the vertices of graph G in topological order while ensuring that C[j] is optimally solved after M[j].v is visited.To solve the chaining problem on cyclic graphs, we design an iterative solution where chaining scores C [1..N] get closer to optimal values in each iteration.We will use Ŵ c to specify the total count of iterations.

35:
end for

36:
Reset all values in search tree T i to −∞, for all i ∈ [1, |P|] 37: end while An overview of Algorithm 3 is as follows.At the beginning of each iteration, all search trees T i s are filled with keys {M[j].d| 1 ≤ j ≤ N } and values −∞ .The value of key M[j].d will be updated based on the score C[j] and some other parameters.The range search trees will be used to efficiently identify the optimal preceding anchor for each anchor [26,29,38].
Each iteration of our algorithm processes v ∈ V in the increasing order of rank(v).While processing v, Algorithm 3 performs three types of tasks: 1.The first type of task is to revise chaining scores {C[j] : M[j].v = v} corresponding to the anchors that lie on vertex v.We also revise scores corresponding to those anchors that are located on vertex u = v such that v is the last vertex on a path ∈ P to reach u.This is achieved by querying search trees T i for all i ∈ paths(v) .In all these tasks, we use Lines 4-18 in Algorithm 3 build array Z that contains up to 4N |P| tuples corresponding to all the above type of tasks.Array Z is sorted in O(N |P| log N |P|) time to ensure that all tasks are executed in the proper order (Line 21).Next, we start the iterative procedure.Lines 19-33 form a single iteration of the algorithm.These tasks lead to updates on score array C and the search trees.We update the priority of an anchor M[j] in the relevant search trees using its score C[j] and the coordinates (Lines 28, 33).To update the score of an anchor M[j] based on the scores of its preceding anchors, we use the (i) highest priority value obtained from the search trees, (ii) the coordinates of anchor M[j], and (iii) the precomputed arrays D, dist2begin, D • P (Lines 24,25,31,32).The score calculations are consistent with the gap cost definition in Sect.Notations and problem formulations.
Each iteration requires O(N |P| log N ) time because each task corresponds to either update or RMQ operation on a search tree of size ≤ N .In Lemma 5, we prove that array C [1..N] converges to optimality in at most N iterations.In Lemma 6, we prove that �(N ) iterations are required for convergence in the worst case.
Proof C[j] always specifies the score of a chain of size ≥ 1 that ends at anchor M[j] throughout the execution of the algorithm.Let f i (j) denote the optimal chaining score ending at anchor M[j] over all chains of size ≤ i .We will prove by induction that before i th iteration begins, C[j] ≥ f i (j) for all j ∈ [1, N ] .It suffices to prove this state- ment because the size of a chain must be ≤ N .Base case ( i = 1 ) holds due to the initialization step in Line 2. Next, assume that before x th iteration begins, C[j] ≥ f x (j) holds for all j ∈ [1, N ] .We will prove that the invariant holds for iteration x + 1.

Let C x [j] and C x+1 [j] denote the intermediate values of C[j]
at the start of x th and (x + 1) th iteration, respectively.From Lines 25 and 32, we know ) .Based on our induction hypoth- esis, C[β x ] ≥ f x (β x ) at the start of the x th iteration.Each iteration of Algorithm 3 processes v ∈ V by increasing the order of rank(v).To prove that C x+1 [j] ≥ f x+1 (j) , we have the following four cases to consider: .v is processed during the x th iteration, the value of key M[β x ].d gets updated in search trees (Line 28).C[j] gets updated later.At the end of the x th iteration, There- fore, C[j] gets updated again due to tuples created in Line 12.This will ensure that

Lemma 6
The count of iterations required by Algorithm 3 is �(N ) in the worst-case.
Proof An example where Algorithm 3 requires �(N ) iter- ations is shown in Fig. 4. The graph has two vertices forming a cycle.Assume that weight of all N input anchors is equal and sufficiently high to outweigh the gap cost between any pair of anchors.As M Similarly, the optimal chain ending at anchor Each anchor is assigned a weight of one unit (a unit represents a multiplicative constant).In this example, each iteration of the chaining algorithm can be conceptually divided into two stages.Stage 1 corresponds to the first round of updates to array C in the order These updates are caused due to tuples in Lines 7-8 of Algorithm 3. Stage 2 corresponds to the second round of updates to caused by tuples in Line 12.After the first iteration of the algorithm, the maximum score in array C is N 2 + 1 .In each subsequent iteration, the maximum score increases by 1.The scores converge at iteration N 2 + 1 .

Implementation
We have implemented the proposed algorithm in C++ (https:// github.com/ at-cg/ PanAl igner).We call our software as PanAligner.PanAligner is developed as an endto-end long-read aligner for cyclic pangenome graphs.We borrow open-source code from Minichain [26], Minigraph [28], and GraphChainer [29] for other necessary components besides co-linear chaining.While using PanAligner, a user needs to provide a graph (GFA format) and a set of reads or contigs (fasta or fastq format) as input.We use the standard data structure to store the pangenome graph while accounting for double stranded nature of DNA sequences.For each vertex v ∈ V , we also add another vertex v whose string label is the reverse complement of string σ (v) .For each edge u → v ∈ E , we add the complementary edge v → ū .This enables read alignment irrespective of which strand the read was sequenced from.
For the benchmark, we built pangenome graphs by using Minigraph v0.20 [28].Minigraph augments large insertion, deletion, and inversion variants into the graph while incrementally aligning each input assembly.Inversion variants can introduce cycles in the graph because Minigraph augments them by linking the vertices from opposite strands.The graph contains multiple weakly connected components because the components corresponding to different chromosomes are never linked during graph construction.Similar to [26,29], we consider each weak component independently during both the preprocessing and co-linear chaining stages to enable efficient multithreading and memory optimization.We defined our problem formulation to produce an optimal chain, but we actually compute multiple best chains, similar to [24,26,28].This is because there can be multiple high-scoring alignments of a read on the graph.PanAligner also outputs a mapping quality score between 0 to 60 to indicate the confidence score for each alignment [46].We used seeding and extension code from Minigraph [28].Seeding is done by identifying minimizer matches [47] between vertex labels of the graph and the read.The extension code produces the final base-to-base alignment by joining the chained anchors [48].We used code from GraphChainer [29] to compute the minimum path cover of a DAG and range queries.
Co-linear chaining for sequence-to-graph alignment is generally slower than chaining between two sequences.If the optimal alignment of a read is unlikely to span more than one vertex, then it may be more efficient to use sequenceto-sequence chaining algorithm for that read.Following this intuition, we have also implemented a hybrid method that identifies a subset of reads which are 'easy to align' by first aligning all reads using Minigraph.Only those reads are aligned using PanAligner for which either Minigraph outputs an alignment spanning more than one vertex or Minigraph outputs a split-read alignment.

Benchmark datasets
We constructed four cyclic pangenome graphs by using subsets of publicly available 95 haplotype-resolved human genome assemblies [4,49].These graphs were generated using Minigraph v0.20 [28].We used CHM13 human genome assembly [49] as the starting sequence during graph construction in all four graphs.We refer to these graphs as 10H, 40H, 80H, and 95H, where the prefix integer represents the count of haplotypes in each graph.The properties of these graphs are provided in Table 1.

Evaluation methodology
We simulated long reads using PBSIM2 v2.0.1 [50] from CHM13 assembly with N50 length 10 kb, 0.5× sequenc- ing coverage and 5% error-rate to approximately mimic the properties of long-reads.We labeled the IDs of the simulated reads with their true interval coordinates in the CHM13 assembly for correctness evaluation.To confirm the correctness of a read alignment, we used similar criteria from prior studies [24,26,28].We require that the reported walk corresponding to a correct alignment should only use the vertices corresponding to the CHM13 assembly in the graph, and it should overlap with the true walk.We used paftools [24] to automate this evaluation.By default, it requires the overlapping portion to be at least 10% of the union of the true and the reported walk length.We executed all experiments on a computer with AMD EPYC 7763 64-core processor and 512 GB RAM.We ran each aligner using 32 threads to leverage the multi-threading capabilities of the tested aligners.All aligners process reads in parallel.We used the /usr/bin/time -v command to measure wall clock time and peak memory usage.

Size of path cover and count of iterations
Finding a suitable path cover P of the input graph such that |P| ≪ |V | is a crucial step in our proposed frame- work because the scalability of our algorithms depends on this parameter.We discussed a heuristic to compute path cover in Sect.because determining minimum path cover in general graphs is NP-hard.Table 2 shows the sizes of path covers computed by our heuristic in all four graphs.Recall that our algorithms process the weakly connected components of a graph independently.In each graph, we indicate the size of the path cover as a range because path covers vary per component.For each component, we show a comparison of the size of the computed path cover with the lower bound on the minimum path cover size (Fig. 5).
The results show that our heuristic is effective in finding a path cover whose size is close to optimal.The number of anchors N that were provided as input to the co-linear chaining algorithm varies per read.We report the mean and maximum value in Table 2. Observe that N does not change much with increasing haplotype count.Next, we evaluate the count of iterations Ŵ l , Ŵ d used by our graph preprocessing algorithms Fig. 6 The plots in panels (a), (b) and (c) show the fraction of aligned reads and the accuracy obtained by using all the aligners on graphs 10H, 40H, and 95H, respectively.These plots were generated by varying mapping quality cutoffs from 0 to 60. X-axis in these plots uses a logarithmic scale to indicate the percentage of incorrectly aligned reads (Algorithms 1-2) and also report them as a range for each graph.These algorithms compute last2reach and D arrays.Observe that the iteration count is significantly smaller in practice than the proven upper limit of |V| (Lemma 3).This is because the worst-case situation is not observed in practice.Accordingly, there is minimal time overhead during the preprocessing phase.
The count of iterations Ŵ c required by our chaining algorithm (Algorithm 3) varies per component as well as per read.We collect the iteration count statistics as follows.For a single read, we define the iteration count as the maximum number of iterations used over all components.Based on this definition, we report the average and the maximum count over all reads in Table 2. Observe that the average count is < 2.5 using all four graphs.The maximum count is < 100 .These numbers are again significantly bet- ter compared to the upper bound from Lemma 5.
We highlight the accuracy, runtime, and memory usage of different aligners using graphs 10H and 95H in Tables 3 and 4, respectively.Observe that PanAligner outperformed Minigraph and GraphAligner in terms of accuracy, i.e., the fraction of correctly aligned reads.This advantage is even more apparent if low-confidence alignments with mapping quality < 10 are ignored.Next, the hybrid method offers slightly better accuracy than PanAligner because the hybrid method uses Minigraph heuristics to align the reads which are sampled from a single vertex.Minigraph is built on top of Minimap2 [24], which is a highly optimized sequence-to-sequence aligner.We show the comparison plots in Fig. 6.
PanAligner, Minigraph and the Minigraph-PanAligner hybrid method left a small fraction of reads unaligned.This may be because (i) PanAligner and Minigraph drop high-frequency minimizer matches during the seeding step, and (ii) they do not consider low-scoring chains for the extension stage.In contrast, GraphAligner achieved higher recall by aligning all reads, but this came at the expense of lower accuracy.
Table 2 shows that the size of the path cover computed by our heuristic increases by roughly a factor of three from 10H to 95H.We can see how this parameter proportionally affects PanAligner's runtime in Tables 3  and 4. PanAligner's runtime is significantly higher than Minigraph for both 10H and 95H graphs because it uses an iterative algorithm.The runtimes of PanAligner and GraphAligner are in the same order of magnitude.The Minigraph-PanAligner hybrid method is about 5× faster than using PanAligner alone.This is because, for 95H graph, PanAligner was used to align only 12% of the total reads; the alignments for rest of the 'easy to align' reads were obtained using Minigraph.Overall, the hybrid method produces the best alignment accuracy among the four methods, and its runtime is practical for large whole-genome sequencing data.
We observe a consistent drop in alignment accuracy of all four alignment methods with increasing haplotype count (Fig. 6).This is likely because the number of combinatorial paths to which a read can align grows exponentially with respect to the haplotype count.

Alignment of simulated reads to acyclic graphs
We also tested PanAligner for acyclic pangenome graphs.We followed the same procedure as [26] to generate a DAG from 95 haplotype-phased assemblies and refer to this graph as 95H-DAG.This graph was generated by disabling inversion variants during graph construction in Minigraph [28].95H-DAG has 1.2M vertices and 1.8M edges.We also include Minichain v1.0 [26] and Graph-Chainer v1.0.2 [29] in this comparison.GraphChainer uses a co-linear chaining algorithm for DAGs without penalizing gaps.For DAG inputs, the problem formulation in PanAligner becomes equivalent to the one used in Minichain [26].A single iteration of our algorithms suffices for DAGs.Therefore, we simply check if the input graph is a DAG at the preprocessing stage, and run a single iteration of Algorithms 1-3.PanAligner achieves similar performance as Minichain in terms of speed and accuracy for DAGs (Table 5).It compares favorably to other methods in terms of accuracy.

Discussion
Co-linear chaining is a fundamental technique for scalable sequence alignment.Several classes of structural variants, such as duplications, tandem repeat polymorphism, and inversions, are best represented as cycles in pangenome graphs [4,10].Existing alignment software designed for cyclic graphs are based on heuristics to join anchors [28,30].We proposed the first practical problem formulation and an efficient algorithm for co-linear chaining on pangenome graphs with cycles.We gave a rigorous analysis of the algorithm's time complexity.The proposed algorithm serves as a useful generalization of the previously known ideas for DAGs [26,29,38,52].
We implemented the proposed algorithm as an opensource software PanAligner.We demonstrated that PanAligner scales to large pangenome graphs built by using haplotype-phased human genome assemblies.It offers superior alignment accuracy compared to existing methods.Although PanAligner is slower than heuristic methods, one can use PanAligner for only those fraction of reads that are predicted to have optimal alignments spanning more than one vertex.
In our formulation, we did not allow anchors to overlap with each other.We also did not allow an anchor to span two or more vertices in a graph for simplicity; but the proposed ideas can be generalized.PanAligner software borrows seeding logic from Minigraph [28], which also restricts anchors within a single vertex.This simplification is appropriate if the graph only includes structural variants ( > 50 bp).The current version of PanAligner software may not be suitable for graphs which include substitution and indel variants.Future work will be directed in the following directions.First, we will test the performance of PanAligner on pangenome graphs that are constructed by using alternative methods, e.g., [4,53,54].Second, we will explore formulations for haplotype-constrained co-linear chaining to control the exponential growth of combinatorial search space with the increasing number of haplotypes [51,55,56].Third, we will generalize the proposed techniques for aligning reads to long-read genome assembly graphs which also contain cycles.It will be interesting to understand whether the small width assumption is appropriate for assembly graphs.

Fig. 3
Fig.3 An illustration of the proposed heuristic used to convert a cyclic graph into a DAG.Red-dotted edges represent the removed back edges in each strongly connected component (SCC)

Fig. 4 A
Fig. 4 A worst-case example for Algorithm 3 where it requires �(N) iterations to converge (Lemma 6).We show a step-by-step progress of the algorithm with each iteration.The table shows the values in array C after each iteration

Fig. 5 A
Fig.5A comparison the size of the computed path cover and the lower bound on the size of the minimum path cover for each component of graphs (a) 10H and (b) 95H.Graph 10H has 30 weakly connected components.Graph 95H has 26 weakly connected components (Table1)

Table 1
Properties of four cyclic pangenome graphs used for evaluation

Table 2
All four graphs have multiple weakly connected components Therefore, the size of the identified path cover of each graph is presented as a range.The other columns show the count of iterations used by our iterative algorithms for graph preprocessing and co-linear chaining (Algorithms 1, 2, 3).The iteration count statistics were gathered while aligning simulated long reads to cyclic pangenome graphs

Table 3
A comparison of the performance of long-read aligners using the 10H graph

Table 4
A comparison of the performance of long-read aligners using the 95H graph

Table 5
A comparison of the performance of long-read aligners using the 95H-DAG graph