ESKEMAP: exact sketch-based read mapping

Background Given a sequencing read, the broad goal of read mapping is to find the location(s) in the reference genome that have a “similar sequence”. Traditionally, “similar sequence” was defined as having a high alignment score and read mappers were viewed as heuristic solutions to this well-defined problem. For sketch-based mappers, however, there has not been a problem formulation to capture what problem an exact sketch-based mapping algorithm should solve. Moreover, there is no sketch-based method that can find all possible mapping positions for a read above a certain score threshold. Results In this paper, we formulate the problem of read mapping at the level of sequence sketches. We give an exact dynamic programming algorithm that finds all hits above a given similarity threshold. It runs in \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {O} (|t| + |p| + \ell ^2)$$\end{document}O(|t|+|p|+ℓ2) time and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {O} (\ell \log \ell )$$\end{document}O(ℓlogℓ) space, where |t| is the number of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k$$\end{document}k-mers inside the sketch of the reference, |p| is the number of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k$$\end{document}k-mers inside the read’s sketch and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell$$\end{document}ℓ is the number of times that \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k$$\end{document}k-mers from the pattern sketch occur in the sketch of the text. We evaluate our algorithm’s performance in mapping long reads to the T2T assembly of human chromosome Y, where ampliconic regions make it desirable to find all good mapping positions. For an equivalent level of precision as minimap2, the recall of our algorithm is 0.88, compared to only 0.76 of minimap2.


Introduction
Read mapping continues to be one of the most fundamental problems in bioinformatics.Given a read, the broad goal is to find the location(s) in the reference genome that have a "similar sequence".Traditionally, "similar sequence" was defined as having a high alignment score and read mappers were viewed as heuristic solutions to this well-defined problem.However, the last few years has seen the community embrace sketch-based mapping methods, best exemplified by minimap2 [1] (see [2] for a survey).These read mappers work not on the original sequences themselves but on their sketches, e.g. the minimizer sketch.As a result, it is no longer clear which exact problem they are trying to solve, as the definition using an alignment score is no longer directly relevant.To the best of our knowledge, there has not been a problem formulation to capture what problem an exact sketch-based mapping algorithm should solve.
In this work, we provide a problem formulation (Section "Problem definition") and an exact algorithm to find all hits above a given score (Section "Algorithm for the Sketch Read Mapping Problem").More formally, we consider the problem of taking a sketch t of a text T and a sketch p of a query P and identifying all sub-sequences of t that match p with a score above some threshold.A score function could for example be the weighted Jaccard index, though we explore several others in this paper (Section "Score function").We provide both a simulationbased and an analytical-based method for setting the score threshold (Section "Choosing a threshold") 1 .
Other sketch-based mappers are heuristic: they typically find matching elements between the reference and the read sketches (i.e.anchors) and extend these into maps using chaining [2].Our algorithm is more resource intensive than these heuristics, as is typical for exact algorithms.However, a problem formulation and an exact algorithm gives several long-term benefits.First, the exact algorithm could be used in place of a greedy heuristic when the input size is not too large.Second, the formulation can spur development of exact algorithms that are optimized for speed and could thus become competitive with heuristics.Third, the formulation could be used to find the most effective score functions, which can then guide the design of better heuristics.Finally, our exact algorithm can return all hits with a score above a threshold, rather than just the best mapping(s).This is important for tasks such as the detection of copy number variation [3] or detecting variation in multi-copy gene families [4].
We evaluate our algorithm (called eskemap), using simulated long reads from the T2T human Y chromosome (Section "Results").For the same level of precision, the recall of eskemap is 0.88, compared to 0.76 of minimap2.This illustrates the power of eskemap as a method to recover more of the correct hits than a heuristic method.We also compare against Winnowmap2 [5] and edlib [6], which give lower recall but higher precision than eskemap.

Preliminaries
Sequences.Let t be a sequence of elements (e.g.kmers) that may contain duplicates.We let |t| denote the length of the sequence, and we let t[i] refer to the i-th element in t, with t[0] being the first element.For 0 ≤ i ≤ j < |t| , let t[i, j] represent the subsequence (t[i], t[i + 1], . . ., t[j]) .The set of elements in t is denoted by t , e.g. if t = (ACG, TTT, ACG) then t = {ACG, TTT} .We let occ(x, t) represent the number of occurrences of an element x in t, e.g.occ(ACG, t) = 2.
Sketch.Let T be a string and let t be the sequence of k -mers appearing in T. Note that t is a sequence of DNA sequences.For example, if T = ACGAC and k = 2 , then t = (AC, CG, GA, AC) .For the purposes of this paper, a sketch of T is simply a subsequence of t, e.g.(AC, GA) .This type of sketch could for example be a minimizer sketch [7,8], a syncmer sketch [9], or a FracMinHash sketch [10,11].Scoring Scheme.A scoring scheme (sc, thr) is a pair of functions: the score function and the threshold function.The score function sc is a function that takes as input a pair of non-empty sketches and outputs a real number, intuitively representing the degree of similarity.We assume it is symmetric, i.e. sc(p, s) = sc(s, p) for all sketches p and s.If the score function has a parameter, then we write sc(s, p; θ) , where θ is a vector of parame- ter values.The threshold function thr takes the length of a sketch and returns a score cutoff threshold, i.e. scores below this threshold are not considered similar.Note that the scoring scheme is not allowed to depend on the underlying nucleotide sequences besides what is captured in the sketch.Miscellenous.We use U k to denote the universe of all kmers.Given two sequences p and s, the weighted Jaccard is defined as x∈U k max(occ(x,p),occ(x,s)) .It is 0 when s and p do not share any elements, 1 when s is a permutation of p, and strictly between 0 and 1 otherwise.The weighted Jaccard is a natural extension of Jaccard similarity that accounts for multi-occurring elements.

Problem definition
In this section, we first motivate and then define the Sketch Read Mapping Problem.Fix a scoring scheme (sc, thr) .Let p and t be two sketches, which we refer to as the pattern and the text, respectively.Define a candidate mapping as a subinterval t[a, b] of t.A naive problem definition would ask to return all candidate mappings with sc(p, t[a, b]) ≥ thr(|p|). 2However, a lower-scoring candidate mapping could contain a higher-scoring candidate mapping as a subinterval, with both scores above the threshold.This may arise due to a large candidate mapping containing a more conserved small candidate mapping, in which case both candidate mappings are of interest.But it may also arise spuriously, as a candidate mapping with a score sufficiently higher than thr(|p|) can 1 Our algorithm runs in time O(|t| + |p| + ℓ 2 ) and space O(ℓ log ℓ) , where ℓ is the number of times that k-mers from p occur in t 2 Notice that in this framing, the threshold is not a single parameter but can vary depending on the read's (sequence or sketch) length.This gives flexibility to the scoring function, since the scores of candidate mappings of reads of different lengths do not need to be comparable to each other.Moreover, computing the threshold value is not a challenge since it needs to be computed just once for each read.
be extended with non-shared k-mers that decrease the score but not below the threshold.
To eliminate most of these spurious cases, we say that a candidate mapping t ping is reasonable if it is both left-and right-reasonable.In other words, a reasonable candidate mapping must start and end with a k-mer that has a match in the pattern.We also naturally do not wish to report a candidate mapping that is a subinterval of a longer candidate mapping with a larger score.Formally, we call a candidate mapping t[a, b] maximal if there does not exist a candidate mapping t b], p) .We can now formally define t[a, b] to be a final mapping if it is both maximal and reasonable and sc(t[a, b], p) ≥ thr(|p|) .The Sketch Read Mapping Problem is then to report all final mappings.An example is given in Fig. 1.We now restate the problem in a succinct manner: Definition 1 (Sketch Read Mapping Problem).Given a pattern sketch p, a text sketch t, a score function sc, and a threshold function thr, the Sketch Read Mapping Problem is to find all 0 ≤ a ≤ b < |t| such that

Score function
In this section, we explore the design space of score functions and fix two score functions for the rest of the paper.Let p be the sketch of the pattern and let s be a continuous subsequence of the sketch of the text t, i.e. s = t[a, b] for some a ≤ b .For example if p = (ACT, GTA, TAC) and t = (AAC, ACT, CCT, GTA) , we could have s = t[1, 3] = (ACT, CCT, GTA) .In the context of the Sketch Read Mapping Problem, p is fixed and s varies.Therefore, while the score function is symmetric, the threshold function sets the score threshold as a function of |p|.Since p is fixed, the threshold is a single number in the context of a single problem instance.
In the following, we exclusively consider score functions that calculate the similarity of s and p by ignoring the order of k-mers inside the sketches.Taking k-mer order into account would likely make it more complex to compute scores, while not necessarily giving better results on real data.However, score functions that do take order into account are possible and could provide better accuracy in some cases.
A good score function should reflect the number of k -mers shared between s and p.For a given k-mer x, we define Intuitively, x occurs a certain number of times in p and a certain number of times in s; we let x min be the smaller of these two numbers and x max be the larger of these two numbers.Similarly, x diff is the absolute difference between how often x occurs in p and s.We say that the x min := min(occ(x, p), occ(x, s)) x max := max(occ(x, p), occ(x, s)) x diff := x max − x min Fig. 1 An example of the Sketch Read Mapping Problem.We show all candidate mappings t[a, b] for a given pattern p and a text t.Each candidate mapping is represented by its score calculated using sc ℓ (p, t[a, b]; 1) (see Section "Score Function").Reasonable candidate mappings are shown in black (rather than gray) and final mappings are further bolded number of shared occurrences is 2x min and the number of non-shared occurrences is x diff .These quantities are gov- erned by the relationships A good score function should be (1) increasing with respect to the number of shared occurrences and (2) decreasing with respect to the number of non-shared occurrences.There are many candidate score functions within this space.The first score function we consider is the weighted Jaccard.Formally, The above formula includes first the definition but then two algebraically equivalent versions of it, derived using Eq. 1.The weighted Jaccard has the two desired properties of a score function and is a well-known similarity score.However, it has two limitations.First, the use of a ratio makes it challenging to analyze probabilistically, as is the case with the non-weighted Jaccard [12].Second, it does not offer a tuning parameter which would control the relative benefit of a shared occurrence to the cost of a non-shared occurrence.We therefore consider another score function, parameterized by a real-valued tuning parameter w > 0: It is sometimes more useful to use an equivalent formulation, obtained using Eq.1: As with the weighted Jaccard, sc ℓ has the two desired properties of a score function.But, unlike the weighted Jaccard, it is linear and contains a tuning parameter w.
To understand how score functions relate to each other, we introduce the notion of domination and equivalence.Informally, a score function sc 1 dominates another score function sc 2 when sc 1 can always recover the separation between good and bad scores that sc 2 can.In this case, the solution obtained using sc 2 can always be obtained by using sc 1 instead.Formally, let sc 1 and sc 2 be two score functions, parameterized by θ 1 and θ 2 , respectively.We say that sc 1 dominates sc 2 if and only if for any param- eterization θ 2 , threshold function thr 2 , and pattern sketch (1) (2) p there exist a θ 1 and thr 1 such that, for all sequences s, we have that sc 2 (s, p; θ 2 ) ≥ thr 2 (|p|) if and only if sc 1 (s, p; θ 1 ) ≥ thr 1 (|p|) .Furthermore, sc 1 dominates sc 2 strictly if and only if the opposite does not hold, i.e. sc 2 does not dominate sc 1 .Otherwise, sc 1 and sc 2 are said to be equivalent, i.e. if and only if each one dominates the other.
We can now precisely state the relationship between sc ℓ and sc j , i.e. that sc ℓ strictly dominates sc j .In other words, any solution to the Sketch Read Mapping Problem that is obtained by sc j can also be obtained by sc ℓ , but not vice- versa.Formally, Theorem 1 sc ℓ strictly dominates the weighted Jaccard score function sc j .
Proof We start by proving that sc ℓ dominates sc j .Let p be a pattern sketch and let thr j be the threshold func- tion associated with sc j .We will use the shorthand t = thr j (|p|) .First, consider the case that t < 1 .Let w = t 1−t and let thr ℓ evaluate to zero for all inputs.Let s be any sketch.The following is a series of equivalent transformations that proves domination.
Next, consider the case t > 1 .In this case, for all s, sc j (s, p) < t , since the weighted Jaccard can never exceed one.Observe that sc ℓ (s, p; w) ≤ |p| for any non-negative w.Therefore, we can set thr ℓ (|p|) = |p| + 1 and let w be any non-negative number, guaranteeing that for all s, sc ℓ (s, p; w) < thr ℓ (|p|).
Finally consider the case that t = 1 .Then, sc j (s, p) ≥ t if and only if s and p are permutations of each other, i.e. x diff = 0 for all x.Setting thr ℓ (|p|) = |p| and letting w be any strictly positive number guarantees that sc ℓ (s, p; w) ≥ thr ℓ (|p|) if and only if s and p are permuta- tions of each other.
To prove that sc ℓ is not dominated by sc j , we fix w = 1 (though any value could be used) and give a counterexample family to show that sc j cannot recover the sepa- ration that sc ℓ can.Pick an integer i ≥ 1 to control the size of the counterexample.Let p be a pattern sketch of length 4i consisting of arbitrary k-mers.We con- struct two sketches, s 1 and s 2 .The sequence s 1 is an arbitrary subsequence of p of length i. Observe that The sequence s 2 is p appended with arbitrary k-mers to get a length 12i.Observe that x∈p∪s 2 x min = x occ(x, p) = 4i .Using Eq. 3 for sc ℓ and Eq. 2 for sc j , Under sc ℓ , s 1 has a higher score, while under sc j , s 2 has a higher score.If thr ℓ is set to accept s 1 but not s 2 (e.g.thr ℓ = −3i ), then it is impossible to set thr j to achieve the same effect.In other words, since sc j (s 2 ) > sc j (s 1 ) , any threshold that accepts s 1 must also accept s 2 .
Next, we show that many other natural score functions are equivalent to sc ℓ .Consider the following score functions: The conditions on the parameters are there to enforce the two desired properties of a score function.Each of these score functions is natural in its own way, e.g.sc A is similar to sc ℓ but places the weight on x min rather than on x diff .One could also have two separate weights, as in the score sc B .One could then replace x diff with x max , as in sc C , which is the straightforward reformulation of the weighted Jaccard score as a difference instead of a ratio.Or one could replace x diff with the length of s, as in sc D .The following theorem shows that the versions turn out to be equivalent to sc ℓ and to each other.The proof is a straightforward algebraic manipulation and is left for the appendix.

Choosing a threshold
In this section, we propose two ways to set the score threshold.The first is analytical (Section "Analytical analysis") and the second is with simulations (Section "Simulation-based analysis").The analytical approach gives a closed form formula for the expected value of the score under a mutation model.However, it only applies to the FracMinHash sketch, assumes a read has little internal homology, and does not give a confidence interval.The simulation approach can apply to any sketch but does not offer any analytical insight into the behavior of the score.The choice of approach ultimately depends on the use case.
We first need to define a generative mutation model to capture both the sequencing and evolutionary divergence process: Definition 2 (Mutation model).Let S be a circular string 3 with n characters.The mutation model produces a new string S ′ by first setting S ′ = S and then taking the following steps: 1.For every 0 ≤ i < n , draw an action a i ∈ {sub, del, unchanged} with probability of p sub for sub, p del for del, and 1 − p sub − p del for unchanged.Also, draw an insertion length b i from a geometric distribution with mean p ins . 4.Let track be a function mapping from a position in S to its corresponding position in S ′ .Initially, track(i) = i , but as we delete and add characters to S ′ , we assume that track is updated to keep track of the position of S[i] in S ′ .3. For every i such that a i = sub , replace S ′ [i] with one of the three nucleotides not equal to S[i], chosen uniformly at random.4. For every 0 ≤ i < n , insert b i nucleotides (chosen uniformly at random) before S ′ [track(i)]. 5.For every i such that a i = del , remove S ′ [track(i)] from S ′ .

Analytical analysis
To derive an expected score under the mutation model, we need to specify a sketch.We will use the FracMinHash sketch [10], due to its simpliticy of analysis [11].
Definition 3 (FracMinHash).Let h be a hash function that maps a k-mer to a real number between 0 and 1, inclusive.Let 0 < q ≤ 1 be a real-valued number called the sampling rate.Let S be a string.Then the FracMin-Hash sketch of S, denoted by s, is the sequence of all k -mers x of S, ordered as they appear in S, such that h(x) ≤ q.
Consider an example with k = 2 , S = CGGACGGT , and the only k-mers hashing to a value ≤ q being CG and GG.Then, s = (CG, GG, CG, GG).
We make an assumption, which we refer to as the mutation-distinctness assumption, that the mutations on S never create a k-mer that is originally in S. Based on previous work [13], we find this necessary to make the analysis mathematically tractable (for us).The results under this assumption become increasingly inaccurate as the read sequence contains increasingly more internal similarity.For example, reads coming from centromeres might violate this assumption.In such cases, it may be better to choose a threshold using the technique in Section "Simulation-based Analysis".
We can now derive the expected value of the score under the mutation model and FracMinHash.
Theorem 3 Let S be a circular string and let S ′ be gen- erated from S under the mutation model with the mutation-distinctness assumption and with parameters p sub , p del , and p ins .Let s and s ′ be the FracMinHash sketches of S and S ′ , respectively, with sampling rate q.Then, for all real-valued tuning parameters w > 0, Proof Observe that under mutation-distinctness assumption, the number of occurrences of a k-mer that is in s can only decrease after mutation, and a k-mer that is newly created after mutation has an x min of 0. Therefore, applying Eq. 3, (Recall that s is the set of all k-mers in s.)We will first compute the score conditioned on the hash function of the sketch being fixed.Note that when h is fixed, then the sketch s becomes fixed and s ′ becomes only a function of S ′ .By linearity of expectation, Observe that the number of elements in s ′ is the num- ber of elements in s minus the number of deletions plus the sum of all the insertion lengths.By linearity of expectation, Next, consider a k-mer x ∈ s and E[occ(x, s ′ )] .Recall by our mutation model that no new occurrenes of x are introduced during the mutation process.So occ(x, s ′ ) is equal to the number of occurrences of x in S that remain unaffected by mutations.Consider an occurrence of x in s.The probability that it remains is the probability that all actions on the k nucleotides of x were "unchanged" and the length of all insertions in-between the nucleotides was 0. Therefore, Putting it all together, To add the sketching step, we know from [11] that the expected size of a sketch is the size of the original text times q.Then,

Simulation-based analysis
First, we choose the parameters of the mutation model according to the target sequence divergence between the reads and the reference caused by sequencing errors, but also due to the evolutionary distance between the reference and the organism sequenced.If one is also interested in mapping reads to homologous regions within the reference that are related more distantly, e.g. if there exist multiple copies of a gene, the mutation parameters can be increased further.
To generate a threshold for a given read length, we generate sequence pairs (S, S ′ ) , where S is a uniformly ran- dom DNA sequence of the given length and S ′ is mutated from S under the above model.We then calculate the sketch of S and S ′ , which we call s and s ′ , respectively.The sketch can for example be a minimizer sketch, a syncmer sketch, or a FracMinHash sketch.We can then use the desired score function to calculate a score for each pair (s, s ′ ) .For a sufficiently large number of pairs, their scores will form an estimate of the underlying score distribution for sequences that evolved according to the used model.It is then possible to choose a threshold such that the desired percentage of mappings would be reported by our algorithm.For example, one could choose a threshold to cover a one sided 95% confidence interval of the score.
In order to be able to adjust thresholds according to the variable length of reads produced from a sequencing run, the whole process may be repeated several times for different lengths of S. Thresholds can then be interpolated dynamically for dataset reads whose lengths were not part of the simulation.

Algorithm for the Sketch Read Mapping Problem
In this section, 5  We present our algorithm as two parts.In the first part (Section "Computing S"), we compute a matrix S with ℓ rows and ℓ columns so that S(i, j) = x min(occ(x, p), occ(x, t[L[i], L[j]]) .S is only defined for j ≥ i .We also store an index i * for each column j indicating the smallest index for which t[L[i * ], L[j]] is a right-reasonable candidate mapping.In the second part (Section "Computing maximality"), we scan through S and output the candidate mapping t[i, j] if and only if it is reasonable, maximal and has a score above the threshold.
The reason that S(i, j) is not defined to store the score of the candidate mapping t[L[i], L [j]] is that the score can be computed from S(i, j) in constant time, for both sc j and sc ℓ .To see this, let x min := min(occ(x, p), occ(x, t[L[i], L[j]]) .Recall that Eq. 2 allows us to express sc j (t[i, j], p) as a function of x min , |p|, and the length of the candidate mapping, i.e. j − i + 1 .Similarly, we can apply Eq. 1 to express sc ℓ as Thus, once x x min is computed, either of the scores can be computed trivially.

Computing S
We compute S using dynamic programming.We will give a recurrence for S so that it can be filled in from right-to-left.For the base case of the last column, we start by setting S(ℓ − 1, ℓ − 1) = 1 .This is trivially correct since L[ℓ − 1] ∈ p by definition.We then fill in the last column, starting from i = ℓ − 2 and going down to . For the remaining cells of S, i.e. for all i and j such that 0 ≤ i ≤ j < ℓ − 1 , we use the recursive formula: To see the correctness of this formula, observe that all the elements of t[L[j] + 1, L[j + 1] − 1] are, by definition, not in p and hence their minimum occurrence value is 0. If the element x = t[L[j + 1]] helps to increase min(occ(x, t[L[i], L[j + 1]]), occ(x, p)) , then excluding it decreases the minimum count by one, otherwise the minimum occurrence does not change.To maintain the smallest value of i for which S(i, j + 1) is right-reasonable, it is enough to observe that S(i, j + 1) is right-reasonable if and only if the top case is used.
To design an efficient algorithm based on Eq. 5, we need two auxiliary data structures.The first is a hash table P cnt that stores, for every k-mer in p , how often it occurs in p.A second hash table T cnt is used to store a count for every k-mer in p .T cnt sc ℓ (t[i, j], p; w) := maintains the invariant that at the start of processing column j, ) .This invariant can be maintained by 1) initializing ) for every k-mer x ∈ p , and 2) when starting to process column j, decrementing To compute S, the P cnt hash table is constructed via a scan through p and T cnt [x] is set to 0 for every k -mer x ∈ p .Next, we simultaneously compute the last column of S and initialize T cnt so that it holds the counts of all k-mers from p in t[L[0], L[ℓ − 1] , as fol- lows.We trivially initialize S(ℓ − 1, ℓ − 1) and then compute S(i, ℓ − 1) starting from i = ℓ − 2 down to i = 0 .Before computing S(i, ℓ − 1) , we increment the ) and can be used to distinguish between the two cases necessary for computing S(i, ℓ − 1).
Algorithm 1 details the rest of the algorithm for computing S, column-by-column using Eq. 5.The non-trivial part is to determine which case of Eq. 5 to use in constant-time (i.e. to compute occ(t[L , so it only remains to compute c 1 .When computing a column j < ℓ − 1 , we are processing all the rows starting from 0 up to ℓ − 1 .Thus we can update c 1 by initially setting c 1 = 0 (Line 8) and then, for each new row i, incrementing c

Computing maximality
In the second step, we identify which of the candidate mappings in S are final.Our algorithm is shown in Algorithm 2. We traverse S column-by-column starting with the last column and then row-by-row, starting from the first row.While traversing S, we maintain a list M of all maximal, reasonable candidate mappings above the threshold found so far and their scores.M has the invariant that the candidate mappings are increasingly ordered by their start positions.
To maintain the invariant that M is sorted by start position, we maintain a pointer cur to a location in M (Lines 7-11).At the start of a new column traversal, when the row i = 0 , cur points to the start of M. As the row is increased, we move cur forward until it hits the first value in M with a start larger than i.When a new final mapping is added to M, we do so at cur, which guarantees the order invariant of M (Lines 16-20).
Due to the order cells in S are processed during our traversal, a candidate mapping t[L[i], L [j]] is maximal if and only if its score is larger than the score of all other final mappings in M with position i ′ ≤ i .For a given column, since we are processing the candidate mappings in increasing order of i, we can simultenously maintain a running variable maxSoFar that holds the maximum value in M up to cur (Line 8).We can then determine if a candidate mapping is maximal by simply checking its score against maxSoFar (Line 14).Note that as long as we have not yet seen any final mapping at some position i ′ ≤ i , a candidate mapping is already maximal if its score equals thr(|p|) .This is ensured via a flag supMpFnd and an additional satisfiable subclause (Line 14).As soon as maxSoFar is updated, supMpFnd is set (Line 10).
To check for the reasonability of a candidate mapping corresponding to S(i, j), we need to verify that i ≥ i * and S(i, j) = S(i + 1, j) + 1 , for 0 ≤ i < j .To see that this is correct, first observe that i * is the smallest index of a cell in column j of S belonging to a right-reasonable candidate mapping that could be found during the execution of Algorithm 1.Thus, there is no candidate mapping belonging to a cell S(i, j) with i < i * .Vice versa, every candidate mapping t must hold.The requirement of the second condition ( S(i, j) = S(i + 1, j) + 1 , for 0 ≤ i < j ) to ensure left- reasonability can be justified by a similar argument as given for the correctness of Eq. 5.All candidate mappings of cells S(i, j) where i = j are always reasonable, because they contain only 1 k-mer x ∈ p.

Runtime and memory analysis
The runtime for Algorithm 1 is �(ℓ 2 + |t| + |p|) .Note that this time bound refers to an expected time if assuming a hash table with constant time for insertion and lookup.The P cnt table can be constructed in a straightforward manner in O(|p|) time, the L array is constructed in O(|t|) .Algorithm 2 runs two for loops with constant time internal operations, with the exception of the while loop to fast forward the cur pointer.The total time for the loop is amortized to O(ℓ) for each column.Therefore, the total time for Algorithm 2 is �(ℓ 2 ) .This gives the total running time for our algorithm as O(|t| The total space used by the algorithm is the sum of the space used by P cnt , T cnt , L, S and M. The P cnt table stores |p| integers with values up to |p|.However, notice that when |p| > ℓ , we can limit the table to only store k-mers that are in t , i.e.only ℓ k-mers.We can also replace integer values greater than ℓ with ℓ , as it would not affect the algorithm.Therefore, the P cnt table uses O(ℓ log ℓ) space.The T cnt table stores at most ℓ entries with values at most ℓ and therefore takes O(ℓ log ℓ) space.To store S completely would require �(ℓ 2 ) space.However, as S is filled column by column and since the recursion in Eq. 5 only depends on the previously computed column, we can dynamically compute S by only storing two columns of S in memory.Further, we can apply Algorithm 2 separately on each column after it has been computed by Algorithm 1 as long as M is retained throughout the whole processing of S. Hence, we do not need to spend more than �(ℓ) space for storing S. To limit the space required for storing M to O(ℓ log ℓ) , we can observe that, in fact, it is not necessary to keep all final mappings stored in M to check for maximality during the execution of our algorithm.Instead, it is enough to retain the last candidate mapping found for each start position.As soon as we find a final candidate mapping starting at the same position as a previously found one, we can output the latter and overwrite its entry in M with the newly found candidate mapping.Thus our algorithm uses a total of O(ℓ log ℓ) space.

Results
We implemented the eskemap algorithm described in "Algorithm for the Sketch Read Mapping Problem" section using sc ℓ as score function and compared it to other methods in a read mapping scenario.For better comparability, we implemented it with the exact same minimizer sketching approach as used by minimap2.Source code of our implementation as well as a detailed documentation of our comparison described below including exact program calls is available from https:// github.com/ medve devgr oup/ eskem ap.

Datasets
For our evaluation, we used the T2T reference assembly of human chromosome Y (T2T-CHM13v2.0;[GenBank:NC_060948.1])[15].The chromosome contains many ampliconic regions with duplicated genes from several gene families.Identifying a single best hit for reads from such regions is not helpful and instead it is necessary to find all good mappings [16].Such a reference poses a challenge to heuristic algorithms and presents an opportunity for an all-hits mapper like eskemap to be worth the added compute.
We simulated a read dataset on this assembly imitating characteristics of a PacBio Hifi sequencing run [17].For each read, we randomly determined its length r according to a gamma distribution with a 9000bp mean and a standard deviation of 7000bp.Afterwards, a random integer i ∈ [1, n − r + 1] was drawn as the read's start position, where n refers to the length of the chromosome.Sequencing errors were simulated by introducing mutations into each read's sequence using the mutation model described in Definition 2 and a total mutation rate of 0.2% distributed with a ratio of 6:50:54 between substitu- tion/insertion/deletion, as suggested in [18].Aiming for a sequencing depth of 10x, we simulated 69401 reads.
The T2T assembly of the human chromosome Y contains long centromeric and telomeric regions which consist of short tandem and higher order repeats.Mapping reads in such regions results in thousands of hits that are meaningless for many downstream analyses and significantly increases the runtime of mapping.Therefore, we excluded all reads from the initially simulated set which could be aligned to more than 20 different, non-overlapping positions using edlib (see below).After filtering, a set of 32295 reads remained.

Tools
We compared eskemap to two other sketch-based approaches and an exact alignment approach.The sketch-based approaches were minimap2 (version 2.24-r1122) and Winnowmap2 (version 2.03), run using default parameters.In order to be able to compare our results also to an exact, alignment-based mapping approach, we used the C/C++ library of Edlib [6] (version 1.2.7) to implement a small script that finds all nonoverlapping substrings of the reference sequence a read could be aligned to with an edit distance of at most T. We tried values T ∈ {0.01r, 0.02r, 0.03r} , where recall that r is the read length.We refer to this script as simply edlib.
For eskemap, we aimed to make the results as comparable as possible to minimap2.We therefore used a minimizer sketch with the same k-mer and window size as minimap2 ( k = 15 , w = 10 ).However, we excluded mini- mizers that occurred > 100 times inside the reference sketch, to limit the O(ℓ 2 ) memory use of eskemap, even as this exclusion may potentially hurt eskemap 's accuracy.We used the default w = 1 as the tuning parameter in the linear score.To set the score threshold, we used the dynamic procedure described in Section "Simulation-based analysis".In particular, we used five different sequence lengths for simulations and used a divergence of 1%.We used the same sequencing error profile as for read simulation.Four thresholds were then chosen so at to cover the one-sided confidence interval of 70%, 80%, 90%, and 95%, respectively.

Accuracy measure
We compared the reference substrings corresponding to each reported mapping location of any tool to the mapped read's sequence using BLAST [19].If a pairwise comparison of both sequences resulted either in a single BLAST hit with an E-value not exceeding 0.016 and covering at least 90% of the substring or the read sequence or if a set of non-overlapping BLAST hits was found of which none had an E-value above 0.01 and their lengths summed up to at least 90% of either the reference sub- string's or the read sequence's length, we considered the reference mapping location as homologous.
For each read, we combine all the homologous reference substrings found across all tools into a ground truth set for that read.We then measure the accuracy of a mapping as follows.We determined for each k-mer of the reference sequence's sketch whether it is either a true positive (TP), false positive (FP), true negative (TN) or false negative (FN).A k-mer was considered a TP if it was covered by both a mapping and a ground truth substring.It was considered a FP if it was covered by a mapping, but not by any ground truth substring.Conversely, it was considered a TN if it was covered by neither a mapping nor a ground truth substring and considered a FN if it was covered by a substring of the ground truth exclusively.The determination was performed for each read independently and results were accumulated per tool to calculate precision and recall measures.

Accuracy results
The precision and recall of the various tools is shown in Fig. 2. The most controlled comparison can be made with respect to minimap2, since the sketch used by eskemap is a subset of the one used by minimap2.At a score threshold corresponding to 70% recovery, eskemap achieves the same precision (0.999) as minimap2.However, the recall of eskemap is 0.88, compared to 0.76 of minimap2.We see that the extra effort of minimap2 to take k-mer order inside the sketches into account does not seem to pay off in this experiment compared to the approach of eskemap which avoids this.The increased recall of eskemap can be explained by the fact that eskemap is able to find all promising but slightly worse suboptimal mapping positions for a read originating, e.g., from the sequence of a duplicated gene on the chromosome.Many of these are not reported by minimap2 as it only reports a fixed number of suboptimal mappings per read.This illustrates the potential of eskemap as a method to recover more of the correct hits than a heuristic method.More generally, eskemap achieves a recall around 90%, while all other tools have a recall of at most 76%.However, both edlib and Winnowmap2 achieve a slightly higher precision (by 0.001).

Time and memory results
We compared the runtimes and memory usage of all sketch-based methods (Table 1).Calculations were performed on a virtual machine with 28 cores and 256 GB of RAM.We did not include edlib in this alignment since, as an exact alignment-based method, it took much longer to complete (i.e.running highly parallelized for many days on a system with many cores).We see that both heuristics are significantly faster than our exact algorithm.However, they also find many fewer mapping positions per read.E.g., only one mapping position is reported for 67% and 75% of all reads by minimap2 and Winnow-map2, respectively.In comparison, eskemap finds more than one mapping position for almost every second read (49%).When the runtime is normalized per output mapping, eskemap is actually more than an order of magnitude faster than the other tools.
eskemap 's memory usage was also found to be smaller than that of minimap2 and Winnowmap2 in our experiment. 7

Conclusion
In this work, we formally defined the Sketch Read Mapping Problem, i.e. to find all positions inside a reference sketch with a certain minimum similarity to a read sketch under a given similarity score function.We also proposed an exact dynamic programming algorithm called eskemap to solve the problem, running in O(|t| + |p| + ℓ 2 ) time and �(ℓ 2 ) space.We evalu- ated eskemap 's performance by mapping a simulated long read dataset to the T2T assembly of human chromosome Y and found it to have a superior recall for a similar level of precision compared to minimap2, while offering precision/recall tradeoffs compared with edlib or Winnowmap2.
In order to further improve on eskemap 's runtime, a strategy could be to develop filters that prune the result's search space.This could be established, e.g., by terminating score calculations for a column once it is clear an optimal solution would not make use of the rest of that column.Our prototype implementation of eskemap would also benefit from additional Fig. 2 Mapping accuracies of all tools.For edlib, the color of the cross encodes the various edit distance thresholds (0.01, 0.02, 0.03).For eskemap, the color of the circles indicate the score threshold used, in terms of the target confidence interval used (0.7, 0.8, 0.9, 0.95).The ground truth is determined by combining the mappings from all tools and filtering out those with bad BLAST scores.The most lenient thresholds for edlib and eskemap were used  engineering of the code base, potentially leading to substantial improvements of runtime and memory in practice.
Having an exact sketch-based mapping algorithm at hand also opens the door for the exploration of novel score functions to determine sequence similarity on the level of sketches.Using our algorithm, combinations of different sketching approaches and score functions may be easily tested.Eventually, this may lead to a better understanding of which sketching methods and similarity measures are most efficient considering sequences with certain properties like high repetitiveness or evolutionary distance.

Appendix A: Proofs
Proof of Theorem 2 Observe that domination is a transitive property, i.e. if sc 1 dominates sc 2 and sc 2 domi- nates sc 3 , then sc 1 dominates sc 3 .To prove equivalence, we will prove the following circular chain of domination: First, observe that sc B trivially dominates sc ℓ by keeping the threshold function the same and setting Finally, we prove that sc ℓ dominates sc A .Let p be a pattern and let t = thr A (|p|) .Set w = 1 a 1 and thr ℓ (i) = thr A (i) a 1 .Then, for all s, the following series of equivalent transformations proves that sc ℓ dominates sc A .
The tools were called to map 32295 simulated PacBio Hifi sequencing reads on the T2T assembly of human chromosome Y.Runtimes are shown both as total values and normalized by the number of reported mapping positions b 1 = 1 and b 2 = w.Next, we prove that sc C dominates sc B .Let p be a pattern and let t = thr B (|p|) .Set thr C = thr B and c 1 = b 1 + b 2 and c 2 = b 2 .Then, for all s, the following series of equivalent transformations proves that sc C dom- inates sc B .Next, we prove that sc D dominates sc C .Let p be a pat- tern and let t = thr C (|p|) .Setd 1 = c 1 + c 2 , d 2 = c 2 , and thr D (i) = thr C (i) + ic 2 .Then, for all s, the following series of equivalent transformations proves that sc D dom- inates sc C .scB (s, p; b 1 , b 2 ) ≥ t x b 1 x min − b 2 x diff ≥ t x b 1 x min − b 2 (x max − x min ) ≥ t x (b 1 + b 2 )x min − b 2 x max ≥ t sc C (s, p; c 1 , c 2 ) ≥ thr C (|p|)Next, we prove that sc A dominates sc D .Let p be a pattern and let t = thr D (|p|) .Seta 1 = d 1 d 2 − 2 and thr A (i) = thr D (i) d 2 − i .Then, for all s, the following series of equivalent transformations proves that sc D dominates sc C .
2d 2 and d 2 > 0 Theorem 2 Schulz and Medvedev Algorithms for Molecular Biology (2024) 19:19 we describe a dynamic programming algorithm for the Sketch Read Mapping Problem under both the weighted Jaccard and the linear scores ( sc j and sc ℓ , respectively).Let t be the sketch of the text, let p be the sketch of the pattern, let L be the sequence of positions in t that have a k-mer that is in p , in increasing order, and let ℓ = |L| .Our algorithm takes advantage of the fact that p is typically much shorter than t and hence the number of elements of t that are shared with p is much smaller than |t| (i.e.ℓ ≪ |t| ).In particular, it suf- fices to consider only candidate mappings that begin and end in positions listed in L, since by definition, if t[a, b] is a reasonable candidate mapping, then t[a] ∈ p and t[b] ∈ p.

Table 1
Runtime and memory usage comparison of all sketchbased methods