Locality-sensitive bucketing functions for the edit distance

Background Many bioinformatics applications involve bucketing a set of sequences where each sequence is allowed to be assigned into multiple buckets. To achieve both high sensitivity and precision, bucketing methods are desired to assign similar sequences into the same bucket while assigning dissimilar sequences into distinct buckets. Existing k-mer-based bucketing methods have been efficient in processing sequencing data with low error rates, but encounter much reduced sensitivity on data with high error rates. Locality-sensitive hashing (LSH) schemes are able to mitigate this issue through tolerating the edits in similar sequences, but state-of-the-art methods still have large gaps. Results In this paper, we generalize the LSH function by allowing it to hash one sequence into multiple buckets. Formally, a bucketing function, which maps a sequence (of fixed length) into a subset of buckets, is defined to be \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(d_1, d_2)$$\end{document}(d1,d2)-sensitive if any two sequences within an edit distance of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d_1$$\end{document}d1 are mapped into at least one shared bucket, and any two sequences with distance at least \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d_2$$\end{document}d2 are mapped into disjoint subsets of buckets. We construct locality-sensitive bucketing (LSB) functions with a variety of values of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(d_1,d_2)$$\end{document}(d1,d2) and analyze their efficiency with respect to the total number of buckets needed as well as the number of buckets that a specific sequence is mapped to. We also prove lower bounds of these two parameters in different settings and show that some of our constructed LSB functions are optimal. Conclusion These results lay the theoretical foundations for their practical use in analyzing sequences with high error rates while also providing insights for the hardness of designing ungapped LSH functions.


Introduction
Comparing a set of given sequences is a common task involved in many bioinformatics applications, such as homology detection [6], overlap detection and the construction of overlap graphs [10,4,24], phylogenetic tree reconstruction, and isoform detection from circular consensus sequence (CCS) reads [22], to name a few.The naive all-vs-all comparison gives the most comprehensive information but does not scale well.An efficient and widely-used approach that avoids unnecessary comparisons is bucketing: a linear scan is employed to assign each sequence into one or multiple buckets, followed by pairwise comparisons within each bucket.The procedure of assigning sequences into buckets, which we refer to as a bucketing function, is desired to be both "sensitive", i.e., two similar sequences ideally appear in at least one shared bucket so that they can be compared, and "specific", i.e., two dissimilar sequences ideally appear in disjoint buckets so that they can be exempt from comparison.The criteria of similar/dissimilar sequences are application-dependent; in this work we study bucketing functions for the edit distance (Levenshtein distance).
A simple yet popular bucketing function is to put a sequence into buckets labeled with its own k-mers.The popular seed-and-extend strategy [1,2] implicitly uses this approach.Various sketching methods such as minimizer [19,23,20,13] and universal hitting set [16,7] reduce the number of buckets a sequence is assigned to by only considering a subset of representative k-mers.These bucketing methods based on exact k-mer matching enjoyed tremendous success in analyzing next-generation sequencing (NGS) data, but are challenged by the third-generation long-reads sequencing data represented by PacBio [18] and Oxford Nanopore [8] technologies; due to the high error rate, sequences that should be assigned to the same buckets hardly share any identical k-mers (for a reasonably large k such as k = 21 with 15% error rate), and therefore results in poor sensitivity.
To address this issue, it is required to be able to recognize similar but not necessarily identical sequences.A general solution is locality-sensitive hashing (LSH) [14,15] where with high probability, similar sequences are sent into the same bucket (i.e., there is a hash collision), and with high probability dissimilar sequences are sent into different buckets.However, designing locality-sensitive hashing functions for the edit distance is hard; the state-of-the-art method Order Min Hash (OMH) is proved to be a gapped LSH but admits a large gap [14].Another related approach is embedding the metric space induced by the edit distance into more well-studied normed spaces [3,17,24].However, such an embedding is also hard; for example, it is known that the embedding into L 1 cannot be distortion-free [9].In addition, there are seeding/sketching methods such as spaced k-mer [5,11], indel seeds [12], and the more recent strobemer [21] that allow gaps in the extracted seeds to accommodate some edits, but an edit that happens within the chosen seed can still cause mismatches.
It is worth noting that locality-sensitive hashing functions, when interpreted as bucketing functions, assign a sequence into exactly one bucket: buckets are labeled with hash values, and a sequence is put into the single bucket where it is hashed to.In this work, we propose the concept of locality-sensitive bucketing (LSB) functions as a generalization of LSH functions by allowing it to assign a sequence into multiple buckets.Formally, a bucketing function, which maps a sequence (of fixed length) into one or more buckets, is defined to be (d 1 , d 2 )-sensitive if any two sequences within an edit distance of d 1 are mapped into at least one shared bucket, and any two sequences with an edit distance at least d 2 are mapped into disjoint subsets of buckets.While a stochastic definition by introducing a distribution on a family of bucketing functions can be made in a similar way as the definition of LSH functions, here we focus on this basic, deterministic definition.We design several LSB functions for a variety of values of (d 1 , d 2 ) including both ungapped (d 2 = d 1 + 1) and gapped (d 2 > d 1 + 1) ones.This demonstrates that allowing one sequence to appear in multiple buckets makes the locality-sensitive properties easier to satisfy.Moreover, our lower bound proof shows that any (1, 2)-sensitive bucketing function must put each sequence (of length n) into at least n buckets (see Lemma 2), suggesting that certain ungapped locality-sensitive hashing functions, where each sequence is sent to a single bucket, may not exist.
The rest of this paper is organized as follows.In Section 2, we give the precise definition of LSB functions and propose criteria to measure them.In Sections 3 and 4, we design LSB functions using two different approaches, the results are summarized in Section 5. We show experimental studies in Section 6, with a focus on demonstrating the performance of gapped LSB functions.Future directions are discussed in Section 7.

Basics of locality-sensitive bucketing (LSB) functions
Given an alphabet Σ with |Σ| > 1 and a natural number n, let S n = (Σ n , edit) be the metric space of all length-n sequences equipped with the Levenshtein (edit) distance.Given a set B of buckets, a bucketing function f maps S n to P(B), the power set of B. This can be viewed as assigning a sequence s of length n to a subset of buckets f (s) ⊂ B. Let d 1 < d 2 be two non-negative integers, we say a bucketing function We refer to the above two conditions as LSB-properties (1) and ( 2) respectively.Intuitively, the LSB-properties state that, if two length-n sequences are within an edit distance of d 1 , then the bucketing function f guarantees assigning them to at least one same bucket, and if two length-n sequences have an edit distance at least d 2 , then the bucketing function f guarantees not assigning them to any shared bucket.In other words, (d 1 , d 2 )-sensitive bucketing functions perfectly distinguish length-n sequences within distance d 1 from those with distances at least d 2 .It is easy to show that if f : S n → P(B) is a (d 1 , d 2 )-sensitive bucketing function, then f (s) = ∅ for all s ∈ S n .In fact, since edit (s, s) = 0 ≤ d 1 , the LSB-property (1) 1 then we say the bucketing function is ungapped; otherwise it is called gapped.
We note that the above definition of LSB functions generalize the (deterministic) LSH functions: if we require that |f (s)| = 1 for every sequence s ∈ S n , i.e., f maps a sequence to a single bucket, then Two related parameters can be used to measure an LSB function: |B|, the total number of buckets, and |f (s)|, the number of different buckets that contain a specific sequence s.From a practical perspective, it is desirable to keep both parameters small.We therefore aim to design LSB functions that minimize |B| and |f (s)|.Specifically, in the following sections, we will construct (d 1 , d 2 )-sensitive bucketing functions with a variety of values of (d 1 , d 2 ), and analyze their corresponding |B| and |f (s)|; we will also prove lower bounds of |B| and |f (s)| in different settings and show that some of our constructed LSB functions are optimal, in terms of minimizing these two parameters.
The bounds of |B| and |f (s)| are closely related to the structure of the metric space S n .For a sequence s ∈ S n , its d-neighborhood, denoted by N d n (s), is the subspace of all sequences of length n with edit distance at most d from s; formally The following simple fact demonstrates the connection between the bound of |f (s)| and the structure of S n , which will be used later.
Lemma 1.Let s be a sequence of length n.If N d1 n (s) contains a subset X with |X| = x such that every two sequences in X have an edit distance at least d 2 , then for any Proof.Let f be an arbitrary (d 1 , d 2 )-sensitive bucketing function.By the LSB-property (2), these x sequences must be assigned to distinct buckets by f .On the other hand, since they are all in N d1 n (s), the LSB-property (1) requires that f (s) overlaps with f (t) for each sequence t ∈ X. Combined, we have |f (s)| ≥ x.

An optimal (1, 2)-sensitive bucketing function
In the most general setting of LSB functions, the labels of buckets in B are just symbols that are irrelevant to the construction of the bucketing function.Hence we can let B = {1, . . ., |B|}.The remaining of this section studies (1, 2)-sensitive bucketing functions in this general case.We first prove lower bounds of |B| and |f (s)| in this setting; we then give an algorithm to construct an optimal (1, 2)-sensitive bucketing function f that matches these bounds.
Proof.According to Lemma 1 with d 1 = 1 and d 2 = 2, we only need to show that N 1 n (s) contains n different sequences with pairwise edit distances at least 2. For i = 1, . . ., n, let t i be a sequence obtained from s by a single substitution at position i.If i = j, then t i differs from t j at two positions, namely i and j.Then we must have edit t i , t j ≥ 2 as t i cannot be transformed into t j with a single substitution or a single insertion or deletion.Hence, t 1 , . . ., t n forms the required set.We now construct a bucketing function f : S n → P(B) that is (1, 2)-sensitive using the algorithm given below.It has exponential running time with respect to n but primarily serves as a constructive proof that (1, 2)-sensitive bucketing functions exist.Assign to the alphabet Σ an arbitrary order σ : {1, . . ., |Σ|} → Σ.The following algorithm defines the function f : A toy example of the bucketing function f with n = 2 and Σ = {σ(1) = A, σ(2) = C, σ(3) = G, σ(4) = T} constructed using the above algorithm (where the sequences are processed in the lexicographical order induced by σ) is given below, followed by the contained sequences in the resulting buckets.Proof.Claim (i) follows directly from the construction (the most inner for-loop).In the algorithm, each sequence s ∈ S n is added to n different buckets, one for each position.Specifically, let s = s 1 s 2 • • • s n , then s is added to a new bucket when we process the sequence Hence, |f (s)| = n.To calculate |B|, observe that a new bucket is used whenever we encounter the smallest character σ(1) in some sequence s.So |B| is the same as the number of occurrences of σ(1) among all sequences in S n .The total number of characters in S n is n|Σ| n .By symmetry, σ(1) appears n|Σ| n−1 times.
Proof.We show that for s, t ∈ S n , edit (s, t) ≤ 1 if and only if f (s)∩f (t) = ∅.For the forward direction, edit (s, t) ≤ 1 implies that s and t can differ by at most one substitution at some position i.Let r be the sequence that is identical to s except at the i-th position where it is substituted by σ(1) (it is possible that r = s).According to the algorithm, when processing r, both s and t are added to a same bucket m.Therefore, m ∈ f (s) ∩ f (t).
For the backward direction, let m be an integer from f (s)∩f (t).By construction, all the |Σ| sequences in the bucket m differ by a single substitution.Hence, edit (s, t) ≤ 1.
Combining Lemmas 2-5, we have shown that the above (1, 2)-sensitive bucketing function is optimal in the sense of minimizing |B| and |f (s)|.This is summarized below.

Mapping to sequences of length n
We continue to explore LSB functions with different values of d 1 and d 2 .Here we focus on a special case where B ⊂ S n , namely, each bucket in B is labeled by a length-n sequence.The idea of designing such LSB functions is to map a sequence s to its neighboring sequences that are in B. Formally, given a subset B ⊂ S n and an integer r ≥ 1, we define the bucketing function f r B : S n → P(B) by ≤ r} for each s ∈ S n .We now derive the conditions for f r B to be an LSB function.For any sequence s, all the buckets in f r B (s) are labeled by its neighboring sequences within radius r.Therefore, if two sequences s and t share a bucket labeled by v, then edit (s, v) ≤ r and edit (t, v) ≤ r.Recall that S n is a metric space, in particular, the triangle inequality holds.So edit (s, t) ≤ edit (s, v) + edit (t, v) ≤ 2r.In other words, if s and t are 2r +1 edits apart, then they will be mapped to disjoint buckets.Formally, if edit (s, t) ≥ 2r +1, then f r B (s) ∩ f r B (t) = ∅.This implies that f r B satisfies the LSB-property (2) with d 2 = 2r + 1.We note that this statement holds regardless of the choice of B.
Hence, to make f r B a (d 1 , 2r + 1)-sensitive bucketing function for some integer d 1 , we only need to determine a subset B so that f r B satisfies the LSB-property (1).Specifically, B should be picked such that for any two length-n sequences s and t within an edit distance of d 1 , we always have For the sake of simplicity, we say a set of buckets B ⊂ S n is (d 1 , r)-guaranteed if and only if N r n (s) ∩ N r n (t) ∩ B = ∅ for every pair of sequences s and t with edit (s, t) ≤ d 1 .Equivalently, following the above arguments, B is (d 1 , r)-guaranteed if and only if the corresponding bucketing function f r B is (d 1 , 2r + 1)sensitive.Note that the (d 1 , r)-guaranteed set is not a new concept, but rather an abbreviation to avoid repeating the long phrase "a set whose corresponding bucketing function is (d 1 , 2r + 1)-sensitive".In the following sections, we show several (d 1 , r)-guaranteed subsets B ⊂ S n for different values of d 1 .

(2r, r)-guaranteed and (2r − 1, r)-guaranteed subsets
We first consider an extreme case where B = S n .Lemma 6.Let B = S n .Then B is (2r, r)-guaranteed if r is even, and B is (2r − 1, r)-guaranteed if r is odd.
Proof.First consider the case that r is even.Let s and t be two length-n sequences with edit (s, t) ≤ 2r.Then there are 2r edits that transforms s to t. (If edit (s, t) < 2r, we can add in trivial edits that substitute a character with itself.)Because s and t have the same length, these 2r edits must contain the same number of insertions and deletions.Reorder the edits so that each insertion is followed immediately by a deletion (i.e., a pair of indels) and all the indels come before substitutions.Because r is even, in this new order, the first r edits contain an equal number of insertions and deletions.Namely, applying the first r edits on s produces a length-n sequence v. Clearly, edit (s, v) ≤ r and edit (t, v) ≤ r, i.e., For the case that r is odd.Let s and t be two length-n sequences with edit (s, t) ≤ 2r − 1.By the same argument as above, s can be transformed to t by 2r − 1 edits and we can assume that all the indels appear in pairs and they come before all the substitutions.Because r is odd, r − 1 is even.So applying the first r − 1 edits on s produces a length-n sequence v such that edit (s, v) ≤ r − 1 < r and edit By definition, setting B = S n makes f r B (2r, 2r + 1)-sensitive if r is even and (2r − 1, 2r + 1)-sensitive if r is odd.This provides nearly optimal bucketing performance in the sense that there is no gap (when r is even) or the gap is just one (when r is odd).It is evident from the proof that the gap at 2r indeed exists when r is odd because if s can only be transformed to t by r pairs of indels, then there is no length-n sequence v with edit (s, v) = edit (t, v) = r.

Properties of (r, r)-guaranteed subsets
In the above section all sequences in S n are used as buckets.A natural question is, can we use a proper subset of S n to achieve (gapped) LSB functions?This can be viewed as down-sampling S n such that if two length-n sequences s and t are similar, then a length-n sequence is always sampled from their common neighborhood N r n (s) ∩ N r n (t).Here we focus on the case that d 1 = r, i.e., we aim to construct B that is (r, r)-guaranteed.Recall that this means for any s, t ∈ S n with edit (s, t) ≤ r, we have N r n (s) ∩ N r n (t) ∩ B = ∅.In other words, f r B is (r, 2r + 1)-sensitive.To prepare the construction, we first investigate some structural properties of (r, r)-guaranteed subsets.We propose a conjecture that such sets form a hierarchical structure with decreasing r: We prove a weaker statement: Proof.Let s and t be two length-n sequences with edit (s, t) ≤ r + 2; we want to show that Consider a list of edits that transforms s to t: skipping a pair of indels or two substitutions gives a length-n sequence m such that edit (s, m) ≤ r and edit (t, m) = 2.Because s and m are within a distance of r and B is (r, r)-guaranteed, we have that N r n (s) ∩ N r n (m) ∩ B = ∅, i.e., there exists a length-n sequence v ∈ B such that edit (s, v) ≤ r and edit (m, v) ≤ r.By triangle inequality, The next lemma shows that (1, 1)-guaranteed subsets have the strongest condition.
Proof.According to the previous lemma, we only need to show that B is (2, 2)-guaranteed.Given two length-n sequences s and t with edit (s, t) = 2, consider a list Q of two edits that transforms s to t.There are two possibilities: • If both edits in Q are substitutions, let i be the position of the first substitution.
• If Q consists of one insertion and one deletion, let i be the position of the character that is going to be deleted from s.
In either case, let m be a length-n sequence obtained by replacing the i-th character of s with another character in Σ.Then edit (s, m) = 1.Because B is (1, 1)-guaranteed, there is a length-n sequence v ∈ B such that edit (s, v) ≤ 1 and edit (m, v) ≤ 1. Observe that either s = v or v is obtained from s by one substitution at position i.So applying the two edits in Q on v also produces t, i.e., edit (t, v) ≤ 2. Therefore, v ∈ N 2 n (s) ∩ N 2 n (t) ∩ B. Now we bound the size of a (1, 1)-guaranteed subset from below.
Proof.Let B ⊂ S n be an arbitrary (1, 1)-guaranteed subset.For part (i), because . .s n ∈ B, then it must have at least n 1-neighbors v i ∈ B, one for each position 1 ≤ i ≤ n, where v i = s 1 . . .s i−1 v i s i+1 . . .s n , v i = s i .Suppose conversely that this is not the case for a particular i.Let t = s 1 . . .s i−1 t i s i+1 . . .s n where t i = s i .We have edit (s, t) = 1.Also,    1).function f , for the gap at d, define categories 0, . . ., d/2 corresponding to the types of edits: a pair of length-n sequences with edit distance d is in the i-th category if they can be transformed to each other with i pairs of indels (and d − 2i substitutions) but not i − 1 pairs of indels (and d − 2i + 2 substitutions).Figure 2 shows the results for the three LSB functions in Figure 1 at their respective gaps with respect to different types of edits.Observe that the result for f 1 Sn (in red) agrees with our analysis above.where a is the number of substitutions and b is the number of pairs of indels.Left: two (1, 3)-sensitive bucketing functions (rows 2 and 3 of Table 1).Right: the (3, 5)-sensitive bucketing function (row 4 of Table 1).

Conclusion and Discussion
We introduce locality-sensitive bucketing (LSB) functions, that generalize locality-sensitive hashing (LSH) functions by allowing it to map a sequence into multiple buckets.This generalization makes the LSB functions easier to construct, while guaranteeing high sensitivity and specificity in a deterministic manner.We construct such functions, prove their properties, and show that some of them are optimal under proposed criteria.We also reveal several properties and structures of the metric space S n , which are of independent interests for studying LSH functions and the edit distance.
Our results for LSB functions can be improved in several aspects.An obvious open problem is to design (d 1 , d 2 )-sensitive functions that are not covered here.For this purpose, one direction is to construct optimal (r, r)-guaranteed subsets for r > 1.As an implication of Lemma 11, it is worth noting that the optimal (1, 1)-guaranteed subset is a maximal independent set in the undirected graph G 1 n whose vertex set is S n and each sequence is connected to all its 1-neighbors.It is natural to suspect that similar results hold for (r, r)-guaranteed subsets with larger r.Another approach is to use other more well-studied sets as buckets and define LSB functions based on their connections with S n .This is closely related to the problem of embedding S n which is difficult as noted in the introduction.Our results in Section 3 suggest a new angle to this challenging problem: instead of restricting our attention to embedding S n into metric spaces, it may be beneficial to consider a broader category of spaces that are equipped with a non-transitive relation (here in LSB functions we used subsets of integers with the "have a nonempty intersection" relation).Yet another interesting future research direction would be to explore the possibility of improving the practical time and space efficiency of computing and applying LSB functions.
A technique commonly used to boost the sensitivity of an LSH function is known as the ORamplification.It combines multiple LSH functions in parallel, which can be viewed as sending each sequence into multiple buckets such that the probability of having similar sequences in one bucket is higher than using the individual functions separately.However, as a side effect, the OR-amplification hurts specificity: the chance that dissimilar sequences share a bucket also increases.It is therefore necessary to combine it with other techniques and choosing parameters to balance sensitivity and specificity is a delicate work.On contrast, the LSB function introduced in this paper achieves a provably optimal separation of similar and dissimilar sequences.In addition, the OR-amplification approach can also be applied on top of the LSB functions as needed.

Lemma 3 .
If f : S n → P(B) is (1, 2)-sensitive, then |B| ≥ n|Σ| n−1 .Proof.Consider the collection of pairs H = {(s, b) | s ∈ S n and b ∈ f (s)}.We bound the size of H from above and below.For an arbitrary sequence s, let b ∈ f (s) be a bucket that contains s.According to the LSB-property (2), any other sequence in b has edit distance 1 from s, i.e., a substitution.Suppose that the bucket b contains two sequences u and v that are obtained from s by a single substitution at different positions.Then edit (u, v) = 2 and f (u) ∩ f (v) = ∅, which contradicts the LSB-property (2).Therefore, all the sequences in b can only differ from s at some fixed position i.There are |Σ| such sequences (including s itself).So each bucket b ∈ B can appear in at most |Σ| pairs in H. Thus |H| ≤ |Σ| • |B|.On the other hand, for a length-n sequence s, its 1-neighborhood N 1 n (s) contains n(|Σ| − 1) other length-n sequences, corresponding to the |Σ| − 1 possible substitutions at each of the n positions.The LSB-property (1) requires that s shares at least one bucket with each of them.As argued above, each bucket b ∈ f (s) can contain at most |Σ| − 1 sequences other than s.Therefore, s needs to appear in at least n(|Σ| − 1)/(|Σ| − 1) = n different buckets, and hence at least n pairs in H.So |H| ≥ n|S n | = n|Σ| n .Together, we have |Σ| • |B| ≥ n|Σ| n , or |B| ≥ n|Σ| n−1 .

Figure 1 :
Figure1: Probabilities (estimated by frequencies) that two sequences share a bucket with respect to their edit distance under three gapped LSB functions (red, green, and blue bars correspond to the rows 2-4 of Table1).
, t) = 2 Frequency of s and t sharing a bucket under the function f r B r = 1, B = S n r = 1, B = (1, t) = 4 Frequency of s and t sharing a bucket under the function f r B r = 2, B = (1, 1)-guaranteed subset

Figure 2 :
Figure2: Probabilities (estimated frequencies) that two sequences share a bucket with respect to their edit type under three gapped LSB functions.The types of edits are labeled in the format a + b × 2 where a is the number of substitutions and b is the number of pairs of indels.Left: two (1, 3)-sensitive bucketing functions (rows 2 and 3 of Table1).Right: the (3, 5)-sensitive bucketing function (row 4 of Table1).