Context-aware seeds for read mapping

Motivation Most modern seed-and-extend NGS read mappers employ a seeding scheme that requires extracting t non-overlapping seeds in each read in order to find all valid mappings under an edit distance threshold of t. As t grows, this seeding scheme forces mappers to use more and shorter seeds, which increases the seed hits (seed frequencies) and therefore reduces the efficiency of mappers. Results We propose a novel seeding framework, context-aware seeds (CAS). CAS guarantees finding all valid mappings but uses fewer (and longer) seeds, which reduces seed frequencies and increases efficiency of mappers. CAS achieves this improvement by attaching a confidence radius to each seed in the reference. We prove that all valid mappings can be found if the sum of confidence radii of seeds are greater than t. CAS generalizes the existing pigeonhole-principle-based seeding scheme in which this confidence radius is implicitly always 1. Moreover, we design an efficient algorithm that constructs the confidence radius database in linear time. We experiment CAS with E. coli genome and show that CAS significantly reduces seed frequencies when compared with the state-of-the-art pigeonhole-principle-based seeding algorithm, the Optimal Seed Solver. Availability https://github.com/Kingsford-Group/CAS_code


Introduction
Read mapping is used ubiquitously in bioinformatics. With | · | denoting the length of a string; T [·, ·] denoting the substring of a string T at a fixed interval; and D(·, ·) as the edit distance measurement between a string pair, commonly, read mapping is defined as follows: To efficiently map reads, modern mappers usually employ the seed-and-extend mapping strategy [1][2][3][4]: a mapper extracts a substring of R as a seed, s; iterates through all seed locations of s in T; at each seed location, performs sequence alignment of R against the surrounding text in T; reports alignments that have edit distances below t as valid mappings.
For mappers that use non-overlapping seeds, the number of seeds to extract from a read R is governed by the pigeonhole principle: to find all valid mappings of R, the mapper must divide R into at least t non-overlapping seeds. Otherwise, the mapper will not be able to consistently find all valid mappings of R in T. As t grows, the length of seeds is reduced. Using short seeds significantly increases the workload of a mapper [5,6]. Shorter seeds appear more frequently in T, hence increasing the number of alignments while mapping a read. To improve the performance of mappers, it is desirable to use fewer nonoverlapping seeds under a fixed t, which lets a mapper not only use fewer seeds, but also use longer seeds.
In this paper, we focus on improving seed-and-extend mappers that use non-overlapping seeds. We propose a

Open Access
Algorithms for Molecular Biology *Correspondence: carlk@cs.cmu.edu Xin et al. Algorithms Mol Biol (2020) 15:10 novel seeding scheme, called context-aware seeds (CAS). CAS enables a mapper to use fewer than t seeds without missing any valid mappings. CAS attaches each seed s with a confidence radius score, c s , with c s ≥ 1 . Let S be a set of non-overlapping seeds from R. CAS ensures that as long as s∈S c s ≥ t , then S is sufficient to find all valid mappings of R under an error tolerance threshold of t. When S includes any seed s with c s > 1 , then |S| < t and all valid mappings are secured with fewer-than-t seeds (|S| denotes the number of seeds in S). In the worst case where c s = 1 for all s ∈ S , CAS degenerates into the case governed by pigeonhole principle with |S| = t. Fig. 1 Illustration of CAS. The upper part shows a read and a reference. Suppose that t = 4 , i.e., we want to find all alignments of the read in the reference with fewer than 4 edits. There is only one such locally optimal alignment (marked as red). The middle part shows the seed extraction result with the pigeonhole principle, which splits the read into t = 4 seeds. This gives many seed locations and thus many alignments. With CAS (in the lower part), we can split the read into 2 long seeds while still guarantee to find all valid mappings. The two long seeds together have a total seed frequency of 2, drastically reducing the number of alignments Figure 1 compares CAS and the pigeonhole-principlebased seeds. In this example, we want to find all mappings of a read AAC CTT GG under an error tolerance threhsold of t = 4 . Assume that we have verified that each of the two CAS seeds AACC and TTGG has confidence radii of c s = 2 . Therefore CAS can be guaranteed to find all valid mappings with just these two seeds, as s c s = 4 ≥ t . Using the pigeonhole principle, however, a mapper needs to select t = 4 non-overlapping seeds. It forces the mapper to pick short and repetitive seeds, making the mapper perform more local alignments.
We establish the theoretical foundation of CAS and demonstrate that with CAS future mappers can map reads more efficiently using fewer, longer and less frequent seeds without losing valid mappings. We also propose a suffix-trie-based CAS database construction algorithm that builds a CAS database from T in linear time, based on which we design a greedy CAS seeding algorithm that extracts CAS from reads. We test the greedy CAS seeding algorithm against a state-of-the-art pigeonhole-principle-based seeding algorithm, Optimal Seed Solver (OSS), on an E. coli dataset.

Context-aware seeds
CAS reduces seed usage in read mapping by introducing a novel metric for seeds in T, the confidence radius. A seed s in T has a confidence radius c s if all substrings in T whose edit distance is smaller than c s must occur in T within a small window where s occurs. The size of the window equals to extending the length of s by c s − 1 letter(s) at both ends. For example, seed AACC in T from Fig.1 has a confidence radius of 2. Any substring in T whose edit distance to AACC equals 1 (e.g., AAC , ACC , GAACC , AACCG ) locates within the 1-letter extended window of AACC (GAA CCG ). The confidence radius of each possible seed in T can be computed by profiling T (see the next Section). CAS guarantees that all valid mappings of a read R can be located, as long as the seeds s extracted from R collectively have a confidence radius of s c s ≥ t . Below, we give the formal definition of CAS and prove the correctness of CAS.
Let s be a string in T and [l 1 , l 2 ] be a pair of locations in T. We say string T [l 1 , l 2 ] is in the vicinity of s under an integer c, if s appears in an interval Furthermore, let seed s be a substring of R at [l r1 , l r2 ] ( s = R[l r1 , l r2 ] ) and let T [l 1 , l 2 ] be a valid mapping of R. We say T [l 1 , l 2 ] is in the vicinity of s with regard to R under c, if string T [l 1 + l r1 , l 1 + l r2 ] is in the vicinity of s under c. If a valid mapping T [l 1 , l 2 ] is in the vicinity of s with respect to R under t, then T [l 1 , l 2 ] can be discovered by locally aligning R against the surrounding text in T at each seed location of s.
The pigeonhole principle states that by dividing R into a set of t non-overlapping seeds, denoted by S, then for all intervals [l 1 , l 2 ] where T [l 1 , l 2 ] is a valid mapping of R, there must be s ∈ S where T [l 1 , l 2 ] is in the vicinity of s with regard to R. CAS seeks to retain the seed vicinity guarantee of the pigeonhole principle, where all valid mappings of a read R are in the vicinity of its seeds with regard to R under t, with fewer than t seeds. Given two substrings s and s ′ of T and an edit-distance threshold t, we say s ′ is a neighbor of s if D(s, s ′ ) < t . Assume that s ′ is a neighbor of s under t, CAS defines s ′ as a trivial neighbor of s, if and only if for any location interval [l 1 , l 2 ] where T [l 1 , l 2 ] = s ′ and T [l 1 , l 2 ] is in the vicinity of s under D(s, s ′ ) . Otherwise CAS defines s ′ as a nontrivial neighbor of s. Finally, CAS defines the confidence radius c s of s as the minimum editdistance between s and all nontrivial neighbors of s. Since a seed is trivial to itself and is at least 1-edit-distance away from any other string, we have c s ≥ 1 for any seed s.
We now give the central theorem of CAS, the theoretical foundation that enables seed-and-extend mappers to find all valid mappings using fewer than t seeds.
Theorem 1 Let S be a set of non-overlapping seeds of a read R. Assume s∈S c s ≥ t . Then for any reference string is not in the vicinity, with regard to R under t, of any s ∈ S . In the minimum-edit-distance alignment between R and T [l 1 ′ , l 2 ′ ] , assume that the non-overlapping seeds s 1 , s 2 , ..., s n of R are aligned to the non-overlapping segments s T 1 , s T 2 , ..., s Tn of T [l 1 ′ , l 2 ′ ] , with n = |S| . Since T [l 1 ′ , l 2 ′ ] is not in the vicinity, with regard to R under t, of any s ∈ S , none of the seeds s i matches exactly to its counterpart sequence S Ti ; and none of the sequences s Ti is a trivial neighbor under the confidence radius c si of its counterpart seed s i . Because c s i is the minimum editdistance between s i and any of its nontrivial neighbors, We call Theorem 1 the weighted Pigeonhole principle. The confidence radius of each seed serves as the weight of the seed in computing the overall error tolerance of the seed set.

Construction of confidence radius database
The confidence radius c s of each seed s is stored in a table, called the confidence radius database. The confidence radius database only needs to be constructed once offline for a reference T. Computing c s of seed s involves finding the minimum edit distance to its nontrivial neighbors. Below we propose an algorithm that constructs the confidence radius database in O(|�| 2 · M) time, where is the alphabet set of T and M is the total number of neighbors of all strings in T (up to length P). Note that c s need not be a tight lower bound. Ordinary seeds can be perceived as CAS with confidence radii of c s = 1 . As will be discussed later, tighter bounds (greater confidence radii) take longer to compute. We limit the search space in computing the confidence radii of seeds by introducing a maximum allowed confidence radius threshold θ . Seeds whose exact confidence radii (tight bounds) that are greater than θ are set to c s = θ . In this section, we assume both P and θ are constants.
The confidence radius database is constructed in two steps: first, we construct a neighbor database, which stores all neighbors of all seeds (up to length P) under the confidence radius threshold θ ; then, we find the confidence radius of each from its neighbors. We prove that both steps can be done in O(|�| 2 · M) time.

Construction of the neighbor database
To find all neighbors of all substrings in T (up to a maximum length P), we first build a P-level suffix trie of T, then find all neighbors of each seed in Trie by systematically traversing the suffix trie in a top-down manner. Formally, let Trie = (V , E) be a suffix trie of T of a maximum depth of P + θ . Let r ∈ V be the root of Trie. Each node represents a substring in T, i.e., the string obtained by concatenating the letters on edges along the path from r to v. We denote the edit distance between these two substrings corresponding to u and v as D (u, v). We aim to solve the following problem: Problem 2 Given a suffix trie Trie = (V , E) and an integer θ , compute all pairs of nodes u, v ∈ V such that We have the following lemmas.
Proof Proved in Landau and Vishkin [7] by enumerating and validating all possible scenarios.

Proof
This follows from the dynamic programming algorithm for the edit distance problem.
Lemma 1 shows that nodes are neighbors only if their parents are neighbors. Hence the neighbors of a child node must be the children of the neighbors of its parent node. Lemma 2 further shows that the edit distance between two children nodes can be computed in constant time, given the edit distances between one child and the parent node of the other child, as well as the edit distance between the two parent nodes.
We construct the neighbor database by traversing Trie as follows: First, assign each node in V an integral rank from {1, 2, · · · , |V |} following a top-down, left-to-right order. The root r of Trie has rank of 1, and then the children of children of r have ranks of 2, 3, · · · , from the leftmost child to the rightmost child, and so on. Nodes that are deeper in Trie rank higher. Among nodes of the same depth, children of a higher ranking parent node rank higher. A breadth-first-search traversal of Trie ranks all nodes.
For any v ∈ V , we define X v := {u ∈ V | D(u, v) ≤ θ} as the set of neighbors of v, including v itself, and define Y v := {D(u, v) | u ∈ X v } as the accompanying edit-distance set of X v . For every neighbor node u in X v , Y v provides the edit distance between u and v. We compute X v and Y v for each node v ∈ V from low ranking nodes to high ranking nodes. Both X v and Y v are implemented as arrays.
The algorithm for constructing the neighbor database is summarized in Algorithm 1. We iterate through all nodes by rank from low to high. For each node v ∈ V , we iterate through all children of v. For each children node Figure 2 illustrates the process of validating a candidate neighbor u ′ of another node v ′ , based on the information of its parent node v and the neighbor u of v, where u is also the parent of u ′ (lines 4-25 in Algorithm 1). We prove that this algorithm maintains the following three invariants: 1 For any node v ∈ V , array X v is always sorted according to their ranks, i.e., nodes that are added to X v are always in ascending order w.r.t. their ranks. 2 Right before processing node v (i.e., before line 4 of Algorithm 1), X v and Y v are already computed and sorted w.r.t. their ranks. 3 Right after processing node v (i.e., after line 17 of Algorithm 1), X v ′ and Y v ′ are computed and sorted w.r.t. their ranks for each child v ′ of v.
The initialization step Algorithm 1 (line 2) computes X r and Y r for root node r. Its neighbors include all nodes whose depth in Trie is no greater than θ . The edit distance between r to a neighbor node u is simply the depth of u minus 1 (we assume that root r is at depth 1). Root r is also in X r with D(r, r) = 0 . Clearly, the first and the second invariant hold for root r.
In the main loop (lines 3-26), for a node v ∈ V , Algorithm 1 iterates through all of its children. For a child v ′ of v, lines 4-25 compute X v ′ and Y v ′ of v ′ . Line 4-6 initialize the pointers that will be used to fetch the edit distances D(v, u ′ ) and D(v ′ , u) , which are stored in Y v and the partially computed Y v ′ , respectively. Because u ranks higher than u ′ , by the time of computing is then computed according to Lemma 2. Specifically, pointer k tracks the position of u in array X v (i.e., the index of u in array X v ); pointer k v tracks the position of u ′ in array X v ; and pointer k v ′ tracks the position of u in array X v ′ . Line 11 computes Algorithm 1 maintains the first invariant. For each child v ′ of v, assuming X v is sorted, then neighbors are also added to X v ′ in a sorted manner, as Algorithm 1 iterates through neighbors ordered by X v . Since X r is sorted for root r, given the inductive nature of Algorithm 1, we conclude that X v must be sorted for any v ∈ Trie.
Algorithm 1 maintains the third invariant. According to Lemma 1, a node u ′ ∈ X v ′ requires u ∈ X v for their parents u and v. Any node ū ′ whose parent ū � ∈ X v results in ū ′ � ∈ X v ′ . Algorithm 1 iterates through all u in X v . Therefore, after line 17, all neighbors of child v ′ must have been found, assuming the second invariant holds. The second invariant holds because all neighbors of r are correctly defined during initialization. As the algorithm propagates, because of the inductive nature of Algorithm 1, the second invariant holds.
Let M denote the member size of set Since pointers of k v and k v ′ can only move forward, lines [14][15][16][17][18][19] With | | being a small constant (for example = {A, C, G, T } for DNA analysis), Algorithm 1 finds all M neighbor pairs in Trie in O(M) time.

Computing the confidence radius among nontrivial neighbors
The neighbor database stores both the trivial and nontrivial neighbors of each seed. However, CAS only requires the minimum edit distance to the nontrivial neighbors of each seed. In order to derive the confidence radius of each seed, we propose an augmentation to Algorithm 1, such that it computes the minimum edit distance to nontrivial neighbors while constructing the neighbor database. We prove that the augmentation does not increase the time complexity of Algorithm 1.
Within the neighbor array X v of a node v, let the sub-array X 0 v store all trivial neighbors and X 1 v store all nontrivial neighbors, where X v = X 0 v ∪ X 1 v . By definition, the confidence radius of v is computed as . Let u be a neighbor of v; we say u is an immediate neighbor of v if u is a substring, or a superstring, or an overlapping string of v; otherwise we say u is a nonimmediate neighbor of v (see Fig. 3 for examples). Immediate neighbors are not necessarily trivial neighbors. If u is a trivial neighbor of v, by definition, then u must be an immediate neighbor of v. However, the opposite is not necessarily true, i.e., u could be a substring of v (an immediate neighbor) yet u is nontrivial to v. Substring u may appear at more locations in T than v does. It is easier Xin et al. Algorithms Mol Biol (2020) 15:10 to determine whether u is an immediate neighbor to v than whether u is a trivial neighbor to v.
Let X 2 v be the set of non-immediate neighbors of a node v. The minimum edit distance from v to nontrivial neighbors of v equals to the minimum edit distance between v to neighbors in X 2 v . We prove this in Theorem 4. To prove Theorem 4, we first prepare the following two lemmas.   We find the immediate neighbors, X 3 v , of each node v ∈ Trie , by checking if a neighbor u ∈ X v is a immediate substring, superstring or overlapping string of v. We associate with each node v a new vector the updated workflow is illustrated in Fig. 5.
Computation of F(v, u) can be piggybacked on top of computing D(v, u) in Algorithm 1 . Given u and v, F(v, u) stores whether v and u possess any of the below immediate conditions: 4 v is a suffix of u. 5 u is neither a prefix nor a suffix but a substring of v. 6 v is neither a prefix nor a suffix but a substring of u. 7 A prefix of u is a suffix of v. 8 A suffix of u is a prefix of v.
From above immediate conditions, we deduce the immediate relationship between v and u. With conditions 1-6, we can infer the superstring-substring relationship. With Condition 7-8, we can infer the overlapping relationship.
If v and u qualifies none of the above immediate conditions, then they must be non-immediate neighbors.
For simplicity, we initialize each node as satisfying immediate conditions 1, 2, 3 and 4 to itself. We initialize the root node r as a prefix to any of its neighbors; and any neighbors of r as a suffix to r. Finally, r is not an overlapping string or a substring of any neighbor.

F(u, v) can be computed in constant time if F(p(v), p(u)), F(p(v), u)
and F(v, p(u)) are known. For example, in Fig. 5, . Conditions 3 and 4 are mirror cases of conditions 1 and 2, respectively with v and u, v ′ and u ′ trading places. F (v ′ , u ′ ) satisfies condition 5, only if (a) F (v ′ , u) satisfies condition 5 or (b) F (v ′ , u) satisfies condition 2, while v ′ � = u ′ and v is not root. Condition 6 is a mirror case of condition 5. F (v ′ , u ′ ) satisfies condition 7 only if (a) F (v, u ′ ) satisfies condition 7 or (b) F (v, u ′ ) satisfies condition 2, while v = u ′ and v is not root. Condition 8 is a mirror case of condition 7.
The computation of F (·, ·) is piggybacked on top of the computation of D(·, ·) , as both methods use dynamic programming. Both computations share the same structure: To compute D(·, ·) and F (·, ·) between two child nodes, we must aquire D(·, ·) and F (·, ·) between all child-parent and parent-parent node pairs; and then derive D(·, ·) and F (·, ·) between the two child nodes in constant time. As a result, piggybacking the computation of immediateness does not increase the complexity of Algorithm 1.
Finally, the confidence radius of node v equals u) does not satisfy any of the immediate conditions. The confidence radius of a node can be found by simply scanning its neighbor array, which finishes in linear time. The overall complexity of constructing the confidence radius database is still O(|�| 2 · M).

Maintaining the confidence radius metadata
The confidence radii of seeds is stored in a |T|-by-P table, where P is user-provided. Fig. 6 demonstrates an example confidence radius table. The [x, y] entry of the table stores the confidence radius of seed T [x, x + y] . Overall, the confidence radius table has a space complexity of O(|T | · P) , with |T | ≫ P in practice.
In read mapping, it is not always necessary to maintain the confidence radii for every seed. Instead, it is often sufficient to just maintain the confidence radii of seeds at fixed-length intervals. For instance, instead of recording the confidence radii for seeds of all lengths between 1 to P at every location x in T, the confidence radius database can selectively record the confidence radii just for seeds of lengths at every 10-bp interval: T [x, x + 10] , T [x, x + 20] , ..., T [x, x + 10 × ⌈ P 10 ⌉] . We call this method interval sampling. Let I denote the interval length for seed sampling, interval sampling reduces the space complexity of the CAS database to O(|T | · ⌈ P I ⌉) ] . An example confidence radius database with a length of I = 4 is provided in Fig. 7.
For seeds that are not sampled by the confidence radius database, their confidence radii may still be indirectly inferred during mapping. Given a seed u and its substring v, we have the following theorem:

Proof
Assume c u < c v . Then there exists a string w ∈ T where D(u, w) = c u while w is not in the vicinity of u. Since v is a substring of u, there must exist w ′ , a substring of w, where D(w ′ , v) ≤ c u . However, since , w ′ must not be a non-trivial neighbor of v. Hence w ′ must be in the vicinity of v. Because w is not in the vicinity of u but w ′ is in the vicinity of v, we conclude that v occurs in at least one more location in T than u does (in the vicinity of w ′ ). This contradicts the fact that occ(u) = occ(v) .
Theorem 4 suggests that for a seed s that is not sampled by the confidence radius database, as long as there exists a confidence-radius-database-covered substring s ′ of s that shares the same occurrence frequency with s, then c s ≥ c s ′ . Since the confidence radius is just a lower bound, and need not be an exact measurement of the minimum edit distance towards the non-trivial neighbors of a seed, we can set c s = c s ′.
In mapping, upon encountering a seed s with a length that is not a multiple of the sampling interval I, the mapper may enumerate all substrings of s with lengths of I · ⌊ |s| I ⌋ . Unlike s, these substrings are supported by the confidence radius database. The mapper then checks the frequency of each substring in T and removes the substrings whose occurrence frequencies do not match s. The confidence radius of s is set to the maximum confidence radius among the remaining frequency-matching substrings. This method works only if there exists at least one frequency-matching substring of s. When there is none, the mapper is recommended to give up s, and directly use a database-supported substring of s instead, preferably the one with the minimum occurrence frequency or the maximum confidence radius, which maximizes the mapping efficiency.

A seeding scheme with context-aware seeds
While the major goal of this paper is to establish the theoretical framework of CAS, to test the effectiveness of CAS, we propose a greedy seed selection method, referred to as greedy CAS seeding. Greedy CAS seeding selects consecutive Maximum Exact Matching substrings (MEMs, which are seeds that cannot be further extended without bumping into errors) from a read as seeds. At the end of each MEM, greedy CAS seeding heuristically skips the next two base pairs, in an effort to skip potential errors. Greedy CAS seeding sorts seeds by their frequency from low to high, into S raw . Then selects the minimum number of seeds S from S raw in sequential order such that s∈S c s ≥ t . In the rare cases where there is insufficient number of CAS seeds such that there does not exist a set of seeds S with s∈S c s ≥ t , greedy CAS  seeding reverts back to using the pigeonhole principle, by dividing the read into t non-overlapping seeds. Figure 8 compares the seed extraction results of greedy CAS seeding against the state-of-the-art, pigeonholeprinciple-based seeding method, the Optimal Seed Solver (OSS) [8]. OSS has been previously shown to generate the least frequent seeds, when compared to other pigeonhole-principle-based seeding methods, such as flexible-placement k-mers or spaced seeds. Figure 8 demonstrates both seeding methods in action under t = 4 . Greedy CAS seeding is shown in the upper half while OSS is shown in the lower half. Compared to OSS, which uses a total of t = 4 seeds, greedy CAS seeding uses only two seeds. As a result, greedy CAS seeding can afford longer and less frequent seeds.
When interval sampling is enabled, greedy CAS seeding follows the seeding mechanism introduced above. If a seed s has a length that is not supported by the confidence radius database and it has no frequency-matching substrings, then s is withdrawn and the database-supported substring of s with the maximum confidence radius is selected as a substitute.
Greedy CAS seeding has a maximum complexity of O(|R| + |S| log(|S|)) . We use a Burrows-Wheeler Transformation (BWT) array to index seeds. With the BWT array, it takes O(|s|) operations to access the seed database for seed s and locate all seed locations of s. Given that s∈S |s| ≤ |R| , and |S| ≤ t ≪ |R| , we conclude that the maximum complexity of greedy CAS seeding is O(|R| + t log(t)).

Experiments
We benchmark greedy CAS seeding against OSS and naïve seeding on the E. coli genome. Naïve seeding selects consecutive 12-bp seeds following the Pigeonhole principle. We benchmark both seeding schemes on a 22-million, 100-bp E. coli read set from EMBL-EBI, ERX008638-1. We build a confidence radius database for E. coli genome with a maximum confidence radius threshold θ = 5 and a max seed length P = 60 . We measure the effectiveness of both approaches by comparing the average total seed frequency of selected seeds under different edit distance thresholds t = {1, 2, 3, 4, 5} . The average total frequency is the sum of seed frequencies extracted from each read, averaged over all reads in the read set. Figure 9 shows the average total seed frequency comparison between the two approaches. OSS has slightly smaller total seed frequency (averaged over all reads) under t = 1 , but it quickly increases, exceeding CAS at t > 1 . OSS outperforms CAS under t = 1 because greedy CAS seeding extracts seeds sequentially; while OSS scans through all possible MEM placements in a read and picks the least frequent placement. CAS achieves significantly lower seed frequencies with t > 1 . When t gets larger, OSS is pressured to use more seeds, which leads to using shorter and more frequent seeds. To the contrary, greedy  CAS seeding often uses fewer than t seeds, as shown in Fig. 10, which lets it use longer and less frequent seeds. We also benchmarked the performance of CAS with variate sampling intervals. The results are summarized in Figs. 11, 12. As shown in both figures, while the average seed frequency increases as the interval increases, the magnitude of the increase is very small. Compared to CAS without interval sampling, CAS with a sampling interval of I = 8 only increases the average seed frequency by 6%, while reducing the storage footprint by 8 × . Overall, interval sampling significantly reduces the space complexity of the CAS database, with minor performance loss.
CAS is expected to perform better on larger genomes. The E. coli genome is a small genome, which has only around 4.6 million base pairs. In comparison, the human genome has more than 3 billion base pairs. For small genomes, seeds become less frequent by nature. Therefore short seeds become acceptable as they are not as frequent as they are in larger genomes. We therefore expect CAS to perform better in larger genomes. However, due to practical (not theoretical) limitations in scaling up the construction of the confidence radius database on larger genomes (further elaborated in the Discussion section), we only demonstrate CAS on the E. coli genome.
While the focus of this paper is to establish the theoretical foundation of CAS, instead of providing a complete read mapping solution, it is worth mentioning that greedy CAS seeding (only the seeding mechanism) is more practical than OSS. OSS requires scanning through all substrings of R, which has a total size of O(|R| 2 ) , for seed frequencies. Combined with BWT, it takes at least O(|R| 2 ) operations to collect all seed frequencies with OSS. Greedy CAS seeding, to the contrary, finishes in O(|R| + t log(t)) time with t ≪ |R|.

Discussion
Although Algorithm 1 finishes in O(|�| 2 · M) time, in practice, M could be on the scale of trillions or more, for large and complex genomes. This is because for large genomes, the suffix trie is close to full in the first ten to twenty levels, where almost every permutation of letters exists. Nodes in these levels have large numbers of neighbors: the number of neighbors of a node v, equals to the number of unique strings formed by editing the string of v with up to t edits. After each edit, the resulting string is guaranteed to appear in Trie. Figure 13 demonstrates the average number of neighbors (both trivial and non-trivial) per seed at different depths in a Trie. The Trie in Fig. 13 is built with E. coli genome with θ = 4 . As the depth grows, the average number of neighbors per node first increases, then peaks at 9 and then quickly decreases. Such a trajectory is closely related to the trend of the average number of children per node in Trie. Figure 14 presents the average number of children per node at variate depths in the E. coli Trie. Between depths 1-9, the E. coli Trie is close to a full 4-nary tree, with most nodes having 4 children. Between depths 10-14, the average number of children per node quickly decreases and stabilizes at slightly above 1. As a result, Trie becomes increasingly sparse after depth 15. Between depths 1-9, as Trie is dense, the number of neighbors of a seed approaches to the number of unique permutations of the seed, with up to t edits. As Trie quickly becomes sparse after depth 9, the number of neighbors of a seed quickly diminishes. Finally, the average count of neighbors stabilizes at around 25, which is roughly the average number of immediate neighbors of a seed. As the depth increases, fewer nodes have non-immediate neighbors. Figure 15 illustrates the average number of non-immediate neighbors of seeds at different depths in Trie. As shown in the figure, the average number of non-immediate neighbors per node decreases to slightly above 0 after depth 20. Hence, the majority of seeds in E. coli with lengths greater than 20 have no non-immediate neighbors and, therefore, have confidence radii of 4. The suffix trie maintains a similar structure in complex genomes. At top, the trie is very dense, close to a full 4-nary tree and the suffix trie expands exponentially. After a certain depth (9 for E. coli), the suffix trie quickly becomes sparse, with the average number of children per node stabilizing at a little bit over 1. Similarly, the average number of neighbors first expands super-linearly. Then it peaks at around the depth where the suffix trie turns sparse and quickly decreases to just the average number of trivial neighbors per node.
However, unlike E.coli, the suffix tries in complex genomes have much greater scales. In human genomes, there are more than one billion unique 15-base-pair suffixes, compared to around one million in E. coli. This means that for human genomes, with θ = 4 , there could be more than 1 trillion total neighbors just for 15-basepair suffixes. Maintaining metadata at such scale vastly exceeds the capacity of our currently available computational power. From our experiment, it takes around 300 CPU hours to compute the confidence radius database for the E. coli genome with θ = 4 and P = 60 on a multi-CPU, mechanical hard drive system. However, it is worth noting that as a theoretical study, the database construction program is not fully optimized for speed and is currently I/O-bound due to frequently reading and writing neighbor information into neighbor arrays of nodes in Trie.  While there are many nodes (long suffixes) with fewer neighbors, given that Algorithm 1 traverses Trie in a top-down manner, it is unavoidable to track the massive number of neighbors for short suffixes. This is an interesting algorithmic problem for future work.
CAS may be applied to situations other than NGS read mapping. For example, the idea of context-aware seeds may improve long-read mapping. Long reads suffer from high error rates [9][10][11]. Finding error-free seeds for long reads is very challenging [12]. CAS can serve as a metric measuring the likelihood of seeds having errors: if there exists a seed, s, with high confidence radius, it is highly likely that s is free of errors. The likelihood of obtaining a reference-matching seed through many accidental errors is small.
Finally, CAS can be applied to develop probes for DNA and RNA identification. When designing probe sequences, it is important to make certain that the target sequence is unique in the genome [13][14][15]. It prevents probes from accidentally annealing to a similar sequences. CAS checks the existence of similar sequences by consulting the confidence radius database.

Conclusion
In this work, we proposed a new seeding framework, context-aware seeds (CAS). CAS extends the pigeonhole principle and guarantees finding all valid mappings with fewer seeds. CAS associates each seed s with a confidence radius c s , defined as a lower bound of edit distances towards nontrivial neighbors of s. We proved that the CAS can find all valid mappings of any read R, as long as its seeds s satisfy c s ≥ t.
We proposed a linear-time algorithm for constructing the confidence radius database. It computes the confidence radii of seeds by traversing the suffix trie of a reference. We experimented with CAS on the E. coli genome and compared it against the state-of-the-art pigeonholeprinciple-based seeding scheme, OSS, and showed that CAS outperforms OSS by significantly reducing the sum of seed frequencies.
This paper focuses on the theoretical aspects of CAS, especially how it extends the pigeonhole principle into using fewer seeds. Composing a practical solution of Algorithm 1 on larger genomes is an interesting-yet-separate problem for future work.