 Research
 Open access
 Published:
Fast, parallel, and cachefriendly suffix array construction
Algorithms for Molecular Biology volumeÂ 19, ArticleÂ number:Â 16 (2024)
Abstract
Purpose
String indexes such as the suffix array (sa) and the closely related longest common prefix (lcp) array are fundamental objects in bioinformatics and have a wide variety of applications. Despite their importance in practice, few scalable parallel algorithms for constructing these are known, and the existing algorithms can be highly nontrivial to implement and parallelize.
Methods
In this paper we present capssa, a simple and scalable parallel algorithm for constructing these string indexes inspired by samplesort and utilizing an LCPinformed mergesort. Due to its design, capssa has excellent memorylocality and thus incurs fewer cache misses and achieves strong performance on modern multicore systems with deep cache hierarchies.
Results
We show that despite its simple design, capssa outperforms existing stateoftheart parallel sa and lcparray construction algorithms on modern hardware. Finally, motivated by applications in modern aligners where the query strings have bounded lengths, we introduce the notion of a boundedcontext sa and show that capssa can easily be extended to exploit this structure to obtain further speedups. We make our code publicly available at https://github.com/jamshed/CaPSSA.
Introduction
Methods for aligning sequencing reads to reference genomes underlie some of the most welldeveloped and widelyused tools in bioinformatics [2]. Modern readtoreference aligners typically employ an index over the reference text. A classic index for strings is the suffix array (sa)Â [40], which is an array of indices of the lexicographically sorted suffixes of a string. In alignment, the sa index is used by the popular STAR aligner [14] as well as in other toolsÂ [53, 55]. The sa has also been used in shortread error correctionÂ [24] and sequence clusteringÂ [23]. A related object frequently used in conjunction with the sa is the Longest Common Prefix (lcp) array, which contains the lengths of the longest shared prefixes between pairs of successive indices in the sa. For instance, the sa can be used in concert with the lcparray (and other auxiliary tables derived from these) in a data structure called an enhanced suffix arrayÂ [1] to mimic the functionality of a suffix tree [54], but often more efficiently and using less space. An account of the pervasiveness of the sa and the lcparray in computational genomics is best left to a dedicated review (see e.g. [51]).
Because of the utility of the sa (and the lcparray) in string indexing, significant work has been dedicated to developing practical algorithms for its construction. It is wellestablished that sa and lcparray construction can be performed sequentially in time linear to the size of strings. However, as modern genomics pipelines produce ever more dataâ€”including more complete reference genomes and pangenomesâ€”there has been a concerted effort to improve the practical efficiency and reduce the runtime of sa and lcparray construction. A host of efficient serial algorithms have been developed [18, 30, 34, 35, 38, 42, 44]. Likewise, in an effort to take advantage of the increased parallelism of modern computer hardware, a number of parallel algorithms have also been proposed; e.g. parallel DivSufSortÂ [37], parallel DC3Â [36], and parallel divideandconquer based saconstructionÂ [28]. External memory algorithmsÂ [26, 27, 29] have also been a focus of recent research because of the memory bottlenecks that arise when building the sa and the lcparray on genomic datasets. Besides, algorithms for GPUsettingsÂ [39] and distributedmemoryÂ [19, 20] have been developed. We refer the interested reader to [5, 6, 48] for a comprehensive review. For our purposes, we note that these increasingly advanced methods introduce new algorithmic techniques to enable parallelism or improve the worstcase time complexity (so that it is sublinear). The tradeoff, often, is that these more complex algorithms may potentially be more difficult to implement, optimize for modern hardware and cache layouts, and to maintain.
In this work, we address these issues by introducing capssa [33], a highly parallel method for constructing the sa and the lcparray. A core principle behind capssa is simplicity. Our approach draws on several existing algorithms and techniques, and focuses on their efficient combination for the problem of highly parallel sa construction. The algorithm builds upon the parallel samplesort algorithmÂ [21], and is easy to implement and optimize for modern hardware.
A potential downside of our approach is that it is outputsensitive and as a result its worstcase time complexity on adversarial inputs is quadratic. However, in practice we find that the sharedmemory implementation of capssa outperforms stateoftheart methods (specifically paralleldivsufsort Â [37] and paralleldc3 Â [3, 36]) in terms of runtime and scalability (although not in memory for Paralleldivsufsort). For example, capssa can build the sa and the lcparray for the telomeretotelomere human genome assembly (CHM13 v2)Â [45] in 106Â s using 48 GBs of memory with 32 threads on a typical sharedmemory machine, whereas the leading method paralleldc3 requires 119Â s using 116 GBs of memory for the sa. Our experimental study demonstrates that this superior performance of capssa can largely be attributed to two causes. First, capssa achieves better memorylocality (fewer cache misses) than the other methods (likely thanks to its straightforward approach), and second, real world use cases typically do not manifest properties that render the algorithm exhibit its worstcase complexity. However, our experimental results include performance on an adversarial dataset for the algorithm. Overall, our work demonstrates that as parallel resources increase combining domainspecific optimizations (i.e. lcpinformed merging) with highlyefficient general sorting strategies (i.e. samplesort [21]) can outperform more sophisticated but complex algorithms. capssa is implemented in C++17 and is available under an open source license at https://github.com/jamshed/CaPSSA.
The remainder of this manuscript is organized as follows. We discuss the preliminary concepts required for a formal treatment of the algorithm as well as the most relevant prior work and the methods against which we compare capssa in Sec.Â 2. Then we discuss capssa in Sec.Â 3, and provide an analysis of its asymptotic behavior. Sec.Â 4 describes the experimental study for the proposed algorithm, and reports the results. We conclude with discussion on the potential of the method and prospective future directions for building on top of it.
Preliminaries
A string (or text) \(T=a_0a_1\dots a_{n  1}\) is a finite ordered sequence of \(n\) symbols drawn from a finite ordered alphabet \(\Sigma\). \(\Sigma\) contains a special terminator symbol \(\$\), which terminates a string and is the smallest symbol in the ordering of \(\Sigma\). \(T\) denotes the length \(n\) of \(T\). The halfopen interval \([i, \, j]\) is a shorthand for the closed interval \([i \ldots j  1]\). \(T_i\) denotes the \(i\)â€™th symbol in \(T\). The substring \(T_{[i, j)}\) of \(T\) is the sequence of characters of \(T\) in the halfopen interval \([i, \, j]\). We call a substring \(T_{[i, j)}\) with \(i = 0\) a prefix of \(T\). Likewise, a substring \(T_{[i, j)}\) with \(j = T\) is a suffix of \(T\), denoted by \(T_{[i:]}\).
The ordering of \(\Sigma\) induces a lexicographical ordering of all possible strings over \(\Sigma\). The Suffix Array (sa) of a string \(T\) is an array of the starting indices of all suffixes of \(T\) ordered by the suffixesâ€™ lexicographical order. The Longest Common Prefix lcp(\(T_1\), \(T_2\)) of two strings \(T_1\) and \(T_2\) is the largestsized prefix \(P\) of both \(T_1\) and \(T_2\), such that if \(P=k\) then for all \(0<i<k\), \({T_1}_i = {T_2}_i\), and \({T_1}_k \ne {T_2}_k\). Given the suffix array \(SA\) of a string \(T\), its \(LCP\)array is the array \(L\) such that \(L_i = {\textsc {LCP}}(T_{[SA_i:]}, T_{[SA_{i  1}:]})\).Â ^{Footnote 1} For instance, given the string \(T=AACTGCGGAT\) the sa and \(LCP\) array are given by following data structure:
Index  0  1  2  3  4  5  6  7  8  9  10 
\(T\)  A  A  C  T  G  C  G  G  A  T  $ 
\(SA\)  10  0  1  8  5  2  7  4  6  9  3 
\(LCP\) array  0  0  1  1  0  1  0  1  1  0  1 
The work of an algorithm is the total number of operations it performs to compute the result. The depth (or span) of an algorithm is the longest sequence of dependent computations in its execution. In pseudocode, we will use () as an infix operator to specify the parallel execution of statementsâ€”so \(f(x)\ \ g(y)\) denotes the parallel execution of \(f(x)\) and \(g(y)\). We use \(\mathcal {O}\big (f(n) \big )\) with high probability (whp) in \(n\) to mean \(\mathcal {O}\big (cf(n) \big )\) with probability at least \(1n^{c}\) forÂ some constant \(c \ge 1\).
Prior Work. The sa can be constructed naively in \(\mathcal {O}(n^2\log n)\) work for an \(n\)length text. Efficient algorithms can operate in \(\mathcal {O}(n)\) work and \(\mathcal {O}(n)\) space, which is the theoretical optimum, as it is the time and space required to record the sa and \(LCP\) array themselves. A comprehensive discussion of work on sa construction is well beyond the scope of this manuscript. As such, we here focus on several sequential and parallel algorithms which are of particular interest due to their speed and wide use.
The stateoftheart sequential program for sa construction is divsufsort [18, 42]. Subsequent work has elucidated the algorithm to be an efficient implementation of some twostage algorithmsÂ [18, 25].
divsufsort has been parallelized by Labeit et. al. in 2017 [37]. It has recently been used in several computational genomics tools, including the CAMMiQ method for microbial abundance quantificationÂ [56] and the macle tool for computing match complexity [47].
Another wellknown algorithm for sa construction is the Difference Cover modulo 3 (DC3) method [31], which has also been effectively parallelized [36] and has a stateoftheart implementationÂ [3].
Relevant String Sorting Methods. String sorting is a wellstudied algorithmic problem. The main difficulty in string sorting is that comparing two strings \(T_1\) and \(T_2\) requires \(\mathcal {O}(\min (T_1,T_2))\) comparisons, which renders many traditional sorting algorithms for atomic objects costly. Of particular relevance to our work is the problem of merging two sorted lists of strings. FarachColton used an efficient mergeroutine for building suffix trees in linear timeÂ [17] (though the space overhead of suffix trees renders them impractical for most modern applications). Ng and Kakehi analyzed an efficient merging strategy of sorted lists of strings with associated lcpinformationÂ [43]. They show that given random strings with uniform distribution of symbols, an lcpinformed mergesort algorithm has an expected running time of \(\mathcal {O}(n \log n)\) to sort \(n\) strings. The same merge procedure was used by Bingmann and Sanders in several samplesorting algorithms for sorting collections of stringsÂ [8].
Bingmann and Sanders propose two mergebased string sorting algorithms of particular interest to us here: Parallel Super Scalar String Sample Sort (\(pS^5\)) and Parallel Multiway LCPMergesort. \(pS^5\) makes use of the merge routine in a samplesort framework, much like capssa. The algorithms differ in their inputs (a set of strings vs a single string) as well as their approach to partitioning the data. \(pS^5\) uses machinewordsized pivot keys to create a binary search tree which can fit into the cache of each core, then divides up the input set of strings evenly across the cores and bins them accordingly. As described in Sect.Â 3.1, capssa divides up the input into evenly sized partitions, then samples pivots using a twostep process and places them into each partition. Parallel Multiway LCPMergesort generalizes the mergealgorithm to \(k\)way merges [7].
Methods
The proposed algorithm, capssa, is based on the samplesortÂ [21] algorithm. Samplesort is a popular generalization of quicksort that achieves excellent performance on both sharedmemory and distributedmemory architecturesÂ [4, 49]. Instead of partitioning the input array into two parts around a single pivot as in quicksort, it chooses a number of pivots \(z_1, z_2, \ldots , z_{p  1}\) along with two sentinel pivots \(z_0 = \infty\) and \(z_p = +\infty\), and partitions the data into \(p\) partitions such that an input element \(x_i\) is assigned to partition \(j\) iff \(z_{j  1} < x_i \le z_j\). It then sorts each partition using another (usually sequential) sorting algorithm (e.g., quicksort).
For constructing a suffix array, simply applying samplesort is costly since string comparisons in general require superconstant time. In more detail, first each suffix needs to be assigned to its partition by binary searching over the pivots. Secondly, sorting the suffixes in each partition may cost substantially more than linearithmic time due to string comparisons.
capssa addresses these issues using the following key idea of jointly leveraging merge sort and \(LCP\)arrays. Whenever two suffixes are compared, the comparison is always done inside the operation of merging two sorted arrays of suffixes. Each sorted array is augmented with its lcparray, and the merge operations avoid repeated comparisons of common prefixes among suffixes by exploiting these lcparrays. This approach has previously been used in general string sorting algorithmsÂ [7, 8, 43] and mergingbased Burrows Wheeler Transform construction algorithmsÂ [11, 15, 16]. The partitioning strategy for the suffixes is modified to make better use of the merge operation and achieve good parallelism. In particular, instead of randomly sampling pivots at the beginning of the algorithm, capssa partitions the suffixes uniformly into \(p\) subarrays, sorts the subarrays locally, and only then selects the pivots using oversampling. Once pivots are placed within each partition, the \(p\) partitions are further subdivided into \(p1\) subarrays each, for a total of \(p(p1)\) subsubarrays. Since each subsubarray is flanked by two pivots, the partition that it should go to is known. Each partition is thus a collection of sorted subsubarrays, which can be merged efficiently. The initial sorting of the uniformsized subarrays is done using mergesort to exploit the merge operation. Thus capssa ends up exploiting an efficient merging procedure with associated lcpinformation to reduce expensive comparisons of suffixes, while not having to merge large subarrays due to its pivoting strategy. We discuss the algorithm in more detail in the following sections.
Parallel SA and LCParray construction: capssa
Next, we provide a highlevel overview of the \({\textsc {CaPSSA}}(T, p)\) algorithm. The input to the algorithm is a string \(T\) and a partition count (or, subproblem count) \(p\), and as output it produces the sa and the lcparray of \(T\). Conceptually, the algorithm executes in four highlevel steps which we illustrate in Fig.Â 1.
First, it populates an unsorted sa. Then this initial sa is broken into \(p\) subarrays of uniform size \(T/p\), and each subarray is sorted with mergesort, in parallel. Next, \(p1\) global pivots are sampled from the sorted subarrays together. Then in each sorted subarray, in parallel, each pivot is located with a binary search. The locations of the \(p  1\) pivots thus found in each sorted subarray break the subarray into p sorted
subsubarrays. Besides, the position of each pivot in the final sa is now defined by its location in each of the \(p\) subarrays. The local ordering of the suffixes in each sorted subarray and the global position of the pivots thus define \(p\) partitions for the final sa, each of which is a collection of \(p\) subsubarrays: one from each of the \(p\) sorted subarrays. Then for each partition, in parallel, its \(p\) sorted subsubarrays are merged recursively into a fully sorted partition. Together, these sorted partitions, in order, produce the final sa and the final lcparray. The lcpvalues for pairs that cross partition boundaries are computed at the end.
The algorithm is presented as following, and its major steps are detailed in the following subsections. Then we analyze the asymptotic characteristics of the algorithm.
The merge operation. For efficient suffix comparisons, capssa utilizes the merge operation. A pair of suffixes is compared only when merging two sorted lists of suffixes, with the only exception being the case when the algorithm performs a binary search using a pivot suffix. When merging sorted suffixes, merging without any extra information about the suffixes in its input lists can be costly due to superconstant time string comparisons. To avoid comparing repeated prefixes of suffixes, the merge procedure in capssa utilizes the lcparrays of the input suffix lists, generated recursively in the mergesort procedure.
The \({\textsc {Merge}}(X, Y, L_X, L_Y, Z, L_Z, T)\) procedure takes two sorted arrays \(X\) and \(Y\) of suffixes, their respective lcparrays \(L_X\) and \(L_Y\), and populates the array \(Z\) as the merged output for \(X\) and \(Y\). Also, the lcparray of \(Z\) is produced in \(L_Z\). The procedure works exactly like the classic merge routine, with the following modifications.
At a given moment, let \(X_i\) and \(Y_j\) be the two suffixes being compared, and \(Z_k\) be the output of the comparison. Without loss of generality, say that \(X_i < Y_j\) is found, i.e. \(Z_k = X_i\). Let \(m\) denote the lcplength of the the last compared pair in each step of the merge. Then after the current step finishes comparing \(X_i, Y_j\), we have that \(m = {\textsc {LCP}}(X_i, Y_j)\). \(X_i < Y_j\) implies that \(T_{X_i + m} < T_{Y_j + m}\). The next suffixes to compare are \(X_{i + 1}\) and \(Y_j\). Let \(l_x = {L_X}_{i + 1} = {\textsc {LCP}}(X_{i + 1}, X_i)\). \(X_{i + 1} > X_i\) implies that \(T_{X_{i + 1} + l_x} > T_{X_i + l_x}\). There are three possible outcomes when comparing \(l_x\) and \(m\) (illustrated in Fig.Â 2):

1.
\(l_x > m\): It implies that \(T_{X_{i + 1} + m} = T_{X_i + m}\). Combining with \(T_{X_i + m} < T_{Y_j + m}\), we get \(T_{X_{i + 1} + m} < T_{Y_j + m}\). It follows that
$$\begin{aligned} Z_{k + 1} = X_{i + 1}, {L_Z}_{k + 1} = {\textsc {LCP}}(Z_{k + 1}, Z_k) = {\textsc {LCP}}(X_{i + 1}, X_i) = l_x, m = {\textsc {LCP}}(X_{i + 1}, Y_j) \! = m \end{aligned}$$ 
2.
\(l_x < m\): It implies that \(T_{X_i + l_x} = T_{Y_j + l_x}\). Combining with \(T_{X_{i + 1} + l_x} > T_{X_i + l_x}\), we get \(T_{X_{i + 1} + l_x} > T_{Y_j + l_x}\). It follows that
$$\begin{aligned} Z_{k + 1} = Y_j, {L_Z}_{k + 1} = {\textsc {LCP}}(Z_{k + 1}, Z_k) = {\textsc {LCP}}(Y_j, X_i) = m, m = {\textsc {LCP}}(X_{i + 1}, Y_j) = l_x \end{aligned}$$ 
3.
\(l_x {==}m\): We compute \(n = m + {\textsc {LCP}}(T_{[X_{i + 1} + m:]}, T_{[Y_j + m:]})\), and set the following:
$$\begin{aligned} Z_{k + 1} = {\left\{ \begin{array}{ll} X_{i + 1} &{} \text {if } T_{X_{i + 1} + n} < T_{Y_j + n}\\ Y_j &{} \text {otherwise} \end{array}\right. } \end{aligned}$$$$\begin{aligned} {L_Z}_{k + 1} = {\textsc {LCP}}(Z_{k + 1}, Z_k) = {\left\{ \begin{array}{ll} {\textsc {LCP}}(X_{i + 1}, X_i) = l_x &{} \text {if } Z_{k + 1} {==}X_{i + 1} \\ {\textsc {LCP}}(Y_j, X_i) = m &{} \text {otherwise} \end{array}\right. } \end{aligned}$$$$\begin{aligned} m = {\textsc {LCP}}(X_{i + 1}, Y_j) = n \end{aligned}$$
The merge procedure continues this way through \(X\) and \(Y\). Finally, when either of \(X\) and \(Y\) has been depleted, the rest of the entries at the other one are copied to the end of \(Z\) and \(L_Z\).
Local sorting. capssa starts out with some permutation of \([0,\, T)\), and sorts its \(p\) disjoint subarrays, each of size \(T/p\), in parallel using mergesort. The \({\textsc {MergeSort}}(X, Y, L, L', T)\) procedure takes as input an array \(X\) of suffixes, and sorts it into \(Y\). Besides, the lcparray of sorted suffixes is produced in \(L\), using \(L'\) as working space. As typical mergesort implementation requires linear extra space in each invocation, capssa uses the arrays \(X\) and \(Y\) in a backandforth manner to reuse the extra space in the invocations. For such, \(Y\) needs to be equal to \(X\) before an invocation. The merge step in the sort uses the mergeprocedure described earlier.
Pivot selection. capssa deviates from samplesort in its pivot selection strategy. In a typical samplesort, pivots are to be sampled from the initial array and then partitioning would be based on their intervals. Instead, in parallel, the \({\textsc {SamplePivots}}(SA, T, p)\) procedure (see Suppl.) in capssa samples \(s\) suffixes from each of the \(p\) subarrays, where \(s\) is the sampling factor. Then these \(s\times p\) sample suffixes are sorted using mergesort. Subsequently, \(p  1\) evenlyspaced pivots are selected from the sorted output to form the pivot set \(V\).
These pivots define the ranges of the samplesort partitions, and are used to split each of the subarrays in the next collation step. We show in TheoremÂ 1 that with a sufficient sampling factor \(s\), the size of each partition is within a constant factor of \(T/p\) with high probability, which ensures a balanced load for processing each partition in the last step of the algorithm.
Collating partitions. Having finalized the pivot set \(V\), the algorithm locates each pivot suffix \(v \in V\) in each sorted subarray. Each subarray is searched for the \(p  1\) pivots in parallel.
Consider a pivot \(v \in V\) and some sorted subarray \(A\). The position of \(v\) in \(A\) is the last index where \(v\) can be inserted without breaking the sorted order of \(A\). This index is computed using a binary search for the suffix \(v\) in \(A\). During a binary search suffixes are compared without any associated \(LCP\) array, contrary to the merge procedure. As a practical speedup, we skip some repeated character comparisons between \(v\) and the suffixes in \(A\) using the simple accelerant ideaÂ [22].
After placing each pivot \(v\) into \(A\), the index \(i\) of \(v\) in \(A\) implies that all the suffixes in \(A_{[0: i)}\) are \(\le v\). Hence the sum \(C_v\) of these indices of \(v\) across all the \(p\) sorted subarrays provides the total suffix count in the sa that are not lexicographically larger than \(v\)â€”the index of \(v\) in the final sa is \(C_v  1\). Along with the sentinel pivot positions \(C_0 = 0\) and \(C_p = T\), these \(p  1\) pivots divide the final sa into \(p\) partitions. Consider two successive pivots \(v_{j  1}\) and \(v_j\). In each sorted subarray \(A\), all the suffixes \(k\) such that \(T_{[v_{j  1}:]} < T_{[k:]} \le T_{[v_j:]}\) will be present in the indexrange \([C_{j  1}, C_j)\) of the final sa. That is, all the suffixes between the locations for \(v_{j  1}\) and \(v_j\) belong to the \((j  1)\)â€™th partition.
Thus the pivot locations in a sorted subarray \(A\) break \(A\) into \(p\) subsubarrays, where the \(j\)â€™th subsubarray is known to be present in the \(j\)â€™th partition of the final sa. After the binary searches, capssa moves these subsubarrays in parallel to collate all subsubarrays for the same partition. The lcparrays of these subsubarrays are also collated together. The \({\textsc {CollatePartitions}}(SA, SA', L, L', V, T, p)\) procedure (see Suppl.) describes it in more detail.
Merging partitions. Having grouped together the corresponding subsubarrays for every partition, capssa merges together the sorted subsubarrays in each partition, in parallel. A partition consists of \(p\) sorted collections of suffixes, with all of the collections stored contiguously. The \({\textsc {MergePartition}}(X, Y, n, R, L_X, L_Y, T)\) procedure takes this collection \(X\) of \(p\) sorted subsubarrays, and produces the merged output in the same contiguous region of memory \(Y\) recursively. \(L_X\) is the collection of the lcparrays of the sorted groups in \(X\), and the merged lcparray is produced in \(L_Y\). The sorted groups in \(X\) (and \(L_X\)) are delineated by \(R\).
The mergepartition procedure is same as the mergesort procedure, except for that it is more generalâ€”the sorted units where mergesort bottoms out are single suffixes, whereas mergepartition bottoms out earlier at sorted groups of suffixes. As noted earlier, mergepartition also uses the space in \(X\) and \(Y\) backandforth to reuse the extra spaces required.
Asymptotics
In this section, we analyze the computational complexity of the \({\textsc {CaPSSA}}(T, p)\) algorithm executed on a text \(T\) with length \(n = T\), given a subproblemcount \(p\).
Work analysis
We start by analyzing the overall work of the algorithm and providing selfcontained proofs on the total work due to symbol comparisons made by our algorithm.
Local sorting. This step executes the classic mergesort on each subarray. For a subarray \(A\) with \(m\) suffixes, this amounts to a total work of \(T(m) = 2 T({m}/{2}) + \mathcal {O}(m) + C(A)\), where \(C(A)\) denotes the number of symbol comparisons made in the execution in the third case of the merge procedure. We analyze the total amortized cost of these \(C(A)\) values across all the recursiontrees of all the subarrays in TheoremÂ 3. Omitting \(C(A)\) from \(T(m)\), each local sort has \({n}/{p}\log {{n}/{p}}\) work.
Pivot selection. With a sampling factor \(s\), there are \(s\times p\) pivots sampled in total across all the subarrays. capssa sorts these pivots with mergesort and picks the \(p  1\) equidistant pivots from these as the global pivots. The mergesort amounts to a total work of \(\mathcal {O}(sp \log {sp} + \sum {L_p}_i)\), where \(L_p\) is the output lcparray of the sort. This holds from TheoremÂ 3.
Collating partitions. The collation step first locates each of the \(p  1\) pivots in each of the sorted subarrays using binary search. The length of a pivot suffix is \(\mathcal {O}(n)\), and the sorted subarrays are of size \({n}/{p}\). The work of each binary search is \(\mathcal {O}(n + \log {{n}/{p}})\) in practice (\(n = T\)) with the simpleaccelerantÂ [22] strategy. For adversarial inputs however, the work can still be \(\mathcal {O}(n \log {{n}/{p}})\) in the worstcase. Then the suffix indices are moved into their appropriate final partitions. This step simply reorders the elements across the sorted subarrays, and thus requires \(\mathcal {O}(n)\) total work.
Merging partitions. The mergepartition procedure works similar to the mergesort procedure, except for that the recursion bottoms out at a sorted group of suffixes, instead of at a single suffix. Unlike the mergesort instances however, each of which operate on \({n}/{p}\)sized subarrays, the mergepartition instances may work on various sizes of partitions. TheoremÂ 1 provides a bound on the partition sizes.
Theorem 1
With a sampling factor \(s\), every partition has size at most \({cn}/{p}\) for some constant \(c\) with high probability.
Proof
The algorithm samples \(s\) pivots from each subarray, for a total of \(sp\) samples.
It then picks \(p  1\) evenly spaced pivots (every \(s\)â€™th sample) from the sorted samples to use as global pivots.
Consider the final location of these \(sp\) samples in the final suffix array.
Every \(s\)â€™th of them marks the boundary of a partition.
Thus, a partition has size \(d \ge {cn}/{p}\) only if fewer than \(s\) of the samples fall into these \(d\) suffixes.
Otherwise, at least one sample would be picked as a final pivot and would thus break this partition.
Let \(SA\) be the final suffix array, and \(X_i\) be a random variable indicating whether \({SA}_i\) is one of the \(sp\) samples.
Then \(Pr[X_i = 1] = {sp}/{n}\).
Thus the random variable denoting the number of samples picked from a region of size \({cn}/{p}\) is \(X = \sum _{i = 1}^{{cn}/{p}} X_i\). By linearity of expectation, we get \(E[X] = \sum _{i = 1}^{{cn}/{p}} E[X_i] = {cn}/{p} Pr[X_i = 1] = {cn}/{p} \cdot {sp}/{n} = cs\). Applying the Chernoff bound we have:
With \(s = 32 \ln n\) and letting \(c' = c (1  {1}/{c})^2\),
Since the event of a partition having size \(\ge {cn}/{p}\) implies the event \(X < s\), we get:
\(Pr[\text {a given partition has size} \ge {cn}/{p}] \le Pr[X < s] \le {1}/{n^{16c'}}\).
\(\Rightarrow Pr[\text {at least one partition has size } \ge {cn}/{p}] \le \sum _{i = 1}^p {1}/{n^{16c'}} = {p}/{n^{16c'}}\) (by unionbound).
\(\Rightarrow Pr[\text {no partition has size} \ge {cn}/{p}] \ge 1  {p}/{n^{16c'}}\).
Since \(p\) is at most \(\mathcal {O}(n)\), no partition has size \(\ge {cn}/{p}\) whp. \(\square\)
Thus each partition has size at most \({cn}/{p}\) for some constant \(c\) whp. Merging a partition \(A\) with \(m\) sorted subgroups has work \(T(m) = 2T({m}/{2}) + \mathcal {O}(m) + C(A)\), where \(C(A)\) is the number of symbol comparisons made in merge. Omitting \(C(A)\) from \(T(m)\), this solves to \(\mathcal {O}({cn}/{p} \log {p}) = \mathcal {O}({n}/{p}\log {p})\) whp.
Total symbol comparisons. A subproblem in the algorithm is some subarray of the sa that can be processed independently of the other subarrays in a given step. Let \(X\) be some subproblem in either of the two steps: localsorting and partitionmerging. For the localsorting case, sorting \(X\) with mergesort consists of \(\log {n}/{p}\) recursionlevels. For the partitionmerging case, the mergepartition procedure for \(X\) executes in \(\log {p}\) recursionlevels. We label the bottommost level as level 0, and count the levels upwards in the recursiontree.
Let \(x \in X\) be a suffix. At any given level \(i, x\) is present in exactly one mergesort (or mergepartition) instance executing on \(X\). Let \(x'_i\) be the suffix that immediately precedes \(x\) in the output of that mergesort (or mergepartition) instance, and let \(L_i(x) = {\textsc {LCP}}(x, x'_i)\). If \(x\) is the first suffix in the output, then \(x'_i\) is the empty suffix. We prove the following.
Theorem 2
In mergesort and mergepartition, \(L_i (x) \le L_{i + 1}(x)\) for a suffix \(x\) at each recursion level \(i \in [0, d  1)\), where \(d\) is the depth of the recursiontree.
Proof
Let \(x\) be present in the mergesort (or mergepartition) instance \(M\) at level \(i + 1\), and say \(M\) spawns the two instances \(M_l\) and \(M_r\). \(M_l\) and \(M_r\) are at level \(i\), and \(x\) is present in exactly one of them. Let it be \(M_l\).
Now, \(x'_{i + 1}\) is either \(x'_i\) i.e. the same suffix preceding \(x\) in \(M_l\), or some other suffix \(y\) from \(M_r\). If \(x'_{i + 1} = x_i\), then \(L_{i + 1}(x) = L_i(x)\), and the claim holds.
In the other case, the output array of \(M\) has the following form: \([\ldots , x'_i, \ldots , y, x, \ldots ]\). Suppose that the claim is false, i.e. \(L_i(x) > L_{i + 1}(x)\). Which is, \({\textsc {LCP}}(x, x'_i) > {\textsc {LCP}}(x, y)\). \({\textsc {LCP}}(x, y) < {\textsc {LCP}}(x, x'_i)\) implies that \(x'_i\) and \(y\) share the same prefix of length \(l = {\textsc {LCP}}(x, y)\), and mismatch first at index \(l\). Let \(c_x, c_{x'_i}\), and \(c_y\) be the \(l\)â€™th symbol in \(x\), \(x'_i\), and \(y\) resp. As \(y > x'_i\) in the output, \(c_y > c_{x'_i}\). Besides, since \(x\) and \(y\) first mismatch at the \(l\)â€™th symbol and \(x > y\), \(c_x > c_y\). Thus \(c_x > c_{x'_i}\). \({\textsc {LCP}}(x, x'_i) > l\) implies that the \(l\)â€™th symbols in \(x\) and \(x'_i\) are the same, i.e. \(c_x = c_{x'_i}\). Thus we get \(c_x > c_{x'_i}\) and \(c_x = c_{x'_i}\), resulting in a contradiction. Hence, \({\textsc {LCP}}(x, y) \le {\textsc {LCP}}(x, x'_i)\). \(\square\)
TheoremÂ 3 provides a bound on the number of total comparisons made across all the mergesort and mergepartition instances in the algorithm execution.
Theorem 3
The total number of symbol comparisons made across all the mergesort and mergepartition instances in capssa for an \(n\)length text is \(\mathcal {O}( n \log {n} + \sum _{i = 1}^{n  1} L_i \big )\) whp, where \(L\) is the output lcparray.
Proof
Symbol comparisons occur only as part of the merge procedure in both mergesort and mergepartition. Given two sorted lists of suffixes \(X\) and \(Y\) along with their lcparrays, the \({\textsc {Merge}}(X, Y, L_X, L_Y, Z, L_Z, T)\) procedure iterates through the \(X_i\)â€™s and \(Y_j\)â€™s and fills in the \(Z_k\)â€™s in the sorted order, along with their lcpvalues in \({L_Z}_k\). Without loss of generality, suppose that \(X_i < Y_j\) is found at some iteration. Then the next iteration compares \(X_{i + 1}\) and \(Y\). Let \(l_x = {\textsc {LCP}}(X_{i + 1}, X_i)\) and \(m = {\textsc {LCP}}(X_i, Y_j)\), lcp of the last compared pair. Symbols from the suffixes \(X_{i + 1}\) and \(Y_j\) will only be compared iff \(l_x = m\) holds. In this case, we compute \(n = {\textsc {LCP}}(X_{i + 1}, Y_j)\) with exactly \(n  m + 1\) symbol comparisons. The \(+1\) term is due to the first mismatching symbol pair. \(m\) is set to \(n\) for the next iteration. We argue that before the new \(m = n\) value is assigned as the lcpvalue in the output lcparray \(L_Z\) in some future iteration, it remains unchanged.
In the next iteration, if case (1), i.e. \(l_x > m\) holds, then \(m\) remains unchanged. If case (2), i.e. \(l_x < m\) holds, then \(m\) is assigned at output \({L_Z}_{k + 1}\). In the event of case (3), either \(l_x\) or \(m\) is assigned to \({L_Z}_{k + 1}\), and these are equal in this case. If \(X\) has been depleted during the merge while \(Y\) still has remaining elements, the current \(m\) is assigned as the lcpvalue for the first of the remaining elements from \(Y\).
Thus, whenever symbol comparisons are done in case (3) of merge, it results in a new value \(m' \ge m\) for the variable \(m\). \(m = m'\) persists until \(m'\) has been assigned as the lcpvalue for some merged output. Thus the number of matching symbol comparisons made in case (3) accumulates in the lcpvalues at the output.
All the lcpvalues start out with \(0\) at mergesort. TheoremÂ 2 states that the lcpvalue associated to a given suffix can never decrease while winding up the recursiontrees of mergesort and mergepartition. Thus the sum \(\sum _{i = 1}^{n  1} L_i\) of the final lcpvalues in the sa is the total number of matching symbol comparisons made across all the mergesort and mergepartition executions.
The extra mismatching comparison in case (3) of merge costs \(\mathcal {O}(1)\). In the worst case, this case occurs in each iteration of merge. Omitting the matching symbol comparisons, a mergesort or a mergepartition instance working on \(m\) elements incurs \(T(m) = 2T({m}/{2}) + \mathcal {O}(m)\) mismatches in the worst case. This solves to \(p \times \mathcal {O}({n}/{p} \log {{n}/{p}})\) and \(p \times \mathcal {O}({n}/{p} \log {p})\) whp for the \(p\) mergesorts and mergepartitions, resp. Thus \(\mathcal {O}(n\log {n})\) mismatching symbol comparisons are made whp. \(\square\)
Total work. Locally sorting the \(p\) subarrays cost \(p \times \mathcal {O}({n}/{p} \log {{n}/{p}}) = \mathcal {O}(n\log {{n}/{p}})\) work without the symbol comparisons. Omitting the symbol comparisons in sorting the sampled pivots, the pivot selection step has \(\mathcal {O}(sp \log {sp})\) work. In the collation step, there are \(p(p  1)\) binary searches, costing \(\mathcal {O}(p^2 n\log {{n}/{p}})\) work in the worstcase, and \(\mathcal {O}(p^2 \big (n + \log {{n}/{p}}) \big )\) in practice. Merging the \(p\) partitions separately cost \(p \times \mathcal {O}({n}/{p}\log {p}) = \mathcal {O}(n\log {p})\) whp without the symbol comparisons.
The total number of symbol comparisons in the localsort and the partitionsmerge steps is \(\mathcal {O}(n \log n + \sum _{i = 1}^{n  1} L_i)\) whp as per TheoremÂ 3, where \(L\) is the output lcparray. In sorting the sampled suffixes, the number of symbol comparisons done is also bounded by thisÂ ^{Footnote 2}. Thus the total work for the algorithm is \(\mathcal {O}(n \log n + \sum _{i = 1}^{n  1} L_i)\) whp.Â ^{Footnote 3}
Working space
The working space of the algorithm is the total space required by all the merge procedure instances at any given recursivelevel of mergesort or mergepartition. The merge procedure produces two merged output for some given input, the sorted suffixes and their LCPs, and thus has \(2 \times 2T\) input and output entries in total at any level. So the working space for the algorithm is \(4n\) entries, which is \(\mathcal {O}(n\log {n})\) bits. Our implementation of the algorithm requires \(4wT\) bytes of working space, where \(w\in \{4,8\}\) is the numerical size used to store sa and lcp values. We discuss an externalmemory scheme to reduce the working space in Sec.Â 5.
Parallelization
Our implementation fully parallelizes the work across the different partitions. Within a partition, we perform recursive calls to mergesort in parallel, but perform the merge procedure serially. We show the following theorem about the depth of our algorithm:
Theorem 4
The overall depth of the algorithm is \(\mathcal {O}\big ((n/p) \log n \big )\) whp.
Proof
The dominant factor in the merge algorithm is the depth of the merge routine, which simply performs a linear number of comparisons in the input size. The depth of a comparison is \(\mathcal {O}(1)\) in cases 1 and 2 of Fig.Â 2, and requires a string comparison in the final case.
The string comparison can be parallelized workefficiently (i.e., in the same work as a serial characterbycharacter comparison) by using a simple prefixdoubling strategy. In more detail, the comparison algorithm works in rounds comparing \(2^i\) characters in the \(i\)â€™th round until a mismatch occurs. Clearly for strings of length \(\mathcal {O}(n)\) only \(\mathcal {O}(\log n)\) rounds are required, and thus the overall work is asymptotically the same as the serial algorithm, and the depth is \(\mathcal {O}(\log n)\). Thus, for merging two sorted arrays in the algorithm, each of size \(k\), we require a depth of \(\mathcal {O}(k \log n)\).
Putting these facts together, for a single call to the mergesort routine, we have a recurrence of the form \(D(k) = D(k/2) + \mathcal {O}(k \log n)\), which is root dominated and solves to \(\mathcal {O}(k\log n)\). Since our algorithm is parallelized across different partitions, and by TheoremÂ 1 each partition has size at most \(\mathcal {O}(n/p)\), the overall depth of the algorithm is \(\mathcal {O}((n/p) \log n)\) whp. \(\square\)
We note that the depth is not polylogarithmic, as in the classic parallel mergesort. However, the amount of parallelism generated by our algorithm is more than enough to keep the processors all busy in practice. Indeed, we note that many samplesort implementations use a similar strategy in practice and use a serial sort within each partition, and thus also do not have polylogarithmic depth in practice. In our implementation, we exploit parallelism using the parallel primitives and the workstealing scheduler from ParlayLibÂ [10].
Optimizations
We applied a number of optimizations into the implementation of the algorithm that provide practical speedups. We make use of vectorization support using AVX instructions from modern processors to speed up the computation of the \({\textsc {LCP}}(X, Y)\) routine used in the mergeprocedure and in the binary searches in locating pivots.
In the proposed mergesort and the mergepartition procedures in the algorithm, we have nested parallelism for their recursive invocations. This is applied in the implementation upto some fixed granularity, due to the associated overhead of scheduling small tasks.
In the binary searches for the sampled pivots in each sorted subarray, instead of searching for the appropriate position of an entire pivot suffix, we look for a fixedlength prefix of the pivot. This helps reduce the total work associated to locating the pivots, with an associated tradeoff with the final partition sizes. With sufficiently large prefix lengths, the partition sizes do not get significantly affected in our observation.
Results
We performed a number of experiments to characterize the performance of the capssa algorithm and its implementation. We evaluated its performance compared to the available implementations of two leading methods for sa construction: paralleldivsufsort [37] and paralleldc3 [3]. We assessed its ability to construct sa and lcparrays on a number of genomic datasets.
Next, we evaluated the parallel scaling of the algorithm. Then we explore the idea of Boundedcontext suffix arrays, and the performance of capssa for various prefixcontext lengths.
A varied collection of datasets has been used in the experiments. TableÂ 1 delineates the pertinent characteristics of the datasets. We follow [51] by removing Nrepeats, which occur when the sequence underlying a region of the assembly cannot be resolved. We also unmask softmasked regions of the genomes. We verified the correctness of the implementation by crosschecking its output against from that of paralleldivsufsort.
Computation system. The experiments have been performed on a server having 4 Intel(R) Xeon(R) Platinum 8160 processors with 192 cores in total and 1.5 TB of 2.66 GHz DDR4 RAM. The system is run with Ubuntu 22.04.2 LTS (GNU/Linux 6.2.0â€“33generic x86_64). The sa and lcparray construction times and the maximum memory usages of the tools were measured with the GNU time command.
Dataset characteristics
TableÂ 1 provides some pertinent characteristics of the datasets used. The GRCh38 dataset is the Human Build 38 patch release 13 version of the human genome reference from the Genome Reference Consortium,^{Footnote 4} which is a chromosomelevel assembly of the full genome. The T2T dataset is the latest T2T CHM13v2.0 TelomeretoTelomere assembly of the CHM13 cell line with chromosome Y from NA24385, from the T2T consortium, which is a complete genomelevel assembly of the genomeÂ [45]. Together, these two human datasets represent what we imagine may be a typical usecase for genome construction in the context of a tool like STARÂ [14]. Though largely similar, the CHM13 assembly has resolved telomeric and centromeric regions, and more complete coverage, specifically in highlyrepetitive regions. Thus, we expect it represents a more challenging problem instance for suffix array construction.
The CdBG (Compacted de Bruijn Graph) dataset is the collection of the maximal unitigs extracted from the de Bruijn graph (with kmer size 27) of the human sequencing read set NIST HG004 (SRA3440461â€”95)Â [57] by cuttlefish 2Â [32]. This dataset represents a potential usecase where one may wish to build an index for the sequence stored in the CdBG data structure. The ability to index the CdBG (i.e. contigsbased indexing) has proven useful in many contexts [12], and the sa can provide one possible index for providing efficient lookup over the sequence contained in the CdBG.
The great white shark dataset is the genome reference of Carcharodon carchariasÂ [41] and the axolotl dataset is the genome reference sequence of Ambystoma mexicanumÂ [52]. These represent large problem instances, where one may wish to build the sa on large reference genomes.
The bacteria (1K) dataset is a collection of 1000 genomes sampled randomly from 661,405 bacterial genomesÂ [9], from the European Nucleotide Archive. Almost all the genomes in this dataset are Salmonella entericaâ€”this represents a pathological dataset for capssa, being extremely repetitive.
SA and LCParray construction
We evaluate the performance of capssa in constructing the sa and the lcparray of a number of genomic datasets, compared to the sa construction performance of: 1. paralleldivsufsort [37] and 2. paralleldc3 [3]. TableÂ 2 contains the results of the benchmarking. As the stateoftheart sequential benchmark, we note the performance of the divsufsort implementation from libdivsufsortÂ [42].
We note that capssa executes significantly faster than the other parallel algorithms in all the instances except for the pathological dataset of 1K similar bacteria, whereas paralleldivsufsort uses the least amount of memory. Interestingly for the smaller datasets capssa does not require much more memory than paralleldivsufsort, despite constructing both the sa and lcp. Memory usage could be improved by bitpacking the indexes, or through the extension to an external memory algorithm.
Appendix Table 4 provides sequential timing results for the methods, i.e. with 1 thread, to compare their total amount of work. We note that capssa tends to do more work than the other parallel methodsâ€”which is expected as the other methods have \(\mathcal {O}(n \log n)\) workÂ [36, 37], whereas capssa has an additional \(\mathcal {O}(\sum _{i = 1}^{n  1} L_i)\) outputdependent factor, and benefits from better parallelization with more workers.
Parallel scaling
In order to assess how sensitive runtime is to parallelism we evaluated capssa against paralleldc3 and paralleldivsufsort as the number of threads increased. We report the results in Fig.Â 3, which illustrated that capssa exploits parallelism betterâ€”becoming the fastest method as the thread count becomes high despite doing more work asymptotically.
On the GRCh38 and T2T datasets, all the parallel methods become faster than divsufsort at around the same number of threads, after which capssa becomes the fastest.
Cache performance
The samplesortbased design of capssa optimizes cacheperformance. In order to evaluate the empirical cache behavior of capssa as compared to other algorithms for sa construction, we profiled the programs on the GRCh38 and the T2T reference genomes. Because cachebehavior can degenerate as parallelism increases, we evaluate it across 1, 32, and 64 threads. The results in Table 3 show that capssa outperforms other parallel sa indexing programs by large margins. All measurements were taken with the Linux perf command.
Boundedcontext SA Construction
By virtue of organizing all suffixes of the underlying text \(T\), the suffix array provides the powerful ability to efficiently search for query patterns of any length in the text. While this capability arises naturally from the definition of the sa, such flexibility is rarely needed in the saâ€™s most common applications in genomics. Specifically, when used to efficiently find seed sequences from a genomic read, the maximum length of the query is often very short. Many modern aligners use seed lengths in the range of 15â€“31, and even with the maximum mappable prefix concept used by STAR [14], the query length is bounded above by the errorfree prefix length of the remainder of the read (rarely more than \(\sim 100\) nucleotides).
As such, indices that can provide efficient lookup and locate queries for patterns less than some maximum length, say \(k\), are often very useful in this context. For example, the \(k\)BWT data structure [13, 46, 50] builds a transform of the text that organizes character occurrences by their bounded context (in this case, their right context of length \(k\)). This allows the index to be built efficiently, since rotations of the text need not have their relative orders resolved beyond their initial length \(k\) contexts, while simultaneously allowing efficient and correct query for any pattern length \(\le k\).
Here, we experiment with an analogous version of the saâ€”the boundedcontext \(SA\). Specifically, the boundedcontext sa of order \(k\) resolves the lexicographic order of all suffixes of the text up to (and including) their prefixes of length \(k\). If a pair of suffixes share a prefix of length \(\ge k\), then they may appear in an arbitrary relative order within the boundedcontext sa of order \(k\). Without any meaningful modifications to the query algorithms, this variant of the sa allows locating all occurrences of queries of any length \(\le k\) in the text. Such a variant of the sa is very straightforward to construct using capssa, as we simply declare equal any suffixes that are equal up to (or beyond) their length \(k\) prefixes. At the same time, this variant can be more efficient to construct using our algorithm, as the context length \(k\) places a strict upper bound on the number of comparisons we must perform when attempting to determine the relative order of a pair of suffixes. Specifically, it follows directly from TheoremÂ 3 that capssa performs at most \(\mathcal {O}\left( n \log n + nk\right)\) character comparisons, in the worst case, when constructing the contextbounded sa of order \(k\).
In Fig.Â 4, we report the timing requirements to construct the contextbounded sa of varying orders of powers of \(2\), from \(64\) to \(65,\,536\), over the Bacteria (1K) dataset. As expected, the boundedcontext sas can be constructed substantially faster than the fullcontext sa.
Conclusion
In this manuscript, we introduced a new method, capssa, for parallel sa and lcparray construction. capssa displays very good cache performance (i.e. very low cache miss rate), and scales well to many threads. As a result, capssa is able to outperform existing stateoftheart parallel sa construction algorithms like paralleldivsufsort and paralleldc3 on genomic datasets. At the same time, capssa is substantially simpler than existing stateoftheart algorithms. This simplicity eases implementation, and leads to many opportunities for further future improvements. Likewise, capssa provides the lcparray directly as a byproduct of sa construction, and does not require a separate algorithm to produce this useful auxiliary data structure. We hope that will prove to be useful in utilities where parallel sa construction is a core subproblem, and also hope that the relatively straightforward algorithm will benefit from further optimizations, enhancements, and alternative implementations within the community.
As capssa scales well with the level of available parallelism, and performs well for large references, we expect that it will provide a useful option for tools that seek to build the sa in parallel environments. In addition to the time taken to construct the sa or the \(LCP\)array, another consideration is the memory (specifically the RAM) required for construction. One approach to improve the memoryscalability of sa construction algorithms is to develop externalmemory construction algorithms. For example, pSAscan [28] is a stateoftheart externalmemory algorithm for sa construction. Such approaches make use of externalmemory (i.e. disk) and algorithms that access and construct the sa in a structured way are likely amenable to externalmemory variants.
We note that, though we have not explored it in this manuscript, capssa is highlyamenable to external memory implementation. This is because the initial partitioning generates many small subproblems that can be solved independently â€” i.e. some subproblem can be paged into RAM while others remain on disk. Pivot sampling from the subproblems can be done through a similar paging process. Likewise, after pivot selection, many approximately equalsized partitions will be created, and these subproblems, which target specific output intervals of the final suffix array, can be solved independently and in parallel with the relevant data for only a working subset of partitions paged into RAM with the remaining partitions residing on disk. Further, given a sufficiently finegrained partitioning, the algorithm can likely provide tight controls on the required working memory. As more RAM use is allowed, a larger number of partitions will be allowed in RAM at once, and our algorithm will be able to better make use of available parallelism. On the other hand, as the maximum allowed RAM usage is restricted, fewer partitions will be present in memory at once, potentially limiting parallelism, but adhering to the requested RAM constraints. In practice, we believe that, so long as a sufficiently finegrained partitioning is used, externalmemory variants of our algorithm will still be able to efficiently make use of many threads while still substantially reducing the required working memory. We leave the efficient implementation of an externalmemory variant of capssa to future work.
Code availability
The implementation is available under an opensource license at https://github.com/jamshed/CaPSSA.
Data availability
All the datasets used in this manuscript are publicly available, and have been cited appropriately.
Notes
With the special case of \(L_0 = 0.\)
\(\mathcal {O}(sp \log {sp} + \sum _{i = 1}^{sp} {L_p}_i)\) comparisons are performed, where \(L_p\) is the lcparray of the sorted samples.
The worstcase work is \(\mathcal {O}(n^2)\) due to the second factor. The first factor does not change.
References
Abouelhoda MI, Kurtz S, Ohlebusch Enno. Replacing suffix trees with enhanced suffix arrays. J Discrete Algorithm. 2004;2(1):53â€“86.
Alser M, Rotman J, Deshpande D, Taraszka K, Shi H, Baykal PI, Yang HT, Xue V, Knyazev S, Singer BD, Balliu B, Koslicki D, Skums P, Zelikovsky A, Alkan C, Mutlu O, Mangul S. Technology dictates algorithms: recent developments in read alignment. Genome Biol. 2021;22(1):249. https://doi.org/10.1186/s13059021024437.
Daniel A, GuyÂ E. Blelloch, Laxman Dhulipala, Magdalen Dobson, and Yihan Sun. The problembased benchmark suite (PBBS), v2. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP â€™22, page 445447, New York, NY, USA, 2022. Association for Computing Machinery. https://doi.org/10.1145/3503221.3508422.
Axtmann M, Witt S, Ferizovic D, Sanders P, Samplesort InPlace Parallel Super Scalar, (IPSSSSo). In 25th Annual European Symposium on Algorithms (ESA,. volume 87 of Leibniz International Proceedings in Informatics (LIPIcs), p. 9:1â€“9:14. Schloss DagstuhlLeibnizZentrum fuer Informatik. 2017;2017. https://doi.org/10.4230/LIPIcs.ESA.2017.9.
Timo B. Scalable string and suffix sorting: Algorithms, techniques, and tools. arXiv preprint arXiv:1808.00963, 2018.
Timo B, Patrick D, Johannes F, Florian K, Enno O, Peter S. Scalable text index construction, pages 252â€“284. Springer Nature Switzerland. Chamhttps://doi.org/10.1007/9783031215346_14.
Bingmann T, Eberle A, Sanders P. Engineering parallel string sorting. Algorithmica. 2017;77:235â€“86.
Timo B and Peter S. Parallel string sample sort. In Algorithmsâ€“ESA 2013: 21st Annual European Symposium, Sophia Antipolis, France, September 24, 2013. Proceedings 21, p. 169â€“180. Springer, 2013.
Blackwell G A, Hunt M, Malone KM, Lima L, Horesh G, Alako BTF, Thomson NR, Iqbal Z. Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences. PLOS Biol. 2021;19(11):1â€“16. https://doi.org/10.1371/journal.pbio.3001421.
Blelloch GE, Anderson D, Dhulipala L. Parlayliba toolkit for parallel algorithms on sharedmemory multicore machines. In Proceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures, p. 507â€“509. 2020.
Bonizzoni P, Vedova GD, Pirola Y, Previtali M, Rizzi R. Computing the multistring bwt and lcp array in external memory. Theor Computer Sci. 2021;862:42â€“58. https://doi.org/10.1016/j.tcs.2020.11.041.
Rayan C, Jan H, Paul M. Data structures to represent a set of klong DNA sequences. ACM Comput Surv. 2021;2021. https://doi.org/10.1145/3445967.
Shane CJ, Petri M, Puglisi SJ. Revisiting bounded context blocksorting transformations. Softw Pract Exper. 2012;42(8):1037â€“54.
Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras Thomas R. STAR: ultrafast universal RNAseq aligner. Bioinformatics. 2013;29(1):15â€“21.
Egidi L, Louza FA, Manzini G, Telles GP. External memory bwt and lcp computation for sequence collections with applications. Algorithm Mol Biol. 2019;14(1):6. https://doi.org/10.1186/s1301501901400.
Lavinia E, Giovanni M. Lightweight bwt and lcp merging via the gap algorithm. In: Fici Gabriele, Sciortino Marinella, Venturini Rossano, editors. String Processing and Information Retrieval. Berlin: Springer International Publishing; 2017.
Farach M. Optimal suffix tree construction with large alphabets. Ann Sympos Foundations Computer Sci Pages. 1997. https://doi.org/10.1109/SFCS.1997.646102.
Fischer J, Kurpicz F. Dismantling divsufsort. In Prague Stringology Conference 2017, p. Â 62, 2017.
Johannes F, Florian K. Lightweight distributed suffix array construction. Soc Indust Appl Mathemat. 2019. https://doi.org/10.1137/1.9781611975499.3.
Flick P, Aluru S. Parallel distributed memory construction of suffix and longest common prefix arrays. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC â€™15, New York, NY, USA, 2015. Association for Computing Machinery. https://doi.org/10.1145/2807591.2807609.
Donald FW, McKellar AC. Samplesort: a sampling approach to minimal storage tree sorting. J ACM. 1970;17(3):496â€“507.
Gusfield D. Algorithms on strings, trees, and sequences: computer science and computational biology. 1997. https://doi.org/10.1017/CBO9780511574931.
Hazelhurst S, LiptÃ¡k Z. KABOOM! a new suffix array based algorithm for clustering expression data. Bioinformatics. 2011;27(24):3348â€“55.
Ilie L, Fazayeli F, Ilie S. HiTEC: accurate error correction in highthroughput sequencing data. Bioinformatics. 2011;27(3):295â€“302.
Itoh H, Tanaka H. An efficient method for in memory construction of suffix arrays. In 6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No. PR00268), p. 81â€“88. IEEE, 1999.
KÃ¤rkkÃ¤inen J, Kempa D. Engineering a lightweight external memory suffix array construction algorithm. Math Computer Sci. 2017;11:137â€“49.
KÃ¤rkkÃ¤inen J, Kempa D. Engineering external memory LCP array construction: Parallel, inplace and large alphabet. In 16th International Symposium on Experimental Algorithms (SEA 2017). Schloss DagstuhlLeibnizZentrum fuer Informatik, 2017.
Juha K, Dominik K, Puglisi SJ. Parallel external memory suffix sorting. In: Cicalese Ferdinando, Porat Ely, Vaccaro Ugo, editors. Combinatorial pattern matching. Berlin: Springer International Publishing; 2015.
KÃ¤rkkÃ¤inen J, Kempa D, Puglisi SJ, Zhukova B. Engineering external memory induced suffix sorting. In 2017 Proceedings of the Ninteenth Workshop on Algorithm Engineering and Experiments (ALENEX), p. 98â€“108. SIAM, 2017.
KÃ¤rkkÃ¤inen J, Sanders P. Simple linear work suffix array construction. In Automata, Languages and Programming: 30th International Colloquium, ICALP 2003 Eindhoven, The Netherlands, June 30â€“July 4, 2003 Proceedings. Springer. 30, p. 943â€“955, 2003.
KÃ¤rkkÃ¤inen J, Sanders P, Burkhardt S. Linear work suffix array construction. J ACM (JACM). 2006;53(6):918â€“36.
Khan J, Kokot M, Deorowicz S, Patro R. Scalable, ultrafast, and lowmemory construction of compacted de bruijn graphs with Cuttlefish 2. Genome Biol. 2022;23(1):190. https://doi.org/10.1186/s13059022027436.
Khan J, Rubel T, Dhulipala L, Molloy E, Patro R. Fast, parallel, and cachefriendly suffix array construction. In Djamal Belazzougui and AÃ¯da Ouangraoua, editors, 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023), volume 273 of Leibniz International Proceedings in Informatics (LIPIcs), p. 16:1â€“16:21, Dagstuhl, Germany, 2023. Schloss Dagstuhl â€“ LeibnizZentrum fÃ¼r Informatik. https://doi.org/10.4230/LIPIcs.WABI.2023.16.
Kim DK, Sim JS, Park H, Park K. Lineartime construction of suffix arrays. In combinatorial pattern matching: 14th Annual Symposium, CPM 2003 Morelia, MichoacÃ¡n, Mexico, June 25â€“27, 2003 Proceedings. Springer. 14, p. 186â€“199, 2003.
Ko P, Aluru S. Space efficient linear time construction of suffix arrays. In Combinatorial Pattern Matching: 14th Annual Symposium, CPM 2003 Morelia, MichoacÃ¡n, Mexico, June 25â€“27, 2003 Proceedings.Springer. 2003 p. 200â€“210
Kulla F, Sanders P. Scalable parallel suffix array construction. Parallel Comput. 2007;33(9):605â€“12.
Labeit J, Shun J, Blelloch GE. Parallel lightweight wavelet tree, suffix array and fmindex construction. J Discrete Algorithm. 2017;43:2â€“17.
Li Z, Li J, Huo H. Optimal inplace suffix sorting. In String Processing and Information Retrieval: 25th International Symposium, SPIRE 2018, Lima, Peru, October 911, 2018, Proceedings, p. 268â€“284. Springer, 2018.
Liao G, Ma L, Zang G, Tang L. Parallel DC3 algorithm for suffix array construction on manycore accelerators. In 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, p. 1155â€“1158, 2015. https://doi.org/10.1109/CCGrid.2015.56.
Manber U, Myers G. Suffix arrays: a new method for online string searches siam. J Comput. 1993;22(5):935â€“48.
Marra NJ, Stanhope MJ, Jue NK, Wang M, Sun Q, Bitar Pavinski P, Vincent RP, Komissarov A, Rayko M, Kliver S, Stanhope BJ, Winkler C, Oâ€™Brien SJ, Antunes A, Jorgensen S, Shivji MS. White shark genome reveals ancient elasmobranch adaptations associated with wound healing and the maintenance of genome stability. Proc Natl Acad Sci. 2019;116(10):4446â€“55. https://doi.org/10.1073/pnas.1819778116.
Mori Y. divsufsort. https://github.com/y256/libdivsufsort. 2015. (Accessed on 1 May 2023).
Ng W, Kakehi K. Merging string sequences by longest common prefixes. IPSJ Digital Courier. 2008;4:69â€“78.
Nong G, Zhang S, Chan WH. Two efficient algorithms for linear time suffix array construction. IEEE Trans comput. 2010;60(10):1471â€“84.
Nurk S, Koren S, Rhie A, Rautiainen M, Bzikadze AV, Mikheenko A, Vollger MR, Altemose N, Uralsky L, Gershman A. et al. The complete sequence of a human genome. Science. 2022;376(6588):44â€“53.
Petri M, Navarro G, Culpepper JS, Puglisi SJ. Backwards search in context bound text transformations. In 2011 First International Conference on Data Compression, communications and processing, 2011. p. 82â€“91. IEEE
Pirogov A, Pfaffelhuber P, BÃ¶rschHaubold A, Haubold B. Highcomplexity regions in mammalian genomes are enriched for developmental genes. Bioinformatics. 2019;35(11):1813â€“9.
Puglisi SJ, Smyth WF, Turpin AH. A taxonomy of suffix array construction algorithms. Acm Comput Surveys (CSUR). 2007;39(2):4es.
Sanders P, Winkel S. Super scalar sample sort. In Algorithmsâ€“ESA 2004: 12th Annual European Symposium, Bergen, Norway, September 1417, 2004. Proceedings 12, p. 784â€“796. Springer, 2004.
Schindler M. A fast blocksorting algorithm for lossless data compression. In Proceedings DCC â€™97. Data Compression Conference. 1997. p. 469 https://doi.org/10.1109/DCC.1997.582137.
Shrestha AMS, Frith MC, Horton P. A bioinformaticianâ€™s guide to the forefront of suffix array construction algorithms. Brief Bioinform. 2014;15(2):138â€“54.
Smith JJ, Timoshevskaya N, Timoshevskiy VA, Keinath MC, Hardy D, Voss RS. A chromosomescale assembly of the axolotl genome. Genome Res. 2019;29(2):317â€“24.
Vyverman M, De Baets B, Fack V, Dawyndt P. essaMEM: finding maximal exact matches using enhanced sparse suffix arrays. Bioinformatics. 2013;29(6):802â€“4.
Weiner P. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory (swat 1973), 1973. p. 1â€“11.https://doi.org/10.1109/SWAT.1973.13.
Ye Y, Choi JH, Tang H. RAPSearch: a fast protein similarity search tool for short reads. BMC Bioinform. 2011;12(1):159.
Zhu K, SchÃ¤ffer AA, Robinson W, Xu J, Ruppin E, Ergun AF, Ye Y, Sahinalp SC. Strain level microbial detection and quantification with applications to single cell metagenomics. Nature Commun. 2022;13(1):6430.
Zook JM, Catoe D, McDaniel J, Vang L, Spies N, Sidow A, Weng Z, Liu Y, Mason CE, Alexander N, Henaff E, McIntyre ABR., Chandramohan D, Chen F, Jaeger E, Moshrefi A, Pham K, Stedman W, Liang T, Saghbini M, Dzakula Z, Hastie A, Cao H, Deikus G, Schadt E, Sebra R, Bashir A, Truty RM, Chang CC, Gulbahce N, Zhao K, Ghosh S, Hyland F, Yutao F, Chaisson M, Xiao C, Trow J, Sherry ST, Zaranek AW, Ball M, Bobe J, Estep P, Church GM, Marks P, Sofia KP, Grace XYZ, Michael SL, Heather SO, Patrice AM, Kristina G, Ying S, Karoline Bjarnesdatter R, Marc S. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci Data. 2016;3(1): 160025. https://doi.org/10.1038/sdata.2016.25.
Acknowledgements
J.K. acknowledges Harihara Subrahmaniam Muralidharan for his inputs into the proofs.
Funding
This work is supported by the NIH under Grant award Numbers R01HG009937 and by NSF awards CCF1,750,472 and CNS176,368 to RP, and by NSF SaTC2,317,194 to LD.
Author information
Authors and Affiliations
Contributions
This work was conceived by J.K. and R.P. This work was supervised by E.M., L.D., and R.P. The manuscript was prepared by all authors. Experiments were designed, prepared, and carried out by T.R. and J.K., with input from L.D. and R.P. The implementation of CaPSSA was primarily developed by J.K., with contributions from T.R. and R.P. The analysis of the algorithm was performed by J.K. and L.D. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
L.D. is a visiting researcher at Google Research. R.P. is a cofounder of Ocean Genomics Inc.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix
Methods
Pseudocodes for the procedures absent in the main text, samplepivots, collatepartitions, and computeboundarylcps are provided in the following.
Results
See Table 4
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the articleâ€™s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the articleâ€™s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Khan, J., Rubel, T., Molloy, E. et al. Fast, parallel, and cachefriendly suffix array construction. Algorithms Mol Biol 19, 16 (2024). https://doi.org/10.1186/s13015024002635
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13015024002635