Suffix sorting via matching statistics

We introduce a new algorithm for constructing the generalized suffix array of a collection of highly similar strings. As a first step, we construct a compressed representation of the matching statistics of the collection with respect to a reference string. We then use this data structure to distribute suffixes into a partial order, and subsequently to speed up suffix comparisons to complete the generalized suffix array. Our experimental evidence with a prototype implementation (a tool we call sacamats) shows that on string collections with highly similar strings we can construct the suffix array in time competitive with or faster than the fastest available methods. Along the way, we describe a heuristic for fast computation of the matching statistics of two strings, which may be of independent interest.


Introduction
Suffix sorting-the process of ordering all the suffixes of a string into lexicographical order-is the key step in construction of suffix arrays and the Burrows-Wheeler transform, two of the most important structures in text indexing and biological sequence analysis [21,15,1].As such, algorithms for efficient suffix sorting have been the focus of intense research since the early 1990s [16,23].
With the rise of pangenomics, there is an increased demand for indexes that support fast pattern matching over collections of genomes of individuals of the same species (see, e.g., [8,24,25]).With pangenomic collections constantly growing and changing, construction of these indexes-and in particular suffix sorting-is a computational bottleneck in many bioinformatics pipelines.While traditional and well-established suffix sorting tools such as divsufsort [17,7] and sais [18,20] can be applied to these collections, specialised algorithms for collections of similar sequences, perhaps most notably the so-called BigBWT program [3], are beginning to emerge.
In this paper we describe a suffix sorting algorithm specifically targeted to collections of highly similar genomes that makes use of the matching statistics, a data structure due to Chang and Lawler, originally used in the context of approximate pattern matching [5].The core device in our suffix sorting algorithm is a novel compressed representation of the matching statistics of every genome in the collection with respect to a designated reference genome, that allows determining the relative order of two arbitrary suffixes, from any of the genomes, efficiently.We use this data structure to drive a suffix sorting algorithm that has a small working set relative to the size of the whole collection, with the aim of increasing locality of memory reference.Experimental results with a prototype implementation show the new approach to be faster or competitive with state-of-the-art methods for suffix array construction, including those targeted at highly repetitive data.We also provide a fast, practical algorithm for matching statistics computation, which is of independent interest.
The remainder of this paper is structured as follows.The next section sets notation and defines basic concepts.In Section 3 we describe a compressed representation of the matching statistics and a fast algorithm for constructing it.Section 4 then describes how to use the compressed matching statistics to determine the relative lexicographic order of two arbitrary suffixes of the collection.Section 5 describes a complete suffix sorting algorithm.We touch on several implementation details in Section 6, before describing experimental results in Section 7. Reflections and avenues for future work are then offered.

Basics
A string T over an ordered alphabet Σ, of size |Σ| = σ, is a finite sequence T = T [1.
.n] of characters from Σ.We use the notation T [i] for the ith character of T , |T | for its length n, and T [i..j] for the substring .j] = ε, where ε is the empty string.The substring (or factor) T [i..] = T [i.
.n] is called the ith suffix, and T [..i] = T [1..i] the ith prefix of T .We assume throughout that the last character of each string is a special character $, not occurring elsewhere in T , which is set to be smaller than every character in Σ.
Given a string T , the suffix array SA is a permutation of the index set {1, . . ., n} defined by: SA[i] = j if the jth suffix of T is the ith in lexicographic order among all suffixes of T .The inverse suffix array ISA is the inverse permutation of SA.The LCP-array is given by: LCP [ [20,11].
Given the suffix array SA of T and a substring U of T , the indices of all suffixes which have U as prefix appear consecutively in SA.We refer to this interval as U -interval: the Let C = {S 1 , . . ., S m } be a collection of strings (a set or multiset).The generalized suffix array GSA of C is defined as .] is the ith suffix in lexicographic order among all suffixes of the strings from C, where ties are broken by the document index d.The GSA can be computed in time O(N ), where N is the total length of strings in C [21].
Let R and S be two strings.The matching statistics of S with respect to R is an array MS of length |S|, defined as follows.Let U be the longest prefix of suffix S[i..] which occurs in R as a substring, where the end-of-string character # of R is assumed to be different from, and smaller than that of S. Then MS[i] = (p i , i ), where p i = −1 if U = ε, and p i is an occurrence of U in R otherwise, and i = |U |. (Note that p i is not unique in general.)We refer to U as the matching factor, and to the character c immediately following U in S as the mismatch character, of position i.For a collection C = {S 1 , . . ., S m } and a string R, the matching statistics of C w.r.t.R is simply the concatenation of MS i 's, where MS i is the matching statistics of S i w.r.t.R. We will discuss matching statistics in more detail in Section 3.
For an integer array A of length n and an index i, the previous and next smaller values, PSV resp.NSV, are defined as bits can be built that supports answering arbitrary PSV and NSV queries in constant time per query [6].
Let X be a finite set of integers.Given an integer x, the predecessor of x, pred(x) is defined as the largest element smaller than x, i.e. pred X (x) = max{y ∈ X | y ≤ x}.Using the y-fast trie data structure of Willard [27] allows answering predecessor queries in O(log log |X|) time using O(|X|) space.
We are now ready to state our problem: Problem Statement: Given a string collection C = {S 1 , . . ., S m } and a reference string R, compute the generalized suffix array GSA of C.
We will denote the length of R by n and the total length of strings in the collection by As before, we assume that the end-of-string character # of R is strictly smaller than those of the strings in the collection C. We are interested in those cases where LCPsum R is small and the strings in C are very similar to R. If no reference string is given in input, we will take S 1 to be the reference string by default.

Efficient suffix array construction
Currently, the best known and conceptually simplest linear-time suffix array construction algorithm is the SAIS algorithm by Nong et al. [20].It cleverly combines, and further develops, several ideas used by previous suffix array construction algorithms, among these induced sorting, and use of a so-called type array, already used in [9,12] (see also [23]).
Nong et al.'s approach can be summarized as follows: assign a type to each suffix, sort a specific subset of suffixes, and compute the complete suffix array by inducing the order of the remaining suffixes from the sorted subset.There are three types of suffixes, one of which constitutes the subset to be sorted first.
The definition of types is as follows (originally from [12], extended in [20]) It is well known that assigning a type to each suffix can be done with a back-to-front scan of the text in linear time.Now, if the relative order of the S * -suffixes is known, then that of the remaining suffixes can be induced with two linear scans over the partially filled-in suffix array: the first scan to induce L-type suffixes, and the second to induce S-type suffixes.For details, see [20] or [21].
Another ingredient of SAIS, and of several other suffix array construction algorithms, is what we term the metacharacter method.Subdivide the string T into overlapping substrings, show that if two suffixes start with the same substring, then their relative order depends only on the remaining part; assign metacharacters to these substrings according to their rank (w.r.t. the lexicographic order, or some other order, depending on the algorithm), and define a new string on these metacharacters.Then the relative order of the suffixes of the new string and the corresponding suffixes starting with these specific substrings will coincide.In SAIS [20], so-called LMS-substrings are used, while a similar method is applied in prefix-free-parsing (PFP) [3].Here we will apply this method using substrings starting in special positions which we term insert-heads, see Sections 4 and 5 for details.

Compressed matching statistics
Let R, S be two strings over Σ and MS be the matching statistics of S w.r.t.R. Let MS[i] = (p i , i ).It is a well known fact that if i > 0, then i+1 ≥ i − 1.This can be seen as follows.Let U be the matching factor of position i, and p i an occurrence of U in R. Then Let us call a position j a head if j > j−1 −1, and a sequence of the form (x, x−1, x−2, . ..), of length at most x − 1, a decrement run, i.e. each element is one less than the previous one.Using this terminology, we thus have that the sequence L = ( 1 , 2 , . . ., n ) is a concatenation of decrement runs, i.e.L has the form ( . .), with each x j = j for some head j.We can therefore store the matching statistics in compressed form as follows: Definition 1 (Compressed matching statistics).Let R, S be two strings over Σ, and MS be the matching statistics of S w.r.t.R. The compressed matching statistics (CMS) of S w.r.t.R is a data structure storing (j, MS[j]) for each head j, and a predecessor data structure on the set of heads H.

We can use CMS to recover all values of MS:
Proof.Let i be the length of the matching factor of i.Since there is a matching factor of length j starting in position j in S, this implies that i ≥ max(0, j − k).If i was strictly greater than j − k, this would imply the presence of another head between j and i, in contradiction to j = pred H (i). Since an occurrence of the matching factor U j of j starts in position p j of R, therefore the matching factor U = U [k + 1.. j ] of i has an occurrence at position p j + k.In row 3, we mark the heads (for the CMS).In rows 4, we give the position qi, defined by ip(i), i.e. qi = SAR[ip(i)], where ip(i) is the insert-point of suffix S[i..] in the suffix array of R. In row 5, we mark the insert-heads (for the eCMS).For some statistics on the number χ of heads, see the end of Sec.3.1.

Enhancing the CMS
Let R, S be two strings over Σ, and MS the matching statistics of S w.r.t.R. We now assume that all characters that occur in S also occur in R (see Sec. 6).Let SA R be the suffix array of R. For position i of S, let U = ε be the matching factor and c the mismatch character of i.We want to compute the position that the suffix S[i..] would have in SA R if it was present.To this end, we define the insert point of i, ip(i), as follows: In other words, the insert point is the lexicographic rank, among all suffixes of R, of the next smaller occurrence of U in R if such an occurrence exists, and of the smallest occurrence of U in R otherwise.Note that case 1 (where U = ε) only happens for end-of-string characters.The insert point is well-defined for every i because # is smaller than all other characters, including other end-of-string characters.Observe that the insert point of i always lies within the U -interval of SA R .For an example, see Fig. 2.
We will later use the insert points to bucket suffixes.First we need to slightly change the definition of our compressed matching statistics.We will add more information to the heads: we add the mismatch character and replace the position entry p i , which gives just some occurrence of the matching factor, by the specific occurrence q i given by the insert point.This will imply adding more heads, so our data structure may increase in size.
To this end, we define j to be an insert- Note that, in particular, all heads are also insert-heads, but it is possible to have insert-heads j which are not heads, namely where j = j−1 − 1.
Definition 5 (Enhanced compressed matching statistics).Let R, S be two strings over Σ. Define the enhanced matching statistics as follows: for each 1 ≤ i ≤ |S|, let ems(i) = (q i , i , x i , c i ), where q i = SA R [ip(i)], i is the length of the matching factor U of i, c i is the mismatch character, and x i ∈ {S, L} indicates whether U c i is smaller (S) or greater (L) than R[q i ..].The enhanced compressed matching statistics (eCMS) of S w.r.t.R is a data structure storing (j, ems(j)) for each insert-head j, and a predecessor data structure on the set of insert-heads H . Example 6. Continuing with Example 3, the enhanced CMS of S w.r.t.R is: (1, 2, 9, L, T), (9, 3, 2, L, T), (11, 1, 6, S, $), (16, 11, 1, S, $), (17, 17, 0, L, $), see Figure 1.
We will need some properties of the insert point in the following: Observation 7. Let ip(i) be the insert point of i, and ems(i) = (q i , i , x i , c i ).

ip(i) = ip(i ) if and only if q
The enhanced CMS can be used in a similar way as the CMS to recover the enhanced matching statistics (including the matching statistics) of each i.Denote by i-head(i) the next insert-head to the left of i, i.e. i-head(i) = max{j ≤ i | j is an insert-head}.Note that i-head(i) = pred H (i). Lemma 8. Let 1 ≤ i ≤ |S|, let eCMS be the enhanced CMS of S w.r.t.R. Let j = i-head(i), k = i − j, and ems(j) = (q j , j , x j , c j ).Then ems(i) = (q j + k, j − k, x j , c j ), and ip(i) = ISA R [q j + k].In particular, q j + k is an occurrence and j − k is the length of the matching factor of i (in other words, the matching statistics entry MS[i]).
Proof.Analogous to Lemma 2, resp.straightforward from the definitions.
Similarly to the CMS (cp.Prop.4), the enhanced CMS allows access to all values for every index i, using space O(χ ) and time O(log log χ ), where χ = |H | is the number of insert-heads.Again, this is due to the fact that the predecessor data structure on the set H of insert-heads allows retrieving pred We close this subsection by remarking that for a collection of similar genomes, one can expect the number of heads to be small.Indeed, on a 500MB viral genome data set (see Section 7) containing approximately 10,000 SARS-cov2 genomes, we observed the number of heads to be 5,326,226 (100x less than the input size) and the number of insert heads to be 6,537,294.

Computing the CMS
It is well known that the matching statistics of S w.r.t.R can be computed in time O(|R| + |S| log σ) and O(|R|) space by using, for example, the suffix tree of R, as described in Chang and Lawler's original paper [5].Since then, several authors have described similar algorithms for computing matching statistics, all focussed on reducing space requirements via the use of compressed indexes instead of the suffix tree [1,22,2].These algorithms all incur the slowdowns typical of compressed data structures.
In our setting, where end-to-end runtime is the priority, it is the speed at which the matching statistics can be computed (rather than working space) that is paramount.Moreover, because the size of the reference is generally small relative to the total length of all the strings S i ∈ C, we have some freedom to use large index data structures on R to compute the matching statistics, without overall memory usage getting out of hand.With these factors in mind, we take the following approach to computing CMS.The algorithm is similar to that of Chang and Lawler, but makes use of array-based data structures rather than the suffix tree.
Recall that, given the suffix array SA R of string R and a substring Y of R, the Y-interval is the interval SA R [s..e] that contains all suffixes having Y as a prefix.

Definition 9 (Right extension and left contraction). For a character c and a string Y, the computation of the Yc-interval from the Y-interval is called a right extension and the computation of the Y-interval from cY-interval is called a left contraction.
We remark that a left contraction is equivalent to following a (possibly implicit) suffix link in the suffix tree of R and a right extension is a downward movement (either to a child or along an edge) in the suffix tree of R.
Given a Y -interval, because of the lexicographical ordering on the SA R , we can implement a right extension to a Yc-interval in O(log |R|) time by using a pair of binary searches (with c as the search key), one to find the lefthand end of the Yc-interval and another to find the righthand end.If a right extension is empty then there are no occurrences of Yc in R, but we can have the binary search return to us the insert point where it would have been in SA R .
On the other hand, given a cY -interval, SA R [s.
.e], we can compute the Y -interval (i.e.perform a left contraction) in the following way.Let the target Y -interval be SA R [x..y].With these ideas in place, we are ready to describe the matching statistics algorithm.We first compute SA R , ISA R , and LCP R for R and preprocess LCP R for NSV/PSV queries.The elements of the MS will be computed in left-to-right order, MS [1], MS [2], . . ., MS [|S|].

Observe that both SA R [s] + 1 and SA R [e] + 1 must be inside the
Note that this makes it trivial to save only the heads (or iheads) and so compute the CMS (or eCMS) instead.To find MS [1] use successive right extensions starting with the interval SA R [1..|R|], searching with successive characters of S [1..] until the right extension is empty, at which point we know 1 and p 1 .At a generic step in the algorithm, immediately after computing MS[i], we know the interval SA R [s i ..e i ] containing all the occurrences of R[p i ..p i + i − 1].To compute MS[i + 1] we first compute the left contraction of SA R [s i ..e i ], followed by as many right contractions as possible until i+1 and p i+1 are known.
When profiling an implementation of the above algorithm, we noticed that very often the sequence of right extensions ended with a singleton interval (i.e., an interval of size one) and so was the interval reached by the left contraction that followed.In terms of the suffix tree, this corresponds to the match between R and the current suffix of S i being inside a leaf branch.This frequently happens on genome collections because each sequence is likely to have much longer matches with other sequences (in this case with R) than it does with itself (a single genome tends to look fairly random, at least by string complexity measures).
A simple heuristic to exploit this phenomenon is to compare i to the maximum value in the entire LCP R array of R immediately after MS[i] has been computed.If i − 1 > max(LCP R ) then ISA R [p i + 1] will also be inside a leaf branch (i.e., the left contraction will also be a singleton interval), and so the left contraction can be computed trivially as ISA R [p i + 1]with no subsequent NSV/PSV queries or access to LCP R required to expand the interval.Although this gives no asymptotic improvement, there is potential gain from the probable cache miss(es) avoided by not making random accesses to those large data structures.
On a viral genome data set (see Section 7), max(LCP R ) was 14, compared to an average i value of over 1, 100, and this heuristic saved lots of computation.On a human chromosome data set, however, max(LCP R ) was in the hundreds of thousands, and so we generalized the trick in the following way.We divide the LCP array up into blocks of size b and compute the minimum of each block.These minima are stored in an array M of size |R|/b, and b is chosen so that M is small enough to comfortably fit in cache.Now, when transitioning from then there is a single match corresponding to MS[i + 1], which we compute with right extensions.This generalized form of the heuristic has a consistent and noticeable effect in practice.For a 500MB viral genome data set its use reduced CMS computation from 12.23 seconds to 2.34 seconds.On the human chromosome data set the effect is even more dramatic: from 76.50 seconds down to 7.14 seconds.

Comparing two suffixes via the enhanced CMS
We will now show how to use the enhanced CMS of the collection C w.r.t.R to define a partial order on the set of suffixes of strings in C (Prop.12), and how to break ties when the entries are identical (Lemma 13).These results can then be used either directly to determine the relative order of any two of the suffixes (Prop.14), or as a way of inducing the complete order once that of the subset of the insert-heads has been determined (Prop.15).We will prove Prop. 12 via two lemmas.Recall that in the eCMS we only have the entries referring to the insert-heads; however, Lemma 8 tells us how to compute them for any position.d, i) > 1, then there exists an index j s.t.ip(d, i) < j < ip(d , i ), and therefore Let U be the matching factor of (d, i), U that of (d , i ), and V = lcp(U, U ), the longest common prefix of the two.V cannot be equal to U because then U would be a proper prefix of U , but ip(d , i ) is the smallest occurrence in R of U .If V = U , then U is a proper prefix of U , and by definition of ip(d , i ), the character following U in U is strictly greater than the mismatch character c i of (d, i).Finally, if V is a proper prefix both of U and of U , then the character following V in U is smaller than the one following V in U , therefore First note that j > i.This is because the matching factor of position i ends in position i + d,i − 1, so there must be a new insert-head after i and at most at i + d,i , the position of the mismatch character.Similarly, j > i .The fact that j = i-head(i + d,i ) implies that there is a matching factor starting in position j which spans the mismatch character c = c d,i = c d ,i .Let's write V c for the prefix of length i + d,i − j of this matching factor.V is a suffix of the matching factor U of position i, but V c is not.However, V c is also a prefix of S d [i ..].Therefore, j = i + (j − i) is also an insert-head in S d .An analogous argument shows that any insert-head between i and i + d ,i in S d is also an insert-head in S d , in the same relative position.
From Lemma 8 we get the four eCMS-entries of (d, i) and (d , i ), namely the insert positions q i resp.q i , the length of the matching factor, whether the mismatch characters is smaller or larger, and the mismatch character itself.If any of these differ for the two suffixes, then Lemmas 10 and 11 tell us their relative order.This check takes O(1) time.Otherwise, Lemma 13 shows that the relative order is determined by the next relevant heads.Iteratively applying the three lemmas, in the worst case, takes us through all heads for the strings S d and S d .
Instead of using Prop.14, we will use these lemmas in the following way.We will first sort only the insert-heads.The following proposition states that this suffices to determine the order of any two suffixes in constant time.

Proposition 15.
Given the insert-heads in sorted order, the relative order of any two suffixes can be determined in O(log log χ ) time, where χ is the number of insert-heads.
Proof.Follows from Lemmas 10, 11, and 13, since all checks take constant time, and each of the two predecessor queries take O(log log χ ) time.

Putting it all together
A high-level view of our algorithm is as follows.We first partially sort the insert-heads, then use this partial sort to generate a new string, whose suffixes we sort with an existing suffix sorting algorithm.This gives us a full sort of the insert heads.We then use this to sort the S * -suffixes of the collection.Finally, we induce the remaining suffixes of the collection using the S * -suffixes.We next give a schematic description of the algorithm.We next give a worst-case asymptotic analysis of the algorithm.

Proposition 16. Algorithm 1 computes the GSA of a string collection C of total length N in worst-case time O(N log N ).
Proof.Let |R| = n.Phase 1 takes O(n + N ) time, since constructing all data structures on R can be done in linear time in n and scanning the collection C takes time O(N ).Phase 2 takes time O(N log n) using the algorithm from Sec. 3.2.In Phase 3, identifying the S * suffixes, takes time O(N ).Since at this point, the eCMS is in text-order, identifying i-head(i) takes constant time, also computing the insert-point takes constant time, so altogether O(N ) time.In Phase 4, all steps are linear in χ , the number of insert-heads, including the partial sort of the buckets, since this can be done with radix-sort (three passes over each bucket), so this phase takes time O(χ ).Phase 5 takes time O(|B| log |B|) for each bucket B, thus O(N log |B max |) for the entire collection, where B max is a largest bucket.Since all strings in the collection are assumed to be highly similar to the reference, the size of the buckets can be expected to vary around the number of strings in the collection m; however, in the worst case the largest bucket can be Θ(N ).Finally, Phase 6 takes linear time O(N ).Altogether, the running time is dominated by Phase 5, O(N log N ).

Implementation details
In Phase 1, the augmentation step involves, for every character c not occurring in R but occurring in C, appending c nc to R, where n c is the length of the longest run of c in C.This avoids having 0-length entries in the matching statistics and is necessary in order to have a well defined ip.
To compute SA R in Phase 1, we use sais [18] as implemented by Yuta Mori, a well engineered version of SAIS [20], which was chosen due to its consistent speed on many different inputs.For the computation of PLCP R and LCP R we use the Φ method [11].This is the fastest method to compute the LCP array we know of.We constructed the data structure of Cánovas and Navarro [4] for NSV/PSV queries on the LCP array, as it has low space overheads and was fast to query and initialize.
For the predecessor data structure, we use the following two-layered approach in practice (rather than [27]).We sample every bth head starting position and store these in an array.In a separate array we store a differential encoding of all head positions.The array of differentially encoded starting positions takes 32 bits per entry.Predecessor search for a position x proceeds by first binary searching in the sampled array to find the predecessor sample at index i of that array.We then access the differentially encoded array starting at index ib and scan, summing values until the cumulative sum is greater than x, at which point we know the predecessor.This takes O(χ /b + b) time, where χ is the number of insert-heads.
For Phase 4, when we have to sort C (the concatenation of metacharacters representing partially sorted heads), we use a SACA-K implementation that handles integer alphabets [13].This choice was made because of this algorithm's low space requirement, in particular, O(K), where K is the number of distinct ems-entries of insert-heads in H (note K = O(χ )).

Experiments
We implemented our algorithm for computing the generalized suffix array in C++.Our prototype implementation, sacamats, is available at https://github.com/fmasillo/sacamats.The experiments were conducted on a laptop equipped with 16GB of RAM DDR4-2400MHz and an Intel(R) Core(R) i5-8250U@3.4GHz with 6MB of cache.The operating system is Ubuntu 20.04 LTS, the compiler used is g++ version 9.4.0 with options -std=c++17 -O3 -funroll-loops enabled.
In the following experiments, we compare sacamats to two well known suffix array construction tools, both implementations by Yuta Mori [18,17].The first, sais, is an implemenation of the well-known SAIS algorithm by Nong et al. [20]; the second, divsufsort [7], is perhaps the most widely used tool for suffix array construction.We also compare against gsufsort [14], which is an extension of the SACA-K algorithm [19] to a collection of strings, and to bigBWT [3], a tool computing the BWT and the suffix array, designed specifically for highly repetitive data.

Datasets
For our tests, we used two publicly available datasets, one consisting of copies of human chromosome 19 from the 1000 Genomes Project [26], and another of copies of SARS-CoV2 genomes taken from NCBI Datasets1 .For both datasets we selected subsets of different sizes in order to study the scalability of our algorithm.The sizes are 250MB, 500MB, 800MB and 1GB.More information can be found in Table 1.
We observe that on both datasets the number of i-heads is around 100x less than the input size, and on chr19 it is 8x less than the number of BWT runs.  1 Datasets used in experiments.In column 3, we specify the alphabet size σ, in column 4 the number r of runs of the BWT, in column 5 the number of S * -suffixes, and column 6 the number of insert-heads.In our experiments we use prefixes of each dataset up to 1GB.The last three columns refer to the 500MB prefix.

Results
In Figures 3 and 4, information about running time for both datasets is displayed.The line plot represents a direct comparison of different algorithms, whereas the stacked bar plot is to visualize how much each phase of sacamats takes w.r.t. the total running time (cp.Sec. 5).
These tools all produce slightly different outputs: sais and divsufsort output the SA, gsufsort and sacamats the GSA, and bigBWT both the BWT and the SA.Because of these differences, if one were to write to disk each result, the running time would be affected accordingly by the size of the output.Therefore, we only compare the building time, i.e. the time spent constructing the SA and storing it in a single array in memory, without the time spent writing it to disk.For this reason, we made slight changes to the bigBWT code to enable storing the SA in main memory.
By looking at the line plots, one can see that sacamats is competitive in both scenarios, i.e., it is faster than all tools on sars-cov2, except bigBWT.The same is true for chr19, where it is the fastest method, especially on larger inputs, but here the main competitor becomes divsufsort.More precisely, for the first dataset (chr19) and considering 1GB of data, sacamats takes less than a third of the time of gsufsort, is 20% faster than sais, 12% faster than bigBWT, and 5% faster than divsufsort.For the second dataset (covid), sacamats takes again less than a third of the time of gsufsort, is 37% faster than divsufsort, 16% faster than sais, and 30% slower than bigBWT.
Shifting our attention to the stacked bar plots, Figure 3b indicates that a lot of time is spent in the first phase, consisting in the augmentation of R and the construction of various data structures for the augmented version of R. In the setting of DNA strings it is not too hard to think that the augmentation process will not elongate R, due to the very restricted alphabet.If the application lends itself to it, one could compute beforehand all the data structures listed in Phase 1, gaining roughly 20 seconds of run time.In our experiment on chr19 we would then be clearly the best algorithm, further distancing from the others.Alternatively, the common method of replacing N symbols with random nucleotide symbols would be another way to speed up this phase.
Finally, we comment on memory usage, which is highest for sacamats and gsufsort at 8 bytes per input symbol, and 4 bytes per input symbol for divsufsort and sais, and bigBWT (including the 4 bytes per input symbol of the SA when it is saved in memory, see above).We have not yet optimized for memory usage and note that a semi-external implementation of our approach, in which buckets reside on disk, presents itself as an effective way to reduce main memory usage.In all phases, the actual working set-the amount of data active in main memory-is small (for the most part, proportional to the number of i-heads), and other authors have shown that the inducing phase is amenable to external memory, too [10].We leave these optimizations as future work.

Conclusion
We have presented a new algorithm for computing the generalized suffix array of a collection of highly similar strings.It is based on a compressed representation of the matching statistics, and on efficient handling of string comparisons.Our experiments show that a relatively straightforward implementation of the new algorithm is competitive with the fastest existing suffix array construction algorithms on datasets of highly similar strings, as are common in computational biology applications.A byproduct of our suffix sorting algorithm is a heuristic for fast computation of the matching statistics of a collection of highly similar genomes w.r.t. a reference sequence, which is of independent interest.We also envisage uses for our compressed matching statistics (CMS) data structure beyond the present paper, for example as a tool for sparse suffix sorting, or for distributed suffix sorting in which the CMS is distributed to all sorting nodes together with a lexicographic range of the suffixes that each particular node is responsible for sorting.From the CMS alone, each node can extract the positions of its suffixes and then sort them with the aid of the CMS.
We believe there to be a great deal of room for further practical improvements, both through algorithm engineering and parallelism.Interestingly, in an initial attempt along the second line, simply assigning S * -suffix buckets to one of four different sorting threads reduces runtime significantly, for example, from 122 to 89 seconds on the 1GB sars-cov2 dataset.
Further studies will be conducted on how the size of eCMS impacts on the competitiveness of our tool.
1] = 0, and for i ≥ 2, LCP[i] is the length of the longest common prefix (lcp) of the two suffixes T [SA[i − 1]..] and T [SA[i]..] (which are consecutive in lexicographic order).A variant of the LCP array is the permuted LCP-array, PLCP, defined as PLCP[i] = LCP[ISA[i]], i.e. the lcp values are stored in text order, rather than in SA order.We further define LCPsum(T ) = |T | i=1 LCP[i].LCPsum can be used as a measure of repetitiveness of strings, since the number of distinct substrings of T equals (|T | 2 + |T |)/2 − LCPsum(T ).All these arrays can be computed in linear time in |T |, see e.g.

Figure 1
Figure 1Example for the matching statistics and the data for the CMS and the eCMS.In the first two rows, we give MS of S w.r.t.R, where MS[i] = (pi, i).In row 3, we mark the heads (for the CMS).In rows 4, we give the position qi, defined by ip(i), i.e. qi = SAR[ip(i)], where ip(i) is the insert-point of suffix S[i..] in the suffix array of R. In row 5, we mark the insert-heads (for the eCMS).

Figure 2 Example 3 .Proposition 4 .
Figure 2 Details of computation of the matching statistics from Figure 1.We highlight in blue the matching factors for the indices i = 9 (matching factor AT, mismatch character T) and 11 (matching factor TGATGG, mismatch character $).The arrows represent the insert-points.
and the values of ems(i) can then be computed in O(1) time (Lemma 8).
y] and e = ISA R [SA R [e] + 1] ∈ [x..y].To finish computing SA R [x..y] from SA R [s ..e ] there are two cases to consider.Firstly, if s = e and |Y | > LCP R [s ], then SA R [s ] is the only occurrence of Y and we are done (the Y -interval is a singleton).Alternatively, s = e and we compute SA R [x..y] using NSV/PSV queries on LCP R , in particular SA R [x..y] = SA R [PSV(LCP R , s )..NSV(LCP R , e )].

Phase 1 - 2 -Phase 4 -
string collection C, reference string R output: the GSA of C Augmenting and constructing data structures on R: Preprocess R ("augmenting", see Sec. 6).Compute the data structures SA R , ISA R , PLCP R , LCP R and the RMQ-data structure for PSV -and NSV -queries on LCP R .Phase Computing the eCMS: Compute the eCMS of C, as described in Sec.3.2.Phase 3 -Bucketing: Identify the S * -suffixes in C via a backward linear scan of C. Bucket S * -suffixes i according to ip(i), computed using the eCMS (Lemma 8).Sorting the insert-heads: bucket the insert-heads according to their insert point; for each bucket B, partially sort B, according to Lemmas 10 and 11; rename insert-heads according to lexicographic rank of substring stretching up to the mismatch character (metacharacters are S d [j..j + d,j ]); generate new string C as concatenation of these metacharacters; compute the suffix array of C, map back to corresponding suffixes of C. Phase 5 -Fully sorting the S * -suffixes: for each bucket B from Phase 3, sort B, according to Lemmas 11 and 13 Phase 6 -Inducing the GSA: With two scans, induce L-suffixes, induce Ssuffixes.

( a )
Running time comparison.(b) Phases breakdown, see Sec. 5 for details.

Figure 3
Figure 3 Experiments on different subsets of copies of Human Chromosome 19.

Figure 4
Figure 4 Experiments on different subsets of SARS-CoV2 genomes.