Generalized enhanced suffix array construction in external memory

Background Suffix arrays, augmented by additional data structures, allow solving efficiently many string processing problems. The external memory construction of the generalized suffix array for a string collection is a fundamental task when the size of the input collection or the data structure exceeds the available internal memory. Results In this article we present and analyze \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathsf {eGSA}$$\end{document}eGSA [introduced in CPM (External memory generalized suffix and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathsf {LCP}$$\end{document}LCP arrays construction. In: Proceedings of CPM. pp 201–10, 2013)], the first external memory algorithm to construct generalized suffix arrays augmented with the longest common prefix array for a string collection. Our algorithm relies on a combination of buffers, induced sorting and a heap to avoid direct string comparisons. We performed experiments that covered different aspects of our algorithm, including running time, efficiency, external memory access, internal phases and the influence of different optimization strategies. On real datasets of size up to 24 GB and using 2 GB of internal memory, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathsf {eGSA}$$\end{document}eGSA showed a competitive performance when compared to \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathsf {eSAIS}$$\end{document}eSAIS and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathsf {SAscan}$$\end{document}SAscan, which are efficient algorithms for a single string according to the related literature. We also show the effect of disk caching managed by the operating system on our algorithm. Conclusions The proposed algorithm was validated through performance tests using real datasets from different domains, in various combinations, and showed a competitive performance. Our algorithm can also construct the generalized Burrows-Wheeler transform of a string collection with no additional cost except by the output time.


Introduction
Suffix arrays [40] (also known as PAT arrays [23]) may be used for the solution of string processing problems in several areas, including pattern matching, data compression and information retrieval [24,39,47]. Combining a suffix array with the longest common prefix (LCP) array and with the Burrows-Wheeler transform (BWT) [12] provides a data structure, an enhanced suffix array (ESA) [2], that enables solving many string processing problems in optimal time and space.
Using such structures in the solution of problems involving strings is usually done in two steps: the structure is first constructed and then it is queried. This article is about the construction of generalized enhanced suffix arrays for a collection of strings using external memory. This is motivated by the rising number of applications that deal with huge sets of strings, such as those in Bioinformatics and Internet searching. Moreover, recent advancements in non-volatile storage technologies have substantially narrowed the gap between internal and external memory access times, making the querying of external suffix arrays significantly faster.
Different algorithms have been proposed for internal memory suffix array construction (see [17,49]), including algorithms with linear running time [30,33,46]. Gonnet et al. [23] proposed the first external memory algorithm for constructing suffix arrays. Later, Crauser and Ferragina [14] adapted internal memory algorithms to work in external memory. Dementiev et al. [16] observed that these algorithms do not scale well and presented a pipelined version of the internal memory algorithm DC3 [30] to external memory. Nong et al. [44,45] adapted the internal memory algorithms SA-DS and SA-IS [46] to external memory, and Liu et al. [34] presented an enhanced version of SA-IS to external memory. Kärkkäinen and Kempa [25] presented the SAscan algorithm, improving on the earlier proposal by Gonnet et al. [23], and later, Kärkkäinen et al. presented a parallel external version of SAscan algorithm [27]. BWT can be either obtained from the suffix array or constructed directly in internal memory in linear time [48]. Ferragina et al. [18] proposed an external memory algorithm to construct the BWT for a single string, and Bauer et al. [5] presented external memory algorithms to compute and decode the BWT for a string collection.
LCP construction in internal memory is also possible in linear time during the suffix array construction [19,35] or afterwards, given the suffix array [29,31,41] or the BWT as input [7,22]. Kärkkäinen and Kempa [26] presented the LCPscan, an external memory algorithm to construct LCP arrays given the suffix array as input, and Bauer et al. [6] proposed the extLCP algorithm to construct both BWT and LCP arrays for large collections of equally sized strings in external memory, and later, Cox et al. [13] presented an extended version of extLCP to deal with strings with different sizes.
The suffix and the LCP arrays are constructed together in external memory by eSAIS, proposed by Bingmann et al. [10], one of the most efficient external memory algorithm to date. There exists alternatives to compute suffix and LCP arrays in parallel [20] and using small space [37,42].
In this article we present and analyze the algorithm eGSA (introduced in [38]) in depth. To our knowledge this is the first algorithm to construct generalized enhanced suffix arrays in external memory. We compared eGSA with the most efficient related algorithms in the literature, eSAIS [10] and SAscan [25]. Although eSAIS and SAscan can easily be applied to the concatenation of a string collection, our method is shown to run faster in practice. In addition to the LCP array, our method also constructs the BWT for the collection. eGSA uses a heap and a combination of optimization procedures that are shown to be very effective in practice. The optimizing strategies that we propose in this article are based on nice properties of strings and their relation with the LCP array, and are applied across the nodes of a heap.
Theoreticallly, eSAIS runs in O(n log M/B (n/B)) time and O((n/B)log M/B (n/B)) I/Os, where n is the length of the input string, B is the disk block size and M is the RAM size. SAscan runs in O((n 2 /M) log(2 + log σ / log log n)) time and O(n 2 log σ/(MB log n) + (n/B) log M/B (n/B)) I/ Os. Our algorithm runs in O((N log m)maxlcp) time and O(N log m|T ℓ |) I/Os, where N is the sum of the m string lengths in the input, maxlcp is the length of the longest common prefix between suffixes of the input strings, |T ℓ | is the length of the longest string in the collection.
The rest of the article is organized as follows. "Background" section introduces concepts and notation, "eGSA" section describes the algorithm and presents a theoretical analysis, "Performance evaluation" section details the experiments, results and investigates limitations of the algorithm. "Conclusions" section concludes the article.

Background
Let be an ordered alphabet of symbols. We denote the set of every string of symbols in by * and the concatenation of strings or symbols by the dot operator ( · ). Let $ be a symbol not in that precedes every symbol in with respect to the alphabetical order. We define $ = {T · $|T ∈ * }. We use the symbol < for the lexicographic order relation between strings.
The ith symbol in a string T of length n is denoted A suffix array for a string T ∈ $ of size n, denoted SA, is an array of integers SA = [i 1 , i 2 , . . . , i n ] such that T [i 1 , n] < T [i 2 , n] < · · · < T [i n , n]. Thus, a suffix array provides the lexicographic order for all suffixes of a string.
Let pos(T [k, n]) denote the mapping of suffix T[k, n] to its position in SA, i.e. the reverse suffix array, and let suff (j) denote the mapping of position j of SA to the suffix represented at j, namely T [SA[j], n].
Let lcp(S, T ) be the length of the longest common prefix of two strings S and T in $ . The LCP array for T is an array of integers such that and The BWT is a reversible transformation obtained through cyclic rotations of a string, and results in another string that is easier to compress [12]. The BWT has a close relationship to the suffix array and can be trivially obtained from it. Let the BWT of a string T be denoted BWT and defined as BWT We will refer to the structure formed by SA, LCP, BWT as an enhanced suffix array, denoted ESA [2]. Table 1 shows the enhanced suffix array for T 1 = GATAGA$ and for T 2 = TAGAGA$.
Let T be a collection of m strings {T 1 , . . . , T m } from $ having lengths n 1 , . . . , n m . We extend the lexicographic relation among strings to deal with unit length suffixes of T : let < be augmented for pairs of suffixes of length 1 of strings in T by The generalized suffix array of T , denoted GSA, is an array of pairs of integers (a, b) that specifies the lexicographic order of all suffixes T a [b, n a ] of strings in T . We denote the first component of The generalized LCP of T is defined as and LCP [1] = 0, and the generalized BWT of T is defined as The generalized suffix array of T together with its corresponding LCP array and BWT will be called generalized enhanced suffix array and denoted GESA. Table 2 shows the generalized enhanced suffix array for T = {T 1 , T 2 }, where T 1 = GATAGA$ and T 2 = TAGAGA$.

eGSA
The External Generalized Enhanced Suffix Array Construction Algorithm (eGSA) resembles a two-phase multiway merge-sort [32]. Algorithm 1 illustrates eGSA without the otimizing strategies introduced in Phase 2. Phase 1 builds the enhanced suffix arrays for the input strings and Phase 2 merges the respective arrays using an improved string comparison method on memory buffers. We detail each phase below.

Phase 1: internal sorting
The input for eGSA is a collection T of m strings T = {T 1 , . . . , T m } having lengths n 1 , . . . , n m with total length N and stored in external memory.
In Phase 1 the suffix array SA i , the LCP array LCP i , the Burrows-Wheeler transform BWT i and the auxiliary array PREFIX i are built for each T i and stored in external memory (lines 1-9 of Algorithm 1). Any internal or external memory suffix and LCP array construction algorithm may  The suff column illustrates, in bold, prefixes shared between consecutive positions in the array be used by eGSA to build SA i and LCP i (line 2). As they are constructed, both BWT i (line 5) and PREFIX i (lines 6) can be computed and written sequentially to external memory with no need to store in internal memory.
PREFIX arrays are used to reduce external memory accesses in Phase 2: starting from a position j and concatenating PREFIX i successively for adjacent preceding positions will render a prefix of T i [j, n i ], up to a position with lcp equal to zero. In other words, PREFIX i [j] will store symbols from the string, such that the contents of PREFIX i [j] concatenated to parts of preceding positions of PREFIX is equal to a starting portion of the suffix at position SA i [j]. More formally, let p be a given integer constant.
As a boundary condition, whenever the length of T i is exceeded, sufficient $ symbols are added to the right of PREFIX i [j]. An example for the ESAs from Table 1 with p = 3 is shown in Table 3. Notice that it is possible to recall the strings with the aid of PREFIX. Our construction is similar to the left-justified approach by Sinha et al. [50] and relates to the work of Barsky et al. [4].
We will denote a tuple of elements in the same position of an ESA augmented with the PREFIX array by and we will use a dot to refer to a component, for instance ESA i [j].SA i . The product of Phase 1 is ESA 1 , . . . , ESA m .

Phase 2: external merging
Phase 2 merges the enhanced suffix arrays computed in Phase 1 to obtain a GESA for T .
Each ESA is partitioned into consecutive blocks having e consecutive elements, except perhaps for the last block. For each ESA i the algorithm uses two internal memory buffers: a string buffer S i , with capacity for at most s symbols of T i , and an enhanced-array buffer E i , large enough to store a block of ESA i . It also uses two other buffers: an output buffer Buffer out for at most o elements of the GESA, and an induced buffer I, of size | | × c pair of integers, which stores data needed by the inducing strategy discussed below. The values of s, e, o and c are constants that determine the amount of internal memory used in this phase.
The overall strategy used in Phase 2 (lines 10-20 of Algorithm 1) is the following. The first block of each ESA i is loaded into the respective enhanced-array buffer E i (line 11). Then the heading element of each E i is inserted into a lexicographic minimum binary heap (line 12). Assume that the smallest suffix in the heap originates from E k (line 15). Then the suffix is moved to the output buffer (line 16), which is written to disk as it gets full (line [17][18][19], and the heap is filled with the next element in the buffer E k . Recall that during such comparisons the suffixes themselves are stored in external memory. Comparing suffixes in the heap may then require many random external memory accesses. To reduce external memory accesses, we propose an enhanced comparison method composed by three strategies: (a) prefix assembly, (b) lcp comparison, and (c) suffix induction.

Prefix assembly
Prefix assembly uses PREFIX arrays to retrieve portions of strings with no external memory accesses. These characters are those more likely to be needed to compare suffixes. Let j be the index of the smallest element in the enhanced-array buffer , h j−1 + p), h 0 = 0, and # is an end-ofbuffer marker not in . Thus, if a string comparison does not involve more than h j + p symbols, an external memory access is not necessary. Otherwise # is reached and a portion of T i must be retrieved from the external memory. However, the part of T i that can be reconstructed from PREFIX is often long enough such that the first distinct characters can be accessed without I/O operations. In addition, the string buffer can easily and without great costs be adjusted to accommodate the relevant parts of PREFIX, i.e. h j + p. Algorithm 2 illustrates prefix assembling applied to reconstruct the initial part of

Table 3 Prefix array examples
The suff column illustrates, in bold, the prefixes recovered without external access during the merging phase of our algorithm, as detailed in "Phase 2: external merging" section Column suff in Table 3 illustrates the prefixes recovered by prefix assembly in bold. For example, for ESA 1 shown in Table 4, when j = 5 then h 5 = 0 and, since LCP 1 [5] = 0, S 1 stores GA$. When j = 6 then h 6 = min(LCP 1 [6],

LCP comparison
lcp values can be used to speed up suffix comparisons [9,43] and to avoid external memory accesses in heap insertions. The following lemma formalizes the idea. The proof is simple, based on the cases illustrated in Fig. 1, and will be omitted.
Let X, Y and Z be nodes in the binary heap storing , respectively. Let X, Y and Z be also the suffixes stored by such heap nodes. Suppose that node X is the parent of Y and Z. Because X < Y and Assume that the heap also stores lcp values between a node and its children and between a node and its sibling.
As X is removed from the heap, E a [i] is moved to the output buffer and X is replaced by another node W storing E a [i + 1]. The order of W with respect to its children Y and Z can be determined without character comparisons when case 1 or case 2 of Lemma 1 applies, and if case 3 applies then the character comparison can be started In the same way the order between Y and Z can be determined using Lemma 1. Algorithm 3 illustrates this procedure to compare the nodes W, Y and Z in the heap.
lcp values between nodes in the heap are updated as nodes are compared and swapped. Suppose that node W is swapped with Y (meaning Y < W and Y < Z). The lcp of W with respect to its new children are also determined using Lemma 1, taking the minimum lcp between two suffixes (in cases 1 and 2) or through direct character comparisons (case 3). Hence, by using lcp values many direct comparisons of strings that are in external memory are avoided.  For instance, consider merging ESA 1 and ESA 2 in Table 1. First, comparing the elements ESA 1 [4] and ESA 2 [3] we conclude that suff 2 (3) = AGA$ is less than suff 1 (4) = ATAGA$. The next comparison involves ESA 1 [4] and ESA 2 [4]. As already stated, without comparing any symbols we see that lcp(suff 2 (3), suff 2 (4)) > lcp(suff 2 (3), suff 1 (4)) and that suff 2 (4) = AGAGA$ is less than suff 1 (4) = ATAGA$.

Suffix induction
The induced sorting principle corresponds to deduce the order of unsorted suffixes from already sorted suffixes. This strategy is used by many suffix array construction algorithms [49]. We apply an induced sorting approach that relies on the following lemma. Let a suffix starting with a symbol α be denoted α-suffix and let suff T be the set of all suffixes of strings in T .
Lemma 2 can be used for sorting the suffixes of a string T of length n as follows. Let an α-bucket be a block of a partition of SA that contains only α-suffixes. suff T is initialized with every suffix of T and an empty bucket for each symbol in is created. While suff T is not empty, the smallest suffix T [j, n] = α · T [j + 1, n] in suff T is moved to the leftmost available position in the α-bucket and, if α < β then T [j − 1, n] = β · T [j, n] is added to the leftmost available position in the β-bucket (it is induced). The induced suffix T [j − 1, n] cannot be removed from suff T yet because it may induce T [j − 2, n i ] as well. When a suffix that is already in a bucket is also the smallest in suff T , the suffix itself and those that succeed it in the bucket are used to induce another suffix and are removed from suff T at once. Note that if α > β then the suffix T [j − 1, n i ] was already sorted and if α = β then reading induced suffixes from the β-bucket can cause the induction of already induced suffixes. So no induction is done when α ≥ β.
This approach is not efficient to sort the suffixes of a single string T, since it is often necessary to find a smallest suffix. But in merging previously sorted suffixes the smallest one can be determined efficiently using the heap. Suppose that E i [k] is at the root of the heap. Then . This later test may be performed using BWT i and, as a consequence, to determine whether T i [j − 1, n i ] can be induced or not.
Induced suffixes are added to the induced buffer I, partitioned into buckets I α , one for each α ∈ �. When an α -suffix from string T i is induced, the value i is inserted into the first available position of I α , which is written to an external memory file F α as it gets full. When the smallest α-suffix is at the root of the heap, F α is read sequentially to retrieve string indexes. Each string index i indicates that the smallest suffix in E i may be written to the output directly, since such suffix has been induced, bypassing operations in the heap and saving many comparisons. When every index in F α has been processed the heap must be reconstructed. Algorithm 4 illustrates Phase 2 (see Algorithm 1) augmented for suffix induction. Whenever the first suffix starting with α = T a [b] is returned from the heap, eGSA induces the output buffer the suffixes in F α .
lcp values for induced suffixes must also be induced, since induced suffixes are not compared in the heap. Suppose that T a [i, n a ] induces an α-suffix and suppose that T b [j, n b ] induces the next α-suffix. Then Therefore, when a suffix T i [j, n i ] is induced in the second phase, its corresponding LCP is also induced. As induced suffixes may also induce further suffixes, the corresponding LCP must be stored in the induced buffer I α and in the respective file as well. As induced suffixes are recovered from external memory, LCP values are recovered to update the rmq computation.
For instance, suppose that T 1 [6, n 1 ] = A$ is the smallest suffix in the heap during the merge of ESA 1 and ESA 2 in Table 1. Because ESA 1 [2].BWT = G > A, Using prefix assembly together with induction requires additional care. Since induced suffixes are not compared in the heap, they do not participate in the prefix assembly. Thus during the evaluation of PREFIX in Phase 1, h j must be equal to 0 for every last α-suffix that will be induced, then the prefix of the first non-induced α-suffix will start at its initial position. To this end, we set 0 as the LCP[pos(T i [j, n i ])] of every suffix T i [j, n i ] that will be induced, i.e. when T i [j] > T i [j + 1]. Recall that all such lcp values will be also induced.

Theoretical costs
Phase 1 of eGSA is dominated by the algorithms used to construct SA and LCP. The other columns of the generalized suffix array are evaluated when the output is written to disk, using constant time and memory per item. The construction of SA and LCP may be done in linear time and space [29,46]. Thus, for m input strings with total length N and T ℓ the longest string, Phase 1 is O(m|T ℓ |) time plus O(N) I/O operations using O(|T ℓ |) memory.
In Phase 2, the number of node swaps in the heap is bounded by N log m. Each node swap requires comparing a number of characters that is at most the maximum value of lcp for T (maxlcp). The time cost of this phase is then O((N log m)maxlcp). I/O operations in Phase 2 include loading portions of suffix arrays and of strings from disk, and writing output buffers to disk. Suffix arrays are loaded in blocks to the enhanced-array buffers. In the worst case each comparison in the heap will trigger a character comparison, and the string buffers will be loaded when exhausted. Provided that the string buffer is at least as large as maxlcp, each suffix will cause at most one I/O operation and the worst case for the number of string buffer load operations is O(N). The number of I/O operations on enhanced-array and output buffers is limited by N divided by the respective buffer sizes. Then the number of I/O operations in Phase 2 is bounded by O(N). The memory usage in Phase 2 is bounded by the sum of buffer sizes, which can be tailored as necessary.
Such bounds for I/O operations are prohibitive, but it is much lower in practice due to the optimizing strategies, as shown in the next sections. An easy to devise limitation of eGSA is the case of datasets whose strings are large and highly repetitive, for instance, a dataset composed by human genomes of different individuals. For these datasets the practical performance will approach the theoretical bound. Another limitation is when maxlcp is larger than the string buffer size, when the number of I/O operations is as bad as O(N log m(|T ℓ |/s)), where s is the string buffer size.

Performance evaluation
We used four real datasets of different domains, including DNA and protein sequences, and natural language texts as described in Table 6. The table includes the total size of each dataset in GB, the number of strings, the average string length, and the average and maximum lcp values, which provide an approximation of suffix sorting difficulty [16].
The experiments were conducted on Debian GNU/ Linux 6.0.3/64 bits operating system using an Intel(R) Xeon(R) CPU E3-1230 V2 @ 3.30 GHz processor 8 MB cache, with 32 GB of internal memory and a 2.0 TB

Table 5 Prefix assembly and inducing suffixes
Symbols in bold illustrate the substring of suffix T 1 [SA [3], n 1 ] stored in PREFIX SATA hard disk with 7200 RPM and 64 MB cache (Seagate Desktop HDD ST2000DM001). Our algorithm was implemented in ANSI/C and compiled by GNU GCC version 4.6.3, with optimizing option -O3. The source code is freely available at https://github.com/felipelouza/ egsa/.
In Phase 1 we partitioned the collection of strings T into k groups, such that when the strings in each group are concatenated the resulting string T cat may be given to internal memory SA and LCP construction algorithms. After concatenating the strings in a group a new terminator symbol # that is smaller than $ is added to the end of T cat . For the first phase we used gSACA-K [36] combined with -algorithm [29]. gSACA-K guarantees that the order of equal suffixes from different strings in a group will be defined by the rank of their strings in T . Given the SA of T cat , we compute GSA for the string group using an additional integer array DA of size The -algorithm was adapted to stop the comparison in T cat when it reaches $ symbols, thus correctly evaluating the LCP between suffixes in the same group. Together, these algorithms use 13 × |T cat | bytes. In this experiments, when T cat is composed by only one string T ℓ and 13 × |T cat | is larger than the available internal memory, the algorithm truncates T ℓ , such that 13 × |T cat | fits in memory. The sizes reported in Table 6 refer to the datasets after truncations, that happened only with dna.
In Phase 2 we used p = 10 for the prefix array size, which provided a good tradeoff between time and disk usage space, as shown in "eGSA internals" section. Each buffer S i were set to use 20 KB of internal memory, whereas all buffers B, Buffer out and I were set to use 1 GB, 64 MB and 16 MB, respectively, in total. We remark that eGSA uses 1 byte to store each character in memory. The output produced by eGSA was validated using a trivial checking algorithm.
In "Relative performance" section we investigate the behavior of eGSA with respect to eSAIS [10] and SAscan [27]. In "eGSA internals" section we evaluate eGSA in detail, showing the influence of each phase and of the improving strategies used in Phase 2 on the total running time. In "Limitations" section we investigate limitations of our algorithm related to the effect of disk cache managed by the operating system when the internal memory (RAM) size is restricted at boot time.

Relative performance
To assess the performance of eGSA we compared it to eSAIS [11], which is the fastest algorithm to date to compute both suffix and LCP arrays in external memory. We also compared eGSA to SAscan [28], which computes only the suffix array with small peak disk usage. We configured the algorithms to use the same disk for input and output. We are aware of the existence of the algorithms by Bauer et al. [5,6] and by Cox et al. [13] that aim at indexing collections of small strings in external memory. However, we did not consider comparing them with eGSA because they were designed to solve a different problem, namely building the BWT and the LCP array with small memory footprint. Moreover, a comparison in the article [13] have shown that eGSA is faster and uses more space in external memory.
Although eSAIS and SAscan are aimed at indexing only one string, we can concatenate all strings and use eSAIS or SAscan to construct the generalized suffix arrays. All strings in T were concatenated and a final terminator # was added, such that # < $. This concatenation strategy will not guarantee that equal suffixes will be sorted by string rank and the values in LCP may be larger than the actual lcp of consecutive suffixes in GESA,   but will not impose the growth of the alphabet size and still allows eSAIS and SAscan to use 1 byte per input character. We remark that the results presented in this section depends on the RAM size available in the experiments, that is, 32 GB. As we show in "Limitations" section, the performance and efficiency of eGSA degrades as the total RAM size is reduced. Figure 2 shows the running time in microseconds per input byte and the efficiency of eGSA, eSAIS and SAscan. Efficiency is the proportion of time for which the CPU is busy, not waiting for I/O. Except for dna, eSAIS was interrupted for datasets with more than 12 GB due to the large amount of time to process these instances. For example, eSAIS took 9 days to run on enwiki with 12 GB. The experiments took about 70 days of computing to finish.

Running time and efficiency
The amount of internal memory used by the algorithms is an input parameter. We configured them to use 2 GB. Although the comparison is not totally fair because eSAIS and SAscan were not designed for multiple strings, eGSA have outperformed eSAIS and presented a competitive performance compared to SAscan, which computes only the SA. Moreover, eGSA can also construct the generalized BWT of the collection T with no additional cost except by the output time.
The long running times of eSAIS prevented the analysis of its efficiency trend. In the extreme case, enwiki with 12 GB, the running time of eSAIS is almost 35 times larger than the time spent by eGSA. The running times of eGSA and SAscan are very close, with larger differences only for the dna dataset. SAscan presents the best efficiency, which is mostly unaffected by the size of the dataset. The efficiency of eGSA is comparable to SAscan for small datasets and better than the efficiency of eSAIS. The efficiency of eGSA drops with the size of the dataset. For larger datasets it becomes apparent that the efficiency of eGSA is strongly affected by the effect of the disk cache managed by the operating system, since the size of the available internal memory decreases as the dataset increases (we evaluate this issue in "Limitations" section).

I/O volume and peak disk usage
The I/O volume (in bytes per input byte) and the peak disk usage (in GB) of each algorithm are reported in Fig. 3. eGSA makes a larger volume of I/O transfer. In the extreme case, protein with 12 GB, eGSA transfer more than 6 times data than eSAIS and eGSA transfers 150 times more data than SAscan. eGSA uses 39n bytes (8n bytes for GSA, 4n bytes for LCP, and 27n bytes for auxiliary structures) plus by the size of the temporary files used to store induced suffixes. As can be seen in Fig. 5, the average number of induced suffixes is about 43%, and is almost constant for all dataset sizes. eSAIS uses 54n bytes to compute SA and LCP arrays, whereas SAscan uses 7.5n bytes to compute SA. Overall, the peak disk usage is much smaller for SAscan.
Although eSAIS and SAscan do not take care of the peculiarities of a generalized suffix array, eGSA still shows faster or comparable running times. Therefore, eGSA is a good alternative for the construction of the generalized enhanced suffix array in external memory.

eGSA internals
We have evaluated the behavior of eGSA in terms of the performance of each phase and the effect of each heap strategy used in Phase 2. Figure 4 shows the percentage of time spent by each phase of eGSA and its efficiency. We can see that the percentage of the time spent by Phase 2 increases as the dataset increases and dominates the time of eGSA. We can see that the efficiency of Phase 1 is almost constant and the efficiency of Phase 2 is better for small alphabets (dna). Figure 5 shows the percentage of induced suffixes and the number of partitions created by eGSA in the preprocessing step. In the average, 42% of the suffixes were induced. This indicates that the algorithm is avoiding many string comparisons. The number of partitions grows linearly with the dataset size, and the figure shows Phase 1 using less than 2 GB.

Prefix array size
We have analyzed the effect of the value of the parameter p on the running time. We used the first 8 GB of each dataset for these experiments. Recall that p is the number of symbols in each position of PREFIX arrays and has a major impact on external memory usage and access. As the value of p grows the external memory access decreases but the peak disk space usage increases. We evaluated some values for p with fixed memory usage, that is, increasing p implied an reduction of the number of elements in the partition buffers B i , guaranteeing that all versions use the same amount of internal memory. Table 7 shows the effect of p on the total running time and the efficiency of eGSA, for p varying between 0 and 25. The value p = 10 resulted in a good tradeoff between the peak disk space used by the algorithm and the running time.

Effect of optimizations
In order to evaluate the effect of strategies that help to avoid character comparisons in eGSA, namely (a) prefix Louza et al. Algorithms Mol Biol (2017) Table 8. We can see that the complete version of eGSA (column {a, b, c}) was the best in the majority of the cases. The dataset gutenberg.8GB was faster using {a, b} with a difference of about 10%. Comparing with the ∅ version, optimization strategies reduced the time by a factor of 1.9−2.22. Note that all strategies individually improved performance with respect to the ∅ version. Thus, we may conclude that the use of heap strategies heavily improves the performance of eGSA. As a final remark, we note that, except for dna.8GB, the ∅ version reduced the time by a factor of up to 6.10 compared with eSAIS (Fig. 2).

Limitations
We have investigated the effect of the disk caching in the performance of eGSA, eSAIS and SAscan. We restricted both the internal memory available to our algorithm (2 GB) as well as the total system memory at boot time (8−24 GB). Table 9 shows the running time and the efficiency of each algorithm set to use 2 GB of internal memory to process the first 8 GB of the dataset dna in a machine whose total RAM was restricted to 24, 16, 12, 10 and 8 GB at boot time. The values in the last column (32 GB) are the same presented in "Relative performance" and "eGSA internals" sections. We also tested the datasets protein.8GB, gutenberg.8GB and enwiki.8GB and we obtained very close results, which were omitted.
The running time and efficiency of each algorithm degrade as the total RAM size reduces. This happens as an effect of the reduction of free memory available for disk caching managed by the operating system, which reduces the number of disk accesses. Comparing to the original setting (32 GB), the running time of eGSA was about 25 times larger with the RAM size restricted to 8 GB, whereas for eSAIS and SAscan their running times were about 1.3 larger.
For eGSA, disk caching reduces disk accesses to the input strings as suffixes are moved along the heap, which displays a "random" pattern. This is where the worst case complexity stated in "Theoretical costs" section shows its claws. On the other hand, eGSA takes advantage of the disk cache system, which might be a favorable aspect in practical setups. Recall that the total size of the output data structure is 12 times the dataset size, which is 96 GB for the dataset dna.8GB.
The experiments show that eGSA depends on the availability of a large amount of free RAM to be efficient, which can be seen as a feature of a semi-external algorithm [8]. However, eGSA works purely in external memory. We believe that the optimizing strategies applied on the heap are interesting per se, and, as the disk access pattern is not actually random, may be there is still room for improving the overall strategy based on a heap, what could improve the performance of eGSA with less support of disk caching.  Table 8 Effect of each heap strategies on time

Conclusions
In this article we proposed eGSA, which is the first external memory algorithm to construct generalized suffix arrays enhanced with the longest common prefix array (LCP) and the Burrows-Wheeler transform (BWT) for a collection of strings. The proposed algorithm was validated through performance tests using real datasets from different domains, in various combinations. Compared to eSAIS and SAscan, eGSA showed a competitive performance. Moreover, our algorithm can also constructs the generalized BWT of a collection of strings with no additional cost except by the output time.
Another advantage of eGSA is that it may be employed to build generalized enhanced suffix arrays from arrays that have already been computed individually for strings in a dataset. Moreover, eGSA may be used to construct the core data structures used by LOF-SA [50] and ROSA search algorithms [21], or to build generalized suffix trees in external memory [4]. Furthermore, it may be applied to solve the longest common substring problem [1,3] and to construct the Longest Previous Factor array, which is used in text compression and for detecting motifs and repeats [15].