 Research
 Open Access
External memory BWT and LCP computation for sequence collections with applications
 Lavinia Egidi^{1},
 Felipe A. Louza^{2}Email authorView ORCID ID profile,
 Giovanni Manzini^{1, 3} and
 Guilherme P. Telles^{4}
https://doi.org/10.1186/s1301501901400
© The Author(s) 2019
 Received: 6 November 2018
 Accepted: 23 February 2019
 Published: 8 March 2019
Abstract
Background
Sequencing technologies produce larger and larger collections of biosequences that have to be stored in compressed indices supporting fast search operations. Many compressed indices are based on the Burrows–Wheeler Transform (BWT) and the longest common prefix (LCP) array. Because of the sheer size of the input it is important to build these data structures in external memory and time using in the best possible way the available RAM.
Results
We propose a spaceefficient algorithm to compute the BWT and LCP array for a collection of sequences in the external or semiexternal memory setting. Our algorithm splits the input collection into subcollections sufficiently small that it can compute their BWT in RAM using an optimal linear time algorithm. Next, it merges the partial BWTs in external or semiexternal memory and in the process it also computes the LCP values. Our algorithm can be modified to output two additional arrays that, combined with the BWT and LCP array, provide simple, scanbased, external memory algorithms for three well known problems in bioinformatics: the computation of maximal repeats, the all pairs suffix–prefix overlaps, and the construction of succinct de Bruijn graphs.
Conclusions
We prove that our algorithm performs \({\mathcal {O}}(n\, \mathsf {maxlcp})\) sequential I/Os, where n is the total length of the collection and \(\mathsf {maxlcp}\) is the maximum LCP value. The experimental results show that our algorithm is only slightly slower than the state of the art for short sequences but it is up to 40 times faster for longer sequences or when the available RAM is at least equal to the size of the input.
Keywords
 Burrows–Wheeler Transform
 Longest common prefix array
 Maximal repeats
 All pairs suffix–prefix overlaps
 Succinct de Bruijn graph
 External memory algorithms
Introduction
A fundamental problem in bioinformatics is the ability to efficiently search into the billions of DNA sequences produced by NGS studies. The Burrows Wheeler transform (BWT) [1] is a well known structure which is the starting point for the construction of compressed indices for collections of sequences [2]. The BWT is often complemented with the longest common prefix (LCP) array [3] since the latter makes it possible to efficiently emulate Suffix Tree algorithms [4, 5]. The construction of such data structures is a challenging problem. Although the final outcome is a compressed index, construction algorithms can be memory hungry and the necessity of developing lightweight algorithms was recognized since the very beginning of the field [6–8]. In lightweight algorithms it is assumed that the input and the output fit in main memory but the amount of additional working memory is sublinear with the size of the input.
When even lightweight algorithms do not fit in RAM, one has to resort to external or semiexternal memory construction algorithms (see [9–14] and references therein). In the external memory model [15] it is assumed that the main memory grows at most polylogarithmically with the size of the input. In the semiexternal model the main memory can grow linearly with the size of the input but part of the working data has to reside on disk. In both models the complexity of the algorithms is usually measured in terms of disk I/Os, since data transfer is much slower than CPU operations.
The space efficient computation of the BWT in main memory for a single sequence is well studied, and remarkable advances have been recently obtained [16, 17]. Unfortunately, for external memory computation the situation is less satisfactory. For collections of sequences, the first external memory algorithm is the BCR algorithm described in [18] that computes the multistring BWT for a collection of total size n, performing a number of sequential I/Os proportional to nK, where K is the length of the longest sequence in the collection. This approach is clearly not competitive when the sequences have nonhomogeneous lengths, and it is far from the theoretical optimal even for sequences of equal length. Nevertheless, the simplicity of the algorithm makes it very effective for collections of relatively short sequences, and it has become the reference tool for this problem. This approach was later extended [19] to compute also the LCP values with the same asymptotic number of I/Os. When computing also the LCP values, or when the input strings have different lengths, the algorithm uses \({\mathcal {O}}(m)\) words of RAM, where m is the number of input sequences.
In this paper, we present a new spaceefficient algorithm for the computation of the BWT and LCP array for a collection of sequences in external or semiexternal memory. Our algorithm takes the amount of available RAM as an input parameter, and tries to make the best use of it by splitting the input into subcollections sufficiently small so that it can compute their BWT in internal memory using an optimal linear time algorithm. Next, it merges the partial BWTs in external or semiexternal memory and in the process it also computes the LCP values. Since the LCP values are computed in a nonstandard order, the algorithm is completed by an external memory mergesort procedure that computes the final LCP array. We show that our algorithm performs a number of sequential I/Os between \({\mathcal {O}}(n\, \mathsf {avelcp})\) and \({\mathcal {O}}(n\, \mathsf {maxlcp})\), where \(\mathsf {avelcp}\) and \(\mathsf {maxlcp}\) are respectively the average and the maximum longest common prefix of the input sequences. To our knowledge, the only other known external memory algorithm for computing the BWT and LCP arrays of a collection of sequences is bwtlcpem, recently proposed in [20] that performs \({\mathcal {O}}(n\, \mathsf {maxlcp})\) sequential I/Os and uses \({\mathcal {O}}(m+K)\) words of RAM, where K is the fixed string length.
In “Related approaches” section we compare our approach with the ideas behind these previous works, and in “Experiments” section we compare their performance in practice. The experimental results show that BCR is the fastest algorithm for relatively short sequences while our algorithm is the fastest when the average LCP of the collection is relatively small compared to the lengths of the sequences. Both our algorithm and BCR appear to be faster than the available implementation of bwtlcpem, which is however only a prototype implementation with some limitations on the admissible inputs.
Another contribution of the paper, which follows from our first result, is the design of simple external memory algorithms for three well known problems related to genomic sequences, namely: (i) the computation of maximal repeats [21, 22], (ii) the computation of the all pairs suffix–prefix overlaps [23–25], and (iii) the construction of succinct de Bruijn graphs [26–28]. Our external memory algorithms for these problems are derived from known internal memory algorithms, but they process the input data in a single sequential scan. In addition, for the problem of computing the all pairs suffix–prefix, we go beyond the recent solutions that compute all the overlaps [24, 25, 29, 30], and we compute only the overlaps above a certain length, still spending constant time per reported overlap. Our algorithms take as input the BWT and LCP array, together with two additional arrays that our BWT construction algorithm can compute without any asymptotic slowdown.
Since problems on genomic sequences often involve huge datasets, it is certainly important to provide efficient external memory algorithms for the three problems described above. To our knowledge, only for the all pairs suffix–prefix problem there exists an external memory algorithm, namely the algorithm [30, Algorithm 2] that computes all the overlaps given the BWT, LCP, and Generalized Suffix Array of the input collection.
Background
Let \({\mathsf {s}}_{}[1,n]\) denote a string of length n over an alphabet \(\Sigma\) of constant size \(\sigma\). As usual, we assume \({\mathsf {s}}_{}[n]\) is a special symbol (endmarker) not appearing elsewhere in \({\mathsf {s}}_{}\) and lexicographically smaller than any other symbol. We write \({\mathsf {s}}_{}[i,j]\) to denote the substring \({\mathsf {s}}_{}[i] {\mathsf {s}}_{}[i+1] \cdots {\mathsf {s}}_{}[j]\). If \(j\ge n\) we assume \({\mathsf {s}}_{}[i,j] = {\mathsf {s}}_{}[i,n]\). If \(i>j\) or \(i > n\) then \({\mathsf {s}}_{}[i,j]\) is the empty string. Given two strings \({\mathsf {s}}_{1}\) and \({\mathsf {s}}_{2}\) we write \({\mathsf {s}}_{1} \preceq {\mathsf {s}}_{2}\) (\({\mathsf {s}}_{1} \prec {\mathsf {s}}_{2}\)) to denote that \({\mathsf {s}}_{1}\) is lexicographically (strictly) smaller than \({\mathsf {s}}_{2}\). We denote by \(\mathsf {LCP}({\mathsf {s}}_{1},{\mathsf {s}}_{2})\) the length of the longest common prefix between \({\mathsf {s}}_{1}\) and \({\mathsf {s}}_{2}\).
The suffix array \({{\mathsf {s}}}{{\mathsf {a}}}[1,n]\) associated to \({\mathsf {s}}_{}\) is the permutation of [1, n] giving the lexicographic order of \({\mathsf {s}}_{}\)’s suffixes, that is, for \(i=1,\ldots ,n1\), \({\mathsf {s}}_{}[{{\mathsf {s}}}{{\mathsf {a}}}[i],n] \prec {\mathsf {s}}_{}[{{\mathsf {s}}}{{\mathsf {a}}}[i+1],n]\).
Essentially \(\mathsf {bwt}_{1\cdots k}\) is a permutation of the symbols in \({\mathsf {s}}_1,\ldots ,{\mathsf {s}}_k\) such that the position in \(\mathsf {bwt}_{1\cdots k}\) of \({\mathsf {s}}_{i}[j]\) is given by the lexicographic rank of its context \({\mathsf {s}}_{i}[j+1,n_i]\) (or \({\mathsf {s}}_{i}[1,n_i]\) if \(j=n_i\)). Figure 1 shows an example with \(k=2\). Notice that for \(k=1\), this is the usual Burrows–Wheeler Transform [1].
Given the suffix array \({{\mathsf {s}}}{{\mathsf {a}}}_{1\cdots k}[1,n]\) of the concatenation \({\mathsf {s}}_1\cdots {\mathsf {s}}_k\), we consider the corresponding LCP array \(\mathsf {lcp}_{1\cdots k}[1,n]\) defined as in (1) (see again Fig. 1). Note that, for \(i=2,\ldots ,n\), \(\mathsf {lcp}_{1\cdots k}[i]\) gives the length of the longest common prefix between the contexts of \(\mathsf {bwt}_{1\cdots k}[i]\) and \(\mathsf {bwt}_{1\cdots k}[i1]\). We stress that all practical implementations use a single $ symbol as endmarker for all strings to avoid alphabet explosion, but endmarkers from different strings are then sorted as described, i.e., on the basis of the index of the strings they belong to.
Computing multistring BWTs
The gSACAK algorithm [32], based on algorithm SACAK [33], computes the suffix array for a string collection. Given a collection of strings of total length n, gSACAK computes the suffix array for their concatenation in O(n) time using \((\sigma +1) \log n\) additional bits (in practice, only 2KB are used for ASCII alphabets). gSACAK is time and space optimal for alphabets of constant size \(\sigma =O(1)\). The multistring \(\mathsf {bwt}_{1\cdots k}\) of \({\mathsf {s}}_1,\ldots ,{\mathsf {s}}_k\) can be easily obtained from the suffix array as in (2). gSACAK can also compute the \(\mathsf {lcp}\) array \(\mathsf {lcp}_{1\cdots k}\) still in linear time using only the additional space for the \(\mathsf {lcp}\) values.
Merging multistring BWTs
The Gap algorithm [34], based on an earlier algorithm by Holt and McMillan [35], is a simple procedure for merging multistring BWTs. In its original formulation the Gap algorithm can also merge LCP arrays, but in this paper we compute LCP values using a different approach more suitable for external memory execution. We describe here only the main idea behind Gap and refer the reader to [34] for further details.
For simplicity in the following we assume we are merging k singlestring BWTs \(\mathsf {bwt}_1=\mathsf {bwt} ({\mathsf {s}}_1),\ldots ,\mathsf {bwt}_{k}=\mathsf {bwt} ({\mathsf {s}}_k)\); the algorithm does not change in the general case where the inputs are multistring BWTs. Computing \(\mathsf {bwt}_{1\cdots k}\) amounts to sorting the symbols of \(\mathsf {bwt}_1,\ldots ,\mathsf {bwt}_{k}\) according to the lexicographic order of their contexts, where the context of symbol \(\mathsf {bwt}_{j}[i]\) is \({\mathsf {s}}_{j}[{{\mathsf {s}}}{{\mathsf {a}}}_j[i],n_j]\). By construction, the symbols in each \(\mathsf {bwt} _j\) are already sorted by context, hence to compute \(\mathsf {bwt}_{1\cdots k}\) we only need to merge \(\mathsf {bwt}_1,\ldots ,\mathsf {bwt}_{k}\) without changing the relative order of the symbols within the input sequences.
The Gap algorithm works in successive iterations. During the hth iteration it computes a vector \(Z^{(h)}\) specifying how the entries of \(\mathsf {bwt}_1,\ldots ,\mathsf {bwt}_{k}\) should be merged to have them sorted according to the first h symbols of their context. Formally, for \(j=1,\ldots ,k\) the vector \(Z^{(h)}\) contains \(\mathsf {bwt} _j\) copies of the value j arranged so that the following property holds.
Property 1
For \({j_{1}},{j_{2}}\in \{1,\ldots ,k\}\), the \(i_1\)th occurrence of \(j_1\) precedes the \(i_2\)th occurrence of \(j_2\) in \(Z^{(h)}\) if and only if the lengthh context of \(\mathsf {bwt} _{j_1}[i_1]\) is lexicographically smaller than the lengthh context of \(\mathsf {bwt} _{j_2}[i_2]\), or the two contexts are equal and \(j_1 < j_2\). \(\square\)
Related approaches
The strategy used by Gap to build multistring BWTs follows the idea, introduced by [35, 36], of merging partial BWTs, i.e. BWTs of subsets of the input collection. Interestingly, both previous algorithms for computing the BWT and LCP in external memory [19, 20] are also based on a merging strategy but instead of merging partial BWTs, they merge the arrays \(L_1\), \(L_2\), \(L_3\), …, where \(L_i\) consists of the symbols which are at distance i from the end of their respective strings. The symbols inside each \(L_i\) are sorted as usual by context. In the example of Fig. 1, we would have \(L_1 = \mathsf{bc}\) (since \({\mathsf {s}}_{1}\) ends with \(\mathsf{b}\$_1\) and \({\mathsf {s}}_{2}\) ends with \(\mathsf{c}\$_{2}\)), \(L_2 = \mathsf{ab}\), (since \({\mathsf {s}}_{1}\) ends with \(\mathsf{ab}\$_1\) and \({\mathsf {s}}_{2}\) ends with \(\mathsf{bc}\$_{2}\)), \(L_3 = \mathsf{ca}\) and so on. Note that in \(L_3\) \(\mathsf{c}\) precedes \(\mathsf{a}\) since \(\mathsf{c}\)’s context \(\mathsf{ab\$_1}\) is lexicographically smaller than \(\mathsf{a}\)’s context \(\mathsf{bc\$_{2}}\). Clearly, merging the arrays \(L_i\) yields the desired multistring BWT and the authors of [19, 20] show how to compute also the LCP array. The algorithms in [19, 20] differ in how the merging is done: [19] uses a refinement of a technique introduced in [9, 10], while [20] uses a refinement of Holt and McMillan merging strategy [35, 36].
The eGap algorithm
The eGap algorithm for computing the multistring BWT and LCP array in external memory works in three phases. First it builds multistring BWTs for subcollections in internal memory, then it merges these BWTs in external memory and generates the LCP values. Finally, it sorts the LCP values in external memory.
Phase 1: BWT computation
Given a collection of sequences \({\mathsf {s}}_{1}, {\mathsf {s}}_{2}, \ldots , {\mathsf {s}}_{k}\), we split it into subcollections sufficiently small that we can compute the multistring SA for each one of them in internal memory using the gSACAK algorithm. After computing each SA, we compute the corresponding multistring BWT and write it to disk in uncompressed form using one byte per character.
Phase 2: BWT merging and LCP computation
This part is based on the Gap algorithm previously described but suitably modified to work efficiently in external memory (or in semiexternal memory if there are at least n bytes of RAM). In the following we assume that the input consists of k BWTs \(\mathsf {bwt} _1,\ldots ,\mathsf {bwt} _k\) of total length n over an alphabet of size \(\sigma\). The input BWTs are read from disk and never moved to internal memory.
eGap computes LCP values exploiting the bitvector B used by Gap to mark the beginning of blocks (see Eq. 3) within each \(Z^{(h)}\) (for simplicity the computation of B is not shown in Fig. 2). We observe that if B[i] is set to \({\mathbf {1}}\) during iteration h then \(\mathsf {lcp}_{1\cdots k}[i] = h1\), since the algorithm has determined that the contexts of \(\mathsf {bwt}_{1\cdots k}[i]\) and \(\mathsf {bwt}_{1\cdots k}[i1]\) have a common prefix of length exactly \(h1\). We introduce an additional bit array \(B_{x}[1,n+1]\) such that, at the beginning of iteration h, \(B_{x}[i]={\mathbf {1}}\) iff B[i] has been set to \({\mathbf {1}}\) at iteration \(h2\) or earlier. During iteration h, if \(B[i]={\mathbf {1}}\) we look at \(B_{x}[i]\). If \(B_{x}[i]={\mathbf {0}}\) then we know that B[i] has been set at iteration \(h1\): thus we output to a temporary file \(F_{h2}\) the pair \(\langle i,h2\rangle\) to record that \(\mathsf {lcp}_{1\cdots k}[i]=h2\), and we set \(B_{x}[i]={\mathbf {1}}\) so no pair for position i will be produced in the following iterations. At the end of iteration h, file \(F_{h2}\) contains all pairs \(\langle i,\mathsf {lcp}_{1\cdots k}[i]\rangle\) with \(\mathsf {lcp} [i]=h2\); the pairs are written in increasing order of their first component, since B and \(B_{x}\) are scanned sequentially. These temporary files will be merged in Phase 3 to produce the LCP array.
As proven in [34, Lemma 7], if at iteration h of the Gap algorithm we set \(B[i]={\mathbf {1}}\), then at any iteration \(g \ge h+2\) processing the entry \(Z^{(g)}[i]\) will not change the arrays \(Z^{(g+1)}\) and B. Since the roles of the \(Z^\mathsf {old}\) and \(Z^{\mathsf {new}}\) files are swapped at each iteration, and at iteration h we scan \(Z^\mathsf {old}= Z^{(h1)}\) to update \(Z^{\mathsf {new}}\) from \(Z^{(h2)}\) to \(Z^{(h)}\), we can compute only the entries \(Z^{(h)}[j]\) that are different from \(Z^{(h2)}[j]\). In particular, any range \([\ell ,m]\) such that \(B_{x}[\ell ] = B_{x}[\ell +1] = \cdots = B_{x}[m] = {\mathbf {1}}\) can be added to a set of irrelevant ranges that the algorithm may skip in successive iterations (irrelevant ranges are defined in terms of the array \(B_{x}\) as opposed to the array B, since before skipping an irrelevant range we need to update both \(Z^\mathsf {old}\) and \(Z^{\mathsf {new}}\)). We read from one file the ranges to be skipped at the current iteration and simultaneously write to another file the ranges to be skipped at the next iteration (note that irrelevant ranges are created and consumed sequentially). Since skipping a single irrelevant range takes \({\mathcal {O}}(k+\sigma )\) time, an irrelevant range is stored only if its size is larger than a given threshold t and we merge consecutive irrelevant ranges whenever possible. In our experiments we used \(t=\max (256,k+\sigma )\). In the worst case the space for storing irrelevant ranges could be \({\mathcal {O}}(n)\) but in actual experiments it was always less than 0.1n bytes.
As in the Gap algorithm, when all entries in B are nonzero, \(Z^\mathsf {old}\) describes how the BWTs \(\mathsf {bwt}_{j}\) (\(j=1,\ldots ,k\)) should be merged to get \(\mathsf {bwt}_{1\cdots k}\), and a final sequential scan of the input BWTs along with \(Z^\mathsf {old}\) allows to write \(\mathsf {bwt}_{1\cdots k}\) to disk, in sequential order. Our implementation can merge at most \(2^7 = 128\) BWTs at a time because we use 7 bits to store each entry of \(Z^\mathsf {old}\) and \(Z^{\mathsf {new}}\). These arrays are maintained on disk in two separate files; the additional bit of each byte are used to keep the current and the next copy of B. The bit array \(B_{x}\) is stored separately in a file of size n/8 bytes. To merge a set of \(k>128\) we split the input in subsets of cardinality 128 and merge them in successive rounds. In practice, the algorithm merges the multistring BWTs produced by Phase 1. In our experiments the maximum number of subcollections was 21.
Semiexternal version We have also implemented a semiexternal version of the merge algorithm that uses n bytes of RAM. The ith byte is used to store \(Z^\mathsf {old}[i]\) and \(Z^{\mathsf {new}}[i]\) (3 bits each), B[i] and \(B_{x}[i]\). This version can sort at most \(2^3 = 8\) BWTs simultaneously; to sort k BWTs it performs \(\log _8 k\) merging rounds. Although performing more rounds is clearly more expensive, this version stores in RAM all the arrays that are modified and reads from disk only the input BWTs and is therefore significantly faster.
Phase 3: LCP merging
At the end of Phase 2 all \(\mathsf {LCP}\)values have been written to the temporary files \(F_h\) on disk as pairs \(\langle i,\mathsf {lcp} [i]\rangle\). Each file \(F_h\) contains all pairs with second component equal to h in order of increasing first component. The computation of the LCP array is completed using a standard external memory multiway merge [37, Chap. 5.4.1] of \(\mathsf {maxlcp}\) sorted files, where \(\mathsf {maxlcp}=\max _i(\mathsf {lcp}_{1\cdots k}[i])\) is the largest LCP value.
Analysis
During Phase 1, gSACAK computes the suffix array for a subcollection of total length m using 9m bytes (8m bytes for \({{\mathsf {s}}}{{\mathsf {a}}}\) and 1m bytes for the text). If the available RAM is M, the input is split into subcollections of size \(\approx M/9\). Since gSACAK runs in linear time, if the input collection has total size n, Phase 1 takes \({\mathcal {O}}(n)\) time overall.
A single iteration of Phase 2 consists of a complete scan of \(Z^{(h1)}\) except for the irrelevant ranges. Since the algorithm requires \(\mathsf {maxlcp}\) iterations, without skipping the irrelevant ranges the algorithm would require \(\mathsf {maxlcp}\) sequential scans of \({\mathcal {O}}(n)\) items. Reasoning as in [34, Theorem 8] we get that by skipping irrelevant ranges the overall amount of data directly read/written by the algorithm is \({\mathcal {O}}( n\, \mathsf {avelcp})\) items where \(\mathsf {avelcp}\) is the arithmetic average of the entries in the final LCP array. However, if we reason in terms of disk blocks, every time we skip an irrelevant range we discard the current block and load a new one (unless the beginning of the new relevant range is inside the same block; in that case or if the beginning of the new relevant range is in the block immediately following the current one, skipping the irrelevant range does not save any I/O). We can upper bound this extra cost, with an overhead of \({\mathcal {O}}(1)\) blocks for each irrelevant range skipped. Summing up, if the total number of skipped ranges is Ir and each disk block consists of \({\mathcal B}\) words, the I/O complexity of Phase 2 according to the theoretical model in [15] is \({\mathcal {O}}( Ir + n\, \mathsf {avelcp}/({{\mathcal {B}}}\log n))\) block I/Os (under the reasonable assumptions that the alphabet is constant, each entry in Z takes constant space, and we need a constant number of merge rounds). Although the experiments in “Experiments” section suggest that in practice Ir is small, for simplicity and uniformity with the previous literature we upper bound the cost of Phase 2 with \({\mathcal {O}}(n\, \mathsf {maxlcp})\) sequential I/Os (corresponding to \({\mathcal {O}}(n\, \mathsf {maxlcp}/({{\mathcal {B}}}\log n))\) block I/Os).
Phase 3 takes \({\mathcal {O}}(\lceil \log _\lambda \mathsf {maxlcp}\rceil )\) rounds; each round merges \(\lambda\) LCP files by sequentially reading and writing \({\mathcal {O}}(n)\) bytes of data. The overall cost of Phase 3 is therefore \({\mathcal {O}}(n \log _\lambda \mathsf {maxlcp})\) sequential I/Os. In our experiments we used \(\lambda =256\); since in our tests \(\mathsf {maxlcp}< 2^{16}\) two merging rounds were always sufficient.
Experiments
Datasets used in our experiments
Name  Size GB  N. of strings  Max Len  Ave Len  Max LCP  Ave LCP 

short  8.0  85,899,345  100  100  99  27.90 
long  8.0  28,633,115  300  300  299  90.28 
pacbio.1000  8.0  8,589,934  1000  1000  876  18.05 
pacbio  8.0  942,248  71,561  9116  3084  18.32 
Datasets We used four real DNA datasets reported in Table 1 containing sequences of different lengths and structure. The sequences of the first three datasets were trimmed to make them of the same length, while the fourth dataset contains sequences of widely different lengths. short are Illumina reads from human genome (ftp://ftp.sra.ebi.ac.uk/vol1/ERA015/ERA015743/srf/). long are Illumina HiSeq 4000 pairedend RNAseq reads from plant Setaria viridis (https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=ERR1942989). pacbio.1000 and pacbio are PacBio RS II reads from Triticum aestivum (wheat) genome (https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR5816161). All datasets contain sequences over the A, C, G, T alphabet plus a string terminator symbol.
Memory setting To make a realistic external memory experimental setting one has to use an amount of RAM smaller than the size of the data. Indeed, if more RAM is available, even if the algorithm is supposedly not using it, the operating system will use it to temporarily store disk data and the algorithm will be no longer really working in external memory. This phenomenon will be apparent also from our experiments. For these reasons we reduced the available RAM to simulate three different scenarios: (i) input data 4 times larger than the available RAM, (ii) input data of approximately the same size as the RAM, and (iii) input data 4 times smaller than the RAM. We evaluated these scenarios with the complete 8 GB datasets from Table 1 (with 2 GB, 8 GB, and 32 GB RAM), and with the datasets trimmed to 1 GB (hence with 256 MB, 1 GB, and 4 GB RAM). The RAM was limited at boot time to a value equal to the amount assigned to the algorithm plus a small extra amount for the operating system (14 MB for the 256 MB instance and 64 MB for the others).
Comparison with the existing algorithms
We compared eGap with the algorithm BCR [19] which is the current state of the art for BWT/LCP computation for collections of sequences. We used the bcrlcp implementation from [38] since the previous implementation mentioned in [19] did not compute the LCP values correctly. We tested also the recently proposed algorithm bwtlcpem [20] using the code from [39]. As a reference we also tested the algorithm eGSA [14] using the code from [40]. eGSA computes the Suffix and LCP Arrays for collections of sequences in external memory: the disadvantage of this algorithm is that working with the Suffix Array could involve transferring to/from disk a much larger amount of data.
Results: The results of our experiments are summarized in Fig. 3. The bar plots on the left are for the 1 GB datasets showing the running time as function of the available RAM; the diagrams on the right are for the 8 GB datasets. The results show that for memory scenarios (i) and (ii) eGap and bcrlcp have the better performance, whereas for scenario (iii) eGap and eGSA are the best options. The performance of bwtlcpem improves with the RAM size, but it is still 12 times slower than eGap for the short datasets with 4 GB of RAM.
The above results are in good accordance with the theoretical analysis. bcrlcp complexity is \({\mathcal {O}}(n\, \mathsf {maxlen})\) sequential I/Os while eGap and bwtlcpem both take \({\mathcal {O}}(n\, \mathsf {maxlcp})\) sequential I/Os. For the short and long datasets the maximum length and the maximum LCP coincide and we see that when the available memory is only one fourth of the input size bcrlcp is clearly the fastest option: indeed it is up to a factor 2.6 faster than eGap. This is no longer true when the available memory is equal or larger than the input size: in this case eGap is the fastest, probably because of its ability to exploit all the available memory using a semiexternal strategy whenever possible. When the available memory is larger than the input size or for the pacbio.1000 dataset which has a very large \(\mathsf {maxlen}\) then eGap is up to 40 times faster than bcrlcp. Note that, in accordance with our heuristic analysis, eGap ’s running time per input byte appears to be roughly proportional to the average LCP of the collection. If we look at the datasets pacbio and pacbio.1000 we see that they have widely different maximum LCPs, yet their running times are very close similarly to their average LCPs.
Note that in the scenario (iii) eGSA is often the fastest algorithm and its running time appears to be less influenced by the size of the average or maximum LCP. Another advantage is that it also computes the Suffix Array, but it has the drawback of using a large amount of disk working space: 340 GB for a 8 GB input vs 56 GB used by eGap.
Relative performance of eGap’s building blocks
We evaluated the percentage of time spent by each phase of eGap and their efficiency (percentage the CPU was busy) on the 8 GB datasets in the memory scenarios considered above, thus with RAM limited to (i) 1 GB, (ii) 8 GB, and (iii) 32 GB.
The results in Fig. 4 show that Phase 2 of eGap dominates the algorithm in general. The second phase took about \(95\%\), \(85\%\) and \(50\%\) of the total time in scenarios (i), (ii), and (iii) respectively. If we look at the efficiency of the single phases, we see that they all improve with the RAM size. However, we notice that for any given memory scenario the efficiency of Phases 1 and 3 was almost the same for the different datasets, while Phase 2 has a different behavior. For the short and long datasets with 8 GB and 32 GB RAM, we see that Phase 2 efficiency is very close to Phase 1’s, while there is a sharp drop when using 2 GB RAM. For the pacbio datasets, the drop in Phase 2 efficiency is significant already when we use 8 GB RAM.
Applications
In this section we show that the eGap algorithm, in addition to the BWT and LCP arrays, can output additional information useful to design efficient external memory algorithms for three well known problems on sequence collections: (i) the computation of maximal repeats, (ii) the all pairs suffix–prefix overlaps, and (iii) the construction of succinct de Bruijn graphs. For these problems we describe algorithms which are derived from known (internal memory) algorithms suitably modified so that they process the input data in a single sequential scan.
Our first observation is that eGap can also output the array which provides, for each \(\mathsf {bwt}\) entry, the id of the sequence to which that entry belongs. In information retrieval this is usually called the Document Array, so in the following we will denote it by \({{\mathsf {d}}}{{\mathsf {a}}}\). In Phase 1 the gSACAK algorithm can compute the \({{\mathsf {d}}}{{\mathsf {a}}}\) together with the \(\mathsf {lcp}\) and \(\mathsf {bwt}\) using only additional 4n bytes of space to store the \({{\mathsf {d}}}{{\mathsf {a}}}\) entries. These partial \({{\mathsf {d}}}{{\mathsf {a}}}\) ’s can be merged in Phase 2 using the \(Z^{\mathsf {new}}\) array in the same way as the BWT entries. In the following we use \(\mathsf {bwt}\), \(\mathsf {lcp}\), and \({{\mathsf {d}}}{{\mathsf {a}}}\) to denote the multistring BWT, LCP and Document Array of a collection of m sequences of total length n. We write \({\mathsf {s}}\) to denote the concatenation \({\mathsf {s}}_{1} \cdots {\mathsf {s}}_{m}\) and \({{\mathsf {s}}}{{\mathsf {a}}}\) to denote the suffix array of \({\mathsf {s}}\). We will use \({\mathsf {s}}\) and \({{\mathsf {s}}}{{\mathsf {a}}}\) to describe and prove the correctness of our algorithms, but neither \({\mathsf {s}}\) nor \({{\mathsf {s}}}{{\mathsf {a}}}\) are used in the computations.
Computation of maximal repeats
Different notions of maximal repeats have been used in the bioinformatics literature to model different notions of repetitive structure (see for example [21, 22]). We use a notion of maximal repeat from [41, Chap. 7]: we say that a string \(\alpha\) is a Type 1 maximal repeat if \(\alpha\) occurs in the collection at least twice and every extension, i.e. \(c\alpha\) or \(\alpha c\) with \(c\in \Sigma\), occurs fewer times. We consider also a more restrictive notion: we say that a string \(\alpha\) is a Type 2 maximal repeat if \(\alpha\) occurs in the collection at least twice and every extension of \(\alpha\) occurs at most once.
To compute Type 1 maximal repeats the crucial observation is that there is a substring of length \(\ell\) that prefixes \({{\mathsf {s}}}{{\mathsf {a}}}\) entries \(j, j+1, \ldots , i\) (and no others) iff \(\mathsf {lcp} [k] \ge \ell\) for \(k=j+1,\ldots ,i\), and both \(\mathsf {lcp} [j]\) and \(\mathsf {lcp} [i+1]\) are smaller than \(\ell\). To ensure that the repeat is Type 1 maximal, we also require that there exists \(h \in [j+1,i]\) such that \(\mathsf {lcp} [h]=\ell\) and that the substring \(\mathsf {bwt} [j,i]\) contains at least two distinct characters.
Our algorithm consists of a single sequential scan of \(\mathsf {bwt}\) and \(\mathsf {lcp}\). During the scan, we maintain a stack containing pairs \({\langle j, \mathsf {lcp} [h] \rangle }\) with \(j \le h\) such that if \({\langle j', \mathsf {lcp} [h'] \rangle }\) is below \({\langle j, \mathsf {lcp} [h] \rangle }\) then \(j'<j\) and \(\mathsf {lcp} [h']<\mathsf {lcp} [h]\). In addition, when the scanning reaches position i, for every entry \({\langle j, \mathsf {lcp} [h] \rangle }\) in the stack it is \(\mathsf {lcp} [h]=\min _{j\le k < i}\mathsf {lcp} [k]\), that is, \(\mathsf {lcp} [h]\) is the smallest value in the range \(\mathsf {lcp} [j,i1]\).
We maintain the stack as follows. When we reach position i, if the entry \({\langle j, \mathsf {lcp} [h] \rangle }\) at the top of the stack has \(\mathsf {lcp} [h]<\mathsf {lcp} [i]\) we push \({\langle i, \mathsf {lcp} [i] \rangle }\) on the stack. If \(\mathsf {lcp} [h]=\mathsf {lcp} [i]\) we do nothing. If \(\mathsf {lcp} [h]>\mathsf {lcp} [i]\) we pop from the stack all entries \({\langle j, \mathsf {lcp} [h] \rangle }\) with \(\mathsf {lcp} [h]> \mathsf {lcp} [i]\); if the removal leaves at the top of the stack an entry \({\langle j', \mathsf {lcp} [h'] \rangle }\) with \(\mathsf {lcp} [h'] < \mathsf {lcp} [i]\) we push on the stack a new entry \({\langle {\hat{\jmath }}, \mathsf {lcp} [i] \rangle }\) where \({\hat{\jmath }}\) is the first component of the last entry just removed from the stack. Note that in any case when we have completed the processing of position i the entry at the top of the stack has second component equal to \(\mathsf {lcp} [i]\), and for each stack entry \({\langle j, \mathsf {lcp} [h] \rangle }\) it is \(\mathsf {lcp} [h]=\min _{j\le k \le i} \mathsf {lcp} [k]\) as claimed.
We now prove that if \({\langle j', \mathsf {lcp} [h'] \rangle }\) is immediately below \({\langle j, \mathsf {lcp} [h] \rangle }\) then \(\mathsf {lcp} [j1]=\mathsf {lcp} [h']\). As we observed above, if at step i we push \({\langle i, \mathsf {lcp} [i] \rangle }\) on the stack, the previous top entry has second component equal to \(\mathsf {lcp} [i1]\) so the property holds for the first insertion of an entry \({\langle i, \mathsf {lcp} [\cdot ] \rangle }\). During the following steps it is possible that \({\langle i, \mathsf {lcp} [x] \rangle }\) is removed and immediately reinserted as \({\langle i, \mathsf {lcp} [y] \rangle }\) (with \(\mathsf {lcp} [y]<\mathsf {lcp} [x]\)), but since the preceding stack element does not change, is still holds true that \(\mathsf {lcp} [i1]\) is equal to the second component of the preceding element. Note that, since \(\mathsf {lcp}\) values on the stack are strictly increasing, we conclude that for each stack entry \({\langle j, \mathsf {lcp} [h] \rangle }\) it is \(\mathsf {lcp} [j1] < \mathsf {lcp} [h]\).
Our algorithm outputs Type 1 maximal repeats when elements are popped from the stack. At step \(i+1\) we pop from the stack all entries \({\langle j, \mathsf {lcp} [h] \rangle }\) such that \(\mathsf {lcp} [h] > \mathsf {lcp} [i+1]\). Recall that by construction \(\mathsf {lcp} [h] = \min _{j \le k\le i} \mathsf {lcp} [k]\). In addition \(\mathsf {lcp} [j1] < \mathsf {lcp} [h]\) and \(\mathsf {lcp} [i+1]<\mathsf {lcp} [h]\). Thus, to ensure that we have found a Type 1 maximal repeat we only need to check that \(\mathsf {bwt} [j1,i]\) contains at least two distinct characters. To efficiently check this latter condition, for each stack entry \({\langle j, \mathsf {lcp} [h] \rangle }\) we maintain a bit vector \(b_j\) of size \(\sigma\) keeping track of the distinct characters in the array \(\mathsf {bwt}\) from position \(j1\) to the next stack entry, or to the last seen position for the entry at the top of the stack. When \({\langle j, \mathsf {lcp} [h] \rangle }\) is popped from the stack its bit vector is ored to the previous stack entry in constant time; if \({\langle j, \mathsf {lcp} [h] \rangle }\) is popped from the stack and immediately replaced with \({\langle j, \mathsf {lcp} [i] \rangle }\) its bit vector survives as it is (essentially because it is associated with an index, not with a stack entry). Clearly, maintaining the bit vector does not increase the asymptotic cost of the algorithm.
Since at each step we insert at most one entry on the stack, the overall cost of our algorithm is \({\mathcal {O}}(n)\) time. The algorithm uses a stack of size bounded by \({\mathcal {O}}(\mathsf {maxlcp})\) words. For most applications \(\mathsf {maxlcp}\ll n\) so it should be feasible to keep the stack in RAM. However, since a stack can also be implemented in external memory in \({\mathcal {O}}(1)\) amortized time per operation [42], we can state the following result.
Theorem 1
We can compute all Type 1 maximal repeats in \({\mathcal {O}}(n)\) time executing a single scan of the arrays \(\mathsf {bwt}\) and \(\mathsf {lcp}\) using \({\mathcal {O}}(1)\) words of RAM. \(\square\)
To find Type 2 maximal repeats, we are interested in consecutive LCP entries \(\mathsf {lcp} [j], \mathsf {lcp} [j+1], \ldots , \mathsf {lcp} [i], \mathsf {lcp} [i+1]\), such that \(\mathsf {lcp} [j]< \mathsf {lcp} [j+1] = \mathsf {lcp} [j+2] = \cdots = \mathsf {lcp} [i] > \mathsf {lcp} [i+1].\) Indeed, this implies that for \(h=j, \ldots , i\) all suffixes \({\mathsf {s}}[{{\mathsf {s}}}{{\mathsf {a}}}[h], n]\) are prefixed by the same string \(\alpha\) of length \(\mathsf {lcp} [j+1]\) and every extension \(\alpha c\) occurs at most once. If this is the case, then \(\alpha\) is a Type 2 maximal repeat if all characters in \(\mathsf {bwt} [j,i]\) are distinct since this ensures that also every extension \(c\alpha\) occurs at most once. In order to detect this situation, as we scan the \(\mathsf {lcp}\) array we maintain a candidate pair \({\langle j+1, \mathsf {lcp} [j+1] \rangle }\) such that \(j+1\) is the largest index seen so far for which \(\mathsf {lcp} [j]<\mathsf {lcp} [j+1]\). When we establish a candidate at \(j+1\) as above, we initialize to zero a bit vector b of size \(\sigma\) setting to 1 only entries \(\mathsf {bwt} [j]\) and \(\mathsf {bwt} [j+1]\). As long as the following values \(\mathsf {lcp} [j+2], \mathsf {lcp} [j+3], \ldots\) are equal to \(\mathsf {lcp} [j+1]\) we go on updating b and if the same position is marked twice we discard \({\langle j+1, \mathsf {lcp} [j+1] \rangle }\). If we reach an index \(i+1\) such that \(\mathsf {lcp} [i+1]>\mathsf {lcp} [j+1]\), we update the candidate to \({\langle i+1, \mathsf {lcp} [i+1] \rangle }\) and reinitialize b. If we reach \(i+1\) such that \(\mathsf {lcp} [i+1]<\mathsf {lcp} [j+1]\) and \({\langle j+1, \mathsf {lcp} [j+1] \rangle }\) has not been discarded, then a repeat of Type 2 (with \(ij+1\) repetitions) has been located.
Theorem 2
We can compute all Type 2 maximal repeats in \({\mathcal {O}}(n)\) time executing a single scan of the arrays \(\mathsf {bwt}\) and \(\mathsf {lcp}\) using \({\mathcal {O}}(1)\) words of RAM. \(\square\)
Note that when our algorithms discover Type 1 or Type 2 maximal repeats we know the repeat length and the number of occurrences so one can easily filter out noninteresting repeats (too short or too frequent). In some applications, for example the MUMmer tool [43], one is interested in repeats that occur in at least r distinct input sequences, maybe exactly once for each sequence. Since for these applications the number of input sequences is relatively small, we can handle these requirements by simply scanning the \({{\mathsf {d}}}{{\mathsf {a}}}\) array simultaneously with the \(\mathsf {lcp}\) and \(\mathsf {bwt}\) arrays and keeping track of the sequences associated to a maximal repeat using a bit vector (or a unionfind structure) as we do with characters in the \(\mathsf {bwt}\).
All pairs suffix–prefix overlaps
In this problem we want to compute, for each pair of sequences \({\mathsf {s}}_{i}\) \({\mathsf {s}}_{j}\), the longest overlap between a suffix of \({\mathsf {s}}_{i}\) and a prefix of \({\mathsf {s}}_{j}\). Our solution is inspired by the algorithm in [24] which in turn was derived by an earlier Suffixtree based algorithm [23]. The algorithm in [24] solves the problem using a Generalized Enhanced Suffix array (consisting of the arrays \({{\mathsf {s}}}{{\mathsf {a}}}\), \(\mathsf {lcp}\), and \({{\mathsf {d}}}{{\mathsf {a}}}\)) in \({\mathcal {O}}(n+m^2)\) time, which is optimal since n is the size of the input and there are \(m^2\) longest overlaps. However, for large collections it is natural to consider the problem of reporting only the overlaps larger than a given threshold \(\tau\) still spending \({\mathcal {O}}(n)\) time plus constant time per reported overlap. Our algorithm solves this more challenging problem.
In the following we say that a suffix starting at \({{\mathsf {s}}}{{\mathsf {a}}}[i]\) is special iff it is a prefix of the suffix starting at \({{\mathsf {s}}}{{\mathsf {a}}}[i+1]\), not considering the endmarker. This is equivalent to state that \({\mathsf {s}}[{{\mathsf {s}}}{{\mathsf {a}}}[i]+\mathsf {lcp} [i+1]]={\$}\). For example, in Fig. 1 (right) the special suffixes are \(\mathsf{ab}\$_1\), \(\mathsf{abc}\$_{2}\), \(\mathsf{abcab}\$_1\) \(\mathsf{b}\$_1\), \(\mathsf{bc}\$_2\), \(\mathsf{bcab}\$_1\), \(\mathsf{c}\$_2\), \(\mathsf{cab}\$_1\). Notice that a special suffix starting at \({{\mathsf {s}}}{{\mathsf {a}}}[i]\) has the form v$ with \(v= \mathsf {lcp} [i+1]\); clearly only if \({{\mathsf {s}}}{{\mathsf {a}}}[i]\) is special then v can be a suffix–prefix overlap. Note also that any suffix \({\$}\) is always trivially special.
To efficiently solve the suffix–prefix overlaps problem, we modify Phase 2 of our algorithm so that it outputs also the bit array \(\mathsf {xlcp}\) such that \(\mathsf {xlcp} [i]={\mathbf {1}}\) iff the suffix starting at \({{\mathsf {s}}}{{\mathsf {a}}}[i]\) is special. To this end, we maintain an additional lengthn bit array S such that, at the end of iteration h, \(S[i] = \mathbf{1}\) if and only if the suffix starting at \({{\mathsf {s}}}{{\mathsf {a}}}[i]\) is special and it has length less than h, again not considering the endmarker symbol. The array S is initialized at the end of iteration \(h=1\) as \(S = \mathbf{1}^k\mathbf{0}^{nk}\), consistently with the fact that in the final suffix array the first k contexts are strings consisting of just an endmarker, that are special suffixes and the only suffixes of length 0.
During iteration h, we update S as follows. With reference to the code in Fig. 2, whenever we use entry \(Z^{(h1)}[i]\) to compute \(Z^{(h)}[j]\) for some j, if \(S[i]=\mathbf{1}\) and \(B[j+1]=\mathbf{0}\) then we set \(S[j]=\mathbf{1}\).
Lemma 1
The above procedure correctly updates the array S.
Proof
We prove by induction that at the end of iteration h: (1) \(S[i]=\mathbf{1}\) iff the suffix starting at \({{\mathsf {s}}}{{\mathsf {a}}}[i]\) is special and has length less than h, and (2) if \(S[i]=\mathbf{1}\) the lengthh context currently in position i is in the correct lexicographic position with respect to the final suffix array ordering (in other words, it is a prefix for \({\mathsf {s}}[{{\mathsf {s}}}{{\mathsf {a}}}[i],n]\)).
For \(h=1\) the result is true by construction. During iteration \(h>1\), if we reach a position i such that \(S[i]=\mathbf{1}\), then by inductive hypothesis the context in position i has the form v$ with \(v \le h2\). If c is the symbol we read at Step 5 of Fig. 2, then the context corresponding to position j is cv$. Since the context contains the endmarker, j is the correct lexicographic position of cv$ which is therefore the suffix corresponding to \({{\mathsf {s}}}{{\mathsf {a}}}[j]\). If \(B[j+1]=\mathbf{0}\), then \(\mathsf {lcp} [j+1]\ge h1\). Since \(\mathsf {lcp} [j+1] \le c v \le h1\), it follows that \(cv=\mathsf {lcp} [j+1]=h1\) and S[j] is special as claimed.
On the other hand, if at the end of iteration h it is \(S[j]=\mathbf{0}\), then either it was \(S[i]=\mathbf{0}\) or \(B[j+1]=\mathbf{1}\) which implies \(\mathsf {lcp} [j+1]< h1\). In both cases the suffix starting at \({{\mathsf {s}}}{{\mathsf {a}}}[j]\) cannot be special and of length less than h. \(\square\)
Having established the properties of S, we can now show how to compute \(\mathsf {xlcp}\). Recall that LCP values are computed as follows. In Phase 2, during iteration \(h+1\) if \(B[i+1]=\mathbf{1}\) and \(B_{x}[i+1]=\mathbf{0}\) we output the pair \(\langle i+1,h1\rangle\) recording the fact that \(\mathsf {lcp} [i+1]=h1\). Such pairs are later sorted by their first component during Phase 3 to retrieve the LCP array. If \({{\mathsf {s}}}{{\mathsf {a}}}[i]\) is special, its corresponding suffix has length \(\mathsf {lcp} [i+1]=h1\) so, by the properties of S, at the beginning of iteration \(h+1\) it is \(S[i]=\mathbf{1}\). Thus, to compute \(\mathsf {xlcp}\), instead of the pair \(\langle i+1,h1\rangle\) we output the triplet \(\langle i+1,h1,S[i]\rangle = \langle i+1,\mathsf {lcp} [i+1],\mathsf {xlcp} [i]\rangle\). After the merging is completed we sort the triplets by their first component and we derive both arrays \(\mathsf {lcp}\) and \(\mathsf {xlcp}\).
Our algorithm for computing the suffix–prefix overlaps longer than a threshold \(\tau\), consists of a sequential scan of the arrays \(\mathsf {bwt}\), \(\mathsf {lcp}\), \({{\mathsf {d}}}{{\mathsf {a}}}\), and \(\mathsf {xlcp}\). We maintain m distinct stacks, \(\mathsf {stack}[1],\ldots ,\mathsf {stack}[m]\), one for each input sequence; \(\mathsf {stack}[k]\) stores pairs \(\langle j, \mathsf {lcp} [j+1]\rangle\) only if \({{\mathsf {s}}}{{\mathsf {a}}}[j]\) is a special suffix belonging to sequence k such that \(\mathsf {lcp} [j+1]>\tau\). During the scan we maintain the invariant that for all stack entries \({\langle j, \mathsf {lcp} [j+1] \rangle }\), \(\mathsf {lcp} [j+1]\) is the length of the longest common prefix (longer than \(\tau\)) between \({\mathsf {s}}[{{\mathsf {s}}}{{\mathsf {a}}}[j],n]\) and \({\mathsf {s}}[{{\mathsf {s}}}{{\mathsf {a}}}[i],n]\), where i is the position just scanned.

A stack \(\mathsf {lcpStack}\) containing, in increasing order, the values \(\ell\) such that some \(\mathsf {stack}[k]\) contains an entry with LCP component equal to \(\ell\);

An array of lists \(\mathsf {top}\) such that \(\mathsf {top}[\ell ]\) contains the indexes k for which the entry at the top of \(\mathsf {stack}[k]\) has LCP component equal to \(\ell\);

An array \(\mathsf {daPtr}[1,m]\) such that \(\mathsf {daPtr}[k]\) points to the entry k in the list \(\mathsf {top}[\ell _k]\) containing it (\(\mathsf {daPtr}[k]\) is used to remove such entry k from \(\mathsf {top}[\ell _k]\) in constant time).
We maintain the above data structures as follows. When we reach position \(i+1\) we remove all entries \({\langle j, \mathsf {lcp} [j+1] \rangle }\) such that \(\mathsf {lcp} [j+1]>\mathsf {lcp} [i+1]\). We use \(\mathsf {lcpStack}\) to determine which are the values \(\ell\) such that some stack contains an entry \(\langle j,\ell \rangle\) with \(\ell >\mathsf {lcp} [i+1]\). For the value \(\ell\) at the top of \(\mathsf {lcpStack}\) we locate through \(\mathsf {top}[\ell ]\) all stacks that contain an \(\ell\)entry at the top. For each one of these stacks we remove the top entry \(\langle j,\ell \rangle\) so that a new entry \(\langle j',\ell '\rangle\), with \(\ell ' <\ell\), becomes the new top of the stack. Then, if k is the stack that is being updated, we add k to \(\mathsf {top}[\ell ']\), and a pointer to the new entry is saved in \(\mathsf {daPtr}[k]\) (overwriting the previous pointer). When all entries of \(\mathsf {top}[\ell ]\) have been processed, \(\mathsf {top}[\ell ]\) is emptied and \(\ell\) is popped from \(\mathsf {lcpStack}\). The whole procedure is repeated until a value \(\ell \le \mathsf {lcp} [i+1]\) is left at the top of \(\mathsf {lcpStack}\).
Finally, if \(\mathsf {xlcp} [i]={\mathbf {1}}\) and \(\mathsf {lcp} [i+1]>\tau\), \({\langle i, \mathsf {lcp} [i+1] \rangle }\) is added to \(\mathsf {stack}[{{\mathsf {d}}}{{\mathsf {a}}} [i]]\); this requires removing \({{\mathsf {d}}}{{\mathsf {a}}} [i]\) from the list \(\mathsf {top}[\ell ]\) where \(\ell\) is the previous top LCP value in \(\mathsf {stack}[{{\mathsf {d}}}{{\mathsf {a}}} [i]]\); the position of \({{\mathsf {d}}}{{\mathsf {a}}} [i]\) in \(\mathsf {top}[\ell ]\) is retrieved through \(\mathsf {daPtr}[{{\mathsf {d}}}{{\mathsf {a}}} [i]]\). Also we add \({{\mathsf {d}}}{{\mathsf {a}}} [i]\) to \(\mathsf {top}[\mathsf {lcp} [i+1]]\), and the pointer to this new element of \(\mathsf {top}[\mathsf {lcp} [i+1]]\) is written to \(\mathsf {daPtr}[{{\mathsf {d}}}{{\mathsf {a}}} [i]]\). Since the algorithm performs an amortized constant number of operations per entry \({\langle i, \mathsf {lcp} [i+1] \rangle }\), maintaining the above data structures takes \({\mathcal {O}}(n)\) time overall.
The computation of the overlaps is done as in [24]. When the scan reaches position i, we check whether \(\mathsf {bwt} [i]=\)$. If this is the case, then \({\mathsf {s}}[{{\mathsf {s}}}{{\mathsf {a}}}[i],n]\) is prefixed by the whole sequence \({\mathsf {s}}_{{{\mathsf {d}}}{{\mathsf {a}}} [i]}\), hence the longest overlap between a prefix of \({\mathsf {s}}_{{{\mathsf {d}}}{{\mathsf {a}}} [i]}\) and a suffix of \({\mathsf {s}}_{k}\) is given by the element currently at the top of \(\mathsf {stack}[k]\), since by construction these stacks only contain special suffixes whose overlap with \({\mathsf {s}}[{{\mathsf {s}}}{{\mathsf {a}}}[i],n]\) is larger than \(\tau\). Note that using \(\mathsf {lcpStack}\) and \(\mathsf {top}\) we can directly access the stacks whose top element corresponds to an overlap with \({\mathsf {s}}_{{{\mathsf {d}}}{{\mathsf {a}}} [i]}\) larger than \(\tau\), hence the time spent in this phase is proportional to the number of reported overlaps. As in [24] some care is required to handle the case in which the whole string \({\mathsf {s}}_{{{\mathsf {d}}}{{\mathsf {a}}} [i]}\) is a suffix of another sequence, but this can be done without increasing the overall complexity as in [24]. Since we spend constant time for reported overlap and amortized constant time for scanned position the overall cost of the algorithm, in addition to the scanning of the \(\mathsf {bwt}\)/\(\mathsf {lcp}\)/\(\mathsf {xlcp}\)/\({{\mathsf {d}}}{{\mathsf {a}}}\) arrays, is \({\mathcal {O}}(n+E_\tau )\), where \(E_\tau\) is the number of suffix–prefix overlaps greater than \(\tau\). Since all stacks can be implemented in external memory spending amortized constant time per operation, we only need to store in RAM \(\mathsf {top}\) and \(\mathsf {daPtr}\) that overall take \({\mathcal {O}}(m+\mathsf {maxlcp})\) words.
Theorem 3
Our algorithm computes all suffix–prefix overlaps longer than \(\tau\) in time \({\mathcal {O}}(n+E_\tau )\), where \(E_\tau\) is the number of reported overlaps, using \({\mathcal {O}}(m+\mathsf {maxlcp})\) words of RAM and executing a single scan of the arrays \(\mathsf {bwt}\), \(\mathsf {lcp}\), \({{\mathsf {d}}}{{\mathsf {a}}}\) and \(\mathsf {xlcp}\). \(\square\)
Construction of succinct de Bruijn graphs
A recent remarkable application of compressed data structures is the design of efficiently navigable succinct representations of de Bruijn graphs [26–28]. Formally, a de Bruijn graph for a collection of strings consists of a set of vertices representing the distinct kmers appearing in the collection, with a directed edge (u, v) iff there exists a \((k+1)\)mer \(\alpha\) in the collection such that \(\alpha [1,k]\) is the kmer associated to u and \(\alpha [2,k+1]\) is the kmer associated to v.
The starting point of all de Bruijn graphs succinct representation is the BOSS representation [28], so called from the authors’ initials. For simplicity we now describe the BOSS representation of a korder de Bruijn graph using the lexicographic order of kmers, instead of the colexicographic order as in [28], which means we are building the graph with the direction of the arcs reversed. This is not a limitation since arcs can be traversed in both directions (or we can apply our construction to the input sequences reversed).
Consider the N kmers appearing in the collection sorted in lexicographic order. For each kmer \(\alpha _i\) consider the array \(C_i\) of distinct characters \(c\in \Sigma \cup \{\$\}\) such that \(c\alpha _i\) appears in the collection. The concatenation \({W}= C_1 C_2 \cdots C_N\) is the first component of the BOSS representation. The second component is a binary array \(\mathsf {last}\), with \(\mathsf {last} = {W}\), such that \(\mathsf {last}[j]={\mathbf {1}}\) iff \({W}[j]\) is the last entry of some array \(C_i\). Clearly, there is a bijection between entries in \({W}\) and graph edges; in the array \(\mathsf {last}\) each sequence \({\mathbf {0}}^{i}{\mathbf {1}}\) (\(i\ge 0\)) corresponds to the outgoing edges of a single vertex with outdegree \(i+1\). Finally, the third component is a binary array \({W}^{}\), with \({W}^{} = {W}\), such that \({W}^{}[j]={\mathbf {1}}\) iff \({W}[j]\) comes from the array \(C_i\), where \(\alpha _i\) is the lexicographically smallest kmer prefixed by \(\alpha _i[1,k1]\) and preceded by W[j] in the collection. This means that \(\alpha _i\) is the lexicographically smallest kmer with an outgoing edge reaching the node associated to kmer \(W[j]\alpha _i[1,k1]\). Note that the number of \({\mathbf {1}}\)’s in \(\mathsf {last}\) and \({W}^{}\) is exactly N, i.e. the number of nodes in the de Bruijn graph.
We now show how to compute \({W}\), \(\mathsf {last}\) and \({W}^{}\) by a sequential scan of the \(\mathsf {bwt}\) and \(\mathsf {lcp}\) array. The crucial observation is that the suffix array range prefixed by the same kmer \(\alpha _i\) is identified by a range \([b_i,e_i]\) of LCP values satisfying \(\mathsf {lcp} [b_i]<k\), \(\mathsf {lcp} [\ell ] \ge k\) for \(\ell =b_i+1,\ldots ,e_i\) and \(\mathsf {lcp} [e_i+1]<k\). Since kmers are scanned in lexicographic order, by keeping track of the corresponding characters in the array \(\mathsf {bwt} [b_i,e_i]\) we can build the array \(C_i\) and consequently \({W}\) and \(\mathsf {last}\). To compute \({W}^{}\) we simply need to keep track also of suffix array ranges corresponding to \((k1)\)mers. Every time we set an entry \({W}[j]=c\) we set \({W}^{}[j]={\mathbf {1}}\) iff this is the first occurrence of c in the range corresponding to the current \((k1)\)mers.
Theorem 4
Our algorithm computes the BOSS representation of a de Bruijn graph in \({\mathcal {O}}(n)\) time using \({\mathcal {O}}(1)\) words of RAM, and executing a single scan of the arrays \(\mathsf {bwt}\) and \(\mathsf {lcp}\). \(\square\)
If, in addition to the \(\mathsf {bwt}\) and \(\mathsf {lcp}\) arrays, we also scan the \({{\mathsf {d}}}{{\mathsf {a}}}\) array, then we can keep track of which sequences contain any given graph edge and therefore obtain a succinct representation of the colored de Bruijn graph [44]. Finally, we observe that if our only objective is to build the korder de Bruijn graph, then we can stop the phase 2 of our algorithm after the kth iteration. Indeed, we do not need to compute the exact values of LCP entries greater than k, and also we do not need the exact BWT but only the BWT characters sorted by their length k context.
Conclusions
In this paper we have described eGap, a new algorithm for the computation of the BWT and LCP arrays of large collection of sequences. Depending on the amount of available memory, eGap uses an external or semiexternal strategy for computing the BWT and LCP values. An experimental comparison of the available tools for BWT and LCP arrays computation shows that eGap is the fastest tool in many scenarios and was the only tool capable of completing the computation within a reasonable time frame for all kind of input data.
Another important feature of eGap is that, in addition to the BWT and LCP array, it can compute, without any asymptotic slowdown, two additional arrays that provide important information about the substrings of the input collection. We show how to use such information to design efficient external memory algorithms for three important problems for biosequences, namely the computation of maximal repeats, the computation of the all pairs suffix–prefix overlaps, and the construction of succinct de Bruijn graphs. Overall our results confirm the importance of the BWT and LCP arrays beyond their use for the construction of compressed full text indexes. This is in accordance with other recent results that have shown of they can be used directly to discover structural information on the underlying collection (see [45–47] and references therein).
Declarations
Authors’ contributions
LE and GM devised the main algorithmic ideas. All authors contributed to improve the algorithms and participated to their implementations. FAL and GPT designed and performed the experiments. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Availability
The source code of the proposed algorithm is available at https://github.com/felipelouza/egap.
Funding
L.E. was partially supported by the University of Eastern Piedmont project Behavioural Types for Dependability Analysis with Bayesian Networks. F.A.L. was supported by the Grants \(\#\)2017/091050 and \(\#\)2018/215092 from the São Paulo Research Foundation (FAPESP). G.M. was partially supported by PRIN grant 201534HNXC and INdAMGNCS Project 2019 Innovative methods for the solution of medical and biological big data. G.P.T. acknowledges the support of Brazilian agencies Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) and Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES).
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Burrows M, Wheeler DJ. A blocksorting lossless data compression algorithm. Technical report, Digital SRC Research Report; 1994.Google Scholar
 Mäkinen V, Belazzougui D, Cunial F, Tomescu AI. GenomeScale Algorithm Design: biological sequence analysis in the era of highthroughput sequencing. Cambridge: Cambridge University Press; 2015.View ArticleGoogle Scholar
 Manber U, Myers G. Suffix arrays: a new method for online string searches. SIAM J Comput. 1993;22(5):935–48.View ArticleGoogle Scholar
 Gog S, Ohlebusch E. Compressed suffix trees: efficient computation and storage of LCPvalues. ACM J Exp Algorith. 2013;18:2.Google Scholar
 Navarro G, Mäkinen V. Compressed fulltext indexes. ACM Comput Surv. 2007;39:1.View ArticleGoogle Scholar
 Burkhardt S, Kärkkäinen J. Fast lightweight suffix array construction and checking. In: Proc. 14th symposium on combinatorial pattern matching (CPM ’03). Springer, Morelia, Michocän, Mexico; 2003. p. 55–69.Google Scholar
 Manzini G. Two space saving tricks for linear time LCP computation. In: Proc. of 9th Scandinavian workshop on algorithm theory (SWAT ’04). Humlebæk: Springer; 2004. p. 372–83.Google Scholar
 Manzini G, Ferragina P. Engineering a lightweight suffix array construction algorithm. In: Proc. 10th European symposium on algorithms (ESA). Rome: Springer; 2002. p. 698–710.Google Scholar
 Ferragina P, Gagie T, Manzini G. Lightweight data indexing and compression in external memory. In: Proc. 9th Latin American theoretical informatics symposium (LATIN ’10). Lecture Notes in Computer Science vol. 6034; 2010. p. 698–711.Google Scholar
 Ferragina P, Gagie T, Manzini G. Lightweight data indexing and compression in external memory. Algorithmica. 2011.Google Scholar
 Kärkkäinen J, Kempa D. LCP array construction in external memory. ACM J Exp Algorith. 2016;21(1):1–711722.Google Scholar
 Beller T, Zwerger M, Gog S, Ohlebusch E. Spaceefficient construction of the Burrows–Wheeler transform. In: SPIRE. Lecture Notes in Computer Science, vol. 8214. Jerusalem: Springer; 2013. p. 5–16.Google Scholar
 Kärkkäinen J, Kempa D. Engineering a lightweight external memory suffix array construction algorithm. Math Comput Sci. 2017;11(2):137–49.View ArticleGoogle Scholar
 Louza FA, Telles GP, Hoffmann S, Ciferri CDA. Generalized Enhanced Suffix array construction in external memory. Algorith Mol Biol. 2017;12(1):26–12616.View ArticleGoogle Scholar
 Vitter J. External memory algorithms and data structures: dealing with massive data. ACM Comput Surv. 2001;33(2):209–71.View ArticleGoogle Scholar
 Belazzougui D. Linear time construction of compressed text indices in compact space. In: STOC. New York: ACM; 2014. p. 148–93.Google Scholar
 Munro JI, Navarro G, Nekrich Y. Spaceefficient construction of compressed indexes in deterministic linear time. In: SODA. Barcelona: SIAM; 2017. p. 408–24.Google Scholar
 Bauer MJ, Cox AJ, Rosone G. Lightweight algorithms for constructing and inverting the BWT of string collections. Theor Comput Sci. 2013;483:134–48.View ArticleGoogle Scholar
 Cox AX, Garofalo F, Rosone G, Sciortino M. Lightweight LCP construction for very large collections of strings. J Discrete Algorith. 2016;37:17–33.View ArticleGoogle Scholar
 Bonizzoni P, Della Vedova G, Pirola Y, Previtali M, Rizzi R. Computing the BWT and LCP array of a set of strings in external memory. CoRR: arXiv:1705.07756. 2017.
 Külekci MO, Vitter JS, Xu B. Efficient maximal repeat finding using the Burrows–Wheeler transform and wavelet tree. IEEE/ACM Trans Comput Biol Bioinform. 2012;9(2):421–9.View ArticleGoogle Scholar
 Ohlebusch E, Gog S, Kügel A. Computing matching statistics and maximal exact matches on compressed fulltext indexes. In: SPIRE. Lecture Notes in Computer Science, vol. 6393. Los Cabos: Springer; 2010. p. 347–58.Google Scholar
 Gusfield D, Landau GM, Schieber B. An efficient algorithm for the all pairs suffix–prefix problem. Inform Process Lett. 1992;41(4):181–5.View ArticleGoogle Scholar
 Ohlebusch E, Gog S. Efficient algorithms for the allpairs suffix–prefix problem and the allpairs substringprefix problem. Inform Process Lett. 2010;110(3):123–8.View ArticleGoogle Scholar
 Tustumi WHA, Gog S, Telles GP, Louza FA. An improved algorithm for the allpairs suffix–prefix problem. J Discrete Algorith. 2016;37:34–43.View ArticleGoogle Scholar
 Belazzougui D, Gagie T, Mäkinen V, Previtali M, Puglisi SJ. Bidirectional variableorder de Bruijn graphs. In: LATIN. Lecture Notes in Computer Science, vol. 9644. Ensenada: Springer; 2016. p. 164–78.Google Scholar
 Boucher C, Bowe A, Gagie T, Puglisi SJ, Sadakane K. Variableorder de Bruijn graphs. In: DCC. IEEE, Snowbird, Utah, USA; 2015. p. 383–392Google Scholar
 Bowe A, Onodera T, Sadakane K, Shibuya T. Succinct de Bruijn graphs. In: WABI. Lecture Notes in Computer Science, vol. 7534. Ljubljana: Springer; 2012. p. 225–35.Google Scholar
 Bonizzoni P, Della Vedova G, Pirola Y, Previtali M, Rizzi R. Constructing string graphs in external memory. In: WABI. Lecture Notes in Computer Science, vol. 8701. Berlin: Springer; 2014. p. 311–25.Google Scholar
 Bonizzoni P, Della Vedova G, Pirola Y, Previtali M, Rizzi R. An externalmemory algorithm for string graph construction. Algorithmica. 2017;78(2):394–424. https://doi.org/10.1007/s0045301601654.View ArticleGoogle Scholar
 Mantaci S, Restivo A, Rosone G, Sciortino M. An extension of the Burrows–Wheeler transform. Theor Comput Sci. 2007;387(3):298–312.View ArticleGoogle Scholar
 Louza FA, Gog S, Telles GP. Inducing enhanced suffix arrays for string collections. Theor Comput Sci. 2017;678:22–39.View ArticleGoogle Scholar
 Nong G. Practical lineartime O(1)workspace suffix sorting for constant alphabets. ACM Trans Inform Syst. 2013;31(3):15.View ArticleGoogle Scholar
 Egidi L, Manzini G. Lightweight BWT and LCP merging via the Gap algorithm. In: SPIRE. Lecture Notes in Computer Science, vol. 10508. Palermo: Springer; 2017. p. 176–90.Google Scholar
 Holt J, McMillan L. Merging of multistring BWTs with applications. Bioinformatics. 2014;30(24):3524–31.View ArticleGoogle Scholar
 Holt J, McMillan L. Constructing Burrows–Wheeler transforms of large string collections via merging. In: BCB. New York: ACM; 2014. p. 464–71.Google Scholar
 Knuth DE. Sorting and searching, 2nd edn. In: The art of computer programming, vol. 3. Reading: AddisonWesley; 1998. p. 780.Google Scholar
 Cox AJ, Garofalo F, Rosone G, Sciortino M. Multistring eBWT/LCP/GSA computation (commit no. 6c6a1b38bc084d35330295800f9d4a6882052c51). GitHub; 2018. https://github.com/giovannarosone/BCR_LCP_GSA.
 Bonizzoni P, Della Vedova G, Nicosia S, Previtali M, Rizzi R. bwtlcpem (commit no. a6f0144b203e5bda7af8480e9ea3a1d781ad7ba0). GitHub; 2018. https://github.com/AlgoLab/bwtlcpem.
 Louza FA, Telles GP, Hoffmann S, Ciferri CDA. egsa (commit no. 1790094e010040bef3be11e393a4f1d5408debb0). GitHub; 2018. https://github.com/felipelouza/egsa.
 Gusfield D. Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge: Cambridge University Press; 1997.View ArticleGoogle Scholar
 Dementiev R, Kettner L, Sanders P. STXXL: standard template library for XXL data sets. Softw Pract Exper. 2008;38(6):589–637. https://doi.org/10.1002/spe.844.View ArticleGoogle Scholar
 Marçais G, Delcher AL, Phillippy AM, Coston R, Salzberg SL, Zimin AV. Mummer4: a fast and versatile genome alignment system. PLoS Comput Biol. 2018;14(1):e1005944.View ArticleGoogle Scholar
 Muggli MD, Bowe A, Noyes NR, Morley PS, Belk KE, Raymond R, Gagie T, Puglisi SJ, Boucher C. Succinct colored de Bruijn graphs. Bioinformatics. 2017;33(20):3181–7.View ArticleGoogle Scholar
 Louza FA, Telles GP, Gog S, Zhao L. Computing Burrows–Wheeler similarity distributions for string collections. SPIRE. Lecture Notes in Computer Science, vol. 11147. Lima: Springer; 2018. p. 285–96.Google Scholar
 Prezza N, Pisanti N, Sciortino M, Rosone G. Detecting mutations by ebwt. In: WABI. LIPIcs, vol. 113. Schloss Dagstuhl  LeibnizZentrum fuer Informatik, Helsinki, Finland; 2018. p. 3–1315.Google Scholar
 Garofalo F, Rosone G, Sciortino M, Verzotto D. The colored longest common prefix array computed via sequential scans. SPIRE. Lecture Notes in Computer Science, vol. 11147. Lima: Springer; 2018. p. 153–67.Google Scholar