gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections

Table 1 Collections

Collection	size	\(\sigma\)	N. of strings	Max. len.	Avg. len	Max. lcp	Avg. \({\mathsf {lcp}}\)
shortreads	16.00	5	171.8	100	100	100	32.87
reads	16.00	6	57.3	300	300	300	91.29
pacbio	16.00	5	1.9	71,561	9117	3084	19.08
pacbio.1000	16.00	5	17.2	1,000	1000	876	18.67
uniprot	16.04	25	46.1	74,488	374	74,293	99.24
gutenberg	15.88	255	334.3	757,936	50	9060	18.97
random.dna	16.00	4	16.1	1,048,576	1,048,576	33	16.18
random.protein	16.00	25	16.1	1,048,576	1,048,576	13	6.89

Columns 2 and 3 show the collection size (in GB) and the alphabet size. Column 4 shows the number of strings (in millions). Columns 5 and 6 show the maximum and average lengths of strings in a collection. Columns 7 and 8 show the maximum and average \({\mathsf {lcp}}\) of strings in a collection
Collections
shortreads are Illumina reads from human genome trimmed to 100 nucleotides (http://ftp.sra.ebi.ac.uk/vol1/ERA015/ERA015743/srf);
reads are Illumina HiSeq 4000 paired-end RNA-seq reads from plant Setaria viridis trimmed to 300 nucleotides (http://www.trace.ncbi.nlm.nih.gov/Traces/sra/?run=ERR1942989);
pacbio are PacBio RS II reads from Triticum aestivum (wheat) genome (http://www.trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR5816161);
pacbio.1000 are strings from pacbio trimmed to length 1,000;
uniprot are protein sequences from TrEMBl dowloaded on May 28, 2019 (http://www.ebi.ac.uk/uniprot/download-center);
gutenberg are ASCII books in English from Project Gutenberg (http://www.gutenberg.org);
random-dna was generated with even sampling probability on the standard 4 letter alphabet;
random-protein was generated with even sampling probability on the IUPAC 25 letter alphabet

ISSN: 1748-7188