Skip to main content

Table 1 Collections

From: gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections

Collection size \(\sigma\) N. of strings Max. len. Avg. len Max. lcp Avg. \({\mathsf {lcp}}\)
shortreads 16.00 5 171.8 100 100 100 32.87
reads 16.00 6 57.3 300 300 300 91.29
pacbio 16.00 5 1.9 71,561 9117 3084 19.08
pacbio.1000 16.00 5 17.2 1,000 1000 876 18.67
uniprot 16.04 25 46.1 74,488 374 74,293 99.24
gutenberg 15.88 255 334.3 757,936 50 9060 18.97
random.dna 16.00 4 16.1 1,048,576 1,048,576 33 16.18
random.protein 16.00 25 16.1 1,048,576 1,048,576 13 6.89
  1. Columns 2 and 3 show the collection size (in GB) and the alphabet size. Column 4 shows the number of strings (in millions). Columns 5 and 6 show the maximum and average lengths of strings in a collection. Columns 7 and 8 show the maximum and average \({\mathsf {lcp}}\) of strings in a collection
  2. Collections
  3. shortreads are Illumina reads from human genome trimmed to 100 nucleotides (http://ftp.sra.ebi.ac.uk/vol1/ERA015/ERA015743/srf);
  4. reads are Illumina HiSeq 4000 paired-end RNA-seq reads from plant Setaria viridis trimmed to 300 nucleotides (http://www.trace.ncbi.nlm.nih.gov/Traces/sra/?run=ERR1942989);
  5. pacbio are PacBio RS II reads from Triticum aestivum (wheat) genome (http://www.trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR5816161);
  6. pacbio.1000 are strings from pacbio trimmed to length 1,000;
  7. uniprot are protein sequences from TrEMBl dowloaded on May 28, 2019 (http://www.ebi.ac.uk/uniprot/download-center);
  8. gutenberg are ASCII books in English from Project Gutenberg (http://www.gutenberg.org);
  9. random-dna was generated with even sampling probability on the standard 4 letter alphabet;
  10. random-protein was generated with even sampling probability on the IUPAC 25 letter alphabet