Skip to main content

Table 1 Collections

From: gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections

Collection

size

\(\sigma\)

N. of strings

Max. len.

Avg. len

Max. lcp

Avg. \({\mathsf {lcp}}\)

shortreads

16.00

5

171.8

100

100

100

32.87

reads

16.00

6

57.3

300

300

300

91.29

pacbio

16.00

5

1.9

71,561

9117

3084

19.08

pacbio.1000

16.00

5

17.2

1,000

1000

876

18.67

uniprot

16.04

25

46.1

74,488

374

74,293

99.24

gutenberg

15.88

255

334.3

757,936

50

9060

18.97

random.dna

16.00

4

16.1

1,048,576

1,048,576

33

16.18

random.protein

16.00

25

16.1

1,048,576

1,048,576

13

6.89

  1. Columns 2 and 3 show the collection size (in GB) and the alphabet size. Column 4 shows the number of strings (in millions). Columns 5 and 6 show the maximum and average lengths of strings in a collection. Columns 7 and 8 show the maximum and average \({\mathsf {lcp}}\) of strings in a collection
  2. Collections
  3. shortreads are Illumina reads from human genome trimmed to 100 nucleotides (http://ftp.sra.ebi.ac.uk/vol1/ERA015/ERA015743/srf);
  4. reads are Illumina HiSeq 4000 paired-end RNA-seq reads from plant Setaria viridis trimmed to 300 nucleotides (http://www.trace.ncbi.nlm.nih.gov/Traces/sra/?run=ERR1942989);
  5. pacbio are PacBio RS II reads from Triticum aestivum (wheat) genome (http://www.trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR5816161);
  6. pacbio.1000 are strings from pacbio trimmed to length 1,000;
  7. uniprot are protein sequences from TrEMBl dowloaded on May 28, 2019 (http://www.ebi.ac.uk/uniprot/download-center);
  8. gutenberg are ASCII books in English from Project Gutenberg (http://www.gutenberg.org);
  9. random-dna was generated with even sampling probability on the standard 4 letter alphabet;
  10. random-protein was generated with even sampling probability on the IUPAC 25 letter alphabet