Collection
|
size
|
\(\sigma\)
|
N. of strings
|
Max. len.
|
Avg. len
|
Max. lcp
|
Avg. \({\mathsf {lcp}}\)
|
---|
shortreads
|
16.00
|
5
|
171.8
|
100
|
100
|
100
|
32.87
|
reads
|
16.00
|
6
|
57.3
|
300
|
300
|
300
|
91.29
|
pacbio
|
16.00
|
5
|
1.9
|
71,561
|
9117
|
3084
|
19.08
|
pacbio.1000
|
16.00
|
5
|
17.2
|
1,000
|
1000
|
876
|
18.67
|
uniprot
|
16.04
|
25
|
46.1
|
74,488
|
374
|
74,293
|
99.24
|
gutenberg
|
15.88
|
255
|
334.3
|
757,936
|
50
|
9060
|
18.97
|
random.dna
|
16.00
|
4
|
16.1
|
1,048,576
|
1,048,576
|
33
|
16.18
|
random.protein
|
16.00
|
25
|
16.1
|
1,048,576
|
1,048,576
|
13
|
6.89
|
- Columns 2 and 3 show the collection size (in GB) and the alphabet size. Column 4 shows the number of strings (in millions). Columns 5 and 6 show the maximum and average lengths of strings in a collection. Columns 7 and 8 show the maximum and average \({\mathsf {lcp}}\) of strings in a collection
- Collections
- shortreads are Illumina reads from human genome trimmed to 100 nucleotides (http://ftp.sra.ebi.ac.uk/vol1/ERA015/ERA015743/srf);
- reads are Illumina HiSeq 4000 paired-end RNA-seq reads from plant Setaria viridis trimmed to 300 nucleotides (http://www.trace.ncbi.nlm.nih.gov/Traces/sra/?run=ERR1942989);
- pacbio are PacBio RS II reads from Triticum aestivum (wheat) genome (http://www.trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR5816161);
- pacbio.1000 are strings from pacbio trimmed to length 1,000;
- uniprot are protein sequences from TrEMBl dowloaded on May 28, 2019 (http://www.ebi.ac.uk/uniprot/download-center);
- gutenberg are ASCII books in English from Project Gutenberg (http://www.gutenberg.org);
- random-dna was generated with even sampling probability on the standard 4 letter alphabet;
- random-protein was generated with even sampling probability on the IUPAC 25 letter alphabet