Collection | size | \(\sigma\) | N. of strings | Max. len. | Avg. len | Max. lcp | Avg. \({\mathsf {lcp}}\) |
---|
shortreads | 16.00 | 5 | 171.8 | 100 | 100 | 100 | 32.87 |
reads | 16.00 | 6 | 57.3 | 300 | 300 | 300 | 91.29 |
pacbio | 16.00 | 5 | 1.9 | 71,561 | 9117 | 3084 | 19.08 |
pacbio.1000 | 16.00 | 5 | 17.2 | 1,000 | 1000 | 876 | 18.67 |
uniprot | 16.04 | 25 | 46.1 | 74,488 | 374 | 74,293 | 99.24 |
gutenberg | 15.88 | 255 | 334.3 | 757,936 | 50 | 9060 | 18.97 |
random.dna | 16.00 | 4 | 16.1 | 1,048,576 | 1,048,576 | 33 | 16.18 |
random.protein | 16.00 | 25 | 16.1 | 1,048,576 | 1,048,576 | 13 | 6.89 |
- Columns 2 and 3 show the collection size (in GB) and the alphabet size. Column 4 shows the number of strings (in millions). Columns 5 and 6 show the maximum and average lengths of strings in a collection. Columns 7 and 8 show the maximum and average \({\mathsf {lcp}}\) of strings in a collection
- Collections
- shortreads are Illumina reads from human genome trimmed to 100 nucleotides (http://ftp.sra.ebi.ac.uk/vol1/ERA015/ERA015743/srf);
- reads are Illumina HiSeq 4000 paired-end RNA-seq reads from plant Setaria viridis trimmed to 300 nucleotides (http://www.trace.ncbi.nlm.nih.gov/Traces/sra/?run=ERR1942989);
- pacbio are PacBio RS II reads from Triticum aestivum (wheat) genome (http://www.trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR5816161);
- pacbio.1000 are strings from pacbio trimmed to length 1,000;
- uniprot are protein sequences from TrEMBl dowloaded on May 28, 2019 (http://www.ebi.ac.uk/uniprot/download-center);
- gutenberg are ASCII books in English from Project Gutenberg (http://www.gutenberg.org);
- random-dna was generated with even sampling probability on the standard 4 letter alphabet;
- random-protein was generated with even sampling probability on the IUPAC 25 letter alphabet