Skip to main content

Advertisement

Table 6 Datasets used in the experiments

From: Generalized enhanced suffix array construction in external memory

Dataset Size (GB) Number of strings Total length Avg. length Max. lcp Avg. lcp
dna 9.85 153 10,580,043,054 69,150,608 2,282,187 1122
protein 18.68 62,148,086 20,056,474,339 323 31,815 88
gutenberg 22.32 407,864,056 23,962,356,903 59 11,946 18
enwiki 24.50 351,363,467 25,648,226,940 75 111,273 33
  1. dna:a collection of large DNA chromosomes from organisms (Homo sapiens, Oryzias latipes, Danio rerio, Bos taurus, Mus musculus and Gallus gallus) of Ensembl dataset (ftp://ftp.ensembl.org/pub/release-84/fasta/). We removed any occurrences of the character N (unknown) from the strings
  2. protein: the collection of protein sequences from Uniprot/TrEMBL, release 2016_5 (http://www.ebi.ac.uk/uniprot/download-center/)
  3. gutenberg: a collection of documents from Gutenberg Project, release 2012_09 (http://algo2.iti.kit.edu/bingmann/esais-corpus/). We processed each line of the input as a single string
  4. enwiki: a collection of pages from a snapshot of the English language edition of Wikipedia release 2016_05 (https://dumps.wikimedia.org/enwiki/20160501/). We processed each line of the input as a single string