Skip to main content

Table 6 Datasets used in the experiments

From: Generalized enhanced suffix array construction in external memory

Dataset

Size (GB)

Number of strings

Total length

Avg. length

Max. lcp

Avg. lcp

dna

9.85

153

10,580,043,054

69,150,608

2,282,187

1122

protein

18.68

62,148,086

20,056,474,339

323

31,815

88

gutenberg

22.32

407,864,056

23,962,356,903

59

11,946

18

enwiki

24.50

351,363,467

25,648,226,940

75

111,273

33

  1. dna:a collection of large DNA chromosomes from organisms (Homo sapiens, Oryzias latipes, Danio rerio, Bos taurus, Mus musculus and Gallus gallus) of Ensembl dataset (ftp://ftp.ensembl.org/pub/release-84/fasta/). We removed any occurrences of the character N (unknown) from the strings
  2. protein: the collection of protein sequences from Uniprot/TrEMBL, release 2016_5 (http://www.ebi.ac.uk/uniprot/download-center/)
  3. gutenberg: a collection of documents from Gutenberg Project, release 2012_09 (http://algo2.iti.kit.edu/bingmann/esais-corpus/). We processed each line of the input as a single string
  4. enwiki: a collection of pages from a snapshot of the English language edition of Wikipedia release 2016_05 (https://dumps.wikimedia.org/enwiki/20160501/). We processed each line of the input as a single string