Generalized enhanced suffix array construction in external memory

Table 6 Datasets used in the experiments

Dataset	Size (GB)	Number of strings	Total length	Avg. length	Max. lcp	Avg. lcp
dna	9.85	153	10,580,043,054	69,150,608	2,282,187	1122
protein	18.68	62,148,086	20,056,474,339	323	31,815	88
gutenberg	22.32	407,864,056	23,962,356,903	59	11,946	18
enwiki	24.50	351,363,467	25,648,226,940	75	111,273	33

dna:a collection of large DNA chromosomes from organisms (Homo sapiens, Oryzias latipes, Danio rerio, Bos taurus, Mus musculus and Gallus gallus) of Ensembl dataset (ftp://ftp.ensembl.org/pub/release-84/fasta/). We removed any occurrences of the character N (unknown) from the strings
protein: the collection of protein sequences from Uniprot/TrEMBL, release 2016_5 (http://www.ebi.ac.uk/uniprot/download-center/)
gutenberg: a collection of documents from Gutenberg Project, release 2012_09 (http://algo2.iti.kit.edu/bingmann/esais-corpus/). We processed each line of the input as a single string
enwiki: a collection of pages from a snapshot of the English language edition of Wikipedia release 2016_05 (https://dumps.wikimedia.org/enwiki/20160501/). We processed each line of the input as a single string

ISSN: 1748-7188