Skip to main content

Table 1 Summary of the most important compressors of sequencing data

From: Data compression for sequencing data

Software

Implementation

Website

Lossless /

Ambig.

Var.

Speed of

Ratio

Random

Methods

Remarks

name

availability

 

lossy

codes

length

compr./

 

access

  
 

src code / binaries / libs

   

reads

decompr.

    

Compressors of raw sequencing data

gzip

C++ / many / many

http://www.gzip.org

yes / no

yes

yes

moderate / very high

low

no

LZ, Huf

 

bzip2

C / many / many

http://www.bzip.org

yes / no

yes

yes

low / high

low

no

BWT, Huf

 

7zip

C, C++ / many / many

http://www.7-zip.org

yes / no

yes

yes

low / very high

moderate

no

LZ, AC

 

BWT-SAP [21]

C++ / — / C++

http://github.com/BEETL/BEETL

yes / no

yes

no

low / low

moderate

no

BWT, PPM

FASTA only

DSRC [17]

C++ / Lin, Win / C++, Pyt

http://sun.aei.polsl.pl/dsrc

yes / no

yes

yes

high / high

moderate

yes

LZ, Huf

 

Fqzcomp [30]

C / Lin / —

http://sourceforge.net/projects/fqzcomp/

yes / yes

no

yes

high / moderate

high

no

CM

 

G-SQZ [31]

C++ / Lin / —

http://public.tgen.org/sqz

yes / no

no

no

high / moderate

low

yes

Huf

 

Kung-FQ [19]

C# / Win / –

http://quicktsaf.sourceforge.net

yes / yes

no

no

moderate / moderate

moderate

no

AC, LZ, RLE

 

Quip [29]

C / – / –

http://cs.washington.edu/homes/dcjones/quip

yes / no

no

no

high / high

high

no

M. models, AC

 

ReCoil [20]

C++ / — / C++

http://github.com/BEETL/BEETL

yes / no

no

no

very low / high

moderate

no

BWT, PPM

FASTA only

SCALCE+gzip [22]

C++ / – / –

http://scalce.sourceforge.net

yes / yes

no

no

moderate / high

moderate

no

AC, LZ, Huf

 

Seq-DB [28]

C++ / – / –

https://bitbucket.org/mhowison/seqdb

yes / yes

no

no

very high / very high

low

yes

AC, LZ, RLE

 

SeqSqueeze1 [30]

C/ Lin/ —

http://sourceforge.net/p/ieetaseqsqueeze/wiki/Home/

yes / no

no

yes

very low / ver low

high

no

CM

 

Compressors of reference genome alignment data

gzip

C++ / many / many

http://www.gzip.org

yes / no

yes

N/A

low / very high

low

no

LZ, Huf

 

bzip2

C / many / many

http://www.bzip.org

yes / no

yes

N/A

low / high

low

no

BWT, Huf

 

7z

C, C++ / many / many

http://www.7-zip.org

yes / no

yes

N/A

low / very high

moderate

no

LZ, AC

 

BAM [32]

C++ / many / many

http://samtools.sourceforge.net

yes / no

yes

N/A

moderate / high

moderate

yes

LZ, Huf

 

CRAM [33]

Java / many / Java

http://www.ebi.ac.uk/ena/about/cram_toolkit

yes / yes

yes

N/A

moderate / moderate

moderate

yes

Huf, Gol, diff.

 

Quip [29]

C / – / –

http://cs.washington.edu/homes/dcjones/quip

yes / no

no

N/A

high / high

high

no

M. models, AC

 

SAMZIP+rar [34]

C/ – / –

http://www.plosone.org/article/info:doi/10.1371/journal.pone.0028251

yes / no

yes

N/A

moderate / high

moderate

no

RLE, LZ, Huf

 

Compressors of single genome sequences

gzip

C++ / many / many

http://www.gzip.org

yes / no

yes

N/A

moderate / very high

low

no

LZ, Huf

 

bzip2

C / many / many

http://www.bzip.org

yes / no

yes

N/A

low / high

low

no

BWT, Huf

 

7z

C, C++ / many / many

http://www.7-zip.org

yes / no

yes

N/A

low / very high

moderate

no

LZ, AC

 

dna3 [35]

C / – / –

http://people.unipmn.it/manzini/dnacorpus/

yes / no

no

N/A

low / low

moderate

no

LZ, PPM

 

FCM-M [36]

C / – / –

 

yes / no

no

N/A

very low / very low

moderate

no

M. models

 

XM [37]

Java / many / Java

http://ftp.infotech.monash.edu.au/software/DNAcompress-XM

yes / no

yes

N/A

very low / very low

moderate

no

M. models, AC

 

Compressors of genome collections

gzip

C++ / many / many

http://www.gzip.org

yes / no

yes

N/A

low / very high

very low

no

LZ, Huf

 

bzip2

C / many / many

http://www.bzip.org

yes / no

yes

N/A

low / high

very low

no

BWT, Huf

 

7z

C, C++ / many / many

http://www.7-zip.org

yes / no

yes

N/A

low / very high

high

no

LZ, AC

chr-ordered

ABRC [38]

C++ / Lin, Win / C++

http://www2.informatik.hu-berlin.de/~wandelt/blockcompression/

yes / no

yes

N/A

high / very high

very high

yes

LZ, Huf

 

GDC [39]

C++ / Lin, Win / C++

http://sun.aei.polsl.pl/gdc

yes / no

yes

N/A

high / very high

very high

yes

LZ, Huf

 

GReEn [40]

C / – / –

 

yes / no

yes

N/A

high / high

high

no

M. models, AC

 

GRS [41]

C / Lin / –

http://gmdd.shgmo.org/Computational-Biology/GRS/

yes / no

yes

N/A

moderate / low

high

no

LCS, Huf

 

RLZ [42]

C++ / – / –

http://www.genomics.csse.unimelb.edu.au/product-rlz.php

yes / no

yes

N/A

moderate / very high

high

no

LZ, Gol

 
  1. Abbreviations used in the table: src—source codes,libs—libraries, Lin—Linux, Win—Windows,Pyt—Python, exe—binary executables,AC—arithmetic coding (a statistical coding method [12]), CM—context mixing for arithmetic coding [12], diff—differential coding (paradigm: store onlychanges between sequences), Gol—Golomb (a statisticalcoding method [12]), Huf—Huffman, LCS—longest commonsubsequence (a measure of similarity of sequences [43]), LZ—an algorithm from Ziv–Lempel family,M. models—Markov models [12], PPM—prediction by partial matching (anefficient general-purpose compressor [12]). “Ambig. codes” means the ability tocompress DNA symbols other than {A, C, G, T, N }.“chr-ordered” for 7z and genome collections meansthat the input (human) genomes were split into chromosomes andordered according to them before the actual compression. In thisway several chromosomes fit the 7z LZ-buffer which is beneficialfor the compression.