Skip to main content

Advertisement

Table 1 Summary of the most important compressors of sequencing data

From: Data compression for sequencing data

Software Implementation Website Lossless / Ambig. Var. Speed of Ratio Random Methods Remarks
name availability   lossy codes length compr./   access   
  src code / binaries / libs     reads decompr.     
Compressors of raw sequencing data
gzip C++ / many / many http://www.gzip.org yes / no yes yes moderate / very high low no LZ, Huf  
bzip2 C / many / many http://www.bzip.org yes / no yes yes low / high low no BWT, Huf  
7zip C, C++ / many / many http://www.7-zip.org yes / no yes yes low / very high moderate no LZ, AC  
BWT-SAP [21] C++ / — / C++ http://github.com/BEETL/BEETL yes / no yes no low / low moderate no BWT, PPM FASTA only
DSRC [17] C++ / Lin, Win / C++, Pyt http://sun.aei.polsl.pl/dsrc yes / no yes yes high / high moderate yes LZ, Huf  
Fqzcomp [30] C / Lin / — http://sourceforge.net/projects/fqzcomp/ yes / yes no yes high / moderate high no CM  
G-SQZ [31] C++ / Lin / — http://public.tgen.org/sqz yes / no no no high / moderate low yes Huf  
Kung-FQ [19] C# / Win / – http://quicktsaf.sourceforge.net yes / yes no no moderate / moderate moderate no AC, LZ, RLE  
Quip [29] C / – / – http://cs.washington.edu/homes/dcjones/quip yes / no no no high / high high no M. models, AC  
ReCoil [20] C++ / — / C++ http://github.com/BEETL/BEETL yes / no no no very low / high moderate no BWT, PPM FASTA only
SCALCE+gzip [22] C++ / – / – http://scalce.sourceforge.net yes / yes no no moderate / high moderate no AC, LZ, Huf  
Seq-DB [28] C++ / – / – https://bitbucket.org/mhowison/seqdb yes / yes no no very high / very high low yes AC, LZ, RLE  
SeqSqueeze1 [30] C/ Lin/ — http://sourceforge.net/p/ieetaseqsqueeze/wiki/Home/ yes / no no yes very low / ver low high no CM  
Compressors of reference genome alignment data
gzip C++ / many / many http://www.gzip.org yes / no yes N/A low / very high low no LZ, Huf  
bzip2 C / many / many http://www.bzip.org yes / no yes N/A low / high low no BWT, Huf  
7z C, C++ / many / many http://www.7-zip.org yes / no yes N/A low / very high moderate no LZ, AC  
BAM [32] C++ / many / many http://samtools.sourceforge.net yes / no yes N/A moderate / high moderate yes LZ, Huf  
CRAM [33] Java / many / Java http://www.ebi.ac.uk/ena/about/cram_toolkit yes / yes yes N/A moderate / moderate moderate yes Huf, Gol, diff.  
Quip [29] C / – / – http://cs.washington.edu/homes/dcjones/quip yes / no no N/A high / high high no M. models, AC  
SAMZIP+rar [34] C/ – / – http://www.plosone.org/article/info:doi/10.1371/journal.pone.0028251 yes / no yes N/A moderate / high moderate no RLE, LZ, Huf  
Compressors of single genome sequences
gzip C++ / many / many http://www.gzip.org yes / no yes N/A moderate / very high low no LZ, Huf  
bzip2 C / many / many http://www.bzip.org yes / no yes N/A low / high low no BWT, Huf  
7z C, C++ / many / many http://www.7-zip.org yes / no yes N/A low / very high moderate no LZ, AC  
dna3 [35] C / – / – http://people.unipmn.it/manzini/dnacorpus/ yes / no no N/A low / low moderate no LZ, PPM  
FCM-M [36] C / – / –   yes / no no N/A very low / very low moderate no M. models  
XM [37] Java / many / Java http://ftp.infotech.monash.edu.au/software/DNAcompress-XM yes / no yes N/A very low / very low moderate no M. models, AC  
Compressors of genome collections
gzip C++ / many / many http://www.gzip.org yes / no yes N/A low / very high very low no LZ, Huf  
bzip2 C / many / many http://www.bzip.org yes / no yes N/A low / high very low no BWT, Huf  
7z C, C++ / many / many http://www.7-zip.org yes / no yes N/A low / very high high no LZ, AC chr-ordered
ABRC [38] C++ / Lin, Win / C++ http://www2.informatik.hu-berlin.de/~wandelt/blockcompression/ yes / no yes N/A high / very high very high yes LZ, Huf  
GDC [39] C++ / Lin, Win / C++ http://sun.aei.polsl.pl/gdc yes / no yes N/A high / very high very high yes LZ, Huf  
GReEn [40] C / – / –   yes / no yes N/A high / high high no M. models, AC  
GRS [41] C / Lin / – http://gmdd.shgmo.org/Computational-Biology/GRS/ yes / no yes N/A moderate / low high no LCS, Huf  
RLZ [42] C++ / – / – http://www.genomics.csse.unimelb.edu.au/product-rlz.php yes / no yes N/A moderate / very high high no LZ, Gol  
  1. Abbreviations used in the table: src—source codes,libs—libraries, Lin—Linux, Win—Windows,Pyt—Python, exe—binary executables,AC—arithmetic coding (a statistical coding method [12]), CM—context mixing for arithmetic coding [12], diff—differential coding (paradigm: store onlychanges between sequences), Gol—Golomb (a statisticalcoding method [12]), Huf—Huffman, LCS—longest commonsubsequence (a measure of similarity of sequences [43]), LZ—an algorithm from Ziv–Lempel family,M. models—Markov models [12], PPM—prediction by partial matching (anefficient general-purpose compressor [12]). “Ambig. codes” means the ability tocompress DNA symbols other than {A, C, G, T, N }.“chr-ordered” for 7z and genome collections meansthat the input (human) genomes were split into chromosomes andordered according to them before the actual compression. In thisway several chromosomes fit the 7z LZ-buffer which is beneficialfor the compression.