Data compression for sequencing data

  • Sebastian Deorowicz1 and

    Affiliated with

    • Szymon Grabowski2Email author

      Affiliated with

      Contributed equally
      Algorithms for Molecular Biology20138:25

      DOI: 10.1186/1748-7188-8-25

      Received: 25 April 2013

      Accepted: 25 September 2013

      Published: 19 November 2013

      Abstract

      Post-Sanger sequencing methods produce tons of data, and there is a generalagreement that the challenge to store and process them must be addressedwith data compression. In this review we first answer the question“why compression” in a quantitative manner. Then we also answerthe questions “what” and “how”, by sketching thefundamental compression ideas, describing the main sequencing data types andformats, and comparing the specialized compression algorithms and tools.Finally, we go back to the question “why compression” and giveother, perhaps surprising answers, demonstrating the pervasiveness of datacompression techniques in computational biology.

      Background

      In the first decade of the century, the cost of sequencing a single human genome fellfrom about 30 million to about 10 thousand dollars (here and later we mean U.S.dollars). The second generation sequencing platforms by 454 Life Sciences, Illumina,and Applied Biosystems cost less than one million dollars [1] and are available in many institutes. The promised third generation (3G)technology (including Ion Torrent Systems, Oxford Nanopore Technologies, and PacificBiosciences equipment) should be even cheaper, which rapidly moves us closer to theday of personalized medicine available to the masses.

      In mid-2013, the world-biggest sequencing institute, Beijing Genomics Institute, used188 sequencers, of which 139 were the top-of-the-line Illumina HiSeq 2000 and 2500sequencing machines. Their total theoretical throughput is over 1.2 Pbases per year,which is equivalent to about 3 PB of raw sequencing read files. Including theadditional output space for mapping to the reference genomes, the total amount ofnecessary storage is on the order of 10 PB per year. The statistics ofhigh-throughput sequencers in the world (http://​omicsmaps.​com) show thatthe storage necessary for the instruments’ output is in the range 50–100PB per year. Kahn [2] presents the genomic data growth until 2010, pointing out that theprogress in computer hardware lags behind.

      Interestingly, recently the Million Veteran Program (MVP), led by the US Departmentof Veterans Affairs, was announced. With at least 30-fold coverage (100 bp reads)the number of reads per genome sample will be about 1 billion [3]. This means about 250 PB of raw data (in FASTQ format) in total, when thesequencing program is finished (the enrollment of volunteers is expected to last 5to 7 years).

      Those numbers are very large, but we need to remember that they refer to July of2013. As can be seen in Figure 1, the cost of sequencinga single base has been halving roughly every 8 months in 2008–2013, while thecost of hard disk space has been halving every 25 months in 2004–2013. Even ifthe most recent NHGRI data suggest some stagnation (see also the comment [4]), it may be a temporary slowdown, as 3G instruments are becomingavailable (PacBio RS and Heliscope).
      http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-8-25/MediaObjects/13015_2013_Article_202_Fig1_HTML.jpg
      Figure 1

      Trends in storage, transfer, and sequencing costs. The historiccosts of low-end hard disk drives were taken fromhttp://​www.​jcmit.​com/​diskprice.​htm. They had been halvingevery 12 months in the 1990s and around 2000–2004. Then, the doublingtime lengthened suddenly, to about 25 months. The real costs of sequencing,taken from the NHGRI Web page [5], reflect not only reagent costs like some studies show, but alsoinclude labor, administration, amortization of sequencing instruments,submission of data to a public database, etc. The significant change insequencing costs around 2008 was caused by the popularization of the secondgeneration technologies. The prices of the Amazon storage and transferreflect the real market offers from the top data centers. It is interestingthat the storage costs at data centers drop very slowly, mainly because thecosts of blank hard disks are only a part of the total costs of maintenance.The curves were not corrected for inflation.

      On the other hand, the prices of low-end hard disks are rather misleading, since theamount of data necessary in sequencing projects is so huge that data centers aremuch better places to store the files. The growing popularity of cloud storage canbe attributed to several reasons. Generally, data management may be cheaper if runby IT professionals (not always available in smaller centers, like hospitals),centralization may reduce data replication costs, and access to really hugerepositories (at least of terabyte scale) is easier or even only possible with largearrays of disks. The disk drives in large storage arrays are usuallyenterprise-class ones of increased reliability and better performance, so they areseveral times more expensive than standard (SATA) HD drives. Taking these factorstogether we should not be surprised that storing a file in a large data center maycost about 5–10 times more than on a plain HD; this is the price we pay forubiquitous access to reliable data collections.

      What is more, these files must be transferred over the Internet, which is not free.One of the largest data centers and cloud computing centers are Amazon S3 and AmazonEC2. Their charges for storage has been halving every 84 months and the charges fortransfer has been halving every 44 months in 2006–2013. In January 2013 thereal cost of sequencing human-size genome (according to NHGRI data) was about 5,700dollars, while the cost of one-year storage at Amazon S3 and 15 downloads of rawreads and mapping results (225 GB of reads with 30-fold coverage and 500 GB ofmapped data) was close to 1,500 dollarsa.

      The trends show clearly that the costs of storage and transfer will become comparableto the costs of sequencing soon, and the IT costs will be a significant obstacle forpersonalized medicine, if we do not face this problem seriously.

      When choosing the compression algorithm to apply (if any), one should consider thespecific use of the data. Two, in a sense extreme, scenarios are: (i) thedata are to be transferred over a network (and then decompressed), (ii) thedata are to be accessed in real-time, either in a sequential or random fashion. Letus take a closer look at both. In the case (i), a reasonable cost measureis the time to compress the given data (preferably using some “standard”computer), transfer it, and decompress on the receiver side. Concerning sequencingreads (FASTQ), it is possible to obtain (at least) 4.5-fold compression at about 30MB/s both compression and decompression speed, on a single CPU core. Let us assume alink speed of 50 Mbit/s and that the input files have 450 GB in total. Without anycompression the files will be transferred in exactly 20 hours. On the other hand,the compression, transfer and decompression times sum up to about 10.5 hours. Evenif a faster connection is available (when the gain diminishes), we note that thecompression and decompression may also be sped up (using threads or separateprocesses to utilize multiple CPU cores, and also by overlapping the compression,transfer and decompression phases). We point out that the ratio of 4.5, used in thisexample, is rather modest; significantly more is possible with lossy compression andalso for some other biological data types (like genome collections).

      Let us now estimate the monetary savings for transferring (downloading) these filesfrom Amazon S3. The transfer charge depends on the volume of data, but let us assume9 cents/GB. Without compression, we will pay 40.5 dollars for downloading these 450GB of data, but only 9 dollars if compression is applied.

      The case (ii) is somewhat different, as the choice of the compression methodis rather constrained. To our knowledge, the most successful, in terms of efficiencyand flexibility, storage system for biosequences was presented by Steinbiss andKurtz [6]. Their GtEncseq solution accepts several common sequence formats (FASTA,GenBank, EMBL, FASTQ), compresses the input, provides fast random access to FASTAdata, stores metadata, etc. Accessing single bases from random locations is not muchslower than from the uncompressed equivalent representation, with over 3 millionqueries per second. In case of FASTQ, however, extracting a single read from anarbitrary location is not that fast, as it takes, e.g., more than 100μ s, which is less than the access time of a hard disk drive, butnot necessarily faster than the SSD. A space-time tradeoff exists, so this time maybe reduced, but consequently compression ratio drops quite severely. Anyway, if onlysequential access to FASTQ is needed (which is often the case), GtEncseq may be agood choice.

      As new large-scale sequencing projects are frequently announced (the mentioned MVPproject is probably the most ambitious at the moment), the storage concerns are ofhigh priority. It is no wonder then that many ideas of selective storage (or lossycompression) are discussed in the community (cf. [7, 8]). While we agree that some forms of lossy compression seem to be anecessity, the radical approach of discarding old data (in the hope of reproducingthem later, when needed) raises major methodological doubts, as far as researchapplications are considered. More precisely, discarding raw data is hazardous, sincetheir reproduction later will not be exact, due to inherent “randomness”in sequencing process, even if the same hardware and possibly identical procedureare used. This clashes with one of the main principles of the scientific method,which is reproducibility.

      The area of data compression techniques in computational biology has been surveyed byGiancarlo et al.[9, 10], with more focus on the theory and data compression applications insequence analysis than storage and indexing of data from high-throughputtechnologies. One aspect of data compression in genomics, index structures forsequencing data, is thoroughly discussed by Vyverman et al.[11]. We hope our review complements earlier efforts by paying attention toversatile applications of data compression in the age of data flood inbioinformatics.

      The paper is organized as follows. The next section contains a brief introduction towidely used data compression algorithms. The section “Sequencing data”outlines the popular compression techniques and file formats to store variousbioinformatics data: base calls annotated with quality scores, genome alignmentdata, single and multiple genome data. Making data smaller is not only for reducingtheir space and facilitating their distribution, as it is reflected in the name ofthe next section, “Beyond storage and transfer”. Here, applications ofdata compression ideas in indexing, read alignment and other vital problems arediscussed. The last section concludes.

      Data compression in brief

      Compression techniques are the traditional means of handling huge data. Thosemethods reduce the space for storage and speed up the data circulation (e.g.,among research institutes). With its origins yet in the 19th century and firsttheoretical works just after the WW2, nowadays data compression is used almosteverywhere as the amount of stored and transmitted data is huge. There areseveral major concepts that are used in compression programs, but the twomentioned below are the most important in bioinformatics. Here we give only avery short description of them, with more details in the Additionalfile 1 or the monograph [12].

      The Huffman coding [12, 13], invented in 1952, is a statistical method, which assigns a sequenceof bits (a codeword) to each alphabet symbol. The codewords are ofdifferent length, and in accordance to the golden rule of data compression,rarer symbols are represented by longer codewords. The given sequence is thenencoded by replacing each symbol with its corresponding codeword. What makesHuffman coding important is its optimality, i.e., no other code leads to ashorter encoded sequence.

      In 1977–78 Ziv and Lempel [12, 14] invented dictionary methods. They process the sequence from left toright and encode possibly long repetitions of consecutive symbols as referencesto the already compressed part of data. Such an approach allows for highercompression ratios than sole Huffman coding as it looks for another type ofredundancy, common not only in natural language (e.g., repeating word phrases),but also in multiple genome sequences or overlapping sequencing reads. Evenbetter results are possible with combining dictionary methods and Huffmancoding. The very popular gzip program serves as a successful example.

      The Burrows–Wheeler transform (BWT) [12, 15] is a more recent compression idea, that has become highly popular inbioinformatics. The pure BWT is not a compression method, it only permutes theinput sequence, but it can be used to construct highly efficient compressors(bzip2, combining the BWT with Huffman coding and other techniques, is awell-known representative). This transform is also a basis of some sequenceindexing techniques that are used to search a genome in many tools (more on thiswill be covered in the “Beyond …” section). The key idea ofBWT is to permute the input sequence in such a way that symbols are grouped bytheir neighborhood, i.e., the symbols that are followed by the same symbols areclose after the permutation, even if they were far in the original sequence (seealso the Additional file 1 for a BWT example).

      Sequencing data

      Raw sequencing data

      The raw reads from sequencing are usually stored as records in textual FASTQformat. Each record is composed of a read ID, base calls, and quality scoresfor all base calls [16]. These files are often compressed with gzip, to obtain about3-fold size reduction. While significant, such a gain is not quitesatisfactory. A better choice is to use a FASTQ-dedicated compressor, e.g.,DSRC [17], which shrinks the data about 5 times. The algorithm findsoverlaps in the base calls from the reads placed at very long distances inthe file (in Ziv–Lempel fashion), as well as takes care of various IDformat conventions. It also uses multiple statistical models to bettercompress quality values by the Huffman coding. In particular, quality scoresin different read positions are compressed in different models (i.e., basedon different statistics) because it is well-known that the qualitydeteriorates towards the read end, hence this design improves the predictionof the scores. DSRC is quite fast, as it compresses at 30–40 MB/sspeed, and handles alphabets of size beyond 5 (IUPAC ambiguity codes) aswell as variable-length reads.

      Most newer proposals [18, 19] focus mainly on compression ratio and either the speeds are ofsecondary importance or are not examined at all. For example, the solutionby Bhola et al.[18] follows essentially the same direction as DSRC, but handles alsoapproximate and palindromic repeats. The byproducts are compressed with anadaptive arithmetic coder [12]. While the reported compression ratios are usually higher thanthe DSRC’s by a few percent, it is unclear if the solution can bescalable. Remarkably, doubts are expressed by the authors themselves.

      A more radical attempt to improve compression ratio is to group the readswith suffix-prefix overlaps close together. Unfortunately, the algorithmsfollowing this line [20, 21] are not fully functional tools, e.g., they ignore the read IDsand quality scores, and achieve processing speeds on the order of 1 MB/s orless. A more interesting solution along these lines, SCALCE [22], is a FASTQ preprocessor helping to improve the compression ratiowith gzip twice.

      As argued earlier, the rapid growth of data from sequencing experimentsdemands even better compression ratios, and switching to a lossy mode seemsto be the only chance for a breakthrough [23]. The natural candidates for lossy compression are quality scores.It has been shown [22, 2426] that rounding the quality scores to a few values (instead ofabout 60) can be acceptable. E.g., the fraction of discrepant SNPs growsslowly with diminishing number of quality scores in Illumina’s CASAVApackage [27] while the benefit in compression is clearb. On theother hand, some (re)assemblers ignore these data [23], so if the reads are to be processed by such tools, the qualityscores can be removed. Another option is to use the scores to prefilter thereads and discard the ones with unacceptably low quality.

      SeqDB [28] is a notable exception from the shown tendency. Here the focuswas put on compression and decompression speed, not compression ratio. On amachine with 12 CPU cores SeqDB reaches the speeds of about 500–600MB/s ([28], Section 4.4), but the compression ratio is at best atgzip’s level.

      Recently, two very interesting algorithms were presented. One of them is Quip [29], which uses higher-order modelling [12] with arithmetic coding in its more “traditional”mode, and also an assembly-based approach in its stronger mode. The idea isto form contigs from the first (by default 2.5 million) reads, which thenare used as a reference for the following reads. The compression improvementdue to this idea is moderate however, so the standard mode (faster and usinghalf of the memory) seems to be more practical, with compressionsignificantly better than DSRC and 10 MB/s to 20 MB/s compression/decompression speed.

      In [30] two FASTQ compressors were presented. Now we briefly describe themore interesting of them, Fqzcomp. It achieves better compression ratiosthanks to a carefully chosen context mixing modelc and otheroriginal ideas. Fqzcomp belongs also to faster solutions, (partly) due tomulti-threading. In the same paper, yet another strong FASTQ compressor wastested, SeqSqueeze1 (described only in [30]). This algorithm, also based on context mixing, is sometimes evenbetter than Fqzcomp in respect of compression ratio, but its compression anddecompression speed is less than 1 MB/s.

      In Table 1 we compare the existing tools forcompressing sequencing data. They belong to four different kinds describedin the following four subsections. We did our best not to miss any relevanttool for which the sources or (at least) binary executables for a popularoperating systems were available. There are several reasons why we do notpresent experimental (comparative) tests of the presented tools. They are,in the order of decreasing importance:
      1. (i)

        limitations of many tools (e.g., accepting sequenceswith only the ACGT or only ACGTN alphabet, support for fixed-width readsonly, assumptions on the ID format in FASTQ files, restriction to onlyselected fields of the SAM format);

         
      2. (ii)

        significant problems with running some of theexisting tools (which we have experienced in earlier work, on FASTQ andgenome collection compression);

         
      3. (iii)

        not quite compatible outputs (e.g., DSRC for FASTQsupports random access, which cannot be turned off, while many others donot, hence comparing compression ratios of these tools cannot be fullyfair);

         
      4. (iv)

        different targets (by design, some of the tools aresupposed to be run on a commodity PC, while others require powerful servers,or never were tested on mammalian-size data, since it would take days oreven weeks. For example, many top single genome sequence compressors work atthe rate of several Kbase/s only, on a standard computer).

         
      Table 1

      Summary of the most important compressors of sequencing data

      Software

      Implementation

      Website

      Lossless /

      Ambig.

      Var.

      Speed of

      Ratio

      Random

      Methods

      Remarks

      name

      availability

       

      lossy

      codes

      length

      compr./

       

      access

        
       

      src code / binaries / libs

         

      reads

      decompr.

          

      Compressors of raw sequencing data

      gzip

      C++ / many / many

      http://​www.​gzip.​org

      yes / no

      yes

      yes

      moderate / very high

      low

      no

      LZ, Huf

       

      bzip2

      C / many / many

      http://​www.​bzip.​org

      yes / no

      yes

      yes

      low / high

      low

      no

      BWT, Huf

       

      7zip

      C, C++ / many / many

      http://​www.​7-zip.​org

      yes / no

      yes

      yes

      low / very high

      moderate

      no

      LZ, AC

       

      BWT-SAP [21]

      C++ / — / C++

      http://​github.​com/​BEETL/​BEETL

      yes / no

      yes

      no

      low / low

      moderate

      no

      BWT, PPM

      FASTA only

      DSRC [17]

      C++ / Lin, Win / C++, Pyt

      http://​sun.​aei.​polsl.​pl/​dsrc

      yes / no

      yes

      yes

      high / high

      moderate

      yes

      LZ, Huf

       

      Fqzcomp [30]

      C / Lin / —

      http://​sourceforge.​net/​projects/​fqzcomp/​

      yes / yes

      no

      yes

      high / moderate

      high

      no

      CM

       

      G-SQZ [31]

      C++ / Lin / —

      http://​public.​tgen.​org/​sqz

      yes / no

      no

      no

      high / moderate

      low

      yes

      Huf

       

      Kung-FQ [19]

      C# / Win / –

      http://​quicktsaf.​sourceforge.​net

      yes / yes

      no

      no

      moderate / moderate

      moderate

      no

      AC, LZ, RLE

       

      Quip [29]

      C / – / –

      http://​cs.​washington.​edu/​homes/​dcjones/​quip

      yes / no

      no

      no

      high / high

      high

      no

      M. models, AC

       

      ReCoil [20]

      C++ / — / C++

      http://​github.​com/​BEETL/​BEETL

      yes / no

      no

      no

      very low / high

      moderate

      no

      BWT, PPM

      FASTA only

      SCALCE+gzip [22]

      C++ / – / –

      http://​scalce.​sourceforge.​net

      yes / yes

      no

      no

      moderate / high

      moderate

      no

      AC, LZ, Huf

       

      Seq-DB [28]

      C++ / – / –

      https://​bitbucket.​org/​mhowison/​seqdb

      yes / yes

      no

      no

      very high / very high

      low

      yes

      AC, LZ, RLE

       

      SeqSqueeze1 [30]

      C/ Lin/ —

      http://​sourceforge.​net/​p/​ieetaseqsqueeze/​wiki/​Home/​

      yes / no

      no

      yes

      very low / ver low

      high

      no

      CM

       

      Compressors of reference genome alignment data

      gzip

      C++ / many / many

      http://​www.​gzip.​org

      yes / no

      yes

      N/A

      low / very high

      low

      no

      LZ, Huf

       

      bzip2

      C / many / many

      http://​www.​bzip.​org

      yes / no

      yes

      N/A

      low / high

      low

      no

      BWT, Huf

       

      7z

      C, C++ / many / many

      http://​www.​7-zip.​org

      yes / no

      yes

      N/A

      low / very high

      moderate

      no

      LZ, AC

       

      BAM [32]

      C++ / many / many

      http://​samtools.​sourceforge.​net

      yes / no

      yes

      N/A

      moderate / high

      moderate

      yes

      LZ, Huf

       

      CRAM [33]

      Java / many / Java

      http://​www.​ebi.​ac.​uk/​ena/​about/​cram_​toolkit

      yes / yes

      yes

      N/A

      moderate / moderate

      moderate

      yes

      Huf, Gol, diff.

       

      Quip [29]

      C / – / –

      http://​cs.​washington.​edu/​homes/​dcjones/​quip

      yes / no

      no

      N/A

      high / high

      high

      no

      M. models, AC

       

      SAMZIP+rar [34]

      C/ – / –

      http://​www.​plosone.​org/​article/​info:​doi/​10.​1371/​journal.​pone.​0028251

      yes / no

      yes

      N/A

      moderate / high

      moderate

      no

      RLE, LZ, Huf

       

      Compressors of single genome sequences

      gzip

      C++ / many / many

      http://​www.​gzip.​org

      yes / no

      yes

      N/A

      moderate / very high

      low

      no

      LZ, Huf

       

      bzip2

      C / many / many

      http://​www.​bzip.​org

      yes / no

      yes

      N/A

      low / high

      low

      no

      BWT, Huf

       

      7z

      C, C++ / many / many

      http://​www.​7-zip.​org

      yes / no

      yes

      N/A

      low / very high

      moderate

      no

      LZ, AC

       

      dna3 [35]

      C / – / –

      http://​people.​unipmn.​it/​manzini/​dnacorpus/​

      yes / no

      no

      N/A

      low / low

      moderate

      no

      LZ, PPM

       

      FCM-M [36]

      C / – / –

       

      yes / no

      no

      N/A

      very low / very low

      moderate

      no

      M. models

       

      XM [37]

      Java / many / Java

      http://​ftp.​infotech.​monash.​edu.​au/​software/​DNAcompress-XM

      yes / no

      yes

      N/A

      very low / very low

      moderate

      no

      M. models, AC

       

      Compressors of genome collections

      gzip

      C++ / many / many

      http://​www.​gzip.​org

      yes / no

      yes

      N/A

      low / very high

      very low

      no

      LZ, Huf

       

      bzip2

      C / many / many

      http://​www.​bzip.​org

      yes / no

      yes

      N/A

      low / high

      very low

      no

      BWT, Huf

       

      7z

      C, C++ / many / many

      http://​www.​7-zip.​org

      yes / no

      yes

      N/A

      low / very high

      high

      no

      LZ, AC

      chr-ordered

      ABRC [38]

      C++ / Lin, Win / C++

      http://​www2.​informatik.​hu-berlin.​de/​~wandelt/​blockcompression​/​

      yes / no

      yes

      N/A

      high / very high

      very high

      yes

      LZ, Huf

       

      GDC [39]

      C++ / Lin, Win / C++

      http://​sun.​aei.​polsl.​pl/​gdc

      yes / no

      yes

      N/A

      high / very high

      very high

      yes

      LZ, Huf

       

      GReEn [40]

      C / – / –

       

      yes / no

      yes

      N/A

      high / high

      high

      no

      M. models, AC

       

      GRS [41]

      C / Lin / –

      http://​gmdd.​shgmo.​org/​Computational-Biology/​GRS/​

      yes / no

      yes

      N/A

      moderate / low

      high

      no

      LCS, Huf

       

      RLZ [42]

      C++ / – / –

      http://​www.​genomics.​csse.​unimelb.​edu.​au/​product-rlz.​php

      yes / no

      yes

      N/A

      moderate / very high

      high

      no

      LZ, Gol

       

      Abbreviations used in the table: src—source codes,libs—libraries, Lin—Linux, Win—Windows,Pyt—Python, exe—binary executables,AC—arithmetic coding (a statistical coding method [12]), CM—context mixing for arithmetic coding [12], diff—differential coding (paradigm: store onlychanges between sequences), Gol—Golomb (a statisticalcoding method [12]), Huf—Huffman, LCS—longest commonsubsequence (a measure of similarity of sequences [43]), LZ—an algorithm from Ziv–Lempel family,M. models—Markov models [12], PPM—prediction by partial matching (anefficient general-purpose compressor [12]). “Ambig. codes” means the ability tocompress DNA symbols other than {A, C, G, T, N }.“chr-ordered” for 7z and genome collections meansthat the input (human) genomes were split into chromosomes andordered according to them before the actual compression. In thisway several chromosomes fit the 7z LZ-buffer which is beneficialfor the compression.

      Reference genome alignment data

      The reads are usually assembled or reassembled. Denovo assembling is the most challenging, but resequencing, inwhich the reads are aligned to some reference genome, is much cheaper andthus widely used.

      The results of mapping the reads onto the reference genome are usually storedin the SAM/BAM format [32]. SAM files augment the reads data with mapping quality andseveral other fields. BAM is a gzip-like compressed binary equivalent oftextual SAM and is about 3–4 times smaller. Due to the additional dataSAMs are more than twice as large as FASTQ files.

      The reads in SAM files are mapped to a known reference genome and thedifferences between the reads and the reference sequence, resulting fromvariation and sequencing errors, are small. Thus, it is efficient torepresent the base calls of a read as the mapping coordinate and thedifferences. These reads are usually ordered by the mapping coordinate andthus the coordinates can be stored in a differential manner, which resultsin a sequence of small and thus well compressible numbers. The oldest scheme [44] for compressing mapping data with the described idea cannothowever be considered mature, since quality scores are ignored and there isno support for unaligned (i.e., those that failed to map onto a reference)reads.

      Fritz et al.[33] handle both aligned and unaligned reads. The aligned reads arestored basically as described above together with the quality scores. Toobtain better compression ratios, the authors advocate using a lossy modeand refrain from storing some quality scores, e.g., the ones related toperfectly matched positions. To compress the unaligned reads (usually10–40% of raw reads) better, they propose to build some artificialreference sequences. To this end, unmapped reads from many similarexperiments are processed by an assembler to obtain contigs, built only forthe compression process. Finally, the remaining sequences are matched to thebacterial and viral databases. Some of the byproducts in the algorithm areencoded with the Huffman (or other) codes. Most of the described techniques,excluding artificial reference as well as bacteria and viral sequences, areimplemented in the CRAM compressor. In a highly lossy setting, it canproduce archives smaller by an order of magnitude than corresponding BAMfiles. Similar approaches for compressing reads by mapping them onto anreference genome are used in SlimGene [25] and SAMZIP [34] algorithms.

      Generally the use of a reference sequence can help a lot for the compressionratio but we should remember that a reference might not be available in somecases, e.g., for metagenomic datasets or for organisms with highpolymorphism [20].

      The most recent SAM compressor [45], apart from highly configurable lossy compression settings,introduces a novel idea to exploit common features of reads mapped to thesame genomic positions. Quip [29], published slightly earlier, is not as flexible as CRAM and worksonly losslessly. However, if aligned reads in the SAM or BAM format and areference sequence are given, it wins with CRAM in compression ratio andneeds less memory to operate.

      The tabix program [46] is a more general solution, popular in sequencing centers. It wasdesigned to allow fast random access to compressed (gzip-like) textualfiles, in which the data are stored in rows containing tab-delimited values.The basic idea is to sort the input rows according to the sequence name andcoordinate. Then the file is split into a series of blocks of maximum sizeof 64 KB. These blocks are compressed. Finally, an index to support randomaccess queries is built.

      Single genome sequences

      Raw and annotated sequencing data are nowadays the greatest challenge forstorage and transfer today. Nevertheless, consensus DNA sequences (e.g.,complete bacterial genomes) used to be historically the first object ofcompression in bioinformatics. In a sense, however, the genome sequences fora single individual are almost incompressible. If the sequence contains onlythe symbols A, C, G, and T, then the trivial 2 bits per symbol encoding isoften more efficient than a general-purpose compressor, like gzip!

      Specialized DNA compressors appeared in mid-1990s, but most solutions fromthe literature are impractically slow. For example, one of the strongest ofthem, the highly-acclaimed XM [37], can squeeze a genome up to about 5 times, but compression speedof the order of 20 KB/s on a modern machine is clearly disappointing. Someother notable compressors in this area are dna3 [35] and FCM-M [36]. The standard input format for genome sequences is FASTA, inwhich the file starts with a single-line description, followed by lines ofsequence data.

      Collections of genome sequences

      As said, a single genome in its compact encoding (2 bits per base) seemsalmost incompressible. However, large repositories with thousands ofindividual genomes of the same species are just behind the corner. Thesegenomes are highly similar to each other (e.g., human genomes have more than99% of their content in common [47]), so a collection can be very efficiently compressed. Dictionarymethods, from the LZ family, constitute the most obvious and actually mostsuccessful approach, considering high-speed decompression with moderatememory requirements. The compression phase is more demanding, both in timeand space, because the repetitions (matches) in a collection of genomes aretypically gigabytes apart. For these reasons, most general-purpose LZ-stylecompressors (e.g., gzip, rar) are useless for those data, and a few yearsago the first specialized algorithms emerged.

      In their seminal work, Christley et al.[48] compressed a single human (James Watson’s) genome, but withthe variation data relative to a reference genome being provided. Variationdata were comprised of single nucleotide polymorphisms (SNPs) and so-calledindels, i.e., insertions or deletions of multiple nucleotides. Additionally,they used a readily-available SNP database. The assumed scenario, augmentedwith standard compression techniques, made it possible to represent thewhole human genome in about 4 Mbytes only. Recently, Pavlichin etal.[49] followed the lines pioneered in [48], compressing the JW genome to 2.5 MB, with very similar averageresults on 1092 human genomes from the 1000 Genomes Project. The introducednovelties are partly biologically-inspired, e.g., making use of the tag SNPscharacterizing haplotypes. In another recent achievement, Deorowicz etal.[50] compressed the variant call files from 1092 diploid H.sapiens (1000 Genomes Project) and 775 A. thaliana (1001Genomes Project) individuals, together with a variant database, squeezingthe former collection to about 432 MB and the latter to 110 MB (whichtranslates to 395 KB and 142 KB per individual, respectively).

      Other research did not assume access to a knowledge base (i.e., a referencegenome), hence most of them did not yet achieve the mentioned levels ofcompression. Most of these works [39, 40, 42, 51] encode one of the genomes in a collection with simple means,spending about 2 bits per base, and then apply very efficientdifferential LZ-like encoding for the remaining genomes. Thepioneering algorithm of this kind, RLZ [42, 52], looks for LZ-matches in the reference genome (the one encodednaïvely, with 2 bits per base), and encodes their positions compactly,with reference to the previous match. This handles typical differences ingenomes of the same species (short insertions or deletions, SNPs) in anefficient manner. Such an approach is successful: the related compressor,GDC [39], obtains in its strongest mode the compression ratio of 1,000 forrelatively encoded genomes in a collection of 70 human individuals, withdecompression speed of 150 MB/s. The key ideas in GDC include looking forlong approximate matches in the whole collection and the Huffman coding.

      Some of the specialized compressors (GDC, LZ-End [53]) allow for fast access to an arbitrary substring of thecompressed collection. Unfortunately, this comes at a price: some loss incompression ratio (which still remains competitive, though).

      Yet another differential genome compressor was presented by Wandelt and Leser [38]. Implementation-wise its original trait is searching for matchesvia a compressed suffix tree [54] (in blocks). This solution reaches the compression ratio of about400 for the collection of 1000 human genomes. The match-finding speed of aparallel implementation is rather high (85 MB/s with large blocks). The realbottleneck is, however, building the compressed suffix tree.

      Interestingly, also 7zip, a well-known advanced general-purpose LZcompressor, achieves quite competitive results, but for mammalian-sizegenomes the collections must be reordered by chromosomes, otherwise itcannot find inter-genome LZ-matches and its compression then is not muchbetter than gzip’s. This behavior can be explained by its LZ-bufferlimited to 1 GB, which is less than the size of, e.g., the human genome. Onthe other hand, chromosomes are already small enough and several of them fitits buffer.

      Currently, public repositories often store large genomes as variant databases(e.g., in VCF format). This tendency should help significantly in developingnew efficient compression algorithms, since all input sequences areperfectly aligned and the (otherwise hard and resource-consuming) task offinding repetitions in data becomes almost trivial. Thus, we expect the workby Christley et al.[48], Pavlichin et al.[49] and Deorowicz et al.[50] to be only the first steps in this direction.

      Beyond storage and transfer

      So far, we have discussed how compression can alleviate the burden of storing andtransmitting various genomic data. It can, however, help in less obvious aspectsas well.

      One prominent example is de novo assembly for second-generationsequencing technology, based on the de Bruijn graph [55], where hundreds of GB of RAM could be needed with standard methods.Applying succinct data structures allowed to decrease the memory usage by anorder of magnitude, and a whole assembly pipeline of a human individual was runin about 36 GB of RAM [56]. The nodes in the de Bruijn assembly graph are all distinct stringsof some length k (e.g., 25) from given data, and edges between themexist if two nodes have a suffix-prefix overlap of length k-1. The ideaof Conway and Bromage [56] was to perceive the set of edges for given data as a subset of thefull graph and to use (existing) succinct subset encoding techniques, supportingfast access. Representing this subset with naïve means would be impossiblebecause of typically quadrillion-edge scale of the full graph. On the otherhand, standard approaches to the de Bruijn assembly graph construction areplagued with pointers (location addresses in the memory), which are dominantpart of the data structure in RAM.

      Compact data structures not always make use of “typical” compressiontechniques (like statistical coding or LZ-matches), yet they serve the samepurpose, which is (in the currently discussed application) reducing the memoryrequirements for building or querying a large graph. The Bloom filters [57], the well-known idea for compact approximate subset representation,were recently used with success for the de Brujin graph construction [58, 59]. In particular, the work of Salikhov et al.[59] is a refinement of the technique of Chikhi and Rizk [58]. Their graph built for 564M human reads of length 100 bp usingk=23 occupies only about 2.5 GB. Another succinct (but not exactlycompression) technique was proposed by Ye et al.[60]. They store only a small subset of the observed k-mers asnodes, with their neighboring chains of bases as edges. Some care is taken toremove low-coverage edges, which normally result in tips, loops or bubbles inthe assembly graph, undesirable in the graph for a few reasons, including veryhigh memory requirements. The compact structure of SparseAssembler from thecited work was built in less than 2 hours for the human chromosome 14 with thepeak memory use of 3 GB. The memory requirement for the whole human genome wasbelow 30 GB. While the result from [59], cited above, seems clearly better, they are not directly comparable,unfortunately, due to different coverages and used values of k.

      An alternative to the de Bruijn graph is the assembly string graph[61], not working on k-mers, but requiring fast and memoryefficient algorithms for the computation of suffix-prefix overlaps of arbitrarylength among reads. In a string graph, as opposed to the de Bruijn graph, eachpath represents a valid assembly of the reads (because the reads are not“decomposed” into many independent k-mers). Although thisapproach seems harder, in the SGA string assembler [62] error correction and assembling were performed with satisfactoryaccuracy on 125 Gbp of human genome reads using 54 GB of memory. This wasachieved thanks to a compressed data structure, the FM-index [63], which will be mentioned also later. Interestingly, another practicalstring graph assembler, Readjoiner [64], which can process 115 Gbp short reads dataset in 52 GB of RAM, doesnot make use of compressed data structures, but its space effectiveness comesfrom an ingenious partitioning approach applied to the array of a relevantsubset of all suffixes of all reads. Readjoiner also confirms that compact datastructures may be fast because of locality of accesses to data.

      Another compression example concerns data indexing. Computationalbiology is mostly about data analysis, which in turn involves pattern search. Ifthe data over which patterns are sought do not change over time, we talk about astatic scenario. This is quite common, e.g., an already sequenced genome of agiven individual is usually not updated for a long period. In such a case it maybe worth to build an index structure for given data since its construction time,even if significant, is likely to be paid off during multiple subsequent patternsearches. A classic text indexing data structure is the suffix tree(ST) [43]. It is powerful and useful, but unfortunately requires up to28n bytes of space, where n is the sequence length. Thus,it is hard to store an ST in the main memory even for a single mammalian genome.The compressed index idea is to support all (or main) functionalitiesof its classic counterpart (e.g., returning the locations of all occurrences ofthe pattern in the text), but using much less space. The area of compressedindexes, initiated only around 2000, has been marked by tens of significantpapers [65]. Unfortunately, those “general-purpose” text indexes arenot a good choice for a collection of genomes of individuals of the samespecies. In these cases, LZ-based indexes are much more efficient in removingthe specific (and very large) redundancy. Several works with LZ-style indexesdesigned for genomic data appeared in the recent years. Most of the solutionsfor the exact [6668] and approximate pattern search [69] are rather theoretical as for only some of them implementations areavailable.

      The Burrows–Wheeler transform (BWT) is used with huge success for mappingsequencing reads onto a reference genome, almost making the classic,q-gram-based approach, obsolete (for example, an interestingrepresentative of the latter approach, Hobbes [70], is very fast but also uses large amount of memory). Some of the mostimportant genome alignment algorithms, Bowtie [71, 72], BWA [73], BWA-SW [74], SOAP2 [75], and GEM [76], make use of the FM-index [63] or another compressed index based on the BWT, occupying as little asabout 2 GB for a human genome (with the exception of GEM, requiring usually from3 GB to 6 GB). For more information on BWT and FM-index, see the Additionalfile 1. These aligners also belong to thefastest ones. All of them (Bowtie only in version 2) support ungapped and gappedalignments and all of them are multi-threaded to make use of multi-core CPUs.Using BWT for gapped alignment is cumbersome, and this is why Bowtie2 [72] and GEM combine it with dynamic programming, a classic computationtechnique boasting its flexibility and tolerance for large gaps and affine gappenalties. An important issue for compressed indexes is the working space neededduring their construction, as standard BWT computation algorithms require atleast 5n bytes of memory. Lightweight algorithms for BWT computation [77, 78] appeared relatively late, yet the Kärkkäinen’s method [77] is already implemented in Bowtie.

      A special case of mapping sequence reads to genomes concerns RNA-Seq experiments,in which a ‘snapshot’ of RNA molecules in the cell is sequenced. TheRNA-Seq [79] is a relatively new approach that proved highly successful especiallyfor determination of gene expression. The main problem here is that we must dealwith reads at exon-exon boundaries (without the introns present in the referencegenome), so spliced mappings must be looked for, which is unusual for standardDNA reads mapping. Some mappers for this problem also make use of the FM-index,e.g., TopHat [80] and CRAC [81]. A comprehensive list of RNA-Seq mapping approaches can be found in [82].

      The FM-index searches for a pattern finding its successive letters, from right toleft, which can be called backward extension of a string. Recently, Li [83] presented a simple modification of the FM-index for forward-backwardextension of DNA strings. His de novo assembler, fermi, shows that theassembly based variant calling can achieve an SNP accuracy close to the standardmapping approach, being particularly strong in indel calling.

      It is worth noting that BWT-based alignment can be implemented on a massivelyparallel graphics processing unit (GPU). Recent tools SOAP3 [84] and SOAP3-dp [85], being about an order of magnitude faster than their CPU-basedcounterparts, are prominent examples.

      One could ask if the FM-index, or another compressed index, is useful forsearching DNA strings in a genome, e.g., given as FASTA input. The answer ispositive; thanks to the small alphabet the search times (thecount query, in which we return the number of matches only) ofthe FM-index, in the best current implementation, may be comparable with thesuffix array ([86], Table Six and Seven). The space use, however, is only about0.3n (in contrast to 5n needed by the suffixarray). On the other hand, the locate query, in which thepositions of all matching substrings are returned, is at least an order ofmagnitude slower, and needs some extra space.

      In some cases, however, searching directly in the compressed data may be fasterthan in the straightforward representation. Loh et al.[87] compress a sequence database so that if an inserted sequence issimilar enough to one from the database, it is represented as the reference plusa list of differences (edit script). The search algorithm they propose, based onBLAST, takes care of the differentially encoded sequences, and only rarelyrequires to bring them back to their “full” form. Their CompressiveBLAST / BLAT algorithm was found to be about 4 times faster than classic BLAST /BLAT tools [88, 89]. Other examples where data compression reportedly speeds upprocessing concern the k-mer counting task, especially inI/O-constrained scenarios ([90], Table Two–Four). We note that reading compressed input isnowadays a convenient feature of many tools (e.g., de novo assemblersVelvet [91] and ABySS [92]), but not always it brings improvements in speed.

      Compression methods were used also for other purposes, in which the goal was notthe reduction of space or processing speed-up, but rather better understandingof genomic data. Cao et al.[93] used the XM algorithm [37] to align eukaryotic-size genomes in a few hours on a workstation. Theidea is to teach the expert models on one of the sequences and use the knowledgeto properly align the second one by measuring the information content and themutual information content of the sequences.The resulting aligner is shownexperimentally to be superior (at least in quality, not in speed) toconventional alignment methods based on character matching.

      Bhaduri et al.[94] proposed a somewhat related idea of using a compression algorithmfrom the LZ family to filter low-complexity reads in a project on identificationof nonhuman sequences, such as viruses, in deep sequencing datasets.

      A measure of sequence similarity, that is both accurate and rapidly computable,is highly desirable. Ferragina et al.[95] advocated that classic alignment methods do not scale well for hugedata. They focus on the Universal Similarity Measure (USM) [96]. As USM is rather a theoretical concept, the authors experiment withits three approximations based on data compression. They validate thepossibility of using these approximations for classification of sequences byUPGMA and NJ methods.

      Freschi and Bogliolo [97] proposed a lossy compression scheme to eliminate tandem repeats froma sequence. Thanks to that, no repeat masking is necessary before performingpairwise alignment of sequences.

      Conclusions

      Data deluge in computational biology has become a fact. A vast majority ofgathered data is “temporary” in nature and could be discarded assoon as the analysis is done. The problem is, however, that current sequenceanalysis algorithms are imperfect, and storing lots of data only in the hope tosqueeze out more of them in the future is a reasonable strategy. To put it inother words, lossy storage is an interesting option for bioinformatics, but itshould be used judiciously.

      The variety of genomic data formats implies the need for specialized compressionalgorithms better than the general-purpose standards, like gzip and bzip2.Succinct representation is not everything; decompression time or rapid access toarbitrary data snippets may matter even more, so they should be taken intoaccount in algorithmic design. Sometimes even more enhanced functionalities arewelcome. Fast search directly in the compressed data is an example. Moreefficient compression diminishes the costs of not only local data storage andtransfer, but also of data center services. The latter should bring the visionof ubiquitous cloud computing closer.

      Let us make two predictions at the end. First, we note that some objects ofinterest in computational biology, like a human genome, do not grow. Hence, withgrowing amount of memory even in our home laptops, it perhaps no longer pays toapply strong compression for some tasks, if less compact but faster solutionsare known. Read alignment onto a reference genome is a prominent example of thissort. We anticipate that in 1–2 years solutions processing a 1 billion 100bp reads collection in a few hours on a PC will appear, but their main datastructure may be the good old suffix array rather than, e.g., the FM-index.

      Second, we predict that the turbulent period of new compression ideas forsequencing data representations will slowly give way to industry-orientedsolutions, with more stress on robustness, flexibility, ease of use, andcompression and decompression speed (in sequential and parallel/distributedregimes). Ideas are exiting, but routine jobs require standards. We believe thatpowerful, versatile and thus widely used formats in bioinformatics will emergesoon, proving the maturity of the field.

      Endnotes

      a There are, of course, many alternative cloud storage solutionsand it is hard to tell “typical” fees for storage and transfer, asopposed to retail disk media prices which can be monitored rather easily. As areference, however, we note that Microsoft Windows Azure and Google Cloud Storagecharges in the same scenario are similar (about 1,350–1,500 dollars), and allthese providers charge more for the assumed 15 downloads than for one-yearstorage.

      b Illumina software for their HiSeq 2500 equipment contains an option toreduce the number of quality scores [98]. Its effect on overall sequencing is shown in a technical support note [99].

      c Statistical methods often encode symbols with regard to the gatheredstatistics of occurrences in their respective contexts, which are formedwith, e.g., several proceeding symbols. This approach can be made even moresophisticated with considering several contextual models running in parallel, inorder to improve the estimation of symbols’ probability and, in result, theobtained compression ratio. The name of “context mixing” refers to thisapproach, in which the statistics from different contexts are “mixed”(weighted, blended).

      Declarations

      Acknowledgments

      This work was supported by the Polish National Science Centre under the projectDEC-2012/05/B/ST6/03148. We wish to thank Agnieszka Debudaj-Grabysz and WitoldGrabysz for their constructing remarks after reading the preliminary version ofthe paper.

      Authors’ Affiliations

      (1)
      Institute of Informatics, Silesian University of Technology
      (2)
      Institute of Applied Computer Science, Lodz University of Technology

      References

      1. Metzker ML: Sequencing technologies–the next generation. Nat Rev Genet. 2010, 11: 31-46.View ArticlePubMed
      2. Kahn SD: On the future of genomic data. Science. 2011, 331: 728-729.View ArticlePubMed
      3. Roberts JP: Million veterans sequenced. Nat Biotechnol. 2013, 31 (6): 470-10.1038/nbt0613-470.View Article
      4. Hall N: After the gold rush. Genome Biol. 2013, 14 (5): 115.PubMed CentralView ArticlePubMed
      5. National Human Genome Research Institute, DNA Sequencing Costs. [http://​www.​genome.​gov/​sequencingcosts/​] (accessed February 14,2013), [] (accessed February 14,2013)
      6. Steinbiss S, Kurtz S: A new efficient data structure for storage and retrieval of multiplebiosequences. IEEE/ACM Trans Comput Biol Bioinformatics. 2012, 9 (2): 345-357.View Article
      7. Kodama Y, Shumway M, Leinonen R: The sequence read archive: explosive growth of sequencing data. Nucleic Acids Res. 2012, 40 (Database issue): 54-56.View Article
      8. Cochrane G, Cook CE, Birney E: The future of DNA sequence archiving. GigaScience. 2012, 1 (1): article no. 2
      9. Giancarlo R, Scaturro D, Utro F: Textual data compression in computational biology: A synopsis. Bioinformatics. 2009, 25 (13): 1575-1586.View ArticlePubMed
      10. Giancarlo R, Scaturro D, Utro F: Textual data compression in computational biology: Algorithmic techniques. Comput Sci Rev. 2012, 6 (1): 1-25. 10.1016/j.cosrev.2011.11.001.View Article
      11. Vyverman M, De Baets B, Fack V, Dawyndt P: Prospects and limitations of full-text index structures in genomeanalysis. Nucleic Acids Res. 2012, 40 (15): 6993-7015.PubMed CentralView ArticlePubMed
      12. Salomon D, Motta G: Handbook of data compression. 2010, London: SpringerView Article
      13. Huffman D: A method for the construction of minimum-redundancy codes. Proceedings of the Institute of Radio Engineers. 1952, 1098-1101.
      14. Ziv J, Lempel A: A universal algorithm for sequential data compression. IEEE Trans Inf Theory. 1977, IT-23: 337-343.View Article
      15. Burrows M, Wheeler D: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation 1994,http://​www.​hpl.​hp.​com/​techreports/​Compaq-DEC/​SRC-RR-124.​pdf., Technical Report 124, Digital Equipment Corporation 1994.
      16. Cock PJA, Fields CJ, Goto N, Heuer ML, Rive PM: The Sanger FASTQ file format for sequences with quality scores, and theSolexa/Illumina FASTQ variants. Nucleic Acids Res. 2010, 38 (6): 1767-1771.PubMed CentralView ArticlePubMed
      17. Deorowicz S, Grabowski Sz: Compression of DNA sequence reads in FASTQ format. Bioinformatics. 2011, 27 (6): 860-862.View ArticlePubMed
      18. Bhola V, Bopardikar AS, Narayanan R, Lee K, Ahn T: No-reference compression of genomic data stored in FASTQ format. Proceedings of the IEEE International Conference on Bioinformatics andBiomedicine. Edited by: Wu F-X, Zaki M, Morishita S, Pan Y, Wong S, Christianson A, Hu X. 2011, 147-150. Atlanta, USA: IEEE Computer Society
      19. Grassi E, Di Gregorio F, Molineris I: KungFQ: A Simple and Powerful Approach to Compress Fastq Files. IEEE/ACM Trans Comput Biol Bioinformatics. 2012, 9 (6): 1837-1842.View Article
      20. Yanovsky V: ReCoil—an algorithm for compression of extremely large datasets of DNAdata. Algo Mol Biol. 2011, 6: 23-10.1186/1748-7188-6-23.View Article
      21. Cox AJ, Bauer MJ, Jakobi T, Rosone G: Large-scale compression of genomic sequence databases with theBurrows-Wheeler transform. Bioinformatics. 2012, 28 (11): 1415-1419.View ArticlePubMed
      22. Hach F, Numanagić I, Alkan C, Sahinapl SC: SCALCE: boosting Sequence Compression Algorithms using Locally ConsistentEncoding. Bioinformatics. 2012, 28 (23): 3051-3057.PubMed CentralView ArticlePubMed
      23. Miller JR, Koren S, Sutton G: Assembly algorithms for next-generation sequencing data. Genomics. 2010, 95 (6): 315-327.PubMed CentralView ArticlePubMed
      24. Wan R, Anh VN, Asai K: Transformations for the compression of FASTQ quality scores of nextgeneration sequencing data. Bioinformatics. 2011, 28 (5): 628-635.View ArticlePubMed
      25. Kozanitis C, Saunders C, Kruglyak S, Bafna V, Varghese G: Compressing genomic sequence fragments using SlimGene. J Comput Biol. 2011, 18 (3): 401-413.PubMed CentralView ArticlePubMed
      26. Ochoa I, Asnani H, Bharadia D, Chowdhury M, Weissman T, Yona G: QualComp: a new lossy compressor for quality scores based on rate distortiontheory. BMC Bioinformatics. 2013, 14: 187.PubMed CentralView ArticlePubMed
      27. , : Casava v. 1.8.2 Documentation. 2013, [http://​support.​illumina.​com/​sequencing/​sequencing_​software/​casava.​ilmn].
      28. Howison M: High-throughput compression of FASTQ data with SeqDB. IEEE/ACM Trans Comput Biol Bioinformatics. 2013, 10 (1): 213-218.View Article
      29. Jones DC, Ruzzo WL, Peng X, Katze MG: Compression of next-generation sequencing reads aided by highly efficient denovo assembly. Nucleic Acids Res. 2012, 40 (22): e171.PubMed CentralView ArticlePubMed
      30. Bonfield JK, Mahoney MV: Compression of FASTQ and SAM format sequencing data. PLoS ONE. 2013, 8 (3): e59190.PubMed CentralView ArticlePubMed
      31. Tembe W, Lowey J, Suh E: G-SQZ: compact encoding of genomic sequence and quality data. Bioinformatics. 2010, 26 (17): 2192-2194.View ArticlePubMed
      32. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth , Abecasis G, Durbin R, : The sequence alignment/map (SAM) format and SAMtools. Bioinformatics. 2009, 25 (16): 2078-2079.PubMed CentralView ArticlePubMed
      33. Fritz MH-Y, Leinonen R, Cochrane G, Birney E: Efficient storage of high throughput DNA sequencing data usingreference-based compression. Genome Res. 2011, 21: 734-740.View Article
      34. Sakib MN, Tang J, Zheng WJ, Huang C-T: Improving transmission efficiency of large sequence alignment/map (SAM)files. PLoS ONE. 2011, 6 (12): e28251.PubMed CentralView ArticlePubMed
      35. Manzini G, Rastero M: A simple and fast DNA compressor. Softw Pract Exp. 2004, 34 (14): 1397-1411. 10.1002/spe.619.View Article
      36. Pinho AJ, Ferreira PJSG, Neves AJR, Bastos CAC: On the representability of complete genomes by multiple competingfinite-context (Markov) models. PLoS ONE. 2011, 6 (6): e21588-PubMed CentralView ArticlePubMed
      37. Cao MD, Dix TI, Allison L, Mears C: A simple statistical algorithm for biological sequence compression. Proceedings of the Data Compression Conference. Washington, DC, USA: IEEE Computer Society Press,2007, 43-52.
      38. Wandelt S, Leser U: Adaptive efficient compression of genomes. Algo Mol Biol. 2012, 7: 30-10.1186/1748-7188-7-30.View Article
      39. Deorowicz S, Grabowski Sz: Robust relative compression of genomes with random access. Bioinformatics. 2011, 27 (11): 2979-2986.View ArticlePubMed
      40. Pinho AJ, Pratas D, Garcia SP: GReEn: a tool for efficient compression of genome resequencing data. Nucleic Acids Res. 2012, 40 (4): e27.PubMed CentralView ArticlePubMed
      41. Wang C, Zhang D: A novel compression tool for efficient storage of genome resequencingdata. Nucleic Acids Res. 2011, 39 (7): e45.PubMed CentralView ArticlePubMed
      42. Kuruppu S, Puglisi SJ, Zobel J: Optimized relative Lempel-Ziv compression of genomes. Proceedings of the ACSC Australasian Computer Science Conference. Edited by: Reynolds M. 2011, 91-98. Sydney, Australia: Australian Computer Society, Inc.
      43. Gusfield D: Algorithms on strings, trees and sequences: Computer science andcomputational biology. 1997, Cambridge, UK: Cambridge University PressView Article
      44. Daily K, Rigor P, Christley S, Hie X, Baldi P: Data structures and compression algorithms for high-throughput sequencingtechnologies. BMC Bioinformatics. 2010, 11: 514-PubMed CentralView ArticlePubMed
      45. Popitsch N, von Haeseler A: NGC: lossless and lossy compression of aligned high-throughput sequencingdata. Nucleic Acids Res. 2013, 41 (1): e27-PubMed CentralView ArticlePubMed
      46. Li H: Tabix: fast retrieval of sequence features from generic TAB-delimitedfiles. Bioinformatics. 2011, 27 (5): 718-719.PubMed CentralView ArticlePubMed
      47. Levy S, Sutton G, Ng PC, Feuk L, Halpern AL, Walenz BP, Axelrod N, Huang J, Kirkness EF, Denisov G, Lin Y, MacDonald JR, Pang AWC, Shago M, Stockwell TB, Tsiamouri A, Bafna V, Bansal V, Kravitz SA, Busam DA, Beeson KY, McIntosh TC, Remington KA, Abril JF, Gill J, Borman J, Rogers YH, Frazier ME, Scherer SW, Strausberg RL, Venter JC: The diploid genome sequence of an individual human. PLoS Biol. 2007, 5 (10): e254-PubMed CentralView ArticlePubMed
      48. Christley S, Lu Y, Li C, Xie X: Human genomes as email attachments. Bioinformatics. 2009, 25 (2): 274-275.View ArticlePubMed
      49. Pavlichin D, Weissman T, Yona G: The human genome contracts again. Bioinformatics. 2013, 29 (17): 2199-2202.View ArticlePubMed
      50. Deorowicz S, Danek A, Grabowski Sz: Genome compression: a novel approach for large collections. Bioinformatics. 2013, 29 (20): 2572-2578.View ArticlePubMed
      51. Chern BG, Ochoa I, Manolakos A, No A, Venkat K, Weissman T: Reference based genome compression. Publicly available preprint arXiv:1204.1912v1 2012
      52. Kuruppu S, Puglisi SJ, Zobel J: Relative Lempel–Ziv compression of genomes for large-scale storage andretrieval. Proceedings of the 17th International Symposium on String Matching andInformation Retrieval (SPIRE). Edited by: Chávez E, Lonardi S. 2010, 201-206. Springer-Verlag, Berlin-Heidelberg: Springer, LNCS 6393
      53. Kreft S, Navarro G: LZ77-like compression with fast random access. Proceedings of the Data Compression Conference. 2010, 239-248. Washington, DC, USA: IEEE Computer Society
      54. Ohlebusch E, Fischer J, Gog S: CST++. Proceedings of the 17th International Symposium on String Matching andInformation Retrieval (SPIRE). Edited by: Chávez E, Lonardi S. 2010, 322-333. Springer-Verlag, Berlin-Heidelberg: Springer, LNCS 6393
      55. Compeau PE, Pevzner PA, Tesler G: How to apply de Bruijn graphs to genome assembly. Nat Biotechnol. 2011, 29 (11): 987-991.View ArticlePubMed
      56. Conway TC, Bromage AJ: Succinct data structures for assembling large genomes. Bioinformatics. 2011, 27 (4): 479-486.View ArticlePubMed
      57. Bloom BH: Space/time trade-offs in hash coding with allowable errors. Commun ACM. 1970, 13 (7): 422-426. 10.1145/362686.362692.View Article
      58. Chikhi R, Rizk G: Space-efficient and exact de Bruijn graph representation based on a Bloomfilter. Proceedings of the 12th International Workshop on Algorithms inBioinformatics (WABI). Edited by: Raphael BJ, Tang J. 2012, 236-248. Springer-Verlag, Berlin-Heidelberg: Springer, LNCS 7534
      59. Salikhov K, Sacomoto G, Kucherov G: Using cascading Bloom filters to improve the memory usage for de Brujingraphs. Proceedings of the 13th International Workshop on Algorithms inBioinformatics (WABI). Edited by: Darling A. E., Stoye J. 2013, 364-376. Springer-Verlag, Berlin-Heidelberg: Springer, LNCS 8126
      60. Ye C, Ma ZS, Cannon CH, Pop M, Yu DW: Exploiting sparseness in de novo genome assembly. BMC Bioinformatics. 2012, 13 (Suppl 6): S1-10.1186/1471-2105-13-S6-S1.PubMed CentralView ArticlePubMed
      61. Myers EW: The fragment assembly string graph. Bioinformatics. 2005, 21 (suppl 2): ii79-ii85.View ArticlePubMed
      62. Simpson JT, Durbin R: Efficient de novo assembly of large genomes using compressed datastructures. Genome Res. 2012, 22: 549-556.PubMed CentralView ArticlePubMed
      63. Ferragina P, Manzini G: Opportunistic data structures with applications. Proceedings of the 41st Annual Symposium on Foundations of Computer Science(FOCS). 2000, 390-398. Redondo Beach, California, USA: IEEE Computer SocietyView Article
      64. Gonnella G, Kurtz S: Readjoiner: a fast and memory efficient string graph-based sequenceassembler. BMC Bioinformatics. 2012, 13: 82.PubMed CentralView ArticlePubMed
      65. Navarro G, Mäkinen V: Compressed full-text indexes. ACM Computing Surv. 2007, 39: 2-10.1145/1216370.1216372.View Article
      66. Kreft S, Navarro G: On compressing and indexing repetitive sequences. Theor Comput Sci. 2013, 483: 115-133.View Article
      67. Gagie T, Gawrychowski P, Kärkkäinen J, Nekrich Y, Puglisi SJ: A faster grammar-based self-index. Proceedings of the 6th International Conference on Language and AutomataTheory and Applications (LATA). 2012, 240-251. Springer-Verlag, Berlin-Heidelberg: LNCS 7183
      68. Do HH, Jansson J, Sadakane K, Sung W-K: Fast relative Lempel-Ziv self-index for similar sequences. Proceedings of the Joint International Conference on Frontiers inAlgorithmics and Algorithmic Aspects in Information and Management(FAW-AAIM). 2012, 291-302. Springer-Verlag, Berlin-Heidelberg: LNCS 7285View Article
      69. Gagie T, Gawrychowski P, Puglisi SJ: Faster approximate pattern matching in compressed repetitive texts. Proceedings of the 22nd International Symposium on Algorithms andComputation (ISAAC). 2011, 653-662. Springer-Verlag, Berlin-Heidelberg: LNCS 7074
      70. Ahmadi A, Behm A, Honnalli N, Li C, Weng L, Xie X: Hobbes: optimized gram-based methods for efficient read alignment. Nucleic Acids Res. 2012, 40 (6): e41.PubMed CentralView ArticlePubMed
      71. Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memory-efficient alignment of short DNA sequences to the humangenome. Genome Biol. 2009, 10 (3): R25.PubMed CentralView ArticlePubMed
      72. Langmead B, Salzberg SL: Fast gapped-read alignment with Bowtie. Nature Methods. 2012, 9: 357-359.PubMed CentralView ArticlePubMed
      73. Li H, Durbin R: Fast and accurate short read alignment with Burrows–Wheelertransform. Bioinformatics. 2009, 25 (14): 1754-1760.PubMed CentralView ArticlePubMed
      74. Li H, Durbin R: Fast and accurate long-read alignment with Burrows–Wheelertransform. Bioinformatics. 2010, 26 (5): 589-595.PubMed CentralView ArticlePubMed
      75. Li R, Yu C, Li Y, Lam T-W, Yiu S-M, Kristiansen K, Wang J: SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics. 2009, 25 (15): 1966-1967.View ArticlePubMed
      76. Marco-Sola S, Sammeth M, Guigó R, Ribeca P: The GEM mapper: fast, accurate and versatile alignment by filtration. Nat Methods. 2012, 9 (12): 1185-1188.View ArticlePubMed
      77. Kärkkäinen J: Fast BWT in small space by blockwise suffix sorting. Theor Comput Sci. 2007, 387: 249-257. 10.1016/j.tcs.2007.07.018.View Article
      78. Ferragina P, Gagie T, Manzini G: Lightweight data indexing and compression in external memory. Algorithmica. 2012, 63 (3): 707-730. 10.1007/s00453-011-9535-0.View Article
      79. Wang Z, Gerstein M, Snyder M: RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009, 10 (1): 57-63.PubMed CentralView ArticlePubMed
      80. Trapnell C, Pachter L, Salzberg SL: TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009, 25 (9): 1105-1111.PubMed CentralView ArticlePubMed
      81. Rivals E: CRAC: an integrated approach to the analysis of RNA-seq reads. Genome Biol. 2013, 14 (3): R30.PubMed CentralView ArticlePubMed
      82. Alamancos GP, Agirre E, Eyras E: Methods to study splicing from high-throughput RNA Sequencing data. Publicly available preprint arXiv:1304.5952v1
      83. Li H: Exploring single-sample SNP and INDEL calling with whole-genome de novoassembly. Bioinformatics. 2012, 28 (14): 1838-1844.PubMed CentralView ArticlePubMed
      84. Liu C-M, Wong TKF, Wu E, Luo R, Yiu S-M, Li Y, Wang B, Yu C, Chu X, Zhao K, Li R, Lam TW: SOAP3: ultra-fast GPU-based parallel alignment tool for short reads. Bioinformatics. 2012, 28 (6): 878-879.View ArticlePubMed
      85. Luo R, Wong T, Zhu J, Liu C-M, Zhu X, Wu E, Lee L-K, Lin H, Zhu W, Cheung DW, Ting H-F, Yiu S-M, Peng S, Yu C, Li Y, Li R, Lam TW: SOAP3-dp: Fast, accurate and sensitive GPU-based short read aligner. PLoS ONE. 2013, 8 (5): e65632-PubMed CentralView ArticlePubMed
      86. Gog S, Petri M: Optimized succinct data structures for massive data. Softw Pract Exp. 2013, doi: 10.1002/spe.2198
      87. Loh P-R, Baym M, Berger B: Compressive genomics. Nat Biotechnol. 2012, 30 (7): 627-630.View ArticlePubMed
      88. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215 (3): 403-410.View ArticlePubMed
      89. Kent WJ: BLAT–the BLAST-like alignment tool. Genome Res. 2002, 12 (4): 656-664.PubMed CentralView ArticlePubMed
      90. Deorowicz S, Debudaj-Grabysz A, Grabowski Sz: Disk-based k-mer counting on a PC. BMC Bioinformatics. 2013, 14: Article no. 160-10.1186/1471-2105-14-160.View ArticlePubMed
      91. Zerbino DR, Birney E: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008, 18 (5): 821-829.PubMed CentralView ArticlePubMed
      92. Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJM, Birol I: ABySS: A parallel assembler for short read sequence data. Genome Res. 2009, 19 (6): 1117-1123.PubMed CentralView ArticlePubMed
      93. Cao MD, Dix TI, Allison L: A genome alignment algorithm based on compression. BMC Bioinformatics. 2010, 11 (1): 599.PubMed CentralView ArticlePubMed
      94. Bhaduri A, Qu K, Lee CS, Ungewickell A, Khavari P: Rapid identification of nonhuman sequences in high throughput sequencing datasets. Bioinformatics. 2012, 28 (8): 1174-1175.PubMed CentralView ArticlePubMed
      95. Ferragina P, Giancarlo R, Greco V, Manzini G, Valiente G: Compression-based classification of biological sequences and structures viathe universal similarity metric: experimental assessment. BMC Bioinformatics. 2007, 8: 252-PubMed CentralView ArticlePubMed
      96. Li M, Chen X, Li X, Ma B, Vitányi PMB: The similarity metric. IEEE Trans Inf Theory. 2004, 50 (12): 3250-3264. 10.1109/TIT.2004.838101.View Article
      97. Freschi V, Bogliolo A: A lossy compression technique enabling duplication-aware sequencealignment. Evol Bioinformatics. 2012, 8: 171-180.View Article
      98. Illumina: HiSeq 2500 system user guide. 2012. [http://​supportres.​illumina.​com/​documents/​myillumina/​223bf628-0b46-409f-aa3d-4f3495fe4f69/​hiseq2500_​ug_​15035786_​a_​public.​pdf]
      99. Illumina: New algorithms increase computing efficiency for IGN whole-genomeanalysis. 2013. [http://​res.​illumina.​com/​documents/​products/​technotes/​technote_​ign_​isaac_​software.​pdf]

      Copyright

      © Deorowicz and Grabowski; licensee BioMed Central Ltd. 2013

      This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative CommonsAttribution License (http://​creativecommons.​org/​licenses/​by/​2.​0), whichpermits unrestricted use, distribution, and reproduction in any medium, provided theoriginal work is properly cited.

      Advertisement