Index Construction
In order to find matches of the input in reference blocks, we need an index structure over the reference blocks. Suffix trees[17] are tree data structures which allow for fast access to all suffixes of a given string. Each suffix is usually represented by a path from the root of the tree to a leaf. Please note that the longest prefix-suffix match problem of s in t can be be easily solved by depth-first search given a suffix tree for the string t. The focus recently shifted towards so-called compressed suffix trees. One example of a compressed suffix tree implementation is CST++[18], which uses a new approach to compress the longest common prefix-array and achieves, depending on the sampling rates of the (inverse) suffix array, an usual space overhead of (4−6)∗n bits. In the following, the compressed suffix tree of a string s is denoted with CST(s).
In addition, we need a reference genome for our compression algorithm. The reference sequence is given as a set of compressed FASTA-files, one for each chromosome. The index generation algorithm iterates over all chromosomes of the reference genome. Each chromosome is split up into blocks of a maximum size BS. For instance, given the textual representation s1 of Chromosome 1, we compute m substrings b1,1,…,b1,m, such that
We do the same for all other chromosomes of the reference genome. For each reference block a compressed suffix tree is computed and stored on hard disk together with the raw reference block. In addition, we keep track of the order of the blocks, which is induced by their position on the chromosomes. This meta information is used below for optimizing our compression algorithm.
The memory consumption during index creation is limited as follows: at each step of the index generation we have one raw reference block of size at most BS bytes in main memory plus (roughly) 4∗BS bytes for its compressed suffix tree. For example, given a BS of 4 MB, the main memory usage can be restricted to approximately 20 MB. The maximum value for BS is 300 MB, since the largest human chromosome (Chromosome 1) has about 247 million nucleotide base pairs, and each base is encoded as one byte.
Please note that we could have encoded bases of the reference genome with 3 bits (five entries: A,C,G,N,T), in order to further reduce memory usage. However, our tests indicate that the space saving yields a significant time overhead for accessing bases. Furthermore, the 3-Bit-encoding would not allow us to compress sequences against references with other symbols than these five.
Compression and Decompression
In the following, we present our compression algorithm for genome sequences in detail. Algorithms do not show range checks for the sake of readability. Algorithm 2 assumes the input genome in Input (as a string of bases). The input string is traversed from left to right, and depending on the current characters in the input and in the reference block, different subroutines are executed.
Algorithm 2: Compression Algorithm
1: P
in
←0
2: P
raw
←0
3: FIND-MATCH
4: while P
in
≠ ∣Input∣do
5: if Input(P
in
) = B(P
raw
) then
6: ENCODE-REF
7: else if Input(P
in
)∉{A,C,G,T} then
8: ENCODE-RAW
9: else
10: FIND-MATCH
11: end if
12: end while
In the beginning of the compression algorithm, a match for the current input position in the reference is searched, using function FIND-MATCH, Algorithm 3, explained in detail below. If the current reference block B matches the input at the current position, then we try to generate a (as long as possible) reference match in function ENCODE-REF, Algorithm 4. If the current input base does not equal the current reference block base, and the input is not a normal base, i.e. neither A, C, G, or T, then the base and all the following non-normal bases are added as raw entries to the compressed genome file in function ENCODE-RAW, Algorithm 5. Finally, if neither of the conditions is satisfied, then the algorithm tries to find a new match (either in the current block or in another reference block) using function FIND-MATCH.
Algorithm 3: FIND-MATCH Function
1: Max←25
2: if not HAS-LOCAL-ALTERNATIVE() then
3: for all B
j
∈Blocks do
4: if m(Input,P
in
,B
j
)≥Max then
5: Max←m(Input,P
in
,B
j
)
6: B←B
j
7: end if
8: end for
9: if Max≤25then
10: Raws←Input(P
in
,Max)
11: Add R(Raws) to output
12: P
in
←P
in
+ Max
13: else
14: P
raw
← beginning of the longest match in B
15: Add BC(B) to output
16: end if
17: end if
Algorithm 4: ENCODE-REF Function
1: M←0
2: while Input(P
in
)=B(P
raw
)do
3: M←M + 1
4: P
in
←P
in
+ 1
5: P
raw
←P
raw
+ 1
6: end while
7: Add RM(P
raw
-M,M) to output
Algorithm 5: ENCODE-RAW Function
1: Raws←""
2: while Input(P
in
)∉{A,C,G,T}do
3: Raws←Raws∘Input[P
in
];
4: P
in
←P
in
+ 1;
5: end while
6: Add R(Raws) to output
In Algorithm 4 the number of matching characters between current input position and current reference position are determined and stored in variable M. A reference entry RM(P
raw
-M,M) is added to the compressed output. The encoding of a raw sequence in Algorithm 5 is straight-forward: The string Raws is filled with bases from the input until a normal base is found and R(Raws) is added to the output.
Algorithm 3 requires more thoughts. The function FIND-MATCH is called, whenever there is a mismatch between the input and the reference. In this case, we need to find an alternative match for the current input position. First, we check whether there exists a match in the neighbourhood (mutations caused by few single nucleotide polymorphisms) of the current position of the reference block. The process is explained in detail later. If we cannot find a local match, then all the reference blocks are checked for a better match. The expression m(Input,P
in
,B
j
) returns the position of the longest prefix-suffix match of the current input in B
j
.
In our implementation of Algorithm 3, we care about the order in which all reference blocks are traversed. This is necessary, since loading blocks from the disk is a time consuming effort and should be avoided. Reference blocks are traversed with the following heuristic:
-
1.
The block left and right of the current reference block
-
2.
All other blocks on the same chromosome as the current reference block
-
3.
All other blocks of the reference genome
If there is no long enough match in any reference block, then we just encode few raw bases. Matches in other blocks are required to be longer than 25 characters, in order to avoid random matches. Our experiments have shown that the mere length of the human DNA causes a lot of unrelated matches with less than 20-25 characters.
Algorithm 3 uses the function HasLocalAlternative. The intuition is that we want to avoid searching all possible reference blocks all the time. Our experiments showed that one longest prefix-suffix-match lookup-operation in a single compressed suffix tree can take up to few milliseconds depending on the size. This is basically caused by the high bit-wise compression of the reference genome. Furthermore, potentially we have to load all the other reference block’s CTSs from the hard disk. Therefore, the naive strategy to search for a new reference block on each mismatch does not scale.
Instead we use a local search in the neighbourhood of the current input position and of the current reference position. This strategy has a biological foundation: Often two parts of a genome might only be different by few bases (base insertion, base removal, of base mutation). Whenever we find an appropriate match in the neighbourhood, we avoid checking other reference blocks, although they might contain better matches. In fact, if we searched for the best reference blocks each time, we could increase the compression rate slightly for the price of being orders of magnitude slower.
The algorithm for local neighbourhood search is shown in Algorithm 6. If a match of at least 20 characters can be found near the current input and reference positions, then the difference until the match is encoded raw, and the match is encoded referentially.
Decompression of the relative genome data is straightforward. Basically, all entries are processed according to their type. For decompression of genome data we do not need the compressed suffix trees any more, but only the raw blocks of the reference genome. The decompression of genome data is the least CPU-intensive task. Our evaluation in the next section shows that index generation and compression are CPU-bound, while the decompression phase is I/O-bound.
Algorithm 6: HAS-LOCAL-ALTERNATIVE Function
1: for i∈{0,1,…,6,7}do
2: for j∈{−7,−6,…,6,7}do
3: if Input(P
in
+ i,20)=B(P
raw
+ j,20)then
4: if i≥1then
5: Raws←Input(P
in
,i)
6: Add R(Raws) to output
7: P
in
←P
in
+ i
8: end if
9: P
raw
←P
raw
+ j
10: ENCODE-REF
11: RETURN
12: end if
13: end for
14: end for
We emphasize that our compression scheme is a heuristic, which mainly works when compressing a sequence with respect to a reference from the same species. The efficiency of this heuristic is evaluated in Section “Evaluation”. The worst-case time complexity of our compression algorithm is O(—Input—), since we need to traverse the whole input sequence one time from left to right and for each character we perform (in the worst case) one lookup in the reference. The complexity of the lookup depends on the length of the substring being looked up (we assume a fixed maximal match length) and is therefore constant. The worst-case space complexity is O(—Input—). Each character of the input is stored at most one time: either inside a raw block or as part of a referential block.