Fast local fragment chaining using sum-of-pair gap costs

Background Fast seed-based alignment heuristics such as BLAST and BLAT have become indispensable tools in comparative genomics for all studies aiming at the evolutionary relations of proteins, genes, and non-coding RNAs. This is true in particular for the large mammalian genomes. The sensitivity and specificity of these tools, however, crucially depend on parameters such as seed sizes or maximum expectation values. In settings that require high sensitivity the amount of short local match fragments easily becomes intractable. Then, fragment chaining is a powerful leverage to quickly connect, score, and rank the fragments to improve the specificity. Results Here we present a fast and flexible fragment chainer that for the first time also supports a sum-of-pair gap cost model. This model has proven to achieve a higher accuracy and sensitivity in its own field of application. Due to a highly time-efficient index structure our method outperforms the only existing tool for fragment chaining under the linear gap cost model. It can easily be applied to the output generated by alignment tools such as segemehl or BLAST. As an example we consider homology-based searches for human and mouse snoRNAs demonstrating that a highly sensitive BLAST search with subsequent chaining is an attractive option. The sum-of-pair gap costs provide a substantial advantage is this context. Conclusions Chaining of short match fragments helps to quickly and accurately identify regions of homology that may not be found using local alignment heuristics alone. By providing both the linear and the sum-of-pair gap cost model, a wider range of application can be covered. The software clasp is available at http://www.bioinf.uni-leipzig.de/Software/clasp/.


Background
The detection of (potentially) homologous sequence fragments is a basic task in computational biology that underlies all comparative approaches from molecular phylogenetics to gene finding, from detailed analysis of evolutionary patterns of individual genes to global comparisons of genome structure. On genome-wide scales, BLAST [1] has become the bioinformatician's work horse for homology search, with a sensitivity and specificity that is sufficient for most applications in comparative genomics. It is in particular the basis for the currently available genome-wide alignments, which in turn underlie a wide variety of subsequent analyses.
Some specialized tasks such as the search for distant homologs of short structured RNAs [2], require more sensitive techniques. In particular, sequence families exhibiting only short conserved blocks interspersed with highly variable regions are difficult for BLAST or BLAT [3] because the seeds have to be very short in this case. This typically leads to a huge number of short match fragments that require sophisticated post-processing to discriminate single random hits from sets of adjacent hits potentially indicating true homologs.
The objective of fragment chaining is to efficiently find sets of consistent fragments with a maximal score [4]. The order of fragments is assumed to be congruent in both query and database sequences. While the case of overlapping fragments is explicitly excluded, gaps between fragments are allowed and may be penalized according to different scoring models. In the case of a local fragment chaining, the score of any fragment within a chain must not be smaller than the penalty that is assigned to the gap to the successive fragment. Thus, a chain is a sequence of non-overlapping, i.e., disjoint, ordered fragments and its score is the sum of their fragment scores minus the penalties for any gaps between them. Introduced in sequence alignments [5], fragment chaining may be used in several comparative tasks such as whole genome comparison, cDNA/EST mapping, or identifying regions with conserved synteny as described in [6].
Let f beg .x, f end .x denote the start and end position of a fragment f in the database sequence x. The start and end positions in the query y are denoted by f beg .y and f end .y, respectively. Let f and f' be two non-overlapping ordered fragments, i.e., assume f end .x < f beg .x and f end .y < f beg .y. Linear gap costs g 1 (f', f) between the fragments f and f' are calculated by: .y − f end .y − 1 | , and weighting parameters λ g 1 , ε g 1 0. Note that the use of weighting parameters in the gap cost model is equivalent to linear weights on fragment scores. A graphical illustration of fragments and chaining connections is shown in Figure 1. For λ g 1 , ε g 1 > 0 linear gap costs penalize any distance between fragments on query and database sequence. This scoring system may not be suitable, however, when scattered blocks of local sequence conservation are expected.
The more flexible sum-of-pair gap cost model introduced by Myers and Miller [7] allows to penalize differences of the distances between adjacent fragments on query and database only. The sum-of-pair gap costs g sop (f', f) between non-overlapping ordered fragments f and f' is given by with parameters λ g sop , ε g sop 0 . Intuitively, λ g sop expresses the penalty to align an anonymous character with a gap position while ε g sop is the penalty to align two anonymous characters. With e g sop = 0 , the chaining only minimizes the distance difference between fragments.
The software tool CHAINER, a part of CoCoNUT [8,9], implements fragment chaining with linear gap costs. AXTCHAIN, part of the UCSC genome browser pipeline, also uses the linear gap model [10,11]. The tool expects pairwise alignments alignments as input and hence cannot be used "as is" with plain fragment files produced from external applications. The SeqAn library provides algorithms for fragment chaining with different gap cost models [12]. A running tool that implements these models, however, is not available at present.

Implementation
We implemented the local fragment chaining algorithm, introduced by [4,6]. In addition to the linear gap cost model in CHAINER, the more flexible sum-of-pair gap cost model has been incorporated for the first time in a standalone tool.
The chaining algorithm is based on sparse dynamic programming [13], since for any fragment only a small set of possible predecessors needs to be considered in order to find the optimal one. More precisely, the optimal predecessor is a non-overlapping chain preceding the fragment in both database and query sequence that leads to the maximal combined score considering the gap cost penalty between them. In the case of local fragment chaining, the fragment is chained to the optimal predecessor only if its score is equal to or higher than the necessary gap costs. Using theoretical results on both gap cost models [4], priorities can be assigned to chains in such a way that the optimal predecessor has the maximal priority. Using the line-sweep paradigm, the algorithm scans through the list of fragment start and end points ordered by their database position. For any start point, the optimal predecessor is identified by means of range maximum queries (RMQs) over the set of active chains, i.e., chains only comprised of fragments with already processed end points. The RMQ reports the element with maximal priority within a given range that involves only non-overlapping chains preceding the current fragment in both database and query sequence. For any end point, a novel chain is generated by connecting the optimal predecessor to the current fragment position on database sequence x position on query sequence y Figure 1 Graphical representation of fragments and chaining connections. Graphical representation of fragments as blocks with their respective database and query positions. All valid chaining connections are depicted as edges including their distance on database x and query sequence y. Note that f 1 and f 3 can not be chained due to their overlap on the query sequence y. and is marked as active. In the end, the algorithm groups together chains with common first fragment and reports the best-scoring chain of each group. Note that a fragment does not necessarily have to be the first fragment of any best-scoring chain.
In contrast to CHAINER, we implemented Johnson priority queues [14] and range trees padded with Johnson priority queues instead of simple kd-trees to support RMQs. One-dimensional RMQs are answered using Johnson priority queues, i.e., semi-dynamic tree structures permitting non-recursive binary searches on tree paths. The priority domain, i.e., the range of possible priorities, is defined at the point of initialization. Hence, the balanced tree structure provides binary search information at tree nodes. In order to condense the priority domain, we linked the priorities to the sorting order of all potential elements. Let n be the length of the priority domain. Johnson priority queues support predecessor, successor, insert, and delete operations in O(log(log(n))) time. To efficiently implement sum-of-pair gap costs we need to consider two distinct sorting dimensions [4]. For the two-dimensional RMQs, range trees were padded with Johnson queues (see Figure 2). More precisely, the range tree is a primary binary search tree for all elements sorted by their firstdimension order. Additionally, each node v stores a Johnson priority queue containing all elements in the subtree beneath v, referred to as the canonical subset CS(v). Elements in Johnson priority queues are sorted by the second-dimension order. In summary, the implemented fragment chaining algorithm requires O(n(log(n)) in time with linear gap costs and O(n(log(n)(log(n))) in time with sum-of-pair gap costs.
Because the database is typically much larger than the query sequence, we introduced a novel clustering approach to facilitate local fragment chaining. The basic idea is to improve the running time by assigning fragments to clusters that can be chained separately from each other without resulting in different chaining outcome. It first pools neighboring fragments in a single linear scan using the following observation: Let f and f' be two adjacent non-overlapping fragments on the database sequence. Clearly, f' and f may never be chained and can be assigned to different clusters if (3) where max score is the highest possible chain score and max y is the maximal distance of fragments on the query sequence. Note that max score is bounded from above by the length of the query multiplied by the maximal score per fragment position. Estimates of max score and max y are calculated and updated during the linear scan. Hence, the clustering is accomplished with only one linear scan consuming only a negligible amount of additional memory. Subsequently, rather than applying the chaining algorithm to the entire list of fragments, each of the clusters can be chained separately, improving both running time and memory consumption. In the worst case, all fragments are in the same cluster leading to the same performance as without clustering. We incorporated clustering in local fragment chaining with linear gap costs using an analogous condition. Note that fragments from different queries or database sequences (e.g., chromosomes) can be processed in a single pass by our tool but are generally chained separately from each other (even without use of clustering).
More details on the implemented data structures, their worst-case time complexities, and the chaining algorithm can be found in the Additional file 1. Note that the algorithm is implemented for two-dimensional fragments only, i.e., fragments with position information on one query and one database sequence, due to its intended area of application.

Performance Tests
In order to evaluate the performance of clasp using linear gap costs with ε g 1 = 1 and λ g 1 = 1, we compared it to CHAINER v3.0 with options -l -lw 1 producing comparable scores. Each simulated data set contained fragments of length 100 covering 1 KB query sequences, uniformly sampled from a virtual 100 KB large database. Scores were sampled from a normal distribution. Both programs were executed singlethreaded on the same 64-Bit machine with equal data sets. Moreover, the performance of clasp was analyzed with and without the use of our clustering method. The results for different numbers of sampled fragments are shown in Figure 3 and 4. We measured the performance in terms of running time in user mode and peak virtual memory consumption. If not disabled, the clustering procedure as an integral part  of our algorithm is naturally included in all measurements of running time and memory consumption.
In terms of running time, clasp (with and without clustering) outperforms CHAINER in any tested setting at the expense of a three-fold increased memory consumption during execution. Due to the uniform distribution of query sequences the use of clustering only leads to a minor performance improvement. In each test case, the quality of the chains was assessed by comparing the distributions of chain scores reported by both programs. In a few cases, only marginal differences between clasp and CHAINER were observed. These differences do not require further attention from our side.

Homology searches with Human box H/ACA snoRNAs
To assess the performance of clasp in real-life applications, a sequence-based homology search was carried out. Human box H/ACA snoRNA families, an important class of structured RNAs, were selected to identify potentially homologous regions in entire genome of Mus musculus. BLAST fails to report sufficiently long hits but, e.g., in the case of the 134 nt long Human H/ACA snoRNA 42 (SNORA42 in the snoRNABase [15]), dumps more than 10 millions short hits in the mouse genome when executed in a very sensitive mode with small word sizes and high expectation values (options: -W 8 -e 1e+20 -F F).
We executed clasp using the sum-of-pair cost model with ε g sop = 0 , λ g sop = 0.5 (only punish for distance differences with half of the match score) fragment scores according to the length of the BLAST hit, and a minimal required chain score of 30. The use of clustering greatly reduced the memory requirements: Instead of more than 100 GB, the fragment chaining on the 1.2 GB BLAST output file consumed only 1.6 GB and took less than 5 minutes on a single 2.33 GHz 64-Bit Intel Xeon CPU. In the end, clasp reported 17 chains in disjoint regions of the mouse genome. In order to check for conservation of H-box and the ACA-motif, the mouse candidates were aligned to the initial Human H/ACA snoRNA 42 sequence using the multiple alignment tool ClustalW [16]. We further checked the secondary structure conservation and stability by folding each candidate using RNAsubopt [17] with constraints, i.e., demanding single-stranded regions at the H-box and ACA-motif. In total, we identified 7 of the 17 regions as H/ACA snoRNA candidates homologous to the Human H/ACA snoRNA 42 (see Additional file 2). The sequence alignment of the final candidates and the Human H/ACA snoRNA 42 including consensus secondary structure and sequence conservation is shown in Figure 5. By checking with previous annotations, all of the final candidates were confirmed as snoRNA orthologs by the Ensembl database [18,19]. However, ncRNAs in the Ensembl database were annotated using extensive Infernal screens with Rfam covariance models [20], i.e., profile stochastic context-free grammars comprising primary sequence and secondary structure information. To illustrate the benefits of the sum-of-pair gap cost model, we additionally compared the performance of clasp using both models in a snoRNA homology search experiment. We selected the entire set of 19   For each parameter setting, the true positive rate (i.e., the fraction of SNORA42 that was covered by at least one chain) was recorded with respect to the total number of reported chains, a function of the minimal required chain score. In the average as well as the best case of parameter selection the linear gap cost is outperformed by the sum-of-pair model ( Figure 6). Using sum-of-pair with ε g sop = 0 and λ g sop = 0.5 , 11 out of 19 annotated snoRNAs are among the 19 best chains. With linear gap costs and optimal parameter settings (εg 1 = λ g 1 = 0.1), a list of 900 best scoring chains has to be scanned to find the same number of annotated snoRNAs (49-fold increase). With suboptimal parameters, about 6000 chains (314-fold increase) need to be screened on average to retrieve the same amount of snoRNAs. Note that alternative weighting functions of fragment scores or in the linear gap cost model, e.g., affine or non-linear functions, are currently not implemented but are subject to further research.

Conclusions
Commonly used local alignment heuristics may fail to retrieve sequence families with scattered conservation. Chaining of short match fragments can overcome this limitation, thereby substantially enhancing the effective sensitivity of BLAST and similar approaches in homology search. The clasp tool implements a fast local fragment chaining algorithm supporting the linear and the sum-of-pair gap model. The latter is available for the first time in a running tool and is particularly suitable to cope with scattered sequence conservation, e. g., evolutionary conserved structured ncRNAs. In this field of application, it outperforms optimized linear gap models in terms of accuracy and sensitivity. We showed that the usage of Johnson priority queues greatly improves the runtime performance in comparison to the only existing fragment chaining tool CHAI-NER. The presented clustering approach facilitates clasp to tackle large amounts of short match data by alignment heuristics such as segemehl or BLAST. In a simple homology search with H/ACA snoRNAs, we were able to identify 7 H/ACA snoRNA candidates in mouse, all confirmed by the annotation in the Ensembl database. A large-scale survey for Human H/ACA snoRNA paralogs yielded 295 candidates with more than 70% sequence identity to Human H/ACA snoR-NAs from the snoRNABase. More than 98% of the Summary of H/ACA snoRNA candidates in Homo sapiens including their fragment counts and their overlap with previous annotations, i.e., the snoRNABase, the set of snoRNAs and snoRNA pseudogenes from the Ensembl database and the Eddy-BLAST-snornalib in the UCSC RNAGenes track.