Fast local fragment chaining using sum-of-pair gap costs
© Otto et al; licensee BioMed Central Ltd. 2011
Received: 29 October 2010
Accepted: 18 March 2011
Published: 18 March 2011
Fast seed-based alignment heuristics such as BLAST and BLAT have become indispensable tools in comparative genomics for all studies aiming at the evolutionary relations of proteins, genes, and non-coding RNAs. This is true in particular for the large mammalian genomes. The sensitivity and specificity of these tools, however, crucially depend on parameters such as seed sizes or maximum expectation values. In settings that require high sensitivity the amount of short local match fragments easily becomes intractable. Then, fragment chaining is a powerful leverage to quickly connect, score, and rank the fragments to improve the specificity.
Here we present a fast and flexible fragment chainer that for the first time also supports a sum-of-pair gap cost model. This model has proven to achieve a higher accuracy and sensitivity in its own field of application. Due to a highly time-efficient index structure our method outperforms the only existing tool for fragment chaining under the linear gap cost model. It can easily be applied to the output generated by alignment tools such as segemehl or BLAST. As an example we consider homology-based searches for human and mouse snoRNAs demonstrating that a highly sensitive BLAST search with subsequent chaining is an attractive option. The sum-of-pair gap costs provide a substantial advantage is this context.
Chaining of short match fragments helps to quickly and accurately identify regions of homology that may not be found using local alignment heuristics alone. By providing both the linear and the sum-of-pair gap cost model, a wider range of application can be covered. The software clasp is available at http://www.bioinf.uni-leipzig.de/Software/clasp/.
The detection of (potentially) homologous sequence fragments is a basic task in computational biology that underlies all comparative approaches from molecular phylogenetics to gene finding, from detailed analysis of evolutionary patterns of individual genes to global comparisons of genome structure. On genome-wide scales, BLAST has become the bioinformatician's work horse for homology search, with a sensitivity and specificity that is sufficient for most applications in comparative genomics. It is in particular the basis for the currently available genome-wide alignments, which in turn underlie a wide variety of subsequent analyses.
Some specialized tasks such as the search for distant homologs of short structured RNAs , require more sensitive techniques. In particular, sequence families exhibiting only short conserved blocks interspersed with highly variable regions are difficult for BLAST or BLAT because the seeds have to be very short in this case. This typically leads to a huge number of short match fragments that require sophisticated post-processing to discriminate single random hits from sets of adjacent hits potentially indicating true homologs.
The objective of fragment chaining is to efficiently find sets of consistent fragments with a maximal score . The order of fragments is assumed to be congruent in both query and database sequences. While the case of overlapping fragments is explicitly excluded, gaps between fragments are allowed and may be penalized according to different scoring models. In the case of a local fragment chaining, the score of any fragment within a chain must not be smaller than the penalty that is assigned to the gap to the successive fragment. Thus, a chain is a sequence of non-overlapping, i.e., disjoint, ordered fragments and its score is the sum of their fragment scores minus the penalties for any gaps between them. Introduced in sequence alignments , fragment chaining may be used in several comparative tasks such as whole genome comparison, cDNA/EST mapping, or identifying regions with conserved synteny as described in .
with parameters . Intuitively, expresses the penalty to align an anonymous character with a gap position while is the penalty to align two anonymous characters. With , the chaining only minimizes the distance difference between fragments.
The software tool CHAINER, a part of CoCoNUT[8, 9], implements fragment chaining with linear gap costs. AXTCHAIN, part of the UCSC genome browser pipeline, also uses the linear gap model [10, 11]. The tool expects pairwise alignments alignments as input and hence cannot be used "as is" with plain fragment files produced from external applications. The SeqAn library provides algorithms for fragment chaining with different gap cost models . A running tool that implements these models, however, is not available at present.
We implemented the local fragment chaining algorithm, introduced by [4, 6]. In addition to the linear gap cost model in CHAINER, the more flexible sum-of-pair gap cost model has been incorporated for the first time in a standalone tool.
The chaining algorithm is based on sparse dynamic programming , since for any fragment only a small set of possible predecessors needs to be considered in order to find the optimal one. More precisely, the optimal predecessor is a non-overlapping chain preceding the fragment in both database and query sequence that leads to the maximal combined score considering the gap cost penalty between them. In the case of local fragment chaining, the fragment is chained to the optimal predecessor only if its score is equal to or higher than the necessary gap costs. Using theoretical results on both gap cost models , priorities can be assigned to chains in such a way that the optimal predecessor has the maximal priority. Using the line-sweep paradigm, the algorithm scans through the list of fragment start and end points ordered by their database position. For any start point, the optimal predecessor is identified by means of range maximum queries (RMQs) over the set of active chains, i.e., chains only comprised of fragments with already processed end points. The RMQ reports the element with maximal priority within a given range that involves only non-overlapping chains preceding the current fragment in both database and query sequence. For any end point, a novel chain is generated by connecting the optimal predecessor to the current fragment and is marked as active. In the end, the algorithm groups together chains with common first fragment and reports the best-scoring chain of each group. Note that a fragment does not necessarily have to be the first fragment of any best-scoring chain.
where max score is the highest possible chain score and max y is the maximal distance of fragments on the query sequence. Note that max score is bounded from above by the length of the query multiplied by the maximal score per fragment position. Estimates of max score and max y are calculated and updated during the linear scan. Hence, the clustering is accomplished with only one linear scan consuming only a negligible amount of additional memory. Subsequently, rather than applying the chaining algorithm to the entire list of fragments, each of the clusters can be chained separately, improving both running time and memory consumption. In the worst case, all fragments are in the same cluster leading to the same performance as without clustering. We incorporated clustering in local fragment chaining with linear gap costs using an analogous condition. Note that fragments from different queries or database sequences (e.g., chromosomes) can be processed in a single pass by our tool but are generally chained separately from each other (even without use of clustering).
More details on the implemented data structures, their worst-case time complexities, and the chaining algorithm can be found in the Additional file 1. Note that the algorithm is implemented for two-dimensional fragments only, i.e., fragments with position information on one query and one database sequence, due to its intended area of application.
Results and Discussion
In terms of running time, clasp (with and without clustering) outperforms CHAINER in any tested setting at the expense of a three-fold increased memory consumption during execution. Due to the uniform distribution of query sequences the use of clustering only leads to a minor performance improvement. In each test case, the quality of the chains was assessed by comparing the distributions of chain scores reported by both programs. In a few cases, only marginal differences between clasp and CHAINER were observed. These differences do not require further attention from our side.
Homology searches with Human box H/ACA snoRNAs
To assess the performance of clasp in real-life applications, a sequence-based homology search was carried out. Human box H/ACA snoRNA families, an important class of structured RNAs, were selected to identify potentially homologous regions in entire genome of Mus musculus. BLAST fails to report sufficiently long hits but, e.g., in the case of the 134 nt long Human H/ACA snoRNA 42 (SNORA42 in the snoRNABase ), dumps more than 10 millions short hits in the mouse genome when executed in a very sensitive mode with small word sizes and high expectation values (options: -W 8 -e 1e+20 -F F).
Novel candidates of Human H/ACA snoRNA paralogs
annotated candidate regions in %
fragments per chain
number of chains
Commonly used local alignment heuristics may fail to retrieve sequence families with scattered conservation. Chaining of short match fragments can overcome this limitation, thereby substantially enhancing the effective sensitivity of BLAST and similar approaches in homology search. The clasp tool implements a fast local fragment chaining algorithm supporting the linear and the sum-of-pair gap model. The latter is available for the first time in a running tool and is particularly suitable to cope with scattered sequence conservation, e.g., evolutionary conserved structured ncRNAs. In this field of application, it outperforms optimized linear gap models in terms of accuracy and sensitivity. We showed that the usage of Johnson priority queues greatly improves the runtime performance in comparison to the only existing fragment chaining tool CHAINER. The presented clustering approach facilitates clasp to tackle large amounts of short match data by alignment heuristics such as segemehl or BLAST. In a simple homology search with H/ACA snoRNAs, we were able to identify 7 H/ACA snoRNA candidates in mouse, all confirmed by the annotation in the Ensembl database. A large-scale survey for Human H/ACA snoRNA paralogs yielded 295 candidates with more than 70% sequence identity to Human H/ACA snoRNAs from the snoRNABase. More than 98% of the candidates have been annotated previously, in particular with respect to the extensive Ensembl ncRNA screens, emphasizing the high specificity of this rather simple homology search.
Availability and requirements
Project name: clasp
Project home page: http://www.bioinf.uni-leipzig.de/Software/clasp/
Operating system(s): platform independent
Programming language: C
Other requirements: none
License: GNU GPL
Any restrictions to use by non-academics: Note that a license is needed to include the source code from the clasp in commercial software projects.
We thank Christian Anthon for contributing to the tests at running clasp.
This publication is supported by LIFE - Leipzig Research Center for Civilization Diseases, Universität Leipzig. LIFE is funded by means of the European Union, by the European Regional Development Fund (ERFD) and by means of the Free State of Saxony within the framework of the excellence initiative. JG is supported by the Danish Strategic Research Council, the Danish Research council for Technology and Production, and Danish Center for Scientific Computation. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215 (3): 403-10.PubMedView ArticleGoogle Scholar
- Mosig A, Zhu L, Stadler PF: Customized strategies for discovering distant ncRNA homologs. Brief Funct Genomics Proteomics. 2009, 8: 451-460. 10.1093/bfgp/elp035View ArticleGoogle Scholar
- Kent WJ: BLAT -the BLAST-like alignment tool. Genome Res. 2002, 12 (4): 656-64.PubMedPubMed CentralView ArticleGoogle Scholar
- Abouelhoda MI, Ohlebusch E: Multiple Genome Alignment: Chaining Algorithms Revisited. Combinatorial Pattern Matching: 14th Annual Symposium, CPM 2003, Morelia, MichoacÃ¡n, Mexico, June 25-27, 2003. Proceedings, Volume 2676/2003 of Lecture Notes in Computer Science. 2003, Springer Berlin/Heidelberg,Google Scholar
- Morgenstern B: A simple and space-efficient fragment-chaining algorithm for alignment of DNA and protein sequences. Applied Mathematics Letters. 2002, 15: 11-16. 10.1016/S0893-9659(01)00085-4View ArticleGoogle Scholar
- Abouelhoda MI, Ohlebusch E: Chaining algorithms for multiple genome comparison. Journal of Discrete Algorithms. 2005, 3 (2-4): 321-341. 10.1016/j.jda.2004.08.011View ArticleGoogle Scholar
- Myers G, Miller W: Chaining multiple-alignment fragments in sub-quadratic time. SODA '95: Proceedings of the sixth annual ACM-SIAM symposium on Discrete algorithms. 1995, 38-47. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics,Google Scholar
- Abouelhoda MI, Ohlebusch E: CHAINER: Software for Comparing Genomes. Proceedings of the 12th International Conference on Intelligent Systems for Molecular Biology + 3rd European Conference on Computational Biology. 2004,Google Scholar
- Abouelhoda MI, Kurtz S, Ohlebusch E: CoCoNUT: an efficient system for the comparison and analysis of genomes. BMC Bioinformatics. 2008, 9: 476- 10.1186/1471-2105-9-476PubMedPubMed CentralView ArticleGoogle Scholar
- Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D: Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci USA. 2003, 100 (20): 11484-9. 10.1073/pnas.1932072100PubMedPubMed CentralView ArticleGoogle Scholar
- Karolchik D, Hinrichs AS, Kent WJ: The UCSC Genome Browser. Curr Protoc Bioinformatics. 2009, Chapter 1: Unit1.4-PubMedGoogle Scholar
- Döring A, Weese D, Rausch T, Reinert K: SeqAn an efficient, generic C++ library for sequence analysis. BMC Bioinformatics. 2008, 9: 11-PubMedPubMed CentralView ArticleGoogle Scholar
- Eppstein D, Galil Z, Giancarlo R, Italiano GF: Sparse dynamic programming I: linear cost functions. J ACM. 1992, 39 (3): 519-545. 10.1145/146637.146650View ArticleGoogle Scholar
- Johnson DB: A Priority Queue in Which Initialization and Queue Operations Take O(log log D) Time. Mathematical Systems Theory. 1982, 15 (4): 295-309.Google Scholar
- Lestrade L, Weber MJ: snoRNA-LBME-db, a comprehensive database of human H/ACA and C/D box snoRNAs. Nucleic Acids Res. 2006, D158-62. 34 Database,Google Scholar
- Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, Thompson JD, Gibson TJ, Higgins DG: Clustal W and Clustal X version 2.0. Bioinformatics. 2007, 23 (21): 2947-8. 10.1093/bioinformatics/btm404PubMedView ArticleGoogle Scholar
- Wuchty S, Fontana W, Hofacker IL, Schuster P: Complete suboptimal folding of RNA and the stability of secondary structures. Biopolymers. 1999, 49 (2): 145-65. 10.1002/(SICI)1097-0282(199902)49:2<145::AID-BIP4>3.0.CO;2-GPubMedView ArticleGoogle Scholar
- Hubbard T, Barker D, Birney E, Cameron G, Chen Y, Clark L, Cox T, Cuff J, Curwen V, Down T, Durbin R, Eyras E, Gilbert J, Hammond M, Huminiecki L, Kasprzyk A, Lehvaslaiho H, Lijnzaad P, Melsopp C, Mongin E, Pettett R, Pocock M, Potter S, Rust A, Schmidt E, Searle S, Slater G, Smith J, Spooner W, Stabenau A, Stalker J, Stupka E, Ureta-Vidal A, Vastrik I, Clamp M: The Ensembl genome database project. Nucleic Acids Res. 2002, 30: 38-41. 10.1093/nar/30.1.38PubMedPubMed CentralView ArticleGoogle Scholar
- Flicek P, Aken BL, Ballester B, Beal K, Bragin E, Brent S, Chen Y, Clapham P, Coates G, Fairley S, Fitzgerald S, Fernandez-Banet J, Gordon L, Gräf S, Haider S, Hammond M, Howe K, Jenkinson A, Johnson N, Kähäri A, Keefe D, Keenan S, Kinsella R, Kokocinski F, Koscielny G, Kulesha E, Lawson D, Longden I, Massingham T, McLaren W, Megy K, Overduin B, Pritchard B, Rios D, Ruffier M, Schuster M, Slater G, Smedley D, Spudich G, Tang YA, Trevanion S, Vilella A, Vogel J, White S, Wilder SP, Zadissa A, Birney E, Cunningham F, Dunham I, Durbin R, Fernández-Suarez XM, Herrero J, Hubbard TJ, Parker A, Proctor G, Smith J, Searle SM: Ensembl's 10th year. Nucleic Acids Res. 2010, D557-62. 38 Database,Google Scholar
- Gardner PP: The use of covariance models to annotate RNAs in whole genomes. Brief Funct Genomic Proteomic. 2009, 8 (6): 444-50. 10.1093/bfgp/elp042PubMedView ArticleGoogle Scholar
- Eddy-BLAST-snornalib in the UCSC RNAGenes track. http://genome.ucsc.edu/cgi-bin/hgTables?db=hg18&hgta_group=genes&hgta_track=rnaGene&hgta_table=rnaGene&hgta_doSchema=describe+table+schema
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.