CHSMiner: a GUI tool to identify chromosomal homologous segments
© Wang et al; licensee BioMed Central Ltd. 2009
Received: 21 September 2008
Accepted: 15 January 2009
Published: 15 January 2009
The identification of chromosomal homologous segments (CHS) within and between genomes is essential for comparative genomics. Various processes including insertion/deletion and inversion could cause the degeneration of CHSs.
Here we present a Java software CHSMiner that detects CHSs based on shared gene content alone. It implements fast greedy search algorithm and rigorous statistical validation, and its friendly graphical interface allows interactive visualization of the results. We tested the software on both simulated and biological realistic data and compared its performance with similar existing software and data source.
CHSMiner is characterized by its integrated workflow, fast speed and convenient usage. It will be useful for both experimentalists and bioinformaticians interested in the structure and evolution of genomes.
The identification of chromosomal homologous segments (CHSs) within and between genomes (known as paralogons and syntenies, respectively) is essential for comparative genomics. It can not only help evolutionary biologists to study genome evolution, such as genome duplication and rearrangement [1, 2], but also help experimental biologists to transfer gene function information from one genome to another. Although extensive gene mutation, deletion, and insertion have made them not always obvious from primary sequences, chromosomal homology can still be revealed by a pair of segments sharing a group of homologous genes . Most existing programs, including ADHoRe , FISH  and LineUp , look for CHSs based on the conservation of both gene content and order (colinearity). While the approach was sensitive enough for moderate divergence, it has been pointed out conserved gene order may be too strict for more ancient divergence , as inversion is another dominant force for the degeneration of CHSs. For example, the whole genome duplication in early vertebrate evolution can only be inferred by discarding gene order and considering gene content alone . A pioneering implementation of this strategy was CloseUp , but some limitations still exist, especially with the rapid increase of genomic data. First, it used Monte Carlo simulation to estimate the statistical significance of identified CHSs, which might no longer be suitable for whole genome sequence analysis, as thousands of annotated genes would make it quite time-consuming. Second, previous tools were mainly developed for computational biologists, which restricted their wide use among experimental biologists.
In our recent project to build a paralog/paralogon database EPGD , we found it was very necessary to develop a new software that could overcome those weaknesses. Here, we publish it as a complete Java package named CHSMiner. Its core algorithm has been used to construct our database successfully and several improvements were added later as well. In short, it can not only fast identify and evaluate CHSs from whole genome comparison, but also provide a convenient graphical interface for end users to visualize the results.
Fast greedy search algorithm
Formal statistical evaluation
Finally, we multiply the probabilities that the cluster is observed in both genomes for comparison, each with parameters (n1, F1) and (n2, F2):
Q(n1, m, g, F1)Q(n2, m, g, F2)
The value reflects the probability that a given CHS with maximal gap size g or smaller is observed in two independently and randomly ordered genomes. When the size of the CHS m is fixed, the smaller the maximal gap size is, the harder it can be observed. Therefore, the value can be treated as the p-value for the CHS. As a lot of CHSs should be assessed in whole genome comparison, we recommend an extra multiple test correction (e.g. Bonferroni correction) to the raw p-values in order to control false positive results.
Java package and GUI for visualization
Automatic data download from Ensembl database  for well assembled genomes.
Interactive operations and flexible parameter settings.
Visual display of CHSs from an individual one to the whole genome pattern (Figure 2B, C).
Useful graphic functions. The image can be saved as vector graph format for further edit.
The application was entirely written in Java and distributed as an executable jar package. It could run on any platform supporting Java Runtime Environment (1.5 or higher). Full source code and documents are also provided at our web site, and users can access them under GNU General Public License v2.0.
Results and discussion
Comparison on simulated data
Summary of the four programmes for comparison
Monte Carlo simulation
Monte Carlo simulation
It is clear that the sensitivity of the algorithm based on colinearity will become gradually poor with the increase of inversions, whereas the algorithms based on gene content alone are quite robust to the disorders. HomologyTeams has the advantage of finding nonnested regions , but its gain of sensitivity is not evident until inversions are extremely frequent (>105). In addition, as statistical validation is not implemented in HomologyTeams, its specificity will become quite lower when the background similarity is increased. CHSMiner and CloseUp can always have similar and satisfactory sensitivity and specificity for different R and F, suggesting that the analytical method of CHSMiner works as well as Monte Carlo simulation on empirical data. Nonetheless, CHSMiner is much faster than CloseUp. On a single Pentium processor CloseUp required more than one hour to run the simulated data set (1000 permutations for each CHS to get a reliable assessment), whereas CHSMiner took less than one minute. According to our experience, time is an important factor in genome comparison as we usually need to adjust parameters for the program. Thus, our tool greatly improves the efficiency and usability.
Comparison on human-mouse synteny map
In order to show its performance on real biological data, we used CHSMiner to construct the synteny map for human and mouse. We downloaded homolog information from Ensembl database  and run the program with different maximal gap size. Each synteny detected was evaluated by corrected p-value (Bonferroni method) and only those smaller than 0.05 were preserved. The results were compared with the synteny map provided by Ensembl (release 47), which was generated from primary DNA sequence alignments .
Number of orthologs covered by Ensembl synteny map and CHSMiner result
CHSMiner result (by maximal gap size)
Ensembl synteny map
When we increase the maximal gap size to five genes, the coverage of detected syntenies will become larger (Table 2). Not only nearly all orthologs present in Ensembl map (18135 in 18753, 97%), but also an amount of ones absent in it (1209 in 3518, 34%) can now be discovered. The result does not change too much when the gap size is increased more (up to 30, data not shown). Since a strict statistical criterion has been applied for filtering, the newly obtained CHSs are less likely to be false positives. The reasonable interpretation is that those degraded CHSs can not be recognized from the primary sequence by the strategy of Ensembl. Therefore, CHSMiner is more flexible and can reveal more complete CHSs by selecting proper parameters.
CHSMiner is designed to identify chromosomal homologous segments based on gene content alone, which enables it to discover highly degenerated homology. Compared with previous tools, it has at least three significant advantages: (1) it has comprised search algorithm, statistical validation and result display in a uniform platform; (2) it has improved both accuracy and efficiency; (3) its graphical and interactive interface allows it easy to use. We hope it will be helpful for biologists who are interested in the structure and evolution of genomes.
First, two artificial chromosomes were created, each containing 1000 genes. The background similarity was simulated by assigning a gene to be the homolog of some other gene with probability R, regardless of their locations. Then the middle 20% of the two chromosomes were specified as a known CHS. Within the region, a gene in one chromosome would have a corresponding homolog in the other chromosome with probability 0.3. Finally, the inversions were simulated by exchanging two randomly chosen neighbouring gene pairs.
All the four software packages were tested on the simulated data set with the same parameter settings, i.e. the gap size should be less than 20 genes and each CHS should have at least 3 matched genes. LineUp was run with inversions forbidden. If statistical test was available, each CHS detected was further assessed by corrected p-value (Bonferroni method) and only those smaller than 0.05 were preserved. The sensitivity was calculated as P/TP, where TP was the number of genes in the predefined CHS (TP = 200) and P was the number detected among them. The specificity was calculated as N/TN, where TN was the number of genes not in the predefined CHS (TN = 800) and N was the number remaining undetected in TN.
Availability and requirements
Project name: CHSMiner
Project home page: http://www.biosino.org/papers/CHSMiner/
Operating system(s): Platform independent
Programming language: Java
Other requirements: JRE 1.5 or higher
License: GNU GPL
This research was supported by grants from National High-Tech R&D Program (863): 2006AA02Z334, State key basic research program (973): 2006CB910705, 2003CB715901, and Research Program of CAS (KSCX2-YW-R-112).
- Murphy WJ, Larkin DM, Wind Everts-van der A, Bourque G, Tesler G, Auvil L, Beever JE, Chowdhary BP, Galibert F, Gatzke L: Dynamics of mammalian chromosome evolution inferred from multispecies comparative maps. Science. 2005, 309: 613-617.PubMedView ArticleGoogle Scholar
- Peer Van de Y: Computational approaches to unveiling ancient genome duplications. Nat Rev Genet. 2004, 5: 752-763.PubMedView ArticleGoogle Scholar
- Simillion C, Vandepoele K, Peer Van de Y: Recent developments in computational approaches for uncovering genomic homology. Bioessays. 2004, 26: 1225-1235.PubMedView ArticleGoogle Scholar
- Vandepoele K, Saeys Y, Simillion C, Raes J, Peer Van De Y: The automatic detection of homologous regions (ADHoRe) and its application to microcolinearity between Arabidopsis and rice. Genome Res. 2002, 12: 1792-1801.PubMedPubMed CentralView ArticleGoogle Scholar
- Calabrese PP, Chakravarty S, Vision TJ: Fast identification and statistical evaluation of segmental homologies in comparative maps. Bioinformatics. 2003, 19 (Suppl 1): i74-80.PubMedView ArticleGoogle Scholar
- Hampson S, McLysaght A, Gaut B, Baldi P: LineUp: statistical detection of chromosomal homology with application to plant comparative genomics. Genome Res. 2003, 13: 999-1010.PubMedPubMed CentralView ArticleGoogle Scholar
- McLysaght A, Hokamp K, Wolfe KH: Extensive genomic duplication during early chordate evolution. Nat Genet. 2002, 31: 200-204.PubMedView ArticleGoogle Scholar
- Hampson SE, Gaut BS, Baldi P: Statistical detection of chromosomal homology using shared-gene density alone. Bioinformatics. 2005, 21: 1339-1348.PubMedView ArticleGoogle Scholar
- Ding G, Sun Y, Li H, Wang Z, Fan H, Wang C, Yang D, Li Y: EPGD: a comprehensive web resource for integrating and displaying eukaryotic paralog/paralogon information. Nucleic Acids Res. 2008, 36: D255-262.PubMedPubMed CentralView ArticleGoogle Scholar
- He X, Goldwasser MH: Identifying conserved gene clusters in the presence of homology families. J Comput Biol. 2005, 12: 638-656.PubMedView ArticleGoogle Scholar
- Hoberman R, Sankoff D, Durand D: The statistical analysis of spatially clustered genes under the maximum gap criterion. J Comput Biol. 2005, 12: 1083-1102.PubMedView ArticleGoogle Scholar
- Hubbard TJ, Aken BL, Beal K, Ballester B, Caccamo M, Chen Y, Clarke L, Coates G, Cunningham F, Cutts T: Ensembl 2007. Nucleic Acids Res. 2007, 35: D610-617.PubMedPubMed CentralView ArticleGoogle Scholar
- Gu X, Wang Y, Gu J: Age distribution of human gene families shows significant roles of both large- and small-scale duplications in vertebrate evolution. Nat Genet. 2002, 31: 205-209.PubMedView ArticleGoogle Scholar
- Durand D, Hoberman R: Diagnosing duplications–can it be done?. Trends Genet. 2006, 22: 156-164.PubMedView ArticleGoogle Scholar
- Clamp M, Andrews D, Barker D, Bevan P, Cameron G, Chen Y, Clark L, Cox T, Cuff J, Curwen V: Ensembl 2002: accommodating comparative genomics. Nucleic Acids Res. 2003, 31: 38-42.PubMedPubMed CentralView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.