A fast and accurate enumeration-based algorithm for haplotyping a triploid individual

Background Haplotype assembly, reconstructing haplotypes from sequence data, is one of the major computational problems in bioinformatics. Most of the current methodologies for haplotype assembly are designed for diploid individuals. In recent years, genomes having more than two sets of homologous chromosomes have attracted many research groups that are interested in the genomics of disease, phylogenetics, botany and evolution. However, there is still a lack of methods for reconstructing polyploid haplotypes. Results In this work, the minimum error correction with genotype information (MEC/GI) model, an important combinatorial model for haplotyping a single individual, is used to study the triploid individual haplotype reconstruction problem. A fast and accurate enumeration-based algorithm enumeration haplotyping triploid with least difference (EHTLD) is proposed for solving the MEC/GI model. The EHTLD algorithm tries to reconstruct the three haplotypes according to the order of single nucleotide polymorphism (SNP) loci along them. When reconstructing a given SNP site, the EHTLD algorithm enumerates three kinds of SNP values in terms of the corresponding site’s genotype value, and chooses the one, which leads to the minimum difference between the reconstructed haplotypes and the sequenced fragments covering that SNP site, to fill the SNP loci being reconstructed. Conclusion Extensive experimental comparisons were performed between the EHTLD algorithm and the well known HapCompass and HapTree. Compared with algorithms HapCompass and HapTree, the EHTLD algorithm can reconstruct more accurate haplotypes, which were proven by a number of experiments.


Background
As a large number of sequencing data are available, the investigation of genetic variations has become one of the main topics in bioinformatics. Single nucleotide polymorphism (SNP), the most widespread form of variation, is believed to be the major genetic cause to phenotypic variability. A sequence of SNPs along a chromosome is referred to as a haplotype, which is more important for complete comprehending the complex genetic polymorphisms than isolated SNPs. Increasing evidence shows that haplotypes play a crucial role in studying the variations relating to diseases prediction and gene expression [1]. Therefore, computational methods to infer haplotypes are needed, for determining haplotypes is both time consuming and expensive by direct using biological experiments. In recent decade, the presented computational haplotyping algorithms generally fall into three categories [2]: (1) population haplotyping with genotype data [3,4]; (2) population haplotyping with fragment data [5]; (3) individual haplotyping with fragment data [6]. In this paper, individual haplotyping problem is studied for a triploid individual.
The problem of individual haplotyping is also called as haplotype assembly problem or haplotype reconstruction problem. It has received extensive study in the recent decade. Most of the existing research results are regarding diploid individuals [1,7,8], and there is still a lack of research studies for reconstructing triploid ones. Several algorithms for assembling K-individual haplotypes

Open Access
Algorithms for Molecular Biology *Correspondence: wjlhappy@mailbox.gxnu.edu.cn 1 Guangxi Key Lab of Multi-source Information Mining & Security, Guangxi Normal University, Guilin 541004, China Full list of author information is available at the end of the article were proposed. Based on the minimum error correction (MEC) model and the minimum error correction with genotype information (MEC/GI) model, Wang et al. [9] and Qian et al. [10] respectively proposed a genetic algorithm and a particle swarm optimization algorithm to reconstruct diploid individual haplotypes, both of which can be adapted to reconstructing the K-individual ones. The code length of the two algorithms is very long in practical applications, for it is equal to the number of sequencing SNP fragments. This brings huge solution space to these two algorithms and negatively affects the performance of them. Based on the minimum fragment removal (MFR) model [11], an exact exponential algorithm was introduced by Li et al. [11]. The time complexity of which is O(2 2t m 2 n + 2 (K +1)t m K +1 ) , where m denotes the number of SNP fragments, n denotes the number of SNP sites, and t is the max number of holes covered by a fragment. The algorithm can not perform well with large m, n and t. In 2013, Aguiar et al. [12] introduced the HapCompass model and the minimum weighted edge removal (MWER) optimization for haplotyping polyploid genomes. Algorithm HapCompass aims to remove a minimal weighted set of edges from the compass graph such that a unique phasing may be constructed. The HapCompass algorithm performs on the spanning-tree cycle basis of the compass graph to iteratively delete errors. However, in the same conflict cycle basis, there may be more than one edge having the same absolute value of weight. It may lead to the wrong SNP phasing to select the removed edge randomly. In 2014, Berger et al. [13] described a maximum-likelihood estimation framework HapTree for haplotyping a single polyploid. It can obtain better performance than the HapCompass algorithm [13]. In 2014, based on the MEC model, Wu et al. [14] presented a genetic algorithm GTIHR for reconstructing triploid haplotypes. Since the code length of algorithm GTIHR equals to the number of heterozygous sites in haplotype, the performance of the GTIHR algorithm is negatively affected by haplotype length and heterozygous rate. In this paper, the triploid individual haplotype assembly problem is studied based on the MEC/GI model. An enumeration-based algorithm enumeration haplotyping triploid with least difference (EHTLD) is proposed for solving it. Algorithm EHTLD reconstructs the three haplotypes according to the order of SNP loci along them. For reconstructing the three alleles of a given site, it enumerates three kinds of SNP values by using the site's genotype, and chooses the kind of value resulting in the minimum difference between the reconstructed haplotypes and the sequenced fragments covering that SNP site. The experimental comparisons were performed between the EHTLD, the HapCompass and the HapTree algorithms. The results proved that the performance of algorithm EHTLD was superior to those of algorithms HapCompass and HapTree. The rest of this paper is arranged as follows. "Definitions and notations" section provides definitions and notations used later. "Algorithm EHTLD" section introduces the EHTLD algorithm. "Experimental results" section presents the experimental results of the EHTLD, the HapCompass and the HapTree algorithms. Some conclusions are drawn in the last section.

Definitions and notations
Triploid somatic cells contain three sets of chromosomes, i.e., a triploid organism has three copies of each chromosome. Since haplotype consists of the sequence of all SNPs along a chromosome, a triploid individual owns three haplotypes. It is commonly regarded that a SNP locus shows merely two possible alleles, hence the major allele can be represented as '0' and the minor one can be represented as '1' . A haplotype can be encoded as a string over a 2-letter alphabet {0, 1} instead of four real bases {A,T,C,G}. A genotype is the conflation of three haplotypes on the homologous chromosomes. When three alleles at a SNP site have identical values, this SNP site is called a homozygous site, otherwise it is called a heterozygous site. For example, (000) T or (111) T represents the genotype value at a homozygous SNP site, while (001) T or (011) T represents the genotype value at a heterozygous SNP site. Suppose that m aligned SNP fragments, coming from three haplotypes of length n, are generated by DNA sequencing experiments. Let M denote an m × n SNP matrix over the alphabet {0, 1, −} (− denotes the value is null). As shown in Fig. 1a, each row represents a SNP fragment, each column represents a SNP site, and each entry M[i, j] denotes the SNP allele of the ith fragment at the jth SNP site. Let G = (g 1 , g 2 , . . . , g n ) denote the genotype matrix corresponding to M, where where Let the strings X and Y be regarded as two SNP fragments, they are said to compatible if HD(X, Y, 1, n) = 0. The larger HD(X, Y, 1, n) is, the greater the probability of fragments X and Y coming from different chromosome copies or having sequencing errors is. If there are no errors in the data, the rows of M can be divided into three classes of compatible fragments. Three haplotypes can be reconstructed by assembling the fragments in the three classes. In this situation, the SNP matrix M is called feasible or error-free. Given haplotype h = (h 1 , h 2 , h 3 ) (h k = (h k1 , h k2 , . . . , h kn ), k = 1, 2, 3) and genotype G = (g 1 , g 2 , . . . , g n )(g j = (g j1 , g j2 , g j3 ) T , j = 1, 2, . . . , n) , if 3 k=1 h kj = 3 k=1 g jk (j = 1, 2, . . . , n), h and G are regarded as compatible.
Based on the above mentioned concepts for the haplotype reconstruction problem, the MEC/GI model can be described as follows [9]: MEC/GI: Given a SNP matrix M and a genotype matrix G, correct the minimum number of entries in M (0 into 1 and vice versa) so that the resulting matrix is feasible, and the three reconstructed haplotypes are compatible with the genotype G.

Algorithm EHTLD
In this section, the EHTLD algorithm is described. The input consists of a SNP matrix M and a genotype matrix G. The output is three assembled haplotypes h = (h 1 , h 2 , h 3 ) of length n. In the first step of this algorithm, the matrices M and G are preprocessed by removing the homozygous SNPs, which do not play a role in assembling haplotypes. Subsequently, enumerates three kinds of values for the jth , and x j � = y j 0 otherwise.
(j = 2, 3,…,n) SNP site in terms of its genotype, and chooses the one leading to the minimum difference between the reconstructed haplotypes and the fragments covering the jth site. After this iteration process is completed, three having only heterozygous SNP sites are built, for only heterozygous SNPs are remained in the preprocessed matrices. Finally, h ′ is augmented by inserting the SNPs discarded in preprocessing step and the final haplotypes h is obtained. Some key steps of the EHTLD algorithm will be introduced in detail as follows.

Preprocessing
Since homozygous sites do not contribute to haplotype reconstruction, they are deleted from matrices M and G to improve the efficiency of assembly. Drop column j(j = 1, 2, . . . , n) from G where g j1 = g j2 = g j3 , and the corresponding column is also dropped from matrix M. The deleted column j ( j = 1, 2, . . . , n ) is recorded as g j1 . After dropping columns from matrix M, some SNP fragments with only-elements are also deleted, for they are also redundant information. The remained SNPs are all heterozygous sites. For convenience of description, the preprocessed matrices are still denoted by M and G. Sort the rows of M by their l(.) values in ascending order. For each column j ( j = 1, 2, . . . , n ) of M, calculate set r(j) which contains the rows covering the jth column.

Enumerating and computing
The EHTLD algorithm iteratively reconstructs each hete- . Each step concerns reconstructing the current empty site, starting from the left first site. Suppose that the first j − 1 sites of the three haplotypes h ′ have already been filled, i.e., ( h ′ k1 , h ′ k2 ,…, h ′ kj−1 ) (k = 1, 2, 3, j = 2, 3,…, n) has been assembled, and the jth site is under consideration. The calculating method comprises the following two steps.  (3). From the three kinds of values enumerated in step (1), choose the one with the minimum D(.) value.
In the following, we give an example for enumerating and computing by using the matrices in Fig. 1. As shown in Fig. 2, assume that the first three sites of the three haplo- , and the fourth site is under reconstruction. The genotype of the fourth SNP site is have the following three kinds of possible values on the fourth SNP site: . The values of D(0,1,1), D(1,0,1) and D (1,1,0) are computed respectively according to the fragments in Fig. 1a and the three haplotypes

Augmenting
The homozygous SNPs that are deleted by preprocessing must be reinserted. The reconstructed haplotypes are augmented by the bits of the columns removed, and h = (h 1 , h 2 , h 3 ) are built. For a given position j, haplotypes h ′ 1 , h ′ 2 and h ′ 3 are inserted with g j1 when the discarded column j is recorded as g j1 . Based on the above mentioned steps, the EHTLD algorithm for assembling triploid haplotypes is depicted in Fig. 3.
Now the time complexity of the EHTLD algorithm is discussed. In preprocessing, dropping redundant information and calculating set r(·) take time O(m × n) , sorting the rows of M takes time O(m × logm) . During enumerating and computing, three haplotypes with only heterozygous SNP sites are reconstructed, which takes time O(c × n × len) , here c denotes the fragments coverage, and len represents the average length of fragments. In augmenting, the discarded columns can be reinserted by scanning the columns only once, which takes time O(n). In summary, the time complexity of the algorithm is O(m × n + m × logm + c × n × len).

Experimental results
In this section, the EHTLD algorithm is compared with two state-of-the-art algorithms, i.e., the HapCompass [12] and the HapTree [13] algorithms. Algorithms EHTLD and HapCompass were implemented on a Windows 7 and the compiler was Microsoft Visual C# 2012. The Python program HapTree (v0.1), downloaded from http://group s.csail .mit.edu/cb/haptr ee/, was implemented on a Linux system. All the tests below are conducted on a 64 bit PC with Intel Core i5 2.50GHz CPU and 6GB RAM. One hundred data sets were generated for each parameter setting. The average over 100 runs at each parameter setting was calculated and presented.
The vector error (VE) [13], the reconstruction rate (RR) [1,9,15] and the minimum error correction (MEC) score [12] were used to measure the performance of the algorithms. The vector error, generalized from switch error, is a special kind of measurement for evaluating the accuracy of polyploid phasing. Given three reconstructed   j=2,3,. . . ,n do 4.
if (D(0, 0, 1 haplotypes, the vector error is equal to the minimum number of segments on them for which a switch must occur to correspond with the three true haplotypes, i.e., the minimum number of segments a reconstructed phase and the true phase have in common [13]. The reconstruction rate (RR), which measures the similarity degree between the pair of true haplotypes and the pair of reconstructed ones, is a widely adopted index to evaluate diploid phasing [1,9,15]. For triploid phasing, we generalized it to calculate the similarity degree between the three true haplotypes and the three reconstructed haplotypes. Assuming that h = (h 1 , h 2 , h 3 ) are the original haplotypes, and ĥ = (ĥ 1 , ĥ 2 , ĥ 3 ) are the reconstructed haplotypes. RR is defined as the proportion of nucleotides that are reconstructed correctly, as shown in Formula (4).
where r i k j k = HD(h i k ,ĥ j k , 1, n).
The minimum error correction (MEC) score measures the minimum number of mismatches between the reconstructed haplotypes ĥ = (ĥ 1 , ĥ 2 , ĥ 3 ) and the SNP matrix M, as shown in Formula (5).
To the best of our knowledge, the real triploid haplotype data are not available in the public domain, Aguiar et al. [12] and Berger et al. [13] used computer-generated simulated data. Therefore, simulated data were also used in our experiments. Three simulation haplotypes h = (h 1 , h 2 , h 3 ) of length n were created by using the following method. h 1 was generated at random firstly. h 2 was generated by flipping each bit of h 1 randomly so that the hamming distance between h 1 and h 2 was equal to a given parameter d. h 3 also had the same length and h 3j was set to h 1j or h 2j (j = 1, 2, . . . , n) with uniform probability. With regard to fragment data, two kinds of sequencing simulators, CELSIM [16] and MetaSim [17], were adopted to generate simulation fragments, and the testing datasets were called as CELSIM instances and Meta-Sim instances, respectively.

CELSIM instances
In this section, the evaluation of the EHTLD, the Hap-Compass and the HapTree algorithms is described by using CELSIM instances. CELSIM was invoked to simulate shotgun sequencing platform. m 1 single SNP fragments and m 2 mate-pair SNP fragments were generated. A single fragment had a length ranging from f min to f max , and a mate-pair fragment had a length of n/10. The coverage was c/2 for both kinds of fragments, and the total coverage was c. Reading errors were planted into the fragments with probability p s . In practical applications of shotgun sequencing, the values of f min and f max are 3 and 7, respectively, c ranges from 5 to 10, and p s ranges between 2 and 5% [2,18]. In the following tables, algorithms EHTLD, HapCompass and HapTree are abbreviated to EH, HC and HT, respectively. In Table 1, 12 sets of parameters were set in dealing with error rate p s , where c = 10, f min = 3, f max = 7, n = 100 and d = 0.3. It can be seen from this table that algorithm EHTLD can achieve much higher reconstruction rates, smaller vector errors and MEC scores than the HapCompass and the HapTree algorithms in every p s setting. When p s = 0, algorithm EHTLD achieves reconstruction rate of 0.97, which is higher than both Hap-Compass and HapTree algorithms by about 9.0%, and vector error of 3, which is less than them by 8 times or so. In particular, the MEC score obtained by algorithm EHTLD reaches zero, while those of the other two algorithms are 126 and 57. Although the increase of p s plays stronger negative effect on algorithm EHTLD than on algorithms HapCompass and HapTree, the EHTLD algorithm still obtains better performance than algorithms HapCompass and HapTree with high error rate. When p s = 0.2, the RRs of algorithms EHTLD, HapCompass and HapTree are 0.92, 0.89 and 0.88, the vector errors of them are 14, 31 and 26, and the MEC scores of them are 335, 407 and 364, respectively. The three algorithms all execute very efficiently when p s ranges from 0 to 0.2.
In Table 2, nine sets of parameters were set in dealing with coverage c, where n = 100, f min = 3, f max = 7, p s = 0.05 and d = 0.3. From Table 2 we observe that algorithm EHTLD still obtains the highest reconstruction rate and the smallest vector error and MEC score under different coverage settings. When the coverage is 2, the RRs of algorithms EHTLD, HapCompass and HapTree are 0.94, 0.89 and 0.86, the vector errors of them are 10, 30 and 29, and the MEC scores of them are 16, 40 and 19. When the coverage increases, the RR of algorithm EHTLD increases gradually, while that of algorithm Hap-Compass fluctuates between 0.89 and 0.90, and that of algorithm HapTree varies between 0.85 and 0.91. Generally, the increase of coverage plays a positive role in the improvement of algorithm performance, for much more original fragment information can be utilized. However, it is not apparent for algorithms HapCompass and HapTree. Table 3 compares the performance of the three algorithms with different haplotype lengths n, where c = 10, f min = 3, f max = 7, p s = 0.05 and d = 0.3. As can be seen from this table, algorithm EHTLD still obtains superior results to the other two algorithms under each parameter setting. With the increase of haplotype length, the three algorithms experience a gradual degradation in the performance. When n is 100, the RR of algorithm EHTLD is 0.97, which is higher than both HapCompass   and HapTree algorithms by about 7.8%, the vector error of algorithm EHTLD is 4, which is less than algorithms HapCompass and HapTree by about 86 and 85%, the MEC score of algorithm EHTLD is 85, which is less than algorithms HapCompass and HapTree by about 56 and 21%, respectively. When n is 1000, the RRs of them decrease to 0.92, 0.88 and 0.86, the vector errors of them increase to 136, 322 and 314, and the MEC scores of them go up to 479, 1029 and 595, respectively. The running time of the three algorithms increases significantly with the increase of n. When n = 100, the running time of algorithms EHTLD, HapCompass and HapTree is 0.01, 0.03 and 0.01 s, respectively, while n = 1000, it increases to 4.36, 4.67 and 4.51 s, respectively. In Table 4, three groups of parameters were set in dealing with single fragment length range [f min , f max ] , where c = 10, p s = 0.05, n = 100 and d = 0.3. As shown in Table 4, algorithm EHTLD still performs the best under different parameter settings. When [f min , f max ] = [3,7] , the RRs of algorithms EHTLD, HapCompass and Hap-Tree are 0.97, 0.90 and 0.89, the vector errors of them are 4, 29 and 26, and the MEC scores of them are 85, 194 and 117, respectively. With the decrease of the length of single fragment, the decline of fragments overlap might be disadvantageous for haplotype reconstruction. When [f min , f max ] = [1,2] , the RRs decrease to 0.94, 0.90 and 0.88, the vector errors increase to 14, 30 and 28, and the MEC scores drop to 65, 86 and 76, respectively. The decrease of the MEC scores explain the shorter the fragments are, the more probability the fragments agree with the reconstructed haplotypes. The change of single fragment length plays little effect on the running time of the three algorithms. Table 5 compares the three algorithms with different hamming distances d, where f min = 3, f max = 7, c = 10, p s = 0.05 and n = 100. It can be seen from this table that the performance of algorithm EHTLD remains relatively stable under different d, while that of algorithms HapCompass and HapTree suffers strong negative influence with the increase of hamming distance. For example, when d varies from 0.1 to 1.0, the RR of the EHTLD algorithm fluctuates between 0.97 and 1.0, while those of the HapCompass and the HapTree algorithms achieve decrease rate up to 35 and 20%, respectively.

MetaSim instances
MetaSim was used to simulate 454 sequencing platform. m SNP fragments, including m 1 = (1 − p m ) × m single ones and m 2 = p m × m mate-pair ones, were generated, where p m denoted the probability of matepair fragments and was set to 0.25 in the experiments. A single fragment had an expected length of f _len , and a mate-pair fragment had a length of 3 × f _len . Since each mate-pair fragment consists of two single   Table 6 gives the comparisons with coverage ranging from 5 to 50, where n = 100, f _len = 5, and d = 0.3. In Table 7, six sets of experimental results under different haplotype length settings are displayed, where c = 20, f _len = 5, and d = 0.3. In Table 8, three instances were generated in dealing with single fragment length f _len , where n = 100, c = 20, and d = 0.3. In Table 9, the test results under different parameter d are shown, where n = 100, c = 20, and f _len = 5. The experimental results obtained from MetaSim instances indicate that algorithm EHTLD still obtain much higher reconstruction rates, smaller vector errors and MEC scores than the HapCompass and the HapTree algorithms under different c, n, f _len and d settings.

Conclusion
The minimum error correction with genotype information (MEC/GI) model is one of the important computational models for solving single individual SNP haplotyping problem. In this paper, an enumerationbased algorithm EHTLD is presented for haplotyping a triploid single individual by using this model. Algorithm EHTLD reconstructs the three haplotypes according to the order of SNP loci along them. For a SNP site being reconstructed, the EHTLD algorithm enumerates three possible values in terms of the site's   genotype, and chooses the one leading to the minimum difference between the reconstructed haplotypes and the fragments covering that SNP site. The reconstructed alleles of a SNP site mainly depend on the fragments which cover the site, and are little affected by other former reconstructed alleles. Therefore, the former wrongly reconstructed SNP alleles would not affect the latter reconstructed SNP alleles, i.e., reconstructed errors on the former SNP alleles would not spread to the latter ones. The kind of enumeration strategy can also be apply to reconstruct haplotypes of other ploidy, which will be studied in the future. Compared with algorithms HapCompass and Hap-Tree by using two kinds of simulated sequencing data, the EHTLD algorithm can get the highest reconstruction rates, the smallest vector errors and MEC scores, which was tested by a number of experiments. In addition, algorithm EHTLD still achieves satisfying performance even with high error rate, low fragment coverage, or long haplotype length. All of these advantages may contribute to the practical application of the EHTLD algorithm when haplotyping a triploid single individual.