GrammaticalRestrained Hidden Conditional Random Fields for Bioinformatics applications
 Piero Fariselli^{1}Email author,
Affiliated with
 Castrense Savojardo^{1},
Affiliated with
 Pier Luigi Martelli^{1} and
Affiliated with
 Rita Casadio^{1}
Affiliated with
DOI: 10.1186/17487188413
© Fariselli et al. 2009
Received: 12 June 2009
Accepted: 22 October 2009
Published: 22 October 2009
Abstract
Background
Discriminative models are designed to naturally address classification tasks. However, some applications require the inclusion of grammar rules, and in these cases generative models, such as Hidden Markov Models (HMMs) and Stochastic Grammars, are routinely applied.
Results
We introduce GrammaticalRestrained Hidden Conditional Random Fields (GRHCRFs) as an extension of Hidden Conditional Random Fields (HCRFs). GRHCRFs while preserving the discriminative character of HCRFs, can assign labels in agreement with the production rules of a defined grammar. The main GRHCRF novelty is the possibility of including in HCRFs prior knowledge of the problem by means of a defined grammar. Our current implementation allows regular grammar rules. We test our GRHCRF on a typical biosequence labeling problem: the prediction of the topology of Prokaryotic outermembrane proteins.
Conclusion
We show that in a typical biosequence labeling problem the GRHCRF performs better than CRF models of the same complexity, indicating that GRHCRFs can be useful tools for biosequence analysis applications.
Availability
GRHCRF software is available under GPLv3 licence at the website
http://www.biocomp.unibo.it/~savojard/biocrf0.9.tar.gz.
Background
Sequence labeling is a general task addressed in many different scientific fields, including Bioinformatics and Computational Linguistics [1–3]. Recently Conditional Random Fields (CRFs) have been introduced as a new promising framework to solve sequence labeling problems [4]. CRFs offer several advantages over Hidden Markov Models (HMMs), including the ability of relaxing strong independence assumptions made in HMMs [4]. CRFs have been successfully applied in biosequence analysis and structural predictions [5–11]. However, several problems of sequence analysis can be successfully addressed only by designing a grammar in order to provide meaningful results. For instance in gene prediction tasks exons must be linked in such a way that the donor and acceptor junctions define regions whose length is multiple of three (according to the genetic code), and in protein structure prediction, helical segments shorter than 4 residues should be consider meaningless, being this the shortest allowed length for a protein helical motif [1, 2]. In this kind of problems, the training sets generally consist of pairs of observed and label sequences and very often the number of the different labels representing the experimental evidence is small compared to the grammar requirements and the length distribution of the segments for the different labels. Then a direct mapping of onelabel to one state results in poor predictive performances and HMMs trained for these applications routinely separate labels from state names. The separation of state names and labels allows to model a huge number of concurring paths compatible with the grammar and with the experimental labels without increasing the time and space computational complexity [1].
In analogy with the HMM approach, in this paper we develop a discriminative model that incorporates regulargrammar production rules with the aim of integrating the different capabilities of generative and discriminative models. In order to model labels and states disjointly, the regular grammar has to be included in the structure of a Hidden Conditional Random Field (HCRF) [12–14]. Previously, McCallum et al. [13] introduced a special HCRF that exploits a specific automaton to align sequences.
The model here introduced as GrammaticalRestrained Hidden Conditional Random Field (GRHCRF), separates the states from the labels and restricts the accepted predictions only to those allowed by a predefined grammar. By this, it is possible to cast into the model prior knowledge of the problem at hand, that may not be captured directly from the learning associations and ensures that only meaningful solutions are provided.
In principle CRFs can directly model the same GRHCRF grammar. However, given the fullyobservable nature of the CRFs [12], the observed sequences must be relabelled to obtain a bijection between states and labels. This implies that only one specific and unique state path for each observed sequence must be selected. On the contrary with GRHCRFs that allow the separation between labels and states, an arbitrary large number of different state paths, corresponding to the same experimentally observed labels, can be counted at the same time. In order to fully exploit this path degeneration in the prediction phase, the decoding algorithm must take into account all possible paths, and the posteriorViterbi (instead of the Viterbi) should be adopted [15].
In this paper we define the model as an extension of a HCRF, we provide the basic inference equations and we introduce a new decoding algorithm for CRF models. We then compare the new GRHCRF with CRFs of the same complexity on a Bioinformatics task whose solution must comply with a given grammar: the prediction of the topological models of Prokaryotic outer membrane proteins. We show that in this task the GRHCRF performance is higher than to those achieved by CRF and HMM models of the same complexity.
Methods
We further restrict our model to linear HCRF, so that the computational complexity of the inference algorithms remains linear with respect to the sequence length. This choice implies that the embedded grammar will be regular. Our implementation and tests are based on first order HCRFs with explicit transition functions (t _{ k }(s _{ j1}, s _{ j }, x)) and state functions (g _{ k }(s _{ j }, x)) unrolled over each sequence position j.
Parameter estimation
where the different sequences are supposed to be independent and identically distributed random variables.
where the E _{ p(sy, x)} [f _{ k }] and E _{ p(s, yx)} [f _{ k }] are the expected values of the feature function f _{ k } computed in the clamped and free phases, respectively. Differently from the standard CRF, both expectations have to be computed using the Forward and Backward algorithms. These algorithms must take into consideration the grammar restraints.
Alternatively, the Expectation Maximization procedure can be adopted [16].
Computing the expectations
where for simplicity we dropped out the sequence upperscript ((i)).
Decoding
Decoding is the task of assigning labels (y) to an unknown observation sequence x. Viterbi algorithm is routinely applied as decoding for the CRFs, since it finds the most probable path of an observation sequence given a CRF model [4]. Viterbi algorithm is particular effective when there is a single strong highly probable path, while when several paths compete (have similar probabilities), posterior decoding may perform significantly better. However, the selected state path of the posterior decoding may not be allowed by the grammar. A simple solution of this problem is provided by the posteriorViterbi decoding, that was previously introduced for HMMs [15]. PosteriorViterbi, exploits the posterior probabilities and at the same time preserves the grammatical constraint. This algorithm consists of three steps:

for each position j and state s ∈ , compute posterior probability p(s _{ j }= sx)

find the allowed state path
S* = argmax_{ s } Π_{ j } p(s _{ j } = sx)

assig to x a label sequence y so that y _{ j }= Λ(s _{ j }) for each position j
The first step can be accomplished using the ForwardBackward algorithm as described for the free phase of parameter estimation. In order to find the best allowed state path, a Viterbi search is performed over posterior probabilities. In what follows ρ _{ j }(sx) is the most probable allowed path of length j ending in state s and π _{ j }(s) is a traceback pointer. The algorithm can be described as follows:
The labels are assigned to the observed sequence according to the state path s*. It is also possible to consider a slightly modified version of the algorithm where, for each position, the posterior probability of the labels is considered, and the states with the same label have associated the same posterior probability. The rationale behind this is to consider the aggregate probability of all state paths corresponding to the same sequence of labels to improve the overall per label accuracy. In many applications this variant of the algorithm might perform better.
Implementation
We implemented the GRHCRF as linear HCRF in C++ language. Our GRHCRF can deal with sequences of symbols as well as sequence profiles. A sequence profile of a protein p is a matrix X whose rows represent the sequence positions and whose columns are the 20 possible amino acids. Each element X [i] [a] of the sequence profile represents the frequency of the residue type a in the aligned position i. The profile positions are normalized such as Σ_{ a } X[i][a] = 1 (for each i).
Measures of performance
To evaluate the average standard deviation of our predictions, we performed a bootstrapping procedure with 100 runs over 60% of the predicted data sets.
Results and Discussion
Problem definition
The prediction of the topology of the outer membrane proteins in Prokaryote organisms is a challenging task that was addressed several times given its biological relevance [18–20]. The problem can be defined as: given a protein sequence that is known to be inserted in the outer membrane of a Prokaryotic cell, we want to predict the number and the location with respect to the membrane plane of the membranespanning segments. From experimental results, we know that the outer membrane of Prokaryotes imposes some constraints to the topological models such as:

both C and N termini of the protein chain lie in the periplasmic side of the cell (inside) and this implies that the number of the spanning segments is even;

membrane spanning segments have a minimal segment length (≥ 3 residues);

the transmembranesegment lengths are distributed accordingly to a probability density distribution that can be experimentally determined and must be taken into account.
Prediction of the topology of the Prokaryotic outer membrane proteins.
Method  POV  Q2  C(t)  Sn(t)  Sp(t) 

CRF1 (Vit)  0.26 ± 0.05  0.72 ± 0.01  0.47 ± 0.02  0.59 ± 0.01  0.80 ± 0.01 
CRF1 (Pvit)  0.39 ± 0.05  0.77 ± 0.01  0.54 ± 0.02  0.71 ± 0.01  0.80 ± 0.01 
CRF2 (Vit)  0.34 ± 0.05  0.76 ± 0.01  0.52 ± 0.03  0.63 ± 0.02  0.82 ± 0.02 
CRF2 (Pvit)  0.47 ± 0.05  0.80 ± 0.01  0.60 ± 0.03  0.74 ± 0.02  0.82 ± 0.02 
CRF3 (Vit)  0.29 ± 0.04  0.72 ± 0.01  0.45 ± 0.02  0.60 ± 0.02  0.79 ± 0.01 
CRF3 (Pvit)  0.45 ± 0.04  0.76 ± 0.01  0.52 ± 0.02  0.70 ± 0.02  0.79 ± 0.01 
GRHCRF  0.66 ± 0.04  0.85 ± 0.01  0.70 ± 0.03  0.83 ± 0.01  0.84 ± 0.01 
HMMB2TMR  0.58 ± 0.04  0.80 ± 0.01  0.62 ± 0.02  0.82 ± 0.02  0.83 ± 0.01 
Outermembrane protein data set
The training set consists of 38 highresolution experimentally determined outermembrane proteins of Prokaryotes, whose sequence identity between each pair is less than 40%. We then generated 19 subsets for the crossvalidation experiments, such as there is no sequence identity greater than 25% and no functional similarity between two elements belonging to disjoint sets. The annotation consists of three different labelings that correspond to: inner loop (i), outer loop (o) and transmembrane (t). This assignment was obtained using the DSSP program [21] by selecting the βstrands that span the outer membrane. The dataset with the annotations and the crossvalidation sets are available with the program at http://www.biocomp.unibo.it/~savojard/biocrf0.9.tar.gz.
For each protein in the dataset, a profile based on a multiple sequence alignment was created using the PSIBLAST program on the nonredundant dataset of sequences (uniref90 as described in http://www.uniprot.org/help/uniref). PSIBLAST runs were performed using a fixed number of cycles set to 3 and an evalue of 0.001.
Prediction of the topology of Prokaryotic outer membrane proteins
The automaton described in Figure 2 assigns labels to observed sequences that can be obtained using different state paths. This ambiguity leads to an ensemble of paths that must be taken into account during the likelihood maximization by summing up all possible trajectories compliant with the experimentally assigned labels (see Method section).
However, this ambiguity does not permit the adoption of the automaton of Figure 2 for CRF learning, since to train CRFs a bijective mapping between states and labels is required. On the contrary, with the automaton of Figure 2, several different state paths can be obtained (in theory a factorial number) that are in agreement with the automaton and with the experimental labels.
Confidence level of the results reported in Table 1.
Methods  POV  Q2  C(t)  Sn(t)  Sp(t) 

GRHCRF vs CRF1  98.0%  99.5%  99.5%  99.8%  99.5% 
GRHCRF vs CRF2  96.0%  99.5%  99.5%  99.5%  99.5% 
GRHCRF vs CRF3  96.0%  99.5%  99.5%  99.5%  99.5% 
GRHCRF vs HMMB2TMR  80.0%  96.0%  99.0%  98.0%  99.5% 
Conclusion
In this paper we presented a new class of conditional random fields that assigns labels in agreement with production rules of a defined regular grammar. The main novelty of GRHCRF is then the introduction of an explicit regular grammar that defines the prior knowledge of the problem at hand, eliminating the need of relabelling the observed sequences. The GRHCRF predictions satisfy the grammar production rules by construction, so that only meaningful solutions are provided. In [13], an automaton was included to restrain the solution of a HCRFs. However in that case, it was hardcoded in the model in order to train finitestate string edit distance. On the contrary, GRHCRFs are designed to provide solutions in agreement with defined regular grammars that are provided as further input to the model. To the best of our knowledge, this is the first time that this is described. In principle, the grammar may be very complex, however, to maintain the tractability of the inference algorithm, we restrict our implementation to regular grammars. Extensions to contextfree grammars can be designed by modifying the inference algorithms at the expense of the computational complexity of the final models. Since the GrammaticalRestrained HCRF can be seen as an extension of linear HCRF [13, 14], the GRHCRF is also related to the models that deal with latent variables such as Dynamic CRFs [22].
In this paper we also test the GRHCRFs on a real biological problem that require grammatical constraints: the prediction of the topology of Prokaryotic outermembrane proteins. When applied to this biosequence analysis problem we show that GRHCRFs perform similarly or better than the corresponding CRFs and HMMs indicating that GRHCRFs can be profitably applied when a discriminative problem requires grammatical constraints.
Finally we also present the posteriorViterbi decoding algorithm for CRFs that was previously designed for HMMs and that can be of general interest and application, since in many cases posteriorViterbi can perform significantly better than the classical Viterbi algorithm.
Declarations
Acknowledgements
We thank MIUR for the PNR 2003 project (FIRB art.8) termed LIBILaboratorio Internazionale di BioInformatica delivered to R. Casadio. This work was also supported by the Biosapiens Network of Excellence project (a grant of the European Unions VI Framework Programme).
Authors’ Affiliations
References
 Durbin R: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids Cambridge Univ Pr, reprint edition 1999.
 Baldi P, Brunak S: Bioinformatics: The Machine Learning Approach 2 Edition MIT Press 2001.
 Manning C, Schütze H: Foundations of Statistical Natural Language Processing MIT Press 1999.
 Lafferty J, McCallum A, Pereira F: Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. Proceedings of ICML01 2001, 282–289.
 Liu Y, Carbonell J, Weigele P, Gopalakrishnan V: Protein fold recognition using segmentation conditional random fields (SCRFs). Journal of Computational Biology 2006,13(2):394–406.View ArticlePubMed
 Sato K, Sakakibara Y: RNA secondary structural alignment with conditional random fields. Bioinformatics 2005,21(2):237–242.View Article
 Wang L, Sauer UH: OnDCRF: predicting order and disorder in proteins conditional random fields. Bioinformatics 2008,24(11):1401–1402.View ArticlePubMed
 Li CT, Yuan Y, Wilson R: An unsupervised conditional random fields approach for clustering gene expression time series. Bioinformatics 2008,24(21):2467–2473.View ArticlePubMed
 Li MH, Lin L, Wang XL, Liu T: Protein protein interaction site prediction based on conditional random fields. Bioinformatics 2007,23(5):597–604.View ArticlePubMed
 Dang TH, Van Leemput K, Verschoren A, Laukens K: Prediction of kinasespecific phosphorylation sites using conditional random fields. Bioinformatics 2008,24(24):2857–2864.View ArticlePubMed
 Xia X, Zhang S, Su Y, Sun Z: MICAlign: a sequencetostructure alignment tool integrating multiple sources of information in conditional random fields. Bioinformatics 2009,25(11):1433–1434.View ArticlePubMed
 Wang S, Quattoni A, Morency L, Demirdjian D: Hidden Conditional Random Fields for Gesture Recognition. CVPR 2006, II:1521–1527.
 McCallum A, Bellare K, Pereira F: A Conditional Random Field for Discriminativelytrained Finitestate String Edit Distance. Proceedings of the 21th Annual Conference on Uncertainty in Artificial Intelligence (UAI05) Arlington, Virginia: AUAI Press 2005., 388:
 Quattoni A, Collins M, Darrell T: Conditional Random Fields for Object Recognition. Advances in Neural Information Processing Systems 17 (Edited by: Saul LK, Weiss Y, Bottou L). Cambridge, MA: MIT Press 2005, 1097–1104.
 Fariselli P, Martelli P, Casadio R: A new decoding algorithm for hidden Markov models improves the prediction of the topology of allbeta membrane proteins. BMC Bioinformatics 2005,6(Suppl 4):S12.View ArticlePubMed
 Sutton C, McCallum A: An Introduction to Conditional Random Fields for Relational Learning MIT Press 2006.
 Krogh A: Hidden Markov Models for Labeled Sequences. Proceedings of the 12th IAPR ICPR'94 IEEE Computer Society Press 1994, 140–144.
 Martelli P, Fariselli P, Krogh A, Casadio R: A sequenceprofilebased HMM for predicting and discriminating beta barrel membrane proteins. Bioinformatics 2002,18(Suppl 1):46–53.
 Bigelow H, Petrey D, Liu J, Przybylski D, Rost B: Predicting transmembrane betabarrels in proteomes. Nucleic Acids Res 2004, 2566–2577:32.
 Bagos P, Liakopoulos T, Hamodrakas S: Evaluation of methods for predicting the topology of betabarrel outer membrane proteins and a consensus prediction method. BMC Bioinformatics 2005, 6:7–20.View ArticlePubMed
 Kabsch W, Sander C: Dictionary of protein secondary structure: pattern recognition of hydrogenbonded and geometrical features. Biopolymers 1983,22(12):2577–2637.View ArticlePubMed
 Sutton C, McCallum A, Rohanimanesh K: Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data. J Mach Learn Res 2007, 8:693–723.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.