Gene Ontology consistent protein function prediction: the FALCON algorithm applied to six eukaryotic genomes
© Kourmpetis et al.; licensee BioMed Central Ltd. 2013
Received: 10 August 2011
Accepted: 4 March 2013
Published: 27 March 2013
Gene Ontology (GO) is a hierarchical vocabulary for the description of biological functions and locations, often employed by computational methods for protein function prediction. Due to the structure of GO, function predictions can be self- contradictory. For example, a protein may be predicted to belong to a detailed functional class, but not in a broader class that, due to the vocabulary structure, includes the predicted one.
We present a novel discrete optimization algorithm called Functional Annotation with Labeling CONsistency (FALCON) that resolves such contradictions. The GO is modeled as a discrete Bayesian Network. For any given input of GO term membership probabilities, the algorithm returns the most probable GO term assignments that are in accordance with the Gene Ontology structure. The optimization is done using the Differential Evolution algorithm. Performance is evaluated on simulated and also real data from Arabidopsis thaliana showing improvement compared to related approaches. We finally applied the FALCON algorithm to obtain genome-wide function predictions for six eukaryotic species based on data provided by the CAFA (Critical Assessment of Function Annotation) project.
KeywordsProtein function prediction Gene Ontology Evolutionary optimization
Central aim of computational protein function prediction methods is to provide reliable and interpretable results, in order to be useful for the biological community. For this reason, prediction methods often make use of the Gene Ontology (GO) controlled vocabulary  to describe functional properties of proteins. GO terms are organized in three separate domains that describe different aspects of gene and protein function: Molecular Function (MF), Biological Process (BP) and Cellular Component (CC). Within each domain the terms are arranged in a Directed Acyclic Graph (DAG). Due to the hierarchical structure of the GO-DAG, a protein that is assigned to a particular term is by definition assigned to all of its predecessors, which are more general GO terms. On the other hand, if a protein does not perform a particular function, it is not assigned to the corresponding GO term, nor to any of the successors (more detailed terms) of that term. This constraint of the GO-DAG is referred to as the True Path Rule (TPR) and provides a framework to ensure that functional descriptions of proteins are not self-contradictory. Computational methods often neglect the TPR in their predictions, making their interpretations problematic. Taking the GO DAG (and thus TPR) into account in protein function prediction may lead to improvement of the performance and interpretation.
Violation of TPR can be described in a continuous or in a discrete manner. In the former, the probability (or confidence) of membership to a GO term does not decrease monotonically when moving from more general GO terms to the more detailed ones. Therefore the space of probability vectors (where a vector denotes the joint set of per-GO term probabilities of memberships) can be divided in two sets: one set (C) that contains the probability vectors that satisfy the monotonicity constraint and another set (V) that contains those that violate the constraint. The challenge from the continuous point of view is, given a vector V to find an optimal corresponding vector in C, according to a criterion.
Obozinski et al.  developed different “reconciliation” approaches to infer consistent probability vectors from Support Vector Machines (SVM) outputs transformed to probabilities. Performance comparisons between methods based on Belief Propagation, Kullback-Leibner minimization and Isotonic Regression (IR), showed that the last outperformed the rest. In IR [3, 4] predictors are the ranks in the ordering of terms in the GO-DAG from general to detailed and the responses are the membership probabilities. The aim is to identify the probability vector that minimizes the squared error with the original input vector and that is monotonic for the predictors and thus belongs to C.
Here, we take a discrete approach to the problem of TPR violations and we develop an algorithm for the inference of most probable TPR consistent assignments using per-GO term probabilities as input. To the best of our knowledge there is no other algorithm for this task that is suitable for large DAGs. We model the GO DAG as a Bayesian Network and we infer the most probable assignments employing the global optimization method of Differential Evolution , which is adapted for discrete space. We test our algorithm on small graphs of size 6 and 15 nodes, for which we can perform exact inference. We show that our algorithm consistently finds the correct optimal configuration. Further we evaluate the performance of the algorithm on probabilistic outputs of Bayesian Markov Random Fields (BMRF)  as applied previously in Arabidopsis thaliana protein function prediction . Our algorithm is applied to a graph that contains 1024 GO terms. We show that besides providing consistent predictions, our algorithm improves the performance of the predictions compared to a supervised method used in a previous study. We finally applied our algorithm in large scale and we provide function predictions for 32,000 unannotated proteins from six eukaryotic species.
Materials and methods
Parent-child relationship in a GO-DAG
Naturally, methods that treat GO terms independently and neglect the DAG structure of the GO can make predictions that are inconsistent. In particular for probabilistic methods those inconsistencies may appear in the form of p i >p j in which the term j is an ancestor of term i, and thus more general. In this study, we aim to find the most probable consistent GO term assignments, using such probability vectors as input. We first describe the general probabilistic setting, then derive two likelihood based objective functions and finally an evolutionary algorithm for the optimization.
Bayesian network modelling of GO
Consider a Directed Acyclic Graph (DAG) G=(V,E) with nodes V (denoting the set of GO terms) and E directed edges (the set of parent-child relationships). Vector θ denotes the input probability vector which is |V| - dimensional and x is the corresponding binary labeling, where x g =1 denotes membership for a particular protein to the g-th GO term in V GO term.
where p a(g) denotes the parent set of node g and xp a(g) the set of labels that correspond to those parents.
Conditional probability table, under the DAG constraints
min (xp a(g))
1 - θ g
Given equation 1 and conditional probability tables with parameters = (θ1,⋯,θ∣V∣), one wishes to identify the most probable labelings vector x. There are two challenges in this. The first is how to choose the parameter vector θ, discussed in this section, and the second is how to search for the most probable labelings vector x, which is discussed in the next section.
Most computational methods for GO term prediction are developed under a multi-class classification framework, where each GO term denotes a class and for each protein the probability of being member of that class is evaluated by the method. Classes are arranged according to a DAG hierarchy and further each protein may belong to one or multiple classes. In GOStruct  a SVM approach was developed to perform multi-class classification in a single step. However, the vast majority of the methods split the multi-class problem in multiple binary classification ones (i.e. one versus all) and therefore act per GO term and disregard the GO hierarchy. GeneMANIA , Kernel Logistic Regression  and BMRF  propagate function information through networks of protein associations and this operation is performed per GO term. Blast2GO , GO-At  and Aranet  perform overrepresentation analysis for each GO term separately. Such methods do not return the conditional probabilities in the sense of equation 1. The membership probabilities that they return are perhaps best viewed as marginal probabilities, i.e. summed over all configurations for GO terms other than the specific term g. We might have tried to retrieve θ from the relation between marginal and conditional probabilities, but this is certainly not an easy way. We attempted other ways.
Methods such as BMRF return low probabilities for detailed GO terms and high ones for general terms. Prioritization of the proteins in a particular GO term can then be achieved by simply sorting them. By contrast, prioritization of GO terms for a particular protein (a more important task) is not simple as the sets of probabilities for different GO terms are not directly comparable. To make them comparable, the probabilities need to be calibrated. We derive two approaches.
The first, called DeltaL is based on the maximization of the difference of the likelihood and prior probability of the labelings as they are defined in equation (1). The second, called LogitR, is based on explicit calibration of the input probability vector.
Note that when the input probabilities are very close to the priors, the objective function of DeltaL becomes multimodal.
Optimization by differential evolution: The FALCON algorithm
The DeltaL in equation(3) and LogitR in equation(4) approaches do not involve directly the TPR constraints. We develop an optimization algorithm inspired from Differential Evolution (DE)  that by construction is restricted to the subspace of consistent labelings. We call our algorithm Functional Annotation with Labeling CONsistency (FALCON). In general, DE works by evolving a population of candidate solutions to explore the search space and retrieve the maximum. Because DE is derivative free, it has appealing global optimization properties. Also, it is suitable for optimization in discrete spaces (like the labelings space in our problem).
Logical operations OR and AND for all the combinations of labels
Operations between consistent graphs (labelings) result in consistent graphs (labelings) as well, because the edge set of the last is the union or the intersection of the operands and therefore a particular edge has to pre-exist in at least one of the operands without violating the TPR. This property can be seen as follows: For any parent-child pair of nodes there are three types of configurations that are consistent (Table 1). Graph union and intersection between any combination of those pairs leads to locally consistent labeling. This holds for all the parent-child pairs, so it holds for the full labeling. Therefore the outcome of graph union and intersection will be consistent as well. Further, operations between more than two labelings will be consistent as well due to the associativity property. The FALCON optimization algorithm is based on the generation and evolution of a population of subgraphs i.e., with N=2∣V∣. The population is first initialized with consistent labelings (graphs) and evolved exploiting the graph-union and graph-intersection operations between individuals. Through the generations, all the constructed labelings will be consistent due to the abovementioned property. In our optimization problem we used four strategies to propose a new candidate solution (labelings) for the i-th graph , that is member of the population:
S1: Global Union
S2: Global Intersection
S3: Local Union
S4: Local Intersection
The first two types are called global because they do not involve while the latter two are local moves. Graph e is a random subgraph of the original full graph (i.e. GO-DAG), constructed by sampling a random node and all its ancestors. e ensures that all consistent configurations can be eventually proposed and reached.
With f the objective function i.e. being DeltaL or LogitR, the scheme of the FALCON algorithm is as follows: Initialize Population of size N=2∣V∣ by picking random consistent vectors (see below): while Convergence or Maximum generations not reached do for i=1 to N do Sample two labelings from the population Construct using the a randomly picked strategy S 1,S 2,S 3,S 4 if then, end for end while
Initialization of the population for DeltaL is done by random sampling GO terms according to their individual score (log ratio of the input and prior probability), while LogitR by sampling from the binomial distribution with probability equal to the calibrated one. In both cases the nodes were up-propagated in order to construct a consistent labeling. The computation was terminated after 10,000 generations or after reaching a plateau (i.e. there is no improvement in the objective function for 100 generations). Finally we point that a valid Markov Chain Monte Carlo algorithm cannot be derived using those proposal strategies because they do not represent reversible moves. The bitwise exclusive OR move proposed by Sterns in  is reversible but does not lead to consistent labelings. Implementation of the algorithm was done in R language for Statistical Computing and using the igraph R package .
We evaluated the performance of the FALCON algorithm on the DeltaL and LogitR objective function using Precision, Recall and F-score. Precision is defined as the percentage of correct GO terms in the list of the GO predictions. Recall is equal to the percentage of the GO assignments that were identified and F-score is the geometric mean of the Precision and Recall.
First, we tested the capability of FALCON to retrieve the most probable graph using the full graphs in Figure 1 with hundred simulated probability vectors. The first contains six nodes and the second fifteen. Because the graphs are small, exhaustive search of the most probable labeling was computationally tractable. We generated a hundred random probability vectors, by sampling probabilities for each node from the uniform distribution. Then we identified the most probable labeling for each simulated probability vector and the one returned by FALCON using equation (1) as objective function. Performance measures were calculated by comparing the vectors obtained by FALCON with the most probable ones as calculated from the exhaustive search.
The performance of FALCON was further evaluated using as input the GO membership probabilities of the Arabidopsis proteins as computed by BMRF in . This method provides membership probabilities per GO term independently. We constructed two evaluation datasets from those data. First, we randomly picked 100 Arabidopsis proteins that were already annotated at the time of computing the BMRF posterior probabilities. One constraint was that they should have at least fifty annotations (after up-propagating their original annotations). In this way we ensured that they were annotated in rather detailed GO terms, and therefore the attempt to get GO-DAG consistent predictions would be sensible. Although these proteins had a fixed labeling in the computations, BMRF can calculate membership probabilities for them, by reconstitution, i.e. as if they were unknown. The second dataset consisted of 387 proteins that were annotated later than the date of the BMRF computations. Thus, at the time of the computation the proteins were not annotated. We used this second set of proteins to evaluate the performance of FALCON in realistic conditions. In addition, we obtained a further list of predictions using the supervised approach proposed in . In this approach, from the posterior probabilities of the annotated proteins, an F-score based optimal threshold was calculated per GO term. Using this approach, called maxF, we derived a set of predictions for each evaluation dataset. Note that those lists are not guaranteed to be GO-DAG consistent.
Results and discussion
Performance of FALCON on simulated data
We initially evaluated the performance of FALCON in the two small graphs of Figure 1. For each graph we simulated 100 probability vectors by drawing from the uniform. Because the graphs are small we could identify the most probable labeling by exhaustive searching. Using equation 1 as objective function and setting all the prior probabilities to 0.5, LogitR retrieved the 98/100 of the labelings for the 6-node graph and 92/100 of the labelings for the 15-node graph. The DeltaL approach also retrieved 98/100 labelings for the small graph (using priors = 0.5 for all the nodes).
Performance of FALCON on real data
Mean performance measures for the evaluation dataset consisting of 100 Arabidopsis proteins
Per GO term
Mean performance measures for the newly annotated proteins
We performed protein function predictions using the FALCON algorithm on the unannotated parts of the genomes of 6 eukaryotes (human, mouse, rat, slime mold, frog and arabidopsis). This dataset includes the eukaryotic targets used in the Critical Assessment of protein Function Annotation (CAFA) experiment of 2011  and consists of 32,201 proteins. Function predictions were made for 1,917 GO terms from the Biological Process and Molecular Function compartments of Gene Ontology. The input probabilities were computed during CAFA’11 by BMRF integrating protein networks constructed from the STRING database  with orthology information obtained from ProgMap . The BMRF and FALCON predictions are available in the BMRF website: http://www.ab.wur.nl/bmrf_yk/FALCON_CAFA.tab.gz.
Overall, we examined the performance of FALCON for two objective functions, but FALCON is in principle suitable for optimization of a wide range of objective functions. The main purpose of FALCON is to provide GO DAG consistent predictions. We showed that this comes with no loss of the prediction performance. In fact LogitR outperforms the maxF method. The predictions of FALCON are GO-DAG consistent and therefore biologically much easier interpreted by the curators of protein function annotations. In this study an estimate of the calibration parameter α for LogitR was obtained using a yeast data set and the input probabilities were obtained from a semi-supervised method (BMRF), but, thereafter, FALCON is unsupervised; it infers the optimal GO term assignment using only the input probability vectors and the prior probabilities per GO term (computed from a set of predictions or using external Gene Ontology information). In contrast, in maxF, a training set is necessary in order to obtain the optimal cutoffs per GO term. In this study both approaches were applicable but the FALCON algorithm is expected to have broader applicability.
We thank Roeland van Ham, James Holzwarth and the two reviewers for constructive remarks on the manuscript. YK was supported by the Biorange grant SP3.2.1 from the Netherlands Bioinformatics Centre. ADJvD was supported by the transPLANT project (funded by the European Commission within its 7th Framework Programme under the thematic area ‘Infrastructures’, contract number 283496)
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: Tool for the unification of biology. Nat Genet. 2000, 25: 25-29.PubMed CentralView ArticlePubMed
- Obozinski G, Lanckriet G, Grant C, Jordan MI, Noble WS: Consistent probabilistic outputs for protein function prediction. Genome Biol. 2008, 9 (Suppl 1): S2(SUPPL. 1).View Article
- Burdakov O, Grimvall A, Sysoev O: Data preordering in generalized PAV algorithm for monotonic regression. J Comput Math. 2006, 24 (6): 771-790.
- Burdakov O, Sysoev O, Grimvall A, Hussian M: An O (n 2) algorithm for isotonic regression. Large-Scale Nonlinear Optimization. 2006, 83: 25-33. 10.1007/0-387-30065-1_3.View Article
- Viterbi A: Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans Inf Theory. 1967, 13 (2): 260-269.View Article
- MacKay DJC: Information Theory, Inference & Learning Algorithms. New York: Cambridge University Press 2002.
- Kourmpetis Y, van der Burgt A, Bink M, ter Braak C, van Ham R: The use of multiple hierarchically independent Gene Ontology terms in gene function prediction and genome annotation. In Silico Biol. 2007, 7 (6): 575-582.PubMed
- Vens C, Struyf J, Schietgat L, Džeroski S, Blockeel H: Decision trees for hierarchical multi-label classification. Mach Learn. 2008, 73 (2): 185-214. 10.1007/s10994-008-5077-3.View Article
- Glez-Peña D, Álvarez R, Díaz F: DFP: A Bioconductor package for fuzzy profile identification and gene reduction of microarray data. BMC Bioinformatics. 2009, 10: 37.PubMed CentralView ArticlePubMed
- Jiang X, Nariai N, Steffen M, Kasif S, Kolaczyk E: Integration of relational and hierarchical network information for protein function prediction. BMC Bioinformatics. 2008, 9: 350.PubMed CentralView ArticlePubMed
- Lauritzen S, Spiegelhalter D: Local computations with probabilities on graphical structures and their application to expert systems. J R Stat Soc Ser B (Methodological). 1988, 50 (2): 157-224.
- Barutcuoglu Z, Schapire RE, Troyanskaya OG: Hierarchical multi-label prediction of gene function. Bioinformatics. 2006, 22 (7): 830-836.View ArticlePubMed
- Sokolov A, Ben-Hur A: Hierarchical classification of gene ontology terms using the Gostruct method. J Bioinformatics Comput Biol. 2010, 8 (2): 357-376. 10.1142/S0219720010004744.View Article
- Valentini G: True Path Rule hierarchical ensembles for genome-wide gene function prediction. Comput Biol Bioinformatics, IEEE/ACM Trans. 2011, 8 (3): 832-847.View Article
- Cesa-Bianchi N, Re M, Valentini G: Synergy of multi-label hierarchical ensembles, data fusion, and cost-sensitive methods for gene functional inference. Mach Learn. 2011, 88: 1-33.
- Storn R, Price K: Differential evolution–a simple and efficient heuristic for global optimization over continuous spaces. J Glob Optimization. 1997, 11 (4): 341-359. 10.1023/A:1008202821328.View Article
- Kourmpetis Y, van Dijk A, Bink M, van Ham R, Ter Braak C: Bayesian Markov Random Field analysis for protein function prediction based on network data. PloS ONE. 2010, 5 (2): e9293.PubMed CentralView ArticlePubMed
- Kourmpetis Y, van Dijk A, van Ham R, ter Braak C: Genome-wide computational function prediction of Arabidopsis proteins by integration of multiple data sources. Plant Physiol. 2011, 155: 271-281.PubMed CentralView ArticlePubMed
- Mostafavi S, Ray D, Warde-Farley D, Grouios C, Morris Q: GeneMANIA: A real-time multiple association network integration algorithm for predicting gene function. Genome Biol. 2008, 9 (Suppl 1): S4(SUPPL. 1).View Article
- Lee H, Tu Z, Deng M, Sun F, Chen T: Diffusion kernel-based logistic regression models for protein function prediction. OMICS. 2006, 10: 40-55.View ArticlePubMed
- Conesa A, Gotz S, Garcia-Gomez J, Terol J, Talón M, Robles M: Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics. 2005, 21 (18): 3674.View ArticlePubMed
- Bradford J, Needham C, Tedder P, Care M, Bulpitt A, Westhead D: GO-At: in silico prediction of gene function in Arabidopsis thaliana by combining heterogeneous data. Plant J. 2010, 61 (4): 713-721.View ArticlePubMed
- Lee I, Ambaru B, Thakkar P, Marcotte E, Rhee S: Rational association of genes with traits using a genome-scale gene network for Arabidopsis thaliana. Nat Biotechnol. 2010, 28 (2): 149-156.PubMed CentralView ArticlePubMed
- Strens M: Evolutionary MCMC sampling and optimization in discrete spaces. Proceedings of the Twentieth International Conference on Machine Learning ICML. 2003,http://www.aaai.org/Papers/ICML/2003/ICML03-096.pdf,
- Csardi G, Nepusz T: The igraph software package for complex network research. InterJournal. 2006, Complex Systems: 1695-http://igraph.sf.net,
- Radivojac P, Clark W, Oron T, Schnoes A, Wittkop T, : A large-scale evaluation of computational protein function prediction. Nature Methods. 2013, 10: 221-227.PubMed CentralView ArticlePubMed
- Szklarczyk D, Franceschini A, Kuhn M, Simonovic M, Roth A, Minguez P, Doerks T, Stark M, Muller J, Bork P: The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Res. 2011, 39 (suppl 1): D561—D568.PubMed CentralPubMed
- Kuzniar A, Lin K, He Y, Nijveen H, Pongor S, Leunissen J: ProGMap: an integrated annotation resource for protein orthology. Nucleic Acids Res. 2009, 37 (suppl 2): W428—W434.PubMed CentralPubMed
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.