BicNET: Flexible module discovery in largescale biological networks using biclustering
 Rui Henriques^{1}Email author and
 Sara C. Madeira^{1}Email author
DOI: 10.1186/s1301501600748
© Henriques and Madeira. 2016
Received: 11 December 2015
Accepted: 22 April 2016
Published: 20 May 2016
Abstract
Background
Despite the recognized importance of module discovery in biological networks to enhance our understanding of complex biological systems, existing methods generally suffer from two major drawbacks. First, there is a focus on modules where biological entities are strongly connected, leading to the discovery of trivial/wellknown modules and to the inaccurate exclusion of biological entities with subtler yet relevant roles. Second, there is a generalized intolerance towards different forms of noise, including uncertainty associated with lessstudied biological entities (in the context of literaturedriven networks) and experimental noise (in the context of datadriven networks). Although stateoftheart biclustering algorithms are able to discover modules with varying coherency and robustness to noise, their application for the discovery of nondense modules in biological networks has been poorly explored and it is further challenged by efficiency bottlenecks.
Methods
This work proposes Biclustering NETworks (BicNET), a biclustering algorithm to discover nontrivial yet coherent modules in weighted biological networks with heightened efficiency. Three major contributions are provided. First, we motivate the relevance of discovering network modules given by constant, symmetric, plaid and orderpreserving biclustering models. Second, we propose an algorithm to discover these modules and to robustly handle noisy and missing interactions. Finally, we provide new searches to tackle time and memory bottlenecks by effectively exploring the inherent structural sparsity of network data.
Results
Results in synthetic network data confirm the soundness, efficiency and superiority of BicNET. The application of BicNET on protein interaction and gene interaction networks from yeast, E. coli and Human reveals new modules with heightened biological significance.
Conclusions
BicNET is, to our knowledge, the first method enabling the efficient unsupervised analysis of largescale network data for the discovery of coherent modules with parameterizable homogeneity.
Keywords
Flexible module discovery Largescale biological networks BiclusteringIntroduction
The increasing availability of precise and complete biological networks from diverse organisms provides an unprecedented opportunity to understand the organization and dynamics of cell functions [1]. In particular, the discovery of modules in biological networks has been largely proposed to characterize, discriminate and predict such biological functions [1–6]. The task of discovering modules can be mapped as the discovery of coherent regions in weighted graphs, where nodes represent the molecular units (typically genes, proteins or metabolites) and the scored edges represent the strength of interactions between the biological entities. In this context, a large focus has been placed on the identification of dense regions [7–10], where each region is given by a statistically significant set of highly interconnected nodes. In recent years, several biclustering algorithms have been proposed to discover dense regions from (bipartite) graphs by mapping them as adjacency matrices and searching for dense submatrices [8, 10–13]. A bicluster is then given by two subsets of strongly connected nodes.
Despite the relevance of biclustering to model local interactions [14, 15], the focus on dense regions comes with key drawbacks. First, such regions are associated with either trivial or wellknown (putative) modules. Second, the scores of the interactions associated with less studied genes, proteins and metabolites have lower confidence (being the severity of these penalizations highly dependent on the studied organism) and may not reflect the true role of these molecular interactions in certain cellular processes [16]. In particular, the presence of (wellstudied) regular/background cellular processes may mask the discovery of sporadic or lesstrivial processes, preventing the discovery of new putative functional modules.
Although biclustering has been proved to be an effective tool to retrieve exhaustive structures of dense regions in a network [8, 11–13, 17], it has not yet been effectively applied to the discovery of modules with alternative forms of coherency due to two major challenges. First, despite the hypothesized importance of discovering biclusters associated with nondense regions (characterized for instance by constant, orderpreserving or plaid coherencies), there are not yet mappings enabling the understanding of their biological meaning. Second, the hard combinatorial nature of biclustering data when considering nondense forms of coherency, together with the high dimensionality of the adjacency matrices derived from biological networks, are often associated with memory and time bottlenecks, and/or undesirable restrictions on the structure and quality of biclusters.

Principles for the discovery of modules in weighted graphs given by parameterizable forms of coherency (including constant, orderpreserving, symmetric assumptions) with nondense yet meaningful interactions, and given by plaid structures to accommodate weight variations explained by the network topology;

Principles for the discovery of modules robust to missing and noisy interactions;

New biclustering algorithm (BicNET) able to accommodate the proposed principles and adequately discover modules from data with arbitraryhigh sparsity;

Adequate data structures and searches to guarantee BicNET’s applicability over large networks;

Principles for biclustering different types of networks, including homogeneous and heterogeneous networks, and networks with either weighted or labeled interactions;

Theoretical and empirical evidence of the biological relevance of the modules discovered using nondense coherency assumptions.
Figure 1 provides a structured view on the challenges and proposed contributions. Accordingly, this work is organized as follows. First, we provide background on the target task. "BicNET: solution" and "BicNET: algorithmic aspects" sections describe the principles used by BicNET and its algorithmic details. "Results and discussion" section provides empirical evidence for the relevance of BicNET to unravel nontrivial yet relevant modules in synthetic and real biological networks. Finally, we draw conclusions and highlight directions for future work.
Background
In this section, we provide the basics on biological networks, background on biclustering network data, and a discussion on the importance and open challenges of biclustering nondense network modules. Finally, the opportunities and limitations of patternbased biclustering for this end are surveyed.
Biological networks
A biological network is a linked collection of biological entities (proteins, protein complexes, genes, metabolites, etc.). Biological networks are typically classified according to the observed type of biological entities and their homogeneity. Homogeneous networks are given, for instance, by proteinprotein interactions (PPI) and gene interactions (GI). Heteregeneous networks capture interactions between two distinct data sources, such as proteins and protein complexes, host and viral molecules, biological entities and certain functions, among others. Biological networks can be further classified according to the type of interactions: weighted interactions (either determining the degree of physical or functional association) or qualitative/labeled interactions (such as ’binding’, ’activation’ and ’repression’, etc.). The methods targeted by this work aim to analyze both homogeneous and heterogeneous biological networks with either weighted or qualitative interactions.
Biclustering network data
The introduced types of biological networks can be mapped as bipartite graphs for the subsequent discovery of modules.
Definition 1
A graph is defined by a set of nodes X = \(\{x_1,..,x_n\}\), and interactions \(a_{ij}\) relating nodes \(x_i\) and \(x_j\), either numeric (\(a_{ij}\in \mathbb {R}\)) or categoric (\(a_{ij}\in \mathcal {L}\), where \(\mathcal {L}\) is a set of symbols). A bipartite graph is defined by two sets of nodes X = \(\{x_1,\ldots,x_n\}\) and Y = \(\{y_1,\ldots,y_m\}\) with interactions \(a_{ij}\) between nodes \(x_i\) and \(y_j\).
Definition 2
Given a bipartite graph (X, Y), the biclustering task aims to identify a set of biclusters \(\mathcal {B}\) = \(\{B_1,..,B_p\}\), where each bicluster \(B_k\) = \((I_k,J_k)\) is a module (or subgraph) in the graph given by two subsets of nodes, \(I_k\subseteq X\wedge J_k\subseteq X\), satisfying specific criteria of homogeneity and statistical significance.
Under the previous definitions, both homogeneous networks (Y = X) and heterogeneous networks are candidates for biclustering. The task of biclustering network data can be tackled by using the traditional task of biclustering realvalued matrices by subsequently mapping a bipartite graph as a matrix (with rows and columns given by the nodes and values given by the scored interactions). In this case, subsets of rows and columns define a bicluster. A bicluster is associated with a module in the network with coherent interactions (see Figs. 2, 3).
The paradigmatic assumption when biclustering network data is to rely on the dense coherency [20] (Definition 3). Definitions 4 and 5 formalize for the first time the meaning of distinct coherency assumptions in the context of weighted network data. The constant assumption (Definition 4) introduces the possibility of accommodating biological entities with (possibly) distinct strengths/types of interactions yet coherent behavior. This already represents an improvement in terms of flexibility against the dense assumption. Alternative coherency assumptions can be given by symmetric, orderpreserving and plaid models (Definition 5).
Definition 3
Let the elements in a bicluster \(a_{ij}\in (I,J)\) have a specific coherency. A bicluster is dense when the average of its values is significantly high (deviates from expectations), where the average value is given by \(\frac{1}{IJ}\Sigma _{i\in I}\Sigma _{j\in J} a_{ij}.\)
Definition 4
A constant coherency assumption is observed when \(a_{ij}=k_j+\eta _{ij}\), where \(k_j\) is the expected strength of interactions between nodes in X and \(y_j\) node from Y and \(\eta _{ij}\) is the noise factor. In other words, constant biclusters have similarly scored interactions for each node from one of the two subsets of nodes. The coherency strength of a constant module is defined by the \(\delta\) range, where \(\eta _{ij}\in [\delta /2,\delta /2]\).
Definition 5
The symmetric assumption considers the (possible) presence of symmetries within a constant bicluster, \(a_{ij}=k_jc_i\)+\(\eta _{ij}\) where \(c_i\in \{1,1\}\). An Orderpreserving assumption is verified when the values for each node in one subset of nodes of a bicluster induce the same linear ordering across the other subset of nodes. A plaid assumption [21] considers cumulative contributions on the elements where biclusters/subgraphs overlap.
Patternbased biclustering
Related work
A large number of algorithms has been proposed to find modules in unweighted graphs (binary interactions) and weighted graphs (realvalued interactions) mapped from biological networks. In the context of unweighted graphs, clique detection with Monte Carlo optimization [25], probabilistic motif discovery [26] and clustering on graphs [27] have been, respectively, applied to discover modules in PPIs (yeast), GIs (E. coli) and metabolic networks.
In unweighted bipartite graphs, the densest regions correspond to bicliques. Bicliques have been efficiently discovered using MotzkinStraus optimization [9], densityconstrained biclustering [28], formal concepts and patternbased biclustering [11, 12, 17]. In the context of weighted graphs, the density of a module is given by the average weight of the interactions within the module. Different scores have been proposed to determine the weight of an interaction, including the: functional correlation between biological entities (when interactions are predicted from literature or other knowledgebased sources); or physical association (when interactions are derived from experimental data based for instance on the correlated variation of the expression of genes or concentration of molecular compounds). Modules given by densely connected subgraphs have been discovered from PPIs using betweennessbased partitioning [27] and flowbased clustering algorithms in graphs [29]. Biclustering has been largely applied for this end^{2} using SAMBA [20], multiobjective searches [34] and patternbased biclustering [6, 8, 10]. The application of these methods over both homogeneous and viralhost PPIs show that protein complexes largely match the found modules [27, 29, 34].
Patternbased biclustering has been largely applied for the discovery of dense network modules [6, 8, 10–13, 17] due to their intrinsic ability to exhaustively discover flexible structures of biclusters. In unweighted graphs, closed frequent itemset mining and association rule mining were applied to study interactions between proteins and protein complexes in yeast proteome network [12, 17] and between HIV1 and human proteins to predict and characterize hostcellular functions and their perturbations [12, 13]. More recently, association rules were also used to obtain a modular decomposition of GI networks with positive and negative interactions (\(a_{ij}\in\){−1,0,1}) [11] for understanding betweenpathway and withinpathway models of GIs. In weighted graphs, Dao et. al [6] and Atluri et. al [10] relied on the loose antimonotone property of density to propose weightsensitive pattern mining searches. DECOB [8], originally applied to PPIs and GIs from human and yeast, uses an additional filtering step to output dissimilar modules only.
Some of the surveyed contributions have been used or extended for classification tasks such as function prediction [2, 12, 13]. Discriminative modules, often referred as multigenic markers, are critical to surpass the limitations of single gene markers and topological markers [2, 6, 35, 36]. Networkbased (bi)clustering methods for function prediction have been comprehensively reviewed by Sharan et al. [2].
The problem with the surveyed contributions is their inability to discover modules with parameterizable coherency assumption and strength.
Some simple variants of the dense coherency assumption have been reviewed by Dittrich et al. [37], Ideker et al. [4] and Sharan et al. [2]. Yet, the studied algorithms do not support the coherency assumptions explored in this work (Definitions 4 and 5). A first attempt to apply biclustering algorithms with nondense coherency over biological networks was presented by Tomaino et al. [40]. Despite its disruptive nature, this work suffers from two drawbacks. First, only considers very small PPIs (human and yeast PPIs with less than 200 interactions) due to the scalability limits of the surveyed biclustering algorithms to handle highdimensional adjacency matrices. Second, although enriched biological terms have been identified for the discovered modules (pointing out the importance of using nondense forms of coherency), an indepth analysis of the modules with enriched terms as well as an explanation of the meaning of their coherency in the assessed networks is absent.
Research questions
Although biclustering can be easily applied over biological networks to discover biclusters with varying coherency criteria, three major challenges have been preventing this possibility up to date. First, stateoftheart biclustering algorithms are not able to scale for the majority of the available biological networks due to the high dimensionality of the mapped matrices [41]. Second, nondense forms of coherency often come with the cost of undesirable restrictions on the number, positioning (e.g. nonoverlapping condition) and quality of biclusters [15]. Finally, there is a generalized lack of understanding of the relevance and biological meaning associated with nondense modules [41]. Although patternbased biclustering can be used to address the second challenge [15], it still presents efficiency bottlenecks and further knowledge is required for the correct interpretation of these regions.

Discussion on whether biclustering can be efficiently and consistently applied over largescale biological networks for the discovery of nondense modules;

Assessment of the biological relevance of discovering network modules with varying coherency criteria.
BicNET: solution
In this section, we first introduce principles to enable the sound application of (patternbased) biclustering over network data. Second, we motivate the relevance of discovering coherent modules following constant, symmetric and plaid models. Third, we show how to discover modules robust to noisy and missing interactions. Fourth, we extend patternbased searches to seize efficiency gains from the inherent structural sparsity of biological networks. Fifth, we see how module discovery can be guided in the presence of domain knowledge. Finally, we overview the opportunities of patternbased biclustering biological networks.
Biclustering network data
For an effective application of stateoftheart biclustering algorithms towards (weighted) graphs derived from network data, two principles should be satisfied. First, the weighted graph should be mapped into a minimal bipartite graph. In heterogeneous networks, multiple bipartite graphs can be created (each with two disjoint sets of nodes with heterogeneous interactions). The minimality requirement can be satisfied by identifying subsets of nodes with crossset interactions but without intraset interactions to avoid unnecessary duplicated nodes in the disjoint sets of nodes (see Fig. 4). This is essential to avoid the generation of large bipartite graphs and subsequent very large matrices. Second, when targeting nondense coherencies from homogeneous networks, a realvalued adjacency matrix is derived from the bipartite graph by filling both \(a_{ij}\) and \(a_{ji}\) elements with the value of the interaction between \(x_i\) and \(x_j\) nodes. In the context of an heterogeneous network, two realvalued adjacency matrices are derived: one matrix with rows and columns mapped from the disjoint sets of nodes and its transpose. Despite the relevance of this second principle, some of the few attempts to find nondense biclusters in biological networks fail to satisfy it [40], thus delivering incomplete and often inconsistent solutions.
Under the satisfaction of the previous two principles, a widerange of biclustering algorithms can be applied to discover modules with varying forms of coherency [14]. Yet, only patternbased biclustering [15, 18, 42] is able to guarantee the discovery of flexible structures of biclusters with parameterizable coherency and quality criteria. Additionally, patternbased biclustering provides an environment to easily measure the relevance and impact of discovering modules with varying coherency and tolerance to noise.
In particular, we rely on BicPAM, BiP and BicSPAM algorithms [15, 21, 22], which respectively use frequent itemset mining, association rule mining and sequential pattern mining to find biclusters with constant, plaid and orderpreserving coherencies (in both the absence and presence of symmetries). These algorithms integrate the dispersed contributions from previous patternbased algorithms and address some of their limitations, providing key principles to: (1) surpass discretization problems by introducing the possibility to assign multiple discrete values to a single element; (2) accommodate meaningful constraints and relaxations, while seizing their efficiency gains; and (3) robustly handle noise and missing values.
Modules with nondense forms of coherency using patternbased biclustering
Constant model
The proposed constant model can be directly applied to networks with qualitative interactions capturing distinct types of regulatory relations, such as binding, activation or enhancement associations. Qualitative interactions are commonly observed for a widevariety of PPIs [12, 13].
The constant model is essential to guarantee that biological entities with nonnecessarily high (yet coherent) influence on another set of entities are not excluded. Typically, the constant coherency leads to the discovery of larger modules than the dense coherency. The exception is when the dense coherency is not given by highly weighted interactions, but instead by all interactions independently of their weight (extent of interconnected nodes). In this context, dense modules can be larger than constant modules.
Symmetric model
Plaid model
The plaid assumption [21] is essential to describe overlapping regulatory influence associated with cumulative effects in the interactions between the nodes in a biological network. Illustrating, consider that two genes interact in the context of multiple biological processes, a plaid model can consider their cumulative effect on the score of their interaction based on the expected score associated with each active process. The same observation remains valid to explain the regulatory influence between proteins. The use of the plaid assumption for the analysis of GIs and PPIs can also provide insights on the network topology and molecular functions, revealing: (1) hubs and core interactions (based on the amount of overlapping interactions), and (2) between and withinpathway interactions (based on the interactions inside and outside of the overlapping areas). Figure 6 (right) illustrates a plaid model associated with two simple modules with overlapping interactions. These illustrative modules could not be discovered without a plaid assumption.
Orderpreserving model
An orderpreserving module/bicluster is defined by a set of nodes with a preserved relative degree of influence on another set of nodes [22]. Illustrating, given a bicluster (I, J) with I = \(\{x_3,x_5\}\) and J = \(\{y_2,y_6,y_7\}\), if \(a_{32}\le a_{36}\le a_{37}\) then \(a_{52}\) \(\le\) \(a_{56}\) \(\le\) \(a_{57}\). Assuming that an orderpreserving module is observed with two proteins acting as a transcription factors of a set of genes/proteins/metabolites, then these proteins show the same ordering of regulatory influence on the target set of biological entities. Orderpreserving modules may contain interactions according to the constant model (as well as modules with shifting and scaling factors [15]), leading to more inclusive solutions associated with larger and less noisesusceptible modules. The orderpreserving model is thus critical to accommodate nonfixed yet coherent influence of a node on another set of nodes, tackling the problem of scores’ uncertainty on lessresearched regions in the network.
Handling noisy and missing interactions
An undesirable restriction of existing methods for the discovery of dense modules is that they require almost every node within a module to be connected, thus possibly excluding relevant nodes in the presence of some missing interactions. Understandably, meaningful modules with missing interactions are common since the majority of existing biological networks are still largely incomplete.
Patternbased biclustering is able to recover missing interactions recurring to wellestablished and efficient postprocessing procedures [44]. These procedures commonly rely on the merging and extension of the discovered modules. Merging is driven by the observation that when two modules share a significant amount of interactions it is probable that their merging composes a larger module still respecting some homogeneity criteria [44]. Extension procedures identify candidate nodes to enlarge a given module (yet still satisfying a certain homogeneity) by changing the minimum support threshold of the patternbased searches [15]. Furthermore, the scoring scheme of interactions might be prone to experimental noise (bias introduced by the applied measurement and preprocessing) and structural noise (particularly common in the presence of less researched genes or proteins), not always reflecting the true interactions.
Recent breakthroughs in patternbased biclustering show the possibility to assign multiple ranges of values on specific interactions (see Fig. 4) to reduce the propensity of excluding interactions due to score deviations. Since pattern mining searches are inherently able to learn from transactions or sequences with an arbitrary number of items, this enables the possibility to assign multiple items to a single element of the mapped matrix. As such, elements with values near a boundary of discretization (or cutoff threshold) can be assigned with two items corresponding to the closest ranges of values. Under this procedure, patternbased biclustering is able to effectively address different forms of noise based on parameterizable distances for the assignment of additional items.
According to the previous strategies, the level of sparsity and noise of the discovered modules can be parametrically controlled. Illustrating, to strengthen the quality of a given module (reducing its tolerance to noise), the overlapping thresholds for merging procedures can be reduced. Figure 5 provides an illustrative constant module with missing interactions (red dashed lines) and noisy interactions (red continuous lines).
By default, BicNET relies on a merging procedure with an 80 % overlapping threshold (with the computation of similarities pushed into the mining step according to [44]) and on the assignment of multiple items for interactions with scores closer to a boundary of discretization (allocation of 2 items for interactions in a range \(a_{ij}\in [c_1,c_2]\) when \(\frac{min(c_2a_{ij},\,a_{ij}c_1)}{c_2c_1}<25\, \%\) according to [22]).
BicNET: efficient biclustering of biological networks
Understandably, the task of biclustering modules with the introduced coherencies is computationally harder than biclustering dense modules (the complexity of biclustering nondense models is discussed in [15, 22]). Empirical evidence using stateoftheart biclustering algorithms shows that this task in its current form is only scalable for biological networks up to a few hundreds of nodes [41]. Nevertheless, a key property distinguishing biological networks from gene expression or clinical data is their underlying sparsity. Illustrating, some of the densest PPI and GI networks from wellstudied organisms still have a density below 5 % (ratio of interconnected nodes after excluding nodes without interactions) [16].
While traditional biclustering depends on operations over matrices, patternbased biclustering algorithms are prepared to mine transactions of varying length. This property makes patternbased biclustering algorithms able to exclude missing interactions from searches and thus surpass memory and efficiency bottlenecks. To understand the impact of this option, given a homogeneous network with n nodes, the complexity of traditional biclustering algorithms is bounded by \(\Theta (f(n^2))\) (where f is the biclustering function), while the target approach is bounded by \(\Theta (f(p))\) (where p is the number of pairwise interactions) and \(p\ll n^2\) for biological network data.
Based on these observations, we propose BicNET (BiClustering Biological NETworks), a patternbased biclustering algorithm for the discovery of modules with parameterizable forms of coherency and robustness to noise in biological networks. BicNET relies on the following principles to explore efficiency gains from the analysis of biological networks.
We first propose a new data structure to efficiently preprocess data: an array, where each position (node from a disjoint set in the bipartite graph) has a list of pairs, each pair representing an interaction (corresponding node and the interaction weight). Discretization and itemization procedures are performed by linearly scanning this structure. In this context, the time and memory complexity of these procedures is linear on the number of interactions. Sequential and transactional databases are mapped from this preprocessed data structure without time and memory overhead.
Patternbased searches commonly rely on bitset vectors due to the need to retrieve not only the frequent patterns but also their supporting transactions in order to compose biclusters. Patternbased searches for biclustering commonly rely on variants of AprioriTID methods [45] or vertical methods (such as Eclat [46]). However, Aprioribased methods suffer from the costs associated with the generation of a huge number of candidate modules for dense networks or networks with modules of varying size [41], while verticalbased methods rely on expensive memoryandtime costs of intersecting (arbitrarily large) bitsets [47]. These observations can be experimentally tested by parameterizing BicNET with these searches (used for instance in BiModule [23], GenMiner [48] and DeBi [24] biclustering algorithms). For this reason, we rely on the recently proposed F2G miner [47] and on revised implementations of Eclat and Charm miners where diffsets are used to address the bottlenecks of bitsets in order to efficiently discover constant/symmetric/ plaid models, as well as on IndexSpan [22] miner to efficiently discover orderpreserving models.
Furthermore, the underlying pattern mining searches of BicNET are dynamically selected based on the properties of the network to optimize their efficiency. Horizontal versus vertical data formats [15] are selected based on the ratio of rows and columns from the mapped matrix. Apriori (candidate generation) versus patterngrowth (tree projection) searches [15] are selected based on the network density (patterngrowth searches are preferable for dense networks). We also push the computation of similarities between all pairs of biclusters (the most expensive postprocessing procedure) into the mining step by checking similarities with distance operators on a compact data structure to store the frequent patterns.
Scalability
Additional principles from the research on pattern mining can be used to guarantee the scalability of BicNET.
Multiple parallelization and distribution principles are directly applicable by enhancing the underlying pattern mining searches [49, 50]. Alternatively, data partitioning principles can be considered under certain optimality guarantees [50, 51]. Finally, BicNET can additionally benefit from efficiency gains associated with searches for approximate patterns [22, 50].
BicNET: incorporating available domain knowledge
As previously discussed, patternbased biclustering algorithms show the unprecedented ability to efficiently discover exhaustive structures of biclusters with parameterizable coherency and quality. In this context, two valuable synergies can be identified. First, the optimality and flexibility of patternbased biclustering solutions provide an adequate basis upon which knowledgedriven constraints can be incorporated [39]. Second, the effective use of domain knowledge to guide the underlying pattern mining searches has been largely researched in the context of domaindriven pattern mining [52, 53].
Constraintguided biclustering
In previous work [42], patternbased biclustering algorithms were extended to optimally explore efficiency gains from constraints with succinct, (anti)monotone and convertible properties. For this end, F2G and IndexSpan pattern mining searches were revised (and respectively termed F2GBonsai and IndexSpanPG [42]) to be able to effectively incorporate and satisfy such constraints for the final task of biclustering expression data. BicNET can be seen as wrapper over existing pattern mining searches, adding new principles to guarantee that they are consistently, robustly and efficiently applied over biological networks. As such, BicNET’s behavior complies with domaindriven pattern mining searches. In fact, domaindriven pattern mining searches, such as F2GBonsai and IndexSpanPG, simply provide mechanisms to interpret constraints and guarantee that they are used to guide the pruning of the search space.
Succinct constraints can be used to remove ranges of uninformative interactions from the network [remove(S) where \(S\subseteq \mathbb {R}^+\) or \(S\subseteq \mathcal {L}\)]. Illustrating, some labels may not be relevant when mining biological networks with qualitative interactions, while low scores (denoting weak associations) can be promptly disregarded from biological networks with weighted interactions. Despite the structural simplicity of this behavior, this possibility cannot be supported by peer stateoftheart biclustering algorithms [42].
Succinct constraints can be alternatively used for the discovery of biological entities interacting according to a specific patterns of interest. Illustrating, \(\{2, 2\}\subseteq \varphi _B\) implies an interest on nondense network modules (interactions without strong weights) to disclose nontrivial regulatory activity, and \(min(\varphi _B)= 3\wedge max(\varphi _B)= 3\) implies a focus on modules with interactions delineating strong activation and repression.
Monotone and antimonotone constraints are key to discover modules with distinct yet coherent regulatory interactions. Illustrating, the nonsuccinct monotonic constraint countVal \((\varphi _B)\ge 3\) implies that at least three different types of interaction’s strengths must be present within a module. Assuming a network with {a,b,c} types of biological interactions, then \(\varphi _B\cap \{a,b\}\le 1\) is antimonotone.
Finally, convertible constraints are useful to fix pattern expectations, yet still accommodating deviations from expectations. Illustrating, \(avg(\varphi _B)\le 0\) indicates a preference for network modules with negative interactions without a strict exclusion of positive interactions.
Integration of external knowledge
BicNET is also able to benefit from network data contexts where nodes can be annotated. These annotations are often retrieved from knowledge repositories, semantic sources and/or literature. Annotations can be either directly derived from the properties of the biological entity (such as functional terms from ontologies) or be implicitly predicted based on the observed interactions (such as topological properties). Illustrating, consider a geneinteraction network where genes are annotated with functional terms from Gene Ontology (GO) [54]. Since a gene can participate in multiple biological processes or, alternatively, its function be yet unknown, genes can have an arbitrary number of functional annotations.
Since pattern mining is able to rely on observations with an arbitrary length, BicNET consistently supports the integrated analysis of network data and annotations. For this aim, annotations are associated with a new dedicated symbol and appended to the respective row in the mapped adjacency matrix (see Fig. 8). Illustrating, consider \(T_1\) and \(T_2\) terms to be respectively associated with genes \(\{x_1,x_3,x_4\}\) and \(\{x_3,x_5\}\), an illustrative transactional database for this scenario would be \(\{x_1=\{a_{11},\ldots,a_{1m},T_1\},x_2=\{a_{21},\ldots,a_{2m}\},x_3=\{a_{31},\ldots,a_{3m},T_1,T_2\},\ldots\}\). Sequential databases can be composed by appending terms either at the end or the beginning of each sequence.
Given these enriched databases, pattern mining can then be applied with succinct, (anti)monotone and convertible constraints. Succinct constraints can be incorporated to guarantee the inclusion of certain terms (such as \(\varphi _B\cap \{T_1,T_2\} \ne0\)). (Anti)monotone convertible constraints can be, alternatively incorporated to guarantee that, for instance, a bicluster is functionally consistent, meaning that it can be mapped to a single annotation. The \(\varphi _B\cap \{T_1,T_2\}\le 1\) constraint is antimonotone and satisfies the convertible condition: if \(\varphi _B\) satisfies the constraint, the \(\varphi _B\) suffixes also satisfy the constraint.
Benefits of BicNET against its peers
This section introduced respectively principles to guarantee the consistency, flexibility, robustness and efficiency of BicNET, as well as its ability to benefit from guidance in the presence of domain knowledge. Figure 9 illustrates the positioning of BicNET on each one of these qualities against alternative stateoftheart biclustering algorithms.

possibility to analyze not only biological networks but also sparse biological matrices, such as expression data (where nondifferential expression is removed) and genome structural variations (where entries without mutations or singlenucleotide polymorphisms are ignored);

easy extension of BicNET for the discovery of discriminative modules for labeled or classconditional biological networks by parameterizing BicNET with discriminative pattern mining searches [55, 56];

incorporation of statistical principles from pattern mining research [57–59] to assess the statistical significance of modules given by patternbased biclusters, thus guaranteeing the absence of false positive discoveries [18].
BicNET: algorithmic aspects
The algorithmic basis of BicNET is described in Algorithm 1. BicNET’s behavior can be synthesized in three major steps: mapping, mining and postprocessing. First, the input network is mapped into one or more minimal (sparse) adjacency matrices, being the number of generated matrices given by \(\left( {\begin{array}{c}max(\kappa ,2)\\ 2\end{array}}\right)\) where \(\kappa\) is the number of distinct types of nodes from the inputted network. For example, 6 adjacency matrices would be generated for a biological network capturing interactions between genes, protein, protein complexes and metabolites. Each adjacency matrix is efficiently represented using an array of lists of pairs, where each position in the array stores both the index/ID of the nodes interacting with a given node as well as the values for those interactions. If the inputted interactions are labeled or unweighted, BicNET proceeds directly with the mining step. If the inputted interactions have realvalued weights, they are discretized (after proper normalization and exclusion of outliers) under a given coherency strength determining the length of the alphabet for discretization. Multiple items can be assigned (according to "Handling noisy and missing interactions" section) to mitigate the drawbacks associated with the discretization needs. Due to the assignment of multiple items, each list from the array may have duplicated indexes/IDs. In the absence of a prespecified coherency strength, BicNET iteratively discretizes the adjacency matrices using several alphabets. The modules discovered under each coherency strength are jointly postprocessed.
Domain knowledge and user expectations can be declaratively specified as a set constraints and inputted as a parameter to BicNET. For this aim, BicNET simply replaces the underlying pattern mining searches by F2GBonsai (for the constant/symmetric/plaid model) or IndexSpanPG (for the orderpreserving model) [42].
Third and finally, postprocessing procedures to merge, filter, extend or reduce modules are applied according to the principles respectively introduced in "Handling noisy and missing interactions" and "BicNET: efficient biclustering of biological networks" sections.
Computational complexity
The computational complexity of BicNET is bounded by the pattern mining task and computation of similarities among biclusters. For this analysis, we discuss the major computational bottlenecks associated with each one of the three introduced steps. The discretization (including outlier detection and normalization) and noise correction procedures (for the assignment of multiple items) within the mapping step are linear on the size of the matrix, \(\Theta (p)\), where p is the number of interactions and typically \(p\ll n^2\). To dynamically select an adequate discretization procedure, distribution fitting tests and parameter estimations^{3} are performed in \(\Theta (p)\). The complexity of the mining step depends on three factors: the complexity of the pattern miner and the amount of iterations need for the discovery of modules with varying coherency assumptions. The cost of the pattern mining task depends essentially on the number and size of transactions/sequences (essentially defined by the size and sparsity of the inputted network), selected mining procedures (FIM, SPM or association/sequential rules defined by the desired coherency assumption) and respective algorithmic implementations, the frequency distribution of items (essentially defined by the target coherency strength), the selected pattern representation (closed by default), and the presence of scalability enhancements (listed throughout "BicNET: efficient biclustering of biological networks" section). Empirical evidence shows that the complexity of the mining step, when iteratively applied with a decreasing support threshold, is bounded by the search with lowest support. A detailed analysis of the complexity of the pattern mining task has been attempted in literature [60] and it is out of the scope of this paper. Let \(\Theta (\wp )\) be the complexity of the pattern mining task. For the discovery of symmetric and plaid effects, the previous mining procedure is iteratively applied, being the final search bounded by \(\Theta (d\) \(\times\) \(\wp )\), where \(d\approx {n \atopwithdelims ()2}\). Finally, the complexity of the postprocessing step depends essentially on two factors: (1) the complexity of computing similarities among biclusters to merge and filter modules (bounded by \(\Theta ({k \atopwithdelims ()k/2}\bar{r}\bar{s})\) based on [15], where k is the number of modules and \(\bar{r}\bar{s}\) is the average number of interactions per module), and (2) the complexity of extending and reducing modules (bounded by \(k'(\bar{r}n+n\bar{s})\), where \(k'\) is the number of biclusters after merging and filtering). Summing up, the complexity of BicNET is bounded by \(\Theta (d\wp +{k \atopwithdelims ()k/2}\bar{r}\bar{s}+k'(\bar{r}n+n\bar{s}))\), which for largescale networks (where typically k \(\gg\) \(k'\)) is approximately given \(\Theta (d\wp\) + \({k \atopwithdelims ()k/2}\bar{r}\bar{s})\).
Default and dynamic parameterizations
As BicNET makes available a high number of options and thus fine tunable parameters, there is the need to guarantee that it provides a robust and friendly environment to be used by users without expertise in network module discovery and patternbased biclustering.
For this aim, BicNET makes available: (1) default parameterizations (dataindependent setting) and (2) dynamic parameterizations based on the properties of the input dataset (datadependent setting). Default parameterizations include: (1) zeromean roworiented normalization followed by overall Gaussian discretization with n/4 items for orderpreserving coherencies (for an adequate tradeoff of precedences vs. cooccurrences) and a number of items in the set \(\{3,5,7\}\) for the remaining coherencies; (2) iterative discovery of modules with distinct coherencies (dense, constant, symmetric, plaid and orderpreserving); (3) F2G search for closed FIM and association rule mining, and IndexSpan search for SPM; (4) multiitems assignment (according to criteria introduced in section “Handling noisy and missing interactions”); (5) merging procedure with the computation of Jaccardbased similarities pushed into the mining step and an 80 % overlapping threshold; (6) filtering procedure for biclusters without statistical significance (according to [44]) and a 70 % Jaccardbased similarity against a larger bicluster; and (7) no extension or reduction procedures. For the default setting, BicNET iteratively decreases the support threshold by 10 % (starting with \(\theta\) = 80 %) until the output solution discovers 50 dissimilar modules or a minimum coverage of 10 % of the elements in the inputted network interactions.
The dynamic parameterizations differ with regards to the following aspects: (1) the fit of different distributions are tested to select adequate normalization and discretization procedures, (2) the size and sparsity of the biological network are used to affect the pattern mining search (according to [18]), and (3) data partitioning procedures are considered for largescale networks with over 100 million of interactions for dense and constant module discovery and 1 million of interactions for the discovery of modules with alternative coherency assumptions.
Software
BicNET is provided within both graphical and programmatic interfaces^{4} to offer a supportive environment for the analysis of biological networks. BicNET supports the loading of input data and the exportation of results according to a widevariety of formats.
Alternatively, BicNET is made available through a programmatic interface based on a Java API with the respective source code and accompanying documentation. This interface can be used to: extend patternbased biclustering algorithms for alternative tasks, such as classification and indexation, and easily adapt its behavior in the presence of biological networks with very specific regularities. Illustrative cases are provided in the webpage of the authors.
Results and discussion
Results are organized as follows. First, we describe the selected data settings, metrics and algorithms. Second, we compare the performance of BicNET against stateoftheart algorithms for biclustering and network module discovery, using synthetic networks with varying properties. Finally, we use BicNET for the analysis of largescale PPI and GI networks to show the relevance of discovering modules with varying forms of coherency and parameterizable levels of noise and sparsity. BicNET is implemented in Java (JVM v1.6.024). Experiments were run using an Intel Core i5 2.30GHz with 6GB of RAM.
Experimental settings
Synthetic data

Size of networks: number of nodes and density;

Distribution of the weight of interactions for realvalued networks (Uniform or Gaussian assignment of positive and negative ranges of values) and of labels for symbolic networks;

Number, size (Uniform distribution on the number of nodes to plant biclusters with dissimilar size), overlapping degree, and shape (imbalance on the distribution of nodes per disjoint set) of modules;

Modules’ coherency: dense, constant, symmetric, plaid (according to [21]) and orderpreserving assumptions, with the respective 1.2, 1, 1.2, 1.1 and 1.5 scale adjustments to the expected size (to guarantee their statistical significance as the different coherency assumptions impact the probability of module to unexpectedly occur by chance);

Planted degree of noisy and missing interactions (from 0 to 20 %).
Default synthetic data benchmarks for network data analyzes
Network nodes (10 % density)  Network density (2000 nodes)  

200  500  1000  2000  10,000  1 %  5 %  10 %  25 %  
Nr. of hidden modules  5  10  15  20  30  3  5  10  20 
Nr. of nodes per module  [20, 30]  [30, 40]  [40, 50]  [50, 70]  [100, 140]  [50, 70]  [50, 70]  [50, 70]  [50, 70] 
% interactions in modules  19.5  12.2  7.6  4.5  1.1  22.5  9.0  4.5  2.3 
Table 1 summarizes the default data settings for some of these variables when assuming that the generated network is homogeneous. The generation of heterogeneous networks is also made available through the specification of the size of each disjoint set of nodes and pairwise density between the sets of distinct types of nodes. For a sound evaluation of the target algorithms, 30 data instances were generated for each data setting.
Real data
Biological networks used to assess the relevance and efficiency of BicNET
Type  Organism  \(\sharp\)Nodes  \(\sharp\)Interactions  Density (%)  Notes 

GI  Yeast  4455  1,91,309  1.0  Links (65 % negative) from doublemutant arrays [19] 
GI  Yeast  6314  4,23,335  1.1  Known and predicted associations benchmarked from multiple data sources and text mining, and combined through an integrative score [16] 
PPI  E. Coli  8428  32,93,416  4.6  
PPI  Human  19,247  85,48,002  2.3 
Performance metrics
Introductory notes on tools for network data analysis
As surveyed, a wide diversity of algorithms and tools have been proposed for the modular analysis of biological networks. For this end, three major options have been considered: (1) exhaustive clustering (discovery of sets of nodes C such that \(\cup _{k}C_k= X \wedge \cap _{k}C_k =\emptyset\)) using different algorithms; (2) nonexhaustive clustering with the allowance of overlapping nodes between clusters (\(\cup _{k}C_k\subseteq X\)); and (3) biclustering (discovery of bisets of nodes (I, J) coherently related). Table 3 provides a compact view on the differences between the solutions gathered by the different techniques, disclosing their intrinsic limitations for the discovery of coherent modules within the target synthetic and biological networks. For this end, kMeans, affinitypropagation and spectral clustering algorithms [63] for weighted networks were tested using MEDUSA software [64], CPMw (clique percolation method for weigthed networks) algorithm [65] using CFinder software was applied for nonexhaustive clustering, and traditional algorithms for biclustering dense network modules (based on the discovery of hypercliques from unweighted and/or weighted networks [6, 8, 11, 12]) were applied using BicNET software.
Comparison of widelyused tasks for modular analysis of networks using the introduced synthetic and real datasets
Approach  Method  Solution aspects and concerns  Efficiency 

Clustering (exhaustive and nonoverlapping node coverage)  kMeans  Majority of clusters show loose connectedness; High variation on the size of modules (1to3 clusters covering almost all nodes and the remaining clusters being statistically nonsignificant [66])  Efficiency problems for networks with >100.000 interactions 
Spectral  Able to isolate modules where the degree of connectedness is approximately constant per module; Only a small subset of clusters is relevant (mediumtohigh degree of connectedness)  Medusa implementation only scales for networks with <10.000 interactions  
Affinity propagation  The clusters collected from (small samples of) the target biological networks show a generalized lack of biological relevance  Time and memory bottlenecks for small nets (<1000 interactions)  
Clustering (nonexhaustive and possibly overlapping node coverage)  CPMw (weighted kclique percolation)  Intolerance to noise; Intractably large solutions (explosion of similar clusters) with strict coherency criterion (kclique); Dependence on parameters (e.g. k, intensity level)  Only scales for nets with <5000 nodes (5–10 % density). Bottlenecks for the target biological data even when removing >95 % interactions 
Biclustering (bisets of nodes)  Hypercliques (unweighted)  Intolerant to missing interactions; Large number of highly similar modules; Dense coherency only  BicNET implementation efficient for large networks (>10000 nodes) with density up to 25 % 
Hypercliques (differential)  Intolerant to noise and the prone itemboundaries problem during the selection of differential weights; Dense coherency only  BicNET implementation scales for large dense networks  
BicNET (dense assumption)  Focus on dissimilar modules robust to noise and missings, with possibly distinct forms of coherency strength (L \(\in\){1,2,3,5})  Efficiency bounded by the search for unweigthed hypercliques (L=1) 
Algorithms for comparisons
For the purpose of establishing fair comparisons, we select 7 stateoftheart biclustering algorithms that, similarly to BicNET, are prepared to find biclusters with nondense coherencies^{5}: FABIA^{6} [67], ISA [69], xMotifs [70] and Cheng and Church [71] (all able to discover variants of the introduced constant model); OPSM [72] and OPClustering [43] (able to discover orderpreserving models); and SAMBA [20] (inherently prepared to discover dense biclusters). The number of seeds for FABIA and ISA was set to 10 and the number of iterations for OPSM was varied from 10 to 100. The remaining parameters of the selected methods were set by default.
Results on synthetic data
In Fig. 13, we compare the efficiency of BicNET with stateoftheart biclustering algorithms with nondense coherency criteria for the analysis of networks with varying size and density and planted modules following a constant coherency assumption.
Results on real data
Results gathered from the application of BicNET over real biological networks are provided in three parts. First, we show basic statistics that motivate the relevance of using BicNET against peer algorithms. Second, we explore the biological relevance of the retrieved modules when considering varying levels of tolerance to noise and different forms of coherency. Finally, we make use of some of the meaningful constraints provided in "BicNET: incorporating available domain knowledge" section in order to discover lesstrivial modules (such as modules characterized by the presence of plaid effects, flexible constant patterns or symmetries), and provide a brief analysis of their enriched terms and transcription factors.
The biological significance of the retrieved modules from real data is here computed by assessing the overrepresentation of Gene Ontology (GO) terms with an hypergeometric test using GOrilla [73]. A module is significant when its genes or proteins show enrichment for one or more of the “biological process” terms by having a (Bonferroni corrected) p value below 0.01.
Modules with varying coherency
Description of the biological role of an illustrative set of BicNET’s modules with varying properties
ID  Homogeneity  \(\sharp\)Nodes \(I\times J\)  Putative functionality: group of enriched terms (\(p<\)1E−10)  

STRING (yeast)  Y1  Dense (high noisetolerance)  231 × 14  Metabolic processes with incidence on protein, peptide and amide metabolism and biosynthesis 
Y2  Dense (medium noisetolerance)  217 × 9  Metabolism of nitrogen compounds and some organic substances  
Y3  Constant (few high \(a_{ij}\))  103 × 8  Amino acid activation and tRNA metabolism for tRNA aminoacylation  
Y4  Constant (few high \(a_{ij}\))  206 × 6  Organic acid metabolic process and its subterms  
Y5  Constant (few high or low \(a_{ij}\))  55 × 7  Signal transduction and its subterms  
Y6  Constant (few high or low \(a_{ij}\))  43 × 6  Phosphorylation related terms (with incidence on protein phosphorylation)  
Y7  Orderpreserving  176 × 12  Transport of organic acids (with incidence on aminoacid transmembrane transport)  
Y8  Orderpreserving  235 × 9  Oxidationreduction process and metabolism of aminoacids. Assembly of ribonucleoprotein  
Y9  Orderpres. (few high \(a_{ij}\))  146 × 8  Transport of molecules (highest enrichment found for drug transmembrane)  
STRING (human)  H1  Dense (high noisetolerance)  811 × 28  Multiple metabolic processes with incidence on transcription activity 
H2  Dense (high noisetolerance)  787 × 25  Regulation of metabolic processes (both positive and negative regulation)  
H3  Constant (few high \(a_{ij}\))  693 × 14  Regulation of intracellular signal transduction (over 20 highly enriched terms)  
H4  Constant (few high \(a_{ij}\))  645 × 10  Regulation of molecular functions (incidence on catalytic activity)  
H5  Orderpreserving  720 × 24  Establishment of protein localization (protein targeting to ER and membrane)  
H6  Orderpreserving  733 × 29  Protein phosphorylation and its subterms  
DryGIN  D1  Dense (high noisetolerance)  28 × 17  Organelle localization (establishment of spindle and nuclear localization) 
D2  Constant (with pos&neg \(a_{ij}\))  22 × 10  Chromatin remodeling and nucleosome organization  
D3  Constant (with pos&neg \(a_{ij}\))  21 × 7  Transport processes for the establishment of protein localization  
D4  Constant (with pos&neg \(a_{ij}\))  19 × 9  Regulation of growth (incidence on filamentous growth)  
D5  Orderpreserving  39 × 7  Organelle and nucleous organization  
D6  Orderpreserving  54 × 6  Regulation of cellular metabolic processes (both positive and negative regulation) 
Three major observations are retrieved from the conducted analyzes. First, the combination of the dense model with the provided procedures to foster robustness leads to higher enrichment factors as key genes/proteins with subtler yet functional relevance were not excluded from the modules. Nevertheless, this form of coherency is mainly associated with broader biological processes, such as general metabolic and regulatory processes (see \(Y_1\), \(Y_2\), \(H_1\) and \(H_2\) modules). Second, the constant model is indicated to guarantee a focus on less trivial modules associated with a compact set of more specific biological processes. Modules \(Y_3\)–\(Y_6\), \(H_3\)–\(H_4\) and \(D_2\)–\(D_4\) are example of the relevance of considering nondense interactions since these interactions are often related with latent or secondary (yet critical) cellular functions. Third, the orderpreserving coherency is associated with modules as large as the ones provided under the noisetolerant dense coherency, yet with the additional benefit of enabling the presence of weaker interactions as long as their coherency among the nodes is respected.
Nontrivial modules
Exclusivity and relevance of BicNET solutions: properties of found modules
ID  Type  \(\sharp\)Nodes \(I\times J\)  Items  \(\sharp\)Terms p \(<\)1E−15  Notes  

DryGIN  G1  Constant  18 × 9  {−4,..,−1}  27  Module with coherent strong (−4) and soft (−1) negative interactions 
G2  Symmetric  4 × 9  {−3,..,3}  13  Varying levels of strong (mainly positive) interactions ({\(\pm\)3,\(\pm\)2})  
G3  Symmetric  5 × 6  {−2,−1,1,2}  12  Module with either all positive or negative interactions per “row”node ({\(\pm\)1,\(\pm\)2})  
G4  Constant  7 × 5  {1,2}  12  Module with coherent strong (2) and soft (1) positive interactions  
G5  Symmetric  7 × 5  {−2,−1,1,2}  11  Module with either all positive or negative interactions per “row”node ({\(\pm\)1,\(\pm\)2})  
G6  Order  14 × 11  {−3,..,3}  25  Preserved precedences and cooccurrences per “row”node before postprocessing  
G7  Order  42 × 8  {−2,−1,1,2}  50  Noisetolerant module with mostly preserved orderings per “row”node  
STRING  S1  Order  155 × 14  {1,2,3,4}  169  Preserved precedences and cooccurrences per “row”node before postprocessing 
S2  Constant  80 × 18  {1,2,3}  98  Module with mostly of nondense interactions ({1,2})  
S3  Constant  83 × 10  {1,2}  93  Module with nondense positive interactions before postprocessing ({1})  
S4  Constant  50 × 20  {1,2,3}  70  Module with nondense positive interactions ({1,2}) before postprocessing  
S5  Constant  45 × 31  {1,2,3}  76  Module with mostly dense interactions (scores in {2,3})  
S6  Constant  55 × 85  {1,2}  143  Module with mostly dense interactions ({2}) 
ID  Terms description (\(\sharp\))  \(\sharp\)Terms p \(<\)1E−15  \(\sharp\)Nodes  

DryGIN  G1  Histone modification; regulation of histone H3K79 methylation, histone H2B ubiquitination, H2B conserved Cterminal lysine ubiquitination, H3K4 methylation (4)  6  27 
G2  Regulation of gluconeogenesis; glutamate metabolic and catabolic processes (2);nicotinamide riboside metabolic process; nicotinamide nucleotide biosynthetic process  6  13  
G3  Positive and negative regulation of transcription from RNA polymerase II; Invasive growth response to glucose limitation and hyperosmotic salinity response by regulating RNA polymerase II (5)  5  12  
G4  Meiotic anaphase I; activation of anaphasepromoting complex activity involved in meiotic cell cycle  4  12  
G5  Negative reg. of phospholipid biosynthesis; lipid homeostasis; isopropylmalate and oxaloacetate transport  4  11  
G6  Cotranslational protein targeting to membrane; protein insertion into mitochondrial membrane; protein import into peroxisome membrane; reg. sporulation; actin filament bundle assembly involved in cytokinesis  5  25  
G7  Acetate fermentation, acetylCoA biosynthesis (from acetate), reg. transcription on exit from mitosis  7  50  
STRING  S1  Response to hypoxia; oxidationdependent protein catabolic process; anaerobic respiration; agedependent response to reactive oxygen species; cellular response to oxidative stress  36  169 
S2  Positive and negative reg. of mitotic and nuclear cell cycle, DNA replication, budding cell apical bud growth  16  98  
S3  Transport of aerobic electron, acetylCoA, vacuolar transmembrane, amine, transport (5); ribose phosphate metabolic process; Dribose metabolic and catabolic processes (2)  22  93  
S4  Heterochromatin maintenance involved in chromatin silencing; sister chromatid segregation  6  70  
S5  Cytoplasmic and mitochondrial translation (4); regulation of translational fidelity; ADP biosynthesis  6  76  
S6  rRNA processing; separation, cleavage and maturation of SSUrRNA (5); ribosomal (large subunit) biogenesis  14  143 
Sets of modules with meaningful overlapping areas (satisfying the inbetween plaid assumption [21])
ID  Modules with meaningful overlapping regions  Pattern  \(\sharp\)Nodes \(I\times J\)  % Overlapping interactions 

G6  G7 from Table 6 (orders preserved in overlapping regions before cumulative effect)  Order  42 × 8  21 
G8: tRNA reexport from nucleus; nuclear mRNA surveillance of mRNP export  Constant  12 × 10  62  
G9: More general module (background) including cellular responses to pH  Constant  41 × 6  16  
S4  S2 from Table 6 (satisfying the relaxed additive model proposed in [21])  Constant  80 × 18  42 
S7: Telomere maintenance; translocation; protein import into nucleous  Constant  104 × 20  37  
S8: Response to ionizing radiation; ribose phosphate metabolic process  Constant  59 × 31  45  
S9: Positive regulation of mitochondrial translation in response to stress  Constant  50 × 20  89 
The analysis of the enriched transcription factors (TFs) for each putative biological process in Table 6 further supports the previous functional enrichment analyzes. For this end, we retrieved the TFs that are more representative (high coverage of the genes in the module) and significant (high functional enrichment: p value\(<\)1E−3). Illustrating, \(G_1\) has diverse TFs regulating different families of histones, such as Jhd1p [74]; in \(G_4\) we found regulators of meiosis, including Sin3p [74]; the TFs of \(G_7\) activate genes required for cytokinesis (exit from mitosis); in \(S_1\) we found TFs associated with responses to oxygenrelated stress, such as the activation of betaoxidation genes by Pip2p [74]; proteins regulating \(S_2\) respond to DNA damaging, such as Plm2p and Abf1p [75]; membrane sensors, such as Ure2p, are active in the regulation of genes in \(S_3\); \(S_4\) has proteins promoting the organization and remodeling of chromatin, including Abf1p, Plm2p and Rsc1p [75]; regulators of ribosomal biogenesis, such as Sfp1p (100 % representativity), and of its subunits, such as Cse2p [74], are core TFs for \(S_6\).
Concluding note
When analyzing networks derived from knowledgebased repositories and literature (such as the networks from STRING [16]), the flexibility of coherence and noiserobustness is critical to deal with uncertainty and with the regions of the network where scores may be affected due to the unbalanced focus of research studies. When analyzing networks derived from data experiments (such as the GIs from DRYGIN [19]), the discovery of modules with nonnecessarily strong interactions (e.g. given by the constant model) is critical to model lesspredominant (yet key) biological processes, such as the ones associated with early stages of stimulation or disease.
Conclusions and future work
This work tackles the task of biclustering largescale network data to discover modules with nondense yet meaningful coherency and robustness to noise. In particular, we explore the relevance of mining nontrivial modules in homogeneous and heterogeneous networks with quantitative and qualitative interactions. We proposed BicNET algorithm to extend stateoftheart contributions on patternbased biclustering with efficient searches on networks, thus enabling the exhaustive discovery of constant, symmetric and plaid models in biological networks. Additional strategies were further incorporated to retrieve modules robust to noisy and missing interactions, thus addressing the limitations of the existing exhaustive searches on networks. Finally, we have shown that BicNET can be assisted in the presence of background knowledge and user expectations.
Empirical evidence confirms the superiority of BicNET against peer biclustering algorithms able to discover nondense regions. Contrasting with their efficiency bottlenecks, BicNET enables the analysis of dense networks with up to 50,000 nodes. Results on biological networks reveal its critical relevance to discover nontrivial yet coherent and biologically significant modules.
Five major directions are identified for upcoming research: (1) to gather missing and noisy interactions within the discovered modules to predict unknown interactions and to test the confidence (or adjust the score) of the weighted interactions within available biological networks; (2) to enlarge the conducted biological analysis to further establish relationships between modules and functions to support the characterization of biological molecules with yet unclear roles; (3) to explore the plaid model to identify and characterize hubs based on the overlapping interactions between modules, as well as the interactions within each of the two sets of interacting nodes per bicluster to further assess the connectivity, coherence and significance of modules; (4) to study the relevance of alternative forms of coherency given by biclustering algorithms with distinct homogeneity/merit functions [15]; and (5) to extend BicNET for the integrative analysis of GI and PPI networks and expression data in order to validate results and combine these complementary views either at the input, mining or output levels.
Availability
The BicNET software (graphical and programmatic interfaces) and datasets can be accessed at https://web.ist.utl.pt/rmch/bicnet/.
Consider the specific case where patternbased biclustering is given by frequent itemset mining. Let \(\mathcal {L}\) be a finite set of items, and P an itemset \(P\subseteq \mathcal {L}\). A discrete matrix D is a finite set of transactions in \(\mathcal {L}\), \(\{P_1,..,P_n\}\). Let the coverage \(\Phi _{P}\) of an itemset P be the set of transactions in D in which P occurs, \(\{P_i \in D\mid P\subseteq P_i\}\), and its support \(sup_P\) be the coverage size, \(\mid \Phi _{P}\mid\). Given D and a minimum support threshold \(\theta\), the frequent itemset mining (FIM) problem consists of computing: \(\{P \mid P \subseteq \mathcal {L}, sup_P \ge \theta \}\).
Given D, let a matrix A be the concatenation of D elements with their column (or row) indexes. Let \(\Psi _P\) of an itemset P in A be its indexes, and \(\Upsilon _P\) be its original items in \(\mathcal {L}\). A set of biclusters \(\cup _k (I_k,J_k)\) can be derived from a set of frequent itemsets \(\cup _k P_k\) by mapping \((I_k,J_k)\)=\(B_k\), where \(B_k\)=\((\Phi _{P_k},\Psi _{P_k})\), to compose constant biclusters with coherency across rows (or \((I_k,J_k)\)=\((\Psi _{P_k},\Phi _{P_k})\) for columncoherency) with pattern \(\Upsilon _P\).
In the context of biological networks, biclustering has been also used to either validate or extract molecular interactions from biclusters discovered in gene expression and proteomic data [30–33]. This a rather distinct task that the target in this paper and thus out of the scope.
Tests and estimations based on the calculus of approximated statistical ratios described in http://www.pitt.edu/super1/ResearchMethods/Riccidistributionsen.pdf (accessed January 2016).
To run the experiments, we used: fabia package [67] from R, BicAT [68], BicPAM [15] and expander [20] softwares.
Sparse prior equation with decreasing sparsity until able to retrieve a nonempty set of biclusters.
Declarations
Authors’ contributions
RH designed and implemented the algorithms under close supervision by SCM. RH drafted the manuscript. Both authors revised and approved the final manuscript.
Acknowledgements
This work is an extension of previous work [41]. It was supported by national funds through Fundação para a Ciência e Tecnologia with reference UID/CEC/50021/2013, the Ph.D. grant SFRH/BD/75924/2011 to RH and the sabbatical leave grant SFRH/BSAB/1427/2014 to SCM. SCM was also partially funded by the EURIAS Fellowship Programme and the European Commission (MarieSklodowskaCurie actions CoFUND ProgrammeFP7) through a grant for a junior fellowship position at Istituto di Studi Avanzati, University of Bologna, Italy.
Competing interests
The authors declare that they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Barabasi AL, Oltvai ZN. Network biology: understanding the cell’s functional organization. Nat Rev Genet. 2004;5(2):101–13.View ArticlePubMedGoogle Scholar
 Sharan R, Ulitsky I, Shamir R. Networkbased prediction of protein function. Mol Syst Biol. 2007;3(1):88.PubMedPubMed CentralGoogle Scholar
 Mukhopadhyay A, Ray S, Maulik U. Incorporating the type and direction information in predicting novel regulatory interactions between HIV1 and human proteins using a biclustering approach. BMC Bioinform. 2014;15:26.View ArticleGoogle Scholar
 Ideker T, Ozier O, Schwikowski B, Siegel AF. Discovering regulatory and signalling circuits in molecular interaction networks. Bioinformatics. 2002;18(suppl 1):S233–40.View ArticlePubMedGoogle Scholar
 Segal E, Wang H, Koller D. Discovering molecular pathways from protein interaction and gene expression data. Bioinformatics. 2003;19(suppl 1):i264–72.View ArticlePubMedGoogle Scholar
 Dao P, Colak R, Salari R, Moser F, Davicioni E, Schönhuth A, Ester M. Inferring cancer subnetwork markers using densityconstrained biclustering. Bioinformatics. 2010;26(18):i625–31.View ArticlePubMedPubMed CentralGoogle Scholar
 Georgii E, Dietmann S, Uno T, Pagel P, Tsuda K. Enumeration of conditiondependent dense modules in protein interaction networks. Bioinformatics. 2009;25(7):933–40.View ArticlePubMedPubMed CentralGoogle Scholar
 Colak R, Moser F, Chu JSC, Schönhuth A, Chen N, Ester M. Module discovery by exhaustive search for densely connected, coexpressed regions in biomolecular interaction networks. PLoS ONE. 2010;5(10):e13348.View ArticlePubMedPubMed CentralGoogle Scholar
 Ding C, Zhang Y, Li T, Holbrook S. Biclustering protein complex interactions with a biclique finding algorithm. In: Sixth international conference on data mining, 2006. ICDM ’06; 2006: 178–87.
 Atluri G, Bellay J, Pandey G, Myers C, Kumar V. Discovering coherent value bicliques in genetic interaction data. In: IW on data mining in bioinformatics (BIOKDD) 2010.
 Bellay J, Atluri G, Sing TL, Touftghi K, Costanzo M, Ribeiro PSM, Pandey G, Baller J, VanderSluis B, Michaut M, Han S, Kim P, Brown GW, Andrews BJ, Boone C, Kumar V, Myers CL. Putting genetic interactions in context through a global modular decomposition. Genome Res. 2011;21(8):1375–87.View ArticlePubMedPubMed CentralGoogle Scholar
 Mukhopadhyay A, Maulik U, Bandyopadhyay S. A novel biclustering approach to association rule mining for predicting HIV1–human protein interactions. PLoS ONE. 2012;7(4):e32289.View ArticlePubMedPubMed CentralGoogle Scholar
 MacPherson JI, Dickerson JE, Pinney JW, Robertson DL. Patterns of HIV1 protein interaction identify perturbed hostcellular subsystems. PLoS Comput Biol. 2010;6(7):e1000863.View ArticlePubMedPubMed CentralGoogle Scholar
 Madeira SC, Oliveira AL. Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinform. 2004;1:24–45.View ArticlePubMedGoogle Scholar
 Henriques R, Madeira S. BicPAM: Patternbased biclustering for biomedical data analysis. Algorit Mol Biol. 2014;9:27.View ArticleGoogle Scholar
 Szklarczyk D, Franceschini A, Wyder S, Forslund K, Heller D, HuertaCepas J, Simonovic M, Roth A, Santos A, Tsafou KP, et al. STRING v10: protein–protein interaction networks, integrated over the tree of life. Nucleic Acids Res. 2014;43:1003.Google Scholar
 Xiong H, Heb XF, Ding C, Zhang Y, Kumar V, Holbrook SR. Identiftcation of functional modules in protein complexes via hyperclique pattern discovery. In: Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing. 2005; p. 221–32.
 Henriques R, Antunes C, Madeira SC. A structured view on pattern miningbased biclustering. Pattern Recognit. 2015;48(12):3941–58.View ArticleGoogle Scholar
 Koh JLY, Ding H, Costanzo M, Baryshnikova A, Touftghi K, Bader GD, Myers CL, Andrews BJ, Boone C. DRYGIN: a database of quantitative genetic interaction networks in yeast. Nucleic Acids Res. 2010;38(suppl 1):D502–7.View ArticlePubMedGoogle Scholar
 Tanay A, Sharan R, Shamir R. Discovering statistically significant biclusters in gene expression data. Bioinformatics. 2002;18:136–44.View ArticleGoogle Scholar
 Henriques R, Madeira S. Biclustering with flexible plaid models to unravel interactions between biological processes. IEEE/ACM Trans Comput Biol Bioinform. 2015. doi:10.1109/TCBB.2014.2388206.PubMedGoogle Scholar
 Henriques R, Madeira S. BicSPAM: Flexible biclustering using sequential patterns. BMC Bioinform. 2014;15:130.View ArticleGoogle Scholar
 Okada Y, Fujibuchi W, Horton P. A biclustering method for gene expression module discovery using closed item set enumeration algorithm. IPSJ Trans Bioinform. 2007;48(SIG5):39–48.Google Scholar
 Serin A, Vingron M. DeBi: Discovering differentially expressed biclusters using a frequent itemset approach. Algorit Mol Biol. 2011;6:1–12.View ArticleGoogle Scholar
 Spirin V, Mirny LA. Protein complexes and functional modules in molecular networks. Proc Natl Acad Sci. 2003;100(21):12123–8.View ArticlePubMedPubMed CentralGoogle Scholar
 Berg J, Lässig M. Local graph alignment and motif search in biological networks. Proc Natl Acad Sci USA. 2004;101(41):14689–94.View ArticlePubMedPubMed CentralGoogle Scholar
 Chen J, Yuan B. Detecting functional modules in the yeast protein–protein interaction network. Bioinformatics. 2006;18:2283–90.View ArticleGoogle Scholar
 Colak R. Towards finding the complete modulome: density constrained biclustering. PhD thesis, Simon Fraser University; 2008.
 PereiraLeal JB, Enright AJ, Ouzounis CA. Detection of functional modules from protein interaction networks. Proteins Struct Func Bioinform. 2004;54:49–57.View ArticleGoogle Scholar
 Bo V, Curtis T, Lysenko A, Saqi M, Swift S, Tucker A. Discovering StudySpeciftc Gene Regulatory Networks. PLoS ONE. 2014;9(9):e106524.View ArticlePubMedPubMed CentralGoogle Scholar
 Mitra S, Das R, Banka H, Mukhopadhyay S. Gene interaction—an evolutionary biclustering approach. Informat Fusion. 2009;10(3):242–9 (Special Issue on Natural Computing Methods in Bioinformatics).View ArticleGoogle Scholar
 Das R, Mitra S, Banka H, Mukhopadhyay S. Evolutionary Biclustering with Correlation for Gene Interaction Networks. In: Ghosh A, De R, Pal S, editors. Pattern recognition and machine intelligence, vol. 4815., lecture notes in computer science. Berlin: Springer; 2007. p. 416–24.View ArticleGoogle Scholar
 Reiss DJ, Baliga NS, Bonneau R. Integrated biclustering of heterogeneous genomewide datasets for the inference of global regulatory networks. BMC Bioinform. 2006;7:280.View ArticleGoogle Scholar
 Maulik U, Mukhopadhyay A, Bhattacharyya M, Kaderali L, Brors B, Bandyopadhyay S, Eils R. Mining quasibicliques from HIV1human protein interaction network: a multiobjective biclustering approach. IEEE/ACM Trans Comput Biol Bioinform. 2013;10(2):423–35.View ArticlePubMedGoogle Scholar
 Chuang HY, Lee E, Liu YT, Lee D, Ideker T. Networkbased classification of breast cancer metastasis. Mol Syst Biol. 2007;3:140. doi:10.1038/msb4100180.Google Scholar
 Chowdhury SA, Koyuturk M. Identiftcation of coordinately dysregulated subnetworks in complex phenotypes In pacific symposium on biocomputing. World Scientific. 2010;15:133–44.Google Scholar
 Dittrich MT, Klau GW, Rosenwald A, Dandekar T, Muller T. Identifying functional modules in protein–protein interaction networks: an integrated exact approach. Bioinformatics. 2008;24(13):i223–31.View ArticlePubMedPubMed CentralGoogle Scholar
 Ideker T, Ozier O, Schwikowski B, Siegel AF. Discovering regulatory and signalling circuits in molecular interaction networks. Bioinformatics. 2002;18(suppl 1):S233–40.View ArticlePubMedGoogle Scholar
 Sharan R, Ulitsky I, Shamir R. Networkbased prediction of protein function. Mol Syst Biol. 2007;3:88.View ArticlePubMedPubMed CentralGoogle Scholar
 Tomaino V, Guzzi PH, Cannataro M, Veltri P. Experimental comparison of biclustering algorithms for PPI networks. In: Proceedings of the first ACM international conference on bioinformatics and computational biology, BCB ’10, New York: ACM 2010: 671–76.
 Henriques R, Madeira SC. BicNET: Efficient biclustering of biological networks to unravel nontrivial modules. In algorithms in bioinformatics (WABI), lecture notes in computer science. Berlin: Springer; 2015.
 Henriques R, Madeira SC. Patternbased biclustering with constraints for gene expression data analysis In: Computational methods in bioinformatics and systems biology (EPIACMBSB), LNAI. Berlin: Springer; 2015.
 Liu J, Wang W. OPCluster: clustering by tendency in high dimensional space. In ICDM. Washington: IEEE Computer Society; 2003.Google Scholar
 Henriques R, Antunes C, Madeira S. Methods for the efficient discovery of large itemindexable sequential patterns. LNAI 2014, 7765.
 Agrawal R, Srikant R. Fast Algorithms for Mining Association Rules in Large Databases. In VLDB. San Francisco: Morgan Kaufmann; 1994. p. 487–99.Google Scholar
 Zaki MJ, Gouda K. Fast vertical mining using diffsets. New York: ACM; 2003. p. 326–35.Google Scholar
 Henriques R, Madeira SC, Antunes C. F2G: Efficient discovery of fullpatterns In: ECML/PKDD IW on new frontiers to mine complex patterns, prague, Czech Republic. Berlin: Springer; 2013.
 Martinez R, Pasquier C, Pasquier N. GenMiner: mining informative association rules from genomic data. In BIBM. Washington: IEEE CS; 2007. p. 15–22.Google Scholar
 Chen D, Lai C, Hu W, Chen W, Zhang Y, Zheng W. Tree partition based parallel frequent pattern mining on shared memory systems. In 20th International Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. IEEE. 2006; p. 1–8.
 Han J, Cheng H, Xin D, Yan X. Frequent pattern mining: current status and future directions. Data Min Knowl Discov. 2007;15:55–86.View ArticleGoogle Scholar
 Javed A, Khokhar A. Frequent pattern mining on message passing multiprocessor systems. Distributed Parallel Databases. 2004;16(3):321–34.View ArticleGoogle Scholar
 Pei J, Han J. Can we push more constraints into frequent pattern mining? In KDD. New York: ACM; 2000. p. 350–4.Google Scholar
 Bonchi F, Lucchese C. Extending the stateoftheart of constraintbased pattern discovery. Data Knowl Eng. 2007;2:377–99.View ArticleGoogle Scholar
 Martin D, Brun C, Remy E, Mouren P, Thieffry D, Jacq B. GOToolBox: functional analysis of gene datasets based on Gene Ontology. Genome Biol. 2004;12:101.View ArticleGoogle Scholar
 Fang G, Kuang R, Pandey G, Steinbach M, Myers CL, Kumar V. Subspace differential coexpression analysis: problem deftnition and a general approach. In: Pacific symposium on biocomputing. Singapore: World Scientiftc Publishing; 2010. p. 145–56.
 Odibat O, Reddy C. Efficient mining of discriminative coclusters from gene expression data. Knowl Informat Syst. 2013;41(3):667–96.View ArticleGoogle Scholar
 Kirsch A, Mitzenmacher M, Pietracaprina A, Pucci G, Upfal E, Vandin F. An efficient rigorous approach for identifying statistically signiftcant frequent itemsets. In PODS. New York: ACM; 2009. p. 117–26.Google Scholar
 DuMouchel W, Pregibon D. Empirical bayes screening for multiitem associations. In KDD. New York: ACM; 2001. p. 67–76.Google Scholar
 DuMouchel W. Bayesian data mining in large frequency tables, with an application to the fda spontaneous reporting system. Am Statist. 1999;53(3):177–90.Google Scholar
 Ramesh G, Maniatty WA, Zaki MJ. Feasible itemset distributions in data mining: theory and application. In Symposium on Princ. of data sys. New York: ACM Press; 2003.View ArticleGoogle Scholar
 Prelic A, Bleuler S, Zimmermann P, Wille A, Buhlmann P, Gruissem W, Hennig L, Thiele L, Zitzler E. A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics. 2006;22(9):1122–9.View ArticlePubMedGoogle Scholar
 Bozdag D, Kumar AS, Catalyurek UV. Comparative analysis of biclustering algorithms In BCB. New York: ACM; 2010.Google Scholar
 Aggarwal CC, Reddy CK. Data clustering: algorithms and applications. Boca Raton: CRC Press; 2013.Google Scholar
 Pavlopoulos GA, Hooper SD, Sifrim A, Schneider R, Aerts J. Medusa: a tool for exploring and clustering biological networks. BMC Res Notes. 2011;4:1–6.View ArticleGoogle Scholar
 Farkas I, Abel D, Palla G, Vicsek T. Weighted network modules. New J Phys. 2007;9(6):180.View ArticleGoogle Scholar
 Henriques R. Learning from highdimensional data using local descriptive models. PhD thesis, Instituto Superior Tecnico, Lisboa: Universidade de Lisboa; 2016.
 Hochreiter S, Bodenhofer U, Heusel M, Mayr A, Mitterecker A, Kasim A, Khamiakova T, Van Sanden S, Lin D, Talloen W, Bijnens L, Gohlmann HWH, Shkedy Z, Clevert DA. FABIA: factor analysis for bicluster acquisition. Bioinformatics. 2010;26(12):1520–7.View ArticlePubMedPubMed CentralGoogle Scholar
 Barkow S, Bleuler S, Prelic A, Zimmermann P, Zitzler E. BicAT: a biclustering analysis toolbox. Bioinformatics. 2006;10:1282–3.View ArticleGoogle Scholar
 Ihmels J, Bergmann S, Barkai N. Deftning transcription modules using largescale gene expression data. Bioinformatics. 2004;20(13):1993–2003.View ArticlePubMedGoogle Scholar
 Murali TM, Kasif S. Extracting conserved gene expression motifs from gene expression data. Pacific Symp Biocomput. 2003;8:77–88.Google Scholar
 Cheng Y, Church GM. Biclustering of expression data. In intelligent systems for molecular biology. Menlo Park: AAAI Press; 2000. p. 93–103.Google Scholar
 BenDor A, Chor B, Karp R, Yakhini Z. Discovering local structure in gene expression data: the orderpreserving submatrix problem. In RECOMB. New York: ACM; 2002. p. 49–57.Google Scholar
 Eden E, Navon R, Steinfeld I, Lipson D, Yakhini Z. GOrilla: a tool for discovery and visualization of enriched GO terms in ranked gene lists. BMC Bioinformatics. 2009;10:48.View ArticlePubMedPubMed CentralGoogle Scholar
 Teixeira M, Monteiro P, Guerreiro J, Goncalves J, Mira N, dos Santos S, Cabrito T, Palma M, Costa C, Francisco A, Madeira S, Oliveira A, Freitas A, SaCorreia I. The yeastract database an upgraded information system for the analysis of gene and genomic transcription regulation in Saccharomyces cerevisiae. Nucleic Acids Res. 2014;42(Database issue):D161–6.View ArticlePubMedGoogle Scholar
 Cherry JM, Hong EL, Amundsen C, Balakrishnan R, Binkley G, Chan ET, Christie KR, Costanzo MC, Dwight SS, Engel SR, Fisk D, Hirschman J, Hitz B, Karra K, Krieger C, Miyasato S, Nash R, Park J, Skrzypek M, Simison M, Weng S, Wong E. Saccharomyces genome database: the genomics resource of budding yeast. Nucleic Acids Res. 2012;40:D700–5.View ArticlePubMedGoogle Scholar