A graph modification approach for finding core–periphery structures in protein interaction networks
 Sharon Bruckner^{1},
 Falk Hüffner^{2} and
 Christian Komusiewicz^{2}Email author
https://doi.org/10.1186/s1301501500437
© Bruckner et al.; licensee BioMed Central. 2015
Received: 23 December 2014
Accepted: 31 March 2015
Published: 2 May 2015
Abstract
The core–periphery model for protein interaction (PPI) networks assumes that protein complexes in these networks consist of a dense core and a possibly sparse periphery that is adjacent to vertices in the core of the complex. In this work, we aim at uncovering a global core–periphery structure for a given PPI network. We propose two exact graphtheoretic formulations for this task, which aim to fit the input network to a hypothetical ground truth network by a minimum number of edge modifications. In one model each cluster has its own periphery, and in the other the periphery is shared. We first analyze both models from a theoretical point of view, showing their NPhardness. Then, we devise efficient exact and heuristic algorithms for both models and finally perform an evaluation on subnetworks of the S. cerevisiae PPI network.
Keywords
Protein complexes Graph classes NPhard problemsBackground
A fundamental task in the analysis of PPI networks is the identification of protein complexes and functional modules. Herein, a basic assumption is that complexes in a PPI network are strongly connected among themselves and weakly connected to other complexes [1]. This assumption is usually too strict. To obtain a more realistic network model of protein complexes, several approaches incorporate the core–attachment model of protein complexes [2]. In this model, a complex is conjectured to consist of a stable core plus some attachment proteins, which have only transient interactions with the core. In graphtheoretic terms, the core thus is a dense subnetwork of the PPI network. The attachment (or: periphery) is less dense, but has edges to one or more cores.
Current methods employing this type of modeling are based on seed growing [35]. Here, an initial set of promising small subgraphs is chosen as cores. Then, each core is separately greedily expanded by adding vertices to its core or its attachment (in each step, a vertex maximizing some specific objective function is chosen). The aim of these approaches was to predict protein complexes [4,5] or to reveal biological features that are correlated with topological properties of core–periphery structures in networks [3]. In this work, we use core–periphery modeling in a different context. Instead of searching for local core–periphery structures, we attempt to unravel a global core–periphery structure in PPI networks.
To this end, we hypothesize that the true network consists of several core–periphery structures. We propose two precise models to describe this. In the first model, the core–periphery structures are disjoint. In the second model, the peripheries may interact with different cores, but the cores are disjoint. Then, we fit the input data to each formal model and evaluate the results on several PPI networks.
Our approach. In spirit, our approach is related to the cliquecorruption model of the CAST algorithm for gene expression data clustering [6]. In this model, the input is a similarity graph where edges between vertices indicate similarity. The hypothesis is that the objects corresponding to the vertices belong to disjoint biological groups of similar objects, the clusters. In the case of gene expression data, these are assumed to be groups of genes with the same function. Assuming perfect measurements, the similarity graph is a cluster graph.
Definition 1.
A graph G is a cluster graph if each connected component of G is a clique.
Because of stochastic measurement noise, the input graph is not a cluster graph. The task is to recover the underlying cluster graph from the input graph. Under the assumption that the errors are independent, the most likely cluster graph is one that disagrees with the input graph on a minimum number of edges. Such a graph can be found by applying a minimum number of edge modifications (that is, edge insertions or edge deletions) to the input graph. This paradigm directly leads to the optimization problem CLUSTER EDITING [79].
We now apply this approach to our hypothesis that there is a global core–periphery structure in the PPI networks. In both models detailed here, we assume that all proteins of each core interact with each other; this implies that each core is a clique. We also assume that the proteins in the periphery interact only with the cores but not with each other. Hence, the peripheries are independent sets.
In the first model, we assume that ideally the protein interactions give rise to vertexdisjoint core–periphery structures, that is, there are no interactions between different cores and no interactions between cores and peripheries of other cores. Then each connected component has at most one core which is a clique and at most one periphery which is an independent set. This is precisely the definition of a split graph.
Definition 2.
A graph G=(V,E) is a split graph if V can be partitioned into V _{1} and V _{2} such that G[ V _{1}] is an independent set and G[ V _{2}] is a clique.
SPLIT CLUSTER EDITING
Input: An undirected graph G=(V,E).
Task: Transform G into a split cluster graph by applying a minimum number of edge modifications.
In our second model, we allow the vertices in the periphery to be attached to an arbitrary number of cores, thereby connecting the cores. In this model, we thus assume that the cores are disjoint cliques and the vertices of the periphery are an independent set. Such graphs are called monopolar [11].
Definition 3.
A graph is monopolar if its vertex set can be twopartitioned into V _{1} and V _{2} such that G[ V _{1}] is an independent set and G[ V _{2}] is a cluster graph. The partition (V _{1},V _{2}) is called monopolar partition.
MONOPOLAR EDITING
Input: An undirected graph G=(V,E).
Task: Transform G into a monopolar graph by applying a minimum number of edge modifications and output a monopolar partition.
Further related work. In the following, we point to some related work in the literature that is not directly relevant for our algorithms and their evaluation but either considers models of core–periphery structure or optimization problems that are related to SPLIT CLUSTER EDITING or MONOPOLAR EDITING.
Della Rossa et al. [12] proposed to compute core–periphery profiles that assign to each vertex a numerical coreness value. The computation of these values is based on a heuristic randomwalk model. Their evaluation showed that the S. cerevisiae PPI network exhibits a clear core–periphery structure which significantly deviates from random networks with the same degree distribution. An adaption of the Markov Clustering algorithm MCL that incorporates the coreattachment model for protein complexes was presented by Srihari et al. [13].
The SPLIT EDITING problem is closely related to SPLIT CLUSTER EDITING as it asks to transform a graph into a (single) split graph by at most k edge modifications. SPLIT EDITING is, somewhat surprisingly, solvable in linear time [14]; in fact, the number of required modifications depends only on the degree sequence. Thus, split graphs are recognizable by their degree sequence. Another problem that is related to CLUSTER EDITING is COGRAPH EDITING which asks to destroy induced P _{4}’s by modifying at most k edges [15]. COGRAPH EDITING has applications in the computation of phylogenies [16]. In a cograph, every connected component has diameter at most two; in split cluster graphs every connected component has diameter at most three.
Finally, a further approach of fitting PPI networks to specific graph classes was proposed by Zotenko et al. [17] who find for a given PPI network a close chordal graph, that is, a graph without induced cycles of length four or more. The modification operation is insertion of edges. One notable difference is that the algorithm may be unable to construct a chordal graph from the input network [17].
Preliminaries. We consider undirected simple graphs G=(V,E) where n:=V denotes the number of vertices and m:=E denotes the number of edges. The open neighborhood of a vertex u is defined as N(u):={v∣{u,v}∈E}. We denote the neighborhood of a set U by \(N(U):= \bigcup _{u\in U} N(u)\setminus U\). The subgraph induced by a vertex set S is defined as G[S]:=(S,{{u,v}∈E∣u,v∈S}).
One approach to solving NPhard problems is based on the concept of fixedparameter tractability [18,19]. Herein, instances I of a problem come along with a parameter k, for example the size of a solution. The aim is to obtain a fixedparameter algorithm, that is, an algorithm with running time f(k)·n ^{ O(1)} where f depends only on k. Such an algorithm is efficient if k is small and f does not grow too rapidly.
The exponentialtime hypothesis (ETH) states that there is a constant c>1 such that 3SAT cannot be solved in (c−ε)^{ n } time for any ε>0 [20]. Assuming the ETH, tight running time lower bounds can be shown; a survey on ETHbased running time lower bounds is given by Lokshtanov et al. [21]. If it is known that a parameterized problem L does not admit a 2^{ o(k)}·n ^{ O(1)}time algorithm (assuming the ETH), then a polynomialtime reduction from this problem to a problem L ^{′} with parameter k ^{′}=O(k) implies that L ^{′} cannot be solved in \(\phantom {\dot {i}\!}2^{o(k^{\prime })}\cdot n^{O(1)}\) time (assuming the ETH). Note that the parameter in this case may also be the number of vertices or the number of edges of a graph.
Combinatorial properties and complexity
Before presenting concrete algorithmic approaches for the two optimization problems, we show some properties of split cluster graphs and monopolar graphs which will be useful in the various algorithms. Furthermore, we present computational hardness results for the problems which will justify the use of integer linear programming (ILP) and heuristic approaches.
Split cluster editing
Lemma 1 ([22]).
A graph G is a split graph if and only if G does not contain an induced subgraph that is a pair of disjoint edges or a cycle of four or five edges, that is, G is (2K _{2},C _{4},C _{5})free.
To obtain a characterization for split cluster graphs, we need to characterize the existence of 2K _{2}’s within connected components. The following lemma will be useful for this purpose.
Lemma 2.
If a connected graph contains a 2K _{2} as induced subgraph, then it contains a 2K _{2}=(V ^{′},E ^{′}) such that there is a vertex v∉V ^{′} that is adjacent to at least one vertex of each K _{2} of (V ^{′},E ^{′}).
Proof.
Let G contain the 2K _{2}{x _{1},x _{2}},{y _{1},y _{2}} as induced subgraph. Without loss of generality, let the shortest path between any x _{ i },y _{ j } be P=(x _{1}=p _{1},p _{2},…,p _{ k }=y _{1}). Clearly, k>2. If k=3, then x _{1} and y _{1} are both adjacent to p _{2}. Otherwise, if k=4, then {x _{2},x _{1}=p _{1}},{p _{3},p _{4}=y _{1}} is a 2K _{2} and x _{1} and p _{3} are both adjacent to p _{2}. Finally, if k>4, then P contains a P _{5} as induced subgraph. The four outer vertices of this P _{5} induce a 2K _{2} whose K _{2}’s each contain a neighbor of the middle vertex.
We can now provide a characterization of split cluster graphs (see Figure 2).
Theorem 1.
A graph G is a split cluster graph if and only if G is a (C _{4},C _{5},P _{5},necktie,bowtie)free graph.
Proof.
Let G be a split cluster graph, that is, every connected component is a split graph. Clearly, G does not contain a C _{4} or C _{5}. If a connected component of G contains a P _{5}, then omitting the middle vertex of the P _{5} yields a 2K _{2}, which contradicts the assumption that the connected component is a split graph. The same argument shows that the graph cannot contain a necktie or bowtie.
Conversely, let G be (C _{4},C _{5},P _{5},necktie,bowtie)free. Clearly, no connected component contains a C _{4} or C _{5}. Assume towards a contradiction that a connected component contains a 2K _{2} consisting of the K _{2}’s {a,b} and {c,d}. Then according to Lemma 2 there is a vertex v which is, without loss of generality, adjacent to a and c. If no other edges between the 2K _{2} and v exist, then {a,b,v,c,d} is a P _{5}. Adding exactly one of {b,v} or {d,v} creates a necktie, and adding both edges results in a bowtie. No other edges are possible, since there are no edges between {a,b} and {c,d}.
This leads to a lineartime algorithm for checking whether a graph is a split cluster graph.
Theorem 2.
There is an algorithm that determines in O(n+m) time whether a graph is a split cluster graph and outputs a forbidden induced subgraph if this is not the case.
Proof.
For each connected component, we run an algorithm by Heggernes and Kratsch [23] that checks in linear time whether a graph is a split graph, and if not, produces a 2K _{2}, C _{4}, or C _{5}. If the forbidden subgraph is a C _{4} or C _{5}, we are done. If it is a 2K _{2}, then we find, using the method described in the proof of Lemma 2, in linear time an induced 2K _{2} such that there is a vertex v that is adjacent to at least one vertex in each K _{2}. The subgraph induced by this 2K _{2} plus v is either a P _{5}, necktie, or bowtie, as shown in the proof of Theorem 1.
In contrast, SPLIT CLUSTER EDITING is NPhard even in restricted cases. Before proving the hardness, we make the following observation that follows from a simple local improvement argument. It will be used in our hardness proof and also in our algorithms.
Observation 1.

every degreeone vertex whose neighbor has degree at least two is a periphery vertex, and

no inserted edge is incident with a periphery vertex.
Theorem 3.
SPLIT CLUSTER EDITING is NPhard even on graphs with maximum degree 10. Further, it cannot be solved in 2^{ o(k)}·n ^{ O(1)} or 2^{ o(n)}·n ^{ O(1)} time if the exponentialtime hypothesis (ETH) is true.
Proof.
Input: An undirected graph G=(V,E) and an integer k.
Question: Can G be transformed into a cluster graph by applying at most k edge modifications?
CLUSTER EDITING is NPhard [24], even if the maximum degree of the input graph is five [25] and it cannot be solved in 2^{ o(k)}·n ^{ O(1)} time assuming ETH [25,26].
The reduction works as follows; we assume that the original instance does not contain isolated vertices. Given an instance (G,k) of CLUSTER EDITING, build a graph G ^{′}=(V ^{′},E ^{′}) that has the same vertices and edges as G and degG(v) additional degreeone vertices attached to each v∈V.
We show that G can be transformed by at most k edge modifications into a cluster graph if and only if G ^{′} has a split cluster editing set of size at most k. First, if a set S of at most k edge modifications transforms G into a cluster graph \(\tilde {G}\), then performing the same modifications on G ^{′} transforms G ^{′} into a split cluster graph \(\tilde {G'}\): Each connected component of \(\tilde {G'}\) contains a clique K of \(\tilde {G}\) plus degG(v) degreeone vertices adjacent to each v∈K. The set of these degreeone vertices is an independent set.
For the other direction, we show that there is a minimumcardinality edge modification set S ^{′} that transforms G ^{′} into a split cluster graph \(\tilde {G}'\), such that performing S ^{′} on G transforms G into a cluster graph. By Observation 1 and the fact that each vertex in G has degree at least one, we can assume that every vertex in V ^{′}∖V is a periphery vertex in \(\tilde {G}'\). Consider some vertex v∈V. If v is a periphery vertex in \(\tilde {G}'\), then all degG(v) edges between v and V ^{′}∖V are deleted (there are no edges between periphery vertices). Then, however, a solution with the same cost is to delete all degG(v) edges between v and V instead. This solution makes v a core vertex with neighbors in V ^{′} only. Hence, we can assume that S ^{′} makes every vertex in V a core vertex. Since \(\tilde {G'}\) is a split cluster graph, each core is a clique and different cores are disjoint. Hence, S ^{′} transforms G into a cluster graph.
This shows the correctness of the reduction. The hardness results follow from the previous hardness results and the fact that the solution size remains the same and that the maximum degree of the constructed graph G ^{′} is exactly twice the maximum degree of G.
This hardness result motivates the study of algorithmic approaches such as fixedparameter algorithms or ILP formulations. For example, SPLIT CLUSTER EDITING is fixedparameter tractable for the parameter number of edge modifications k by the following search tree algorithm: Check whether the graph contains a forbidden subgraph. If this is the case, branch into the possibilities to destroy this subgraph. In each recursive branch, the number of allowed edge modifications decreases by one. Furthermore, since the largest forbidden subgraph has five vertices, at most ten possibilities for edge insertions or deletions have to be considered to destroy a forbidden subgraph. By Theorem 2, forbidden subgraphs can be found in O(n+m) time. Altogether, this implies the following.
Theorem 4.
SPLIT CLUSTER EDITING can be solved in O(10^{ k }·(n+m)) time.
This result is purely of theoretical interest. With further improvements of the search tree algorithm, practical running times might be achievable.
For example, one could focus on improving the base of the exponential factor by a more elaborate case distinction, either designed manually (e. g. [27]) or automatically [28]. Another approach could be to study parameterized data reduction known as kernelization [18,19].
Monopolar graphs
The class of monopolar graphs is hereditary, and thus it is characterized by forbidden induced subgraphs. The set of minimal forbidden induced subgraphs, however, is infinite [29]; for example among graphs with five or fewer vertices, only the wheel W _{4} is forbidden, but there are 11 minimal forbidden subgraphs with six vertices. In contrast to the recognition of split cluster graphs, which is possible in linear time by Theorem 2, deciding whether a graph is monopolar is NPhard [30]. Algorithmic research is focused on the recognition problem for special graph classes. A fairly general such approach uses a 2SAT formulation [31,32]. Thus MONOPOLAR EDITING is NPhard already for k=0 edge modifications. As a consequence, it is not fixedparameter tractable with respect to the number of edge modifications k unless P=NP (in contrast to SPLIT CLUSTER EDITING).
Solution approaches
To evaluate our model, it is helpful to obtain optimal solutions to eliminate or at least estimate the systematic bias that might be introduced by heuristics. We use an integer linear programming (ILP) formulation for this. Since it is not able to solve the hardest instances, we also present a heuristic based on simulated annealing.
Forbidden subgraph ILP
where \(\mathcal F\) is the set of forbidden induced subgraphs on V. A constraint of type (2) forces that at least one edge differs from the forbidden subgraph. Since an nvertex graph may contain Ω(n ^{5}) forbidden subgraphs, in practice we use row generation (lazy constraints) and add in a callback only the constraints that are violated; by Theorem 2, we can find a violated constraint in linear time.
A factor \(\frac {1}{2}\) is permissible for edits that require at least one more edit; for example inserting {u,w} produces a necktie. The summand e _{ uy } is omitted, since this insertion produces a C _{5}, which needs at least two more edits. Similar strengthenings are possible for neckties and bowties.
Partition variable ILP
Since monopolar graphs have infinitely many forbidden subgraphs, which are NPhard to find, the forbidden subgraph ILP formulation is not feasible for MONOPOLAR EDITING. We show an alternative formulation based on the observation that if we correctly guess the partition into core and independent set vertices, we can get a simpler forbidden subgraph characterization for both split cluster graphs and monopolar graphs.
Lemma 3.
Let G=(V,E) be a graph and \(C \dot {\cup } I = V\) a partition of the vertices. Then G is a split cluster graph with core vertices C and periphery vertices I if and only if G does not contain an edge with both endpoints in I, nor an induced P _{3} with both endpoints in C.
Proof.
“ ⇒”: We show the contraposition. Thus assume that there is an edge with both endpoints in I or an induced P _{3} with both endpoints in C. Then I is not an independent set or C does not form a clique in each connected component, respectively.
“ ⇐”: We again show the contraposition. If G is not a split cluster graph with core vertices C and periphery vertices I, then it must contain an edge with both endpoints in I, or C∩H does not induce a clique for some connected component H of G. In the first case we are done; in the second case, there are two vertices u,v∈C in the same connected component with {u,v}∉E. Consider a shortest path (u=p _{1},…,p _{ l }=v) from u to v. If it contains a periphery vertex p _{ i }∈I, then p _{ i−1},p _{ i },p _{ i+1} forms a forbidden subgraph. Otherwise, p _{1},p _{2},p _{3} is one.
For annotated monopolar graphs, the situation is even simpler. By Definition 3, the twopartition into C and I exactly demands that I is an independent set and G[C] is a cluster graph or, equivalently, P _{3}free.
Lemma 4.
Let G=(V,E) be a graph and \(C \dot {\cup } I = V\) a partition of the vertices. Then G is a monopolar graph with core vertices C and periphery vertices I if and only if it does not contain an edge with both endpoints in I, nor an induced P _{3} whose vertices are contained in C.
Proof.
“ ⇒”: This follows directly from Definition 3.
“ ⇐”: If G is not monopolar with core vertices C and periphery vertices I, then it must contain an edge with both endpoints in I, or G[C] is not a cluster graph. In the first case we are done; in the second case, there is a P _{3} with all vertices in C, since that is the forbidden induced subgraph for cluster graphs.
which forces that the graph induced by the core vertices is a cluster graph.
Data reduction
Data reduction (preprocessing) proved very effective for solving CLUSTER EDITING optimally [8,9]. Indeed, any instance can be reduced to one of at most 2k vertices [33,34], where k is the number of edge modifications. Unfortunately, the data reduction rules we devised for SPLIT CLUSTER EDITING were not applicable to our realworld test instances. However, Observation 1 allows us to fix the values of some variables of Constraints (4) to (6) in the partition variable ILP for SPLIT CLUSTER EDITING: if a vertex u has only one vertex v as neighbor and deg(v)>1, then set c _{ u }=0 and e _{ uw }=0 for all w≠v. Since our instances have many degreeone vertices, this considerably reduces the size of the ILPs.
Heuristics
The integer linear programming approach is not able to solve the hardest of our instances. Thus, we employ the wellknown simulated annealing heuristic. This is a local search method, where we try a random modification of our current solution, and accept it if it improves the objective; but to escape local minima, we also accept it with a small probability if it makes the objective worse. More precisely, a change in the objective of Δ is accepted with probability exp(−Δ/T), where the factor T is reduced over the course of the algorithm down to zero, such that the algorithm initially explores a larger part of the search space, but eventually settles in a local minimum. We restart the simulated annealing algorithm, where each repetition has a fixed number of steps.
For SPLIT CLUSTER EDITING, we start with a clustering where each vertex is a singleton. As random modification, we move a vertex to a cluster that contains one of its neighbors. Since this allows only a decrease in the number of clusters, we also allow moving a vertex into an empty cluster. For a fixed clustering, the optimal number of modifications can be computed in linear time by counting the edges between clusters and computing for each cluster a solution for SPLIT EDITING in linear time [14]. For MONOPOLAR EDITING, we additionally have a set representing the shared periphery. Accordingly, we allow moving a vertex into another cluster or into the independent set. Here, the optimal number of modifications for a fixed clustering can also be calculated in linear time: all edges in the independent set are deleted, all edges between clusters are deleted, and all missing edges within clusters are added.
Experimental results

The algorithm by Luo et al. [3] (“Luo” for short) produces clusters with core and periphery, like SCE, but the clusters may overlap and might not cover the whole graph. Luo produces two types of core–periphery structures, those with a dense core, called kplex core, and those with a star core. In the comparison, we consider only the structures with kplex cores, since this model is closer to our models. For periphery, we consider only neighbors of the core (called 1periphery by Luo et al. [3]) and not vertices with distance two to the core (called 2periphery).

The SCAN algorithm [35], like ME, partitions the graph vertices into “clusters”, which we interpret as cores, and “hubs” and “outliers”, which we interpret as periphery. SCAN is run with several parameter combinations, obtaining different results. For consistency, we select the results where the clusters have the highest modularity, as reported by the SCAN program itself.

In addition, we compare the solutions of SCE and ME with optimal solutions of Cluster Editing (CE) (see Section ‘Split cluster editing’ for a formal problem definition). The result of such a solution is a cluster graph and the size1 clusters of this cluster graph are an independent set. Accordingly, we interpret the size1 clusters as periphery. We solve CE by a simple ILP with row generation, using the characterization by the forbidden subgraph P _{3}.
Experimental setup
Implementation details. The ILPs and the simulated annealing heuristic were implemented in C++ and compiled with the GNU g++ 4.7.2 compiler. As ILP solver, we used CPLEX 12.6.0. For both formulations, we use the heuristic solution found after 10 rounds as MIP start. For the forbidden subgraph formulation (Section ‘Forbidden subgraph ILP’), in a lazy constraint callback, we find a forbidden subgraph using Theorem 2 and add the corresponding inequality of type (2) to the model. We then delete one of its vertices and try to find another forbidden subgraph, adding up to n inequalities per callback.
For the partition variable formulation (Section ‘Partition variable ILP’), we initially add all independent set constraints (5) and those P _{3} constraints ((6), (7)) for which the vertices u,v,w induce a P _{3} in the input graph. In a lazy constraint callback, we add violated P _{3} constraints (usually only a few are needed). These constraints are also used as cutting planes, that is, we already add them in a cutting plane callback when they are violated by the fractional solution. In addition, we use the forbidden subgraphs C _{4} and P _{5} for SCE and the forbidden subgraph W _{4} for ME as cutting planes (Eq. 2). In the cutting plane callbacks, we add the 500 inequalities which are violated the most, if the violation is at least 0.3 (these parameters were heuristically determined).
In the simulated annealing heuristic, we use 20,000 steps and an initial T _{0}=1, and restart the procedure 100 times.
The test machine is a 4core 3.6 GHz Intel Xeon E51620 (Sandy BridgeE) with 10 MB L3 cache and 64 GB main memory, running Debian GNU/Linux 7.0. CPLEX was allowed to use up to 8 threads, and we report wall clock times.
Data. For comparison of the algorithms, we first use random graphs, where each possible edge is present with probability p, to examine variability of running times and limits of feasibility. For more realistic data, we generate subnetworks of the S. cerevisiae (yeast) protein interaction network from BioGRID [36]. Our networks contain only physical interactions. For each Gene Ontology (GO) term in the annotations of the Saccharomyces Genome Database (SGD) [37], we extract the subnetwork induced by only those proteins that are annotated with this term. We omit networks with fewer than 30 vertices (these can all be solved in less than one second). This yields 178 graphs with up to 2198 vertices, with a median of 66 vertices and 226 edges.
Input properties of the process networks
n  m  n _{ lcc }  m _{ lcc }  C  p  A _{ C }  i _{ g }  

Cell cycle  196  797  192  795  7  148  6.3  1151 
Transcription  215  786  198  776  11  54  7.5  1479 
Translation  188  2352  186  2351  5  88  27.4  174 
Biological evaluation. We evaluate our results using the following measures. First, we examine the coherence of the GO terms in our modules using the semantic similarity score calculated by GSESAME [38]. We use this score to test the hypothesis that the cores are more stable than the peripheries. If the hypothesis is true, then the GO terms within a core should be more similar than the GO terms in the periphery. Hence, the pairwise similarity score within the core should be higher than in the periphery. We test only terms relating to process, not function, since proteins in the same complex play a role in the same biological process. Since ME, SCAN, and CE return multiple cores and only a single periphery, we assign to each cluster C its neighborhood N(C) as periphery. We consider only clusters with at least two core vertices and one periphery vertex.
Next, we compare the resulting clusters with known protein complexes from the CYC2008 database [39]. Since the networks we analyze are subnetworks of the larger yeast network, we discard for each network the CYC2008 complexes that have less than 50% of their vertices in the current subnetwork, restrict them to proteins contained in the current subnetwork, and then discard those with fewer than three proteins. We test the overlap between the algorithm results and these complexes, treating the complexes as the “ground truth”. We expect that the cores mostly correspond to complexes and that the periphery may contain complex vertices plus further vertices.
Results
Random networks
The heuristic optimally solves SCE for all instances with known optimal solution; for ME, it is off by one for five instances.
PPI subnetworks
For ME, we note that instances can be solved slightly quicker in general, consistent with the observations on sparse random networks. Using W _{4} (the smallest forbidden subgraph for monopolar graphs) as a cutting plane also helps little, solving one more instance, but might be useful for difficult instances with long running time.
The heuristic for SCE finds an optimal solution for all 126 instances for which the optimal solution size is known. The ME heuristic optimally solves 104 of 129 instances for which the optimal solution size is known. The average error is very small (0.61), but for one instance the heuristic produces a solution size too high by 27. Possibly the independent set, which interacts with all clusters, makes local search approaches less effective here compared to SCE.
Process networks
Solution statistics and average GO term coherence for the process networks
Cellcycle  Transcription  Translation  

k  K  \(\bar c\)  \(\bar p\)  c _{ t }  c _{ c }  c _{ p }  k  K  \(\bar c\)  \(\bar p\)  c _{ t }  c _{ c }  c _{ p }  k  K  \(\bar c\)  \(\bar p\)  c _{ t }  c _{ c }  c _{ p }  
SCE  321  14  5  7  0.60  0.64  0.40  273  14  6  6  0.54  0.56  0.57  308  6  13  14  0.70  0.73  0.69 
ME  126  24  3  9  0.45  0.57  0.40  106  26  3  7  0.50  0.60  0.54  240  11  8  12  0.59  0.61  0.54 
SCAN  —  28  5  4  0.41  0.62  0.34  —  29  4  3  0.48  0.59  0.47  —  5  30  4  0.66  0.66  0.76 
Luo  —  16  9  63  0.34  0.50  0.31  —  12  8  41  0.40  0.52  0.38  —  4  24  24  0.72  0.84  0.67 
CE  461  28  4  1  0.51  0.51  0.38  392  28  4  1  0.56  0.57  0.68  937  10  11  4  0.71  0.73  0.71 
Complex detection statistics for the process networks
Cellcycle  Transcription  Translation  

D  core _{ % }  comp _{ % }  extra _{ % }  D  core _{ % }  comp _{ % }  extra _{ % }  D  core _{ % }  comp _{ % }  extra _{ % }  
SCE  4  93/100  100/100  71/ 67  8  87/ 88  100/100  73/ 85  4  89/100  96/ 96  50/ 50 
ME  5  89/100  99/100  90/ 89  11  89/100  93/100  76/ 79  4  89/100  95/ 96  55/ 64 
SCAN  4  87/ 91  98/100  100/100  8  81/ 84  100/100  96/100  0  —/—  —/—  —/— 
Luo  5  76/ 81  100/100  100/100  6  89/ 87  100/100  100/100  4  84/ 92  96/ 96  72/ 73 
CE  5  80/ 87  94/100  40/ 0  9  87/ 91  94/100  89/100  4  84/ 92  90/ 94  55/ 60 
Running times and objective function.
For SCE, the ILP approach failed to solve the cell cycle and transcription network, and for ME, it failed to solve the transcription network, with CPLEX running out of memory in each case. Thus, consistent with the previous types of instances, the theoretically harder ME problem was easier to solve in practice. This could be explained by the fact that the number k of necessary modifications is much lower, which could reduce the size of the branchandbound tree. For two of the three optimally solved instances, the heuristic finds the optimal solution within one minute. For the third instance (ME transcription) it finds the optimal solution only after several hours; after one minute, it is 2.9% too large. This indicates the heuristic gives good results, and in the following, we use the heuristic solution for the three instances not solvable by ILP. From experiments with other networks, we conjecture that the heuristic SCE solutions are optimal; we are less sure about the heuristic solutions for ME.
As for the PPI subnetworks, the SCAN algorithm runs very fast, finishing within seconds on all three networks; the LUO algorithm is considerably slower as it needs several minutes on the translation network. CE is again slower than LUO but still considerably faster than SCE and ME.
Cluster statistics and GO term coherence. Table 2 gives an overview of the number and average sizes of the output clusters and of their average GO term coherence in core and periphery. We say that a cluster is nontrivial if it has at least three vertices and at least two core vertices. We describe the results for the cell cycle network in more detail since the results here are the most representative of the three networks. Then, we summarize our findings for the transcription and translation network.
The SCE solution identifies 14 nontrivial clusters; all other clusters are singletons. Only for one of the 14 nontrivial clusters, the GO term coherence is lower in the core than in the periphery (for two clusters the scoring tool does not return a result, four clusters have empty peripheries). This is in line with the hypothesis that cores have higher GO term coherence than peripheries.
The ME result contains more nontrivial clusters than SCE (24). Compared to SCE, clusters have on average about the same size, but a slightly smaller core and a slightly larger periphery (recall that a periphery vertex may occur in more than one cluster). The average coherence in the cores is 0.58, lower than for SCE (0.64), this might be due to the fact that the cores are smaller for ME. On average, coherence in the periphery is much lower than in the cores, but for six clusters it is higher than in the core.
SCAN identifies 7 hubs and 41 outliers, which then comprise the periphery. There are even more nontrivial clusters than for ME. Clusters are smaller than for SCE or ME, in particular the periphery has on average only 4.4 vertices as opposed to 7.3 for SCE or 9.8 for ME. Coherence on cores is similar to SCE and ME, and also lower for the periphery.
LUO outputs only large clusters (this is true for all subnetworks we tested). For the cell cycle network, 16 clusters are identified, each having at least 5 proteins in the cores, and 3 in the periphery, and the largest having 15 proteins in the core and 126 in the periphery (for SCE, one cluster has 10 proteins in the core and 56 in the periphery, and all other clusters for the three other methods have cores of at most 16 and peripheries of at most 30 vertices). The cores have much lower coherence on average than the other methods, but again coherence in the periphery is even lower.
CE outputs many nontrivial clusters, on average the cores and periphery are smaller than for SCE and ME. The average coherence is lower than for SCE and ME, but again the average coherence is higher in the core than in the periphery.
We now describe the results for the transcription network. Again, ME outputs the smallest cores, followed by SCAN and CE. LUO again finds the largest cores and also the largest peripheries. Concerning GO term analysis, we see a similar pattern here that LUO has worse coherence. The average core coherence is the highest for ME and, unlike CE and SCE, the average coherence is higher in the cores than in the periphery for ME.
In the translation network, ME outputs the most nontrivial clusters, followed by CE and SCE. SCAN and LUO output the fewest nontrivial clusters (5 and 4, respectively). LUO has the best coherence values here. The average coherence is higher for CE than for ME but the difference between the average core and periphery coherence is less pronounced in CE than in ME.
Complex detection. Table 3 gives an overview of the number of detected complexes. Again, we describe the results for the cell cycle network in more detail and then summarize our findings for the transcription and translation network.
Following our hypothesis, we say that a complex is detected by a cluster if at least 50% of the core belongs to the complex and at least 50% of the complex belongs to the cluster. Out of the seven complexes, three are detected without any error (anaphasepromoting, DASH, and Far3p/Far7p/Far8p/Far9p/Far10p/Far11p complex), and one (Mcm27) is detected with an error of two additional proteins in the core that are not in the complex. The periphery contains between one and eight extra proteins that are not in the complex (which is allowed by our hypothesis).
ME detects the same complexes as SCE, and additionally the mitotic checkpoint complex. For the anaphasepromoting complex, it misses one protein; all other complexes are detected without error.
SCAN detects almost the same complexes as ME (it misses the Mcm27 complex). It also has slightly more errors, for example having three extra protein in the core for the anaphasepromoting complex plus one missing. LUO detects the same complexes as ME without missing any complex proteins but it also finds more extra vertices in the cores. CE detects the same clusters as ME with a slightly higher number of missed complex proteins and extra core proteins.
In the transcription network, the ME method comes out a clear winner: it detects all 11 complexes and has fewer errors than the other methods. CE detects more complexes than SCAN and SCE; LUO detects only 6 complexes for this network.
In the translation network, SCE, ME, LUO, and CE detect the same four complexes. The SCAN algorithm does not seem to deal well with this network, since it does not detect any complex. LUO finds only four nontrivial clusters, corresponding to the four complexes also detected by SCE and ME; this might also explain why it has the best coherence values here.
Conclusions
Experiment evaluation
The coherence values for cores and peripheries indicate that a division of clusters into core and periphery makes sense. Under the assumption that cores should be more coherent than peripheries, ME and LUO do best with respect to separating cores from periphery.
In detecting complexes, the ME method does best (20 detected), followed by CE (18), followed by SCE and LUO (15 each), and finally SCAN (12). This indicates that the model that peripheries are shared is superior (note that in CE the size1 clusters are also a shared periphery). One advantage of ME compared to CE is that the cores are smaller and thus contain fewer extra proteins which are not in the complex. Note that when comparing the number of detected complexes, then SCE is at a disadvantage, since it can use each protein as periphery only once, while having large peripheries makes it easier to count a complex as detected. One approach here could be to consider clusters of size one as shared periphery (as we did for CE). The graph modificationbased methods showed a more consistent behavior across the three test networks than LUO (which performs not so well on the transcription network) and SCAN (which performs not so well on the translation network).
A further notable difference between the algorithms is that LUO outputs much larger peripheries for each cluster. Thus, the peripheries of the detected complexes contain many proteins which are not known to be in the complex (by our initial hypothesis, these extra proteins are not necessarily errors). The other four methods are much more conservative in this regard.
Outlook
Concerning the theoretical analysis of SPLIT CLUSTER EDITING the following questions are open: Is SPLIT CLUSTER EDITING amenable to parameterized data reduction? That is, does SPLIT CLUSTER EDITING admit a polynomialtime reduction to a polynomialsize problem kernel (see [18] for a definition of problem kernel)? Does SPLIT CLUSTER EDITING admit a constantfactor approximation? It would be also interesting to study the SPLIT CLUSTER DELETION problem in which only edge deletions are allowed to transform the input graph into a split cluster graph. This variant is also NPhard by a reduction that is similar to the one presented for SPLIT CLUSTER EDITING.
For MONOPOLAR EDITING it would be interesting to obtain any tractability results, for example by considering combinations of parameters. A first step here could be to study the problem of recognizing monopolar graphs more closely.
There are many further variants of our models that could possibly yield better biological results or have algorithmic advantages. For instance, one could restrict the cores to have a certain minimum size. Also, instead of using split graphs as a core–periphery model, one could resort to dense split graphs [10] in which every periphery vertex is adjacent to all core vertices. Finally, one could allow some limited amount of interaction between periphery vertices.
Further evaluation of the biological properties of the computed core–periphery structures seems also worthwhile. For example, it would be interesting to examine the peripheries more closely in order to determine whether SPLIT CLUSTER EDITING and MONOPOLAR EDITING are too conservative when determining the periphery of a cluster. Finally, one could explore the biological properties of those clusters that were identified by SPLIT CLUSTER EDITING or MONOPOLAR EDITING but that do not correspond to known protein complexes from the CYC2008 database (all output clusters are listed in the Additional file 1: Supplemental material).
Endnote
^{a} To determine the protein subsets corresponding to each process, we queried BioMart [40] for all yeast genes annotated with the relevant GO terms: GO:0007049 (cell cycle), GO:0006412 (translation), and GO:0006351 (DNAtemplated transcription). Note that this gives somewhat different results than using the SGD GO annotations.
Declarations
Acknowledgments
An extended abstract of this article appeared in the Proceedings of the 14th Workshop on Algorithms in Bioinformatics (WABI ’14), volume 8701 of LNCS, Springer, pages 340–351.
Authors’ Affiliations
References
 Spirin V, Mirny LA. Protein complexes and functional modules in molecular networks. PNAS. 2003; 100(21):12123–8.View ArticlePubMed CentralPubMedGoogle Scholar
 Gavin AC, Aloy P, Grandi P, Krause R, Boesche M, Marzioch M, et al. Proteome survey reveals modularity of the yeast cell machinery. Nature. 2006; 440(7084):631–6.View ArticlePubMedGoogle Scholar
 Luo F, Li B, Wan XF, Scheuermann R. Core and periphery structures in protein interaction networks. BMC Bioinformatics. 2009; 10(Suppl 4):8.View ArticleGoogle Scholar
 Leung HC, Xiang Q, Yiu SM, Chin FY. Predicting protein complexes from PPI data: a coreattachment approach. J Comput Biol. 2009; 16(2):133–44.View ArticlePubMedGoogle Scholar
 Wu M, Li X, Kwoh CK, Ng SK. A coreattachment based method to detect protein complexes in PPI networks. BMC Bioinformatics. 2009; 10(1):169.View ArticlePubMed CentralPubMedGoogle Scholar
 BenDor A, Shamir R, Yakhini Z. Clustering gene expression patterns. J Comput Biol. 1999; 6(34):281–97.View ArticlePubMedGoogle Scholar
 Shamir R, Sharan R, Tsur D. Cluster graph modification problems. Discrete Appl Math. 2004; 144(1–2):173–82.View ArticleGoogle Scholar
 Böcker S, Briesemeister S, Klau GW. Exact algorithms for cluster editing: Evaluation and experiments. Algorithmica. 2011; 60(2):316–34.View ArticleGoogle Scholar
 Böcker S, Baumbach J. Cluster editing. In: Proceedings of the 9th conference on computability in Europe (CiE ’13). LNCS. Berlin, Heidelberg: Springer: 2013. p. 33–44.Google Scholar
 Borgatti SP, Everett MG. Models of core/periphery structures. Soc Netw. 1999; 21(4):375–95.View ArticleGoogle Scholar
 Chernyak ZA, Chernyak AA. About recognizing (α,β) classes of polar graphs. Discrete Math. 1986; 62(2):133–8.View ArticleGoogle Scholar
 Della Rossa F, Dercole F, Piccardi C. Profiling coreperiphery network structure by random walkers. Sci Rep. 2013. Article no. 3.Google Scholar
 Srihari S, Ning K, Leong H. MCLCAw: a refinement of MCL for detecting yeast complexes from weighted PPI networks by incorporating coreattachment structure. BMC Bioinformatics. 2010; 11:504.View ArticlePubMed CentralPubMedGoogle Scholar
 Hammer PL, Simeone B. The splittance of a graph. Combinatorica. 1981; 1(3):275–84.View ArticleGoogle Scholar
 Liu Y, Wang J, Guo J, Chen J. Complexity and parameterized algorithms for cograph editing. Theor Comput Sci. 2012; 461:45–54.View ArticleGoogle Scholar
 Hellmuth M, Wieseke N, Lechner M, Lenhof HP, Middendorf M, Stadler PF. Phylogenomics with paralogs. PNAS. 2015; 112(7):2058–63.View ArticlePubMedGoogle Scholar
 Zotenko E, Guimarães KS, Jothi R, Przytycka TM. Decomposition of overlapping protein complexes: a graph theoretical method for analyzing static and dynamic protein associations. Algorithms Mol Biol. 2006; 1(7).Google Scholar
 Downey RG, Fellows MR. Fundamentals of Parameterized Complexity. Texts in Computer Sci. Berlin, Heidelberg: Springer; 2013.View ArticleGoogle Scholar
 Niedermeier R. Invitation to fixedparameter algorithms. Oxford: Oxford University Press; 2006.View ArticleGoogle Scholar
 Impagliazzo R, Paturi R, Zane F. Which problems have strongly exponential complexity?J Comput Syst Sci. 2001; 63(4):512–30.View ArticleGoogle Scholar
 Lokshtanov D, Marx D, Saurabh S. Lower bounds based on the exponential time hypothesis. Bull EATCS. 2011; 105:41–72.Google Scholar
 Foldes S, Hammer PL. Split graphs. Congressus Numerantium. 1977; 19:311–5.Google Scholar
 Heggernes P, Kratsch D. Lineartime certifying recognition algorithms and forbidden induced subgraphs. Nord J Comput. 2007; 14(1–2):87–108.Google Scholar
 Křivánek M, Morávek J. NPhard problems in hierarchicaltree clustering. Acta Informatica. 1986; 23(3):311–23.View ArticleGoogle Scholar
 Fomin FV, Kratsch S, Pilipczuk M, Pilipczuk M, Villanger Y. Subexponential fixedparameter tractability of cluster editing. CoRR. 2011:abs/1112.4419.Google Scholar
 Komusiewicz C, Uhlmann J. Cluster editing with locally bounded modifications. Discrete Appl Math. 2012; 160(15):2259–70.View ArticleGoogle Scholar
 Liu Y, Wang J, Xu C, Guo J, Chen J. An effective branching strategy based on structural relationship among multiple forbidden induced subgraphs. J Comb Optimization. 2015; 29(1):257–75.View ArticleGoogle Scholar
 Gramm J, Guo J, Hüffner F, Niedermeier R. Automated generation of search tree algorithms for hard graph modification problems. Algorithmica. 2004; 39(4):321–47.View ArticleGoogle Scholar
 Berger AJ. Minimal forbidden subgraphs of reducible graph properties. Discussiones Mathematicae Graph Theory. 2001; 21(1):111–7.View ArticleGoogle Scholar
 Farrugia A. Vertexpartitioning into fixed additive inducedhereditary properties is NPhard. Electron J Combinatorics. 2004; 11(1):46.Google Scholar
 Le VB, Nevries R. Complexity and algorithms for recognizing polar and monopolar graphs. Theor Comput Sci. 2014; 528:1–11.View ArticleGoogle Scholar
 Churchley R, Huang J. Solving partition problems with colourbipartitions. Graphs Combinatorics. 2014; 30(2):353–64.View ArticleGoogle Scholar
 Cao Y, Chen J. Cluster editing: Kernelization based on edge cuts. Algorithmica. 2012; 64(1):152–69.View ArticleGoogle Scholar
 Chen J, Meng J. A 2k kernel for the cluster editing problem. J Comput Syst Sci. 2012; 78(1):211–20.View ArticleGoogle Scholar
 Xu X, Yuruk N, Feng Z, Schweiger TAJ. SCAN: a structural clustering algorithm for networks. In: Proceedings of the 13th ACM SIGKDD international conference on Knowledge Discovery and Data mining (KDD ‘07). New York: ACM: 2007. p. 824–33.Google Scholar
 Chatraryamontri A, Breitkreutz BJ, Heinicke S, Boucher L, Winter AG, Stark C, et al. The BioGRID interaction database: 2013 update. Nucleic Acids Res. 2013; 41(D1):816–23.View ArticleGoogle Scholar
 Cherry JM, Hong EL, Amundsen C, Balakrishnan R, Binkley G, Chan ET, et al. Saccharomyces genome database: the genomics resource of budding yeast. Nucleic Acids Res. 2012; 40(Database issue):700–5.View ArticleGoogle Scholar
 Du Z, Li L, Chen CF, Yu PS, Wang JZ. GSESAME: web tools for GOtermbased gene similarity analysis and knowledge discovery. Nucleic Acids Res. 2009; 37(suppl. 2):345–9.View ArticleGoogle Scholar
 Pu S, Wong J, Turner B, Cho E, Wodak SJ. Uptodate catalogues of yeast protein complexes. Nucleic Acids Res. 2009; 37(3):825–31.View ArticlePubMed CentralPubMedGoogle Scholar
 Kasprzyk A. BioMart: driving a paradigm change in biological data management. Database. 2011;2011.Google Scholar
Copyright
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.