One way to represent a set of overlapping functional groups is to construct a graph with nodes representing functional groups and edges representing overlaps, i.e., there exists an edge between two functional groups if and only if they share at least one protein. This approach has two shortcomings. First, it is not obvious how to correctly identify functional groups, and second, such a representation does not provide any information about the dynamics of proteins in the network. We propose a graph-theoretical approach, which, under the assumption that the protein interaction network satisfies certain mathematical properties, identifies functional groups and provides a representation of overlaps between functional groups in the form of the Tree of Complexes.

Here, we first describe the COD method. Then, we demonstrate the utility of our approach by applying the COD method to several examples, derived from high-throughput experiments, TNF*α*/NF-*κ* B and pheromone signaling pathway interaction networks.

### Complex overlap decomposition

Our method of representing overlapping functional groups, which is depicted in Figure 1, builds on chordal and cograph graph theories. Chordal graphs constitute an important and well studied graph family [18, 19]. A *chord* in a graph is any edge that connects two non-consecutive nodes of a cycle. A *chordal graph* is a graph which does not contain chordless cycles of length greater than three.

An important property of chordal graphs, which is explored directly in this paper, is that every chordal graph has a corresponding *clique tree representation or clique tree* [17]. The nodes in the tree are maximal cliques. Moreover, for every node in the graph, a set of maximal cliques that contain this node form a connected subgraph of the clique tree. Thus, there is a mapping between the nodes in the graph and subtrees in the clique tree. The "connected subgraph" requirement puts constraints on the topology of the clique tree. In fact, the topology of the tree is determined by the structure of overlaps between the maximal cliques in the graph. Thus, the clique tree captures information about the structure of the overlaps, which is lost in a simple clique intersection graph as shown in the example below.

**Example** Consider a hypothetical protein interaction network in Figure 2(a). This network is chordal and its maximal cliques are listed in Figure 2(b). We want to contrast the clique tree representation in Figure 2(d) to a naive representation in Figure 2(c), where every pair of maximal cliques that contain a protein in common is connected by an edge.

While both representations show the overlap between maximal cliques, the interconnection pattern of cliques in the naive representation carries little additional information about the structure of this overlap. On the other hand, a very specific tree-like interconnection pattern in the clique tree representation can expose a special structure of such overlap. For example, consider maximal cliques *B* through *F*. In the naive representation, the overlap between these maximal cliques is collapsed to a clique. Thus, the representation treats the maximal cliques and overlaps between them equally. In particular, there is no way to tell that, for example, *D* occupies a more central position in the network than *B*. In the clique tree representation this information can be extracted from the relative position of cliques in the tree. For example, *B* is connected to *F* by a path that passes through *C* and *D*, which means that any protein shared by *B* and *F* is also contained in *C* and *D*. In other words, the overlap between *B* and *F* is entirely contained in the overlap between *B* and *D*, which in turn is entirely contained in the overlap between *B* and *C*. Thus, there is a correlation between the amount of overlap between maximal cliques and their distance in the clique tree.

Nice properties of the clique tree mentioned above make it a good choice for representation of overlaps between functional groups. However, not every protein interaction network is chordal and maximal cliques may not always be the best way to represent functional groups. For example, in Figure 1, cliques {1, 2, 3} and {1, 2, 4} may correspond to two variants of one complex, where proteins 3 and 4 replace each other, forming one rather than two functional groups. Therefore, in the COD decomposition we relax the assumption that every functional group is a single protein complex (maximal clique) and allow it to contain several protein complexes (maximal cliques). In doing so we have to ensure that a functional group is not just any collection of protein complexes but rather a set of closely related protein complexes which represent possible variants of one complex (such as complexes {1, 2, 3} and {1, 2, 4} in the above example). To capture this systematically we model a functional group with a graph from a family of graphs known as cographs.

Cographs are another well-studied graph family [20]. A *cograph* can be characterized by an absence of an induced subgraph which is a path of length four (*P*_{4}), where the length is the number of nodes in the path. Thus, the diameter of a connected cograph is at most two. Subsequently, connected cographs are dense and cliquish, consistently with the assumption made by algorithms that delineate protein complexes. What makes cographs even more attractive is that for every cograph there exists a Boolean expression which describes all the maximal cliques in the graph. (In terms of modular decomposition used in [16] it means that a cograph can be decomposed by modular decomposition without leaving non-trivial non-decomposable prime module.) This Booolean expression describes in a compact and hierarchical way all the possible variants of protein complex within a functional group.

The main idea behind COD method is to provide a representation of a functional module, which is analogous to the clique tree, in which nodes are cographs (representing variants of protein complex within a functional group) rather than maximal cliques. If we knew in advance all the functional groups in the module, we could simply connect the proteins within each functional group turning it into a clique and, under the assumption that the resulting graph is chordal, apply clique tree construction algorithm to the graph. Since we do not have predefined functional groups, our algorithm identifies them by adding edges to the graph in such a way that each added edge connects a pair of nodes that putatively belong to the same functional group.

The COD method's edge addition strategy and its biological motivation builds on a concept of *weak siblings*. We call a pair of nodes weak siblings if and only if they are connected to the exactly the same set of neighbors, but are not connected to each other. In terms of protein interaction networks, weak siblings are proteins which interact with the same set of proteins but do not interact with each other. In particular, proteins that can substitute each other in a protein interaction network may have this property. Similarly, a pair of proteins that belong to the same complex but are not connected due to missing data or an experimental error will be represented as weak siblings. Since the weak siblings relationship suggests functional similarity, the COD method takes a first step towards delineation of functional groups by connecting every pair of weak siblings by an edge. As this modification may also eliminate some of the chordless cycles of length four (*squares*) in the graph, functional group delineation happens simultaneously with transformation of the protein interaction graph into a chordal graph.

If, after connecting all pairs of weak siblings, the resulting graph is not chordal, the COD method attempts to transform it to chordal by adding some additional edges. Consistently with our assumption that we connect only nodes corresponding to proteins that could be put in the same functional group, we impose restrictions on this "fill-in" process. Namely, we require that, each introduced edge connects a pair of nodes which are close to being weak siblings. In such case the new edge is a diagonal of one or more squares in the graph. We emphasize that adding edges between nodes of longer cycles has no such justification. To summarize our edge addition procedure, our method attempts to eliminate all the squares in the protein interaction network by adding a limited set of diagonals that satisfies following conditions (i) connects potentially functionally equivalent proteins, as measured by the overlap in neighborhoods or distance from being a pair of weak siblings; (ii) ensures that functional groups correspond to cographs; we argue that this condition is guaranteed if the set of added edges does not form a *P*_{4} in a maximal clique of the modified graph (cf. Methods);

If the modification step succeeds, i.e., the modified graph is chordal, all the clique tree representations of the modified graph are computed and then each clique tree is extended to a Tree of Complexes representation of the original graph. The COD algorithm keeps track of all the edge additions and uses this information to delineate functional groups by projecting each maximal clique onto original network and removing all introduced edges contained in the clique. For example, in the modified graph of Figure 1 a maximal clique with four nodes, {1, 2, 5, 8}, is projected to a functional group by removing an edge connecting proteins 5 and 8. This functional group contains two variants of a protein complex, {1, 2, 5} and {1, 2, 8}, which are compactly represented by a (1 ∧ 2) ∧ (5 ∨ 8) Boolean expression. If, on the other hand, the modified graph is not chordal, the COD method stops without producing the representation. Since the clique tree representation for a chordal graph is not unique, the Tree of Complexes representation that derives from it is not unique either. As all clique trees of a chordal graph have the same set of nodes (the nodes are the maximal cliques in the graph), the difference between clique trees comes from the topology of the tree. The clique tree topology is determined by the "connected subgraph" constraints and restriction power of these constraints depends on the structure of the underlying graph, i.e., there are graphs with a unique clique tree representation and there are graphs for which almost any tree that spans all the maximal cliques in the graph is a valid clique tree. As a result a protein interaction network may have several Tree of Complexes representations; all such representations will have the same functional groups but will differ in the way these functional groups are interconnected. For every protein interaction network analyzed bellow we explicitly state all the possible Tree of Complexes representations.

### TNF*α*/NF-*κ* B signaling pathway

To illustrate the power of COD in elucidating the dynamics behind protein complexes, we consider the TNF*α*/NF-*κ* B signaling pathway. The Nuclear Factor *κ* B (NF-*κ* B) family of transcription factors is activated in response to a diverse set of stress stimuli, which includes pro-inflammatory cytokines, *e.g.*, TNF*α*. In vertebrates, this family includes p50, p52, Rel A, c-Rel, and Rel B, which bind to the DNA in a homo or heterodimeric fashion. The NF-*κ* B activity is regulated by the I*κ* B family of proteins via inhibitory ankyrin repeat domains. This family includes I*κ* B*α*, I*κ* B*β*, and I*κ* B*∈*. The precursors of p50 (p105) and p52 (pl00) also possess ankyrin repeat domains and thus act as inhibitors. These precursors can also form dimers with other members of the NF-*κ* B family. The activation with the pro-inflammatory cytokine tumor necrosis factor TNF*α* triggers a signaling cascade, which, in particular, stimulates the activation of the IKK*α*, IKK*β*, and IKK*γ* functional groups. The IKKs initiate a signal induced degradation of the inhibitors (I*κ* Bs), and subsequent nuclear translocation of the transcription factor. Recent TAP experiments [5] provide a wealth of new information regarding this important signaling pathway. Bouwmeester *et al.* identified 221 molecular associations, out of which only 80 where previously known. Gagneur *et al.* [16] applied modular decomposition to the network of these associations but the decomposition halted quickly at large non-decomposable modules.

We used the COD method to analyze the subnetwork spanning all the paths from NIK (NF*κ* B-inducing kinase phosphorylating IKK*α* and IKK*β*) with at most three edges. For the purpose of the analysis, we contracted all five members of NF-*κ* B family into one node. As the resulting protein interaction network, shown in Figure 3(a), is chordal without weak siblings, functional groups correspond to maximal cliques in the network.

For this network, there are two alternative Tree of Complexes representations: functional group *E* can be connected to either *D* or *C*. The representation that maximizes the number of leaves is shown in Figure 3(b). One can clearly see the interplay between the activators and inhibitors. Proteins p105 and NF-*κ* B participate in the same functional groups and thus follow the same path in the tree. The same is true for the pair of proteins IkB*α* and IkB*β*. The Tree of Complexes captures this by grouping p105 and NF-*κ* B, and IkB*α* and IkB*β*.

### Pheromone signaling pathway

The yeast *Saccharomyces cerevisiae* may be present in one of two haploid cell types, which are able to mate. Pheromones released by one type of cell bind to a specific receptor of the other type. This triggers the activation of a scaffold protein-bound mitogen-activated protein kinase (MAPK) cascade and subsequent activation of nuclear proteins that control subsequent cellular events. In a recent paper, Spirin *et al.* [13] identified a subnetwork of proteins involved in this process within a yeast protein interaction network [21]. We analyzed this subnetwork using the COD to see if our method can extract elements of temporal ordering. The subnetwork identified by Spirin *et al.* and its Tree of Complexes representation is given in Figure 4. In this case, the protein network is not chordal. First, the COD method identifies and connects a pair of weak siblings, *MKKl* and *MKK2* .Then, to transform the network to a chordal graph, three additional edges are added: (*SPH* 1, *SPA* 2), (*FUS* 3, *KSS* 1), and (*STE* 11, *STE* 7). In this case, some functional groups will contain more than one protein complex.

This network admits six different Tree of Complexes representations: (i) functional group *H* can be connected to either *B* or *C*; (ii) any interconnection pattern that spans groups *E*, F, and *G* can be chosen. If we ask for a tree with maximum number of leaves, the number of tree variants is reduced to two (option (i)).

The MAPK cascade module consists of three sequentially acting protein kinases: MAP kinase kinase kinase (STE11) MAP kinase kinase (STE7) and MAP kinase (KSS1, FUS3) [22]. MKKl and MKK2 are two redundant protein kinase kinases (most similar to STE7) [23]. Their redundancy is properly captured by the ∨ (*OR*) in their functional group (H). The MAP kinases KSS1 and FUS3 are two separate kinases both activated by STE7 each of which is essential for a different program: FUS3 – for mating; KSS1 – for the filamentous growth [24]. Once again this is correctly captured by ∨ (*OR*) in groups F and G. STE5 is a scaffold protein of the MAPK module. It recruits MAPK module kinases (STE11, STE7, FUS3). This is consistent with the central position of a functional group containing STE5 in the tree and relative to the paths of STE7, STE11 and FUS3. Finally, nuclear proteins DIG1 and DIG2 (necessary for transcription inhibition, which are regulated by both FUS3 and KSS1) enter at the endpoint (node *F*) in the tree.