Detecting conserved protein complexes using a dividing-and-matching algorithm and unequally lenient criteria for network comparison

The increase of protein–protein interaction (PPI) data of different species makes it possible to identify common subnetworks (conserved protein complexes) across species via local alignment of their PPI networks, which benefits us to study biological evolution. Local alignment algorithms compare PPI network of different species at both protein sequence and network structure levels. For computational and biological reasons, it is hard to find common subnetworks with strict similar topology from two input PPI networks. Consequently some methods introduce less strict criteria for topological similarity. However those methods fail to consider the differences of the two input networks and adopt equally lenient criteria on them. In this work, a new dividing-and-matching-based method, namely UEDAMAlign is proposed to detect conserved protein complexes. This method firstly uses known protein complexes or computational methods to divide one of the two input PPI networks into subnetworks and then maps the proteins in these subnetworks to the other PPI network to get their homologous proteins. After that, UEDAMAlign conducts unequally lenient criteria on the two input networks to find common connected components from the proteins in the subnetworks and their homologous proteins in the other network. We carry out network alignments between S. cerevisiae and D. melanogaster, H. sapiens and D. melanogaster, respectively. Comparisons are made between other six existing methods and UEDAMAlign. The experimental results show that UEDAMAlign outperforms other existing methods in recovering conserved protein complexes that both match well with known protein complexes and have similar functions.


Background
The majority of biological processes are not carried out by a single protein alone but by a group of proteins which physically interact with each other to form protein complexes.It is believed that protein complexes are the building blocks of the cellular machinery and protein-protein interaction (PPI) networks evolve at module level [1].Consequently, identifying protein complexes of a single species plays a significant role in understanding the underlying mechanism of cellular function, and identifying protein complexes conserved across difference species are helpful for studying biological evolution.Recently, some computational methods have been proposed to identify protein complexes from a single PPI network [2][3][4][5][6][7][8][9].The underlying hypothesis behind these methods is that a protein complex corresponds to a dense subgraph or cluster of a single PPI network.Meanwhile, some computational methods have been introduced to identify the common subnetworks (conserved functional modules) across species by comparatively analyzing PPI networks of different species.
In contrast to traditional sequence-comparison-based methods, network-comparison-based methods provide a new view of studying biological evolution, which considers two proteins conserved across species if they have both similar sequences and similar interactive patterns.The two proteins (homologous protein pairs) that are from two different PPI networks and have similar sequences are believed to have similar interactive patterns if their neighbors in corresponding PPI networks also have similar sequences.These network-comparisonbased methods define the problem as network alignment.In context of biology, there are two challenges exist in PPI network alignment.The one is there exist many-to-many mappings between proteins of different species, which is the result of biological evolution, such as gene duplication [10].The other is few strict meaning of conserved interactive patterns exist due to emergence or elimination of interactions in the course of evolution.
According to differences in the ways to deal with manyto-many mapping, network alignment can be classified into two categories: global alignment and local alignment [11].The aim of global alignment is to find oneto-one optimal mappings between proteins of two PPI networks.Global alignment can help us to understand variations between species and be used to detect functions of orthologs and construct phylogenetic relationships.There are also some global alignment methods [12] adopt some clustering methods to detect conserved subnetworks based on the best mappings between the nodes from different PPI networks.However, these methods ignore the facts that there exist duplications of interacting proteins and even whole complexes in a single species.Previous studies observe that a significant fraction of complexes in S. cerevisiae (yeast) share strong similarity with each other [13].By contrast, local alignment is utilized to detect pathways or protein complexes that are conserved across multiple species.There exist many-tomany mappings between proteins of two PPI networks.Note that there are also other global or local alignment methods [14][15][16][17] incorporate some biological information, such as functional annotation, protein structure information, protein domain information to find truly homologous proteins and reduce the impacts of manyto-many mappings.This work focuses on local alignment, whose goal is to find conserved complexes across different species only depending on sequence and topological similarity.
Up to now, many local alignment methods have been proposed to detect conserved protein complexes.Generally, there are two types of local alignment methods: alignment-graph-based method and dividing-and-matching-based method.The basic idea of alignment-graphbased method is that false positive protein interactions are rare possible to duplicate in other species and merging two PPI networks being compared according homologous mappings between proteins can filter false positive protein interactions.Alignment-graph-based methods [18][19][20][21] usually take two steps to identify conserved complexes.Firstly, a weighted alignment graph is built from two input PPI networks.Each node of the graph is composed of a pair of homologous proteins, one from each network.Each edge of the graph is weighed by certain methods that account for the degree to which an interaction in one PPI network is conserved across species.After that, some clustering methods are adopted to detect conserved protein complexes from the weighted alignment graph.Those existing alignment-graph-based methods differ in the strategies taken to construct alignment graph and to clustering the alignment graph.Dividing-and-matching-based method is an alternatively way of finding conserved complexes, which firstly uses known protein complexes or computational methods to divide one of the two input PPI networks into subnetworks and then maps the proteins in these subnetworks to the other PPI network [22][23][24][25].The motivation underlying this kind of methods is to investigate how those protein complexes that are experimentally or computationally identified from a single species are conserved across species.In recent years, there are available know protein complexes of some species, such as yeast and human, and some computational methods that have good performance of detecting protein complexes from the PPI network of single species [2,[26][27][28][29][30].All of this make it pressing to design an effective dividing-and-matching-based method to identify conserved protein complexes.
To overcome the challenge that there are few strict meaning of conserved interactions across species, both alignment-graph-based methods and dividingand-matching-based methods introduce less restrictive definition of conserved interactions in the course of comparison.As for alignment-graph-based method, some methods, such as Network-Blast [18], Network-Path [31] and Mawish [19], introduce edges in alignment graph if a pair of proteins in one networks is directly connected while their homologous proteins in the other network are indirectly connected.However PHUNKE [20] cancels the requirement of indirect connection between homologous proteins in the other network and connects two nodes in alignment graph if there is at least a pair of proteins in one network is directly connected.AlignNemo [17] adopts less restrict criterion and constructs edges in alignment graph if a least a pair of proteins in one PPI network is directly or indirectly connected.NetAligner [32] adds edges between node pair in alignment graph at a distance greater than 2 and tolerates gaps and mismatch of any length.As for dividing-and-matchingbased method, Manikandan et al. [33] have proposed a Match-and-Split algorithm which matches proteins of two networks according to a local matching criterion and splits the whole networks into connected components.This process is recursively implemented on those components and finally outputs conserved complexes.Luqman and Karp [24] have introduced Produles which uses PageRank-Nibble [34] algorithm to partition one of the two input networks and maps these subnetworks to the other network.After that, a local extension is implemented to detect the connected components that consist of the homologous proteins in the other network.According to those connected components, the subnetworks are refined and the connected parts in them are extracted as conserved protein complexes.Obviously, Match-and-Split and Produles algorithm do not match the two networks exactly in their graph structure.However, they only take direct neighbors into account when implementing local alignment, which is so rigid that very few conserved protein complexes are identified.With respect to this, DAMAlign [25] is proposed in our previous work, which takes both dividing-and-matching strategies and the same lenient criteria as AlignNemo to locally extend a pair of homologous protein pairs.That is, in the course of finding common connected components, DAMAlign recruits a pair of homologous proteins if there is at least one path of length not larger than 2 to connect one of node in the homologous protein pair in its corresponding network.The comparisons made by previous studies show that AlignNemo, AlignMCL [21] and DAMAlign succeed in detecting more conserved complexes than previous methods [18,19], such as Mawish, NetworkBlast,PHUNKEE and Produles.The reason may be considering indirectly connected node pairs in one network is robust against missing interactions in original network.Although NetAligner employs more lax criteria to introduce conserved interactions, it also yields a lot of false positive conserved interactions, which reduces its performance of detecting conserved complexes.
In spite of that previous researchers have done great efforts to improve the performance of their methods by introducing less strict criteria to find conserved interactions, few of them consider the difference of the two input networks and adopt equally lenient criteria on them.In fact, there exist differences between PPI networks of different species in their structures and topologies.The distance between proteins that have homologous proteins in the other PPI network may vary with species.Therefore, in this work, we propose a new dividing-and-matching method named by UEDAMAlign to detect conserved protein complexes via local network alignment.UEDAMAlign, similar to previous dividing-and-matching methods, such as Produles and DAMAlign, partitions one of PPI network into subnetworks and then maps these subnetworks to the other PPI network to find common connected components.In contrast to previous dividing-and-matching methods, UEDAMAlign implements unequal criteria on the two networks to find common connected components with respect to the structural and topological differences of the two networks.That is UEDAMAlign locally extends a pair of homologous proteins if there is a path of length not larger than l to connect the homologous protein in the PPI network one or a path of length not larger than r to connect the homologous protein in the other network.To evaluate the effectiveness of UEDAMAlign, We carry out network alignment between S. cerevisiae and D. melanogaster, H. sapiens and D. melanogaster, respectively.Comparisons are made between other six existing methods and UEDAMAlign whose parameters l and r are both set to 2. The experimental results show that When UEDAMAlign takes the same lenient criteria as AlignNemo and DAMAlign do, it is superior to other existing methods because it can detect conserved protein complexes that both match well with known protein complexes and have similar functions.Finally, we discuss the effect of parameters l and r on the performance of UEDAMAlign.

Methods
The detection process of UEDAMAlign is broadly divided into four steps.At the beginning, several random walking steps are unequally taken on the two input PPI networks to detect some potential mappings between proteins of the two networks.After that, one of the two PPI networks is divided by known protein complexes or computational methods.Then proteins in those subnetworks are mapped to the other PPI network to find their homologous proteins and the connected components of those homologous proteins are extracted from the other PPI networks by using a heuristic approach.The final step of UEDAMAlign is to filter out the predicted conserved complexes that are highly overlapping with others.

Exploring potential mappings between proteins of two species
In network alignment, the homologous mappings between the proteins of two different PPI networks can be inferred from their sequence-based similarity.Those proteins with similar sequences are most likely to evolve from a common ancestor and thus have similar functions.Moreover, interactive proteins of a single species tend to share common functions.Therefore, we assume that the protein and its neighbors in a single PPI network should map to a common protein in the other PPI network.Since proteins are most likely to share functions not only with their direct neighbors but also with their indirect neighbors, and even with their level k neighbors, some potential mappings between proteins of two species can be inferred from their direct, indirect or level k neighbors.Furthermore, the level of neighbors with which a protein tend to share functions varies with species due to the structural and topological difference of their PPI networks.Hence, we should infer potential protein-protein mappings from unequal level of neighbors for different species.In this work, we adopt an unbalanced Bi-random walk algorithm to find potential mapping between proteins of two species.This method has also been used in our previous study [35] that gets protein-function associations by walking different number of steps in PPI network and functional interrelationship network.To formally define our method, some variables are introduced in advance.
Let P(N*N) and H(M*M) be adjacent matrixes of two input PPI networks respectively.P(N*N) is row-normalized and H(M*M) is column-normalized.The element p(i, j) of matrix P(N*N) and h(i, j) of matrix H(M*M) is defined as follows.
where degree(i) denotes sum of interactions of node i .
Let matrix A(N*M) represent known protein-protein mappings measured by sequence-based similarities.Its element a(i,j) is 1, if there exists an mapping between protein i of one species and protein j of the other one, 0 otherwise.R(N*M) denotes the final protein-protein mappings.The value of its element r(i,j) represents the probability that protein i will be mapped to protein j.
Given matrix P, H and A, we want to calculate matrix R. Since proteins and their level k neighbors in one PPI network may map to the same proteins in the counterpart network, several random walk steps are taken on the two PPI networks, respectively.At each walking step, multiplying P on the left and H on the right respectively can detect some potential protein-protein mappings (Eqs.3,4).Then the weighted average of the multiply results updates matrix R (Eq. 5).Consider the difference of the two input networks, the level of neighbors from which the proteins infer mapping information should be different.To address this problem, two parameters (l and r) are adopted to control maximal iteration steps in the two networks.Mathematically, the process can be expressed as Algorithm 1. (1) if degree(j) > 0 0 otherwise .
where t (=1, 2, . . . ) represents the walking steps.Matrix A storing known protein-protein mappings can regulate the iteration process.The parameter α(0< α <1) is used to adjust the weight of regulation of network and of prior knowledge stored in Matrix A (in this work, α is set to 0.5).p or h are indicators which are 1 if the number of walk steps on PPI network One or Two are less than their thresholds (l or r), respectively, 0 otherwise.ISORank [11] adopts similar strategy to obtain potential mappings between proteins of two different PPI networks and computes their global network alignment.In ISORank, however, random walks are taken simultaneously on the two networks until the global networks.Actually, ISORank treats the two networks equally.However, Our work separately takes random walks on two networks, which walks only several steps (t is set to 1, 2, . . . ) and is convenient for controlling different walk- ing steps taken on the two networks according to their difference in topology and structure.Consequently, our method is more flexible to get protein-protein mappings between two PPI networks.

Detecting conserved protein complexes from PPI networks
The basic idea of UEDAMAlign is first dividing PPI networks into small subnetworks and then mapping proteins of subnetworks to the other PPI network.Many computational methods, such as Coach [36], MCL [37,38], CMC [39], CFinder [40] and so on, have been proposed to detect protein complexes form a single PPI network and achieve good performance.Moreover, biological experiments have been implemented on several species and the data of known protein complexes is available.Consequently those known protein complexes or those predicted by computational methods can be conveniently used as partition of a PPI network.
The main challenge of UEDAMAlign lies in mapping proteins in subnetworks of a PPI network to the other one in order to find common connected components.In the course of finding common connected components,  UEDAMAlign adopts unequally lenient criteria to extend a pair of homologous proteins.The span distance of a protein pair in a single network is unequal with respect to the difference of input PPI networks, which is determined by inputting parameters l and r.For example, when taking the same lenient criteria as AlignNemo and DAMAlign do, UEDAMAlign absorbs a pair of homologous proteins into its predicted conserved protein complexes if at least one of protein in the homologous protein pair connects to the proteins in the predicted conserved protein complexes through a path of length not larger than 2. In this case, parameters l and r are set to 2. When parameters l and r are set to 2 and 3 respectively, UEDAMAlign locally extends a pair of homologous proteins if there exists a path of length not larger than 2 to connect the node in the homologous protein pair in PPI network one or a path of length not larger than 3 to connect the node in the homologous protein pair in PPI network two. Figure 1 shows eight cases of connectivity in conserved protein complexes from two different PPI networks when l and r are set to 2. Figure 2 shows eleven cases of connectivity in conserved protein complexes from two different PPI networks when l and r are set to 2 and 3 respectively.The nodes with different color come from different PPI networks.The full lines connecting two different color nodes represent their known homologous mappings.The dot lines represent artificial homologous mappings detected by unbalanced Bi-random walk algorithm.The full lines connecting the same color nodes represent their interactions.
Given k subnetworks p 1 , p 2 , . . .p k extracted from PPI network P(N*N), the other PPI network H(M*M), known protein-protein mapping matrix A(N*M), parameter l, r and a constructed mapping matrix R(N*M), UEDAMAlign proceeds as follows: Step 1: In this step, we aim to extract the proteins from an input subnetwork that both have homologs in the other PPI network and are connected through at least one path of length not larger than a threshold.The threshold is set to l and r for network P and H, respectively.Given ModuleOne and ModuleTwo store conserved protein complexes induced from PPI network P and H respectively.Start from an arbitrarily node of subnetwork p i (i = 1, 2, . . .k), find its homologous pro- teins in H, which are homologous to both the node and its neighbors in the input subnetwork p i according to the matrix R. Put the neighbors into ModuleOne if they satisfy one of following conditions: 1.There exists at least one real homologous mapping between the shared homologous proteins and the nodes or its neighbors 2. There exist two different homologous proteins shared by the node and its neighbors but also the two different homologous proteins are really matched with two proteins other than the node and its neighbors in input subnetwork p i .Since there are some artificial mappings in matrix R, only those real homologous proteins are put into Mod-uleTwo.The real homologous mappings are stored in Matrix A. Then start again from the neighbors, repeat the process in step 1 until no more nodes in subnetworks p i can been put into ModuleOne.
Step 2: The aim of step 2 is to refine ModuleTwo by reducing many-to-many homologous mappings.Each node in ModuleTwo is assigned a weight, which is defined as sum of mapping values in matrix R between the node and its counterpart in ModuleOne.Connected components from proteins in ModuleTwo are deduced by searching both their direct neighbors and up to level l or r neighbors (level l neighbors for subnetworks from network P, level r neighbors for subnetworks from H).For the components that consist of at least two nodes, their counterparts in ModuleOne are regard as being covered.Exclude components with one node from ModuleTwo if their counterparts in Module-One have been covered.Otherwise, keep the one with high weight .
Step 3: In this step, we will handle the case that the node of input subnetwork are isolated but their homologous proteins have connections with protein in Modu-leTwo.For example, when the parameters l and r are set to 2, steps 1 and 2 can cover the case of Figure 1a-f.In step 3, we consider the case of Figure 1g, h.When the parameters l and r are set to 2 and 3, respectively, steps 1 and 2 can cover the case of Figure 2a-h proteins in the other PPI network) satisfy one of following conditions.Step 4: In this step, highly overlapping conserved protein complexes will be filtered out.There are two reasons may contribute to overlap.The one is input subnetworks are overlapping.The other is the homologous mapping between different PPI networks may generate multiple overlapping conserved protein complexes.Comparing two input PPI networks produces a solution consisting of two conserved protein complexes.One comes from each PPI network.The overlap between a pair of solutions is qualified by the overlapping score of their two protein complexes (B and C) from PPI network One.The overlapping score of B and C is defined as follows.
where V B and V C denote the node sets of protein complex B and C, respectively.The solution will be filtered out if there exists another solution that consists of larger complex from PPI One and their overlapping score is larger than a threshold t (in this work t = 0.8).In summary, Algorithm UEDAMAlign outlines the overall framework to detect conserved protein complexes by using our method.

Results
To investigate the effectiveness of our method, first of all, we evaluate the dividing-and-matching strategy of UEDAMAlign.We compare it with other existing methods such as Mawish [19], Networkblast [18], Match-and-Split [33], Produles [24], AlignNemo [17] and AlignMCL [21].Mawish and Networkblast are two typical alignment-graph-based methods.AlignNemo and AlignMCL are two new alignment-graph-based methods and possess well performance.Match-and-Split and Produles are two dividing-and-matching-based methods.For fair comparison, the parameters l and r in UEDA-MAlign are set to 2, which means UEDAMAlign adopts the same lenient criteria as AlignNemo does and locally extends a pair of homologous proteins if there exists at least one path of length not larger than 2 to connect one of node in the homologous protein pair in its corresponding network.The parameters "a", "b", "c", "d" and "e" in Produles are set to "2", "100", "2", "0.05", "50" respectively, as recommended by the authors.The threshold of blast E-value used in all comparing methods is set to 10-9.The parameters of other methods are selected as their default values set by the authors.UEDAMAlign explores known protein complexes or some existing computational methods,such as Coach [36], MCL [37,38], CMC [39], CFinder [40] to partition the PPI networks. (3) The corresponding results are named by UEDAMA-lignKnown, UEDAMAlignCoach, UEDAMAlignMCL, UEDAMAlignCMC, UEDAMAlignCFinder, respectively.Among these computational methods that detect protein complexes in a single PPI network, Coach is a very successful clustering algorithm by considering the core attachment structure of protein complex [2].MCL is a fast and highly scalable clustering algorithm, which partitions a PPI network into non-overlapping subnetworks by simulating a random walker in it.CMC is a clustering method based on Maximal Cliques.CFinder detects the k-cliques in a PPI network and joins two adjacent k-cliques if they share (k 1) common nodes.In this work, the parameter k of CFinder is set to 4. The values of parameter of other methods are selected from those recommended by authors.
In this section, we first introduce the experimental data used in this work.Then the performances of the comparing methods are evaluated by matching with known protein complexes.In addition we show the biological relevance of the conserved protein complexes detected by the comparing methods.After that, UEDAMAlign is compared with AlignNemo based on AlignNemo's experimental dataset.Finally, we show the property of the UEDAMAlign that can take an unequally lenient criteria when comparing two networks.Moreover, the effect of parameters on the performance of UEDAMAlign will be discussed.

Experimental data
We carry out alignment among two pairs of PPI network, S. cerevisiae (yeast) with D. melanogaster (fruit fly) and H. sapiens (human) with D. melanogaster.The PPI network data of yeast and fruit fly is downloaded from DIP database [41], which is published on Oct. 10, 2010, without self-interactions and repeated interactions.There are total of 5,093 proteins and 22,570 interactions in yeast dataset, and 7,916 proteins and 20,289 interactions in fruit fly dataset.The PPI network data of human is obtained from HIPPIE [42], which includes 13,398 proteins and 86,307 interactions, also excludes self-interactions and repeated interactions.The protein sequence data of yeast, fruit fly and human are all downloaded from NCBI.The homologous protein pairs of the two input networks are inferred according to the sequencebased similarity between proteins from different PPI networks.The sequence-based similarity of two protein a and b is calculated based on their BlAST E-values as follows.
where E(a,b) is the minimum BlAST E-value when aligning a against b.Here, sequence-base similarities are The list of known yeast protein complexes is obtained from literature published in Nucleic Acids Research (CYC2008) [43], which consists of 408 protein complexes.The list of human protein complexes is obtained from CORUM [44], which consists of 1613 distinct protein complexes composed by no less than two proteins.

Matching with known protein complexes
To evaluate the performance of each method, we match the predicted conserved protein complexes with known ones.The better the predicted protein complexes match with the known one, the better the performance of the method has.A predicted conserved protein complex is considered to match with known protein complexes if their overlapping score OS (see Eq. 3) is equal to or larger than a threshold (in this work, threshold = 0.2) [18].Three statistic measures that are widely used to evaluate a result: Precision Recall and F-measure.Precision measures the percentage of predicted protein complexes that match the known complexes.Recall measures the fraction of known complexes that are matched by the predicted conserved protein complexes.F-measure is the harmonic mean of precision and recall.Formally, they are defined as follows.
where TP (true positive) is the number of predicted conserved protein complexes matched by known protein complexes.FP (false positive) is the number of predicted conserved complexes that fail to match with known protein complexes.FN (false negative) is the number of known protein complexes that are not matched by predicted conserved protein complexes.In addition, coverage rate in introduced to measure how many proteins in the known complexes can be covered by the predicted conserved complexes.Let m be the number of known protein complexes T ij is the number of proteins in com- mon between ith known protein complex and jth predicted conserved protein complex.Coverage rate (CR) is the defined as follows. (5) where |KC i | denotes the number of proteins in the ith known complex.
Table 1 shows the basic information of results of different methods based on our experimental dataset.Column "conserved pairs" refers to the number of conserved protein complexes pairs generated from alignment of two different PPI network.Since there exists many-tomany mappings between proteins of different PPI networks, the conserved protein complexes in one network may be repeat and match with different ones in the other network.Additionally, a conserved protein complexes in on network may include some repeat proteins which are mapped to different proteins in the other network.Column "distinct complexes (size ≥2)" refers to the number conserved protein complexes in one PPI network after filtering out repeat proteins in one complex and repeat complexes as well as those that consist of only one protein.For example, AlignMCL yields 933 pairs of conserved protein complexes when comparing yeast PPI network against to fly PPI network.915 out of 933 conserved protein complexes in yeast PPI network and 927 out of 933 conserved protein complexes in fly PPI network are distinct , each of which includes at least two distinct proteins.
Table 2 shows the comparison of different methods by matching the predicted conserved protein complexes with known protein complexes.When using known complexes to partition PPI network, our method (UEDAMA-lignKnownComplex) detects 148 yeast conserved protein complexes (PC) and 515 human conserved protein complexes (PC), respectively when comparing yeast against fly and comparing human against fly.145 out of 148 yeast conserved protein complexes match at least a known yeast complex (MPC), and 172 known yeast protein complexes match at least a predicted one (MKC).508 out of 515 human conserved protein complexes match at least a known human complex (MPC), and 821 known human protein complexes match at least a predicted one (MKC).Moreover, UEDAMAlignKnownComplex detects 45 yeast and 158 human conserved protein complexes which share identical proteins with known yeast and human protein complexes, respectively (PM).The F-measure of UEDAMAlignKnownComplex is about 0.55 in alignment of yeast and fly, and 0.56 in alignment of human and fly, which is the highest among all comparing methods.When using computational methods to partition the PPI network, the performance of our methods varies due to their different performance of detecting protein complexes in a single PPI network.UEDAMAlignCoach possesses the second best performance and its F-measure is 0.34 when aligning yeast with fly, which is 0.11, 0.28, 0.27, 0.29, 0.21 higher than AlignMCL, Match-and-Split, Mawish, NetworkBlast and Produles, respectively.When aligning human with fly, the F-measure of UEDAMAlign-Coach achieves 0.28, which is 0.17, 0.25, 0.25, 0.23, 0.24 higher than AlignMCL, Match-and-Split, Mawish, Net-workBlast and Produles, respectively.As for coverage rate (CR), UEDAMAlignKnownComplex and UEDAMAlign-Coach also possess the first and the second best coverage rate in the two alignments.Here we don't compare our methods with AlignNemo because AlignNemo cannot output results on our experimental dataset.AlignMCL takes the same strategy of constructing alignment graph as AlignNemo and are more scalable than AlignNemo, which has the best performance among other existing methods, including Match-and-Split, Mawish, Network-Blast and Produles, in term of F-measure and coverage rate.Both AlignMCL and UEDAMAlignMCL employ MCL method to partition PPI network.The difference is that the former uses MCL after constructing alignment graph while the latter uses it before aligning with the other PPI network.On the whole, UEDAMAlignMCL is litter advanced than AlignMCL because its F-measure is litter higher than that of AlignMCL in two alignments.The CR value of DAMAlignMCL is higher than that of AlignMCL when comparing human against fly, while is almost the same as that of AlignMCL when comparing yeast against fly.

Biological relevance of conserved protein complex pairs
To further validate our method, we investigate biological relevance between the conserved protein complexes from the two different PPI networks, which is measured by the average of functional similarity among all proteins in them.Functional similarity of two proteins refers to the semantic similarity of their GO annotations [45].Given two protein p 1 and p 2 , and their GO annotations GO(p 1 ) and GO(p 2 ), the functional similarity between protein p 1 and p 2 is defined as follows: (9) sim(p 1 , p 2 ) = max(Resinksim(go i , go j )) where go i ∈ GO(p 1 ) and go j ∈ GO(p 2 ).Resinksim(go i , go j ) refers to the semantic similarity score of GO pair (go i , go j ) measured by Resink method [46].In this work, we use Resinksim to measure the similarity between GO terms because both AlignNemo [17] and AlignMCL [21] use it.
Based on Resink method, a free tool FastSemSim (http:// sourceforge.net/projects/fastsemsim/) is adopted to calculate the similarity of two proteins.The GO system consists of three separate categories of annotations, namely Molecular Function (MF), Biological Process (BP) and Cellular Component (CC).In this work, we mainly focus on the biological process (BP).Table 3 shows the comparison of each method in terms of the functional similarity of conserved protein complex pairs, when comparing yeast against fly and comparing human against fly.Column "avg_yeast" and "avg_fly" refer to the average functional similarity of conserved yeast protein complexes and conserved fly protein complexes respectively when comparing yeast against fly.Column "avg_intra" lists the average functional similarity of conserved protein complex pair, when only considering the functional similarity between proteins from different species.Column "avg_mixed" lists the average functional similarity of conserved protein complex pair, when considering the functional similarity among all proteins, both inter-species and intra-species.Results for two alignments show that UEDAMAlignKnownComplex yields conserved protein complex pairs which are highly functional related, due to the highest avg_mixed values.Our method using computational methods, such as Coach, CMC and CFinder, to partition PPI networks, can also produce conserved protein complex pairs with similar functions, because their avg_mixed values for two alignments are higher than that of AlignMCL and NetworkBlast, comparable to that of Produles and litter lower than that of Match-and-Split and Mawish.As for UEDAMAlign-MCL, it has relative lower avg_mixed values.However, its avg_mixed value is higher than that of AlignMCL for the alignment between yeast and fly, and comparable to that of AlignMCL for the alignment between human and fly.
Above results show that although previous methods such as Mawish, Produles and Match-and-Split can yield a small amount of conserved protein complexes that both match well with known protein complexes and are highly functional related, UEDAMAlign is able to detect more high quality conserved protein complexes that are functional related, if taking effective strategy to partition PPI network, i.e. inputting known protein complexes or those predicted by effective computational methods, such as Coach.

Validation based on experimental data of AlignNemo
Our UEDAMAlign method takes the same lenient criteria as AlignNemo does to align two PPI network.The main difference between the two methods lies in whether or not dividing PPI networks before aligning.However, AlignNemo cannot produce results when using our experimental data.For fair comparison, we compare our method with AlignNemo, as well as AlignMCL based on AlignNemo's experimental data [17].Table 4 shows the basic information of their results.The results of two alignment in Tables 5 and 6 show that UEDAMAlign-KnownComplex outperforms all comparing methods in term of its F-measure, coverage rate and Avg_mixed value, which suggest it can yield high quality conserved protein complexes not only matching well with known protein complexes but also highly functional related to their counterparts.UEDAMAlignCoach possesses the second best performance among all comparing methods in term of their F-measure and coverage rate.Its Avg_ mixed value is comparable to UEDAMAlignCFinder (k = 4), UEDAMAlignCMC.As for UEDAMAlignCFinder and AignNemo, UEDAMAlignCFinder (k = 4) divides PPI network by using CFinder to detect the 4-cliques in a PPI network, and AignNemo detects conserved protein complexes from alignment graph by extracting 4-subgraphs.DAMAlignCFinder (k = 4) has higher F-measure and Avg_mixed value than AignNemo and comparable coverage rate to AlignNemo.As for DAMAlignMCL and AlignMCL, both methods use MCL method to partition network before or after aligning two network.UEDA-MAlignMCL has higher F-measure and Avg_mixed value than AignMCL and comparable coverage rate to Align-MCL.All of these facts verify the effectiveness of our methods that take the dividing-and-matching strategy to align two networks.

Effect of parameters on performance
The other contribution of UEDAMAlign lies in being capable of taking unequally lenient criteria when comparing two PPI networks.It makes use of two parameters l and r to control the walking steps taken in the two input PPI networks and therefore determine the distance that a protein pair can span in corresponding network.For example, when aligning the network of yeast and fruit fly, setting parameter l and r to 2 and 3 respectively means that UEDAMAlign locally extends a pair of homologous proteins if there exists one path of length not larger than 2 to connect the yeast node in the homologous protein pair or one path of length not larger than 3 to connect the fruit fly node in the homologous protein pair.Specially, as l and r are both set to 2, UEDAMAlign achieves the same performance to DAMAlign on detecting conserved protein complexes.
To investigate the effect of unequally lenient strategy on the performance of detecting conserved protein complexes, we vary the two parameters ranging from 2 to 3  and evaluate the prediction accuracy of UEDAMAlign when utilizing known protein complexes or Coach to partition the input PPI networks.Tables 7 and 8 show that in the two alignments, UEDA-MAlign does not always possess the best performance when its parameters l and r are both set to 2 in terms of F-measures values and Avg_mixed values.For example, as aligning human and fruit fly, UEDAMAlignKnown-Complex when the parameters l and r are set to 3 and 2 outperforms that when the parameters l and r are both   7 shown, no matter which one of the two partition methods UEDAMAlign uses, for the alignment of yeast and fruit fly, its highest F-measure values achieve when setting the parameters l and r to unequal values.Specially, both UEDAMAlignKnown-Complex and UEDAMAlignCoach achieve the highest F-measures as the parameters l and r are set to 2 and 3.For the alignment of human and fruit fly, UEDAMAlign has the highest F-measure values when setting the parameters l and r to equal values.Specially, UEDA-MAlignKnownComplex has the highest F-measures value as the two parameters are set to 2 and UEDA-MAlignCoach achieves the highest F-measures value as setting the two parameter to 3. Through analyzing the structure and topology of the three PPI networks, we find that the yeast PPI network contains 5,093 proteins and 22,570 interactions, whose average path length is about 3.84, the fruit fly PPI network contains 7,916 GO terms and 20,289 edges, whose average path length is about 4.5, while the human PPI network includes 13,398 proteins and 86,307 interactions, whose average path length is about 4.2.It is obvious that the PPI network of fruit fly is sparser than that of yeast and is similar dense to that of human, which may cause the difference in criteria for comparing the two pairs of PPI networks.
Table 8 show that the conserved protein complexes that can well match with known protein complexes are less biological relevant.
For example, as the parameters l and r are set to 2 and 3, UEDAMAlignKnownComplex and UEDAMA-lignCoach achieve the highest F-measures when aligning yeast and fruit fly.However, the conserved protein complexes detected by them under this condition have lower biological relevance duo to the lowest Avg_mixed values.This may be caused by two reasons.The one may be homologous protein pairs with low functional similarity are introduced to identified conserved protein complexes.The other is the proteins in conserved protein complexes have some similar functions with their homologous proteins, which are not found by biologist.
The results in Tables 7 and 8 verify that UEDAMAlign can taking unequally lenient criteria on the two comparing PPI network by setting parameters l and r.However, it is still a big challenge for us to choose suitable values for parameters l and r with respect to the difference between the two input networks.

Conclusion
The aim of this work is to detect protein complexes conserved across species through locally aligning a pair of PPI networks.Most of previous methods adopt equally lenient criteria on the two comparing networks but fail to consider the differences of the two networks.Considering that PPI network has the property of modularity and increasing number of known protein complex data are available, we propose a new dividing-and-matching-based method named by UEDAMAlign to detect conserved protein complexes.UEDAMAlign detects subnetworks from one of PPI network and maps these subnetworks to the other one.After that, UEDAMAlign takes heuristic strategy to find the common connected components from the subnetworks and their homologous proteins in the other network.In the course of finding common connected components, UEDAMAlign takes lenient criteria which may vary with parameters according to topological feature of input PPI networks.
To access the effectiveness of UEDAMAlign.we carry out two alignments, yeast with fruit fly, and human with fruit fly.Comparison are made between other existing methods and UEDAMAlign when taking the same lenient criteria as AlignNemo and DAMAlign to extend locally a pair of homologous proteins (parameters l and r are set to 2).(1) The experimental results shows that UEDAMAlign is superior to all other methods in recovering conserved protein complexes which can both match known protein complexes well and have similar functions if it takes effective strategies to partition PPI networks, for example using known protein complexes or Coach to partition.(2) UEDAMAlignMCL outperforming AlignMCL and UEDAMAlignCFinder outperforming AlignNemo confirm the effectiveness of dividing-and-matching strategy of our UEDAMAlign method.(3) The experimental results when setting various values for the parameters (l and r) of UEDAMAlign verify that UEDAMAlign can taking unequally lenient criteria on the two comparing PPI network by setting parameters l and r.However, it is still a big challenge for us to choose suitable values for parameters l and r with respect to the difference between the two input networks.

Figure 1
Figure 1 Eight cases (a-h) of connectivity in conserved protein complexes from two different PPI networks when UEDAMAlign adopts the same lenient criteria as AlignNemo does to extend a pair of homologous proteins.The nodes with different color come from different PPI networks.The full lines connecting two different color nodes represent their known homologous mappings.The dot lines represent artificial homologous mappings by a unbalanced Bi-random walk algorithm.The full lines connecting the same color nodes represent their interactions.

Figure 2
Figure 2 Eleven cases (a-k) of connectivity in conserved protein complexes from two different PPI networks, when parameters l and r are set to 2 and 3 in the course of extending a pair of homologous proteins.The nodes with different color come from different PPI networks.The full lines connecting two different color nodes represent their known homologous mappings.The dot lines represent artificial homologous mappings by a unbalanced Bi-random walk algorithm.The full lines connecting the same color nodes represent their interactions.
Input:P (N *N ),H(M *M ),A(N ,M ),subnetworks(p 1 ,p 2 , ...p k )of P ,subnetworks(h 1 ,h 2 , ...hz)of H, Parameters l,r; 2: Output:predicted conserved complex list moduleOneList for P , moduleTwoList for H ; 3: According to matrix P ,H and A,Parameters l,r build Matrix R by using Algorithm 1; 4: for each subnetwork p i of P do in PPI P are considered, reverse the role of PPI network P and H. Input z subnetworks (h 1 , h 2 , . . .h z ) extracted from PPI network H, repeat steps 1 to 3.
1. Exist in ModuleTwo.2. Connect a node in ModuleTwo through a path of length not more than the threshold (l for network P, r for network H).In this case, put these counterparts into ModuleTwo.Since conserved complexes consist of homologous proteins, discard the proteins in ModuleOne or ModuleTwo that have not homologous protein.When all subnetworks Algorithm 2 UEDAMAlign 1: