On the familyfree DCJ distance and similarity
 Fábio V Martinez^{1, 2},
 Pedro Feijão^{2},
 Marília DV Braga^{3} and
 Jens Stoye^{2}Email author
https://doi.org/10.1186/s1301501500419
© Martinez et al.; licensee BioMed Central. 2015
Received: 10 February 2015
Accepted: 13 March 2015
Published: 1 April 2015
Abstract
Structural variation in genomes can be revealed by many (dis)similarity measures. Rearrangement operations, such as the so called doublecutandjoin (DCJ), are largescale mutations that can create complex changes and produce such variations in genomes. A basic task in comparative genomics is to find the rearrangement distance between two given genomes, i.e., the minimum number of rearragement operations that transform one given genome into another one. In a familybased setting, genes are grouped into gene families and efficient algorithms have already been presented to compute the DCJ distance between two given genomes. In this work we propose the problem of computing the DCJ distance of two given genomes without prior gene family assignment, directly using the pairwise similarities between genes. We prove that this new familyfree DCJ distance problem is APXhard and provide an integer linear program to its solution. We also study a familyfree DCJ similarity and prove that its computation is NPhard.
Keywords
Genome rearrangement DCJ Familyfree genome comparisonBackground
Genomes are subject to mutations or rearrangements in the course of evolution. Typical largescale rearrangements change the number of chromosomes and/or the positions and orientations of genes. Examples of such rearrangements are inversions, translocations, fusions and fissions. A classical problem in comparative genomics is to compute the rearrangement distance, that is, the minimum number of rearrangements required to transform a given genome into another given genome [1].
In order to study this problem, one usually adopts a highlevel view of genomes, in which only “relevant” fragments of the DNA (e.g., genes) are taken into consideration. Furthermore, a preprocessing of the data is required, so that we can compare the content of the genomes.
One popular method, adopted for more than 20 years, is to group the genes in both genomes into gene families, so that two genes in the same family are said to be equivalent. This setting is said to be familybased. Without gene duplications, that is, with the additional restriction that each family occurs exactly once in each genome, many polynomial models have been proposed to compute the genomic distance [25]. However, when gene duplications are allowed, the problem is more intrincate and all approaches proposed so far are NPhard, see for instance [610].
It is not always possible to classify each gene unambiguously into a single gene family. Due to this fact, an alternative to the familybased setting was proposed recently and consists in studying the rearrangement distance without prior family assignment. Instead of families, the pairwise similarity between genes is directly used [11,12]. This approach is said to be familyfree. Although the familyfree setting seems to be at least as difficult as the familybased setting with duplications, its complexity is still unknown for various distance models.
In this work we are interested in the problem of computing the distance of two given genomes in a familyfree setting, using the double cut and join (DCJ) model [5]. The DCJ operation, that consists of cutting a genome in two distinct positions and joining the four resultant open ends in a different way, represents most of largescale rearrangements that modify genomes. After preliminaries and a formal definition of the familyfree DCJ distance, we present a hardness result, before giving a linear programming solution and showing its feasibility for practical problem instances. Finally, we also study the problem of computing the similarity – a counterpart of the distance function – of two given genomes in a familyfree setting using the DCJ model and show its NPhardness.
This paper is an extended version of [13], that was presented at the 14th Workshop on Algorithms in Bioinformatics, WABI 2014.
Preliminaries
Each gene g in a genome is an oriented DNA fragment that can be represented by the symbol g itself, if it has direct orientation, or by the symbol −g, if it has reverse orientation. Furthermore, each one of the two extremities of a linear chromosome is called a telomere, represented by the symbol ∘. Each chromosome in a genome can be represented by a string that can be circular, if the chromosome is circular, or linear and flanked by the symbols ∘ if the chromosome is linear. For the sake of clarity, each chromosome is also flanked by parentheses. As an example, consider the genome A={(∘ 3 −1 4 2 ∘),(∘ 5 −6 −7 ∘)} that is composed of two linear chromosomes.
Since a gene g has an orientation, we can distinguish its two ends, also called its extremities, and denote them by g ^{ t } (tail) and g ^{ h } (head). An adjacency in a genome is either the extremity of a gene that is adjacent to one of its telomeres, or a pair of consecutive gene extremities in one of its chromosomes. If we consider again the genome A above, the adjacencies in its first chromosome are 3^{ t }, 3^{ h }1^{ h }, 1^{ t }4^{ t }, 4^{ h }2^{ t } and 2^{ h }.
Throughout this paper, let A and B be two distinct genomes and let be the set of genes in genome A and be the set of genes in genome B.
Adjacency graph and familybased DCJ distance
Observe that, in Figure 1, the number of genes is n=4 and A G(A,B) has one cycle and two odd paths. Consequently the DCJ distance is d_{ DCJ }(A,B)=4−1−2/2=2.
Gene similarity graph for the familyfree model
In the familyfree setting, each gene in each genome is represented by a distinct symbol, thus \(\protect \mathcal {A} \cap \protect \mathcal {B} = \emptyset \) and the cardinalities \(\protect \mathcal {A}\) and \(\protect \mathcal {B}\) may be distinct. Let a be a gene in A and b be a gene in B, then their normalized similarity is given by the value σ(a,b) that ranges in the interval [ 0,1].
Reduced genomes and their weighted adjacency graph
Let A and B be two genomes and let G S _{ σ }(A,B) be their gene similarity graph. Now let M={e _{1},e _{2},…,e _{ n }} be a matching in G S _{ σ }(A,B) and denote by \(w(M) = \sum _{e_{i} \in M} \sigma (e_{i})\) the weight of M, that is the sum of its edge weights. Since the endpoints of each edge e _{ i }=(a,b) in M are not saturated by any other edge of M, we can unambiguously define the function ℓ ^{ M }(a)=ℓ ^{ M }(b)=i. The reduced genome A ^{ M } is obtained by deleting from A all genes that are not saturated by M, and renaming each saturated gene a to ℓ ^{ M }(a), preserving its orientation. Similarly, the reduced genome B ^{ M } is obtained by deleting from B all genes that are not saturated by M, and renaming each saturated gene b to ℓ ^{ M }(b), preserving its orientation. Observe that the set of genes in A ^{ M } and in B ^{ M } is \(\mathcal {G}(M) = \{ \ell ^{M}(g) \colon g\ \text {is saturated by the matching}\ M \} = \{1,2,\ldots,n\}\).
The familyfree DCJ distance
Based on the weighted adjacency graph, in [12] a familyfree DCJ similarity measure has been proposed. We will come back to this measure later in this paper. Before that, to be more consistent with the comparative genomics literature, where distance measures are more common than similarities, here we also propose a familyfree DCJ distance. This familyfree distance is based on the weighted DCJ distance of reduced genomes. An important design criterion for this definition is that it must be the same as the (unweighted) familybased DCJ distance when all weights are equal to 1.
The first step in our definition is to consider the components of the graph A G _{ σ }(A ^{ M },B ^{ M }) separately, similarly to the approach described previously for the familybased model. Here the contribution of each component C is denoted by d _{ σ }(C) and must include not only the length C of the component, but also information about the weights of the edges in C. Basically, we need a function f(C) to use instead of C in the contribution function d _{ σ }(C), such that: (i) when all edges in C have weight 1, f(C)=C, that is, the contribution of C is the same as in the familybased version; (ii) when the weights decrease, f should increase, because smaller weights mean less similarity, or increased distance between the genomes.
since the number of genes in \(\protect \mathcal {G}(M)\) is equal to the size of M.
Observe that not only the components of the graph, but also the size and the weight of the matching influence the distance above. For example, in Figure 3, matching M _{1} gives the weighted adjacency graph with more components, but whose distance \(\phantom {\dot {i}\!}d_{\sigma }(A^{M_{1}},B^{M_{1}}) = 1 + 5  2.7 = 3.3\) is larger. On the other hand, M _{2} gives the weighted adjacency graph with less components, but whose distance \(\phantom {\dot {i}\!}d_{\sigma }(A^{M_{2}},B^{M_{2}}) = 2 + 5  3.9 = 3.1\) is smaller.
Problem FFDCJDISTANCE(A,B): Given genomes A and B and their gene similarities σ, calculate their familyfree DCJ distance$$ \textup{d}_{\textup{\textsc{ffdcj}}}(A, B) = \min_{M \in \mathbb{M}}\{ d_{\sigma}(A^{M},B^{M}) \} \:, $$(2)where is the set of all maximal matchings in G S _{ σ }(A,B).
Complexity of the familyfree DCJ distance
Problem (s,t)EXDCJDISTANCE(A,B): Given genomes A and B, where each family occurs at most s times in A and at most t times in B, obtain exemplar genomes A ^{′} and B ^{′} by removing all but one copy of each family in each genome, so that the DCJ distance d_{ DCJ }(A ^{′},B ^{′}) is minimized.
We establish the computational complexity of the FFDCJDISTANCE problem by means of a polynomial time and approximation preserving (AP) reduction from the problem (1,2)EXDCJDISTANCE, which is NPhard [8]. Note that the authors of [8] only consider unichromosomal genomes, but the reduction can be extended to multichromosomal genomes, since an algorithm that solves the multichromosomal case also solves the unichromosomal case.
Theorem 1.
Problem FFDCJDISTANCE(A,B) is APXhard, even if the maximum degrees in the two partitions of G S _{ σ }(A,B) are respectively one and two.
Before proving the result, we need some definitions and particularly a formal definition of an APreduction. These definitions are based on [16].
An optimization problem is defined by three main elements: a set of instances, a set Sol(I) of feasible solutions for each instance I, and a function val that relates a nonnegative rational number val(I,S) to each instance I and solution S in Sol(I). Thus, in a minimization problem, the aim is to find a feasible solution of minimum value. That is, if Π is an optimization problem with an instance I, then we want to find S∈Sol(I) that minimizes val(I,S), called an optimal solution to the optimization problem. For an instance I, the value of an optimal solution is denoted by opt(I).

f receives as input a positive rational number δ and an instance I of Π, and returns an instance f(δ,I) of Π ^{′};

g receives as input a positive rational number δ, an instance I of Π and an element S ^{′} in Sol(f(δ,I)), and returns a solution g(δ,I,S ^{′}) in Sol(I);

for any positive rational number δ, f(δ,·) and g(δ,·,·) are polynomial time algorithms;

for any instance I of Π, any positive rational number δ, and any S ^{′} in Sol(f(δ,I)), ifthen$$ \text{val}(f(\delta, I), S') \leq (1 + \delta) \: \text{opt}(f(\delta, I))\:, $$$$ \text{val}(I, g(\delta, I, S')) \leq (1 + \beta\delta) \, \text{opt}(I)\:. $$
An APreduction from Π to Π ^{′} is frequently denoted by Π≤_{AP} Π ^{′}, and we say that Π is APreduced to Π ^{′}. An APreduction is a special type of reduction which preserves both the polynomiality property and the approximation factor.
Now, we can proceed with the proof of Theorem 1.
Proof 1.
(of Theorem 1). We give an APreduction (f,g,β) from (1,2)EXDCJDISTANCE to FFDCJDISTANCE.
(AP2) Algorithm g receives as input a positive rational number δ, an instance (A,B) of (1,2)EXDCJDISTANCE and a solution M of FFDCJDISTANCE, and transforms M into a solution (A _{X},B _{X}) of (1,2)EXDCJDISTANCE. This is a simple construction: for each edge (i,k) in M, we add symbols a _{ i } to A _{X} and b _{ j } to B _{X}, where j=k−A. The value of δ does not influence the construction. In the example of Figure 4, a matching M={(1,7),(2,8),(−3,−10),(4,6)}, which is a solution to FFDCJDISTANCE (A _{F},B _{F}), is transformed by g into the genomes A _{ X }={(∘ a _{1} a _{2} a _{3} a _{4} ∘)}={(∘ a c −b d ∘)} and B _{ X }={(∘ b _{2} b _{3} b _{4} b _{6} ∘)}={(∘ d a c −b ∘)}, which is a solution to (1,2)EXDCJDISTANCE (A,B).
for some fixed positive rational number β.
Denote by c _{ AG } and i _{ AG } the number of cycles and odd paths, respectively, in the adjacency graph A G(A _{X},B _{X}), and by \(\phantom {\dot {i}\!}c_{{AG}_{\!\sigma }}\) and \(\phantom {\dot {i}\!}i_{{AG}_{\!\sigma }}\) the number of cycles and odd paths, respectively, in the weighted adjacency graph \({AG}_{\!\sigma }(A_{\mathrm {F}}^{M}, B_{\mathrm {F}}^{M})\).
Corollary 2.
There exists no polynomialtime algorithm for FFDCJDISTANCE with approximation factor better than 1237/1236, unless P = NP.
Proof.
As shown in [8], (1,2)EXDCJDISTANCE is NPhard to approximate within a factor of 1237/1236−ε for any ε>0. Therefore, the result follows immediately from [8] and from the APreduction in the proof of Theorem 1.
Since the weight plays an important role in d _{ σ }, a matching with maximum weight, that is obviously maximal, could be a candidate for the design of an approximation algorithm for FFDCJDISTANCE. However, we can demonstrate that it is not possible to obtain such an approximation, with the following example.

M ^{∗} is composed of all edges that have weight 1−ε. It has weight w(M ^{∗})=(1−ε)M ^{∗}=(1−ε) k/2. Its corresponding weighted adjacency graph \(\phantom {\dot {i}\!}{AG}_{\!\sigma }(A^{M^{*}}\!,B^{M^{*}})\) has M ^{∗}−1 cycles and two odd paths, thus \(\text {d}_{\text {\textsc {dcj}}}(A^{M^{*}}\!\!,B^{M^{*}})=0\). Consequently, we have \(\phantom {\dot {i}\!}d_{\sigma }(A^{M^{*}}\!\!,B^{M^{*}}) =M^{*} (1\varepsilon)M^{*}=\varepsilon M^{*}\).

M is composed of all edges that have weight 1. It is the only matching with the maximum weight w(M)=M=k. Its corresponding weighted adjacency graph A G _{ σ }(A ^{ M },B ^{ M }) has two even paths, but no cycles or odd paths, giving d_{ dcj }(A ^{ M },B ^{ M })=M. Hence, d _{ σ }(A ^{ M },B ^{ M })=2M−M=M.
This shows that, for any genomes A and B, a matching of maximum weight in G S _{ σ }(A,B) can have d _{ σ } arbitrarily far from the optimal solution and cannot give an approximation for FFDCJDISTANCE(A,B).
ILP to compute the familyfree DCJ distance
We propose an integer linear program (ILP) formulation to compute the familyfree DCJ distance between two given genomes. This formulation is a slightly different version of the ILP for the maximum cycle decomposition problem given by Shao et al. [10] to compute the DCJ distance between two given genomes with duplicate genes. Besides the cycle decomposition in a graph, as was made in [10], we also have to take into account maximal matchings in the gene similarity graph and their weights.
Let A and B be two genomes with extremity sets X _{ A } and X _{ B }, respectively, and let G=G S _{ σ }(A,B) be their gene similarity graph. The weight w(e) of an edge e in G is also denoted by w _{ e }. Let M be a maximal matching in G. For the ILP formulation, a weighted adjacency graph H=A G _{ σ }(A ^{ M },B ^{ M }) is such that V(H)=X _{ A }∪X _{ B } and E(H) has three types of edges: (i) matching edges that connect two extremities in different extremity sets, one in X _{ A } and the other in X _{ B }, if there exists one edge in M connecting these genes in G; the set of matching edges is denoted by E _{ m }; (ii) adjacency edges that connect two extremities in the same extremity set if they are an adjacency; the set of adjacency edges is denoted by E _{ a }; and (iii) self edges that connect two extremities of the same gene in an extremity set; the set of self edges is denoted by E _{ s }. All edges in H are in E _{ m }∪E _{ a }∪E _{ s }=E(H). Matching edges have weights defined by the normalized similarity σ, all adjacency edges have weight 1, and all self edges have weight 0. Notice that any edge in G corresponds to two matching edges in H.
Since all the y _{ i } variables in the same cycle have the same label but a different upper bound, only one of the y _{ i } can be equal to its upper bound i. This means that for each cycle there can be only one z _{ i } equal to 1, and the sum of all z _{ i } variables is the total number of cycles in the adjacency graph.
Simulations and experimental results
We performed some initial benchmarking experiments of the proposed ILP formulation. Therefore, we produced datasets using the Artificial Life Simulator (ALF) [17]. Genome sizes varied from 1000 to 3000 genes, where the gene lengths were generated according to a gamma distribution with shape parameter k=3 and scale parameter θ=133. A birthdeath tree with 10 leaves was generated, with PAM distance of 100 from the root to the deepest leaf. For the amino acid evolution, the WAG substitution model with default parameters was used, with Zipfian indels at a rate of 0.000005. For structural evolution, gene duplications and gene losses were applied with a rate of 0.001 and reversals and translocations with a rate of 0.0025. To test different proportions of rearrangement events, we also simulated datasets where the structural evolution ratios had a 2 and 5fold increase.
ILP runningtime results for datasets with different genome sizes and evolutionary rates
1000 genes  2000 genes  3000 genes  

r =1  r =2  r =5  r =1  r =2  r =5  r =1  r =2  r =5  
Finished  35/45  10/45  2/45  45/45  9/45  1/45  45/45  7/45  3/45  
Avg. Time (s)  99.66  6.97  0.53  0.47  0.70  3.31  0.45  2.03  213.15  
Avg. Gap (%)  0.3  3.0  4.3  0  3.6  6.5  0  5.3  4.8 
The familyfree DCJ similarity
Here our goal is to study the problem of computing the familyfree DCJ similarity, i.e., to find a matching in G S _{ σ }(A,B) that maximizes s _{ σ }. Similarly to the distance, the behaviour of the similarity does not correlate with the size of the matching. In other words, smaller matchings, that possibly discard gene assignments, can lead to higher similarities.
Observe that the parameter α can be adjusted in favor of gene similarity when α is closer to 0, or in favor of genome organization similarity, when α is closer to 1. The closer the parameter α is to 0, the closer we are to the problem of finding a maximum weighted matching in the gene similarity graph G S _{ σ }(A,B). On the other hand, the closer α is to 1, the closer we are to the problem of computing s _{ σ }(A ^{ M },B ^{ M }). A drawback of this model is that the weights of edges actually appear in both terms of the equation. Furthermore, it remains the problem of finding the “best” value for α.
Complexity of the familyfree DCJ similarity
We have the following result to the familyfree DCJ similarity.
Theorem 3.
Problem FFDCJSIMILARITY is NPhard, even if the maximum degrees in the two partitions of the gene similarity graph are respectively one and two.
Proof.
We use the Cook reduction, which is a polynomial time transformation, from (1,2)EXDCJDISTANCE to FFDCJSIMILARITY.
Let A and B be any instance of (1,2)EXDCJDISTANCE and let k be a positive integer, with k≤A, where A is the number of genes of a genome A. We suppose, without loss of generality, that A and B are circular multichromosomal genomes. We must construct a pair of circular genomes A _{F} and B _{F}, a normalized similarity measure σ for genes in A _{F} and B _{F}, and a positive integer k ^{′}≤A _{F} such that the familyfree DCJ similarity of A _{F} and B _{F} is at least k ^{′} if and only if the exemplar DCJ distance of genomes A and B is at most k.
The construction of A _{F},B _{F},σ, and k ^{′} is similar to the transformation f in (AP1) of the proof of Theorem 1. Let be the underlying gene set, such that each gene in occurs at most once in A and at most twice in B. Let the genes of A be denoted a _{1},a _{2},…,a _{A} and the genes of B be denoted b _{1},b _{2},…,b _{B}. Then A _{F} and B _{F} are copies of A and B, respectively, except that symbol a _{ i } in A _{F} is relabeled by i, keeping its orientation, and b _{ j } in B _{F} is relabeled by j+A, also keeping its orientation. The normalized similarity measure σ for genes in A _{F} and B _{F} is defined as σ(i,k)=1 for i in A _{F} and k in B _{F}, such that a _{ i } is in A, b _{ j } is in B, a _{ i } and b _{ j } are in the same gene family, and k=j+A. Otherwise, σ(i,k)=0. It is easy to see that this construction can be accomplished in poynomial time.
Now we must show that the familyfree DCJ similarity of A _{F} and B _{F} is at least k ^{′} if and only if the exemplar DCJ distance of genomes A and B is at most k. Let n=A.
Conclusion
In this paper, we have defined a new distance measure for two genomes that is motivated by the double cut and join model, while not relying on gene annotations in form of gene families. In case gene families are known and each family has exactly one member in each of the two genomes, this distance equals the familybased DCJ distance and thus can be computed in linear time. In the general case, however, it is NPhard and even hard to approximate. Nevertheless, we could give an integer linear program for the exact computation of the distance that is fast enough to be applied to realistic problem instances. Similar theoretical results hold for the familyfree DCJ similarity measure, which is NPhard.
The familyfree model has many potentials when gene family assignments are not available or ambiguous, in fact it can even be used to improve family assignments [18]. The work presented in this paper is another step in this direction.
Declarations
Acknowledgements
We would like to thank Tomáš Vinař who suggested that the NPhardness of ffdcjdistance could be proven via a reduction from the exemplar distance problem. FVM and MDVB are funded from the Brazilian research agency CNPq grants Ciência sem Fronteiras Postdoctoral Scholarship 245267/20123 and PROMETRO 563087/20102, respectively.
Authors’ Affiliations
References
 Sankoff D. Edit distance for genome comparison based on nonlocal operations. In: Proc. of CPM 1992. LNCS, vol. 644. Heidelberg: Springer Verlag: 1992. p. 121–35.Google Scholar
 Bergeron A, Mixtacki J, Stoye J. A unifying view of genome rearrangements. In: Proc. of WABI 2006. LNBI, vol. 4175. Heidelberg: Springer Verlag: 2006. p. 163–73.Google Scholar
 Bafna V, Pevzner P. Genome rearrangements and sorting by reversals. In: Proc. of FOCS 1993: 1993. p. 148–57.Google Scholar
 Hannenhalli S, Pevzner P. Transforming men into mice (polynomial algorithm for genomic distance problem). In: Proc. of FOCS 1995: 1995. p. 581–92.Google Scholar
 Yancopoulos S, Attie O, Friedberg R. Efficient sorting of genomic permutations by translocation, inversion and block interchanges. Bioinformatics. 2005; 21(16):3340–6.View ArticlePubMedGoogle Scholar
 Sankoff D. Genome rearrangement with gene families. Bioinformatics. 1999; 15(11):909–17.View ArticlePubMedGoogle Scholar
 Bryant D. The complexity of calculating exemplar distances In: Sankoff D, Nadeau JH, editors. Comparative Genomics. Dortrecht: Kluwer Academic Publishers: 2000. p. 207–11.Google Scholar
 Bulteau L, Jiang M. Inapproximability of (1,2)exemplar distance. IEEE/ACM Trans Comput Biol Bioinf. 2013; 10(6):1384–90.View ArticleGoogle Scholar
 Angibaud S, Fertin G, Rusu I, Thévenin A, Vialette S. On the approximability of comparing genomes with duplicates. J Graph Algorithms Appl. 2009; 13(1):19–53.View ArticleGoogle Scholar
 Shao M, Lin Y, Moret B. An exact algorithm to compute the DCJ distance for genomes with duplicate genes. In: Proc. of RECOMB 2014. LNBI, vol. 8394. Heidelberg: Springer Verlag: 2014. p. 280–92.Google Scholar
 Dörr D, Thévenin A, Stoye J. Gene family assignmentfree comparative genomics. BMC Bioinformatics. 2012; 13(Suppl 19):3.View ArticleGoogle Scholar
 Braga MDV, Chauve C, Dörr D, Jahn K, Stoye J, Thévenin A, et al. The potential of familyfree genome comparison In: Chauve C, ElMabrouk N, Tannier E, editors. Models and Algorithms for Genome Evolution, Chap. 13. London: Springer: 2013. p. 287–307.Google Scholar
 Martinez FV, Feijão P, Braga MDV, Stoye J. On the familyfree DCJ distance. In: Proc. of WABI 2014. LNBI, vol. 8701. Heidelberg: Springer Verlag: 2014. p. 174–86.Google Scholar
 Braga MDV, Stoye J. The solution space of sorting by DCJ. J Comp Biol. 2010; 17(9):1145–65.View ArticleGoogle Scholar
 Feijão P, Meidanis J, SCJ: A breakpointlike distance that simplifies several rearrangement problems. IEEE/ACM Trans Comput Biol Bioinf. 2011; 8(5):1318–29.View ArticleGoogle Scholar
 Ausiello G, Protasi M, MarchettiSpaccamela A, Gambosi G, Crescenzi P, Kann V. Complexity and approximation: combinatorial optimization problems and their approximability properties. Heidelberg: Springer; 1999.View ArticleGoogle Scholar
 Dalquen DA, Anisimova M, Gonnet GH, Dessimoz C. ALF–a simulation framework for genome evolution. Mol Biol Evol. 2012; 29(4):1115–23.View ArticlePubMed CentralPubMedGoogle Scholar
 Lechner M, HernandezRosales M, Doerr D, Wieseke N, Thévenin A, Stoye J, et al. Orthology detection combining clustering and synteny for very large datasets. PLOS ONE. 2014; 9(8):107014.View ArticleGoogle Scholar
Copyright
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.