EUCALYPT: efficient tree reconciliation enumerator
 Beatrice Donati†^{1, 2, 3},
 Christian Baudet†^{1, 2}Email author,
 Blerina Sinaimeri†^{1, 2},
 Pierluigi Crescenzi^{3} and
 MarieFrance Sagot^{1, 2}Email author
https://doi.org/10.1186/s1301501400313
© Donati et al.; licensee BioMed Central. 2015
Received: 29 November 2013
Accepted: 4 July 2014
Published: 23 January 2015
Abstract
Background
Phylogenetic tree reconciliation is the approach of choice for investigating the coevolution of sets of organisms such as hosts and parasites. It consists in a mapping between the parasite tree and the host tree using eventbased maximum parsimony. Given a cost model for the events, many optimal reconciliations are however possible. Any further biological interpretation of them must therefore take this into account, making the capacity to enumerate all optimal solutions a crucial point. Only two algorithms currently exist that attempt such enumeration; in one case not all possible solutions are produced while in the other not all cost vectors are currently handled. The objective of this paper is twofold. The first is to fill this gap, and the second is to test whether the number of solutions generally observed can be an issue in terms of interpretation.
Results
We present a polynomialdelay algorithm for enumerating all optimal reconciliations. We show that in general many solutions exist. We give an example where, for two pairs of hostparasite trees having each less than 41 leaves, the number of solutions is 5120, even when only timefeasible ones are kept. To facilitate their interpretation, those solutions are also classified in terms of how many of each event they contain. The number of different classes of solutions may thus be notably smaller than the number of solutions, yet they may remain high enough, in particular for the cases where losses have cost 0. In fact, depending on the cost vector, both numbers of solutions and of classes thereof may increase considerably. To further deal with this problem, we introduce and analyse a restricted version where host switches are allowed to happen only between species that are within some fixed distance along the host tree. This restriction allows us to reduce the number of timefeasible solutions while preserving the same optimal cost, as well as to find timefeasible solutions with a cost close to the optimal in the cases where no timefeasible solution is found.
Conclusions
We present Eucalypt, a polynomialdelay algorithm for enumerating all optimal reconciliations which is freely available at http://eucalypt.gforge.inria.fr/.
Keywords
Cophylogeny Reconciliation Enumeration algorithm Polynomial delay Hostparasite systemsBackground
Phylogenetic tree reconciliation has been the approach of choice for investigating the coevolution of sets of organisms such as hosts and parasites [13]. Besides the increasingly important role that reconciliation methods are likely to play in the study of coevolution, they have the advantage of being applicable to different types of data. For instance, they are extensively used for analysing the associations between genes and species [46], and between species and geological history [7]. The similarity between all three classes of problems was pointed out by Page already in 1994 [8] and further considered in [9,10]. More recently, a unique generalised formal model appeared in [11]. In this paper, we focus on the host/parasite associations but we want to call attention to the fact that, due to the similarity of the models, our algorithm can be straightforwardly applied to the other problems as well.
Reconciliation is modelled as a tree mapping problem – of the parasite tree onto the host one – with constraints. During the mapping process, four types of events are considered [1,12]. These are: cospeciation (this happens when both host and parasite speciate), duplication (when the parasite speciates but not the host, both new parasite species remaining associated with the host), loss (when the host speciates but not the parasite, leading to the loss of the parasite in one of the two new host species), and host switch (when the parasite speciates, one species remaining with its current host while the other switches, that is jumps to another). In the context of genespecies associations, this model is known as the DTL (for “Duplication, Transfer, and Loss”) model for the reconciliation problem and has been extensively studied (see, for example, [46,13,14]).
In the reconciliation problem, we are given a host tree H, a parasite tree P, and a mapping of the leaves of P to the leaves of H which reflects current knowledge on which existing parasites inhabit which hosts. By assigning a cost to each of the four types of events, we can obtain a parsimonious solution (or simply a reconciliation) according to the DTL model which minimises the total cost of the mapping. Additionally, if timing information is available, i.e. if we happen to know the order in which speciation events occurred in the host phylogeny, then any proposed reconciliation must also respect the temporal constraints imposed by the available timing information. In this case, the reconciliation problem can easily be solved using dynamic programming, in time polynomial in the size of the trees [15]. However, timing information may not be available or may be insufficiently reliable to be used with enough confidence. In such case, the reconciliation problem is NPhard [5,6,16]. A number of algorithms have been developed that allow for solutions that are biologically unfeasible, that is for solutions where some of the switches induce a contradictory timing ordering of the internal vertices of the host tree [13,14,17,18]. In this case, the algorithms are able to generate optimal solutions in polynomial time without guaranteeing the time feasibility constraint.
This is the situation we address in this paper. We treat the reconciliation problem in the absence of any timing information, i.e. the two phylogenetic trees are provided undated. In this context, there are two main issues that must be taken into account in the reconciliation approach. First, providing a single optimal solution is not a good option as it can be biologically unfeasible (observe however that this is what the majority of the existing reconciliation algorithms do). Second, given a cost model for the events, an exponential number of optimal reconciliations is possible. Thus, even when restricting to timefeasible solutions, this number can remain huge. For these reasons, the capacity to enumerate all optimal solutions becomes a crucial point.
To the best of our knowledge, the reconciliation algorithms that try to deal with more than one optimal solution are the following: COREPA [19], MOWGLI [18], JANE 4 [16], NOTUNG [14], and RANGERDTL [20]. However, MOWGLI assumes that the host and parasite trees are fully dated and computes just the number of optimal reconciliations without generating them. JANE 4 uses a heuristic based on a genetic algorithm for finding one or a number of solutions (not all and not necessarily of optimal cost). RANGERDTL can handle both dated and undated trees and can compute the total number of optimal reconciliations. However, the currently available version of RANGERDTL outputs only one optimal reconciliation. EUCALYPT and NOTUNG are the only publicly available algorithms that claim to generate all optimal reconciliations. However for most instances, COREPA enumerates only a proper subset of all optimal solutions (see the Results and discussion Section). NOTUNG was designed for a more general event model that includes duplications, losses, transfers, and incomplete lineage sorting (ILS). In particular, the DTL model is a special case when the species tree is binary. However, the algorithm imposes some restrictions on the cost values. Indeed, the cost of a cospeciation is always assumed to be equal to 0 and the cost of a loss positive.
We provide an algorithm that, given a cost model for the events, efficiently generates all the optimal solutions for the reconciliation problem. It is also possible for the algorithm to generate only optimal reconciliations that are timefeasible. EUCALYPT requires no assumption to be made concerning the cost values: it thus allows negative ones while cospeciation and loss may have any arbitrary cost. In addition, the algorithm can efficiently handle distancebounded host switches, i.e. cases where the host switches are allowed to happen only between species that are within some fixed distance along the host tree. Observe that this is not an artificial requirement. The significance of a host switch distance has already been pointed out in several studies [21,22]. Indeed, if parasites switched only between closely related hosts, this would lead to a higher degree of congruence between the parasite and host trees. When this information is available, it should thus be taken into account in the reconciliation process. Moreover, it can happen that for some datasets and cost vectors, there is no optimal timefeasible solution. One way to overcome this problem can be by varying the length of the farthest allowed switch until at least one timefeasible solution is obtained. On the contrary, in the case where the number of optimal timefeasible reconciliations is high, one can decrease their number by selecting a subset of them. This can be done by decreasing the value of the maximum allowed distance of a switch while maintaining the same optimal total cost. Finally, the complexity of the boundedswitch problem remains open, and it could be that this constraint makes the reconciliation problem solvable in polynomial time, which in turn is of both theoretical and practical interest.
We show that the algorithm we developed and describe in this paper, EUCALYPT (for “EnUmerator of Coevolutionary Associations in PoLYnomialTime delay” with one switch, of “P” before “LY”), loses no optimal solution, and is able to list all of them in lineartime delay: the time required for getting from one solution to the next one is indeed O(m) for m the number of vertices in the parasite tree, while finding the first solution requires O(n ^{3} m) time for n the number of vertices in the host tree. The space complexity for the whole enumeration process is also O(n ^{3} m).
We applied EUCALYPT to a number of hostparasite trees available in the literature, and to our own set of interest. We show that in general many optimal solutions exist. Indeed, as already noticed in other studies (see e.g. [20]) the number can sometimes be huge. We give an example where, for two pairs of hostparasite trees having each less than 41 leaves, the number of solutions is 5120, even when only timefeasible ones are kept. Depending on the cost vector, this may increase considerably (for the same example, to 4080384). EUCALYPT indeed comes with a procedure (in O(n ^{3}) time) for testing the time feasibility of a solution. The possibility to calculate the number of solutions without explicitly listing them all was also integrated in EUCALYPT. This has the same complexity as to enumerate a single solution. Finally, to facilitate interpreting the results even when a huge number of solutions is observed, the latter are classified in terms of the number of each event (cospeciation, duplication, loss, or host switch) that they contain. The number of different classes of solutions that must be examined further is often considerably smaller, but may remain high enough (for instance 275) depending on the cost vector.
Implementation
Model
A rooted phylogenetic tree is a leaflabelled tree that models the evolution of a set of taxa from their most recent common ancestor (placed at the root). The internal vertices of the tree correspond to the speciation events. The tree is rooted so a direction is intrinsically assumed that corresponds to the direction of evolutionary time. Henceforth, by a phylogenetic tree T, we thus mean a rooted tree with labelled leaves and where the root has indegree 0 and outdegree 2, the leaves have outdegree 0 and indegree 1 and every other vertex has indegree 1 and outdegree 2.
The model of hostparasite evolution we rely on in this paper is the eventbased one presented by Tofigh et al. [6], and later further analysed by Bansal et al. [13]. Let H,P be the phylogenetic trees for the host and parasite species respectively. We define ϕ as a function from the leaves of P to the leaves of H that represents the association between currently living host species and parasites. Such association is an input of our algorithm, together with the trees themselves. In this model, we allow each parasite to be related to one and only one host, while a host can be related to zero, one, or more than one parasite.
A reconciliation is a function γ from the set of nodes of P to the nodes of H that extends the mapping of the leaves ϕ. Besides this, γ must satisfy some logical constraints as for example: (a) a node of P cannot be mapped in an ancestor of the mapping of its father; and (b) one of the two children of an internal node v of P must be mapped in the subtree of H rooted at γ(v), the image of v in H.
A reconciliation can be uniquely associated to a multiset of events from cospeciation, duplication and switches. Indeed, for each node of the parasite tree, one can determine the event associated to that node by looking at the mappings of its children. For instance, a node v is associated with a cospeciation if its children are mapped in different subtrees rooted at the two children of γ(v). Finally, losses are identified by a multiset containing all the vertices h∈V(H) that are in the path from the image of a vertex in V(P) and the image of one of its children. A detailed and formal description of the model can be found in Additional file 1.
The triple S=〈H,P,γ〉 is said to be a scenario or simply a reconciliation. Given a vector 〈c _{ c },c _{ d },c _{ s },c _{ l }〉 of real values that correspond to the cost of each type of event, the optimal reconciliations are the ones that minimise the total cost.
Finally, some combinations of host switches can introduce an incompatibility due to the temporal constraints imposed by the host and parasite trees, as well as by the reconciliation itself. Determining whether a reconciliation is timefeasible can be done in polynomial time [14]. It is common to refer to a timefeasible (unfeasible) solution as acyclic (cyclic), and in this paper we will use these terms interchangeably.
The All Most Parsimonious Reconciliation problem (AllMPR problem) consists in generating all reconciliations of minimum cost. Finally, we consider the problem when the host switches are distancebounded, i.e. there exists a bound on the maximum distance k along the host tree to which a parasite can jump. We call this the kboundedAllMPR problem. Clearly this is a generalisation in the sense that we trivially obtain the unbounded version by setting k equal to the longest path in the host tree.
We describe next the algorithm EUCALYPT that solves both problems in polynomial delay. As the set of optimal reconciliations produced may contain both cyclic and acyclic solutions, we also include an acyclicity test based on [14], for selecting only the timefeasible reconciliations.
Algorithm
Finding one solution
In the same way as the algorithms which find one single reconciliation (possibly cyclic) of minimum cost, that is, which solve the socalled DTL problem [6,13,14], EUCALYPT uses a dynamic programming approach to find one or to enumerate all optimal reconciliations. In this approach, each (p,h) cell of the m by n dynamic programming matrix, let us denote it by D, contains a single (real or integer) number that represents the cost of an optimal (sub)reconciliation mapping vertex p in the parasite tree to vertex h in the host tree. The matrix is filled following a postorder traversal of P and H. Bansal et al. provided an algorithm that finds the cost of one optimal reconciliation in time and space O(n m) [13]. We adapted the algorithm for solving the more general kboundedAllMPR problem. More precisely, let k be the maximum allowed host switch distance. Adding to the dynamic programming procedure a test for checking the distance of a host switch, we obtain an algorithm whose time complexity is \(O(nm\dot {2^{k}})\). However, for a constant value of k, this complexity remains in O(n m). Observe that if we do not require the k bound on the distance of the switches, one could easily replace the algorithm by the theoretically faster method of Bansal et al.
Finding one optimal solution requires keeping trace of a path in the matrix leading to the minimum cost. This is easily done by keeping in each D(p,h) cell not only the cost associated to it but a pair of pointers to one mapping for the children p _{1},p _{2} of p having led to such cost.
Enumerating all optimal solutions
Complexity analysis
The space complexity of EUCALYPT is O(n ^{3} m). For each of the mn steps of the dynamic programming procedure, we create at most n ^{2} objects. All the additional structures used to iterate over this matrix have size O(n). To evaluate the time complexity of the whole enumeration process, we separate the time needed for filling matrix D, and the time for traversing it in order to produce a single solution, or to enumerate all of them. The number of steps needed for filling the matrix the first time is O(n ^{3} m) because each cell may contain, in the worst case, a list of n ^{2} pairs of pointers. Since the height of the DAG is bounded by 2m, it can be traversed using only a linear size support structure. Moreover, at each iteration that leads to one solution, a subtree of size 2m of the DAG is visited. In particular, each time we are visiting a parent vertex, we need to add its mapping to the current solution and then follow the DAG looking for the mappings of its two children. An entire solution (which is composed by m mappings) is complete when at most 2 vertices (one parent and one subsolution) have been visited for each vertex of P. This guarantees that once we produce the first solution, obtaining each one of the others in turn requires only linear time and linear space.
Finally, it is possible to enumerate only timefeasible solutions. To this purpose we have implemented a timefeasibility test defined in [14] which has a time complexity of O(n ^{2}).
Results and discussion
Datasets
To test EUCALYPT, we selected 12 datasets from the literature. As we are mostly interested in hostparasite systems, the first 10 datasets concern such relations: EC  Encyrtidae & Coccidae [23], GL  Gopher & Lice [24], SC  Seabirds & Chewing Lice [25], RP  Rodents & Pinworms [26], SCF  Smut Fungi & Caryophillaceus plants [27], PLML  Pelican Lice ML [28] (the trees are generated through a maximum likelihood approach), PLMP  Pelican Lice MP [28] (the trees are generated through a maximum parsimony approach), RH  Rodents & Hantaviruses [29], PP  Primates & Pinworm [30], and FD  Fishs and Dactylogyrus [31].
In addition, we used a dataset of our own which corresponds to arthropod hosts and a bacterium genus, Wolbachia, living inside the cells of their hosts [32,33]. The datasets were chosen to provide a variety in terms of size of the host and parasite trees: those from the literature are relatively small (from 7 to 100 leaves), while our own data provide an example of much bigger host and parasite trees, each having 387 leaves. Moreover, we were careful that the selected datasets cover, as much as possible, a range of situations in terms of coevolution and of the expected frequencies of each event. Finally, since EUCALYPT can be applied to any type of datasets compatible with the model, we also tested it on a genesspecies dataset from [17] that had previously been used by [20]. The dataset has 3983 unrooted gene trees.
To be able to run our algorithm on this dataset, we rooted the trees using an approach similar to [20]: for each possible rooted version of the unrooted gene tree, we consider the optimal cost of the reconciliation and choose the rooting that has minimum cost among all. In this paper, we only show the results concerning 4 datasets: COG2085, COG3715, COG4964, COG4965. The choice of these datasets is motivated by the fact that they show different behaviour, in particular as concerns the kboundedAllMPR problem.
Number of solutions found by each one of the programs COREPA, NOTUNG and EUCALYPT
Dataset  Leaves  Cost vector  Reconciliations  

H  P  COREPA  EUCALYPT  
#T  #C  #A  #T  #C  #A  #CA  
EC  7  10  〈0,1,1,1〉  16  6  10  16  6  10  5 
〈0,1,2,1〉  14  0  14  18  0  18  6  
〈0,2,3,1〉  12  0  12  16  0  16  4  
GL  8  10  〈0,1,1,1〉  2  0  2  2  0  2  1 
〈0,1,2,1〉  2  0  2  2  0  2  1  
〈0,2,3,1〉  2  0  2  2  0  2  1  
SC  11  14  〈0,1,1,1〉  1  0  1  1  0  1  1 
〈0,1,2,1〉  1  0  1  1  0  1  1  
〈0,2,3,1〉  1  0  1  1  0  1  1  
RP  13  13  〈0,1,1,1〉  18  2  16  18  2  16  3 
〈0,1,2,1〉  3  1  2  3  1  2  1  
〈0,2,3,1〉  3  1  2  3  1  2  1  
SFC  15  16  〈0,1,1,1〉  184  40  144  184  40  144  1 
〈0,1,2,1〉  40  40  0  40  40  0  0  
〈0,2,3,1〉  40  40  0  40  40  0  0  
PLML  18  18  〈0,1,1,1〉  158  0  158  180  0  180  4 
〈0,1,2,1〉  2  0  2  2  0  2  1  
〈0,2,3,1〉  11  0  11  11  0  11  2  
PLMP  18  18  〈0,1,1,1〉  2  0  2  2  0  2  1 
〈0,1,2,1〉  2  0  2  2  0  2  1  
〈0,2,3,1〉  17  0  17  18  0  18  2  
RH  34  42  〈0,1,1,1〉  32  0  32  42  0  42  4 
〈0,1,2,1〉  158  158  0  2208  2208  0  0  
〈0,2,3,1〉  22  22  0  288  288  0  0  
PP  36  41  〈0,1,1,1〉  1000*  0  1000  5120  0  5120  4 
〈0,1,2,1〉  11  0  11  72  0  72  2  
〈0,2,3,1〉  11  0  11  72  0  72  2  
FD  20  51  〈0,1,1,1〉  1000*  282  718  25184  1792  23392  11 
〈0,1,2,1〉  108  44  64  408  132  276  5  
〈0,2,3,1〉  22  22  0  80  80  0  0  
COG2085  100  44  〈0,1,1,1〉  1000*  0  1000  44544  2304  42240  3 
〈0,1,2,1〉  1000*  0  1000  37568  480  37088  7  
〈0,2,3,1〉  888  0  888  46656  0  46656  4  
COG3715  100  40  〈0,1,1,1〉  1000*  1000  0  1172598  1155958  16640  6 
〈0,1,2,1〉  9  9  0  9  9  0  0  
〈0,2,3,1〉  13  13  0  33  33  0  0  
COG4964  100  27  〈0,1,1,1〉  85  85  0  224  224  0  0 
〈0,1,2,1〉  13  13  0  36  36  0  0  
〈0,2,3,1〉  17  17  0  54  54  0  0  
COG4965  100  30  〈0,1,1,1〉  1000*  408  592  17408  5632  11776  2 
〈0,1,2,1〉  141  0  141  640  0  640  2  
〈0,2,3,1〉  1000*  276  724  6528  1408  5120  2 
Comparison with COREPA and NOTUNG
In Table 1, we compare EUCALYPT to COREPA and NOTUNG for some cost vectors that commonly appear in the literature.
As mentioned in the introduction, COREPA does not always enumerate all solutions, as shown in Table 1. In the Additional file 1, we give an explicit example from one of the datasets used (“Smut Fungi & Caryophillaceus plants” [27] with cost vector 〈0,1,1,1〉) where indeed EUCALYPT finds more (correct) solutions, some of which are acyclic, that are timefeasible. The same result is observed for other datasets (examples not shown).
Number of solutions found by the programs COREPA and EUCALYPT
Dataset  Leaves  Cost vector  Reconciliations  

H  P  COREPA  EUCALYPT  
#T  #C  #A  #T  #C  #A  #CA  
EC  7  10  〈−1,1,1,1〉  2  0  2  2  0  2  1 
〈0,1,1,0〉  18  0  18  24  0  24  8  
GL  8  10  〈−1,1,1,1〉  2  0  2  2  0  2  1 
〈0,1,1,0〉  12  0  12  12  0  12  5  
SC  11  14  〈−1,1,1,1〉  1  0  0  1  0  1  1 
〈0,1,1,0〉  82  2  80  113  3  110  18  
RP  13  13  〈−1,1,1,1〉  3  1  2  3  1  2  1 
〈0,1,1,0〉  69  25  44  117  45  72  29  
SFC  15  16  〈−1,1,1,1〉  40  40  0  40  40  0  0 
〈0,1,1,0〉  1000*  741  259  6332  5069  1263  81  
PLML  18  18  〈−1,1,1,1〉  2  0  0  2  0  2  1 
〈0,1,1,0〉  45  2  43  448  28  420  16  
PLMP  18  18  〈−1,1,1,1〉  2  0  0  2  0  2  1 
〈0,1,1,0〉  147  0  147  262  0  262  34  
RH  34  42  〈−1,1,1,1〉  197  197  0  1056  1056  0  0 
〈0,1,1,0〉  1000*  0  1000  4080384  310284  3770100  275  
PP  36  41  〈−1,1,1,1〉  17  0  17  144  0  144  2 
〈0,1,1,0〉  182  8  174  498960  55440  443520  129  
FD  20  51  〈−1,1,1,1〉  196  86  110  944  368  576  7 
〈0,1,1,0〉  1000*  1000  0  1.5×10^{15}  *  *  *  
COG2085  100  44  〈−1,1,1,1〉  1000*  0  1000  109056  26496  82560  3 
〈0,1,1,0〉  1000*  0  1000  3.5×10^{11}  *  *  *  
COG3715  100  40  〈−1,1,1,1〉  869  869  0  63360  63360  0  0 
〈0,1,1,0〉  1000*  0  1000  1.2×10^{12}  *  *  *  
COG4964  100  27  〈−1,1,1,1〉  13  13  0  36  36  0  0 
〈0,1,1,0〉  1000*  0  1000  8586842  2603598  5983244  300  
COG4965  100  30  〈−1,1,1,1〉  1000*  335  665  44800  13312  31488  5 
〈0,1,1,0〉  1000*  0  1000  907176  387192  519984  208 
NOTUNG generates only timefeasible reconciliations and their number coincides with the result of EUCALYPT for all the datasets used. However, NOTUNG imposes some restrictions on the cost values. Indeed, the cost of a cospeciation is always assumed to be equal to 0 and the cost of a loss positive.
Non positive cost vectors
No assumption needs to be made by EUCALYPT concerning the cost values: it thus allows negative ones while cospeciation and loss may have any arbitrary cost. As already mentioned, in this case we can compare it only with COREPA. The results of these experiments are presented in Table 2. In almost all of the cases, COREPA does correctly determine the total number of (un)feasible optimal solutions.
Results of EUCALYPT and discussion
The results obtained by EUCALYPT and presented in Tables 1 and 2 are striking for various reasons.
First, we observe that when the size of the tree increases, in most cases the total number of optimal solutions also increases. However, this does not hold for the number of timefeasible optimal solutions. For instance, according to the results given in Table 1, for the dataset EC having 7 and 10 leaves, the number of timefeasible optimal solutions is much higher than for the case of dataset COG4964 (having 100 and 27 leaves). Even for the same dataset, this number can be reduced significantly depending on the cost vector. In particular when the cost of the losses is 0, the number of optimal solutions can be huge even for relatively small datasets, such as for example FD (2051 leaves). This makes it practically impossible to check the timefeasibility of each of them (this explains the presence of a * in some cells of Tables 1 and 2). Thus, it seems that the cost vector and the topology of the trees together with the mapping of the leaves play a more important role in the total number of timefeasible solutions.
We also tested EUCALYPT on the much bigger trees of Wolbachia and the arthropods. Due to limitations in space and time, we could not enumerate all optimal solutions because their number is huge: 1.01×10^{47} for cost vector 〈−1,1,1,1〉, 3.87×10^{136} for cost vector 〈0,1,1,0〉, 3.19×10^{48} for cost vector 〈0,1,1,1〉, and finally 1.01×10^{47} for cost vector 〈0,1,2,1〉. We did however enumerate optimal solutions until one was produced that was found acyclic.
For the cost vector 〈0,1,1,1〉, the first produced optimal solution was already acyclic as were those that were enumerated next, hinting to the possibility that the proportion of acyclic solutions is high among all optimal ones. For the remaining cost vectors, the initial optimal solutions enumerated by EUCALYPT were indeed all cyclic, and given their number and the time required by each acyclicity test, we stopped the process of checking after one week. Two cases are then possible: either there are no acyclic solutions meaning optimal ones have a higher cost; or the proportion of acyclic solutions is low among all optimal ones.
The results confirm once again that the number of optimal reconciliations can be huge. Moreover, the problem remains even if we restrict the results to only timefeasible solutions. In order to deal with this huge set of solutions, we propose to group them in classes depending on the number of events observed. As shown in Tables 1 and 2, the number of classes in a set of timefeasible optimal solutions is significantly smaller compared to the size of the set itself.
For instance, for dataset EC and vector 〈0,1,1,1〉, the 10 optimal timefeasible solutions are split in 5 classes 〈# _{ c },# _{ d },# _{ s },# _{ l }〉 as follows: 1 for 〈4,1,4,1〉, 3 for 〈4,0,5,1〉, 2 for 〈5,0,4,2〉, 2 for 〈3,1,5,0〉 and, 2 for 〈3,0,6,0〉. For dataset RP and vector 〈0,1,1,1〉, the 16 optimal timefeasible solutions are split in 3 classes: 4 for 〈7,0,5,3〉, 2 for 〈4,0,5,1〉 and, 10 for 〈6,0,6,2〉. Even more interestingly, for dataset SFC and vector 〈0,1,1,1〉, the 144 optimal timefeasible solutions belong to a unique class: 〈4,0,11,0〉. For more details on this type of analysis, we refer to Additional file 1.
Finally, we want to call attention to the fact that, in many cases of datasets and cost vectors, there is no optimal solution that is timefeasible. Indeed, in the case of the datasets from [17], from 3983 trees we choose 429 with between 20 and 50 leaves for which the total number of optimal reconciliations is less than 10000. Among these 429 trees (rooted according to the one that leads to a minimum total cost of the reconciliation) and using the vector 〈0,2,3,1〉, 233 (i.e. more than half) have no timefeasible solutions. To deal with this problem, we consider the restriction when the host switches are constrained to have bounded distance.
kboundedAllMPR problem
Searching for timefeasible solutions by varying k
Dataset  Costvector  

〈−1,1,1,1〉  〈0,1,2,1〉  〈0,2,3,1〉  
k _{ start } →k _{ A }  \(\boldsymbol {o \rightarrow o_{k_{A}}}\)  #A  k _{ start } →k _{ A }  \(\boldsymbol {o \rightarrow o_{k_{A}}}\)  #A  k _{ start } →k _{ A }  \(\boldsymbol {o \rightarrow o_{k_{A}}}\)  #A  
SFC  7→6  6→7  16  7→6  21→22  16  7→5  31→35  12 
RH  6→5  8→12  16  6→5  43→48  192  6→5  62→68  48 
COG3715  13→12  10→11  288  22→6  51→176  6  22→6  80→206  2 
COG4964  22→4  20→208  30  13→12  33→34  288  13→12  49→50  288 
Reducing the number of optimal timefeasible solutions by bounding k
Dataset  Costvector  

〈−1,1,1,1〉  〈0,1,2,1〉  〈0,2,3,1〉  
k / k ^{ ′ }  # A C / # A C ^{ ′ }  # T / # T ^{ ′ }  k / k ^{ ′ }  # A C / # A C ^{ ′ }  # T / # T ^{ ′ }  k / k ^{ ′ }  # A C / # A C ^{ ′ }  # T / # T ^{ ′ }  
EC  3/3  2/2  2/2  3/3  18/16  18/16  3/3  16/16  16/16 
GL  4/4  2/2  2/2  4/4  2/2  2/2  4/4  2/2  2/2 
SC  6/6  1/1  1/1  6/6  1/1  1/1  6/6  1/1  1/1 
RP  9/9  2/2  3/3  8/8  2/2  8/8  2/2  2/2  3/3 
PMP  6/6  2/2  2/2  6/6  2/2  2/2  5/5  11/4  11/4 
PML  5/5  2/2  2/2  5/5  2/2  2/2  3/3  18/6  18/6 
PP  4/4  144/96  144/96  4/4  72/48  72/48  4/4  72/48  72/48 
FD  9/10  576/240  944/512  9/9  276/4  408/8  −  −  − 
COG2085  14/14  82560/9408  109056/9408  14/14  37088/4032  37568/4032  14/14  46656/5184  46656/5184 
COG4965  16/16  31448/15744  44800/22400  16/16  640/320  640/320  13/16  5120/2560  6528/3328 
Conclusions
We presented in this paper a software, EUCALYPT, that can find one optimal reconciliation of a pair of host and parasite trees, can compute the number of all optimal solutions, and can enumerate them all. The first two problems are handled in polynomial time, while the enumeration has a polynomial delay complexity. EUCALYPT also displays the classes of solutions observed, where two solutions are in a same class if the number of each event in the two is the same. We show that although the number of classes of solutions may be considerably smaller than the total number of optimal reconciliations, it nevertheless may remain very high even for relatively small trees. Finally, we introduced the kboundedAllMPR problem and showed how it could be applied either to find optimal timefeasible solutions when the parsimonious method found none, or to reduce their number if this is too large to be handled in practice for further analysis. EUCALYPT takes a nexus file as input and generates all the information related to the reconciliations described in the paper. The datasets used in this paper are also available on the website of the software.
All the results of this study point to the necessity of introducing new criteria besides parsimony in the model for an optimal reconciliation. The idea of imposing an evolutionary distance to the host switches is one possible criterion when it can be justified from a biological point of view. Other types of information, such as for instance geographic, might enable also to indicate that certain mappings of internal vertices are impossible. One next improvement of EUCALYPT will therefore be to allow the user to indicate that certain associations of internal vertices should be forbidden. Finally, it remains an open question whether some such additional criteria could change the complexity of the reconciliation problem when only feasible (acyclic) solutions are to be found or enumerated.
Availability and requirements

Project name: Eucalypt

Project home page: http://eucalypt.gforge.inria.fr/

Operating system(s): Any

Programming languages: Java 1.6

Other requirements: None

License: CeCILL

Any restrictions to use by nonacademics: None.
Declarations
Acknowledgements
The authors would like to thank Valentino Bruni who was the first to notice that CoRePa may miss some optimal solutions when enumerating all. This result was presented in his bachelor thesis [34].
The research leading to the results presented in this paper was funded by the European Research Council under the European Community’s Seventh Framework Programme (FP7/20072013)/ERC grant agreement no. [247073]10, and by the French project ANR MIRI BLAN081335497. It was supported by the ANR funded LabEx ECOFECT. The fourth author was partially supported by the Italian Ministry of Education, University, and Research (MIUR) under PRIN 2012C4E3KT national research project “AMANDA – Algorithmics for MAssive and Networked DAta”.
Authors’ Affiliations
References
 Charleston MA. Jungles: a new solution to the host/parasite phylogeny reconciliation problem. Math Biosci. 1998; 149(2):191–223.PubMedView ArticleGoogle Scholar
 Charleston MA. Recent results in cophylogeny mapping. Adv Parasit. 2003; 54:303–30.View ArticleGoogle Scholar
 Merkle D, Middendorf M. Reconstruction of the cophylogenetic history of related phylogenetic trees with divergence timing information. Theor Biosci. 2005; 123(4):277–299.View ArticleGoogle Scholar
 Doyon JP, Ranwez V, Daubin V, Berry V. Models, algorithms and programs for phylogeny reconciliation. Brief Bioinform. 2011; 12(5):392–400.PubMedView ArticleGoogle Scholar
 Hallett MT, Lagergren J. Efficient algorithms for lateral gene transfer problems In: Lengauer T, editor. Proceedings of the fifth annual international conference on computational biology (RECOMB 2001). New York, USA: ACM: 2001. p. 149–56.Google Scholar
 Tofigh A, Hallett M, Lagergren J. Simultaneous identification of duplications and lateral gene transfers. IEEE/ACM Trans Comput Biol Bioinf. 2011; 8(2):517–35.View ArticleGoogle Scholar
 Rosen DE. Vicariant patterns and historical explanation in biogeography. Syst Biol. 1978; 27(2):159–88.Google Scholar
 Page RDM. Maps between trees and Cladistic analysis of historical associations among genes, organisms, and areas. Syst Biol. 1994; 43:58–77.Google Scholar
 Maddison PW. Gene trees in species trees. Syst Biol. 1997; 46(3):523–36.View ArticleGoogle Scholar
 Ronquist F. Parsimony analysis of coevolving species associations. In: Tangled trees: Phylogeny, cospeciation and coevolution. Chicago: University of Chicago Press: 2002. p. 22–64.Google Scholar
 Wieseke N, Bernt M, Middendorf M. Unifying parsimonious tree reconciliation. In: 13th workshop on algorithms in bioinformatics (WABI 2013), Volume 8126 of lecture notes in computer science Sophia Antipolis. France: SpringVerlag Berlin Heidelberg: 2013. p. 200–14.Google Scholar
 Page RD, Charleston MA. Trees within trees: phylogeny and historical associations. Trends Ecol Evol. 1998; 13(9):356–9.PubMedView ArticleGoogle Scholar
 Bansal MS, Alm E, Kellis M. Efficient algorithms for the reconciliation problem with gene duplication, horizontal transfer and loss. Bioinformatics. 2012; 28(12):i283–91.PubMed CentralPubMedView ArticleGoogle Scholar
 Stolzer ML, Lai H, Xu M, Sathaye D, Vernot B, Durand D. Inferring duplications, losses, transfers and incomplete lineage sorting with nonbinary species trees. Bioinformatics. 2012; 28(18):i409–15.PubMed CentralPubMedView ArticleGoogle Scholar
 LibeskindHadas R, Charleston MA. On the computational complexity of the reticulate cophylogeny reconstruction problem. J Comput Biol. 2009; 16:105–17.PubMedView ArticleGoogle Scholar
 Conow C, Fielder D, Ovadia Y, LibeskindHadas R. Jane: a new tool for the cophylogeny reconstruction problem. Algorithm Mol Biol. 2010; 5:16.View ArticleGoogle Scholar
 David LA, Alm EJ. Rapid evolutionary innovation during an Archaean Genetic Expansion. Nature. 2011; 469:93–6.PubMedView ArticleGoogle Scholar
 Doyon JP, Scornavacca C, Gorbunov KY, Szöllősi GJ, Ranwez V, Berry V, Tannier E. An efficient algorithm for gene/species trees parsimonious reconciliation with losses, duplications and transfers. In: Proceedings of the 8th annual RECOMB satellite workshop on comparative genomics (RECOMBCG 2010), Volume 6398 of lecture notes in bioinformatics. Ottawa, Canada: SpringVerlag Berlin Heidelberg: 2011. p. 93–108.Google Scholar
 Merkle D, Middendorf M, Wieseke N. A parameteradaptive dynamic programming approach for inferring cophylogenies. BMC Bioinformatics. 2010; 11(Supplementary 1):10.Google Scholar
 Bansal MS, Alm EJ, Kellis M. Reconciliation revisited: handling multiple optima when reconciling with duplication, transfer, and loss. In: Proceedings of the 17th international conference on research in computational molecular biology, RECOMB’13. Berlin: Heidelberg SpringerVerlag: 2013. p. 1–13.Google Scholar
 Vienne DMD, Giraud T, Skyhoff JA. When can host shifts produce congruent host and parasite phylogenies? A simulation approach. JEvol Biol. 2007; 20(4):1428–38.Google Scholar
 Poulin R, Mouillot D. Parasite specialization from a phylogenetic perspective: a new index of host specificity. Parasitology. 2003; 126:473–80.PubMedView ArticleGoogle Scholar
 Deng J, Yu F, Li HB, Gebiola M, Desdevises Y, Wu SA, et al.Cophylogenetic relationships between Anicetus parasitoids (Hymenoptera: Encyrtidae) and their scale insect hosts (Hemiptera Coccidae). BMC Evol Biol. 2013; 13:275.PubMed CentralPubMedView ArticleGoogle Scholar
 Hafner M, Nadler S. Phylogenetic trees support the coevolution of parasites and their hosts. Nature. 1988; 332:258–9.PubMedView ArticleGoogle Scholar
 Paterson A, Palma R, Gray R. Drowning on arrival, missing the boat, and xevents: How likely are sorting events? In: Page R D M, editor. Tangled trees: Phylogeny, cospeciation, and coevolution. Chicago: USA: UC Press: 2003. p. 287–307.Google Scholar
 Hugot JP. New evidence for hystricognath rodent monophyly from the phylogeny of their pinworms In: Page R D M, editor. Tangled trees: Phylogeny, cospeciation, and coevolution. Chicago: USA: UC Press: 2003. p. 144–73.Google Scholar
 Refrégier G, Gac M, Jabbour F, Widmer A, Shykoff J, Yockteng R, et al.Cophylogeny of the anther smut fungi and their caryophyllaceous hosts: Prevalence of host shifts and importance of delimiting parasite species for inferring cospeciation. BMC Evol Biol. 2008; 8:100.PubMed CentralPubMedView ArticleGoogle Scholar
 Hughes J, Kennedy M, Johnson KP, Palma RL, Page RDM. Multiple cophylogenetic analyses reveal frequent cospeciation between pelecaniform birds and pectinopygus lice. Syst Biol. 2007; 56(2):232–51.PubMedView ArticleGoogle Scholar
 Ramsden C, Holmes E, Charleston M. Hantavirus evolution in relation to its rodent and insectivore hosts. Mol Biol Evol. 2009; 26:143–53.PubMedView ArticleGoogle Scholar
 Hugot JP. Primates and their pinworm parasites: the Cameron hypothesis revisited. Syst Biol. 1999; 48(3):523–546.PubMedView ArticleGoogle Scholar
 Balbuena JA, MíguezLozano R, BlascoCosta I. PACo: a novel procrustes application to cophylogenetic analysis. PLoS ONE. 2013; 8(4):e61048. http://dx.doi.org/10.1371%2Fjournal.pone.0061048.PubMed CentralPubMedView ArticleGoogle Scholar
 Simões PM, Mialdea G, Reiss D, Sagot MF, Charlat S. Wolbachia detection: an assessment of standard PCR protocols. Mol Ecol Resour. 2011; 11(3):567–72.PubMedView ArticleGoogle Scholar
 Simões PM. Diversity and dynamics of Wolbachiahost associations in arthropods from the society archipelago, French Polynesia, PhD thesis. France: University of Lyon 1; 2012.Google Scholar
 Bruni V. Algoritmi per la ricostruzione cofilogenetica, Master’s thesis. Florence, Italy: University of Florence, Faculty of Mathematical, Physical and Natural Sciences; 2013. [In Italian].Google Scholar
Copyright
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.