A reconstruction problem for a class of phylogenetic networks with lateral gene transfers

Background Lateral, or Horizontal, Gene Transfers are a type of asymmetric evolutionary events where genetic material is transferred from one species to another. In this paper we consider LGT networks, a general model of phylogenetic networks with lateral gene transfers which consist, roughly, of a principal rooted tree with its leaves labelled on a set of taxa, and a set of extra secondary arcs between nodes in this tree representing lateral gene transfers. An LGT network gives rise in a natural way to a principal phylogenetic subtree and a set of secondary phylogenetic subtrees, which, roughly, represent, respectively, the main line of evolution of most genes and the secondary lines of evolution through lateral gene transfers. Results We introduce a set of simple conditions on an LGT network that guarantee that its principal and secondary phylogenetic subtrees are pairwise different and that these subtrees determine, up to isomorphism, the LGT network. We then give an algorithm that, given a set of pairwise different phylogenetic trees \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T_0,T_1,\ldots ,T_k$$\end{document}T0,T1,…,Tk on the same set of taxa, outputs, when it exists, the LGT network that satisfies these conditions and such that its principal phylogenetic tree is \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T_0$$\end{document}T0 and its secondary phylogenetic trees are \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T_1,\ldots ,T_k$$\end{document}T1,…,Tk. Electronic supplementary material The online version of this article (doi:10.1186/s13015-015-0059-z) contains supplementary material, which is available to authorized users.


Background
In the traditional view of evolution, species evolve in a pattern ideally represented by a series of bifurcations in a tree. However, it is well known that many relevant evolutionary processes cannot be properly represented in a tree [1,2]. This has motivated the adoption, since as early as the second half of the XVIIIth century, of more general models to represent phylogenies [3]. One specific type of non tree-like events are the Lateral, or Horizontal, Gene Transfers: transfers of genetic material from one species to a different and, usually, taxonomically distant one [4]. Although these kinds of phenomena are known since the 1950s [5,6], the current explosion of genomic and metagenomic data has revealed that they are much more frequent and important than previously thought, not only among unicellular species [7] but also, for instance, among plants [8] or from parasites to hosts [9].
Evolutionary histories including non-tree like events are usually modelled by means of (evolutionary) phylogenetic networks [10,11]: rooted directed acyclic graphs with leaves bijectively labelled by a set of taxa. The study of phylogenetic networks has been an active field of research during recent years, as witnessed in [12], and many papers on the computational inference of phylogenetic networks with lateral gene transfer events from incongruent gene trees have been published: see, for instance [13][14][15][16][17].
Although lateral gene transfers are modeled in these papers as arcs added to a tree, and hence the resulting phylogenetic networks are tree-based in the sense of [18], in most cases the mathematical model under consideration makes no reference to the base tree and all parents of a node are treated symmetrically. This is not accurate, because in lateral gene transfers, the resulting species acquires its DNA mostly from one, and only one, of its parents, which should be understood as its "principal" parent, in contrast to the other parents which contribute in a much lesser way and should be considered as "secondary" parents. This asymmetry is usually emphasized in graphical representations of phylogenetic networks with lateral gene transfers, like for instance those depicted in [19, Fig. 3] (which, according to Morrison [20], are the first published in the literature), but again seldom in the mathematical model. Actually, and up to our knowledge, the only types of phylogenetic networks that explicitly distinguish between the primary, tree-like, line of evolution and the secondary lateral gene transfers that have been studied in the literature are those in [18] and those in [21,22]. In [18] the primary line of evolution is given by choosing a base tree, but they are not interested in a reconstruction problem from a set of trees but in deciding whether this base tree exists or not for a given phylogenetic network. Also, Górecki's introduces species graphs in [21,22], although this author was not interested in the reconstruction of phylogenies but in modelling the evolution of genes in the context of the evolution of species.
In this paper we consider a general model of phylogenetic network with lateral gene transfers similar to the species graphs' approach: LGT networks, which consist roughly of a principal rooted tree with its leaves labelled on a set of taxa (and possibly with elementary, that is, out-degree 1, nodes) and a set of secondary arcs between nodes in this tree, representing lateral gene transfers, such that the resulting directed graph turns out to be rooted, acyclic, with its leaves labelled and its internal nodes unlabelled. Any such LGT network gives rise to a principal phylogenetic subtree (by suppressing out-degree 1 nodes in the principal subtree) and a set of secondary phylogenetic subtrees, each one of them obtained by replacing one arc in the principal subtree by one secondary arc with the same target node (and then recursively removing non-labelled leaves and out-degree 1 nodes). These phylogenetic subtrees can be understood, respectively, as representing the primary line of evolution and the secondary histories, involving one lateral gene transfer event.
We then introduce the subclass of restricted LGT networks, which are characterized by a set of conditions that guarantee that its principal and secondary phylogenetic subtrees are pairwise different and that these trees determine, up to isomorphism, the LGT network. We also give an algorithm that solves the corresponding reconstruction problem from incongruent trees: given a set of pairwise different phylogenetic trees T 0 , T 1 , . . . , T k on the same set of taxa, to find, when it exists, the unique restricted LGT network such that its principal phylogenetic tree is T 0 and its secondary phylogenetic trees are T 1 , . . . , T k . In order to test the models and algorithms introduced in this paper, we include a computational experiment on the database of phylogenetic trees given in [23].

Preliminaries
Let N = (V , E) be a directed acyclic graph. A node u ∈ V is a tree node if indeg(u) ≤ 1, and it is a reticulation otherwise. A node u is a root if indeg(u) = 0, and N is rooted (it is an rDAG, for short) if it has a single root. A node u is a leaf if outdeg(u) = 0, internal if it is not a leaf, and elementary if outdeg(u) = 1.
For every u, v ∈ V , if (u, v) ∈ E, we say that u is a parent of v and that v is a child of u. Whenever there exists a (directed) path from u to v, in symbols u v, we say that u is an ancestor of v and that v is a descendant of u: notice in particular that every node is both an ancestor and a descendant of itself. A path u v is proper when u � = v (and then u is a proper ancestor of v and v is a proper descendant of u). A path u v is elementary when all its nodes, except at most v (but including its origin u), are elementary.
A tree is an rDAG without reticulations. In particular, trees may contain elementary nodes. Given an elementary node u in a tree T, in order to suppress it we perform the following operation: if u is the root, we remove it together with its incident arc; if, otherwise, u has parent w and child v, we remove u together with the arcs (w, u) and (u, v), and we replace them by an arc (w, v).
Two paths u v 1 and u v 2 in a tree T are bifurcating when they have the same origin and it is their only node in common. Given two nodes u, v in a tree T, their lowest common ancestor LCA T (u, v) is their common ancestor that is a descendant of every other common ancestor of them. If u, v are not connected by a directed path, then LCA T (u, v) is characterized by the fact that there exist bifurcating paths LCA T (u, v) u and LCA T (u, v) v.
Let S be henceforth a finite, non-empty set of labels; in order to avoid unnecessary discussions of trivial cases, we shall always assume that S has more than one element. An S-rDAG is an rDAG endowed with a bijection between its set of leaves and S. We shall always identify, usually without further notice, each leaf in an S-rDAG with its label.
In this paper, by a phylogenetic network on S we mean an S-rDAG without elementary nodes. Notice, in particular, that we forbid in our phylogenetic networks the existence of reticulations with out-degree 1. The reason is that, unlike other interpretations [10,24,25,26], we understand that all nodes in a phylogenetic network represent species: each tree node represents a species produced by mutations from its immediate ancestor, while reticulations represent species that have appeared through "reticulate" events involving the interaction of more than one species. Therefore, an elementary node would represent a species that has only one descendant, and it is impossible to distinguish this ancestor species from its unique descendant through evolutive information only.
An S -tree is an S-rDAG without reticulations, that is, a tree endowed with a bijection between its set of leaves and S. A phylogenetic tree on S is a phylogenetic network on S without reticulations, or, equivalently, an S-tree without elementary nodes. Every S-tree gives rise to a phylogenetic tree on S by suppressing all its elementary nodes.
Given a phylogenetic tree T on S and a subset S 0 ⊆ S, the restriction of T to S 0 is the phylogenetic tree T | S 0 on S 0 obtained by first taking the subtree of T supported on all ancestors of the leafs in S 0 and then suppressing elementary nodes.
Given an S-tree T = (V , E), the cluster of a node u ∈ V is the set C T (u) ⊆ S of labels of leaves that are descend- A triple on three different labels x, y, z ∈ S is a phylogenetic tree on {x, y, z}. Figure 1 depicts the only four possible triples on x, y, z, together with their Newick notation. 1 The triple defined by a phylogenetic tree T on x, y, z ∈ S is the restriction of T to {x, y, z}; we shall denote it by T x,y,z , and the set of all triples defined by T by Ŵ(T ).
Two S-rDAG on the same set S are isomorphic if there exists an isomorphism of directed graphs between them that preserves the leaves' labels. Recall that two phylogenetic trees on S are isomorphic if, and only if, they have the same set of clusters, and also if, and only if, they define the same set of triples [ We shall often make the abuse of language of saying that two S-rDAG are equal to mean that they are actually isomorphic.

LGT networks
In [21,22], Górecki defined a species graph on a set of labels S as an S-tree endowed with a set of extra arcs, representing lateral gene transfers, that satisfies a set of restrictions motivated by their use in the representation of common evolutionary histories of species and genes. In this section we consider phylogenetic networks with lateral gene transfers more general than species graphs, by imposing only that the graph obtained by adding arcs to the tree is a phylogenetic network. In the next section we shall impose a new set of restrictions that will ensure the uniqueness of the solution of the reconstruction problem considered therein.

Definition 1 An
LGT network on a set S is a phylogenetic network N = (V , E) on S together with a partition E = E p ⊔ E s of its set of arcs such that T 0 (N ) = (V , E p ) is an S-tree. The arcs in E p are called principal, and those in E s , secondary. We shall call T 0 (N ) the principal subtree of N.

Figure 2 depicts an
LGT network and its principal subtree T 0 (N ). 2 It is easy to check that any species graph defines an LGT network. Using some other notations that appear in the literature, we also have that T 0 (N ) is a switching of N [29] (or T 0 (N ) is displayed by N [10]); also, N is tree-based and T 0 (N ) is a distinguished base tree [18].
Let N be an LGT network. Since T 0 (N ) = (V , E p ) is an S-tree, every arc in N ending in a tree node is principal and the set of arcs ending in each reticulation h contains exactly one principal arc: we call its origin the principal parent of h, and its other parents, secondary parents. To ease the notations, we shall also say that the single parent of a tree node is its principal parent. We also split the children of every node v into principal and secondary, depending on the type of the arcs going from v to them. These definitions can be illustrated in Fig. 2; for instance, the node a is the principal parent of h, and the nodes c and d are its secondary parents; also, the leaf 4 is the principal child of c and the nodes h and k are its secondary children.
The rationale behind these definitions is as follows. In an LGT network, nodes represent species. The principal subtree represents the main line of evolution of these species; that is, the genetic material of a species comes mainly from its principal parent, possibly including mutations, while its secondary parents have introduced some genes in the species through lateral gene transfers. In this way, a secondary arc models a lateral gene transfer from its source to the principal parent of its target.
The fact that T 0 (N ) is an S-tree also implies that every internal node of N has some principal child. A node v is principally elementary when it has exactly one principal child, i.e., when it is elementary in T 0 (N ). Since N cannot contain elementary nodes, this implies that every principally elementary node is the source of some secondary arc. A principally elementary path in N is an elementary path in T 0 (N ).
A path in an LGT network N is principal when it consists only of principal arcs. The principal cluster of a node u is the set C T 0 (N ) (u) of leaves that are principal descendants of u; that is, that can be reached from u through principal paths.
For each secondary arc e = (u, h) in N, the secondary subtree T e (N ) of N associated to e is the tree obtained from T 0 (N ) by removing the principal arc ending in h and replacing it by e; cf. Fig. 3. Notice that the tree T e (N ) is also a switching of N, and this switching can be obtained from the one associated to T 0 (N ) by switching-off the principal arc ending in h and switching-on the arc e.
Although T 0 (N ) is always an S-tree, a secondary subtree of N may have non-labelled leaves: we shall say that it is partially leaf-labelled in S. To obtain phylogenetic trees on S from the principal and secondary subtrees of N, we reduce them: we recursively remove (in secondary subtrees) all their non labelled leaves together with the arcs ending in them, and then we recursively suppress all their elementary nodes. We shall generically denote by T the reduced phylogenetic tree on S obtained by reducing a partially leaf-labelled tree T on S. Notice that T is an homeomorphic subtree of T, in the sense that they have the same set of labels, the set of nodes of T is contained in the set of nodes of T, this inclusion preserves the leaves' labelling, and every arc in T corresponds to a path in T. In particular, for every node v in T, C T (v) = C T (v); we shall often use this equality without any further mention. The construction of the reduced principal and secondary subtrees of an LGT network is illustrated by Figs. 3 and 4.
The following result is a direct consequence of the fact that the set of triples defined by a phylogenetic tree characterizes it, and that the triple defined on a set of three labels by a partially leaf-labelled tree with, possibly, elementary nodes, is the same as the triple defined by its reduction.
Proposition 1 Let T 1 , T 2 be two partially leaf-labelled trees on a set S. Then, T 1 = T 2 if, and only if, T 1 and T 2 define the same triple on each set of three different labels of S.
Intuitively, the difference between the reduced principal subtree T 0 (N ) and any reduced secondary subtree T e (N ) is that some rooted subtree of the former is pruned (by removing the principal arc ending in the end of e) and regrafted (through the secondary arc e) in the latter. This fact motivates to consider rooted subtree prune and regraft (rSPR, for short) operations [30] to analyze the differences between the reduced principal subtree of an LGT network and its reduced secondary subtrees. However, since these trees need not be binary, we slightly generalize the rSPR operations defined in [30] to allow for the pruned subtree to be regrafted not only to an arc but also to a node. More precisely, we define an rSPR operation of a tree T as the following procedure: 3. Choose a node w that is not a descendant of v. 4. If w is an internal node other than u, then apply either (a) or (b) below. If w is a leaf or w = u, apply (b).
(b) Add a new node w and new arcs ( w, v) and ( w, w) . If w was not the root of T and w ′ was its parent, then remove the arc (w ′ , w) and add a new arc (w ′ , w). If w was the root, then w becomes the root of the resulting tree.

Suppress u if it has become elementary.
We shall denote such an rSPR operation by v node ←− w (a node rSPR operation) if step (4a) is applied, and v arc ←− w (an arc rSPR operation) if step (4b) is applied; cf. Fig. 5. When it is not necessary to specify whether it is a node or an arc rSPR operation, we shall denote it by v spr ←− w. Given any pair of phylogenetic trees on the same set of labels, their rSPR distance d rSPR (T , T ′ ) is the least number of rSPR operations that transform one into the other (cf. [30] in the binary case). In particular, since a reduced secondary subtree T e (N ) of an LGT network is obtained from its reduced principal subtree T 0 (N ) by means of an rSPR operation, we have that d rSPR ( T 0 (N ), T e (N )) ≤ 1 , and d rSPR ( T 0 (N ), T e (N )) = 1 if, and only if, T 0 (N ) � = T e (N ).
An isomorphism of LGT networks is an isomorphism of S-rDAG that preserves and reflects the partitions of the sets of arcs into principal and secondary. More formally, given two LGT networks N = (V , E) and N ′ = (V ′ , E ′ ), LGT network, its principal subtree and its secondary subtrees is a leaf labelled with s.
The isomorphism of LGT networks can be easily checked in linear time in their sizes. Indeed, two LGT networks N and N ′ are isomorphic if, and only if, T 0 (N ) = T 0 (N ′ ) -which can be checked in linear time in the number of principal arcs of the networks-and this isomorphism preserves and reflects the sets of secondary arcs. As we do with S-rDAG in general, we shall usually say that two LGT networks are equal when they are actually isomorphic.

A reconstruction problem for a restricted class of LGT networks
Let us consider the problem of reconstructing an LGT network from its reduced principal subtree T 0 and its set of reduced secondary subtrees T 1 , . . . , T k . We shall take into account only the case when T 1 , . . . , T k are pairwise different, because if T i = T j , they can be defined by the same secondary arc. Moreover, we shall restrict ourselves to the case when T 0 � = T i for every i = 1, . . . , k, because when a reduced secondary subtree is equal to the reduced principal subtree, it only means that we are not able to "distinguish" the secondary line of evolution from the principal one. This leads us to the following general problem: Of course, this problem may have no solution for certain input trees. Consider, for instance, the trees T 0 , T 1 , T 2 depicted in Fig. 6. A simple inspection shows that if there exists an LGT network N with reduced principal subtree T 0 and two secondary arcs e 1 , e 2 such that T e 1 (N ) = T 1 and T e 2 (N ) = T 2 , then e 1 must go from an elementary node added in the arc ending in 4 to a (or to an elementary node added in the arc ending in a), and e 2 must go from an elementary node added in the arc ending in 3 to c (or to an elementary node added in the arc ending in c). But then, the resulting directed graph contains a cycle: see, for instance, the graph N in Fig. 6.
On the other hand, as it was already hinted in the discussion above, if the LGT network reconstruction problem has a solution for a specific input, it need not be unique: see, for instance, Fig. 7. And, as we mentioned at the beginning of this section, there may be repetitions in the family of reduced principal and secondary subtrees of a general LGT network, and therefore not every LGT network can be obtained as an output of this problem.
This motivates us to restrict ourselves to a class of LGT networks satisfying a set of conditions that guarantee, on the one hand, that their reduced principal and secondary subtrees are pairwise different and, on the other hand, LGT network" with reduced principal subtree T 0 and reduced secondary subtrees T 1 , T 2 would contain a cycle the uniqueness of the restricted LGT network with given reduced principal and secondary subtrees, if some exists.

Definition 2 An
LGT network is restricted when it satisfies the following properties: As far as the other two conditions go, (c) prevents the existence of a lateral gene transfer from a species to a principal descendant of it, and condition (d) prevents the existence of a lateral gene transfer from a species to a species represented by an ancestor of it in the reduced principal subtree.
Except for (c), which is shared by both definitions, the conditions that define our restricted LGT networks are transversal to those defining species graphs.
We shall prove now that the reduced principal and secondary subtrees of a restricted LGT network form a family of pairwise different phylogenetic trees.

Proposition 2 If N is a restricted
LGT network and e is a secondary arc in it, then T 0 (N ) � = T e (N ).
Proof Let e = (u, h) ∈ E s ; to simplify the notations, we shall denote T 0 (N ) and T e (N ) by T 0 and T e , respectively. We shall prove that these trees define different sets of triples; by Proposition 1, this will imply that T 0 � = T e .
By condition (c) in Definition 2, there exists no principal path connecting u and h, and therefore C T 0 (h) ∩ C T 0 (u) = ∅. Let x 1 ∈ C T 0 (u) and x 2 ∈ C T 0 (h) . On the other hand, if z = LCA T 0 (u, h), condition (d) in Definition 2 implies that the principal path z h contains some intermediate node w with a principal child w 1 outside this path; let x 3 ∈ C T 0 (w 1 ) (see Fig. 8). It is straightforward to check now that T 0 defines the triple ((x 2 , x 3 ), x 1 ) and T e defines the triple ((x 1 , x 2 ), x 3 ). Therefore, Ŵ(T 0 ) � = Ŵ(T e ), as we claimed.

Proposition 3 If N is a restricted
LGT network and e, e ′ are two different secondary arcs in it, then T e (N ) � = T e ′ (N ). The proof of this proposition is similar to that of Proposition 2, but much longer because we must distinguish many cases, depending on the relative positions of the source and the target nodes of e and e ′ in T 0 (N ). Therefore, and in order not to lose the thread of the paper, we postpone it until the Additional file 1: Appendix.
The problem we are actually going to solve in this section is, then, the following special case of the LGT Network Reconstruction Problem: Our next goal is now to establish a set of necessary and sufficient conditions for the existence of a restricted LGT network N with a given principal subtree T and a given secondary subtree T ′ . First, we give these conditions in terms of rSPR operations. Next, we translate the resulting conditions in terms of triples and clusters. We rewrite now the characterization provided by the previous proposition in terms of triples (Proposition 5) and clusters (Proposition 6).

Proposition 4 Let T , T ′ be two phylogenetic trees on the same set of labels. There exists a restricted LGT network N with a secondary arc e such that
We say that two trees T , T ′ on the same set of labels S and given by their respective set of triples {T x,y,z | {x, y, z} ⊆ S} and {T ′ x,y,z | {x, y, z} ⊆ S} satisfy the principal-secondary condition on triples if there exist k, l, m ≥ 1 and a family of non-empty, pairwise disjoint subsets of S (and to ease notations, let C l = m i=1 C l,i ) such that for every x, y, z ∈ S: 1. If x ∈ k i=1 A i , y ∈ B, and z ∈ l i=1 C i , then T x,y,z = ((x, y), z) and T ′ x,y,z = ((y, z), x). 2. If x ∈ B, y ∈ A j and z ∈ A i , for some 1 ≤ i < j ≤ k, then T x,y,z = ((x, y), z) and T ′ x,y,z = ((y, z), x). 3. If x ∈ C i , y ∈ C j and z ∈ B, for some 1 ≤ i < j ≤ l, then T x,y,z = ((x, y), z) and T ′ x,y,z = ((y, z), x). 4. If x ∈ C l,i , y ∈ C l,j and z ∈ B, for some 1 ≤ i < j ≤ m, then T x,y,z = ((x, y), z) and T ′ x,y,z = (x, y, z). 5. If x, y, z do not satisfy any of the previous conditions, then T x,y,z = T ′ x,y,z .
Proposition 5 Let T , T ′ be two phylogenetic trees on the same set of labels. There exists a restricted LGT network N with a secondary arc e such that T = T 0 (N ) and T ′ = T e (N ) if, and only if, they satisfy the principal-secondary condition on triples.
Proof As far as the "only if " implication goes, assume that e = (w, h) and let v = LCA T 0 (N ) (w, h) = LCA T 0 (N ) (w, h). Let w ∈ T 0 (N ) be the first non principally elementary principal descendant of w: that is, w = w if w is not principally elementary, and its principal child otherwise. Now: (Cf. Fig. 9). It is straightforward to check that the triples defined by T 0 (N ) and T e (N ) are the same except for those in the statement.
Let us consider now the "if " implication. In order not to overload the text, we shall outline here the proof, and fill in the details in a series of Claims proved in the Additional file 1: Appendix.
Assuming that the symmetric difference Ŵ(T ) △ Ŵ(T ′ ) consists of those triples described in the statement, we have that B is a cluster of both T and T ′ (this is Claim 1 in the Appendix, where it is proved). Since every triple in Ŵ(T ) △ Ŵ(T ′ ) involves one, and only one, leaf in B, it is clear that Ŵ(T | B ) = Ŵ(T ′ | B ) and Ŵ(T | S\B ) = Ŵ(T ′ | S\B ) and hence T | B = T ′ | B and T | S\B = T ′ | S\B . So, T | B and T | S\B form a maximum-agreement forest for T and T ′ in the sense of [31], which implies that d rSPR (T , T ′ ) = 1 [30, Theorem 2.1]. Then, the rSPR operation that transforms T into T ′ must have the form h spr ←− x, with h the root of T | B , that is, the node in T with C T (h) = B. In order to prove that this rSPR operation satisfies condition (2) in Proposition 4, we must identify the node x and the type of rSPR operation. To do that, we use that each C l,i is a cluster in T and T ′ (cf. Claim 2 in the Appendix) and that B ∪ C l is a cluster in T ′ but not in T (cf. Claim 3). Then: • If m = 1, so that C l = C l,1 ∈ C(T ) ∩ C(T ′ ), this entails that the nodes with clusters B and C l are sibling in T ′ but not in T, and therefore that x is the node in T with cluster C l and that the rSPR operation is of type arc. • If m > 1, since C l is a cluster in T but not in T ′ (this is Claim 4 in the Appendix) and Fig. 9 The local structure of T 0 (N) and T e (N) around a secondary arc e = (w, h), when w is not principally elementary (a) and when it is principally elementary (b) B ∪ C l,i 1 ∪ · · · ∪ C l,i k / ∈ C(T ′ ) for every ∅ � = {i 1 , . . . , i k } {1, . . . , m} (cf. Claim 5), we have that the nodes with clusters B, C l,1 , . . . , C l,m are sibling in T ′ but not in T, and therefore that x is the node in T with cluster C l and that the rSPR operation is of type node.
In both cases, it is easy to see that x is not connected in T with h (because B ∩ C l = ∅) and that LCA T (x, h) is not the parent of h (because if a ∈ A 1 , b ∈ B and c ∈ C l , then ((a, b), c) ∈ Ŵ(T )).

Corollary 1 Let N andN ′ be two restricted
LGT networks on the same set of labels S, each with a single secondary arc: say, e and e ′ , respectively. If T 0 (N ) = T 0 (N ′ ) and T e (N ) = T e ′ (N ′ ), then N = N ′ .
Proof Let us denote T 0 (N ) = T 0 (N ′ ) simply by T. Since N and N ′ are restricted LGT networks, the proof of the last proposition shows that if T e (N ) = T e ′ (N ′ ), then e and e ′ must have the same source and target nodes: with the notations therein, their target node is the node in T with cluster B, and their source node is either a principally elementary node added in the arc ending in the node in T with cluster C l (if m = 1) or the node in T with cluster C l (if m > 1). Therefore, N = N ′ .
Notice that the naïve implementation of the procedure given by Proposition 5, that computes and writes all the O(n 3 ) triples defined by T and T ′ and then checks whether the symmetric difference of the corresponding sets of triples has the form described therein, takes at least O(n 4 ) time. Although this cost can possibly be reduced by using the strategy in [32], we found it simpler to translate this condition on triples into an equivalent condition on clusters that is faster to check. To this end we first give a set of conditions written in terms of clusters of trees and its structure as a partial ordered set, where we consider the natural ordering given by inclusion of sets. In the context of posets, a segment is a chain such that every element in the poset lying between the ends of the chain also belongs to the chain.
We say that two trees T , T ′ on the same set of labels S and given by their respective set of clusters C(T) and C(T ′ ) satisfy the pricipal-secondary condition on clusters if: U k · · · U 1 , W l 0 · · · W 1 , with k − 1 ≤ k 0 ≤ k. • If l = 1 and l 0 = l − 1, (respectively, if k = 1 and k 0 = k − 1), the chain W l 0 · · · W 1 (respectively, U k 0 ′ · · · U ′ 1 ) does not exist, and then C(T )\C(T ′ ) (respectively, C(T ′ )\C(T )) consists only of the other segment.
• If C(T )\C(T ′ ) (respectively, C(T ′ )\C(T )) consists of two maximal disjoint segments of clusters, The minimal elements in the chains above satisfy that U k ∩ W l ′ ∈ C(T ) ∩ C(T ′ ). Let B denote this cluster. (c) The difference between the first element in the first segment and the common cluster B, say A k = U k \B satisfies: Analogously, the difference between the first element in the last segment and the common cluster B, say C l = W ′ l \B satisfies: . (e) If k > 1, the differences between consecutive sets in the segments above satisfy: (f ) And analogously, if l > 1, then: • C l W l−1 ; • Setting (even when l 0 = l − 1) W l = C l , we have that W i \W i+1 = W i ′ \W i+1 ′ for every i = 1, . . . , l − 1.

Proposition 6 Let T , T ′ be two different phylogenetic trees on the same set of labels. There exists a restricted
LGT network N with a secondary arc e such that T = T 0 (N ) and T ′ = T e (N ) if, and only if they satisfy the principal-secondary condition on clusters.
The principal-secondary condition on clusters can be checked in O(n 2 ) time. Indeed, conditions (b) to (f ) can be checked in linear time, since they only involve testing if certain sets are clusters of the trees or subsets of some specific sets of leaves. As for condition (a), one only needs to compute all the clusters of both trees, which can be done in O(n 2 ) time, and then computing the symmetric difference of those sets and arranging this symmetric difference in chains, which can be done in linear time in the size of the clusters.
Proposition 6 allows us to detect easily the secondary arc that must be added to T in order to obtain a network that has T ′ as the corresponding reduced secondary tree, when it exists, by means of the following algorithm: It turns out that N (T , T ′ ) is contained in every restricted LGT network with reduced principal subtree T and having T ′ as a reduced secondary subtree. Proof Let N be a restricted LGT network with T 0 (N ) = T and reduced secondary subtrees T ′ 1 , . . . , T k ′ . Without any loss of generality, we rename these reduced secondary subtrees as T ′ 1,1 , . . . , T ′ 1,k 1 , T ′ 2,1 , . . . , T l,k l ′ (k 1 + · · · + k l = k) in such a way that, for every i = 1, . . . , l, the secondary arcs ē i,1 , . . . ,ē i,k i producing the reduced secondary subtrees T ′ i,1 , . . . , T ′ i,k i have the same origin u i , and u i � = u j if i � = j. For every i = 1, . . . , l, let u * i be equal to u i if this node is not principally elementary, and to the principal child of u i in N if it is principally elementary; in both cases, u * i is a node in T. Finally, for every i = 1, . . . , l and j = 1, . . . , k i , let h i,j be the target of ē i,j , which is also a node in T.
We know from Proposition 6 (and its proof) that the clusters of each u * i and each h i,j and the equality, or not, between u i and u * i are uniquely determined by the pair (T , T i ′ ). Indeed, in each case the clusters of the aforementioned nodes are found in the proof of Proposition 8, and the statement of this proposition shows how these clusters are determined by T and T ′ i . Then, we can understand that Algorithm 2 first splits the arc in T ending in each u * i for which u i � = u * i into two arcs connected by a new elementary node ū i and next, for every i = 1, . . . , l and j = 1, . . . , k i , adds to the resulting S-tree a secondary arc from ū i or from u * i to h i,j . It is clear then that the resulting graph N is isomorphic to N by means of an isomorphism that preserves labels, principal arcs and secondary arcs.
This proposition entails, on the one hand, that if there exists some restricted LGT network with reduced principal subtree T and reduced secondary subtrees T 1 ′ , . . . , T k ′ , then it is unique (up to isomorphisms), and, on the other hand, that Algorithm 2 is correct (and also independent of the ordering of the trees T 1 ′ , . . . , T k ′ ), in the sense that such a restricted LGT network exists if, and only if, the algorithm finds it: notice that if the algorithm detects a cycle in step 5, then this proposition implies that no restricted LGT network can have T and T ′ 1 , . . . , T k ′ as reduced principal and reduced secondary subtrees. Another consequence is the stability of the network reconstructed: If some new tree is added to the input of the algorithm, then a new secondary arc is added to the network, without altering the other secondary arcs (notice, however, that this last secondary arc could create a cycle in the network and hence the problem would have no solution).
The following examples show two simple applications of Algorithm 2. We obtain the directed graph depicted in Fig. 11, which is acyclic and therefore a restricted LGT network with reduced principal subtree T and reduced secondary sub- Example 2 Consider the trees depicted in Fig. 12. We obtain the directed graph depicted in Fig. 13, which contains a directed cycle. Therefore, there does not exist any restricted LGT network with T as reduced principal subtree and T 1 ′ , T 2 ′ as reduced secondary subtrees. Of course, it is possible that, on a given input, the LGT network Reconstruction Problem has a solution and the Restricted LGT network Reconstruction Problem does not, as the following example shows. , and therefore these trees do not satisfy conditions (a) to (f ) in Proposition 6: from C(T )\C(T ′ 1 ) we have that k = 2, and from C(T ′ 1 )\C(T ) that l = 2, but then both differences should consist of a pair of segments, instead of a single segment. This means that there does not exist any restricted LGT network with reduced principal subtree T and reduced secondary subtree T ′ 1 . But there actually exists an LGT network with reduced principal subtree T and reduced secondary subtree T ′ 1 : the network N depicted in the same figure, which is not restricted. The graph obtained as output when applying Algorithm 2 to the trees T , T 1 ′ , T 2 ′ , T 3 ′ in Fig. 10 In order to test the models and algorithms introduced in this paper, we have performed a computational experiment. Our goal was to find an example of trees in a database of phylogenetic trees obtained from biological data where our algorithms can be applied. The general strategy for this search was as follows: We first chose a database with many phylogenetic trees; among these trees we exhaustively searched for a "central" tree sharing many leaves with a large set of "companion" trees in the database.
Then, we exhaustively looked for pairs formed by a subtree of this central tree and a companion tree such that their topological restrictions to their common set of leaves satisfy the principal-secondary condition on clusters.
With all pairs satisfying this condition we looked for a maximal example: with as many leaves as possible and as many secondary trees as possible.
Finally, this maximal set of trees is used as an input to Algorithm 2.
We have taken as our datasource the database of phylogenetic trees in [23]. That database contains 159,905 phylogenetic trees, but in order to make the computations feasible we have restricted our experiment to a random sample of 15,000 trees. Within this sample, we have found a "central" tree T with 100 leaves and 200 other "companion" trees sharing at least 30 labels with T. We have then kept these 201 trees and discarded the others. Following the strategy described above, we have found the subtree T 0 of T described by the Newick string  LGT network that has them as reduced principal and secondary subtrees, respectively where the numbers correspond to the organisms given in Table 1, and the following three subtrees of some of the remaining set of 200 trees: such that each pair of trees (T 0 , T i ′ ), i = 1, 2, 3, satisfies the conditions in Proposition 6. Applying Algorithm 2 to T 0 , T 1 ′ , T 2 ′ , T 3 ′ , we obtain the restricted LGT network depicted in Fig. 15, that contains T 0 as reduced principal subtree and T 1 ′ , T 2 ′ , T 3 ′ as reduced secondary subtrees. This network suggests the existence of three lateral gene transfer events that explain the differences between T 0 and T 1 ′ , T 2 ′ , T 3 ′ . Although there is no reference in the literature to these specific events, several lateral gene transfer events involving Rhodobacter sp., Ruegeria pom. and Ruegeria sp. have been reported in the literature [33][34][35].

Conclusions
In this paper we have considered LGT networks: a general model of phylogenetic networks with lateral gene  (1, 2)))); transfers that capture the asymmetry of these evolutionary events. An LGT network allows to distinguish between the principal line of evolution of the species under study and the secondary lines determined by the lateral gene transfers, by defining, in a natural way, a principal phylogenetic subtree and a family of secondary phylogenetic subtrees. We have defined a subclass of "restricted" LGT networks such that (a) the principal and secondary phylogenetic subtrees of a restricted LGT network are pairwise different; and (b) the principal and secondary phylogenetic subtrees of a restricted LGT network single it out, up to isomorphisms. Then, we have given an algorithm that solves the problem of reconstructing a restricted LGT network from a given principal phylogenetic subtree and a given family of secondary phylogenetic subtrees, when it exists.
We have implemented the algorithms in this paper using Python. The program can be downloaded from the url http://bioinfo.uib.es/~recerca/LGTnetworks/reconstruction.zip, and the only requirements are the libraries networkx and pyparsing, which are included in most of the standard distributions of python for scientific computation (e.g. anaconda). The zip file contains a README file with specific instructions on how to use the program.
As a future work, we plan to relax the conditions on the restricted LGT networks in order to be able to reconstruct a broader class of networks and discover new algorithms for reconstructing such networks from biologically significant data.