 Research
 Open access
 Published:
Statistically consistent divideandconquer pipelines for phylogeny estimation using NJMerge
Algorithms for Molecular Biology volume 14, Article number: 14 (2019)
Abstract
Background
Divideandconquer methods, which divide the species set into overlapping subsets, construct a tree on each subset, and then combine the subset trees using a supertree method, provide a key algorithmic framework for boosting the scalability of phylogeny estimation methods to large datasets. Yet the use of supertree methods, which typically attempt to solve NPhard optimization problems, limits the scalability of such approaches.
Results
In this paper, we introduce a divideandconquer approach that does not require supertree estimation: we divide the species set into pairwise disjoint subsets, construct a tree on each subset using a base method, and then combine the subset trees using a distance matrix. For this merger step, we present a new method, called NJMerge, which is a polynomialtime extension of Neighbor Joining (NJ); thus, NJMerge can be viewed either as a method for improving traditional NJ or as a method for scaling the base method to larger datasets. We prove that NJMerge can be used to create divideandconquer pipelines that are statistically consistent under some models of evolution. We also report the results of an extensive simulation study evaluating NJMerge on multilocus datasets with up to 1000 species. We found that NJMerge sometimes improved the accuracy of traditional NJ and substantially reduced the running time of three popular species tree methods (ASTRALIII, SVDquartets, and “concatenation” using RAxML) without sacrificing accuracy. Finally, although NJMerge can fail to return a tree, in our experiments, NJMerge failed on only 11 out of 2560 test cases.
Conclusions
Theoretical and empirical results suggest that NJMerge is a valuable technique for largescale phylogeny estimation, especially when computational resources are limited. NJMerge is freely available on Github (http://github.com/ekmolloy/njmerge).
Introduction
Estimating evolutionary trees, called phylogenies, from molecular sequence data is a fundamental problem in computational biology, and building the Tree of Life is a scientific grand challenge. It is also a computational grand challenge, as many of the most accurate phylogeny estimation methods are heuristics for NPhard optimization problems. Species tree estimation can be further complicated by biological processes (e.g., incomplete lineage sorting, gene duplication and loss, and horizontal gene transfer) that create heterogeneous evolutionary histories across genomes or “gene tree discordance” [1].
Incomplete lineage sorting (ILS), which is modeled by the MultiSpecies Coalescent (MSC) model [2, 3], has been shown to present challenges for phylogenomic analyses [4]. In addition, while the standard approach for multilocus species tree estimation uses maximum likelihood methods (e.g., RAxML) on the concatenated multiple sequence alignment, recent studies have established that even exact algorithms for maximum likelihood are not statistically consistent methods for multilocus species tree estimation under the MSC model (see [5] for a proof for unpartitioned maximum likelihood and [6] for fully partitioned maximum likelihood).
Because concatenation analyses using maximum likelihood are provably not statistically consistent in the presence of incomplete lineage sorting, new methods have been developed that are provably statistically consistent under the MSC model. Bayesian methods that coestimate gene trees and species trees (e.g., [7, 8]) are statistically consistent and expected to be the highly accurate; however, such methods are also prohibitively expensive on large datasets. More efficient approaches have been developed that are statistically consistent under the MSC model, including “gene tree summary methods”, which take a collection of gene trees as input and then compute a species tree from the gene trees using only the gene tree topologies. For example, NJst [9] runs Neighbor Joining (NJ) [10] on the “average gene tree internode distance” (AGID) matrix, and ASTRAL [11] finds a quartetmedian tree (i.e. a species tree that maximizes the total quartet tree similarity to the input gene trees) within a constrained search space. However, gene tree summary methods can have reduced accuracy when gene tree estimation error is high, which is a problem for many phylogenomic datasets (see discussion in [12]).
Because of the impact of gene tree estimation error, alternative approaches that bypass gene tree estimation, called “sitebased” methods, have been proposed. Perhaps the best known sitebased method is SVDquartets [13], which estimates quartet trees from the concatenated sequence alignments (using statistical properties of the MSC model and the sequence evolution model) and then combines the quartet trees into a tree on the full set of species using quartet amalgamation methods that are heuristics for the Maximum Quartet Consistency problem [14]. Other examples of sitebased methods include computing JukesCantor [15] or logdet [16] distances from the concatenated alignment and then running NJ on the resulting distance matrix. Such approaches can be statistically consistent under the MSC model when the sequence evolution models across genes satisfy some additional assumptions (e.g., a relaxed molecular clock) [17, 18].
Many of these methods (e.g., ASTRAL, SVDquartets, and concatenation using RAxML) are heuristics for NPhard optimization problems. Such methods can have difficulties scaling to datasets with large numbers of species, and divideandconquer approaches have been developed to scale methods to larger datasets (e.g., the family of disk covering methods [19,20,21,22,23,24]). Such methods operate by dividing the species set into overlapping subsets, constructing trees on the subsets, and then merging the subset trees into a tree on the entire species set. The last step of this process, called “supertree estimation”, can provide good accuracy (i.e., retain much of the accuracy in the subset trees) if good supertree methods are used. Notably, the supertree compatibility problem is NPcomplete [25], and the preferred supertree methods attempt to solve NPhard optimization problems (e.g., the Robinson–Foulds supertree problem [26], the Maximum Quartet Consistency problem [14], the Matrix Representation with Parsimony problem [27], and the Matrix Representation with Likelihood problem [28]). In summary, none of the current supertree methods provide both accuracy and scalability to datasets with large numbers of species (see [29] for further discussion).
In this paper, we introduce a new divideandconquer approach to scaling phylogeny estimation methods to large datasets: we divide the species (or leaf) set into pairwise disjoint subsets, construct a tree on each of the subsets, and then assemble the subset trees into a tree on the entire species set. Supertree methods cannot be used to combine trees on pairwise disjoint leaf sets, and we present a new polynomialtime method, called NJMerge, for this task. We prove that NJMerge can be used in statistically consistent divideandconquer pipelines for both gene tree and species tree estimation and evaluate the effectiveness of using NJMerge in the context of multilocus species tree estimation. We found, using an extensive simulation study, that NJMerge sometimes improved the accuracy of traditional NJ and that NJMerge provided substantial improvements in the running time for three methods (ASTRALIII [30], SVDquartets [13], and concatenation using RAxML [31]) without sacrificing accuracy. Furthermore, NJMerge enabled SVDquartets and RAxML to run on large datasets (e.g., 1000 taxa and 1000 genes), on which SVDquartets and RAxML would otherwise fail to run when limited to 64 GB of memory. While NJMerge is not guaranteed to return a tree; the failure rate in our experiments was low (less than 1% of tests). In addition, NJMerge failed on fewer datasets than either ASTRALIII, SVDquartets, or RAxML—when given the same computational resources: a single compute node with 64 GB of physical memory, 16 cores, and a maximum wallclock time of 48 h. Together, these results suggest that NJMerge is a valuable technique for largescale phylogeny estimation, especially when computational resources are limited.
NJMerge
Neighbor Joining (NJ) [10], perhaps the most widely used polynomialtime method for phylogeny estimation, estimates a tree T from a dissimilarity matrix D; NJMerge is a polynomialtime extension of NJ to impose a set of constraints on the output tree T (Fig. 1). More formally, NJMerge takes as input a dissimilarity matrix D on leaf set \(S = \{s_1, s_2, \ldots , s_n\}\) and a set \({\mathcal {T}} = \{T_1, T_2, \dots , T_k\}\) of unrooted binary trees on pairwise disjoint subsets of the leaf set S and returns a tree T that agrees with every tree in \({\mathcal {T}}\) (Definition 1). Note that the output tree T is a compatibility supertree for \({\mathcal {T}}\) and that because the trees in \({\mathcal {T}}\) are on pairwise disjoint subsets of the leaf set S, a compatibility supertree always exists. NJMerge does not require that the input constraint trees \({\mathcal {T}}\) to form clades in T. For example, the caterpillar tree on \(\{A,B,C,D,E,F,G,H\}\) obtained by making a path with the leaves hanging off it in alphabetical order is a compatibility supertree for \({\mathcal {T}} = \{ACEG, \; BDFH\}\), and yet the trees in \({\mathcal {T}}\) do not form clades within the caterpillar tree (Fig. 2). Of course, other compatibility supertrees exist for \({\mathcal {T}}\), and, in some of them, the input constraint trees will form clades. The objective is to find a tree that is close to the true (but unknown) tree from the set of all compatibility supertrees for \({\mathcal {T}}\), and NJMerge tries to achieve this objective by using the dissimilarity matrix D.
Definition 1
Let T be a tree on leaf set S, and let \(T'\) be a tree on leaf set \(R \subseteq S\). We say that \(T'\) agrees with T if restricting T to leaf set R induces a binary tree that (after suppressing the internal nodes of degree 2) is isomorphic to \(T'\).
Here we briefly describe the NJ algorithm by Saitou and Nei [10]. NJ has an iterative design that builds the tree from the bottom up, producing a rooted tree that is then unrooted. Initially, all n leaves are in separate components. When a pair of leaves is selected to be siblings, the pair of leaves is effectively replaced by a rooted tree on two leaves, and the number of components is reduced by one. This process repeats until there is only one component: a tree on the full leaf set. At each iteration, NJ updates D based on the new sibling pair, derives a new matrix Q from D, and uses Q to determine which pair of the remaining nodes to join. Specifically, NJ accepts siblinghood proposal (i, j) such that Q[i, j] is minimized. The same formulas used by NJ [10] to update D and compute Q are also used by NJMerge; however, NJMerge can make different siblinghood decisions than NJ—based on the input constraint trees.
After each siblinghood decision, NJMerge updates the constraint trees. Specifically, when two leaves are made siblings, they are replaced by a new leaf, and the constraint trees are relabeled. For example, if x is a leaf in \(T_i\) and y is a leaf in \(T_j\), then the siblinghood proposal \(z = (x,y)\) requires that x and y are replaced with z in \(T_i\) and \(T_j\), respectively. Because siblinghood decisions change the set of leaves in the constraint trees, they can result in the constraint trees no longer being disjoint (Fig. 3). Thus, siblinghood decisions have the potential to make the set of constraint trees incompatible. Determining whether or not a set of unrooted phylogenetic trees is compatible is an NPcomplete problem [32, 33], so NJMerge uses a polynomialtime heuristic. In each iteration, NJMerge sorts the entries of the Q from least to greatest and accepts the first siblinghood proposal (x, y) that satisfies the following properties:

1.
If x and y are both in some constraint tree \(T_i\), then they are siblings in \(T_i\).

2.
If x or y are in more than one constraint trees, then replacing x and y with a new leaf \(z = (x,y)\) in all constraint trees does not make any pair of constraint trees incompatible, i.e., a compatibility supertree exists for every pair of updated constraint trees.
Because pairwise compatibility of unrooted trees does not guarantee that the entire set of constraint trees is compatible, it is possible for NJMerge to accept a siblinghood decision that will eventually cause the algorithm to fail when none of the remaining leaves can be joined without violating the pairwise compatibility of constraint trees. Although the “pairwise compatibility heuristic” can fail, it is easy to see that if NJMerge returns a tree, then it is a compatibility supertree for the input set \({\mathcal {T}}\) of constraint trees.
To determine if some pair of constraint trees becomes incompatible after making x and y siblings, it suffices to check only those pairs of constraint trees that contain at least one of x and y; all other pairs of trees are unchanged by accepting the siblinghood proposal and are pairwise compatible by induction. Because the leaves in the two trees labeled x or y have been relabeled by the new leaf \(z = (x, y)\), they can be treated as rooted trees by rooting them at z. Testing the compatibility of rooted trees is easily accomplished in polynomial time using [34]. In fact, instead of testing pairs of constraint trees, the entire set of trees in \({\mathcal {T}}\) containing the new leaf \(z = (x, y)\) can be tested for compatibility in polynomial time using [34]. Furthermore, if at least one leaf exists in all constraint trees, then the compatibility of \({\mathcal {T}}\) can be determined in polynomial time. Finally, note the input matrix was referred to as a dissimilarity matrix (and not a distance matrix), because estimated distances between species may not satisfy the triangle inequality [24]; however, this matrix is more commonly referred to as a distance matrix, and we use this term henceforth.
Divideandconquer pipelines for phylogeny estimation
NJMerge can be used in divideandconquer pipelines for phylogeny estimation as shown in Fig. 4 and described below. In order to run this pipeline, the user must select a method for decomposing the leaf set into pairwise disjoint subsets (step 2), a maximum subset size (step 2), a method for computing a distance matrix \(M_D\) (step 1), and a method \(M_T\) for computing subset trees (step 3); thus, the user can select \(M_D\) and \(M_T\) to be appropriate for gene tree estimation or species tree estimation. The pipeline then operates as follows.

1.
Estimate distances between pairs of leaves using method \(M_D\).

2.
Decompose the leaf set into pairwise disjoint subsets.

2a.
Compute a starting tree by running NJ on the distance matrix computed in Step 1.

2b.
Decompose the starting tree into pairwise disjoint subsets of leaves with a predefined maximum subset size (e.g., using the centroid tree decomposition described in PASTA [35]).

2a.

3.
Build a tree on each subset using method \(M_T\), thus producing the set \({\mathcal {T}}\) of constraint trees. Note that constraint trees can be estimated in serial or in parallel, depending on the computational resources available.

4.
Run NJMerge on the input pair (\({\mathcal {T}}\), D).
Finally, although not explored in this study, this pipeline can be run in an iterative fashion by using the tree produced in step 4 to define the next subset decomposition.
Statistical consistency
Neighbor Joining (NJ) has been proven to be statistically consistent [36,37,38] under models of evolution for which pairwise distances can be estimated in a statistically consistent manner. This includes standard models of sequence evolution (e.g., the Generalized Time Reversible (GTR) model [39], which contains other models of sequence evolution, including JukesCantor [15]). More recently, NJ has been used on multilocus datasets to estimate species trees under the MultiSpecies Coalescent (MSC) model; specifically, the method, NJst [9] estimates a species tree by running NJ on the average gene tree internode distance (AGID) matrix, calculated by averaging the topological distances between pairs of species in the input set of gene trees. Allman et al. [40] showed that the AGID matrix converges to an additive matrix for the species tree, and so NJst and some other methods (e.g., ASTRID [41]) that estimate species trees from the AGID matrix are statistically consistent under the MSC model.
We now prove that NJMerge can be used in statistically consistent divideandconquer pipelines for estimating gene trees and species trees. These results follow from Theorem 3 that shows NJMerge will return the tree \(T^*\) when given a nearly additive distance matrix (Definition 2) for \(T^*\) and a set \({\mathcal {T}}\) of constraint trees that agree with \(T^*\) (Definition 1).
Definition 2
Let T be a tree with positive weights on the edges and leaves labeled \(1, 2, \dots , n\). We say that an \(n \times n\) matrix M is nearly additive for T if each entry M[i, j] differs from the distance between leaf i and leaf j in T by less than one half of the shortest branch length in T.
Theorem 3
Let \({\mathcal {T}}=\{T_1, T_2, \ldots , T_k\}\) be a set of trees, and let D be a distance matrix on \(S = \bigcup _i S_i\), where \(S_i\) is the set of leaves in \(T_i\). Let \(T^*\) be a tree on leaf set S. If D is a nearly additive matrix for \(T^*\) and if \(T_i\) agrees with \(T^*\) for all \(i \in \{1, \dots , k\}\), then NJMerge applied to input \(({\mathcal {T}}, D)\) returns \(T^*\).
Proof
NJ applied to a nearly additive distance matrix for \(T^*\) will return \(T^*\) [37]. Because all trees in \({\mathcal {T}}\) agree with \(T^*\), the siblinghood proposals suggested by NJ will never violate the trees in \({\mathcal {T}}\) or the compatibility of \({\mathcal {T}}\). Thus, NJMerge applied to \(({\mathcal {T}}, D)\) will return the same output as NJ applied to D, which is \(T^*\). \(\square \)
We now define statistical consistency in the context of gene tree estimation (Definition 4) and show that NJMerge can be used to create statistically consistent divideandconquer pipelines for gene tree estimation (Corollary 5).
Definition 4
Let \((T,\Theta )\) be a GTR model tree with topology T and numerical parameters \(\Theta \) (e.g., substitution rate matrix, branch lengths, etc). A method M for constructing gene trees from DNA sequences is statistically consistent under the GTR model if, for all \(\epsilon > 0\), there exists a constant \(l>0\) such that, given sequences of length at least l, M returns T with probability at least \(1  \epsilon \).
Corollary 5
NJMerge can be used in a gene tree estimation pipeline that is statistically consistent under the GTR model of sequence evolution.
Proof
Let \((T^*, \Theta )\) be a GTR model tree, let \(M_D\) be a method for calculating distances between pairs of sequences, and let \(M_T\) be a method for constructing trees from DNA sequences. Suppose that

the divideandconquer pipeline produces k pairwise disjoint subsets of sequences

Neighbor Joining (NJ) applied to a matrix of pairwise distances calculated using \(M_D\) is a statistically consistent method for constructing gene trees under the GTR model (e.g., the logdet distance [16])

\(M_T\) is statistically consistent under the GTR model (e.g., maximum likelihood [42, 43])
Now let \(\epsilon > 0\), and select \(\epsilon _D, \epsilon _T > 0\) such that \(\epsilon _D + k \epsilon _T < \epsilon \). By Definition 4, there exists a constant \(l_D\) such that NJ applied to matrix D computed from sequences of length at least \(l_D\) returns \(T^*\) with probability at least \(1  \epsilon _D\), and there exists a constant \(l_T\) such that \(M_T\) given DNA sequences of length at least \(l_T\) returns \(T^*\) with probability at least \(1  \epsilon _T\). If a distance matrix D is calculated using \(M_D\) and a set \({\mathcal {T}}\) of k constraint trees are constructed using \(M_T\), given sequences of length at least \(\max \{l_D, l_T\}\), then the probability that NJ applied to D returns \(T^*\) and that \(M_T\) returns a tree that agrees with \(T^*\) for all k constraint trees in \({\mathcal {T}}\) is at least \(1  \epsilon \), as
Then, by Theorem 3, NJMerge applied to the input \(({\mathcal {T}}, D)\) will return the \(T^*\) with probability at least \(1  \epsilon \), and by Definition 4, NJMerge is statistically consistent under the GTR model. \(\square \)
Finally, we define statistical consistency in the context of species tree estimation (Definition 7) and show that NJMerge can be used to create statistically consistent divideandconquer pipelines for species estimation (Corollary 7).
Definition 6
Let \((T,\Theta )\) be an MSC model tree with topology T and numerical parameters \(\Theta \) (e.g., substitution rate matrix, branch lengths, etc). A method M for constructing species trees from true gene trees is statistically consistent under the MSC model if, for all \(\epsilon > 0\), there exists a constant \(m>0\) such that, given at least m true gene trees, M returns T with probability at least \(1  \epsilon \).
Corollary 7
NJMerge can be used in a species tree estimation pipeline that is statistically consistent under the MSC model.
Proof
Let \((T^*, \Theta )\) be an MSC model tree, let \(M_D\) be a method for calculating distances between pairs of species from a set of gene trees, and let \(M_T\) be a method for constructing species trees from a set of gene trees. Suppose that

the divideandconquer pipeline produces k pairwise disjoint subsets of sequences

Neighbor Joining (NJ) applied to a matrix of pairwise distances calculated using \(M_D\) is a statistically consistent method for constructing species trees under the MSC model (e.g., the average topological distance between species in the input set of gene trees [40])

\(M_T\) is statistically consistent under the MSC model (e.g., ASTRAL [11, 45])
Now let \(\epsilon > 0\), and select \(\epsilon _D, \epsilon _T > 0\) such that \(\epsilon _D + k \epsilon _T < \epsilon \). By Definition 6, there exists a constant \(m_D\) such that NJ applied to matrix D computed from at least \(m_D\) gene trees returns \(T^*\) with probability at least \(1  \epsilon _D\), and there exists a constant \(m_T\) such that \(M_T\) given at least \(m_T\) gene trees returns \(T^*\) with probability at least \(1  \epsilon _T\). If a distance matrix D is calculated using \(M_D\) and a set \({\mathcal {T}}\) of k constraint trees are constructed using \(M_T\), both given at least \(\max \{m_D, m_T\}\) gene trees, then the probability that NJ applied to D returns \(T^*\) and that \(M_T\) returns a tree that agree with \(T^*\) for all k constraint trees in \({\mathcal {T}}\) is at least \(1  \epsilon \). Then, by Theorem 3, NJMerge applied to the input \(({\mathcal {T}}, D)\) will return the \(T^*\) with probability at least \(1  \epsilon \), and by Definition 6, NJMerge is statistically consistent under the MSC model. \(\square \)
Performance study
Our study evaluated the effectiveness of using NJMerge to estimate species trees on large multilocus datasets, simulated for this study using the protocol presented in [45]. Our simulation produced model conditions, described by two numbers of taxa (100 and 1000) and two levels of ILS (low/moderate and very high), each with 20 replicate datasets. Datasets included both exonlike sequences and intronlike sequences with exonlike sequences (“exons”) characterized by slower rates of evolution across sites (less phylogenetic signal) and intronlike sequences (“introns”) characterized by faster rates of evolution across sites (greater phylogenetic signal). The 100taxon datasets were analyzed using 25, 100, and 1000 genes, and the 1000taxon datasets were analyzed using 1000 genes; note that exons and introns were always analyzed separately. For each of these 320 datasets, we constructed distance matrices using two different methods and constraint trees using four different methods. This provided 2560 different tests on which to evaluate NJMerge. NJMerge failed on 11/2560 tests, so the failure rate (in our experiments) was less than 1%. Species tree methods were evaluated in terms of species tree estimation error (computed using normalized Robinson–Foulds (RF) distances [46]) and running time. All software commands are provided in Additional file 1.
Simulated datasets
True species and true gene trees
Datasets, each with a true species tree and 2000 true gene trees, were simulated using SimPhy version 1.0.2 [47]. All model conditions had deep speciation (towards the root) and 20 replicate datasets. By holding the effective population size constant (200K) and varying the species tree height (in generations), model conditions with different levels of ILS were generated. For species tree heights of 10M and 500K generations, the average distance between the true species tree and the true gene trees (as measured by the normalized RF distance) was 8–10% and 68–69% respectively. Thus, we referred to these levels of ILS as “low/moderate” and “very high” respectively.
True sequence alignments
Sequence alignments were simulated for each true gene tree using INDELible version 1.03 [48] under the GTR+\(\Gamma \) model of evolution without insertions or deletions. For each gene, the parameters for the GTR+\(\Gamma \) model of evolution (base frequencies, substitution rates, and alpha) were drawn from distributions based on estimates of these parameters from the Avian Phylogenomics Dataset [49]; distributions were fitted for exons and introns, separately (Additional file 1: Table S1). For each dataset (with 2000 genes), 1000 gene sequences were simulated with parameters drawn from the exon distributions, and 1000 gene sequences were simulated with parameters drawn from the intron distributions. Note that exons and introns were analyzed separately. The sequence lengths were also drawn from a distribution (varying from 300 to 1500 bp).
Estimated gene trees
Maximum likelihood gene trees were estimated using FastTree2 [50] under the GTR+CAT model of evolution. The average gene tree estimation error across all replicate datasets ranged from 26 to 51% for introns and 38 to 64% for exons and thus was higher for exon datasets (Additional file 1: Table S2). Note that gene tree estimation error was computed by the normalized symmetric difference between true and estimated gene trees, averaged across all gene trees (the normalized symmetric difference equals the normalized RF distance when both input trees are binary).
Estimated species trees
For each model condition (described by number of taxa and level of ILS), species trees estimation methods were run on the exonlike genes and the intronlike genes, separately. Species trees were estimated on 25, 100, or 1000 genes for the 100taxon datasets and 1000 genes for the 1000taxon datasets using three species tree estimation methods: ASTRALIII [11, 30, 45] (as implemented in version 5.6.1), SVDquartets [13] (as implemented in PAUP* version 4a161 [51]), and concatenation using unpartitioned maximum likelihood under the GTR+\(\Gamma \) model of evolution (as implemented in RAxML [31] version 8.2.12 with pthreads and SSE3).
NJMerge
Distance matrices
Distance matrices were created using two different approaches.

\(D_{AGID}\) refers to the average gene tree internode distance (AGID) matrix [9], computed from estimated gene trees using ASTRID [41] version 1.1.

\(D_{LD}\) refers to the logdet distance matrix [16], computed from concatenated alignment using PAUP* [51] version 4a163.
Recall that NJ applied to the AGID matrix (i.e., NJst [9]) was proven to be statistically consistent method under the MSC model [40] and that NJ applied to the logdet distance matrix was proven to be statistically consistent under the MSC model when the sequence evolution models across genes satisfy some additional assumptions (e.g., a relaxed molecular clock) [18].
Subset decomposition
We decomposed the species set into subsets as indicated by the blue dashed arrows in Fig. 4. Specifically, the NJ tree was computed for each distance matrix using FastME [52] version 2.1.5 and then the centroid tree decomposition (described in PASTA [35]) was used to create disjoint subsets of taxa from the NJ tree. Datasets with 100 species were decomposed into 4–6 subsets with a maximum subset size of 30 taxa, and datasets with 1000 species were decomposed into 10–15 subsets with a maximum subset size of 120 taxa.
Constraint trees
Constraint trees were created using four different approaches.

\({\mathcal {T}}_{true}\) refers to constraint trees computed by restricting the true species tree to each subset of species.

\({\mathcal {T}}_{AST}\) refers to constraint trees computed by running ASTRALIII on each subset, i.e., on the estimated gene trees restricted to each subset of species.

\({\mathcal {T}}_{SVD}\) refers to constraint trees computed by running SVDquartets on each subset, i.e., on the concatenated alignment restricted to each subset of species.

\({\mathcal {T}}_{RAX}\) refers to constraint trees computed by running RAxML on each subset, i.e., on the concatenated alignment restricted to each subset of species.
Notation
We often specify the inputs to NJ and NJMerge using the following notation: NJ(D) and NJMerge(\({\mathcal {T}}\), D). For example, NJMerge(\({\mathcal {T}}_{RAX}\), \(D_{LD}\)) refers to NJMerge given the RAxML constraint trees and the logdet distance matrix as input, whereas NJMerge(\({\mathcal {T}}_{RAX}\), D) refers to NJMerge given the RAxML constraint trees and either the AGID or the logdet distance matrix as input.
Evaluation
Species tree estimation error
Species tree estimation error was measured as the RF error rate, i.e., the normalized RF distance between the true and the estimated species trees both on the full species set. Since both trees were fully resolved or binary, the RF error rate is the proportion of edges in the true tree that are missing in the estimated tree. RF error rates were computed using Dendropy [53].
Running time
All computational experiments were run on the Blue Waters supercomputer, specifically, the XE6 dualsocket nodes with 64 GB of physical memory and two AMD Interlagos model 6276 CPU processors (i.e., one per socket each with 8 floatingpoint cores). All methods were given access to 16 threads with 1 thread per bulldozer (floatingpoint) core. SVDquartets and RAxML were explicitly run with 16 threads; however, ASTRALIII and NJMerge were not implemented with multithreading at the time of this study. All methods were restricted to a maximum wallclock time of 48 h.
Running time was measured as the wallclock time and recorded in seconds for all methods. For ASTRAL, SVDquartets, and RAxML, the timing data was recorded for running the method on the full dataset as well as running the method on subsets of the dataset (to produce constraint trees for NJMerge). RAxML did not complete within the maximum wallclock time of 48 h on datasets with 1000 taxa, so we used the last checkpoint file to evaluate species tree estimation error and running time. Specifically, running time was measured as the time between the info file being written and the last checkpoint file being written.
We approximated total running time of the NJMerge pipeline by combining the running timing data for estimating the distance matrix, estimating the subset trees, and combining the subset trees using NJMerge. If a user only had access to one compute node, then subset trees would need to be estimated in serial. In this case, the running time of the NJMerge pipeline \(t_P\) would be approximated as
where k is the number of subsets, \(t_D\) is time to estimate a distance matrix with method \(M_D\), \(t_T(i)\) is the time to estimate a species tree on subset i with method \(M_T\), and \(t_M\) is the time to run NJMerge given the distance matrix and the subset trees as input. The average running times for \(t_T\) and \(t_M\) are shown in Additional file 1: Tables S9, S10. The time to estimate the NJ tree from the distance matrix is not included, as this took less than a minute even for datasets with 1000 species. Note that given access to multiple compute nodes (at least 6 for the 100taxon datasets and at least 15 for the 1000species datasets), the subset trees could be estimated in parallel, as shown in [54].
It is worth noting that running ASTRALIII and computing the AGID matrix requires gene trees to be estimated. Using the same experimental setup (a single Blue Waters compute node with 64 GB of memory and 16 floatingpoint cores), FastTree2 took on average \(18 \pm 2\) min to estimate 1000 gene trees for datasets with 100 species and on average \(217 \pm 20\) min to estimate 1000 gene trees for datasets with 1000 species (Additional file 1: Tables S4, S5). The amount of time for gene tree estimation can vary greatly, depending on the method used and the analysis performed (e.g., model of sequence evolution, bootstrapping, etc.); we did not include the time to estimate gene trees in the reported running times.
Results
Pipelines using NJMerge can be thought of in two ways: (1) as techniques for potentially improving the accuracy of NJ (hopefully without a large increase in running time) or (2) as techniques for potentially improving the scalability or speed of the method \(M_T\) used to compute constraint trees (hopefully without sacrificing accuracy). When distancebased species tree estimation is not as accurate as some other species tree methods, we would predict that NJMerge (when given constraint trees estimated using highly accurate species tree methods) would be more accurate than traditional NJ. Because NJMerge, like NJ, is typically faster than other species tree methods, we would predict that NJMerge would improve the running time of more computationally intensive methods (such as RAxML) used to estimate constraint trees, hopefully without sacrificing accuracy.
Thus, we compared the accuracy of the NJMerge pipeline to traditional NJ, and we also compared the accuracy and running time of the NJMerge pipeline to running \(M_T\) on the full dataset, where \(M_T\) is the method used to estimate the constraint trees for NJMerge. Results are shown here for intronlike datasets; results for exonlike datasets are shown in Additional file 1. Unless otherwise noted, results were similar for both sequence types; however, species trees estimated on the exon datasets had slightly higher error rates than those estimated on the intron datasets. This is expected, as the exons had slower rates of evolution (and thus less phylogenetic signal) than the introns.
How do pipelines using NJMerge compare to Neighbor Joining (NJ)?
In this section, we report results on the effectiveness of using NJMerge as compared to NJ in terms of accuracy.
Impact of estimated distance matrix
We compared the accuracy of the NJMerge pipeline to traditional NJ on distance matrices estimated from datasets with 100 taxa and varying numbers of genes (Fig. 5; Additional file 1: Figure S1). Because the accuracy of NJMerge also depends on error in the input constraint trees, we considered an idealized case where NJMerge was given true constraint trees (i.e., constraint trees that agree with the true species tree). We found that NJMerge(\({\mathcal {T}}_{true}\), D) was more accurate than NJ(D) for all model conditions and that the difference in error was especially large when the number of genes was small and the level of ILS was very high (e.g., the difference in mean error was greater than 15% when matrices were estimated from 25 introns but was closer to 5% when matrices were estimated from 1000 introns). A similar trend was observed for matrices computed using the logdet distance. Interestingly, both NJ(D) and NJMerge(\({\mathcal {T}}_{true}\), D) were more accurate when given the AGID matrix rather than the logdet distance matrix as input—even when the level of ILS was low/moderate. In summary, NJMerge(\({\mathcal {T}}_{true}\), D) was always more accurate than NJ(D), but the improvement in accuracy was greater under challenging model conditions, suggesting that NJMerge(\({\mathcal {T}}_{true}\), D) was more robust to error in the distance matrix than NJ(D).
Impact of estimated constraint trees
We compared traditional NJ to the NJMerge pipeline given estimated constraint trees on datasets with 1000 taxa and 1000 genes (Fig. 6; Additional file 1: Figure S2). When the level of ILS was low/moderate, NJMerge outperformed NJ regardless of the method used to estimate species trees. For intronlike datasets with low/moderate ILS, the use of constraint trees reduced the median species tree error from 11–14% (NJ) to less than 3–6% (NJMerge); however, when the level of ILS was very high, the performance of NJMerge varied greatly with the species tree method. Specifically, NJMerge(\({\mathcal {T}}_{SVD}\), D) and NJMerge(\({\mathcal {T}}_{RAX}\), D) were less accurate than NJ(D) by 0–4% on average, whereas NJMerge(\({\mathcal {T}}_{AST}\), D) was more accurate than NJ(D) by 0–1% on average (Additional file 1: Tables S7, S8). These trends were consistent with the relative performance of methods on the 100taxon datasets (Fig. 7 and Additional file 1: Figure S3); specifically, when the level of ILS was very high, SVDquartets and RAxML performed worse than running NJ on either the AGID matrix or the logdet distance matrix. In summary, NJMerge was highly impacted by the quality of the constraint trees—so that accurate constraint trees resulted in NJMerge being more accurate than NJ, but inaccurate constraint trees resulted in NJMerge being less accurate than NJ.
How do pipelines using NJMerge compare to ASTRALIII, SVDquartets, and RAxML?
In this section, we compare the running time and the accuracy of the NJMerge pipeline to running \(M_T\) on the full dataset, where \(M_T\) is the method used to estimate constraint trees for NJMerge. Because NJMerge was more accurate when given the AGID matrix (Fig. 5; Additional file 1: Figure S1), results for NJMerge given the AGID distance matrix are shown here, and results for NJMerge given the logdet distance matrix are shown in Additional file 1.
ASTRALIII vs. NJMerge
Both NJMerge(\({\mathcal {T}}_{AST}\), \(D_{AGID}\)) and NJMerge(\({\mathcal {T}}_{AST}\), \(D_{LD}\)) provided running time advantages over ASTRALIII under some model conditions. While ASTRALIII completed on all the low/moderate ILS datasets with 1000 taxa and 1000 genes in less than 9 h on average, ASTRALIII failed to complete within the maximum wallclock time of 48 h on 23/40 datasets with 1000 taxa, 1000 genes, and very high ILS (Table 1). On the other 17/40 datasets, ASTRALIII ran for more than 2000 min (approximately 33 h). This difference between the low/moderate ILS and the very high ILS datasets is noteworthy (see discussion). In contrast, NJMerge(\({\mathcal {T}}_{AST}\), \(D_{AGID}\)) completed in under 300 min (approximately 5 h) on average, including the time it took to estimate the distance matrix and the ASTRALIII subset trees in serial (Fig. 8, Additional file 1: Figure S4). Note that NJMerge(\({\mathcal {T}}_{AST}\), \(D_{AGID}\)) failed on 0 datasets, and NJMerge(\({\mathcal {T}}_{AST}\), \(D_{LD}\)) failed on 2 datasets (Table 1). In summary, NJMerge substantially reduced the running time of ASTRALIII on the 1000taxon, 1000gene datasets with very high ILS.
ASTRALIII and NJMerge(\({\mathcal {T}}_{AST}\), \(D_{AGID}\)) achieved similar levels of accuracy with the mean species tree error within 0–2% for both intron and exon datasets (Fig. 8; Additional file 1: Figure S4, Table S7). Trends were similar for NJMerge(\({\mathcal {T}}_{AST}\), \(D_{LD}\)) except when the level of ILS was very high; under these conditions, the mean error of NJMerge(\({\mathcal {T}}_{AST}\), \(D_{LD}\)) was 2–6% greater than that of ASTRALIII (Additional file 1: Figures S7 and S8, Table S8).
NJMerge vs. SVDquartets
Species trees can be estimated with SVDquartets using the full set of \(n \atopwithdelims ()4\) quartet trees or a subset of quartet trees. Based on a prior study [55], which showed that the best accuracy was obtained when using all quartet trees, we computed all \(n \atopwithdelims ()4\) quartet trees for 100taxon datasets. However, on datasets with 1000 taxa, SVDquartets was run using a random subset of quartet trees (without replacement), because the maximum number of quartets allowed by SVDquartets (as implemented by PAUP*) was \(4.15833 \times 10^{10}\). Running PAUP* resulted in a segmentation fault for all 1000taxon datasets, i.e., SVDquartets failed on 40/40 datasets with 1000 taxa and 1000 genes. In contrast, NJMerge(\({\mathcal {T}}_{SVD}\), \(D_{AGID}\)) failed on 0 datasets, and NJMerge(\({\mathcal {T}}_{SVD}\), \(D_{LD}\)) failed on 3 datasets (Table 1).
NJMerge also improved running time on datasets with 100 taxa; for example, SVDquartets completed in 19–81 min on average, whereas NJMerge(\({\mathcal {T}}_{SVD}\), \(D_{AGID}\)) completed in less than 2 min on average for datasets with 100 taxa and 1000 genes (Fig. 9; Additional file 1: Figure S5). This running time comparison does not take into account the time needed to estimate gene trees, which required on average 18 min using FastTree2 on datasets with 100 taxa and 1000 genes.
NJMerge(\({\mathcal {T}}_{SVD}\), \(D_{AGID}\)) typically produced species trees with less error than SVDquartets. The difference between methods was typically small (between 0 and 2%) when the level of ILS was low/moderate but could be larger than 10% when the level of ILS was very high. Similar trends were observed for NJMerge(\({\mathcal {T}}_{SVD}\), \(D_{LD}\)) (Additional file 1: Figures S9, S10).
NJMerge vs. RAxML
NJMerge(\({\mathcal {T}}_{RAX}\), \(D_{AGID}\)) and NJMerge(\({\mathcal {T}}_{RAX}\), \(D_{LD}\)) reduced the running time of RAxML by more than half—even though RAxML was run on the subset trees in serial (Fig. 10 and Additional file 1: Figure S6). For the 1000taxon datasets, the final checkpoint was written by RAxML after more than 2250 min (\(\sim \) 37.5 h) on average. In comparison, when RAxML was run on subsets in serial, the average running time of NJMerge(\({\mathcal {T}}_{RAX}\), \(D_{AGID}\)) was between 500 (approximately 8.5 h) and 1500 min (approximately 25 h). Although these running times for NJMerge do not include the time to estimate gene trees, recall that it took on average 217 min (less than 4 h) to estimate 1000 gene trees on datasets with 1000 species using FastTree2.
While NJMerge can fail to return a tree, NJMerge failed less frequently than RAxML—when both methods were given the same computational resources. NJMerge(\({\mathcal {T}}_{RAX}\), \(D_{AGID}\)) failed on 1 dataset, and NJMerge(\({\mathcal {T}}_{RAX}\), \(D_{LD}\)) failed on 2 datasets. In contrast, for datasets with 1000 taxa, RAxML failed to run on 38 intronlike datasets and 3 exonlike datasets due to “Out of Memory” (OOM) errors (Table 1); the difference between the number of intronlike versus the number of exonlike datasets is noteworthy (see discussion).
For datasets with low/moderate levels of ILS, RAxML produced species trees with less error (0–3% on average) than NJMerge(\({\mathcal {T}}_{RAX}\), \(D_{AGID}\)); however, for datasets with very high levels of ILS, NJMerge(\({\mathcal {T}}_{RAX}\), \(D_{AGID}\)) produced species trees with less error (0–4% on average) than RAxML (Fig. 10; Additional file 1: Figure S6). Similar trends were observed for NJMerge(\({\mathcal {T}}_{RAX}\), \(D_{LD}\)) (Additional file 1: Figures S11, S12).
Discussion
Remarks on the utility of pipelines using NJMerge
Pipelines using NJMerge can be viewed either as techniques for improving traditional NJ or as techniques for scaling a computationallyintensive base method (previously referred to as \(M_T\)) to larger datasets. Thus, in order to maximize the utility of NJMerge, users should select a base method that is both more accurate and more computationallyintensive than NJ. Our results show that selecting base methods for NJMerge may not be trivial when analyzing phylogenomic datasets—because both accuracy and running time were impacted by the level of ILS. For example, ASTRALIII was very fast when the level of ILS was low/moderate but was substantially slower when the level of ILS was very high. Similarly, SVDquartets and RAxML were both more accurate than NJ(\(D_{AGID}\)), i.e., NJst, when the level of ILS was low/moderate but were less accurate than these methods when the level of ILS was very high; note that this trend is consistent with results from [12] (also see the review paper by [56]). Overall, our results suggest that constraint trees should be estimated using RAxML when the level of ILS is low/moderate and using ASTRALIII when the level of ILS is very high, and thus, determining the level of ILS in a given phylogenomic datasets is an important area of future research. Finally, we note that NJMerge, when given constraint trees that agreed with the true species tree, was very accurate (less than 2% error on average) even when the level of ILS was very high, suggesting that NJMerge is a promising technique for scaling Bayesian methods (e.g., Starbeast2 [8]) and future species tree methods to larger datasets.
Although NJMerge can fail, this should not discourage potential users, as NJMerge failed on fewer datasets than ASTRALIII, SVDquartets, or RAxML—when all methods were given the same computational resources, including a maximum wallclock time of 48 h. In our experiments, NJMerge failed on only 11/2560 test cases from running NJMerge on 320 datasets with two different types of distance matrices and four different types of constraint trees (Table 1).
Importantly, in all our experiments, NJMerge was run within the divideandconquer pipeline shown in Fig. 4, specifically, with subsets of taxa derived from decomposing the NJ tree (blue dashed lines). Because NJMerge was always given inputs generated by this pipeline, our results on the accuracy, the failure rate, and the running time of NJMerge may not generalize to arbitrary inputs.
Remarks on other results
Impact of distance matrix on NJ
Our results showed that on average NJ(\(D_{AGID}\)) was either as accurate or else more accurate than NJ(\(D_{LD}\)). Notably, there was a clear difference between these two methods on datasets with 100 taxa and low/moderate levels of ILS; specifically NJ(\(D_{AGID}\)) produced trees with less than 5% error on average, whereas NJ(\(D_{LD}\)) produced trees with greater than 10% error on average). However, on the exact same model condition but with 1000 taxa, NJ(\(D_{AGID}\)) and NJ(\(D_{LD}\)) produced trees with similar levels of accuracy. This may be due to the difference between the median branch length between low/moderate ILS datasets with 100 taxa and 1000 taxa (Additional file 1: Table S3); furthermore, it is possible that branch length and other factors that limit the accuracy of NJ(\(D_{LD}\)) in the context of gene tree estimation would also apply in the context of species tree estimation. However, it is interesting to note that NJ(\(D_{LD}\)) was more accurate than either SVDquartets or RAxML when the level of ILS was very high, providing support for Allman et al.’s statement, “The simplicity and speed of distancebased inference suggests logdet based methods should serve as benchmarks for judging more elaborate and computationallyintensive species trees inference methods” [18].
Impact of ILS and sequence type on ASTRALIII
Our results showed that ASTRALIII was much faster on the low/moderate ILS datasets than on the very high ILS datasets. This finding makes sense in light of ASTRALIII’s algorithm design. ASTRALIII operates by searching for an optimal solution to its search problem within a constrained search space that is defined by the set \({\mathcal {X}}\) of bipartitions in the estimated gene trees, and in particular, ASTRALIII’s running time scales with \({\mathcal {X}}^{1.726}\) [30]. The set of gene trees will become more heterogeneous for higher levels of ILS, and thus, the size of \({\mathcal {X}}\) will increase, as every gene tree could be different when the level of ILS is very high. In addition, gene tree estimation error can also increase the size of \({\mathcal {X}}\), explaining why ASTRALIII failed to complete on exon datasets more often than on intron datasets (Table 1, Additional file 1: Table S2).
Impact of sequence type on RAxML
Our results showed that RAxML failed on more intronlike datasets than exonlike datasets. This finding makes sense in light of RAxML’s implementation. RAxML uses redundancy in site patterns to store the input alignment compactly, so that the memory scales with the number of unique site patterns. The intron datasets had more unique site patterns than the exon datasets (i.e., greater phylogenetic signal and lower gene tree estimation error), which explains why RAxML required more memory when analyzing introns.
Remarks on the statistical consistency of pipelines using NJMerge
Although NJMerge can fail to return a tree, by statistical consistency under the MSC model (Corollary 7), the probability that NJMerge fails goes to zero as the number of true gene trees goes to infinity. In fact, NJMerge was designed to have this theoretical guarantee via the selection of the heuristic for determining whether or not to accept a siblinghood proposal. It is easy to think of other heuristics that prevent NJMerge from failing but do not have the guarantee of correctness (Theorem 3) and thus do not have the guarantee of statistical consistency (Corollary 7). Designing heuristics that prevent NJMerge from failing but have good theoretical properties is an area of future research.
As mentioned previously, our proof of statistical consistency under the MSC model requires that the number of true gene trees goes to infinity, which is the equivalent of requiring that both the number of gene trees and the sequence length per gene tree go to infinity. Roch et al. [6] recently showed that essentially all gene tree summary methods (e.g., NJst [40], and ASTRAL [11]) are not statistically consistent under the MSC if the sequence length per gene is fixed—and these theoretical results apply to NJMerge as well. The failure to be statistically consistent when the sequence length per gene is bounded is not unique to gene tree summary methods or NJMerge, as Roch et al. also showed that fully partitioned maximum likelihood is not consistent under these conditions, and [5] had shown that unpartitioned maximum likelihood is also not consistent.
Conclusions
In this paper, we introduced a divideandconquer approach to phylogeny estimation that (1) decomposes a set of species into pairwise disjoint subsets, (2) builds trees on each subset of species using a base method, and (3) merges the subsets trees together using a distance matrix. For the merger step, we presented a new method, called NJMerge, and proved that some divideandconquer pipelines using NJMerge are statistically consistent under some models of evolution. We then evaluated pipelines using NJMerge in the context of species tree estimation, specifically using simulated multilocus datasets with up to 1000 species and two levels of ILS. We found that pipelines using NJMerge provided several benefits to largescale species tree estimation. Specifically, under some model conditions, pipelines using NJMerge improved the accuracy of traditional NJ and substantially reduced the running time of three popular species tree methods (ASTRALIII, SVDquartets, and “concatenation” using RAxML) without sacrificing accuracy (see discussion for details as the results depended on the level of ILS). Finally, although NJMerge can fail to return a tree, in our experiments, pipelines using NJMerge failed on only 11 out of 2560 test cases. Together these results suggest that NJMerge is a promising approach for scaling highly accurate but computationallyintensive methods to larger datasets.
This study also suggests several different directions for future research. Since NJMerge uses a heuristic (which can fail) to test for tree compatibility (in deciding whether to accept a siblinghood proposal), a modification to NJMerge to use an exact method for this problem would reduce the failure rate and—if sufficiently fast—would still enable scalability to large datasets. In addition, all aspects of the divideandconquer pipeline could be modified and tested; for example, the robustness of NJMerge to the starting tree and initial subset decomposition could be evaluated. Finally, divideandconquer pipelines using NJMerge could be compared to traditional divideandconquer pipelines (e.g., Disk Covering Methods) when robust implementations become publicly available for species tree estimation. Other agglomerative techniques for merging disjoint subset trees are being developed (e.g., the agglomerative technique described in [57] for gene tree estimation has good theoretical properties but has not yet been implemented), and NJMerge should be compared to such techniques when they become publicly available.
Availability of data and materials
The datasets supporting the conclusions of this article are available in the following Illinois Data Bank repositories: https://doi.org/10.13012/B2IDB1424746_V1 and https://doi.org/10.13012/B2IDB0569467_V2.
Abbreviations
 GTR:

Generalized Time Reversible
 ILS:

incomplete lineage sorting
 MSC:

MultiSpecies Coalescent
 NJ:

Neighbor Joining
 RF:

Robinson–Foulds
References
Maddison WP. Gene trees in species trees. Syst Biol. 1997;46(3):523–36. https://doi.org/10.1093/sysbio/46.3.523.
Pamilo P, Nei M. Relationships between gene trees and species trees. Mol Biol Evol. 1988;5(5):568–83.
Rannala B, Yang Z. Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci. Genetics. 2003;164(4):1645–56.
Edwards SV. Is a new and general theory of molecular systematics emerging? Evolution. 2009;63(1):1–19. https://doi.org/10.1111/j.15585646.2008.00549.x.
Roch S, Steel M. Likelihoodbased tree reconstruction on a concatenation of aligned sequence data sets can be statistically inconsistent. Theor Popul Biol. 2015;100:56–62. https://doi.org/10.1016/j.tpb.2014.12.005.
Roch S, Nute M, Warnow T. Longbranch attraction in species tree estimation: inconsistency of partitioned likelihood and topologybased summary methods. Syst Biol. 2018;68:281–97. https://doi.org/10.1093/sysbio/syy061.
Heled J, Drummond AJ. Bayesian inference of species trees from multilocus data. Mol Biol Evol. 2010;27(3):570–80. https://doi.org/10.1093/molbev/msp274.
Ogilvie HA, Bouckaert RR, Drummond AJ. StarBEAST2 brings faster species tree inference and accurate estimates of substitution rates. Mol Biol Evol. 2017;34(8):2101–14. https://doi.org/10.1093/molbev/msx126.
Liu L, Yu L. Estimating species trees from unrooted gene trees. Syst Biol. 2011;60(5):661–7. https://doi.org/10.1093/sysbio/syr027.
Saitou N, Nei M. The neighborjoining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987;4(4):406–25. https://doi.org/10.1093/oxfordjournals.molbev.a040454.
Mirarab S, Reaz R, Bayzid MS, Zimmermann T, Swenson MS, Warnow T. ASTRAL: genomescale coalescentbased species tree estimation. Bioinformatics. 2014;30(17):541–8. https://doi.org/10.1093/bioinformatics/btu462.
Molloy EK, Warnow T. To include or not to include: the impact of gene filtering on species tree estimation methods. Syst Biol. 2018;67(2):285–303. https://doi.org/10.1093/sysbio/syx077.
Chifman J, Kubatko L. Quartet inference from SNP data under the coalescent model. Bioinformatics. 2014;30(23):3317–24. https://doi.org/10.1093/bioinformatics/btu530.
Jiang T, Kearney P, Li M. A polynomial time approximation scheme for inferring evolutionary trees from quartet topologies and its application. SIAM J Comput. 2001;30(6):1942–61. https://doi.org/10.1137/S0097539799361683.
Jukes TH, Cantor CR. Evolution of protein molecules. In: Munro HN, editor. Mammalian protein metabolism, vol. 3. New York: Academic Press; 1969. p. 21–132.
Steel MA. Recovering a tree from the leaf colourations it generates under a Markov model. Appl Math Lett. 1994;7(2):19–24.
Dasarathy G, Nowak R, Roch S. Data requirement for phylogenetic inference from multiple loci: a new distance method. IEEE/ACM Trans Comput Biol Bioinform. 2015;12(2):422–32. https://doi.org/10.1109/TCBB.2014.2361685.
Allman ES, Long C, Rhodes JA. Species tree inference from genomic sequences using the logdet distance. 2018. arXiv:1806.04974.
Warnow T, Moret BME, St. John K. Absolute convergence: true trees from short sequences. In: Proceedings of the twelfth annual ACMSIAM symposium on discrete algorithms. SODA ’01. Philadelphia: Society for Industrial and Applied Mathematics; 2001. p. 186–95.
Huson DH, Vawter L, Warnow T. Solving large scale phylogenetic problems using DCM2. In: Proceedings of the seventh international conference on intelligent systems for molecular biology. Palo Alto: AAAI Press; 1999. p. 118–29.
Lagergren J. Combining polynomial running time and fast convergence for the diskcovering method. J Comput Syst Sci. 2002;65(3):481–93. https://doi.org/10.1016/S00220000(02)000053.
Nelesen S, Liu K, Wang LS, Linder CR, Warnow T. DACTAL: divideandconquer trees (almost) without alignments. Bioinformatics. 2012;28(12):274–82. https://doi.org/10.1093/bioinformatics/bts218.
Bayzid MS, Hunt T, Warnow T. Disk covering methods improve phylogenomic analyses. BMC Genom. 2014;15(6):7. https://doi.org/10.1186/1471216415S6S7.
Warnow T. Computational phylogenetics: an introduction to designing methods for phylogeny estimation. Cambridge: Cambridge University Press; 2017.
Bodlaender HL, Fellows MR, Warnow TJ. Two strikes against perfect phylogeny. In: Automata, languages and programming: 19th international colloquium Wien, Austria, July 13–17, 1992 proceedings. Berlin: Springer; 1992. p. 273–83. https://doi.org/10.1007/3540557199_80.
Bansal MS, Burleigh JG, Eulenstein O, FernándezBaca D. Robinson–Foulds supertrees. Algorithms Mol Biol. 2010;5(1):18. https://doi.org/10.1186/17487188518.
Ragan MA. Phylogenetic inference based on matrix representation of trees. Mol Phylogenet Evol. 1992;1(1):53–8. https://doi.org/10.1016/10557903(92)90035F.
Nguyen N, Mirarab S, Warnow T. MRL and SuperFine+MRL: new supertree methods. Algorithms Mol Biol. 2012;7(1):3. https://doi.org/10.1186/1748718873.
Warnow T. Supertree construction: opportunities and challenges. ArXiv eprints; 2018. arXiv:1805.03530
Zhang C, Rabiee M, Sayyari E, Mirarab S. ASTRALIII: polynomial time species tree reconstruction from partially resolved gene trees. BMC Bioinform. 2018;19(6):153. https://doi.org/10.1186/s128590182129y.
Stamatakis A. RAxML version 8: a tool for phylogenetic analysis and postanalysis of large phylogenies. Bioinformatics. 2014;30(9):1312–3. https://doi.org/10.1093/bioinformatics/btu033.
Steel M. The complexity of reconstructing trees from qualitative characters and subtrees. J Classif. 1992;9(1):91–116. https://doi.org/10.1007/BF02618470.
Warnow TJ. Tree compatibility and inferring evolutionary history. J Algorithms. 1994;16(3):388–407. https://doi.org/10.1006/jagm.1994.1018.
Aho AV, Sagiv Y, Szymanski TG, Ullman JD. Inferring a tree from lowest common ancestors with an application to the optimization of relational expressions. SIAM J Comput. 1981;10(3):405–21. https://doi.org/10.1137/0210030.
Mirarab S, Nguyen N, Guo S, Wang LS, Kim J, Warnow T. PASTA: ultralarge multiple sequence alignment for nucleotide and aminoacid sequences. J Comput Biol. 2015;22(5):377–86. https://doi.org/10.1089/cmb.2014.0156.
Gascuel O. Concerning the NJ algorithm and its unweighted version, UNJ. In: Roberts FS, Rzhetsky A, editors. Mathematical hierarchies and biology. Providence: American Mathematical Society; 1997. p. 149–70.
Atteson K. The performance of neighborjoining methods of phylogenetic reconstruction. Algorithmica. 1999;25(2–3):251–78. https://doi.org/10.1007/PL00008277.
Bryant D. On the uniqueness of the selection criterion in neighborjoining. J Classif. 2005;22:3–15. https://doi.org/10.1007/s003570050003x.
Tavaré S. Some probabilistic and statistical problems in the analysis of DNA sequences. Lect Math Life Sci. 1986;17(2):57–86.
Allman ES, Degnan JH, Rhodes JA. Species tree inference from gene splits by unrooted STAR methods. IEEE/ACM Trans Comput Biol Bioinform. 2018;15(1):337–42. https://doi.org/10.1109/TCBB.2016.2604812.
Vachaspati P, Warnow T. ASTRID: accurate species trees from internode distances. BMC Genom. 2015;16(10):3. https://doi.org/10.1186/1471216416S10S3.
Neyman J. Molecular studies of evolution: a source of novel statistical problems. In: Gupta SS, Yackel J, editors. Statistical decision theory and related topics. Cambridge: Academic Press; 1971. p. 1–27. https://doi.org/10.1016/B9780123075505.500058.
Felsenstein J. Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol. 1981;17(6):368–76. https://doi.org/10.1007/BF01734359.
Mitrinović DS. Analytic inequalities. New York: Springer; 1970.
Mirarab S, Warnow T. ASTRALII: coalescentbased species tree estimation with many hundreds of taxa and thousands of genes. Bioinformatics. 2015;31(12):44–52. https://doi.org/10.1093/bioinformatics/btv234.
Robinson DF, Foulds LR. Comparison of phylogenetic trees. Math Biosci. 1981;53(1):131–47. https://doi.org/10.1016/00255564(81)900432.
Mallo D, De Oliveira Martins L, Posada D. SimPhy: phylogenomic simulation of gene, locus, and species trees. Syst Biol. 2016;65(2):334–44. https://doi.org/10.1093/sysbio/syv082.
Fletcher W, Yang Z. INDELible: a flexible simulator of biological sequence evolution. Mol Biol Evol. 2009;26(8):1879–88. https://doi.org/10.1093/molbev/msp098.
Jarvis ED, Mirarab S, et al. Wholegenome analyses resolve early branches in the tree of life of modern birds. Science. 2014;346(6215):1320–31. https://doi.org/10.1126/science.1253451.
Price MN, Dehal PS, Arkin AP. FastTree 2—approximately maximumlikelihood trees for large alignments. PLoS ONE. 2010;5(3):1–10. https://doi.org/10.1371/journal.pone.0009490.
Swofford DL. PAUP* (*Phylogenetic Analysis using PAUP); 2018. http://phylosolutions.com/pauptest/.
Lefort V, Desper R, Gascuel O. FastME 2.0: a comprehensive, accurate, and fast distancebased phylogeny inference program. Mol Biol Evol. 2015;32(10):2798–800. https://doi.org/10.1093/molbev/msv150.
Sukumaran J, Holder MT. DendroPy: a Python library for phylogenetic computing. Bioinformatics. 2010;26(12):1569–71. https://doi.org/10.1093/bioinformatics/btq228.
Molloy EK, Warnow T. NJMerge: a generic technique for scaling phylogeny estimation methods and its application to species trees. In: Blanchette M, Ouangraoua A, editors. Comparative genomics. RECOMBCG 2018. Lecture notes in computer science, vol. 11183. Cham: Springer; 2018. https://doi.org/10.1007/9783030008345_15.
Swenson MS, Suri R, Linder CR, Warnow T. An experimental study of Quartets MaxCut and other supertree methods. Algorithms Mol Biol. 2011;6(1):7. https://doi.org/10.1186/1748718867.
Xu B, Yang Z. Challenges in species tree estimation under the multispecies coalescent model. Genetics. 2016;204(4):1353–68. https://doi.org/10.1534/genetics.116.190173.
Zhang QR, Rao S, Warnow TJ. New absolute fast converging phylogeny estimation methods with improved scalability and accuracy. In: 18th international workshop on algorithms in bioinformatics, WABI 2018, August 20–22, 2018, Helsinki, Finland. 2018. pp. 8–1812. https://doi.org/10.4230/LIPIcs.WABI.2018.8
Acknowledgements
We thank the anonymous reviewers of our RECOMBCG paper, whose feedback led to improvements in the quality of this paper.
Availability and requirements
Project name: NJMerge; Project home page: https://github.com/ekmolloy/njmerge; Operating systems: Platform independent; Programming language: Python version 2.7; Other requirements: Dendropy version 4.3.0; License: BSD 3Clause; Any restrictions to use by nonacademics: None.
Funding
This work was supported by the U.S. National Science Foundation (grants 1535977 and 1513629) to TW. EKM was supported by the NSF Graduate Research Fellowship (Award DGE1144245) and the Ira and Debra Cohen Graduate Fellowship in Computer Science. Computational experiments were performed on Blue Waters. This research is part of the Blue Waters sustainedpetascale computing project, which is supported by the NSF (Awards OCI0725070 and ACI1238993) and the state of Illinois. Blue Waters is a joint effort of the University of Illinois at UrbanaChampaign and its National Center for Supercomputing Applications.
Author information
Authors and Affiliations
Contributions
Both authors designed the study, proved the theorems, and wrote the paper. EKM implemented the algorithm and performed the simulation study. Both authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional file
Additional file 1.
Detailed methods section, including software commands. 12 figures and 10 tables describing additional results, including those for exonlike datasets and those for analyses using logdet distance matrices.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Molloy, E.K., Warnow, T. Statistically consistent divideandconquer pipelines for phylogeny estimation using NJMerge. Algorithms Mol Biol 14, 14 (2019). https://doi.org/10.1186/s130150190151x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s130150190151x