Statistically consistent divide-and-conquer pipelines for phylogeny estimation using NJMerge

Background Divide-and-conquer methods, which divide the species set into overlapping subsets, construct a tree on each subset, and then combine the subset trees using a supertree method, provide a key algorithmic framework for boosting the scalability of phylogeny estimation methods to large datasets. Yet the use of supertree methods, which typically attempt to solve NP-hard optimization problems, limits the scalability of such approaches. Results In this paper, we introduce a divide-and-conquer approach that does not require supertree estimation: we divide the species set into pairwise disjoint subsets, construct a tree on each subset using a base method, and then combine the subset trees using a distance matrix. For this merger step, we present a new method, called NJMerge, which is a polynomial-time extension of Neighbor Joining (NJ); thus, NJMerge can be viewed either as a method for improving traditional NJ or as a method for scaling the base method to larger datasets. We prove that NJMerge can be used to create divide-and-conquer pipelines that are statistically consistent under some models of evolution. We also report the results of an extensive simulation study evaluating NJMerge on multi-locus datasets with up to 1000 species. We found that NJMerge sometimes improved the accuracy of traditional NJ and substantially reduced the running time of three popular species tree methods (ASTRAL-III, SVDquartets, and “concatenation” using RAxML) without sacrificing accuracy. Finally, although NJMerge can fail to return a tree, in our experiments, NJMerge failed on only 11 out of 2560 test cases. Conclusions Theoretical and empirical results suggest that NJMerge is a valuable technique for large-scale phylogeny estimation, especially when computational resources are limited. NJMerge is freely available on Github (http://github.com/ekmolloy/njmerge). Electronic supplementary material The online version of this article (10.1186/s13015-019-0151-x) contains supplementary material, which is available to authorized users.


Introduction
Estimating evolutionary trees, called phylogenies, from molecular sequence data is a fundamental problem in computational biology, and building the Tree of Life is a scientific grand challenge. It is also a computational grand challenge, as many of the most accurate phylogeny estimation methods are heuristics for NP-hard optimization problems. Species tree estimation can be further complicated by biological processes (e.g., incomplete lineage sorting, gene duplication and loss, and horizontal gene transfer) that create heterogeneous evolutionary histories across genomes or "gene tree discordance" [1].
Incomplete lineage sorting (ILS), which is modeled by the Multi-Species Coalescent (MSC) model [2,3], has been shown to present challenges for phylogenomic analyses [4]. In addition, while the standard approach for multi-locus species tree estimation uses maximum likelihood methods (e.g., RAxML) on the concatenated multiple sequence alignment, recent studies have established that even exact algorithms for maximum likelihood are not statistically consistent methods for multi-locus species tree estimation under the MSC model (see [5] for a  14:14 proof for unpartitioned maximum likelihood and [6] for fully partitioned maximum likelihood). Because concatenation analyses using maximum likelihood are provably not statistically consistent in the presence of incomplete lineage sorting, new methods have been developed that are provably statistically consistent under the MSC model. Bayesian methods that co-estimate gene trees and species trees (e.g., [7,8]) are statistically consistent and expected to be the highly accurate; however, such methods are also prohibitively expensive on large datasets. More efficient approaches have been developed that are statistically consistent under the MSC model, including "gene tree summary methods", which take a collection of gene trees as input and then compute a species tree from the gene trees using only the gene tree topologies. For example, NJst [9] runs Neighbor Joining (NJ) [10] on the "average gene tree internode distance" (AGID) matrix, and ASTRAL [11] finds a quartetmedian tree (i.e. a species tree that maximizes the total quartet tree similarity to the input gene trees) within a constrained search space. However, gene tree summary methods can have reduced accuracy when gene tree estimation error is high, which is a problem for many phylogenomic datasets (see discussion in [12]).
Because of the impact of gene tree estimation error, alternative approaches that bypass gene tree estimation, called "site-based" methods, have been proposed. Perhaps the best known site-based method is SVDquartets [13], which estimates quartet trees from the concatenated sequence alignments (using statistical properties of the MSC model and the sequence evolution model) and then combines the quartet trees into a tree on the full set of species using quartet amalgamation methods that are heuristics for the Maximum Quartet Consistency problem [14]. Other examples of site-based methods include computing Jukes-Cantor [15] or log-det [16] distances from the concatenated alignment and then running NJ on the resulting distance matrix. Such approaches can be statistically consistent under the MSC model when the sequence evolution models across genes satisfy some additional assumptions (e.g., a relaxed molecular clock) [17,18].
Many of these methods (e.g., ASTRAL, SVDquartets, and concatenation using RAxML) are heuristics for NPhard optimization problems. Such methods can have difficulties scaling to datasets with large numbers of species, and divide-and-conquer approaches have been developed to scale methods to larger datasets (e.g., the family of disk covering methods [19][20][21][22][23][24]). Such methods operate by dividing the species set into overlapping subsets, constructing trees on the subsets, and then merging the subset trees into a tree on the entire species set. The last step of this process, called "supertree estimation", can provide good accuracy (i.e., retain much of the accuracy in the subset trees) if good supertree methods are used. Notably, the supertree compatibility problem is NP-complete [25], and the preferred supertree methods attempt to solve NP-hard optimization problems (e.g., the Robinson-Foulds supertree problem [26], the Maximum Quartet Consistency problem [14], the Matrix Representation with Parsimony problem [27], and the Matrix Representation with Likelihood problem [28]). In summary, none of the current supertree methods provide both accuracy and scalability to datasets with large numbers of species (see [29] for further discussion).
In this paper, we introduce a new divide-and-conquer approach to scaling phylogeny estimation methods to large datasets: we divide the species (or leaf ) set into pairwise disjoint subsets, construct a tree on each of the subsets, and then assemble the subset trees into a tree on the entire species set. Supertree methods cannot be used to combine trees on pairwise disjoint leaf sets, and we present a new polynomial-time method, called NJMerge, for this task. We prove that NJMerge can be used in statistically consistent divide-and-conquer pipelines for both gene tree and species tree estimation and evaluate the effectiveness of using NJMerge in the context of multilocus species tree estimation. We found, using an extensive simulation study, that NJMerge sometimes improved the accuracy of traditional NJ and that NJMerge provided substantial improvements in the running time for three methods (ASTRAL-III [30], SVDquartets [13], and concatenation using RAxML [31]) without sacrificing accuracy. Furthermore, NJMerge enabled SVDquartets and RAxML to run on large datasets (e.g., 1000 taxa and 1000 genes), on which SVDquartets and RAxML would otherwise fail to run when limited to 64 GB of memory. While NJMerge is not guaranteed to return a tree; the failure rate in our experiments was low (less than 1% of tests). In addition, NJMerge failed on fewer datasets than either ASTRAL-III, SVDquartets, or RAxML-when given the same computational resources: a single compute node with 64 GB of physical memory, 16 cores, and a maximum wall-clock time of 48 h. Together, these results suggest that NJMerge is a valuable technique for large-scale phylogeny estimation, especially when computational resources are limited.

NJMerge
Neighbor Joining (NJ) [10], perhaps the most widely used polynomial-time method for phylogeny estimation, estimates a tree T from a dissimilarity matrix D; NJMerge is a polynomial-time extension of NJ to impose a set of constraints on the output tree T (Fig. 1). More formally, NJMerge takes as input a dissimilarity matrix D on leaf set S = {s 1 , s 2 , . . . , s n } and a set T = {T 1 , T 2 , . . . , T k } of unrooted binary trees on pairwise disjoint subsets of the leaf set S and returns a tree T that agrees with every tree in T (Definition 1). Note that the output tree T is a compatibility supertree for T and that because the trees in T are on pairwise disjoint subsets of the leaf set S, a compatibility supertree always exists. NJMerge does not require that the input constraint trees T to form clades in T. For example, the caterpillar tree on {A, B, C, D, E, F , G, H} obtained by making a path with the leaves hanging off it in alphabetical order is a compatibility supertree for T = {AC|EG, BD|FH } , and yet the trees in T do not form clades within the caterpillar tree (Fig. 2). Of course, other compatibility supertrees exist for T , and, in some of them, the input constraint trees will form clades. The objective is to find a tree that is close to the true (but unknown) tree from the set of all compatibility supertrees for T , and NJMerge tries to achieve this objective by using the dissimilarity matrix D.

Definition 1
Let T be a tree on leaf set S, and let T ′ be a tree on leaf set R ⊆ S . We say that T ′ agrees with T if restricting T to leaf set R induces a binary tree that (after suppressing the internal nodes of degree 2) is isomorphic to T ′ .
Here we briefly describe the NJ algorithm by Saitou and Nei [10]. NJ has an iterative design that builds the tree from the bottom up, producing a rooted tree that is then unrooted. Initially, all n leaves are in separate components. When a pair of leaves is selected to be siblings, the pair of leaves is effectively replaced by a rooted tree on two leaves, and the number of components is reduced by one. This process repeats until there is only one component: a tree on the full leaf set. At each iteration, NJ updates D based on the new sibling pair, derives a new matrix Q from D, and uses Q to determine which pair of the remaining nodes to join. Specifically, NJ accepts siblinghood proposal (i, j) such that Q[i, j] is minimized. The same formulas used by NJ [10] to update D and compute Q are also used by NJMerge; however, NJMerge can make different siblinghood decisions than NJ-based on the input constraint trees.
After each siblinghood decision, NJMerge updates the constraint trees. Specifically, when two leaves are made siblings, they are replaced by a new leaf, and the constraint trees are relabeled. For example, if x is a leaf in T i and y is a leaf in T j , then the siblinghood proposal z = (x, y) requires that x and y are replaced with z in T i and T j , respectively. Because siblinghood decisions change the set of leaves in the constraint trees, they can result in the constraint trees no longer being disjoint (Fig. 3). Thus, siblinghood decisions have the potential to make the set of constraint trees incompatible. Determining whether or not a set of unrooted phylogenetic NJMerge input/output example. In this example, NJMerge is given two constraint trees ( T i and T j ) and a distance matrix D ij that is additive for the tree (((A, B), (C, D)), E, (F, (G, H))). NJMerge returns a compatibility supertree, called T ij , for the two constraint trees ( T i and T j ). Note that Neighbor Joining (NJ) applied to the distance matrix D ij would return (((A, B), (C, D)), E, (F, (G, H))) [37]; however, NJMerge rejects the siblinghood proposal (G, H), because it violates constraint tree T j . Instead, NJMerge makes G and F siblings Fig. 2 Compatibility supertree example. In this example, two compatibility supertrees for T = {T i , T j } are shown. Note that the trees in T form clades in T ′ but do not form clades in T. Other compatibility supertrees for T exist Fig. 3 NJMerge siblinghood proposal example. In this example, NJMerge evaluates the siblinghood proposal (C, D). Because C ∈ T i and D ∈ T j , NJMerge first updates the constraint trees T i and T j based on the proposed siblinghood to get T ′ i and T ′ j . Specifically, both C ∈ T i and D ∈ T j are replaced by X, representing the siblinghood (C, D). The compatibility of the updated constraint trees can be tested by rooting the trees at leaf X and using the algorithm proposed in [34]. Because the updated constraint trees ( T ′ i and T ′ j ) are indeed compatible, NJMerge will accept siblinghood proposal (C, D). Importantly, when NJMerge evaluates the next siblinghood proposal, the two constraint trees will no longer be on disjoint leaf sets trees is compatible is an NP-complete problem [32,33], so NJMerge uses a polynomial-time heuristic. In each iteration, NJMerge sorts the entries of the Q from least to greatest and accepts the first siblinghood proposal (x, y) that satisfies the following properties: 1. If x and y are both in some constraint tree T i , then they are siblings in T i . 2. If x or y are in more than one constraint trees, then replacing x and y with a new leaf z = (x, y) in all constraint trees does not make any pair of constraint trees incompatible, i.e., a compatibility supertree exists for every pair of updated constraint trees.
Because pairwise compatibility of unrooted trees does not guarantee that the entire set of constraint trees is compatible, it is possible for NJMerge to accept a siblinghood decision that will eventually cause the algorithm to fail when none of the remaining leaves can be joined without violating the pairwise compatibility of constraint trees. Although the "pairwise compatibility heuristic" can fail, it is easy to see that if NJMerge returns a tree, then it is a compatibility supertree for the input set T of constraint trees.
To determine if some pair of constraint trees becomes incompatible after making x and y siblings, it suffices to check only those pairs of constraint trees that contain at least one of x and y; all other pairs of trees are unchanged by accepting the siblinghood proposal and are pairwise compatible by induction. Because the leaves in the two trees labeled x or y have been relabeled by the new leaf z = (x, y) , they can be treated as rooted trees by rooting them at z. Testing the compatibility of rooted trees is easily accomplished in polynomial time using [34]. In fact, instead of testing pairs of constraint trees, the entire set of trees in T containing the new leaf z = (x, y) can be tested for compatibility in polynomial time using [34]. Furthermore, if at least one leaf exists in all constraint trees, then the compatibility of T can be determined in polynomial time. Finally, note the input matrix was referred to as a dissimilarity matrix (and not a distance matrix), because estimated distances between species may not satisfy the triangle inequality [24]; however, this matrix is more commonly referred to as a distance matrix, and we use this term henceforth.

Divide-and-conquer pipelines for phylogeny estimation
NJMerge can be used in divide-and-conquer pipelines for phylogeny estimation as shown in Fig. 4 and described below. In order to run this pipeline, the user must select a method for decomposing the leaf set into pairwise disjoint subsets (step 2), a maximum subset size (step 2), a method for computing a distance matrix M D (step 1), and a method M T for computing subset trees (step 3); thus, the user can select M D and M T to be appropriate for gene tree estimation or species tree estimation. The pipeline then operates as follows.
1. Estimate distances between pairs of leaves using method M D . 2. Decompose the leaf set into pairwise disjoint subsets.
2a. Compute a starting tree by running NJ on the distance matrix computed in Step 1. 2b. Decompose the starting tree into pairwise disjoint subsets of leaves with a predefined maximum subset size (e.g., using the centroid tree decomposition described in PASTA [35]).
3. Build a tree on each subset using method M T , thus producing the set T of constraint trees. Note that constraint trees can be estimated in serial or in parallel, depending on the computational resources available.

Run NJMerge on the input pair ( T , D).
Finally, although not explored in this study, this pipeline can be run in an iterative fashion by using the tree produced in step 4 to define the next subset decomposition.

Statistical consistency
Neighbor Joining (NJ) has been proven to be statistically consistent [36][37][38] under models of evolution for which pairwise distances can be estimated in a statistically consistent manner. This includes standard models of sequence evolution (e.g., the Generalized Time Reversible (GTR) model [39], which contains other models of sequence evolution, including Jukes-Cantor [15]). More recently, NJ has been used on multi-locus datasets to estimate species trees under the Multi-Species Coalescent (MSC) model; specifically, the method, NJst [9] estimates a species tree by running NJ on the average gene tree internode distance (AGID) matrix, calculated by averaging the topological distances between pairs of species in the input set of gene trees. Allman et al. [40] showed that the AGID matrix converges to an additive matrix for the species tree, and so NJst and some other methods (e.g., ASTRID [41]) that estimate species trees from the AGID matrix are statistically consistent under the MSC model.
We now prove that NJMerge can be used in statistically consistent divide-and-conquer pipelines for estimating gene trees and species trees. These results follow from Theorem 3 that shows NJMerge will return the tree T * when given a nearly additive distance matrix (Definition 2) for T * and a set T of constraint trees that agree with T * (Definition 1). Definition 2 Let T be a tree with positive weights on the edges and leaves labeled 1, 2, . . . , n . We say that an n × n matrix M is nearly additive for T if each entry M[i, j] differs from the distance between leaf i and leaf j in T by less than one half of the shortest branch length in T.
. . , T k } be a set of trees, and let D be a distance matrix on S = i S i , where S i is the set of leaves in T i . Let T * be a tree on leaf set S. If D is a nearly additive matrix for T * and if T i agrees with T * for all i ∈ {1, . . . , k} , then NJMerge applied to input (T , D) returns T * .
Proof NJ applied to a nearly additive distance matrix for T * will return T * [37]. Because all trees in T agree with T * , the siblinghood proposals suggested by NJ will never violate the trees in T or the compatibility of T . Thus, NJMerge applied to (T , D) will return the same output as NJ applied to D, which is T * .
We now define statistical consistency in the context of gene tree estimation (Definition 4) and show that NJMerge can be used to create statistically consistent divide-and-conquer pipelines for gene tree estimation (Corollary 5).

Definition 4
Let (T , �) be a GTR model tree with topology T and numerical parameters (e.g., substitution rate matrix, branch lengths, etc). A method M for constructing gene trees from DNA sequences is statistically consistent under the GTR model if, for all ǫ > 0 , there exists a constant l > 0 such that, given sequences of length at least l, M returns T with probability at least 1 − ǫ.

Corollary 5 NJMerge can be used in a gene tree estimation pipeline that is statistically consistent under the GTR model of sequence evolution.
Proof Let (T * , �) be a GTR model tree, let M D be a method for calculating distances between pairs of sequences, and let M T be a method for constructing trees from DNA sequences. Suppose that • the divide-and-conquer pipeline produces k pairwise disjoint subsets of sequences • Neighbor Joining (NJ) applied to a matrix of pairwise distances calculated using M D is a statistically consistent method for constructing gene trees under the GTR model (e.g., the log-det distance [16])

Fig. 4
Divide-and-conquer pipeline using NJMerge. We present a divide-and-conquer pipeline that operates by (1) estimating distances between pairs of species using method M D , (2) decomposing the species set into pairwise disjoint subsets, (3) building a tree on each subset using method M T , and (4) merging trees together using the distance matrix using NJMerge.
Step 2 can be performed by estimating a tree from the distance matrix (e.g., using NJ) and then decomposing this tree into pairwise disjoint subsets of species (shown in blue). Although not explored in this study, this pipeline can be run in an iterative fashion by using the tree produced in Step 4 to define the next subset decomposition. In this schematic, sets of species are represented by circles, distance matrices are represented by squares, and trees are represented by triangles • M T is statistically consistent under the GTR model (e.g., maximum likelihood [42,43]) Now let ǫ > 0 , and select ǫ D , ǫ T > 0 such that ǫ D + kǫ T < ǫ . By Definition 4, there exists a constant l D such that NJ applied to matrix D computed from sequences of length at least l D returns T * with probability at least 1 − ǫ D , and there exists a constant l T such that M T given DNA sequences of length at least l T returns T * with probability at least 1 − ǫ T . If a distance matrix D is calculated using M D and a set T of k constraint trees are constructed using M T , given sequences of length at least max{l D , l T } , then the probability that NJ applied to D returns T * and that M T returns a tree that agrees with T * for all k constraint trees in T is at least 1 − ǫ , as Then, by Theorem 3, NJMerge applied to the input (T , D) will return the T * with probability at least 1 − ǫ , and by Definition 4, NJMerge is statistically consistent under the GTR model.
Finally, we define statistical consistency in the context of species tree estimation (Definition 7) and show that NJMerge can be used to create statistically consistent divide-and-conquer pipelines for species estimation (Corollary 7). Definition 6 Let (T , �) be an MSC model tree with topology T and numerical parameters (e.g., substitution rate matrix, branch lengths, etc). A method M for constructing species trees from true gene trees is statistically consistent under the MSC model if, for all ǫ > 0 , there exists a constant m > 0 such that, given at least m true gene trees, M returns T with probability at least 1 − ǫ.

Corollary 7 NJMerge can be used in a species tree estimation pipeline that is statistically consistent under the MSC model.
Proof Let (T * , �) be an MSC model tree, let M D be a method for calculating distances between pairs of species from a set of gene trees, and let M T be a method for constructing species trees from a set of gene trees. Suppose that • the divide-and-conquer pipeline produces k pairwise disjoint subsets of sequences • Neighbor Joining (NJ) applied to a matrix of pairwise distances calculated using M D is a statistically consistent method for constructing species trees under the MSC model (e.g., the average topological distance between species in the input set of gene trees [40]) • M T is statistically consistent under the MSC model (e.g., ASTRAL [11,45]) Now let ǫ > 0 , and select ǫ D , ǫ T > 0 such that ǫ D + kǫ T < ǫ . By Definition 6, there exists a constant m D such that NJ applied to matrix D computed from at least m D gene trees returns T * with probability at least 1 − ǫ D , and there exists a constant m T such that M T given at least m T gene trees returns T * with probability at least 1 − ǫ T . If a distance matrix D is calculated using M D and a set T of k constraint trees are constructed using M T , both given at least max{m D , m T } gene trees, then the probability that NJ applied to D returns T * and that M T returns a tree that agree with T * for all k constraint trees in T is at least 1 − ǫ . Then, by Theorem 3, NJMerge applied to the input (T , D) will return the T * with probability at least 1 − ǫ , and by Definition 6, NJMerge is statistically consistent under the MSC model.

Performance study
Our study evaluated the effectiveness of using NJMerge to estimate species trees on large multi-locus datasets, simulated for this study using the protocol presented in [45]. Our simulation produced model conditions, described by two numbers of taxa (100 and 1000) and two levels of ILS (low/moderate and very high), each with 20 replicate datasets. Datasets included both exon-like sequences and intron-like sequences with exon-like sequences ("exons") characterized by slower rates of evolution across sites (less phylogenetic signal) and intron-like sequences ("introns") characterized by faster rates of evolution across sites (greater phylogenetic signal). The 100-taxon datasets were analyzed using 25, 100, and 1000 genes, and the 1000-taxon datasets were analyzed using 1000 genes; note that exons and introns were always analyzed separately. For each of these 320 datasets, we constructed distance matrices using two different methods and constraint trees using four different methods. This provided 2560 different tests on which to evaluate NJMerge. NJMerge failed on 11/2560 tests, so the failure rate (in our experiments) was less than 1%. Species tree methods were evaluated in terms of species tree estimation error (computed using normalized Robinson-Foulds (RF) distances [46]) and running time. All software commands are provided in Additional file 1.

Simulated datasets True species and true gene trees
Datasets, each with a true species tree and 2000 true gene trees, were simulated using SimPhy version 1.0.2 [47]. All model conditions had deep speciation (towards the root) and 20 replicate datasets. By holding the effective population size constant (200K) and varying the species tree height (in generations), model conditions with different levels of ILS were generated. For species tree heights of 10M and 500K generations, the average distance between the true species tree and the true gene trees (as measured by the normalized RF distance) was 8-10% and 68-69% respectively. Thus, we referred to these levels of ILS as "low/moderate" and "very high" respectively.

True sequence alignments
Sequence alignments were simulated for each true gene tree using INDELible version 1.03 [48] under the GTR+Ŵ model of evolution without insertions or deletions. For each gene, the parameters for the GTR+Ŵ model of evolution (base frequencies, substitution rates, and alpha) were drawn from distributions based on estimates of these parameters from the Avian Phylogenomics Dataset [49]; distributions were fitted for exons and introns, separately (Additional file 1: Table S1). For each dataset (with 2000 genes), 1000 gene sequences were simulated with parameters drawn from the exon distributions, and 1000 gene sequences were simulated with parameters drawn from the intron distributions. Note that exons and introns were analyzed separately. The sequence lengths were also drawn from a distribution (varying from 300 to 1500 bp).

Estimated gene trees
Maximum likelihood gene trees were estimated using FastTree-2 [50] under the GTR+CAT model of evolution. The average gene tree estimation error across all replicate datasets ranged from 26 to 51% for introns and 38 to 64% for exons and thus was higher for exon datasets (Additional file 1: Table S2). Note that gene tree estimation error was computed by the normalized symmetric difference between true and estimated gene trees, averaged across all gene trees (the normalized symmetric difference equals the normalized RF distance when both input trees are binary).

Estimated species trees
For each model condition (described by number of taxa and level of ILS), species trees estimation methods were run on the exon-like genes and the intron-like genes, separately. Species trees were estimated on 25, 100, or 1000 genes for the 100-taxon datasets and 1000 genes for the 1000-taxon datasets using three species tree estimation methods: ASTRAL-III [11,30,45] (as implemented in version 5.6.1), SVDquartets [13] (as implemented in PAUP* version 4a161 [51]), and concatenation using unpartitioned maximum likelihood under the GTR+Ŵ model of evolution (as implemented in RAxML [31] version 8.2.12 with pthreads and SSE3).

Distance matrices
Distance matrices were created using two different approaches.
Recall that NJ applied to the AGID matrix (i.e., NJst [9]) was proven to be statistically consistent method under the MSC model [40] and that NJ applied to the log-det distance matrix was proven to be statistically consistent under the MSC model when the sequence evolution models across genes satisfy some additional assumptions (e.g., a relaxed molecular clock) [18].

Subset decomposition
We decomposed the species set into subsets as indicated by the blue dashed arrows in Fig. 4. Specifically, the NJ tree was computed for each distance matrix using FastME [52] version 2.1.5 and then the centroid tree decomposition (described in PASTA [35]) was used to create disjoint subsets of taxa from the NJ tree. Datasets with 100 species were decomposed into 4-6 subsets with a maximum subset size of 30 taxa, and datasets with 1000 species were decomposed into 10-15 subsets with a maximum subset size of 120 taxa.

Constraint trees
Constraint trees were created using four different approaches.
• T true refers to constraint trees computed by restricting the true species tree to each subset of species. • T AST refers to constraint trees computed by running ASTRAL-III on each subset, i.e., on the estimated gene trees restricted to each subset of species. • T SVD refers to constraint trees computed by running SVDquartets on each subset, i.e., on the concatenated alignment restricted to each subset of species. • T RAX refers to constraint trees computed by running RAxML on each subset, i.e., on the concatenated alignment restricted to each subset of species.

Notation
We often specify the inputs to NJ and NJMerge using the following notation: NJ(D) and NJMerge(T , D). For example, NJMerge(T RAX , D LD ) refers to NJMerge given the RAxML constraint trees and the log-det distance matrix as input, whereas NJMerge(T RAX , D) refers to NJMerge given the RAxML constraint trees and either the AGID or the log-det distance matrix as input.

Species tree estimation error
Species tree estimation error was measured as the RF error rate, i.e., the normalized RF distance between the true and the estimated species trees both on the full species set. Since both trees were fully resolved or binary, the RF error rate is the proportion of edges in the true tree that are missing in the estimated tree. RF error rates were computed using Dendropy [53].

Running time
All computational experiments were run on the Blue Waters supercomputer, specifically, the XE6 dualsocket nodes with 64 GB of physical memory and two AMD Interlagos model 6276 CPU processors (i.e., one per socket each with 8 floating-point cores). All methods were given access to 16 threads with 1 thread per bulldozer (floating-point) core. SVDquartets and RAxML were explicitly run with 16 threads; however, ASTRAL-III and NJMerge were not implemented with multi-threading at the time of this study. All methods were restricted to a maximum wall-clock time of 48 h. Running time was measured as the wall-clock time and recorded in seconds for all methods. For ASTRAL, SVDquartets, and RAxML, the timing data was recorded for running the method on the full dataset as well as running the method on subsets of the dataset (to produce constraint trees for NJMerge). RAxML did not complete within the maximum wall-clock time of 48 h on datasets with 1000 taxa, so we used the last checkpoint file to evaluate species tree estimation error and running time. Specifically, running time was measured as the time between the info file being written and the last checkpoint file being written.
We approximated total running time of the NJMerge pipeline by combining the running timing data for estimating the distance matrix, estimating the subset trees, and combining the subset trees using NJMerge. If a user only had access to one compute node, then subset trees would need to be estimated in serial. In this case, the running time of the NJMerge pipeline t P would be approximated as where k is the number of subsets, t D is time to estimate a distance matrix with method M D , t T (i) is the time to estimate a species tree on subset i with method M T , and t M is the time to run NJMerge given the distance matrix and the subset trees as input. The average running times for t T and t M are shown in Additional file 1: Tables S9, S10. The time to estimate the NJ tree from the distance matrix is not included, as this took less than a minute even for datasets with 1000 species. Note that given access to multiple compute nodes (at least 6 for the 100-taxon datasets and at least 15 for the 1000-species datasets), the subset trees could be estimated in parallel, as shown in [54].
It is worth noting that running ASTRAL-III and computing the AGID matrix requires gene trees to be estimated. Using the same experimental set-up (a single Blue Waters compute node with 64 GB of memory and 16 floating-point cores), FastTree-2 took on average 18 ± 2 min to estimate 1000 gene trees for datasets with 100 species and on average 217 ± 20 min to estimate 1000 gene trees for datasets with 1000 species (Additional file 1: Tables S4, S5). The amount of time for gene tree estimation can vary greatly, depending on the method used and the analysis performed (e.g., model of sequence evolution, bootstrapping, etc.); we did not include the time to estimate gene trees in the reported running times.

Results
Pipelines using NJMerge can be thought of in two ways: (1) as techniques for potentially improving the accuracy of NJ (hopefully without a large increase in running time) or (2) as techniques for potentially improving the scalability or speed of the method M T used to compute constraint trees (hopefully without sacrificing accuracy). When distance-based species tree estimation is not as accurate as some other species tree methods, we would predict that NJMerge (when given constraint trees estimated using highly accurate species tree methods) would be more accurate than traditional NJ. Because NJMerge, like NJ, is typically faster than other species tree methods, we would predict that NJMerge would improve the running time of more computationally intensive methods (such as RAxML) used to estimate constraint trees, hopefully without sacrificing accuracy.
(1) Thus, we compared the accuracy of the NJMerge pipeline to traditional NJ, and we also compared the accuracy and running time of the NJMerge pipeline to running M T on the full dataset, where M T is the method used to estimate the constraint trees for NJMerge. Results are shown here for intron-like datasets; results for exon-like datasets are shown in Additional file 1. Unless otherwise noted, results were similar for both sequence types; however, species trees estimated on the exon datasets had slightly higher error rates than those estimated on the intron datasets. This is expected, as the exons had slower rates of evolution (and thus less phylogenetic signal) than the introns.

How do pipelines using NJMerge compare to Neighbor Joining (NJ)?
In this section, we report results on the effectiveness of using NJMerge as compared to NJ in terms of accuracy.

Impact of estimated distance matrix
We compared the accuracy of the NJMerge pipeline to traditional NJ on distance matrices estimated from datasets with 100 taxa and varying numbers of genes ( Fig. 5; Additional file 1: Figure S1). Because the accuracy of NJMerge also depends on error in the input constraint trees, we considered an idealized case where NJMerge was given true constraint trees (i.e., constraint trees that agree with the true species tree). We found that NJMerge(T true , D) was more accurate than NJ(D) for all model conditions and that the difference in error was especially large when the number of genes was small and the level of ILS was very high (e.g., the difference in mean error was greater than 15% when matrices were estimated from 25 introns but was closer to 5% when matrices were estimated from 1000 introns). A similar trend was observed for matrices computed using the log-det distance. Interestingly, both NJ(D) and NJMerge(T true , D) were more accurate when given the AGID matrix rather than the log-det distance matrix as input-even when the level of ILS was low/moderate. In summary, NJMerge(T true , D) was always more accurate than NJ(D), but the improvement in accuracy was greater under challenging model conditions, suggesting that NJMerge(T true , D) was more robust to error in the distance matrix than NJ(D).

Fig. 5
Impact of estimated distance matrix on Neighbor Joining (NJ) and NJMerge. Neighbor Joining (NJ) was run with two different distance matrices, and NJMerge was run with two different distance matrices and constraint trees that agreed with the true species tree (see "Performance study" section for more information on the notation). Datasets had two different levels of incomplete lineage sorting (ILS) and numbers of genes varying from 25 to 1000. Species tree estimation error is defined as the normalized Robinson-Foulds (RF) distance between true and estimated species trees. Lines represent the average over replicate datasets, and filled regions indicate the standard error Molloy and Warnow Algorithms Mol Biol (2019) 14:14

Impact of estimated constraint trees
We compared traditional NJ to the NJMerge pipeline given estimated constraint trees on datasets with 1000 taxa and 1000 genes ( Fig. 6; Additional file 1: Figure S2). When the level of ILS was low/moderate, NJMerge outperformed NJ regardless of the method used to estimate species trees. For intron-like datasets with low/moderate ILS, the use of constraint trees reduced the median species tree error from 11-14% (NJ) to less than 3-6% (NJMerge); however, when the level of ILS was very high, the performance of NJMerge varied greatly with the species tree method. Specifically, NJMerge(T SVD , D) and NJMerge(T RAX , D) were less accurate than NJ(D) by 0-4% on average, whereas NJMerge(T AST , D) was more accurate than NJ(D) by 0-1% on average (Additional file 1: Tables S7, S8). These trends were consistent with the relative performance of methods on the 100-taxon datasets ( Fig. 7 and Additional file 1: Figure S3); specifically, when the level of ILS was very high, SVDquartets and RAxML performed worse than running NJ on either the AGID matrix or the log-det distance matrix. In summary, NJMerge was highly impacted by the quality of the constraint trees-so that accurate constraint trees resulted in NJMerge being more accurate than NJ, but inaccurate constraint trees resulted in NJMerge being less accurate than NJ.

How do pipelines using NJMerge compare to ASTRAL-III, SVDquartets, and RAxML?
In this section, we compare the running time and the accuracy of the NJMerge pipeline to running M T on the full dataset, where M T is the method used to estimate constraint trees for NJMerge. Because NJMerge was more accurate when given the AGID matrix ( Fig. 5; Additional file 1: Figure S1), results for NJMerge given the AGID distance matrix are shown here, and results for NJMerge given the log-det distance matrix are shown in Additional file 1.

ASTRAL-III vs. NJMerge
Both NJMerge(T AST , D AGID ) and NJMerge(T AST , D LD ) provided running time advantages over ASTRAL-III under some model conditions. While ASTRAL-III completed on all the low/moderate ILS datasets with 1000 taxa and 1000 genes in less than 9 h on average, ASTRAL-III failed to complete within the maximum wall-clock time of 48 h on 23/40 datasets with 1000 taxa, 1000 genes, and very high ILS (Table 1). On the other 17/40 datasets, ASTRAL-III ran for more than 2000 min (approximately 33 h). This difference between the low/ moderate ILS and the very high ILS datasets is noteworthy (see discussion). In contrast, NJMerge(T AST , D AGID ) completed in under 300 min (approximately 5 h) on average, including the time it took to estimate the distance matrix and the ASTRAL-III subset trees in serial (Fig. 8, Additional file 1: Figure S4). Note that NJMerge(T AST , D AGID ) failed on 0 datasets, and NJMerge(T AST , D LD ) failed on 2 datasets (Table 1). In summary, NJMerge substantially reduced the running time of ASTRAL-III on the 1000-taxon, 1000-gene datasets with very high ILS. ASTRAL-III and NJMerge(T AST , D AGID ) achieved similar levels of accuracy with the mean species tree error within 0-2% for both intron and exon datasets ( Fig. 8; Additional file 1: Figure S4, Table S7). Trends were similar for NJMerge(T AST , D LD ) except when the level of ILS was very high; under these conditions, the mean error of NJMerge(T AST , D LD ) was 2-6% greater than that of ASTRAL-III (Additional file 1: Figures S7 and S8, Table S8). Fig. 6 Impact of estimated constraint trees on NJMerge. Neighbor Joining (NJ) was run with two different distance matrices, and NJMerge was run with two different distance matrices and four different sets of constraint trees (see "Performance study" section for more information on the notation). Species tree estimation error is defined as the normalized Robinson-Foulds (RF) distance between true and estimated species trees. Note that gray bars represent medians, gray squares represent means, gray circles represent outliers, box plots are defined by quartiles (extending from the first to the third quartiles), and whiskers extend to plus/minus 1.5 times the interquartile distance (unless greater/less than the maximum/ minimum value)

NJMerge vs. SVDquartets
Species trees can be estimated with SVDquartets using the full set of n 4 quartet trees or a subset of quartet trees. Based on a prior study [55], which showed that the best accuracy was obtained when using all quartet trees, we computed all n 4 quartet trees for 100-taxon datasets. However, on datasets with 1000 taxa, SVDquartets was run using a random subset of quartet trees (without replacement), because the maximum number of quartets allowed by SVDquartets (as implemented by PAUP*) was 4.15833 × 10 10 . Running PAUP* resulted in a segmentation fault for all 1000-taxon datasets, i.e., SVDquartets failed on 40/40 datasets with 1000 taxa and 1000 genes. In contrast, NJMerge(T SVD , D AGID ) failed on 0 datasets, and NJMerge(T SVD , D LD ) failed on 3 datasets (Table 1).
NJMerge also improved running time on datasets with 100 taxa; for example, SVDquartets completed in 19-81 min on average, whereas NJMerge(T SVD , D AGID ) completed in less than 2 min on average for datasets with 100 taxa and 1000 genes ( Fig. 9; Additional file 1: Figure S5). This running time comparison does not take into account the time needed to estimate gene trees, which required on average 18 min using FastTree-2 on datasets with 100 taxa and 1000 genes.
NJMerge(T SVD , D AGID ) typically produced species trees with less error than SVDquartets. The difference between methods was typically small (between 0 and 2%) when the level of ILS was low/moderate but could be larger than 10% when the level of ILS was very high. Similar trends were observed for NJMerge(T SVD , D LD ) (Additional file 1: Figures S9, S10).

NJMerge vs. RAxML
NJMerge(T RAX , D AGID ) and NJMerge(T RAX , D LD ) reduced the running time of RAxML by more than half-even though RAxML was run on the subset trees in serial ( Fig. 10 and Additional file 1: Figure S6). For the 1000-taxon datasets, the final checkpoint was written by RAxML after more than 2250 min ( ∼ 37.5 h) on average. In comparison, when RAxML was run on subsets in serial, the average running time of NJMerge(T RAX , D AGID ) was between 500 (approximately 8.5 h) and 1500 Fig. 7 Comparison of species tree methods. All methods were run on the full dataset (i.e., not subsets) with 100 species. Neighbor Joining (NJ) was run with two different distance matrices ("Performance study" section for more information on the notation). Species tree estimation error is defined as the normalized Robinson-Foulds (RF) distance between true and estimated species trees. Note that gray bars represent medians, gray squares represent means, gray circles represent outliers, box plots are defined by quartiles (extending from the first to the third quartiles), and whiskers extend to plus/minus 1.5 times the interquartile distance (unless greater/less than the maximum/minimum value) Molloy and Warnow Algorithms Mol Biol (2019) 14:14 min (approximately 25 h). Although these running times for NJMerge do not include the time to estimate gene trees, recall that it took on average 217 min (less than 4 h) to estimate 1000 gene trees on datasets with 1000 species using FastTree-2. While NJMerge can fail to return a tree, NJMerge failed less frequently than RAxML-when both methods were given the same computational resources. NJMerge(T RAX , D AGID ) failed on 1 dataset, and NJMerge(T RAX , D LD ) failed on 2 datasets. In contrast, for datasets with 1000 taxa, RAxML failed to run on 38 intron-like datasets and 3 exon-like datasets due to "Out of Memory" (OOM) errors (Table 1); the difference between the number of intron-like versus the number of exon-like datasets is noteworthy (see discussion).

Remarks on the utility of pipelines using NJMerge
Pipelines using NJMerge can be viewed either as techniques for improving traditional NJ or as techniques for scaling a computationally-intensive base method (previously referred to as M T ) to larger datasets. Thus, in order to maximize the utility of NJMerge, users should select a base method that is both more accurate and more computationally-intensive than NJ. Our results show that selecting base methods for NJMerge may not be trivial when analyzing phylogenomic datasets-because both accuracy and running time were impacted by the level of ILS. For example, ASTRAL-III was very fast when the level of ILS was low/moderate but was substantially slower when the level of ILS was very high. Similarly, SVDquartets and RAxML were both more accurate than NJ(D AGID ), i.e., NJst, when the level of ILS was low/moderate but were less accurate than these methods when the level of ILS was very high; note that this trend is consistent with results from [12] (also see the review paper by [56]). Overall, our results suggest that constraint trees should be estimated using RAxML when the level of ILS is low/moderate and using ASTRAL-III when the level of  Table 1 The number of datasets on which methods failed is indicated below by model condition ASTRAL-III failed due to running beyond the maximum wall clock time of 48 h; SVDquartets failed due to segmentation faults; RAxML failed due to running out of memory; NJMerge failed due to being unable to find a legal siblinghood. Note that NJMerge is described by the input set T of constraint trees and input distance matrix D; see "Performance study" section for more information on the notation  . 9 SVDquartets vs. NJMerge given SVDquartet constraint trees and average gene tree internode distance (AGID) matrix. Subplots on top row show species tree estimation error (defined as the normalized RF distance between true and estimated species trees); note that gray bars represent medians, gray squares represent means, gray circles represent outliers, box plots are defined by quartiles (extending from the first to the third quartiles), and whiskers extend to plus/minus 1.5 times the interquartile distance (unless greater/less than the maximum/minimum value). Subplots on bottom row show running time (in minutes); bars represent means and error bars represent standard deviations across replicate datasets. NJMerge running times are for computing the subset trees "in serial"; see Eq. (1) in the main text for more information. The numbers of replicates on which the methods completed is shown on the x-axis, e.g., N = X , Y indicates that SVDquartets completed on X out of 20 replicates and that NJMerge(T SVD , D AGID ) completed on Y out of 20 replicates. SVDquartets did not run any datasets with 1000 taxa due to segmentation faults ILS is very high, and thus, determining the level of ILS in a given phylogenomic datasets is an important area of future research. Finally, we note that NJMerge, when given constraint trees that agreed with the true species tree, was very accurate (less than 2% error on average) even when the level of ILS was very high, suggesting that NJMerge is a promising technique for scaling Bayesian methods (e.g., Starbeast2 [8]) and future species tree methods to larger datasets. Although NJMerge can fail, this should not discourage potential users, as NJMerge failed on fewer datasets than ASTRAL-III, SVDquartets, or RAxML-when all methods were given the same computational resources, including a maximum wall-clock time of 48 h. In our experiments, NJMerge failed on only 11/2560 test cases from running NJMerge on 320 datasets with two different types of distance matrices and four different types of constraint trees (Table 1).
Importantly, in all our experiments, NJMerge was run within the divide-and-conquer pipeline shown in Fig. 4, specifically, with subsets of taxa derived from decomposing the NJ tree (blue dashed lines). Because NJMerge was always given inputs generated by this pipeline, our results on the accuracy, the failure rate, and the running time of NJMerge may not generalize to arbitrary inputs.

Impact of distance matrix on NJ
Our results showed that on average NJ(D AGID ) was either as accurate or else more accurate than NJ(D LD ). Notably, there was a clear difference between these two methods on datasets with 100 taxa and low/moderate levels of ILS; specifically NJ(D AGID ) produced trees with less than 5% error on average, whereas NJ(D LD ) produced trees with greater than 10% error on average). However, on the exact same model condition but with 1000 taxa, NJ(D AGID ) and NJ(D LD ) produced trees with similar levels of accuracy. This may be due to the difference between the median branch length between low/moderate ILS datasets with 100 taxa and 1000 taxa (Additional file 1: Table S3); furthermore, it is possible that branch length and other factors that limit the accuracy of NJ(D LD ) in the context of gene tree estimation would also apply in the context of species tree estimation. However, it is interesting to note that NJ(D LD ) was more accurate than either SVDquartets or RAxML when the level of ILS was very high, providing Fig. 10 RAxML vs. NJMerge given RAxML constraint trees and and average gene tree internode distance (AGID) matrix. Subplots on top row show species tree estimation error (defined as the normalized RF distance between true and estimated species trees); note that gray bars represent medians, gray squares represent means, gray circles represent outliers, box plots are defined by quartiles (extending from the first to the third quartiles), and whiskers extend to plus/minus 1.5 times the interquartile distance (unless greater/less than the maximum/minimum value). Subplots on bottom row show running time (in minutes); bars represent means and error bars represent standard deviations across replicate datasets. NJMerge running times are for computing the subset trees "in serial"; see Eq. (1) in the main text for more information. The numbers of replicates on which the methods completed is shown on the x-axis, e.g., N = X , Y indicates that RAxML completed on X out of 20 replicates and that NJMerge(T RAX , D AGID ) completed on Y out of 20 replicates. RAxML was only able to run on 1/40 intron-like datasets with 1000 taxa due to "Out of Memory" errors The simplicity and speed of distance-based inference suggests log-det based methods should serve as benchmarks for judging more elaborate and computationally-intensive species trees inference methods" [18].

Impact of ILS and sequence type on ASTRAL-III
Our results showed that ASTRAL-III was much faster on the low/moderate ILS datasets than on the very high ILS datasets. This finding makes sense in light of ASTRAL-III's algorithm design. ASTRAL-III operates by searching for an optimal solution to its search problem within a constrained search space that is defined by the set X of bipartitions in the estimated gene trees, and in particular, ASTRAL-III's running time scales with |X | 1.726 [30]. The set of gene trees will become more heterogeneous for higher levels of ILS, and thus, the size of X will increase, as every gene tree could be different when the level of ILS is very high. In addition, gene tree estimation error can also increase the size of X , explaining why ASTRAL-III failed to complete on exon datasets more often than on intron datasets (Table 1, Additional file 1: Table S2).

Impact of sequence type on RAxML
Our results showed that RAxML failed on more intronlike datasets than exon-like datasets. This finding makes sense in light of RAxML's implementation. RAxML uses redundancy in site patterns to store the input alignment compactly, so that the memory scales with the number of unique site patterns. The intron datasets had more unique site patterns than the exon datasets (i.e., greater phylogenetic signal and lower gene tree estimation error), which explains why RAxML required more memory when analyzing introns.

Remarks on the statistical consistency of pipelines using NJMerge
Although NJMerge can fail to return a tree, by statistical consistency under the MSC model (Corollary 7), the probability that NJMerge fails goes to zero as the number of true gene trees goes to infinity. In fact, NJMerge was designed to have this theoretical guarantee via the selection of the heuristic for determining whether or not to accept a siblinghood proposal. It is easy to think of other heuristics that prevent NJMerge from failing but do not have the guarantee of correctness (Theorem 3) and thus do not have the guarantee of statistical consistency (Corollary 7). Designing heuristics that prevent NJMerge from failing but have good theoretical properties is an area of future research. As mentioned previously, our proof of statistical consistency under the MSC model requires that the number of true gene trees goes to infinity, which is the equivalent of requiring that both the number of gene trees and the sequence length per gene tree go to infinity. Roch et al. [6] recently showed that essentially all gene tree summary methods (e.g., NJst [40], and ASTRAL [11]) are not statistically consistent under the MSC if the sequence length per gene is fixed-and these theoretical results apply to NJMerge as well. The failure to be statistically consistent when the sequence length per gene is bounded is not unique to gene tree summary methods or NJMerge, as Roch et al. also showed that fully partitioned maximum likelihood is not consistent under these conditions, and [5] had shown that unpartitioned maximum likelihood is also not consistent.

Conclusions
In this paper, we introduced a divide-and-conquer approach to phylogeny estimation that (1) decomposes a set of species into pairwise disjoint subsets, (2) builds trees on each subset of species using a base method, and (3) merges the subsets trees together using a distance matrix. For the merger step, we presented a new method, called NJMerge, and proved that some divide-and-conquer pipelines using NJMerge are statistically consistent under some models of evolution. We then evaluated pipelines using NJMerge in the context of species tree estimation, specifically using simulated multi-locus datasets with up to 1000 species and two levels of ILS. We found that pipelines using NJMerge provided several benefits to large-scale species tree estimation. Specifically, under some model conditions, pipelines using NJMerge improved the accuracy of traditional NJ and substantially reduced the running time of three popular species tree methods (ASTRAL-III, SVDquartets, and "concatenation" using RAxML) without sacrificing accuracy (see discussion for details as the results depended on the level of ILS). Finally, although NJMerge can fail to return a tree, in our experiments, pipelines using NJMerge failed on only 11 out of 2560 test cases. Together these results suggest that NJMerge is a promising approach for scaling highly accurate but computationally-intensive methods to larger datasets. This study also suggests several different directions for future research. Since NJMerge uses a heuristic (which can fail) to test for tree compatibility (in deciding whether to accept a siblinghood proposal), a modification to NJMerge to use an exact method for this problem would reduce the failure rate and-if sufficiently fastwould still enable scalability to large datasets. In addition, all aspects of the divide-and-conquer pipeline could be modified and tested; for example, the robustness of