 Research
 Open access
 Published:
Quartets enable statistically consistent estimation of cell lineage trees under an unbiased error and missingness model
Algorithms for Molecular Biology volume 18, Article number: 19 (2023)
Abstract
Cancer progression and treatment can be informed by reconstructing its evolutionary history from tumor cells. Although many methods exist to estimate evolutionary trees (called phylogenies) from molecular sequences, traditional approaches assume the input data are errorfree and the output tree is fully resolved. These assumptions are challenged in tumor phylogenetics because singlecell sequencing produces sparse, errorridden data and because tumors evolve clonally. Here, we study the theoretical utility of methods based on quartets (fourleaf, unrooted phylogenetic trees) in light of these barriers. We consider a popular tumor phylogenetics model, in which mutations arise on a (highly unresolved) tree and then (unbiased) errors and missing values are introduced. Quartets are then implied by mutations present in two cells and absent from two cells. Our main result is that the most probable quartet identifies the unrooted model tree on four cells. This motivates seeking a tree such that the number of quartets shared between it and the input mutations is maximized. We prove an optimal solution to this problem is a consistent estimator of the unrooted cell lineage tree; this guarantee includes the case where the model tree is highly unresolved, with error defined as the number of false negative branches. Lastly, we outline how quartetbased methods might be employed when there are copy number aberrations and other challenges specific to tumor phylogenetics.
Introduction
Cancer progression and treatment can be informed by reconstructing the evolutionary history of tumor cells [1]. Although many methods exist to estimate evolutionary trees (called phylogenies) from molecular sequences, traditional approaches assume the input data are errorfree and the output tree is fully resolved. These assumptions are challenged in tumor phylogenetics because singlecell sequencing produces sparse, errorridden data and because tumors evolve clonally so the underlying tree is highly unresolved [2, 3]. Here, we study the theoretical utility of methods based on quartets (fourleaf, unrooted phylogenetic trees) and triplets (threeleaf rooted phylogenetic trees) in light of these barriers.
A quartet is an unrooted, phylogenetic tree with four leaves. Quartets have long been used as the building blocks for reconstructing the evolutionary history of species [4]. The reason quartetbased methods have garnered such success in species phylogenetics is their good statistical properties under the MultiSpecies Coalescent (\(\texttt {MSC}\)) model [5, 6]. An \(\texttt {MSC}\) model species tree generates gene trees (note that a gene tree reflects the genealogical history of a gene, which is passed down from ancestor to descendant, whereas the species tree governs the pool of potential ancestors). Arguably, one of the most important theoretical results from the last decade of systematics is that the most probable unrooted gene tree under the \(\texttt {MSC}\) is topologically equivalent to the unrooted model species tree when considering four species [7]. For trees with more than four leaves, the most probable unrooted gene tree can be topologically discordant with the unrooted model species tree [8]. In such situations, the model species tree is said to be in the anomaly zone or the offending gene tree is said to be anomalous. It is now widely recognized that anomalous gene trees can challenge traditional species tree estimation methods [9, 10].
The statistical theory described above has motivated the development of quartetbased methods (e.g., [11, 12]) and is central to their proofs of statistical consistency under the MSC. ASTRAL [12], in particular, has become a gold standard approach to multilocus species tree estimation. Moreover, new and improved quartetbased methods are continually being developed [13,14,15,16,17,18]. Similar theory and methodology has been given for triplets: threeleaf, rooted, phylogenetic trees [19,20,21].
Inspired by these efforts, we study the utility of quartets and triplets for estimating cell lineage trees under a popular tumor phylogenetics model [2, 22,23,24], in which mutations arise on a (highly unresolved) cell lineage tree according to the infinite sites model and then errors and missing values are introduced to the resulting mutation data in an unbiased fashion. The idea is that deviations from a perfect phylogeny can be attributed to sequencing errors, as data produced by singlecell protocols are notoriously errorprone and sparse. Although the infinite sites plus unbiased error and missingness (\(\texttt {IS+UEM}\)) model generates mutations rather than gene trees, quartets (or triplets) are implied by mutations that are present in two cells and absent from two cells (or one cell).
Our main result is that there are no anomalous quartets under the \(\texttt {IS+UEM}\) model; this motivates seeking a cell lineage tree such that the number of quartets shared between it and the input mutations is maximized. We prove an optimal solution to this problem is a consistent estimator of the unrooted cell lineage tree; this guarantee extends to the case of highly unresolved model trees, with error defined as the number of false negative branches. Somewhat surprisingly, our positive finding for quartets does not extend to triplets, as there can be anomalous triplets under the \(\texttt {IS+UEM}\) model under reasonable conditions. These results generalize to any model of 2state character evolution for which there are no anomalous quartets or triplets. An example of such a model is the infinite sites plus neutral WrightFisher (\(\texttt {IS+nWF}\)) model [25, 26] and its approximations [46]. Under \(\texttt {IS+nWF}\), mutations follow the IS assumption but evolve within a species tree, so deviations from a perfect phylogeny are due to genetic drift. Nevertheless, there are no anomalous triplets (see Additional file 1 of [47]) and no anomalous quartets (Theorem 1 in [48]; also see [49]), motivating the application of quartetbased methods to estimate species trees from lowhomoplasy retroelement insertion presence/absence patterns [48, 50]. However, our work is largely motivated by tumor phylogenetics, so we conclude by outlining how quartetbased methods might be employed in this setting, given other important challenges like copy number aberrations (CNAs) and doublets.
Background
We now provide some background on phylogenetic trees, models of evolution, and statistical consistency.
Phylogenetic trees
A phylogenetic tree is defined by the triple \((g, X, \phi )\), where g is a connected acyclic graph, X is a set of labels (often representing species or cells), and \(\phi\) is a bijection between the labels in X and leaves (i.e., vertices with degree 1) of g. Phylogenetic trees can be either unrooted or rooted, and we use u(T) to denote the unrooted version of a rooted tree T. Edges in an unrooted tree are undirected, whereas edges in a rooted tree are directed away from the root, a special vertex with indegree 0 (all other vertices have indegree 1). Vertices that are neither leaves nor the root are called internal vertices, and edges incident to only internal vertices are called internal edges (otherwise they are referred to as terminal edges). An interval vertex with degree greater than 3 (called a polytomy) can be introduced to a tree by contracting one of its edges (i.e., deleting the edge and identifying its endpoints). A refinement of a polytomy is the opposite of a contraction. If there are no polytomies in T, we say that T is binary or fully resolved; otherwise, we say that T is nonbinary. We use the phrase highly unresolved to indicate that T contains many polytomies and/or that the polytomies in T have high degrees.
As previously mentioned, many methods for species tree estimation are based on quartets. A quartet is an unrooted, binary tree with four leaves. We denote the three possible quartets on \(X = \{A, B, C, D\}\) as \(q_1 = A,BC,D\), \(q_2 = A,CB,D\), and \(q_3 = A,DB,C\). A set of quartets can be created from an unrooted tree T by restricting T to every possible subset of four leaves (i.e., deleting the other leaves from T and then suppressing vertices of degree 2). The resulting set Q(T) is called the quartet encoding of T, and we say that T displays quartet q if \(q \in Q(T)\). Importantly, if T contains polytomies, restricting T to some subsets of four labels will not produce a (binary) quartet. Some selections will produce star trees, which do not provide any topological information. We use \(T_S\) to denote T restricted to label set S (note that if branch parameters are associated with T, they are added together when suppressing vertices of degree 2). The concepts described above for quartets extend to triplets. A triplet is a rooted, binary tree with three leaves, and we denote the three possible triplets on \(X = \{A, B, C\}\) as \(t_A = AB,C\), \(t_B = BA,C\), and \(t_C = CA,B\). Lastly, a bipartition or split of label set X partitions it into two disjoint subsets. It is easy to see that each edge in an unrooted tree induces a bipartition, and we use Bip(T) to denote the set of bipartitions induced by all edges in T.
Mutations and models of evolution
A mutation matrix M is an \(n \times k\) matrix, where n is the number of rows (representing cells or species) and k is the number of columns (representing mutations). Columns are also referred to as characters or site patterns. Our focus here is on 2state characters, with \(M_{i,j} = 0\) indicating that mutation j is absent from cell i and \(M_{i,j} = 1\) indicating that mutation j is present in cell i. In tumor phylogenetics, mutations are called in reference to a healthy cell, which is the root of the cell lineage tree; thus, 0 represents the ancestral state and 1 represents mutant/derived state (note that this interpretation of states 0 and 1 will only be important when looking at triplets and not quartets).
Throughout this paper, we assume the mutation matrix D is generated under a hierarchical model with two steps (Fig. 1).

1.
A mutation matrix G is generated under some model \({\mathcal {M}}\), parameterized by a rooted phylogenetic tree topology \(\sigma\) and a set \(\Theta\) of associated numeric parameters. Importantly, model \({\mathcal {M}}\) given \((\sigma , \Theta )\) defines a probability distribution on mutation patterns, and we assume mutations in G are independent and identically distributed (i.i.d.) according to this model. For simplicity of notation, we typically omit the dependence on the numeric parameters in \(\Theta\).

2.
Errors and/or missing values are introduced to the ground truth matrix G according to the UEM model (described below). This result of this process is the observed matrix D.
Hierarchical models of this form, denoted \({\mathcal {M}}{} \texttt {+UEM}\), define a probability distribution on mutation patterns given their parameters. Thus, if we say that D is generated under the \({\mathcal {M}}{} \texttt {+UEM}\) model, then we assume the mutations in D are i.i.d. according to this model. We now describe the data generation steps in greater detail for a popular tumor phylogenetics model [2, 22,23,24].
Step 1: Infinite Sites (IS) model. For tumor phylogenetics, we take \({\mathcal {M}}\) to be the infinite sites (IS) model, so the mutation matrix G is generated under the IS model given a rooted cell lineage tree \(\sigma\) and a set \(\Theta\) of edge probabilities that sum to 1. Specifically, every edge e in \(\sigma\) is associated with a numeric value \(p(e) \in \Theta\), indicating the probability that a mutation occurs on e. When a mutation occurs on e, all cells on a directed path from e to any of the leaves of \(\sigma\) are set to state 1; all other cells are set to state 0. Thus, a mutation corresponds to the bipartition induced by the branch on which it occurred. Internal edges on which mutations cannot occur are contracted, so that the probability of a mutation on any edge in \(\sigma\) is strictly greater than zero. Terminal edges on which mutations cannot occur are not contracted; however, we refer to these edges and (the leaves incident to them) as “fake”.
Step 2: Unbiased Error and Missingness (UEM) model. If mutation matrix G is generated under the IS model given \((\sigma , \Theta )\), then reconstructing \(\sigma\) is trivial. However, for tumor phylogenetics, false positives and false negatives are introduced to G, producing the observed matrix D. This is done according to Eq. 1:
where \(0 \le \alpha < 1\) and \(0 \le \beta < 1\) are the probability of false positives and false negatives, respectively. Simultaneously, missing values are introduced to G with probability \(0 \le \gamma < 1\); this can be incorporated into the model by multiplying each of the cases in Eq. 1 by \((1  \gamma )\).
Our goal is to estimate cell lineage trees under the \(\texttt {IS+UEM}\) model. An important property for phylogeny estimation methods is whether they are statistically consistent under the model of interest.
Definition 1
(Statistical Consistency; see Section 1.1 of [27]) Let \({\mathcal {A}}\) be some model that generates mutations, and let D be a mutation matrix, with n rows (cells or species) and k columns (mutations), generated under \({\mathcal {A}}\) given rooted tree \(\sigma\) and numerical parameters \(\Theta\). We say that an estimation method is statistically consistent under \({\mathcal {A}}\) if for any \(\epsilon > 0\), there exists a constant \(K > 0\) such that when D contains at least K mutations, the method given D returns (the unrooted version of) \(\sigma\) with probability at least \(1\epsilon\). Alternatively, we might say that the error in the tree estimated from D is zero with probability at least \(1\epsilon\).
The idea is that as the number k of mutations goes towards infinity, the error in the estimated tree is zero with high probability. Tree error is typically defined as the number of false negative branches (i.e., branches in \(\sigma\) that are missing from the estimated tree) plus the number of false positive branches (i.e., branches in the estimated tree that are missing from \(\sigma\)).
No anomalous quartets under an unbiased error and missingness model
To begin, we assume that the rooted cell lineage tree \(\sigma\) has four leaves; therefore, it must have one of five tree shapes shown in Fig. 2. Two of them display a star when unrooted, and the other three correspond to a quartet when unrooted. If mutations are generated i.i.d. under some model \({\mathcal {A}}\) given \(\sigma\), there are 16 possible patterns on four cells, denoted \(\{A, B, C, D\}\). A quartet is implied by two cells being in state 1 and two cells being in state 0. Therefore, two patterns (\(ABCD = 0011\) and 1100) support quartet \(q_1 = A,BC,D\), two patterns (0101 and 1010) support quartet \(q_2 = A,CB,D\), two patterns (0110 and 1001) support quartet \(q_3 = A,DB,C\), and the other 10 patterns do not provide topological information. Henceforth, we denote the probability of quartets under model \({\mathcal {A}}\) given \(\sigma\) as \(\mathbb {P}_{{\mathcal {A}}}(q_1  \sigma ) = \mathbb {P}_{{\mathcal {A}}}(1100  \sigma ) + \mathbb {P}_{{\mathcal {A}}}(0011  \sigma )\), \(\mathbb {P}_{{\mathcal {A}}}(q_2  \sigma ) = \mathbb {P}_{{\mathcal {A}}}(1010  \sigma ) + \mathbb {P}_{{\mathcal {A}}}(0101  \sigma )\), and \(\mathbb {P}_{{\mathcal {A}}}(q_3  \sigma ) = \mathbb {P}_{{\mathcal {A}}}(1001  \sigma ) + \mathbb {P}_{{\mathcal {A}}}(0110  \sigma )\). Now we consider quartetinformative patterns generated from a model tree with more than four leaves.
Definition 2
(No anomalous quartets) We say that there are no anomalous quartets under model \({\mathcal {A}}\) given rooted tree \(\sigma\) if the following inequalities hold for every subset S of four species in \(\sigma\). Let \(q_1, q_2, q_3\) denote the three quartets on S, and let i index \(\{1, 2, 3\}\).

1.
If \(u(\sigma )_S = q_i\), \(\mathbb {P}_{{\mathcal {A}}}(q_i\sigma ) > \mathbb {P}_{{\mathcal {A}}}(q_j\sigma )\) for all \(j \in \{1, 2, 3\}\) such that \(i \ne j\).

2.
If \(u(\sigma )_S\) is a star, \(\mathbb {P}_{{\mathcal {A}}}(q_1\sigma ) = \mathbb {P}_{{\mathcal {A}}}(q_2\sigma ) = \mathbb {P}_{{\mathcal {A}}}(q_3\sigma )\).
This brings us to the main result of this section.
Theorem 1
There are no anomalous quartets under the \(\texttt {IS+UEM}\) model, assuming \(\alpha + \beta \ne 1\).
The statement above directly follows from Lemma 1 and Corollary 1.
Lemma 1
There are no anomalous quartets under the \(\texttt {IS}\) model. Moreover, all quartetinformative patterns have zero probability except for one or both of the patterns corresponding to \(u(\sigma )\) when \(u(\sigma )\) is not a star.
If \(\sigma\) has more than four leaves, we can restrict \(\sigma\) to any subset of four leaves and get a valid submodel (i.e., a submodel for which the probability of the mutation patterns on four cells is the same as under the larger model tree). For the IS model, the submodel is formed by deleting the other leaves and adding branch parameters together when suppressing vertices of degree 2. The mutation pattern probabilities for the four cells under this submodel will be the same as the larger tree because addition represents an or condition (i.e., a mutation occurring on this branch or on that branch will produce the same pattern when looking at only a subset of cells). Thus, it suffices to verify that there are no anomalous quartets for \(\sigma\) with four leaves. This can be done by considering a mutation occurring on each of the internal branches of all possible rooted tree shapes with four leaves (Fig. 2) and comparing the resulting pattern to the unrooted tree shape; see Additional file 1 for details.
The following two lemmas will also be useful later.
Lemma 2
Let \(0 \le \alpha < 1\) and \(0 \le \beta < 1\). Then,
for \(\alpha + \beta \ne 1\). If \(\alpha + \beta = 1\), the inequality in Eq. 2 becomes an equality.
The statement above follows from expanding the polynomials; see the Additional file 1 for details.
Lemma 3
If there are no anomalous quartets under model \({\mathcal {M}}\), then there are no anomalous quartets under the \({\mathcal {M}}{+}\texttt {UE}\) model, assuming \(\alpha + \beta \ne 1\).
Proof
Taking any subset of four leaves, there are 16 possible mutation patterns that may occur under model \({\mathcal {M}}\). These are the two invariant patterns (0000 and 1111), the eight variant but quartetuninformative patterns (1000, 0100, 0010, 0001, 0111, 1011, 1101, 1110), and the six quartetinformative patterns (1100, 0011, 0101, 1010, 0110, 1001). For each pattern g listed above, we enumerate all possible ways of introducing errors (false positives and false negatives); this gives us the probability of each of the 16 mutation patterns under the \(\texttt {UE}\) model given \((\alpha , \beta )\). Now we need to put this information together to get the probability of quartets under the \({\mathcal {M}}{} \texttt {+UE}\) model. First, we compute the probability of observing any quartet q from errors (false positives and false negatives) being introduced to the invariant and variant but quartetuninformative characters; see Eq. 3.
see Additional file 1: Tables S1–S4 for details. Second, we repeat this calculation for the quartetinformative patterns; see Table 1 and Additional file 1: Tables S5–S6 for details. Putting it all together gives us the probability of each quartet under the \({\mathcal {M}}{} \texttt {+UE}\) model
for \(i, j, k \in \{1, 2, 3\}\) such that \(i \ne j \ne k\). Now we can compute the difference in probabilities between quartets \(q_i\) and \(q_j\) under the \({\mathcal {M}}{} \texttt {+UE}\) model for any \(i,j \in \{1, 2, 3\}\) such that \(i \ne j\). By Lemma 1, we have
Assuming \(\alpha + \beta \ne 1\), this quantity is zero if \(\mathbb {P}_{{\mathcal {M}}}(q_i\sigma ) = \mathbb {P}_{{\mathcal {M}}}(q_j\sigma )\) and greater than zero if \(\mathbb {P}_{{\mathcal {M}}}(q_i\sigma ) > \mathbb {P}_{{\mathcal {M}}}(q_j\sigma )\). Because there are no anomalous quartets under model \({\mathcal {M}}\), the former will be the case if \(u(\sigma )\) is a star; the latter will be the case if \(u(\sigma ) = q_i\). It follows there are no anomalous quartets under the \({\mathcal {M}}{} \texttt {+UE}\) model. \(\square\)
Note that the quantity \(\alpha + \beta\) is unlikely to equal 1 in practice, as both probabilities should be less than 0.5. We now extend the result above to address unbiased missing values, in addition to errors.
Corollary 1
If there are no anomalous quartets under model \({\mathcal {M}}\), then there are no anomalous quartets under the \({\mathcal {M}}{} \texttt {+UEM}\) model, assuming that \(\alpha + \beta \ne 1\).
Proof
If one or more of the values in a mutation pattern is missing, then no quartet is displayed. Thus, unbiased missingness can be accounted for in the proof of Lemma 3 simply by updating \(\mathbb {P}_{{\mathcal {M}}}(x\sigma )\) to \((1\gamma )^4 \cdot \mathbb {P}_{{\mathcal {M}}}(x\sigma )\). In this case, Eq. 5 becomes
which does not change our argument. \(\square\)
This concludes the derivation of the main result of this section (Theorem 1). Before moving on to triplets, we note that the difference in quartet probabilities (Eq. 6) depends on (1) the probability of false positives and negatives (specifically how close their sum is to one), (2) the probability of missing values, and (3) the probability of observing quartetinformative patterns under model \({\mathcal {M}}\) given \(\sigma\). In the simulations performed by [24], the largest values of \(\alpha\), \(\beta\), and \(\gamma\) were 0.001, 0.2, and 0.05, respectively. In this scenario with \(u(\sigma ) = q_i\), we have \(\mathbb {P}_{\texttt {IS+UEM}}(q_i\alpha , \beta , \gamma , \sigma )  \mathbb {P}_{\texttt {IS+UEM}}(q_j\alpha , \beta , \gamma , \sigma ) = 0.52 \cdot \mathbb {P}_{\texttt {IS}}(q_i  \sigma )\) for any \(i, j \in \{1, 2, 3\}\) such that \(i \ne j\). The magnitude of \(\mathbb {P}_{\texttt {IS}}(q_i  \sigma )\) depends on the lineage tree and the four cells sampled from it. We discuss sampling further in the context of triplets; for now, we note that, in practical settings, \(\mathbb {P}_{\texttt {IS}}(q_i  \sigma )\) may have a greater impact on the difference in quartet probabilities under the \(\texttt {IS+UEM}\) model than \(\alpha\), \(\beta\), or \(\gamma\).
Anomalous triplets under an unbiased error and missingness model
We now derive related results for triplets. To begin, we assume the rooted cell lineage tree \(\sigma\) has three leaves; therefore, it must have one of two topologies: binary or nonbinary (Additional File 1: Fig S1). If mutations are generated i.i.d. under some model \({\mathcal {A}}\) given \(\sigma\), there are 8 possible patterns on three cells, denoted \(\{A, B, C\}\). A triplet is implied by two cells being in state 1 (i.e., the mutant/derived state) and one cell being in state 0 (i.e., the ancestral state) because the two cells harboring the mutation must have descended from a common ancestor cell also harboring the mutation. One pattern (\(ABC = 110\)) supports triplet \(t_C = CA,B\), one pattern (101) supports triplet \(t_B = BA,C\), one pattern (011) supports triplet \(t_A = AB,C\), and the other five patterns do not provide topological information. Henceforth, we denote the probability of triplets under model \({\mathcal {A}}\) given \(\sigma\) as \(\mathbb {P}_{{\mathcal {A}}}(t_A\sigma ) = \mathbb {P}_{{\mathcal {A}}}(011\sigma )\), \(\mathbb {P}_{{\mathcal {A}}}(t_B\sigma ) = \mathbb {P}_{{\mathcal {A}}}(101\sigma )\), and \(\mathbb {P}_{{\mathcal {A}}}(t_C\sigma ) = \mathbb {P}_{{\mathcal {A}}}(110\sigma )\). Now we consider tripletinformative patterns generated from a model tree with more than three leaves.
Definition 3
(No anomalous triplets) We say that there are no anomalous triplets under model \({\mathcal {A}}\) if the following inequalities hold for every subset S = {X, Y, Z} of three species in \(\sigma\). Let \(t_X, t_Y, t_Z\) denote the three triplets on S, and let i index {X, Y, Z}.

1.
If \(\sigma _S = t_i\), \(\mathbb {P}_{{\mathcal {A}}}(t_i\sigma ) > \mathbb {P}_{{\mathcal {A}}}(t_j\sigma )\) for all \(j \in \{X, Y, Z\}\) such that \(i \ne j\).

2.
If \(\sigma _S\) is nonbinary, \(\mathbb {P}_{{\mathcal {A}}}(t_X\sigma ) = \mathbb {P}_{{\mathcal {A}}}(t_Y\sigma ) = \mathbb {P}_{{\mathcal {A}}}(t_Z\sigma )\).
This brings us to the main result of this section.
Theorem 2
There are no anomalous triplets under the \(\texttt {IS+UEM}\) model, assuming one of two conditions: (1) \(\alpha = 0\) or (2) \(\alpha + \beta \ne 1\) and \(\mathbb {P}_{\texttt {IS}}(100\sigma ) = \mathbb {P}_{\texttt {IS}}(010\sigma ) = \mathbb {P}_{\texttt {IS}}(001\sigma )\). Otherwise, there can be anomalous triplets under the \(\texttt {IS+UEM}\) model.
The statement above directly follows from Lemma 4 and Corollary 2.
Lemma 4
There are no anomalous triplets under the \(\texttt {IS}\) model. Moreover, all tripletinformative patterns have zero probability except for the pattern corresponding to \(\sigma\) when \(\sigma\) is not nonbinary.
If \(\sigma\) has more than three leaves, we can restrict \(\sigma\) to any subset of three leaves and get a valid submodel (i.e., a submodel for which the probability of the mutation patterns on three cells is the same as under the larger model tree, as discussed for quartets). Thus, it suffices to verify that there are no anomalous triplets for \(\sigma\) with three leaves. This can be done by considering a mutation occurring on each of the internal branches of all possible rooted tree shapes with three leaves (Additional File 1: Fig S1) and comparing the resulting pattern to the tree shape; see Additional File 1 for details.
Lemma 5
If there are no anomalous triplets under model \({\mathcal {M}}\), then there are no anomalous triplets under the \({\mathcal {M}}{} \texttt {+UE}\) model, assuming one of two conditions: (1) \(\alpha = 0\) or (2) \(\alpha + \beta \ne 1\) and \(\mathbb {P}_{{\mathcal {M}}}(100\sigma ) = \mathbb {P}_{{\mathcal {M}}}(010\sigma ) = \mathbb {P}_{{\mathcal {M}}}(001\sigma )\). Otherwise, there can be anomalous triplets under the \({\mathcal {M}}{} \texttt {+UE}\) model.
Proof
Taking any subset of three leaves, there are 8 possible mutation patterns that may occur under model \({\mathcal {M}}\). These are the two invariant patterns (\(ABC = 000\) and 111), the three variant but tripletuninformative patterns (100, 010, 001), and the three tripletinformative patterns (110, 101, 011). For each pattern g listed above, we enumerate all possible ways of introducing errors (false positives and false negatives); this gives us the probability of mutation patterns under the UE model given \((\alpha , \beta , g)\); see Additional file 1: Tables S7–S14. Putting everything together, we find the probability of triplet \(t_C\) under the \({\mathcal {M}}{} \texttt {+UE}\) model is
Similar probabilities can be computed for \(t_B\) and \(t_A\). To provide a general formula, we define
and set \(x_A = 100\), \(x_B = 010\), and \(x_C = 001\). This allows us to write the probability of any triplet as
where \(i, j, k \in \{A, B, C\}\) with \(i \ne j \ne k\). Now we can compute the differences in probabilities between \(t_i\) and \(t_j\) under the \({\mathcal {M}}{} \texttt {+UE}\) for any \(i,j \in \{A, B, C\}\) such that \(i \ne j\). We find
Assuming that either condition (1) \(\alpha = 0\) or condition (2) \(\alpha + \beta \ne 1\) and \(\mathbb {P}_{{\mathcal {M}}}(x_i  \sigma ) = \mathbb {P}_{{\mathcal {M}}}(x_j  \sigma )\), this quantity is zero if \(\mathbb {P}_{{\mathcal {M}}}(t_i\sigma ) = \mathbb {P}_{{\mathcal {M}}}(t_j\sigma )\) and is greater than zero if \(\mathbb {P}_{{\mathcal {M}}}(t_i\sigma ) > \mathbb {P}_{{\mathcal {M}}}(t_j\sigma )\). Because there are no anomalous triplets under \({\mathcal {M}}\), the former will be the case if \(\sigma\) is nonbinary; the latter will be the case if \(\sigma = t_i\). It follows there are no anomalous triplets under the \({\mathcal {M}}{} \texttt {+UE}\) model, provided one of the two conditions hold. If these conditions do not hold and \(\sigma \ne t_j\), triplet \(t_j\) is anomalous when
for \(i,j \in \{A,B,C\}\) such that \(i \ne j\). \(\square\)
The result above extends easily to the case with unbiased missing values, in addition to unbiased error (see Corollary 1).
Corollary 2
If there are no anomalous triplets under model \({\mathcal {M}}\), then there are no anomalous triplets under the \({\mathcal {M}}{} \texttt {+UEM}\) model, assuming that one of the two conditions: (1) \(\alpha = 0\) or (2) \(\alpha + \beta \ne 1\) and \(\mathbb {P}_{{\mathcal {M}}}(100\sigma ) = \mathbb {P}_{{\mathcal {M}}}(010\sigma ) = \mathbb {P}_{{\mathcal {M}}}(001\sigma )\). Otherwise, there can be anomalous triplets under the \({\mathcal {M}}{} \texttt {+UEM}\) model.
This concludes the derivation of the main result of this section (Theorem 2). Before moving onto phylogeny estimation, we consider whether anomalous triplets should be expected in the context of tumor phylogenetics, so setting \({\mathcal {M}}\) to be the \(\texttt {IS}\) model. As previously mentioned, in the simulations performed by Kizilkale et al. [24], the largest values of \(\alpha\) and \(\beta\) were 0.001 and 0.2, respectively. This would put \(\alpha /(1  \beta ) = 0.00125\) in Eq. 11, so \(\mathbb {P}_{{\mathcal {M}}}(x_i  \sigma )  \mathbb {P}_{{\mathcal {M}}}(x_j  \sigma )\) would need to be 800 times greater than \(\mathbb {P}_{{\mathcal {M}}}(t_i  \sigma )  \mathbb {P}_{{\mathcal {M}}}(t_j  \sigma )\) for \(t_j\) to be anomalous under the \({\mathcal {M}}{} \texttt {+UE}\) model (note the terms for missing values will cancel out in Eq. 11). Although this seems drastic, it could occur when \(\sigma\) is created by restricting a larger model tree to a subset three leaves. Consider the cell lineage tree in Fig. 1 but with 1000 additional cells added between cell 9 and cell 10, and suppose that we sample cells \(\{1, 4, 10\}\), so the resulting submodel, \(\sigma\) has rooted topology \(t_{10} = 101, 4\). This submodel defines the following probability distribution on mutations, assuming mutations occur on all nonfake edges with equal probability. For the variant but tripletuninformative patterns, we have \(\mathbb {P}_{\texttt {IS}}(x_1\sigma ) = 0\), \(\mathbb {P}_{\texttt {IS}}(x_4\sigma ) = 1/1012\), and \(\mathbb {P}_{\texttt {IS}}(x_{10} \sigma ) = 1009/1012\), and for tripletinformative patterns, we have \(\mathbb {P}_{\texttt {IS}}(t_1\sigma ) = 0\), \(\mathbb {P}_{\texttt {IS}}(t_4\sigma ) = 0\), and \(\mathbb {P}_{\texttt {IS}}(t_{10}\sigma ) = 1/1012\). Now looking at Eq. 11, we find that triplets \(t_1\) and \(t_4\) are anomalous under the IS + UEM model. Based on this analysis, we conjecture that triplets are less robust to error than quartets when sampling cells from a larger cell lineage tree.
Phylogeny estimation from quartets
Because there are no anomalous quartets under the \(\texttt {IS+UEM}\) model under reasonable assumptions (i.e., \(\alpha + \beta \ne 1\)), we now consider the utility of quartetbased methods for estimating cell lineage trees from mutation data. By quartetbased methods, we mean heuristics for the Maximum Quartet Support Supertree (MQSS) problem ( [4]; also see Section 7.7 in [27]).
Definition 4
(Maximum Quartet Support Supertrees) The MQSS problem is defined by
Input: A set of unrooted trees \({\mathcal {P}} = \{T_1, T_2, \dots , T_k\}\), with tree \(T_i\) on leaf label set \(X_i\)
Output: A unrooted, binary tree B on leaf label set \(\cup _{i=1}^k X_i\) that maximizes \(QS_D(B) = \sum _{q \in Q(B)} w_{{\mathcal {P}}}(q)\), where \(w_{\mathcal {P}}(q)\) is the number of input trees in \({\mathcal {P}}\) that display q
When all input trees are on the same leaf label set, MQSS becomes weighted quartet consensus (WQC). An optimal solution to WQC is a consistent estimator of the unrooted species tree topology under the MSC model (Theorem 2 in [12]). Although MQSS and WQC are NPhard [28, 29], fast and accurate heuristics have been developed, with the most wellknown being ASTRAL [12]. Since version 2 [13, 14], ASTRAL allows the input trees to be incomplete and is statistically consistent under the MSC model, provided some assumptions on missing data (see [30] for details).
There are two notable differences when using quartetbased methods to reconstruct cell lineage trees, rather than species trees. The first difference is that input are mutations rather than unrooted (gene) trees. This issue was addressed by Springer et al. [31], who treat mutations as unrooted trees with at most one internal branch (Fig. 3). Given this transformation of the input, it is possible to run ASTRAL and other quartetbased methods on mutation data. The second difference is that ASTRAL outputs a binary tree. Our next result suggests that MQSS/WQC are useful problem formulations, even when data are generated from nonbinary trees.
Theorem 3
Let \(\sigma\) be a rooted tree with at least one internal branch when unrooted (so it can be nonbinary), and let D be an \(n \times k\) mutation matrix generated under the \(\texttt {IS+UEM}\) model given \((\alpha , \beta , \gamma , \sigma )\) with \(\alpha + \beta \ne 1\) and \(0 \le \alpha , \beta , \gamma < 1\). Then, an optimal solution to MQSS given D is a consistent estimator of \(u(\sigma )\) under the \(\texttt{IS+UEM}\) model, with tree error defined as the number of false negative branches. If in addition \(0< \alpha , \beta , < 1\), ASTRAL given D is statistically consistent under the \(\texttt {IS+UEM}\) model, with tree error defined as the number of false negative branches.
The first statement above follows from Theorem 1 and Lemma 6. The second statement follows from Theorem 1, Corollary 3, and the observation that every complete mutation pattern is possible under the \(\texttt {IS+UEM}\) model when \(0< \alpha , \beta < 1\).
Lemma 6
Let \(\sigma\) be a rooted tree with at least one internal branch when unrooted (so it can be nonbinary), and let D be an \(n \times k\) mutation matrix generated under model \({\mathcal {A}}\) given \(\sigma\). If there are no anomalous quartets under \({\mathcal {A}}\), an optimal solution to MQSS given D is a consistent estimator of \(u(\sigma )\) under model \({\mathcal {A}}\), with tree error defined as the number of false negative branches.
Proof
Let B be an unrooted, binary tree on the same label set as \(u(\sigma )\). The number of false negative branches between B and \(u(\sigma )\) is zero if B is a refinement of \(u(\sigma )\), meaning that B can be obtained from \(u(\sigma )\) in a sequence of refinement operations (this sequence has length zero if \(\sigma\) is binary). Thus, to prove consistency with tree error defined as the number of false negative branches, we revise Definition 1 to say that for any \(\epsilon > 0\), there exists a constant \(K > 0\) such that when D contains at least K mutations, an optimal solution to MQSS given D is a refinement of \(u(\sigma )\) with probability at least \(1  \epsilon\). The remainder of the proof follows from Lemma 7. \(\square\)
Lemma 7
Suppose the conditions of Lemma 6 hold. Let \(L(\sigma )\) be the label set of \(\sigma\), and let B and T be unrooted, binary trees on \(L(\sigma )\). Suppose that B is a refinement of \(u(\sigma )\) and that T is NOT. Then, for any pair B and T and for any \(\epsilon > 0\), there exists a constant \(K > 0\) such that when D contains at least K mutations, \(QS_D(B) > QS_D(T)\) with probability at least \(1  \epsilon\).
Proof
To begin, we restate the inequality as
where \({\mathcal {X}}\) is the set of all possible subsets of four elements from \(L(\sigma )\) and \(w_D(q)\) is the number of mutations in D that imply quartet q.
Claim 1: First, we claim that as \(k \rightarrow \infty\), \(w_D/k\) converges to its expectation \(F^*\) under model \({\mathcal {A}}\) given \(\sigma\) with probability 1. Claim 1 holds by the strong law of large numbers, as noted in the proofs of consistency for quartet and tripletbased methods under the \(\texttt {MSC}\) model (see [20] for an example). Let \(q_1^S\), \(q_2^S\), \(q_3^S\) denote the three possible quartets on S. Then, we can restate claim 1 as follows. As \(k \rightarrow \infty\), \(w_D(q_i^S)/k\) converges to \(F^*(q_i^S) = \mathbb {P}_{{\mathcal {A}}}(q_i^S  \sigma )\) with probability 1 for all \(i \in \{1, 2, 3\}\) and for all \(S \in {\mathcal {X}}\).
Claim 2: Second, we claim there exists a \(\delta\) such that whenever \(\Vert w_D/k  F_*\Vert _{\infty } < \delta\), Eq. 12 holds. We show claim 2 is true for \(\delta = \pi / 2 {\mathcal {X}}\), where
There are three cases to consider.
Case 1: Let \({\mathcal {W}} \subset {\mathcal {X}}\) include all \(S \in {\mathcal {X}}\) such that \(B_S\) and \(T_S\) are the same quartet. Then, \(w_D(B_S) = w_D(T_S)\) for all \(S \in {\mathcal {W}}\), giving us
Now suppose that B and T restricted to S display different quartets, denoted \(q_1^S\) and \(q_2^S\), respectively. If \(q_2^S \in Q(u(\sigma))\), it must also be in Q(B) as \(Q(u(\sigma)) \subseteq Q(B)\). Therefore, \(q_2^S \notin Q(u(\sigma))\). This gives us two additional cases to consider.
Case 2: Let \({\mathcal {Y}} \subset {\mathcal {X}}\) include all \(S \in {\mathcal {X}}\) such that \(B_S\) and \(T_S\) are different quartets, denoted \(q_1^S\) and \(q_2^S\), respectively, with \(q_1^S \in Q(u(\sigma))\) and \(q_2^S \notin Q(u(\sigma))\). Then, for all \(S \in {\mathcal {Y}}\), \(F^*(q_1^S) > F^*(q_2^S)\) and \(F^*(q_1^S) > F^*(q_3^S)\) because there are no anomalous quartets under \({\mathcal {A}}\) (Definition 2). Therefore, whenever \(\Vert w_D/k  F_*\Vert _{\infty } < \pi /2\), \(w_D(B_S) > w_D(T_S)\) for all \(S \in {\mathcal {Y}}\), giving us
Case 3 (only needed for nonbinary \(\sigma\)): Let \({\mathcal {Z}} \subset {\mathcal {X}}\) include all \(S \in {\mathcal {X}}\) such that \(B_S\) and \(T_S\) are different quartets, denoted \(q_1^S\) and \(q_2^S\), respectively, with \(q_1^S, q_2^S \notin Q(u(\sigma))\). By our assumptions on B, \(\sigma_S\) is a star. Then, because there are no anomalous quartets under \({\mathcal {A}}\) (Definition 2), \(F^*(q_1^S) = F^*(q_2^S) = F^*(q_3^S)\). However, even as \(k \rightarrow \infty\), we are not guaranteed to get an exact equality \(w_D(q_1^S) = w_D(q_2^S) = w_D(q_3^S)\). Thus, we need to put an upper bound \(\delta\) on \(\Vert w_D/k  F_*\Vert _{\infty }\) so that Eq. 12 holds even when \(w_D(B_S) < w_D(T_S)\) for all \(S \in {\mathcal {Z}}\). This happens for \(\delta = \pi / 2 {\mathcal {X}}\). When \(\Vert w_D/k  F_*\Vert _{\infty } < \pi / 2 {\mathcal {X}}\), we have
because \(({\mathcal {Y}} + {\mathcal {Z}})/ {\mathcal {X}} \le 1\) and \(1 \le {\mathcal {Y}}\). If not, either T is a refinement of \(u(\sigma )\) or \(u(\sigma )\) has no internal branches, contradicting our assumptions.
Putting the cases together: If \(u(\sigma )\) is binary, then \({\mathcal {W}}{\mathcal {Y}}\) is a partition \({\mathcal {X}}\), so we can combine equations 14 and 15 to get Eq. 12. If \(u(\sigma )\) is nonbinary, then \({\mathcal {W}}{\mathcal {Y}}{\mathcal {Z}}\) is a partition of \({\mathcal {X}}\), so we can combine Eqs. 14 and 16 to get Eq. 12.
Wrap up. By claim 1, for any \(\epsilon > 0\), there exists a constant \(K > 0\) such that when D contains at least K mutations, \(\Vert w_D/k  F^*\Vert _{\infty } < \pi / 2 {\mathcal {X}}\) with probability at least \(1  \epsilon\). Then, by claim 2, for any \(\epsilon > 0\), when D contains at least K mutations, \(QS_D(B) > QS_D(T)\) with probability at least \(1  \epsilon\). \(\square\)
Corollary 3
Suppose the conditions of Lemma 6 hold. If there are no anomalous quartets under model \({\mathcal {A}}\) and every complete mutation pattern occurs with nonzero probability under \({\mathcal {A}}\), then ASTRAL given D is statistically consistent under \({\mathcal {A}}\), with tree error defined as the number of false negative branches.
Proof
ASTRAL solves MQSS exactly within a constrained version of the solution space, denoted \(\Sigma\). Its algorithm has two main steps: first, form \(\Sigma\) so that it contains all bipartitions induced by the input “trees” (i.e., mutations in D), and second find a solution B to MQSS such that \(Bip(B) \subseteq \Sigma\). Because all complete mutation patterns have nonzero probability under the model \({\mathcal {A}}\), for any \(\epsilon _1 > 0\), there exists a constant \(K_1> 0\) such that when \(k \ge K_1\), \(\Sigma\) will contain all bipartitions induced by at least one refinement of \(u(\sigma )\) with probability \(1  \epsilon _1\). By Lemma 7, for any \(\epsilon _2 > 0\), there exists a constant \(K_2 > 0\) such that when \(k \ge K_2\), \(QS_D(B) > QS_D(T)\) for any pair B and T of unrooted binary trees on \(L(\sigma )\) such that B is a refinement of \(u(\sigma )\) and T is NOT. Now let \(\epsilon > 0\) and select \(\epsilon _1, \epsilon _2 > 0\) such that \(\epsilon _1 + \epsilon _2 < \epsilon\). Then, when D contains at least \(\max \{K_1, K_2\}\) mutations, ASTRAL given D returns a refinement of \(u(\sigma )\) with probability at least \((1  \epsilon _1)(1\epsilon _2) > (1  \epsilon )\). It follows the number of false negative branches is zero with probability at least \(1  \epsilon\). \(\square\)
Lastly, we note that related results can be derived for triplets by viewing mutations as a rooted trees with at most one internal branch (Additional file 1: Fig. S2).
Theorem 4
Suppose that the conditions of Theorem 3 hold but that \(\alpha = 0\) (instead of \(\alpha + \beta \ne 1\)). Then, an optimal solution to Maximum Triplet Support Supertree (MTSS) problem given D is a consistent estimator of \(\sigma\) under the \(\texttt {IS+UEM}\) model, with tree error defined as the number of false negative branches.
This result follows from Theorem 2 and Lemma 6 but replacing “quartet” with “triplet” and “rooted” with “unrooted”. Code for transforming ordered 2state characters into rooted trees (as shown in Additional file 1: Fig. S2) is available as part of DolloCDP [32]; see the − k option on Github: https://github.com/molloylab/DolloCDP. These “rooted trees” can be given as input to quartetbased methods, like ASTRAL or ASTER [18], which will effectively ignore the root (note that any “tree” on fewer than four cells must be removed prior to running ASTRAL).
Discussion
Quartetbased approaches have garnered much success for estimating species phylogenies under the MultiSpecies Coalescent [12,13,14]. Here, we considered their application for estimating cell lineage trees, focusing on two differences between estimating cell lineage trees compared to species trees. First, errors and missing values can arise from singlecell sequencing and thus are typically modeled. Second, the model cell lineage tree may be highly unresolved because tumors evolve clonally. To address these issues, we first show that there are no anomalous quartets under the infinite sites (\(\texttt {IS}\)) plus unbiased error and missingness (\(\texttt {UEM}\)) model, which is widely used in tumor phylogenetics (this is an identifiability result). We then show that under the \(\texttt {IS+UEM}\) model, an optimal solution to the Maximum Quartet Support Supertree (MQSS) problem is a refinement of the model cell lineage tree (this is a consistency result when tree error is defined as the number of false negative branches). Lastly, we consider the case of triplets, showing that there can be anomalous triplets when the probability of false positive errors is greater than zero. Our result suggests that quartets may be more robust to error than triplets when reconstructing cell lineage trees.
Overall, our results suggest the potential of quartetbased methods for reconstructing trees from noisy mutation data, provided that the tree can be rooted and that false positive branches in the output tree can be effectively handled. The former is often doable because the tree can be rooted on the edge incident to the healthy cell with no mutations. The latter is related to mapping mutations onto branches in the cell lineage tree (see [33]) as well as identifying which cells are members of the same clone or subclone (see [34]). These tasks are also relevant to likelihoodbased methods designed for cell lineage tree reconstruction, as such methods also return binary trees. Examples include ScisTree [23], SiFit [35], and CellPhy [33] (note that of these methods ScisTree makes the \(\texttt {IS}\) assumption but the other two do not).
In general, likelihoodbased methods require explicitly estimating numeric parameters, like \(\alpha\) and \(\beta\), as well as exploring the space of cell lineage trees, which grows exponentially in the numbers of cells. In contrast, quartetbased approaches allow for error implicitly (without explicit estimation of \(\alpha\) and \(\beta\)) and are often based on algorithmic techniques, like divideandconquer, that are quite fast in practice. That being said, quartetbased methods have been designed for species phylogenetics, where the number of leaves (species) is typically much less than the number of gene trees or characters. In tumor phylogenetics, the number of leaves (cells) can be much greater than than characters. This will likely to have consequences for runtime and accuracy (just consider that our consistency guarantees is in the limit of infinite mutations). Corollary 3 sheds light on a potential issue when using ASTRAL, namely the construction of the constrained solution space \(\Sigma\) may not be very successful for mutation data, especially if the number of mutations is small. However, there are other high quality heuristics for MQSS, including wQMC [36,37,38], wQFM [16, 39], and TREEQMC [17]. Moreover, even when the number of characters is small compared to the number of cells, the underlying model tree is likely to be highly unresolved. In this case, sampling different cells around the same branch may be a means of providing more data for estimation (this observation has already been leveraged by Kizilkale et al. [24]).
Although quartetbased methods, as presented here, may be robust to noise, they fail to address doublets and copy number aberrations (CNAs), which also challenge cell lineage tree reconstruction. A doublet is a sequencing artifact where data provided for a single cell is really a mixture of two cells. This “hybrid” cell challenges the notion of treelike evolution, motivating the development of methods for correcting doublets [40]. If doublets can be effectively corrected, then their impact on quartetbased methods would be minimal. Alternatively, quartets may be useful for detecting doublets. CNAs include duplications and losses of large sections of chromosomes (see review on methods for detecting CNAs by Mallory et al. [41]). CNA losses, in particular, have motivated the development of many new methods for reconstructing tumor phylogenies [42,43,44]. Some of these methods view CNA losses as false negatives (although these false negatives will be biased towards particular cells and mutations). In contrast, SCARLET [44] reconstructs a CNA tree and then uses it to constrain phylogeny reconstruction with the mutation data. Constraints have also been leveraged in species phylogenetics, including with ASTRAL [45]. Thus, the output of quartetbased methods could similarly be forced to obey the constraints of a CNA tree. To summarize, there are practical limitations to quartetbased methods for tumor phylogenetics, several of which apply to existing methods that do not handle CNAs and doublets, for example.
Looking beyond cell lineage tree reconstruction, our results generalize beyond the \(\texttt {IS}\) model to any model of 2state character evolution for which there are no anomalous quartets or triplets. A consequence of our study is that quartetbased methods, like ASTRAL, are consistent under the \(\texttt {IS+nWF}\) model, even when unbiased errors and missing values are introduced. This statement follows from combining Corollaries 1 and 3 with Theorem 1 in [48]. Thus, our work addresses an open question from [48] about the utility of such methods on imperfect data and gives a positive result for recent systematic studies leveraging quartetbased methods on retroelement insertion presence/absence patterns for placental mammals [51], bats [52], and birds [53]. Missing values, in particular, are prevalent in these data sets (e.g., the data set from [54] has 18% missing/ambiguous values). Future work should investigate this issue further, looking at error and missingness biased towards particular species or (orthologous) positions of the genome (related questions would also be of interest in the context of cell lineage tree estimation). Similarly, while unbiased errors (false positives and false negatives) may be appropriate for modeling sequencing error, it may not be appropriate in this other setting if error is biased towards particular species or genes when calling retroelement insertions. Lastly, species trees are typically assumed to be binary; however, there could be hard polytomies, in which case the model tree would be nonbinary. Our results for consistency with error defined as the number of false negatives (Lemma 6 and Corollary 3) extend to the other models, like \(\texttt {MSC}\), suggesting the utility of quartetbased methods in the case of hard polytomies.
Data availability
Not applicable.
Abbreviations
 CNA:

Copy number abberation
 IS:

Infinite sites
 MQSS:

Maximum quartet support supertree
 MSC:

Multispecies coalescent
 MTSS:

Maximum triplet support supertree
 nWF:

Neutral WrightFisher
 UEM:

Unbiased error and missingness
References
Lim B, Lin Y, Navin N. Advancing cancer research and medicine with singlecell genomics. Cancer Cell. 2020;37(4):456–70. https://doi.org/10.1016/j.ccell.2020.03.008.
Jahn K, Kuipers J, Beerenwinkel N. Tree inference for singlecell data. Genome Biol. 2016;17:86. https://doi.org/10.1186/s130590160936x.
Schwartz R, Schäffer AA. The evolution of tumour phylogenetics: principles and practice. Nat Rev Genet. 2017;18(4):213–29. https://doi.org/10.1038/nrg.2016.170.
Wilkinson M, Cotton JA, Creevey C, Eulenstein O, Harris SR, Lapointe FJ, Levasseur C, Mcinerney JO, Pisani D, Thorley JL. The shape of supertrees to come: tree shape related properties of fourteen supertree methods. Syst Biol. 2005;54(3):419–31. https://doi.org/10.1080/10635150590949832.
Pamilo P, Nei M. Relationships between gene trees and species trees. Mol Biol Evol. 1988;5(5):568–83. https://doi.org/10.1093/oxfordjournals.molbev.a040517.
Rannala B, Yang Z. Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci. Genetics. 2003;164(4):1645–56. https://doi.org/10.1093/genetics/164.4.1645.
Allman ES, Degnan JH, Rhodes JA. Identifying the rooted species tree from the distribution of unrooted gene trees under the coalescent. J Math Biol. 2011;62(6):833–62. https://doi.org/10.1007/s0028501003557.
Degnan JH. Anomalous unrooted gene trees. Syst Biol. 2013;62(4):574–90. https://doi.org/10.1093/sysbio/syt023.
Kubatko LS, Degnan JH. Inconsistency of phylogenetic estimates from concatenated data under coalescence. Syst Biol. 2007;56(1):17–24. https://doi.org/10.1080/10635150601146041.
Roch S, Steel M. Likelihoodbased tree reconstruction on a concatenation of aligned sequence data sets can be statistically inconsistent. Theor Popul Biol. 2015;100:56–62. https://doi.org/10.1016/j.tpb.2014.12.005.
Larget BR, Kotha SK, Dewey CN, Ané C. BUCKy: gene tree/species tree reconciliation with Bayesian concordance analysis. Bioinformatics. 2010;26(22):2910–1. https://doi.org/10.1093/bioinformatics/btq539.
Mirarab S, Reaz R, Bayzid MS, Zimmermann T, Swenson MS, Warnow T. ASTRAL: genomescale coalescentbased species tree estimation. Bioinformatics. 2014;30(17):541–8. https://doi.org/10.1093/bioinformatics/btu462.
Mirarab S, Warnow T. ASTRALII: coalescentbased species tree estimation with many hundreds of taxa and thousands of genes. Bioinformatics. 2015;31(12):44–52. https://doi.org/10.1093/bioinformatics/btv234.
Zhang C, Rabiee M, Sayyari E, Mirarab S. ASTRALIII: polynomial time species tree reconstruction from partially resolved gene trees. BMC Bioinf. 2018;19(6):153. https://doi.org/10.1186/s128590182129y.
Dibaeinia P, TabeBordbar S, Warnow T. FASTRAL: improving scalability of phylogenomic analysis. Bioinformatics. 2021;37(16):2317–24. https://doi.org/10.1093/bioinformatics/btab093.
Mahbub M, Wahab Z, Reaz R, Rahman MS, Bayzid MS. wQFM: highly accurate genomescale species tree estimation from weighted quartets. Bioinformatics. 2021;37(21):3734–43. https://doi.org/10.1093/bioinformatics/btab428.
Han Y, Molloy EK. Improving quartet graph construction for scalable and accurate species tree estimation from gene trees. Genome Res. 2023. https://doi.org/10.1101/gr.277629.122.
Zhang C, Mirarab S. Weighting by gene tree uncertainty improves accuracy of quartetbased species trees. Mol Biol Evol. 2022;39(12):215. https://doi.org/10.1093/molbev/msac215.
Degnan JH, Rosenberg NA. Discordance of species trees with their most likely gene trees. PLoS Genet. 2006;2(5):1–7. https://doi.org/10.1371/journal.pgen.0020068.
Liu L, Yu L, Edwards SV. A maximum pseudolikelihood approach for estimating species trees under the coalescent model. BMC Evol Biol. 2010;10:302. https://doi.org/10.1186/1471214810302.
Islam M, Sarker K, Das T, Reaz R, Bayzid MS. STELAR: a statistically consistent coalescentbased species tree estimation method by maximizing triplet consistency. BMC Genomics. 2020;21(1):136. https://doi.org/10.1186/s128640206519y.
Ross EM, Markowetz F. OncoNEM: inferring tumor evolution from singlecell sequencing data. Genome Biol. 2016;17(1):69. https://doi.org/10.1186/s1305901609299.
Wu Y. Accurate and efficient cell lineage tree inference from noisy single cell data: the maximum likelihood perfect phylogeny approach. Bioinformatics. 2019;36(3):742–50. https://doi.org/10.1093/bioinformatics/btz676.
Kizilkale C, Mehrabadi FR, Azer ES, PérezGuijarro E, Marie KL, Lee MP, Day CP, Merlino G, Ergün F, Buluç A, Sahinalp SC, Malikić S. Fast intratumor heterogeneity inference from singlecell sequencing data. Nat Comput Sci. 2022;2:577–83. https://doi.org/10.1038/s4358802200298x.
Fisher RA. On the dominance ratio. Proc R Soc Edinb. 1923;42:321–41. https://doi.org/10.1017/S0370164600023993.
Wright S. Evolution in mendelian populations. Genetics. 1931;16(2):97–159. https://doi.org/10.1093/genetics/16.2.97.
Warnow T. Computational phylogenetics: an introduction to designing methods for phylogeny estimation. Cambridge: Cambridge University Press; 2017.
Jiang T, Kearney P, Li M. A polynomial time approximation scheme for inferring evolutionary trees from quartet topologies and its application. SIAM J Comput. 2001;30(6):1942–61. https://doi.org/10.1137/S0097539799361683.
Lafond M, Scornavacca C. On the weighted quartet consensus problem. Theor Comput Sci. 2019;769:1–17. https://doi.org/10.1016/j.tcs.2018.10.005.
Nute M, Chou J, Molloy EK, Warnow T. The performance of coalescentbased species tree estimation methods under models of missing data. BMC Genomics. 2018;19(Suppl 5):286. https://doi.org/10.1186/s1286401846198.
Springer MS, Molloy EK, Sloan DB, Simmons MP, Gatesy J. ILSaware analysis of lowhomoplasy retroelement insertions: inference of species trees and introgression using quartets. J Hered. 2019;111(2):147–68. https://doi.org/10.1093/jhered/esz076.
Dai J, Rubel T, Han Y, Molloy EK. Leveraging Constraints plus dynamic programming for the large dollo parsimony problem. In: Belazzougui D, Ouangraoua A, editors. 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023). Leibniz International Proceedings in Informatics (LIPIcs), vol. 273. Schloss Dagstuhl–LeibnizZentrum für Informatik, Dagstuhl, Germany. 2023. pp. 5–1523. https://doi.org/10.4230/LIPIcs.WABI.2023.5
Kozlov A, Alves JM, Stamatakis A, Posada D. Cell Phy: accurate and fast probabilistic inference of singlecell phylogenies from scDNAseq data. Genome Biol. 2022;23:37. https://doi.org/10.1186/s1305902102583w.
Zafar H, Navin N, Chen K, Nakhleh L. SiCloneFit: Bayesian inference of population structure, genotype, and phylogeny of tumor clones from singlecell genome sequencing data. Genome Res. 2019;29(11):1847–59. https://doi.org/10.1101/gr.243121.118.
Zafar H, Tzen A, Navin N, Chen K, Nakhleh L. SiFit: inferring tumor trees from singlecell sequencing data under finitesites models. Genome Biol. 2017;18:178. https://doi.org/10.1186/s1305901713112.
Snir S, Rao S. Quartets MaxCut: a divide and conquer quartets algorithm. IEEE/ACM Trans on Comput Biol Bioinf. 2010;7(4):704–18. https://doi.org/10.1109/TCBB.2008.133.
Snir S, Rao S. Quartet MaxCut: a fast algorithm for amalgamating quartet trees. Mol Phylogenet Evol. 2012;62(1):1–8. https://doi.org/10.1016/j.ympev.2011.06.021.
Avni E, Cohen R, Snir S. Weighted quartets phylogenetics. Syst Biol. 2014;64(2):233–42. https://doi.org/10.1093/sysbio/syu087.
Reaz R, Bayzid MS, Rahman MS. Accurate phylogenetic tree reconstruction from quartets: a heuristic approach. PLoS ONE. 2014;9(8):1–13. https://doi.org/10.1371/journal.pone.0104008.
Weber LL, Sashittal P, ElKebir M. doubletD: detecting doublets in singlecell DNA sequencing data. Bioinformatics. 2021;37(Suppl–1):214–21. https://doi.org/10.1093/bioinformatics/btab266.
Mallory XF, Edrisi M, Navin N, Nakhleh L. Methods for copy number aberration detection from singlecell DNAsequencing data. Genome Biol. 2020;21:208. https://doi.org/10.1186/s13059020021198.
ElKebir M. SPhyR: tumor phylogeny estimation from singlecell sequencing data under loss and error. Bioinformatics. 2018;34(17):671–9. https://doi.org/10.1093/bioinformatics/bty589.
Malikic S, Mehrabadi FR, Ciccolella S, Rahman MK, Ricketts C, Haghshenas E, Seidman D, Hach F, Hajirasouliha I, Sahinalp SC. PhISCS: a combinatorial approach for subperfect tumor phylogeny reconstruction via integrative use of singlecell and bulk sequencing data. Genome Res. 2019;29(11):1860–77. https://doi.org/10.1101/gr.234435.118.
Satas G, Zaccaria S, Mon G, Raphael BJ. SCARLET: singlecell tumor phylogeny inference with copynumber constrained mutation losses. Cell Syst. 2020;10(4):323–3328. https://doi.org/10.1016/j.cels.2020.04.001.
Rabiee M, Mirarab S. Forcing external constraints on tree inference using astral. BMC Genomics. 2020;21(Suppl 2):218. https://doi.org/10.1186/s128640206607z.
Hudson RR. Generating samples under a WrightFisher neutral model of genetic variation. Bioinformatics. 2002;18(2):337–8. https://doi.org/10.1093/bioinformatics/18.2.337.
Kuritzin A, Kischka T, Schmitz J, Churakov G. Incomplete lineage sorting and hybridization statistics for largescale retroposon insertion data. PLOS Comput Biol. 2016;12(3):1–20. https://doi.org/10.1371/journal.pcbi.1004812.
Molloy EK, Gatesy J, Springer MS. Theoretical and practical considerations when using retroelement insertions to estimate species trees in the anomaly zone. Syst Biol. 2021;71(3):721–40. https://doi.org/10.1093/sysbio/syab086.
Mendes FK, Hahn MW. Why concatenation fails near the anomaly zone. Syst Biol. 2017;67(1):158–69. https://doi.org/10.1093/sysbio/syx063.
Springer MS, Gatesy J. The gene tree delusion. Mol Phylogenet Evol. 2016;94:1–33. https://doi.org/10.1016/j.ympev.2015.07.018.
Doronina L, Hughes GM, MorenoSantillan D, Lawless C, Lonergan T, Ryan L, Jebb D, Kirilenko BM, Korstian JM, Dávalos LM, Vernes SC, Myers EW, Teeling EC, Hiller M, Jermiin LS, Schmitz J, Springer MS, Ray DA. Contradictory phylogenetic signals in the laurasiatheria anomaly zone. Genes. 2022. https://doi.org/10.3390/genes13050766.
Korstian J, Paulat N, Platt R II, Stevens R, Ray D. Sinebased phylogenomics reveal extensive introgression and incomplete lineage sorting in myotis. Genes. 2022;13(3):399. https://doi.org/10.3390/genes13030399.
Gatesy J, Springer MS. Phylogenomic coalescent analyses of avian retroelements infer zerolength branches at the base of neoaves, emergent support for controversial clades, and ancient introgressive hybridization in afroaves. Genes. 2022. https://doi.org/10.3390/genes13071167.
Cloutier A, Sackton TB, Grayson P, Clamp M, Baker AJ, Edwards SV. Wholegenome analyses resolve the phylogeny of flightless birds (Palaeognathae) in the presence of an empirical anomaly zone. Syst Biol. 2019;68(6):937–55. https://doi.org/10.1093/sysbio/syz019.
Acknowledgements
This work was conducted in preparation for a talk given at the 2023 U.S. National Cancer Institute’s Spring School in Algorithmic Cancer Biologyove; EKM thanks the organizers for the invitation, the funders of the event, and the audience for their questions and engagement. YH and EKM thank Dr. Michael Nute for feedback on a preliminary version our manuscript that greatly improved our paper, especially the clarity of our notation. YH and EKM also thank the PC chairs of WABI, Dr. Aïda Ouangraoua and Dr. Djamal Belazzougui, as well as the anonymous reviewers for constructive feedback.
Funding
This work was financially supported by the State of Maryland.
Author information
Authors and Affiliations
Contributions
EKM and YH proved the theorems and wrote the paper. EKM conceptualized and supervised the project.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1.
Proofs of Lemmas 1, 3, and 5, supplemental figures S1–S2, and supplemental tables S1–S16.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Han, Y., Molloy, E.K. Quartets enable statistically consistent estimation of cell lineage trees under an unbiased error and missingness model. Algorithms Mol Biol 18, 19 (2023). https://doi.org/10.1186/s1301502300248w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1301502300248w