Parsimonious Clone Tree Integration in cancer

Background Every tumor is composed of heterogeneous clones, each corresponding to a distinct subpopulation of cells that accumulated different types of somatic mutations, ranging from single-nucleotide variants (SNVs) to copy-number aberrations (CNAs). As the analysis of this intra-tumor heterogeneity has important clinical applications, several computational methods have been introduced to identify clones from DNA sequencing data. However, due to technological and methodological limitations, current analyses are restricted to identifying tumor clones only based on either SNVs or CNAs, preventing a comprehensive characterization of a tumor’s clonal composition. Results To overcome these challenges, we formulate the identification of clones in terms of both SNVs and CNAs as a integration problem while accounting for uncertainty in the input SNV and CNA proportions. We thus characterize the computational complexity of this problem and we introduce PACTION (PArsimonious Clone Tree integratION), an algorithm that solves the problem using a mixed integer linear programming formulation. On simulated data, we show that tumor clones can be identified reliably, especially when further taking into account the ancestral relationships that can be inferred from the input SNVs and CNAs. On 49 tumor samples from 10 prostate cancer patients, our integration approach provides a higher resolution view of tumor evolution than previous studies. Conclusion PACTION is an accurate and fast method that reconstructs clonal architecture of cancer tumors by integrating SNV and CNA clones inferred using existing methods. Supplementary Information The online version contains supplementary material available at 10.1186/s13015-022-00209-9.


Bqu
(1) 1,s = (i,j )∈Π:π 2 ((i,j ))=j where the second equality follows from construction with u 1,i = a i /(Bq), the third equality uses Equation (1), the fourth equality uses consistency of proportion matrix U with respect to U 2 given projection function π 2 and the fifth equality uses the construction u be a solution to the 3-PARTITION problem instance (A, B). We claim that 1,i = a i /(Bq) is a solution to the corresponding PCI problem. To see why, recall that π 1 ((i, σ(i))) = s and π 2 ((i, σ(i))) = σ(i). Given these projection functions, we need to show that U is consistent with U 1 (A, B) and U 2 (A, B). The consistency with respect to the Π 1 -clones is trivial as for each i ∈ Π 1 = [3q] there exists exactly one pair (i, j) ∈ Π, i.e., the pair (i, j) where j = σ(i), with proportion u 1,(i,σ(i)) = u (1) 1,i . To see the consistency with respect to the Π 2 -clones, consider for any j ∈ Π 2 = [q], where the second to last equality uses Equation (1). Since u which is the required condition for consistency. Proof. (⇒) Let clones Π, clone tree T and proportion matrix U be a solution to the PCTI problem in- . By the premise, we have that J(U, U 1 , U 2 ) = 0, which implies that U is consistent with U 1 and U 2 .
1,0 = 0 and since the proportion matrix U is consistent with U 1 and U 2 , we have π −1 . We show that σ defined above satisfies Equation (1) in the main text. Recall that π 1 ((i, j)) = i and π 2 ((i, j)) = j. For any where the second equality follows from construction with u 1,i = a i /(Bq), the third equality uses Equation (1), the fourth equality uses consistency of proportion matrix U with respect to U 2 given projection function π 2 and the fifth equality uses the construction u We first show that U is consistent with respect to U 1 and U 2 which is equivalent to the condition J(U, U 1 , U 2 ) = 0. Recall the projection functions π 1 ((i, j)) = i and π 2 ((i, j)) = j. We show consis- We show consistency with respect to U 2 as follows. For j = 0, since π −1 2 (0) = 0, we have where the third equality uses the premise that σ is a solution of the 3-PARTITION problem instance (A, B).
Now we show that T is a refinement of T 1 and T 2 . We address the three conditions in Definition 3 in the main text as follows.

B Checking Consistency of a Proportion Matrix for a Given Instance of the PCI Problem
In this section, given an instance (Π 1 , U 1 , Π 2 , U 2 ) of the PCI problem and a set Π ⊆ Π 1 × Π 2 , we describe the procedure to check if there exists a proportion matrix U that is consistent with given proportion matrix U 1 for clones Π 1 and U 2 for clones Π 2 in polynomial time. We do this by reduction to the maximum flow problem.
Recall that maximum flow problem is as follows. Given a directed graph G = (V, E) with a source s ∈ V and sink t ∈ V and capacities along every each described by c : E → R + , find a flow f : E → R that maximizes the total flow through the sink defined by where δ + (s) denotes the set of outgoing edges from s in G. Briefly, a function f : E → R is a flow provided The capacity function c p , for p th instance of the maximum flow problem is defined such that c p ((i, j)) = 1, ∀(i, j) ∈ Π, c p ((s, i)) = u (1) p,i and c p (j, t) = u (2) p,j . Clearly, the maximum possible flow through the source s in the p th instance is bounded from above by the total capacity of the edges in δ + (s) given by n 1 i=1 u (1) p,i = 1. In fact there exists a proportion matrix U that is consistent with given proportion matrix U 1 for clones Π 1 and U 2 for clones Π 2 if an only if the maximum flow is equal to 1 for all the m instances of maximum flow problem. In such a case, let f p be the flow for the p th instance. The proportion matrix U ∈ [0, 1] m×|Π| is given by u p,(i,j) = f p (i, j). We refer to (Main Text) Figure 2b for an example.

Mixture Deconvolution Problems
In this section we show that the error-free version of the PCTI problem is equivalent to a special case of the previously posed Cladistic Multi-state Perfect Phylogeny Mixture Deconvolution problem [2]. To state the latter problem, we recall the definition of a multi-state perfect phylogeny from References [3,4]. Note that we keep the same notations as in the original paper [2] and thus there will be some overlap in notation to the main text of this paper.

Definition 1 ( [3, 4]). A rooted tree
T is a multi-state perfect phylogeny on n characters provided that (1) each vertex is labeled by a state vector a ∈ N n , which denotes the state for each character; (2)  Similarly to above, we write v i S v j if vertex v i ∈ V (S) occurs on the unique path from the root vertex r(S) to vertex v j ∈ V (S). This enables us to define consistency as follows. Note that multiple character-state pairs may label the same edge in a multi-state perfect phylogeny. To prevent this from happening, we need to impose an additional constraints that ensure that each non-root vertex v = r(T ) of T corresponds to a unique character-state pair (c, i) that indicates the single change that happened on its incoming edge. Specifically, we call such constrained trees complete multi-state perfect phylogenies consistent with S = {S 1 , . . . , S n }.
Rather than observing such a tree, in practice, we observe an m × |V (S c )| frequency matrix F c = [f  Given frequencies matrices F 1 , . . . , F n on m samples for n cladistic characters with state trees S = {S 1 , . . . , S n }, we seek a complete multi-state perfect phylogeny T consistent with S and an m × |V (T )| proportion matrix U (following Definition 1 in the main text) that explains the given frequency matrices.
More formally, the problem is posed as follows [2]. Here, we show that the PCTI problem subject to the additional constraint that J(U, U 1 , U 2 ) = 0 is equivalent to the CMPPD problem with n = 2 characters. Let (F 1 , F 2 , S 1 , S 2 ) be a CMPPD problem instance the corresponding PCTI instance (T 1 , U 1 , T 2 , U 2 ) has T 1 = S 1 , U 1 = F 1 , T 2 = S 2 and U 2 = F 2 .
Conversely, for a PCTI instance (T 1 , U 1 , T 2 , U 2 ), the corresponding CMPPD instance (F 1 , F 2 , S 1 , S 2 ) has To see why this works observe that any clone tree T that is a refinement of clones T 1 and T 2 (as in Definition 3 in the main text) is also complete multi-state perfect phylogeny consistent with state trees T 1 and T 2 , and vice versa. Thus, we can use SPRUCE [2] to enumerate all error-free solutions (if they exist) to any PCTI instance.
Previously, the CMPPD problem was shown to be NP-complete for m = 2 samples and state trees S = {S 1 , . . . , S n } with two states each [2]. Here, we have shown that the decision version of the PCTI problem (i.e., deciding whether there exists a solution with J(U, U 1 , U 2 ) = 0) is NP-complete (Theorem 5 in the main text). Based on the equivalence between the problems, an alternative hardness result follows for CMPPD.

F Computation of SNV Clone Proportions
Each edge of the SNV clone tree T 1 reported by Gundem et al. [5] represents a set of mutations, also known as mutation clusters. As such, for a SNV clone tree T 1 with n 1 vertices, there are n 1 − 1 mutation clusters.
The authors have provided the cancer cell fraction (CCF) for each of the mutation clusters in each sample of the ten patients. They used pigeonhole principle (PPH) to construct the SNV clone tree manually. For a given patient, let F ∈ [0, 1] m×(n 1 −1) be the CCF matrix such that F = [f p,k ] and f p,k is the CCF of mutation . The SNV clone tree T 1 , excluding the root vertex which represent the normal cell, is used to construct a perfect phylogeny matrix B [6]. We use the perfect phylogeny matrix B and the CCF matrix F to get the proportion U of SNV clones, excluding the normal clone, in each sample of the ten patients by solving the following linear program where | · | 1 is the entry-wise L 1 norm. Finally, we correct the proportion matrix U for the purity of the tumor samples (also known as tumor cellularity), which is the proportion of cancer cells in the tumor. We use the proportion of normal cells in each sample, inferred by HATCHet [7], to compute the purity of the tumor samples. Let γ ∈ [0, 1] m×1 be a vector such that γ p,1 is the purity of sample p ∈ [m] inferred using HATCHet. The proportion matrix U ∈ [0, 1] m×n 1 of the SNV clones is given by    Table S2: Statistics of the metastatic prostate cancer data [5]. Number m of samples, number n 1 of SNV clones and number n 2 of CNA clones for the 10 patients from Gundem et al. [5]. The CNA clones were identified using HATCHet [7].