The following section describes Phyolin, which solves the lppf problem and computes the Bayes factor for distinguishing linear and branched evolution. We implemented Phyolin in C++ utilizing IBM ILOG CP OPTIMIZERFootnote 1. Phyolin is publicly available at https://github.com/elkebir-group/phyolin.
Model
To solve the lppf, we formulate a constraint optimization problem (COP) [20]. A COP is a constraint satisfaction problem (CSP) with an objective function that specifies which feasible solutions are preferred based on an optimization criteria. A CSP is defined by a tuple \(({\mathcal {X}},{\mathcal {D}}, {\mathcal {C}})\), where \({\mathcal {X}} = \{x_1, \ldots ,x_n\}\) is the set of decision variables, \({\mathcal {D}} = \{d_1, \ldots d_n\}\) is the set of domains for \({\mathcal {X}}\) and \({\mathcal {C}}\) is a set of constraints that must be satisfied. A solution \(a \in {\mathcal {A}}({\mathcal {X}},{\mathcal {D}},{\mathcal {C}})\) to a CSP is an assignment of values \(\{x_1 \mapsto v_1, \ldots , x_n \mapsto v_n\}\) such that \(v_i \in d_i\) for all \(i \in [n]\) and all constraints C are satisfied. To facilitate an objective function \(f: {\mathcal {A}}({\mathcal {X}},{\mathcal {D}},{\mathcal {C}}) \rightarrow {\mathbb {R}}\), an initial assignment \({\hat{a}} \in {\mathcal {A}}({\mathcal {X}},{\mathcal {D}},{\mathcal {C}})\) is found. Then, a preference constraint is added to C, such that \(f(a) \le f({\hat{a}})\) for a minimization problem or \(f(a) \ge f({\hat{a}})\) for a maximization problem. The search is continued and the preference constraint updated each time a feasible assignment \({\hat{a}}\) is found until no more feasible assignments exist. When this occurs, the assignment \({\hat{a}}\) is returned and \(f({\hat{a}})\) is the objective value. We note that in problems where multiple optimal solutions exist, it is possible to return all such valid solutions. But even though multiple solutions may exist, our focus is on assessing the plausibility of the null hypothesis of linear evolution. Therefore, it is sufficient to consider any optimal solution even if the respective assignments yield different linear perfect phylogenies.
First, we describe the set \({\mathcal {X}}\) of decision variables and the associated domains \({\mathcal {D}}\) used in the formulation. The set \({\mathcal {X}}\) contains the variables \(\mathbf{x}\) and \(\mathbf{c}\). Intuitively, the values taken by \(\mathbf{x}\) represent a binary matrix \(B^{\prime}\) used to represent a linear perfect phylogeny after flipping. More formally, given a set n of cells and a set m of mutations, let \(x_{ij}\) = 1 if cell i has mutation j in the linear perfect phylogeny \(B^{\prime}\) after flipping and 0 otherwise for each cell \(i \in [n]\), and mutation \(j \in [m]\). Then, \({\mathcal {D}}(x_{ij}) = \{0,1\}\), for all \(i \in [n], j \in [m]\). The variables \(\mathbf{c}\), are used to define a permutation of the columns in \(B^{\prime}\), such that after flipping is completed, \(B^{\prime}\) will adhere to the definition of a linear perfect phylogeny. Recall that in order to represent a linear perfect phylogeny, there must exist permutation \(\pi : [m] \rightarrow [m]\) such that \(O_{\pi (1)} \subseteq O_{\pi (2)} \subseteq \cdots \subseteq O_{\pi (m)}\). Let \(c_j = \pi (j)\) for all \(j \in [m]\). Then \({\mathcal {D}}(c_j) = [m]\) for all \(j \in [m]\).
Since, our goal is to find the linear perfect phylogeny that requires as few flips as possible, that is minimizing the number of false negatives we infer, we define an objective function that minimizes the number of flips from 0 to 1.
$$\min \sum _{i=1}^n \sum _{j=1}^m \mathbb {1}(b_{ij}=0, x_{ij} =1).$$
(1)
The set \({\mathcal {C}}\) ensures that the outputted binary matrix \(B^{\prime}\) meets the conditions of representing a linear perfect phylogeny and also that the number of false positives, or flips from 1 to 0, is bounded above by \(z^*\). The set \({\mathcal {C}}\) consists of the following three constraints.
$$\textsc {ALLDIFFERENT}(c),$$
(2)
$$\sum _{i=1}^{n}\sum _{j=1}^m\mathbb {1}(b_{ij}=1, x_{ij} =0) \le z^*$$
(3)
$$(c_k < c_j) \Rightarrow (x_{ij} \le x_{ik}) \quad \forall k,j \in [m], \forall i \in [n].$$
(4)
Equation (2) is a global constraint that ensures that every mutation is assigned a unique ordering in the permutation. Equation (3) ensures that the number of false positives or flips from 1 to 0 is bounded above by the number of allowable false positives \(z^*\), as determined by a given false positive rate \(\alpha\). Finally, Eq. (4) ensures the defining property of a linear perfect phylogeny is met by ensuring that if \(\pi (k)\) is less than \(\pi (j)\) then it must hold that \(O_k \subseteq O_j\) for all \(k,j \in [m]\).
Two alternative hypotheses for the pattern of evolution
We have two competing hypotheses to explain the observed data B: (i) linear evolution (\(H_{\text {linear}}\)) and (ii) branched evolution (\(H_{\text {branched}}\)). The solution to the lppf is the total number y of flips from 0 to 1 in order to represent linear evolution. In addition, we obtain a matrix \(B^{\prime}\) containing the resultant bits after flipping. Let N be the total number of ones in \(B^{\prime}\) and \({\hat{\beta }} = y/N\) represent the fraction of false negatives in the solution to the lppf. We then hypothesize y was drawn from one of two different beta-binomial distributions.
Under the null hypothesis \(H_{\text {linear}}\) of linear evolution, we have
$$p_\text {linear} \sim \text {beta}(\mu _{\text {linear}}, s_{\text {linear}}),$$
(5)
$$y \sim \text {binomial}(N, p_\text {linear}),$$
(6)
where the mean \(\mu _\text {linear}\) equals the expected false negative rate \(\beta ^*\) and \(s_{\text {linear}}\) is the beta-binomial precision parameter (controlling dispersion) of the used sequencing technology.
In the worst case, every 0 can be flipped to a 1. This results in a binary matrix with all values equal to 1, suggesting a linear perfect phylogeny with a single clone harboring all of the mutations. The number of flips required to achieve such a solution may be implausible under a model of linear evolution, which seeks to characterize the true distribution of false negatives for the sequencing. Thus, we consider an alternative hypothesis that characterizes the distribution of flips when a branched phylogeny is forced to be linear.
Under the alternative hypothesis \(H_\text {branched}\) of branched evolution, \(H_{\phi }\), we have
$$p_{\text {branched}} \sim \text {beta}( \mu _{\text {branched}}, s_{\text {branched}}),$$
(7)
$$y \sim \text {binomial}(N, p_\text {branched}),$$
(8)
where \(\mu _\text {branched}\) and \(s_\text {branched}\) parameterize a distribution that results from converting data generated under branched evolution into one that represents a linear perfect phylogeny. In other words, we expect the estimated false negative rate \({\hat{\beta }}_\phi = y/N\) to follow a different distribution from that of the false negative rate \(\beta ^*\) of the sequencing technology since the number of flips for branched evolution will be greater for data generated under linear evolution.
While \(\mu _\text {linear}\) and \(s_\text {linear}\) are typically known, parameters \(\mu _\text {branched}\) and \(s_\text {branched}\) are more challenging to estimate. One approach is to generate simulated data under branched evolution and the given sequencing profile and utilize \({\hat{\beta }}_\phi\) to fit this distribution. As this is time consuming in practice, we offer two heuristics. First, given some hard threshold \(\beta ^*\), which could be based on knowledge of the system estimated false negative rate or conservatively set at 0.4 [12], we reject \(H_{\text {linear}}\) that the phylogeny is linear whenever \(\hat{\beta } > \beta ^*\). Unlike the Bayes factor, a hard threshold does not provide any evidence in support of \(H_{\text {linear}}\), only allowing us to conclude the pattern of evolution is branched. Second, as a more robust alternative, one can establish a hypothetical threshold beta-binomial distribution with parameters \(\mu _\text {branched}\) and \(s_\text {branched}\) representing an implausible distribution for the false negative rate of the sequencing technology.
Bayes factor
To determine the hypothesis that best explains the observed data B, we utilize the Bayes factor. The Bayes factor
$$K= \frac{P(B \mid H_{\text {linear}})}{P(B \mid H_{\text {branched}})}$$
(9)
quantifies the likelihood ratio of the two computing hypotheses (\(H_{\text {linear}}\) and \(H_{\text {branched}}\)) for the evolutionary pattern of the observed binary matrix B. A Bayes factor \(K > 1\) is evidence in support of \(H_{\text {linear}}\) and \(K < 1\) is evidence in support of \(H_{\text {branched}}\). Furthermore, Bayes factor K expresses how strongly the observed data supports one of the two alternative hypotheses.
Given two shape parameters \(\theta _1\) and \(\theta _2\), the likelihood of the beta-binomial distribution modeling each of our two hypotheses is defined as
$$f(y \mid , N \theta _1, \theta _2) = \left( {\begin{array}{c}N\\ y\end{array}}\right) \frac{{\mathcal {B}}(\theta _1 + y, N - y + \theta _2)}{{\mathcal {B}}(\theta _1, \theta _2)},$$
(10)
where \({\mathcal {B}}\) is the beta function.
We parameterize the statistical distribution of our hypotheses in terms of \(\theta _1\) and \(\theta _2\) via the conversion
$$\theta _{i,1}= \mu _i s_i,$$
(11)
$$\theta _{i,2}= s_i - \theta _{i1},$$
(12)
for \(i \in \{\text {linear}, \text {branched}\}\). Finally, substituting Eq. (10) into Eq. (9) for both hypotheses \(H_{\text {linear}}\) and \(H_{\text {branched}}\) yields Bayes factor K equal to
$$\frac{{\mathcal {B}}(\theta _{\text {linear},1} + y, \theta _{\text {linear},2} + N -y){\mathcal {B}}(\theta _{\text {branched},1}, \theta _{\text {branched},2})}{{\mathcal {B}}(\theta _{\text {branched},1} + y, \theta _{\text {branched},2} + N -y){\mathcal {B}}(\theta _{\text {linear},1}, \theta _{\text {linear},2})}.$$
(13)