Auto-validating von Neumann rejection sampling from small phylogenetic tree spaces

Sainudiin, Raazesh; York, Thomas

doi:10.1186/1748-7188-4-1

Research
Open access
Published: 07 January 2009

Auto-validating von Neumann rejection sampling from small phylogenetic tree spaces

Raazesh Sainudiin^1,2 &
Thomas York^3,4

Algorithms for Molecular Biology volume 4, Article number: 1 (2009) Cite this article

7394 Accesses
7 Citations
Metrics details

Abstract

Background

In phylogenetic inference one is interested in obtaining samples from the posterior distribution over the tree space on the basis of some observed DNA sequence data. One of the simplest sampling methods is the rejection sampler due to von Neumann. Here we introduce an auto-validating version of the rejection sampler, via interval analysis, to rigorously draw samples from posterior distributions over small phylogenetic tree spaces.

Results

The posterior samples from the auto-validating sampler are used to rigorously (i) estimate posterior probabilities for different rooted topologies based on mitochondrial DNA from human, chimpanzee and gorilla, (ii) conduct a non-parametric test of rate variation between protein-coding and tRNA-coding sites from three primates and (iii) obtain a posterior estimate of the human-neanderthal divergence time.

Conclusion

This solves the open problem of rigorously drawing independent and identically distributed samples from the posterior distribution over rooted and unrooted small tree spaces (3 or 4 taxa) based on any multiply-aligned sequence data.

Background

Obtaining samples from a real-valued target density f^• (t) is a basic problem in statistical estimation. The target f^• (t): $T$ ↦ ℝ maps n-dimensional real points in ℝⁿto real numbers in ℝ, i.e. t ∈ $T$ ⊂ ℝⁿ. In Bayesian phylogenetic estimation, we want to draw independent and identically distributed samples from a target posterior density on the space of phylogenetic trees. The standard point-valued or punctual Monte Carlo methods via conventional floating-point arithmetic are typically non-rigorous as they do not account for all sources of numerical errors and are limited to evaluating the target at finitely many points. The standard approaches to sampling from the posterior density, especially over phylogenetic trees, rely on Markov chain Monte Carlo (MCMC) methods. Despite their asymptotic validity, it is nontrivial to guarantee that an MCMC algorithm has converged to stationarity [1], and thus MCMC convergence diagnostics on phylogenetic tree spaces are heuristic [2].

A more direct sampler that is capable of producing independent and identically distributed samples from the target density f^• (t):= f(t)/(N_f), by only evaluating the target shape f(t) without knowing the normalizing constant $N_{f} : = \int_{T} f (t) d t$ , is the von Neumann rejection sampler [3]. However, the limiting step in the rejection sampler is the construction of an envelope function $\hat{g}$ (t) that is not only greater than the target shape f(t):= N_ff^• (t) at every t ∈ $T$ , but also easy to normalize and draw samples from. Moreover, a practical and efficient envelope function has to be as close to the target shape as possible from above. When an envelope function is constructed using point-valued methods, except for simple classes of targets, one cannot guarantee that the envelope function dominates the target shape globally.

None of the available samplers can rigorously produce independent and identically distributed samples from the posterior distribution over phylogenetic tree spaces, even for 3 or 4 taxa. We describe a new approach for rigorously drawing samples from a target posterior distribution over small phylogenetic tree spaces using the theory of interval analysis. This method can circumvent the problems associated with (i) heuristic convergence diagnostics in MCMC samplers and (ii) pseudo-envelopes constructed via non-rigorous point-valued methods in rejection samplers.

Informally, our method partitions the domain into boxes and uses interval analysis to rigorously bound the target shape in each box; then we use as envelope the simple function which takes on in each box the upper bound obtained for that box. It is easy to draw samples from the density corresponding to this step function envelope. More formally, the method employs an interval extension of the target posterior shape f(t): $T$ ↦ ℝ to produce rigorous enclosures of the range of f over each interval vector or box in an adaptive partition $T : = {t^{(1)}, t^{(2)}, ..., t^{(| T |)}}$ of the tree space $T$ = ∪_it⁽ⁱ⁾. This partition is adaptively constructed by a priority queue. The interval extended target shape maps boxes in $T$ to intervals in ℝ. This image interval provides an upper bound for the global maximum and a lower bound for the global minimum of f over each element of the partition of $T$ . We use this information to construct an envelope as a simple function over the partition $T$ . Using the Alias method [4] we efficiently propose samples from this normalized step-function envelope for von Neumann rejection sampling.

We call our method auto-validating because we employ interval methods to rigorously construct the envelope for a large class of target densities. The method was described in a more rudimentary form in [5]. Unlike many conventional samplers, each sample produced by our method is equivalent to a computer-assisted proof that it is drawn from the desired target, up to the pseudo-randomness of the underlying, deterministic, pseudo-random number generator. MRS 0.1.2, a C++ class library for statistical set processing is available from http://www.math.canterbury.ac.nz/~r.sainudiin/codes/mrs under the terms of the GNU General Public License.

The rest of the paper is organized as follows. In the Methods Section, we introduce (i) von Neumann rejection sampler (RS), (ii) phylogenetic estimation problem, (iii) interval analysis and (iv) an interval extension of the rejection sampler called the Moore rejection sampler (MRS) in honor of Ramon E. Moore. Moore was one of the influential founders of interval analysis [6]. In Results Section, we employ MRS to rigorously draw samples from the posterior density over small tree spaces. Using one of the earliest primate mitochondrial DNA data sets we use the posterior samples to estimate the posterior probability of each rooted tree topology and conduct a non-parametric test of rate variation between protein-coding and tRNA-coding sites. Using one of the latest data sets we obtain a rigorous posterior estimate of the human-neanderthal divergence time. We can also draw samples from the space of unrooted triplet and quartet trees. We conclude after a discussion of the method.

Methods

In the following sections, we first introduce the rejection sampler (RS) due to von Neumann [3]. Secondly, we describe the basic phylogenetic inference problem (e.g. [7–9]). Then, we introduce the basic principles of interval methods (e.g. [6, 10–13]). Finally, we construct interval extensions of RS to rigorously draw independent and identically distributed samples from small phylogenetic tree spaces. We leave the formal proofs to the Appendix for completeness.

Rejection sampler (RS)

Rejection sampling [3] is a Monte Carlo method to draw independent samples from a target random variable or random vector T with density f^• (t):= f(t)/N_f, where t ∈ $T$ ⊂ ℝⁿ, i.e. T ~ f^•. The challenge is to draw the samples without any knowledge of the normalizing constant $N_{f} : = \int_{T} f (t) d t$ . Typically the target f^• (t) is any density that is absolutely continuous with respect to the Lebesgue measure. The von Neumann rejection sampler (RS) can produce samples from T ~ f^• according to Algorithm 1 when provided with (i) a fundamental sampler that can produce independent samples from the Uniform [0, 1] random variable M with density given by the indicator function 1_[0,1](m): ℝ ↦ ℝ, (ii) a target shape f(t): $T$ ↦ ℝ, (iii) an envelope function $\hat{g} (t) : T \mapsto ℝ$ , such that,

\begin{matrix} \hat{g} (t) \geq f (t) & for all & t \in T, \end{matrix}

(1)

(iv)
a normalizing constant $N_{\hat{g}} : = \int_{T} \hat{g} (t) d t$ , (v) a proposal density $g (t) : = {(N_{\hat{g}})}^{- 1} \hat{g} (t)$ over $T$ from which independent samples can be drawn and finally (vi) f(t) and $\hat{g}$ (t) must be computable for any t ∈ $T$ .

input : (i) f; (ii) samplers for V ~ g and M ~ 1_[0,1]; (iii) $\hat{g}$ ; (iv) integer MaxTrials;

output : (i) possibly one sample t from T ~ f^• and (ii) Trials

initialize: Trials ← 0; Success ← false; t ← ∅;

repeat //propose at most MaxTrials times until acceptance

v ← sample(g); //draw a sample v from RV V with density g

u ← $\hat{g}$ (v) sample(1_[0,1]); //draw a sample u from RV U with density $1_{[0, \hat{g} (v)]}$

if u ≤ f(v) then //accept the proposed v and flag Success

t ← v; Success ← true

end

Trials ← Trials +1; //track the number of proposal trials so far

until Trials ≥ MaxTrials or Success = true;

return t and Trials

Algorithm 1: von Neumann RS

We use the Mersenne Twister pseudo-random number generator [14] to imitate independent samples from M ~ 1_[0,1]. The random variable T, if generated by Algorithm 1, is distributed according to f^• (e.g. [15]). Let A( $\hat{g}$ ) be the probability that a point proposed according to g gets accepted as an independent sample from f^• through the envelope function $\hat{g}$ . Observe that the envelope-specific acceptance probability A( $\hat{g}$ ) is the ratio of the integrals

A (\hat{g}) = \frac{N_{f}}{N_{\hat{g}}} : = \frac{\int_{T} f (t) d t}{\int_{T} \hat{g} (t) d t},

and the probability distribution over the number of samples from g to obtain one sample from f^• is geometrically distributed with mean 1/A( $\hat{g}$ ) (e.g. [15]).

Phylogenetic estimation

In this section we briefly review phylogenetic estimation. A more detailed account can be found in [7–9]. Inferring the ancestral relationship among a set of extant species based on their DNA sequences is a basic problem in phylogenetic estimation. One can obtain the likelihood of a particular phylogenetic tree that relates the extant species of interest at its leaves by superimposing a continuous time Markov chain model of DNA substitution upon that tree. The length of an edge (branch length) connecting two nodes (species) in the tree represents the amount of evolutionary time (divergence) between the two species. The internal nodes represent ancestral species. During the likelihood computation, one needs to integrate over all possible states at the unobserved ancestral nodes.

Next we give a brief introduction to some phylogenetic nomenclature. A phylogenetic tree is said to be rooted if one of the internal nodes, say node r, is identified as the root of the tree, otherwise it is said to be unrooted. The rooted tree is conventionally depicted with the root node r at the top. The four topology-labeled, three-leaved, rooted trees, namely, ⁰t, ¹t, ²t and ³t, with leaf label set {1, 2, 3}, are depicted in Figure 1(i)–(iv). The unrooted, three-leaved tree with topology label 4 or the unrooted triplet ⁴t is shown in Figure 1(v). For each tree, the terminal branch lengths, i.e. the branch lengths leading to the leaf nodes, have to be strictly positive and the internal branch lengths have to be non-negative. Our rooted triplets (Figure 1(i)–(iv)) are said to satisfy the molecular clock, since the branch lengths of each ^kt, where k ∈ {0, 1, 2, 3}, satisfy the constraint that the distance from the root node r to each of the leaf nodes is equal to ^kt₀ + ^kt₁ with ^kt₁ > 0 and ^kt₀ ≥ 0.

Likelihood of a tree

Let d denote a homologous set of sequences of length v with character set $U = {a_{1}, a_{2}, ..., a_{| U |}}$ from n taxa. We think of d as an n × v matrix with entries from $U$ . We are interested in estimating the branch lengths and topologies of the tree underlying our observed d. Let b_kdenote the number of branches and s_kdenote the number of nodes of a tree with a specific topology or branching order labeled by k. Thus, for a given topology label k, n labeled leaves and b_kmany branches, the labeled tree ^kt is the topology-labeled vector of branch lengths (^kt₁,..., ${{}^{k}t}_{b_{k}}$ ) contained in the topology-labeled tree space ${}^{k}T$ , i.e.,

Any subset of the tree space with |. Such a model prescribes $P_{a_{i}, a_{j}} (t)$ , the probability of mutation from a character a_i∈ $U$ to another character a_j∈ $U$ in time t. Using such a transition probability we may compute ℓ_q(^kt), the log-likelihood of the data d at site q ∈ {1,..., v} or the q-th column of d, via the post-order traversal over the labeled tree with branch lengths ^kt := (^kt₁, ^kt₂,..., ${{}^{k}t}_{b_{k}}$ ). This amounts to the sum-product Algorithm 2 [16] that associates with each node h ∈ {1,..., s_k} of ^kt subtending ℏ many descendants, a partial likelihood vector, $l_{h} : = (l_{h}^{(a_{1})}, l_{h}^{(a_{2})}, ..., l_{h}^{(a_{| U |})}) \in ℝ^{| U |}$ , and specifies the length of the branch leading to its ancestor as ^kt_h.

input : (i) a labeled tree with branch lengths ^kt := (^kt₁, ^kt₂,..., ${{}^{k}t}_{b_{k}}$ ), (ii) transition probability $P_{a_{i}, a_{j}} (t)$ for any a_i, a_j∈ $U$ , (iii) stationary distribution π(a_i) over each character a_i∈ $U$ , (iv) site pattern or data d_{•, q}at site q

output : $l_{d_{\cdot, q}}$ (^kt), the likelihood at site q with pattern d_{•, q}

initialize: For a leaf node h with observed character a_i= d_{h, q}at site q, set $l_{h}^{(a_{i})}$ = 1 and $l_{h}^{(a_{j})}$ = 0 for all j ≠ i. For any internal node h, set l_h:= (1, 1,...,1).

recurse : compute l_hfor each sub-terminal node h, then those of their ancestors recursively to finally compute l_rfor the root node r to obtain the likelihood for site q,

l_{d_{\cdot, q}} ({}^{k}t) = l_{r} = \sum_{a_{i} \in U} (π (a_{i}) \cdot l_{r}^{(a_{i})}) .

For an internal node h with descendants s₁, s₂,..., s_ℏ,

Algorithm 2: Likelihood by post-order traversal

Assuming independence across all v sites we obtain the likelihood function for the given data d, by multiplying the site-specific likelihoods

(2)

The maximum likelihood estimate is a point estimate (single best guess) of the unknown phylogenetic tree on the basis of the observed data d and it is

\underset{{}^{k}t \in {}^{K}T}{\arg \max} l_{d} ({}^{k}t) .

The simplest probability models for character mutation are continuous time Markov chains with finite state space $U$ . We introduce three such models employed in this study next. We only derive the likelihood functions for the simplest model with just two characters as it is thought to well-represent the core problems in phylogenetic estimation (see for e.g. [17]).

Posterior density of a tree

The posterior density f^• (^kt) conditional on data d at tree ^kt is the normalized product of the likelihood l_d(^kt) and the prior density p(^kt) over a given tree space ${}^{K}T$ :

(3)

We assume a uniform prior density over a large box or a union of large boxes in a given tree space ${}^{K}T$ . Typically, the sides of the box giving the range of branch lengths, are extremely long, say, [0, 10] or [10^-10, 10]. The branch lengths are measured in units of expected number of DNA substitutions per site and therefore the support of our uniform prior density over ${}^{K}T$ contains the biologically relevant branch lengths. If ${}^{K}T$ is a union of distinct topologies then we let our prior be an equally weighted finite mixture of uniform densities over large boxes in each topology. Naturally, other prior densities are possible especially in the presence of additional information. We choose at priors for the convenient interpretation of the target posterior shape $f ({}^{k}t) = f^{\cdot} ({}^{k}t) \int_{{}^{K}T} l_{d} ({}^{k}t) p ({}^{k}t) \partial ({}^{k}t)$ to be the likelihood function in the absence of prior information beyond a compact support specification.

Likelihood of a triplet under Cavender-Farris-Neyman (CFN) model

We now describe the simplest model for the evolution of binary sequences under a symmetric transition matrix over all branches of a tree. This model has been used by authors in various fields including molecular biology, information theory, operations research and statistical physics; for references see [7, 18]. This model is referred to as the Cavender-Farris-Neyman (CFN) model in molecular biology, although in other fields it has been referred to as 'the on-off machine', 'symmetric binary channel' and the 'symmetric two-state Poisson model'. Although the relatively tractable CFN model itself is not popular in applied molecular evolution, the lessons learned under the CFN model often extend to more realistic models of DNA mutation (e.g. [17]). Thus, our first stop is the CFN model.

Model 1 (Cavender-Farris-Neyman (CFN) model) Under the CFN mutation model, only pyrimidines and purines, denoted respectively by Y:= {C, T} and R:= {A, G}, are distinguished as evolutionary states among the four nucleotides {A, G, C, T}, i.e. $U$ = {Y, R}. Time t is measured by the expected number of substitutions in this homogeneous continuous time Markov chain with rate matrix:

Q = (\begin{matrix} - 1 & 1 \\ 1 & - 1 \end{matrix}),

and transition probability matrix P(t) = e^Qt:

P (t) = (\begin{matrix} 1 - (1 - e^{- 2 t}) / 2 & (1 - e^{- 2 t}) / 2 \\ (1 - e^{- 2 t}) / 2 & 1 - (1 - e^{- 2 t}) / 2 \end{matrix}) .

Thus, the probability that Y mutates to R, or vice versa, in time t is a(t): (1e^-2t)/2. The stationary distribution is uniform on $U$ , i.e. π(R) = π(Y) = 1/2.

When there are only three taxa, there are five tree topologies of interest as depicted in Figure 1. There are 2³ = 8 possible site patterns, i.e. for each site q ∈ {1, 2,..., v}, the q-th column of the data d, denoted by d_{•, q}, is one of eight possibilities, numbered 0, 1,...,7 for convenience:

d_{\cdot, q} \in {\begin{matrix} 0 & , & 1 & , & 2 & , & 3 & , & 4 & , & 5 & , & 6 & , & 7 \\ R & Y & R & Y & R & Y & R & Y \\ R & , & Y & , & R & , & Y & , & Y & , & R & , & Y & , & R \\ R & Y & Y & R & Y & R & R & Y \end{matrix}} .

(4)

Given a multiple sequence alignment data d from 3 taxa at v homologous sites, i.e. d ∈ {Y, R}^{3 × v}, the likelihood function over the tree space ${}^{k}T$ is simplified from (2) as follows:

l_{d} ({}^{k}t) = \prod_{q = 1}^{v} l_{d_{\cdot, q}} ({}^{k}t) = \prod_{i = 0}^{7} {(l_{i} ({}^{k}t))}^{c_{i}},

(5)

where l_i(^kt) is the likelihood of the the i-th site pattern as in (4) and c_iis the count of sites with pattern i. In fact, l_i(^kt) = P(i|^kt) is the probability of observing site pattern i given topology label k and branch lengths t and similarly l_d(^kt) = P(d|^kt).

Consider the unrooted tree-space with a single topology labeled 4 and three non-negative terminal branch lengths ⁴t = (⁴t₁, ⁴t₂, ⁴t₃) ∈ $ℝ_{+}^{3}$ as shown in Figure 1(v). An application of Algorithm 2 to compute the likelihoods l₀(⁴t), l₁(⁴t),..., l₇(⁴t), as derived in (19)-(25), reveals symmetry. There are in fact four minimally sufficient site pattern classes, namely, xxx, xxy, yxx and xyx, where x and y simply denote distinct characters in the alphabet set $U$ = {R, Y}. The corresponding likelihoods are:

(6)

Therefore, the multiple sequence alignment data d from three taxa evolving under Model 1 can be summarized by the minimal sufficient site pattern counts

(c_xxx, c_xxy, c_yxx, c_xyx):= (c₀ + c₁, c₂ + c₃, c₄ + c₅, c₆ + c₇),

which simplifies (5) to:

l_{d} ({}^{k}t) = \prod_{q = 1}^{v} l_{d_{\cdot, q}} ({}^{k}t) = \prod_{i = 0}^{7} {(l_{i} ({}^{k}t))}^{c_{i}} = \prod_{s = xxx, xxy, yxx, xyx} {(l_{s} ({}^{k}t))}^{c_{s}} .

(7)

Note that the probability of our sample space with eight patterns given in (4) is $\sum_{i = 0}^{7} l_{i} ({}^{4}t) = 1$ . Our likelihoods are half of those in [17] that are prescribed over a sample space of only four classes of patterns: {0, 1}, {2, 3}, {4, 5} and {6, 7}. This is because we distinguish between the sample space of data from that of the minimal sufficient statistics. We compute the rooted topology-specific likelihood functions, i.e. l(^kt) for k ∈ {0, 1, 2, 3} (Figure 1) by substituting the appropriate constraints on branch lengths in ${}^{4}T = ℝ_{+}^{3}$ , the space of unrooted triplets.

Likelihood of a triplet under Jukes-Cantor (JC) model

The r-state symmetric model introduced in [19] is specified by the r × r rate matrix with equal off-diagonal entries over an alphabet set $U$ of size r. The stationary distribution under this model is the uniform distribution on $U$ . Thus, CFN model is the 2-state symmetric model over $U$ = {Y, R}. The Jukes-Cantor (JC) model [20] is the 4-state symmetric model over $U$ = {A, C, G, T}. This is perhaps the simplest model on four characters.

Model 2 (Jukes-Cantor (JC) model) All four nucleotides form the state space for this mutation model, i.e. $U$ = {A, C, G, T}. Once again, evolutionary time t is measured by the expected number of substitutions in the homogeneous continuous time Markov chain with rate matrix:

Q = (\begin{matrix} - 1 & 1 / 3 & 1 / 3 & 1 / 3 \\ 1 / 3 & - 1 & 1 / 3 & 1 / 3 \\ 1 / 3 & 1 / 3 & - 1 & 1 / 3 \\ 1 / 3 & 1 / 3 & 1 / 3 & - 1 \end{matrix}) .

The transition probability matrix P(t) = e^Qtis also symmetric. The probability that any given nucleotide mutates to any other nucleotide in time t is P_{x, y}(t) and that it is found in the same state is P_{x, x}(t). These transition probabilities are:

\begin{matrix} a (t) : = P_{x, y} (t) = \frac{1}{4} - \frac{1}{4} \exp (- \frac{4}{3} t), & b (t) : = P_{x, x} (t) = \frac{1}{4} + \frac{3}{4} \exp (- \frac{4}{3} t) . \end{matrix}

The stationary distribution is uniform, i.e. π(A) = π(C) = π(G) = π(T) = 1/4.

Consider the three non-negative terminal branch lengths ⁴t = (⁴t₁, ⁴t₂, ⁴t₃) ∈ $ℝ_{+}^{3}$ of an unrooted tree ⁴t of Figure 1(v). An application of Algorithm 2 to compute the likelihoods of the 64 possible site patterns (see for e.g. [21–24]), reveals five minimally sufficient site pattern classes. Let x, y and z simply denote distinct characters from the alphabet set $U$ = {A, C, G, T} at taxon 1, 2 and 3, respectively. The minimally sufficient site pattern classes xxx, xyz, xxy, yxx and xyx encode 4, 24, 12, 12 and 12 nucleotide site patterns, respectively. By a computation similar to that in (19)-(25), the likelihoods are:

Notice that the probability of observing one of the 64 possible site patterns is 1 for any ⁴t ∈ (0, ∞)³ :

4l_xxx(⁴t) + 24l_xyz(⁴t) + 12l_xxy(⁴t) + 12l_yxx(⁴t) + 12l_yxx(⁴t) = 1.

Let c_ijk denote the number of sites with the site pattern ijk ∈ {xxx, xyz, xxy, yxx, xyx}. Then, under the assumption of independence across sites, we obtain the likelihood of a given data d by multiplying the site-specific likelihoods:

l_{d} ({}^{4}t) = {(l_{xyz} ({}^{4}t))}^{c_{xyz}} {(l_{xxy} ({}^{4}t))}^{c_{xxy}} {(l_{xyx} ({}^{4}t))}^{c_{xyx}} {(l_{yxx} ({}^{4}t))}^{c_{yxx}} {(l_{xxx} ({}^{4}t))}^{c_{xxx}} .

Once again, the likelihood of a rooted tree or the star tree can be obtained from that of the unrooted tree by substituting the appropriate constraints on branch lengths in the above equations or by directly applying Algorithm 2 with the appropriate input tree with its topology and branch lengths.

Model 3 (Hasegawa-Kishino-Yano (HKY) model) The Hasegawa-Kishino-Yano or HKY model [25]has all four nucleotides in the state space, i.e. $U$ = {A, C, G, T}. There are five parameters in this more flexible model. Transitions are changes within the purine {A, G} or pyrimidine {C, T} state subsets, while transversions are changes from purine to pyrimidine or from pyrimidine to purine. In this model, we have a mutational parameter κ that allows for transition:transversion bias and four additional parameters π_A, π_C, π_G and π_T that explicitly control the stationary distribution. The entries of the rate matrix are:

q_{x, y} = {\begin{array}{l} κ π_{y} & f o r t r a n s i t i o n s \\ π_{y} & f o r t r a n s v e r s i o n s \\ - \sum_{z \in U, z \neq x} q_{x, z} & i f x = y . \end{array}

The transition probabilities are known analytically for this model (see for e.g. [[8], p. 203]). We can use these expressions when evaluating the likelihood of a rooted or unrooted tree along with the five mutational parameters via Algorithm 2. For simplicity we set the stationary distribution parameters to the empirical nucleotide frequencies and κ to be 2.0 in this study.

Interval analysis

Let $I ℝ$ denote the set of closed and bounded real intervals. Let any element of $I ℝ$ be denoted by x: [ $\underline{x}$ , $\bar{x}$ ], where, $\underline{x}$ ≤ $\bar{x}$ and $\underline{x}$ , $\bar{x}$ ∈ ℝ. Next we define arithmetic over $I ℝ$ .

Definition 1 (Interval Operation) If the binary operator ⋆ is one of +, -, ×,/, then we define an arithmetic on operands in $I ℝ$ by

x⋆ y:= {x ⋆ y : x ∈ x, y ∈ y},

with the exception that x/y is undefined if 0 ∈ y.

Theorem 1 (Interval arithmetic) Arithmetic on the pair x, y∈ $I ℝ$ is given by:

\begin{array}{l} x + y & = & [\underline{x} + \underline{y}, \bar{x} + \bar{y}] \\ x - y & = & [\underline{x} - \bar{y}, \bar{x} - \underline{y}] \\ x \times y & = & [\min {\underline{x} \underline{y}, \underline{x} \bar{y}, \bar{x} \underline{y}, \bar{x} \bar{y}}, \max {\underline{x} \underline{y}, \underline{x} \bar{y}, \bar{x} \underline{y}, \bar{x} \bar{y}}] \\ x / y & = & x \times [1 / \bar{y}, 1 / \underline{y}], p r o v i d e d, 0 \notin y . \end{array}

When computing with finite precision, say in floating-point arithmetic, directed rounding must be taken into account (see e.g., [6, 10]) to contain the solution. Interval multiplication is branched into nine cases, on the basis of the signs of the boundaries of the operands, such that only one case entails more than two real multiplications. Therefore, a rigorous computer implementation of an interval operation mostly requires two directed rounding floating-point operations. Interval addition and multiplication are both commutative and associate but not distributive. For example,

\begin{matrix} [- 1, 2] \times ([1, 2] + [- 2, 1]) = [- 1, 2] \times [- 1, 3] = [- 3, 6], \\ \begin{matrix} but, & [- 1, 2] \times [1, 2] | [- 1, 2] \times [- 2, 1] = [- 2, 4] + [- 4, 2] = [- 6, 6] . \end{matrix} \end{matrix}

Interval arithmetic satisfies a weaker rule than distributivity called sub-distributivity:

x(y+ z) ⊆ xy+ xz.

An extremely useful property of interval arithmetic that is a direct consequence of Definition 1 is summarized by the following theorem.

Theorem 2 (Fundamental property of interval arithmetic) If x⊆ x' and y⊆ y' and ⋆ ∈ {+, -, ×,/}, then

x⋆ y⊆ x' ⋆ y',

where we require that 0 ∉ y' when ⋆ = /.

Note that an immediate implication of Theorem 2 is that when x= [x, x] and y= [y, y] are thin intervals, i.e. $\underline{x}$ = $\bar{x}$ = x and $\underline{y}$ = $\bar{y}$ = y are real numbers, then x' ⋆ y' will contain the result of the real arithmetic operation x ⋆ y.

Let $\underline{x}$ , $\bar{x}$ ∈ ℝⁿbe real vectors such that $\underline{x}$ _i≤ ${\bar{x}}_{i}$ , for all i = 1, 2,..., n, then x: [ $\underline{x}$ , $\bar{x}$ ] is an interval vector or a box. The set of all such boxes is $I ℝ^{n}$ . The i-th component of the box x= (x₁,..., x_n) is the interval x_i= [ $\underline{x}$ _i, ${\bar{x}}_{i}$ ] and the interval extension of a set $D \subseteq ℝ^{n}$ is $I D : = {x \in I ℝ^{n} : \underline{x}, \bar{x} \in D}$ . We write inf x:= $\underline{x}$ for the lower bound, sup x:= $\bar{x}$ for the upper bound. Let the maximum norm of a vector x ∈ ℝⁿbe ∥x∥_∞ := max_k|x_k|. Let the vector valued hyper-metric between boxes x and y be

dist (x, y) = \sup {| \underline{x} - \underline{y} |, | \bar{x} - \bar{y} |},

and the Hausdorff distance between the boxes x and y in the metric given by the maximum norm is then

dist_∞ (x, y) = ∥dist(x, y)∥_∞.

We can make $I ℝ^{n}$ a metric space by equipping it with the Hausdorff distance.

Our main motivation for the extension to intervals is to enclose the range:

range(f; S):= {f(x): x ∈ S},

of a real-valued function f : ℝⁿ↦ ℝ over a set S ⊆ ℝⁿ. Except for trivial cases, few tools are available to obtain the range.

Definition 2 (Directed acyclic graph (DAG) expression of a function) One can think of the process by which a function f : ℝ^m↦ ℝ is computed as the result of a sequence of recursive operations with the sub-expressions f_iof its expression f where, i = 1,..., n < ∞. This involves the evaluation of the sub-expression f_i at node i with operands $s_{i_{1}}$ , $s_{i_{2}}$ from the sub-terminal nodes of i given by the directed acyclic graph (DAG) for f

s_{i} = ⊙ f_{i} : = {\begin{array}{l} f_{i} (s_{i_{1}}, s_{i_{2}}) & : i f n o d e i h a s 2 s u b - t e r m i n a l n o d e s s_{i_{1}}, s_{i_{2}} \\ f_{i} (s_{i_{1}}) & : i f n o d e i h a s 1 s u b - t e r m i n a l n o d e s_{i_{1}} \\ I (s_{i}) & : i f n o d e i i s a l e a f o r t e r m i n a l n o d e, I (x) = x . \end{array}

(8)

The leaf or terminal node of the DAG is a constant or a variable and thus the f_ifor a leaf i is set equal to the respective constant or variable. The recursion starts at the leaves and terminates at the root of the DAG. The DAG for an elementary f is simply its expression f with n sub-expressions f₁, f₂,...,f_n:

\begin{matrix} {⊙ f_{i}}_{i = 1}^{n} & ↣ & ⊙ f_{n} = f (x), \end{matrix}

(9)

where each ⊙f_iis computed according to (8).

We look at some DAGs for 0 functions to concretely illustrate these ideas.

Example 1 Consider the constant zero function f(x) = 0 expressed as (i) f(x) = 0, (ii) f'(x) = x × 0 and (iii) f"(x) = x - x. The corresponding DAG expressions are shown in Figure 2.

Definition 3 (The natural interval extension) Consider a real-valued function f(x): ℝⁿ↦ ℝ^m given by a formula or a DAG expression f(x). If real constants, variables, and operations in f(x) are replaced by their interval counterparts, then one obtains

f (x) : I ℝ^{n} \mapsto I ℝ^{m} .

f(x) is known as the natural interval extension of the expression f(x) for f(x). This extension is well-defined if we do not run into division by zero.

Although the three distinct expressions f(x), f'(x) and f"(x) of the real function f: ℝ ↦ ℝ of Example 1 are equivalent upon evaluation in the reals, their respective interval extensions f(x) = [0, 0], f'(x) = x× [0, 0], and f"(x) = x- x are not. For instance, if x= [1, 2],

\begin{array}{l} f ([1, 2]) & = & [0, 0], \\ f^{'} ([1, 2]) & = & [1, 2] \times [0, 0] = [\min {1 \times 0, 1 \times 0, 2 \times 0, 2 \times 0}, \max {1 \times 0, 1 \times 0, 2 \times 0, 2 \times 0}] = [0, 0] \\ f^{″} ([1, 2]) & = & [1, 2] - [1, 2] = [1 - 2, 2 - 1] = [- 1, 1], \end{array}

and in general for any x: [ $\underline{x}$ , $\bar{x}$ ] ∈ $I ℝ$ ,

\begin{array}{l} f ([\underline{x}, \bar{x}]) & = & [0, 0], \\ f^{'} ([\underline{x}, \bar{x}]) & = & [\underline{x}, \bar{x}] \times [0, 0] = [\min {\underline{x} \times 0, \underline{x} \times 0, \bar{x} \times 0, \bar{x} \times 0}, \max {\underline{x} \times 0, \underline{x} \times 0, \bar{x} \times 0, \bar{x} \times 0}] = [0, 0] \\ f^{″} ([\underline{x}, \bar{x}]) & = & \begin{matrix} [\underline{x}, \bar{x}] - [\underline{x}, \bar{x}] = [\underline{x} - \bar{x}, \bar{x} - \underline{x}] \neq [0, 0], & unless \underline{x} = \bar{x} . \end{matrix} \end{array}

Thus, f(x) = f'(x) ≠ f"(x) for any x∈ $I ℝ$ , albeit f(x) = f'(x) = f"(x) for any x ∈ ℝ.

Theorem 3 (Interval rational functions) Consider the rational function f(x) = p(x)/q(x), where p and q are polynomials. Let f be the natural interval extension of its DAG expression f such that f(y) is well-defined for some y ∈ $I ℝ$ and let x, x'∈ $I ℝ$ . Then we have

\begin{array}{l} (i) & I n c l u s i o n i s o t o n y : & \begin{matrix} \forall x \subseteq x^{'} \subseteq y \Rightarrow f (x) \subseteq f (x^{'}), & a n d \end{matrix} \\ (i i) & R a n g e e n c l o s u r e : & \forall x \subseteq y \Rightarrow range (f; x) \subseteq f (x) . \end{array}

Definition 4 (Standard functions) Piece-wise monotone functions, including exponential, logarithm, rational power, absolute value, and trigonometric functions, constitute the set of standard functions

$S$ = {a^x, log_b(x), x^p/q, |x|, sin(x), cos(x), tan(x), sinh(x),...arcsin(x),...}.

Such functions have well-defined interval extensions that satisfy inclusion isotony and exact range enclosure, i.e. range(f; x) = f(x). Consider the following definitions for the interval extensions for some monotone functions in $S$ with x∈ $I ℝ$ ,

\begin{array}{l} \exp (x) & = & [\exp (\underline{x}), \exp (\bar{x})] \\ \arctan (x) & = & [\arctan (\underline{x}), \arctan (\bar{x})] \\ \sqrt{(x)} & = & [\sqrt{(\underline{x})}, \sqrt{(\bar{x})}] & if 0 \leq \underline{x} \\ \log (x) & = & [\log (\underline{x}), \log (\bar{x})] & if 0 < \underline{x}, \end{array}

and a piece-wise monotone function in $S$ ; with ℤ₊ and ℤ_- representing the set of positive and negative integers, respectively. Let the mignitude of an interval x be the number $〈 x 〉$ = min{|x|:x ∈ x} and the absolute value of x be the number |x| = max{|x|:x ∈ x} = sup{- $\underline{x}$ , $\bar{x}$ }. Then, the interval-extended power function that plays a basic role in product likelihood functions is:

x^{n} = {\begin{array}{l} [{\underline{x}}^{n}, {\bar{x}}^{n}] & : if n \in ℤ_{+} is odd, \\ [{〈 x 〉}^{n}, | x |^{n}] & : if n \in ℤ_{+} is even, \\ [1, 1] & : if n = 0, \\ {[1 / \bar{x}, 1 / \underline{x}]}^{- n} & : if n \in ℤ_{-}; 0 \notin x . \end{array}

Definition 5 (Elementary functions) A real-valued function that can be expressed as a finite combination of constants, variables, arithmetic operations, standard functions and compositions is called an elementary function. The set of all such elementary functions is referred to as $E$ .

Example 2 (Probability of the pattern xxx under CFN star tree ⁰t ) The trifurcating star-tree ⁰t := (⁰t₁) has topology label 0 and common branch length parameter ⁰t₁ as shown in Figure 1(i). Either a direct application of Algorithm 2 with input as ⁰t := (⁰t₁) or a substitution of ⁰t₁ for ⁴t₁, ⁴t₂ and ⁴t₃ in (6), yields the likelihood for pattern xxx as:

The probability of the pattern xxx under CFN star tree ⁰t given by l_xxx(⁰t) with the corresponding DAG expression shown in Figure 3 is an elementary function.

It would be convenient if guaranteed enclosures of the range of an elementary f can be obtained by the natural interval extension f of one of its expressions f. The following Theorem 4 is the work-horse of interval Monte Carlo algorithms.

Theorem 4 (The fundamental theorem of interval analysis) Consider any elementary function f ∈ $E$ with expression f. Let f : y↦ $I ℝ$ be its natural interval extension such that f(y) is well-defined for some y ∈ $I ℝ$ and let x, x'∈ $I ℝ$ . Then we have

\begin{array}{l} (i) & I n c l u s i o n i s o t o n y : & \forall x \subseteq x^{'} \subseteq y \Rightarrow f (x) \subseteq f (x^{'}), & a n d \\ (i i) & R a n g e e n c l o s u r e : & \forall x \subseteq y \Rightarrow range (f; x) \subseteq f (x) . \end{array}

The fundamental implication of the above theorem is that it allows us to enclose the range of any elementary function and thereby produces an upper bound for the global maximum and a lower bound for the global minimum over any compact subset of the domain upon which the function is well-defined. This is the work-horse for rigorously constructing an envelope for rejection sampling.

Unlike the natural interval extension of an f ∈ $S$ that produces exact range enclosures, the natural interval extension f(x) of an f ∈ $E$ often overestimates range(f; x), but can be shown under mild conditions to linearly approach the range as the maximal width of the box x goes to zero. This implies that a partition of x into smaller boxes {x⁽¹⁾,⋯, x^(m)} gives better enclosures of range(f; x) through the union $\cup_{i = 1}^{m} f (x^{(i)})$ as illustrated in Figure 4. Next we make the above statements precise in terms of the width and radius of a box x defined by wid x:= $\bar{x}$ - $\underline{x}$ and rad x:= ( $\bar{x}$ - $\underline{x}$ )/2, respectively.

Definition 6 A function f: $D$ ↦ ℝ is Lipschitz if there exists a Lipschitz constant K such that, for all x, y ∈ $D$ , we have |f(x) - f(y)| ≤ K|x - y|. We define $E_{L}$ to be the set of elementary functions whose sub-expressions f_i, i = 1,..., n at the nodes of its corresponding DAG f are all Lipschitz:

E_{L} : = {f \in E : e a c h s u b - e x p r e s s i o n f_{i} i n t h e D A G e x p r e s s i o n f f o r f i s L i s p s c h i t z} .

Theorem 5 (Range enclosure tightens linearly with mesh) Consider a function f : $D$ ↦ ℝwith f ∈ $E_{L}$ . Let f be an inclusion isotonic interval extension of the DAG expression f of f such that f (x) is well-defined for some x∈ $I ℝ$ . Then there exists a positive real number K, depending on f and x, such that if $x = \cup_{i = 1}^{k} x^{(i)}$ , then

range (f; x) \subseteq \cup_{i = 1}^{k} f (x^{(i)}) \subseteq f (x),

and

rad (\cup_{i = 1}^{k} f (x^{(i)})) \leq rad (range (f; x)) + K \max_{i = 1, ..., k} rad (x^{(i)}) .

Likelihood of a box of trees

The likelihood function (2) over trees with a DAG expression that is directly or indirectly obtained via Algorithm 2 has a natural interval extension over boxes of trees [5, 26]. This interval extension of the likelihood function allows us to produce rigorous enclosures of the likelihood over a box in the tree space. Next we give a concrete example of the natural interval extension of the likelihood function over an interval of trees ⁰t in the star-tree space . The same ideas extend to any labeled box of trees ^kt when the number of branch lengths is greater than one and more generally to a finite union of labeled boxes with possibly distinct labels.

Example 3 (Posterior density over the CFN star-tree space ) The trifurcating star-tree ⁰t := (⁰t₁) has topology label 0 and common branch length ⁰t₁ > 0. Either a direct application of Algorithm 2 with input triplet ⁰t or a substitution of ⁴t₁, ⁴t₂ and ⁴t₃ in (6) by ⁰t₁ yields the following -specific likelihoods:

(10)

Therefore, on the basis of (4), (5), (6) and (7), the likelihood of the data at the star-tree ⁰t ∈ is

(11)

the posterior density (3) based on a uniform prior p(⁰t₁) = 1/10 over = (0, 10] is

(12)

Thus, under our conveniently chosen uniform prior, the target posterior shape (without the normalizing constant) is simply the likelihood function, i.e.

f ({}^{0}t) = f^{\cdot} ({}^{0}t) \int_{0}^{10} l_{d} ({}^{0}t) \partial ({}^{0}t) = l_{d} ({}^{0}t) .

Observe that the minimal sufficient statistics over are the number of sites with the same character c_xxx := c₀ + c₁ and the total number of sites v. Let the natural interval extension of the DAG expression for the posterior shape f(⁰t): ↦ ℝ be:

Thus, f maps an interval ⁰t in the tree space to an interval in $I ℝ$ that encloses the target shape or likelihood of ⁰t.

For the human, chimpanzee and gorilla mitochondrial sequence data [27]analyzed in [17], c_xxx = 762 and v = 895. Figure 4 shows log(f(⁰t)) or the log-likelihood function for this data set as the white line. Evaluations of its interval extension over partitions by 3, 7 and 19 intervals are depicted by colored rectangles in Figure 4. Notice how the range enclosure by the interval extension of the log-likelihood function, our target shape, tightens with domain refinement as per Theorem 5. The maximum likelihood estimate derived in [17](the red dot in Figure 4) is

Moore rejection sampler (MRS)

Moore rejection sampler (MRS) is an auto-validating rejection sampler (RS). MRS is said to be auto-validating because it automatically obtains a proposal g that is easy to simulate from, and an envelope that is guaranteed to satisfy the envelope condition (1). MRS can produce independent samples from any target shape f whose DAG expression f has a well-defined natural interval extension f over a compact domain $T$ . In summary, the defining characteristics and notations of MRS are:

\begin{array}{l} Compact domain & T = [\underline{t}, \bar{t}] \\ Target shape & f (t) : T \mapsto ℝ \\ Target integral & N_{f} : = \int_{T} f (t) d t \\ Target density & f^{\cdot} (t) : = {(N_{f})}^{- 1} f (t) : T \mapsto ℝ \\ DAG expression of f & f (t) : T \mapsto ℝ \\ Interval extension of f & f (t) : I T \mapsto I ℝ \\ Envelope function & \hat{g} (t) : T \mapsto ℝ \\ Envelope integral & N_{\hat{g}} : = \int_{T} \hat{g} (t) d t \\ Proposal density & g (t) : = {(N_{\hat{g}})}^{- 1} \hat{g} (t) : T \mapsto ℝ \\ Acceptance probability & A (\hat{g}) = N_{f} / N_{\hat{g}} \\ Partitionof T & T : = {t^{(1)}, t^{(2)}, ..., t^{(| T |)}} . \end{array}

Suppose f is an elementary function and its DAG expression f has a well-defined interval extension f on $T$ . If $T : = {t^{(1)}, t^{(2)}, ..., t^{(| T |)}}$ is a finite partition of $T$ , then by Theorem 4 we can enclose range(f; t⁽ⁱ⁾), i.e. the range of f over the i-th element of $T$ , with the interval extension f of f:

range (f; t^{(i)}) \subseteq f (t^{(i)}) : = [\underline{f} (t^{(i)}), \bar{f} (t^{(i)})], \forall i \in {1, 2, ..., | T |} .

(13)

For a given partition $T$ , we can construct a partition-specific envelope function:

\begin{matrix} {\hat{g}}^{T} (t) = \sum_{i = 1}^{| T |} \bar{f} (t^{(i)}) 1_{{t \in t^{(i)}}}, & 1_{{t \in t^{(i)}}} = {\begin{array}{l} 1 & if t \in t^{(i)} \\ 0 & otherwise . \end{array} \end{matrix}

(14)

The necessary envelope condition (1) is satisfied by ${\hat{g}}^{T}$ (t) because of (13). We can obtain the corresponding proposal $g^{T}$ (t) as a normalized simple function over $T$ :

g^{T} (t) = {(N_{{\hat{g}}^{T}})}^{- 1} {\hat{g}}^{T} (t) = {(N_{{\hat{g}}^{T}})}^{- 1} \sum_{i = 1}^{| T |} \bar{f} (t^{(i)}) 1_{{t \in t^{(i)}}},

(15)

where the normalizing constant $N_{{\hat{g}}^{T}} : = \sum_{i = 1}^{| T |} (vol t^{(i)} \cdot \bar{f} (t^{(i)}))$ and $vol t : = \prod_{i = 1}^{n} wid t_{i}$ is the volume of the box t. The volume of an interval x is simply its width, i.e. vol x= wid x, if x ∈ $I ℝ$ . Now, we have all the ingredients to perform a more efficient, partition-specific, auto-validating von Neumann rejection sampling or simply Moore rejection sampling.

Before making formal statements about our sampler let us gain geometric insight into the sampler from Example 3 and Figure 4. The upper boundaries of rectangles of a given color, depicting a simple function in Figure 4, is a partition-specific envelope function (14) for the logarithm of the posterior shape or the log-likelihood function of Example 3 over the prior-specified support [10^-10, 10] ⊂ . In Figure 4 only a small interval about the maximum likelihood estimate (red dot) that contains the posterior samples (gray '+' markers) is depicted since the likelihood falls sharply outside this range. Normalization of the envelope gives the corresponding proposal function (15). As the refinement of the domain proceeds through adaptive bisections (described later), the partition size increases. We show partitions of size 3,7 and 19 over an interval containing the posterior samples. These samples were obtained from the partition with 19 intervals. Each of the corresponding envelope functions (upper boundaries of rectangles of a given color) can be used to draw independent and identically distributed samples from the target posterior density. Note how the acceptance probability (ratio of the area below the target shape to that below the envelope) increases with refinement.

Theorem 6 shows that Moore rejection sampler (MRS) indeed produces independent samples from the desired target and Theorem 7 describes the asymptotics of the acceptance probability as the partition of the domain is refined. Proofs for both Theorems are included in the Appendix for completeness.

Theorem 6 Suppose that the DAG expression f of the target shape f has a well-defined natural interval extension f over . If T is generated according to Algorithm 1, and if the the envelope function ${\hat{g}}^{T}$ (t) and the proposal density $g^{T}$ (t) are given by (14) and (15), respectively, then T is distributed according to the target density f^• : $T$ ↦ ℝ.

Next we bound the partition-specific acceptance probability $A (T) : A ({\hat{g}}^{T})$ for this sampler. For simplicity, let the domain $T$ of the target shape f be an interval. Due to the linearity of the integral operator and (13),

\begin{array}{l} N_{f} & : = & \int_{T} f (t) d t \\ = & \sum_{i = 1}^{| T |} \int_{t^{(i)}} f (t) d t \\ \in & \sum_{i = 1}^{| T |} (wid (t^{(i)}) \cdot f (t^{(i)})) \\ = & [\sum_{i = 1}^{| T |} (wid (t^{(i)}) \cdot \underline{f} (t^{(i)})), \sum_{i = 1}^{| T |} (wid (t^{(i)}) \cdot \bar{f} (t^{(i)}))] . \end{array}

Therefore,

A (T) : = A ({\hat{g}}^{T}) = \frac{N_{f}}{N_{{\hat{g}}^{T}}} = \frac{N_{f}}{\sum_{i = 1}^{| T |} (wid (t^{(i)}) \cdot \bar{f} (t^{(i)}))} \geq \frac{\sum_{i = 1}^{| T |} (wid (t^{(i)}) \cdot \bar{f} (t^{(i)}))}{\sum_{i = 1}^{| T |} (wid (t^{(i)}) \cdot \bar{f} (t^{(i)}))} .

(16)

If f ∈ $E_{L}$ , the Lipschitz class of elementary functions (Definition 6), then we might expect the enclosure of N_fto be proportional to the mesh $m e s h w : = \max_{i \in {1, ..., T}} wid (t^{(i)})$ of the partition $T$ .

Theorem 7 Let $U_{W}$ be the uniform partition of $T$ = [ $\underline{t}$ , $\bar{t}$ ] into W intervals each of width w

\begin{array}{l} w & = & \frac{(\bar{t} - \underline{t})}{W} \\ t_{W}^{(i)} & = & [\underline{t} + (i - 1) w, \underline{t} + i w], i = 1, ..., W \\ U_{W} & = & {t_{W}^{(i)}, i = 1, ..., W} . \end{array}

and let f ∈ $E_{L}$ , then

A (U_{W}) = 1 - O (1 / W)

Theorem 7 shows that if f ∈ $E_{L}$ and $U_{W}$ is a uniform partition of $T$ into W intervals, then the acceptance probability $A (U_{W}) = 1 - O (1 / W)$ . Thus, the acceptance probability approaches 1 at a rate that is no slower than linearly with the mesh.

Prioritized partitions and pre-processed proposals

We studied the efficiency of uniform partitions for their mathematical tractability. In practice, we may further increase the acceptance probability for a given partition size by adaptively partitioning $T$ . In our context, adaptive means the possible exploitation of any current information about the target. We can refine the current partition $T_{α}$ and obtain a finer partition $T_{α^{'}}$ with an additional box by bisecting a box $t^{(*)}$ ∈ $T_{α}$ along the midpoint of its side with the maximal width into a left box $t_{L}^{(*)}$ and a right box $t_{R}^{(*)}$ . There are several ways to choose a box $t^{(*)}$ ∈ $T_{α}$ for bisection. For instance, a relatively optimal choice is

t^{(*)} = \underset{t^{(i)} \in T_{α}}{\arg \max} (vol (t^{(i)}) \cdot wid (f (t^{(i)})) .

(17)

We employ a priority queue to conduct sequential refinements of $T$ under this partitioning scheme. This approach avoids the exhaustive argmax computations to obtain the t^(*) for bisection at each refinement step. Thus, the current partition is represented by a queue of boxes that are prioritized in descending order by the the priority function vol (t⁽ⁱ⁾) · wid (f(t⁽ⁱ⁾) in (17). Therefore, the box with the largest uncertainty in the enclosure of the integral over it gets bisected first. There are several ways to decide when to stop refining the partition. A simple strategy is to stop when the number of boxes reaches a number that is well within the memory constraints of the computer, say 10⁶, or when the lower bound of the acceptance probability given by (16) is above a desired threshold, say 0.1.

Once we have a partition $T$ of $T$ , we can sample t from the proposal density $g^{T}$ given by (15) in two steps:

1.
Sample a box t⁽ⁱ⁾∈ $T$ according to the discrete distribution:
$\begin{matrix} {\ddot{g}}^{T} (t^{(i)}) = \frac{vol t^{(i)} \bar{f} (t^{(i)})}{\sum_{i = 1}^{| T |} (vol t^{(i)} \bar{f} (t^{(i)}))}, & t^{(i)} \in T, \end{matrix}$
(18)
2.
Sample a point t uniformly at random from the box t⁽ⁱ⁾.

Sampling from large discrete distributions (with million states or more) can be made faster by pre-processing the probabilities and saving the result in some convenient look-up table. This basic idea [28] allows samples to be drawn rapidly. We employ an efficient pre-processing strategy known as the Alias Method [4] that allows samples to be drawn in constant time even for very large discrete distributions as implemented in the GNU Scientific Library [29]. We also minimize the number of evaluations of the target shape f by saving the box-specific computations of $\underline{f}$ (t⁽ⁱ⁾) and $\bar{f}$ (t⁽ⁱ⁾) and exploiting the so-called "squeeze principle", i.e. immediately accepting those points proposed in the box t⁽ⁱ⁾that fall below $\underline{f}$ (t⁽ⁱ⁾) when uniformly stretched toward $\bar{f}$ (t⁽ⁱ⁾).

Thus, by means of priority queues and look-up tables we can efficiently manage our adaptive partitioning of the domain for envelope construction, and rapidly draw samples from the proposal distribution. Our sampler class MRSampler implemented in MRS 0.1.2, a C++ class library for statistical set processing, builds on C-XSC 2.0, a C++ class library for extended scientific computing using interval methods [30]. All computations were done on a 2.8 GHz Pentium IV machine with 1 GB RAM. Having given theoretical and practical considerations to our Moore rejection sampler, we are ready to draw samples from various targets over small tree spaces.

Results

The natural interval extension of the likelihood function over labeled boxes in the tree space allows us to employ the Moore rejection sampler to rigorously draw independent and identically distributed samples from the posterior distribution over a compact box in the tree space given by our prior distribution. We draw samples from the posterior distribution based on two mitochondrial DNA data sets and use these samples (i) to estimate the posterior probabilities of each of the three rooted topologies, (ii) to conduct a nonparametric test of rate homogeneity between protein-coding and tRNA-coding sites and (iii) to estimate the human-neanderthal divergence time.

Human, chimpanzee and gorilla

We revisit the data from a segment of the mitochondrial DNA of human, chimpanzee and gorilla [27] that was analyzed under the CFN model of DNA mutation (Model 1) within a point estimation setting [17]. The sufficient statistics of pattern counts for this data with total number of sites v = 895 under the CFN model over the space of all three-leaved phylogenetic trees are:

(c_xxx, c_xxy, c_yxx, c_xyx) = (762, 54, 41, 38)

Let human, chimpanzee and gorilla be denoted by leaf labels 1, 2 and 3, or H, C and G, respectively. Let the set of rooted tree labels corresponding to (ii),(iii) and (iv) of Figure 1 be $K$ = {1, 2, 3}. The maximum likelihood estimate over ${}^{K}T : = {}^{1}T \cup {}^{2}T \cup {}^{3}T$ , the rooted and clocked three-leaved phylogenetic tree space, is derived in [17] as

Recall that due to our flat priors, our posterior shape f(ⁱt):= f(ⁱt₀, ⁱt₁) with i ∈ $K$ = {1, 2, 3} is our likelihood function over ${}^{K}T$ . Now, suppose are n independent and identically distributed samples from the posterior density f^•. over ${}^{K}T$ . We can obtain asymptotically consistent estimates of the posterior probabilities of ${}^{1}T$ , ${}^{2}T$ and ${}^{3}T$ from Monte Carlo integration of the indicator function of each of the three topology labels using

The 95% confidence interval for ^jP, based on asymptotic normality of the Monte Carlo estimator, is

{\hat{{}^{j}P}}_{n} \pm 1.96 \sqrt{{\hat{{}^{j}P}}_{n} (1 - {\hat{{}^{j}P}}_{n}) / n} .

Point estimate and a symmetric 95% confidence interval for the posterior probability of each of the three topologies from n = 10⁶ posterior samples are

\begin{matrix} {\hat{{}^{1}P}}_{10^{6}} = 0.8875 \pm 0.0006, \\ {\hat{{}^{2}P}}_{10^{6}} = 0.0646 \pm 0.0005, \\ {\hat{{}^{3}P}}_{10^{6}} = 0.0479 \pm 0.0004. \end{matrix}

These point estimates are in agreement with estimates obtained in [31, 32] through quadrature routines in Mathematica. The first 10,000 of these samples are shown in Figure 5 upon transforming the rooted and clocked trees, ⁱt := (ⁱt₀, ⁱt₁), i ∈ {1, 2, 3}, into constrained unrooted trees, ⁴t := (⁴t₁, ⁴t₂, ⁴t₃), according to Table 1.

Table 1 Rooted triplets as constrained unrooted triplets

Full size table

Obtaining confidence intervals from dependent MCMC samples requires nontrivial computations for the burn-in period and the thinning rate [1]. These are not readily available for phylogenetic MCMC samplers. Thus, the independent and identically distributed samples from our rejection sampler has the advantage of producing valid confidence intervals for our integrals of interest. The point estimate of the posterior mean $E ({}^{1}T) : = \int_{{}^{1}T} {}^{1}t f ({}^{1}t) \partial ({}^{1}t)$ for topology label 1 is (0.010863, 0.048994). This posterior mean is close to (0.010036, 0.048559), the mode of our target shape or the maximum likelihood estimate derived in [17].

Chimpanzee, gorilla and orangutan

We focus here on the 895 bp long homologous segment of mitochondrial DNA from chimpanzee, gorilla and orangutan [27]. This gives us a greater phylogenetic depth than the human, chimpanzee and gorilla sequences that were just analyzed. These sequences encode the genes for three transfer RNAs and parts of two proteins. Under the assumption of independence across sites, the sufficient statistics, under the JC model of DNA mutation (Model 2) over triplets, are given in Table 2 for all of the data as well as a partition of the data into tRNA-coding and protein-coding sites.

Table 2 Minimal sufficient statistics for the chimpanzee, gorilla and orangutan data

Full size table

Ten thousand independent and identically distributed samples were drawn in 942 CPU seconds from the posterior distribution over JC triplets, i.e. unrooted trees with three edges corresponding to the three primates. Figure 6 shows these samples (blue dots) scattered about the verified global maximum likelihood estimate (MLE) of the triplet obtained in [5, 26] and subsequently confirmed algebraically in [23]. We also drew ten thousand independent and identically distributed samples from the posterior based on the 198 tRNA-coding DNA sites (green dots in Figure 6) as well as from that based on the remaining 697 protein-coding sites (red dots in Figure 6). The former posterior samples, corresponding to the tRNA-coding sites, are more dispersed than the posterior samples based on the entire sequence. This is due to the smaller number of tRNA-coding sites making the posterior less concentrated. Moreover, the cluster of samples from the posterior based on tRNA-coding sites seem to be farther away from that based on protein-coding sites. Such a clustering of two sets of posterior samples is a signal of mutational rate heterogeneity between the two types of sites. Hotelling's trace statistics, being a natural measure of distance between two clusters of points, can be used as a test statistic to determine the significance of the observed test statistic. On the basis of 100 random permutations of the sites, we obtain the null distribution of Hotelling's trace statistics. We were able to reject the null hypothesis of rate homogeneity between the posterior samples based on the tRNA-coding sites and that based on the protein-coding sites at the 10% significance level using this permutation test (P-value = 0.06). Any biological interpretation of this test must be done cautiously since the JC model employed here forbids any transition:transversion bias that is reportedly relevant for this data [27].

Neanderthal, human and chimpanzee

We used the 15 site patterns and their counts in Table 3 to infer the human-neanderthal divergence time. These counts are obtained from a multiple sequence alignment of the data made available in [33]. Our alignment procedure is more robust at the ends of each locus than that of [33]. We do an ordered concatenation of all the loci for each species prior to a multiple sequence alignment. The alignment was further edited by hand to obtain the locus-specific alignments. Under the assumption of independence across sites, the sufficient statistics, under any Markov model of DNA mutation, is the set of distinct site patterns and their respective counts. They are given in Table 3 for this data set.

Table 3 Minimal sufficient statistics for the neanderthal, human and chimpanzee data

Full size table

We drew 10,000 samples that were independently and identically distributed from each of three posterior densities; (i) over the space of unrooted triplets under the JC model in 312 CPU seconds, (ii) over the clocked and rooted triplets under the JC model in 375 CPU seconds and (iii) over the clocked and rooted triplets under the HKY model in 1.2 CPU hours. In the HKY model we used the empirical nucleotide frequencies from the data (π(T) = 0.2588, π(C) = 0.2571, π(A) = 0.2916, π(G) = 0.1925) and a hominid-specific transition:transversion rate of 2.0. Unlike the JC model with five sufficient statistics (c_xxx, c_xxy, c_yxx, c_xyx, c_xyz) = (2343, 56, 2, 4, 0), all 15 distinct site patterns are required for the likelihood computations under the HKY model and this is reflected in its longer CPU time. Both models gave similar posterior samples over rooted triplets, as shown in Figure 7.

We transformed the three posterior distributions over the triplet spaces; (i) unrooted JC triplets that were rooted using the mid-point rooting method, (ii) rooted JC triplets and (iii) rooted HKY triplets, respectively, into three posterior distributions over the human-neanderthal divergence time relative to the human-chimp divergence time. The corresponding posterior quantiles ({5%, 50%, 95%}) for the human-neanderthal divergence time in units of human-chimp divergence time are {0.0643, 0.125, 0.214}, {0.0694, 0.142, 0.263} and {0.0682, 0.143, 0.268}, respectively. We constrained the neanderthal lineage to be a fraction of the human lineage in branch length in order to estimate the age of the neanderthal fossil from the rooted HKY triplets. The posterior quantiles of the fossil date in units of human-chimp divergence is {0.00685, 0.0666, 0.195}. The estimate of 38; 310 years based on carbon-14 accelerator mass spectrometry [33] is within our [5%, 95%] posterior quantile interval for the fossil date, provided the human-chimp divergence estimate ranges in [196103, 5.6 × 10⁶]. Thus, reasonable bounds for the human-chimp divergence are 4 × 10⁶ and 5.6 × 10⁶ years, under the assumption that 4 × 10⁶ is an acceptable lower-bound. Based on these two calendar year estimates, we transformed the posterior quantiles of the human-neanderthal divergence times from the rooted HKY triplets into {272680, 571124, 1073375} and {381752, 799574, 1502724} years, respectively. Our [5%, 95%] posterior intervals contain the interval estimate of [461000, 825000] years reported in [33]. However, our confidence intervals are from perfectly independent samples from the posterior and account for the finite number of neanderthal sites that were successfully sequenced, unlike those obtained on the basis of a bootstrap of site patterns [34] or heuristic MCMC [1]. Unfortunately, our human-neanderthal divergence estimates are overestimates as they ignore the non-negligible time to coalescence of the human and neanderthal homologs within the human-neanderthal ancestral population. Improvements to our estimates based on the other 310 human and 4 chimpanzee homologs reported in [33] may be possible with more sophisticated models of populations within a phylogeny and needs further investigation.

Chimpanzee, gorilla, orangutan and gibbon

We were able to draw samples from JC quartets on the basis of the mitochondrial DNA of chimpanzee, gorilla, orangutan and gibbon [27]. The data for all four primates can be summarized by 61 distinct site patterns [5]. Now, the problem is more challenging because there are three distinct tree topologies in the unrooted, bifurcating, quartet tree space, and each of these topologies has five edges. Thus, the domain of quartets is a piecewise Euclidean space that arises from a fusion of 3 distinct five dimensional orthants. Since the post-order traversals (Algorithm 2) specifying the likelihood function are topology-specific, we extended the likelihood over a compact box of quartets in a topology-specific manner. The computational time was about a day and a half to draw 10000 samples from the quartet target due to low acceptance probability of the naive likelihood function based on the 61 distinct site patterns. All the samples had the topology which grouped Chimp and Gorilla together, i.e. ((chimpanzee, gorilla), (orangutan, gibbon)). The samples (results not shown) were again scattered about the verified global MLE of the quartet [5]. This quartet likelihood function has an elaborate DAG with numerous operations. When the data got compressed into sufficient statistics, the efficiency increased tremendously (e.g. for triplets the efficiency increases by a factor of 3.7). This is due to the number of leaf nodes in the target DAG, which encode the distinct site patterns of the observed data into the likelihood function, getting reduced from 29 to 5 for the triplet target and from 61 to 15 for the quartet target [24].

Discussion

Interval methods provide for a rigorous sampling from posterior target densities over small phylogenetic tree spaces. When one substitutes conventional floating-point arithmetic for real arithmetic in a computer and uses discrete lattices to construct the envelope and/or proposal, it is generally not possible to guarantee the envelope property, and thereby ensure that samples are drawn from the desired target density, except in special cases [35]. Thus, the construction of the Moore rejection sampler through interval methods, that enclose the target shape over the entire real continuum in any box of the domain with machine-representable bounds, in a manner that rigorously accounts for all sources of numerical errors (see [36] for a discussion on error control), naturally guarantees that the Moore rejection samples are independent draws from the desired target. Moreover, the target is allowed to be multivariate and/or non-log-concave with possibly 'pathological' behavior, as long as it has a well-defined interval extension. The efficiency of MRS is not immune to the curse of dimensionality and target DAG complexity. When the DAG expression for the likelihood gets large, its natural interval extension can have terrible over-enclosures of the true range, which in turn forces the adaptive refinement of the domain to be extremely fine for efficient envelope construction. Thus, a naive application of interval methods to targets with large DAGs can be terribly inefficient. In such cases, sampler efficiency rather than rigor is the issue. Thus, one may fail to obtain samples in a reasonable time, rather than (as may happen with non-rigorous methods) produce samples from some unknown and undesired target.

There are several ways in which efficiency can be improved for such cases. First, the particular structure of the target DAG should be exploited to avoid any redundant computations. For example, sufficient statistics must be used to dissolve symmetries in the DAG. Second, we can further improve efficiency by limiting ourselves to differentiable targets in Cⁿ. Tighter enclosures of the range of f(t⁽ⁱ⁾) with f(t⁽ⁱ⁾) can come from the enclosures of Taylor expansions of f around the midpoint mid (t⁽ⁱ⁾) through interval-extended automatic differentiation (e.g. [36]) that can then yield tighter estimates of the integral enclosures [37]. Third, we can employ pre-processing to improve efficiency. For example, we can pre-enclose the range of a possibly rescaled f over a partition of the domain and then obtain the enclosure of f over some arbitrary t through a combination of hash access and hull operations on the pre-enclosures. Such a pre-enclosing technique reduces not only the overestimation of target shapes with large DAGs but also the computational cost incurred while performing interval operations with processors that are optimized for floating-point arithmetic. In the next version of the MRS library we plan to extend interval arithmetic beyond $I ℝ^{n}$ to a class of multi-dimensional data-structures related to regular sub-pavings (e.g. [38]) to improve the efficiency of our sampler. Fourth, various contractors can be used to improve the range enclosure in polynomial time (e.g. [38]). The most promising contractors employ interval constraint propagation. Finally, efficiency at the possible cost of rigor can also be gained (up to 30%) by foregoing directed rounding during envelope construction.

Poor sampler efficiency makes it currently impractical to sample from trees with five leaves and 15 topologies. However, one could use such triplets and quartets drawn from the posterior distribution to stochastically amalgamate and produce estimates of larger trees via fast amalgamating algorithms (e.g. [39, 40]), which may then be used to combat the slow mixing in MCMC methods [2] by providing a good set of initial trees. A collection of large trees obtained through such stochastic amalgamations would account for the effect of finite sample sizes (sequence length) as well as the sensitivity of the amalgamating algorithm itself to variation in the input vector of small tree estimates. It would be interesting to investigate if such stochastic amalgamations can help improve mixing of MCMC algorithms on large tree spaces, albeit auto-validating rejection sampling via the natural interval extension of the likelihood function may not be practical for trees with more than four leaves.

Conclusion

None of the currently available punctual samplers can rigorously produce independent and identically distributed samples from the posterior distribution over phylogenetic tree spaces, even for 3 or 4 taxa. We describe a new approach for rigorously drawing samples from a target posterior distribution over small phylogenetic tree spaces using the theory of interval analysis. Our Moore rejection sampler (MRS), being an auto-validating von Neumann rejection sampler (RS), can produce independent samples from any target shape f whose DAG expression f has a well-defined natural interval extension f over a compact domain $T$ . MRS is said to be auto-validating because it automatically obtains a proposal g that is easy to simulate from, and an envelope $\hat{g}$ that is guaranteed to satisfy the envelope condition (1). MRS can circumvent the problems associated with (i) heuristic convergence diagnostics in MCMC samplers and (ii) pseudo-envelopes constructed via non-rigorous punctual methods in rejection samplers. When the target DAG is large, MRS becomes inefficient and may fail to produce the desired samples in a reasonable time, rather than (as may happen with non-rigorous methods) produce samples from some unknown and undesired target. MRS solves the open problem of rigorously drawing independent and identically distributed samples from the posterior distribution over small rooted and unrooted phylogenetic tree spaces (3 or 4 taxa) based on any multiply-aligned sequence data.

Appendix

Likelihoods for the CFN model on unrooted triplets

Recall that the probability that Y mutates to R, or vice versa, in time t is a(t):= (1 - e^-2t)/2 and the stationary distribution π(R) = π(Y) = 1/2. Next we apply Algorithm 2 to compute the likelihood $l_{d_{\cdot, q}}$ at a given site q which could be one of l₀(⁴t), l₁(⁴t),..., l₇(⁴t).

(19)

(20)

(21)

(22)

(23)

(24)

(25)

(26)

Proof of Theorem 1 (cf. [37])

Since any real arithmetic operation x ⋆ y, where ⋆ ∈ {+, - ×,/} and x, y ∈ ℝ, is a continuous function x ⋆y := ⋆(x, y): ℝ ⊗ ℝ ↦ ℝ, except when y = 0 under / operation. Since x and y are simply connected compact intervals, so is their Cartesian product x ⊗ y. On such a domain x⊗ y, the continuity of ⋆(x, y) (except when ⋆ =/and 0 ∈ y) ensures the attainment of a minimum, a maximum and all intermediate values. Therefore, with the exception of the case when ⋆ = / and 0 ∈ y, the range x⋆ y has an interval form [min (x ⋆y), max (x ⋆y)], where the min and max are taken over all pairs (x, y) ∈ x⊗ y. Fortunately, we do not have to evaluate x ⋆y over every (x, y) ∈ x⊗ y to find the global min and global max of ⋆(x, y) over x⊗ y, because the monotonicity of the ⋆(x, y*) in terms of x ∈ x for any fixed y* ∈ y implies that the extremal values are attained on the boundary of x⊗ y, i.e. the set {x, y, $\bar{x}$ , and $\bar{y}$ }. Thus the theorem can be verified by examining the finitely many boundary cases.

Proof of Theorem 2

x⋆ y= {x ⋆y : x ∈ x, y ∈ y} ⊆ {x ⋆y : x ∈ x', y ∈ y'} = x'⋆ y'.

Proof of Theorem 3 (cf. [37])

Since f(y) is well-defined, we will not run into division by zero, and therefore (i) follows from the repeated invocation of Theorem 2. We can prove (ii) by contradiction. Suppose range(f; x) ⊈ f(x). Then there exists x ∈ x, such that f(x) ∈ range(f; x) but f(x) ∉ f(x). This in turn implies that f(x) = f([x, x]) ∉ f(x), which contradicts (i). Therefore, our supposition cannot be true and we have proved (ii) range(f; x) ⊆ f(x).

Proof of Theorem 4 (cf. [37])

Any elementary function f ∈ $E$ with expression f is defined by the recursion 9 on its sub-expressions f_iwhere i ∈ {1,..., n} according to its DAG. If f(x) = p(x)/q(x) is a rational function, then the theorem already holds by Theorem 3, and if f ∈ $S$ then the theorem holds because the range enclosure is exact for standard functions. Thus it suffices to show that if the theorem holds for f₁, f₂ ∈ $E$ , then the theorem also holds for f₁ ⋆ f₂, where ⋆ ∈ {+, -,/, ×, ◦}. By ◦ we mean the composition operator. Since the proof is analogous for all five operators, we only focus on the ◦ operator. Since f is well-defined on its domain y, neither the real-valued f nor any of its sub-expressions f_ihas singularities in its respective domain y_iinduced by y. In particular f₂ is continuous on any x₂ and ${x^{'}}_{2}$ such that x₂ ⊆ ${x^{'}}_{2}$ ⊆ y₂ implying the compactness of f₂(x₂) =: w₂ and f₂( ${x^{'}}_{2}$ ) =: ${w^{'}}_{2}$ , respectively. By our assumption that f₁ and f₂ are inclusion isotonic we have that w₂ ⊆ ${w^{'}}_{2}$ and also that

f_{1} \circ f_{2} (x_{2}) = f_{1} (f_{2} (x_{2})) = f_{1} (w_{2}) \subseteq f_{1} ({w^{'}}_{2}) = f_{1} (f_{2} ({x^{'}}_{2})) = f_{1} \circ f_{2} ({x^{'}}_{2})

The range enclosure is a consequence of inclusion isotony by an argument identical to that given in the proof for Theorem 3.

Proof of Theorem 5 (cf. [37])

The proof is given by an induction on the DAG for f similar to the proof of Theorem 4 (See [37]).

Proof of Theorem 6

Let the domain $T$ of the target f^• be an element of $I ℝ^{n}$ . From (15) and (14) observe that ${\hat{g}}^{T} (t) = g^{T} (t) N_{{\hat{g}}^{T}}$ . Let us define the following two subsets of ℝⁿ⁺¹,

\begin{matrix} ℬ ({\hat{g}}^{T}) = {(v, u) : v \in T, 0 \leq u \leq {\hat{g}}^{T} (v)}, & and & ℬ (f) = {(v, u) : v \in T, 0 \leq u \leq f (v)} . \end{matrix}

Algorithm 1 first produces a sample from the random vector (V, U) that is uniformly distributed in $ℬ ({\hat{g}}^{T})$ . We can see this by letting h(v, u) denote the joint density of (V, U) and h(u|v) denote the conditional density of U given V = v. Then,

h (v, u) = {\begin{array}{l} g^{T} (v) h (u | v) & if (v, u) \in ℬ ({\hat{g}}^{T}) \\ 0 & otherwise . \end{array}

Since we sample a height u for a given v from the Uniform [0, ${\hat{g}}^{T}$ ] distribution,

h (u | v) = {\begin{array}{l} {({\hat{g}}^{T} (v))}^{- 1} = (g^{T} (v)) N_{{\hat{g}}^{T}})^{- 1} & if u \in [0, {\hat{g}}^{T} (v)] \\ 0 & otherwise . \end{array}

Therefore,

h (v, u) = {\begin{array}{l} g^{T} (v) h (u | v) = g^{T} (v) {(g^{T} (v) N_{{\hat{g}}^{T}})}^{- 1} = {(N_{{\hat{g}}^{T}})}^{- 1} & if (v, u) \in ℬ ({\hat{g}}^{T}) \\ 0 & otherwise . \end{array}

Thus, we have shown that the joint density of the random vector (V, U) initially produced by Algorithm 1 is uniformly distributed on $ℬ ({\hat{g}}^{T})$ . The above relationship also makes geometric sense since the volume of $ℬ ({\hat{g}}^{T})$ is exactly $N_{{\hat{g}}^{T}}$ .

Now, let (T, S) be the accepted random vector during the accept/reject step of Algorithm 1, i.e.

(T, S) = (V, U) \Leftrightarrow (V, U) \in ℬ (f) \subseteq ℬ ({\hat{g}}^{T}) .

Then, the uniform distribution of (V, U) on $ℬ ({\hat{g}}^{T})$ implies the uniform distribution of (T, S) on $ℬ$ (f). Since the volume of $ℬ$ (f) is N_f, the density of (T, S) is identically 1/N_fon $ℬ$ (f) and 0 elsewhere. Hence, the marginal density of T on $T$ is

\begin{array}{l} \int_{0}^{f (t)} 1 / N_{f} d h & = & 1 / N_{f} \int_{0}^{f (t)} 1 d h \\ = & \begin{matrix} 1 / N_{f} \int_{0}^{N_{f} f^{\cdot} (t)} 1 d h, & ∵ f^{\cdot} (t) = f (t) / N_{f} \end{matrix} \\ = & f^{\cdot} (t) . \end{array}

Thus, we have shown that the accepted random vector T has the desired density f^•.

Proof of Theorem 7

Due to Theorem 5,

\begin{array}{l} wid (t_{W}^{(i)}) = O (1 / W) & \Rightarrow & {dist}_{\infty} (range (f; t_{W}^{(i)}, f (t_{W}^{(i)})) = O (1 / W) \\ \Rightarrow & \begin{matrix} wid (f (t_{W}^{(i)})) = O (1 / W), & ∵ f \in E_{ℒ} \end{matrix} . \end{array}

Therefore

\sum_{i = 1}^{| U_{W} |} (wid (t_{W}^{(i)}) \cdot f (t_{W}^{(i)})) = w \sum_{i = 1}^{W} f ([\underline{t} + (i - 1) w, \underline{t} + i w]),

and we have

\begin{matrix} wid (w \sum_{i = 1}^{W} f (t_{W}^{i})) = O (1 / W) & \Rightarrow & A (U_{W}) = 1 - O (1 / W) . \end{matrix}

Therefore the lower bound for the acceptance probability A( $U_{W}$ ) of MRS approaches 1 no slower than linearly with the refinement of $T$ by $U_{W}$ . Note that this should hold for a general nonuniform partition with w replaced by the mesh.

References

Jones G, Hobert J: Honest exploration of intractable probability distributions via Markov chain Monte Carlo. Statistical Science. 2001, 16 (4): 312-334. 10.1214/ss/1015346317.
Article Google Scholar
Mossel E, Vigoda E: Phylogenetic MCMC algorithms are misleading on mixtures of trees. Science. 2005, 309: 2207-2209.
Article PubMed CAS Google Scholar
von Neumann J: Various techniques used in connection with random digits. John Von Neumann, Collected Works. 1963, V: Oxford University Press
Google Scholar
Walker A: An efficient method for generating discrete random variables with general distributions. ACM Trans on Mathematical Software. 1977, 3: 253-256. 10.1145/355744.355749.
Article Google Scholar
Sainudiin R: Machine interval experiments. pHd dissertation. 2005, Cornell University, Ithaca, New York
Google Scholar
Moore R: Interval analysis. 1967, Prentice-Hall
Google Scholar
Semple C, Steel M: Phylogenetics. 2003, Oxford University Press
Google Scholar
Felsenstein J: Inferring phylogenies. 2003, Sunderland, MA: Sinauer Associates
Google Scholar
Yang Z: Computational molecular evolution. 2006, UK: Oxford University Press
Book Google Scholar
Moore R: Methods and applications of interval analysis. 1979, Philadelphia, Pennsylvania: SIAM
Book Google Scholar
Alefeld G, Herzberger J: An introduction to interval computations. 1983, Academic press
Google Scholar
Hammer R, Hocks M, Kulisch U, Ratz D: C++ toolbox for verified computing: basic numerical problems. 1995, Springer-Verlag
Google Scholar
Kulisch U, Lohner R, Facius A, : Perspectives on encolsure methods. 2001, Springer-Verlag
Google Scholar
Matsumoto M, Nishimura T: Mersenne Twister: A 623-dimensionally equidistributed uniform pseudo-random number generator. ACM Trans Model Comput Simul. 1998, 8: 3-30. 10.1145/272991.272995.
Article Google Scholar
Williams D: Weighing the Odds: A Course in Probability and Statistics. 2001, Cambridge University Press
Book Google Scholar
Felsenstein J: Evolutionary trees from DNA sequences: a maximum likelihood approach. Jnl Mol Evol. 1981, 17: 368-376. 10.1007/BF01734359.
Article CAS Google Scholar
Yang Z: Complexity of the simplest phylogenetic estimation problem. Proceedings Royal Soc London B Biol Sci. 2000, 267: 109-119. 10.1098/rspb.2000.0974.
Article CAS Google Scholar
Evans W, Kenyon C, Peres Y, Schulman L: Broadcasting on trees and the Ising model. Advances in Applied Probability. 2000, 10: 410-433. 10.1214/aoap/1019487349.
Article Google Scholar
Neyman J: Molecular studies of evolution: a source of novel statistical problems. Statistical decision theory and related topics. Edited by: Gupta S, Yackel J. 1971, 1-27. New York Academy Press
Google Scholar
Jukes T, Cantor C: Evolution of protein molecules. Mammalian Protein Metabolism. Edited by: Munro H. 1969, 21-32. New York Academic Press
Chapter Google Scholar
Saitou N: Property and efficiency of the maximum likelihood method for molecular phylogeny. Jnl Mol Evol. 1988, 27: 261-273. 10.1007/BF02100082.
Article CAS Google Scholar
Yang Z: Statistical properties of the maximum likelihood method of phylogenetic estimation and comparison with distance matrix methods. Syst Biol. 1994, 43: 329-342. 10.2307/2413672.
Article Google Scholar
Hosten S, Khetan A, Sturmfels B: Solving the likelihood equations. Found Comput Math. 2005, 5 (4): 389-407. 10.1007/s10208-004-0156-8.
Article Google Scholar
Casanellas M, Garcia L, Sullivant S: Catalog of small trees. Algebraic statistics for computational biology. Edited by: Pachter L, Sturmfels B. 2005, 291-304. Cambridge University Press
Chapter Google Scholar
Hasegawa M, Kishino H, Yano T: Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. Jnl Mol Evol. 1985, 22: 160-174. 10.1007/BF02101694.
Article CAS Google Scholar
Sainudiin R, Yoshida R: Applications of interval methods to phylogenetic trees. Algebraic statistics for computational biology. Edited by: Pachter L, Sturmfels B. 2005, 359-374. Cambridge University Press
Chapter Google Scholar
Brown W, Prager E, Wang A, Wilson A: Mitochondrial DNA sequences of primates, tempo and mode of evolution. Jnl Mol Evol. 1982, 18: 225-239. 10.1007/BF01734101.
Article CAS Google Scholar
Marsaglia G: Generating discrete random numbers in a computer. Comm ACM. 1963, 6: 37-38. 10.1145/366193.366228.
Article Google Scholar
Galassi M, Davies J, Theiler J, Gough B, Jungman G, Booth M, Rossi F: GNU Scientific Library Reference Manual. 2003, Network Theory Ltd, 2, http://www.gnu.org/software/gsl/
Google Scholar
Hofschuster , Krämer : C-XSC 2.0: A C++ library for extended scientific computing. Numerical software with result verification, of Lecture notes in computer science. Edited by: Alt R, Frommer A, Kearfott R, Luther W. 2004, 2991: 15-35. Springer-Verlag
Chapter Google Scholar
Rannala B, Yang Z: Probability distribution of molecular evolutionary trees: a new method of phylogenetic inference. Jnl Mol Evol. 1996, 43: 304-311. 10.1007/BF02338839.
Article CAS Google Scholar
Yang Z, Rannala B: Branch-length prior in uences Bayesian posterior probability of phylogeny. Syst Biol. 2005, 54: 455-470.
Article PubMed Google Scholar
Green R, Krause J, Ptak S, Briggs A, Ronan M, Simons J, Du L, Egholm M, Rothberg J, Paunovic M, Pääbo S: Analysis of one million base pairs of Neandertal DNA. Nature. 2006, 444: 330-336.
Article PubMed CAS Google Scholar
Efron B, Halloran E, Holmes S: Bootstrap confidence levels for phylogenetic trees. Proc Natl Acad Sci. 1996, 93: 13429-13429.
Article PubMed CAS PubMed Central Google Scholar
Gilks W, Wild P: Adaptive rejection sampling for Gibbs sampling. Applied Statistics. 1992, 41: 337-348. 10.2307/2347565.
Article Google Scholar
Kulisch U: Advanced arithmetic for the digital computer, interval arithmetic revisited. Perspectives on encolsure methods. Edited by: Kulisch U, Lohner R, Facius A. 2001, 50-70. Springer-Verlag
Chapter Google Scholar
Tucker W: Auto-validating numerical methods. 2004, Lecture notes, Uppsala University
Google Scholar
Jaulin L, Kieffer M, Didrit O, Walter E: Applied interval analysis: with examples in parameter and state estimation, robust control and robotics. 2004, Springer-Verlag
Google Scholar
Strimmer K, von Haeseler A: Quartet puzzling: A quartet maximum likelihood method for reconstructing tree topologies. Mol Biol Evol. 1996, 13: 964-969.
Article CAS Google Scholar
Levy D, Yoshida R, Pachter L: Beyond pairwise distances: neighbor joining with phylogenetic diversity estimates. Mol Biol Evol. 2006, 23: 491-498.
Article PubMed CAS Google Scholar

Download references

Acknowledgements

R.S. is a Research Fellow of the Royal Commission for the Exhibition of 1851. This was partly supported by a joint NSF/NIGMS grant DMS-02-01037. Many thanks to Rob Strawderman, Warwick Tucker and Stephane Aris-Brosou for constructive comments, Joe Felsenstein for clarifying the transition probabilities under the HKY model and Ziheng Yang for various clarifications and encouragement.

Author information

Authors and Affiliations

Department of Statistics, University of Oxford, Oxford, OX1 3TG, UK
Raazesh Sainudiin
Biomathematics Research Centre, Department of Mathematics and Statistics, University of Canterbury, Private Bag 4800, Christchurch, New Zealand
Raazesh Sainudiin
Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York, 14853, USA
Thomas York
Boyce Thompson Institute for Plant Research, Cornell University, Ithaca, New York, 14853, USA
Thomas York

Authors

Raazesh Sainudiin
View author publications
You can also search for this author in PubMed Google Scholar
Thomas York
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Raazesh Sainudiin.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

RS developed the basic algorithm, analyzed the data and wrote the first draft. TY improved the object-oriented interface and refined the final implementation of the algorithm. Both authors edited the manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Sainudiin, R., York, T. Auto-validating von Neumann rejection sampling from small phylogenetic tree spaces. Algorithms Mol Biol 4, 1 (2009). https://doi.org/10.1186/1748-7188-4-1

Download citation

Received: 05 June 2007
Accepted: 07 January 2009
Published: 07 January 2009
DOI: https://doi.org/10.1186/1748-7188-4-1

Auto-validating von Neumann rejection sampling from small phylogenetic tree spaces

Abstract

Background

Results

Conclusion

Background

Methods

Rejection sampler (RS)

Phylogenetic estimation

Likelihood of a tree

Posterior density of a tree

Likelihood of a triplet under Cavender-Farris-Neyman (CFN) model

Likelihood of a triplet under Jukes-Cantor (JC) model

Interval analysis

Likelihood of a box of trees

Moore rejection sampler (MRS)

Prioritized partitions and pre-processed proposals

Results

Human, chimpanzee and gorilla

Chimpanzee, gorilla and orangutan

Neanderthal, human and chimpanzee

Chimpanzee, gorilla, orangutan and gibbon

Discussion

Conclusion

Appendix

Likelihoods for the CFN model on unrooted triplets

Proof of Theorem 1 (cf. [37])

Proof of Theorem 2

Proof of Theorem 3 (cf. [37])

Proof of Theorem 4 (cf. [37])

Proof of Theorem 5 (cf. [37])

Proof of Theorem 6

Proof of Theorem 7

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors' contributions

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Algorithms for Molecular Biology

Contact us