Choosing a prior distribution for likelihood (18) is not trivial, especially since there is no universal notion of "complete prior ignorance" [9, 37, 40]. Choice of prior is therefore governed by specific criteria assumed by the investigator to be important. One nearly universally-accepted criterion is that of *reparameterization-invariance*, a criterion requiring that the inference should not depend on the units of either the parameters or observations. The importance of such invariance has been detailed by Jeffreys [39], Wallace and Freeman [41], Jermyn [42], and many others. In our context, such invariance ensures that parameterizing the likelihood by, for example, either "the expected number of observed misincorporations per PCR cycle" or its reciprocal "the expected number of PCR cycles before a misincorporation is observed" yield equivalent inferences. Since there is no meaningful physical difference between these two parameterizations it is essential that each yield equivalent results.

Under even highly-mutagenic conditions, the error probabilities given by the columns of *T* are known *a priori* to be very close to the extremes of either zero or one. *Taq* polymerase is one of the best-studied polymerases in molecular biology and its error rate is known to be heavily influenced by the precise experimental conditions under which it is used [6, 43]. Therefore, since the precise *relative scales* of the different types of polymerase errors are not known, we derived a prior for *T* that is 'objective' in the sense of Berger *et al*. [38]. An 'objective' prior is one that formally minimizes, in an information-theoretic sense, the influence of the prior on the posterior and is constructed to always be invariant to reparameterization.

#### An Objective Polymerase Prior

Fortunately, there is a common case where selection of the prior is almost universally accepted, and that is when the model parameters form an Abelian Lie group [

9], or in other words, when the parameter space is a commutative group on a differentiable manifold. To see that the 16 parameters of

*T* conform to this special structure, consider the 16 dimensional parameter column-vector

$\tilde{T}=\mathrm{ln}\left\{\text{vec}T\right\},$

(19)

where 'vec' is the standard matrix-vectorization operator. Each column-constraint of *T*, namely that it sum to one, is equivalent to removing a particular one-dimensional subspace from span $\left\{\tilde{T}\right\}$.

Removing the specific groups of interest results in the quotient-space

$\text{span}\left\{\tilde{T}\right\}/\text{span}\left\{{e}_{1},{e}_{2},{e}_{3},{e}_{4}\right\},$

(20)

where

$\begin{array}{l}{e}_{\text{1}}=\text{vec}\left[{1}_{\text{4}},{0}_{\text{4}},{0}_{\text{4}},{0}_{\text{4}}\right],\\ {e}_{\text{2}}=\text{vec}\left[{0}_{\text{4}},{1}_{\text{4}},{0}_{\text{4}},{0}_{\text{4}}\right],\\ {e}_{\text{3}}=\text{vec}\left[{0}_{\text{4}},{0}_{\text{4}},{1}_{\text{4}},{0}_{\text{4}}\right],\text{and}\\ {e}_{\text{4}}=\text{vec}\left[{0}_{\text{4}},{0}_{\text{4}},{0}_{\text{4}},{1}_{\text{4}}\right].\end{array}$

The quotient-space (20) which fully describes the 16-parameter *T* is therefore isomorphic to the 12-dimensional additive Lie group of ℝ^{12} which in itself is a Hilbert space [27].

Further justification for assuming that ℝ^{12} is the 'correct' structure with which to understand *T* comes from realizing that although the columns of *T* are parameters in our likelihood function, they are also discrete and finite probability densities in their own right. Each column of *T* represents the *relative* frequency of concrete events, namely *Taq* misincorporations. When normalized, they represent multinomial probabilities; when non-normalized they represent relative Poisson frequencies.

Our model is only interested in relative relative rates as given by the multinomial model, and it is well-known that the natural parameter space of the multinomial distribution, as a member of the exponential family, is the logarithm of the multinomial parameters.

Given the special Lie group structure of

*T*, the appropriate prior is generally considered to be that of Jeffreys [

39]. The Jeffreys prior

*J* is defined as

$J\left(T\right)=\sqrt{\left|\mathrm{det}F\left(T\right)\right|},$

(21)

where *F*(*T*) denotes the Fisher Information Matrix (FIM) of the given likelihood model, in this case given by (18). However, it is not entirely obvious *which* likelihood model should be utilized. The likelihoods of *codon* mutation, as given by (5) and (6), condition on the the codon-mutation probabilities. However, our null-hypothesis is computed at the *nucleotide* level because the null-hypothesis states that the selected-clone mutation probabilities are consistent with the unselected-clone mutation probabilities.

Incidentally, we note that although it is widely reported in the literature that Jeffreys' prior fails for 'simple' distributions such as the univariate normal with unknown mean and variance. In fact, Jeffreys' prior results in compatible with frequentist methods if a parameterization with *Abelian* Lie group structure is enforced.

The parameters of

*T* are therefore estimated by the 4 × 4 matrix of observed nucleotide point-mutations

*n*_{
ij
} . These counts enumerate the number of times a wild-type nucleotide of type

*j* was observed to be of type

*i* in a sequenced clone. Since mutations are relatively rare, the diagonal

*n*_{
ii
} elements are expected to be several orders of magnitude greater than the off-diagonals. The likelihood function (18) can be more compactly written as

$\mathrm{ln}L\propto {\displaystyle \sum _{ij}{c}_{ij}\mathrm{ln}\left({p}_{ij}(T)\right)},$

which, again, is the log-sum of four independent multinomial processes. The parameters

*p*_{
ij
} are the probabilities that wild-type nucleotide

*j* will be mutated to

*i* after

*k* cycles of mutagenic PCR and

*implicit* conditioning that the wild-type sequence is given to be

*j*. For (22) the corresponding 16 × 16 matrix

*F* is defined entry-wise as

${\left[F(T)\right]}_{kl,mn}=-\mathbb{E}\left[\frac{{\partial}^{2}\mathrm{ln}L}{\partial {\tau}_{kl}\partial {\tau}_{mn}}|T\right],$

(23)

with the expectation being taken over all possible observations.

Explicit computation of the FIM is straightforward and can be accomplished by noting that

$\frac{\partial \mathrm{ln}L}{\partial {\tau}_{kl}}={\displaystyle \sum _{ij}\frac{{c}_{ij}}{{p}_{ij}}\frac{\partial {p}_{ij}}{\partial {\tau}_{kl}}}$

(24)

which implies that

$\begin{array}{c}\frac{{\partial}^{2}\mathrm{ln}L}{\partial {\tau}_{kl}\partial {\tau}_{mn}}={\displaystyle \sum _{ij}\frac{\partial}{\partial {\tau}_{mn}}\left[\frac{{c}_{ij}}{{p}_{ij}}\frac{\partial {p}_{ij}}{\partial {\tau}_{kl}}\right]}\\ ={\displaystyle \sum _{ij}}{c}_{ij}\left[\frac{{\partial}^{2}{p}_{ij}}{\partial {\tau}_{kl}\partial {\tau}_{mn}}\frac{1}{{p}_{ij}}-\frac{\partial {p}_{ij}}{\partial {\tau}_{kl}}\frac{\partial {p}_{ij}}{\partial {\tau}_{kl}}\frac{1}{{p}_{ij}^{2}}\right].\end{array}$

(25)

Taking the expectation of (25) as per (23) requires only evaluating

$\mathbb{E}$[

*c*_{
ij
} ]. Since (22) is the log-sum of four multinomials, we can evaluate

$\mathbb{E}[{c}_{ij}]={c}_{++}{p}_{ij}{p}_{j},$

(26)

where *c*_{++} is the total number of observed wild-type-to-mutant-clone nucleotide pairs and *p*_{
j
} is the probability of that the wild-type nucleotide is of type *j*. Estimation of *p*_{
j
} is equivalent to determining the ratio of (G+C)/(A+T) content for the wild-type sequence.

The nature of the genetic code, be it standard or otherwise, dictates it be extremely unusual for an organism to code for a protein using a nucleotide sequence with

(G+C)-content approaching either 0% or 100%. The fraction

(G+C)/

(A+T) should then be very well approximated by the ratio of the

(G+C) to

(A+T) counts. Such approximation allows us to assume that the expected

(G+C)-content is equal to the observed

(G+C)-content, with the result that (26) can be simplified to

$\mathbb{E}[{c}_{ij}]{c}_{+j}{p}_{ij},$

(27)

where *c*_{
+j
} is the row-vector of column-totals of *c*_{
ij
} and can be taken as a given value. Note that in the pedagogical or extremely rare case that (27) is not accurate, the (G+C)-content can always be estimated via standard Bayesian methods at the expense of the simplified computation that we utilize. It is also worth noting that mutagenic PCR never incorporates enough nucleotide changes to appreciably change (G+C)-content.

The final expression for each entry of the Fisher Information Matrix [

*F*(

*T*)]

_{
kl,mn
} is therefore proportional to the negative of

$\sum _{ij}{c}_{+j}{p}_{ij}\left[\frac{{\partial}^{2}{p}_{ij}}{\partial {\tau}_{kl}\partial {\tau}_{mn}}\frac{1}{{p}_{ij}}-\frac{\partial {p}_{ij}}{\partial {\tau}_{kl}}\frac{\partial {p}_{ij}}{\partial {\tau}_{kl}}\frac{1}{{p}_{ij}^{2}}\right]},$

(28)

where each parameter *p*_{
ij
} is a function of *T*, and Jeffreys' prior easily computed via the 12-dimensional pseudo-determinant. All that remains is to compute the first- and second-order derivatives of *p*_{
ij
} with respect to the 16 entries of *T*.

These derivatives could be computed analytically. However, since typical experiments use on the order of *k* ≈ 30 PCR cycles, analytic derivatives of *P* with respect to *T* result in unwieldily and numerically-unstable expressions. Instead, second-order differentiation arithmetic [44, 45] is used via operator overloading in Fortran to compute all required derivatives of *p*_{
ij
} with respect to *T*. Briey, differentiation arithmetic computes a function and its gradient and Hessian simultaneously by utilizing the algebra of differential operators. The simple structure of all relevant equations make them particularly amenable toward straightforward implementation.

#### Sampling from the Posterior

Computing

*F*(

*T*) has the secondary benefit of simplifying the procedure of realizing samples from the posterior distribution of

*T* as well. This simplification arises from two sources. First, under Jeffreys' prior the maximum

*a posteriori* (MAP) parameter estimate of

*T* can be used as central point-estimate of

*T* that is invariant to reparameterization [

41,

42]. The MAP point-estimate is similar, conceptually, to the point-estimate given by maximum-likelihood methods. Second, given the central MAP estimate of

$\widehat{T}(C)$, it is well-known that, asymptotically with the sample-size,

$\widehat{T}\left(C\right)\to \mathcal{N}\left(T,{\tilde{F}}^{-1}\left(T\right)\right),$

(29)

where $\mathcal{N}$ denotes the multivariate normal and ${\tilde{F}}^{-1}\left(T\right)$ denotes the inverse of the FIM divided by the sample-size used to estimate $\widehat{T}$. Frequencies *P* computed from the MAP estimate of *T* are shown in Table 2 and appear very similar to those estimated via relative frequencies and natural parameters. Even though (29) describes an asymptotic relationship, it can be exploited to sample the *exact* posterior of *T* via the Metropolis algorithm [9]. Specifically, the mean expected value of *T*, computed via (13), can be combined with our estimate of *F* via (29) to yield a viable proposal function for Metropolis sampling. With Metropolis sampling, differences between the true and approximate posterior of *T* are eliminated due to the use of rejection sampling. The approximation only affects sampling efficiency, not accuracy: the poorer the approximation, the larger the proportion of rejected samples. In practice, we find that acceptance ratios even as low as 1-10% yield posterior samples more than rapidly enough for practical analysis of large proteins.