# Stochastic errors vs. modeling errors in distance based phylogenetic reconstructions

- Daniel Doerr
^{1}, - Ilan Gronau
^{2}, - Shlomo Moran
^{3}Email author and - Irad Yavneh
^{3}

**7**:22

https://doi.org/10.1186/1748-7188-7-22

© Doerr et al.; licensee BioMed Central Ltd. 2012

**Received: **23 December 2011

**Accepted: **28 June 2012

**Published: **31 August 2012

## Abstract

### Background

Distance-based phylogenetic reconstruction methods use evolutionary distances between species in order to reconstruct the phylogenetic tree spanning them. There are many different methods for estimating distances from sequence data. These methods assume different substitution models and have different statistical properties. Since the true substitution model is typically unknown, it is important to consider the effect of model misspecification on the performance of a distance estimation method.

### Results

This paper continues the line of research which attempts to adjust to each given set of input sequences a distance function which maximizes the expected topological accuracy of the reconstructed tree. We focus here on the effect of systematic error caused by assuming an inadequate model, but consider also the stochastic error caused by using short sequences. We introduce a theoretical framework for analyzing both sources of error based on the notion of *deviation from additivity*, which quantifies the contribution of model misspecification to the estimation error. We demonstrate this framework by studying the behavior of the Jukes-Cantor distance function when applied to data generated according to Kimura’s two-parameter model with a transition-transversion bias. We provide both a theoretical derivation for this case, and a detailed simulation study on quartet trees.

### Conclusions

We demonstrate both analytically and experimentally that by deliberately assuming an oversimplified evolutionary model, it is possible to increase the topological accuracy of reconstruction. Our theoretical framework provides new insights into the mechanisms that enables statistically inconsistent reconstruction methods to outperform consistent methods.

### Keywords

Phylogenetic reconstructions Substitution models Additive substitution rate functions## Introduction

Phylogenetic reconstruction is the task of determining the topology of an evolutionary tree underlying a given set of samples (species) using sequence data extracted from them. This is typically done by assuming some simplified model for DNA sequence evolution, in most cases modeling it as a homogeneous continuous-time Markov process[1–3]. Distance-based reconstruction algorithms tackle this task by first computing a set of$\left(\genfrac{}{}{0ex}{}{n}{2}\right)$ pairwise distances between the *n* input samples and then finding a tree which fits these distances. The distance measures used for this purpose typically reflect the rates of certain substitution events along the evolutionary paths in question. We thus refer to these distance measures as *substitution rate (SR) functions*. The distance-based approach is based on the fact that if the SR function used is *additive* for the underlying substitution model, and the input sequences are sufficiently long, then the topology of the true tree can be efficiently recovered with high probability. However, since the underlying evolutionary model is usually unknown, this assumption is rarely satisfied in practice.

Substitution models used for phylogenetic reconstruction range from the simplest Jukes-Cantor (JC) model[4], through slightly more complex and flexible models, such as Kimura’s two-parameter (K2P) model[5] and the Hasegawa-Kishino-Yano model (HKY)[6], to the General Time-Reversible (GTR) model[7, 8]. In previous works[9, 10] we observed that substitution models which are not too restrictive or too general have many inherently different additive SR functions. We used this basic observation to demonstrate that it is possible to adjust for each given set of DNA sequences a “good” additive SR function, which leads to significantly increased phylogenetic reconstruction accuracy, compared to other additive SR functions. This exploits our ability to predict the *stochastic noise* associated with each SR function. When the SR function used for distance estimation is additive for the underlying substitution model, this stochastic noise is the only cause for inaccurate reconstruction. However, in the scenario, which is very common in practice, where the SR function in use is not additive for the model, an additional systematic bias is introduced in the distance estimates. This systematic bias in distance estimation results in a phylogenetic reconstruction method that might be statistically inconsistent in some cases. In this paper^{a}, we extend our previous line of research to this scenario, by removing the constraint of additivity. We do this by considering both the stochastic noise and systematic error.

Several previous studies have demonstrated the utility of phylogenetic reconstruction methods that are not generally statistically consistent. The maximum parsimony method has been long known to be inconsistent in some cases[11, 12]. However, in other cases it was shown to be more likely to produce accurate reconstructions, compared with the maximum likelihood method[13–15]. More recently, it has been demonstrated that reconstruction accuracy can be improved by deliberately assuming an oversimplified substitution model, when reconstructing a tree using maximum likelihood[16, 17]. In the context of distance-based reconstruction, non-additive distance measures have been shown in several cases to lead to improved accuracy when compared with additive measures[18, 19]. Overall, these studies provide convincing evidence for the need to consider inconsistent phylogenetic reconstruction methods. However, none of them provide a rigorous framework for characterizing the cases in which inconsistent methods outperform consistent ones.

In this paper we develop a theoretical framework which provides a practical and systematic way to quantify the effect of distance-estimation-bias on the accuracy of distance-based reconstruction. This framework is based on a novel method for measuring the *deviation from additivity* of SR functions. Coupled with the results in[9], this method enables evaluation of both the systematic bias and stochastic noise of SR functions. Such evaluation is important, because there is often a tradeoff between these two sources of error, stemming from the fact that simpler models with fewer parameters (such as JC) have smaller stochastic noise at the expense of greater estimation bias. Our framework allows us to consider this tradeoff when deciding which SR function to use for a given data set. This allows us to characterize a wide range of cases in which an SR function associated with an oversimplified evolutionary model results in increased reconstruction accuracy.

This finding falls in line with previous studies demonstrating the usefulness of phylogenetic reconstruction methods that are not generally consistent. Previous studies have attributed the increased accuracy of inconsistent methods mainly to the fact that these methods have a bias toward reconstructing certain topologies, leading to increased accuracy in cases where the phylogeny being reconstructed has the “favored topology”. We notice a similar behavior using our theoretical characterization of non-additive SR functions. However, somewhat surprisingly, we find that non-additive SR functions often have an advantage even when the phylogeny being reconstructed has an “unfavorable topology”. This is due to the reduced stochastic noise of the non-additive SR function (compared with it additive alternatives), which compensates for its topological bias.

Our paper is organized as follows. Section “Background” outlines some of the required background and introduces several new concepts that are central in our analysis. Section “Deviation from additivity in homogeneous substitution models” provides the main analytic results in the paper, and introduces *deviation from additivity* as a measure of distance estimation bias. In that section we prove a general upper bound for this deviation and establish a connection with reconstruction accuracy. We then study deviation from additivity and stochastic error of the JC distance formula when applied to data generated under the K2P model. In Section “Performance of Non affine-additive SR functions in quartet resolution” we study the effect of deviation from additivity and stochastic error on the accuracy of quartet reconstruction. In the case of quartets we can draw a tight connection between the different sources of error in distance estimation and inaccuracy of reconstruction. We present a useful heuristic, based on the so-called Fisher criterion ([20, 21]), for comparing the expected accuracy of two SR functions in this context. In Section “Simulations on Hasegawa’s Tree” we extend our study to larger trees using experiments on simulated data based on the tree obtained by Hasegawa in[6]. Finally, In Section “Inferring trees from genomic sequences” we demonstrate our approach through a series of experiments reconstructing trees from bacterial gene sequences.

## Background

In this section we provide a brief exposition of DNA substitution models and substitution rate functions used for distance estimation. We concentrate on details essential to this study and refer the reader to a previous paper[9] and standard textbooks[1, 2] for a more complete survey.

### Substitution Models

In this work, a DNA substitution model$\mathcal{\mathcal{M}}$ is simply a set of stochastic 4×4 *transition matrices* closed under matrix product (i.e., **P** **Q**$\in \mathcal{\mathcal{M}}\to $**P** **Q**$\in \mathcal{\mathcal{M}}$). These matrices serve to describe the substitution process along evolutionary paths in a phylogenetic tree. All substitution models addressed in this paper are time-reversible[7]. A *model tree* in a time reversible substitution model$\mathcal{\mathcal{M}}$, or an$\mathcal{\mathcal{M}}$*tree*, is an undirected tree *t* =(*V* *E* ) in which each edge *e* ∈*E* is associated with a transition matrix${\mathbf{\text{P}}}_{e}\in \mathcal{\mathcal{M}}$. An$\mathcal{\mathcal{M}}$-tree *t* implies an inter-leaf transition matrix${\mathbf{\text{P}}}_{\mathrm{ij}}\in \mathcal{\mathcal{M}}$ for each pair of leaves$\{i,j\}\subset L\left(T\right)$, namely${\mathbf{\text{P}}}_{\mathrm{ij}}=\prod _{e\in \mathrm{pat}{h}_{T}(i,j)}{\mathbf{\text{P}}}_{e}$. Most common models are defined using *rate matrices*, which are 4×4 matrices whose off-diagonal elements are non-negative *substitution rates*, and whose rows sum to 0. A stochastic transition matrix **P** is obtained from a rate matrix **R** through matrix exponentiation: **P**=^{
e
R
}.

A common assumption made on the substitution process is that it is *homogeneous* throughout time. This means that all rate matrices in the model are proportional to each other. Such a substitution model is thus termed *homogeneous*, and it is defined by a *unit rate matrix* **R** as follows:${\mathcal{\mathcal{M}}}_{\mathbf{R}}=\{{e}^{t\mathbf{\text{R}}}:t>0\}$. Note that the definition of the unit rate matrix associated with a given homogeneous model is somewhat arbitrary ^{b}, but once the unit **R** is defined, it implies a bijection (or equivalence) between rate matrices in${\mathcal{\mathcal{M}}}_{\mathbf{R}}$ and the parameter *t*, which corresponds to evolutionary time. We will make use of this equivalence extensively throughout this paper.

*α*, which is the rate of

*transition*-type (ti) substitutions ($\mathtt{\text{A}}\iff \mathtt{\text{G}}$,$\mathtt{\text{C}}\iff \mathtt{\text{T}}$), and

*β*, which is the rate of

*transversion*-type (tv) substitutions ($\left\{\mathtt{\text{A,G}}\right\}\iff \left\{\mathtt{\text{C,T}}\right\}$). Each K2P rate matrix can be represented as a product of a unit rate matrix, in which

*α*+ 2

*β*=1, and a scalar

*t*corresponding to

*evolutionary time*.

Each unit rate matrix of the K2P model defines a homogeneous sub-model, which is identified by its unique transition-transversion (ti-tv) ratio$R=\frac{\alpha}{2\beta}\ge \frac{1}{2}$. The Jukes-Cantor (JC) model[4] is a special homogeneous sub-model of K2P, in which$R=\frac{1}{2}$ (i.e., *α* =*β* ). Although the K2P model is defined in (1) as a union of its homogeneous sub-models, it is important to note that this union is closed under matrix product, implying that K2P adheres to our definition of a proper substitution model. Conversely, some commonly used substitution models, such as GTR and HKY, are defined as a union of homogeneous models, but are not themselves closed under matrix product[22].

*p*

_{ α }– the probability of a transition-type substitution;

*p*

_{ β }– the probability of a transversion-type substitution. The transformations between (

*α*,

*β*,

*t*) and (

*p*

_{ α },

*p*

_{ β }) are given by the following equations:

### Substitution rate functions

A *substitution rate (SR) function* for a model$\mathcal{\mathcal{M}}$ is a non-negative continuous function$\Delta :\mathcal{\mathcal{M}}\to {R}^{+}$ that maps each transition matrix onto a numerical value of “substitution rate”. An SR function *Δ* induces the following *dissimilarity mapping* over the leaves of an$\mathcal{\mathcal{M}}$-tree *t* :${D}_{\Delta}^{T}(i,j)=\Delta \left({\mathbf{\text{P}}}_{\mathrm{ij}}\right)$, for all$\{i,j\}\subset L\left(T\right)$. Of particular interest in phylogenetic reconstruction are *additive* SR functions.

#### Definition 2.1 (Additive SR function)

An SR function *Δ* is said to be *additive* for a substitution model$\mathcal{\mathcal{M}}$ if for all$\mathbf{\text{P}},\mathbf{\text{Q}}\in \mathcal{\mathcal{M}}$, *Δ* (**P** **Q**)=*Δ* (**P**) + *Δ* (**Q**).

It is often explicitly required that an SR function be additive for the assumed model (see[9]). The evolutionary time, *t*, typically serves as the standard additive measure in most common substitution models. Throughout this study we follow the special case of K2P, focusing on the two SR functions defined below.

The first SR function, *Δ* _{K2P}, is the common SR function suggested for the K2P model in [5], and it is clearly additive, as it maps the transition probabilities onto evolutionary time *t* . The second SR function, *Δ*_{JC}, maps the transition probabilities onto evolutionary time only in the special case of the JC model where *α* =*β* . Under other homogeneous sub-models of K2P, it is non-additive. This non-additivity is analyzed in details in section Deviation from additivity in homogeneous substitution models.

### Additive metrics, Affine-additive mappings, and Near-additivity

The core idea behind distance-based phylogenetic reconstruction is that a phylogenetic tree *t* can be accurately and efficiently reconstructed from pairwise distances which are *additive with respect to* *t*[23, 24].

#### Definition 2.2 (Additive metric)

A metric *d* defined over the leaf-set *L* (*t* ) of a tree *t* is *t* *-additive* (or *additive w.r.t* *t* ), if there exists a positive edge-weighting function$w:E\left(T\right)\to {R}^{+}$, such that for each *i*,*j* ∈*L* (*t* ),$D(i,j)=\sum _{e\in \mathrm{pat}{h}_{T}(i,j)}w\left(e\right)$. *d* is *additive* for a set *S* if it is *t* -additive for some tree *t* where *L* (*t* )=*S* .

It is well known that additive SR functions imply additive metrics: if *Δ* is an additive SR function for a model $\mathcal{\mathcal{M}}$, then for any$\mathcal{\mathcal{M}}$-tree *t*,${D}_{\Delta}^{T}$ (the dissimilarity mapping induced by *Δ* on *t* ) is a *t* -additive metric. The inherent difficulty in reconstructing phylogenies using additive SR functions is that computing the implied *t* -additive metric requires the *exact* values of the inter-taxon transition matrices {**P**_{
ij
}}, and getting these exact values from alignments of finite length is practically impossible. Therefore, a distance-based reconstruction algorithm is useful in a realistic setting only if it has some robustness to error in distance estimation. In[25], Atteson observed that the topology of a phylogenetic tree *t* can be accurately (and efficiently) reconstructed from any dissimilarity mapping *d* which is sufficiently close to a *t* -additive metric, using certain “robust” distance-based algorithms^{c}. Formally, “sufficiently close” is defined by the following relation:

#### Definition 2.3 (Near-additive mapping)

*d*on

*L*(

*t*) is said to be

*near-additive*w.r.t.

*t*iff there exists a

*t*-additive mapping

^{d ⋆}s.t.

where ${w}_{\mathrm{min}}\left({D}^{\star}\right)$ is the minimal weight assigned to an internal edge^{d} by the edge weighting function corresponding to the additive metric ^{d ⋆}.

For our results we will be using a generalization of this criterion, in which the mapping *d*^{⋆} can be any *affine-additive* mapping, defined below.

#### Definition 2.4 (Affine-additive mapping)

A dissimilarity mapping${D}^{\prime}$ is said to be *affine-additive* w.r.t. a phylogenetic tree *t*, if there is a *t* -additive metric *d*, and scalars *a* >0,*B* s.t.${D}^{\prime}=\mathrm{aD}+b$ (i.e.,${D}^{\prime}(i,j)=\mathrm{aD}(i,j)+b$ for all$\{i,j\}\subset L\left(T\right)$)

As with additive metrics, affine-additive mapping are also associated with edge weights. Let *d* be a *t* -additive mapping corresponding to the edge-weighting function *w* (·). Then the edge weighting function${w}^{\prime}(\xb7)$ corresponding to the affine additive mapping${D}^{\prime}=\mathrm{aD}+b$ is given by:${w}^{\prime}\left(e\right)=\mathrm{aw}\left(e\right)$ for all internal edges, and${w}^{\prime}\left(e\right)=\mathrm{aw}\left(e\right)+\frac{1}{2}b$ for all external edges. When *B* is positive,${D}^{\prime}$ is actually an additive metric, but when *B* is negative, the weights of external edges implied by${w}^{\prime}(\xb7)$ might be negative, and${D}^{\prime}$ might even yield negative dissimilarities. The generalization of Atteson’s theorem to cases where *d*^{⋆} is affine-additive follows from the observation that the robust distance-based reconstruct algorithms considered by Atteson are invariant to affine transformations of their input distances. From this point on, when we say a dissimilarity mapping *d* is *near additive*, we mean it satisfies (6) with respect to some affine-additive mapping *d*^{⋆}.

### Local consistency

*Δ*which is additive for the underlying substitution model$\mathcal{\mathcal{M}}$, as follows:

- 1.
If

*Δ*is additive for $\mathcal{\mathcal{M}}$, then for each $\mathcal{\mathcal{M}}$-tree*t*the mapping ${D}_{\Delta}^{T}$ defined by ${D}_{\Delta}^{T}(i,j)=\Delta \left({\mathbf{\text{P}}}_{\mathrm{ij}}\right)$ for all*i*,*j*∈*L*(*t*), is a*t*-additive metric. - 2.
As the length of the input sequences grows, the estimated transition matrices $\left\{{\hat{\mathbf{\text{P}}}}_{\mathrm{ij}}\right\}$ converge (w.h.p.) to the true matrices {

**P**_{ ij }}. - 3.
When $\left\{{\hat{\mathbf{\text{P}}}}_{\mathrm{ij}}\right\}$ are sufficiently close to {

**P**_{ ij }}, the estimated dissimilarity map $\hat{D}$ defined by $\hat{D}(i,j)=\Delta \left({\hat{\mathbf{\text{P}}}}_{\mathrm{ij}}\right)$ is sufficiently close to ${D}_{\Delta}^{T}$, and is thus near-additive. - 4.
The near-additivity of the estimated dissimilarity map $\hat{D}$ implies accurate topological reconstruction, assuming a robust distance-based algorithm is used.

This line of argument has been used in numerous works studying statistical consistency of distance-based algorithms (e.g.,[25–27]), and in all these cases an additive SR function is assumed. Notice, however, that this line of argument remains valid when${D}_{\Delta}^{T}$ is *near additive* w.r.t. *t* . For instance, consistent reconstruction of any$\mathcal{\mathcal{M}}$-tree is guaranteed by using an *affine-additive* SR function${\Delta}^{\prime}$, which is an affine transformation of some additive SR function *Δ* :${\Delta}^{\prime}=\mathrm{a\Delta}+b$ (with *a* >0). An SR function that is not affine-additive in a given substitution model$\mathcal{\mathcal{M}}$ does not guarantee consistency across all$\mathcal{\mathcal{M}}$-trees, but it still can be consistent for specific$\mathcal{\mathcal{M}}$-trees.

#### Definition 2.5 (Consistent SR function)

An SR function *Δ* of a substitution model$\mathcal{\mathcal{M}}$ is said to be *consistent* w.r.t. an$\mathcal{\mathcal{M}}$-tree *t* if${D}_{\Delta}^{T}$ is near-additive w.r.t *t* .

The main idea endorsed in this paper is that if an SR function only deviates slightly from some SR function which is affine-additive for$\mathcal{\mathcal{M}}$, then it might be consistent with respect to many$\mathcal{\mathcal{M}}$-trees of interest, and as such should be considered for use in distance based reconstructions.

## Deviation from additivity in homogeneous substitution models

In order to assess whether a given SR function *Δ* is consistent w.r.t. a given model tree *t*, one has to find an affine-additive mapping *d*^{⋆} which minimizes the ratio$\frac{\left|\right|{D}_{\Delta}^{T},{D}^{\star}|{|}_{\infty}}{{w}_{\mathrm{min}}\left({D}^{\star}\right)}$ (see Definition 2.3). This task seems hard in a general setting, but in the special case of homogeneous substitution models it is tractable. Consider a homogeneous substitution model${\mathcal{\mathcal{M}}}_{\mathbf{R}}$. The unit rate matrix **R** implies a 1-1 mapping between evolutionary time *t* and rate matrices in${\mathcal{\mathcal{M}}}_{\mathbf{R}}$. It is thus useful to view an SR function for${\mathcal{\mathcal{M}}}_{\mathbf{R}}$ as a function$\Delta :{R}^{+}\to {R}^{+}$ which maps the *evolutionary time* *t* to a dissimilarity measure *Δ* (*t* ).

It can be shown that such *Δ* is affine-additive in the model if and only if *Δ* (*t* )=*at* + *B* for some$a\in {R}^{+},b\in R$. We define the *deviation* of an SR function *Δ* from a given affine-additive function *at* + *B* in an interval [*t*_{0},*t*_{1}] as$\frac{1}{a}\text{max}\left\{\right|\Delta \left(t\right)-\mathrm{at}-b|\phantom{\rule{1em}{0ex}}:t\in [{t}_{0},{t}_{1}\left]\right\}$ (the factor$\frac{1}{a}$ normalizes the deviation to units of evolutionary time). The *deviation from additivity* of *Δ* within [*t*_{0},*t*_{1}] is defined as the minimum deviation of *Δ* from any affine-additive function in that interval.

### Definition 2.6 (Deviation from additivity)

*deviation from additivity*of

*Δ*in an interval [

*t*

_{0},

*t*

_{1}] is defined by:

Lemma 2.7 below presents the basic relation between deviation from additivity and consistency. In Section Performance of Non affine-additive SR functions in quartet resolution we demonstrate the tightness of this relation.

### Lemma 2.7

Let$\mathcal{\mathcal{M}}$ be a homogeneous model, and let *t* be an$\mathcal{\mathcal{M}}$-tree with edge lengths (measured in time units) denoted by {*t*_{
e
}}. Let _{
t
}_{min}=min{*t*_{
e
}:*e* ∈*t* }, and assume that all inter-leaf distances in *t* fall within the interval [*t*_{0},*t*_{1}]. Then any SR function *Δ* in$\mathcal{\mathcal{M}}$ for which$\mathrm{dev}(\Delta ,[{t}_{0},{t}_{1}\left]\right)<\frac{1}{2}{t}_{\mathrm{min}}$is consistent w.r.t. *t* .

### Proof

*t*. Since$\mathrm{dev}(\Delta ,[{t}_{0},{t}_{1}\left]\right)<\frac{1}{2}{t}_{\mathrm{min}}$, there are$a\in {R}^{+},b\in R$ which satisfy

*i*,

*j*∈

*L*(

*t*), denote${t}_{\mathrm{ij}}=\sum _{e\in \mathrm{pat}{h}_{T}(i,j)}{t}_{e}$, and let

*d*be the dissimilarity map associated with evolutionary time:

*d*(

*i*,

*j*)=

*t*

_{ ij }. Clearly,

*d*is an additive metric, and the dissimilarity mapping${D}^{\prime}=\mathrm{aD}+b$ is an affine-additive mapping. The internal-edge-weights associated with${D}^{\prime}$ are given by${w}^{\prime}\left(e\right)=\mathrm{at}\left(e\right)$ (see discussion following Definition 2.4), implying that${w}_{\mathrm{min}}\left({D}^{\prime}\right)=a{t}_{\mathrm{min}}$. We thus have:

□

*Δ*from additivity in a given interval [

*t*

_{0},

*t*

_{1}] is implied from the error associated with its linear interpolation

*At*+

*B*within that interval ($A=\frac{\Delta \left({t}_{1}\right)-\Delta \left({t}_{0}\right)}{{t}_{1}-{t}_{0}}$ and$B=\frac{{t}_{1}\Delta \left({t}_{0}\right)-{t}_{0}\Delta \left({t}_{1}\right)}{{t}_{1}-{t}_{0}}$). Figure1a demonstrates this for

*Δ*

_{JC}under a homogeneous sub-model of K2P, and Lemma 2.8 below presents a general upper bound on the deviation from additivity. For this purpose, we assume that the SR function

*Δ*is a monotone increasing continuous function of

*t*with continuous first and second derivatives.

### Lemma 2.8

*t*

_{0},

*t*

_{1}] be an interval. Let

*Δ*

_{int}(

*t*)=

*At*+

*B*be the linear interpolation of

*Δ*in [

*t*

_{0},

*t*

_{1}] defined above, and let$F\stackrel{\u25b3}{=}{\mathrm{max}}_{t\in [{t}_{0},{t}_{1}]}\left\{\right|{\Delta}^{\prime}\left(t\right)\left|\right\}$. Then

### Proof

We are looking for$a\in {R}^{+}$ and$b\in R$ which minimize$\frac{1}{a}\psi (a,b)$. Let _{
ψ
}_{min}=_{mint ∈[t}_{0},*t*_{1}]{*ψ* (*A*,*B*,*t* )}, *ψ*_{max}=_{maxt ∈[t}_{0},*t*_{1}]{*ψ* (*A*,*B*,*t* )}, and let${b}^{\ast}=B+\frac{1}{2}({\psi}_{\text{max}}+{\psi}_{\mathrm{min}})$. Then$\psi (A,{b}^{\ast})=\frac{1}{2}({\psi}_{\text{max}}-{\psi}_{\mathrm{min}})$. A bound for *dev* (*Δ*,[*t*_{0},*t*_{1}]) will thus follow by showing that${\psi}_{\text{max}}-{\psi}_{\mathrm{min}}\le \frac{{({t}_{1}-{t}_{0})}^{2}F}{8}$.

*Δ*

_{int}(

*t*)=

*At*+

*B*is a linear interpolation of

*Δ*in

*t*

_{0}

*t*

_{1}, we have

*ψ*(

*A*

*B*

*t*

_{0})=

*ψ*(

*A*

*B*

*t*

_{1})=0. Let

*t*

_{min}be an arbitrary point in the interval

*t*

_{0}

*t*

_{1}s.t.

*ψ*(

*A*

*B*

*t*

_{min})=

*ψ*

_{min}≤0 and let (

*t*

_{2}

*t*

_{3}) be the maximal open interval in

*t*

_{0}

*t*

_{1}containing

*t*

_{min}in which

*ψ*(

*A*

*B*

*t*)<0 (this interval can be empty if

*ψ*

_{min}=0). We define a similar interval (

*t*

_{4}

*t*

_{5}) in which

*ψ*(

*A*

*B*

*t*)>0 around some arbitrary

*t*

_{max}s.t.

*ψ*(

*A*

*B*

*t*

_{ max })=

*ψ*

_{max}. Note that the intervals (

*t*

_{2}

*t*

_{3}) and (

*t*

_{4}

*t*

_{5}) are disjoint, and that

*Δ*

_{int}is the linear interpolation of

*Δ*in both these intervals (since

*ψ*(

*A*

*B*

*t*

_{2})=

*ψ*(

*A*

*B*

*t*

_{3})=

*ψ*(

*A*

*B*

*t*

_{4})=

*ψ*(

*A*

*B*

*t*

_{5})=0). Therefore, the bound on the error of polynomial interpolation (see, e.g.,[28], p. 187) implies that

□

### Note

In Appendix 3 we prove that if *Δ* does not intersect its linear interpolation *Δ*_{int}=*At* + *B* within the interval (*t*_{0},*t*_{1}), then the function *At* + *B*^{∗}mentioned in the proof above is, in fact, the affine-additive function which minimizes the deviation from additivity of *Δ* in [*t*_{0},*t*_{1}]. This means that, in such cases, the first inequality in (9) holds in equality. The last inequality in (9) also holds in equality in such cases, because we are guaranteed to have either [*t*_{2},*t*_{3}]=[*t*_{0},*t*_{1}] (when *Δ* is bounded from above by its linear interpolation) or [*t*_{4},*t*_{5}]=[*t*_{0},*t*_{1}] (when *Δ* is bounded from below by its linear interpolation). Thus, in such a case, the bound of Lemma 2.8 is reduced to the bound on interpolation error (middle inequality in (9)). Cases where *Δ* does not intersect its linear interpolation are frequent among many SR functions of interest, as this condition holds when *Δ* is either convex or concave.

### Deviation of *Δ*_{JC}from Additivity in K2P

*Δ*

_{JC}from additivity in homogeneous sub-models of K2P with ti-tv ratio$R>\frac{1}{2}$. First, we express

*Δ*

_{JC}as a function of the ti-tv ratio

*R*and the time

*t*, using (5) and the relations$\frac{\alpha}{2\beta}=R$ and

*α*+ 2

*β*=1.

*Δ*

_{JC}is not affine-additive (i.e., not of the form

*at*+

*B*for

*a*>0), and we can use the result in Lemma 2.8 to bound the deviation of

*Δ*

_{JC}from additivity. Denoting$\rho =\frac{2R-1}{R+1}$, we get

We get that for any given ti-tv ratio$R>\frac{1}{2}$, *Δ*_{JC}(*R*,*t* ) is a concave monotone increasing function, and its second derivative attains a global minimum of$-\frac{3}{16}{\rho}^{2}$ at$t=\frac{\text{ln}\phantom{\rule{0.3em}{0ex}}\left(2\right)}{\rho}$. By the note following Lemma 2.8, the deviation of *Δ*_{JC} from additivity in an interval [*t*_{0},*t*_{1}] can be evaluated by computing the linear interpolation *Δ*_{int}=*At* + *B* of *Δ*_{JC} in [*t*_{0},*t*_{1}], and finding *t* ∈[*t*_{0},*t*_{1}] which maximizes *Δ*_{JC}(*t* )−*Δ*_{int}(*t* ) (see Figure1a). A bound on this deviation from additivity can be obtained through Lemma 2.8 by plugging in the slope of the linear interpolation, *A*, and the maximum value, *F*, attained by the second derivative of *Δ*_{JC} in [*t*_{0},*t*_{1}]. Using Lemma 2.7 and an expression for *dev* (*Δ*_{JC}(*R*,*t* ),[*t*_{0},*t*_{1}]), it is possible to map out coherent collections of homogeneous K2P-trees for which *Δ*_{JC} is guaranteed to be consistent. Each collection is defined by a range of ti-tv ratios [0.5,*R*_{max}], a range of inter-leaf distances [*t*_{0},*t*_{1}], and a lower bound on the weights of internal edges in the tree, given by *t*_{min}=2*dev* (*Δ*_{JC}(*R*_{max},*t* ),[*t*_{0},*t*_{1}]).

*Δ*, is consistent, one can compare the performance of

*Δ*with additive alternatives. In our case, we compare

*Δ*=

*Δ*

_{JC}, which is not affine additive when$R>\frac{1}{2}$, to the standard additive SR function

*Δ*

_{K2P}. The potential advantage of

*Δ*

_{JC}over

*Δ*

_{K2P}lies in its reduced

*stochastic noise*. Informally, this occurs because JC relies on the accuracy of estimating a single parameter - the sum

*p*=

*p*

_{ α }+ 2

*p*

_{ β }, while

*Δ*

_{K2P}relies on the accuracy of estimating each of the two parameters

*p*

_{ α }and

*p*

_{ β }separately. The stochastic noise of an SR function is measured by the

*standard deviation*of the statistical estimator associated with it, denoted

*σ*(

*Δ*

_{JC}) and

*σ*(

*Δ*

_{K2P}), respectively. We use the result in[9] to get a first order approximation (based on the delta method[29]) of

*σ*(

*Δ*

_{K2P}) for sequences of length

*k*and model parameters

*R*

*t*:

*Δ*

_{JC}, we obtain:

where *k* is the sequence length and$p(t,R)={p}_{\alpha}+2{p}_{\beta}=\frac{3}{4}-\frac{1}{4}{e}^{-\frac{2t}{R+1}}-\frac{1}{2}{e}^{-\frac{(2R+1)t}{R+1}}$ (see (3)).

Figure1 provides an illustrative comparison of *Δ*_{JC}and *Δ*_{K2P} under the homogeneous sub-model of K2P with ti-tv ratio *R* =10, and within the inter-leaf time interval of [0.8,2]. Figure1a shows the deviation of *Δ*_{JC}from additivity in that setting, using its linear interpolation *Δ*_{int}=*At* + *B* . Note that Lemma 2.8 and the subsequent note imply that$\mathrm{dev}({\Delta}_{\text{JC}},[0.8,2\left]\right)=\frac{X}{2A}$, where *X* =max*t*_{∈[0.8,2]}{*Δ*_{JC}(*t* )−*Δ*_{int}(*t* )}. Figure1b depicts *Δ*_{JC} in the same setting with its stochastic error margins (*Δ*_{JC}±*σ* (*Δ*_{JC})), alongside its closest affine-additive function${\Delta}^{\ast}={\Delta}_{\text{int}}+\frac{1}{2}X$ and its stochastic error margins (*Δ*^{∗}±*σ* (*Δ*^{∗})). These stochastic error margins are determined by assuming a sequence length of 500 bp in the first-order approximations given in (14) and (15), where *σ* (*Δ*^{∗}) is given by scaling *σ* (*Δ*_{K2P}) by the slope *A* of the linear interpolation. Note how the margins of *Δ*_{JC}are actually more tightly concentrated around its affine-additive approximation *Δ*^{∗} than the margins of *Δ*^{∗}. This implies that, despite its deviation from additivity in this setting, distances obtained using *Δ*_{JC}are actually more likely to be near-additive than distances obtained using *Δ*_{K2P}.

## Performance of Non affine-additive SR functions in quartet resolution

The quartet tree is the smallest phylogenetic tree with non-trivial topology. Focusing on quartets enables a close study of the effects of deviation from additivity and stochastic noise on reconstruction accuracy. The topology of a quartet spanning four taxa {1,2,3,4} can be represented by the split notation (*ij* |*kl* ) (where {*i* *j* *k* *l* }={1,2,3,4}), indicating that the internal edge of the quartet separates *i* *j* from *k* *l* . All distance based quartet resolution algorithms essentially reduce to the four-point method (FPM)[26, 30], which resolves this split using the six observed pairwise distances$\{{\hat{d}}_{\mathrm{ij}}:\{i,j\}\subset \{1,2,3,4\left\}\right\}$: it first partitions the six observed distances into three sums${\hat{d}}_{12}+{\hat{d}}_{34}$,${\hat{d}}_{13}+{\hat{d}}_{24}$, and${\hat{d}}_{14}+{\hat{d}}_{23}$, and then determines the quartet split according to the minimal sum (the sum${\hat{d}}_{\mathrm{ij}}+{\hat{d}}_{\mathrm{kl}}$ corresponds to the split (*ij* |*kl* )). We will focus on the task of reconstructing homogeneous K2P quartets using FPM with distances$\left\{{\hat{d}}_{\mathrm{ij}}\right\}$ estimated using either *Δ*_{JC}or *Δ*_{K2P}. We note that most of our findings easily generalize to more sophisticated homogeneous substitution models, replacing *Δ*_{JC} by any concave distance function and *Δ*_{K2P} by some SR function corresponding to the evolutionary time *t* .

*t*

_{12}+

*t*

_{34}is minimal. We start by analyzing the impact of the deviation from additivity of

*Δ*

_{JC}on the consistency of quartet resolutions. First, observe that

*any*monotone distance function is consistent for quartets in which

*t*

_{12}and

*t*

_{34}are the smallest interleaf distances - as is the case with symmetric quartets, in which all external edges are of the same length. Therefore, we study two prototypes of asymmetric quartets. The length of the internal edge in both types is

*t*

_{ i }, and each type has two long external edges of length

*t*

_{ l }, and two short external edges of length

*t*

_{ s }. In type A quartets (Figure2a), the short edges are on one side of the split and the long edges are on the other side. In this case

*d*

_{12}and

*d*

_{34}re the smallest and largest interleaf distances (resp.). Hence, the concavity of

*Δ*

_{JC}increases the separation between the sum

*d*

_{12}+

*d*

_{34}and the other two competing sums, leading to an expected

*improvement*in reconstruction accuracy. The other quartet configuration (type B; Figure2b) has a short edge and a long edge on both sides of the split. In this case, the interval of interpolation is [

*d*

_{13},

*d*

_{24}], and the distance

*d*

_{12}=

*d*

_{34}is near the center of this interval. Thus the concavity of

*Δ*

_{JC}decreases the separation between the sums

*d*

_{13}+

*d*

_{24}and

*d*

_{12}+

*d*

_{34}by approximately twice the deviation from additivity of

*Δ*

_{JC}in that range.

When the deviation from additivity exceeds half the length of the internal edge, the sum *d*_{13} + *d*_{24} becomes the minimal sum, and *Δ*_{JC} becomes inconsistent. Note that this demonstrates the tightness of the condition stated in Lemma 2.7, and in this sense, type B quartets provide a worst case scenario for quartet resolution by a concave SR function^{e}.

*Δ*

_{JC}with that of

*Δ*

_{K2P}when used to reconstruct its “worst case scenario” quartets of type B. Interestingly,

*Δ*

_{JC}ends up outperforming

*Δ*

_{K2P}on many of these quartets, due to its reduced stochastic noise (as predicted in our discussion revolving around Figure1b). For example, consider a series of homogeneous K2P quartets of type B with ti-tv ratio

*R*=5, whose edge lengths were set as follows:

*t*

_{ i }=0.2,

*t*

_{ l }=1.0, and

*t*

_{ s }∈[0.2,1.0]. We assessed reconstruction accuracy for both SR functions (

*Δ*

_{JC}and

*Δ*

_{K2P}) across this series of quartets, by generating 100,000 simulations of the substitution process using 1,000 bp long sequences for each quartet (Figure3a). Despite its deviation from additivity,

*Δ*

_{JC}outperforms the additive SR function

*Δ*

_{K2P}on many of these quartets (as long as

*t*

_{ l }/

*t*

_{ s }<3.6) . Note that as

*t*

_{ s }shrinks, the deviation of

*Δ*

_{JC}from additivity increases, since the interval [

*t*

_{0},

*t*

_{1}] expands. This experiment appears to indicate that the deviation of

*Δ*

_{JC}from additivity has to be quite large for

*Δ*

_{K2P}to outperform it.

### Fisher’s criterion for separability

*X*∼

*n*(

*μ*

_{1}

*σ*

_{1}) and

*Y*∼

*n*(

*μ*

_{2}

*σ*

_{2}) using the following measure

^{f}([20, 21]):

We use FC to measure the separability of the distance sum corresponding to the true split (which should be the minimal sum for consistent SR functions) from the two remaining sums. For the expectation *μ* of each sum we use the true distances as computed by the SR function on the actual model parameters. For the variance ^{σ 2}, we use the sum of the approximate variances of the two distances involved in the sum. We expect that an SR function which provides a larger separation of the smallest sum from the two other sums will imply a better reconstruction probability.

We note that FC is not an exact indicator of the separability in our case, because the necessary criteria for this are not satisfied in our model. Namely, the two distance sums are not normally distributed, and they are correlated through the substitution process along the external edges of the quartet. Nevertheless, as Figure3b suggests, FC turns out to provide a quite reliable comparison of the expected performance of *Δ*_{JC} and *Δ*_{K2P} for the quartet series considered in the aforementioned experiment. Figure3b exhibits for each quartet the FC of *Δ*_{JC} alongside that of *Δ*_{K2P}, both associated with the comparison of the true split (12|34) and the “*Δ*_{JC} favored split” (13|24). As shown, the trends observed in both FC plots closely resemble the trends observed in the reconstruction accuracy plot (Figure3a), and the the equilibrium point of the FC values of *Δ*_{JC} and *Δ*_{K2P} is very close to the equilibrium point of the accuracy of reconstructions of these two functions (near *t*_{
l
}/*t*_{
s
}=3.6).

*SEP*(for “separation”) and its denominator by

*NOISE*, then a comparison of FC estimates between two SR function

*Δ*

_{1},

*Δ*

_{2}can be represented as a ratio of ratios:

*Δ*

_{JC}and that of

*Δ*

_{K2P}can be carried out by tracing the

*SEP*and

*NOISE*ratios along four series of homogeneous K2P quartet: the bottom-left plot corresponds to the quartet series considered in Figure3; the plot above it corresponds to the same series with ti-tv ratio

*R*=2; the two plots on the right describe two quartet series in which the weight of the short edges is constant

*t*

_{ s }=0.2, and the weight of the long edges ranges in [0.2,1]. These four series demonstrate several typical trends in the behavior of the

*SEP*and

*NOISE*ratios. First, we observe that the

*NOISE*ratio decreases (favoring

*Δ*

_{JC}) as the diameter of the quartet (

*t*

_{24}) increases (it is almost constant in the two series on the left, and monotone decreasing in the series on the right). This is because the diameter provides the major contribution to the stochastic noise (for both

*Δ*

_{JC}and

*Δ*

_{K2P}), and as it increases, the ratio between the stochastic noise of

*Δ*

_{K2P}and

*Δ*

_{JC}increases as well. We also observe a natural decrease in the

*NOISE*ratio with an increase in the ti-tv ratio (the

*NOISE*ratio for

*R*=5 is consistently smaller than for

*R*=2). Concerning the

*SEP*ratio, we see it becomes smaller (favoring

*Δ*

_{K2P}) as the quartet becomes more unbalanced (the

*SEP*ratio decreases along the X axis in each of the four plots). This is because the deviation of

*Δ*

_{JC}from additivity increases as the inter-leaf distance interval [

*t*

_{0},

*t*

_{1}]=[

*t*

_{13},

*t*

_{24}] expands. Deviation of

*Δ*

_{JC}from additivity also increases with the ti-tv ratio, as the substitution model further departs from the assumptions of JC (the

*SEP*ratio for

*R*=5 is consistently smaller than for

*R*=2).

The two series on the right side of Figure4 demonstrate well the tradeoff between the effects of stochastic noise and deviation from additivity. In both series, the *SEP* and *NOISE* ratios decrease as the quartets become more unbalanced (due to the trends listed above). However, the rates of decrease of these two ratios are different due to the different ti-tv ratios, and this determines the expected relative performance of the two SR functions across the series. When *R* = 2, the *SEP* ratio decreases at a slower rate than the *NOISE* ratio, and *Δ*_{JC} is expected to outperform *Δ*_{K2P} across the entire series. When *R* = 5, the *SEP* ratio decreases at a faster rate than the *NOISE* ratio, and when the quartets are sufficiently unbalanced (*t*_{
l
}/*t*_{
s
}>4) *Δ*_{K2P} is expected to outperform *Δ*_{JC}.

## Simulations on Hasegawa’s Tree

In our study we used the tree structure and edge lengths to generate simulated data sets. We considered the tree in various scales, by setting the tree diameter (largest inter-taxon path length) to values in the interval [0.1,2.0]. For each scale considered, 10,000 simulations were carried out, where in each simulation 500 bp sequences were evolved along the tree according to a homogeneous K2P substitution model with ti-tv ratio of *R* =2. For each simulated data set, estimated values of the K2P statistics *p*_{
α
}and *p*_{
β
}, denoted by${\widehat{p}}_{\alpha}$ and${\widehat{p}}_{\beta}$, were extracted for all$\left(\genfrac{}{}{0ex}{}{7}{2}\right)$ pairs of taxa. Subsequently, several distance matrices were computed for each data set by applying different SR functions to these estimated statistics. Reconstruction accuracy was evaluated by applying the Neighbor Joining (NJ) algorithm[31, 32] to these distance matrices and recording the *Robinson-Foulds topological distance* (RF)[33] between the reconstructed tree and the Hasegawa tree. Sequence simulation was performed using SeqGen[34] (by choosing the HKY model with uniform base frequencies), and tree reconstruction was performed using the version of NJ implemented in the PHYLIP package[35].

We studied the reconstruction accuracy associated with four different SR functions: *Δ*_{JC}, *Δ*_{K2P}, *Δ*_{tv}, and *Δ*_{R=2}. The first two are as described in Equations (5) and (4), respectively. The third SR function, *Δ*_{tv}, considers only tv-type substitutions:${\Delta}_{\mathrm{tv}}({p}_{\alpha},{p}_{\beta})\phantom{\rule{1em}{0ex}}=\phantom{\rule{1em}{0ex}}-\frac{1}{4}\text{log}\phantom{\rule{0.3em}{0ex}}\left(1-4{p}_{\beta}\left(t\right)\right)\phantom{\rule{1em}{0ex}}=\phantom{\rule{1em}{0ex}}\mathrm{\beta t}$, and the fourth SR function, *Δ*_{R=2}, is based on a maximum likelihood (ML) estimator^{g} of the time *t* from the estimated transition probabilities${\widehat{p}}_{\alpha},{\widehat{p}}_{\beta}$, given that *R* =2. Informally, this function, which uses knowledge of the true value of *R* (which is typically unknown to the user), is optimal in our setting, because it has similar stochastic noise as *Δ*_{JC}, and it is additive since it coincides with *Δ*_{K2P} when applied to transition probabilities${\widehat{p}}_{\alpha},{\widehat{p}}_{\beta}$ that are consistent with a ti-tv ratio of *R* =2.

The performance of these four SR functions is traced across the different tree scales in Figure5a. For each SR function *Δ* and scale *s*, we recorded the average normalized RF distance from the true tree to each of the 10,000 trees reconstructed using *Δ* . The RF distance was normalized by its maximum value which is twice the number of internal edges in the tree (in our case 2×4=8). As observed previously in[9], *Δ*_{K2P} performed well in shorter scales, and *Δ*_{tv} performed well in longer scales. However, both additive SR functions were significantly outperformed in nearly all cases by *Δ*_{JC}. Surprisingly, *Δ*_{JC} even slightly outperformed *Δ*_{R=2}. We speculate that this happened due to a bias similar to the one observed in type A quartets in Section Performance of Non affine-additive SR functions in quartet resolution, improving the performance of concave SR functions such as *Δ*_{JC} on certain K2P-trees.

To test this hypothesis, we went through a similar experiment with a more symmetric seven-taxon caterpillar tree, with internal edges of uniform length *t*_{
int
}, and external edges of uniform length *t*_{
ext
}=5*t*_{
int
}(Figure5b). The symmetry of this tree was expected to reduce the effect of the reconstruction bias observed in Hasegawa’s tree, and indeed, *Δ*_{JC} performed much more poorly on this tree. Despite this fact, *Δ*_{JC} still outperformed *Δ*_{K2P} in all scales and *Δ*_{tv} in the smaller scales (*s* <1.1).

## Inferring trees from genomic sequences

In this section we describe our study comparing various SR functions on genomic DNA sequences. Next to *Δ*_{JC} and *Δ*_{K2P} we also considered the well known LogDet SR function[36, 37], denoted here as *Δ*_{LogDet}. Extending our study to this setting is challenging in two respects. First of all, unlike the simulated case, the true tree is not known with complete confidence, and accuracy of reconstruction can only be determined by using a well-accepted reference tree that may contain some errors. Secondly, the true substitution model is also unknown and is likely to violate the assumptions of both JC and K2P models and even the relaxed assumptions of the general time-reversible model (in which *Δ*_{LogDet} is additive). Hence, we have to assume in this case that *Δ*_{JC}, *Δ*_{K2P}, and *Δ*_{LogDet} are all non affine-additive, where *Δ*_{JC} and *Δ*_{K2P} are still likely to exhibit higher deviation from additivity than *Δ*_{LogDet}, since they make stronger assumptions on the substitution model.

### The genomic data set

In building the genomic data set, we made use of a set of 31 *clusters of orthologous groups* (COGs) which was compiled by Ciccarelli et al. and used for inferring phylogenetic relationships amongst a large number of species in[38, 39]. These 31 gene families were selected to capture the evolutionary history of the species containing them. This was done in[38] by making sure that the genes in these families have the following properties: (1) they are highly conserved across species, (2) they have a small number of paralogs, and (3) they are weakly affected by horizontal gene transfer. We scanned the NCBI genome database and found 199 bacterial genomes that contained all annotated COGs. For each of the 31 COGs, we extracted the appropriate protein sequence in each of the 199 bacterial species, choosing an arbitrary paralog in cases of multiple hits. We followed a procedure similar to the one described in[38, 39] to obtain reliable multiple-sequence alignments for each COG: we computed a 199-way multiple alignment of the protein sequences of each COG using HMMalign[40] and then mapped each protein sequence back to its coding DNA sequence. The conserved parts of each of the 31 DNA alignments were extracted using GBLOCKS[41] to filter out alignment columns with 50% or more gap symbols. The alignments were manually scanned, and 36 species which contributed a large number of gaps to the alignments were removed from the subsequent analysis. The 31 different alignments were concatenated to form one long 163-way multiple sequence DNA alignment.

For the reference tree we used the phylogenetic tree of microbial species provided by the ARB-SILVA Living Tree Project[42]. This tree, spanning 8,029 species at the time of writing, is based on a widely accepted analysis of the small subunit (SSU) 16S RNA. A subtree spanning our 163 bacterial species was extracted from this tree and treated as the true phylogenetic tree in our analysis.

### Reconstruction accuracy for ten-species subsets

We used the base set of 163 species to generate 40,000 random 10-species sub-alignments. The random selection process was guided to generate species subsets corresponding to a wide range of diameter scales (a blind random selection process is biased toward subsets with large diameters). For each of the 40,000 subsets, a 10-way subalignment was extracted from the original 163-way alignment, and in this alignment we extracted only columns corresponding to four-fold degenerate sites that do not have any gap symbol. This is done to make sure the sites used for distance estimation have undergone a substitution process that is as uniform as possible along the different lineages and across the different sites. Each sub-alignment was used to compute three distance matrices – one under *Δ*_{JC}, one under *Δ*_{K2P}, and one under *Δ*_{LogDet}. The latter was calculated by the version that is implemented in the PHYLIP package. The NJ algorithm was then applied to the three matrices and the resulting trees were compared to the true tree (as depicted by the appropriate LTP subtree) according to the RF distance.

As an additional comparison, we used a fourth reconstruction technique. This method (termed BIONJ-GTR) used the BIONJ reconstruction algorithm[43] on distances obtained under the general time-reversible model with invariant sites and Gamma distribution of rates across variant sites (GTR+*Γ* +I)[8, 44].

*Γ*+I model since it was found by the MEGA5 software[46] to provide the best fit to the sequence data. The 40,000 sampled instances were partitioned into eight bins according to the RF distance observed between the BIONJ-GTR tree and the true (LTP) tree, and average RF distances were recorded for each of the three SR functions in each bin. This allowed us to observe trends throughout these 40,000 samples (Figure6). Of the 40,000 trees inferred under

*Δ*

_{JC}, 83.1% showed an equal or lower RF distance than those reconstructed by the BIONJ-GTR method. Moreover,

*Δ*

_{JC}outperformed

*Δ*

_{K2P}and

*Δ*

_{LogDet}on average in all partitions, and

*Δ*

_{LogDet}showed by far the worst performance with 48.7% of all reconstructed trees achieving higher RF distances to the reference tree than those inferred by BIONJ-GTR. As with our results on simulated data sets, we see that the SR functions with lower stochastic error but inferior model fit performed best. Unsurprisingly, the GTR+G+I model itself, which was predicted to have the best fit to the sequence data, was often outperformed by the simpler JC and K2P models. Note that the difference in performance between

*Δ*

_{JC}and the two other SR functions is greater for subsets that are more accurately reconstructed by the BIONJ-GTR approach (the lower bins). This appears to indicate that over-simplified distance methods are particularly beneficial when the sequence data conveys a stronger phylogenetic signal.

## Conclusions

In this paper we explored the basic properties of methods for estimating evolutionary distances, and studied how these properties affect the accuracy of distance-based phylogenetic reconstruction. We considered both the systematic bias and the stochastic noise (variance) of the distance estimators, and examined the tradeoff between these two factors. We focused on the common task of phylogenetic reconstruction under homogeneous substitution models. Assuming homogeneous models simplifies the analytical framework, since in such models each SR function is reduced to a univariate function of the evolutionary time *t* . However, obtaining accurate estimates of *t* is still a hard task in this setting, since the unit rate matrix is unknown. An SR function *Δ* is guaranteed to yield consistent reconstruction across *all* trees in a homogeneous model only if it is additive, meaning that it is a linear function of *t* . When *Δ* is not additive, it introduces a systematic bias in distance estimates, which we denoted here as *deviation from additivity* . Some SR functions are only additive in one homogeneous model, whereas others are additive across a wider collection of homogeneous models. This less constrained additivity is typically achieved at a price of increased estimation noise. We studied the tradeoff between “deviation from additivity” and “estimation noise” via a case study where the model tree is a homogeneous K2P-tree with an unknown ti-tv ratio *R* . In this case, Kimura’s distance formula *Δ*_{K2P} is always additive, while the less noisy Jukes Cantor’s formula, *Δ*_{JC}, is additive only when$R=\frac{1}{2}$.

A study of this type requires a way to measure the deviation from additivity of a non-additive SR function *Δ* in a given range of distances [*t*_{0},*t*_{1}]. To this end, we introduced the concept of affine-additive distance functions, and defined the deviation from additivity of *Δ* in [*t*_{0},*t*_{1}] as the distance of *Δ* from its closest affine-additive function in [*t*_{0},*t*_{1}]. We established a tight connection between this measure and statistical consistency of reconstruction (Lemma 2.7) and derived an upper bound for deviation from additivity in homogeneous models (Lemma 2.8). We applied these results in analyzing the deviation from additivity of *Δ*_{JC}, and its effect on the accuracy of reconstructing homogeneous K2P-trees. We then showed, both analytically (in Section Deviation from additivity in homogeneous substitution models) and through experiments on simulated data sets (in Sections Performance of Non affine-additive SR functions in quartet resolution and Simulations on Hasegawa’s Tree), that, compared to *Δ*_{K2P}, it is often better to use the non-additive but less noisy estimates of *Δ*_{JC}, even when *R* is quite high. Somewhat surprisingly, we found this to be the case even when the tree being reconstructed has an “unfavorable” topology. Our experiments on bacterial gene sequences (Section Inferring trees from genomic sequences) also indicate that the simple and less noisy SR functions perform better on average than ones that are expected to better fit the true substitution process.

The framework presented in this paper implies a practical way for selecting SR functions which are likely to increase the accuracy of distance estimation. The practicality of the method is drawn from the fact that the criteria by which we select an SR function depend only a relatively crude information about the tree being reconstructed. For instance, in the case of a homogeneous K2P-tree, one can easily obtain from the input sequences rough estimates of both the ti-tv ratio *R* and the range of inter-leaf times [*t*_{0},*t*_{1}]. These estimates can then be used to compare the expected accuracies of *Δ*_{JC} and *Δ*_{K2P} on the given input, and determine which of them is more likely to yield an accurate phylogeny. For quartets, a tight comparison can be made using the FC-based approach suggested in Section Fisher’s Criterion for Separability, and for larger trees, a cruder comparison can be made using a plot like the one presented in Figure1b. A promising avenue of further research is to extend the FC-based approach to allow tighter prediction of reconstruction accuracy of trees spanning more than four taxa.

## Endnotes

^{a}This is a WABI 2011 special issue invited paper. Extended abstract of this paper appeared in[47]. ^{b}Typically, the unit rate matrix is assumed to be the one corresponding to one substitution per site. ^{c}Many common distance-based algorithms, such as the Neighbor Joining (NJ) algorithm[31, 32], are known to be robust in this sense. ^{d}In a tree, edges which touch leaves are *external*, and all other edges are *internal*. ^{e}Types A and B quartets represent the *Farris zone* and *Felsenstein zone*, resp. (see, e.g.,[1], Chapter 9). ^{f}We use here the square root of the criterion commonly used in the literature, because we prefer to think in terms of distances rather than squares of distances. This has no practical influence, since we use FC only for comparing between different choices, not for assessing the quality of a give choice.^{g} This ML estimate is obtained by a simple numerical method for maximizing the likelihood function (see, e.g.,[1]).

## Appendix

## Tightness of Lemma 2.8

*f*(

*t*) be a (continuous) function on some interval [

*t*

_{0},

*t*

_{1}]. We prove below that if

*f*does not intersect its linear interpolation

*At*+

*B*in that interval, then$\mathrm{dev}(f,[{t}_{0},{t}_{1}\left]\right)=\frac{1}{A}{\text{max}}_{t\in [{t}_{0},{t}_{1}]}\left\{\left|f\right(t)-\mathrm{At}-{b}^{\ast}|\right\}$. We use the following notations, conforming to the notations in the proof of Lemma 2.8:

### Lemma 2.9

Let *f* (*t* ) be a monotone increasing function in the interval [*t*_{0},*t*_{1}] and let *At* + *B* be its linear interpolation in [*t*_{0},*t*_{1}]. If either *f* (*t* )≥*At* + *B* for all *t* ∈[*t*_{0},*t*_{1}] or *f* (*t* )≤*At* + *B* for all *t* ∈[*t*_{0},*t*_{1}], then for all *a* >0, we have$\frac{1}{a}\psi \left(a\right)\phantom{\rule{1em}{0ex}}\ge \phantom{\rule{1em}{0ex}}\frac{1}{A}\psi \left(A\right)$.

### Proof

We prove the minimality of$\frac{1}{A}\psi \left(A\right)$ in the case where *f* (*t* )≥*At* + *B* for all *t* ∈[*t*_{0},*t*_{1}]. The other case (where *f* (*t* )≤*At* + *B* for all *t* ∈[*t*_{0},*t*_{1}]) can be proven in an identical fashion.

*a*>0, let

*B*

_{ a }be the maximum value of

*B*

^{ ′ }s.t.$\psi (a,{b}^{\prime},t)\ge 0$ for all

*t*∈[

*t*

_{0},

*t*

_{1}]. Evidently,$\psi \left(a\right)=\frac{1}{2}\psi (a,{b}_{a})$. If the linear interpolation of

*f*(

*t*) in [

*t*

_{0},

*t*

_{1}] is given by

*At*+

*B*, then

*B*

_{ A }=

*B*. We need to show that for every

*a*>0, it holds that

*Aψ*(

*a*,

*B*

_{ a })>

*aψ*(

*A*,

*B*

_{ A }). Let

*t*

_{ A }be a point in [

*t*

_{0},

*t*

_{1}] s.t.

*ψ*(

*A*,

*B*

_{ A },

*t*

_{ A })=

*ψ*(

*A*,

*B*

_{ A }). Note that if

*a*<

*A*, then the two linear functions

*At*+

*B*

_{ A }and

*at*+

*B*

_{ a }intersect at (

*t*

_{0},

*f*(

*t*

_{0})), and if

*a*>

*A*, then they intersect at (

*t*

_{1},

*f*(

*t*

_{1})) (see Figure7).

*a*<

*A*, we get the following equality (Figure7; right):

*ψ*(

*a*,

*B*

_{ a })≥

*ψ*(

*a*,

*B*

_{ a },

*t*) for every

*t*∈[

*t*

_{0},

*t*

_{1}], and since

*a*<

*A*, we get

*a*>

*A*, we get the following equality (Figure7; left)

*a*>

*A*implies that

□

## Declarations

### Acknowledgements

This research was supported by the Israel Science Foundation (ISF) grant No. 509/11. We also acknowledge the support for the publication fee by the Deutsche Forschungsgemeinschaft and the Open Access Publication Funds of Bielefeld University. The third author would also like to thanks the Max Planck Institute for Informatics for supporting a visit under which part of this research was carried out.

## Authors’ Affiliations

## References

- Felsenstein J: Inferring Phylogenies. Sunderland: MA Sinauer Associated Inc, 2004.Google Scholar
- Semple C, Steel M: Phylogenetics. Oxford University Press, 2003.Google Scholar
- Papoulis A, Pillali SU: Probability, Random Variables and Stochastic Processes. 2002, New York: McGraw Hill Higher Education,Google Scholar
- Jukes T, Cantor C: Evolution of Protein Molecules. Mammalian Protein Metab. Edited by: Munro H. New York: Academic Press, 1969, 21-132.View ArticleGoogle Scholar
- A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol. 1980, 16 (2): 111-120. 10.1007/BF01731581PubMedView ArticleGoogle Scholar
- Hasegawa M, Kishino H, Yano T: Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol. 1985, 22 (2): 160-174. 10.1007/BF02101694PubMedView ArticleGoogle Scholar
- Tavaré S: Some Probabilistic and Statistical Problems in the Analysis of DNA Sequences. Lectures on Mathematics in the Life Sci. 1986, 17: 57-86.Google Scholar
- Lanave C, Preparata G, Saccone C, Serio G: A new method for calculating evolutionary substitution rates. J Mol Evol. 1984, 20: 86-93. 10.1007/BF02101990PubMedView ArticleGoogle Scholar
- Gronau I, Moran S, Yavneh I: Towards Optimal Distance Functions for Stochastic Substitution Models. J Theor Biol. 2009, 260 (2): 294-307. 10.1016/j.jtbi.2009.05.028PubMedView ArticleGoogle Scholar
- Gronau I, Moran S, Yavneh I: Adaptive Distance Measures for Resolving K2P Quartets: Metric Separation versus Stochastic Noise. J Comp Biol. 2010, 17 (11): 1391-1400.View ArticleGoogle Scholar
- Felsenstein J: Cases in which parsimony or compatability methods will be positively misleading. Syst Zool. 1978, 27: 401-410. 10.2307/2412923View ArticleGoogle Scholar
- Cavender J: Taxonomy with confidence. Math Biosci. 1978, 40: 271-280. 10.1016/0025-5564(78)90089-5View ArticleGoogle Scholar
- Steel M, Penny D: Parsimony, likelihood, and the role of models in molecular phylogenetics. Mol Biol Evol. 2000, 17: 839-850. 10.1093/oxfordjournals.molbev.a026364PubMedView ArticleGoogle Scholar
- Sober E: A likelihood justification of parsimony. Cladistics. 1985, 1: 209-233. 10.1111/j.1096-0031.1985.tb00424.xView ArticleGoogle Scholar
- Felstenstein J, Sober E: Parsimony and likelihood: an exchange. Syst Zool. 1986, 35: 617-626. 10.2307/2413121View ArticleGoogle Scholar
- Yang Z: How often do wrong models produce better phylogenies?. Mol Biol Evol. 1997, 14: 105-108. 10.1093/oxfordjournals.molbev.a025695PubMedView ArticleGoogle Scholar
- Bruno WJ, Halpern AL: Topological bias and inconsistency of maximum likelihood using wrong models. Mol Biol Evol. 1999, 16 (4): 564-566. http://www-t10.lanl.gov/billb/BrunoHalpern99.pdf 10.1093/oxfordjournals.molbev.a026137PubMedView ArticleGoogle Scholar
- Zharkikh A: Estimation of evolutionary distances between nucleotide sequences. J Mol Evol. 1994, 39 (3): 315-329. 10.1007/BF00160155PubMedView ArticleGoogle Scholar
- Gascuel O, Guindon S: Efficient Biased Estimation of Evolutionary Distances When Substitution Rates Vary Across Sites. Mol Biol Evol. 2002, 19 (4): 534-543. 10.1093/oxfordjournals.molbev.a004109PubMedView ArticleGoogle Scholar
- Fisher R: The use of multiple measurements in taxonomic problems. Ann of Eugenics. 1936, 7: 177-188.Google Scholar
- Duda R, Hart P: Pattern Classification and Scene Analysis. Hoboken: John Wiley and Sons, 1973.Google Scholar
- Sumner J, Fernandez-Sanchez J, Jarvis P: Lie Markov Models. J Theor Biol. 2012, 298: 16-31.PubMedView ArticleGoogle Scholar
- Buneman P: The recovery of trees from measures of dissimilarity. Mathematics in the Archeological and Historical Sciences. Edited by: Hodson F, Kendall D, Tautu P. Edinburgh University Press, 1971, 387-395.Google Scholar
- Sattath S, Tversky A: Additive similarity trees. Psychometrica. 1977, 42 (3): 319-345. 10.1007/BF02293654View ArticleGoogle Scholar
- Atteson K: The Performance of Neighbor-Joining Methods of Phylogenetic Reconstruction. Algorithmica. 1999, 25: 251-278. 10.1007/PL00008277View ArticleGoogle Scholar
- Erdos P, Steel M, Szekely L, Warnow T: A few logs suffice to build (almost) all trees (I). Random Struct Algorithms. 1999, 14: 153-184. 10.1002/(SICI)1098-2418(199903)14:2<153::AID-RSA3>3.0.CO;2-RView ArticleGoogle Scholar
- Erdos P, Steel M, Szekely L, Warnow T: A few logs suffice to build (almost) all trees (II). Theoret Comput Sci. 1999, 221: 77-118. 10.1016/S0304-3975(99)00028-6View ArticleGoogle Scholar
- Johnson L, Riess R: Numerical Analysis. Boston: Addison Wesley, 1977.Google Scholar
- Oehlert G: A note on the delta method. Am Statistician. 1992, 46: 27-29.Google Scholar
- Zaretskii K: Constructing a tree on the basis of a set of distances between the hanging vertices. Uspekhi Mat Nauk. 1965, 20 (6): 90-92. [In Russian].Google Scholar
- Saitou N, Nei M: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987, 4: 406-425.PubMedGoogle Scholar
- Studier J, Keppler K: A note on the neighbor-joining algorithm of Saitou and Nei. Mol Biol Evol. 1988, 5 (6): 729-731.PubMedGoogle Scholar
- Robinson F, Foulds R: Comparison of phylogenetic trees. Math Biosci. 1981, 53: 131-147. 10.1016/0025-5564(81)90043-2View ArticleGoogle Scholar
- Rambaut A, Grass NC: Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput Appl Biosci. 1997, 13 (3): 235-238.PubMedGoogle Scholar
- Felsenstein J: PHYLIP - Phylogeny Inference Package (Version 3.2). Cladistics. 1989, 5: 164-166.Google Scholar
- Steel M: Recovering a tree from the leaf colourations it generates under a Markov model. Appl Math Lett. 1994, 7 (2): 19-24. 10.1016/0893-9659(94)90024-8View ArticleGoogle Scholar
- Lockhart P, Steel M, Hendy M, Penny D: Recovering evolutionary trees under a more realistic model of sequence evolution. Mol Biol Evol. 1994, 11 (4): 605-612.PubMedGoogle Scholar
- Ciccarelli FD, Doerks T, von Mering C, Creevey CJ, Snel B, Bork P: Toward Automatic Reconstruction of a Highly Resolved Tree of Life. Science. 2006, 311 (5765): 1283-1287. 10.1126/science.1123061PubMedView ArticleGoogle Scholar
- von Mering, Hugenholtz P, Raes J, Tringe SG, Doerks T, Jensen LJ, Ward N, Bork P: Quantitative Phylogenetic Assessment of Microbial Communities in Diverse Environments. Science. 2007, 315 (5815): 1126-1130. 10.1126/science.1133420View ArticleGoogle Scholar
- Durbin R, Eddy SR, Krogh A, Mitchison G: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. 1999, Cambridge University Press.Google Scholar
- Talavera G, Castresana J: Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignments. Syst Bio. 2007, 56: 564-577. 10.1080/10635150701472164View ArticleGoogle Scholar
- Yarza P, Ludwig W, Euzeby J, Amann R, Schleifer KH, Glockner FO, Rossello-Mora R: Update of the All-Species Living Tree Project based on 16S and 23S rRNA sequence analyses. Syst Appl Microbiol. 2010, 33: 291-299. 10.1016/j.syapm.2010.08.001PubMedView ArticleGoogle Scholar
- Gascuel O: BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. Mol Biol Evol. 1997, 14 (7): 685-695. 10.1093/oxfordjournals.molbev.a025808PubMedView ArticleGoogle Scholar
- Rodriguez F, Oliver JL, Marin A, Medina JR: The general stochastic model of nucleotide substitution. J Theor Biol. 1990, 142: 485-501. 10.1016/S0022-5193(05)80104-3PubMedView ArticleGoogle Scholar
- Guindon S, Gascuel O: A simple, fast and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol. 2003, 52: 696-704. 10.1080/10635150390235520PubMedView ArticleGoogle Scholar
- Tamura K, Peterson D, Peterson N, Stecher G, Nei M, Kumar S: MEGA5: Molecular Evolutionary Genetics Analysis using Maximum Likelihood, Evolutionary Distance, and Maximum Parsimony Methods. Mol Biol Evol. 2011, 28: 2731-2739. 10.1093/molbev/msr121PubMedPubMed CentralView ArticleGoogle Scholar
- Doerr D, Gronau I, Moran S, Yavneh I: Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions. Algorithms in Bioinformatics, Volume 6833 of Lecture Notes in Computer Science. Edited by: Przytycka T, Sagot MF. Berlin / Heidelberg: Springer 2011, 49-60.Google Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.