We begin by describing the \(\texttt {INC}\) method in the unconstrained condition (i.e., when there are no constraint trees), and prove that it is AFC.
High-level description of \(\texttt {INC}\)
The input to \(\texttt {INC}\) is a dissimilarity matrix d. We present a high-level description of how tree \(t = \texttt {INC}(d)\) is computed.
-
Find the insertion ordering \(\sigma = x_1, x_2, \ldots , x_n.\)
-
Initialize t as the three-leaf tree on the first three taxa in the ordering.
-
For \(i=4\) up to n.
-
Determine the set of valid quartets to use for placing \(x_i\) into tree t.
-
Compute quartet trees for each valid quartet, and let them vote for where to place \(x_i.\)
-
Pick the edge e in t that has the most votes; if this is a tie, pick an edge at random from the set of edges with the most votes.
-
Insert \(x_i\) into edge e.
-
Return the resulting tree t.
Thus, at a high-level, the \(\texttt {INC}\) algorithm operates by greedily growing a tree t based on a computed sequence addition ordering. Yet, many of the details of the algorithm are unspecified (e.g., we do not say how we calculate the insertion ordering, how we determine the set of valid quartets, how we compute trees on valid quartets, and how the quartet trees vote). Below, we provide details for each of the steps for this algorithm.
Computing the sequence addition ordering
We begin by constructing a minimum spanning tree S of \(TG(d,\infty ),\) where d is the input dissimilarity matrix. Once the spanning tree S is computed, we choose an arbitrary leaf in S to be the starting vertex in the ordering. Then we order the vertices according to order of traversal in a BFS (or DFS) from the starting vertex. See Fig. 1 for an example.
How we compute the growing tree t
We initialize our tree t by choosing the first three taxa according to our insertion ordering and forming the three-leaf tree with an internal node that connects to all three taxa. In order to define how we insert the remaining taxa into t, we need to formally define the “valid quartets” and how we compute quartet trees for valid quartets.
Definition 4
Let \(q_0\) be the maximum weight of an edge in S and let \(q = 8q_0,\) A quartet of leaves is valid if its maximum pairwise distance (i.e., diameter) is at most q. The quartet tree for the valid quartet is computed using the Four Point Method.
As we will show later, the restriction to just the “valid quartets” allows us to develop a tree construction method that runs in polynomial time and that is AFC. Note also that restricting the diameter of the valid quartets to small values has a mixed effect: if the maximum permitted diameter is too small then even correct quartet trees will not be sufficient to reconstruct the tree, but if the maximum permitted diameter is too large then some computed quartet trees are more likely to be incorrect. We show that this setting for the maximum diameter q to be \(8q_0\) is sufficient to allow us to prove that the algorithm is AFC. However, we did not try to optimize this constant, and hence the choice of the constant 8 is likely not optimal (i.e., smaller constants might give better theoretical results).
There may be many valid quartets, but we only need to examine a linear number of these, as we now show. Suppose we wish to add a vertex x into t. Given an internal node u of t, because t is binary the removal of u splits t into three non-empty components which we will refer to as \(t_1,t_2,t_3\) (with internal nodes included). Because the leaves of t form a connected induced subtree of S, we can find taxa \(u_i\) in \(t_i\) and \(v_i \in V(t)\setminus t_i\) such that \((u_i,v_i)\) is an edge in S, for \(i=1,2,3.\) We will associate the triplet \(u_1,u_2,u_3\) to node u, so that the addition of leaf x defines a quartet, and we can check to see if the quartet is valid. We summarize this discussion with the following definition.
Definition 5
Let u be an internal node of a binary tree t and let \(u_1,u_2,u_3\) be leaves in the three components of t upon removing u such that there exists \(v_i\) with \((u_i, v_i)\) an edge in S where \(v_i\) is not in the same component as \(u_i.\) Then, for some choice of \(q \in \mathbb {R}^+,\) a quartet query on \(\{u_1,u_2,u_3,x\}\) is q-valid iff the d-diameter (maximum pairwise distance with respect to the input matrix d) of \(\{u_1,u_2,u_3,x\}\) is less than q. We use the term valid when q is clear from context.
To determine how to place leaf x into the growing tree t, we compute a tree for each valid quartet query (which includes x) using the Four Point Method. For example, if the tree computed on valid quartet query \(u_1,u_2,u_3,x\) (with \(u_i \in t_i,\) and \(t_1,t_2,t_3\) the three subtrees off internal node u) returns \(xu_1|u_2 u_3,\) then this implies that x should be placed in the subtree of t induced on \(t_1 \cup {u}\) hence, each edge in that subtree will receive a vote (Fig. 2).
Thus, each valid quartet adds a vote to each edge in some non-empty subtree of the tree. We define the support of an edge in the tree t to be the number of valid quartet trees that voted for that edge. We then choose an edge e in the tree t that has the largest total support (and if there is a tie, we pick an edge e at random with the maximum support). We then subdivide e and then make x adjacent to the node created. See Fig. 3 for an illustration of how this voting procedure operates.
Furthermore, as we will shortly show, when the sequences are long enough, then there will be a unique edge on which all queries agree, and so the algorithm will correctly add x into t (in such a way that it agrees with the model tree T), and so inductively the greedy algorithm will construct the model tree.
Theoretical properties of \(\texttt {INC}\) under the CFN model
We now establish the theoretical properties of \(\texttt {AFC}\) under the CFN model. Throughout this section, recall that \(d_{ij}\) is the empirical CFN dissimilarity between taxa i, j and \(D_{ij}\) is the underlying CFN model distance between taxa i, j defined by the model tree \((T,\Theta )\) with edge weights given by \(w(e)=-\frac{1}{2}\log (1-2p(e)).\) Given matrices d and D and positive real q, we define \(\epsilon (q) = \max \{|d_{ij}-D_{ij}|: d_{ij} \le q \text { or } D_{ij} \le q\}.\)
Theorem 2
Let d be a dissimilarity matrix, D an additive matrix defining a model tree T with edge weights \(w: E(T) \rightarrow \mathbb {R}^+,\) and \(q = 8q_0,\) where \(q_0\) is the maximum distance in the minimum spanning tree of \(TG(d,\infty ).\) Let f be the weight of the shortest internal edge in T and suppose \(\epsilon (q) < f/2\) and \(q_0 \ge f/2,\) then INC(d) returns T.
Proof
We proceed by induction on the number of leaves in T. Our claim holds when T has three leaves, since there is only one such topology. Let t be the tree maintained by INC before the insertion of the last taxon x according to our insertion ordering. By induction, t must have the correct topology. It suffices to show that valid queries can determine the accurate placing of x. Assume that the correct location on which to place x is \(e = (u,v)\) of t. Note that when \(\epsilon (q) < f/2,\) the Four Point Method is guaranteed to correctly construct quartet trees for all quartets with diameter at most q (and hence for all valid quartets), so that all valid node queries will return the correct quartet tree. Therefore, e will receive all possible votes.
To show that all other edges will miss at least one vote, it suffices to show that node queries are valid at u and v because all other edges will miss a vote from one of the two such queries, confirming that x is inserted at e (if one of u, v is a leaf vertex, the same conclusion is achieved).
We will show that a node query at u is valid; a similar argument is done for v. Let \(u_1, u_2, u_3\) be selected upon the deletion of u with their corresponding \(v_i.\) First, for all i, \(D(u_i,v_i) \le q_0 + f/2\) since \(d(u_i,v_i)\le q_0\) and \(\epsilon (q) \le f/2.\) Then, since \(v_i \in V\setminus t_i\) and t is the correct tree topology, \(D(u_i,u) \le D(u_i,v_i) \le q_0 + f/2,\) where D is the true distance matrix and u is the corresponding internal node in T.
Next, we claim that \(D(x,u) \le 2q_0+f\) and together we will have concluded that \(\{u_1,u_2,u_3,x\}\) is a quartet of d-diameter at most \(3q_0+2f\) since by triangle inequality, we deduce \(\{u_1,u_2,u_3,x\}\) is of D-diameter \(3q_0+3f/2\) and so the d-diameter is bounded at \(3q_0+2f \le 7q_0 \le q,\) proving validity.
For the last claim, since x is within \(q_0\) of some leaf \(x'\) in t, \(D(x,x') \le d(x,x') + f/2 \le q_0 + f/2.\) Since the path x to \(x'\) must pass through u or v in T and \(e = (u,v)\) is the correct edge insertion location, we conclude that \(\min (D(x,u),D(x,v)) \le q_0 + f/2.\)
If \(D(x,u) \le q_0+f/2,\) then our claim follows (we automatically get \(D(x,u) \le 2q_0 + f\)). Else, \(D(x,v) \le q_0+f/2.\) Since \(v \in V(t_3),\) \(D(x,u) \le D(x,v) + D(u,v) \le q_0+f/2 +D(u,v) \le q_0 +f/2 + D(u,u_3) \le q_0 +f/2 + q_0 + f/2 \le 2q_0+f.\)
Therefore, if the correct edge is (u, v), then the query at u votes against all edge placements for x in \(t_1 \cup t_2,\) and symmetrically for v. Hence all other edges will miss at least one vote. \(\square\)
Theorem 3
Given the
\(n \times n\)
dissimilarity matrix
d
and a given
q,
\(\texttt {INC}(d)\)
can be implemented in
\(O(n^2)\)
time and space.
Proof
It suffices to show that as we grow our tree t, each insertion step can be implemented to take O(n) time. First, each internal node, when initialized due to an inserted taxon, can store the necessary vertices \(u_1,u_2,u_3\) and that can be done in O(n) time. Each node query is O(1), so in total the |V(t)| node queries take O(n) time.
The only possible difficulty is efficiently finding the edge with the most votes. A naive implementation will lead to an \(O(n^3)\) runtime. For any edge \(e = (u,v)\) on the tree, let n(e) denote the total number of votes that e has. Note that if \(e, e'\) are adjacent edges in t with common vertex u, then \(n(e) - n(e')\) can be determined by simply looking at the short quartet query at u. Specifically, let \(e = (u,v)\) and \(e' = (u,v')\) be adjacent edges at vertex u. If the query at u was invalid or it showed that x should be in \(t_j\) with \(v,v' \not \in t_j,\) then \(n(e) - n(e') = 0.\) Otherwise, if the query returned \(t_j\) with \(v \in t_j,\) then \(n(e) - n(e') = 1;\) a similar argument shows that if \(v' \in t_j\) then \(n(e) - n(e') = -1.\) Therefore, by using this local property, we can calculate \(n(e) + C\) for some constant C in O(n) time by performing \(\texttt {BFS}\) starting from any leaf of the tree. The \(\texttt {BFS}\) then simply returns the edge with the highest score.
Finally, throughout the algorithm, we need to store the dissimilarity matrix, spanning tree, insertion ordering, and O(1) vertices at each node of the tree. Together, this requires \(O(n^2)\) space. \(\square\)
Theorem 4
(Azuma’s inequality [23]) Suppose \(X = (X_1, X_2, ..., X_k)\) are independent random variables taking values in any set S, and let \(L:S^k \rightarrow \mathbb {R}\) be any function that satisfies the condition: \(|L(u)-L(v)| \le t\) whenever u and v differ in just one coordinate. Then,
$$\begin{aligned} P(|L(X) - E[L(X)]| \ge \lambda ) \le 2 \exp \left( -\frac{\lambda ^2}{2t^2k} \right) \end{aligned}$$
Theorem 5
For \(k =\Omega \left( \frac{\ln (n/\epsilon )e^{4q}}{f^2} \right),\) with probability \(\ge 1-\epsilon,\) we have \(\epsilon (q) < f/2.\) Furthermore, if \(q_0\) is the minimum value of q such that TG(d, q) is connected, then \(q_0 = O(g\log n)\) and \(q_0 \ge f/2.\)
Proof
To show that \(\epsilon (q) < f/2,\) we must show that \(|D_{ij} - d_{ij}| < f/2\) when \(D_{ij} < q\) or \(d_{ij} < q.\) First, if \(D_{ij} < q\) we show that \(k =\Omega \left( \frac{\ln (n/\epsilon )e^{2q}}{f^2} \right)\) suffices to show \(|D_{ij} - d_{ij}| < f/2\) holds with probability \(\ge 1-\epsilon /n^2.\) We express our probability of failure as:
$$\begin{aligned} P(|D_{ij} - d_{ij}| \ge f/2)&= P\left( \left| \log \left( \frac{1 - 2H_{ij}}{1-2E_{ij}}\right) \right| \ge f \right) \\&\le P((1- 2 E_{ij})e^{-f} \ge 1 - 2 H_{ij}) + P(1 - 2 H_{ij} \ge (1-2E_{ij})e^f) \end{aligned}$$
The first expression can be written as:
$$\begin{aligned} P(H_{ij} - E_{ij} \ge \frac{1}{2}(1-e^{-f})(1-2E_{ij}))&\le P(|H_{ij} - E_{ij}| \ge \frac{1}{2}(1-e^{-f})e^{-2D_{ij}})\\&\le 2\exp (-\Omega (k(1-e^{-f})^2e^{-4q}))\\&\le 2 \exp (-\Omega ( \ln n/\epsilon )) \end{aligned}$$
The second line follows from Azuma’s (Theorem 4) with \(t = 1/k.\) Similarly, the second expression is equivalent to:
$$\begin{aligned} P(E_{ij} - H_{ij} \ge \frac{1}{2}(e^{f}-1)(1-2E_{ij}))&\le P(|E_{ij} - H_{ij}| \ge \frac{1}{2}(e^{f}-1)e^{-2D_{ij}})\\&\le 2\exp (-\Omega (k(e^{f}-1)^2e^{-4q}))\\&\le 2 \exp (-\Omega ( \ln n/\epsilon )) \end{aligned}$$
Next, if \(d_{ij} < q,\) then we show that \(D_{ij} < 2q+1\) with high probability and then apply the previous result with \(q' = 2q+1.\) First, if we let \(r_{ij} = D_{ij} - q,\) then by simple algebra our probability that \(d_{ij} < q\) when \(r_{ij} > q+1\) is bounded by
$$\begin{aligned} P(d_{ij} < q )&= P(D_{ij} - d_{ij}> D_{ij}-q)\\&= P\left( -\frac{1}{2}\log \left( \frac{1-2E_{ij}}{1-2H_{ij}} \right)> D_{ij}-q \right) \\&= P(1-2H_{ij}> e^{2r_{ij}}(1-2E_{ij}))\\&= P\left( E_{ij} - H_{ij}> \frac{1}{2}(e^{2r_{ij}}-1)e^{-2D_{ij}}\right) \\&= P\left( |E_{ij} - H_{ij}| > \frac{1}{4}e^{2r_{ij}-2D_{ij}}\right) \\&\le 2\exp (-\Omega (ke^{-4q}))\\&\le 2 \exp (-\Omega ( \ln n/\epsilon )) \end{aligned}$$
The fourth to fifth line follows since \(e^{2r_{ij}} - 1 > \frac{1}{2}e^{2r_{ij}}\) whenever \(r_{ij} > 1.\) Therefore, by a union bound, we conclude our claim that \(\epsilon (q) < f/2\) with probability \(\ge 1-\epsilon.\)
We now show that \(q_0 = O(g\log n).\) Note that in our model tree, since g is the maximum weight of an edge in a binary tree with n leaves, it follows that \(TG(D,O(g\log n))\) is connected. By our previous part, we know that \(|d_{ij} - D_{ij}| < f/2\) for all edges in \(TG(D,O(g\log n)).\) Therefore, we conclude that \(TG(d, O(g\log n))\) is also connected, and so \(q_0 = O(g\log n).\)
Lastly, we show \(q_0 \ge f/2.\) Consider all edges in the minimum spanning tree of \(TG(d,\infty ).\) These edges have true weight (in D) at least f and by our previous part the weight of the edges in the minimum spanning tree deviate from the true weight (as defined by the model tree) by at most f / 2. Thus, we conclude that \(q_0\ge f/2.\) \(\square\)
Theorem 6
\(\texttt {INC}\)
is absolute fast converging for the CFN model and takes
\(O(n^2)\)
time and space (assuming distances are precomputed).
Proof
All we need to establish is that for every triplet \(\epsilon ,f,g\) with \(0<f<g<\infty\) and \(\epsilon >0,\) there is a polynomial p(n) such that for all model trees \((T,\Theta )\) in \(CFN_{f,g},\) given sequences of length at least p(n) generated by \((T,\Theta )\) and empirical CFN dissimilarity matrix d computed on these sequences, the tree returned by \(\texttt {INC}(d)\) is the model tree T with probability at least \(1-\epsilon.\)
Let q be the maximum edge weight on the minimum spanning tree in the underlying model tree. We know \(q = O(g \log n).\) By Theorem 5, with probability \(\ge 1-\epsilon,\) when \(k = \Omega (\ln (n/\epsilon )e^{4q}/f^2) = poly(n),\) we have \(\epsilon (q) < f/2\) Furthermore, \(q_0 \ge f/2.\) Therefore, by Theorem 2, the insertion algorithm will return the model tree T. Hence, \(\texttt {INC}\) is absolute fast converging for the CFN model.
Finally, given the dissimilarity matrix, the running time and space complexity to compute and store the minimum spanning tree and to run INC are all \(O(n^2)\) by Theorem 3.\(\square\)
The method we described runs in \(O(n^2)\) time and is AFC. As shown by [17], any algorithm that reconstructs the true tree with high probability and uses distance calculations as its only source of information about the phylogeny on sequences of length \(O(poly\log n)\) will have \(\Omega (n^2)\) runtime, up to logarithmic factors. Hence, our runtime is optimal.