A multi-labeled tree dissimilarity measure for comparing “clonal trees” of tumor progression

We introduce a new dissimilarity measure between a pair of “clonal trees”, each representing the progression and mutational heterogeneity of a tumor sample, constructed by the use of single cell or bulk high throughput sequencing data. In a clonal tree, each vertex represents a specific tumor clone, and is labeled with one or more mutations in a way that each mutation is assigned to the oldest clone that harbors it. Given two clonal trees, our multi-labeled tree dissimilarity (MLTD) measure is defined as the minimum number of mutation/label deletions, (empty) leaf deletions, and vertex (clonal) expansions, applied in any order, to convert each of the two trees to the maximum common tree. We show that the MLTD measure can be computed efficiently in polynomial time and it captures the similarity between trees of different clonal granularity well.


Introduction
According to the clonal theory of cancer evolution [1], cancer originates from a single cell which had acquired a set of mutations that provide it proliferative advantage compared to the neighboring healthy cells. As tumor grows, cancer cells acquire new mutations and some of them might accumulate a set of mutations conferring further selective advantage or disadvantage compared to the other cells. This continues over a period of time and at the time of the clinical diagnosis, tumors are usually heterogeneous consisting of multiple cellular populations, harboring distinct sets of mutations, leading to different phenotypes. Each such cellular population is considered to be a clone.
The whole process of tumor initiation and growth is illustrated in Fig. 1 (left panel).
One of the most widely used ways of depicting mutational heterogeneity and tumor progression over time is by the use of a clonal tree of tumor evolution. Here, each individual vertex represents a distinct clone and each mutation (i.e. its label) is placed as part of the label of clone where it occurs for the first time in evolutionary history. In this work we focus on trees built by the use of single nucleotide variants (SNVs), which represent the most widely used type of mutations in reconstructing trees of tumor evolution [2]. We also assume that each SNV occurs exactly once during the course of tumor evolution and never gets lost (infinite sites assumption, usually abbreviated as ISA). Some recently introduced methods (e.g. SiFit [3]) allow for the violations of ISA and, in such cases, we expect that labels corresponding to mutations violating ISA are removed from the trees prior to dissimilarity calculation. In order to simplify our figures, in each figure in this work we omit the vertex representing population of healthy cells. Namely, such vertex would be non-informative as it would always be label-free (since healthy cells are assumed to contain none of the mutations relevant to cancer progression) and attached as the parent of root vertex in each of the figures presented in this work. See Fig. 1 for an illustration of tumor growth (left panel) and the corresponding clonal tree of tumor evolution (right panel). Note that the children of a vertex in a clonal tree are unordered.
A popular alternative to the clonal tree is the mutation tree, a special case of the clonal tree, where the label of each vertex consists of exactly one mutation [4,5]-thus a mutation tree is a clonal tree with the highest possible granularity. As can be expected, any clonal tree can be easily converted to the mutation tree as follows. Consider an arbitrary edge (u, v) and assume without loss of generality (WLOG) that a set of all mutations assigned to it is {M 1 , M 2 , . . . , M k } . Now replace edge (u, v) by a path with vertices {w 0 = u, w 1 , w 2 , . . . , w k−1 , w k = v} and edges {(w 0 , w 1 ), (w 1 , w 2 ), . . . , (w k−1 , w k )} , such that exactly one mutation, WLOG M i , is assigned to the edge (w i−1 , w i ) for each i ∈ {1, 2, . . . , k} . Note that from a given clonal tree which is not mutation tree (i.e. contains at least one vertex with two or more labels), multiple different mutation trees can be obtained. More precisely, from the above, it is obvious that any vertex with k mutations assigned to it can be expanded to a chain of k vertices, each having exactly one mutation as its label, in k! different ways. Consequently, considering the numbers of mutations assigned to vertices of the clonal tree T and taking the product of factorials of these numbers gives a formula for computing the exact number of different mutation trees that can be obtained from T.
There are additional tree representations [5] for tumor evolution but in this work we focus on clonal trees only.
In order to assess the accuracy of the proposed method, many of these studies use simulated HTS data extracted from synthetic tumor compositions. The inferred tree is then compared against the (synthetic) ground truth. We will call the ground truth tree the true tree. Other studies, such as the Pan Cancer Analysis of Whole Genomes Project (PCAWG) compare trees inferred by participating methods on real tumor samples to reach a consensus tree. In order to compare clonal trees with varying granularity (granularity can be measured in terms of the average number of mutations assigned to a clone) the measure(s) used should be versatile enough to discriminate real topological differences between trees from those differences due to the type and coverage of the HTS data used by a method; e.g. such a "dissimilarity" measure should be equal to 0 between any clonal tree and its corresponding mutation tree (obtained using the procedure described above).
Unfortunately, comparing trees of tumor evolution is a challenging problem and available measures fail to fully capture (dis)similarities between inferred and true trees. Many of existing measures only aim to compare the relative placement of pairs of mutations across two trees, e.g. whether the two mutations maintain an ancestordescendant relationship in both trees (we discuss several of the existing measures in more detail in "The existing measures and their limitations" section). Such measures can not capture topological differences between distinct trees, e.g. a simple topology with two vertices, where all but one of the mutations is assigned to the non-root vertex, v.s. a star topology where each vertex is assigned a single mutation. Thus measures of tree similarity that not only consider the relative placement of mutations but also the topological structure of the trees are of high demand.
The standard measure to compare combinatorial objects-such as strings, especially in bioinformatics, is the edit distance. This measure has numerous applications and a large number of variants, not only for strings but also for labeled trees, have been considered in the past. The classical Levenstein edit distance between two strings is defined as the minimum number of single symbol deletions on the two strings so that what remain from the strings are identical (in fact the longest common subsequence of the two strings). As such, it has a well established dynamic programming algorithm (e.g. [16]). The running time of this algorithm is proportional to the product of the lengths of the two input strings and the existence of a sub-quadratic algorithm is unlikely [17]. In general, the complexity of computing an edit distance strictly depends on the set of allowed edit operations. E.g. if we consider a variant of the problem where only single character mismatches and block reversals are allowed, then the running time reduces to O(n log 2 n) [18]-here n is the total length of the strings; on the other hand, the variant where only mismatches, block deletion and move operations are allowed is NP-hard [19].
Extensions of edit distance measures for rooted trees have typically been defined for trees with ordered vertices, each with a single label, where the goal is to transform one tree to the other by the use of vertex deletions (or, equivalently, vertex insertions) and vertex label replacements [20]. Based on such tree edit distance measures, a notion of tree alignment has also been introduced, both for vertex ordered as well as unordered trees [21]. For many of the vertex ordered cases, there are polynomial time algorithms that can solve the distance/ alignment problem [20][21][22][23][24][25][26][27][28][29], whereas for several unordered cases, the both the alignment and the corresponding tree edit distance problems are NP-hard [30,31] or MAX SNP-hard [21,32].
Motivated by the Levenshtein edit distance between strings, edit distances for trees with unordered vertices are defined in relation to the largest common subtree [32] between the input trees: here the goal is to perform the minimum number of label deletions (and eliminate the resulting empty nodes) from the two input trees so that the remaining subtrees are identical. The notion of the largest common subtree of two trees and the implied edit distance can be generalized to clonal (multi-label) trees. Unfortunately, just like other edit distances for unordered trees [33], this distance would be NP-hard (in fact MAX SNP-hard) to compute. Moreover, none of the results in the literature deal with trees where vertices may have more than a single (mutational) label-as is the case for the clonal tree comparison problem.
In this paper we consider a restricted version of the above notion of tree edit distance by allowing label (and implied node) deletions for leaves. This notion of distance can naturally be generalized to multi-labeled trees and the resulting "dissimilarity" measure (multi-labeled tree dissimilarity, MLTD) can be computed in polynomial time. More importantly, it successfully captures the differences between clonal trees: for example it satisfies a key condition that two clonal trees from which it is possible to produce two identical mutation trees have a dissimilarity of 0.
Multi-labeled tree dissimilarity is the first polynomial time computable dissimilarity measure for vertex unordered trees. 1 We have devised and implemented an algorithm to compute MLTD and applied it to a number of synthetic and real data sets to compare trees inferred by some of the available tumor history reconstruction methods with success.

Definitions
While this work is motivated and currently has the main application in the comparison of clonal trees of tumor evolution, possible novel applications may arise in the future. In order to minimize the background knowledge of cancer evolution and related terminology required to follow description of the presented algorithms, in this section we first provide formal definition of multi-labeled tree and use this term throughout the sections containing algorithms description ("Definitions", "Set alignment problem" and "Computing a maximum common tree in2 the general case" sections). Second, we describe how the dissimilarity measure between two arbitrary multi-labeled trees is computed. Finally, for the readers interested in the presented practical application, we also provide motivation for the introduced multi-labeled tree and edit operations.

Multi-labeled tree
A rooted tree T = (V , E) is a connected, acyclic, undirected graph with set of vertices V (also denoted as V(T)) and edges E (also denoted as E(T)), with a particular vertex r identified as the root. For each non-root vertex v, any vertex u that lies on the simple path between v and the root is considered to be its ancestor; in particular, the vertex u = p(v) on this path which has an edge to v is considered to be its parent. The depth of vertex v denoted d(v), is thus defined as the number of its ancestors. The lowest common ancestor of any pair of vertices u and v, denoted lca(u, v) , is defined as a common ancestor of both u and v whose depth is maximum possible. The structure of a tree induces partial order on its vertices: u v denotes that u is an ancestor of v.
Multi-labeled tree T is a rooted tree in which each vertex v other than root has a subset L v of labels from a universe L and each label is unique to a vertex, i.e. L u ∩ L v = ∅ for each pair of distinct vertices u and v. We denote the set of all labels assigned to the vertices of T as L(T). In other words, L(T ) = v∈V (T ) L v .

MLTD measure between two multi-labeled trees
Consider the following types of edit operations on multilabeled tree: • deleting a label where one of the labels is removed from some set L v , • deleting an unlabeled leaf where a vertex is removed from the tree. This operation is allowed to be performed only for unlabeled leaves, i.e. vertices with no labels and no children, • expanding a vertex where vertex v is replaced by two vertices v 1 and v 2 such that all children of v after this operation are children of v 2 , and the parent of v is the parent of v 1 , and v 1 is the parent of v 2 . Each of the labels from L v is assigned to exactly one of the L v 1 and L v 2 .
A Common tree of arbitrary multi-labeled trees T 1 and T 2 is any multi-labeled tree which can be obtained from each of T 1 and T 2 by the use of edit operations defined above. A maximum common tree of T 1 and T 2 is a common tree of T 1 and T 2 having the largest number of labels among all common trees of T 1 and T 2 . We define MLTD measure between T 1 and T 2 as the difference between the total number of labels in T 1 and T 2 and twice the number of labels in their maximum common tree. In other words, MLTD is defined as the total number of labels required to be removed from the two trees in the process of obtaining their maximum common tree. 2 For two trees given as an input, finding their maximum common tree obviously suffices to compute MLTD and will therefore be the main focus of our algorithms described below. As mentioned earlier MLTD defined above is not a metric since it is akin to the "inverse set intersection" and thus does not satisfy the triangle inequality. For example, given L = {A, B} and the following trees: (i) tree T 1 consisting of two vertices, labeled by A (root vertex) and B (non-root vertex) (ii) tree T 2 consisting of two vertices, labeled by B (root vertex) and A (non-root vertex) and (iii) a single vertex tree T 3 where vertex label consists of both, A and B, MLTD between T 1 and T 3 , as well as T 2 and T 3 , equals 0, whereas MLTD between T 1 and T 2 equals 2.

(Dis)similarity between multi-labeled trees in the context of tumor evolution
Formal definition of multi-labeled tree presented above is motivated by the clonal tree of tumor evolution discussed in "Introduction" section. In a clonal tree, root vertex r represents population of healthy cells and each non-root vertex represents tumor clone. Universe L represents set of mutations detected in a given tumor and L v denotes the set of mutations appearing for the first time at vertex (clone) v. The constraint L u ∩ L v = ∅ for each pair of distinct vertices u and v, ensures that each mutation appears at most once during the course of tumor evolution (this follows directly from the ISA).
The main difference between multi-labeled and clonal tree is that in the latter we have constraint that the set of labels assigned to the root vertex r is empty (since this vertex represents population of healthy cells which is assumed to be mutation-free) and L v = ∅ for each v ∈ V (T )\{r} . Namely, if v is non-root vertex such that L v is empty then clone v would be, with respect to the set of mutations it harbors, identical to its parent which is atypical for clonal trees as it introduces unnecessary redundancy in representation of the process of tumor evolution. For the simplicity, in the figures of clonal trees presented in this work we do not show a root since its set of labels is empty hence such vertex would be non-informative.
Note that any multi-labeled tree can be converted to a unique clonal tree using the following steps: (i) merging Karpov et al. Algorithms Mol Biol (2019) 14:17 each of non-root vertices having empty set of labels with its parent and repeating this until each non-root vertex has non-empty set of labels and (ii) in the case that root of the tree obtained after the first step has non-empty set of labels, add a new vertex without any mutational labels and connect it to the root of the modified tree (so that it becomes new root). In the applications, we first consider each clonal tree as a mutation tree and, once a common tree is obtained, it is converted to a clonal tree using the two of these steps.
While the notion for the edit operation of label deletion is intuitively clear as in general case one would be unable to obtain a common tree without allowing this or any similar operation which removes some labels, the edit operation of expanding a vertex at no cost is directly motivated by the existing different ways of representing clonal trees of tumor evolution. More precisely, we introduce it in order to be able to capture differences between two clonal trees which are due to different levels of granularity in tree representation. An example of such trees is shown in Fig. 6 where tree of tumor evolution is shown in (a) and its more refined versions are shown in (c) and (d) (more detailed discussion of Fig. 6 is provided in "The existing measures and their limitations" section). Finally, the operation of deleting an unlabelled leaf is introduced in order to allow obtaining common tree of trees having certain topological differences, mostly in terms of branching. For example, if we are given a linear and non-linear clonal tree as two input trees, they can not be reduced to a common tree using solely the label deletion and vertex expansion operations. Also, note that deletion of unlabeled leaf requires deletion of all of its labels prior to applying this edit operation which is usually costly. However, this is desired when computing a "dissimilarity" between clonal trees of tumor evolution since the placement of mutations on vertices from different branches (i.e. to the clones from different lineages) in one clonal tree and to the vertices that in the ancestor-descendant relation (i.e. to the clones from the same lineage) in the second clonal tree represents fundamental dissimilarity between the two trees and needs to have an appropriate contribution to their "dissimilarity".

Set alignment problem
We first demonstrate how maximum common tree is computed for a pair of trees where each tree is a path. Obviously in this case any common tree between the input trees is also a path. Let the ordered sequence of vertices of the first tree/path be v 1 , v 2 , . . . , v n with respective label sets S 1 , S 2 , . . . , S n , and the ordered sequence of vertices of the second tree/path be w 1 , w 2 , . . . , w m with respective label sets P 1 , P 2 , . . . , P m . (Assume that S i , P j are subsets of L and that any label u ∈ L occurs exactly in one of S 1 , S 2 , . . . , S n and exactly in one of P 1 , P 2 , . . . , P m .) Let f : L → {1, 2, . . . , n} and g : L → {1, 2, . . . , m} be the functions that map labels to vertex indices, respectively in the first and the second tree such that v f (a) denotes the vertex of label a in the first tree and w g(a) denotes the vertex of the label a in the second tree.
It is easy to see that computing a maximum common tree in this special case is equivalent to the following generalized version of the string edit distance problem for a pair of ordered sets.
The following lemma offers an efficient algorithm for solving the Set Alignment Problem. Our approach for computing dissimilarity between two arbitrary trees (presented in "Computing a maximum common tree in the general case" section) uses this algorithm as a subroutine.

Lemma 1
Let D(i, j) be the size of the set which is answer of the Set Alignment Problem for the instance where input sequences are (S 1 , . . . , S i ) and (P 1 , . . . , P j ) (i.e. according to the notation from the above D(i, j) = A(i, j) ). Then the following hold: all positive integers i and j.
Proof The first equation easily follows from the fact that For the second equation, we first prove that In order to prove this, observe that each of A(i, j − 1) ∪ (S i ∩ P j ) and A(i − 1, j) ∪ (S i ∩ P j ) represent a valid candidate solution for the instance of Set Alignment Problem with the input sequences (S 1 , . . . , S i ) and (P 1 , . . . , P j ) . Namely, in the case of set A(i, j − 1) ∪ (S i ∩ P j ) (analogous applies to the set A(i − 1, j) ∪ (S i ∩ P j ) ), if we consider two arbitrary labels a and b of this set, then: holds in this case as well.
In order to prove this, consider the partition of A(i, j) into A(i, j)\(S i ∩ P j ) and S i ∩ P j . We claim that at most one of the sets S i and P j has non-empty intersection with the set A(i, j)\(S i ∩ P j ) .
To prove this, assume on contrary that there exists which is, by definition of A(i, j) satisfied for all of its labels. This contradiction directly implies our latest claim. To finalize the proof of inequality Lemma 1 provides a dynamic programming formulation for calculating "dissimilarity" D(n, m) between trees T 1 and T 2 .

Observation 1 Total time and total space required for calculating number of labels in each of the sets
Proof For each label from u ∈ L we can store two indices f(u) and g(u). This can be implemented in the above time and space by using a hash table. If we know these indices, we can fill the table I ij , where I ij = |S i ∩ P j | , by iterating through elements of L and increasing the value of I f (x)g(x) by one for each x ∈ L .

Lemma 2 The Set Alignment Problem is solvable in
Proof Follows straightforwardly from Lemma 1 and Observation 1.

Computing a maximum common tree in the general case
We now describe an efficient algorithm for computing a maximum common tree. Note that in the remainder of the paper we call all vertices in a tree with exactly one child as non-crucial vertices and all other vertices, i.e. leaves, and vertices with two or more children, as crucial vertices. Now consider the sequence of edit operations applied to a tree T 1 in the process to reaching a common tree T with another tree T 2 .

Observation 2 Each edit operation applied to any vertex creates at most one (new) crucial vertex; no edit operation can increase the total number of crucial vertices.
Proof The proof is based on analyzing the effect that application of a given edit operation might have on the set of crucial vertices.
• The edit operation of deleting a label does not change the topology of the tree or the set of crucial vertices in the tree. • The edit operation of deleting a leaf u does change the topology of a tree, but with respect to the set of crucial vertices, the only update is that u is lost, and, (i) provided that u was the only child of p(u), p(u) becomes crucial, or (ii) provided that u was one of the two children of p(u), p(u) becomes non-crucial, or (iii) provided that u was one of more than two children of p(u), p(u) stays crucial. All other vertices remain unaltered. See Fig. 2a for detailed examples. • Finally, the edit operation of expanding, i.e., splitting a vertex v into v 1 and v 2 does change the topology of the tree (i) but it does not create a new crucial vertex if v is non-crucial; however, (ii) if a vertex v is crucial, then v 2 becomes crucial after the edit operation, but v 1 stays non-crucial. See Fig. 2b for examples.
In summary, after an arbitrary edit operation, at most one new vertex is added to the set of crucial vertices. However, in the case that new crucial vertex is added, at least one of such vertices is deleted implying that the total number of crucial vertices never increases.
The observation above indicates that an edit operation applied to a crucial vertex u may create a new crucial vertex v. In that case, we say that the crucial vertex u in T 1 corresponds to a crucial vertex v in T ′ 1 (if latter was created). In case of an expansion of vertex u in T 1 to two vertices u 1 and u 2 , we say that u corresponds to u 2 in T ′ 1 . In case of a deletion of a leaf u, if p(u) which was originally non-crucial, became crucial, then we say that u in T 1 corresponds to p(u) in T ′ 1 . For any vertex v which remains unedited and crucial in T ′ 1 , we say that v in tree T 1 corresponds to v in the tree T ′ 1 . Finally, we say that v in T 1 corresponds to v ′ in T if for the sequence of trees 1 ) ) such that v i corresponds to v i+1 for all i. We extend the notion of correspondence to T 2 in a similar manner.
Thus we notice the following fact.

Observation 3
We can construct the correspondence between a subset of crucial vertices in T 1 and T 2 and crucial vertices in the common tree. Such that each crucial vertex in the common tree corresponds to some vertex in T 1 and T 2 .
Given trees T 1 and T 2 , their common tree T and the vertices in T 1 and T 2 that correspond to every crucial vertex in T, it is straightforward to establish the edit operations to transform T 1 and T 2 to T. The algorithm to compute T makes use of this observation.

Observation 4
Given two sets of crucial vertices u 1 , . . . , u l and v 1 , . . . , v l in T 1 and T 2 respectively such that u i and v i correspond to same crucial vertex in the common tree T for each i, we can reconstruct a common tree T ′ such that the number of labels in T ′ is at least that in T.
Proof Here we describe the procedure of reconstructing the tree T ′ in two steps (see Figs. 3 and 4 as illustrations).
In the first step we delete each label which cannot belong to T in a trivial manner: let S 1 ( S 2 ) be the set of vertices which do not lie on a path from the root of T 1 ( T 2 ) to some u i ( v i ). Then we delete all vertices from (and S 2 ) together with their labels. Note that no label which is present in tree T will be deleted: if a vertex v does not belong to a path from the root to some crucial vertex in T, then any label from L v cannot be present in T. However, if any label in T that is in L v for some vertex v which lies on a path from the root to a leaf w (which is necessarily crucial) then there must exist a pair of vertices u i , v i which correspond to the leaf w. Thus, starting from the leaf level, we can delete all vertices which do not belong to a path from the root to any u i (and v i ). It is easy to see that this first step transforms T 1 and T 2 into isomorphic trees. Let r i denote the root of tree T i ; the isomorphism φ on r 1 , u 1 , . . . , u l which transforms T 1 into T 2 is φ(r 1 ) = r 2 , φ(u 1 ) = v 1 , . . . , φ(u l ) = v l .
Let T ′ 1 and T ′ 2 denote the trees respectively produced from T 1 and T 2 after applying the first step. Notice that, T ′ 1 and T ′ 2 are also topologically isomorphic to T and T ′ .
In the second step, for each pair of vertices v i and u i we consider the pair of "maximum" paths from v i and u i to the associated root, which do not contain other vertices from v 1 , . . . , v l and u 1 , . . . , u l . For this pair of paths we apply a sequence of edit operations that expand vertices and delete labels, such that the resulting paths will be identical with the maximum possible number of labels.
T ′ is the tree produced as a result of the second step. Note that on any pair of paths from the vertex pair u i and v i to the respective root, the set of labels observed will be identical. This implies that T ′ is a common tree with number of labels necessarily lower bounded by that of T.
The above observation implies that we can reduce the problem of computing a maximum common tree between two multi-labeled trees to the problem of finding an optimal pair of sequences of vertices u 1 , . . . , u l and v 1 , . . . , v l corresponding to the maximum common tree.
Our general algorithm for computing the "dissimilarity" between two multi-labeled trees requires constant time access to the solutions to many instances of the Set Alignment Problem, which we compute in a preprocessing step.
Solving Set Alignment Problem for all pairs of sequences u 1 , . . . , u l and v 1 , . . . , v l is impractical. Fortunately, special conditions with respect to the structure of these sequences help us develop an efficient algorithm for finding an optimal pair of sequences as explained below.
The algorithm for computing an optimal pair of sequences will need the solutions to Set Alignment Problem for all possible downward paths; we call this auxiliary problem Pairwise Alignments on a Tree.
Given a pair of vertices u, v such that u v , let the following sequence of sets of vertex labels be denoted as P(u, v) = (L w 1 , . . . , L w k ) where w 1 (= u), w 2 , . . . , w k (= v) is called the downward path between u and v. Then we can define Pairwise Alignments on a Tree problem formally as follows.

Pairwise Alignments on a Tree
Instance: Two rooted unordered multi-labeled trees T 1 = (V 1 , E 1 ) and T 2 = (V 2 , T 2 ) with associated sets of labels for each vertex. Task: For each 4- tuple (a, b, c, d) such that a, b ∈ V 1 , c, d ∈ V 2 , a b and c d, compute and store the answer for Set Alignment Problem on P(a, b), P(c, d).
In the next lemma, we introduce equations for computing Pairwise Alignments on a Tree which forms the basis of our dynamic programming algorithm.  Fig. 3 Illustrates how to obtain a maximum common tree of trees T 1 and T 2 . We used dashed lines to denote pairs of vertices u i , v i from the proof of Observation 4. After the first step of proof we delete all vertices which do not belong to paths from roots to some crucial vertex and obtain from trees T 1 and T 2 trees T ′ 1 and T ′ 2 which are topologically isomorphic to each other. After applying the step two from proof we obtain by applying sequence of optimal operations to pairs of paths ((a), (a ′ )), ((c, g), (c ′ , g ′ )), ((b), (b ′ )), ((d), (d ′ )), ((e), (e ′ )) from T ′ 1 and T ′ 2 trees T 1 and T 2 which are equal to each other and contain a maximum number of labels  If a = b and c = d then D(a, c, b, d) = D(a, c, p( Proof Each of the cases above holds true as a direct consequence of Lemma 1.
Through a straightforward application of the above lemma, we obtain the following.

Lemma 4
If I 1 and I 2 denote the heights of T 1 and T 2 , respectively, Pairwise Alignments on a Tree is solvable in O(|V 1 ||V 2 |I 1 I 2 + |L(T 1 )| + |L(T 2 )|) time and space.
Proof The algorithm is a straightforward implementation of Observation 1 and Lemma 3. Namely, from Observation 1 it follows that the values of |L a ∩ L b | , for all a ∈ V 1 and b ∈ V 2 , can be computed by the use of algorithm having time and space complexity O(|V 1 ||V 2 | + |L(T 1 )| + |L(T 2 )|) . After computing these values, all entries in D can be computed in the time and space that are proportional to the number of all possible combinations of a, b, c, d, which is bounded by |V 1 ||V 2 |I 1 I 2 . Now, combining the above with the obvious inequality |V 1 ||V 2 |I 1 I 2 ≥ |V 1 ||V 2 | , we have that the overall time and space complexity of the proposed algorithm is O(|V 1 ||V 2 |I 1 I 2 + |L(T 1 )| + |L(T 2 )|) . Fig. 4 a T 1 and T 2 before applying the first step from Observation 4, b T ′ 1 and T ′ 2 obtained from T 1 and T 2 from a after first step of deleting vertices which do not belong to paths between root and crucial vertices, c the resulting tree T 1 =T 2 after applying second step Karpov et al. Algorithms Mol Biol (2019) 14:17 Given a common tree T for T 1 and T 2 , let M : Proof The observation follows straightforwardly from the construction of correspondence. For that notice that the least common ancestor of vertices can correspond only the least common ancestor in the common tree because we may apply only operations of expanding for internal vertices.
We now present our algorithm for computing the size of a maximum common tree, which is a combination of dynamic programming and an algorithm for finding a maximum cost matching.

Theorem 1 The mapping which corresponds to a maximum common tree can be computed in time
Proof For i ∈ {1, 2} and x ∈ V i , let T i (x) be the subtree of T i rooted at vertex x and let T ′ i (x) be the multi-labeled tree that is identical to T i (x) except that no labels are assigned to its root x. Let G(a, b) be the size of the maximum common tree of T 1 (a) and T 2 (b) . We now define for those vertices a ∈ V 1 , b ∈ V 2 , such that M(a) = b , the function G ′ : V 1 × V 2 → N as the size of the maximum common tree between subtrees T ′ 1 (a) and T ′ 2 (b) (more specifically the number of common labels between T ′ 1 (a) and T ′ 2 (b)-by definition excluding the labels of a and b themselves). Notice that G(a, b) is not necessarily equal to G ′ (a, b) , since (i) if a and b do not correspond to each other G ′ (a, b) is undefined, and (ii) L a or L b are not necessarily empty. Rather, as will be shown below, D(a, b, x, y)] . The choice of vertices x and y corresponds to the choice of vertices which are mapped to each other and has the minimal depth among all such vertices in T 1 and T 2 .
The key observation of our algorithm is that the computation of G ′ (a, b) can be reduced to finding a maximum "cost" matching for an auxiliary graph. Let a 1 , . . . , a n be the children of a, and b 1 , . . . , b m be the children of b. The structure conditions on mapping provide the guarantee that all vertices which are leaves of downward paths from a without internal crucial vertices, lie in distinct subtrees. Using the Observation 5 this implies that each such vertex lies in distinct subtrees with roots a 1 , . . . , a n and b 1 , . . . , b m . We know inductively that G(a i , b j ) = max c∈V (T 1 (a i )),d∈V (T 2 (b j )) (G ′ (c, d) + D(a i , b j , c, d)).
Consider now all possible bijections N between equal sized subsets of {a 1 , . . . , a n } and {b 1 , . . . , b m } . Then G ′ (a, b) = max N (x,y)∈N G(x, y) . The problem of choosing an optimal N thus trivially reduces to the well known maximum weighted bipartite matching problem, which can be solved in a polynomial time [34]. For that we can construct a bipartite graph on the set of vertices a 1 , . . . , a n and b 1 , . . . , b m with the cost of an edge (a i , b j ) equal to G(a i , b j ) and return the score of an optimal assignment in this graph (with n + m vertices and nm edges) as G ′ (a, b) . Note that if one or both of a or b are leaves then G ′ (a, b) = 0 . See Fig. 5 as an illustration of constructing graph Q. We provide an example of how our algorithm works in Appendix 2.
The time to construct auxiliary graphs is bounded by O(|V 1 ||V 2 |I 1 I 2 ) . The computational bottleneck of this algorithm is however the bipartite matching procedure: for a graph with n vertices and m edges it takes O(nm log n) time. Let n a be the number of children of any vertex a in T 1 and n b the number of children of any vertex b in T 2 ; then the total time of our algorithm is O( a,b (n a + n b )n a n b log(n a + n b )) which is O(|V 1 ||V 2 |(|V 1 | + |V 2 |) log(|V 1 | + |V 2 |)) or O((|V 1 | b n 2 b + |V 2 | a n 2 a ) log(|V 1 | + |V 2 |)) . The second bound is significantly better if the maximum degree of a vertex is bounded by a small value.

The existing measures and their limitations
There are number of measures in the literature that are being used to compare clonal trees. Two of the most widely used measures include: (1) Ancestor-Descendant Accuracy (ADA), measure which considers only mutations originating at vertices (clones) which are in ancestor-descendant relationship in the true tree and returns the fraction of pairs of such mutations for which the In addition to these two measures, used in [10][11][12]35] and elsewhere, (3) Clustering Accuracy (CA) [10] and (4) Co-Clustering Accuracy (CCA) [35] were also introduced in order to measure the accuracy in the placement of mutations originating from the same clone in true tree. CA measures the fraction of label pairs that are both co-located in the same vertex in both trees, whereas CCA measures the proximity in the inferred tree of pairs of mutations originating from the same clone in true tree (see [10] and [35] for definitions of CA and CCA). Finally, (5) Pair-wise Marker Shortest Path "dissimilarity" (PMSPD) [13] is (symmetric) "dissimilarity" measure calculated as the sum, over all label pairs, of the absolute difference of path length between the two labels in true tree with the equivalent length calculated in the inferred tree. All of the above mentioned are designed to compare inferred tree against the given true tree and no single measure can capture the overall similarity/difference between two arbitrary trees. Furthermore, for each of the measures there exist cases where it returns high similarity for topologically very different true and inferred trees. We will illustrate this below by presenting several examples using trees from Fig. 6 where true tree and four trees inferred by (hypothetical) methods are shown. Each vertex in any one of these trees have one or more labels (corresponding to mutations in clonal trees) represented by A, B, C, . . . , J.
For ADA measure, one needs to consider all pairs of labels in the true tree: Clustering of mutations in 'Inferred tree 4' is in the perfect agreement with the clustering in the 'True tree' hence both CA and CCA measures will return maximum score for this tree, even though it is also topologically very different from 'True tree' . Finally, the calculation of the PMSPD measure between the 'True tree' and 'Inferred tree 1' , as well as 'Inferred tree 2' , is shown in Fig. 7. This measure assigns the same score to these two inferred trees, despite the fact that 'Inferred tree 2' is, from the perspective of interpreting tumor evolution, much closer to 'True tree' .

Applications of MLTD
In order to facilitate the interpretation of results, for two arbitrary trees T 1 and T 2 , in addition to the MLTD similarity measure which returns the number of mutations in common tree of T 1 and T 2 and is denoted here as MLTD(T 1 , T 2 ) , we also introduce MLTD-normalized(T 1 , T 2 ) defined as MLTD(T 1 ,T 2 ) max(a,b) , where a and b denote number of mutations in T 1 and T 2 . MLTDnormalized can be interpreted as similarity measure which takes values from [0, 1], with higher values denoting higher similarity among trees. In the discussion of results below, all presented scores represent MLTDnormalized similarity measure, although it is obviously equivalent to MLTD (assuming that the sets of vertex labels are known for both trees, which is true in all of our comparisons). a a 1 a 2

Application to the synthetic examples with the available ground truth
In this section we discuss similarity between true and inferred trees shown in Fig. 6. 'Inferred tree 1' has relatively low score equal to 0.3 which rewards the proper placement of mutation A and correctly inferred phylogenetic relations for pairs of mutations originating from different clones, but penalizes for extensive branching which leads to the inaccurate placement to different branches of mutations originating from the same clone, as well as to significant topological differences between this and true tree. In contrast, and a True tree b Inferred tree 1 c Inferred tree 2 d Inferred tree 3 e Inferred tree 4 Fig. 6 a True clonal tree depicting the evolution of hypothetical tumor. b-e Hypothetical trees inferred by methods for reconstructing history of tumor evolution (input data to these methods is assumed to be obtained from the hypothetical tumor mentioned in the description of 'True tree'). These trees are used as examples which demonstrate limitations of the existing measures for calculating similarity/"dissimilarity" between true and each of the four inferred trees (details provided in "The existing measures and their limitations" section). In "Application to the synthetic examples with the 56available ground truth" section we discuss the application of MLTD in calculating similarities between these pairs of trees  Fig. 7 "Dissimilarities" between pairs of labels required for calculating Pair-wise Marker Shortest Path "dissimilarity" (PMSPD) for trees from Fig. 6. Entries in each matrix represent length of path between labels (note that labels are shown in the first row and the first column of each matrix). "Dissimilarity" is calculated as the sum of absolute values of differences between pairs of entries which are at the same position in both matrices. Red colored entries in labels pairwise "dissimilarity" matrix shown in b, c differ from the corresponding entries in matrix for true tree shown in a and therefore contribute to the overall "dissimilarity". PMSPD assigns the same score to 'Inferred tree 1' and 'Inferred tree 2' , despite the fact that 'Inferred tree 2' is, from the perspective of interpreting tumor evolution, much closer to 'True tree' a b Fig. 8 Clonal trees of tumor evolution, inferred by SiFit and PhISCS, for triple-negative breast cancer (TNBC) dataset originally published in [37] and consisting of the binary presence/absence profile of 22 mutations across 16 single cells. Names of the clones are assumed not to be included as part of the vertex label. Trees are very similar to each other in placement of the vast majority of mutations: (i) Clone 1 in the SiFit tree is almost identical (with respect to the set of mutations assigned to its label) to Clone 1 in PhISCS tree (ii) Clone 2 in SiFit tree is split into two adjacent clones, namely Clone 2 and Clone 3, in PhISCS tree. Analogous applies to Clone 7. (iii) The order of mutations in genes CHRM5 and TGFB2, as well as in most other pairs of mutations (including the pairs where both mutations are at the same vertex), is same among the trees. Notable exceptions leading to some dissimilarities between the trees include mutations in genes MAP3K4 and ECM1. In addition, mutations in genes CBX4 and TNC are absent in tree reported by SiFit. Removing these four mutations and their corresponding vertices from each tree (if present) and assigning each of the Clone 4 and Clone 7 in SiFit tree as child of Clone 2, and Clone 7 as child of Clone 3 in PhISCS tree, we obtain trees which are same up to the existence of splits of single into two adjacent clones belonging to the same lineage (see (ii) from above). MLTD-normalized score for the two trees equals 0.82, which well reflects the overall high topological similarity and concordance in ordering pairs of mutations as expected based on our discussion from the introduction, 'Inferred tree 2' (which represents slightly refined version of 'True tree' where green and yellow clones are each split into two adjacent clones belonging to the same branch) and 'Inferred tree 3' (which represents fully resolved mutation tree that can be obtained from 'True tree') both have score 1. 'Inferred tree 4' , having score 0.6, is rewarded for the proper placement of mutation A and large cluster of mutations appearing for the first time at green clone, but is penalized for inaccurate placement of yellow clone from where 4 out of 10 mutations originate.

Application to real data
In order to demonstrate the application of measure developed in this work in real settings where true tree is usually not available, we analyzed two datasets obtained by sequencing real samples of triple-negative breast cancer (TNBC) and acute lymphoblastic leukemia (ALL). For each sample, we inferred trees of tumor evolution by the use of SCITE [5], SiFit [3] and PhISCS [36]. We provide more details about these methods and parameters used in running them, as well as details of obtaining real data, in Appendix 1. Inferred trees and very detailed discussion of the calculated MLTD-normalized scores for pairs of inferred trees are shown in Figs. 8, 9 (for the TNBC sample) and Fig. 10 (for the ALL sample). We show that MLTD-normalized score recognizes high similarity in the placement of vast majority of mutations between two trees (as demonstrated for trees inferred by PhISCS and SiFit for TNBC sample where score equals 0.82), but also penalizes for topological differences and different sorting of mutations along linear chains (as demonstrated for trees inferred by SCITE and SiFit for ALL sample where the score equals 0.69).

Fig. 9
Mutation tree for TNBC dataset (see Fig. 8 for details) inferred by SCITE. This tree can be obtained from PhISCS tree by expanding vertices having more than one label, hence MLTD-normalized score between the two trees is maximum possible (i.e. equals 1). Compared with tree inferred by SiFit, SCITE tree has analogous topological similarities and differences as tree inferred by PhISCS, and MLTD-normalized score for these two trees is also equal to 0.82 a b Fig. 10 Trees inferred by SCITE and SiFit for acute lymphoblastic leukemia (ALL) patient dataset from [38] consisting of 115 single cells and 16 mutations. Unsurprisingly, due to large number of single-cells in this dataset, sequencing noise and similarities in the scoring schemes used in PhISCS and SCITE (see Appendix 1: Details of obtaining trees of tumor evolution for the real data sets) both methods report the same mutation tree so we only focus on SCITE in this discussion. The most notable difference among the two trees is in the placement and ordering of mutations in genes ZC3H3, XPO7 and BRD7P3 as well as in the ordering of mutations in genes FGD, RRP8, FAM105A, BDNF-AS and PCDH7. Furthermore, the relative order also differs for mutations in genes TRRAP and ATRNL1. However, in contrast to these important differences, the trees still share most of the major branching events in tumor evolution and have consistent ancestor-descendant order for most of the pairs of mutations. All these are reflected in MLTD-normalized score of 0.69 assigned to this pair of trees Karpov et al. Algorithms Mol Biol (2019) 14:17 underlying scoring used in PhISCS is analogous to that in SCITE and the major difference among the two methods is in the type of tree returned in the output. While SCITE searches for the maximum likelihood mutation tree, PhISCS reports the maximum likelihood clonal tree. Due to the equivalence in tree scoring, assuming that both methods find the optimal solution, clonal tree reported by PhISCS is expected to represent compressed version of mutation tree reported by SCITE (i.e. we expect that tree reported by SCITE belongs to the set of mutation trees which can be obtained from clonal tree reported by PhISCS). Similarly as PhISCS, SiFit also returns clonal tree of tumor evolution but uses different tree search methodology and does not necessarily yield the same output as PhISCS nor SCITE (as demonstrated in [3]).

Details of obtaining input data and running SCITE, SiFit and PhISCS
We obtained binary SCS data mutation matrix for TNBC patient sample from [39] and for ALL patient sample from [38]. For each sample, false positive and false negative rates of sequencing experiment were estimated in the original studies [37,38] and provided as the input to the methods used in the analysis. In order to obtain better convergence, we run MCMC based methods SiFit and SCITE for very large number of iterations. For SiFit, we set number of iterations to 5,000,000. For SCITE we set number of repetitions of the MCMC to 3 and chain length of each MCMC repetition to 1,000,000. PhISCS is combinatorial optimization based method which provided guarantee of the optimality for each solution.

Appendix 2: Demonstration of algorithm with an example
In this section, we will illustrate how the maximum common tree of trees from Fig. 11 is found using our algorithm. For the convenience of notation, since each label appears at exactly one vertex, we will use a string of concatenated labels as a unique identifier of a vertex. For example, vertex having labels c, d, e (blue vertex in the second tree) will be denoted as cde.
The first part of our algorithm consists of computing 4-dimensional table D (see Lemma 3 for a definition). As computation of D is a straightforward application of the dynamic programming, here we will skip details of this step and proceed directly to demonstrating how table G ′ is computed. As defined earlier, for any pair of vertices x ∈ V 1 , y ∈ V 2 , such that M(x) = y , G ′ (x, y) represents the size of the maximum common tree between subtrees rooted at x and y, namely T ′ 1 (x) and T ′ 2 (y)-excluding the root labels. Since our algorithm for computing G ′ is recursive, we present how G ′ is computed in several steps, starting from the leaves of the two trees and then propagating towards the roots.
The first step for computing G ′ is shown in Table 1. In this step, all entries G ′ (u, v) of G ′ , where either u is a leaf in T 1 or v is a leaf in T 2 , are computed (as 0 by definition).
Next, we compute G ′ for a pair of vertices each with a single child (specifically, vertices {h} , {g} in tree T 1 , and {g} , {hi} in T 2 ) easily-since the optimal matching contains only one edge (see Table 2). Two multi-labeled trees, T 1 (on the left) and T 2 (on the right) to be compared Table 1 The table G ′ with entries computed for the cases when vertex from the first tree is a leaf (such vertices are de, f and ij) or vertex from the second tree is a leaf (such vertices are cde, f, and j) Rows