Notation and Preliminaries
Partitions. \({\mathcal {V}}=\{V_1,V_2,\dots ,V_k\}\) is a partition of a set V if (i) \(V_i\ne \emptyset \), (ii) \(\bigcup _{i=1}^k V_i = V\) and (iii) \(V_i\cap V_j=\emptyset \) for \(i\ne j\). A partition is non-trivial if \(|{\mathcal {V}}|\ge 2\). Consider two partitions \({\mathcal {V}}=\{V_1,\dots ,V_k\}\) and \({\mathcal {V}}'=\{V'_1,\dots ,V'_l\}\) of V. If for every \(1\le j'\le l\) there is a j such that \(V'_{j'}\subseteq V_j\), i.e., if every set in \({\mathcal {V}}'\) is completely contained in a set in \({\mathcal {V}}\), then \({\mathcal {V}}'\) is a refinement of \({\mathcal {V}}\), and \({\mathcal {V}}\) is a coarse-graining of \({\mathcal {V}}'\).
Graphs. Mostly, we consider simple directed graphs (digraphs) \(\vec {G}=(V,E)\) with vertex set V and arc set \(E\subseteq V\times V \setminus \{(v,v)\mid v\in V\}\). We will frequently write \(V(\vec {G})\) and \(E(\vec {G})\) to explicitly refer to the graph \(\vec {G}\). For a vertex \(x\in V\), we say that (y, x) is an in-arc and (x, z) is an out-arc. The subgraph induced by a subset \(W\subseteq V\) is denoted by \(\vec {G}[W]\). Undirected graphs can be identified with symmetric digraphs, i.e., the undirected graph G underlying a digraph \(\vec {G}\) is obtained by dropping the direction of all arcs, or by symmetrizing the digraph, i.e., adding the arc (y, x) to \(E(\vec {G})\) for every arc \((x,y)\in E(\vec {G})\). When referring to an undirected graph G, we write xy for \((x,y),(y,x)\in E(G)\) and call xy an edge. The (weakly) connected components of \(\vec {G}\) are the maximal connected subgraphs of the undirected graph underlying \(\vec {G}\) or, equivalently, the maximal strongly connected components of the symmetrized digraph. Whenever the context is clear, we will also refer to the partition of V formed by the vertex sets of the maximal connected subgraphs as the set of connected components.
A vertex coloring is a map \(\sigma :V\rightarrow S\), where S is a non-empty set of colors. A vertex coloring of \(\vec {G}\) is proper if \(\sigma (x)\ne \sigma (y)\) whenever \((x,y)\in E(\vec {G})\). We write \((\vec {G},\sigma )\) for a vertex-colored digraph and denote by V[r] the subset of vertices of a graph \((\vec {G}=(V,E),\sigma )\) that have color r. Moreover, we define \(\sigma (W){:}{=}\{\sigma (x) \mid x\in W\}\) for the subset of colors present in a set \(W\subseteq V\).
We write N(x) for the set of out-neighbors of a vertex \(x\in V(\vec {G})\) and \(N^-(x)\) for the set of in-neighbors of x. A digraph \(\vec {G}\) is called sink-free if \(N(x)\ne \emptyset \) holds for all \(x\in V(\vec {G})\). We write \(A{{\,\mathrm{\triangle }\,}}B {:}{=}(A\setminus B)\cup (B\setminus A)\) for the symmetric difference of two sets A and B, and define, for a digraph \(\vec {G}=(V,E)\) and arc set \(F\subseteq (V\times V)\setminus \{(v,v)\mid v\in V\}\), the digraph \(\vec {G}{{\,\mathrm{\triangle }\,}}F{:}{=}(V, E{{\,\mathrm{\triangle }\,}}F)\). Analogously, we write \(\vec {G}+ F{:}{=}(V, E\cup F)\) and \(\vec {G}- F{:}{=}(V, E\setminus F)\).
Phylogenetic trees. Consider an undirected, rooted tree T with leaf set \(L(T)\subseteq V(T)\) and root \(\rho _T\in V(T)\). Its inner vertices are given by the set \(V^0(T) = V(T) \setminus L(T)\). The ancestor order on V(T) is defined such that \(u\preceq _T v\) if v lies on the unique path from u to the root \(\rho _T\), i.e., if v is an ancestor of v. For brevity we set \(u \prec _T v\) if \(u \preceq _{T} v\) and \(u \ne v\). If xy is an edge in T such that \(y \prec _{T} x\), then x is the parent of y and y the child of x. The set of children of a vertex \(x\in V(T)\) is denoted by \(\mathsf {child}_T(x)\). A tree is phylogenetic if all of its inner vertices have at least two children. All trees considered in this contribution will be phylogenetic. For a non-empty subset \(A\subseteq V(T)\), we define \({{\,\mathrm{lca}\,}}_T(A)\), the last common ancestor of A, to be the unique \(\preceq _T\)-minimal vertex of T that is an ancestor of every \(u\in A\). Following e.g. [11], we denote by \(T_{L'}\) the restriction of T to a subset \(L'\subseteq L(T)\), i.e. \(T_{L'}\) is obtained by identifying the (unique) minimal subtree of T that connects all leaves in \(L'\), and suppressing all vertices with degree two except possibly the root \(\rho _{T_{L'}}={{\,\mathrm{lca}\,}}_T(L')\). We say that T displays or is a refinement of a tree \(T'\), in symbols \(T'\le T\), if \(T'\) can be obtained from a restriction \(T_{L'}\) of T by a series of inner edge contractions. \((T,\sigma )\) is a leaf-colored tree if \(\sigma : L(T)\rightarrow S\) is a map from the leaves of T into a non-empty set of colors. We say that \((T',\sigma ')\) is displayed by \((T,\sigma )\) if \(T'\le T\) and \(\sigma (v)=\sigma '(v)\) for all \(v\in L(T')\).
Rooted triples. A (rooted) triple is a tree on three leaves and with two inner vertices, and thus, it has a topology as the tree in Fig. 3(D). We write xy|z for the triple on the leaves x, y and z if the path from x to y does not intersect the path from z to the root in T, i.e., if \({{\,\mathrm{lca}\,}}_T(x,y)\prec _T {{\,\mathrm{lca}\,}}_T(x,z)={{\,\mathrm{lca}\,}}_T(y,z)\). In this case we say that T displays xy|z. We write \({\mathcal {R}}_{|L'} {:}{=}\left\{ xy|z \in {\mathcal {R}} \,:x,y,z\in L' \right\} \) for the restriction of a triple set \({\mathcal {R}}\) to a set \(L'\) of leaves. A set \({\mathcal {R}}\) of triples is consistent if there is a tree T with leaf set \(L:=\bigcup _{T'\in {\mathcal {R}}} L(T')\) that displays every triple in \({\mathcal {R}}\). The polynomial-time algorithm BUILD decides for every triple set \({\mathcal {R}}\) whether it is consistent, and if so, constructs a particular tree, the Aho tree \({{\,\mathrm{Aho}\,}}({\mathcal {R}}, L)\), that displays every triple in \({\mathcal {R}}\) [10]. The algorithm relies on the construction of an (undirected) auxiliary graph, the Aho graph, for a given triple set \({\mathcal {R}}\) on a set of leaves L. This graph, denoted by \([{\mathcal {R}}, L]\), contains an edge xy if and only if \(xy|z\in {\mathcal {R}}\) for some \(z\in L\).
Best match graphs
Best matches formalize the notion of the evolutionarily closest relative(s) of a gene x in another species. Relatedness in this context is thought of as a phylogenetic concept and thus expressed in terms of last common ancestors in the gene tree T that describes the evolutionary relationships among a family of genes.
Definition 1
Let \((T,\sigma )\) be a leaf-colored tree. A leaf \(y\in L(T)\) is a best match of the leaf \(x\in L(T)\) if \(\sigma (x)\ne \sigma (y)\) and \({{\,\mathrm{lca}\,}}(x,y)\preceq _T {{\,\mathrm{lca}\,}}(x,y')\) holds for all leaves \(y'\) of color \(\sigma (y')=\sigma (y)\).
As a consequence, best matches in a pair of species in general form a many-to-many relationship and are not necessarily symmetric. Given \((T,\sigma )\), the digraph \(\vec {G}(T,\sigma ) = (V,E)\) with vertex set \(V=L(T)\), vertex-coloring \(\sigma \), and with arcs \((x,y)\in E\) if and only if y is a best match of x w.r.t. \((T,\sigma )\) is called the best match graph (BMG) of \((T,\sigma )\) [6], see Fig. 2 for an illustrative example.
Definition 2
An arbitrary vertex-colored digraph \((\vec {G},\sigma )\) is a best match graph (BMG) if there exists a leaf-colored tree \((T,\sigma )\) such that \((\vec {G},\sigma ) = \vec {G}(T,\sigma )\). In this case, we say that \((T,\sigma )\) explains \((\vec {G},\sigma )\).
We say that \((\vec {G}=(V,E),\sigma )\) is an \(\ell \)-BMG if \(|\sigma (V)|=\ell \). By construction, there is at least one best match of x for every color \(s\in \sigma (V)\setminus \{\sigma (x)\}\):
Observation 3
For every vertex x and every color \(s\ne \sigma (x)\) in a BMG \((\vec {G},\sigma )\) there is some vertex \(y\in N(x)\) with \(\sigma (y)=s\). Equivalently, the subgraph induced by every pair of colors is sink-free.
In particular, therefore, BMGs are sink-free whenever they contain at least two colors. We formalize this basic property of BMGs for colored digraphs in general:
Definition 4
Let \((\vec {G}=(V,E),\sigma )\) be a colored digraph. The coloring \(\sigma \) is sink-free if it is proper and, for every vertex \(x\in V\) and every color \(s\ne \sigma (x)\) in \(\sigma (V)\), there is a vertex \(y\in N(x)\) with \(\sigma (y)=s\). A digraph with a sink-free coloring is called sf-colored.
Given a tree T and an edge e, we denote by \(T_e\) the tree obtained from T by contracting the edge e. We call an edge in \((T,\sigma )\) redundant (w.r.t. \((\vec {G},\sigma )\)) if both \((T,\sigma )\) and \((T_e,\sigma )\) explain \((\vec {G},\sigma )\).
Definition 5
A tree \((T,\sigma )\) is least resolved for a BMG \((\vec {G},\sigma )\) if (i) it explains \((\vec {G},\sigma )\) and (ii) it does not contain any redundant edges w.r.t. \((\vec {G},\sigma )\).
By [6, Thm. 8], every BMG has a unique least resolved tree (LRT). Moreover, a characterization of BMGs was given in [6] that makes use of a set of informative triples, which can be defined compactly as follows [12]:
Definition 6
Let \((\vec {G},\sigma )\) be a vertex-colored digraph. Then the set of informative triples is
$$\begin{aligned} {} & {\mathcal{R}} ({\vec{G}},\sigma ) \, {:=} \, \{ ab|b' :\sigma (a)\ne \sigma (b)=\sigma (b'), \\ &\quad (a,b)\in E({\vec{G}}), \ {\text{and}} \ (a,b')\notin E({\vec{G}}) \}, \end{aligned}$$
and the set of forbidden triples is
$$\begin{aligned} {} & {\mathcal{F}}({\vec{G}},\sigma ) \, {:=} \,\{ ab|b' :\sigma (a)\ne \sigma (b)=\sigma (b'), \\ &\quad b\ne b',\ {\text{and}} \ (a,b),(a,b')\in E({\vec{G}}) \}. \end{aligned}$$
For the subclass of BMGs that can be explained by binary trees, we will furthermore need
$$\begin{aligned} {} & \mathop{{\mathcal{R}}^{\text {B}}} ({\vec{G}},\sigma ) \, {:=} \, {\mathcal{R}} ({\vec{G}},\sigma ) \cup \\&\quad \{ bb'|a:ab|b'\in {\mathcal{F}}({\vec{G}},\sigma ), \sigma (b)=\sigma (b')\}. \end{aligned}$$
By definition, \(a,b,b'\) must be pairwise distinct whenever \(ab|b'\in {\mathcal {R}}(\vec {G},\sigma )\), \(ab|b'\in {\mathcal {F}}(\vec {G},\sigma )\), or \(ab|b'\in \mathop {{\mathcal {R}}^{\text {B}}}(\vec {G},\sigma )\).
We extend the notion of consistency to pairs of triple sets in
Definition 7
Let \({\mathcal {R}}\) and \({\mathcal {F}}\) be sets of triples. The pair \(({\mathcal {R}},{\mathcal {F}})\) is called consistent if there is a tree T that displays all triples in \({\mathcal {R}}\) but none of the triples in \({\mathcal {F}}\). In this case, we also say that T agrees with \(({\mathcal {R}},{\mathcal {F}})\).
It can be decided in polynomial time whether such a pair \(({\mathcal {R}},{\mathcal {F}})\) is consistent using the algorithm MTT, which was named after the corresponding mixed triplets problem restricted to trees and described in [13].
We continue with two simple observations concerning the restriction of triple sets. Since informative and forbidden triples xy|z are only defined by the presence and absence of arcs in the subgraph of \(\vec {G}\) induced by \(\{x,y,z\}\), this leads to the following
Observation 8
[14] Let \((\vec {G},\sigma )\) be a vertex-colored digraph and \(V'\subseteq V(\vec {G})\). Then \(R(\vec {G},\sigma )_{|V'}=R(\vec {G}[V'],\sigma _{|V'})\) holds for every \(R\in \{{\mathcal {R}},{\mathcal {F}}, \mathop {{\mathcal {R}}^{\text {B}}}\}\).
Moreover, any pair of triples \(({\mathcal {R}}',{\mathcal {F}}')\) such that \({\mathcal {R}}'\subseteq {\mathcal {R}}\) and \({\mathcal {F}}'\subseteq {\mathcal {F}}\) for a consistent pair \(({\mathcal {R}},{\mathcal {F}})\) remains consistent since any tree that agrees with \(({\mathcal {R}},{\mathcal {F}})\) clearly displays all triples in \({\mathcal {R}}'\) and none of the triples in \({\mathcal {F}}'\). Hence, we have
Observation 9
Let \({\mathcal {R}}'\subseteq {\mathcal {R}}\) and \({\mathcal {F}}'\subseteq {\mathcal {F}}\) for a consistent pair of triple sets \(({\mathcal {R}},{\mathcal {F}})\). Then \(({\mathcal {R}}',{\mathcal {F}}')\) is consistent.
We summarize two characterizations of BMGs given in [7, Thm. 15] and [8, Lemma 3.4 and Thm. 3.5] in the following
Proposition 10
Let \((\vec {G},\sigma )\) be a properly colored digraph with vertex set L. Then the following three statements are equivalent:
-
1
\((\vec {G},\sigma )\) is a BMG.
-
2
\({\mathcal {R}}(\vec {G},\sigma )\) is consistent and \(\vec {G}({{\,\mathrm{Aho}\,}}({\mathcal {R}}(\vec {G},\sigma ),L), \sigma ) = (\vec {G},\sigma )\).
-
3
\((\vec {G},\sigma )\) is sf-colored and \(({\mathcal {R}}(\vec {G},\sigma ),{\mathcal {F}}(\vec {G},\sigma ))\) is consistent.
In this case, \(({{\,\mathrm{Aho}\,}}({\mathcal {R}}(\vec {G},\sigma ),L),\sigma )\) is the unique LRT for \((\vec {G},\sigma )\), and a leaf-colored tree \((T,\sigma )\) on L explains \((\vec {G},\sigma )\) if and only if it agrees with \(({\mathcal {R}}(\vec {G},\sigma ), {\mathcal {F}}(\vec {G},\sigma ))\).
Prop. 10 states that the set of informative triples \({\mathcal {R}}(\vec {G},\sigma )\) of a BMG \((\vec {G},\sigma )\) is consistent. Therefore, it can be used to construct its LRT by means of the BUILD algorithm, see Fig. 1 for an example.
It is important to note that both arc insertions and deletions may lead to creation and loss of informative triples. In particular, when starting from a BMG, both types of modifications have the potential to make the triple set inconsistent as the example in Fig. 2 shows. This is indeed often the case even for moderate disturbances of a BMG as we shall see later.
We expect that empirically estimated best match relations will typically contain errors that correspond to both arc insertions and deletions w.r.t. the unknown underlying “true” best match graph. This motivates the problem of editing a given vertex-colored digraph to a BMG:
Problem 1
(\(\ell \)-BMG Editing)
Input: | A properly \(\ell \)-colored digraph |
| \((\vec {G}=(V,E),\sigma )\) and an integer k. |
Question: | Is there a subset |
| \(F\subseteq V\times V \setminus \{(v,v)\mid v\in V\}\) such that |
| \(|F|\le k\) and \((\vec {G}{{\,\mathrm{\triangle }\,}}F,\sigma )\) is an \(\ell \)-BMG? |
Natural variants are \(\ell \)-BMG Completion and \(\ell \)-BMG Deletion where \(\vec {G}{{\,\mathrm{\triangle }\,}}F\) is replaced by \(\vec {G}+F\) and \(\vec {G}-F\), respectively, i.e., only addition or deletion of arcs is allowed. Both \(\ell \)-BMG Editing and its variants are NP-complete [8].
The heuristic algorithms considered in this contribution can be thought of as maps \({\mathbb {A}}\) on the set of finite vertex-colored digraphs such that \({\mathbb {A}}(\vec {G},\sigma )\) is a BMG for every vertex-colored input digraph \((\vec {G},\sigma )\). In particular, the following property of such algorithms is desirable:
Definition 11
A (BMG-editing) algorithm is consistent if \({\mathbb {A}}(\vec {G},\sigma )=(\vec {G},\sigma )\) whenever \((\vec {G},\sigma )\) is a BMG.
A simple, triple-based heuristic
The triple-based characterization summarized by Prop. 10 suggests a simple heuristic for BMG editing that relies on replacing the consistency checks for triple sets by the extraction of maximal sets of consistent triples (see Alg. 1). Both MaxRTC, the problem of extracting from a given set \({\mathcal {R}}\) of rooted triples a maximum-size consistent subset, and MinRTI, the problem of finding a minimum-size subset \({\mathcal {I}}\) such that \({\mathcal {R}}\setminus {\mathcal {I}}\) is consistent, are NP-hard [15]. Furthermore, MaxRTC is APX-hard and MinRTI is \(\Omega (\ln n)\)-inapproximable [16]. However, because of their practical importance in phylogenetics, a large number of practically useful heuristics have been devised, see e.g. [17,18,19].
As a consequence of Prop. 10, Alg. 1 is consistent, i.e., \((\vec {G}^*,\sigma )=(\vec {G},\sigma )\) if and only if the input digraph \((G,\sigma )\) is a BMG, if a consistent heuristic is employed to solve MaxRTG/MinRTI, i.e., if consistent triple sets remain unchanged by the method approximating MaxRTG / MinRTI.
The heuristic Alg. 1 is not always optimal, even if MaxRTC/MinRTI is solved optimally. Fig. 3 shows an unconnected 2-colored digraph \((\vec {G},\sigma )\) on three vertices that is not a BMG and does not contain informative triples. The BMG \((\vec {G}^*,\sigma )\) produced by Alg. 1 introduces two arcs into \((\vec {G},\sigma )\). However, \((\vec {G},\sigma )\) can also be edited to a BMG by inserting only one arc.
A simple improvement is to start by enforcing obvious arcs: If v is the only vertex with color \(\sigma (v)\), then by definition there must be an arc (x, v) for every vertex \(x\ne v\). The computation then starts from the sets of informative triples of the modified digraph. We shall see below that these are the only arcs that can safely be added to \(\vec {G}\) without other additional knowledge or constraints (cf. Thm. 19 below).
Locally optimal splits
Finding an optimal BMG editing of a digraph \((\vec {G}=(V,E),\sigma )\) is equivalent to finding a tree \((T,\sigma )\) on V that minimizes the cardinality of
$$\begin{aligned} U(\vec {G},T) {:}{=}\{ (x,y)\in V\times V \mid \ &(x,y)\in E \text { and }\\ &(x,y)\notin E(\vec {G}(T,\sigma )) \, \text {, or} \\& (x,y)\notin E \text { and }\\& (x,y)\in E(\vec {G}(T,\sigma )) \}. \end{aligned}$$
(1)
Clearly, \(U(\vec {G},T) =\emptyset \) implies that \((\vec {G},\sigma ) = \vec {G}(T,\sigma )\) is a BMG. However, finding a tree \((T,\sigma )\) that minimizes \(|U(\vec {G},T)|\) is intractable (unless \(P=NP\)) since \(\ell \)-BMG Editing, Problem 1 above, is NP-complete [8].
We may ask, nevertheless, if trees \((T,\sigma )\) on V contain information about arcs and non-arcs in \((\vec {G},\sigma )\) that are “unambiguously false” in the sense that they are contained in every edit set that converts \((\vec {G},\sigma )\) into a BMG. Denote by \({\mathcal {T}}_V\) the set of all phylogenetic trees on V. The set of these “unambiguously false” (non-)arcs can then be expressed as
$$\begin{aligned} U^*(\vec {G}){:}{=}\bigcap _{T\in {\mathcal {T}}_V} U(\vec {G},T). \end{aligned}$$
(2)
Since there are in general exponentially many trees on V and thus, the problem of determining \(U^*(\vec {G})\) seems to be quite challenging at first glance. We shall see in Thm. 19, however, that \(U^*(\vec {G})\) can be computed efficiently. We start with a conceptually simpler construction, and consider the set of trees \({\mathcal {T}}({\mathcal {V}})\subseteq {\mathcal {T}}_V\) for which the set of leaf sets of the children of the root equals the partition \({\mathcal {V}}\). In other words, given \({\mathcal {V}}=\{V_1,\dots ,V_k\}\), then the root \(\rho _T\) of every \(T\in {\mathcal {T}}({\mathcal {V}})\) has exactly k children \(v_1,\dots ,v_k\) such that \(V_i=L(T(v_i))\) for all \(1\le i\le k\).
Definition 12
Let \((\vec {G}=(V,E),\sigma )\) be a properly vertex-colored digraph and \({\mathcal {V}}\) a partition of V with \(|{\mathcal {V}}|\ge 2\). Moreover, let \({\mathcal {T}}({\mathcal {V}})\) be the set of trees T on V that satisfy \({\mathcal {V}} = \{L(T(v)) \mid v\in \mathsf {child}_{T}(\rho _T) \}\). The set of unsatisfiable relations (UR), denoted by \(U(\vec {G},{\mathcal {V}})\), is defined as
$$\begin{aligned} U(\vec {G},{\mathcal {V}}) {:}{=}\bigcap _{T\in {\mathcal {T}}({\mathcal {V}})} U(\vec {G},T). \end{aligned}$$
(3)
The associated UR-cost is \(c(\vec {G},{\mathcal {V}}){:}{=}|U(\vec {G},{\mathcal {V}})|\).
The set of (phylogenetic) trees \({\mathcal {T}}({\mathcal {V}})\) is non-empty since \(|{\mathcal {V}}|\ge 2\) in Def. 12. Moreover, by construction, \((x,y) \in U(\vec {G},{\mathcal {V}})\) if and only if
$$\begin{aligned} {} &(x,y)\in E \text { and } (x,y)\notin E(\vec {G}(T,\sigma )) \text { for all } T\in {\mathcal {T}}({\mathcal {V}}),\\& \text {or}\\& (x,y)\notin E \text { and } (x,y)\in E(\vec {G}(T,\sigma )) \text { for all } T\in {\mathcal {T}}({\mathcal {V}}). \end{aligned}$$
Intriguingly, the set \(U(\vec {G},{\mathcal {V}})\), and thus the UR-cost \(c(\vec {G},{\mathcal {V}})\), can be computed in polynomial time without any explicit knowledge of the possible trees to determine the set \(U(\vec {G},{\mathcal {V}})\). To this end, we define the three sets
$$\begin{aligned} {} & U_1(\vec {G},{\mathcal {V}}) = \bigcup _{V_{i}\in {\mathcal {V}}} \{(x,y) \mid (x,y)\in E,\; x\in V_{i},\; y\in V\setminus V_{i},\\&\quad \sigma (y)\in \sigma (V_{i})\},\\& U_2(\vec {G},{\mathcal {V}}) = \bigcup _{V_{i}\in {\mathcal {V}}} \{(x,y) \mid (x,y)\notin E,\; x\in V_{i},\; y\in V\setminus V_{i},\\&\quad \sigma (y)\notin \sigma (V_{i})\}, \text { and}\\& U_3(\vec {G},{\mathcal {V}}) = \bigcup _{V_{i}\in {\mathcal {V}}} \{(x,y) \mid (x,y)\notin E,\; \text {distinct }x,y\in V_{i},\\&\quad V_{i}[\sigma (y)]=\{y\}\}. \end{aligned}$$
Lemma 13
Let \((\vec {G}=(V,E),\sigma )\) be a properly vertex-colored digraph and let \({\mathcal {V}}=\{V_1,\dots ,V_k\}\) be a partition of V with \(|{\mathcal {V}}|=k\ge 2\). Then
The proof of Lemma 13 relates the possible cases between \({\mathcal {V}}\) and the tree set \({\mathcal {T}}({\mathcal {V}})\) in a straightforward manner. Since it is rather lengthy it is relegated to Appendix. Fig. 4 gives examples for all three types of unsatisfiable relations, i.e., for \(U_1(\vec {G},{\mathcal {V}})\), \(U_2(\vec {G},{\mathcal {V}})\), and \(U_3(\vec {G},{\mathcal {V}})\). In particular, we have \((b', a)\in U_1(\vec {G},{\mathcal {V}})\) since it is an arc in \(\vec {G}\) but \(V_2\) contains another red vertex \(a'\). Moreover, \((b,c)\in U_2(\vec {G},{\mathcal {V}})\) since it is not an arc in \(\vec {G}\) but \(V_1\) does not contain another green vertex. Finally, we have \((a,b)\in U_3(\vec {G},{\mathcal {V}})\) since it is not an arc in \(\vec {G}\) but b is the only blue vertex in \(V_1\). In the example, the digraph \((\vec {G}{{\,\mathrm{\triangle }\,}}U(\vec {G},{\mathcal {V}}))\) is already a BMG which, however, is not true in general.
Corollary 14
The set
\(U(\vec {G},{\mathcal {V}})\)
can be computed in quadratic time.
Proof
We first compute all numbers \(n_{i,A}\) of vertices in \(V_i\) with a given color A. This can be done in O(|V|) if we do not explicitly store the zero-entries. Now, \(\sigma (y)\in \sigma (V_i)\), i.e. \(n_{i,\sigma (y)}>0\), can be checked in constant time, and thus, it can also be decided in constant time whether or not a pair (x, y) is contained in \(U_1(\vec {G},{\mathcal {V}})\) or \(U_2(\vec {G},{\mathcal {V}})\). Since, given \(y\in V_i\), the condition \(V_i[\sigma (y)]=\{y\}\) is equivalent to \(n_{i,\sigma (y)}=1\), membership in \(U_3(\vec {G},{\mathcal {V}})\) can also be decided in constant time. Checking all ordered pairs \(x,y\in V\) thus requires a total effort of \(O(|V|^2)\). \(\square \)
Our discussion so far suggests a recursive top-down approach, made precise in Alg. 2. In each step, one determines a “suitably chosen” partition \({\mathcal {V}}\) and then recurses on the subgraphs of the edited digraph \(\vec {G}^* \triangle U(\vec {G}^*[V'],{\mathcal {V}})\). More details on such suitable partitions \({\mathcal {V}}\) will be given in Thm. 23 below. The parts in the algorithm highlighted in color can be omitted. They are useful, however, if one is also interested in a tree \((T,\sigma )\) that explains the editing result \((\vec {G}^*,\sigma )\) and to show that \((\vec {G}^*,\sigma )\) is indeed a BMG (see below). Alg. 2 is designed to accumulate the edit sets in each step, Line 5. In particular, the total edit cost and the scores \(c(\vec {G}^*[V'],{\mathcal {V}})\) are closely tied together, which follows from the following result:
Lemma 15
All edit sets
\(U(\vec {G}^*[V'],{\mathcal {V}})\)
constructed in Alg. 2 are pairwise disjoint.
The proof of Lemma 15 and a technical result on which it relies can be found in the Appendix. As an immediate consequence of Lemma 15, we have
Corollary 16
The edit cost of Alg. 2 is the sum of the
UR
-costs
\(c(\vec {G}^*[V'],{\mathcal {V}})\)
in each recursion step.
It is important to note that the edits \(U(\vec {G}^*[V'],{\mathcal {V}})\) must be applied immediately in each step (cf. Line 5 in Alg. 2). In particular, Lemma 15 and Cor. 16 pertain to the partitioning of the edited digraph \(\vec {G}^*\), not to the original digraph \(\vec {G}\). We continue by proving the correctness of Alg. 2, i.e., that it returns a valid BMG and a corresponding explaining tree.
Theorem 17
Every pair of edited digraph \((\vec {G}^*,\sigma )\) and tree T produced as output by Alg. 2 satisfies \((\vec {G}^*,\sigma )=\vec {G}(T,\sigma )\). In particular, \((\vec {G}^*,\sigma )\) is a BMG.
Proof
By construction, the tree T is phylogenetic and there is a one-to-one correspondence between the vertices \(u\in V(T)\) and the recursion steps, which operate on the sets \(V'=L(T(u))\). If \(|V'|\ge 2\) (or, equivalently, u is an inner vertex of T), we furthermore have \({\mathcal {V}}=\{L(T(v)) \mid v\in \mathsf {child}_{T}(u)\}\) for the partition \({\mathcal {V}}\) of \(V'\) chosen in that recursion step. In the following, we denote by \((\vec {G}^*,\sigma )\) the digraph during the editing process, and by \((\vec {G},\sigma )\) the input digraph, i.e., as in Alg. 2. For brevity, we write \(E^*\) for the arc set of the final edited digraph and \(E^T{:}{=}E(\vec {G}(T,\sigma ))\).
Let us assume, for contradiction, that there exists (a) \((x,y)\in E^*\setminus E^T\ne \emptyset \), or (b) \((x,y)\in E^T\setminus E^*\ne \emptyset \). In either case, we set \(u{:}{=}{{\,\mathrm{lca}\,}}_T(x,y)\) and consider the recursion step on \(V'{:}{=}L(T(u))\) with the corresponding partition \({\mathcal {V}}{:}{=}\{L(T(v)) \mid v\in \mathsf {child}_{T}(u)\}\) chosen for \(V'\). Note that \(x\ne y\), and thus \(u\in V^0(T)\). Moreover, let \(v_x\) be the child of u such that \(x\preceq _{T} v_x\), and \(V_x{:}{=}L(T(v_x))\in {\mathcal {V}}\).
Case (a): \((x,y)\in E^*\setminus E^T\ne \emptyset \). Since \((x,y)\notin E^T\) and by the definition of best matches, there must be a vertex \(y'\in V_x\) of color \(\sigma (y)\) such that \({{\,\mathrm{lca}\,}}_T(x,y')\prec _T{{\,\mathrm{lca}\,}}_{T}(x,y)=u\), and thus \(\sigma (y)\in \sigma (V_x)\). Moreover, we have \(V_x\in {\mathcal {V}}\), \(x\in V_x\) and \(y\in V'\setminus V_x\). Two subcases need to be considered, depending on whether or not (x, y) is an arc in \(\vec {G}^*\) at the beginning of the recursion step. In the first case, the arguments above imply that \((x,y)\in U_1(\vec {G}^*[V'], {\mathcal {V}})\), and thus, \((x,y)\in U(\vec {G}^*[V'], {\mathcal {V}})\) by Lemma 13. Hence, we delete the arc (x, y) in this step. In the second case, it is an easy task to verify that none of the definitions of \(U_1(\vec {G}^*[V'], {\mathcal {V}})\), \(U_2(\vec {G}^*[V'], {\mathcal {V}})\), and \(U_3(\vec {G}^*[V'], {\mathcal {V}})\) matches for (x, y). Since this step is clearly the last one in the recursion hierarchy that can affect the (non-)arc (x, y), it follows for both subcases that \((x,y)\notin E^*\); a contradiction.
Case (b): \((x,y)\in E^T\setminus E^*\ne \emptyset \). Since \((x,y)\in E^T\) and by the definition of best matches, there cannot be a vertex \(y'\in V_x\) of color \(\sigma (y)\) such that \({{\,\mathrm{lca}\,}}_T(x,y')\prec _T{{\,\mathrm{lca}\,}}_{T}(x,y)=u\), and thus \(\sigma (y)\notin \sigma (V_x)\). Moreover, we have \(V_x\in {\mathcal {V}}\), \(x\in V_x\) and \(y\in V'\setminus V_x\). Again, two subcases need to be distinguished depending on whether or not (x, y) is an arc in \(\vec {G}^*\) at the beginning of the recursion step. In the first case, the arguments above make it easy to verify that none of the definitions of \(U_1(\vec {G}^*[V'], {\mathcal {V}})\), \(U_2(\vec {G}^*[V'], {\mathcal {V}})\), and \(U_3(\vec {G}^*[V'], {\mathcal {V}})\) matches for (x, y). In the second case, we obtain \((x,y)\in U_2(\vec {G}^*[V'], {\mathcal {V}})\), and thus, \((x,y)\in U(\vec {G}^*[V'], {\mathcal {V}})\) by Lemma 13. Hence, we insert the arc (x, y) in this step. As before, the (non-)arc (x, y) remains unaffected in any deeper recursion step. Therefore, we have \((x,y)\in E^*\) in both subcases; a contradiction.
Finally, \((\vec {G}^*,\sigma )=\vec {G}(T,\sigma )\) immediately implies that \((\vec {G}^*,\sigma )\) is a BMG. \(\square \)
Cor. 16 suggests a greedy-like “local” approach. In each step, the partition \({\mathcal {V}}\) is chosen to minimize the score \(c(\vec {G}, {\mathcal {V}})\) in Line 4. The example in Fig. 5 shows, however, that the greedy-like choice of \({\mathcal {V}}\) does not necessarily yield a globally optimal edit set.
In order to identify arcs that must be contained in every edit set, we first clarify the relationship between the partitions \({\mathfrak {P}}_{\ge 2}\) on V and the partitions defined by the phylogenetic trees on V.
Lemma 18
Let V be a set with \(|V|\ge 2\). Let \({\mathfrak {P}}_{\ge 2}\) be the set of all partitions \({\mathcal {V}}\) of V with \(|{\mathcal {V}}|\ge 2\). Then the set \({\mathcal {T}}_V\) of all phylogenetic trees with leaf set V satisfies \({\mathcal {T}}_V= \bigcup _{{\mathcal {V}}\in {\mathfrak {P}}_{\ge 2}} {\mathcal {T}}({\mathcal {V}})\).
Proof
For every \({\mathcal {V}}\in {\mathfrak {P}}_{\ge 2}\), \({\mathcal {T}}({\mathcal {V}})\) is a set of phylogenetic trees on V. Hence, we conclude \(\bigcup _{{\mathcal {V}}\in {\mathfrak {P}}_{\ge 2}} {\mathcal {T}}({\mathcal {V}}) \subseteq {\mathcal {T}}_V\). Conversely, assume that \(T\in {\mathcal {T}}_V\). Since T (with root \(\rho _T\)) is a phylogenetic tree and has at least two leaves, we have \(|\mathsf {child}_{T}(\rho _T)|\ge 2\). Together with \(L(T(\rho _T))=L(T)=V\), this implies \({\mathcal {V}}^*{:}{=}\{L(T(v)) \mid v\in \mathsf {child}_{T}(\rho _T)\} \in {\mathfrak {P}}_{\ge 2}\). In particular, T satisfies \(T\in {\mathcal {T}}({\mathcal {V}}^*)\) for some \({\mathcal {V}}^* \in {\mathfrak {P}}_{\ge 2}\), and is therefore contained in \(\bigcup _{{\mathcal {V}}\in {\mathfrak {P}}_{\ge 2}} {\mathcal {T}}({\mathcal {V}})\). \(\square \)
Using Lemma 18 and given that \(|V|\ge 2\), we can express the set of relations that are unsatisfiable for every partition as follows
$$\begin{aligned} \begin{aligned} \bigcap _{{\mathcal {V}}\in {\mathfrak {P}}_{\ge 2}} U(\vec {G},{\mathcal {V}})&= \bigcap _{{\mathcal {V}}\in {\mathfrak {P}}_{\ge 2}} \left( \bigcap _{T\in {\mathcal {T}}({\mathcal {V}})} U(\vec {G},T) \right) \\[5pt]&= \bigcap _{T\in \bigcup _{{\mathcal {V}}\in {\mathfrak {P}}_{\ge 2}} {\mathcal {T}}({\mathcal {V}})} U(\vec {G},T)\\&= \bigcap _{T\in {\mathcal {T}}_V} U(\vec {G},T) = U^*(\vec {G})\;, \end{aligned} \end{aligned}$$
(4)
i.e., it coincides with the set of relations that are unsatisfiable for every phylogenetic tree, and thus part of every edit set. Note that \(U^*(\vec {G})\) is trivially empty if \(|V|<2\). We next show that \(U^*(\vec {G})\) can be computed without considering the partitions of V explicitly.
Theorem 19
Let
\((\vec {G}=(V,E),\sigma )\)
be a properly vertex-colored digraph with
\(|V|\ge 2\)
then
$$\begin{aligned} U^*(\vec {G}) = \left\{ (x,y) \mid (x,y)\notin E,\; x\ne y,\; V[\sigma (y)]=\{y\} \right\} . \end{aligned}$$
(5)
Proof
First note that \(|V|\ge 2\) ensures that \({\mathfrak {P}}_{\ge 2}\ne \emptyset \). Moreover, since \(|{\mathcal {V}}|\ge 2\) for any \({\mathcal {V}}\in {\mathfrak {P}}_{\ge 2}\), the sets \({\mathcal {T}}({\mathcal {V}})\) are all non-empty as well. With the abbreviation \({\hat{U}}(\vec {G})\) for the right-hand side of Eq. (5), we show that \({\hat{U}}(\vec {G})= \bigcap _{{\mathcal {V}}\in {\mathfrak {P}}_{\ge 2}} U(\vec {G},{\mathcal {V}})\) which by Eq. (4) equals \(U^*(\vec {G})\).
Suppose that \((x,y)\in {\hat{U}}(\vec {G})\). Then \(x\ne y\) and \(V[\sigma (y)]=\{y\}\) imply that \(\sigma (x)\ne \sigma (y)\). This together with the facts that (i) y is the only vertex of its color in V, and (ii) \(L(T)=V\) for each \(T\in {\mathcal {T}}({\mathcal {V}})\) and any \({\mathcal {V}}\in {\mathfrak {P}}_{\ge 2}\) implies that y is a best match of x in every such tree T, i.e. \((x,y)\in E(\vec {G}(T,\sigma ))\). Since in addition \((x,y)\notin E\) by assumption, we conclude that \((x,y)\in U^{*}(\vec {G})\).
Now suppose that \((x,y)\in U^{*}(\vec {G})\). Observe that \(\sigma (x)\ne \sigma (y)\) (and thus \(x\ne y\)) as a consequence of Def. 12 and the fact that \((\vec {G},\sigma )\) and all BMGs are properly colored. If \(V=\{x,y\}\) and thus \(\{\{x\},\{y\}\}\) is the only partition in \({\mathfrak {P}}_{\ge 2}\), the corresponding unique tree T consists of x and y connected to the root. In this case, we clearly have \((x,y)\in E(\vec {G}(T,\sigma ))\) since \(\sigma (x)\ne \sigma (y)\). On the other hand, if \(\{x,y\}\subsetneq V\), then we can find a partition \({\mathcal {V}}\in {\mathfrak {P}}_{\ge 2}\) such that \(V_{i}=\{x,y\}\) for some \(V_{i}\in {\mathcal {V}}\). In this case, every tree \(T\in {\mathcal {T}}({\mathcal {V}})\) has a vertex \(v_{i}\in \mathsf {child}_{T}(\rho _T)\) with the leaves x and y as its single two children. Clearly, \((x,y)\in E(\vec {G}(T,\sigma ))\) holds for any such tree. In summary, there always exists a partition \({\mathcal {V}}\in {\mathfrak {P}}_{\ge 2}\) such that \((x,y)\in E(\vec {G}(T,\sigma ))\) for some tree \(T\in {\mathcal {T}}({\mathcal {V}})\). Therefore, by \((x,y)\in \bigcap _{{\mathcal {V}}\in {\mathfrak {P}}_{\ge 2}} U(\vec {G},{\mathcal {V}})\) and Def. 12, we conclude that \((x,y)\notin E\). In order to obtain \((x,y)\in {\hat{U}}(\vec {G})\), it remains to show that \(V[\sigma (y)]=\{y\}\). Since \((x,y)\notin E\) and \((x,y)\in \bigcap _{{\mathcal {V}}\in {\mathfrak {P}}_{\ge 2}} U(\vec {G},{\mathcal {V}})\), it must hold that \((x,y)\in E(\vec {G}(T,\sigma ))\) for all \(T\in {\mathcal {T}}({\mathcal {V}})\) and all \({\mathcal {V}}\in {\mathfrak {P}}_{\ge 2}\). Now assume, for contradiction, that there is a vertex \(y'\ne y\) of color \(\sigma (y')=\sigma (y)\). Since \(\sigma (x)\ne \sigma (y)\), the vertices \(x,y,y'\) must be pairwise distinct. Hence, we can find a partition \({\mathcal {V}}\in {\mathfrak {P}}_{\ge 2}\) such that \(V_{i}=\{x,y'\}\) for some \(V_{i}\in {\mathcal {V}}\). In this case, every tree \(T\in {\mathcal {T}}({\mathcal {V}})\) has a vertex \(v_{i}\in \mathsf {child}_{T}(\rho _T)\) with only the leaves x and \(y'\) as its children. Clearly, \({{\,\mathrm{lca}\,}}_T(x,y')=v_{i}\prec _{T}\rho _{T}={{\,\mathrm{lca}\,}}_{T}(x,y)\), and thus \((x,y)\notin E(\vec {G}(T,\sigma ))\); a contradiction. Therefore, we conclude that y is the only vertex of its color in V, and hence, \((x,y)\in {\hat{U}}(\vec {G})\). In summary, therefore, we have \(U^{*}(\vec {G})={\hat{U}}(\vec {G})\). \(\square \)
As a consequence of Thm. 19 and by similar arguments as in the proof of Cor. 14, we observe
Corollary 20
The set
\(U^*(\vec {G})\)
can be computed in quadratic time.
By Thm. 19, \(U^*(\vec {G})\) contains only non-arcs, more precisely, missing arcs pointing towards a vertex that is the only one of its color and thus, by definition, a best match of every other vertex irrespective of the details of the gene tree. By definition, furthermore, \(U^*(\vec {G})\) is a subset of every edit set for \((\vec {G},\sigma )\). We therefore have the lower bound
$$\begin{aligned} |U^*(\vec {G})| \le c(\vec {G},{\mathcal {V}}) \end{aligned}$$
(6)
for every \({\mathcal {V}}\in {\mathfrak {P}}_{\ge 2}\).
The following result shows that if \((\vec {G},\sigma )\) is a BMG, then a suitable partition \({\mathcal {V}}\) can be chosen such that \(c(\vec {G},{\mathcal {V}})=|U^*(\vec {G})|=0\).
Lemma 21
Let \((\vec {G}=(V,E),\sigma )\) be a BMG with \(|V|\ge 2\) and \({\mathcal {V}}\) be the connected components of the Aho graph \([{\mathcal {R}}(\vec {G},\sigma ), V]\). Then the partition \({\mathcal {V}}\) of V satisfies \(|{\mathcal {V}}|\ge 2\) and \(c(\vec {G},{\mathcal {V}})=0\).
Proof
Since \((\vec {G},\sigma )\) is a BMG, we can apply Prop. 10 to conclude that \({\mathcal {R}}{:}{=}{\mathcal {R}}(\vec {G},\sigma )\) is consistent and that \((T,\sigma ){:}{=}({{\,\mathrm{Aho}\,}}({\mathcal {R}}, V),\sigma )\) explains \((\vec {G},\sigma )\), i.e., \(\vec {G}(T,\sigma )=(\vec {G},\sigma )\). Hence, \(U(\vec {G},T)=\emptyset \). From \(|V|\ge 2\) and consistency of \({\mathcal {R}}\), it follows that \([{\mathcal {R}}, V]\) has at least two connected components [10], and thus, by construction, \(|{\mathcal {V}}|\ge 2\). Moreover, we clearly have \(T\in {\mathcal {T}}({\mathcal {V}})\) by the construction of T via BUILD. Together with \(U(\vec {G},T)=\emptyset \), the latter implies \(U(\vec {G},{\mathcal {V}})=\emptyset \), and thus \(c(\vec {G},{\mathcal {V}})=0\). \(\square \)
Lemma 22
Let \((\vec {G}=(V,E),\sigma )\) be a BMG, and \({\mathcal {V}}\) a partition of V such that \(c(\vec {G},{\mathcal {V}})=0\). Then the induced subgraph \((\vec {G}[V'],\sigma _{|V'})\) is a BMG for every \(V'\in {\mathcal {V}}\).
Proof
Set \({\mathcal {R}}{:}{=}{\mathcal {R}}(\vec {G},\sigma )\) and \({\mathcal {F}}{:}{=}{\mathcal {F}}(\vec {G},\sigma )\) for the sets of informative and forbidden triples of \((\vec {G},\sigma )\), respectively. Since \((\vec {G},\sigma )\) is a BMG, we can apply Prop. 10 to conclude that \(({\mathcal {R}},{\mathcal {F}})\) is consistent. Now we choose an arbitrary set \(V'\in {\mathcal {V}}\) and set \((\vec {G}',\sigma '){:}{=}(\vec {G}[V'],\sigma _{|V'})\). By Obs. 8, we obtain \({\mathcal {R}}(\vec {G}',\sigma ')={\mathcal {R}}_{|V'}\) and \({\mathcal {F}}(\vec {G}',\sigma ')={\mathcal {F}}_{|V'}\). This together with the fact that \({\mathcal {R}}_{|V'}\subseteq {\mathcal {R}}\) and \({\mathcal {F}}_{|V'}\subseteq {\mathcal {F}}\) and Obs. 9 implies that \(({\mathcal {R}}_{|V'},{\mathcal {F}}_{|V'})=({\mathcal {R}}(\vec {G}',\sigma '), {\mathcal {F}}(\vec {G}',\sigma '))\) is consistent.
By Prop. 10, it remains to show that \((\vec {G}',\sigma ')\) is sf-colored to prove that it is a BMG. To this end, assume for contradiction that there is a vertex \(x\in V'\) and a color \(s\in \sigma (V')\) such that x has no out-neighbor of color \(s\ne \sigma (x)\) in \(V'\). However, since the color s is contained in \(\sigma (V)\) and \((\vec {G},\sigma )\) is a BMG, and thus sf-colored, we conclude that there must be a vertex \(y\in V\setminus V'\) of color s such that \((x,y)\in E\). In summary, we obtain \((x,y)\in E\), \(x\in V'\), \(y\in V\setminus V'\) and \(\sigma (y)=s\in \sigma (V')\). Thus, we have \((x,y)\in U_1(\vec {G},{\mathcal {V}})\). Hence, Lemma 13 implies that \((x,y)\in U(\vec {G},{\mathcal {V}})\) and, hence, \(c(\vec {G},{\mathcal {V}})>0\); a contradiction. Therefore, \((\vec {G}',\sigma ')\) must be sf-colored, which concludes the proof. \(\square \)
Lemma 21 and 22 allow us to choose the partition \({\mathcal {V}}\) in each step of Alg. 2 in such a way that Alg. 2 is consistent, i.e., BMGs remain unchanged.
Theorem 23
Alg. 2 is consistent if, in each step on \(V'\) with \(|V'|\ge 2\), the partition \({\mathcal {V}}\) in Line 4 is chosen according to one of the following rules:
-
1
\({\mathcal {V}}\) has minimal UR-cost among all possible partitions \({\mathcal {V}}'\) of \(V'\) with \(|{\mathcal {V}}'|\ge 2\).
-
2
If the Aho graph \([{\mathcal {R}}(\vec {G}^*[V'],\sigma _{|V'}),V']\) is disconnected with the set of connected components \({\mathcal {V}}_{{{\,\mathrm{Aho}\,}}}\), and moreover \(c(\vec {G}^*[V'],{\mathcal {V}}_{{{\,\mathrm{Aho}\,}}})=0\), then \({\mathcal {V}}={\mathcal {V}}_{{{\,\mathrm{Aho}\,}}}\).
Proof
We have to show that the final edited digraph \((\vec {G}^*,\sigma )\) returned in Line 13 equals the input digraph \((\vec {G}=(V,E),\sigma )\) whenever \((\vec {G},\sigma )\) already is a BMG, i.e., nothing is edited. Thus suppose that \((\vec {G},\sigma )\) is a BMG and first consider the top-level recursion step on V (where initially \(\vec {G}^*=\vec {G}\) still holds at Line 1). If \(|V|=1\), neither \((\vec {G},\sigma )\) nor \((\vec {G}^*,\sigma )\) contain any arcs, and thus, the edit cost is trivially zero. Now suppose \(|V|\ge 2\). Since \((\vec {G},\sigma )\) is a BMG, Lemma 21 guarantees the existence of a partition \({\mathcal {V}}\) satisfying \(c(\vec {G},{\mathcal {V}})=0\), in particular, the connected components \({\mathcal {V}}_{{{\,\mathrm{Aho}\,}}}\) of the Aho graph \([{\mathcal {R}}(\vec {G},\sigma ), V]\) form such a partition. Hence, for both rules (1) and (2), we choose a partition \({\mathcal {V}}\) with (minimal) UR-cost \(c(\vec {G},{\mathcal {V}})=0\). Now, Lemma 22 implies that the induced subgraph \((\vec {G}[V'],\sigma _{|V'})\) is a BMG for every \(V'\in {\mathcal {V}}\). Since we recurse on these subgraphs, we can repeat the arguments above along the recursion hierarchy to conclude that the UR-cost \(c(\vec {G}^*[V'],{\mathcal {V}}')\) vanishes in every recursion step. By Cor. 16, the total edit cost of Alg. 2 is the sum of the UR-costs \(c(\vec {G}^*[V'],{\mathcal {V}}')\) in each recursion step, and thus, also zero. Therefore, we conclude that we still have \((\vec {G}^*,\sigma )=(\vec {G},\sigma )\) in Line 13. \(\square \)
By Thm. 23, Alg. 2 is consistent whenever the choice of \({\mathcal {V}}\) minimizes the UR-cost of \({\mathcal {V}}\) in each step. We shall see below that minimizing \(c(\vec {G},{\mathcal {V}})\) is a difficult optimization problem in general. Therefore, a good heuristic will be required for this step. This, however, may not guarantee consistency of Alg. 2 in general. The second rule in Thm. 23 provides a remedy: the Aho graph \([{\mathcal {R}}(\vec {G}^*[V'],\sigma _{|V'}), V']\) can be computed efficiently. Whenever \([{\mathcal {R}}(\vec {G}^*[V'],\sigma _{|V'}), V']\) is not connected, the partition \({\mathcal {V}}_{{{\,\mathrm{Aho}\,}}}\) defined by the connected components \([{\mathcal {R}}(\vec {G}^*[V'],\sigma _{|V'}), V']\) is chosen provided it has UR-cost zero. This procedure is effectively a generalization of the algorithm BUILD using as input the set of informative triples \({\mathcal {R}}(\vec {G},\sigma )\) of a properly vertex-colored digraph \((\vec {G},\sigma )\). If \((\vec {G},\sigma )\) is already a BMG, then the recursion in Alg. 2 is exactly the same as in BUILD: it recurses on the connected components of the Aho graph (cf. Prop. 10). We can summarize this discussion as
Corollary 24
\((\vec {G},\sigma )\) is a BMG if and only if, in every step of the BUILD algorithm operating on \({\mathcal {R}}(\vec {G},\sigma )_{|V'}\) and \(V'\), either \(|V'|=1\), or \(c(\vec {G}^*[V'], {\mathcal {V}}_{{{\,\mathrm{Aho}\,}}})=0\) for the connected component partition \({\mathcal {V}}_{{{\,\mathrm{Aho}\,}}}\) of the disconnected Aho graph \([{\mathcal {R}}(\vec {G}^*[V'],\sigma _{|V'}), V']\).
For recursion steps in which the Aho graph is connected, and possibly also in steps with non-zero UR-cost, another (heuristic) rule has to be employed. As a by-product, we obtain an approach for the case that \({\mathcal {R}}(\vec {G},\sigma )\) is consistent: Following BUILD yields the approximation \(\vec {G}({{\,\mathrm{Aho}\,}}({\mathcal {R}}(\vec {G},\sigma ),V(\vec {G})),\sigma )\) as a natural choice.
Binary-explainable BMGs
Phylogenetic trees are often binary. Multifurcations are in many cases – but not always – the consequence of insufficient data [14, 20, 21]. It is therefore of practical interest to consider BMGs that can be explained by a binary tree:
Definition 25
A properly colored digraph \((\vec {G},\sigma )\) is a binary-explainable best match graph (beBMG) if there is a binary tree T such that \(\vec {G}(T,\sigma )=(\vec {G},\sigma )\).
Correspondingly, it is of interest to edit a properly colored digraph to a beBMG, which translates to the following decision problem:
Problem 2
(\(\ell \)-BMG Editing restricted to Binary-Explainable Graphs (EBEG))
Input: | A properly \(\ell \)-colored digraph |
| \((\vec {G}=(V,E),\sigma )\) and an integer k. |
Question: | Is there a subset |
| \(F\subseteq V\times V \setminus \{(v,v)\mid v\in V\}\) such that |
| \(|F|\le k\) and \((\vec {G}{{\,\mathrm{\triangle }\,}}F,\sigma )\) is a binary- |
| explainable \(\ell \)-BMG? |
We call the corresponding completion and deletion problem \(\ell \)-BMG CBEG and \(\ell \)-BMG DBEG, respectively. As their more general counterparts, all three variants are NP-complete as well, cf. [8, Cor. 6.2] and [14, Thm. 5].
Since the recursive partitioning in Alg. 2 defines a tree that explains the edited BMG, see Thm. 17, it is reasonable to restrict the optimization of \({\mathcal {V}}\) in Line 17 to bipartitions. The problem still remains hard, however, since the corresponding decision problem (problem BPURC) is NP-complete as shown in Thm. 30 below. Similar to BMGs in general, beBMGs have a characterization in terms of informative triples:
Proposition 26
[14, Thm. 3.5] A properly vertex-colored digraph \((\vec {G},\sigma )\) with vertex set V is binary-explainable if and only if (i) \((\vec {G},\sigma )\) is sf-colored, and (ii) the triple set \(\mathop {{\mathcal {R}}^{\text {B}}}(\vec {G},\sigma )\) is consistent. In this case, the BMG \((\vec {G},\sigma )\) is explained by every refinement of the binary refinable tree \(({{\,\mathrm{Aho}\,}}(\mathop {{\mathcal {R}}^{\text {B}}}(\vec {G},\sigma ), V), \sigma )\).
Using Prop. 26, we can apply analogous arguments as in the proof of Lemma 21 for \(\mathop {{\mathcal {R}}^{\text {B}}}(\vec {G},\sigma )\) instead of \({\mathcal {R}}(\vec {G},\sigma )\) to obtain
Corollary 27
Let \((\vec {G}=(V,E),\sigma )\) be a beBMG with \(|V|\ge 2\) and \({\mathcal {V}}\) be the connected components of the Aho graph \([\mathop {{\mathcal {R}}^{\text {B}}}(\vec {G},\sigma ), V]\). Then the partition \({\mathcal {V}}\) of V satisfies \(|{\mathcal {V}}|\ge 2\) and \(c(\vec {G},{\mathcal {V}})=0\).
Since a beBMG \((\vec {G},\sigma )\) is explained by every refinement of the Aho tree constructed from \(\mathop {{\mathcal {R}}^{\text {B}}}(\vec {G},\sigma )\) (cf. Prop. 26), we can obtain a slightly more general result.
Lemma 28
Let \((\vec {G}=(V,E),\sigma )\) be a beBMG with \(|V|\ge 2\) and \({\mathcal {V}}\) be the connected components of the Aho graph \([\mathop {{\mathcal {R}}^{\text {B}}}(\vec {G},\sigma ),V]\). Then, every coarse-graining \({\mathcal {V}}'\) of \({\mathcal {V}}\) with \(|{\mathcal {V}}'|\ge 2\) satisfies \(c(\vec {G},{\mathcal {V}}')=0\).
Proof
First note that \(\mathop {{\mathcal {R}}^{\text {B}}}(\vec {G},\sigma )\) is consistent by Prop. 26 since \((\vec {G},\sigma )\) is a beBMG. Therefore, \(|V|\ge 2\) implies \(|{\mathcal {V}}|\ge 2\) [10]. For the trivial coarse-graining \({\mathcal {V}}'={\mathcal {V}}\), Cor. 27 already implies the statement. Now assume \({\mathcal {V}}'\ne {\mathcal {V}}\). Observe that the tree \((T,\sigma ) {:}{=}({{\,\mathrm{Aho}\,}}(\mathop {{\mathcal {R}}^{\text {B}}}(\vec {G},\sigma ), V), \sigma )\) exists and explains \((\vec {G},\sigma )\) by Prop. 26. Moreover, there is, by construction, a one-to-one correspondence between the children \(v_i\) of its root \(\rho \) and the elements in \(V_i\in {\mathcal {V}}\) given by \(L(T(v_i))=V_i\). We construct a refinement (tree) \(T'\) of T as follows: Whenever we have multiple sets \(V_i\in {\mathcal {V}}\) that are subsets of the same set \(V_j\in {\mathcal {V}}'\), we remove the edges \(\rho v_i\) to the corresponding vertices \(v_i\in \mathsf {child}_{T}(\rho )\) in T, and collectively connect these \(v_i\) to a newly created vertex \(w_j\). These vertices \(w_j\) are then reattached to the root \(\rho \). Since \(|{\mathcal {V}}'|\ge 2\) by assumption, the so-constructed tree \(T'\) is still phylogenetic. Moreover, it satisfies \({\mathcal {V}}'=\{L(T'(v)) \mid v\in \mathsf {child}_{T'}(\rho )\}\), and thus, \(T'\in {\mathcal {T}}({\mathcal {V}}')\). It is a refinement of T since contraction of the edges \(\rho w_j\) again yields T. Hence, we can apply Prop. 26 to conclude that \((T',\sigma )\) also explains \((\vec {G},\sigma )\). It follows immediately that \(U(\vec {G},T')=\emptyset \). The latter together with \(T'\in {\mathcal {T}}({\mathcal {V}}')\) implies \(U(\vec {G},{\mathcal {V}}')=\emptyset \), and thus \(c(\vec {G},{\mathcal {V}}')=0\). \(\square \)
We are now in the position to formulate an analogue of Thm. 23 for variants of Alg. 2 that aim to edit a properly-colored digraph \((\vec {G},\sigma )\) to a beBMG.
Theorem 29
Alg. 2 is consistent for beBMGs \((\vec {G},\sigma )\) if, in each step on \(V'\) with \(|V'|\ge 2\), a bipartition \({\mathcal {V}}\) in Line 4 is chosen according to one of the following rules:
-
1
\({\mathcal {V}}\) has minimal UR-cost among all possible bipartitions \({\mathcal {V}}'\) of \(V'\).
-
2
If the Aho graph \([\mathop {{\mathcal {R}}^{\text {B}}}(\vec {G}^*[V'],\sigma _{|V'}),V']\) is disconnected with the set of connected components \({\mathcal {V}}_{{{\,\mathrm{Aho}\,}}}\), and moreover \(c(\vec {G}^*[V'],{\mathcal {V}}_{{{\,\mathrm{Aho}\,}}})=0\), then \({\mathcal {V}}\) is a coarse-graining of \({\mathcal {V}}_{{{\,\mathrm{Aho}\,}}}\).
Proof
We have to show that the final edited digraph \((\vec {G}^*,\sigma )\) returned in Line 13 equals the input digraph \((\vec {G}=(V,E),\sigma )\) whenever \((\vec {G},\sigma )\) already is a beBMG, i.e., nothing is edited. Thus suppose that \((\vec {G},\sigma )\) is a beBMG and first consider the top-level recursion step on V (where initially \(\vec {G}^*=\vec {G}\) still holds at Line 1). If \(|V|=1\), neither \((\vec {G},\sigma )\) nor \((\vec {G}^*,\sigma )\) contain any arcs, and thus, the edit cost is trivially zero. Now suppose \(|V|\ge 2\). Since \((\vec {G},\sigma )\) is a beBMG, \(\mathop {{\mathcal {R}}^{\text {B}}}{:}{=}\mathop {{\mathcal {R}}^{\text {B}}}(\vec {G},\sigma )\) is consistent, and thus, the set of connected components \({\mathcal {V}}_{{{\,\mathrm{Aho}\,}}}\) of the Aho graph \([\mathop {{\mathcal {R}}^{\text {B}}}, V]\) has a cardinality of at least two. If \(|{\mathcal {V}}_{{{\,\mathrm{Aho}\,}}}|=2\), \({\mathcal {V}}{:}{=}{\mathcal {V}}_{{{\,\mathrm{Aho}\,}}}\) is a bipartition satisfying \(c(\vec {G},{\mathcal {V}})=0\) by Cor. 27. If \(|{\mathcal {V}}_{{{\,\mathrm{Aho}\,}}}|>2\), we can find an arbitrary bipartition \({\mathcal {V}}\) that is a coarse-graining of \({\mathcal {V}}_{{{\,\mathrm{Aho}\,}}}\). By Lemma 28, \({\mathcal {V}}\) also satisfies \(c(\vec {G},{\mathcal {V}})=0\) in this case. Hence, for both rules (1) and (2), we choose a bipartition \({\mathcal {V}}\) with (minimal) UR-cost \(c(\vec {G},{\mathcal {V}})=0\). Now, Lemma 22 implies that the induced subgraph \((\vec {G}[V'],\sigma _{|V'})\) is a BMG for every \(V'\in {\mathcal {V}}\). To see that \((\vec {G}[V'],\sigma _{|V'})\) is also binary-explainable, first note that \(\mathop {{\mathcal {R}}^{\text {B}}}(\vec {G}[V'],\sigma _{|V'})=\mathop {{\mathcal {R}}^{\text {B}}}_{|V'}\) by Obs. 8. This together with the fact that \(\mathop {{\mathcal {R}}^{\text {B}}}_{|V'}\subseteq \mathop {{\mathcal {R}}^{\text {B}}}\) and Obs. 9 implies that \(\mathop {{\mathcal {R}}^{\text {B}}}(\vec {G}[V'],\sigma _{|V'})\) is consistent. Moreover, Prop. 10 and \((\vec {G}[V'],\sigma _{|V'})\) being a BMG together imply that \((\vec {G}[V'],\sigma _{|V'})\) is sf-colored. Hence, we can apply Prop. 26 to conclude that \((\vec {G}[V'],\sigma _{|V'})\) is a beBMG.
Since we recurse on the subgraphs \((\vec {G}[V'],\sigma _{|V'})\), which are again beBMGs, we can repeat the arguments above along the recursion hierarchy to conclude that the UR-cost \(c(\vec {G}^*[V'],{\mathcal {V}}')\) vanishes in every recursion step. By Cor. 16, the total edit cost of Alg. 2 is the sum of the UR-costs \(c(\vec {G}^*[V'],{\mathcal {V}}')\) in each recursion step, and thus, also zero. Therefore, we conclude that we still have \((\vec {G}^*,\sigma )=(\vec {G},\sigma )\) in Line 13. \(\square \)
Minimizing the UR-cost \(c(\vec {G},{\mathcal {V}})\)
The problem of minimizing \(c(\vec {G},{\mathcal {V}})\) for a given properly colored digraph \((\vec {G},\sigma )\) corresponds to the following decision problem.
Problem 3
((Bi)Partition with UR-Cost ((B)PURC))
Input: | A properly \(\ell \)-colored digraph |
| \((\vec {G}=(V,E),\sigma )\) and an integer \(k\ge 0\). |
Question: | Is there a (bi)partition \({\mathcal {V}}\) of V |
| such that \(c(\vec {G},{\mathcal {V}})\le k\)? |
In the Appendix, we show that (B)PURC is NP-hard by reduction from Set Splitting, one of Garey and Johnson’s [22] classical NP-complete problems.
Theorem 30
BPURC
is NP-complete.
Thm. 23 suggests to consider heuristics for (B)PURC that make use of the Aho graph in the following manner:
-
1
Construct the Aho graph \(H{:}{=}[{\mathcal {R}}(\vec {G},\sigma ),V]\) based on the set of informative triples \({\mathcal {R}}(\vec {G},\sigma )\).
-
2
If H has more than one connected component, we use the set of connected components as the partition \({\mathcal {V}}\).
-
3
If H is connected, a heuristic that operates on the Aho graph H is used to find a partition \({\mathcal {V}}\) with small UR-cost \(c(\vec {G},{\mathcal {V}})\).
Plugging any algorithm of this type into Line 4 of Alg. 2 reduces the algorithm to BUILD if a BMG is used as input and thus guarantees consistency (cf. Prop. 10). We note, however, that the connected components of a disconnected Aho graph are not guaranteed to correspond to an optimal solution for (B)PURC in the general case.