Heuristic algorithms for best match graph editing

Background Best match graphs (BMGs) are a class of colored digraphs that naturally appear in mathematical phylogenetics as a representation of the pairwise most closely related genes among multiple species. An arc connects a gene x with a gene y from another species (vertex color) Y whenever it is one of the phylogenetically closest relatives of x. BMGs can be approximated with the help of similarity measures between gene sequences, albeit not without errors. Empirical estimates thus will usually violate the theoretical properties of BMGs. The corresponding graph editing problem can be used to guide error correction for best match data. Since the arc set modification problems for BMGs are NP-complete, efficient heuristics are needed if BMGs are to be used for the practical analysis of biological sequence data. Results Since BMGs have a characterization in terms of consistency of a certain set of rooted triples (binary trees on three vertices) defined on the set of genes, we consider heuristics that operate on triple sets. As an alternative, we show that there is a close connection to a set partitioning problem that leads to a class of top-down recursive algorithms that are similar to Aho’s supertree algorithm and give rise to BMG editing algorithms that are consistent in the sense that they leave BMGs invariant. Extensive benchmarking shows that community detection algorithms for the partitioning steps perform best for BMG editing. Conclusion Noisy BMG data can be corrected with sufficient accuracy and efficiency to make BMGs an attractive alternative to classical phylogenetic methods.


Background
A wide range of tasks in computational biology start by determining, for a given "query gene" x in a species A, one or all genes y in another species B that are most similar to x. Conceptually, these best matches y of the query gene x are meant to approximate the set of genes in B that are evolutionary most closely related to x. Best matches can be identified by comparing evolutionary distances [1], which in turn are usually obtained from sequence alignments [2]. In practice, fast approximation algorithms such as blast and its successors are often used for this purpose [3,4]. Even if sequence similarity is measured perfectly, deviations from a common molecular clock, i.e., differences in the evolutionary rates of different genes, cause discrepancies between best hits (most similar sequences) and best matches (evolutionary most closely related sequences), see [5] for a detailed discussion.
The idea of best matches in the sense of closest evolutionary relatedness pre-supposes an underlying tree T that describes the phylogenetic relationships among the genes, which correspond to the leaves of T, and a map σ assigning to each gene x the species σ (x) in which it resides. Given such a leaf-colored tree (T , σ ) , the best match graph G(T , σ ) has as its vertex set the leaves of T, i.e., the set of genes, and as (directed) arcs the best matches. The latter are defined as the pairs (x, y) for which the last common ancestor of x and y is at least as close to x as the last common ancestor of x and any other gene y ′ from the same species σ (y ′ ) = σ (y) . Best match graphs (BMGs), i.e., digraphs that are derived from a leaf-labeled tree (T , σ ) in this manner (cf. Defs. 1 and 2 below), form a very restrictive class of colored digraphs [6,7]. Empirically determined best hit data therefore will in general not satisfy the defining properties of BMGs. They can be corrected in part, however, by considering the problem of editing a given digraph to the closest BMG. We refer to this problem as BMG editing in line with the usual terminology to describe graph editing problems by the target class of graphs. Importantly, the input digraph will not be a BMG in general. In [8], it was shown that the arc modification problems for BMGs are NP-hard, but can be formulated as integer linear programs (ILPs) allowing practical solutions for small instances. However, in computational biology, applications to large gene families would be of particular interest, creating the need for faster, approximate solutions for BMG editing. Before embarking to develop software for a BMG-based analysis of large sequence data sets, we need to understand whether the editing problem for BMGs is tractable in practice with satisfactory accuracy and for interestingly large instances. The purpose of this contribution is to establish that this is indeed the case.
Motivated by both theoretical and practical considerations, we are mainly interested in heuristics that are consistent in the following sense: Let A be an algorithm that takes an arbitrary vertex-colored digraph ( G, σ ) as input and outputs a BMG A( G, σ ) . Then A is consistent if A( � G, σ ) = ( � G, σ ) whenever the input digraph ( G, σ ) is a BMG. Such an algorithm can be obtained trivially by adding an initial check whether the input is a BMG or not. All algorithms described here, however, are designed such that their edit-operations leave BMGs unchanged.
BMGs can be characterized in terms of their set of socalled informative (rooted) triples R( G, σ ) defined on the set of genes. These are binary trees with three leaves that can be easily extracted from a vertex-colored digraph (cf. Def. 6). In particular, R( G, σ ) is consistent for a BMG ( G, σ ) , i.e., there is a common supertree for all triples in R( G, σ ) . A formal definition of triple consistency will be provided in the next section. In [7], it was shown that a vertex-colored digraph ( � G = (V , E), σ ) is a BMG if and only if (a) the set of informative triples R( G, σ ) is consistent and (b) the BMG � G(T , σ ) of the corresponding so-called Aho tree T := Aho (R( � G, σ ), V ) coincides with ( G, σ ) . In general, the Aho tree Aho (R, V ) of a consistent set of triples R on a set V is a least resolved supertree of all the triples in R . However, there are wellknown examples for which the Aho tree is neither the unique least resolved supertree for R nor the tree with a minimal number of vertices [9]. For a BMG, on the other hand, T := Aho (R( � G, σ ), V ) is the unique least resolved tree (LRT) that explains ( � G, σ ) = � G(T , σ ) [6]. See Fig. 1 for an illustrative example of the construction of the LRT Aho (R( G, σ ), V ) of a BMG ( G, σ ) using the BUILD algorithm for supertree construction [10]. These close connections between recognizing BMGs and constructing supertrees suggest to adapt ideas from heuristic algorithms for triple consistency problems and supertree construction for BMG editing.
The simplest approach, therefore, is to extract a maximal consistent subset R * from R( G, σ ) and to use the BMG � G( Aho (R * , V ), σ ) as an approximation, see Alg. 1 below. A more detailed analysis of arcs in ( G, σ ) that violate the property of being a BMG, however, will lead to a notion of "unsatisfiable relations" (UR), which can be used to count the arc modifications associated with a partition V of the vertex set V of G . It also gives rise to a top-down algorithm in which the vertex set of G is recursively edited and partitioned. A large class of heuristics for BMG editing can be constructed depending on the construction of the partition V in each recursion step. We shall see that the arc edit sets in different steps of the recursion are disjoint. A main result of this contribution, Thm. 23, links the partitions V appearing in BMG editing algorithms to the auxiliary graphs appearing in the BUILD algorithm. This provides a guarantee that the BMG editing algorithms are consistent provided the choice of V is such that it does not enforce edits whenever an alternative partition with an empty UR is available. For BMGs, this is in particular the case for the partitions appearing in the BUILD algorithm. We proceed to show by reduction from Set Splitting that finding a partition with a minimal number of unsatisfiable relations is NP-hard.
The theoretical results are complemented by computational experiments on BMGs with randomly perturbed arc sets. To this end, we compare the arcs sets of the editing results with those of the perturbed digraphs and, since they are known in simulations, of the original BMGs. We focus on a comparison of different algorithms to construct the partitions V . Somewhat surprisingly, we find that minimizing the cardinality of the UR alone is not the best approach, since this tends to produce very unbalanced partitions and thus requires a large number of steps in the recursions whose costs add up. Instead, certain types of clustering or community detection approaches that favor more balanced partitions tend to perform well in terms of the usual measures for digraph comparison such as the (absolute) symmetric difference, connected components of G are the maximal connected subgraphs of the undirected graph underlying G or, equivalently, the maximal strongly connected components of the symmetrized digraph. Whenever the context is clear, we will also refer to the partition of V formed by the vertex sets of the maximal connected subgraphs as the set of connected components.
A vertex coloring is a map σ : V → S , where S is a nonempty set of colors. A vertex coloring of G is proper if σ (x) = σ (y) whenever (x, y) ∈ E( � G) . We write ( G, σ ) for a vertex-colored digraph and denote by V[r] the subset of vertices of a graph ( � G = (V , E), σ ) that have color r. Moreover, we define σ (W ):={σ (x) | x ∈ W } for the subset of colors present in a set W ⊆ V .
We write N(x) for the set of out-neighbors of a vertex x ∈ V ( � G) and N − (x) for the set of in- for the symmetric difference of two sets A and B, and define, for a digraph Phylogenetic trees. Consider an undirected, rooted tree T with leaf set L(T ) ⊆ V (T ) and root ρ T ∈ V (T ) . Its inner vertices are given by the set V 0 (T ) = V (T ) \ L(T ) . The ancestor order on V(T) is defined such that u T v if v lies on the unique path from u to the root ρ T , i.e., if v is If xy is an edge in T such that y ≺ T x , then x is the parent of y and y the child of x. The set of children of a vertex x ∈ V (T ) is denoted by child T (x) . A tree is phylogenetic if all of its inner vertices have at least two children. All trees considered in this contribution will be phylogenetic. For a non-empty subset A ⊆ V (T ) , we define lca T (A) , the last common ancestor of A, to be the unique T -minimal vertex of T that is an ancestor of every u ∈ A . Following e.g. [11], we denote by T L ′ the restriction of T to a subset L ′ ⊆ L(T ) , i.e. T L ′ is obtained by identifying the (unique) minimal subtree of T that connects all leaves in L ′ , and suppressing all vertices with degree two except possibly the root ρ T L ′ = lca T (L ′ ) . We say that T displays or is a refinement of a tree T ′ , in symbols T ′ ≤ T , if T ′ can be obtained from a restriction T L ′ of T by a series of inner edge contractions. (T , σ ) is a leaf-colored tree if σ : L(T ) → S is a map from the leaves of T into a nonempty set of colors. We say that Rooted triples. A (rooted) triple is a tree on three leaves and with two inner vertices, and thus, it has a topology as the tree in Fig. 3(D). We write xy|z for the triple on the leaves x, y and z if the path from x to y does not intersect the path from z to the root in T, i.e., if lca T (x, y) ≺ T lca T (x, z) = lca T (y, z) . In this case we say that T displays xy|z. We write R |L ′ := xy|z ∈ R : x, y, z ∈ L ′ for the restriction of a triple set R to a set L ′ of leaves. A set R of triples is consistent if there is a tree T with leaf set L := T ′ ∈R L(T ′ ) that displays every triple in R . The polynomial-time algorithm BUILD decides for every triple set R whether it is consistent, and if so, constructs a particular tree, the Aho tree Aho (R, L) , that displays every triple in R [10]. The algorithm relies on the construction of an (undirected) auxiliary graph, the Aho graph, for a given triple set R on a set of leaves L. This graph, denoted by [R, L] , contains an edge xy if and only if xy|z ∈ R for some z ∈ L.
Bottom left: Insertion of the arc (a, b 2 ) creates a new informative triple ab 2 |b 3 ( ab 1 |b 2 gets lost) resulting in a connected Aho graph H ′ . Bottom right: Deletion of the arc (a, c 1 ) creates a new triple ac 2 |c 1 resulting in a connected Aho graph H ′′ Fig. 3 Example for a digraph (A) where Alg. 1 does not lead to an optimal BMG editing. The set R( G, σ ) is empty and thus consistent.
) and (C) its corresponding BMG. The two arcs (b, a) and (b, a ′ ) have been inserted. D A tree (T ′ , σ ) and (E) its corresponding BMG � G(T ′ , σ ) in which only the arc (b, a) has been inserted

Best match graphs
Best matches formalize the notion of the evolutionarily closest relative(s) of a gene x in another species. Relatedness in this context is thought of as a phylogenetic concept and thus expressed in terms of last common ancestors in the gene tree T that describes the evolutionary relationships among a family of genes.
As a consequence, best matches in a pair of species in general form a many-to-many relationship and are not necessarily symmetric. Given (T , σ ) , the digraph � G(T , σ ) = (V , E) with vertex set V = L(T ) , vertex-coloring σ , and with arcs (x, y) ∈ E if and only if y is a best match of x w.r.t. (T , σ ) is called the best match graph (BMG) of (T , σ ) [6], see Fig. 2 for an illustrative example.
We say that ( � G = (V , E), σ ) is an ℓ-BMG if |σ (V )| = ℓ . By construction, there is at least one best match of x for every color s ∈ σ (V ) \ {σ (x)}: Observation 3 For every vertex x and every color s = σ (x) in a BMG ( G, σ ) there is some vertex y ∈ N (x) with σ (y) = s . Equivalently, the subgraph induced by every pair of colors is sink-free.
In particular, therefore, BMGs are sink-free whenever they contain at least two colors. We formalize this basic property of BMGs for colored digraphs in general: be a colored digraph. The coloring σ is sink-free if it is proper and, for every vertex x ∈ V and every color s = σ (x) in σ (V ) , there is a vertex y ∈ N (x) with σ (y) = s . A digraph with a sink-free coloring is called sf-colored.
Given a tree T and an edge e, we denote by T e the tree obtained from T by contracting the edge e. We call an edge in (T , σ ) redundant (w.r.t. ( G, σ ) ) if both (T , σ ) and (T e , σ ) explain ( G, σ ). By [6,Thm. 8], every BMG has a unique least resolved tree (LRT). Moreover, a characterization of BMGs was given in [6] that makes use of a set of informative triples, which can be defined compactly as follows [12]: Then the set of informative triples is and the set of forbidden triples is For the subclass of BMGs that can be explained by binary trees, we will furthermore need By definition, a, b, b ′ must be pairwise distinct whenever . We extend the notion of consistency to pairs of triple sets in Definition 7 Let R and F be sets of triples. The pair (R, F) is called consistent if there is a tree T that displays all triples in R but none of the triples in F . In this case, we also say that T agrees with (R, F).
It can be decided in polynomial time whether such a pair (R, F) is consistent using the algorithm MTT, which was named after the corresponding mixed triplets problem restricted to trees and described in [13].
We continue with two simple observations concerning the restriction of triple sets. Since informative and forbidden triples xy|z are only defined by the presence and absence of arcs in the subgraph of G induced by {x, y, z} , this leads to the following Moreover, any pair of triples (R ′ , F ′ ) such that R ′ ⊆ R and F ′ ⊆ F for a consistent pair (R, F) remains consistent since any tree that agrees with (R, F) clearly displays all triples in R ′ and none of the triples in F ′ . Hence, we have Observation 9 Let R ′ ⊆ R and F ′ ⊆ F for a consistent pair of triple sets (R, F) . Then (R ′ , F ′ ) is consistent.
We summarize two characterizations of BMGs given in [7,Thm. 15] and [8,Lemma 3.4 and Thm. 3.5] in the following Proposition 10 Let ( G, σ ) be a properly colored digraph with vertex set L. Then the following three statements are equivalent: In this case, ( Aho (R( G, σ ), L), σ ) is the unique LRT for ( G, σ ) , and a leaf-colored tree (T , σ ) on L explains ( G, σ ) if and only if it agrees with (R( G, σ ), F( G, σ )).
Prop. 10 states that the set of informative triples R( G, σ ) of a BMG ( G, σ ) is consistent. Therefore, it can be used to construct its LRT by means of the BUILD algorithm, see Fig. 1 for an example.
It is important to note that both arc insertions and deletions may lead to creation and loss of informative triples. In particular, when starting from a BMG, both types of modifications have the potential to make the triple set inconsistent as the example in Fig. 2 shows. This is indeed often the case even for moderate disturbances of a BMG as we shall see later.
We expect that empirically estimated best match relations will typically contain errors that correspond to both arc insertions and deletions w.r.t. the unknown underlying "true" best match graph. This motivates the problem of editing a given vertex-colored digraph to a BMG:

Question:
Is there a subset Natural variants are ℓ-BMG Completion and ℓ-BMG Deletion where � G △ F is replaced by � G + F and � G − F , respectively, i.e., only addition or deletion of arcs is allowed. Both ℓ-BMG Editing and its variants are NPcomplete [8].
The heuristic algorithms considered in this contribution can be thought of as maps A on the set of finite vertex-colored digraphs such that A( G, σ ) is a BMG for every vertex-colored input digraph ( G, σ ) . In particular, the following property of such algorithms is desirable:

A simple, triple-based heuristic
The triple-based characterization summarized by Prop. 10 suggests a simple heuristic for BMG editing that relies on replacing the consistency checks for triple sets by the extraction of maximal sets of consistent triples (see Alg. 1). Both MaxRTC , the problem of extracting from a given set R of rooted triples a maximum-size consistent subset, and MinRTI, the problem of finding a minimum-size subset I such that R \ I is consistent, are NP-hard [15]. Furthermore, MaxRTC is APX-hard and MinRTI is �(ln n)-inapproximable [16]. However, because of their practical importance in phylogenetics, a large number of practically useful heuristics have been devised, see e.g. [17][18][19].
The heuristic Alg. 1 is not always optimal, even if MaxRTC /MinRTI is solved optimally. Fig. 3 shows an unconnected 2-colored digraph ( G, σ ) on three vertices that is not a BMG and does not contain informative triples. The BMG ( � G * , σ ) produced by Alg. 1 introduces two arcs into ( G, σ ) . However, ( G, σ ) can also be edited to a BMG by inserting only one arc.
A simple improvement is to start by enforcing obvious arcs: If v is the only vertex with color σ (v) , then by definition there must be an arc (x, v) for every vertex x = v . The computation then starts from the sets of informative triples of the modified digraph. We shall see below that these are the only arcs that can safely be added to G without other additional knowledge or constraints (cf. Thm. 19 below).

Locally optimal splits
Finding an optimal BMG editing of a digraph ( � G = (V , E), σ ) is equivalent to finding a tree (T , σ ) on V that minimizes the cardinality of is a BMG. However, finding a tree (T , σ ) that minimizes |U ( � G, T )| is intractable (unless P = NP ) since ℓ-BMG Editing, Problem 1 above, is NP-complete [8].
We may ask, nevertheless, if trees (T , σ ) on V contain information about arcs and non-arcs in ( G, σ ) that are "unambiguously false" in the sense that they are contained in every edit set that converts ( G, σ ) into a BMG. Denote by T V the set of all phylogenetic trees on V. The set of these "unambiguously false" (non-)arcs can then be expressed as Since there are in general exponentially many trees on V and thus, the problem of determining U * ( � G) seems to be quite challenging at first glance. We shall see in Thm. 19, however, that U * ( � G) can be computed efficiently. We start with a conceptually simpler construction, and consider the set of trees T (V) ⊆ T V for which the set of leaf sets of the children of the root equals the partition V . In other words, given V = {V 1 , . . . , V k } , then the root ρ T of every T ∈ T (V) has exactly k children v 1 , . . . , v k such that The set of (phylogenetic) trees T (V) is non-empty since |V| ≥ 2 in Def. 12. Moreover, by construction, Intriguingly, the set U ( G, V) , and thus the UR-cost c( G, V) , can be computed in polynomial time without any explicit knowledge of the possible trees to determine the set U ( G, V) . To this end, we define the three sets The proof of Lemma 13 relates the possible cases between V and the tree set T (V) in a straightforward manner. Since it is rather lengthy it is relegated to Appendix. Fig. 4 gives examples for all three types of unsatisfiable relations, i.e., for is already a BMG which, however, is not true in general.

Corollary 14
The set U ( G, V) can be computed in quadratic time.
Proof We first compute all numbers n i,A of vertices in V i with a given color A. This can be done in O(|V|) if we do not explicitly store the zero-entries. Now, σ (y) ∈ σ (V i ) , i.e. n i,σ (y) > 0 , can be checked in constant time, and thus, it can also be decided in constant time whether or not a pair (x, y) is contained in can also be decided in constant time. Checking all ordered pairs x, y ∈ V thus requires a total effort of O(|V | 2 ) .
Our discussion so far suggests a recursive top-down approach, made precise in Alg. 2. In each step, one determines a "suitably chosen" partition V and then recurses on the subgraphs of the edited digraph � G * △U ( � G * [V ′ ], V) . More details on such suitable partitions V will be given in Thm. 23 below. The parts in the algorithm highlighted in color can be omitted. They are useful, however, if one is also interested in a tree (T , σ ) that explains the editing result ( � G * , σ ) and to show that ( � G * , σ ) is indeed a BMG (see below). Alg. 2 is designed to accumulate the edit sets in each step, Line 5. In particular, the total edit cost and the scores c( � G * [V ′ ], V) are closely tied together, which follows from the following result: The proof of Lemma 15 and a technical result on which it relies can be found in the Appendix. As an immediate consequence of Lemma 15, we have

Corollary 16 The edit cost of Alg. 2 is the sum of the UR
It is important to note that the edits U ( � G * [V ′ ], V) must be applied immediately in each step (cf. Line 5 in Alg. 2). In particular, Lemma 15 and Cor. 16 pertain to the partitioning of the edited digraph � G * , not to the original digraph G . We continue by proving the correctness of Alg. 2, i.e., that it returns a valid BMG and a corresponding explaining tree.

Theorem 17
Every pair of edited digraph ( � G * , σ ) and tree T produced as output by Alg. 2 satisfies Proof By construction, the tree T is phylogenetic and there is a one-to-one correspondence between the vertices u ∈ V (T ) and the recursion steps, which operate on the sets V ′ = L(T (u)) . If |V ′ | ≥ 2 (or, equivalently, u is an inner vertex of T), we furthermore have chosen in that recursion step. In the following, we denote by ( � G * , σ ) the digraph during the editing process, and by ( G, σ ) the input digraph, i.e., as in Alg. 2. For brevity, we write E * for the arc set of the final edited digraph and E T :=E( � G(T , σ )).
Let us assume, for contradiction, that there exists (a) In either case, we set u:= lca T (x, y) and consider the recursion step on V ′ :=L(T (u)) with the corresponding partition Since (x, y) / ∈ E T and by the definition of best matches, there must be a vertex y ′ ∈ V x of color σ (y) such that lca T (x, y ′ ) ≺ T lca T (x, y) = u , and thus σ (y) ∈ σ (V x ) . Moreover, we have V x ∈ V , x ∈ V x and y ∈ V ′ \ V x . Two subcases need to be considered, depending on whether or not (x, y) is an arc in � G * at the beginning of the recursion step. In the first case, the arguments above imply that (x, y) ∈ U 1 ( � G * [V ′ ], V) , and thus, (x, y) ∈ U ( � G * [V ′ ], V) by Lemma 13. Hence, we delete the by the gray boxes). In the middle, the set of trees T (V) is illustrated, i.e., the triangles represent all possible phylogenetic trees on the respective subset of leaves. On the right, the arc modifications implied by V (i.e., U( G, V) ) are illustrated where U 1 , U 2 , and U 3 indicate the type according to Lemma 13 arc (x, y) in this step. In the second case, it is an easy task to verify that none of the definitions of , V) matches for (x, y). Since this step is clearly the last one in the recursion hierarchy that can affect the (non-)arc (x, y), it follows for both subcases that (x, y) / ∈ E * ; a contradiction.
and by the definition of best matches, there cannot be a vertex y ′ ∈ V x of color σ (y) such that Again, two subcases need to be distinguished depending on whether or not (x, y) is an arc in � G * at the beginning of the recursion step. In the first case, the arguments above make it easy to verify that none of the definitions Lemma 13. Hence, we insert the arc (x, y) in this step. As before, the (non-)arc (x, y) remains unaffected in any deeper recursion step. Therefore, we have (x, y) ∈ E * in both subcases; a contradiction.
Cor. 16 suggests a greedy-like "local" approach. In each step, the partition V is chosen to minimize the score c( G, V) in Line 4. The example in Fig. 5 shows, however, that the greedy-like choice of V does not necessarily yield a globally optimal edit set.
In order to identify arcs that must be contained in every edit set, we first clarify the relationship between the partitions P ≥2 on V and the partitions defined by the phylogenetic trees on V.
is a phylogenetic tree and has at least two leaves, we have In particular, T satisfies T ∈ T (V * ) for some V * ∈ P ≥2 , and is therefore contained in V∈P ≥2 T (V) .
Using Lemma 18 and given that |V | ≥ 2 , we can express the set of relations that are unsatisfiable for every partition as follows i.e., it coincides with the set of relations that are unsatisfiable for every phylogenetic tree, and thus part of every edit set. Note that U * ( � G) is trivially empty if |V | < 2 . We next show that U * ( � G) can be computed without considering the partitions of V explicitly.
Proof First note that |V | ≥ 2 ensures that P ≥2 � = ∅ . Moreover, since |V| ≥ 2 for any V ∈ P ≥2 , the sets T (V) are all non-empty as well. With the abbreviation Û ( � G) for the right-hand side of Eq. (5), we show that Suppose that (x, y) ∈Û ( � G) . Then x = y and V [σ (y)] = {y} imply that σ (x) = σ (y) . This together with the facts that (i) y is the only vertex of its color in V, and (ii) L(T ) = V for each T ∈ T (V) and any V ∈ P ≥2 implies that y is a best match of x in every such tree T, i.e. (x, y) ∈ E( � G(T , σ )) . Since in addition (x, y) / ∈ E by assumption, we conclude that (x, y) ∈ U * ( � G).
Observe that σ (x) = σ (y) (and thus x = y ) as a consequence of Def. 12 and the fact that ( G, σ ) and all BMGs are properly colored. If V = {x, y} and thus {{x}, {y}} is the only partition in P ≥2 , the corresponding unique tree T consists of x and y connected to the root. In this case, we clearly have (x, y) ∈ E( � G(T , σ )) since σ (x) = σ (y) . On the other hand, if {x, y} V , then we can find a partition V ∈ P ≥2 such that V i = {x, y} for some V i ∈ V . In this case, every tree T ∈ T (V) has a vertex v i ∈ child T (ρ T ) with the leaves x and y as its single two children. Clearly, (x, y) ∈ E( � G(T , σ )) holds for any such tree. In summary, there always exists a partition V ∈ P ≥2 such that (x, y) ∈ E( � G(T , σ )) for some tree T ∈ T (V) . Therefore, by (x, y) ∈ V∈P ≥2 U ( � G, V) and Def. 12, we conclude that (x, y) / ∈ E . In order to obtain (x, y) ) for all T ∈ T (V) and all V ∈ P ≥2 . Now assume, for contradiction, that there is a vertex y ′ � = y of color σ (y ′ ) = σ (y) . Since σ (x) = σ (y) , the vertices x, y, y ′ must be pairwise distinct. Hence, we can find a partition V ∈ P ≥2 such that V i = {x, y ′ } for some V i ∈ V . In this case, every tree T ∈ T (V) has a vertex v i ∈ child T (ρ T ) with only the leaves x and y ′ as its chil- σ )) ; a contradiction. Therefore, we conclude that y is the only vertex of its color in V, and hence, (x, y) ∈Û ( � G) . In summary, therefore, we have As a consequence of Thm. 19 and by similar arguments as in the proof of Cor. 14, we observe Corollary 20 The set U * ( � G) can be computed in quadratic time.
By Thm. 19, U * ( � G) contains only non-arcs, more precisely, missing arcs pointing towards a vertex that is the only one of its color and thus, by definition, a best match of every other vertex irrespective of the details of the gene tree. By definition, furthermore, U * ( � G) is a subset of every edit set for ( G, σ ) . We therefore have the lower bound for every V ∈ P ≥2 .
The following result shows that if ( G, σ ) is a BMG, then a suitable partition V can be chosen such that From |V | ≥ 2 and consistency of R , it follows that [R, V ] has at least two connected components [10], and thus, by construction, |V| ≥ 2 . Moreover, we clearly have T ∈ T (V) by the The two (isomorphic) induced subgraphs obtained by applying the locally optimal partition V . Each of them has a (global) optimal BMG editing cost of 4. Therefore, the overall symmetric difference of an edited digraph (using the initial split V as specified) comprises at least c( � G, V) + 2 · 4 = 11 arcs. C An optimal editing removes the 8 green arcs and results in a digraph that is explained by the tree in D. The optimality of this solution was verified using an implementation of the ILP formulation for BMG editing given in [8]  construction of T via BUILD. Together with U ( � G, T ) = ∅ , the latter implies U ( � G, V) = ∅ , and thus c( � G, V) = 0 .
Proof Set R:=R( � G, σ ) and F:=F( � G, σ ) for the sets of informative and forbidden triples of ( G, σ ) , respectively. Since ( G, σ ) is a BMG, we can apply Prop. 10 to conclude that (R, F) is consistent. Now we choose an arbitrary set V ′ ∈ V and set ( � This together with the fact that R |V ′ ⊆ R and F |V ′ ⊆ F and Obs. 9 implies that (R |V ′ , By Prop. 10, it remains to show that ( � G ′ , σ ′ ) is sf-colored to prove that it is a BMG. To this end, assume for contradiction that there is a vertex x ∈ V ′ and a color s ∈ σ (V ′ ) such that x has no out-neighbor of color s = σ (x) in V ′ . However, since the color s is contained in σ (V ) and ( G, σ ) is a BMG, and thus sf-colored, we conclude that there must be a vertex y ∈ V \ V ′ of color s such that (x, y) ∈ E . In summary, we obtain ( . Hence, Lemma 13 implies that (x, y) ∈ U ( � G, V) and, hence, c( G, V) > 0 ; a contradiction. Therefore, ( � G ′ , σ ′ ) must be sf-colored, which concludes the proof.
Lemma 21 and 22 allow us to choose the partition V in each step of Alg. 2 in such a way that Alg. 2 is consistent, i.e., BMGs remain unchanged.

Theorem 23
Alg. 2 is consistent if, in each step on V ′ with |V ′ | ≥ 2 , the partition V in Line 4 is chosen according to one of the following rules: Proof We have to show that the final edited digraph ( � G * , σ ) returned in Line 13 equals the input digraph ( � G = (V , E), σ ) whenever ( G, σ ) already is a BMG, i.e., nothing is edited. Thus suppose that ( G, σ ) is a BMG and first consider the top-level recursion step on V (where initially � G * = � G still holds at Line 1). If |V | = 1 , neither ( G, σ ) nor ( � G * , σ ) contain any arcs, and thus, the edit cost is trivially zero. Now suppose |V | ≥ 2 . Since ( G, σ ) is a BMG, Lemma 21 guarantees the existence of a partition V satisfying c( � G, V) = 0 , in particular, the connected components V Aho of the Aho graph [R( � G, σ ), V ] form such a partition. Hence, for both rules (1) and (2), we choose a partition V with (minimal) UR-cost Since we recurse on these subgraphs, we can repeat the arguments above along the recursion hierarchy to conclude that the UR-cost c( � G * [V ′ ], V ′ ) vanishes in every recursion step. By Cor. 16, the total edit cost of Alg. 2 is the sum of the UR-costs c( � G * [V ′ ], V ′ ) in each recursion step, and thus, also zero. Therefore, we conclude that we still have ( � G * , σ ) = ( � G, σ ) in Line 13.
By Thm. 23, Alg. 2 is consistent whenever the choice of V minimizes the UR-cost of V in each step. We shall see below that minimizing c( G, V) is a difficult optimization problem in general. Therefore, a good heuristic will be required for this step. This, however, may not guarantee consistency of Alg. 2 in general. The second rule in Thm. 23 provides a remedy: the Aho graph is chosen provided it has UR-cost zero. This procedure is effectively a generalization of the algorithm BUILD using as input the set of informative triples R( G, σ ) of a properly vertex-colored digraph ( G, σ ) . If ( G, σ ) is already a BMG, then the recursion in Alg. 2 is exactly the same as in BUILD: it recurses on the connected components of the Aho graph (cf. Prop. 10). We can summarize this discussion as

Corollary 24 ( G, σ ) is a BMG if and only if, in every step of the BUILD algorithm operating on
For recursion steps in which the Aho graph is connected, and possibly also in steps with non-zero UR-cost, another (heuristic) rule has to be employed. As a byproduct, we obtain an approach for the case that R( G, σ ) is consistent: Following BUILD yields the approximation G( Aho (R( G, σ ), V ( G)), σ ) as a natural choice.

Binary-explainable BMGs
Phylogenetic trees are often binary. Multifurcations are in many cases -but not always -the consequence of insufficient data [14,20,21]. It is therefore of practical interest to consider BMGs that can be explained by a binary tree: Correspondingly, it is of interest to edit a properly colored digraph to a beBMG, which translates to the following decision problem:

Question:
Is there a subset We call the corresponding completion and deletion problem ℓ-BMG CBEG and ℓ-BMG DBEG, respectively. As their more general counterparts, all three variants are NP-complete as well, cf. [8, Cor. 6.2] and [14,Thm. 5].
Since the recursive partitioning in Alg. 2 defines a tree that explains the edited BMG, see Thm. 17, it is reasonable to restrict the optimization of V in Line 17 to bipartitions. The problem still remains hard, however, since the corresponding decision problem (problem BPURC ) is NP-complete as shown in Thm. 30 below. Similar to BMGs in general, beBMGs have a characterization in terms of informative triples: σ ) is consistent. In this case, the BMG ( G, σ ) is explained by every refinement of the binary refinable tree ( Aho (R B ( G, σ ), V ), σ ). Since a beBMG ( G, σ ) is explained by every refinement of the Aho tree constructed from R B ( G, σ ) (cf. Prop. 26), we can obtain a slightly more general result. E), σ ) be a beBMG with |V | ≥ 2 and V be the connected components of the Aho
Observe that the tree (T , σ ):=( Aho (R B ( � G, σ ), V ), σ ) exists and explains ( G, σ ) by Prop. 26. Moreover, there is, by construction, a one-to-one correspondence between the children v i of its root ρ and the elements in V i ∈ V given by L(T (v i )) = V i . We construct a refinement (tree) T ′ of T as follows: Whenever we have multiple sets V i ∈ V that are subsets of the same set V j ∈ V ′ , we remove the edges ρv i to the corresponding vertices v i ∈ child T (ρ) in T, and collectively connect these v i to a newly created vertex w j . These vertices w j are then reattached to the root ρ . Since |V ′ | ≥ 2 by assumption, the so-constructed tree T ′ is still phylogenetic. Moreover, it satisfies V ′ = {L(T ′ (v)) | v ∈ child T ′ (ρ)} , and thus, T ′ ∈ T (V ′ ) . It is a refinement of T since contraction of the edges ρw j again yields T. Hence, we can apply Prop. 26 to conclude that (T ′ , σ ) also explains ( G, σ ) . It follows immediately that U ( � G, T ′ ) = ∅ . The latter together with T ′ ∈ T (V ′ ) implies U ( � G, V ′ ) = ∅ , and thus c( � G, V ′ ) = 0 .
We are now in the position to formulate an analogue of Thm. 23 for variants of Alg. 2 that aim to edit a properlycolored digraph ( G, σ ) to a beBMG. and moreover c( � G * [V ′ ], V Aho ) = 0 , then V is a coarse-graining of V Aho .

Theorem 29 Alg. 2 is consistent for beBMGs
Proof We have to show that the final edited digraph ( � G * , σ ) returned in Line 13 equals the input digraph ( � G = (V , E), σ ) whenever ( G, σ ) already is a beBMG, i.e., nothing is edited. Thus suppose that ( G, σ ) is a beBMG and first consider the top-level recursion step on V (where initially � G * = � G still holds at Line 1). If |V | = 1 , neither ( G, σ ) nor ( � G * , σ ) contain any arcs, and thus, the edit cost is trivially zero. Now suppose |V | ≥ 2 . Since ( G, σ ) is a beBMG, R B := R B ( � G, σ ) is consistent, and thus, the set of connected components V Aho of the Aho graph [R B , V ] has a cardinality of at least two. If |V Aho | = 2 , V:=V Aho is a bipartition satisfying c( � G, V) = 0 by Cor. 27. If |V Aho | > 2 , we can find an arbitrary bipartition V that is a coarse-graining of V Aho . By Lemma 28, V also satisfies c( � G, V) = 0 in this case. Hence, for both rules (1) and (2) Since we recurse on the subgraphs ( � G[V ′ ], σ |V ′ ) , which are again beBMGs, we can repeat the arguments above along the recursion hierarchy to conclude that the URcost c( � G * [V ′ ], V ′ ) vanishes in every recursion step. By Cor. 16, the total edit cost of Alg. 2 is the sum of the UR-costs c( � G * [V ′ ], V ′ ) in each recursion step, and thus, also zero. Therefore, we conclude that we still have ( � G * , σ ) = ( � G, σ ) in Line 13.

Minimizing the UR-cost c( G, V)
The problem of minimizing c( G, V) for a given properly colored digraph ( G, σ ) corresponds to the following decision problem.

Question:
Is there a (bi)partition V of V such that c( � G, V) ≤ k?
In the Appendix, we show that (B)PURC is NP-hard by reduction from Set Splitting, one of Garey and Johnson's [22] classical NP-complete problems.

Theorem 30 BPURC is NP-complete.
Thm. 23 suggests to consider heuristics for (B)PURC that make use of the Aho graph in the following manner: Plugging any algorithm of this type into Line 4 of Alg. 2 reduces the algorithm to BUILD if a BMG is used as input and thus guarantees consistency (cf. Prop. 10). We note, however, that the connected components of a disconnected Aho graph are not guaranteed to correspond to an optimal solution for (B)PURC in the general case.

Construction of test instances
We test the heuristics described below on ensembles of perturbed BMGs that are constructed as follows: We first generate leaf-colored trees (T , σ ) with a predefined number of vertices N and colors ℓ and then compute their BMGs G(T , σ ) . To construct the tree T, we start from a single vertex. We then repeatedly choose one of the existing vertices v randomly, and, depending on whether v is currently an inner vertex or a leaf, attach either a single or two new leaves to it, respectively. Hence, the number of leaves increases by exactly one and the tree remains phylogenetic in each step. We stop when the desired number N of leaves is reached. In the next step, colors are assigned randomly to the leaves under the constraint that each of the ℓ colors appears at least once. We note that trees created in this manner are usually not least resolved, and their BMGs are in general not binaryexplainable. Finally, we disturb these BMGs by inserting and deleting arcs according to a specified insertion and deletion probability, respectively. Since arcs between vertices of the same color trivially cannot correspond to best matches, we do not insert arcs between such vertices, i.e., the input digraphs for the editing are all properly vertexcolored digraphs. For the purpose of benchmarking the heuristics for the (B)PURC problem, we only retain perturbed BMGs ( G, σ ) with a connected Aho graph H :=[R( � G, σ ), V ( � G)] because the heuristics are not applied to instances with a disconnected Aho graph H. Depending on the insertion and deletion probabilities, we retained 93% to 100% of the initial sample, except in the case where arcs were only inserted to obtain a disturbed digraph. Here, the Aho graph H was connected in 60% of the initial sample. Thus, even moderate perturbation of a BMG introduces inconsistencies into the triple set R( G, σ ) and results in a connected Aho graph H in the majority of cases. As shown in Fig. 2, both arc insertions and deletions can cause triple inconsistencies. For better comparison, the same set of test instances is used for all of the methods described below.

Heuristics for (B)PURC
(B)PURC is a variation on graph partitioning problems. It seems reasonable, therefore, to adapt graph partitioning algorithms for our purposes.
MinCut. We solve the minimum edge cut problem for the connected undirected graph H, i.e., we want to find a bipartition V = {V 1 , V 2 } such that the number of edges between V 1 and V 2 is minimal in H. The problem can be solved exactly in polynomial time using the Stoer-Wagner algorithm [23]. Note, however, that the minimum edge cut in H will in general not deliver an optimal solution of (B)PURC .
Karger's algorithm is a randomized algorithm that, in its original form, also aims to find a minimum edge cut [24]. In brief, it merges vertices of the graph by randomly choosing and contracting edges, until only two vertices remain, which induce a bipartition V according to the vertices that were merged into them. By repeating this process a sufficient number of times, a minimum edge cut can be found with high probability. Here, we use the UR-cost c( G, V) instead of the size of the edge cut as objective function to select the best solution over multiple runs.
A simple greedy approach starts with Ties are broken at random. This produces |V | − 1 "locally optimal" bipartitions, from which the best one is selected.
Gradient walks. The space of all bipartitions V = {V 1 , V 2 } endowed with a "move set" and the objective function c( G, V) forms a fitness landscape. Here, we consider adjacency between bipartitions by moving one vertex from V 1 to V 2 or vice versa. Gradient walks [25], also called "gradient adaptive walk" [26] or "greedy adaptive walks" [27], form the discrete analog of gradient descent methods. We start with a random but balanced bipartition V = {V 1 , V 2 } and then repeatedly execute a move to an adjacent bipartition that maximally improves the objective function; a gradient walk stops when a local optimum is reached.
Louvain method. This method for community detection in graphs greedily optimizes the so-called modularity of a vertex partition V [28]. Its objective function is q(V) = W ∈V u,v∈W (a uv − d u d v /(2m)) , where a uv are the entries of the (possibly weighted) adjacency matrix of a graph H, d u = v a uv the vertex degrees, and m is the sum of all edge weights in the graph. This favors so-called communities or modules W that are highly connected internally but have only few edges between them. The Louvain method operates in two phases starting from the discrete partition V = {{u} | u ∈ V } . In the first phase, it repeatedly iterates over all vertices x and moves x into the community of one of its neighbors that leads to the highest gain in modularity as long as a move that increases q(V) can be found. The second phase repeats the first one on the weighted quotient graph H/V whose vertices are the sets of V and whose edge weights are the sum of the original weights between the communities. In addition to maximizing the modularity, we also investigate a variant of the Louvain method that moves vertices into the community of one of their neighbors if this results in a lower UR-cost c( G, V) , and otherwise proceeds analogously. We exclude the merging of the last two vertices to ensure that a non-trivial partition is returned. Since the Louvain method is sensitive to the order in which the vertex set is traversed, we randomly permute the order of vertices to allow multiple runs on the same input.
With the exception of the Stoer-Wagner algorithm for solving the minimum edge cut problem, all of these partitioning methods include random decisions. One may therefore run them multiple times and use the partition corresponding to the best objective value, i.e., the lowest UR-cost c( G, V) or the highest modularity. If not stated otherwise, we apply five runs for each of these methods in each recursion step (with a connected Aho graph) in the following analyses.

Heuristics for BMG editing
We will explore the performance of several variants of Alg. 1 and 2 for BMG editing. The variants of Alg. 2 correspond to using the heuristics for (B)PURC discussed above for processing a connected Aho graph , σ |V ′ ) in each step of the recursion. We note that Alg. 2 in combination with any of the heuristics for (B)PURC also serves as a heuristic for MaxRTC because the choice of the partition V in each recursion step determines a set of included triples xy|z, namely those for which x • In the Aho graphs, R B ( G, σ ) is used instead of R( G, σ ). • If we encounter a partition V of cardinality greater than two in some recursion step, we use a coarsegraining V ′ of V such that |V ′ | = 2 instead. This modification is necessary whenever itself has more than two connected components, and for the partitions with |V| ≥ 3 returned by the Louvain method.
By Thm. 29, this procedure is consistent for binaryexplainable BMGs. Thm. 29, moreover, guarantees some freedom in the choice of a coarse-graining V ′ = {V 1 , V 2 } whenever V is not a bipartition. We therefore aim to produce (locally) balanced trees in such situations, i.e., we seek to minimize the difference of |V 1 | and |V 2 | . Formally, this corresponds to the well-known Number Partitioning problem with the multiset {|V i | | V i ∈ V} as input. We use the efficient heuristic described in [30], which in general appears to yield very good solutions of the Number Partitioning problem [31].
To construct the second binary tree T * based on subset of triples R * ⊆ R B ( � G, σ ) that are displayed by T, we employ an analogous coarse-graining in an otherwise unmodified BUILD algorithm. We note, however, that one could incorporate more sophisticated approaches which e.g. use some greedy coarse-graining method based on the UR-cost.

Computational results
In this section, we compare different heuristics for the (B)PURC Problem and their performance in the context of BMG editing. Somewhat unexpectedly, but in accordance with Fig. 5, our results suggest that a good (or bad) performance of (B)PURC is not directly linked to a good (or bad) performance for BMG editing. Moreover, we find that, even for noisy data, all analyzed methods are able to capture the tree structure of the underlying "true" BMG at least to some extent. As we shall see, community detection approaches in combination with the UR-cost appear to be more promising for BMG editing than optimal solutions of (B)PURC alone.
In order to better understand the behavior of the repeated application of the partitioning heuristics of Alg. 2, it is instructive to consider not only the score but also the structure of partitions. We observe a strong tendency of some of the partitioning methods to produce single-leaf splits, i.e., (bi)partitions V in which at least one set W ∈ V is a singleton (i.e., |W | = 1 ). Single-leaf splits in general seem to have relatively low UR-costs. Further details on the propensity of the partitioning heuristics to produce single-leaf splits are given in Appendix C. and y are contained in one set of V while z is contained in another. Another way of expressing that same fact is that an approximation to MaxRTC is given by the subset R * ⊆ R( � G, σ ) of the informative triples of the input digraph ( G, σ ) that are displayed by the tree T constructed in Alg. 2. In particular, Alg. 2 together with the MinCut method has been described as a heuristic for MaxRTC in earlier work [16,17]. For comparison, we will also consider the following bottom-up approach as a component of Alg. 1: Best-Pair-Merge-First (BPMF) was described in [18], and constructs a tree from a set of triples R in a bottomup fashion. We use here a modified version introduced in [16]. BPMF operates similar to the well-known UPGMA clustering algorithm [29]. Starting with each vertex x ∈ V as its own cluster, pairs of clusters are merged iteratively, thereby defining a rooted binary tree with leaf set V. The choice of the two clusters to merge depends on a similarity score with the property that any triple xy|z with x, y, and z lying in distinct clusters S x , S y , and S z contributes positively to score(S x , S y ) and negatively to score(S x , S z ) and score(S y , S z ) . Since BPMF constructs the tree T from the bottom, it does not imply a vertex partitioning scheme that could be plugged into the top-down procedure of Alg. 2. Importantly, BPMF is not a consistent heuristic for MaxRTC , i.e. it does not necessarily recognize consistent triples sets. Hence, consistency in the application to BMG editing is also not guaranteed, see Fig. 15 in Appendix B for an example.
In summary, we have two distinct ways to obtain an edited BMG: We may take either 1 G(T , σ ) , where T is the output tree of Alg. 2 or BPMF, respectively, or 2 � G(T * , σ ) , where T * = Aho (R * , V ( � G)) is constructed from the consistent triple subset of triples R * . This corresponds to Alg. 1.
Somewhat surprisingly, the results in Fig. 7 below suggest that it is in general beneficial to extract the triple set R * and rerun the BUILD algorithm, i.e., to use � G(T * , σ ).

Heuristics for binary-explainable BMG Editing
In order to test the heuristics for the slightly different task of obtaining a binary-explainable BMG ( � G * , σ ) , we constructed a similar set of test instances. The only difference is that we ensured that T orig is binary by modifying the attachment procedure above such that in each growth step we only choose among the vertices that are currently leaves for attaching two new leaves. Thus, ( � G orig , σ ) = � G(T orig , σ ) is binary-explainable. The editing heuristics are analogous, with two straightforward modifications: Results for (B)PURC Fig. 6 suggests that the Simple Greedy approach is best suitable for the minimization of the UR-cost c( G, V) for any of the considered parameters for BMG disturbance. The Louvain method based on graph modularity (Louvain (m)) appears to have by far the worst performance which, moreover, quickly produces higher UR-costs with an increasing intensity of the perturbations.

Results for BMG editing
To assess the performance of the various heuristics for BMG editing, we consider the differences between the editing result (G * , σ ) from both the original BMG (G orig , σ ) and the perturbed input digraphs (G, σ ) . In Fig. 7, we summarize the absolute values of the symmetric differences of the arc sets d orig :=|E(G * ) △ E(G orig )| and d:=|E(G * ) △ E(G)| , respectively. These results are translated to usual normalized performance indicators (recall, precision, specificity, and accuracy; all defined in terms of the arc sets) in Fig. 8.
Comparing the distances d orig (blue boxplots) and d (green boxplots) of the editing result (G * , σ ) to original unperturbed BMG and the input digraph, resp., we find that, for the methods investigated here, on average d orig is smaller than d. This indicates that all methods are able to capture the underlying tree structure of the original BMG at least to some extent. The discrepancy between d orig and d tends to increase with the level of perturbation, a trend that is most pronounced for Louvain (c). This result is encouraging for practical applications of BMG modification to correcting noisy best match data, where the eventual goal is to obtain a good estimate of the underlying true BMG.
Intriguingly, the extraction of consistent informative triples R * from the reconstructed tree T and rerunning BUILD, i.e., using � G(T * , σ ) , in general improves the estimation results for the majority of methods. In particular, this increases the recall without a notable negative impact on precision and specificity (cf. Fig. 8). A better recall, corresponding to a higher proportion of correctly inferred arcs, is not surprising in this context, since this additional step in essence reduces the number of triples. We therefore expect the tree T * = Aho (R * , V ( � G)) to be on average less resolved than T. The BMGs of less resolved trees tend to have more arcs than BMGs of highly resolved tree (cf. [12,Lemma 8]). In good accordance with this prediction, BPMF, which shows a strong increase of recall, always constructs a binary, i.e., fullyresolved, tree T -whereas the corresponding tree T * in general is much less resolved.
Somewhat surprisingly, a good or bad performance for minimizing the UR-cost in individual steps apparently does not directly translate to the performance in the overall editing procedure. In particular, the modularitybased Louvain (m) method seems to be a better choice than the Simple Greedy approach. The methods MinCut and Karger do not seem to be suitable components for Alg. 2, with the exception of the case where perturbations are arc deletions only (Fig. 7, bottom row). Here, MinCut produces reasonable estimates that compare well with other methods. The bottom-up method for the MaxRTC problem BPMF also produces relatively good results. It appears to be robust at high levels of perturbation. For most of the parameter combinations, we obtain the best results with the UR-cost-based Louvain method (Louvain (c)). Here, we often observe a symmetric difference (w.r.t. the arcs sets) that is better than the difference between the original and the perturbed digraph. This trend is illustrated by the red median lines in Fig. 7 and 8 . Hence, we achieve two goals of BMG editing: (i) the resulting digraph ( � G * , σ ) is a BMG, i.e., it satisfies Def. 1, and (ii) it is closer to the original BMG than the perturbed digraph. We note that we observed similar trends across all investigated combinations for the numbers of leaves N (ranging from 10 to 40) and of colors ℓ ( ℓ < N ranging from 2 to at most 20).
Our results show that minimization of the UR-cost in each step is not the best approach to BMG editing because this often produces very unbalanced partitions. As a consequence, more recursion steps are needed in Alg. 2 resulting in higher accumulated number of arc edits. Figure 9 shows that better solutions to the BMG editing problem are not necessarily composed of vertex partitions with minimal UR-cost in each step. The perturbed digraph ( G, σ ) in Fig. 9 was obtained from the randomly simulated BMG ( G orig , σ ) as described above using equal insertion and deletion probabilities of 0.1. As an example, the partitions V 1 and V 2 as constructed by the MinCut and the Louvain (c) method in the first iteration step of Alg. 2 are shown as pink and green frames, respectively. MinCut produces a single-leaf split V 1 with an isolated vertex b 2 and UR-cost c( � G, a 1 ), (c 2 , a 1 ), (c 2 , b 1 )} , which corresponds to the connected components of the Aho graph H orig of the unperturbed BMG and thus identifies the split in the original tree (T , σ ) . Here, the correct partition V 2 has a strictly larger UR-cost than the misleading choice of V 1 . However, MinCut results in a higher total edit cost than Louvain (c) for ( G, σ ).
In order to account for the issue of unbalanced partitions, we performed a cursory analysis on maximizing a gain function rather than minimizing the UR-cost. In analogy to c( G, V) , we defined g( G, V) as the number of arcs and non-arcs that are satisfied by the BMGs of all trees in T (V) . Recapitulating the arguments in the proof of Lemma 13, one can show that these relations can also be determined as the union of three sets by replacing " (x, y) ∈ E " with " (x, y) / ∈ E " and vice versa in the definitions of U 1 ( G, V) , U 2 ( G, V) , and U 3 ( G, V) . The gain function g( G, V) can be used instead of the UR-cost with Karger, Simple Greedy, Gradient Walk, and in a gainfunction-based Louvain method. For all these algorithms, however, maximizing g( G, V) leads to partitions that appear to be too balanced, and a performance for BMG editing that is worse than the use of the UR-cost. A possible explanation for both unbalanced and too balanced partitions as produced with a cost and gain function, resp., is the fact that U 1 ( G, V) and U 2 ( G, V) (and their gain function counterparts) contain pairs of vertices (x, y) that lie in distinct sets of V . Hence, both single-leaf splits and perfectly balanced partitions minimize (maximize, resp.) the number of pairs that could potentially be contained in these arc sets.
All methods for BMG editing were implemented and compared using Python on an off-the-shelf laptop (Intel Core TM i7-4702MQ processor, 16 GB RAM, Ubuntu 20.04, Python 3.7). They are available as a Python library at https:// github. com/ david-schal ler/ bmg-edit. Figure 10 summarizes the running times. The right panel shows that all methods appear to scale polynomially in the size |V| of the vertex set of the input digraph. The methods that explicitly rely on the UR-cost are much slower than the other methods. We suspect that this is largely due to the repeated O(|V ′ | 2 )-computation of c( G, V) whenever a vertex is moved between the sets/communities in V . This could possibly be improved by an incremental algorithm.

Results for binary-explainable BMG Editing
The results for beBMG editing in essence recapitulate the observations for general BMG editing, see Fig. 11. Alg. 2 in combination with Louvain (c) appears to be the best choice for the majority of parameter combinations. However, it is outperformed by the BPMF heuristic at high levels of perturbation (insertion and deletion probability 0.2). As in the general case, construction of T * and using ( � G * , σ ):= � G(T * , σ ) as editing result appears to be advantageous. Moreover, the difference of the editing result ( � G * , σ ) to the original beBMG ( G orig , σ ) is on average smaller than the difference of ( � G * , σ ) to the perturbed digraph ( G, σ ).

Real-life data
Assessing the performance of BMG editing for real-life data is not a trivial task because no reliable gold standard data sets are available. Moreover, we expect that software pipelines for best match inference would benefit from a pre-processing step that eliminates systematic errors arising as a consequence of unequal evolution rates in different branches of a gene family [5]. The implementation of such a pipeline is beyond the scope of this contribution. Nevertheless, we include a brief analysis of a small set of eubacterial genomes to obtain a first impression of the practical applicability of the methods described in this contribution. In particular, we provide an empirical justification for the level of error introduced in the simulated digraphs.
We consider the genomes of the eleven species of Aquificales species that have been studied previously in [32]. Starting from the protein sequences of these species, we use ProteinOrtho [33] to obtain estimates for best matches. ProteinOrtho is a tool for orthology inference that, in a first step, constructs a digraph ϒ on the set of genes/proteins from all species. The digraph contains an arc (x, y) whenever x and y stem from distinct species X and Y, respectively, and the sequence of y is among the most similar sequences in Y to that of x. Sequence similarity is measured here in terms of bitscores obtained from the hits in an all-versus-all comparison of the sequences using blast or a fast local alignment tool. Moreover, cutoffs for the E-value and the sequence identify are set to avoid an overabundance of spurious hits. Here, we use the default settings of the current ProteinOrtho version, i.e., DIAMOND [34] is employed for the all-versus-all comparison and we only include hits with an E-value smaller than 10 −5 and a pairwise sequence identity exceeding 25%. To better account for the fact that multiple best matches of x in species Y are possible, we choose a more inclusive relative threshold f = 0.8 for the bitscores (as compared to the default value f = 0.95 ), i.e., an arc (x, y) is included in ϒ whenever the respective hit reaches a bitscore of at least 0.8 times the bitscore of the best hit of x in species Y. To infer orthologous genes, i.e., related genes that arose from speciation events, ProteinOrtho proceeds to construct the symmetric part ϒ of ϒ followed by spectral clustering to eliminate false orthology edges [33]. Since we are interested in the directed best match graph rather than orthology, we use ϒ as an estimate of the best match relation and forego the rest of the pipeline. The resulting digraph ϒ for the Aquificales data set comprises 16630 vertices and 2001 (weakly) connected components. The distribution of the order of these components is shown in Fig. 12. We obtained 9 components with more than 100 vertices (with a maximum of 775 vertices), which are not included in the plot. We use each connected component ( G, σ ) of ϒ , where σ is determined by the species to which a gene/protein belongs, as input for our editing heuristics. We distinguish three size classes (bins) as indicated in Fig. 12 to discuss the results. It is worth noting that in particular the large components may still be composites of genes that are not true homologs but only share certain protein domains. For routine applications, additional data preprocessing steps are advisable. Gene family classification, i.e., the clustering of proteins in families [36], is itself by no means a completely solved problem in computational biology. Figure 13 summarizes the number of arc differences between the editing results and the input digraphs. As in the simulations, the input digraphs are in general not valid BMGs. Indeed, we found that only 5.9% in bin (I) comprising small instances and none of the digraphs in bins (II) and (III) were BMGs. The comparison of the different methods recapitulates the results on simulated data. Differences in the performance are on average more prominent for input digraphs with more vertices. The methods BPMF and Louvain (c) again show good results. However, here, the simple MinCut heuristic is also among the best-performing methods. As before, the editing can be slightly improved by additionally constructing T * and its BMG � G(T * , σ ) . Comparing e.g. the top panel in Fig. 7 ( |V ( � G)| = 30 for all digraphs) and bin (II) in Fig. 13 shows that the number of arc edits performed by the heuristics is comparable for simulated and real-life data. Even though the true BMG and thus the amount of errors is not known in the Aquificales data set, this suggests that simulation results provide a realistic view of BMG editing in real-life applications.

Summary and discussion
In this contribution, we have described a large class of heuristics for BMG editing that operate in a recursive top-down fashion to (at least implicitly) construct a tree (T , σ ) capturing the underlying BMG-structure of an arbitrary input digraph ( G, σ ) . We have shown that this is closely related to a specific notion of locally good edits, which we assess using the UR-cost. The UR-cost counts the minimum number of arc insertions and deletions of the BMG-editing for ( G, σ ) that are linked to each inner node (and thus to their corresponding leaf partitions) in (T , σ ) and cannot be reversed in subsequent recursion steps. In particular, we showed that an optimal solution among all possible partitions guarantees consistency of this class of heuristics (cf. Thm. 23 and 29). Unfortunately, the corresponding problem BPURC is itself NP-complete.
We therefore suggested a number of approximation methods for finding suitable partitions, and compared their performances in the context of Alg. 2. We find that, even though good solutions for (B)PURC alone do not seem to be the most adequate approach, the value of the UR-costs appears most clearly in a combination with a method for community detection, more precisely, a modification of the Louvain method [28].  For all of the methods investigated here, we found that the Aho graph H :=[R( � G, σ )[V ′ ], V ′ ] serves as a useful starting point for finding a suitable partition. This choice is based on the idea that, due to the properties of BMGs and in particular the construction of the tree (T , σ ) from informative triples of the BMG ( � G, σ ) = � G(T , σ ) , arc insertions and deletions in ( G, σ ) should not add too many new edges between the connected components of the originally disconnected Aho graph of R( G, σ ) (cf. Fig. 2). Therefore, we suggest that there is a correlation between good partitions V of V ′ , i.e. partitions linked to few edits, and the minimization of the number of edges in H connecting vertices in distinct sets of V.
For the general BMG editing problem, we did not make use of the information contained in the set of forbidden triples F( G, σ ) of the input digraph ( G, σ ) . It might be possible to adapt the algorithm MTT [13], which identifies consistent pairs (R, F) , instead of BUILD. MTT constructs a coarse-graining V MTT of the set of connected components of the Aho graph (on R ) in order to account for the forbidden triples in F in each recursion step. Possibly, V MTT (or some suitable graph representation) yields a further improvement. However, in case of beBMG editing, the extended triple set R B ( G, σ ) and thus the corresponding Aho graphs by construction already cover the information contained in F( G, σ ) . Since no substantial improvement over the general case was observed in this case (cf. Fig. 11), we opted against more detailed benchmarking of V MTT in comparison to partitions based on the Aho graph.
Another source of information that we have not considered here are values of confidence for best matches  Performance of the BMG editing heuristics for the Aquificales data, measured as the number of arc differences between input digraphs and editing results. As above, the light green indicates the "direct" performance of each method, i.e., the digraph G(T , σ ) where T is the tree that is directly constructed by each method, whereas the dark green indicates the results if the methods are used as heuristic for MaxRTC in Alg. e.g. derived from pairwise sequence alignment. These can naturally be provided as weights assigned to the arcs of an input digraph ( � G = (V , E), σ ) . Since every informative triple ab|b ′ ∈ R( � G, σ ) stems from an arc (a, b) ∈ E (and a "non-arc" (a, b ′ ) / ∈ E ), such arc weights can be propagated to the triples in R( G, σ ) . Assuming that larger weights mean higher confidence, a simple approach to weighting the edges xy in an Aho graph H :=[R( � G, σ )[V ′ ], V ′ ] is to sum up the weights of all triples xy|z ∈ R( � G, σ )[V ′ ] that induce the edge xy ∈ E(H) . Now the heuristics for finding a (bi)partition V in Alg. 2 may operate on this weighted graph H. Methods such as the Stoer-Wagner algorithm [23] and the (modularity-based) Louvain method [28] natively support weighted graphs. If the UR are used to find V , the situation appears more complicated. The cost contributed by an arc (x, y) ∈ U 1 ( � G, V) ⊆ E can be set to the weight of (x, y) in ( G, σ ) . However, since U 2 ( G, V) and U 3 ( G, V) contain elements that are not in E, one also would require weights for these "non-arcs". Since, in the simplest case, "non-arcs" correspond to pairs of sequences that do not reach a certain similarity threshold (see e.g. [35]), the task of assigning weights to them does not seem trivial. In any case, whether incorporating such values of confidence is helpful needs to be assessed in practice.
The purpose of this contribution is to establish a sound theoretical foundation for practical approaches to BMG editing and to demonstrate that the problem can be solved for interestingly large instances at reasonable accuracy. In computational biology, however, much larger problems than the ones considered here would also be of interest. Less emphasis has been placed here on computational efficiency and scalability of different algorithmic variants. We leave this as topic for future research. Given the performance advantage of community detection over minimization of the UR-cost in each step, it seems most promising to focus on community detection methods that scale well for very large systems. The Louvain method seems to be a promising candidate, since it has been applied successfully to large networks in the past [28]. This is largely due to the fact that the change of modularity in response to moving a vertex between modules can be computed efficiently. We suspect that a comparably fast computation of the UR-cost may also be possible; this does not appear to be trivial, however. Moreover, the method could probably be accelerated by moving vertices into the community of the first neighbor such that this results in a (not necessarily optimal) improvement of the UR-cost. A similar randomization approach has already shown to only slightly affect the clustering quality in terms of modularity [37].
Since the restriction of a (be)BMG to a subset of colors is again a (be)BMG, it may also be possible to remove large parts of the noise by editing induced subgraph on a moderate number of colors, possibly using information of the phylogeny of the species to select species (= color) sets. Presumably, color sets with sufficient overlaps will need to be considered. A systematic analysis of this idea, however, depends on scalable BMG editing for large instances and goes beyond the scope of this contribution.
A potential shortcoming of the empirical analysis above is the simplistic error model, i.e., the independent perturbation of arcs (and non-arcs). Better models will depend on the investigation of BMGs derived from reallife sequence data. Such data is often burdened with systematic errors arising e.g. from the fact that a common ancestry often cannot be detected for very large evolutionary distances and from unequal mutation rates during the evolution of gene families, see e.g. [5,38,39] for more in-depth discussions of these issues. Benchmarking using real-life data, however, is a difficult task because the ground truth is unknown and large, well-curated data sets are not available. Our results so far suggest that a good performance w.r.t. the input digraph is also an indicator for a good performance w.r.t. the true digraph (cf. Figs. 7 and 8, green vs. blue boxplots). Moreover, they at least suggest that realistic BMG data can be processed with sufficient accuracy and efficiency to make BMGs an attractive alternative to classical phylogenetic methods. We indeed obtained promising results in a first application of our editing heuristics to the protein-coding genes of eleven Aquificales species. The construction of bioinformatics workflows to process best hit data, e.g. at the first processing stage of ProteinOrtho [35], is a logical next step.