Algorithms for the quantitative Lock/Key model of cytoplasmic incompatibility

Cytoplasmic incompatibility (CI) relates to the manipulation by the parasite Wolbachia of its host reproduction. Despite its widespread occurrence, the molecular basis of CI remains unclear and theoretical models have been proposed to understand the phenomenon. We consider in this paper the quantitative Lock-Key model which currently represents a good hypothesis that is consistent with the data available. CI is in this case modelled as the problem of covering the edges of a bipartite graph with the minimum number of chain subgraphs. This problem is already known to be NP-hard, and we provide an exponential algorithm with a non trivial complexity. It is frequent that depending on the dataset, there may be many optimal solutions which can be biologically quite different among them. To rely on a single optimal solution may therefore be problematic. To this purpose, we address the problem of enumerating (listing) all minimal chain subgraph covers of a bipartite graph and show that it can be solved in quasi-polynomial time. Interestingly, in order to solve the above problems, we considered also the problem of enumerating all the maximal chain subgraphs of a bipartite graph and improved on the current results in the literature for the latter. Finally, to demonstrate the usefulness of our methods we show an application on a real dataset.


Introduction
Wolbachia is an intracellular bacterium that infects numerous arthropod species. It is transmitted vertically through the host's eggs and is known for frequently influencing the reproductive development and behaviour of its host. In particular, the transmission of Wolbachia is promoted via a mechanism known as cytoplasmic incompatibility (CI). CI occurs when a Wolbachia infected male host crosses with a female that is either uninfected, or is infected by another Wolbachia strain. In this case, the cross is unsuccessful and the offspring does not survive. In this way, CI gives a reproductive advantage to the infected females by reducing the reproductive success of uninfected females (for a review on this phenomenon, see for example [1]). An example illustrating CI is provided in Fig. 1. It is a mechanism induced not only by Wolbachia but it is also observed in other unrelated bacteria such as for example Cardinium hertigii [2,3]. CI has attracted much attention also for its potential use in biological control, i.e. the introduction of parasites, predators, and pathogens with the purpose to reduce or suppress pest populations [4].
Despite the widespread occurrence of CI, its molecular basis remains unclear and theoretical models have been proposed to understand the phenomenon. The general model assumes the existence of a toxin, deposited by the bacterium in the sperm, which leads to the death of the zygote unless it is neutralised by an antitoxin deposited by the bacteria present in the egg [5]. A more concrete model is the quantitative Lock-Key model which assumes that the toxin and antitoxin are distinct molecules and that the success of the cross depends on the quantity

Open Access
Algorithms for Molecular Biology *Correspondence: blerina.sinaimeri@inria.fr 1 Inria Grenoble, 655, Avenue de l'Europe, 38334 Montbonnot, France Full list of author information is available at the end of the article Page 2 of 16 Calamoneri et al. Algorithms Mol Biol (2020) 15:14 of each of them present in the eggs and sperm (see for example [6]). This model currently represents the best hypothesis and is consistent with the data available (see e.g. [7,8]). In [6,9], the cytoplasmic compatibility relationships that are observed in a given dataset are modelled as a bipartite graph with males and females in different partitions of the graph and edges representing an unsuccessful crossing. The aim is to find the minimum number of different Lock/Key molecules that explain the observed data. This is modelled as finding the minimum number of chain subgraphs (i.e. graphs that do not contain a 2K 2 as induced subgraph) that cover the edges of a given bipartite graph [6,9]. Moreover, as different minimum (resp. minimal) covers may correspond to solutions that differ in terms of their biological interpretation, the capacity to enumerate all such minimal chain covers becomes crucial. More formally, in this paper, we address the problem of enumerating without repetitions all maximal edge induced chain subgraphs of a bipartite graph. If there is no ambiguity, from now on we will refer to them simply as chain subgraphs, omitting the wording "edge induced".
The problem of enumerating in bipartite graphs all subgraphs with certain properties has already been considered in the literature. These concern for instance maximal bicliques for which polynomial delay enumeration algorithms in bipartite [10,11] as well as in general graphs [11,12] were provided. In the case of maximal node induced chain subgraphs, their enumeration can be done in total polynomial time as it can be reduced to the enumeration of a particular case of the minimal hitting set problem [13] (where the sets in the family are of cardinality 4). However, the existence of a polynomial delay algorithm for this problem remains open. We recall that an enumeration algorithm is said to be output polynomial or total polynomial if the total running time is polynomial in the size of the input and the output. It is said to be polynomial delay if the time between the output of any solution and the next one is bounded by a polynomial function of the input size [14].
Regarding the problem of enumerating maximal edge induced chain subgraphs in bipartite graphs, in [15] the authors deal with it in the form of enumerating minimal interval order extension of interval orders (see "Chain graphs and interval orders" section for the relation between these two problems). In this paper, we improve this result by proposing a polynomial space and polynomial delay algorithm to enumerate all maximal chain subgraphs of a bipartite graph. We also provide an analysis of the time complexity of this algorithm in terms of the input size. In order to do this, we prove some upper bounds on the maximum number of maximal chain subgraphs of a bipartite graph G with n nodes and m edges. This is also of intrinsic interest as combinatorial bounds on the maximum number of specific subgraphs in a graph are difficult to obtain and have received a lot of attention (see for e.g. [16,17]).
We then address a second related problem called the minimum chain subgraph cover problem that, for a given graph G, asks to determine the minimum number of chain subgraphs that cover all the edges of G. This has already been investigated in the literature as it is related to other well-known problems such as the maximum induced matching (see e.g. [18,19]). For bipartite graphs, the minimum chain subgraph cover problem is shown to be NP-hard [20]; it is also known that for some special subclasses of bipartite graphs (e.g. convex bipartite graphs or chordal bipartite graphs), the problem can be solved in polynomial time in the size of the graph [19,21]. Nevertheless, bipartite graphs that represent cytoplasmatic incompatibility in general are neither convex nor chordal (see for example the graph defined by the incompatibility matrix in [6]). Moreover, to the best of our knowledge, no special structural properties are known on CI bipartite graphs and hence we cannot apply these types of results.
Calling m the number of edges in the graph, we provide an exact exponential algorithm which runs in time O * ((2 + ε) m ) (by O * , we denote the standard big O notation but omitting polynomial factors) by combining our results on the enumeration of maximal chain subgraphs with the inclusion-exclusion technique [22]. Notice that, since a chain subgraph cover is a family of subsets of edges, the existence of an algorithm whose complexity is close to 2 m is not obvious. Indeed, the basic search space would have size 2 2 m , which corresponds to all families of subsets of edges of a graph on m edges. Finally, we approach the problem of enumerating all minimal covers by chain subgraphs. To this purpose, we provide a total quasi-polynomial time algorithm to enumerate all minimal covers by maximal chain subgraphs of a bipartite graph. To do so, we prove that this can be polynomially reduced to the enumeration of the minimal set covers of a hypergraph.
To show the usefulness of our algorithms, we implemented Algorithm 1 and applied it to the Culex pipiens dataset [6,9]. We show that our method allows to identify solutions that are better than the ones presented in the literature, in the sense that they require less Lock and Key molecules to explain the data.
The remainder of the paper is organised as follows. In "Preliminaries" section, we give some definitions and preliminary results that will be used throughout the paper. In "Modeling cytoplasmic incompatibility" section, we better explain the CI-model in terms of a graph problem. "Enumerating all maximal chain subgraphs" section provides a polynomial delay algorithm to enumerate all maximal chain subgraphs in a bipartite graph G and "Upper bounds on the number of maximal chain subgraphs" section presents an upper bound on their maximum number. We use the latter result to further establish the input-sensitive complexity of the enumeration algorithm. In "Minimum chain subgraph cover" section, we detail the exact algorithm for finding the minimum size of a minimum chain cover in bipartite graphs, and in "Enumeration of minimal chain subgraph covers" section we exploit the connection of this problem with the minimal set cover of a hypergraph to show that it is possible to enumerate in quasi-polynomial time all minimal covers by maximal chain subgraphs of a bipartite graph. "Chain graphs and interval orders" section deals with the interpretation of the results in "Enumerating all maximal chain subgraphs" and Minimum chain subgraph cover sections in the context of poset and interval poset dimension, two problems which are deeply related. In "A case study" section we show an application of our method to a real dataset. Finally, we conclude with some open problems in "Conclusions and open problems" section.

Preliminaries
Throughout the paper, we assume that the reader is familiar with the standard graph terminology, as contained for instance in [23]. We consider finite undirected graphs without loops or multiple edges. For each of the graph problems in this paper, we let n denote the number of nodes and m the number of edges of the input graph.
Given a bipartite graph G = (U ∪ W , E) and a node u ∈ U , we denote by N G (u) the set of nodes adjacent to u in G and by E G (u) the set of edges incident to u in G. Moreover, given U ′ ⊆ U and W ′ ⊆ W , we denote by A bipartite graph is a chain graph if it does not contain a 2K 2 as an induced subgraph. Equivalently, a bipartite graph is a chain graph if and only if for each two nodes v 1 and v 2 both in U (resp. in W), it holds that either Note that this means that the nodes of U (resp. of W) can be linearly ordered, say v 1 , . . . , v n , such that . Given a chain subgraph C = (X ∪ Y , F ) of G, we say that a permutation π of the nodes of U is a neighbourhood ordering of C if N C (u π(1) ) ⊆ N C (u π(2) ) ⊆ . . . ⊆ N C (u π(|U |) ) .
Observe that if X ⊂ U , the sets N C (u π(1) ), . . . , N C (u π(l) ) for some integer l ≤ |U | may be empty and, in case C is connected, l = |U | − |X| . By the largest neighbourhood of C, we mean the neighbourhood of a node x in X for which the set In this paper, we always consider edge induced chain subgraphs of a graph G. Hence, here a chain subgraph C of G is identified with its edges E(C) ⊆ E(G) while its set of nodes will be constituted by all the nodes of G incident to at least one edge in C. Since the edges of a chain graph characterize it, sometimes we abuse the notation writing e.g.: C \ E(D) , with D a subgraph of G, to denote the chain graph induced by edges E(C) \ E(D) ; C ⊆ E(D) or equivalently C ⊆ D to say that C is an edge-induced subgraph of D and e ∈ C to mean that e ∈ E(C).
A maximal chain subgraph C of a given bipartite graph G is a connected chain subgraph such that no superset of E(C) is a chain subgraph. We denote by C (G) the set of all maximal chain subgraphs in G.
A set of chain subgraphs C 1 , . . . , C k is a cover for G if ∪ 1≤i≤k E(C i ) = E(G) . Observe that, given any cover of G by chain subgraphs C = {C 1 , . . . C k } , there exists another cover of same size C ′ = {C ′ 1 , . . . C ′ k } whose chain subgraphs are all maximal; more precisely, for each i = 1, . . . , k , C ′ i is a maximal chain subgraph of G and C ′ i admits C i as subgraph. In order to avoid redundancies, from now on, although not explicitly highlighted, we will restrict our attention to the covers by maximal chain subgraphs.
We denote by S(G) the set of all minimal chain covers of a bipartite graph G.

Modeling cytoplasmic incompatibility
In [6,9], the cytoplasmic compatibility relationships that are observed in a given dataset are represented as a binary n 1 × n 2 matrix C with the males in rows and females in columns. Entry C[i, j] = 0 represents a compatible cross (offspring survival) between male i and female j; C[i, j] = 1 representing an incompatible cross. Under the quantitative Lock/Key model (see Section Quantitative Model in [9]) the unknown infections with Wolbachia strains are represented as an n 1 × k matrix L and an n 2 × k matrix K that describe the Lock and Key factors carried by the host males and females, respectively. Matrices L and K contain integer values and for each entry L[i, l] = q ( K [j, t] = q ), the value q indicates that the Lock molecule l is found in quantity q in male i (the Key molecule t is found in quantity q in female j). A value of q equal to 0 indicates the absence of the molecule in the host. The pattern observed in C can be explained by the matrices K and L in the following way: C[i, j] = 0 , i.e. the crossing between male i and female j is successful, if and only if female j has enough Key molecules to "open" all the Lock molecules. More formally, Definition 1 Given an incompatibility n 1 × n 2 matrix C, an n 1 × k matrix L and an n 2 × k matrix K, we say that L, K explain C if the following holds: In a parsimonious context, the goal is, given a matrix C, to find two matrices L, K that explain all the crosses in C and have a minimum number of columns k.
This problem has been formulated in [24] in terms of graphs. Matrix C can be seen as the adjacency matrix of a bipartite graph B(C) = (U ∪ W , E) , with males in U and females in W and edges representing the incompatible crosses. We include this formulation here for the completeness of the paper.

Lemma 1 Given an incompatibility
is a chain graph if and only if there exist an n 2 × 1 matrix L and n 1 × 1 matrix K that explain C.
Proof We start by first showing the reduction between those two representations. Let C be an incompatibility n 1 × n 2 matrix, we first assume that B(C) = (U ∪ W , E) is a chain graph. Let U = {u 1 , . . . , u n 1 } and W = {w 1 , . . . , w n 2 } . By definition of a chain graph, we can assume that the nodes of U can be linearly ordered such that N G (u 1 ) ⊆ N G (u 2 ) ⊆ . . . ⊆ N G (u n 1 ) . Note that for all i < j it holds |N G (u i )| ≤ |N G (u j )| , hence it is possible to group the nodes of U in d classes B 1 , . . . , B d such that a node u i ∈ U belongs to B r (with r ≤ d ) if and only if |N G (u i )| = r . Note that some of the B i can be empty and in B 0 we have all the isolated nodes (if any) of U. If u i , u j ∈ B r then N (u i ) = N (u r ) , hence we will extend the notion of neighbourhood and denote it by N (B r ) = N (u) for some u ∈ B r .
We show that one pair Lock/Key is sufficient to explain the matrix C observed. To this purpose we construct the matrices L and K that explain C as follows: for all u i ∈ B j we assign Intuitively, we assign to a node u ∈ U a quantity of the Lock molecule that depends on its degree. Also all the nodes of W in the neighbourhood of u should have a smaller quantity such that the cross results incompatible. Notice that the same reduction works in the opposite direction.
The lemma follows by the chain of equivalences: An example that illustrates the connection between incompatibility matrix, bipartite graph and chain graph is depicted in Fig. 2. Given the incompatibility matrix in Fig. 2a, the corresponding bipartite graph is constructed in Fig. 2b. Recall that an edge corresponds to a cross incompatibility. We apply the procedure described in the proof of Lemma 1 and thus we can . We can explain the dataset with only one pair of Lock-Key molecule. The assignment of the quantities of the lock and key molecules to the males and females is done as described before and is depicted in Fig. 2d. It is not difficult to check that the cross between male i and female j is successful if and only if the female has enough of the key molecules for the lock molecules of the male.
The straightforward outcome of the lemma is that a matrix representation of any chain subgraph B can be represented by exactly one pair of Lock/Key molecules. Hence, the next theorem follows.

Theorem 1
Given an incompatibility n 1 × n 2 matrix C, there exist an n 1 × k matrix L and an n 2 × k matrix K that explain C if and only if the bipartite graph B(C) has an edge cover with k chain subgraphs.
The previous theorem motivates the problems we study in this paper.

Enumerating all maximal chain subgraphs
In this section, we provide a polynomial delay algorithm for enumerating all the maximal chain subgraphs of a given bipartite graph. We start by proving the following result.
and let x ∈ X be a node with largest neighbourhood in C. Then C is a maximal chain subgraph of G if and only if both the following conditions hold: , so we can add to C all the edges incident to x ′ and still obtain a chain subgraph thereby contradicting the maximality of C.
. By adding to each one of the previous graphs the edges in E G (x) , we have that the strict inclusion is preserved because the added edges were not present in any one of the three graphs. Since C ′ with the addition of E G (x) is still a chain subgraph with N G (x) as its largest neighbourhood, we reach a contradiction with the hypothesis that C is maximal in G.
(⇐ ) We show that if both (i) and (ii) hold, then the chain subgraph C of G is maximal. Suppose by contradiction that C is not maximal in G, and let C ′ be a chain subgraph of G such that C ⊂ C ′ . Let x be the node with the largest neighbourhood in C. It fol- is a maximal neighbourhood of G, hence the largest neighbourhood of C ′ (and C by the hypothesis). This implies also that C and C ′ differ in some node different from is still a chain subgraph because we simply removed node x and all its incident edges. We then get a contradiction with (ii).
Proposition 1 leads us to design Algorithm 1 which efficiently enumerates all maximal chain subgraphs of G. It exploits the fact that, in each maximal chain subgraph, a node u whose neighbourhood is largest is also maximal in G (part (i) of Proposition 1) and this holds recursively in the chain subgraph obtained by removing node u and restricting the graph to N C (u) (part (ii) of Proposition 1). To compute the maximal neighbourhood nodes, the algorithm uses function computeCandidates that, given sets U and W, for each maximal neighbourhood Y ⊂ W , returns a unique node u, called candidate, for which N G (u) = Y . This means that in case of twins, function computeCandidates extracts only one representative node according to some fixed order on the nodes (e.g. the node with the smallest label). If the graph has no edges, the function returns the empty set. Proof Let G = (U ∪ W , E) be a bipartite graph. We prove the correctness of Algorithm 1 by induction on |U|, i.e. we show that all the solutions are output, without repetitions. When |U | = 1 , let u be the only node in U. We have that N G (u) is the only neighbourhood in W, and line 3 returns {u} as unique candidate. In line 9, the algorithm reduces the graph of interest. In line 10, the whole E G (u) is added to the current chain subgraph C. Then the function is recursively recalled, with U ′ = ∅ so the condition at line 4 is true and C is printed; it is in fact the only chain subgraph of G, it is trivially maximal and there are no repetitions. Correctness then follows when |U | = 1. Assume now that |U | = k with k > 1 . As inductive hypothesis, let the algorithm work correctly when |U | ≤ k − 1.
For each candidate u, the algorithm recursively recalls the same function on a reduced subgraph and, by the inductive hypothesis, outputs all chain subgraphs of this reduced subgraph without repetitions. By Proposition 1, if we add to each one of these chain subgraphs the node u and all the edges incident to u in G[U, W], we get a different maximal chain subgraph of G since each maximal chain subgraph has one and only one maximal neighborhood and the function computeCandidates returns only one representative node. Recall that in the case of twin nodes the algorithm will always consider the nodes in a precise order and so no repetition occurs. Moreover, iterating this process for all candidates guarantees that all maximal chain subgraphs are enumerated and no one is missed. Proof Represent the computation of Algorithm 1 as a tree of the recursion calls of enumerateMaxi-malChain, each node of which stores the current graph on which the recursion is called at line 11. Of course, the root stores G and on each leaf the condition Candidates == ∅ is true and a new solution is output. Observe that each leaf contains a feasible solution, and that no repetitions occur in view of Proposition 2, so the number of leaves is exactly |C (G)|. Since at each call the size of U is reduced by one, the tree height is necessarily bounded by |U | = O(n) ; moreover, on each tree node, O(nm) time is spent for running function ComputeCandidates.
It follows that, since the algorithm explores the tree in DFS fashion starting from the root, between two solutions the running time is at most O(n 2 m) and the total running time is O(|C (G)|n 2 m) .

Upper bounds on the number of maximal chain subgraphs
In this section, we give two upper bounds on the maximum number of maximal chain subgraphs of a bipartite graph G with n nodes and m edges. The first bound is given in terms of n while the second depends on m. These bounds are of independent interest, however we will use them in two directions. First, they will allow us to determine a (input-sensitive) time complexity of Algorithm 1. Indeed, in Proposition 3, we proved that the total running time of Algorithm 1 is of the form O(D(n) · |C (G)|) , where D(n) is the delay of the algorithm and |C (G)| is the number of maximal chain subgraphs of G. Thus, a bound on |C (G)| leads to a bound on the running time of Algorithm 1 depending on the size of the input. Second, the bound on |C (G)| in terms of edges allows us to compute the time complexity of an exact exponential algorithm for the minimum chain subgraph cover problem in "Minimum chain subgraph cover" section.

Bound in terms of nodes
The following lemma claims that a given permutation is the neighbourhood ordering of at most one maximal chain subgraph.
Lemma 2 Let C 1 and C 2 be two maximal chain subgraphs of G = (U ∪ W , E) and let π 1 (resp. π 2 ) be a neighbourhood ordering of C 1 (resp. C 2 ). Then, Proof The proof proceeds by induction on the number of nodes of U.
If |U | = 1 then G has only one maximal chain subgraph and the result trivially holds.
As a corollary, the maximum number of chain subgraphs of a graph G = (U ∪ W , E) is bounded by |U|!. Since the same reasoning can be applied on W, we have that |C (G)| ≤ |W |! and hence: This bound is tight as shown by the following family of graphs that reaches it.
Consider the antimatching graph with n nodes A n = (U ∪ W , E) defined as the complement of an n/2 edge perfect matching, i.e.: It is not difficult to convince oneself that the maximal chain subgraphs of A n are exactly (n/2)! and that a different permutation corresponds to each of them. In particular, for each permutation π of the nodes of U, the corresponding maximal chain subgraph C π of A n can be defined by means of the set of neighbourhoods as follows: C π is a chain subgraph since all the neighbourhoods form a chain of inclusions. Moreover, it is maximal since if we added to the neighbourhood of u i any one of the missing edges (u i , w j ) with π −1 (j) ≥ π −1 (i) , we would introduce a 2K 2 with the existing edge (u j , w i ) as (u j , w j ) and (u i , w i ) are not in E.

Bound in terms of edges
Let T(m) be the maximum number of maximal chain subgraphs over all bipartite graphs with m edges. After two preliminary lemmas, we prove that T (m) ≤ 2 √ m log(m) .

Lemma 3 Let
Proof In view of how the algorithm works and of Proposition 1, at the beginning, there at most |U| candidates. For each candidate x, we can build as many chain subgraphs as there are in G[U \ {x}, N G (x)] . We claim that this latter graph has at most m − |W | edges. Indeed, in order to construct G[U \ {x}, N G (x)] , we remove from G exactly |E G (x)| edges when deleting x from U, and |W | − |N G (x)| nodes (each one connected to at least a different edge as G is connected) when reducing W to N G (x) . Observing that |E G (x)| = |N G (x)| , in total we remove at least |W| edges. It is not difficult to see that T(m) is increasing with m. Hence, the proof follows By the next Lemma, we have that the maximum on n of the auxiliary function n 2 · 2 √ m− n 2 log(m− n 2 ) is reached when n/2 is minimum (note that trivially for a bipartite graph we have n/2 > √ m).

Lemma 4 The real-valued function
Proof The derivative of F(x) is given by:

while for
x ≥ 0 we have: and: We are now able to prove the main theorem of this subsection:

) be a bipartite graph with n nodes and m edges; then |C
Proof Assume w.l.o.g that |U | ≤ |W | . The proof is by induction on m. Note that for m = 1 the theorem trivially holds. Applying the inductive hypothesis and Lemma 3, we have: considering that B < 1 and 1/2 < A ≤ 1 since: By this bound on the number of maximal chain subgraphs we trivially obtain an input-sensitive bound on the time complexity for Algorithm 1:

Minimum chain subgraph cover
In this section, we show how to find in polynomial space the minimum size of a chain subgraph cover in time O * ((2 + ǫ) m ) , for every ε > 0 . Since a chain subgraph cover is a family of subsets of edges, the existence of an algorithm whose complexity is close to 2 m is not obvious. Indeed the basic search space has size 2 2 m , as it corresponds to a family of subsets of edges. To obtain this result, we exploit Algorithm 1, the bound obtained in Theorem 2 and the inclusion/exclusion method [16,22] that has already been successfully applied to exact exponential algorithms for many partitioning and covering problems.
We first express the problem as an inclusion-exclusion formula over the subsets of edges of G. Exploiting this result, we can design an exact algorithm which counts the number of chain subgraph covers of size k with a time complexity given in the following theorem: Where the last step follows by recalling that G is connected and thus n = O(m) .
We conclude, by observing that the size of a minimum chain cover is given by the smallest value of k for which c k (G) = 0.

Enumeration of minimal chain subgraph covers
In this section, we prove that the enumeration of all minimal chain subgraph covers can be polynomially reduced to the enumeration of the minimal set covers of a hypergraph. This reduction implies that there is a quasi-polynomial time algorithm to enumerate all minimal chain subgraph covers. Indeed, the result in [25] implies that all the minimal set covers of a hypergraph can be enumerated in time N log N where N is the sum of the input size (i.e. n + m ) and of the output size (i.e. the number of minimal set covers).
Let them G = (U ∪ W , E) be a bipartite graph, C = C (G) the set of all maximal chain subgraphs of G and S = S(G) the set of minimal chain subgraph covers of G. Notice that the minimal chain subgraph covers of G are the minimal set covers of the hypergraph H = (V , E) where V = E and E = C . Unfortunately, the size of H might be exponential in the size of G plus the size of S . Indeed not every maximal chain subgraph in C will necessarily be part of some minimal chain subgraph cover. To obtain a quasi-polynomial time algorithm to enumerate all minimal chain subgraph covers, we need to enumerate only those maximal chain subgraphs that belong to a minimal chain subgraph cover.
Given an edge e ∈ E , let C e be the set of all maximal chain subgraphs of G containing e.
We call an edge e ∈ E non-essential if there exists another edge e ′ ∈ E such that C e ′ ⊂ C e . An edge which is not non-essential is said to be essential. Note that for every non-essential edge e, there exists an essential edge e 1 such that C e 1 ⊂ C e . Indeed, by applying iteratively the definition of a non-essential edge, we obtain a list of inclusions C e ⊃ C e 1 ⊃ C e 2 . . . , where no C e i is repeated as the inclusions are strict. The last element of the list will correspond to an essential edge.
The following lemma claims that if a maximal chain subgraph C contains at least one essential edge, then it belongs to at least one minimal chain subgraph cover.
Lemma 5 Let C be a maximal chain subgraph of a bipartite graph G = (U ∪ W , E) . Then C belongs to a minimal chain subgraph cover of G if and only if C contains an essential edge.
Proof (⇒ ) Let C belong to a minimal chain subgraph cover M and assume that C contains no essential edge. Given e ∈ C , e therefore being non-essential, there exists an essential edge e ′ such that C e ′ ⊂ C e . Moreover, e ′ � ∈ C . As M is a cover, there exists C ′ ∈ M such that e ′ ∈ C ′ . Thus, C ′ � = C , C ′ ∈ C e ′ ⊂ C e , hence e ∈ C ′ . Since for every edge e ∈ C , there exists C ′ ∈ M containing it, we have that M \ {C} is a cover, contradicting the minimality of M.
(⇐ ) Assume C contains an essential edge e. Let C ′ = {D ∈ C (G) : e � ∈ D} . Note that C ′ = C \ C e . We show that C ′ ∪ {C} is a cover. Suppose on the contrary that there exists e ′ ∈ E \ E(C) and e ′ is not covered by C ′ and thus C e ′ ∩ C ′ = ∅ . This implies that C e ′ ⊆ C \ C ′ = C e and as e is essential, we obtain C e ′ = C e from which we deduce that e ′ ∈ C . Thus, M = C ′ ∪ {C} is a cover and clearly it contains a minimal one. Finally, we conclude by observing that, since by construction C is the only chain subgraph of M that contains e, it belongs to any minimal cover contained in M.
It follows that the set of maximal chain subgraphs that can contribute to a minimal chain cover is C = ∪C e where the index e runs over all the essential edges of G.
In the following, we show how to detect essential edges. This problem then consists in detecting all the couples e 1 , e 2 such that C e 1 ⊆ C e 2 before enumerating all useful maximal chain subgraphs.
Theorem 4 later in this section provides an efficient way to detect these couples. In order to prove it, we need first some preliminary results.
Let M e the set of all edges e ′ ∈ E inducing a 2K 2 in G together with e.
Fact 1 Let C = (X ∪ Y , F ) be a maximal chain subgraph of a bipartite graph G = (U ∪ W , E) , and let z ∈ X , e = {u, w} ∈ E be such that for every e ′ ∈ E C (z) , we have e ∈ M e ′ . Then at least one of the following holds: Proof The proof follows straightforwardly by observing that for any e ′ = {z, y} ∈ C then as e ∈ M e ′ , either {z, w} ∈ E(G) or {u, y} ∈ E(G) .
Observe that in the previous claim, we can re-write (b) in the form: Lemma 6 Let C be a maximal chain subgraph of a bipartite graph G = (U ∪ W , E) and let e ∈ E be such that for all e ′ ∈ E(C) , it holds that e ∈ M e ′ . Then e ∈ C.
From the hypothesis, let e = {u, w} in E be such that for all e ′ ∈ E(C) , it holds e ∈ M e ′ . Finally, assume e ∈ C.
The proof runs by contradiction: we will show that must hold. Although, this contradicts the maximality of C as in this way we could add e and all the other edges in E G (w) to C and still obtain a chain subgraph (with N G (w) as the largest neighbourhood of C).
In order to prove (1), using Fact 1 with z = u |X| , we have that at least one among (a) and (b) must hold. Observe that (b) cannot hold as otherwise we have straightaway (1) (interchanging the roles of Y and X, and w and u) observing that N C (u |X| ) = N G (u |X| ) = Y by point (i) of Proposition 1. Thus, (a) must hold, i.e. w ∈ N G (u |X| ).
If we now show that w ∈ ∩ |X| k=j N G (u k ) ⇒ w ∈ N G (u j−1 ) , we prove the claim since together with the just proved w ∈ N G (u |X| ) this leads to (1): We conclude the proof by showing the validity of w ∈ ∩ |X| k=j N G (u k ) ⇒ w ∈ N G (u j−1 ).
Assume then that w ∈ ∩ |X| k=j N G (u k ) and we deduce w ∈ N G (u j−1 ) applying again Fact 1 with z = u j−1 and showing that (b'), hence (b), cannot hold. Indeed, supposing by contradiction that (b') holds, it yields N C (u j−1 ) ⊆ N G (u) . By this assumption and using the maximality of C, we deduce that u ∈ X with the following arguments: N C (u) has to contain at least N C (u j−1 ) , and hence there exists k ≥ j − 1 for which u = uk otherwise we could add the related edges.
Although u ∈ X implies that we could contradictorily extend C to C ′ by adding at least e, were C ′ has the following list of neighbourhoods: and C ′ is a chain graph since N C (uk ) ∪ {w} ⊆ N C (u k ) for all k >k ≥ j − 1 by w ∈ ∩ |X| k=j N G (u k ) and the maximality of C.
Using Lemma 6 we can now prove the following result. Proof (⇒ ) Given two edges e, e ′ ∈ E , suppose that C e ⊆ C e ′ , and assume on the contrary that there exists f ∈ M e ′ and f ∈ M e . Then there exists a maximal chain subgraph C ′ containing e and f (as they do not form a 2K 2 in G) but not e ′ ( f ∈ M e ′ ). Hence, C ′ ∈ C e but C ′ / ∈ C e ′ , contradicting the assumption that C e ⊆ C e ′ .
(⇐ ) Suppose now M e ⊇ M e ′ . Let C ∈ C e . By definition, none of the edges of M e appears in C. Hence, e ′ does not form a 2K 2 with any edge in C in the graph G (as M e ⊇ M e ′ ). By Lemma 6 e ′ ∈ C . Thus, C e ⊆ C e ′ .
Notice that, given an edge e = (u, w) ∈ E , u ∈ U and w ∈ W , it is easy to determine the set M e . We just need to start from E and delete all edges that are incident either to u or to w, as well as all edges at distance 2 from e (that is all edges e ′ = (u ′ , w ′ ) such that either u ′ is adjacent to w or w ′ is adjacent to u). Checking whether M e ⊇ M e ′ is also easy: it suffices to sort the edges in each set in lexicographic order, and then the inclusion of each pair can be checked in linear time in their size, that is in O(m). It is thus possible to enumerate in polynomial delay only those maximal chain subgraphs that contain at least one essential edge by slightly modifying Algorithm 1 as shown in the pseudo-code in Algorithm 2.

Chain graphs and interval orders
There is an interesting connection between chain graphs and interval orders. In this section, we look at the results we presented in this paper in the light of this relation, in particular related to the computation of the interval order dimension of a poset and the enumeration of minimal interval order extensions and maximal interval order reductions of bipartite posets. First, we briefly recall in this section these notions and this relation. A partially ordered set (or in short poset) is a pair (P, ≤ P ) where P is a set, called the ground set, and ≤ P ⊆ P × P is a binary, reflexive, transitive and anti-symmetric relation on P referred to as partial order on P. A partial order is an interval order on P if there exist two functions l, r : P → R such that x ≤ P y iff r(x) ≤ l(y) , while P is said to be a total or linear order iff either x ≤ P y or y ≤ P x for all x, y ∈ P . A partial order Q = (Q, ≤ Q ) is said to extend P or to be an extension of P if x ≤ P y implies x ≤ Q y . A linear extension of P is an extension of P which is also a linear order. A bipartite poset is a poset The interval order dimension of a bipartite poset H, denoted by Idim(H), is the minimum number k of interval order extensions whose intersection gives H.
We can view H as a bipartite undirected graph, called the comparability graph G(H) of H, with node set U ∪ V and edge set given by {(u, v) ∈ E(G(H)) : u ∈ U , v ∈ V , and u ≤ H v} . We have that a bipartite poset is an interval order if and only if its comparability graph is a chain graph [20]. Hence each interval extension of H can be viewed as a chain graph (edge) completion of G(H), i.e. a chain graph with the same node set as G(H) which has G(H) as a subgraph. Thus Idim(H) coincides with the minimum number of chain graph completions of G(H) whose intersection gives G(H).
The bipartite complement of a bipartite graph where E ′ are all the non-edges of D across the two partitions.
Now, if C is a chain subgraph of G(H), also its bipartite complement is a chain graph (as the bipartite complement of a 2K 2 is a 2K 2 ). Then we have that Idim(H) coincides with the size of a minimum chain subgraph cover of the bipartite complement of G(H), which is our second problem. All this is contained in the following result of [20] where, by abuse of notation, the bipartite comple-  In the same way, enumerating all the maximal chain subgraphs of a bipartite graph G(H) (i.e. our first problem) is equivalent to listing all the maximal interval order reductions of the bipartite order H and enumerating all maximal chain subgraphs of B(H) is equivalent to listing all minimal interval order extensions of H.
We can then interpret the results on the enumeration of maximal chain subgraphs (Propositions 2 and 3) in this context as follows: Proof The first result is a straightforward interpretation of Propositions 2 and 3 in this context, while the second result comes from the fact that we have to run our algorithm on B(H) instead of G(H), hence the number of edges goes from m to |U | · |V | − m , hence O(n 2 (|U | · |V | − m)}) = O(n 2 · |U | · |V |) . Finally, in both results, we apply the observation that we can run our algorithm on the smaller of the two partitions substituting n 2 by min{|U |, |V |} 2 .
Observe that it is not surprising that enumerating minimal extensions is more complicated than enumerating maximal reductions as it happens in [15] for counting minimal extensions and maximal reductions of N-free orders (i.e. the posets (P, ≤ P ) such that there does not exist x, y, z, w ∈ P with x ≤ P y , x ≤ P w and z ≤ P w).
Furthermore, recall that for general posets P, in [15] it was already proved that the number of minimal interval order extensions and maximal reductions can be computed polynomially in their number, but the proposed dynamic programming algorithm requires an exponential space to prevent duplications by storing all the already found solutions and comparing the new solution with them, differently from the Algorithm 1 we proposed.

A case study
In this section we show an application of our methods to a real dataset. We implemented Algorithm 1 and ran it on the graph G representing the CI of Wolbachia in Culex pipiens [6], stored by means of the incidence matrix in Fig. 3. The code is available at https ://githu b.com/sinai meri/Chain Enume ratio n. On this dataset, the algorithm to list all the maximal chains took only 1 second on a single core of a 6-core MacBook Pro 2.2 GHz i7. The maximal chain subgraphs of G came out to be 16, and by doing a simple exhaustive search algorithm, we found a chain cover of G constituted by 4 chain subgraphs (represented in Fig. 4 in colours on the matrix storing G, where rows and columns have been conveniently permuted in order to highlight the triangular shapes of chain subgraphs, that are called quantitative shapes in [6]). Notice that this cover is minimum since it is known [18] that the size of a maximum induced subgraph is upper bounded by the size of a minimum chain cover, and it is not difficult to find by hand a maximum induced subgraph of G with 4 edges (e.g. edges connecting Bifa-A and Istanbul, Keo-B and Bifa-B, Istanbul and Aus, and Aus and Slab, all corresponding to 1s on the incident matrix); this improves the result claimed in [6] (page E19) that 5 chain subgraphs would be necessary.
In Fig. 5 Fig. 3 The Culex pipiens dataset in [6]. Rows represent the males and the columns the females. A value of 1 in the cell [i: j] represents an incompatibility between male i and female j Fig. 4 The Culex pipiens dataset in [6]. A minimum chain cover for the graph G, constituted by four chain subgraphs Fig. 5 The Lock and Key matrices for the Culex pipiens dataset in [6]. The Lock matrix and the Key matrix resulting from the solution of the quantitative model. In each cell the relative amount of lock and key molecules is indicated. Notice that a value 0 indicate that no molecule is inferred from the analysis. Each color symbolizes a single Lock-Key pair value 0 indicates that no lock or key molecule is inferred from the analysis. Each colour symbolizes a single Lock-Key pair. The values of the quantities are only indicative. Indeed, Lemma 1 provides a possible assignment that explains the incompatibilities but what is important are not the values themselves but the relative order among them. Thus, for instance we can safely substitute the value 17 in the table by 5 and the relative order will not change.
It is also interesting to observe that the solution provided in Fig. 5 is significantly different from the one in [6]. For instance, in [6] the Lock matrices inferred under the quantitative model contain many more empty cells than the Key matrices. This is clearly not the case in the solution we presented, where both Lock and Key matrices have roughly the same number of non empty cells, indicating a more uniform level of infection in the population considered. In our solution, each male or female individual is infected by 2 or 3 strains of Wolbachia.

Conclusions and open problems
In this paper, we studied the problem of finding the minimum number of different Lock/Key molecules that explains the CI for the observed data. This problem was already modelled as finding the minimum number of chain subgraphs that cover the edges of a given bipartite graph [6,9].
Motivated by this, we studied different problems related to maximal chain subgraphs and chain subgraph covers in bipartite graphs. First, for the NP-hard problem of finding the minimum number of chain subgraphs in a bipartite graph, we provided an exponential algorithm with a non trivial complexity. Although we improved the complexity in theory, a simple implementation may not be efficient for large and dense graphs. A future direction would be to use this algorithm as a base for more efficient implementations that are fast in practice.
Second, for the problem of enumerating all minimal chain subgraph covers of a bipartite graph, we showed that it can be solved in quasi-polynomial time. It remains an open problem to understand whether it is possible to enumerate the minimal chain covers of a graph in polynomial delay.
Interestingly, in order to solve the above problems, we considered also the problem of enumerating all the maximal chain subgraphs of a bipartite graph and improved on the current results in the literature for the latter. It is worth exploring the different nature of the problems considered here in the case where we deal with an hereditary property (induced chain subgraphs) instead of a nonhereditary one (edge induced chain subgraphs).
Finally, in this paper we assumed the data are correct and complete. It is certainly interesting to deal with the case of missing data.