Genome-wide multiple sequence alignments
Our starting point is a set of genome assemblies. For our purposes an assembly is simply a set of sequences representing chromosomes, scaffolds, reftigs, contigs, etc. In the following, we will use contig to refer to any such sequence. On each of these constituent sequences we assume the usual coordinate system defining sequence positions. Since DNA is double stranded, a piece of genomic sequence is either contained directly (\(\sigma = +\;1\)) in the assembly or it is represented by its reverse complement (\(\sigma = -\;1\)). We write \((\mathcal {G},c,i,j,\sigma)\) to identify the sequence interval from positions i to j on contig c of genome assembly \(\mathcal {G}\) with reading direction \(\sigma\). We assume, w.l.o.g., \(i\le j\).
Most comparative methods require multiple sequence alignments (MSAs) as input. An MSA \(\mathfrak{A}\) is composed of alignment blocks, each of which consists of an alignment of sequence intervals. For the purposes of this paper it its sufficient to characterize an alignment block by the coordinates of its constituent sequence intervals. That is, a block \(B\in \mathfrak{A}\) has the form \(B=\{ (\mathcal {G}_u,c_u,i_u,j_u,\sigma _u) | u \in \text {rows of }B\}\) where the index u runs over the rows of the alignment block. It will be convenient to allow alignment blocks also to consist of a single interval only, thus referring to a piece of sequence that has not been aligned. Note that at this stage we do not assume that an alignment block contains only one interval from each assembly.
The projection \(\pi _\mathcal {G}(B)\) extracts from an alignment block the union of its constituent sequence intervals belonging to assembly \(\mathcal {G}\). If the assembly \(\mathcal {G}\) is not represented in the alignment block B we set \(\pi _\mathcal {G}(B)=\emptyset\).
The projection operation collapses pairs of overlapping sequence intervals (\(i \le i' \le j \le j'\)) in to a single interval: \((\mathcal {G},c,i,j,\sigma)\cup (\mathcal {G},c,i',j',\sigma ')= (\mathcal {G},c,i,j',+1)\) without regard for the orientation, which is set to \(+1\) and will have no bearing on the algorithm we develop further down.
The projection \(\pi _\mathcal {G}\) of \(\mathfrak{A}\) onto one of its constituent assemblies \(\mathcal {G}\) is the union of the sequence intervals from \(\mathcal {G}\) that are contained in its alignment blocks, i.e., \(\pi _\mathcal {G}(\mathfrak{A})=\bigcup _{B\in \mathfrak{A}} \pi _\mathcal {G}(B)\).
Definition 1
Let \(\mathfrak{A}\) be an MSA.
-
\(\mathfrak{A}\) is complete if \(\pi _\mathcal {G}(\mathfrak{A})=\mathcal {G}\), i.e., if each position in each assembly is represented in at least one alignment block.
-
\(\mathfrak{A}\) is irredundant \(\pi _\mathcal {G}(B')\cap \pi _\mathcal {G}(B'')=\emptyset\) for any two distinct blocks \(B'\) and \(B''\), i.e., if every sequence interval from assembly \(\mathcal {G}\) is contained in at most one alignment block.
-
\(\mathfrak{A}\) is injective if no alignment block comprises more than one interval from each of its constituent assemblies.
Clearly, every given MSA can easily be completed by simply adding all unaligned sequence intervals as additional blocks.
Just like a contig c in a (genome) assembly \(\mathcal {G}\), each block \(B\in \mathfrak{A}\) has an internal coordinate system defined by its columns. We write (B, k) for column k in block B. We write \(\ell (B)\) for the number of columns in block B. If \(\mathfrak{A}\) is irredundant, then there are functions \(f_{\mathcal {G},c}\) that map position i within \((\mathcal {G},c)\) to a corresponding MSA coordinate (B, k). If \(\mathfrak{A}\) is complete, the individual \(f_{\mathcal {G},c}\) can be combined to a single function \(f:(\mathcal {G},c,i)\mapsto (B,k)\). Completeness implies that every position \((\mathcal {G},c,i)\) is represented in the MSA, and irredundancy guarantees that the relation between assembly and alignment coordinates is a function by ensuring that \((\mathcal {G},c,i)\) corresponds to at most one alignment column. The following definition is therefore equivalent to the notion of a supergenome introduced in [17].
Definition 2
An MSA \(\mathfrak{A}\) is a supergenome if it is complete, irredundant, and injective.
The most commonly used genome-wide MSAs cannot be completed to supergenomes. The MSAs produced by the multiz pipeline are usually not irredundant: different intervals of the “reference sequence” may be aligned to the same interval of another assembly. While multiz [11] alignments are injective this is in general not the case with the EPO [12] alignments. In these, multiple paralogous sequences from the same genome may appear in one alignment block.
Now consider an MSA \(\mathfrak{A}\) and an arbitrary order < of the alignment blocks of \(\mathfrak{A}\). Then there is a (unique) function \(\phi\) that maps the pairs (B, k) injectively to the interval [1, n], where \(n=\sum _{B\in \mathfrak{A}} \ell (B)\) is the total number of columns in \(\mathfrak{A}\) such that \(\phi (B,i)<\phi (B',i')\) whenever \(B<B'\) or \(B=B'\) and \(i<i'\). If \(\mathfrak{A}\) is a supergenome, then \(\phi (f)\) is clearly an injective function from a genome assembly \(\mathcal {G}\) to [1, n]. We call \(\phi (f(\mathcal {G},c,i))\) the coordinate of position i of contig c of assembly \(\mathcal {G}\) in the ordered supergenome \((\mathfrak{A},<)\).
As pointed out in [17], the existence of a coordinate system for the supergenome \(\mathfrak{A}\) is independent of the block order <. However, the order < is crucial for the practical use of the coordinate system.
Adjacency and betweenness of MSA blocks
The natural starting point for considering adjacency and betweenness of alignment blocks are their constituent intervals \((\mathcal {G},c,i,j,\sigma)\) on a fixed assembly \(\mathcal {G}\) and contig c. Intervals have a natural partial order defined by \((\mathcal {G},c,i,j,\sigma)\prec (\mathcal {G},c,k,l,\sigma)\) whenever \(i<k\) and \(j<l\). Two intervals are incomparable in this interval order if and only if one is contained in the other. Note that the interval order allows comparable intervals to overlap. We also consider intervals incomparable that belong to different contigs and/or assemblies.
Given three intervals \(\alpha =(\mathcal {G},c,i',j',\sigma ')\), \(\beta =(\mathcal {G},c,i'',j'',\sigma '')\), and \(\gamma =(\mathcal {G},c,i,j,\sigma)\) (on the same genome assembly and contig), we say that \(\gamma\) is between the two distinct intervals \(\alpha\) and \(\beta\) if \(\alpha \prec \gamma \prec \beta\) or \(\beta \prec \gamma \prec \alpha\).
Given a collection \(\mathcal {Q}\) of intervals on the same assembly \(\mathcal {G}\) and contig c, we say that \(\alpha =(\mathcal {G},c,i',j',\sigma ')\) and \(\beta =(\mathcal {G},c,i'',j'',\sigma '')\) are adjacent if there is no interval \(\gamma\) between \(\alpha\) and \(\beta\). We say that \(\alpha\) is a predecessor of \(\beta\) if \(\alpha\) and \(\beta\) are adjacent and \(\alpha \prec \beta\). Analogously, \(\alpha\) is a a successor of \(\beta\) if \(\alpha\) and \(\beta\) are adjacent and \(\beta \prec \alpha\).
Lemma 1
Let \(\mathfrak{A}\) be a supergenome and consider the the collection \(\{\pi _{\mathcal {G}}(B)|B\in \mathfrak{A}\}\) of intervals on a given \(\mathcal {G}\). Then (i) no two intervals overlap, (ii) the interval order \(\prec\) is a total order on every contig c, (iii) every interval has at most one predecessor and one successor, and hence is adjacent to at most two intervals, and (iv) if \(\gamma\) is adjacent to both \(\alpha\) and \(\beta\), then \(\gamma\) is between \(\alpha\) and \(\beta\).
Proof
Property (i) follows directly from the condition. (ii) As a consequence, any two intervals on a fixed contig c are comparable, i.e., the restriction of \(\prec\) to c is a total order. Since only intervals on the same contig can be adjacent, (iii) is an immediate consequence of (ii) and the fact that the number intervals is finite. Property (iv) is now a trivial consequence of the fact that by (iii) \(\alpha\) and \(\beta\) must be the predecessor and successor of \(\gamma\).
A key construction in this contribution is the notion of betweenness relations for alignment blocks.
Definition 3
Given three blocks \(A,B,C \in \mathfrak{A}\), we say that C is between A and B with respect to \(\mathcal {G}\) if \(\pi _{\mathcal {G}}(C)\) is between \(\pi _{\mathcal {G}}(A)\) and \(\pi _{\mathcal {G}}(B)\). The ternary relation \({{\mathscr{C}}}{{(\mathfrak{A}}})\) is the defined by \((A,C,B)\in {{\mathscr{C}}}({{\mathfrak{A}}})\) whenever C is between A and B for some assembly \({\mathcal {G}}\).
We note that contradicting betweenness relations resulting from different genome assemblies \({\mathcal {G}}\) are to be expected, i.e., the relation \({{\mathscr{C}}}{{(\mathfrak{A})}}\) will in general not satisfy the properties of a well-formed betweenness relation. We will return to this issue below.
Definition 4
Two blocks \(A,B\in \mathfrak{A}\) are adjacent if there is an assembly \(\mathcal {G}\) such that \(\pi _{G}(A)\) and \(\pi _{G}(B)\) are adjacent w.r.t. \(\{\pi _{\mathcal {G}}(C)|C\in \mathcal {A}\}\).
It is useful to regard \(\mathfrak{A}\) with its adjacency relation as a graph. In order to keep track of the individual contigs, we use an edge-colored multigraph, with \(\mathcal {G}\) serving as edge color.
Definition 5
The supergenome graph \(\Gamma (\mathfrak{A})\) of an MSA \(\mathfrak{A}\) is the directed, edge-colored multigraph whose vertices are the alignment blocks of \(\mathfrak{A}\) and whose directed edges (A, B) connect an alignment block A to an alignment block B with color \(\mathcal {G}\) whenever there are sequence intervals \(\alpha \in A\) is a predecessor of \(\beta \in B\) in assembly \(\mathcal {G}\).
An example how the prejection \(\Gamma (\mathfrak{A})\) is done is shown in Fig. 1. The projection of \(\Gamma (\mathfrak{A})\) to a constituent assembly \(\mathcal {G}\) is a (not necessarily induced) subgraph. As an immediate consequence of Lemma 1, each projection is a disjoint union of directed paths, each of which represents a contig. Conversely, every colored directed multigraph whose restriction to a single color is a set of vertex-disjoint directed paths is a supergenome graph. It therefore makes sense to talk about a supergenome graph \(\Gamma\) without explicit reference to an underlying alignment \(\mathfrak{A}\).
The structure of the supergenome graph strongly depends on the evolutionary history of the genomes that it represents. In the absence of genome rearrangements (i.e., when the only genetic changes are substitutions, insertions (including duplications), and deletions) then all genomes remain colinear with their common ancestor. In other words, a single, canonical [30] global alignment describes a common coordinate system that is unique up to the (arbitrary) order of contigs and each trace [31] of insertions and deletions. In terms of the block adjacency relation, each block has at most two adjacent neighbors in this scenario.
Genome rearrangements are by no means infrequent events [32,33,34,35], and thus cannot be neglected. Every breakpoint introduced by a genome rearrangement operation, be it a local reversal or a cut-and-join type dislocation, introduces an ambiguous adjacency, i.e., a block that has two or more predecessors or successors. The task of identifying an appropriate ordering of the MSA blocks therefore is a non-trivial one for realistic data, even in the absence of alignment errors.
Modeling the “Supergenome Sorting Problem”
Informally, we may consider the “supergenome sorting problem” (SSP) as the task of finding an order < (or, equivalently, a permutation \(\rho\)) of the alignment blocks of \(\mathfrak{A}\) such that the orders of the constituent assemblies are preserved as much as possible. Somewhat more precisely, we are to find an order < on the vertex set of the supergenome graph \(\Gamma (\mathfrak{A})\) that as many of its directed edges as possible are “consistent” with the order <. It is not clear from the outset, however, how “consistency” should be defined for our application. A large number of related models have been proposed and analyzed in the literature that make this condition precise in different ways, leading to different combinatorial optimization problems. We therefore proceed with a brief review of some paradigmatic approaches. A reader primarily interested in our proposed approach to the problem might want to skip this section.
Hamiltonian paths
A plausible attempt is to view the SSP as a variant of the Hamiltonian path problem on the supergenome graph \(\Gamma\). A Hamiltonian path defines a total order of the vertices and therefore a solution to the SSP. This is idea is similar to the use of Hamiltonian graphs for genome assembly from read overlap graphs [36]. There are several quite obvious difficulties, however. First, it is not sufficient to consider only paths that are entirely confined to pass through the adjacencies. The simplest counterexample consists of only 4 MSA blocks \(B_1\), \(B_2\), \(B_3\) ,\(B_4\) and three assemblies \(\mathcal {G}_1\), \(\mathcal {G}_2\), \(\mathcal {G}_3\). Given eight sequence intervals \(\beta _{k,l}=(\mathcal {G}_l,c_{k,l},i_{k,l},j_{k,l},+1) \in B_k\) we construct the following alignment:
$$\begin{aligned} \begin{array}{*{20}{c}} {}&{}{{B_1}}&{}{{B_2}}&{}{{B_3}}&{}{{B_4}}\\ {{g_1} = }&{}{{\beta _{1,1}}}&{}{}&{}{}&{}{{\beta _{4,1}}}\\ {{g_2} = }&{}{{\beta _{1,2}}}&{}{{\beta _{2,2}}}&{}{}&{}{{\beta _{4,2}}}\\ {{g_3} = }&{}{{\beta _{1,3}}}&{}{}&{}{{\beta _{3,3}}}&{}{{\beta _{4,3}}} \end{array} \end{aligned}$$
This situation arises in practice e.g. when \(B_2\) and \(B_3\) are two independent, unrelated inserts between \(B_1\) and \(B_4\). The block adjacency graph is the graph
$$\begin{aligned} {B_2-B_1-B_4-B_3,} \end{aligned}$$
which violates the desired betweenness relation \((\beta _{1,3},\beta _{3,3},\beta _{4,3})\).
In this case there are only two biologically correct solutions: \(B_1<B_2<B_3<B_4\) (or the inverse order) and \(B_1<B_3<B_2<B_4\) (or its inverse). In either case, the solution contains two consecutive blocks (\(B_2\) and \(B_3\)) that are not adjacent in the block graph. This example also serves to demonstrate that the block graph alone does not contain the complete information on the supergenome. It appears that in addition we will need to know the betweenness relation among the blocks i.e., that both \(B_2\) and \(B_3\) are between \(B_1\) and \(B_4\).
Feedback arc sets and topological sorting
An other possibility to determine a well-defined order of the vertices of the supergenome graph \(\Gamma\) is to first extract a maximum acyclic subgraph and then to compute a topological sorting of this subgraph. An equivalent formulation asks for the removal of a minimum set of edges that close cycles. This Maximum Acyclic Subgraph or Minimum Feedback Arc Set problem (MFAS) is well-known to be NP-hard [37]. Nevertheless fast, practicable heuristics have been devised, see e.g. [38, 39]. From the resulting directed acyclic graph (DAG) an admissible ordering of blocks can be obtained efficiently by topological sorting [40]. A closely related approach is the Linear Ordering Problem (LOP): given a complete weighted directed graph, find a tournament with maximum total edge weights [41]. It yields essentially the same model since LOP and MFAS can be transformed into each other quite easily [42]. Cost functions designed to define consensus orderings for sets of total and partial orders have been considered in different fields starting with the work of Spearman [43] and Kendall [44], see also e.g. [45,46,47]. The reconciliation of partial orders in investigated in detail e.g. in [48].
The key problem of modeling the SSP in terms of MFAS is highlighted in Fig. 2. It shows that even when undirected adjacencies would allow for a perfect solution, it may not be uncovered directly by the MFAS approach.
Simultaneous consecutive ones and matrix banding
Instead of adjacencies we may consider the incidence matrix \(\mathbf {A}\) of the supergenome graph \(\Gamma\) and try to sort both the alignment blocks and their adjacencies in such a way that, to the extent that this is possible, (i) adjacent blocks are consecutive and (ii) adjacencies that have a block in common are consecutive. In more formal terms, we wish to sort both the rows (alignment blocks) and columns (adjacencies) of the incidence matrix in such a way that rows and columns show all non-zero entries consecutively. A rectangular matrix \(\mathbf {A}\) that admits such a pair of row and column permutations is said to have the simultaneous consecutive ones property (C1S) [49]. This is possible if \(\Gamma\) is a union of paths. Note that instead of adjacencies we could also cover the graph with short paths \(\wp _k\). In this case column k identifies the vertices incident with path k. Again, if \(\Gamma\) is a union of paths, the path-incidence matrix satisfies (C1S). It is not difficult to see [49] that \(\mathbf {A}\) satisfies (C1S) if and only if \(\mathbf {A}\) has the well-studied consecutive ones property [50, 51] for both its rows and columns. Thus (C1S) can be checked in linear time [50]. Furthermore, Tucker’s characterization of (C1S) in terms of forbidden sub-matrices [52] also carries over. Some direct connection between the consecutive ones property and the Betweenness Problem, which we will consider below, are discussed in [53].
In general, \(\mathbf {A}\) will now have the consecutive ones property. The problem of identifying a minimal number of columns (adjacencies) whose removal leaves a (C1S) matrix is NP-complete [49]. In practise it may be desirable to quantify the extent of the violation of (C1S) in terms of intervals of consecutive zeros enclosed by the two 1s. For instance, one may want to use \(\omega =\sum _i h(\ell _i)\), where the sum runs over all intervals i of consecutive zeros enclosed by the two 1s, and \(h(\ell _i)\ge 0\) is some contribution that monotonically grows with the length \(\ell _i\) of the 0-interval. For a given ordering of the rows and columns, the total violation is quantified as the sum of the \(\omega\) values. It should be noted, however, that (C1S) does not imply \(\Gamma\) is a union of disjoint paths, i.e., that \(\Gamma\) is a valid supergenome graph.
A related set of optimization problems is concerned with reducing the bandwidth of matrices, i.e., the maximal distance of non-zero entries from the diagonal (in a symmetric case) or the parameter \(\min (l,u)+l+u\) (for rectangular matrices); here \(u=\max _{\mathbf {A}_{ij}\ne 0} (i-j)\) and \(l=\max _{\mathbf {A}_{ij}\ne 0} (j-i)\) [54]. In the symmetric case, several good heuristics are known, starting with the Cuthill-McKee [55] and GPS [56] algorithms even though the problem is NP-hard [57], while the general case has received much less attention [54]. Bandwidth reduction methods do not eliminate “bad” adjacencies that eventually determine bandwidth. The resulting ordering of rows and columns thus may be very inaccurate locally.
Bidirected graphs
Several specialized graph structures have been introduced recently to tackle the problem of ordering sequence blocks, which is a problem very similar to SSP. Among these constructions are A-Bruijn graphs, Enredo graphs, and Cactus graphs, see [58] for a review. A key insight of [58] is that these graph representations are equivalent in the sense that they can be transformed into each other. They differ, however, in additional information extracted from the input alignment that is stored as vertex and edge labels, see Fig. 3. For instance, the bidirected graphs of [28] have the same underlying graph as our supergenome graph. While the supergenome graph used a standard directed graph structure, bidirected graphs encode directional information independently for the endpoint of each edge, distinguishing three cases: (i) adjacent alignment blocks have the same orientation, (ii) the connected blocks switch from minus to plus orientation, or (iii) vice versa. The latter two cases indicate a change of orientation between two changes. In [28], this basic structure is extended by additional transitive edges given by a legal path through the graph with an exponential weight function. The task is then to find a consistent (non-conflicting) set of edges with maximal weight. This form of the weight function takes into account that, due to genetic linkage, the more closely positioned on a genome two blocks are, the less likely it is that the edge between these blocks is broken. This gives higher weight to locally correct block positions and orientations.
While this sorting problem is (NP-) hard in general, the sets of genomes considered by the authors are particularly suitable for this kind of calculations. The genomes considered for pangenome construction are typically closely related, or even of the same species. Thus one can expect many paths with high weights of the conserved consensus [29]. This approach also fits well to the analysis of genomic regions that are under constraint to maintain syntenic order for functional reasons, such as the MHC locus used as an example in [28]. In distantly related genomes, however, synteny tends not to be well preserved. In addition, we are interested in particular in data sets that contain genomes in preliminary draft forms, i.e., linkage information that is at least partially limited to short contigs or scaffolds. As a consequence we have less confidence in linkage and orientation information than one can expect in a typical pangenome scenario.
Sequence graphs
The sequence graphs of Haussler et al. [29] are also closely related to the supergenome graph representation \(\Gamma (\mathfrak{A})\) of a multiple alignment \(\mathfrak{A}\). The key difference is that the the orientation of the blocks is used to determine the direction of the edge. Two adjacent intervals with negative orientation thus imply an edge that is reversed compared to the supergenome graph. This situation is problematic in the case where the orientation of the intervals is switched. In [29] a preprocessing step is performed to minimize the number of such edges. The sequence graph approach was designed for the comparison of human genomes of different individuals. In such a scenario, the resulting information loss is small and does not present a practical problem. It is likely to become an issue, however, for large phylogenetic distances with frequent genome rearrangements.
The natural formulation of SSP on a sequence graph is to find a vertex ordering that minimizes weighted feedback edges and the average cut width. These optimization criteria ensure that the successor relations are mostly kept intact and at the same time successors are placed close to each other in the solution. Both problems the Minimum Feedback Arc Set Problem [37] and the Average Cut-Width Minimization Problem [59,60,61] are known to be NP-hard. Cutwidth minimization problems ask for a linear ordering of the vertices of a graph such that the average or maximum number of edges spanning across the gap between a pair of consecutive vertices is minimized. Conceptually, cutwidth problems are quite similar to bandwidth problems [62]. In [29] a heuristic is presented that first extracts a totally ordered “backbone” and then inserts the remaining vertices into the backbone order that is kept intact in the process. While the presence of a global common backbone order is a well founded assumption for pangenomes of a single or very closely related species, it is violated substantially for phylogenetically diverse data.
Betweenness problems
In this work we interpret SSP as a betweenness (ordering) problem rather than a vertex ordering problem on a directed graph. Instead of (oriented) adjacencies, which are defined on pairs of blocks, one considers the relative order of three blocks:
Betweenness Problem [63, 64]: Given a finite set X and a collection \({{\mathscr{C}}}\) of triples from X, is there a total order on X such that \(\forall (i,j,k)\in {{\mathscr{C}}}\) either \(i<j<k\) or \(i>j>k\)?
This decision problem is NP-complete [63, 64]. A triple \((i,j,k)\in {{\mathscr{C}}}\) is called a betweenness triple.
The Betweenness Problem can be adapted to model the SSP by means of a suitable cost function b designed to penalize violations of the betweenness relation. Consider an order \(\rho\) used to coordinatize the supergenome. We may think of \(\rho\) as a bijective function from \([1,\ldots ,|\mathfrak{A}|]\rightarrow \mathfrak{A}\). In other words, we number the blocks contained in \(\mathfrak{A}\) and work on this set of block indices.
For \(i<j<k\) we set \(b_{\rho ,{\mathcal {G}}}(i,j,k)=1\) if the projections of the three alignment blocks \(\rho (i)\), \(\rho (j)\), and \(\rho (k)\) exist and violate the betweenness relation for a given assembly \({\mathcal {G}}\), i.e., if \(\pi _{\mathcal {G}}(\rho (i))\), \(\pi _{\mathcal {G}}(\rho (j))\) and \(\pi _{\mathcal {G}}(\rho (k))\) are located on the same contig and \(\pi _{\mathcal {G}}(\rho (j))\) is not located between \(\pi _{\mathcal {G}}(\rho (i))\) and \(\pi _{\mathcal {G}}(\rho (k))\). Otherwise we set \(b_{\rho ,{\mathcal {G}}}(i,j,k)=0\). A natural cost function is now the total number of betweenness violations
$$\begin{aligned} b(\rho) := \sum _{\mathcal {G}\in G(\mathfrak{A})} \sum _{i<j<k} b_{\rho ,\mathcal {G}}(i,j,k)\,, \end{aligned}$$
(1)
where \(G({\mathfrak{A}})\) is the set of genome assemblies that are contained in \(\mathfrak{A}\). If genome evolution were to preserve gene order, i.e., only local duplications and deletions are allowed, the betweenness relation of the ancestral state would be preserved, guaranteeing a perfect solution \(\rho\) with \(b(\rho)=0\).
Since this decision problem is NP-complete [63, 64], so is the problem of optimizing \(b(\rho)\) NP-hard. The proof is shown in Additional file 1: Section 1 The cost function \(b(\rho)\) involves the sum over all triples of alignment blocks and thus is fairly expensive to evaluate. It is interesting in practice, therefore, to consider a modified cost function that restricts the sum in Eq. (1) to local information. This idea leads us to the rather natural extension of the Betweenness Problem to colored multigraphs.
Definition 6
Given an (undirected) colored multigraph \(\hat{\Gamma }=(V,E)\), the triple (i, j, k) is part of the collection \({{\mathscr{C}}}(\hat{\Gamma })\), iff there are edges \(\{i,j\} \in E\) and \(\{j,k\} \in E\) with color \(\mathcal {G}\).
Colored Multigraph Betweenness Decision Problem: given the colored multigraph \(\hat{\Gamma }=(V,E)\), is there a total order on V such that \(\forall (i,j,k)\in {{\mathscr{C}}}(\hat{\Gamma })\) either \(i<j<k\) or \(i>j>k\)?
The reformulation as an optimization problem that maximizes the number of edges is straightforward: Colored Multigraph Betweenness Problem: Given a colored multigraph \(\hat{\Gamma }=(V,E)\), find a total order on V such that \(E^* \subseteq E\) is maximal under the condition that \(\forall (i,j,k)\in {{\mathscr{C}}}(V,E^*)\) either \(i<j<k\) or \(i>j>k\).
This problem can be viewed as an analog of the Minimum Feedback Arc Set problem [38] for betweenness data. To our knowledge is has not been studied so far.
Lemma 2
The (decision version of the) Colored Multigraph Betweenness Problem is NP-complete.
Proof
Every set \({{\mathscr{C}}}(\hat{\Gamma })\) of triples can be obtained from an edge-colored multigraph \(\hat{\Gamma }\) (with vertices corresponding to alignment blocks and colored edges corresponding to adjacencies deriving from a genome identified by the color). Thus, the total order on the vertices of \(\hat{\Gamma }\) is a solution of the Colored Multigraph Betweenness Problem if and only if the answer to the NP-complete Betweenness Problem is positive. In Additional file 1: Section 1 the reduction in both directions is shown.
In the example of Fig. 2 the optimal solution of the Colored Multigraph Betweenness Problem retains all unordered adjacencies and creates a unique coordinatization (up to orientation) that leaves all alignment blocks ordered as drawn.
Seriation
An alternative framework for solving the SSP by construction of a preferred ordering is seriation. The Robinson seriation problem [65] starts from a dissimilarity measure \(d:X\times X\rightarrow \mathbb {R}\), and seeks a linear order \(\rho\) on X that satisfies the inequality
$$\begin{aligned} \max \{d(\rho (i),\rho (j)),d(\rho (j),\rho (k))\}\le d(\rho (i),\rho (k))\,. \end{aligned}$$
(2)
A dissimilarity d for which an ordering \(\rho\) exists that satisfies Eq. (2) for all \(\rho (i)<\rho (j)<\rho (k)\) is called Robinsonian. It is worth noting that Robinsonian dissimilarities are intimately related with pyramidal clustering problems [66, 67].
The seriation problem [65, 68] consists of finding a total order for which the given pairwise distances violates the Robinson conditions as little as possible. To link this seriation problem with the Colored Multigraph Betweenness Problem or Betweenness Problem we consider a collections \({{\mathscr{C}}}\) of triples such that
$$\begin{aligned} (\rho (i),\rho (j),\rho (k))\in {{\mathscr{C}}} \text { implies} \max \{d(\rho (i),\rho (j)),d(\rho (j),\rho (k))\}< d(\rho (i),\rho (k)) \end{aligned}$$
(3)
Clearly, if the dissimilarity is Robinsonian, then \(\rho\) defines a total order on X that solves the Betweenness Problem for \((X,{{\mathscr{C}}})\).
The relevant optimization task in our context is therefore to minimize the number of ordered triples that violate Eq. (2). A variety of heuristics for this problem have been developed, see e.g. [69]. It is important to note, however, that in our setting the distance between alignment blocks is not defined directly. In order to obtain a seriation problem that approximates the Supergenome Sorting Problem we will have to resort to a heuristic that summarizes the distances between two blocks in all genomes and reflects the betweenness relationships. For pangenome-like models, the cost function advocated in [28] is a very plausible choice.
Graph simplification
Each of the plausible models for the “Supergenome Sorting Problem” discussed in the previous sections leads to NP-hard computational problems. The size of typical genome-wide alignments by far exceeds the range where exact solutions can be hoped for, except possibly for the smallest and most benign examples such as the ones used as examples in [17]. We therefore will have to resort to fast heuristics. In this section we focus on the conceptual ideas behind the simplification steps. More detailed implementation details are given in the "Methods" section.
Nevertheless it is possible to isolate certain sub-problems that can be solved exactly and independently of the remainder of the input graph. Since “linearized” portions of the vertex set can be contracted to a single vertex set, this leads to a reduction of problem size.
Lemma 3
If the supergenome graph \(\Gamma\) is a directed acyclic graph then topological sorting of \(\Gamma\) solves the Colored Multigraph Betweenness Problem.
Proof
In this case betweenness is established exactly by the directed paths in the DAG. Hence any topological sorting preserves all betweenness triples of G and thus presents a perfect solution to the Colored Multigraph Betweenness Problem as well.
This simple observation suggests to identify subgraphs with DAG structure and to replace them with a representative for each replaced DAG. These can later be replaced by the solution that is created with topological sorting. We note that this does not necessarily preserve optimality. It is conceivable that a local DAG structure has to be broken up into two disjoint subsets that are integrated in larger surrounding structures in a way that requires reversal of the edge directions in one or even both parts. Nevertheless, if the local DAG structures are sufficiently isolated they are likely to be part of the optimal solution as a unit. The motif that describes such a local DAG structure has been introduced as “superbubble” in the context of graph structures arising in sequence assembly problems [70, 71].
Source and sink vertices s in the supergenome graph with only a single neighbor t are conceptually a special case of superbubbles. These can be sorted together with their unique neighbor t. \(\Gamma\) is thus simplified by contracting s and t, i.e., placing the source s immediately before t and sink s immediate after t. An example of such a source and a sink can be seen in Fig. 4.
In some cases it is helpful to reverse the direction of the coordinate system of a single species. This is in particular the case when a single genome is reversed compared to all others. The inversion of an entire path does not change the solution of the Colored Multigraph Betweenness Problem but can make it easier to apply some of the reduction heuristics discussed above. In particular, if the relative orientation of the coordinatizations could be fixed in an optimal manner, the betweenness problem reduces to a much easier topological sorting problem. Finding this optimum, however, is equivalent to the Betweenness Problem, which is a NP-hard. Hence, we again have to resort to local heuristics.
Definition 7
Let \(\Gamma = (V,E)\) be a supergenome graph. A pair of vertices \(v,w\in V\) such that there are edges (v, w) and (w, v) in E is a mini cycle.
Mini-cycles are naturally removed by removing one of the two edge directions between v and w. More precisely, the less supported direction of an edge is dropped. The estimate for support is evaluated in a region around a mini-cycle since adjacent mini-cycles may yield contradictory majority votes.
Definition 8
Two mini-cycles are connected with each other if they share a vertex. A mini-cycle complex \(\mathfrak{C}\) is a maximal connected set of mini-cycles.
Lemma 4
The mini-cycle complexes of a supergenome graph
\(\Gamma\)
form a unique partition of the set of all minicycles. Any two classes of this partition are vertex and edge disjoint.
Proof
Consider the the graph H whose vertices are the mini-cycles, and there are edges between any two mini-cycles that share at least one vertex. Then every mini-cycle complex \(\mathfrak{C}\) is a connected component of H. Since the connected components of a graph are uniquely defined, disjoint, and form a cover, they partition the vertex set of H. Every minicycle, furthermore, forms a connected subgraph of \(\Gamma\) by construction. Since any two minicycles that contain a common vertex belong to the same mini-cycle complex, two mini-cycle complexes cannot have a vertex in common. This implies that they are also edge disjoint.
The mini-cycle complexes therefore can be resolved independently of each other. The target is to remove edges that create cycles in order to obtain a DAG that can then be subjected to topological sorting. However, this topological sorting is a solution of the Colored Multigraph Betweenness Problem for a subgraph. This is still a hard problem, so that we again utilize a heuristic approach. In this step we only attempt to remove mini-cycles. Cycles that connect mini-cycle complexes with each other or with other vertices in the graph are therefore left untouched and have to be dealt with in a subsequent step.
The local sorting within a complex \(\mathfrak{C}\) is achieved by considering adjacencies. To this end we annotate each adjacency with the number of edges and the ratio of the edges in the two directions. We identify the best supported edges as those with a high multiplicity and a strong bias for one direction over the other. This choice of a direction is then propagated. If a directed edge has more than one possible successor, we first propagate along the one with the largest support for the proposed direction. The issue now is when exactly to stop propagating this information. Clearly, it is forbidden to orient an edge that would close a directed cycle. Any such edge is instead seeded with the reverse directional information.
As part of this procedure it is possible that parts of a directed path from a given genome received contradictory orientations in different regions. If this is the case, the edge crossing the boundary between the differently oriented regions must be removed. Finally, the heuristic may terminate and still leave some edges unoriented. This indicates that the orientations are contradictory and need to be reversed. An example of the mini-cycle resolution process is shown in Fig. 5.