 Research
 Open Access
Coordinate systems for supergenomes
 Fabian Gärtner^{1, 2}Email authorView ORCID ID profile,
 Christian Höner zu Siederdissen^{2, 3},
 Lydia Müller^{1, 3, 4} and
 Peter F. Stadler^{1, 2, 3, 5, 6, 7, 8}
https://doi.org/10.1186/s1301501801334
© The Author(s) 2018
 Received: 14 June 2017
 Accepted: 7 September 2018
 Published: 24 September 2018
Abstract
Background
Genome sequences and genome annotation data have become available at ever increasing rates in response to the rapid progress in sequencing technologies. As a consequence the demand for methods supporting comparative, evolutionary analysis is also growing. In particular, efficient tools to visualizeomics data simultaneously for multiple species are sorely lacking. A first and crucial step in this direction is the construction of a common coordinate system. Since genomes not only differ by rearrangements but also by large insertions, deletions, and duplications, the use of a single reference genome is insufficient, in particular when the number of species becomes large.
Results
The computational problem then becomes to determine an order and orientations of optimal local alignments that are as colinear as possible with all the genome sequences. We first review the most prominent approaches to model the problem formally and then proceed to showing that it can be phrased as a particular variant of the Betweenness Problem. It is NP hard in general. As exact solutions are beyond reach for the problem sizes of practical interest, we introduce a collection of heuristic simplifiers to resolve ordering conflicts.
Conclusion
Benchmarks on reallife data ranging from bacterial to fly genomes demonstrate the feasibility of computing good common coordinate systems.
Keywords
 Comparative genomics
 Comparative transcriptomics
 Big data
 Graph theory
 Betweenness ordering
 Colored multigraph
 Combinatorial optimization
Background
The past decade has seen rapid progress of sequencing technologies [1]. The dramatic decrease of sequencing costs has enabled an everaccelerating flood of genomic and transcriptomic data [2] that in turn have lead to the development of a wide array of methods for data analysis. Despite recent efforts to study transcriptome evolution at large scales [3–7] the capability to analyze and integrate omics data in largescale phylogenetic comparisons lags far behind data generation. One key aspect of this shortcoming is the current lack of powerful tools for visualizing comparative omics data. Available tools such as [8, 9] have been designed with closely related species or strains in mind. The visualizations become difficult to read for multiple species and larger evolutionary distances, where homologous genomic regions may differ substantially in their lengths, an issue that becomes more pressing the larger regions of interest become. A common coordinate system for multiple genomes is not only a convenience for graphical representations of omics data, however. It would also greatly facilitate the systematic analysis of all those genomic features that are not sufficiently local to be completely contained within individual blocks of a genomewide multiple sequence alignment (gMSA).
Still, gMSAs are the natural starting point. Several pipelines to construct such alignments have been deployed over the past two decades, most prominently the tba/multiz pipeline [10, 11] employed by the UCSC genome browser and the Enredo/Pecan/Ortheus (EPO) pipeline [12] featured in the ensembl system. For the ENCODE project data, in addition alignments generated with MAVID [13] and MLAGAN [13] have become available, see [14] for a comparative assessment. A common feature of gMSAs is that they are composed of a large number of alignment blocks. At least in the case of MSAs of higher animals and plants the individual blocks are typically (much) smaller than individual genes. As a consequence, they are not readytouse for detailed comparative studies e.g. of transcriptome or epigenome [15] structure. In the gMSAbased splice site maps of [16], for example, it is easy to follow the evolution of individual splice junctions as they are localized within a block. At the same time it is difficult to collate the global differences of extended transcripts, which may span hundreds of blocks and to relate changes in transcript structure with genomic rearrangements, insertions of repetitive elements or deletion of chunks of sequence.
To a certain extent this problem is alleviated by considering the blocks arranged w.r.t. a reference genome. For many applications, however, this does not appear to be sufficient. For sufficiently similar genomes with only few rearrangements gMSA blocks are large or can at least be arranged so that large syntenic regions can be represented as a single aligned block. Any ordering of these large syntenic blocks, termed a supergenome in [17], then yields an informative common coordinate system. So far, this approach has been applied only to closely related procaryotic genomes. Prime examples are a detailed comparative analysis of the transcriptome of multiple isolates of Campylobacter jejuni [18] or the reconstruction of the phylogeny of mosses from the “nucleotide pangenome” of mitogenomic sequences [19]. We remark that some approaches to “pangenomes” are concerned with gMSAs of (usually large numbers of) closely related isolates; most of this literature, however, treats pangenomes as sets of orthologous genes [20].
Here we are concerned with the coordinatization of supergenomes, i.e., the question how gMSA blocks can be ordered in a way that facilitates comparative studies of genome annotation data. In contrast to previous work on supergenomes we are in particular interested in large animal and plant genomes and in large phylogenetic ranges. We therefore assume that we have short alignment blocks and abundant genome rearrangement, leaving only short sequences of alignment blocks that are perfectly syntenic between all genomes involved. The problem of optimally sorting the MSA blocks can, as we shall see, be regarded as a quite particular variant of a vertex ordering problem, a class of combinatorial problems that recently has received increasing attention in computer science [21–24]. In the computational biology literature, furthermore, several graphbased methods have been proposed to solve the problem of sorting sequence blocks for supergenomes, see e.g. [12, 25–29].
This contribution is organized as follows: In the following section we first analyze the concept of the supergenome and its relationship to gMSAs in detail. We then review combinatorial optimization problems that are closely related to the “supergenome sorting problem”, and argue that the most appropriate modeling leads to a special type of betweenness ordering problem. Next, we introduce a heuristic solution that is geared towards very large input alignments and proceeds by stepwise simplification of the supergenome multigraph. Finally, we outline a few computational results.
Theory
Genomewide multiple sequence alignments
Our starting point is a set of genome assemblies. For our purposes an assembly is simply a set of sequences representing chromosomes, scaffolds, reftigs, contigs, etc. In the following, we will use contig to refer to any such sequence. On each of these constituent sequences we assume the usual coordinate system defining sequence positions. Since DNA is double stranded, a piece of genomic sequence is either contained directly (\(\sigma = +\;1\)) in the assembly or it is represented by its reverse complement (\(\sigma = \;1\)). We write \((\mathcal {G},c,i,j,\sigma)\) to identify the sequence interval from positions i to j on contig c of genome assembly \(\mathcal {G}\) with reading direction \(\sigma\). We assume, w.l.o.g., \(i\le j\).
Most comparative methods require multiple sequence alignments (MSAs) as input. An MSA \(\mathfrak{A}\) is composed of alignment blocks, each of which consists of an alignment of sequence intervals. For the purposes of this paper it its sufficient to characterize an alignment block by the coordinates of its constituent sequence intervals. That is, a block \(B\in \mathfrak{A}\) has the form \(B=\{ (\mathcal {G}_u,c_u,i_u,j_u,\sigma _u)  u \in \text {rows of }B\}\) where the index u runs over the rows of the alignment block. It will be convenient to allow alignment blocks also to consist of a single interval only, thus referring to a piece of sequence that has not been aligned. Note that at this stage we do not assume that an alignment block contains only one interval from each assembly.
The projection \(\pi _\mathcal {G}(B)\) extracts from an alignment block the union of its constituent sequence intervals belonging to assembly \(\mathcal {G}\). If the assembly \(\mathcal {G}\) is not represented in the alignment block B we set \(\pi _\mathcal {G}(B)=\emptyset\).
The projection operation collapses pairs of overlapping sequence intervals (\(i \le i' \le j \le j'\)) in to a single interval: \((\mathcal {G},c,i,j,\sigma)\cup (\mathcal {G},c,i',j',\sigma ')= (\mathcal {G},c,i,j',+1)\) without regard for the orientation, which is set to \(+1\) and will have no bearing on the algorithm we develop further down.
The projection \(\pi _\mathcal {G}\) of \(\mathfrak{A}\) onto one of its constituent assemblies \(\mathcal {G}\) is the union of the sequence intervals from \(\mathcal {G}\) that are contained in its alignment blocks, i.e., \(\pi _\mathcal {G}(\mathfrak{A})=\bigcup _{B\in \mathfrak{A}} \pi _\mathcal {G}(B)\).
Definition 1

\(\mathfrak{A}\) is complete if \(\pi _\mathcal {G}(\mathfrak{A})=\mathcal {G}\), i.e., if each position in each assembly is represented in at least one alignment block.

\(\mathfrak{A}\) is irredundant \(\pi _\mathcal {G}(B')\cap \pi _\mathcal {G}(B'')=\emptyset\) for any two distinct blocks \(B'\) and \(B''\), i.e., if every sequence interval from assembly \(\mathcal {G}\) is contained in at most one alignment block.

\(\mathfrak{A}\) is injective if no alignment block comprises more than one interval from each of its constituent assemblies.
Clearly, every given MSA can easily be completed by simply adding all unaligned sequence intervals as additional blocks.
Just like a contig c in a (genome) assembly \(\mathcal {G}\), each block \(B\in \mathfrak{A}\) has an internal coordinate system defined by its columns. We write (B, k) for column k in block B. We write \(\ell (B)\) for the number of columns in block B. If \(\mathfrak{A}\) is irredundant, then there are functions \(f_{\mathcal {G},c}\) that map position i within \((\mathcal {G},c)\) to a corresponding MSA coordinate (B, k). If \(\mathfrak{A}\) is complete, the individual \(f_{\mathcal {G},c}\) can be combined to a single function \(f:(\mathcal {G},c,i)\mapsto (B,k)\). Completeness implies that every position \((\mathcal {G},c,i)\) is represented in the MSA, and irredundancy guarantees that the relation between assembly and alignment coordinates is a function by ensuring that \((\mathcal {G},c,i)\) corresponds to at most one alignment column. The following definition is therefore equivalent to the notion of a supergenome introduced in [17].
Definition 2
An MSA \(\mathfrak{A}\) is a supergenome if it is complete, irredundant, and injective.
The most commonly used genomewide MSAs cannot be completed to supergenomes. The MSAs produced by the multiz pipeline are usually not irredundant: different intervals of the “reference sequence” may be aligned to the same interval of another assembly. While multiz [11] alignments are injective this is in general not the case with the EPO [12] alignments. In these, multiple paralogous sequences from the same genome may appear in one alignment block.
Now consider an MSA \(\mathfrak{A}\) and an arbitrary order < of the alignment blocks of \(\mathfrak{A}\). Then there is a (unique) function \(\phi\) that maps the pairs (B, k) injectively to the interval [1, n], where \(n=\sum _{B\in \mathfrak{A}} \ell (B)\) is the total number of columns in \(\mathfrak{A}\) such that \(\phi (B,i)<\phi (B',i')\) whenever \(B<B'\) or \(B=B'\) and \(i<i'\). If \(\mathfrak{A}\) is a supergenome, then \(\phi (f)\) is clearly an injective function from a genome assembly \(\mathcal {G}\) to [1, n]. We call \(\phi (f(\mathcal {G},c,i))\) the coordinate of position i of contig c of assembly \(\mathcal {G}\) in the ordered supergenome \((\mathfrak{A},<)\).
As pointed out in [17], the existence of a coordinate system for the supergenome \(\mathfrak{A}\) is independent of the block order <. However, the order < is crucial for the practical use of the coordinate system.
Adjacency and betweenness of MSA blocks
The natural starting point for considering adjacency and betweenness of alignment blocks are their constituent intervals \((\mathcal {G},c,i,j,\sigma)\) on a fixed assembly \(\mathcal {G}\) and contig c. Intervals have a natural partial order defined by \((\mathcal {G},c,i,j,\sigma)\prec (\mathcal {G},c,k,l,\sigma)\) whenever \(i<k\) and \(j<l\). Two intervals are incomparable in this interval order if and only if one is contained in the other. Note that the interval order allows comparable intervals to overlap. We also consider intervals incomparable that belong to different contigs and/or assemblies.
Given three intervals \(\alpha =(\mathcal {G},c,i',j',\sigma ')\), \(\beta =(\mathcal {G},c,i'',j'',\sigma '')\), and \(\gamma =(\mathcal {G},c,i,j,\sigma)\) (on the same genome assembly and contig), we say that \(\gamma\) is between the two distinct intervals \(\alpha\) and \(\beta\) if \(\alpha \prec \gamma \prec \beta\) or \(\beta \prec \gamma \prec \alpha\).
Given a collection \(\mathcal {Q}\) of intervals on the same assembly \(\mathcal {G}\) and contig c, we say that \(\alpha =(\mathcal {G},c,i',j',\sigma ')\) and \(\beta =(\mathcal {G},c,i'',j'',\sigma '')\) are adjacent if there is no interval \(\gamma\) between \(\alpha\) and \(\beta\). We say that \(\alpha\) is a predecessor of \(\beta\) if \(\alpha\) and \(\beta\) are adjacent and \(\alpha \prec \beta\). Analogously, \(\alpha\) is a a successor of \(\beta\) if \(\alpha\) and \(\beta\) are adjacent and \(\beta \prec \alpha\).
Lemma 1
Let \(\mathfrak{A}\) be a supergenome and consider the the collection \(\{\pi _{\mathcal {G}}(B)B\in \mathfrak{A}\}\) of intervals on a given \(\mathcal {G}\). Then (i) no two intervals overlap, (ii) the interval order \(\prec\) is a total order on every contig c, (iii) every interval has at most one predecessor and one successor, and hence is adjacent to at most two intervals, and (iv) if \(\gamma\) is adjacent to both \(\alpha\) and \(\beta\), then \(\gamma\) is between \(\alpha\) and \(\beta\).
Proof
Property (i) follows directly from the condition. (ii) As a consequence, any two intervals on a fixed contig c are comparable, i.e., the restriction of \(\prec\) to c is a total order. Since only intervals on the same contig can be adjacent, (iii) is an immediate consequence of (ii) and the fact that the number intervals is finite. Property (iv) is now a trivial consequence of the fact that by (iii) \(\alpha\) and \(\beta\) must be the predecessor and successor of \(\gamma\).
A key construction in this contribution is the notion of betweenness relations for alignment blocks.
Definition 3
Given three blocks \(A,B,C \in \mathfrak{A}\), we say that C is between A and B with respect to \(\mathcal {G}\) if \(\pi _{\mathcal {G}}(C)\) is between \(\pi _{\mathcal {G}}(A)\) and \(\pi _{\mathcal {G}}(B)\). The ternary relation \({{\mathscr{C}}}{{(\mathfrak{A}}})\) is the defined by \((A,C,B)\in {{\mathscr{C}}}({{\mathfrak{A}}})\) whenever C is between A and B for some assembly \({\mathcal {G}}\).
We note that contradicting betweenness relations resulting from different genome assemblies \({\mathcal {G}}\) are to be expected, i.e., the relation \({{\mathscr{C}}}{{(\mathfrak{A})}}\) will in general not satisfy the properties of a wellformed betweenness relation. We will return to this issue below.
Definition 4
Two blocks \(A,B\in \mathfrak{A}\) are adjacent if there is an assembly \(\mathcal {G}\) such that \(\pi _{G}(A)\) and \(\pi _{G}(B)\) are adjacent w.r.t. \(\{\pi _{\mathcal {G}}(C)C\in \mathcal {A}\}\).
It is useful to regard \(\mathfrak{A}\) with its adjacency relation as a graph. In order to keep track of the individual contigs, we use an edgecolored multigraph, with \(\mathcal {G}\) serving as edge color.
Definition 5
The supergenome graph \(\Gamma (\mathfrak{A})\) of an MSA \(\mathfrak{A}\) is the directed, edgecolored multigraph whose vertices are the alignment blocks of \(\mathfrak{A}\) and whose directed edges (A, B) connect an alignment block A to an alignment block B with color \(\mathcal {G}\) whenever there are sequence intervals \(\alpha \in A\) is a predecessor of \(\beta \in B\) in assembly \(\mathcal {G}\).
The structure of the supergenome graph strongly depends on the evolutionary history of the genomes that it represents. In the absence of genome rearrangements (i.e., when the only genetic changes are substitutions, insertions (including duplications), and deletions) then all genomes remain colinear with their common ancestor. In other words, a single, canonical [30] global alignment describes a common coordinate system that is unique up to the (arbitrary) order of contigs and each trace [31] of insertions and deletions. In terms of the block adjacency relation, each block has at most two adjacent neighbors in this scenario.
Genome rearrangements are by no means infrequent events [32–35], and thus cannot be neglected. Every breakpoint introduced by a genome rearrangement operation, be it a local reversal or a cutandjoin type dislocation, introduces an ambiguous adjacency, i.e., a block that has two or more predecessors or successors. The task of identifying an appropriate ordering of the MSA blocks therefore is a nontrivial one for realistic data, even in the absence of alignment errors.
Modeling the “Supergenome Sorting Problem”
Informally, we may consider the “supergenome sorting problem” (SSP) as the task of finding an order < (or, equivalently, a permutation \(\rho\)) of the alignment blocks of \(\mathfrak{A}\) such that the orders of the constituent assemblies are preserved as much as possible. Somewhat more precisely, we are to find an order < on the vertex set of the supergenome graph \(\Gamma (\mathfrak{A})\) that as many of its directed edges as possible are “consistent” with the order <. It is not clear from the outset, however, how “consistency” should be defined for our application. A large number of related models have been proposed and analyzed in the literature that make this condition precise in different ways, leading to different combinatorial optimization problems. We therefore proceed with a brief review of some paradigmatic approaches. A reader primarily interested in our proposed approach to the problem might want to skip this section.
Hamiltonian paths
In this case there are only two biologically correct solutions: \(B_1<B_2<B_3<B_4\) (or the inverse order) and \(B_1<B_3<B_2<B_4\) (or its inverse). In either case, the solution contains two consecutive blocks (\(B_2\) and \(B_3\)) that are not adjacent in the block graph. This example also serves to demonstrate that the block graph alone does not contain the complete information on the supergenome. It appears that in addition we will need to know the betweenness relation among the blocks i.e., that both \(B_2\) and \(B_3\) are between \(B_1\) and \(B_4\).
Feedback arc sets and topological sorting
The key problem of modeling the SSP in terms of MFAS is highlighted in Fig. 2. It shows that even when undirected adjacencies would allow for a perfect solution, it may not be uncovered directly by the MFAS approach.
Simultaneous consecutive ones and matrix banding
Instead of adjacencies we may consider the incidence matrix \(\mathbf {A}\) of the supergenome graph \(\Gamma\) and try to sort both the alignment blocks and their adjacencies in such a way that, to the extent that this is possible, (i) adjacent blocks are consecutive and (ii) adjacencies that have a block in common are consecutive. In more formal terms, we wish to sort both the rows (alignment blocks) and columns (adjacencies) of the incidence matrix in such a way that rows and columns show all nonzero entries consecutively. A rectangular matrix \(\mathbf {A}\) that admits such a pair of row and column permutations is said to have the simultaneous consecutive ones property (C1S) [49]. This is possible if \(\Gamma\) is a union of paths. Note that instead of adjacencies we could also cover the graph with short paths \(\wp _k\). In this case column k identifies the vertices incident with path k. Again, if \(\Gamma\) is a union of paths, the pathincidence matrix satisfies (C1S). It is not difficult to see [49] that \(\mathbf {A}\) satisfies (C1S) if and only if \(\mathbf {A}\) has the wellstudied consecutive ones property [50, 51] for both its rows and columns. Thus (C1S) can be checked in linear time [50]. Furthermore, Tucker’s characterization of (C1S) in terms of forbidden submatrices [52] also carries over. Some direct connection between the consecutive ones property and the Betweenness Problem, which we will consider below, are discussed in [53].
In general, \(\mathbf {A}\) will now have the consecutive ones property. The problem of identifying a minimal number of columns (adjacencies) whose removal leaves a (C1S) matrix is NPcomplete [49]. In practise it may be desirable to quantify the extent of the violation of (C1S) in terms of intervals of consecutive zeros enclosed by the two 1s. For instance, one may want to use \(\omega =\sum _i h(\ell _i)\), where the sum runs over all intervals i of consecutive zeros enclosed by the two 1s, and \(h(\ell _i)\ge 0\) is some contribution that monotonically grows with the length \(\ell _i\) of the 0interval. For a given ordering of the rows and columns, the total violation is quantified as the sum of the \(\omega\) values. It should be noted, however, that (C1S) does not imply \(\Gamma\) is a union of disjoint paths, i.e., that \(\Gamma\) is a valid supergenome graph.
A related set of optimization problems is concerned with reducing the bandwidth of matrices, i.e., the maximal distance of nonzero entries from the diagonal (in a symmetric case) or the parameter \(\min (l,u)+l+u\) (for rectangular matrices); here \(u=\max _{\mathbf {A}_{ij}\ne 0} (ij)\) and \(l=\max _{\mathbf {A}_{ij}\ne 0} (ji)\) [54]. In the symmetric case, several good heuristics are known, starting with the CuthillMcKee [55] and GPS [56] algorithms even though the problem is NPhard [57], while the general case has received much less attention [54]. Bandwidth reduction methods do not eliminate “bad” adjacencies that eventually determine bandwidth. The resulting ordering of rows and columns thus may be very inaccurate locally.
Bidirected graphs
Several specialized graph structures have been introduced recently to tackle the problem of ordering sequence blocks, which is a problem very similar to SSP. Among these constructions are ABruijn graphs, Enredo graphs, and Cactus graphs, see [58] for a review. A key insight of [58] is that these graph representations are equivalent in the sense that they can be transformed into each other. They differ, however, in additional information extracted from the input alignment that is stored as vertex and edge labels, see Fig. 3. For instance, the bidirected graphs of [28] have the same underlying graph as our supergenome graph. While the supergenome graph used a standard directed graph structure, bidirected graphs encode directional information independently for the endpoint of each edge, distinguishing three cases: (i) adjacent alignment blocks have the same orientation, (ii) the connected blocks switch from minus to plus orientation, or (iii) vice versa. The latter two cases indicate a change of orientation between two changes. In [28], this basic structure is extended by additional transitive edges given by a legal path through the graph with an exponential weight function. The task is then to find a consistent (nonconflicting) set of edges with maximal weight. This form of the weight function takes into account that, due to genetic linkage, the more closely positioned on a genome two blocks are, the less likely it is that the edge between these blocks is broken. This gives higher weight to locally correct block positions and orientations.
While this sorting problem is (NP) hard in general, the sets of genomes considered by the authors are particularly suitable for this kind of calculations. The genomes considered for pangenome construction are typically closely related, or even of the same species. Thus one can expect many paths with high weights of the conserved consensus [29]. This approach also fits well to the analysis of genomic regions that are under constraint to maintain syntenic order for functional reasons, such as the MHC locus used as an example in [28]. In distantly related genomes, however, synteny tends not to be well preserved. In addition, we are interested in particular in data sets that contain genomes in preliminary draft forms, i.e., linkage information that is at least partially limited to short contigs or scaffolds. As a consequence we have less confidence in linkage and orientation information than one can expect in a typical pangenome scenario.
Sequence graphs
The sequence graphs of Haussler et al. [29] are also closely related to the supergenome graph representation \(\Gamma (\mathfrak{A})\) of a multiple alignment \(\mathfrak{A}\). The key difference is that the the orientation of the blocks is used to determine the direction of the edge. Two adjacent intervals with negative orientation thus imply an edge that is reversed compared to the supergenome graph. This situation is problematic in the case where the orientation of the intervals is switched. In [29] a preprocessing step is performed to minimize the number of such edges. The sequence graph approach was designed for the comparison of human genomes of different individuals. In such a scenario, the resulting information loss is small and does not present a practical problem. It is likely to become an issue, however, for large phylogenetic distances with frequent genome rearrangements.
The natural formulation of SSP on a sequence graph is to find a vertex ordering that minimizes weighted feedback edges and the average cut width. These optimization criteria ensure that the successor relations are mostly kept intact and at the same time successors are placed close to each other in the solution. Both problems the Minimum Feedback Arc Set Problem [37] and the Average CutWidth Minimization Problem [59–61] are known to be NPhard. Cutwidth minimization problems ask for a linear ordering of the vertices of a graph such that the average or maximum number of edges spanning across the gap between a pair of consecutive vertices is minimized. Conceptually, cutwidth problems are quite similar to bandwidth problems [62]. In [29] a heuristic is presented that first extracts a totally ordered “backbone” and then inserts the remaining vertices into the backbone order that is kept intact in the process. While the presence of a global common backbone order is a well founded assumption for pangenomes of a single or very closely related species, it is violated substantially for phylogenetically diverse data.
Betweenness problems
In this work we interpret SSP as a betweenness (ordering) problem rather than a vertex ordering problem on a directed graph. Instead of (oriented) adjacencies, which are defined on pairs of blocks, one considers the relative order of three blocks:
Betweenness Problem [63, 64]: Given a finite set X and a collection \({{\mathscr{C}}}\) of triples from X, is there a total order on X such that \(\forall (i,j,k)\in {{\mathscr{C}}}\) either \(i<j<k\) or \(i>j>k\)?
This decision problem is NPcomplete [63, 64]. A triple \((i,j,k)\in {{\mathscr{C}}}\) is called a betweenness triple.
The Betweenness Problem can be adapted to model the SSP by means of a suitable cost function b designed to penalize violations of the betweenness relation. Consider an order \(\rho\) used to coordinatize the supergenome. We may think of \(\rho\) as a bijective function from \([1,\ldots ,\mathfrak{A}]\rightarrow \mathfrak{A}\). In other words, we number the blocks contained in \(\mathfrak{A}\) and work on this set of block indices.
Since this decision problem is NPcomplete [63, 64], so is the problem of optimizing \(b(\rho)\) NPhard. The proof is shown in Additional file 1: Section 1 The cost function \(b(\rho)\) involves the sum over all triples of alignment blocks and thus is fairly expensive to evaluate. It is interesting in practice, therefore, to consider a modified cost function that restricts the sum in Eq. (1) to local information. This idea leads us to the rather natural extension of the Betweenness Problem to colored multigraphs.
Definition 6
Given an (undirected) colored multigraph \(\hat{\Gamma }=(V,E)\), the triple (i, j, k) is part of the collection \({{\mathscr{C}}}(\hat{\Gamma })\), iff there are edges \(\{i,j\} \in E\) and \(\{j,k\} \in E\) with color \(\mathcal {G}\).
Colored Multigraph Betweenness Decision Problem: given the colored multigraph \(\hat{\Gamma }=(V,E)\), is there a total order on V such that \(\forall (i,j,k)\in {{\mathscr{C}}}(\hat{\Gamma })\) either \(i<j<k\) or \(i>j>k\)?
The reformulation as an optimization problem that maximizes the number of edges is straightforward: Colored Multigraph Betweenness Problem: Given a colored multigraph \(\hat{\Gamma }=(V,E)\), find a total order on V such that \(E^* \subseteq E\) is maximal under the condition that \(\forall (i,j,k)\in {{\mathscr{C}}}(V,E^*)\) either \(i<j<k\) or \(i>j>k\).
This problem can be viewed as an analog of the Minimum Feedback Arc Set problem [38] for betweenness data. To our knowledge is has not been studied so far.
Lemma 2
The (decision version of the) Colored Multigraph Betweenness Problem is NPcomplete.
Proof
Every set \({{\mathscr{C}}}(\hat{\Gamma })\) of triples can be obtained from an edgecolored multigraph \(\hat{\Gamma }\) (with vertices corresponding to alignment blocks and colored edges corresponding to adjacencies deriving from a genome identified by the color). Thus, the total order on the vertices of \(\hat{\Gamma }\) is a solution of the Colored Multigraph Betweenness Problem if and only if the answer to the NPcomplete Betweenness Problem is positive. In Additional file 1: Section 1 the reduction in both directions is shown.
In the example of Fig. 2 the optimal solution of the Colored Multigraph Betweenness Problem retains all unordered adjacencies and creates a unique coordinatization (up to orientation) that leaves all alignment blocks ordered as drawn.
Seriation
The relevant optimization task in our context is therefore to minimize the number of ordered triples that violate Eq. (2). A variety of heuristics for this problem have been developed, see e.g. [69]. It is important to note, however, that in our setting the distance between alignment blocks is not defined directly. In order to obtain a seriation problem that approximates the Supergenome Sorting Problem we will have to resort to a heuristic that summarizes the distances between two blocks in all genomes and reflects the betweenness relationships. For pangenomelike models, the cost function advocated in [28] is a very plausible choice.
Graph simplification
Each of the plausible models for the “Supergenome Sorting Problem” discussed in the previous sections leads to NPhard computational problems. The size of typical genomewide alignments by far exceeds the range where exact solutions can be hoped for, except possibly for the smallest and most benign examples such as the ones used as examples in [17]. We therefore will have to resort to fast heuristics. In this section we focus on the conceptual ideas behind the simplification steps. More detailed implementation details are given in the "Methods" section.
Nevertheless it is possible to isolate certain subproblems that can be solved exactly and independently of the remainder of the input graph. Since “linearized” portions of the vertex set can be contracted to a single vertex set, this leads to a reduction of problem size.
Lemma 3
If the supergenome graph \(\Gamma\) is a directed acyclic graph then topological sorting of \(\Gamma\) solves the Colored Multigraph Betweenness Problem.
Proof
In this case betweenness is established exactly by the directed paths in the DAG. Hence any topological sorting preserves all betweenness triples of G and thus presents a perfect solution to the Colored Multigraph Betweenness Problem as well.
This simple observation suggests to identify subgraphs with DAG structure and to replace them with a representative for each replaced DAG. These can later be replaced by the solution that is created with topological sorting. We note that this does not necessarily preserve optimality. It is conceivable that a local DAG structure has to be broken up into two disjoint subsets that are integrated in larger surrounding structures in a way that requires reversal of the edge directions in one or even both parts. Nevertheless, if the local DAG structures are sufficiently isolated they are likely to be part of the optimal solution as a unit. The motif that describes such a local DAG structure has been introduced as “superbubble” in the context of graph structures arising in sequence assembly problems [70, 71].
In some cases it is helpful to reverse the direction of the coordinate system of a single species. This is in particular the case when a single genome is reversed compared to all others. The inversion of an entire path does not change the solution of the Colored Multigraph Betweenness Problem but can make it easier to apply some of the reduction heuristics discussed above. In particular, if the relative orientation of the coordinatizations could be fixed in an optimal manner, the betweenness problem reduces to a much easier topological sorting problem. Finding this optimum, however, is equivalent to the Betweenness Problem, which is a NPhard. Hence, we again have to resort to local heuristics.
Definition 7
Let \(\Gamma = (V,E)\) be a supergenome graph. A pair of vertices \(v,w\in V\) such that there are edges (v, w) and (w, v) in E is a mini cycle.
Minicycles are naturally removed by removing one of the two edge directions between v and w. More precisely, the less supported direction of an edge is dropped. The estimate for support is evaluated in a region around a minicycle since adjacent minicycles may yield contradictory majority votes.
Definition 8
Two minicycles are connected with each other if they share a vertex. A minicycle complex \(\mathfrak{C}\) is a maximal connected set of minicycles.
Lemma 4
The minicycle complexes of a supergenome graph \(\Gamma\) form a unique partition of the set of all minicycles. Any two classes of this partition are vertex and edge disjoint.
Proof
Consider the the graph H whose vertices are the minicycles, and there are edges between any two minicycles that share at least one vertex. Then every minicycle complex \(\mathfrak{C}\) is a connected component of H. Since the connected components of a graph are uniquely defined, disjoint, and form a cover, they partition the vertex set of H. Every minicycle, furthermore, forms a connected subgraph of \(\Gamma\) by construction. Since any two minicycles that contain a common vertex belong to the same minicycle complex, two minicycle complexes cannot have a vertex in common. This implies that they are also edge disjoint.
The local sorting within a complex \(\mathfrak{C}\) is achieved by considering adjacencies. To this end we annotate each adjacency with the number of edges and the ratio of the edges in the two directions. We identify the best supported edges as those with a high multiplicity and a strong bias for one direction over the other. This choice of a direction is then propagated. If a directed edge has more than one possible successor, we first propagate along the one with the largest support for the proposed direction. The issue now is when exactly to stop propagating this information. Clearly, it is forbidden to orient an edge that would close a directed cycle. Any such edge is instead seeded with the reverse directional information.
As part of this procedure it is possible that parts of a directed path from a given genome received contradictory orientations in different regions. If this is the case, the edge crossing the boundary between the differently oriented regions must be removed. Finally, the heuristic may terminate and still leave some edges unoriented. This indicates that the orientations are contradictory and need to be reversed. An example of the minicycle resolution process is shown in Fig. 5.
Methods
Curation of input data sets
We investigate here three genomewide data sets. The smallest set, referred to as \(\mathbf {B}\) (bacteria) below, is an alignment of four Salmonella enterica serovars. This alignment was produced with Cactus [27] using the Salmonella enterica Newport genome as reference and comprises 13, 416 blocks, 50, 932 sequence fragments, and 18, 047, 456 nucleotides. The mediumsize set, termed \(\mathbf {Y}\) (yeast), is an alignment of seven yeast species that uses the Saccharomyces cerevisiae genome as references. It comprises 49, 795 alignment blocks composed of 275, 484 sequences fragments that contains 71, 517, 259 nucleotides. The third, much larger set \(\mathbf {F}\) (fly) is a alignment of 27 insect species that uses the Drosophila melanogaster genome as references. It comprises 2, 112, 962 blocks composed of 36, 139, 620 sequence fragments hat contains 2, 172, 959, 429 nucleotides. For more detailed information of the data sets refer to Additional file 1: Section 2.
The two large genomewide multiple sequence alignments were produced by the multiz pipeline and were downloaded from the UCSC genome browser [72]. They are, as discussed above, injective but not irredundant. In order to remove spurious alignment blocks we filter the input blocks with respect to first length, then score, and finally mutual overlap. Very short alignment blocks are almost certainly either spurious matches or they were inserted to bridge gaps between larger blocks. Consequently, they convey little or no useful information for our purposes. We therefore remove all blocks with a length \(\le 10\) nt.
Since genomewide alignments tend to contain also very poorly aligned regions we require a minimum similarity, expressed here in the form of sumofpairs blastz scores [73]. Since these scale linearly with the number of columns \(\ell (B)\) of the alignment block B and the number \(\left( {\begin{array}{c}r\\ 2\end{array}}\right)\) of pairwise alignments formed by the r rows in B, we normalize with \(\left( {\begin{array}{c}r\\ 2\end{array}}\right) \ell (B)\) to obtain a similarity measure that is independent of the size of the alignment block. Based on the parametrization of blastz, we set the threshold at a normalized score of \(\;30\), which corresponds to the gap extension penalty.
The coordinatization of the supergenome depends on the uniqueness of coordinate projections. There are three major reasons to observe overlaps, i.e., genomic regions that appear in more than one alignment: (i) the sequence is duplicated in some species. Then multiz tends to align the corresponding unduplicated sequence to both duplicates. (ii) Spurious similarities in particular in poorly conserved regions may lead to alignments containing a sequence element twice at the expense of the second copy. (iii) Short overlaps at the end of blocks may appear due to difficulties in determining the exact ends of alignable regions. The first two causes introduce undesirable noise and uncertainties. Therefore, we remove all such overlapping blocks in which sequences from the species overlap. Since there is no easy way to determine which one of two overlapping blocks is likely correct, we opt to remove both copies. The third case, in contrast, does not disturb the relative order of alignment blocks and thus can be ignored. The overlap filter is applied after low quality alignments already have been removed from the data set.
We tolerate an overlap of 20 nt at the borders of alignment blocks. This cutoff is designed to remove ambiguous alignments, while avoiding the removal of alignment blocks that overlap by a few nucleotides owing to overlapping extensions of local blastz seeds. In addition we remove sequences that completely overlap other sequences regardless of their size to further reduce the noise introduced by spurious alignments. We opt here for a stringent procedure and remove all alignment blocks that contain sequences tagged for removal. In practice, this step removes only a tiny fraction of the blocks and thus does not significantly influence the coverage of the retained data.
The initial data filtering steps removed almost \(35\%\) (\(40\%\), \(30\%\)) of the blocks from data set \(\mathbf {F}\), (\(\mathbf {Y}\), \(\mathbf {B}\)). The majority were eliminated because of their minimal length. About \(8.5\%\) (\(27\%\), \(0\%\)) of the blocks were removed because they contained nonunique sequences. The sequences in the blocks that are removed with all filters contain less then \(15\%\) (\(26\%\), \(0.4\%\)) of the nucleotides in the alignment. Hence more than \(85\%\) (\(74\%\), \(99\%\)) of the sequence information of the alignment is intact and the quality of the data is significant better. A more detailed summary of the filtering is compiled in Additional file 1: Section 2.
Graph simplification and DAG construction
The algorithmic ideas and their justifications for the graph reduction steps have already been discussed in "Theory" section. Here we briefly address implementation issues as well as particular choices of cost functions and parameters that were discussed in a more general setting above.
The filtered data is used to create an initial supergenome graph. Then we iterate the three different graph simplifiers until no further reduction steps can be applied: the minicycle remover, the source/sink simplifier, and the superbubble simplifier. The individual simplifiers are straightforward implementations of the basic ideas outlined above. The minicycle remover first identifies the minicycles, aggregates them into nonoverlapping complexes, and then proceeds to remove contradictory edges in a greedy manner. The other two simplifiers first check for each vertex in the input graph whether it is a valid sink, source, or starting vertex of a superbubble. Pseudocode of the simplifiers is given in Additional file 1: Section 4.
The minicycle remover works more effectively on a single big complex than on many small ones separated by narrow gaps. The other two simplifiers therefore are applied until a fixed point is reached to close some of these gaps. The entire procedure is then iterated until the minicycle remover cannot change the graph any further.
Once a fixed point is reached we attempt to remove directed cycles. This amounts to solving the Minimum Feedback Arc Set Problem, which is known to be NPhard [37]. Given the size of our input graphs we have to resort to lineartime heuristics. We use Algorithm GR [38] because it is known to work particularly well on sparse graphs. Cycle removal typically creates new possibilities to simplify the graph. For instance, a sink is created whenever the last outgoing edge of a vertex is removed. The new sink can then be simplified further. The graph simplifiers are therefore applied again after the cycle removal step.
The minicycle remover is not used in this second pass because it is not applicable to DAGs by construction. Instead, we use a generalized version of the source/sink simplifier in which a source s may have more than a single successor v, provided v is a predecessor of all other successors of s. The position of source s in the DAG is determined by v and thus s can be placed immediately before v. The corresponding arrangements for a sink and its predecessor is treated analogously.
The running time and the minimization of the data set while this process is applied is shown in Figs. 6, 7 and 8.
Seriation
Finally, the common coordinate system is created by seriation of the DAG. The resulting supergenome, i.e. linear order of the vertices of the graphs corresponds to a linear order of all blocks. In particular, vertices resulting from a simplifier may contain more than one block. Those blocks, however, are already sorted and thus are inserted as a single block. Seriation is naturally divided into two steps. First, topological sorting is used to calculate an initial linear ordering from the DAG. Kahn’s algorithm [40] is a classical solution to the topological sorting problem. For our purposes it is desirable that, if possible, two nodes v and w are placed consecutively whenever there is an edge (v, w) in the final DAG. To this end we modify Kahn’s algorithm by sorting the successors of a node in the order of evidence for their adjacency in the original data.
We use a gradient descentlike optimization algorithm to minimize \(\tau\). We say that two nodes are siblings if they either share a predecessor in the DAG or if they are both sources. The move set for the gradient descent consists of swaps of siblings only. In addition, we allow to move a node directly in front of its sibling. The discrete gradient is computed exhaustively by generating and evaluating each potential move. Since nonoverlapping swaps do not influence each other, we greedily execute a maximal set of nonoverlapping swaps in a single optimization step.
Assessment of the quality of supergenome coordinate systems
Since no ground truth is available for this problem and the construction of simulated benchmarks for genome wide multiple sequence alignments would be a research project in its own right, we have to resort to measuring quantities that are informative about the final choice of the coordinate system.
A straightforward measure is the distribution of distances in the output coordinate system of alignment blocks that are contiguous in at least one input genome. Since we are not interested in the length of alignment blocks, distance is measured here not in terms of sequence length but in terms of the number of alignment blocks, so that adjacent blocks have distance 0. It is important here to keep track of the reading directions: contiguity with the same reading direction corresponds to preservation of the original genomic coordinates, while a change in reading direction indicates change of the order. Thus we distinguish preserved and inverted reading direction in our quantitative analysis.
Open reading frames (ORFs) are among the bestconserved features in the genome due to the strong selection pressures acting to preserve the corresponding proteins. As an immediate consequence we expect that ORFs are almost always preserved. This should be reflected also by the supergenome coordinates, i.e., blocks belonging to the same ORF should have only a small number (smaller then 5) of other blocks between them and retain their relative order. For higher eukaryotes, we cannot expect near perfect adjacency of coding blocks, however, because larger introns are subject to local rearrangements. To quantify the proximity of blocks of an ORF, the distances between all adjacent blocks are determined as described above and their absolute values are added up to yield a single characteristic value. In addition we count the number of exons that are “broken up” in the sense that consecutive pieces do not have consecutive coordinates or are placed in reverse order in the supergenome. Coding genes and exons are taken as annotated for the corresponding genomes. We note that in particular for large, intronrich genomes such as the insect data set \(\mathbf {F}\) this is an additional source of errors.
Results and discussion
We have devised a heuristic algorithm to extract a common coordinate system for a supergenome from a genomewide multiple sequence alignment. The procedure has been tested on three alignments of very different size and difficulty: an easy instance comprising four closely related bacterial species, an intermediate size problem composed of seven yeast genomes, and the alignment of 27 insect genomes as the most difficult instance.
Performance of individual components
The heuristic algorithm outlined above is composed of several largely independent components. It is of interest, therefore to consider their relative impact on the final results. We find that most edges are removed by the minicycle remover, with a small contribution of Algorithm GR. On the other hand, the largest reduction of the vertex set is due to the merges identified by the closed DAG simplifier. More quantitative information is compiled in Figs. 6, 7 and 8 and in Additional file 1: Section 8. The simplifiers reduce the graph size by about an order of magnitude in both the number of vertices and edges, reducing it in size and complexity to a point where the seriation heuristic operates efficiently. The relative improvement is smallest for the bacterial data set.
Since the Colored Multigraph Betweenness Problem cannot be solved exactly in reasonable time for instances with sizes that are of interest for our application at hand, we cannot measure performance relative to the exact solution. The multigraphs obtained from reallife alignments contain a large number of conflicting edges. In the most difficult data set, \(\mathbf {F}\), for instance, the final order keeps more than 95% of the initial edges.
Quality of supergenome coordinate systems
The quality of the coordinate systems strongly depends on the quality of the input alignments. A detailed discussion of issues with the input alignments can be found in Additional file 1: Section 8. Here, we focus on an assessment of the coordinate systems themselves.
The distribution of blockwise distances in the supergenome of alignment blocks that are consecutive in the original genome serves as a simple measure of preserved synteny. The results are summarized in Fig. 9 and presented in full detail in Additional file 1: Section 8. Another measure is how many of the input orders are preserved. To measure this we consider every alignment block and all successors from the different genomes. For the bacterial data set \(\mathbf {B}\) \(89\%\) of the successors preserve the order and \(80\%\) also preserve the adjacency. For the yeast data set \(\mathbf {Y}\) we observe that \(93\%\) of the successors preserve the order and \(84\%\) also preserve adjacency. This is a very encouraging result, taking in mind that every true genome rearrangement necessarily introduces at least one nonadjacency. Even in the much larger and more difficult insect set \(\mathbf {F}\) we still find \(70\%\) of the successors preserve the order and \(66\%\) also preserve adjacency. The overwhelming majority of noncontiguous successors are placed in the adjacent but orderreversed position, reflecting the level of local rearrangements in the insect data set. This is entirely reasonable given the much larger number of species and their larger phylogenetic depth compared to the yeast data. Taken together, these numbers already indicate that the supergenome coordinates are meaningful and indeed are likely a useful starting point for largescale multigenome comparisons.
Restricting our attention to coding sequences yields a more stringent quality measure, as shown in Fig. 10. As bacteria have essentially no introns, we expect that nearly all blocks belonging to the same ORF retain both adjacency and order. In the bacterial data set \(\mathbf {B}\) \(96\%\) of the ORFs are in one stretch with no interruption and less then \(1\%\) of the ORFs are broken. Since yeasts have few and short introns [74] we expect that data set \(\mathbf {Y}\) is also very wellbehaved in this respect. It contains 6062 ORFs annotated for Saccharomyces cerevisiae. Of these, 5474, i.e., \(90\%\), are consistently represented in the coordinate system. An additional 272 ORFs, about \(5\%\), have a distance of less then 100 blocks between them. Only 73, i.e., a bit more than \(1\%\) of the ORFs are broken. For Drosophila melanogaster are 167, 051 exons annotated, and part of the alignment \(\mathbf {F}\). Due to large and abundant introns the analysis is based on individual exons rather than complete ORFs for set \(\mathbf {F}\). We observe that \(95\%\) of the ORFs/exons are consistently represented. Only 779, about \(0.5\%\), are broken. Overall, thus, the supergenome coordinates behave very well for all three data sets.
As a specific example we consider the genes of the yeast TCA cycle [75] in more detail. It is one of the beststudied enzyme systems and known to be essential in S. cerevisiae. There, it comprises 20 genes [76–79], all of which are contained at least partially in the initial set of alignment blocks in the yeast data set \(\mathbf {Y}\). Only nine genes are included in their entirety, however. Seven of these nine are represented colinearly in single blocks. The alignments for KGD2 and SDH2 cover multiple MSA blocks and there is intervening genomic sequence in the input alignment, leading to noncontiguous placement in the final coordinate system. The alignment blocks referring to the remaining 11 genes are difficult to analyze and may contain misassigned sequences. This example, similar to several other loci, suggests that the quality of the input alignment rather than the complexity of the betweenness problem is the limiting factor for the construction of supergenome coordinate systems.
Conclusion
In this contribution we have shown that the problem of computing a common coordinate system for supergenomes with the Colored Multigraph Betweenness Problem is NPhard. It belongs to a class of relatively poorly studied betweenness problems for which few efficient heuristics have been developed so far. We introduced here several local simplification rules that can be applied iteratively to reduce the problem. It is important to note these reduction steps are only heuristics and do not guarantee optimal solutions. In conjunction with a simple serialization approach for the residual graph, they nevertheless yield practically useful results with acceptable computational efforts.
The most immediate application of the supergenome sorting problem is the direct comparison of genome annotations for multiple genomes. Hence it constitutes a prerequisite for comparative genome browsers. We have applied our approach to three reallife data sets of different sizes and difficulties. Our results indicate that practically useful coordinatizations can be computed. The computational requirement of the method scales favorably so that in principle even the largest genomewide multiple sequence alignments could be processed.
The present study, however, also highlights the shortcomings of currently available genomewide multiple sequence alignments [80, 81]. The issue is not only the relatively moderate coverage with alignment blocks that contain at least most of the species under consideration, but also the substantial fractions of alignment blocks that have been removed from our data set due to likely artefactual sequences. We have therefore not attempted to analyze the UCSC 100way vertebrate alignments, since these data are even more complex than the insect data due to the very large number of paralogs introduced by genome duplications.
Synteny, i.e., the preservation of relative genome order, is in general a good predictor for homology. This fact suggests to use the common coordinate system to identify likely homologous regions that are not included in the initial alignment blocks. These could then be (re)aligned at sequence level and included in a revised multiple sequence alignment. This, in turn, could yield an improved common coordinate system. The systematic improvement of genomewide alignments, albeit an interesting and extremely useful endeavor, is beyond the scope of this contribution.
Possible improvements of the approach taken here are conceivable in at least two directions. First, one may consider a hybrid algorithm that solves subgraphs with a dominant backbone use the method discussed in [29]. As discussed, we assume that large parts of the global graph structure are not amenable to such a solution, but it is also reasonable to assume that gene regions under strong conservation pressures can be solved fairly easily using a local backbonebased approach. A second venue of research is concerned with the determination of the final backbone order. Depending on the phylogenetic range under investigation, the ancestral gene order would provide a useful backbone based on the phylogeny of the species involved in the alignment.
Declarations
Authors' contributions
CHzS and PFS designed the study, FG and LM implemented the software and performed the computational analysis. All authors collaborated on the design of the algorithms and the overall work flow, the interpretation of results, and the writing of the manuscript. All authors read and approved the final manuscript.
Acknowledgements
This work was funded by the German Federal Ministry of Education and Research within the project Competence Center for Scalable Data Services and Solutions (ScaDS) Dresden/Leipzig (BMBF 01IS14014B). We acknowledge support for APCs from the German Research Foundation (DFG) and Universität Leipzig within the program of Open Access Publishing.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Gawad C, Koh W, Quake SR. Singlecell genome sequencing: current state of the science. Nat Rev Genet. 2016;17(3):175–88.View ArticleGoogle Scholar
 Genomes Project Consortium. A global reference for human genetic variation. Nature. 2015;526(7571):68–74.View ArticleGoogle Scholar
 Hezroni H, Koppstein D, Schwartz MG, Avrutin A, Bartel DP, Ulitsky I. Principles of long noncoding RNA evolution derived from direct comparison of transcriptomes in 17 species. Cell Rep. 2015;11:1110–22. https://doi.org/10.1016/j.celrep.2015.04.023.View ArticlePubMedPubMed CentralGoogle Scholar
 Lin S, Lin Y, Nery JR, Urich MA, Breschi A, Davis CA, Dobin A, Zaleski C, Beer MA, Chapman WC, Gingeras TR, Ecker JR, Snyder MP. Comparison of the transcriptional landscapes between human and mouse tissues. Proc Natl Acad Sci USA. 2014;111:17224–9. https://doi.org/10.1073/pnas.1413624111.View ArticlePubMedGoogle Scholar
 Necsulea A, Kaessmann H. Evolutionary dynamics of coding and noncoding transcriptomes. Nat Rev Genet. 2014;15:734–48. https://doi.org/10.1038/nrg3802.View ArticlePubMedGoogle Scholar
 Neme R, Tautz D. Fast turnover of genome transcription across evolutionary time exposes entire noncoding DNA to de novo gene emergence. Elife. 2016;5:e09977.View ArticleGoogle Scholar
 Washietl S, Kellis M, Garber M. Evolutionary dynamics and tissue specificity of human long noncodingRNAs in six mammals. Genome Res. 2014;24:616–28.View ArticleGoogle Scholar
 Nguyen N, Hickey G, Raney BJ, Armstrong J, Clawson H, Zweig A, Karolchik D, Kent WJ, Haussler D, Paten B. Comparative assembly hubs: webaccessible browsers for comparative genomics. Bioinformatics. 2014;30(23):3293–301.View ArticleGoogle Scholar
 Darling AE, Mau B, Perna NT. progressivemauve: multiple genome alignment with gene gain, loss and rearrangement. PloS ONE. 2010;5(6):11147.View ArticleGoogle Scholar
 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Humanmouse alignments with blastz. Genome Res. 2003;13(1):103–7.View ArticleGoogle Scholar
 Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AF, Roskin KM, Baertsch R, Rosenbloom K, Clawson H, Green ED. Aligning multiple genomic sequences with the threaded blockse aligner. Genome Res. 2004;14(4):708–15.View ArticleGoogle Scholar
 Paten B, Herrero J, Beal K, Fitzgerald S, Birney E. Enredo and pecan: genomewide mammalian consistencybased multiple alignment with paralogs. Genome Res. 2008;18:1814–28.View ArticleGoogle Scholar
 Bray N, Pachter L. MAVID: constrained ancestral alignment of multiple sequences. Genome Res. 2004;14:693–9. https://doi.org/10.1101/gr.1960404.View ArticlePubMedPubMed CentralGoogle Scholar
 Chen X, Tompa M. Comparative assessment of methods for aligning multiple genome sequences. Nat Biotech. 2010;28:567–72. https://doi.org/10.1038/nbt.1637.View ArticleGoogle Scholar
 Xiao S, Cao X, Zhong S. Comparative epigenomics: defining and utilizing epigenomic variations across species, timecourse, and individuals. Wiley Interdiscip Rev Syst Biol Med. 2014;6:345–52. https://doi.org/10.1002/wsbm.1274.View ArticlePubMedPubMed CentralGoogle Scholar
 Nitsche A, Rose D, Fasold M, Reiche K, Stadler PF. Comparison of splice sites reveals that long noncoding RNAs are evolutionarily well conserved. RNA. 2015;21:801–12. https://doi.org/10.1261/rna.046342.114.View ArticlePubMedPubMed CentralGoogle Scholar
 Herbig A, Jäger G, Battke F, Nieselt K. GenomeRing: alignment visualization based on SuperGenome coordinates. Bioinformatics. 2012;28:7–15.View ArticleGoogle Scholar
 Dugar G, Herbig A, Förstner KU, Heidrich N, Reinhardt R, Nieselt K, Sharma CM. Highresolution transcriptome maps reveal strainspecific regulatory features of multiple Campylobacter jejuni isolates. PLoS Genet. 2013;9:1003495. https://doi.org/10.1371/journal.pgen.1003495.View ArticleGoogle Scholar
 Goryunov DV, Nagaev BE, Nikolaev MY, Alexeevski AV, Troitsky AV. Moss phylogeny reconstruction using nucleotide pangenome of complete mitogenome sequences. Biochemistry (Mosc). 2015;80:1522–7. https://doi.org/10.1134/S0006297915110152.View ArticleGoogle Scholar
 Medini D, Donati C, Tettelin H, Masignani V, Rappuoli R. The microbial pangenome. Curr Op Genet Dev. 2005;15:589–94. https://doi.org/10.1016/j.gde.2005.09.006.View ArticleGoogle Scholar
 Bodlaender HL, Fomin FV, Koster AMCA, Kratsch D, Thilikos DM. A note on exact algorithms for vertex ordering problems on graphs. Theory Comput Syst. 2012;50:420–32.View ArticleGoogle Scholar
 Li K, Tang X, Veeravalli B, Li K. Scheduling precedence constrained stochastic tasks on heterogeneous cluster systems. IEEE Trans Comput. 2015;64(1):191–204.View ArticleGoogle Scholar
 Fellows MR, Hermelin D, Rosamond F, Shachnai H. Tractable parameterizations for the minimum linear arrangement problem. ACM Trans Comput Theory. 2016;8(2):6.View ArticleGoogle Scholar
 Pardo EG, Martí R, Duarte A. Linear layout problems. Berlin: Springer; 2016. p. 1–25.Google Scholar
 Kececioglu J. The maximum weight trace problem in multiple sequence alignment. Combinatorial pattern matching. Berlin: Springer; 1993. p. 106–19.Google Scholar
 Pevzner PA, Tang H, Tesler G. De novo repeat classification and fragment assembly. Genome Res. 2004;14(9):1786–96.View ArticleGoogle Scholar
 Paten B, Earl D, Nguyen N, Diekhans M, Zerbino D, Haussler D. Cactus: algorithms for genome multiple sequence alignment. Genome Res. 2011;21(9):1512–28.View ArticleGoogle Scholar
 Nguyen N, Hickey G, Zerbino DR, Raney B, Earl D, Armstrong J, Kent WJ, Haussler D, Paten B. Building a pangenome reference for a population. J Comput Biol. 2015;22(5):387–401.View ArticleGoogle Scholar
 Haussler D, SmugaOtto M, Paten B, Novak AM, Nikitin S, Zueva M, Miagkov D. A flow procedure for the linearization of genome sequence graphs. In: International conference on research in computational molecular biology. Berlin: Springer; 2017, p. 34–49.Google Scholar
 Giegerich R. Explaining and controlling ambiguity in dynamic programming. Annual symposium on combinatorial pattern matching. Berlin: Springer; 2000. p. 46–59.Google Scholar
 Sankoff D. Time warps, string edits, and macromolecules. The theory and practice of sequence comparison, reading. Boston: AddisonWesley; 1983.Google Scholar
 Belda E, Moya A, Silva FJ. Genome rearrangement distances and gene order phylogeny in \(\gamma\)proteobacteria. Mol Biol Evol. 2005;22:1456–67. https://doi.org/10.1093/molbev/msi134.View ArticlePubMedGoogle Scholar
 Drillon G, Fischer G. Comparative study on synteny between yeasts and vertebrates. C R Biol. 2011;334:629–38. https://doi.org/10.1016/j.crvi.2011.05.011.View ArticlePubMedGoogle Scholar
 Fischer G, Rocha EPC, Brunet F, Vergassola M, Dujon B. Highly variable rates of genome rearrangements between hemiascomycetous yeast lineages. PLoS Genet. 2006;2:32. https://doi.org/10.1371/journal.pgen.0020032.View ArticleGoogle Scholar
 Friedberg R, Darling AE, Yancopoulos S. Genome rearrangement by the double cut and join operation. Methods Mol Biol. 2008;452:385–416.View ArticleGoogle Scholar
 ElMetwally S, Hamza T, Zakaria M, Helmy M. Nextgeneration sequence assembly: four stages of data processing and computational challenges. PLoS Comput Biol. 2013;9:1003345. https://doi.org/10.1371/journal.pcbi.1003345.View ArticleGoogle Scholar
 Karp RM. Reducibility among combinatorial problems. Complexity of computer computations. Berlin: Springer; 1972. p. 85–103.Google Scholar
 Eades P, Lin X, Smyth WF. A fast and effective heuristic for the feedback arc set problem. Inf Process Lett. 1993;47:319–23.View ArticleGoogle Scholar
 Saab Y. A fast and effective algorithm for the feedback arc set problem. J Heuristics. 2001;7:235–50. https://doi.org/10.1023/A:1011315014322.View ArticleGoogle Scholar
 Kahn AB. Topological sorting of large networks. Commun ACM. 1962;5(11):558–62.View ArticleGoogle Scholar
 Martí R, Reinelt G. The linear ordering problem: exact and heuristic methods in combinatorial optimization, vol. 175. Berlin: Springer; 2011.Google Scholar
 Grötschel M, Jünger M, Reinelt G. A cutting plane algorithm for the linear ordering problem. Oper Res. 1984;32:1195–220.View ArticleGoogle Scholar
 Spearman C. The proof and measurement of association between two things. Am J Psychol. 1904;15:72–101.View ArticleGoogle Scholar
 Kendall MG. A new measure of rank correlation. Biometrika. 1938;30:81–93.View ArticleGoogle Scholar
 Fagin R, Kumar R, Sivakumar D. Comparing top \(k\) lists. SIAM J Discrete Math. 2003;17:134–60. https://doi.org/10.1137/S0895480102412856.View ArticleGoogle Scholar
 Fried C, Hordijk W, Prohaska SJ, Stadler CR, Stadler PF. The footprint sorting problem. J Chem Inf Comput Sci. 2004;44:332–8.View ArticleGoogle Scholar
 Collier JH, Konagurthu AS. An information measure for comparing top k lists. In: 2014 IEEE 10th international conference on escience, vol. 1. 2014, p. 127–34. https://doi.org/10.1109/eScience.2014.39.
 Bertrand D, Blanchette M, ElMabrouk N. Genetic map refinement using a comparative genomic approach. J Comput Biol. 2009;16:1475–86.View ArticleGoogle Scholar
 Oswald M, Reinelt G. The simultaneous consecutive ones problem. Theor Comput Sci. 2009;410:21–3.View ArticleGoogle Scholar
 Booth KS, Lueker GS. Testing for the consecutive ones property, interval graphs, and graph planarity using PQtree algorithms. J Comput Syst Sci. 1976;13:335–79.View ArticleGoogle Scholar
 Meidanis J, Porto O, Telles GP. On the consecutive ones property. Discrete Appl Math. 1998;88:325–54.View ArticleGoogle Scholar
 Tucker A. A structure theorem for the consecutive 1’s property. J Comb Theory B. 1972;12:153–62.View ArticleGoogle Scholar
 Christof T, Oswald M, Reinelt G. Consecutive ones and a betweenness problem in computational biology. In: Bixby RE, Boyd EA, RíosMercado RZ, eds. Integer programming and combinatorial optimization, vol. 1412. 1998, p. 213–28.View ArticleGoogle Scholar
 Reid JK, Scott JA. Reducing the total bandwidth of a sparse unsymmetric matrix. SIAM J Matrix Anal Appl. 2006;28:805–21.View ArticleGoogle Scholar
 Cuthill E, McKee J. Reducing the bandwidth of sparse symmetric matrices. In: Proceedings of 24th National Conference ACM. New York: ACM; 1969, p. 157–72. https://doi.org/10.1145/800195.805928.
 Gibbs NE, Poole WG Jr, Stockmeyer PK. An algorithm for reducing bandwidth and profile reduction algorithms. SIAM J Numer Anal. 1976;13:236–50.View ArticleGoogle Scholar
 Feige U. Coping with the NPhardness of the graph bandwidth problem. In: Algorithm Theory—SWAT 2000, vol. 1851. 2000, p. 129–45.View ArticleGoogle Scholar
 Kehr B, Trappe K, Holtgrewe M, Reinert K. Genome alignment with graph data structures: a comparison. BMC Bioinf. 2014;15(1):99.View ArticleGoogle Scholar
 Gavril F. Some NPcomplete problems on graphs. In: Proceedings of the 11th Conference on Information Sciences and Systems. Baltimore: Johns Hopkins University; 1977, p. 91–5.Google Scholar
 Makedon FS, Papadimitriou CH, Sudbourough IH. Topological bandwidth. SIAM J Algebraic Discrete Methods. 1985;6:418–44.View ArticleGoogle Scholar
 Martí R, Pantrigo JJ, Duarte A, Pardo EG. Branch and bound for the cutwidth minimization problem. Comput Oper Res. 2013;40:137–49.View ArticleGoogle Scholar
 Barth D, Pellegrini F, Raspaud A, Roman J. On bandwidth, cutwidth, and quotient graphs. Informatique théorique et applications. 1995;29:487–508.View ArticleGoogle Scholar
 Opatrny J. Total ordering problem. SIAM J Comput. 1979;8:111–4.View ArticleGoogle Scholar
 Chor B, Sudan M. A geometric approach to betweenness. SIAM J Discr Math. 1998;11:511–23.View ArticleGoogle Scholar
 Robinson WS. A method for chronologically ordering archaeological deposits. Amer Antiquity. 1951;16:293–301.View ArticleGoogle Scholar
 Bertrand P. Systems of sets such that each set properly intersects at most one other set—application to cluster analysis. Discrete Appl Math. 2008;156:1220–36.View ArticleGoogle Scholar
 Bertrand P, Diatta J. Multilevel clustering models and interval convexities. Discrete Appl Math. 2017;222:54–66. https://doi.org/10.1016/j.dam.2016.12.019.View ArticleGoogle Scholar
 Liiv I. Seriation and matrix reordering methods: an historical overview. Stat Anal Data Min. 2010;3:70–91.Google Scholar
 Hahsler M, Hornik K, Buchta C. Getting things in order: an introduction to the R package seriation. J Stat Softw. 2008;25:3.View ArticleGoogle Scholar
 Onodera T, Sadakane K, Shibuya T. Detecting superbubbles in assembly graphs. In: International workshop on algorithms in bioinformatics. Berlin: Springer; 2013, p. 338–48.Google Scholar
 Paten B, Novak AM, Garrison E, Hickey G. Superbubbles, ultrabubbles and cacti. In: International conference on research in computational molecular biology. Berlin: Springer; 2017, p. 173–89.Google Scholar
 Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D. The human genome browser at ucsc. Genome Res. 2002;12(6):996–1006.View ArticleGoogle Scholar
 Chiaromonte F, Yap V, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2001;7:115.Google Scholar
 Spingola M, Grate L, Haussler D, Ares M Jr. Genomewide bioinformatic and molecular analysis of introns in Saccharomyces cerevisiae. RNA. 1999;5:221–34.View ArticleGoogle Scholar
 Krebs H, Gurin S, Eggleston L. The pathway of oxidation of acetate in Baker’s yeast. Biochem J. 1952;51(5):614.View ArticleGoogle Scholar
 Saccharomyces Genome Database Community: SGD Yeast Pathway: Saccharomyces cerevisiae TCA cycle, aerobic respiration. http://pathway.yeastgenome.org/YEAST/NEWIMAGE?object=TCAEUKPWY. Accessed 18 May 2017.
 Haselbeck RJ, McAlisterHenn L. Function and expression of yeast mitochondrial nadand nadpspecific isocitrate dehydrogenases. J Biol Chem. 1993;268(16):12116–22.PubMedGoogle Scholar
 Oyedotun KS, Lemire BD. The carboxyl terminus of the Saccharomyces cerevisiae succinate dehydrogenase membrane subunit, sdh4p, is necessary for ubiquinone reduction and enzyme stability. J Biol Chem. 1997;272(50):31382–8.View ArticleGoogle Scholar
 Yasutake Y, Watanabe S, Yao M, Takada Y, Fukunaga N, Tanaka I. Crystal structure of the monomeric isocitrate dehydrogenase in the presence of nadp+ insight into the cofactor recognition, catalysis, and evolution. J Biol Chem. 2003;278(38):36897–904.View ArticleGoogle Scholar
 Earl D, Nguyen N, Hickey G, Harris RS, Fitzgerald S, Beal K, Seledtsov I, Molodtsov V, Raney BJ, Clawson H, Kim J, Kemena C, Chang JM, Erb I, Poliakov A, Hou M, Herrero J, Kent WJ, Solovyev V, Darling AE, Ma J, Notredame C, Brudno M, Dubchak I, Haussler D, Paten B. Alignathon: a competitive assessment of wholegenome alignment methods. Genome Res. 2014;24:2077–89. https://doi.org/10.1101/gr.174920.114.View ArticlePubMedPubMed CentralGoogle Scholar
 Ezawa K. Characterization of multiple sequence alignment errors using completelikelihood score and positionshift map. BMC Bioinf. 2016;17:133. https://doi.org/10.1186/s1285901609455.View ArticleGoogle Scholar