Coordinate systems for supergenomes

Background Genome sequences and genome annotation data have become available at ever increasing rates in response to the rapid progress in sequencing technologies. As a consequence the demand for methods supporting comparative, evolutionary analysis is also growing. In particular, efficient tools to visualize-omics data simultaneously for multiple species are sorely lacking. A first and crucial step in this direction is the construction of a common coordinate system. Since genomes not only differ by rearrangements but also by large insertions, deletions, and duplications, the use of a single reference genome is insufficient, in particular when the number of species becomes large. Results The computational problem then becomes to determine an order and orientations of optimal local alignments that are as co-linear as possible with all the genome sequences. We first review the most prominent approaches to model the problem formally and then proceed to showing that it can be phrased as a particular variant of the Betweenness Problem. It is NP hard in general. As exact solutions are beyond reach for the problem sizes of practical interest, we introduce a collection of heuristic simplifiers to resolve ordering conflicts. Conclusion Benchmarks on real-life data ranging from bacterial to fly genomes demonstrate the feasibility of computing good common coordinate systems. Electronic supplementary material The online version of this article (10.1186/s13015-018-0133-4) contains supplementary material, which is available to authorized users.


Betweenness Problem
Betweenness Decision Problem: Given a finite set X and a collection C of triples from X, is there a total order on X such that ∀(i, j, k) ∈ C either i < j < k or i > j > k?
Algorithm 2: The algorithm that reduce the Colored Multigraph Betweenness Decision Problem to the Betweenness Decision Problem, where G(Γ) are all colors in Γ, E g is the set of all edges with the color g and "BDP(V, C )" applies the Betweenness Decision Problem. Input : X and C 2 Collection E = Collection() foreach g = (i, j, k) ∈ C do 3 Add edge (i, j) with color g to E; 4 Add edge (i, j) with color g to E; 5 end 6 Γ = (X, E); 7 return CMBDP(Γ) Colored Multigraph Betweenness Problem: Given a colored multigrapĥ Γ = (V, E), find a total order on V such that E * ⊆ E is maximal under the condition that ∀(i, j, k) ∈ C (V, E * ) either i < j < k or i > j > k.

Datasets
One small data set of different Salmonella strains is created with Cactus [2]. Two larger data sets are downloaded from the UCSC genome browser website. They can be found here: http://hgdownload.soe.ucsc.edu/downloads.html

B Data Set: 4way Salmonella
This data set is created with Cactus [2]. The genomes of four Salmonella strains are aligned. They are described in Table 1. Salmonella enterica Newport SL254 is used as reference genome for the alignment. The Salmonella alignment has 13, 416 blocks. Those blocks contain 50, 932 sequences which have a combined total length of 18, 047, 456 nucleotides. This corresponds to 91% of the combined genome length of the genomes in the alignment.
The filters described in Section 3 are applied to the alignment. Statistical results are shown in Table 2. The result are significant different from the other data set due to the alignment method used for this data set. Cactus create no overlap and gives the blocks no score so that this two filters have no effect. On the other hand, cactus creates many small blocks. Thus, the length filter discards 30% of all blocks. However, the length of the discarded blocks sum up to only 0.4% of the nucletides in the alignment.

Y Data Set: 7way Yeast
The data set can be found here: http://hgdownload.soe.ucsc.edu/goldenPath/ sacCer3/multiz7way. It consists of genome assemblies of seven yeast species. They are described in Table 3. S. cerevisiae is used as reference genome for the alignment. The yeast alignment has 49, 795 blocks. Those blocks contain 275, 484 sequences which have a combined total length of 71, 517, 259 nucleotides. This corresponds to 89.01% of the combined genome length of the genomes in the alignment.
The filters described in Section 3 are applied to the alignment. Statistical results are shown in Table 4.  Table 5.D. melanogaster is used as reference genome for the alignment. The insect alignment has 2, 112, 962 blocks which contain 36, 139, 620 sequences. They have a combined length of 2, 172, 959, 429 nucleotides which correspond to 38.45% of the combined genome length of the genomes in the alignment.
The filters described in section 3 are applied to the alignments. The statistical result are shown in Table 6. Table 6 An overview of the filters for the Insect alignment.

Filter
Blocks(b) Sequence fragments(f) 2112962  36139620  2172959429  100  100  100  length  531676  8362669  30024535  25  23  1  score  25557  384853  4074333  1  1  0  overlap  178882  3134544  285684545  8  9  13  all  736115  11882066  319783413  35  33  15 3 Filter Noisy data and the alignment procedure make it necessary to curate the input data before calculating the common coordinate system. As part of the curation of input data sets some blocks are filtered out. Here, we give the pseudocode of the three filters already described in the main manuscript. Additionally, we show some results that explain the chosen parameters. The filters are used to reduce the noise in the data. The noise has three main sources. Firstly, sequences may align by pure chance on each other. Secondly, multiz may produces small blocks that have no or almost no information in it. Thirdly, duplication events lead to sequences which are not different enough to align them unambigous and thus, lead to multiple alignments containing the same sequence for at least one species.
Filters do not discard blocks completely but mark the corresponding blocks and sequences as thrown out using the function setThrowout (). The graph generation algorithms and simplifiers (see below) check the thrown out flag using the function isThrowout (). Blocks and sequences for which the flag is set are skipped.

Length Filter
This filter targets the first and the second source for noisy data. Short blocks created by the aligner can be a result of random sequences. We filter out blocks with a length of at most 10 nucleotides. This size is used to discard random alignments but no useful information. Since the DNA consists of 4 nucleotides, there are 4 10 possible sequences of length 10. This are 1.048.576 possibilities. Even the small yeast data set is 77 times longer than this number. Thus, it is clear that it is very likely that this sequences are likely the result of random alignments.
Each block has an attribute that shows the length of the corresponding alignment. Each sequence in this blocks at most as long as the alignment itself. Sequences might be shorter due to alignment gaps.
The pseudocode of the length filter is shown in Algorithm 5.

Score Filter
Multiz may create alignments with bad scores as a side effect of the way how the blocks borders are determined. They likely contian no or not trustworthy information for the common coordinate system. The bad scored blocks are removed with the score filter. In the current implementation, scores are normalized by alignment length and number of aligned sequences and a threshold of −30 is used as threshold for the filtering out a block. The alignments scores that are given by the UCSC alignments are calculated on base of the BlastZ scoring scheme. So that the score filter is created with this scheme in mind. If another scheme is used, the constants has to adapted or the scores has to recalculated using the BlastZ scoring scheme. Table 7 The HOXD70 scoring scheme after [3]. In the BlastZ score this used with affine gap scores where the opening penalty are 400 and the extension penalty is 30.
The BlastZ scoring scheme is based on the HOXD70 scoring matrix [3] with affine gap scores. The scoring scheme is shown in Table 7. The gap opening penalty is 400 and gap extension penalty is 30. It is used to compute all pairwise alignment scores. For the scoring of the multiple alignments of the MAF-blocks, the sum-of-pair score is applied, i.e. all pairwise scores are computed and then summed up to form the MAF-Block score.
The pairwise alignment that contains no information is the alignment where both sequences are aligned completely to gaps. The score of such alignment with sequence length n and m would be: The alignment length then is n + m. This means that the normalized score of this is: It is easy to see that for long sequences such an alignment would have a normalized score of around −30. These is the reason why every block with a score less then −30 most likely contain no useful information. Thus, blocks with a normalized score of at most −30 are discarded. The score is saved as attribute of the block. The pseudocode of the score filter is shown in Algorithm 6.

Overlap Filter
One assumption of the golden genome is that the alignment is injective. However, multiz alignments are not injetive. Thus, this feature has to be actively created. this is done by filtering all overlapping sequences. Again, random alignments can be a problem. Especially, the exact start and stop position of an alignment on a genome is often not clear. Therefore, a small number of nucleotides may be used in two alignments. Discarding all overlaps would thus mean a big data loss. In particular, small overlaps are not a problem for the general idea and thus should be kept.
To allow this short overlapping ends, a overlap from up to 20 nucleotides is allowed on both ends. The explaination for this cutoff is again the combiantorical argument. There are 4 20 possible sequences with 20 nucleotides. This number is larger then the size of the whole insect alignment. Thus, overlaps larger than 20 nucleotides indicate truely doublicated sequences.
Note that sequences in blocks shorter than 20 nucleotides may still overlap completely with other sequences without being filtered out. To avoid this, sequences completly overlapped by other sequences are filtered out. Hereby, sequences may be completely covered by the sequence from the previous block, the next block, or by the previous and next block.
Due to the fact that the sequences on one chromosomes are sorted by their start positions, the calculation of the overlap is a local problem. Overlaps with blocks before the current block and after the current block are calculated independently. The overlap filter marks the sequences as overlapping but not as thrown out to be able to distinguish them from sequences thrown out by previous filtering steps. In a later step, the marked sequences are thrown out. Algorithm 7 shows the pseudocode of the overlap filter.

Simplifier
For the graph simplification three simplifier are used. These are shown here as pseudocode.
The simplifier uses three non standard objects: the graph, the vertex, and the edge. The graph consists of vertices and edges and has the following functions: • successor(vertex) Get the successors of the vertex.
Get the predecessor of the vertex.
Get the outDegree of the vertex.
Get the inDegree of the vertex.
Check if there is an edge from the first to the second vertex.
Create the induce subgraph that contains all vertices in list.
Remove the vertex and all corresponding edges from the graph.
Remove the edge from the graph.

• outgoingEdges (vertex)
Get all edges that goes out of a vertex.

• incomingEdges (vertex)
Get all edges that goes in a vertex. • add (vertex,vertex, edge) Add the edge from the first to the second vertex to the graph. Furthermore, one can also iterate over the vertices and edges of the graph.
A vertex can contain one or more alignment blocks. To mange this, the following functions are given: • clear() Remove all blocks from the vertex • addAll(list) Add all blocks in the list to the vertex.
Add all blocks in vertex to the front of the vertex.
Add all blocks in vertex to the back of the vertex. The edge stores among others a source and target vertex. Source and target vertex are read-only attributes of the edge: • source Get the source of the edge.

• target
Get the target of the edge. Additionally, each edge has a color. This attribute links to the chromosome associated with the edge.

Supperbubble Simplifier
The pseudocode of the supperbubble simplifier is shown in Algorithm 8. It is strutuctured into three sub-tasks: searchDAG (Algorithm 9), sortDAG (Algorithm 11), and removeDAG (Algorithm 12). The searchDAG algorithm use the function check-Predecessor (Algorithm 10) which checks if all predecessors of a node are contained in a given DAG.

Algorithm 8:
The algorithm that simplify the supperbubbles in the graph. 1 function simplify (g); Input : g theGraph that should be simplified Output : The number of removed vertices sum += removeDAG (v,dag,g); /* Remove obsolete Vertex */ 13 end 14 return sum

Sink and Source Simplifier
The optimal position of sinks and source are behind and before the predecessor and successor, respectively. Algorithm 13 shows the complete implementation.

Mini-Cycle Complex Remover
The mini-cylce complex remover uses the following three objects: • Complex The complex that contains a set of all connected mini-cycle. It can be iterated over the vertices in the complex and checked if a vertex is in the complex.
add(vertex) Add a new vertex to the complex length The number of vertices in the complex -getBestAdjacency() Get the adjacency that has the best value -getBefore(vertex) Get all incoming adjacencies of the vertex.
Get all outgoing adjacencies of the Vertex. remove(adjacency) Remove adjacency from the Complex.

• Order
A partial order that contains all decisions that are made in a complex -addSuccessor(vertex,vertex) Add the order information that the first vertex is smaller then the second vertex.
Algorithm 9: The algorithm that find in a graph a supperbubble with the source v 1 function searchDAG (v, g); Input : v the Vertex that is the designated source of the supperbubble. g theGraph that should be simplified. Output : The found supperbubble 2 list next = g.successor (v) ; /* get the successor Vertex of v */ 3 list dag; /* init new list of Vertex */ 4 dag.add (v); 5 boolean change = true ; /* true as long something changed */ 6 while change do Check if the direction of a adjacency is valid with the order information.
Check if the first vertex is a predecessor of the second vertex.

• Adjacency
An adjacency in a complex. It bundles the information about all edges (both directions) of a mini-cycle and also contains direction information. It is possible to test if in the adjacency an edge with a specific chromosome exists.
source The source with respect to direction information target The target with respect to direction information -getSwitch() Get the chromosomes of the edges that are contradicting the direction information.
Switches the direction of the edge that belongs to the chromosome.
The structure of the simplifier is shown in Algorithm 14. This function calls three sub-functions: findComplexe (Algorithm 15), decideComplex (Algorithm 16), and Algorithm 11: The algorithm that sort the DAG. If a cycle is found, null is returned and nothing is change in the graph. Otherwise add all vertices in v and add new edges from the sink in the DAG.
: v the Vertex that is the source of the DAF that is given as graph. Also the complete graph is given. Output : True if a cycle is in the DAG, false other wise.
foreach Edge e ∈ g.outgoingEdges (v2) do /* All outgoing Edges */ 1 function removeDAG (v, dag, g); Input : v the Vertex that is the source of the dag that is given as Graph. Also the complete Graph is given. Algorithm 16: The algorithm that create a order of the complex and decides so which edge is kept.

Minimum Feedback Arc Set problem
The order creation assumes a cycle-free graph. This is done by removing a minimum set of arcs that cause cycles. This Maximum Acyclic Subgraph or Minimum Feedback Arc Set problem is well-known to be NP-hard [4]. It is fixed parameter tractable (FPT) [5] but APX hard [6]. Nevertheless, fast, practicable heuristics have been devised, see e.g. [7,8].
Given the size of our input graphs, we have to resort to linear-time heuristics. We use a modified version of Algorithm GR [7] because it is known to work particularly well on sparse graphs.
The idea is to create an order π out of the graph then remove all edges that goes from an vertex i to a vertex j if π(j) < π(i) where π(j) is the position of j in the order π. These guaranties an cycle free output graph. The tricky part is to create this order in a way that removes as less edges as possible.
For this, the order is fragmented in two parts, a front π 1 and a back part π 2 , where then π = π 1 π 2 . To do so, a vertex removed from the graph and is either append to π 1 or added to the front of π 2 . An simple observation is that all sources can be append to π 1 and all sinks can be added to π 2 without losing any edge.
The part where edges are lost is when no sources or sinks are left in the graph. Then a vertex v is removed with a maximal value for d out (v) − d in (v) where d out is the out degree of v and d in is the in degree of v. With other words, it has many outgoing edges and less incoming edges. When v is append to π 1 only few edges are removed (incoming edges) and many edges are saved (outgoing edges).
We made only one modification to the algorithm. In the original Algorithm GR, a random vertex with the highest value is used. We try to use a vertex with a predecessor that is already append to π 1 . This minimized the number of sources that are created without influencing the performance of the algorithm. In fact, it optimized the results of the heuristic on our data sets (Table 8). Table 8 An overview of the heuristics that solve the FAS problem. The columns are the different data sets. The rows are the different algorithm. The first row provides the information of the original graphs of the data sets. The absolut number of edges are shown. The percentage of edges that are kept from the original graph are shown in brackets.

Bacteria
Yeast This modification is implemented in the function Eades Serialization that is shown in Algorithm 19. It has a time complexity of O(|E| + |V |). The organization of the vertex degrees in an effective way is done by a class "VertexCategorizer". This class has all vertices in sortet in lists by there value and special functions that handle the removal of a vertex in an effective way. This removal handling is described also in the original paper [7]. The only interesting and changed function is the getMax() function that returns the next vertex that is removed when no sink or source in the graph exists.
The getMax() function is shown in Algorithm 20 and uses the function get-MaxBucket() which returns the list of the vertices with the highest value. If this not exists, it returns null. Also the set "nosource" is used. It is a class member of "VertexCategorizer" and contains all vertices that can be removed without creating a source in the graph, i.e. a predecessor of such a vertex is already in π 1 .
Algorithm 19: The algorithm that create a order. This order is used to removes edges out of the graph. This destroy all cycles in the graph. 1 function serialize (g); Input : g the Graph for which the cycle should be removed. Output : The order with can be used to remove the cycles. Algorithm 20: The function from VertexCategorizer that decide which vertex is removed next from the graph.
1 function getMax (); Output : The next vertex to remove from the graph.

Topological Sorting
Topological sorting is performed using an adoption of a method presented by A. B. Kahn [9]. There are two variants. One that is the direct implementation of the method. It holds newly found sources in a queue that operates according to the first in, first out (FIFO) schema (Algorithm 21). The second variant uses a stack for the newly found sources, i.e. a queue operating according to the last in, first out (LIFO) schema (Algorithm 22). Both lead to valid topological orders with the difference in parts of the order where the order information is ambiguous. The second one creates the order that is better for our purpose. We generated results with both varaints to show the differences. Both, queue and stack, have two functions and one attribute: • add(Vertex) Add the vertex to the queue/stack • pop() Get and remove the next outgoing element after the used scheme.

• lenght
Get the number of elements in the queue/stack.

Algorithm 21:
The simple Kahn algorithm that created Topologic sorting out of a graph.
1 function toposort (g); Input : g the Graph that should be ordered. Output : The order as list.

Optimization
The optimization has two main components: the validation of the order and the re-ordering. The validation is done by the class RobinsonQuality. This can be done in a simple way of checking the neighborhood of every vertex and count the number of violations of the robinson inequation (see main manuscript). The neighborhood is defined by the number of hops (h). This means that a vertex is seen as neighbor if a path from one to the other exists that has at most length h. Consequently, the maximum neighborhood size is k h with k being the maximal outdegree. So that this validation has a linear time complexity with respect to the number of vertices. For simplicity, this class is not shown in more detail.
Besides the validation of an exiting order, the evaluation of potential changes is required for an efficient optimization. Therefore, the class RobinsonQuality provides the function predictValue that takes the order and a change object as input and returns the number of violations of the robinson inequation taking the changes in the change object int account. This makes it possible to evaluate a movement before it is applied to the order.
The re-ordering is done in loop until no changes that optimize the order are found. This procedure is shown in Algorithm 23. It calls three other functions. First, a list of candidates are created by the function "initCandidates". These candidates are created only once. A candidate consists of two vertices. It is iterated through the candidates and two changes are tried for each of them. The changes are found with the functions "moveToFront" and "swapSiblings". The best non-overlapping changes is finally applied to the order.
The candidates are created by finding pairs of siblings. Two vertices are siblings if they have a direct predecessor in common or they are both sources. The function that creates that candidates is shown in Algorithm 24.
Finally, we will describe the two move functions: moveToFront and switchSiblings. This two function uses three help functions. The first one is called "checkPreIsNot-Between". It checks if a predecessor is in the same block and if a predecessor is between the blocks. This is shown in Algorithm 25. The second function is "check-PreInBlock" and checks if a predecessor is in the same block (Algorithm 26). The last function is "checkSucIsNotBetween". It checks if a successor is between the blocks and is shown in Algorithm 27.
The function "moveToFront" performs a move to front change on the order. This means that a block of vertices after the source position is moved to the target position. The moved block is as large as possible i.e. all vertices after the source position is added to the block as long no predecessor is between the target position and the source position. The function is shown in Algorithm 28.
The second function, "swapSiblings", performs a swap on the order. This means that a block of vertices after the source position is swapped with a block of vertices after the target position. The blocks are as large as possible but connected. Connected means that, beside the first vertex, every other vertex has a predecessor in the block. The function is shown in Algorithm 29.
Algorithm 23: The basic sort algorithm. That sort the order until no optimization is found.

Data distribution
As a simple statistical quality assessment, we calculated the coverage of the reference species of the alignments with alignment blocks. To quantify positional effects, the coverage is assessed for sliding windows along the genome. We furthermore weight the coverage with number of different species in the repsective blocks. These can be done before and after the filtering. Plotting the resulting data into the same coodinae system enables a direct comparison of the coverage by the genome graph before and after filtering. it allows to conclude whether filtering affects specific genome positions more than others.
The plots are generated for data set Y by using Saccharomyces cerevisiae as reference and for F by using Drosophila melanogaster as reference. The results show that for data set Y the coverage is in general high. The filtering leads to local loss of coverage but change nothing on the general high coverage. In the case of data set F, it is different. Here, six genomic regions aleady have a low coverage before filtering (end of 2L,beginning of 2R, end of 3L, begining of 3R, end of X and complete Y). Those genomic regions overlap with repeat regions of the chromosomes. Since repeat regions are very similar to each other and therfore hard to align, it is not surprising to observe a low coverage for those regions when using an alignment technique as done for the data set. For the remaining genome, the result is the same as by the data set Y. It has a high coverage for the graph before filtering and the filtering only lose local coverage.
After the filtering, the input alignments are transfromed into the supergenome graph and further cacluations only make use of the supergenome. This graph is then modified in three steps: first run of simplifer, apply FAS algorithm, and second run of simplifer. Four simplfier are applied to the graph in the fist run: the mini-cycle remover, the source/sink simplifier, and supperbubble simplifier. In the second run, the mini-cycle remover is replaced by the advance source/sink simplifier.
(a) (b) Figure 1 The data distribution on the reference genomes for data set Y (a) and data set F (b). The black line is the data distribution before any filter is used and the red line after all filters are applied to the data set. The y axis shows how many different species are contained in the alignment blocks at this position. So it is in (a) a value between 0 and 7 and in (b) a value between 0 and 27. To make the information more readable the information is given for regions not for every nucleotide. In case of data set Y, 100nt are one data point and in case of data set F are 10000nt one data point. In (b) also in green the repeats after RepeatMasker are shown.

Graph edit statistic
For every graph, three edit steps are performed. The first run of simplifier, the remove of all cylces (FAS), and the second run of simplifier. All these steps remove edges from the graph. The Mini-Cycle Remover and FAS remove edges to break cycles in the graph. In the remaining cases, edges are lost as a side effect of the merging of connected vertices and the translation of the edges that connect them. The different steps are repeated by the edit tools several times. To get an idea of the time consumption of each tool in every step, a example time consumption is given. This is the result of an example run on a machine with a Intel(R) Xeon(R) CPU with 2.27GHz and enough RAM to fit the problem size in it. However, the exact numbers are not important. It rather provides an impression how long each step takes compared with the other steps. Moreover, it shows that the simplifier can be calculated even for big data sets in a feasible time on a normal machine.
The results are shown in Table 9, 10, and 11. Here, the three steps are indicated by horizontal lines. For every tool, we list how many vertices and edges have been removed along with the used time and the number of applications (number distinct position at which the tool could be used). For data set F, it is remarkable that most of the compute time is used in one big mini cycle complex. The complex has 896,082 vertices and it take 10,144 seconds to decompose it.

Graph properties
The graph editing steps create different temporary graphs. After each step, some core information of the temporary graphs are collected. This are the number of vertices, edges, and connected components (C. C.), the median in-degree (In-Degree) and median out-degree (Out-Degree), the mean number of blocks in vertex (Mean B.), and the maximal number of blocks. As a reference, we calculated this values also for the original graph. The three temproary graphs are: the graph after the first run of simplifier (Simplfier 1), the graph after the FAS-algorithm is applied (FAS), and the graph after the second run of simplifier (Simplfier 2). The data is shown in Table 12,13, and 14. In case of data set B, the second connected component is the third plasmid of Salmonella enterica Newport SL254 that is with less then 4k nucleotide very small and is aligned with nothing. In case of data set F, the second connected component is created by the FAS algorithm. That can happened since the FAS algorithm is a heuristic. However, it contains only a handful vertices and is simplified to one vertex in the next step.

Successor statistic
Most sequences have one successor sequence in the data set. These successor sequence is the sequence that follows the sequence on the contig. Note that successor depends on the reading direction of the sequence. If it is negative, then the successor sequence is in the contig reading direction direct before the sequence. If the reading direction is positive, the successor is the sequence that is in the contig reading direction direct behind the sequence. These sequences belong to blocks that are contained in the resulting order. So, the distance in blocks between the sequencen itself and it's successor can be calculated using the order. The distance is calculated in form that the number of blocks between them counted. So if they are adjacent they have a distance of zero. The data is disassembled in positive distances where the successor block is behind and in negative distances where the successor is before the sequence. Since the contig reading direction is arbitrary, we use the reading direction creating more positive distances for a contig. The data is shown in Table 15,16, and 17. The data is binned in different distance ranges. Different ranges and all successors are evaluted for each direction. The absolute numbers as well as the fraction of all successors in the data set is given.

ORF statistic
ORFs provide an estimate how much biological entities are retained in the correct order. Thus, we calculate how many ORFs are kept together in the final order. These is reasoned by the idea that ORFs should not be fragmented in the process as long the genes have the same biological functions in the species in the alignment. The ORFs are obtained from the NCBI annotations of the reference species. For data set F, not the complete ORFs but exons are used.
To measure how fragmented a ORF is, the distances between all adjacent blocks in the ORF are add up to the total size of the gaps between the blocks of an ORF. The gap sizes of the ORFs are then binnede. Furthermore, for each bin, the number of broken ORFs and the number of ORFs with more than one block are counted. Broken means that the relative order of the blocks is not kept in the genome order. Thus, a block exists in the ORF where both adjacent blocks are before it or both are after it in the order.
The results are shown in Table 18, 19, and 20.   Comparative analysis of the tricarboxylic acid cycle in yeast The tricarboxylic acid (TCA) cycle for aerobic respiration is well studied and discussed in yeast (e.g. [10]). In S. cerevisiae, 20 genes belong to the TCA cycle [11,12,13,14]. All of them are part of the initial set of alignment blocks in the yeast data set (Y). The TCA cycle is at least for S. cerevisiae essential. Consequently, we expect that the genes of TCA cycle are conserved in all yeast species in the yeast data set. All genes are distributed over at least two blocks in S. cerevisiae. For all genes, we found evidences that they are contained in the initial alignment also for the other species. Note that we use BLAST to search the protein sequences in the genome assembly since there were no annotations available. Many genes of the TCA cycle are conservered in large parts or completely in all species i.e. the sequences from different species belongs to the same alignment block. Nevertheless, there are serveral cases with low conservation for one species. For example, SDH1 in S. kudriavzevii and S. cerevisiae do not share any nucleotide between the annotations (based on BLAST) but in the alignment blocks for SDH1 in S. cerevisiae, a sequence for S. kudriavzevii can be found. However, all those issues are likely artefacts from the assemblies and alignment procedure.
After filtering some blocks are lost. Five genes of all genes loose one block each due to the length and score filter. However, the information loss is negligible since in each case less than 1% of the gene is lost. The overlap filter discards blocks of 11 genes such that large parts or almost complete genes are lost. We investigated those cases more deeply and found that those genes exhibit a high resemblance to either other genes in the TCA cycle or pseudogenes that result from whole genome duplications [15]. Except S. cerevisiae, all genome assemblies in the yeast data set consists of contigs and thus have a rather low quality. Thus, very similar nucleotide sequences might be merged into one sequence. As a result very similar (part of) genes may are not assembled as separate genes but occur only one in the assembly. A reliable assignment of those misassemblies to genes is therefore difficult.
In the following cases, the overlap filters discards parts or complete genes. Large parts of IDH1 are lost in most of the species due to high similarity to IDH2. Also a fifth of IDH2 is lost due to this similiarity. The same hold for CIT1 and CIT3 where CIT1 is lost in most of the species and a fifth of CIT3. PYC1 and PYC2 are very similar in sequence. After filtering almost all blocks for those genes are discarded. After filtering, at least 50% of ACO1 is lost due to the similarity to ACO2. Two species loosing ACO1 completely. Large parts of AC02 are retained. SDH1 and its homolog YJL045W is lost almost completely in all species due to there homology. Also SDH3 has a homolog that is a pseudogene: SHH3. This leads to loss of large parts or the complete loss of SDH3 after filtering. Another gene where a part of the gene is filtered is the MDH1 gene. The filtered part overlap with the MDH3 gene. The MDH3 is a very similar gene that has a comparable function as the MDH1 gene in the glyoxylate cycle an other variation of the TCA cycle.
To summarize the results after filtering, the low assembly quality and the alignment method for the whole genome alignment causes the removal of blocks for 11 genes. For further analysis, we will use only the 9 genes which are kept completely after filtering. These 9 genes are almost complete in the same blocks for all species.
For 7 genes, the blocks of the gene are placed as successive blocks with successive genomic coordinates in the optimized order. The block for KGD2 as well as the block for SDH2 are broken apart by the MFAS heuristic. Since the information that they should be successive is lost, they are places in different locations in the optimized order. However, at least within the two parts, blocks and genomic coordinates are successive.
To summarize, the common coordinate system for yeast shows the conservation of 9 of the 20 genes in TCA cycle. Filtering statistics indicate that the remaining 11 genes can not be analyzed with respect to conservation. They have homologs or share sequence similarity with other genes such that it is unclear if the sequence is conserved or not. The raw multiple alignments used as input thus suggest conservation which is questionable. For conserved genes, the method mostly creates the correct order of the alignments block allowing to easily access the corresponding alignment parts for comparative analysis.