Figure 1 denotes the workflow of GraphBin2. The preprocessing steps of GraphBin2 assemble reads into contigs using the assembly graph and then bin the contigs (i.e., assign coloured labels to contigs) using existing contig-binning tools. GraphBin2 takes this labelled assembly graph as input, removes unsupported labels, corrects the labels of inconsistent vertices, propagates labels to unlabelled vertices and finally infers vertices with multiple labels (colours).
Preprocessing
In this step, we assemble the next generation reads (e.g., Illumina reads with length ranging from 75 to 300 bp) into contigs using the assembly graph. There are two dominant paradigms for genome assembly: overlap-layout-consensus (or string graphs) [30] and de Bruijn graphs [31]. We select one representative assembler from each paradigm, SGA [32] and metaSPAdes [27] respectively, to demonstrate the adaptability of GraphBin2. In order to show that GraphBin2 could be in principle applied to long-read assemblies, we also considered a simulated dataset which was assembled using metaFlye [33], a popular metagenomics long-read assembler.
In the assembly graph, each vertex represents a contig with coverage denoting the average number of reads that map to each base of the contig and each edge indicates a significant overlap between a pair of contigs. In an ideal case, a genome corresponds to a path in the assembly graph and its genomic sequence corresponds to the concatenation of contigs along this path. Hence, if two contigs are connected by an edge in the assembly graph, they are more likely to belong to the same genome. Previous studies [24, 25] have shown that the connectivity information between contigs can be used to refine and improve binning results. In the assembly graph of metagenomic datasets, different genomes usually correspond to different paths in the assembly graph. If two genomes share a common contig (e.g., unresolved “interspecies repeat” [27]), the corresponding vertex would be shared by two genomic paths in the assembly graph.
After assembling reads into contigs using assembly graphs, GraphBin2 uses an existing contig-binning tool to derive an initial binning result. Note that most of the existing tools for binning contigs require a minimum length for the contigs (e.g., 1000 bp for MaxBin2 [23] and SolidBin [22], 500 bp for BusyBee Web [16] and 1500 bp for MetaBAT2 [19]). Therefore, many short contigs in the assembly graph will be discarded, resulting in low recall values as a common limitation of existing binning tools. For example, 65% of the contigs in the metaSPAdes assembly of the Sharon-All dataset were discarded by MaxBin2 due to their short length.
Step 1: Remove labels of unsupported vertices
A linear (or circular) chromosome usually corresponds to a path (or a cycle) that traverses multiple vertices in the assembly graph. If two contigs belong to the same chromosome, they are likely to be connected by a path which consists of other contigs from the same chromosome. Therefore, a labelled vertex is defined as supported if and only if one of the following conditions hold:
Otherwise, a labelled vertex is defined as unsupported. Note that the definition of unsupported vertices in GraphBin2 is more strict than ambiguous vertices in GraphBin.Footnote 1 For example, in the initial labelled assembly graph of Fig. 1, vertex 2 in red is supported by vertex 6 in red as they are directly connected. Note that vertex 18 in green is also supported by vertex 15 in green as there exists a path (i.e., \(18 \rightarrow 19 \rightarrow 14 \rightarrow 15\)) between them that traverses only unlabelled vertices (i.e., 19 and 14). However, vertex 1 in blue is unsupported as it cannot reach another blue vertex through a path consisting of only unlabelled (white coloured) vertices.
To check whether a labelled vertex is supported or unsupported, a naive approach is to perform a breadth-first-search from each labelled vertex. A refined algorithm first initialises all labelled vertices as unsupported and scans the graph to identify all labelled vertices that are either isolated or directly connected to a vertex of the same label and classifies them as supported vertices. This refined algorithm then uses breadth-first-search to find all connected components that consist of only unlabelled vertices and for each component Component stores a set of labelled vertices N(Component) that are connected to vertices in Component. If multiple labelled vertices in N(Component) have the same label, these vertices are supported because they connect to each other through a path that consists of only unlabelled vertices in Component. GraphBin2 removes the labels for all unsupported vertices because these labels may not be reliable. For example, the label of the unsupported vertex 1 is removed by GraphBin2 in Step 1 of Fig. 1.
Step 2: Correct labels of inconsistent vertices
After Step 1, each non-isolated labelled vertex v is supported by at least one vertex with the same label. The closer two vertices are in the assembly graph, the more likely they have the same label. For each vertex v, we introduce a labelled score, S(v, x), for each label x by considering all vertices of label x that are directly connected to v or connected to v through a path that consists of only unlabelled vertices. A vertex t of label x contributes to S(v, x) by \(2^{-D(v,t)}\) where D(v, t) is the shortest distance between v and t using only unlabelled vertices. This distance is measured by the number of edges in a path and \(D(v,t)=1\) if v and t are directly connected. Therefore, the labelled score S(v, x) is the sum of contributions from all vertices of label x that are directly connected to v or connected to v through a path that consists of only unlabelled vertices. In Step 1 of Fig. 1, vertex 17 contributes 1/2 to S(18, blue) because \(D(17,18)=1\) and vertex 8 contributes 1/8 to S(18, green) because \(D(8,17)=3\). The labelled score of S(18, blue) is 2 to which all four blue vertices 17, 20, 23 and 24 contribute 1/2 respectively while \(S(18, green)=5/16\) to which vertex 8 contributes 1/8, vertex 15 contributes 1/8 and vertex 26 contributes 1/16.
A labelled vertex v of label x is defined as inconsistent if and only if the labelled score of its current label x times \(\alpha\) is less than or equal to the labelled score of another label y where \(\alpha\) is a parameter, i.e., \(\alpha \times S(v,x) \leqslant S(v,y)\). We have set \(\alpha =1.5\) in the default settings of GraphBin2. In Step 1 of Fig. 1, vertex 18 in green is an inconsistent vertex because \(1.5 \times S(18, green)= 1.5 \times 5/16 = 0.47\) is less than \(S(18, blue)=2\).
Again, GraphBin2 uses the breadth-first-search to check if a labelled vertex is inconsistent. GraphBin2 corrects the label of an inconsistent vertex v to another label that maximises the labelled score. For example, GraphBin2 corrects the label of vertex 18 from green to blue and corrects the label of vertex 22 from red to green (refer from Step 1 to Step 2 in Fig. 1).
Step 3: Propagate labels to unlabelled vertices
As existing contig-binning tools discard contigs due to their short lengths in the initial binning, many vertices are still unlabelled in the current assembly graph. In this step, we will propagate existing labels to the remaining unlabelled vertices using the assembly graph. There are two intuitions behind this label propagation process. Firstly, vertices that are closer to each other in the assembly graph are more likely to have the same label. Secondly, vertices with similar coverages are more likely to have the same label because contigs from the same genome usually have similar coverages [18, 34]. GraphBin2 uses both the connectivity and coverage information of the assembly graph to propagate the labels.
For each unlabelled vertex v with coverage c(v) (i.e., coverage of the contig that corresponds to the vertex), a candidate propagation action \((D(v,t),|c(v)-c(t)|, t, v)\) is recorded as a tuple where t is the nearest labelled vertex to v, c(t) is the coverage of t and D(v, t) is the shortest distance between v and t (as defined in Step 2). Given two candidate propagation actions, \((d_1,c_1,t_1,v_1)\) and \((d_2,c_2,t_2,v_2)\), GraphBin2 will execute \((d_1,c_1,t_1,v_1)\) before \((d_2,c_2,t_2,v_2)\), i.e., propagating the label of \(t_1\) to \(v_1\) before propagating the label of \(t_2\) to \(v_2\), if (\(d_1 < d_2\)) or (\(c_1 < c_2\) and \(d_1 = d_2\)). In other words, GraphBin2 puts more emphasis on the connectivity information than the coverage information because the edges in the assembly graph are expected to be more reliable than the coverage information on vertices, especially for vertices corresponding to short contigs (which are discarded by initial binning tools).
GraphBin2 first uses the breadth-first-search to compute all candidate propagation actions for unlabelled vertices and sort them into a ranked list according to the order defined above. At each iteration, GraphBin2 executes the first candidate propagation action and then updates the ranked list of candidate propagation actions. Note that one unlabelled vertex receives its label at each iteration and updating the ranked list of candidate propagation actions can be done efficiently by breadth-first-search from this unlabelled vertex.
Figure 2 shows how GraphBin2 propagates labels from Step 2 to Step 3 in Fig. 1. Figure 2a denotes the assembly graph after correcting labels of inconsistent vertices (after Step 2). The step-by-step label propagation process is explained in the remaining figures in Fig. 2.
Note that this label propagation process in GraphBin2 improves on the label propagation algorithm in GraphBin by incorporating both the connectivity and coverage information in the assembly graph. So far, GraphBin2 does not generate multi-labelled vertices. In the next step, we will show how GraphBin2 uses the labelling, connectivity and coverage information together on the assembly graph to infer multi-labelled vertices.
Step 4: Infer multi-labelled vertices
Contigs belonging to multiple genomes correspond to multi-labelled vertices in the assembly graph. What are the characteristics of shared contigs between multiple species? Firstly, a contig shared by multiple genomes may connect other contigs in these genomes. Secondly, the coverage of a contig shared by multiple genomes should be equal to the sum of coverages of these genomes in the ideal case. After label propagation, vertices of the same label are likely to form connected components in the assembly graph and multi-labelled vertices are likely to be located along the borders between multiple connected components where distinct labels meet and have a coverage similar to the sum of the average coverages of multiple components that they belong to.
GraphBin2 checks labelled vertices that are connected to vertices of multiple different labels. The average coverage of a connected component P is calculated by \(\frac{\sum {}{} c(i) \times L(i)}{\sum {}{} L(i)}\) for each vertex i in the connected component P, where c(i) is the coverage of the vertex i and L(i) is the length of the contig corresponding to vertex i. Assume v is a labelled vertex v from a component P, the coverage of v is c(v) and the average coverage of P is c(P). When c(v) is larger than c(P) and v is connected to other components \(P_1,P_2,\ldots ,P_k\) with different labels, it is possible that v also belongs to one or more components (in addition to P). For example, if v belongs to P, \(P_i\) and \(P_j\) in the ground-truth, the coverage of v, c(v), is expected to be close to the sum of average coverages of the above three components, \(c(P)+c(P_i)+c(P_j)\). In fact, finding which components in \(\{P_1,P_2,\ldots ,P_k\}\) that v also belongs to (in addition to P) can be modelled as the following subset sum problem [35]. Given a set of positive numbers \(\{c(P_1),c(P_2),\ldots ,c(P_k)\}\), find a subset whose sum is or is closest to \(c(v)-c(P)\). Then v will be assigned to the corresponding components in this subset as well as to P. Note that it is possible that the selected subset is empty and thus v only belongs to P.
In all of our experiments, the maximum number of different components that a vertex connects to in the assembly graph is less than 5. We use a brute-force way to enumerate all possible combinations of components and find out the combinations that best explain the observed coverages. For example, after Step 3 in Fig. 1, vertex 3 in green connects to another red component. The coverage of vertex 3 is 108 while the average coverage of the green component is 95 and the average coverage of the red components is 19. Because the coverage of vertex 3 (108) is closer to the sum of average coverages of green and red components (95+19=114) compared to the average coverage of the green component (95), vertex 3 is assigned both green and red labels. Similarly, the coverage of vertex 25 (142) is closer to the sum of average coverages of green and blue components (95 + 49 = 144) compared to the average coverage of the green component (95). Hence, vertex 25 is assigned both green and blue labels. In the same assembly graph after Step 3 in Fig. 1, vertex 14 in red does not gain any other labels because its own coverage is closest to the average coverage of the red component (19) compared to other possible combinations (i.e., red+blue, red+green, green+blue and red+green+blue).