LRBinner consists of three main steps; (1) learning lower dimensional latent representations of composition and coverage, (2) clustering the latent representations and (3) obtaining complete clusters. The complete workflow for LRBinner is demonstrated in Fig. 1. In the following sections, we will explain these three steps in details.
Step 1
LRBinner uses two typical binning features of metagenomic sequences, composition and coverage. The composition and coverage of each long read is represented as trimer frequency vectors and k-mer coverage histograms [15], respectively.
Computing composition vectors
Previous studies show that different species demonstrate unique genomic patterns [18, 19] and thus can be used in composition-based metagenomics binning. Oligonucleotide frequency vectors are one such genomic representation that can be used in metagenomics binning. Small k-mer sizes (k varying from 3-7) have been used in the past to discriminate assembled contigs of different origins [6, 8, 10, 20, 21] and 3-mers have been used in metagenomics binning of error-prone long reads [15] which shows that trinucleotide frequency vectors provide stable binning despite the noise level exist in TGS reads. Therefore in LRBinner, we utilise k=3 by default which results in trinucleotide composition vectors. For each long read, we count the frequencies of all 64 3-mers in this read and merge the reverse complements to form a vector of 32 dimensions. The resulting vector is then normalised by the total number of 3-mers observed in the read. We refer to this trimer frequency vector as \(V^T\).
Computing coverage vectors
While an all-vs-all alignment of long reads may provide coverage information for each long read, it is usually too time-consuming to perform the quadratic number of pairwise alignments on large scale long-read datasets. Given a sufficiently large k, the frequency of a k-mer is defined as the number of occurrences of this k-mer in the entire dataset. Long reads from high-abundance species tend to contain k-mers with higher frequencies compared to long reads from low-abundance species. Hence, a k-mer frequency vector can be computed for each long read to represent coverage information without performing alignments [15] to represent read coverage. In order to obtain such coverage histograms, we first compute the k-mer counts of all long reads in the entire dataset by DSK [22] (the default value of k=15). The counts are then indexed in memory by encoding each nucleotide in 2 bits as per the encoding (i.e., A=00, C=01, T=10 and G=11) [22]. The resulting index is in the form \((k_i, coverage(k_i))\) (as key, value pairs), where \(coverage(k_i)\) is the number of occurrences of the k-mer \(k_i\) in the entire dataset. Now for each k-mer \(k_i\) of a read, we obtain the frequency from the index. These frequencies are then used to build a normalised histogram, \(V^C\). We chose a preset bin width (\(bin\_width\)) for the histogram and obtain a vector of bins dimensions. By default we set \(bin\_width\) \(=\) \(10\) and \(bins\) \(=\) \(32\). All the k-mers with counts exceeding the histogram limits are added into the last index of the histogram. We also normalise the histogram by the total number of k-mers observed in the read.
Obtaining latent representations
For each long read, its coverage (\(V^C\)) and composition (\(V^T\)) vectors are concatenated to form a single vector V of 64 dimensions. We use a variational auto-encoder to obtain lower dimensional latent representations. The key motivation for using a variational auto-encoder is to combine coverage and composition features in an effective way. Previous work shows that a simple concatenation of coverage and composition vectors made TSNE less effective [15]. This is mainly because TSNE does not attempt to learn how to effectively combine composition and coverage features, but rather sticks with the spatial distances on concatenated features. However, the variational auto-encoder is able to learn lower dimensional representations by combining both composition and coverage features through a deep neural network.
Our implementation of the variational auto-encoder consists of two hidden-layers in the encoder and decoder. Each layer uses batch normalisation and dropout with \(p\) \(=\) \(0.1\) during the training phase. For each input vector V, the auto-encoder learns a latent representation \(V^{L}_{i}\), where \(V^{L}_{i} \sim {\mathcal {N}}(\mu _i, \sigma _i)\). The latent representation consists of 8 dimensions. Each layer in the encoder and decoder contains 128 neurons. Similar to previous studies [14], we use LeakyRELU (leaky rectified linear unit function) for \(\mu\) and softplus function for \(\sigma\) layers. Note that \(\mu\) and \(\sigma\) represent neural network layers intended to learn the lower dimensional means and standard deviations of each read’s distribution. We use the weighted sum of reconstruction error E (Eq. 1, where \(V_{in}\) and \(V_{out}\) represents the inputs and the outputs of the auto-encoder. Note that they are the same for an auto-encoder) and Kullback-Leibler divergence [14, 23] \(D_{KL}\) (Eq.2) as the loss function. \(E_{cov}\) and \(E_{com}\) represent reconstruction errors of coverage and composition respectively. Eq. 3 demonstrates the complete loss function used.
$$\begin{aligned} E= & {} \sum (V_{in} - V_{out})^2 \end{aligned}$$
(1)
$$\begin{aligned} D_{KL}(latent|prior)= & {} - \sum \frac{1}{2} (1+ln(\sigma )-\mu ^2-\sigma ) \end{aligned}$$
(2)
$$\begin{aligned} \textit{Total Loss}= & {} w_{cov}E_{cov} + w_{com}E_{com} + w_{kld}D_{KL} \end{aligned}$$
(3)
Here we set \(w_{cov}\) \(=\) \(0.1\), \(w_{com}\) \(=\) \(1\) and \(w_{kld}\) \(=\) \(1/500\) as determined empirically using simulated data. The decoder output was obtained through LeakyRELU activation in order to reconstruct the scaled positive inputs. We train the auto-encoder with read batches of size 10,240 for 200 epochs. Finally, we obtain the predicted latent means of the input data from the encoder for clustering. Each point in the latent mean corresponds to the relevant read in the original input.
Step 2
In this step, we perform clustering of the latent means learnt by the variational auto-encoder. The complete clustering algorithm of LRBinner is illustrated in Fig. 2. Similar to previous studies [14], we use the cosine distance as the distance measure for clustering. Note that cosine distance between point a and b in latent space \(V^L\) is defined as d(a, b)=\(\frac{V^L_a\cdot V^L_b}{||V^L_a|| ||V^L_b||}\). Given a point a, a distance histogram \(H_a\) can be generated by computing the pairwise distances between a and all other points and setting the bin width as \(\Delta\) (\(\Delta\) \(=\) \(0.005\) in our experiments). We define peak as the index of the first maximal of the distance histogram \(H_a\). Similarly, the valley is defined as the index of the first minimal after the peak in the distance histogram \(H_a\). Refer to the top right figure in Fig. 2 for an example of the peak and valley in a distance histogram.
As shown in VAMB [14], a point with smaller valley-to-peak ratio H[valley]/H[peak] is more likely to be the medoid of a cluster, where H[valley] and H[peak] are the number of points corresponding to the valley and peak in the distance histogram H, respectively. Therefore, VAMB randomly samples points, searches within a distance of 0.05 (up to 25 neighbouring points) and moves to another point if H[valley]/H[peak] can be further reduced. This step is iterated until a local minimal point of H[valley]/H[peak] is inferred as a proper cluster medoid and then the corresponding cluster is extracted by removing points within a distance \(\Delta \times valley\) of the distance histogram. However, clusters of long reads are orders of magnitude larger than clusters of contigs, thus mere local search of a cluster medoid may be inefficient. Furthermore, while most contig clusters consist of hundreds of points per species [16], the long-read clusters vary in size drastically (from hundreds of points to millions of points), which demand for a more flexible search strategy rather than sampling points within a fixed radius and up to a fixed number of neighbours. Hence, we design the following strategy to dynamically extract clusters of varying sizes. Our algorithm consists of two steps; (1) from a seed point to a candidate cluster and (2) from a candidate cluster to a confident cluster.
From a seed point to a candidate cluster
A point s is called a seed point if its valley-to-peak ratio \(H_s[valley]/H_s[peak]<0.5\) in its distance histogram \(H_s\). Initially, LRBinner randomly picks a seed point s from the entire dataset and obtains its distance histogram \(H_s\). Note that a distance histogram demonstrates a candidate cluster. This candidate cluster consists of the points within the distance \(\Delta \times valley\) in \(H_s\), referred to as candidate cluster points. Compared to the seed point, some candidate cluster points may have lower valley-to-peak ratio that results in more confident clusters. However, the number of candidate cluster points may vary significantly depending on the size of the ground-truth clusters. In the next section, we will show how to use sampling strategies to find a confident cluster from a candidate cluster.
From a candidate cluster to a confident cluster
Given a candidate cluster, we sample 10% of candidate cluster points (up to 1,000 points) to compare their corresponding distance histograms. For each point p in candidate cluster points, we compute the valley-to-peak ratio \(H_p[valley]/H_p[peak]\) in its corresponding distance histogram \(H_p\). We chose a point x from the sample with the minimum H[valley]/H[peak] value and extract a confident cluster which consists of points within a distance \(\Delta \times valley\) of the distance histogram \(H_x\). In contrast with the iterative medoid search in VAMB [14], this approach takes advantage of the rough estimation of the candidate cluster from a seed point and thus allows us to dynamically and efficiently discover clusters with varying sizes. This process is iterated until no further candidate clusters or seed points are observed. Please refer to Implementation Section for detailed information. The resulting clusters are depicted as detected clusters in Fig. 1. Note that few reads still remain unclustered due to the noise present in composition and coverage vectors of error-prone long reads and we will show how to assign them to existing bins in the next section.
Step 3
Obtaining final bins
Once all the clusters have been yielded, the points that are sparsely located are left aside. However, such points could have the potential to improve the downstream assembly processes. Hence, we assign such points to the detected clusters using the statistical model from MetaBCC-LR [15]. For each cluster \(C_k\) the mean \(\mu ^{C}_k\), \(\mu ^{T}_k\) and standard deviation \(\sigma ^{C}_k\), \(\sigma ^{T}_k\) is computed using the coverage and composition vectors; \(V^C\) and \(V^T\) respectively.
$$\begin{aligned} PDF({\bar{v}},{\bar{\mu }},{\bar{\sigma }}) = \prod _{j}^{|{\bar{v}}|} \frac{1}{\sqrt{2\pi }\sigma _j} e^{-\frac{(x_j-\mu _j)^2}{2\sigma _j^2}} \end{aligned}$$
(4)
Finally the unclustered reads are assigned to the cluster \(C_i\) using a maximum likelihood computed using Eq. 4. The assignment of reads is performed such that Eq. 5 is maximised. \(V^C_l\) and \(V^T_l\) are the coverage histogram and trimer frequency vectors of the unclustered read l.
$$\begin{aligned} C_i = \mathop {\mathrm{argmax}}\limits _{i} \bigg \{PDF(V^{C}_l,\mu ^{C}_i,\sigma ^{C}_i)\times PDF(V^T_l,\mu ^T_i,\sigma ^T_i)\bigg \} \end{aligned}$$
(5)