Network architecture
Our new software tool DeepGRP uses a similar network architecture as dna-brnn but with an additional attention layer. The layer improves the ability of the network to learn motifs including gaps and stretching over long ranges of the DNA sequence.
The input of the neural network is the DNA sequence for which repeats are to be predicted. In a first step, initial and terminal stretches of Ns are eliminated. In the second step, the forward strand and the reverse complement strand of the remaining sequence are both one-hot encoded by \(5\) integers per position, where the fifth integer is used for handling occurrences of Ns. The one-hot encoding is fed into a bidirectional GRU. While these bidirectional inputs are widely used for sequential tasks, like text translation [45], here the reverse complement strand is fed into the network to learn orientation-independent features. To let the network focus on short term relations and to have multiple predictions per sequence position, the DNA sequence is processed in a sliding window approach. The windows of length 342 are slid over the DNA sequence with a step size of 50. For each nucleotide in the window, the GRU-model predicts a probability with respect to all considered repetitive classes. As the windows are overlapping, the model produces multiple predictions for each position. These are maximized to obtain the final predictions (see next subsection). The length of the windows was determined by hyperparameter optimization and is fixed for training and evaluation. The step size for each window can be chosen at prediction time. This enables a trade off between running time and prediction accuracy.
To reduce the number of parameters to be trained and to allow the network to learn the features independently of the orientation of the sequence, the forward and the reverse GRU layer share weights. In [46] it was shown, that sharing of weights in such bidirectional model improves the model prediction performance and considerably reduces the number of learnable parameters. The output of the GRU layers is averaged and, together with the last hidden state, it is fed into the attention layer. The hidden state is also averaged over the forward and reverse layer.
The attention mechanism is implemented as proposed by [41]. It helps the network to utilize information from the complete sequence in a window, compressed as hidden state representation, at all sequence positions in that window. This improves the prediction of repetitive elements at the beginning of the genomic sequence and enhances the learning of more complex features. The output of the attention layer is fed, position by position, to a feed-forward network (Dense layer) with the Softmax activation function to obtain probabilities for each class of repetitive elements. The complete network is depicted in Fig. 1. One advantage of this approach is that all repetitive element classes can be handled at once and thus the number of repetitive elements has only a marginal influence on the running time.
From probabilities to segments
Due to the window based approach with overlapping windows, for each nucleotide position and each repetitive class (including the no-repeat class) one obtains multiple probabilities of class memberships. For each position and each repetitive class we only keep the maximum of all such probabilities. This strategy of aggregating the information over different sliding windows leads to shorter running times and reduced space requirement, while still allowing for accurate predictions.
Let \(C\) be the set of possible labels (each representing a repeat class) plus label 0 for class no repeat. Let \(p_{i}^{c}\) be the probability derived by the model that position \(i\), \(1\le i\le n\) has label \(c\in C\), where \(n\) is the length of the sequence subject to the prediction of repeats. To derive predictions for segments of the input sequence, we first convert these probabilities to scores as follows: Calculate \(q_{i} = \min \{\max \{p_{i}^{c}\mid c\in C\},0.99\}\) and determine \(c_{i}\in C\) such that \(q_{i}=\min \{p_{i}^{c_{i}},0.99\}\). Finally, convert \(q_{i}\) to a score \(s_{i}\) defined as follows [11]:
$$s_{i}= {\left\{ \begin{array}{ll} -10 \cdot \log \frac{q_{i}}{1-q_{i}} &{} \quad \text {if }c_{i}=0 \\ \log \frac{q_{i}}{1-q_{i}} &{} \quad \text {otherwise.} \end{array}\right. }$$
This scoring is a modified version of the \({\text {logit}}\)-function, which is well known for its application in logistic regression [47]. The upper bounding to 0.99 prevents division by zero and \(-10\) is multiplied for case \(c_{i}=0\) to punish large extensions of segments. For this scoring approach we consider only the maximum probability per position over all repeat classes, as in the gold standard there are only few positions (i.e. 814 648 of 716 335 847 bp) for which the assignment to one of the four repeat classes considered here is not unique, see "Data sets" section. So applying the MSS algorithm to scores obtained from the probabilities for all classes of repetitive elements, would likely only lead to small improvements in accuracy, but mean a considerable increase of the computational effort. Therefore, the MSS algorithm is applied to the sequence \(s_{1}\), \(s_{2}\), ..., \(s_{n}\) of scores obtained from the maximum probabilities. It delivers non-overlapping segments of continuous positions and for each position we have tracked the corresponding repeat class of maximum probability. No repeat positions inherit the repeat class which occurs most often in the segment. In this way, all positions of a segment are considered as part of an instance of one of the four considered repeat classes.
Afterwards, segments which contain more than one repetitive element class are split into several segments, each only containing one repetitive element class. Finally, the coordinates for all segments with a length greater than some user defined threshold (default 50, as also used in dna-brnn), are output.
Implementation
The base network was implemented in Python (Version 3) using TensorFlow 2.1 [48] and is available as Python package deepgrp (https://github.com/fhausmann/deepgrp). Several time consuming steps are written in C. All C-codes including the the implementation of the MSS algorithm of [11] are integrated with DeepGRP via Cython.
Data sets
Our evaluation is based on the reference assemblies GRCh19 and GRCh38 of the human genome (abbreviated as hg19 and hg38 in the following) and GRCm38 of the mouse genome (abbreviated as mm10). Details are given in Additional file 1: Table S2. To simplify notation, a specific chromosome \(c\) of an assembly \(a\) is denoted by \(a/c\).
As the pre-annotated human genomes from the RepeatMasker website (http://www.repeatmasker.org/) are outdated, we computed the annotations of hg19 (all autosomal chromosomes) and hg38/chr1 ourselves, using Repbase 24.01 and RepeatMasker 4.1.0 (git-hash:41848148) [2], based on the cross_match [49] software computing the local alignments. The source code of cross_match was obtained from the author [50]. The annotations obtained in this way are considered as the gold standard for human. We found that 716 335 847 bp of the gold standard (i.e. 24.86% of hg19) are covered by a repeat of one of the four considered repeat classes. Of these 716 335 847 bp only 0.11% (814 648 bp) cannot uniquely be assigned to a repeat class.
As gold standard for mouse we used a pre-annotation of mm10/chr1 (cf. Additional file 1: Table S2). The main focus of using mouse data was to evaluate whether DeepGRP is able to make use of features learned from human sequences for annotating a different, but not too distant species which has specific repetitive elements. If not stated otherwise, the notation gold standard in singular means the gold standard for human.
From both gold standards we extracted all annotations of repeats of class HSAT2, Alphoid, Alu and LINE-1. A complete list of the repeat IDs of these classes (for human) can be found in Additional file 1: Table S3. Annotations for class HSAT3 were computed as described in [11]. As in [11] and [13], HSAT2 and HSAT3 were combined due to their similarity. All models were trained on hg19/chr11. For validation and early stopping hg19/chr20 was used. The true class labels were derived from the gold standard.
We selected hg19/chr11 for training for two reasons: On the one hand it was (besides other chromosomes) also used in [11] and on the other hand it is a medium sized chromosome (\(\approx {4}{\%}\) of the bp. of hg19) and therefore represents a setting in which the size of the training data is small compared to the test data.
We selected hg19/chr20 for validation, since it is small (\(\approx {2}{\%}\) of the bp. of hg19) allowing hyperparameter optimization in reasonable time. Early stopping was implemented in such a way that training is stopped, once the validation loss does not decrease in the last 10 epochs.
A Dfam-based annotation of repeats of the four repeat classes was obtained by extracting from Dfam release 3.3, November 2020, the related repetitive element classes from the pre-annotated hg38 genome. When determining the running time of HMMER, we used a subset of the Dfam HMM library (release 3.3) comprised of all classes of repetitive elements which were also present in the subset of the RepeatMasker annotations used in this study (cf. Additional file 1: Table S3). We applied dfamscan using HMMER (v3.2.1) [29], faithfully following the recipe of https://dfam.org/help/tools.
Training details
While the basic network architecture we used is similar to the one proposed in [11], our training largely differs from the training of dna-brnn.
Training, retraining and evaluation were performed on Linux-based computers. The hyperparameter optimization was run on an Intel® Xeon® Gold 6142 CPU using 12 inter-operation and 12 intra-operation threads using Hyperopt with the Tree of Parzen Estimators algorithm [51]. Model performance is measured by the multi-class Matthews correlation coefficient, denoted by \(MCC _{k}\) where \(k\) is the number of classes [52]. \(MCC _{k}\) is a single value characterizing a complete confusion table [52] even if the classes are very imbalanced [53]. In our application we have four different classes of repeats and the no repeat class, so \(k\) is 5. We chose the model maximizing \(MCC _{5}\). This model is termed best model.
The retraining of the best model and all evaluations, including training and evaluation of both tools, were performed on an Intel® Core® 7-5820K CPU using 10 threads. DeepGRP additionally utilized a GPU on a Nvidia Geforce GTX 960 graphics card using the XLA domain-specific compiler, see [54] for details. Training of DeepGRP required 6 to 15 minutes (median 11min) per model.
During the training, the composition of the batches is adjusted by a hyperparameter \(r \in [0,1]\). A batch of size \(n\) is constructed such that it contains at least \(\lfloor n \cdot r\rfloor\) sequence windows, each of which has at least one position annotated as repetitive elements. This is further restricted such that from these \(\lfloor n \cdot r\rfloor\) sequence windows for the set of all repetitive element classes \(C\), \(\left\lfloor \frac{n \cdot r}{|C|}\right\rfloor\) sequence windows per repetitive element class are present in one batch. Therefore, repeats which are less present in the training data are sampled more often during training to account for the imbalance of the occurrence of different repeat classes.
DeepGRP was compared to dna-brnn [11], which is implemented in C using its own neural network framework. Both programs use, as far as possible, the same hyperparameter values. dna-brnn does not provide an interface to perform hyperparameter tuning. Moreover, several parameters and flags of dna-brnn are hard coded or are ignored even if they have been specified on the command line. We concluded that hyperparameter tuning for dna-brnn would require modifications of its source code and the development of wrappers. As a consequence, we did not perform hyperparameter tuning for dna-brnn. For training and prediction the same hyperparameters were used.
dna-brnn uses a fixed number of epochs and a user defined random seed, whereas DeepGRP uses early stopping and a varying random seed. To the best of our knowledge, dna-brnn is not able to use validation data and the training cannot be resumed in previously saved states. So, it seems not possible to apply early stopping to prevent overfitting of parameters in dna-brnn. For DeepGRP we trained five models. Using different random seeds, we also trained five models for dna-brnn, in contrast to [11], in which only one model was trained.