DeepGRP: engineering a software tool for predicting genomic repetitive elements using Recurrent Neural Networks with attention

Background Repetitive elements contribute a large part of eukaryotic genomes. For example, about 40 to 50% of human, mouse and rat genomes are repetitive. So identifying and classifying repeats is an important step in genome annotation. This annotation step is traditionally performed using alignment based methods, either in a de novo approach or by aligning the genome sequence to a species specific set of repetitive sequences. Recently, Li (Bioinformatics 35:4408–4410, 2019) developed a novel software tool dna-brnn to annotate repetitive sequences using a recurrent neural network trained on sample annotations of repetitive elements. Results We have developed the methods of dna-brnn further and engineered a new software tool DeepGRP. This combines the basic concepts of Li (Bioinformatics 35:4408–4410, 2019) with current techniques developed for neural machine translation, the attention mechanism, for the task of nucleotide-level annotation of repetitive elements. An evaluation on the human genome shows a 20% improvement of the Matthews correlation coefficient for the predictions delivered by DeepGRP, when compared to dna-brnn. DeepGRP predicts two additional classes of repeats (compared to dna-brnn) and is able to transfer repeat annotations, using RepeatMasker-based training data to a different species (mouse). Additionally, we could show that DeepGRP predicts repeats annotated in the Dfam database, but not annotated by RepeatMasker. DeepGRP is highly scalable due to its implementation in the TensorFlow framework. For example, the GPU-accelerated version of DeepGRP is approx. 1.8 times faster than dna-brnn, approx. 8.6 times faster than RepeatMasker and over 100 times faster than HMMER searching for models of the Dfam database. Conclusions By incorporating methods from neural machine translation, DeepGRP achieves a consistent improvement of the quality of the predictions compared to dna-brnn. Improved running times are obtained by employing TensorFlow as implementation framework and the use of GPUs. By incorporating two additional classes of repeats, DeepGRP provides more complete annotations, which were evaluated against three state-of-the-art tools for repeat annotation. Supplementary Information The online version contains supplementary material available at 10.1186/s13015-021-00199-0.

The algorithm is applied to a sequence of scores and can be described as follows [4]: The scores in S are processed from left to right and the cumulative score of scores in S is calculated. Additionally, an ordered list I = (I 1 , I 2 , . . . , I m ) of disjoint segments is maintained.
For each q, 1 ≤ q ≤ m, the cumulative total score L q of all scores up to but not including the leftmost score of I q and the total score R q up to and including the rightmost score of I q is recorded. Formally: Here I q .start is the start and I q .end the end position of a segment represented by I q . I.length is the number of segments stored in the list I, namely the list length.
Initially these lists, I, R and L, are empty. Scores ≤ 0 are skipped. A positive score at the first position or a positive score following a score ≤ 0 forms a new segment I k of length one which is inserted into the previous lists using the following process: 1. Search for the maximum value of j, 1 ≤ j ≤ m, satisfying L j < L k .
2. If there is no such j or R j ≥ R k , then add I k to the end of I. 3. Otherwise extend I k to the leftmost position of segment I j and update L k accordingly.
Delete segment I j , I j+1 , . . . , I m from I and continue with step 1 with m = |I|.
If the end of the input is reached, I is exactly the set of all MSS of S. A complete proof for the correctness of the algorithm can be found in [4].
The algorithm by [4], described above, is extended by [3] with a minimum score and a xdrop parameter [3]. If one of theses parameters is set to a negative values in the implementation by [3], the extension can be disabled. The minimum score parameter is used to filter the calculated MSSs. MSSs with a score smaller than the minimum score are dropped. The xdrop parameter controls how large a segment of negative scores can be considered for extension. If the current score drops too much relative to the maximum score reached, all previous segments do not get considered for the extension of the next segments. idx and max are variables used for the xdrop application. Segments in I with an index smaller than idx are not considered for extension. idx is updated by comparing the current maximal value max with the current extended segment. The algorithm is shown in algorithm S1 and an example in table S1.

Position
Step Step 1 no j The algorithm, as shown in algorithm S1 and described textually, does not run in linear time, but several improvements regarding the implementation of the list structures can be done to accelerate the list searches. Some of them are already shown in algorithm S1 and described below. The shown optimizations and others achieve linear running time in practice [4] and are implemented in [3]. The space requirements are clearly linear, since only maximal O(n) non-overlapping items in the list can exists, if n is the length of the sequence of scores. In practice these list are much shorter and further improvements can be made for the storing of the lists [4]. One of these improvements is shown in lines 3-7, where not each positive score separately, but a complete segment of positive scores is considered as initial Γ=I k . Another possible improvement for step 1 (lines 12-13) is instead of searching the full list I maintain for each Γ=I k a pointer to previously discovered I j and search this linked list of monotonically decreasing L j values.
Γ.start = I j .start 22 if (R z − L z > min sc) output.append (I z ) 33 return output 5   Evaluation of DeepGRP for hg38/chr1 (0) Figure S1: Confusion matrix comparing predictions by DeepGRP for hg38/chr1 with the gold standard; left: all of hg38/chr1; right: comparison restricted to sequences in chains of colinear approximate matches (length ≥ 50 bp, mean length 542 799 bp, error rate ≤ 12.10%, overall error rate 0.0123%) in hg19/chr1 and hg38/chr1. These chains were computed as described in Section 5. Evaluation of RM/RM and RM/Dfam on chr1 Figure S3: Confusion matrix resulting from the comparison of the predicted repeat annotations of RepeatMasker on hg38/chr1 compared to RepeatMasker on hg19/chr1 (left) and for Dfam [5] with RepeatMasker on hg38/chr1 (right). For the comparison of RepeatMasker on hg38 vs. hg19, hgLiftOver [7] was used to transfer the annotations from hg38 to hg19. For the comparison of RepeatMasker on hg38/chr1 to RepeatMasker on hg19/chr1 only annotations which could be transferred using hgLiftOver were considered.  Evaluation of DeepGRP for masking performance Figure S10: Confusion matrix resulting from the comparison of the repeats predicted by DeepGRP with the gold standard for hg19/chr1 (left) and hg38/chr1 (right), ignoring which specific repeat class was annotated. That is, for every position it is only considered if this is in a substring annotated as any repeat or as no repeat. Note that DeepGRP was still trained to predict the four considered repeat classes, which were only considered as a single 'any repeat' class for evaluation.

Sensitivity, specificity and the δ-parameter in the evaluation of repeat boundaries
The parameter δ, see Figure S11 specifies the maximum difference allowed between the position of a boundary of overlapping repeats of the same type from the gold standard and the prediction to qualify as correct boundary. For a given value of δ, the sensitivity is the percentage of repeats of the gold standard whose boundary is correct in the prediction. The specificity is the percentage of repeats of the prediction whose boundary is correct in the gold standard. These quality measures do not tolerate any misclassifications with distance larger than δ from the boundaries of repeats of the gold standard. Consider, for example, an interval [ . . . r] representing a repetitive element of class c that is correctly predicted for all positions except for a single position i, ≤ i ≤ r, which is assigned to a class different from c. If the distance of position i to and r is larger than δ, this predicted repetitive element contributes 0 to the sensitivity and specificity measure. Thus, for constant δ, the larger the repeats, the more unlikely it is to achieve reasonable sensitivity and specificity values.

Identifying chains of colinear exact and of approximate matches in pairs of chromosomes of hg19 and hg38
To evaluate the difference between the training and test data, chromosomes from hg19 and hg38 where compared on a base-pair level. Since the chromosomes are to huge to compare using greedy alignment methods, the following approach was used: 1. Remove from the sequences the maximum prefix and suffix consisting of N's only.
2. Construct an enhanced suffix array for the concatenation of each chromosome from hg19 and hg38 using gt suffixerator from the GenomeTools [1].
3. Compute all maximal exact matches of length ≥ , where one instance of the match occurs in sequence of hg19 and the other in hg38. This is done by gt repfind from the GenomeTools [1]. 4. Compute global co-linear chains of maximum score of all matches. This is done by gt chain2dim from the GenomeTools [1].

5.
For chains compute the positions of the pairs of substrings before the first chained match, between two chained matches and after the last chained match. So all pairs of regions between the chained matches are determined.
6. For each pair of gaps, extract the corresponding pair of sequences from hg19 and hg38 and try to align them globally using a greedy alignment method with sensitive parameter settings. We either obtain an upper bound of the unit edit distance of the pair of sequences of the gap or an undefined value in case the alignment method was not able to compute such an upper bound.
7. Merge adjacent gaps (with defined upper bounds) and matches to obtain supermatches. A supermatch thus consists of a maximal sequence of closed gaps including the matches in between two consecutive gaps and the match before the first gap and after the last gap. The sum of the distance values gives an upper bound on the unit edit distance of the sequences of the supermatch. Note that the supermatches form a co-linear chain of non-overlapping approximate matches with no guaranteed upper bound on the error rate. In our application we see error rates of up to 50 %. Two supermatches are separated by gaps for which the edit distance is too large to be computed by the greedy alignment methods.
8. Align the entire sequences of each supermatch to compute the unit edit distance, using a greedy alignment method. With very few exceptions, the long supermatches of more than 10 000 bp have a very low error rate while the supermatches with higher error rates are shorter. So the alignment can be computed very quickly.