Given a de Bruijn graph \(G_k(R)\) generated by a set of reads R for which we do not have any prior information, our goal is to identify whether there are subgraphs of \(G_k(R)\) that correspond each to a set of high copy-number repeats in R. To this end, we identify and then exploit some of the topological properties of the subgraphs that are induced by repeats. Starting with a formal model for representing repeats with high-copy number, we show that the number of compressible arcs, which we denote by \(\gamma\), is a relevant parameter for such a characterisation. This parameter will play an important role in the algorithm of “Bubbles “drowned” in repeats” section. However, we also prove that, for an arbitrary de Bruijn graph, identifying a subgraph \(G'\) with bounded \(\gamma (G')\) is NP-complete.
Simple uniform model for repeats
We now present the model we adopted for representing high copy-number repeats, e.g. transposable elements, in a genome or transcriptome. First, we would like to clarify that our model is a simple one and, as such, should be seen as only a first approximation, yet realistic enough, of what may happen in reality. We consider here that sequencing errors can be successfully removed. Indeed, there are several techniques to remove the big majority of the sequencing errors in RNA-seq data. In KisSplice, for example, we prune the de Bruijn graph using an absolute and a relative cut-off based on the k-mer coverage. The absolute cut-off enables us to remove sequencing errors in general, and the relative one is tailored to deal with highly-expressed genes (more details can be found in [14]). Furthermore, while we realise that there is room for improvement, in practice, the sequencing-error-removal procedure in KisSplice seems to be effective, as most sequencing errors are removed at the expense of losing some rare genomic variants [14].
Basically, our model consists of several “similar” sequences, each generated by uniformly mutating a fixed initial sequence. In particular, it enables to model well recent invasions of transposable elements which often involve high copy-number and low divergence rate (i.e. divergence from their consensus sequence). Consider indeed as an example the recent subfamilies AluYa5 and AluYb8 with 2640 and 1852 copies respectively, which both present a divergence rate below \(1\%\) [15] (see [16] for other subfamilies with high copy-number and low divergence).
The model is as follows. First, due to mutations, the sequences \(s_1, \ldots , s_m\) that represent the repeats are not identical. However, provided that the number of such mutations is not high (otherwise the concept of repeats would not apply), the repeats are considered “similar” in the sense of having a small pairwise Hamming distance between them. We recall that, given two equal length sequences \(s\) and \(s'\) in \(\Sigma ^n\), their Hamming distance, denoted by \(d_H(s,s')\), is the number of positions i for which \(s[i] \ne s'[i]\). Indels are thus not considered in this model.
The model has then the following parameters: \(\Sigma\), the length n of the repeat, the number m of copies of the repeat, an integer k (for the length of the k-mers considered), and the mutation rate, \(\alpha\), i.e. the probability that a mutation happens in a particular position. The sequences \(s_1, \ldots , s_m\) are then generated by the following process. We first choose uniformly at random a sequence \(s_0 \in \Sigma ^n\). At step \(i \le m\), we create a sequence \(s_i\) as follows: for each position j, \(s_i[j]=s_0[j]\) with probability \(1-\alpha\), whereas with probability \(\alpha\) a value different from s[j] is chosen uniformly at random for \(s_i[j]\). We repeat the whole process m times and thus create a set \(S(m,n,\alpha )\) of m such sequences from \(s_0\) (see Fig. 2 for a small example). The generated sequences thus have an expected Hamming distance of \(\alpha n\) from \(s_0\).
Topological characterisation of the subgraphs generated by repeats
Given a de Bruijn graph \(G_k(R)\), if a is a compressible arc labelled by the sequence \(s=s_1 \ldots s_{k+1}\) then, by definition, a is the only outgoing arc of the vertex labelled by the sequence s[1, k] and the only incoming arc of the vertex labelled by the sequence \(s[2,k+1]\). Hence the \((k-1)\)-mer s[2, k] appears as a substring in R, always preceded by the symbol s[1] and followed by the symbol \(s[k+1]\). We refer to such \((k-1)\)-mers as being boundary rigid. It is not difficult to see that the set of compressible arcs in a de Bruijn graph \(G_k(R)\) stands in a one-to-one correspondence with the set of boundary rigid \((k-1)\)-mers in R.
We now calculate and compare among them the expected number of compressible arcs in \(G=G_k(R)\) when R corresponds to a set of sequences that are generated: (1) uniformly at random, and (2) according to our model. We show that \(\gamma\) is “small” in the cases where the induced graph corresponds to similar sequences, which provides evidence for the relevance of this parameter.
Claim 1
Let R
be a set of m sequences randomly chosen from
\(\Sigma ^n\). Then the expected number of compressible arcs in
\(G_k(R)\)
is
\(\Theta (mn)\).
Proof
The probability that a sequence of length \(k-1\) occurs in a fixed position in a randomly chosen sequence of length n is \((1/4)^{k-1}\). Thus the expected number of appearances of a sequence of length \(k-1\) in a set of m randomly chosen sequences of length n is given by \(m(n-k+2) (1/4)^{k-1}\). If \(m(n-k+2) \le 4^{k-1}\), then this value is upper bounded by 1, and all the sequences of length \(k-1\) are expected to be boundary rigid (as a sequence is expected to appear once). The claim follows by observing that there are \(m(n-k+2)\) different \((k-1)\)-mers.□
We consider now \(\gamma (G_k(R))\) for \(R=S(m,n,\alpha )\). We upper bound the expected number of compressible arcs by upper bounding the number of boundary rigid \((k-1)\)-mers.
Theorem 1
Given integers
k, n, m
with
\(k<n\)
and a real number
\(0\le \alpha \le 3/4\), the de Bruijn graph
\(G_k(S(m,n,\alpha ))\)
has
o(nm) expected compressible arcs.
Proof
Let \(s_0\) be a sequence chosen randomly from \(\Sigma ^n\). Let \(S(m,n,\alpha )\) be the set \(\{s_1, \ldots , s_m\}\) of m repeats generated according to our model starting from \(s_0\). Consider now the de Bruijn graph \(G=G_k(S(m,n,\alpha ))\). Recall that the number of compressible arcs in this graph is equal to the number of boundary rigid \((k-1)\)-mers in \(S(m,n,\alpha )\). Let X be a random variable representing the number of boundary rigid \((k-1)\)-mers in G. Consider the repeats in \(S(m,n,\alpha )\) in a matrix-like ordering as in Fig. 2 and observe that the mutations from one column to another are independent. Due to the symmetry and the linearity of the expectation, E[X] is given by \(m(n-k+2)\) (the total number of \((k-1)\)-mers) multiplied by the probability that a given \((k-1)\)-mer is boundary rigid.
The probability that the \((k-1)\)-mer \(\hat{s}=s[i,i+k-2]\) is boundary rigid clearly depends on the distance from the starting sequence \(\hat{s}_0=s_0[i,i+k-2]\). Let d be the distance \(d_H(\hat{s} ,\hat{s}_0)\).
Observe that if the \((k-1)\)-mer \(s[i] \ldots s[i+k-2]\) is not boundary rigid then there exists a sequence y in \(S(m,n,\alpha )\) such that \(y[j]= s[j]\) for all \(i \le j \le i+k-2\) and either \(y[i+k-1] \ne s[i+k-1]\) or \(y[i-1] \ne s[i-1]\). It is not difficult to see that the probability that this happens is lower bounded by \((2\alpha -4/3 \alpha ^2) (1-\alpha )^{k-1-d} (\alpha /3)^d\). Hence we have:
$$\begin{aligned} Pr[\hat{s} \text { is boundary rigid} | d_H(\hat{s} ,\hat{s}_0)=d ] \le \Bigl ( 1- (2\alpha -4/3 \alpha ^2) (1-\alpha )^{k-1-d} (\alpha /3)^d\Bigr )^{m-1}. \end{aligned}$$
By approximating the above expression we therefore have that:
$$\begin{aligned} \displaystyle E[X]&\le (n-k-1)m \sum _{d=0}^{k-1} Pr[\hat{s} \text { is boundary rigid} | d_H(\hat{s} ,\hat{s}_0)=d ] \\ \nonumber&\le (n-k-1) m e^{- (m-1)(2\alpha -4/3 \alpha ^2)/(\frac{\alpha }{3})^{k-1}}. \end{aligned}$$
(1)
For a sufficiently large number of copies \(\left({\text{e.g}}. m=\left( {\begin{array}{c}k\\ \alpha k\end{array}}\right)\right)\) and using the fact that \(\left( {\begin{array}{c}k\\ \alpha k\end{array}}\right) \ge (1/ \alpha )^{\alpha k}\), we have that E[X] is o(mn). This concludes the proof.□
The previous result shows that the number of compressible arcs is a good parameter for characterising a repeat-associated subgraph.
Identifying a repeat-associated subgraph
As we showed, a subgraph due to repeated elements has a distinctive feature: it contains few compressible arcs. Based on this, a natural formulation to the repeat identification problem in RNA-seq data is to search for large enough subgraphs that do not contain many compressible arcs. This is formally stated in Problem 1. In order to disregard trivial solutions, it is necessary to require a large enough connected subgraph, otherwise any set of disconnected vertices or any small subgraph would be a solution. Unfortunately, we show that this problem is NP-complete, so an efficient algorithm for the repeat identification problem based on this formulation is unlikely.
Problem 1
[Repeat Subgraph] INSTANCE: A directed graph G and two positive integers m, t.
DECIDE: If there exists a connected subgraph \(G'=(V', E')\), with \(|V'| \ge m\) and having at most t compressible arcs.
In Theorem 2, we prove that this problem is NP-complete for all directed graphs with (total) degree, i.e. sum of in and out-degree bounded by 3. The reduction is from the Steiner tree problem which requires finding a minimum weight subgraph spanning a given subset of vertices. It remains NP-hard even when all arc weights are 1 or 2 (see [17]). This version of the problem is denoted by STEINER(1, 2). More formally, given a complete undirected graph \(G = (V,E)\) with arc weights in \(\{1,2\}\), a set of terminal vertices \(N \subseteq V\) and an integer B, it is NP-complete to decide if there exists a subgraph of G spanning N with weight at most B, i.e. a connected subgraph of G containing all vertices of N.
We specify next a family of directed graphs that we use in the reduction. Given an integer x, we define the directed graph R(x) as a cycle on 2x vertices numbered in a clockwise order and where the arcs have alternating directions, i.e. for any \(i \le x\), \((v_{2i},v_{2i+1})\) is an arc. Observe that in R(x), all vertices in even positions, i.e. all vertices \(v_{2i}\), have out-degree 2 and in-degree 0, while all vertices \(v_{2i+1}\) have out-degree 0 and in-degree 2. Clearly, none of the arcs of R(x) is compressible.
Theorem 2
The Repeat Subgraph Problem is NP-complete even for directed graphs with degree bounded by d, for any \(d \ge 3\).
Proof
Given a complete graph \(G = (V,E)\), a set of terminal vertices N and an upper bound B, i.e. an instance of STEINER(1, 2), we transform it into an instance of the Repeat Subgraph Problem for a graph \(G'\) with degree bounded by 3. Let us first build the graph \(G' = (V', E')\). For each vertex v in \(V \setminus N\), add a corresponding subgraph \(r(v) = R(|V|)\) in \(G'\) and for each vertex v in N, add a corresponding subgraph \(r(v) = R(|E|+|V|^2 + 1)\) in \(G'\). For each arc (u, v) in E with weight \(w \in \{1,2\}\), add a simple directed path composed by w compressible arcs connecting r(u) to r(v) in \(G'\); these are the subgraphs corresponding to u and v. The first vertex of the path should be in a sink of r(u) and the last vertex in a source of r(v). By construction, there are at least |V| vertices with in-degree 2 and out-degree 0 (sink) and |V| vertices with out-degree 2 and in-degree 0 (source) in both r(v) and r(u). It is clear that \(G'\) has degree bounded by 3. Moreover, the size of \(G'\) is polynomial in the size of G and it can be constructed in polynomial time.
In this way, the graph \(G'\) has one subgraph for each vertex of G and a path with one or two (depending on the weight of the corresponding arc) compressible arcs for each arc of G. Thus, there exists a subgraph spanning N in G with weight at most B if and only if there exists a subgraph in \(G'\) with at least \(m =2|N| + 2|E||N| + 2|V|^2|N|\) vertices and at most \(t = |B|\) compressible arcs. This follows from the fact that any subgraph of \(G'\) with at least m vertices necessarily contains all the subgraphs r(v), where \(v \in N\), since the number of vertices in all r(v), with \(v \in V\setminus N\), is at most \(|E| + 2|V|^2\) and the only compressible arcs of \(G'\) are in the paths corresponding to the arcs of G.□
We can obtain the same result for the specific case of subgraphs of de Bruijn graphs. The reduction is more technical but follows similarly.
Theorem 3
The Repeat Subgraph Problem is NP-complete even for subgraphs of de Bruijn graphs on
\(|\Sigma | = 4\)
symbols.