 Research
 Open Access
 Published:
Playing hide and seek with repeats in local and global de novo transcriptome assembly of short RNAseq reads
Algorithms for Molecular Biology volume 12, Article number: 2 (2017)
Abstract
Background
The main challenge in de novo genome assembly of DNAseq data is certainly to deal with repeats that are longer than the reads. In de novo transcriptome assembly of RNAseq reads, on the other hand, this problem has been underestimated so far. Even though we have fewer and shorter repeated sequences in transcriptomics, they do create ambiguities and confuse assemblers if not addressed properly. Most transcriptome assemblers of short reads are based on de Bruijn graphs (DBG) and have no clear and explicit model for repeats in RNAseq data, relying instead on heuristics to deal with them.
Results
The results of this work are threefold. First, we introduce a formal model for representing high copynumber and lowdivergence repeats in RNAseq data and exploit its properties to infer a combinatorial characteristic of repeatassociated subgraphs. We show that the problem of identifying such subgraphs in a DBG is NPcomplete. Second, we show that in the specific case of local assembly of alternative splicing (AS) events, we can implicitly avoid such subgraphs, and we present an efficient algorithm to enumerate AS events that are not included in repeats. Using simulated data, we show that this strategy is significantly more sensitive and precise than the previous version of KisSplice (Sacomoto et al. in WABI, pp 99–111, 1), Trinity (Grabherr et al. in Nat Biotechnol 29(7):644–652, 2), and Oases (Schulz et al. in Bioinformatics 28(8):1086–1092, 3), for the specific task of calling AS events. Third, we turn our focus to fulllength transcriptome assembly, and we show that exploring the topology of DBGs can improve de novo transcriptome evaluation methods. Based on the observation that repeats create complicated regions in a DBG, and when assemblers try to traverse these regions, they can infer erroneous transcripts, we propose a measure to flag transcripts traversing such troublesome regions, thereby giving a confidence level for each transcript. The originality of our work when compared to other transcriptome evaluation methods is that we use only the topology of the DBG, and not read nor coverage information. We show that our simple method gives better results than RsemEval (Li et al. in Genome Biol 15(12):553, 4) and TransRate (SmithUnna et al. in Genome Res 26(8):1134–1144, 5) on both real and simulated datasets for detecting chimeras, and therefore is able to capture assembly errors missed by these methods.
Background
Transcriptomes can now be studied through sequencing. However, in the absence of a reference genome, de novo assembly remains a challenging task. The main difficulty certainly comes from the fact that sequencing reads are short, and repeated sequences within transcriptomes could be longer than the reads. This short read/long repeat issue is of course not specific to transcriptome sequencing. It is an old problem that has been around since the first algorithms for genome assembly. Even though the problems repeats cause in both contexts are similar, they have also some characteristics that are specific to each. In genome assembly, repeats tend to be longer and present in more copies. In transcriptome assembly, repeats are located within genes and tend to be shorter and in fewer copies. However, in this last case, coverage cannot be applied to discriminate contigs that correspond to repeats, as it can be in genomics by using e.g. Myers’ Astatistics [6, 7], since the coverage of a gene does not only reflect its copynumber in the genome, but also and mostly its expression level. Some genes are highly expressed and therefore highly covered, while most genes are poorly expressed and therefore poorly covered. Such specificities complicate the application of a genomic repeatsolving strategy to the transcriptomic context.
Initially, it was thought that repeats would not be a major issue in RNAseq, since they are mostly in introns and intergenic regions. However, the truth is that many regions which are thought to be intergenic are transcribed [8] and introns are not always already spliced out when mRNA is collected to be sequenced [9]. Repeats, especially transposable elements, are therefore very present in real samples and cause major problems in transcriptome assembly, if not addressed properly.
Most, if not all current shortread transcriptome assemblers are based on de Bruijn graphs. Among the best known are Oases [3], Trinity [2], and to a lesser degree TransAbyss [10] and IDBAtran [11]. Common to all of them is the lack of a clear and explicit model for repeats in RNAseq data. Heuristics are thus used to try and cope efficiently with repeats. For instance, in Oases short vertices are thought to correspond to repeats and are therefore not used for assembling genes. They are added in a second step, which hopefully causes genes sharing repeats not to be assembled together. In Trinity, there is no attempt to deal with repeats by explicitly modelling them. The first module of Trinity, Inchworm, will try and assemble the most covered contig which hopefully corresponds to the most abundant alternative transcript. Then alternative exons are glued to this major transcript to form a splicing graph. The last step is to enumerate all alternative transcripts. If repeats are present, their high coverage may be interpreted as a highly expressed link between two unrelated transcripts. Overall, assembled transcripts may be chimeric or spliced into many subtranscripts.
In the method we had previously developed, KisSplice, which is a local transcriptome assembler [12], repeats are less problematic since the goal is not to assemble fulllength transcripts. KisSplice instead aims at finding variants in transcriptomes (SNPs, indels and alternative splicings). However, as we reported in [12], KisSplice was not able to deal with large portions of a de Bruijn graph containing subgraphs associated to highly repeated sequences, e.g. transposable elements, the socalled complex Biconnected Components.
Here, we try and achieve three goals: (1) give a clear formalisation of the notion of repeats with high copynumber in RNAseq data, (2) apply it on local transcriptome assembly by giving a practical way to enumerate bubbles that are lost because of such repeats, and (3) apply it on global transcriptome assembly by showing that the topology of the subgraph around a transcript can give some hints about its confidence level. Recall that we are in a de novo context, so we assume that neither a reference genome/transcriptome nor a database of known repeats, e.g. RepBase [13], are available.
First, we formally introduce a model for representing high copynumber repeats and exploit its properties to infer that repeatassociated subgraphs in a de Bruijn graph contain few compressible arcs. However, we show that the problem of identifying, in a de Bruijn graph, a subgraph corresponding to repeats according to such characterisation is NPcomplete. A polynomial time algorithm is therefore unlikely to exist.
Second, we show that in the specific case of a local assembly of alternative splicing (AS) events, by using a strategy based on the compressiblearc characterization, we can implicitly avoid such subgraphs. More precisely, it is possible to find the structures (i.e. bubbles) corresponding to AS events in a de Bruijn graph that are not contained in a repeatassociated subgraph (see Fig. 3 for an example). While there has been great efforts in the literature to solve repeats, there has been almost no exploration on how to avoid them. This is explained by the fact that most efforts in assembly concentrate on fulllength genome and transcriptome assembly, in which avoiding repeats is not an option, and the performance of an assembler can be narrowed down to how well it solves repeats. However, in our case, repeatavoidance can be an effective technique. Indeed, this fact was confirmed by our experiments, where using human simulated RNAseq data, we show that the new algorithm improves significantly the sensitivity of KisSplice, while also improving its precision. We further compared our algorithm to two of the best transcriptome assemblers, namely Trinity [2] and Oases [3], in the specific task of calling AS events, and we show that our algorithm is more sensitive than both tools, while also being more precise. In addition, our results show that the advantage of using the new algorithm proposed in this work is more evident when the input data contains high premRNA content or the AS events of interest stem from highlyexpressed genes. Moreover, we give an indication of the usefulness of our method on real data.
Third, we show that the method described can also be applied in the context of fulllength transcriptome assembly. We introduce a measure based on the proposed model to identify lowconfidence transcripts, which are the ones that traverse complex regions in the de Bruijn Graph. Within these complex parts of the graph generated by repeats, any assembler will have to choose the “right” path(s) among the many present. This choice is not simple and may lead to incorrect solutions (e.g. chimeric or truncated transcripts). It is therefore important to be able to identify the transcripts coming from such complex regions in order to know that the solution presented is not the only one, and furthermore may not be the right one. We compared our measure against two stateoftheart methods for de novo transcriptome evaluation, namely RsemEval [4] and TransRate [5], for the specific task of identifying chimeric transcripts in both real and simulated datasets. We show that our measure provides good results despite the fact that it uses only the graph topology, and not coverage, nor read information. The results obtained thus suggest that exploring the topology of the subgraph around a transcript, an information that is currently disregarded by transcriptome evaluation methods, can be useful to infer some of the transcript’s properties, such as confidence level, quality, assembly hardness, etc. Therefore, our measure can improve the stateoftheart methods for de novo transcriptome evaluation, since it is able to capture assembly errors missed by these tools.
Preliminaries
Let \(\Sigma\) be an alphabet of fixed size \(\sigma\). Here we always assume \(\Sigma =\{A,C,T,G\}\). Given a sequence (string) \(s \in \Sigma ^*\), let s denote its length, s[i] the ith element of s, and s[i, j] the substring \(s[i] s[i+1] \ldots s[j]\) for any \(1 \le i<j \le s\).
A kmer is a sequence \(s \in \Sigma ^k\). Given an integer k and a set S of sequences each of length \(n \ge k\), we define span(S, k) as the set of all distinct kmers that appear as a substring in S.
Definition 1
Given a set of sequences (reads) \(R \subseteq \Sigma ^*\) and an integer k, we define the directed de Bruijn graph \(G_k(R)=(V,A)\) where \(V=span(R,k)\) and \((u,v) \in A\) if and only if \(u[2,k]=v[1,k1]\).
Given a directed graph \(G = (V,A)\) and a vertex \(v \in V\), we denote its outneighbourhood (resp. inneighbourhood) by \(N^+(v)=\{ u \in V \mid (v,u) \in A \}\) (resp. \(N^(v)=\{ u \in V \mid (u,v) \in A \}\)), and its outdegree (resp. indegree by \(d^+(v)=N^+(v)\) (\(d^(v)=N^(v)\)). A (simple) path \(\pi = s \leadsto t\) in G is a sequence of distinct vertices \(s = v_0, \ldots , v_l = t\) such that, for each \(0 \le i < l\), \((v_i, v_{i+1})\) is an arc of G. If the graph is weighted, i.e. there is a function \(w : A \rightarrow Q_{\ge 0}\) associating a weight to every arc in the graph, then the length of a path \(\pi\) is the sum of the weights of the traversed arcs, and is denoted by \(\pi \).
An arc \((u,v) \in A\) is called compressible if \(d^+(u)=1\) and \(d^(v)=1\). The intuition behind this definition comes from the fact that every path passing through u should also pass through v. It should therefore be possible to “compress” or contract this arc without losing any information. Note that the compressed de Bruijn graph [2, 3] commonly used by transcriptomic assemblers is obtained from a de Bruijn graph by replacing, for each compressible arc (u, v), the vertices u, v by a new vertex x, where \(N^(x) = N^(u)\), \(N^+(x) = N^+(v)\) and the label is the concatenation of the kmer of u and the kmer of v without the overlapping part (see Fig. 1).
Repeats in de Bruijn graphs
Given a de Bruijn graph \(G_k(R)\) generated by a set of reads R for which we do not have any prior information, our goal is to identify whether there are subgraphs of \(G_k(R)\) that correspond each to a set of high copynumber repeats in R. To this end, we identify and then exploit some of the topological properties of the subgraphs that are induced by repeats. Starting with a formal model for representing repeats with highcopy number, we show that the number of compressible arcs, which we denote by \(\gamma\), is a relevant parameter for such a characterisation. This parameter will play an important role in the algorithm of “Bubbles “drowned” in repeats” section. However, we also prove that, for an arbitrary de Bruijn graph, identifying a subgraph \(G'\) with bounded \(\gamma (G')\) is NPcomplete.
Simple uniform model for repeats
We now present the model we adopted for representing high copynumber repeats, e.g. transposable elements, in a genome or transcriptome. First, we would like to clarify that our model is a simple one and, as such, should be seen as only a first approximation, yet realistic enough, of what may happen in reality. We consider here that sequencing errors can be successfully removed. Indeed, there are several techniques to remove the big majority of the sequencing errors in RNAseq data. In KisSplice, for example, we prune the de Bruijn graph using an absolute and a relative cutoff based on the kmer coverage. The absolute cutoff enables us to remove sequencing errors in general, and the relative one is tailored to deal with highlyexpressed genes (more details can be found in [14]). Furthermore, while we realise that there is room for improvement, in practice, the sequencingerrorremoval procedure in KisSplice seems to be effective, as most sequencing errors are removed at the expense of losing some rare genomic variants [14].
Basically, our model consists of several “similar” sequences, each generated by uniformly mutating a fixed initial sequence. In particular, it enables to model well recent invasions of transposable elements which often involve high copynumber and low divergence rate (i.e. divergence from their consensus sequence). Consider indeed as an example the recent subfamilies AluYa5 and AluYb8 with 2640 and 1852 copies respectively, which both present a divergence rate below \(1\%\) [15] (see [16] for other subfamilies with high copynumber and low divergence).
The model is as follows. First, due to mutations, the sequences \(s_1, \ldots , s_m\) that represent the repeats are not identical. However, provided that the number of such mutations is not high (otherwise the concept of repeats would not apply), the repeats are considered “similar” in the sense of having a small pairwise Hamming distance between them. We recall that, given two equal length sequences \(s\) and \(s'\) in \(\Sigma ^n\), their Hamming distance, denoted by \(d_H(s,s')\), is the number of positions i for which \(s[i] \ne s'[i]\). Indels are thus not considered in this model.
The model has then the following parameters: \(\Sigma\), the length n of the repeat, the number m of copies of the repeat, an integer k (for the length of the kmers considered), and the mutation rate, \(\alpha\), i.e. the probability that a mutation happens in a particular position. The sequences \(s_1, \ldots , s_m\) are then generated by the following process. We first choose uniformly at random a sequence \(s_0 \in \Sigma ^n\). At step \(i \le m\), we create a sequence \(s_i\) as follows: for each position j, \(s_i[j]=s_0[j]\) with probability \(1\alpha\), whereas with probability \(\alpha\) a value different from s[j] is chosen uniformly at random for \(s_i[j]\). We repeat the whole process m times and thus create a set \(S(m,n,\alpha )\) of m such sequences from \(s_0\) (see Fig. 2 for a small example). The generated sequences thus have an expected Hamming distance of \(\alpha n\) from \(s_0\).
Topological characterisation of the subgraphs generated by repeats
Given a de Bruijn graph \(G_k(R)\), if a is a compressible arc labelled by the sequence \(s=s_1 \ldots s_{k+1}\) then, by definition, a is the only outgoing arc of the vertex labelled by the sequence s[1, k] and the only incoming arc of the vertex labelled by the sequence \(s[2,k+1]\). Hence the \((k1)\)mer s[2, k] appears as a substring in R, always preceded by the symbol s[1] and followed by the symbol \(s[k+1]\). We refer to such \((k1)\)mers as being boundary rigid. It is not difficult to see that the set of compressible arcs in a de Bruijn graph \(G_k(R)\) stands in a onetoone correspondence with the set of boundary rigid \((k1)\)mers in R.
We now calculate and compare among them the expected number of compressible arcs in \(G=G_k(R)\) when R corresponds to a set of sequences that are generated: (1) uniformly at random, and (2) according to our model. We show that \(\gamma\) is “small” in the cases where the induced graph corresponds to similar sequences, which provides evidence for the relevance of this parameter.
Claim 1
Let R be a set of m sequences randomly chosen from \(\Sigma ^n\). Then the expected number of compressible arcs in \(G_k(R)\) is \(\Theta (mn)\).
Proof
The probability that a sequence of length \(k1\) occurs in a fixed position in a randomly chosen sequence of length n is \((1/4)^{k1}\). Thus the expected number of appearances of a sequence of length \(k1\) in a set of m randomly chosen sequences of length n is given by \(m(nk+2) (1/4)^{k1}\). If \(m(nk+2) \le 4^{k1}\), then this value is upper bounded by 1, and all the sequences of length \(k1\) are expected to be boundary rigid (as a sequence is expected to appear once). The claim follows by observing that there are \(m(nk+2)\) different \((k1)\)mers.□
We consider now \(\gamma (G_k(R))\) for \(R=S(m,n,\alpha )\). We upper bound the expected number of compressible arcs by upper bounding the number of boundary rigid \((k1)\)mers.
Theorem 1
Given integers k, n, m with \(k<n\) and a real number \(0\le \alpha \le 3/4\), the de Bruijn graph \(G_k(S(m,n,\alpha ))\) has o(nm) expected compressible arcs.
Proof
Let \(s_0\) be a sequence chosen randomly from \(\Sigma ^n\). Let \(S(m,n,\alpha )\) be the set \(\{s_1, \ldots , s_m\}\) of m repeats generated according to our model starting from \(s_0\). Consider now the de Bruijn graph \(G=G_k(S(m,n,\alpha ))\). Recall that the number of compressible arcs in this graph is equal to the number of boundary rigid \((k1)\)mers in \(S(m,n,\alpha )\). Let X be a random variable representing the number of boundary rigid \((k1)\)mers in G. Consider the repeats in \(S(m,n,\alpha )\) in a matrixlike ordering as in Fig. 2 and observe that the mutations from one column to another are independent. Due to the symmetry and the linearity of the expectation, E[X] is given by \(m(nk+2)\) (the total number of \((k1)\)mers) multiplied by the probability that a given \((k1)\)mer is boundary rigid.
The probability that the \((k1)\)mer \(\hat{s}=s[i,i+k2]\) is boundary rigid clearly depends on the distance from the starting sequence \(\hat{s}_0=s_0[i,i+k2]\). Let d be the distance \(d_H(\hat{s} ,\hat{s}_0)\).
Observe that if the \((k1)\)mer \(s[i] \ldots s[i+k2]\) is not boundary rigid then there exists a sequence y in \(S(m,n,\alpha )\) such that \(y[j]= s[j]\) for all \(i \le j \le i+k2\) and either \(y[i+k1] \ne s[i+k1]\) or \(y[i1] \ne s[i1]\). It is not difficult to see that the probability that this happens is lower bounded by \((2\alpha 4/3 \alpha ^2) (1\alpha )^{k1d} (\alpha /3)^d\). Hence we have:
By approximating the above expression we therefore have that:
For a sufficiently large number of copies \(\left({\text{e.g}}. m=\left( {\begin{array}{c}k\\ \alpha k\end{array}}\right)\right)\) and using the fact that \(\left( {\begin{array}{c}k\\ \alpha k\end{array}}\right) \ge (1/ \alpha )^{\alpha k}\), we have that E[X] is o(mn). This concludes the proof.□
The previous result shows that the number of compressible arcs is a good parameter for characterising a repeatassociated subgraph.
Identifying a repeatassociated subgraph
As we showed, a subgraph due to repeated elements has a distinctive feature: it contains few compressible arcs. Based on this, a natural formulation to the repeat identification problem in RNAseq data is to search for large enough subgraphs that do not contain many compressible arcs. This is formally stated in Problem 1. In order to disregard trivial solutions, it is necessary to require a large enough connected subgraph, otherwise any set of disconnected vertices or any small subgraph would be a solution. Unfortunately, we show that this problem is NPcomplete, so an efficient algorithm for the repeat identification problem based on this formulation is unlikely.
Problem 1
[Repeat Subgraph] INSTANCE: A directed graph G and two positive integers m, t.
DECIDE: If there exists a connected subgraph \(G'=(V', E')\), with \(V' \ge m\) and having at most t compressible arcs.
In Theorem 2, we prove that this problem is NPcomplete for all directed graphs with (total) degree, i.e. sum of in and outdegree bounded by 3. The reduction is from the Steiner tree problem which requires finding a minimum weight subgraph spanning a given subset of vertices. It remains NPhard even when all arc weights are 1 or 2 (see [17]). This version of the problem is denoted by STEINER(1, 2). More formally, given a complete undirected graph \(G = (V,E)\) with arc weights in \(\{1,2\}\), a set of terminal vertices \(N \subseteq V\) and an integer B, it is NPcomplete to decide if there exists a subgraph of G spanning N with weight at most B, i.e. a connected subgraph of G containing all vertices of N.
We specify next a family of directed graphs that we use in the reduction. Given an integer x, we define the directed graph R(x) as a cycle on 2x vertices numbered in a clockwise order and where the arcs have alternating directions, i.e. for any \(i \le x\), \((v_{2i},v_{2i+1})\) is an arc. Observe that in R(x), all vertices in even positions, i.e. all vertices \(v_{2i}\), have outdegree 2 and indegree 0, while all vertices \(v_{2i+1}\) have outdegree 0 and indegree 2. Clearly, none of the arcs of R(x) is compressible.
Theorem 2
The Repeat Subgraph Problem is NPcomplete even for directed graphs with degree bounded by d, for any \(d \ge 3\).
Proof
Given a complete graph \(G = (V,E)\), a set of terminal vertices N and an upper bound B, i.e. an instance of STEINER(1, 2), we transform it into an instance of the Repeat Subgraph Problem for a graph \(G'\) with degree bounded by 3. Let us first build the graph \(G' = (V', E')\). For each vertex v in \(V \setminus N\), add a corresponding subgraph \(r(v) = R(V)\) in \(G'\) and for each vertex v in N, add a corresponding subgraph \(r(v) = R(E+V^2 + 1)\) in \(G'\). For each arc (u, v) in E with weight \(w \in \{1,2\}\), add a simple directed path composed by w compressible arcs connecting r(u) to r(v) in \(G'\); these are the subgraphs corresponding to u and v. The first vertex of the path should be in a sink of r(u) and the last vertex in a source of r(v). By construction, there are at least V vertices with indegree 2 and outdegree 0 (sink) and V vertices with outdegree 2 and indegree 0 (source) in both r(v) and r(u). It is clear that \(G'\) has degree bounded by 3. Moreover, the size of \(G'\) is polynomial in the size of G and it can be constructed in polynomial time.
In this way, the graph \(G'\) has one subgraph for each vertex of G and a path with one or two (depending on the weight of the corresponding arc) compressible arcs for each arc of G. Thus, there exists a subgraph spanning N in G with weight at most B if and only if there exists a subgraph in \(G'\) with at least \(m =2N + 2EN + 2V^2N\) vertices and at most \(t = B\) compressible arcs. This follows from the fact that any subgraph of \(G'\) with at least m vertices necessarily contains all the subgraphs r(v), where \(v \in N\), since the number of vertices in all r(v), with \(v \in V\setminus N\), is at most \(E + 2V^2\) and the only compressible arcs of \(G'\) are in the paths corresponding to the arcs of G.□
We can obtain the same result for the specific case of subgraphs of de Bruijn graphs. The reduction is more technical but follows similarly.
Theorem 3
The Repeat Subgraph Problem is NPcomplete even for subgraphs of de Bruijn graphs on \(\Sigma  = 4\) symbols.
Bubbles “drowned” in repeats
In the previous section, we showed that an efficient algorithm to directly identify the subgraphs of a de Bruijn graph corresponding to repeated elements according to our model (i.e. containing few compressible arcs), is unlikely to exist since the problem is NPcomplete. However, in this section we show that in the specific case of a local assembly of alternative splicing (AS) events based on the compressiblearc characterisation of “Topological characterisation of the subgraphs generated by repeats” section, we can implicitly avoid such subgraphs. More precisely, it is possible to find the structures (i.e. bubbles) corresponding to AS events in a de Bruijn graph that are not contained in a repeatassociated subgraph, thus answering to the main open question of [12].
KisSplice [12] is a method for de novo calling of AS events through the enumeration of socalled bubbles, that correspond to pairs of vertexdisjoint paths in a de Bruijn graph. The bubble enumeration algorithm proposed in [12] was later improved in [1]. However, even the improved algorithm is not able to enumerate all bubbles corresponding to AS events in a de Bruijn graph. There are certain complex regions in the graph, likely containing repeatassociated subgraphs but also real AS events [12], where both algorithms take a huge amount of time. Figure 3 shows an example of a complex region with a bubble corresponding to an AS event. In practice, the enumeration is halted after a given timeout. The bubbles drowned (or trapped) inside these regions are thus missed by KisSplice.
In “Repeats in de Bruijn graphs” section, the repeatassociated subgraphs are characterised by the presence of few compressible arcs. This suggests that in order to avoid repeatassociated subgraphs, we should restrict the search to bubbles containing many compressible arcs. Equivalently, in a compressed de Bruijn graph (see “Preliminaries” section), we should restrict the search to bubbles with few branching vertices. We recall that a branching vertex is a vertex of indegree or outdegree strictly at least 2. Indeed, in a compressed de Bruijn graph, given a fixed sequence length, the number of branching vertices in a path is inversely proportional to the number of compressible arcs of the corresponding path in the noncompressed de Bruijn graph. We thus modify the definition of \((s,t,\alpha _1,\alpha _2)\)bubbles in compressed de Bruijn graphs (Def. 1 in [1]) by adding the extra constraint that each path should have at most b branching vertices.
Definition 2
Given a weighted directed graph \(G = (V,E)\) and two vertices \(s,t \in V\), an \((s,t,\alpha _1,\alpha _2,b)\)bubble is a pair of vertexdisjoint stpaths \(\pi _1\), \(\pi _2\) with lengths bounded by \(\alpha _1,\alpha _2\), each containing at most b branching vertices.
By restricting the search to bubbles with few branching vertices, we are able to enumerate them in complex regions implicitly avoiding repeatassociated subgraphs. Indeed, in “Experimental results” section we show that by considering bubbles with at most b branching vertices in KisSplice, we increase both its sensitivity and precision. This supports our claim that by focusing on \((s,t,\alpha _1,\alpha _2,b)\)bubbles, we avoid repeatassociated subgraphs and recover at least part of the bubbles trapped in complex regions.
Enumerating bubbles avoiding repeats
In this section, we modify the algorithm of [1] to enumerate all bubbles with at most b branching vertices in each path. Given a weighted directed graph \(G = (V,E)\) and a vertex \(s \in V\), let \(\mathcal {B}_s(G)\) denote the set of \((s,*,\alpha _1,\alpha _2,b)\)bubbles of G. The algorithm recursively partitions the solution space \(\mathcal {B}_s(G)\) at every call until the considered subspace is a singleton (contains only one solution), and in that case it outputs the corresponding solution. In order to avoid unnecessary recursive calls, it maintains the invariant that the current partition contains at least one solution. The algorithm proceeds as follows.
Invariant
At a generic recursive step on vertices \(u_1,u_2\) (initially, \(u_1 = u_2 =s\)), let \(\pi _1 = s \leadsto u_1, \pi _2 = s \leadsto u_2\) be the paths discovered so far (initially, \(\pi _1,\pi _2\) are empty). Let \(G'\) be the current graph (initially, \(G' := G\)). More precisely, \(G'\) is defined as follows: remove from G all the vertices in \(\pi _1\) and \(\pi _2\) but \(u_1\) and \(u_2\). Moreover, we also maintain the following invariant (INV): there exists at least one pair of paths \(\bar{\pi }_1\) and \(\bar{\pi }_2\) in \(G'\) that extend \(\pi _1\) and \(\pi _2\) so that \(\pi _1 \cdot \bar{\pi }_1\) and \(\pi _2 \cdot \bar{\pi }_2\) belong to \(\mathcal {B}_s(G)\).
Base case
When \(u_1 = u_2 = u\), output the \((s,u,\alpha _1,\alpha _2,b)\)bubble given by \(\pi _1\) and \(\pi _2\).
Recursive rule
Let \(\mathcal {B}_{s}(\pi _1,\pi _2, G')\) denote the set of \((s,*,\alpha _1,\alpha _2,b)\)bubbles to be listed by the current recursive call, i.e. the subset of \(\mathcal {B}_s(G)\) with prefixes \(\pi _1, \pi _2\). It is the union of the following disjoint sets:

The bubbles of \(\mathcal {B}_{s}(\pi _1,\pi _2, G')\) that use e, for each arc \(e = (u_1,v)\) outgoing from \(u_1\), that is \(\mathcal {B}_{s}(\pi _1 \cdot e, \pi _2, G'  u_1)\), where \(G'u_1\) is the subgraph of \(G'\) after the removal of \(u_1\) and all its incident arcs.

The bubbles that do not use any arc from \(u_1\), that is \(\mathcal {B}_{s}(\pi _1,\pi _2, G'')\), where \(G''\) is the subgraph of \(G'\) after the removal of all arcs outgoing from \(u_1\).
The same holds for \(u_2\) instead of \(u_1\).
In order to maintain the invariant (INV), we only perform the recursive calls when \(\mathcal {B}_{s}(\pi _1 \cdot e, \pi _2, G'  u)\) or \(\mathcal {B}_{s}(\pi _1,\pi _2, G'')\) are nonempty. In both cases, we have to decide if there exists a pair of (internally) vertexdisjoint paths \(\bar{\pi }_1 = u_1 \leadsto t_1\) and \(\bar{\pi }_2 = u_2 \leadsto t_2\), such that \(\bar{\pi }_1 \le \alpha _1'\), \(\bar{\pi }_2 \le \alpha _2'\), and \(\bar{\pi }_1,\bar{\pi }_2\) have at most \(b_1,b_2\) branching vertices, respectively. Since both the length and the number of branching vertices are monotonic properties, i.e. both are smaller for a prefix instead of for the full path, we can drop the vertexdisjoint condition. Indeed, let \(\bar{\pi }_1\) and \(\bar{\pi }_2\) be a pair of paths satisfying all conditions but the vertexdisjoint one. The prefixes \(\bar{\pi }^*_1 = u_1 \leadsto t^*\) and \(\bar{\pi }^*_2 = u_2 \leadsto t^*\), where \(t^*\) is the first intersection of the paths, satisfy all conditions and are internally vertexdisjoint.
Moreover, using a dynamic programming algorithm, we can obtain the following result.
Lemma 1
Given a nonnegatively weighted directed graph \(G = (V,E)\) and a source \(s \in V\), we can compute the shortest paths from s using at most b branching vertices in O(bE) time.
Proof
Let \(d[\beta , t]\) denote the distance from s to t using at most \(\beta\) branching vertices (s is never counted as a branching vertex, even if it is branching). The recurrence to calculate \(d[\beta ,t]\), for \(0 \le \beta \le b\) and \(t \in V\) is:
Initialisation step:
Main recurrence:
This recurrence works only on compressed graphs, i.e. it requires that the neighbours of simple vertices are branching. However, since the graph compression procedure described in “Preliminaries” section can be applied to general graphs, this recurrence is also applicable to general graphs. The calculation order for \(d[\beta ,t]\) in the main recurrence must be by increasing value of \(\beta\) and, for a fixed \(\beta\), the branching vertices must be processed before the nonbranching ones. Moreover, the shortest paths themselves can be constructed by a traceback procedure.
Finally, since the calculation of each value \(d[\beta ,t]\) takes \(O(N^(t))\) time, the algorithm runs in \(O(b \sum _{t \in V}N^(t)) =O(b E)\) time. We can guarantee that this algorithm runs in time polynomial in the length of the input by upperbounding b by V (if \(b > V\), we simply set \(b=V\)).□
As a corollary of Lemma 1, we can decide if \(\mathcal {B}_{s}(\pi _1,\pi _2, G)\) is nonempty in O(bE) time. Now, using an argument similar to [1], i.e. the leaves of the recursion tree and the solutions are in onetoone correspondence and the height of the recursion tree is bounded by 4b, we obtain the following theorem.
Theorem 4
The \((s,*,\alpha _1,\alpha _2,b)\)bubbles can be enumerated in \(O(b^2E\mathcal {B}_s(G))\) time. Moreover, the time elapsed between the output of any two consecutive solutions (i.e. the delay) is \(O(b^2E)\).
Measuring the confidence of a transcript in fulllength transcriptome assemblers
Reconstructing fulllength transcripts from reads is a challenging task because two transcripts, even from different genes, may very well share subsequences that are longer than the sequenced reads, or even longer than the fragments in case of pairedend sequencing. This is specially true when genes host transposable elements within their introns, and less frequently but still present, within their UTRs and also exons (e.g. exonised repeats). Even if a repeatcontaining intron is always spliced out in the splicing phase, this intron, and consequently the repeat, can still be present in RNAseq data. The fraction of introns present in the sequenced data depends on the cell compartment that is sampled (nucleus, cytoplasm or both) and the protocol to remove rRNA (ribo0 or polydT primers). As estimated in [9], the level of premRNA can be assumed to vary between 2 and 22%. The true level of premRNA may however be in practice higher, because the methods used for estimating it are mappingbased and therefore deal poorly with reads stemming from repeated regions. Besides, the upper bound given in [9] corresponds to extraction protocols which are harder to obtain. In this work, we considered the most commonly used extraction protocol to extract RNA, and assumed that they yielded premRNA fractions between 5 and 15%. Thus, more introns than expected are sequenced, generating problems to transcriptome assemblers, particularly when they span several members of a specific repeat family.
Most transcriptome assemblers are based on de Bruijn graphs and have no clear and explicit model for repeats in RNAseq data, relying instead on heuristics to deal with them. Within the complex parts of the graph generated by repeats, any assembler will have to choose the “right” path(s) among the many present. Even with hints given by (pairedend) reads, assemblers can still have several arguable options to extend a contig (see Fig. 4). This problem gets harder if the (pairedend) reads do not span the repeat entirely, thereby not giving the assembler any reliable information on how to connect the unique regions. If the assembler decides to guess a path, it may erroneously extend a contig and create a chimeric transcript. It can also choose to be conservative by not choosing any path in complicated regions of the de Bruijn graph, and instead truncating the transcript. Although this strategy can lead to an accurate assembly, it will produce a very fragmented one, which is not desired. Whatever the strategy (conservative or permissive), the resulting assembled transcript may be erroneous (chimeric or truncated).
It is hence important to be able to identify lowconfidence transcripts, which are the ones traversing complex regions of a de Bruijn graph, in order to know that the solution presented is the result of a “difficult” choice and therefore may not be the right one. To identify such transcripts, we introduce the concept of Branching Measure of a transcript. Consider the set of transcripts \(\mathcal {T}\) output by a fulllength transcriptome assembler starting from a set of reads \(\mathcal {R}\). We construct the de Bruijn graph \(G_k(R)\), and map back each transcript \(t \in \mathcal {T}\) to the graph by identifying each of its kmers. Given a positive integer w, let W be a wsized window (or substring) with the largest number of branching kmers in t. We define the Branching Measure of a transcript t, B(t), as the proportion of branching kmers in W. By looking at B(t), it is possible to infer if t traversed a hardtoassemble region in the de Bruijn graph, and this can be used as a measure of its confidence, i.e. the higher B(t) is, the lower is the confidence of t.
As a proof of concept, in the following we show two examples of the application of the Branching Measure to transcripts assembled by Trinity on RNAseq data from the GEUVADIS project [18].
The first example (Fig. 5) is the chimeric transcript c12400_g1_i1 that aligns to the gene MOB1A in chromosome 2 and also to the gene PEBP1 in chromosome 12, in which the fusion of these genes is due to a small identical region shared between two different repeats present in their UTR regions. Figure 5a shows the alignment of the transcript c12400_g1_i1 to reference hg38, visualised using the UCSC Genome Browser. The alignment on the top shows that the built transcript aligns almost perfectly to an isoform of gene MOB1A in chromosome 2. Due to the repeats inside the red circles, the alignment is truncated in the 3′UTR of MOB1A, and continued on the 5′UTR of gene PEBP1 in chromosome 12 (alignment on the bottom). Thus, here we have a chimeric transcript. Figure 5b zooms in the regions where both alignments intersect the repeats that cause the chimerism. The main reason of the junction between the two genes is due to a stretch of 18 As shared between the Atail of a SINE AluY in the 3′UTR of MOB1A and a Simple Repeat A(n) in the 5′UTR of PEBP1. Even though this repeated region is short, it was enough to cause problems to Trinity, which had access to 76bp pairedend reads, with an average insert size of 158 bp. In Fig. 5c we mapped all reads back to transcript c12400_g1_i1 and visualised them using IGV [19]. As we can see, there are no single or pairedend reads traversing the small repeat. This shows that this chimera is not an in vitro or a biological one, but indeed an assembly mistake by Trinity. Figure 5d conveys a local visualisation of the subgraph induced by the kmers of transcript c12400_g1_i1 at the junction point which causes the chimerism (the full graph can be accessed at http://kissplice.prabi.fr/bm/graph_chimera.html). We can see that this is a complex region since the transcript (red path) traverses a region having 11 branching kmers in a window of 12, and could thus be flagged by the Branching Measure. There is no other such complex region in this transcript, i.e. this is the only hardtoassemble region that this transcript goes through. We can also see in the picture the correct extension which should have been followed as the reference transcripts (the green and blue paths). Observe that even the reference transcripts could also have been flagged by our method since they traverse regions containing a concentration of branching vertices due to the repeated elements presented in Fig. 5a, b.
The second case, depicted in Fig. 6, shows a misassembly of the last exon of gene SLC35F2, in which Trinity assembled several truncated transcripts instead of the full exon. Figure 6a shows, on the 3’ \(\rightarrow\) 5’ orientation (reverse strand), the three truncated short transcripts: c65590_g1_i1, c64_g1_i1, and c14482_g2_i1. The truncation points were cause caused by repeats, where the first split is due to a simple repeat (A(n)) and the second is due to 2 consecutive Alus (AluJo and AluSz). Figure 6b displays a schematic global view on how the last exon of gene SLC35F2 was assembled by Trinity and how the three next figures are connected in the full graph drawing. This figure and the next assume the \(5^{\prime }\rightarrow 3^{\prime }\) orientation. Figure 6c conveys a local visualisation of the truncation point between c65590_g1_i1 and c64_g1_i1 due to a simple repeat. We can see that Trinity misassembled the very end of c65590_g1_i1 (only the last base) and truncated the transcript. The yellow path is accurate although truncated and does not go through a complicated region (one having a concentration of branching vertices). Even though the reference exon path in this region has 11 consecutive branching vertices and would be flagged by the Branching Measure, this method is unable to flag c65590_g1_i1 since it is truncated too early, before entering the complex region. Figure 6d shows a local view of the region that traverses the repeat AluJo, and where the assembler has chosen to truncate the transcript c64_g1_i1. We can see that Trinity misassembled the last 29 bases of c64_g1_i1 and truncated it. At the end of c64_g1_i1, we have 23 branching vertices in a window of 34 vertices, so this truncated transcript can be flagged by our method, as it is deeply enough plunged into a complex region. Finally, Fig. 6e displays a local view of the region that traverses the repeat AluSz, and where the assembler has chosen to truncate the transcript c14482_g2_i1. Again, the Branching Measure is not able to flag this transcript since it is not deeply enough plunged into a complex region. The full graph of Fig. 6b–e can be accessed at http://kissplice.prabi.fr/bm/graph_truncated.html.
Experimental results
Local assembly: experimental setup
To evaluate the performance of our method, we simulated RNAseq data using the FluxSimulator version 1.2.1 [20]. We generated 100 million reads of 75 bp using its default error model. We used the RefSeq annotated Human transcriptome (hg19 coordinates) as a reference and we performed a twostep pipeline to obtain a mixture of mRNA and premRNA (i.e. with introns not yet spliced). To achieve this, we first ran the FluxSimulator with the Refseq annotations. We then modified the annotations to include the introns and reran it on this modified version. In this second run, we additionally constrained the expression values of the premRNAs to be correlated to the expression values of their corresponding mRNAs, as simulated in the first run. Finally, we mixed the two sets of reads to obtain a total of 100M reads. We tested two values, namely 5 and 15% for the proportion of reads from premRNAs. Those values were chosen so as to correspond to realistic ones (see “Measuring the confidence of a transcript in fulllength transcriptome assemblers” section).
On these simulated datasets, we ran KisSplice [12] versions 2.1.0 (Ks_2.1.0) and 2.2.0 (Ks_2.2.0, with a maximum number of branching vertices set to 5) and obtained lists of detected bubbles that are putative alternative splicing (AS) events. We also ran the fulllength transcriptome assemblers Trinity version r2013_08_14 and Oases version 0.2.09 on both datasets, obtaining a list of predicted transcripts, from which we then extracted a list of putative AS events. For Oases, there was one locus in each dataset for which we were not able to extract the putative AS events. A manual inspection revealed that they were mostly composed of subparts of introns or UTRs flanked by repeats, usually copies of ALUs. The presence of such high copynumber repeats in these transcripts induced the clustering of all these unrelated sequences into one complex locus. More precisely, in the dataset containing 5% of the reads from premRNAs, the largest locus that Oases predicted comprised 20,769 transcripts, while the second largest contained only 139 transcripts. In the other simulated dataset, the largest locus comprised 39,389 transcripts, and the second largest contained only 205 transcripts. This indicates that Oases faces similar issues to Ks_2.1.0. For fairness of comparison, we did not postprocess these complex loci, and we were then unable to retrieve the potential AS events that they could describe. It is worth mentioning though, that the majority of the transcripts inside these loci corresponded to subparts of introns or UTRs, and not to full introns or exons, and therefore could not describe AS events.
In order to assess the precision and the sensitivity of our method, we compared our set of found AS events to the set of true AS events. Following the definition of Astalavista, an AS event is composed of two sets of transcripts, the inclusion/exclusion isoforms respectively. We consider that an AS event is true if at least one transcript among the inclusion isoforms and one among the exclusion isoforms is present in the simulated dataset with at least 5 reads per kilobase (RPK). The rationale for adding this threshold is that very rare events are considerably hard, or even impossible, to detect by all methods.
To compare the results of KisSplice with the true AS events, we propose that a true AS event is a true positive (TP) if there is a bubble such that one path matches the inclusion isoform and the other the exclusion isoform. If there is no such bubble among the results of KisSplice, the event is counted as a false negative (FN). If a bubble does not correspond to any true AS event, it is counted as a false positive (FP). To align the paths of the bubbles to transcript sequences, we used the Blat aligner [21] with 95% identity and a constraint of 95% of each bubble path length to be aligned (to account for the sequencing errors simulated by FluxSimulator). We computed the sensitivity TP/(TP+FN) and precision TP/(TP+FP) for each simulation case and we report their values for various classes of expression of the minor isoform. Expression values are measured in RPK.
Local assembly: results
The overall sensitivity and precision of Ks_2.2.0, Ks_2.1.0, Trinity and Oases can be found in Fig. 7a, b, respectively. A first look reveals that Ks_2.2.0 outperforms the other three methods in both measures and datasets.
A closer look at Fig. 7a shows that both versions of KisSplice had better sensitivity than both transcriptome assemblers in the 5% premRNA dataset. However, due to its inability to deal with repeatassociated regions, the performance of Ks_2.1.0 drops substantially, from 46 to 33%, when a higher rate of 15% of premRNA is present in the data. The same happened with Oases. Ks_2.2.0 and Trinity, on the other hand, were able to slightly improve their sensitivity from the 5 to the 15% premRNA dataset. It is however worth mentioning that the sensitivity of Ks_2.2.0 is substantially higher than the one of Trinity in the 15% premRNA dataset. In summary, we can say that Ks_2.2.0 is substantially more sensitive than all the other three methods. This reflects the fact that most problematic repeats are in introns. A small unspliced mRNA rate leads to few repeatassociated subgraphs, so there are not many AS events drowned in them. In this case, the advantage of using Ks_2.2.0 is less obvious, whereas a large proportion of premRNA leads to more AS events drowned in repeatassociated subgraphs which are identified by Ks_2.2.0 and usually missed by the other methods.
In Fig. 7b we can see that Ks_2.2.0 and Trinity presented the highest precision (98%) of all methods in the 5% premRNA dataset. Although Ks_2.1.0 is ranked third, it still presents a very high precision (95%), while Oases presented a moderate value (80%). Nevertheless, the most important aspect to be observed in Fig. 7b is that Ks_2.2.0 kept the same high precision even when a higher rate of 15% of premRNA is present in the data. Trinity, on the other hand, dropped significantly from 98 to 79%. This drop in precision is actually mostly due to the prediction of a large number of intron retentions, since Trinity assembles both the mRNA and the premRNA. Ks_2.1.0 dropped slightly from 95 to 91%, and Oases dropped moderately, from 80 to 70%. In summary, we can say that both versions of KisSplice are more precise than both transcriptome assemblers, except that Trinity shows comparable precision if a small rate of premRNA is present in the data, and, more specifically, that Ks_2.2.0 outperformed all the other three methods. The high precision we obtain indicates that we have very few false positives. Those mostly correspond to repeatinduced bubbles mistakenly identified as AS events.
Finally, Fig. 7c, d present the detailed sensitivity by expression levels of the four methods on both datasets, allowing for a better understanding of their performance. As we can see, Ks_2.2.0 presented the best sensitivity in practically all expression levels in both datasets, while the other three methods were worse, but comparable between themselves. We can also observe that the gap between the sensitivity of Ks_2.2.0 and the sensitivity of the other methods tends to increase as the expression levels of the genes increase, especially in the 15% premRNA dataset. Since highlyexpressed genes tend to present higher levels of premRNA content, they bring several repeat copies in their introns, and thus some parts of their associated graphs are complex and repeatinduced. Therefore, AS events inside such genes tend to be trapped in troublesome regions of the graph, making them harder to find. As Ks_2.2.0 is able to avoid such complex regions and retrieve the AS events deeply plunged into them, it presents better sensitivity than the other methods, especially in highlyexpressed genes and datasets with high rate of premRNAs.
As was already reported in [12], KisSplice (i.e. both Ks_2.2.0 and Ks_2.1.0) is faster and uses considerably less memory than Trinity and Oases. For instance, on these datasets, KisSplice uses around 5 GB of RAM, while Trinity uses more than 20 GB, and Oases, around 18 GB. However, it should be noted that both these latter methods try to solve a more general problem than KisSplice, that is reconstructing the fulllength transcripts.
To conclude, our results show that Ks_2.2.0 is significantly more sensitive and precise than Ks_2.1.0, Trinity and Oases for the task of identifying AS events. The advantage of using Ks_2.2.0 over the other three methods is more evident when the input data contains high premRNA content or the AS events of interest stem from highlyexpressed genes.
On the usefulness of Ks_2.2.0 on real data
In order to give an indication of the usefulness of our repeatavoiding bubble enumeration algorithm with real data, we also ran Ks_2.2.0 and Ks_2.1.0 on the SKNSH Human neuroblastoma cell line RNAseq dataset (wgEncodeEH000169, total RNA). In Fig. 8, we have an example of a nonannotated exon skipping event not found by Ks_2.1.0. Observe that the intronic region contains several transposable elements (many of which are Alu sequences), while the exons contain none. This is a good example of a bubble (exon skipping event) drowned in a complex region of the de Bruijn graph. The bubble (composed by the two alternative paths) itself contains no repeated elements, but is surrounded by them. In other words, this is a bubble with few branching vertices that is surrounded by repeatassociated subgraphs. Since Ks_2.1.0 is unable to differentiate between repeatassociated subgraphs and the bubble, it spends a prohibitive amount of time in the repeatassociated subgraph and fails to find the bubble.
Global assembly
To test our hypothesis that the Branching Measure is able to identify problematic transcripts, we evaluated it on the transcripts assembled by Trinity on the two simulated RNAseq datasets described in “Local assembly: results” section, and on two other real RNAseq datasets: one from the GEUVADIS project [18]^{Footnote 1} and one from a neuroblastoma SKNSH cell line (ENCODE) differentiated or not using retinoic acid.^{Footnote 2} Even though our method is referencefree, we have chosen to evaluate it under a model species so that we could make use of annotated reference genomes to assess if our predictions are correct. We compared our measure against two stateoftheart methods for de novo transcriptome evaluation, RsemEval [4] and TransRate [5], on the specific task of identifying chimeric transcripts in Trinity’s assemblies on all four described datasets. In all our tests, we used the contig impact score of RsemEval as a measure of contig correctness. Formally, the contig impact score is a statistical measure that compares the hypothesis that a particular contig (i.e. transcript) is a true contig with the null hypothesis that the reads composing the contig actually represent the background noise [4]. In other words, the contig impact score determines the relative contribution of each transcript to explaining the assembly. Wellassembled transcripts should therefore have a high contig impact score, and badly assembled transcripts, including chimeras, should have a low contig impact score. TransRate [5], on the other hand, allowed us to work with a specific metric representing the probability that a contig is derived from a single transcript. This metric denotes the probability that the read coverage of a transcript is best modelled by a single Dirichlet distribution, rather than two or more distributions, and it corresponds to the field sCseq of TransRate’s output file contigs.csv.
As was shown before, one of the main errors that transcriptome assemblers do is to build chimeric transcripts. We compared the performances of the Branching Measure, RsemEval, and TransRate on identifying chimeric transcripts. In order to have our ground truth, we first identified which assembled transcripts are chimeric with respect to a reference genome by using Algorithm 1. In total, 253 out of 18,706 transcripts (1.3%) in the 5% premRNA dataset, 376 out of 26,407 transcripts (1.4%) in the 15% premRNA dataset, 375 out of 99,591 transcripts (0.3%) in the GEUVADIS dataset, and 2830 out of 457,383 transcripts (0.6%) in the SKNSH dataset were classified as chimeric. Figure 9 depicts four ROC curves showing the performance of the three methods on all datasets. We can observe that the Branching Measure outperforms both RsemEval and TransRate by a large margin in all tests and, with a highvalue threshold, is also able to flag a majority of the chimeric transcripts while keeping a low false positive rate. These experiments show that, in the provided datasets, chimeric transcripts could be well captured by the Branching Measure. Our false positives include wellassembled transcripts traversing high copynumber low divergence repeats, and our false negatives include chimeric transcripts that did not go through a complex region. The main issue with RsemEval and TransRate, on the other hand, is that both methods failed to find chimeric transcripts assembled from genes with similar expression levels. These chimeras had low scores and corresponded to the false negatives at the end of the ROC curves for RsemEval and TransRate. As a side effect of this misclassification, many wellassembled transcripts had higher scores than several real chimeras, and were mistakenly flagged as chimeras.
Concluding remarks and perspectives
Although transcriptome assemblers are now commonly used, their way to handle repeats is not satisfactory, arguably because the presence of repeats in transcriptomes has been underestimated so far. Given that most RNAseq datasets correspond to total mRNA extractions, many introns are still present in the data and their repeat content cannot be simply ignored. Although repeats in transcriptomic and genomic data cause similar problems to assemblers, the specificities of each mean that a successful strategy in one context may fail in the other. It is thus essential for transcriptome assemblers to formally address the repeats problem.
In this paper, we first proposed a simple formal model for representing high copynumber repeats in RNAseq data. Exploiting the properties of this model, we established that the number of compressible arcs is a relevant quantitative characteristic of repeatassociated subgraphs. We proved that the problem of identifying in a de Bruijn graph a subgraph with this characteristic is NPcomplete. However, this characteristic drove the design of an algorithm for efficiently identifying AS events that are not included in repeated regions. The new algorithm was implemented in KisSplice version 2.2.0, and by using simulated RNAseq data, we showed that it improves significantly the sensitivity of the previous version of KisSplice, while also improving its precision. In addition, we compared our algorithm to Trinity and Oases, and showed that for the specific task of calling AS events, our algorithm is significantly more sensitive while also being more precise. Our results also showed that the advantage of using KisSplice version 2.2.0 is more evident when the input data contains high premRNA content or the AS events of interest stem from highlyexpressed genes. Moreover, we gave an indication of the usefulness of our method on real data. Finally, we explored the proposed model in the context of fulllength transcriptome assembly by introducing the Branching Measure, which is able to flag the transcripts that go through a complex region in the de Bruijn graph. Even though one should not directly consider lowconfidence transcripts as erroneous ones since they could have been correctly assembled despite the hardness, the described measure is useful from a user’s pointofview since it enables to flag the transcripts that result from a “difficult” choice during the assembly, no matter which assembler is used. We showed that this measure can indeed capture assembly errors in real cases and, when compared to RsemEval [4] and to TransRate [5] on the specific task of identifying chimeric transcripts, the measure we propose outperformed the ones used by RsemEval and TransRate by a large margin. The originality of our work, when compared to other methods for transcriptome assembly evaluation, is that we use only the topology of the graph. Despite the successful application of the Branching Measure in global transcriptome assembly, it remains a simple method, and in particular, we would like to emphasise that it must be seen as a proof of concept that exploring the topology of the subgraph around a transcript can give some hints about its confidence level, quality, assembly hardness, etc. The method proposed is not a fullfledged one for assessing transcripts in a de novo context. It could however be a promising direction to improve transcriptome assembly evaluation, especially when combined with statistical and readmapping approaches (e.g. RsemEval [4] or TransRate [5]).
As concerns the local transcriptome assembly of variations, the most interesting open problem which currently remains is how to efficiently enumerate AS events whose variable region (e.g. skipped exon, retained intron) traverses repeats. Although the application of the proposed model enabled to retrieve several AS events that were previously missed, the current algorithm is still only able to avoid repeats, not to solve them. The presence of repeats in RNAseq data shows that transcriptome assemblers should formally address the repeats issue, as is generally the case of genome assemblers, preferably by solving them. Even if repeats are less frequent in transcriptomic data and are thus easier to solve than in the genomic context, the complexity and ambiguity they add are enough to cause problems if not addressed properly. If this is not done, it will impact the assembly of fulllength transcripts or variants, leading to either erroneous or fragmented ones, especially in regions that are more prone to contain repeats, such as introns, UTRs, and exonised repeats.
As concerns future works, our repeats model could be improved. One direction would be to employ a treelike structure to take into account the evolutionary nature of repeat (sub)families. Variability in the sizes of the copies of a repeat family would also enable to model more realistically the true nature of families of transposable elements (the type of repeats which cause most trouble in assembly). Another example would be to explicitly model sequencing errors in Theorem 1. Although, in practice, assemblers like KisSplice [1] employ a sequencing error removal module, it remains unclear how to distinguish the structures created by sequencing errors from the ones induced by a lowlyexpressed member of a highlyexpressed family of repeats, or by infrequent allelic differences in poolseq. The difficulty increases in regions that are highly expressed or that present sequencing error bias. In practice, error removal strategies may be too stringent and erroneously remove SNPs and repeats. Taking into account the sequencing errors in the model would make it applicable even without any preprocessing of the data, and would thus be more sensitive for finding repeats if such errors are correctly modeled. Finally, the Branching Measure could also be extended to identify truncated transcripts and isoforms stemming from paralogous genes.
Notes
This dataset can be found at the ArrayExpress database (http://www.ebi.ac.uk/arrayexpress/) under the accession number EGEUV6, and we used the individual named NA06994, extract name “NA06994.2.M_111215_7 extract”.
This dataset can be found at http://genome.crg.es/encode_RNA_dashboard/hg19/, and is also accessible with the following accession numbers: ENCSR000CPN—SRA: SRR315315, SRR315316 and ENCSR000CTT—SRA: SRR534309, SRR534310. For cell lines treated by retinoic acid, the reads were 76nt long, while they were 100nt long for the non treated cells. Hence we trimmed all reads to 76nt.
References
Sacomoto G, Lacroix V, Sagot MF. A polynomial delay algorithm for the enumeration of bubbles with length constraints in directed graphs and its application to the detection of alternative splicing in RNAseq data. In: WABI, pp. 99–111 (2013).
Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, Chen Z, Mauceli E, Hacohen N, Gnirke A, Rhind N, di Palma F, Birren BW, Nusbaum C, LindbladToh K, Friedman N, Regev A. Fulllength transcriptome assembly from RNASeq data without a reference genome. Nat Biotechnol. 2011;29(7):644–52.
Schulz M, Zerbino D, Vingron M, Birney E. Oases: robust de novo RNAseq assembly across the dynamic range of expression levels. Bioinformatics. 2012;28(8):1086–92.
Li B, Fillmore N, Bai Y, Collins M, Thomson J, Stewart R, Dewey C. Evaluation of de novo transcriptome assemblies from RNASeq data. Genome Biol. 2014;15(12):553.
SmithUnna R, Boursnell C, Patro R, Hibberd J, Kelly S. TransRate: reference free quality assessment of de novo transcriptome assemblies. Genome Res. 2016;26(8):1134–44.
Myers EW, Sutton GG, Delcher AL, Dew IM, Fasulo DP, Flanigan MJ, Kravitz SA, Mobarry CM, Reinert KHJ, Remington KA, Anson EL, Bolanos RA, Chou HH, Jordan CM, Halpern AL, Lonardi S, Beasley EM, Brandon RC, Chen L, Dunn PJ, Lai Z, Liang Y, Nusskern DR, Zhan M, Zhang Q, Zheng X, Rubin GM, Adams MD, Venter JC. A wholegenome assembly of Drosophila. Science. 2000;287(5461):2196–204.
Novák P, Neumann P, Macas J. Graphbased clustering and characterization of repetitive sequences in nextgeneration sequencing data. BMC Bioinform. 2010;11(1):378.
Djebali S, Davis CA, Merkel A, Dobin A, Lassmann T, Mortazavi A, Tanzer A, Lagarde J, Lin W, Schlesinger F, Xue C, Marinov GK, Khatun J, Williams BA, Zaleski C, Rozowsky J, Röder M, Kokocinski F, Abdelhamid RF, Alioto T, Antoshechkin I, Baer MT, Bar NS, Batut P, Bell K, Bell I, Chakrabortty S, Chen X, Chrast J, Curado J, Derrien T, Drenkow J, Dumais E, Dumais J, Duttagupta R, Falconnet E, Fastuca M, FejesToth K, Ferreira P, Foissac S, Fullwood MJ, Gao H, Gonzalez D, Gordon A, Gunawardena H, Howald C, Jha S, Johnson R, Kapranov P, King B, Kingswood C, Luo OJ, Park E, Persaud K, Preall JB, Ribeca P, Risk B, Robyr D, Sammeth M, Schaffer L, See LHH, Shahab A, Skancke J, Suzuki AMM, Takahashi H, Tilgner H, Trout D, Walters N, Wang H, Wrobel J, Yu Y, Ruan X, Hayashizaki Y, Harrow J, Gerstein M, Hubbard T, Reymond A, Antonarakis SE, Hannon G, Giddings MC, Ruan Y, Wold B, Carninci P, Guigó R, Gingeras TR. Landscape of transcription in human cells. Nature. 2012;489(7414):101–8.
Tilgner H, Knowles D, Johnson R, Davis C, Chakrabortty S, Djebali S, Curado JA, Snyder M, Gingeras T, Guigó R. Deep sequencing of subcellular RNA fractions shows splicing to be predominantly cotranscriptional in the human genome but inefficient for lncRNAs. Genome Res. 2012;22:1616–25.
Robertson G, Schein J, Chiu R, Corbett R, Field M, Jackman SD, Mungall K, Lee S, Okada HM, Qian JQ, Griffith M, Raymond A, Thiessen N, Cezard T, Butterfield YS, Newsome R, Chan SK, She R, Varhol R, Kamoh B, Prabhu AL, Tam A, Zhao Y, Moore RA, Hirst M, Marra MA, Jones SJM, Hoodless PA, Birol I. De novo assembly and analysis of RNAseq data. Nat Methods. 2010;7(11):909–12.
Peng Y, Leung H, Yiu S, Lv M, Zhu X, Chin F. IDBAtran: a more robust de novo de Bruijn graph assembler for transcriptomes with uneven expression levels. Bioinformatics. 2013;29(13):i326–34.
Sacomoto G, Kielbassa J, Chikhi R, Uricaru R, Antoniou P, Sagot MF, Peterlongo P, Lacroix V. KISSPLICE: denovo calling alternative splicing events from RNAseq data. BMC Bioinform. 2012;13(S–6):5.
Bao W, Kojima KK, Kohany O. Repbase update, a database of repetitive elements in eukaryotic genomes. Mobile DNA. 2015;6(1):11.
LopezMaestre H, Brinza L, Marchet C, Kielbassa J, Bastien S, Boutigny M, Monnin D, El Filali A, Carareto CM, Vieira C, et al. SNP calling from RNAseq data without a reference genome: identification, quantification, differential analysis and impact on the protein sequence. Nucl Acids Res. 2016;44(19):148.
Carroll ML, RoyEngel AM, Nguyen SV, Salem AH, Vogel E, Vincent B, Myers J, Ahmad Z, Nguyen L, Sammarco M, Watkins WS, Henke J, Makalowski W, Jorde LB, Deininger PL, Batzer MA. Largescale analysis of the Alu Ya5 and Yb8 subfamilies and their contribution to human genomic diversity. J Mol Biol. 2001;311(1):17–40.
Jurka J, Bao W, Kojima K. Families of transposable elements, population structure and the origin of species. Biol Direct. 2011;6(1):44.
Bern M, Plassmann P. The steiner problem with edge lengths 1 and 2. Inf Process Lett. 1989;32(4):171–6.
Lappalainen T, Sammeth M, Friedlander MR. /‘t Hoen, PAC, Monlong J, Rivas MA, GonzalezPorta M, Kurbatova N, Griebel T, Ferreira PG, Barann M, Wieland T, Greger L, van Iterson M, Almlof J, Ribeca P, Pulyakhina I, Esser D, Giger T, Tikhonov A, Sultan M, Bertier G, MacArthur DG, Lek M, Lizano E, Buermans HPJ, Padioleau I, Schwarzmayr T, Karlberg O, Ongen H, Kilpinen H, Beltran S, Gut M, Kahlem K, Amstislavskiy V, Stegle O, Pirinen M, Montgomery SB, Donnelly P, McCarthy MI, Flicek P, Strom TM, Consortium TG, Lehrach H, Schreiber S, Sudbrak R, Carracedo A, Antonarakis SE, Hasler R, Syvanen AC, van Ommen GJ, Brazma A, Meitinger T, Rosenstiel P, Guigo R, Gut IG, Estivill X, Dermitzakis ET. Transcriptome and genome sequencing uncovers functional variation in humans. Nature. 2013;501(7468):506–11.
Robinson JT, Thorvaldsdottir H, Winckler W, Guttman M, Lander ES, Getz G, Mesirov JP. Integrative genomics viewer. Nat Biotechnol. 2011;29(1):24–6.
Griebel T, Zacher B, Ribeca P, Raineri E, Lacroix V, Guigó R, Sammeth M. Modelling and simulating generic RNAseq experiments with the flux simulator. Nucl Acids Res. 2012;40(20):10073.
Kent WJ. BLAT—the BLASTlike alignment tool. Genome Res. 2002;12:656–64.
Freyermuth F, Rau F, Kokunai Y, Linke T, Sellier C, Nakamori M, Kino Y, Arandel L, Jollet A, Thibault C, Philipps M, Vicaire S, Jost B, Udd B, Day JW, Duboc D, Wahbi K, Matsumura T, Fujimura H, Mochizuki H, Deryckere F, Kimura T, Nukina N, Ishiura S, Lacroix V, CampanFournier A, Navratil V, Chautard E, Auboeuf D, Horie M, Imoto K, Lee KY, Swanson MS, de Munain AL, Inada S, Itoh H, Nakazawa K, Ashihara T, Wang E, Zimmer T, Furling D, Takahashi MP, CharletBerguerand N. Splicing misregulation of SCN5A contributes to cardiacconduction delay and heart arrhythmia in myotonic dystrophy. Nat Commun. 2016;7:11067.
Authors’ contributions
BS, GS, MFS and VL developed the model for repeats. LL, BS, GS, MFS developed and implemented the algorithms. LL, HLM, CM, VM performed the experiments. Part of this material appeared in WABI 2014. All authors read and approved the manuscript.
Acknowledgements
LL acknowledges CNPq/Brazil for the financial support. This work was performed using the computing facilities of the CC LBBE/PRABI.
Competing interests
The authors declare that they have no competing interests.
Funding
LL was funded by the Brazilian Ministry of Science, Technology and Innovation (in portuguese, Ministério da Ciência, Tecnologia e Inovação  MCTI) through the National Counsel of Technological and Scientific Development (in portuguese, Conselho Nacional de Desenvolvimento Científico e Tecnológico  CNPq), under the Science Without Borders (in portuguese, Ciências Sem Fronteiras) scholarship grant process number 203362/20144. VL was funded by the Agence Nationale de la Recherche ABS4NGS ANR project (ANR11BINF000106) and Action n3.6 Plan Cancer 2009–2013. This work was also funded by the Agence Nationale de la Recherche ANR12BS020008 (Colib’read) with grants to LL, BS, GS, HLM, CM, MFS, and VL. This work was also funded by the European Research Council under the European Community’s Seventh Framework Programme (FP7 /2007–2013)/ERC Grant Agreement No. [247073]10. with grants to BS, GS, MFS, and VL.
Author information
Authors and Affiliations
Corresponding author
Additional information
Leandro Lima and Blerina Sinaimeri contributed equally to this work
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Lima, L., Sinaimeri, B., Sacomoto, G. et al. Playing hide and seek with repeats in local and global de novo transcriptome assembly of short RNAseq reads. Algorithms Mol Biol 12, 2 (2017). https://doi.org/10.1186/s1301501700912
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1301501700912
Keywords
 Transcriptome assembly
 RNAseq
 Repeats
 Alternative splicing
 Formal model for representing repeats
 Enumeration algorithm
 De Bruijn graph topology
 Assembly evaluation