Playing hide and seek with repeats in local and global de novo transcriptome assembly of short RNAseq reads
 Leandro Lima†^{1, 2}Email author,
 Blerina Sinaimeri†^{1, 2},
 Gustavo Sacomoto^{1, 2},
 Helene LopezMaestre^{1, 2},
 Camille Marchet^{3},
 Vincent Miele^{2},
 MarieFrance Sagot^{1, 2} and
 Vincent Lacroix^{1, 2}
https://doi.org/10.1186/s1301501700912
© The Author(s) 2017
Received: 27 July 2016
Accepted: 27 January 2017
Published: 22 February 2017
Abstract
Background
The main challenge in de novo genome assembly of DNAseq data is certainly to deal with repeats that are longer than the reads. In de novo transcriptome assembly of RNAseq reads, on the other hand, this problem has been underestimated so far. Even though we have fewer and shorter repeated sequences in transcriptomics, they do create ambiguities and confuse assemblers if not addressed properly. Most transcriptome assemblers of short reads are based on de Bruijn graphs (DBG) and have no clear and explicit model for repeats in RNAseq data, relying instead on heuristics to deal with them.
Results
The results of this work are threefold. First, we introduce a formal model for representing high copynumber and lowdivergence repeats in RNAseq data and exploit its properties to infer a combinatorial characteristic of repeatassociated subgraphs. We show that the problem of identifying such subgraphs in a DBG is NPcomplete. Second, we show that in the specific case of local assembly of alternative splicing (AS) events, we can implicitly avoid such subgraphs, and we present an efficient algorithm to enumerate AS events that are not included in repeats. Using simulated data, we show that this strategy is significantly more sensitive and precise than the previous version of KisSplice (Sacomoto et al. in WABI, pp 99–111, 1), Trinity (Grabherr et al. in Nat Biotechnol 29(7):644–652, 2), and Oases (Schulz et al. in Bioinformatics 28(8):1086–1092, 3), for the specific task of calling AS events. Third, we turn our focus to fulllength transcriptome assembly, and we show that exploring the topology of DBGs can improve de novo transcriptome evaluation methods. Based on the observation that repeats create complicated regions in a DBG, and when assemblers try to traverse these regions, they can infer erroneous transcripts, we propose a measure to flag transcripts traversing such troublesome regions, thereby giving a confidence level for each transcript. The originality of our work when compared to other transcriptome evaluation methods is that we use only the topology of the DBG, and not read nor coverage information. We show that our simple method gives better results than RsemEval (Li et al. in Genome Biol 15(12):553, 4) and TransRate (SmithUnna et al. in Genome Res 26(8):1134–1144, 5) on both real and simulated datasets for detecting chimeras, and therefore is able to capture assembly errors missed by these methods.
Keywords
Background
Transcriptomes can now be studied through sequencing. However, in the absence of a reference genome, de novo assembly remains a challenging task. The main difficulty certainly comes from the fact that sequencing reads are short, and repeated sequences within transcriptomes could be longer than the reads. This short read/long repeat issue is of course not specific to transcriptome sequencing. It is an old problem that has been around since the first algorithms for genome assembly. Even though the problems repeats cause in both contexts are similar, they have also some characteristics that are specific to each. In genome assembly, repeats tend to be longer and present in more copies. In transcriptome assembly, repeats are located within genes and tend to be shorter and in fewer copies. However, in this last case, coverage cannot be applied to discriminate contigs that correspond to repeats, as it can be in genomics by using e.g. Myers’ Astatistics [6, 7], since the coverage of a gene does not only reflect its copynumber in the genome, but also and mostly its expression level. Some genes are highly expressed and therefore highly covered, while most genes are poorly expressed and therefore poorly covered. Such specificities complicate the application of a genomic repeatsolving strategy to the transcriptomic context.
Initially, it was thought that repeats would not be a major issue in RNAseq, since they are mostly in introns and intergenic regions. However, the truth is that many regions which are thought to be intergenic are transcribed [8] and introns are not always already spliced out when mRNA is collected to be sequenced [9]. Repeats, especially transposable elements, are therefore very present in real samples and cause major problems in transcriptome assembly, if not addressed properly.
Most, if not all current shortread transcriptome assemblers are based on de Bruijn graphs. Among the best known are Oases [3], Trinity [2], and to a lesser degree TransAbyss [10] and IDBAtran [11]. Common to all of them is the lack of a clear and explicit model for repeats in RNAseq data. Heuristics are thus used to try and cope efficiently with repeats. For instance, in Oases short vertices are thought to correspond to repeats and are therefore not used for assembling genes. They are added in a second step, which hopefully causes genes sharing repeats not to be assembled together. In Trinity, there is no attempt to deal with repeats by explicitly modelling them. The first module of Trinity, Inchworm, will try and assemble the most covered contig which hopefully corresponds to the most abundant alternative transcript. Then alternative exons are glued to this major transcript to form a splicing graph. The last step is to enumerate all alternative transcripts. If repeats are present, their high coverage may be interpreted as a highly expressed link between two unrelated transcripts. Overall, assembled transcripts may be chimeric or spliced into many subtranscripts.
In the method we had previously developed, KisSplice, which is a local transcriptome assembler [12], repeats are less problematic since the goal is not to assemble fulllength transcripts. KisSplice instead aims at finding variants in transcriptomes (SNPs, indels and alternative splicings). However, as we reported in [12], KisSplice was not able to deal with large portions of a de Bruijn graph containing subgraphs associated to highly repeated sequences, e.g. transposable elements, the socalled complex Biconnected Components.
Here, we try and achieve three goals: (1) give a clear formalisation of the notion of repeats with high copynumber in RNAseq data, (2) apply it on local transcriptome assembly by giving a practical way to enumerate bubbles that are lost because of such repeats, and (3) apply it on global transcriptome assembly by showing that the topology of the subgraph around a transcript can give some hints about its confidence level. Recall that we are in a de novo context, so we assume that neither a reference genome/transcriptome nor a database of known repeats, e.g. RepBase [13], are available.
First, we formally introduce a model for representing high copynumber repeats and exploit its properties to infer that repeatassociated subgraphs in a de Bruijn graph contain few compressible arcs. However, we show that the problem of identifying, in a de Bruijn graph, a subgraph corresponding to repeats according to such characterisation is NPcomplete. A polynomial time algorithm is therefore unlikely to exist.
Second, we show that in the specific case of a local assembly of alternative splicing (AS) events, by using a strategy based on the compressiblearc characterization, we can implicitly avoid such subgraphs. More precisely, it is possible to find the structures (i.e. bubbles) corresponding to AS events in a de Bruijn graph that are not contained in a repeatassociated subgraph (see Fig. 3 for an example). While there has been great efforts in the literature to solve repeats, there has been almost no exploration on how to avoid them. This is explained by the fact that most efforts in assembly concentrate on fulllength genome and transcriptome assembly, in which avoiding repeats is not an option, and the performance of an assembler can be narrowed down to how well it solves repeats. However, in our case, repeatavoidance can be an effective technique. Indeed, this fact was confirmed by our experiments, where using human simulated RNAseq data, we show that the new algorithm improves significantly the sensitivity of KisSplice, while also improving its precision. We further compared our algorithm to two of the best transcriptome assemblers, namely Trinity [2] and Oases [3], in the specific task of calling AS events, and we show that our algorithm is more sensitive than both tools, while also being more precise. In addition, our results show that the advantage of using the new algorithm proposed in this work is more evident when the input data contains high premRNA content or the AS events of interest stem from highlyexpressed genes. Moreover, we give an indication of the usefulness of our method on real data.
Third, we show that the method described can also be applied in the context of fulllength transcriptome assembly. We introduce a measure based on the proposed model to identify lowconfidence transcripts, which are the ones that traverse complex regions in the de Bruijn Graph. Within these complex parts of the graph generated by repeats, any assembler will have to choose the “right” path(s) among the many present. This choice is not simple and may lead to incorrect solutions (e.g. chimeric or truncated transcripts). It is therefore important to be able to identify the transcripts coming from such complex regions in order to know that the solution presented is not the only one, and furthermore may not be the right one. We compared our measure against two stateoftheart methods for de novo transcriptome evaluation, namely RsemEval [4] and TransRate [5], for the specific task of identifying chimeric transcripts in both real and simulated datasets. We show that our measure provides good results despite the fact that it uses only the graph topology, and not coverage, nor read information. The results obtained thus suggest that exploring the topology of the subgraph around a transcript, an information that is currently disregarded by transcriptome evaluation methods, can be useful to infer some of the transcript’s properties, such as confidence level, quality, assembly hardness, etc. Therefore, our measure can improve the stateoftheart methods for de novo transcriptome evaluation, since it is able to capture assembly errors missed by these tools.
Preliminaries
Let \(\Sigma\) be an alphabet of fixed size \(\sigma\). Here we always assume \(\Sigma =\{A,C,T,G\}\). Given a sequence (string) \(s \in \Sigma ^*\), let s denote its length, s[i] the ith element of s, and s[i, j] the substring \(s[i] s[i+1] \ldots s[j]\) for any \(1 \le i<j \le s\).
A kmer is a sequence \(s \in \Sigma ^k\). Given an integer k and a set S of sequences each of length \(n \ge k\), we define span(S, k) as the set of all distinct kmers that appear as a substring in S.
Definition 1
Given a set of sequences (reads) \(R \subseteq \Sigma ^*\) and an integer k, we define the directed de Bruijn graph \(G_k(R)=(V,A)\) where \(V=span(R,k)\) and \((u,v) \in A\) if and only if \(u[2,k]=v[1,k1]\).
Given a directed graph \(G = (V,A)\) and a vertex \(v \in V\), we denote its outneighbourhood (resp. inneighbourhood) by \(N^+(v)=\{ u \in V \mid (v,u) \in A \}\) (resp. \(N^(v)=\{ u \in V \mid (u,v) \in A \}\)), and its outdegree (resp. indegree by \(d^+(v)=N^+(v)\) (\(d^(v)=N^(v)\)). A (simple) path \(\pi = s \leadsto t\) in G is a sequence of distinct vertices \(s = v_0, \ldots , v_l = t\) such that, for each \(0 \le i < l\), \((v_i, v_{i+1})\) is an arc of G. If the graph is weighted, i.e. there is a function \(w : A \rightarrow Q_{\ge 0}\) associating a weight to every arc in the graph, then the length of a path \(\pi\) is the sum of the weights of the traversed arcs, and is denoted by \(\pi \).
Repeats in de Bruijn graphs
Given a de Bruijn graph \(G_k(R)\) generated by a set of reads R for which we do not have any prior information, our goal is to identify whether there are subgraphs of \(G_k(R)\) that correspond each to a set of high copynumber repeats in R. To this end, we identify and then exploit some of the topological properties of the subgraphs that are induced by repeats. Starting with a formal model for representing repeats with highcopy number, we show that the number of compressible arcs, which we denote by \(\gamma\), is a relevant parameter for such a characterisation. This parameter will play an important role in the algorithm of “Bubbles “drowned” in repeats” section. However, we also prove that, for an arbitrary de Bruijn graph, identifying a subgraph \(G'\) with bounded \(\gamma (G')\) is NPcomplete.
Simple uniform model for repeats
We now present the model we adopted for representing high copynumber repeats, e.g. transposable elements, in a genome or transcriptome. First, we would like to clarify that our model is a simple one and, as such, should be seen as only a first approximation, yet realistic enough, of what may happen in reality. We consider here that sequencing errors can be successfully removed. Indeed, there are several techniques to remove the big majority of the sequencing errors in RNAseq data. In KisSplice, for example, we prune the de Bruijn graph using an absolute and a relative cutoff based on the kmer coverage. The absolute cutoff enables us to remove sequencing errors in general, and the relative one is tailored to deal with highlyexpressed genes (more details can be found in [14]). Furthermore, while we realise that there is room for improvement, in practice, the sequencingerrorremoval procedure in KisSplice seems to be effective, as most sequencing errors are removed at the expense of losing some rare genomic variants [14].
Basically, our model consists of several “similar” sequences, each generated by uniformly mutating a fixed initial sequence. In particular, it enables to model well recent invasions of transposable elements which often involve high copynumber and low divergence rate (i.e. divergence from their consensus sequence). Consider indeed as an example the recent subfamilies AluYa5 and AluYb8 with 2640 and 1852 copies respectively, which both present a divergence rate below \(1\%\) [15] (see [16] for other subfamilies with high copynumber and low divergence).
The model is as follows. First, due to mutations, the sequences \(s_1, \ldots , s_m\) that represent the repeats are not identical. However, provided that the number of such mutations is not high (otherwise the concept of repeats would not apply), the repeats are considered “similar” in the sense of having a small pairwise Hamming distance between them. We recall that, given two equal length sequences \(s\) and \(s'\) in \(\Sigma ^n\), their Hamming distance, denoted by \(d_H(s,s')\), is the number of positions i for which \(s[i] \ne s'[i]\). Indels are thus not considered in this model.
Topological characterisation of the subgraphs generated by repeats
Given a de Bruijn graph \(G_k(R)\), if a is a compressible arc labelled by the sequence \(s=s_1 \ldots s_{k+1}\) then, by definition, a is the only outgoing arc of the vertex labelled by the sequence s[1, k] and the only incoming arc of the vertex labelled by the sequence \(s[2,k+1]\). Hence the \((k1)\)mer s[2, k] appears as a substring in R, always preceded by the symbol s[1] and followed by the symbol \(s[k+1]\). We refer to such \((k1)\)mers as being boundary rigid. It is not difficult to see that the set of compressible arcs in a de Bruijn graph \(G_k(R)\) stands in a onetoone correspondence with the set of boundary rigid \((k1)\)mers in R.
We now calculate and compare among them the expected number of compressible arcs in \(G=G_k(R)\) when R corresponds to a set of sequences that are generated: (1) uniformly at random, and (2) according to our model. We show that \(\gamma\) is “small” in the cases where the induced graph corresponds to similar sequences, which provides evidence for the relevance of this parameter.
Claim 1
Let R be a set of m sequences randomly chosen from \(\Sigma ^n\). Then the expected number of compressible arcs in \(G_k(R)\) is \(\Theta (mn)\).
Proof
The probability that a sequence of length \(k1\) occurs in a fixed position in a randomly chosen sequence of length n is \((1/4)^{k1}\). Thus the expected number of appearances of a sequence of length \(k1\) in a set of m randomly chosen sequences of length n is given by \(m(nk+2) (1/4)^{k1}\). If \(m(nk+2) \le 4^{k1}\), then this value is upper bounded by 1, and all the sequences of length \(k1\) are expected to be boundary rigid (as a sequence is expected to appear once). The claim follows by observing that there are \(m(nk+2)\) different \((k1)\)mers.□
We consider now \(\gamma (G_k(R))\) for \(R=S(m,n,\alpha )\). We upper bound the expected number of compressible arcs by upper bounding the number of boundary rigid \((k1)\)mers.
Theorem 1
Given integers k, n, m with \(k<n\) and a real number \(0\le \alpha \le 3/4\), the de Bruijn graph \(G_k(S(m,n,\alpha ))\) has o(nm) expected compressible arcs.
Proof
Let \(s_0\) be a sequence chosen randomly from \(\Sigma ^n\). Let \(S(m,n,\alpha )\) be the set \(\{s_1, \ldots , s_m\}\) of m repeats generated according to our model starting from \(s_0\). Consider now the de Bruijn graph \(G=G_k(S(m,n,\alpha ))\). Recall that the number of compressible arcs in this graph is equal to the number of boundary rigid \((k1)\)mers in \(S(m,n,\alpha )\). Let X be a random variable representing the number of boundary rigid \((k1)\)mers in G. Consider the repeats in \(S(m,n,\alpha )\) in a matrixlike ordering as in Fig. 2 and observe that the mutations from one column to another are independent. Due to the symmetry and the linearity of the expectation, E[X] is given by \(m(nk+2)\) (the total number of \((k1)\)mers) multiplied by the probability that a given \((k1)\)mer is boundary rigid.
The probability that the \((k1)\)mer \(\hat{s}=s[i,i+k2]\) is boundary rigid clearly depends on the distance from the starting sequence \(\hat{s}_0=s_0[i,i+k2]\). Let d be the distance \(d_H(\hat{s} ,\hat{s}_0)\).
The previous result shows that the number of compressible arcs is a good parameter for characterising a repeatassociated subgraph.
Identifying a repeatassociated subgraph
As we showed, a subgraph due to repeated elements has a distinctive feature: it contains few compressible arcs. Based on this, a natural formulation to the repeat identification problem in RNAseq data is to search for large enough subgraphs that do not contain many compressible arcs. This is formally stated in Problem 1. In order to disregard trivial solutions, it is necessary to require a large enough connected subgraph, otherwise any set of disconnected vertices or any small subgraph would be a solution. Unfortunately, we show that this problem is NPcomplete, so an efficient algorithm for the repeat identification problem based on this formulation is unlikely.
Problem 1
[Repeat Subgraph] INSTANCE: A directed graph G and two positive integers m, t.
DECIDE: If there exists a connected subgraph \(G'=(V', E')\), with \(V' \ge m\) and having at most t compressible arcs.
In Theorem 2, we prove that this problem is NPcomplete for all directed graphs with (total) degree, i.e. sum of in and outdegree bounded by 3. The reduction is from the Steiner tree problem which requires finding a minimum weight subgraph spanning a given subset of vertices. It remains NPhard even when all arc weights are 1 or 2 (see [17]). This version of the problem is denoted by STEINER(1, 2). More formally, given a complete undirected graph \(G = (V,E)\) with arc weights in \(\{1,2\}\), a set of terminal vertices \(N \subseteq V\) and an integer B, it is NPcomplete to decide if there exists a subgraph of G spanning N with weight at most B, i.e. a connected subgraph of G containing all vertices of N.
We specify next a family of directed graphs that we use in the reduction. Given an integer x, we define the directed graph R(x) as a cycle on 2x vertices numbered in a clockwise order and where the arcs have alternating directions, i.e. for any \(i \le x\), \((v_{2i},v_{2i+1})\) is an arc. Observe that in R(x), all vertices in even positions, i.e. all vertices \(v_{2i}\), have outdegree 2 and indegree 0, while all vertices \(v_{2i+1}\) have outdegree 0 and indegree 2. Clearly, none of the arcs of R(x) is compressible.
Theorem 2
The Repeat Subgraph Problem is NPcomplete even for directed graphs with degree bounded by d, for any \(d \ge 3\).
Proof
Given a complete graph \(G = (V,E)\), a set of terminal vertices N and an upper bound B, i.e. an instance of STEINER(1, 2), we transform it into an instance of the Repeat Subgraph Problem for a graph \(G'\) with degree bounded by 3. Let us first build the graph \(G' = (V', E')\). For each vertex v in \(V \setminus N\), add a corresponding subgraph \(r(v) = R(V)\) in \(G'\) and for each vertex v in N, add a corresponding subgraph \(r(v) = R(E+V^2 + 1)\) in \(G'\). For each arc (u, v) in E with weight \(w \in \{1,2\}\), add a simple directed path composed by w compressible arcs connecting r(u) to r(v) in \(G'\); these are the subgraphs corresponding to u and v. The first vertex of the path should be in a sink of r(u) and the last vertex in a source of r(v). By construction, there are at least V vertices with indegree 2 and outdegree 0 (sink) and V vertices with outdegree 2 and indegree 0 (source) in both r(v) and r(u). It is clear that \(G'\) has degree bounded by 3. Moreover, the size of \(G'\) is polynomial in the size of G and it can be constructed in polynomial time.
In this way, the graph \(G'\) has one subgraph for each vertex of G and a path with one or two (depending on the weight of the corresponding arc) compressible arcs for each arc of G. Thus, there exists a subgraph spanning N in G with weight at most B if and only if there exists a subgraph in \(G'\) with at least \(m =2N + 2EN + 2V^2N\) vertices and at most \(t = B\) compressible arcs. This follows from the fact that any subgraph of \(G'\) with at least m vertices necessarily contains all the subgraphs r(v), where \(v \in N\), since the number of vertices in all r(v), with \(v \in V\setminus N\), is at most \(E + 2V^2\) and the only compressible arcs of \(G'\) are in the paths corresponding to the arcs of G.□
We can obtain the same result for the specific case of subgraphs of de Bruijn graphs. The reduction is more technical but follows similarly.
Theorem 3
The Repeat Subgraph Problem is NPcomplete even for subgraphs of de Bruijn graphs on \(\Sigma  = 4\) symbols.
Bubbles “drowned” in repeats
In the previous section, we showed that an efficient algorithm to directly identify the subgraphs of a de Bruijn graph corresponding to repeated elements according to our model (i.e. containing few compressible arcs), is unlikely to exist since the problem is NPcomplete. However, in this section we show that in the specific case of a local assembly of alternative splicing (AS) events based on the compressiblearc characterisation of “Topological characterisation of the subgraphs generated by repeats” section, we can implicitly avoid such subgraphs. More precisely, it is possible to find the structures (i.e. bubbles) corresponding to AS events in a de Bruijn graph that are not contained in a repeatassociated subgraph, thus answering to the main open question of [12].
In “Repeats in de Bruijn graphs” section, the repeatassociated subgraphs are characterised by the presence of few compressible arcs. This suggests that in order to avoid repeatassociated subgraphs, we should restrict the search to bubbles containing many compressible arcs. Equivalently, in a compressed de Bruijn graph (see “Preliminaries” section), we should restrict the search to bubbles with few branching vertices. We recall that a branching vertex is a vertex of indegree or outdegree strictly at least 2. Indeed, in a compressed de Bruijn graph, given a fixed sequence length, the number of branching vertices in a path is inversely proportional to the number of compressible arcs of the corresponding path in the noncompressed de Bruijn graph. We thus modify the definition of \((s,t,\alpha _1,\alpha _2)\)bubbles in compressed de Bruijn graphs (Def. 1 in [1]) by adding the extra constraint that each path should have at most b branching vertices.
Definition 2
Given a weighted directed graph \(G = (V,E)\) and two vertices \(s,t \in V\), an \((s,t,\alpha _1,\alpha _2,b)\)bubble is a pair of vertexdisjoint stpaths \(\pi _1\), \(\pi _2\) with lengths bounded by \(\alpha _1,\alpha _2\), each containing at most b branching vertices.
By restricting the search to bubbles with few branching vertices, we are able to enumerate them in complex regions implicitly avoiding repeatassociated subgraphs. Indeed, in “Experimental results” section we show that by considering bubbles with at most b branching vertices in KisSplice, we increase both its sensitivity and precision. This supports our claim that by focusing on \((s,t,\alpha _1,\alpha _2,b)\)bubbles, we avoid repeatassociated subgraphs and recover at least part of the bubbles trapped in complex regions.
Enumerating bubbles avoiding repeats
In this section, we modify the algorithm of [1] to enumerate all bubbles with at most b branching vertices in each path. Given a weighted directed graph \(G = (V,E)\) and a vertex \(s \in V\), let \(\mathcal {B}_s(G)\) denote the set of \((s,*,\alpha _1,\alpha _2,b)\)bubbles of G. The algorithm recursively partitions the solution space \(\mathcal {B}_s(G)\) at every call until the considered subspace is a singleton (contains only one solution), and in that case it outputs the corresponding solution. In order to avoid unnecessary recursive calls, it maintains the invariant that the current partition contains at least one solution. The algorithm proceeds as follows.
Invariant
At a generic recursive step on vertices \(u_1,u_2\) (initially, \(u_1 = u_2 =s\)), let \(\pi _1 = s \leadsto u_1, \pi _2 = s \leadsto u_2\) be the paths discovered so far (initially, \(\pi _1,\pi _2\) are empty). Let \(G'\) be the current graph (initially, \(G' := G\)). More precisely, \(G'\) is defined as follows: remove from G all the vertices in \(\pi _1\) and \(\pi _2\) but \(u_1\) and \(u_2\). Moreover, we also maintain the following invariant (INV): there exists at least one pair of paths \(\bar{\pi }_1\) and \(\bar{\pi }_2\) in \(G'\) that extend \(\pi _1\) and \(\pi _2\) so that \(\pi _1 \cdot \bar{\pi }_1\) and \(\pi _2 \cdot \bar{\pi }_2\) belong to \(\mathcal {B}_s(G)\).
Base case
When \(u_1 = u_2 = u\), output the \((s,u,\alpha _1,\alpha _2,b)\)bubble given by \(\pi _1\) and \(\pi _2\).
Recursive rule

The bubbles of \(\mathcal {B}_{s}(\pi _1,\pi _2, G')\) that use e, for each arc \(e = (u_1,v)\) outgoing from \(u_1\), that is \(\mathcal {B}_{s}(\pi _1 \cdot e, \pi _2, G'  u_1)\), where \(G'u_1\) is the subgraph of \(G'\) after the removal of \(u_1\) and all its incident arcs.

The bubbles that do not use any arc from \(u_1\), that is \(\mathcal {B}_{s}(\pi _1,\pi _2, G'')\), where \(G''\) is the subgraph of \(G'\) after the removal of all arcs outgoing from \(u_1\).
In order to maintain the invariant (INV), we only perform the recursive calls when \(\mathcal {B}_{s}(\pi _1 \cdot e, \pi _2, G'  u)\) or \(\mathcal {B}_{s}(\pi _1,\pi _2, G'')\) are nonempty. In both cases, we have to decide if there exists a pair of (internally) vertexdisjoint paths \(\bar{\pi }_1 = u_1 \leadsto t_1\) and \(\bar{\pi }_2 = u_2 \leadsto t_2\), such that \(\bar{\pi }_1 \le \alpha _1'\), \(\bar{\pi }_2 \le \alpha _2'\), and \(\bar{\pi }_1,\bar{\pi }_2\) have at most \(b_1,b_2\) branching vertices, respectively. Since both the length and the number of branching vertices are monotonic properties, i.e. both are smaller for a prefix instead of for the full path, we can drop the vertexdisjoint condition. Indeed, let \(\bar{\pi }_1\) and \(\bar{\pi }_2\) be a pair of paths satisfying all conditions but the vertexdisjoint one. The prefixes \(\bar{\pi }^*_1 = u_1 \leadsto t^*\) and \(\bar{\pi }^*_2 = u_2 \leadsto t^*\), where \(t^*\) is the first intersection of the paths, satisfy all conditions and are internally vertexdisjoint.
Moreover, using a dynamic programming algorithm, we can obtain the following result.
Lemma 1
Given a nonnegatively weighted directed graph \(G = (V,E)\) and a source \(s \in V\), we can compute the shortest paths from s using at most b branching vertices in O(bE) time.
Proof
Let \(d[\beta , t]\) denote the distance from s to t using at most \(\beta\) branching vertices (s is never counted as a branching vertex, even if it is branching). The recurrence to calculate \(d[\beta ,t]\), for \(0 \le \beta \le b\) and \(t \in V\) is:
Finally, since the calculation of each value \(d[\beta ,t]\) takes \(O(N^(t))\) time, the algorithm runs in \(O(b \sum _{t \in V}N^(t)) =O(b E)\) time. We can guarantee that this algorithm runs in time polynomial in the length of the input by upperbounding b by V (if \(b > V\), we simply set \(b=V\)).□
As a corollary of Lemma 1, we can decide if \(\mathcal {B}_{s}(\pi _1,\pi _2, G)\) is nonempty in O(bE) time. Now, using an argument similar to [1], i.e. the leaves of the recursion tree and the solutions are in onetoone correspondence and the height of the recursion tree is bounded by 4b, we obtain the following theorem.
Theorem 4
The \((s,*,\alpha _1,\alpha _2,b)\)bubbles can be enumerated in \(O(b^2E\mathcal {B}_s(G))\) time. Moreover, the time elapsed between the output of any two consecutive solutions (i.e. the delay) is \(O(b^2E)\).
Measuring the confidence of a transcript in fulllength transcriptome assemblers
Reconstructing fulllength transcripts from reads is a challenging task because two transcripts, even from different genes, may very well share subsequences that are longer than the sequenced reads, or even longer than the fragments in case of pairedend sequencing. This is specially true when genes host transposable elements within their introns, and less frequently but still present, within their UTRs and also exons (e.g. exonised repeats). Even if a repeatcontaining intron is always spliced out in the splicing phase, this intron, and consequently the repeat, can still be present in RNAseq data. The fraction of introns present in the sequenced data depends on the cell compartment that is sampled (nucleus, cytoplasm or both) and the protocol to remove rRNA (ribo0 or polydT primers). As estimated in [9], the level of premRNA can be assumed to vary between 2 and 22%. The true level of premRNA may however be in practice higher, because the methods used for estimating it are mappingbased and therefore deal poorly with reads stemming from repeated regions. Besides, the upper bound given in [9] corresponds to extraction protocols which are harder to obtain. In this work, we considered the most commonly used extraction protocol to extract RNA, and assumed that they yielded premRNA fractions between 5 and 15%. Thus, more introns than expected are sequenced, generating problems to transcriptome assemblers, particularly when they span several members of a specific repeat family.
It is hence important to be able to identify lowconfidence transcripts, which are the ones traversing complex regions of a de Bruijn graph, in order to know that the solution presented is the result of a “difficult” choice and therefore may not be the right one. To identify such transcripts, we introduce the concept of Branching Measure of a transcript. Consider the set of transcripts \(\mathcal {T}\) output by a fulllength transcriptome assembler starting from a set of reads \(\mathcal {R}\). We construct the de Bruijn graph \(G_k(R)\), and map back each transcript \(t \in \mathcal {T}\) to the graph by identifying each of its kmers. Given a positive integer w, let W be a wsized window (or substring) with the largest number of branching kmers in t. We define the Branching Measure of a transcript t, B(t), as the proportion of branching kmers in W. By looking at B(t), it is possible to infer if t traversed a hardtoassemble region in the de Bruijn graph, and this can be used as a measure of its confidence, i.e. the higher B(t) is, the lower is the confidence of t.
As a proof of concept, in the following we show two examples of the application of the Branching Measure to transcripts assembled by Trinity on RNAseq data from the GEUVADIS project [18].
Experimental results
Local assembly: experimental setup
To evaluate the performance of our method, we simulated RNAseq data using the FluxSimulator version 1.2.1 [20]. We generated 100 million reads of 75 bp using its default error model. We used the RefSeq annotated Human transcriptome (hg19 coordinates) as a reference and we performed a twostep pipeline to obtain a mixture of mRNA and premRNA (i.e. with introns not yet spliced). To achieve this, we first ran the FluxSimulator with the Refseq annotations. We then modified the annotations to include the introns and reran it on this modified version. In this second run, we additionally constrained the expression values of the premRNAs to be correlated to the expression values of their corresponding mRNAs, as simulated in the first run. Finally, we mixed the two sets of reads to obtain a total of 100M reads. We tested two values, namely 5 and 15% for the proportion of reads from premRNAs. Those values were chosen so as to correspond to realistic ones (see “Measuring the confidence of a transcript in fulllength transcriptome assemblers” section).
On these simulated datasets, we ran KisSplice [12] versions 2.1.0 (Ks_2.1.0) and 2.2.0 (Ks_2.2.0, with a maximum number of branching vertices set to 5) and obtained lists of detected bubbles that are putative alternative splicing (AS) events. We also ran the fulllength transcriptome assemblers Trinity version r2013_08_14 and Oases version 0.2.09 on both datasets, obtaining a list of predicted transcripts, from which we then extracted a list of putative AS events. For Oases, there was one locus in each dataset for which we were not able to extract the putative AS events. A manual inspection revealed that they were mostly composed of subparts of introns or UTRs flanked by repeats, usually copies of ALUs. The presence of such high copynumber repeats in these transcripts induced the clustering of all these unrelated sequences into one complex locus. More precisely, in the dataset containing 5% of the reads from premRNAs, the largest locus that Oases predicted comprised 20,769 transcripts, while the second largest contained only 139 transcripts. In the other simulated dataset, the largest locus comprised 39,389 transcripts, and the second largest contained only 205 transcripts. This indicates that Oases faces similar issues to Ks_2.1.0. For fairness of comparison, we did not postprocess these complex loci, and we were then unable to retrieve the potential AS events that they could describe. It is worth mentioning though, that the majority of the transcripts inside these loci corresponded to subparts of introns or UTRs, and not to full introns or exons, and therefore could not describe AS events.
In order to assess the precision and the sensitivity of our method, we compared our set of found AS events to the set of true AS events. Following the definition of Astalavista, an AS event is composed of two sets of transcripts, the inclusion/exclusion isoforms respectively. We consider that an AS event is true if at least one transcript among the inclusion isoforms and one among the exclusion isoforms is present in the simulated dataset with at least 5 reads per kilobase (RPK). The rationale for adding this threshold is that very rare events are considerably hard, or even impossible, to detect by all methods.
To compare the results of KisSplice with the true AS events, we propose that a true AS event is a true positive (TP) if there is a bubble such that one path matches the inclusion isoform and the other the exclusion isoform. If there is no such bubble among the results of KisSplice, the event is counted as a false negative (FN). If a bubble does not correspond to any true AS event, it is counted as a false positive (FP). To align the paths of the bubbles to transcript sequences, we used the Blat aligner [21] with 95% identity and a constraint of 95% of each bubble path length to be aligned (to account for the sequencing errors simulated by FluxSimulator). We computed the sensitivity TP/(TP+FN) and precision TP/(TP+FP) for each simulation case and we report their values for various classes of expression of the minor isoform. Expression values are measured in RPK.
Local assembly: results
A closer look at Fig. 7a shows that both versions of KisSplice had better sensitivity than both transcriptome assemblers in the 5% premRNA dataset. However, due to its inability to deal with repeatassociated regions, the performance of Ks_2.1.0 drops substantially, from 46 to 33%, when a higher rate of 15% of premRNA is present in the data. The same happened with Oases. Ks_2.2.0 and Trinity, on the other hand, were able to slightly improve their sensitivity from the 5 to the 15% premRNA dataset. It is however worth mentioning that the sensitivity of Ks_2.2.0 is substantially higher than the one of Trinity in the 15% premRNA dataset. In summary, we can say that Ks_2.2.0 is substantially more sensitive than all the other three methods. This reflects the fact that most problematic repeats are in introns. A small unspliced mRNA rate leads to few repeatassociated subgraphs, so there are not many AS events drowned in them. In this case, the advantage of using Ks_2.2.0 is less obvious, whereas a large proportion of premRNA leads to more AS events drowned in repeatassociated subgraphs which are identified by Ks_2.2.0 and usually missed by the other methods.
In Fig. 7b we can see that Ks_2.2.0 and Trinity presented the highest precision (98%) of all methods in the 5% premRNA dataset. Although Ks_2.1.0 is ranked third, it still presents a very high precision (95%), while Oases presented a moderate value (80%). Nevertheless, the most important aspect to be observed in Fig. 7b is that Ks_2.2.0 kept the same high precision even when a higher rate of 15% of premRNA is present in the data. Trinity, on the other hand, dropped significantly from 98 to 79%. This drop in precision is actually mostly due to the prediction of a large number of intron retentions, since Trinity assembles both the mRNA and the premRNA. Ks_2.1.0 dropped slightly from 95 to 91%, and Oases dropped moderately, from 80 to 70%. In summary, we can say that both versions of KisSplice are more precise than both transcriptome assemblers, except that Trinity shows comparable precision if a small rate of premRNA is present in the data, and, more specifically, that Ks_2.2.0 outperformed all the other three methods. The high precision we obtain indicates that we have very few false positives. Those mostly correspond to repeatinduced bubbles mistakenly identified as AS events.
Finally, Fig. 7c, d present the detailed sensitivity by expression levels of the four methods on both datasets, allowing for a better understanding of their performance. As we can see, Ks_2.2.0 presented the best sensitivity in practically all expression levels in both datasets, while the other three methods were worse, but comparable between themselves. We can also observe that the gap between the sensitivity of Ks_2.2.0 and the sensitivity of the other methods tends to increase as the expression levels of the genes increase, especially in the 15% premRNA dataset. Since highlyexpressed genes tend to present higher levels of premRNA content, they bring several repeat copies in their introns, and thus some parts of their associated graphs are complex and repeatinduced. Therefore, AS events inside such genes tend to be trapped in troublesome regions of the graph, making them harder to find. As Ks_2.2.0 is able to avoid such complex regions and retrieve the AS events deeply plunged into them, it presents better sensitivity than the other methods, especially in highlyexpressed genes and datasets with high rate of premRNAs.
As was already reported in [12], KisSplice (i.e. both Ks_2.2.0 and Ks_2.1.0) is faster and uses considerably less memory than Trinity and Oases. For instance, on these datasets, KisSplice uses around 5 GB of RAM, while Trinity uses more than 20 GB, and Oases, around 18 GB. However, it should be noted that both these latter methods try to solve a more general problem than KisSplice, that is reconstructing the fulllength transcripts.
To conclude, our results show that Ks_2.2.0 is significantly more sensitive and precise than Ks_2.1.0, Trinity and Oases for the task of identifying AS events. The advantage of using Ks_2.2.0 over the other three methods is more evident when the input data contains high premRNA content or the AS events of interest stem from highlyexpressed genes.
On the usefulness of Ks_2.2.0 on real data
Global assembly
To test our hypothesis that the Branching Measure is able to identify problematic transcripts, we evaluated it on the transcripts assembled by Trinity on the two simulated RNAseq datasets described in “Local assembly: results” section, and on two other real RNAseq datasets: one from the GEUVADIS project [18]^{1} and one from a neuroblastoma SKNSH cell line (ENCODE) differentiated or not using retinoic acid.^{2} Even though our method is referencefree, we have chosen to evaluate it under a model species so that we could make use of annotated reference genomes to assess if our predictions are correct. We compared our measure against two stateoftheart methods for de novo transcriptome evaluation, RsemEval [4] and TransRate [5], on the specific task of identifying chimeric transcripts in Trinity’s assemblies on all four described datasets. In all our tests, we used the contig impact score of RsemEval as a measure of contig correctness. Formally, the contig impact score is a statistical measure that compares the hypothesis that a particular contig (i.e. transcript) is a true contig with the null hypothesis that the reads composing the contig actually represent the background noise [4]. In other words, the contig impact score determines the relative contribution of each transcript to explaining the assembly. Wellassembled transcripts should therefore have a high contig impact score, and badly assembled transcripts, including chimeras, should have a low contig impact score. TransRate [5], on the other hand, allowed us to work with a specific metric representing the probability that a contig is derived from a single transcript. This metric denotes the probability that the read coverage of a transcript is best modelled by a single Dirichlet distribution, rather than two or more distributions, and it corresponds to the field sCseq of TransRate’s output file contigs.csv.
Concluding remarks and perspectives
Although transcriptome assemblers are now commonly used, their way to handle repeats is not satisfactory, arguably because the presence of repeats in transcriptomes has been underestimated so far. Given that most RNAseq datasets correspond to total mRNA extractions, many introns are still present in the data and their repeat content cannot be simply ignored. Although repeats in transcriptomic and genomic data cause similar problems to assemblers, the specificities of each mean that a successful strategy in one context may fail in the other. It is thus essential for transcriptome assemblers to formally address the repeats problem.
In this paper, we first proposed a simple formal model for representing high copynumber repeats in RNAseq data. Exploiting the properties of this model, we established that the number of compressible arcs is a relevant quantitative characteristic of repeatassociated subgraphs. We proved that the problem of identifying in a de Bruijn graph a subgraph with this characteristic is NPcomplete. However, this characteristic drove the design of an algorithm for efficiently identifying AS events that are not included in repeated regions. The new algorithm was implemented in KisSplice version 2.2.0, and by using simulated RNAseq data, we showed that it improves significantly the sensitivity of the previous version of KisSplice, while also improving its precision. In addition, we compared our algorithm to Trinity and Oases, and showed that for the specific task of calling AS events, our algorithm is significantly more sensitive while also being more precise. Our results also showed that the advantage of using KisSplice version 2.2.0 is more evident when the input data contains high premRNA content or the AS events of interest stem from highlyexpressed genes. Moreover, we gave an indication of the usefulness of our method on real data. Finally, we explored the proposed model in the context of fulllength transcriptome assembly by introducing the Branching Measure, which is able to flag the transcripts that go through a complex region in the de Bruijn graph. Even though one should not directly consider lowconfidence transcripts as erroneous ones since they could have been correctly assembled despite the hardness, the described measure is useful from a user’s pointofview since it enables to flag the transcripts that result from a “difficult” choice during the assembly, no matter which assembler is used. We showed that this measure can indeed capture assembly errors in real cases and, when compared to RsemEval [4] and to TransRate [5] on the specific task of identifying chimeric transcripts, the measure we propose outperformed the ones used by RsemEval and TransRate by a large margin. The originality of our work, when compared to other methods for transcriptome assembly evaluation, is that we use only the topology of the graph. Despite the successful application of the Branching Measure in global transcriptome assembly, it remains a simple method, and in particular, we would like to emphasise that it must be seen as a proof of concept that exploring the topology of the subgraph around a transcript can give some hints about its confidence level, quality, assembly hardness, etc. The method proposed is not a fullfledged one for assessing transcripts in a de novo context. It could however be a promising direction to improve transcriptome assembly evaluation, especially when combined with statistical and readmapping approaches (e.g. RsemEval [4] or TransRate [5]).
As concerns the local transcriptome assembly of variations, the most interesting open problem which currently remains is how to efficiently enumerate AS events whose variable region (e.g. skipped exon, retained intron) traverses repeats. Although the application of the proposed model enabled to retrieve several AS events that were previously missed, the current algorithm is still only able to avoid repeats, not to solve them. The presence of repeats in RNAseq data shows that transcriptome assemblers should formally address the repeats issue, as is generally the case of genome assemblers, preferably by solving them. Even if repeats are less frequent in transcriptomic data and are thus easier to solve than in the genomic context, the complexity and ambiguity they add are enough to cause problems if not addressed properly. If this is not done, it will impact the assembly of fulllength transcripts or variants, leading to either erroneous or fragmented ones, especially in regions that are more prone to contain repeats, such as introns, UTRs, and exonised repeats.
As concerns future works, our repeats model could be improved. One direction would be to employ a treelike structure to take into account the evolutionary nature of repeat (sub)families. Variability in the sizes of the copies of a repeat family would also enable to model more realistically the true nature of families of transposable elements (the type of repeats which cause most trouble in assembly). Another example would be to explicitly model sequencing errors in Theorem 1. Although, in practice, assemblers like KisSplice [1] employ a sequencing error removal module, it remains unclear how to distinguish the structures created by sequencing errors from the ones induced by a lowlyexpressed member of a highlyexpressed family of repeats, or by infrequent allelic differences in poolseq. The difficulty increases in regions that are highly expressed or that present sequencing error bias. In practice, error removal strategies may be too stringent and erroneously remove SNPs and repeats. Taking into account the sequencing errors in the model would make it applicable even without any preprocessing of the data, and would thus be more sensitive for finding repeats if such errors are correctly modeled. Finally, the Branching Measure could also be extended to identify truncated transcripts and isoforms stemming from paralogous genes.
This dataset can be found at the ArrayExpress database (http://www.ebi.ac.uk/arrayexpress/) under the accession number EGEUV6, and we used the individual named NA06994, extract name “NA06994.2.M_111215_7 extract”.
This dataset can be found at http://genome.crg.es/encode_RNA_dashboard/hg19/, and is also accessible with the following accession numbers: ENCSR000CPN—SRA: SRR315315, SRR315316 and ENCSR000CTT—SRA: SRR534309, SRR534310. For cell lines treated by retinoic acid, the reads were 76nt long, while they were 100nt long for the non treated cells. Hence we trimmed all reads to 76nt.
Notes
Declarations
Authors’ contributions
BS, GS, MFS and VL developed the model for repeats. LL, BS, GS, MFS developed and implemented the algorithms. LL, HLM, CM, VM performed the experiments. Part of this material appeared in WABI 2014. All authors read and approved the manuscript.
Acknowledgements
LL acknowledges CNPq/Brazil for the financial support. This work was performed using the computing facilities of the CC LBBE/PRABI.
Competing interests
The authors declare that they have no competing interests.
Funding
LL was funded by the Brazilian Ministry of Science, Technology and Innovation (in portuguese, Ministério da Ciência, Tecnologia e Inovação  MCTI) through the National Counsel of Technological and Scientific Development (in portuguese, Conselho Nacional de Desenvolvimento Científico e Tecnológico  CNPq), under the Science Without Borders (in portuguese, Ciências Sem Fronteiras) scholarship grant process number 203362/20144. VL was funded by the Agence Nationale de la Recherche ABS4NGS ANR project (ANR11BINF000106) and Action n3.6 Plan Cancer 2009–2013. This work was also funded by the Agence Nationale de la Recherche ANR12BS020008 (Colib’read) with grants to LL, BS, GS, HLM, CM, MFS, and VL. This work was also funded by the European Research Council under the European Community’s Seventh Framework Programme (FP7 /2007–2013)/ERC Grant Agreement No. [247073]10. with grants to BS, GS, MFS, and VL.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Sacomoto G, Lacroix V, Sagot MF. A polynomial delay algorithm for the enumeration of bubbles with length constraints in directed graphs and its application to the detection of alternative splicing in RNAseq data. In: WABI, pp. 99–111 (2013).Google Scholar
 Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, Chen Z, Mauceli E, Hacohen N, Gnirke A, Rhind N, di Palma F, Birren BW, Nusbaum C, LindbladToh K, Friedman N, Regev A. Fulllength transcriptome assembly from RNASeq data without a reference genome. Nat Biotechnol. 2011;29(7):644–52.View ArticlePubMedPubMed CentralGoogle Scholar
 Schulz M, Zerbino D, Vingron M, Birney E. Oases: robust de novo RNAseq assembly across the dynamic range of expression levels. Bioinformatics. 2012;28(8):1086–92.View ArticlePubMedPubMed CentralGoogle Scholar
 Li B, Fillmore N, Bai Y, Collins M, Thomson J, Stewart R, Dewey C. Evaluation of de novo transcriptome assemblies from RNASeq data. Genome Biol. 2014;15(12):553.View ArticlePubMedPubMed CentralGoogle Scholar
 SmithUnna R, Boursnell C, Patro R, Hibberd J, Kelly S. TransRate: reference free quality assessment of de novo transcriptome assemblies. Genome Res. 2016;26(8):1134–44.View ArticlePubMedPubMed CentralGoogle Scholar
 Myers EW, Sutton GG, Delcher AL, Dew IM, Fasulo DP, Flanigan MJ, Kravitz SA, Mobarry CM, Reinert KHJ, Remington KA, Anson EL, Bolanos RA, Chou HH, Jordan CM, Halpern AL, Lonardi S, Beasley EM, Brandon RC, Chen L, Dunn PJ, Lai Z, Liang Y, Nusskern DR, Zhan M, Zhang Q, Zheng X, Rubin GM, Adams MD, Venter JC. A wholegenome assembly of Drosophila. Science. 2000;287(5461):2196–204.View ArticlePubMedGoogle Scholar
 Novák P, Neumann P, Macas J. Graphbased clustering and characterization of repetitive sequences in nextgeneration sequencing data. BMC Bioinform. 2010;11(1):378.View ArticleGoogle Scholar
 Djebali S, Davis CA, Merkel A, Dobin A, Lassmann T, Mortazavi A, Tanzer A, Lagarde J, Lin W, Schlesinger F, Xue C, Marinov GK, Khatun J, Williams BA, Zaleski C, Rozowsky J, Röder M, Kokocinski F, Abdelhamid RF, Alioto T, Antoshechkin I, Baer MT, Bar NS, Batut P, Bell K, Bell I, Chakrabortty S, Chen X, Chrast J, Curado J, Derrien T, Drenkow J, Dumais E, Dumais J, Duttagupta R, Falconnet E, Fastuca M, FejesToth K, Ferreira P, Foissac S, Fullwood MJ, Gao H, Gonzalez D, Gordon A, Gunawardena H, Howald C, Jha S, Johnson R, Kapranov P, King B, Kingswood C, Luo OJ, Park E, Persaud K, Preall JB, Ribeca P, Risk B, Robyr D, Sammeth M, Schaffer L, See LHH, Shahab A, Skancke J, Suzuki AMM, Takahashi H, Tilgner H, Trout D, Walters N, Wang H, Wrobel J, Yu Y, Ruan X, Hayashizaki Y, Harrow J, Gerstein M, Hubbard T, Reymond A, Antonarakis SE, Hannon G, Giddings MC, Ruan Y, Wold B, Carninci P, Guigó R, Gingeras TR. Landscape of transcription in human cells. Nature. 2012;489(7414):101–8.View ArticlePubMedPubMed CentralGoogle Scholar
 Tilgner H, Knowles D, Johnson R, Davis C, Chakrabortty S, Djebali S, Curado JA, Snyder M, Gingeras T, Guigó R. Deep sequencing of subcellular RNA fractions shows splicing to be predominantly cotranscriptional in the human genome but inefficient for lncRNAs. Genome Res. 2012;22:1616–25.View ArticlePubMedPubMed CentralGoogle Scholar
 Robertson G, Schein J, Chiu R, Corbett R, Field M, Jackman SD, Mungall K, Lee S, Okada HM, Qian JQ, Griffith M, Raymond A, Thiessen N, Cezard T, Butterfield YS, Newsome R, Chan SK, She R, Varhol R, Kamoh B, Prabhu AL, Tam A, Zhao Y, Moore RA, Hirst M, Marra MA, Jones SJM, Hoodless PA, Birol I. De novo assembly and analysis of RNAseq data. Nat Methods. 2010;7(11):909–12.View ArticlePubMedGoogle Scholar
 Peng Y, Leung H, Yiu S, Lv M, Zhu X, Chin F. IDBAtran: a more robust de novo de Bruijn graph assembler for transcriptomes with uneven expression levels. Bioinformatics. 2013;29(13):i326–34.View ArticlePubMedPubMed CentralGoogle Scholar
 Sacomoto G, Kielbassa J, Chikhi R, Uricaru R, Antoniou P, Sagot MF, Peterlongo P, Lacroix V. KISSPLICE: denovo calling alternative splicing events from RNAseq data. BMC Bioinform. 2012;13(S–6):5.Google Scholar
 Bao W, Kojima KK, Kohany O. Repbase update, a database of repetitive elements in eukaryotic genomes. Mobile DNA. 2015;6(1):11.View ArticlePubMedPubMed CentralGoogle Scholar
 LopezMaestre H, Brinza L, Marchet C, Kielbassa J, Bastien S, Boutigny M, Monnin D, El Filali A, Carareto CM, Vieira C, et al. SNP calling from RNAseq data without a reference genome: identification, quantification, differential analysis and impact on the protein sequence. Nucl Acids Res. 2016;44(19):148.Google Scholar
 Carroll ML, RoyEngel AM, Nguyen SV, Salem AH, Vogel E, Vincent B, Myers J, Ahmad Z, Nguyen L, Sammarco M, Watkins WS, Henke J, Makalowski W, Jorde LB, Deininger PL, Batzer MA. Largescale analysis of the Alu Ya5 and Yb8 subfamilies and their contribution to human genomic diversity. J Mol Biol. 2001;311(1):17–40.View ArticlePubMedGoogle Scholar
 Jurka J, Bao W, Kojima K. Families of transposable elements, population structure and the origin of species. Biol Direct. 2011;6(1):44.View ArticlePubMedPubMed CentralGoogle Scholar
 Bern M, Plassmann P. The steiner problem with edge lengths 1 and 2. Inf Process Lett. 1989;32(4):171–6.View ArticleGoogle Scholar
 Lappalainen T, Sammeth M, Friedlander MR. /‘t Hoen, PAC, Monlong J, Rivas MA, GonzalezPorta M, Kurbatova N, Griebel T, Ferreira PG, Barann M, Wieland T, Greger L, van Iterson M, Almlof J, Ribeca P, Pulyakhina I, Esser D, Giger T, Tikhonov A, Sultan M, Bertier G, MacArthur DG, Lek M, Lizano E, Buermans HPJ, Padioleau I, Schwarzmayr T, Karlberg O, Ongen H, Kilpinen H, Beltran S, Gut M, Kahlem K, Amstislavskiy V, Stegle O, Pirinen M, Montgomery SB, Donnelly P, McCarthy MI, Flicek P, Strom TM, Consortium TG, Lehrach H, Schreiber S, Sudbrak R, Carracedo A, Antonarakis SE, Hasler R, Syvanen AC, van Ommen GJ, Brazma A, Meitinger T, Rosenstiel P, Guigo R, Gut IG, Estivill X, Dermitzakis ET. Transcriptome and genome sequencing uncovers functional variation in humans. Nature. 2013;501(7468):506–11.Google Scholar
 Robinson JT, Thorvaldsdottir H, Winckler W, Guttman M, Lander ES, Getz G, Mesirov JP. Integrative genomics viewer. Nat Biotechnol. 2011;29(1):24–6.View ArticlePubMedPubMed CentralGoogle Scholar
 Griebel T, Zacher B, Ribeca P, Raineri E, Lacroix V, Guigó R, Sammeth M. Modelling and simulating generic RNAseq experiments with the flux simulator. Nucl Acids Res. 2012;40(20):10073.View ArticlePubMedPubMed CentralGoogle Scholar
 Kent WJ. BLAT—the BLASTlike alignment tool. Genome Res. 2002;12:656–64.View ArticlePubMedPubMed CentralGoogle Scholar
 Freyermuth F, Rau F, Kokunai Y, Linke T, Sellier C, Nakamori M, Kino Y, Arandel L, Jollet A, Thibault C, Philipps M, Vicaire S, Jost B, Udd B, Day JW, Duboc D, Wahbi K, Matsumura T, Fujimura H, Mochizuki H, Deryckere F, Kimura T, Nukina N, Ishiura S, Lacroix V, CampanFournier A, Navratil V, Chautard E, Auboeuf D, Horie M, Imoto K, Lee KY, Swanson MS, de Munain AL, Inada S, Itoh H, Nakazawa K, Ashihara T, Wang E, Zimmer T, Furling D, Takahashi MP, CharletBerguerand N. Splicing misregulation of SCN5A contributes to cardiacconduction delay and heart arrhythmia in myotonic dystrophy. Nat Commun. 2016;7:11067.View ArticlePubMedPubMed CentralGoogle Scholar