 Software article
 Open Access
 Published:
Segmentor3IsBack: an R package for the fast and exact segmentation of Seqdata
Algorithms for Molecular Biology volume 9, Article number: 6 (2014)
Abstract
Background
Change point problems arise in many genomic analyses such as the detection of copy number variations or the detection of transcribed regions. The expanding Next Generation Sequencing technologies now allow to locate change points at the nucleotide resolution.
Results
Because of its complexity which is almost linear in the sequence length when the maximal number of segments is constant, and as its performance had been acknowledged for microarrays, we propose to use the Pruned Dynamic Programming algorithm for Seqexperiment outputs. This requires the adaptation of the algorithm to the negative binomial distribution with which we model the data. We show that if the dispersion in the signal is known, the PDP algorithm can be used, and we provide an estimator for this dispersion. We describe a compression framework which reduces the time complexity without modifying the accuracy of the segmentation. We propose to estimate the number of segments via a penalized likelihood criterion. We illustrate the performance of the proposed methodology on RNASeq data.
Conclusions
We illustrate the results of our approach on a real dataset and show its good performance. Our algorithm is available as an R package on the CRAN repository.
Background
Changepoint detection methods have long been used in the analysis of genetic data as an efficient tool in the study of DNA sequences for various purposes. For instance, segmentation methods have been developed for categorical variables with the aim of identifying patterns for gene predictions [1, 2], while SNPs have been detected using sequence segmentation [3]. In the last two decades, with the large spread of microarrays, changepoint methods have been widely used for the analysis of DNA copy number variations and the identification of amplification or deletion of genomic regions in pathologies such as cancer [4–8].
The recent development of NextGeneration Sequencing technologies gives rise to new applications along with new difficulties: (a) the increased size of profiles (up to 10^{8} datapoints when microarray signals were closer to 10^{5}), and (b) the discrete nature of the output (number of reads starting at each position of the genome). Yet applying segmentation methods to DNASeq data with their greater resolution should lead to the analysis of copynumber variation with a much improved precision compared to CGH arrays. Moreover, in the case of poly(A) RNASeq data on lower organisms, since coding regions of the genome are well separated from noncoding regions with lower activity, segmentation methods should allow the identification of transcribed genes as well as address the issue of new transcript discovery. Our objective is therefore to develop a segmentation method to tackle both (a) and (b) with some specific requirements: the amount of reads falling within a segment should be representative of the biological information associated (relative copynumber of the region, relative level of expression of the gene), and comparison to neighboring regions should be sufficient to label the segment (for instance normal or deleted region of the chromosome in DNASeq data, exon or nontranscribed region in RNASeq), therefore no comparison profile should be needed. This also suppresses the need for normalization, and consequently we wish to analyze the raw countprofile.
Up to now, most methods addressing the analysis of these datasets require some normalization process to allow the use of algorithms which rely on Gaussiandistributed data or which were previously developed for microarrays [9–12]. Indeed, methods adapted to count datasets are not numerous and are highly focused on the Poisson distribution. Alteration of genomic sequences can be detected based on the comparison of Poisson processes associated with the read counts of a case and a control sample [13], but this cannot be applied to the detection of transcribed regions in a single condition.
Still, a likelihood ratio statistic was proposed for the localization of a shift in the intensity of a Poisson process [14], and a test statistic was proposed for the existence of a changepoint in the Poisson autoregression of order 1 [15].
These last two methods do not require a comparison profile but they only allow for the detection of a single changepoint and have too high a timecomplexity to be applied to RNASeq profiles. Binary Segmentation, a fast heuristic [6], and Pruned Exact Linear Time (PELT) [16], an exact algorithm for optimal segmentation with respect to the likelihood, are both implemented for the Poisson distribution in the changepoint package. Even though both are extremely fast, do not require a comparison profile, and analyze countdata, the Poisson distribution is not adapted to our kind of datasets.
A recent study [17] has compared 13 segmentation methods for the analysis of chromosomal copy number profiles and has shown the excellent performance of the Pruned Dynamic Programming (PDP) algorithm [18] proposed in its initial implementation for the analysis of Gaussian data in the R package cghseg. We propose to use this algorithm, which we have implemented for the Poisson and negative binomial distributions.
In the next section we recall the general segmentation framework and the definition and requirements of the PDP algorithm. Our contributions are given in the third section where we define the negative binomial model and show that it satisfies the PDP algorithm requirements. We also provide a theoretical result for the possibility to compress the data, and finally we give a model selection criterion with theoretical guarantees, which makes the whole approach complete. We conclude with a simulation study, which illustrates the performance of the proposed method.
Segmentation model and algorithm
General segmentation model
The general segmentation problem consists in partitioning a signal of n datapoints {y_{ t }}_{t∈[ [1,n] ]} into a given number K of pieces or segments. The model can be written as follows: the observed data {y_{ t }}_{t=1,…,n} are supposed to be a realization of an independent random process Y={Y_{ t }}_{t=1,…,n}. This process is drawn from a probability distribution which depends on a set of parameters among which one parameter θ is assumed to be affected by K1 abrupt changes, called changepoints, such that
where m is a partition of [ [1,n] ] into segments r, θ_{ r } stands for the parameter of segment r and ϕ is constant. The objective is to estimate the changepoints or the positions of the segments and the parameters θ_{ r } both resulting from the segmentation. More precisely, we define ${\mathcal{\mathcal{M}}}_{k,t}$ the set of all possible partitions in k>0 regions of the sequence up to point t. We recall that the number of possible partitions is
We aim at choosing the partition in ${\mathcal{\mathcal{M}}}_{K,n}$ of minimal loss γ, where the loss is usually taken as the negative loglikelihood of the model. We define the pointadditive loss of a segment with given parameter θ as $c\left(r,\theta \right)=\sum _{i\phantom{\rule{1em}{0ex}}\in \phantom{\rule{1em}{0ex}}r}\gamma \left({y}_{i},\theta \right)$, therefore its optimal cost is c(r)= minθ{c(r,θ)}. This allows us to define the cost of a segmentation m as $\sum _{r\phantom{\rule{1em}{0ex}}\in \phantom{\rule{1em}{0ex}}m}c\left(r\right)$ and our goal is to recover the optimal segmentation M_{K,n} and its cost C_{K,n} which are particular cases of the generic optimal segmentation of the signal up to point t in k segments and its cost, defined as:
Quick overview of the PDP algorithm
Like the original DP algorithm, the pruned DP algorithm is an iterative algorithm based on the minimization of a cost function C_{k,t} which is traditionally decomposed as:
where θ is the parameter of the cost of the last segment, constraints on its possible values being directly related to the support of the loss function γ (for instance θ takes its value in in the case of the Gaussian loss, but in [ 0,1] in the case of the binomial loss). In what follows we will denote by I_{ s } the set of possible values for parameter θ.
The specificity of the PDP algorithm is that it relies on the comparison of candidates for the last changepoint position τ through the permutation of the minimizations in (1) and the introduction of the functions:
which are the cost of the best partition in k regions up to t, the parameter of the last segment being θ. C_{k,t} is then obtained as minθ{H_{k,t}(θ)}.
Then at each iteration k, the PDP algorithm works on a list of last changepoint candidates: ListCandidate_{ k }. For each of these τ s and for each value of t, it updates the set of θ s, denoted ${S}_{k,t}^{\tau}$ for which this candidate is optimal. If this set is empty, the candidate is discarded, resulting in the pruning and lower complexity of the algorithm.
The foundations of the algorithm can be written as follows.

Defining ${H}_{k,t}^{\tau}\left(\theta \right)={C}_{k1,\tau}+{\sum}_{j=\tau +1}^{t}\gamma \phantom{\rule{0.1em}{0ex}}\left({y}_{j},\theta \right)$ the optimal cost if the last change is τ and last parameter is θ, then

(i)
${H}_{k,t+1}^{\tau}\left(\theta \right)$ is obtained from ${H}_{k,t}^{\tau}\left(\theta \right)$ using:
$$\begin{array}{c}\phantom{\rule{7em}{0ex}}{H}_{k,t+1}^{\tau}\left(\theta \right)={H}_{k,t}^{\tau}\left(\theta \right)+\left({y}_{t+1},\theta \right);\end{array}$$

Defining ${I}_{k,t}^{\tau}=\left\{\theta \phantom{\rule{0.5em}{0ex}}\phantom{\rule{0.5em}{0ex}}{H}_{k,t}^{\tau}\left(\theta \right)\phantom{\rule{0.5em}{0ex}}\le \phantom{\rule{0.5em}{0ex}}{H}_{k,t}^{t}\left(\theta \right)\phantom{\rule{1em}{0ex}}\right\}=\left\{\theta \phantom{\rule{0.5em}{0ex}}\phantom{\rule{0.5em}{0ex}}{H}_{k,t}^{\tau}\left(\theta \right)\phantom{\rule{0.5em}{0ex}}\le \phantom{\rule{0.5em}{0ex}}{C}_{k1,t}\phantom{\rule{1em}{0ex}}\right\}$ the set of θ such that τ is better than t in terms of cost, with τ<t, then

(ii)
if all ${\sum}_{j=\tau +1}^{t}\gamma \phantom{\rule{0.1em}{0ex}}({y}_{j},\theta )$ are unimodal in θ then ${I}_{k,t}^{\tau}$ are intervals. Indeed, since by definition ${H}_{k,t}^{t}\left(\theta \right)={C}_{k1,t}$ and the cost function does not depend on θ, ${I}_{k,t}^{\tau}$ is the set of values for which a unimodal function is smaller than a constant.

Finally, we introduce ${S}_{k,t}^{\tau}=\left\{\theta \phantom{\rule{0.5em}{0ex}}\phantom{\rule{0.5em}{0ex}}{H}_{k,t}^{\tau}\left(\theta \right)\phantom{\rule{0.5em}{0ex}}\le \phantom{\rule{0.5em}{0ex}}{H}_{k,t}\left\{\theta \right\}\right\}$ the set of θ such that τ is optimal. Then since ${H}_{k,t}\left(\theta \right)=\underset{\tau \le t}{min}\left\{{H}_{k,t}^{\tau}\left(\theta \right)\right\}$, ${S}_{k,t}^{\tau}$ can be written as $\left\{\theta \phantom{\rule{1em}{0ex}}\left\phantom{\rule{1em}{0ex}}{H}_{k,t}^{\tau}\right(\theta )\phantom{\rule{1em}{0ex}}=\phantom{\rule{1em}{0ex}}{H}_{k,t}(\theta )\phantom{\rule{1em}{0ex}}\right\}$ and we obtain that

(iii)
${S}_{k,t+1}^{\tau}$ can be updated using: ⋆ ${S}_{k,t+1}^{\tau}={S}_{k,t}^{\tau}\phantom{\rule{1em}{0ex}}\cap {I}_{k,t+1}^{\tau}$
⋆ ${S}_{k,t}^{t}={I}_{s}\setminus \left({\cup}_{\tau \in {\text{ListCandidate}}_{k}}{I}_{k,t}^{\tau}\right)$
The first assertion follows from the fact that ${S}_{k,t+1}^{\tau}=\left\{\theta {H}_{k,t+1}^{\tau}\le \underset{k\le \tau \le t+1}{min}{H}_{k,t+1}^{\tau}\right\}=\left\{\theta {H}_{k,t+1}^{\tau}\le min\left(\underset{k,t+1}{\overset{t+1}{H}},\underset{k\le \tau \le t}{min}{H}_{k,t+1}^{\tau}\right)\right\},$ the first term in the minimum giving ${I}_{k,t+1}^{\tau}$ and the second one giving ${S}_{k,t}^{\tau}$. The second assertion trivially follows from the fact that candidate t is optimal on the set of values where no other candidate was optimal.

(iv)
once it has been determined that ${S}_{k,t}^{\tau}$ is empty, it easily follows from the update equation (i i i) that the regionborder τ can be discarded from the list of candidates L i s t C a n d i d a t e _{ k }:
$$\begin{array}{lc}{S}_{k,t}^{\tau}=\varnothing & \Rightarrow \phantom{\rule{2em}{0ex}}\forall \phantom{\rule{1em}{0ex}}{t}^{\prime}\ge t\phantom{\rule{1em}{0ex}}{S}_{k,{t}^{\prime}}^{\tau}=\mathrm{\varnothing .}\end{array}$$
Requirements of the pruned dynamic programming algorithm
Requirements of the pruned dynamic programming algorithm
Proposition 0.1
Properties (i) to (iv) are satisfied as soon as the following conditions on the loss c(r,θ) are met:

(a)
It is point additive,

(b)
It is convex with respect to its parameter θ,

(c)
It can be stored and updated efficiently.
The proof of those claims can be found in [18]. A pseudocode of the PDP algorithm is given in the appendix.
It is possible to include an additional penalty term, denoted g as in the pseudocode, in the loss function. To preserve the pointadditivity requirement of the loss, this penalty can only depend on the value of the segmentparameter θ and not on any other characteristics, such as segment length. This is then equivalent to minimizing C_{k,t}= min{k1<τ<t}{C_{k1,τ}+ minθ{c([ τ+1,t],θ)+g(θ)}} and can be achieved by adding the penalty value g(θ) in the initialization of ${H}_{k,t}^{\tau}\left(\theta \right)$. For example, in the case of RNAseq data one could add a lasso (λθ) or ridge penalty (λ θ^{2}) to encode that a priori the coverage in most regions should be close to 0. Our C++ implementation of the PDP algorithm includes the possibility of adding such a penalty term; however we do not provide an R interface to this functionality in our R package. One of the reasons for this choice is that choosing an appropriate value for λ is not a simple problem.
Contribution
Pruned dynamic programming algorithm for count data
We now show that the PDP algorithm can be applied to the segmentation of RNASeq data using a negative binomial model and we propose a criterion for the choice of K. Though not discussed here, our results also hold for the Poisson segmentation model.
Negative binomial model
Negative binomial model
We consider that in each segment r all y_{ t } are the realization of random variables Y_{ t } which are independent and follow the same negative binomial distribution. Assuming the dispersion parameter ϕ to be known, we will use the natural parametrization from the exponential family (also classically used in R) so that parameter θ_{ r } will be the probability of success. In this framework, θ_{ r } is specific to segment r whereas ϕ is common to all segments.
We have E(Y_{ t })=ϕ(1θ)/θ and V a r(Y_{ t })=ϕ(1θ)/θ^{2}. We choose the loss as the negative loglikelihood associated with datapoint t belonging to segment r: ϕ log(θ_{ r })y_{ t } log(1θ_{ r })+A(ϕ,y_{ t }), or more simply γ(y_{ t },θ_{ r })=ϕ log(θ_{ r })y_{ t } log(1θ_{ r }) since A is a function that does not depend on θ_{ r }.
Validity of the pruned dynamic programming algorithm for the negative binomial model
Proposition 0.2
Assuming parameter ϕ to be known, the negative binomial model satisfies (a), (b) and (c):

(a)
As we assume that Y _{ t } are independent, we indeed have that the loss is point additive: $c(r,\theta )=\sum _{t\phantom{\rule{1em}{0ex}}\in \phantom{\rule{1em}{0ex}}r}\gamma ({y}_{t},\theta ).$

(b)
As γ(y _{ t },θ)=ϕ log(θ)y _{ t } log(1θ) is convex with respect to θ, c(r,θ) is also convex as the sum of convex functions.

(c)
Finally, we have $c(r,\theta )={n}_{r}\varphi log\left(\theta \right)+\sum _{t\phantom{\rule{1em}{0ex}}\in \phantom{\rule{1em}{0ex}}r}{y}_{t}log(1\theta )$ (where n _{ r } is the length of segment r). This function can be stored and updated using only two doubles: one for n _{ r } ϕ, say d _{1}, and the other for $\sum _{t\phantom{\rule{1em}{0ex}}\in \phantom{\rule{1em}{0ex}}r}{y}_{t}$, say d _{2}. Then at step t+1 as the new datapoint y _{t+1} is considered, these doubles are simply updated as d _{1}←d _{1}+ϕ and d _{2}←d _{2}+y _{t+1}.
Estimation of the overdispersion parameter
We propose to estimate ϕ using a modified version of Johnson et. al’s estimator [19]: compute the moment estimator of ϕ on each sliding window of size h using the formula $\varphi =\mathbb{E}{\left(Y\right)}^{2}/\left(\mathit{\text{Var}}\right(Y)\mathbb{E}(Y\left)\right)$ and keep the median $\hat{\varphi}$.
Taking into account a positional bias
It is possible that the assumption that the counts share the same distribution in a segment might not be verified. For instance in the case of RNASeq data the number of reads can be affected by the location in the transcribed region or by the GCcontent of the fragment. The pruned dynamic programming algorithm only requires a vector of integers as input, it is therefore possible to apply any kind of normalization process that preserves the countspecificity of the data prior to segmentation. For instance, a method such as that which has resulted in the publication of the data used in the illustration [20] can be applied. A comparison of the main normalization methods can for example be found in Bullard et. al.’s paper [21].
C++ implementation of the PDP algorithm
We implemented the PDP algorithm in C++ having in mind the possibility of adding new loss functions in potential future applications. The difficulties we had to face were the versatility of the program to be designed and the design of the objects it could work on. Indeed, the use of full templates implied that we used stable sets of objects for the operations that were to be performed.
Namely:

The sets were to be chosen in a tribe. This means that they all belong to a set$\mathcal{S}$ of sets such that any set $s\in \mathcal{S}$ can be conveniently handled and stored in the computer. A set of sets$\mathcal{S}$ is said to be acceptable if it satisfies the following:

1.
If s belongs to$\mathcal{S}$, $\mathbb{R}\setminus s\in \mathcal{S}$

2.
If ${s}_{1},{s}_{2}\in \mathcal{S},\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}{s}_{1}\cap {s}_{2}\in \mathcal{S}$

3.
If ${s}_{1},{s}_{2}\in \mathcal{S},\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}{s}_{1}\cup {s}_{2}\in \mathcal{S}$
For instance, the set$\mathcal{S}$ of intervals is a tribe since the complementary, the union and the intersection of intervals form a union of intervals. This property ensures that the sets ${I}_{k,t}^{\tau}$ and ${S}_{k,t}^{\tau}$ can be updated and stored efficiently (only two doubles are required to store an interval) to take full advantage of the pruning process.

The cost functions were chosen in a set$\mathcal{F}$ such that

1.
Each function may be conveniently handled and stored by the software.
For instance, for the Gaussian loss it suffices to store the three coefficients of a second order polynomial.

2.
For any $f\in \mathcal{F}$ and any constant c, f(x)≤c can be easily solved and the set of solutions belongs to an acceptable set of sets

3.
For any $f,g\in \mathcal{F},\phantom{\rule{0.3em}{0ex}}f+g\in \mathcal{F}$.
These two points ensure that the cost (and penalty) functions can be easily updated and compared so that the sets ${I}_{k,t}^{\tau}$ of each candidate τ can be updated and candidates eventually discarded.
Thus we defined two collections for the sets of sets$\mathcal{S}$, intervals and parallelepipeds, and implemented the loss functions corresponding to negative binomial, Poisson or normal distributions. The program is thus designed in a way that any user can add his own cost function or acceptable set of probability function and use it without rewriting a line in the code.
Compression of the signal
In the case of count data, and in particular in the analysis of RNASeq data, it is very likely that we observe plateaux, that is regions between two arbitrary positions t_{1} and t_{2} (>t_{1}) where the signal is constant:
Then we have the following proposition, the proof of which is given in the appendix.
Proposition 0.3
There exists a segmentation m in K or fewer segments without any changepoint in the plateaux such that the optimal cost of m is equal to C_{K,n}.
This proposition proves the arguably intuitive idea that having a changepoint between t_{1} and t_{2} is never beneficial in terms of cost. When searching for the best segmentation of the data, it is therefore unnecessary to look for changepoints in plateaux. In other words a plateau starting at position t_{1} and ending at position t_{2} can be considered as a unique data point with value ${y}_{{t}_{1}}$ and weight t_{2}t_{1}+1. At worst the size of the compressed signal is equal to the minimum between two times the number of reads and the length of the chromosome arm. Thus, if the number of reads is very large, the twostep algorithm (compression and pruned dynamic programming) does not change the worst case complexity. However, in most cases the number of reads is much smaller than the size of the considered chromosome. Thus compression is efficient and allows for a significant reduction in the overall runtime. Furthermore, in the case of RNASeq data we do not expect reads to be evenly scattered. On the contrary they are concentrated in transcribed regions and between those regions we expect large plateaux of 0 allowing for an efficient compression (for instance only 2% of the human chromosome contains coding regions).
Model selection
The last issue concerns the estimate of the number of segments K. This model selection issue can be solved using a penalized loglikelihood criterion for which the choice of a good penalty function is crucial. This kind of procedure typically requires computation of the optimal segmentations in all k=1,…,K_{max} segments where K_{max} is generally chosen smaller than n. The most popular criteria (AIC [22] and BIC [23]) failed in the segmentation context due to the discrete nature of the changepoints. Indeed, additionally to being an asymptotic criterion in a framework where the collection of possible models grows polynomially with n, the BIC criterion uses a Laplace approximation requiring differentiability conditions of the likelihood function which are not satisfied by the segmentation model [24]. From a nonasymptotic point of view and for the negative binomial model, the following criterion was proposed [25]: denoting ${\widehat{m}}_{K}$ the optimal segmentation of the data in K segments,
where ${\stackrel{\u0304}{y}}_{r}=\frac{\sum _{t\in r}{y}_{t}}{{\widehat{n}}_{r}}$ and ${\widehat{n}}_{r}$ is the size of segment r. The first term corresponds to the cost of the optimal segmentation while the second is a penalty term which depends on the dimension K and on a constant β that has to be tuned according to the data (see the next section). With this choice of penalty, a socalled oracle penalty, the resulting estimator satisfies an oracletype inequality. A more complete performance study is done in [25] and showed that the proposed criterion outperforms the existing ones.
Implementation
The Pruned Dynamic Programming algorithm is available in the function Segmentor of the R package Segmentor3IsBack. Version 1.7 of this package contains the compression process which is performed by default in the case of count data. The user can choose the distribution with the slot model (1 for Poisson, 2 for Gaussian homoscedastic, 3 for negative binomial and 4 for segmentation of the variance). It returns an S4 object of class Segmentor which can later be processed for other purposes. The function SelectModel provides four criteria for choosing the optimal number of segments: AIC [22], BIC [23], the modified BIC [24] (available for Gaussian and Poisson distribution) and oracle penalties (available for the Gaussian distribution [26] and for the Poisson and negative binomial [25] as described previously). This latter kind of penalty requires tuning a constant according to the data, which is done using the slope heuristic [27].
Figure 1 (which is detailed in the Results and discussion Section) was obtained with the following 4 lines of code (assuming the data was contained in vector x):
Seg<Segmentor(x,model=3,Kmax=200)
Kchoose<SelectModel(Seg, penalty=~oracle~)
plot(sqrt(x),col=’dark red’)
abline(v=getBreaks(Seg)[Kchoose, 1:Kchoose],col=’blue’)
The function BestSegmentation allows us, for a given K, to find the optimal segmentation with a changepoint at location t (slot $bestSeg). It also provides, through the slot $bestCost, the cost of the optimal segmentation with t for j^{th} changepoint. Figure 2(Left) illustrates this result for the optimal segmentations in 4 segments of a signal simulated with only 3 segments. We can see for instance that any choice of first changepoint location between 1 and 2000 yields almost the same cost (the minimum is obtained for t=1481), and thus the optimal segmentation is not clearly better than the next best segmentations. On the contrary, the same function with 3 segments shows that the optimal segmentation outperforms all other segmentations in 3 segments.
Performance study
We designed a simulation study on the negative binomial distribution to assess the performance of the PDP algorithm in terms of computational efficiency without using the compression option, while studying the impact of the overdispersion parameter ϕ by comparing the results for two different values of this parameter. After running different estimators (median on sliding windows of maximum, quasimaximum likelihood and moment estimators) on several real RNASeq data (whole chromosome and genes of various sizes), we fixed ϕ_{1}=0.3 as a typical value for highly dispersed data as observed in real RNASeq data and chose ϕ_{2}=2.3 for comparison with a reasonably dispersed dataset. For each value, we simulated datasets of size n with various densities of number of segments K, and only two possible values for the parameter p_{ J }: 0.8 on even segments (corresponding to low signal) and 0.2 on odd segments for a higher signal. We had n vary on a logarithmic scale between 10^{3} and 10^{6} and K between $\sqrt{n}/6$ and $\sqrt{n}/3$. For each configuration, we segmented the signal up to ${K}_{max}=\sqrt{n}$ twice: once with the known value of ϕ and once with our estimator $\hat{\varphi}$ as described above. We started with a window width h=15. When the estimate was negative, we doubled h and repeated the experience until the median was positive.
Each configuration was simulated 100 times.
For our analysis we checked the runtime on a standard laptop, and assessed the quality of the segmentation using the Rand Index . Specifically, let C_{ t } be the true index of the segment to which base t belongs and let ${\u0108}_{t}$ be the index estimated by the method, then
Figure 3 shows, for the particular case of $K=\sqrt{n}/3$, the almost linear complexity of the algorithm in the size n of the signal. As the maximal number of segments K_{max} considered increased with n, we normalized the runtime to allow comparison. This underlines an empirical complexity smaller than $\mathcal{O}({K}_{max}nlogn)$, and independent of the value of ϕ or its knowledge. Moreover, the algorithm, and therefore the pruning, is faster when the overdispersion is high, a phenomenon already encountered with the L^{2} loss when the distribution of errors is Cauchy. However, the knowledge of the true value of ϕ does not affect the runtime of the algorithm. Figure 4 illustrates through the Rand Index the quality of the proposed segmentation for a few values of n. Even though the indexes are slightly lower for ϕ_{1} than for ϕ_{2} (see left panel), they range between 0.94 and 1 showing a great quality in the results. Moreover, the knowledge of ϕ does not increase the quality (see right panel), which validates the use of our estimator. We can therefore conclude that the runtime of our algorithm without compression is roughly 40×K_{ m a x }×n/10^{6}s.
Yeast RNAseq experiment
We applied our algorithm to the segmentation of chromosome 1 of the S. Cerevisiae (yeast) using RNASeq data from the Sherlock Laboratory at Stanford University [20], publicly available from the NCBI’s Sequence Read Archive (SRA, http://www.ncbi.nlm.nih.gov/sra, accession number SRA048710). We selected the number of segments using our oracle penalty described in the previous section. An existing annotation of translated regions (i.e. excluding untranslated regions (UTR)) is available on the Saccharomyces Genome Database (SGD) at http://www.yeastgenome.org, which allows us to validate our results.
With a runtime of 27 minutes without compression, and 5.4 minutes with compression (for a signal length of 230218), we selected 125 segments with the negative binomial distribution. Most of those segments (all but 3) can be related to the official annotation, however as expected segments corresponding to transcribed regions (as opposed to intergenic regions) were found to surround known genes from the SGD due to the difference between transcribed and translated regions. Figure 1 illustrates the result.
We compared our segmentation with that corresponding to the SGD annotation through the Hellinger distance by fitting a negative binomial distribution on each segment and repeated this comparison with the other two algorithms able to process long count datasets: PELT [16] and Binary Segmentation [6], both implemented in the R package changepoint for the Poisson distribution. For fair comparison, we also used the PDP algorithm for the Poisson loss. Figure 5, together with Table 1 which gives the estimated number of segments, the overall Hellinger score $\left(\sum _{t}{H}_{t}/n\right)$ and the number of changepoints falling within annotated translated regions, illustrates the result and shows that we outperform the other approaches. Moreover, most of the Hellinger peaks observed can be explained by the fact that we are comparing the annotation of transcribed regions with that of translated regions.
Analysis of complex organisms
The issues raised in the analysis of RNASeq and DNASeq data differ. In the first case, the number of segments that we hope to select is roughly twice the number of expressed exons, therefore the order of K_{ m a x } varies from 10^{2} (small chromosomes from lower organisms, e.g. yeast) to 10^{4} (large chromosomes from higher organisms, e.g. human). However, when aligned to a reference genome, RNASeq data is expected to present large plateaux of zeros at noncoding regions (for instance, 98% of the human genome) and at non expressed regions. The compression option of our algorithm then allows us to reduce the size of the profile by a factor of 10 to 10^{3}. Moreover, it is wellknown that centromere regions are large noncoding regions where no changepoint is expected, and we therefore propose to divide the profile into two parts at such regions. As a proof of concept we ran our algorithm with compression on an RNAseq profile of the small arm of the 4th chromosome of Arabidopsis Thaliana (n=4.10^{6}, K_{ m a x }=6.10^{3}) and selected 4289 segments after a compression factor of 10 and a runtime of 19 hours on a 2.4Ghz computer. The data was kindly provided by some of our collaborators.
DNASeq data on the other hand will present much smaller plateaux. While this implies that the compression will be less efficient, the profile can still be summarized into a dataset the length of which will be smaller than the total amount of mapped reads.Most importantly, in these experiments the expected number of segments is drastically smaller as the number of chromosomic aberrations is generally limited to less than one hundred per chromosome, even in pathologies such as cancer.
Conclusion
Segmentation has been a useful tool for the analysis of biological datasets for a few decades. We propose to extend its application with the use of the Pruned Dynamic Programming algorithm for count datasets such as outputs of sequencing experiments. We show that the negative binomial distribution can be used to model such datasets on the condition that the overdispersion parameter is known and have proposed an estimator of this parameter that performs well in our segmentation framework.
We propose to choose the number of segments using our oracle penalty criterion, which makes the package fully operational. This package also allows the use of other criteria such as AIC or BIC. Similarly, the algorithm is not restricted to the negative binomial distribution but also allows the use of Poisson and Gaussian losses for instance and could easily be adapted to other convex oneparameter losses.
With its empirical complexity of $\mathcal{O}({K}_{max}nlogn)$, it can be applied to large signals such as readalignment of whole chromosomes, and we illustrated its result on a real dataset from the yeast genomes. Moreover, this algorithm can be used as a base for further analysis. For example, [28] use it to initialize their Hidden Markov Model to compute changepoint location probabilities.
Availability and requirements

Project name: Segmentor3IsBack

Project home page: http://cran.rproject.org/web/packages/Segmentor3IsBack/index.html

Operating systems: Platform independent

Programming language: C++ code embedded in R package

License: GNU GPL

Any restrictions to use by nonacademics: none
Appendix
Pseudocode of the PDP algorithm
Algorithm 1 The PDP algorithm
Proof of Proposition 0.3
Searching for one changepoint
Searching for one changepoint Let us first consider a segmentation in 2 segments with a breakpoint at t. We define P_{ t }(θ_{1},θ_{2}), the cost of this segmentation given some parameter θ_{1} for the first segment and θ_{2} for the second segment:
The optimal cost P_{ t } is:
Having these notations, let us prove the following lemma:
Lemma 0.4.

If t_{1}=1 and t_{2}=n then ∀ t P_{ t }≥C_{1,n}

If t_{1}=1 and t_{2}<n then ∀ t_{1}1≤t≤t_{2} we have ${P}_{t}\ge {P}_{{t}_{2}}$

If t_{1}>1 and t_{2}=n then ∀ t_{1}1≤t≤t_{2} we have ${P}_{t}\ge {P}_{{t}_{1}1}$

If t_{1}>1 and t_{2}<n then ∀ t_{1}1≤t≤t_{2} we have ${P}_{t}\ge \mathit{\text{min}}\left\{{P}_{{t}_{1}1},{P}_{{t}_{2}}\right\}$
Proof
First scenario [ t_{1}=1 and t_{2}=n]
We have:
Thus we get: P_{ t }≥C_{1,n}.
Second scenario
Second scenario [ t_{1}=1 and t_{2}<n]
For any t such that t≤t_{2} we have:
Thus we have:
And we get $\forall \phantom{\rule{1em}{0ex}}t\le {t}_{2}\phantom{\rule{1em}{0ex}}{P}_{t}\ge {P}_{{t}_{2}}$.
Third scenario
Third scenario [ t_{1}>1 and t_{2}=n]
We get $\forall \phantom{\rule{1em}{0ex}}{t}_{1}1\le t\phantom{\rule{1em}{0ex}}{P}_{t}\ge {P}_{{t}_{1}1}$ by reversing the index and using scenario 2.
Fourth scenario [ t_{1}>1 and t_{2}<n]
For any t such that t_{1}1≤t≤t_{2} we obtain:
Thus, for fixed θ_{1} and θ_{2} and for t ∈ [ t_{1}1,t_{2}], P_{ t }(θ_{1},θ_{2}) is a linear function of t. Thus we obtain that for any θ_{1} and θ_{2}:
As this is true for any θ_{1} and θ_{2} we get ${P}_{t}\ge \mathit{\text{min}}\left\{{P}_{{t}_{1}1},{P}_{{t}_{2}}\right\}$
Proof of the main proposition
Assume that we have a segmentation m in ${\mathcal{\mathcal{M}}}_{K,n}$ with a breakpoint τ_{ k } in a plateau. Then applying lemma 0.4 on the sequence ${\left\{{y}_{i}\right\}}_{i\in \left\{{\tau}_{k1},\dots {\tau}_{k+1}\right\}}$ we see that τ_{ k } can either be discarded or moved to t_{1}1 or t_{2} without increasing the cost. Thus there exists a segmentation in K or fewer segments without any changepoint in the plateau such that its optimal cost is C_{K,n}. ■
This theorem is more subtle than we might have thought based on our intuition. It does not mean that a changepoint in a plateau is never optimal but only that it is not necessary to have changepoints in plateaux to achieve optimality.
Abbreviations
 PELT:

Pruned exact linear time
 PDP:

Pruned dynamic programming
 AIC:

Akaike information criterion
 BIC:

Bayesian information criterion
 NCBI:

National Center for Biotechnology Information
 SGD:

Saccharomyces genome database.
References
 1.
Braun JV, Muller HG:Statistical methods for DNA sequence segmentation. Stat Sci. 1998, 13 (2): 142162.
 2.
Durot C, Lebarbier E, Tocquet AS:Estimating the joint distribution of independent categorical variables via model selection. Bernoulli. 2009, 15: 475507. 10.3150/08BEJ155.
 3.
Bockhorst J, Jojic N:Discovering patterns in biological sequences by optimal segmentation. Proceedings of the 23rd Conference in Uncertainty in Artificial Intelligence. 2007, AUAI Presss
 4.
Zhang Z, Lange K, Sabatti C:Reconstructing DNA copy number by joint segmentation of multiple sequences. BMC Bioinformatics. 2012, 13: 205.
 5.
Erdman C, Emerson JW:A fast Bayesian change point analysis for the segmentation of microarray data. Bioinformatics. 2008, 24 (19): 21432148.
 6.
Olshen AB, Venkatraman ES, Lucito R, Wigler M:Circular binary segmentation for the analysis of arraybased DNA copy number data. Biostat (Oxford, England). 2004, 5 (4): 557572. 10.1093/biostatistics/kxh008.
 7.
Picard F, Robin S, Lavielle M, Vaisse C, Daudin J:A statistical approach for array CGH data analysis. BMC Bioinformatics. 2005, 6: 27.
 8.
Picard F, Lebarbier E, Hoebeke M, Rigaill G, Thiam B, Robin S:Joint segmentation, calling and normalization of multiple CGH profiles. Biostatistics. 2011, 12 (3): 413428.
 9.
Chiang DY, Getz G, Jaffe DB, O’Kelly MJ, Zhao X, Carter SL, Russ C, Nusbaum C, Meyerson M, Lander ES:Highresolution mapping of copynumber alterations with massively parallel sequencing. Nat Methods. 2009, 6: 99103.
 10.
Xie C, Tammi MT:CNVseq, a new method to detect copy number variation using highthroughput sequencing. BMC Bioinformatics. 2009, 10: 80
 11.
Yoon S, Xuan Z, Makarov V, Ye K, Sebat J:Sensitive and accurate detection of copy number variants using read depth of coverage. Genome Res. 2009, 19: 15861592.
 12.
Boeva V, Zinovyev A, Bleakley K, Vert JP, JanoueixLerosey I, Delattre O, Barillot E:Controlfree calling of copy number alterations in deepsequencing data using GCcontent normalization. Bioinformatics (Oxford, England). 2011, 27: 2689. 10.1093/bioinformatics/btq635.
 13.
Shen JJ, Zhang NR:Changepoint model on nonhomogeneous Poisson processes with application in copy number profiling by nextgeneration DNA sequencing. Ann Appl Stat. 2012, 6 (2): 476496. 10.1214/11AOAS517.
 14.
Rivera C, Walther G:Optimal detection of a jump in the intensity of a Poisson process or in a density with likelihood ratio statistics. Scand J Stat. 2013, 40 (4): 752769. 10.1111/sjos.12027.
 15.
Franke J, Kirch C, Kamgaing JT:Changepoints in times series of counts. J Time Series Anal. 2012, 33 (5): 757770. 10.1111/j.14679892.2011.00778.x.
 16.
Killick R, Fearnhead P, Eckley I:Optimal detection of changepoints with a linear computational cost. J Am Stat Assoc. 2012, 107 (500): 15901598. 10.1080/01621459.2012.737745.
 17.
Hocking TD, Schleiermacher G, JanoueixLerosey I, Boeva V, Cappo J, Delattre O, Bach F, Vert JP:Learning smoothing models of copy number profiles using breakpoint annotations. BMC Bioinformatics. 2013, 14 (1): 164
 18.
Rigaill G:Pruned dynamic programming for optimal multiple changepoint detection. Arxiv:1004.0887. 2010, [http://arxiv.org/abs/1004.0887]
 19.
Johnson N, Kemp A, Kotz S: Univariate Discrete Distributions. 2005, John Wiley & Sons Inc.
 20.
Risso D, Schwartz K, Sherlock G, Dudoit S:GCContent normalization for RNASeq data. BMC Bioinformatics. 2011, 12: 480
 21.
Bullard J, Purdom E, Hansen K, Dudoit S:Evaluation of statistical methods for normalization and differential expression in mRNASeq experiments. BMC Bioinformatics. 2010, 11: 94
 22.
Akaike H:A new look at the statistical model identification. Automatic Control IEEE Trans. 1974, 19 (6): 716723. 10.1109/TAC.1974.1100705.
 23.
Yao Y:Estimation of a noisy discretetime step function: Bayes and empirical Bayes approaches. Ann Stat. 1984, 12 (4): 14341447. 10.1214/aos/1176346802.
 24.
Zhang NR, Siegmund DO:A modified Bayes information criterion with applications to the analysis of comparative genomic hybridization data. Biometrics. 2007, 63: 2232. [PMID: 17447926]
 25.
Cleynen A, Lebarbier E:Segmentation of the poisson and negative binomial rate models: a penalized estimator. Esaim: P & S. 2014, arXiv preprint arXiv:1301.2534
 26.
Lebarbier E:Detecting multiple changepoints in the mean of Gaussian process by model selection. Signal Process. 2005, 85 (4): 717736. 10.1016/j.sigpro.2004.11.012.
 27.
Arlot S, Massart P:Datadriven calibration of penalties for leastsquares regression. J Mach Learn Res. 2009, 10: 245279. (electronic)
 28.
Luong TM, Rozenholc Y, Nuel G:Fast estimation of posterior probabilities in changepoint analysis through a constrained hidden Markov model. Comput Stat Data Anal. 2013,
Acknowledgements
We thank Véronique Brunaud for providing the RNAseq profile of Arabidopsis Thaliana. We also thank our anonymous referee for helpful comments on the presentation of the algorithm.
Author information
Affiliations
Corresponding author
Additional information
Competing interests
The authors have no competing interest to declare.
Authors’ contributions
AC cowrote the C++ code, wrote the Rpackage, performed data analysis and cowrote the manuscript. MK cowrote the C++ code. EL cosupervised the work and cowrote the manuscript. GR cowrote the C++ code, and cowrote the manuscript. SR cowrote the manuscript and cosupervised the work. All authors read and approved the final manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Cleynen, A., Koskas, M., Lebarbier, E. et al. Segmentor3IsBack: an R package for the fast and exact segmentation of Seqdata. Algorithms Mol Biol 9, 6 (2014). https://doi.org/10.1186/1748718896
Received:
Accepted:
Published:
Keywords
 Segmentation algorithm
 Exact algorithm
 Fast algorithm
 RNASeq data
 Genome annotation
 Count data
 Data compression