A strand specific high resolution normalization method for chip-sequencing data employing multiple experimental control measurements
- Stefan Enroth†1, 5,
- Claes R Andersson†2,
- Robin Andersson1, 6,
- Claes Wadelius3,
- Mats G Gustafsson1, 2 and
- Jan Komorowski1, 4Email author
© Enroth et al; licensee BioMed Central Ltd. 2012
Received: 18 August 2011
Accepted: 16 January 2012
Published: 16 January 2012
High-throughput sequencing is becoming the standard tool for investigating protein-DNA interactions or epigenetic modifications. However, the data generated will always contain noise due to e.g. repetitive regions or non-specific antibody interactions. The noise will appear in the form of a background distribution of reads that must be taken into account in the downstream analysis, for example when detecting enriched regions (peak-calling). Several reported peak-callers can take experimental measurements of background tag distribution into account when analysing a data set. Unfortunately, the background is only used to adjust peak calling and not as a pre-processing step that aims at discerning the signal from the background noise. A normalization procedure that extracts the signal of interest would be of universal use when investigating genomic patterns.
We formulated such a normalization method based on linear regression and made a proof-of-concept implementation in R and C++. It was tested on simulated as well as on publicly available ChIP-seq data on binding sites for two transcription factors, MAX and FOXA1 and two control samples, Input and IgG. We applied three different peak-callers to (i) raw (un-normalized) data using statistical background models and (ii) raw data with control samples as background and (iii) normalized data without additional control samples as background. The fraction of called regions containing the expected transcription factor binding motif was largest for the normalized data and evaluation with qPCR data for FOXA1 suggested higher sensitivity and specificity using normalized data over raw data with experimental background.
The proposed method can handle several control samples allowing for correction of multiple sources of bias simultaneously. Our evaluation on both synthetic and experimental data suggests that the method is successful in removing background noise.
High-throughput sequencing of chromatin immunoprecipitated DNA, or ChIP-seq , has replaced microarray-based techniques as the standard tool for investigating protein-DNA interactions in the cell. However, the data generated will always contain noise due to sequencing biases, PCR-artefacts, low complexity regions/mappability, chromatin structure or non-specific antibody interactions in the ChIP-step. The noise appears as a background distribution of reads, or tags, which must be taken into account in downstream analyses such as peak-calling.
Experimental assessments of the background read distribution is favoured over purely theoretical and therefore not experimentally validated background models . One such assessment is to sequence the sonicated sample prior to immunoprecipitation (IP). The resulting read distribution is commonly referred to as 'input'. Ideally this distribution would be uniform but Kharchenko et al  identifies three types of repeatable anomalies that arise in input: singular peaks with very high pile-up, non-uniform wide clusters of increased tag density and, lastly, small clusters of tag densities resembling real peaks but typically with small strand separation where aligned reads pile up in non-meaningful ways. The latter anomaly is difficult to distinguish from a true pile-up. These anomalies are significant to the analysis of ChIP data because the precipitate is a mixture of protein-DNA complexes and bare DNA; ChIP only enriches the protein target and typically only a few percent of the sequenced reads fall within identified peaks . Another source of false positives in ChIP-seq analysis is non-specific binding in the immunoprecipitate. To control for that, the sample can be precipitated using non-specific antiserum, i.e. immunoglobulin G (IgG) that does not have a known antigen in the organism under study. It should be noted that since the degree of enrichment will vary between different antisera an input control experiment adds information to the IgG control. The observed distribution in a specific IP is a mixture of reads due to input anomalies, non-specific and specific IP.
Many of the recently reported peak-callers for ChIP-seq data can make use of control-data to improve predictions of enriched regions. The strategies for correction of background densities vary but are, for instance, performed by simple contrast approaches such as subtraction of the background read distribution from the ChIP-signal or by calculating fold-changes. Other more sophisticated ways of filtering the peaks have been proposed such as using the background read densities as priors in a statistical framework or estimating the false discovery rate. See [2, 4] for a discussion of techniques and overview of peak-callers. However, none of the peak-callers offers a way to export the transformed (normalized) raw signal (e.g. ChIP-seq pile-up) actually used for inferring binding sites in the same format as the raw ChIP-seq data. Consequently, there is no way to visualize or compute statistics on the processed signal used internally in the peak-callers to detect enriched regions. The only normalization method published so far seems to be the one introduced by Taslim et al . This method yields an output signal with limited resolution due to its use of summary statistics in sliding windows of typically length 1 kb along the genome. This resolution might be sufficient in the application of main interest to Taslim et al, which was detection of regions with differential enrichment of RNA polymerase II between conditions, where the exact location of sequenced reads is not required. However, it is an important limitation in applications where fine resolution mappings of for example protein-DNA interactions are studied.
However, synthetically extending sequence reads relies not only on an accurate estimate of the fragment length but also on that the estimate is representative of the distribution of fragments. Hence we focused on a method that produces normalised 5'- and 3'-read counts. Here we present, to our best knowledge, for the first time a normalization algorithm for ChIP-seq data that preserves the high resolution needed to fine map protein-DNA interactions. Since the fragment lengths will vary between experiments we apply an averaging (see Methods) of the 5' and 3' coordinates. The algorithm is based on regression modelling that uses sufficiently small windows (5 bp default) to retain high resolution whilst correcting for one or multiple experimental control measurements simultaneously. We present a demonstration of the strategy on a simulated example data set as well as an in depth evaluation of the normalisation procedure when applied to experimental transcription factor ChIP-sequencing data.
In order to demonstrate the different components of the strategy, we constructed a small synthetic example data set consisting of a short hypothetical genomic region of 2000 bp (Methods). Our scenario has one binding site (peak) to be inferred, and one added anomaly in input that creates a pile-up of reads that is also observed in IP. The resulting simulated signal was sampled and smoothed (see Methods for details) by averaging over 11 bp (+/- 5 bp) windows every 5 bp, giving an 'overlapping' design as shown in Figure 1B.
The synthetic data and results of the normalization steps are shown in Figure 1A. The top three panels represent (from top to bottom): the synthetic IP signal (sense, anti-sense and combined (XSET)), the read starts (5' and 3') and then the re-sampled smoothed read start signal. This signal is intended to simulate the actual measurement that in practice would be used as input to the normalization procedure. The three panels below are the corresponding results for the synthetic control signals. Finally, the two bottom panels represent the resulting output signal after normalization. Apparently, the only remaining signal after normalization corresponds to the region in the real signal that does not coincide with the peak-region in the control signal. Thus the peak in the real signal that overlapped with the peak in the "control" signal was effectively removed even though the synthetic reads were randomly added in different sized intervals. Note also that the smoothing step effectively removes all the low amplitude noise throughout the region.
Number of detected peaks
% with motif
% with motif
% with motif
MAX (IgG norm)
MAX (Input norm)
MAX (IgG/Input norm)
FOXA1 (Input norm)
The number of detected peaks in the normalized data was found to be 1.4 - 6.7 times less for MAX and 3.5 - 6.2 times less for FOXA1 compared to experimental background although with increase percentages of the expected motifs (see below). The latter also holds when investigating a more stringent peak-set consisting of the top 20% of peaks in each category (Table 1) suggesting that the normalization strategy is efficient in reducing false positives among the called peaks.
The use of a sequence specific transcription factors allowed us to estimate the fraction of detected peaks that contained exact matches to the expected binding motif. For MAX this is the E-box, 5'- CACGTG  and for FOXA1 5'-TGTTT[AG] [9, 13]. Since the peak regions reported vary greatly in length, we used a fixed size for all peaks. For peaks detected using SISSRs, the centre coordinates were prolonged with 75 bp in each direction and the same was done for peaks found with FindPeaks and MACS using the point with the highest score as the centre. The fraction of peaks containing the desired motif for the different data sets is reported in Table 1. Since a very large number of peaks were reported by the peak-finders when no control background data was used, we repeated the analysis using only the top 20% regions in each data set ranked by score reported by the peak-callers. Both analyses resulted in the highest percentages in the peaks called using the normalized data sets.
For the FOXA1 data, the original investigator  performed 22 qPCR validations of 15 positive regions and 7 negative. We extracted and normalized the ChIP and Input signal +/-250 kb around these sites and ran SISSRs and findPeaks on the raw (statistical and experimental background) and on the normalized data (Additional file 1, Figure S1). The performance in terms of sensitivity and specificity on this small sample set (n = 22) was higher for the normalized data for both peak-callers (0.87/0.57 and 0.67/0.86 for findPeaks and SISSRs respectively) compared to the second best (experimental background) performance (0/1 and 0.60/0.43).
Normalizing is a vital part of any next generation sequencing study. For microarray based techniques there exist many different types of normalizing methods directed at different sources of bias (e.g. dye effects or background noise). To the best of our knowledge, up until now, there has not been any normalization method for ChIP-sequencing data that globally addresses effects, such as non-specific antibody interactions or background noise, which can be suppressed using control experiments. Many of the existing peak-callers are tailor-made for ChIP-sequencing data and can make use of a background model based on experimental control data, rather than purely theoretical statistical assumptions, to filter out regions that are also enriched in the control data. However, these approaches are inherently designed to be used for peak calling and are therefore not easily transformed into universal normalization methods. In order to be fully compliant with any type of analysis performed on ChIP-seq data it is also imperative that the resulting normalized signal is reported in the same format as the raw ChIP-seq data.
Here we present for the first time such a universal normalization strategy based on a simple regression framework. The resulting method does not destroy the fine resolution obtained in next generation sequencing data and relies on re-sampling of the fragment starting points in small intervals, typically 5 base pairs long. At least for the data examined here this gave a reasonable trade-off between keeping the high resolution and the underlying biological variance between samples. Since the linear regression modelling used by this new method may be fitted using standard software libraries for ordinary least squares regression, it is very easy to include in any software library for analysis of ChIP-seq data.
Finally, the options to include more than one control data sets allows an investigator to for instance account for technical error sources such as unspecific interactions of the antibody and for biologically less likely active sites as defined by e.g. nucleosome occupancy or any histone modification data set.
The ENCODE data sets where downloaded from the repositories using the UCSC genome browser . The FOXA1 data was collected from a previous in-house project and is available from the European Nucleotide Archive under accession number ERP000005.
Synthetic Data Generation
The synthetic data was generated in a small hypothetical region of 2000 bp containing two peaks in the ChIP-signal one of which had a similar peak in the Control data. The pile-ups were generated by placing reads at random inside pre-defined short intervals (peaks). The endpoints of the intervals are intended to represent the extreme borders of sonicated fragments, and sense (5') and anti-sense (3') reads were placed within the intervals corresponding to the start and end of the sonicated fragment respectively. The script for generating the simulated data is included in the Additional files. The interval lengths were empirically set to 20 bp for the synthetic IP data and to 25 for the synthetic control in order to simulate less variation in the IP data compared to the control. In total 40 reads were assigned to the real signal and 25 to the control in this 2000 bp region. In addition, 20 noise fragments were added on each strand at positions drawn at random from a uniform distribution over the whole region.
The underlying model of our normalization method assumes that the measured raw signal may be accurately described as a linear combination of three following components: (i) the de facto interaction sites of the investigated protein, (ii) non-specific anti-body interaction and (iii) background resulting from sequencing biases, low complexity regions/mappability, chromatin structure of other so far uncharacterized effects. An input-measurement, which is taken from the chromatin sample before any antibody pull down is done, is assumed not to be specifically enriched for any of the three components listed above. Both the non-specific and the specific antibody experiments are assumed to be enriched for components (i) and (ii) compared to (iii).
The case when only (2) or (3) is available follows simply from the description below.
Consequently, the least squares estimate for n i is obtained for the least squares estimate for . The values and that minimize (7) is the ordinary least squares regression solution when predicting from and. Moreover the estimate of αt i in each position i is obtained as the residual of the regression for that positions value, i.e. and the estimates are easily calculated using any software library that offers least squares regression modelling. Furthermore, we note that the methodology can be extended to use more than two control experiments and that the basic idea of removing uncorrelated noise from a signal by using measurements of sources correlated to the noise has previously been applied in adaptive noise cancelling .
In our implementation, the normalization is done locally, in sections of 100 k. The section size is basically limited by the memory capacity of the system. In our experience, however, the section size does not generally affect the results (Additional file 1, Table S1). This allows for local usage of the algorithm in specific subsections of a genome and therefore simultaneous (parallel) processing of different regions. If a promoter specific transcription factor is investigated the normalization can be applied to promoter regions alone, reducing computational time. In each section, all analyzed signals (e.g. ChIP-data and controls) are smoothed reporting the mean over small windows (typically +/- 5 bp). This smoothed signal is then sampled at given intervals (typically 5 bp) that in fact serve as a size-reduction step where we only retain the information of the centre position of the averaging window. The smoothed and sampled signals (ChIP and controls) are then used as starting point for the regression. Note that the average-windows can overlap depending on how the window size and centre-to-centre distance is chosen. In such case, the read count on a given base pair will contribute to several windows contributing to an even smoother signal. After the normalization a per-bp-signal output signal can be rebuilt from the centre-averages filling in missing values between centres with e.g. the value of the centres or the averages between adjacent centres. Since the analysis is split on read aligning to the sense or anti-sense strand, the resulting per-bp-signal can easily be written out in the same format as aligned reads, retaining the strand specificity. Here we have chosen to work with the BED-format and write out dummy reads with the same length as the original sequenced reads. The major steps of the algorithm are outlined in Figure 1A. A major strength of this new normalization method is that there is not really any need to account for different sequencing depths in the different signals as this is handled by the models created in the regression step. Specifically, differences in sequencing depths will reflect as different scaling of the coefficients of the regression model and any prior scaling of the signals will only amount to other, scaled, coefficients.
For demonstration purposes we implemented the algorithm using R . In order to be compliant with the already established downstream analysis (peak finders) the program outputs dummy reads in the BED-format located at all positions with a residual after regression greater than 1. The number of such dummy reads at each position was taken as the largest whole number portion of the residual at that position. The R-source code needed to reproduce the low level analysis of this work is available in Additional file 2. The R-script requires R version 2.10 or higher, additional packages and software [10, 11, 18, 19]. The generation of the Hilbert-curves required a 64-bit system with proper version of R and additional packages. The overlaps between regions and extraction of sequences were done using BEDTools . Signal footprints were produced using the SICTIN  software suite. The algorithm has also been implemented as a command line program in C++ using the GNU Scientific Library  for performing the regression. The source code is publicly available at https://github.com/ project name "Strand-Specific-Normalization-of-ChIP-seq-Data".
The three peak-finders used here were SISSRs (version 1.4), FindPeaks 4.0 (version 4.0.15) and MACS (version 1.4.0rc2). The peak-finders where run with the following parameters in effect (only non-default settings are reported here):
SISSRs, "-s 3093120360". FindPeaks, statistical background, "-dist_type 0 < fraglength as reported > -subpeaks 0.5 -landerwaterman 0.001". FindPeaks, experimental background, "-control < file > -dist_type 1 < fraglength as reported > -subpeaks 0.5". MACS, " -g hs --bw < fraglength as reported > --shiftsize < fraglength as reported > --call-subpeaks --wig"
SE, RA and JK were supported by the Swedish Foundation for Strategic Research, the Knut and Alice Wallenberg Foundation, Uppsala University and the Swedish University of Agricultural Sciences. CW was supported by the Swedish Research Council, grants 521-2007-3276 and 621-2008-3571. JK was supported by the Polish Ministry of Science and Higher Education, grant number N301 239536 and by the MPD programme of the Foundation for Polish Science, co-financed from European Union, Regional Development Fund grant number MPD/2009/5/styp5.
- Johnson DS, Mortazavi A, Myers RM, Wold B: Genome-wide mapping of in vivo protein-DNA interactions. Science. 2007, 316: 1497-1502. 10.1126/science.1141319PubMedView Article
- Kharchenko PV, Tolstorukov MY, Park PJ: Design and analysis of ChIP-seq experiments for DNA-binding proteins. Nat Biotechnol. 2008, 26: 1351-1359. 10.1038/nbt.1508PubMedPubMed CentralView Article
- Hoffman BG, Jones SJ: Genome-wide identification of DNA-protein interactions using chromatin immunoprecipitation coupled with flow cell sequencing. J Endocrinol. 2009, 201: 1-13. 10.1677/JOE-08-0526PubMedView Article
- Laajala TD, Raghav S, Tuomela S, Lahesmaa R, Aittokallio T, Elo LL: A practical comparison of methods for detecting transcription factor binding sites in ChIP-seq experiments. BMC Genomics. 2009, 10: 618- 10.1186/1471-2164-10-618PubMedPubMed CentralView Article
- Taslim C, Wu J, Yan P, Singer G, Parvin J, Huang T, Lin S, Huang K: Comparative study on ChIP-seq data: normalization and binding pattern characterization. Bioinformatics. 2009, 25: 2334-2340. 10.1093/bioinformatics/btp384PubMedPubMed CentralView Article
- Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, Zeng T, Euskirchen G, Bernier B, Varhol R, Delaney A: Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat Methods. 2007, 4: 651-657. 10.1038/nmeth1068PubMedView Article
- ENCODE Data Coordination Center at UCSC, Yale data.http://hgdownload.cse.ucsc.edu/goldenPath/hg18/encodeDCC/wgEncodeYaleChIPseq/
- Birney E, Stamatoyannopoulos JA, Dutta A, Guigo R, Gingeras TR, Margulies EH, Weng Z, Snyder M, Dermitzakis ET, Thurman RE: Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007, 447: 799-816. 10.1038/nature05874PubMedView Article
- Motallebipour M, Ameur A, Reddy Bysani MS, Patra K, Wallerman O, Mangion J, Barker MA, McKernan KJ, Komorowski J, Wadelius C: Differential binding and co-binding pattern of FOXA1 and FOXA3 and their relation to H3K4me3 in HepG2 cells revealed by ChIP-seq. Genome Biol. 2009, 10: R129- 10.1186/gb-2009-10-11-r129PubMedPubMed CentralView Article
- Jothi R, Cuddapah S, Barski A, Cui K, Zhao K: Genome-wide identification of in vivo protein-DNA binding sites from ChIP-Seq data. Nucleic Acids Res. 2008, 36: 5221-5231. 10.1093/nar/gkn488PubMedPubMed CentralView Article
- Fejes AP, Robertson G, Bilenky M, Varhol R, Bainbridge M, Jones SJ: FindPeaks 3.1: a tool for identifying areas of enrichment from massively parallel short-read sequencing technology. Bioinformatics. 2008, 24: 1729-1730. 10.1093/bioinformatics/btn305PubMedPubMed CentralView Article
- Findpeaks 4.0.http://sourceforge.net/apps/mediawiki/vancouvershortr/index.php?title=FindPeaks#FindPeaks_4.0
- Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nusbaum C, Myers RM, Brown M, Li W, Liu XS: Model-based analysis of ChIP-Seq (MACS). Genome Biol. 2008, 9: R137- 10.1186/gb-2008-9-9-r137PubMedPubMed CentralView Article
- Anders S: Visualization of genomic data with the Hilbert curve. Bioinformatics. 2009, 25: 1231-1235. 10.1093/bioinformatics/btp152PubMedPubMed CentralView Article
- Luscher B: Function and regulation of the transcription factors of the Myc/Max/Mad network. Gene. 2001, 277: 1-14. 10.1016/S0378-1119(01)00697-7PubMedView Article
- Rosenbloom KR, Dreszer TR, Pheasant M, Barber GP, Meyer LR, Pohl A, Raney BJ, Wang T, Hinrichs AS, Zweig AS: ENCODE whole-genome data in the UCSC Genome Browser. Nucleic Acids Res. 2010, 38: D620-625. 10.1093/nar/gkp961PubMedPubMed CentralView Article
- Widrow B, Glover JR, McCool JM, Kaunitz J, Williams CS, Hearn RH, Zeidler JR, Dong E, Goodlin RC: ADAPTIVE NOISE CANCELLING - PRINCIPLES AND APPLICATIONS. Proc IEEE. 1975, 63: 1692-1716.View Article
- R Development Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0. 2009, http://www.R-project.org
- Quinlan AR, Hall IM: BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010, 26: 841-842. 10.1093/bioinformatics/btq033PubMedPubMed CentralView Article
- Enroth S, Andersson R, Wadelius C, Komorowski J: SICTIN: Rapid footprinting of massively parallel sequencing data. BioData Min. 2010, 3: 4- 10.1186/1756-0381-3-4PubMedPubMed CentralView Article
- M Galassi JD, Theiler J, Gough B, Jungman G, Alken P, Booth M, Rossi F: GNU Scientific Library Reference Manual. 3
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.