Getting DNA copy numbers without control samples
© Ortiz-Estevez et al.; licensee BioMed Central Ltd. 2012
Received: 2 December 2011
Accepted: 15 June 2012
Published: 16 August 2012
Skip to main content
© Ortiz-Estevez et al.; licensee BioMed Central Ltd. 2012
Received: 2 December 2011
Accepted: 15 June 2012
Published: 16 August 2012
The selection of the reference to scale the data in a copy number analysis has paramount importance to achieve accurate estimates. Usually this reference is generated using control samples included in the study. However, these control samples are not always available and in these cases, an artificial reference must be created. A proper generation of this signal is crucial in terms of both noise and bias.
We propose NSA (Normality Search Algorithm), a scaling method that works with and without control samples. It is based on the assumption that genomic regions enriched in SNPs with identical copy numbers in both alleles are likely to be normal. These normal regions are predicted for each sample individually and used to calculate the final reference signal. NSA can be applied to any CN data regardless the microarray technology and preprocessing method. It also finds an optimal weighting of the samples minimizing possible batch effects.
Five human datasets (a subset of HapMap samples, Glioblastoma Multiforme (GBM), Ovarian, Prostate and Lung Cancer experiments) have been analyzed. It is shown that using only tumoral samples, NSA is able to remove the bias in the copy number estimation, to reduce the noise and therefore, to increase the ability to detect copy number aberrations (CNAs). These improvements allow NSA to also detect recurrent aberrations more accurately than other state of the art methods.
NSA provides a robust and accurate reference for scaling probe signals data to CN values without the need of control samples. It minimizes the problems of bias, noise and batch effects in the estimation of CNs. Therefore, NSA scaling approach helps to better detect recurrent CNAs than current methods. The automatic selection of references makes it useful to perform bulk analysis of many GEO or ArrayExpress experiments without the need of developing a parser to find the normal samples or possible batches within the data. The method is available in the open-source R package NSA, which is an add-on to the aroma.cn framework.http://www.aroma-project.org/addons.
A DNA copy number aberration (CNA) is a pathological amplification or deletion of a part of the genome (a chromosome, one of their arms or a segment) which has been related to cancer development. In CNAs, DNA copy numbers (CNs) may be larger (gains and amplifications) or smaller (deletions and homozygous deletions) than the normal state (CN = 2).
CNAs can be measured using single nucleotide polymorphism (SNP) arrays. Although the initial application of these arrays was genotyping, they can also be used to calculate the CN estimates. Besides, regions with LOH (Loss of heterozygosity), which are zones of the genome that show no heterozygous SNPs, can be found using these arrays.
The number of SNPs in the arrays range from the initial ones which interrogate around 10,000 SNPs, to the newest ones which interrogate several millions of SNPs. The GWS arrays from Affymetrix, in addition to SNP probes, include non-polymorphic probes (known as CN probes) for analysis of Copy Number Variations (CNVs). In order to deal with Affymetrix SNP arrays it is required to apply several low level processes, namely, background removal, calibration, normalization and summarization. The final results from these steps are two values (θ A and θ B ) for each SNP probeset which are approximately proportional to the number of copies of each allele. On the other hand, the CN probes of the latest arrays have a single value (θ T ) proportional to the total number of copies. The proportionality constant is unknown and different for each SNP probeset and CN probe.
As previously stated, CNAs occur in segments of the genome. In order to find these aberrated regions, it is necessary to compute the scale factor which relates summarized SNP signals (θ A and θ B ) and CN values (C N A and C N B ) for each SNP. If there are control samples in the study, the computation of the scale factor is straighforward: two over a robust average of (θ A + θ B ) in the control samples[2–6] .
In this work it is assumed that a control sample has neutral copy number with no LOH in its whole genome. Of course, there can be CNVs in a control sample but, for sake of clarity, they are not considered here.
Unfortunately, there are many experiments that do not include control samples due to the difficulties to find them or simply to reduce experimental costs. In these cases, researchers opt for either using control samples from a public dataset or calculating a robust reference using the tumoral samples available in the experiment (implicitly assuming that for each SNP most of the tumoral samples have neutral CNs). However, as it will be shown, using samples from different labs can increase the noise in the CN estimations and assuming that SNPs have neutral CNs in most of the samples, although usually works, can introduce bias in the copy number estimations hiding real CNAs or even creating false ones, mostly when there is a recurrent aberration.
We propose an algorithm termed NSA (Normality Search Algorithm) that generates for each sample the corresponding reference without the need of control samples. Within each of the samples (control or tumoral) NSA detects regions with neutral CNs (CN = 2) and no LOH. Using these normal regions, the reference is calculated and used to scale the data through both SNPs and samples. We will show that NSA is able to correctly scale the CN estimates even if there are no control samples in the experiment. In addition, NSA is able to infer the batches within an experiment and find a proper weighting (different for each sample) to calculate the references.
The core of the NSA algorithm is the detection of normal regions in the genome. The developed method is based on the comparison of the signals for both alleles in each SNP. Heterozygous neutral copy number SNPs (HNCNs) have similar signals for both alleles () and their total copy number is two. For the majority of aberrations (amplifications, deletions or LOH), one of the alleles will have a larger signal than the other since the number of copies is different for each of them. AABB (or AAABBB) genotypes are very unlikely (except in multiploid cells) since it implies an amplification of both chromosomes and this occurs much less often than aberrations that involve only one of the chromosomes. The nuclear assumption of NSA is that a SNP that have similar signals in both alleles () is probably a HNCN. NSA assumes that regions enriched in SNPs with similar signal for both alleles have neutral total CNs and no LOH.
Once NSA has inferred the normal regions, it also computes the optimal weights to calculate the reference using a weighted median. These weights are different for each sample and strongly related with the batches (set of samples that share some characteristics such day of hybridization, lab, person, etc) in which the samples were hybridized. Finally, the reference is calculated and the data scaled across both SNPs and samples.
NSA has been implemented in the aroma.cn framework and memory requirements are modest even for a large number of samples. Moreover, it is independent of both the preprocessing method or the microarray technology.
NSA is a population-based multi-array method for scaling any SNP & CN array technology, e.g. Affymetrix and Illumina. It identifies the normal regions within the samples, finds optimal weights to account for hybridization batches, calculates the corresponding references and, finally, performs a two-dimensional scaling.
We applied NSA to five different datasets. The first one is a subset of a Gliobastoma Multiforme (GBM) experiment that includes 64 tumoral samples hybridized to Affymetrix Mapping 50K_Xba array. The second one is a Prostate Cancer analysis with 20 tumoral and 20 control samples hybridized to Affymetrix Mapping 250K_Nsp array, the third one is a subset of 50 Lung Cancer and 20 control samples from hybridized to Affymetrix GWS 6.0 array. The fourth one is a subset of an Ovarian Cancer experiment that includes 72 tumoral samples and 57 control samples and, finally, the fifth one is a subset of HapMap samples. It is shown here that, for these datasets, NSA provides more accurate and precise CN estimates than other state of the art scaling methods.
The input data of NSA are the summarized probe signals (θ A and θ B ) calculated using any summarization method such as dChip, RMA, CRMA v2, ACNE, CalMaTe for Affymetrix arrays or the one developed by for Illumina technology. From these allele specific probe signals the fractions of the B-allele (β = θ B /(θ A + θ B )) are obtained. CRMA v2 pre-processing methods have been applied to all the datasets. In the summarization step, ACNE (for Affymetrix Mapping 50 and 250K arrays) and CalMaTe (for the GWS6.0 arrays) were chosen because provide more accurate allele specific CNs.
The main assumption of NSA is that gains of both chromosomes in a tumor sample are very unlikely to occur. It is shown in that (for GBM) only 3% of the aberrations occur in both chromosomes. Therefore, we consider that if both θ A and θ B are approximately equal, the SNP is likely to be heterozygous with CN equal to 2 (i.e. it is a HNCN). Since CNAs occur in segments of the genome, the homozygous SNPs within a region enriched in HNCNs will probably also have neutral CNs. In order to quantify how similar the signals of both alleles are, we use the term level of heterozygosity (LH) which stands for a continuous approximation to the heterozygosity of a SNP (notice that this level of heterozygosity does not have anything to do with the same term in population genetics).
where are the fractions of the B-allele.
Level of Heterozygisity
Using Table1, a threshold L H th has been selected to discern whether the corresponding SNP is HNCN or not. The most critical case to discern from a normal region (using the LH value) is a zone with 3 copies. The expected LH value is around 2/3 for a perfect summarization model with a pure tumor sample. If there is contamination from the surrounding normal tissue, this value will be larger. Because of this, a suitable threshold is 5/6 (the middle point between the expected value for normal heterozygous calls and the value for 3 copies).
Once the threshold is set, the SNPs with LH value over it are labeled as HNCN (LH = 1) and the others as non-HNCNs (LH = 0). A segmentation algorithm is applied to these binary data to find zones enriched in HNCNs. We used a variant of CBS in which the input are binary data[18, 19], although other methods could be applied (using for example Hidden Markov Models).
In the aforementioned Figure1, the thick gray lines in the bottom panel represent the different segments predicted by CBS. The y axis represents the proportion of HNCNs SNPs in the segment. It can be observed that CBS detects 2 different segments, corresponding to the two states (normal and aberrated). In other samples, segmentation also differentiates between aberrations (deletions and gains) assigning them a different value since the proportion of SNPs whose LH is above the threshold can be different, but this fact does not affect the method.
On average, 27% of the SNPs in a HapMap sample are heterozygous. Therefore, ideally, the proportion of HNCNs in a normal region would be around 27%. On the other hand, the proportion of HNCNs should be ideally close to 0% in an aberrated region. We have selected a threshold in the middle point (13.5%). The larger the threshold, the more likely the selected segments have neutral CNs. However, some normal regions can be missed because of noise or simply because the regions include many SNPs with one rare variant.
For example, consider the initial part of the p arm from Figure1, which is normal, and the q arm, which is amplified. The density function for the normal zone has more SNPs with LH closer to 1 than the amplified region. In this particular case, 18.3% of the SNPs are above the threshold set in the previous section. For the 3 copies region, only 8.6% of the SNPs have LH above the threshold. The difference between both regions is large enough, so that, the algorithm is not sensitive to a particular selection of the threshold.
If the array of the experiment includes SNP and CN probes (GWS 5.0 or GWS 6.0), then the corresponding status (neutral copy number or not) of the CN probes is inferred from the status of the segment where they are located (that has been computed using the SNP probes).
The final processing of NSA includes two scaling steps, one by SNPs and another by samples.
NSA implements two methods (user-selectable) to compute the references: the first one uses standard medians and the second one uses weighted medians to minimize batch effects. In this second method each sample has a different computed reference.
For the first method, the reference is computed by using the median of the signals of the samples labeled as normal for each SNP. In the second case, a different algorithm (described in the following section) estimates some weights that are used to compute the reference for each SNP and sample using weighted medians. It also uses only the samples labeled as normal for each SNP to compute the reference.
The algorithm computes for each sample the median of the CNs of the SNPs assumed to be normal and re-scales all the data so that the median of the normal zones for each sample is 2.
The two scaling steps of the proposed normalization is similar to a median polish in which only the SNPs which are labeled as normals are included in the computation.
The values of are the final estimates of NSA for the total CN values.
These steps should be repeated until convergence but improvement is negligible after the first iteration (average copy number changed about 0.001 copies in the examples).
Batch effects have proved to have paramount importance in the analysis of SNP arrays[22, 23]. The particular characteristics of NSA algorithm helps to develop an algorithm to minimize them. The overall idea is that, since NSA identifies normal zones that are expected to have neutral CNs, the reference can be selected so that, for these normal zones, the estimated copy number is close to 2.
The procedure is the following. First of all, a set “S” of SNPs is selected. This set must include a sufficient number of normal SNPs to capture the relationships between the arrays. These SNPs are selected from normal regions in most of the samples. Although for some studies there are no normal SNPs for all the samples, the SNPs in “S” are selected so that they appear as normal in as many samples as possible.
implicitly assuming that the reference for is, i.e. a linear combination of the logarithm of the signals for each sample. The restrictions impose that 1) sample i is not used to compute its own reference and 2) only positive weights are allowed, i.e. if samples i and k are so different that the corresponding coefficient γi,k is negative, sample k is not used to build the reference of sample i instead of giving it a counter-intuitive negative weight.
This optimization is a quadratic programming (QP) problem. Instead of using a standard QP algorithm (quite time consuming) we have iteratively solved the minimum squares problem. In each step, the algorithm removes the samples whose weights are negative in the solution.
Using this formula, each of the computed γi,kis the weight of the normal regions of sample k to compute reference for sample i using a weighted median. For each SNP j, only the samples that are expected to be normal are included in calculation of the median using the corresponding weights.
In this part, it is shown that the results using NSA outperforms the use of control samples from a different lab or using a robust median of the tumoral samples (which are the most used methods). This improvement in performance appears both in noise and bias. Since there is not a ground truth to compare against, we have used three indirect aspects to state the performance: the ability to find CNAs along the genome, the quality of the estimated CNs in regions that are known to be normal and the ability to find recurrently aberrated regions.
Eckel-Passow et al. shows a comparison among different summarization algorithms. It describes four summarization methods[2–5]. Within these methods, there are only two different algorithms for scaling: using the median (of some of the samples) and a linear model. The summarization methods used here (ACNE and CalMaTe), internally implement a callibration that is in fact a linear model. In addition to the methods described in, dChip implements a trimmed mean of the samples if no references are provided, LaFramboise fits a non-linear model using the information from control samples, Nannya et al propose to use the m control samples most similar to the sample under study. In the end, the suggested methods to scale the samples (except) are simply the computation of a robust average (i.e. the median or a trimmed mean) of some of the samples. As will be shown later, our approach to remove batch effects resembles Nannya’s but focusing only on the normal zones within the tumor samples.
We have compared four different possibilities to select the samples to build the references. These scaling methods depend on whether control samples are available or not and on the laboratory where the control samples (if any) were hybridized. The first algorithm under study can be applied to a dataset which includes control samples from the same lab. In this case, for each SNP, the median of the values of the control samples is used as reference i.e, the median of ( +) where C is a subset of control samples within the experiment. We named this algorithm MCS (median of control samples). This is the method of choice suggested by all the referred methods if control samples are available.
The next three algorithms are used when no control samples are available. The second method (MHS: median of HapMap samples) computes the reference using the median of external control samples (in this case they are from HapMap). This is the method suggested by[3, 6]. The third method (MTS: median of tumoral samples) builds the reference using the median of all the tumoral samples (implicitly assuming that most of the samples have neutral copy number at a given locus). It is suggested in some vignettes of the package aroma.affymetrix. And, finally NSA, which calculates the reference based on the predicted normal regions within the tumoral samples.
We have focused this section on the detection of recurrent alterations since it is the aim of many CN analysis.
The analyzed experiment is the GBM dataset (64 tumoral samples hybridized to Affymetrix Mapping50K_240Xba). The data were analyzed using CRMAv2 pre-processing and ACNE summarization method. Once the θ values were obtained, the data were scaled using both MTS and NSA. In addition, we performed the same analysis taking HapMap samples as references (MHS).
On the other hand, in both chromosomes 7q and 10, NSA and MHS provide more significant values for the aberrations (the amplified and deleted regions have higher q-values) and no artificial aberrations appear. The performance in terms of identifying recurrent aberrations is similar for both algorithms. Nevertheless, we have to fine-tune the low-level analysis using MHS since a very strong bias appears in the CN estimates (that NSA does not present because of the second stage of the algorithm -scale by samples).
We have performed this comparison in two cancer datasets, Prostate Cancer and Lung Cancer. In addition, we have analyzed the chromosome X in some HapMap samples to check if NSA is able to uncover which samples have two copies (female samples) and, using the first autosomic region of this chromosome, compare the ability to find a copy number change.
The Prostate Cancer dataset is hybridized to Affymetrix Mapping250K_Nsp. We compared the NSA results with MTS, MCS and MHS. It is reminded that MCS needs more hybridizations than the other methods. This analysis is focused on finding aberrant regions in order to show which method detects CNAs more accurately.
For this analysis, we have included the SNPs that surround the CN change (specifically from 20 to 45Mb). Since the exact position of the change is difficult to locate, the SNPs within a safety zone from 30Mb to 34Mb are not considered. This figure includes some gray zones that illustrate which is the region under study. It can be seen in Figure4 that MCS and NSA are the ones that best identify this CN change (better TPR for the same FPR). MTS gives intermediate results and MHS performs worse than the rest. We have selected this locus since it is the most prominent recurrent aberration for this dataset and has been previously shown to be a frequent aberration in Prostate Cancer[29, 30].
We have also validated NSA with the dataset from a Lung Cancer study, hybridized to Affymetrix GenomeWideSnp 6.0. It should be noted that in this study there are 291 samples where 59 are control samples from a different lab (one of the authors informed us in a personal communication that they are from HapMap).
Moreover, there is also another bias, which exists even using control samples from the same lab. This bias is generated in the pre-processing step where the samples are normalized. This step performs a transformation on probe level data to make them comparable across different samples. This transformation can be a quantile scaling or simply a multiplication of the data by a constant. Nevertheless, if a large part of the genome in a sample is deleted (amplified), the normalization procedure tends to compensate for this effect and the normal regions of the genome (that ideally have CN=2) are shifted to a larger (smaller) value to compensate for the aberration. This fact induces a bias in the CN estimation. Therefore, if there is a sample with many deletions (or amplifications), the normal regions of these samples have a slightly higher (lower) value than 2.
We have also analyzed the ability to remove batch effects using NSA. For this we have used the Ovarian Cancer dataset. This dataset includes 129 samples (72 tumoral and 57 references, some of them matched) that were hybridized in 13 batches. The number of samples for each of the batches is very different to each other ranging from 2 samples in one batch to more than 20 samples in another batch. In the case of small batches, the advantage of computing the references by batches or taking the experiment as a whole is unclear.
We have applied NSA with batch effect removal on this dataset. NSA’s procedure to estimate the weights is completely blind, i.e., it does not require any information from the user on which are the samples within each batch.
This paper describes NSA, an algorithm to scale the summarized SNP signals to CN values by finding normal regions within tumoral samples. NSA is platform (Illumina or Affymetrix) and pre-processing method (dChip, CRMAv2, ACNE, CalMaTe…) independent. The synthetic reference generated by NSA using only tumoral samples gives more accurate results than either using control samples from different labs or using all the tumoral samples. Indeed, NSA results are close to the ones obtained using control samples from the same lab within the dataset. In addition, NSA includes an algorithm to deal with batch effects. It automatically computes an optimal reference for each sample (that in our tests is strongly related to the hybridization batches). Batch information is not required to run NSA; the algorithm automatically identifies the proper samples to compute the reference using only the signals of the microarrays. NSA minimizes the problem of bias for samples with a large number of similar aberrations (i.e. most of them are gains or deletions). For these samples, the predicted CNs for normal regions tends to compensate the aberration including a bias. This is a potential problem that also occurs using MCS (control samples from the same lab). NSA is able to effectively discover the normal regions and uses them to scale the data diminishing any bias that appears in the normalization step.
We have compared NSA with other scaling methods. On one hand, it is important to pinpoint that the number of hybridized samples needed for NSA, MTS and MHS is much smaller than using MCS, since these methods do not need to hybridize control samples from the same lab to create the reference. MHS method seems to be noisier which can be an effect of the difference between the protocols or conditions of the labs where the arrays were hybridized. MTS provides biased estimates especially in the case of recurrent aberrations. On the contrary, NSA estimates are similar to the ones from MCS. In addition, NSA provides a way to deal with the unavoidable batch effects of large experiments.
NSA also presents some limitations that are summarized here. Ultimately, the identification of normal regions relies on the fact that gains in both chromosomes are very unlikely to occur. If this is not the case and a few samples do present amplifications on both chromosomes, NSA is still reliable since a robust estimator of the reference (such as the median) can withstand the presence of a small percentage of outliers.
The prediction of normal regions can be affected by samples with low tumor purity that can be wrongly included within the normal ones. Again, since the median withstand the presence of errors, it will only affect if most of the samples have very low tumor purity which usually is not the case.
The number of samples used by NSA to build the reference is different for each locus. Therefore, the variance of the reference varies on each position. In particular, recurrently aberrated regions have a reference with larger variance than regions that show seldom aberrations.
Since NSA is based on finding regions enriched in HNCNs, if an aberration occurs in a region with no SNPs (it includes only CN probes), NSA cannot provide an accurate reference for this region.
Similarly, NSA estimates of chromosomes X and Y are not reliable for experiments that include only male samples. It is not useful either for experiments that include polyploid samples. These samples still pose a challenge for their analysis.
Finally, one potential situation in which NSA could fail is if all the samples present an aberration in the same locus. This is very unlikely event but it could potentially occur. Even in this putative case, a few samples hybridized in the same lab can be included in the study to avoid these “emergency cases”.
In spite of these limitations (most of them also present in other scaling methods), NSA is able to fix the problem of finding recurrent regions and copy number changes when no control samples are available.
If NSA is applied to a dataset with both tumoral and control samples, the detection of CN changes could be even more accurate than using MCS (especially if NSA removes the batch effects) because of having more and (what is more important) more appropriate samples to calculate the references. This fact makes NSA especially convenient to apply to many experiments (in GEO or ArrayExpress) without the need to explicitly state which are the control samples or the batches in the datasets.
In conclusion, NSA can be used to accurately scale summarized SNP signals to CNs. It presents less bias and noise than MTS or MHS and needs no control sample hybridization.
The proposed NSA method is available in the NSA package implemented in R (R Development Core Team, 2010). This package includes an add on to the high-level aroma.affymetrix framework, which allows NSA to be applied to very large SNP data sets. It is publicly available at CRAN repository in a package called “NSA”.
MO conceived the idea and jointly with AA developed the add-on to the aroma.cn framework. AR developed the algorithm to account for batch effects. MO, AA and AR wrote the manuscript and developed the software to compare NSA against other algorithms. All authors read and approved the final manuscript.
Level of Heterozygosity
Heterozygous Neutral Copy Number 775 SNPs
Copy Number Aberration
DNA Copy Numbers
Single 776 Nucleotide Polymorphism
Loss of Heterozygosity
Copy Number 777 Variations
Normality Search Algorithm
Batch Effect Removal
778 Quadratic Programming
Median of Control samples
Median of 779 HapMap samples
Median of tumoral samples.
The authors would like to thank Luis Montuenga and Maribel Zudaire (“Centro para la Investigacion Medica Aplicada”, CIMA, University of Navarra) for their support and advice in the analysis of the biological consequences of the copy number alterations.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.