Skip to main content
  • ORIGINAL PAPER
  • Open access
  • Published:

CNARA: reliability assessment for genomic copy number profiles

Abstract

Background

DNA copy number profiles from microarray and sequencing experiments sometimes contain wave artefacts which may be introduced during sample preparation and cannot be removed completely by existing preprocessing methods. Besides, large derivative log ratio spread (DLRS) of the probes correlating with poor DNA quality is sometimes observed in genome screening experiments and may lead to unreliable copy number profiles. Depending on the extent of these artefacts and the resulting misidentification of copy number alterations/variations (CNA/CNV), it may be desirable to exclude such samples from analyses or to adapt the downstream data analysis strategy accordingly.

Results

Here, we propose a method to distinguish reliable genomic copy number profiles from those containing heavy wave artefacts and/or large DLRS. We define four features that adequately summarize the copy number profiles for reliability assessment, and train a classifier on a dataset of 1522 copy number profiles from various microarray platforms. The method can be applied to predict the reliability of copy number profiles irrespective of the underlying microarray platform and may be adapted for those sequencing platforms from which copy number estimates could be computed as a piecewise constant signal. Further details can be found at https://github.com/baudisgroup/CNARA.

Conclusions

We have developed a method for the assessment of genomic copy number profiling data, and suggest to apply the method in addition to and after other state-of-the-art noise correction and quality control procedures. CNARA could be instrumental in improving the assessment of data used for genomic data mining experiments and support the reliable functional attribution of copy number aberrations especially in cancer research.

Background

Since the introduction of molecular-cytogenetic technologies for whole genome copy number aberration screening [1, 2], considerable advances have been made to work with a variety of sub-optimal material (e.g. micro dissected samples, aspiration biopsies, paraffin embedded tissue), both in the areas of DNA preparation, labeling and platform technologies as well as in bioinformatic processing of the experimental read-out. However, DNA copy number profiles from current microarray and sequencing experiments sometimes suffer from the presence of systematical “wave patterns” [3] throughout the whole genome, where within each genomic segment the estimated copy number deviates from the true value which is supposed to be a constant. These wave artefacts disrupt the piecewise constant signal of the copy number data and may lead to false positives or negatives in identifying CNAs.

One of the known causes to the wave artefacts is differential DNA retrieval across chromosomal regions, which may be due to GC-content bias [4], DNA replication timing [5], differences in chromatin organization during DNA isolation [6] and damages to the DNA by fixation procedures [7]. Copy number profiles with heavy wave artefacts sometimes can be corrected if certain requirements are met. Marioni et al. developed a method to remove wave artefacts in copy number profiles for normal samples without obvious CNAs [8]. Wiel et al. suggested to eliminate waves in tumor profiles with many CNAs using calibration profiles [3]. Some methods correcting GC-content bias have also been implemented for microarray [4] and sequencing experiments [9]. However, the methods proposed in these studies have a limited ability to remove wave artefacts or put many restrictions on the type and variability of the input data itself. In addition to wave artefacts, large derivative log ratio spread (DLRS) [10] correlating with poor DNA quality also leads to unreliable copy number profiles.

The limited ability of existing experimental and bioinformatic methods to remove wave artefacts or to correct for source dependent DNA quality variations motivated us to devise a method for assessing if genomic copy number can be reliably estimated from a pre-processed, technology agnostic copy number profiling dataset. Rather than developing a method for improving the sample derived copy number profiles themselves, our primary intention here is to provide measures for the contamination of copy number profiles through artefacts and thereby to support decisions regarding the suitability of these copy number profiles for downstream data analysis and interpretation.

Zhang and Zhang previously designed such a measure in an explorative study [7] where they proposed using autocorrelation scanning profile (ASP) to evaluate data quality, and demonstrated on simulated data that the median of ASP (medASP) can be used as a discriminative metric. However, it will be shown in the “Discussion” section that medASP is not an adequate measure for real-world data of different scenarios.

In this paper, we assess the reliability of copy number profiles using a machine learning approach. From our experience and by experiment, we selected four features which are able to adequately represent the copy number profiles for reliability assessment, i.e. the number of steps that can be detected by the step-fitting algorithm in the copy number profile, a quantitative value indicating how much the copy number data is step-like, the number of segments induced by both CNAs and wave artefacts detected by circular binary segmentation [11, 12], and the mean of DLRS within segments. We will explain in detail about these features in the “Results and discussion” section and discribe the step-fitting algorithm by which the first two features are generated in the “Methods” section. Based on these features, we trained a classifier to predict the reliability of the copy number profile. We will describe the way of classifying reliable and unreliable profiles, and subsequently assigning them into one of the five subcategories each having a biological or experimental correspondence. Our reliability assessment method can also be adapted to assess copy number profiles generated by those sequencing platforms from which copy number estimates could be computed as a piecewise constant signal.

Results

Piecewise constant model and segmentation

Tumor samples often contain CNAs, in which chromosomal segments are found gained or lost in copy number, deviating from the normal diploid status. Genome-wide copy number can be depicted as a piecewise constant signal, where the change-points are the boundaries of the chromosomal segments that differ in copy number, and the constant value between a pair of change-points is the copy number of the corresponding segment. Copy number profiling by microarray and sequencing techniques gives noisy estimates of the true copy number at specific genomic positions, which can be modeled as follows:

Assume a series of n log ratio copy number estimates x={x i :i=1,2,,n} ordered by genomic position. The piecewise constant model for the series is

$$ x_{i} = \mu_{i} + \varepsilon_{i}, \quad i=1,2,\cdots,n, $$
(1)

where μ={μ i :i=1,2,,n} is a piecewise constant function and ε={ε i :i=1,2,,n} is a sequence of independent and identically distributed errors. Assuming a series of m+1 change-points τ={τ j :j=0,1,2,,m} where 1=τ 0<τ 1<<τ m =n+1 delimit m segments with copy number level θ={θ j :j=1,2,,m} such that

$$ \mu_{i} = \theta_{j}, \quad i\in[\tau_{j-1},\tau_{j}), \quad j=1,2,\cdots,m. $$
(2)

The errors are usually assumed to be Gaussian εN(0,σ 2) and supported by experimental data on self-self hybridizations [13], although this assumption is not crucial if the distances between successive τ j ’s are large [14]. The DLRS is denoted by σ and estimated by the standard deviation of the error ε.

Segmentation is applied to recover the genomic position of the boundaries and the underlying copy number for chromosomal segments from the noisy copy number estimates. Under the piecewise constant model, the segmentation problem is to find the change-points τ delimiting the segments and the copy number levels θ for each segment. If τ is known, θ j can be estimated by the mean of the copy number estimates that fall in the j-th segment, that is

$$ \hat{\theta_{j}} = \frac{\sum_{i=\tau_{j-1}}^{\tau_{j}-1} x_{i}}{\tau_{j}-\tau_{j-1}}, \quad j=1,2,\cdots,m. $$
(3)

Many segmentation algorithms have been proposed (see [15] and [16] for excellent review), among which the popular circular binary segmentation (CBS) algorithm [11, 12] was found to be one of the most accurate methods. Starting with the whole chromosome, the CBS algorithm detects change-points delimiting a sub-segment with a different copy number level in the middle of a larger segment, and does it recursively until no more change-points can be found in any of the segments. For any interval [a,b) and 1≤a<bn, let the null hypothesis be that the log ratio copy number estimates x a ,x a+1,,x b are independent and identically distributed Gaussian and the alternative be that there is a sub-segment with different mean and same variance. CBS calculates the maximal t-statistic T=max ai<jb |T ij |,

$$ T_{ij} = \frac{\bar{Y}_{ij}-\bar{Z}_{ij}}{s_{ij}\sqrt{(j-i)^{-1}+(b-a+i-j+1)^{-1}}}, $$
(4)

where \(\bar {Y}_{ij} = (x_{i+1}+\cdots +x_{j})/(j-i)\), \(\bar {Z}_{ij} = (x_{a}+\cdots +x_{i}+x_{j+1}+\cdots +x_{b})/(b-a+i-j+1)\), and \(s^{2}_{ij}\) is the corresponding mean squared error.

Step detection and step-likeness quantification by step-fitting

Ideally, DNA is uniformly retrieved from the chromosomes during sample preparation, that is, the amount of DNA retrieved is proportional to the true copy number of the chromosomal segments present in the cells. In this case, the log ratio copy number estimates contain only abrupt jumps which satisfy the piecewise constant assumption. However, in practice, due to GC-content bias [4], DNA replication timing [5] and other biological phenomena such as differences in chromatin organization during DNA isolation [6] or damages to the DNA caused by formalin fixation [7], DNA is sometimes retrieved differentially across the genome, which adds artefacts in the form of waves to the otherwise piecewise constant signal. These wave artefacts adversely affect detecting change-points which truly delimit the gained and lost chromosomal segments.

Existing segmentation algorithms such as CBS are poorly suited to discriminate the change-points that are boundaries of the CNA segments from those introduced by wave artefacts. When the magnitude of the waves are less than that of the CNA signal, CBS leads to “hyper-segmentation”, in which many change-points caused by waves are detected in a single CNA segment. In the worst case where the true copy number signal is buried in the waves with comparable or even greater magnitude, change-points detected by CBS largely depend on the wave artefacts and are no longer boundaries of the CNA segments.

Fortunately, in practice, CNAs induce abrupt jumps which are visualized as steps and wave artefacts only induce gradual changes. As a result, for copy number profiles containing CNAs, reliable samples are usually step-like, with few waves in each CNA segment, whereas unreliable ones contain a lot of waves or the overall shape of the signal does not resemble steps any more. Based on these observations we implemented a fast step-fitting algorithm of time complexity O(nlogn) adapted from the method originally proposed by Kerssemakers et al. (refer to Supplementary Methods 3 in [17]) to capture the step signal which are mostly CNAs regardless of the change-points introduced by wave artefacts, and assess how much the copy number profile is step-like. See “Methods” section for a detailed explanation about the step-fitting algorithm and the adaptation.

Computer simulation: CBS versus the step-fitting algorithm

To demonstrate the differences between CBS and the step-fitting algorithm in detecting change-points, 3 groups of copy number profiles each containing 200 samples of 10,000 dimensions were generated (Figs. 1 and 2). Group A are reliable copy number profiles containing many CNAs; Group B are unreliable copy number profiles in which the piecewise constant CNA signal is buried in waves; Group C are hyper-segmented copy number profiles in which many change-points caused by waves are present in a single copy number segment. See Additional file 1 for details on how simulations were performed.

Fig. 1
figure 1

An example of simulated copy number profiles. Samples of 10,000 dimensions (n=10,000) were generated for 3 reliability groups, i.e. A: reliable copy number profiles containing many CNAs; B: unreliable copy number profiles having indiscernible CNAs due to wave artefacts; C: hyper-segmented copy number profiles. In group A, both CBS and step-fitting recovered majority of copy number segments well; In group B, both methods fitted noise and CBS detected more change-points than step-fitting; In group C, CBS detected too many change-points whereas step-fitting recovered majority of segments well

Fig. 2
figure 2

CBS versus step-fitting for 200 set of simulated copy number profiles. Scatter plot of the number of segments detected by CBS (purple) and step-fitting (green), plotted against the number of true CNA segments. In group A, the number of segments recovered by CBS approximates true CNA segments very well; the number of segments recovered by CBS is close to that of step-fitting. In group B, both methods fitted noise in which CBS found more change-points than step-fitting in general. In group C, CBS recovered much more segments than the number of true segments, whereas step-fitting found majority of CNA segments well

Figure 1 shows a set of simulated copy number profiles having the same true CNA segments. A1 to A3 demonstrate the segmentation on the same reliable copy number profile, where the red lines in A1 are the predefined true CNA segments. Red lines in A2 and A3 are copy number segments recovered by CBS and step-fitting respectively, from which it can be concluded that both methods recovered the majority of copy number segments. While CBS performs better in terms of detecting small segments of low signal to noise ratio, step-fitting is capable of capturing the overall step-like signal. B1 to B3 show the same unreliable copy number profile with heavy wave artefacts generated by introducing large auto-correlation to the reliable copy number profile in group A, for which neither CBS nor step-fitting can locate the true CNA segments correctly. In B2, CBS found greater amount of change-points most of which are noise, whereas in B3 step-fitting also fitted noise but only those with more abrupt changes and therefore the number of change-points detected are less than CBS; Nevertheless, change-points detected by both methods mainly depend on noise and are no longer boundaries of the true CNA segments. C1 to C3 show the same hyper-segmented copy number profile created by passing the reliable copy number profile in group A through a median filter to add small waves at the same time preserving the overall step-like structure. While in C2 CBS detected too many change-points within individual true CNA segments, step-fitting significantly outperformed by recovering majority of copy number segments well. Note here the possibility of merging multiple segments detected by CBS using additional methods is not considered, as our main goal is to find a good proxy for the extent of wave artefacts which is represented by the number of change-points that CBS can detect under its standard setting.

The observation from Fig. 1 can be generalized to the set of 200 simulated copy number profiles as shown in Fig. 2. For reliable copy number profiles in group A, the number of segments recovered by CBS is a very good approximation to the true CNA segments (Spearman correlation coefficient ρ=0.89; Regressing the number of recovered segments on the number of true segments by robust linear regression results in slope w=0.93 and intercept b=0.28), while step-fitting recovered majority of CNA segments well except for those very short segments of low signal to noise ratio (ρ=0.41; w=0.67 and b=0.79). For copy number profiles in group B containing heavy wave artefacts, both methods fitted noise. CBS recovered more segments than the number of true segments (w=1.86 and b=−2.26) and although in our simulation there is still correlation (ρ=0.70) as the waves introduced in different samples are of comparable size, the correlation is not expected in real data where the waves are of various size coming from different sources. Furthermore, in group B, the number of segments found by step-fitting is largely less than CBS and independent of the number of true segments (ρ=0.04; w=0.13 and b=19.56). Last, for hyper-segmented copy number profiles in group C, CBS recovered far more segments than the number of true segments (ρ=0.36; w=2.84 and b=222.80), whereas step-fitting found majority of CNA segments well (ρ=0.38, w=0.79 and b=2.21).

Reliability assessment metrics

In this section we further develop the reliability assessment method on 1522 previously published copy number profiles from different experimental batches and microarray platforms (See “Methods” section for detail).

Figure 3 shows five typical cases each corresponding to one of the reliability groups that can often be observed among genomic copy number profiles. Apart from those containing many CNAs which can be classified as hyper-segmented, reliable and unreliable with heavy wave artefacts (Fig. 3 a to c), there exist copy number profiles that have no or very few CNAs (e.g. control samples from normal tissues; Fig. 3 d); but also copy number profiles with extraordinarily large DLRS suggesting poor DNA quality prohibiting the detection of any CNA (Fig. 3 e). The upper panel of each subgraph in Fig. 3 a to e shows copy number profile of a particular case segmented by CBS, and the lower panel shows the same copy number profile segmented by step-fitting in the optimal iteration when S peak , the maximum value of S generated by the step-fitting algorithm was attained. The corresponding S values throughout iterations are plotted in Fig. 3 f (See The step-fitting algorithm in “Methods” section for definition of S and S peak ).

Fig. 3
figure 3

Five example specimen of copy number profiles for each of the reliability groups. 3a: Case 1, hyper-segmented, discernible CNAs with some waves; 3b: Case 2, reliable, discernible CNAs with few waves; 3c: Case 3, unreliable, indiscernible CNAs with heavy waves; 3d: Case 4, unreliable, large DLRS, undetectable CNAs; 3e: Case 5, reliable, control sample or without many CNAs. In each subgraph 3a to 3e, the upper panel shows the copy number profile segmented by the CBS algorithm, and the lower panel displays the same copy number profile segmented by step-fitting in the optimal iteration when S peak was attained, where the red line is the fit and the blue line is the counter-fit. In Fig. 3 f, S values are shown for the same five copy number profiles. For each curve the S-values for 120 iterations are shown. The GEO accession numbers [29] for the five cases are: Case 1, GSM360756 [30]; Case 2, GSM491138 [31]; Case 3, GSM360643 [30]; Case 4, GSM187938 [32]; and Case 5, GSM182894 [33]

In Fig. 3 a to c, the performance of CBS versus step-fitting is consistent with that on simulated data in the previous section: In reliable copy number profiles (Fig. 3 b), segments recovered by CBS are comparable to those from step-fitting; in unreliable copy number profiles with heavy wave artefacts (Fig. 3 c) both methods fitted noise in which CBS found more change-points than step-fitting; in hyper-segmented copy number profiles (Fig. 3 a) CBS detected a great many change-points within individual steps recovered by step-fitting, resulting in a higher number of CBS derived segments. An additional feature associated with reliability can be discovered by looking closer into the three cases in which both the reliable and the hyper-segmented copy number profile have clear step-like structures, while in the unreliable case the boundaries of the CNA segments are much less defined. This property is reflected by the peak value of S: As shown in Fig. 3 f, the hyper-segmented (Case 1) and the reliable copy number profile (Case 2) both attained relatively high S peak whereas S peak for the unreliable copy number profile (Case 3) is much lower.

In Fig. 3 d and e, no apparent CNA exists or can be detected. The former is unreliable due to large DLRS so that the CNAs, even if are present, cannot be detected properly, whereas the latter is a reliable control sample diploid across the genome. Nevertheless, in both cases step-fitting merely fitted noise and the corresponding S values for Case 4 and 5 in Fig. 3 f increase throughout iterations yet remain close to 1 and no apparent peak can be observed.

So far we have discussed 4 features related to the reliability of copy number profiles, i.e. S peak , an indicator of how much the copy number profile is step-like; l, the number of steps detected by step-fitting; v, the number of segments detected by CBS which could be induced by both CNAs and wave artefacts; and σ, which is the DLRS estimated by the standard deviation of the error ε in Eq. (1).

The full set of reliability assessment metrics is summarized in Table 1, where l and v combined represent the wave density.

Table 1 Reliability assessment metrics for copy number profiles. Eight qualitative combinations lead to five reliability cases

CNARA: reliability assessment for copy number profiles

To turn the qualitative metrics into quantitative ones and therefore allow to predict the reliability of a given copy number profile, a support vector machine (SVM) classifier was trained on the 1522 previously published copy number profiles labeled as reliable or unreliable by experts (see “Methods” section).

The 4 features S peak , l, v and σ were extracted for each of the 1522 samples and were used with the SVM classifier. The svm function in the e1701 R package [18] which is the interface to the C++ implementation libsvm [19] was called, where the cost was set to 1 and the radical basis function (RBF) kernel e x p(−0.4|uv|2) was chosen in which the parameters were set by cross validation. Tenfold cross-validation on the 1522 training samples resulted in a total prediction accuracy of 99.08 % with standard deviation 0.0083 (a 3D visualization can be found in Additional file 1: Figure S5). The extremely high accuracy suggests that the 4 features chosen can represent the copy number profile very well for reliability assessment. The resulted classifier can predict any new copy number profiles as reliable or unreliable in terms of whole-genome CNA evaluation. For fine tuning of the methodology to particular sample compositions or evaluation goals, we have included the option to use user provided training sets (see respective subsection below).

For a copy number profile predicted as reliable, the value of S peak tells if it has many CNAs or not (case 2 and 5 in Table 1). Usually control samples from normal tissues or samples without many CNAs have S peak t h r 1, while tumor samples containing CNAs that we are interested in have S peak >t h r 1, where t h r 1 is a threshold with default value 1.5.

Copy number profiles predicted as unreliable can be further divided into 3 subcategories (case 1, 3 and 4 in Table 1), by carrying out the following procedure: Denote samples in the training set by \((S_{peak}^{(i)},\ l^{(i)},\ v^{(i)},\ \sigma ^{(i)})\) where i=1, 2, , 1522. Given any unreliable sample b=(S peak , l, v, σ), create two dummy samples d 1=(S peak , l, m i n(v (i)), σ) and d 2=(S peak , l, v, m i n(σ (i))) by substituting v or σ by the minimum of the training set correspondingly, and predict the reliability of d 1 and d 2. If d 1 is reliable, b contains wave artefacts and therefore can be subcategorized as either Case 1 or 3; if d 2 is reliable, b has large DLRS and thus belongs to Case 4; otherwise b suffers from both wave artefacts and large DLRS and can be either Case 1 or 3. As hyper-segmented samples in Case 1 may still be of interest for certain tasks such as looking for CNA regions but not counting the number or keeping track of the size of copy number gains or losses, a more stringent threshold t h r 2 greater than t h r 1 for S peak could be set, for example, to accept those hyper-segmented samples having S peak >t h r 2 with caution where the default value of t h r 2 is 2.5.

Discussion

A comparison between CNARA and medASP

The performance of CNARA and medASP [7] was compared and shown in Fig. 4. The 1522 samples were randomly split into training and validation sets at the proportion of 50:50 % and the SVM classifier of CNARA was trained on the training set where the parameters for the cost and the RBF kernel were set the same as in the previous subsection. The probability of being predicted as unreliable for each sample in the validation set was computed by the libsvm implementation [19] and the median of ASP was computed for the same sample in the validation set. The receiver operating characteristic (ROC) curves for the two values were then generated against the true class labels for the validation set.

Fig. 4
figure 4

CNARA versus medASP. ROC plots of CNARA and medASP on the validation set. The 1522 samples were split into training and validation sets (50:50 %) at random, and the SVM classifier of CNARA was trained on the training set. The AUC of CNARA is 0.9994, compared to the AUC of medASP (0.7372)

As shown in Fig. 4, the area under the ROC curve (AUC) for medASP is 0.7372, which suggests that the assumptions made in the medASP study [7], i.e. simulating copy number profiles with a gain and a loss region of fixed size, does not comform well with feature distribution in real-world data. By contrast, the AUC of CNARA is 0.9994 which is consistent with the extremely high prediction accuracy of the SVM classifier stated in the previous subsection. The value approaching 1 once again implies that the 4 features chosen summarize the copy number profile very well in terms of its reliability.

Custom training set

Apart from the training set consists of absolutely reliable and unreliable copy number profiles, CNARA is flexible to take in additional average quality samples labeled by experts to the training set to accomodate to their own need. For example, the boundaries of the CNA segments for formalin-fixed paraffin embedded (FFPE) samples are known to be less well defined than fresh-frozen samples in general [7, 20]. Therefore, when working with a large set of FFPE samples, one may wish to keep those slightly contaminated samples and reject others that are highly contaminated. To achieve this, people can create their own training set by including several slightly contaminated samples as reliable and highly contaminated as unreliable according to expert knowledge to the original training set (See Additional file 1 for more details on building custom training set).

Conclusions

During our efforts on the curation of human cancer copy number data from genomic screening experiments [21, 22], we have frequently encountered data sets challenging the state-of-the-art quality assessment procedures. While several methods have been proposed to improve the readout of genomic copy number profiling both through improvements of experimental design [6] as well as application of advanced bioinformatic methods [3, 4, 8, 9], many data sets still contain artefacts that can not be removed sufficiently by these existing methods. As a result, when working with tens of thousands of genomic copy number profiles derived from a multitude of platforms and different pre-processing methods, a robust method capable of identifying the low quality data sets based on extractable features is needed.

Previously, some downstream quality control measures such as genotyping call rate [23] and derivative log-ratio spread (DLRS) [10] have been reported for microarrays to describe the noise induced during sample hybridization onto the arrays. However, existing evaluation methods are limited in controlling for inherent noise due to differential DNA retrieval across the genome during sample preparation, which is a severe problem in the generation of genomic copy number profiles, and therefore reliability assessment for copy number profiles remained an open problem. Zhang and Zhang previously did an explorative study [7] where they proposed medASP as a discriminative metric for evaluating the quality of copy number profiles and achieved good accuracy on simulated data. However, it has been shown in the previous section that medASP is not an adequate measure for real copy number data which contains different amount of copy number gains and losses of various sizes at different genomic positions.

In this paper, we proposed a method for assessing the reliability of DNA copy number profiles. We showed five typical cases each corresponding to one of the reliability groups that can often be observed in practice and discussed in detail about the 4 features which can represent the copy number profiles well in terms of reliability, namely the number of steps detected by step-fitting, an indicator of how much the copy number data is step-like, the number of segments induced by both CNAs and wave artefacts detected by CBS and the mean of DLRS within segments. To obtain the first two features we proposed a fast step-fitting algorithm of time complexity O(nlogn) which is scalable to high-throughput copy number data. Taking these 4 features as input, an SVM classifer was trained on 1522 samples labeled as reliable or unreliable according to expert knowledge. By tenfold cross-validation on the whole dataset, the resulting classifier achieved a total accuracy of 99.08 % with standard deviation 0.0083. Predicted samples can be further subcategorized into the five reliablility classes, each having a biological or experimental correspondence. To the best of our knowledge, this is the first application developed for real-world data filling the gap of controlling the quality problem regarding differential DNA retrieval for copy number profiles, which is non-overlapping with and complementary to the objectives of any state-of-the-art quality control measures.

Our method can be applied to log ratio copy number data before or after normalization. Applied before normalization, it helps to judge if the log ratio copy number data needs normalization and noise correction; applied to the normalized data, it assesses the utility of the data in downstream analysis. Nonetheless, since it evaluates an aspect of the quality different from any existing quality control measures, we suggest our reliability assessment method to be the last step after any possible downstream quality control and bias correction methods, to decide if the normalized sample can be finally included in data mining experiments for biomedical knowledge generation. We have to emphasize that, while our method can deliver information about the quality of the signal derived from whole genome copy number screening experiments, by itself it does not address problems arising from the possible clonal heterogeneity, e.g. in biosamples derived from cancer tissues. Also, the impact of reliability assessment will depend on the intended downstream analyses; for instance, the reliability of whole-genome CNA profiles, as determined by our method, may be of less concern when using statistical CNA peak finding tools like GISTIC [24].

Thanks to the competing efforts on estimating genomic copy number from exome and whole-genome sequencing [25] especially recent development of CopywriteR [26] which is capable of extracting uniformly distributed copy number information from sequencing data, we are optimistic that in the near future piecewise constant signal for copy number estimates could be computed reliably with comparable acurracy to that of microarray platforms. Due to the universality of our method in dealing with samples from different platforms and the flexibility in taking new training samples, at that time our reliability assessment method can also be easily adapted to those copy number profiles generated by sequencing platforms.

Methods

The copy number profile dataset

1522 previously published copy number profiles (Additional file 2: Table S1) were used in our study. The samples were obtained from arrayMap [21, 22] with the original data retrieved from NCBI’s GEO repository [27], with preprocessing and noise correction performed through the standard arrayMap data processing pipeline [21]. Probe-level plots with added segmentation markers were visually inspected and selected by experts with respect to the empirically assessed copy-number calling reliability, and with the goal of a balanced representation of reliable and unreliable copy number profiles as well as a sufficient coverage of cases from different reliability groups. This selection resulted in 804 absolutely reliable copy number profiles (cf. cases 2 and 5 in Table 1) and 718 absolutely unreliable copy number profiles including hyper-segmented ones (cf. cases 1, 3 and 4 in Table 1). See Additional file 1 for data preprocessing and Additional file 2: Table S1 for a complete list of the arrays being analyzed and their reliability labels. The data is available at arraymap.org [21, 22].

The step-fitting algorithm

To recover the set of change-points τ={τ j :j=0,1,2,,m} delimiting the CNA segments from the log ratio copy number estimates x={x i :i=1,2,,n}, the algorithm updates τ iteratively, starting from τ (0)={1,n+1}, in each iteration k introduces an additional change-point τ (k) to τ (k−1), called the best-fit, which minimizes the cost function

$$ H = \sum_{i=1}^{n}{(x_{i}-\mu_{i})^{2}} $$
(5)

by scanning through all possible locations i=2,,n, iτ (k−1) and adding i to τ (k−1) as the temporary change-point set \(\boldsymbol {\tau }_{temp}^{(k)}\) such that \(\boldsymbol {\tau }_{temp}^{(k)} = \text {sort}(\boldsymbol {\tau }^{(k-1)} \cup i)\), in which the elements are ordered from the smallest to the largest; μ i in Eq. (5) is computed from Eqs. (2) and (3) where \(\{\tau _{j-1}, \tau _{j}\}\subseteq \boldsymbol {\tau }_{temp}^{(k)}\), j=1,2,,k+1. The location i that minimizes Eq. (5) is therefore τ (k), and τ in the kth iteration is updated as τ (k)=sort(τ (k−1)τ (k)).

Next, in between each pair of τ j−1 and τ j in τ (k) where j=1,2,,k+1, find the change-point c j which is the best-fit for x i , i[τ j−1,τ j ) such that it minimizes

$$ F_{j} = \sum_{i=\tau_{j-1}}^{\tau_{j}-1}{(x_{i}-\psi_{i})^{2}}, $$
(6)

where c j (τ j−1,τ j ) and

$$ \psi_{i} =\left\{ \begin{array}{ll} \sum_{i=\tau_{j-1}}^{c_{j}-1} x_{i}/(c_{j}-\tau_{j-1}) & \text{for}\ i\in[\tau_{j-1},c_{j})\\ \sum_{i=c_{j}}^{\tau_{j}-1} x_{i}/(\tau_{j}-c_{j}) & \text{for}\ i\in[c_{j},\tau_{j})\\ \end{array}\right. $$
(7)

The set of change-points c j ’s plus the boundary denoted by c (k)={c j :j=1,2,,k+1}{c 0=1, c k+2=n+1} is called the counter-fit change-points for iteration k. The cost function Q for the counter-fit is defined as

$$ Q = \sum_{i=1}^{n}{(x_{i}-\nu_{i})^{2}}. $$
(8)

where

$$ \nu_{i} = \frac{\sum_{i=c_{j-1}}^{c_{j}-1} x_{i}}{c_{j}-c_{j-1}}, \quad i\in[c_{j-1},c_{j}), \quad j=1,2,\cdots,k+2. $$
(9)

The algorithm proceeds iteratively adding the best-fit change-point τ (k) each time to the change-point set τ (k−1) in the previous iteration and finding the set of counter-fit change-points c (k) correspondingly, until the number of iterations reaches a predefined threshold K. This results in a set of best-fit change-points and a set of counter-fit change-points located in between one another.

The step-fitting algorithm stated above (refer to Supplementary Methods 3 in [17] for more information) has time complexity of 2n K (for definition of the tilde notation see [28]) where n is the dimension of the copy number estimates x and K is the total number of predefined iterations usually greater than 100. An important observation which helps to improve the computational efficiency is that, of all the counter-fit change-points in c (k) in the kth iteration, the most prominent change-point c j if added to τ (k) decreasing the cost function H the most is always the best-fit change-point τ (k+1) to be included in the next iteration, specifically,

$$ \tau^{(k+1)} = \underset{{c_{j}} \in \boldsymbol{c}^{(k)}}{\text{arg}\,\text{max}\,d_{j}} $$
(10)

and

$$ d_{j} = \sum_{i=\tau_{j-1}}^{\tau_{j}-1}{(x_{i}-\mu_{i})^{2}-(x_{i}-\psi_{i})^{2}}, $$
(11)

where i[τ j−1,τ j ), {τ j−1,τ j }τ (k), c j (τ j−1,τ j ) and j=1,2,,k+1; μ i and ψ i are computed as in Eqs. (2), (3) and (7) respectively. Here we keep track of the set of d j ’s for each corresponding c j as the difference set d (k). Furthermore, after the best-fit change-point τ (k+1) being included in τ (k+1), to update the corresponding counter-fit set c (k+1) and difference set d (k+1) we only need to exclude τ (k+1)=c t from c (k) and the corresponding d t from d (k) first, and then scan the region delimited by the two best-fit break-points directly next to τ (k+1) in τ (k+1) for two additional counter-fit break-points and the corresponding two differences. To be specific, given τ r =τ (k+1), {τ r−1,τ r ,τ r+1}τ (k+1), find c r [τ r−1,τ r ) and c r+1[τ r ,τ r+1) as computed in Eqs. (6) and (7), and compute d r and d r+1 as in Eq. (11); then update c (k+1)=(c (k)τ (k+1))c r c r+1 and d (k+1)=(d (k)d t )d r d r+1. In this way, the time complexity is reduced to O(nlogn).

To estimate the number of steps in the copy number profile and assess how much the data is step-like, the same model selection criterion is adopted as in Kerssemakers’ method, where the step-indicator S is introduced and defined as

$$ S=\frac{Q}{H}. $$
(12)

For reliable copy number profiles or hyper-segmented profiles containing many steps which are mostly CNAs, in early iterations change-points in τ (k) and c (k) both locate significant steps such that Q is close to H and therefore S is close to 1. S increases until it peaks in the optimal iteration when change-points in τ (k) cover all significant steps and all change-points in c (k) merely fit noise so that Q and H differ the most. After that S decreases as change-points in both τ (k) and c (k) begin to fit noise and Q and H become closer again. As a result, the number of iterations it takes for S to reach the peak, denoted by l, is an estimation of the number of steps in the copy number profile.

In Kerssemakers’ literature they also mentioned the peak value of S denoted by S peak approximates quadratic of signal to noise ratio given by \(S_{peak} \approx 1+\frac {\Delta ^{2}}{4\sigma ^{2}}\), where Δ is the mean of the absolute difference of μ i ’s around each best-fit change-point τ j weighted by \(\sqrt {\tau _{j+1}-\tau _{j-1}}\), where τ j τ (l), j=1,2,,l, and σ is the standard deviation of ε in Eq. (1). For unreliable copy number profiles of which the steps are drowned by wave artefacts of large magnitude, S peak is significantly smaller than that of reliable copy number profiles containing comparable amount of CNAs. For those profiles containing few significant CNAs or control samples which are diploid across genome, S is close to 1 throughout iterations and no apparent peak can be observed so that S peak is close to 1. Therefore S peak is an indicator of how much the copy number profile is step-like.

The adapted step-fitting algorithm is summarized in Algorithm 1.

Software

The CNARA software and a tutorial is available at https://github.com/baudisgroup/CNARA.

Abbreviations

ASP:

Autocorrelation scanning profile

AUC:

Area under the ROC curve

CBS:

Circular binary segmentation

CNA/CNV:

Copy number alterations/variations

CNARA:

Reliability assessment for copy number profiles

DLRS:

Derivative log ratio spread

FFPE:

Formalin-fixed paraffin embedded

medASP:

Median of ASP

RBF:

Radical basis function

ROC:

Receiver operating characteristic

SVM:

Support vector machine

References

  1. Kallioniemi A, Kallioniemi OP, Sudar D, Rutovitz D, Gray JW, Waldman F, Pinkel D. Comparative genomic hybridization for molecular cytogenetic analysis of solid tumors. Science. 1992; 258(5083):818–21.

    Article  CAS  PubMed  Google Scholar 

  2. du Manoir S, Speicher MR, Joos S, Schröck E, Popp S, Döhner H, Kovacs G, Robert-Nicoud M, Lichter P, Cremer T. Detection of complete and partial chromosome gains and losses by comparative genomic in situ hybridization. Hum Genet. 1993; 90(6):590–610.

    Article  CAS  PubMed  Google Scholar 

  3. van de Wiel MA, Brosens R, Eilers PH, Kumps C, Meijer GA, Menten B, Sistermans E, Speleman F, Timmerman ME, Ylstra B. Smoothing waves in array CGH tumor profiles. Bioinformatics. 2009; 25(9):1099–104.

    Article  CAS  PubMed  Google Scholar 

  4. Redon R, Carter NP. Comparative genomic hybridization: microarray design and data interpretation. DNA Microarrays for Biomedical Research: Methods and Protocols. 2009:37–49.

  5. Koren A, Handsaker RE, Kamitaki N, Karlić R, Ghosh S, Polak P, Eggan K, McCarroll SA. Genetic variation in human DNA replication timing. Cell. 2014; 159(5):1015–26.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. van Heesch S, Mokry M, Boskova V, Junker W, Mehon R, Toonen P, de Bruijn E, Shull JD, Aitman TJ, Cuppen E, et al. Systematic biases in DNA copy number originate from isolation procedures. Genome Biol. 2013; 14(4):33.

    Article  Google Scholar 

  7. Zhang L, Zhang L. Use of autocorrelation scanning in DNA copy number analysis. Bioinformatics. 2013; 29(21):2678–82.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Marioni JC, Thorne NP, Valsesia A, Fitzgerald T, Redon R, Fiegler H, Andrews TD, Stranger BE, Lynch AG, Dermitzakis ET, et al. Breaking the waves: improved detection of copy number variation from microarray-based comparative genomic hybridization. Genome Biol. 2007; 8(10):228.

    Article  Google Scholar 

  9. Risso D, Schwartz K, Sherlock G, Dudoit S. GC-content normalization for RNA-Seq data. BMC Bioinformatics. 2011; 12(1):480.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Largo C, Saéz B, Alvarez S, Suela J, Ferreira B, Blesa D, Prosper F, Calasanz MJ, Cigudosa JC. Multiple myeloma primary cells show a highly rearranged unbalanced genome with amplifications and homozygous deletions irrespective of the presence of immunoglobulin-related chromosome translocations. Haematologica. 2007; 92(6):795–802.

    Article  CAS  PubMed  Google Scholar 

  11. Olshen AB, Venkatraman E, Lucito R, Wigler M. Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics. 2004; 5(4):557–72.

    Article  PubMed  Google Scholar 

  12. Venkatraman E, Olshen AB. A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics. 2007; 23(6):657–63.

    Article  CAS  PubMed  Google Scholar 

  13. Fridlyand J, Snijders AM, Pinkel D, Albertson DG, Jain AN. Hidden Markov models approach to the analysis of array CGH data. J Multivariate Anal. 2004; 90(1):132–53.

    Article  Google Scholar 

  14. Zhang NR. DNA Copy Number Profiling in Normal and Tumor Genomes. Frontiers in Computational and Systems Biology. London: Springer: 2010. p. 259–81.

  15. Lai WR, Johnson MD, Kucherlapati R, Park PJ. Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics. 2005; 21(19):3763–70.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Willenbrock H, Fridlyand J. A comparison study: applying segmentation to array CGH data for downstream analyses. Bioinformatics. 2005; 21(22):4084–91.

    Article  CAS  PubMed  Google Scholar 

  17. Kerssemakers JW, Munteanu EL, Laan L, Noetzel TL, Janson ME, Dogterom M. Assembly dynamics of microtubules at molecular resolution. Nature. 2006; 442(7103):709–12.

    Article  CAS  PubMed  Google Scholar 

  18. Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F. e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. 2015. R package version 1.6-7. http://CRAN.R-project.org/package=e1071.

  19. Chang CC, Lin CJ. LIBSVM: A library for support vector machines. ACM Trans Intell Syst Technol. 2011; 2:27–12727. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.

    Article  Google Scholar 

  20. Mc Sherry EA, Mc Goldrick A, Kay EW, Hopkins AM, Gallagher WM, Dervan PA. Formalin-fixed paraffin-embedded clinical tissues show spurious copy number changes in array-CGH profiles. Clin Genet. 2007; 72(5):441–7.

    Article  CAS  PubMed  Google Scholar 

  21. Cai H, Kumar N, Baudis M. arrayMap: a reference resource for genomic copy number imbalances in human malignancies. PLoS One. 2012; 7(5):36944.

    Article  Google Scholar 

  22. Cai H, Gupta S, Rath P, Ai N, Baudis M. arrayMap 2014: an updated cancer genome resource. Nucleic acids res. 2014; gku1123.

  23. Laurie CC, Doheny KF, Mirel DB, Pugh EW, Bierut LJ, Bhangale T, Boehm F, Caporaso NE, Cornelis MC, Edenberg HJ, et al. Quality control and quality assurance in genotypic data for genome-wide association studies. Genet Epidemiol. 2010; 34(6):591–602.

    Article  PubMed  PubMed Central  Google Scholar 

  24. Beroukhim R, Getz G, Nghiemphu L, Barretina J, Hsueh T, Linhart D, Vivanco I, Lee J, Huang J, Alexander S, Du J, Kau T, Thomas R, Shah K, Soto H, Perner S, Prensner J, Debiasi R, Demichelis F, Hatton C, Rubin M, Garraway L, Nelson S, Liau L, Mischel P, Cloughesy T, Meyerson M, Golub T, Lander E, Mellinghoff I, Sellers W. Assessing the significance of chromosomal aberrations in cancer: methodology and application to glioma. Proc Natl Acad Sci U S A. 2007; 104(50):20007–12.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Zhao M, Wang Q, Wang Q, Jia P, Zhao Z. Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives. BMC Bioinformatics. 2013; 14(Suppl 11):1.

    Article  Google Scholar 

  26. Kuilman T, Velds A, Kemper K, Ranzani M, Bombardelli L, Hoogstraat M, Nevedomskaya E, Xu G, de Ruiter J, Lolkema MP, et al. CopywriteR: DNA copy number detection from off-target sequence data. Genome Biol. 2015; 16(1):49.

    Article  PubMed  PubMed Central  Google Scholar 

  27. Edgar R, Domrachev M, AE L. Gene expression omnibus: Ncbi gene expression and hybridization array data repository. Nucleic Acids Res. 2002; 30(1):207–10.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Sedgewick R, Wayne K. Algorithms, 4th edition: Addison-Wesley; 2011. ISBN 978-0-321-57351-3, pp. I-XII, 1-955.

  29. Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Holko M, et al. NCBI GEO: archive for functional genomics data sets–update. Nucleic Acids Res. 2013; 41(D1):991–5.

    Article  Google Scholar 

  30. Northcott PA, Nakahara Y, Wu X, Feuk L, Ellison DW, Croul S, Mack S, Kongkham PN, Peacock J, Dubuc A, et al. Multiple recurrent genetic events converge on control of histone lysine methylation in medulloblastoma. Nat Genet. 2009; 41(4):465–72.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Chen R, Nishimura MC, Bumbaca SM, Kharbanda S, Forrest WF, Kasman IM, Greve JM, Soriano RH, Gilmour LL, Rivers CS, et al. A hierarchy of self-renewing tumor-initiating cell types in glioblastoma. Cancer Cell. 2010; 17(4):362–75.

    Article  CAS  PubMed  Google Scholar 

  32. Liu L, Greger J, Shi H, Liu Y, Greshock J, Annan R, Halsey W, Sathe GM, Martin AM, Gilmer TM. Novel mechanism of lapatinib resistance in HER2-positive breast tumor cells: activation of AXL. Cancer Res. 2009; 69(17):6871–878.

    Article  CAS  PubMed  Google Scholar 

  33. Haverty PM, Fridlyand J, Li L, Getz G, Beroukhim R, Lohr S, Wu TD, Cavet G, Zhang Z, Chant J. High-resolution genomic and expression analyses of copy number alterations in breast tumors. Genes, Chromosomes Cancer. 2008; 47(6):530–42.

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

The authors want to thank Henrik Bengtsson, Reinhard Furrer and Mark Robinson for helpful discussions. This study was supported by a grant from the Swiss Federation through the Swiss Contribution to the enlarged European Union.

Availability of supporting data

The GEO accession numbers of the arrays used in the study are provided in the Additional files and can be used to access GEO datasets as well as the corresponding processed data through the arrayMap resource. The CNARA software and a tutorial is available at https://github.com/baudisgroup/CNARA.

Authors’ contributions

NA conceived the study and developed computational methods; HC collected and labeled the data; CS provided additional test samples; NA and MB wrote the paper; All authors contributed to the biological insight, read and approved the final manuscript.

Competing interests

The authors declare that they have no competing interests.

Ethics approval and consent to participate

The human array datasets utilized in this study represent publicly available datasets that were derived from the Gene Expression Onmibus (GEO [27]) resource at the National Center for Biotechnology Information (NCBI), with the possibly required informed consent lying with the orginal data submitters. For additional samples used for evaluation purposes (data not shown), informed consent was acquired from the patients with approval of the study design through the ethics committee of the University of Medicine and Pharmacology Timisoara.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Ni Ai or Michael Baudis.

Additional files

Additional file 1

Supplementary Methods. This document describes the computer simulation procedure for the 3 groups of copy number profiles in Figs. 1 and 2, the preprocessing procedure for the 1522 copy number profile dataset and the supporting Figures S1-S5, and the procedure of building custom training set. (PDF 933 kb)

Additional file 2

Supplementary Table S1. A complete list of the 1522 copy number profiles, including GEO accession number and reliability label. (TSV 58 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ai, N., Cai, H., Solovan, C. et al. CNARA: reliability assessment for genomic copy number profiles. BMC Genomics 17, 799 (2016). https://0-doi-org.brum.beds.ac.uk/10.1186/s12864-016-3074-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://0-doi-org.brum.beds.ac.uk/10.1186/s12864-016-3074-7

Keywords