 Methodology article
 Open Access
 Published:
Statistical measures of transcriptional diversity capture genomic heterogeneity of cancer
BMC Genomics volume 15, Article number: 876 (2014)
Abstract
Background
Molecular heterogeneity of tumors suggests the presence of multiple different subclones that may limit response to targeted therapies and contribute to acquisition of drug resistance, but its quantification has remained challenging.
Results
We performed simulations to evaluate statistical measures that best capture the molecular diversity within a group of tumors for either continuous (gene expression) or discrete (mutations, copy number alterations) molecular data. Dispersion based metrics in the principal component space best captured the underlying heterogeneity. To demonstrate utility of these measures, we characterized the diversity in transcriptional and genomic profiles of different breast tumor subtypes, and showed that basallike or triplenegative breast cancers (TNBC) are significantly more heterogeneous molecularly than other subtypes. Our analysis also suggests that transcriptional diversity is a global characteristic of the tumors observed across the majority of molecular pathways. Among basallike tumors, those that were resistant to multiagent chemotherapy showed greater transcriptional diversity compared to chemotherapysensitive tumors, suggesting that potentially multiple mechanisms may be contributing to chemotherapy resistance.
Conclusions
We proposed and validated measures of transcriptional and genomic diversity that can quantify the molecular diversity of tumors. We applied the new measures to genomic data from breast tumors and demonstrated that basallike breast cancers are significantly more diverse than other breast cancers. The observation that chemoresistant tumors are significantly more diverse molecularly than chemosensitive tumors implies that multiple resistance mechanisms may be active, thus limiting the sensitivity and accuracy of predictive markers of chemotherapy response.
Background
Pathologists have long recognized that tumors are highly heterogeneous consisting of neoplastic cells with distinct morphological and molecular features that are associated with clinically distinct phenotypes [1]. In the past few years, largescale nextgeneration sequencing provided evidence of extensive genomic diversity in histopathologically similar cancers, with each tumor effectively harboring a unique repertoire of genomic abnormalities [2–5]. Epigenetic changes that occur throughout tumorigenesis and the environmental cues conveyed through a tumor’s microenvironment contribute to the observed phenotypic heterogeneity, but genetic alterations are thought to drive the majority of tumor phenotypic variation [6].
The early large scale transcriptional profiling efforts provided the initial impetus for characterizing the phenotypic heterogeneity of cancer that allowed to discover transcriptionallyuniform cancer subtypes that were associated with distinct clinical outcomes [7, 8]. These subtypes were subsequently shown by next generation sequencing to be molecularly and genomically quite heterogeneous [9]. Although the causal drivers of intertumor heterogeneity are not clearly understood [10], genomic instability appears to be a unifying theme in many genetically diverse malignancies, acting to deregulate the control of DNA replication thus promoting proliferation and destabilizing the genome [11]. The increased clonal diversity that results from genetic instability is associated with a higher risk of progression [12], worst survival [13] and resistance to chemotherapy [14]. A recent study reported that the pretreatment genomic diversity among cells from each of the four breast cancer subtypes is preserved after chemotherapy, but tumors with lower genetic diversity before treatment are more likely to achieve complete pathologic response (pCR), irrespective of phenotypic subtype [15].
Breast tumors of the basallike or triple negative breast cancer (TNBC) subtype exhibit a broad range of complex structural DNA alterations [4, 16] that render them particularly challenging to treat effectively as a group with standard cytotoxic chemotherapy or with targeted therapies. Because a tumor’s transcriptional profile is an amalgam of the functional genetic and epigenetic variations that it harbors, higher intratumor genetic heterogeneity would be reflected in greater transcriptional diversity or dissimilarity between tumors of the same phenotype. Such transcriptional diversity could be the reason why developing geneexpression based predictors of chemotherapy sensitivity for basallike cancer has been unexpectedly difficult [17, 18]. The term diversity is used here to denote the dissimilarity in gene expression or mutational profiles between individuals, which is a characteristic of a group of tumors, while the term heterogeneity here refers to the nonuniformity in genetic and cellular composition of an individual tumor.
In this paper we evaluated through extensive simulations a number of statistical metrics based on the pairwise similarity of transcriptional and genomic profiles and selected mean dispersion distance as the best measure of transcriptional diversity. Using this measure, we observed that among basallike breast cancers, those that were chemotherapy resistant were significantly more diverse than chemotherapy sensitive cancers. These results suggest that multiple resistance mechanisms may be active among cancers of the same subtype and such heterogeneity would need to be accounted for when developing predictive models of response to chemotherapy.
Results
Evaluation of transcriptional diversity metrics in simulated datasets
We used two different strategies to assess transcriptional diversity among a group of profiles. In the first strategy, we considered all possible pairwise distances among the profiles the group using Pearson or cosine correlation for pairwise similarity of transcriptional profiles. In the second strategy, we used a dispersion distance metric to summarize the diversity as distance from the population centroid [19] (details in Methods). We considered either the mean or the median of the distribution of pairwise or dispersion distances as a summary of the central tendency in the group. To evaluate the performance of the different metrics, we generated artificial gene expression datasets representing scenarios that differed in the number of latent subgroups present and in the within (σ_{ g }) and between (σ_{ p }) tumor sample variance using the R package Umpire [20]. In each scenario, the log expression level for each gene was generated by a hierarchical model, in which σ_{ g } controls the withintumor variance and σ_{ p } controls the variance across patients in the cohort [21].
Figure 1A shows eight simulated scenarios (50 genes per sample, 40 samples per cohort) corresponding to 1, 2, 4, and 40 evensized latent intracohort subgroups, and two levels of relative between and withintumor variation (σ_{ p }/σ_{ g } = 0.5/1.5 and 0.5/0.5). A smaller σ_{ p }/σ_{ g } ratio indicates lower relative variation across tumors and therefore higher similarity in their expression profiles or lower overall diversity, whereas a larger σ_{ p }/σ_{ g } ratio or a greater number of latent subgroups implies greater within cohort diversity. The mean of all withingroup pairwise distances was calculated for each scenario repeatedly over 500 independent simulations from the same distributions. All metrics consistently increased with increasing number of latent subgroups and σ_{ p }/σ_{ g } ratio, except for the cosine metric that exhibited lower sensitivity when the relative between and withintumor variation was high (Figure 1B). The mean dispersion distance tracked almost linearly the increase in heterogeneity due to increasing proportion in a two latent subgroup scenario (Figure 1C). The median pairwise distance was generally less sensitive than the mean (Additional file 1: Figure S1), and the results were robust over a broad range of genes used to compute the distances (Additional file 1: Figure S2). Based on these simulations, we selected the mean withincohort dispersion distance as the best metric to describe intracohort transcriptional diversity.
Transcriptional and genomic diversity within TCGA breast cancer subtypes
We assessed the transcriptional heterogeneity within the four breast cancer subtypes as assigned by the PAM50 classifier using gene expression data from The Cancer Genome Atlas (TCGA) breast cancer project (N = 547; Table 1) [4]. Gene expression data were used as provided and probe sets with low expression or low variation across the dataset were filtered as described in the Methods. Mean dispersion distance was computed in groups of 100 cases from each subtype, repeatedly sampled with replacement over 500 bootstrap iterations. Overall, tumors of the luminal A subtype were significantly less heterogeneous than other subtypes (permutation test P < 10^{5}), while basallike tumors had the highest transcriptionally diversity (permutation test P < 10^{5}) (Figure 2A). Estimation of withinpatient and betweenpatient variance (see Methods) showed that luminal A tumors had the highest withinpatient gene expression variance but the lowest betweenpatient variance (smaller σ_{ p }/σ_{ g } ratio), whereas the reverse was true for basallike tumors that show a relatively narrow range of expression within each tumor but more diverse expression across tumors (Figure 2B). Comparing the luminal subtypes, luminal B tumors appear transcriptionally significantly more diverse (permutation test P < 10^{5}), exhibiting a higher estimated σ_{ p }/σ_{ g } ratio, similar that of HER2enriched tumors (Figure 2A, B).
To examine whether the higher transcriptional diversity observed in basallike tumors is associated with greater genetic heterogeneity, we evaluated the diversity in somatic mutation profiles and DNA copy number variation as categorical data using the mean pairwise Hamming distance [22] to assess intracohort diversity of genomic profiles (see Methods). The mean pairwise Hamming distance was able to track the population diversity in the different simulation scenarios (Additional file 1: Figure S3). As observed at the transcriptome level, basallike cancers exhibited the highest diversity in the distribution of copy number aberrations and the second highest mutational heterogeneity, whereas luminal A tumors were genomically the least diverse (Figures 2C, 2D). Our results are consistent with the assessment in the original TCGA publication that reported the highest overall rate of genome alteration in basallike tumors and the lowest in luminal A tumors [4]. Moreover, the almost indistinguishable patterns of molecular diversity that we observed in the genomic and transcriptomic profiles of breast tumors suggest that a measure of the transcriptional diversity could be a effective surrogate measure of the underlying genomic heterogeneity.
Validation of transcriptional diversity patterns within breast cancer subtypes using microarray datasets
We evaluated the generalizability of the results obtained on the TCGA data using publicly available gene expression datasets (Table 2) to assess the transcriptional diversity of basallike (n = 403), HER2enriched (n = 171), luminal A (n = 515) and luminal B (n = 210) breast cancers. We calculated the mean dispersion distance of transcriptional profiles within each subtype over 500 bootstrap iterations in groups of 100 cases per subtype. The pattern of transcriptional diversity was very similar to that observed in the TCGA dataset (Figure 2E), with basallike cancers being the most diverse and luminal A cancers being the most uniform (permutation test P < 10^{5}), except that HER2enriched cancers in this cohort appeared less diverse.
Basallike breast cancers that respond to chemotherapy are transcriptionally homogeneous
Among the 403 basallike cases in the Affymetrix dataset, 255 cases received preoperative combination chemotherapy (paclitaxel, 5fluoracil, cyclophosphamide and doxorubicin), which allowed us to compare the transcriptional diversity of exceptionally chemotherapy sensitive cancers that achieved pathological complete response (pCR, n = 96) to those who had partial or no response as evidenced by viable amounts or residual cancer after preoperative chemotherapy (RD, n = 159)(Table 2). We compared the transcriptional diversity of cancers from the two response groups by calculating the mean dispersion distance in sets of 100 cases per group over 500 bootstrap iterations. Chemotherapy resistant basallike cancers showed significantly greater transcriptional diversity compared to chemotherapy sensitive cancers (permutation test P < 10^{5}; Figure 2 F). These resistant cancers had narrower within tumor expression range but a greater variation between cancers (Figure 2G), and also a greater number of latent transcriptionally homogeneous subtypes (Figure 2H). Our results provide evidence that pathological complete response is associated with significantly lower pretreatment transcriptional diversity, corroborating previous reports of a significantly lower genetic divergence in tumors achieving pCR [15].
Transcriptional diversity extends at the level of individual pathways
The varying degree of transcriptional diversity within different breast cancer subtypes raises the question whether the observed heterogeneity is restricted to only genes from a few pathways or whether it is more of a global transcriptional phenomenon. We defined 50 pathways from the Kyoto Encyclopedia of Genes and Genomes (KEGG) [23] that represent all biological processes but have minimal overlap in gene membership. The number of genes in these pathways ranged from 11 (folate biosynthesis) to 389 (olfactory transduction) (see Methods).We applied the same bootstrapping procedure to calculate the mean dispersion distance within breast cancer subtypes, but now using only genes from each of the 50 pathways. Most pathways showed the same order of diversity across the four subtypes as that observed globally, suggesting that transcriptional diversity is a global characteristic of the tumors and not primarily driven by heterogeneity in a few biological processes (Figure 3A, C). Certain pathways showed remarkably low diversity in all subtypes indicating a tight transcriptional coordination of genes involved in biological processes that are vital for all cells (e.g. ribosome metabolism, protein export, glyoxylate/dicarboxylate metabolism; Figure 3A, B). Only a few pathways were distinctively diverse within each subtype (Figure 3C), and these were the same in all breast cancer subtypes (linoleic acid metabolism, renin/angiotensin system, and neuroactive ligand/receptor interaction) (Figure 3A, B).We also compared the pathwaylevel transcriptional diversity of chemotherapy sensitive (pCR) or resistant (RD) basallike cancers. Almost all pathways (48/50) were transcriptionally more diverse in the chemotherapy resistant group (Figure 3D), suggesting that the greater transcriptional diversity in these tumors is global and not restricted to certain pathways (slope of regression line = 0.917, 95% confidence interval 0.877 to 0.956). Pathways with greater diversity in RD cancers relative to pCR cancers included basalcellcarcinoma, which includes genes from the Hedgehog, Wnt, TGF and p53 signaling pathways, folate biosynthesis, and dorsoventral axis formation pathway involving genes from several key signaling pathways including the MAPK and NOTCH (Figure 3D).
Discussion
In this study, we showed that the mean dispersion and the mean pairwise Hamming distance provide quantitative metrics that capture the transcriptional and genomic diversity of breast cancer subtypes. We demonstrated that each subtype is characterized by a different degree of transcriptional and genomic diversity and provided statistical evidence that basallike breast cancers are the most and luminal A cancers are the least heterogeneous. While this has been suggested in the literature [4, 16, 24] it has never been previously demonstrated in statistical terms. The difference in transcriptional diversity between subtypes appears to concern the entire transcriptome with the exception of a few biological processes that are critical for cell survival, suggesting that basallike and luminal A cancers differ in fundamental ways in how tightly gene coexpression is regulated within cells. These results are also consistent with greater intratumor cellular heterogeneity in basallike cancers due to coexistence of either genomically different clonal populations or genomically similar cells exhibiting greater diversity in cellular states. But since we also observed identical trends in the relative heterogeneity of DNA copy number alterations and somatic mutations between the subtypes, the results strongly suggest that genomic alterations leave a major imprint on transcriptional profiles [25] and likely drive the higher transcriptomic diversity of basallike cancers. Transcriptional diversity can thus be considered as a surrogate measure of the underlying genomic heterogeneity of cancers. To further elaborate, we would expect repeat biopsies from a tumor with high intratumor heterogeneity to show greater transcriptional diversity, as biopsies may consist of different mixtures of clonal cells. High intratumor clonal variation would also imply higher variation between two such tumors and therefore a greater transcriptional diversity.
We also demonstrated that two clinically very distinct subtypes of basallike cancers, chemotherapy resistant and highly chemotherapy sensitive cancers, which cannot not be reliably separated using gene expression based multivariate prediction models [17, 18], show significant differences in transcriptional diversity (permutation test P < 10^{5}). Chemotherapy resistant cancers had much greater transcriptional diversity across most biological processes. Greater diversity in regulatory and metabolic pathways could confer greater resilience against cytotoxic insults [26]. Pathways that are expressed more diversely in chemotherapy resistant basallike cancers may provide biological insights into the mechanisms of chemotherapy resistance.
Finally, our results generate new insights as to why it may be difficult to predict chemotherapy response in basallike cancers. Due to the greater transcriptional diversity in chemotherapy resistant tumors, a single gene expression signature of resistance would fail to represent adequately the spectrum of resistance mechanisms present and thus limit the sensitivity and accuracy of such predictors. This has been demonstrated to be the case recently through insightful simulations [27]. Using genomic DNA copy number alterations to substratify basallike cancers into genomically more uniform groups could perhaps improve predictability, but this strategy will require considerably larger cohorts. Alternatively, using a single metric that captures the withinpatient molecular heterogeneity may prove a more effective strategy to predict general chemotherapy sensitivity of a given cancer.
Conclusion
We presented and systematically evaluated a quantitative measure to capture transcriptional and genomic diversity among cancers. Results from different types of genomic data consistently demonstrated that basallike cancers are the most heterogeneous while luminal A the least. Such diversity appears to be a global characteristic reflected in most biological process. Finally, we also showed that greater transcriptional diversity is also associated with basallike cancers that are resistant to chemotherapy, suggesting that molecular heterogeneity is a manifestation, potentially a causal one, of treatment resistance. This also explains the difficulty in developing generalizable gene signatures that are predictive of chemotherapy response in the highly heterogeneous triplenegative breast cancers.
Methods
Breast cancer datasets and processing
The Cancer Genome Atlas (TCGA) Dataset
Datasets were downloaded from the TCGA breast carcinoma web site (https://tcgadata.nci.nih.gov/docs/publications/brca_2012/) [4] (Table 1). Gene expression data by Agilent 244 K arrays were available for 547 cases. DNA copy number alteration data by Affymetrix 6.0 SNP arrays were available for 466 cases. Somatic mutation data by whole exome sequencing were available on 463 cases. The TCGA PAM50 subtype classification was downloaded also from the same site. A sample list is provided in Additional file 2: (TCGA sample ID.xlsx).
Affymetrix U133A dataset
A total of 1460 breast cancer profiles from eight breast cancer datasets that were uniformly profiled on the Affymetrix U133A microarray were obtained form the Gene Expression Omnibus (http://0www.ncbi.nlm.nih.gov.brum.beds.ac.uk/geo/). Because of partial overlap in several of these datasets, the selected unique samples from each GEO dataset are provided in Additional file 3: (Affy_sample_info1460.txt). A summary of the combined data set is given in Table 2.
Data preprocessing
The TCGA gene expression data were used as provided without any additional processing. The segmented log2 DNA copy number values were trichotomized as deletion (score < 0.3), amplification (score > 0.3) or no effect. We aggregated the segmentlevel alterations at the gene level using RefSeq gene models. A gene was labeled as deleted or amplified if the entire transcribed region was in a deleted or amplified segment. Of the 45,918 RefSeq genes, 22,271 were amplified or deleted in at least one of the patients in the cohort and were used in the calculation of the Hamming distance. The somatic mutation data were provided at the gene level in the TCGA dataset (any mutation vs no mutations) and did not require any further processing.
The Affymetrix U133A data files were compiled in a single dataset, normalized by MAS5, scaled to a target intensity of 600 and log2 transformed. Principal component analysis (PCA) was applied to detect potential batch effects across the datasets. We used the empirical Bayes framework procedure ComBat [28] as implemented in the R package sva [29] to remove batch effects due to different datasets. We used PCA after batch effect removal to verify lack of batch effects after correction (Additional file 1: Figure S4). A previously validated PAM50 algorithm for Affymetrixbased expression data was used to generate breast cancer subtype classification for each cancer in the dataset [30]. Table 2 provides a summary of all the datasets used in this study and the subtype distribution within each dataset. The 161 normallike cases were excluded from the analysis. Prior to analysis, the combined, batchadjusted dataset was filtered to remove probe sets in the lowest expression quartile and those in the lowest median absolute deviation (MAD) decile. The remaining 14,505 probe sets were used in subsequent analyses.
Collection and processing of KEGG pathways
The 186 pathways used in this study were collected from the Molecular Signatures Database [31] (MSigDB; http://www.broadinstitute.org/gsea/msigdb/index.jsp) and correspond to gene sets derived from the KEGG pathway database [23] (http://www.genome.jp/kegg/pathway.html). We processed the original list of 186 pathways to produce a reduced set of pathways that had minimal overlap in gene membership. We broadly followed the steps described previously [32]. First, we calculated a matrix of pairwise pathway similarities derived from the hypergeometric distribution as follows. If N, N_{ A } and N_{ B } represent the genes in all pathways, in pathway A and in pathway B respectively, and these two pathways have n genes in common, then the degree of enrichment of pathway B in genes also included in pathway A is given by the hypergeometric distribution h(N, N_{ A }, N_{ B }) [33]. This can be visualized as a 2x2 contingency table crosstabulating the number of genes in the two pathways (pathway A vs not pathway A by pathway B vs not pathway B). The objective is to test whether genes in pathway B are overrepresented among genes in pathway A compared to genes not in pathway A. Thus a rejection of the null hypothesis of independence between the two dimensions of the 2x2 table is evidence for significant overrepresentation or overlap between the two pathways. If we let X denote the number of genes in common, then the probability of two unrelated pathways (under the null hypothesis) having at least n genes in common is given by
Pathways for which the null hypothesis was rejected at the 0.05 level, i.e P(X ≥ n) < 0.05 were labeled as similar (value of 1) and the remaining as dissimilar (value of 0) in the pathway similarity matrix. Pathways with the highest degree of overlap were identified as those with the highest row or column sum in the similarity matrix and were removed recursively by removing both the corresponding column and the row from the pathway similarity matrix. The resulting 50 pathways consisted of almost disjoint genes sets and were used in subsequent analyses.
Distance measures
Pearson distance
If the pairwise Pearson’s correlation coefficient between two expression vectors (profiles) is denoted by ρ, the Pearson’s pairwise distance was calculated as
All the unique pairwise distances within a set of profiles are computed and the distribution of pairwise distances is summarized by its mean or its median.
Cosine distance
Given expression vectors X_{ i }, X_{ j }, their cosine distance is given by
where ∥X∥ represents the L2norm of vector X. The common definition of the cosine similarity measure is the cosine of the angle between two vectors. Here, we define the cosine dissimilarity or distance measure as the normalized angle between the two vectors. The above two dissimilarity measures are related, since the Pearson’s correlation is essentially equivalent to a centered cosine similarity measure:
As above, all the unique pairwise distances within a set of profiles are computed and the distribution of pairwise distances is summarized by its mean or its median.
Dispersion distance
For the calculation of dispersion distances, the distance matrix A of all pairwise Pearson distances is centered by subtracting the row and column means and then adding the overall mean to each element of A. The spectral decomposition of the resulting centered matrix defines the principal coordinates. Vectors are mapped to the full principal coordinate space and their Euclidian distance from the overall centroid in this space defines the distance or dissimilarity from each vector to its centroid [34]. The overall dispersion distance for a set of vectors is then calculated as the mean or median dispersion over all vectors. Computations were performed using the R language package vegan [35].
Hamming distance
For categorical vectors we used the pairwise Hamming distance for character strings (not ordinal) to assess dissimilarity [22]. Specifically, for categorical vectors X_{ i }, X_{ j } of length L,
where δ(a, b) = 1 when a is the same as b and zero otherwise. The overall Hamming dispersion distance for the set of categorical profiles was computed as the mean or median pairwise Hamming distance.
Permutation test for assessing significance in difference of mean distances
We used a permutation procedure to compare the bootstrap estimates of the mean group distances between two groups of samples. Two vectors, each containing 500 bootstrap estimates of the mean distance within each group, were concatenated to form a 1000 element vector. This vector was resampled without replacement 10^{5} times and for each permutation the average of the first 500 elements was compared to the average of the last 500 elements. This difference was then compared to the original difference in the mean distances between the two groups. The pvalue, which represents the probability of observing by chance a difference between the bootstrapped mean distances at least as high as the observed, was then computed as (1 + #(D > d))/(1 + R), where D is the difference in means from the permuted vectors, d is the difference in means from the original vector, and R = 10^{5} is the number of random permutations.
Generation of simulated datasets
We used the Umpire R language package to simulate realistic gene expression datasets with controlled within and between patient variation in gene expression [20]. In each simulated scenario we considered 50 genes per case and 40 cases. We assumed that all the genes are expressed in all patients and that the expression of each gene across patients follows a lognormal distribution with parameters μ_{ g } and σ_{ P }, i.e. log(X_{ i }) ∼ Normal(μ_{ g }, σ_{ P }). The mean log expression μ_{ g } is assumed to have a normal distribution across genes with mean μ_{0} and standard deviation σ_{ g }, i.e. μ_{ g } ∼ Normal(μ_{0}, σ_{ g }). It was further assumed that the betweensample gene expression variance σ_{ P } follows an inverse Gamma distribution with hyperparameters α = 15 and β = 7, which correspond to a mean σ_{ P } equal to 0.5. The mean μ_{0} was set equal to 8 for all simulations. In the first scenario shown in Figure 1A, the within sample standard deviation σ_{ g } was set to 1.5, whereas in the second scenario it was set to 0.5. Samples from different latent groups were generated by resampling with the same set of hyperparameters. Additional file 1: Figure S5 shows the effect of the various parameters of the hierarchical model on the estimated diversity of the resulting set of gene expression profiles. Binarized profiles were simulated in the same way and then thresholded using the function Z > 1.5, where Z = (X – M)/SD, X is the continuous gene expression value and M, SD are the genewide and samplewide (overall) mean and standard deviation of expression.
Estimation of gene expression statistics from profiles
If X_{ ij } is the expression of gene i in sample j then m_{ i } = E(X_{ ij }) is the centroid of gene expression levels, which is also expressed more compactly as m = E_{ P }(X). The mean of the centroid vector is the overall mean expression or μ_{0} = E_{ g }(m) = E_{ g }(E_{ P }(X)). The within sample variance is estimated from the variance of the centroid vector as ${\sigma}_{g}^{2}={V}_{g}\left(\mathbf{m}\right)={V}_{g}\left({E}_{P}\left(\mathbf{X}\right)\right)$. Finally, the between sample variance is estimated from the mean of the variance vector s^{2} = V_{ P }(X), or ${\sigma}_{P}^{2}={E}_{g}\left({\mathbf{s}}^{2}\right)={E}_{g}\left({V}_{P}\left(\mathbf{X}\right)\right)$.
Additional file 1: Figure S6 shows the parameters estimated from the simulated scenarios shown in Figure 1A. The between sample standard deviation (SD) was set to 0.5 in all these scenarios. As shown in Additional file 1: Figure S6A, the between sample SD estimate is nearly unbiased when there is one latent group, with the variance of the estimate being very small in the low betweentowithin SD scenario (S) but considerably larger in the high betweentowithin SD scenario (U). Furthermore, an increasing number of latent groups results in greater between sample SD, reflecting the greater population diversity, but the effect is more moderate in the high betweentowithin SD scenario (U). Similarly, the estimate of the mean within sample SD is nearly unbiased (1.5 for S and 0.5 for U) when there is one latent group, but an increasing number of latent groups reduces the estimate of the within sample SD. Therefore, a greater number of latent groups in a cohort will be bias upwards the estimated between sample variance and bias downwards the estimated within sample variance.
Scripts/Code
All the R scripts required to generate the datasets and run the analyses presented in this manuscript are provided in Additional file 4.
Abbreviations
 TNBC:

Triplenegative breast cancer
 TCGA:

The Cancer Genome Atlas
 PAM50:

Minimal gene set for classifying intrinsic subtypes for breast cancer
 pCR:

Pathologic complete response
 RD:

Residual disease
 PCA:

Principle component analysis
 MAD:

Mean absolute deviation
 KEGG:

Kyoto Encyclopedia of Genes and Genomes.
References
 1.
Elston CW, Ellis IO: Pathological prognostic factors in breast cancer. I. The value of histological grade in breast cancer: experience from a large study with longterm followup. Histopathol. 1991, 19: 403410.
 2.
Cancer Genome Atlas Research Network: Comprehensive genomic characterization of squamous cell lung cancers. Nature. 2012, 489: 519525.
 3.
Chin L, Meyerson M, Aldape K, Bigner D, Mikkelsen T, VandenBerg S, Kahn A, Penny R, Ferguson ML, Gerhard DS, Getz G, Brennan C, Taylor BS, Winckler W, Park P, Ladanyi M, Hoadley KA, Verhaak RGW, Hayes DN, Spellman PT, Absher D, Weir BA, Ding L, Wheeler D, Lawrence MS, Cibulskis K, Mardis E, Zhang JH, Wilson RK, Donehower L, et al: Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature. 2008, 455: 10611068.
 4.
Koboldt DC, Fulton RS, McLellan MD, Schmidt H, KalickiVeizer J, McMichael JF, Fulton LL, Dooling DJ, Ding L, Mardis ER, Wilson RK, Ally A, Balasundaram M, Butterfield YSN, Carlsen R, Carter C, Chu A, Chuah E, Chun HJE, Coope RJN, Dhalla N, Guin R, Hirst C, Hirst M, Holt RA, Lee D, Li HYI, Mayo M, Moore RA, Mungall AJ, et al: Comprehensive molecular portraits of human breast tumours. Nature. 2012, 490: 6170.
 5.
Kan Z, Jaiswal BS, Stinson J, Janakiraman V, Bhatt D, Stern HM, Yue P, Haverty PM, Bourgon R, Zheng J, Moorhead M, Chaudhuri S, Tomsho LP, Peters BA, Pujara K, Cordes S, Davis DP, Carlton VE, Yuan W, Li L, Wang W, Eigenbrot C, Kaminker JS, Eberhard DA, Waring P, Schuster SC, Modrusan Z, Zhang Z, Stokoe D, de Sauvage FJ, et al: Diverse somatic mutation patterns and pathway alterations in human cancers. Nature. 2010, 466: 869873.
 6.
De Sousa EMF, Vermeulen L, Fessler E, Medema JP: Cancer heterogeneity–a multifaceted view. EMBO Rep. 2013, 14: 686695.
 7.
Perou CM, Sorlie T, Eisen MB, van de Rijn M, Jeffrey SS, Rees CA, Pollack JR, Ross DT, Johnsen H, Akslen LA, Fluge O, Pergamenschikov A, Williams C, Zhu SX, Lonning PE, BorresenDale AL, Brown PO, Botstein D: Molecular portraits of human breast tumours. Nature. 2000, 406: 747752.
 8.
Sorlie T, Perou CM, Tibshirani R, Aas T, Geisler S, Johnsen H, Hastie T, Eisen MB, van de Rijn M, Jeffrey SS, Thorsen T, Quist H, Matese JC, Brown PO, Botstein D, Lonning PE, BorresenDale AL: Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Natl Acad Sci U S A. 2001, 98: 1086910874.
 9.
Curtis C, Shah SP, Chin SF, Turashvili G, Rueda OM, Dunning MJ, Speed D, Lynch AG, Samarajiwa S, Yuan Y, Graf S, Ha G, Haffari G, Bashashati A, Russell R, McKinney S, Group M, Langerod A, Green A, Provenzano E, Wishart G, Pinder S, Watson P, Markowetz F, Murphy L, Ellis I, Purushotham A, BorresenDale AL, Brenton JD, Tavare S, et al: The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature. 2012, 486: 346352.
 10.
Polyak K: Breast cancer: origins and evolution. J Clin Invest. 2007, 117: 31553163.
 11.
Burrell RA, McGranahan N, Bartek J, Swanton C: The causes and consequences of genetic heterogeneity in cancer evolution. Nature. 2013, 501: 338345.
 12.
Maley CC, Galipeau PC, Finley JC, Wongsurawat VJ, Li X, Sanchez CA, Paulson TG, Blount PL, Risques RA, Rabinovitch PS, Reid BJ: Genetic clonal diversity predicts progression to esophageal adenocarcinoma. Nat Genet. 2006, 38: 468473.
 13.
Mroz EA, Tward AD, Pickering CR, Myers JN, Ferris RL, Rocco JW: High intratumor genetic heterogeneity is related to worse outcome in patients with head and neck squamous cell carcinoma. Cancer. 2013, 119: 30343042.
 14.
Turner NC, ReisFilho JS: Genetic heterogeneity and cancer drug resistance. Lancet Oncol. 2012, 13: e178185.
 15.
Almendro V, Cheng YK, Randles A, Itzkovitz S, Marusyk A, Ametller E, GonzalezFarre X, Munoz M, Russnes HG, Helland A, Rye IH, BorresenDale AL, Maruyama R, van Oudenaarden A, Dowsett M, Jones RL, ReisFilho J, Gascon P, Gonen M, Michor F, Polyak K: Inference of tumor evolution during chemotherapy by computational modeling and in situ analysis of genetic and phenotypic cellular diversity. Cell Rep. 2014, 6: 514527.
 16.
Russnes HG, Navin N, Hicks J, BorresenDale AL: Insight into the heterogeneity of breast cancer through nextgeneration sequencing. J Clin Invest. 2011, 121: 38103818.
 17.
Hatzis C, Pusztai L, Valero V, Booser DJ, Esserman L, Lluch A, Vidaurre T, Holmes F, Souchon E, Wang HK, Martin M, Cotrina J, Gomez H, Hubbard R, Chacon JI, FerrerLozano J, Dyer R, Buxton M, Gong Y, Wu Y, Ibrahim N, Andreopoulou E, Ueno NT, Hunt K, Yang W, Nazario A, DeMichele A, O'Shaughnessy J, Hortobagyi GN, Symmans WF: A Genomic Predictor of Response and Survival Following TaxaneAnthracycline Chemotherapy for Invasive Breast Cancer. JamaJournal of the American Medical Association. 2011, 305: 18731881.
 18.
Popovici V, Chen W, Gallas BG, Hatzis C, Shi W, Samuelson FW, Nikolsky Y, Tsyganova M, Ishkin A, Nikolskaya T, Hess KR, Valero V, Booser D, Delorenzi M, Hortobagyi GN, Shi L, Symmans WF, Pusztai L: Effect of trainingsample size and classification difficulty on the accuracy of genomic predictors. Breast Cancer Res. 2010, 12: R5
 19.
Anderson MJ, Ellingsen KE, McArdle BH: Multivariate dispersion as a measure of beta diversity. Ecol Lett. 2006, 9: 683693.
 20.
Zhang J, Roebuck PL, Coombes KR: Simulating gene expression data to estimate sample size for class and biomarker discovery. Int J Advances Life Sci. 2012, 4: 4451.
 21.
Zhang J, Coombes KR: Sources of variation in false discovery rate estimation include sample size, correlation, and inherent differences between groups. BMC Bioinformatics. 2012, 13 (Suppl 13): S1
 22.
Hamming RW: Error detecting and error correcting codes. Bell Sys Tech J. 1950, 29: 147160.
 23.
Kanehisa M, Goto S, Sato Y, Kawashima M, Furumichi M, Tanabe M: Data, information, knowledge and principle: back to metabolism in KEGG. Nucleic Acids Res. 2014, 42: D199205.
 24.
Almendro V, Kim H, Cheng YK, Gonen M, Itzkovitz S, Argani P, van Oudenaarden A, Sukumar S, Michor F, Polyak K: Genetic and phenotypic diversity in breast tumor metastases. Cancer Res. 2014, 75: 13381348.
 25.
Pollack JR, Sorlie T, Perou CM, Rees CA, Jeffrey SS, Lonning PE, Tibshirani R, Botstein D, BorresenDale AL, Brown PO: Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors. Proc Natl Acad Sci U S A. 2002, 99: 1296312968.
 26.
Saunders NA, Simpson F, Thompson EW, Hill MM, EndoMunoz L, Leggatt G, Minchin RF, Guminski A: Role of intratumoural heterogeneity in cancer drug resistance: molecular and clinical perspectives. EMBO Mol Med. 2012, 4: 675684.
 27.
Ng CK, Weigelt B, A’Hern R, Bidard FC, Lemetre C, Swanton C, Shen R, ReisFilho JS: Predictive Performance of Microarray Gene Signatures: Impact of Tumor Heterogeneity and Multiple Mechanisms of Drug Resistance. Cancer Res. 2014, 74: 29462961.
 28.
Johnson WE, Li C, Rabinovic A: Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2007, 8: 118127.
 29.
Leek JT, Johnson WE, Parker HS, Jaffe AE, Storey JD: The sva package for removing batch effects and other unwanted variation in highthroughput experiments. Bioinformatics. 2012, 28: 882883.
 30.
Parker JS, Mullins M, Cheang MC, Leung S, Voduc D, Vickery T, Davies S, Fauron C, He X, Hu Z, Quackenbush JF, Stijleman IJ, Palazzo J, Marron JS, Nobel AB, Mardis E, Nielsen TO, Ellis MJ, Perou CM, Bernard PS: Supervised risk predictor of breast cancer based on intrinsic subtypes. J Clin Oncol. 2009, 27: 11601167.
 31.
Liberzon A, Subramanian A, Pinchback R, Thorvaldsdottir H, Tamayo P, Mesirov JP: Molecular signatures database (MSigDB) 3.0. Bioinformatics. 2011, 27: 17391740.
 32.
Shi W, Balazs B, Gyorffy B, Jiang T, Symmans WF, Hatzis C, Pusztai L: Combined analysis of gene expression, DNA copy number, and mutation profiling data to display biological process anomalies in individual breast cancers. Breast Cancer Res Treat. 2014, 144: 561568.
 33.
Rivals I, Personnaz L, Taing L, Potier MC: Enrichment or depletion of a GO category within a class of genes: which test?. Bioinformatics. 2007, 23: 401407.
 34.
Anderson MJ: Distancebased tests for homogeneity of multivariate dispersions. Biometrics. 2006, 62: 245253.
 35.
Oksanen J, Blanchet FG, Kindt R, Legendre P, Minchin PR, O’Hara RB, Simpson GL, Solymos P, Stevens MHH, Wagner H: vegan: Community ecology package. R package version 2010. 2013
Acknowledgments
This work was supported by Breast Cancer Research Foundation grants to LP and a Yale Lion Heart Foundation pilot grant to CH.
Author information
Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
TJ contributed in study design, carried out all the simulations and statistical analyses and drafted the manuscript. WS helped with compiling the microarray datasets, evaluated the KEGG pathways and reviewed the manuscript. YK provided input on distance measures and helped draft the manuscript. LP proposed the concept of heterogeneity as a measure of response, participated in its designed and coordination and helped draft the manuscript. RN, SNO, VBW helped draft the manuscript. CH conceived the study design and simulation strategy, coordinated the work and drafted the manuscript. All authors read and approved the final manuscript.
Electronic supplementary material
12864_2014_6541_MOESM2_ESM.txt
Additional file 2: TCGA_sample ID.txt: tabdelimited text file listing the sample, PAM50subtype classification, and whether gene expression, somatic mutation or copy number variation data was available for each sample.(TXT 14 KB)
Affy_sample_info1460.txt:
Additional file 3: tabdelimited text file listing the GEO sample id, GEO dataset series id, the response category (pCR/RD) or NA if not available, and the PAM50subtype classification.(TXT 39 KB)
Jiang_RScripts.zip: Archive containing R scripts for the analyses presented in this manuscript. Scripts included:
Additional file 4: supplycodesimulation. R Code for generating simulated expression datasets and heterogeneity calculation. supplycodeTCGA.R Code for TCGA data import, heterogeneity calculation and comparisons. supplycodehetaffy.R Code for Affymetrix data normalization, heterogeneity calculation and comparisons. KEGG_Path_NonOverlap.R Code for generation of 50 KEGG pathways with minimum degree of overlap in gene membership. File “186KEGG.Rdata” with KEGG pathway gene lists needed by this script is also provided. (ZIP 57 KB)
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.
The Creative Commons Public Domain Dedication waiver (https://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Jiang, T., Shi, W., Natowicz, R. et al. Statistical measures of transcriptional diversity capture genomic heterogeneity of cancer. BMC Genomics 15, 876 (2014). https://0doiorg.brum.beds.ac.uk/10.1186/1471216415876
Received:
Accepted:
Published:
DOI: https://0doiorg.brum.beds.ac.uk/10.1186/1471216415876
Keywords
 Tumor diversity
 Breast cancer
 Basallike subtype
 Chemotherapy resistance
 Dispersion distance
 Hamming distance
 Chemotherapy