Skip to main content


A gene sets approach for identifying prognostic gene signatures for outcome prediction

Article metrics



Gene expression profiling is a promising approach to better estimate patient prognosis; however, there are still unresolved problems, including little overlap among similarly developed gene sets and poor performance of a developed gene set in other datasets.


We applied a gene sets approach to develop a prognostic gene set from multiple gene expression datasets. By analyzing 12 independent breast cancer gene expression datasets comprising 1,756 tissues with 2,411 pre-defined gene sets including gene ontology categories and pathways, we found many gene sets that were prognostic in most of the analyzed datasets. Those prognostic gene sets were related to biological processes such as cell cycle and proliferation and had additional prognostic values over conventional clinical parameters such as tumor grade, lymph node status, estrogen receptor (ER) status, and tumor size. We then estimated the prediction accuracy of each gene set by performing external validation using six large datasets and identified a gene set with an average prediction accuracy of 67.55%.


A gene sets approach is an effective method to develop prognostic gene sets to predict patient outcome and to understand the underlying biology of the developed gene set. Using the gene sets approach we identified many prognostic gene sets in breast cancer.


Many researchers have studied the feasibility of gene expression profiling to improve the prognosis of cancer patients and have shown that gene expression signatures can better predict the outcome of cancer patients than conventional clinical criteria in many cancer types [14]. A few of the discovered signatures are now in large clinical trials to confirm their prognostic value [5, 6]. However, there are also concerns about the usefulness of the gene expression signatures because several problems remain unresolved [79]. These problems include poor overlap among discovered gene signatures, the unstable nature of gene expression signatures, and poor performance of signatures when applied to other datasets [7, 911].

Researchers have applied either top-down or bottom-up approaches to discover prognostic gene signatures [12]. Most researchers have used the top-down approach in which samples are split into training and testing sets and gene signatures are developed by discovering genes that show a high correlation between expression and clinical information [2, 1319]. In the bottom-up approach, gene signatures developed from other biological models are applied to gene expression datasets to classify patients into clinically distinct groups [12, 20]. One advantage of the bottom-up approach is that it affords a straightforward understanding of the underlying biological process behind the discovered gene signature [12]. Similarly, the recently developed gene set enrichment analysis (GSEA) and similar methods are promising tools for high-throughput data analysis. These methods enable researchers to identify significantly changed biological themes and pathways from gene expression data by observing changes in expression using pre-defined gene sets [21, 22]. Another method, named globaltest, was recently developed to test the association of a pathway with survival using gene expression data [23].

A gene signature is useless if it works well only on the dataset from which it was developed. Thus, recent work includes external validation of developed signatures as a necessary step that will reinforce the applicability of gene signatures to other datasets [14, 15, 24]. Here, we suggest a simple but very effective approach to identify gene signatures that are prognostic in multiple datasets. Rather than developing a signature from one dataset and validating it in other datasets, we suggest simultaneously testing multiple pre-defined gene signatures on multiple datasets to identify signatures that are prognostic in as many independent datasets as possible. By exhaustively testing all combinations of gene sets and datasets, our approach guarantees that the best gene signature will be identified among a pool of pre-defined gene sets. Moreover, our approach will enable better understanding of the underlying biology of disease by observing the patterns of association between gene expression and clinical parameters at multiple gene set levels.

In this work, we applied a bottom-up, gene sets approach to multiple datasets to determine gene signatures for prognosis of breast cancer patients. We chose breast cancer because there are several high-quality breast cancer gene expression datasets with survival or recurrence information. Our goal was to identify prognostic gene signatures useful in as many independent datasets as possible. For this, we collected 12 different datasets comprising 1,756 tumor samples and prepared 2,411 gene sets from diverse sources including gene ontology, biological pathways, and previously identified prognostic gene signatures for breast cancer. For each gene set, we performed survival analysis to test if the gene set could classify patients into clinically distinct groups. We also evaluated each gene set for the accuracy of outcome prediction.


Selection of gene sets for prognosis of survival or recurrence

Analysis of 12 datasets (Table 1) with 2,411 gene sets (Table 2) including 32 gene sets previously identified as prognostic in breast and other cancers (Table 3) revealed that many of the gene sets related to cell cycle or proliferation were best discriminating between good and poor prognosis groups. Table 4 presents the 20 most highly prognostic gene sets identified by two-means clustering of samples. Most of these top gene sets were related to cell cycle, mitosis, proliferation, and DNA replication as well as gene sets previously identified as prognostic in breast cancer such as 11823860_ST2, 17076897_ADF3, and 16478745_ST1 (Table 4). Kaplan-Meier plots of 12 datasets showed that the 11823860_ST2 gene set classified patients into two groups (poor or good prognosis) according to differences in survival or recurrence in eight of 12 datasets (Figure 1). Because breast cancers are heterogeneous and may comprise three to six subtypes [2527], we also applied k-means clustering with k = 3, 4, 5, and 6 to each dataset to divide samples into three, four, five, and six subtypes respectively and performed log-rank test to infer the significance of differences in survival between the groups. Again, we found that gene sets related to cell cycle or proliferation were best discriminating between groups with different clinical outcomes (Additional data file 1, Supplementary Table 1, 2, 3, and 4). The 11823860_ST2 gene set, which was ranked as the first in two-means clustering analysis (Table 4), was ranked as the first in four (Supplementary Table 2) and the fifth in three (Supplementary Table 1) and the tenth in five and six-means clustering (Supplementary Table 3 and 4).

Table 1 Breast cancer datasets analyzed in this study
Table 2 Number of gene sets in each category
Table 3 Thirty-two prognostic gene sets prepared from published reports
Table 4 Top 20 prognostic gene sets identified by two-means clustering in breast cancer gene expression datasets
Figure 1

Kaplan-Meier survival curves for the two prognostic classes of breast cancers. In each dataset, patients were divided into two groups (poor and good prognostic groups) based on the gene expression pattern in the 11823860_ST2 gene set, and their survival or recurrence proportions were then plotted. The log-rank test was used to infer the statistical significance of survival or recurrence differences between the two groups. In each graph, the x-axis represents overall or relapse-free survival years and the y-axis represents the proportion of overall survival (A, B, C, D, E, F, I, and K) or relapse-free survival (G, H, J, and L). Black indicates poor prognosis and red indicates good prognosis.

Unadjusted and adjusted hazard ratios

We then calculated unadjusted hazard ratios for three selected gene sets within the 12 datasets (Table 5). These three gene sets showed significant (P < 0.05) unadjusted hazard ratios in six or seven of the 12 datasets irrespective of microarray platforms. For example, the Sotiriou_2, Wang, and Pawitan datasets used the Affymetrix U133A platform, the van de Vijver dataset used Agilent oligomers, and the Sorlie_1 dataset used cDNA arrays. This confirms that many gene sets related to cell cycle and proliferation are prognostic irrespective of the microarray platform. We also calculated adjusted hazard ratios for the 11823860_ST2 gene set in the three datasets (Sotiriou_2, van de Vijver, and Sorlie_1) for available clinical parameters such as grade, lymph node status, tumor size, age, and estrogen receptor (ER) status (Additional data file 2, Supplementary Table 5, 6 and 7). The 11823860_ST2 gene set proved significant even after adjustment for other clinical parameters in the three datasets, verifying that the 11823860_ST2 gene set contains additional prognostic value over existing prognostic clinical parameters.

Table 5 Hazard ratios and P values for the top three gene signatures in 12 datasets

Accuracy of outcome prediction

We then analyzed the accuracy of patient outcome prediction for each of the 2,411 gene sets. Initially, we tested five algorithms – nearest centroid, diagonal linear discriminant analysis (DLDA), compound covariate predictor, one-nearest and three-nearest neighbor predictor [28] and found that in our datasets nearest centroid and DLDA methods performed better than the others (data not shown) with similar performance to each other. For convenience, we used the nearest centroid method in subsequent analysis. With six large datasets containing more than 100 samples, we estimated the prediction accuracy of each gene set by external validation. We measured prediction accuracy for each pair of 30 training-testing datasets and for a total of 30 predictions (Table 6). The best gene set was the gene set 11823860_ST2, with prediction accuracy, sensitivity, and specificity of 67.55%, 70.56%, and 57.16%, respectively (Tables 6 and 7). The individual prediction accuracy with the 11823860_ST2 gene set was as high as 0.7464 when the training-testing pair was Pawitan-van de Vijver and as low as 0.54 74 when the training-testing pair was Wang-Bild (Table 6). The individual prediction accuracy was not related to the differences in microarray platforms or patient characteristics (data not shown). We also analyzed the accuracy of patient outcome prediction with nine datasets with more than ten samples for each of the two groups. Again, the gene set 11823860_ST2 was the best with a prediction accuracy, sensitivity, and specificity of 0.6578, 0.6895, and 0.566, respectively (Additional data file 3, Supplementary Table 8).

Table 6 Prediction accuracy of the 11823860_ST2 gene set in external validation
Table 7 Top 20 gene sets with high prediction accuracy (analysis with six datasets)

Best gene sets for prediction accuracy differ from those for prognosis

Comparison of the top 20 prognostic gene sets for breast cancer survival (Table 4) with the top 20 gene sets with high prediction accuracy (Table 7) showed only three common gene sets (11823860_ST2, 14737219_USR, and 14737219_CSR). Interestingly, the gene sets shown in Table 7 were, in general, from higher categories in the gene ontology hierarchy, including transferase activity (MF), transcription factor activity (MF), transport (BP), and transcription (BP). Because gene sets in higher categories have more genes than those in lower categories, we reasoned that there might be a significant difference in gene set size between the gene sets in Table 4 and Table 7. Thus, we compared the distribution of gene set sizes between the top 20 prognostic gene sets for survival (designated as prognosis gene sets, Table 4) and the top 20 gene sets with high prediction accuracy (designated as predictor gene sets, Table 7) and found a significant difference in sizes between prognosis and predictor gene sets (Figure 2; P = 1.34 × 10-5 by unpaired t-test). The sizes of the top 20 prognosis gene sets ranged from 6 to 530 with a mean of 155.5 and a median of 72.5, whereas the sizes of the top 20 predictor gene sets ranged from 125 to 1,817 with a mean of 674.15 and a median of 502.5 (Figure 2). The trend was repeatedly observed when we varied the number of top n prognosis and predictor gene sets (n = 10, 50, 100, 150, and 200) for comparison. The P-values by unpaired t-test to compare the difference in sizes between the two gene sets were 2.42 × 10-3 (n = 10), 6.46 × 10-8 (n = 50), 3.34 × 10-7 (n = 100), 3.02 × 10-8 (n = 150), and 4.55 × 10-8 (n = 200), respectively

Figure 2

Comparison of gene set sizes between best prognostic gene sets (group 1) and best gene predictive sets (group 2). The number of genes in top 20 gene sets for group discrimination (PROG) and top 20 gene sets for prediction accuracy (PRED) is box plotted. P-value was inferred from an unpaired t-test.


We have shown that a gene sets approach is effective in identifying prognostic gene sets over multiple gene expression datasets. We identified 11823860_ST2 gene set as the best prognostic gene set for breast cancer patients.

Our gene sets approach is fundamentally different from previous methods in that our method doesn't try to build a single gene set from gene expression and clinical data as previous methods did [2, 3, 13]. Instead, our method begins from multiple gene sets and datasets and exhaustively searches for the best gene set among the given gene sets. As more gene sets and datasets accumulate, our method always finds out a better gene set than before. Another advantage of our gene sets approach is that it assists us to understand the underlying biology of the clinical outcome because many gene sets are prepared using biological knowledge such as pathways, gene ontology, and protein domains [12, 21, 22]. In the analysis of breast cancer datasets, cell cycle or proliferation gene sets were the best for prognosis of survival or recurrence as judged by the log-rank test (Table 4). This result is in agreement with many previous studies showing that cell proliferation signatures are the best predictors of prognosis of breast cancer patients [1, 2, 1216, 18, 24, 29, 30].

Because poor overlap among independently developed prognostic gene sets has raised concerns over this type of diagnostic tool [10, 11], we examined the degree of overlap among the top 20 prognostic gene sets identified in our study. Again, we found relatively poor overlap among them, thus confirming previous results (data not shown). However, poor overlap among gene sets may not be as serious a problem as previously thought if different gene sets represent similar biological pathways and are congruent on outcome prediction [3032]. This point was recently emphasized by Fan et al. [26] who showed congruence among four different gene expression-based predictors for breast cancer.

Pepe et al. [33] emphasized that strong statistical associations between prognostic markers and clinical outcomes do not necessarily imply good discriminative power of the marker. Thus, instead of reporting odds ratios or hazards ratios, one should report an objective prediction accuracy to prove the usefulness of the marker as a diagnostic, prognostic, or screening tool [3335]. As such, we calculated the prediction accuracy of each gene set using six datasets containing over 100 samples. We emphasize that we performed only external validation to avoid over-fitted estimation of prediction accuracy. While Michiels et al. [7] showed that five of the seven datasets they analyzed did not classify patients better than by chance, at least for breast cancer, all six datasets that we analyzed classified patients even though we only used external validation.

When we prepared 2,411 gene sets, we included 32 gene sets previously identified as prognostic in breast and other cancers to evaluate their performance in multiple gene expression datasets. Among the included gene sets are the 70-gene signature (12490681_70 in Table 3) [1, 13], 76-gene signature (15721473_T3 in Table 3) [2], 21-gene signature (15591335_F1 in Table 3) [6], and wound healing signature (14737219_CSR in Table 3) [3, 12] (Table 3). Through various analyses, we identified the 11823860_ST2 gene set as the best prognostic gene set in breast cancer. The 11823860_ST gene set was the best in two and four-means clustering and also in outcome prediction (Table 4, 6, Supplementary Table 2 in Additional data file 1, and Supplementary Table 8 in Additional data file 3). The 11823860_ST2 gene set was also ranked high in three, five, and six-means clustering (Supplementary Table 1, 3, and 4 in Additional data file 1). The 11823860_ST2 gene set was originally identified as 231 genes significantly associated with clinical outcomes of 78 node-negative, untreated, and young patients with an age at diagnosis less than 55 years in a supervised analysis [13]. But, in our analysis with 12 datasets, the 11823860_ST2 gene set was also prognostic in independent patients with diverse clinical characteristics (both node-negative and positive, both treated and untreated patients of all ages), which was previously confirmed [1, 18]. Also, the 11823860_ST2 gene set was prognostic in most datasets irrespective of the used microarray platforms.

In van't Veer et al. [13]'s work, 11823860_ST2 gene set was reduced to the famous 70-gene signature by optimizing the number of genes for maximum accuracy in leave-one-out cross validation [13]. The 70-gene signature has been validated in subsequent works and now undergoes a large scale prospective clinical trial [1]. But, our results indicate that using 231 genes might be better than using the 70-gene signature. Then, why 11823860_ST2 gene set performed better than the 70-gene signature? One reason is because we included in our analysis 12 different datasets produced using diverse microarray platforms with different gene contents. In this situation, gene sets containing many genes are likely to perform better than gene sets with a small number of genes because a greater proportion of prognostic genes are consistently present across all platforms. Indeed, the 11823860_ST2 gene set contains many genes (for example, cyclin E2, MCM6, MMP9, MP1, RAB6B, PK429, ESM1, and FLT1), in addition to 70 genes, involved in processes such as cell cycle, invasion and metastasis, angiogenesis, and signal transduction, processes up-regulated in poor prognosis group [13]. The tendency of gene sets with high prediction accuracy (Table 7) having more genes than prognostic gene sets identified by log-rank test (Table 4) may be explained in the same way (Figure 2).

One concern in our strategy is that by taking a certain number of pre-defined gene sets, it may just happen that one gene set will turn out significant. However, because the two procedures we perform, log-rank test and the estimation of prediction accuracy, evaluate at individual gene-set level whether a gene set is prognostic or not, we suppose that our method can effectively handle false positive predictions. Thus, even if a gene set is identified as the best among pre-defined gene sets, the two procedures, log-rank test and prediction accuracy, will evaluate if the identified gene set is significant or not.

Many microarray-based molecular studies have been criticized as noisy discovery due to problems such as small sample size, inappropriate statistical analysis leading to over-fitting of data, lack of independent validation, or validation with too small set [9, 36, 37]. In this regard, our work sets a good example for microarray-based discovery of prognostic gene sets. We included more than 1,700 samples in the analysis and applied complete external validation to avoid data over-fitting. Thus, we believe that gene sets found in our analysis are truly prognostic in breast cancer and not just a noisy discovery. Finally, although we focused only on breast cancer datasets in this work, our gene sets approach is equally applicable to other types of cancer or to studies that develop molecular signatures for predicting drug sensitivity of each patient to cancer drugs. We expect that, like gene set enrichment analysis and similar tools that have become useful for gene expression data analysis [21, 22], a gene sets approach will be useful for developing prognostic signatures for outcome prediction [23].


The gene sets approach is an effective tool for selecting a prognostic gene set as well as for understanding the underlying biology for different patients' outcomes. By applying a bottom-up approach with many gene sets, we could identify the biological processes and pathways that are important for prognosis of breast cancer patients. The importance of cell proliferation signatures in breast cancer prognosis has been repeatedly discovered, but our approach reinforces these previous findings [1, 2, 13, 15, 16, 18, 24, 30, 38]. Additionally, our approach is applicable to other types of cancer in which prognostic gene sets are less developed than breast cancer.



We downloaded breast cancer gene expression datasets with clinical information from the gene expression omnibus [39], Stanford microarray database [40], or author's individual web pages [1, 2, 1520, 26, 27, 38, 41]. See Table 1 for a complete list of datasets and their sources. We analyzed 12 datasets comprising 1,756 tissue samples.

Gene sets

We prepared gene sets from diverse sources including gene ontology (GO) terms [42], GenMAPP [43] and KEGG pathways [44], and InterPro protein domain information [45] using the Affymetrix annotation file (2006 November version) downloaded from the Affymetrix web site (Table 2)[46]. We limited the gene set size between five and two thousands. We also included 32 well-known prognostic gene sets for breast and other cancers (Table 3). For those 32 gene sets, we created a nomenclature for each gene set by combining the PubMed id of the reference and the source of the gene set in the reference. For example, 11823860_ST2 represents a gene set from the Supplementary Table 2 from van't Veer et al. [13] with a PubMed id of 118323860. The number of gene sets in each category is shown in Table 2.

Preprocessing of microarray data

The datasets that we analyzed included both single-channel Affymetrix and dual-channel cDNA microarray platforms. We used a gene symbol as a common identifier to map probe IDs across different platforms. When we mapped a gene set between two arrays, we used only genes common to both arrays. To analyze Affymetrix datasets, we consistently used expression values computed by MAS5 algorithms to ensure similar processing, normalized each sample by a global mean method to a target density of 1,000, floored low expression values to 100, log-transformed each value by base two, merged replicate probes for the same gene by an average value, and finally mean-centered each gene within a dataset [47]. To analyze cDNA datasets, we initially filtered out missing values when the percentage of missing values was greater than 30%, imputed missing values by the k-nearest neighbor method, merged replicate probes by an average value, and finally mean-centered each gene. We used the GEPAS web service [48] to filter and impute missing values [49].

Statistical analysis

For each dataset and gene set, we applied k-means clustering with k = 2, 3, 4, 5, and 6 to divide each sample into two, three, four, five, or six groups based on the gene expression pattern of the gene set and applied the log-rank test to infer the statistical significance of differences in survival between the groups. We used a Kaplan-Meier plot to show the differences in survival. We applied the nearest-centroid prediction rule, one of the simplest class prediction methods, to estimate the accuracy of prediction for each patient's outcome [7]. To briefly describe, the nearest-centroid prediction rule first calculates a centroid for each group. The centroid is the average gene expression for each gene in each group. Then, with a new sample, the method calculates two distances between the gene expression value of the new sample and each of the two centroids and assigns the new sample to the group with the smaller distance. For each gene set, we defined two average profiles (good and poor) as vectors of the average expression values of genes in a gene set in patients with good and poor prognoses. Good prognosis patients were defined as relapse-free or overall survival over five years, whereas poor prognosis patients were deceased within five years. We classified each patient in the validation set according to the Euclidean distance between the gene expression of the patient and the two average profiles. We performed external validation using six large datasets containing more than 100 samples. For external validation, we calculated two average (good and poor) profiles using only samples in one dataset and predicted patient outcomes in the other five datasets and performed external validation for all training-testing pairs of six datasets (30 pairs). We used R language [50] for statistical analysis and python programming language [51] for data processing.



Gene expression omnibus


Gene expression Series


Gene ontology


Biological processes


Molecular functions


pathway, BR: Breast cancer prognostic signature


estrogen receptor.


  1. 1.

    van de Vijver MJ, He YD, van't Veer LJ, Dai H, Hart AA, Voskuil DW, Schreiber GJ, Peterse JL, Roberts C, Marton MJ: A gene-expression signature as a predictor of survival in breast cancer. The New England journal of medicine. 2002, 347 (25): 1999-2009. 10.1056/NEJMoa021967.

  2. 2.

    Wang Y, Klijn JG, Zhang Y, Sieuwerts AM, Look MP, Yang F, Talantov D, Timmermans M, Meijer-van Gelder ME, Yu J: Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet. 2005, 365 (9460): 671-679.

  3. 3.

    Chang HY, Sneddon JB, Alizadeh AA, Sood R, West RB, Montgomery K, Chi JT, van de Rijn M, Botstein D, Brown PO: Gene expression signature of fibroblast serum response predicts human cancer progression: similarities between tumors and wounds. PLoS biology. 2004, 2 (2): E7-10.1371/journal.pbio.0020007.

  4. 4.

    Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X: Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature. 2000, 403 (6769): 503-511. 10.1038/35000501.

  5. 5.

    Buyse M, Loi S, van't Veer L, Viale G, Delorenzi M, Glas AM, d'Assignies MS, Bergh J, Lidereau R, Ellis P: Validation and clinical utility of a 70-gene prognostic signature for women with node-negative breast cancer. Journal of the National Cancer Institute. 2006, 98 (17): 1183-1192.

  6. 6.

    Paik S, Shak S, Tang G, Kim C, Baker J, Cronin M, Baehner FL, Walker MG, Watson D, Park T: A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer. The New England journal of medicine. 2004, 351 (27): 2817-2826. 10.1056/NEJMoa041588.

  7. 7.

    Michiels S, Koscielny S, Hill C: Prediction of cancer outcome with microarrays: a multiple random validation strategy. Lancet. 2005, 365 (9458): 488-492. 10.1016/S0140-6736(05)17866-0.

  8. 8.

    Eden P, Ritz C, Rose C, Ferno M, Peterson C: "Good Old" clinical markers have similar power in breast cancer prognosis as microarray gene expression profilers. Eur J Cancer. 2004, 40 (12): 1837-1841. 10.1016/j.ejca.2004.02.025.

  9. 9.

    Ransohoff DF: Bias as a threat to the validity of cancer molecular-marker research. Nature reviews. 2005, 5 (2): 142-149. 10.1038/nrc1550.

  10. 10.

    Ein-Dor L, Kela I, Getz G, Givol D, Domany E: Outcome signature genes in breast cancer: is there a unique set?. Bioinformatics (Oxford, England). 2005, 21 (2): 171-178. 10.1093/bioinformatics/bth469.

  11. 11.

    Ein-Dor L, Zuk O, Domany E: Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. Proceedings of the National Academy of Sciences of the United States of America. 2006, 103 (15): 5923-5928. 10.1073/pnas.0601231103.

  12. 12.

    Chang HY, Nuyten DS, Sneddon JB, Hastie T, Tibshirani R, Sorlie T, Dai H, He YD, van't Veer LJ, Bartelink H: Robustness, scalability, and integration of a wound-response gene expression signature in predicting breast cancer survival. Proceedings of the National Academy of Sciences of the United States of America. 2005, 102 (10): 3738-3743. 10.1073/pnas.0409462102.

  13. 13.

    van 't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT: Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002, 415 (6871): 530-536. 10.1038/415530a.

  14. 14.

    Naderi A, Teschendorff AE, Barbosa-Morais NL, Pinder SE, Green AR, Powe DG, Robertson JF, Aparicio S, Ellis IO, Brenton JD: A gene-expression signature to predict survival in breast cancer across independent data sets. Oncogene. 2007, 26 (10): 1507-1516. 10.1038/sj.onc.1209920.

  15. 15.

    Sotiriou C, Wirapati P, Loi S, Harris A, Fox S, Smeds J, Nordgren H, Farmer P, Praz V, Haibe-Kains B: Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis. Journal of the National Cancer Institute. 2006, 98 (4): 262-272.

  16. 16.

    Pawitan Y, Bjohle J, Amler L, Borg AL, Egyhazi S, Hall P, Han X, Holmberg L, Huang F, Klaar S: Gene expression profiling spares early breast cancer patients from adjuvant therapy: derived and validated in two population-based cohorts. Breast Cancer Res. 2005, 7 (6): R953-964. 10.1186/bcr1325.

  17. 17.

    West M, Blanchette C, Dressman H, Huang E, Ishida S, Spang R, Zuzan H, Olson JA, Marks JR, Nevins JR: Predicting the clinical status of human breast cancer by using gene expression profiles. Proceedings of the National Academy of Sciences of the United States of America. 2001, 98 (20): 11462-11467. 10.1073/pnas.201162998.

  18. 18.

    Sotiriou C, Neo SY, McShane LM, Korn EL, Long PM, Jazaeri A, Martiat P, Fox SB, Harris AL, Liu ET: Breast cancer classification and prognosis based on gene expression profiles from a population-based study. Proceedings of the National Academy of Sciences of the United States of America. 2003, 100 (18): 10393-10398. 10.1073/pnas.1732912100.

  19. 19.

    Miller LD, Smeds J, George J, Vega VB, Vergara L, Ploner A, Pawitan Y, Hall P, Klaar S, Liu ET: An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival. Proceedings of the National Academy of Sciences of the United States of America. 2005, 102 (38): 13550-13555. 10.1073/pnas.0506230102.

  20. 20.

    Bild AH, Yao G, Chang JT, Wang Q, Potti A, Chasse D, Joshi MB, Harpole D, Lancaster JM, Berchuck A: Oncogenic pathway signatures in human cancers as a guide to targeted therapies. Nature. 2006, 439 (7074): 353-357. 10.1038/nature04296.

  21. 21.

    Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America. 2005, 102 (43): 15545-15550. 10.1073/pnas.0506580102.

  22. 22.

    Kim SY, Volsky DJ: PAGE: parametric analysis of gene set enrichment. BMC bioinformatics. 2005, 6: 144-10.1186/1471-2105-6-144.

  23. 23.

    Goeman JJ, Oosting J, Cleton-Jansen AM, Anninga JK, van Houwelingen HC: Testing association of a pathway with survival using gene expression data. Bioinformatics (Oxford, England). 2005, 21 (9): 1950-1957. 10.1093/bioinformatics/bti267.

  24. 24.

    Teschendorff AE, Naderi A, Barbosa-Morais NL, Pinder SE, Ellis IO, Aparicio S, Brenton JD, Caldas C: A consensus prognostic gene expression classifier for ER positive breast cancer. Genome biology. 2006, 7 (10): R101-10.1186/gb-2006-7-10-r101.

  25. 25.

    Perou CM, Sorlie T, Eisen MB, van de Rijn M, Jeffrey SS, Rees CA, Pollack JR, Ross DT, Johnsen H, Akslen LA: Molecular portraits of human breast tumours. Nature. 2000, 406 (6797): 747-752. 10.1038/35021093.

  26. 26.

    Sorlie T, Perou CM, Tibshirani R, Aas T, Geisler S, Johnsen H, Hastie T, Eisen MB, van de Rijn M, Jeffrey SS: Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proceedings of the National Academy of Sciences of the United States of America. 2001, 98 (19): 10869-10874. 10.1073/pnas.191367098.

  27. 27.

    Sorlie T, Tibshirani R, Parker J, Hastie T, Marron JS, Nobel A, Deng S, Johnsen H, Pesich R, Geisler S: Repeated observation of breast tumor subtypes in independent gene expression data sets. Proceedings of the National Academy of Sciences of the United States of America. 2003, 100 (14): 8418-8423. 10.1073/pnas.0932692100.

  28. 28.

    Dudoit S, Fridlyand J, Speed TP: Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association. 2002, 97 (457): 77-87. 10.1198/016214502753479248.

  29. 29.

    Whitfield ML, George LK, Grant GD, Perou CM: Common markers of proliferation. Nature Reviews Cancer. 2006, 6 (2): 99-106. 10.1038/nrc1802.

  30. 30.

    Fan C, Oh DS, Wessels L, Weigelt B, Nuyten DS, Nobel AB, van't Veer LJ, Perou CM: Concordance among gene-expression-based predictors for breast cancer. The New England journal of medicine. 2006, 355 (6): 560-569. 10.1056/NEJMoa052933.

  31. 31.

    Chang JT, Nevins JR: GATHER: a systems approach to interpreting genomic signatures. Bioinformatics (Oxford, England). 2006, 22 (23): 2926-2933. 10.1093/bioinformatics/btl483.

  32. 32.

    Simon R: Development and evaluation of therapeutically relevant predictive classifiers using gene expression profiling. Journal of the National Cancer Institute. 2006, 98 (17): 1169-1171.

  33. 33.

    Pepe MS, Janes H, Longton G, Leisenring W, Newcomb P: Limitations of the odds ratio in gauging the performance of a diagnostic, prognostic, or screening marker. American journal of epidemiology. 2004, 159 (9): 882-890. 10.1093/aje/kwh101.

  34. 34.

    Ioannidis JP: Is molecular profiling ready for use in clinical decision making?. The oncologist. 2007, 12 (3): 301-311. 10.1634/theoncologist.12-3-301.

  35. 35.

    Pepe MS: Evaluating technologies for classification and prediction in medicine. Statistics in medicine. 2005, 24 (24): 3687-3696. 10.1002/sim.2431.

  36. 36.

    Ioannidis JP: Microarrays and molecular research: noise discovery?. Lancet. 2005, 365 (9458): 454-455.

  37. 37.

    Reis-Filho JS, Westbury C, Pierga JY: The impact of expression profiling on prognostic and predictive testing in breast cancer. J Clin Pathol. 2006, 59 (3): 225-231. 10.1136/jcp.2005.028324.

  38. 38.

    Weigelt B, Hu Z, He X, Livasy C, Carey LA, Ewend MG, Glas AM, Perou CM, Van't Veer LJ: Molecular portraits and 70-gene prognosis signature are preserved throughout the metastatic process of breast cancer. Cancer research. 2005, 65 (20): 9155-9158. 10.1158/0008-5472.CAN-05-2553.

  39. 39.

    Gene expression omnibus (GEO). []

  40. 40.

    Stanford Microarray Database. []

  41. 41.

    Oh DS, Troester MA, Usary J, Hu Z, He X, Fan C, Wu J, Carey LA, Perou CM: Estrogen-regulated genes predict survival in hormone receptor-positive breast cancers. J Clin Oncol. 2006, 24 (11): 1656-1664. 10.1200/JCO.2005.03.2755.

  42. 42.

    The Gene Ontology. []

  43. 43.

    GenMAPP. []

  44. 44.

    KEGG: Kyoto Encyclopedia of Genes and Genomes. [http://www.genome.up/kegg/]

  45. 45.

    InterPro. []

  46. 46.


  47. 47.

    Kim SY, Kim Y: Genome-wide prediction of transcriptional regulatory elements of human promoters using gene expression and promoter analysis data. BMC bioinformatics. 2006, 7: 330-10.1186/1471-2105-7-330.

  48. 48.

    GEPAS: Gene Expression Pattern Analysis Suite. []

  49. 49.

    Vaquerizas JM, Conde L, Yankilevich P, Cabezon A, Minguez P, Diaz-Uriarte R, Al-Shahrour F, Herrero J, Dopazo J: GEPAS, an experiment-oriented pipeline for the analysis of microarray gene expression data. Nucleic acids research. 2005, W616-620. 10.1093/nar/gki500. 33 Web Server

  50. 50.

    The R Project for Statistical Computing. []

  51. 51.

    Python Programming Language. []

  52. 52.

    Huang E, Cheng SH, Dressman H, Pittman J, Tsou MH, Horng CF, Bild A, Iversen ES, Liao M, Chen CM: Gene expression predictors of breast cancer outcomes. Lancet. 2003, 361 (9369): 1590-1596. 10.1016/S0140-6736(03)13308-9.

  53. 53.

    Zhao H, Langerod A, Ji Y, Nowels KW, Nesland JM, Tibshirani R, Bukholm IK, Karesen R, Botstein D, Borresen-Dale AL: Different gene expression patterns in invasive lobular and ductal carcinomas of the breast. Molecular biology of the cell. 2004, 15 (6): 2523-2536. 10.1091/mbc.E03-11-0786.

  54. 54.

    Glinsky GV, Higashiyama T, Glinskii AB: Classification of human breast cancer using gene expression profiling as a component of the survival predictor algorithm. Clin Cancer Res. 2004, 10 (7): 2272-2283. 10.1158/1078-0432.CCR-03-0522.

  55. 55.

    Glinsky GV, Berezovska O, Glinskii AB: Microarray analysis identifies a death-from-cancer signature predicting therapy failure in patients with multiple types of cancer. The Journal of clinical investigation. 2005, 115 (6): 1503-1521. 10.1172/JCI23412.

  56. 56.

    Schuetz CS, Bonin M, Clare SE, Nieselt K, Sotlar K, Walter M, Fehm T, Solomayer E, Riess O, Wallwiener D: Progression-specific genes identified by expression profiling of matched ductal carcinomas in situ and invasive breast tumors, combining laser capture microdissection and oligonucleotide microarray analysis. Cancer research. 2006, 66 (10): 5278-5286. 10.1158/0008-5472.CAN-05-4610.

Download references


We thank Drs. YI Yeom, IS Chu, SB Kim, WH Choi, and SK Kim for their helpful discussion as well as their careful reading and valuable comments on the manuscript. We also thank anonymous reviewers for their helpful and constructive comments that significantly improved our manuscript. This work was supported by grants NTC700711 from the Korea Research Council for Fundamental Science & Technology (to SYK) and FG06-11-01 from the Ministry of Science and Technology of Korea (to YSK).

Author information

Correspondence to Seon-Young Kim or Yong Sung Kim.

Additional information

Author's contributions

SYK designed the study, collected datasets, performed bioinformatics analyses, and drafted the manuscript. YSK designed the study and wrote the manuscript. Both authors read and approved the final manuscript.

Electronic supplementary material

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and Permissions

About this article

Cite this article

Kim, S., Kim, Y.S. A gene sets approach for identifying prognostic gene signatures for outcome prediction. BMC Genomics 9, 177 (2008) doi:10.1186/1471-2164-9-177

Download citation


  • Prediction Accuracy
  • Additional Data File
  • Gene Expression Dataset
  • Prognostic Gene Signature
  • Prognostic Gene