- Open Access
Discovering monotonic stemness marker genes from time-series stem cell microarray data
- Hsei-Wei Wang†1, 2, 3, 4,
- Hsing-Jen Sun†1,
- Ting-Yu Chang†3,
- Hung-Hao Lo5,
- Wei-Chung Cheng6,
- George C Tseng7,
- Chin-Teng Lin8, 9,
- Shing-Jyh Chang10,
- Nikhil Ranjan Pal11 and
- I-Fang Chung1, 2Email author
© Wang et al.; licensee BioMed Central Ltd. 2015
- Published: 21 January 2015
Identification of genes with ascending or descending monotonic expression patterns over time or stages of stem cells is an important issue in time-series microarray data analysis. We propose a method named Monotonic Feature Selector (MFSelector) based on a concept of total discriminating error (DEtotal) to identify monotonic genes. MFSelector considers various time stages in stage order (i.e., Stage One vs. other stages, Stages One and Two vs. remaining stages and so on) and computes DEtotal of each gene. MFSelector can successfully identify genes with monotonic characteristics.
We have demonstrated the effectiveness of MFSelector on two synthetic data sets and two stem cell differentiation data sets: embryonic stem cell neurogenesis (ESCN) and embryonic stem cell vasculogenesis (ESCV) data sets. We have also performed extensive quantitative comparisons of the three monotonic gene selection approaches. Some of the monotonic marker genes such as OCT4, NANOG, BLBP, discovered from the ESCN dataset exhibit consistent behavior with that reported in other studies. The role of monotonic genes found by MFSelector in either stemness or differentiation is validated using information obtained from Gene Ontology analysis and other literature. We justify and demonstrate that descending genes are involved in the proliferation or self-renewal activity of stem cells, while ascending genes are involved in differentiation of stem cells into variant cell lineages.
We have developed a novel system, easy to use even with no pre-existing knowledge, to identify gene sets with monotonic expression patterns in multi-stage as well as in time-series genomics matrices. The case studies on ESCN and ESCV have helped to get a better understanding of stemness and differentiation. The novel monotonic marker genes discovered from a data set are found to exhibit consistent behavior in another independent data set, demonstrating the utility of the proposed method. The MFSelector R function and data sets can be downloaded from: http://microarray.ym.edu.tw/tools/MFSelector/.
- Gene List
- Human Embryonic Stem Cell
- Embryoid Body
- Stem Cell Differentiation
- Monotonic Trend
In many biological experiments, we analyze regulation of gene or protein expression over time and in some other cases we examine how the expression pattern changes as the grade/stage of a disease progresses, including over time.
These biomarkers are "Monotonic" because their pattern ascends or descends with time (or stage/grade), important in time-series studies (i.e., disease progression, stem cell differentiation, aging and drug kinetics experiments)[1, 2]. For example, HNF4α is a well-known factor promoting stem cell hepatogenesis, and its levels increase with a monotonic pattern during differentiation. Genes in such studies may exhibit a natural temporal ordering in the samples, distinguishing them from others dealing with issues like discriminating tumor from non-tumor cases. A time series model generally shows that samples close in time are more closely related. For example, the expression of a stemness gene or microRNA should be more abundant in stem cells and precursors than in fully differentiated and mature progenies. Here we identify those genes monotonically ascending or descending over a given period of time.
Many approaches are used to find differentially expressed biomarkers from microarray data [5–8]. By statistical tests and other computational methods, the two-class and multi-class array data are usually analyzed focusing on the class-specific signature genes (i.e., genes that are expressed for one/some class(es) and unexpressed for the other/rest of the class(es)). In order to identify differentially expressed genes, many approaches adopt Student's t-test for two-class data, and few approaches adopt statistical tests such as ANOVA F-test for multi-class data  or a nonparametric rank-based statistical test such as Kruskal-Wallis test  also for multi-class data [11–14]. However, distribution-based and rank-based statistical tests may not be appropriate because they find differentially expressed genes, not monotonically ascending or descending genes.
Some methods to discover monotonic expression patterns are correlation-based, where a seed feature or template is chosen to find genes with good correlation. But the choice of the template plays a critical role, and this is time-consuming, requiring experienced bioinformaticians. In addition, the Cuzick-test  can be used to discover a specific trend/pattern from microarray data. Also, the modified M statistic test  can be used to detect monotonic trend through a combination of the means ordering strategy and a distribution-based statistical test. However, outliers affect the monotonic trend there because of weighting used in the Cuzick-test and the means ordering strategy used in the modified M statistic test. Furthermore, the Cuzick-test does not provide any information about the extent a gene expression pattern is monotonic (i.e., the degree of "monotonicity") for genes with the same z-score/p-value. Therefore, an approach without statistical assumptions and not affected by outliers, and able to identify the degree of monotonicity between genes with the same z-score/p-value, is needed. Our method proposed herein, MFSelector (Monotonic Feature Selector) based on the total discriminating error (DEtotal), is quite effective in discovering genes with monotonic attributes. It identifies gradually increasing or decreasing genes irrespective of any heterogeneity in the array data. Furthermore, MFSelector, eliminates the subjectivity required in seed/template selection and also provides assessments of confidence (i.e., p-/q-value and sample variance for discriminating error). This new approach is user-friendly as we integrate our algorithm into a simple R function. Users with no proficiency in programming can generate a monotonic gene list and related scatter plots.
All microarray data sets were downloaded from the NCBI GEO public archive, generated in Affymetrix Human Genome U133 Plus 2.0 platform with 54675 probe sets on chips. The raw data (CEL file) were normalized by RMA algorithm using the 'affy' package of the Bioconductor http://www.bioconductor.org software suite in the R Project for statistical Computing http://www.r-project.org. See Materials S1 in Additional file 1 for details.
Embryonic Stem Cell Neurogenesis data set (ESCN)
This data set contains 27 samples over five periods of human embryonic development: three embryonic stem cell (ESC) samples, three embryoid body (EB) samples, six primitive ectoderm cell (PEL) samples, six neural tube-like rosette cell samples, and nine post-natal neural stem cell (NSC) samples.
Embryonic Stem Cell Vasculogenesis data set (ESCV)
In this data set, there are 13 samples over four periods of human embryonic stem cell differentiation into human mature (vascular) endothelial cells: three undifferentiated embryonic stem cell (ESC) samples, three mesodermal progenitor cell (MPC) samples, four embryoid body (EB) samples, and three human mature vascular endothelial cell (VEC) samples. The results and discussion of application of MFSelector to this data set are exhibited in Materials S1 (in Additional file 1).
Synthetic data sets
We generated two sets of synthetic data. These data sets contain 50 samples each spread over five equal sized stages. There is a set with descending trends (denoted 'Des') and a set with ascending trends (denoted 'Asc'). These synthetic data sets are named 's50_Asc' and 's50_Des', respectively. In addition, each data set has 120 genes which are classified into 9 types of monotonic genes (or monotonic-like genes): 'Good (distinct)', 'Good (close)', 'Slightly', 'Outliers (slight)', 'Outliers (severe)', 'Moderately', 'Severely', 'Partially ordered (far)', and 'Partially ordered (close)'. There are 20 genes for each of 'Slightly', 'Moderately', and 'Severely' types and 10 genes for each of 'Good (distinct)', 'Good (close)', 'Outliers (slight)', 'Outliers (severe)', 'Partially ordered (far)', and 'Partially ordered (close)'. If most of the samples follow a trend over time/stages but only a few do not, we call those few samples as outliers. In this study the number of outliers for the synthetic data sets is restricted to less than 6% of total samples.
The 'Good' category has two subcategories, 'Good (distinct)' and 'Good (close)'. 'Good (distinct)' genes indicate they have the strongest monotonic ascending trend and samples have distinct separation between stages; while 'Good (close)' genes indicate they still have the strongest monotonic trend but samples between stages are close. 'Partially ordered' type genes are also further separated into 'Partially ordered (far)', and 'Partially ordered (close)'. The 'Partially ordered (far)' means that the samples in one of the stages are significantly far from the monotonic trend in a gene expression profile while 'Partially ordered (close)' means that the samples in one of the stages are close to the monotonic trend but it is still out of monotonic order in the gene expression profile. The genes of type 'Outliers (severe)' represent genes where at most six percent of the samples are relatively away from the monotonic trend, and type 'Outliers (slight)' indicates that at most six percent of the samples are slightly away from the monotonic trend. The results on these synthetic data sets by our proposed method (MFSelector) and other two methods are presented in the Results section.
Monotonic Feature Selector (MFSelector)
Computation of discriminating error for each level of a gene
Here we use several steps to examine genes with monotonically ascending or descending profiles, which are distributed over N stages. For example, when a monotonic gene has an ascending profile, first we assume samples in Stage One have the lower expression values and samples in other stages have higher expression values. We use a sample in Stage One to draw a horizontal discriminating line to determine how many samples in Stage One have higher expression values than the discriminating line and how many samples in other stages have lower expression values than the discriminating line. The number of samples on the wrong side of the discriminating line is referred to as "discriminating errors". Distinguishing Stage One from all other stages is called the "Level One" process. In addition, since every sample in Stage One can be used to draw a discriminating line, we select the discriminating line with the smallest number of discriminating errors for this level. Note that, when more than one discriminating line for Stage One have the same number of discriminating errors, we can choose any one of the lines; here we use the discriminating line corresponding to the lowest sample number. Secondly, again for the same gene, as a "Level Two" process, we assume samples in Stages One and Two have lower expression patterns than samples in other stages. Therefore, we use each sample from Stages One and Two taken together to draw a discriminating line and determine discriminating error as in Level One. When more than one sample from Stages One and Two together result in the same smallest number of discriminating errors, we consider: (a) when the samples are from the same stage, then we select the line (hence the sample) corresponding to the lowest sample number; (b) when the samples are from different stages, we give priority to the lowest sample number from the highest stage, in this case, Stage Two. This entire process is repeated N-1 times, each time adding one stage (e.g., for the "Level Three" process the union of Stages One, Two and Three versus all other stages; for the "Level Four" process the union of Stages One, Two, Three and Four against all other stages, and so on until the "Level N-1"). The discriminating line is selected in the same manner for each of the N-1 levels (in total, N-1 discriminating lines are considered for a gene). It is interesting to note that, if the number of distinct discriminating lines is equal to N-1, the corresponding gene is more likely to be a good monotonic gene.
Our outlier strategy is different from that used by rank-based statistical tests (e.g., Cuzick-test), where an outlier's deviation in expression value influences its rank, often adversely affecting the statistical calculation. MFSelector does not count samples on the wrong side of the discriminating line more than once. Suppose an outlier adds a discriminating error in the Level M process. This outlier is very likely to add a discriminating error in later levels, i.e., "Level M+1" to "Level N-1"). But we count the outlier's discriminating error only once at Level M. We compare the results for the Cuzick-test and ours in the Results section.
For a monotonic gene with a descending profile, samples in Stage One should have the higher expression values than the samples in other stages. Thus, we can get the "discriminating errors" for "Level One" process by drawing a discriminating line through one of the samples in Stage One to determine the number of samples in Stage One that have lower expression values than the discriminating line and the number of samples in other stages having higher expression values than the discriminating line. As with an ascending profile, this process is repeated N-1 times, though in this case, in descending direction. The discriminating line for each level is also selected in the same manner as that for the ascending case.
Computation of total Discriminating Error (DEtotal) for a gene
In order to evaluate monotonicity of a gene (e.g., genes with weak or strong monotonic features), we use the sum of its discriminating errors over all N-1 level processes (total discriminating error) denoted by "DEtotal". For instance, when DEtotal=k, k discriminating errors exist over N-1 level processes. Here, the smaller DEtotal of a gene, the fewer the number of discriminating errors and the stronger the monotonicity of the gene. DEtotal is an evaluation index to determine monotonic genes. We sort the DEtotal values for all genes in increasing order to select genes with stronger monotonic features for multi-stage and time-series array data. Note that, in this study we separately consider the DEtotal for monotonically ascending or descending cases. In addition, we also develop another strategy using sample variance for discriminating error (SVDE) to further determine the gene with stronger monotonic feature when several genes have the same DEtotal value. This is explained later.
Calculating DEtotalfor a gene: an illustration
First, for the "Level One" process, every sample in Stage One is used to draw a discriminating line to determine its associated number of discriminating errors (the number of samples in Stage One that have higher expression values than the discriminating line and the number of samples in other stages having lower expression values than the discriminating line). In this case, we obtain 7, 2, and 2 discriminating errors taking each of three samples in order from Stage One to draw the discriminating lines. For example, one sample from Stage One and another sample from Stage Four are on the wrong side of the discriminating line when considering the 2nd sample in Stage One to draw the discriminating line, resulting in two discriminating errors. In addition, since two discriminating lines (i.e., corresponding to the 2nd and 3rd samples) from Stage One have the same number of discriminating errors, we choose the sample with the lowest sample number (i.e., 2nd sample) and get two discriminating errors in this level.
Second, for the "Level Two" process, every sample in Stages One and Two is used to draw a discriminating line to determine its associated number of discriminating errors. In this level, we obtain 3, 3, 4, 0, 2, and 1 discriminating errors taking each of six samples in order from Stages One and Two to draw the discriminating lines. Note that, to compute the discriminating errors for the "Level Two" process, we do not consider the samples which have already caused discriminating errors in "Level One". Such samples are masked pink in Figure 3(B), one sample from Stage One and another sample from Stage Four. The extent of outliers may strongly influence results in other methods, but in our method we consider "outliers" only once to contribute to the discriminating errors and hence it allows us to determine monotonicity without the results being skewed by the position of "outliers". This is why our method has some advantage over other approaches. Finally, we choose the sample with the lowest number of discriminating errors (i.e., the 4th sample from Stage One and Two) and get no discriminating errors in this level.
Third, for the "Level Three" process, every sample in Stages One, Two, and Three is used to draw a discriminating line to determine its associated number of discriminating errors. We get 3, 9, 10, 6, 8, 7, 2, 4, 5, 1, 3, and 1 discriminating errors taking each of 12 samples in order from Stage One to Stage Three. The 10th sample is selected as the discriminating line because its discriminating error is the lowest, and it has the lowest sample number of the two Level Three samples with the same discriminating error of one.
Similarly, the process is continued for the "Level Four". In this case we determine the 16th sample with no discriminating error. Finally, we add up all discriminating errors for each selected discriminating line from every level and get the DEtotal = 2 + 0 + 1 + 0 = 3.
Statistical analysis of monotonic genes
Computation of p-value and q-value
To assess the statistical significance of the DEtotal index associated with the identified genes in ascending and descending profiles, a permutation test has been performed in Figure 2. Both unadjusted p-values and adjusted q-values for multiple comparisons are computed as in our previous studies [7, 8]. Let G be the total number of genes and S be the total number of sample points. The procedure followed is summarized below.
Step 1. Given a gene expression matrix X (x gs is the expression intensity of gene g in sample unit s; 1 ≤ g ≤ G, 1 ≤ s ≤ S) with class labels (y s , 1 ≤ s ≤ S), we compute the DEtotal,g for ascending case and denotes it by a g for notational simplicity. Similarly, for the descending case also we compute DEtotal,g and denote it by d g .
Step 2. Randomly permute the class labels y s B times (500 times in this study). In the b th permutation (1 ≤ b ≤ B), compute and for gene g using the gene expression matrix X and the permuted labels .
Computation of sample variance for discriminating error (SVDE)
Statistical tests are unable to distinguish which genes with the same level of statistical significance are better. Two genes with the same DEtotal may exhibit different monotonicities. Therefore, we develop SVDE to address the degree of monotonicity of a gene. This can also help to differentiate between genes with the same DEtotal value.
In order to assess the degree of monotonicity (particularly when more than one gene have the same DEtotal), all samples for each of the genes are slightly altered in expression values to examine whether the DEtotal of the altered expression values has changed significantly or not. To evaluate this, we propose an index, called Sample Variance for Discriminating Error (SVDE). We perturb a dataset and evaluate the extent of confidence on the monotonicity properly by adding noise to all samples (as shown in Additional file 2: Fig. S1), and apply the same method to the perturbed a dataset to calculate the DEtotal of genes.
The SVDE is a measure of how far the set of the new DEtotal values are spread around the original DEtotal (DEorg). If each DE i is equal to or close to DE org , the SVDE will be very small. Figure 4(A) shows how hard it is to change the DEtotal of a highly confident monotonic gene even after randomly altering the expression values of all samples. The smaller the SVDE, the stronger the monotonicity of the gene. Figure 4(B) depicts the monotonic profile of another gene where DEtotal is equal to zero, but the SVDE is high. The reason is that separation between adjacent stages for Stages One to Four is not sufficiently distinct. Consequently, genes with a smaller SVDE can be considered more strongly monotonic than genes with a larger SVDE.
One may think that the time interval between observations should be given importance in defining any index for finding such stemness markers. This is not so because here our objective is to discover monotonic stemness marker genes which are associated with stages of stem cell differentiation. These stages are clinical stages which depend on pathology/clinical phenotypes and are not defined taking observations with arbitrary/equal time interval, but the biological process here dictates the time interval between observations. However, our method would be equally applicable to other experimental scenario where the researchers design experiments to take observations with equal time interval for some scientific reasons. It does not matter if the required time interval is large or small.
To find ascending genes, we use w i = i, i = 1, ..., N; while for descending genes, we use w i = N-i+1, i = 1, ..., N. The Cuzick statistic above is obtained as a z-score, hence, we can determine a p-value for each gene and identify whether a gene is monotonic and the extent of its monotonicity (a stronger monotonic gene with a smaller p-value) by referring to the z-score table. In this study, we compare the performance of the Cuzick-test approach and our MFSelector approach (shown in the Results section).
The modified M statistic test
We have also performed a comparison with the modified M statistic test . The modified M statistic test (a distribution-based statistical test) is used to analyze dose-response studies in microarray experiments and investigate a trend in the response level expression with respect to doses in multi-class array data. In order to effectively find monotonic genes, the modified M statistic test adopts the means ordering strategy that considers the order restriction of mean expressions (responses) regarding the increasing or decreasing doses through K stages, respectively. The modified M statistic test and several other statistical tests are further integrated into an R package called IsoGene . However, the modified M statistic test is less effective because the existence of outliers in a stage strongly influences the order of mean responses (as shown in the Results section).
After getting the list of monotonic genes, we can sort them by DEtotal or p-value in ascending order to identify those genes with stronger monotonic patterns, and decide the threshold for the gene list based on monotonicity. In addition, we further sort these monotonic genes with the same DEtotal values according to their associated SVDE. The resulting gene list is fed into the web service tool DAVID Bioinformatics Resources http://david.abcc.ncifcrf.gov/ to identify correspondingly enriched biological processes by cross-referencing to the gene ontology database http://www.geneontology.org/.
Results on the ESCN data set by MFSelector
As mentioned in the Materials and Methods section, we use two microarray data sets, i.e., embryonic stem cell neurogenesis (ESCN) and embryonic stem cell vasculogenesis (ESCV), to demonstrate the effectiveness of our algorithm in identifying monotonic genes during different stages of stem cell differentiation. The results of application of MFSelector to the ESCV data set are exhibited in Materials S1 (in Additional file 1).
Figure S2 (in Additional file 3) reveals that principal component analysis (PCA) based on 9,285 probe sets (obtained using t-test, q<0.01) clearly distinguishes embryonic stem cells (ESCs) from post-natal neural stem cells (NSCs). It also shows that embryoid bodies (EB), primitive ectoderm cells (PEL) and neural rosette cells are differentiated along the neural lineage as evidenced by moving toward post-natal neural stem cells. Figure S2 (in Additional file 3) also shows that the addition of basic fibroblast growth factor (bFGF or FGF2)  does not make dramatic transcriptome changes on either day 10 primitive ectoderm cells (d10 and d10+) or day 17 neural rosette cells (d17 and d17+). So samples in d10 and d10+ are considered to belong to the same stage in this experiment. Similarly, samples in d17 and d17+ are considered to be in the same stage also.
Similarly, there are another kind of stemness markers which exhibit high expression at early stages of development and are monotonically descending during embryonic stem cell differentiation into the other cells/tissues (here, NSCs in the ESCN data set and vascular endothelial cells in the ESCV data set). Figure 6(B) displays DNMT3B, one of the top 12 descending genes with DEtotal = 0. Here, samples from all stages are also well separated by discriminating lines for each level. DNMT3B is associated with embryonic stem cell development . Some of other top 12 descending genes are correlated with developmental pluripotency and are usually expressed in undifferentiated pluripotent stem cells (i.e., EPCAM and DPPA4) [25, 26]. Thus, some of these top descending monotonic genes may play important roles in maintaining pluripotentiality. We shall further discuss some well-known embryonic stem cell related genes in Discussion section. Note that, we also identify some other previously unreported (or reported) genes of unknown function (or related to some biological function) with monotonic features.
In addition, we have checked one by one whether the top monotonic genes with DEtotal <= 3 (136 descending genes and 98 ascending genes with formal gene symbols) in the ESCN dataset are reported in the literature to be related to some stem cell characteristics, including stemness, differentiation, reprogramming, or induced pluripotent stem cell (iPS). Here we find that 54 (40%) monotonically descending genes related to some stem cell characteristics: stemness (44 genes), reprogramming (6 genes), and iPS (4 genes). Thirty-four (25%) genes are related to differentiation and some of them are related to embryogenesis. We could regard some of identified descending genes as stemness or embryo biomarkers. On the other hand, 30 (31%) monotonically ascending genes are found to be related to differentiation and most of them are related to neurogenesis (e.g., glia, nerve or other neural elements). We could also regard some of identified ascending genes as biomarkers of specific cell types in the differentiated cells. Please see Table S2 (in Additional file 5).
Furthermore, we have also checked one by one whether the top monotonic genes with DEtotal <= 3 in the ESCN dataset have a bi-stable profile with a significant drop or rise in expression level between two successive stages. We subjectively find that about 11 (11%) ascending genes and 17 (12.5%) descending genes are likely to have bi-stable switches. In the Table S2 (in Additional file 5), we have discussed some of those genes that are reported in the literature to be related to some stem cell characteristics.
Comparison among three methods based on the ESCN data
Since the Cuzick-test can find many monotonic genes with the same rank, for preliminary comparison we have decided to use ascending genes with rank up to 20 as determined by the Cuzick-test as an example. For Cuzick-test, the number of genes selected with ranks 1~20 is 145. Then we use the top 145 monotonic ascending genes identified by each of the other methods, to find the number of common and unique identified genes. From Table S3 (in Additional file 6), we find that a large number of common genes (110 genes; 76%) are discovered by the Cuzick-test and MFSelector. On the other hand, a large number of unique genes are discovered by the modified M statistic test (only 50 (34%) common genes found with MFSelector and 54 (37%) common genes found with Cuzick-test). The detail results are shown in Table S3 and S4 (in Additional files 6 and 7) for ascending and descending cases, respectively. These findings suggest that the performances of the Cuzick-test and MFSelector are more similar than the performances of the modified M statistic test and MFSelector. However, there are some drawbacks in the Cuzick-test, which will be revealed when we further evaluate the three methods on the ESCN data set and the synthetic data sets.
Comparison between results by the Cuzick-test and MFSelector
For comparison with MFSelector, we have also applied the Cuzick-test to the ESCN data set and the ESCV data set, and determined the trends of the gene expression profiles. For illustration we have used only the ESCN data set. The results demonstrate several superior characteristics of MFSelector.
First, although the Cuzick-test can identify monotonic genes, the Cuzick-test results contain some non-monotonic genes (false positives) at or near the top of the gene lists. Figure S3 (in Additional file 8) shows four genes identified by the Cuzick-test as strong monotonic genes in terms of z-score/p-value because only samples from one stage (Stage Two) are out of the monotonic trend. Figure S3(A) and (B) (in Additional file 8) display two genes with only roughly ascending features as a result of the deviation from the trend in the expressions of samples from Stage Two. Figure S3(C) and (D) (in Additional file 8) display two other genes with only roughly descending features as a result of the deviation from the trend in the expressions of samples also from Stage Two. Note that, MFSelector discards most of this kind of genes with a partial monotonic trend (even though those genes have small DEtotal values) by examining whether we can determine their N-1 distinct discriminating lines.
Second, MFSelector uses SVDE to further determine different levels of monotonicity of genes with the same DEtotal values (p-values). Actually, many monotonic genes ranked by the Cuzick-test have the same p-value. For example, all of those genes with DEtotal = 0 always belong to the Cuzick-test top level. Figures S4(A) and (B) (in Additional file 9) show two genes with DEtotal = 0 that should have different levels of monotonicity based on their respective SVDE values. Similarly, Figs. S4(C) and (D) (in Additional file 9) also indicate different monotonicity under MFSelector SVDE analysis for two other genes which have the same DEtotal values (here DEtotal = 2) and both are ranked 10th in the list generated by the Cuzick-test.
In addition, the Cusick-test may assign a low rank to a highly monotonic gene that contains some outlier samples which in fact do not adversely affect monotonicity. Figure S5(B) (in Additional file 10) shows that the gene expression value for ZNF302 is 164th in the list generated by MFSelector, but is 105th (in fact, its effective rank is 1279 because there are at least 1278 genes before the gene in the ranked list) in the list generated by the Cuzick-test because two samples from Stage One and Four, respectively, are significantly distant from its stage (they totally cross all samples of Stage Two and Three in expression value). This situation arises because this gene's ranking is significantly affected by the weight assigned to the anomalous expression value for these samples from Stage One and Four, causing the poor ranking for this gene under the Cuzick-test. Figure S5(A) (in Additional file 10) also shows the same characteristic influenced by outlier samples. On the other hand, the Cusick-test may also give a high rank to a gene which is insufficiently monotonic as a result of too much overlapping, where the lack of monotonicity is quite clear. For example, those monotonic genes in Figs. S6 and S7 (in Additional files 11 and 12) have too much overlapping among stages.
Comparison between results by the modified M statistic test and MFSelector
By applying the modified M statistic test, we create two gene lists, one for the ascending nature and the other for the descending nature respectively, and sort these genes by their corresponding M values. We illustrate only the top hundred ranked ascending and descending genes by using two heatmaps in Figs. S8 and S9 (in Additional files 13 and 14). Figure S8(A) (in Additional files 13) displays the heatmap of one hundred ascending monotonic genes produced by MFSelector and Fig. S8(B) (in Additional files 13) displays the same produced by the modified M statistic test. Similarly, the gene expressions of the top hundred descending monotonic genes chosen by MFSelector and the modified M statistic test are delineated in heatmaps in Figs. S9(A) and (B) (in Additional files 14), respectively. Note that, genes chosen from the top hundred are further clustered using hierarchical clustering with the average linkage strategy. It is evident from the yellow rectangular area on the heatmaps created from the gene lists (Figs. S8(B) and S9(B)), the genes found by the modified M selector method are not monotonic. This is a consequence of the existence of outliers in a stage (expressed by few different colors in a stage) satisfying the means ordering constraint. On the other hand, the genes found by MFSelector actually have monotonic trend (Figs. S8(A) and S9(A)).
Results on the synthetic data
Nine pseudo genes are chosen randomly from the 9 types of monotonic genes as shown in Figure 1. We also use synthetic data sets to perform extensive comparisons, including quantitative and qualitative evaluations, between MFSelector, the Cuzick-test, and the modified M statistic test.
Comparison among three methods based on the synthetic data
In the ideal ranking, the top 60 pseudo genes should belong to the type 'Good (distinct/close)' (20 genes), 'Slightly' (20 genes), and 'Outliers (slight/severe)' (20 genes), and that is what MFSelector does. The results using the Cuzick-test are the approximately the same as that by MFSelector, but the modified M statistic test always selects about fifteen percent of genes in the type 'Partially ordered' which are worse cases and are out of monotonic trend (as shown in Additional files 15 and 16: Table S5 and S6). Nevertheless, the Cuzick-test and the modified M statistic test are seriously affected by the variation in the nature of outliers such as "slight" versus "severe" and the nature of partially ordered trends such as "far" versus "close". We think that both 'Outliers' type genes, slight or severe, should have similar monotonic trends. These outlier samples should not influence the extent of monotonic trends. Similarly, in the 'Partially ordered' type genes, no matter how significantly far the samples in one of the stages are from the monotonic trend in a gene expression profile, most of those genes should not influence the degree of monotonicity. In addition, as mentioned earlier, MFSelector discards most of this kind of genes with a partial monotonic trend (even though those genes have a small DEtotal values) by examining the number of distinct discriminating lines. Therefore, genes having slight/severe outliers or having partially ordered patterns with the level 'far' or 'close' should be given the similar significance/rank. For the synthetic data set, we find that the ranks of the pseudo genes g46 (with slight outlier characteristics) and g53 (with severe outlier characteristics) discovered by MFSelector are not different (similar DEtotal value) but their ranks are significantly different for the Cuzick-test and the modified M statistic test. The ranks of the pseudo genes g102 (with a close partially ordered stage) and g111 (with a far partially ordered stage) identified by MFSelector are also not different (same DEtotal value) but the ranks suggested by the Cuzick-test and the modified M statistic test are significantly different (as shown in Figure 1(D), (E), (H), and 1(I) and Additional file 15: Table S5).
Biological relevance of other monotonic genes
Expressions of monotonic genes in different stages of cell lineage are closely related to the function of specific cell stage. Monotonicity of genes over stages is a good rationale to identify biomarkers responsible for differentiation of stem or precursor cells. For example, biomarkers for stemness exhibit high expression at early stages and those are responsible for maintaining proliferation and self-renewal activities. By contrast, biomarkers of specific cell types exhibit high expression in later stages after differentiation. And those genes also contribute to the specific phenotype and function of differentiated cells such as development of neural system which is mentioned earlier.
Another research  has also reported a gene list comprised of known ESC-specific genes and some new candidates (such as DNMT3B, LIN28A, BTF3 and ERH) that can serve as markers for human embryonic stem cells and may also contribute to the stemness phenotype. Of particular interest in that study is the strong monotonic descending pattern of DNMT3B (DEtotal = 0) in our study during embryonic stem cell differentiation. LIN28A is without N-1 distinct discriminating lines because its expression is tightly grouping from Stage One to Stage Three. It has a small number of discriminating errors (DEtotal = 4) though. Additional transcription factors that were also highly expressed in human embryonic stem cells include BTF3 and ERH. These genes are marker genes for embryonic stem cell, which show high expressions only in human embryonic stem cells . We depict the expression profile of these four genes in the ESCN data set in Fig. S10 (in Additional file 17). These scatter plots (as shown in Figs. S10(A), (B) and (C)) fit within the scenario for embryonic stem cell development explained by that study . However, the expression profile of ERH over different stage is not clearly separated and is not monotonically descending during the neurogenesis process (as shown in Fig. S10(D)). We make a hypothesis that ERH does not clearly exhibit monotonic expression during neurogenesis even though it is an embryonic stem cell specific marker gene. This new biomarker needs to be surveyed and studied further both in wet- and dry- laboratory experiments.
Biological processes involved in monotonic genes
The genes expressed in an early stage of stem cells or in undifferentiated cells (or precursor cells) of stem cells are often regarded as stem cell biomarkers, which are thought to maintain the pluripotency of embryonic stem cells. Based on the concept of pluripotency, cell growth or proliferation and the self-renewal activity are the most representative characteristics of stem cells which are known to be sustained by multiple genes in both stem cells and precursor cells . Differentiation of stem cells or precursor cells is accompanied by the loss of pluripotency, which suggests that the expression of stemness related genes will decrease as well. This provides a rationale for why those genes should be included in the descending gene list generated by MFSelector.
Logically, the differentiation program of stem cells and precursor cells is not only driven by gene regulation but also manipulated by the circumstances of differentiation . Combination of multiple stimulations in distinct circumstances provides the specific niche for differentiating into certain cell lineage, which is triggered by growth factors, transcription factors, and the specific cell surface receptors. Different growth factors or transcription factors are accompanied by different signals for the activation of specific functional genes, which are responsible for the generation of specific cell lineage. Therefore, we believe that those functional proteins related to the specialized cell type, structures, and functions will be involved in the ascending gene list generated by MFSelector.
To validate our inferences, we corroborate the biological processes for those monotonic genes identified from the two data sets by Gene Ontology analysis on the DAVID web tool http://david.abcc.ncifcrf.gov/. This helps us find distinct genes expressed in certain organogenesis with different processes.
For the ESCN data set, the top 857 ascending monotonic genes (DEtotal = 0~7) found by MFSelector are subjected to Gene Ontology analysis. Some of the biological processes those genes are involved in are neuron differentiation (41 genes, p-value = 8.1E-9), regulation of neurogenesis (23 genes, p-value = 3.5E-8), regulation of nervous system development (24 genes, p-value = 1.2E-7), regulation of neuron differentiation (19 genes, p-value = 4.6E-7), neuron development (31 genes, p-value = 1.2E-6), neuron projection development (24 genes, p-value = 1.7E-5), cell morphogenesis involved in neuron differentiation (21 genes, p-value = 2.5E-5), axonogenesis (20 genes, p-value = 2.6E-5), neuron projection morphogenesis (21 genes, p-value = 3.2E-5), and regulation of axonogenesis (11 genes, p-value = 1.7E-5). These biological processes very likely play important roles in differentiation of ESC neurogenesis and the development of the organs or the tissues of the nervous system. There are other ascending genes involved in neurotransmission, such as regulation of axon extension (7 genes, p-value = 3.5E-5), axon guidance (14 genes, p-value = 5.8E-5), transmission of nerve impulse (26 genes, p-value = 3.2E-4), and neuron migration (9 genes, p-value = 1.3E-3). These monotonic genes directly indicate their importance during the development of the nervous system and the process of neurogenesis.
Similarly, some of the top 1,117 descending monotonic genes (DEtotal = 0~7) found by MFSelector are involved in basic DNA/RNA biological processes, such as translation (42 genes, p-value = 4.3E-9), RNA processing (57 genes, p-value = 7.8E-9), ribosome biogenesis (23 genes, p-value = 2.4E-8), ribonucleoprotein complex biogenesis (28 genes, p-value = 3.8E-8), rRNA processing (19 genes, p-value = 1.3E-7), and ncRNA metabolic process (29 genes, p-value = 1.8E-6). According to the common understanding about the characteristics of undifferentiated or early stage stem cells, basic cellular processes are crucial for maintaining the capacity of self-renewal and proliferation. In addition, these monotonic genes may be responsible for the basic metabolisms and the stability of chromosome structures in cells. Proliferation of cells must undergo the process of cell cycle, and genes involved in the mechanism of cell cycle are able to imply the progress of proliferation. In conformity with this, we notice that genes for M phase (32 genes, p-value = 9.3E-5), DNA replication (22 genes, p-value = 1.4E-4), M phase of mitotic cell cycle (23 genes, p-value = 5.3E-4), mitotic cell cycle (32 genes, p-value = 7.3E-4), nuclear division (22 genes, p-value = 1.0E-3), mitosis (22 genes, p-value = 1.0E-3), cell cycle process (43 genes, p-value = 1.1E-3), and cell cycle phase (34 genes, p-value = 1.2E-3) are in the descending monotonic list. This evidence suggests that the stem cell features in the early stage exist during neurogenesis.
We have developed a novel and robust scheme to identify genes with monotonic patterns in multi-stage as well as in time-series genomic data. Our method is based on a concept of total discriminating error practically without requiring any assumption. Using our proposed scheme we have been able to reveal genes with gradually increasing or decreasing expression patterns during stem cell differentiation. Some of these genes are known stemness genes (such as POU5F1 (Oct4)), while many other genes have not been linked to stem cell neurogenesis or vasculogenesis before. In the ESCN and ESCV data sets monotonic features could be found even from heterogeneous gene expression profiles. In addition, we have used monotonic markers discovered from one data set to analyze another data set and obtained very interesting results. An R package implementing our algorithm and data sets can be found at: http://microarray.ym.edu.tw/tools/MFSelector/.
We acknowledge the efforts of Zhang et al. at NIH Neuroscience Microarray Consortium and Mrugala et al. at Inserm, France for their valuable microarray data sets.
This work is supported by Ministry of Science and Technology (MOST) (MOST103-2221-E-010-015) and National Yang-Ming University (a grant from Ministry of Education, Aim for the Top University Plan). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
This article has been published as part of BMC Genomics Volume 16 Supplement 2, 2015: Selected articles from the Thirteenth Asia Pacific Bioinformatics Conference (APBC 2015): Genomics. The full contents of the supplement are available online at http://0-www.biomedcentral.com.brum.beds.ac.uk/bmcgenomics/supplements/16/S2
- Abranches E, Silva M, Pradier L, Schulz H, Hummel O, Henrique D, Bekman E: Neural differentiation of embryonic stem cells in vitro: a road map to neurogenesis in the embryo. PLoS One. 2009, 4 (7): e6286-10.1371/journal.pone.0006286.PubMed CentralView ArticlePubMedGoogle Scholar
- Wilson PG, Stice SS: Development and differentiation of neural rosettes derived from human embryonic stem cells. Stem cell reviews. 2006, 2 (1): 67-77. 10.1007/s12015-006-0011-1.View ArticlePubMedGoogle Scholar
- Hayhurst GP, Lee YH, Lambert G, Ward JM, Gonzalez FJ: Hepatocyte nuclear factor 4alpha (nuclear receptor 2A1) is essential for maintenance of hepatic gene expression and lipid homeostasis. Molecular and cellular biology. 2001, 21 (4): 1393-1403. 10.1128/MCB.21.4.1393-1403.2001.PubMed CentralView ArticlePubMedGoogle Scholar
- Chen KT, Pernelle K, Tsai YH, Wu YH, Hsieh JY, Liao KH, Guguen-Guillouzo C, Wang HW: Liver × receptor alpha (LXRalpha/NR1H3) regulates differentiation of hepatocyte-like cells via reciprocal regulation of HNF4alpha. Journal of hepatology. 2014Google Scholar
- Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, et al: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999, 286 (5439): 531-537. 10.1126/science.286.5439.531.View ArticlePubMedGoogle Scholar
- Kim KJ, Cho SB: Ensemble classifiers based on correlation analysis for DNA microarray classification. Neurocomputing. 2006, 70 (1-3): 187-199. 10.1016/j.neucom.2006.03.002.View ArticleGoogle Scholar
- Tsai YS, Aguan K, Pal NR, Chung IF: Identification of single- and multiple-class specific signature genes from gene expression profiles by group marker index. PLoS One. 2011, 6 (9): e24259-10.1371/journal.pone.0024259.PubMed CentralView ArticlePubMedGoogle Scholar
- Tsai YS, Lin CT, Tseng GC, Chung IF, Pal NR: Discovery of dominant and dormant genes from expression data using a novel generalization of SNR for multi-class problems. BMC Bioinformatics. 2008, 9: 425-10.1186/1471-2105-9-425.PubMed CentralView ArticlePubMedGoogle Scholar
- Li H, Wood CL, Liu Y, Getchell TV, Getchell ML, Stromberg AJ: Identification of gene expression patterns using planned linear contrasts. BMC Bioinformatics. 2006, 7: 245-10.1186/1471-2105-7-245.PubMed CentralView ArticlePubMedGoogle Scholar
- Kruskal WH, Wallis WA: Use of Ranks in One-Criterion Variance Analysis. Journal of the American Statistical Association. 1952, 47 (260): 583-621. 10.1080/01621459.1952.10483441.View ArticleGoogle Scholar
- Bonner AE, Lemon WJ, You M: Gene expression signatures identify novel regulatory pathways during murine lung development: implications for lung tumorigenesis. Journal of medical genetics. 2003, 40 (6): 408-417. 10.1136/jmg.40.6.408.PubMed CentralView ArticlePubMedGoogle Scholar
- Kleer CG, Griffith KA, Sabel MS, Gallagher G, van Golen KL, Wu ZF, Merajver SD: RhoC-GTPase is a novel tissue biomarker associated with biologically aggressive carcinomas of the breast. Breast Cancer Res Treat. 2005, 93 (2): 101-110. 10.1007/s10549-005-4170-6.View ArticlePubMedGoogle Scholar
- Lee ML, Whitmore GA, Bjorkbacka H, Freeman MW: Nonparametric methods for microarray data based on exchangeability and borrowed power. J Biopharm Stat. 2005, 15 (5): 783-797. 10.1081/BIP-200067778.View ArticlePubMedGoogle Scholar
- Reinhold WC, Kouros-Mehr H, Kohn KW, Maunakea AK, Lababidi S, Roschke A, Stover K, Alexander J, Pantazis P, Miller L, et al: Apoptotic susceptibility of cancer cells selected for camptothecin resistance: gene expression profiling, functional analysis, and molecular interaction mapping. Cancer Res. 2003, 63 (5): 1000-1011.PubMedGoogle Scholar
- Cuzick J: A Wilcoxon-type test for trend. Statistics in medicine. 1985, 4 (1): 87-90. 10.1002/sim.4780040112.View ArticlePubMedGoogle Scholar
- Lin D, Shkedy Z, Yekutieli D, Burzykowski T, Gohlmann HW, De Bondt A, Perera T, Geerts T, Bijnens L: Testing for trends in dose-response microarray experiments: a comparison of several testing procedures, multiplicity and resampling-based inference. Statistical applications in genetics and molecular biology. 2007, 6: Article26-View ArticlePubMedGoogle Scholar
- Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, et al: Bioconductor: open software development for computational biology and bioinformatics. Genome biology. 2004, 5 (10): R80-10.1186/gb-2004-5-10-r80.PubMed CentralView ArticlePubMedGoogle Scholar
- Pramana S, Lin D, Haldermans P, Shkedy Z, Verbeke T, Göhlmann H, Bondt AD, Talloen W, B L: IsoGene: An R Package for Analyzing Dose-response Studies in Microarray Experiments. The R Journal. 2010, 2 (1): 5-12.Google Scholar
- Eckenstein FP: Fibroblast growth factors in the nervous system. Journal of neurobiology. 1994, 25 (11): 1467-1480. 10.1002/neu.480251112.View ArticlePubMedGoogle Scholar
- Voltz R, Gultekin SH, Rosenfeld MR, Gerstner E, Eichen J, Posner JB, Dalmau J: A serologic marker of paraneoplastic limbic and brain-stem encephalitis in patients with testicular cancer. N Engl J Med. 1999, 340 (23): 1788-1795. 10.1056/NEJM199906103402303.View ArticlePubMedGoogle Scholar
- Kamnasaran D, Muir WJ, Ferguson-Smith MA, Cox DW: Disruption of the neuronal PAS3 gene in a family affected with schizophrenia. Journal of medical genetics. 2003, 40 (5): 325-332. 10.1136/jmg.40.5.325.PubMed CentralView ArticlePubMedGoogle Scholar
- Strehl S, Glatt K, Liu QM, Glatt H, Lalande M: Characterization of two novel protocadherins (PCDH8 and PCDH9) localized on human chromosome 13 and mouse chromosome 14. Genomics. 1998, 53 (1): 81-89. 10.1006/geno.1998.5467.View ArticlePubMedGoogle Scholar
- Suter DM, Tirefort D, Julien S, Krause KH: A Sox1 to Pax6 switch drives neuroectoderm to radial glia progression during differentiation of mouse embryonic stem cells. Stem Cells. 2009, 27 (1): 49-58. 10.1634/stemcells.2008-0319.View ArticlePubMedGoogle Scholar
- Xie S, Wang Z, Okano M, Nogami M, Li Y, He WW, Okumura K, Li E: Cloning, expression and chromosome locations of the human DNMT3 gene family. Gene. 1999, 236 (1): 87-95. 10.1016/S0378-1119(99)00252-8.View ArticlePubMedGoogle Scholar
- Masaki H, Nishida T, Kitajima S, Asahina K, Teraoka H: Developmental pluripotency-associated 4 (DPPA4) localized in active chromatin inhibits mouse embryonic stem cell differentiation into a primitive ectoderm lineage. J Biol Chem. 2007, 282 (45): 33034-33042. 10.1074/jbc.M703245200.View ArticlePubMedGoogle Scholar
- Munz M, Kieu C, Mack B, Schmitt B, Zeidler R, Gires O: The carcinoma-associated antigen EpCAM upregulates c-myc and induces cell proliferation. Oncogene. 2004, 23 (34): 5748-5758. 10.1038/sj.onc.1207610.View ArticlePubMedGoogle Scholar
- Nichols J, Zevnik B, Anastassiadis K, Niwa H, Klewe-Nebenius D, Chambers I, Scholer H, Smith A: Formation of pluripotent stem cells in the mammalian embryo depends on the POU transcription factor Oct4. Cell. 1998, 95 (3): 379-391. 10.1016/S0092-8674(00)81769-9.View ArticlePubMedGoogle Scholar
- Anthony TE, Mason HA, Gridley T, Fishell G, Heintz N: Brain lipid-binding protein is a direct target of Notch signaling in radial glial cells. Genes & development. 2005, 19 (9): 1028-1033. 10.1101/gad.1302105.View ArticleGoogle Scholar
- Keilani S, Sugaya K: Reelin induces a radial glial phenotype in human neural progenitor cells by activation of Notch-1. BMC developmental biology. 2008, 8: 69-10.1186/1471-213X-8-69.PubMed CentralView ArticlePubMedGoogle Scholar
- Hatakeyama J, Bessho Y, Katoh K, Ookawara S, Fujioka M, Guillemot F, Kageyama R: Hes genes regulate size, shape and histogenesis of the nervous system by control of the timing of neural stem cell differentiation. Development. 2004, 131 (22): 5539-5550. 10.1242/dev.01436.View ArticlePubMedGoogle Scholar
- Richards M, Tan SP, Tan JH, Chan WK, Bongso A: The transcriptome profile of human embryonic stem cells as defined by SAGE. Stem Cells. 2004, 22 (1): 51-64. 10.1634/stemcells.22-1-51.View ArticlePubMedGoogle Scholar
- Chambers I, Colby D, Robertson M, Nichols J, Lee S, Tweedie S, Smith A: Functional expression cloning of Nanog, a pluripotency sustaining factor in embryonic stem cells. Cell. 2003, 113 (5): 643-655. 10.1016/S0092-8674(03)00392-1.View ArticlePubMedGoogle Scholar
- Trounson A: The production and directed differentiation of human embryonic stem cells. Endocrine reviews. 2006, 27 (2): 208-219. 10.1210/er.2005-0016.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.