- Methodology article
- Open Access
Systematic identification of cell cycle regulated transcription factors from microarray time series data
BMC Genomics volume 9, Article number: 116 (2008)
The cell cycle has long been an important model to study the genome-wide transcriptional regulation. Although several methods have been introduced to identify cell cycle regulated genes from microarray data, they can not be directly used to investigate cell cycle regulated transcription factors (CCRTFs), because for many transcription factors (TFs) it is their activities instead of expressions that are periodically regulated across the cell cycle. To overcome this problem, it is useful to infer TF activities across the cell cycle by integrating microarray expression data with ChIP-chip data, and then examine the periodicity of the inferred activities. For most species, however, large-scale ChIP-chip data are still not available.
We propose a two-step method to identify the CCRTFs by integrating microarray cell cycle data with ChIP-chip data or motif discovery data. In S. cerevisiae, we identify 42 CCRTFs, among which 23 have been verified experimentally. The cell cycle related behaviors (e.g. at which cell cycle phase a TF achieves the highest activity) predicted by our method are consistent with the well established knowledge about them. We also find that the periodical activity fluctuation of some TFs can be perturbed by the cell synchronization treatment. Moreover, by integrating expression data with in-silico motif discovery data, we identify 8 cell cycle associated regulatory motifs, among which 7 are binding sites for well-known cell cycle related TFs.
Our method is effective to identify CCRTFs by integrating microarray cell cycle data with TF-gene binding information. In S. cerevisiae, the TF-gene binding information is provided by the systematic ChIP-chip experiments. In other species where systematic ChIP-chip data is not available, in-silico motif discovery and analysis provide us with an alternative method. Therefore, our method is ready to be implemented to the microarray cell cycle data sets from different species. The C++ program for AC score calculation is available for download from URL http://leili-lab.cmb.usc.edu/yeastaging/projects/project-base/.
Eukaryotic cell cycle is precisely controlled and regulated at the transcriptional, post-transcriptional, and post-translational level. To identify cell cycle regulated genes, several genome-wide analysis have been performed using microarray technologies [1–5]. In these studies, expression levels across the cell cycle were measured simultaneously for thousands of mRNA transcripts. In order to identify the subset of periodically expressed genes in the resulting microarray gene expression time series data, a number of computational approaches have been proposed, including Fourier analysis [2, 6], partial least squares regression , Fisher's G-test , model-based method , and methods using some threshold criteria . These approaches provided useful tools for periodicity analysis in microarray time series data and have led to the identification of hundreds of cell cycle regulated genes. For example, Spellman et al. found that about 800 genes are periodically expressed across the cell cycle in S. cerevisiae.
Transcription factors (TFs) play critical roles in gene expression regulation. To understand how the cell cycle is regulated and how cell cycle regulates other biological processes, such as DNA replication and amino acids biosynthesis, it is useful to identify the cell cycle regulated transcription factors (CCRTFs). We note that in this paper we use the term "cell cycle regulated" rather than "cell cycle regulator" as used in previous studies, because it is often difficult to infer the direction of regulation only from the microarray cell cycle data. The transcription factors whose regulatory activities fluctuate periodically across the cell cycle could be either cell cycle regulator or effector of the cell cycle regulation. Moreover, the expression levels of TFs in microarray data may not accurately reflect their activities in transcription regulation. First, TFs are often subject to various post-transcriptional and post-translational modifications, which abolish the significant correlations between their activities and expression levels. Second, TFs are usually expressed in relatively low levels [11, 12] and therefore expression changes measured by microarray hybridization may not be accurate. Therefore, identification of CCRTFs should be based on the activities of TFs rather than their expression levels. Considering this issue and by integrating the microarray cell cycle data with TF-gene connectivity data from ChIP-chip experiment, several methods have been suggested to identify yeast cell cycle transcription factors [13–15], to infer cooperativity among the transcription factors controlling the cell cycle in yeast , to model the network of yeast cell cycle transcription factors , or to reconstruct the transcriptional regulatory modules of the yeast cell cycle .
In this paper, we suggest a two-step method to identify the CCRTFs in yeast. First, for each TF we infer its activity in each time point of the microarray data, resulting in an activity profile for it. This is achieved by integrating microarray expression data with systematic ChIP-chip data  or motif discovery data  using the BASE method . In the second step we use the Fisher's G-test to examine the periodicity of these TF activity profiles to identify the CCRTFs. Based on the combination of microarray data and ChIP-chip data, we identify 42 CCRTFs at 4% false discovery rate (FDR), including 23 experimentally validated cell cycle TFs. More importantly, by combining microarray expression data with in-silico motif discovery data, we identify 8 motifs with known associated TFs at 3.5% FDR, among which 7 are associated with established cell cycle TFs. Thus, our method is able to identify CCRTFs in species other than S. cerevisiae, in which microarray cell cycle data have been generated whereas large-scale ChIP-chip data are still not available.
Periodic activity fluctuation of TFs across the cell cycle
We integrate the microarray cell cycle gene expression data with the ChIP-chip data to infer TF activities at 18 time points of the microarray data, resulting in 203 activity change score (AC score) profiles each corresponding to a TF. These AC score profiles measure the activity fluctuation of TFs at each of the 18 time points in the cell cycle. The AC score for a TF reflects the relative activity of this TF in the synchronized sample with respect to the non-synchronized control at a time point. For a transcription activator, a positive AC score indicates its activity enhancement and a negative AC score indicates its activity reduction. For a transcription repressor, the inverse conclusion should be made. See section "Methods" for detail definition and calculation of AC scores. Then average periodogram is applied to these AC score profiles and the original gene expression data (only the expressions for TF coding genes are selected from the microarray data) to investigate the existences of TFs that fluctuate periodically in the activity and in the expression level, respectively.
As shown in Figure 1, the average periodogram exhibits a dominate peak at the Fourier frequency of 0.11 in both the inferred AC score data (Figure 1A) and the gene expression data (Figure 1B), suggesting the existences of periodical components. The Fourier frequency 0.11 corresponds to the genuine frequency of the cell cycle, since the microarray data covers exactly two cell cycles by 18 time points. These results indicate that the existences of the cell cycle regulated TFs can be detected at both the activity level and the expression level. Nevertheles, when we test the significance of periodicity for each of the TFs using the Fisher's G-test, we find that much more TFs show periodic fluctuation in the activity level (Figure 1C) than in the gene expression level (Figure 1D). At the 0.01 significance level, we identify 42 CCRTFs based on the inferred TF activity profiles, whereas only 4 TFs are found to be cell cycle regulated based on the original expression data. These results confirm our hypothesis that cell cycle regulation of TFs takes place mainly at the activity level rather than at the gene expression level and as a consequence the CCRTFs should be be identified based on their activities.
We note that the original microarray cell cycle data contains four separate experiments, corresponding to four different cell cycle synchronization method: α-factor arrest, temperature arrest by cdc15 mutant, temperature arrest by cdc28 mutant, and elutriation, respectively. Our analysis indicates that the α-factor arrest data achieves the highest quality, since the average periodogram for the inferred TF activity profiles from the other three data sets [see Additional File 1] exhibit one or multiple un-expected peaks, which may result from noise or the effect of synchronization treatment to the TF activities. Therefore, in this paper we focus our analysis on the microarray cell cycle data synchronized by α-factor arrest.
Cell cycle regulated TFs
We identify 42 CCRTFs that are periodically activated in a cell cycle dependent manner at the 0.01 significance level, among which 23 have been experimentally verified according to previous studies [2, 22–24]. The detail information for these CCRTFs is listed in Table 1. As shown, these CCRTFs include the well established cell cycle regulatory transcription factors: Mbp1, Swi4, Swi6, Mcm1, Fkh2 and Ndd1 .
Figure 2 demonstrates the inferred activity profiles and the original gene expression profiles of four CCRTFs identified by our method. In these examples, the activity profiles exhibit an apparent signature of periodicity across the cell cycle, whereas at the gene expression level the periodicity can not be detected. According to our permutation results, an AC score greater than 10 or less than -10 suggests a strong evidence of TF activation or repression. As shown, the activities of these four TFs vary dramatically during the cell cycle, indicating that they may play important regulatory roles in certain stages of the cell cycle. However, their expression changes in the original microarray data are often neglectable and do not show substantial variation across the cell cycle.
In the 42 CCRTFs identified by our method, 19 have not been experimentally validated to be cell cycle regulated. But several of them have been reported as putative cell cycle TFs in several computational studies. For example, Bas1 and Gat3 have been identified as cell cycle TFs by Tsai et al. ; Abf1, Gat3, and Nrg1 have been identified by Wu et al. ; Hap2 has been identified by Yang et al. ; Bas1 and Spt2 have been identified by Cokus et al. . Further investigation of these TFs might provide new insight into the mechanisms about how the cell cycle is regulated or how the cell cycle regulates other cellular processes. The yeast cell cycle can be divided into five different phases: M/G1, G1, S, G2 and M. In the microarray cell cycle data, each of the five stages corresponds to 4 (M/G1, G1, S, and M) or 2 (G2) time points. We calculate the average AC scores of each TF over the time points corresponding to a common phase as shown in Table 1. The phase-specific AC scores reflect the cell cycle behavior of a TF. For example, our results indicate that Fkh2 achieves the highest activity at M phase (average AC score is 18.1) and the lowest activity at G1 phase (average AC score is -11.4); Hir2 achieves the highest activity at S phase (average AC score is 16.5) and the lowest activity at M/G1 phase (average AC score is -12.5); Mbp1 achieves the highest activity at G1 phase (average AC score is 22.2) and the lowest activity at M phase (average AC score is -17.4). All these inferences are consistent with the well established knowledge that Fhk2, Hir2 and MBP1 are M phase, S phase and G1, M phase specific transcription factor, respectively. Moreover, we also estimate the active phase of each CCRTF using the statistical method described in subsection "Phase estimation" of the "Method". These 42 CCRTFs are categorized into 9 groups: k = 0, 1,...,8. The groups k = 0, 1, the groups k = 2, 3, the groups k = 4, 5, the group k = 6, and the groups k = 7, 8 achieve maximum activity at phase M/G1, G1, S, G2 and M, respectively. The statistically estimated phases for these 42 CCRTFs are also consistent with the established knowledge about these transcription factors. We apply clustering analysis to the inferred activity profiles of the 42 CCRTFs using the hierarchical clustering method . As shown in Figure 3, the TFs that are activated in common phases of the cell cycle tend to be clustered together. For example, previous studies indicate that the complex SBF formed by Swi4 and Swi6 and the complex MBF formed by Swi6 and Mbp1 regulate the expression of late G1 genes [25, 27, 28]; Mcm1, together with Fkh1 or Fkh2, recruits the Ndd1 protein in late G2, and controls the transcription of G2/M genes [25, 28, 29]. Consistently, as shown in Figure 3, Swi4, Swi6 and Mbp1 follow into the same cluster that exhibits the highest activity at phase G1; Mcm1, Fhk2 and Ndd1 follows into the same cluster that exhibit the highest activity at phase M. Other than the hierarchical clustering, the above described phase estimation inherently provides a clustering method for the 42 CCRTFs, which categorize them into 9 groups: k = 0, 1,...,8. The image presentation of these 9 groups is shown in the supplementary documents [see Additional File 2].
Our method fails to identify several known cell cycle TFs, including Ace2 and Fkh1. It turns out that the activities of Ace2 does fluctuate periodically across the cell cycle (q-value = 0.010), but the period calculated by the periodogram is different from the genuine period of the cell cycle. For the Fkh1, although its activity profile shows an obvious fluctuation in each of the two cell cycles [See Additional File 3], the activities in the two cell cycles are not consistent with each other and thereby it fails to pass the Fisher's G-test (q-value = 0.19). This may be caused by noise or by the influence of synchronization treatment to the activity of transcription factors, which we will discuss in more detail in the next section.
Influence of the cell cycle synchronization methods
In order to measure gene expression during the cell cycle, the yeast cells must be synchronized using certain techniques, such as α-factor arrest and temperature arrest. However, these synchronization techniques may perturb the cell status and result in activity modification of TFs . As a consequence, the periodic activity fluctuation of some CCRTFs may be perturbed and can not be detected. Figure 4 shows the effect of the α-factor to the activity of four different TFs. As shown in Figure 4A–C, Dig1, Ste12 and Tec1 exhibit extraordinary high activities at the initiation of the time series after releasing from the α-factor arrest (AC scores are 22.3, 31.9 and 16.1, respectively). For Ace2, although its activity is only moderately up-regulated at the initiation of the time series (AC scores is 6.8), the periodicity of its activity profiles is perturbed by the α-factor treatment. As shown in Figure 4D, the activity profile of Ace2 exhibits quite different patterns in the two consecutive cell cycles. In fact, Dig1, Ste12 and Tec1 are transcription factors that are activated by the MAP kinase signaling cascade and involved in the regulation of genes in mating or pseudohyphal/invasive growth pathways [31–33]. α-factor pheromone is the activator of the MAP kinase pathway , so it is not surprising to see the up-regulation of Dig1, Ste12, and Tec1 by the α-factor treatment.
According to previous studies, Ste12, Tec1, and Ace2 are all regulated in a cell cycle dependent manner. However, the α-factor treatment used to synchronize the cell cycle perturbs the periodic activity fluctuation of these TFs in the cell cycle time series, and as a consequence these CCRTFs can not be identified by our method. The statistical method proposed by Tsai et al. may avoid this problem, since it is not based on the test of periodicity . According to their method, a TF is reported to be cell cycle regulator if the activity of the TF does not keep constant in the cell cycle and there exists at least one phase in the cell cycle where the TF is significantly up- or down-regulated. This method may result in some false positive findings. For example, the cell cycle regulator Dig1 identified by Tsai et al is obvious not a CCRTF according to its activity profile shown in Figure 4A.
Cell cycle associated regulatory motifs
TF activities in the cell cycle can be inferred by integrating the microarray expression data with systematic ChIP-chip data. For most species, however, the large-scale ChIP-chip data is still not available. In this case, how to identify the CCRTFs? Here we provide an alternative strategy by identifying the cell cycle associated regulatory motifs. We first identify all the putative motifs that are significantly enriched in the promoter regions of all yeast genes. By integrating the systematic motif discovery data with the microarray cell cycle data, we calculate the AC score for each motif at each time point of the microarray cell cycle data. Suppose that there exists a DNA binding protein (e.g. a TF) for each motif, the AC score of a motif reflects the activity of its DNA binding protein at a time point. We define a motif as cell cycle associated regulatory motif, if its AC score profile fluctuates periodically across the cell cycle.
We identify 537 putative motifs and calculate their AC score profiles in the cell cycle. Figure 5A shows the average periodogram for the AC score profiles of these 537 motifs. As shown, it exhibits a dominate peak at the Fourier frequency 0.11, the genuine cell cycle frequency. We test the significance of periodicity for these profiles and the distribution of their p-values is shown in Figure 5B. Among these 537 putative motifs, 97 are found to be cell cycle associated at the 0.01 significance level according to our analysis (FDR = 0.04). In these 537 putative motifs, 46 can be associated with known TFs according to previous literatures, including 12 cell cycle TFs: Ace2, Fkh1, Hir2, Mbp1, Mcm1, Ndt80, Rpn4, Skn7, Ste12, Swi4, Ume6, and Xbp1. In the 46 motifs with known TFs, 8 are found to be cell cycle associated motifs according to our results. These 8 motifs are known to be regulated by Mbp1, Swi4, Hir2, Ndt80, Rpn4, Skn7, Abf1, and Mcm1, respectively. Among them, 7 are experimental verified cell cycle TFs and the other one, Abf1, is reported to be cell cycle TF in the computational analysis performed by Wu et al. . Detail information for these motifs is listed in the supplementary documents [see Additional File 4] and the inferred activity profiles of them are shown in Figure 6. Note that we miss 5 cell cycle TFs: Ace2, Fkh1, Ste12, Ume5, and Xbp1. Our analysis indicates that the activity of Ace2 fluctuates periodically across the cell cycle (p-value = 0.0011), but the estimated Fourier frequency is not equal to the cell cycle frequency. This is consistent with the result obtained using ChIP-chip data. For the other four missed cell cycle TFs: Fkh2 (p-value = 0.025), Xbp1 (p-value = 0.015), Ste12 (p-value = 0.076), and Ume6 (p-value = 0.067), the p-values from periodicity test are relatively small, although they do not pass our criteria. Overall, these results suggest that our method is able to identify CCRTFs with high accuracy by integrating microarray expression data with the motif discovery data from pure in-silico sequence analysis.
To test whether a TF is cell cycle regulated, Tsai et al. examine if there exists at least one phase in the cell cycle where the TF is significantly activated by comparing the expression levels of its target genes with those of non-target genes . In this paper, we propose a two-step method: infer the activity profiles of a TF in the cell cycle and then test the significance of its periodicity. In comparison with the method suggested by Tsai et al., our method has the following advantages: (1) it need not to pre-define the target and non-target gene sets for each TF based on more or less arbitrarily selected threshold setting and is thereby easier to be implemented; (2) it deals with TF activity perturbation caused by synchronization treatment in a more conservative manner, and thereby avoid some false positive findings; (3) it can detect the CCRTFs that are periodically but moderately activated during the cell cycle. On the other hand, there is also a limitation for this method: synchronization treatments may perturb the periodic activity fluctuation of some TFs and result in false negatives. In order to achieve more accurate results, it may be useful to combine these two methods. For example, after the implementation of our method, we can further examine the AC scores of those non-CCRTFs at the initial time point. If the initial AC score of a TF is significantly high or low, its activity may be perturbed by the synchronization treatment. We therefore exclude one or more initial time points and apply the Tsai et al.'s method to the remaining time points to examine again whether it is cell cycle regulated.
Several methods have proposed to infer the TF activities from microarray expression data by integrating it with ChIP-chip data or motif information [35–40]. All these methods assume a linear relationship between gene expressions and TF-gene binding affinities (ChIP-chip data), motif occurrences, or motif matching-scores, which may not be valid considering the high complexity of gene transcriptional regulation. Moreover, some of these methods, such as gNCA, require a complicated pre-processing step and some constraints. In contrast, our method does not require the linear assumption and easy to be implemented. It is exciting to see that the CCRTFs can be identified with high accuracy by integrating microarray cell cycle data with the in-silico motif discovery data. Microarray cell cycle data sets are available for several species, including fission yeast, mouse and human. On the other hand, systematic ChIP-chip data are still not available for most species other than the budding yeast. Our method may be applied to the microarray cell cycle data in these species to identify the all the cell cycle associated regulatory motifs. Then experimental techniques or computational methods can be used to associate these motifs with TFs. This may lead to systematic identification of the CCRTFs in these species.
In conclusion, we present a method to identify CCRTFs by integrating microarray cell cycle data with systematic ChIP-chip data or motif discovery data. For each TF, the method infers its activity profile across the cell cycle and then tests the significance of periodicity of the profile. Application of this method to the yeast microarray cell cycle data and the ChIP-chip data results in identification of 42 CCRTFs, among which 23 have been experimentally verified. Importantly, reliable results are also achieved by integrating the microarray expression data with the in-silico motif discovery data. The method provides a useful tool to investigate the cell cycle transcriptional regulation, especially in those species where large-scale ChIP-chip data are not available.
Microarray cell cycle data and ChIP-chip data
In this study, we utilize the microarray cell cycle data published by Spellman et al., which contains expression profiles of 6178 yeast genes during the cell cycle . RNA samples were collected every 7 minutes after α-factor synchronization and hybridized with cDNA microarray, resulting in the microarray time series data with 18 time points covering two cell cycles. During hybridization, RNA from asynchronous cultures was used as control. We represent the microarray cell cycle data by a matrix E (referred to as expression matrix) with 6178 rows and 18 columns, each row corresponding to a gene and each column corresponding to a time point. The values in the expression matrix E are log ratios. In the systematic ChIP-chip experiment performed by Harbison et al., in vivo binding sites across the whole genome for 203 yeast TFs were identified under YPD condition . Each TF-gene pair was assigned an occupancy ratio which reflects binding affinity of the TF to the promoter region of the gene. Roughly, larger occupancy ratios suggest strong binding affinities. We also represent the ChIP-chip occupancy ratio data by a matrix B (referred to as binding score matrix) with 6229 rows and 203 columns, each row corresponding to a gene and each column corresponding to a TF. Note that the occupancy ratios in the binding score matrix B are in their original scale without log transformation.
Inferring TF activities based on the ChIP-chip data
We use the BASE method proposed by Cheng et al. to infer the activities of transcription factors at each time point of the cell cycle data . Given an expression profile e = (e1, e2,...,e N ) (e.g. a column corresponding to a specific time point in the expression matrix E) and a binding profile b = (b1, b2,...,b N ) (e.g. a column corresponding to a specfic TF in the binding score matrix B), where N is the number of common genes in E and B, we infer the activity of this TF in this time point using the following method. First, we sort the expression vector e in the decreasing order to obtain a sorted vector e' = (e(1), e(2),...,e(N)). We also rearrange the binding vector b into b' = (b(1), b(2),...,b(N)) according to the gene orders in the sorted expression vector e'. Second, we combine the two vectors e' and b' into an increasing function f(i) defined as following:
where 1 ≤ i ≤ N. Meanwhile we calculate another increasing function g(i) for e' itself as
Third, we find out the i max that achieves the maximum deviation between f (i) and g(i), that is, . Then a pre-score is defined as
ps* = f(i max ) - g(i max ).
Fourth, we permute the reordered binding vector b' K times (K = 10,000), and recalculate the pre-scores by replacing b' in Formula (1) with each of the permutated binding vectors, resulting in a permutated pre-score vector psperm= (ps1, ps2,...,psK). Finally, we define an activity change score (AC score) as
where MEAN(psperm) is the mean of pspermand SD(|psperm|) is the standard deviation of the absolute values of psperm. The AC score reflects the relative activity of the considered TF in the synchronized sample with respect to the non-synchronized control at this time point.
We calculate the AC scores for all the 203 TFs available from the ChIP-chip data in each of the 18 time points of the cell cycle data. As a result, for each TF we obtain an activity profile across the cell cycle, and we then perform periodicity analysis for these activity profiles. The C++ program for AC score calculation is available for download at .
We use two criteria to determine whether a TF is cell cycle regulated: (1) whether the TF activities fluctuate periodically across the cell cycle; (2) whether the period of its activity fluctuation matches the cell cycle period. To implement these two criteria, we apply the periodicity analysis techniques introduced by Wichert et al. to the inferred AC score data containing activity profiles for 203 yeast TFs. .
First, we use the average periodogram to assess the presence of TFs that are periodically expressed in the original microarray expression data or periodically fluctuate in activities in the inferred TF AC score data. The average periodogram is a simple extension of the standard periodogram. The standard periodogram is a tool to analyze a single time series: if a time series (e.g. a TF expression profile or an activity profile across the cell cycle) contains significant sinusoidal component, then the periodogram would exhibit a peak at a specific frequency. Average periodogram is the average of multiple time series with a common block size, e.g. the average of periodograms for all genes in a microarray time series data (each gene corresponds to a time series). We use average periodogram to assess the existence of periodically expressed genes in the microarray cell cycle data. If there are a few genes exhibiting strong periodicity, the periodograms of them would dominate the average periodogram and exhibit a visible peak. Since all the periodically expressed genes should have the same frequency, i.e. the cell cycle frequency, we would expect to see a single peak in the average periodogram for microarray cell cycle data. In practice, however, we may observe one or more small peaks other than the main peak, which are caused by the noise in the data. Therefore, the pattern of peaks in average periodogram can reflects the quality of microarray cell cycle data.
Second, we use the Fisher's G-test to examine whether the expression or activity of a TF behaves like a pure random process or whether it shows a periodical pattern. The significance of periodicity was tested for all the 203 inferred TF activity profiles and a p-value is assigned to each TF. Third, we calculate the q-values for those p-values using a method of FDR to correct for multiple testings. Finally, for each significant TF, we use the standard periodogram to estimate its period and compare the estimated period with the known cell cycle period. With the above described procedure, we identify 42 CCRTFs at the 0.01 significance level (FDR < 0.04) based on the above inferred AC score data. To implement these techniques, the R package GeneTS is used.
Let us model the periodic activity profile of a TF as following:
where β is a positive constant, T represents the period, and k represents the phase. Since the microarray data covers exactly two cell cycles by 18 time points, T = 9 and the phase k takes an integral value from 0 to 8. The phase for a TF can be estimated using the following formula:
We estimate the phase for each of the 42 identified CCRTFs, which basically categorized them into 9 groups. For example, according to our results, the cell cycle TF, Fkh2, belongs to the group with phase k = 7, which achieves the maximum activity at M phase in the cell cycle.
Motif discovery data
In a systematic analysis performed by Beer et al. , 666 potential regulatory motifs that are significantly enriched in the promoter regions (the DNA sequences from translation initiation site up to 800 bp upstream) of all yeast genes using the AlignACE software [42, 43]. The occurrences of each motif in the promoter region of each gene are then determined as those with matching-scores larger than 0.5. We download this motif discovery data from . After removing redundancy, we select 539 motifs from these 666 putative motifs, in which 46 are associated with known transcriptional factors according to literatures. Based on this data, we define a matching-score matrix M, which contains 6328 rows each corresponding to a yeast gene, and 539 columns each corresponding to a putative motif. The element of M is the aggregated matching-score of a motif in the up-stream region of a gene, that is, the matching-scores of all the occurrences for the same motif are aggregated in case of multiple occurrences. When no occurrence is found in the upstream region of a gene, the score is set to 0. The aggregated matching-score, to some extent, reflects the binding affinity of a motif to the promoter of a gene.
Identification of cell cycle associated motifs
For each putative motif we assume there is a TF that binds it to regulate gene expression. By integrating the microarray gene expression matrix E with the matching-score matrix M, we can infer the activity profile of the TF corresponding to a putative motif in the cell cycle. The method is similar to the one used to infer TF activity profile by integrating microarray cell cycle data with ChIP-chip data, except that the binding score matrix B obtained from ChIP-chip experiment is replaced by the matching-score matrix M obtained from in-silico sequence analysis.
We then perform the above described periodicity analysis and estimate the active phase for each imaginary TF of the putative motifs. According to the results from this analysis, we identify the cell cycle associated motifs which are potential regulatory sites for some CCRTFs.
Availability and requirements
Project name: BASE
Project homepage: http://leili-lab.cmb.usc.edu/yeastaging/projects/project-base/
Operating system: Microsoft Windows XP system
Programming language: Visual C++ 6.0
Other requirements: None
License: The tool is available free of charge.
Any restrictions to use by non-academics: None
Cho RJ, Campbell MJ, Winzeler EA, Steinmetz L, Conway A, Wodicka L, Wolfsberg TG, Gabrielian AE, Landsman D, Lockhart DJea: A genome-wide transcriptional analysis of the mitotic cell cycle. Mol Cell. 1998, 2: 65-73. 10.1016/S1097-2765(00)80114-8.
Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B: Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell. 1998, 9: 3273-3297.
Rustici G, Mata J, Kivinen K, Lio P, Penkett CJ, Burns G, Hayles J, Brazma A, Nurse P, Bahler J: Periodic gene expression program of the fission yeast cell cycle. Nat Genet. 2004, 36: 809-817. 10.1038/ng1377.
Oliva A, Rosebrock A, Ferrezuelo F, Pyne S, Chen H, Skiena S, Futcher B, Leatherwood J: The cell cycle-regulated genes of Schizosaccharomyces pombe. PLoS Biol. 2005, 3: e225-10.1371/journal.pbio.0030225.
Whitfield ML, Sherlock G, Saldanha AJ, Murray JI, Ball CA, Alexander KE, Matese JC, Perou CM, Hurt MM, Brown PO, Botstein D: Identification of genes periodically expressed in the human cell cycle and their expression in tumors. Mol Biol Cell. 2002, 13: 1977-2000. 10.1091/mbc.02-02-0030..
Claridge-Chang A, Wijnen H, Naef F, Boothroyd C, Rajewsky N, Young MW: Circadian regulation of gene expression systems in the Drosophila head. Neuron. 2001, 32: 657-671. 10.1016/S0896-6273(01)00515-3.
Johansson D, Lindgren P, Berglund A: A multivariate approach applied to microarray data for identification of genes with cell cycle-coupled transcription. Bioinformatics. 2003, 19: 467-473. 10.1093/bioinformatics/btg017.
Wichert S, Fokianos K, Strimmer K: Identifying periodically expressed transcripts in microarray time series data. Bioinformatics. 2004, 20: 5-20. 10.1093/bioinformatics/btg364.
Luan Y, Li H: Model-based methods for identifying periodically expressed genes based on time course microarray gene expression data. Bioinformatics. 2004, 20: 332-339. 10.1093/bioinformatics/btg413.
Storch KF, Lipan O, Leykin I, Viswanathan N, Davis FC, Wong WH, Weitz CJ: Extensive and divergent circadian gene expression in liver and heart. Nature. 2002, 417: 78-83. 10.1038/nature744.
Yang VW: Eukaryotic transcription factors: identification, characterization and functions. J Nutr. 1998, 128: 2045-2051.
de Lichtenberg U, Wernersson R, Jensen TS, Nielsen HB, Fausboll A, Schmidt P, Hansen FB, Knudsen S, Brunak S: New weakly expressed cell cycle-regulated genes in yeast. Yeast. 2005, 22: 1191-1201. 10.1002/yea.1302.
Tsai HK, Lu HH, Li WH: Statistical methods for identifying yeast cell cycle transcription factors. Proc Natl Acad Sci USA. 2005, 102: 13532-13537. 10.1073/pnas.0505874102.
Yang YL, Suen J, Brynildsen MP, Galbraith SJ, Liao JC: Inferring yeast cell cycle regulators and interactions using transcription factor activities. BMC Genomics. 2005, 6: 90-10.1186/1471-2164-6-90.
Wang L, Chen G, Li H: Group SCAD regression analysis for microarray time course gene expression data. Bioinformatics. 2007, 23: 1486-1494. 10.1093/bioinformatics/btm125.
Banerjee N, Zhang MQ: Identifying cooperativity among transcription factors controlling the cell cycle in yeast. Nucleic Acids Res. 2003, 31: 7024-7031. 10.1093/nar/gkg894.
Cokus S, Rose S, Haynor D, Grønbech-Jensen N, Pellegrini M: Modelling the network of cell cycle transcription factors in the yeast Saccharomyces cerevisiae. BMC Bioinformatics. 2006, 7: 381-10.1186/1471-2105-7-381.
Wu WS, Li WH, Chen BS: Computational reconstruction of transcriptional regulatory modules of the yeast cell cycle. BMC Bioinformatics. 2006, 7: 421-10.1186/1471-2105-7-421.
Harbison CT, Gordon DB, Lee TI, Rinaldi NJ, Macisaac KD, Danford TW, Hannett NM, Tagne JB, Reynolds DB, Yoo J, Jennings EG, Zeitlinger J, Pokholok DK, Kellis M, Rolfe PA, Takusagawa KT, Lander ES, Gifford DK, Fraenkel E, Young RA: Transcriptional regulatory code of a eukaryotic genome. Nature. 2004, 431: 99-104. 10.1038/nature02800.
Beer MA, Tavazoie S: Predicting gene expression from sequence. Cell. 2004, 117: 185-198. 10.1016/S0092-8674(04)00304-6.
Cheng C, Yan X, Sun F, Li LM: Inferring activity changes of transcription factors by binding association with sorted expression profiles. BMC Bioinformatics. 2007, 8: 452-10.1186/1471-2105-8-452.
Zhao LP, Prentice R, Breeden L: Statistical modeling of large microarray data sets to identify stimulus-response profiles. Proc Natl Acad Sci USA. 2001, 98: 5631-5636. 10.1073/pnas.101013198.
Breeden LL: Periodic transcription: a cycle within a cycle. Curr Biol. 2003, 13: R31-R38. 10.1016/S0960-9822(02)01386-6.
Cherry JM, Ball C, Weng S, Juvik G, Schmidt R, Adler C, Dunn B, Dwight S, Riles L, Mortimer RK, Botstein D: Genetic and physical maps of Saccharomyces cerevisiae. Nature. 1997, 387: 67-73. 10.1038/43025.
Simon I, Barnett J, Hannett N, Harbison CT, Rinaldi NJ, Volkert TL, Wyrick JJ, Zeitlinger J, Gifford DK, Jaakkola TS, Young RA: Serial regulation of transcriptional regulators in the yeast cell cycle. Cell. 2001, 106: 697-708. 10.1016/S0092-8674(01)00494-9.
Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA. 1998, 95: 14863-14868. 10.1073/pnas.95.25.14863.
Sidorova J, Breeden L: Analysis of the SWI4/SWI6 protein complex, which directs G1/S-specific transcription in Saccharomyces cerevisiae. Mol Cell Biol. 1993, 13: 1069-1077.
Bahler J: Cell-cycle control of gene expression in budding and fission yeast. Annu Rev Genet. 2005, 39: 69-94. 10.1146/annurev.genet.39.110304.095808.
Koranda M, Schleiffer A, Endler L, Ammerer G: Forkhead-like transcription factors recruit Ndd1 to the chromatin of G2/M-specific promoters. Nature. 2000, 406: 94-98. 10.1038/35017589.
Shedden K, Cooper S: Analysis of cell-cycle-specific gene expression in human cells as determined by microarrays and double-thymidine block synchronization. Proc Natl Acad Sci USA. 2002, 99: 4379-4384. 10.1073/pnas.062569899.
Olson KA, Nelson C, Tai G, Hung W, Yong C, Astell C, Sadowski I: Two regulators of Ste12p inhibit pheromone-responsive transcription by separate mechanisms. Mol Cell Biol. 2000, 20: 4199-4209. 10.1128/MCB.20.12.4199-4209.2000.
Bardwell L, Cook JG, Voora D, Baggott DM, Martinez AR, Thorner J: Repression of yeast Ste12 transcription factor by direct binding of unphosphorylated Kss1 MAPK and its regulation by the Ste7 MEK. Genes Dev. 1998, 12: 2887-2898. 10.1101/gad.12.18.2887.
Roberts RL, Fink GR: Elements of a single MAP kinase cascade in Saccharomyces cerevisiae mediate two developmental programs in the same cell type: mating and invasive growth. Genes Dev. 1994, 8: 2974-2985. 10.1101/gad.8.24.2974.
Herskowitz I: MAP kinase pathways in yeast: for mating and more. Cell. 1995, 80: 187-197. 10.1016/0092-8674(95)90402-6.
Liao JC, Boscolo R, Yang YL, Tran LM, Sabatti C, Roychowdhury VP: Network component analysis: reconstruction of regulatory signals in biological systems. Proc Natl Acad Sci USA. 2003, 100: 15522-15527. 10.1073/pnas.2136632100.
Alter O, Golub GH: Integrative analysis of genome-scale data by using pseudoinverse projection predicts novel correlation between DNA replication and RNA transcription. Proc Natl Acad Sci USA. 2004, 101: 16577-16582. 10.1073/pnas.0406767101.
Gao F, Foat BC, Bussemaker HJ: Defining transcriptional networks through integrative modeling of mRNA expression and transcription factor binding data. BMC Bioinformatics. 2004, 5: 31-10.1186/1471-2105-5-31.
Boulesteix AL, Strimmer K: Predicting transcription factor activities from combined analysis of microarray and ChIP data: a partial least squares approach. Theor Biol Med Model. 2005, 2: 23-10.1186/1742-4682-2-23.
Bussemaker HJ, Li H, Siggia ED: Regulatory element detection using correlation with expression. Nat Genet. 2001, 27: 167-171. 10.1038/84792.
Conlon EM, Liu XS, Lieb JD, Liu JS: Integrating regulatory motif discovery and genome-wide expression analysis. Proc Natl Acad Sci USA. 2003, 100: 3339-3344. 10.1073/pnas.0630591100.
Program for AC score calculation using the BASE method. [http://leili-lab.cmb.usc.edu/yeastaging/projects/project-base/]
Roth FP, EPCG Hughes JD: Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nat Biotechnol. 1998, 16: 939-945. 10.1038/nbt1098-939.
Hughes JD, Estep PW, Tavazoie S, Church GM: Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J Mol Biol. 2000, 296: 1205-1214. 10.1006/jmbi.2000.3519.
Motif occurrence in yeast genes. [http://genomics.princeton.edu/tavazoie/Supplementary%20Data.htm]
The work is supported by the grant R01 GM75308-01 from NIH. This work is partially supported by the Center of Excellence in Genome Science at University of Southern California, the NIH P50 HG002790 grant. We thank Dr Luan, Yihui and Dr Ma, Xiaotu for discussions and suggestions.
CC designed the method, wrote the code, carried out the analysis, and drafted the manuscript. LL participated in design and coordination of the study. Both authors read and approved the final manuscript.
Electronic supplementary material
Additional file 1: Average periodograms for 203 inferred TF AC score profiles in four yeast microarray cell cycle data sets. In these data sets, yeast cells are synchronized by α-factor arrest, temperature arrest by cdc15 mutant, temperature arrest by cdc28 mutant, and elutriation, respectively. (EPS 7 KB)
Additional file 2: Visualization of inferred AC score profiles for 42 CCRTFs according to their statistical phase k. (EPS 225 KB)
Authors’ original submitted files for images
About this article
Cite this article
Cheng, C., Li, L.M. Systematic identification of cell cycle regulated transcription factors from microarray time series data. BMC Genomics 9, 116 (2008). https://0-doi-org.brum.beds.ac.uk/10.1186/1471-2164-9-116
- Cell Cycle
- Putative Motif
- Original Gene Expression
- Microarray Time Series Data
- Synchronization Treatment