- Research article
- Open Access
GC-compositional strand bias around transcription start sites in plants and fungi
© Fujimori et al; licensee BioMed Central Ltd. 2005
- Received: 27 August 2004
- Accepted: 28 February 2005
- Published: 28 February 2005
A GC-compositional strand bias or GC-skew (=(C-G)/(C+G)), where C and G denote the numbers of cytosine and guanine residues, was recently reported near the transcription start sites (TSS) of Arabidopsis genes. However, it is unclear whether other eukaryotic species have equally prominent GC-skews, and the biological meaning of this trait remains unknown.
Our study confirmed a significant GC-skew (C > G) in the TSS of Oryza sativa (rice) genes. The full-length cDNAs and genomic sequences from Arabidopsis and rice were compared using statistical analyses. Despite marked differences in the G+C content around the TSS in the two plants, the degrees of bias were almost identical. Although slight GC-skew peaks, including opposite skews (C < G), were detected around the TSS of genes in human and Drosophila, they were qualitatively and quantitatively different from those identified in plants. However, plant-like GC-skew in regions upstream of the translation initiation sites (TIS) in some fungi was identified following analyses of the expressed sequence tags and/or genomic sequences from other species. On the basis of our dataset, we estimated that >70 and 68% of Arabidopsis and rice genes, respectively, had a strong GC-skew (>0.33) in a 100-bp window (that is, the number of C residues was more than double the number of G residues in a +/-100-bp window around the TSS). The mean GC-skew value in the TSS of highly-expressed genes in Arabidopsis was significantly greater than that of genes with low expression levels. Many of the GC-skew peaks were preferentially located near the TSS, so we examined the potential value of GC-skew as an index for TSS identification. Our results confirm that the GC-skew can be used to assist the TSS prediction in plant genomes.
The GC-skew (C > G) around the TSS is strictly conserved between monocot and eudicot plants (ie. angiosperms in general), and a similar skew has been observed in some fungi. Highly-expressed Arabidopsis genes had overall a more marked GC-skew in the TSS compared to genes with low expression levels. We therefore propose that the GC-skew around the TSS in some plants and fungi is related to transcription. It might be caused by mutations during transcription initiation or the frequent use of transcription factor-biding sites having a strand preference. In addition, GC-skew is a good candidate index for TSS prediction in plant genomes, where there is a lack of correlation among CpG islands and genes.
- Transcription Start Site
- Fungal Genomic Sequence
- Eudicot Plant
- Actual Transcription Start Site
- Transcription Start Site Prediction
A prominent GC-compositional strand bias or GC-skew (=(C-G)/(C+G)), where C and G denote the numbers of cytosine and guanine residues, was reported recently around the transcription start sites (TSS) of Arabidopsis genes . It is well known that GC-skews occur bi-directionally in circular bacterial genomes along the direction of replication, and that GC-skew is an effective index for predicting the replication origin in some bacteria [2, 3]. In this case the numbers of G and T (thymine) residues in the leading strand of these genomes exceed those of C and A (adenine). Several models have been proposed to explain this bias . A similar strand bias related to the direction of replication has also been observed in human mitochondrial genomes . Also, mammalian and enterobacterial genomes have been reported to show a strand bias associated with transcribed regions [6–8]. An excess of G+T over A+C was observed in mammals within the sense strand of genes. A transcription-coupled DNA-repair system might be involved in this bias . However, existing models cannot explain the excess of C over G in the sense strand around the TSS in Arabidopsis. Although slight GC-skews (regardless of the direction) were reported recently around the TSS in metazoans [10, 11], it remains unclear whether the strong GC-skew of Arabidopsis is similar to that observed in metazoans.
Although large amounts of genomic and full-length cDNA sequence data from plants are now publicly available, knowledge of the promoters and TSS in plants is still limited compared to mammals, such as human and mouse. It has been reported that the CpG island  is the most effective index for predicting the promoter regions or TSS in mammals[13, 14]. However, the CpG islands are not specifically located in the promoter regions in Arabidopsis, so they cannot be used for the prediction of TSS or promoters . Identifying another, more suitable, index for the prediction of plant-specific TSS has therefore become a priority.
Answers are required for two key issues. First, which eukaryotic species, phyla or kingdoms have Arabidopsis-like GC-skews around the TSS? Second, what is the biological significance of these regions? In the present study, we used sequences from various animal, fungus, protist and plant species to conduct comparative analyses. We explored the potential value of GC-skew as an index for TSS prediction in plants. Finally, we considered the biological meaning of the GC-skew around the TSS of plant genes.
GC-skew around the TSS of plant genes
We also determined the frequencies of the four nucleotides in the regions up- and downstream of the TSS in plants (see Fig. S1 in Additional file 1). High C-residue frequencies were observed in both Arabidopsis and rice (approximately 50 to 100 bp from the TSS). In contrast, G-residue frequencies decreased slightly around the TSS (+/-10 bp). These findings suggest that the peaks of GC-skew values observed near the plant TSS were caused by an increased frequency of C residues and a slight reduction in the G-residue frequency. No significant biases were observed in the frequencies of A and T residues, indicating a lack of AT-skew at the TSS (See Fig. S2 in Additional file 1).
Our results suggest that many plant (Arabidopsis and rice) genes have strong GC-skews around the TSS. Furthermore, this characteristic is common among both monocot and eudicot plants. In contrast, the GC-skews (including the C < G skew) that were observed in human and Drosophila were qualitatively and quantitatively different from those identified in plants.
GC-skew in eukaryotes
GC-skew in various eukaryotes
No. of sequences
Correlation between the two nucleotide frequencies in plants
GC-skew and gene-expression level
In order to examine the relationship between GC-skew at the TSS and gene expression in plants, we conducted a statistical test using serial analysis of gene expression (SAGE) data for Arabidopsis (10-day-old seedlings) . The mean GC-skew value in the TSS of highly-expressed genes (see Methods section for details) was significantly higher (P = 0.0003, paired t-test) than in genes with low expression in Arabidopsis: the mean values were 0.25 (standard deviation = 0.31) and 0.19 (standard deviation = 0.31).
Potential value of GC-skew as an index for TSS prediction
Many of the GC-skew peaks were preferentially located near the TSS, therefore we assessed the potential value of GC-skew as a predictive index for TSS in plants. Sequences 1-kb upstream of the predicted ORF start positions in plant genomic sequences were used in the following analysis.
First, the GC-skew values were computed using the sliding-window technique, in which one window and the shift size were set to 100 and 1 bp, respectively. GC-skew peaks satisfying a particular cut-off value were identified as primitive TSS candidates from the noisy GC-skew spectrum using the Savitzky-Golay (S-G) filter , which simultaneously smoothes and differentiates. Next, TSS candidates that were located within 50 bp of another candidate were considered to be identical and were merged into the position with the highest GC-skew. TSS prediction was validated by counting as true positives (TP), the candidates that were located within 100-bp up- or downstream of the actual TSS. If more than two candidates coexisted in the appropriate region, they were regarded as one TP.
TSS prediction results for a stepwise increase of cut-off values
This study confirmed the presence of significant GC-skews (C > G) around the TSS (or upstream regions of the TIS) of genes in some species of plants and fungi. In contrast, our analysis revealed no significant excess of C residues in either animals or protists. However, an opposite GC-skew (C < G) has been reported previously in several animal species (Mus musculus, Rattus norvegicus, Fugu rubripes, Danio rerio and human) . Although small skews were detected for these species in our present analysis (Table 1; Fig. 1), they were not significant compared to those observed in plants. Aerts et al.  reported a GC-skew (C > G) close to the TSS in two nematode species (Caenorhabditis elegans and Caenorhabditis briggsae) however, no significant GC-skew was detected in C. elegans in our analysis. This inconsistency was probably due to the fact that our analysis targeted only regions downstream of the TSS in C. elegans (thus, regions upstream were not examined); alternatively, the skew might have been too small to be detected using our method. In either case, it is difficult to compare GC-skews between nematodes and plants, since trans-splicing at the 5' -ends of genes has been reported in nematodes [20, 21], as pointed out by Aerts et al . As noted above, the GC-skews near the TSS in plants and fungi differed from those of other species (or kingdoms) in both quality and quantity. To our knowledge, this is the first report to describe the prominent GC-skew (C > G) around the TSS specific to plants and fungi.
We propose two possible explanations for the GC-skew peaks found close to the TSS. First, regulatory elements, such as transcription factor-biding sites (TFBS), which are present in regions both up- and downstream of the TSS, might contribute to (or even cause) this phenomenon. Moreover, some TFBS have a strand preference (see for example ). Therefore, if these types of TFBS are preferentially located around the TSS of plant and fungal genes, they might influence the local GC-compositional strand bias. Second, the GC-skew might be involved in transcription-coupled events, such as transcription-associated mutational asymmetry [6–9]. Tatarinova and colleagues  mentioned such transcription-associated mutational asymmetry and suggested that the GC-skew around the TSS of genes might be caused by the substitution of C residues with T residues, due to C deamination in the template strand. However, this hypothesis cannot fully explain all of the findings of our present study. If the C-to-T transition occurred preferentially around the TSS in the template strand, an AT-skew would also be expected in these regions. However, no significant AT-skew (see Fig. S2 in Additional file 1) was observed at the TSS of either Arabidopsis or rice genes, indicating that the C-to-T transition was not the main cause of the GC-skew. Furthermore, in our analyses, significant changes in the correlation coefficient (decreased A-T/G-C and increased A-G/T-C) were observed in both Arabidopsis and rice (Fig. 4). Assuming that the GC skew is caused by mutations during transcription initiation, changes in the correlation coefficient might be interpreted as an increase in the transversion ratio around the TSS. The increased negative correlation between C and G, coupled with the high GC-skew value in the same region, led us to speculate that G-to-C transversion occurred at relatively high rates in this region in plants and fungi, but not in animals and protists. If mutations that yield a GC-skew occur mainly around the TSS of single-stranded DNA, highly-expressed genes would be expected to have a high GC-skew around the TSS. In fact, the mean GC-skew value in the TSS of highly-expressed genes was significantly higher than in genes with low expression in Arabidopsis. This indicates that the GC-skew is associated with the level of gene expression, at least in Arabidopsis. Strand-specific mutational rates are believed to be a by-product of transcription-coupled DNA repair in mammals , therefore the GC-skew observed around the TSS of plant genes might result from the plant-specific DNA lesion and repair system. Alternatively, if the GC-skew was caused by the higher frequency of strand-specific TFBS, the higher mean GC-skew value in highly-expressed genes might be interpreted as a greater effect of strand-specific TFBS for the transcription efficiency. In either strand-specific TFBS or mutation, highly-expressed genes appear to have a high GC-skew around the TSS. As a GC-skew was also detected in the upstream regions of the TIS in fungal genes, this feature might be generated by a common mechanism in both groups. However, additional sequence data, and investigations into the patterns of nucleotide substitutions and their rates around the TSS, both between and within groups, will be necessary to elucidate the origin and mechanism of GC-skew around the TSS of plant and fungal genes.
Significant GC-skew (C > G) around the TSS is strictly conserved among monocot and eudicot plants (that is, angiosperms), and a similar skew is also seen in some fungi. The mean GC-skew at the TSS in the highly expressed genes was greater than that in the group with low expression. We therefore propose that the GC-skew around the TSS in some species of plants and fungi is associated with transcriptional activity. This is probably a result of DNA mutations during transcription initiation or the frequent use of strand-specific TFBS. Our findings also confirm that GC-skew has the potential to assist TSS prediction in plant genomes, where there is a lack of correlation among CpG islands and genes.
Data for 13,095 full-length cDNAs from Arabidopsis thaliana were downloaded from The Institute for Genomic Research (TIGR)  (as of March 2, 2001) and the RIKEN Arabidopsis Genome Encyclopedia  (as of May 9, 2002). Genomic sequences for Arabidopsis were downloaded from The National Center for Biotechnology Information (NCBI)  (as of January 31, 2003). ORF information for Arabidopsis genes was obtained from the NCBI Entrez database , based on each of the original sequence IDs. For rice (O. sativa ssp. japonica c.v. Nipponbare), 28,469 full-length cDNA sequences  were retrieved from the Knowledge-Based Oryza Molecular Biological Encyclopedia (KOME)  with ORF information, and the genomic sequences were obtained from the Rice Genome Research Program  (as of October 16, 2002) and Syngenta Biotechnology Inc. (SBI) . For analysis of human (Homo sapiens) DNA, 21,245 full-length cDNAs and genomic sequences were downloaded from the DNA Data Bank of Japan (DDBJ)  and Ensembl (Ver. 9.30) , respectively. Data for 9,872 full-length cDNA sequences (as of July 17, 2002) and genomic sequences of Drosophila melanogaster were obtained from the Berkeley Drosophila Genome Project [33, 34]. Fungal genomic sequence data, including ORF information, were downloaded from the NCBI [35, 36] for S. cerevisiae and S. pombe, and from the Fungal Genome Initiative (FGI)  for A. nidulans, F. graminearum, M. grisea, and N. crassa. Virtually assembled transcripts from 10 animal species, nine plant species, six species of fungi and 11 protist species were retrieved from the TIGR Gene Indices (TGI) [16, 17].
Mapping of cDNA to the genomic sequences
Redundant cDNA sequences showing at least 95% similarity in at least 95% of the regions, compared with other sequences were excluded from the full-length cDNA dataset for four species – Arabidopsis, rice, human and Drosophila. They were identified using BLASTN homology searching and CAP3 . Also, any poly(A) tracts at the 3' -end of the cDNA sequences were eliminated. Finally, by mapping the cDNA sequences for corresponding genomic sequences using the BLASTN program and SIM4 , sequences both up- and downstream of the TSS were determined.
SAGE data for Arabidopsis (10-day-old seedlings)  were used to examine the relationship between GC-skew in the TSS and gene-expression levels. In assigning the SAGE tags to genes, only data that showed one-to-one correspondence between the genes and tags were included. Genes with <100 counts per million were defined as the low-expression group (504 genes) and those with >100 per million the high-expression group (689 genes). The mean GC-skew values in a 100-bp window at the TSS were calculated for both groups. Genes with a G+C content at the TSS of <0.3 were eliminated from the dataset in advance.
GC-skew peak detection
As an initial step towards detecting the GC-skew peaks, GC-skew values at each position in the sequences were calculated using the sliding-window technique (window size = 100 bp; shift size = 1 bp). Next, we attempted to smooth the spectrum, to reduce the noise and to simultaneously determine the peaks, using the S-G filter . The first-order derivative of the smoothed GC-skew value g i at position i is given by the following equation:
Here, fi+nis the original GC-skew value in the position i + n. n l and n r represent the lengths of the filter window to the left and right of position i, and were set to 20 in this analysis. C n denotes the set of weight coefficients and corresponds to the first-order derivatives of the quartic polynomial. The position of the GC-skew peak i was defined as the zero-crossing point, which is the point satisfying the following condition:
g i ·gi + 1 ≤ 0 ∩ gi + 1 < 0
The peak was only counted as a TSS candidate if a GC-skew value at that peak satisfied the particular cut-off value, and the G+C content was ≥ 0.3 in the window. When more than one peak occurred within 50 bp, they were regarded as identical and the peak with the highest GC-skew value was accepted. Using this procedure, TSS prediction was conducted with stepwise changes in the cut-off values of the GC-skew (-0.9 to 0.9) at the peak.
The predictive accuracy was verified by counting TSS candidates located within 100-bp up- and downstream of the actual TSS as true positives (TP). Where more than two TSS candidates coexisted within 100-bp either up- or downstream of the TSS, they were counted as one TSS candidate. The rest of the candidates, which were located in inappropriate regions, were counted as false positives (FP). To compare results for different cut-offs, the correlation coefficient (φ), which is one of the measures used to compare predictive performance , was calculated as follows:
Here, TN and FN denote the number of true and false negatives, respectively. In this analysis, the total number of negatives (N) was defined as the maximum number of possible TSS candidates in the non-TSS regions. Thus, TN was calculated as N-FP.
Correlation coefficient between the two nucleotide frequencies
The correlation coefficient r ij (p) between nucleotide i and j at a position p up- and downstream of the TSS was defined as follows:
Here, i k (p) and j k (p) are the frequencies of i and j at a position p in the sequence k, respectively. and are the mean frequencies of i and j at p. The nucleotide frequency in each position was the number of each nucleotide in a 100-bp window.
We would like to thank Keitaro Senda, Sayaka Kitamura, Haruo Suzuki, Hitomi Ito and members of the Institute for Advanced Biosciences for helpful discussions during the course of this work. We also thank the Rice Full-Length cDNA Consortium (the National Institute of Agrobiological Sciences Rice Full-Length cDNA Project Team, the Foundation of Advancement of International Science Genome Sequencing & Analysis Group and the RIKEN Institute) for providing the full-length cDNA data on rice. This work was supported by the Ministry of Agriculture, Forestry and Fisheries of Japan (Rice Genome Project SY-1104). This work was also supported by the Ministry of Education, Culture, Sports, Science and Technology of Japan, through "the 21st Century COE Program" and "Special Coordination Funds Promoting Science and Technology".
- Tatarinova T, Brover V, Troukhan M, Alexandrov N: Skew in CG content near the transcription start site in Arabidopsis thaliana. Bioinformatics. 2003, 19 Suppl 1: I313-I314. 10.1093/bioinformatics/btg1043.PubMedView ArticleGoogle Scholar
- Freeman JM, Plasterer TN, Smith TF, Mohr SC: Patterns of Genome Organization in Bacteria. Science. 1998, 279: 1827a-10.1126/science.279.5358.1827a.View ArticleGoogle Scholar
- Lobry JR: Asymmetric substitution patterns in the two DNA strands of bacteria. Mol Biol Evol. 1996, 13: 660-665.PubMedView ArticleGoogle Scholar
- Frank AC, Lobry JR: Asymmetric substitution patterns: a review of possible underlying mutational or selective mechanisms. Gene. 1999, 238: 65-77. 10.1016/S0378-1119(99)00297-8.PubMedView ArticleGoogle Scholar
- Tanaka M, Ozawa T: Strand asymmetry in human mitochondrial DNA mutations. Genomics. 1994, 22: 327-335. 10.1006/geno.1994.1391.PubMedView ArticleGoogle Scholar
- Francino MP, Chao L, Riley MA, Ochman H: Asymmetries generated by transcription-coupled repair in enterobacterial genes. Science. 1996, 272: 107-109.PubMedView ArticleGoogle Scholar
- Francino MP, Ochman H: Deamination as the basis of strand-asymmetric evolution in transcribed Escherichia coli sequences. Mol Biol Evol. 2001, 18: 1147-1150.PubMedView ArticleGoogle Scholar
- Green P, Ewing B, Miller W, Thomas PJ, Green ED: Transcription-associated mutational asymmetry in mammalian evolution. Nat Genet. 2003, 33: 514-517. 10.1038/ng1103.PubMedView ArticleGoogle Scholar
- Svejstrup JQ: Mechanisms of transcription-coupled DNA repair. Nat Rev Mol Cell Biol. 2002, 3: 21-29. 10.1038/nrm703.PubMedView ArticleGoogle Scholar
- Aerts S, Thijs G, Dabrowski M, Moreau Y, De Moor B: Comprehensive analysis of the base composition around the transcription start site in Metazoa. BMC Genomics. 2004, 5: 34-10.1186/1471-2164-5-34.PubMedPubMed CentralView ArticleGoogle Scholar
- Louie E, Ott J, Majewski J: Nucleotide frequency variation across human genes. Genome Res. 2003, 13: 2594-2601. 10.1101/gr.1317703.PubMedPubMed CentralView ArticleGoogle Scholar
- Gardiner-Garden M, Frommer M: CpG islands in vertebrate genomes. J Mol Biol. 1987, 196: 261-282. 10.1016/0022-2836(87)90689-9.PubMedView ArticleGoogle Scholar
- Hannenhalli S, Levy S: Promoter prediction in the human genome. Bioinformatics. 2001, 17 Suppl 1: S90-6.PubMedView ArticleGoogle Scholar
- Bajic VB, Tan SL, Suzuki Y, Sugano S: Promoter prediction analysis on the whole human genome. Nat Biotechnol. 2004, 22: 1467-1473. 10.1038/nbt1032.PubMedView ArticleGoogle Scholar
- Rombauts S, Florquin K, Lescot M, Marchal K, Rouze P, van de Peer Y: Computational approaches to identify promoters and cis-regulatory elements in plant genomes. Plant Physiol. 2003, 132: 1162-1176. 10.1104/pp.102.017715.PubMedPubMed CentralView ArticleGoogle Scholar
- TIGR Gene Indices (TGI). [ftp://ftp.tigr.org/pub/data/tgi]
- Quackenbush J, Cho J, Lee D, Liang F, Holt I, Karamycheva S, Parvizi B, Pertea G, Sultana R, White J: The TIGR Gene Indices: analysis of gene transcript sequences in highly sampled eukaryotic species. Nucleic Acids Res. 2001, 29: 159-164. 10.1093/nar/29.1.159.PubMedPubMed CentralView ArticleGoogle Scholar
- NCBI Gene Expression Omnibus (for Arabidopsis SAGE data). [http://www.ncbi.nih.gov/geo/query/acc.cgi?acc=GSM769]
- Savitzky A, Golay MJE: Smoothing and Differentiation of Data by Simplified Least Squares Procedures. Anal Chem. 1964, 36: 1627-1639. 10.1021/ac60214a047.View ArticleGoogle Scholar
- von Mering C, Bork P: Teamed up for transcription. Nature. 2002, 417: 797-798. 10.1038/417797a.PubMedView ArticleGoogle Scholar
- Blumenthal T, Evans D, Link CD, Guffanti A, Lawson D, Thierry-Mieg J, Thierry-Mieg D, Chiu WL, Duke K, Kiraly M, Kim SK: A global analysis of Caenorhabditis elegans operons. Nature. 2002, 417: 851-854. 10.1038/nature00831.PubMedView ArticleGoogle Scholar
- Tucker ML, Whitelaw CA, Lyssenko NN, Nath P: Functional analysis of regulatory elements in the gene promoter for an abscission-specific cellulase from bean and isolation, expression, and binding affinity of three TGA-type basic leucine zipper transcription factors. Plant Physiol. 2002, 130: 1487-1496. 10.1104/pp.007971.PubMedPubMed CentralView ArticleGoogle Scholar
- TIGR (for Arabidopsis Full-length cDNAs). [ftp://ftp.tigr.org/pub/data/a_thaliana/ceres/CeresTigr]
- RIKEN Arabidopsis Genome Encyclopedia. [ftp://pfgweb.gsc.riken.go.jp/rafl/sequence/]
- NCBI (for Arabidopsis genomic sequences). [ftp://ftp.ncbi.nih.gov/genomes/Arabidopsis_thaliana/]
- NCBI Entrez. [http://0-www.ncbi.nlm.nih.gov.brum.beds.ac.uk/entrez/]
- Kikuchi S, Satoh K, Nagata T, Kawagashira N, Doi K, Kishimoto N, Yazaki J, Ishikawa M, Yamada H, Ooka H, Hotta I, Kojima K, Namiki T, Ohneda E, Yahagi W, Suzuki K, Li CJ, Ohtsuki K, Shishiki T, Otomo Y, Murakami K, Iida Y, Sugano S, Fujimura T, Suzuki Y, Tsunoda Y, Kurosaki T, Kodama T, Masuda H, Kobayashi M, Xie Q, Lu M, Narikawa R, Sugiyama A, Mizuno K, Yokomizo S, Niikura J, Ikeda R, Ishibiki J, Kawamata M, Yoshimura A, Miura J, Kusumegi T, Oka M, Ryu R, Ueda M, Matsubara K, Kawai J, Carninci P, Adachi J, Aizawa K, Arakawa T, Fukuda S, Hara A, Hashizume W, Hayatsu N, Imotani K, Ishii Y, Itoh M, Kagawa I, Kondo S, Konno H, Miyazaki A, Osato N, Ota Y, Saito R, Sasaki D, Sato K, Shibata K, Shinagawa A, Shiraki T, Yoshino M, Hayashizaki Y, Yasunishi A: Collection, mapping, and annotation of over 28,000 cDNA clones from japonica rice. Science. 2003, 301: 376-379. 10.1126/science.1081288.PubMedView ArticleGoogle Scholar
- Knowledge-based Oryza Molecular biological Encyclopedia (KOME). [ftp://cdna01.dna.affrc.go.jp/pub/data/]
- Rice Genome Research Program (RGP). [http://rgp.dna.affrc.go.jp/]
- Syngenta Biotechnology Inc. (SBI). [http://www.tmri.org/]
- DNA Data Bank of Japan (DDBJ). [http://www.ddbj.nig.ac.jp/]
- Ensembl. [ftp://ftp.ensembl.org/pub/]
- Berkeley Drosophila Genome Project (for Drosophila Full-length cDNAs). [http://www.fruitfly.org/sequence/]
- Berkeley Drosophila Genome Project (for Drosophila genomic sequences). [ftp://ftp.fruitfly.org/pub/download/compressed/]
- NCBI (for S. cerevisiae genomic sequences). [ftp://ftp.ncbi.nih.gov/genomes/Saccharomyces_cerevisiae/]
- NCBI (for S. pombe genomic sequences). [ftp://ftp.ncbi.nih.gov/genomes/Schizosaccharomyces_pombe/]
- Fungal Genome Initiative (FGI). [http://www.broad.mit.edu/annotation/fungi/fgi/]
- Huang X, Madan A: CAP3: A DNA sequence assembly program. Genome Res. 1999, 9: 868-877. 10.1101/gr.9.9.868.PubMedPubMed CentralView ArticleGoogle Scholar
- Florea L, Hartzell G, Zhang Z, Rubin GM, Miller W: A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res. 1998, 8: 967-974.PubMedPubMed CentralGoogle Scholar
- Bajic VB: Comparing the success of different prediction software in sequence analysis: a review. Brief Bioinform. 2000, 1: 214-228.PubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.