- Research article
- Open Access
Large-scale sequencing based on full-length-enriched cDNA libraries in pigs: contribution to annotation of the pig genome draft sequence
BMC Genomics volume 13, Article number: 581 (2012)
Along with the draft sequencing of the pig genome, which has been completed by an international consortium, collection of the nucleotide sequences of genes expressed in various tissues and determination of entire cDNA sequences are necessary for investigations of gene function. The sequences of expressed genes are also useful for genome annotation, which is important for isolating the genes responsible for particular traits.
We performed a large-scale expressed sequence tag (EST) analysis in pigs by using 32 full-length-enriched cDNA libraries derived from 28 kinds of tissues and cells, including seven tissues (brain, cerebellum, colon, hypothalamus, inguinal lymph node, ovary, and spleen) derived from pigs that were cloned from a sow subjected to genome sequencing. We obtained more than 330,000 EST reads from the 5′-ends of the cDNA clones. Comparison with human and bovine gene catalogs revealed that the ESTs corresponded to at least 15,000 genes. cDNA clones representing contigs and singlets generated by assembly of the EST reads were subjected to full-length determination of inserts. We have finished sequencing 31,079 cDNA clones corresponding to more than 12,000 genes. Mapping of the sequences of these cDNA clones on the draft sequence of the pig genome has indicated that the clones are derived from about 15,000 independent loci on the pig genome.
ESTs and cDNA sequences derived from full-length-enriched libraries are valuable for annotation of the draft sequence of the pig genome. This information will also contribute to the exploration of promoter sequences on the genome and to molecular biology-based analyses in pigs.
The pig is the world's most frequently consumed meat animal, and its genetic improvement, particularly in terms of productivity and meat quality, is of interest to livestock science. To date, intensive genetic improvement of livestock animals has been conducted by using classical selection and mating, but genomic information is required for further improvement. Improvements in aspects of the rearing management of pigs, such as feeding and hygiene control, have to be based on knowledge obtained from physiological studies. Moreover, the pig is unique among livestock in that it is very useful in biomedical research because of the structural and size similarities of its organs (particularly cardiovascular and dermal) to those of humans[2–4]. Improvement in the breeding and rearing of pigs with the help of molecular genetics and physiology, as well as the use of pigs as biomedical model animals, requires fundamental information on pig molecular biology, particularly in terms of the genome and genes.
Recently, the International Swine Genome Sequencing Consortium (SGSC) completed its draft sequencing of the pig genome; these sequences will form the basis of further investigations of pig molecular biology[5–7]. Sequencing of the pig genome will accelerate the development of genetic markers to improve breeds and populations and give basic information on the genes encoded on the genome. However, genes cannot be precisely localized on the genome solely from information on the genome sequence. Determination of precise sequences, structures, and locations requires information on the sequences of expressed genes per se. The locations and structures of genes on the pig genome are now being explored by using automated systems or by manual inspection by annotators using the Otterlace system of the Wellcome Trust Sanger Institute to add information to databases such as Vertebrate Genome Annotation (VEGA)[10–12]. The sequences of expressed genes are also useful for genome annotation, which is important for isolating the genes responsible for particular traits.
Expressed sequence tag (EST) analyses have been conducted by many research groups in pigs and other organisms. More than 1,600,000 pig ESTs have been accumulated and registered in the public nucleotide databases, and several attempts at transcriptome analysis using next-generation DNA sequencing (NGS) have been made[13–15]. Most of the cDNA libraries constructed by using traditional methods do not cover the transcription start sites, because the limitations of cloning techniques can cause incomplete synthesis of full-length cDNA. On the other hand, the sequences of transcripts obtained by using NGS alone are reconstructed by the compilation of short reads and do not directly reflect the actual structure of the mRNA; this may be problematic in considering the alternative splicing products that are actually expressed in the tissues. As far as possible, it is therefore important to clone full-length mRNA transcripts in order to collect gene expression data and use these data in further analyses of expressed genes and of genome annotation in pigs[17–19]. cDNA clones carrying full-length transcripts have additional benefits—they can be used for protein production in vitro and for exploring promoter sequences on the genome.
So far we have conducted EST analysis and sequencing of entire mRNA transcripts in pigs by using full-length-enriched cDNA libraries. Here, we outline the data we have collected and the advantages of their use, particularly in genome annotation of the draft sequence of the pig genome.
Results and discussion
Pig ESTs based on full-length-enriched cDNA libraries
We have constructed 32 cDNA libraries for 28 different tissues and cell populations by using cloning of cap-structured mRNA[20, 21] or the SMART method, and we have accumulated 330,707 ESTs from 5′-ends (Table1) including previously reported 162,631 ESTs in our pig expressed gene database[18, 19]. The ESTs thus obtained were assembled into 17,183 contigs consisting of 209,779 ESTs, with 120,928 singlets remaining. The contig containing the largest number of ESTs carried 4111 ESTs with marked similarity to tubulin α1 genes. The genes encoding α-tubulin are extremely similar to each other; therefore, the ESTs encoding α-tubulins were assembled into the same contigs. On the other hand, about 90% of the contigs contained fewer than 20 ESTs (Figure1), showing that the majority of the ESTs were based on virtually unbiased mRNA sequences. High-quality EST reads have been registered in the public nucleotide database [DDBJ:BP137499–BP173623, BP433030–BP464980, BW954997–BW985219, CJ000001–CJ039835, DB781565–DB806061, FS639971–FS656010, FS656544–FS722296, and HX201766–HX247044].
Genes and chromosomal locations corresponding to assemblies generated from pig ESTs
BLAST similarity analysis of the nucleotide sequences of the assemblies (contigs and singlets) against the mRNA sequences of RefSeq (release 49; 13 September 2011) at the National Center for Biotechnology Information (NCBI) revealed that the cDNA sequences corresponded to 13,691 unique human genes (Table2). Correspondence of the assemblies to the functions of human genes according to the Gene Ontology terms demonstrated that the EST collection covered a broad range of porcine expressed genes (Figure2 and3). About three-fourths of the assemblies showing obvious similarity to protein sequences were estimated to contain start codons, indicating the high efficiency of cloning of entire mRNA molecules by the construction of full-length-enriched cDNA libraries (Table2). Ideally, almost all of the clones in the full-length-enriched cDNA libraries would contain entire coding sequences (CDSs). However, at the cloning step in the procedure of cDNA library construction, short transcripts that do not cover entire CDSs may be cloned preferentially to those containing functional CDSs. Degradation of RNA before library construction may also hamper the cloning of intact cDNA sequences in the libraries. Even though there are some incomplete transcripts in the ESTs, full-length-enriched cDNA libraries are much more valuable for determining the correct structures of functional mRNA than are cDNA libraries constructed by using normal methods, because the latter rarely yield intact full-length cDNAs.
Among the human genes matched to the assemblies, 12,937 corresponded to 12,911 unique NCBI HomoloGene IDs, which are indices of homologs among genomes of different eukaryote species. This covered about two-thirds of all HomoloGene groups in humans (18,431 IDs; release 65). Furthermore, about 2000 additional genes were also included in the ESTs, as estimated from the numbers of genes without HomoloGene IDs (Table2). In total, we estimated that more than 15,000 different genes were included in the ESTs thus obtained. However, the numbers of genes included in the EST assemblies might in fact increase because of gene duplication specifically occurring in the Sus genus.
The EST assemblies were mapped on the latest build of the draft sequence of the pig genome Sscrofa10.2. The entire summed length of chromosomes in the latest draft sequence of the pig genome (Sscrofa10.2) is 2596.6 Mb, and the 4562 scaffolds unplaced on any chromosome are a total of 211.9 Mb long. Among the 17,183 contigs, 16,121 were mapped on pig chromosomes. Among the singlets, 86,464 of 120,928 were mapped on the chromosomes (Table3). The assemblies were mapped to 39,445 different loci on pig autosomes and 1221 different loci on the pig sex chromosomes. In addition, 4463 loci corresponding to the assemblies were detected in scaffold sequences unplaced in any chromosomes (Figure3A and Table3). Mapping of loci corresponding to the EST assemblies on the draft sequence of the pig genome demonstrated that the ESTs were derived from regions throughout the whole chromosomes, and that the density of loci on the chromosomes showed periodic change within the chromosomes ( Additional file1), possibly corresponding to the G- and R-bands of the chromosomes. Although genes expressed ubiquitously in tissues (e.g., the gene encoding tubulin) were frequently observed, we were able to clone pig transcripts in the EST collection derived from more than 40,000 unique loci distributed throughout the whole genome. This shows that our EST resource is valuable for exploring pig gene sequences of interest in various areas of veterinary research.
Generation of the collection of pig cDNA clones, and complete sequencing of their inserts
The pig cDNA libraries used for the EST analysis were full-length–enriched libraries, which are ideal for determining the entire sequences of transcripts functioning as protein-encoding mRNA. In parallel with the EST analysis described above, we selected cDNA clones for the sequencing of entire inserts. The cDNA clones located at the forefront position in contigs generated by the assembly were selected for sequencing of the entire inserts, because we considered that they were the best candidates for clones carrying entire transcripts. As the EST analysis progressed, if cDNA clones located upstream of the clones already selected in the contigs appeared, we added these clones into the pipeline for sequencing of the entire inserts. On the other hand, among the singlets that did not join the contigs, there were many clones corresponding to human genes that had no counterparts among the clones selected from the contigs. We also selected these clones to ensure that the cDNA collection included a broad variety of genes. In total, we selected 42,047 clones as candidates for complete sequencing (31,545 clones from contigs and 10,502 clones from singlets).
Selected cDNA clones were sequenced by the primer-walking (PW) method. cDNA clones that were difficult to sequence by PW were subjected to transposon-shotgun sequencing (TPS) with clone pooling[30–32]. To date, we have completed the sequencing of the entire inserts of 31,232 clones [DDBJ:AK230469–AK240615, AK343227–AK344223, AK344276–AK352511, and AK389169– AK399513, AK399520–AK401026]. We excluded the clones that had only repetitive sequences such as short or long interspersed nucleotide elements, and obtained 31,079 clones in a result (Table1; 22,853 clones from contigs and 8226 clones from singlets), including 10,147 clones that have already been reported and presented in our cDNA database. Among the candidate clones, 2169 were completely sequenced only by using universal sequencing primers (forward and reverse primers aligned to the vector sequence, and T25V primer). There were 25,629 cDNA clones sequenced by the PW method and 3500 by the TPS method (Table4). The average length of the inserts of the cDNA clones thus sequenced was 1.63 kb. Sequencing by the TPS method effectively found longer inserts of the cDNA clones than by the PW method (Table4), even considering the fact that one of the reasons why TPS was used on difficult-to-sequence clones was their length. The longest inserts of the clones sequenced by the PW and TPS methods were 4653 and 7293 bp, respectively. The TPS method is highly compatible with automation, because it does not require primer design; however, the total amount of sequencing work is much more than the PW method[30, 31]. Therefore, application of TPS may be limited to clones with long inserts or for which many sequencing primers need to be designed.
Genes corresponding to cDNA clones
BLAST similarity analysis of the inserts of the cDNA clones sequenced against the mRNA sequences of the NCBI RefSeq (release 49) revealed that the cDNA sequences corresponded to 11,298 human genes and more than 10,000 genes in each of mice, cattle, and dogs (Table5). The functions of genes corresponding to more than 20,000 clones could be estimated by their similarity to human genes and classification according to Gene Ontology terms (Figure3D). Among the human genes matched to the cDNA clones, 10,901 corresponded to 10,889 HomoloGene IDs (Table5A); this accounted for 59% of all of the HomoloGene groups of human genes. Notably, more than 1000 genes corresponding to cDNA clones have not been classified in HomoloGene, and there may be duplication of gene loci originally occurring in the pig. Furthermore, it cannot be ruled out that the cDNA clones that showed no similarity to genes of other species encode genes that are functional in pigs. In total, we estimated that the cDNA thus sequenced represented more than 12,000 pig genes by comparative approach of mRNA sequences. In addition, cDNA clones corresponding to the same HomoloGene ID may be duplicated specifically on the pig genome. Although clarification of the correct number of genes included in our cDNA collection requires the completely sequenced genome sequence of the pig, we estimated the numbers of loci by using the currently publicized draft pig genome sequence (Sscrofa10.2).
Distribution of cDNA clones on the draft sequence of the pig genome
As mentioned above in the Background, the genome annotation process is greatly accelerated by the mapping of cDNA clones on the genomic sequence. We evaluated the usefulness of our collection of pig cDNA clones in genome annotation by mapping the clones onto the latest draft sequence of the pig genome (Sscrofa10.2). Among the 31,079 cDNA clones sequenced, 26,159 were mapped to 12,441 independent loci on the pig chromosomes. There were 12,042 loci mapped on the autosomes and 399 on the sex chromosomes (Figure3B and Table6). In addition, 3271 clones were mapped to 1453 loci on scaffolds that have not been localized on any chromosomes in Sscrofa10.2. Among the 1649 cDNA clones that were not mapped to pig chromosomes, and the unplaced scaffold sequences, 1114 clones corresponded to 604 unique human genes. Inclusion of those unmapped cDNA clones that corresponded to the genes of other mammals yielded a total of more than 600 unique genes corresponding to unmapped cDNA clones (Table5). Taking these findings together, we estimated that our pig cDNA collection includes transcripts derived from about 15,000 different loci on the pig genome.
Similar to the human genome, the pig genome is estimated to include 20,000 to 25,000 genes, because the genome size of pig is comparable to that of human. Our sequencing of the cDNA clones therefore covered slightly more than half of the entire gene set of the pig. The reason why thousands of genes were not included in our cDNA resources may be that the libraries were constructed with tissues of animals that were healthy and not subject to stressors such as infection or starvation. Genes that are highly expressed only during acute responses to pathogens or nutritional exhaustion might be difficult to clone in such libraries. In addition, if a gene is rarely expressed in a particular tissue (e.g., if its expression frequency among all the transcripts is less than 0.006%), then the probability that it will fail to be detected in the cloning of 10,000 transcripts will be more than 50%. To increase the number of cloned genes it would be necessary to normalize the libraries or use tissues derived from animals stimulated by particular stressors.
Genes encoded on the genome may be duplicated specifically in particular species but not in others. To detect duplication specifically occurring in the pig genome, we extracted the cDNA sequences with the longest open reading frames (ORFs) from among the clones that we sequenced here for the 12,441 putative loci on the autosomes and sex chromosomes in Sscrofa10.2. The extracted cDNA sequences were compared with human and cattle protein sequences deduced from the NCBI RefSeq (release 49). We estimated that 635 human protein sequences and 709 cattle protein sequences matched more than one putative locus on Sscrofa10.2. Conversely, the total numbers of loci estimated to be duplicated on Sscrofa10.2 in comparison with humans and cattle were 1358 and 1505, respectively. However, most of the potentially duplicated loci encoded shorter ORFs than their counterparts, implying that the loci were only pseudogenes or that they had arisen from the remaining sequencing errors in the genome sequence. Further refinement of the draft sequences of the pig genome will elucidate the duplication of genes occurring in the Sus genus.
The cDNAs thus analyzed were synthesized by reverse transcription using a poly-dT primer; therefore, most of the clones showed canonical mRNA features and had ORFs. Among the 13,894 loci mapped on the chromosomes and scaffolds, the representative (the longest) cDNA clones for 4144 loci showed no correspondence to the genes of humans, cattle, dogs, or mice. However, most of the cDNA clones with no obvious correspondence to the genes of other animals had apparent ORFs, and only 160 clones did not have ORFs for sequences more than 30 amino acids long. The average insert length of these 160 clones was 665 bp, whereas the average for all of the clones was much longer (1631 bp). These “non-coding” transcripts may be transcribed randomly and may have no function, although they may have a certain regulatory function on other protein-coding transcripts.
The draft sequence of the pig genome in its latest build (Sscrofa10.2) corresponds to the bacterial artificial chromosome (BAC) clones covering 98% of the physical map of the entire chromosome[5, 6]. In our analysis, about 2% of the unique human genes corresponding to the cDNA sequences were not mapped to any chromosomes or unplaced scaffolds, showing that our estimate of the coverage of the whole genome by the draft sequence was correct. Refinement of the draft sequence of the pig genome will reveal the precise locations on the pig genome of the loci generating those transcripts that we cloned but that were not mapped, or that we mapped only on unplaced scaffolds.
The cDNA libraries that we used included seven libraries constructed by using animals that were cloned from a sow used for draft sequencing of pig genome (Duroc 2–14)[5–7]. A total of 4010 cDNA clones from four libraries were completely sequenced. Among them, 3729 were mapped to 2993 loci on the pig genome (Table6B). These cDNA clones will be valuable for genome annotation of the draft sequence generated by the SGSC. In addition, comparison of the cDNA sequences of Duroc 2–14 clones with the draft sequences of the pig genome can be used to estimate the sequencing accuracy of the draft sequence and the frequency of RNA editing, although polymorphisms among the chromosomes of Duroc 2–14 hinder precise discrimination of such errors and edited bases. We roughly estimated such base changes by aligning cDNA sequences derived from Duroc 2–14 clones with the genome sequences. The region of each cDNA sequence that appeared most aligned was extracted, and the adenosine (A) to guanosine (G) base changes, which reflected the most representative A to I (inosine) RNA editing, were counted. To simplify the estimation, we investigated only A-to-G changes flanked by 5-base matches on both sides. Among the cDNAs of Duroc 2–14 clone pigs, 124 carried only A-to-G base changes, which totaled 142. In contrast, 91 cDNAs carried only G-to-A base changes, which totaled 97. Therefore, we estimated that about one-fourth of the inconsistency between G and A in the cDNAs and the draft sequence, respectively, is caused by RNA editing in the pig. Alignment of the 3′-UTR sequences of the cDNAs of Duroc 2–14 clone pigs showed differences of less than 0.4% from the draft genome sequence (data not shown). The differences thus detected included polymorphisms between different chromosomes and bases subjected to RNA editing. Furthermore, more than half of the aligned 3166 3′-UTRs (1883) were completely matched to the draft genome sequence. We therefore estimated that the actual error rate in the draft sequences of the pig genome was much less than 0.4%; the draft sequence was thus reliable.
Coverage of coding sequences by cDNA clones on the pig genome
We expected that the collection of pig cDNA clones that we sequenced would include sequences covering the entire CDSs of pig genes. To estimate the numbers of cDNA clones covering entire CDSs, we investigated the coverage of those protein sequences of humans, mice, cattle, and dogs that showed the greatest similarity to the amino acid sequences deduced from the cDNAs. We also examined the distribution of the cDNA clones considered to cover entire CDSs on the pig chromosomes in the draft genome sequence (Sscrofa10.2). Among the cDNA clones sequenced completely, 14,616 were estimated to contain entire CDSs in their inserts. We estimated that these clones corresponded to 6466 different loci on the pig chromosomes (Table5).
Usefulness of the pig cDNA collection in genome annotation and other applications
The cDNA clones sequenced here were derived from libraries by methods that preferentially cloned intact RNA transcripts. About three-fourths of EST assemblies showing considerable similarity to known genes carrying the beginning of CDSs; we estimated that about half of the cDNA clones that were completely sequenced contained entire CDSs. An outline of the pig genome sequence is currently available, and use of the sequences of these expressed genes should help in precisely identifying the locations of genes on the genome and in determining the exon–intron structures of the genes. Along with the progress made in draft sequencing of the pig genome, automated annotation of the pig genome sequence has been conducted by the pipelines in Pre-Ensembl/Ensembl and publicized through the Pre-Ensembl/Ensembl database[35, 36]. In the automated pipelines, about 30,000 pig cDNA clones were utilized; most of these were derived from our pig cDNA sequencing project. Our data resources on pig-expressed genes have greatly contributed to prediction of the structures of genes on the draft sequence of the pig genome. In addition, the use of pig cDNA sequences that have been completely sequenced accelerates the process of manual refinement of automated genome annotation. Until now, many projects for full-length cDNA sequencing have been conducted in parallel with genome sequencing in eukaryotic species, and the results of these studies have contributed to our knowledge of gene locations and structures in target species such as humans and mice[9, 38]. In fact, the pig cDNA sequences presented here have contributed greatly to the process of annotation of immune-related genes in the draft sequence of the pig genome by the Immune Response Annotation Group. We expect that additional efforts to annotate other groups of pig genes will be accelerated by the use of our pig cDNA sequences.
One of the characteristics of the ESTs and cDNA sequences presented here is that the majority of the sequences were derived from intact mRNA with transcription start sites. This has great merit for exploring promoter sequences on the genome sequence. The consensus sequences bound by transcription factors in the promoter sequences are generally well conserved among species; however, there are many variations in the binding-site sequences of transcription factors, and precise determination of the genomic region of the promoter sequence of each gene is essential for clarifying the efficiency of transcription in cells in response to stimuli. Extraction of the upstream regions of the EST assemblies and cDNA sequences presented here, combined with direct evidence from, for example, ChIP-Seq studies, which will be accelerated by using the cDNA sequences for transcription factors in pigs, will enable the construction of a variable database for understanding transcriptional regulation of pig genes. Notably, we were able to completely sequence 1340 pig cDNA clones associated with “nucleic acid binding transcription factor activity,” as classified according to Gene Ontology (GO:0001071) (Figure3D).
Another advantage of this collection of cDNA sequences is its usefulness for investigating alternative splicing events in pig genes. We mapped 29,430 cDNA clones to 13,894 different loci on the pig genome—that is, on average more than two different cDNA clones were derived from a single locus. Future studies should include a detailed exploration of splicing variants by using the cDNA sequences we have sequenced, together with the pig gene sequences presented by other groups.
The cDNA sequences and the ESTs themselves will also be useful in other studies, such as in gene expression analysis and in detecting polymorphisms in pig genes. A number of polymorphisms have been reported in mRNA sequences (particularly in CDSs), and it should be emphasized that there are many polymorphisms in CDSs that affect the functions of the molecules encoded by the genes that carry the polymorphisms. Our explorations of polymorphisms using the cDNA sequences and ESTs presented here have been useful in characterizing the genetic features of pig breeds and populations[18, 42]. We have also investigated polymorphisms in genes encoding pattern-recognition receptors[43–46] and have demonstrated that some of the polymorphisms observed so far in commercial pig and wild boar populations truly affect the ligand-recognition ability of the molecules encoded by the genes[47–49]. In addition, many ongoing studies are revealing the potential associations of gene polymorphisms with economically important traits in pigs. The use of pig gene sequences, including those presented here, will help greatly in promoting the exploration of polymorphisms that may be candidates for markers for selecting or breeding pigs with distinguished traits. Furthermore, the gene sequences can be used directly to design probes for microarrays. We have developed oligomer microarrays by using sequences derived mainly from the ESTs and cDNA sequences presented here, and we have successfully elucidated the characteristics of changes in gene expression in pig subcutaneous preadipocytes. Designing microarray probes with full-length cDNA sequences has benefits in terms of reliability, because the probes are highly specific to the target genes and there is clear evidence of correspondence between the probes and fully annotated genes. Full-length cDNA sequences will even be valuable in transcriptome analysis with NGS, which will become the mainstream method of expression analysis; these sequences will be useful in determining which short NGS reads belong to gene sequences that truly exert functions in organisms.
Here, we demonstrated our attempts to collect pig-expressed genes by EST analysis and sequencing of entire cDNA clones using full-length-enriched cDNA libraries. We have so far accumulated 330,707 ESTs and 31,079 cDNA sequences. The ESTs and cDNA clones thus sequenced were respectively mapped to 40,666 and 13,894 different loci on the latest pig genome sequence Sscrofa10.2; they corresponded to more than 15,000 and 12,000 different genes of other species, respectively. The cDNA resource presented here is valuable for annotation of the draft sequence of the pig genome and for exploring promoter sequences on the genome. It will also be valuable for molecular biology–based analyses in pigs, for example for analyses of protein production in vitro.
Construction of cDNA libraries
Tissues for construction of the cDNA libraries were prepared from 10 crossbred [(Landrace × Large White) × Duroc] pigs, which are representative of those used for the Japanese pork market, and a Meishan animal, a breed representative of those used in China. The pigs were housed at the National Institute of Livestock and Grassland Science (Tsukuba, Ibaraki, Japan)[18, 19]. We also used tissues from two Landrace and one Berkshire breed pig and one NIBS miniature pig. In addition, we used a pig cloned from an animal of the Duroc breed (2–14) that was subjected to genome sequencing by the SGSC[5–7].
Using the collected tissues and cell populations, cDNA libraries were constructed by one of the following three methods. About two-thirds of the libraries (23) were constructed by using the oligo-capping method. Fifteen of the oligo-capped cDNA libraries were constructed by using Gateway-compatible pCMVFL3 vector (Invitrogen, Carlsbad, CA, USA), whereas the vector for the rest was pME18SFL3 (Toyobo, Osaka, Japan). Five cDNA libraries were constructed by using another method for constructing full-length cDNA libraries, namely the vector-capping method. In total, 28 libraries were generated by methods using the 5′ cap structure, which is characteristic of intact mRNA. To compile the remaining four libraries, because only small amounts of RNA could be prepared from the tissues or cell populations, we used the SMART method and pDNR-LIB vector (Clontech, Palo Alto, CA, USA); this method selectively clones cDNAs that are synthesized as far as the 5′-end of the mRNA molecule. All of the cDNAs were cloned into the vector unidirectionally. The library construction methods used for each tissue or cell population are shown in Table1.
EST analysis and clustering/assembling
The cDNA libraries thus constructed were subjected to EST analysis by single-pass sequencing from the 5Â´-ends of the respective clones. The EST reads obtained underwent basecalling using Phred; the vector sequences were screened by using the crossmatch program in the Phrap package[54, 55]. Repetitive sequences and regions of low-complexity regions (e.g., poly(A) tracts) in the chromatograms thus generated were screened by using RepeatMasker with RepBase and in-house-generated Perl scripts. Clustering and assembly of sequences were performed with the TGICL package with CAP3. Chromatograms that were not included in the contigs or that did not have regions containing more than 100 bases with Phred quality values ≥10 were discarded.
Sequencing of entire inserts of cDNA clones
cDNA clones located at the most 5′ position in contigs generated by the assembling were selected for sequencing of the inserts. We also chose clones in singlets that did not join the contigs, provided that the clones corresponded to human genes with no counterparts among the clones selected from the contigs.
First, we sequenced the selected clones from the 5′-end with primer annealing with the vector sequence to confirm whether the correct clones had been selected. We also sequenced from the 3′-end with primer annealing with the vector sequence and T25V primer. The chromatograms that were obtained from the EST analysis and generated by sequencing from the 5′- and 3′-ends were subjected to basecalling with Phred and assembly by using Phrap. The contigs thus generated were screened with in-house Perl scripts to check for the low-quality regions (Phred quality values ≤ 25), and they were manually inspected for sequencing errors by using the Consed program. Regions with low-quality or ambiguous bases were re-sequenced with primers (PW method;) designed by using the Consed program with the “autofinish” option. The procedure of inspection and sequencing of the remaining low-quality regions was performed twice or until no low-quality or ambiguous bases were observed. In addition to the PW method, we adopted another approach based on TPS, particularly for cDNA clones that could be difficult to sequence with the primer walking method[30–32, 61]. With TPS, a large majority of the cDNA clones constructed by using the Gateway-compatible cloning vector (pCMVFL3; Toyobo, Osaka, Japan) were sequenced by using a combination of TPS and insert transfer with Gateway technology to reduce the number of shotgun clones with transposons in the vector sequence. TPS was conducted with pooled DNA from two to 12 cDNA clones, as described previously[30, 31].
The EST assemblies and cDNA clones were used in similarity analyses after interspersed repetitive sequences and low-complexity regions (such as polynucleotides and microsatellites) had been masked with the RepeatMasker program. EST assemblies and cDNA clones on the pig genome sequence were mapped by using a BLAST similarity search with the latest pig genome sequence, Sscrofa10.2. The best alignment of the respective query sequences, with a BLAST similarity score above 100 and identity above 90%, was anchored on the genome sequence. Overlapped alignments of cDNA sequences on the pig genome with opposite directions were regarded as different loci. The region of the locus was extended from the anchored alignment to both ends by using other alignments of the same query sequence meeting the following criteria: (1) the direction of the alignment was identical to that of the anchored alignments; and (2) the distance between the alignment from the region of the locus (which is a possible intron) was less than 1 Mb. The loci were regarded as identical if the locus regions after extension overlapped and were mapped in the same direction. Correspondence of the pig cDNA sequences to genes was investigated by BLAST similarity search using the mRNA or protein sequences in the NCBI RefSeq databases of humans, mice, cattle, dogs, and pigs. The similarity was considered as positive if the BLAST score was more than 50. The presence of a full-length CDS in each pig cDNA sequence was estimated by BLAST similarity analysis, using those protein sequences showing the highest similarity in the NCBI RefSeq database. Two sequences were aligned without any filtering and masking. If the cDNA sequence was aligned with the specified protein sequence trimmed at both ends by fewer than 10 amino acids, then we considered that the cDNA contained a full-length CDS. EST assemblies and cDNA clones were classified according to the Gene Ontology terms by using the ontology file. Classification according to Gene Ontology was conducted by using the similarity of cDNA clones to the mRNA sequences of human genes in the NCBI RefSeq and the correspondence between genes and Gene Ontology terms provided in NCBI Gene.
Dekkers JCM, Mathur PK, Knol EF: Genetic improvement of the pig. The Genetics of the Pig. Edited by: Rothschild MF, Ruvinsky A. 2011, CAB International, Wallingford, 390-425. 2
Vodicka P, Smetana K, Dvorankova B, Emerick T, Xu YZ, Ourednik J, Ourednik V, Motlik J: The miniature pig as an animal model in biomedical research. Ann N Y Acad Sci. 2005, 1049: 161-171. 10.1196/annals.1334.015.
Kuzmuk KN, Schook LB: Pigs as a model for biomedilcal sciences. The Genetics of the Pig. Edited by: Rothschild MF, Ruvinsky A. 2011, CAB International, Wallingford, 426-444. 2
Lunney JK: Advances in swine biomedical model genomics. Int J Biol Sci. 2007, 3 (3): 179-184.
Archibald AL, Bolund L, Churcher C, Fredholm M, Groenen MA, Harlizius B, Lee KT, Milan D, Rogers J, Rothschild MF, et al: Pig genome sequence–analysis and publication strategy. BMC Genomics. 2010, 11: 438-10.1186/1471-2164-11-438.
International Swine Genome Sequencing Consortium: Draft sequence of the pig genome. in preparation
Schook LB, Beever JE, Rogers J, Humphray S, Archibald A, Chardon P, Milan D, Rohrer G, Eversole K: Swine Genome Sequencing Consortium (SGSC): a strategic roadmap for sequencing the pig genome. Comp Func Genomics. 2005, 6: 251-255. 10.1002/cfg.479.
Groenen MAM, Schook LB, Archibald AL: Pig genomics. The Genetics of the Pig. Edited by: Rothschild MF, Ruvinsky A. 2011, CAB International, Wallingford, 179-199. 2
Ota T, Suzuki Y, Nishikawa T, Otsuki T, Sugiyama T, Irie R, Wakamatsu A, Hayashi K, Sato H, Nagai K, et al: Complete sequencing and characterization of 21,243 full-length human cDNAs. Nat Genet. 2004, 36 (1): 40-45. 10.1038/ng1285.
Immune Response Annotation Group: Structural and functional annotation of the porcine immunome. in preparation
Loveland JE, Gilbert JG, Griffiths E, Harrow JL: Community gene annotation in practice. Database: the journal of biological databases and curation 2012. 2012, bas009-
Searle SM, Gilbert J, Iyer V, Clamp M: The otter annotation system. Genome Res. 2004, 14 (5): 963-970. 10.1101/gr.1864804.
Fahrenkrug SC, Smith TP, Freking BA, Cho J, White J, Vallet J, Wise T, Rohrer G, Pertea G, Sultana R, et al: Porcine gene discovery by normalized cDNA-library sequencing and EST cluster assembly. Mamm Genome. 2002, 13 (8): 475-478. 10.1007/s00335-001-2072-4.
Gorodkin J, Cirera S, Hedegaard J, Gilchrist MJ, Panitz F, Jorgensen C, Scheibye-Knudsen K, Arvin T, Lumholdt S, Sawera M, et al: Porcine transcriptome analysis based on 97 non-normalized cDNA libraries and assembly of 1,021,891 expressed sequence tags. Genome Biol. 2007, 8 (4): R45-10.1186/gb-2007-8-4-r45.
Isom SC, Spollen WG, Blake SM, Bauer BK, Springer GK, Prather RS: Transcriptional profiling of day 12 porcine embryonic disc and trophectoderm samples using ultra-deep sequencing technologies. Mol Reprod Dev. 2010, 77 (9): 812-819. 10.1002/mrd.21226.
Wall PK, Leebens-Mack J, Chanderbali AS, Barakat A, Wolcott E, Liang H, Landherr L, Tomsho LP, Hu Y, Carlson JE, et al: Comparison of next generation sequencing technologies for transcriptome characterization. BMC Genomics. 2009, 10: 347-10.1186/1471-2164-10-347.
Kim TH, Kim NS, Lim D, Lee KT, Oh JH, Park HS, Jang GW, Kim HY, Jeon M, Choi BH, et al: Generation and analysis of large-scale expressed sequence tags (ESTs) from a full-length enriched cDNA library of porcine backfat tissue. BMC Genomics. 2006, 7: 36-10.1186/1471-2164-7-36.
Uenishi H, Eguchi T, Suzuki K, Sawazaki T, Toki D, Shinkai H, Okumura N, Hamasima N, Awata T: PEDE (Pig EST Data Explorer): construction of a database for ESTs derived from porcine full-length cDNA libraries. Nucleic Acids Res. 2004, 32: D484-D488. 10.1093/nar/gkh037. (Database issue)
Uenishi H, Eguchi-Ogawa T, Shinkai H, Okumura N, Suzuki K, Toki D, Hamasima N, Awata T: PEDE (Pig EST Data Explorer) has been expanded into Pig Expression Data Explorer, including 10 147 porcine full-length cDNA sequences. Nucleic Acids Res. 2007, 35: D650-D653. 10.1093/nar/gkl954. (Database issue)
Kato S, Oshikawa M, Ohtoko K: Full-length transcriptome analysis using a bias-free cDNA library prepared with the vector-capping method. Methods Mol Biol. 2011, 729: 53-70. 10.1007/978-1-61779-065-2_4.
Suzuki Y, Yoshitomo-Nakagawa K, Maruyama K, Suyama A, Sugano S: Construction and characterization of a full length-enriched and a 5'-end-enriched cDNA library. Gene. 1997, 200 (1–2): 149-156.
Zhu YY, Machleder EM, Chenchik A, Li R, Siebert PD: Reverse transcriptase template switching: a SMART approach for full-length cDNA library construction. BioTechniques. 2001, 30 (4): 892-897.
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17): 3389-3402. 10.1093/nar/25.17.3389.
Sayers EW, Barrett T, Benson DA, Bolton E, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Federhen S, et al: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2011, 39: D38-D51. 10.1093/nar/gkq1172. (Database issue)
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000, 25 (1): 25-29. 10.1038/75556.
Okayama H, Berg P: High-efficiency cloning of full-length cDNA. Mol Cell Biol. 1982, 2 (2): 161-170.
Pig (Sus scrofa) - Pre-ensembl.http://http://pre.ensembl.org/Sus_scrofa/,
Craig JM, Bickmore WA: Chromosome bands–flavours to savour. BioEssays. 1993, 15 (5): 349-354. 10.1002/bies.950150510.
Wiemann S, Weil B, Wellenreuther R, Gassenhuber J, Glassl S, Ansorge W, Bocher M, Blocker H, Bauersachs S, Blum H, et al: Toward a catalog of human genes and proteins: sequencing and analysis of 500 novel complete protein coding human cDNAs. Genome Res. 2001, 11 (3): 422-435. 10.1101/gr.GR1547R.
Butterfield YS, Marra MA, Asano JK, Chan SY, Guin R, Krzywinski MI, Lee SS, MacDonald KW, Mathewson CA, Olson TE, et al: An efficient strategy for large-scale high-throughput transposon-mediated sequencing of cDNA clones. Nucleic Acids Res. 2002, 30 (11): 2460-2468. 10.1093/nar/30.11.2460.
Morozumi T, Toki D, Eguchi-Ogawa T, Uenishi H: A rapid and cost-effective method for sequencing pooled cDNA clones by using a combination of transposon insertion and Gateway technology. BioTechniques. 2011, 51 (3): 195-197.
Shevchenko Y, Bouffard GG, Butterfield YS, Blakesley RW, Hartley JL, Young AC, Marra MA, Jones SJ, Touchman JW, Green ED: Systematic sequencing of cDNA clones using the transposon Tn5. Nucleic Acids Res. 2002, 30 (11): 2469-2477. 10.1093/nar/30.11.2469.
Farajollahi S, Maas S: Molecular diversity through RNA editing: a balancing act. Trends Genet. 2010, 26 (5): 221-230. 10.1016/j.tig.2010.02.001.
Nishikura K: Functions and regulation of RNA editing by ADAR deaminases. Annu Rev Biochem. 2010, 79: 321-349. 10.1146/annurev-biochem-060208-105251.
Curwen V, Eyras E, Andrews TD, Clarke L, Mongin E, Searle SM, Clamp M: The Ensembl automatic gene annotation system. Genome Res. 2004, 14 (5): 942-950. 10.1101/gr.1858004.
Potter SC, Clarke L, Curwen V, Keenan S, Mongin E, Searle SM, Stabenau A, Storey R, Clamp M: The Ensembl analysis pipeline. Genome Res. 2004, 14 (5): 934-941. 10.1101/gr.1859804.
Ensembl gene annotation project - Sus scrofa (Pig).http://www.ensembl.org/info/docs/genebuild/2012_04_sus_scrofa_genebuild.pdf,
Carninci P, Kasukawa T, Katayama S, Gough J, Frith MC, Maeda N, Oyama R, Ravasi T, Lenhard B, Wells C, et al: The transcriptional landscape of the mammalian genome. Science. 2005, 309 (5740): 1559-1563.
Shiraki T, Kondo S, Katayama S, Waki K, Kasukawa T, Kawaji H, Kodzius R, Watahiki A, Nakamura M, Arakawa T, et al: Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage. Proc Natl Acad Sci U S A. 2003, 100 (26): 15776-15781. 10.1073/pnas.2136655100.
Dermitzakis ET, Clark AG: Evolution of transcription factor binding sites in Mammalian gene regulatory regions: conservation and turnover. Mol Biol Evol. 2002, 19 (7): 1114-1121. 10.1093/oxfordjournals.molbev.a004169.
Ma W, Wong WH: The analysis of ChIP-Seq data. Methods Enzymol. 2011, 497: 51-73.
Matsumoto T, Okumura N, Uenishi H, Hayashi T, Hamasima N, Awata T: Population structure of pigs determined by single nucleotide polymorphisms observed in assembled expressed sequence tags. Anim Sci J. 2012, 83 (1): 14-22. 10.1111/j.1740-0929.2011.00920.x.
Kojima-Shibata C, Shinkai H, Morozumi T, Jozaki K, Toki D, Matsumoto T, Kadowaki H, Suzuki E, Uenishi H: Differences in distribution of single nucleotide polymorphisms among intracellular pattern recognition receptors in pigs. Immunogenetics. 2009, 61 (2): 153-160. 10.1007/s00251-008-0350-y.
Morozumi T, Uenishi H: Polymorphism distribution and structural conservation in RNA-sensing Toll-like receptors 3, 7, and 8 in pigs. Biochim Biophys Acta. 2009, 1790 (4): 267-274. 10.1016/j.bbagen.2009.01.002.
Shinkai H, Tanaka M, Morozumi T, Eguchi-Ogawa T, Okumura N, Muneta Y, Awata T, Uenishi H: Biased distribution of single nucleotide polymorphisms (SNPs) in porcine Toll-like receptor 1 (TLR1), TLR2, TLR4, TLR5, and TLR6 genes. Immunogenetics. 2006, 58 (4): 324-330. 10.1007/s00251-005-0068-z.
Uenishi H, Shinkai H: Porcine Toll-like receptors: the front line of pathogen monitoring and possible implications for disease resistance. Dev Comp Immunol. 2009, 33 (3): 353-361. 10.1016/j.dci.2008.06.001.
Jozaki K, Shinkai H, Tanaka-Matsuda M, Morozumi T, Matsumoto T, Toki D, Okumura N, Eguchi-Ogawa T, Kojima-Shibata C, Kadowaki H, et al: Influence of polymorphisms in porcine NOD2 on ligand recognition. Mol Immunol. 2009, 47 (2–3): 247-252.
Shinkai H, Okumura N, Suzuki R, Muneta Y, Uenishi H: Toll-Like receptor 4 polymorphism impairing lipopolysaccharide signaling in Sus scrofa, and its restricted distribution among Japanese wild boar populations. DNA Cell Biol. 2012, 31 (4): 575-581. 10.1089/dna.2011.1319.
Shinkai H, Suzuki R, Akiba M, Okumura N, Uenishi H: Porcine Toll-like receptors: recognition of Salmonella enterica serovar Choleraesuis and influence of polymorphisms. Mol Immunol. 2011, 48 (9–10): 1114-1120.
Rothschild MF, Hu ZL, Jiang Z: Advances in QTL mapping in pigs. Int J Biol Sci. 2007, 3 (3): 192-197.
Matsumoto T, Nakajima I, Eguchi-Ogawa T, Nagamura Y, Hamasima N, Uenishi H: Changes in gene expression in a porcine preadipocyte cell line during differentiation. Anim Genet. in press
Kano M, Toyoshi T, Iwasaki S, Kato M, Shimizu M, Ota T: QT PRODACT: usability of miniature pigs in safety pharmacology studies: assessment for drug-induced QT interval prolongation. J Pharmacol Sci. 2005, 99 (5): 501-511. 10.1254/jphs.QT-C13.
Hartley JL, Temple GF, Brasch MA: DNA cloning using in vitro site-specific recombination. Genome Res. 2000, 10 (11): 1788-1795. 10.1101/gr.143000.
Ewing B, Green P: Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 1998, 8 (3): 186-194.
Ewing B, Hillier L, Wendl MC, Green P: Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 1998, 8 (3): 175-185.
Jurka J: Repbase update: a database and an electronic journal of repetitive elements. Trends Genet. 2000, 16 (9): 418-420. 10.1016/S0168-9525(00)02093-X.
Pertea G, Huang X, Liang F, Antonescu V, Sultana R, Karamycheva S, Lee Y, White J, Cheung F, Parvizi B, et al: TIGR Gene Indices clustering tools (TGICL): a software system for fast clustering of large EST datasets. Bioinformatics. 2003, 19 (5): 651-652. 10.1093/bioinformatics/btg034.
Huang X, Madan A: CAP3: A DNA sequence assembly program. Genome Res. 1999, 9 (9): 868-877. 10.1101/gr.9.9.868.
Gordon D, Abajian C, Green P: Consed: a graphical tool for sequence finishing. Genome Res. 1998, 8 (3): 195-202.
Devine SE, Chissoe SL, Eby Y, Wilson RK, Boeke JD: A transposon-based strategy for sequencing repetitive DNA in eukaryotic genomes. Genome Res. 1997, 7 (5): 551-563.
Ontology file downloads.http://www.geneontology.org/GO.downloads.ontology.shtml,
We thank Maiko Tanaka-Matsuda, Toshie Iioka, Takako Suzuki, and Ikuyo Nakamori (JATAFF) for their technical assistance. This work was supported by the Integrated Research Project for Insect and Animal Using Genome Technology from the Ministry of Agriculture, Forestry and Fisheries, Japan, and Grants-in-Aid from the Japan Racing Association.
The authors declare that they have no competing interests.
HU designed the entire analysis and led the project. TM, DT, TEO, and HU conducted the EST analysis and sequencing of cDNA clones. LAR and LBS prepared the tissues of pigs cloned from the individual subjected to draft genome sequencing. Computational analysis was conducted by HU. HU wrote the draft, and all of the authors approved the manuscript.
Electronic supplementary material
Additional file 1: Density of genes on pig chromosomes, as demonstrated by localization of the EST assemblies. Loci aligned by the expressed sequence tag assemblies within 5-Mb windows on the sequences of pig autosomes and the X chromosome (Sscrofa10.2) were counted. Each window slides by 100 kb. Solid lines indicate loci in the orientation from pter to qter on the chromosomes, and dotted lines indicate loci in the orientation from qter to pter. (PDF 7 MB)
About this article
Cite this article
Uenishi, H., Morozumi, T., Toki, D. et al. Large-scale sequencing based on full-length-enriched cDNA libraries in pigs: contribution to annotation of the pig genome draft sequence. BMC Genomics 13, 581 (2012) doi:10.1186/1471-2164-13-581
- Sus scrofa
- Full-length cDNA
- Genome annotation