- Research article
- Open Access
Genome-wide characterization of non-reference transposons in crops suggests non-random insertion
- Bin Wei†1,
- Hanmei Liu†2,
- Xin Liu3,
- Qianlin Xiao1,
- Yongbin Wang1,
- Junjie Zhang2,
- Yufeng Hu4,
- Yinghong Liu1,
- Guowu Yu4 and
- Yubi Huang1, 4Email author
© The Author(s). 2016
- Received: 27 July 2015
- Accepted: 20 June 2016
- Published: 2 August 2016
Transposons (transposable elements or TEs) are DNA sequences that can change their position within the genome. A large number of TEs have been identified in reference genome of each crop(named accumulated TEs), which are the important part of genome. However, whether there existed TEs with different insert positions in resequenced crop accession genomes from those of reference genome (named non-reference transposable elements, non-ref TEs), and what the characteristics (such as the number, type and distribution) are. To identify and characterize crop non-ref TEs, we analyzed non-ref TEs in more than 125 accessions from rice (Oryza sativa), maize (Zea mays) and sorghum (Sorghum bicolor) using resequenced data with paired-end mapping methods.
We identified 13,066, 23,866 and 35,679 non-ref TEs in rice, maize and sorghum, respectively. Genome-wide characterization analysis shows that most of non-ref TEs were unique and non-ref TE classes shows different among rice, maize and sorghum. We found that non-ref TEs have a strong positive correlation with gene number and have a bias toward insertion near genes, but with a preference for avoiding coding regions in maize and sorghum. The genes affected by non-ref TE insertion were functionally enriched for stress response mechanisms in all three crops.
These observations suggest that transposon insertion is not a random event and it makes genomic diversity, which may affect the intraspecific adaption and evolution of crops.
- Non-reference transposons
- Genome-wide analysis
- Non-random insertion
- Stress response
Transposons are DNA sequences that can change their positions within the genome. Transposons were first discovered in maize by McClintock in the 1940s , and over the next several decades, transposons have been found in almost every plant and animal genome. Moreover, transposons are important components of crop genomes. For example, at least 35 % of the rice genome , 62 % of the sorghum genome , and nearly 85 % of the maize genome  is made up of transposable elements (TEs).
A scheme for the classification of transposons is based on transposition mechanisms, sequence similarities and structural relationships . Transposons are divided into two classes: DNA transposons and RNA transposons (retrotransposons) . Retrotransposons include the following three groups: Long terminal repeats (LTRs), which are flanked by long terminal repeats and encode reverse transcriptase; long interspersed elements (LINEs), which lack LTRs and are transcribed by RNA polymerase II; and short interspersed elements (SINEs), which also lack LTRs and are transcribed by RNA polymerase III. In addition, there are the helitrons, which are replicated by the ‘rolling-circle’ mechanism, and are therefore also called rolling-circle (RC) transposons. Transposons of theses classes are widely distributed and constitute major components of plant genomes. Additionally, TE superfamilies may be subdivided depending on their replication strategies in crops, such as LTR/Copia, LTR/Gypsy, DNA/CMC-EnSpm, DNA/MULE-MuDR, LINE/L1 and RC/Helitron.
In recent years, we have gradually realized the importance of transposons in genome structure, function and evolution. As a fundamental function elements constituting the genomes, transposons are playing important roles in the formation and evolution of the DNA “jigsaw puzzle” structure. They are distributed nonrandomly in large genome and have a correlative relation between other function elements [7, 8]. Transposons not only affect plant genome structure but also play important roles in gene expression regulation . Their activity can inactivate genes. Some transposons prefer insertion into genes or near gene flanking regions, leading to a mutation that affects gene function. This transposon activity can be engineered using appropriate vectors to produce artificial mutations in genes. For example, wrinkled peas result from a 0.8-KB transposon insertion in the SBE1 gene, the mechanism of which is similar to the mechanism for the corn Ac/Ds transposon family . Transposon insertion can also positively or negatively alter gene expression levels. A classic example is the transposon insertion into intron 1 of the maize knotted1 gene, causing the expression in the leaves . In additional, transposon insertions can also cause gene rearrangement and epigenetic silencing.
With the advance of high-throughput sequencing and data analysis technologies, researchers have been able to identify new transposon insertions in various species. A comparative genome analysis showed that 14 % of the genomic differences between Nipponbare and 9311 are the result of transposon insertion . Naito et al. detected 1664 mPing transposon insertions by analyzing the genome resequencing data of 24 rice accessions . Ewing and Kazazian analyzed data from the 1000 Genomes Project and presented their analysis of LINE-1 insertions in genomes that are not represented in the reference genome assembly . Tian et al. analyzed sequencing data of 31 wild and cultivated soybeans and detected 34,154 new transposon insertions, which revealed the evolutionary trends of transposons in soybean . The above studies demonstrate that transposons between accessions of the same species are markedly different, and these differences may play important roles in the evolution of species.
Rice, maize and sorghum are important cereal crops; all of their reference genomes are available. Many landrace accessions of these crops and improved and wild varieties have been resequenced using second-generation sequencing technology. Lai et al. resequenced six maize inbred lines, and 103 maize lines of teosintes, landraces and improved varieties were resequenced in the maize hapmap2 project (HapMapV2) . Genome resequencing of 40 cultivated lines and ten wild lines of rice were completed, with an average depth of >15X . Mace et al. resequenced 45 sorghum varieties with an average sequencing depth of 16–45X . At the same time, many methods and tools have been used to identify new transposon insertions in resequenced accessions, which are inserted in different genomic locations from those of reference genome, and termed non-reference transposable elements (non-ref TEs). That is, non-ref TEs are not in the reference genome but in other resequenced accession genomes. RetroSeq introduced a method using pair-end reads mapping to reference genome and accumulated transposon database to do this. First, one end of the pair-end short reads are mapped to the reference genome, while the other paired reads are mapped to the transposons library; paired short reads will therefore overlap with potential transposon insertion sites. Second, transposons that pass aggregation analyses of all possible positions and filtering for depth coverage are designated as non-reference transposons . Although transposons are major components of the genome, their exact functions and relevance in plant genomes have not been revealed. Genome resequencing of crop accessions can be used efficiently to identify and characterize Non-ref TEs. Comparing to the reference genome, Non-ref TEs have different insert positions in accessions.
In this study, resequencing data of 125 accessions for rice, maize and sorghum were collected, including wild, landrace and improved groups. Non-ref TEs were identified using pair-end read alignment to the reference genome and transposon databases separately. To characterize genome-wide non-ref TEs, we compared classes of non-ref TEs between both species and groups and analyzed the insertion location and affected genes. We found that the number, classification and distribution of non-ref TEs were different for each crop group and each accessions of the same species. In addition, non-ref TEs had an insertion preference for intergenic regions, avoiding coding regions. These observations suggest that transposon insertion is not a random event. Furthermore, the functional analysis of affected genes suggested that transposon insertion plays an important role in the adaptive evolution of crops.
Identification of non-ref TE insertions
We used the RepeatMasker (Version: 3.3.0)  with the TE database library exacted from RepbaseUpdate to predict the accumulated TEs. The results of this analysis identified total lengths of 142,446,614 bp for TEs in rice, 1,585,325,106 bp for TEs in maize, and 434,877,678 bp for TEs in sorghum, comprising 37.22, 76.72 and 58.88 % of the three reference genomes, respectively.
Summary of the non-ref TEs in rice, maize, sorghum
Raw data depth
Average non-ref TEs
% in genomec
We used PCR-based validation to examine TE insertion events in the maize inbred line MO17, while B73 served as the reference. The results for the predicted TE insertion positions show different fragment lengths between these two lines, and the sequence results support our prediction (Fig. 1b and c and Additional file 2: Figure S2).
Non-ref TE sharing in the accessions and groups
The non-ref TEs shared by the three groups are shown in Fig. 2b. The number of shared non-ref TEs was highest between improved and landrace groups for rice, maize and sorghum. The unique non-ref TEs were highest in the landrace groups of rice and maize and the wild group of sorghum. Considering the differences in evolutionary history for the reference genomes and the method used to discover non-ref TEs, these results suggest that differences in non-ref TEs between groups are related to the genetic relationships.
Classification of the non-ref TEs
Distribution of non-ref TEs classes between accumulated TEs and non-ref TEs
To further explore differences in the non-ref TEs, we compared superfamilies between accession groups. We used Student’s t-test to identify significantly different superfamilies of non-ref TEs from each group in the three species. The wild group of maize was excluded from this analysis because that group had only one accession. In rice, LINE/L1 and RC/Helitron were significantly different between the improved group and the landrace group (p < 0.01). In maize, DNA/DNA, DNA/PIF-Harbinger, DNA/hAT-Ac, DNA/hAT-Tip100, LINE/RTE-BovB, LTR/Copia and LTR/Gypsy were all significantly different between the improved group and the landrace group. In sorghum, DNA/CMC-EnSpm, DNA/DNA, LTR/Gypsy and RC/Helitron were significantly different between the wild group and the improved group, and 11 superfamilies of non-ref TEs were significantly different between the wild group and the landrace group (Additional file 2: Table S1). The numbers of non-ref TE classes and superfamilies in rice, maize and sorghum are in Additional file 3: Table S5.
To discover TE differences between accessions, in cases of random sampling, the longer TE may have higher probability. We calculated the reads per kilobase per million mapped reads (RPKM)  for each transposon in all accessions of the three species and then calculated the Pearson correlation coefficients in pairwise comparisons. For example, the RPKM value of “Gypsy5-ZM_LTR” transposon is 4762 and 4873 in two maize accessions of Mo17 and 148; RPKM value of “LINE1-57_ZM” transposon is 122 and 76. We calculated RPKM values for each kind of non-ref TEs and their correlation coefficient between Mo17 and 148. Pearson value was 0.98, suggesting that Mo17 and 148 had similar character of non-ref TEs insertion (Fig. 3b). See all other results in Additional file 2: Figure S5. After that, the distribution of Pearson values is shown in Fig. 3c. The average Pearson correlation coefficient (PCC) of the RPKMs between accessions was 0.70, with a minimum of 0.17 and a maximum of 0.99 in rice. In maize, the average PCC was 0.77, with a minimum of 0.40 and a maximum of 0.97. In sorghum, each pairwise comparison had a PCC >0.6, with an average of 0.98, a minimum of 0.88 and a maximum of 1. Therefore, the differences in all non-ref TEs between sorghum accessions were smaller than those of rice and maize, which suggested different evolutionary histories of rice, maize and sorghum, and there have smaller genetic differences between the various accessions in sorghum.
Chromosome distribution of non-ref TEs
Large effects of non-ref TEs
For non-ref TE insertion into intergenic regions, we calculated the distance between non-ref TEs and nearby genes. In rice, the average distance between two nearby genes was 9200 bp, and the average distance between non-ref TEs and nearby genes was 4491 bp. The two indices were 18,436 and 4667 bp for maize and 16,542 and 3533 bp for sorghum. The density of distance from non-ref TEs to nearby genes is illustrated in Fig. 5b. The figure clearly shows that most non-ref TEs tend to insert close to gene regions in rice, maize and sorghum, and regions closest to genes contain smaller numbers of new transposon insertions.
For non-ref TE insertion into genic regions, the ratios of the non-ref TE insertion into 5′ and 3′ untranslated regions (UTRs) are less than 5 % in rice, maize and sorghum. However, insertion into intron regions is greater than 15 %. The ratios of non-ref TE insertion into coding regions were 15.51, 5.45 and 3.76 % in rice, maize and sorghum, respectively (Fig. 5c). The proportion in rice was much higher than the proportions in maize and sorghum, suggesting that non-ref TEs in rice may have greater effects on gene function.
Generally, TE insertions alter gene expression and function. The numbers of genes with non-ref TE insertions were 4062, 4796 and 3141 in rice, maize and sorghum, respectively; the numbers of coding region insertions by non-ref TEs were 1804, 983 and 622, respectively. Additional file 2: Table S3 shows the structures of these genes compared to the reference genome annotation. Overall these results show that genes with non-ref TEs have a longer average transcript length and average CDS length and a higher average number of exons per gene compared to all of the genes in the genome.
To further investigate the effects of non-ref TE on gene function, we identified and annotated all genes with non-ref TEs in the coding region using InterProScan . The results of gene annotation analysis were similar in rice, maize and sorghum. Most of these genes encoded protein kinases, including protein kinase, catalytic domain, serine/threonine-/dual-specificity protein kinase, catalytic domain, tyrosine-protein kinase, catalytic domain, serine/threonine-protein kinase, active site, protein kinase, ATP binding site, and serine-threonine/tyrosine-protein kinase catalytic domain. In addition to protein kinase, there are also some others were listed (Additional file 4: Table S6). For example, NB-ARC: a motif shared by plant resistance gene products and regulators of cell death in animals ; Cytochrome P450: Key players in plant development and defense .
Gene Ontology (GO)  analysis showed that function of proteins annotated in the envelope, extracellular region and membrane-enclosed lumen in maize and sorghum. Molecular function ontology analysis identified enzyme regulator and molecular transducer in maize and sorghum. Nutrient reservoir proteins were only found in sorghum. The biological process ontology analysis found proteins of multi-organism process, pigmentation and reproduction mainly in maize and sorghum, depth only in sorghum, and developmental process only in rice (Fig. 5d and Additional file 5: Table S7).
Biological Networks Gene Ontology (BiNGO)  was used to perform the enrichment analysis of GO items, such as ATP binding, protein amino acid phosphorylation, protein kinase activity and apoptosis in rice, maize and sorghum. In rice and maize, many proteins involved in defense response were also enriched. In addition, GO analysis in rice found cellular component enrichments for proteolysis, RNA-dependent DNA replication and DNA integration and molecular function enrichments for calcium-transporting ATPase activity, ribonuclease H activity, peptidase activity and RNA-directed DNA polymerase activity. RNA glycosylase activity, isomerase activity and terpenoid metabolic process were enriched in sorghum only. Iron ion binding was enriched in maize (Additional file 2: Figure S7). The results suggested that the genes affected by non-ref TEs were involved in multiple biological functions, and the results of the functional annotations were similar in rice, maize and sorghum.
Identification non-ref TEs using resequencing data
Transposons as an important part of the plant genome, not only can regulate gene expression, gene function, but also provide important information for study of the evolution history of plants. In recent years, with the development of high-throughput sequencing technology, genome-resequencing data have been on an explosive growth trend, which includes growth in the discovery of non-ref TEs using resequencing data. Multiple studies have demonstrated the feasibility of this approach [12, 14, 17]. Our study used a modified RetroSeq workflow, adjusting some alignment methods and parameters for suitable use in genome-wide analysis of non-ref TEs in crops. A total 125 accessions of rice, maize and sorghum was used to identify novel TE insertions compared to a reference genome. The depth coverage was 6–45×, and the average numbers of non-ref TEs identified were 261, 796 and 793 for rice, maize, and sorghum, respectively. We did not find a significant correlation between the number of non-ref TEs and the depth coverage of the sequencing data. This results support the use of resequencing data to identify non-ref TEs. We found that non-ref TEs were different between accessions. We assume these differences are consistent with polymorphic variations, such as SNPs, InDels and SVs, as these DNA level changes affect polymorphisms between accessions. The investigation of non-ref TEs increases our understanding of genetic polymorphism and evolution.
Variation of non-ref TEs among crops
Differences in transposons in the genome occur not only between species but also between groups. We divided the accessions of rice, maize and sorghum separately into three groups: wild, improved and landrace. First, we analyzed the numbers of non-ref TEs between different groups. The average numbers of non-ref TEs in improved, landrace and wild groups of rice were 249, 307 and 518, respectively, 1641, 679 and 1798 in maize, respectively, and 999, 858 and 2563 in sorghum, respectively. These results indicate that there are more non-ref TEs in the wild group than in the improved and landrace groups in rice, maize and sorghum. The results of the NPSPD analysis were similar (Table 1). Because non-ref TEs are defined as TEs that are not in the reference genome but in other accession genomes, we note that accessions that are closely related to the reference genome may be identified with fewer non-ref TEs. By contrast, increased genetic distance would result in more non-ref TEs. The cultivar sequencing of rice (Japonica), maize (B73) and sorghum (BTx623) provide reference genomes, so these reference genomes are more distantly related to the wild group and more closely related to their domestication and improvement processes. Second, we compared the superfamilies of non-ref TEs. Significant superfamily differences were observed among the groups. Identifying the source of these differences requires further analysis; however, we speculate that these differences are also related to evolutionary history, genetic relationships between accessions and the distance from accessions to the reference genome.
Non-ref TE insertions are not random events
Positive correlation between non-ref TEs and gene density. Other researchers are also concerned about the relationship between genes and transposons. In Arabidopsis, distribution analysis of accumulated TEs suggests a negative correlation between gene and TE density [28, 29]. This association is also found in rice, where investigation of non-LTR-RTs (Non-long terminal repeat retrotransposons) and DNA transposons revealed a negative correlation between gene densities [30, 31]. We analyzed the chromosomal distributions of non-ref TEs and genes in sorghum and maize, and found that they were strongly correlative, and the respective mean PCC were 0.88 and 0.67. Our discovery of this relationship between non-ref TEs and gene number is novel. The results suggest that the TEs in the region near a gene have high activity, whereas accumulated TEs are more stable. Moreover, the position of non-ref TE insertion in sorghum was positively related to SNP loci; this relationship is also clearly shown for accumulated TEs in the human genome [32, 33]. Presumably, these non-ref TEs are an important source of SNPs, and in rice and maize but not sorghum, non-ref TEs have a smaller contribution to SNPs.
The number and distribution of non-ref TEs in rice is different from those of maize and sorghum, meanwhile, the correlation coefficient between non-ref TEs and gene number in rice is far less than those of maize and sorghum. The possible reasons are as follows. 1) the total gene number in rice, maize and sorghum genomes is similar to each other. However, the genome size of rice is far less than sorghum and maize. So the rice genome included fewer TEs. 2) Previous reports showed that rice genome is more conservative [34, 35]. It was speculated that the TE activity is lower than other grasses, such as maize and sorghum, which causes small TE difference among rice accessions. So we identified fewer non-ref TEs in rice. Comparing the gene and non-ref TEs distribution among rice, maize and sorghum, the similar total gene number with less accumulated and non-ref TEs of rice may results in the weakly correlated between gene number and non-ref TE.
Non-ref TEs are often located at flanking regions of genes. The analysis of distance between non-ref TEs and nearby genes found that non-ref TE insertions tended to be close to intergenic regions, keeping their distance from upstream and downstream genes. The distribution of miniature inverted repeat transposable elements (MITEs) in regions near genes for rice was also confirmed . This TE activity located at regions flanking genes can result in complex rearrangements that can affect gene regulation [37, 38]. These results suggest that location biases in non-ref TE insertion may play important roles in gene regulation.
Non-ref TEs are often located in introns. TEs that insert into introns generally have a greater chance of survival because these insertions are less visible to natural selection. Moreover, TE insertions into introns can affect gene regulation in surprising ways [11, 37, 39]. In our analysis of non-ref TE insertion position, the ratio of transposons that inserted into intron regions was greater than 15 %, and the ratio of rice non-ref TE insertion into CDS regions was 15.51 %, compared with 5.45 and 3.76 % in maize and sorghum, respectively. The proportion of non-ref TE insertion into intron regions was much higher than the proportions of insertion into CDS regions, and the proportion of non-ref TE insertion into CDS regions in rice was much higher than the proportions of insertion into maize and sorghum. These results suggest two possibilities. First, natural selection negatively influences detection of TE insertion in exon regions. TE insertions often leads to disrupting the structure and function of genes. After a long time evolutionarily speaking, they will become so diverged that they are no longer identifiable. Second, transposon insertion may occur with a preference to avoid coding regions, or coding region protective mechanisms render TE insertion difficult. Additionally, the transcriptional state of DNA influences DNA structure, which may affect TE insertion. Assuming efficient transposon insertion, such insertions likely occur primarily during the process of transcription. In agreement with this mechanism, in rice, maize and sorghum, genes with non-ref TEs have longer average transcript and CDS lengths and higher average exon numbers per gene.
Non-ref TEs response to stress. The responses of genomes to stress by transposons was first suggested by McClintock . Two approaches can be used to test this hypothesis. The first involves stress exposure to genetically controlled organisms [41–44]. The second approach involves analysis of natural populations of the same species living in different conditions . Here, we analyzed the functions of genes that are affected by non-ref TE insertion. Although the identified non-ref TEs number in rice is far less than maize and sorghum, the results of gene function annotation and classification are consistent. Interpro results showed that most affected genes encoded proteins annotated as protein kinases which involved in many aspects of cellular regulation and metabolism . Additional, some affected genes were annotated as NB-ARC and Cytochrome P450 which involved in plant resistance gene and defense. GO analysis showed that affected genes are functionally different. The GO enrichment analysis identified affected genes encoding proteins that have ATP-binding sites, amino acid phosphorylation sites, and protein kinase activity, along with biological processes related to cell apoptosis in rice, maize and sorghum. In addition, affected genes in rice and maize included functional enrichments for defense response processes. These results demonstrate that the functions of genes affected by non-ref TE insertion are highly similar in the three crops. Protein phosphorylation alters both protein structure and activity to influence the transmission process of information in a cell. Through a series of protein phosphorylation and dephosphorylation steps, plant cells transmit intracellular signals to generate an appropriate response to extracellular stimuli. Results of the functional enrichment analysis suggest that plant cells experiencing stressful external stimulation activate intracellular kinase activities to affect protein ATP-binding sites. Autophosphorylation may follow, and then phosphate group transfer to other proteins to amplify the signal cascade regulating downstream gene expression, which leads to either cell apoptosis or the promotion of defense reactions to increase crop resilience. Thus, in the event of plant environmental stress, gene protection mechanisms are activated, and transposons may be inserted into specific gene regions to maintain defensive intracellular signal transduction for improving crop adaptability to adverse environmental conditions. In this way, transposon insertion may play an important role in plant adaptive evolution.
Transposable elements (TEs) are a major component of plant genomes, but their characteristics of various accessions is not clear. We present the genome-wide identification and characterization of non-reference transposable elements in rice, maize and sorghum using resequencing data. Our results show that the non-ref TE class has different preferences in rice, maize and sorghum. The non-ref TEs have a strong positive correlation with gene number and have a bias toward insertion near genes, and also with a preference for avoiding coding regions. The genes affected by non-ref TE insertion were functionally enriched for stress response mechanisms. Suggest that transposon insertion is not a random event and it makes genomic diversity to plays a major role in intraspecific adaption and evolution of crops. It provides new insight into the evolution of transposons and their role in plant evolution. In the near future, more plant genomics data should analysis to improve understanding of the transposon evolution and how their insertion may have influenced the variation between accessions.
Maize resequencing data were acquired from a project deposited in the NCBI Short Read Archive with accession number SRA010130 . This project generated resequenced data from a group of six elite maize inbred lines. Another maize resequencing dataset was acquired from the NCBI Short Reads Archive under the accession number SRA051245 . The data from this maize HapMapV2 study had a depth coverage that ranged from 4X to 30X. We only used landrace lines sequenced at the same facility. The rice resequencing data acquired from the NCBI Short Read Archive under accession number SRA023116 , had 50 accessions in total, which included 40 cultivated accessions and ten accessions of wild progenitors with >15X raw data coverage. The acquired sorghum resequencing data had 16-45X raw data coverage of 45 sorghum lines .
The transposon data from the three species studied were extracted from the Repbase Update database at www.girinst.org/repbase/ . The Repbase Update (RU) database contains prototypic sequences representing repetitive DNA from different eukaryotic species.
The B73 sequences of the maize reference genome were obtained from the Maize Genome Sequencing Project AGPv2, which was the first draft assembly of the maize genome released in 2010. We used the maize gene annotation from the 5b.60 release of the maize Genome Sequencing Project based on the AGPv2 assembly . The International Rice Genome Sequencing Project (IRGSP) genome sequence (build 5) was used as the rice reference genome. Accordingly, rice annotation information used the 2009-01-MSU gene set . The sorghum reference genome used was the DOE-JGI (sbi1), and the Sbi1.4 gene set was used for gene annotation .
Identifying non-ref TEs
First, we used BWA (0.6.2-r126)  to map the next-generation sequencing reads against the reference genome, and then we used the ‘bwa sampe’ to generate paired reads alignment in SAM format. Second, we used SAMTOOLS (Version: 0.1.18)  to sort the alignments and then used the command ‘samtools merge’ to merge the alignments from different sequence lanes.
To identify candidate non-ref TEs with read pair support, we used a modified RetroSeq  workflow, adjusting some alignment methods and parameters for suitability in genome-wide analysis of non-ref TEs in crops. First, we checked the bam file, and the alignments with duplicate reads or a mapping quality <30 were discarded. We kept mapped reads and unmapped mate reads for further analysis. Second, we used the unmapped mate reads in a BLAST query search  against the TE library. Matches with >80 % identity and alignment length >36 in the 500 bp beginnings or 500 bp ends of sequences in the TE library were accepted as candidate non-ref TEs.
We clustered the reads into fwd and rev clusters in 4000-bp regions. Only an average length of a region <200 bp and a minimum number of reads >5 was considered for non-ref TE identification. Finally, we compared candidate read positions against the accumulated TEs and filtered the results that overlapped with the reference (Additional file 2: Figure S1).
We used PCR to validate a sample of non-ref TE insertions in the MO17 maize inbred line. Primers were designed to the predicted non-ref TE insertion site. Comparisons of amplicon sizes between references B73 and MO17 were used to determine insertions of predicted non-ref TEs. The PCR products were also sequenced for comparison to further establish the presence of an insertion event (Fig. 1b and Additional file 2: Figue S2).
Genome-wide characterization analysis of non-ref TEs
Pearson correlation coefficient was calculated between sequencing depth and the number of non-ref TEs of all accessions. The numbers of non-ref TEs in each accessions were obtained after the identifying steps. We clustered the non-ref TEs inserted position in 100 bp regions of all accessions for identifying accessions or groups sharing a common non-ref TE.
Classification using the Repbase database divided transposons into classes and subclasses. There were six classes: DNA transposon, LINE, SINE, LTR, RC and Other. The subclasses mainly included LTR/Copia, LTR/Gypsy, DNA/CMC-EnSpm, DNA/MULE-MuDR, LINE/L1 and RC/Helitron, etc. To explore the superfamilies differences between groups, student’s t-test was used to identify significantly different superfamilies of non-ref TEs from each group in the three species. RPKM was calculated as the total number of short sequences mapped to transposon type divided by the total number of short sequences mapped to the transposon database (million) and the length of transposon type (KB) . We used all of these transposon’s RPKM value to calculate the Pearson correlation coefficients of all accessions in pairwise.
The numbers of genes, SNPs, transposons, and non-ref TEs were calculated for each reference genome chromosome. For rice, the calculation window size was 400 KB, with a sliding window size of 200 KB. The maize window size was 2 MB, with a sliding window size of 1 MB. The calculation window size in sorghum was 600 KB, with a sliding window size of 300 KB. To calculate gene numbers on each chromosome, the gene start site was used, and measurements were transformed into log2 values for drawing.
We obtained the protein sequences of genes affected by non-ref TEs for a BLASTP search against TrEMBL , KEGG  and SwissProt  databases. The e-value was set to 1-5e, and we only retained the best match for each protein. For structure domain and motif characterization, we used InterProScan to search Pfam, PRINT, PROSITE, ProDom and SMART databases. Homologous protein sequences found after InterProScan analysis were used for GO analysis. We used WEGO  for GO mapping and BINGO for GO enrichment analysis.
CDS, coding sequence; GO, gene ontology; LINEs, long interspersed elements; LTRs, long terminal repeats; non-ref TEs, non-reference transposable elements; NPSPD, average number of non-ref TEs per sample per depth; PCC, pearson’s correlation coefficient; RC, rolling-circle transposon; RPKM, reads per kilobase per million mapped reads; SINEs, short interspersed elements; SNP, single-nucleotide polymorphisms; TE, transposable element; UTRs, untranslated regions
We thank members of our research group and collaborators for technical assistance. We are grateful to Dr. Xiyin Wang from North China University of Science and Technology (China) and Xuewen Wang from Kunming Institute of Botany (Chinese Academy of Sciences) for discussions during the manuscript revision.
The work was supported by the State Key Development Program for Basic Research of China (2014CB138202), the National Natural Science Foundation of China (91435114).
Availability of data and materials
All the supporting data are included as additional files.
BW, YH, XL and HL designed the research; BW, HL, QX and YW performed research; BW, HL, QX, YW, JZ, YH, YL and GY analyzed data; BW and HL wrote the paper. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Mc CB. The origin and behavior of mutable loci in maize. Proc Natl Acad Sci U S A. 1950;36:344–55.View ArticleGoogle Scholar
- Project IRGS. The map-based sequence of the rice genome. Nature. 2005;436:793–800.View ArticleGoogle Scholar
- Paterson AH, Bowers JE, Bruggmann R, Dubchak I, Grimwood J, Gundlach H, et al. The Sorghum bicolor genome and the diversification of grasses. Nature. 2009;457:551–6.View ArticlePubMedGoogle Scholar
- Schnable PS, Ware D, Fulton RS, Stein JC, Wei F, Pasternak S, et al. The B73 maize genome: complexity, diversity, and dynamics. Science. 2009;326:1112–5.View ArticlePubMedGoogle Scholar
- Wicker T, Sabot F, Hua-Van A, Bennetzen JL, Capy P, Chalhoub B, et al. A unified classification system for eukaryotic transposable elements. Nat Rev Genet. 2007;8:973–82.View ArticlePubMedGoogle Scholar
- Seberg O, Petersen G. A unified classification system for eukaryotic transposable elements should reflect their phylogeny. Nat Rev Genet. 2009;10:276.View ArticlePubMedGoogle Scholar
- Liu Y-H, Zhang M, Wu C, Huang JJ, Zhang H-B. DNA is structured as a linear “jigsaw puzzle” in the genomes of Arabidopsis, rice, and budding yeast. Genome. 2013;57:9–19.View ArticlePubMedGoogle Scholar
- Wu C, Wang S, Zhang H-B. Interactions among genomic structure, function, and evolution revealed by comprehensive analysis of the Arabidopsis thaliana genome. Genomics. 2006;88:394–406.View ArticlePubMedGoogle Scholar
- Lisch D. How important are transposons for plant evolution? Nat Rev Genet. 2012;14:49–61.View ArticleGoogle Scholar
- Bhattacharyya MK, Smith AM, Ellis TH, Hedley C, Martin C. The wrinkled-seed character of pea described by Mendel is caused by a transposon-like insertion in a gene encoding starch-branching enzyme. Cell. 1990;60:115–22.View ArticlePubMedGoogle Scholar
- Greene B, Walko R, Hake S. Mutator insertions in an intron of the maize knotted1 gene result in dominant suppressible mutations. Genetics. 1994;138:1275–85.PubMedPubMed CentralGoogle Scholar
- Huang X, Lu G, Zhao Q, Liu X, Han B. Genome-wide analysis of transposon insertion polymorphisms reveals intraspecific variation in cultivated rice. Plant Physiol. 2008;148:25–40.View ArticlePubMedPubMed CentralGoogle Scholar
- Naito K, Zhang F, Tsukiyama T, Saito H, Hancock CN, Okumoto Y, et al. Unexpected consequences of a sudden and massive transposon amplification on rice gene expression. Genes Genet Syst. 2009;84:439.Google Scholar
- Ewing AD, Kazazian HH. Whole-genome resequencing allows detection of many rare LINE-1 insertion alleles in humans. Genome Res. 2011;21:985–90.View ArticlePubMedPubMed CentralGoogle Scholar
- Tian Z, Zhao M, She M, Du J, Cannon SB, Liu X, et al. Genome-wide characterization of nonreference transposons reveals evolutionary propensities of transposons in soybean. Plant Cell. 2012;24:4422–36.View ArticlePubMedPubMed CentralGoogle Scholar
- Chia JM, Song C, Bradbury PJ, Costich D, de Leon N, Doebley J, et al. Maize HapMap2 identifies extant variation from a genome in flux. Nat Genet. 2012;44:803–7.View ArticlePubMedGoogle Scholar
- Xu X, Liu X, Ge S, Jensen JD, Hu F, Li X, et al. Resequencing 50 accessions of cultivated and wild rice yields markers for identifying agronomically important genes. Nat Biotechnol. 2012;30:105–11.View ArticleGoogle Scholar
- Mace ES, Tai SS, Gilding EK, Li YH, Prentis PJ, Bian LL, et al. Whole-genome sequencing reveals untapped genetic potential in Africa’s indigenous cereal crop sorghum. Nat Commun. 2013;4:2320. Available from: http://wos:000323752300005.PubMedPubMed CentralGoogle Scholar
- Keane TM, Wong K, Adams DJ. RetroSeq: transposable element discovery from next-generation sequencing data. Bioinforma Oxf Engl. 2013;29:389–90.View ArticleGoogle Scholar
- Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2010 (http://www.repeatmasker.org).
- Jurka J, Kapitonov VV, Pavlicek A, Klonowski P, Kohany O, Walichiewicz J. Repbase Update, a database of eukaryotic repetitive elements. Cytogenet Genome Res. 2005;110:462–7.View ArticlePubMedGoogle Scholar
- Tenaillon MI, Hufford MB, Gaut BS, Ross-Ibarra J. Genome size and transposable element content as determined by high-throughput sequencing in maize and Zea luxurians. Genome Biol Evol. 2011;3:219–29.View ArticlePubMedPubMed CentralGoogle Scholar
- Hunter S, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, et al. InterPro: the integrative protein signature database. Nucleic Acids Res. 2009;37:D211–5.View ArticlePubMedGoogle Scholar
- van der Biezen EA, Jones JD. The NB-ARC domain: a novel signalling motif shared by plant resistance gene products and regulators of cell death in animals. Curr Biol CB. 1998;8:R226–7.View ArticlePubMedGoogle Scholar
- Qi X, Bakht S, Qin B, Leggett M, Hemmings A, Mellon F, et al. A different function for a member of an ancient and highly conserved cytochrome P450 family: from essential sterols to plant defense. Proc Natl Acad Sci. 2006;103:18848–53.View ArticlePubMedPubMed CentralGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene Ontology: tool for the unification of biology. Nat Genet. 2000;25:25–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Maere S, Heymans K, Kuiper M. BiNGO: a Cytoscape plugin to assess overrepresentation of Gene Ontology categories in Biological Networks. Bioinformatics. 2005;21:3448–9.View ArticlePubMedGoogle Scholar
- Lin X, Kaul S, Rounsley S, Shea TP, Benito MI, Town CD, et al. Sequence and analysis of chromosome 2 of the plant Arabidopsis thaliana. Nature. 1999;402:761–8.View ArticlePubMedGoogle Scholar
- Mayer K, Schüller C, Wambutt R, Murphy G, Volckaert G, Pohl T, et al. Sequence and analysis of chromosome 4 of the plant Arabidopsis thaliana. Nature. 1999;402:769–77.View ArticlePubMedGoogle Scholar
- Tian Z, Rizzon C, Du J, Zhu L, Bennetzen JL, Jackson SA, et al. Do genetic recombination and gene density shape the pattern of DNA elimination in rice long terminal repeat retrotransposons? Genome Res. 2009;19:2221–30.View ArticlePubMedPubMed CentralGoogle Scholar
- Turcotte K, Srinivasan S, Bureau T. Survey of transposable elements from rice genomic sequences: Transposable elements in rice. Plant J. 2008;25:169–79.View ArticleGoogle Scholar
- Ng S-K, Xue H. Alu-associated enhancement of single nucleotide polymorphisms in the human genome. Gene. 2006;368:110–6.View ArticlePubMedGoogle Scholar
- Sela N, Mersch B, Hotz-Wagenblatt A, Ast G. Characteristics of transposable element exonization within human and mouse. PLoS One. 2010;5:e10907. Ruvinsky I, editor.View ArticlePubMedPubMed CentralGoogle Scholar
- Wang X, Tang H, Paterson AH. Seventy million years of concerted evolution of a homoeologous chromosome pair, in parallel, in major poaceae lineages. Plant Cell. 2011;23:27–37.View ArticlePubMedPubMed CentralGoogle Scholar
- Xu J-H, Messing J. Diverged copies of the seed regulatory opaque-2 gene by a segmental duplication in the progenitor genome of rice, sorghum, and maize. Mol Plant. 2008;1:760–9.View ArticlePubMedGoogle Scholar
- Mao L. Rice transposable elements: a survey of 73,000 sequence-tagged-connectors. Genome Res. 2000;10:982–90.View ArticlePubMedPubMed CentralGoogle Scholar
- Kidwell MG, Lisch D. Transposable elements as sources of variation in animals and plants. Proc Natl Acad Sci U S A. 1997;94:7704–11.View ArticlePubMedPubMed CentralGoogle Scholar
- Kloeckener-Gruissem B, Vogel JM, Freeling M. The TATA box promoter region of maize Adh1 affects its organ-specific expression. EMBO J. 1992;11:157–66.PubMedPubMed CentralGoogle Scholar
- Umeda M, Ohtsubo H, Ohtsubo E. Diversification of the rice Waxy gene by insertion of mobile DNA elements into introns. Idengaku Zasshi. 1991;66:569–86.PubMedGoogle Scholar
- McClintock B. The significance of responses of the genome to challenge. Science. 1984;226:792–801.View ArticlePubMedGoogle Scholar
- Finatto T, de Oliveira AC, Chaparro C, da Maia LC, Farias DR, Woyann LG, et al. Abiotic stress and genome dynamics: specific genes and transposable elements response to iron excess in rice. Rice [Internet]. 2015 [cited 2015 May 28];8. Available from: http://www.thericejournal.com/content/8/1/13.
- Makarevitch I, Waters AJ, West PT, Stitzer M, Hirsch CN, Ross-Ibarra J, et al. Transposable elements contribute to activation of maize genes in response to abiotic stress. PLoS Genet. 2015;11:e1004915.View ArticlePubMedPubMed CentralGoogle Scholar
- Wessler SR. Turned on by stress. Plant retrotransposons. Curr Biol CB. 1996;6:959–61.View ArticlePubMedGoogle Scholar
- Grandbastien M. Activation of plant retrotransposons under stress conditions. Trends Plant Sci. 1998;3:181–7.View ArticleGoogle Scholar
- Capy P, Gasperi G, Biémont C, Bazin C. Stress and transposable elements: co-evolution or useful parasites? Heredity. 2000;85:101–6.View ArticlePubMedGoogle Scholar
- Stone JM, Walker JC. Plant protein kinase families and signal transduction. Plant Physiol. 1995;108:451–7.View ArticlePubMedPubMed CentralGoogle Scholar
- Lai J, Li R, Xu X, Jin W, Xu M, Zhao H, et al. Genome-wide patterns of genetic variation among elite maize inbred lines. Nat Genet. 2010;42:1027–30.View ArticlePubMedGoogle Scholar
- Goff SA. A draft sequence of the rice genome (Oryza sativa L. ssp. japonica) (April, pg 92, 2002). Science. 2005;309:879.Google Scholar
- Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25:1754–60.View ArticlePubMedPubMed CentralGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25:2078–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–402.View ArticlePubMedPubMed CentralGoogle Scholar
- Bairoch A, Apweiler R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 2000;28:45–8.View ArticlePubMedPubMed CentralGoogle Scholar
- Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28:27–30.View ArticlePubMedPubMed CentralGoogle Scholar
- Bairoch A, Boeckmann B, Ferro S, Gasteiger E. Swiss-Prot: juggling between evolution and stability. Brief Bioinform. 2004;5:39–55.View ArticlePubMedGoogle Scholar
- Ye J, Fang L, Zheng HK, Zhang Y, Chen J, Zhang ZJ, et al. WEGO: a web tool for plotting GO annotations. Nucleic Acids Res. 2006;34:W293–7.View ArticlePubMedPubMed CentralGoogle Scholar