- Open Access
IRcall and IRclassifier: two methods for flexible detection of intron retention events from RNA-Seq data
© Bai et al.; licensee BioMed Central Ltd. 2015
- Published: 21 January 2015
The emergence of next-generation RNA sequencing (RNA-Seq) provides tremendous opportunities for researchers to analyze alternative splicing on a genome-wide scale. However, accurate detection of intron retention (IR) events from RNA-Seq data has remained an unresolved challenge in next-generation sequencing (NGS) studies.
We propose two new methods: IRcall and IRclassifier to detect IR events from RNA-Seq data. Our methods combine together gene expression information, read coverage within an intron, and read counts (within introns, within flanking exons, supporting splice junctions, and overlapping with 5' splice site/ 3' splice site), employing ranking strategy and classifiers to detect IR events. We applied our approaches to one published RNA-Seq data on contrasting skip mutant and wild-type in Arabidopsis thaliana. Compared with three state-of-the-art methods, IRcall and IRclassifier could effectively filter out false positives, and predict more accurate IR events.
The data and codes of IRcall and IRclassifier are available at http://mlg.hit.edu.cn/ybai/IR/IRcallAndIRclass.html
- Splice Site
- Random Forest
- Information Gain
- Read Count
- Intron Retention
Alternative splicing of precursor messenger RNA (pre-mRNA) produces different mRNA isoforms from a single genic locus during gene expression, resulting in functional complexity and diversity of proteins for higher eukaryotic organisms [1–3]. Alternative splicing has four main patterns that include cassette exon skipping (ES), alternative 5' and 3' splice site (ASS), mutually exclusive exon splicing (MXE) and intron retention (IR) . Among those four alternative splicing patterns, genome-wide intron retention (IR) detection is a popular research topic in biology.
Traditional methods which analyze microarray data, have provided a rich source of information for IR event detection . However, the hybridization-based technology employed by microarray is largely restricted to existing genome sequence knowledge, with a limited range of quantification . With the development of next-generation sequencing technology, RNA-Seq, which extends the analysis of previous unidentified genes and splicing variants, is rapidly outperforming microarrays for genome-wide studies [6, 7]. With various statistical and computational strategies, many recent studies have analyzed RNA-Seq data for IR event detection, including ExpressionPlot , MATS , Wang's Framework , IRFinder  and etc. ExpressionPlot  cuts off low-density introns and adopts Fisher's exact test or Chi-squared test to quantify IR events, using only read counts (within introns/ flanking exons). Another framework MATS  calculates the Bayesian posterior probabilities of IR to assess their difference between treatment and control conditions, using only read counts (supporting splice junctions/ within introns). Besides, an analysis pipeline based on de novo mapping using BLAT  is introduced by , which detects the complete retention of an intron in a transcript. However, BLAT employs some low quality reads for read alignment, which would bring in noises . Recently, IRFinder  uses read counts within introns to estimate the difference of read coverage between treatment and control conditions by adopting Audic and Claverie Test . However, as for read coverage calculation, besides uniquely mapped reads, IRFinder employs the reads mapped to multiple positions in the genome, which would bring in noises.
In this paper, we propose two efficient and effective methods: Intron Retention call (IRcall) and Intron Retention classifier (IRclassifer) for of IR event detection from RNA-Seq data. Compared with existing IR event detection methods, IRcall and IRclassifier have four major contributions:
First, besides traditional features for IR event detection, IRcall and IRclassifier employ new features, including read counts overlapping with 5' splice site/ 3' splice site, and gene expression between treatment and control conditions, which help to filter out false positives caused by different gene expression; Second, IRcall employs ranking strategy, with a new formula to calculate IR scores, which avoids bias and declines false positives; Third, IRclassifier employs machine learning techniques, based on J48 Decision Tree and Random Forests, for IR event detection. IRclassifier selects training examples by incorporating the predictions from three state-of-the-art approaches, and constructs 17 features between treatment and control conditions to represent introns in higher-dimensional spaces; Last but not least, experimenting the published RNA-Seq data of skip mutant and wild-type in Arabidopsis thaliana , IRcall and IRclassifier could identify both known and novel IR events, with high prediction precisions. Moreover, IRclassifier based on J48 Decision Tree and Random Forests could predict IR events with a precision of 98.5% and 99.0%, respectively.
To overcome limitations of existing approaches, our idea is to utilize more features, employing ranking strategy and machine learning methods for IR event detection.
The inputs of IRcall and IRclassifier include the alignment results of RNA-Seq data and the reference annotation. The raw RNA-Seq reads between treatment and control conditions are aligned to the reference genome sequences by TopHat2 , and the resulting alignments are stored and sorted in BAM files, defined by . The reference annotation provides the locations of introns, exons and genes. As for the raw RNA-Seq read alignment, the parameters of TopHat2 are set as follows: min − anchor − length = 4, min − intron − length = 40, max − intron − length = 2000, min − segment − intron = 40, max − segment − intron = 2000, min − coverage − intron = 40, and max − coverage − intron = 2000.
For N intron , N exon , N junc , N5ssand N3ss, we count the original reads within an intron, within flanking exons, supporting splice junctions, overlapping with 5' splice site and 3' splice site, respectively. As for N coverage , we calculate the percentage of covered positions (with one read at least) within an intron. In addition, we employ Cuffdiff  to calculate gene expression RPKM values (RPKM: Reads per kilo base per million), denoted by N expression .
As shown in Figure 2(b), we will get 7 features , , , , , and for treatment condition, and 7 features , , , , , and for control condition. Those 14 features form the feature space of an intron for IR event detection.
Intron Removal Criteria.
In our experiments, we set equal weights to NDIE, NDIJ and NDIC, thus . The NDIE, NDIJ and NDIC are calculated as follows:
As IRcall defines proper intron removal criteria (in Table 1) and IR score formula, it well considers read counts (within introns, within flanking exons, supporting splice junctions, and overlapping with 5' splice site/ 3' splice site), read coverage within an intron, and gene expression information, between treatment and control conditions. Hence, IRcall could effectively filter out false positives and improve prediction precision.
IRclassifier is a framework drawing on machine learning ideas to facilitate better identification of IR events from RNA-Seq data, which contains three main steps (Figure 3(b)): first, the training set is constructed by incorporating the predictions of ExpressionPlot , MATS  and IRFinder , with 17 significant features to interpret each instance. The instances are saved in feature matrix X, labeled with "yes" and "no", indicating positive and negative ones, respectively; second, the J48 Decision Tree (J48) classifier and Random Forests (RF) classifier are trained respectively, with 10-fold cross-validation to avoid over-fitting; finally, new RNA-Seq data are sent to the two classifiers respectively for IR event prediction.
Our training data set are constructed based on the results of IR event identification, by MATS, ExpressionPlot and IRFinder. The instances hit by at least two of the three predicting algorithms are taken as "positive" ones, labeled by "yes", while those hit by none of the algorithms are taken as "negative" ones, labeled by "no". Thus, 741 positive instances and 3525 random negative instances are selected as the training set. The instance is interpreted by 17 features: 14 features from treatment and control samples (Figure 2) and 3 features DIE, DIJ , and DIC (defined in section Methods).
As for classifiers, we employ J48 Decision Tree and Random Forests algorithms from Weka  to conduct two independent predictions. J48 classifier forms rules from pruned partial decision trees built by C4.5 heuristics , which minimizes the number of tree levels and tree nodes to maximize data generalization. During tree building, information gain is adopted for feature selection, such that feature of largest information gain is chosen for the next splitting. RF classifier  constructs independent decision tree classifiers by bagging and random feature selection, and aggregates the outputs of each decision tree in RF to produce one final prediction by a majority vote of the trees. Here, the RF classifier is constructed with 10 decision trees, each selecting 5 random features. To avoid overfitting, both J48 classifier and RF classifier employ 10-fold cross-validation.
We use real RNA-seq data http://0-www.ncbi.nlm.nih.gov.brum.beds.ac.uk/sra?term=SRP008262 to detect IR events, which are comprise of skip mutant and wild-type in Arabidopsis thaliana , including two replicated skip mutant samples (skip1 and skip2) and two replicated wild-type samples (wt1 and wt2). As for the reference, we take Arabidopsis thaliana TAIR9 genome, with sequences and annotations ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR9_genome release/. The performance of IRcall and IRclassifier are studied against state-of-the-art IR prediction methods ExpressionPlot , MATS  and IRFinder .
Data normalization and experimental setup
As the numbers of mapped reads in the two skip mutant samples (skip1 and skip2) and two wild-type samples (wt1 and wt2) are different (28465249 reads in skip1, 20849546 reads in skip2, 44865242 reads in wt1, and 18343654 reads in wt2), we take the minimum read number 18343654 as the standard normalization scale (SNS). Then all read counts that construct features are divided by its sample's overall read number and multiplied by SNS for normalization. The sum of normalized read counts in skip1 and skip2 (wt1 and wt2) is taken as the read count for treatment (control) sample. Then all relevant features and scores are calculated according to the algorithms in Section Methods.
In the first group of experiments, IRcall is taken to detect differential IR events from RNA-Seq data on skip mutant and wild-type. For all of the 158580 introns in Arabidopsis thaliana TAIR9 reference annotation, we firstly filtered out the introns not satisfying the intron removal criteria of IRcall (see Table 1), resulting in 928 IR event candidates. Then the IR scores of those 928 IR event candidates are calculated, according to which the candidates are sorted in descending order.
To evaluate the performance of IRcall, we compared the 928 IR event candidates of IRcall with the positive predictions of MATS, ExpressionPlot and IRFinder. As illustrated in Figure 4(b), the 928 IR event candidates of IRcall have 199 new predictions, and 5, 624, 383 overlaps with MATS, ExpressionPlot, and IRFinder, respectively. From the new predictions and each overlap, we randomly picked one case to display by IGV software  and , to study whether the introns are retained. As shown in Figure 6, our new prediction intron 1 of AT4G26080.1 (d) is retained between treatment and control conditions, in the similar way as those in overlaps with MATS (a), ExpressionPlot (b) and IRFinder (c). That is, our IRcall could predict known and novel IR event candidates.
In addition, the predictions of IRclassifier are evaluated with the reference predictions of MATS, ExpressionPlot and IRFinder. As shown in Figure 4(C) and 4(D), the J48 classifier predicts 1086 new IR events, and 22 MATS, 1898 ExpressionPlot, 1226 IRFinder overlaps; while the RF classifier predicted 850 new IR events, and 18 MATS, 1751 ExpressionPlot, 1407 IRFinder overlaps. Among the new IRclassifier predictions, we randomly choose two cases displayed by IGV software in Figure 6: (e) intron 7 of AT5G61500.1 (predicted by J48 classifier) in skip mutant was retained compared with wild-type; and (f) AT3G45050 (predicted by J48 and RF classifiers) was different spliced between treatment and control conditions, where transcript 1 of AT3G45050 (AT3G45050.1) was expressed under treatment condition, but transcript 2, 3, and 4 of AT3G45050 (AT3G45050.2, AT3G45050.3 and AT3G45050.4) are expressed under control condition.
read counts within an intron
read counts overlapping with 3 splicing site
read counts overlapping with 5 splicing site
read coverage within an intron
read counts supporting splice junction
read counts within flanking exons
gene expression RPKM values
In this paper, we overcome the limitations of traditional statistical and computational methods for IR event identification from RNA-Seq data, by proposing two novel algorithms, IRcall and IRclassifier, to detect the IR events between treatment and control conditions. In our methods, new features are employed to filter out false positives effectively. IRcall takes 14 features and defines a novel formula IRscore for candidate IR event ranking. From experiments, 98.91% of IRcall's top-10% high ranked IR event candidates have been verified by other methods. As a machine learning method for IR event detection, IRclassifier employs 17 features, and gets 98.5% and 99.0% recognition precisions, with J48 Decision Tree classifier and Random Forests classifier, respectively. Therefore, we believe that our IRcall and IRclassifier are effective to detect IR events between treatment and control conditions from RNA-Seq data, delivering valuable information and tools for alternative splicing research.
In future, instead of using TopHat2 for junction alignment, we will consider assessing the influence of different junction alignment methods on the performance of IRcall and IRclassifier. Besides, we plan to use IRcall and IRclassifier to detect IR events from paired-end sequencing data.
Publication of this article was funded by the Natural Science Foundation of China [Nos: 61173085 and 61102149] and the National High-Tech Research and Development Program (863) of China [Nos: 2012AA020404, 2012AA02A602 and 2012AA02A604].
This article has been published as part of BMC Genomics Volume 16 Supplement 2, 2015: Selected articles from the Thirteenth Asia Pacific Bioinformatics Conference (APBC 2015): Genomics. The full contents of the supplement are available online at http://0-www.biomedcentral.com.brum.beds.ac.uk/bmcgenomics/supplements/16/S2
- Graveley BR: Alternative splicing: increasing diversity in the proteomic world. Trends in Genetics. 2001, 17: 100-107. 10.1016/S0168-9525(00)02176-4.View ArticlePubMedGoogle Scholar
- Pan Q, Shai O, Lee LJ, Frey BJ, Blencowe BJ: Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nature Genetics. 2008, 40: 1413-1415. 10.1038/ng.259.View ArticlePubMedGoogle Scholar
- Nilsen TW, Graveley BR: Expansion of the eukaryotic proteome by alternative splicing. Nature. 2010, 463: 457-463. 10.1038/nature08909.PubMed CentralView ArticlePubMedGoogle Scholar
- Matlin AJ, Clark F, Smith CWJ: Understanding alternative splicing: towards a cellular code. Nature Reviews Molecular Cell Biology. 2005, 6: 386-398. 10.1038/nrm1645.View ArticlePubMedGoogle Scholar
- Ner-Gaon H, Fluhr R: Whole-genome microarray in arabidopsis facilitates global analysis of retained introns. DNA Research. 2006, 13: 111-121. 10.1093/dnares/dsl003.View ArticlePubMedGoogle Scholar
- Wang Z, Gerstein M, Snyder M: Rna-seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics. 2009, 57-63. 7.10Google Scholar
- Ozsolak F, Milos PM: Rna sequencing: advances, challenges and opportunities. Nature Reviews Genetics. 2011, 12: 87-98. 10.1038/nrg2934.PubMed CentralView ArticlePubMedGoogle Scholar
- Friedman BA, Maniatis T: Expressionplot: a web-based framework for analysis of rna-seq and microarray gene expression data. Genome Biology. 2011, 12: R69-10.1186/gb-2011-12-7-r69.PubMed CentralView ArticlePubMedGoogle Scholar
- Shen S, Park JW, Huang J, Dittmar KA, Lu Z-x, Zhou Q, Carstens RP, Xing Y: Mats: a bayesian framework for flexible detection of differential alternative splicing from rna-seq data. Nucleic Acids Research. 2012, 40 (8): e61-10.1093/nar/gkr1291.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang X, Wu F, Xie Q, Wang H, Wang Y, Yue Y, Gahura O, Ma S, Liu L, Cao Y, Jiao Y, Puta F, McClung CR, Xu X, Ma L: Skip is a component of the spliceosome linking alternative splicing and the circadian clock in arabidopsis. The Plant Cell. 2012, 24: 3278-3295. 10.1105/tpc.112.100081.PubMed CentralView ArticlePubMedGoogle Scholar
- Wong JJ-L, Ritchie W, Ebner OA, Selbach M, Wong JWH, Huang Y, Gao D, Pinello N, Gonzalez M, Baidya K, Thoeng A, Khoo T-L, Bailey CG, Holst J, Rasko JEJ: Orchestrated intron retention regulates normal granulocyte differentiation. Cell. 2013, 154: 583-595. 10.1016/j.cell.2013.06.052.View ArticlePubMedGoogle Scholar
- Kent WJ: Blatthe blast-like alignment tool. Genome Research. 2002, 12: 656-664. 10.1101/gr.229202. Article published online before March 2002.PubMed CentralView ArticlePubMedGoogle Scholar
- Grant GR, Farkas MH, Pizarro AD, Lahens NF, Schug J, Brunk BP, Stoeckert CJ, Hogenesch JB, Pierce EA: Comparative analysis of rna-seq alignment algorithms and the rna-seq unified mapper (rum). Bioinformatics. 2011, 27: 2518-2528.PubMed CentralPubMedGoogle Scholar
- Audic S, Claverie J-M: The significance of digital gene expression profiles. Genome Research. 1997, 7: 986-995.PubMedGoogle Scholar
- Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL: Tophat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biology. 2013, 14: R36-10.1186/gb-2013-14-4-r36.PubMed CentralView ArticlePubMedGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, Subgroup, GPDP: The sequence alignment/map format and samtools. Bioinformatics. 2009, 25 (16): 2078-2079. 10.1093/bioinformatics/btp352.PubMed CentralView ArticlePubMedGoogle Scholar
- Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L: Transcript assembly and quantification by rna-seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology. 2010, 28: 511-515. 10.1038/nbt.1621.PubMed CentralView ArticlePubMedGoogle Scholar
- Witten IH, Frank E: Data Mining: Practical Machine Learning Tools and Techniques. 2005, Morgan Kaufmann Publishers Inc., San Francisco, CA, USAGoogle Scholar
- Quinlan JR: C4.5: Programs for Machine Learning. 1993, Morgan Kaufmann Publishers Inc., San Francisco, CA, USAGoogle Scholar
- Breiman L: Random forests. Machine learning. 2001, 45 (1): 5-32. 10.1023/A:1010933404324.View ArticleGoogle Scholar
- Thorvaldsdttir H, Robinson JT, Mesirov JP: Integrative genomics viewer (igv): high-performance genomics data visualization and exploration. Briefings in Bioinformatics. 2012, 14: 178-192.View ArticleGoogle Scholar
- Robinson JT, Thorvaldsdttir H, Winckler W, Guttman M, Lander ES, Getz G, Mesirov JP: Integrative genomics viewer. Nature Biotechnology. 2011, 29: 24-26. 10.1038/nbt.1754.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.