- Open Access
QuickRNASeq lifts large-scale RNA-seq data analyses to the next level of automation and interactive visualization
BMC Genomics volume 17, Article number: 39 (2016)
RNA sequencing (RNA-seq), a next-generation sequencing technique for transcriptome profiling, is being increasingly used, in part driven by the decreasing cost of sequencing. Nevertheless, the analysis of the massive amounts of data generated by large-scale RNA-seq remains a challenge. Multiple algorithms pertinent to basic analyses have been developed, and there is an increasing need to automate the use of these tools so as to obtain results in an efficient and user friendly manner. Increased automation and improved visualization of the results will help make the results and findings of the analyses readily available to experimental scientists.
By combing the best open source tools developed for RNA-seq data analyses and the most advanced web 2.0 technologies, we have implemented QuickRNASeq, a pipeline for large-scale RNA-seq data analyses and visualization. The QuickRNASeq workflow consists of three main steps. In Step #1, each individual sample is processed, including mapping RNA-seq reads to a reference genome, counting the numbers of mapped reads, quality control of the aligned reads, and SNP (single nucleotide polymorphism) calling. Step #1 is computationally intensive, and can be processed in parallel. In Step #2, the results from individual samples are merged, and an integrated and interactive project report is generated. All analyses results in the report are accessible via a single HTML entry webpage. Step #3 is the data interpretation and presentation step. The rich visualization features implemented here allow end users to interactively explore the results of RNA-seq data analyses, and to gain more insights into RNA-seq datasets. In addition, we used a real world dataset to demonstrate the simplicity and efficiency of QuickRNASeq in RNA-seq data analyses and interactive visualizations. The seamless integration of automated capabilites with interactive visualizations in QuickRNASeq is not available in other published RNA-seq pipelines.
The high degree of automation and interactivity in QuickRNASeq leads to a substantial reduction in the time and effort required prior to further downstream analyses and interpretation of the analyses findings. QuickRNASeq advances primary RNA-seq data analyses to the next level of automation, and is mature for public release and adoption.
RNA sequencing (RNA-seq) has emerged as a powerful technology in transcriptome profiling [1–3]. Our previous side-by-side comparison between RNA-seq and microarray in investigating T cell activation demonstrated that RNA-seq analysis has many advantages over microarray analysis . In contrast to hybridization-based microarray analyses, RNA-seq has the extra benefits of obtaining transcription start and stop sites, alternative spliced isoforms, and genetic variants in addition to gene expression levels. One apparent shortcoming of early non-stranded (standard) RNA-seq protocols is that a sequence read loses the strand origin information, thus making it difficult to determine accurately the expression levels of overlapping genes transcribed from opposite strands. A comparison of stranded with non-stranded RNA-seq led us to conclude that stranded RNA-seq provides a more accurate estimation of gene expression levels than non-stranded RNA-seq .
Short reads generated by RNA-seq experiments must first be aligned, or mapped, to a reference genome or transcriptome assembly. The general objective of mapping or aligning a collection of sequence reads to a reference is to discover the true location (origin) of each read with respect to that reference. Although a large number of read mapping algorithms have been developed in recent years [6–10], the accurate alignment of RNA-seq reads is still a challenge. Indeed, some features of a reference genome such as repetitive regions, assembly errors, and assembly gaps render this objective impossible for a subset of reads. Furthermore, because RNA-seq libraries are constructed from transcribed RNA, intronic sequences are not present in exon-exon spanning reads. Therefore, when aligning the sequences to a reference genome, reads that span exon-exon junctions have to be split across potentially thousands of bases of intronic sequence. Many of the RNA-seq alignment tools, including STAR , GSNAP , MapSplice , and TopHat , use reference transcriptomes to inform the alignment of junction reads. The benefits of using a reference transcriptome to map RNA-seq reads have been demonstrated clearly in our previous RNA-seq analyses [15, 16].
The second important step in most RNA-seq analyses is gene or isoform quantification. A common method to estimate gene or transcript abundance from RNA-seq data is to count the number of reads that map uniquely to each gene or transcript. RPKM (reads per kilobase per million reads) is widely used to represent the relative abundance of mRNAs for a gene or transcript. A number of algorithms have been developed to infer gene and isoform abundance [17, 18], including RSEM [19, 20], Cufflinks , IsoEM , featureCounts , and HTSeq . A gene can be expressed in one or more transcript isoforms; accordingly, its expression level should be represented as the sum of its isoforms. However, estimating the expression of individual isoforms is intrinsically more difficult because different isoforms of a gene typically have a high proportion of genomic overlap. Accordingly, a simpler union exon-based approach has been proposed, in which all overlapping exons of the same gene are first merged into union exons, and the total length of the union exons is taken to represent the gene length. We carried out a side-by-side comparison between union exon-based approach and transcript-based method in RNA-seq gene quantification , and found that gene expressions were significantly underestimated when the union exon-based approach was used. Therefore, we strongly discourage using the union exon-based approach in gene quantification despite its simplicity.
Although the time and cost for generating RNA-seq data are decreasing, the analysis of massive amounts of RNA-seq data still remains challenging. Numerous software packages and algorithms for basic data quality control (QC) and analyses have been developed, which has led to the need to apply these tools efficiently to obtain results within a reasonable timeframe, especially for large datasets. Based on our own experience with in-house analyses of multiple RNA-seq datasets of varying size using open source tools, the main challenges, gaps, and bottlenecks for large-scale RNA-seq data analyses can be summarized as follows:
Selecting appropriate software packages and setting software-specific parameters. Making the right or best choice can be difficult because many similar tools are available. Setting software parameters is even harder if not impossible, because it often requires both an in-depth understanding of the algorithms and sufficient hands-on experience, which disadvantages researchers new to this field.
Writing scripts to make different components work seamlessly in a pipeline. A variety of algorithms have been designed to perform different tasks, but they have been developed (and/or maintained) independently by different research groups and often use different programming languages. Moreover, those algorithms do not understand each other well, and the output(s) from one algorithm often cannot be used as input(s) for another algorithm. As a result, additional bridging scripts are necessary, which ideally requires a data analyst who is familiar with a number of programming languages, including Shell script, Perl, Python, Java, C/C++, and R.
Integrating and summarizing analyses results from individual samples. In general, most algorithms are implemented to process an individual sample. Consequently, the results of primary RNA-seq data analyses have to be further processed, integrated, and summarized for reporting, presentation, and downstream analysis. Usually, data integration and summarization are tedious and not easy to execute efficiently.
Identifying RNA-seq sample outliers. It is not uncommon that some samples have low quality and often substitute samples are not available, especially for RNA-seq of clinical specimen. RNA-seq is a complicated multistep process that involves sample collection/stabilization, RNA extraction, fragmentation, cDNA synthesis, adapter ligation, amplification, purification, and sequencing. Any mistake in this complex sequence of protocols can result in biased or even unusable data. Therefore, it is necessary to establish stringent RNA-seq data quality metrics to identify outliers that should be excluded from further downstream data analysis.
Detecting sample swapping and mislabeling. For large-scale RNA-seq studies in which hundreds or even thousands of RNA samples are sequenced and analyzed, it is not unusual that some samples are mishandled and appear to be swapped or sequenced more than once. Such errors can become a serious problem for downstream data analyses and interpretation of results, especially for longitudinal sample analyses. It is difficult to identify such mistakes based only on RNA-seq QC metrics and/or gene expression profiles. To confirm whether samples are from the same subject, it is more reliable to compare genetic markers among samples, such as single nucleotide polymorphisms (SNPs).
Sharing the results of RNA-seq data analyses with experimental scientists. Nearly all RNA-seq data analyses are performed using Linux clusters or workstations; however, analyses results in Linux are often inaccessible to most experimental scientists. RNA-seq data analyses typically generate a large number of files and large amounts of data that are difficult to comprehend or digest directly by experimental scientists. Therefore, easily accessible interfaces are needed that not only provide a quick and easy way for non-expert users to obtain high-level visualizations of the main RNA-seq analyses outputs (e.g., QC results), but also allow them to drill down further or export the results into additional analysis applications of their choice. To the best of our knowledge, very few RNA-seq related open source packages provide all these options.
To address these challenges, we have implemented a new pipeline named QuickRNASeq to advance the automation and visualization of RNA-seq data analyses results, and have constantly improved and refined its implementation since its inception. QuickRNASeq significantly reduces data analysts’ hands-on time, which results in a substantial decrease in the time and effort needed for the primary analyses of RNA-seq data before proceeding to further downstream analysis and interpretation. Additionally, QuickRNASeq provides a dynamic data sharing and interactive visualization environment for end users. All the results are accessible from a web browser without the need to set up a web server and database. The rich visualization features implemented in QuickRNASeq enable non-expert end users to interact easily with the RNA-seq data analyses results, and to drill down into specific aspects to gain insights into often complex datasets simply through a point-and-click approach.
QuickRNASeq is designed for simplicity and visual interactivity. A few important principles dictate its implementation. First, all components of the pipeline are freely available in the public domain. Second, it is easy to deploy and use. Third, all analyses results including RNA-seq QC metrics, sample correlations, and gene quantifications are accessible via a web browser and can be further explored interactively. An overview of QuickRNASeq (Fig. 1) illustrates its three main steps. Step #1 performs RNA-seq read mapping, counting, aligned read QC, and SNP calling. Step #1 processes each sample completely independently of each other, and is computationally intensive. Therefore, all samples can be processed in parallel in a high performance computing cluster (HPC), or in a serial fashion on a standalone workstation. Step #2 merges the results from the individual sample and generates an integrated and interactive project report for data interpretation in Step #3.
In addition to raw sequence reads in FASTQ format, the only other required inputs are a reference genome file in FASTA format and a corresponding gene annotation file in GTF (gene transfer format). QuickRNASeq can be applied to any species as long as its genome and gene annotations are available; for example, human, mouse, rat, and cynomolgus or rhesus monkeys. A gene annotation file can exist in many formats, but GTF has become the de facto standard; however, not all tools accept gene annotation files in GTF format as input. For example, RSeQC (RNA-seq quality control package)  accepts gene annotation only in BED (browser extensible display) format, though the majority of gene annotations in the public domain are not available in BED format. To ensure that the exact same annotations are used by the different components in QuickRNASeq, we wrote Perl scripts to convert gene annotation files from GTF to BED format. This avoids any discrepancy or inconsistency among gene annotations that are available in different formats.
Step #1: single sample processing
This step consists mainly of read mapping, counting, aligned read QC, and SNP calling, and the corresponding algorithms used to perform these tasks are STAR [11, 27], featureCounts , RSeQC , and VarScan  respectively. STAR aligns spliced sequences of any length with moderate error rates, provides scalability for emerging sequencing technologies, and generates output files ready for transcript/gene expression quantification . The algorithms featureCounts  and HTSeq  are comparable in terms of counting results, but featureCounts is considerably faster than HTSeq by an order of magnitude for gene-level summarization and requires far less computer memory. Read mapping and counting typically are very time consuming, and we chose STAR and featureCounts in QuickRNASeq mainly because of their high speed and accuracy.
The RSeQC  package provides a number of modules that can comprehensively inspect sequence quality, nucleotide composition bias, PCR bias, GC bias, mapped reads distribution, coverage uniformity, and strand specificity. All such QC metrics are valuable for outlier detection. VarScan  is a platform-independent software tool that can detect variants in RNA-seq data. It employs a robust heuristic/statistic approach to call variants that meet desired thresholds for read depth, base quality, variant allele frequency, and statistical significance. To verify samples from the same subject, it is unnecessary to call SNPs across all chromosomes. In practice, it is sufficient to use only SNPs from the chromosome that contains the major histocompatibility complex (MHC) genes. For human, mouse, and rat, these are chromosomes 6, 17, and 20, respectively. As mentioned earlier, numerous software packages that can perform similar tasks are freely available; however, we found that the combination of STAR, featureCounts, RSeQC, and VarScan represents one of the best toolsets.
Computational algorithms for RNA-seq analyses are continuously being improved, including STAR, featureCounts, RSeQC, and VarScan. Therefore, we designed our pipeline to be independent of its underlying software version and ensured that it can handle RNA-seq samples from a variety of species. To decouple the dependence of QuickRNASeq pipeline upon underlying computational algorithms and species, we introduced a plain text configuration file that can store project, species, and software-specific parameters. This configuration file also improves the reproducibility of RNA-seq data analyses and simplifies the command lines in QuickRNASeq. For the convenience of QuickRNASeq users, a configuration file template has been provided for easy customization.
Step #2: data integration, QC, and summary
Step #2 aims mainly to merge results generated in Step #1 for each individual RNA-seq sample. Additionally, it runs many across-sample calculations, such as correlation-based QC and a SNP correlation matrix. As shown in Fig. 1, the second step performs the following tasks:
Merge mapping, counting summaries, and RSeQC metrics from individual samples.
Generate a read counting table ready for downstream analysis of all annotated genes.
Calculate a SNP (and gene expression) correlation matrix among samples.
Perform correlation-based sample QC, calculation of MADScore (median absolute deviation score), and data normalization.
Produce RNA-seq metrics and correlation plots ready for PowerPoint presentations.
Generate a comprehensive HTML QC report for individual sample.
Produce a dynamic and integrated QC metrics plot for individual samples.
Generate a master HTML entry webpage for data analyses results.
Each individual task listed above is performed by a corresponding Bash, Perl, or R script, and a master script coordinates the execution of all these tasks. The main scripts and their functions are listed in Table 1. As shown in Table 1, the primary RNA-seq data analyses can be performed by as few as two shell command lines (star-fc-qc.sh and star-fc-qc.summary.sh). All the plots generated in Step #2 are ready for presentations, and the gene counting table can feed downstream differential analysis algorithms. The highly automated features in Step #2 make QuickRNASeq an efficient tool for typical standard RNA-seq analyses, and our pipeline substantially reduces the hands-on time (not the computational time) that data analysts have to spend on primary RNA-seq data analyses.
We implemented a correlation-based QC to detect potential outliers in the RNA-seq data by calculating a MADScore for each sample. In general, an outlier appears to deviate markedly from other samples in a RNA-seq study, and thus its correlation with other samples will be relatively low. The MADScore is calculated as follows. For each sample, calculate the correlation difference, which is simply the difference between the average of all the pairwise correlations that involve the sample and the average of all the pairwise correlations that do not involve the sample. If a sample is an outlier, then the difference will be negative. Accordingly, there will be a vector of values (one for each sample). Then this vector of difference is converted to MADScores (robust Z-scores) by subtracting the medians and dividing it by median absolute deviations (MAD). A standard MADScore cutoff (e.g., −5) is set to determine the outliers.
Step #3: interactive data visualization
Results and discussion
Test run of QuickRNASeq on a publicly available dataset
GENCODE annotation [34, 35] is based on Ensembl  but with improved coverage and accuracy, and thus is used by the ENCODE consortium  as well as many other projects (e.g., 1000 Genomes ) as the reference gene set. Therefore, we chose the GENCODE annotation for our test run. GENCODE Release 19 was downloaded from the GENCODE web site . An analysis of RNA-seq data from 1641 samples across 43 tissues of 175 individuals in the Genotype-Tissue Expression (GTEx) project [39, 40] revealed the landscape of gene expression across tissues, and catalogued thousands of tissue-specific genes. For our test run, we selected 48 GTEx samples from five donors. The sample identifiers, annotations, and RNA-seq mapping summaries for all 48 samples are listed in Table 2. Note that a sequence read can be aligned uniquely to a reference genome, or mapped to multiple locations. Some reads cannot be mapped to the reference genome at all. The percentages of reads that were uniquely mapped, mapped to multiple locations, or unmapped are given in Table 2. The complete report for our test run of the GTEx dataset can be downloaded directly from the QuickRNASeq project home page, and is briefly described below.
All analyses results accessible from a single entry webpage
A screenshot of the entry webpage for the results of the test run is shown in Fig. 2. The page uses Noozle’s presentation template, which collates sections into a single neat web page with functionalities to expand or collapse individual or whole sections. In the “QC Metrics” section, both static images and interactive plots are provided for a variety of QC measures, including read mapping summaries, read counting statistics, SNP correlations among samples, number of expressed genes at various RPKM cutoffs, and correlations among gene expression profiles. All static QC plots can be enlarged into a new window by clicking on the iconized image, and the corresponding more dynamic and interactive plots are accessible by clicking the pointing hand icon. The interactive plots of QC measures offer many interactive features over static images, such as zooming in and zooming out. The raw data that was used to generate these figures can be accessed simply by clicking the corresponding hyperlinked text. The “Parallel Plot” and “Expression Table” sections in Fig. 2 are detailed later. Furthermore, all the result files and figures are directly accessible by expanding the “Raw Data Files” section shown at the bottom of Fig. 2. The entry webpage makes data navigation and visualization more convenient and intuitive, especially for experimental scientists.
SNP correlation to detect mishandled samples
SNP correlation plots help to verify whether samples are from the same subject or not. By definition, SNP concordance among samples from the same subject will be much higher than those samples from different subjects. In the first case, typical examples may be samples of different tissues from the same subject or longitudinal samples from the same subject. For simplicity, we selected samples from three donors to illustrate the usefulness of the SNP concordance plot (Fig. 3). As we expected, the SNP correlation plot in Fig. 3a is clustered by the donors. The corresponding correlation plot after the swap of SRR598044 and SRR608096 is shown in Fig. 3b where the correlation pattern indicates that the two samples are wrongly labeled. The true identifiers for the two swapped samples are indicated on the right of the plot. Sample swapping is typically very difficult to detect when it occurs. We have tried different methods to rectify mislabeled or swapped samples and found that a SNP correlation-based approach gave the best results (data not shown).
Integrated QC metrics for individual sample
The parallel plot in Fig. 4 is a common way to visualize high-dimensional data and it is used widely in multivariate data analysis. We implemented the parallel plot to link all related QC measurements for all samples into one plot. Each axis within the plot represents a sample feature or a QC measurement. There are multiple ways users can control the look and feel of the plot, such as selecting a subset of samples to view, changing the order of the axes by drag-and-drop, and removing unwanted axes for a clearer view by dragging them off the plot to either side. The linked table is searchable, and for any selected sample in the table, its corresponding QC measures are highlighted simultaneously on the plot with tooltips showing the measurement values.
MAD, an alternative and more robust measure of dispersion has been proposed to detect outliers . We extended MAD to implement a correlation-based QC to detect potential outliers. The MADScore was calculated as described above, and is listed in the table in Fig. 4. To determine whether a potential outlier identified from the correlation-based QC is a true outlier, we recommend that the corresponding QC report is also checked. The comprehensive QC report for an individual sample can be accessed by clicking the corresponding sample identifier in the table in Fig. 4. For example, some representative RNA-seq QC metrics for SRR603068 (highlighted in Fig. 4) are shown in Fig. 5. The metrics correspond to reads duplication rate, distribution of reads versus percentages of GC content, nucleotide composition bias, distribution of read quality score, plot of junction saturation, and characteristics of the splicing junction sites.
Two strategies are used to determine the read duplication rate, as indicated in Fig. 5a. For the sequence-based strategy, reads with exactly the same sequence content are regarded as duplicated reads, whereas, for the mapping-based strategy, reads mapped to the same genomic location are regarded as duplicated reads. For spliced reads, reads mapped to the same starting position that splice the same way are regarded as duplicated reads. SRR603068 is a brain sample, and its nucleotide composition is biased towards A/T, as indicated in Fig. 5c. For RNA-seq data, we often want to know whether the sequencing depth is enough for the analyses, and the saturation plot shown in Fig. 5d is very valuable for this. For a well annotated organism, the number of expressed genes in a particular tissue is almost fixed so the number of splice junctions is also fixed. These numbers should be rediscovered from saturated RNA-seq data. The plot in Fig. 5d indicates that more reads should be sequenced for performing alternative splicing analyses. In Fig. 5f, all multiple splicing events spanning the same intron have been consolidated into one splicing junction, and a novel junction is considered as complete_novel if neither of the two splice sites can be annotated by a gene model. Otherwise, it is partial_novel, meaning that one of the splice site (5′SS or 3′SS) is new, while the other splice site is annotated (known). While the majority of junctions in Fig. 5f are annotated, over 20 % are either complete_novel or partial_novel.
Interactive visualization of gene expression profiles
One of the most important objectives in many RNA-seq studies is to estimate gene expression levels under certain biological or disease conditions. With the help of the visualization tools shown in Fig. 6, differences in gene expression levels across samples under different conditions can be highlighted easily by a few mouse-clicks either in the boxplot (Fig. 6b) or heat map view (Fig. 6c). A keyword search box at the top of the table (Fig. 6a) provides an easy way to look at related genes such as kinases and interleukins. Gene expression profiles can be grouped and split on the fly according to the sample annotations, such as tissue type, visiting time, and treatment arms. Moreover, the look and feel of a plot, such as font size, color, plot type, and scales for x-axis and y-axis, can be customized by right clicking on the plot and selecting relevant options from the dropdown menu. An annotated heat map (Fig. 6c) is informative in comparing gene expression profiles across different conditions, and can help reveal the relationships between gene expression levels and corresponding biological conditions. Detailed instructions on how to use advanced visualization features of the interactive plot are described in the QuickRNAseq user guide that is bundled with the QuickRNASeq package.
Scalability of QuickRNASeq
Limitations and running of QuickRNASeq
QuickRNASeq is presumed to be executed in a HPC environment, which can process multiple samples in parallel. The out-of-the-box QuickRNASeq pipeline has been fully tested in a HPC computing environment using the IBM Platform’s Load Sharing Facility (LSF) , a powerful workload management platform for demanding, distributed HPC environments. The IBM Platform’s LSF provides a comprehensive set of intelligent, policy-driven scheduling features that enable users to utilize all the computing infrastructure resources and ensure optimal application performance. In addition to LSF, many other notable job scheduling software are available . For a cluster that uses a job scheduler other than LSF, star-fc-qc.sh (implementation of Step #1 in Fig. 1) needs to be customized accordingly. The only required change in the script is the way of job submission, and this command is dependent the job scheduling software. For researchers with no access to a HPC computing environment, we implemented star-fc-qc.ws.sh, a customized script that runs on a standard Linux workstation. Of course, analyzing large RNA-seq datasets on a single workstation is not typical and not recommended.
For gene quantifications, QuickRNAseq requires a complete genome sequence and well-annotated genes as inputs. The pipeline is not intended for the discovery of novel isoforms. QuickRNASeq is designed for use by bioinformaticians, experimental biologists, and geneticists in the fields of genome-scale analysis, functional genomics, and systems biology; however, downloading, installing, and running the QuickRNASeq pipeline in a Linux environment will require some basic computer-based expertise. A README.txt is provided along with the QuickRNASeq package, which explains step-by-step how to run QuickRNASeq. In addition, users can examine the configuration and sample annotation file under the test_run folder in the QuickRNASeq package. QuickRNASeq can be run without a sample annotation file, but it is strongly recommended that users provide meaningful annotations for all samples. A proper annotation file should be tab delimited, and QuickRNASeq requires that the first and second columns correspond to sample and subject identifiers, respectively. Sample names should start with a letter, and should not contain any white spaces.
In QuickRNASeq, we selected FeatureCounts, a union exon based approach, for gene quantification. According to our own most recent research , union exon based approach is discouraged. Unfortunately, there is still a long way to go for the switch from union exon based approach to transcript-based method in estimation of gene expression levels because of the inaccuracy of isoform quantification , especially for those isoforms with low expression, and gene-based annotation databases. Traditionally, functional enrichment analyses rely upon annotation databases such as Gene Ontology (GO) , Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways  and other commercial knowledge systems. All such annotations have been recorded and centered on genes, not transcripts or isoforms. In practical RNA-seq data analyses, the switch from gene to isoform in quantification should ideally go with the switch in annotation hand by hand.
The current version of QuickRNASeq focuses on the automation of primary processing steps in RNA-seq data analyses, and these steps are in general biological question independent. We plan to expand QuickRNASeq to downstream analyses in the future, including differential analysis and pathway enrichment. Downstream analyses are usually driven by biological questions and experimental designs and thus different from project to project. How to automate such analyses in a user friendly manner remains a challenge for our practical implementation.
QuickRNASeq versus QuickNGS
While this paper was in preparation, Wagle et al.  published QuickNGS, a new workflow system to analyze data from multiple next-generation sequencing (NGS) projects at a time. QuickNGS uses parallel computing resources, a comprehensive backend database, and the careful selection of previously published algorithmic approaches to build fully automated data analysis workflows. An overview of our comparison of the QuickRNASeq pipeline with the QuickNGS workflow is provided in Table 3. In summary, compared with QuickNGS, QuickRNASeq is more tailored to RNA-seq data. In QuickRNASeq, we developed scripts to perform RNA-seq-specific data integration and to generate integrated and interactive project reports in a fully automated manner. All the results from QuickRNASeq can be shared easily and further explored from a web browser on a personal computer even without internet access. Our pipeline QuickRNASeq provides a noticeable advancement of RNA-seq data analyses by incorporating a high degree of automation together with interactive visualizations.
By combing the best open source tool sets developed for RNA-seq data analyses and the most advanced web 2.0 technologies, we implemented the QuickRNASeq pipeline, which significantly reduces the efforts involved in primary RNA-seq data analyses and generates an integrated project report for data sharing and interactive visualization. The dynamic visualization features enable end users to explore and digest RNA-seq data analyses results intuitively and interactively, and to gain deep insights into RNA-seq datasets. The configuration file contains project, species, and software related parameters, and thus improves the reproducibly in RNA-seq data analyses. We have already applied QuickRNASeq to in-house large scale RNA-seq projects, and its current version is stable and mature for public release and adoption.
Availability of software and supporting Information
Project name: QuickRNASeq pipeline
Project home page: http://quickrnaseq.sourceforge.net
Operating system: Linux
Dependencies: R packages edgeR, reshape2 and ggplot2
Other requirements: None
License: GNU GPL version 3
Wang Z, Gerstein M, Snyder M. RNA-seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10(1):57–63.
Mutz KO, Heilkenbrinker A, Lönne M, Walter JG, Stahl F. Transcriptome analysis using next-generation sequencing. Curr Opin Biotechnol. 2013;24(1):22–30.
Mantione KJ, Kream RM, Kuzelova H, Ptacek R, Raboch J, Samuel JM, et al. Comparing bioinformatic gene expression profiling methods: microarray and RNA-Seq. Med Sci Monit Basic Res. 2014;20:138–42.
Zhao S, Fung-Leung W-P, Bittner A, Ngo K, Liu X. Comparison of RNA-seq and microarray in transcriptome profiling of activated T cells. PLoS ONE. 2014;9(1):e78644.
Zhao S, Zhang Y, Gordon W, Quan J, Xi H, Du S, et al. Comparison of stranded and non-stranded RNA-seq transcriptome profiling and investigation of gene overlap. BMC Genomics. 2015;16:487.
Garber M, Grabherr MG, Guttman M, Trapnell C. Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Methods. 2011;8(6):469–77.
Capobianco E. RNA-Seq data: a complexity journey. Comput Struct Biotechnol J. 2014;11(19):123–30.
Engström PG, Steijger T, Sipos B, Grant GR, Kahles A, Rätsch G, et al. Systematic evaluation of spliced alignment programs for RNA-seq data. Nat Methods. 2013;10(12):1185–91.
Soneson C, Delorenzi M. A comparison of methods for differential expression analysis of RNA-seq data. BMC Bioinformatics. 2013;14:91.
Borozan I, Watt SN, Ferretti V. Evaluation of alignment algorithms for discovery and identification of pathogens using RNA-seq. PLoS ONE. 2013;8(10):e76935.
Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29(1):15–21.
Wu TD, Nacu S. Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics. 2010;26(7):873–81.
Wang K, Singh D, Zeng Z, Coleman SJ, Huang Y, Savich GL, et al. MapSplice: accurate mapping of RNA-seq reads for splice junction discovery. Nucleic Acids Res. 2010;38(18):e178.
Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 2013;14(4):R36.
Zhao S. Assessment of the impact of using a reference transcriptome in mapping short RNA-seq reads. PLoS ONE. 2014;9(7):e101374.
Zhao S, Zhang B. A comprehensive evaluation of Ensembl, RefSeq, and UCSC annotations in the context of RNA-seq read mapping and gene quantification. BMC Genomics. 2015;16:97.
Kanitz A, Gypas F, Gruber AJ, Gruber AR, Martin G, Zavolan M. Comparative assessment of methods for the computational inference of transcript isoform abundance from RNA-seq data. Genome Biol. 2015;16:150.
Angelini C, De Canditiis D, De Feis I. Computational approaches for isoform detection and estimation: good and bad news. BMC Bioinformatics. 2014;15:135.
Li B, Ruotti V, Stewart RM, Thomson JA, Dewey CN. RNA-Seq gene expression estimation with read mapping inaccuracy. Bioinformatics. 2009;26:493–500.
Li B, Dewey CN. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics. 2011;12:323.
Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010;28:511–5.
Nicolae M, Mangul S, Măndoiu II, Zelikovsky A. Estimation of alternative splicing isoform frequencies from RNA-Seq data. Algorithms Mol Biol. 2011;6:9.
Liao Y, Smyth GK, Shi W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 2014;30(7):923–30.
Anders S, Theodor P, Huber W. HTSeq — a Python framework to work with high-throughput sequencing data. Bioinformatics. 2014;31(2):166–9.
Zhao S, Zhang B. Union exon based approach for RNA-seq gene quantification: to be or not to be? PLoS ONE. 2015;10(11):e0141910.
Wang L, Wang S, Li W. RSeQC: quality control of RNA-seq experiments. Bioinformatics. 2012;28(16):2184–5.
Dobin A, Gingeras TR. Mapping RNA-seq reads with STAR. Curr Protoc Bioinformatics. 2015;51:11.14.1–11.14.19.
Koboldt D, Zhang Q, Larson D, Shen D, McLellan M, Lin L, et al. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 2012;22:568–76.
jQuery. https://jquery.com. Accessed 15 November 2015.
Data-Driven Documents. http://d3js.org. Accessed 15 November 2015.
canvasXpress. http://canvasxpress.org. Accessed 15 November 2015.
SlickGrid. https://github.com/mleibman/SlickGrid. Accessed 15 November 2015.
Gehlenborg N, Noble MS, Getz G, Chin L, Park PJ. Nozzle: a report generation toolkit for data analysis pipelines. Bioinformatics. 2013;29:1089–91.
Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, et al. GENCODE: the reference human genome annotation for the ENCODE Project. Genome Res. 2012;22(9):1760–74.
GENCODE. http://www.gencodegenes.org/releases/19.html. Accessed 15 November 2015.
Flicek P, Amode MR, Barrell D, Beal K, Billis K, Brent S, et al. Ensembl 2014. Nucleic Acids Res. 2014;42(Database issue):D749–55.
The ENCODE Project. http://www.genome.gov/encode/. Accessed 15 November 2015.
1000 Genomes. http://www.1000genomes.org/. Accessed 15 November 2015.
GTEx Consortium. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science. 2015;348(6235):648–60.
Melé M, Ferreira PG, Reverter F, DeLuca DS, Monlong J, Sammeth M, et al. Human genomics. The human transcriptome across tissues and individuals. Science. 2015;348(6235):660–5.
Leys C, Ley C, Klein O, Bernard P, Licata L. Detecting outliers: do not use standard deviation around the mean, use absolute deviation around the median. J Exp Soc Psycho. 2013;49(4):764–6.
Pako. https://github.com/nodeca/pako. Accessed 15 November 2015.
IBM Platform LSF. http://www.ibm.com/systems/platformcomputing/products/lsf. Accessed 15 November 2015.
Job scheduler software. https://en.wikipedia.org/wiki/List_of_job_scheduler_software. Accessed 15 November 2015.
The Gene Ontology Consortium. Gene Ontology Consortium: going forward. Nucl Acids Res. 2015;43(Database issue):D1049–56.
Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M. The KEGG resource for deciphering the genome. Nucleic Acids Res. 2004;32(Database issue):D277–80.
Wagle P, Nikolić M, Frommolt P. QuickNGS elevates Next-Generation Sequencing data analysis to a new level of automation. BMC Genomics. 2015;16:487.
The authors thank Alexander Dobin for valuable assistance with running STAR, Shi Wei for assistance with using featureCounts, and Liguo Wang for his help with RSeQC. The development of QuickRNASeq was motivated partially by large scale in-house RNA sequencing projects in which thousands of samples were sequenced and analyzed. We thank Karen Page and Mina Hassan-Zahraee for their support and collaboration.
All authors are employees of Pfizer. The authors declare that they have no competing interests.
SZ and BZ conceived, designed, implemented, tested, validated the workflow, and drafted the manuscript. LX compared QuickRNASeq with QuickNGS. JQ, HX, YZ, DS, and MV participated in the design and requirements specification discussions. All authors read and approved the final manuscript.
About this article
Cite this article
Zhao, S., Xi, L., Quan, J. et al. QuickRNASeq lifts large-scale RNA-seq data analyses to the next level of automation and interactive visualization. BMC Genomics 17, 39 (2016) doi:10.1186/s12864-015-2356-9
- Batch processing
- High-performance computing
- Large-scale data analysis