 PROCEEDINGS
 Open Access
A Bayesian approach for estimating allelespecific expression from RNASeq data with diploid genomes
 Naoki Nariai^{1, 2},
 Kaname Kojima^{2},
 Takahiro Mimori^{2},
 Yosuke Kawai^{2} and
 Masao Nagasaki^{2}Email author
 Published: 11 January 2016
Abstract
Background
RNAsequencing (RNASeq) has become a popular tool for transcriptome profiling in mammals. However, accurate estimation of allelespecific expression (ASE) based on alignments of reads to the reference genome is challenging, because it contains only one allele on a mosaic haploid genome. Even with the information of diploid genome sequences, precise alignment of reads to the correct allele is difficult because of the highsimilarity between the corresponding allele sequences.
Results
We propose a Bayesian approach to estimate ASE from RNASeq data with diploid genome sequences. In the statistical framework, the haploid choice is modeled as a hidden variable and estimated simultaneously with isoform expression levels by variational Bayesian inference. Through the simulation data analysis, we demonstrate the effectiveness of the proposed approach in terms of identifying ASE compared to the existing approach. We also show that our approach enables better quantification of isoform expression levels compared to the existing methods, TIGAR2, RSEM and Cufflinks. In the real data analysis of the human reference lymphoblastoid cell line GM12878, some autosomal genes were identified as ASE genes, and skewed paternal Xchromosome inactivation in GM12878 was identified.
Conclusions
The proposed method, called ASETIGAR, enables accurate estimation of gene expression from RNASeq data in an allelespecific manner. Our results show the effectiveness of utilizing personal genomic information for accurate estimation of ASE. An implementation of our method is available at http://nagasakilab.csml.org/asetigar.
Keywords
 Allelespecific expression
 RNASeq data
 Bayesian inference
Background
Allelespecific expression (ASE) has been traditionally studied in the context of genomic imprinting, in which the expression of genes depends on whether they are paternally or maternally inherited. Xchromosome inactivation is also a form of ASE, in which one of the two alleles of the X chromosome is inactivated in female [1]. Recent studies have revealed that ASE is relatively common [2], and that many cisacting sequence variants can alter gene expression in a highly contextspecific manner [3]. In some cases, differences in the expression of two alleles can be predisposition to diseases, such as colorectal cancer [4]. Importantly, transcript abundances can be used as quantitative traits for identifying susceptibility loci for common diseases, such as diabetes and obesity [5, 6]. Hence, it is of our great interest to identify ASE and characterize genetic variants that are directly associated with phenotypic differences for elucidating causal mechanisms of diseases.
In order to identify allelespecific gene expression, RNAsequencing (RNASeq) has now been widely used. However, there are several difficulties in measuring the amount of expressed isoforms in an allelespecific manner from RNASeq data given genotypes of an individual. First, in many cases, short reads can be aligned to multiple locations of the reference genome, which poses uncertainty in quantifying gene expression levels [7]. Statistical methods that handle ambiguous alignment of reads as hidden variables have been shown to be effective in optimizing read alignments for more accurate quantification of isoforms [8–10], although the approaches do not consider isoforms in an allelespecific manner. Another difficulty is that there is a bias in alignment of reads to the reference genome if a sample has heterozygous SNPs where nucleotides are different from the reference sequence [11–13]. To avoid the bias in alignment of reads to the reference genome, one can prepare the alternative allele that includes genomic variants [14, 15], or construct diploid genomes for a specific sample [16]. Then, the best alignments of reads to the extended reference sequences are used to count the number of the paternally or maternally derived reads based on heterozygous SNP sites. However, these approaches cannot quantify isoform expression levels accurately, since only reads that align heterozygous positions are considered for ASE. To our best knowledge, there is currently no approach that can estimate ASE explicitly as well as isoform abundances in a unified statistical framework, given RNASeq data and diploid genomes.
In this paper, we present a novel method called ASETIGAR, to estimate ASE as well as gene expression levels of isoforms simultaneously from RNASeq data and diploid genome sequences. In the read generative model, a haploid choice is modeled as a hidden variable, and the posterior distribution for the binomial random variable is estimated by variational Bayesian inference. In order to evaluate our approach, we prepare two sets of synthetic pairedend reads (30 million reads, 100 bp × 2) with some sequencing errors, one is generated based on the nullhypothesis where there is no ASE, and the other is generated based on the alternative hypothesis where there is ASE for a certain portion of isoforms. We apply ASETIGAR to the simulation data and show that our method identifies more ASE isoforms than those identified with the existing approach. We also show that our method predicts isoform abundances more accurately compared to TIGAR2, RSEM and Cufflinks, which are widely used software for isoformlevel quantification from RNASeq data. Finally, we apply our method to the RNASeq data obtained from the human lymphoblastoid cell line GM12878 [17] to identify autosomal genes that exhibit ASE, and investigate the balance of Xchromosome inactivation between the paternal and maternal alleles in the cell line.
Methods
ASETIGAR pipeline
Read generative model
p(H _{ n }=h∣T _{ n }=t,ϕ) is the probability that read n is generated from haplotype h (either paternal or maternal), given the isoform choice and ϕ. We calculate this probability as p(H _{ n }=0∣T _{ n }=t,ϕ)=ϕ _{ t } (if read n is generated from the paternal allele), or p(H _{ n }=1∣T _{ n }=t,ϕ)=1−ϕ _{ t } (if read n is generated from the maternal allele).
p(S _{ n }=s∣T _{ n }=t,H _{ n }=h) is the probability that read n is generated from position s, given isoform t and haplotype h. We calculate this probability as p(S _{ n }=s∣T _{ n }=t,H _{ n }=h)=1/(l _{ th }−L+1).
where s u b s t(·,·,·) is the quality score dependent substitution matrix [10], r _{ n }[x] is the nucleotide character of position x of read n, q _{ n }[x] is the quality score of position x of read n, and c _{ th }[x] is the nucleotide character of position x of cDNA sequence of isoform t of haplotype h. The quality score dependent substitution matrix, s u b s t(·,·,·), is either determined according to the Phred base quality score [20], or can be estimated from the best alignments of reads to the reference cDNA sequences from the RNASeq data.
For the cases where RNASeq reads are generated from pairedend libraries, and how indel errors of sequencers can be handled, the previously proposed model [10] can be naturally extended similarly to the case for the singleend data described here.
Variational Bayesian inference
We propose a Bayesian approach, in which model parameters are estimated as posterior distributions, given RNASeq data and prior distributions for the model parameters θ and ϕ. Because full Bayesian inference involves integration over all possible hidden variable Z and is analytically intractable, we use variational Bayesian inference [21], which approximates the posterior joint distributions by assuming the factorization of latent variables and model parameters as q(θ,ϕ,Z)≈q(θ)q(ϕ)q(Z).
where α _{ t }>0 is a hyperparameter, \(G(\boldsymbol {\alpha }) = \frac {\prod _{t} \Gamma (\alpha _{t})}{\Gamma \left (\sum _{t} \alpha _{t}\right)}\), and Γ(·) is the gamma function. In this paper, we use a single hyperparameter α _{0} for all isoforms, based on the assumption that there is no prior knowledge about relative differences in isoform abundance. The single hyperparameter α _{0} controls the complexity of model parameters [22]. When α _{0}≥1,α _{0}−1 can be interpreted as the prior count of reads that are assigned to isoforms, and when α _{0}<1, the prior favors that some of the isoform abundances to be zero [10]. Here, we choose α _{0} that maximizes the lower bound of the log marginal likelihood.
where β _{ t,1}>0 and β _{ t,2}>0 are hyperparameters, and B(·,·) is the Beta function. Here, β _{ t,1} and β _{ t,2} can be interpreted as the prior counts of reads that are assigned to the paternal and maternal allele, respectively, for calculating the paternal/maternal ratio. We use β _{ t,1}=β _{ t,2}=1 for all isoforms as a noninformative prior.

Step 1. Initialization
For each isoform t, set \(\alpha _{t}^{*} = \alpha _{0}\), \(\beta _{t,1}^{*} = \beta _{t,1}\), and \(\beta _{t,2}^{*} = \beta _{t,2}\)

Step 2. Update q ^{∗}(Z)
Compute E _{ Z }[Z _{ nths }] given the current estimate of q ^{∗}(θ) and q ^{∗}(ϕ)

Step 3. Update q ^{∗}(θ) and q ^{∗}(ϕ)
Compute E _{ θ }[θ _{ t }] and E _{ ϕ }[ϕ _{ t }] given the current estimate of q ^{∗}(Z)

Step 4. Check for convergence
If none of the E _{ θ }[θ _{ t }] has been changed more than a prespecified threshold, exit. Otherwise, return to Step 2
where ψ(·) is the digamma function.
Hence, it turns out that q ^{∗}(θ) is also the Dirichlet distribution, and the prior distribution p(θ) is the conjugate prior.
Hence, q ^{∗}(ϕ _{ t }) is also the Beta distribution, and the prior distribution p(ϕ _{ t }) is the conjugate prior.
In Step 4, a relative change of 10^{−3} for isoforms whose abundance parameter E _{ θ }[θ _{ t }]>10^{−7} is used as a convergence criteria.
Variational lower bound
Results and discussion
Simulation data analysis
To evaluate the performance of the proposed method, we prepared synthetic RNASeq data (30 million reads, 100 bp × 2 with the mean fragment size of 400 bp and standard deviation of 40 bp) based on diploid genome sequences of NA12878, which were constructed from hg19 and publicly available from the website (http://sv.gersteinlab.org/NA12878_diploid). First, the paternal and maternal cDNA sequences were generated from the diploid genome sequences based on the UCSC gene annotations file (refFlat.txt) with rSeq (version 0.2.1) as described in Methods section. Second, 10,000 isoforms were randomly chosen and expression levels were assigned so that it follows the lognormal distribution. Then, we prepared two sets of RNASeq data with 0.1 % substitution, deletion, and insertion errors, one was generated based on the null hypothesis that there was no ASE, and the other was generated based on the alternative hypothesis that there were ASE for some portions of isoforms. For the null hypothesis data set, 100 % of the isoforms express the paternal and maternal alleles equally likely (50:50 chance). On the other hand, for the ASE data set, 10 % of the isoforms have the paternalspecific expression (in which the paternal allele was chosen to express with an 80 % probability, whereas the maternal allele was chosen to express with a 20 % probability), 10 % of the isoforms have the maternalspecific expression (in which the maternal allele was chosen to express with an 80 % probability, whereas the paternal allele was chosen to express with a 20 % probability), and the remaining isoforms have no ASE.
To compare with the existing approach [16], reads were aligned to the both paternal and maternal haplotypes, and the best alignments of reads were obtained. Then, for each isoform, the number of heterozygous SNPs was counted to determine the paternal/maternal ratio. On the other hand, our approach aligned reads to the both haplotypes and retained all the possible alignments with Bowtie2 specifying “k” option. Then, ASETIGAR took the BAM file as input and optimized the read alignments between the paternal and maternal alleles, as well as among isoforms by variational Bayesian inference algorithm as described in Methods section. The hyperparameter α _{0} was set to 0.1, which maximized the variational lower bound of the marginal log likelihood of the data.
Real data analysis
We applied ASETIGAR to the RNASeq data (36.5 million reads of 100 bp × 2) that was generated from the lymphoblastoid cell line GM12878 [17], which is publicly available under the NCBI SRA accession number SRX245434. This cell line was derived from the HapMap NA12878 individual, whose diploid genomes were similarly obtained and used as in the simulation data analysis.
Terms enriched in the autosomal ASE genes
Category  Term  Count  Pvalue  Bonferroni 

SP_PIR  Polymorphism  787  9.7E8  5.2E5 
UP_SEQ  Sequence variant  808  1.1E7  3.2E4 
SP_PIR  Alternative splicing  599  1.6E6  8.6E4 
SP_PIR  Glycoprotein  231  1.8E6  9.7E4 
UP_SEQ  Extracellular  159  9.0E7  2.7E3 
SP_PIR  Signal  170  1.0E5  5.6E3 
UP_SEQ  Splice variant  597  2.4E6  7.3E3 
Interestingly, by looking at the paternal/maternal ratio of expressed isoforms on each chromosome, skewed Xinactivation in the paternal allele of the GM12878 cell line was observed (bottomright in Fig. 5). This result is consistent with the findings in previous studies that showed the bias in Xchromosome inactivation by CTCF binding [29] and occupancies of RNA polymerase II [30] from ChIPSeq data. ASETIGAR identified 90 maternal allelespecific isoforms on Xchromosome, whereas the existing approach based on the best alignment to the diploid genome [16] identified 76, based on the same experimental condition in the simulation analysis.
Computational resources
Computational experiments were performed on a computer with Intel Xeon CPU E78837 processors (2.8 GHz) with the Red Hat Enterprise Linux Server release 6.1 operating system. ASETIGAR is implemented in Java and executed on 16 CPU cores with a multithread option. In the experiments for the simulated data sets (30 million pairedend reads), the execution time was 20 hours, and 46 GB memory was used with the Java(TM) SE Runtime Environment (build 1.8.0_45b14).
Conclusions
In this paper, we proposed a novel method called ASETIGAR, a Bayesian approach to estimate ASE from RNASeq data with diploid genomes. Contrary to the popularly used existing methods such as TopHatCufflinks [25], RSEM [8], and TIGAR2 [23], personal diploid genomes are used as reference sequences in the pipeline, instead of the reference genome. Since genetic variants such as SNPs and indels are incorporated in the diploid genome sequences by construction, there will be less bias in alignment of reads compared to the conventional approaches that rely on the reference genome. In the generative model, a haplotype choice is modeled as a latent variable and estimated simultaneously with isoform abundances by variational Bayesian inference.
We showed from the simulation data analysis that ASETIGAR estimated ASE more consistently compared to the existing approach, in part from smoothing effect of the estimated posterior distribution of the binomial random variable that represents the fraction of the expressed paternal and maternal haplotypes. We also showed that ASETIGAR quantified isoform abundances more accurately compared to TIGAR2, RSEM, and Cufflinks, which is an additional benefit of ASETIGAR if genotypes of samples are available. In the real data analysis of human lymphoblastoid cell line GM12878, ASE was identified among relatively lowexpressed genes, and that no functional GO category was found to be significantly enriched. We also observed that the paternal Xchromosome inactivation was dominant in the cell line, which was also confirmed in the previous studies [29, 30].
Although fulllength transcripts can be sequenced with new sequencing technologies, such as the PacBio RS II [31], accurate estimation of ASE is challenging without enough information about isoform abundances. Currently, the Illumina platform is more suitable in quantifying isoform abundances thanks to its capacity of generating short reads in a highthroughput manner. Because the accuracy of the reference sequences is critical for our approach, it will be effective to include the obtained fulllength transcript sequences as reference cDNA sequences in ASETIGAR pipeline combined with short reads.
As more personal wholegenome sequencing data and RNASeq data become available [32], ASETIGAR will be particularly useful to find associations between genetic variants and expression quantitative loci (eQTL). For example, links between genetic variants in transcription factor (TF) binding sites and the level of gene expression can be investigated. Incorporation of other omics data, such as ChIPSeq data measuring CTCF binding, TF occupancies, histone modifications, or chromatin structures will be possible in the similar framework. If only a limited portion of genotypes is available for samples (such as with SNP arrays), genotype imputation with the reference panel can be considered [33]. However, there might exist imputation errors, or switching errors in phased genotypes without a complete parental genotypes, which will affect accuracies in ASE identification and isoform quantification with ASETIGAR. Our future work will include investigating ASE with other cell types, and the topics described above.
Declarations
Acknowledgements
This work was supported (in part) by MEXT Tohoku Medical Megabank Project. Super computer resources were provided by the Supercomputing services, Tohoku Medical Megabank Organization, Tohoku University, and National Institute of Genetics (NIG).
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Lyon MF. Gene action in the Xchromosome of the mouse (Mus musculus L). Nature. 1961; 190:372–3.PubMedView ArticleGoogle Scholar
 Knight JC. Allelespecific gene expression uncovered. Trends Genet. 2004; 20:113–6.PubMedView ArticleGoogle Scholar
 Buckland PR. Allelespecific gene expression differences in humans. Hum Mol Genet. 2004; 13:R255–60. Spec No 2.PubMedView ArticleGoogle Scholar
 de la Chapelle A. Genetic predisposition to human disease: allelespecific expression and lowpenetrance regulatory loci. Oncogene. 2009; 28:3345–8.View ArticleGoogle Scholar
 Schadt EE, Monks SA, Drake TA, Lusis AJ, Che N, Colinayo V, et al. Genetics of gene expression surveyed in maize, mouse and man. Nature. 2003; 422:297–302.PubMedView ArticleGoogle Scholar
 Fairfax BP, Humburg P, Makino S, Naranbhai V, Wong D, Lau E, et al. Innate immune activity conditions the effect of regulatory variants upon monocyte gene expression. Science. 2014; 343:6175.View ArticleGoogle Scholar
 Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNASeq. Nat Methods. 2008; 5:621–8.PubMedView ArticleGoogle Scholar
 Li B, Ruotti V, Stewart RM, Thomson JA, Dewey CN. RNASeq gene expression estimation with read mapping uncertainty. Bioinformatics. 2010; 26:493–500.PubMedPubMed CentralView ArticleGoogle Scholar
 Glaus P, Honkela A, Rattray M. Identifying differentially expressed transcripts from RNAseq data with biological variation. Bioinformatics. 2012; 28:1721–28.PubMedPubMed CentralView ArticleGoogle Scholar
 Nariai N, Hirose O, Kojima K, Nagasaki M. TIGAR: transcript isoform abundance estimation method with gapped alignment of RNASeq data by variational Bayesian inference. Bioinformatics. 2013; 29:2292–9.PubMedView ArticleGoogle Scholar
 Degner JF, Marioni JC, Pai AA, Pickrell JK, Nkadori E, Gilad Y, et al. Effect of readmapping biases on detecting allelespecific expression from RNAsequencing data. Bioinformatics. 2009; 25:3207–12.PubMedPubMed CentralView ArticleGoogle Scholar
 Skelly DA, Johansson M, Madeoy J, Wakefield J, Akey JM. A flexible Bayesian method for detecting allelic imbalance in RNAseq data. Genome Res. 2011; 10:1728–37.View ArticleGoogle Scholar
 LeónNovelo LG, McIntyre LM, Fear JM, Graze RM. A flexible Bayesian method for detecting allelic imbalance in RNAseq data. BMC Genomics. 2014; 15:920.PubMedPubMed CentralView ArticleGoogle Scholar
 Satya RV, Zavaljevski N, Reifman J. A new strategy to reduce allelic bias in RNASeq readmapping. Nucleic Acids Res. 2012; 40:e127.PubMedView ArticleGoogle Scholar
 van de Geijn B, McVicker G, Gilad Y, Pritchard JK. WASP: allelespecific software for robust molecular quantitative trait locus discovery. Nat Methods. 2015; 12(11):1061–3. doi: dx.doi.org/10.1038/nmeth.3582. Epub 2015 Sep 14.
 Rozowsky J, Abyzov A, Wang J, Alves P, Raha D, Harmanci A, et al. AlleleSeq: analysis of allelespecific expression and binding in a network framework. Mol Syst Biol. 2011; 7:522.PubMedPubMed CentralView ArticleGoogle Scholar
 Marinov GK, Williams BA, McCue K, Schroth GP, Gertz J, Myers RM, et al. From singlecell to cellpool transcriptomes: stochasticity in gene expression and RNA splicing. Genome Res. 2014; 24:496–510.PubMedPubMed CentralView ArticleGoogle Scholar
 Jiang H, Wong WH. Statistical inferences for isoform expression in RNASeq. Bioinformatics. 2009; 25:1026–32.PubMedPubMed CentralView ArticleGoogle Scholar
 Langmead B, Salzberg SL. Fast gappedread alignment with Bowtie 2. Nat Methods. 2012; 9:357–9.PubMedPubMed CentralView ArticleGoogle Scholar
 Ewing B, Hillier L, Wendl MC, Green P. Basecalling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 1998; 8:175–85.PubMedView ArticleGoogle Scholar
 Attias H. Inferring parameters and structure of latent variable models by variational bayes. In: Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.: 1999. p. 21–30. http://0dl.acm.org.brum.beds.ac.uk/citation.cfm?id=2073799.Google Scholar
 Bishop CM. Pattern Recognition and Machine Learning. New York, NY, USA: Springer Science + Business Media, LLC; 2006.Google Scholar
 Nariai N, Kojima K, Mimori T, Sato Y, Kawai Y, YamaguchiKabata Y, et al. TIGAR2: sensitive and accurate estimation of transcript isoform expression with longer RNASeq reads. BMC Genomics. 2014; 15((Suppl 10)):S5.PubMedPubMed CentralView ArticleGoogle Scholar
 Li B, Dewey CN. RSEM: accurate transcript quantification from RNASeq data with or without a reference genome. BMC Bioinformatics. 2011; 12:323.PubMedPubMed CentralView ArticleGoogle Scholar
 Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, et al. Transcript assembly and quantification by RNASeq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010; 28:511–5.PubMedPubMed CentralView ArticleGoogle Scholar
 Dennis G Jr, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, et al. DAVID: Database for annotation, visualization, and integrated discovery. Genome Biol. 2003; 4:P3.PubMedView ArticleGoogle Scholar
 Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, et al. The SWISSPROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 2003; 31(1):365–70.PubMedPubMed CentralView ArticleGoogle Scholar
 Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, et al. Gene ontology: tool for the unification of biology. Nat Genet. 2000; 25:25–9.PubMedPubMed CentralView ArticleGoogle Scholar
 McDaniell R, Lee BK, Song L, Liu Z, Boyle AP, Erdos MR, et al. Heritable individualspecific and allelespecific chromatin signatures in humans. Science. 2010; 328:235–9.PubMedPubMed CentralView ArticleGoogle Scholar
 Reddy TE, Gertz J, Pauli F, Kucera KS, Varley KE, Newberry KM, et al. Effects of sequence variation on differential allelic transcription factor occupancy and gene expression. Genome Res. 2012; 22:860–9.PubMedPubMed CentralView ArticleGoogle Scholar
 Tilgner H, Grubert F, Sharon D, Snyder MP. Defining a personal, allelespecific, and singlemolecule longread transcriptome. Proc Natl Acad Sci U S A. 2014; 111:9869–74.PubMedPubMed CentralView ArticleGoogle Scholar
 Chen R, Mias GI, LiPookThan J, Jiang L, Lam HY, Chen R, et al. Personal omics profiling reveals dynamic molecular and medical phenotypes. Cell. 2012; 148:1293–1307.PubMedPubMed CentralView ArticleGoogle Scholar
 Howie BN, Donnelly P, Marchini J. A flexible and accurate genotype imputation method for the next generation of genomewide association studies. PLoS Genet. 2009; 5(10):e10005291728–37.Google Scholar