- Open Access
Conditional entropy in variation-adjusted windows detects selection signatures associated with expression quantitative trait loci (eQTLs)
© Handelman et al.; licensee BioMed Central Ltd. 2015
- Published: 18 June 2015
Over the past 50,000 years, shifts in human-environmental or human-human interactions shaped genetic differences within and among human populations, including variants under positive selection. Shaped by environmental factors, such variants influence the genetics of modern health, disease, and treatment outcome. Because evolutionary processes tend to act on gene regulation, we test whether regulatory variants are under positive selection. We introduce a new approach to enhance detection of genetic markers undergoing positive selection, using conditional entropy to capture recent local selection signals. Results We use conditional logistic regression to compare our Adjusted Haplotype Conditional Entropy (H|H) measure of positive selection to existing positive selection measures. H|H and existing measures were applied to published regulatory variants acting in cis (cis-eQTLs), with conditional logistic regression testing whether regulatory variants undergo stronger positive selection than the surrounding gene.
These cis-eQTLs were drawn from six independent studies of genotype and RNA expression. The conditional logistic regression shows that, overall, H|H is substantially more powerful than existing positive-selection methods in identifying cis-eQTLs against other Single Nucleotide Polymorphisms (SNPs) in the same genes. When broken down by Gene Ontology, H|H predictions are particularly strong in some biological process categories, where regulatory variants are under strong positive selection compared to the bulk of the gene, distinct from those GO categories under overall positive selection. . However, cis-eQTLs in a second group of genes lack positive selection signatures detectable by H|H, consistent with ancient short haplotypes compared to the surrounding gene (for example, in innate immunity GO:0042742); under such other modes of selection, H|H would not be expected to be a strong predictor.. These conditional logistic regression models are adjusted for Minor allele frequency(MAF); otherwise, ascertainment bias is a huge factor in all eQTL data sets. Relationships between Gene Ontology categories, positive selection and eQTL specificity were replicated with H|H in a single larger data set. Our measure, Adjusted Haplotype Conditional Entropy (H|H), was essential in generating all of the results above because it: 1) is a stronger overall predictor for eQTLs than comparable existing approaches, and 2) shows low sequential auto-correlation, overcoming problems with convergence of these conditional regression statistical models.
Our new method, H|H, provides a consistently more robust signal associated with cis-eQTLs compared to existing methods. We interpret this to indicate that some cis-eQTLs are under positive selection compared to their surrounding genes. Conditional entropy indicative of a selective sweep is an especially strong predictor of eQTLs for genes in several biological processes of medical interest. Where conditional entropy is a weak or negative predictor of eQTLs, such as innate immune genes, this would be consistent with balancing selection acting on such eQTLs over long time periods. Different measures of selection may be needed for variant prioritization under other modes of evolutionary selection.
- Positive Selection
- Selective Sweep
- Conditional Entropy
- Conditional Logistic Regression
In the post-genome era, major research initiatives are directed towards identifying non protein-coding genetic variants with an effect on the expression of nearby genes: cis-expression quantitative trait loci (cis-eQTLs) [1, 2]. In parallel, comparative analysis  of humans and other primates, as well as human populations, has revealed categories of genes evolving under different selective pressures [4, 5]. For example, the phylogenetic p-values program phyloP  has been used to show that disease markers in genes involved in central nervous system (CNS) developmental defects  have undergone positive selection in the human lineage. Humans have more fixed genetic differences in CNS development genes, compared to our close primate relatives; therefore, phyloP identifies mutations in these CNS development genes. The interpretation is that these mutations conferred differential fitness in the primate ancestors of modern humans, causing many mutations in such genes to rapidly accumulate and fix. Conversely, other genes and especially regulatory regions show human-lineage specific purifying selection , leading to a loss of ancestral diversity. Therefore, it might be expected that different selection measures would be more or less useful in different genes or regulatory features.
In this report, we explore positive selection signatures not in disease genes/markers arising from our pre-human ancestors, but in the cis-eQTLs within/adjacent to genes. Positive selection in this report is confined to a relatively recent timeframe, and is controlled for the human population positive selection level on the entire gene and sub-gene location (exon, intron, intergenic region, flanking within 1kB of promoter or terminator, or untranscribed region/UTR). The positive selection signals used here are measured using allele frequencies and linkage-related properties within and between human populations via the results of the 1,000 Genomes Project . Detecting a controlled association between positive selection and cis-eQTLs is technically challenging from a statistics standpoint, because it introduces considerable dependencies (co-linearity) between the prediction (positive selection, allele frequency, sub-gene region) and stratification (gene) variables. The effort is justified because eQTLs, both cis  and trans  have shown considerable relevance to human disease. Methods to prioritize such variants have considerable practical consequences, since most disease heritability is driven by regulatory markers outside of coding regions [12, 13].
In order to fit statistical models of cis-eQTLs within genes, we introduce a new measure of positive selection, termed Adjusted Conditional Haplotype Entropy (H|H). Conditional entropy is commonly used in information/linguistics approaches , but also sees use in diverse bioinformatics applications [15, 16]. Among existing measures, this conditional entropy is most closely related to the integrated haplotype score (iHS) of Voight et al. , however, in the results below we will show that H|H provides two crucial advantages. First, H|H is a significantly stronger predictor of cis-eQTLs than other measures, and second, H|H values can vary widely even between nearby markers. H|H is a measure of haplotypes but it depends on an individual marker: for example, a minor variant leading to a selective sweep of the pre-existing major haplotype would result in a very low conditional entropy and thus a very high H|H score, but it would be in linkage to other major haplotype markers that would have very low H|H scores. This divergence between nearby markers and even markers in moderate linkage allows for models incorporating H|H to converge when models utilizing other positive selection measures do not.
Overall research design
In both phases, conditional logistic regression is used to fit statistical models to the different cis-eQTL data sets. GO categories and other measures individually identified in the hypothesis-generation phase are tested as a group in the hypothesis-testing phase.
Hypothesis-testing phase in the Chicago QTL Browser data
Chicago QTL browser data sets.
Number of eQTLs/total
Num. Genes (Regions)
Total GTEx Markers
Mangravite 2012 
LCLs; candidate genes
Montgomery 2010A 
LCLs; RNA-seq; exons
Montgomery 2010B 
LCLs; RNA-seq; transcripts
Schadt 2007 
Stranger 2007 
Veyrieras 2008 
Zeller 2010 
When statistical models were fit to these individual data sets, the results were quite heterogeneous (see below in this section). Therefore, GO biological-process categories were used to test whether the inter-data set heterogeneity might be explained by differences in the underlying biology of the genes for which eQTLs were detected by the different approaches. Alternative, non-exclusive explanations for heterogeneity among these data sets include: source tissue differences, participant population differences, and sample size/ascertainment bias differences, any or all of which might reasonably affect the extent to which eQTLs appear more positively selected than nearby markers. In order to produce a preliminary hypothesis on this question, each gene was assigned to a single biological-process GO category, chosen among the available GO assignments (see methods), with genes assigned to multiple GO categories assigned to the single GO category containing closest-to-125 ENCODE  genes. This is an ad hoc approach that produces single assignments to biologically relevant GO categories; it was driven by inspection of the GO categories of most interest to our research group. Size-filtering of some kind is standard in any study of GO categories. Note that this is not a GO enrichment test, and is neither intended nor expected to give a valid GO enrichment result, but to test for an interaction between GO category and our QTL predictors. So, for example, these results cannot be interpreted to indicate in which GO categories QTLs are especially common. The reduction to a single GO assignment for each gene is both a simplifying assumption and a technical requirement for the type of model fitting that is used in the hypothesis-generation phase; the GO category size of 125 was chosen by inspection - GO categories containing more than 200 genes were generally not biologically informative (e.g. the most extreme case, GO 0008150, biological process), while GO categories below size 50 are unlikely to have enough representatives in our data to detect an association, especially an association which is reproduced in several data sets containing a non-overlapping combination of genes. Small GO categories are often not-represented at all in many of our data sets. This procedure will be biased to detect effects from larger GO categories, but the statistical test should still be valid, in the same way that a non-uniformly distributed error rate (e.g. in a GWAS, assigning 50% of the errors to the 5% of the markers which are in coding regions)
This may introduce a bias into the types of GO categories which are selected in this hypothesis-generation phase; yet another reason why the hypothesis-testing phase was required. The GO assignment reduction is not carried out in the hypothesis-testing phase, so the hypothesis-testing phase should be free of any bias introduced by this favoritism.
Conditional logistic models.
eQTL ~ H|H + |ΔDAF| + ΔDAF + ΔiHH + Fst + iHS + MAF + location
eQTL ~ H|H + |ΔDAF| + ΔDAF + ΔiHH + iHS + MAF + location
eQTL ~ H|H + |ΔDAF| + ΔDAF + ΔiHH + Fst + iHS + MAF × GO + location
eQTL ~ H|H + |ΔDAF| + ΔDAF + Fst + MAF × GO + location
eQTL ~ H|H + |ΔDAF| + ΔDAF + MAF × GO + location
eQTL ~ H|H × GO + ΔDAF + MAF × GO + location
Except for our Adjusted Haplotype Conditional Entropy (H|H), all of the values in the formulas for Table 2 are taken from the 1,000 Genomes Selection Browser, and converted to a sum of logs (P value when reported, but MAF is similarly transformed although it is not a p-value) across thousand genomes populations on a per-SNP basis. Identification as an eQTL is treated as a random variable, with a logistic distribution fit on a per-gene basis to an effect from one or more of the following variables that are included in the formulas in Table 2: Adjusted Haplotype Conditional Entropy (H|H, the new measure introduced in this manuscript), the absolute value of the differential allele frequency (|ΔDAF|), the differential allele frequency (ΔDAF), the change in integrated haplotype heterozygosity (ΔiHH), the coancestry coefficient (F-statistic or Fst ) the integrated haplotype score (iHS), the minor allele frequency (MAF), and the location within the gene of an individual SNP (either flanking, UTR, intronic or exonic; or, for intergenic regions, intergenic); if an individual parameter is fit independently in each singletized GO term, it is appended to indicate such an interaction (with × GO).
Summary of t-statistic Z-scores from conditional logistic model fits
Minor allele frequency is the strongest predictor, and heavily modulates the predictive power of all other predictors (results not shown; this is an expected consequence of sample size constraints/ascertainment bias within the individual source studies.) Otherwise, only one pair of predictors interact (change substantially in predictive power when both are included): |ΔDAF| and Fst. For |ΔDAF|, two Z-scores are given in Table 3: without (before the slash) and with (after the slash) Fst in the model. Fst is a good predictor but does not converge on the Mangravite data set, indicative that the measure is overly smooth (see Figure 1).
Model 6 is underlined in the list above because it is used for hypothesis generation; Model 6 includes only our Adjusted Haplotype Conditional Entropy (fit on a per GO term basis), Minor Allele Frequency (likewise), and Differential Allele Frequency (fit to each data set as a whole). Other positive selection measures either were weak predictors in one or more data sets (ΔiHH and iHS; |ΔDAF| unless Fst is also included); or, caused a failure of the model to converge in some data sets (Fst, ΔDAF if included as an interaction term), because of excessive non-independence (referred to as co-linearity in the logistic regression literature ) with either the other predictors or with the stratification variable. ΔDAF is a strong predictor and we were able to include ΔDAF in models that did converge; ΔDAF serves a crucial purpose in our models because it differentiates the effects of population bottlenecks from selective sweeps. However, unfortunately a model including a ΔDAF interaction with GO terms was tried but did not converge, so ΔDAF was fit globally. Notably, of the positive selection measures, only our Adjusted Haplotype Conditional Entropy is a strong predictor which also converges in all of these data sets when fit on a per-GO term basis.
Other positive selection measures that report values for contiguous blocks of gene sequence are highly useful in gene-prioritization applications; however, they are generally unsuited for the challenge addressed here, which is to prioritize variants within genes, and would be expected to have severe convergence issues, so they were not considered (Figure 1 illustrates this issue with the Composite of Multiple Signals). Future applications of this method would be in a candidate gene context, however in order to assess the predictive power of the measure, whole genome results are used. A positive selection measure that is constant across an individual gene may identify genes that are in turn eQTL rich, but this would tend to produce a failure to converge. In any case, in the currently available data, differences in eQTL richness between genes cannot be distinguished from ascertainment bias. The conditional information haplotype score used here was developed in part to address these technical issues, which are related to the initial observation made by biologists in our group - that a marker in a long haplotype might be only a few hundred bases away from a second marker with which the long haplotype is in weak linkage. In these data, only our new conditional haplotype information has the properties that it is both a strong predictor for eQTLs and allows a convergent statistical fit in relevant cases. Therefore, Model 6 was chosen to carry on to the hypothesis generation step.
GO ID's at high and low extremes of selection score in hypothesis-generation data set
A heuristic score was used to combine the results of the fit to Model 6, for each GO ID in each of the seven data sets. To generate this heuristic score from the fit of Model 6 to each permutation, the effect size is moved one standard error towards 0 (minimum of 0) and then summed across all 7 data sets. There is an overall association between our H|H and eQTLs; by using the effect size rather than the Z-score, we avoid identifying individual GO IDs solely because they contain many genes.
In order to establish preliminary confidence bounds on the range of variability for these prediction by GO category interactions, Model 6 was also fit to 100 randomly permuted versions of the data sets above. In each permutation, assignments between genes assigned to each single, medium-sized GO category, were randomized. In each permutation, each randomized medium-sized GO category contained the same number of member genes as the corresponding original GO category. By generating the same heuristic score (sum of effect sizes) in each permutation, a confidence interval was obtained.
GO IDs where eQTL effect size for Adjusted Haplotype Conditional Entropy outside of the 90% confidence interval from permutation.
Mangravite 2012: βH|H ± σH|H
Montgomer y'10 Exon: βH|H ± σH|H
Montgomer y '10 Trspt: βH|H ± σH|H
Schadt 2007: βH|H ± σH|H
Stranger 2007: βH|H ± σH|H
Veyrieras 2008: βH|H ± σH|H
Zeller 2010: βH|H ± σH|H
Negative Effect Sizes
-0.099 ± 0.055
0.186 ± 0.086
-0.402 ± 0.592
0.137 ± 0.255
-1.600 ± 1.200
-0.001 ± 0.172
-0.048 ± 0.060
-0.112 ± 0.041
-0.066 ± 0.065
-0.207 ± 0.109
-0.0001 ± 0.134
-0.160 ± 0.069
-0.049 ± 0.074
-0.050 ± 0.035
-0.013 ± 0.125
-1.200 ± 0.933
-0.043 ± 0.126
0.032 ± 0.043
-0.373 ± 0.105
-0.140 ± 0.111
-0.006 ± 0.249
-0.055 ± 0.180
-0.107 ± 0.123
-0.068 ± 0.132
0.019 ± 0.056
Large Positive Effect Sizes
0.128 ± 0.047
0.158 ± 0.075
0.290 ± 0.083
0.006 ± 0.127
0.196 ± 0.330
0.139 ± 0.094
0.039 ± 0.039
1.144 ± 0.643
0.119 ± 0.069
-0.037 ± 0.011
-0.074 ± 0.124
0.455 ± 0.676
0.066 ± 0.100
0.018 ± 0.037
0.480 ± 0.076
-0.124 ± 0.052
-0.052 ± 0.116
0.084 ± 0.109
0.629 ± 0.262
0.304 ± 0.082
0.010 ± 0.030
0.657 ± 0.452
0.309 ± 0.209
0.090 ± 0.259
0.771 ± 0.319
-0.143 ± 0.231
-0.067 ± *0.108
0.010 ± 0.033
0.040 ± 0.061
0.414 ± 0.194
0.084 ± 0.100
0.322 ± 0.097
0.060 ± 0.044
0.037 ± 0.035
0.185 ± 0.042
0.040 ± 0.071
-0.031 ± 0.0112
-0.322 ± 0.239
0.297 ± 0.059
0.113 ± 0.047
0.038 ± 0.040
0.277 ± 0.049
0.027 ± 0.086
0.048 ± 0.094
0.031 ± 0.128
0.465 ± 0.119
0.175 ± 0.078
0.026 ± 0.026
0.185 ± 0.044
0.030 ± 0.055
-0.006 ± q0.173
-0.136 ± 0.169
0.218 ± 0.103
0.333 ± 0.111
-0.033 ± 0.039
http://amigo.geneontology.org/amigo/term/GO:0006367 transcription initiation from RNA polymerase II promoter
http://amigo.geneontology.org/amigo/term/GO:0006396 RNA processing
http://amigo.geneontology.org/amigo/term/GO:0008544 epidermis development
http://amigo.geneontology.org/amigo/term/GO:0042742 defense response to bacterium
While the following GO categories, hence Group 2, were above the 90% confidence interval from the permutations (that is, in these categories, overall the Conditional Haplotype Entropy, which is generally higher for eQTLs, was especially-relatively-high for eQTLs compared to other GO categories):
http://amigo.geneontology.org/amigo/term/GO:0001501 skeletal system development
http://amigo.geneontology.org/amigo/term/GO:0006869 lipid transport
http://amigo.geneontology.org/amigo/term/GO:0006936 muscle contraction
http://amigo.geneontology.org/amigo/term/GO:0007186 G-protein coupled receptor signaling pathway
http://amigo.geneontology.org/amigo/term/GO:0009653 anatomical structure morphogenesis
http://amigo.geneontology.org/amigo/term/GO:0016567 protein ubiquitination
http://amigo.geneontology.org/amigo/term/GO:0018108 peptidyl-tyrosine phosphorylation
http://amigo.geneontology.org/amigo/term/GO:0051056 regulation of small GTPase mediated signal transduction
Group 1 includes bacterial defense genes which might be expected to experience balancing selection[35, 36], which would preserve ancient diversity, allowing time for recombination to produce short haplotypes; since the epidermis serves a largely infection-defense role, we speculate that balancing selection also acts there in order to maintain within-population diversity in pathogen environment. For the other categories in Group 1, we propose simply that the last 50,000 years is too short a time frame for many beneficial mutations to arise in the highly-conserved core functions of molecular biology.
Group 2 is of particular interest because it includes so many current and emerging drug targets[38, 39], as well as the lipid transport genes which play such a large role in western lifestyle diseases[40, 41]. Follow-up work will focus on utilizing these predictions to enrich the variant discovery activities of the Pharmacogenomics Research Network (PGRN).
An additional GO term (Cellular Metabolic Process, GO:0044237) was also above the confidence interval but we chose to exclude it from the hypothesis-testing phase, because by our approach it produces a conglomerate taxon of genes that are simply not well-understood enough to be assigned to any GO ID with fewer members. This is not intended to be a rigorous statistical test of these associations; instead, these two categories are tested in a larger, more extensively genotyped hypothesis-testing phase, described next.
Hypothesis-testing phase in the Westra data set/Blood eQTL Browser
This hypothesis-testing differed from Model 6 in the hypothesis-generation phase above in that:
1) intergenic regions (defined as more than 1kB from an adjacent gene) were associated with GO IDs, and thus groups, of both their flanking genes,
2) all genes associated with the GO IDs in Group 1 and/or Group 2 were included (and not just those that were not also assigned to a second GO ID which had closer-to-125 members), in order to avoid any biases that might have been introduced by our size-filtering,
3) the interaction terms were fit to each of Group 1 and Group 2 (and independently to the few genes and intergenic regions that were a member of GO IDs in both),
4) the data set itself includes more eQTLs as a fraction of all markers, including eQTLs having effect sizes too small to be detected in the Chicago eQTL browser data sets used for the hypothesis-generation phase.
Whole-blood cis-eQTL conditional logistic regression results
Number of cis-eQTLs
Num. Genes (Regions)
Total GTEx Markers
βH|H ± σH|H
Westra 2012 - Group 1
GO:0006367, GO:0006396, GO:0008544, GO:0042742
-0.003 ± 0.003
Westra 2012 - Group 2
GO:0001501, GO:0006869, GO:0006936, GO:0007186, GO:0009653, GO:0016567, GO:0018108, GO:0051056,
0.005 ± 0.001
Westra 2012 - Both
In both Group 1 and in Group 2
0.006 ± 0.007
Westra 2012 - TOTAL
RNA extracted from whole blood.
GO enrichment test on normalized likelihoods
The previous results arise from incorporation of interaction terms into regression methods, and do not utilize existing approaches for GO enrichment. In the supplement (additional file 1), we present the results of an alternative approach where positive selection methods are added to the logistic model, the improvement in model prediction is assessed on a per-gene basis, and PantherDB is used to test for an association between model improvement and biological function GO category. This method finds GO categories containing relatively few genes and a significant interaction is not detected in the hypothesis-testing data set. A supplementary report on this alternative approach can be found in additional file 1.
Our Adjusted Conditional Haplotype Entropy (H|H) will be distrubted as UCSC Genome Browser tracks from the Center for Pharmacogenomics webpage (http://pharmacogenomics.osu.edu/). Conditional entropy indicative of a selective sweep is a strong predictor of eQTLs for genes in some medically interesting biological processes but not in others. In categories of genes where H|H is a weak or negative predictor of eQTLs, such as innate immune system genes, this would be consistent with balancing selection acting on such eQTLs. The evolutionary signatures underlying eQTLs in different biological processes, populations and systems vary widely.
GTEx candidate SNPs
The GTEx consortium has filtered human genetic variation for excess linkage disequilibrium, producing a list of 6,820,472 candidate variations for which GTEx plans to perform statistical tests in their tissues, which we downloaded from their ftp site on July 5, 2014.
First, because this list of SNPs has been filtered for D' values, the risk of over-counting in our statistical model fits is reduced. Second, we anticipate that tissue-specific QTLs identified by GTEx will be the most relevant data for future applications that may incorporate selection as a component.
Adjusted conditional entropy in 1,000 genomes data
For these calculations, 1,000 genomes data release 3 (Nov. 23, 2010) was used, including all single nucleotide polymorphisms (but not other types of variation) to which a reference SNP cluster (rs number) had been assigned. Our method for finding positively-selected (long) haplotypes is based on information theoretical considerations, in particular the Khinchin axiom for the Shannon entropy, also known as the (Shannon) entropy decomposition formula; for the purposes of this manuscript, the Khinchin axiom is used only to justify the summation of conditional entropy in different populations of roughly equal size. The formulas below describe the decomposition of the entropy for a population of haplotypes (in a given window), into a total entropy and a conditional entropy associated with a minor variant - the conditional entropy associated with the major variant would also be a part of this decomposition but is not used for our score. Frequency-dependent terms, the total entropy, the conditional entropy for the minor variant, and also any population-specific effects are all combined into an ad-hoc linear Adjusted Conditional Haplotype Entropy (H|H), which has a mean of 0 in each 1,000 genomes population.
The difference between the total entropy and the conditional entropy given the minor variant quantifies the influence of the minor variant on the diversity of the haplotypes in the window. The larger the difference between entropies, the more conserved (longer) the haplotype that carries the minor variant, consistent with a recent selective sweep.
(meaning: bpmin and bpmax are chosen such-that the sum is at least 10 but otherwise as small as possible).
Where Hi is the Shannon entropy at each SNP in the window (here using the natural log as the base) in the corresponding 1,000 genomes population. The sum of Hi in the window would correspond to entropy of the window if the individual positions were all mutually independent. The value of 10 was chosen arbitrarily in order to make the results computationally tractable.
This gives a window that will be shorter around markers in more variable regions of the genome, and for the same markers will be shorter in more genetically diverse populations; this is intended to account for background differences in haplotype lengths between populations and regions.
Empirical adjustments in conditional entropy.
Chicago eQTL browser data sets
These data sets were used only in the hypothesis-generation phase.
Via the blood eQTL browser maintained by the authors (http://genenetwork.nl/bloodeqtlbrowser/) of a study primarily focused on potential disease consequences for trans-eQTLs, we obtained cis-eQTLs from a very large (5,311 participants in their discovery phase, with an additional 2,775 participants in their replication phase) study of expression genetics in whole human blood. These cis-eQTLs are used only in our hypothesis-testing phase.
1,000 Genomes Selection Browser results
A variety of diversity and positive selection genetics measures are made available via the 1,000 Genomes Selection Browser. The entire data set was downloaded on January 10, 2014.
Of these, ΔDAF, |ΔDAF|, ΔiHH, Fst , and iHS, along with minor allele frequency (MAF) were used in at least one model in the hypothesis-generation phase. In the final model and in the hypothesis-testing phase, only ΔDAF and MAF were used. Positive selection measures that report results in windows (including the Composite of Multiple Signals score, distributed elsewhere) were not used because preliminary results indicated that the gene-based Conditional logistic regression would not converge.
Covariates incorporated using Annovar and the UCSC Table Browser
The results from the previous sections were matched to genes and within-gene location using Annovar which in-turn utilizes 2013 ENCODE data. Once mapped to gene symbols, gene ontology IDs are mapped using the UCSC table browser with tables downloaded on or after September 25, 2014.
To account for differential eQTL frequencies in different locations within genes, the within-gene locations included in the Annovar distribution were converted into four distinct location categories as follows:
exonic, ncRNA_exonic exonic
UTR3, UTR5, ncRNA_UTR3, ncRNA_UTR5 UTR
intronic, splicing, ncRNA_intronic, ncRNA_splicing intronic
upstream, downstream flanking
For a SNP in multiple categories, the category higher in the above list is chosen only (e.g. a SNP that is both in an exon and upstream of the promoter for a non-coding gene is classified as exonic only). Upstream and downstream are defined as within 1kB of the most distal promoter and transcriptional terminators defined in ENCODE; this 1kB distance is the default for Annovar.
In the hypothesis generation phase, each gene was assigned to a single GO ID in the biological function category; when a gene is assigned to multiple categories (as is generally the case), it was instead assigned only to the category that was closest to containing exactly 125 genes (with ties broken for smaller categories, so a GO ID with 100 genes would be preferred to a GO ID with 150). Once this was resolved, all GO ids with 50 or fewer assigned genes, along with all intergenic regions, were assigned to a large "OTHER" category.
In the hypothesis testing phase, this mapping to a single GO id is not used. Instead, genes were assigned either to one of the Group 1 GO IDs in a group for which our haplotype score had a positive association with eQTL status in the validation/model selection phase (see Conditional logistic regression below,) to a Group 2 GO IDs for which our haplotype score had a negative association with eQTL status in the validation/model selection phase, to GO categories in both groups, or to neither group; eQTLs in intergenic regions were assigned to the group(s) corresponding to the closest flanking genes.
Conditional logistic regression
Conditional logistic regression, as implemented in the survival package for the statistical programming language R, is used for all statistical model fitting. The notation used in the results closely matches the R syntax, with gene assignments from Annovar (see Covariates incorporated using Annovar and the UCSC Table Browser) used as the stratification variable. Each intergenic region is treated as an individual and distinct gene for stratification purposes.
Meta-analysis of conditional logistic regression results
In the Conditional logistic regression above, gene by gene variability plays a significant role. Therefore, in order to identify GO categories in which the Adjusted Haplotype Conditional Entropy reproducibly reported different extremes, the 1-to-1 mapping of genes to GO IDs was permuted 100 times (with genes in the OTHER category, including intergenic regions and genes mapped to GO IDs with fewer than 50 members after 1-to-1 mapping, not permuted). Within each of the 100 permutations, the statistical model (Model 6, see Results) was fit to the permuted data.
Where β is the effect size, and σ is the standard error in the effect size. The 100 permutations of 94 such GO IDs produced 9,400 such sums - these were used to produce a 90% confidence interval, and GO IDs in the original data for which the sum was outside of this confidence interval, were continued onto the hypothesis-testing phase.
Supplementary enrichment tests
As a supplement to the analysis reported in the main text, an additional round of analysis, testing for enrichment of extreme changes in likelihood when models incorporate additional positive selection predictions according to GO IDs, is reported in additional file 1.
The authors acknowledge that this material is based upon work supported by or in part by NIGMS grant U01 092655 to the XGEN consortium, and by NSF agreement no. 0931642 to the Mathematical Biosciences Institute. Publication of this work was supported in part by an allocation of computing time from the Ohio Supercomputer Center. None of these funding agencies had any part in the research presented here, or in the preparation thereof.
We would like to thank Drs. Bromberg and Capriotti for their work in organizing the ISMB special interest group session.
This article has been published as part of BMC Genomics Volume 16 Supplement 8, 2015: VarI-SIG 2014: Identification and annotation of genetic variants in the context of structure, function and disease. The full contents of the supplement are available online at http://0-www.biomedcentral.com.brum.beds.ac.uk/bmcgenomics/supplements/16/S8.
- Brem RB, Yvert G, Clinton R, Kruglyak L: Genetic Dissection of Transcriptional Regulation in Budding Yeast. Science. 2002, 296 (5568): 752-755. 10.1126/science.1069516.View ArticlePubMedGoogle Scholar
- Bryois J, Buil A, Evans DM, Kemp JP, Montgomery SB, Conrad DF, et al: Cis and Trans Effects of Human Genomic Variants on Gene Expression. PLoS Genet. 2014, 10 (7): e1004461-10.1371/journal.pgen.1004461.PubMed CentralView ArticlePubMedGoogle Scholar
- Felsenstein J: Phylogenies and the Comparative Method. Am Nat. 1985, 125 (1): 1-15. 10.1086/284325.View ArticleGoogle Scholar
- Fagny M, Patin E, Enard D, Barreiro LB, Quintana-Murci L, Laval G: Exploring the Occurrence of Classic Selective Sweeps in Humans Using Whole-Genome Sequencing Data Sets. Mol Biol Evol. 2014, 31 (7): 1850-1868. 10.1093/molbev/msu118.View ArticlePubMedGoogle Scholar
- O'Bleness M, Searles VB, Varki A, Gagneux P, Sikela JM: Evolution of genetic and genomic features unique to the human lineage. Nat Rev Genet. 2012, 13 (12): 853-866. 10.1038/nrg3336.PubMed CentralView ArticlePubMedGoogle Scholar
- Pollard KS, Hubisz MJ, Rosenbloom KR, Siepel A: Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 2010, 20 (1): 110-121. 10.1101/gr.097857.109.PubMed CentralView ArticlePubMedGoogle Scholar
- Vissers LELM, de Ligt J, Gilissen C, Janssen I, Steehouwer M, de Vries P, et al: A de novo paradigm for mental retardation. Nat Genet. 2010, 42 (12): 1109-1112. 10.1038/ng.712.View ArticlePubMedGoogle Scholar
- Ward LD, Kellis M: Evidence of abundant purifying selection in humans for recently acquired regulatory functions. Science. 2012, 337 (6102): 1675-1678. 10.1126/science.1225057.PubMed CentralView ArticlePubMedGoogle Scholar
- 1000 Genomes Project Consortium, Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, et al: An integrated map of genetic variation from 1,092 human genomes. Nature. 2012, 491 (7422): 56-65. 10.1038/nature11632.View ArticleGoogle Scholar
- Zhang X, Gierman HJ, Levy D, Plump A, Dobrin R, Goring HH, et al: Synthesis of 53 tissue and cell line expression QTL datasets reveals master eQTLs. BMC Genomics. 2014, 15: 532-10.1186/1471-2164-15-532.PubMed CentralView ArticlePubMedGoogle Scholar
- Westra H-J, Peters MJ, Esko T, Yaghootkar H, Schurmann C, Kettunen J, et al: Systematic identification of trans eQTLs as putative drivers of known disease associations. Nat Genet. 2013, 45 (10): 1238-1243. 10.1038/ng.2756.PubMed CentralView ArticlePubMedGoogle Scholar
- Sadee W, Hartmann K, Seweryn M, Pietrzak M, Handelman SK, Rempala GA: Missing heritability of common diseases and treatments outside the protein-coding exome. Hum Genet. 2014, 133 (10): 1199-1215. 10.1007/s00439-014-1476-7.PubMed CentralView ArticlePubMedGoogle Scholar
- Bromberg Y, Capriotti E: SNP-SIG 2013: from coding to non-coding-new approaches for genomic variant interpretation. BMC Genomics. 2014, 15 (Suppl 4): S1-10.1186/1471-2164-15-S4-S1.PubMed CentralView ArticlePubMedGoogle Scholar
- Bratko A, Filipič B, Cormack GV, Lynam TR, Zupan B: Spam Filtering Using Statistical Data Compression Models. J Mach Learn Res. 2006, 7: 2673-2698.Google Scholar
- Kaitchenko A: Algorithms for estimating information distance with application to bioinformatics and linguistics. Canadian Conference on Electrical and Computer Engineering, 2004. 2004, 4: 2255-2258.View ArticleGoogle Scholar
- Nalbantoglu ÖU, Russell DJ, Sayood K: Data Compression Concepts and Algorithms and Their Applications to Bioinformatics. Entropy. 2009, 12 (1): 34-52. 10.3390/e12010034.View ArticleGoogle Scholar
- Voight BF, Kudaravalli S, Wen X, Pritchard JK: A Map of Recent Positive Selection in the Human Genome. PLoS Biol. 2006, 4 (3): e72-10.1371/journal.pbio.0040072.PubMed CentralView ArticlePubMedGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al: Gene Ontology: tool for the unification of biology. Nat Genet. 2000, 25 (1): 25-29. 10.1038/75556.PubMed CentralView ArticlePubMedGoogle Scholar
- Mi H, Muruganujan A, Casagrande JT, Thomas PD: Large-scale gene function analysis with the PANTHER classification system. Nat Protoc. 2013, 8 (8): 1551-1566. 10.1038/nprot.2013.092.View ArticlePubMedGoogle Scholar
- Mangravite LM, Engelhardt BE, Medina MW, Smith JD, Brown CD, Chasman DI, et al: A statin-dependent QTL for GATM expression is associated with statin-induced myopathy. Nature. 2013, 502 (7471): 377-380. 10.1038/nature12508.PubMed CentralView ArticlePubMedGoogle Scholar
- Montgomery SB, Sammeth M, Gutierrez-Arcelus M, Lach RP, Ingle C, Nisbett J, et al: Transcriptome genetics using second generation sequencing in a Caucasian population. Nature. 2010, 464 (7289): 773-777. 10.1038/nature08903.View ArticlePubMedGoogle Scholar
- Schadt EE, Molony C, Chudin E, Hao K, Yang X, Lum PY, et al: Mapping the Genetic Architecture of Gene Expression in Human Liver. PLoS Biol. 2008, 6 (5): e107-10.1371/journal.pbio.0060107.PubMed CentralView ArticlePubMedGoogle Scholar
- Stranger BE, Nica AC, Forrest MS, Dimas A, Bird CP, Beazley C, et al: Population genomics of human gene expression. Nat Genet. 2007, 39 (10): 1217-1224. 10.1038/ng2142.PubMed CentralView ArticlePubMedGoogle Scholar
- Veyrieras J-B, Kudaravalli S, Kim SY, Dermitzakis ET, Gilad Y, Stephens M, Pritchard JK: High-Resolution Mapping of Expression-QTLs Yields Insight into Human Gene Regulation. PLoS Genet. 2008, 4 (10): e1000214-10.1371/journal.pgen.1000214.PubMed CentralView ArticlePubMedGoogle Scholar
- Zeller T, Wild P, Szymczak S, Rotival M, Schillert A, Castagne R, et al: Genetics and Beyond - The Transcriptome of Human Monocytes and Disease Susceptibility. PLoS One. 2010, 5 (5): e10693-10.1371/journal.pone.0010693.PubMed CentralView ArticlePubMedGoogle Scholar
- Pybus M, Dall'Olio GM, Luisi P, Uzkudun M, Carreño-Torres A, Pavlidis P, et al: 1000 Genomes Selection Browser 1.0: a genome browser dedicated to signatures of natural selection in modern humans. Nucleic Acids Res. 2013, 42 (Database issue): D909-D909.Google Scholar
- Lonsdale J, Thomas J, Salvatore M, Phillips R, Lo E, Shad S, et al: The Genotype-Tissue Expression (GTEx) project. Nat Genet. 2013, 45 (6): 580-585. 10.1038/ng.2653.View ArticleGoogle Scholar
- Rosenbloom KR, Sloan CA, Malladi VS, Dreszer TR, Learned K, Kirkup VM, et al: ENCODE Data in the UCSC Genome Browser: year 5 update. Nucleic Acids Res. 2013, 41 (Database issue): D56-D63.PubMed CentralView ArticlePubMedGoogle Scholar
- Alexa A, Rahnenfuhrer J: topGO: Enrichment Analysis for Gene Ontology. R Package Version 2.18. 0. 2010Google Scholar
- Hofer T, Ray N, Wegmann D, Excoffier L: Large allele frequency differences between human continental groups are more likely to have occurred by drift during range expansions than by selection. Ann Hum Genet. 2009, 73 (1): 95-108. 10.1111/j.1469-1809.2008.00489.x.View ArticlePubMedGoogle Scholar
- Weir BS, Hill WG: Estimating F-statistics. Annu Rev Genet. 2002, 36: 721-750. 10.1146/annurev.genet.36.050802.093940.View ArticlePubMedGoogle Scholar
- Jennrich RI, Ralston ML: Fitting Nonlinear Models to Data. Annu Rev Biophys Bioeng. 1979, 8: 195-238. 10.1146/annurev.bb.08.060179.001211.View ArticlePubMedGoogle Scholar
- Simonson TS, Yang Y, Huff CD, Yun H, Qin G, Witherspoon DJ, et al: Genetic Evidence for High-Altitude Adaptation in Tibet. Science. 2010, 329 (5987): 72-75. 10.1126/science.1189406.View ArticlePubMedGoogle Scholar
- Grossman SR, Shylakhter I, Karlsson EK, Byrne EH, Morales S, Frieden G, et al: A Composite of Multiple Signals Distinguishes Causal Variants in Regions of Positive Selection. Science. 2010, 327 (5967): 883-886. 10.1126/science.1183863.View ArticlePubMedGoogle Scholar
- Azad AK, Sadee W, Schlesinger LS: Innate immune gene polymorphisms in tuberculosis. Infect Immun. 2012, 80 (10): 3343-3359. 10.1128/IAI.00443-12.PubMed CentralView ArticlePubMedGoogle Scholar
- Van der Hoorn RAL, De Wit PJGM, Joosten MHAJ: Balancing selection favors guarding resistance proteins. Trends Plant Sci. 2002, 7 (2): 67-71. 10.1016/S1360-1385(01)02188-4.View ArticlePubMedGoogle Scholar
- Boube M, Joulia L, Cribbs DL, Bourbon H-M: Evidence for a Mediator of RNA Polymerase II Transcriptional Regulation Conserved from Yeast to Man. Cell. 2002, 110 (2): 143-151. 10.1016/S0092-8674(02)00830-9.View ArticlePubMedGoogle Scholar
- Cohen P: Protein kinases -- the major drug targets of the twenty-first century?. Nat Rev Drug Discov. 2002, 1 (4): 309-315. 10.1038/nrd773.View ArticlePubMedGoogle Scholar
- Neubig RR, Siderovski DP: Regulators of G-Protein signalling as new central nervous system drug targets. Nat Rev Drug Discov. 2002, 1 (3): 187-197. 10.1038/nrd747.View ArticlePubMedGoogle Scholar
- O'Keefe JH, Cordain L: Cardiovascular Disease Resulting From a Diet and Lifestyle at Odds With Our Paleolithic Genome: How to Become a 21st-Century Hunter-Gatherer. Mayo Clin Proc. 2004, 79 (1): 101-108. 10.4065/79.1.101.View ArticlePubMedGoogle Scholar
- Hu FB: Globalization of Diabetes The role of diet, lifestyle, and genes. Diabetes Care. 2011, 34 (6): 1249-1257. 10.2337/dc11-0442.PubMed CentralView ArticlePubMedGoogle Scholar
- Giacomini KM, Brett CM, Altman RB, Benowitz NL, Dolan ME, Flockhart DA, et al: The Pharmacogenetics Research Network: From SNP Discovery to Clinical Drug Response. Clin Pharmacol Ther. 2007, 81 (3): 328-345. 10.1038/sj.clpt.6100087.View ArticlePubMedGoogle Scholar
- Karolchik D, Barber GP, Casper J, Clawson H, Cline MS, Diekhans M, et al: The UCSC Genome Browser database: 2014 update. Nucleic Acids Res. 2014, 42 (Database issue): D764-D770.PubMed CentralView ArticlePubMedGoogle Scholar
- Ong RT-H, Wang X, Liu X, Teo Y-Y: Efficiency of trans-ethnic genome-wide meta-analysis and fine-mapping. Eur J Hum Genet. 2012, 20 (12): 1300-1307. 10.1038/ejhg.2012.88.View ArticlePubMedGoogle Scholar
- Lewontin RC: The Interaction of Selection and Linkage. I. General Considerations; Heterotic Models. Genetics. 1964, 49 (1): 49-67.PubMed CentralPubMedGoogle Scholar
- Consortium T 1000 GP: A map of human genome variation from population-scale sequencing. Nature. 2010, 467 (7319): 1061-1073. 10.1038/nature09534.View ArticleGoogle Scholar
- Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K: dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001, 29 (1): 308-311. 10.1093/nar/29.1.308.PubMed CentralView ArticlePubMedGoogle Scholar
- Khinchin AI: Mathematical Foundations of Information Theory. 1957, Courier Dover Publications, 434 ():Google Scholar
- Kolmogorov A: On the Shannon theory of information transmission in the case of continuous signals. Inf Theory IRE Trans On. 1956, 2 (4): 102-108.View ArticleGoogle Scholar
- Wang K, Li M, Hakonarson H: ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 2010, 38 (16): e164-e164. 10.1093/nar/gkq603.PubMed CentralView ArticlePubMedGoogle Scholar
- Karolchik D, Hinrichs AS, Furey TS, Roskin KM, Sugnet CW, Haussler D, Kent WJ: The UCSC Table Browser data retrieval tool. Nucleic Acids Res. 2004, 32 (Database issue): D493-D496.PubMed CentralView ArticlePubMedGoogle Scholar
- Therneau TM, T L: (original S->R port and maintainer until: 2009). Survival: Survival Analysis. 2014,Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.