- Methodology article
- Open Access
- Published:

# Functional annotation signatures of disease susceptibility loci improve SNP association analysis

*BMC Genomics*
**volume 15**, Article number: 398 (2014)

## Abstract

### Background

Genetic association studies are conducted to discover genetic loci that contribute to an inherited trait, identify the variants behind these associations and ascertain their functional role in determining the phenotype. To date, functional annotations of the genetic variants have rarely played more than an indirect role in assessing evidence for association. Here, we demonstrate how these data can be systematically integrated into an association study’s analysis plan.

### Results

We developed a Bayesian statistical model for the prior probability of phenotype–genotype association that incorporates data from past association studies and publicly available functional annotation data regarding the susceptibility variants under study. The model takes the form of a binary regression of association status on a set of annotation variables whose coefficients were estimated through an analysis of associated SNPs in the GWAS Catalog (GC). The functional predictors examined included measures that have been demonstrated to correlate with the association status of SNPs in the GC and some whose utility in this regard is speculative: summaries of the UCSC Human Genome Browser ENCODE super–track data, dbSNP function class, sequence conservation summaries, proximity to genomic variants in the Database of Genomic Variants and known regulatory elements in the Open Regulatory Annotation database, PolyPhen–2 probabilities and RegulomeDB categories. Because we expected that only a fraction of the annotations would contribute to predicting association, we employed a penalized likelihood method to reduce the impact of non–informative predictors and evaluated the model’s ability to predict GC SNPs not used to construct the model. We show that the functional data alone are predictive of a SNP’s presence in the GC. Further, using data from a genome–wide study of ovarian cancer, we demonstrate that their use as prior data when testing for association is practical at the genome–wide scale and improves power to detect associations.

### Conclusions

We show how diverse functional annotations can be efficiently combined to create ‘functional signatures’ that predict the *a priori* odds of a variant’s association to a trait and how these signatures can be integrated into a standard genome–wide–scale association analysis, resulting in improved power to detect truly associated variants.

## Background

The purpose of genetic association studies is to discover genetic loci that contribute to an inherited trait, identify the variants behind these associations and ascertain their functional role in determining the phenotype [1]. Modern association studies bring to bear on this problem high coverage genotype data, comprehensive databases of genetic variation that allow imputation of most common ungenotyped variants to high accuracy and extensive, publicly available, *in silico* resources housing a growing assortment of genomic data that allow functional characterization of vast regions of the human genome. In the typical genome–wide association study (GWAS), the first two forms of data are combined to reconstruct genotypes to a desired density and these genotypes are then systematically tested for association with the phenotype. The functional annotation data are most frequently used in *post hoc* interpretation of evident associations raised by the analysis [2].

To date, functional annotation data have rarely played more than an indirect role in assessing evidence for association. For example, they may be used to suggest candidate genes and SNPs for study or to support links between candidate SNPs and genes. While methods to incorporate functional annotation data *a priori* in genetic association analyses exist, they are infrequently used. The prevailing approach to this is via a two–staged hierarchical model in which coefficients in the stage I generalized linear model for phenotype given genotype and exposure measurements are regressed, in stage II, on the annotation data [3–6]. This is limited to analysis of a modest number of variants and does not make use of prior data derived from previous association studies to inform the nature of that relationship.

It is becoming increasingly clear that a widening array of annotation data correlates with a variant’s having been associated with a human phenotype [7–14]. In what follows, we describe a formal approach to inference for association that combines functional annotation data (through a prior distribution) with genotype data (through a sampling model for the phenotype given genetic and other covariate data). We construct the prior distribution through careful analysis of SNPs housed in the GWAS Catalog [8]. We refer to the linear combination of the annotation variables defined by this model and evaluated for a given SNP as its ‘functional annotation signature’. We show that functional signatures so derived are predictive of the association status of SNPs not used in their creation and that, when coupled with genetic association data following the method we describe, improve the efficiency of association testing in a GWAS study of ovarian cancer. Data used to construct and evaluate the functional signatures are available at ftp://stat.duke.edu/pub/Users/iversen/FunctionalSignatures/.

## Results and discussion

In association studies statistical analysis is utilized to measure the evidence in favor of association. The data that inform these analyses usually comprise phenotype labels, SNP genotype data and a set of non–genetic covariates in addition to functional annotations of the variants under study. The statistical analysis may take many forms, varying according to choice of modeling approach and inferential paradigm (Frequentist or Bayesian). The approach we develop here relies on Bayesian inference but can also be applied when the genetic association summaries are derived from a Frequentist analysis. In this paradigm, prior data on a quantity of interest (such as the binary association status of a genetic variant) are updated to reflect evidence in the current data set.

A Bayesian analysis of genetic association data returns an estimate of the posterior odds of association of each marker given the available data. When the data take two distinct forms — here subject–level phenotype, genotype and covariate data and variant–level functional annotations — the posterior odds of association may be calculated in two stages, either by incorporating functional data prior to or following evaluation of the genetic data. The latter represents the heuristic typically followed in practice, whereby functional data is evaluated in an informal way (from the probabilistic point of view) conditional on evidence for association. Here we describe a model–based framework for combining functional and association data following the second factorization. We focus on the case–control study design for purposes of illustrating integration of the *a priori* (to association data) models for functional annotation data we describe below into analyses of genetic association data. Details of the models and their assumptions are provided in Methods.

When the functional data are incorporated as prior information, the posterior odds of a SNP’s association given the functional and subject–level data can be written as the product of the Bayes factor (BF) in favor of association and the prior odds of association given the functional data. The BF is the ratio of the integrated likelihood of the phenotype data given the covariate and genotype data assuming the SNP is associated to the integrated likelihood of the phenotype data given the covariate data only (i.e. assuming the SNP is not associated). It is a commonly used Bayesian statistical measure of association and is calculated by the SNPTEST [15] and BIMBAM [16] packages for analysis of GWAS data. Alternately, it can be approximated by the ABF [17, 18], allowing the method to be used in conjunction with standard Frequentist association testing software.

In short, the functional annotation data are incorporated into an analysis by formally updating the prior odds of association given the annotation data by a standard measure of genetic association. This process is depicted schematically in Figure 1. In what follows, we describe the model used to calculate prior odds of association and demonstrate its use in a GWAS of ovarian cancer. In it, the log of a SNP’s prior odds of association equals *α* + *F* *β*, where *F* denotes its functional data and *α* and *β* are parameters. We refer to the linear combination *F* *β* associated with a SNP as its ‘functional signature’.

### Functional signatures of known associations

We constructed the functional annotation signatures by estimating the multivariate relationship between a set of functional annotation variables and the binary association status of a set of SNPs. In this way, we identify a linear combination, *β* *F*, of the functional annotations, denoted *F*. We refer to this linear combination as a SNP’s ‘functional signature’ when it is evaluated using that SNP’s annotations, *F*_{
s
}. Figure 2 provides a schematic of our approach. In brief, we identified a set of associated SNPs and, for each, we chose a matching, unassociated ‘control’ SNP. We divided the matched pairs into ‘training’ and ‘validation’ sets and used the former to construct a series of models to predict association status given the function data and used the latter to compare the performance of these models. We chose the model that demonstrated the best predictive accuracy in the validation data, as measured by concordance index, to define the functional annotation signatures.

We began by constructing a matched case–control study *of SNPs* in which the cases were drawn from the GWAS Catalog [8] and the controls were identified from the HapMap database, Release 27, Phases II and III merged genotypes. We identified 2,093 case SNPs and, for each, identified one control SNP matched on chromosome, minor allele frequency and the genotyping platform(s) it appeared on. Since SNPs in the GWAS Catalog are arguably more frequently tags than the directly associated variant, we identified ‘LD partners’ [8] for each case and control SNP. We grouped each case and control SNP together with its LD partners to form blocks.

Using on–line bioinformatics resources, we assembled a set of functional annotation variables representing a variety of contextual descriptions and empirical measurements with which we annotated each of the 48,889 case, control and LD partner SNPs. We included annotation variables shown to be correlated with presence in the GWAS Catalog or that we believed likely to be so. These were: dbSNP function designation; summaries of ENCODE Project [19, 20] data on transcription levels assayed by RNA–seq [21, 22], measures of signal enrichment for H3K4Me1, H3K27Ac and H3K4Me3 histone modifications associated with enhancer and promoter activity [23, 24], evidence for overlap with a DNaseI hypersensitivity cluster [25, 26] and evidence for transcription factor binding [27–31]; PhyloP evolutionary conservation scores [32]; indicators for whether or not the variant falls in a region of known copy number variation, a region containing insertions or deletions or a region with inversions [33, 34]; PolyPhen–2 [35] probability that a mutation is damaging; and RegulomeDB score [36]. We used principal components analysis to reduce 75 Broad ChIP–seq, 24 Caltech RNA–seq, and 24 PhyloP ENCODE track summary statistics to the 18, 11 and 4 principal components, respectively, that explained at least 90% of variation in the class.

While not a comprehensive set, the functional variables we chose to work with covered the major annotation classes available at the time of analysis and are readily available to individuals executing an association study. The infrastructure and methods described here are easily updated to accommodate new variables as they become generally available. Table 1 lists the 57 variables that we used to construct the functional signatures of association.

The 48,889 SNPs included in the analysis were grouped into 2,093 case and an equal number of control blocks. We randomly selected 1,675 of these matched case–control pairs for development of the model (the ‘training set’) and left the remaining 418 pairs for model evaluation and comparison (the ‘evaluation set’). We modeled the probability that a SNP is associated given its functional data using a logistic regression model in which the association status of the individual SNPs in the case blocks is assumed unknown. The model assumes, however, that each case block contains *at least one* truly associated SNP and that each control block contains none; as in [7, 14], the identities of the associated SNPs are not specified as they are unknown (details of the model are provided in Section “Model”); a prior distribution on the overall fraction of associated SNPs was employed to encourage their number to be close to one on average. In this way, the model is free to identify multivariate patterns in the annotation data (‘signatures’) of individual SNPs that would be diluted by averaging the annotation measurements of all SNPs within a block.

While the assembled list of functional predictors includes measures that have been demonstrated to correlate with the association status of SNPs in the GC, it also includes a number of measures whose utility in this regard was unclear. Hence, we expected that only a fraction of the 57 variables would contribute to predicting phenotype association. We used shrinkage priors [37, 38] to reflect this belief and chose the normal–exponential–gamma (NEG) distribution for its ability to penalize heavily weakly determined predictors and to penalize weakly those that are well determined [39–41]. Further details of the model and the Markov chain Monte Carlo (MCMC) algorithm used for inference can be found in Methods.

Table 2 provides a summary of the coefficient estimates obtained for the binary regression of association status on the 57 functional annotation variables. Because all variables in the model were standardized, coefficients measure the difference in the log–odds of phenotype association attributed to an increase of one standard deviation in the covariate when the others remain fixed. The majority of variation (51%) in the functional scores as measured in the control block SNPs from the validation set, is due to the Broad promoter/enhancer ChIP–seq principal components (PCs) and nearly all (>97%) of this variation is due to PCs 1, 2, 4, 5, 6, 8 and 13. Each PC is a linear combination of the 75 summary statistics of the 25 assays. Additional file 1: Figure S3 depicts the loadings (weights in the linear combinations) for these PCs as they depend on histone modification, cell line and summary statistic. Grossly, PC 1 measures total signal strength across all cell lines and histone modifications, PC 2 contrasts average signal strength of the H3k4me3 assay against average signal strength of the remaining assays, while the remaining PCs each contrast signal in one subset of cell lines with that in another (PC 4: HMEC and NHEK *versus* GM12878 and K562; PC 5: GM12878, HMEC and NHEK *versus* HSMM, HUVEC and NHLF; PC 6: H1–hESC, HepG2 and HSMM *versus* GM12878 and HUVEC; PC 8: K562 *versus* H1-hESC; and PC 13: HepG2 *versus* H1–hESC). Additional file 1: Figure S1 and S2 depict the correlation matrix and a hierarchical clustering of the 75 summary statistics, respectively. For a given combination of cell line and histone modification, the signal strength (‘log(mean)’) and peak signal (‘log(maximum)’) measures tend to be highly correlated with one another but tend to have only modest correlation with the signal consistency measure (‘log(z)’); patterns of correlation across cell lines and histone modifications are more complex.

The sequence conservation PCs collectively make the next largest contribution, explaining 16% of variation in the functional scores; PCs 1 and 3 explain >97% of this. Each PC is a linear combination of the summary statistics of the 28 and 44 species PhyloP scores, each for all species and restricted to placental mammals. Additional file 1: Figure S6 graphs the loadings for these PCs as they depend on number of species, depth of alignment and summary statistic; Additional file 1: Figure S4 and S5 depict the correlation matrix and a hierarchical clustering of these variables, respectively. Briefly, PC 1 measures total signal strength across scores with the scores based on the 28–way alignment weighted more heavily than those based on the 44–way alignment, while PC 3 contrasts the 28–way with 44-way scores.

The CalTech RNA–seq PCs collectively explain 10% of the signature, with PCs 1, 2, 4 and 8 contributing 87% of this. Additional file 1: Figure S9 depicts the loadings for these PCs as they depend on cell line and summary statistic. PC 1 provides a measure of total signal strength across all cell lines, while the remaining PCs each contrast signal in one subset of cell lines with that in another (PC 2: H1-hESC and K562 *versus* GM12878 and NHEK; PC 4: GM12878 and H1-hESC *versus* K562, NHEK and HepG2; PC8: HUVEC *versus* NHEK). Additional file 1: Figure S7 and S8 depict the correlation matrix and a hierarchical clustering of the RNA–seq summary statistics, respectively. These reveal the largely complementary information provided by each of the cell lines as well as high correlations between the signal strength, peak signal and signal indicator measures of a given cell line. Here, as in the Broad ChIP–seq data, the signal consistency measure is only modestly correlated with the remaining signal measures.

RegulomeDB categorizes variants into a seven-level functional score based on a synthesis of regulatory data derived from ENCODE and other sources. Category 7 variants lack evidence of association; categories 4-6 show ‘minimal binding evidence’; categories 2 and 3 are ‘likely’ and ‘less likely to affect binding’, respectively; and category 1 variants are those ‘likely to affect binding and linked to expression of a gene target’. Its score explains the next largest fraction (8%) of variation. It is represented by six variables, each indicating a functional category; category 7 serves as the reference (‘baseline’). Categories 2, 4, 5 and 6 explain 99% of this variation suggesting that other annotation variables in the model better characterize the probability of phenotype association for variants in categories 1 and 3. This is especially notable for category 1 as it is the only variable included in the model explicitly tied to eQTL status. This failure of simple eQTL summaries to provide predictive power once the other annotations are included in the model is consistent with our findings from a preliminary analysis that included lymphoblastoid line–based eQTL summaries derived from the SCAN database (http://www.scandb.org; [10]), but not the RegulomeDB variables. We found no practical or statistically detectable difference between the accuracy, as measured by concordance index (see below and Methods), of out–of–sample predictions made using the model including all annotations and the model including all annotations except the eQTL variables.

Virtually all (96%) of the 2.5% contribution to variation made by the DGV variables is due to the copy number and inversion variables. Finally, at 1.7%, the dbSNP functional class variables are the only remaining that contribute more than 1% of the variation in functional scores. Virtually all (99%) of this contribution is due to the non–synonymous designation and the PolyPhen–2 probability contributes to the model’s ability to predict within this class of variants. This suggests that more direct measures of a variant’s involvement in transcriptional regulation, such as location in or near a DNAse I hypersensitive site (DHS), are better predictors of association than proximity to a nearby transcription start site (TSS) or other gene–centric landmark. Indeed, it has been shown [42] that GWAS variants are concentrated in DHSs, features captured by the Broad promoter/enhancer ChIP–seq principal components and the transcription factor ChIP–seq data. These are more flexible measures that allow the model to distinguish between promoters and more distant enhancers, inactive versus active promoters and local versus more distant enhancer–TSS interactions, each representing an important distinction [43, 44] relevant to predicting function.

We estimated the concordance indexes (equivalent to AUC, area under the ROC curve) for each model using the 418 matched case–control block pairs in the validation set as a tool for comparing the accuracy of their out–of–sample predictions. Table 3 provides the estimates of concordance and associated 95% interval estimates. While the concordance statistics are not discernibly different from one another, the best out–of–sample predictive ability is achieved using the model with the prior distribution having the strongest shrinkage properties, i.e. the ‘NEG3’ model.

### Application to an ovarian cancer multi–GWAS study

We compared the ranks assigned to a group of variants in a GWAS analysis when those ranks are calculated with and without the functional annotation data. Each in the group of variants is assumed to have known association status (associated/unassociated) with epithelial ovarian cancer, where this determination is based on confirmatory studies subsequent to the GWAS. The group is constructed as follows. There are currently 11 published, genome–wide significant loci for epithelial ovarian cancer. Nine of the 11 have come to light through analysis of genome–wide SNP data. These are rs3814113 [45], rs8170 [46], rs2072590, rs2665390, rs7814937, rs9303542 [47], rs11782652, rs7084454, rs757210 [48]. The remaining two (rs10069690 and rs2077606) were identified by candidate gene/pathway investigations [49, 50]; all 11 have been evaluated in very large confirmatory studies. We consider these to be ‘true positive’ variants. Our analysis of data from the large–scale follow–up study of GWAS candidates [48] allowed us to identify a group of 5,155 variants with ‘strong evidence’ (BF >10) in favor of the null model, i.e. the model of *no association*, based on Jeffreys’ scale of evidence [51] that we treat here as ‘true negatives’.

Table 4 summarizes the GWAS results for the true positive and true negative SNPs when the analysis is conducted with (subscript ‘A+F’) and without (subscript ‘A’) the functional signatures. Note that, among the true positives, the SNPs discovered through candidate studies are ranked substantially lower than those identified via GWAS. Indeed, the evidence in the association data for these variants is actually *against* association (both of their Bayes factors are less than 1.0). The GWAS SNPs are all ranked in the top 50,000 (of approximately 2.5 million) by the same measure and all have Bayes factors of at least 3 to 1 in favor of association.

Only two of the truly associated SNPs (rs11782652 and rs9303542) are ranked higher when the functional data are ignored than when they are used, however their respective changes in rank are small. The median (alt. average) rank of the truly associated SNPs was 5,272 (178,246) without and 3,532 (80,143) with the functional data included. If design constraints allowed only for followup of the top 5,000 variants, a larger fraction (7/11) would be discovered with addition of the functional data than without (5/11); with followup of 10,000 variants, these fractions become 8/11 and 7/11. In contrast, when the function data were included, the median (alt. average) rank among a set of ‘true negative’ SNPs increased from 181,116 (438,664) to 244,393 (517,810), while the number selected for followup fell from 244 to 204 under the 5K scenario and from 443 to 373 under the 10K scenario.

### Functional signatures of tag SNPs correlate with function of tagged SNPs

While a few of the functional variables, such as the function class designation ‘nonsynonymous’, incorporated in the signature are base pair specific, most map to contiguous regions of 100’s or 1000’s of base pairs. Hence, the functional signatures associated with nearby SNPs are correlated. Figure 3 is a plot of the correlation between the functional signatures of adjacent SNPs that passed QC in the ovarian cancer GWAS described above as a function of the distance, measured in base pairs (BPs), between the two variants. This correlation is greater than 0.72 (alt 0.68) for more than 80% (alt 97.5%) of adjacent variants, corresponding to those at distances of 1470 (alt 4376) BPs or less. Hence, while there are gains to be realized in doing so, it is not necessary to impute to and annotate at the highest possible density to realize an increase in power to detect association through the use of functional signatures, a fact we demonstrated empirically above. Note that typical BP distances between tagged (not genotyped or imputed) variants and their nearest tag will be on the order of one half of the distances reported here for adjacent tags.

## Conclusions

Using the GWAS Catalog as a sampling frame, we developed a model for the probability that a given polymorphism is associated with an observable human phenotype given a set of functional annotation variables and demonstrated that this model has the ability to predict a set of phenotype associated variants not used in the model building exercise. We demonstrate several methods for incorporating functional annotation signatures defined by this model and evaluated for a SNP’s annotation data as prior data and show through example that by doing so we improve the efficiency of GWAS scale analysis to identify true positive associations for follow–up study.

The approach we describe is computationally tractable and scalable to modern genome–wide analysis. Our use of penalized regression techniques to model the functional data and construct the function signatures allows us to consider a relatively large number of individual annotation variables while controlling for over–fitting. We evaluated sensitivity of the model’s out–of–sample predictions to choice of shrinkage prior and found that the most aggressive choice we examined, the model whose results are summarized herein, resulted in the best out–of–sample concordance estimates. Our approach can be expanded and adapted to incorporate more detailed annotation data such as was recently released by the ENCODE consortium [11] or generated experimentally in individual labs.

In principle, estimates of the parameters in the model for SNP association status given the functional data can be refined via Bayesian updating as part of an association analysis. This requires an additional layer of analysis that is feasible, but computationally demanding to implement on a genome–wide scale. However, the value of this will be limited in settings where there are few truly associated SNPs and/or the case–control data supporting associations are weak, i.e. the vast majority of applications. Here, Bayesian updating will yield estimates equivalent to those using the approach we describe above up to Monte Carlo simulation error. Indeed, we formally compared the two approaches using the ovarian cancer GWAS data and found little change in the median ranks of the true positive (3,532 *versus* 3,705) and true negative SNPs (244,393 *versus* 248,459). This suggests that the added value of Bayesian updating to the functional signatures will typically be limited.

Performance for our integrative approach likely depends on the depth, specificity and density of coverage of the available annotation data. The current study defines a starting point and benchmark in each of these dimensions. In particular, while the depth of annotation considered here is sufficient to noticeably improve inference for association, it is clear from recent ENCODE Project Consortium publications that it reflects only a small fraction of the complexity present in the regulatory landscape. Further, none of the annotation variables are tailored to the outcome phenotype; indeed, the ENCODE super–track data enter the model through linear combinations of the cell–line specific measurements, effectively averaging over cell type. In addition, the associations we observe in our study of the GWAS Catalog SNPs are those that tend to identify regulatory functions important to a range of phenotypes (i.e. those represented in the GWAS Catalog). This may explain, at least in part, the relatively modest nature of the improvements observed in our GWAS analysis and the failure of general summaries of eQTL status to contribute meaningfully to the functional signatures. Indeed, many regulatory processes are cell–type–specific [11, 12] and hence will be more informative for a given phenotype if measured in the appropriate context [14]. However, determining the relevant annotation data, assuming it exists, for a given phenotype requires domain expertise and more careful modeling to create functional signatures. While Bayesian updating did not improve inferences in the ovarian cancer GWAS example using the existing model, it may when the model is generalized to incorporate disease–specific annotations.

Finally, our analyses have been carried out entirely at the HapMap III density. Our approach succeeds at this density because the functional signatures of SNPs nearby, at distances typical of HapMap III, are highly correlated and hence the functional signatures of HapMap III polymorphisms essentially tag function of nearby polymorphisms not in the database. As coverage (genotype/imputation density) of the typical association study becomes more complete, the need to rely on correlations between functional signatures will diminish and their power to assist in identifying and localizing associations is expected to increase. Association analyses at the density of the 1000 Genomes Project database [52] are now possible and will likely become common. The specificity of the functional signatures should improve when reconstructed and applied at this density as we plan to do as we continue to develop this approach.

## Methods

### Association analysis given annotation data

Let **G** be an *n* by *p* matrix of SNP genotypes, *D* be an *n* by 1 vector of disease indicators where *D*_{
i
}=1 if individual *i* has the disease and *D*_{
i
}=0 otherwise, **X** be an *n* by *r* matrix of covariates used in the association model and **F** be a *p* by *m* matrix of SNP–level functional annotation data where *n* is the number of individuals, *p* is the number of SNPs, *r* is the number of covariates and *m* is the number of annotation variables. Finally, let *A* be a *p* by 1 vector of 0-1 indicators of the (unknown) association status of the variants, where *A*_{
s
}=1 if SNP *s* is associated with the phenotype of interest.

In what follows, we specify the likelihood for the association indicator given the association (**X**, *D*, *G*) and function (**F**) data. To this end, we let $Pr\left(A\right|D,\mathbf{X},\mathbf{G},\mathbf{F})\propto \prod _{s=1}^{p}Pr({A}_{s}\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}D,\mathbf{X},{G}_{s},{F}_{s}).$ This relies on two assumptions: (1) that the *A*_{
s
}’s are conditionally independent given (**X**, *D*, **G**, **F**) and (2) that the *A*_{
s
}’s are conditionally independent of other variants (**G**_{−s}, **F**_{−s}) given (**X**, *D*, *G*_{
s
}, *F*_{
s
}). The notation **G**_{−s} indicates the matrix obtained by removing column *s* from **G**.

Further, we assume that the disease phenotype data are conditionally independent of the functional data for SNP *s* given its association status, covariate and genotype data and that its association status is conditionally independent of the covariate and genotype data given its functional data. The latter assumption may be violated, for example, if the genotype data *G*_{
s
} carries information about function (e.g. minor allele frequency) not included in **F**. Given this, the posterior odds of association of SNP *s* given its association and functional data can be written as the product of the (prior) odds of its association given its functional data times the (integrated) likelihood ratio or Bayes Factor (BF) of the phenotype given the SNP genotype and other covariate data, i.e.

We describe estimation of the association summary Bayes factor below.

Given the binary, logistic link model developed below for association status given the functional data and the parameters *α* and *β*, odds(*A*_{
s
}|*F*_{
s
}) = exp(*α* + *F*_{
s
}*β*) and hence, given *α* and *β*

where *F*_{
s
}*β* is the ‘functional signature’ of SNP *s*. Provided that estimates of *α* and *β* are available from an external analysis such as described in the next section, one can estimate Pr(*A*_{
s
} = 1|*D*,**X**,*G*_{
s
},*F*_{
s
}) by

where the *α*_{
i
} and *β*_{
i
} are samples from the posterior distribution from an analysis such as described in Section “Functional signatures of known associations”.

The above procedure depends on estimates of the marginal likelihoods,

of the association data for each SNP under H_{o} (*A*_{
s
}=0) and under H_{a} (*A*_{
s
}=1). Pr(*D* | **X**,*G*_{
s
},*A*_{
s
}=*a*,*θ*_{
a
}) is a logistic regression of the disease status indicator, *D*, on the covariates, **X**, and SNP genotype, *G*_{
s
}, and with coefficient vector *θ*_{1} under H_{a} and is a logistic regression *D* on **X** with coefficient vector *θ*_{0} under H_{o}. We place independent normal mean 0, standard deviation 10 prior distributions on all components of *θ*_{0} and *θ*_{1}, with exception of the coefficient of *G*_{
s
}, which is accorded a normal mean 0, standard deviation 0.25 prior distribution, as the majority of log–odds estimates cited in the GWAS catalog are smaller than 0.5 in absolute value. We estimate the SNP–specific marginal likelihoods under each hypothesis of association using the Laplace approximation [53] implemented in software described in Wilson et al. (2010) [54] and available from the authors. In cases where it is not convenient or possible to directly calculate Bayes factors, they can be approximated by the ABF [17, 18], allowing the method to be used in conjunction with Frequentist association testing software.

### Construction of functional signatures

In what follows, we detail the steps we took to assemble the case–control study of SNPs used to build and evaluate the models for a variant’s association status given the functional data. These comprised identification of a representative set of phenotype–associated SNPs to serve as ‘cases’ in the analysis and a matched set of ‘controls’ and the collection of a set of measurements related to function to serve as annotations for the variants. The process is depicted in Additional file 1: Figure S10 and described in detail below.

#### Sampling frame

Many genomewide genotyping arrays were designed to over–sample variants with characteristics related to their ability to explain phenotypic variation, such as proximity to coding regions, type of variant (e.g. missense) and minor allele frequency (MAF). Hence, a comparison of case SNPs identified using such assays to control SNPs drawn randomly from the HapMap or dbSNP, for example, may lead to spurious associations between assay design variables and the SNP’s association with a human phenotype. In order to avoid confounding due to the selection method employed in the design of the genomewide genotyping platform, we constructed a sampling frame of SNPs by combining SNPs on the Affymetrix GeneChip Human Mapping 500K Array Set and the Illumina HumanHap550 Genotyping BeadChip, as this generation of arrays and their predecessors cover most of the reported findings in the GWAS Catalog attributed to an Affymetrix or Illumina product and these products were the most commonly used. We labeled SNPs in the sampling frame according to whether they appeared only on the Affymetrix list, only on the Illumina list or on both and confined attention to those variants appearing in both the Genome Browser’s dbSNP 130 and HapMap Release 27 tables (see Additional file 1: Table S1) and having a MAF estimated in HapMap’s CEU sample to be 0.05 or larger. The sampling frame comprises 803,991 SNPs with 421,072 unique to Illumina, 305,672 to Affymetrix and the remaining 77,247 common to both.

#### Case and control selection

The GWAS Catalog is subject to constant update and versions are available from several locations. We downloaded the GWAS Catalog from the Genome Browser (time stamp and location in Additional file 1: Table S1). We confined attention to non–CNV variants in the GWAS Catalog discovered by association studies utilizing an Affymetrix and/or an Illumina genomewide array and present in the sampling frame. We randomly chose a single representative of each set of SNPs appearing multiple times in the GWAS Catalog or sharing one or more ‘LD partners’ (see below). This left 2093 unique case SNPs, 1306 of which were unique to Illumina, 403 unique to Affymetrix and 384 in common. We randomly matched one control SNP drawn from the sampling frame to each case SNP on chromosome, platform (Illumina only, Affymetrix only, on both) and MAF rounded to the nearest 0.02. We excluded SNPs in the sampling frame in LD (R ^{2}>0) with one or more case SNPs as reported in the HapMap Release #27 LD files (see Additional file 1: Table S1) or sharing an LD partner with another control SNP.

#### LD Partner identification

SNPs in the GWAS Catalog are arguably more likely to tag the variant that is directly associated with the phenotype than to be that variant [8]. For this reason, we identified and annotated each case and control SNP’s ‘LD partners’ [8]. We defined LD partners as those SNPs with R ^{2}≥0.8 with a case or control SNP as reported in the HapMap Release #27 LD files. Hindorff et al. (2009) [8] chose a threshold of 0.9 but noted that their results were nearly the same when using thresholds of 1.0 and 0.8. We identified 20,924 LD partners of the case SNPs and 23,779 LD partners of the control SNPs.

#### Annotation data

All data drawn from the UCSC Genome Browser [55] used the “March 2006 (NCBI36/hg18)” assembly. Additional file 1: Table S1 provides locations, revision dates and references for each of the annotation files referred to below. In what follows, we describe each class of annotation variable, its source and the parameterization we use for it in the models we fit.

Variants described in dbSNP [56] release 130 are classified according to their predicted function as determined by their locations relative to known genes in the reference assembly. Variants that fall within the coding sequence of a known gene are further described as ‘non–synonymous’ if they result in a change to the associated amino acid or ‘synonymous’ if they do not. A variant may have several such designations; for purposes of our analysis, we confine attention to each variant’s primary designation. Those observed among the SNPs included in our analysis are ‘unknown’, ‘coding–synon’, ‘intron’, ‘near–gene–3’, ‘near–gene–5’, ‘nonsense’, ‘missense’, ‘untranslated–3’, and ‘untranslated–5’. Given the small number (*n*=5) of nonsense variants, we created a ‘coding–nonsynonymous’ designation by combining the ‘missense’ and ‘nonsense’ categories; similarly, we combined the ‘untranslated–3’, and ‘untranslated–5’ designations into the category ‘untranslated’.

We investigated the PhyloP evolutionary conservation scores [32] applied to 28– and 44–species alignments, and to those alignments restricted to the placental mammals and human, for their ability to predict the disease association status of common variants. Each of the four relevant Genome Browser tables provides the sum of the score, its sum of squares and the number of nucleotides that contribute to these statistics within ranges of contiguous nucleotides. We calculated a standardized score (mean divided by standard deviation) for each range and alignment and assigned these values to SNPs within the range. The PhyloP conservation scores exhibited pairwise correlations of up to 0.984. The top four, six, and nine PCs explain 90%, 96%, and 99% of the variability, respectively, in the 12 variables. The top four PCs were used. Additional file 1: Figure S4 and S5, respectively, depict a heat map of the correlation matrix and a dendrogram depicting an average linkage, one–minus–absolute–correlation–distance clustering of the 12 PhyloP variables used to construct the PCs.

The Database of Genomic Variants (DGV) [33, 34] is a compilation of reported genomic alterations spanning more than 1000 bases (>100 in the case of indels) observed in healthy subjects. We formed three variables indicating, respectively, whether (=1) or not (=0) each SNP falls in a region of copy number variation, a region containing insertions or deletions or a region with inversions.

The ENCODE Project [19, 20] is an ambitious project to identify and characterize the various functional elements present in the human genome sequence and to facilitate public access to the data it generates; its overarching objective is to improve our knowledge of human disease processes by providing a more comprehensive understanding of human molecular biology. Application of ENCODE functional annotation data to the design, analysis and interpretation of GWAS studies is one way in which ENCODE data can quickly be put to use to shed light on human disease processes [20]. To this end, we examine the utility of the recently released ENCODE regulation super–track data available from, and displayed on, the Genome Browser for *a priori* prediction of functional, disease-associated variants. In particular, we include variables (see below) summarizing: transcription levels assayed in six cell lines by RNA–seq [21, 22] and represented as normalized read density; density of sequence tags for H3K4Me1 (Histone H3 Lysine 4 monomethylation) associated with enhancer and promoter activity measured in eight cell lines, similarly coded measures of promoter– and enhancer–associated H3K27Ac (Histone H3 Lysine 27 acetylation) in eight cell lines and of promoter–associated H3K4Me3 (Histone H3 Lysine 4 tri–methylation) in nine cell lines [23, 24]; evidence for the variant falling within a DNaseI hypersensitivity cluster [25, 26]; and the evidence for transcription factor binding measured via ChIP–seq [27–31].

The Broad ChIP–seq, Caltech RNA–seq, and PhyloP signal tracks are summarized at the level of genomic bins. The ChIP–seq signals are measured within 118,084 contiguous bins of 25,600 bases apiece. The RNA–seq and PhyloP signals are measured in sets of non–overlapping, non–uniform bins. Bins are indexed according to the hierarchical scheme described in [57]. The ENCODE database provides basic summary statistics of the signal densities measured within each bin. These are: the number of valid data points, the minimum, the range, the sum and the sum–of–squares. Sums are non–negative and range across six orders of magnitude; in bins where the sum was zero, we set it to one. From these summaries we derived the mean, the ratio of the mean to the standard deviation (‘z’) and the maximum for each bin. We interpreted these as bin–level measures of overall signal strength, signal consistency and peak signal, respectively. We transformed each of these measures to the log scale prior to analysis. Finally, we include an indicator variable for when a variant falls in a Caltech RNA–seq bin with zero signal.

The three Broad types (signal enrichment for H3K4Me1, H3K27Ac and H3k4Me3 histone modifications) comprise data on eight, eight and nine cell lines, respectively. We found significant pairwise correlations among the 75 variables (25 each of log(mean), log(maximum), and log(z)), ranging as high as 0.949, and therefore conducted a principal components analysis to identify the linear combinations, i.e. principal components (PCs), that explain most of the variability in the data. The top 18, 27 and 44 PCs explain 90%, 95% and 99% of variability, respectively, in the 75 measures. Finally, we mapped each SNP to the appropriate Broad bin and annotated each with the top 18 PCs for purposes of the analysis. Additional file 1: Figure S1 and S2, respectively, depict a heat map of the correlation matrix and a dendrogram depicting an average linkage, one–minus–absolute–correlation–distance clustering of the 75 Broad ChIP–seq variables used to construct the PCs.

The Caltech tables comprise RNA–seq raw signal enrichment data on six cell lines. Pairwise correlations among the 24 variables ranged as high as 0.970. The top 11, 12, and 16 PCs explain 92%, 95%, and 99% of the variability. The top 11 PCs were used in the analysis. Additional file 1: Figure S7 and S8, respectively, depict a heat map of the correlation matrix and a dendrogram depicting an average linkage, one–minus–absolute–correlation–distance clustering of the 24 Caltech RNA–seq variables used to construct the PCs.

The transcription factor ChIP–seq data are summarized by scores, ranging from 6 to 1000, measuring strength of evidence for binding within specified, sometimes overlapping, chromosomal bins (‘clusters’). We summarize these data as they apply to each SNP using two variables: the number of clusters it intersects with (‘TFBSfreq’) and the average loge(score) (‘logTFBS’) assigned to those clusters (coded as 0 if the SNP does not intersect with a cluster). Similarly, the DNaseI hypersensitivity data are summarized by scores, ranging from 16 to 1000, within specified chromosomal bins (‘clusters’). We summarize these data as they apply to each SNP using (1) an indicator for the variant falling within a clusters and (2) the loge(score) assigned to that cluster.

The Open REGulatory ANNOtation database (ORegAnno) [58, 59] is a curated collection of regulatory elements. The Genome Browser ORegAnno table provides start and stop coordinates and annotations for elements in the database. For purposes of our analysis, we summarize these data with an indicator variable for whether or not a variant falls within an ORegAnno regulatory region.

PolyPhen–2 (PPh2) [35] assigns to nonsynonymous SNPs a probability of being damaging based on the sequence, phylogenetic and structural information characterizing the amino acid substitution.

RegulomeDB [36] annotates SNPs with known and predicted regulatory elements in the intergenic regions of the human genome. Each SNP is assigned one of seven categories based on its likelihood of affecting protein binding.

We collected the annotation variables described above for each SNP in the HapMap Release 27 tables into an annotations data base that is available on our web site. In what follows, we refer to the collection of these variables as *F*; when *F* is indexed we are referring to the functional data for a given SNP (*F*_{
s
}) or, as just below, to a given SNP, *s*, in a given LD block, *b* (*F*_{
s
b
}).

#### Model

For purposes of the analysis, we assumed that blocks and the SNPs within the blocks were independent conditional on the functional data. We modeled the probability that a SNP *s* in block *b* was an associated SNP, *π*_{
s
b
}, given the functional data for that SNP, *F*_{
s
b
}, using the logistic regression model logit(*π*_{
s
b
})=*α*+*F*_{
s
b
}*β*. We assumed that there was at least one associated SNP in each case block and that there were no associated SNPs in control blocks. Hence, each case block contributed the factor $\left[1-\prod _{s=1}^{{n}_{b}}(1-{\pi}_{\mathit{\text{sb}}})\right]$ to the likelihood, while each control block contributed $\prod _{s=1}^{{n}_{b}}(1-{\pi}_{\mathit{\text{sb}}}).$ As a result, we expected *at least* 1,675 of the 48,888 SNPs in the training set to be phenotype associated. This corresponds to *α*=−3.34 (columns of *F* are centered); if 10% (alt 20%) of case blocks contain two phenotype associated SNPs, *α*=−3.24 (alt -3.15). Hence the normal mean -3.24, standard deviation 0.1 prior distribution we placed on *α* is consistent with our expectation that there were fewer than 2,178 (=1.3×1,675) true phenotype associated variants among the case blocks.

Our specification of the prior distribution on *β* was guided by the observation that, in the normal model with the normal–exponential–gamma (NEG) distribution as prior on the mean and the variance known, the posterior mode is identically zero when the maximum likelihood estimator (MLE) is in a neighborhood around zero, but rapidly converges to the MLE as the MLE diverges from zero (this setting approximates the more general one in which the NEG distribution is used as the prior distribution for a parameter whose likelihood is approximately normal). The NEG distribution is specified by its shape and scale parameters and the width of the threshold neighborhood is a function of these parameters. For purposes of our analysis, we chose parameter values for which no more than 10% of the coefficients are outside of the threshold region with probability 0.90, *a priori*. We placed independent NEG prior distributions on the components of *β*; in addition, we also considered the model with independent standard normal distributions on the components of *β*. Inference for each of these models was carried out using the training set and were evaluated using the evaluation set.

We used random–walk Markov chain Monte Carlo (MCMC) algorithms [60, 61] to estimate summaries of the posterior distribution under each of the models. We started 10 independent chains per model from starting points drawn from the prior distribution. In each case, step sizes were adjusted so that parameter level acceptance ratios fell between 0.3 and 0.5 during an initial, ‘burn–in’ set of iterations not used for inference. We fixed the step sizes and ran the 10 chains from their leave–off positions for an additional 50,000 iterations per chain. Inspection of trace plots, as well as computation of the Gelman–Rubin [62], Heidelberger–Welch [63], Raftery–Lewis [64], and Geweke [65] diagnostics implemented in the CODA package [66] in R [67], indicated satisfactory convergence. We thinned the 10 chains by 1,000 and combined them to produce a sample of 500 coefficient vectors.

We used the concordance index (CI) to measure the out–of–sample predictive accuracy of the model. We calculated the CI as the fraction of matched pairs in the ‘evaluation set’ in which the average probability of association given the functional data over the *n*_{1} SNPs in the case block (‘ *b*1’) was larger than the corresponding average over the *n*_{0} SNPs in the matched control block (‘ *b*0’); i.e. if

where we estimated Pr(*A*_{s,b n} | *F*_{s,b n}) by

where the *α*_{
i
} and *β*_{
i
} are MCMC samples saved from analysis of the training data.

### Evaluation

We carried out a genome–wide association analysis of serous ovarian cancer using the methods described above. The data for this analysis were drawn from GWAS studies conducted in the US [68] and the UK [45]. The genotype data from these studies were combined and imputed to HapMap III density, resulting in a data set comprising analyzable genotypes at 2,500,004 SNPs for 7,272 subjects of European ancestry. The association analysis was confined to the 2,004 cases with advanced stage serous ovarian cancer and the 3,272 available controls and was adjusted for study site and the first two principal components of the sample genotypes. We calculated Bayes factors (BFs) as described above and set the prior probability of association to be 0.00001 when estimating posterior probabilities; ranks are invariant to this choice.

Several large scale studies conducted to follow up promising associations from these GWAS have identified the eleven genome–wide significant loci listed in Table 4. We treat these as established or ‘true positive’ associations for purposes of evaluating the various association measures. In addition, we identified a set of likely unassociated, ‘true negative’ SNPs from among 22,254 GWAS followup SNPs placed on the iCOGS chip [48]. This analysis included 8,344 cases with advanced stage serous ovarian cancer and 22,913 controls of European ancestry and was adjusted for study site and the first five European ancestry principal components. We identified a subset of 5,155 SNPs with strong evidence *against association* (defined as BF <0.1 on Jeffreys’ scale of evidence [51]) to serve as the ‘true negatives’.

We compared the rankings of these two sets of SNPs in the original GWAS analysis when association was measured using genotype data only to those obtained with incorporation of the functional signatures. We compared the procedures based on their power to identify the truly associated variants for follow–up assuming budgets allowing for evaluation of the top 5,000 or 10,000 SNPs.

In most association studies, genotypes are determined, through a combination of genotyping and imputation, for only a subset of the universe of variants. In this setting, it is standard to rely on correlations between genotyped variants (‘tags’) and those that are ‘tagged’ (not genotyped) to identify and localize associations. Likewise, the utility of functional signatures in a typical study will depend on the degree to which they reflect the likelihood of function of both the tag for which it is calculated and for the set of variants it tags. We evaluated correlations between functional signatures, defined as (*F*_{
s
}*β*), for adjacent pairs of SNPs included in the ovarian cancer GWAS analysis. We identified the quantile of each adjacent variant pair in the overall distribution of distances measured in base pairs (BPs). For purposes of this analysis, we defined quantiles in increments of 0.025, i.e. with each containing 2.5% of the mass of the distance distribution. We estimated the Pearson correlation between the functional signatures of the adjacent SNP pairs within each quantile and plotted these estimates against BP distance, locating the estimates at the midpoints of the quantile bins.

## References

- 1.
Manolio TA: Genomewide association studies and assessment of the risk of disease. N Engl J Med. 2010, 363 (2): 166-176. 10.1056/NEJMra0905980. doi:10.1056/NEJMra0905980. PMID:20647212. http://www.nejm.org/doi/pdf/10.1056/NEJMra0905980,

- 2.
Freedman ML, Monteiro ANA, Gayther SA, Coetzee GA, Risch A, Plass C, Casey G, Biasi MD, Carlson C, Duggan D, James M, Liu P, Tichelaar JW, Vikis HG, You M, Mills IG: Principles for the post–GWAS functional characterization of cancer risk loci. Nat Genet. 2011, 43 (6): 513-518. 10.1038/ng.840. doi:10.1038/ng.840,

- 3.
Witte JS, Greenland S, Haile RW, Bird CL: Hierarchical regression analysis applied to a study of multiple dietary exposures and breast cancer. Epidemiology. 1994, 5 (6): 612-621. 10.1097/00001648-199411000-00009.

- 4.
Aragaki CC, Greenland S, Probst-Hensch N, Haile RW: Hierarchical modeling of gene-environment interactions: estimating NAT2 genotype–specific dietary effects on adenomatous polyps. Cancer Epidemiol Biomarkers & Prev. 1997, 6 (5): 307-314. http://cebp.aacrjournals.org/content/6/5/307.full.pdf+html,

- 5.
Hung RJ, Brennan P, Malaveille C, Porru S, Donato F, Boffetta P, Witte JS: Using hierarchical modeling in genetic association studies with multiple markers: application to a case-control study of bladder cancer. Cancer Epidemiol Biomarkers & Prev. 2004, 13 (6): 1013-1021.

- 6.
Hung RJ, Baragatti M, Thomas D, McKay J, Szeszenia-Dabrowska N, Zaridze D, Lissowska J, Rudnai P, Fabianova E, Mates D, Foretova L, Janout V, Bencko V, Chabrier A, Moullan N, Canzian F, Hall J, Boffetta P, Brennan P: Inherited predisposition of lung cancer: a hierarchical modeling approach to DNA repair and cell cycle control pathways. Cancer Epidemiol Biomarkers & Prev. 2007, 16 (12): 2736-2744. 10.1158/1055-9965.EPI-07-0494.

- 7.
Veyrieras JB, Kim SY, Dermitzakis ET, Gilad Y, Stephens M, Pritchard JK: High-resolution mapping of expression-QTLs yields insight into human gene regulation. PLoS Genet. 2008, 4 (10): 1000214-10.1371/journal.pgen.1000214. doi:10.1371/journal.pgen.1000214,

- 8.
Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, Manolio TA: Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Nat Acad Sci. 2009, 106 (23): 9362-9367. 10.1073/pnas.0903103106. doi:10.1073/pnas.0903103106. http://www.pnas.org/content/106/23/9362.full.pdf+html,

- 9.
Lee SI, Dudley AM, Drubin D, Silver PA, Krogan NJ, Peér D, Koller D: Learning a prior on regulatory potential from eQTL data. PLoS Genet. 2009, 5 (1): 1000358-10.1371/journal.pgen.1000358. doi:10.1371/journal.pgen.1000358,

- 10.
Nicolae DL, Gamazon E, Zhang W, Duan S, Dolan ME, Cox NJ: Trait–associated SNPs are more likely to be eQTLs: Annotation to enhance discovery from GWAS. PLoS Genet. 2010, 6 (4): 1000888-10.1371/journal.pgen.1000888. doi:10.1371/journal.pgen.1000888,

- 11.
An integrated encyclopedia of DNA elements in the human genome. Nature. 2012, 489: 57-74. 10.1038/nature11247. doi:10.1038/nature11247,

- 12.
Schaub MA, Boyle AP, Kundaje A, Batzoglou S, Snyder M: Linking disease associations with regulatory information in the human genome. Genome Res. 2012, 22 (9): 1748-1759. 10.1101/gr.136127.111. doi:10.1101/gr.136127.111. http://genome.cshlp.org/content/22/9/1748.full.pdf+html,

- 13.
Carbonetto P, Stephens M: Integrated enrichment analysis of variants and pathways in genome-wide association studies indicates central role for IL-2 signaling genes in type 1 diabetes, and cytokine signaling genes in Crohn’s disease. PLoS Genet. 2013, 9 (10): 1003770-10.1371/journal.pgen.1003770. doi:10.1371/journal.pgen.1003770,

- 14.
Pickrell JK: Joint analysis of functional genomic data and genome-wide association studies of 18 human traits. 2014, arXiv 1311.4843 [q-bio.GN],

- 15.
Marchini J, Howie B, Myers S, McVean G, Donnelly P: A new multipoint method for genomewide association studies by imputation of genotypes. Nat Genet. 2007, 39: 906-913. 10.1038/ng2088.

- 16.
Servin B, Stephens M: Imputation–based analysis of association studies: Candidate regions and quantitative traits. PLoS Genet. 2007, 3 (7): 114-10.1371/journal.pgen.0030114. doi:10.1371/journal.pgen.0030114,

- 17.
Wakefield J: A Bayesian measure of the probability of false discovery in genetic epidemiology studies. Am J Hum Genet. 2007, 81 (2): 208-227. 10.1086/519024. doi:10.1086/519024,

- 18.
Wakefield J: Bayes factors for genome–wide association studies: comparison with p-values. Genet Epidemiol. 2009, 33 (1): 79-86. 10.1002/gepi.20359. doi:10.1002/gepi.20359,

- 19.
The ENCODE Project Consortium: Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007, 447 (7146): 799-816. 10.1038/nature05874.

- 20.
The ENCODE Project Consortium: A user’s guide to the encyclopedia of DNA elements (ENCODE). PLoS Biology. 2011, 9 (4): 1001046-10.1371/journal.pbio.1001046. doi:10.1371/journal.pbio.1001046,

- 21.
Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008, 5 (7): 621-628. 10.1038/nmeth.1226. doi:10.1038/nmeth.1226,

- 22.
Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memory–efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009, 10 (3): 25-10.1186/gb-2009-10-3-r25.

- 23.
Bernstein BE, Mikkelsen TS, Xie X, Kamal M, Huebert DJ, Cuff J, Fry B, Meissner A, Wernig M, Plath K, Jaenisch R, Wagschal A, Feil R, Schreiber SL, Lander ES: A bivalent chromatin structure marks key developmental genes in embryonic stem cells. Cell. 2006, 125 (2): 315-326. 10.1016/j.cell.2006.02.041.

- 24.
Mikkelsen TS, Ku M, Jaffe DB, Issac B, Lieberman E, Giannoukos G, Alvarez P, Brockman W, Kim T-K, Koche RP, Lee W, Mendenhall E, O’Donovan A, Presser A, Russ C, Xie X, Meissner A, Wernig M, Jaenisch R, Nusbaum C, Lander ES, Bernstein BE: Genome–wide maps of chromatin state in pluripotent and lineage–committed cells. Nature. 2007, 448 (7153): 553-560. 10.1038/nature06008. doi:10.1038/nature06008,

- 25.
Sabo PJ, Kuehn MS, Thurman R, Johnson BE, Johnson EM, Cao H, Yu M, Rosenzweig E, Goldy J, Haydock A, Weaver M, Shafer A, Lee K, Neri F, Humbert R, Singer MA, Richmond TA, Dorschner MO, McArthur M, Hawrylycz M, Green RD, Navas PA, Noble WS, Stamatoyannopoulos JA: Genome–scale mapping of DNase I sensitivity in vivo using tiling DNA microarrays. Nat Methods. 2006, 3 (7): 511-518. 10.1038/nmeth890. doi:10.1038/nmeth890,

- 26.
Sabo PJ, Hawrylycz M, Wallace JC, Humbert R, Yu M, Shafer A, Kawamoto J, Hall R, Mack J, Dorschner MO, McArthur M, Stamatoyannopoulos JA: Discovery of functional noncoding elements by digital analysis of chromatin structure. Proc Nat Acad Sci. 2004, 101 (48): 16837-6842. 10.1073/pnas.0407387101.

- 27.
Euskirchen G, Royce TE, Bertone P, Martone R, Rinn JL, Nelson FK, Sayward F, Luscombe NM, Miller P, Gerstein M, Weissman S, Snyder M: CREB binds to multiple loci on human chromosome 22. Mol Cell Biol. 2004, 24 (9): 3804-3814. 10.1128/MCB.24.9.3804-3814.2004.

- 28.
Euskirchen GM, Rozowsky JS, Wei CL, Lee WH, Zhang ZD, Hartman S, Emanuelsson O, Stolc V, Weissman S, Gerstein MB, Ruan Y, Snyder M: Mapping of transcription factor binding regions in mammalian cells by ChIP: comparison of array– and sequencing–based technologies. Genome Res. 2007, 17 (6): 898-909. 10.1101/gr.5583007.

- 29.
Martone R, Euskirchen G, Bertone P, Hartman S, Royce TE, Luscombe NM, Rinn JL, Nelson FK, Miller P, Gerstein M, Weissman S, Snyder M: Distribution of nf–

*κ*b–binding sites across human chromosome 22. Proc Nat Acad Sci USA. 2003, 100 (21): 12247-12252. 10.1073/pnas.2135255100. - 30.
Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, Zeng T, Euskirchen G, Bernier B, Varhol R, Delaney A, Thiessen N, Griffith OL, He A, Marra M, Snyder M, Jones S: Genome–wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat Methods. 2007, 4 (8): 651-657. 10.1038/nmeth1068. doi:10.1038/nmeth1068,

- 31.
Rozowsky J, Euskirchen G, Auerbach RK, Zhang ZD, Gibson T, Bjornson R, Carriero N, Snyder M, Gerstein MB: PeakSeq enables systematic scoring of ChIP–seq experiments relative to controls. Nat Biotech. 2009, 27 (1): 66-75. 10.1038/nbt.1518. doi:10.1038/nbt.1518,

- 32.
Siepel A, Pollard K, Haussler D: New methods for detecting lineage–specific selection. Res Computat Mol Biol. 2006, 3909: 190-205. 10.1007/11732990_17.

- 33.
Iafrate AJ, Feuk L, Rivera MN, Listewnik ML, Donahoe PK, Qi Y, Scherer SW, Lee C: Detection of large–scale variation in the human genome. Nat Genet. 2004, 36 (9): 949-951. 10.1038/ng1416.

- 34.
Zhang J, Feuk L, Duggan GE, Khaja R, Scherer SW: Development of bioinformatics resources for display and analysis of copy number and other structural variants in the human genome. Cytogenet & Genome Res. 2006, 115 (3/4): 205-214.

- 35.
Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, Kondrashov AS, Sunyaev SR: A method and server for predicting damaging missense mutations. Nat Methods. 2010, 7: 248-249. 10.1038/nmeth0410-248.

- 36.
Boyle A, Hong E, Hariharan M, Cheng Y, Schaub M, Kasowski M, Karczewski K, Park J, Hitz B, Weng S, Cherry J, Snyder M: Annotation of functional variation in personal genomes using RegulomeDB. Genome Res. 2012, 22 (9): 1790-1797. 10.1101/gr.137323.112. doi:10.1101/gr.137323.112,

- 37.
Hans CM: Bayesian lasso regression. Biometrika. 2009, 96: 835-845. 10.1093/biomet/asp047.

- 38.
Richardson S, Bottolo L, Rosenthal JS: Bayesian models for sparse regression analysis of high dimensional data. Bayesian Statistics 9. Edited by Bernardo JM, Bayarri MJ, Berger JO, Dawid AP, Heckerman D, Smith AFM. 2011, Oxford: Oxford University Press,

- 39.
Griffin JE, Brown PJ: Bayesian adaptive lassos with non–convex penalization. Technical report, University of Kent, 2007

- 40.
Hoggart CJ, Whittaker JC, Balding DJ, De Iorio M: Simultaneous analysis of all SNPs in genome–wide and re–sequencing association studies. PLoS Genet. 2008, 4 (7): 1000130-10.1371/journal.pgen.1000130. doi:10.1371/journal.pgen.1000130,

- 41.
Griffin JE, Brown PJ: Inference with normal–gamma prior distributions in regression problems. Bayesian Anal. 2010, 5 (1): 171-188. 10.1214/10-BA507.

- 42.
Maurano MT, Humbert R, Rynes E, Thurman RE, Haugen E, Wang H, Reynolds AP, Sandstrom R, Qu H, Brody J, Shafer A, Neri F, Lee K, Kutyavin T, Stehling-Sun S, Johnson AK, Canfield TK, Giste E, Diegel M, Bates D, Hansen RS, Neph S, Sabo PJ, Heimfeld S, Raubitschek A, Ziegler S, Cotsapas C, Sotoodehnia N, Glass I, Sunyaev SR, et al: Systematic localization of common disease–associated variation in regulatory DNA. Science. 2012, 337 (6099): 1190-1195. 10.1126/science.1222794. doi:10.1126/science.1222794. http://www.sciencemag.org/content/337/6099/1190.full.pdf,

- 43.
Jin F, Li Y, Dixon JR, Selvaraj S, Ye Z, Lee AY, Yen C-A, Schmitt AD, Espinoza CA, Ren B: A high–resolution map of the three–dimensional chromatin interactome in human cells. Nature. 2013, 503: 290-294. doi:10.1038/nature12644,

- 44.
Sanyal A, Lajoie BR, Jain G, Dekker J: The long–range interaction landscape of gene promoters. Nature. 2012, 489: 109-113. 10.1038/nature11279. doi:10.1038/nature11279,

- 45.
Song H, Ramus SJ, Tyrer J, Bolton KL, Gentry-Maharaj A, Wozniak E, Anton-Culver H, Chang-Claude J, Cramer DW, DiCioccio R, Dork T, Goode EL, Goodman MT, Schildkraut JM, Sellers T, Baglietto L, Beckmann MW, Beesley J, Blaakaer J, Carney ME, Chanock S, Chen Z, Cunningham JM, Dicks E, Doherty JA, Durst M, Ekici AB, Fenstermacher D, Fridley BL, Giles G, et al: A genome-wide association study identifies a new ovarian cancer susceptibility locus on 9p22.2. Nature Genet. 2009, 42: 996-1000. doi:10.1038/ng.424,

- 46.
Bolton KL, Tyrer J, Song H, Ramus SJ, Notaridou M, Jones C, Sher T, Gentry-Maharaj A, Wozniak E, Tsai Y-Y, Weidhaas J, Paik D, Van Den Berg DJ, Stram DO, Pearce CL, Wu AH, Brewster W, Anton-Culver H, Ziogas A, Narod SA, Levine DA, Kaye SB, Brown R, Paul J, Flanagan J, Sieh W, McGuire V, Whittemore AS, Campbell I, Gore ME, et al: Common variants at 19p13 are associated with susceptibility to ovarian cancer. Nat Genet. 2010, 42: 880-884. 10.1038/ng.666.

- 47.
Goode EL, Chenevix-Trench G, Song H, Ramus SJ, Notaridou M, Lawrenson K, Widschwendter M, Vierkant RA, Larson MC, Krüger-Kjaer S, Birrer MJ, Berchuck A, Schildkraut J, Tomlinson I, Kiemeney LA, Cook LS, Gronwald J, Garcia-Closas M, Gore ME, Campbell I, Whittemore AS, Sutphen R, Phelan C, Anton-Culver H, Pearce CL, Lambrechts D, Rossing MA, Chang-Claude J, Moysich KB, Goodman MT, et al: A genome-wide association study identifies susceptibility loci for ovarian cancer at 2q31 and 8q24. Nat Genet. 2010, 42: 874-879. 10.1038/ng.668. doi:10.1038/ng.668,

- 48.
Pharoah PDP, Tsai Y-Y, Ramus SJ, Phelan CM, Goode EL, Lawrenson K, Buckley M, Fridley BL, Tyrer JP, Shen H, Weber R, Karevan R, Larson MC, Song H, Tessier DC, Bacot F, Vincent D, Cunningham JM, Dennis J, Dicks E, Aben KK, Anton-Culver H, Antonenkova N, Armasu SM, Baglietto L, Bandera EV, Beckmann MW, Birrer MJ, Bloom G, Bogdanova N, et al: GWAS meta–analysis and replication identifies three novel susceptibility loci for ovarian cancer. Nat Genet. 2013, 45: 362-370. 10.1038/ng.2564. doi:10.1038/ng.2564,

- 49.
Bojesen SE, Pooley KA, Johnatty SE, Beesley J, Michailidou K, Tyrer JP, Edwards SL, Pickett HA, Shen HC, Smart CE, Hillman KM, Mai PL, Lawrenson K, Stutz MD, Lu Y, Karevan R, Woods N, Johnston RL, French JD, Chen X, Weischer M, Nielsen SF, Maranian MJ, Ghoussaini M, Ahmed S, Baynes C, Bolla MK, Wang Q, Dennis J, McGuffog L: Multiple independent variants at the TERT locus are associated with telomere length and risks of breast and ovarian cancer. Nat Genet. 2013, 45: 371-384. 10.1038/ng.2566. doi:10.1038/ng.2566,

- 50.
Permuth-Wey J, Lawrenson K, Shen HC, Velkova A, Tyrer JP, Chen Z, Lin H-Y, Ann Chen Y, Tsai Y-Y, Qu X, Ramus SJ, Karevan R, Lee J, Lee N, Larson MC, Aben KK, Anton-Culver H, Antonenkova N, Antoniou AC, Armasu SM, Bacot F, Baglietto L, Bandera EV, Barnholtz-Sloan J, Beckmann MW, Birrer MJ, Bloom G, Bogdanova N, Brinton LA, Brooks-Wilson A, et al: Identification and molecular characterization of a new ovarian cancer susceptibility locus at 17q21.31. Nat Commun. 2013, 4: 1627-doi:10.1038/ncomms2613,

- 51.
Jeffreys H: Theory of Probability. 1961, Oxford: Oxford Univ. Press

- 52.
The 1000 Genomes Project Consortium: An integrated map of genetic variation from 1,092 human genomes. Nature. 2012, 491: 56-65. 10.1038/nature11632.

- 53.
Kass RE, Raftery AE: Bayes factors. J Am Stat Assoc. 1995, 90: 773-795. 10.1080/01621459.1995.10476572.

- 54.
Wilson MA, Iversen ES, Clyde MA, Schmidler SC, Schildkraut JM: Supplement to “Bayesian model search and multilevel inference for SNP association studies”. Ann Appl Stat. 2010, 4 (3): 1342-1364. 10.1214/09-AOAS322.

- 55.
Rhead B, Karolchik D, Kuhn RM, Hinrichs AS, Zweig AS, Fujita PA, Diekhans M, Smith KE, Rosenbloom KR, Raney BJ, Pohl A, Pheasant M, Meyer LR, Learned K, Hsu F, Hillman-Jackson J, Harte RA, Giardine B, Dreszer TR, Clawson H, Barber GP, Haussler D, Kent WJ: The UCSC genome browser database: update 2010. Nucleic Acids Res. 2010, 38 (suppl 1): 613-619. doi:10.1093/nar/gkp939. http://0-nar.oxfordjournals.org.brum.beds.ac.uk/content/38/suppl_1/D613.full.pdf+html,

- 56.
Sherry S, Ward M, Kholodov M, Baker J, Phan L, Smigielski E, Sirotkin K: dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001, 29 (1): 308-311. 10.1093/nar/29.1.308.

- 57.
Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D: The human genome browser at UCSC. Genome Res. 2002, 12 (6): 996-1006. 10.1101/gr.229102. Article published online before print in May 2002.

- 58.
Montgomery SB, Griffith OL, Sleumer MC, Bergman CM, Bilenky M, Pleasance ED, Prychyna Y, Zhang X, Jones SJM: ORegAnno: an open access database and curation system for literature–derived promoters, transcription factor binding sites and regulatory variation. Bioinformatics. 2006, 22 (5): 637-640. 10.1093/bioinformatics/btk027.

- 59.
Griffith OL, Montgomery SB, Bernier B, Chu B, Kasaian K, Aerts S, Mahony S, Sleumer MC, Bilenky M, Haeussler M, Griffith M, Gallo SM, Giardine B, Hooghe B, Blanco E, Ticoll A, Lithwick S, Portales–Casamar E, Donaldson IJ, Robertson G, Wadelius C, Vlieghe D, Halfon MS, Wasserman W, Hardison R, Bergman CM, Jones SJM, Van Loo, P: ORegAnno: an open–access community–driven resource for regulatory annotation. Nucleic Acids Res. 2008, 36 (suppl 1): 107-113.

- 60.
Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E: Equations of state calculations by fast computing machines. J Chem Phys. 1953, 21: 1087-1091. 10.1063/1.1699114.

- 61.
Gilks WR, Richardson S, Spiegelhalter DJ: Introducing Markov chain Monte Carlo. Markov Chain Monte Carlo in Practice. Edited by Gilks WR, Richardson S, Spiegelhalter DJ. 1996, London: Chapman and Hall,

- 62.
Gelman A, Rubin DB: Inference from iterative simulation using multiple sequences (with discussion). Stat Sci. 1992, 7: 457-511. 10.1214/ss/1177011136.

- 63.
Heidelberger P, Welch P: Simulation run length control in the presence of an initial transient. Oper Res. 1983, 31: 1109-1144. 10.1287/opre.31.6.1109.

- 64.
Raftery AE, Lewis SM: Implementing MCMC. Markov Chain Monte Carlo in Practice. Edited by Gilks WR, Richardson S, Spiegelhalter DJ. 1996, London: Chapman and Hall, 115-127.

- 65.
Geweke J: Evaluating the accuracy of sampling–based approaches to calculating posterior moments. Bayesian Statistics 4. Edited by Bernado J, Erger J, AP D, Smith A. 1992, Oxford, UK: Clarendon Press,

- 66.
Plummer M, Best N, Cowles K, Vines K: CODA: Output Analysis and Diagnostics for MCMC. 2010. R package version 0.13-5. http://cran.r-project.org/package=coda,

- 67.
Ihaka R, Gentleman R: R: A language for data analysis and graphics. J Comput Graph Stat. 1996, 5 (3): 299-314.

- 68.
Permuth-Wey J, Kim D, Tsai Y-Y, Lin H-Y, Chen YA, Barnholtz-Sloan J, Birrer MJ, Bloom G, Chanock SJ, Chen Z, Cramer DW, Cunningham JM, Dagne G, Ebbert-Syfrett J, Fenstermacher D, Fridley BL, Garcia-Closas M, Gayther SA, Ge W, Gentry-Maharaj A, Gonzalez-Bosquet J, Goode EL, Iversen E, Jim H, Kong W, McLaughlin J, Menon U, Monteiro ANA, Narod SA, Pharoah PDP, et al: LIN28B polymorphisms influence susceptibility to epithelial ovarian cancer. Cancer Res. 2011, 71 (11): 3896-3903. 10.1158/0008-5472.CAN-10-4167. doi:10.1158/0008-5472.CAN-10-4167. http://cancerres.aacrjournals.org/content/71/11/3896.full.pdf+html,

## Acknowledgements

This work was funded by the National Institutes of Health through its Genes and Environment Initiative, grant number R01–HL090559 and through R21–ES020796 (NIEHS) and by the National Science Foundation via DMS–1106891. The ovarian cancer multi–GWAS analyzed herein were provided by the FOCI project (1U19-CA148112) of NCI’s GAME–ON consortium. The authors wish to thank two anonymous referees for their important contributions to the manuscript.

## Author information

## Additional information

### Competing interests

The authors declare that they have no competing interests.

### Authors’ contributions

EI conceived the study, coordinated its design, the analysis and drafting of the manuscript. GL carried out statistical and bioinformatic analyses and participated in drafting the manuscript. MC contributed to study design, the analysis plan and the manuscript. AM contributed to study design, interpretation of results and the manuscript. All authors have read, critically reviewed and approved its final version.

## Electronic supplementary material

## Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

## Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

## About this article

### Cite this article

Iversen, E.S., Lipton, G., Clyde, M.A. *et al.* Functional annotation signatures of disease susceptibility loci improve SNP association analysis.
*BMC Genomics* **15, **398 (2014) doi:10.1186/1471-2164-15-398

#### Received

#### Accepted

#### Published

#### DOI

### Keywords

- Association study
- GWAS
- SNPs
- Functional annotations
- Bayesian analysis
- ENCODE project