 Research
 Open Access
A new estimation of proteinlevel false discovery rate
 Guanying Wu^{1},
 Xiang Wan^{2} and
 Baohua Xu^{1}Email author
 Published: 13 August 2018
Abstract
Background
In mass spectrometrybased proteomics, protein identification is an essential task. Evaluating the statistical significance of the protein identification result is critical to the success of proteomics studies. Controlling the false discovery rate (FDR) is the most common method for assuring the overall quality of the set of identifications. Existing FDR estimation methods either rely on specific assumptions or rely on the twostage calculation process of first estimating the error rates at the peptidelevel, and then combining them somehow at the proteinlevel. We propose to estimate the FDR in a nonparametric way with less assumptions and to avoid the twostage calculation process.
Results
We propose a new proteinlevel FDR estimation framework. The framework contains two major components: the Permutation+BH (Benjamini–Hochberg) FDR estimation method and the logistic regressionbased null inference method. In Permutation+BH, the null distribution of a sample is generated by searching data against a large number of permuted random protein database and therefore does not rely on specific assumptions. Then, pvalues of proteins are calculated from the null distribution and the BH procedure is applied to the pvalues to achieve the relationship of the FDR and the number of protein identifications. The Permutation+BH method generates the null distribution by the permutation method, which is inefficient for online identification. The logistic regression model is proposed to infer the null distribution of a new sample based on existing null distributions obtained from the Permutation+BH method.
Conclusions
In our experiment based on three public available datasets, our Permutation+BH method achieves consistently better performance than MAYU, which is chosen as the benchmark FDR calculation method for this study. The null distribution inference result shows that the logistic regression model achieves a reasonable result both in the shape of the null distribution and the corresponding FDR estimation result.
Keywords
 FDR
 Proteomics
 Permutation
 Null distribution
Background
In shotgun proteomics, the identification of proteins is a twostage process: peptide identification and protein inference [1]. In peptide identification, experimental MS/MS spectra are searched against a sequence database to obtain a set of peptidespectrum matches (PSMs) [2–4]. In protein inference, individual PSMs are assembled to infer the identity of proteins present in the sample [5–7].

pvalue based approaches provide a single proteinlevel pvalue for each reported protein.

FDR approaches apply a single threshold to all proteins identified from the data rather than generate individual significance values for each protein.
 1
Reliance on specific assumptions. Most methods depend on particular assumptions regarding the model or the distribution of false positive matches. For instance, the pvalue based PROT_PROBE approach [10] assumes that protein identification by a collection of spectra follows a binomial model. Similarly, the representative of FDR approaches, MAYU [11] is based on the assumption that false positive PSMs are equally likely to map to either the target or decoy database and the number of false positive protein identifications is assumed to be hypergeometrically distributed.
 2
Reliance on the twostage calculation process. Generally, the proteinlevel confidence measure is obtained by combining peptidelevel pvalues (e.g., [8]). Such process may propagate errors at the peptidelevel to the proteinlevel in a nontrivial manner [12].
Based on above observations, we propose a new framework for the proteinlevel FDR estimation that can avoid abovementioned shortcomings. In this framework, we are permuting protein sequences and performing searching against these fake sequences on a dataset to get the corresponding null distribution at the proteinlevel before pvalue and FDR calculation. Therefore, our calculation does not rely on the twostage calculation process. In addition, we do not need to make any assumption on the distribution of protein identification scores since the permutation procedure is nonparametric. More importantly, once the null/permutation distribution is available, we can calculate pvalues and the FDR without searching a decoy database. Experimental results on several real proteomics datasets show that our framework is effective in pvalue and FDR calculation and outperforms MAYU consistently.
Although this framework is very appealing, the time required to perform the permutation procedure renders it infeasible to generate the null in an online manner before we have a fast permutation algorithm. To alleviate this problem, we suggest to do the permutation in an offline manner and then store the null distributions for future use. When null distributions built on existing samples are not applicable in analyzing newcoming data with different features, we propose to use logistic regression to infer the null distribution from existing null distributions.
The rest of paper is organized as follows: “Methods” section illustrates the details of our methods. “Results and discussion” section presents the experiment results. “Discussion” section concludes the paper.
Methods
Overview of the methods
Proteins that are present in an experimental sample are true positives; others are false positives. Each protein is associated with a score measuring its confidence. The higher the score, more confident we are that the protein is in the sample. If we treat the protein score as a test statistic, the distribution formed by scores of false positives is the null distribution. Given N proteins, we can determine the pvalue of each protein from the null distribution. For a certain FDR α, we can determine how many proteins are accepted based on their pvalues according to the BenjaminiHochberg (BH) method:
Given a subset of proteins obtained by setting a protein score threshold, we can also determine the FDR according to the BH procedure.
In the method, the pvalue calculation method and the BH procedure are wellestablished statistical routines. The major source of errors in estimating the FDR may come from the null distribution estimation. Generally, the more data we have, the more accurate the null distribution. Thus, we estimate the null distribution by using a permutation method, which can generate plenty of data for robust data analysis. However, the permutation method is inefficient. This method becomes computational expensive when handing large datasets. This motivates us to develop a method that can infer the null distribution of a new dataset from null distributions of known datasets. We can store the previous estimated null distributions and conduct the proteinlevel FDR estimation in an offline mode. In this way, both accuracy and efficiency can be achieved. Details are provided in the following sections.
The permutation method
We employ the targetdecoy technique to determine null distributions. In the permutation step, each sequence in the original protein database is randomly shuffled [14]. The shuffled proteins are appended into the original protein database to form a concatenated database. Then, proteins are identified from the concatenated targetdecoy database. Protein identifications mapping to decoy sequences are false positives, whose scores are used to form the null distribution. When the sample size is small, we may not have enough false positives to form a reliable null distribution. Thus, the shuffling step and the protein inference step are repeated multiple times (e.g. 20 repeats). In each iteration, decoy protein scores are stored.
and let x_{k} be the center point of \(\mathcal {Z}_{k}\). Note that \(\sum _{i=1}^{K}{y_{k}}=M\). Then, the set of points \(\mathcal {H}=\{(x_{1},y_{1}/M),...,(x_{K},y_{K}/M)\}\) describes the probability density function of the null distribution.
Given N proteins, we can estimate their pvalues. The determination of the relationship of the FDR and the number of proteins is straightforward by applying the BH procedure mentioned in the previous section.
In the permutation method, we need to shuffle and identify proteins multiple times. Thus, the notable limitation of the permutation method is its low efficiency. Null distributions of different samples can be stored in the protein database for future use in an offline mode.
A general null distribution inference model
The protein identification result can be affected by various reasons such as the tandem MS peak count and the sample complexity. Null distributions built on existing samples may not be applicable in analyzing data with different tandem MS peak counts and a different sample complexity. Determining the null distribution of new data is time consuming by applying the permutation method. Thus, we design a way to infer the null distribution from existing null distributions in the case that high efficiency is desired.
A raw data can be described by many features. For instance, tandem MS peak counts and tandem MS spectral quality measured by the mean noise level. Suppose we have I existing samples and each sample can be described by J features. Denote features of the ith sample as (r_{i,1},r_{i,2},...,r_{i,J}) and let \(\mathcal {H}_{i}(x)\) be the null probability density function associated with the sample. The feature of a new sample is denoted as (r_{0,1},r_{0,2},...,r_{0,J}) and our objective is to infer its null density function \(\mathcal {H}_{0}(x)\).
The probability \(\text {Pr}_{i,k}(x\leq \tilde {x}\tilde {x}\in \mathcal {Z}_{k})\) can be calculated from the null probability density function \(\mathcal {H}_{i}(x)\)
Sample  feature 1  ...  feature j  ...  feature J  \(\text {Pr}_{i}(x\leq \tilde {x}\tilde {x}\in \mathcal {Z}_{k})\) 

Sample 1  r _{1,1}  ...  r _{1, j}  ...  r _{1, J}  P _{1, k} 
...  ...  ...  ...  ...  ...  ... 
Sample i  r _{ i,1}  ...  r _{ i, j}  ...  r _{ i, J}  P _{ i, k} 
...  ...  ...  ...  ...  ...  ... 
Sample I  r _{ I,1}  ...  r _{ I, j}  ...  r _{ I, J}  P _{ I, k} 
The coefficient table for null distribution inference
Intercept  feature 1  feature 2  ...  feature J  

1th bin  β _{1,0}  β _{1,1}  β _{1,2}  ...  β _{1, J} 
...  ...  ...  ...  ...  ... 
kth bin  β _{ k,0}  β _{ k,1}  β _{ k,2}  ...  β _{ k, J} 
...  ...  ...  ...  ...  ... 
(K−1)th bin  β _{ K−1,0}  β _{ K−1,1}  β _{ K−1,2}  ...  β _{ K−1, J} 
When k=1, \(\mathcal {H}_{0}(x\in \mathcal {Z}_{k})\) becomes illposed because \(\mathcal {Z}_{0}\) is undefined. In this case, let \(\text {Pr}_{0}(x\leq \tilde {x}\tilde {x}\in \mathcal {Z}_{0})=0\). By using the coefficient table and equation (6), we obtain the null density function \(\mathcal {H}_{0}\).
The feature protein database
When a new raw data is input, protein inference and feature extraction are performed. We can use existing protein inference algorithms such as ProteinProphet to identify proteins from the protein database in the feature database. Then, we can compare the new sample with samples in the feature database based on their features. We can measure the similarity of two samples by calculating the correlation of their feature vectors. Similar samples are often encountered when we analyze replicate samples. If the similarity between the new sample and an existing sample i is high (e.g. the correlation of features is above 0.9), we use \(\mathcal {H}_{i}(x)\) as the null distribution to calculate the proteinlevel FDR. If we cannot find any similar sample in the feature database, we can plug the coefficients in the coefficient table into function (5) and use Eq. (6) to infer a new null distribution for the new sample.
The permutation method takes lots of time. When the number of bins in the null distribution and the number of features are large, the logistic regression fitting may also take a great amount of time. The offline information (i.e. the feature table and the coefficient table) is used to achieve a new null distribution without the permutation step and the logistic regression fitting step. Thus, it makes the proteinlevel FDR estimation efficient.
Our current implementation of the framework
When applying the proposed proteinlevel FDR estimation framework, two key points are: features and similarity measurement. A sample can be described by features. When a novel sample is similar to an existing sample by comparing their features, the null distribution of the existing sample will be used. Otherwise, a new null distribution is inferred by applying the logistic model based on sample features.
The error propagation from the peptidelevel to the proteinlevel is nontrivial. Features of raw data such as tandem MS peak counts are faraway from the final protein inference result. Thus, these kinds of features may not have a clear connection with protein scores. In our current implementation, we determine to directly select features from protein scores.
The smaller the value of D_{i,j}, the more similar sample i and sample j. We choose the KL divergence from each sample to the reference sample as a feature, which is used to infer the null distribution and measure the sample similarity.
Overview of the experiments
The whole framework consists of two parts: The first part employs the permutation method and the BH procedure to estimate the FDR (Permutation+BH); the second part provides a logistic regression model to infer the null distribution of a new sample based on existing null distributions. We first conduct the experiment to verify Permutation+BH in FDR estimation. The performance of our method is compared to MAYU based on three datasets with groundtruth. Then, we conduct another experiment to illustrate the performance of our null distribution inference method. In the last part of our experiments, we discuss the reference dataset issue in our current implementation of our framework.
The whole framework is implemented in Ruby (v1.9.2p290). Targetdecoy concatenated databases are generated from UniProtKB/SwissProt (Release 2011_01) by appending shuffled protein sequences into the original protein database. Peptides are identified by X!Tandem (v2010.10.01.1) [4]. Then, ProteinProphet (Embedded in the TransProteomic Pipeline v4.5 RAPTURE rev 0, Build 201109211427) is employed to perform peptide probability calculation and protein inference, respectively [5, 15].
Names and URLs of data files
Dataset name  File name  URL 

ISB  QT20051230_S_18mix_04.mzXML  
ABRF  Lane/060121Yrasprg051025ct5.RAW  
Yeast  YPD_ORBI/061220.zl.mudpit0.1.1/raw/000.RAW  
Yeast_Train  YPD_ORBI/070119zlmudpit071/raw/000.RAW  
Human  YPD_LCQ/060b.RAW  
Human_Test  PAe000281_mzXML_200909301914/B067017_c.mzXML 
Results and discussion
FDR estimation
In this experiment, the first three datasets are used: ISB, ABRF and Yeast. For the ISB dataset, the 18 standard proteins together with 15 contaminants are marked as the groundtruth [16]. For the ABRF dataset, the 49 standard proteins and 78 contaminants are used as the groundtruth. Readers can refer to the supplementary document for more information [19]. For the Yeast dataset, all proteins in the protein reference set are treated as true proteins.
The permutation method includes two steps to obtain a null distribution: a shuffling step and a protein inference step. In the shuffling step, each protein sequence is shuffled and appended into the original protein database. In the protein inference step, proteins are identified from the targetdecoy concatenated database with TPP. The shuffling step and the protein inference step repeat for 20 times. The protein mapping to a decoy sequence is considered to be a false positive. The protein probabilities of all false positives from the 20 protein identification results are collected. Then, the histogram of the protein probabilities is built and used as the null probability density function. In general, the smaller the bin size, the more detail the null distribution contains and more data points are desired to build the null distribution. In our experiments, the length of each bin is empirically chosen to be 0.003.
After the null distribution has been built, the pvalue of each protein is calculated according to Eq. (3). Next, pvalues of all proteins are sorted in the ascending order. Then, the BH procedure is conducted on pvalues to obtain the relationship of the number of proteins and the FDR.
According to our experimental results, our method and MAYU are comparable in performance on the ISB dataset. For the ABRF dataset, our method is better than MAYU on average. Our method is dominantly better than MAYU on the Yeast dataset.
Null distribution inference
In this dataset, the first five datasets listed in Table 3 are used to fit the logistic regression model (4) and the Human_Test dataset is used to validate the null distribution inference result.
The feature table for five datasets and one test set
Sample  KL divergence 

ISB  0.000 
ABRF  0.0301 
Yeast  0.3697 
Human  0.2206 
Yeast_Train  0.2781 
Human_Test  0.4159 
The null distributions of six samples are obtained by the permutation method. The first five null distributions are used to infer the null distribution of Human_Test. The last null distribution will be used as a reference.
Coefficient table. Note that we only need to conduct logistic regression on the first K−1 bins
Bins  Intercept coefficients  KL divergence coefficients 

1th bin  6.0539  18.6072 
...  ...  ... 
333th bin  5.1731  3.3703 
Figure 3a shows the probability density functions of the inferred null and the null obtained by the permutation method. The peak height of the inferred null is overestimated compared with that of the permutation null. In Fig. 3b, Permutation+BH and InferredNull+BH estimate the FDR by applying the BH procedure to the null distribution generated by the permutation method and the inferred null distribution, respectively. According to the result shown in Fig. 3b, the performance of InferredNull+BH is closer to Permutation+BH than that of MAYU. The correlation of the inferred null distribution and the permutation generated null distribution is 0.9052.
The reference dataset

The data used in fitting the logistic regression model may be neither typical nor enough. Data representative to different conditions are desired to obtain a robust regression model. When information of some typical conditions are missing, it is may be hard to make it up by using mathematical models.

In our experiment, we only consider one feature in our logistic model. The single feature may not explain all kinds of variation in the protein inference process. A more accurate model can be achieved by using more features and plenty of data in building the feature database.

Even if we obtain an ideal logistical model with perfect coefficients, it is not guaranteed that the new data is not an outlier. It is often the case that the new data locates at a point that has a certain distance to the ideal model.
The biggest advantage of the null inference method is its efficiency. Once the coefficient table is obtained beforehand, the inference of the null distribution is just the calculation of deterministic functions (5) and (6). The whole process of the FDR estimation just takes a few seconds. This should benefit the largescale data analysis.
Discussion
In the Permutation+BH method, we use the permutation method to generate the null distribution and apply the BH procedure to estimate the FDR. The method does not relies on specific assumptions and works directly at the proteinlevel. Thus, the problems related to improper assumption and error propagation are avoided. The flexibility of our method also implies that it can be used with any search method or protein inference method of the user’s choice. According to our experimental results based on three datasets, our method performs better than MAYU. We believe that this is partly due to a more accurate estimation of the null distribution through the increased sampling by our permutation method, than in a typical 1:1 targetdecoy approach as used in MAYU.
In the Permutation+BH method, the efficiency is low because we need to shuffle protein sequences and conduct protein inference multiple times. We propose an offline strategy to handle this issue. In the offline strategy, a feature protein database is built beforehand with null distributions obtained from existing samples, a feature table and a coefficient table. When a new sample cannot find a match in the feature table, a new null distribution is inferred by directly plugging the coefficients in the coefficient table into the logistic model. The logistic regression model provides an efficient way to infer the null distribution. Our model is currently trained only on a few samples and including only 1 feature, limiting its accuracy. We will seek to improve this model by adding many more features and training the model on more datasets, as part of our future work.
Conclusions
In this paper, we propose a proteinlevel FDR estimation framework. The framework includes two major components: the Permutation+BH FDR estimation method and the logistic regressionbased null distribution inference method. The Permutation+BH method first applies the permutation to generate the null distribution and then uses the BH procedure to estimate the FDR. However, this method is inefficient for online identification. Therefore, we propose the logistic regressionbased null distribution inference method to handle this issue. In our experiment based on three public available datasets, our Permutation+BH method achieves consistently better performance than MAYU, which is chosen as the benchmark FDR calculation method for this study. The null distribution inference result shows that the logistic regression model achieves a reasonable result both in the shape of the null distribution and the corresponding FDR estimation result.
Notes
Declarations
Acknowledgements
The abridged abstract of this work was previously published in the Proceedings of the 13th International Symposium on Bioinformatics Research and Applications (ISBRA 2017), Lecture Notes in Computer Science: Bioinformatics Research and Applications [20].
Funding
This work was supported in part by International S&T Cooperation Program of China (ISTCP) under Grant No. 2014DFA31520 and by Shenzhen Fundamental Research Fund under Grant No. QTD2015033114415450.
Availability of data and materials
All the data discussed in this study are publicly available and the sources for downloading these data are listed in Table 3.
About this supplement
This article has been published as part of BMC Genomics Volume 19 Supplement 6, 2018: Selected articles from the 13th International Symposium on Bioinformatics Research and Applications (ISBRA 2017): genomics. The full contents of the supplement are available online at https://0bmcgenomicsbiomedcentralcom.brum.beds.ac.uk/articles/supplements/volume19supplement6.
Authors’ contributions
GYW, XW and BHX developed the model, conducted the simulation studies and real data analysis, and wrote the draft of the manuscript. BHX revised the manuscript. GYW and XW conducted the real data analysis and finalized the manuscript. BHX provided the guidance on methodology and finalized the manuscript. All authors read and approved the final manuscript.
Ethics approval and consent to participate
Notapplicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Nesvizhskii AI, Vitek O, Aebersold R. Analysis and validation of proteomic data generated by tandem mass spectrometry. Nat Methods. 2007; 4:787–97.View ArticleGoogle Scholar
 Eng J, McCormack AL, Yates III JR. An approach to correlate tandem mass spectra data of peptides with amino acid sequences in a protein database. J Am Soc Mass Spectrom. 1994; 5:976–89.View ArticleGoogle Scholar
 Perkins DN, Pappin DJC, Creasy DM, Cottrell JS. Probabilitybased protein identification by searching sequence databases using mass spectrometry data. Electrophoresis. 1999; 20:3551–67.View ArticleGoogle Scholar
 Craig R, Beavis RC. TANDEM: matching proteins with tandem mass spectra. Bioinformatics. 2004; 20:1466–7.View ArticleGoogle Scholar
 Nesvizhskii AI, Keller A, Kolker E, Aebersold R. A statistical model for identifying proteins by tandem mass spectrometry. Anal Chem. 2003; 75:4646–58.View ArticleGoogle Scholar
 Bern M, Goldberg D. Improved ranking functions for protein and modificationsite identifications. J Comput Biol. 2008; 15(7):705–19.View ArticleGoogle Scholar
 Li YF, Arnold RJ, Li Y, Radivojac P, Sheng Q, Tang H. A Bayesian approach to protein inference problem in shotgun proteomics. J Comput Biol. 2009; 16(8):1183–93.View ArticleGoogle Scholar
 Spirin V, Shpunt A, Seebacher J, Gentzel M, Shevchenko A, Gygi S, Sunyaev S. Assigning spectrumspecific pvalues to protein identifications by mass spectrometry. Bioinformatics. 2011; 27:1128–34.View ArticleGoogle Scholar
 Omenn GS, Blackwell TW, Fermin D, Eng J, Speicher DW, Hanash SM. Challenges in deriving highconfidence protein identifications from data gathered by a HUPO plasma proteome collaborative study. Nat Biotechnol. 2006; 24:333–8.View ArticleGoogle Scholar
 Sadygov RG, Liu H, Yates JR. Statistical models for protein validation using tandem mass spectral data and protein amino acid sequence databases. Anal Chem. 2004; 76:1664–71.View ArticleGoogle Scholar
 Reiter L, Claassen M, Schrimpf SP, Jovanovic M, Schmidt A, Buhmann JM, Hengartner MO, Aebersold R. Protein identification false discovery rates for very large proteomics data sets generated by tandem mass spectrometry. Mol Cell Proteomics. 2009; 8:2405–17.View ArticleGoogle Scholar
 Gupta N, Bandeira N, Keich U, Pevzner PA. Targetdecoy approach and false discovery rate: when things may go wrong. J Am Soc Mass Spectrom. 2011; 22(7):1111–20.View ArticleGoogle Scholar
 Friedman J, Tibshirani R, Hastie T. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer; 2009.Google Scholar
 Reidegeld KA, Eisenacher M, Kohl M, Chamrad D, Körting G, Blüggel M, Meyer HE, Stephan C. An easytouse Decoy Database Builder software tool, implementing different decoy strategies for false discovery rate calculation in automated MS/MS protein identifications. Proteomics. 2008; 8:1129–37.View ArticleGoogle Scholar
 Keller A, Nesvizhskii AI, Kolker E, Aebersold R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal Chem. 2002; 74:5383–92.View ArticleGoogle Scholar
 Klimek J, Eddes JS, Hohmann L, Jackson J, Peterson A, Letarte S, Gafken PR, Katz JE, Mallick P, Lee H, Schmidt A, Ossola R, Eng J, Aebersold R, Martin DB. The standard protein mix database: a diverse data set to assist in the production of improved peptide and protein identification software tools. J Proteome Res. 2008; 7:96–103.View ArticleGoogle Scholar
 Lu P, Vogel C, Wang R, Yao X, Marcotte EM. Absolute protein expression profiling estimates the relative contributions of transcriptional and translational regulation. Nat Biotechnol. 2006; 25:117–24.View ArticleGoogle Scholar
 Ramakrishnan SR, Vogel C, Kwon T, Penalva LO, Marcotte EM, Miranker DP. Mining gene functional networks to improve massspectrometrybased protein identification. Bioinformatics. 2009; 25:2955–61.View ArticleGoogle Scholar
 Gerster S, Qeli E, Ahrens CH, Bühlmann P. Protein and gene model inference based on statistical modeling in kpartite graphs. Proc Natl Acad Sci. 2010; 107:12101–6.View ArticleGoogle Scholar
 In: Cai Z, Daescu O, Li M, (eds).Proceedings of the 13th International Symposium on Bioinformatics Research and Applications (ISBRA 2017), Honolulu, Hawaii, May 30  June 2, 2017. New York City: Springer; 2017.Google Scholar