Statistical models for protein validation using tandem mass spectral data and protein amino acid sequence databases

被引:104
作者
Sadygov, RG [1 ]
Liu, HB [1 ]
Yates, JR [1 ]
机构
[1] Scripps Res Inst, Dept Cell Biol, La Jolla, CA 92037 USA
关键词
D O I
10.1021/ac035112y
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
The purpose of this work is to develop and verify statistical models for protein identification using peptide identifications derived from the results of tandem mass spectral database searches. Recently we have presented a probabilistic model for peptide identification that uses hypergeometric distribution to approximate fragment ion matches of database peptide sequences to experimental tandem mass spectra. Here we apply statistical models to the database search results to validate protein identifications. For this we formulate the protein identification problem in terms of two independent models, two-hypothesis binomial and multinomial models, which use the hypergeometric probabilities and cross-correlation scores, respectively. Each database search result is assumed to be a probabilistic event. The Bernoulli event has two outcomes: a protein is either identified or not. The probability of identifying a protein at each Bernoulli event is determined from relative length of the protein in the database (the null hypothesis) or the hypergeometric probability scores of the protein's peptides (the alternative hypothesis). We then calculate the binomial probability that the protein will be observed a certain number of times (number of database matches to its peptides) given the size of the data set (number of spectra) and the probability of protein identification at each Bernoulli event. The ratio of the probabilities from these two hypotheses (maximum likelihood ratio) is used as a test statistic to discriminate between true and false identifications. The significance and confidence levels of protein identifications are calculated from the model distributions. The multinomial model combines the database search results and generates an observed frequency distribution of cross-correlation scores (grouped into bins) between experimental spectra and identified amino acid sequences. The frequency distribution is used to generate p-value probabilities of each score bin. The probabilities are then normalized with respect to score bins to generate normalized probabilities of all score bins. A protein identification probability is the multinomial probability of observing the given set of peptide scores. To reduce the effect of random matches, we employ a marginalized multinomial model for small values of cross-correlation scores. We demonstrate that the combination of the two independent methods provides a useful tool for protein identification from results of database search using tandem mass spectra. A receiver operating characteristic curve demonstrates the sensitivity and accuracy level of the approach. The shortcomings of the models are related to the cases when protein assignment is based on unusual peptide fragmentation patterns that dominate over the model encoded in the peptide identification process. We have implemented the approach in a program called PROT-PROBE.
引用
收藏
页码:1664 / 1671
页数:8
相关论文
共 39 条
[1]  
Bafna V, 2001, Bioinformatics, V17 Suppl 1, pS13
[2]   Proteomics characterization of abundant Golgi membrane proteins [J].
Bell, AW ;
Ward, MA ;
Blackstock, WP ;
Freeman, HNM ;
Choudhary, JS ;
Lewis, AP ;
Chotai, D ;
Fazel, A ;
Gushue, JN ;
Paiement, J ;
Palcy, S ;
Chevet, E ;
Lafrenière-Roula, M ;
Solari, R ;
Thomas, DY ;
Rowley, A ;
Bergeron, JJM .
JOURNAL OF BIOLOGICAL CHEMISTRY, 2001, 276 (07) :5152-5165
[3]   Role of accurate mass measurement (±10 ppm) in protein identification strategies employing MS or MS MS and database searching [J].
Clauser, KR ;
Baker, P ;
Burlingame, AL .
ANALYTICAL CHEMISTRY, 1999, 71 (14) :2871-2882
[4]  
Eddes JS, 2002, PROTEOMICS, V2, P1097, DOI 10.1002/1615-9861(200209)2:9<1097::AID-PROT1097>3.0.CO
[5]  
2-X
[6]   AN APPROACH TO CORRELATE TANDEM MASS-SPECTRAL DATA OF PEPTIDES WITH AMINO-ACID-SEQUENCES IN A PROTEIN DATABASE [J].
ENG, JK ;
MCCORMACK, AL ;
YATES, JR .
JOURNAL OF THE AMERICAN SOCIETY FOR MASS SPECTROMETRY, 1994, 5 (11) :976-989
[7]  
EWENS WJ, 2002, BIOINFORMATICS STAT
[8]   A method for assessing the statistical significance of mass spectrometry-based protein identifications using general scoring schemes [J].
Fenyö, D ;
Beavis, RC .
ANALYTICAL CHEMISTRY, 2003, 75 (04) :768-774
[9]   Functional organization of the yeast proteome by systematic analysis of protein complexes [J].
Gavin, AC ;
Bösche, M ;
Krause, R ;
Grandi, P ;
Marzioch, M ;
Bauer, A ;
Schultz, J ;
Rick, JM ;
Michon, AM ;
Cruciat, CM ;
Remor, M ;
Höfert, C ;
Schelder, M ;
Brajenovic, M ;
Ruffner, H ;
Merino, A ;
Klein, K ;
Hudak, M ;
Dickson, D ;
Rudi, T ;
Gnau, V ;
Bauch, A ;
Bastuck, S ;
Huhse, B ;
Leutwein, C ;
Heurtier, MA ;
Copley, RR ;
Edelmann, A ;
Querfurth, E ;
Rybin, V ;
Drewes, G ;
Raida, M ;
Bouwmeester, T ;
Bork, P ;
Seraphin, B ;
Kuster, B ;
Neubauer, G ;
Superti-Furga, G .
NATURE, 2002, 415 (6868) :141-147
[10]   Global analysis of protein expression in yeast [J].
Ghaemmaghami, S ;
Huh, W ;
Bower, K ;
Howson, RW ;
Belle, A ;
Dephoure, N ;
O'Shea, EK ;
Weissman, JS .
NATURE, 2003, 425 (6959) :737-741