Properties of average score distributions of SEQUEST

被引:122
作者
Martinez-Bartolome, Salvador [1 ]
Navarro, Pedro [1 ]
Martin-Maroto, Fernando [2 ]
Lopez-Ferrer, Daniel [1 ]
Ramos-Fernandez, Antonio [1 ]
Villar, Margarita [1 ]
Garcia-Ruiz, Josefa P. [1 ]
Vazquez, Jesus [1 ]
机构
[1] Univ Autonoma Madrid, Ctr Biol Mol Severo Ochoa, Prot Chem & Proteom Lab, CSIC, E-28049 Madrid, Spain
[2] Thermo Electron Corp, San Jose, CA 95134 USA
关键词
D O I
10.1074/mcp.M700239-MCP200
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
High throughput identification of peptides in databases from tandem mass spectrometry data is a key technique in modern proteomics. Common approaches to interpret large scale peptide identification results are based on the statistical analysis of average score distributions, which are constructed from the set of best scores produced by large collections of MS/MS spectra by using searching engines such as SEQUEST. Other approaches calculate individual peptide identification probabilities on the basis of theoretical models or from single-spectrum score distributions constructed by the set of scores produced by each MS/MS spectrum. In this work, we study the mathematical properties of average SEQUEST score distributions by introducing the concept of spectrum quality and expressing these average distributions as compositions of single-spectrum distributions. We predict and demonstrate in the practice that average score distributions are dominated by the quality distribution in the spectra collection, except in the low probability region, where it is possible to predict the dependence of average probability on database size. Our analysis leads to a novel indicator, the probability ratio, which takes optimally into account the statistical information provided by the first and second best scores. The probability ratio is a non-parametric and robust indicator that makes spectra classification according to parameters such as charge state unnecessary and allows a peptide identification performance, on the basis of false discovery rates, that is better than that obtained by other empirical statistical approaches. The probability ratio also compares favorably with statistical probability indicators obtained by the construction of single-spectrum SEQUEST score distributions. These results make the robustness, conceptual simplicity, and ease of automation of the probability ratio algorithm a very attractive alternative to determine peptide identification confidences and error rates in high throughput experiments.
引用
收藏
页码:1135 / 1145
页数:11
相关论文
共 36 条
[1]   BASIC LOCAL ALIGNMENT SEARCH TOOL [J].
ALTSCHUL, SF ;
GISH, W ;
MILLER, W ;
MYERS, EW ;
LIPMAN, DJ .
JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) :403-410
[2]   A new algorithm for the evaluation of shotgun peptide sequencing in proteomics: Support vector machine classification of peptide MS/MS spectra and SEQUEST scores [J].
Anderson, DC ;
Li, WQ ;
Payan, DG ;
Noble, WS .
JOURNAL OF PROTEOME RESEARCH, 2003, 2 (02) :137-146
[3]  
Bafna V, 2001, Bioinformatics, V17 Suppl 1, pS13
[4]   Protein identification by mass spectrometry - Issues to be considered [J].
Baldwin, MA .
MOLECULAR & CELLULAR PROTEOMICS, 2004, 3 (01) :1-9
[5]   Controlling the false discovery rate in behavior genetics research [J].
Benjamini, Y ;
Drai, D ;
Elmer, G ;
Kafkafi, N ;
Golani, I .
BEHAVIOURAL BRAIN RESEARCH, 2001, 125 (1-2) :279-284
[6]   Automatic Quality Assessment of Peptide Tandem Mass Spectra [J].
Bern, Marshall ;
Goldberg, David ;
McDonald, W. Hayes ;
Yates, John R., III .
BIOINFORMATICS, 2004, 20 :49-54
[7]   The need for guidelines in publication of peptide and protein identification data - Working group on publication guidelines for peptide and protein identification data [J].
Carr, S ;
Aebersold, R ;
Baldwin, M ;
Burlingame, A ;
Clauser, K ;
Nesvizhskii, A .
MOLECULAR & CELLULAR PROTEOMICS, 2004, 3 (06) :531-533
[8]   High-performance peptide identification by tandem mass spectrometry allows reliable automatic data processing in proteomics [J].
Colinge, J ;
Masselot, A ;
Cusin, I ;
Mahé, E ;
Niknejad, A ;
Argoud-Puy, G ;
Reffas, S ;
Bederr, N ;
Gleizes, A ;
Rey, PA ;
Bougueleret, L .
PROTEOMICS, 2004, 4 (07) :1977-1984
[9]   OLAV: Towards high-throughput tandem mass spectrometry data identification [J].
Colinge, J ;
Masselot, A ;
Giron, M ;
Dessingy, T ;
Magnin, J .
PROTEOMICS, 2003, 3 (08) :1454-1463
[10]   Intensity-based protein identification by machine learning from a library of tandem mass spectra [J].
Elias, JE ;
Gibbons, FD ;
King, OD ;
Roth, FP ;
Gygi, SP .
NATURE BIOTECHNOLOGY, 2004, 22 (02) :214-219