Probability-based pattern recognition and statistical framework for randomization: modeling tandem mass spectrum/peptide sequence false match frequencies

被引:32
作者
Feng, Jian
Naiman, Daniel Q.
Cooper, Bret [1 ]
机构
[1] Johns Hopkins Univ, Dept Appl Math & Stat, Baltimore, MD 21218 USA
[2] USDA ARS, Soybean Genom & Improvement Lab, Beltsville, MD USA
关键词
D O I
10.1093/bioinformatics/btm267
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: In proteomics, reverse database searching is used to control the false match frequency for tandem mass spectrum/peptide sequence matches, but reversal creates sequences devoid of patterns that usually challenge database-search software. Results: We designed an unsupervised pattern recognition algorithm for detecting patterns with various lengths from large sequence datasets. The patterns found in a protein sequence database were used to create decoy databases using a Monte Carlo sampling algorithm. Searching these decoy databases led to the prediction of false positive rates for spectrum/peptide sequence matches. We show examples where this method, independent of instrumentation, database-search software and samples, provides better estimation of false positive identification rates than a prevailing reverse database searching method. The pattern detection algorithm can also be used to analyze sequences for other purposes in biology or cryptology. Availability: On request from the authors. Contact: Bret.Cooper@ars.usda.gov Supplementary information: http://bioinformatics.psb.ugent.be/.
引用
收藏
页码:2210 / 2217
页数:8
相关论文
共 20 条
[1]  
[Anonymous], 1974, IIE Transactions
[2]   Building a dictionary for genomes: Identification of presumptive regulatory sites by statistical analysis [J].
Bussemaker, HJ ;
Li, H ;
Siggia, ED .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2000, 97 (18) :10096-10100
[3]   Potential for false positive identifications from large databases through tandem mass spectrometry [J].
Cargile, BJ ;
Bundy, JL ;
Stephenson, JL .
JOURNAL OF PROTEOME RESEARCH, 2004, 3 (05) :1082-1085
[4]   Shotgun identification of proteins from uredospores of the bean rust Uromyces appendiculatus [J].
Cooper, B ;
Garrett, WM ;
Campbell, KB .
PROTEOMICS, 2006, 6 (08) :2477-2484
[5]   AN APPROACH TO CORRELATE TANDEM MASS-SPECTRAL DATA OF PEPTIDES WITH AMINO-ACID-SEQUENCES IN A PROTEIN DATABASE [J].
ENG, JK ;
MCCORMACK, AL ;
YATES, JR .
JOURNAL OF THE AMERICAN SOCIETY FOR MASS SPECTROMETRY, 1994, 5 (11) :976-989
[6]   Randomized sequence databases for tandem mass spectrometry peptide and protein identification [J].
Higdon, R ;
Hogan, JM ;
Van Belle, G ;
Kolker, E .
OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY, 2005, 9 (04) :364-379
[7]   PROTEIN SEQUENCING BY TANDEM MASS-SPECTROMETRY [J].
HUNT, DF ;
YATES, JR ;
SHABANOWITZ, J ;
WINSTON, S ;
HAUER, CR .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1986, 83 (17) :6233-6237
[8]   An evaluation, comparison, and accurate benchmarking of several publicly available MS/MS search algorithms:: Sensitivity and specificity analysis [J].
Kapp, EA ;
Schütz, F ;
Connolly, LM ;
Chakel, JA ;
Meza, JE ;
Miller, CA ;
Fenyo, D ;
Eng, JK ;
Adkins, JN ;
Omenn, GS ;
Simpson, RJ .
PROTEOMICS, 2005, 5 (13) :3475-3490
[9]   Sequencing and comparison of yeast species to identify genes and regulatory elements [J].
Kellis, M ;
Patterson, N ;
Endrizzi, M ;
Birren, B ;
Lander, ES .
NATURE, 2003, 423 (6937) :241-254
[10]   Error-tolerant EST database searches by tandem mass spectrometry and MultiTag software [J].
Liska, AJ ;
Sunyaev, S ;
Shilov, IN ;
Schaeffer, DA ;
Shevchenko, A .
PROTEOMICS, 2005, 5 (16) :4118-4122