Probability-based pattern recognition and statistical framework for randomization: modeling tandem mass spectrum/peptide sequence false match frequencies

被引：32

作者：

Feng, Jian

Naiman, Daniel Q.

Cooper, Bret ^{[1
]}

机构：

[1] Johns Hopkins Univ, Dept Appl Math & Stat, Baltimore, MD 21218 USA

[2] USDA ARS, Soybean Genom & Improvement Lab, Beltsville, MD USA

来源：

BIOINFORMATICS | 2007年 / 23卷 / 17期

关键词：

D O I：

10.1093/bioinformatics/btm267

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Motivation: In proteomics, reverse database searching is used to control the false match frequency for tandem mass spectrum/peptide sequence matches, but reversal creates sequences devoid of patterns that usually challenge database-search software. Results: We designed an unsupervised pattern recognition algorithm for detecting patterns with various lengths from large sequence datasets. The patterns found in a protein sequence database were used to create decoy databases using a Monte Carlo sampling algorithm. Searching these decoy databases led to the prediction of false positive rates for spectrum/peptide sequence matches. We show examples where this method, independent of instrumentation, database-search software and samples, provides better estimation of false positive identification rates than a prevailing reverse database searching method. The pattern detection algorithm can also be used to analyze sequences for other purposes in biology or cryptology. Availability: On request from the authors. Contact: Bret.Cooper@ars.usda.gov Supplementary information: http://bioinformatics.psb.ugent.be/.

引用

页码：2210 / 2217

页数：8

共 20 条

[1]

[Anonymous], 1974, IIE Transactions

[2] Building a dictionary for genomes: Identification of presumptive regulatory sites by statistical analysis [J].