Getting more from less - Algorithms for rapid protein identification with multiple short peptide sequences

被引:176
作者
Mackey, AJ
Haystead, TAJ
Pearson, WR [1 ]
机构
[1] Univ Virginia, Dept Biochem, Charlottesville, VA 22908 USA
[2] Univ Virginia, Dept Microbiol, Charlottesville, VA 22908 USA
[3] Univ Virginia, Dept Mol Genet, Charlottesville, VA 22908 USA
[4] Duke Univ, Dept Pharmacol, Durham, NC 27710 USA
关键词
D O I
10.1074/mcp.M100004-MCP200
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
We describe two novel sequence similarity search algorithms, FASTS and FASTF, that use multiple short peptide sequences to identify homologous sequences in protein or DNA databases. FASTS searches with peptide sequences of unknown order, as obtained by mass spectrometry-based sequencing, evaluating all possible arrangements of the peptides. FASTF searches with mixed peptide sequences, as generated by Edman sequencing of unseparated mixtures of peptides. FASTF deconvolutes the mixture, using a greedy heuristic that allows rapid identification of high scoring alignments while reducing the total number of explored alternatives. Both algorithms use the heuristic FASTA comparison strategy to accelerate the search but use alignment probability, rather than similarity score, as the criterion for alignment optimality. Statistical estimates are calculated using an empirical correction to a theoretical probability. These calculated estimates were accurate within a factor of 10 for FASTS and 1000 for FASTF on our test dataset. FASTS requires only 15-20 total residues in three or four peptides to robustly identify homologues sharing 50% or greater protein sequence identity. FASTF requires about 25% more sequence data than FASTS for equivalent sensitivity, but additional sequence data are usually available from mixed Edman experiments. Thus, both algorithms can identify homologues that diverged 100 to 500 million years ago, allowing proteomic identification from organisms whose genomes have not been sequenced. Molecular & Cellular Proteomics 1:139-147, 2002.
引用
收藏
页码:139 / 147
页数:9
相关论文
共 36 条
[1]   AMINO-ACID SUBSTITUTION MATRICES FROM AN INFORMATION THEORETIC PERSPECTIVE [J].
ALTSCHUL, SF .
JOURNAL OF MOLECULAR BIOLOGY, 1991, 219 (03) :555-565
[2]  
ALTSCHUL SF, 1986, B MATH BIOL, V48, P617, DOI 10.1016/S0092-8240(86)90011-X
[3]   ISSUES IN SEARCHING MOLECULAR SEQUENCE DATABASES [J].
ALTSCHUL, SF ;
BOGUSKI, MS ;
GISH, W ;
WOOTTON, JC .
NATURE GENETICS, 1994, 6 (02) :119-129
[4]  
Altschul SF, 1996, METHOD ENZYMOL, V266, P460
[5]  
ALTSCHUL SF, 1988, B MATH BIOL, V50, P77, DOI 10.1007/BF02459979
[6]  
ALTSCHUL SF, 1986, B MATH BIOL, V48, P633, DOI 10.1016/S0092-8240(86)90012-1
[7]  
ALTSCHUL SF, 1990, J MOL BIOL, V215, P403, DOI 10.1006/jmbi.1990.9999
[8]   THE ERDOS-RENYI LAW IN DISTRIBUTION, FOR COIN TOSSING AND SEQUENCE MATCHING [J].
ARRATIA, R ;
GORDON, L ;
WATERMAN, MS .
ANNALS OF STATISTICS, 1990, 18 (02) :539-570
[9]   A new approach to sequence comparison:: normalired sequence alignment [J].
Arslan, AN ;
Egecioglu, Ö ;
Pevzner, PA .
BIOINFORMATICS, 2001, 17 (04) :327-337
[10]   Combining evidence using p-values: application to sequence homology searches [J].
Bailey, TL ;
Gribskov, M .
BIOINFORMATICS, 1998, 14 (01) :48-54