Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements

被引:1117
作者
Schäffer, AA [1 ]
Aravind, L [1 ]
Madden, TL [1 ]
Shavirin, S [1 ]
Spouge, JL [1 ]
Wolf, YI [1 ]
Koonin, EV [1 ]
Altschul, SF [1 ]
机构
[1] NIH, Natl Ctr Biotechnol Informat, Bethesda, MD 20894 USA
关键词
D O I
10.1093/nar/29.14.2994
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
PSI-BLAST is an iterative program to search a database for proteins with distant similarity to a query sequence. We investigated over a dozen modifications to the methods used in PSI-BLAST, with the goal of improving accuracy in finding true positive matches. To evaluate performance we used a set of 103 queries for which the true positives in yeast had been annotated by human experts, and a popular measure of retrieval accuracy (ROC) that can be normalized to take on values between 0 (worst) and 1 (best). The modifications we consider novel improve the ROC score from 0.758 +/- 0.005 to 0.895 +/- 0.003. This does not include the benefits from four modifications we included in the 'baseline' version, even though they were not implemented in PSI-BLAST version 2.0. The improvement in accuracy was confirmed on a small second test set. This test involved analyzing three protein families with curated lists of true positives from the non-redundant protein database. The modification that accounts for the majority of the improvement is the use, for each database sequence, of a position-specific scoring system tuned to that sequence's amino acid composition. The use of composition-based statistics is particularly beneficial for large-scale automated applications of PSI-BLAST.
引用
收藏
页码:2994 / 3005
页数:12
相关论文
共 62 条
  • [31] Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching
    Gribskov, M
    Robinson, NL
    [J]. COMPUTERS & CHEMISTRY, 1996, 20 (01): : 25 - 33
  • [32] POSITION-BASED SEQUENCE WEIGHTS
    HENIKOFF, S
    HENIKOFF, JG
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 1994, 243 (04) : 574 - 578
  • [33] AMINO-ACID SUBSTITUTION MATRICES FROM PROTEIN BLOCKS
    HENIKOFF, S
    HENIKOFF, JG
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1992, 89 (22) : 10915 - 10919
  • [34] METHODS FOR ASSESSING THE STATISTICAL SIGNIFICANCE OF MOLECULAR SEQUENCE FEATURES BY USING GENERAL SCORING SCHEMES
    KARLIN, S
    ALTSCHUL, SF
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1990, 87 (06) : 2264 - 2268
  • [35] Hidden Markov models for detecting remote protein homologies
    Karplus, K
    Barrett, C
    Hughey, R
    [J]. BIOINFORMATICS, 1998, 14 (10) : 846 - 856
  • [36] KROGH A, 1995, P 3 INT C INT SYST M, P215
  • [37] MOTT R, 1992, B MATH BIOL, V54, P59, DOI 10.1007/BF02458620
  • [38] Accurate formula for p-values of gapped local sequence and profile alignments
    Mott, R
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 2000, 300 (03) : 649 - 659
  • [39] Benchmarking PSI-BLAST in genome annotation
    Müller, A
    MacCallum, RM
    Sternberg, MJE
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 1999, 293 (05) : 1257 - 1271
  • [40] OPTIMAL ALIGNMENTS IN LINEAR-SPACE
    MYERS, EW
    MILLER, W
    [J]. COMPUTER APPLICATIONS IN THE BIOSCIENCES, 1988, 4 (01): : 11 - 17