Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements

被引:1117
作者
Schäffer, AA [1 ]
Aravind, L [1 ]
Madden, TL [1 ]
Shavirin, S [1 ]
Spouge, JL [1 ]
Wolf, YI [1 ]
Koonin, EV [1 ]
Altschul, SF [1 ]
机构
[1] NIH, Natl Ctr Biotechnol Informat, Bethesda, MD 20894 USA
关键词
D O I
10.1093/nar/29.14.2994
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
PSI-BLAST is an iterative program to search a database for proteins with distant similarity to a query sequence. We investigated over a dozen modifications to the methods used in PSI-BLAST, with the goal of improving accuracy in finding true positive matches. To evaluate performance we used a set of 103 queries for which the true positives in yeast had been annotated by human experts, and a popular measure of retrieval accuracy (ROC) that can be normalized to take on values between 0 (worst) and 1 (best). The modifications we consider novel improve the ROC score from 0.758 +/- 0.005 to 0.895 +/- 0.003. This does not include the benefits from four modifications we included in the 'baseline' version, even though they were not implemented in PSI-BLAST version 2.0. The improvement in accuracy was confirmed on a small second test set. This test involved analyzing three protein families with curated lists of true positives from the non-redundant protein database. The modification that accounts for the majority of the improvement is the use, for each database sequence, of a position-specific scoring system tuned to that sequence's amino acid composition. The use of composition-based statistics is particularly beneficial for large-scale automated applications of PSI-BLAST.
引用
收藏
页码:2994 / 3005
页数:12
相关论文
共 62 条
  • [1] Crystal structure of the BTB domain from PLZF
    Ahmad, KF
    Engel, CK
    Privé, GG
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1998, 95 (21) : 12123 - 12128
  • [2] AMINO-ACID SUBSTITUTION MATRICES FROM AN INFORMATION THEORETIC PERSPECTIVE
    ALTSCHUL, SF
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 1991, 219 (03) : 555 - 565
  • [3] Altschul SF, 1998, PROTEINS, V32, P88, DOI 10.1002/(SICI)1097-0134(19980701)32:1<88::AID-PROT10>3.3.CO
  • [4] 2-X
  • [5] ISSUES IN SEARCHING MOLECULAR SEQUENCE DATABASES
    ALTSCHUL, SF
    BOGUSKI, MS
    GISH, W
    WOOTTON, JC
    [J]. NATURE GENETICS, 1994, 6 (02) : 119 - 129
  • [6] ALTSCHUL SF, 1986, B MATH BIOL, V48, P603, DOI 10.1016/S0092-8240(86)90010-8
  • [7] Altschul SF, 1996, METHOD ENZYMOL, V266, P460
  • [8] The estimation of statistical parameters for local alignment score distributions
    Altschul, SF
    Bundschuh, R
    Olsen, R
    Hwa, T
    [J]. NUCLEIC ACIDS RESEARCH, 2001, 29 (02) : 351 - 361
  • [9] Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
    Altschul, SF
    Madden, TL
    Schaffer, AA
    Zhang, JH
    Zhang, Z
    Miller, W
    Lipman, DJ
    [J]. NUCLEIC ACIDS RESEARCH, 1997, 25 (17) : 3389 - 3402
  • [10] A PROTEIN ALIGNMENT SCORING SYSTEM SENSITIVE AT ALL EVOLUTIONARY DISTANCES
    ALTSCHUL, SF
    [J]. JOURNAL OF MOLECULAR EVOLUTION, 1993, 36 (03) : 290 - 300