PSI-BLAST pseudocounts and the minimum description length principle

被引:108
作者
Altschul, Stephen F. [1 ]
Gertz, E. Michael [1 ]
Agarwala, Richa [1 ]
Schaffer, Alejandro A. [1 ]
Yu, Yi-Kuo [1 ]
机构
[1] NIH, Natl Ctr Biotechnol Informat, Natl Lib Med, Bethesda, MD 20894 USA
关键词
ACID SUBSTITUTION MATRICES; SEQUENCE WEIGHTS; PROTEIN-SEQUENCE; DATABASE; ALIGNMENT; INFORMATION; HOMOLOGY; SEARCHES; BLOCKS;
D O I
10.1093/nar/gkn981
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Position specific score matrices (PSSMs) are derived from multiple sequence alignments to aid in the recognition of distant protein sequence relationships. The PSI-BLAST protein database search program derives the column scores of its PSSMs with the aid of pseudocounts, added to the observed amino acid counts in a multiple alignment column. In the absence of theory, the number of pseudocounts used has been a completely empirical parameter. This article argues that the minimum description length principle can motivate the choice of this parameter. Specifically, for realistic alignments, the principle supports the practice of using a number of pseudocounts essentially independent of alignment size. However, it also implies that more highly conserved columns should use fewer pseudocounts, increasing the inter-column contrast of the implied PSSMs. A new method for calculating pseudocounts that significantly improves PSI-BLAST's retrieval accuracy is now employed by default.
引用
收藏
页码:815 / 824
页数:10
相关论文
共 35 条
[11]  
Dayhoff M., 1978, ATLAS PROTEIN SEQ ST, V5, P353
[12]  
Dayhoff M O., 1978, Atlas of Protein Seq Struct, ppp 345
[13]  
Eddy S R, 1995, J Comput Biol, V2, P9, DOI 10.1089/cmb.1995.2.9
[14]   Theory of statistical estimation. [J].
Fisher, RA .
PROCEEDINGS OF THE CAMBRIDGE PHILOSOPHICAL SOCIETY, 1925, 22 :700-725
[15]   VOLUME CHANGES IN PROTEIN EVOLUTION [J].
GERSTEIN, M ;
SONNHAMMER, ELL ;
CHOTHIA, C .
JOURNAL OF MOLECULAR BIOLOGY, 1994, 236 (04) :1067-1078
[16]  
GOTOH O, 1995, COMPUT APPL BIOSCI, V11, P543
[17]   Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching [J].
Gribskov, M ;
Robinson, NL .
COMPUTERS & CHEMISTRY, 1996, 20 (01) :25-33
[18]  
Grünwald P, 2005, NEURAL INF PROCESS S, P23
[19]  
Henikoff JG, 1996, COMPUT APPL BIOSCI, V12, P135
[20]   POSITION-BASED SEQUENCE WEIGHTS [J].
HENIKOFF, S ;
HENIKOFF, JG .
JOURNAL OF MOLECULAR BIOLOGY, 1994, 243 (04) :574-578