Empirical statistical estimates for sequence similarity searches

被引:214
作者
Pearson, WR [1 ]
机构
[1] Univ Virginia, Dept Biochem, Charlottesville, VA 22908 USA
关键词
sequence similarity; statistical estimates; FASTA; Smith-Waterman;
D O I
10.1006/jmbi.1997.1525
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
The FASTA package of sequence comparison programs has been modified to provide accurate statistical estimates for local sequence similarity scores with gaps. These estimates are derived using the extreme value distribution from the mean and variance of the local similarity scores of unrelated sequences after the scores have been corrected for the expected effect of library sequence length. This approach allows accurate estimates to be calculated for both FASTA and Smith-Waterman similarity scores for protein/protein, DNA/DNA, and protein/translated-DNA comparisons. The accuracy of the statistical estimates is summarized for 54 protein families using FASTA and Smith-Waterman scores. Probability estimates calculated from the distribution of similarity scores are generally conservative, as are probabilities calculated using the Altschul-Gish lambda, K, and H parameters. The performance of several alternative methods for correcting similarity scores for library-sequence length was evaluated using 54 protein superfamilies from the PIR39 database and 110 protein families from the Prosite/SwissProt rel. 34 database. Both regression-scaled and Altschul-Gish scaled scores perform significantly better than unsealed Smith-Waterman or FASTA similarity scores. When the Prosite/SwissProt test set is used, regression-scaled scores perform slightly better; when the PIR database is used, Altschul-Gish scaled scores perform best. Thus, length-corrected similarity scores improve the sensitivity of database searches. Statistical parameters that are derived from the distribution of similarity scores from the thousands of unrelated sequences typically encountered in a database search provide accurate estimates of statistical significance that can be used to infer sequence homology. (C) 1998 Academic Press Limited.
引用
收藏
页码:71 / 84
页数:14
相关论文
共 24 条
[1]   ISSUES IN SEARCHING MOLECULAR SEQUENCE DATABASES [J].
ALTSCHUL, SF ;
BOGUSKI, MS ;
GISH, W ;
WOOTTON, JC .
NATURE GENETICS, 1994, 6 (02) :119-129
[2]  
Altschul SF, 1996, METHOD ENZYMOL, V266, P460
[3]   BASIC LOCAL ALIGNMENT SEARCH TOOL [J].
ALTSCHUL, SF ;
GISH, W ;
MILLER, W ;
MYERS, EW ;
LIPMAN, DJ .
JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) :403-410
[4]  
[Anonymous], METHOD ENZYMOL
[5]  
[Anonymous], 1993, STAT DISTRIBUTIONS
[6]   THE SWISS-PROT PROTEIN-SEQUENCE DATA-BANK [J].
BAIROCH, A ;
BOECKMANN, B .
NUCLEIC ACIDS RESEARCH, 1991, 19 :2247-2248
[7]   PROSITE - A DICTIONARY OF SITES AND PATTERNS IN PROTEINS [J].
BAIROCH, A .
NUCLEIC ACIDS RESEARCH, 1991, 19 :2241-2245
[8]  
BARKER WC, 1990, METHOD ENZYMOL, V183, P31
[9]  
COLLINS JF, 1988, COMPUT APPL BIOSCI, V4, P67
[10]   PERFORMANCE EVALUATION OF AMINO-ACID SUBSTITUTION MATRICES [J].
HENIKOFF, S ;
HENIKOFF, JG .
PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 1993, 17 (01) :49-61