FASTA-SWAP and FASTA-PAT: Pattern database searches using combinations of aligned amino acids, and a novel scoring theory

被引:10
作者
Ladunga, I [1 ]
Wiese, BA [1 ]
Smith, RF [1 ]
机构
[1] EOTVOS LORAND UNIV, DEPT GENET, H-1088 BUDAPEST, HUNGARY
关键词
protein function identification; amino acid sequence pattern; protein database search; scoring theory; FASTA;
D O I
10.1006/jmbi.1996.0362
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
We introduce two new pattern database search tools that utilize statistical significance and information theory to improve protein function identification. Both the general pattern scoring theory with the specific matrices introduced here and the low redundancy of pattern databases increase search sensitivity and selectivity. Pattern scoring preferentially rewards matches at conserved positions in a pattern with higher scores than matches at variable positions, and assigns more negative scores to mismatches at conserved positions than to mismatches at variable positions. The theory of pattern scoring can be used to create log-odds pattern scores for patterns derived from any set of multiple alignments. This theoretical framework can be used to adapt existing sequence database search tools to pattern analysis. Our FASTA-SWAP and FASTA-PAT tools are extensions of the FASTA program that search a sequence query against a pattern database. In the first step, FASTA-SWAP searches the diagonals of the query sequence and the library pattern for high-scoring segments, while FASTA-PAT performs an extended version of hashing. In the second step, both methods refine the alignments and the scores using dynamic programming. The tools utilize an extremely compact binary representation of all possible combinations of amino acid residues in aligned positions. Our FASTA-SWAP and FASTA-PAT tools are well suited for functional identification of distant relatives that may be missed by sequence database search methods. FASTA-SWAP and FASTA-PAT searches can be performed using out World-Wide Web Server (http://dot.imgen.bcm.tmc.edu:9331/seq-search/Options/fastapat.htm1). (C) 1996 Academic Press Limited
引用
收藏
页码:840 / 854
页数:15
相关论文
共 50 条
[1]   PROTEIN DATABASE SEARCHES FOR MULTIPLE ALIGNMENTS [J].
ALTSCHUL, SF ;
LIPMAN, DJ .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1990, 87 (14) :5509-5513
[2]  
ALTSCHUL SF, 1990, J MOL BIOL, V215, P403, DOI 10.1006/jmbi.1990.9999
[3]  
[Anonymous], METHOD ENZYMOL
[4]  
[Anonymous], 1980, CLUSTER ANAL
[5]  
[Anonymous], 1978, Atlas of protein sequence and structure
[6]   PRINTS - A PROTEIN MOTIF FINGERPRINT DATABASE [J].
ATTWOOD, TK ;
BECK, ME .
PROTEIN ENGINEERING, 1994, 7 (07) :841-848
[7]   PROSITE - A DICTIONARY OF SITES AND PATTERNS IN PROTEINS [J].
BAIROCH, A .
NUCLEIC ACIDS RESEARCH, 1992, 20 :2013-2018
[8]   FLEXIBLE PROTEIN-SEQUENCE PATTERNS - A SENSITIVE METHOD TO DETECT WEAK STRUCTURAL SIMILARITIES [J].
BARTON, GJ ;
STERNBERG, MJE .
JOURNAL OF MOLECULAR BIOLOGY, 1990, 212 (02) :389-402
[9]  
BARTON GJ, 1990, METHOD ENZYMOL, V183, P403
[10]   INFORMATION ENHANCEMENT METHODS FOR LARGE-SCALE SEQUENCE-ANALYSIS [J].
CLAVERIE, JM ;
STATES, DJ .
COMPUTERS & CHEMISTRY, 1993, 17 (02) :191-201