An artificial intelligence approach to motif discovery in protein sequences: Application to steroid dehydrogenases

被引:54
作者
Bailey, TL
Baker, ME
Elkan, CP
机构
[1] 0114 UNIV CALIF SAN DIEGO, DEPT COMP SCI & ENGN, LA JOLLA, CA 92093 USA
[2] 0623B UNIV CALIF SAN DIEGO, DEPT MED, LA JOLLA, CA 92093 USA
关键词
D O I
10.1016/S0960-0760(97)00013-7
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
MEME (Multiple Expectation-maximization for Motif Elicitation) is a unique new software tool that uses artificial intelligence techniques to discover motifs shared by a set of protein sequences in a fully automated manner. This paper is the first detailed study of the use of MEME to analyse a large, biologically relevant set of sequences, and to evaluate the sensitivity and accuracy of MEME in identifying structurally important motifs. For this purpose, we chose the short-chain alcohol dehydrogenase superfamily because it is large and phylogenetically diverse, providing a test of how well MEME can work on sequences with low amino acid similarity. Moreover, this dataset contains enzymes of biological importance, and because several enzymes have known X-ray crystallographic structures, we can test the usefulness of MEME for structural analysis. The first six motifs from MEME map onto structurally important alpha-helices and beta-strands on Streptomyces hydrogenans 20 beta-hydroxysteroid dehydrogenase. We also describe MAST (Motif Alignment Search Tool), which conveniently uses output from MEME for searching databases such as SWISS-PROT and Genpept. MAST provides statistical measures that permit a rigorous evaluation of the significance of database searches with individual motifs or groups of motifs. A database search of Genpept90 by MAST with the log-odds matrix of the first six motifs obtained from MEME yields a bimodal output, demonstrating the selectivity of MAST. We show for the first time, using primary sequence analysis, that bacterial sugar epimerases are homologs of short-chain dehydrogenases. MEME and MAST will be increasingly useful as genome sequencing provides large datasets of phylogenetically divergent sequences of biomedical interest. (C) 1997 Elsevier Science Ltd.
引用
收藏
页码:29 / 44
页数:16
相关论文
共 44 条