Efficient functional clustering of protein sequences using the Dirichlet process

被引:12
作者
Brown, Duncan P. [1 ,2 ]
机构
[1] Univ Calif Berkeley, Dept Bioengn, San Francisco, CA 94158 USA
[2] Merck & Co Inc, San Francisco, CA 94158 USA
关键词
D O I
10.1093/bioinformatics/btn244
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Automatic clustering of protein sequences is an important problem in computational biology. The recent explosion in genome sequences has given biological researchers a vast number of novel protein sequences. However, the majority of these sequences have no experimental evidence for their molecular function in the cell, and the responsibility for correctly annotating these sequences falls upon the bioinformatics community. Ideally, we would like to be able to group sequences of similar or identical molecular function in an automatic fashion, without relying on experimental evidence. Results: In this article I present a novel probabilistic framework that models subfamilies within a known protein family. Given a multiple sequence alignment, the model uses Dirichlet mixture densities to estimate amino acid preferences within subfamily clusters, and places a Dirichlet process prior on the overall set of clusters. Based on results from several datasets, the model breaks data accurately into functional subgroups.
引用
收藏
页码:1765 / 1771
页数:7
相关论文
共 25 条
[1]   Clustering of proximal sequence space for the identification of protein families [J].
Abascal, F ;
Valencia, A .
BIOINFORMATICS, 2002, 18 (07) :908-921
[2]  
[Anonymous], 1992, ENZYME NOMENCLATURE
[3]   FERGUSON DISTRIBUTIONS VIA POLYA URN SCHEMES [J].
BLACKWELL, D ;
MACQUEEN, JB .
ANNALS OF STATISTICS, 1973, 1 (02) :353-355
[4]   Automated protein subfamily identification and classification [J].
Brown, Duncan P. ;
Krishnamurthy, Nandini ;
Sjoelander, Kimmen .
PLOS COMPUTATIONAL BIOLOGY, 2007, 3 (08) :1526-1538
[5]   Structural genomics and structural biology: compare and contrast [J].
John-Marc Chandonia ;
Thomas N Earnest ;
Steven E Brenner .
Genome Biology, 5 (9)
[6]  
DAHL DB, 2003, 1086 U WISC
[7]  
Dubey A, 2003, PACIFIC SYMPOSIUM ON BIOCOMPUTING 2004, P399
[8]   MUSCLE: multiple sequence alignment with high accuracy and high throughput [J].
Edgar, RC .
NUCLEIC ACIDS RESEARCH, 2004, 32 (05) :1792-1797
[9]   GeneRAGE: a robust algorithm for sequence clustering and domain detection [J].
Enright, AJ ;
Ouzounis, CA .
BIOINFORMATICS, 2000, 16 (05) :451-457
[10]   GPCRDB information system for G protein-coupled receptors [J].
Horn, F ;
Bettler, E ;
Oliveira, L ;
Campagne, F ;
Cohen, FE ;
Vriend, G .
NUCLEIC ACIDS RESEARCH, 2003, 31 (01) :294-297