Improving the quality of protein similarity network clustering algorithms using the network edge weight distribution

被引:32
作者
Apeltsin, Leonard [1 ]
Morris, John H. [1 ]
Babbitt, Patricia C. [1 ,2 ]
Ferrin, Thomas E. [1 ,2 ]
机构
[1] Univ Calif San Francisco, Dept Pharmaceut Chem, San Francisco, CA 94143 USA
[2] Univ Calif San Francisco, Dept Bioengn & Therapeut Sci, San Francisco, CA 94143 USA
基金
美国国家卫生研究院;
关键词
EVOLUTION; FORCE;
D O I
10.1093/bioinformatics/btq655
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Clustering protein sequence data into functionally specific families is a difficult but important problem in biological research. One useful approach for tackling this problem involves representing the sequence dataset as a protein similarity network, and afterwards clustering the network using advanced graph analysis techniques. Although a multitude of such network clustering algorithms have been developed over the past few years, comparing algorithms is often difficult because performance is affected by the specifics of network construction. We investigate an important aspect of network construction used in analyzing protein superfamilies and present a heuristic approach for improving the performance of several algorithms. Results: We analyzed how the performance of network clustering algorithms relates to thresholding the network prior to clustering. Our results, over four different datasets, show how for each input dataset there exists an optimal threshold range over which an algorithm generates its most accurate clustering output. Our results further show how the optimal threshold range correlates with the shape of the edge weight distribution for the input similarity network. We used this correlation to develop an automated threshold selection heuristic in order to most optimally filter a similarity network prior to clustering. This heuristic allows researchers to process their protein datasets with runtime efficient network clustering algorithms without sacrificing the clustering accuracy of the final results.
引用
收藏
页码:326 / 333
页数:8
相关论文
共 30 条
[1]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[2]  
Apweiler R, 2004, NUCLEIC ACIDS RES, V32, pD115, DOI [10.1093/nar/gkh131, 10.1093/nar/gkw1099]
[3]   Using Sequence Similarity Networks for Visualization of Relationships Across Diverse Protein Superfamilies [J].
Atkinson, Holly J. ;
Morris, John H. ;
Ferrin, Thomas E. ;
Babbitt, Patricia C. .
PLOS ONE, 2009, 4 (02)
[4]   Evaluation of clustering algorithms for protein-protein interaction networks [J].
Brohee, Sylvain ;
van Helden, Jacques .
BMC BIOINFORMATICS, 2006, 7 (1)
[5]   A gold standard set of mechanistically diverse enzyme superfamilies [J].
Brown, SD ;
Gerlt, JA ;
Seffernick, JL ;
Babbitt, PC .
GENOME BIOLOGY, 2006, 7 (01)
[6]  
Chim H., 2007, Proc. of ACM WWW, P121, DOI DOI 10.1145/1242572
[7]   An efficient algorithm for large-scale detection of protein families [J].
Enright, AJ ;
Van Dongen, S ;
Ouzounis, CA .
NUCLEIC ACIDS RESEARCH, 2002, 30 (07) :1575-1584
[8]   BioLayout - an automatic graph layout algorithm for similarity visualization [J].
Enright, AJ ;
Ouzounis, CA .
BIOINFORMATICS, 2001, 17 (09) :853-854
[9]   Clustering by passing messages between data points [J].
Frey, Brendan J. ;
Dueck, Delbert .
SCIENCE, 2007, 315 (5814) :972-976
[10]  
Frivolt G., 2006, IIT.SRC 2006: Student Research Conference:168-175 April 2006, P168