Large scale clustering of protein sequences with FORCE - A layout based heuristic for weighted cluster editing

被引:54
作者
Wittkop, Tobias [1 ,2 ]
Baumbach, Jan [1 ,3 ]
Lobo, Francisco P. [1 ,4 ]
Rahmann, Sven [5 ]
机构
[1] Univ Bielefeld, Bielefeld, Germany
[2] Univ Bielefeld, DFG Graduiertenkolleg Bioinformat, Bielefeld, Germany
[3] Ctr Biotechnol, Int Grad Sch Bioinformat & Genome Res, Bielefeld, Germany
[4] Univ Fed Minas Gerais, Lab Genet Bioquim, Belo Horizonte, MG, Brazil
[5] Univ Dortmund, D-44221 Dortmund, Germany
关键词
D O I
10.1186/1471-2105-8-396
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Detecting groups of functionally related proteins from their amino acid sequence alone has been a long-standing challenge in computational genome research. Several clustering approaches, following different strategies, have been published to attack this problem. Today, new sequencing technologies provide huge amounts of sequence data that has to be efficiently clustered with constant or increased accuracy, at increased speed. Results: We advocate that the model of weighted cluster editing, also known as transitive graph projection is well-suited to protein clustering. We present the FORCE heuristic that is based on transitive graph projection and clusters arbitrary sets of objects, given pairwise similarity measures. In particular, we apply FORCE to the problem of protein clustering and show that it outperforms the most popular existing clustering tools ( Spectral clustering, TribeMCL, GeneRAGE, Hierarchical clustering, and Affinity Propagation). Furthermore, we show that FORCE is able to handle huge datasets by calculating clusters for all 192 187 prokaryotic protein sequences ( 66 organisms) obtained from the COG database. Finally, FORCE is integrated into the corynebacterial reference database CoryneRegNet. Conclusion: FORCE is an applicable alternative to existing clustering algorithms. Its theoretical foundation, weighted cluster editing, can outperform other clustering paradigms on protein homology clustering. FORCE is open source and implemented in Java. The software, including the source code, the clustering results for COG and CoryneRegNet, and all evaluation datasets are available at http://gi.cebitec.uni-bielefeld.de/comet/force/.
引用
收藏
页数:12
相关论文
共 22 条
[1]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[2]   SCOP database in 2004: refinements integrate structure and sequence family data [J].
Andreeva, A ;
Howorth, D ;
Brenner, SE ;
Hubbard, TJP ;
Chothia, C ;
Murzin, AG .
NUCLEIC ACIDS RESEARCH, 2004, 32 :D226-D229
[3]   CoryneRegNet: An ontology-based data warehouse of corynebacterial transcription factors and regulatory networks [J].
Baumbach, J ;
Brinkrolf, K ;
Czaja, LF ;
Rahmann, S ;
Tauch, A .
BMC GENOMICS, 2006, 7 (1)
[4]   CoryneRegNet 3.0 -: An interactive systems biology platform for the analysis of gene regulatory networks in corynebacteria and Escherichia coli [J].
Baumbach, Jan ;
Wittkop, Tobias ;
Rademacher, Katrin ;
Rahmann, Sven ;
Brinkrolf, Karina ;
Tauch, Andreas .
JOURNAL OF BIOTECHNOLOGY, 2007, 129 (02) :279-289
[5]   Fast index based algorithms and software for matching position specific scoring matrices [J].
Beckstette, Michael ;
Homann, Robert ;
Giegerich, Robert ;
Kurtz, Stefan .
BMC BIOINFORMATICS, 2006, 7 (1)
[6]   The ASTRAL Compendium in 2004 [J].
Chandonia, JM ;
Hon, G ;
Walker, NS ;
Lo Conte, L ;
Koehl, P ;
Levitt, M ;
Brenner, SE .
NUCLEIC ACIDS RESEARCH, 2004, 32 :D189-D192
[7]  
Dehne F, 2006, LECT NOTES COMPUT SC, V4169, P13
[8]   On best transitive approximations to simple graphs [J].
Delvaux, S ;
Horsten, L .
ACTA INFORMATICA, 2004, 40 (09) :637-655
[9]   GeneRAGE: a robust algorithm for sequence clustering and domain detection [J].
Enright, AJ ;
Ouzounis, CA .
BIOINFORMATICS, 2000, 16 (05) :451-457
[10]   An efficient algorithm for large-scale detection of protein families [J].
Enright, AJ ;
Van Dongen, S ;
Ouzounis, CA .
NUCLEIC ACIDS RESEARCH, 2002, 30 (07) :1575-1584