Ultra-fast sequence clustering from similarity networks with SiLiX

被引:220
作者
Miele, Vincent [1 ]
Penel, Simon [1 ]
Duret, Laurent [1 ]
机构
[1] Univ Lyon 1, CNRS, Lab Biometrie & Biol Evolut, INRIA,UMR5558, F-69622 Villeurbanne, France
来源
BMC BIOINFORMATICS | 2011年 / 12卷
关键词
PROTEIN; EFFICIENT; ALGORITHMS; FAMILIES;
D O I
10.1186/1471-2105-12-116
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: The number of gene sequences that are available for comparative genomics approaches is increasing extremely quickly. A current challenge is to be able to handle this huge amount of sequences in order to build families of homologous sequences in a reasonable time. Results: We present the software package SiLiX that implements a novel method which reconsiders single linkage clustering with a graph theoretical approach. A parallel version of the algorithms is also presented. As a demonstration of the ability of our software, we clustered more than 3 millions sequences from about 2 billion BLAST hits in 7 minutes, with a high clustering quality, both in terms of sensitivity and specificity. Conclusions: Comparing state-of-the-art software, SiLiX presents the best up-to-date capabilities to face the problem of clustering large collections of sequences. SiLiX is freely available at http://lbbe.univ-lyon1.fr/SiLiX.
引用
收藏
页数:9
相关论文
共 32 条
[1]  
Alsuwaiyel M H., 1998, Algorithms: Design Techniques and Analysis
[2]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[3]   Using Sequence Similarity Networks for Visualization of Relationships Across Diverse Protein Superfamilies [J].
Atkinson, Holly J. ;
Morris, John H. ;
Ferrin, Thomas E. ;
Babbitt, Patricia C. .
PLOS ONE, 2009, 4 (02)
[4]   Animal mitochondrial genomes [J].
Boore, JL .
NUCLEIC ACIDS RESEARCH, 1999, 27 (08) :1767-1780
[5]   Network formation and anti-coordination games [J].
Bramoullé, Y ;
López-Pintado, D ;
Goyal, S ;
Vega-Redondo, F .
INTERNATIONAL JOURNAL OF GAME THEORY, 2004, 33 (01) :1-19
[6]   DIVIDE-AND-CONQUER-BASED OPTIMAL PARALLEL ALGORITHMS FOR SOME GRAPH PROBLEMS ON EREW PRAM MODEL [J].
DAS, SK ;
DEO, N .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS, 1988, 35 (03) :312-322
[7]   A phylogenomic gene cluster resource: the Phylogenetically Inferred Groups (PhIGs) database [J].
Dehal, Paramvir S. ;
Boore, Jeffrey L. .
BMC BIOINFORMATICS, 2006, 7 (1)
[8]   An efficient algorithm for large-scale detection of protein families [J].
Enright, AJ ;
Van Dongen, S ;
Ouzounis, CA .
NUCLEIC ACIDS RESEARCH, 2002, 30 (07) :1575-1584
[9]  
Fiat A., 1998, ONLINE ALGORITHMS ST, P1442
[10]   The Pfam protein families database [J].
Finn, Robert D. ;
Mistry, Jaina ;
Tate, John ;
Coggill, Penny ;
Heger, Andreas ;
Pollington, Joanne E. ;
Gavin, O. Luke ;
Gunasekaran, Prasad ;
Ceric, Goran ;
Forslund, Kristoffer ;
Holm, Liisa ;
Sonnhammer, Erik L. L. ;
Eddy, Sean R. ;
Bateman, Alex .
NUCLEIC ACIDS RESEARCH, 2010, 38 :D211-D222