CD-HIT: accelerated for clustering the next-generation sequencing data

被引:7070
作者
Fu, Limin [1 ]
Niu, Beifang [1 ]
Zhu, Zhengwei [1 ]
Wu, Sitao [1 ]
Li, Weizhong [1 ]
机构
[1] Univ Calif San Diego, Ctr Res Biol Syst, La Jolla, CA 92093 USA
关键词
PROTEIN; IDENTIFICATION;
D O I
10.1093/bioinformatics/bts565
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
CD-HIT is a widely used program for clustering biological sequences to reduce sequence redundancy and improve the performance of other sequence analyses. In response to the rapid increase in the amount of sequencing data produced by the next-generation sequencing technologies, we have developed a new CD-HIT program accelerated with a novel parallelization strategy and some other techniques to allow efficient clustering of such datasets. Our tests demonstrated very good speedup derived from the parallelization for up to similar to 24 cores and a quasi-linear speedup for up to similar to 8 cores. The enhanced CD-HIT is capable of handling very large datasets in much shorter time than previous versions.
引用
收藏
页码:3150 / 3152
页数:3
相关论文
共 11 条
[1]   Search and clustering orders of magnitude faster than BLAST [J].
Edgar, Robert C. .
BIOINFORMATICS, 2010, 26 (19) :2460-2461
[2]   Unique folding of precursor microRNAs: Quantitative evidence and implications for de novo identification [J].
Kwang Loong, Stanley Ng ;
Mishra, Santosh K. .
RNA, 2007, 13 (02) :170-187
[3]   Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences [J].
Li, Weizhong ;
Godzik, Adam .
BIOINFORMATICS, 2006, 22 (13) :1658-1659
[4]   Clustering of highly homologous sequences to reduce the size of large protein databases [J].
Li, WZ ;
Jaroszewski, L ;
Godzik, A .
BIOINFORMATICS, 2001, 17 (03) :282-283
[5]   Artificial and natural duplicates in pyrosequencing reads of metagenomic data [J].
Niu, Beifang ;
Fu, Limin ;
Sun, Shulei ;
Li, Weizhong .
BMC BIOINFORMATICS, 2010, 11
[6]   A human gut microbial gene catalogue established by metagenomic sequencing [J].
Qin, Junjie ;
Li, Ruiqiang ;
Raes, Jeroen ;
Arumugam, Manimozhiyan ;
Burgdorf, Kristoffer Solvsten ;
Manichanh, Chaysavanh ;
Nielsen, Trine ;
Pons, Nicolas ;
Levenez, Florence ;
Yamada, Takuji ;
Mende, Daniel R. ;
Li, Junhua ;
Xu, Junming ;
Li, Shaochuan ;
Li, Dongfang ;
Cao, Jianjun ;
Wang, Bo ;
Liang, Huiqing ;
Zheng, Huisong ;
Xie, Yinlong ;
Tap, Julien ;
Lepage, Patricia ;
Bertalan, Marcelo ;
Batto, Jean-Michel ;
Hansen, Torben ;
Le Paslier, Denis ;
Linneberg, Allan ;
Nielsen, H. Bjorn ;
Pelletier, Eric ;
Renault, Pierre ;
Sicheritz-Ponten, Thomas ;
Turner, Keith ;
Zhu, Hongmei ;
Yu, Chang ;
Li, Shengting ;
Jian, Min ;
Zhou, Yan ;
Li, Yingrui ;
Zhang, Xiuqing ;
Li, Songgang ;
Qin, Nan ;
Yang, Huanming ;
Wang, Jian ;
Brunak, Soren ;
Dore, Joel ;
Guarner, Francisco ;
Kristiansen, Karsten ;
Pedersen, Oluf ;
Parkhill, Julian ;
Weissenbach, Jean .
NATURE, 2010, 464 (7285) :59-U70
[7]   Predicting disulfide bond connectivity in proteins by correlated mutations analysis [J].
Rubinstein, Rotem ;
Fiser, Andras .
BIOINFORMATICS, 2008, 24 (04) :498-504
[8]   Community cyberinfrastructure for Advanced Microbial Ecology Research and Analysis: the CAMERA resource [J].
Sun, Shulei ;
Chen, Jing ;
Li, Weizhong ;
Altintas, Ilkay ;
Lin, Abel ;
Peltier, Steve ;
Stocks, Karen ;
Allen, Eric E. ;
Ellisman, Mark ;
Grethe, Jeffrey ;
Wooley, John .
NUCLEIC ACIDS RESEARCH, 2011, 39 :D546-D551
[9]   UniRef: comprehensive and non-redundant UniProt reference clusters [J].
Suzek, Baris E. ;
Huang, Hongzhan ;
McGarvey, Peter ;
Mazumder, Raja ;
Wu, Cathy H. .
BIOINFORMATICS, 2007, 23 (10) :1282-1288
[10]   A core gut microbiome in obese and lean twins [J].
Turnbaugh, Peter J. ;
Hamady, Micah ;
Yatsunenko, Tanya ;
Cantarel, Brandi L. ;
Duncan, Alexis ;
Ley, Ruth E. ;
Sogin, Mitchell L. ;
Jones, William J. ;
Roe, Bruce A. ;
Affourtit, Jason P. ;
Egholm, Michael ;
Henrissat, Bernard ;
Heath, Andrew C. ;
Knight, Rob ;
Gordon, Jeffrey I. .
NATURE, 2009, 457 (7228) :480-U7