Comparing clustering and pre-processing in taxonomy analysis

被引:65
作者
Bonder, Marc J. [1 ,2 ]
Abeln, Sanne [2 ]
Zaura, Egija [1 ]
Brandt, Bernd W. [1 ]
机构
[1] Univ Amsterdam, Dept Prevent Dent, Acad Ctr Dent Amsterdam ACTA, NL-1012 WX Amsterdam, Netherlands
[2] Vrije Univ Amsterdam, Ctr Integrat Bioinformat IBIVU, Amsterdam, Netherlands
关键词
RARE BIOSPHERE; SEQUENCES; MICROBIOME; GENERATION; DIVERSITY; WRINKLES; PROTEIN; SEARCH; BLAST; SETS;
D O I
10.1093/bioinformatics/bts552
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Massively parallel sequencing allows for rapid sequencing of large numbers of sequences in just a single run. Thus, 16S ribosomal RNA (rRNA) amplicon sequencing of complex microbial communities has become possible. The sequenced 16S rRNA fragments (reads) are clustered into operational taxonomic units and taxonomic categories are assigned. Recent reports suggest that data pre-processing should be performed before clustering. We assessed combinations of data pre-processing steps and clustering algorithms on cluster accuracy for oral microbial sequence data. Results: The number of clusters varied up to two orders of magnitude depending on pre-processing. Pre-processing using both denoising and chimera checking resulted in a number of clusters that was closest to the number of species in the mock dataset (25 versus 15). Based on run time, purity and normalized mutual information, we could not identify a single best clustering algorithm. The differences in clustering accuracy among the algorithms after the same pre-processing were minor compared with the differences in accuracy among different pre-processing steps.
引用
收藏
页码:2891 / 2897
页数:7
相关论文
共 30 条
[1]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[2]  
[Anonymous], 2007, NUMERICAL RECIPES AR
[3]   TaxMan: a server to trim rRNA reference databases and inspect taxonomic coverage [J].
Brandt, Bernd W. ;
Bonder, Marc J. ;
Huse, Susan M. ;
Zaura, Egija .
NUCLEIC ACIDS RESEARCH, 2012, 40 (W1) :W82-W87
[4]   ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time [J].
Cai, Yunpeng ;
Sun, Yijun .
NUCLEIC ACIDS RESEARCH, 2011, 39 (14) :e95
[5]   QIIME allows analysis of high-throughput community sequencing data [J].
Caporaso, J. Gregory ;
Kuczynski, Justin ;
Stombaugh, Jesse ;
Bittinger, Kyle ;
Bushman, Frederic D. ;
Costello, Elizabeth K. ;
Fierer, Noah ;
Pena, Antonio Gonzalez ;
Goodrich, Julia K. ;
Gordon, Jeffrey I. ;
Huttley, Gavin A. ;
Kelley, Scott T. ;
Knights, Dan ;
Koenig, Jeremy E. ;
Ley, Ruth E. ;
Lozupone, Catherine A. ;
McDonald, Daniel ;
Muegge, Brian D. ;
Pirrung, Meg ;
Reeder, Jens ;
Sevinsky, Joel R. ;
Tumbaugh, Peter J. ;
Walters, William A. ;
Widmann, Jeremy ;
Yatsunenko, Tanya ;
Zaneveld, Jesse ;
Knight, Rob .
NATURE METHODS, 2010, 7 (05) :335-336
[6]   The Ribosomal Database Project (RDP-II): sequences and tools for high-throughput rRNA analysis [J].
Cole, JR ;
Chai, B ;
Farris, RJ ;
Wang, Q ;
Kulam, SA ;
McGarrell, DM ;
Garrity, GM ;
Tiedje, JM .
NUCLEIC ACIDS RESEARCH, 2005, 33 :D294-D296
[7]   The Human Oral Microbiome [J].
Dewhirst, Floyd E. ;
Chen, Tuste ;
Izard, Jacques ;
Paster, Bruce J. ;
Tanner, Anne C. R. ;
Yu, Wen-Han ;
Lakshmanan, Abirami ;
Wade, William G. .
JOURNAL OF BACTERIOLOGY, 2010, 192 (19) :5002-5017
[8]   UCHIME improves sensitivity and speed of chimera detection [J].
Edgar, Robert C. ;
Haas, Brian J. ;
Clemente, Jose C. ;
Quince, Christopher ;
Knight, Rob .
BIOINFORMATICS, 2011, 27 (16) :2194-2200
[9]   Search and clustering orders of magnitude faster than BLAST [J].
Edgar, Robert C. .
BIOINFORMATICS, 2010, 26 (19) :2460-2461
[10]   Chimeric 16S rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons [J].
Haas, Brian J. ;
Gevers, Dirk ;
Earl, Ashlee M. ;
Feldgarden, Mike ;
Ward, Doyle V. ;
Giannoukos, Georgia ;
Ciulla, Dawn ;
Tabbaa, Diana ;
Highlander, Sarah K. ;
Sodergren, Erica ;
Methe, Barbara ;
DeSantis, Todd Z. ;
Petrosino, Joseph F. ;
Knight, Rob ;
Birren, Bruce W. .
GENOME RESEARCH, 2011, 21 (03) :494-504