DNA barcode analysis: a comparison of phylogenetic and statistical classification methods

被引:120
作者
Austerlitz, Frederic [2 ,3 ,4 ]
David, Olivier [5 ]
Schaeffer, Brigitte [5 ]
Bleakley, Kevin [7 ,8 ,9 ]
Olteanu, Madalina [5 ]
Leblois, Raphael [1 ]
Veuille, Michel [1 ,6 ]
Laredo, Catherine [5 ,10 ,11 ]
机构
[1] MNHN, CNRS, UMR 5202, Lab Origine Struct Evolut Biodivers, F-75005 Paris, France
[2] CNRS, Lab Ecol Systemat & Evolut, UMR 8079, F-91405 Orsay, France
[3] Univ Paris Sud, F-91405 Orsay, France
[4] AgroParisTech, F-75231 Paris, France
[5] INRA, UR341, F-78350 Jouy En Josas, France
[6] Ecole Prat Hautes Etud, Lab Biol Integrat Populat, Paris, France
[7] Ctr Rech, Inst Curie, F-75248 Paris, France
[8] INSERM, U900, F-75248 Paris, France
[9] Ecole Mines Paris, Ctr Computat Biol, F-77305 Fontainebleau, France
[10] Univ Paris 06, Lab Probabil & Modeles Aleatoires, CNRS, UMR 7599, F-75005 Paris, France
[11] Univ Paris 07, Lab Probabil & Modeles Aleatoires, CNRS, UMR 7599, F-75005 Paris, France
来源
BMC BIOINFORMATICS | 2009年 / 10卷
关键词
LIFE; IDENTIFICATION; RATES;
D O I
10.1186/1471-2105-10-S14-S10
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: DNA barcoding aims to assign individuals to given species according to their sequence at a small locus, generally part of the CO1 mitochondrial gene. Amongst other issues, this raises the question of how to deal with within-species genetic variability and potential transpecific polymorphism. In this context, we examine several assignation methods belonging to two main categories: (i) phylogenetic methods (neighbour-joining and PhyML) that attempt to account for the genealogical framework of DNA evolution and (ii) supervised classification methods (k-nearest neighbour, CART, random forest and kernel methods). These methods range from basic to elaborate. We investigated the ability of each method to correctly classify query sequences drawn from samples of related species using both simulated and real data. Simulated data sets were generated using coalescent simulations in which we varied the genealogical history, mutation parameter, sample size and number of species. Results: No method was found to be the best in all cases. The simplest method of all, "one nearest neighbour", was found to be the most reliable with respect to changes in the parameters of the data sets. The parameter most influencing the performance of the various methods was molecular diversity of the data. Addition of genetically independent loci - nuclear genes - improved the predictive performance of most methods. Conclusion: The study implies that taxonomists can influence the quality of their analyses either by choosing a method best-adapted to the configuration of their sample, or, given a certain method, increasing the sample size or altering the amount of molecular diversity. This can be achieved either by sequencing more mtDNA or by sequencing additional nuclear genes. In the latter case, they may also have to modify their data analysis method.
引用
收藏
页数:13
相关论文
共 37 条
[1]   A step toward barcoding life: A model-based, decision-theoretic method to assign genes to preexisting species groups [J].
Abdo, Zaid ;
Golding, G. Brian .
SYSTEMATIC BIOLOGY, 2007, 56 (01) :44-56
[2]   SmcHD1, containing a structural-maintenance-of-chromosomes hinge domain, has a critical role in X inactivation [J].
Blewitt, Marnie E. ;
Gendrel, Anne-Valerie ;
Pang, Zhenyi ;
Sparrow, Duncan B. ;
Whitelaw, Nadia ;
Craig, Jeffrey M. ;
Apedaile, Anwyn ;
Hilton, Douglas J. ;
Dunwoodie, Sally L. ;
Brockdorff, Neil ;
Kay, Graham F. ;
Whitelaw, Emma .
NATURE GENETICS, 2008, 40 (05) :663-669
[3]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[4]  
CORTES C, 1995, MACH LEARN, V20, P273, DOI 10.1023/A:1022627411411
[5]   Profile hidden Markov models [J].
Eddy, SR .
BIOINFORMATICS, 1998, 14 (09) :755-763
[6]   Limited performance of DNA barcoding in a diverse community of tropical butterflies [J].
Elias, Marianne ;
Hill, Ryan I. ;
Willmott, Keith R. ;
Dasmahapatra, Kanchon K. ;
Brower, Andrew V. Z. ;
Malllet, James ;
Jiggins, Chris D. .
PROCEEDINGS OF THE ROYAL SOCIETY B-BIOLOGICAL SCIENCES, 2007, 274 (1627) :2881-2889
[7]  
Fix E., 1951, Discriminatory analysis: nonparametric discrimination consistency properties, V29, P262, DOI DOI 10.2307/1403797
[8]   A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood [J].
Guindon, S ;
Gascuel, O .
SYSTEMATIC BIOLOGY, 2003, 52 (05) :696-704
[9]   DNA barcodes distinguish species of tropical Lepidoptera [J].
Hajibabaei, M ;
Janzen, DH ;
Burns, JM ;
Hallwachs, W ;
Hebert, PDN .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2006, 103 (04) :968-971
[10]  
Hastie T., 2009, The elements of statistical learning: data mining, inference, and prediction, P9