Microbial comparative pan-genomics using binomial mixture models

被引:65
作者
Snipen, Lars [1 ]
Almoy, Trygve [1 ]
Ussery, David W. [2 ]
机构
[1] Norwegian Univ Life Sci, Dept Chem Biotechnol & Food Sci, As, Norway
[2] Tech Univ Denmark, Ctr Biol Sequence Anal, DK-2800 Lyngby, Denmark
来源
BMC GENOMICS | 2009年 / 10卷
关键词
ESCHERICHIA-COLI;
D O I
10.1186/1471-2164-10-385
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
Background: The size of the core- and pan-genome of bacterial species is a topic of increasing interest due to the growing number of sequenced prokaryote genomes, many from the same species. Attempts to estimate these quantities have been made, using regression methods or mixture models. We extend the latter approach by using statistical ideas developed for capture-recapture problems in ecology and epidemiology. Results: We estimate core- and pan-genome sizes for 16 different bacterial species. The results reveal a complex dependency structure for most species, manifested as heterogeneous detection probabilities. Estimated pan-genome sizes range from small (around 2600 gene families) in Buchnera aphidicola to large (around 43000 gene families) in Escherichia coli. Results for Echerichia coli show that as more data become available, a larger diversity is estimated, indicating an extensive pool of rarely occurring genes in the population. Conclusion: Analyzing pan-genomics data with binomial mixture models is a way to handle dependencies between genomes, which we find is always present. A bottleneck in the estimation procedure is the annotation of rarely occurring genes.
引用
收藏
页数:8
相关论文
共 19 条
[1]  
BUNGE J, 2008, BIOMETRICAL J, V50
[2]   ESTIMATING THE POPULATION-SIZE FOR CAPTURE RECAPTURE DATA WITH UNEQUAL CATCHABILITY [J].
CHAO, A .
BIOMETRICS, 1987, 43 (04) :783-791
[3]   Identification of genes subject to positive selection in uropathogenic strains of Escherichia coli:: A comparative genomics approach [J].
Chen, SL ;
Hung, CS ;
Xu, JA ;
Reigstad, CS ;
Magrini, V ;
Sabo, A ;
Blasiar, D ;
Bieri, T ;
Meyer, RR ;
Ozersky, P ;
Armstrong, JR ;
Fulton, RS ;
Latreille, JP ;
Spieth, J ;
Hooton, TM ;
Mardis, ER ;
Hultgren, SJ ;
Gordon, JI .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2006, 103 (15) :5977-5982
[4]   ESTIMATING TOTAL NUMBER OF EVENTS WITH DATA FROM MULTIPLE-RECORD SYSTEMS - REVIEW OF METHODOLOGICAL STRATEGIES [J].
ELKHORAZATY, MN ;
IMREY, PB ;
KOCH, GG ;
WELLS, HB .
INTERNATIONAL STATISTICAL REVIEW, 1977, 45 (02) :129-157
[5]   Characterization and modeling of the Haemophilus influenzae core and supragenomes based on the complete genomic sequences of Rd and 12 clinical nontypeable strains [J].
Hogg, Justin S. ;
Hu, Fen Z. ;
Janto, Benjamin ;
Boissy, Robert ;
Hayes, Jay ;
Keefe, Randy ;
Post, J. Christopher ;
Ehrlich, Garth D. .
GENOME BIOLOGY, 2007, 8 (06)
[6]  
KOONIN E, 2008, NUCL ACIDS RES, V36
[7]  
KUHNERT R, 2008, BIOMETRICAL J, V50
[8]   The microbial pan-genome [J].
Medini, D ;
Donati, C ;
Tettelin, H ;
Masignani, V ;
Rappuoli, R .
CURRENT OPINION IN GENETICS & DEVELOPMENT, 2005, 15 (06) :589-594
[9]   Large-scale prokaryotic gene prediction and comparison to genome annotation [J].
Nielsen, P ;
Krogh, A .
BIOINFORMATICS, 2005, 21 (24) :4322-4329
[10]   The pangenome structure of Escherichia coli:: Comparative genomic analysis of E-coli commensal and pathogenic isolates [J].
Rasko, David A. ;
Rosovitz, M. J. ;
Myers, Garry S. A. ;
Mongodin, Emmanuel F. ;
Fricke, W. Florian ;
Gajer, Pawel ;
Crabtree, Jonathan ;
Sebaihia, Mohammed ;
Thomson, Nicholas R. ;
Chaudhuri, Roy ;
Henderson, Ian R. ;
Sperandio, Vanessa ;
Ravel, Jacques .
JOURNAL OF BACTERIOLOGY, 2008, 190 (20) :6881-6893