Capturing whole-genome characteristics in short sequences using a naive Bayesian classifier

被引:138
作者
Sandberg, R [1 ]
Winberg, G
Bränden, CI
Kaske, A
Ernberg, I
Cöster, J
机构
[1] Karolinska Inst, Ctr Microbiol & Tumor Biol, S-17177 Stockholm, Sweden
[2] Virtual Genet Lab AB, S-17177 Stockholm, Sweden
关键词
D O I
10.1101/gr.186401
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Bacterial genomes have diverged during evolution, resulting in clearcut differences in their nucleotide ge composition, such as their GC content. The analysis of complete sequences of bacterial genomes also reveals the presence of nonrandom. sequence variation, manifest in the frequency profile of specific short oligonucleotides. These frequency profiles constitute highly specific genomic signatures. Based on these differences in oligonucleotide frequency between bacterial genomes, we investigated the possibility of predicting the genome of origin for a specific genomic sequence. To this end, we developed a naive Bayesian classifier and systematically analyzed 28 eubacterial and archaeal genomes. We found that sequences as short as 400 bases could be correctly classified with an accuracy of 85%. We then applied the classifier to the identification of horizontal gene transfer events In whole-genome sequences and demonstrated the validity of our approach by correctly predicting the transfer of both the superoxide dismutase (sodC) and the bioC gene from Haemophilus influenzae to Neisseria meningitis, correctly identifying both the donor and recipient species. We believe that this classification methodology could be a valuable tool in biodiversity studies.
引用
收藏
页码:1404 / 1409
页数:6
相关论文
共 21 条
[1]   MECHANISM OF HOMOSPECIFIC DNA UPTAKE IN HEMOPHILUS-INFLUENZAE TRANSFORMATION [J].
DEICH, RA ;
SMITH, HO .
MOLECULAR & GENERAL GENETICS, 1980, 177 (03) :369-374
[2]   Genomic signature: Characterization and classification of species assessed by chaos game representation of sequences [J].
Deschavanne, PJ ;
Giron, A ;
Vilain, J ;
Fagot, G ;
Fertil, B .
MOLECULAR BIOLOGY AND EVOLUTION, 1999, 16 (10) :1391-1399
[3]   On the optimality of the simple Bayesian classifier under zero-one loss [J].
Domingos, P ;
Pazzani, M .
MACHINE LEARNING, 1997, 29 (2-3) :103-130
[4]   Phylogenetic classification and the universal tree [J].
Doolittle, WF .
SCIENCE, 1999, 284 (5423) :2124-2128
[5]  
Durbin R., 1998, BIOL SEQUENCE ANAL
[6]   Horizontal gene transfer among microbial genomes: new insights from complete genome analysis [J].
Eisen, JA .
CURRENT OPINION IN GENETICS & DEVELOPMENT, 2000, 10 (06) :606-611
[7]   Horizontal gene transfer in bacterial and archaeal complete genomes [J].
Garcia-Vallvé, S ;
Romeu, A ;
Palau, J .
GENOME RESEARCH, 2000, 10 (11) :1719-1725
[8]   NUCLEOTIDE, DINUCLEOTIDE AND TRINUCLEOTIDE FREQUENCIES EXPLAIN PATTERNS OBSERVED IN CHAOS GAME REPRESENTATIONS OF DNA-SEQUENCES [J].
GOLDMAN, N .
NUCLEIC ACIDS RESEARCH, 1993, 21 (10) :2487-2491
[9]   STATISTICAL-ANALYSES OF COUNTS AND DISTRIBUTIONS OF RESTRICTION SITES IN DNA-SEQUENCES [J].
KARLIN, S ;
BURGE, C ;
CAMPBELL, AM .
NUCLEIC ACIDS RESEARCH, 1992, 20 (06) :1363-1370
[10]   COMPARISONS OF EUKARYOTIC GENOMIC SEQUENCES [J].
KARLIN, S ;
LADUNGA, I .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1994, 91 (26) :12832-12836