GANN: Genetic algorithm neural networks for the detection of conserved combinations of features in DNA

被引:20
作者
Beiko, RG [1 ]
Charlebois, RL
机构
[1] Univ Queensland, Inst Mol Biosci, Brisbane, Qld 4072, Australia
[2] Univ Ottawa, Dept Biol, Ottawa, ON K1N 6N5, Canada
[3] Dalhousie Univ, Dept Biochem & Mol Biol, Halifax, NS B3H 1X5, Canada
关键词
D O I
10.1186/1471-2105-6-36
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: The multitude of motif detection algorithms developed to date have largely focused on the detection of patterns in primary sequence. Since sequence-dependent DNA structure and flexibility may also play a role in protein-DNA interactions, the simultaneous exploration of sequence-and structure-based hypotheses about the composition of binding sites and the ordering of features in a regulatory region should be considered as well. The consideration of structural features requires the development of new detection tools that can deal with data types other than primary sequence. Results: GANN ( available at http://bioinformatics.org.au/gann) is a machine learning tool for the detection of conserved features in DNA. The software suite contains programs to extract different regions of genomic DNA from flat files and convert these sequences to indices that reflect sequence and structural composition or the presence of specific protein binding sites. The machine learning component allows the classification of different types of sequences based on subsamples of these indices, and can identify the best combinations of indices and machine learning architecture for sequence discrimination. Another key feature of GANN is the replicated splitting of data into training and test sets, and the implementation of negative controls. In validation experiments, GANN successfully merged important sequence and structural features to yield good predictive models for synthetic and real regulatory regions. Conclusion: GANN is a flexible tool that can search through large sets of sequence and structural feature combinations to identify those that best characterize a set of sequences.
引用
收藏
页数:12
相关论文
共 41 条
[1]   A genetic algorithm for the detection of new cis-regulatory modules in sets of coregulated genes [J].
Aerts, S ;
Van Loo, P ;
Moreau, Y ;
De Moor, B .
BIOINFORMATICS, 2004, 20 (12) :1974-1976
[2]   PROMOTER RECOGNITION BY ESCHERICHIA-COLI RNA-POLYMERASE - ROLE OF THE SPACER DNA IN FUNCTIONAL COMPLEX-FORMATION [J].
AYERS, DG ;
AUBLE, DT ;
DEHASETH, PL .
JOURNAL OF MOLECULAR BIOLOGY, 1989, 207 (04) :749-756
[3]  
Baldi P, 1998, Proc Int Conf Intell Syst Mol Biol, V6, P35
[4]   Additivity in protein-DNA interactions: how good an approximation is it? [J].
Benos, PV ;
Bulyk, ML ;
Stormo, GD .
NUCLEIC ACIDS RESEARCH, 2002, 30 (20) :4442-4451
[5]   SELECTION OF DNA-BINDING SITES BY REGULATORY PROTEINS - STATISTICAL-MECHANICAL THEORY AND APPLICATION TO OPERATORS AND PROMOTERS [J].
BERG, OG ;
VONHIPPEL, PH .
JOURNAL OF MOLECULAR BIOLOGY, 1987, 193 (04) :723-743
[6]   Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors [J].
Bulyk, ML ;
Johnson, PLF ;
Church, GM .
NUCLEIC ACIDS RESEARCH, 2002, 30 (05) :1255-1261
[7]   Characterization of species-specific genes using a flexible, web-based querying system [J].
Charlebois, RL ;
Clarke, GDP ;
Beiko, RG ;
Jean, AS .
FEMS MICROBIOLOGY LETTERS, 2003, 225 (02) :213-220
[8]   ... THE TYRANNY OF THE LATTICE ... [J].
DICKERSON, RE ;
GOODSELL, DS ;
NEIDLE, S .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1994, 91 (09) :3579-3583
[9]   CRYSTAL-LATTICE PACKING IS IMPORTANT IN DETERMINING THE BEND OF A DNA DODECAMER CONTAINING AN ADENINE TRACT [J].
DIGABRIELE, AD ;
SANDERSON, MR ;
STEITZ, TA .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1989, 86 (06) :1816-1820
[10]   A DNA DODECAMER CONTAINING AN ADENINE TRACT CRYSTALLIZES IN A UNIQUE LATTICE AND EXHIBITS A NEW BEND [J].
DIGABRIELE, AD ;
STEITZ, TA .
JOURNAL OF MOLECULAR BIOLOGY, 1993, 231 (04) :1024-1039