Distribution patterns of over-represented κ-mers in non-coding yeast DNA

被引:33
作者
Hampson, S
Kibler, D
Baldi, P [1 ]
机构
[1] Univ Calif Irvine, Dept Comp & Informat Sci, Inst Genomics & Bioinformat, Irvine, CA 92697 USA
[2] Univ Calif Irvine, Coll Med, Dept Biol Chem, Irvine, CA 92697 USA
关键词
D O I
10.1093/bioinformatics/18.4.513
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Over-represented k-mers in genomic DNA regions are often of particular biological interest. For example, over-represented k-mers in co-regulated families of genes are associated with the DNA binding sites of transcription factors. To measure over-representation, we introduce a statistical background model based on single-mismatches, and apply it to the pooled 500 bp ORF Upstream Regions (USRs) of yeast. More importantly, we investigate the context and spatial distribution of over-represented k-mers in yeast USRs. Results: Single and double-stranded spatial distributions of most over-rep resented k-mers are highly non-random, and predominantly cluster into a small number of classes that are robust with respect to over-representation measures. Specifically, we show that the three most common distribution patterns can be related to DNA structure, function, and evolution and correspond to: (a) homologous ORF clusters associated with sharply localized distributions; (b) regulatory elements associated with a symmetric broad hill-shaped distribution in the 50-200 bp USR; and (c) runs of As, Ts, and ATs associated with a broad hill-shaped distribution also in the 50-200 bp USR, with extreme structural properties. Analysis of over-representation, homology, localization, and DNA structure are essential components of a general data-mining approach to finding biologically important k-mers in raw genomic DNA and understanding the 'lexicon' of regulatory regions. Contact: hampson@ics.uci.edu; kibler@ics.uci.edu; pfbaldi@ics.uci.edu.
引用
收藏
页码:513 / 528
页数:16
相关论文
共 36 条
[1]  
BAILEY T, 1995, MACH LEARN, P51
[2]   Sequence analysis by additive scales:: DNA structure for sequences and repeats of all lengths [J].
Baldi, P ;
Baisnée, PF .
BIOINFORMATICS, 2000, 16 (10) :865-889
[3]   Structural basis for triplet repeat disorders: a computational analysis [J].
Baldi, P ;
Brunak, S ;
Chauvin, Y ;
Pedersen, AG .
BIOINFORMATICS, 1999, 15 (11) :918-929
[4]  
Baldi P, 2001, BIOINFORMATICS MACHI
[5]   Predicting gene regulatory elements in silico on a genomic scale [J].
Brazma, A ;
Jonassen, I ;
Vilo, J ;
Ukkonen, E .
GENOME RESEARCH, 1998, 8 (11) :1202-1215
[6]   Knowledge-based analysis of microarray gene expression data by using support vector machines [J].
Brown, MPS ;
Grundy, WN ;
Lin, D ;
Cristianini, N ;
Sugnet, CW ;
Furey, TS ;
Ares, M ;
Haussler, D .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2000, 97 (01) :262-267
[7]   SEQUENCE-DEPENDENT BENDING PROPENSITY OF DNA AS REVEALED BY DNASE-I - PARAMETERS FOR TRINUCLEOTIDES [J].
BRUKNER, I ;
SANCHEZ, R ;
SUCK, D ;
PONGOR, S .
EMBO JOURNAL, 1995, 14 (08) :1812-1818
[8]   OVER-REPRESENTATION AND UNDER-REPRESENTATION OF SHORT OLIGONUCLEOTIDES IN DNA-SEQUENCES [J].
BURGE, C ;
CAMPBELL, AM ;
KARLIN, S .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1992, 89 (04) :1358-1362
[9]   Building a dictionary for genomes: Identification of presumptive regulatory sites by statistical analysis [J].
Bussemaker, HJ ;
Li, H ;
Siggia, ED .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2000, 97 (18) :10096-10100
[10]  
CHEN QK, 1995, COMPUT APPL BIOSCI, V11, P563