Distribution patterns of over-represented κ-mers in non-coding yeast DNA

被引：33

作者：

Hampson, S

Kibler, D

Baldi, P ^{[1
]}

机构：

[1] Univ Calif Irvine, Dept Comp & Informat Sci, Inst Genomics & Bioinformat, Irvine, CA 92697 USA

[2] Univ Calif Irvine, Coll Med, Dept Biol Chem, Irvine, CA 92697 USA

来源：

BIOINFORMATICS | 2002年 / 18卷 / 04期

关键词：

D O I：

10.1093/bioinformatics/18.4.513

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Motivation: Over-represented k-mers in genomic DNA regions are often of particular biological interest. For example, over-represented k-mers in co-regulated families of genes are associated with the DNA binding sites of transcription factors. To measure over-representation, we introduce a statistical background model based on single-mismatches, and apply it to the pooled 500 bp ORF Upstream Regions (USRs) of yeast. More importantly, we investigate the context and spatial distribution of over-represented k-mers in yeast USRs. Results: Single and double-stranded spatial distributions of most over-rep resented k-mers are highly non-random, and predominantly cluster into a small number of classes that are robust with respect to over-representation measures. Specifically, we show that the three most common distribution patterns can be related to DNA structure, function, and evolution and correspond to: (a) homologous ORF clusters associated with sharply localized distributions; (b) regulatory elements associated with a symmetric broad hill-shaped distribution in the 50-200 bp USR; and (c) runs of As, Ts, and ATs associated with a broad hill-shaped distribution also in the 50-200 bp USR, with extreme structural properties. Analysis of over-representation, homology, localization, and DNA structure are essential components of a general data-mining approach to finding biologically important k-mers in raw genomic DNA and understanding the 'lexicon' of regulatory regions. Contact: hampson@ics.uci.edu; kibler@ics.uci.edu; pfbaldi@ics.uci.edu.

引用

页码：513 / 528

页数：16

共 36 条

[1]

BAILEY T, 1995, MACH LEARN, P51

[2] Sequence analysis by additive scales:: DNA structure for sequences and repeats of all lengths [J].