Information content of individual genetic sequences

被引:237
作者
Schneider, TD [1 ]
机构
[1] NCI, Frederick Canc Res & Dev Ctr, Math Biol Lab, Frederick, MD 21702 USA
关键词
D O I
10.1006/jtbi.1997.0540
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Related genetic sequences having a common function can be described by Shannon's information measure and depicted graphically by a sequence logo. Though useful for many purposes, sequence logos only show the average sequence conservation, and inferring the conservation for individual sequences is difficult. This limitation is overcome by the individual information (R-i) technique described here. The method begins by generating a weight matrix from the frequencies of each nucleotide or amino acid at each position of the aligned sequences. This matrix is then applied to the sequences themselves to determine the sequence conservation of each individual sequence. The matrix is unique because the average of these assignments is the total sequence conservation, and there is only one way to construct such a matrix. For binding sites on polynucleotides, the weight matrix has a natural cut-off that distinguishes functional sequences from other sequences. R-i values are on an absolute scale measured in bits of information so the conservation of different biological functions can be compared with one another. The matrix can be used to rank-order the sequences, to search for new sequences, to compare sequences to other quantitative data such as binding energy or distance between binding sites, to distinguish mutations from polymorphisms, to design sequences of a given strength, and to detect errors in databases. The Ri method has been used to identify previously undescribed but experimentally verified DNA binding sites. The individual information distribution was determined for E, coli ribosome binding sites, bacterial Fis binding sites, and human donor and acceptor splice junctions, among others. The distributions demonstrate clearly that the consensus sequence is highly unusual, and hence is a poor method to describe naturally occurring binding sites.
引用
收藏
页码:427 / 441
页数:15
相关论文
共 77 条
[31]   ESCHERICHIA-COLI PROMOTER SEQUENCES PREDICT INVITRO RNA-POLYMERASE SELECTIVITY [J].
MULLIGAN, ME ;
HAWLEY, DK ;
ENTRIKEN, R ;
MCCLURE, WR .
NUCLEIC ACIDS RESEARCH, 1984, 12 (01) :789-800
[32]   PREDICTION OF SPLICE JUNCTIONS IN MESSENGER-RNA SEQUENCES [J].
NAKATA, K ;
KANEHISA, M ;
DELISI, C .
NUCLEIC ACIDS RESEARCH, 1985, 13 (14) :5327-5340
[33]   ARE VERTEBRATE EXONS SCANNED DURING SPLICE-SITE SELECTION [J].
NIWA, M ;
MACDONALD, CC ;
BERGET, SM .
NATURE, 1992, 360 (6401) :277-280
[34]   TRAINING BACK-PROPAGATION NEURAL NETWORKS TO DEFINE AND DETECT DNA-BINDING SITES [J].
ONEILL, MC .
NUCLEIC ACIDS RESEARCH, 1991, 19 (02) :313-318
[35]   Identification of new Fis binding sites by DNA scission with Fis-1,10-phenanthroline-copper(I) chimeras [J].
Pan, CQ ;
Johnson, RC ;
Sigman, DS .
BIOCHEMISTRY, 1996, 35 (14) :4326-4333
[36]  
PAPOULIS A, 1990, PROBABILITY STAT
[37]   INFORMATION ANALYSIS OF SEQUENCES THAT BIND THE REPLICATION INITIATOR REPA [J].
PAPP, PP ;
CHATTORAJ, DK ;
SCHNEIDER, TD .
JOURNAL OF MOLECULAR BIOLOGY, 1993, 233 (02) :219-230
[38]   HUMAN DNA TATA BOXES AND TRANSCRIPTION INITIATION SITES - A STATISTICAL STUDY [J].
PENOTTI, FE .
JOURNAL OF MOLECULAR BIOLOGY, 1990, 213 (01) :37-52
[39]   HUMAN PRE-MESSENGER-RNA SPLICING SIGNALS [J].
PENOTTI, FE .
JOURNAL OF THEORETICAL BIOLOGY, 1991, 150 (03) :385-420
[40]   ENZYMATIC INCORPORATION OF A NEW BASE PAIR INTO DNA AND RNA EXTENDS THE GENETIC ALPHABET [J].
PICCIRILLI, JA ;
KRAUCH, T ;
MORONEY, SE ;
BENNER, SA .
NATURE, 1990, 343 (6253) :33-37