Identifying DNA and protein patterns with statistically significant alignments of multiple sequences

被引:868
作者
Hertz, GZ [1 ]
Stormo, GD [1 ]
机构
[1] Univ Colorado, Dept Mol Cellular & Dev Biol, Boulder, CO 80309 USA
关键词
D O I
10.1093/bioinformatics/15.7.563
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Molecular biologists frequently, can obtain interesting insight by aligning a set of related DNA, RNA or protein sequences. Such alignments can be used to determine either evolutionary or functional relationships. Our intel est is in identifying functional relationships. Unless the sequences are very similar; it is necessary to have a specific strategy for measuring-or scoring-the relatedness of the aligned sequences. if the alignment is not known, one can be determined by finding an alignment that optimizes the scoring scheme. Results: We describe four components to our approach for determining alignments of multiple sequences. First, we review a log-likelihood scoring scheme we call information content. Second, bye describe two methods for estimating the P value of an individual information content score: (i) a method that combines a technique from large-deviation statistics with numerical calculations; (ii) a method that is exclusively numerical. Third, we describe how we count the number of possible alignments given the overall amount of sequence data. This count is multiplied by the P value to determine the expected frequency of an information content score and thus, the statistical significance of the corresponding alignment. Statistical significance cart be used to compare alignments having differing widths and containing differing numbers of sequences. Fourth, we describe a greedy algorithm for determining alignments of functionally related sequences. Finally, bye test the accuracy of our P value calculations, and give an example of using our algorithm to identify binding sites for the Escherichia coli CRP protein. Availability: Programs were developed under the UNIX operating system and are available by anonymous ftp from ftp://beagle.colorado.edu/pub/consensus. Contact: hertz@colorado.edu.
引用
收藏
页码:563 / 577
页数:15
相关论文
共 25 条
[1]  
[Anonymous], 1994, Ann. Prob
[2]  
[Anonymous], 1990, Large Deviation Techniques in Decision, Simulation and Estimation
[3]   SELECTION OF DNA-BINDING SITES BY REGULATORY PROTEINS - STATISTICAL-MECHANICAL THEORY AND APPLICATION TO OPERATORS AND PROMOTERS [J].
BERG, OG ;
VONHIPPEL, PH .
JOURNAL OF MOLECULAR BIOLOGY, 1987, 193 (04) :723-743
[4]   SOME USEFUL STATISTICAL PROPERTIES OF POSITION-WEIGHT MATRICES [J].
CLAVERIE, JM .
COMPUTERS & CHEMISTRY, 1994, 18 (03) :287-294
[5]   CONTROL SITE LOCATION AND TRANSCRIPTIONAL REGULATION IN ESCHERICHIA-COLI [J].
COLLADOVIDES, J ;
MAGASANIK, B ;
GRALLA, JD .
MICROBIOLOGICAL REVIEWS, 1991, 55 (03) :371-394
[6]   Quantitative specificity of the Mnt repressor [J].
Fields, DS ;
He, YY ;
AlUzri, AY ;
Stormo, GD .
JOURNAL OF MOLECULAR BIOLOGY, 1997, 271 (02) :178-194
[7]  
GRIBSKOV M, 1990, METHOD ENZYMOL, V183, P146
[8]  
HERTZ G, 1995, P 3 INT C BIOINF GEN, P201
[9]  
HERTZ GZ, 1990, COMPUT APPL BIOSCI, V6, P81
[10]   METHODS FOR ASSESSING THE STATISTICAL SIGNIFICANCE OF MOLECULAR SEQUENCE FEATURES BY USING GENERAL SCORING SCHEMES [J].
KARLIN, S ;
ALTSCHUL, SF .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1990, 87 (06) :2264-2268