The Construction and Use of Log-Odds Substitution Scores for Multiple Sequence Alignment

被引:53
作者
Altschul, Stephen F. [1 ]
Wootton, John C. [1 ]
Zaslavsky, Elena [2 ,3 ]
Yu, Yi-Kuo [1 ]
机构
[1] NIH, Natl Ctr Biotechnol Informat, Natl Lib Med, Bethesda, MD 20892 USA
[2] Mt Sinai Hosp, Mt Sinai Sch Med, Ctr Translat Syst Biol, New York, NY 10029 USA
[3] Mt Sinai Hosp, Mt Sinai Sch Med, Dept Neurol, New York, NY 10029 USA
关键词
HIDDEN MARKOV-MODELS; DNA-BINDING-DOMAINS; PROTEIN SEQUENCES; TRANSCRIPTION FACTORS; STATISTICAL SIGNIFICANCE; PATTERN-RECOGNITION; IMPROVED ALGORITHM; ALIGNED PROTEIN; SCORING MATRIX; ACID SEQUENCES;
D O I
10.1371/journal.pcbi.1000852
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Most pairwise and multiple sequence alignment programs seek alignments with optimal scores. Central to defining such scores is selecting a set of substitution scores for aligned amino acids or nucleotides. For local pairwise alignment, substitution scores are implicitly of log-odds form. We now extend the log-odds formalism to multiple alignments, using Bayesian methods to construct "BILD'' ("Bayesian Integral Log-odds'') substitution scores from prior distributions describing columns of related letters. This approach has been used previously only to define scores for aligning individual sequences to sequence profiles, but it has much broader applicability. We describe how to calculate BILD scores efficiently, and illustrate their uses in Gibbs sampling optimization procedures, gapped alignment, and the construction of hidden Markov model profiles. BILD scores enable automated selection of optimal motif and domain model widths, and can inform the decision of whether to include a sequence in a multiple alignment, and the selection of insertion and deletion locations. Other applications include the classification of related sequences into subfamilies, and the definition of profile-profile alignment scores. Although a fully realized multiple alignment program must rely upon more than substitution scores, many existing multiple alignment programs can be modified to employ BILD scores. We illustrate how simple BILD score based strategies can enhance the recognition of DNA binding domains, including the Api-AP2 domain in Toxoplasma gondii and Plasmodium falciparum.
引用
收藏
页数:17
相关论文
共 122 条
[1]   A novel mode of DNA recognition by a β-sheet revealed by the solution structure of the GCC-box binding domain in complex with DNA [J].
Allen, MD ;
Yamasaki, K ;
Ohme-Takagi, M ;
Tateno, M ;
Suzuki, M .
EMBO JOURNAL, 1998, 17 (18) :5484-5496
[2]   AMINO-ACID SUBSTITUTION MATRICES FROM AN INFORMATION THEORETIC PERSPECTIVE [J].
ALTSCHUL, SF .
JOURNAL OF MOLECULAR BIOLOGY, 1991, 219 (03) :555-565
[3]   GAP COSTS FOR MULTIPLE SEQUENCE ALIGNMENT [J].
ALTSCHUL, SF .
JOURNAL OF THEORETICAL BIOLOGY, 1989, 138 (03) :297-309
[4]   OPTIMAL SEQUENCE ALIGNMENT USING AFFINE GAP COSTS [J].
ALTSCHUL, SF ;
ERICKSON, BW .
BULLETIN OF MATHEMATICAL BIOLOGY, 1986, 48 (5-6) :603-616
[5]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[6]   WEIGHTS FOR DATA RELATED BY A TREE [J].
ALTSCHUL, SF ;
CARROLL, RJ ;
LIPMAN, DJ .
JOURNAL OF MOLECULAR BIOLOGY, 1989, 207 (04) :647-653
[7]   PSI-BLAST pseudocounts and the minimum description length principle [J].
Altschul, Stephen F. ;
Gertz, E. Michael ;
Agarwala, Richa ;
Schaffer, Alejandro A. ;
Yu, Yi-Kuo .
NUCLEIC ACIDS RESEARCH, 2009, 37 (03) :815-824
[8]  
[Anonymous], 2007, The Minimum Description Length Principle
[9]   MULTIPLE SEQUENCE ALIGNMENT [J].
BACON, DJ ;
ANDERSON, WF .
JOURNAL OF MOLECULAR BIOLOGY, 1986, 191 (02) :153-161
[10]  
Bailey T L, 1996, Proc Int Conf Intell Syst Mol Biol, V4, P15