An efficient Z-score algorithm for assessing sequence alignments

被引:11
作者
Booth, HS [1 ]
Maindonald, JH
Wilson, SR
Gready, JE
机构
[1] Australian Natl Univ, Ctr Bioinformat Sci, Canberra, ACT 0200, Australia
[2] Australian Natl Univ, Inst Math Sci, Canberra, ACT 0200, Australia
[3] Australian Natl Univ, John Curtin Sch Med Res, Canberra, ACT 0200, Australia
关键词
dynamic programming; sequence alignment; sequence composition; similarity search; Z-score;
D O I
10.1089/cmb.2004.11.616
中图分类号
Q5 [生物化学];
学科分类号
071010 [生物化学与分子生物学]; 081704 [应用化学];
摘要
We describe an alternative method for scoring of the pairwise alignment of two biological sequences. Designed to overcome the bias due to the composition of the alignment, it measures the distance (in standard deviations) between the given alignment and the mean value of all other alignments that can be obtained by a permutation of either sequence. We demonstrate that the standard deviation can be calculated efficiently. By concentrating upon the ungapped case, the mean and standard deviation can be calculated exactly and in two steps, the first being O (N) time, where N is the length of the sequence, the second in a fixed number of calculations, i.e., in O (1) time. We argue that this statistic is a more consistent measure than a similarity score based upon a standard scoring matrix. Even in the ungapped case, the statistic proves in many cases to be more accurate than the commonly used (FASTA) (Pearson and Lipman, 1988) gapped Z-score in which the sequence is matched against a random sample of the database. We demonstrate the use of the POZ-score as a secondary filter which screens out several well-known types of false positive, reducing the amount of manual screening to be done by the biologist.
引用
收藏
页码:616 / 625
页数:10
相关论文
共 18 条
[1]
Comparative accuracy of methods for protein sequence similarity search [J].
Agarwal, P ;
States, DJ .
BIOINFORMATICS, 1998, 14 (01) :40-47
[2]
BASIC LOCAL ALIGNMENT SEARCH TOOL [J].
ALTSCHUL, SF ;
GISH, W ;
MILLER, W ;
MYERS, EW ;
LIPMAN, DJ .
JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) :403-410
[3]
Sequence alignment:: an approximation law for the Z-value with applications to databank scanning [J].
Bacro, JN ;
Comet, JP .
COMPUTERS & CHEMISTRY, 2001, 25 (04) :401-410
[4]
BOOTH H, 2003, P APAC C EXH ADV COM
[5]
Significance of Z-value statistics of Smith-Waterman scores for protein alignments [J].
Comet, JP ;
Aude, JC ;
Glémet, E ;
Risler, JL ;
Hénaut, A ;
Slonimski, PP ;
Codani, JJ .
COMPUTERS & CHEMISTRY, 1999, 23 (3-4) :317-331
[6]
Durbin R., 1998, BIOL SEQUENCE ANAL
[7]
PERFORMANCE EVALUATION OF AMINO-ACID SUBSTITUTION MATRICES [J].
HENIKOFF, S ;
HENIKOFF, JG .
PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 1993, 17 (01) :49-61
[8]
SCOP, structural classification of proteins database: Applications to evaluation of the effectiveness of sequence alignment methods and statistics of protein structural data [J].
Hubbard, TJP ;
Ailey, B ;
Brenner, SE ;
Murzin, AG ;
Chothia, C .
ACTA CRYSTALLOGRAPHICA SECTION D-BIOLOGICAL CRYSTALLOGRAPHY, 1998, 54 :1147-1154
[9]
METHODS FOR ASSESSING THE STATISTICAL SIGNIFICANCE OF MOLECULAR SEQUENCE FEATURES BY USING GENERAL SCORING SCHEMES [J].
KARLIN, S ;
ALTSCHUL, SF .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1990, 87 (06) :2264-2268
[10]
Hidden Markov models for detecting remote protein homologies [J].
Karplus, K ;
Barrett, C ;
Hughey, R .
BIOINFORMATICS, 1998, 14 (10) :846-856