Statistical evaluation of local alignment features predicting allergenicity using supervised classification algorithms

被引:43
作者
Soeria-Atmadja, D
Zorzet, A
Gustafsson, MG
Hammerling, U
机构
[1] Uppsala Univ, Signal & Syst Grp, SE-75120 Uppsala, Sweden
[2] Natl Food Adm Toxicol Lab, Div Toxicol, Uppsala, Sweden
关键词
allergy; amino acid sequence; computational toxicology; risk assessment;
D O I
10.1159/000076382
中图分类号
R392 [医学免疫学];
学科分类号
100102 ;
摘要
Background: Recently, two promising alignment-based features predicting food allergenicity using the k nearest neighbor (kNN) classifier were reported. These features are the alignment score and alignment length of the best local alignment obtained in a database of known allergen sequences. Methods: In the work reported here a much more comprehensive statistical evaluation of the potential of these features was performed, this time for the prediction of allergenicity in general. The evaluation consisted of the following four key components. (1) A new high quality database consisting of 318 carefully selected, non-redundant allergens and 1,007 sequences carefully selected to be non-allergens. (2) Three different supervised algorithms: the kNN classifier, the Bayesian linear Gaussian classifier, and the Bayesian quadratic Gaussian classifier. (3) A large set of local alignment procedures defined using the FASTA3 alignment program by means of a wide range of different parameter settings. (4) Novel performance curves, alternative to conventional receiver-operating characteristic curves, to display not only average behaviors but also statistical variations due to small data sets. Results: The linear Gaussian classifier proved most useful among the tested supervised machine learning algorithms, closely followed by the quadratic Gaussian equivalent and kNN. The overall best classification results were obtained with a novel feature vector consisting of the combined alignment scores derived from local alignment procedures using different substitution matrices. Conclusions: The models reported here should be useful as a part of an integrated assessment scheme for potential protein allergenicity and for future comparisons with alternative bioinformatic approaches. Copyright (C) 2004 S. Karger AG, Basel.
引用
收藏
页码:101 / 112
页数:12
相关论文
共 47 条
[1]  
Aalberse RC, 2000, J ALLERGY CLIN IMMUN, V106, P228, DOI 10.1067/mai.2000.108434
[2]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[3]  
[Anonymous], ISMB
[4]  
[Anonymous], 2003, Statistical pattern recognition
[5]  
[Anonymous], 1978, Atlas of protein sequence and structure
[6]   Identification of continuous, allergenic regions of the major shrimp allergen Pen a 1 (tropomyosin) [J].
Ayuso, R ;
Lehrer, SB ;
Reese, G .
INTERNATIONAL ARCHIVES OF ALLERGY AND IMMUNOLOGY, 2002, 127 (01) :27-37
[7]   Estimating and evaluating the statistics of gapped local-alignment scores [J].
Bailey, TL ;
Gribskov, M .
JOURNAL OF COMPUTATIONAL BIOLOGY, 2002, 9 (03) :575-593
[8]   Methods and statistics for combining motif match scores [J].
Bailey, TL ;
Gribskov, M .
JOURNAL OF COMPUTATIONAL BIOLOGY, 1998, 5 (02) :211-221
[9]  
Bernstein JA, 2003, ENVIRON HEALTH PERSP, V111, P1114
[10]   The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 [J].
Boeckmann, B ;
Bairoch, A ;
Apweiler, R ;
Blatter, MC ;
Estreicher, A ;
Gasteiger, E ;
Martin, MJ ;
Michoud, K ;
O'Donovan, C ;
Phan, I ;
Pilbout, S ;
Schneider, M .
NUCLEIC ACIDS RESEARCH, 2003, 31 (01) :365-370