Support vector inductive logic programming outperforms the naive Bayes classifier and inductive logic programming for the classification of bioactive chemical compounds

被引:40
作者
Cannon, Edward O.
Amini, Ata
Bender, Andreas
Sternberg, Michael J. E.
Muggleton, Stephen H.
Glen, Robert C.
Mitchell, John B. O.
机构
[1] Univ Cambridge, Unilever Ctr Mol Sci Informat, Dept Chem, Cambridge CB2 1EW, England
[2] Univ London Imperial Coll Sci Technol & Med, Fac Nat Sci, Div Mol Biosci, London SW7 2AZ, England
基金
英国生物技术与生命科学研究理事会; 英国工程与自然科学研究理事会;
关键词
classification; feature selection; machine learning; molecular similarity; screening; MOLECULAR SIMILARITY; FEATURE-SELECTION; DESCRIPTORS; FINGERPRINTS;
D O I
10.1007/s10822-007-9113-3
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
We investigate the classification performance of circular fingerprints in combination with the Naive Bayes Classifier (MP2D), Inductive Logic Programming (ILP) and Support Vector Inductive Logic Programming (SVILP) on a standard molecular benchmark dataset comprising 11 activity classes and about 102,000 structures. The Naive Bayes Classifier treats features independently while ILP combines structural fragments, and then creates new features with higher predictive power. SVILP is a very recently presented method which adds a support vector machine after common ILP procedures. The performance of the methods is evaluated via a number of statistical measures, namely recall, specificity, precision, F-measure, Matthews Correlation Coefficient, area under the Receiver Operating Characteristic (ROC) curve and enrichment factor (EF). According to the F-measure, which takes both recall and precision into account, SVILP is for seven out of the 11 classes the superior method. The results show that the Bayes Classifier gives the best recall performance for eight of the 11 targets, but has a much lower precision, specificity and F-measure. The SVILP model on the other hand has the highest recall for only three of the 11 classes, but generally far superior specificity and precision. To evaluate the statistical significance of the SVILP superiority, we employ McNemar's test which shows that SVILP performs significantly (p < 5%) better than both other methods for six out of 11 activity classes, while being superior with less significance for three of the remaining classes. While previously the Bayes Classifier was shown to perform very well in molecular classification studies, these results suggest that SVILP is able to extract additional knowledge from the data, thus improving classification results further.
引用
收藏
页码:269 / 280
页数:12
相关论文
共 38 条
[1]  
[Anonymous], 1999, Advances in kernel methods: Support vector learning
[2]   Similarity searching of chemical databases using atom environment descriptors (MOLPRINT 2D): Evaluation of performance [J].
Bender, A ;
Mussa, HY ;
Glen, RC ;
Reiling, S .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 2004, 44 (05) :1708-1718
[3]   Molecular similarity: a key technique in molecular informatics [J].
Bender, A ;
Glen, RC .
ORGANIC & BIOMOLECULAR CHEMISTRY, 2004, 2 (22) :3204-3218
[4]   Discussion of measures of enrichment in virtual screening: Comparing the information content of descriptors with increasing levels of sophistication [J].
Bender, A ;
Glen, RC .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2005, 45 (05) :1369-1375
[5]   Molecular similarity searching using atom environments, information-based feature selection, and a naive Bayesian classifier [J].
Bender, A ;
Mussa, HY ;
Glen, RC ;
Reiling, S .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 2004, 44 (01) :170-178
[6]  
Bender A, 2006, ANN REP COMP CHEM, V2, P141, DOI 10.1016/S1574-1400(06)02009-3
[7]  
Bohm H.J., 2000, VIRTUAL SCREENING BI
[8]   In vitro and in silico affinity fingerprints:: Finding similarities beyond structural classes [J].
Briem, H ;
Lessel, UF .
PERSPECTIVES IN DRUG DISCOVERY AND DESIGN, 2000, 20 (01) :231-244
[9]   Representation of molecular structure using quantum topology with inductive logic programming in structure-activity relationships [J].
Buttingsrud, Bard ;
Ryeng, Einar ;
King, Ross D. ;
Alsberg, Bjorn K. .
JOURNAL OF COMPUTER-AIDED MOLECULAR DESIGN, 2006, 20 (06) :361-373
[10]   Chemoinformatics-based classification of prohibited substances employed for doping in sport [J].
Cannon, Edward O. ;
Bender, Andreas ;
Palmer, David S. ;
Mitchell, John B. O. .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2006, 46 (06) :2369-2380