A machine learning approach to predicting protein-ligand binding affinity with applications to molecular docking

被引:604
作者
Ballester, Pedro J. [1 ]
Mitchell, John B. O. [2 ]
机构
[1] Univ Cambridge, Dept Chem, Unilever Ctr Mol Sci Informat, Cambridge CB2 1EW, England
[2] Univ St Andrews, Ctr Biomol Sci, St Andrews KY16 9ST, Fife, Scotland
基金
英国生物技术与生命科学研究理事会;
关键词
EMPIRICAL SCORING FUNCTIONS; GENETIC ALGORITHM; FLEXIBLE DOCKING; DRUG DISCOVERY; RECOGNITION; VALIDATION; POTENTIALS; DATABASE; SEARCH; FOREST;
D O I
10.1093/bioinformatics/btq112
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Accurately predicting the binding affinities of large sets of diverse protein-ligand complexes is an extremely challenging task. The scoring functions that attempt such computational prediction are essential for analysing the outputs of molecular docking, which in turn is an important technique for drug discovery, chemical biology and structural biology. Each scoring function assumes a predetermined theory- inspired functional form for the relationship between the variables that characterize the complex, which also include parameters fitted to experimental or simulation data and its predicted binding affinity. The inherent problem of this rigid approach is that it leads to poor predictivity for those complexes that do not conform to the modelling assumptions. Moreover, resampling strategies, such as cross-validation or bootstrapping, are still not systematically used to guard against the overfitting of calibration data in parameter estimation for scoring functions. Results: We propose a novel scoring function (RF-Score) that circumvents the need for problematic modelling assumptions via non-parametric machine learning. In particular, Random Forest was used to implicitly capture binding effects that are hard to model explicitly. RF-Score is compared with the state of the art on the demanding PDBbind benchmark. Results show that RF-Score is a very competitive scoring function. Importantly, RF-Score's performance was shown to improve dramatically with training set size and hence the future availability of more high-quality structural and interaction data is expected to lead to improved versions of RF-Score.
引用
收藏
页码:1169 / 1175
页数:7
相关论文
共 55 条
[1]   A general approach for developing system-specific functions to score protein-ligand docked complexes using support vector inductive logic programming [J].
Amini, Ata ;
Shrimpton, Paul J. ;
Muggleton, Stephen H. ;
Sternberg, Michael J. E. .
PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2007, 69 (04) :823-831
[2]  
[Anonymous], 2001, DISC STUD SOFTW VERS
[3]  
Baxter CA, 1998, PROTEINS, V33, P367, DOI 10.1002/(SICI)1097-0134(19981115)33:3<367::AID-PROT6>3.0.CO
[4]  
2-W
[5]   The Protein Data Bank [J].
Berman, HM ;
Westbrook, J ;
Feng, Z ;
Gilliland, G ;
Bhat, TN ;
Weissig, H ;
Shindyalov, IN ;
Bourne, PE .
NUCLEIC ACIDS RESEARCH, 2000, 28 (01) :235-242
[6]   SmcHD1, containing a structural-maintenance-of-chromosomes hinge domain, has a critical role in X inactivation [J].
Blewitt, Marnie E. ;
Gendrel, Anne-Valerie ;
Pang, Zhenyi ;
Sparrow, Duncan B. ;
Whitelaw, Nadia ;
Craig, Jeffrey M. ;
Apedaile, Anwyn ;
Hilton, Douglas J. ;
Dunwoodie, Sally L. ;
Brockdorff, Neil ;
Kay, Graham F. ;
Whitelaw, Emma .
NATURE GENETICS, 2008, 40 (05) :663-669
[7]   Prediction of binding constants of protein ligands: A fast method for the prioritization of hits obtained from de novo design or 3D database search programs [J].
Bohm, HJ .
JOURNAL OF COMPUTER-AIDED MOLECULAR DESIGN, 1998, 12 (04) :309-323
[9]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[10]   A chemogenomic approach to drug discovery: focus on cardiovascular diseases [J].
Cases, Montserrat ;
Mestres, Jordi .
DRUG DISCOVERY TODAY, 2009, 14 (9-10) :479-485