Deep Architectures and Deep Learning in Chemoinformatics: The Prediction of Aqueous Solubility for Drug-Like Molecules

被引:378
作者
Lusci, Alessandro [1 ]
Pollastri, Gianluca [1 ]
Baldi, Pierre [2 ]
机构
[1] Natl Univ Ireland Univ Coll Dublin, Sch Comp Sci & Informat, Dublin 4, Ireland
[2] Univ Calif Irvine, Dept Comp Sci, Irvine, CA 92697 USA
基金
美国国家科学基金会;
关键词
PARTITION-COEFFICIENTS; GRADIENT DESCENT; GRAPH KERNELS; SMALLEST SET; FREE-ENERGY; ALGORITHM; CLASSIFICATION; NONELECTROLYTES; SELECTION;
D O I
10.1021/ci400187y
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
Shallow machine learning methods have been applied to chemoinformatics problems with some success. As more data becomes available and more complex problems are tackled, deep machine learning methods may also become useful. Here, we present a brief overview of deep learning methods and show in particular how recursive neural network approaches can be applied to the problem of predicting molecular properties. However, molecules are typically described by undirected cyclic graphs, while recursive approaches typically use directed acyclic graphs. Thus, we develop methods to address this discrepancy, essentially by considering an ensemble of recursive neural networks associated with all possible vertex-centered acyclic orientations of the molecular graph. One advantage of this approach is that it relies only minimally on the identification of suitable molecular descriptors because suitable representations are learned automatically from the data. Several variants of this approach are applied to the problem of predicting aqueous solubility and tested on four benchmark data sets. Experimental results show that the performance of the deep learning methods matches or exceeds the performance of other state-of-the-art methods according to several evaluation metrics and expose the fundamental limitations arising from training sets that are too small or too noisy. A Web-based predictor, AquaSol, is available online through the ChemDB portal (cdb.ics.uci.edu) together with additional material.
引用
收藏
页码:1563 / 1575
页数:13
相关论文
共 74 条
[1]  
[Anonymous], 1994, PHYS CHEM PROP DAT P
[2]  
[Anonymous], 2015, ADV NEURAL INFORM PR
[3]  
[Anonymous], 2007, LARGE SCALE KERNEL M
[4]  
[Anonymous], 2012, ABS12070580 CORR
[5]   One- to four-dimensional kernels for virtual screening and the prediction of physical, chemical, and biological properties [J].
Azencott, Chloe-Agathe ;
Ksikes, Alexandre ;
Swamidass, S. Joshua ;
Chen, Jonathan H. ;
Ralaivola, Liva ;
Baldi, Pierre .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2007, 47 (03) :965-974
[6]   Exploiting the past and the future in protein secondary structure prediction [J].
Baldi, P ;
Brunak, S ;
Frasconi, P ;
Soda, G ;
Pollastri, G .
BIOINFORMATICS, 1999, 15 (11) :937-946
[7]   GRADIENT DESCENT LEARNING ALGORITHM OVERVIEW - A GENERAL DYNAMICAL-SYSTEMS PERSPECTIVE [J].
BALDI, P .
IEEE TRANSACTIONS ON NEURAL NETWORKS, 1995, 6 (01) :182-195
[8]   The principled design of large-scale recursive neural network architectures-DAG-RNNs and the protein structure prediction problem [J].
Baldi, P ;
Pollastri, G .
JOURNAL OF MACHINE LEARNING RESEARCH, 2004, 4 (04) :575-602
[9]   Boolean autoencoders and hypercube clustering complexity [J].
Baldi, P. .
DESIGNS CODES AND CRYPTOGRAPHY, 2012, 65 (03) :383-403
[10]  
Baldi P., 2001, Bioinformatics: The Machine Learning Approach