Scalable partitioning and exploration of chemical spaces using geometric hashing

被引:12
作者
Dutta, D
Guha, R
Jurs, PC [1 ]
Chen, T
机构
[1] Penn State Univ, Dept Chem, University Pk, PA 16802 USA
[2] Univ So Calif, Dept Computat Biol, Los Angeles, CA 90089 USA
关键词
D O I
10.1021/ci050403o
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
Virtual screening (VS) has become a preferred tool to augment high-throughput screening(1) and determine new leads in the drug discovery process. The core of a VS informatics pipeline includes several data mining algorithms that work on huge databases of chemical compounds containing millions of molecular structures and their associated data. Thus, scaling traditional applications such as classification, partitioning, and outlier detection for huge chemical data sets without a significant loss in accuracy is very important. In this paper, we introduce a data mining framework built on top of a recently developed fast approximate nearest-neighbor-finding algorithm(2) called locality-sensitive hashing (LSH) that can be used to mine huge chemical spaces in a scalable fashion using very modest computational resources. The core LSH algorithm hashes chemical descriptors so that points close to each other in the descriptor space are also close to each other in the hashed space. Using this data structure, one can perform approximate nearest-neighbor searches very quickly, in sublinear time. We validate the accuracy and performance of our framework on three real data sets of sizes ranging from 4337 to 249 071 molecules. Results indicate that the identification of nearest neighbors using the LSH algorithm is at least 2 orders of magnitude faster than the traditional k-nearest-neighbor method and is over 94% accurate for most query parameters. Furthermore, when viewed as a data-partitioning procedure, the LSH algorithm lends itself to easy parallelization of nearest-neighbor classification or regression. We also apply our framework to detect outlying (diverse) compounds in a given chemical space this algorithm is extremely rapid in determining whether a compound is located in a sparse region of chemical space or not, and it is quite accurate when compared to results obtained using principal-component-analysis-based heuristics.
引用
收藏
页码:321 / 333
页数:13
相关论文
共 32 条
[1]   A constant time algorithm for estimating the diversity of large chemical libraries [J].
Agrafiotis, DK .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 2001, 41 (01) :159-167
[2]   METHODS FOR DETECTING CARCINOGENS AND MUTAGENS WITH SALMONELLA-MAMMALIAN-MICROSOME MUTAGENICITY TEST [J].
AMES, BN ;
MCCANN, J ;
YAMASAKI, E .
MUTATION RESEARCH, 1975, 31 (06) :347-363
[3]  
[Anonymous], MOL OP ENV
[4]   HIGHLY DISCRIMINATING DISTANCE-BASED TOPOLOGICAL INDEX [J].
BALABAN, AT .
CHEMICAL PHYSICS LETTERS, 1982, 89 (05) :399-404
[5]   NEAREST NEIGHBOR PATTERN CLASSIFICATION [J].
COVER, TM ;
HART, PE .
IEEE TRANSACTIONS ON INFORMATION THEORY, 1967, 13 (01) :21-+
[6]  
DATAR M, 2004, SCG 04
[7]  
Duda R. O., 1998, PATTERN CLASSIFICATI
[8]   Fast calculation of molecular polar surface area as a sum of fragment-based contributions and its application to the prediction of drug transport properties [J].
Ertl, P ;
Rohde, B ;
Selzer, P .
JOURNAL OF MEDICINAL CHEMISTRY, 2000, 43 (20) :3714-3717
[9]  
GAREY R, 1979, COMPUTERS INTRACTIBI
[10]  
GASTEIGER J, 1978, TETRAHEDRON LETT, P3181