Speeding up chemical database searches using a proximity filter based on the logical exclusive OR

被引:29
作者
Baldi, Pierre [1 ,2 ]
Hirschberg, Daniel S. [1 ]
Nasr, Ramzi J. [1 ]
机构
[1] Univ Calif Irvine, Inst Genom & Bioinformat, Sch Informat & Comp Sci, Dept Comp Sci, Irvine, CA 92697 USA
[2] Univ Calif Irvine, Dept Biol Chem, Irvine, CA 92697 USA
关键词
D O I
10.1021/ci800076s
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
In many large chemoinformatics database systems, molecules are represented by long binary fingerprint vectors whose components record the presence or absence in the molecular graphs of particular functional groups or combinatorial features, such as labeled paths or labeled trees. To speed up database searches, we propose to store with each fingerprint a small header vector containing primarily the result of applying the logical exclusive OR (XOR) operator to the fingerprint vector after modulo wrapping to a smaller number of bits, such as 128 bits. From the XOR headers of two molecules, tight bounds on the intersection and union of their fingerprint vectors can be rapidly obtained, yielding fight bounds on derived similarity measures, such as the Tanimoto measure. During a database search, every time these bounds are unfavorable, the corresponding molecule can be rapidly discarded with no need for further inspection. We derive probabilistic models that allow us to estimate precisely the behavior of the XOR headers and the level of pruning under different conditions in terms of similarity threshold and fingerprint density. These theoretical results are corroborated by experimental results on a large set of molecules. For a Tanimoto threshold of 0.5 (respectively 0.9), this approach requires searching less than 50% (respectively 10%) of the database, leading to typical search speedups of 2 to 3 times over the previous state-of-the-art.
引用
收藏
页码:1367 / 1378
页数:12
相关论文
共 18 条
[11]  
JAMES CA, DAYLIGHT THEORY MANU
[12]  
LEACH AR, 2003, INTRO CHEMOINFORMATI, V1, P53
[13]   DEFINITION AND ROLE OF SIMILARITY CONCEPTS IN THE CHEMICAL AND PHYSICAL SCIENCES [J].
ROUVRAY, DH .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 1992, 32 (06) :580-586
[14]   Mathematical correction for fingerprint similarity measures to improve chemical retrieval [J].
Swamidass, S. Joshua ;
Baldi, Pierre .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2007, 47 (03) :952-964
[15]   Bounds and algorithms for fast exact searches of chemical fingerprints in linear and sublinear time [J].
Swamidass, S. Joshua ;
Baldi, Pierre .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2007, 47 (02) :302-317
[16]  
TVERSKY A, 1977, PSYCHOL REV, V84, P327, DOI 10.1037/h0026750
[17]   Similarity search profiling reveals effects of fingerprint scaling in virtual screening [J].
Xue, L ;
Stahura, FL ;
Bajorath, E .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 2004, 44 (06) :2032-2039
[18]   Profile scaling increases the similarity search performance of molecular fingerprints containing numerical descriptors and structural keys [J].
Xue, L ;
Godden, JW ;
Stahura, FL ;
Bajorath, J .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 2003, 43 (04) :1218-1225