Speeding up chemical database searches using a proximity filter based on the logical exclusive OR

被引：29

作者：

Baldi, Pierre ^{[1
,2
]}

Hirschberg, Daniel S. ^{[1
]}

Nasr, Ramzi J. ^{[1
]}

机构：

[1] Univ Calif Irvine, Inst Genom & Bioinformat, Sch Informat & Comp Sci, Dept Comp Sci, Irvine, CA 92697 USA

[2] Univ Calif Irvine, Dept Biol Chem, Irvine, CA 92697 USA

来源：

JOURNAL OF CHEMICAL INFORMATION AND MODELING | 2008年 / 48卷 / 07期

关键词：

D O I：

10.1021/ci800076s

中图分类号：

R914 [药物化学];

学科分类号：

100701 ;

摘要：

In many large chemoinformatics database systems, molecules are represented by long binary fingerprint vectors whose components record the presence or absence in the molecular graphs of particular functional groups or combinatorial features, such as labeled paths or labeled trees. To speed up database searches, we propose to store with each fingerprint a small header vector containing primarily the result of applying the logical exclusive OR (XOR) operator to the fingerprint vector after modulo wrapping to a smaller number of bits, such as 128 bits. From the XOR headers of two molecules, tight bounds on the intersection and union of their fingerprint vectors can be rapidly obtained, yielding fight bounds on derived similarity measures, such as the Tanimoto measure. During a database search, every time these bounds are unfavorable, the corresponding molecule can be rapidly discarded with no need for further inspection. We derive probabilistic models that allow us to estimate precisely the behavior of the XOR headers and the level of pruning under different conditions in terms of similarity threshold and fingerprint density. These theoretical results are corroborated by experimental results on a large set of molecules. For a Tanimoto threshold of 0.5 (respectively 0.9), this approach requires searching less than 50% (respectively 10%) of the database, leading to typical search speedups of 2 to 3 times over the previous state-of-the-art.

引用

页码：1367 / 1378

页数：12

共 18 条

[1] Lossless compression of chemical fingerprints using integer entropy codes improves storage and retrieval
Baldi, Pierre
Benz, Ryan W.
Hirschberg, Daniel S.
Swamidass, S. Joshua
[J]. JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2007, 47 (06) : 2098 - 2109
[2] Similarity searching of chemical databases using atom environment descriptors (MOLPRINT 2D): Evaluation of performance
Bender, A
Mussa, HY
Glen, RC
Reiling, S
[J]. JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 2004, 44 (05): : 1708 - 1718
[3] SPACE/TIME TRADE/OFFS IN HASH CODING WITH ALLOWABLE ERRORS
BLOOM, BH
[J]. COMMUNICATIONS OF THE ACM, 1970, 13 (07) : 422 - &
[4] ChemDB: a public database of small molecules and related chemoinformatics resources
Chen, J
Swamidass, SJ
Bruand, J
Baldi, P
[J]. BIOINFORMATICS, 2005, 21 (22) : 4133 - 4139
[5] ChemDB update - full-text search and virtual chemical space
Chen, Jonathan H.
Linstead, Erik
Swamidass, S. Joshua
Wang, Dennis
Baldi, Pierre
[J]. BIOINFORMATICS, 2007, 23 (17) : 2348 - 2351
[6] A modification of the Jaccard-Tanimoto similarity index for diverse selection of chemical compounds using binary strings
Fligner, MA
Verducci, JS
Blower, PE
[J]. TECHNOMETRICS, 2002, 44 (02) : 110 - 119
[7] On the properties of bit string-based measures of chemical similarity
Flower, DR
[J]. JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 1998, 38 (03): : 379 - 386
[8] Cheminformatics analysis and learning in a data pipelining environment
Hassan, Moises
Brown, Robert D.
Varma-O'Brien, Shikha
Rogers, David
[J]. MOLECULAR DIVERSITY, 2006, 10 (03) : 283 - 299
[9] Comparison of topological descriptors for similarity-based virtual screening using multiple bioactive reference structures
Hert, J
Willett, P
Wilton, DJ
Acklin, P
Azzaoui, K
Jacoby, E
Schuffenhauer, A
[J]. ORGANIC & BIOMOLECULAR CHEMISTRY, 2004, 2 (22) : 3256 - 3266
[10] Holliday JD, 2002, COMB CHEM HIGH T SCR, V5, P155

← 1 2 →