Speeding up chemical database searches using a proximity filter based on the logical exclusive OR

被引:29
作者
Baldi, Pierre [1 ,2 ]
Hirschberg, Daniel S. [1 ]
Nasr, Ramzi J. [1 ]
机构
[1] Univ Calif Irvine, Inst Genom & Bioinformat, Sch Informat & Comp Sci, Dept Comp Sci, Irvine, CA 92697 USA
[2] Univ Calif Irvine, Dept Biol Chem, Irvine, CA 92697 USA
关键词
D O I
10.1021/ci800076s
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
In many large chemoinformatics database systems, molecules are represented by long binary fingerprint vectors whose components record the presence or absence in the molecular graphs of particular functional groups or combinatorial features, such as labeled paths or labeled trees. To speed up database searches, we propose to store with each fingerprint a small header vector containing primarily the result of applying the logical exclusive OR (XOR) operator to the fingerprint vector after modulo wrapping to a smaller number of bits, such as 128 bits. From the XOR headers of two molecules, tight bounds on the intersection and union of their fingerprint vectors can be rapidly obtained, yielding fight bounds on derived similarity measures, such as the Tanimoto measure. During a database search, every time these bounds are unfavorable, the corresponding molecule can be rapidly discarded with no need for further inspection. We derive probabilistic models that allow us to estimate precisely the behavior of the XOR headers and the level of pruning under different conditions in terms of similarity threshold and fingerprint density. These theoretical results are corroborated by experimental results on a large set of molecules. For a Tanimoto threshold of 0.5 (respectively 0.9), this approach requires searching less than 50% (respectively 10%) of the database, leading to typical search speedups of 2 to 3 times over the previous state-of-the-art.
引用
收藏
页码:1367 / 1378
页数:12
相关论文
共 18 条
  • [1] Lossless compression of chemical fingerprints using integer entropy codes improves storage and retrieval
    Baldi, Pierre
    Benz, Ryan W.
    Hirschberg, Daniel S.
    Swamidass, S. Joshua
    [J]. JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2007, 47 (06) : 2098 - 2109
  • [2] Similarity searching of chemical databases using atom environment descriptors (MOLPRINT 2D): Evaluation of performance
    Bender, A
    Mussa, HY
    Glen, RC
    Reiling, S
    [J]. JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 2004, 44 (05): : 1708 - 1718
  • [3] SPACE/TIME TRADE/OFFS IN HASH CODING WITH ALLOWABLE ERRORS
    BLOOM, BH
    [J]. COMMUNICATIONS OF THE ACM, 1970, 13 (07) : 422 - &
  • [4] ChemDB: a public database of small molecules and related chemoinformatics resources
    Chen, J
    Swamidass, SJ
    Bruand, J
    Baldi, P
    [J]. BIOINFORMATICS, 2005, 21 (22) : 4133 - 4139
  • [5] ChemDB update - full-text search and virtual chemical space
    Chen, Jonathan H.
    Linstead, Erik
    Swamidass, S. Joshua
    Wang, Dennis
    Baldi, Pierre
    [J]. BIOINFORMATICS, 2007, 23 (17) : 2348 - 2351
  • [6] A modification of the Jaccard-Tanimoto similarity index for diverse selection of chemical compounds using binary strings
    Fligner, MA
    Verducci, JS
    Blower, PE
    [J]. TECHNOMETRICS, 2002, 44 (02) : 110 - 119
  • [7] On the properties of bit string-based measures of chemical similarity
    Flower, DR
    [J]. JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 1998, 38 (03): : 379 - 386
  • [8] Cheminformatics analysis and learning in a data pipelining environment
    Hassan, Moises
    Brown, Robert D.
    Varma-O'Brien, Shikha
    Rogers, David
    [J]. MOLECULAR DIVERSITY, 2006, 10 (03) : 283 - 299
  • [9] Comparison of topological descriptors for similarity-based virtual screening using multiple bioactive reference structures
    Hert, J
    Willett, P
    Wilton, DJ
    Acklin, P
    Azzaoui, K
    Jacoby, E
    Schuffenhauer, A
    [J]. ORGANIC & BIOMOLECULAR CHEMISTRY, 2004, 2 (22) : 3256 - 3266
  • [10] Holliday JD, 2002, COMB CHEM HIGH T SCR, V5, P155