Bounds and algorithms for fast exact searches of chemical fingerprints in linear and sublinear time

被引:68
作者
Swamidass, S. Joshua [1 ]
Baldi, Pierre [1 ]
机构
[1] Univ Calif Irvine, Sch Informat & Comp Sci, Inst Genom & Bioinformat, Irvine, CA 92697 USA
关键词
D O I
10.1021/ci600358f
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
Chemical fingerprints are used to represent chemical molecules by recording the presence or absence, or by counting the number of occurrences, of particular features or substructures, such as labeled paths in the 2D graph of bonds, of the corresponding molecule. These fingerprint vectors are used to search large databases of small molecules, currently containing millions of entries, using various similarity measures, such as the Tanimoto or Tversky's measures and their variants. Here, we derive simple bounds on these similarity measures and show how these bounds can be used to considerably reduce the subset of molecules that need to be searched. We consider both the case of single-molecule and multiple-molecule queries, as well as queries based on fixed similarity thresholds or aimed at retrieving the top K hits. We study the speedup as a function of query size and distribution, fingerprint length, similarity threshold, and database size parallel to D parallel to and derive analytical formulas that are in excellent agreement with empirical values. The theoretical considerations and experiments show that this approach can provide linear speedups of one or more orders of magnitude in the case of searches with a fixed threshold, and achieve sublinear speedups in the range of O(parallel to D parallel to(0.6)) for the top K hits in current large databases. This pruning approach yields subsecond search times across the 5 million compounds in the ChemDB database, without any loss of accuracy.
引用
收藏
页码:302 / 317
页数:16
相关论文
共 20 条
[1]  
Bohacek RS, 1996, MED RES REV, V16, P3, DOI 10.1002/(SICI)1098-1128(199601)16:1<3::AID-MED1>3.3.CO
[2]  
2-D
[3]   ChemDB: a public database of small molecules and related chemoinformatics resources [J].
Chen, J ;
Swamidass, SJ ;
Bruand, J ;
Baldi, P .
BIOINFORMATICS, 2005, 21 (22) :4133-4139
[4]   COMPUTER-STORAGE AND RETRIEVAL OF GENERIC CHEMICAL STRUCTURES IN PATENTS .10. ASSIGNMENT AND LOGICAL BUBBLE-UP OF RING SCREENS FOR STRUCTURALLY EXPLICIT GENERICS [J].
DOWNS, GM ;
GILLET, VJ ;
HOLLIDAY, JD ;
LYNCH, MF .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 1989, 29 (03) :215-224
[5]   A modification of the Jaccard-Tanimoto similarity index for diverse selection of chemical compounds using binary strings [J].
Fligner, MA ;
Verducci, JS ;
Blower, PE .
TECHNOMETRICS, 2002, 44 (02) :110-119
[6]   On the properties of bit string-based measures of chemical similarity [J].
Flower, DR .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 1998, 38 (03) :379-386
[7]  
Holliday JD, 2002, COMB CHEM HIGH T SCR, V5, P155
[8]   ZINC - A free database of commercially available compounds for virtual screening [J].
Irwin, JJ ;
Shoichet, BK .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2005, 45 (01) :177-182
[9]  
James C. A., 2004, DAYLIGHT THEORY MANU
[10]   Graph kernels for chemical informatics [J].
Ralaivola, L ;
Swamidass, SJ ;
Saigo, H ;
Baldi, P .
NEURAL NETWORKS, 2005, 18 (08) :1093-1110