Lossless compression of chemical fingerprints using integer entropy codes improves storage and retrieval

被引:39
作者
Baldi, Pierre [1 ]
Benz, Ryan W. [1 ]
Hirschberg, Daniel S. [1 ]
Swamidass, S. Joshua [1 ]
机构
[1] Univ Calif Irvine, Sch Informat & Comp Sci, Inst Genom & Bioinformat, Irvine, CA 92697 USA
关键词
D O I
10.1021/ci700200n
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
Many modem chemoinformatics systems for small molecules rely on large fingerprint vector representations, where the components of the vector record the presence or number of occurrences in the molecular graphs of particular combinatorial features, such as labeled paths or labeled trees. These large fingerprint vectors are often compressed to much shorter fingerprint vectors using a lossy compression scheme based on a simple modulo procedure. Here, we combine statistical models of fingerprints with integer entropy codes, such as Golomb and Elias codes, to encode the indices or the run lengths of the fingerprints. After reordering the fingerprint components by decreasing frequency order, the indices are monotone-increasing and the run lengths are quasi-monotone-increasing, and both exhibit power-law distribution trends. We take advantage of these statistical properties to derive new efficient, lossless, compression algorithms for monotone integer sequences: monotone value (MOV) coding and monotone length (MOL) coding. In contrast to lossy systems that use 1024 or more bits of storage per molecule, we can achieve lossless compression of long chemical fingerprints based on circular substructures in slightly over 300 bits per molecule, close to the Shannon entropy limit, using a MOL Elias Gamma code for run lengths. The improvement in storage comes at a modest computational cost. Furthermore, because the compression is lossless, uncompressed similarity (e.g., Tanimoto) between molecules can be computed exactly from their compressed representations, leading to significant improvements in retrival performance, as shown on six benchmark data sets of druglike molecules.
引用
收藏
页码:2098 / 2109
页数:12
相关论文
共 29 条
[21]   DEFINITION AND ROLE OF SIMILARITY CONCEPTS IN THE CHEMICAL AND PHYSICAL SCIENCES [J].
ROUVRAY, DH .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 1992, 32 (06) :580-586
[22]   Detailed analysis of scoring functions for virtual screening [J].
Stahl, M ;
Rarey, M .
JOURNAL OF MEDICINAL CHEMISTRY, 2001, 44 (07) :1035-1042
[23]   Mathematical correction for fingerprint similarity measures to improve chemical retrieval [J].
Swamidass, S. Joshua ;
Baldi, Pierre .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2007, 47 (03) :952-964
[24]   Bounds and algorithms for fast exact searches of chemical fingerprints in linear and sublinear time [J].
Swamidass, S. Joshua ;
Baldi, Pierre .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2007, 47 (02) :302-317
[25]   Kernels for small molecules and the prediction of mutagenicity, toxicity and anti-cancer activity [J].
Swamidass, SJ ;
Chen, J ;
Phung, P ;
Ralaivola, L ;
Baldi, P .
BIOINFORMATICS, 2005, 21 :I359-I368
[26]  
TVERSKY A, 1977, PSYCHOL REV, V84, P327, DOI 10.1037/h0026750
[27]  
Witten Ian H, 1999, MANAGING GIGABYTES C
[28]   Similarity search profiling reveals effects of fingerprint scaling in virtual screening [J].
Xue, L ;
Stahura, FL ;
Bajorath, E .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 2004, 44 (06) :2032-2039
[29]   Profile scaling increases the similarity search performance of molecular fingerprints containing numerical descriptors and structural keys [J].
Xue, L ;
Godden, JW ;
Stahura, FL ;
Bajorath, J .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 2003, 43 (04) :1218-1225