Lossless compression of chemical fingerprints using integer entropy codes improves storage and retrieval

被引:39
作者
Baldi, Pierre [1 ]
Benz, Ryan W. [1 ]
Hirschberg, Daniel S. [1 ]
Swamidass, S. Joshua [1 ]
机构
[1] Univ Calif Irvine, Sch Informat & Comp Sci, Inst Genom & Bioinformat, Irvine, CA 92697 USA
关键词
D O I
10.1021/ci700200n
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
Many modem chemoinformatics systems for small molecules rely on large fingerprint vector representations, where the components of the vector record the presence or number of occurrences in the molecular graphs of particular combinatorial features, such as labeled paths or labeled trees. These large fingerprint vectors are often compressed to much shorter fingerprint vectors using a lossy compression scheme based on a simple modulo procedure. Here, we combine statistical models of fingerprints with integer entropy codes, such as Golomb and Elias codes, to encode the indices or the run lengths of the fingerprints. After reordering the fingerprint components by decreasing frequency order, the indices are monotone-increasing and the run lengths are quasi-monotone-increasing, and both exhibit power-law distribution trends. We take advantage of these statistical properties to derive new efficient, lossless, compression algorithms for monotone integer sequences: monotone value (MOV) coding and monotone length (MOL) coding. In contrast to lossy systems that use 1024 or more bits of storage per molecule, we can achieve lossless compression of long chemical fingerprints based on circular substructures in slightly over 300 bits per molecule, close to the Shannon entropy limit, using a MOL Elias Gamma code for run lengths. The improvement in storage comes at a modest computational cost. Furthermore, because the compression is lossless, uncompressed similarity (e.g., Tanimoto) between molecules can be computed exactly from their compressed representations, leading to significant improvements in retrival performance, as shown on six benchmark data sets of druglike molecules.
引用
收藏
页码:2098 / 2109
页数:12
相关论文
共 29 条
[1]   One- to four-dimensional kernels for virtual screening and the prediction of physical, chemical, and biological properties [J].
Azencott, Chloe-Agathe ;
Ksikes, Alexandre ;
Swamidass, S. Joshua ;
Chen, Jonathan H. ;
Ralaivola, Liva ;
Baldi, Pierre .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2007, 47 (03) :965-974
[2]   Similarity searching of chemical databases using atom environment descriptors (MOLPRINT 2D): Evaluation of performance [J].
Bender, A ;
Mussa, HY ;
Glen, RC ;
Reiling, S .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 2004, 44 (05) :1708-1718
[3]  
Bohacek RS, 1996, MED RES REV, V16, P3, DOI 10.1002/(SICI)1098-1128(199601)16:1<3::AID-MED1>3.3.CO
[4]  
2-D
[5]   ChemDB: a public database of small molecules and related chemoinformatics resources [J].
Chen, J ;
Swamidass, SJ ;
Bruand, J ;
Baldi, P .
BIOINFORMATICS, 2005, 21 (22) :4133-4139
[6]  
ELIAS P, 1975, IEEE T INFORM THEORY, V21, P194, DOI 10.1109/TIT.1975.1055349
[7]   A modification of the Jaccard-Tanimoto similarity index for diverse selection of chemical compounds using binary strings [J].
Fligner, MA ;
Verducci, JS ;
Blower, PE .
TECHNOMETRICS, 2002, 44 (02) :110-119
[8]   On the properties of bit string-based measures of chemical similarity [J].
Flower, DR .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 1998, 38 (03) :379-386
[9]  
Gillet V, 2005, INTRO CHEMOINFORMATI
[10]   RUN-LENGTH ENCODINGS [J].
GOLOMB, SW .
IEEE TRANSACTIONS ON INFORMATION THEORY, 1966, 12 (03) :399-+