When is Chemical Similarity Significant? The Statistical Distribution of Chemical Similarity Scores and Its Extreme Values

被引：70

作者：

Baldi, Pierre ^{[1
]}

Nasr, Ramzi ^{[1
]}

机构：

[1] Univ Calif Irvine, Inst Genom & Bioinformat, Sch Informat & Comp Sci, Irvine, CA 92697 USA

来源：

JOURNAL OF CHEMICAL INFORMATION AND MODELING | 2010年 / 50卷 / 07期

关键词：

SEARCH; FINGERPRINTS; DESCRIPTORS; VARIABLES; DATABASE; RATIO; CHEMDB; 2D;

D O I：

10.1021/ci100010v

中图分类号：

R914 [药物化学];

学科分类号：

100701 ;

摘要：

As repositories of chemical molecules continue to expand and become more open, it becomes increasingly important to develop tools to search them efficiently and assess the statistical significance of chemical similarity scores. Here, we develop a general framework for understanding, modeling, predicting, and approximating the distribution of chemical similarity scores and its extreme values in large databases. The framework can be applied to different chemical representations and similarity measures but is demonstrated here using the most common binary fingerprints with the Tanimoto similarity measure. After introducing several probabilistic models of fingerprints, including the Conditional Gaussian Uniform model, we show that the distribution of Tanimoto scores can be approximated by the distribution of the ratio of two correlated Normal random variables associated with the corresponding unions and intersections. This remains true also when the distribution of similarity scores is conditioned on the size of the query molecules to derive more fine-grained results and improve chemical retrieval. The corresponding extreme value distributions for the maximum scores are approximated by Weibull distributions. From these various distributions and their analytical forms, Z-scores, E-values, and p-values are derived to assess the significance of similarity scores. In addition, the framework also allows one to predict the value of standard chemical retrieval metrics, such as sensitivity and specificity at fixed thresholds, or receiver operating characteristic (ROC) curves at multiple thresholds, and to detect outliers in the form of atypical molecules. Numerous and diverse experiments that have been performed, in part with large sets of molecules from the ChemDB, show remarkable agreement between theory and empirical results.

引用

页码：1205 / 1222

页数：18

共 41 条

[1]

ACKLEY DH, 1985, COGNITIVE SCI, V9, P147

[2] Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].

Altschul, SF ;

Madden, TL ;

Schaffer, AA ;

Zhang, JH ;

Zhang, Z ;

Miller, W ;

Lipman, DJ .

NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402

[3]

[Anonymous], 1983, Springer Series in Statistics, DOI 10.1007/978-1-4612-5449-2

[4]

[Anonymous], 1998, ADAP COMP MACH LEARN

[5] BLASTing small molecules - statistics and extreme statistics of chemical similarity scores [J].

Baldi, Pierre ;

Benz, Ryan W. .

BIOINFORMATICS, 2008, 24 (13) :I357-I365

[6] Lossless compression of chemical fingerprints using integer entropy codes improves storage and retrieval [J].

Baldi, Pierre ;

Benz, Ryan W. ;

Hirschberg, Daniel S. ;

Swamidass, S. Joshua .

JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2007, 47 (06) :2098-2109

[7] Similarity searching of chemical databases using atom environment descriptors (MOLPRINT 2D): Evaluation of performance [J].

Bender, A ;

Mussa, HY ;

Glen, RC ;

Reiling, S .

JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 2004, 44 (05) :1708-1718

[8]

Bohacek RS, 1996, MED RES REV, V16, P3, DOI 10.1002/(SICI)1098-1128(199601)16:1<3::AID-MED1>3.0.CO

[9]

2-6

[10]

Cedilnik A., 2004, METODOLOSKI ZVEZKI, V1, P99

← 1 2 3 4 5 →