When is Chemical Similarity Significant? The Statistical Distribution of Chemical Similarity Scores and Its Extreme Values

被引:70
作者
Baldi, Pierre [1 ]
Nasr, Ramzi [1 ]
机构
[1] Univ Calif Irvine, Inst Genom & Bioinformat, Sch Informat & Comp Sci, Irvine, CA 92697 USA
关键词
SEARCH; FINGERPRINTS; DESCRIPTORS; VARIABLES; DATABASE; RATIO; CHEMDB; 2D;
D O I
10.1021/ci100010v
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
As repositories of chemical molecules continue to expand and become more open, it becomes increasingly important to develop tools to search them efficiently and assess the statistical significance of chemical similarity scores. Here, we develop a general framework for understanding, modeling, predicting, and approximating the distribution of chemical similarity scores and its extreme values in large databases. The framework can be applied to different chemical representations and similarity measures but is demonstrated here using the most common binary fingerprints with the Tanimoto similarity measure. After introducing several probabilistic models of fingerprints, including the Conditional Gaussian Uniform model, we show that the distribution of Tanimoto scores can be approximated by the distribution of the ratio of two correlated Normal random variables associated with the corresponding unions and intersections. This remains true also when the distribution of similarity scores is conditioned on the size of the query molecules to derive more fine-grained results and improve chemical retrieval. The corresponding extreme value distributions for the maximum scores are approximated by Weibull distributions. From these various distributions and their analytical forms, Z-scores, E-values, and p-values are derived to assess the significance of similarity scores. In addition, the framework also allows one to predict the value of standard chemical retrieval metrics, such as sensitivity and specificity at fixed thresholds, or receiver operating characteristic (ROC) curves at multiple thresholds, and to detect outliers in the form of atypical molecules. Numerous and diverse experiments that have been performed, in part with large sets of molecules from the ChemDB, show remarkable agreement between theory and empirical results.
引用
收藏
页码:1205 / 1222
页数:18
相关论文
共 41 条
[31]   Product and quotient of correlated beta variables [J].
Nagar, Daya K. ;
Orozco-Castaneda, Johanna Marcela ;
Gupta, Arjun K. .
APPLIED MATHEMATICS LETTERS, 2009, 22 (01) :105-109
[32]   Is there a difference between leads and drugs? A historical perspective [J].
Oprea, TI ;
Davis, AM ;
Teague, SJ ;
Leeson, PD .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 2001, 41 (05) :1308-1315
[33]   Density of the ratio of two normal random variables and applications [J].
Pham-Gia, T. ;
Turkkan, N. ;
Marchand, E. .
COMMUNICATIONS IN STATISTICS-THEORY AND METHODS, 2006, 35 (09) :1569-1591
[34]   DEFINITION AND ROLE OF SIMILARITY CONCEPTS IN THE CHEMICAL AND PHYSICAL SCIENCES [J].
ROUVRAY, DH .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 1992, 32 (06) :580-586
[35]   Detailed analysis of scoring functions for virtual screening [J].
Stahl, M ;
Rarey, M .
JOURNAL OF MEDICINAL CHEMISTRY, 2001, 44 (07) :1035-1042
[36]   Bounds and algorithms for fast exact searches of chemical fingerprints in linear and sublinear time [J].
Swamidass, S. Joshua ;
Baldi, Pierre .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2007, 47 (02) :302-317
[37]  
TVERSKY A, 1977, PSYCHOL REV, V84, P327, DOI 10.1037/h0026750
[38]   PubChem: a public information system for analyzing bioactivities of small molecules [J].
Wang, Yanli ;
Xiao, Jewen ;
Suzek, Tugba O. ;
Zhang, Jian ;
Wang, Jiyao ;
Bryant, Stephen H. .
NUCLEIC ACIDS RESEARCH, 2009, 37 :W623-W633
[39]   Database resources of the National Center for Biotechnology Information [J].
Wheeler, David L. ;
Barrett, Tanya ;
Benson, Dennis A. ;
Bryant, Stephen H. ;
Canese, Kathi ;
Chetvernin, Vyacheslav ;
Church, Deanna M. ;
DiCuccio, Michael ;
Edgar, Ron ;
Federhen, Scott ;
Geer, Lewis Y. ;
Kapustin, Yuri ;
Khovayko, Oleg ;
Landsman, David ;
Lipman, David J. ;
Madden, Thomas L. ;
Maglott, Donna R. ;
Ostell, James ;
Miller, Vadim ;
Pruitt, Kim D. ;
Schuler, Gregory D. ;
Sequeira, Edwin ;
Sherry, Steven T. ;
Sirotkin, Karl ;
Souvorov, Alexandre ;
Starchenko, Grigory ;
Tatusov, Roman L. ;
Tatusova, Tatiana A. ;
Wagner, Lukas ;
Yaschenko, Eugene .
NUCLEIC ACIDS RESEARCH, 2007, 35 :D5-D12
[40]   Similarity search profiling reveals effects of fingerprint scaling in virtual screening [J].
Xue, L ;
Stahura, FL ;
Bajorath, E .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 2004, 44 (06) :2032-2039