Ties in proximity and clustering compounds

被引:46
作者
MacCuish, J
Nicolaou, C
MacCuish, NE
机构
[1] Bioreason Inc, Santa Fe, NM 87501 USA
[2] Daylight Chem Informat Syst, Santa Fe, NM 87501 USA
来源
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES | 2001年 / 41卷 / 01期
关键词
D O I
10.1021/ci000069q
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Hierarchical clustering algorithms such as Wards or complete-link are commonly used in compound selection and diversity analysis. Many such applications utilize binary representations of chemical structures, such as MACCS keys or Daylight fingerprints, and dissimilarity measures, such as the Euclidean or the Soergel measure. However, hierarchical clustering algorithms can generate ambiguous results owing to what is known in the cluster analysis literature as the ties in proximity problem, i.e., compounds or clusters of compounds that are equidistant from a compound or cluster in a given collection. Ambiguous ties can occur when clustering only a few hundred compounds, and the larger the number of compounds to be clustered, the greater the chance for significant ambiguity. Namely, as the number of "ties in proximity" increases relative to the total number of proximities, the possibility of ambiguity also increases. To ensure that there are no ambiguous ties, we show by a probabilistic argument that the number of compounds needs to be less than 2(n(1/4)), where n is the total number of proximities, and the measure used to generate the proximities creates a uniform distribution without statistically preferred values. The common measures do not produce uniformly distributed proximities, but rather statistically preferred values that tend to increase the number of ties in proximity. Hence, the number of possible proximities and the distribution of statistically preferred values of a similarity measure, given a bit vector representation of a specific length, are directly related to the number of ties in proximities for a given data set. We explore the ties in proximity problem, using a number of chemical collections with varying degrees of diversity, given several common similarity measures and clustering algorithms. Our results are consistent with our probabilistic argument and show that this problem is significant for relatively small compound sets.
引用
收藏
页码:134 / 146
页数:13
相关论文
共 36 条
[1]  
[Anonymous], 1994, FDN COMPUTER SCI
[2]   New perspectives in lead generation .2. Evaluating molecular diversity [J].
Ashton, MJ ;
Jaye, MC ;
Mason, JS .
DRUG DISCOVERY TODAY, 1996, 1 (02) :71-78
[3]   CLUSTERING OF CHEMICAL STRUCTURES ON THE BASIS OF 2-DIMENSIONAL SIMILARITY MEASURES [J].
BARNARD, JM ;
DOWNS, GM .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 1992, 32 (06) :644-649
[4]  
BARTLETT B, 1995, AUST J PUBLIC HEALTH, V19, P3
[5]  
Beiler AH., 1966, Recreations in the Theory of Numbers, V2
[6]   Use of structure Activity data to compare structure-based clustering methods and descriptors for use in compound selection [J].
Brown, RD ;
Martin, YC .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 1996, 36 (03) :572-584
[7]   The hidden component of size in two-dimensional fragment descriptors: Side effects on sampling in bioactive libraries [J].
Dixon, SL ;
Koehler, RT .
JOURNAL OF MEDICINAL CHEMISTRY, 1999, 42 (15) :2887-2900
[8]   SIMILARITY SEARCHING AND CLUSTERING OF CHEMICAL-STRUCTURE DATABASES USING MOLECULAR PROPERTY DATA [J].
DOWNS, GM ;
WILLETT, P ;
FISANICK, W .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 1994, 34 (05) :1094-1102
[9]  
Engels MFM, 2000, J CHEM INF COMP SCI, V40, P241, DOI 10.1021/ci990435
[10]  
EPPSTEIN D, 1998, 9 ACM SIAM S DISCRET