Genetic algorithm guided selection: Variable selection and subset selection
被引:117
作者:
Cho, SJ
论文数: 0引用数: 0
h-index: 0
机构:Bristol Myers Squibb Co, New Leads, Wallingford, CT 06492 USA
Cho, SJ
Hermsmeier, MA
论文数: 0引用数: 0
h-index: 0
机构:Bristol Myers Squibb Co, New Leads, Wallingford, CT 06492 USA
Hermsmeier, MA
机构:
[1] Bristol Myers Squibb Co, New Leads, Wallingford, CT 06492 USA
[2] Bristol Myers Squibb Co, New Leads, Princeton, NJ 08543 USA
来源:
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES
|
2002年
/
42卷
/
04期
关键词:
D O I:
10.1021/ci010247v
中图分类号:
O6 [化学];
学科分类号:
0703 ;
摘要:
A novel Genetic Algorithm guided Selection method, GAS, has been described. The method utilizes a simple encoding scheme which can represent both compounds and variables used to construct a QSAR/QSPR model. A genetic algorithm is then utilized to simultaneously optimize the encoded variables that include both descriptors and compound subsets. The GAS method generates multiple models each applying to a subset of the compounds. Typically the subsets represent clusters with different chemotypes. Also a procedure based on molecular similarity is presented to determine which model should be applied to a given test set compound. The variable selection method implemented in GAS has been tested and compared using the Selwood data set (n = 31 compounds; v = 53 descriptors). The results showed that the method is comparable to other published methods. The subset selection method implemented in GAS has been first tested using an artificial data set (n = 100 points; v = 1 descriptor) to examine its ability to subset data points and second applied to analyze the XLOGP data set (n = 1831 compounds; v = 126 descriptors). The method is able to correctly identify artificial data points belonging to various subsets. The analysis of the XLOGP data set shows that the subset selection method can be useful in improving a QSAR/QSPR model when the variable selection method fails.