VARIABLE SELECTION IN QSAR STUDIES .1. AN EVOLUTIONARY ALGORITHM

被引:278
作者
KUBINYI, H
机构
[1] BASF AG, Ludwigshafen
来源
QUANTITATIVE STRUCTURE-ACTIVITY RELATIONSHIPS | 1994年 / 13卷 / 03期
关键词
CROSS-VALIDATION; EVOLUTIONARY ALGORITHM; FOR VARIABLE SELECTION; GENETIC ALGORITHM; MUSEUM APPROACH; QSAR STUDIES; VARIABLE SELECTION; REGRESSION ANALYSIS; IN QSAR STUDIES;
D O I
10.1002/qsar.19940130306
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
In QSAR studies of large data sets, variable selection and model building is a difficult, time-consuming and ambiguous procedure. While most often stepwise regression procedures are applied for this purpose, other strategies, like neural networks, cluster significance analysis or genetic algorithms have been used. A simple and efficient evolutionary strategy, including iterative mutation and selection, but avoiding crossover of regression models, is described in this work. The MUSEUM (Mutation and Selection Uncover Models) algorithm starts from a model containing any number of randomly chosen variables. Random mutation, first by addition or elimination of only one or very few variables, afterwards by simultaneous random additions, eliminations and/or ex changes of several variables at a time, leads to new models which are evaluated by an appropriate fitness function. In contrast to common genetic algorithm procedures, only the ''fittest'' model is stored and used for further mutation and selection, leading to better and better models. In the last steps of mutation, all variables inside the model are eliminated and all variables outside the model are added, one by one, to control whether this systematic strategy detects any mutation which still improves the model. After every generation of a better model, a new random mutation procedure starts from this model. In the very last step, variables not significant at the 95% level are eliminated, starting with the least significant variable. In this manner, ''stable'' models are produced, containing only significant variables. A comparison of the results for the Selwood data set (n = 31 compounds, k = 53 variables) with those obtained by other groups shows that more relevant models are derived by the evolutionary approach than by other methods.
引用
收藏
页码:285 / 294
页数:10
相关论文
共 33 条
[1]   GENERATING OPTIMAL LINEAR PLS ESTIMATIONS (GOLPE) - AN ADVANCED CHEMOMETRIC TOOL FOR HANDLING 3D-QSAR PROBLEMS [J].
BARONI, M ;
COSTANTINO, G ;
CRUCIANI, G ;
RIGANELLI, D ;
VALIGI, R ;
CLEMENTI, S .
QUANTITATIVE STRUCTURE-ACTIVITY RELATIONSHIPS, 1993, 12 (01) :9-20
[2]   WHAT KIND OF STATISTICS FOR QSAR RESEARCH [J].
BENIGNI, R ;
GIULIANI, A .
QUANTITATIVE STRUCTURE-ACTIVITY RELATIONSHIPS, 1991, 10 (02) :99-100
[3]   SAMPLE-DISTANCE PARTIAL LEAST-SQUARES - PLS OPTIMIZED FOR MANY VARIABLES, WITH APPLICATION TO COMFA [J].
BUSH, BL ;
NACHBAR, RB .
JOURNAL OF COMPUTER-AIDED MOLECULAR DESIGN, 1993, 7 (05) :587-619
[4]   COMPARATIVE MOLECULAR-FIELD ANALYSIS (COMFA) .1. EFFECT OF SHAPE ON BINDING OF STEROIDS TO CARRIER PROTEINS [J].
CRAMER, RD ;
PATTERSON, DE ;
BUNCE, JD .
JOURNAL OF THE AMERICAN CHEMICAL SOCIETY, 1988, 110 (18) :5959-5967
[5]  
CRAMER RD, 1993, 3D QSAR DRUG DESIGN, P443
[6]  
CRUCIANI G, 1993, 3D QSAR DRUG DESIGN, P551
[7]  
DRAPER NR, 1981, APPLIED REGRESSION A
[8]  
Dunn III W., 1984, QUANT STRUCT ACT REL, V3, P131
[9]  
DUNN W, 1985, QUANT STRUCT-ACT REL, V4, P82
[10]  
Gibbons Natrella M, 1963, NBS HDB, V91