A New Strategy of Outlier Detection for QSAR/QSPR

被引:137
作者
Cao, Dong-Sheng [1 ]
Liang, Yi-Zeng [1 ]
Xu, Qing-Song [2 ]
Li, Hong-Dong [1 ]
Chen, Xian [1 ]
机构
[1] Cent S Univ, Res Ctr Modernizat Tradit Chinese Med, Changsha 410083, Peoples R China
[2] Cent S Univ, Sch Math Sci & Comp Technol, Changsha 410083, Peoples R China
关键词
QSAR/QSPR; outliers; Monte-Carlo cross-validation; robust regression; regression diagnostics; PRINCIPAL COMPONENTS REGRESSION; ORGANIC-COMPOUNDS; AQUEOUS SOLUBILITY; MODEL SELECTION; PREDICTION; SQUARES; POINTS; ERROR; SET;
D O I
10.1002/jcc.21351
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
The crucial step of building a high performance QSAR/QSPR model is the detection of outliers in the model. Detecting outliers in a multivariate point cloud is not trivial, especially when several outliers coexist in the model. The classical identification methods do not always identify them, because they are based on the sample mean and covariance matrix influenced by the outliers. Moreover, existing methods only lay stress on some type of outliers but not all the outliers. To avoid these problems and detect all kinds of outliers simultaneously, we provide a new strategy based on Monte-Carlo cross-validation, which was termed as the MC method. The MC method inherently provides a feasible way to detect different kinds of outliers by establishment of many cross-predictive models. With the help of the distribution of predictive residuals such obtained. it seems to be able to reduce the risk caused by the masking effect. In addition, a new display is proposed, in which the absolute values of mean value of predictive residuals are plotted versus standard deviations of predictive residuals. The plot divides the data into normal samples, y direction outliers and X direction outliers. Several examples are used to demonstrate the detection ability of MC method through the comparison of different diagnostic methods. (C) 2009 Wiley Periodicals, Inc. J Comput Chem 31: 592-602, 2010
引用
收藏
页码:592 / 602
页数:11
相关论文
共 63 条
[1]  
ANDREW JC, 2001, J CHEM INF COMP SCI, V41, P457
[2]  
[Anonymous], 2003, INFORM THEORY INFERE
[3]  
[Anonymous], 1985, MATH STAT APPL, V8, P283, DOI DOI 10.1007/978-94-009-5438-0_20
[4]  
[Anonymous], 2005, APPL LINEAR REGRESSI
[5]  
ANTONIO L, 2008, J CHEM INF COMP SCI, V48, P1289
[6]   FITTING OF POWER-SERIES, MEANING POLYNOMIALS, ILLUSTRATED ON BAND-SPECTROSCOPIC DATA [J].
BEATON, AE ;
TUKEY, JW .
TECHNOMETRICS, 1974, 16 (02) :147-185
[7]  
Becker R.A., 1988, NEW S LANGUAGE PROGR
[8]  
BROWNLEE KA, STAT THEORY METHODOL, P491
[9]  
Christel A., 2003, J CHEM INF COMP SCI, V43, P1177
[10]   The Mahalanobis distance [J].
De Maesschalck, R ;
Jouan-Rimbaud, D ;
Massart, DL .
CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 2000, 50 (01) :1-18