Computing LTS regression for large data sets

被引:367
作者
Rousseeuw, PJ
Van Driessen, K
机构
[1] Univ Antwerp, Dept Math & Comp Sci, B-2020 Antwerp, Belgium
[2] Univ Antwerp, Fac Appl Econ, B-2000 Antwerp, Belgium
关键词
breakdown value; linear model; outlier detection; regression; robust estimation;
D O I
10.1007/s10618-005-0024-4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Data mining aims to extract previously unknown patterns or substructures from large databases. In statistics, this is what methods of robust estimation and outlier detection were constructed for, see e.g. Rousseeuw and Leroy (1987). Here we will focus on least trimmed squares (LTS) regression, which is based on the subset of h cases (out of n) whose least squares fit possesses the smallest sum of squared residuals. The coverage h may be set between n/2 and n. The computation time of existing LTS algorithms grows too much with the size of the data set, precluding their use for data mining. In this paper we develop a new algorithm called FAST-LTS. The basic ideas are an inequality involving order statistics and sums of squared residuals, and techniques which we call 'selective iteration' and nested extensions'. We also use an intercept adjustment technique to improve the precision. For small data sets FAST-LTS typically finds the exact LTS, whereas for larger data sets it gives more accurate results than existing algorithms for LTS and is faster by orders of magnitude. This allows us to apply FAST-LTS to large databases.
引用
收藏
页码:29 / 45
页数:17
相关论文
共 29 条
[1]  
Agullo J, 1997, INST MATH S, V31, P133
[2]  
AGULLO J, 1997, THESIS U ALICANTE SP
[3]  
[Anonymous], 1987, ROBUST REGRESSION OU
[4]  
CHORK CJ, 1990, J GEOCHEM EXPLOR, V37, P191
[5]   A BOUNDED INFLUENCE, HIGH BREAKDOWN, EFFICIENT REGRESSION ESTIMATOR [J].
COAKLEY, CW ;
HETTMANSPERGER, TP .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1993, 88 (423) :872-880
[6]   Improved feasible solution algorithms for high breakdown estimation [J].
Hawkins, DM ;
Olive, DJ .
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 1999, 30 (01) :1-11
[7]   THE FEASIBLE SOLUTION ALGORITHM FOR LEAST TRIMMED SQUARES REGRESSION [J].
HAWKINS, DM .
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 1994, 17 (02) :185-196
[9]   Extensions to the k-means algorithm for clustering large data sets with categorical values [J].
Huang, ZX .
DATA MINING AND KNOWLEDGE DISCOVERY, 1998, 2 (03) :283-304
[10]  
Kaufman L., 1986, PATTERN RECOGN, P425, DOI DOI 10.1016/B978-0-444-87877-9.50039-X