Robust linear clustering

被引:36
作者
Garcia-Escudero, L. A. [1 ]
Gordaliza, A.
San Martin, R.
Van Aelst, S. [2 ]
Zamar, R. [3 ]
机构
[1] Univ Valladolid, Fac Ciencias, Dept Estadist Invest Operat, E-47011 Valladolid, Spain
[2] Univ Ghent, Ghent, Belgium
[3] Univ British Columbia, Vancouver, BC V5Z 1M9, Canada
关键词
Affine subspaces; Orthogonal regression; Principal components; Robustness; Trimmed k-means; Trimming; SELF-CONSISTENCY; FAST ALGORITHM; IDENTIFICATION; METHODOLOGY; FEATURES; DATASETS; XGOBI;
D O I
10.1111/j.1467-9868.2008.00682.x
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
070103 [概率论与数理统计]; 140311 [社会设计与社会创新];
摘要
Non-hierarchical clustering methods are frequently based on the idea of forming groups around 'objects'. The main exponent of this class of methods is the k-means method, where these objects are points. However, clusters in a data set may often be due to certain relationships between the measured variables. For instance, we can find linear structures such as straight lines and planes, around which the observations are grouped in a natural way. These structures are not well represented by points. We present a method that searches for linear groups in the presence of outliers. The method is based on the idea of impartial trimming. We search for the 'best' subsample containing a proportion 1-alpha of the data and the best k affine subspaces fitting to those non-discarded observations by measuring discrepancies through orthogonal distances. The population version of the sample problem is also considered. We prove the existence of solutions for the sample and population problems together with their consistency. A feasible algorithm for solving the sample problem is described as well. Finally, some examples showing how the method proposed works in practice are provided.
引用
收藏
页码:301 / 318
页数:18
相关论文
共 52 条
[1]
Hierarchical clustering by means of model grouping [J].
Agostinelli, C ;
Pellizzari, P .
FROM DATA AND INFORMATION ANALYSIS TO KNOWLEDGE ENGINEERING, 2006, :246-+
[2]
MODEL-BASED GAUSSIAN AND NON-GAUSSIAN CLUSTERING [J].
BANFIELD, JD ;
RAFTERY, AE .
BIOMETRICS, 1993, 49 (03) :803-821
[3]
ICE-FLOE IDENTIFICATION IN SATELLITE IMAGES USING MATHEMATICAL MORPHOLOGY AND CLUSTERING ABOUT PRINCIPAL CURVES [J].
BANFIELD, JD ;
RAFTERY, AE .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1992, 87 (417) :7-16
[4]
Bradley P. S., 1998, Proceedings Fourth International Conference on Knowledge Discovery and Data Mining, P9
[5]
ASYMPTOTIC-BEHAVIOR OF CLASSIFICATION MAXIMUM LIKELIHOOD ESTIMATES [J].
BRYANT, P ;
WILLIAMSON, JA .
BIOMETRIKA, 1978, 65 (02) :273-281
[6]
Linear flaw detection in woven textiles using model-based clustering [J].
Campbell, JG ;
Fraley, C ;
Murtagh, F ;
Raftery, AE .
PATTERN RECOGNITION LETTERS, 1997, 18 (14) :1539-1548
[7]
CELEUX G, 1992, DATA ANAL, V13, P315
[8]
CHEN H, 2001, P IEEE C COMP VIS PA, V1, P1069
[9]
CROUX C, 2007, ROBUST PRINCIPAL COM
[10]
Cuesta-Albertos JA, 1997, ANN STAT, V25, P553