Simulating Data to Study Performance of Finite Mixture Modeling and Clustering Algorithms

被引:117
作者
Maitra, Ranjan [1 ,2 ]
Melnykov, Volodymyr [3 ]
机构
[1] Iowa State Univ, Dept Stat, Ames, IA 50011 USA
[2] Iowa State Univ, Stat Lab, Ames, IA 50011 USA
[3] N Dakota State Univ, Dept Stat, Fargo, ND 58102 USA
基金
美国国家科学基金会; 美国国家卫生研究院;
关键词
Cluster overlap; Eccentricity of ellipsoid; Mclust; MixSim; Mixture distribution; Parallel distribution plots; ARTIFICIAL TEST CLUSTERS; MULTIVARIATE; MEMBERSHIP; SEPARATION; OVERLAP; TESTS;
D O I
10.1198/jcgs.2009.08054
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
A new method is proposed to generate sample Gaussian mixture distributions according to prespecified overlap characteristics. Such methodology is useful in the context of evaluating performance of clustering algorithms. Our suggested approach involves derivation of and calculation of the exact overlap between every cluster pair, measured in terms of their total probability of misclassification, and then guided simulation of Gaussian components satisfying prespecified overlap characteristics. The algorithm is illustrated in two and five dimensions using contour plots and parallel distribution plots, respectively, which we introduce and develop to display mixture distributions in higher dimensions. We also study properties of the algorithm and variability in the simulated mixtures. The utility of the suggested algorithm is demonstrated via a study of initialization strategies in Gaussian clustering. This article has supplementary material online.
引用
收藏
页码:354 / 376
页数:23
相关论文
共 43 条
[1]  
Anderson Edgar, 1935, Bulletin of the American Iris Society, V59, P2
[2]  
[Anonymous], 1998, UCI REPOSITORY MACHI
[3]  
Bartlett MS, 1939, P CAMB PHILOS SOC, V35, P180
[4]   Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models [J].
Biernacki, C ;
Celeux, G ;
Govaert, G .
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2003, 41 (3-4) :561-575
[5]   MIXTURE MODEL TESTS OF CLUSTER-ANALYSIS - ACCURACY OF 4 AGGLOMERATIVE HIERARCHICAL METHODS [J].
BLASHFIELD, RK .
PSYCHOLOGICAL BULLETIN, 1976, 83 (03) :377-388
[6]  
Box GEP, 1987, Empirical model-building and response surfaces
[7]  
Brodatz P., 1966, PHOTOGRAPHIC ALBUM A
[8]   MULTIVARIATE STUDY OF VARIATION IN 2 SPECIES OF ROCK CRAB OF GENUS LEPTOGRAPSUS [J].
CAMPBELL, NA ;
MAHON, RJ .
AUSTRALIAN JOURNAL OF ZOOLOGY, 1974, 22 (03) :417-425
[9]  
Dasgupta S., 1999, PROC IEEE S FDN COMP, P633
[10]  
Davies R. B., 2018, J. Royal Stat. Soc. Series C: Appl. Stat, V29, P323, DOI 10.2307/2346911