Improving Markov Chain Monte Carlo model search for data mining

被引:96
作者
Giudici, P
Castelo, R
机构
[1] Univ Pavia, Dept Econ & Quantitat Methods, I-27100 Pavia, Italy
[2] Univ Utrecht, Inst Comp & Informat Sci, NL-3508 TC Utrecht, Netherlands
关键词
Bayesian structural learning; convergence diagnostics; Dirichlet distribution; market basket analysis; Markov chain Monte Carlo;
D O I
10.1023/A:1020202028934
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The motivation of this paper is the application of MCMC model scoring procedures to data mining problems, involving a large number of competing models and other relevant model choice aspects. To achieve this aim we analyze one of the most popular Markov Chain Monte Carlo methods for structural learning in graphical models, namely, the MC3 algorithm proposed by D. Madigan and J. York (International Statistical Review, 63, 215-232, 1995). Our aim is to improve their algorithm to make it an effective and reliable tool in the field of data mining. In such context, typically highly dimensional in the number of variables, little can be known a priori and, therefore, a good model search algorithm is crucial. We present and describe in detail our implementation of the MC3 algorithm, which provides an efficient general framework for computations with both Directed Acyclic Graphical (DAG) models and Undirected Decomposable Models (UDG). We believe that the possibility of commuting easily between the two classes of models constitutes an important asset in data mining, where an a priori knowledge of causal effects is usually difficult to establish. Furthermore, in order to improve the MC3 method we propose provide several graphical monitors which can help extracting results and assessing the goodness of the Markov chain Monte Carlo approximation to the posterior distribution of interest. We apply our proposed methodology first to the well-known coronary heart disease dataset (D. Edwards & T. Havranek, Biometrika, 72:2, 339-351, 1985). We then introduce a novel data mining application which concerns market basket analysis.
引用
收藏
页码:127 / 158
页数:32
相关论文
共 25 条
[1]  
[Anonymous], 1987, AAAI C ARTIFICIAL IN
[2]  
Brooks SP, 1998, J ROY STAT SOC D-STA, V47, P69, DOI 10.1111/1467-9884.00117
[3]  
Buntine W., 1991, P 7 C UNC ART INT, P52
[4]  
CHICKERING DM, 1995, P 11 C UNC ART INT, P87
[5]  
Cowell R.G., 1999, PROBABILISTIC NETWOR
[6]   HYPER MARKOV LAWS IN THE STATISTICAL-ANALYSIS OF DECOMPOSABLE GRAPHICAL MODELS [J].
DAWID, AP ;
LAURITZEN, SL .
ANNALS OF STATISTICS, 1993, 21 (03) :1272-1317
[7]  
DAWID AP, 1979, J ROY STAT SOC B MET, V41, P1
[8]   Markov chain Monte Carlo model determination for hierarchical and graphical log-linear models [J].
Dellaportas, P ;
Forster, JJ .
BIOMETRIKA, 1999, 86 (03) :615-633
[9]   A FAST PROCEDURE FOR MODEL SEARCH IN MULTIDIMENSIONAL CONTINGENCY-TABLES [J].
EDWARDS, D ;
HAVRANEK, T .
BIOMETRIKA, 1985, 72 (02) :339-351
[10]  
FRYDENBERG M, 1989, BIOMETRIKA, V76, P539, DOI 10.2307/2336119