Model-based clustering and visualization of navigation patterns on a web site

被引:129
作者
Cadez, I
Heckerman, D
Meek, C
Smyth, P
White, S
机构
[1] Sparta Syst Inc, Laguna Hills, CA 92653 USA
[2] Microsoft Res, Redmond, WA 98052 USA
[3] Univ Calif Irvine, Sch Informat & Comp Sci, Irvine, CA 92697 USA
关键词
model-based clustering; sequence clustering; data visualization; Internet; web;
D O I
10.1023/A:1024992613384
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present a new methodology for exploring and analyzing navigation patterns on a web site. The patterns that can be analyzed consist of sequences of URL categories traversed by users. In our approach, we first partition site users into clusters such that users with similar navigation paths through the site are placed into the same cluster. Then, for each cluster, we display these paths for users within that cluster. The clustering approach we employ is model-based ( as opposed to distance-based) and partitions users according to the order in which they request web pages. In particular, we cluster users by learning a mixture of first-order Markov models using the Expectation-Maximization algorithm. The runtime of our algorithm scales linearly with the number of clusters and with the size of the data; and our implementation easily handles hundreds of thousands of user sessions in memory. In the paper, we describe the details of our method and a visualization tool based on it called WebCANVAS. We illustrate the use of our approach on user-traffic data from msnbc.com.
引用
收藏
页码:399 / 424
页数:26
相关论文
共 37 条
[1]  
ANDERSON CR, 2001, P 17 INT JOINT C ART, P879
[2]  
[Anonymous], ACM SIGCOMM REV
[3]   MODEL-BASED GAUSSIAN AND NON-GAUSSIAN CLUSTERING [J].
BANFIELD, JD ;
RAFTERY, AE .
BIOMETRICS, 1993, 49 (03) :803-821
[4]  
Bernardo J.M., 2009, Bayesian Theory, V405
[5]   EXPECTED INFORMATION AS EXPECTED UTILITY [J].
BERNARDO, JM .
ANNALS OF STATISTICS, 1979, 7 (03) :686-690
[6]   Speculative data dissemination and service to reduce server load, network traffic and service time in distributed information systems [J].
Bestavros, A .
PROCEEDINGS OF THE TWELFTH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, 1996, :180-187
[7]  
Borges J, 2000, LECT NOTES COMPUT SC, V1836, P92
[8]  
CADEZ I, 1999, 9916 U CAL
[9]  
CHEESEMAN P, 1995, ADV KNOWLEDGE DISCOV, P153
[10]   Efficient data mining for path traversal patterns [J].
Chen, MS ;
Park, JS ;
Yu, PS .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 1998, 10 (02) :209-221