Mining sequential patterns by pattern-growth: The PrefixSpan approach

被引:725
作者
Pei, J [1 ]
Han, JW
Mortazavi-Asl, B
Wang, JY
Pinto, H
Chen, QM
Dayal, U
Hsu, MC
机构
[1] Simon Fraser Univ, Sch Comp Sci, Burnaby, BC V5A 1S6, Canada
[2] Univ Illinois, Dept Comp Sci, Urbana, IL 61801 USA
[3] Univ Minnesota, Minneapolis, MN 55455 USA
[4] Simon Fraser Univ, Sch Comp Sci, Burnaby, BC V5A 1S6, Canada
[5] Packetmot Inc, San Mateo, CA 94403 USA
[6] Hewlett Packard Labs, Palo Alto, CA 94303 USA
[7] Commerce One Inc, San Francisco, CA 94105 USA
基金
加拿大自然科学与工程研究理事会; 美国国家科学基金会;
关键词
data mining algorithm; sequential pattern; frequent pattern; transaction database; sequence database; scalability; performance analysis;
D O I
10.1109/TKDE.2004.77
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Sequential pattern mining is an important data mining problem with broad applications. However, it is also a difficult problem since the mining may have to generate or examine a combinatorially explosive number of intermediate subsequences. Most of the previously developed sequential pattern mining methods, such as GSP, explore a candidate generation-and-test approach [1] to reduce the number of candidates to be examined. However, this approach may not be efficient in mining large sequence databases having numerous patterns and/or long patterns. In this paper, we propose a projection-based, sequential pattern-growth approach for efficient mining of sequential patterns. In this approach, a sequence database is recursively projected into a set of smaller projected databases, and sequential patterns are grown in each projected database by exploring only locally frequent fragments. Based on an initial study of the pattern growth-based sequential pattern mining, FreeSpan [8], we propose a more efficient method, called PSP, which offers ordered growth and reduced projected databases. To further improve the performance, a pseudoprojection technique is developed in PrefixSpan. A comprehensive performance study shows that PrefixSpan, in most cases, outperforms the a priori-based algorithm GSP, FreeSpan, and SPADE [29] ( a sequential pattern mining algorithm that adopts vertical data format), and PrefixSpan integrated with pseudoprojection is the fastest among all the tested algorithms. Furthermore, this mining methodology can be extended to mining sequential patterns with user-specified constraints. The high promise of the pattern-growth approach may lead to its further extension toward efficient mining of other kinds of frequent patterns, such as frequent substructures.
引用
收藏
页码:1424 / 1440
页数:17
相关论文
共 31 条
  • [1] AGRAWAL R, 1995, PROC INT CONF DATA, P3, DOI 10.1109/ICDE.1995.380415
  • [2] Agrawal R, 1994, P 20 INT C VER LARG, V1215, P487
  • [3] [Anonymous], P 1998 ACM SIGMOD IN
  • [4] [Anonymous], 1996, EDBT, DOI 10.1007/BFb0014140
  • [5] [Anonymous], P 3 ACM SIGMOD WORKS
  • [6] [Anonymous], 2000, P 6 ACM SIGKDD INT C
  • [7] [Anonymous], 2002, EFFICIENTLY MINING F, DOI DOI 10.1145/775047.775058
  • [8] Constraint-based rule mining in large, dense databases
    Bayardo, RJ
    Agrawal, R
    Gunopulos, D
    [J]. 15TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 1999, : 188 - 197
  • [9] Bettini C., 1998, B TECH COMMITTEE DAT, V21, P32
  • [10] ROCK: A robust clustering algorithm for categorical attributes
    Guha, S
    Rastogi, R
    Shim, K
    [J]. 15TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 1999, : 512 - 521