Mining minimal distinguishing subsequence patterns with gap constraints

被引:89
作者
Ji, Xiaonan
Bailey, James [1 ]
Dong, Guozhu
机构
[1] Univ Melbourne, NICTA Victoria Lab, Dept Comp Sci & Software Engn, Melbourne, Vic 3053, Australia
[2] Wright State Univ, Dept Comp Sci & Engn, Dayton, OH 45435 USA
关键词
data mining algorithm; sequential pattern; frequent pattern; emerging pattern; gap constraint; contrast pattern;
D O I
10.1007/s10115-006-0038-2
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Discovering contrasts between collections of data is an important task in data mining. In this paper, we introduce a new type of contrast pattern, called a Minimal Distinguishing Subsequence (MDS). An MDS is a minimal subsequence that occurs frequently in one class of sequences and infrequently in sequences of another class. It is a natural way of representing strong and succinct contrast information between two sequential datasets and can be useful in applications such as protein comparison, document comparison and building sequential classification models. Mining MDS patterns is a challenging task and is significantly different from mining contrasts between relational/transactional data. One particularly important type of constraint that can be integrated into the mining process is the gap constraint. We present an efficient algorithm called ConSGapMiner (Contrast Sequences with Gap Miner), to mine all MDSs satisfying a minimum and maximum gap constraint, plus a maximum length constraint. It employs highly efficient bitset and boolean operations, for powerful gap-based pruning within a prefix growth framework. A performance evaluation with both sparse and dense datasets, demonstrates the scalability of ConSGapMiner and shows its ability to mine patterns from high dimensional datasets at low supports.
引用
收藏
页码:259 / 286
页数:28
相关论文
共 34 条
[1]  
AGRAWAL R, 1995, PROC INT CONF DATA, P3, DOI 10.1109/ICDE.1995.380415
[2]  
[Anonymous], 1999, Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining p, DOI [10.1145/312129., DOI 10.1145/312129, 10.1145/312129, 10.1145/312129.312191]
[3]  
[Anonymous], 2000, P 6 ACM SIGKDD INT C
[4]  
[Anonymous], KNOWL INF SYST
[5]  
[Anonymous], 1995, P 1 SIGKDD INT C KNO
[6]  
Antunes C, 2003, LECT NOTES ARTIF INT, V2734, P239
[7]  
Ayres J., 2002, Proceedings of the 8th ACM International Conference on Knowledge Discovery and Data Mining, P429, DOI 10.1145/775047.775109
[8]  
Bailey J, 2003, LECT NOTES COMPUT SC, V2762, P226
[9]   Detecting group differences: Mining contrast sets [J].
Bay, SD ;
Pazzani, MJ .
DATA MINING AND KNOWLEDGE DISCOVERY, 2001, 5 (03) :213-246
[10]   Mining emerging substrings [J].
Chan, S ;
Kao, B ;
Yip, CL ;
Tang, M .
EIGHTH INTERNATIONAL CONFERENCE ON DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, PROCEEDINGS, 2003, :119-126