Discovering all most specific sentences

被引:103
作者
Gunopulos, D [1 ]
Khardon, R
Mannila, H
Saluja, S
Toivonen, H
Sharma, RS
机构
[1] Univ Calif Riverside, Dept Comp Sci & Engn, Riverside, CA 92507 USA
[2] Tufts Univ, Dept EECS, Medford, MA 02155 USA
[3] Univ Helsinki, Dept Comp Sci, HIIT Basic Res Unit, SF-00510 Helsinki, Finland
[4] LSI Log, Milpitas, CA 95035 USA
来源
ACM TRANSACTIONS ON DATABASE SYSTEMS | 2003年 / 28卷 / 02期
关键词
algorithms; theory; data mining; association rules; maximal frequent sets; learning with membership queries; minimal keys;
D O I
10.1145/777943.777945
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Data mining can be viewed, in many instances, as the task of computing a representation of a theory of a model or a database, in particular by finding a set of maximally specific sentences satisfying some property. We prove some hardness results that rule out simple approaches to solving the problem. The a priori algorithm is an algorithm that has been successfully applied to many instances of the problem. We analyze this algorithm, and prove that is optimal when the maximally specific sentences are "small". We also point out its limitations. We then present a new algorithm, the Dualize and Advance algorithm, and prove worst-case complexity bounds that are favorable in the general case. Our results use the concept of hypergraph transversals. Our analysis shows that the a priori algorithm can solve the problem of enumerating the transversals of a hypergraph, improving on previously known results in a special case. On the other hand, using results for the general case of the hypergraph transversal enumeration problem, we can show that the Dualize and Advance algorithm has worst-case running time that is subexponential to the output size (i.e., the number of maximally specific sentences). We further show that the problem of finding maximally specific sentences is closely related to the problem of exact learning with membership queries studied in computational learning theory.
引用
收藏
页码:140 / 174
页数:35
相关论文
共 41 条
[1]  
Agrawal R., 1993, SIGMOD Record, V22, P207, DOI 10.1145/170036.170072
[2]  
AGRAWAL R, 1995, PROC INT CONF DATA, P3, DOI 10.1109/ICDE.1995.380415
[3]  
Agrawal R., 1996, Advances in Knowledge Discovery and Data Mining, P307
[4]  
Agrawal R., 1994, P 20 INT C VER LARG, V1215, P487
[5]  
AGRAWAL RC, 2000, KNOWLEDGE DISCOVERY, P108
[6]  
Angluin D., 1988, Machine Learning, V2, P319, DOI 10.1023/A:1022821128753
[7]  
[Anonymous], P INT C PRINC KNOWL
[8]  
[Anonymous], 1979, Computers and Intractablity: A Guide to the Theoryof NP-Completeness
[9]  
[Anonymous], 1995, P 1 SIGKDD INT C KNO
[10]  
Bayardo R. J., 1998, P ACM SIGMOD INT C M