Summarization - compressing data into an informative representation

被引:45
作者
Chandola, Varun [1 ]
Kumar, Vipin [1 ]
机构
[1] Univ Minnesota, Dept Comp Sci, Minneapolis, MN 55414 USA
关键词
summarization; frequent itemsets; categorical attributes;
D O I
10.1007/s10115-006-0039-1
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we formulate the problem of summarization of a data set of transactions with categorical attributes as an optimization problem involving two objective functions - compaction gain and information loss. We propose metrics to characterize the output of any summarization algorithm. We investigate two approaches to address this problem. The first approach is an adaptation of clustering and the second approach makes use of frequent itemsets from the association analysis domain. We illustrate one application of summarization in the field of network data where we show how our technique can be effectively used to summarize network traffic into a compact but meaningful representation. Specifically, we evaluate our proposed algorithms on the 1998 DARPA Off-Line Intrusion Detection Evaluation data and network data generated by SKAION Corp for the ARDA information assurance program.
引用
收藏
页码:355 / 378
页数:24
相关论文
共 24 条
[1]  
Afrati F., 2004, P KDD, P12, DOI DOI 10.1145/1014052.1014057
[2]  
Agrawal R., 1993, SIGMOD Record, V22, P207, DOI 10.1145/170036.170072
[3]  
[Anonymous], 2004, DATA MINING NEXT GEN
[4]  
Barbará D, 2001, SIGMOD RECORD, V30, P15, DOI 10.1145/604264.604268
[5]   Free-sets: A condensed representation of Boolean data for the approximation of frequency queries [J].
Boulicaut, JF ;
Bykowski, A ;
Rigotti, C .
DATA MINING AND KNOWLEDGE DISCOVERY, 2003, 7 (01) :5-22
[6]   LOF: Identifying density-based local outliers [J].
Breunig, MM ;
Kriegel, HP ;
Ng, RT ;
Sander, J .
SIGMOD RECORD, 2000, 29 (02) :93-104
[7]  
Calders T., 2002, Principles of Data Mining and Knowledge Discovery. 6th European Conference, PKDD 2002. Proceedings (Lecture Notes in Artificial Intelligence Vol.2431), P74
[8]  
CHANDOLA V, 2005, 05024 TR
[9]  
Fried D., 2000, P DARPA INFORM SURVI, V2, DOI DOI 10.1109/DISCEX.2000.821506
[10]  
Han J., 2000, Data Mining: Concepts and Techniques. The Morgan Kaufmann series in data management systems