A framework for measuring differences in data characteristics

被引:20
作者
Ganti, V [1 ]
Gehrke, J
Ramakrishnan, R
Loh, WY
机构
[1] Microsoft Res, Redmond, WA 98052 USA
[2] Cornell Univ, Dept Comp Sci, Ithaca, NY 14853 USA
[3] Univ Wisconsin, Dept Comp Sci, Madison, WI 53706 USA
[4] Univ Wisconsin, Dept Stat, Madison, WI 53706 USA
关键词
D O I
10.1006/jcss.2001.1808
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
A data mining algorithm builds a model that captures interesting aspects of the underlying data. We develop a framework for quantifying the difference, called the deviation, between two datasets in terms of the models they induce, In addition to being a quantitative, intuitively interpretable measure of difference, the deviation between two datasets can also be computed very fast, Our framework covers a wide variety of models including frequent itemsets, decision tree classifiers. and clusters, and captures standard measures of deviation such as the misclassification rate and the chi-squared metric as special cases. We also show how statistical techniques can be applied to the deviation measure to assess whether the difference between two models is significant (i.e., whether the underlying datasets have statistically significant differences in their characteristics), and discuss several practical applications. (C) 2002 Elsevier Science (USA).
引用
收藏
页码:542 / 578
页数:37
相关论文
共 49 条
  • [1] Agarwal R., 1994, P 20 INT C VER LARG, V487, P499
  • [2] Agarwal S, 1996, PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON VERY LARGE DATA BASES, P506
  • [3] DATABASE MINING - A PERFORMANCE PERSPECTIVE
    AGRAWAL, R
    IMIELINSKI, T
    SWAMI, A
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 1993, 5 (06) : 914 - 925
  • [4] AGRAWAL R, 1998, P ACM SIGMOD C MAN D
  • [5] Agrawal R., 1996, Advances in Knowledge Discovery and Data Mining, P307
  • [6] AGRAWAL R, 1995, P 1 INT C KNOWL DISC
  • [7] Anderson T. W., 1971, STAT ANAL TIME SERIE
  • [8] [Anonymous], P ACM SIGMOD 98
  • [9] [Anonymous], 1996, P ACM SIGMOD C MAN D
  • [10] Arning A., 1996, P 2 INT C KNOWL DISC