A taxonomy of dirty data

被引:165
作者
Kim, W [1 ]
Choi, BJ
Hong, EK
Kim, SK
Lee, D
机构
[1] Cyber Database Solut Inc, Austin, TX USA
[2] Ewha Inst Sci & Technol, Dept Comp Sci, Seoul, South Korea
[3] Seoul Natl Univ, AITrc, Seoul, South Korea
[4] Lucent Technol, Seoul, South Korea
[5] Korea Adv Inst Sci & Technol, Dept Biosyst, Taejon, South Korea
关键词
dirty data; data quality; data mining; data cleansing; data warehousing;
D O I
10.1023/A:1021564703268
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Today large corporations are constructing enterprise data warehouses from disparate data sources in order to run enterprise-wide data analysis applications, including decision support systems, multidimensional online analytical applications, data mining, and customer relationship management systems. A major problem that is only beginning to be recognized is that the data in data sources are often "dirty". Broadly, dirty data include missing data, wrong data, and non-standard representations of the same data. The results of analyzing a database/data warehouse of dirty data can be damaging and at best be unreliable. In this paper, a comprehensive classification of dirty data is developed for use as a framework for understanding how dirty data arise, manifest themselves, and may be cleansed to ensure proper construction of data warehouses and accurate data analysis. The impact of dirty data on data mining is also explored.
引用
收藏
页码:81 / 99
页数:19
相关论文
共 44 条
  • [1] *1 LOG INC, CUST DAT QUAL BUILD
  • [2] [Anonymous], DATA MINING SOLUTION
  • [3] [Anonymous], 1998, DATA WAREHOUSE LIFEC
  • [4] [Anonymous], MODERN DATABASE SYST
  • [5] *APPL TECHN GROUP, 1998, BUILD SUCC CRM ENV
  • [6] Enhancing data quality in data warehouse environments
    Ballou, DP
    Tayi, GK
    [J]. COMMUNICATIONS OF THE ACM, 1999, 42 (01) : 73 - 78
  • [7] Berry MichaelJ., 1997, DATA MINING TECHNIQU
  • [8] Berson A., 1997, DATA WAREHOUSING DAT
  • [9] A FUZZY REPRESENTATION OF DATA FOR RELATIONAL DATABASES
    BUCKLES, BP
    PETRY, FE
    [J]. FUZZY SETS AND SYSTEMS, 1982, 7 (03) : 213 - 226
  • [10] CODD EF, 1979, ACM T DATABASE SYST, V4, P4