基于聚类模式的数据清洗技术

被引:13
作者
唐懿芳
钟达夫
严小卫
机构
[1] 广西师范大学计算机科学系
关键词
数据清洗; Canopy聚类技术; 复制记录;
D O I
暂无
中图分类号
TP311.131 [];
学科分类号
摘要
在挖掘前都必须对所要挖掘的数据源进行清洗,以去掉不正确的数据。本文对数据清洗中整合多个数据源的问题做了相关的研究。针对现有检测复制记录技术存在的不足,提出了采用Canopy聚类技术进行聚类复制记录的数据清洗方法,并通过实验结果验证了所提算法的有效性和准确性。
引用
收藏
页码:116 / 119
页数:4
相关论文
共 8 条
  • [1] An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records. Monge A,Elkan C. Proceedings of SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery . 1997
  • [2] The Merge/Purge Problem for Large Databases. Hernandez M,Stolfo S. Proceedings of the ACM SIGMOD International Conference on Management of Data . 1995
  • [3] IntelliClean: A Knowledge-based Intelligent Data Cleaner. Lee ML,Ling TW,Low WL. Proceedings of SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery . 2000
  • [4] Term-Weighting Approaches in Automatic Text Retrieval. Salton G,Buckley C. Information Processing Letters . 1988
  • [5] AlphaSort: A RISC Machine Sort. Nyberg C,Barclay T,Cvetanovic Z,et al. Proceedings of the 1994 ACM- SIGMOD Conference . 1994
  • [6] Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Levenshtein V. Soviet Physics-Doklady10 . 1966
  • [7] DynamicInvertedIndexesforaDistributedFull TextRetrievalSystem. ClarkeCLA,CormackGV. TechnicalReportMT- 95-01 .
  • [8] Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching. McCallum A,Nigam K,Ungar L. Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining . 2000