基于聚类模式的数据清洗技术

被引：13

作者：

唐懿芳

钟达夫

严小卫

机构：

[1] 广西师范大学计算机科学系

来源：

关键词：

数据清洗; Canopy聚类技术; 复制记录;

D O I：

暂无

中图分类号：

TP311.131 [];

学科分类号：

摘要：

在挖掘前都必须对所要挖掘的数据源进行清洗,以去掉不正确的数据。本文对数据清洗中整合多个数据源的问题做了相关的研究。针对现有检测复制记录技术存在的不足,提出了采用Canopy聚类技术进行聚类复制记录的数据清洗方法,并通过实验结果验证了所提算法的有效性和准确性。

引用

页码：116 / 119

页数：4

共 8 条

[1] An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records. Monge A,Elkan C. Proceedings of SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery . 1997
[2] The Merge/Purge Problem for Large Databases. Hernandez M,Stolfo S. Proceedings of the ACM SIGMOD International Conference on Management of Data . 1995
[3] IntelliClean: A Knowledge-based Intelligent Data Cleaner. Lee ML,Ling TW,Low WL. Proceedings of SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery . 2000
[4] Term-Weighting Approaches in Automatic Text Retrieval. Salton G,Buckley C. Information Processing Letters . 1988
[5] AlphaSort: A RISC Machine Sort. Nyberg C,Barclay T,Cvetanovic Z,et al. Proceedings of the 1994 ACM- SIGMOD Conference . 1994
[6] Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Levenshtein V. Soviet Physics-Doklady10 . 1966
[7] DynamicInvertedIndexesforaDistributedFull TextRetrievalSystem. ClarkeCLA,CormackGV. TechnicalReportMT- 95-01 .
[8] Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching. McCallum A,Nigam K,Ungar L. Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining . 2000