基于Tri-Training和数据剪辑的半监督聚类算法

被引：28

作者：

邓超

郭茂祖

机构：

[1] 哈尔滨工业大学计算机科学与技术学院

来源：

软件学报 | 2008年 / 03期

关键词：

半监督聚类; 半监督分类; K-均值; seeds集; Tri-Training; Depuration数据剪辑;

D O I：

暂无

中图分类号：

TP301.6 [算法理论];

学科分类号：

081202 ;

摘要：

提出一种半监督聚类算法,该算法在用seeds集初始化聚类中心前,利用半监督分类方法Tri-training的迭代训练过程对无标记数据进行标记,并加入seeds集以扩大规模;同时,在Tri-training训练过程中结合基于最近邻规则的Depuration数据剪辑技术对seeds集扩大过程中产生的误标记噪声数据进行修正、净化,以提高seeds集质量.实验结果表明,所提出的基于Tri-training和数据剪辑的DE-Tri-training半监督聚类新算法能够有效改善seeds集对聚类中心的初始化效果,提高聚类性能.

引用

页码：663 / 673

页数：11

共 5 条

[1]

Semi-supervised model-based document clustering: A comparative study[J] . Shi Zhong.Machine Learning . 2006 (1)

[2] Data clustering with partial supervision [J].

Bouchachia, A ;

Pedrycz, W .

DATA MINING AND KNOWLEDGE DISCOVERY, 2006, 12 (01) :47-78

[3] Text Classification from Labeled and Unlabeled Documents using EM [J].

Kamal Nigam ;

Andrew Kachites Mccallum ;

Sebastian Thrun ;

Tom Mitchell .

Machine Learning, 2000, 39 :103-134

[4]

Learning with labeled and unlabeled data .2 Seeger M. . 2002

[5]

Integrating constraints and metric learning in semi-supervised clustering .2 Bilenko M,Basu S,Mooney RJ. Proc.of the21st Int’l Conf.on Machine Learning(ICML2004) . 2004

← 1 →