一种增量式文本软聚类算法

被引：3

作者：

冯中慧

鲍军鹏

沈钧毅

机构：

[1] 西安交通大学电子与信息工程学院

来源：

西安交通大学学报 | 2007年 / 04期

关键词：

语义序列; 增量式聚类; 软聚类; 文本聚类;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

针对传统文本聚类算法时间复杂度较高,而与距离无关的算法又不适用于动态、变化的文本集等问题,提出了一种基于语义序列的增量式文本软聚类算法.该算法考虑了长文本的多主题特性,并利用语义序列相似关系计算相似语义序列集合的覆盖度,同时将每次选择的具有最小熵重叠值的候选类作为一个结果聚类,这样在整个聚类的过程中大大减小了文本向量空间的维数,缩短了计算时间.由于所提算法的语义序列只与文本自身相关,所以它适用于增量式聚类.实验结果表明,算法的聚类精度高于同条件下的其他聚类算法,尤其适合于长文本集的软聚类.

引用

页码：398 / 401+411 +411

页数：5

共 7 条

[1]

Document clustering using wordclusters via the information bottleneck method. Slonim N,,Tishby N. Proceedings of the 21st ACM SIGIR Conference onResearch and Development in Information Retrieval . 2000

[2]

Co-clustering docu-ments and words using bipartite spectral graph parti-tioning. Dhillon I S,,Guan Y,Kogan J. Proceedings of the 7th ACM SIGKDDConference on Knowledge Discovery and Data Mining . 2001

[3]

Web document clustering:a fea-sibility demonstration. Zamir O,Etzioni O. Proceedings of the 19thACM SIGIR Conference on Research and Developmentin Information Retrieval . 1998

[4]

Frequent term-based textclustering. Beil F,,Ester M,Xu X W. Proceedings of the 8th ACM SIGKDDConference on Knowledge Discovery and Data Mining . 2002

[5]

Elements of informationtheory. Cover T M,,Thomas J A. . 1991

[6]

Semantic sequencekin:a method of document copy detection. Bao J P,Shen J Y,Liu X D,et al. Pro-ceedings of the 8th Pacific-Asia Conference on Knowl-edge Discovery and Data Mining . 2004

[7]

Partitioning-basedclustering for web document categorization. Boley D,,Gini M,Gross R,et al. Deci-sion Support Systems . 1999

← 1 →