一种基于聚类的微博关键词提取方法的研究与实现

被引：9

作者：

孙兴东

李爱平

李树栋

机构：

[1] 国防科学技术大学计算机学院

来源：

信息网络安全 | 2014年 / 12期

基金：

中国博士后科学基金;

关键词：

聚类算法; TF-IDF; Text; Rank; n-gram; 语言模型;

D O I：

暂无

中图分类号：

TP393.092 [];

学科分类号：

摘要：

文章提出了一种基于聚类的微博关键词提取方法。实验过程分三个步骤进行。第一步,对微博文本进行预处理和分词处理,再运用TF-IDF算法与Text Rank算法计算词语权重,针对微博短文本的特性在计算词语权重时运用加权计算的方法,在得到词语权重后使用聚类算法提取候选关键词;第二步,根据n-gram语言模型的理论,取n的值为2定义最大左邻概率和最大右邻概率,据此对候选关键词进行扩展;第三步,根据语义扩展模型中邻接变化数和语义单元数的概念,对扩展后的关键词进行筛选,得到最终的提取结果。实验结果表明在处理短文本时Text Ramk算法比TF-IDF算法表现更佳,同时该方法能够有效地提取出微博中的关键词。

引用

页码：27 / 31

页数：5

共 7 条

[1] 基于联合权重的多文档关键词抽取技术
杨洁
季铎
蔡东风
林晓庆
白宇
[J]. 中文信息学报, 2008, 22 (06) : 75 - 79
[2] Automatic recognition of multi-word terms:. the C-value/NC-value method [J] . Katerina Frantzi,Sophia Ananiadou,Hideki Mima.&nbsp&nbspInternational Journal on Digital Libraries . 2000 (2)
[3] Learning Algorithms for Keyphrase Extraction. [J] . Peter D. Turney.&nbsp&nbspInf. Retr. . 2000 (4)
[4] The anatomy of a large-scale hypertextual Web search engine [J] . Sergey Brin,Lawrence Page.&nbsp&nbspComputer Networks and ISDN Systems . 1998 (1)
[5] Domain-Specific Key phrase Extraction .2 Frank E,Paynter G W,Witten I H,et al. IJCAI’’99:Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence . 1999
[6] Enhancing Linguistically Oriented Automatic Keyword Extraction .2 HULTH A,DUMAIS S,MARCU D,et al. HLT-NAACL 2004 S-hort . 2004
[7] Lingo:SearchResults Clustering Algorithm BasedonSingular Value Decomposition .2 sinski,S,Stefanowski,J,Weiss,D. Proceedings of the International ConferenceonIntelligentInformation Systems (IIPWM) . 2004

← 1 →