An Improved KNN Text Classification Algorithm Based on Clustering

被引:16
作者
Zhou Yong [1 ]
Li Youwen [1 ]
Xia Shixiong [1 ]
机构
[1] China Univ Min & Technol, Sch Comp Sci & Technol, Xuzhou 221116, Jiangsu, Peoples R China
基金
新加坡国家研究基金会; 中国国家自然科学基金; 美国国家科学基金会;
关键词
text classification; KNN algorithm; sample austerity; cluster;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
The traditional KNN text classification algorithm used all training samples for classification, so it had a huge number of training samples and a high degree of calculation complexity, and it also didn't reflect the different importance of different samples. In allusion to the problems mentioned above, an improved KNN text classification algorithm based on clustering center is proposed in this paper. Firstly, the given training sets are compressed and the samples near by the border are deleted, so the multi-peak effect of the training sample sets is eliminated. Secondly, the training sample sets of each category are clustered by k-means clustering algorithm, and all cluster centers are taken as the new training samples. Thirdly, a weight value is introduced, which indicates the importance of each training sample according to the number of samples in the cluster that contains this cluster center. Finally, the modified samples are used to accomplish KNN text classification. The simulation results show that the algorithm proposed in this paper can not only effectively reduce the actual number of training samples and lower the calculation complexity, but also improve the accuracy of KNN text classification algorithm.
引用
收藏
页码:230 / 237
页数:8
相关论文
共 13 条
  • [1] Dasarathy B. V, 1991, MC GRAW HILL COMPUTE, P217
  • [2] Jin Yang, 2005, CHINESE J COMPUTERS, V30, P759
  • [3] Jinna Ma, 2006, THESIS
  • [4] Li Ying, 2004, Mini-Micro Systems, V25, P993
  • [5] Lihua Y, 2006, MICROCOMPUTER INFORM, V21, P269
  • [6] [陆玉昌 Lu Yuchang], 2002, [计算机研究与发展, Computer Research and Development], V39, P1205
  • [7] Machine learning in automated text categorization
    Sebastiani, F
    [J]. ACM COMPUTING SURVEYS, 2002, 34 (01) : 1 - 47
  • [8] [苏金树 SU JinShu], 2006, [软件学报, Journal of Software], V17, P1848, DOI 10.1360/jos171848
  • [9] [王建会 Wang Jianhui], 2005, [计算机研究与发展, Journal of Computer Research and Development], V42, P85, DOI 10.1360/crad20050112
  • [10] Wang Xinhao, PROCEEDING NLP KE 05, P602