An improved K-nearest-neighbor algorithm for text categorization

被引:160
作者
Jiang, Shengyi [1 ]
Pang, Guansong [1 ]
Wu, Meiling [1 ]
Kuang, Limin [1 ]
机构
[1] Guangdong Univ Foreign Studies, Sch Informat, Guangzhou 510420, Guangdong, Peoples R China
基金
中国国家自然科学基金;
关键词
Text categorization; KNN text categorization; One-pass clustering; Spam filtering; CLASSIFICATION; MODEL;
D O I
10.1016/j.eswa.2011.08.040
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text categorization is a significant tool to manage and organize the surging text data. Many text categorization algorithms have been explored in previous literatures, such as KNN, Naive Bayes and Support Vector Machine. KNN text categorization is an effective but less efficient classification method. In this paper, we propose an improved KNN algorithm for text categorization, which builds the classification model by combining constrained one pass clustering algorithm and KNN text categorization. Empirical results on three benchmark corpora show that our algorithm can reduce the text similarity computation substantially and outperform the-state-of-the-art KNN, Naive Bayes and Support Vector Machine classifiers. In addition, the classification model constructed by the proposed algorithm can be updated incrementally, and it has great scalability in many real-word applications. (C) 2011 Elsevier Ltd. All rights reserved.
引用
收藏
页码:1503 / 1509
页数:7
相关论文
共 31 条
[1]  
Aghbari Z. A., 2005, DATA KHOWL ENG, V52, P33
[2]  
Androutsopoulos I., 2000, P EUR C MACH LEARN, P9
[3]  
Androutsopoulos Ion., 2000, P WORKSHOP MACHINE L, P1
[4]   AUTOMATED LEARNING OF DECISION RULES FOR TEXT CATEGORIZATION [J].
APTE, C ;
DAMERAU, F ;
WEISS, SM .
ACM TRANSACTIONS ON INFORMATION SYSTEMS, 1994, 12 (03) :233-251
[5]  
Arjen P.de Vries., 2002, SIGMOD '02: Proceedings of the 2002 ACM SIGMOD international conference on Management of data, P322
[6]   A survey of learning-based techniques of email spam filtering [J].
Blanzieri, Enrico ;
Bryl, Anton .
ARTIFICIAL INTELLIGENCE REVIEW, 2008, 29 (01) :63-92
[7]   Web page classification based on a support vector machine using a weighted vote schema [J].
Chen, Rung-Ching ;
Hsieh, Chung-Hsun .
EXPERT SYSTEMS WITH APPLICATIONS, 2006, 31 (02) :427-435
[8]   Boosting multi-label hierarchical text categorization [J].
Esuli, Andrea ;
Fagni, Tiziano ;
Sebastiani, Fabrizio .
INFORMATION RETRIEVAL, 2008, 11 (04) :287-313
[9]  
Frank E, 2006, LECT NOTES ARTIF INT, V4213, P503
[10]  
Guo GD, 2004, LECT NOTES COMPUT SC, V2945, P559