A hybrid text classification approach with low dependency on parameter by integrating K-nearest neighbor and support vector machine

被引：84

作者：

Wan, Chin Heng ^{[2
]}

Lee, Lam Hong ^{[1
]}

Rajkumar, Rajprasad ^{[1
]}

Isa, Dino ^{[1
]}

机构：

[1] Univ Nottingham, Intelligent Syst Res Grp, Fac Engn, Semenyih 43500, Selangor, Malaysia

[2] Univ Tunku Abdul Rahman, Fac Informat & Commun Technol, Kampar 31900, Perak, Malaysia

来源：

EXPERT SYSTEMS WITH APPLICATIONS | 2012年 / 39卷 / 15期

关键词：

Text document classification; K-nearest neighbor; Support vector machine; Euclidean distance function;

D O I：

10.1016/j.eswa.2012.02.068

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This work implements a new text document classifier by integrating the K-nearest neighbor (KNN) classification approach with the support vector machine (SVM) training algorithm. The proposed Nearest Neighbor-Support Vector Machine hybrid classification approach is coined as SVM-NN. The KNN has been reported as one of the widely used text classification approaches due to its simplicity and efficiency in handling various types of text classification tasks. However, there exists a major problem of the KNN in determining the appropriate value for parameter K in order to guarantee high classification effectiveness. This is due to the fact that the selection of the value of parameter K has high impact on the accuracy of the KNN classifier. Other than determining the optimal value of parameter K, the KNN is also a lazy learning method which keeps the entire training samples until classification time. Hence, the computational process of the KNN has become intensive when the value of parameter K increases. In this paper, we propose the SVM-NN hybrid classification approach with the objective that to minimize the impact of parameter on classification accuracy. In the training stage, the SVM is utilized to reduce the training samples for each of the available categories to their support vectors (SVs). The SVs from different categories are used as the training data of nearest neighbor classification algorithm in which the Euclidean distance function is used to calculate the average distance between the testing data point to each set of SVs of different categories. The classification decision is made based on the category which has the shortest average distance between its SVs and the testing data point. The experiments on several benchmark text datasets show that the classification accuracy of the SVM-NN approach has low impact on the value of parameter, as compared to the conventional KNN classification model. (C) 2012 Elsevier Ltd. All rights reserved.

引用

页码：11880 / 11888

页数：9

共 51 条

[11] LIBSVM: A Library for Support Vector Machines
Chang, Chih-Chung
Lin, Chih-Jen
[J]. ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)
[12] A hierarchical neural network document classifier with linguistic feature selection
Chen, CM
Lee, HM
Hwang, CW
[J]. APPLIED INTELLIGENCE, 2005, 23 (03) : 277 - 294
[13] Feature selection for text classification with Naive Bayes
Chen, Jingnian
Huang, Houkuan
Tian, Shengfeng
Qu, Youli
[J]. EXPERT SYSTEMS WITH APPLICATIONS, 2009, 36 (03) : 5432 - 5435
[14] Authorship attribution with support vector machines
Diederich, J
Kindermann, O
Leopold, E
Paass, G
[J]. APPLIED INTELLIGENCE, 2003, 19 (1-2) : 109 - 123
[15] On the optimality of the simple Bayesian classifier under zero-one loss
Domingos, P
Pazzani, M
[J]. MACHINE LEARNING, 1997, 29 (2-3) : 103 - 130
[16] Eyheramendy S., 2003, AB BAF MU MAG KASH B
[17] GENG X, 2008, P 31 ANN INT ACM SIG, P115, DOI DOI 10.1145/1390334.1390356
[18] Greiner R., 2001, ALXPLORATORIUM DECIS
[19] Han E. H., 1999, AB BAF MU MAG KASH B
[20] A new maximal-margin spherical-structured multi-class support vector machine
Hao, Pei-Yi
Chiang, Jung-Hsien
Lin, Yen-Hsiu
[J]. APPLIED INTELLIGENCE, 2009, 30 (02) : 98 - 111

← 1 2 3 4 5 6 →