An enhanced Support Vector Machine classification framework by using Euclidean distance function for text document categorization

被引:115
作者
Lee, Lam Hong [1 ]
Wan, Chin Heng [1 ]
Rajkumar, Rajprasad [2 ]
Isa, Dino [2 ]
机构
[1] Univ Tunku Abdul Rahman, Fac Informat & Commun Technol, Kampar 31900, Perak, Malaysia
[2] Univ Nottingham, Fac Engn, Intelligent Syst Res Grp, Semenyih 43500, Selangor, Malaysia
关键词
Text document classification; Support Vector Machine; Euclidean distance function; Kernel function; Soft margin parameter; KERNEL PARAMETERS; LEARNING-METHODS;
D O I
10.1007/s10489-011-0314-z
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper presents the implementation of a new text document classification framework that uses the Support Vector Machine (SVM) approach in the training phase and the Euclidean distance function in the classification phase, coined as Euclidean-SVM. The SVM constructs a classifier by generating a decision surface, namely the optimal separating hyper-plane, to partition different categories of data points in the vector space. The concept of the optimal separating hyper-plane can be generalized for the non-linearly separable cases by introducing kernel functions to map the data points from the input space into a high dimensional feature space so that they could be separated by a linear hyper-plane. This characteristic causes the implementation of different kernel functions to have a high impact on the classification accuracy of the SVM. Other than the kernel functions, the value of soft margin parameter, C is another critical component in determining the performance of the SVM classifier. Hence, one of the critical problems of the conventional SVM classification framework is the necessity of determining the appropriate kernel function and the appropriate value of parameter C for different datasets of varying characteristics, in order to guarantee high accuracy of the classifier. In this paper, we introduce a distance measurement technique, using the Euclidean distance function to replace the optimal separating hyper-plane as the classification decision making function in the SVM. In our approach, the support vectors for each category are identified from the training data points during training phase using the SVM. In the classification phase, when a new data point is mapped into the original vector space, the average distances between the new data point and the support vectors from different categories are measured using the Euclidean distance function. The classification decision is made based on the category of support vectors which has the lowest average distance with the new data point, and this makes the classification decision irrespective of the efficacy of hyper-plane formed by applying the particular kernel function and soft margin parameter. We tested our proposed framework using several text datasets. The experimental results show that this approach makes the accuracy of the Euclidean-SVM text classifier to have a low impact on the implementation of kernel functions and soft margin parameter C.
引用
收藏
页码:80 / 99
页数:20
相关论文
共 68 条
[1]  
Androutsopoulos I., 2000, SIGIR Forum, V34, P160
[2]  
[Anonymous], 2004, Introduction to Machine Learning
[3]  
[Anonymous], 2002, Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
[4]   AUTOMATED LEARNING OF DECISION RULES FOR TEXT CATEGORIZATION [J].
APTE, C ;
DAMERAU, F ;
WEISS, SM .
ACM TRANSACTIONS ON INFORMATION SYSTEMS, 1994, 12 (03) :233-251
[5]  
Apte C., 1994, SIGIR '94. Proceedings of the Seventeenth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, P23
[6]   Selecting of the optimal feature subset and kernel parameters in digital modulation classification by using hybrid genetic algorithm-support vector machines: HGASVM [J].
Avci, Engin .
EXPERT SYSTEMS WITH APPLICATIONS, 2009, 36 (02) :1391-1402
[7]   Estimation of individual prediction reliability using the local sensitivity analysis [J].
Bosnic, Zoran ;
Kononenko, Igor .
APPLIED INTELLIGENCE, 2008, 29 (03) :187-203
[8]  
Briggs T., 2005, Proceedings of the 20th national conference on Artificial intelligence, V2, P732
[9]  
Buck TAE, 2006, SVM KERNEL OPTIMIZAT
[10]  
Burges C.J.C., 1998, TUTORIAL SUPPORT VEC