A preprocess algorithm of filtering irrelevant information based on the minimum class difference

被引:8
作者
Chen, Zhiping
Lu, Kevin [1 ]
机构
[1] Brunel Univ, Uxbridge UB8 3PH, Middx, England
[2] Fujian Univ Technol, Dept Comp Sci, Fuzhou 350014, Peoples R China
关键词
classification; text categorization; feature selection; preprocess;
D O I
10.1016/j.knosys.2006.03.005
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Whether a word (or a feature) should be included or excluded during the process of text classification could depend on a number of factors, such as the amount of information it represents, its appearance frequency and its meaning. The application context is another important factor that needs to be considered. A word may be able to represent the characteristic of a document in one application context but may not reflect its nature in another. This paper reports on an investigation into the selection of features for classification with the consideration of the application context of the documents to be processed. A new feature selection algorithm for text classification to be known as the PBMCD algorithm is proposed. This algorithm has been implemented and tested using three different data sets. The experiment results have shown that this algorithm cannot only filter out irrelevant features before the classification process but also can increase the classification accuracy. As a comparison, experiment results with other methods have also been presented. (c) 2006 Elsevier B.V. All rights reserved.
引用
收藏
页码:422 / 429
页数:8
相关论文
共 8 条
[1]   Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies [J].
Chakrabarti, S ;
Dom, B ;
Agrawal, R ;
Raghavan, P .
VLDB JOURNAL, 1998, 7 (03) :163-178
[2]  
COHEN W, 2000, IEEE DATA ENG B, V23, P17
[3]  
DUNJA M, 2003, DECIS SUPPORT SYST, V35, P45
[4]  
HYVARINEN A, 2001, INDEPDENT COMPONENT
[5]   Quick estimation of rare events in stochastic networks [J].
Lieber, D ;
Rubinstein, RY ;
Elmakis, D .
IEEE TRANSACTIONS ON RELIABILITY, 1997, 46 (02) :254-265
[6]  
MLADENIC D, 1998, P ECML 98
[7]  
SEBASTIANI F, 2002, ACM SURVEY, V34
[8]  
Zaffalon M., 2002, P 18 INT C UNC ART I