A novel feature selection algorithm for text categorization

被引:245
作者
Shang, Wenqian [1 ]
Huang, Houkuan
Zhu, Haibin
Lin, Yongmin
Qu, Youli
Wang, Zhihai
机构
[1] Beijing Jiaotong Univ, Sch Comp & Informat Technol, Beijing 100044, Peoples R China
[2] Nipissing Univ, Dept Comp Sci, N Bay, ON P1B 8L7, Canada
关键词
text feature selection; text categorization; Gini index; kNN classifier; text preprocessing;
D O I
10.1016/j.eswa.2006.04.001
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the development of the web, large numbers of documents are available on the Internet. Digital libraries, news sources and inner data of companies surge more and more. Automatic text categorization becomes more and more important for dealing with massive data. However the major problem of text categorization is the high dimensionality of the feature space. At present there are many methods to deal with text feature selection. To improve the performance of text categorization, we present another method of dealing with text feature selection. Our study is based on Gini index theory and we design a novel Gini index algorithm to reduce the high dimensionality of the feature space. A new measure function of Gini index is constructed and made to fit text categorization. The results of experiments show that our improvements of Gini index behave better than other methods of feature selection. (c) 2006 Elsevier Ltd. All rights reserved.
引用
收藏
页码:1 / 5
页数:5
相关论文
共 16 条
  • [1] [Anonymous], 1979, INFORM RETRIEVAL
  • [2] NEAREST NEIGHBOR PATTERN CLASSIFICATION
    COVER, TM
    HART, PE
    [J]. IEEE TRANSACTIONS ON INFORMATION THEORY, 1967, 13 (01) : 21 - +
  • [3] Scalable classifiers with dynamic pruning
    Gupta, SK
    Somayajulu, DVLN
    Arora, JK
    Vasudha, B
    [J]. NINTH INTERNATIONAL WORKSHOP ON DATABASE AND EXPERT SYSTEMS APPLICATIONS, PROCEEDINGS, 1998, : 246 - 251
  • [4] Joachims T., 1998, Lecture Notes in Computer Science, P137, DOI DOI 10.1007/BFB0026683
  • [5] Lewis D.D., 1994, 3 ANN S DOC AN INF R, V33, P81
  • [6] Lewis DD., 1998, P 10 EUR C MACH LEAR, V98, P4
  • [7] Li B., 1984, BIOMETRICS, V40, P358, DOI DOI 10.2307/2530946
  • [8] Feature selection on hierarchy of web documents
    Mladenic, D
    Grobelnik, M
    [J]. DECISION SUPPORT SYSTEMS, 2003, 35 (01) : 45 - 87
  • [9] Mladenic D, 1999, MACHINE LEARNING, PROCEEDINGS, P258
  • [10] SHANG W, 2005, P INT C COMP INT SEC, P741