Feature selection on hierarchy of web documents

被引:77
作者
Mladenic, D
Grobelnik, M
机构
[1] Jozef Stefan Inst, Dept Intelligent Syst, Ljubljana 1000, Slovenia
[2] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
关键词
text mining; feature selection; document categorization; maintaining document ontology; machine learning; data mining;
D O I
10.1016/S0167-9236(02)00097-0
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The paper describes feature subset selection used in learning on text. data (text learning) and gives a brief overview of feature subset selection commonly used in machine learning. Several known and some new feature scoring measures appropriate for feature subset selection on large text data are described and related to each other. Experimental comparison of the described measures is given on real-world data collected from the Web. Machine learning techniques are used on data collected from Yahoo, a large text hierarchy of Web documents. Our approach includes some original ideas for handling large number of features, categories and documents. The high number of features is reduced by feature subset selection and additionally by using 'stop-list', pruning low-frequency features and using a short description of each document given in the hierarchy instead of using the document itself. Documents are represented as feature-vectors that include word sequences instead of including only single words as commonly used when learning on text data. An efficient approach to generating word sequences is proposed. Based on the hierarchical structure, we propose a way of dividing the problem into subproblems, each representing one of the categories included in the Yahoo hierarchy. In our learning experiments, for each of the subproblems, naive Bayesian classifier was used on text data. The result of learning is a set of independent classifiers, each used to predict probability that a new example is a member of the corresponding category. Experimental evaluation on real-world data shows that the proposed approach gives good results. The best performance was achieved by the feature selection based on a feature scoring measure known from information retrieval called Odds ratio and using relatively small number of features. (C) 2002 Elsevier Science B.V. All rights reserved.
引用
收藏
页码:45 / 87
页数:43
相关论文
共 45 条
  • [1] Agrawal R., 1996, Advances in Knowledge Discovery and Data Mining, P307
  • [2] AH DW, 1994, P AAAI 94 WORKSH CAS, P106
  • [3] [Anonymous], [No title captured]
  • [4] [Anonymous], P ICML 97
  • [5] [Anonymous], 1991, Proceedings of the Ninth Canadian Conference on Artificial Intelligence
  • [6] [Anonymous], 1997, Proceedings of the fourteenth international conference on machine learning, DOI DOI 10.1016/J.ESWA.2008.05.026
  • [7] Apte C., 1994, SIGIR '94. Proceedings of the Seventeenth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, P23
  • [8] Bala J, 1995, INT JOINT CONF ARTIF, P719
  • [9] BALABANOVI M, 1995, P AAAI 1995 SPRING S, P80
  • [10] Using linear algebra for intelligent information retrieval
    Berry, MW
    Dumais, ST
    OBrien, GW
    [J]. SIAM REVIEW, 1995, 37 (04) : 573 - 595