Classification of text documents

被引:154
作者
Li, YH [1 ]
Jain, AK [1 ]
机构
[1] Michigan State Univ, Dept Comp Sci & Engn, E Lansing, MI 48824 USA
关键词
D O I
10.1093/comjnl/41.8.537
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The exponential growth of the internet has led to a great deal of interest in developing useful and efficient tools and software to assist users in searching the Web. Document retrieval, categorization, routing and filtering can all be formulated as classification problems. However, the complexity of natural languages and the extremely high dimensionality of the feature space of documents have made this classification problem very difficult. We investigate four different methods for document classification: the naive Bayes classifier, the nearest neighbour classifier, decision trees and a subspace method. These were applied to seven-class Yahoo news groups (business, entertainment, health, international, politics, sports and technology) individually and in combination. We studied three classifier combination approaches: simple voting, dynamic classifier selection and adaptive classifier combination. Our experimental results indicate that the naive Bayes classifier and the subspace method outperform the other two classifiers on our data sets. Combinations of multiple classifiers did not always improve the classification accuracy compared to the best individual classifier. Among the three different combination approaches, our adaptive classifier combination method introduced here performed the best. The best classification accuracy that we are able to achieve on this seven-class problem is approximately 83%, which is comparable to the performance of other similar studies. However, the classification problem considered here is more difficult because the pattern classes used in our experiments have a large overlap of words in their corresponding documents.
引用
收藏
页码:537 / 546
页数:10
相关论文
共 28 条
[1]  
[Anonymous], CSTR49595 PRINC U
[2]  
[Anonymous], 1995, ICML
[3]  
Chakrabarti S, 1997, PROCEEDINGS OF THE TWENTY-THIRD INTERNATIONAL CONFERENCE ON VERY LARGE DATABASES, P446
[4]  
Domingos P., 1996, P 13 INT C MACH LEAR, P105
[5]  
FALOUTSOS C, 1995, CSTR3541 U MAR
[6]  
FOX C, 1992, INFORM RETRIEVAL DAT, P102
[7]  
FREUND F, 1995, P 2 EUR C COMP LEARN, P23
[8]  
GIACINTO G, 1997, SPRINGER VERLAG LECT, V1310, P38
[9]  
Hart P.E., 1973, Pattern recognition and scene analysis
[10]  
Hull D, 1996, AAAI SPRING S MACH L