A new text categorization technique using distributional clustering and learning logic

被引:51
作者
Al-Mubaid, Hisham [1 ]
Umair, Syed A. [1 ]
机构
[1] Univ Houston Clear Lake, Houston, TX 77058 USA
关键词
text categorization; feature selection; machine learning;
D O I
10.1109/TKDE.2006.135
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text categorization is continuing to be one of the most researched NLP problems due to the ever-increasing amounts of electronic documents and digital libraries. In this paper, we present a new text categorization method that combines the distributional clustering of words and a learning logic technique, called Lsquare, for constructing text classifiers. The high dimensionality of text in a document has not been fruitful for the task of categorization, for which reason, feature clustering has been proven to be an ideal alternative to feature selection for reducing the dimensionality. We, therefore, use distributional clustering method (IB) to generate an efficient representation of documents and apply Lsquare for training text classifiers. The method was extensively tested and evaluated. The proposed method achieves higher or comparable classification accuracy and F-1 results compared with SVM on exact experimental settings with a small number of training documents on three benchmark data sets WebKB, 20Newsgroup, and Reuters21578. The results prove that the method is a good choice for applications with a limited amount of labeled training data. We also demonstrate the effect of changing training size on the classification performance of the learners.
引用
收藏
页码:1156 / 1165
页数:10
相关论文
共 29 条
[21]  
MITCHELL T, 1989, ANNU REV COMPUT SCI, V4, P417
[22]  
NIGAM K, 1998, P NATL C ART INT AAA
[23]  
*REUT, 2004, REUTERS21578
[24]  
RUIZ ME, 1999, P ANN INT ACM SIGIR
[25]   Machine learning in automated text categorization [J].
Sebastiani, F .
ACM COMPUTING SURVEYS, 2002, 34 (01) :1-47
[26]  
Slonim N., 2001, P 23 EUR C INF RETR, VVolume 1, P200
[27]  
Tishby Naftali, 1999, P 37 ANN ALL C COMM
[28]  
Yang YM, 1999, SIGIR'99: PROCEEDINGS OF 22ND INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, P42, DOI 10.1145/312624.312647
[29]  
Zheng Z., 2003, P WORKSH LEARN IMB D