Text Categorization using the Semi-Supervised Fuzzy c-Means Algorithm

被引:20
作者
Benkhalifa, M [1 ]
Bensaid, A [1 ]
机构
[1] Al Akhawayn Univ, Sch Sci & Engn, Ifrane 53000, Morocco
来源
18TH INTERNATIONAL CONFERENCE OF THE NORTH AMERICAN FUZZY INFORMATION PROCESSING SOCIETY - NAFIPS | 1999年
关键词
D O I
10.1109/NAFIPS.1999.781756
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text Categorization (TC) is the automated assignment of text documents to predefined categories based on document contents. For the past few years, TC has become very important essentially in the Information Retrieval area, where information needs have tremendously increased with the rapid growth of textual information sources such as the Internet. In this paper, we compare, for text categorization, two partially supervised (or semi-supervised) clustering algorithms: the "Semi-Supervised Agglomerative Hierarchical Clustering (ssAHC) algorithm [1] and the Semi-Supervised Fuzzy-c- Means (ssFCM) algorithm [2]. This (Semi-Supervised) learning paradigm falls somewhere between the fully supervised and the fully unsupervised learning schemes, in the sense that it exploits both class information contained in labeled data (training documents) and structure information possessed by unlabeled data (test documents) in order to produce better partitions for test documents. Our experiments, make use of the Reuters 21578 database of documents and consist of a binary classification for each of the ten most populous categories of the Reuters database. To convert the documents into vector form, we experiment with different numbers of features, which we select based on an information gain criterion. We verify experimentally that ssFCM both outperforms and takes less time than the Fuzzy -c- Means (FCM) algorithm. With a smaller number of features, ssFCM's performance is also superior to that of ssAHC's [3]. Finally ssFCM results in improved performance and faster execution time as more weight is given to trailing documents.
引用
收藏
页码:561 / 565
页数:5
相关论文
共 14 条
[1]  
Amar A, 1997, FR ART INT, V40, P232
[2]  
[Anonymous], Pattern Recognition With Fuzzy Objective Function Algorithms
[3]   AUTOMATED LEARNING OF DECISION RULES FOR TEXT CATEGORIZATION [J].
APTE, C ;
DAMERAU, F ;
WEISS, SM .
ACM TRANSACTIONS ON INFORMATION SYSTEMS, 1994, 12 (03) :233-251
[4]   Partially supervised clustering for image segmentation [J].
Bensaid, AM ;
Hall, LO ;
Bezdek, JC ;
Clarke, LP .
PATTERN RECOGNITION, 1996, 29 (05) :859-871
[5]  
JOACHIMS T, 1997, 23 U DORTM
[6]  
LEWIS DD, 1994, P 3 ANN S DOC AN INF
[7]  
LIERE R, IRI9520243 NAT SCI F
[8]  
NIGAM K, 1998, 15 NAT C ART INT AAA
[9]   TERM-WEIGHTING APPROACHES IN AUTOMATIC TEXT RETRIEVAL [J].
SALTON, G ;
BUCKLEY, C .
INFORMATION PROCESSING & MANAGEMENT, 1988, 24 (05) :513-523
[10]  
SALTON G, 1983, INTRO MODERN INFORMA