A machine learning approach for Arabic text classification using N-gram frequency statistics

被引:44
作者
Khreisat, Laila [1 ]
机构
[1] Fairleigh Dickinson Univ, Dept Comp Sci Math & Phys, Madison, NJ 07940 USA
关键词
Data mining; Classification; Categorization; Arabic; N-gram; Machine learning;
D O I
10.1016/j.joi.2008.11.005
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
In this paper a machine learning approach for classifying Arabic text documents is presented. To handle the high dimensionality of text documents, embeddings are used to map each document ( instance) into R ( the set of real numbers) representing the tri-gram frequency statistics profiles for a document. Classification is achieved by computing a dissimilarity measure, called the Manhattan distance, between the pro. le of the instance to be classified and the profiles of all the instances in the training set. The class ( category) to which an instance ( document) belongs is the one with the least computed Manhattan measure. The Dice similarity measure is used to compare the performance of method. Results show that tri-gram text classification using the Dice measure outperforms classification using the Manhattan measure. (C) 2008 Elsevier Ltd. All rights reserved.
引用
收藏
页码:72 / 77
页数:6
相关论文
共 22 条
[1]  
Al-Fedaghi SabahS., 1989, Proceedings of the 11th National Computer Conference, King Fahd University of Petroleum Minerals, Dhahran, Saudi Arabia, P04
[2]  
[Anonymous], 2003, Data Mining: Introductory and Advanced Topics
[3]  
Baeza-Yates R, 1999, MODERN INFORM RETRIE, V463
[4]  
BEESLEY K, 1996, P COLING 96, V1, P89
[5]  
BEESLEY K, 1991, PERSPECTIVES ARABIC, V3, P155
[6]  
Cavalli-Sforza V, 2000, 6TH APPLIED NATURAL LANGUAGE PROCESSING CONFERENCE/1ST MEETING OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE AND PROCEEDINGS OF THE ANLP-NAACL 2000 STUDENT RESEARCH WORKSHOP, pA86
[7]  
Cavnar W. B., 1994, N-gram-based text categorization, V161175
[8]   GAUGING SIMILARITY WITH N-GRAMS - LANGUAGE-INDEPENDENT CATEGORIZATION OF TEXT [J].
DAMASHEK, M .
SCIENCE, 1995, 267 (5199) :843-848
[9]   Machine learning for Arabic text categorization [J].
Duwairi, Rehab M. .
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2006, 57 (08) :1005-1010
[10]  
DUWAIRI RM, 2005, P 2005 INT C DAT MIN