面向大规模中文文本分类的朴素贝叶斯并行Spark算法(英文)

被引:58
作者
刘鹏 [1 ,2 ]
赵慧含 [3 ]
滕家雨 [4 ]
仰彦妍 [3 ]
刘亚峰 [1 ,2 ]
朱宗卫 [5 ]
机构
[1] Internet of Things Perception Mine Research Centre, China University of Mining and Technology
[2] National and Local Joint Engineering Laboratory of Internet Application Technology on Mine
[3] School of Information and Control Engineering, China University of Mining and Technology
[4] Communication Division, NARI Technology Co, Ltd
[5] Suzhou Institute of University of Science and Technology of China
关键词
中文文本分类; 朴素贝叶斯; Spark; Hadoop; 弹性分布式数据集; 并行化;
D O I
暂无
中图分类号
TP391.1 [文字信息处理];
学科分类号
120506 [数字人文];
摘要
针对互联网中中文文本数据量激增使得对其作分类运算的处理时间显著延长的问题,提出并实现了一种基于内存计算模型Spark的并行朴素贝叶斯中文文本分类算法,主要利用弹性分布数据集编程模型,实现了朴素贝叶斯分类器训练过程和预测过程的全程并行化算法。为便于比较,同时实现了基于Hadoop-MapReduce的并行朴素贝叶斯版本。实验结果表明,在相同计算环境下,对同一数据量的中文文本集,基于Spark的朴素贝叶斯中文文本分类并行化算法在加速比、扩展性等主要指标上明显优于基于Hadoop的实现,因此能更好地满足大规模中文文本数据挖掘的要求。
引用
收藏
页码:1 / 12
页数:12
相关论文
共 7 条
[1]
基于Spark的大规模文本k-means并行聚类算法 [J].
刘鹏 ;
滕家雨 ;
丁恩杰 ;
孟磊 .
中文信息学报, 2017, 31 (04) :145-153
[2]
Word Net-based lexical semantic classification for text corpus analysis[J] 龙军;王鲁达;李祖德;张祖平;杨柳; Journal of Central South University 2015, 05
[3]
基于SparkR的分类算法并行化研究 [J].
刘志强 ;
顾荣 ;
袁春风 ;
黄宜华 .
计算机科学与探索, 2015, 9 (11) :1281-1294
[4]
Big Data Analytics in the Cloud: Spark on Hadoop vs MPI/OpenMP on Beowulf[J] Jorge L. Reyes-Ortiz;Luca Oneto;Davide Anguita Procedia Computer Science 2015,
[5]
Scalability of parallel scientific applications on the cloud[J] Ivona Brandic;Ioan Raicu;Satish Narayana Srirama;Oleg Batrashev;Pelle Jakovits;Eero Vainikko Scientific Programming 2011,
[6]
A comparative study of TF*IDF; LSI and multi-words for text classification[J] Wen Zhang;Taketoshi Yoshida;Xijin Tang Expert Systems With Applications 2010, 3
[7]
Cloud computing: state-of-the-art and research challenges[J] Qi Zhang;Lu Cheng;Raouf Boutaba Journal of Internet Services and Applications 2010,