基于上下文信息和碎片的交叉领域汉语自动分词(英文)

被引:9
作者
黄德根
佟德琴
机构
[1] DalianUniversityofTechnology
关键词
cross-domain CWS; Conditional Randem Fields(CRFs); joint decoding; context variables; segmentation fragments;
D O I
暂无
中图分类号
TP391.1 [文字信息处理];
学科分类号
081203 ; 0835 ;
摘要
A new joint decoding strategy that combines the character-based and word-based conditional random field model is proposed.In this segmentation framework,fragments are used to generate candidate Out-of-Vocabularies(OOVs).After the initial segmentation,the segmentation fragments are divided into two classes as "combination"(combining several fragments as an unknown word) and "segregation"(segregating to some words).So,more OOVs can be recalled.Moreover,for the characteristics of the cross-domain segmentation,context information is reasonably used to guide Chinese Word Segmentation(CWS).This method is proved to be effective through several experiments on the test data from Sighan Bakeoffs 2007 and Bakeoffs 2010.The rates of OOV recall obtain better performance and the overall segmentation performances achieve a good effect.
引用
收藏
页码:49 / 57
页数:9
相关论文
共 6 条
[1]   上下文相关广告中的关键词抽取研究(英文) [J].
刘建毅 ;
王枞 ;
姚文斌 .
中国通信, 2010, 7 (04) :51-57
[2]   基于子词的双层CRFs中文分词 [J].
黄德根 ;
焦世斗 ;
周惠巍 .
计算机研究与发展, 2010, 47 (05) :962-968
[3]   一种基于字词联合解码的中文分词方法 [J].
宋彦 ;
蔡东风 ;
张桂平 ;
赵海 .
软件学报, 2009, 20 (09) :2366-2375
[4]   基于决策树的汉语未登录词识别 [J].
秦文 ;
苑春法 .
中文信息学报, 2004, (01) :14-19
[5]   谈谈汉语分词语料库的一致性问题 [J].
孙茂松 .
语言文字应用, 1999, (02) :90-93
[6]  
Chinese Word Segmentation as Character Tagging .2 Nianwen Xue. Computational Linguistics and Chinese Language Processing . 2003