Chinese Word Boundary Ambiguity and Unknown Word Resolution Using Unsupervised Methods

被引:1
作者
傅国宏
机构
关键词
Word segmentation; Character Juncture; Work formation pattern;
D O I
暂无
中图分类号
TP391.1 [文字信息处理];
学科分类号
081203 ; 0835 ;
摘要
An unsupervised framework to partially resolve the four issues, namely ambiguity, unknown word, knowledge acquisition and efficient algorithm, in developing a robust Chinese segmentation system is described. It first proposes a statistical segmentation model integrating the simplified character juncture model (SCJM) with word formation power. The advantage of this model is that it can employ the affinity of characters inside or outside a word and word formation power simultaneously to process disambiguation and all the parameters can be estimated in an unsupervised way. After investigating the differences between real and theoretical size of segmentation space, we apply A * algorithm to perform segmentation without exhaustively searching all the potential segmentations. Finally, an unsupervised version of Chinese word formation patterns to detect unknown words is presented. Experiments show that the proposed methods are efficient.
引用
收藏
页码:29 / 39
页数:11
相关论文
共 5 条
[1]  
Sproat,Richard,Chilin Shih,William Gale,and Nancy Chang. Computational Linguistics . 1996
[2]  
Principles of Artificial Intelligence. Nilsson NJ. . 1980
[3]  
Statistical Methods for Speech Recognition. Jelinek Frederick. . 1997
[4]  
Written Chinese Segmentation and Chinese Word Segmentation System CDWS. Liang NanYuan. . 1984
[5]  
Gan Kokwee,Martha Palmer,Lua Kimteng. Computational Linguistics . 1996