Unsupervised models for morpheme segmentation and morphology learning

被引:4
作者
Creutz, Mathias [1 ,2 ]
Lagus, Krista [1 ,2 ]
机构
[1] Helsinki University of Technology, Adaptive Informatics Research Centre, FIN-02015 HUT
来源
ACM Transactions on Speech and Language Processing | 2007年 / 4卷 / 01期
关键词
Efficient storage; Highly inflecting and compounding languages; Language independent methods; Maximum a posteriori (MAP) estimation; Morpheme lexicon and segmentation; Unsupervised learning;
D O I
10.1145/1217098.1217101
中图分类号
学科分类号
摘要
We present a model family called Morfessor for the unsupervised induction of a simple morphology from raw text data. The model is formulated in a probabilistic maximum a posteriori framework. Morfessor can handle highly inflecting and compounding languages where words can consist of lengthy sequences of morphemes. A lexicon of word segments, called morphs, is induced from the data. The lexicon stores information about both the usage and form of the morphs. Several instances of the model are evaluated quantitatively in a morpheme segmentation task on different sized sets of Finnish as well as English data. Morfessor is shown to perform very well compared to a widely known benchmark algorithm, in particular on Finnish data. © 2007 ACM.
引用
收藏
相关论文
共 55 条
[31]  
JARVIKIVI J., NIEMI J., Form-based representation in the mental lexicon: Priming (with) bound stem allomorphs in Finnish, Brain Lang, 81, pp. 412-423, (2002)
[32]  
JOHNSON H., MARTIN J., Unsupervised learning of morphology for English and Inuktitut, Human Language Technology and North American Chapter of the Association for Computational Linguistics Conference (HLT-NAACL'03), (2003)
[33]  
KAZAKOV D., Unsupervised learning of naïve morphology with genetic algorithms, Workshop Notes of the ECML/MLnet Workshop on Empirical Learning of Natural Language Processing Tasks, pp. 105-112, (1997)
[34]  
KIT C., How does lexical acquisition begin? A cognitive perspective, Cognit. Science, 1, 1, pp. 1-50, (2003)
[35]  
KIT C., PAN H., CHEN H., Learning case-based knowledge for disambiguating Chinese word segmentation: A preliminary study, Proceedings of the COLING Workshop SIGHAN-1, pp. 33-39, (2002)
[36]  
KIT C., WILKS Y., Unsupervised learning of word boundary with description length gain, la. Proceedings of the CoNLL99ACL Workshop, (1999)
[37]  
KNEISSLER J., KLAKOW D., Speech recognition for huge vocabularies by using optimized sub-word units, Proceedings of the 7th European Conference on Speech Communication and Technology (Eurospeech), pp. 69-72, (2001)
[38]  
KONTOROVICH L., RON D., SINGER Y., A Markov model for the acquisition of morphological structure, (2003)
[39]  
KOSKENNIEMI K., Two-level morphology: A general computational model for word-form recognition and production, (1983)
[40]  
MATTHEWS P.H., Morphology, (1991)