Unsupervised models for morpheme segmentation and morphology learning

被引:4
作者
Creutz, Mathias [1 ,2 ]
Lagus, Krista [1 ,2 ]
机构
[1] Helsinki University of Technology, Adaptive Informatics Research Centre, FIN-02015 HUT
来源
ACM Transactions on Speech and Language Processing | 2007年 / 4卷 / 01期
关键词
Efficient storage; Highly inflecting and compounding languages; Language independent methods; Maximum a posteriori (MAP) estimation; Morpheme lexicon and segmentation; Unsupervised learning;
D O I
10.1145/1217098.1217101
中图分类号
学科分类号
摘要
We present a model family called Morfessor for the unsupervised induction of a simple morphology from raw text data. The model is formulated in a probabilistic maximum a posteriori framework. Morfessor can handle highly inflecting and compounding languages where words can consist of lengthy sequences of morphemes. A lexicon of word segments, called morphs, is induced from the data. The lexicon stores information about both the usage and form of the morphs. Several instances of the model are evaluated quantitatively in a morpheme segmentation task on different sized sets of Finnish as well as English data. Morfessor is shown to perform very well compared to a widely known benchmark algorithm, in particular on Finnish data. © 2007 ACM.
引用
收藏
相关论文
共 55 条
[1]  
ADDA-DECKER M., A corpus-based decompounding algorithm for German lexical modeling in LVCSR, Proceedings of the 8th European Conference on Speech Communication and Technology (Eurospeech), pp. 257-260, (2003)
[2]  
ALLEN M., BADECKER W., OSTERHOUT L., Morphological analysis in sentence processing: An ERP study, Lang. Cognit. Proc, 18, 4, pp. 405-430, (2003)
[3]  
ALTUN Y., JOHNSON M., Inducing SFA with e-transitions using Minimum Description Length, Proceedings of the Finite-State Methods in Natural Language Processing, ESSLLI Workshop, (2001)
[4]  
ANDO R.K., LEE L., Mostly-unsupervised statistical segmentation of Japanese: Applications to Kanji, Proceedings of the 6th Applied Natural Language Processing Conference and 1st Meeting of the North American Chapter of the Association for Computational Linguistics (ANLP-NAACL), pp. 241-248, (2000)
[5]  
BAAYEN R.H., PIEPENBROCK R., GULIKERS L., The CELEX lexical database (CD-ROM), (1995)
[6]  
BAAYEN R.H., SCHREUDER R., Towards a psycholinguistic computational model for morphological parsing, Series A: Mathematical, Physical and Engineering Sciences, 358, pp. 1-13, (2000)
[7]  
BARONI M., MATIASEK J., TROST H., Unsupervised learning of morphologically related words based on orthographic and semantic similarity, Proceedings of the Workshop on Morphological & Phonological Learning of ACL, pp. 48-57, (2002)
[8]  
BRENT M.R., An efficient, probabilistically sound algorithm for segmentation and word discovery, Machine Learn, 34, pp. 71-105, (1999)
[9]  
CHANG J.-S., LIN Y.-C., Su K.-Y., Automatic construction of a Chinese electronic dictionary, Proceedings of the 3rd Workshop on Very Large Corpora, pp. 107-120, (1995)
[10]  
CREUTZ M., Unsupervised segmentation of words using prior distributions of morph length and frequency, Proceedings of the Association for Computations Languages (ACL'03), pp. 280-287, (2003)