Unsupervised multilingual sentence boundary detection

被引:179
作者
Kiss, Tibor [1 ]
Strunk, Jan [1 ]
机构
[1] Ruhr Univ Bochum, Sprachwissensch Inst, D-44780 Bochum, Germany
关键词
D O I
10.1162/coli.2006.32.4.485
中图分类号
TP18 [人工智能理论];
学科分类号
081104 [模式识别与智能系统]; 0812 [计算机科学与技术]; 0835 [软件工程]; 1405 [智能科学与技术];
摘要
In this article, we present a language-independent, unsupervised approach to sentence boundary detection. It is based on the assumption that a large number of ambiguities in the determination of sentence boundaries can be eliminated once abbreviations have been identified. Instead of relying on orthographic clues, the proposed system is able to detect abbreviations with high accuracy using three criteria that only require information about the candidate type itself and are independent of context: Abbreviations can be defined as a very tight collocation consisting of a truncated word and a final period, abbreviations are usually short, and abbreviations sometimes contain internal periods. We also show the potential of collocational evidence for two other important subtasks of sentence boundary disambiguation, namely, the detection of initials and ordinal numbers. The proposed system has been tested extensively on eleven different languages and on different text genres. It achieves good results without any further amendments or language-specific resources. We evaluate its performance against three different baselines and compare it to other systems for sentence boundary detection proposed in the literature.
引用
收藏
页码:485 / 525
页数:41
相关论文
共 27 条
[1]
ALUISIO S. M, 2003, P CORPUS LINGUISTICS, P14
[2]
[Anonymous], 1990, CSLI LECT NOTES
[3]
[Anonymous], 1982, Frequency analysis of English usage
[4]
Brill E, 1995, COMPUT LINGUIST, V21, P543
[5]
Dunning T., 1993, Computational Linguistics, V19, P61
[6]
Evert S, 2001, 39TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, P188
[7]
Firth J. R., 1957, STUDIES LINGUISTIC A, P1
[8]
Grefenstette G., 1994, Proceedings of the 3rd International Conference on Computational Lexicography, P79
[9]
Grefenstette G., 1999, Syntactic wordclass tagging, P117, DOI DOI 10.1007/978-94-015-9273-4_9
[10]
Kiss T, 2002, P KONVENS, V2002, P75