An algorithm that learns what's in a name

被引:290
作者
Bikel, DM [1 ]
Schwartz, R [1 ]
Weischedel, RM [1 ]
机构
[1] BBN Syst & Technol Corp, Cambridge, MA 02138 USA
关键词
named entity extraction; hidden Markov models;
D O I
10.1023/A:1007558221122
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we present IdentiFinder(TM), a hidden Markov model that learns to recognize and classify names, dates, times, and numerical quantities. We have evaluated the model in English (based on data from the Sixth and Seventh Message Understanding Conferences [MUC-6, MUC-7] and broadcast news) and in Spanish (based on data distributed through the First Multilingual Entity Task [MET-1]), and on speech input (based on broadcast news). We report results here on standard materials only to quantify performance on data available to the community, namely, MUC-6 and MET-1. Results have been consistently better than reported by any other learning algorithm. IdentiFinder's performance is competitive with approaches based on handcrafted rules on mixed case text and superior on text where case information is not available. We also present a controlled experiment showing the effect of training set size on performance, demonstrating that as little as 100,000 words of training data is adequate to get performance around 90% on newswire. Although we present our understanding of why this algorithm performs so well on this class of problems, we believe that significant improvement in performance may still be possible.
引用
收藏
页码:211 / 231
页数:21
相关论文
共 15 条
  • [1] Aberdeen J., 1995, P 6 MESS UND C MUC 6, P141
  • [2] [Anonymous], P 2 C APPL NAT LANG
  • [3] [Anonymous], 1989, P IEEE
  • [4] Appelt D.E., 1995, MUC 6, P237, DOI DOI 10.3115/1072399.1072420
  • [5] BENNETT SW, 1997, P 2 C EMP METH NAT L, P109
  • [6] BORTHWICK A, 1998, P 7 MESS UND C MUC 7
  • [7] Brill E, 1995, COMPUT LINGUIST, V21, P543
  • [8] CHINCHOR N, 1998, IN PRESS P 7 MESS UN
  • [9] CHINCHOR N, 1995, P 6 MESS UND C MUC 6, P39
  • [10] Krupka GR, 1995, Proceedings of the 6th Message Understanding Conference (MUC-6), P221