Two-stage NER for tweets with clustering

被引:30
作者
Liu, Xiaohua [1 ,2 ]
Zhou, Ming [2 ]
机构
[1] Harbin Inst Technol, Harbin 150001, Peoples R China
[2] Microsoft Res Asia, Nat Language Comp Grp, Beijing 100080, Peoples R China
关键词
Tweet; Named entity recognition; Two-stage labeling; Information extraction;
D O I
10.1016/j.ipm.2012.05.006
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
One main challenge of Named Entities Recognition (NER) for tweets is the insufficient information in a single tweet, owing to the noisy and short nature of tweets. We propose a novel system to tackle this challenge, which leverages redundancy in tweets by conducting two-stage NER for multiple similar tweets. Particularly, it first pre-labels each tweet using a sequential labeler based on the linear Conditional Random Fields (CRFs) model. Then it clusters tweets to put tweets with similar content into the same group. Finally, for each cluster it refines the labels of each tweet using an enhanced CRF model that incorporates the cluster level information, i.e., the labels of the current word and its neighboring words across all tweets in the cluster. We evaluate our method on a manually annotated dataset, and show that our method boosts the F1 of the baseline without collectively labeling from 75.4% to 82.5%. (C) 2012 Elsevier Ltd. All rights reserved.
引用
收藏
页码:264 / 273
页数:10
相关论文
共 31 条
[1]  
[Anonymous], P 22 INT JOINT C ART
[2]  
[Anonymous], 2007, ACL 2007 P WORKSHOP
[3]  
[Anonymous], MUC 7
[4]  
[Anonymous], 2009, P C EMP METH NAT LAN
[5]  
[Anonymous], 2001, PROC 18 INT C MACH L
[6]  
BRILL E, 1992, SPEECH AND NATURAL LANGUAGE, P112
[7]  
Brown P. F., 1992, Computational Linguistics, V18, P467
[8]  
Chiticariu L., 2010, P 2010 C EMP METH NA, P1002, DOI DOI 10.5555/1870658.1870756
[9]  
Collins M, 2002, PROCEEDINGS OF THE 2002 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, P1
[10]  
Downey Doug., 2007, IJCAI