Scaling to very very large corpora for natural language disambiguation

被引:285
作者
Banko, M [1 ]
Brill, E [1 ]
机构
[1] Microsoft Res, Redmond, WA 98052 USA
来源
39TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE | 2001年
关键词
D O I
10.3115/1073012.1073017
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The amount of readily available on-line text has reached hundreds of billions of words and continues to grow. Yet for most. core natural language tasks, algorithms continue to be optimized, tested and compared after training on corpora consisting of only one million words or less. In this paper, we evaluate the performance of different learning methods on a prototypical natural language disambiguation task, confusion set disambiguation, when trained on orders of magnitude more labeled data than has previously been used. We are fortunate that for this particular application, correctly labeled training data is free. Since this will often not be the case, we examine methods for effectively exploiting very large corpora when labeled data comes at a cost.
引用
收藏
页码:26 / 33
页数:8
相关论文
共 22 条
[1]  
BANKO M, 2001, MITIGATING PAUCITY D
[2]   Bagging predictors [J].
Breiman, L .
MACHINE LEARNING, 1996, 24 (02) :123-140
[3]  
BRILL E, 1998, P 17 INT C COMP LING
[4]  
Charniak Eugene, 1996, P AAAI 96
[5]  
DAGAN I, 1995, P ML 95 12 INT C MAC
[6]   A METHOD FOR DISAMBIGUATING WORD SENSES IN A LARGE CORPUS [J].
GALE, WA ;
CHURCH, KW ;
YAROWSKY, D .
COMPUTERS AND THE HUMANITIES, 1992, 26 (5-6) :415-439
[7]   A Winnow-based approach to context-sensitive spelling correction [J].
Golding, AR ;
Roth, D .
MACHINE LEARNING, 1999, 34 (1-3) :107-130
[8]  
GOLDING AR, 1996, P 34 ANN M ASS COMP
[9]  
GOLDING AR, 1995, P 3 WORKSH VER LARG
[10]  
Henderson J., 1999, P 4 C EMP METH NAT L, P187