Using the Web to obtain frequencies for unseen bigrams

被引:131
作者
Keller, F
Lapata, M
机构
[1] Univ Edinburgh, Sch Informat, Edinburgh EH8 9LW, Midlothian, Scotland
[2] Univ Sheffield, Dept Comp Sci, Sheffield S1 4DP, S Yorkshire, England
关键词
D O I
10.1162/089120103322711604
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This article shows that the Web can be employed to obtain frequencies for bigrams that are unseen in a given corpus. We describe a method for retrieving counts for adjective-noun, noun-noun, and verb-object bigrams from the Web by querying a search engine. We evaluate this method by demonstrating: (a) a high correlation between Web frequencies and corpus frequencies; (b) a reliable correlation between Web frequencies and plausibility judgments; (c) a reliable correlation between Web frequencies and frequencies recreated using class-based smoothing, (d) a good performance of Web frequencies in a pseudodisambiguation task.
引用
收藏
页码:459 / 484
页数:26
相关论文
共 58 条
[1]  
Abney Steven, 1999, P ACL WORKSH UNS LEA, P1
[2]  
AGIRRE E, 2000, P COLING 2000 WORKSH, P11
[3]  
[Anonymous], P 37 ANN M ASS COMP
[4]  
[Anonymous], 1993, THESIS U PENNSYLVANI
[5]  
[Anonymous], 2001, COMPUT LINGUIST, DOI DOI 10.3115/1072133.1072204
[6]  
[Anonymous], P 15 INT C COMP LING
[7]  
[Anonymous], WORKSH ROB PARS 8 EU
[8]   Scaling to very very large corpora for natural language disambiguation [J].
Banko, M ;
Brill, E .
39TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, 2001, :26-33
[9]   Magnitude estimation of linguistic acceptability [J].
Bard, EG ;
Robertson, D ;
Sorace, A .
LANGUAGE, 1996, 72 (01) :32-68
[10]  
BRISCOE T, 1997, P 5 ACL C APPL NAT L, P356