Unsupervised named-entity extraction from the Web: An experimental study

被引：465

作者：

Etzioni, O ^{[1
]}

Cafarella, M ^{[1
]}

Downey, D ^{[1
]}

Popescu, AM ^{[1
]}

Shaked, T ^{[1
]}

Soderland, S ^{[1
]}

Weld, DS ^{[1
]}

Yates, A ^{[1
]}

机构：

[1] Univ Washington, Dept Comp Sci & Engn, Seattle, WA 98195 USA

来源：

ARTIFICIAL INTELLIGENCE | 2005年 / 165卷 / 01期

关键词：

information extraction; pointwise mutual information; unsupervised; question answering;

D O I：

10.1016/j.artint.2005.03.001

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The KNOWITALL system aims to automate the tedious process of extracting large collections of facts (e.g., names of scientists or politicians) from the Web in an unsupervised, domain-independent, and scalable manner. The paper presents an overview of KNOWITALL's novel architecture and design principles, emphasizing its distinctive ability to extract information without any hand-labeled training examples. In its first major run, KNOWITALL extracted over 50,000 class instances, but suggested a challenge: How can we improve KNOWITALL's recall and extraction rate without sacrificing precision? This paper presents three distinct ways to address this challenge and evaluates their performance. Pattern Learning learns domain-specific extraction rules, which enable additional extractions. Subclass Extraction automatically identifies sub-classes in order to boost recall (e.g., "chemist" and c biologist" are identified as sub-classes of "scientist"). List Extraction locates lists of class instances, learns a "wrapper" for each list, and. extracts elements of each list. Since each method bootstraps from KNOWITALL's domain-independent methods, the methods also obviate hand-labeled training examples. The paper reports on experiments, focused on building lists of named entities, that measure the relative efficacy of each method and demonstrate their synergy. In concert, our methods gave KNOWITALL a 4-fold to 8-fold increase in recall at precision of 0.90, and discovered over 10,000 cities missing from the Tipster Gazetteer.

引用

页码：91 / 134

页数：44

共 46 条

[1] Querying text databases for efficient information extraction [J].

Agichtein, E ;

Gravano, L .

19TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 2003, :113-124

[2]

AGICHTEIN E, 2001, P 2001 ACM SIGMOD IN

[3]

[Anonymous], 2004, AAAI-04 workshop on adaptive text extraction and mining

[4]

[Anonymous], 2000, P 5 ACM C DIG LIB SA

[5]

Banko M., 2002, P 2002 AAAI SPRING S, P7

[6]

Blum A., 1998, Proceedings of the Eleventh Annual Conference on Computational Learning Theory, P92, DOI 10.1145/279943.279962

[7]

BRILL E, 1994, PROCEEDINGS OF THE TWELFTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOLS 1 AND 2, P722

[8]

Brin S, 1999, LECT NOTES COMPUT SC, V1590, P172

[9]

Califf M. E., 1998, Applying Machine Learning to Discourse Processing. Papers from the 1998 AAAI Symposium, P6

[10]

CIRAVEGNA F, 2001, P 17 INT JOINT C ART, P1251

← 1 2 3 4 5 →