INFORMATION EXTRACTION AS A BASIS FOR HIGH-PRECISION TEXT CLASSIFICATION

被引:76
作者
RILOFF, E
LEHNERT, W
机构
[1] Univ. of Massachusetts, Amherst
关键词
INFORMATION EXTRACTION; TEXT CLASSIFICATION;
D O I
10.1145/183422.183428
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We describe an approach to text classification that represents a compromise between traditional word-based techniques and in-depth natural language processing. Our approach uses a natural language processing task called ''information extraction'' as a basis for high-precision text classification. We present three algorithms that use varying amounts of extracted information to classify texts. The relevancy signatures algorithm uses linguistic phrases; the augmented relevancy signatures algorithm uses phrases and local context; and the case-based text classification algorithm uses larger pieces of context. Relevant phrases and contexts are acquired automatically using a training corpus. We evaluate the algorithms on the basis of two test sets from the MUC-4 corpus. All three algorithms achieved high precision on both test sets, with the augmented relevancy signatures algorithm and the case-based algorithm reaching 100% precision with over 60% recall on one set. Additionally, we compare the algorithms on a larger collection of 1700 texts and describe an automated method for empirically deriving appropriate threshold values. The results suggest that information extraction techniques can support high-precision text classification and, in general, that using more extracted information improves performance. As a practical matter, we also explain how the text classification system can be easily ported across domains.
引用
收藏
页码:296 / 333
页数:38
相关论文
共 34 条
  • [1] Ashley K, 1990, MODELLING LEGAL ARGU
  • [2] BILOFF E, 1993, 9TH P IEEE C ART INT, P93
  • [3] AUTOMATIC DOCUMENT CLASSIFICATION
    BORKO, H
    BERNICK, M
    [J]. JOURNAL OF THE ACM, 1963, 10 (02) : 151 - &
  • [4] CARDIE C, 1993, 11 NAT C ART INT MEN, P798
  • [5] CROFT WB, 1991, 14TH P INT C RES DEV, P32
  • [6] FASIT - A FULLY-AUTOMATIC SYNTACTICALLY BASED INDEXING SYSTEM
    DILLON, M
    GRAY, AS
    [J]. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, 1983, 34 (02): : 99 - 108
  • [7] FAGAN JL, 1989, J AM SOC INFORM SCI, V40, P115, DOI 10.1002/(SICI)1097-4571(198903)40:2<115::AID-ASI6>3.0.CO
  • [8] 2-B
  • [9] Francis WN., 1982, FREQUENCY ANAL ENGLI
  • [10] GOODMAN M, 1991, 2ND P ANN C INN APPL, P25