A formal framework for linguistic annotation

被引:109
作者
Bird, S [1 ]
Liberman, M [1 ]
机构
[1] Univ Penn, Linguist Data Consortium, Philadelphia, PA 19104 USA
关键词
speech markup; speech corpus; general-purpose architecture; directed graph; phonological representation;
D O I
10.1016/S0167-6393(00)00068-6
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
'Linguistic annotation' covers any descriptive or analytic notations applied to raw language data. The basic data may be in the form of time functions - audio, video and/or physiological recordings - or it may be textual. The added notations may include transcriptions of all sorts (from phonetic features to discourse structures), part-of-speech and sense tagging, syntactic analysis,'named entity' identification, coreference annotation, and so on. While there are several ongoing efforts to provide formats and tools for such annotations and to publish annotated linguistic databases, the lack of widely accepted standards is becoming a critical problem. Proposed standards, to the extent they exist, have focused on file formats. This paper focuses instead on the logical structure of linguistic annotations. We survey a wide variety of existing annotation formats and demonstrate a common conceptual core, the annotation graph. This provides a formal framework for constructing, maintaining and searching linguistic annotations, while remaining consistent with many alternative data structures and file formats. (C) 2001 Elsevier Science B.V. All rights reserved.
引用
收藏
页码:23 / 60
页数:38
相关论文
共 37 条
  • [1] Abiteboul S., 1995, Foundations of databases, V1st
  • [2] ALTOSAAR T, 1998, P 1 INT C LANG RES E
  • [3] THE HCRC MAP TASK CORPUS
    ANDERSON, AH
    BADER, M
    BARD, EG
    BOYLE, E
    DOHERTY, G
    GARROD, S
    ISARD, S
    KOWTKO, J
    MCALLISTER, J
    MILLER, J
    SOTILLO, C
    THOMPSON, HS
    WEINERT, R
    [J]. LANGUAGE AND SPEECH, 1991, 34 : 351 - 366
  • [4] [Anonymous], 1992, CAMBRIDGE TRACTS THE
  • [5] [Anonymous], 1995, CHILDES PROJECT TOOL
  • [6] [Anonymous], 1997, 9702 U COL I COGN SC
  • [7] [Anonymous], 2002, P 5 C APPL NAT LANG, DOI DOI 10.3115/974557.974571
  • [8] Transcriber: Development and use of a tool for assisting speech corpora production
    Barras, C
    Geoffrois, E
    Wu, ZB
    Liberman, M
    [J]. SPEECH COMMUNICATION, 2001, 33 (1-2) : 5 - 22
  • [9] BARRAS C, 2000, P 2 INT C LANG RES E, P1517
  • [10] Barras C., 1998, P 1 INT C LANG RES E, P1373