Agreement, the F-measure, and reliability in information retrieval

被引：631

作者：

Hripcsak, G ^{[1
]}

Rothschild, AS ^{[1
]}

机构：

[1] Columbia Univ, Dept Med Informat, Dept Biomed Informat, New York, NY 10032 USA

来源：

JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION | 2005年 / 12卷 / 03期

关键词：

D O I：

10.1197/jamia.M1733

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Information retrieval studies that involve searching the Internet or marking phrases usually lack a well-defined number of negative cases. This prevents the use of traditional interrater reliability metrics like the K statistic to assess the quality of expert-generated gold standards. Such studies often quantify system performance as precision, recall, and F-measure, or as agreement. It can be shown that the average F-measure among pairs of experts is numerically identical to the average positive specific agreement among experts and that K approaches these measures as the number of negative cases grows large. Positive specific agreement-or the equivalent F-measure may be an appropriate way to quantify interrater reliability and therefore to assess the reliability of a gold standard in these studies.

引用

页码：296 / 298

页数：3

共 9 条

[1]

[Anonymous], 1997, EVALUATION METHODS M

[2]

Brants Thorsten, 2000, 2 INT C LANG RES EV

[3]

Fleiss J. L, 1981, STAT METHODS RATES P, P212

[4] MEASURING AGREEMENT BETWEEN 2 JUDGES ON PRESENCE OR ABSENCE OF A TRAIT [J].