Duplicate detection in adverse drug reaction surveillance

被引:104
作者
Noren, G. Niklas [1 ]
Orre, Roland
Bate, Andrew
Edwards, I. Ralph
机构
[1] WHO Collaborating Ctr Int Drug Monitoring, Uppsala, Sweden
[2] Stockholm Univ, S-10691 Stockholm, Sweden
[3] Neurolog Sweden AB, Stockholm, Sweden
关键词
data cleaning; duplicate detection; hit-miss model;
D O I
10.1007/s10618-006-0052-8
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The WHO Collaborating Centre for International Drug Monitoring in Uppsala, Sweden, maintains and analyses the world's largest database of reports on suspected adverse drug reaction (ADR) incidents that occur after drugs are on the market. The presence of duplicate case reports is an important data quality problem and their detection remains a formidable challenge, especially in the WHO drug safety database where reports are anonymised before submission. In this paper, we propose a duplicate detection method based on the hit-miss model for statistical record linkage described by Copas and Hilton, which handles the limited amount of training data well and is well suited for the available data (categorical and numerical rather than free text). We propose two extensions of the standard hit-miss model: a hit-miss mixture model for errors in numerical record fields and a new method to handle correlated record fields, and we demonstrate the effectiveness both at identifying the most likely duplicate for a given case report (94.7% accuracy) and at discriminating true duplicates from random matches (63% recall with 71% precision). The proposed method allows for more efficient data cleaning in post-marketing drug safety data sets, and perhaps other knowledge discovery applications as well.
引用
收藏
页码:305 / 328
页数:24
相关论文
共 27 条
[1]  
[Anonymous], KDD 03 P 9 ACM SIGKD
[2]   A Bayesian neural network method for adverse drug reaction signal generation [J].
Bate, A ;
Lindquist, M ;
Edwards, IR ;
Olsson, S ;
Orre, R ;
Lansner, A ;
De Freitas, RM .
EUROPEAN JOURNAL OF CLINICAL PHARMACOLOGY, 1998, 54 (04) :315-321
[3]  
BELIN TR, 1995, J AM STAT ASSOC, V90, P694
[4]  
Bilenko M., 2003, P KDD 2003 WORKSH DA, P7
[5]   Proactive safety surveillance [J].
Bortnichak, EA ;
Wise, RP ;
Salive, ME ;
Tilson, HH .
PHARMACOEPIDEMIOLOGY AND DRUG SAFETY, 2001, 10 (03) :191-196
[6]   Spontaneous reports of thrombocytopenia in association with quinine: Clinical attributes and timing related to regulatory action [J].
Brinker, AD ;
Beitz, J .
AMERICAN JOURNAL OF HEMATOLOGY, 2002, 70 (04) :313-317
[7]   RECORD LINKAGE - STATISTICAL-MODELS FOR MATCHING COMPUTER RECORDS [J].
COPAS, JB ;
HILTON, FJ .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES A-STATISTICS IN SOCIETY, 1990, 153 :287-320
[8]   How to lie with bad data [J].
De Veaux, RD ;
Hand, DJ .
STATISTICAL SCIENCE, 2005, 20 (03) :231-238
[9]   Adverse drug reactions: definitions, diagnosis, and management [J].
Edwards, IR ;
Aronson, JK .
LANCET, 2000, 356 (9237) :1255-1259
[10]   Spontaneous reporting - of what? Clinical concerns about drugs [J].
Edwards, IR .
BRITISH JOURNAL OF CLINICAL PHARMACOLOGY, 1999, 48 (02) :138-141