Lightweight methods to estimate influenza rates and alcohol sales volume from Twitter messages

被引:48
作者
Culotta, Aron [1 ]
机构
[1] SE Louisiana Univ, Dept Comp Sci & Ind Technol, Hammond, LA 70402 USA
关键词
Social media; Regression; Classification; WEB; SURVEILLANCE; ACCESS; TEXT;
D O I
10.1007/s10579-012-9185-0
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
We analyze over 570 million Twitter messages from an eight month period and find that tracking a small number of keywords allows us to estimate influenza rates and alcohol sales volume with high accuracy. We validate our approach against government statistics and find strong correlations with influenza-like illnesses reported by the U.S. Centers for Disease Control and Prevention (r(14) = .964, p < .001) and with alcohol sales volume reported by the U.S. Census Bureau (r(5) = .932, p < .01). We analyze the robustness of this approach to spurious keyword matches, and we propose a document classification component to filter these misleading messages. We find that this document classifier can reduce error rates by over half in simulated false alarm experiments, though more research is needed to develop methods that are robust in cases of extremely high noise.
引用
收藏
页码:217 / 238
页数:22
相关论文
共 30 条
[1]  
[Anonymous], P 9 INT C INF KNOWL
[2]  
[Anonymous], 7 C ASS MACH TRANSL
[3]  
[Anonymous], 2010, P 4 INT AAAI C WEBL
[4]  
[Anonymous], 2005, Proceedings 11th International Conference Knowledge Discovery in Data Mining, DOI DOI 10.1145/1081870.1081883
[5]  
Bo P., 2008, Foundations and Trends in Information Retrieval, V2, P1, DOI DOI 10.1561/1500000011
[6]   Surveillance sans frontieres: Internet-based emerging infectious disease intelligence and the HealthMap project [J].
Brownstein, John S. ;
Freifeld, Clark C. ;
Reis, Ben Y. ;
Mandl, Kenneth D. .
PLOS MEDICINE, 2008, 5 (07) :1019-1024
[7]   LIBSVM: A Library for Support Vector Machines [J].
Chang, Chih-Chung ;
Lin, Chih-Jen .
ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)
[8]   Pandemics in the Age of Twitter: Content Analysis of Tweets during the 2009 H1N1 Outbreak [J].
Chew, Cynthia ;
Eysenbach, Gunther .
PLOS ONE, 2010, 5 (11)
[9]   BioCaster: detecting public health rumors with a Web-based text mining system [J].
Collier, Nigel ;
Doan, Son ;
Kawazoe, Ai ;
Goodwin, Reiko Matsuda ;
Conway, Mike ;
Tateno, Yoshio ;
Quoc-Hung Ngo ;
Dinh Dien ;
Kawtrakul, Asanee ;
Takeuchi, Koichi ;
Shigematsu, Mika ;
Taniguchi, Kiyosu .
BIOINFORMATICS, 2008, 24 (24) :2940-2941
[10]   Text and Structural Data Mining of Influenza Mentions in Web and Social Media [J].
Corley, Courtney D. ;
Cook, Diane J. ;
Mikler, Armin R. ;
Singh, Karan P. .
INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH, 2010, 7 (02) :596-615