A Method of Automated Nonparametric Content Analysis for Social Science

被引:399
作者
Hopkins, Daniel J. [1 ]
King, Gary [2 ]
机构
[1] Georgetown Univ, Intercultural Ctr 681, Washington, DC 20057 USA
[2] Harvard Univ, Inst Quantitat Social Sci, Cambridge, MA 02138 USA
关键词
D O I
10.1111/j.1540-5907.2009.00428.x
中图分类号
D0 [政治学、政治理论];
学科分类号
0302 ; 030201 ;
摘要
The increasing availability of digitized text presents enormous opportunities for social scientists. Yet hand coding many blogs, speeches, government records, newspapers, or other sources of unstructured text is infeasible. Although computer scientists have methods for automated content analysis, most are optimized to classify individual documents, whereas social scientists instead want generalizations about the population of documents, such as the proportion in a given category. Unfortunately, even a method with a high percent of individual documents correctly classified can be hugely biased when estimating category proportions. By directly optimizing for this social science goal, we develop a method that gives approximately unbiased estimates of category proportions even when the optimal classifier performs poorly. We illustrate with diverse data sets, including the daily expressed opinions of thousands of people about the U.S. presidency. We also make available software that implements our methods and large corpora of text for further analysis.
引用
收藏
页码:229 / 247
页数:19
相关论文
共 53 条
[1]  
Adamic Lada A., 2005, P 3 INT WORKSHOP LIN, P36, DOI DOI 10.1145/1134271.1134277
[2]  
[Anonymous], 2003, ANN M MIDW POL SCI A
[3]  
[Anonymous], 2006, P 2006 C EMP METH NA
[4]  
[Anonymous], 1998, Impersonal Influence: How Perceptions of Mass Collectives Affect Political Attitudes
[5]  
Benoit Kenneth., 2003, Irish Political Studies, V18, P97, DOI DOI 10.1080/07907180312331293249
[6]   Detecting Collaboration in Propaganda [J].
Berelson, Bernard ;
De Grazia, Sebastian .
PUBLIC OPINION QUARTERLY, 1947, 11 (02) :244-253
[7]  
Brank J., 2002, FEATURE SELECTION US
[8]  
Cavnar W. B., 1994, N-gram-based text categorization, V161175
[9]   CONCEPTUAL STRETCHING REVISITED - ADAPTING CATEGORIES IN COMPARATIVE-ANALYSIS [J].
COLLIER, D ;
MAHON, JE .
AMERICAN POLITICAL SCIENCE REVIEW, 1993, 87 (04) :845-855
[10]   SIMULATION-EXTRAPOLATION ESTIMATION IN PARAMETRIC MEASUREMENT ERROR MODELS [J].
COOK, JR ;
STEFANSKI, LA .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1994, 89 (428) :1314-1328