Mining Big Data to Extract Patterns and Predict Real-Life Outcomes

被引:112
作者
Kosinski, Michal [1 ]
Wang, Yilun [2 ]
Lakkaraju, Himabindu [2 ]
Leskovec, Jure [2 ]
机构
[1] Stanford Univ, Grad Sch Business, 655 Knight Way, Stanford, CA 94305 USA
[2] Stanford Univ, Dept Comp Sci, Stanford, CA 94305 USA
基金
美国国家科学基金会;
关键词
computational social science; big data; digital footprints; R; personality; SINGULAR-VALUE DECOMPOSITION; SCIENCE;
D O I
10.1037/met0000105
中图分类号
B84 [心理学];
学科分类号
04 ; 0402 ;
摘要
This article aims to introduce the reader to essential tools that can be used to obtain insights and build predictive models using large data sets. Recent user proliferation in the digital environment has led to the emergence of large samples containing a wealth of traces of human behaviors, communication, and social interactions. Such samples offer the opportunity to greatly improve our understanding of individuals, groups, and societies, but their analysis presents unique methodological challenges. In this tutorial, we discuss potential sources of such data and explain how to efficiently store them. Then, we introduce two methods that are often employed to extract patterns and reduce the dimensionality of large data sets: singular value decomposition and latent Dirichlet allocation. Finally, we demonstrate how to use dimensions or clusters extracted from data to build predictive models in a cross-validated way. The text is accompanied by examples of R code and a sample data set, allowing the reader to practice the methods discussed here. A companion website (http://dataminingtutorial.com) provides additional learning resources.
引用
收藏
页码:493 / 506
页数:14
相关论文
共 53 条
  • [1] Abdi H., 2003, ENCY RES METHODS SOC, P792, DOI DOI 10.4135/9781412950589.N690
  • [2] [Anonymous], 2010, U ILLINOIS J LAW TEC
  • [3] [Anonymous], 2013, P 22 INT C WORLD WID, DOI [DOI 10.1145/2488388.2488416, 10.1145/2488388.2488416]
  • [4] [Anonymous], DEEP LEARNING UNPUB
  • [5] Asuncion A., 2009, C UNC ART INT UAI QU, P27, DOI DOI 10.1080/10807030390248483
  • [6] Baglama J., 2012, IRLBA FAST PARTIAL S
  • [7] Practical advice for conducting ethical online experiments and questionnaires for United States psychologists
    Barchard, Kimberly A.
    Willliams, John
    [J]. BEHAVIOR RESEARCH METHODS, 2008, 40 (04) : 1111 - 1128
  • [8] Fitting Linear Mixed-Effects Models Using lme4
    Bates, Douglas
    Maechler, Martin
    Bolker, Benjamin M.
    Walker, Steven C.
    [J]. JOURNAL OF STATISTICAL SOFTWARE, 2015, 67 (01): : 1 - 48
  • [9] Bishop C.M., 2006, PATTERN RECOGN, V4, P738, DOI DOI 10.1117/1.2819119
  • [10] Latent Dirichlet allocation
    Blei, DM
    Ng, AY
    Jordan, MI
    [J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) : 993 - 1022