Learning Stylometric Representations for Authorship Analysis

被引:42
作者
Ding, Steven H. H. [1 ]
Fung, Benjamin C. M. [1 ,2 ]
Iqbal, Farkhund [3 ]
Cheung, William K. [2 ]
机构
[1] McGill Univ, Sch Informat Studies, Montreal, PQ H3A 1X1, Canada
[2] Hong Kong Baptist Univ, Dept Comp Sci, Hong Kong, Peoples R China
[3] Zayed Univ, Coll Technol Innovat, Abu Dhabi, U Arab Emirates
基金
加拿大自然科学与工程研究理事会;
关键词
Authorship analysis (AA); computational linguistics; representation learning; text mining; ATTRIBUTION; FEATURES; LANGUAGE; STYLE;
D O I
10.1109/TCYB.2017.2766189
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Authorship analysis (AA) is the study of unveiling the hidden properties of authors from textual data. It extracts an author's identity and sociolinguistic characteristics based on the reflected writing styles in the text. The process is essential for various areas, such as cybercrime investigation, psycholinguistics, political socialization, etc. However, most of the previous techniques critically depend on the manual feature engineering process. Consequently, the choice of feature set has been shown to be scenario-or dataset-dependent. In this paper, to mimic the human sentence composition process using a neural network approach, we propose to incorporate different categories of linguistic features into distributed representation of words in order to learn simultaneously the writing style representations based on unlabeled texts for AA. In particular, the proposed models allow topical, lexical, syntactical, and character-level feature vectors of each document to be extracted as stylometrics. We evaluate the performance of our approach on the problems of authorship characterization, authorship identification and authorship verification with the Twitter, blog, review, novel, and essay datasets. The experiments suggest that our proposed text representation outperforms the static stylometrics, dynamic n-grams, latent Dirichlet allocation, latent semantic analysis, distributed memory model of paragraph vectors, distributed bag of words version of paragraph vector, word2vec representations, and other baselines.
引用
收藏
页码:107 / 121
页数:15
相关论文
共 75 条
[61]   Authorship Attribution Based on Specific Vocabulary [J].
Savoy, Jacques .
ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2012, 30 (02)
[62]  
Seroussi Y., 2011, 15 C COMPUTATIONAL N, P181
[63]  
Seroussi Y, 2014, COMPUT LINGUIST, V40, P269, DOI [10.1162/coli_a_00173, 10.1162/COLI_a_00173]
[64]  
Shrestha P, 2017, P 15 C EUR CHAPT ASS, V2, DOI 10.18653/v1/e17-2106
[65]  
Solorio T., 2011, P 5 INT JOINT C NAT, P156
[66]  
Stamatatos E., 2015, P WORK NOT PAP CLEF
[67]   A Survey of Modern Authorship Attribution Methods [J].
Stamatatos, Efstathios .
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2009, 60 (03) :538-556
[68]   Using psycholinguistic features for profiling first language of authors [J].
Torney, Rosemary ;
Vamplew, Peter ;
Yearwood, John .
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2012, 63 (06) :1256-1269
[69]   Feature-rich part-of-speech tagging with a cyclic dependency network [J].
Toutanova, K ;
Klein, D ;
Manning, CD ;
Singer, Y .
HLT-NAACL 2003: HUMAN LANGUAGE TECHNOLOGY CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE MAIN CONFERENCE, 2003, :252-259
[70]   How variable may a constant be? Measures of lexical richness in perspective [J].
Tweedie, FJ ;
Baayen, RH .
COMPUTERS AND THE HUMANITIES, 1998, 32 (05) :323-352