Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace

被引:146
作者
Abbasi, Ahmed [1 ]
Chen, Hsinchun [1 ]
机构
[1] Univ Arizona, Dept Management Informat Syst, Tucson, AZ 85721 USA
关键词
stylometry; online text; discourse; style classification; text mining;
D O I
10.1145/1344411.1344413
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
One of the problems often associated with online anonymity is that it hinders social accountability, as substantiated by the high levels of cybercrime. Although identity cues are scarce in cyberspace, individuals often leave behind textual identity traces. In this study we proposed the use of stylometric analysis techniques to help identify individuals based on writing style. We incorporated a rich set of stylistic features, including lexical, syntactic, structural, content-specific, and idiosyncratic attributes. We also developed the Writeprints technique for identification and similarity detection of anonymous identities. Writeprints is a Karhunen-Loeve transforms-based technique that uses a sliding window and pattern disruption algorithm with individual author-level feature sets. The Writeprints,technique and extended feature set were evaluated on a testbed encompassing four online datasets spanning different domains: email, instant messaging, feedback comments, and program code. Writeprints outperformed benchmark techniques, including SVM, Ensemble SVM, PCA, and standard Karhunen-Loeve transforms, on the identification and similarity detection tasks with accuracy as high as 94% when differentiating between 100 authors. The extended feature set also significantly outperformed a baseline set of features commonly used in previous research. Furthermore, individual-author-level feature sets generally outperformed use of a single group of attributes.
引用
收藏
页数:29
相关论文
共 65 条
  • [31] A STYLOMETRIC ANALYSIS OF MORMON SCRIPTURE AND RELATED TEXTS
    HOLMES, DI
    [J]. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES A-STATISTICS IN SOCIETY, 1992, 155 : 91 - 120
  • [32] STOPPING RULES IN PRINCIPAL COMPONENTS-ANALYSIS - A COMPARISON OF HEURISTIC AND STATISTICAL APPROACHES
    JACKSON, DA
    [J]. ECOLOGY, 1993, 74 (08) : 2204 - 2214
  • [33] A survey of trust and reputation systems for online service provision
    Josang, Audun
    Ismail, Roslan
    Boyd, Colin
    [J]. DECISION SUPPORT SYSTEMS, 2007, 43 (02) : 618 - 644
  • [34] JUOLA R, 2005, LITER LINGUIST COMPU, V20, P59
  • [35] APPLICATION OF THE KARHUNEN-LOEVE PROCEDURE FOR THE CHARACTERIZATION OF HUMAN FACES
    KIRBY, M
    SIROVICH, L
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1990, 12 (01) : 103 - 108
  • [36] DISCRIMINATION OF AUTHORSHIP USING VISUALIZATION
    KJELL, B
    WOODS, WA
    FRIEDER, O
    [J]. INFORMATION PROCESSING & MANAGEMENT, 1994, 30 (01) : 141 - 150
  • [37] Koppel M., 2003, P IJCAI WORKSH COMP
  • [38] Feature instability as a criterion for selecting potential style markers
    Koppel, Moshe
    Akiva, Navot
    Dagan, Ido
    [J]. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2006, 57 (11): : 1519 - 1525
  • [39] Authorship analysis: Identifying the author of a program
    Krsul, I
    Spafford, EH
    [J]. COMPUTERS & SECURITY, 1997, 16 (03) : 233 - 257
  • [40] From fingerprint to writeprint
    Li, JX
    Zheng, R
    Chen, HC
    [J]. COMMUNICATIONS OF THE ACM, 2006, 49 (04) : 76 - 82