Comparing intermittency and network measurements of words and their dependence on authorship

被引:38
作者
Amancio, Diego Raphael [2 ]
Altmann, Eduardo G. [1 ]
Oliveira, Osvaldo N., Jr. [2 ]
Costa, Luciano da Fontoura [2 ]
机构
[1] Max Planck Inst Phys Komplexer Syst, Dresden, Germany
[2] Univ Sao Paulo, Inst Phys Sao Carlos, BR-13560970 Sao Paulo, Brazil
来源
NEW JOURNAL OF PHYSICS | 2011年 / 13卷
基金
巴西圣保罗研究基金会;
关键词
COMPLEX NETWORKS; KEYWORD DETECTION; LEAST EFFORT; SMALL-WORLD; LANGUAGE; CLASSIFICATION; DISTRIBUTIONS;
D O I
10.1088/1367-2630/13/12/123024
中图分类号
O4 [物理学];
学科分类号
0702 ;
摘要
Many features of texts and languages can now be inferred from statistical analyses using concepts from complex networks and dynamical systems. In this paper, we quantify how topological properties of word co-occurrence networks and intermittency (or burstiness) in word distribution depend on the style of authors. Our database contains 40 books by eight authors who lived in the nineteenth and twentieth centuries, for which the following network measurements were obtained: the clustering coefficient, average shortest path lengths and betweenness. We found that the two factors with stronger dependence on authors were skewness in the distribution of word intermittency and the average shortest paths. Other factors such as betweenness and Zipf's law exponent show only weak dependence on authorship. Also assessed was the contribution from each measurement to authorship recognition using three machine learning methods. The best performance was about 65% accuracy upon combining complex networks and intermittency features with the nearest-neighbor algorithm of automatic authorship. From a detailed analysis of the interdependence of the various metrics, it is concluded that the methods used here are complementary for providing short- and long-scale perspectives on texts, which are useful for applications such as the identification of topical words and information retrieval.
引用
收藏
页数:17
相关论文
共 52 条
[1]  
AHA DW, 1991, MACH LEARN, V6, P37, DOI 10.1007/BF00153759
[2]   Intermittency and scale-free networks: a dynamical model for human language complexity [J].
Allegrini, P ;
Grigolini, P ;
Palatella, L .
CHAOS SOLITONS & FRACTALS, 2004, 20 (01) :95-105
[3]   Beyond Word Frequency: Bursts, Lulls, and Scaling in the Temporal Distributions of Words [J].
Altmann, Eduardo G. ;
Pierrehumbert, Janet B. ;
Motter, Adilson E. .
PLOS ONE, 2009, 4 (11) :A31-A37
[4]   Using metrics from complex networks to evaluate machine translation [J].
Amancio, D. R. ;
Nunes, M. G. V. ;
Oliveira, O. N., Jr. ;
Pardo, T. A. S. ;
Antiqueira, L. ;
Costa, L. da F. .
PHYSICA A-STATISTICAL MECHANICS AND ITS APPLICATIONS, 2011, 390 (01) :131-142
[5]   Complex networks analysis of manual and machine translations [J].
Amancio, Diego R. ;
Antiqueira, Lucas ;
Pardo, Thiago A. S. ;
Costa, Luciano da F. ;
Oliveira, Osvaldo N., Jr. ;
Nunes, Maria G. V. .
INTERNATIONAL JOURNAL OF MODERN PHYSICS C, 2008, 19 (04) :583-598
[6]  
[Anonymous], 1989, Proceeding of The 6th International Workshop on Machine Learning, DOI 10.1016/B978-1-55860-036-2.50047-3
[7]  
[Anonymous], P 5 INT C REC ADV SO
[8]   Strong correlations between text quality and complex networks features [J].
Antiqueira, L. ;
Nunes, M. G. V. ;
Oliveira, O. N., Jr. ;
Costa, L. da F. .
PHYSICA A-STATISTICAL MECHANICS AND ITS APPLICATIONS, 2007, 373 :811-820
[9]  
Antiqueira L., 2006, P WORKSH INF HUM LAN
[10]   A complex network approach to text summarization [J].
Antiqueira, Lucas ;
Oliveira, Osvaldo N., Jr. ;
Costa, Luciano da Fontoura ;
Volpe Nunes, Maria das Gracas .
INFORMATION SCIENCES, 2009, 179 (05) :584-599