Modeling Statistical Properties of Written Text

被引:72
作者
Angeles Serrano, M.
Flammini, Alessandro
Menczer, Filippo
机构
[1] Departament de Química Física, Universitat de Barcelona, Barcelona
[2] School of Informatics, Indiana University, Bloomington, IN
[3] Complex Networks Lagrange Lab., ISI Foundation, Torino
来源
PLOS ONE | 2009年 / 4卷 / 04期
关键词
LANGUAGES;
D O I
10.1371/journal.pone.0005372
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Written text is one of the fundamental manifestations of human language, and the study of its universal regularities can give clues about how our brains process information and how we, as a society, organize and share it. Among these regularities, only Zipf's law has been explored in depth. Other basic properties, such as the existence of bursts of rare words in specific documents, have only been studied independently of each other and mainly by descriptive models. As a consequence, there is a lack of understanding of linguistic processes as complex emergent phenomena. Beyond Zipf's law for word frequencies, here we focus on burstiness, Heaps' law describing the sublinear growth of vocabulary size with the length of a document, and the topicality of document collections, which encode correlations within and across documents absent in random null models. We introduce and validate a generative model that explains the simultaneous emergence of all these patterns from simple rules. As a result, we find a connection between the bursty nature of rare words and the topical organization of texts and identify dynamic word ranking and memory across documents as key mechanisms explaining the non trivial organization of written text. Our research can have broad implications and practical applications in computer science, cognitive science and linguistics.
引用
收藏
页数:8
相关论文
共 50 条
[1]   Contextual diversity, not word frequency, determines word-naming and lexical decision times [J].
Adelman, James S. ;
Brown, Gordon D. A. ;
Quesada, Jose F. .
PSYCHOLOGICAL SCIENCE, 2006, 17 (09) :814-823
[2]   Statistical mechanics of complex networks [J].
Albert, R ;
Barabási, AL .
REVIEWS OF MODERN PHYSICS, 2002, 74 (01) :47-97
[3]  
Allan J., 1998, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, P37, DOI 10.1145/290941.290954
[4]   Hierarchical structures induce long-range dynamical correlations in written texts [J].
Alvarez-Lacalle, E. ;
Dorow, B. ;
Eckmann, J. -P. ;
Moses, E. .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2006, 103 (21) :7956-7961
[5]  
Ananiadou Sophia., 2005, Text Mining for Biology And Biomedicine
[6]  
[Anonymous], 2002, P 8 ACM SIGKDD INT C
[7]  
[Anonymous], 2005, Advances in Neural Information Processing Systems
[8]  
[Anonymous], 2006, ICML, DOI [10.1145/1143844.1143917, DOI 10.1145/1143844.1143917]
[9]   Languages evolve in punctuational bursts [J].
Atkinson, Quentin D. ;
Meade, Andrew ;
Venditti, Chris ;
Greenhill, Simon J. ;
Pagel, Mark .
SCIENCE, 2008, 319 (5863) :588-588
[10]  
Baayen R. H., 2001, WORD FREQUENCY DISTR, V18