Hierarchical structures induce long-range dynamical correlations in written texts

被引:54
作者
Alvarez-Lacalle, E.
Dorow, B.
Eckmann, J. -P.
Moses, E. [1 ]
机构
[1] Weizmann Inst Sci, Dept Phys Complex Syst, IL-76100 Rehovot, Israel
[2] Weizmann Inst Sci, Albert Einstein Minerva Ctr Theoret Phys, IL-76100 Rehovot, Israel
[3] Univ Stuttgart, Inst Nat Language Proc, D-70174 Stuttgart, Germany
[4] Univ Geneva, Dept Theoret Phys, CH-1211 Geneva, Switzerland
[5] Univ Geneva, Sect Math, CH-1211 Geneva, Switzerland
关键词
D O I
10.1073/pnas.0510673103
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Thoughts and ideas are multidimensional and often concurrent, yet they can be expressed surprisingly well sequentially by the translation into language. This reduction of dimensions occurs naturally but requires memory and necessitates the existence of correlations, e.g., in written text. However, correlations in word appearance decay quickly, while previous observations of long-range correlations using random walk approaches yield little insight on memory or on semantic context. Instead, we study combinations of words that a reader is exposed to within a "window of attention" spanning about 100 words. We define a vector space of such word combinations by looking at words that co-occur within the window of attention, and analyze its structure. Singular value decomposition of the co-occurrence matrix identifies a basis whose vectors correspond to specific topics, or "concepts" that are relevant to the text. As the reader follows a text, the "vector of attention" traces out a trajectory of directions in this "concept space." We find that memory of the direction is retained over long times, forming power-law correlations. The appearance of power laws hints at the existence of an underlying hierarchical network. Indeed, imposing a hierarchy similar to that defined by volumes, chapters, paragraphs, etc. succeeds in creating correlations in a surrogate random text that are identical to those of the original text. We conclude that hierarchical structures in text serve to create long-range correlations, and use the reader's memory in reenacting some of the multidimensionality of the thoughts being expressed.
引用
收藏
页码:7956 / 7961
页数:6
相关论文
共 28 条
  • [1] LANGUAGE AND CODIFICATION DEPENDENCE OF LONG-RANGE CORRELATIONS IN TEXTS
    Amit, M.
    Shmerler, Y.
    Eisenberg, E.
    Abraham, M.
    Shnerb, N.
    [J]. FRACTALS-COMPLEX GEOMETRY PATTERNS AND SCALING IN NATURE AND SOCIETY, 1994, 2 (01) : 7 - 13
  • [2] [Anonymous], 1994, STAT LANGUAGE LEARNI
  • [3] Statistical models for text segmentation
    Beeferman, D
    Berger, A
    Lafferty, J
    [J]. MACHINE LEARNING, 1999, 34 (1-3) : 177 - 210
  • [4] BOLZANO B, 1930, WISSENSCHAFTSLEHRE, V3
  • [5] BRUNET E, 1974, TRAITEMENT FAITS LIN, P105
  • [6] The small world of human language
    Cancho, RFI
    Solé, RV
    [J]. PROCEEDINGS OF THE ROYAL SOCIETY B-BIOLOGICAL SCIENCES, 2001, 268 (1482) : 2261 - 2265
  • [7] Charniak E., 1997, Proc. of the National Conference on Artificial Intelligence, P598
  • [8] Clark J., 1995, An introduction to phonetics and phonology, V2nd
  • [9] Head-driven statistical models for natural language parsing
    Collins, M
    [J]. COMPUTATIONAL LINGUISTICS, 2003, 29 (04) : 589 - 637
  • [10] DEERWESTER S, 1990, J AM SOC INFORM SCI, V41, P391, DOI 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO