Prose and Poetry Classification and Boundary Detection Using Word Adjacency Network Analysis

被引:21
作者
Roxas, Ranzivelle Marianne [1 ]
Tapang, Giovanni [1 ]
机构
[1] Univ Philippines Diliman, Natl Inst Phys, Quezon City 1101, Philippines
来源
INTERNATIONAL JOURNAL OF MODERN PHYSICS C | 2010年 / 21卷 / 04期
关键词
Text genre classification; word adjacency networks; LDA; Structures and organization in complex systems; Complex systems; Computer science and technology;
D O I
10.1142/S0129183110015257
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Word adjacency networks constructed from written works reflect differences in the structure of prose and poetry. We present a method to disambiguate prose and poetry by analyzing network parameters of word adjacency networks, such as the clustering coefficient, average path length and average degree. We determine the relevant parameters for disambiguation using linear discriminant analysis (LDA) and the effect size criterion. The accuracy of the method is 74.9 +/- 2.9% for the training set and 73.7 +/- 6.4% for the test set which are greater than the acceptable classifier requirement of 67.3%. This approach is also useful in locating text boundaries within a single article which falls within a window size where the significant change in clustering coefficient is observed. Results indicate that an optimal window size of 75 words can detect the text boundaries.
引用
收藏
页码:503 / 512
页数:10
相关论文
共 19 条
[1]   Statistical mechanics of complex networks [J].
Albert, R ;
Barabási, AL .
REVIEWS OF MODERN PHYSICS, 2002, 74 (01) :47-97
[2]   Hierarchical structures induce long-range dynamical correlations in written texts [J].
Alvarez-Lacalle, E. ;
Dorow, B. ;
Eckmann, J. -P. ;
Moses, E. .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2006, 103 (21) :7956-7961
[3]   Complex networks analysis of manual and machine translations [J].
Amancio, Diego R. ;
Antiqueira, Lucas ;
Pardo, Thiago A. S. ;
Costa, Luciano da F. ;
Oliveira, Osvaldo N., Jr. ;
Nunes, Maria G. V. .
INTERNATIONAL JOURNAL OF MODERN PHYSICS C, 2008, 19 (04) :583-598
[4]  
[Anonymous], 2003, COLUMBIA ELECT ENCY, VSixth Edition
[5]  
[Anonymous], 1998, MULTIVARIATE DATA AN
[6]  
ANTIQUEIRA L, 2006, P 4 WORKSH INF HUM L
[7]   Zipf's law from a communicative phase transition [J].
Cancho, RFI .
EUROPEAN PHYSICAL JOURNAL B, 2005, 47 (03) :449-457
[8]   Bayesian network model for semi-structured document classification [J].
Denoyer, L ;
Gallinari, P .
INFORMATION PROCESSING & MANAGEMENT, 2004, 40 (05) :807-827
[9]   Entropy of dialogues creates coherent structures in e-mail traffic [J].
Eckmann, JP ;
Moses, E ;
Sergi, D .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2004, 101 (40) :14333-14337
[10]   Curvature of co-links uncovers hidden thematic layers in the World Wide Web [J].
Eckmann, JP ;
Moses, E .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2002, 99 (09) :5825-5829