How variable may a constant be? Measures of lexical richness in perspective

被引:229
作者
Tweedie, FJ [1 ]
Baayen, RH
机构
[1] Univ Glasgow, Glasgow G12 8QQ, Lanark, Scotland
[2] Max Planck Inst Psycholinguist, Nijmegen, Netherlands
来源
COMPUTERS AND THE HUMANITIES | 1998年 / 32卷 / 05期
关键词
lexical statistics; Monte Carlo methods; vocabulary richness;
D O I
10.1023/A:1001749303137
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
A well-known problem in the domain of quantitative linguistics and stylistics concerns the evaluation of the lexical richness of texts. Since the most obvious measure of lexical richness, the vocabulary size (the number of different word types), depends heavily on the text length (measured in word tokens), a variety of alternative measures has been proposed which are claimed to be independent of the text length. This paper has a threefold aim. Firstly, we have investigated to what extent these alternative measures are truly textual constants. We have observed that in practice all measures vary substantially and systematically with the text length. We also show that in theory, only three of these measures are truly constant or nearly constant. Secondly, we have studied the extent to which these measures tap into different aspects of lexical structure. We have found that there are two main families of constants, one measuring lexical richness and one measuring lexical repetition. Thirdly, we have considered to what extent these measures can be used to investigate questions of textual similarity between and within authors. We propose to carry out such comparisons by means of the empirical trajectories of texts in the plane spanned by the dimensions of lexical richness and lexical repetition, and we provide a statistical technique for constructing confidence intervals around the empirical trajectories of texts. Our results suggest that the trajectories tap into a considerable amount of authorial structure without, however, guaranteeing that spatial separation implies a difference in authorship.
引用
收藏
页码:323 / 352
页数:30
相关论文
共 38 条
[1]  
[Anonymous], 1977, URN MODELS THEIR APP
[2]  
Baayen H., 1996, Literary & Linguistic Computing, V11, P121, DOI 10.1093/llc/11.3.121
[3]  
Baayen R.H., 1989, THESIS FREE U AMSTER
[4]  
Baayen RH, 1996, COMPUT LINGUIST, V22, P455
[5]  
BAAYEN RH, 1998, J QUANTITATIVE LINGU, V5
[6]  
BAKER JC, 1988, LIT LINGUISTIC COMPU, V3, P136
[7]  
Brunet E., 1978, VOCABULAIRE J GIRAUD
[8]   AN OCEAN WHERE EACH KIND - STATISTICAL-ANALYSIS AND SOME MAJOR DETERMINANTS OF LITERARY-STYLE [J].
BURROWS, JF .
COMPUTERS AND THE HUMANITIES, 1989, 23 (4-5) :309-321
[9]  
CHITASHVILI RJ, 1993, QUANTITATIVE TEXT AN
[10]  
COSSETTE A, 1994, TRAVAUX LINGUISTIQUE, V53