The distribution of N-grams

被引:22
作者
Egghe, L
机构
[1] Limburgs Univ Ctr, B-3590 Diepenbeek, Belgium
[2] UIA, Wilrijk, Belgium
关键词
D O I
10.1023/A:1005634925734
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
N-grams are generalized words consisting of N consecutive symbols, as they are used in a text. This paper determines the rank-frequency distribution for redundant N-grams. For entire texts this is known to be Zipf's law (i.e., an inverse power law). For N-grams, however, we show that the rank (r)-frequency distribution is P-N(r)=C/(psi(N)(r))(beta), where psi(N) is the inverse function of f(N)(x)=x ln(N-1)x. Here we assume that the rank-frequency distribution of the symbols follows Zipf's law with exponent beta.
引用
收藏
页码:237 / 252
页数:16
相关论文
共 22 条
[1]  
COHEN JD, 1995, J AM SOC INFORM SCI, V46, P162, DOI 10.1002/(SICI)1097-4571(199504)46:3<162::AID-ASI2>3.0.CO
[2]  
2-6
[3]   GAUGING SIMILARITY WITH N-GRAMS - LANGUAGE-INDEPENDENT CATEGORIZATION OF TEXT [J].
DAMASHEK, M .
SCIENCE, 1995, 267 (5199) :843-848
[4]   Duality in information retrieval and the hypergeometric distribution [J].
Egghe, L ;
Rousseau, R .
JOURNAL OF DOCUMENTATION, 1997, 53 (05) :488-496
[5]  
EGGHE L, 1991, J AM SOC INFORM SCI, V42, P479, DOI 10.1002/(SICI)1097-4571(199108)42:7<479::AID-ASI3>3.0.CO
[6]  
2-9
[7]  
Egghe L, 1999, J AM SOC INFORM SCI, V50, P233, DOI 10.1002/(SICI)1097-4571(1999)50:3<233::AID-ASI6>3.0.CO
[8]  
2-8
[9]  
EGGHE L, 1990, INFORMETRICS 89 90, P97
[10]  
Egghe L., 1990, Introduction to Informetrics: Quantitative Methods in Library, Documentation and Information Science