Accessor variety criteria for Chinese word extraction

被引:76
作者
Feng, HD [1 ]
Chen, K
Deng, XT
Zheng, WM
机构
[1] Shandong Univ, Sch Comp Sci & Technol, Jinan 250100, Peoples R China
[2] City Univ Hong Kong, Dept Comp Sci, Kowloon, Hong Kong, Peoples R China
[3] Tsinghua Univ, Dept Comp Sci & Technol, Beijing, Peoples R China
关键词
D O I
10.1162/089120104773633394
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We are interested in the problem of word extraction from Chinese text collections. We define a word to be a meaningful string composed of several Chinese characters. For example, 'percent', and , 'more and more', are not recognized as traditional Chinese words from the viewpoint of some people. However, in our work, they are words because they are very widely used and have specific meanings. We start with the viewpoint that a word is a distinguished linguistic entity that can be used in many different language environments. We consider the characters that are directly before a string (predecessors) and the characters that are directly after a string (successors) as important factors for determining the independence of the string. We call such characters accessors of the string, consider the number of distinct predecessors and successors of a string in a large corpus (TREC 5 and TREC 6 documents), and use them as the measurement of the context independency of a string from the rest of the sentences in the document. Our experiments confirm our hypothesis and show that this simple rule gives quite good results for Chinese word extraction and is comparable to, and for long words outperforms, other iterative methods.
引用
收藏
页码:75 / 93
页数:19
相关论文
共 35 条
  • [1] [Anonymous], 1987, COMPUTATIONAL ANAL E
  • [2] Statistical models for text segmentation
    Beeferman, D
    Berger, A
    Lafferty, J
    [J]. MACHINE LEARNING, 1999, 34 (1-3) : 177 - 210
  • [3] *BEIJ LANG I, 1986, XIAN DAI HAN PIN DIA
  • [4] CHANG JS, 1997, INT J COMPUTATIONAL, V2, P97
  • [5] Chao Y.R., 1965, Grammar of spoken Chinese
  • [6] CHEN CY, 1993, P PAC AS C FORM COMP, V1, P81
  • [7] Chen K.J., 1998, INT J COMPUTATIONAL, V3, P27
  • [8] Chien LF, 1997, PROCEEDINGS OF THE 20TH ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, P50, DOI 10.1145/278459.258534
  • [9] CHIEN LF, 1995, P 1995 INT C COMP PR, P176
  • [10] Dai YB, 1999, SIGIR'99: PROCEEDINGS OF 22ND INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, P82