A semi-structured document model for text mining

被引:27
作者
Yang, JW [1 ]
Chen, XO [1 ]
机构
[1] Beijing Univ, Inst Comp Sci & Technol, Natl Key Lab Text Proc, Beijing 100871, Peoples R China
关键词
semi-structured document; XML; text mining; vector space model; structured link vector model;
D O I
10.1007/BF02948828
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
A semi-structured document has more structured information compared to an ordinary document, and the relation among semi-structured documents can be fully utilized. In order to take advantage of the structure and link information in a semi-structured document for better mining, a structured link vector model (SLVM) is presented in this paper, where a vector represents a document, and vectors' elements are determined by terms, document structure and neighboring documents. Text mining based on SLVM is described in the procedure of K-means for briefness and clarity: calculating document similarity and calculating cluster center. The clustering based on SLVM performs significantly better than that based on a conventional vector space model in the experiments, and its F value increases from 0.65-0.73 to 0.82-0.86.
引用
收藏
页码:603 / 610
页数:8
相关论文
共 10 条
[1]  
Bray Tim, 1998, Extensible markup language
[2]  
CHAKRABARTI S, 1998, P ACM SIGMOD C SEATT
[3]  
GOLDFARB CF, 1998, XML HDB
[4]  
GUILLAUME D, 2000, COMPUT PHYS COMMUN, P215
[5]  
Larsen Bjorner, 1999, KDD 99
[6]  
PAPAKONSTANTINOU Y, 1995, PROC INT CONF DATA, P251, DOI 10.1109/ICDE.1995.380386
[7]  
SALTON G, 1983, INTRO MODERN INFORMA
[8]  
SALTON G, 1987, 87881 CORN U COMP SC
[9]  
STEINBACH M, 2000, 00034 U MINN
[10]  
YI JH, 2000, KDD 2000