A semi-structured document model for text mining

被引：27

作者：

Yang, JW ^{[1
]}

Chen, XO ^{[1
]}

机构：

[1] Beijing Univ, Inst Comp Sci & Technol, Natl Key Lab Text Proc, Beijing 100871, Peoples R China

来源：

JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY | 2002年 / 17卷 / 05期

关键词：

semi-structured document; XML; text mining; vector space model; structured link vector model;

D O I：

10.1007/BF02948828

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

A semi-structured document has more structured information compared to an ordinary document, and the relation among semi-structured documents can be fully utilized. In order to take advantage of the structure and link information in a semi-structured document for better mining, a structured link vector model (SLVM) is presented in this paper, where a vector represents a document, and vectors' elements are determined by terms, document structure and neighboring documents. Text mining based on SLVM is described in the procedure of K-means for briefness and clarity: calculating document similarity and calculating cluster center. The clustering based on SLVM performs significantly better than that based on a conventional vector space model in the experiments, and its F value increases from 0.65-0.73 to 0.82-0.86.

引用

页码：603 / 610

页数：8