Phrase-based document similarity based on an Index Graph model

被引:30
作者
Hammouda, KM [1 ]
Kamel, MS [1 ]
机构
[1] Univ Waterloo, Dept Syst Design Engn, Waterloo, ON N2L 3G1, Canada
来源
2002 IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS | 2002年
关键词
D O I
10.1109/ICDM.2002.1183904
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Document clustering techniques mostly rely on single term analysis of the document data set, such as the Vector Space Model. To better capture the structure of documents, the underlying data model should be able to represent the phrases in the document as well as single terms. We present a novel data model, the Document Index Graph, which indexes web documents based on phrases, rather than single terms only. The semi-structured web documents help in identifying potential phrases that when matched with other documents indicate strong similarity between the documents. The Document Index Graph captures this information, and finding significant matching phrases between documents becomes easy and efficient with such model. The similarity between documents is based on both single term weights and matching phrases weights. The combined similarities are used with standard document clustering techniques to test their effect on the clustering quality. Experimental results show that our phrase-based similarity, combined with single-term similarity measures, enhances web document clustering quality significantly.
引用
收藏
页码:203 / 210
页数:8
相关论文
共 19 条
  • [1] Aas K., 1999, TEXT CATEGORISATION
  • [2] [Anonymous], 1998, DATA MINING METHODS
  • [3] FRAKES W, 1992, INFORMATION RETRIEVA
  • [4] Hofmann T, 1999, IJCAI-99: PROCEEDINGS OF THE SIXTEENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOLS 1 & 2, P682
  • [5] HONKELA T, 1997, P WSOM 97 WORKSH SEL, P310
  • [6] ISAACS JD, 1999, PCSTR99357 DARTM COL
  • [7] Jain K, 1988, Algorithms for clustering data
  • [8] JUNKER M, 1999, P 1 WORKSH LEARN LAN, P84
  • [9] SENTENCE-TO-SENTENCE CLUSTERING PROCEDURE FOR PATTERN-ANALYSIS
    LU, SY
    FU, KS
    [J]. IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS, 1978, 8 (05): : 381 - 389
  • [10] Nahm UY, 2000, SEVENTEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-2001) / TWELFTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE (IAAI-2000), P627