A novel document similarity measure based on earth mover's distance

被引:69
作者
Wan, Xiaojun [1 ]
机构
[1] Peking Univ, Inst Comp Sci & Technol, Beijing 100871, Peoples R China
关键词
document similarity measure; document similarity search; earth mover's distance; TextTiling; subtopic structure;
D O I
10.1016/j.ins.2007.02.045
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper we propose a novel measure based on the earth mover's distance (EMD) to evaluate document similarity by allowing many-to-many matching between subtopics. First, each document is decomposed into a set of subtopics, and then the EMD is employed to evaluate the similarity between two sets of subtopics for two documents by solving the transportation problem. The proposed measure is an improvement of the previous OM-based measure, which allows only oneto-one matching between subtopics. Experiments have been performed on the TDT3 dataset to evaluate existing similarity measures and the results show that the EMD-based measure outperforms the optimal matching (OM) based measure and all other measures. In addition to the TextTiling algorithm, the sentence clustering algorithm is adopted for document decomposition, and the experimental results show that the proposed EMD-based measure does not rely on the document decomposition algorithm and thus it is more robust than the OM-based measure. (C) 2007 Elsevier Inc. All rights reserved.
引用
收藏
页码:3718 / 3730
页数:13
相关论文
共 30 条
  • [1] On effective conceptual indexing and similarity search in text data
    Aggarwal, CC
    Yu, PS
    [J]. 2001 IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2001, : 3 - 10
  • [2] Allan J., 1998, P DARPA BROADCAST NE, P194
  • [3] [Anonymous], P 37 ANN M ASS COMP
  • [4] [Anonymous], P 24 ANN INT ACM SIG, DOI DOI 10.1145/383952.384019
  • [5] ASLAM JA, 2003, P 26 INT ACM SIGIR C, P449
  • [6] Baeza-Yates R.A., 1999, Modern Information Retrieval
  • [7] CHOI F, JTEXTTILE FREE PLATF
  • [8] A new algorithm for computing similarity between RNA structures
    Collins, GD
    Le, SY
    Zhang, KZ
    [J]. INFORMATION SCIENCES, 2001, 139 (1-2) : 59 - 77
  • [9] CROFT B, 2003, LANGUAGE MODELING IN
  • [10] Ontology-based concept similarity in Formal Concept Analysis
    Formica, Anna
    [J]. INFORMATION SCIENCES, 2006, 176 (18) : 2624 - 2641