Normalized compression distance for visual analysis of document collections

被引:18
作者
Telles, G. P. [1 ]
Minghim, R. [1 ]
Paulovich, F. V. [1 ]
机构
[1] Univ Sao Paulo, Inst Ciencias Matemat & Comp, BR-13560970 Sao Paulo, Brazil
来源
COMPUTERS & GRAPHICS-UK | 2007年 / 31卷 / 03期
基金
巴西圣保罗研究基金会;
关键词
Kolmogorov complexity; normalized compression distance; text collection visualization; multi-dimensional projection; document visualization;
D O I
10.1016/j.cag.2007.01.024
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
In a world flooded by text of various sources, it is of strategic importance to find ways to map information present in written documents in a form that helps users locate and associate important information within a particular text data set. Content-based maps can support extremely useful explorations of text data sets. This paper proposes and evaluates the use of Kolmogorov complexity approximations as a means to detect similarity between general textual documents, in order to support mapping and visualization techniques for corpora exploration. The calculation of this similarity measure requires no intermediate representation of a corpus (such as vector representation) and therefore no pre-processing or parametrization steps. That makes it very attractive for a wider range of exploratory applications compared to conventional measures that need vector-based text representations. The visual layout used here is based on fast distance multi-dimensional projections. It is shown that the similarity measure and the resulting maps present very good precision and that the approach can be used successfully for visual analysis of automatically generated text maps. (C) 2007 Elsevier Ltd. All rights reserved.
引用
收藏
页码:327 / 337
页数:11
相关论文
共 35 条
[1]  
Alonso O, 2003, FIRST LATIN AMERICAN WEB CONGRESS, PROCEEDINGS, P202
[2]  
Andrews K., 2002, Information Visualization, V1, P166, DOI 10.1057/palgrave.ivs.9500023
[3]  
[Anonymous], LNCS
[4]  
BAEZAYATES R, 1996, INT WORKSH ADV US IN, P101
[5]   Visualizing text data sets [J].
Booker, A ;
Condliff, M ;
Greaves, M ;
Holt, FB ;
Kao, A ;
Pierce, DJ ;
Poteet, S ;
Wu, YJJ .
COMPUTING IN SCIENCE & ENGINEERING, 1999, 1 (04) :26-35
[6]   Visualizing knowledge domains [J].
Börner, K ;
Chen, CM ;
Boyack, KW .
ANNUAL REVIEW OF INFORMATION SCIENCE AND TECHNOLOGY, 2003, 37 :179-255
[7]  
CAREY M, 2003, P INT C DISTR MULT S
[8]   A linear iteration time layout algorithm for visualising high-dimensional data [J].
Chalmers, M .
VISUALIZATION '96, PROCEEDINGS, 1996, :127-+
[9]  
CHALMERS M, 1992, 15 INT ACM SIGIR C R, P330
[10]  
CILIBRASI R, 2005, T INFORMATION THEORY, V51, P1546