Document-document similarity approaches and science mapping: Experimental comparison of five approaches

被引:76
作者
Ahlgren, Per [1 ]
Colliander, Cristian [2 ]
机构
[1] Stockholm Univ, Univ Lib, Dept E Resources, SE-10691 Stockholm, Sweden
[2] Jonkoping Univ, Univ Lib, SE-55111 Jonkoping, Sweden
关键词
Citation data; Textual data; Data source combination; Cluster analysis; Science mapping; INFORMATION;
D O I
10.1016/j.joi.2008.11.003
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
This paper treats document-document similarity approaches in the context of science mapping. Five approaches, involving nine methods, are compared experimentally. We compare text-based approaches, the citation-based bibliographic coupling approach, and approaches that combine text-based approaches and bibliographic coupling. Forty-three articles, published in the journal Information Retrieval, are used as test documents. We investigate how well the approaches agree with a ground truth subject classification of the test documents, when the complete linkage method is used, and under two types of similarities, first-order and second-order. The results show that it is possible to achieve a very good approximation of the classification by means of automatic grouping of articles. One text-only method and one combination method, under second-order similarities in both cases, give rise to cluster solutions that to a large extent agree with the classification. (C) 2008 Elsevier Ltd. All rights reserved.
引用
收藏
页码:49 / 63
页数:15
相关论文
共 37 条
[1]   Requirements for a cocitation similarity measure, with special reference to Pearson's correlation coefficient [J].
Ahlgren, P ;
Jarneving, B ;
Rousseau, R .
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2003, 54 (06) :550-560
[2]   Bibliographic coupling, common abstract stems and clustering: A comparison of two document-document similarity approaches in the context of science mapping [J].
Ahlgren, Per ;
Jarneving, Bo .
SCIENTOMETRICS, 2008, 76 (02) :273-290
[3]  
[Anonymous], P ANNU INT ACM SIGIR
[4]  
BAEZAYATES R, 1999, MODERN INFORMATION R, pCH2
[5]   Matrices, vector spaces, and information retrieval [J].
Berry, MW ;
Drmac, Z ;
Jessup, ER .
SIAM REVIEW, 1999, 41 (02) :335-362
[6]  
BOYCE BR, 1994, MEASUREMENT INFORM S, pCH7
[7]   Link-based similarity measures for the classification of Web documents [J].
Calado, P ;
Cristo, M ;
Gonçalves, MA ;
de Moura, ES ;
Ribeiro-Neto, B ;
Ziviani, N .
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2006, 57 (02) :208-221
[8]  
Calado P., 2003, Proceedings of the 12th International Conference on Information and Knowledge Management, P394, DOI DOI 10.1145/956863.956938
[9]  
Cao MD, 2005, LECT NOTES ARTIF INT, V3809, P143
[10]  
Couto T, 2006, OPENING INFORMATION HORIZONS, P75