Bibliographic coupling, common abstract stems and clustering: A comparison of two document-document similarity approaches in the context of science mapping

被引:33
作者
Ahlgren, Per [1 ]
Jarneving, Bo [1 ]
机构
[1] Swedish Sch Lib & Informat Sci, S-50190 Boras, Sweden
关键词
Cluster Solution; Adjusted Rand Index; Test Article; Bibliographic Coupling; Coupling Solution;
D O I
10.1007/s11192-007-1935-1
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
This paper deals with two document-document similarity approaches in the context of science mapping: bibliographic coupling and a text approach based on the number of common abstract stems. We used 43 articles, published in the journal Information Retrieval, as test articles. An information retrieval expert performed a classification of these articles. We used the cosine measure for normalization, and the complete linkage method was used for clustering the articles. A number of articles pairs were ranked (1) according to descending normalized coupling strength, and (2) according to descending normalized frequency of common abstract stems. The degree of agreement between the two obtained rankings was low, as measured by Kendall's tau. The agreement between the two cluster solutions, one for each approach, was fairly low, according to the adjusted Rand index. However, there were examples of perfect agreement between the coupling solution and the stems solution. The classification generated by the expert contained larger groups compared to the coupling and stems solutions, and the agreement between the two solutions and the classification was not high. According to the adjusted Rand index, though, the stems solution was a better approximation of the classification than the coupling solution. With respect to cluster quality, the overall Silhouette value was slightly higher for the stems solution. Examples of homogeneous cluster structures, as well as negative Silhouette values, were found with regard to both solutions. The expert classification indicates that the field of information retrieval, as represented by one volume of articles published in Information Retrieval, is fairly heterogeneous regarding research themes, since the classification is associated with 15 themes. The complete linkage method, in combination with the upper tail rule, gave rise to a fairly good approximation of the classification with respect to the number of identified groups, especially in case of the stems approach.
引用
收藏
页码:273 / 290
页数:18
相关论文
共 23 条
[1]  
[Anonymous], MEASUREMENT INFORM S
[2]  
CALADO P, 2003, P 12 INT C INF KNOWL, P394, DOI DOI 10.1145/956863.956938
[3]  
Couto T, 2006, OPENING INFORMATION HORIZONS, P75
[4]  
Everitt BS., 2001, CLUSTER ANAL
[5]   Combining full text and bibliometric information in mapping scientific disciplines [J].
Glenisson, P ;
Glänzel, W ;
Janssens, F ;
De Moor, B .
INFORMATION PROCESSING & MANAGEMENT, 2005, 41 (06) :1548-1572
[6]   COMPARING PARTITIONS [J].
HUBERT, L ;
ARABIE, P .
JOURNAL OF CLASSIFICATION, 1985, 2 (2-3) :193-218
[7]  
Janssens F., 2006, INSCIT2006 CURRENT R, VI, P615
[8]  
Kaufman L., 2009, Finding groups in data: An introduction to cluster analysis
[9]  
Kendall MG, 1990, Correlation methods
[10]   BIBLIOGRAPHIC COUPLING EXTENDED IN TIME - 10 CASE-HISTORIES [J].
KESSLER, MM .
INFORMATION STORAGE AND RETRIEVAL, 1963, 1 (04) :169-187