Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches

被引:186
作者
Boyack, Kevin W. [1 ]
Newman, David [2 ,3 ]
Duhon, Russell J. [4 ]
Klavans, Richard [5 ]
Patek, Michael [5 ]
Biberstine, Joseph R. [4 ]
Schijvenaars, Bob [6 ]
Skupin, Andre [7 ]
Ma, Nianli [4 ]
Boerner, Katy [4 ]
机构
[1] SciTech Strategies Inc, Albuquerque, NM USA
[2] Univ Calif Irvine, Irvine, CA USA
[3] NICTA Victorian Res Lab, Melbourne, Australia
[4] Indiana Univ, Sch Lib & Informat Sci, Bloomington, IN 47401 USA
[5] SciTech Strategies Inc, Berwyn, PA USA
[6] Collexis Inc, Geldermalsen, Netherlands
[7] San Diego State Univ, Dept Geog, San Diego, CA 92182 USA
来源
PLOS ONE | 2011年 / 6卷 / 03期
基金
澳大利亚研究理事会; 美国国家卫生研究院;
关键词
INFORMATION; SEARCH; DECOMPOSITION; RETRIEVAL; MODELS; GRAPH;
D O I
10.1371/journal.pone.0018029
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Background: We investigate the accuracy of different similarity approaches for clustering over two million biomedical documents. Clustering large sets of text documents is important for a variety of information needs and applications such as collection management and navigation, summary and analysis. The few comparisons of clustering results from different similarity approaches have focused on small literature sets and have given conflicting results. Our study was designed to seek a robust answer to the question of which similarity approach would generate the most coherent clusters of a biomedical literature set of over two million documents. Methodology: We used a corpus of 2.15 million recent (2004-2008) records from MEDLINE, and generated nine different document-document similarity matrices from information extracted from their bibliographic records, including titles, abstracts and subject headings. The nine approaches were comprised of five different analytical techniques with two data sources. The five analytical techniques are cosine similarity using term frequency-inverse document frequency vectors (tf-idf cosine), latent semantic analysis (LSA), topic modeling, and two Poisson-based language models - BM25 and PMRA (PubMed Related Articles). The two data sources were a) MeSH subject headings, and b) words from titles and abstracts. Each similarity matrix was filtered to keep the top-n highest similarities per document and then clustered using a combination of graph layout and average-link clustering. Cluster results from the nine similarity approaches were compared using (1) within-cluster textual coherence based on the Jensen-Shannon divergence, and (2) two concentration measures based on grant-to-article linkages indexed in MEDLINE. Conclusions: PubMed's own related article approach (PMRA) generated the most coherent and most concentrated cluster solution of the nine text-based similarity approaches tested, followed closely by the BM25 approach using titles and abstracts. Approaches using only MeSH subject headings were not competitive with those based on titles and abstracts.
引用
收藏
页数:11
相关论文
共 48 条
  • [1] Bibliographic coupling, common abstract stems and clustering: A comparison of two document-document similarity approaches in the context of science mapping
    Ahlgren, Per
    Jarneving, Bo
    [J]. SCIENTOMETRICS, 2008, 76 (02) : 273 - 290
  • [2] Document-document similarity approaches and science mapping: Experimental comparison of five approaches
    Ahlgren, Per
    Colliander, Cristian
    [J]. JOURNAL OF INFORMETRICS, 2009, 3 (01) : 49 - 63
  • [3] Text categorization models for high-quality article retrieval in internal medicine
    Aphinyanaphongs, Y
    Tsamardinos, I
    Statnikov, A
    Hardin, D
    Aliferis, CF
    [J]. JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2005, 12 (02) : 207 - 216
  • [4] COMBINING THE EVIDENCE OF MULTIPLE QUERY REPRESENTATIONS FOR INFORMATION-RETRIEVAL
    BELKIN, NJ
    KANTOR, P
    FOX, EA
    SHAW, JA
    [J]. INFORMATION PROCESSING & MANAGEMENT, 1995, 31 (03) : 431 - 448
  • [5] Using linear algebra for intelligent information retrieval
    Berry, MW
    Dumais, ST
    OBrien, GW
    [J]. SIAM REVIEW, 1995, 37 (04) : 573 - 595
  • [6] Latent Dirichlet allocation
    Blei, DM
    Ng, AY
    Jordan, MI
    [J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) : 993 - 1022
  • [7] Visual conceptualizations and models of science
    Borner, Katy
    Scharnhorst, Andrea
    [J]. JOURNAL OF INFORMETRICS, 2009, 3 (03) : 161 - 172
  • [8] Co-Citation Analysis, Bibliographic Coupling, and Direct Citation: Which Citation Approach Represents the Research Front Most Accurately?
    Boyack, Kevin W.
    Klavans, Richard
    [J]. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2010, 61 (12): : 2389 - 2404
  • [9] Mapping the backbone of science
    Boyack, KW
    Klavans, R
    Börner, K
    [J]. SCIENTOMETRICS, 2005, 64 (03) : 351 - 374
  • [10] BOYACK KW, 2009, 12 INT C INT SOC SCI, P730