PubMed related articles: a probabilistic topic-based model for content similarity

被引:133
作者
Lin, Jimmy [1 ,2 ]
Wilbur, W. John [2 ]
机构
[1] Univ Maryland, Coll Informat Studies, College Pk, MD 20742 USA
[2] Natl Lib Med, Natl Ctr Biotechnol Informat, Bethesda, MD 20894 USA
关键词
Information Retrieval; MeSH; Retrieval Model; Related Article; Test Collection;
D O I
10.1186/1471-2105-8-423
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: We present a probabilistic topic-based model for content similarity called pmra that underlies the related article search feature in PubMed. Whether or not a document is about a particular topic is computed from term frequencies, modeled as Poisson distributions. Unlike previous probabilistic retrieval models, we do not attempt to estimate relevance-but rather our focus is "relatedness", the probability that a user would want to examine a particular document given known interest in another. We also describe a novel technique for estimating parameters that does not require human relevance judgments; instead, the process is based on the existence of MeSH (R) in MEDLINE (R). Results: The pmra retrieval model was compared against bm25, a competitive probabilistic model that shares theoretical similarities. Experiments using the test collection from the TREC 2005 genomics track shows a small but statistically significant improvement of pmra over bm25 in terms of precision. Conclusion: Our experiments suggest that the pmra model provides an effective ranking algorithm for related article search.
引用
收藏
页数:14
相关论文
共 21 条
[1]  
Berger AL, 1996, COMPUT LINGUIST, V22, P39
[2]  
CLEVERDON C, 1968, ASLIB CRANFIELD RES
[3]   PROBABILISTIC APPROACH TO AUTOMATIC KEYWORD INDEXING .1. DISTRIBUTION OF SPECIALTY WORDS IN A TECHNICAL LITERATURE [J].
HARTER, SP .
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, 1975, 26 (04) :197-206
[4]  
HERSH WR, 2005, P 14 TEXT RETR C TRE
[5]  
LIN J, 2007, LAMPTR145CSTR4877UMI
[6]   MODELING DOCUMENTS WITH MULTIPLE POISSON-DISTRIBUTIONS [J].
MARGULIS, EL .
INFORMATION PROCESSING & MANAGEMENT, 1993, 29 (02) :215-227
[7]  
METZLER D, 2005, P 28 ANN INT ACM SIG, P472, DOI DOI 10.1145/1076034.1076115
[8]   PROBABILITY RANKING PRINCIPLE IN IR [J].
ROBERTSON, SE .
JOURNAL OF DOCUMENTATION, 1977, 33 (04) :294-304
[9]  
Robertson SE., 1994, P 17 ANN INT ACM SIG, P232
[10]  
ROBERTSON SE, 1994, P 3 TEXT RETR C TREC