Concept-based information retrieval using explicit semantic analysis

被引:15
作者
Egozi O. [1 ]
Markovitch S. [1 ]
Gabrilovich E. [2 ]
机构
[1] Department of Computer Science, Israel Institute of Technology, Technion, Haifa
[2] Technion, Israel Institute of Technology, Yahoo, Research, Santa Clara, CA 95054
关键词
Concept-based retrieval; Explicit semantic analysis; Feature selection; Semantic search;
D O I
10.1145/1961209.1961211
中图分类号
学科分类号
摘要
Information retrieval systems traditionally rely on textual keywords to index and retrieve documents. Keyword-based retrieval may return inaccurate and incomplete results when different keywords are used to describe the same concept in the documents and in the queries. Furthermore, the relationship between these related keywords may be semantic rather than syntactic, and capturing it thus requires access to comprehensive human world knowledge. Concept-based retrieval methods have attempted to tackle these difficulties by using manually built thesauri, by relying on term cooccurrence data, or by extracting latent word relationships and concepts from a corpus. In this article we introduce a new concept-based retrieval approach based on Explicit Semantic Analysis (ESA), a recently proposed method that augments keywordbased text representation with concept-based features, automatically extracted from massive human knowledge repositories such as Wikipedia. Our approach generates new text features automatically, and we have found that high-quality feature selection becomes crucial in this setting to make the retrieval more focused. However, due to the lack of labeled data, traditional feature selection methods cannot be used, hence we propose new methods that use self-generated labeled training data. The resulting system is evaluated on several TREC datasets, showing superior performance over previous state-of-the-art results. © 2011 ACM.
引用
收藏
相关论文
共 71 条
[1]  
Allan J., Callan J., Feng F.-F., Malin D., Inquery and trec-8, Proceedings of the 8th Text Retrieval Conference (TREC-8), pp. 637-644, (1999)
[2]  
Anderka M., Stein B., The esa retrieval model revisited, Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 670-671, (2009)
[3]  
Arampatzis A., Kamps J., A study of query length, Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 811-812, (2008)
[4]  
Armstrong T.G., Moffat A., Webber W., Zobel J., Improvements that don't add up: Ad-Hoc retrieval results since 1998, Proceeding of the 18th ACM Conference on Information and Knowledge Management (CIKM'09), pp. 601-610, (2009)
[5]  
Billerbeck B., Zobel J., Questioning query expansion: An examination of behaviour and parameters, Proceedings of the 15th Australasian Database Conference, pp. 69-76, (2004)
[6]  
Buckley C., Robertson S., Relevance feedback track overview: Trec 2008, Proceedings of the 17th Text REtrieval Conference (TREC-17), (2008)
[7]  
Callan J.P., Passage-Level evidence in document retrieval, Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 302-310, (1994)
[8]  
Castells P., Fernandez M., Vallet D., An adaptation of the vector-space model for ontology-based information retrieval, IEEE Transactions on Knowledge and Data Engineering, 19, 2, pp. 261-272, (2007)
[9]  
Chang M.-W., Ratinov L., Roth D., Srikumar V., Importance of semantic representation: Dataless classification, Proceedings of the 23rd AAAI Conference on Artificial Intelligence, pp. 830-835, (2008)
[10]  
Croft B.W., Combining Approaches to Information Retrieval, 1, pp. 1-36, (2000)