Scalable entity-based summarization of web search results using MapReduce

被引:7
作者
Kitsos, Ioannis [1 ,2 ]
Magoutis, Kostas [1 ,2 ]
Tzitzikas, Yannis [1 ,2 ]
机构
[1] FORTH ICS, Inst Comp Sci, Iraklion, Greece
[2] Univ Crete, Dept Comp Sci, Iraklion, Greece
关键词
Text data analytics through summaries and synopses; Interactive data analysis through queryable summaries and indices; Information retrieval and named entity mining; MapReduce; Cloud computing;
D O I
10.1007/s10619-013-7133-7
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Although Web Search Engines index and provide access to huge amounts of documents, user queries typically return only a linear list of hits. While this is often satisfactory for focalized search, it does not provide an exploration or deeper analysis of the results. One way to achieve advanced exploration facilities exploiting the availability of structured (and semantic) data in Web search, is to enrich it with entity mining over the full contents of the search results. Such services provide the users with an initial overview of the information space, allowing them to gradually restrict it until locating the desired hits, even if they are low ranked. This is especially important in areas of professional search such as medical search, patent search, etc. In this paper we consider a general scenario of providing such services as meta-services (that is, layered over systems that support keywords search) without a-priori indexing of the underlying document collection(s). To make such services feasible for large amounts of data we use the MapReduce distributed computation model on a Cloud infrastructure (Amazon EC2). Specifically, we show how the required computational tasks can be factorized and expressed as MapReduce functions. A key contribution of our work is a thorough evaluation of platform configuration and tuning, an aspect that is often disregarded and inadequately addressed in prior work, but crucial for the efficient utilization of resources. Finally we report experimental results about the achieved speedup in various settings.
引用
收藏
页码:405 / 446
页数:42
相关论文
共 60 条
[1]  
Allocca Carlo, 2012, The Semantic Web: Research and Applications. Proceedings 9th Extended Semantic Web Conference (ESWC 2012), P453, DOI 10.1007/978-3-642-30284-8_37
[2]  
Amdahl GM, 1967, VALIDITY SINGLE PROC, P483
[3]  
[Anonymous], 2008, IRPTR0805
[4]  
[Anonymous], 2002, Advances in Information Retrieval. The Information Retrieval Series, DOI DOI 10.1007/0-306-47019-55
[5]  
[Anonymous], 2011, P 2011 ACM SIGMOD IN
[6]  
[Anonymous], 2002, P 40 ANN M ASS COMP
[7]  
[Anonymous], 1996, ACM SIGMOD RECORD
[8]  
[Anonymous], 1912, New Phytologist, DOI [DOI 10.1111/J.1469-8137.1912.TB05611.X, 10.1111/j.1469-8137.1912.tb05611.x]
[9]  
Apache Software Foundation, AP HAD PROJ DEV OP S
[10]   A View of Cloud Computing [J].
Armbrust, Michael ;
Fox, Armando ;
Griffith, Rean ;
Joseph, Anthony D. ;
Katz, Randy ;
Konwinski, Andy ;
Lee, Gunho ;
Patterson, David ;
Rabkin, Ariel ;
Stoica, Ion ;
Zaharia, Matei .
COMMUNICATIONS OF THE ACM, 2010, 53 (04) :50-58