Fast Candidate Generation for Real-Time Tweet Search with Bloom Filter Chains

被引:24
作者
Asadi, Nima [1 ]
Lin, Jimmy [2 ]
机构
[1] Univ Maryland Coll Pk, Dept Comp Sci, College Pk, MD 20742 USA
[2] Univ Maryland Coll Pk, Coll Informat Studies, iSchool, College Pk, MD USA
基金
美国国家科学基金会;
关键词
Algorithms; Experimentation; Scalability; efficiency; top-k retrieval; tweet search; bloom filters;
D O I
10.1145/2493175.2493178
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The rise of social media and other forms of user-generated content have created the demand for real-time search: against a high-velocity stream of incoming documents, users desire a list of relevant results at the time the query is issued. In the context of real-time search on tweets, this work explores candidate generation in a two-stage retrieval architecture where an initial list of results is processed by a second-stage rescorer to produce the final output. We introduce Bloom filter chains, a novel extension of Bloom filters that can dynamically expand to efficiently represent an arbitrarily long and growing list of monotonically-increasing integers with a constant false positive rate. Using a collection of Bloom filter chains, a novel approximate candidate generation algorithm called BWAND is able to perform both conjunctive and disjunctive retrieval. Experiments show that our algorithm is many times faster than competitive baselines and that this increased performance does not require sacrificing end-to-end effectiveness. Our results empirically characterize the trade-off space defined by output quality, query evaluation speed, and memory footprint for this particular search architecture.
引用
收藏
页码:1 / 36
页数:36
相关论文
共 74 条
[1]   Scalable Bloom Filters [J].
Almeida, Paulo Sergio ;
Baquero, Carlos ;
Preguica, Nuno ;
Hutchison, David .
INFORMATION PROCESSING LETTERS, 2007, 101 (06) :255-261
[2]  
Anh V. N., 2005, SIGIR 2005. Proceedings of the Twenty-Eighth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, P226, DOI 10.1145/1076034.1076075
[3]  
[Anonymous], TECH REP MSR TR 2010
[4]  
[Anonymous], ARXIV 1302 5302
[5]  
[Anonymous], 1999, PAGERANK CITATION RA, DOI 10.1.1.31.1768
[6]  
[Anonymous], LEARNING TO RANK FOR
[7]  
[Anonymous], TECH REP
[8]  
[Anonymous], PROCEEDINGS OF THE O
[9]  
[Anonymous], PROCEEDINGS OF THE 2
[10]  
[Anonymous], 2011, P 34 INT ACM SIGIR C