Theory and Applications of b-Bit Minwise Hashing

被引:54
作者
Li, Ping [1 ]
Koenig, Arnd Christian [2 ]
机构
[1] Cornell Univ, Dept Stat Sci, Fac Comp & Informat Sci, Ithaca, NY 14853 USA
[2] Microsoft Corp, Redmond, WA 98052 USA
关键词
ALGORITHMS;
D O I
10.1145/1978542.1978566
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Efficient (approximate) computation of set similarity in very large datasets is a common task with many applications in information retrieval and data management. One common approach for this task is minwise hashing. This paper describes b-bit minwise hashing, which can provide an order of magnitude improvements in storage requirements and computational overhead over the original scheme in practice. We give both theoretical characterizations of the performance of the new algorithm as well as a practical evaluation on large real-life datasets and show that these match very closely. Moreover, we provide a detailed comparison with other important alternative techniques proposed for estimating set similarities. Our technique yields a very simple algorithm and can be realized with only minor modifications to the original minwise hashing scheme.
引用
收藏
页码:101 / 109
页数:9
相关论文
共 25 条
[1]   Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions [J].
Andoni, Alexandr ;
Indyk, Piotr .
COMMUNICATIONS OF THE ACM, 2008, 51 (01) :117-122
[2]  
[Anonymous], MINING MASSIVE DATAS
[3]   On the resemblance and containment of documents [J].
Broder, AZ .
COMPRESSION AND COMPLEXITY OF SEQUENCES 1997 - PROCEEDINGS, 1998, :21-29
[4]   Syntactic clustering of the Web [J].
Broder, AZ ;
Glassman, SC ;
Manasse, MS ;
Zweig, G .
COMPUTER NETWORKS AND ISDN SYSTEMS, 1997, 29 (8-13) :1157-1166
[5]  
Charikar Moses S, 2002, P 34 ANN ACM S THEOR, P380, DOI DOI 10.1145/509907.509965
[6]  
Cherkasova L, 2009, KDD-09: 15TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, P1087
[7]   Finding interesting associations without support pruning [J].
Cohen, E ;
Datar, M ;
Fujiwara, S ;
Gionis, A ;
Indyk, P ;
Motwani, R ;
Ullman, JD ;
Yang, C .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2001, 13 (01) :64-78
[8]  
Fetterly D., 2003, P 12 INT WORLD WID W, P669, DOI [DOI 10.1145/775152.775246, 10.1145/775152.775246]
[9]  
Forman George, 2009, Operating Systems Review, V43, P84, DOI 10.1145/1496909.1496926
[10]  
GAMON M, 2008, AAAI C WEBL SOC MED