The link database: Fast access to graphs of the Web

被引:37
作者
Randall, KH [1 ]
Stata, R [1 ]
Wickremesinghe, RG [1 ]
Wiener, JL [1 ]
机构
[1] Compaq Syst Res Ctr, Palo Alto, CA 94301 USA
来源
DCC 2002: DATA COMPRESSION CONFERENCE, PROCEEDINGS | 2002年
关键词
D O I
10.1109/DCC.2002.999950
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The Connectivity Server is a special-purpose database whose schema models the Web as a graph: a set of nodes (URLs) connected by directed edges (hyperlinks). The Link Database provides fast access to the hyperlinks. To support easy implementation of a wide range of graph algorithms, we have found it important to fit the Link Database into RAM. In the first version of the Link Database, we achieved this fit by using machines with lots of memory (8GB), and storing each hyperlink in 32 bits. However, this approach was limited to roughly 100 million Web pages. This paper presents techniques to compress the links to accommodate larger graphs. Our techniques combine well-known compression methods with methods that depend on the properties of the web graph. The first compression technique takes advantage of the fact that most hyperlinks on most Web pages point to other pages on the same host as the page itself. The second technique takes advantage of the fact that many pages on the same host share hyperlinks, that is, they tend to point to a common set of pages. Together, these techniques reduce space requirements to under 6 bits per link. While (de)compression adds latency to the hyperlink access time, we can still compute the strongly connected components of a 6 billion-edge graph in 22 minutes and run applications such as Kleinberg's HITS in real time. This paper describes our techniques for compressing the Link Database, and provides performance numbers for compression ratios and decompression speed.
引用
收藏
页码:122 / 131
页数:10
相关论文
共 20 条
[1]  
ABELLO J, 1998, P 6 EUR S ALG, P332
[2]  
ADLER M, 2001, P IEEE DAT COMPR C D
[3]  
[Anonymous], 1998, P ACM SIAM S DISCR A
[4]  
BAEZAYATES RA, 1999, MODERN INFORMATION R
[5]  
BHARAT K, 1998, P 7 INT WORLD WID WE
[6]  
BHARAT K, 1999, P ACM C DIG LIB
[7]  
BOOKSTEIN A, 1991, INFORMATION SYSTEMS, P16
[8]  
Brin S., 1998, 7 INT WORLD WIDE WEB
[9]  
Broder A., 2000, P 9 INT WORLD WID WE
[10]  
CHIANG YJ, 1995, PROCEEDINGS OF THE SIXTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, P139