Archiving the web using page changes patterns: a case study

被引:8
作者
Ben Saad, Myriam [1 ]
Gancarski, Stephane [1 ]
机构
[1] Univ Paris 06, LIP6, 4 Pl Jussieu, F-75005 Paris, France
关键词
Web archiving; Importance of page changes; Pattern; Temporal completeness;
D O I
10.1007/s00799-012-0094-z
中图分类号
G25 [图书馆学、图书馆事业]; G35 [情报学、情报工作];
学科分类号
1205 ; 120501 ;
摘要
A pattern is a model or a template used to summarize and describe the behavior (or the trend) of data having generally some recurrent events. Patterns have received a considerable attention in recent years and were widely studied in the data mining field. Various pattern mining approaches have been proposed and used for different applications such as network monitoring, moving object tracking, financial or medical data analysis, scientific data processing, etc. In these different contexts, discovered patterns were useful to detect anomalies, to predict data behavior (or trend) or, more generally, to simplify data processing or to improve system performance. However, to the best of our knowledge, patterns have never been used in the context of Web archiving. Web archiving is the process of continuously collecting and preserving portions of the World Wide Web for future generations. In this paper, we show how patterns of page changes can be useful tools to efficiently archive Websites. We first define our pattern model that describes the importance of page changes. Then, we present the strategy used to (i) extract the temporal evolution of page changes, (ii) discover patterns, to (iii) exploit them to improve Web archives. The archive of French public TV channels France Televisions is chosen as a case study to validate our approach. Our experimental evaluation based on real Web pages shows the utility of patterns to improve archive quality and to optimize indexing or storing.
引用
收藏
页码:33 / 49
页数:17
相关论文
共 45 条
[1]  
Abiteboul S., 2002, P 6 EUR C RES ADV TE
[2]  
Adar E., 2009, P 27 INT C HUM FACT
[3]  
ADAR E, 2009, P 2 ACM INT C WEB SE
[4]  
Baron S, 2004, LECT NOTES COMPUT SC, V3209, P181
[5]  
Ben Saad M., 2010, EDBT ICDT PHD WORKSH
[6]   Longitudinal Study of Changes in Blogs [J].
Bogen, Paul Logasa, II ;
Francisco-Revilla, Luis ;
Furuta, Richard ;
Hubbard, Takeisha ;
Karadkar, Unmil P. ;
Shipman, Frank .
PROCEEDINGS OF THE 7TH ACM/IEE JOINT CONFERENCE ON DIGITAL LIBRARIES: BUILDING & SUSTAINING THE DIGITAL ENVIRONMENT, 2007, :135-+
[7]   Keeping up with the changing Web [J].
Brewington, BE ;
Cybenko, G .
COMPUTER, 2000, 33 (05) :52-+
[8]   How dynamic is the Web? [J].
Brewington, BE ;
Cybenko, G .
COMPUTER NETWORKS, 2000, 33 (1-6) :257-276
[9]  
Cai D., 2003, TECHNICAL REPORT
[10]   Scheduling algorithms for Web crawling [J].
Castillo, C ;
Marin, M ;
Rodriguez, A ;
Baeza-Yates, R .
WEBMEDIA & LA-WEB 2004, VOL 1, PROCEEDINGS, 2004, :10-17