A document comparison scheme for secure duplicate detection

被引:1
作者
Mandreoli F. [1 ]
Martoglia R. [1 ]
Tiberio P. [1 ]
机构
[1] Università di Modena e Reggio Emilia, Dipartimento di Ingegneria dell'Informazione, 41100 Modena, via Vignolese
关键词
Clustering; Data reduction; Databases; Information retrieval; Intellectual property protection;
D O I
10.1007/s00799-004-0079-7
中图分类号
学科分类号
摘要
The ever-growing volumes of textual information from various sources have fostered the development of digital libraries, making digital content readily accessible but also easy for malicious users to plagiarize, thus giving rise to security problems. In this paper, we introduce a duplicate detection scheme that is able to determine, with a particularly high accuracy, the degree to which one document is similar to another. Our pairwise document comparison scheme detects the resemblance between the content of documents by considering document chunks, representing contexts of words selected from the text. The resulting duplicate detection technique presents a good level of security in the protection of intellectual property while improving the availability of the data stored in the digital library and the correctness of the search results. Finally, the paper addresses efficiency and scalability issues by introducing new data reduction techniques. © Springer-Verlag 2004.
引用
收藏
页码:223 / 244
页数:21
相关论文
共 26 条
  • [1] (1995)
  • [2] Arms W.Y., Digital Libraries, (2000)
  • [3] Baeza-Yates R., Ribeiro-Neto B., Modern Information Retrieval, (1999)
  • [4] Baeza-Yates R.A., Navarro G., A faster algorithm for approximate string matching, 7th Annual Symposium On Combinatorial Pattern Matching, pp. 1-23, (1996)
  • [5] Breunig M., Kriegel H., Kroger P., Sander J., Data bubbles: Quality preserving performance boosting for hierarchical clustering, Proc. ACM International Conference On Management of Data (SIGMOD'01), pp. 79-90, (2001)
  • [6] Bricklin D., Copy Protection Robs the Future, (2004)
  • [7] Brin S., Davis J., Garcia-Molina H., Copy detection mechanisms for digital documents, Proc. 1995 ACM SIGMOD International Conference On Management of Data, pp. 398-409, (1995)
  • [8] Broder A., Glassman S., Manasse M., Zweig G., Syntactic clustering of the Web, Computer Netw ISDN Syst, 29, 8-13, pp. 1157-1166, (1997)
  • [9] Chowdhury A., Frieder O., Grossman D., Collection statistics for fast duplicate document detection, ACM Trans Inf Syst, 20, 2, pp. 171-191, (2002)
  • [10] Ciaccia P., Patella M., Searching in metric spaces with user-defined and approximate distances, Trans Database Syst, 4, 27, pp. 398-437, (2002)