Achieving Anonymity via Clustering

被引:85
作者
Aggarwal, Gagan [1 ]
Feder, Tomas [2 ]
Kenthapadi, Krishnaram [3 ]
Khuller, Samir [4 ]
Panigrahy, Rina [3 ]
Thomas, Dilys [5 ]
Zhu, An [1 ]
机构
[1] Google Inc, Mountain View, CA USA
[2] Stanford Univ, Dept Comp Sci, Stanford, CA 94305 USA
[3] Microsoft Res, Mountain View, CA USA
[4] Univ Maryland, Dept Comp Sci, College Pk, MD 20742 USA
[5] Oracle, Redwood Shores, CA USA
基金
美国国家科学基金会;
关键词
Algorithms; Privacy; anonymity; clustering; approximation algorithms;
D O I
10.1145/1798596.1798602
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Publishing data for analysis from a table containing personal records, while maintaining individual privacy, is a problem of increasing importance today. The traditional approach of deidentifying records is to remove identifying fields such as social security number, name, etc. However, recent research has shown that a large fraction of the U. S. population can be identified using nonkey attributes (called quasi-identifiers) such as date of birth, gender, and zip code. The k-anonymity model protects privacy via requiring that nonkey attributes that leak information are suppressed or generalized so that, for every record in the modified table, there are at least k - 1 other records having exactly the same values for quasi-identifiers. We propose a new method for anonymizing data records, where quasi-identifiers of data records are first clustered and then cluster centers are published. To ensure privacy of the data records, we impose the constraint that each cluster must contain no fewer than a prespecified number of data records. This technique is more general since we have a much larger choice for cluster centers than k-anonymity. In many cases, it lets us release a lot more information without compromising privacy. We also provide constant factor approximation algorithms to come up with such a clustering. This is the first set of algorithms for the anonymization problem where the performance is independent of the anonymity parameter k. We further observe that a few outlier points can significantly increase the cost of anonymization. Hence, we extend our algorithms to allow an epsilon fraction of points to remain unclustered, that is, deleted from the anonymized publication. Thus, by not releasing a small fraction of the database records, we can ensure that the data published for analysis has less distortion and hence is more useful. Our approximation algorithms for new clustering objectives are of independent interest and could be applicable in other clustering scenarios as well.
引用
收藏
页数:19
相关论文
共 17 条
[1]  
AGGARWAL G, 2005, J PRIVACY TECHNOL
[2]  
[Anonymous], 1990, COMPUT INTRACTABILIT
[3]   HOW TO ALLOCATE NETWORK CENTERS [J].
BARILAN, J ;
KORTSARZ, G ;
PELEG, D .
JOURNAL OF ALGORITHMS, 1993, 15 (03) :385-415
[4]  
Bayardo RJ, 2005, PROC INT CONF DATA, P217
[5]  
Charikar M, 2001, SIAM PROC S, P642
[6]  
Chawla S, 2005, LECT NOTES COMPUT SC, V3378, P363
[7]   Hierarchical placement and network design problems [J].
Guha, S ;
Meyerson, A ;
Munagala, K .
41ST ANNUAL SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE, PROCEEDINGS, 2000, :603-612
[8]   A BEST POSSIBLE HEURISTIC FOR THE K-CENTER PROBLEM [J].
HOCHBAUM, DS ;
SHMOYS, DB .
MATHEMATICS OF OPERATIONS RESEARCH, 1985, 10 (02) :180-184
[9]  
Jain K., 1999, Proc. 40th Annu. IEEE Sympos. Found. Comput. Sci, P2
[10]  
Karger DR, 2000, ANN IEEE SYMP FOUND, P613, DOI 10.1109/SFCS.2000.892329