Applying text mining methods for data loss prevention

被引:13
作者
Mashechkin, I. V. [1 ]
Petrovskiy, M. I. [1 ]
Popov, D. S. [1 ]
Tsarev, D. V. [1 ]
机构
[1] Moscow MV Lomonosov State Univ, Dept Computat Math & Cybernet, Moscow 119991, Russia
关键词
NONNEGATIVE MATRIX FACTORIZATION;
D O I
10.1134/S0361768815010041
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Currently, the greatest risks for information security of organizations are internal, rather than external, threats. Data loss prevention (DLP) systems are used for minimization of risks related to internal threats. The main function of the DLP systems is to prevent leak of confidential data; however, comparison of the DLP systems relies currently on their capabilities to analyze information captured and convenience of carrying out retrospective investigations of information security incident. In the paper, a new approach to retrospective analysis of user's text information is presented. The idea of the proposed approach consists in topic analysis of the text content processed by the user in the past and prediction of further user behavior with content. User text content can cover different categories, including confidential ones. The topic analysis of user text content assumes determination of main topics and their weights for given past time intervals. Based on deviations of behavior of user's operations with a content from the forecast, one can reveal time intervals when operation with documents of one or another category differs from normal (historical) work and when the user worked with documents of unusual categories. The proposed approach was experimentally verified on an example of actual corporate email correspondence created from the Enron data set.
引用
收藏
页码:23 / 30
页数:8
相关论文
共 16 条
[1]  
Analytical center InfoWatch, INF SEC CORP INF SYS
[2]  
[Anonymous], P 26 ANN INT ACM SIG
[3]   Algorithms and applications for approximate nonnegative matrix factorization [J].
Berry, Michael W. ;
Browne, Murray ;
Langville, Amy N. ;
Pauca, V. Paul ;
Plemmons, Robert J. .
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2007, 52 (01) :155-173
[4]  
Box GEP, TIME SERIES ANAL FOR
[5]  
Ding C., 2006, SIGKDD
[6]  
Manning C. D., 2008, Introduction to information retrieval
[7]  
[Машечкин И.В. Mashechkin I.V.], 2013, [Вычислительные методы и программирование: новые вычислительные технологии, Vychislitel'nye metody i programmirovanie: novye vychislitel'nye tekhnologii], V14, P91
[8]   Automatic Text Summarization Using Latent Semantic Analysis [J].
Mashechkin, I. V. ;
Petrovskiy, M. I. ;
Popov, D. S. ;
Tsarev, D. V. .
PROGRAMMING AND COMPUTER SOFTWARE, 2011, 37 (06) :299-305
[9]  
Meek C., AUTOREGRESSIVE TREE
[10]  
Mirzal Andri, 2010, CORR