Big Data Security Internal Threat Detection The Critical Role of Machine Learning
Objectives 1.Discuss internal user risk management challenges in Big Data Environment 2.Discuss why machine learning is critical in managing internal threats 3.Share machine learning use-cases in Big Data Environment 2
Interconnected World Technology landscape is changing at a faster pace Cloud Big Data Subscription Operating environment is also changing Vendor Partner Outsourcing Third Party Processing API driven Ability to Protect Data and harness the value of the data through Analytics will separate the Leaders from the Laggards in the interconnected World 3
Data Breaches in the News 2015 Data Breach Statistics 780 reported incidents 177 million records breached 4
Why Internal Threat Management is Important? 5
Emerging Trends in Data Security 1. Automated Autonomous 2.Contextual IP protection versus PHI, PII Information 3.Performance Complexity > Data, Events and Interactions Volume & velocity Real time 6
Motivation for Big Data Security Internal Threat Management 1. Hadoop, No-SQL, IoT systems are not designed keeping security in mind 2. Data security is often overlooked in big data eco-systems because of complexity 3. Seamless monitoring of regular data and big data ecosystem for security is cumbersome Key Security Questions: 1. Who is accessing the data? 2. What data are they accessing? 3. Is someone trying to access data that they don t have access to? 4. Are there any anomalous access patterns? 7
Threat Scenarios Data in Motion Internal User 60% Transactional Data External Actors 40% Data Lake Collusion X? 8
Data Protection by Design and by Default 9
Monitoring and Analysis Methods Effort/Cost Elastic Search Machine Learning Searching/ Visualization Machine Learning Actionable Manual Monitoring Monitoring Statistical/Rule Based Expert System 10
Why Machine Learning? Traditional Usage Monitoring File Rule Based High Volume High Processing time Delay / Batch mode External Threat Focus High Infrastructure cost 11
Use of Machine Learning Identify Outliers Understand Trend Understand Patterns Classify/Group Similar Transactions Understand Relations 12
Types of Machine Learning Comparison Temporal Count and Amount Reasonability Moving Average (Stewart's Control Chart) Spatial Outlier based on Standard Deviation Benford s Law Classification Clustering Decision Tree Neural Network Correlation Regression Network Analysis 13
Machine Learnings in Internal Threat Management Data Discovery Classification of Content (e.g. email Spam), PII data SVM, NN, Random Forest Data Minimization Encryption and minimization K-Anonymity, L-Diversity, t- Closeness Usage Monitoring Anomalous User Behavior NN, Eigen-vector decomposition, TDA 14
Content Classification : Use Cases Good Corporate Citizen Amy: Accidently shares sensitive information via email, but is unable to stop the data leak. Needs to share data - Joe: Routinely shares information with vendors and clients via email. The email contains sensitive information in attachments, but is unaware of its sensitivity or the consequences. Malicious user - Mark: Shares sensitive information with vendors and clients with a malicious intent. Routinely accesses sensitive information and disguises data for sharing. 15
Content Classification Image recognition using Machine Learning PII information related images Inappropriate content Document recognition using Machine Learning Patent Document Legal Document Spam Email 16
Content Classification Ref: https://goo.gl/images/cych3j 17
Content Classification Context based Intelligence 18
Content Minimization Use Cases Production Data Sources File Test Data Sources File File File Information Exchange 19
Content Minimization Privacy with loss of information Database testid dl 123-456-7899 UK-7897 123-456-7900 UK-7898 123-456-7901 UK-7899 Format and Context Preserving Encryption Database testid dl 791-456-3456 UK-7896 833-456-4567 UK-3345 999-451-7901 UK-3456 Privacy without loss of information Ref: Unharnessing collective intelligence: A business model for privacy on Mobile devices based on k- anonymity, 2008, Ajit 20
Usage Analytics Use Cases File Unsupervised Training Sliding Window Activity Data Supervised Training Alerts 21
User Anomaly Detection Eigen Value Decomposition Method Compute mean and variance Compute Eigen Vectors and determine Principal Components Normal data points lie near first few principal components Abnormal data points lie further from first few principal components and closer to later components Reference: Apache Eagle Reference Guide 22
Graph based Usage Anomaly Detection http://ailab.wsu.edu/adgs/plads.png 23
Operationalizing Machine Learning Autonomous Automated Script Ad-hoc 24
360 Degree view of Data, User and Usage 25
Summary 1.Protecting sensitive data from accidental or intentional leakage in Big Data Eco-System is challenging because of volume, velocity and complexity 2.Rule based or statistics based systems are costly to set-up and can not keep up with the fast changing data specially in big data environment 3.Appropriate construction of machine learning based threat management schemes can help organizations to identify sensitive data, encrypt data elements and monitor user behavior to detect intentional or unintentional errors. 26