Data Science Tutorial

Similar documents
Slides for Data Mining by I. H. Witten and E. Frank

Machine Learning Techniques for Data Mining

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Data Preprocessing. Slides by: Shree Jaswal

6.034 Quiz 2, Spring 2005

Data Mining: Models and Methods

Network Traffic Measurements and Analysis

K-Means Clustering 3/3/17

Applying Supervised Learning

The data science behind Vectra threat detections. White paper

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

Nearest Neighbor Predictors

Jarek Szlichta

Classification: Feature Vectors

7. Boosting and Bagging Bagging

What to come. There will be a few more topics we will cover on supervised learning

Machine Learning with Python

Random Forest A. Fornaser

Introduction to Machine Learning. Xiaojin Zhu

ECG782: Multidimensional Digital Signal Processing

Classification and Regression

Introduction to Data Mining and Data Analytics

Build a system health check for Db2 using IBM Machine Learning for z/os

CPSC 340: Machine Learning and Data Mining. Multi-Class Classification Fall 2017

10-701/15-781, Fall 2006, Final

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

Nearest Neighbors Classifiers

THE CYBERSECURITY LITERACY CONFIDENCE GAP

CSE 573: Artificial Intelligence Autumn 2010

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

R (2) Data analysis case study using R for readily available data set using any one machine learning algorithm.

Security analytics: From data to action Visual and analytical approaches to detecting modern adversaries

Machine Learning Department School of Computer Science Carnegie Mellon University. K- Means + GMMs

SOCIAL MEDIA MINING. Data Mining Essentials

Machine Learning. Classification

Supervised Learning: The Setup. Spring 2018

CPSC 340: Machine Learning and Data Mining. More Regularization Fall 2017

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

Boosting Simple Model Selection Cross Validation Regularization

Analysis: TextonBoost and Semantic Texton Forests. Daniel Munoz Februrary 9, 2009

DATA MINING AND WAREHOUSING

Hierarchical Clustering 4/5/17

PV211: Introduction to Information Retrieval

Natural Language Processing

Boosting Simple Model Selection Cross Validation Regularization. October 3 rd, 2007 Carlos Guestrin [Schapire, 1989]

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

Project 4 Machine Learning: Optical Character Recognition

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

CSE4334/5334 DATA MINING

Machine Learning and Pervasive Computing

Beyond Firewalls: The Future Of Network Security

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Rapid growth of massive datasets

Excel Basics Rice Digital Media Commons Guide Written for Microsoft Excel 2010 Windows Edition by Eric Miller

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

ECLT 5810 Clustering

turning data into dollars

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

Simple Model Selection Cross Validation Regularization Neural Networks

Data Mining Concepts. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

Mathematics of Data. INFO-4604, Applied Machine Learning University of Colorado Boulder. September 5, 2017 Prof. Michael Paul

CS 584 Data Mining. Classification 1

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

Overview. Data Mining for Business Intelligence. Shmueli, Patel & Bruce

Feature Descriptors. CS 510 Lecture #21 April 29 th, 2013

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

Exploratory Analysis: Clustering

The Data Science Process. Polong Lin Big Data University Leader & Data Scientist IBM

Security Automation Best Practices

Machine Learning Lecture 3

Tips and Guidance for Analyzing Data. Executive Summary

Advanced Threat Intelligence to Detect Advanced Malware Jim Deerman

Football result prediction using simple classification algorithms, a comparison between k-nearest Neighbor and Linear Regression

VECTOR SPACE CLASSIFICATION

k-nearest Neighbors + Model Selection

Data Mining. Lecture 03: Nearest Neighbor Learning

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

(Refer Slide Time: 0:51)

Problems 1 and 5 were graded by Amin Sorkhei, Problems 2 and 3 by Johannes Verwijnen and Problem 4 by Jyrki Kivinen. Entropy(D) = Gini(D) = 1

Expectation Maximization (EM) and Gaussian Mixture Models

Bayes Risk. Classifiers for Recognition Reading: Chapter 22 (skip 22.3) Discriminative vs Generative Models. Loss functions in classifiers

Data Mining Classification - Part 1 -

CP365 Artificial Intelligence

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

CS664 Lecture #21: SIFT, object recognition, dynamic programming

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

A GUIDE TO CYBERSECURITY METRICS YOUR VENDORS (AND YOU) SHOULD BE WATCHING

Ensemble Methods, Decision Trees

ECLT 5810 Clustering

Computational Statistics and Mathematics for Cyber Security

Data analysis case study using R for readily available data set using any one machine learning Algorithm

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Classifiers for Recognition Reading: Chapter 22 (skip 22.3)

AMP-Based Flow Collection. Greg Virgin - RedJack

Data Mining Algorithms: Basic Methods

Perceptron as a graph

Transcription:

Eliezer Kanal Technical Manager, CERT Daniel DeCapria Data Scientist, ETC Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 2017 SEI SEI Data Science in in Cybersecurity Symposium Approved for for Public Release; Distribution is Unlimited 1

About us Eliezer Kanal Technical Manager, CERT Recent projects: ML-based Malware Classifier Network traffic analysis Cybersecurity questionnaire optimization Recent projects: Daniel DeCapria Data Scientist, ETC Cyber risk situational dashboard Big Learning benchmarks 2

Today s presentation a tale of two roles The call center manager Introduction to data science capabilities The master carpenter Overview of the data science toolkit 3

Call center manager First day on job welcome! Goal: Reduce costs Task: Keep calls short! Data: Average call time: 5.14 minutes (5:08) very long! Number of employees: 300 Average calls per day: ~28,000 4

Call center manager Gather data Get the data! Where is it? What will you use to analyze it? How accurate it is? How complete is it? Is it too big to easily read? 5

Data cleaning = 90% of the work 2 weeks (10 days) = 9 cleaning, 1 analyzing 6

Cleaning the Data Structuring the Data Goal: Organize data in a table, where Columns = descriptor (age, weight, height) Row = individual, complete records How can you get data out of these documents? Less structure More structure 7

Cleaning the Data Even when you think your data should be clean, it might not be 8

Cleaning the Data Call Center Example Name Mgr Dir Call Length Phone Line Problem solved? Comment Beth Jones Dan Thomas Anne Kim 1:30 1 Y ❺ Beth Jones Dan Thomas Anne Kim 1:52 3 Y ❶ Jones, Beth Dan Thomas Anne Kim ❶ 90 2 Y Tom Keane Mark Ryan Tim Pike 88 2 N ❶ Tom Keane Mark Ryan Tim Pike 144 3 No ❷ Tom Keane Kevin Wood Tim Pike 200 ❹ Yes Tom Keane Kevin Wood Tim Pike 94511 2 No ❸ ❻ Tom Keane Kevin Wood Tim Pike 421 2 Yes String Integer Nominal Unstructured 9

Call Center manager Exploring data 10

Exploratory Data Analysis (EDA) Mean Median Standard deviation Histograms! 11

Distributions The majority of data will follow SOME distribution o Weight of all Americans: Gaussian o phone call length: Exponential Determining distribution is a common Data Science task Multidimensional outliers: Insider Threat example Image Copyright 2001-2016 The Apache Software Foundation. See Copyright slide for more details. 12

EDA Smart visualizations 13

14

15

16

Brief interruption Skeptics in the audience 17

Brief interruption Data Science helps you use data to get results. This is it. 18

Call center manager call duration histogram Average (5:08) 19

Call Center manager Insights! Strategy update: Goodbye reduce call time Hello reduce callbacks How to measure? callbacks isn t currently captured 20

Feature Engineering Need more useful data? Create it yourself! 21

Feature Engineering Feature Engineering: coming up with new, useful (i.e., informative) data o mean, sums, medians, etc. o x 2, xy, sqrt(xy), etc. Our case: o # of callbacks o Call during peak time? o Overall agent performance? (combination of factors) 22

The role of Listening in Data Science Data science finds hidden patterns in data Experts know what data & patterns are important Talk to subject matter experts 23

Call Center manager Predictive analytics Can we predict staffing levels one day ahead? one week ahead? one month ahead? Can we determine what types of calls to expect for a product we haven t had before? for a market we ve never seen before? 24

Example Predictive Analytics Questions Predicting Current Unknowns Online: Security: Which ads are malicious? Is the bank transaction fraudulent? IC: Which names map to the same person (entity resolution)? Predicting Future Events Retail: Security: What will be the new trend of merchandise that a company should stock? Where will a hacker next attack our network? IC: Who will become the next insider threat? Determining Future Actions Sales: Health: IC: How can a company increase sales revenues? What actions can be taken to prevent the spread of flu? How will a vulnerability patch affect our knowledge/preparedness for future attacks? 25

Call Center manager Predictive analytics Many techniques available, explored in next section 26

Call Center manager Review Get data Clean data EDA, Visualization Interpretation Action! Prediction 27

Because we know our data, we can ask more intelligent questions action-oriented questions questions that can be answered 28

This slide intentionally left blank 29

The master carpenter The right tool for the job 30

Feature Engineering Part 2 With the wrong wood, I can make nothing The fuel of data science is data Data preparation is critical Data quality algorithm choice That will come up 31

Types of Machine Learning Algorithms Classification Naïve Bayes Logistic Regression Decision Trees K-Nearest Neighbors Support Vector Machines Regression Linear Regression Support Vector Machines Clustering K-Means Clustering 32

Types of Machine Learning Algorithms Applications: Everywhere Banking Weather Sports scores Economics Environmental science Cybersecurity 33

Linear Regression Prediction Problem: If I have examples of X and Y, when I learn a new X, can I predict Y? 34

Linear Regression Prediction Solution: Find the line that is closest to every point Said differently: Find the line that the SUM of all errors is smallest 35

Linear Regression Prediction Three dimensions, same concept HUNDREDS of dimensions, same concept 36

Linear Regression Very widely used Simple to implement Quick to run Easy to interpret Works for many problems First identified in early 1800 s; very well studied When applicable: Works best with numeric data (usually) Works for predicting specific numeric outcome 37

Logistic Regression Classification Idea: Classification using a discriminative model Predict future behavior based on existing labeled data Draws a line to assign labels Mainly used for binary classification: either red or blue 38

Logistic Regression Classification Look at distribution, what s likely based on current data Probably blue Probably red 39

Logistic Regression Three dimensions, same concept HUNDREDS of dimensions, same concept 40

Classification: Support Vector Machine Idea: The optimal classifier is the one that is the farthest from both classes Barometric Pressure Dew Point 41

Classification: Support Vector Machine Idea: The optimal classifier is the one that is the farthest from both classes Barometric Pressure Dew Point 42

Classification: Support Vector Machine Algorithm: Find lines like before Assign a cost to misclassified data points based on distance from the classification line Barometric Pressure ξ 1 ξ 2 Dew Point 43

Classification: Decision Trees Idea: Instead of drawing a single complicated line through the data, draw many simpler lines. Algorithm: Scan through all values of all features to find the one that helps the most to determine what data gets what label. Divide the data based on that value, and then repeat recursively on each part. Barometric Pressure Dew Point 44

Classification: Decision Trees Idea: Instead of drawing a single complicated line through the data, draw many simpler lines. Algorithm: Scan through all values of all features to find the one that helps the most to determine what data gets what label. Divide the data based on that value, and then repeat recursively on each part. DP > 60 Barometric Pressure Dew Point 45

Classification: Decision Trees Idea: Instead of drawing a single complicated line through the data, draw many simpler lines. Algorithm: Scan through all values of all features to find the one that helps the most to determine what data gets what label ( information gain ). Divide the data based on that value, and then repeat recursively on each part. DP > 60 Barometric Pressure B > 760 Dew Point 46

Classification: Decision Trees Benefits: Works well when small. Very easy to understand! Challenges: Trees overfit easily Very sensitive to data; Random Forests DP > 60 Barometric Pressure DP > 55 B > 760 Dew Point 47

Classification: K-Nearest Neighbors Idea: A new point is likely to share the same label as points around it. Algorithm: Pick constant k as number of neighbors to look at. For each new point, vote on new label using the k neighbor labels. Barometric Pressure Dew Point 48

Classification: K-Nearest Neighbors Idea: A new point is likely to share the same label as points around it. Algorithm: Pick constant k as number of neighbors to look at. For each new point, vote on new label using the k neighbor labels. Barometric Pressure Dew Point 49

Classification: K-Nearest Neighbors Idea: A new point is likely to share the same label as points around it. Algorithm: Pick constant k as number of neighbors to look at. For each new point, vote on new label using the k neighbor labels. Barometric Pressure Dew Point 50

Classification: K-Nearest Neighbors Works well when there is a good distance metric and weighting function to vote on classification Challenges: Not a smooth classifier; points near each other may get classified differently Must search all your data every time you want to classify a new point When k is small (1,2,3,4), essentially it is overfitting to the data points 51

Clustering Unsupervised learning Structure of un-labled data Organize records into groups based on some similarity measure Cluster is the collection of records which are similar 52

Clustering: K-means Idea: Find the clusters by minimizing distances of cluster centers to data. Algorithm: Instantiate k distinct random guesses μ - of the cluster centers Each data point classifies itself as the μ - it is closest to it Each μ - finds the centroid of the points that were closest to it and jumps there Repeat until centroids don t move Barometric Pressure Dew Point 53

Clustering: K-means Idea: Find the clusters by minimizing distances of cluster centers to data. Algorithm: Instantiate k distinct random guesses μ - of the cluster centers Each data point classifies itself as the μ - it is closest to it Each μ - finds the centroid of the points that were closest to it and jumps there Repeat until centroids don t move Barometric Pressure Dew Point 54

Clustering: K-means Idea: Find the clusters by minimizing distances of cluster centers to data. Algorithm: Instantiate k distinct random guesses μ - of the cluster centers Each data point classifies itself as the μ - it is closest to it Each μ - finds the centroid of the points that were closest to it and jumps there Repeat until centroids don t move Barometric Pressure Dew Point 55

Clustering: K-means Idea: Find the clusters by minimizing distances of cluster centers to data. Algorithm: Instantiate k distinct random guesses μ - of the cluster centers Each data point classifies itself as the μ - it is closest to it Each μ - finds the centroid of the points that were closest to it and jumps there Repeat until centroids don t move Barometric Pressure Dew Point 56

Clustering: K-means Idea: Find the clusters by minimizing distances of cluster centers to data. Algorithm: Instantiate k random guesses μ - of the clusters Each data point classifies itself as the μ - it is closest to it Each μ - finds the centroid of the points that were closest to it and jumps there Repeat until centroids don t move Barometric Pressure Dew Point 57

Clustering: K-means Works well when there is a good distance metric between the points the number of clusters is known in advance Challenges: Clusters that overlap or are not separable are difficult to cluster correctly. Barometric Pressure Dew Point 58

Influencers Goal: Detect the people who control or distribute information through a network. 59

Influencers: Degree Centrality Idea: Influential people have a lot of people watching them. Equation Degree centrality = number of directed edges to the node - High degree centrality people are those with large numbers of followers. If undirected graph, transform to bi-directional and compute 60

Influencers: Degree Centrality Idea: Influential people have a lot of people watching them. Equation Degree centrality = number of directed edges to the node - High degree centrality people are those with large numbers of followers. If undirected graph, transform to bi-directional and compute 1 2 3 1 1 1 3 1 61

Influencers: Betweenness Centrality Idea: Influential people are information brokers who connect different groups of people. Algorithm Find all shortest paths from all nodes to all other nodes in the graph. Betweenness centrality for a node = sum over all start and end nodes of the number of shortest paths in the graph that include the node 62

Influencers: Betweenness Centrality Idea: Influential people are information brokers who connect different groups of people. Algorithm Find all shortest paths from all nodes to all other nodes in the graph. Betweenness centrality for a node = sum over all start and end nodes of the number of shortest paths in the graph that include the node 14 3 21 2 11 33 5 7 63

Indicator communities But what if we aren t starting with a reference indicator? We assume that indicators generated by a coherent real world process will be more likely to co-occur in tickets than arbitrary pairs of indicators. Find groups of highly similar indicators in complete indicatorticket graph. 64

Indicator-ticket graph A subset of the ticket-indicator graph (for a small set of selected indicators) Tickets are grey triangles Indicators are black circles Edges connect tickets to the indicators they contain 65

Machine Learning Is Growing Preferred approach for many problems Speech recognition Natural language processing Medical diagnosis Robot control Sensor networks Computer vision Weather prediction Social network analysis AlphaGO, Watson Jeopardy! 66

This slide also intentionally left blank, just like the earlier one 67

What we did today Name Mgr Dir Beth Jones Dan Thomas Anne Kim Beth Jones Dan Thomas Anne Kim Jones, Beth Dan Thomas Anne Kim ❶ Tom Keane Mark Ryan Tim Pike Length Line Solved? Comment 1:30 1 Y ❺ 1:52 3 Y 90 2 Y 88 2 N ❶ Average (5:08) 68

What we did today 69

Data Science helps you use data to get results. 70

Eliezer Kanal Technical Manager (412) 268-5204 ekanal@sei.cmu.edu Daniel DeCapria Data Scientist (412) 268-2457 djdecapria@sei.cmu.edu 71