Computational Statistics and Mathematics for Cyber Security

Similar documents
Topological Classification of Data Sets without an Explicit Metric

Data mining. Classification k-nn Classifier. Piotr Paszek. (Piotr Paszek) Data mining k-nn 1 / 20

CSE 573: Artificial Intelligence Autumn 2010

Classification: Feature Vectors

Diffusion Maps and Topological Data Analysis

Topological Data Analysis

Mapper, Manifolds, and More! Topological Data Analysis and Mapper

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Analysis of high dimensional data via Topology. Louis Xiang. Oak Ridge National Laboratory. Oak Ridge, Tennessee

MODEL SELECTION AND REGULARIZATION PARAMETER CHOICE

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Nearest Neighbor Predictors

Topological Data Analysis - I. Afra Zomorodian Department of Computer Science Dartmouth College

CSE 6242 A / CX 4242 DVA. March 6, Dimension Reduction. Guest Lecturer: Jaegul Choo

Non-Parametric Modeling

Topological estimation using witness complexes. Vin de Silva, Stanford University

Persistent Homology for Characterizing Stimuli Response in the Primary Visual Cortex

CSC 411: Lecture 05: Nearest Neighbors

Kernels and Clustering

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

Slides for Data Mining by I. H. Witten and E. Frank

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo

Visual Representations for Machine Learning

Topology and the Analysis of High-Dimensional Data

Dimension Reduction of Image Manifolds

Data Preprocessing. Javier Béjar. URL - Spring 2018 CS - MAI 1/78 BY: $\

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

MODEL SELECTION AND REGULARIZATION PARAMETER CHOICE

Basis Functions. Volker Tresp Summer 2017

Random Forest A. Fornaser

Lecture 19: Generative Adversarial Networks

Challenges motivating deep learning. Sargur N. Srihari

Exploring the Structure of Data at Scale. Rudy Agovic, PhD CEO & Chief Data Scientist at Reliancy January 16, 2019

Stratified Structure of Laplacian Eigenmaps Embedding

CS 534: Computer Vision Segmentation and Perceptual Grouping

On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions

CS6716 Pattern Recognition

MSA220 - Statistical Learning for Big Data

Clustering algorithms and introduction to persistent homology

Topic 6 Representation and Description

AM205: lecture 2. 1 These have been shifted to MD 323 for the rest of the semester.

Massive Data Analysis

Naïve Bayes for text classification

TOPOLOGICAL DATA ANALYSIS

Subspace Clustering. Weiwei Feng. December 11, 2015

Observing Information: Applied Computational Topology.

CS 343: Artificial Intelligence

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Beyond Mere Pixels: How Can Computers Interpret and Compare Digital Images? Nicholas R. Howe Cornell University

Topics in Machine Learning

Machine Learning / Jan 27, 2010

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

A Taxonomy of Semi-Supervised Learning Algorithms

SYMBOLIC FEATURES IN NEURAL NETWORKS

ECG782: Multidimensional Digital Signal Processing

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

Text Modeling with the Trace Norm

Relative Constraints as Features

Recap: Gaussian (or Normal) Distribution. Recap: Minimizing the Expected Loss. Topics of This Lecture. Recap: Maximum Likelihood Approach

Dimension Reduction CS534

5 Machine Learning Abstractions and Numerical Optimization

Cellular Tree Classifiers. Gérard Biau & Luc Devroye

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

CS 340 Lec. 4: K-Nearest Neighbors

Machine Learning for NLP

We use non-bold capital letters for all random variables in these notes, whether they are scalar-, vector-, matrix-, or whatever-valued.

Empirical risk minimization (ERM) A first model of learning. The excess risk. Getting a uniform guarantee

Spectral Surface Reconstruction from Noisy Point Clouds

Machine Learning and Pervasive Computing

Random Simplicial Complexes

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

Random Simplicial Complexes

Evgeny Maksakov Advantages and disadvantages: Advantages and disadvantages: Advantages and disadvantages: Advantages and disadvantages:

Segmentation: Clustering, Graph Cut and EM

Courtesy of Prof. Shixia University

Lab 9. Julia Janicki. Introduction

Machine Learning Techniques for Data Mining

Nearest Neighbors Classifiers

Computer Science 210 Data Structures Siena College Fall Topic Notes: Complexity and Asymptotic Analysis

Combine the PA Algorithm with a Proximal Classifier

Network Traffic Measurements and Analysis

Getting Students Excited About Learning Mathematics

Introduction to Machine Learning

Spectral Clustering and Community Detection in Labeled Graphs

Introduction to machine learning, pattern recognition and statistical data modelling Coryn Bailer-Jones

Topic: Orientation, Surfaces, and Euler characteristic

Chapter 1. Introduction

Topological Issues in Hexahedral Meshing

Instance-based Learning

MULTIVARIATE TEXTURE DISCRIMINATION USING A PRINCIPAL GEODESIC CLASSIFIER

Large-Scale Face Manifold Learning

Western TDA Learning Seminar. June 7, 2018

On the Topology of Finite Metric Spaces

Machine Learning Lecture 3

Problem 1: Complexity of Update Rules for Logistic Regression

topological data analysis and stochastic topology yuliy baryshnikov waikiki, march 2013

08 An Introduction to Dense Continuous Robotic Mapping

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Transcription:

and Mathematics for Cyber Security David J. Marchette Sept, 0 Acknowledgment: This work funded in part by the NSWC In-House Laboratory Independent Research (ILIR) program. NSWCDD-PN--00

Topics NSWCDD-PN--00

Topics NSWCDD-PN--00

Take-Away Points Mathematics and statistics provide many tools for cyber security. Simple can be powerful. Complicated models or algorithms are not always necessary. Sometimes they are. Complicated things become simple with familiarity. High dimensional data is complicated, messy, and can fool you. Know your data! If your results appear too good to be true, triple check them! NSWCDD-PN--00

Two Cultures There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. * There are many aspects of this dichotomy: Modeling algorithms. Parametric non-parametric. Statistics machine learning. Inference prediction.** Small data big data. Traditional statistics computational statistics. *Leo Breiman, Statistical Science 00, Vol., No., **Donoho, D. (0, September). 0 years of Data Science. In Princeton NJ, Tukey Centennial Workshop. http://www.economicsguy.com/wp-content/uploads/0/0/0yearsdatascience.pdf accessed //0 NSWCDD-PN--00

The Illusion of Progress... [comparative studies] often fail to take into account important aspects of real problems, so that the apparent superiority of more sophisticated methods may be something of an illusion. * Simple models often produce essentially the same accuracy as more complicated models. These can be easier to understand, fit, and may have fewer parameters to choose possibly resulting in lower variance. The data you get is rarely (if ever) a true random draw from the distribution you will be running your trained/implemented algorithm on. This is particularly important in cyber security. By its nature, cyber security data is non-stationary, and today s data may look very different from tomorrow s. *David Hand, Statistical Science 00, Vol., No., NSWCDD-PN--00

The Illusion of Progress When building a model, one makes assumptions, which are often not testable, and which can impact the ultimate performance. Simpler models (may) have fewer assumptions. Non-parametric (may) be superior to parametric in that they (tend to) make fewer assumptions. However, if the assumptions are true, parametric may be superior. Good non-parametric algorithms would be nearly as good as the parametric, while allowing a hedge on the assumptions. Hand suggests we spend less time developing the next great classifier and more time on methods that mitigate the above issues. NSWCDD-PN--00

Outline Probability density estimation. Kernel estimators. Streaming data. Machine learning. Nearest neighbors. Random forests. Manifold learning. Graphs. Spectral embedding.. We ll see how much of this we can cover today see the paper. NSWCDD-PN--00

Topics NSWCDD-PN--00

The Histogram Density 0.00 0.0 0.0 0. 0.0 0. 0.0 0.00 0.0 0.0 0. 0.0 0. 0.0 0.00 0.0 0.0 0. 0.0 0. 0.0 NSWCDD-PN--00

The Histogram The Kernel Estimator Density 0.00 0.0 0.0 0. 0.0 0. 0.0 0.00 0.0 0.0 0. 0.0 0. 0.0 0.00 0.0 0.0 0. 0.0 0. 0.0 NSWCDD-PN--00

The Kernel Estimator f (x) = n n i= K h (x x i ) = nh n ( x xi φ h i= ). Easily extended to multivariate versions. Note that this is an average. NSWCDD-PN--00

Network Flows http://csr.lanl.gov/data/cyber/ NSWCDD-PN--00

Network Flows NSWCDD-PN--00

Streaming Data Averages can be computed in a streaming fashion: X n = n n X n + n X n. We can implement an exponential window: X n = N N X n + N X n = θ X n + ( θ)x n, and apply this idea to the kernel estimator: f n (x) = θ f n (x) + ( θ)φ ( x Xn h ). θ controls how much of the past we remember. Note that we have to set a grid of x points at which we want to compute f. NSWCDD-PN--00

Streaming Network Flows: log(#bytes) in a Flow NSWCDD-PN--00

Topics NSWCDD-PN--00

: Classification Given {(x i, y i )} i=,...,n X Y with x i corresponding to observations (flows, programs, email, system calls, log files: features ), and y i corresponding to class labels (e.g. malware, benign ). A classifier is a mapping g : X Y. Machine learning (pattern recognition, classification) is designing a function g from training data {(x i, y i )} i=,...,n for which truth is known. We are given training data {(x i, y i )} i=,...,n X Y, and will be presented with a new x X for which the label is unknown. We wish to infer the y associated with x. NSWCDD-PN--00

Nearest Neighbors We are given training data {(x i, y i )} i=,...,n X Y, and a new x X for which the label is unknown. Find the closest x i to x: ŷ = y arg min d(x,xi ). We must select an appropriate distance (dissimilarity) d. Alternative: We can compute the k closest, and vote: take the majority class. NSWCDD-PN--00

Kaggle Malware 0, examples of malware grouped into malware families.* Each file has been byte-dumped and tabulated: We are using the frequency of times each value 0,..., occurs in the file. This seems really dumb (computer scientists laugh when I tell this story). We ll look at the nearest neighbor classifier on these data. 00 observations of each family are used for training ( observations from the family containing only observations). Test on the remaining. Remember: sometimes simple is good. *https://www.kaggle.com/c/malware-classification NSWCDD-PN--00

Kaggle Malware: NN Performance True Class 0 0 0 0 0 0 Error:.%. That is, % of the observations are correctly classified. NSWCDD-PN--00

Kaggle Malware: NN Performance Why??? Text analogy: byte-count histogram is analogous to the word-count histogram used in text analysis. Maybe this is more like a morpheme-count histogram. Intuitively, a family shares a core of code (they are modifications of the mother malware). The bytes correspond to machine instructions or at least they would if we were counting words instead of bytes. NSWCDD-PN--00

Kaggle Malware: Smoothed NN Performance Using the kernel estimator instead of the histogram, one obtains an error of.%. This is another place for computer scientists to laugh: bytes are not continuous, machine instruction codes are discrete.... and yet it works. Remember Hand s paper. Here is the point at which we need to better understand our data. Unfortunately, we won t be doing this today. NSWCDD-PN--00

Random Forests We are given training data {(x i, y i )} i=,...,n X Y, and a new x X for which the label is unknown. The random forest is an ensemble of decision trees: Sample (with replacement) from the training data. Sample a subset of the variables. Build a decision tree using the two samples don t bother with any optimization or pruning. Repeat. With a new observation, vote the trees. NSWCDD-PN--00

Benign vs Malicious observations of windows binaries: 0 benign, malicious. Random forest performance: 0.% error..% of benign misclassified. 0.% of malicious misclassified. Nearest neighbor classifier is a little worse: overall error of.%. NSWCDD-PN--00

Know Your Data The results demonstrate that there is something going on with this byte-count approach. Logically, the performance seems too good to be true, and yet it does seem to work. The data are high dimensional (), so maybe there is a curse of dimensionality thing going on here. Perhaps we are finding OS-specific things: The data collected for the benign files may be a different version of the operating system than the malicious. We don t have version information about the data (beyond these are Windows files). Worrisome fact: there are several different sets of benign (or malicious) data. A classifier can be built to tell which set which of the benign collections a file belongs to. NSWCDD-PN--00

Know Your Data? Maaten & Hinton (00). Visualizing data using t-sne. Journal of Research,, -0. NSWCDD-PN--00

Topics NSWCDD-PN--00

Hypothesis: high dimensional data lives on a lower dimensional structure. Manifold learning is a set of techniques to infer this structure, or to embed the data from the high dimensional space into a lower dimensional space that respects the local structure. NSWCDD-PN--00

Multidimensional Scaling Problem: Given a distance matrix (or dissimilarity matrix) D, find a set of points X R d whose distance d(x ) best approximates D. This is the problem solved by multidimensional scaling (MDS). Different definitions of best approximates lead to different algorithms. Classical MDS utilizes the eigenvector decomposition of (a modified version of) the distance matrix. Some manifold learning algorithms compute a local distance and use MDS, others computer eigenvectors of related matrices. These are the algorithms I use most often. NSWCDD-PN--00

Basic Graph Theory A graph is a set V of vertices, and E of pairs of vertices (edges). The edges can be directed or undirected, and can have weights. In this talk they will be undirected. The (graph) distance between two vertices is the length of the shortest path between them in the graph. The adjacency matrix of a graph on n vertices is the n n binary matrix with a in those positions corresponding to the edges of the graph. The spectrum of a graph is the eigen decomposition of the adjacency matrix A, or more generally, of some function f (A). NSWCDD-PN--00

Graph Examples ɛ-ball graph with ɛ = 0.. -nearest neighbor graph. NSWCDD-PN--00

Basic Steps of Given data {x,..., x n } R p : Construct a graph whose vertices are the x i with edges between near points. k-nearest neighbor graph. ɛ-ball graph. Variations. Compute the eigenvectors of: The adjacency matrix. The Laplacian of the adjacency matrix. Scaled or modified versions of the above. Set Z to the matrix with columns corresponding to the main eigenvectors. That is, the rows {z,..., z n } are the embedded data. Perform inference on Z. NSWCDD-PN--00

Compute ɛ-ball graph on the Kaggle training data. Layout the graph. Embed using scaled Laplacian. Embed using adjacency matrix. Embed using MDS on graph distance. NSWCDD-PN--00

Compute ɛ-ball graph on the Kaggle training data. Layout the graph. Embed using scaled Laplacian. Embed using adjacency matrix. Embed using MDS on graph distance. NSWCDD-PN--00

Compute ɛ-ball graph on the Kaggle training data. Layout the graph. Embed using scaled Laplacian. Embed using adjacency matrix. Embed using MDS on graph distance. NSWCDD-PN--00

Compute ɛ-ball graph on the Kaggle training data. Layout the graph. Embed using scaled Laplacian. Embed using adjacency matrix. Embed using MDS on graph distance. NSWCDD-PN--00

Discussion Different embedding methods extract different information about the data. These two dimensional plots are misleading in that there is no reason to assume the intrinsic dimensionality is. Some care must be taken to ensure that the embedding method can be applied to new data. NSWCDD-PN--00

Joint Embedding ( D W D W = W D ) Jointly embed D and D using D W, where W = λd + ( λ)d. NSWCDD-PN--00

Topics NSWCDD-PN--00

(TDA) The basic idea is to use topological features measures that are invariant to smooth deformations to learn about the structure of the data. We will only be able to touch briefly on this subject. See: Carlsson, Topology and Data, Bulletin of the American Mathematical Society,, 00, 0. Ghrist, Elementary Applied Topology, Createspace Independent Publishing Platform, 0. NSWCDD-PN--00

Simplices A (geometric) simplex of dimension d is a set of d + points in relative position. A 0 simplex is a point, a -simplex a line segment, a simplex a triangle, and so on. NSWCDD-PN--00

Simplicial Complexes A simplicial complex is a collection S of simplices that satisfies the following conditions: If σ S then so are the faces of σ. If σ, σ S are k simplices, then either they are disjoint or they intersect in a lower dimensional simplex which is a face of both. NSWCDD-PN--00

Persistent Homology We construct an ɛ-ball graph on the data, and from this we get a simplicial complex. We compute a measure of the topology (the rank of the Homology, or the Betti number) how many d-dimensional holes are there? Those structures that persist across ranges of ɛ are interesting and more likely to be real structure rather than noise. NSWCDD-PN--00

Euler Characteristic One defines the Euler characteristic as: χ(x ) = n ( ) j Betti j (X ). j=0 This is equivalent to the standard Euler characteristic one learns in grade school, extended to general topological spaces and higher dimensions. The persistent version is to compute this on the persistent homologies from the ɛ-ball graphs. NSWCDD-PN--00

Persistent Euler Characteristics of Malware NSWCDD-PN--00

Discussion Mathematics has many tools for the data analyst, in particular for the analysis of cyber data. These tools include: Computational statistics. Machine learning. Graph theory. Manifold learning. Topological data analysis. New applications of pure mathematics to data analysis are developed every day, and these areas are all huge growth areas for applied mathematicians. NSWCDD-PN--00