Computational Statistics and Mathematics for Cyber Security

and Mathematics for Cyber Security David J. Marchette Sept, 0 Acknowledgment: This work funded in part by the NSWC In-House Laboratory Independent Research (ILIR) program. NSWCDD-PN--00

Topics NSWCDD-PN--00

Take-Away Points Mathematics and statistics provide many tools for cyber security. Simple can be powerful. Complicated models or algorithms are not always necessary. Sometimes they are. Complicated things become simple with familiarity. High dimensional data is complicated, messy, and can fool you. Know your data! If your results appear too good to be true, triple check them! NSWCDD-PN--00

Two Cultures There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. * There are many aspects of this dichotomy: Modeling algorithms. Parametric non-parametric. Statistics machine learning. Inference prediction.** Small data big data. Traditional statistics computational statistics. *Leo Breiman, Statistical Science 00, Vol., No., **Donoho, D. (0, September). 0 years of Data Science. In Princeton NJ, Tukey Centennial Workshop. http://www.economicsguy.com/wp-content/uploads/0/0/0yearsdatascience.pdf accessed //0 NSWCDD-PN--00

The Illusion of Progress... [comparative studies] often fail to take into account important aspects of real problems, so that the apparent superiority of more sophisticated methods may be something of an illusion. * Simple models often produce essentially the same accuracy as more complicated models. These can be easier to understand, fit, and may have fewer parameters to choose possibly resulting in lower variance. The data you get is rarely (if ever) a true random draw from the distribution you will be running your trained/implemented algorithm on. This is particularly important in cyber security. By its nature, cyber security data is non-stationary, and today s data may look very different from tomorrow s. *David Hand, Statistical Science 00, Vol., No., NSWCDD-PN--00

The Illusion of Progress When building a model, one makes assumptions, which are often not testable, and which can impact the ultimate performance. Simpler models (may) have fewer assumptions. Non-parametric (may) be superior to parametric in that they (tend to) make fewer assumptions. However, if the assumptions are true, parametric may be superior. Good non-parametric algorithms would be nearly as good as the parametric, while allowing a hedge on the assumptions. Hand suggests we spend less time developing the next great classifier and more time on methods that mitigate the above issues. NSWCDD-PN--00

Outline Probability density estimation. Kernel estimators. Streaming data. Machine learning. Nearest neighbors. Random forests. Manifold learning. Graphs. Spectral embedding.. We ll see how much of this we can cover today see the paper. NSWCDD-PN--00

Topics NSWCDD-PN--00

The Histogram Density 0.00 0.0 0.0 0. 0.0 0. 0.0 0.00 0.0 0.0 0. 0.0 0. 0.0 0.00 0.0 0.0 0. 0.0 0. 0.0 NSWCDD-PN--00

The Histogram The Kernel Estimator Density 0.00 0.0 0.0 0. 0.0 0. 0.0 0.00 0.0 0.0 0. 0.0 0. 0.0 0.00 0.0 0.0 0. 0.0 0. 0.0 NSWCDD-PN--00

The Kernel Estimator f (x) = n n i= K h (x x i ) = nh n ( x xi φ h i= ). Easily extended to multivariate versions. Note that this is an average. NSWCDD-PN--00

Network Flows http://csr.lanl.gov/data/cyber/ NSWCDD-PN--00

Network Flows NSWCDD-PN--00

Streaming Data Averages can be computed in a streaming fashion: X n = n n X n + n X n. We can implement an exponential window: X n = N N X n + N X n = θ X n + ( θ)x n, and apply this idea to the kernel estimator: f n (x) = θ f n (x) + ( θ)φ ( x Xn h ). θ controls how much of the past we remember. Note that we have to set a grid of x points at which we want to compute f. NSWCDD-PN--00

Streaming Network Flows: log(#bytes) in a Flow NSWCDD-PN--00

Topics NSWCDD-PN--00

: Classification Given {(x i, y i )} i=,...,n X Y with x i corresponding to observations (flows, programs, email, system calls, log files: features ), and y i corresponding to class labels (e.g. malware, benign ). A classifier is a mapping g : X Y. Machine learning (pattern recognition, classification) is designing a function g from training data {(x i, y i )} i=,...,n for which truth is known. We are given training data {(x i, y i )} i=,...,n X Y, and will be presented with a new x X for which the label is unknown. We wish to infer the y associated with x. NSWCDD-PN--00

Nearest Neighbors We are given training data {(x i, y i )} i=,...,n X Y, and a new x X for which the label is unknown. Find the closest x i to x: ŷ = y arg min d(x,xi ). We must select an appropriate distance (dissimilarity) d. Alternative: We can compute the k closest, and vote: take the majority class. NSWCDD-PN--00

Kaggle Malware 0, examples of malware grouped into malware families.* Each file has been byte-dumped and tabulated: We are using the frequency of times each value 0,..., occurs in the file. This seems really dumb (computer scientists laugh when I tell this story). We ll look at the nearest neighbor classifier on these data. 00 observations of each family are used for training ( observations from the family containing only observations). Test on the remaining. Remember: sometimes simple is good. *https://www.kaggle.com/c/malware-classification NSWCDD-PN--00

Kaggle Malware: NN Performance True Class 0 0 0 0 0 0 Error:.%. That is, % of the observations are correctly classified. NSWCDD-PN--00

Kaggle Malware: NN Performance Why??? Text analogy: byte-count histogram is analogous to the word-count histogram used in text analysis. Maybe this is more like a morpheme-count histogram. Intuitively, a family shares a core of code (they are modifications of the mother malware). The bytes correspond to machine instructions or at least they would if we were counting words instead of bytes. NSWCDD-PN--00

Kaggle Malware: Smoothed NN Performance Using the kernel estimator instead of the histogram, one obtains an error of.%. This is another place for computer scientists to laugh: bytes are not continuous, machine instruction codes are discrete.... and yet it works. Remember Hand s paper. Here is the point at which we need to better understand our data. Unfortunately, we won t be doing this today. NSWCDD-PN--00

Random Forests We are given training data {(x i, y i )} i=,...,n X Y, and a new x X for which the label is unknown. The random forest is an ensemble of decision trees: Sample (with replacement) from the training data. Sample a subset of the variables. Build a decision tree using the two samples don t bother with any optimization or pruning. Repeat. With a new observation, vote the trees. NSWCDD-PN--00

Benign vs Malicious observations of windows binaries: 0 benign, malicious. Random forest performance: 0.% error..% of benign misclassified. 0.% of malicious misclassified. Nearest neighbor classifier is a little worse: overall error of.%. NSWCDD-PN--00

Know Your Data The results demonstrate that there is something going on with this byte-count approach. Logically, the performance seems too good to be true, and yet it does seem to work. The data are high dimensional (), so maybe there is a curse of dimensionality thing going on here. Perhaps we are finding OS-specific things: The data collected for the benign files may be a different version of the operating system than the malicious. We don t have version information about the data (beyond these are Windows files). Worrisome fact: there are several different sets of benign (or malicious) data. A classifier can be built to tell which set which of the benign collections a file belongs to. NSWCDD-PN--00

Know Your Data? Maaten & Hinton (00). Visualizing data using t-sne. Journal of Research,, -0. NSWCDD-PN--00

Topics NSWCDD-PN--00

Hypothesis: high dimensional data lives on a lower dimensional structure. Manifold learning is a set of techniques to infer this structure, or to embed the data from the high dimensional space into a lower dimensional space that respects the local structure. NSWCDD-PN--00

Multidimensional Scaling Problem: Given a distance matrix (or dissimilarity matrix) D, find a set of points X R d whose distance d(x ) best approximates D. This is the problem solved by multidimensional scaling (MDS). Different definitions of best approximates lead to different algorithms. Classical MDS utilizes the eigenvector decomposition of (a modified version of) the distance matrix. Some manifold learning algorithms compute a local distance and use MDS, others computer eigenvectors of related matrices. These are the algorithms I use most often. NSWCDD-PN--00

Basic Graph Theory A graph is a set V of vertices, and E of pairs of vertices (edges). The edges can be directed or undirected, and can have weights. In this talk they will be undirected. The (graph) distance between two vertices is the length of the shortest path between them in the graph. The adjacency matrix of a graph on n vertices is the n n binary matrix with a in those positions corresponding to the edges of the graph. The spectrum of a graph is the eigen decomposition of the adjacency matrix A, or more generally, of some function f (A). NSWCDD-PN--00

Graph Examples ɛ-ball graph with ɛ = 0.. -nearest neighbor graph. NSWCDD-PN--00

Basic Steps of Given data {x,..., x n } R p : Construct a graph whose vertices are the x i with edges between near points. k-nearest neighbor graph. ɛ-ball graph. Variations. Compute the eigenvectors of: The adjacency matrix. The Laplacian of the adjacency matrix. Scaled or modified versions of the above. Set Z to the matrix with columns corresponding to the main eigenvectors. That is, the rows {z,..., z n } are the embedded data. Perform inference on Z. NSWCDD-PN--00

Compute ɛ-ball graph on the Kaggle training data. Layout the graph. Embed using scaled Laplacian. Embed using adjacency matrix. Embed using MDS on graph distance. NSWCDD-PN--00

Discussion Different embedding methods extract different information about the data. These two dimensional plots are misleading in that there is no reason to assume the intrinsic dimensionality is. Some care must be taken to ensure that the embedding method can be applied to new data. NSWCDD-PN--00

Joint Embedding ( D W D W = W D ) Jointly embed D and D using D W, where W = λd + ( λ)d. NSWCDD-PN--00

Topics NSWCDD-PN--00

(TDA) The basic idea is to use topological features measures that are invariant to smooth deformations to learn about the structure of the data. We will only be able to touch briefly on this subject. See: Carlsson, Topology and Data, Bulletin of the American Mathematical Society,, 00, 0. Ghrist, Elementary Applied Topology, Createspace Independent Publishing Platform, 0. NSWCDD-PN--00

Simplices A (geometric) simplex of dimension d is a set of d + points in relative position. A 0 simplex is a point, a -simplex a line segment, a simplex a triangle, and so on. NSWCDD-PN--00

Simplicial Complexes A simplicial complex is a collection S of simplices that satisfies the following conditions: If σ S then so are the faces of σ. If σ, σ S are k simplices, then either they are disjoint or they intersect in a lower dimensional simplex which is a face of both. NSWCDD-PN--00

Persistent Homology We construct an ɛ-ball graph on the data, and from this we get a simplicial complex. We compute a measure of the topology (the rank of the Homology, or the Betti number) how many d-dimensional holes are there? Those structures that persist across ranges of ɛ are interesting and more likely to be real structure rather than noise. NSWCDD-PN--00

Euler Characteristic One defines the Euler characteristic as: χ(x ) = n ( ) j Betti j (X ). j=0 This is equivalent to the standard Euler characteristic one learns in grade school, extended to general topological spaces and higher dimensions. The persistent version is to compute this on the persistent homologies from the ɛ-ball graphs. NSWCDD-PN--00

Persistent Euler Characteristics of Malware NSWCDD-PN--00

Discussion Mathematics has many tools for the data analyst, in particular for the analysis of cyber data. These tools include: Computational statistics. Machine learning. Graph theory. Manifold learning. Topological data analysis. New applications of pure mathematics to data analysis are developed every day, and these areas are all huge growth areas for applied mathematicians. NSWCDD-PN--00