A Comparative Study of Supervised and Unsupervised Learning Schemes for Intrusion Detection. NIS Research Group Reza Sadoddin, Farnaz Gharibian, and

Size: px

Start display at page:

Download "A Comparative Study of Supervised and Unsupervised Learning Schemes for Intrusion Detection. NIS Research Group Reza Sadoddin, Farnaz Gharibian, and"

Donna Clark
5 years ago
Views:

1 A Comparative Study of Supervised and Unsupervised Learning Schemes for Intrusion Detection NIS Research Group Reza Sadoddin, Farnaz Gharibian, and

2 Agenda Brief Overview Machine Learning Techniques Clustering/Classification Techniques Dataset & Data Preparation Simulated Attacks Experimental Results Conclusions 2

3 Introduction Shortcomings of traditional techniques to intrusion detection Difficulties with specification of normal or attack behavior Expert knowledge Time-consuming Concept drift is a serious issue How to adapt to environment changes? 3

4 Security Intelligence Six Steps to Producing Security Intelligence Designate Data Log entries, raw or formatted measure of activity in an environment Model Analyst Expertise Weights, centers, and pertinent event knowledge comprise the analytic or data mining model Train Model Baseline of activity that is typical Generate Knowledge Live or offline data is compared against the baseline/classifier Teach Model User-supervision and infusing of expert knowledge Leverage Model 4

5 5 Introduction Advantages of machine learning and data mining Ability to learn and discover Adaptation can be done automatically

6 6 Goals of this study A blind comparison between different supervised and unsupervised techniques Overall accuracy in detecting attacks Performance in detecting different attack categories Sensitivity of techniques to distribution of training and test datasets

7 7 Machine Learning of Techniques Machine Learning Techniques Unsupervised Supervised Distance-based Clustering Unsupervised SVM K-Means C-Means EM SOM Y-Means ICLN Naïve Bayes Decision Tree Random Forest Support Vector Machine Gaussian

8 8 K-Means Clustering Basics Grouping the objects into k clusters Assumption: the number of clusters is given Objective: To minimize the total intra-cluster variance k i= 1 j S i x j μ i 2

9 9 K-Means Algorithm Initial k centroids Assign each each object to to its its closest centroid Calculate the the mean vector of of each each cluster Shift Shift centroids to to their their means No Are all centroids stable? Yes End End

10 K-Means Clustering Pros Simplicity Quick convergence Cons No guarantee to find the global optimum Its performance depends on the initial seeds (cluster centroids) What is the suitable K? 10

11 Fuzzy C-Means C Clustering Basics 11 Each point has a degree of belonging to clusters Center of clusters k = 1 Number of Number of Points All the points contribute Clusters with a degree u k ( x) = 1 Minimizing the following objective function J m ( X, V ) = N k j= 1 i= 1 ( u ij ) m d 2 ( X j, V Degree of membership i ) c k = x x u k u ( x) k m ( x) m x Distance Function

12 Fuzzy C-Means C Clustering Algorithm Choose initial number of clusters. Assign random coefficients for being in the clusters to each point (i.e., u ij ) Repeat until m u x k ( x) x new ck = Compute the centroid for each cluster.

12 12 Fuzzy C-Means C Clustering Algorithm Choose initial number of clusters. Assign random coefficients for being in the clusters to each point (i.e., u ij ) Repeat until m u x k ( x) x new ck = Compute the centroid for each cluster. m u ( x) For each point, compute its coefficients of being in the clusters x k u new ij 1 = C xi c j ( ) 2 k = 1 m 1 xi ck

13 Fuzzy C-Means C Clustering Pros Useful when hard classification of data is not possible Cons Suffers from all the problems mentioned about K-Means initial memberships are required as well Convergence is slower than K-Means 13

14 14 EM (Expectation-Maximization) Basics A model-based approach to clustering Assumption: data is generated by K Gaussian distributions N( μ 1, σ 1 2 ) Provides a soft clustering as apposed to K-Means N( μ 2, σ 2 2 )

15 15 Y-Means Y-means A new classification method based on K- means. A dynamic clustering method without supervision. It overcomes three shortcomings of K- means: Degeneracy Dependency on the number of clusters Dependency on the initial states

16 16 Y-Means Algorithm Start normalize training data run K-means Is there degeneracy? False True remove empty clusters split clusters End link clusters True Are the False clusters stable?

17 17 SOM (Self-Organizing Maps) Extremely A Useful competitive for fast extracting learning (compared structural techniques to other information techniques) from Suitable data and for large visualization datasets in low dimensions Kohonen Layer Input Data

18 18 Algorithm Initialization Map is initialized with a specified topology and neighbor function Assignment & Relocation Presenting an input data to the map Updating the winner neuron and its neighbors based on the following update rule w ( t + 1) = w α( t):learning rate at time t h i ct i ( t) + α( t) h ( t)[ x( t) w ( t)] ( t) :neighbour kernelaround winner unit ci i c

19 19 Improved Competitive Learning Network (ICLN) Standard Competitive Learning Network (SCLN) Input layer Single layer of output neurons

20 20 ICLN Improved Competitive Learning Network (ICLN) What if this happen?

21 21 ICLN ICLN Algorithm η Weight update: β 2 : learning rate Euclidean Weight update: distance: : update factor d( xw, c= ) w= + η1( ( x i w c) 2 i 1 : learning rate e d η w = w η2β ( x w) 2 ) 0 2

22 Support Vector Machines SVM Powerful State-of-the-art classifier Supported by strong mathematical foundations Vapnik-Chervonenkis (VC) theory Good generalization to novel data Ability to classify non-linearly separable data 22

23 23 SVM : Conceptual Simplicity SVM model defines a hyperplane in the feature space in terms of coefficients (w) bias term (b) Prediction: ( w b) f = sign x + d - d +

24 Support Vector Machine : Multi- class 24

25 25 K-Nearest Neighbor (KNN) Describe instance x as feature vector Euclidean distance d ( x i < a x), a ( x), K, a ( ) 1( 2 n x Hamming distance, x j ) n r = distance between two values is 0 if they are the same, 1 if different. 1 ( a r ( x i ) > a r ( x j )) 2

26 26 The inductive bias of k-nnk Assumption that the classification of an instance x will be most similar to the classification of other instances that are nearby in Euclidian distance.

27 27 Voronoi diagram x q + + -

28 28 K-Nearest Neighbor (KNN) KNN-based outlier indices Kappa κ( x) = x z k z k is the kth nearest neighbor of x Gamma γ ( x) = 1 k k j= 1 x z k

29 29 Naive Bayes Classifier Probabilistic classifier based on Bayes Theorem Naive Bayes probabilistic model C = Class variable F 1,,F n = Features Is based on simplifying the assumption that the attribute values are conditionally independent given target value. Z = Scaling factor dependent on F 1,,F n

30 30 C4.5 (Decision Tree) A decision tree consists of nodes, leaves, and edges. A node of a decision tree specifies an attribute by which the data is to be partitioned. Each node bas a number of edges which are labeled according to a possible value of the attribute in the parent node. An edge connects either two nodes or a node and a leaf. Leaves are labeled with a decision value for categorization of the data.

31 31 Random Forest Classifier Generates many classification trees Each tree gives a vote that indicates the tree s decision about the class of the object. The forest chooses the class with the most votes for the object.

32 Dataset KDD99 Used in the third international Knowledge Discovery and Data Mining Tools Competition Acquired from a live network traffic by Lincoln LAB Attacks are simulated on top of background traffic Accepted as a standard dataset for evaluating IDS s Comes with both training (attack-free) and test dataset (contains both attack and normal data) 32

33 33 Dataset Simulation Network for collecting data

34 Simulated Attacks Category Probe DoS U2R R2L Description scan a network of computers to gather information or find known vulnerabilities Excessive consumption of resources that denies legitimate requests from system Successful execution of attacks results in normal user getting root privileges Attacker having no account gains a legal user account on the victim machine by sending packets over the network Example IPsweep, Saint Satan DDoS, Pingflood SYNflood Mailbomb Eject, Fdformat Loadmodule Perl Dictionary FTP-write Sendmail 34

35 KDD Features Feature Type Basic Content Time-based Description Common to all connections Based on information from local hosts Connections with respect to current one within a 2 second time window Example Duration of connection Service requested Bytes transferred Number of failed logins Number of root accesses Number of file creation operations # of connections that have SYN errors # of connections to the same service Connectionbased 35 Past 100 Connections with respect to current # of connections to the current host that have S0 error

36 Data Preparation Feature types (41 features in total) 38 continuous features 3 Discrete features (Protocol type, Flag, Service type) Converting discrete features to continuous features Using frequency instead of the initial values of discrete features 36

37 37 Data Preparation Necessity of normalization Features of different natures Large variance in maximum and minimum values Without normalization, large scales features dominate the low scale ones Normalization formula NewVali = normalize(ln( val X i Mini normalize( X i ) = Max Min wherei stands for i value of a i i + 1)) record on i th feature

38 38 Data Preparation Selecting datasets with different relative populations for train and test Normal-Attack Training Test

39 39 Labeling Heuristics for Clustering Techniques Why labeling heuristics are required? Practiced labeling heuristics Count-based : Label sparse clusters as anomalous (based on a threshold) Distance-based: Label the distant clusters as anomalous (based on a threshold) Inter-cluster Distance : ICD i = 1 C 1 C j= 1, i distance( c i, c j )

40 Performance Criteria Performance Criteria d Detection Rate = d+ c a b Normal False Positive Rate = a b + b c d Attack Actual Normal Attack Predicted 40

41 Different Experiments Per-Technique experiments Comparison between performance of count-based and distance-based labeling for each clustering technique Comparison between performance of each clustering technique in two different modes Direct application to test dataset Application of trained clusters to test dataset 41

42 42 Different Experiments Comparison between different techniques In direct application to test dataset In application of trained models to test dataset In detecting different attack categories (Probe, DoS, ) Average Cluster Purities (Clustering techniques only)

43 43 Count vs. Distance (EM, k = 50)

44 44 Count vs. Distance (ICLN, Init k = 50)

45 45 Count vs. Distance (K-Means, k = 50)

46 46 Count vs. Distance (SOM, k = 50)

47 47 Count vs. Distance (C-Means, k = 50)

48 48 Count vs. Distance (Y-Means, k = 50)

49 49 Count vs. Distance (Y-Means, Initial k = 50)

50 Different Experiments Per-Technique experiments Comparison between performance of count-based and distance-based labeling for each clustering technique Comparison between performance of each clustering technique in two different modes Direct application to test dataset Application of trained clusters to test dataset 50

51 51 Direct vs. Indirect (EM, 50, Test_9604)

52 52 Direct vs. Indirect (EM, 50, Test_8020)

53 53 Direct vs. Indirect (ICLN, 50, Test_8020)

54 54 Direct vs. Indirect (K-Means, 50, Test_9604)

55 55 Direct vs. Indirect (SOM, 50, Test_9604)

56 56 Direct vs. Indirect (SOM, 50, Test_8020)

57 57 Direct vs. Indirect (Y-Means, 50, Test_8020)

58 58 Direct vs. Indirect (C-Means, 50, Test_8020)

59 59 Different Experiments Comparison between different techniques In application of trained models to test dataset In direct application to test dataset In detecting different attack categories (Probe, DoS, ) Average Cluster Purities (Clustering techniques only)

60 60 Experimental Results (8020->8020)

61 61 Experimental Results (8020->9604)

62 62 Experimental Results (9604->8020)

63 63 Experimental Results (9604->9604)

64 64 Different Experiments Comparison between different techniques In application of trained models to test dataset In direct application to test dataset In detecting different attack categories (Probe, DoS, ) Average Cluster Purities (Clustering techniques only)

65 65 Direct application of techniques to Test_9604

66 66 Direct application of techniques to Test_8020

67 67 Different Experiments Comparison between different techniques In application of trained models to test dataset In direct application to test dataset In detecting different attack categories (Probe, DoS, ) Average Cluster Purities (Clustering techniques only)

68 68 Attack Category Detection (Train_8020, Test_8020)

69 69 Attack Category Detection (Train_8020, Test_9604)

70 70 Attack Category Detection (Train_9604, Test_8020)

71 71 Attack Category Detection (Train_9604, Test_9604)

72 72 Different Experiments Comparison between different techniques In application of trained models to test dataset In direct application to test dataset In detecting different attack categories (Probe, DoS, ) Average Cluster Purities (Clustering techniques only)

73 73 Experimental Results Cluster Purities Measurement : Information Entropy n H ( X ) = p( x )log2 i= 1 i p( x i ) Cluster impurity H ( C) = p( C p( C normal normal C ) = C normal )log p( C, p( C attack normal ) C ) = C p( C attack attack )log p( C attack )

74 74 Average Impurity of Clusters in Different Techniques Technique Impurity K-Means EM Y-Means SOM C-Means ICLN

75 Experimental Results (Supervised Schemes) Attack Detection Results FP DR : : : : : Gaussian Naïve Bayes C4.5 Random Forest SVM 75

76 Experimental Results (Supervised Schemes) 1.1 DoS Detection Results FP DR : : : : : Gaussian Naïve Bayes C4.5 Random Forest SVM 76

77 Experimental Results (Supervised Schemes) 1.1 Probe Detection Results FP DR 1: : : : : Gaussian Naïve Bayes C4.5 Random Forest SVM 77

78 Experimental Results (Supervised Schemes) 0.3 R2L Detection Results FP DR 0.2 1: : : : : Gaussian Naïve Bayes C4.5 Random Forest SVM 78

79 Experimental Results (Supervised Schemes) 0.7 U2R Datection Results FP DR : : : : : Gaussian Naïve Bayes C4.5 Random Forest SVM 79

80 Lessons Learned Distance-based labeling provides a more robust results to those of count-based labeling For most of the techniques, it is clearly dominant as well Direct application of clustering techniques performs as good as a two-step process Clustering techniques vs. other outlier detection schemes 80

81 81 Lessons Learned (Cont d) Most of the techniques are good at detecting probe and DoS attacks Almost all of them are poor at detecting R2L attacks Unsupervised SVM and Y-Means are good at detecting U2R attacks

82 82 Future Works Looking for more intelligent heuristics Combination of count-based and distance-based labeling Considering other criteria such as cluster density Looking for more discriminative features Of special value to detecting U2R and R2L attacks Comparison of other learning schemes Semi-supervised, Active Learning, Designing hybrid detectors based on the results of this study

83 83 Questions?

84 Approximate Auto Regressive Modeling For Network Attack Detection Harshit Nayyar and Ali A. Ghorbani

85 85 Scope NETWORK ATTACK DETECTION ANOMALY BASED STATISTICAL ANALYSIS STATISTICAL SIGNAL PROCESSING WAVELET FILTERING SYSTEM IDENTIFICATION SIGNAL APPROXIMATION ARX MODELING APPROXIMATE AUTOREGRESSIVE MODELING

86 86 Introduction Usual Methodology: Thresholds Network Dependent. Different threshold for different times? No scientific basis for determining the threshold. Basis of our technique: Assumption: Unusual is unexpected. Obtain Predictable Component from Network Data Create a predictive model of Network Create a model for high frequency components/peaks Flag large and/or persistent deviations from created model.

87 87 Network Data at a Glance

88 88 Techniques 1: Wavelet Approximations Wavelet Transform Haar Wavelet :-

89 89 Need & Effect of Wavelet Filtering

90 90 Techniques 2: ARX Model System ID: ARX Model Auto Regressive with external input. A linear difference equation relating previous outputs (AR) & External Input to future values. A(q)Y(t) = B(q)U(t) + Error. Predictive model ignores the error ARX[P,Q,R] P = Number of past outputs. (2) Q = Number of past inputs. (2) R = Time Delay in the System. (2T) System Identification deals with identifying: A(q) & B(q) given P,Q,R. Get most optimal A(q) and B(q) i.e. A(q) and B(q) which minimizes prediction Error.

91 91 Framework Phase1 : ARX Model Training Training Time-series Obtain ARX Model External Input

92 92 Phase1 Results: Model Training Phase1: ARX Model Training

93 93 Phase1 Results: Predictions

94 94 Framework Phase 2 Find Limits of Normal Peaks (windowed max) Obtain Peak Model

95 95 Results: Phase2

96 96 Phase3 Anomaly Detection

97 97 Phase 3 Results : Operation

98 98 Phase 3 Results : Operation

99 99 Phase 3 Results : Operation

100 Phase 3 Results : Operation 100

101 Phase 3 Results : Operation 101

102 Phase 3 Results : Operation 102

103 Table of Attacks 103

104 104 Conclusions Contributions: A technique for network data modeling which can automatically detect anomalies caused by network attacks. Technique is: Portable: Learning phase ensures portability across networks. Also, usable with other network signals. Effective: In detecting network anomalies caused by attacks. Unsupervised: Minimal Human Intervention Online: Detects attacks before completion.

105 105 Future Work Experiments with real network data. Test Performance in real network. Issues: Data Collection, Attack identification. Experiments with longer term data. Retraining (How and when) External Input Modification Correlation of Anomalies Improve identification of attack type. Allow higher level correlation rules. Improve Predictions: Nonlinear Models. Other wavelet basis.

106 106 Questions?

107 107 EM (Expectation EM (Expectation-Maximization) Maximization) Algorithm Initialization: Initialize the model parameters Expectation: Estimate the posterior probability of model k Maximization: re-estimate model parameters },, { k k k α σ μ Λ = = Λ j k n j k n k n x p x p x k P ) ( ) ( ), ( λ α λ α Λ Λ = n n n n n new k x k P x x k P ), ( ), ( μ Λ Λ = n n n k n n new k x k P x x k P d ), ( ), ( 1 2 μ σ Λ = n n new k x k P N ), ( 1 α Prior probability Mean Variance

108 One-Class SVM (Unsupervised( Unsupervised) Outlier detection Typical cases vs. outliers Tradeoff between including all examples and smallest sphere around the data Outliers are supposed to be excluded Outliers 108

109 109 Kernel Classifiers 1. Transform data via non-linear mapping to an inner product feature space Gaussian, polynomial and RBF kernels 2. Train a linear machine in the new feature space

110 110 C4.5 (Decision Tree) root = (null, All Rules,, ) root All nodes are represented by a tuple (C, R, F, L) C = condition (feature, operator, value) R = set of candidate detection rules F = feature set (already used to decompose tree) L = set of detection rules matched at that node C4.5: decision tree construction algorithm

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng: