Cyber attack detection using decision tree approach

Size: px

Start display at page:

Download "Cyber attack detection using decision tree approach"

Cody Hopkins
5 years ago
Views:

1 Cyber attack detection using decision tree approach Amit Shinde Department of Industrial Engineering, Arizona State University,Tempe, AZ, USA In this information age, information technology -computers and the global networks that connect them - is a primary mover of the nations economy. Communications, aviation, power delivery, financial services, trade are just some of the sectors that depend of reliable computer networks for day to day functioning. Vital information such as medical records, credit card information, business strategies, criminal records etc. are all stored in cyberspace. This dependency on computers poses serious threat to the economic well being of citizens and companies alike, public safety and national security. These days, securing a computer and its network is just as important as securing a nuclear power plant. To protect computers against a cyberattack, three different categories of defense mechanisms are generally employed. These aim at prevention (blocking), detection and reaction (curing) of an identified attack. In this study, we focus on developing a mechanism to detect cyberattack. Key words: data mining, decision trees, cyber attack detection 1 Objective The objectives of this study are two-fold - 1. Correctly detect whether a computer is under attack or is under normal operating conditions. 2. Identify a set of variables that provide contrast between normal operating conditions and attack conditions. 2 Data description Windows Performance Objects utility tracks nearly 1000 to 1200 variables which are related to many computer objects such as Cache, Memory, Network Interface, System etc. The data in this study is from the data log of activity, state and performance changes collected in attack and norm conditions from a victim computer using this utility. Three different experiments were conducted to collect data under norm and attack conditions. A summary of the experiments and data collected is given in Table 1. Table 1: Experimental setup Experiment No Idle Attack Text Web Post attack Data points Variables 1 Y Y Y Y Y Y Y Y Y Y Y It is expected that an attack would create signals in some of the variables being tracked. These signals are characterized as shift(change in mean),spikes, drift or trend [1]. Write about spikes. Similarly, norm activities such as text editing and web browsing act as additional sources of variation. 1

2 The objective of this study is to identify a set of rules (variables) that will help us classify an unknown activity as a normal activity or an attack. The second and third data sets are more closer to reality than the first case. Hence, data from these two experiments was collated into a single data set. The combined data set contains 3643 measurements and 1162 variables. 3 Supervised learners 3.1 Comparison of supervised learners Applicability of some supervised learners with respect to this data set is summarized below - 1. Regression for classification - Given K classes, we create K indicator variables. y k = 1...y class k (1) y k = 0...otherwise A regression model is built for each y k and the object is assigned to the maximum ŷ k, k = 1,..., K. However, this can be an awkward approach since ŷ k is not restricted to 0-1. Also, it is a linear global model and hence, some classes may be masked [2]. Also, these models are highly susceptible to outliers (spikes) in the data. 2. Näive Bayes - Näive Bayes classifier assumes that given a class k, variables are independent within the class. Also, for numerical variables, it assumes that the data follows normal distribution. Cyber attack data is characterized by highly skewed variables. Also, the independence assumption may not hold true for all attributes. Hence, due to these assumptions and the nature of the data, this classifier is not appropriate. 3. Artificial Neural Networks (ANN) - Although ANN is very useful for classifying, it suffers from poor interpretability. In other words, we would be able to classify an unknown test case into one of the classes, however, we would not be able to identify which variables are important in classifying. In addition to proper classification, our objective is to also identify a reduced set of important variables. Thus, the use of ANN for our objective is inappropriate. 4. Decision Trees - Tree classification techniques provide several advantages over the above mentioned methods. Interpreting results summarized in a tree is very simple. Tree models are non-parametric i.e. they do not assume any particular distribution for the data. Additionally, trees partition the data set into smaller (purer) subspaces. Hence, nonlinear structures in the data can be modeled. Pruning a decision tree ensures that the model is not fit to noise and hence provides a model robust to outliers [3]. A disadvantage of using decision trees is that they are high variance models. Small changes in the data set can be result in a different tree structure. This problem worsens in the presence of multi-collinearity. Hence, addressing variable redundancy is an important pre-processing step. 4 Methodology and Results 4.1 Pre-processing In general, data processing targets reduction in number of variables (complexity) as well as reducing outliers (noise). As mentioned before, decision trees are robust to outliers in X-space, but are susceptible to multi-collinearity. Hence, we will focus on reducing dimensionality as a pre-processing 2

3 Table 2: Confusion matrix - Training data Actual PostAttackText PostAttackWeb Attack-Text Attack-Web Idle Text Web PostAttackText PostAttackWeb Attack-Text Attack-Web Idle Text Web step and handle outliers by pruning the tree. Dimensionality reduction targeted reducing the number of variables. This was carried out in three steps 1. Delete all columns with constant values. This was done by deleting all columns with standard deviation < Delete duplicate columns i.e. retain only unique columns. Correlation coefficient (r) was computed for all variables. All variables that had r <.95 and r > 0.95 were eliminated while retaining one. 3. All variables that increase monotonically with time provide very little practical information about the mechanics of an attack. However, since a decision tree tries to create pure subspaces, the presence of this variable will lead to perfect classification. Hence, all such variables were deleted. The above three rules resulted in reducing the number of variables from 1162 to Decision tree Classification and Regression Trees (CART) implementation of decision trees was used for this project [4]. CART uses binary splits as against C4.5 which primarily uses multi-way split. Also, GINI index is used as a splitting criteria. Building a decision tree based entirely on the training set is not useful.given enough terminal nodes, a decision will eventually learn the entire training set, including noise (outliers). Such an over fitted tree will not perform well on a data that did not participate in building the model (test set). Pruning is a method that controls the complexity of the model. CART uses the post-pruning approach. In this, the tree is grown to its entirety followed by trimming the tree in a bottom up sequence. If the generalization error (test error) improves after trimming, the sub-tree is replaced by a leaf node. This results in a smaller, less complex tree that will perform better on test cases. To estimate the generalization error, the data set was randomly split into 80% training and 20% testing. To ensure robustness of the model to outliers, an additional pruning parameter was specified. If the terminal node had less than three training instances, then the tree would prune up Classification accuracy Classification results on the training and testing data are summarized in confusion matrices given in Table 2 and Table 3. 3

4 Table 3: Confusion matrix - Testing data Actual PostAttackText PostAttackWeb Attack-Text Attack-Web Idle Text Web PostAttackText PostAttackWeb Attack-Text Attack-Web Idle Text Web As can be seen, this model has a misclassification rate of.668% (5/784). Out of the five misclassified points, three are false alarms (i.e falsely classified as Attack when it wasn t). The classifier has difficulty in distinguishing between Post Attack and Attack categories. One observation was a missed signal (i.e. falsely classified as Text when it was an Attack) Variable importance Primary splitters for the decision tree model are shown in Figure 1. As can be seen, only seven variables were used to create this tree. Since CART is a greedy algorithm, it is possible that there may be alternate splitters (variables) that could have performed just as well on this data set. Hence, the reduced set of important variables should include these alternate variables in addition to the primary seven variables shown. Figure 1: Splitters Given a split at node v denoted as s, a surrogate split at node v denoted by s j is obtained for each predictor variable x j. A surrogate split is determined to maximize the association with the actual split. Variable importance is defined as M(x j ) = v T s i(s j, v) (2) 4

5 where T s is the set of all nodes in tree T. M(x j ) measures the change in impurity from the surrogate split based on x j and sums all nodes in the tree with splits (non-leaf). If splits on x j that match the actual splits in the tree can also reduce impurity, then x j is important. Variable importance is usually reported in percentages relative to the maximum as M(x j ) max j M(x j ) Table 4 illustrates the entire list of variables that should be tracked for signs of attack. Table 4: Variable Importance Variable % importance Name X ALPHA02-VICTIM-Objects-Sections X ALPHA02-VICTIM-Memory-% Committed Bytes In Use X ALPHA02-VICTIM-Process(services)-Page File Bytes Peak X ALPHA02-VICTIM-Process(lsass)-Working Set X ALPHA02-VICTIM-Objects-Mutexes X ALPHA02-VICTIM-Process(svchost 1)-Handle Count X ALPHA02-VICTIM-Memory-System Driver Resident Bytes X ALPHA02-VICTIM-Memory-Cache Bytes X ALPHA02-VICTIM-Objects-Processes X ALPHA02-VICTIM-Process(nc)-Thread Count X ALPHA02-VICTIM-Process(spoolsv)-Pool Nonpaged Bytes X ALPHA02-VICTIM-Process(spoolsv)-Thread Count X ALPHA02-VICTIM-Terminal Services Session(Console)- Input Async Overflow X ALPHA02-VICTIM-Terminal Services Session(Console)- Protocol Save Screen Bitmap Cache Hits X ALPHA02-VICTIM-Terminal Services Session(Console)- Total WaitForOutBuf X ALPHA02-VICTIM-Terminal Services Session(Console)- Thread Count X ALPHA02-VICTIM-TCP-Connections Reset X ALPHA02-VICTIM-Cache-Data Map Pins/sec (3) 5 Conclusions The use of decision trees was demonstrated for detecting cyber attacks and arriving at a set of variables that should be monitored for providing a contrast between normal conditions and attack conditions. The decision tree built using post-pruning technique is relatively small in size (12 leaf nodes) and is built using seven variables. In addition, the tree performance for classification is very good. Only 0.668% of the test test sample was misclassified. Variable importance using surrogate split results in a set of eighteen variables that should be monitored. References [1] Deepak Krishna Lakshminarasimhan, Wavelet based cyber attack detection Arizona State University 5

6 [2] T. Hastie, R. Tibshirani, and J. Friedman,The Elements of Statistical Learning, Springer series in statistics, Springer, New York, [3] George H. John Robust decision trees: Removing outliers from databases. Proceedings of the first International Conference on Knowledge Discovery and Data mining, AAAI Press, Menlo Park, CA, [4] L. Breiman, J. Friedman, R. Olshen and C. Stone, Classification and Regression Trees,Wadsworth International Group, Belmont, CA,

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng: