A Multi-agent Based Cognitive Approach to Unsupervised Feature Extraction and Classification for Network Intrusion Detection

Int'l Conf. on Advances on Applied Cognitive Computing ACC'17 25 A Multi-agent Based Cognitive Approach to Unsupervised Feature Extraction and Classification for Network Intrusion Detection Kaiser Nahiyan, Samilat Kaiser, Dr. Ken Ferens, Dr. Robert McLeod Department of Electrical and Computer Engineering, University of Manitoba, Canada. { nahiyank, kaisers3 }@myumanitoba.ca, { Ken.Ferens, Robert.McLeod }@umanitoba.ca Abstract The importance of finding meaning in unstructured data is increasing. In the field of network intrusion detection, unsupervised learning from unlabeled data is of vital significance yet there is no universal technique for the purpose. Most approaches including unsupervised machine learning algorithms involve tedious efforts in terms of computational complexity on large amounts of data that needs additional preprocessing, and yet the accuracy of detection is not satisfactory. This work focuses on an automated, agent-based, nonsupervised, relatively uncomplicated cognitive approach that segregates attacks from normal events within the large search space with reduced computational demands. The algorithm presented collects features from statistical analysis of the observed attributes over each time-step (much like any intuitive learner would try to infer from a stream of unlabeled data) and uses machine learning to isolate the attack events from normal ones using an unsupervised k-means clustering algorithm over the reduced dataset. The computational load for central processing is further optimized by utilizing the agent based architecture where agents are deployed in hosts, and some processing is done at the host and the rest is performed by the node that performs the classification. With an increasing number of small device networks supporting IoT, mobile and sensor networks, demands for fast light weight machine learning models for unsupervised attack identification is a requirement. We validate our algorithm on two recent datasets with modern day attacks, and furthermore do a multi-scale analysis to locate the time-scale of attacks. I. INTRODUCTION From leaking debit card details to intrusion into highly classified materials, cyber-attacks have become a real threat and a part of our political and social discourse. Attacks are no longer done by isolated individuals, now there are organized crimes orchestrated by hacker groups. Likewise, the research in cybersecurity is also at its peak. Machine learning has demonstrated much recent success in transforming all sectors including cyber-security. However, in cyber security the availability of datasets is very rare. Only a small number of datasets are publicly available, generation methods are not uniform, they often contain private data with added formalities, and in many cases, there is no ground truth to guide the researchers into what to expect. For supervised learning methods, the approach is to utilize the labelled data to train the algorithm with training data with mixed samples consisting of all the classes. Once the learner has learned it can be exposed to new samples and can classify the attacks from normal traffic data. Unlike others [1], network traffic data is vastly diverse; IPs and ports are categorical data represented in numbers, hardware addresses are categories represented by groups of characters, payloads and user data are often encrypted, network parameters are flags that are often binary, and the list goes on. Henceforth, achieving consistent detection accuracy on test data becomes difficult even for supervised techniques, let alone unsupervised ones. In unsupervised machine learning the data is unlabeled and hence there is no understanding as to how to find out meaning of the data and how to utilize the knowledge to further classify the samples. The conventional techniques like clustering when applied to the entire dataset are incapable of delivering satisfactory accuracy whereas complex methods like deep learning and neural networks require huge sets of data samples and long hours of intensive computation. We argue that nowadays much focus is towards implementing complex and resource hungry machine learning methods whereas comparable results can be achieved with much less computation power and much less data, and hence

26 Int'l Conf. on Advances on Applied Cognitive Computing ACC'17 timely actions can be taken to address the intrusion. From the context of cognitive detection, an unsupervised learner will be applying simpler techniques like statistical learning, flow analysis and clustering to identify the attacks, which is exactly the approach described in this study. Now that we have set our focus on simplified learning using less computation power, we present the idea of agent based model in our approach. An Agent Based Model (ABM) is a class of computational models for simulating the actions and interactions of autonomous agents (both individual or collective entities such as organizations or groups) with a view to assessing their effects on the system as a whole [2]. In our model, the agents deployed in the hosts and gateways, agents at hosts perform independent analysis from the host traffic and provides the processed information to a gateway agent for further processing. In this manner, the computation load and time required for convergence is further reduced. II. RELATED PREVIOUS WORK Statistical analysis of traffic has been done previously for classification of application or user types. Roughan et al. [3], used nearest neighbor and linear discriminate analysis approaches to map different network applications to different QoS classes. Bernaille et al. [4], proposed a technique using unsupervised ML (K-Means clustering) algorithm that classifies different types of TCP-based applications using the first few packets of the traffic flow. On the UNSW-NB15 dataset [5], Moustafa et al. [6] applied an Association Rule Mining algorithm as feature selection to generate the strongest features from the dataset, Gharaee et al. [7] proposed an anomaly based IDS using Genetic algorithm and Support Vector Machine (SVM) with a new feature selection method. Moustafa et al. [8] performed statistical learning of the observations and the attributes of UNSW dataset, examined the feature correlations and applied existing classifiers to evaluate the complexity in terms of accuracy with KDD99 data set. Previously on the Aegean Wi-Fi Intrusion Dataset - AWID [9], Kolias compared the accuracy of different machine learning techniques on AWID reduced dataset. Thanthrige et al. [10] applied feature reduction techniques such as Information Gain and Chi-Squared statistics to evaluate dataset performance with feature reduction techniques. However, no one has worked on the analysis of time-step based statistical feature analysis on these datasets. Moreover, no previous work mentioned above has approached agent based computation modeling which has been presented in this work. The results of accuracy gained from previous authors were in lieu of high computation based machine learning methods which had to process the entire number of rows in the dataset, hence required major processing time. Our approach is a much more straight-forward, can be easily automated, and can classify the big complex datasets by extracting smaller feature datasets using statistical techniques, runs much faster than others, and utilizes the distributed processing architecture which makes it compatible in micro habitats. III. PROPOSED METHODOLOGY Our motive is to classify the dataset into normal and attack in an unsupervised manner without any training as such, and to find some meaning out of the data. We first apply our algorithm with UNSW-NB15 dataset. We consider all the four files of UNSM-NB15 dataset, which has 3,239,993 rows containing 14.48% attack rows and the rest are normal. The dataset for UNSW contains 49 columns in total. To alter the datasets for the unsupervised problem, we strip the labels from the dataset during preprocessing. The missing data analysis are shown in Fig.. We impute the missing values, and change the categorical data into numeric representations for the columns state, proto, service, srcip and dstip. Such methods are conventional measures for making it easier for the machine to learn. We add two more attributes - srcip_trunc_encoded and dstip_trunc_encoded, which are the subnet addresses of the source and destination IPs and encode them from categorical to numbers. When working with large data sets it is helpful to divide the dataset into smaller fractions which can be analyzed individually. Our sampler divides the large data into time-steps, fragmenting the data set into smaller sections based on the timestamp. These small segments are then processed to find out features from their statistical analysis. Our hypothesis is that the time window that contain attack samples will have significant feature separation from the time window that will have only normal samples. Hence, the sampler collects groups of rows from the dataset, which fall in a certain range of timestamps and creates a new data frame. Then,

Int'l Conf. on Advances on Applied Cognitive Computing ACC'17 27 the feature extraction portion of the algorithm extracts the mean and standard deviation of each of the attribute in the data frame. We have now transformed 3,239,993 rows to 85348 rows, each row now representing the events that occurred during that time window. At this point the dataset has been reduced by 97%. The statistical analysis of each of these attributes are extracted as features and a new data frame consisting of the mean and variance of all the attributes except IP, port, time, etc. Fig. 1: Sampler and Feature Extractor Fig. 2: Classification and Evaluation The dataset that is created from the original data is much reduced in terms of rows, and columns. Hence we are reducing the algorithm s computation time by reducing the number of rows that the unsupervised algorithm needs to process. Fig. 3: Missing Data UNSW Fig. 4: Missing Data AWID This presents the cognitive learner with two sample spaces, one of which has the attack samples. Now, for the intuitive learner to identify which is the attack cluster, it will pick up the time-step samples from each class and try to understand if any attack has occurred during this time step. A way of achieving that will be using the internal system logs corresponding to the time mentioned in the time-step, however, this part is out of the scope of this study. For the evaluation of our algorithm, these two clusters are examined individually and checked if they have accurately classified the events as attack and normal. Since we are doing this on the datasets which initially had labels, we can evaluate the prediction with the actual values by checking their accuracy scores from their false and true positives and negatives. After we have evaluated the accuracy of our machine learning instance, the same algorithm is applied on another dataset to as a final validation for the algorithm. For this we have used AWID dataset. On top of the classification of network intrusion based on statistical features of timesteps, our work presents a multi-scale analysis of the time-steps; we create feature datasets considering time-steps of t, 2t, 4t and 8t, where t is a relatively small time-window in the dataset which contains a balanced number of events. In other words, we are trying to find out the best value of n for time-step 2 n t. We run our algorithm on each of the datasets and our results present the best time-scales for each dataset. Such an analysis can be further used as a benchmark for future research. IV. MULTI-AGENT BASED MODEL Without an agent based approach, the gateway is the node that has access to the entire traffic in the network, hence the gateway must process the traffic flows from each host, and this may include multiple traffic from each host that has occurred in that time window. For example, any host X, has initiated 1000 traffic flows at a time step t. If there are 20 hosts like X, the gateway must process 500*1000 traffic flows every time step, and if we are recording 50 attributes of a traffic flow, the gateway node must perform multiplications and additions over a data size of 20*1000*50 = 10,00,000 for each time step. If the classification is deployed in another node, external to the gateway, then the gateway should send this much data (for each time step) over the network to that

28 Int'l Conf. on Advances on Applied Cognitive Computing ACC'17 classifier node for further processing. On the contrary, if we deploy a multi-agent based approach in the below manner the computation on the classifier node can be further optimized from the central gateway or classifier node and distributed over the network. Each host has an agent that performs the computation of the traffic flows for that host IP. Hence, the 1000 traffic flows for the host X will be processed by the agent in X and the statistical analysis of these 1000 rows will provide a single row for host X at the timestep t. In this manner now the classifier has to compute over only 20(nodes) *1(row provided by each nodes)*50(attributes recorded) = 1000 rows instead of 10,00,000, which is a significant increase. The other advantage of such an approach is that now not only the gateway, any node can be the classifier node. V. ANALYSIS AND RESULTS A. Physical Setup We perform the simulation on the python engine running on a 64-bit OS, the underlying hardware is AMD Quad-Core processor with 8GB RAM. The data is processed using the various python libraries like pandas, scikit-learn, etc. B. Results Out of the 85348 rows in the reduced dataset, 36580 positives were correctly identified, and 37902 negatives were correctly identified. The confusion matrix is shown in below, which is depicted in Fig. 5. In Fig 6., the comparison of computation time is shown, which shows that our approach is much faster. The classification is 89% correct which is a very high number achieved for unsupervised learning. The AWID- R dataset showed an accuracy of 29% with basic unsupervised K-Means, and with our algorithm the accuracy was increased by 60%. This is depicted in Fig. 7. Fig. 5: Comparison of processing time Fig. 6: Comparison of Rows Processed Fig. 7: Comparison of Achieved Accuracy Fig. 6 to 9 is a depiction of the time scale analysis of one of the UNSW dataset files. The plot shows the time step in x-axis, vs the count of total rows observed during that time-step for various scales t=1 second to t = 4 seconds. Fig. 8 : Confusion Matrix for K-means on time scale t=1 sec for UNSW dataset As the scales increase, the maximum value in the x-axis reduces and the maximum value in the y- axis increases.

Int'l Conf. on Advances on Applied Cognitive Computing ACC'17 29 Fig. 9: Time Scale Analysis shown for UNSW (scale t = 1 sec) Fig. 10: Time Scale Analysis shown for UNSW (scale t = 2 sec) Fig. 11: Time Scale Analysis shown for UNSW (scale t = 4 sec) Fig. 12: Time Scale Analysis shown for UNSW (scale t = 8 sec) The metrices that we use for evaluating the results are Accuracy, Recall, Precision and F1 score. Their desciptions are provided in Table 1. The results achieved are provided in Table 2 show that the algorithm performed best was for the scales that are 4 seconds or higher. The same is depicted in Fig. 13. Table 2: Results achieved for different time-scales Time Scale 1 sec 2 sec 4 sec 8 sec Class precision recall f1- score 0 0.9 0.85 0.87 1 0.85 0.89 0.87 total 0.87 0.87 0.87 0 0.98 0.84 0.91 1 0.85 0.98 0.91 total 0.92 0.91 0.91 0 0.99 0.84 0.91 1 0.85 0.99 0.92 total 0.92 0.91 0.91 0 0.99 0.84 0.91 1 0.85 0.99 0.92 total 0.92 0.91 0.91 Fig. 13: Accuracy Analysis on various time windows Table 1 : Accuracy Metrics Accuracy Recall Precision F1 Score (TP + TN) / (TP + TN + FP + FN) (TP ) / (TP + FN) (TP ) / (TP + FP) 2 ( (Precision * Recall) / (Precision + Recall) ) Ratio of positive and negative cases correctly identified Ratio of overall positive cases correctly identified Ratio of negative cases correctly identified measure of the accuracy of the test, a weighted average of the recall and precision VI. FUTURE WORK We need to address the fact that the attacks are ever changing. No algorithm can withstand for decades as there are more improved efforts by the attackers to imitate the normal traffic, hence soon there will be attacks with normal features. Therefore, our future work of this study will be to synthetically design attack traffic that will

30 Int'l Conf. on Advances on Applied Cognitive Computing ACC'17 outperform this algorithm, and then to apply other advanced techniques to filter out such attacks. One way of doing this would be by applying fractal analysis to differentiate normal and attack. This approach has received significant recent attention in the research community. VII. REFERENCES [1] R. Sommer and V. Paxson, "Outside the Closed World: On Using Machine Learning for Network Intrusion Detection," in IEEE Symposium on Security and Privacy, Oakland, CA, USA, 2010. [2] "Wikipedia," [Online]. Available: https://en.wikipedia.org/wiki/agentbased_model. [Accessed 15 May 2017]. [3] M. Roughan, S. Sen, O. Spatscheck, and N. Duffield, "Class of service mapping for QoS: a statistical signature-based approach to IP traffic classification," in 4th ACM SIGCOMM conference on Internet measurement, New York, NY, USA, 2004. [4] L. Bernaille, R. Teixeira, T. Akodkenou, A. Soule and K. Salamatian, "Traffic Classification On The Fly," in ACM SIGCOMM Computer Communication Review, New York, NY, USA, April 2006. [5] N. Moustafa and J. Slay, "UNSW-NB15: a comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set)," in 2015 Military Communications and Information Systems Conference (MilCIS), Canberra, ACT, 2015. [6] N. Moustafa and J. Slay, "The Significant Features of the UNSW-NB15 and the KDD99 Data Sets for Network Intrusion Detection Systems," in 4th International Workshop on Building Analysis Datasets and Gathering Experience Returns for Security (BADGERS), Kyoto, 2015. [7] H. Gharaee and H. Hosseinvand, "A new feature selection IDS based on genetic algorithm and SVM," in 8th International Symposium on Telecommunications (IST), Tehran, 2016. [8] N. Moustafa and J. Slay, "The evaluation of Network Anomaly Detection Systems: Statistical analysis of the UNSW-NB15 data set and the comparison with the KDD99 data set," Information Security Journal: A Global Perspective, Vols. 1-3, no. 25, pp. 18-31, 2016. [9] C. Kolias, G. Kambourakis, A. Stavrou, S. Gritzali, "Intrusion detection in 802.11 networks: Empirical evaluation of threats and a public dataset," in Communications Surveys Tutorials IEEE, 2015. [10] U. S. K. P. M. Thanthrige, J. Samarabandu and X. Wang, "Machine learning techniques for intrusion detection on public dataset," in IEEE Canadian Conference on Electrical and Computer Engineering (CCECE), Vancouver, 2016.