BUILDING A FRAMEWORK FOR INTRUSION DETECTION AND PREVENTION IN IoT USING DATA ANALYTICS METHODS

BUILDING A FRAMEWORK FOR INTRUSION DETECTION AND PREVENTION IN IoT USING DATA ANALYTICS METHODS RESEARCH PROPOSAL STUDENT NAME: Ahmad Arida STUDENT NUMBER: 2632348 COURSE NAME: CIS 698 Independent Study DEPARTMENT: Department of Electrical Engineering and Computer Science COURSE CODE: 6405 SUPERVISOR: Dr. Sunnie S Chung DATE OF SUBMISSION: 01/23/2017

ABSTRACT With the rise of e-commerce and the Internet of Things (IoT), security of such systems over wireless networks is becoming more of a concern. Using the logs from the recently collected Aegean Wi-Fi Intrusion Dataset (AWID) dataset of highly characterized wireless network logs which contain real traces of both normal and intrusive traffic, we will identify and characterize suspicious and malicious activities over a wireless network. We will build a framework to identify and predict these events to prevent future intrusion attempts for intrusion detection. This research will explore the current state-of-the-art methodologies in data analytics literature for intrusion detection. For identification of outliers to characterize wireless intrusion attempts, we will apply a variety of data mining methods, including Bayesian analysis, nearest neighbour hierarchical clustering, and K-means clustering, to develop a more accurate outlier detection process. Gathering data from the logs about the characteristics about each intrusion will help build a set of rules for an algorithm to detect intrusion attempts. Finally, prevention and forecasting of intrusions will be implemented using SQL Server Analysis Services (SSAS) forecasting tools through the use of its business intelligence (BI) applications. This research will provide novel algorithms for use over wireless networks to detect and prevent intrusions. 2

INTRODUCTION BACKGROUND AND SIGNIFICANCE WIRELESS NETWORKS WIDESPREAD USE OF WIRELESS NETWORKING Over the past year, analysis and forecasting studies have predicted that the annual world-wide traffic online will have surpassed the zettabyte mark (1000 exabytes) by the end of 2016, and this traffic will continue to increase over the following 5 years three-fold 1. In addition, almost two-thirds of this traffic will be from wireless or mobile devices. With the majority of internet traffic occurring over wireless networks, the security of these information systems becomes increasingly more important. As this technology is relatively new, there are likely many gaps in the security of these networks that can be exploited through intrusion attempts. Therefore, wireless intrusion detection systems (IDS) are rapidly being developed in order to counter these potentially malicious behaviors. Importance of Wireless Network Security Medical devices and implants are an increasingly common use of wireless network technology 2. Use of such devices with internet access has a great benefit to the patient; however, ethical hackers have shown that they have been able to access insulin pumps and pacemakers, and could potentially switch them off and kill the patient. While there have not been any reported attacks to date on medical devices, the potential of these attacks could be devastating. Hospitals, firefighters, and the military also use wireless systems, and failure or malicious access of the systems could have far-ranging impacts 3. 1 K. Bode,Wireless traffic to reach 11.2 exabytes a month, Cisco, San Jose, CA, USA, 2013. [Online]. Available: http://www.dslreports.com/shownews/cisco-wireless-traffic-to-reach-112-exabytes-a-month-by- 2017-123040 2 C. Bates, Hackers can gain access to medical implants and endanger patients lives, 2012. <http://www.dailymail.co.uk/health/article-2127568/hackersgain-access-medical-implants-endanger-patientslives.html> 3 Robert Mitchell, Ing-Ray Chen, A survey of intrusion detection in wireless network applications, Computer Communications, Volume 42, 1 April 2014, Pages 1-23, ISSN 0140-3664, http://dx.doi.org/10.1016/j.comcom.2014.01.012. (//www.sciencedirect.com/science/article/pii/s0140366414000280) 3

801.11 Standard The IEEE 802.11 wireless standard is currently one of the most widely used wireless technologies in the world 4. Its popularity is predominantly driven by the high adoption rate of mobile devices (smartphones, tablets, laptops, etc.) combined with the convenience of portable communications. The network architecture of the IEEE 802.11 family of wireless networks can be divided into two main groups: Infrastructure or Ad-Hoc. For the infrastructure mode, workstations can connect to an Access Point (AP) to connect to the network. In Ad-Hoc mode, the workstations can directly connect with each other. For the purposes of this research and available dataset, only the infrastructure mode will be discussed. Within the 802.11 standard has three different frame categories for various purposes 5 : management, control, and data. In general, the management frames are used by the various workstations in order to join or leave the basic service. Control frames are heard by all of the workstations and assist with the delivery of the data frames, but contain only header information. Finally, data frames pass the actual data through the various layer protocols. The data frames have a consistent structure which includes a header, body of variable length up to 2312 bytes, and frame check sequence (FCS) 6. WEP Security In the late 1990s, the Wired Equivalent Privacy (WEP) was the sole security mechanism for the first iteration of the 802.11 wireless standard 6. It was mainly introduced in order to bridge the gap between existing wired and wireless security and confidentiality 7. WEP uses a stream cipher to encrypt the data packets using a pre-shared key. However, this protocol is highly susceptible to many types of attacks which can crack these encryption keys given enough time. Even with this knowledge, there are currently still a large number of devices which utilize this technology. 4 Malik A, Qadir J, Ahmad B, Alvin Yau K, Ullah U. QoS in IEEE 802.11-based wireless networks: A contemporary review. Journal Of Network & Computer Applications [serial online]. September 2015;55:24-46. 5 N. Parsi,Wi-Fi every where, 2012. [Online]. Available: http://ilovewifi.blogspot.com/2012/07/80211-frametypes.html 6 Constantinos Kolias, Georgios Kambourakis, Angelos Stavrou, and Stefanos Gritzalis "Intrusion Detection in 802.11 Networks: Empirical Evaluation of Threats and a Public Dataset" Communications Surveys & Tutorials, 2015 IEEE (Volume:PP, Issue:99). 7 Andrea Bittau, Mark Handley, and Joshua Lackey. 2006. The Final Nail in WEP's Coffin. In Proceedings of the 2006 IEEE Symposium on Security and Privacy (SP '06). IEEE Computer Society, Washington, DC, USA, 386-400. DOI=http://dx.doi.org/10.1109/SP.2006.40 4

Types Of Attacks There are a number of different types of attacks against the 802.11 wireless security protocols, including key retrieving attacks, keystream retrieving attacks, availability attacks, and man-in-themiddle attacks 6. As can be expected, key retrieving attacks focus on gaining access to the secret key. Because of the relatively weak security of WEP, all that an attacker needs is to monitor a network for specific packets and use a key cracking algorithm offline to decode the key. In this case, the process is passive and completely untraceable; however, the attacker may try to validate the key by sending packets across the network, which could reveal information about them. Even without the key, an attacker can still gain access through keystream retrieving attacks. These types of attacks leverage the initialization vectors for each packet, and the protocol does not forbid their reuse. Therefore, decrypting parts of a packet would potentially allow someone to create a keystream/initialization vector pair for all network traffic. Another type of attack that is commonly called a denial of service (DoS) attack interrupts the availability and service of a specific network. These types of attacks are fairly simple to implement in most of the 802.11 wireless networks, up to 802.11n 8. However, the DoS attack needs to be maintained and the attacker needs to be present within the network (or range of the network) during the attack. Another type of attack is called the man-in-the-middle attack. This type of intrusion occurs when an attacker intercepts the communication between two parties who have the illusion that they are communicating directly with each other 9. The main goals of the attack are to bypass the mutual authentication when the attacker can successfully impersonate each endpoint. AWID Dataset The AWID dataset is refers to a collection of two equal datasets (AWID-CLS, AWID-ATK) depending on whether they are labelled according to the classification or the actual attacks. Each of these sets also contains both a full and reduced set, as a smaller subset would be easier to test 8 IEEE Standard for Information Technology Telecommunications and Information Exchange Between Systems Local and Metropolitan Area Networks Specific requirements. Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications. Amendment 5: Enhancements for Higher Throughput, IEEE Std. 802.11n-2009, 2012. 9 Kreitz G. Flow stealing: A well-timed redirection attack. Journal Of Computer Security [serial online]. June 2013;21(3):371-391. 5

and develop research strategies. Also, each of the subsets have two versions as well- a training set and a test set for the purposes of model building (Figure 1) 6. Figure 1. AWID Subsets. Each row of the dataset is a vector of 156 attributes (155 attributes + 1 classification). This framework was designed to contain as many 802.11 fields as possible. Intrusion Detection IDS systems are used as part of network security measures in order to prevent, detect, and/or tolerate intrusions depending on the circumstances 3. The first case is intrusion prevention, where a specific security measure can intervene and stop an attacker at the edge of a network without them gaining access. The next case is that of intrusion detection, which aims to identify, log, and track attackers who have penetrated the network. Finally, intrusion tolerance involves techniques to combat attackers and their methods. This is a highly evolving race where attackers and IDSs are continuously competing and developing better and more sophisticated methods. In general, there are some main standard metrics used in intrusion detection to measure performance: false positive (FP), false negative (FN), and detection 3. A FP occurs when a legitimate node or access attempt is incorrectly identified at an intrusion. A FN occurs when a malicious or illegitimate node or access attempt is incorrectly identified as legitimate. A detection occurs when a true intrusion attempt is detected and identified correctly. Detection Techniques 1. Anomaly Based Intrusion Detection Anomaly based intrusion detection is a method that raises an alarm when an observed behaviour exceeds a given threshold 10. In order for this detection system to function, significant resources need to be invested to estimate the normal boundaries of that network. Then, when a specific condition is violated, a specific set of instructions, usually including a logging event and an alarm, are generated. 10 García-Teodoro P, Díaz-Verdejo J, Maciá-Fernández G, Vázquez E. Anomaly-based network intrusion detection: Techniques, systems and challenges. Computers & Security [serial online]. February 2009;28(1/2):18-28 6

2. Signature Based Intrusion Detection Unlike anomaly detection, signature based intrusion detection is mainly used against specific classes of well-known classifications 10. This means that these algorithms are unable to detect any new or unfamiliar intrusions. This is mainly because a signature detection uses previously identified patterns and definitions to correctly mark a threat or attack. 3. Specification Based Intrusion Detection The specification based intrusion detection is somewhat of a hybrid of both anomaly and signature methods with input from a human expert 10. The data model is manually created with a set of rules which, if done to completion, will determine normal system behaviour and be able to reduce FP results. This is due to the expert able to include activities that were not previously reported in the model to not be reported as intrusions. Outlier Detection Methods 1. Bayesian networks A Bayesian networks is a graphical model which creates relationships between specific variables and can be used when dealing with uncertainty 11. This technique applied to intrusion detection in combination with statistical schemes can also grant the ability to predict events, based on prior knowledge of a human expert or using machine learning tools 10. A Naive Bayesian represents the simplest form of a Bayesian network, which uses a set of supervised learning algorithms combined with the naive assumption of having independence between the different sets. This methodology can be very effective in some situations, but the end results can be highly variable based upon the given assumptions of the behaviour of the specific system 11. 2. Clustering and Outlier Detection Cluster analysis is a technique in which a set of data is grouped such that the data in the same group, referred to as a cluster, is more similar to each other than to another cluster 12. Clustering analyses are utilized to group the observed dataset into defined clusters based upon specific variables of interest. As a first approximation, clustering analysis can be used to help researchers gather more information about a large dataset that would otherwise remain hidden within the bulk of the data. Briefly, a common first step involves determining a distance between each of the data points relative to a specific variable of interest 12, which is usually accomplished via computation of 11 Kenaza T, Tabia K, Benferhat S. On the Use of Naive Bayesian Classifiers for Detecting Elementary and Coordinated Attacks. Fundamenta Informaticae [serial online]. February 15, 2011;105(4):435-466. 12 Antonenko P, Toy S, Niederhauser D. Using cluster analysis for data mining in educational technology research. Educational Technology Research & Development [serial online]. June 2012;60(3):383-398. 7

Euclidian distances 13. When the distances are known, an algorithm can be used to classify each data point into a specific cluster 14. The specific points that may not belong to any cluster are considered to be outliers, meaning that these objects are not similar to any of the defined clusters. This suggests that they could be anomalies, but are extremely useful for determining intrusion attempts 15. 13 Everitt, B. S., Landau, S., & Leese, M. (2009). Cluster analysis (4th ed.) London: Arnold. 14 Portnoy L., Eskin E., Stolfo S.J. Intrusion detection with unlabelled data using clustering. In: Proceedings of The ACM Workshop on Data Mining Applied to Security; 2001. 15 Sequeira K., Zaki M. ADMIT: anomaly-based data mining for intrusions. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2002. p. 386 95. 8

OBJECTIVES AND AIMS Overall Objective The overall objective of this study is to use the AWID datasets to generate an algorithm that will be able to detect and prevent intrusion attempts over an 802.11 wireless network. This is a relatively new dataset created in 2015 and will provide a novel basis for algorithms for an IDS system. Specific Aims 1. Identification of Outliers in the AWID Dataset 2. Classification of Features of Legitimate Intrusion Attempts 3. Creation of an Algorithm to Detect Intrusion Attempts 4. Refinement of Algorithm (or creation of additional algorithms) to Prevent Intrusions 9

RESEARCH DESIGN AND METHODS Brief Overview Using the AWID dataset, a set of data mining techniques, including Bayesian analysis, nearest neighbour hierarchical clustering, and K-means clustering, will be implemented to identify potential outliers in the data. These outliers can be further investigated in order to determine their legitimacy and classify them as intrusion attempts. Gathering data from the logs about the characteristics about each intrusion will help build a set of rules for an algorithm to detect intrusion attempts. Use of different training and test sets will further refine the algorithm. Finally, prevention and forecasting of intrusions will be attempted using SQL Server Analysis Services (SSAS) forecasting tools through the use of its business intelligence (BI) applications. Sources of Data The AWID dataset contains 4 different test sets and 4 different training sets (Figure 1). The training sets will be used initially in order to begin identifying outliers and for the initial generation of detection algorithm rules. Refinement of the algorithm will occur through the use of the test sets. The data samples have a size variation from approximately 1 million entries for the training set up to 38 million entries for the full testing set. Analysis Tools As this size of data cannot be opened with normal applications, special editors will be used to convert the data into usable formats for downstream applications. Tools such as EmEditor or EditPad have been able to open the full 38 million entry dataset containing 156 attributes. Depending upon the application, one of these two editors will be used for data saving and conversion. In order to quickly sample the data, the logs will be uploaded to Microsoft SQL Server into a flat table. This will allow quick querying over the data to determine ranges of values, as well as provide a means to quickly convert the data which is continuous into discrete bins through the use of custom views. These views will also allow export of the modified data into text files for use in other applications An additional use of Microsoft SQL Server will be the incorporation of multidimensional cubes through integration with Microsoft Visual Studio. This will allow further refinement of the dataset, as well as the ability to look at more than two dimensional data for common features. Many of the statistical processing methods will also occur through the use of SQL Server s business intelligence tools, including forecasting models. 10

Clustering analysis, including hierarchal clustering, as well as K-means analysis will be performed using the appropriate packages in R Studio. RStudio is an integrated development environment (IDE) for the R programming language. A major benefit of using R Studio is that it includes a an editor that supports direct code execution, as well as many statistical tools and packages designed for use in data analysis and statistics that can be readily adapted for data mining purposed. It is also an open source platform that can quickly analyse large datasets, which will be necessary when running analyses on 38 million rows of data. 11

REFERENCES 1) K. Bode,Wireless traffic to reach 11.2 exabytes a month, Cisco, San Jose, CA, USA, 2013. [Online]. Available: http://www.dslreports.com/shownews/cisco-wireless-traffic-to-reach- 112-Exabytes-a-Month-By-2017-123040 2) C. Bates, Hackers can gain access to medical implants and endanger patients lives, 2012. <http://www.dailymail.co.uk/health/article-2127568/hackersgain-access-medical-implantsendanger-patients-lives.html> 3) Robert Mitchell, Ing-Ray Chen, A survey of intrusion detection in wireless network applications, Computer Communications, Volume 42, 1 April 2014, Pages 1-23, ISSN 0140-3664, http://dx.doi.org/10.1016/j.comcom.2014.01.012. 4) Malik A, Qadir J, Ahmad B, Alvin Yau K, Ullah U. QoS in IEEE 802.11-based wireless networks: A contemporary review. Journal Of Network & Computer Applications [serial online]. September 2015;55:24-46. 5) N. Parsi,Wi-Fi every where, 2012. [Online]. Available: http://ilovewifi.blogspot.com/2012/07/80211-frame-types.html 6) Constantinos Kolias, Georgios Kambourakis, Angelos Stavrou, and Stefanos Gritzalis "Intrusion Detection in 802.11 Networks: Empirical Evaluation of Threats and a Public Dataset" Communications Surveys & Tutorials, 2015 IEEE (Volume:PP, Issue:99). 7) Andrea Bittau, Mark Handley, and Joshua Lackey. 2006. The Final Nail in WEP's Coffin. In Proceedings of the 2006 IEEE Symposium on Security and Privacy (SP '06). IEEE Computer Society, Washington, DC, USA, 386-400. DOI=http://dx.doi.org/10.1109/SP.2006.40 8) IEEE Standard for Information Technology Telecommunications and Information Exchange Between Systems Local and Metropolitan Area Networks Specific requirements. Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications. Amendment 5: Enhancements for Higher Throughput, IEEE Std. 802.11n-2009, 2012. 9) Kreitz G. Flow stealing: A well-timed redirection attack. Journal Of Computer Security [serial online]. June 2013;21(3):371-391. 10) García-Teodoro P, Díaz-Verdejo J, Maciá-Fernández G, Vázquez E. Anomaly-based network intrusion detection: Techniques, systems and challenges. Computers & Security [serial online]. February 2009;28(1/2):18-28 11) Kenaza T, Tabia K, Benferhat S. On the Use of Naive Bayesian Classifiers for Detecting Elementary and Coordinated Attacks. Fundamenta Informaticae [serial online]. February 15, 2011;105(4):435-466. 12

12) Antonenko P, Toy S, Niederhauser D. Using cluster analysis for data mining in educational technology research. Educational Technology Research & Development [serial online]. June 2012;60(3):383-398. 13) Everitt, B. S., Landau, S., & Leese, M. (2009). Cluster analysis (4th ed.) London: Arnold. 14) Portnoy L., Eskin E., Stolfo S.J. Intrusion detection with unlabelled data using clustering. In: Proceedings of The ACM Workshop on Data Mining Applied to Security; 2001. 15) Sequeira K., Zaki M. ADMIT: anomaly-based data mining for intrusions. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2002. p. 386 95. 13