MACHINE LEARNING & INTRUSION DETECTION: HYPE OR REALITY?

MACHINE LEARNING & INTRUSION DETECTION: 1

SUMMARY The potential use of machine learning techniques for intrusion detection is widely discussed amongst security experts. At Kudelski Security, we looked into this topic, and this briefing paper provides an overview of the possibilities and limitations of machine learning. We conclude that although pure machine learning is not well suited for general network intrusion detection today, it is relevant for more specific tasks such as user behavior analysis or specific endpoint security problems. Kudelski Security is developing machine learning-based methods for specific problems where it is more cost-effective than classical methods, and will continue to monitor the evolution of machine learning.

TABLE OF CONTENTS INTRODUCTION 2 A SUMMARY OF MACHINE LEARNING 2 SUPERVISED LEARNING 2 UNSUPERVISED LEARNING 3 SUCCESSFUL APPLICATION OF MACHINE LEARNING 4 POTENTIAL BENEFITS FOR INTRUSION DETECTION 4 INTRUSION DETECTION 5 HOW IS INTRUSION DETECTION CURRENTLY CARRIED OUT? 5 CAVEAT EMPTOR 6 WHAT ABOUT ACADEMIC RESEARCH? 6 CONCLUSION 7 1

INTRODUCTION Machine learning (ML) occupies a central place within current debates on cybersecurity. While its value as a marketing asset is clear, there remains a lack of visibility on the real value and cost-effectiveness of ML in its application to cybersecurity. This paper is a response to that knowledge gap. We investigated the application of ML that generates the most interest and raises the most questions: intrusion detection (ID) in loose terms, the process of looking for attack attempts on a network. Intrusion detection can be seen as a classification problem, with an aim of distinguishing legitimate traffic from malicious traffic. ML often works well on this kind of problem, but its adoption as a standard tool for ID will depend on whether it works well on the particular kind of classification problem encountered in ID itself. A SUMMARY OF MACHINE LEARNING According to renowned artificial intelligence expert and Stanford computer science professor Andrew Ng, ML is the science of getting computers to act without being explicitly programmed. In other words, ML is a set of techniques to categorize or find patterns within data. At its most fundamental level, ML uses algorithms that learn from example data and enables us to make predictions on new or unseen data. ML learns on the job, hence the eponymy. There are two main types of ML: supervised and unsupervised. SUPERVISED LEARNING Supervised learning approximates a complex function or process from a list of example data, called training data. This is best understood with an example: Let s suppose we need to predict if a house that has been put on the market will be sold within the next six months, and that we ll be basing our prediction on sales data pertaining to the size and price of other houses as well as the sale (or not) within six months of their listing. The way supervised learning works for this problem is simple. We first plot the points (x,y) = (price, size) for each house on a graph and illustrate the houses sold within six months as pink dots and the others as green stars, as shown in Figure 1. A training phase then consists in telling the algorithm which houses were sold quickly and which ones were not. The algorithm then learns the relationship between a house s price, size and whether it was sold within the specified timeframe. In our example, this information is depicted in the dark blue curve shown in Figure 1. Figure 1: Supervised learning 2

Using this information, we should be able to predict with some certainty whether or not the particular house under consideration will be sold within the next six months. In Figure 1, a new house would correspond to a new data point. If this data point is to the left of the curve, we can predict that the house will be sold within six months (as was the case for the pink dots). If the data point, however, is to the right of the curve, we can predict that it will not be sold in the next six months (as was the case for the green stars). Another common example of supervised learning is that of spam detection. Following a training period where we tell the system which emails are spam and which ones are not, a spam detection system will learn what spam looks like and will be able to predict whether incoming emails are spam or not. The fundamental idea in supervised learning is that each data point has a label: The houses from our first example were either sold within six months or not sold within six months. The emails from our second example were either spam or not spam. Supervised ML is not perfect, and will make classification errors spam emails predicted as non-spam, or non-spam emails predicted as spam. Classification errors are not a problem, however, if the probability of correct guess is sufficiently high. We can probably tolerate 5% of the spam we receive being filtered as non-spam, but will not tolerate 5% of legitimate emails being classified as spam and deleted automatically. UNSUPERVISED LEARNING Unsupervised learning works on the same principle as supervised learning, but without labels. In this case, an algorithm is simply fed with the raw data and automatically groups data points according to how similar they are. Figure 2: Unsupervised learning Based on some mathematical notions of distance, data points that are close to each other are classified as similar. Similarity can be deduced in Figure 2: The points in the blue cluster are close to each other, as are the points in the pink and green clusters. Unsupervised learning goes beyond grouping similar points, to discovering patterns and relationships within data. One significant advantage of unsupervised learning is its freedom from human bias in the exploration of relationships. Without restricting our classification with labels, a machine might find a relationship that a human would not have thought of. 3

SUCCESSFUL APPLICATION OF MACHINE LEARNING ML has proven value in its market application in some specific cases. The most renowned successes relate most commonly to recommendation systems, such as those used by Netflix and Amazon. Figures 3 and 4: Machine learning-based recommendation systems (Netflix and Amazon) Sources: https://www.netflix.com and http://www.amazon.fr Netflix uses ML to present clients with suggestions of movies or TV shows they may like, using data of previous movie/tv choices, or of favorites selected from a given list (note that this occurs in the training phase). This case, illustrated in Figure 3, is an instance of supervised learning, whereby the movies that have been watched are labeled as movies liked. Once the client has provided sufficient information about their preferences, the algorithm will learn from this information and be able to predict what else they may like to watch 1. In another case of supervised learning, Amazon uses ML to identify products a customer may want to purchase (see Figure 4). The system looks at the products they purchased (in the training phase) and at other products they bought as well. The principal behind this idea is to push additional products to the customer that they may end up buying. ML has proven to be of great help in other situations as well, such as optical character recognition (OCR), spam detection (as mentioned previously) and fraud detection (as demonstrated by Paypal). POTENTIAL BENEFITS FOR INTRUSION DETECTION There are two main potential benefits of using ML to detect intrusions. First, that ML should enable the detection of unknown or previously unseen attacks by learning what intrusions look like. This is not possible with the standard methods used today that need a precise description of what to look for. Second, that ML may adapt in response to new attacks. This benefit derives from ML s ability to learn from new data as it is generated. In principle, this is a great strength. It would allow a system to keep on working even in a context of evolving threats, with only minimal human intervention to modify the algorithm. 1 Netflix actually created an open competition for the best algorithm to predict user ratings based on previous ratings. The winners received a million dollars. 4

INTRUSION DETECTION There is a wide range of threat detection and intrusion methods, which are grouped into two categories: misuse detection and anomaly detection. Misuse detection is the simplest. It uses explicit descriptions of what is bad. Typically, this is done with signatures, black lists, or other indicators of compromise. Any new incoming data point is checked against all indicators. If nothing is flagged, it is considered benign. Anomaly detection is more subtle. It assumes that attack traffic is inherently different from benign traffic. The goal is therefore to detect any anomalies. The subtlety lies in the fact that there is no explicit description to serve as a benchmark for comparison. ML has to look instead for something that in some way stands out. The fact that ML does not need an explicit description of what to look for would make it a useful tool to detect anomalies again, in principle. HOW IS INTRUSION DETECTION CURRENTLY CARRIED OUT? Most ID systems in use today are based on misuse detection. Signature lists are founded on the experience and knowledge of experts and on established heuristics. Sandboxing is also used to detect an attack. For example, a file can be opened or a program can be run in an isolated environment so as to detect any strange (unwanted) behavior that might ensue. If opening the file or running the program does not trigger any unusual behavior, it is considered to be safe. If it does, however, an alarm is raised, and no damage is caused to the target system. In any case, the system is looking for attacks that we know exist and that we are able to recognize. An obvious challenge arises when an attacker slightly modifies some known malware, so that it is not detected using these standard methods. If the attack is modified sufficiently enough to generate a different signature, then no flag will appear. It is precisely in this instance that ML could be useful. Some technology vendors already claim to use ML to detect intrusions. They rarely (if ever) specify their techniques and methods, as these are proprietary. Without direct access to their systems, it is almost impossible to understand what they are doing and how they do it. Furthermore, and maybe even more importantly, they do not release statistics that measure their solution effectiveness. We have no way of assessing how well these methods perform compared to standard techniques. This highlights what is probably the most relevant issue for businesses: We do not need new solutions to detect intrusions that can already be detected by standard techniques. Rather, we need to see if and how ML can detect attack attempts that bypass these standard techniques. Figure 5: Intrusion detection Pareto curve 5

Consider the Pareto curve in Figure 5. It depicts the 80-20 rule that states: 80% of the results come from 20% of the effort. With regard to ML, this means two things: First, activities to detect intrusions are currently performing well. On Figure 5, we are positioned at the green dot. This illustrates that we are able to detect a high number of intrusions with existing efforts 2. Second, however, this also means that if we want to climb the curve and get better results, we will need to leverage more advanced tools and techniques. CAVEAT EMPTOR Given the potential of ML to detect intrusions and its ability to detect new attacks and evolve in response to developments in the cyberthreat landscape, it would be safe to assume that it is a standard cybersecurity tool of every organization. If Amazon and Netflix can get their ML systems to work, so too should security technology vendors. This is unfortunately not the case, for several reasons: First, ML is better at finding similarities than it is at finding differences, which is why it works well for Amazon s recommendation system, for example. Amazon seeks to find products that are typically purchased together, not products that are not purchased together. By definition, ID systems operate in a malicious environment. Attackers will try leverage ML s ability to evolve over time in order to train the system to learn that malicious elements are benign, and that something that is in reality different (bad), is read as something similar (good). This particular caveat would not apply to recommendation systems. The risk that users will go out of their way to make Netflix mistakenly suggest unsuitable movies, is negligible. Second, a challenge arises from the results that are generated. Anomaly detection is different to misuse detection, in which you can simply identify which rule or signature was triggered and therefore establish why a particular event has been flagged. With ML-based anomaly detection, the system will just tell you that one data point network packet, URL, file, for example looks like other data points. But it won t necessarily tell you which value or pattern caused the similarity. Therefore, knowing what to do with the results of ML-based anomaly detection is not obvious. There needs to be a way to help analysts in dealing efficiently with the flagged data points. As stated previously, ML s ability to carry out anomaly detection is based on an assumption that attacks or intrusions are somehow different from benign, or normal, traffic. We are faced with a difficult question, however: what is normal and can it be described? This is challenging due to the variety in so-called normal traffic from different ports, protocols, sources, destinations, encrypted or unencrypted payloads, length of files, sessions, and so on. The picture becomes even more complicated when you add the issues of virtualization and bring-your-own-cloud. Without clarity on what normal traffic looks like, it is hard to detect abnormal traffic. A third challenge arises from the high cost of errors in ID systems. Put simply, false positives (also known as false alarms) waste time. Analysts need to go through all the outputs in order to establish that the data point is benign. False negatives (also known as missed attacks), however, can be extremely dangerous. WHAT ABOUT ACADEMIC RESEARCH? Many academic papers discuss the use of ML as a tool for ID. It is hard to find relevance in these papers because the datasets that inform the discussions are inaccurate. In many cases, academic researchers use the DARPA and KDD datasets, which are used for training and testing ML models. DARPA is an artificial dataset created in 1998 by MIT s Lincoln Lab. KDD is a subset of DARPA and was created in 1999. Both of these datasets have been criticized for many different reasons, but the simple fact that they are artificial and were created more than 15 years ago means that they cannot be considered even remotely relevant today. 2 Note that what we mean by effort here is everything from research to implementation. 6

CONCLUSION We believe that machine learning techniques are not well suited today for pure network intrusion detection systems that only analyze network traffic. As stated already, this is mainly due to the high volume and variety of data passing through a network, making it hard to define what is normal. That said, machine learning would most likely be helpful in user-centric or endpoint behavioral analysis. This could be done in several different ways. First, a profile could be built for each user of a network. This would make it possible to detect attacks by finding discrepancies in user activity for each person. Second, profiles could be created per group or hierarchy. For example, within a company, there could be an IT group as well as HR and legal groups. If it became apparent that a staff member from HR was using the network for IT, it might signify that there is something to investigate. At Kudelski Security, we re developing machine learning methods for specific problems related to intrusion detection. Nevertheless, our products will only rely on machine learning where it is more cost-effective than more simple methods. At the moment, one of our promising applications relates to privacy-preserving user behavior modeling on a network that is, how to build profiles of legitimate users in order to detect unauthorized ones, but without using any privacy-sensitive information. ABOUT KUDELSKI SECURITY Kudelski Security is a premier cybersecurity solutions provider, working with the most security-conscious organizations in Europe and across the United States. Our long-term approach to client partnerships enables us to continuously evaluate their security posture to design and deliver solutions to reduce business risk, maintain compliance and increase overall security effectiveness. For more information about capabilities including consulting, technology, managed security services or custom innovation, visit: Follow us on Linkedin https://www.linkedin.com/company/kudelski-security Follow us on Twitter @KudelskiSec Visit our Blog http://cybermashup.com Visit our Website https:// Limitations on Use This document is provided for marketing and general informational purposes only and should not be relied upon or construed as advice to implement or undertake any specific activities relating to its subject matter. Further consultation with Kudelski Security is recommended to ensure that particular factual situations and other relevant factors are appropriately assessed. 2016 Kudelski Group / All rights reserved Kudelski and Kudelski Security are trademarks of Kudelski Group 7