Internet Traffic Classification using Machine Learning

Similar documents
Statistical based Approach for Packet Classification

Automated Traffic Classification and Application Identification using Machine Learning. Sebastian Zander, Thuy Nguyen, Grenville Armitage

Bittorrent traffic classification

Keywords Machine learning, Traffic classification, feature extraction, signature generation, cluster aggregation.

A Novel Approach for Practical, Real-Time, Machine Learning Based IP Traffic Classification

Master Course Computer Networks IN2097

Master Course Computer Networks IN2097

Machine Learning based Traffic Classification using Low Level Features and Statistical Analysis

Improved Classification of Known and Unknown Network Traffic Flows using Semi-Supervised Machine Learning

Basic Concepts in Intrusion Detection

Training on multiple sub-flows to optimise the use of Machine Learning classifiers in real-world IP networks

Anomaly Detection in Communication Networks

Efficient Flow based Network Traffic Classification using Machine Learning

Rapid Identification of BitTorrent Traffic

Lecture 24: Scheduling and QoS

Smart Home Network Management with Dynamic Traffic Distribution. Chenguang Zhu Xiang Ren Tianran Xu

Keywords Traffic classification, Traffic flows, Naïve Bayes, Bag-of-Flow (BoF), Correlation information, Parametric approach

Fundamental Questions to Answer About Computer Networking, Jan 2009 Prof. Ying-Dar Lin,

CSE 565 Computer Security Fall 2018

Configuring QoS CHAPTER

Detecting Malicious Hosts Using Traffic Flows

Detecting Network Reconnaissance with the Cisco Cyber Threat Defense Solution 1.0

WhitePaper: XipLink Real-Time Optimizations

Access Control Using Intrusion and File Policies

Internet Security: Firewall

1. Arista 7124s Switch Report

Configuring QoS. Finding Feature Information. Prerequisites for QoS

Multicast and Quality of Service. Internet Technologies and Applications

Network Support for Multimedia

Configuring QoS. Understanding QoS CHAPTER

SamKnows test methodology

Distributed Denial of Service (DDoS)

Configuring QoS CHAPTER

PLEASE READ CAREFULLY BEFORE YOU START

Problems with IntServ. EECS 122: Introduction to Computer Networks Differentiated Services (DiffServ) DiffServ (cont d)

Configuring attack detection and prevention 1

CSCD 433/533 Advanced Networks

Activating Intrusion Prevention Service

5 th July BoR PC03 (17) 15

IP SLAs Overview. Finding Feature Information. Information About IP SLAs. IP SLAs Technology Overview

Internet Traffic Classification Using Machine Learning. Tanjila Ahmed Dec 6, 2017

Assessing the Nature of Internet traffic: Methods and Pitfalls

ConEx Concepts and Uses

Configuring attack detection and prevention 1

First Steps to Using a PacketShaper

Access Control Using Intrusion and File Policies

BUILDING A NEXT-GENERATION FIREWALL

Basics (cont.) Characteristics of data communication technologies OSI-Model

Deliverable D7.4 Field Validation Test Report

Application Intelligence and Integrated Security Using Cisco Catalyst 6500 Supervisor Engine 32 PISA

Detecting Botnets Using Cisco NetFlow Protocol

Traffic Engineering 2: Layer 2 Prioritisation - CoS (Class of Service)

Configuring Cisco IOS IP SLAs Operations

ERT Threat Alert New Risks Revealed by Mirai Botnet November 2, 2016

Lecture 9. Quality of Service in ad hoc wireless networks

LOGICAL ADDRESSING. Faisal Karim Shaikh.

Real-Time Protocol (RTP)

Integrating Network QoS and Web QoS to Provide End-to-End QoS

Flowzilla: A Methodology for Detecting Data Transfer Anomalies in Research Networks. Anna Giannakou, Daniel Gunter, Sean Peisert

Chapter 3: Supervised Learning

CIS 6930/4930 Computer and Network Security. Topic 8.1 IPsec

Active Adaptation in QoS Architecture Model

Sections Describing Standard Software Features

Managed IP Services from Dial Access to Gigabit Routers

Promoting the Use of End-to-End Congestion Control in the Internet

IPv6. Internet Technologies and Applications

Big Data Analytics for Host Misbehavior Detection

PeerApp Case Study. November University of California, Santa Barbara, Boosts Internet Video Quality and Reduces Bandwidth Costs

BitTorrent Traffic Classification

Early Application Identification

Intrusion Detection System using AI and Machine Learning Algorithm

Can Passive Mobile Application Traffic be Identified using Machine Learning Techniques

Module objectives. Integrated services. Support for real-time applications. Real-time flows and the current Internet protocols

Managing and Securing Computer Networks. Guy Leduc. Chapter 2: Software-Defined Networks (SDN) Chapter 2. Chapter goals:

Using the Cisco ACE Application Control Engine Application Switches with the Cisco ACE XML Gateway

IPv6 QoS Testing on Dual Stack Network

Challenges in deploying QoS in contemporary networks. Vikrant S. Kaulgud

Configuring Application Visibility and Control for Cisco Flexible Netflow

SVILUPPO DI UNA TECNICA DI RICONOSCIMENTO STATISTICO DI APPLICAZIONI SU RETE IP

TDDD82 Secure Mobile Systems Lecture 6: Quality of Service

Network Layer. Goals of This Lecture. Internet Reference Model. Outline of the Class

QoS in IPv6. Madrid Global IPv6 Summit 2002 March Alberto López Toledo.

Configuring Access Rules

Advanced Computer Networks

CCNA Exploration Network Fundamentals. Chapter 06 Addressing the Network IPv4

Configuring QoS CHAPTER

CHAPTER 2 WIRELESS SENSOR NETWORKS AND NEED OF TOPOLOGY CONTROL

II. Principles of Computer Communications Network and Transport Layer

Avi Networks Technical Reference (16.3)

Enterasys 2B Enterasys Certified Internetworking Engineer(ECIE)

ndpi & Machine Learning A future concrete idea

TamoSoft Throughput Test

Active Directory in Networks Segmented by Firewalls

Avaya ExpertNet Lite Assessment Tool

in High-Speed Networks


UC Assessor A cloud-based UC network assessment solution. Getting Started Guide

Configuring Cisco IOS IP SLA Operations

Network Security and Cryptography. 2 September Marking Scheme

Tutorial 9 : TCP and congestion control part I

Transcription:

Internet Traffic Classification using Machine Learning by Alina Lapina 2018, UiO, INF5050

Alina Lapina, Master student at IFI, Full stack developer at Ciber Experis 2

Based on Thuy T. T. Nguyen, Grenville Armitage. A Survey of Techniques for Internet Traffic Classification using Machine Learning. IEEE Communication Surveys & Tutorials, Vol. 10, No. 4, Fourth quarter 2008, pp. 56-76. Thuy T. T. Nguyen, Grenville Armitage. A Survey of Techniques for Internet Traffic Classification using Machine Learning. IEEE Communication Surveys & Tutorials, Vol. 10, No. 4, Fourth quarter 2008, pp. 56-76. 3

Outline 1. Introduction to the article 2. Traffic properties 3. Requirements for operation deployment 4. Traffic classification metrics 5. Application of Machine Learning in IP Traffic Classification Section 1 gives short article introduction. Section 2 introduces a number of metrics for assessing classification accuracy, and discusses the limitations of traditional portand payload-based classification. Section 3 goes deeper into requirements to the deployed classifier. Section 4 describes the metrics used to classify traffic. Section 5 provides background information about ML and how it can be applied in IP traffic classification. 4

1. Introduction to the article Easy to read, well-structured Clear terms and abbreviations, no formulas Machine learning basics in section II Materials from 2004-2008 Colors in the presentation, skipping slides, references 5

2. Traffic properties Why to classify and who is interested in what What and how to inspect 6

Why to classify IP traffic Real-time traffic classification may be a core part of: Automated intrusion detection systems Quality of service (QoS). Lawful interception (LI). 7

Automated intrusion detection systems Real-time traffic classification may be a core part of automated intrusion detection systems, used to: detect patterns indicative of denial of service attacks, trigger automated re-allocation of network resources for priority customers, identify customer use of network resources that in some way contravenes the operator s terms of service. 8

Quality of Service (QoS) Real-time traffic classification is the core component of emerging QoS-enabled products and automated QoS architectures All QoS schemes have some degree of IP traffic classification implicit in their design (DiffServ, IntServ). Detected application classes allows the communication of specific QoS requirements between Internet applications and the network. A pricing mechanism could be build to differentiate customers with different needs and charge for the QoS that they receive based on classification of their traffic use. 9

Lawful interception (LI) Traffic classification is a solution for the requirement that ISP have to provide LI capabilities. Traffic classification techniques offer the possibility of identifying traffic patterns (which endpoints are exchanging packets and when), and what classes of applications are being used by a person of interest at any given point in time. Depending on the particular traffic classification scheme, this information may be obtained without violating any privacy laws covering the TCP or UDP payloads of the ISP customer s traffic. 10

Who is interested Real-time traffic classification has the potential to solve difficult network management problems for Internet service providers (ISPs) and their equipment vendors. Network operators need to know what is flowing over their networks promptly so they can react quickly in support of their various business goals. Governments are clarifying ISP obligations with respect to lawful interception (LI) of IP data traffic. 11

Interested in what The main goal of traffic analysis is to determine applications users use in real time. Unreachable nowadays. One of the goals can be to classify one or more applications of interest. The interim goal is to get traffic clustered into groups that have similar patterns. Those can be used to determine applications. 12

What to inspect: flows Successive IP packets having the same 5-tuple of protocol type, source address:port and destination address:port are considered to belong to a flow whose controlling application we wish to determine. Unidirectional flow: A series of packets sharing the same 5-tuple: source and destination IP addresses, source and destination IP ports and protocol number. Bi-directional flow: A pair of unidirectional flows going in the opposite directions between the same source and destination IP addresses and ports. Full-flow: A bi-directional flow captured over its entire lifetime, from the establishment to the end of the communication connection. start end 13

How to inspect: early approaches Table 1 Complexity Traffic properties Limitations Simple Sophisticated Controlling application s identity by assuming that most applications consistently use well known TCP or UDP port numbers (visible in the TCP or UDP headers). Looking for application-specific data (or well-known protocol behavior) within the TCP or UDP payloads ( deep packet inspection ) However, many applications are increasingly using unpredictable (or at least obscure) port numbers. IIC1 Encryption of packet contents. Privacy regulations. Heavy operational load (repeated updates of changes in every application s packet formats).iic2 14

How to inspect: newer approaches Table 2 Complexity Traffic properties Limitations Involves machine learning Recognising statistical patterns in externally observable attributes of the traffic such as typical packet lengths in each direction and inter-packet arrival times/periodicity, flow durations, flow idle time, intra-class homogeneity.iid Every ML algorithm has a different approach to sorting and prioritising sets of features, which leads to different dynamic behaviors during training and classification. Applications change their statistical properties over time. Malicious attacks might disguise themselves with the statistical properties of a trusted application. The classifier itself might have been (re)started whilst thousands of flows were already active through a network monitoring point (missing the starts). Applications exhibit different statistical properties in the client-to-server and server-to-client directions. So the classifier must know the direction of a flows. Etc IIIC1234 15

3. Requirements for operation deployment Timely and continuous classification Directional neutrality Efficient use of memory and processors Portability and Robustness 16

Timely and continuous classification A timely classifier should reach its decision using as few packets as possible from each flow while the traffic is flowing (rather than waiting until each flow completes before reaching a decision). Reducing the number of packets required for classification also reduces the memory required to buffer packets during feature calculations. This is an important consideration for situations where the classifier is calculating features for (tens of) thousands of concurrent flows. However, it is not sufficient to simply classify basing on the first few packets of a flow. (See Limitations in Table 2.) Consequently the classifier should ideally perform continuous classification - recomputing its classification decision throughout the lifetime of every flow. The fact that many applications change their statistical properties over time must be addressed. A flow should ideally be correctly classified as being the same application throughout the flow s lifetime. 17

Directional neutrality Application flows are assumed to be bi-directional, and the application s statistical features are calculated separately in the forward and reverse directions. Consequently, the classifier must either know the direction of a previously unseen flow (for example, which end is the server and which is the client) or be trained to recognise an application of interest without relying on external indications of directionality. Inferring the server and client ends of a flow is fraught with practical difficulties. As a real-world classifier should not presume that it has seen the first packet of every flow currently being evaluated, it cannot be sure whether the first packet it sees (of any new bi-directional flow of packets) is heading in the forward or reverse direction. 18

Efficient use of memory and processors The classifier s efficiency impacts on the financial cost to build, purchase and operate large scale traffic classification systems. An inefficient classifier may be inappropriate for operational use regardless of how quickly it can be trained and how accurately it identifies flows. Minimising CPU cycles and memory consumption is advantageous whether the classifier is expected to sit in the middle of an ISP network (where a small number of large, powerful devices may see hundreds of thousands of concurrent flows at multi-gigabit rates) or out toward the edges (where the traffic load is substantially smaller, but the CPU power and memory resources of individual devices are also diminished). 19

Portability and Robustness A model may be considered robust if it provides consistent accuracy in the face of network layer perturbations such as packet loss, traffic shaping, packet fragmentation, and jitter. A classifier also is robust if it can efficiently identify the emergence of new traffic applications. A model may be considered portable if it can be used in a variety of network locations. 20

4. Traffic classification metrics Positives, negatives Precision and recall Byte and Flow accuracy 21

Model Binary classification model Assume there is a traffic class X in which we are interested, mixed in with a broader set of IP traffic. A traffic classifier is being used to identify (classify) packets (or flows of packets) belonging to class X when presented with a mixture of previously unseen traffic. The classifier is presumed to give one of two outputs: a flow (or packet) is believed to be a member of class X, or it is not. 22

Positives and Negatives False Negatives (FN): Percentage of members of class X incorrectly classified as not belonging to class X. False Positives (FP): Percentage of members of other classes incorrectly classified as belonging to class X. True Positives (TP): Percentage of members of class X correctly classified as belonging to class X (equivalent to 100% - FN ). True Negatives (TN): Percentage of members of other classes correctly classified as not belonging to class X (equivalent to 100% - FP ). 23

Precision (P) and Recall (R) Recall: Percentage of members of class X correctly classified as belonging to class X. If all metrics are considered to range from 0 (bad) to 100% (optimal) it can be seen that Recall is equivalent to TP. Precision: Percentage of those instances that truly have class X, among all those classified as class X. 24

Accuracy Accuracy is generally defined as the percentage of correctly classified instances among the total number of instances. Flow accuracy - measuring the accuracy with which flows are correctly classified, relative to the number of other flows in the author s test and/or training dataset(s). Byte accuracy - how many bytes are carried by the packets of correctly classified flows, relative to the total number of bytes in the author s test and/or training dataset(s). The majority of flows on the Internet are small and account for only a small portion of total bytes and packets in the network (mice flows). The majority of the traffic bytes are generated by a small number of large flows (elephant flows). 25

Choice of metrics Whether precision, flow accuracy or byte accuracy is more important will generally depend on the classifier s intended use. Examples QoS purposes. Identifying every instance of a short lived flow needing QoS (such as a 5 minute, 32 Kbit/sec phone calls) is as important as identifying long lived flows needing QoS (such as a 30 minute, 256 Kbit/sec video conference) with both being far more important to correctly identify than the few flows that represent multi-hour (and/or hundreds of megabytes) peer to peer file sharing sessions. Load patterns on the network. ISP are interested in correctly classifying the applications driving the elephant flows that contribute a disproportionate number of packets across their network. 26

5. Application of Machine Learning in IP Traffic Classification ML: Input and output, types of machine learning IP traffic: Training and testing model Efficiency, timing, resources 27

Input and output of a ML process Machine Learning (ML) is the process of finding and describing structural patterns in a supplied dataset. The output is the description of the knowledge that has been learnt from the input. ML takes input in the form of a dataset of instances (also known as examples). An instance refers to an individual, independent example of the dataset. Each instance is characterised by the values of its features (also known as attributes or discriminators) that measure different aspects of the instance. 28

ML in IP traffic classification A class usually indicates IP traffic caused by (or belonging to) an application or group of applications. Instances are usually multiple packets belonging to the same flow. Features are typically numerical attributes calculated over multiple packets belonging to individual flows. Examples include mean packet lengths, standard deviation of inter-packet arrival times, total flow lengths (in bytes and/or packets), Fourier transform of packet inter-arrival time, and so on. Practical ML classifiers choose the smallest set of features that lead to efficient differentiation between members of a class and other traffic outside the class. 29

Different types of learning Classification (or supervised learning) (Figures 1-3) Clustering (or unsupervised learning) Association between features Numeric prediction (the outcome to be predicted is not a discrete class but a numeric quantity) 30

Figure 2 Figure 1. Training and testing for a two-class supervised ML traffic classifier 31

of both instances feature generation result optional for better performance III-A7 Figure 2. Training the supervised ML traffic classifier 32

class optional as in figure 2 Figure 3. Data flow within an operational supervised ML traffic classifier 33

Questions 34

Sources: pictures ISP. https://pixelprivacy.com/resources/stop-isp-throttling/ Network operators. https://wnet.net.in/ Government internet control. https://ict4bop.files.wordpress.com/2012/04/internetcensorship061208.gif Classifying internet traffic. https://www.incapsula.com/blog/bot-traffic-report-2015.html http://blogs.salleurl.edu/data-center-solutions/2016/04/qos-b2016/ Recall. https://commons.wikimedia.org/wiki/file:recall-precision.svg 35