Network Traffic Measurements and Analysis

Similar documents
Clustering and Visualisation of Data

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Clustering CS 550: Machine Learning

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Clustering. Chapter 10 in Introduction to statistical learning

MSA220 - Statistical Learning for Big Data

Unsupervised Learning

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Clustering and Dimensionality Reduction

Unsupervised Learning

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering in Data Mining

Exploratory data analysis for microarrays

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

10701 Machine Learning. Clustering

Grundlagen der Künstlichen Intelligenz

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

INF 4300 Classification III Anne Solberg The agenda today:

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

CS325 Artificial Intelligence Ch. 20 Unsupervised Machine Learning

Clustering Part 4 DBSCAN

Supervised vs. Unsupervised Learning

Unsupervised Learning and Clustering

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

University of Florida CISE department Gator Engineering. Clustering Part 4

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

Clustering & Dimensionality Reduction. 273A Intro Machine Learning

Clustering. Supervised vs. Unsupervised Learning

( ) =cov X Y = W PRINCIPAL COMPONENT ANALYSIS. Eigenvectors of the covariance matrix are the principal components

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

CHAPTER 4: CLUSTER ANALYSIS

STATS306B STATS306B. Clustering. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

Network Traffic Measurements and Analysis

Clustering and The Expectation-Maximization Algorithm

Content-based image and video analysis. Machine learning

Foundations of Machine Learning CentraleSupélec Fall Clustering Chloé-Agathe Azencot

What to come. There will be a few more topics we will cover on supervised learning

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

How do microarrays work

Clustering algorithms

High throughput Data Analysis 2. Cluster Analysis

May 1, CODY, Error Backpropagation, Bischop 5.3, and Support Vector Machines (SVM) Bishop Ch 7. May 3, Class HW SVM, PCA, and K-means, Bishop Ch

Unsupervised: no target value to predict

Spectral Classification

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

Note Set 4: Finite Mixture Models and the EM Algorithm

Cluster Analysis. Ying Shen, SSE, Tongji University

Understanding Clustering Supervising the unsupervised

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

Clustering part II 1

Clustering. k-mean clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

Unsupervised Learning and Clustering

COMS 4771 Clustering. Nakul Verma

Methods for Intelligent Systems

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

CSE 5243 INTRO. TO DATA MINING

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

CSE 5243 INTRO. TO DATA MINING

Data Exploration with PCA and Unsupervised Learning with Clustering Paul Rodriguez, PhD PACE SDSC

Unsupervised Learning : Clustering

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

CSE 347/447: DATA MINING

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

Supervised vs unsupervised clustering

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

Clustering: Classic Methods and Modern Views

Dimension Reduction CS534

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li

Clustering Lecture 5: Mixture Model

Clustering and Dimensionality Reduction. Stony Brook University CSE545, Fall 2017

K-Means Clustering 3/3/17

数据挖掘 Introduction to Data Mining

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Clustering web search results

Artificial Neural Networks (Feedforward Nets)

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

Cluster Analysis. Summer School on Geocomputation. 27 June July 2011 Vysoké Pole

Clustering in Ratemaking: Applications in Territories Clustering

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

TELCOM2125: Network Science and Analysis

CSE 40171: Artificial Intelligence. Learning from Data: Unsupervised Learning

Segmentation Computer Vision Spring 2018, Lecture 27

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Distances, Clustering! Rafael Irizarry!

Machine Learning (BSMC-GA 4439) Wenke Liu

Transcription:

DEIB - Politecnico di Milano Fall, 2017

Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification, but just on the analysis of the data. Can we discover patterns or subgroups among variables or observations? Is there any informative way to visualize the data? To answer these questions, one should apply unsupervised learning techniques.

Agenda Unsupervised learning techniques: K-means Hierarchical Density-based Dimensionality reduction Principal Component Analysis (PCA) Density Estimation

The goal of clustering is to group data together based solely on the features x Applications: Market segmentation: group customers into different segments based on the traffic they produce Mobile traffic segmentation: identify areas in a city which have common traffic signatures User behavior estimation: group together users with same behaviors

K-Means K-means is by far the most widely used clustering algorithm Goal: group data into K clusters Input: m unlabeled observations x K, the number of clusters to produce Output: K centroids (cluster centers) Cluster assignment for each one of the m input observations

K-Means steps Randomly initialize K cluster centroids as {µ 1, µ 2,..., µ k } Repeat until convergence: Assign each observation x (i) to the closest cluster centroid and produce labels c (i). Compute new cluster centers µ k by averaging all x (i) s.t. c (i) = k

K-Means example Let s see a running example: 35 30 25 Avg. Usage [Mbps] 20 15 10 5 0 0 5 10 15 20 25 Connection duration [s]

K-Means example After convergence: 35 30 25 Avg. Usage [Mbps] 20 15 10 5 0 0 5 10 15 20 25 Connection duration [s]

Cost function K-means tries to optimize the following cost function: J(c (1),..., c (k), µ 1,..., µ k ) = 1 m m x (i) µ c (i) (1) i=1 This cost function: Has in general many local minima for a fixed K (algorithm may converge to one of those) Is decreasing for increasing K. When K = m, the optimal value is at J = 0.

Algorithms initialization Possible options: Randomly pick cluster centroids in the feature space Randomly pick cluster centroids in the observation set (recommended option) However, due to the random assignment, the algorithm may converge to a local optima. Solution: run K-means multiple times (50-1000) and track final error J. Pick the clustering with the lowest J.

How to choose K In general, not easy. Mostly, one chooses it manually. Sometimes, not even easy to do it manually:

Elbow Method Plot best J vs number of clusters K and identify the elbow. Doesn t work everytime. However, sometimes one has a later purpose and may evaluate the optimal number of clusters based on that (e.g., traffic policies).

Variants K-median replace mean operation with median K-medoid cluster centers must belong to the original set of observations

Hierarchical clustering One disadvantage of K-means is that we need to specify the number of clusters K. Hierarchical clustering partially solves this issue, by creating a dendrogram

Dendogram Each leaf of the dendogram is an observation The height on the vertical axis at which two leaves join represents their distance For a fixed value of distance, the dendrogram identifies a certain number of clusters (similar to K for K-means)

Creating the dendogram Bottom-up approach. Treat each observation as an individual cluster. Repeat: Compute all the pairwise distances between clusters. Identify the closest two clusters and join them. Their distance identifies the height at which they join. In order to compute distances between clusters with more than one observation, a proper linkage is used: Average: compute all pairwise distances and output the average Complete: compute all pairwise distances and output the max Single: compute all pairwise distances and output min

DBSCAN Density-based clustering algorithms which groups together observations that are closely packed together. Advantages compared to K-means Does not require to specify number of clusters Can find arbitrarily shaped and non-linearly separable clusters. Identify and marks as outliers points which lie alone in low-density regions. Requires only two parameters: ɛ and minpts.

DBSCAN For each non-visited observation in the dataset: Find all neighbors within a distance of ɛ If the number of neighbors is geqminpts, the obs is core point If the number of neighbors is < minpts, but at least one of those is a core, the obs is a point. Otherwise, the point is marked as noise

DBSCAN clusters K-means VS DBSCAN:

Practical issues in clustering In order to perform clustering, some decisions should be made: For K-means, how many clusters should we use? For hierarchical clustering, what kind of linkage and horizontal cut should be used? For DBSCAN, which ɛ and < minpts? In general what kind of distance metric should we use? Euclidean, Hamming, correlation-based? The answer is not unique: try to run these algorithms several times with different parameters and select the ones with the most useful or interpretable solution (Q: should we use cross-validation?) In general, validating a cluster is a difficult task.

Principal Component Analysis PCA is a fundamental and powerful technique used in several fields of computing. The technique analyzes the n-dimensional feature space x and converts it into a p-dimensional representation, p < n. In doing so, some approximation (information loss) is tolerated. Uses: Data compression: represents about the same information with fewer data Speeding up learning algorithms: fewer data, fewer computations Data visualization: 2- or 3-dimensional plots become possible Removal of correlated variables

Problem formulation Given a dataset whose observations lie in a n-dimensional space, project the observations onto a set of principal components, so that: the first component has the largest variance every following component has the largest possible variance and is orthogonal to the preceding components. Therefore, the PCA just represents the data in a different fashion. However, the first PCs are the one that contain most of the information (variance) contained in the data.

PCA DL Volume (bytes) 240 220 200 180 160 140 120 100 80 60 Data First PC Second PC Second Principal Component 5 4 3 2 1 0-1 -2-3 -4 1000 1050 1100 1150 1200 DL Volume (bits) -5 1060 1080 1100 1120 1140 1160 1180 1200 1220 1240 1260 First Principal Component

Algorithm in a nutshell Exact mathematical derivation is out of the scope of this course. Preprocessing: normalize your data so that each feature has zero mean and varies approximately between -1 and 1. Compute covariance matrix Σ = 1 m m i=1 (x (i) )(x (i) ) T Compute eigenvectors matrix U of Σ. Those are your principal components! The eigenvector associated to the highest eigenvalue is the first principal components and so on. z = U T x are the new data values in the Principal Component Domain

Dimensionality reduction Once the PCA has been computed, one has n principal components (directions in R n ) To reduce the dimensionality of the data, one can project the data onto only the first p principal components and obtain a lower dimensional approximation. Compute U red by retaining only the first p column of U Compute z red = Ured T x, data projected onto the first p PCs. Go back to the original domain by computing x = U red z red

PCA approximation 155 150 145 DL Volume (bytes) 140 135 130 Original Approximation 125 1000 1050 1100 1150 1200 1250 DL Volume (bits)

Dimensionality reduction How to choose p? For visualization, p = 2 or p = 3 allows to visualize the data easily and understand better what kind of approach should be taken. In general, one can select p so that a certain quality of approximation is guaranteed. Typically, one selects p so that the ratio between the variance of the approximation error and the variance of the data is below a threshold. 1 m m i=1 x (i) x (i) 2 1 m m i=1 x 0.01 (2) (i) 2

Problem Motivation Unsupervised learning problem with aspects of supervised learning Identify observations with anomalous behaviour. Somehow similar to a classification task (anomalous/not anomalous) but: In general, we have many normal examples and just a few anomalous examples Also, anomalous examples may be very different from each other It s better to learn what anomalies do not look like

Density estimation Assume we collect data on different TCP connections. We look at only two features of such connections. 1250 1200 1150 1100 Downloaded bytes 1050 1000 950 900 850 800 750 70 80 90 100 110 120 130 Duration Network [s] Traffic Measurements and Analysis

Density Estimation The problem of anomaly detection can be solved in a probabilistic way. Overall idea is quite simple: Identify features x i that might be indicative of anomalies Estimate the join probability density function of such features p(x 1, x 2,..., x n ) Given new example x compute p(x). Anomaly if p(x) < ɛ.

Density estimation In the simplest case, to compute p(x) we assume that: p(x) = p(x 1, µ 1, σ 2 1)p(x 2, µ 2, σ 2 2)... p(x n, µ n, σ 2 n) (3) where p(x 1 )... p(x n ) are Gaussian distributions. As in Naive Bayes, this algorithm assumes that features are independent.

Example 1500 1500 Downloaded bytes 1000 Downloaded Bytes 1000 500 50 100 150 Duration [s] 500 0 0.5 1 1.5 2 2.5 3 3.5 4 p(x) 10-3 0.04 0.035 10-4 2 0.03 0.025 1.5 p(x) 0.02 1 0.015 0.5 0.01 0.005 0 50 100 150 Duration 0 1500 1000 Downloaded bytes 500 50 100 Duration [s] 150

Multivariate Gaussian Often, the features used for anomaly detection are not independent. This can cause the method to fail. 1140 1130 1120 1110 x 2 1100 1090 1080 1070 1060 70 80 90 100 110 120 130 x 1

Example of failure What happens for the point x 1 = 95, x 2 = 1120? 1200 1200 1150 1150 x 2 1100 x 2 1100 1050 1050 1000 50 100 150 x 1 1000 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 p(x 2 ) 0.05 10-3 0.04 1.5 0.03 1 p(x 1 ) 0.02 0.5 0.01 0 50 100 150 x 1 0 1200 1100 x 2 1000 50 x 1 100 150

Multivariate density estimation It s better to model the joint distribution all in one go, instead of looking at each feature separately. Multivariate Gaussian distribution for n features: where p(x, µ, Σ) = 1 2π n/2 Σ 1/2 exp( 1 2 (x µ)t Σ(x µ)) (4) x and µ are n-dimensional vectors Σ is the covariance matrix (same used in PCA)

Example 1200 1200 1150 1150 x 2 1100 x 2 1100 1050 1050 1000 50 100 150 x 1 1000 0 0.005 0.01 0.015 0.02 0.025 p(x 2 ) p(x 1 ) 0.04 0.03 0.02 10-3 1.5 1 0.5 0.01 0 50 100 150 x 1 0 1200 1100 x 2 1000 50 100 x 1 150

Independent vs Multivariate Independent density estimation: Used more often Sometimes there is a need to manually create features to capture anomalies Computationally cheap and scales well Works well with small training sets Multivariate density estimation Capture correlation among features Less computationally efficient: need to compute Σ 1 Need for m > n

Evaluating an system Generally, one has always some labeled data. Usually, one has many non-anomalous examples and just few anomalous examples Evaluation of anomaly detection system can be performed using the standard crossvalidation method. Assume to have e.g., 10000 non-anomalous examples and just 20 anomalous examples. A good way to split the data may be: 6000 non-anomalous examples for density estimation. 2000 non-anomalous examples and 10 anomalous in the CV set. Choose which features to use, ɛ, etc. 2000 non-anomalous examples and 10 anomalous in the test set.