Network Traffic Measurements and Analysis

Similar documents
Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Machine Learning. Chao Lan

5 Learning hypothesis classes (16 points)

Machine Learning / Jan 27, 2010

Classification Algorithms in Data Mining

Tree-based methods for classification and regression

Business Club. Decision Trees

Large Scale Data Analysis Using Deep Learning

Advanced Video Content Analysis and Video Compression (5LSH0), Module 8B

What is machine learning?

Simple Model Selection Cross Validation Regularization Neural Networks

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

06: Logistic Regression

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Understanding Andrew Ng s Machine Learning Course Notes and codes (Matlab version)

Supervised Learning Classification Algorithms Comparison

Predictive modelling / Machine Learning Course on Big Data Analytics

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Classification with PAM and Random Forest

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Supervised vs unsupervised clustering

CS249: ADVANCED DATA MINING

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

Applying Supervised Learning

Performance Evaluation of Various Classification Algorithms

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

Machine Learning: Think Big and Parallel

Nonparametric Methods Recap

Classification. Slide sources:

Linear Methods for Regression and Shrinkage Methods

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

8. Tree-based approaches

CS 229 Midterm Review

CS145: INTRODUCTION TO DATA MINING

Weka ( )

Classification and Regression Trees

Supervised Learning for Image Segmentation

Information Management course

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Evaluating Classifiers

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Part I. Classification & Decision Trees. Classification. Classification. Week 4 Based in part on slides from textbook, slides of Susan Holmes

Machine Learning Classifiers and Boosting

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

Data Mining Concepts & Techniques

INF 4300 Classification III Anne Solberg The agenda today:

Network Traffic Measurements and Analysis

CART. Classification and Regression Trees. Rebecka Jörnsten. Mathematical Sciences University of Gothenburg and Chalmers University of Technology

Evaluating Classifiers

10601 Machine Learning. Model and feature selection

Random Forest A. Fornaser

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Bayes Net Learning. EECS 474 Fall 2016

The exam is closed book, closed notes except your one-page cheat sheet.

Boosting Simple Model Selection Cross Validation Regularization

An introduction to random forests

Louis Fourrier Fabien Gaie Thomas Rolf

Boosting Simple Model Selection Cross Validation Regularization. October 3 rd, 2007 Carlos Guestrin [Schapire, 1989]

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Penalizied Logistic Regression for Classification

Decision trees. Decision trees are useful to a large degree because of their simplicity and interpretability

6.034 Quiz 2, Spring 2005

Regularization and model selection

Lecture 9: Support Vector Machines

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Classification/Regression Trees and Random Forests

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Using Machine Learning to Optimize Storage Systems

Classification. Instructor: Wei Ding

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

Classification: Linear Discriminant Functions

Equation to LaTeX. Abhinav Rastogi, Sevy Harris. I. Introduction. Segmentation.

1) Give decision trees to represent the following Boolean functions:

Linear Regression & Gradient Descent

Data Mining and Knowledge Discovery Practice notes 2

CSE 158. Web Mining and Recommender Systems. Midterm recap

Slides for Data Mining by I. H. Witten and E. Frank

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Cyber attack detection using decision tree approach

Logistic Regression: Probabilistic Interpretation

Data Mining and Knowledge Discovery: Practice Notes

Generative and discriminative classification techniques

Machine Learning in Biology

RESAMPLING METHODS. Chapter 05

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

Nonparametric Classification Methods

Ensemble Learning: An Introduction. Adapted from Slides by Tan, Steinbach, Kumar

Machine Learning. A. Supervised Learning A.7. Decision Trees. Lars Schmidt-Thieme

Support Vector Machines

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

CS229 Lecture notes. Raphael John Lamarre Townshend

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 15: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

DATA MINING LECTURE 11. Classification Basic Concepts Decision Trees Evaluation Nearest-Neighbor Classifier

Transcription:

DEIB - Politecnico di Milano Fall, 2017

Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng: Machine Learning course

Introduction Given this data, what throughput will we expect to have when 100 people are connected to the BS? 40 35 Expected throughput (Mbps) 30 25 20 15 10 5 0 20 40 60 80 100 120 140 Number of active UE

Introduction Given this data, what throughput will we expect to have when 100 people are connected to the BS? 40 35 Expected throughput (Mbps) 30 25 20 15 10 5 0 20 40 60 80 100 120 140 Number of active UE Fit a straight line?

Introduction Given this data, what throughput will we expect to have when 100 people are connected to the BS? 40 35 Expected throughput (Mbps) 30 25 20 15 10 5 0 20 40 60 80 100 120 140 Number of active UE Fit a straight line? Fit a second order polynomial?

Supervised learning problems Two types of problems: : predict a continuous valued output : classify data into discrete classes Both generally consider many features: Predict the throughput based on number of connected user, signal strength, device type Classify incoming traffic as malicious based on source IP address, length of flows, evil bit

Agenda We ll focus on the following supervised learning algorithms: : (Multiple) Linear regression Nearest Neighbour : Logistic Regression Decision trees Naive Bayes classifiers Other very popular and powerful algorithms are available: Support Vector Machines (SVM), Neural Networks, Random Forests...

Linear Regression Starting point: The Training Set m training examples x input variables (features, predictors) y output variable (target, response) (x (i), y (i) ) training examples Linear regression takes the training set and generates an hypothesis function h that tries to estimate the value of y given x: h θ (x) = θ 0 + θ 1 x (1) θ 0, θ 1 are the parameters, determined by a learning system and used to output a prediction y based on the input x

Linear Regression - implementation How to choose the parameters? we would like to choose θ s.t. h θ (x) is close to y for our training examples h θ (x) tries to convert x into y. Since we got both x and y we can evaluate how well h θ (x) does this... Define the Mean Squared Error (MSE) cost function: J(θ) = 1 m m (h θ (x (i) ) y (i) ) 2 i=1

Linear Regression - implementation How to choose the parameters? we would like to choose θ s.t. h θ (x) is close to y for our training examples h θ (x) tries to convert x into y. Since we got both x and y we can evaluate how well h θ (x) does this... Define the Mean Squared Error (MSE) cost function: J(θ) = 1 m m (h θ (x (i) ) y (i) ) 2 i=1 We want to solve a minimization problem: min θ J(θ) (2)

Linear Regression - Normal Equation Let X (design matrix) and y, be 1 x (1) y (1) 1 x (2) y (2) X = 1 x (3), y = y (3)... 1 x (m) y (m) It is possible to show that the optimal parameters θ = [θ 0, θ 1 ] T is: θ = (X T X ) 1 X T y At an arbitrary input x, the prediction is ŷ = h θ (x)

40 35 y = - 0.2315*x + 32.54 data1 linear regression 30 Expected throughput (Mbps) 25 20 15 10 5 0 0 20 40 60 80 100 120 140 Number of active UE

Multilinear regression Sometimes, we would like to use more than one feature for making our prediction, e.g.: the throughput may be predicted based on (i) how many users are connected x 1 and (ii) the channel quality x 2 we may also create new features starting from existing ones, e.g., the square of the number of users connected, x 3 = x 2 1 in general, we can have n features x 1,..., x n The hypothesis in case of multilinear regression is: h θ (x) = θ 0 + θ 1 x 1 + θ 2 x 2 +... + θ n x n (3)

Multilinear regression The normal equation can still be used as a solution, with the following design matrix: 1 x (1) 1 x (1) 2 x (1) 3... x (1) n 1 x (2) 1 x (2) 2 x (2) 3... x (2) n X = 1 x (3) 1 x (3) 2 x (3) 3... x n (3)...... 1 x (m) 1 x (m) 2 x (m) 3... x n (m) and θ = [θ 0, θ 1,..., θ n ] T = (X T X ) 1 X T y

45 40 y = 0.002963*x 2-0.6374*x + 40.77 y = - 1.805e-05*x 3 + 0.006383*x 2-0.8*x + 42.49 data1 multilinear regression (quadratic) multilinear regression (cubic) 35 Expected throughput (Mbps) 30 25 20 15 10 5 0 20 40 60 80 100 120 140 Number of active UE

Normal equation - Practical problems The normal equation requires to compute the matrix (X T X ) 1 (X T X ) is an (n + 1) (n + 1) matrix Most implementation compute the matrix inverse in O(n 3 ) Slow if n is large! how large? n 1000 still (relatively) small. (X T X ) may be non-invertible

Normal equation - Practical problems The normal equation requires to compute the matrix (X T X ) 1 (X T X ) is an (n + 1) (n + 1) matrix Most implementation compute the matrix inverse in O(n 3 ) Slow if n is large! how large? n 1000 still (relatively) small. (X T X ) may be non-invertible Cause 1: m n (more features than examples... bad idea)

Normal equation - Practical problems The normal equation requires to compute the matrix (X T X ) 1 (X T X ) is an (n + 1) (n + 1) matrix Most implementation compute the matrix inverse in O(n 3 ) Slow if n is large! how large? n 1000 still (relatively) small. (X T X ) may be non-invertible Cause 1: m n (more features than examples... bad idea) Cause 2: redundant features (some of the columns of X are linearly dependent. remove them!)

Gradient descent What if n is too large? Can we still learn the optimal parameters θ?

Gradient descent What if n is too large? Can we still learn the optimal parameters θ? The answer is the Gradient Descent algorithm: Start with some initial values for θ, e.g. θ i = 0 i

Gradient descent What if n is too large? Can we still learn the optimal parameters θ? The answer is the Gradient Descent algorithm: Start with some initial values for θ, e.g. θ i = 0 i Change θ in the direction of the gradient in order to reduce the cost function J(θ) a little bit.

Gradient descent What if n is too large? Can we still learn the optimal parameters θ? The answer is the Gradient Descent algorithm: Start with some initial values for θ, e.g. θ i = 0 i Change θ in the direction of the gradient in order to reduce the cost function J(θ) a little bit. Repeat until convergence (a local minimum is found) For (multi)linear regression, the cost function is convex and has only a single minimum.

Gradient descent What if n is too large? Can we still learn the optimal parameters θ? The answer is the Gradient Descent algorithm: Start with some initial values for θ, e.g. θ i = 0 i Change θ in the direction of the gradient in order to reduce the cost function J(θ) a little bit. Repeat until convergence (a local minimum is found) For (multi)linear regression, the cost function is convex and has only a single minimum. Gradient descent will converge to the same optimal θ found by the normal equation (if α is small enough...)

Gradient descent Do the following until convergence: θ j = θ j α θ j J(θ) j (4) that is: θ j = θ j α 1 m m i=1 (h θ (x (i) ) y (i) ) x (i) j (5) with x (i) 0 = 1 for convenience.

Gradient descent The variable alpha in: θ j = θ j α 1 m m i=1 (h θ (x (i) ) y (i) ) x (i) j (6) is the learning rate. How to choose it?

Gradient descent The variable alpha in: θ j = θ j α 1 m m i=1 (h θ (x (i) ) y (i) ) x (i) j (6) is the learning rate. How to choose it? too small: baby steps, convergence takes too long

Gradient descent The variable alpha in: θ j = θ j α 1 m m i=1 (h θ (x (i) ) y (i) ) x (i) j (6) is the learning rate. How to choose it? too small: baby steps, convergence takes too long too big: huge steps, you can miss the minimum and fail to converge Normalizing features (e.g., between -1 and 1) helps in providing numerical stability

1 1 0.6 0.8 0.55 0.6 0.5 0.4 0.45 0.2 0.4 0 0.35-0.2 0.3-0.4 0.25-0.6 0.2-0.8 0.15-1 -1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1 0 0.1

k-nearest Neighbours Regression Simple idea: given unknown input x, output y by averaging the k most similar examples in the training set.

k-nearest Neighbours Regression Simple idea: given unknown input x, output y by averaging the k most similar examples in the training set. How many examples should we take (i.e., what is k)?

k-nearest Neighbours Regression Simple idea: given unknown input x, output y by averaging the k most similar examples in the training set. How many examples should we take (i.e., what is k)? How should we average the examples? Generally, Inverse Distance Weighting (IDW) is applied. ŷ = k w k y k (7) with w k inversely proportional to dist(x, x (k) ).

k-nn 40 35 Training data knn, k = 2 knn, k = 4 Normalized Expected throughput (Mbps) 30 25 20 15 10 5 0 20 40 60 80 100 120 140 Normalized Number of active UE

k-nn 40 35 Training data knn, k = 2 knn, k = 4 Normalized Expected throughput (Mbps) 30 25 20 15 10 5 0 20 40 60 80 100 120 140 Normalized Number of active UE How many samples are needed for prediction? Will it generalize well to new data points?

LOESS Linear regression and k-nn can be fused together in the LOESS (LOcal regression) method. For a new input x 0, gather the k points (x i, y i ) whose x i are nearest to x 0 fit a linear model h θ using only these k points output y h at = h θ (x 0 )

Bias-Variance trade off We have seen a few algorithm. They work well in practice, but may suffer from:

Bias-Variance trade off We have seen a few algorithm. They work well in practice, but may suffer from: underfitting (high bias). It occurs when the model used is too simple (e.g. linear regression).

Bias-Variance trade off We have seen a few algorithm. They work well in practice, but may suffer from: underfitting (high bias). It occurs when the model used is too simple (e.g. linear regression). overfitting (high variance). It occurs when the model uses follows the training data too much (e.g. k-nn, high-order polynomial fitting)

Evaluating a hypothesis How do we know if our hypothesis is suffering from underfitting or overfitting? Linear regression works by finding θ so that the cost function J(θ) is minimized. What if we found J(θ) = 0. Does it mean we have found the perfect model?

Evaluating a hypothesis How do we know if our hypothesis is suffering from underfitting or overfitting? Linear regression works by finding θ so that the cost function J(θ) is minimized. What if we found J(θ) = 0. Does it mean we have found the perfect model? Can t be sure unless we try to generalize! Standard way to evaluate the hypothesis: split the training set! 70% used for training the model, by minimizing J train 20% used for computing the cross validation error J cv 10% used for computing the test error J cv

Evaluating a hypothesis 90 80 data1 linear quadratic 9th degree 70 Normalized Expected throughput (Mbps) 60 50 40 30 20 10 0-10 0 20 40 60 80 100 120 140 Normalized Number of active UE Which one will have the smallest J train? Should we select the model based on that?

Good practices for model selection Divide your initial data set in three parts 70-20-10 Train all your models on the training set Compute J cv on the cross validation set. Pick the model who has the smallest J cv error Compute generalization error J test on the test set.

k-fold cross validation If the total number of observation m is small, we may have too few examples in each set. Also, J cv depends on which 20% of the data is used. To cope with these issues, we can use k-fold cross validation. randomly divide the entire set in k fold of similar size. use the first fold as cross-validation set and the other k 1 as training set. Obtain J cv,k repeat k times, each time changing the cross-validation set Set J cv = 1 k k J cv,k Generally, k = 5 or k = 10 is used

Bias-variance diagnosis How do we understand if our model is suffering from overfitting? We look at the train and cross-validation errors. High bias: both errors are high High variance: training error is low and cv error is high

Bias-variance diagnosis Another possibility is looking at the learning curves

Bias-variance diagnosis Another possibility is looking at the learning curves High bias: errors are close and high, adding data does not help, you should use a different / more complex model High variance: adding data could help, but you should consider using a simpler model or using regularization

Regularization / Ridge regression Overfitting happens when the model is too complex, or using an excessive number of features. Main idea: modify the cost function adding a penalty on features J(θ) = 1 m m [ (h θ (x (i) ) y (i) ) 2 + λ i=1 n θj 2 ] (8) j=1 The variable λ controls the amount of penalization. λ should be chosen carefully if too big we ll have underfitting if too small it s like not using it

When the target variable y is discrete (e.g., 0/1), we talk about classification problems 1 (Type 2) Application Type 0 (Type 1) 0 5 10 15 20 25 Flow duration

We could use linear regression with a threshold: Estimate h θ (x) = θ T x = θ 0 + θ 1 x Output 0 if θ T x < 0.5, 1 otherwise data1 linear 1 (Type 2) Application Type 0.5 0 (Type 1) 0 5 10 15 20 25 Flow duration

Linear regression as classifier Problem 1: the addition of a single point changes things a lot! data1 linear 1 (Type 2) data1 linear 1 (Type 2) Application Type 0.5 Application Type 0.5 0 (Type 1) 0 500 1000 1500 2000 2500 3000 Flow duration 0 (Type 1) 0 5 10 15 20 25 Flow duration Problem 2: h θ (x) output values much greater than 1 or much smaller than 0

k-nn classifier We could use a k-nearest Neighbour classifier: we take the k most similar example and output the class who occurs most often (majority voting). 1 (Type 2) Application Type 0.5 0 (Type 1) 0 5 10 15 20 25 Flow duration

k-nn classifier We could use a k-nearest Neighbour classifier: we take the k most similar example and output the class who occurs most often (majority voting). 1 (Type 2) Application Type 0.5 0 (Type 1) 0 5 10 15 20 25 Flow duration Will it generalize well to new data points?

Logistic regression To cope with those problems, logistic regression is introduced: 1 h θ (x) = 1 + e θt x where the function f (z) = 1 1+e z is the sigmoid function (9) 1 0.9 0.8 0.7 0.6 1/1+e -z 0.5 0.4 0.3 0.2 0.1 0-5 -4-3 -2-1 0 1 2 3 4 5 z

Sigmoid function The sigmoid function h θ (x) = 1 1+e θt x outputs values between 0 and 1 crosses 0 at 0.5 When θ T x 0, h θ (x) 0.5 When θ T x < 0, h θ (x) < 0.5 The output can be interpreted as the probability of being of belonging to a particular class.

Logistic regression cost function In order to do this, use a modified cost function (in order to make it a convex function): J(θ) = 1 m m [ y (i) log h θ (x (i) ) + (1 y (i) ) log(1 h θ (x (i) )] (10) i=1 One can use gradient descent (or other more complicated algorithms) to solve for parameters θ.

Logistic regression 1 (Type 2) Application Type 0.5 0 (Type 1) 0 5 10 15 20 25 Flow duration In this example, θ = [ 20.9, 2.83] T. The boundary value is at θ T x = 0 that is x = 7.38

Decision boundary With more than one feature, logistic regression finds a decision boundary in the space of the features: 25 Type 1 Type 2 20 Flow Duration [s] 15 10 5 0 200 300 400 500 600 700 800 900 1000 1100 1200 Packet Length [byte] 1 In this example, h θ (x) =, and the line 1+e (θ 0 +θ 1 x 1 +θ 2 x 2 ) x 2 = 1 θ 2 θ 0 + θ 1 x 1 is the decision boundary.

Decision trees Very simple and powerful algorithms, used standalone or to build more complex algorithms (e.g. Random Forest). They can be used for both classification and regression. Main idea: follow a series of questions, and take a path depending on the answer. 25 Type 1 Type 2 20 packet length < 773? Flow Duration [s] 15 10 YES NO 5 0 200 300 400 500 600 700 800 900 1000 1100 1200 Packet Length [byte] Type 0 Type 1

Growing (learning) a tree Growing a decision tree is a greedy process that tries to minimize the misclassification error: final tree is not optimal different trees may be learned from the same data set

Growing (learning) a tree Growing a decision tree is a greedy process that tries to minimize the misclassification error: final tree is not optimal different trees may be learned from the same data set How does it work: start by the training set, identify a feature and a split criterion on it. Two children nodes are generated

Growing (learning) a tree Growing a decision tree is a greedy process that tries to minimize the misclassification error: final tree is not optimal different trees may be learned from the same data set How does it work: start by the training set, identify a feature and a split criterion on it. Two children nodes are generated For each generated node, repeat the process.

Growing (learning) a tree Growing a decision tree is a greedy process that tries to minimize the misclassification error: final tree is not optimal different trees may be learned from the same data set How does it work: start by the training set, identify a feature and a split criterion on it. Two children nodes are generated For each generated node, repeat the process. Identify a stop criterion, or stop when all nodes contain example of the same class.

Growing (learning) a tree Growing a decision tree is a greedy process that tries to minimize the misclassification error: final tree is not optimal different trees may be learned from the same data set How does it work: start by the training set, identify a feature and a split criterion on it. Two children nodes are generated For each generated node, repeat the process. Identify a stop criterion, or stop when all nodes contain example of the same class. When stopping, output the class label that occurs most.

Growing (learning) a tree Growing a decision tree is a greedy process that tries to minimize the misclassification error: final tree is not optimal different trees may be learned from the same data set How does it work: start by the training set, identify a feature and a split criterion on it. Two children nodes are generated For each generated node, repeat the process. Identify a stop criterion, or stop when all nodes contain example of the same class. When stopping, output the class label that occurs most. Prune the tree

Growing a decision tree Output variable y = k. In our example k = {0, 1} The training set is represented by L = (x (i), y (i) ), i = 1... m Each node t in the tree will contain a set of associated observations L(t). The root of the tree contains all observations L(t 1 ) = L. How and when to split a node t?

Growing a decision tree Output variable y = k. In our example k = {0, 1} The training set is represented by L = (x (i), y (i) ), i = 1... m Each node t in the tree will contain a set of associated observations L(t). The root of the tree contains all observations L(t 1 ) = L. How and when to split a node t? If all the observation in a node L(t) belong to the same class j, we do not split. We declare t to be a leaf node and whenever a new observation x reaches t, we declare y = k.

Splitting criteria Assume a node t contains samples of different classes. We need to decide: which feature x j to operate the split onto, at which value v to operate the split in order to output the question: is x j < v?

Node impurity We introduce the measure Q(t). If all observations of node t are of the same class, Q(t) = 0. When the distribution of the classes in a node is uniform, Q(t) takes the maximum value. When we split a node, we try to decrease the impurity as much as possible

Impurity measures Let p t,k = p(k t) the proportion of class k observations in node t. Gini index: Q(t) = k p t,k (1 p t,k ) (11) Cross-entropy: Q(t) = k p t,k log(p t,k ) (12)

Split criterion To measure a split s change in impurity, we can evaluate: Q = Q(t) p L Q(t L ) p R Q(t R ) where p L and p R are the proportions of observations that fall in the left and right children node, respectively. In order to find which node t to operate on and which variable, we test all possible splits! At each step, we will choose the node t for which Q is higher.

Tree pruning After the learning phase, the full tree may be very complex. Some nodes can be discarded or pruned. Given a full tree T 0, pruning seeks another tree T with a smaller number of leaves, at the cost of an higher misclassification error R. The process tries to minimize: R α (T ) = R(T ) + α T (13) where alpha is a penalty on the tree size, and T is the number of nodes in the tree.

Decision tree summary Pros: Cons: Highly interpretable. Very instable (small change in training set may produce a very different tree) May be complex to train (depending on the data) May not generalize well to new data

Other approaches Support Vector Machines Finds the decision boundary so that the distance (margin) between the classes is maximized Binary Classifier Neural Networks Learn weights and activations (parameters) of a complex layered structure. By doing this the network learns also which are the best features. Works very well in many scenarios and can output multiple classes. Basis for deep learning Difficult to interpret

Ensemble methods Fuse information from many weak classifier into a strong one. Gold standard: Random Forest Main goal: reduce variance/overfitting of trees Majority voting over multiple trees learnt from multiple training set obtained by sampling the original set with replacement (bagging). For each tree, each time split over only a random subset of features. Additionally apply boosting: learn tree sequentially, each time creating a new bag paying higher attention on misclassified samples.

Discriminative vs Generative classifiers The algorithms seen so far are discriminative algorithms. They look at examples from all classes (0s and 1s) and find a decision boundary or a set of rules that separates the classes. They learn: in a direct way. p(y x) (14)

Discriminative vs Generative classifiers The algorithms seen so far are discriminative algorithms. They look at examples from all classes (0s and 1s) and find a decision boundary or a set of rules that separates the classes. They learn: p(y x) (14) in a direct way. In contrast, a generative learning algorithm looks at only one class of examples at a time, and learns p(x y) and p(y) (15) what are the features like, given a particular class (and the class prior).

Naive Bayes classifier Recalling the Bayes Theorem: p(y x) = p(x y)p(y) p(x) (16) Naive Bayes assume that variables x 1, x 2,..., x n are conditionally independent. Therefore: p(x y) = i p(x i y) (17) p(y) is the class prior and can be easily obtained from the available data p(x) = j p(x y j)p(y j ), but is often dropped as the probability of the data is constant

Naive Bayes classifier Recalling the Bayes Theorem: p(y x) = p(x y)p(y) p(x) (16) Naive Bayes assume that variables x 1, x 2,..., x n are conditionally independent. Therefore: p(x y) = i p(x i y) (17) p(y) is the class prior and can be easily obtained from the available data p(x) = j p(x y j)p(y j ), but is often dropped as the probability of the data is constant Note: variables are never independent. However Naive Bayes works well in practice.

Naive Bayes: estimation We need to estimate p(x i y) for each feature from the available data. For continuous features, generally a Gaussian distribution is assumed. p(x i, µ i, σ 2 i ) = 1 2πσi 2 exp( (x i µ i ) 2 2σi 2 ) (18) Therefore we simply need to estimate µ i and σ i for each feature of our data. For discrete variables, binomial/multinomial distributions are used.

Naive Bayes: example 25 Type 1 Type 2 20 Flow Duration [s] 15 10 5 0 200 300 400 500 600 700 800 900 1000 1100 1200 Packet Length [byte]

Naive Bayes: example p(x 1 y = 0): µ 1 = 364, σ1 2 = 2.8 104 p(x 2 y = 0): µ 2 = 3.1, σ2 2 = 1.81 p(x 1 y = 1): µ 1 = 935.75, σ1 2 = 8.5 103 p(x 2 y = 1): µ 2 = 13.8, σ2 2 = 15.46 22 20 Type 1 Type 2 18 16 Flow Duration [s] 14 12 10 8 6 4 2 0 200 300 400 500 600 700 800 900 1000 1100 1200 Packet Length [byte]

Naive Bayes: classification Given a new observation x, we can predict the class it belongs: compute p(y j x) using Bayes Theorem for all y j output the class j for which p(y j x) is maximized Example 1: x 1 = 700, x 2 = 6 p(x 1 = 700 y = 0) p(x 2 = 6 y = 0) p(y = 0) = 0.7768 p(x 1 = 700 y = 1) p(x 2 = 6 y = 1) p(y = 1) = 0.2232 Example 2: x 1 = 900, x 2 = 6 p(x 1 = 900 y = 0) p(x 2 = 6 y = 0) p(y = 0) = 0.0068 p(x 1 = 900 y = 1) p(x 2 = 6 y = 1) p(y = 1) = 0.9932

Other approaches Linear Discriminant Analysis Similar to Naive Bayes, without the independence assumption Assume that all classes share the same covariance matrix Quadratic Discriminant Analysis Similar to LDA, but assume that all classes have different covariance matrix LDA is simpler than QDA (has lower variance). Use it when m is small.

Error analysis For regression problem, the MSE can be used to evaluate the cross-validation or test error. MSE = 1 m m (y i ŷ i ) 2 (19) i=1 For a binary classification problem, we could use the classifier accuracy: ACC = 1 1 I (y i ŷ i ) (20) m where I (y i ŷ i ) = 1 if y i ŷ i and 0 otherwise. Is it a good metric? i=1

Skewed classes Assume you have a test set with m = 100 examples of traffic flows, and need to classify between neutral (0) or malicious traffic (1) Assume that 99 examples are neutral and only one malicious. What is the accuracy of a dummy classifier that always outputs 0? What can we do to better analyse and compare the classifier performance?

Precision and Recall Define: True Positive (TP): malicious flows that were classified as malicious False Positive (FP): neutral flows that were classified as malicious (false alarms) True Negative (TN): neutral flows classified as neutrals False Negative (FN): malicious flows classified as neutrals (miss) We have: Precision: how often our algorithm cause a false alarm? Recall: how sensitive is our algorithm?

Precision and Recall Precision: (i) how often our algorithm causes a false alarm? (ii) among all predicted positive examples, how many were actually positive? True Positive Number of Predicted Positive = TP TP + FP (21) Recall: (i) how sensitive is our algorithm? (ii) among all positive examples present in the set, how many were identified? True Positive Number of Actual Positive = TP TP + FN (22)

F 1 Score Often you can control a tradeoff between recall and precision using a threshold. An always-1-classifier has a recall of 100% but a very low precision (produces many false positive). Similarly you can have classifiers with low recall and high precision. How to compare? Compute the F 1 score: 2 Precision Recall Precision + Recall (23)

Multiclass classification Sometimes you need to classify among multiple classes. Some of the algorithm we have seen naturally have this possibility (k-nn, Naive Bayes) What about logistic regression?

One vs all classification Assume to have three classes in the training set: A, B, C. We can create three new datasets and learn three classifiers: h θ1 : A (1) vs B and C (0) h θ2 : B (1) vs A and C (0) h θ3 : C (1) vs A and B (0) On a new input x, look at the output of the three classifier and assign the class for which h θi (x) is maximized

Confusion Matrix For multiclass problems, one can use the confusion matrix to easily visualize errors: