Network Traffic Measurements and Analysis

DEIB - Politecnico di Milano Fall, 2017

Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng: Machine Learning course

Introduction Given this data, what throughput will we expect to have when 100 people are connected to the BS? 40 35 Expected throughput (Mbps) 30 25 20 15 10 5 0 20 40 60 80 100 120 140 Number of active UE

Supervised learning problems Two types of problems: : predict a continuous valued output : classify data into discrete classes Both generally consider many features: Predict the throughput based on number of connected user, signal strength, device type Classify incoming traffic as malicious based on source IP address, length of flows, evil bit

Agenda We ll focus on the following supervised learning algorithms: : (Multiple) Linear regression Nearest Neighbour : Logistic Regression Decision trees Naive Bayes classifiers Other very popular and powerful algorithms are available: Support Vector Machines (SVM), Neural Networks, Random Forests...

Linear Regression Starting point: The Training Set m training examples x input variables (features, predictors) y output variable (target, response) (x (i), y (i) ) training examples Linear regression takes the training set and generates an hypothesis function h that tries to estimate the value of y given x: h θ (x) = θ 0 + θ 1 x (1) θ 0, θ 1 are the parameters, determined by a learning system and used to output a prediction y based on the input x

Linear Regression - implementation How to choose the parameters? we would like to choose θ s.t. h θ (x) is close to y for our training examples h θ (x) tries to convert x into y. Since we got both x and y we can evaluate how well h θ (x) does this... Define the Mean Squared Error (MSE) cost function: J(θ) = 1 m m (h θ (x (i) ) y (i) ) 2 i=1

Linear Regression - Normal Equation Let X (design matrix) and y, be 1 x (1) y (1) 1 x (2) y (2) X = 1 x (3), y = y (3)... 1 x (m) y (m) It is possible to show that the optimal parameters θ = [θ 0, θ 1 ] T is: θ = (X T X ) 1 X T y At an arbitrary input x, the prediction is ŷ = h θ (x)

40 35 y = - 0.2315*x + 32.54 data1 linear regression 30 Expected throughput (Mbps) 25 20 15 10 5 0 0 20 40 60 80 100 120 140 Number of active UE

Multilinear regression Sometimes, we would like to use more than one feature for making our prediction, e.g.: the throughput may be predicted based on (i) how many users are connected x 1 and (ii) the channel quality x 2 we may also create new features starting from existing ones, e.g., the square of the number of users connected, x 3 = x 2 1 in general, we can have n features x 1,..., x n The hypothesis in case of multilinear regression is: h θ (x) = θ 0 + θ 1 x 1 + θ 2 x 2 +... + θ n x n (3)

Multilinear regression The normal equation can still be used as a solution, with the following design matrix: 1 x (1) 1 x (1) 2 x (1) 3... x (1) n 1 x (2) 1 x (2) 2 x (2) 3... x (2) n X = 1 x (3) 1 x (3) 2 x (3) 3... x n (3)...... 1 x (m) 1 x (m) 2 x (m) 3... x n (m) and θ = [θ 0, θ 1,..., θ n ] T = (X T X ) 1 X T y

45 40 y = 0.002963*x 2-0.6374*x + 40.77 y = - 1.805e-05*x 3 + 0.006383*x 2-0.8*x + 42.49 data1 multilinear regression (quadratic) multilinear regression (cubic) 35 Expected throughput (Mbps) 30 25 20 15 10 5 0 20 40 60 80 100 120 140 Number of active UE

Normal equation - Practical problems The normal equation requires to compute the matrix (X T X ) 1 (X T X ) is an (n + 1) (n + 1) matrix Most implementation compute the matrix inverse in O(n 3 ) Slow if n is large! how large? n 1000 still (relatively) small. (X T X ) may be non-invertible Cause 1: m n (more features than examples... bad idea)

Gradient descent What if n is too large? Can we still learn the optimal parameters θ?

Gradient descent What if n is too large? Can we still learn the optimal parameters θ? The answer is the Gradient Descent algorithm: Start with some initial values for θ, e.g. θ i = 0 i

Gradient descent What if n is too large? Can we still learn the optimal parameters θ? The answer is the Gradient Descent algorithm: Start with some initial values for θ, e.g. θ i = 0 i Change θ in the direction of the gradient in order to reduce the cost function J(θ) a little bit. Repeat until convergence (a local minimum is found) For (multi)linear regression, the cost function is convex and has only a single minimum.

Gradient descent Do the following until convergence: θ j = θ j α θ j J(θ) j (4) that is: θ j = θ j α 1 m m i=1 (h θ (x (i) ) y (i) ) x (i) j (5) with x (i) 0 = 1 for convenience.

Gradient descent The variable alpha in: θ j = θ j α 1 m m i=1 (h θ (x (i) ) y (i) ) x (i) j (6) is the learning rate. How to choose it?

Gradient descent The variable alpha in: θ j = θ j α 1 m m i=1 (h θ (x (i) ) y (i) ) x (i) j (6) is the learning rate. How to choose it? too small: baby steps, convergence takes too long

Gradient descent The variable alpha in: θ j = θ j α 1 m m i=1 (h θ (x (i) ) y (i) ) x (i) j (6) is the learning rate. How to choose it? too small: baby steps, convergence takes too long too big: huge steps, you can miss the minimum and fail to converge Normalizing features (e.g., between -1 and 1) helps in providing numerical stability

1 1 0.6 0.8 0.55 0.6 0.5 0.4 0.45 0.2 0.4 0 0.35-0.2 0.3-0.4 0.25-0.6 0.2-0.8 0.15-1 -1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1 0 0.1

k-nearest Neighbours Regression Simple idea: given unknown input x, output y by averaging the k most similar examples in the training set.

k-nearest Neighbours Regression Simple idea: given unknown input x, output y by averaging the k most similar examples in the training set. How many examples should we take (i.e., what is k)?

k-nearest Neighbours Regression Simple idea: given unknown input x, output y by averaging the k most similar examples in the training set. How many examples should we take (i.e., what is k)? How should we average the examples? Generally, Inverse Distance Weighting (IDW) is applied. ŷ = k w k y k (7) with w k inversely proportional to dist(x, x (k) ).

k-nn 40 35 Training data knn, k = 2 knn, k = 4 Normalized Expected throughput (Mbps) 30 25 20 15 10 5 0 20 40 60 80 100 120 140 Normalized Number of active UE

k-nn 40 35 Training data knn, k = 2 knn, k = 4 Normalized Expected throughput (Mbps) 30 25 20 15 10 5 0 20 40 60 80 100 120 140 Normalized Number of active UE How many samples are needed for prediction? Will it generalize well to new data points?

LOESS Linear regression and k-nn can be fused together in the LOESS (LOcal regression) method. For a new input x 0, gather the k points (x i, y i ) whose x i are nearest to x 0 fit a linear model h θ using only these k points output y h at = h θ (x 0 )

Bias-Variance trade off We have seen a few algorithm. They work well in practice, but may suffer from:

Bias-Variance trade off We have seen a few algorithm. They work well in practice, but may suffer from: underfitting (high bias). It occurs when the model used is too simple (e.g. linear regression).

Bias-Variance trade off We have seen a few algorithm. They work well in practice, but may suffer from: underfitting (high bias). It occurs when the model used is too simple (e.g. linear regression). overfitting (high variance). It occurs when the model uses follows the training data too much (e.g. k-nn, high-order polynomial fitting)

Evaluating a hypothesis How do we know if our hypothesis is suffering from underfitting or overfitting? Linear regression works by finding θ so that the cost function J(θ) is minimized. What if we found J(θ) = 0. Does it mean we have found the perfect model? Can t be sure unless we try to generalize! Standard way to evaluate the hypothesis: split the training set! 70% used for training the model, by minimizing J train 20% used for computing the cross validation error J cv 10% used for computing the test error J cv

Evaluating a hypothesis 90 80 data1 linear quadratic 9th degree 70 Normalized Expected throughput (Mbps) 60 50 40 30 20 10 0-10 0 20 40 60 80 100 120 140 Normalized Number of active UE Which one will have the smallest J train? Should we select the model based on that?

Good practices for model selection Divide your initial data set in three parts 70-20-10 Train all your models on the training set Compute J cv on the cross validation set. Pick the model who has the smallest J cv error Compute generalization error J test on the test set.

k-fold cross validation If the total number of observation m is small, we may have too few examples in each set. Also, J cv depends on which 20% of the data is used. To cope with these issues, we can use k-fold cross validation. randomly divide the entire set in k fold of similar size. use the first fold as cross-validation set and the other k 1 as training set. Obtain J cv,k repeat k times, each time changing the cross-validation set Set J cv = 1 k k J cv,k Generally, k = 5 or k = 10 is used

Bias-variance diagnosis How do we understand if our model is suffering from overfitting? We look at the train and cross-validation errors. High bias: both errors are high High variance: training error is low and cv error is high

Bias-variance diagnosis Another possibility is looking at the learning curves

Bias-variance diagnosis Another possibility is looking at the learning curves High bias: errors are close and high, adding data does not help, you should use a different / more complex model High variance: adding data could help, but you should consider using a simpler model or using regularization

Regularization / Ridge regression Overfitting happens when the model is too complex, or using an excessive number of features. Main idea: modify the cost function adding a penalty on features J(θ) = 1 m m [ (h θ (x (i) ) y (i) ) 2 + λ i=1 n θj 2 ] (8) j=1 The variable λ controls the amount of penalization. λ should be chosen carefully if too big we ll have underfitting if too small it s like not using it

When the target variable y is discrete (e.g., 0/1), we talk about classification problems 1 (Type 2) Application Type 0 (Type 1) 0 5 10 15 20 25 Flow duration

We could use linear regression with a threshold: Estimate h θ (x) = θ T x = θ 0 + θ 1 x Output 0 if θ T x < 0.5, 1 otherwise data1 linear 1 (Type 2) Application Type 0.5 0 (Type 1) 0 5 10 15 20 25 Flow duration

Linear regression as classifier Problem 1: the addition of a single point changes things a lot! data1 linear 1 (Type 2) data1 linear 1 (Type 2) Application Type 0.5 Application Type 0.5 0 (Type 1) 0 500 1000 1500 2000 2500 3000 Flow duration 0 (Type 1) 0 5 10 15 20 25 Flow duration Problem 2: h θ (x) output values much greater than 1 or much smaller than 0

k-nn classifier We could use a k-nearest Neighbour classifier: we take the k most similar example and output the class who occurs most often (majority voting). 1 (Type 2) Application Type 0.5 0 (Type 1) 0 5 10 15 20 25 Flow duration

Logistic regression To cope with those problems, logistic regression is introduced: 1 h θ (x) = 1 + e θt x where the function f (z) = 1 1+e z is the sigmoid function (9) 1 0.9 0.8 0.7 0.6 1/1+e -z 0.5 0.4 0.3 0.2 0.1 0-5 -4-3 -2-1 0 1 2 3 4 5 z

Sigmoid function The sigmoid function h θ (x) = 1 1+e θt x outputs values between 0 and 1 crosses 0 at 0.5 When θ T x 0, h θ (x) 0.5 When θ T x < 0, h θ (x) < 0.5 The output can be interpreted as the probability of being of belonging to a particular class.

Logistic regression cost function In order to do this, use a modified cost function (in order to make it a convex function): J(θ) = 1 m m [ y (i) log h θ (x (i) ) + (1 y (i) ) log(1 h θ (x (i) )] (10) i=1 One can use gradient descent (or other more complicated algorithms) to solve for parameters θ.

Logistic regression 1 (Type 2) Application Type 0.5 0 (Type 1) 0 5 10 15 20 25 Flow duration In this example, θ = [ 20.9, 2.83] T. The boundary value is at θ T x = 0 that is x = 7.38

Decision boundary With more than one feature, logistic regression finds a decision boundary in the space of the features: 25 Type 1 Type 2 20 Flow Duration [s] 15 10 5 0 200 300 400 500 600 700 800 900 1000 1100 1200 Packet Length [byte] 1 In this example, h θ (x) =, and the line 1+e (θ 0 +θ 1 x 1 +θ 2 x 2 ) x 2 = 1 θ 2 θ 0 + θ 1 x 1 is the decision boundary.

Decision trees Very simple and powerful algorithms, used standalone or to build more complex algorithms (e.g. Random Forest). They can be used for both classification and regression. Main idea: follow a series of questions, and take a path depending on the answer. 25 Type 1 Type 2 20 packet length < 773? Flow Duration [s] 15 10 YES NO 5 0 200 300 400 500 600 700 800 900 1000 1100 1200 Packet Length [byte] Type 0 Type 1

Growing (learning) a tree Growing a decision tree is a greedy process that tries to minimize the misclassification error: final tree is not optimal different trees may be learned from the same data set How does it work: start by the training set, identify a feature and a split criterion on it. Two children nodes are generated For each generated node, repeat the process.

Growing a decision tree Output variable y = k. In our example k = {0, 1} The training set is represented by L = (x (i), y (i) ), i = 1... m Each node t in the tree will contain a set of associated observations L(t). The root of the tree contains all observations L(t 1 ) = L. How and when to split a node t? If all the observation in a node L(t) belong to the same class j, we do not split. We declare t to be a leaf node and whenever a new observation x reaches t, we declare y = k.

Splitting criteria Assume a node t contains samples of different classes. We need to decide: which feature x j to operate the split onto, at which value v to operate the split in order to output the question: is x j < v?

Node impurity We introduce the measure Q(t). If all observations of node t are of the same class, Q(t) = 0. When the distribution of the classes in a node is uniform, Q(t) takes the maximum value. When we split a node, we try to decrease the impurity as much as possible

Impurity measures Let p t,k = p(k t) the proportion of class k observations in node t. Gini index: Q(t) = k p t,k (1 p t,k ) (11) Cross-entropy: Q(t) = k p t,k log(p t,k ) (12)

Split criterion To measure a split s change in impurity, we can evaluate: Q = Q(t) p L Q(t L ) p R Q(t R ) where p L and p R are the proportions of observations that fall in the left and right children node, respectively. In order to find which node t to operate on and which variable, we test all possible splits! At each step, we will choose the node t for which Q is higher.

Tree pruning After the learning phase, the full tree may be very complex. Some nodes can be discarded or pruned. Given a full tree T 0, pruning seeks another tree T with a smaller number of leaves, at the cost of an higher misclassification error R. The process tries to minimize: R α (T ) = R(T ) + α T (13) where alpha is a penalty on the tree size, and T is the number of nodes in the tree.

Decision tree summary Pros: Cons: Highly interpretable. Very instable (small change in training set may produce a very different tree) May be complex to train (depending on the data) May not generalize well to new data

Other approaches Support Vector Machines Finds the decision boundary so that the distance (margin) between the classes is maximized Binary Classifier Neural Networks Learn weights and activations (parameters) of a complex layered structure. By doing this the network learns also which are the best features. Works very well in many scenarios and can output multiple classes. Basis for deep learning Difficult to interpret

Ensemble methods Fuse information from many weak classifier into a strong one. Gold standard: Random Forest Main goal: reduce variance/overfitting of trees Majority voting over multiple trees learnt from multiple training set obtained by sampling the original set with replacement (bagging). For each tree, each time split over only a random subset of features. Additionally apply boosting: learn tree sequentially, each time creating a new bag paying higher attention on misclassified samples.

Discriminative vs Generative classifiers The algorithms seen so far are discriminative algorithms. They look at examples from all classes (0s and 1s) and find a decision boundary or a set of rules that separates the classes. They learn: p(y x) (14) in a direct way. In contrast, a generative learning algorithm looks at only one class of examples at a time, and learns p(x y) and p(y) (15) what are the features like, given a particular class (and the class prior).

Naive Bayes classifier Recalling the Bayes Theorem: p(y x) = p(x y)p(y) p(x) (16) Naive Bayes assume that variables x 1, x 2,..., x n are conditionally independent. Therefore: p(x y) = i p(x i y) (17) p(y) is the class prior and can be easily obtained from the available data p(x) = j p(x y j)p(y j ), but is often dropped as the probability of the data is constant

Naive Bayes: estimation We need to estimate p(x i y) for each feature from the available data. For continuous features, generally a Gaussian distribution is assumed. p(x i, µ i, σ 2 i ) = 1 2πσi 2 exp( (x i µ i ) 2 2σi 2 ) (18) Therefore we simply need to estimate µ i and σ i for each feature of our data. For discrete variables, binomial/multinomial distributions are used.

Naive Bayes: example 25 Type 1 Type 2 20 Flow Duration [s] 15 10 5 0 200 300 400 500 600 700 800 900 1000 1100 1200 Packet Length [byte]

Naive Bayes: example p(x 1 y = 0): µ 1 = 364, σ1 2 = 2.8 104 p(x 2 y = 0): µ 2 = 3.1, σ2 2 = 1.81 p(x 1 y = 1): µ 1 = 935.75, σ1 2 = 8.5 103 p(x 2 y = 1): µ 2 = 13.8, σ2 2 = 15.46 22 20 Type 1 Type 2 18 16 Flow Duration [s] 14 12 10 8 6 4 2 0 200 300 400 500 600 700 800 900 1000 1100 1200 Packet Length [byte]

Naive Bayes: classification Given a new observation x, we can predict the class it belongs: compute p(y j x) using Bayes Theorem for all y j output the class j for which p(y j x) is maximized Example 1: x 1 = 700, x 2 = 6 p(x 1 = 700 y = 0) p(x 2 = 6 y = 0) p(y = 0) = 0.7768 p(x 1 = 700 y = 1) p(x 2 = 6 y = 1) p(y = 1) = 0.2232 Example 2: x 1 = 900, x 2 = 6 p(x 1 = 900 y = 0) p(x 2 = 6 y = 0) p(y = 0) = 0.0068 p(x 1 = 900 y = 1) p(x 2 = 6 y = 1) p(y = 1) = 0.9932

Other approaches Linear Discriminant Analysis Similar to Naive Bayes, without the independence assumption Assume that all classes share the same covariance matrix Quadratic Discriminant Analysis Similar to LDA, but assume that all classes have different covariance matrix LDA is simpler than QDA (has lower variance). Use it when m is small.

Error analysis For regression problem, the MSE can be used to evaluate the cross-validation or test error. MSE = 1 m m (y i ŷ i ) 2 (19) i=1 For a binary classification problem, we could use the classifier accuracy: ACC = 1 1 I (y i ŷ i ) (20) m where I (y i ŷ i ) = 1 if y i ŷ i and 0 otherwise. Is it a good metric? i=1

Skewed classes Assume you have a test set with m = 100 examples of traffic flows, and need to classify between neutral (0) or malicious traffic (1) Assume that 99 examples are neutral and only one malicious. What is the accuracy of a dummy classifier that always outputs 0? What can we do to better analyse and compare the classifier performance?

Precision and Recall Define: True Positive (TP): malicious flows that were classified as malicious False Positive (FP): neutral flows that were classified as malicious (false alarms) True Negative (TN): neutral flows classified as neutrals False Negative (FN): malicious flows classified as neutrals (miss) We have: Precision: how often our algorithm cause a false alarm? Recall: how sensitive is our algorithm?

Precision and Recall Precision: (i) how often our algorithm causes a false alarm? (ii) among all predicted positive examples, how many were actually positive? True Positive Number of Predicted Positive = TP TP + FP (21) Recall: (i) how sensitive is our algorithm? (ii) among all positive examples present in the set, how many were identified? True Positive Number of Actual Positive = TP TP + FN (22)

F 1 Score Often you can control a tradeoff between recall and precision using a threshold. An always-1-classifier has a recall of 100% but a very low precision (produces many false positive). Similarly you can have classifiers with low recall and high precision. How to compare? Compute the F 1 score: 2 Precision Recall Precision + Recall (23)

Multiclass classification Sometimes you need to classify among multiple classes. Some of the algorithm we have seen naturally have this possibility (k-nn, Naive Bayes) What about logistic regression?

One vs all classification Assume to have three classes in the training set: A, B, C. We can create three new datasets and learn three classifiers: h θ1 : A (1) vs B and C (0) h θ2 : B (1) vs A and C (0) h θ3 : C (1) vs A and B (0) On a new input x, look at the output of the three classifier and assign the class for which h θi (x) is maximized

Confusion Matrix For multiclass problems, one can use the confusion matrix to easily visualize errors: