DEIB - Politecnico di Milano Fall, 2017
Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification, but just on the analysis of the data. Can we discover patterns or subgroups among variables or observations? Is there any informative way to visualize the data? To answer these questions, one should apply unsupervised learning techniques.
Agenda Unsupervised learning techniques: K-means Hierarchical Density-based Dimensionality reduction Principal Component Analysis (PCA) Density Estimation
The goal of clustering is to group data together based solely on the features x Applications: Market segmentation: group customers into different segments based on the traffic they produce Mobile traffic segmentation: identify areas in a city which have common traffic signatures User behavior estimation: group together users with same behaviors
K-Means K-means is by far the most widely used clustering algorithm Goal: group data into K clusters Input: m unlabeled observations x K, the number of clusters to produce Output: K centroids (cluster centers) Cluster assignment for each one of the m input observations
K-Means steps Randomly initialize K cluster centroids as {µ 1, µ 2,..., µ k } Repeat until convergence: Assign each observation x (i) to the closest cluster centroid and produce labels c (i). Compute new cluster centers µ k by averaging all x (i) s.t. c (i) = k
K-Means example Let s see a running example: 35 30 25 Avg. Usage [Mbps] 20 15 10 5 0 0 5 10 15 20 25 Connection duration [s]
K-Means example After convergence: 35 30 25 Avg. Usage [Mbps] 20 15 10 5 0 0 5 10 15 20 25 Connection duration [s]
Cost function K-means tries to optimize the following cost function: J(c (1),..., c (k), µ 1,..., µ k ) = 1 m m x (i) µ c (i) (1) i=1 This cost function: Has in general many local minima for a fixed K (algorithm may converge to one of those) Is decreasing for increasing K. When K = m, the optimal value is at J = 0.
Algorithms initialization Possible options: Randomly pick cluster centroids in the feature space Randomly pick cluster centroids in the observation set (recommended option) However, due to the random assignment, the algorithm may converge to a local optima. Solution: run K-means multiple times (50-1000) and track final error J. Pick the clustering with the lowest J.
How to choose K In general, not easy. Mostly, one chooses it manually. Sometimes, not even easy to do it manually:
Elbow Method Plot best J vs number of clusters K and identify the elbow. Doesn t work everytime. However, sometimes one has a later purpose and may evaluate the optimal number of clusters based on that (e.g., traffic policies).
Variants K-median replace mean operation with median K-medoid cluster centers must belong to the original set of observations
Hierarchical clustering One disadvantage of K-means is that we need to specify the number of clusters K. Hierarchical clustering partially solves this issue, by creating a dendrogram
Dendogram Each leaf of the dendogram is an observation The height on the vertical axis at which two leaves join represents their distance For a fixed value of distance, the dendrogram identifies a certain number of clusters (similar to K for K-means)
Creating the dendogram Bottom-up approach. Treat each observation as an individual cluster. Repeat: Compute all the pairwise distances between clusters. Identify the closest two clusters and join them. Their distance identifies the height at which they join. In order to compute distances between clusters with more than one observation, a proper linkage is used: Average: compute all pairwise distances and output the average Complete: compute all pairwise distances and output the max Single: compute all pairwise distances and output min
DBSCAN Density-based clustering algorithms which groups together observations that are closely packed together. Advantages compared to K-means Does not require to specify number of clusters Can find arbitrarily shaped and non-linearly separable clusters. Identify and marks as outliers points which lie alone in low-density regions. Requires only two parameters: ɛ and minpts.
DBSCAN For each non-visited observation in the dataset: Find all neighbors within a distance of ɛ If the number of neighbors is geqminpts, the obs is core point If the number of neighbors is < minpts, but at least one of those is a core, the obs is a point. Otherwise, the point is marked as noise
DBSCAN clusters K-means VS DBSCAN:
Practical issues in clustering In order to perform clustering, some decisions should be made: For K-means, how many clusters should we use? For hierarchical clustering, what kind of linkage and horizontal cut should be used? For DBSCAN, which ɛ and < minpts? In general what kind of distance metric should we use? Euclidean, Hamming, correlation-based? The answer is not unique: try to run these algorithms several times with different parameters and select the ones with the most useful or interpretable solution (Q: should we use cross-validation?) In general, validating a cluster is a difficult task.
Principal Component Analysis PCA is a fundamental and powerful technique used in several fields of computing. The technique analyzes the n-dimensional feature space x and converts it into a p-dimensional representation, p < n. In doing so, some approximation (information loss) is tolerated. Uses: Data compression: represents about the same information with fewer data Speeding up learning algorithms: fewer data, fewer computations Data visualization: 2- or 3-dimensional plots become possible Removal of correlated variables
Problem formulation Given a dataset whose observations lie in a n-dimensional space, project the observations onto a set of principal components, so that: the first component has the largest variance every following component has the largest possible variance and is orthogonal to the preceding components. Therefore, the PCA just represents the data in a different fashion. However, the first PCs are the one that contain most of the information (variance) contained in the data.
PCA DL Volume (bytes) 240 220 200 180 160 140 120 100 80 60 Data First PC Second PC Second Principal Component 5 4 3 2 1 0-1 -2-3 -4 1000 1050 1100 1150 1200 DL Volume (bits) -5 1060 1080 1100 1120 1140 1160 1180 1200 1220 1240 1260 First Principal Component
Algorithm in a nutshell Exact mathematical derivation is out of the scope of this course. Preprocessing: normalize your data so that each feature has zero mean and varies approximately between -1 and 1. Compute covariance matrix Σ = 1 m m i=1 (x (i) )(x (i) ) T Compute eigenvectors matrix U of Σ. Those are your principal components! The eigenvector associated to the highest eigenvalue is the first principal components and so on. z = U T x are the new data values in the Principal Component Domain
Dimensionality reduction Once the PCA has been computed, one has n principal components (directions in R n ) To reduce the dimensionality of the data, one can project the data onto only the first p principal components and obtain a lower dimensional approximation. Compute U red by retaining only the first p column of U Compute z red = Ured T x, data projected onto the first p PCs. Go back to the original domain by computing x = U red z red
PCA approximation 155 150 145 DL Volume (bytes) 140 135 130 Original Approximation 125 1000 1050 1100 1150 1200 1250 DL Volume (bits)
Dimensionality reduction How to choose p? For visualization, p = 2 or p = 3 allows to visualize the data easily and understand better what kind of approach should be taken. In general, one can select p so that a certain quality of approximation is guaranteed. Typically, one selects p so that the ratio between the variance of the approximation error and the variance of the data is below a threshold. 1 m m i=1 x (i) x (i) 2 1 m m i=1 x 0.01 (2) (i) 2
Problem Motivation Unsupervised learning problem with aspects of supervised learning Identify observations with anomalous behaviour. Somehow similar to a classification task (anomalous/not anomalous) but: In general, we have many normal examples and just a few anomalous examples Also, anomalous examples may be very different from each other It s better to learn what anomalies do not look like
Density estimation Assume we collect data on different TCP connections. We look at only two features of such connections. 1250 1200 1150 1100 Downloaded bytes 1050 1000 950 900 850 800 750 70 80 90 100 110 120 130 Duration Network [s] Traffic Measurements and Analysis
Density Estimation The problem of anomaly detection can be solved in a probabilistic way. Overall idea is quite simple: Identify features x i that might be indicative of anomalies Estimate the join probability density function of such features p(x 1, x 2,..., x n ) Given new example x compute p(x). Anomaly if p(x) < ɛ.
Density estimation In the simplest case, to compute p(x) we assume that: p(x) = p(x 1, µ 1, σ 2 1)p(x 2, µ 2, σ 2 2)... p(x n, µ n, σ 2 n) (3) where p(x 1 )... p(x n ) are Gaussian distributions. As in Naive Bayes, this algorithm assumes that features are independent.
Example 1500 1500 Downloaded bytes 1000 Downloaded Bytes 1000 500 50 100 150 Duration [s] 500 0 0.5 1 1.5 2 2.5 3 3.5 4 p(x) 10-3 0.04 0.035 10-4 2 0.03 0.025 1.5 p(x) 0.02 1 0.015 0.5 0.01 0.005 0 50 100 150 Duration 0 1500 1000 Downloaded bytes 500 50 100 Duration [s] 150
Multivariate Gaussian Often, the features used for anomaly detection are not independent. This can cause the method to fail. 1140 1130 1120 1110 x 2 1100 1090 1080 1070 1060 70 80 90 100 110 120 130 x 1
Example of failure What happens for the point x 1 = 95, x 2 = 1120? 1200 1200 1150 1150 x 2 1100 x 2 1100 1050 1050 1000 50 100 150 x 1 1000 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 p(x 2 ) 0.05 10-3 0.04 1.5 0.03 1 p(x 1 ) 0.02 0.5 0.01 0 50 100 150 x 1 0 1200 1100 x 2 1000 50 x 1 100 150
Multivariate density estimation It s better to model the joint distribution all in one go, instead of looking at each feature separately. Multivariate Gaussian distribution for n features: where p(x, µ, Σ) = 1 2π n/2 Σ 1/2 exp( 1 2 (x µ)t Σ(x µ)) (4) x and µ are n-dimensional vectors Σ is the covariance matrix (same used in PCA)
Example 1200 1200 1150 1150 x 2 1100 x 2 1100 1050 1050 1000 50 100 150 x 1 1000 0 0.005 0.01 0.015 0.02 0.025 p(x 2 ) p(x 1 ) 0.04 0.03 0.02 10-3 1.5 1 0.5 0.01 0 50 100 150 x 1 0 1200 1100 x 2 1000 50 100 x 1 150
Independent vs Multivariate Independent density estimation: Used more often Sometimes there is a need to manually create features to capture anomalies Computationally cheap and scales well Works well with small training sets Multivariate density estimation Capture correlation among features Less computationally efficient: need to compute Σ 1 Need for m > n
Evaluating an system Generally, one has always some labeled data. Usually, one has many non-anomalous examples and just few anomalous examples Evaluation of anomaly detection system can be performed using the standard crossvalidation method. Assume to have e.g., 10000 non-anomalous examples and just 20 anomalous examples. A good way to split the data may be: 6000 non-anomalous examples for density estimation. 2000 non-anomalous examples and 10 anomalous in the CV set. Choose which features to use, ɛ, etc. 2000 non-anomalous examples and 10 anomalous in the test set.