Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Department of Engineering Science University of Oxford January 27, 2017

Many datasets consist of multiple heterogeneous subsets. Cluster analysis: Given an unlabelled data, want algorithms that automatically group the datapoints into coherent subsets/clusters. Examples: market segmentation of shoppers based on browsing and purchase histories different types of breast cancer based on the gene expression measurements discovering communities in social networks image segmentation

Types of Model-based clustering: Each cluster is described using a probability model. Model-free clustering: Defined by similarity/dissimilarity among instances within clusters.

This Lecture: Model-free Methods K-means clustering: a partition-based method into K clusters. Finds groups such that variation within each group is small. The number of clusters K is usually fixed beforehand or various values of K are investigated as a part of the analysis.

K-means K-means Partition-based methods seek to divide data points into a pre-assigned number of clusters C 1,..., C K where for all k, k {1,..., K}, C k {1,..., n}, C k C k = k k, K C k = {1,..., n}. k=1 For each cluster, represent it using a prototype or cluster centroid µ k.

K-means K-means We can measure the quality of a cluster with its within-cluster deviation W (C k, µ k ) = i C k x i µ k 2 2. The overall quality of the clustering is given by the total within-cluster deviation: W = K W (C k, µ k ). k=1 The overall objective is to choose both the cluster centroids and allocation of points to minimize the objective function.

K-means K-means W = K k=1 i C k x i µ k 2 2 = n x i µ ci 2 2 where c i = k if and only if i C k. Given partition {C k }, we can find the optimal prototypes easily by differentiating W with respect to µ k : W = 2 (x i µ k ) = 0 µ k = 1 x i µ k C k i C k i C k Given prototypes, we can easily find the optimal partition by assigning each data point to the closest cluster prototype: i=1 c i = arg min k x i µ k 2 2 But joint minimization over both is computationally difficult.

K-means K-means The K-means algorithm is a widely used method that returns a local optimum of the objective function W, using iterative and alternating minimization. Step 1: Randomly initialize K cluster centroids µ 1,..., µ K. Step 2: Cluster assignment: For each i = 1,..., n, assign each x i to the cluster with the nearest centroid, c i := arg min k x i µ k 2 2 Set C k := {i : c i = k} for each k.

K-means K-means Step 3: Move centroids: Set µ 1,..., µ K to the averages of the new clusters: µ k := 1 x i C k i C k Step 4: Repeat steps 2-3 until convergence. Step 5: Return the partition {C 1,..., C K } and means µ 1,..., µ K.

K-means K-means The algorithm stops in a finite number of iterations. Between steps 2 and 3, W either stays constant or it decreases, this implies that we never revisit the same partition. As there are only finitely many partitions, the number of iterations cannot exceed this.

y y y K-means K-means The K-means algorithm need not converge to global optimum. K-means is a heuristic search algorithm so it can get stuck at suboptimal configurations. The result depends on the starting configuration. Typically perform a number of runs from different configurations, and pick the end result with minimum W. W= 9.184 W= 3.418 W= 9.264 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.5 0.0 0.5 1.0 1.5 x 0.5 0.0 0.5 1.0 1.5 x 0.5 0.0 0.5 1.0 1.5 x

K-means K-means Additional Comments Good practice initialization. Randomly pick K training examples (without replacement) and set µ 1, µ 2,..., µ K equal to those examples Sensitivity to distance measure. Euclidean distance can be greatly affected by measurement unit and by strong correlations. Can use Mahalanobis distance instead: x y M = (x y) M 1 (x y) where M is positive semi-definite matrix, e.g. sample covariance.

K-means K-means Additional Comments Determination of K. The K-means objective will always improve with larger number of clusters K. Determination of K requires an additional regularization criterion. E.g., W = K k=1 i C k x i µ k 2 2 + λk

Originally developed by the signal processing community for data compression (audio, image and video compression), the VQ idea has been picked up the statistics community and extended to tackle a variety of tasks (including clustering and classification). VQ is a simple idea for summarising data by use of codewords. The algorithm is very closely related to the K-means algorithm, yet works sequentially through the data when updating cluster centers.

Given p-dimensional data, a finite set of vectors Y = {y 1,..., y K } of the same dimensionality must be found. Vectors y k are called codewords and Y the codebook. All n observations are mapped to the indices of the code book using the following rule, x i y k x i y k x i y k k. Such a mapping induces a partition of R p into Voronoi regions defined as V k = { x R p : x y k x y k k } where K k=1 V k = R p and V k s are disjoint except for boundaries.

Finding a Useful Codebook As with K-means, a predefined number of K codewords must be found. They should be chosen to give the greatest compression in the data with minimal loss in data quality. Where we have more codewords than clusters, it is easy to see that we should simply place codewords at the center of areas of high density, i.e. good codebooks find cluster centers.

The following iterative algorithm finds a good approximate solutions to this problem. 1 Randomly choose K observations to initialise the codebook. 2 Sample an observation x and let V c be the Voronoi region where it falls. 3 Update the codebook y c = y c + α(t) [x y c ] y k = y k k c. α(t) quantifies the amount by which y c moves towards of the x and decays over time to 0. 4 Repeat 2-3 until there is no change. 5 Return the codebook Y = {y 1,..., y K }

Compression For compression purposes, any observation x R p is now just mapped to the set {1,..., K} of codewords, according to which Voronoi region the observation falls into. If a large number of observations x 1,..., x n needs to be transferred, alternatively the vector of corresponding codewords in {1,..., K} n can be transferred to achieve a compression (with a certain loss of information). Some audio and video codecs use this method. As with K-means, K must be specified. Increasing K improves the quality of the compressed image but worsens the data compression rate, so there is a clear tradeoff. (For clustering, the choice of K is harder and does not have an entirely satisfactory answer).

Example: Image Compression 3 3 block VQ: View each block of 3 3 pixels as single observation

Example: Image Compression Original image (24 bits/pixel, uncompressed size 1,402 kb)

Example: Image Compression Codebook length 1024 (1.11 bits/pixel, total size 88kB)

Example: Image Compression Codebook length 128 (0.78 bits/pixel, total size 50kB)

Example: Image Compression Codebook length 16 (0.44 bits/pixel, total size 27kB)

Naive Bayes Naive Bayes Department of Engineering Science University of Oxford February 12, 2017 Naive Bayes

Naive Bayes Overview Overview Naive Bayes - a classifier with a simple generative model. Easy to implement. Given a Dataset: D = (x i, y i ) n i=1 with n entries. x i = (x (1) i,..., x (d) i ) R d is a feature vector y i Y is a label with Y = {1,..., m} for classification and Y = R for regression. (x 1, y 1 ),..., (x n, y n ) P θ i.i.d. for some parameters θ. Goal: For a new x R d, predict its label y. Compute the probability of each label given a feature x (i.e. P(y x)) Naive Bayes

Naive Bayes Naive Bayes Assumption Naive Bayes Assumption Assume a family of distributions P θ such that for x R d, y Y, P θ (x, y) = P θ (x y) P θ (y) = P θ (x (1) y)... P θ (x (d) y) P θ (y) d = P θ (x (j) y) P θ (y) j=1 (conditional independent assumption.) If (x, y) P θ, then x (1),..., x (d) are independent given y. Naive Bayes Assumption: All measured features are independent given the label (i.e. x (j) y x (k) y if j k) Naive Bayes

Naive Bayes Methodology Methodology Methodology: Estimate the conditional probability distribution (P θ (x y)) and prior (P θ (y)) that describe the entire population from which the random samples (x i, y i ) n i=1 are drawn. Algorithm: Estimate ˆθ from the dataset D. Compute ŷ arg max Pˆθ (y x) = arg max Pˆθ (x y) Pˆθ (y) y Y y Y = arg max Pˆθ (x (1) y)... Pˆθ (x (d) y) Pˆθ (y) y Y Naive Bayes

Naive Bayes Methodology Methodology Using the Bayes rule, Pˆθ(y x) Pˆθ (x y) Pˆθ (y) = Pˆθ(x) Pˆθ (x y) Pˆθ (y) = y Y Pˆθ(x y) Pˆθ(y) By the conditional independent assumption, (Pˆθ(x y) = d j=1 Pˆθ(x (j) y)) d j=1 = Pˆθ (x (j) y) Pˆθ (y) y Y Pˆθ(y) d j=1 Pˆθ(x (j) y) Pˆθ(y) Therefore, we need to estimate Prior: Pˆθ(y) Conditional PDF: Pˆθ (x (j) y) Naive Bayes

Naive Bayes Methodology Methodology How to choose P θ? For classification, let (x, y) P θ, y Y = {1,..., m}. Then P θ (y) = π y, where π = (π 1,..., π m ) P θ (x i y) where θ = {all parameters of the distributions} If x i {1,..., N} then, P θ (x i y) can be estimated using the sample mean If x i R then assume parametric distribution such as Gaussian or Gamma distributions, and estimate the parameter. How to estimate θ? Using Maximum Likelihood Estimation (MLE) or Maximum A Posteriori Probability Estimation (MAP). Naive Bayes

Naive Bayes Methodology Maximum Likelihood Estimation (MLE) Prior Estimation with MLE: Pˆθ (y = k) = ˆπ k = 1 n n i=1 I(y i = k) = n k n Conditional PDF: For discrete features: Pˆθ(x (j) = l y = k) = 1 n k n i=1 I(x (j) i = l) I(y i = k) = n lk n k For the continuous feature: Use parametric distribution assumption to estimate the parameters with MLE. Then, based on the estimated parameters, compute the conditional pdf. Naive Bayes

Naive Bayes Methodology Gaussian Distribution Example (Continuous features) Estimate the Gaussian parameters for P(x (j) = x y = k) Mean: Variance: µ jk = 1 n k n σ 2 jk = 1 n k n i=1 i=1 x (j) i I(y i = k) (x (j) i µ jk ) 2 I(y i = k) Compute the conditional pdf based on the estimated parameters P(x (j) = x y = k) = 1 2πσjk 2 e 1 2σ jk 2 (x µ jk ) 2 Naive Bayes

Naive Bayes Text Document Classification Example Text Document Classification Example Often used in text document classification, e.g. of scientific articles or emails. A basic standard model for text classification consists of considering a pre-specified dictionary of p workds and summarizing each document i by a binary vector x i where x (j) i = { 1 if word j is present in document 0 otherwise. Naive Bayes

Naive Bayes Text Document Classification Example Text Document Classification Example Presence of the word j is the j-th feature/dimension. Naive Bayes is a plug-in classifier which ignores feature correlations and assumes: g k (x i ) = P(x = x i y = k) = = p j=1 P(x (j) = x (j) i y = k) p (φ kj ) x (j) i j=1 (j) 1 x (1 φ kj ) i where we denoted parametrized conditional PMF with φ kj = P(x (j) = 1 y = k) (probability that j-th word appears in class k document). Given dataset, the MLE of the parameters is: ˆπ k = n k n, ˆφ kj = i:y k =k x (j) i n k, Naive Bayes

Naive Bayes Text Document Classification Example Text Document Classification Example A problem with MLE: if the l-th word did not appear in document labeled as class k then ˆφ kl = 0 and P(y = k x with l-th entry equal to 1) p ˆπ k ( ˆφ kj ) x(j) (1 ˆφ kj ) 1 x(j) = 0 j=1 i.e. we will never attribute a new document containing word l to class k (regardless of other words in it). This is an example of overfitting. Naive Bayes

Naive Bayes Why Conditional Independent Assumption? Why Conditional Independent Assumption? Conditional Independent Assumption: P θ (x y) = P θ (x (1) y)... P θ (x (d) y) Can estimate θ more accurately with less data. Wrong but simple can be better than correct and complicated. Naive Bayes