Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Similar documents
Unsupervised Learning

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Introduction to Machine Learning CMU-10701

Note Set 4: Finite Mixture Models and the EM Algorithm

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

ECE 5424: Introduction to Machine Learning

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Clustering web search results

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

CS 195-5: Machine Learning Problem Set 5

Solution Sketches Midterm Exam COSC 6342 Machine Learning March 20, 2013

Unsupervised Learning: Clustering

Supervised vs unsupervised clustering

Supervised vs. Unsupervised Learning

Unsupervised Learning

Generative and discriminative classification techniques

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

Introduction to Machine Learning. Xiaojin Zhu

Based on Raymond J. Mooney s slides

Today. Lecture 4: Last time. The EM algorithm. We examine clustering in a little more detail; we went over it a somewhat quickly last time

Homework #4 Programming Assignment Due: 11:59 pm, November 4, 2018

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Network Traffic Measurements and Analysis

INF 4300 Classification III Anne Solberg The agenda today:

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

May 1, CODY, Error Backpropagation, Bischop 5.3, and Support Vector Machines (SVM) Bishop Ch 7. May 3, Class HW SVM, PCA, and K-means, Bishop Ch

SD 372 Pattern Recognition

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

Inference and Representation

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

CS 229 Midterm Review

MCMC Methods for data modeling

Clustering algorithms

K-Means Clustering 3/3/17

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Machine Learning for OR & FE

Clustering and The Expectation-Maximization Algorithm

COMS 4771 Clustering. Nakul Verma

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

Deep Generative Models Variational Autoencoders

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

1 Case study of SVM (Rob)

What to come. There will be a few more topics we will cover on supervised learning

K-means and Hierarchical Clustering

Clustering Distance measures K-Means. Lecture 22: Aykut Erdem December 2016 Hacettepe University

Machine Learning. Supervised Learning. Manfred Huber

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques.

Unsupervised Learning. Clustering and the EM Algorithm. Unsupervised Learning is Model Learning

Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016

Cluster Analysis. Jia Li Department of Statistics Penn State University. Summer School in Statistics for Astronomers IV June 9-14, 2008

Regularization and model selection

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

Semi-supervised Learning

Clustering and Visualisation of Data

IBL and clustering. Relationship of IBL with CBR

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

K-Means and Gaussian Mixture Models

ECG782: Multidimensional Digital Signal Processing

Chapter 4: Non-Parametric Techniques

CSE 573: Artificial Intelligence Autumn 2010

Network Traffic Measurements and Analysis

CLUSTERING. JELENA JOVANOVIĆ Web:

Computer Vision. Exercise Session 10 Image Categorization

Clustering Lecture 14

Exploratory Data Analysis using Self-Organizing Maps. Madhumanti Ray

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Lecture 11: Classification

Unsupervised Learning

Clustering. Supervised vs. Unsupervised Learning

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Machine Learning / Jan 27, 2010

Notes and Announcements

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques

Unsupervised Learning

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Clustering & Dimensionality Reduction. 273A Intro Machine Learning

Introduction to Mobile Robotics

Clustering: Classic Methods and Modern Views

Clustering Lecture 5: Mixture Model

SOCIAL MEDIA MINING. Data Mining Essentials

k-means Clustering David S. Rosenberg April 24, 2018 New York University

Image analysis. Computer Vision and Classification Image Segmentation. 7 Image analysis

Expectation Maximization: Inferring model parameters and class labels

10-701/15-781, Fall 2006, Final

Normalized Texture Motifs and Their Application to Statistical Object Modeling

Expectation Maximization. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University

Gene Clustering & Classification

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Supervised Learning for Image Segmentation

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic

Problem 1: Complexity of Update Rules for Logistic Regression

Transcription:

Department of Engineering Science University of Oxford January 27, 2017

Many datasets consist of multiple heterogeneous subsets. Cluster analysis: Given an unlabelled data, want algorithms that automatically group the datapoints into coherent subsets/clusters. Examples: market segmentation of shoppers based on browsing and purchase histories different types of breast cancer based on the gene expression measurements discovering communities in social networks image segmentation

Types of Model-based clustering: Each cluster is described using a probability model. Model-free clustering: Defined by similarity/dissimilarity among instances within clusters.

This Lecture: Model-free Methods K-means clustering: a partition-based method into K clusters. Finds groups such that variation within each group is small. The number of clusters K is usually fixed beforehand or various values of K are investigated as a part of the analysis.

K-means K-means Partition-based methods seek to divide data points into a pre-assigned number of clusters C 1,..., C K where for all k, k {1,..., K}, C k {1,..., n}, C k C k = k k, K C k = {1,..., n}. k=1 For each cluster, represent it using a prototype or cluster centroid µ k.

K-means K-means We can measure the quality of a cluster with its within-cluster deviation W (C k, µ k ) = i C k x i µ k 2 2. The overall quality of the clustering is given by the total within-cluster deviation: W = K W (C k, µ k ). k=1 The overall objective is to choose both the cluster centroids and allocation of points to minimize the objective function.

K-means K-means W = K k=1 i C k x i µ k 2 2 = n x i µ ci 2 2 where c i = k if and only if i C k. Given partition {C k }, we can find the optimal prototypes easily by differentiating W with respect to µ k : W = 2 (x i µ k ) = 0 µ k = 1 x i µ k C k i C k i C k Given prototypes, we can easily find the optimal partition by assigning each data point to the closest cluster prototype: i=1 c i = arg min k x i µ k 2 2 But joint minimization over both is computationally difficult.

K-means K-means The K-means algorithm is a widely used method that returns a local optimum of the objective function W, using iterative and alternating minimization. Step 1: Randomly initialize K cluster centroids µ 1,..., µ K. Step 2: Cluster assignment: For each i = 1,..., n, assign each x i to the cluster with the nearest centroid, c i := arg min k x i µ k 2 2 Set C k := {i : c i = k} for each k.

K-means K-means Step 3: Move centroids: Set µ 1,..., µ K to the averages of the new clusters: µ k := 1 x i C k i C k Step 4: Repeat steps 2-3 until convergence. Step 5: Return the partition {C 1,..., C K } and means µ 1,..., µ K.

K-means K-means The algorithm stops in a finite number of iterations. Between steps 2 and 3, W either stays constant or it decreases, this implies that we never revisit the same partition. As there are only finitely many partitions, the number of iterations cannot exceed this.

y y y K-means K-means The K-means algorithm need not converge to global optimum. K-means is a heuristic search algorithm so it can get stuck at suboptimal configurations. The result depends on the starting configuration. Typically perform a number of runs from different configurations, and pick the end result with minimum W. W= 9.184 W= 3.418 W= 9.264 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.5 0.0 0.5 1.0 1.5 x 0.5 0.0 0.5 1.0 1.5 x 0.5 0.0 0.5 1.0 1.5 x

K-means K-means Additional Comments Good practice initialization. Randomly pick K training examples (without replacement) and set µ 1, µ 2,..., µ K equal to those examples Sensitivity to distance measure. Euclidean distance can be greatly affected by measurement unit and by strong correlations. Can use Mahalanobis distance instead: x y M = (x y) M 1 (x y) where M is positive semi-definite matrix, e.g. sample covariance.

K-means K-means Additional Comments Determination of K. The K-means objective will always improve with larger number of clusters K. Determination of K requires an additional regularization criterion. E.g., W = K k=1 i C k x i µ k 2 2 + λk

Originally developed by the signal processing community for data compression (audio, image and video compression), the VQ idea has been picked up the statistics community and extended to tackle a variety of tasks (including clustering and classification). VQ is a simple idea for summarising data by use of codewords. The algorithm is very closely related to the K-means algorithm, yet works sequentially through the data when updating cluster centers.

Given p-dimensional data, a finite set of vectors Y = {y 1,..., y K } of the same dimensionality must be found. Vectors y k are called codewords and Y the codebook. All n observations are mapped to the indices of the code book using the following rule, x i y k x i y k x i y k k. Such a mapping induces a partition of R p into Voronoi regions defined as V k = { x R p : x y k x y k k } where K k=1 V k = R p and V k s are disjoint except for boundaries.

Finding a Useful Codebook As with K-means, a predefined number of K codewords must be found. They should be chosen to give the greatest compression in the data with minimal loss in data quality. Where we have more codewords than clusters, it is easy to see that we should simply place codewords at the center of areas of high density, i.e. good codebooks find cluster centers.

The following iterative algorithm finds a good approximate solutions to this problem. 1 Randomly choose K observations to initialise the codebook. 2 Sample an observation x and let V c be the Voronoi region where it falls. 3 Update the codebook y c = y c + α(t) [x y c ] y k = y k k c. α(t) quantifies the amount by which y c moves towards of the x and decays over time to 0. 4 Repeat 2-3 until there is no change. 5 Return the codebook Y = {y 1,..., y K }

Compression For compression purposes, any observation x R p is now just mapped to the set {1,..., K} of codewords, according to which Voronoi region the observation falls into. If a large number of observations x 1,..., x n needs to be transferred, alternatively the vector of corresponding codewords in {1,..., K} n can be transferred to achieve a compression (with a certain loss of information). Some audio and video codecs use this method. As with K-means, K must be specified. Increasing K improves the quality of the compressed image but worsens the data compression rate, so there is a clear tradeoff. (For clustering, the choice of K is harder and does not have an entirely satisfactory answer).

Example: Image Compression 3 3 block VQ: View each block of 3 3 pixels as single observation

Example: Image Compression Original image (24 bits/pixel, uncompressed size 1,402 kb)

Example: Image Compression Codebook length 1024 (1.11 bits/pixel, total size 88kB)

Example: Image Compression Codebook length 128 (0.78 bits/pixel, total size 50kB)

Example: Image Compression Codebook length 16 (0.44 bits/pixel, total size 27kB)

Naive Bayes Naive Bayes Department of Engineering Science University of Oxford February 12, 2017 Naive Bayes

Naive Bayes Overview Overview Naive Bayes - a classifier with a simple generative model. Easy to implement. Given a Dataset: D = (x i, y i ) n i=1 with n entries. x i = (x (1) i,..., x (d) i ) R d is a feature vector y i Y is a label with Y = {1,..., m} for classification and Y = R for regression. (x 1, y 1 ),..., (x n, y n ) P θ i.i.d. for some parameters θ. Goal: For a new x R d, predict its label y. Compute the probability of each label given a feature x (i.e. P(y x)) Naive Bayes

Naive Bayes Naive Bayes Assumption Naive Bayes Assumption Assume a family of distributions P θ such that for x R d, y Y, P θ (x, y) = P θ (x y) P θ (y) = P θ (x (1) y)... P θ (x (d) y) P θ (y) d = P θ (x (j) y) P θ (y) j=1 (conditional independent assumption.) If (x, y) P θ, then x (1),..., x (d) are independent given y. Naive Bayes Assumption: All measured features are independent given the label (i.e. x (j) y x (k) y if j k) Naive Bayes

Naive Bayes Methodology Methodology Methodology: Estimate the conditional probability distribution (P θ (x y)) and prior (P θ (y)) that describe the entire population from which the random samples (x i, y i ) n i=1 are drawn. Algorithm: Estimate ˆθ from the dataset D. Compute ŷ arg max Pˆθ (y x) = arg max Pˆθ (x y) Pˆθ (y) y Y y Y = arg max Pˆθ (x (1) y)... Pˆθ (x (d) y) Pˆθ (y) y Y Naive Bayes

Naive Bayes Methodology Methodology Using the Bayes rule, Pˆθ(y x) Pˆθ (x y) Pˆθ (y) = Pˆθ(x) Pˆθ (x y) Pˆθ (y) = y Y Pˆθ(x y) Pˆθ(y) By the conditional independent assumption, (Pˆθ(x y) = d j=1 Pˆθ(x (j) y)) d j=1 = Pˆθ (x (j) y) Pˆθ (y) y Y Pˆθ(y) d j=1 Pˆθ(x (j) y) Pˆθ(y) Therefore, we need to estimate Prior: Pˆθ(y) Conditional PDF: Pˆθ (x (j) y) Naive Bayes

Naive Bayes Methodology Methodology How to choose P θ? For classification, let (x, y) P θ, y Y = {1,..., m}. Then P θ (y) = π y, where π = (π 1,..., π m ) P θ (x i y) where θ = {all parameters of the distributions} If x i {1,..., N} then, P θ (x i y) can be estimated using the sample mean If x i R then assume parametric distribution such as Gaussian or Gamma distributions, and estimate the parameter. How to estimate θ? Using Maximum Likelihood Estimation (MLE) or Maximum A Posteriori Probability Estimation (MAP). Naive Bayes

Naive Bayes Methodology Maximum Likelihood Estimation (MLE) Prior Estimation with MLE: Pˆθ (y = k) = ˆπ k = 1 n n i=1 I(y i = k) = n k n Conditional PDF: For discrete features: Pˆθ(x (j) = l y = k) = 1 n k n i=1 I(x (j) i = l) I(y i = k) = n lk n k For the continuous feature: Use parametric distribution assumption to estimate the parameters with MLE. Then, based on the estimated parameters, compute the conditional pdf. Naive Bayes

Naive Bayes Methodology Gaussian Distribution Example (Continuous features) Estimate the Gaussian parameters for P(x (j) = x y = k) Mean: Variance: µ jk = 1 n k n σ 2 jk = 1 n k n i=1 i=1 x (j) i I(y i = k) (x (j) i µ jk ) 2 I(y i = k) Compute the conditional pdf based on the estimated parameters P(x (j) = x y = k) = 1 2πσjk 2 e 1 2σ jk 2 (x µ jk ) 2 Naive Bayes

Naive Bayes Text Document Classification Example Text Document Classification Example Often used in text document classification, e.g. of scientific articles or emails. A basic standard model for text classification consists of considering a pre-specified dictionary of p workds and summarizing each document i by a binary vector x i where x (j) i = { 1 if word j is present in document 0 otherwise. Naive Bayes

Naive Bayes Text Document Classification Example Text Document Classification Example Presence of the word j is the j-th feature/dimension. Naive Bayes is a plug-in classifier which ignores feature correlations and assumes: g k (x i ) = P(x = x i y = k) = = p j=1 P(x (j) = x (j) i y = k) p (φ kj ) x (j) i j=1 (j) 1 x (1 φ kj ) i where we denoted parametrized conditional PMF with φ kj = P(x (j) = 1 y = k) (probability that j-th word appears in class k document). Given dataset, the MLE of the parameters is: ˆπ k = n k n, ˆφ kj = i:y k =k x (j) i n k, Naive Bayes

Naive Bayes Text Document Classification Example Text Document Classification Example A problem with MLE: if the l-th word did not appear in document labeled as class k then ˆφ kl = 0 and P(y = k x with l-th entry equal to 1) p ˆπ k ( ˆφ kj ) x(j) (1 ˆφ kj ) 1 x(j) = 0 j=1 i.e. we will never attribute a new document containing word l to class k (regardless of other words in it). This is an example of overfitting. Naive Bayes

Naive Bayes Why Conditional Independent Assumption? Why Conditional Independent Assumption? Conditional Independent Assumption: P θ (x y) = P θ (x (1) y)... P θ (x (d) y) Can estimate θ more accurately with less data. Wrong but simple can be better than correct and complicated. Naive Bayes