Today. Lecture 4: Last time. The EM algorithm. We examine clustering in a little more detail; we went over it a somewhat quickly last time

Similar documents
DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Chapter 3. Bootstrap. 3.1 Introduction. 3.2 The general idea

Machine Learning for OR & FE

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)

Note Set 4: Finite Mixture Models and the EM Algorithm

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Clustering Lecture 5: Mixture Model

9.1. K-means Clustering

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Lecture 2 The k-means clustering problem

Cluster Analysis. Jia Li Department of Statistics Penn State University. Summer School in Statistics for Astronomers IV June 9-14, 2008

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

10-701/15-781, Fall 2006, Final

COMS 4771 Clustering. Nakul Verma

Clustering in R d. Clustering. Widely-used clustering methods. The k-means optimization problem CSE 250B

Fall 09, Homework 5

Chapter 4: Non-Parametric Techniques

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

K-Means Clustering. Sargur Srihari

Introduction to Machine Learning CMU-10701

Mixture Models and the EM Algorithm

Clustering and The Expectation-Maximization Algorithm

Notes on Simulations in SAS Studio

Today s lecture. Clustering and unsupervised learning. Hierarchical clustering. K-means, K-medoids, VQ

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Optimization and Simulation

Chapters 5-6: Statistical Inference Methods

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

K-means clustering Based in part on slides from textbook, slides of Susan Holmes. December 2, Statistics 202: Data Mining.

Clustering & Dimensionality Reduction. 273A Intro Machine Learning

Introduction to Mobile Robotics

Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors

IBL and clustering. Relationship of IBL with CBR

MS&E 226: Small Data

Assignment 2. Unsupervised & Probabilistic Learning. Maneesh Sahani Due: Monday Nov 5, 2018

6.001 Notes: Section 4.1

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Discrete Mathematics Course Review 3

Mixture Models and EM

Supervised vs. Unsupervised Learning

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

CS6716 Pattern Recognition

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Dynamic Thresholding for Image Analysis

Chapter 6 Normal Probability Distributions

Use of Extreme Value Statistics in Modeling Biometric Systems

Integrated Math I. IM1.1.3 Understand and use the distributive, associative, and commutative properties.

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Chapter 2 Basic Structure of High-Dimensional Spaces

Expectation-Maximization. Nuno Vasconcelos ECE Department, UCSD

Performance Evaluation

Probabilistic Modeling of Leach Protocol and Computing Sensor Energy Consumption Rate in Sensor Networks

CS Introduction to Data Mining Instructor: Abdullah Mueen

CPSC 340: Machine Learning and Data Mining

Machine Learning A WS15/16 1sst KU Version: January 11, b) [1 P] For the probability distribution P (A, B, C, D) with the factorization

DOWNLOAD PDF BIG IDEAS MATH VERTICAL SHRINK OF A PARABOLA

ALTERNATIVE METHODS FOR CLUSTERING

Mining di Dati Web. Lezione 3 - Clustering and Classification

Lecture 8: The EM algorithm

Chapter 6 Continued: Partitioning Methods

1 Case study of SVM (Rob)

What to come. There will be a few more topics we will cover on supervised learning

LAB #2: SAMPLING, SAMPLING DISTRIBUTIONS, AND THE CLT

Data Mining. Lecture 03: Nearest Neighbor Learning

SGN (4 cr) Chapter 11

Clustering. Shishir K. Shah

An introduction to plotting data

Coloring 3-Colorable Graphs

CHAPTER 6 INFORMATION HIDING USING VECTOR QUANTIZATION

Machine Learning (CSE 446): Concepts & the i.i.d. Supervised Learning Paradigm

Clustering Distance measures K-Means. Lecture 22: Aykut Erdem December 2016 Hacettepe University

AM 221: Advanced Optimization Spring 2016

Unsupervised Learning

Model selection and validation 1: Cross-validation

Derivatives and Graphs of Functions

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

MODEL SELECTION AND REGULARIZATION PARAMETER CHOICE

High Dimensional Indexing by Clustering

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Gaussian Mixture Models For Clustering Data. Soft Clustering and the EM Algorithm

CISC 4631 Data Mining

Clustering and Visualisation of Data

Getting to Know Your Data

Nearest Neighbor Classification

Weighted and Continuous Clustering

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

ECLT 5810 Clustering

Voluntary State Curriculum Algebra II

MSA220 - Statistical Learning for Big Data

Clustering Color/Intensity. Group together pixels of similar color/intensity.

CPSC 340: Machine Learning and Data Mining. Robust Regression Fall 2015

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

Spatial Outlier Detection

10701 Machine Learning. Clustering

CSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection

Transcription:

Today Lecture 4: We examine clustering in a little more detail; we went over it a somewhat quickly last time The CAD data will return and give us an opportunity to work with curves (!) We then examine the performance of estimators again... Last time We examined the EM algorithm in some depth and showed how it could be used to fit discrete gaussian mixtures We then looked under the hood and examined why the procedure converges We then related the entire EM enterprise in this context to a somewhat simpler algorithm known as that is popular in the clustering literature The EM algorithm While we expressed the general algorithm in terms of f(y X, θ i 1 ) log f(x, y θ) y you see how this conditional expectation relates to the indicator formulation we followed for normal mixtures We closed by examining some of the properties of estimators

The EM algorithm Recall that four our complete data likelihood, our data were of the form (X 1, Y 1 ),..., (X n, Y n ), so that the the likelihood became n J [α j N(X i ; µ j, σ j )] Ij(Yi) i=1 j=1 and the log-likelihood could be written as n n J log f(x i, Y i θ) = I j (Y i ) [log α j + log N(X i ; µ i, σ j )] i=1 i=1 j=1 Clustering Last time we went a little fast past a rather big area in statistics and data mining, clustering Broadly, clustering describes the process of identifying groups in a data set, groups that are in some way closely related Usually, the groups can be characterized by a few parameters; perhaps a small number of representative data points or maybe the group means (often called the cluster centers ) These parameters can, in turn, be examined and compared to help expose significant structures in a data set The EM algorithm Since the only term in this expression that involves Y i is the indicator function Taking our conditional expectation of the log-likelihood with respect to Y i given X i and a guess θ 0 for θ, is equivalent to our approach of replacing I j (Y i ) with its conditional expectation Again, there is a nice expression of this algorithm in terms of natural parameters and estimates for an exponential family; this is just a hint at the connection Clustering clustering seeks to identify K groups and their associated centers µ 1,..., µ K so as to minimize an overall objective function V = K k=1 X i S k X i µ k 2 Last time we described an iterative algorithm that alternately forms group means and then assigns data points to the group with the closest mean

Relationship to clustering With, we want to divide our data X 1,..., X n into, well, K groups; the algorithm is pretty simple Make an initial guess for means µ 01,..., µ 0K Until there s no change in these means do: 1. Use the estimated means to classify your data into clusters; each point X i is associated with the closest mean using simple Euclidean distance 2. For each cluster k, form the mean of the data associated with the group and vector quantization Vector quantization (VQ) is a lossy data compression method that builds a block code for a source; each point in our (in this case 2-d) space is represented by the nearest codeword Historically this was a hard problem because it involved a lot of multi-dimensional integrals; in the 1980s, a VQ algorithm was proposed* based on a training set of source vectors, X 1,..., X n In short, we would like to design a codebook µ 1,..., µ K and a partition S 1,..., S K to represent the training set so that the overall distortion measure V = K k=1 is as small as possible X i S k X i µ k 2 * Such algorithms are usually referred to as LBG-VQ for the group proposing the idea, Linde, Buzo and Gray temperature CAD 157 Clustering At the right, we ask for three clusters; and below we present the result, with cluster centers highlighted in black Note that the algorithm assigns points according to the nearest group mean and so in the end we have divisions based on the Voronoi tessellation of these center points Let s consider some real data; at the right we have temperatures at 6am and 6pm for 232 consecutive (from January to November of 2005) days as recorded by CAD node 157 These two measurements are fairly highly correlated (0.85) and so divides the data along the ellipse lengthwise Arguably, clustering is not really achieving much in this (or the previous case) in terms of insight about the data Let s consider a harder case... temperature 6pm temperature 6pm!5 0 5 10 15 20 25 30!5 0 5 10 15 20 25 30!10!5 0 5 10 15 20 temperature 6am!10!5 0 5 10 15 20 temperature 6am

Below we plot a time series of our temperature measurements, averaged across hours for all 232 days The jigsaw pattern is the basic diurnal effect, warmer during the day, colder at night Can we get some insight into the kinds of patterns we see during each day? Do the patterns change with time of year? At the right we have the same plot but colored according to a clustering on the 24-dimensional data for two through five clusters For this we need to collapse our data by day... What do we observe? What is the clustering highlighting here? At the right we have, well, all the data; that is, all 232 curves, each representing the temperature over the course of a day What do we think of this plot? 5 10 15 20 temperature (C)!10 0 10 20 30 hours past midnight Our data space is 24- dimensional; each observation is the vector of average temperatures computed over the course of a day That means our distances are computed in 24-dimensional space and our group means live in 24-dimensional space So rather than treat them as abstract cluster centers, we can plot them as curves (color coding on the right matches that on the previous slide for K=5 groups) group means average temperature!5 0 5 10 15 20 25 30!5 0 5 10 15 20 25 5 10 15 20 hour since midnight 0 50 100 150 200 day

Clustering Ok, so that wasn t very stirring; it gets warmer in the summer Instead, let s start by subtracting out the daily average and then apply ; this should have the effect of highlighting within-day shapes What do we observe? judges similarity (or dissimilarity) based on the nearness of points; the standard Euclidean distance is applied in data space There are many dimension-reduction procedures that operate on pairwise distances between rows in a data table with the goal of providing you a display or some kind of summary that s easier to work with than the original data In the upcoming lectures, we will talk about hierarchical clustering as well as dimension reduction techniques like multi-dimensional scaling As a final comment, the mixture modeling we started with also provides a clustering of the data, but with soft rather than hard group assignments group means!5 0 5 10 Properties of estimators At the right we have the group means for a 3-group fit; again, we can display group means as curves What do we notice now? What are the dominant patterns? average temperature!5 0 5 10 15 20 25 5 10 15 20 hour since midnight Last time we started examining properties of estimators, specifically focusing on their mean and variance We are, for the moment, in a frequentist paradigm, meaning that the quantities we will evaluate are based on the idea of repeated sampling 0 50 100 150 200 day

Properties of estimators Suppose we are given a sample X 1,..., X n of size n that are independent draws from a distribution f An estimate θ n of a parameter θ is just some function of these points X ; that is θn = θ 1,..., X n n (X 1,..., X n ) We view θ n as a random variable in the sense that each time we repeat our experiments, we would collect another sample of data, producing a different estimate Variance We can also consider the variance of the an estimate; in short, how spread out is the sampling distribution? The standard deviation of θ n is called its standard error and is denoted se ( θ n ) = var ( θ n ) We refer to the distribution of θ n over these repeated experiments as its sampling distribution Mean squared error Unbiasedness The bias of an estimate is defined to be bias ( θ n ) = E θ n θ ; here the expectation is sampling distribution of θ n We say that an estimate is unbiased if E θ n = θ so that bias ( θ n ) = 0 We often judge the reasonableness of an estimator based on its mean squared error MSE = E( θ n θ) 2 This quantity captures both bias and variance MSE = E( θ n θ) 2 = E( θ n E θ n + E θ n θ) 2 = E( θ n E θ n ) 2 + (E θ n θ) 2 + 2(E θ n θ)e( θ n E θ n ) = var ( θ n ) + bias ( θ n ) 2

Properties of estimators We say that an estimator θ n is consistent if as n gets large its distribution concentrates around the parameter θ To go one level deeper, we need to recall a definition from probability (that you may or may not have had) A sequence of random variables Z 1, Z 2, Z 3,... is said to converge in probability to another random variable Z, written P Z n Z, if, for every ɛ > 0 Example: Means and the WLLN We can establish consistency of the sample using the so-called weak law of large numbers: If Z 1,..., Z n are independent draws from the same distribution having mean µ, then the sample mean Z P µ as n P ( Z n Z > ɛ) 0 Consistency Therefore, we say that an estimator is consistent if it converges in probability to θ * It is possible to show that if both the bias and standard error of an estimate tend to zero as we collect more and more data (that is, the MSE tends to zero) then the estimate is consistent Example: Means and the WLLN An easy proof of the WLLN can be found from Chebychev s inequality*, namely that for a random variable Z Pr( Z EZ t) var(z) t 2 assuming the mean and variance of Z are finite * or, to be precise, to a random variable that takes on the value with probability 1 θ * Actually, you don t need a second moment for the WLLN to be true, but this is a fast way to prove it.

Note that the weak law of large numbers implies that the sample mean is a consistent estimate of the population mean; we don t have to put a lot of modeling assumptions for this to happen Now, another good estimate of the center of a distribution is the median (recall for the normal case, the mean and median are the same) Let s consider consistency of the median; assume we have data X 1,..., X n from some continuous distribution f with median µ ; let X denote the sample median To make things easy, let s also assume that we have an odd number of points ( n odd) so that the sample median is the (n + 1)/2 element in the list of sorted data Substituting this into our starting equation (and assuming we have an odd number of samples) we find that Pr( X µ > ɛ) = Pr(S n > (n + 1)/2) = Pr(S n np > (n + 1)/2 np) = Pr(S n np > n(1/2 p) + 1/2) < Pr(S n np > n(1/2 p)) < p(1 p) n(1/2 p) 0 as n so that Pr( X µ > ɛ) 0 ; a similar argument can be used to show that Pr( X µ < ɛ) 0, giving us consistency To prove consistency, let s take ɛ > 0 and consider Pr( θ X µ > ɛ) = Pr( θ X > ɛ + µ) = Pr(at least (n + 1)/2 of the X i s are bigger than µ + ɛ) Let S n denote the number of sample points X 1,..., X n that are larger than µ + ɛ ; that means S n has a binomial distribution (n, p) where p = Pr(X i > µ + ɛ) < 0.5 Comparing consistent estimators In many cases, the differences between estimators really show up in large samples ; that is, as we let the number of data points tend to infinity, we start to see differences To formalize this, we will consider the the asymptotic distribution of a sequence of estimators

Example: Means and the CLT Given a sample X 1,..., X n of independent draws from a distribution with mean µ and standard deviation σ, we know that the sample mean has mean µ and standard deviation σ/ X n The Central Limit Theorem states that Z n = X µ n(x µ) D = Z σ var(x) Example: Means and the CLT The CLT implies that the n(x µ) has a normal limiting distribution with mean zero and variance σ 2 What about the median? where Z has a standard normal (mean zero, standard deviation one) distribution Example: Means and the CLT Now, given a sample X 1,..., X n that come from a distribution f, can be shown that n( X µ) also has a limiting normal distribution having zero mean but with variance 1/[2f( µ)] 2 To make this precise (as we had to do with convergence in probability) we say that a sequence of random variables Z 1, Z 2,... converges in distribution to Z if lim F n(x) = F (x) n where F n is the CDF of Z n and F is the CDF of Z, at all points where is continuous F n Suppose our our data come from a normal distribution; that is, suppose f is a gaussian f(x) = 1 2πσ 2 e (x eµ)/2σ2 where we have inserted µ since the mean and median are the same for this distribution Therefore, f( µ) = 1/ 2πσ 2 so n( X µ) has a limiting normal distribution with mean zero and variance πσ 2 /2

So, if we use the mean to estimate the center of a distribution, we have an asymptotic variance of σ 2 ; if we use the median the asymptotic variance is 1/[2f( µ)] 2 In the normal case, the latter expression becomes πσ 2 /2 ; we can then compute the so-called asymptotic relative efficiency between using the median and the mean for data that come from a normal family σ 2 1/[2f( µ)] 2 = 2 π = 0.637 This means that if our data really come from a normal distribution, we re better off using the sample mean instead of the sample median Given data from the contaminated distribution f(x) = (1 ɛ)n(x; 0, 1) + ɛn(x; 0, τ) we know that the variance of this mixture is given by σ 2 = (1 ɛ) + ɛτ 2 ; also, the median of this family is 0 so that f(0) = 1 ( 1 ɛ + ɛ ) 2π τ Therefore, the relative efficiency between the mean and the median is given by (1 ɛ) + ɛτ 2 1/[2f(0)] 2 = 2 ( π [(1 ɛ) + ɛτ 2 ] 1 ɛ + ɛ ) 2 τ ɛ = 0.1 Now consider a contaminated normal family that s often used in so-called robustness studies; Tukey (1960) considered data generated by the normal mixture f(x) = (1 ɛ)n(x; 0, 1) + ɛn(x; 0, τ) This family allows one to contaminate a standard normal distribution (first component) with some outliers (second component) At the left we have plots of the asymptotic relative efficiency for four values of ɛ and τ ranging from 2 to 10 We also have a Q-Q plot for one member of the family ɛ = 0.1, τ = 4 that has a relative efficiency of 1.36 ARE 1 2 3 4 5 2 4 6 8 10 tau Normal Q!Q plot, tau=4, epsilon=0.1 ɛ = 0.05 ɛ = 0.03 ɛ = 0.01 If we had observations solely from a normal distribution, then we know the sample mean (the MLE) is an efficient estimate; but if we start to introduce outliers, what happens? In this case, the median outperforms the mean; notice the effect of the observations from normal with greater spread Sample Quantiles!5 0 5!3!2!1 0 1 2 3 Theoretical Quantiles

With this mixture device, we can see clearly the tradeoff between the mean and the median Next time we will return to estimation in the context of parametric models and examine the performance of the MLE