Today Lecture 4: We examine clustering in a little more detail; we went over it a somewhat quickly last time The CAD data will return and give us an opportunity to work with curves (!) We then examine the performance of estimators again... Last time We examined the EM algorithm in some depth and showed how it could be used to fit discrete gaussian mixtures We then looked under the hood and examined why the procedure converges We then related the entire EM enterprise in this context to a somewhat simpler algorithm known as that is popular in the clustering literature The EM algorithm While we expressed the general algorithm in terms of f(y X, θ i 1 ) log f(x, y θ) y you see how this conditional expectation relates to the indicator formulation we followed for normal mixtures We closed by examining some of the properties of estimators
The EM algorithm Recall that four our complete data likelihood, our data were of the form (X 1, Y 1 ),..., (X n, Y n ), so that the the likelihood became n J [α j N(X i ; µ j, σ j )] Ij(Yi) i=1 j=1 and the log-likelihood could be written as n n J log f(x i, Y i θ) = I j (Y i ) [log α j + log N(X i ; µ i, σ j )] i=1 i=1 j=1 Clustering Last time we went a little fast past a rather big area in statistics and data mining, clustering Broadly, clustering describes the process of identifying groups in a data set, groups that are in some way closely related Usually, the groups can be characterized by a few parameters; perhaps a small number of representative data points or maybe the group means (often called the cluster centers ) These parameters can, in turn, be examined and compared to help expose significant structures in a data set The EM algorithm Since the only term in this expression that involves Y i is the indicator function Taking our conditional expectation of the log-likelihood with respect to Y i given X i and a guess θ 0 for θ, is equivalent to our approach of replacing I j (Y i ) with its conditional expectation Again, there is a nice expression of this algorithm in terms of natural parameters and estimates for an exponential family; this is just a hint at the connection Clustering clustering seeks to identify K groups and their associated centers µ 1,..., µ K so as to minimize an overall objective function V = K k=1 X i S k X i µ k 2 Last time we described an iterative algorithm that alternately forms group means and then assigns data points to the group with the closest mean
Relationship to clustering With, we want to divide our data X 1,..., X n into, well, K groups; the algorithm is pretty simple Make an initial guess for means µ 01,..., µ 0K Until there s no change in these means do: 1. Use the estimated means to classify your data into clusters; each point X i is associated with the closest mean using simple Euclidean distance 2. For each cluster k, form the mean of the data associated with the group and vector quantization Vector quantization (VQ) is a lossy data compression method that builds a block code for a source; each point in our (in this case 2-d) space is represented by the nearest codeword Historically this was a hard problem because it involved a lot of multi-dimensional integrals; in the 1980s, a VQ algorithm was proposed* based on a training set of source vectors, X 1,..., X n In short, we would like to design a codebook µ 1,..., µ K and a partition S 1,..., S K to represent the training set so that the overall distortion measure V = K k=1 is as small as possible X i S k X i µ k 2 * Such algorithms are usually referred to as LBG-VQ for the group proposing the idea, Linde, Buzo and Gray temperature CAD 157 Clustering At the right, we ask for three clusters; and below we present the result, with cluster centers highlighted in black Note that the algorithm assigns points according to the nearest group mean and so in the end we have divisions based on the Voronoi tessellation of these center points Let s consider some real data; at the right we have temperatures at 6am and 6pm for 232 consecutive (from January to November of 2005) days as recorded by CAD node 157 These two measurements are fairly highly correlated (0.85) and so divides the data along the ellipse lengthwise Arguably, clustering is not really achieving much in this (or the previous case) in terms of insight about the data Let s consider a harder case... temperature 6pm temperature 6pm!5 0 5 10 15 20 25 30!5 0 5 10 15 20 25 30!10!5 0 5 10 15 20 temperature 6am!10!5 0 5 10 15 20 temperature 6am
Below we plot a time series of our temperature measurements, averaged across hours for all 232 days The jigsaw pattern is the basic diurnal effect, warmer during the day, colder at night Can we get some insight into the kinds of patterns we see during each day? Do the patterns change with time of year? At the right we have the same plot but colored according to a clustering on the 24-dimensional data for two through five clusters For this we need to collapse our data by day... What do we observe? What is the clustering highlighting here? At the right we have, well, all the data; that is, all 232 curves, each representing the temperature over the course of a day What do we think of this plot? 5 10 15 20 temperature (C)!10 0 10 20 30 hours past midnight Our data space is 24- dimensional; each observation is the vector of average temperatures computed over the course of a day That means our distances are computed in 24-dimensional space and our group means live in 24-dimensional space So rather than treat them as abstract cluster centers, we can plot them as curves (color coding on the right matches that on the previous slide for K=5 groups) group means average temperature!5 0 5 10 15 20 25 30!5 0 5 10 15 20 25 5 10 15 20 hour since midnight 0 50 100 150 200 day
Clustering Ok, so that wasn t very stirring; it gets warmer in the summer Instead, let s start by subtracting out the daily average and then apply ; this should have the effect of highlighting within-day shapes What do we observe? judges similarity (or dissimilarity) based on the nearness of points; the standard Euclidean distance is applied in data space There are many dimension-reduction procedures that operate on pairwise distances between rows in a data table with the goal of providing you a display or some kind of summary that s easier to work with than the original data In the upcoming lectures, we will talk about hierarchical clustering as well as dimension reduction techniques like multi-dimensional scaling As a final comment, the mixture modeling we started with also provides a clustering of the data, but with soft rather than hard group assignments group means!5 0 5 10 Properties of estimators At the right we have the group means for a 3-group fit; again, we can display group means as curves What do we notice now? What are the dominant patterns? average temperature!5 0 5 10 15 20 25 5 10 15 20 hour since midnight Last time we started examining properties of estimators, specifically focusing on their mean and variance We are, for the moment, in a frequentist paradigm, meaning that the quantities we will evaluate are based on the idea of repeated sampling 0 50 100 150 200 day
Properties of estimators Suppose we are given a sample X 1,..., X n of size n that are independent draws from a distribution f An estimate θ n of a parameter θ is just some function of these points X ; that is θn = θ 1,..., X n n (X 1,..., X n ) We view θ n as a random variable in the sense that each time we repeat our experiments, we would collect another sample of data, producing a different estimate Variance We can also consider the variance of the an estimate; in short, how spread out is the sampling distribution? The standard deviation of θ n is called its standard error and is denoted se ( θ n ) = var ( θ n ) We refer to the distribution of θ n over these repeated experiments as its sampling distribution Mean squared error Unbiasedness The bias of an estimate is defined to be bias ( θ n ) = E θ n θ ; here the expectation is sampling distribution of θ n We say that an estimate is unbiased if E θ n = θ so that bias ( θ n ) = 0 We often judge the reasonableness of an estimator based on its mean squared error MSE = E( θ n θ) 2 This quantity captures both bias and variance MSE = E( θ n θ) 2 = E( θ n E θ n + E θ n θ) 2 = E( θ n E θ n ) 2 + (E θ n θ) 2 + 2(E θ n θ)e( θ n E θ n ) = var ( θ n ) + bias ( θ n ) 2
Properties of estimators We say that an estimator θ n is consistent if as n gets large its distribution concentrates around the parameter θ To go one level deeper, we need to recall a definition from probability (that you may or may not have had) A sequence of random variables Z 1, Z 2, Z 3,... is said to converge in probability to another random variable Z, written P Z n Z, if, for every ɛ > 0 Example: Means and the WLLN We can establish consistency of the sample using the so-called weak law of large numbers: If Z 1,..., Z n are independent draws from the same distribution having mean µ, then the sample mean Z P µ as n P ( Z n Z > ɛ) 0 Consistency Therefore, we say that an estimator is consistent if it converges in probability to θ * It is possible to show that if both the bias and standard error of an estimate tend to zero as we collect more and more data (that is, the MSE tends to zero) then the estimate is consistent Example: Means and the WLLN An easy proof of the WLLN can be found from Chebychev s inequality*, namely that for a random variable Z Pr( Z EZ t) var(z) t 2 assuming the mean and variance of Z are finite * or, to be precise, to a random variable that takes on the value with probability 1 θ * Actually, you don t need a second moment for the WLLN to be true, but this is a fast way to prove it.
Note that the weak law of large numbers implies that the sample mean is a consistent estimate of the population mean; we don t have to put a lot of modeling assumptions for this to happen Now, another good estimate of the center of a distribution is the median (recall for the normal case, the mean and median are the same) Let s consider consistency of the median; assume we have data X 1,..., X n from some continuous distribution f with median µ ; let X denote the sample median To make things easy, let s also assume that we have an odd number of points ( n odd) so that the sample median is the (n + 1)/2 element in the list of sorted data Substituting this into our starting equation (and assuming we have an odd number of samples) we find that Pr( X µ > ɛ) = Pr(S n > (n + 1)/2) = Pr(S n np > (n + 1)/2 np) = Pr(S n np > n(1/2 p) + 1/2) < Pr(S n np > n(1/2 p)) < p(1 p) n(1/2 p) 0 as n so that Pr( X µ > ɛ) 0 ; a similar argument can be used to show that Pr( X µ < ɛ) 0, giving us consistency To prove consistency, let s take ɛ > 0 and consider Pr( θ X µ > ɛ) = Pr( θ X > ɛ + µ) = Pr(at least (n + 1)/2 of the X i s are bigger than µ + ɛ) Let S n denote the number of sample points X 1,..., X n that are larger than µ + ɛ ; that means S n has a binomial distribution (n, p) where p = Pr(X i > µ + ɛ) < 0.5 Comparing consistent estimators In many cases, the differences between estimators really show up in large samples ; that is, as we let the number of data points tend to infinity, we start to see differences To formalize this, we will consider the the asymptotic distribution of a sequence of estimators
Example: Means and the CLT Given a sample X 1,..., X n of independent draws from a distribution with mean µ and standard deviation σ, we know that the sample mean has mean µ and standard deviation σ/ X n The Central Limit Theorem states that Z n = X µ n(x µ) D = Z σ var(x) Example: Means and the CLT The CLT implies that the n(x µ) has a normal limiting distribution with mean zero and variance σ 2 What about the median? where Z has a standard normal (mean zero, standard deviation one) distribution Example: Means and the CLT Now, given a sample X 1,..., X n that come from a distribution f, can be shown that n( X µ) also has a limiting normal distribution having zero mean but with variance 1/[2f( µ)] 2 To make this precise (as we had to do with convergence in probability) we say that a sequence of random variables Z 1, Z 2,... converges in distribution to Z if lim F n(x) = F (x) n where F n is the CDF of Z n and F is the CDF of Z, at all points where is continuous F n Suppose our our data come from a normal distribution; that is, suppose f is a gaussian f(x) = 1 2πσ 2 e (x eµ)/2σ2 where we have inserted µ since the mean and median are the same for this distribution Therefore, f( µ) = 1/ 2πσ 2 so n( X µ) has a limiting normal distribution with mean zero and variance πσ 2 /2
So, if we use the mean to estimate the center of a distribution, we have an asymptotic variance of σ 2 ; if we use the median the asymptotic variance is 1/[2f( µ)] 2 In the normal case, the latter expression becomes πσ 2 /2 ; we can then compute the so-called asymptotic relative efficiency between using the median and the mean for data that come from a normal family σ 2 1/[2f( µ)] 2 = 2 π = 0.637 This means that if our data really come from a normal distribution, we re better off using the sample mean instead of the sample median Given data from the contaminated distribution f(x) = (1 ɛ)n(x; 0, 1) + ɛn(x; 0, τ) we know that the variance of this mixture is given by σ 2 = (1 ɛ) + ɛτ 2 ; also, the median of this family is 0 so that f(0) = 1 ( 1 ɛ + ɛ ) 2π τ Therefore, the relative efficiency between the mean and the median is given by (1 ɛ) + ɛτ 2 1/[2f(0)] 2 = 2 ( π [(1 ɛ) + ɛτ 2 ] 1 ɛ + ɛ ) 2 τ ɛ = 0.1 Now consider a contaminated normal family that s often used in so-called robustness studies; Tukey (1960) considered data generated by the normal mixture f(x) = (1 ɛ)n(x; 0, 1) + ɛn(x; 0, τ) This family allows one to contaminate a standard normal distribution (first component) with some outliers (second component) At the left we have plots of the asymptotic relative efficiency for four values of ɛ and τ ranging from 2 to 10 We also have a Q-Q plot for one member of the family ɛ = 0.1, τ = 4 that has a relative efficiency of 1.36 ARE 1 2 3 4 5 2 4 6 8 10 tau Normal Q!Q plot, tau=4, epsilon=0.1 ɛ = 0.05 ɛ = 0.03 ɛ = 0.01 If we had observations solely from a normal distribution, then we know the sample mean (the MLE) is an efficient estimate; but if we start to introduce outliers, what happens? In this case, the median outperforms the mean; notice the effect of the observations from normal with greater spread Sample Quantiles!5 0 5!3!2!1 0 1 2 3 Theoretical Quantiles
With this mixture device, we can see clearly the tradeoff between the mean and the median Next time we will return to estimation in the context of parametric models and examine the performance of the MLE