Clustering and Dimensionality Reduction

Clustering and Dimensionality Reduction Some material on these is slides borrowed from Andrew Moore's excellent machine learning tutorials located at:

Data Mining Automatically extracting meaning from large, high dimensional data sets. We will talk about two broad approaches today: Dimensionality reduction. Clustering.

High Dimensional Data??

Dimensionality Reduction... Re representing high dimensional data in a low dimensional space, so that as much information as possible is preserved. Sometimes this can be done by hand... Often, we would like to automate the process... Benefits: Visualization. Saving Space. Computational efficiency.

Principal Components Analysis Each point is represented by two numbers. How best to squeeze that down to one?

Informal PCA Summary The principal components of a data set are the directions of maximum variance in the data. Each principal component is perpendicular to all of the others. Dimensionality reduction is performed by projecting the data onto the top few principal components.

The Details... The principal components are computed by finding the eigenvectors of the covariance matrix for the data set. The eigenvalues indicate the amount of variance captured by the corresponding eigenvector.

More Details... If P is a matrix where the columns are the first m principle components. Dimensionality reduction is as easy as: y=p T x Where x is the data point, and y is the low dimensional projection. We can go back to the original space with: x=p y

PCA For Image Compression

Clustering The problem of grouping unlabeled data on the basis of similarity. A key component of data mining is there useful structure hidden in this data? Applications: Image segmentation, document clustering, protein class discovery, compression One problem: how do we define similarity? First thought: euclidean distance.

K means 1. Ask user how many clusters they d like. (e.g. k=5)

K means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations

K means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. (Thus each Center owns a set of datapoints)

K means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. 4. Each Center finds the centroid of the points it owns

K means Start Advance apologies: in Black and White this example will deteriorate Example generated by Dan Pelleg s super duper fast K means system: Dan Pelleg and Andrew Moore. Accelerating Exact k-means Algorithms with Geometric Reasoning. Proc. Conference on Knowledge Discovery in Databases 1999, (KDD99) (available on www.autonlab.org/pap.html)

K means continues

K means terminates

Next Question: How Many Clusters?

This Looks Right

This Looks Wrong It is not possible to eyeball data in higher dimensions.

How Do We Define Success? We want two things: Compact clusters all data points should be near their cluster centers. (We can calculate the total distance from each point to its cluster center.) Few clusters not very useful if we have as many clusters as data points. These two goals are in conflict we need to find a good trade off.

Looking For Elbows...

K Means Evaluation It is guaranteed to converge. It is not guaranteed to reach a minima. Commonly used because: It is easy to code. It is efficient.

Single Linkage Hierarchical Clustering 1. Say Every point is its own cluster

Single Linkage Hierarchical Clustering 1. Say Every point is its own cluster 2. Find most similar pair of clusters

Single Linkage Hierarchical Clustering 1. Say Every point is its own cluster 2. Find most similar pair of clusters 3. Merge it into a parent cluster

Single Linkage Hierarchical Clustering 1. Say Every point is its own cluster 2. Find most similar pair of clusters 3. Merge it into a parent cluster 4. Repeat

Single Linkage Hierarchical Clustering How do we define similarity between clusters? Minimum distance between points in clusters (in which case we re simply doing Euclidian Minimum Spanning Trees) Maximum distance between points in clusters Average distance between points in clusters You re left with a nice dendrogram, or taxonomy, or hierarchy of datapoints (not shown here) 1. Say Every point is its own cluster 2. Find most similar pair of clusters 3. Merge it into a parent cluster 4. Repeat until you ve merged the whole dataset into one cluster

Other Kinds of Similarity Sometimes minimizing euclidean distance doesn't seem to be the right idea. On this data set: K Means gives us these clusters: Take COMP 486: Introduction to Machine Learning