Clustering. Content. Typical Applications. Clustering: Unsupervised data mining technique

Content Clustering Examples Cluster analysis Partitional: K-Means clustering method Hierarchical clustering methods Data preparation in clustering Interpreting clusters Cluster validation Clustering: Unsupervised data mining technique Typical Applications Marketing: help marketers segment customers based on similar buying patterns, and then use this knowledge to develop targeted marketing programs Insurance: indentifying groups of motor insurance policy holders with a high average claim cost Image processing: compressing images Pre-processing step: identify groups for further modeling purposes Perform clustering and then regression by cluster

Clustering of people Clustering of products Clustering for better fit Clustering of financial time series

Image compression: Kohonen vector quantization Example:Sir Ronald A. Fisher (1890-1962) Left = 1024 x 1024 greyscale image at 8 bits per pixel, with 1MB of storage Center = 2x2 block VQ, using k=200 clusters, with 245KB of storage Right = 2x2 block VQ, using k=4 clusters, with 64KB of storage Content Examples Cluster analysis Partitional: K-Means clustering method Hierarchical clustering methods Data preparation in clustering Interpreting clusters Cluster validation Cluster Analysis Also called segmentation To group a collection of cases into subsets, such that Cases within each cluster tend to be similar to each other Cases in different clusters tend to be dissimilar Cases are similar in what sense? What is the best answer for clustering analysis?

Many Ways of Clustering Many ways of clustering Types of Clustering Methods Two major clustering methods Hierarchical nested set of cluster created Partitional one set of clusters created Other clustering methods Density-based based on the notion of density Grid-based based on multiple-level grid structure Model-based a model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other Content Examples Cluster analysis Partitional: K-Means clustering method Hierarchical clustering methods Data preparation in clustering Interpreting clusters Cluster validation

K-Mean clustering Example 1: K=5 Assignment Re-assignment

Similarities Distance measure A good distance measure: Non-negativity: d(x,y)>=0 Symmetry: d(x,y)=d(y,x) Triangle inequality: d(x,y)<=d(x,z)+d(z,y) Identity: d(x,y)=0 if and only if x=y Some Distance Measure Euclidean distance Most widely used Clusters formed tend to be spherical in shape Manhattan (city-block) distance Clusters formed tend to be more cubical in shape Euclidean Distance Manhattan Distance

Hamming distance Exercise Comments on K-Means Importance of choosing initial cluster centers

Importance of choosing initial cluster centers Importance of choosing initial cluster centers Importance of choosing initial cluster centers Limitation of K-Means Different size

Limitation of K-Means Different density Limitation of K-means Non-convex shapes Overcoming K-Means limitations Overcoming K-means limitations

Overcoming K-means limitations Other partitional methods Other partitional methods Content Examples Cluster analysis Partitional: K-Means clustering method Hierarchical clustering methods Data preparation in clustering Interpreting clusters Cluster validation

Hierarchical clustering methods Hierarchical clustering methods Distance between two clusters Distance between two clusters

Dendrogram Example 2: Single linkage Example 2: Single linkage Example 2: Single linkage

Example 2: Single linkage Example 2: Single linkage Example 3: Average linkage Example 3: Average linkage

Example 3: Average linkage Example 3: Average linkage Example 3: Average linkage Example 4: Complete link

Issues in hierarchical clustering Content Examples Cluster analysis Partitional: K-Means clustering method Hierarchical clustering methods Data preparation in clustering Interpreting clusters Cluster validation Data preparation in clustering Data preparation in clustering Coding data Discrete inputs Interval inputs Mixed inputs Missing values Variable selection

Data preparation in clustering Data preparation in clustering Data preparation in clustering Variable selection in clustering

Variable selection in clustering Content Examples Cluster analysis Partitional: K-Means clustering method Hierarchical clustering methods Data preparation in clustering Interpreting clusters Cluster validation Interpreting clusters Example: Customer Segmentation on Air Miles Reward Program

Example: Customer Segmentation on Air Miles Reward Program Content Examples Cluster analysis Partitional: K-Means clustering method Hierarchical clustering methods Data preparation in clustering Interpreting clusters Cluster validation Cluster Validity How to evaluate the goodness of the resulting clusters Why do we want to evaluate them? To avoid finding patterns in noise To compare clustering algorithms To compare two sets of clusters To compare two clusters Comment on Cluster Validity The validation of clustering structures is the most difficult and frustrating part of cluster analysis. Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true believers who have experience and great courage. Algorithms for Clustering data, Jain and Dubes