Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University

Kinds of Clustering Sequential Fast Cost Optimization Fixed number of clusters Hierarchical Start with many clusters join clusters at each step

k-means Clustering

Hierarchical Agglomerative Clustering Start with N groups each with one instance Merging similar groups to form larger groups until there is a single one Divisive Clustering Start with a single group Divide large groups into smaller groups until each group contains a single instance

Sec. 17.2 Closest pair of clusters Many variants to defining closest pair of clusters Single-link Similarity of the most cosine-similar (single-link) Complete-link Similarity of the furthest points, the least cosine-similar Centroid Clusters whose centroids are the most cosine-similar Average-link Average cosine between pairs of elements

Cluster Labeling Differential Cluster Labeling Cluster-internal Labeling

What is a good clustering? Internal criteria Example of an internal criterion: Reconstruction Error in K- k t t E means ({ mi} X ) = b i = 1 t i i x mi t t! $ t 1 if x mi = min x m j b = # j i!" 0 otherwise But an internal criterion often does not evaluate the actual utility of a clustering in the application. Alternative: External criteria Evaluate with respect to a human-defined classification 10

External criteria for clustering quality Based on a gold standard data set, e.g., the Reuters collection we also used for the evaluation of classification Goal: Clustering should reproduce the classes in the gold standard (But we only want to reproduce how documents are divided into groups, not the class labels.) First measure for how well we were able to reproduce the classes: purity 11

External criterion: Purity Ω= {ω 1, ω 2,..., ω K } is the set of clusters and C = {c 1, c 2,..., c J } is the set of classes. For each cluster ω k : find class c j with most members n kj in ω k Sum all n kj and divide by total number of points 12

Sec. 16.3 Purity example Cluster I: (max(5, 1, 0)) = 5 Cluster II: (max(1, 4, 1)) = 4 Cluster III: (max(2, 0, 3)) = 3 Purity(clusters1,II,II, classes XOD) = (1/17) (5 + 4 + 3) 0.71

Normalized Mutual Information (NMI) How much information does the clustering contain about the classification? Singleton clusters (number of clusters = number of docs) have maximum MI Therefore: normalize by entropy of clusters and classes 14

Normalized Mutual Information

Rand index Definition: Based on 2x2 contingency table of all pairs of documents: 17 TP+FN+FP+TN is the total number of pairs. There are pairs for N documents. Example: = 136 in o/ /x example Each pair is either positive or negative (the clustering puts the two documents in the same or in different clusters)...... and either true (correct) or false (incorrect): the clustering decision is correct or incorrect. 17

Rand Index: Example 18

Rand measure for the o/ /x example (20 + 72)/(20 + 20 + 24 + 72) 0.68. 19 19

Sec. 16.3 Rand index and Cluster F-measure P = TP TP +FP R = TP TP + FN

Cluster F-measure: Example P = TP TP +FP R = TP TP + FN

Evaluation results for the o/ /x example All four measures range from 0 (really bad clustering) to 1 (perfect clustering). 22

Hard vs. soft clustering Hard clustering: Each document belongs to exactly one cluster More common and easier to do Soft clustering: A document can belong to more than one cluster. document about Chinese cars (china and automobiles) document about electric cars (technology and environment)

Model-based Clustering K-means is a special case of model based clustering

Model Based Clustering k-means

EM is a general framework Create an initial model, θ Arbitrarily, randomly, or with a small set of training examples Use the model θ to obtain another model θ such that Σ i log P θ (y i ) > Σ i log P θ (y i ) i.e. better models data Let θ = θ and repeat the above step until reaching a local maximum Guaranteed to find a better model after each iteration

Example - clustering documents

Inferring the Model Parameters from the Data Similar to K Means - Alternates between an expectation step (corresponding to reassignment) And a maximization step (corresponding to recomputation of the parameters of the model)

Maximization Step (re)compute the parameters q mk and alpha k (priors) as follows: if

Expectation Step compute the soft assignment of documents to clusters given the current parameters q mk and alpha k as follows:

Customizing Sentiment Classifiers to New Domains: a Case Study by Aue and Gamon

E and M steps Expectation: Given the current model, figure out the expected probabilities of the documents belonging to each cluster p(x θ c ) Maximization: Given the probabilistic assignment of all the documents, estimate a new model, θ c Each iteration increases the likelihood of the data and it is guaranteed to converge!

Similar to K-Means Iterate: Assign/cluster each document to closest center Expectation: Given the current model, figure out the expected probabilities of the documents to each cluster p(x θ c ) Recalculate centers as the mean of the points in a cluster Maximization: Given the probabilistic assignment of all the documents, estimate a new model, θ c

EM example Figure from Chris Bishop