DD2475 Information Retrieval Lecture 10: Clustering. Document Clustering. Recap: Classification. Today

Size: px

Start display at page:

Download "DD2475 Information Retrieval Lecture 10: Clustering. Document Clustering. Recap: Classification. Today"

Merry Nora Byrd
6 years ago
Views:

1 Sec.14.1! Recap: Classification DD2475 Information Retrieval Lecture 10: Clustering Hedvig Kjellström Data points have labels Classification task: Finding good separators Government Science Arts Today!Document clustering -!Motivations -!Document representations -!Success criteria Document Clustering!Flat clustering (Manning Chapter ) -!K-means!Hierarchical clustering (Manning Chapter ) -!Agglomerative methods

Ch. 16! Ch. 16! What is Clustering? Clear Cluster Structure!Clustering: the process of grouping a set of objects into classes of similar objects -!Documents within a cluster should be similar -!

2 Ch. 16! Ch. 16! What is Clustering? Clear Cluster Structure!Clustering: the process of grouping a set of objects into classes of similar objects -!Documents within a cluster should be similar -!Documents from different clusters should be dissimilar!how would you design an algorithm for finding the three clusters in this case?!the commonest form of unsupervised learning -!Unsupervised learning = learning from raw unlabeled data!important task in IR and other areas Sec. 16.1! Sec. 16.1! Applications of Clustering in IR Clustering Search Results (clusty.com / Vivisimo)!Clustering search results -!Effective user recall will be higher, better navigation of search results!scatter-gather -!Better user interface: search without typing!visualizing a collection -!Easier to browse!speeding up vector space retrieval -!Cluster-based retrieval gives faster search

1! Visualizing a Collection (ThemeScapes, Cartia) Speeding up Vector Space Retrieval!

cluster hypothesis: Documents in the same cluster behave similarly with respect to

3 Sec. 16.1! Scatter-Gather (Cutting, Karger and Pedersen) Visualizing a Collection (Google News) Sec. 16.1! Visualizing a Collection (ThemeScapes, Cartia) Speeding up Vector Space Retrieval!Mountain height = cluster size!cluster hypothesis: Documents in the same cluster behave similarly with respect to relevance to information needs!therefore, to improve search recall: -!Cluster docs in corpus a priori -!When a query matches a doc D, also return other docs in the cluster containing D!Hope if we do this: The query car will also return docs containing automobile -!Because clustering earlier grouped together docs containing car with those containing automobile. Why DD2475 might Lecture 10, February this 1, happen? 2011

4 Sec. 16.2! Issues for Clustering Notion of Similarity/Distance!Representation for clustering -!Document representation (Vector space? Normalization?) -!Need a notion of similarity/distance!ideal: semantic similarity -!Semantially similar documents close -!Semantically different documents far away!how many clusters? -!Fixed a priori? -!Completely data driven?!practical: term-statistical similarity -!Cosine similarity -!Documents normalized vectors in ND, N>>1!As last week, visualize using Euclidean distance -!For many algorithms, easier to think in terms of a distance (rather than similarity) between docs -!But real implementations use cosine similarity Clustering Algorithms!Flat algorithms -!Usually start with a random (partial) partitioning -!Refine it iteratively -!K-means clustering -!(Model based/probabilistic/em clustering) Flat Clustering (Manning Chapter )!Hierarchical algorithms -!Bottom-up, agglomerative clustering -!(Top-down, divisive clustering)!more about clustering -!DD2431 Machine Learning (p1) -!DD2427 Image Based Recognition and Classification (p4)

5 Sec. 16.4! Partitioning Algorithms K-Means Algorithm!Partitioning method: Partition a set of n documents into K clusters!given: -!Set of n documents to be clustered -!Number of clusters K!Find: K clusters that -!Globally minimize the intra-cluster distance -!Globally maximize the inter-cluster distance!intractable for many objective functions (NP-hard) -!Effective heuristic method: K-means!Assumes documents are real-valued vectors (metric space)!clusters represented by centroids (center of gravity, mean) of points in a cluster c:!approximating partition problem by iterating Sec. 16.4! Sec. 16.4! K-Means Algorithm K-Means Example (K=2) Select K random docs {s 1, s 2, s K } as centra of the clusters {c 1, c 2, c K }.! Until clustering converges:! For each doc x i :! Assign x i to the cluster c j minimizing! dist(x i, s j ).! End! x x x x Pick seeds Reassign clusters Compute centroids Reassign clusters Compute centroids Reassign clusters For each cluster c j :! s j = µ(c j )! End! Converged! End!

6 Sec. 16.4! Sec. 16.4! Termination Conditions Convergence!Several possibilities, e.g., -!A fixed number of iterations -!Document partition is unchanged -!Centroid positions do not change!why should the K-means algorithm ever reach a fixed point? -!Basic requirement static problem! -!Can it be proven? Does this mean that the document partition is unchanged?!k-means special case of the Expectation Maximization (EM) algorithm. -!EM shown to converge -!Number of iterations could be large! But in practice usually not Sec. 16.4! Sec. 16.4! Convergence of K-Means Time Complexity!Define goodness measure G j of cluster c j with center s j as sum of squared distances from cluster centroid: -!G j =! i (x i s j ) 2 (sum over all x i in cluster c j )!G =! j G j!recomputation monotonically decreases each G j since (m j is number of members in cluster j): -!! (x i a) 2 reaches minimum for: -!! 2(x i a) = 0 -!! x i =! a -!m j a =! x i -!a = (1/ m j )! x i = s j!k-means typically converges quickly -!K = #centroids -!M = #dim (vocabulary size) -!N = #documents!computing distance between two docs: O(M)!Reassigning clusters: O(KN) distance computations => O(KNM)!Computing centroids: Each doc gets added once to some centroid => O(NM)!Assume these two steps are each done once for I iterations: O(IKNM)

7 Sec. 16.4! Seed (Initial Cluster Center) Choice How Many Clusters? Example showing sensitivity to seeds In the above, if you start with B and E as centroids you converge to {A,B,C} and {D,E,F} If you start with D and F you converge to {A,B,D,E} {C,F}!Results can vary based on random seed selection!some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings -!Select good seeds using a heuristic (e.g., doc least similar to any existing mean) -!Try out multiple starting points -!Initialize with the results of another method. Is this problem convex?!some applications: Number of clusters K is given -!Given docs, partition into an appropriate number of subsets. -!E.g., for query results - ideal value of K not known up front - though UI may impose limits!other applications: K unknown -!Solve an optimization problem: penalize having lots of clusters -!Application dependent, e.g., compressed summary of search results list. -!Tradeoff between having more clusters (better focus within each cluster) and having too many clusters Benefit-Cost for Estimating K!Given a clustering, benefit of a document = cosine similarity to its centroid -!Total benefit = sum of individual document benefits!for each cluster, cost = constant C -!Total cost = KC Hierarchical Clustering (Manning Chapter )!Value of a clustering = total benefit total cost!find K maximizing clustering value

8 Ch. 17! K-Means Great Hierarchical Clustering! but what if K = 10,000?!Basic problem no structure -!Which classes are similar to each other?!build a tree-based hierarchical taxonomy (dendrogram) from a set of documents animal vertebrate fish reptile amphib. mammal invertebrate worm insect crustacean!one approach: recursive application of a partitional clustering algorithm Sec. 17.1! Dendrogram: Hierarchical Clustering Hierarchical Agglomerative Clustering (HAC)!Clustering obtained by cutting the dendrogram at a desired level -!Each connected component forms a cluster!method for obtaining dendrogram!general approach -!Start with each document in a separate cluster -!Repeatedly join the closest pair of clusters, until there is only one cluster!details -!How to find the closest pair of clusters

9 Sec. 17.2! Sec. 17.2! Closest pair of clusters Single-Link!Many variants to defining closest pair of clusters!use maximum similarity of pairs:!single-link -!Similarity of the most cosine-similar!complete-link -!Similarity of the furthest points, the least cosine-similar!centroid -!Clusters whose centroids (centers of gravity) are the most cosine-similar!average-link -!Average cosine between pairs of elements!can result in long and thin clusters (pairwise similar)!after merging c i and c j, the similarity of the resulting cluster to another cluster, c k, is: C i! C j! C k! Sec. 17.2! Sec. 17.2! Single-Link Example Complete-Link!Use minimum similarity of pairs:!makes spherical clusters (typically preferable)!after merging c i and c j, the similarity of the resulting cluster to another cluster, c k, is: C i! C j! C k!

10 Sec. 17.2! Sec. 17.3! Complete-Link Example Average-Link!Similarity of two clusters = average similarity of all pairs within merged cluster!compromise between single and complete link!two options: -!Averaged over all pairs in the merged cluster -!Averaged over all pairs between the two original clusters!no clear difference in efficacy Sec. 17.3! Sec ! Computing Average-Link Similarity Computational Complexity!Always maintain sum of vectors in each cluster.!in the first iteration, similarity of all pairs of N instances: O(N 2 )!In each of the N-2 merging iterations, compute distance between the most recent cluster and all other!compute similarity of clusters in constant time: -!Worst case: O(N 3 ) -!Cleverly: O(N 2 log N)

11 Sec. 16.3! Sec. 16.3! What is a Good Clustering? What is a Good Clustering?!Internal criterion!a good clustering will produce clusters where: -!The intra-class (intra-cluster) similarity is high -!The inter-class similarity is low -!The measured quality of a clustering depends on both the document representation and the similarity measure used!external criterion!a good clustering is able to: -!Discover some or all hidden patterns or latent classes in gold standard/benchmark data!assess a clustering with respect to ground truth requires labeled data!assume documents with C gold standard classes, while our clustering algorithms produce K clusters, " 1, " 2,, " K with n i members. Sec. 16.3! External Evaluation Purity example!simple measure: purity, the ratio between the dominant class in the cluster # i and the size of cluster " i!biased because having n clusters maximizes purity!other measures: Rand index, entropy/mutual information of classes in clusters Cluster I Cluster II Cluster III Cluster I: Purity = 1/6 (max(5, 1, 0)) = 5/6 Cluster II: Purity = 1/6 (max(1, 4, 1)) = 4/6 Cluster III: Purity = 1/5 (max(2, 0, 3)) = 3/5

12 Sec. 16.3! Sec. 16.3! Rand Index Measures Between-Pair Decisions Here RI = 0.68 Rand index Number of points Same Cluster in clustering Different Clusters in clustering Same class in ground truth Different classes in ground truth Compare with standard Precision and Recall:! Final Word Next!In clustering, clusters are inferred from the data without human input (unsupervised learning)!in practice less clear: many ways of influencing the outcome of clustering -!Number of clusters -!Similarity measure -!Representation of documents -!!Computer hall session (February 3, ) -!Sporthallen (Lindstedtsvägen 5, level 5) -!Examination of Assignment 1, 2 -!Sign up in Doodle (see course homepage)!lecture 11 (February 7, ) -!Prof Viggo Kann -!E34 -!Manning Chapter 18

Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Clustering CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch. 16 What