INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Similar documents
INF4820, Algorithms for AI and NLP: Hierarchical Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Unsupervised Learning

Cluster Analysis. Ying Shen, SSE, Tongji University

Information Retrieval and Web Search Engines

Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Unsupervised Learning and Clustering

Based on Raymond J. Mooney s slides

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

CSE 5243 INTRO. TO DATA MINING

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Lesson 3. Prof. Enza Messina

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Unsupervised Learning and Clustering

Information Retrieval and Web Search Engines

CSE 5243 INTRO. TO DATA MINING

Unsupervised Learning : Clustering

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

4. Ad-hoc I: Hierarchical clustering

Clustering Part 3. Hierarchical Clustering

Clustering CS 550: Machine Learning

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Introduction to Information Retrieval

Clustering. Partition unlabeled examples into disjoint subsets of clusters, such that:

CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16)

Hierarchical Clustering

Clustering: Overview and K-means algorithm

Association Rule Mining and Clustering

Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017

What to come. There will be a few more topics we will cover on supervised learning

Data Informatics. Seon Ho Kim, Ph.D.

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Clust Clus e t ring 2 Nov

CHAPTER 4: CLUSTER ANALYSIS

Hierarchical Clustering 4/5/17

Hierarchical Clustering Lecture 9

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Text Documents clustering using K Means Algorithm

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Clustering. Unsupervised Learning

Clustering Lecture 3: Hierarchical Methods

Unsupervised Learning Hierarchical Methods

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Objective of clustering

Road map. Basic concepts

PV211: Introduction to Information Retrieval

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

Clustering. Unsupervised Learning

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Cluster Analysis: Agglomerate Hierarchical Clustering

Clustering Results. Result List Example. Clustering Results. Information Retrieval

Hierarchical clustering

Machine Learning. Unsupervised Learning. Manfred Huber

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Cluster Analysis for Microarray Data

Hierarchical Clustering

Introduction to Data Mining

Clustering (COSC 416) Nazli Goharian. Document Clustering.

Introduction to Information Retrieval

Clustering algorithms

Clustering. Chapter 10 in Introduction to statistical learning

Clustering part II 1

CS47300: Web Information Search and Management

What is Clustering? Clustering. Characterizing Cluster Methods. Clusters. Cluster Validity. Basic Clustering Methodology

Unsupervised Learning Partitioning Methods

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Clustering (COSC 488) Nazli Goharian. Document Clustering.

Gene Clustering & Classification

Chapter 6: Cluster Analysis

Clustering in Data Mining

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

[Raghuvanshi* et al., 5(8): August, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

Hierarchical Clustering

University of Florida CISE department Gator Engineering. Clustering Part 2

Today s lecture. Clustering and unsupervised learning. Hierarchical clustering. K-means, K-medoids, VQ

CS 2750: Machine Learning. Clustering. Prof. Adriana Kovashka University of Pittsburgh January 17, 2017

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

ECLT 5810 Clustering

Clustering: Overview and K-means algorithm

Clustering Algorithms for general similarity measures

Introduction to Mobile Robotics

Clustering: K-means and Kernel K-means

3. Cluster analysis Overview

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Clustering. Unsupervised Learning

CLUSTERING. Quiz information. today 11/14/13% ! The second midterm quiz is on Thursday (11/21) ! In-class (75 minutes!)

Types of general clustering methods. Clustering Algorithms for general similarity measures. Similarity between clusters

MSA220 - Statistical Learning for Big Data

Lecture 4 Hierarchical clustering

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Transcription:

INF4820 Clustering Erik Velldal University of Oslo Nov. 17, 2009 Erik Velldal INF4820 1 / 22

Topics for Today More on unsupervised machine learning for data-driven categorization: clustering. The task of automatically grouping observations into categories. A core set of tools within machine learning, data mining, and pattern recognition. An example of flat clustering: k-means Hierarchical clustering Agglomerative Divisive Measuring the distance between clusters Single-linkage, complete-linkage, average-linkage... Reading: Chapter 17 in Manning, Raghavan & Schütze (2008), Introduction to Information Retrieval; http://informationretrieval.org/. Erik Velldal INF4820 2 / 22

Catching Up: Partitional Clustering Creates a flat, one-level grouping of the data. Can be defined as an optimization problem: Search for the partitioning of the data that minimizes some objective function. Optimize a globally defined measure of partition quality. Problem: Exponentially many possible partitions of the data. An exhaustive search over partitions not feasible. Instead we must typically approximate the solution by iteratively refining an initial (possibly random) partition until some stopping criterion is met. Erik Velldal INF4820 3 / 22

k-means Unsupervised variant of the Rocchio classifier. Goal: Partition the n observed objects into k clusters C so that each point x j belongs to the cluster c i with the nearest centroid µ i. Typically assumes Euclidean distance as the similarity function s. Erik Velldal INF4820 4 / 22

k-means Unsupervised variant of the Rocchio classifier. Goal: Partition the n observed objects into k clusters C so that each point x j belongs to the cluster c i with the nearest centroid µ i. Typically assumes Euclidean distance as the similarity function s. The optimization problem: For each cluster, minimize the within-cluster sum of squares, F s = WCSS: WCSS = x j µ i 2 c i C x j c i WCSS also amounts to the more general measure of how well a model fits the data known as the residual sum of squares (RSS). Minimizing WCSS is equivalent to minimizing the average squared distance between objects and their cluster centroids (since n is fixed), a measure of how well each centroid represents the members assigned to the cluster. Erik Velldal INF4820 4 / 22

k-means (cont d) Algorithm Initialize: Compute centroids for k random seeds. Iterate: Assign each object to the cluster with the nearest centroid. Compute new centroids for the clusters. Terminate: When stopping criterion is satisfied. Erik Velldal INF4820 5 / 22

k-means (cont d) Algorithm Initialize: Compute centroids for k random seeds. Iterate: Assign each object to the cluster with the nearest centroid. Compute new centroids for the clusters. Terminate: When stopping criterion is satisfied. Properties In short, we keep reassigning memberships and recomputing centroids until the configuration stabilizes. WCSS is monotonically decreasing (or unchanged) for each iteration. Guaranteed to converge but not to find the global minimum. The time complexity is linear, O(kn). Erik Velldal INF4820 5 / 22

k-means Example for k = 2 in R 2 (Manning, Raghavan & Schütze 2008) Erik Velldal INF4820 6 / 22

Comments on k-means Seeding We initialize the algorithm by choosing random seeds that we use to compute the first set of centroids. Many ways to select the seeds: pick k random objects from the collection; pick k random points in the space; pick k sets of m random points and compute centroids for each set; compute an hierarchical clustering on a subset of the data to find k initial clusters; etc.. Erik Velldal INF4820 7 / 22

Comments on k-means Seeding We initialize the algorithm by choosing random seeds that we use to compute the first set of centroids. Many ways to select the seeds: pick k random objects from the collection; pick k random points in the space; pick k sets of m random points and compute centroids for each set; compute an hierarchical clustering on a subset of the data to find k initial clusters; etc.. The heuristics involved in choosing the initial seeds can have a large impact on the resulting clustering (because we typically end up only finding a local minimum of the objective function). Outliers are troublemakers. Erik Velldal INF4820 7 / 22

Comments on k-means Termination Criterion Fixed number of iterations Clusters or centroids are unchanged between iterations. Threshold on the decrease of the objective function (absolute or relative to previous iteration) Erik Velldal INF4820 8 / 22

Comments on k-means Termination Criterion Fixed number of iterations Clusters or centroids are unchanged between iterations. Threshold on the decrease of the objective function (absolute or relative to previous iteration) Some Close Relatives of k-means k-medoids: Like k-means but uses medoids instead of centroids to represent the cluster centers. Erik Velldal INF4820 8 / 22

Comments on k-means Termination Criterion Fixed number of iterations Clusters or centroids are unchanged between iterations. Threshold on the decrease of the objective function (absolute or relative to previous iteration) Some Close Relatives of k-means k-medoids: Like k-means but uses medoids instead of centroids to represent the cluster centers. Fuzzy c-means (FCM): Like k-means but assigns soft memberships in [0, 1], where membership is a function of the centroid distance. The computations of both WCSS and centroids are weighted by the membership function. Erik Velldal INF4820 8 / 22

Flat Clustering: The Good and the Bad Pros Conceptually simple, and easy to implement. Efficient. Typically linear in the number of objects. Cons The dependence on the random seeds makes the clustering non-deterministic. The number of clusters k must be pre-specified. Often no principled means of a priori specifying k. The clustering quality often considered inferior to that of the less efficient hierarchical methods. Not as informative as the more stuctured clusterings produced by hierarchical methods. Erik Velldal INF4820 9 / 22

Hierarchical Clustering Divisive methods Initially regards all k objects as part of a single cluster. Splits the groups top-down into smaller and smaller clusters. Each split defines a binary branch in the tree. Stops when k singleton clusters remain (unless other criterion defined). Agglomerative methods Initially regards each object as its own singleton cluster. Iteratively merges (agglomerates) the groups in a bottom-up fashion. Each merge defines a binary branch in the tree. Stops when only one cluster remains containing all the objects (unless other criterion s defined). Erik Velldal INF4820 10 / 22

Dendrograms A hierarchical clustering is often visualized as a binary tree structure known as a dendrogram. A merge is shown as a horizontal line connecting two clusters. The y-axis coordinate of the line corresponds to the combination similarity of the merged cluster. 0.0 0.25 0.5 0.75 1.0 We here assume dot-products of normalized vectors; self-similarity = 1. G E B D H C A F Erik Velldal INF4820 11 / 22

Agglomerative Clustering parameters: {o 1, o 2...,o n }, sim C = {{o 1 }, {o 2 },...,{o n }} T = [] do for i = 1 to n 1 {c j, c k } arg max sim(c j, c k ) {c j,c k } C j k C C\{c j, c k } C C {c j c k } T[i] {c j, c k } At each stage, merge the pair of clusters that are most similar, as defined by some measure of inter-cluster similarity; sim. Plugging in a different sim gives us a different sequence of merges T. Erik Velldal INF4820 12 / 22

Definitions of Inter-Cluster Similarity So far we ve looked at ways to the define the similarity between pairs of objects. objects and a class. Now we ll look at ways to define the similarity between classes or clusters. Erik Velldal INF4820 13 / 22

Definitions of Inter-Cluster Similarity So far we ve looked at ways to the define the similarity between pairs of objects. objects and a class. Now we ll look at ways to define the similarity between classes or clusters. In agglomerative clustering, a measure of cluster similarity sim(c i, c j ) is usually referred to as a linkage criterion (from graph theory): Single-linkage Complete-linkage Centroid-linkage Average-linkage The linkage criterion determines which pair of clusters we will merge to a new cluster in each step. Erik Velldal INF4820 13 / 22

Single-Linkage Merge the two clusters with the minimum distance between any two members. Nearest-Neighbors. A C E D B Can be computed efficiently by taking advantage of the fact that it s best-merge persistent: The distance of the two closest members is a local property that is not affected by merging. Let the nearest neighbor of cluster ck be in either c i or c j. If we merge c i c j = c l, the nearest neighbor of c k will be in c l. Undesirable chaining effect: Tendency to produce to stretched and straggly clusters. Erik Velldal INF4820 14 / 22

Complete-Linkage Merge the two clusters where the maximum distance between any two members is smallest. Farthest-Neighbors. A C E D B Amounts to merging the two clusters whose merger has the smallest diameter. Preference for compact clusters with small diameters. Sensitive to outliers. Not best-merge persistent: Distance defined as the diameter of a merge is a non-local property that can change during merging. Erik Velldal INF4820 15 / 22

Centroid-Linkage D Similarity of two clusters c i and c j defined as the similarity between their cluster centroids µ i and µ j (the mean vectors). Equivalent to the average pairwise similarity between objects from different clusters: A A m1 C C E E D m2 B B sim(c i, c j ) = µ i µ j = 1 c i c j x c i y c j x y Erik Velldal INF4820 16 / 22

Centroid-Linkage D Similarity of two clusters c i and c j defined as the similarity between their cluster centroids µ i and µ j (the mean vectors). Equivalent to the average pairwise similarity between objects from different clusters: A A m1 C C E E D m2 B B sim(c i, c j ) = µ i µ j = 1 c i c j x c i y c j x y Like complete-link, not best-merge persistent. However, unlike the other linkage criterions, it is not monotonic and subject to s.c. inversions: The combination similarity can increase during the clustering. Erik Velldal INF4820 16 / 22

Inversions A Problem with Centroid-Linkage We usually assume that the clustering is monotonic, i.e. the combination similarity is guaranteed to decrease between iterations. 0.0 0.25 0.5 For a non-monotonic clustering criterion (e.g. centroid-linkage), 0.75 inversions would show in the dendrogram as crossing lines. 1.0 G E B D H C A The dotted circles in the dendrogram above indicate inversions: The horizontal merge bar is lower than the bar of a previous merge. Violates the fundamental assumption that small clusters are more coherent than large clusters. Erik Velldal INF4820 17 / 22 F

Average-Linkage AKA group average agglomerative clustering. Merge the two clusters where the average of all pairwise similarities in their union is highest. Aims to maximize the coherency of the merged cluster by considering all pairwise similarities between the objects in the clusters. Let c i c j = c k, and sim(c i, c j ) = W(c i, c j ) = W(c k ): A C E D B W(c k ) = 1 c k ( c k 1) x c k x y c k x y Self-similarities are excluded in order to not penalize large clusters (which have fewer self-similarities). Erik Velldal INF4820 18 / 22

Average-Linkage (cont d) But not best-merge persistent. Compromise of complete- and single-linkage. A C E D B Commonly considered the best default linkage criterion for agglomerative clustering. Can be computed very efficiently if we assume normalized vectors and that the similarity measure of the feature vectors s = dot-product: 1 W(c k ) = ( x) 2 c k c k ( c k 1) x ck Erik Velldal INF4820 19 / 22

Cutting the Tree Hierarchical methods actually produce several partitions; one for each level of the tree. 0.0 0.25 However, for many applications we will want to extract a set of disjoint clusters. In order to turn the nested partitions into a single flat partitioning, we cut the dendrogram. 0.5 0.75 1.0 G E B D H C A F A cutting criterion can be defined as a threshold on e.g. combination similarity, relative drop in the similarity, number of root nodes, etc. Erik Velldal INF4820 20 / 22

Divisive Hierarchical Clustering Generates the nested partitions top-down: Start by considering all objects part of the same cluster (the root). Split the cluster using a flat clustering algorithm (e.g. by applying k-means for k = 2). Recursively split the clusters until only singleton clusters remain. (Also possible to fix the desired levels and stop the clustering before we reach the singleton leaves.) Flat methods such as k-means are generally very effective; linear in the number of objects. Divisive methods are thereby also generally more efficient than agglomerative methods, which are at least quadratic (single-link). Also has the advantage of being able to initially consider the global distribution of the data, while the agglomerative methods must commit to early decisions based on local patterns. Erik Velldal INF4820 21 / 22

Next (and Final) Week Summing up. Sample exam. Erik Velldal INF4820 22 / 22