Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Similar documents
Clustering. Unsupervised Learning

Clustering. Unsupervised Learning

Clustering. Unsupervised Learning

Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Machine Learning Department School of Computer Science Carnegie Mellon University. K- Means + GMMs

Unsupervised Learning

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

CSE 5243 INTRO. TO DATA MINING

10701 Machine Learning. Clustering

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Clustering: K-means and Kernel K-means

Clustering COMS 4771

CSE 5243 INTRO. TO DATA MINING

Clustering CS 550: Machine Learning

Clustering part II 1

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Statistics 202: Data Mining. c Jonathan Taylor. Clustering Based in part on slides from textbook, slides of Susan Holmes.

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Unsupervised Learning and Clustering

Cluster Analysis. Ying Shen, SSE, Tongji University

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Clustering in R d. Clustering. Widely-used clustering methods. The k-means optimization problem CSE 250B

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

Unsupervised Learning and Clustering

Gene Clustering & Classification

CHAPTER 4: CLUSTER ANALYSIS

Cluster analysis. Agnieszka Nowak - Brzezinska

What to come. There will be a few more topics we will cover on supervised learning

Data Mining Algorithms

Unsupervised learning, Clustering CS434

Clustering: Overview and K-means algorithm

Clustering algorithms

Based on Raymond J. Mooney s slides

Big Data Analytics! Special Topics for Computer Science CSE CSE Feb 9

Clustering. Partition unlabeled examples into disjoint subsets of clusters, such that:

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Unsupervised Learning : Clustering

Lesson 3. Prof. Enza Messina

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

What is Cluster Analysis? COMP 465: Data Mining Clustering Basics. Applications of Cluster Analysis. Clustering: Application Examples 3/17/2015

ECLT 5810 Clustering

Network Traffic Measurements and Analysis

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

ECLT 5810 Clustering

Data Informatics. Seon Ho Kim, Ph.D.

k-means demo Administrative Machine learning: Unsupervised learning" Assignment 5 out

Introduction to Computer Science

Hierarchical Clustering

Clustering Lecture 3: Hierarchical Methods

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University


Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Information Retrieval and Organisation

SGN (4 cr) Chapter 11

Clustering in Data Mining

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Understanding Clustering Supervising the unsupervised

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

Hierarchical Clustering 4/5/17

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Clustering Part 3. Hierarchical Clustering

Machine Learning. Unsupervised Learning. Manfred Huber

Clust Clus e t ring 2 Nov

DD2475 Information Retrieval Lecture 10: Clustering. Document Clustering. Recap: Classification. Today

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Unsupervised Learning. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis

Supervised vs. Unsupervised Learning

Hierarchical Clustering

Clustering: Overview and K-means algorithm

Introduction to Data Mining

Lecture 4 Hierarchical clustering

Introduction to Mobile Robotics

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Cluster Analysis. Summer School on Geocomputation. 27 June July 2011 Vysoké Pole

Clustering. Chapter 10 in Introduction to statistical learning

Cluster Analysis: Agglomerate Hierarchical Clustering

Lecture 15 Clustering. Oct

Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017

University of Florida CISE department Gator Engineering. Clustering Part 2

CSE 5243 INTRO. TO DATA MINING

Hierarchical clustering

Information Retrieval and Web Search Engines

Hierarchical clustering

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Clustering. Supervised vs. Unsupervised Learning

Administrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES

CLUSTERING IN BIOINFORMATICS

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Today s lecture. Clustering and unsupervised learning. Hierarchical clustering. K-means, K-medoids, VQ

Unsupervised Learning. Clustering and the EM Algorithm. Unsupervised Learning is Model Learning

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

Cluster analysis formalism, algorithms. Department of Cybernetics, Czech Technical University in Prague.

Kapitel 4: Clustering

Clustering Algorithms. Margareta Ackerman

Transcription:

Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani

Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation 2

Unsupervised learning Clustering: partitioning of data into groups of similar data points. Density estimation Parametric & non-parametric density estimation Dimensionality reduction: data representation using a smaller number of dimensions while preserving (perhaps approximately) some properties of the data. 3

Clustering: Definition We have a set of unlabeled data points x (i) i=1 and we intend to find groups of similar objects (based on the observed features) high intra-cluster similarity low inter-cluster similarity N x 2 4 x 1

Clustering: Another Definition Density-based definition: Clusters are regions of high density that are separated from one another by regions of low density x 2 x 1 5

Clustering Purpose Preprocessing stage to index, compress, or reduce the data Representing high-dimensional data in a low-dimensional space (e.g., for visualization purposes). As a tool to understand the hidden structure in data or to group them To gain insight into the structure of the data prior to classifier design To group the data when no label is available 6

Clustering Applications Information retrieval (search and browsing) Cluster text docs or images based on their content Cluster groups of users based on their access patterns on webpages 7

Clustering of docs Google news 8

Clustering Applications Information retrieval (search and browsing) Cluster text docs or images based on their content Cluster groups of users based on their access patterns on webpages Cluster users of social networks by interest (community detection). 9

Social Network: Community Detection 10

Clustering Applications Information retrieval (search and browsing) Cluster text docs or images based on their content Cluster groups of users based on their access patterns on webpages Cluster users of social networks by interest (community detection). Bioinformatics cluster similar proteins together (similarity wrt chemical structure and/or functionality etc) or cluster similar genes according to microarray data 11

Gene clustering Microarrays measures the expression of all genes Clustering genes can help determine new functions for unknown genes 12

Clustering Applications Information retrieval (search and browsing) 13 Cluster text docs or images based on their content Cluster groups of users based on their access patterns on webpages Cluster users of social networks by interest (community detection). Bioinformatics Cluster similar proteins together (similarity wrt chemical structure and/or functionality etc) or similar genes according to microarray data Market segmentation Clustering customers based on the their purchase history and their characteristics Image segmentation Many more applications

Categorization of Clustering Algorithms Hierarchical Partitional Partitional algorithms: Construct various partitions and then evaluate them by some criterion Hierarchical algorithms: Create a hierarchical decomposition of the set of objects using some criterion 14

Clustering methods we will discuss Objective based clustering K-means EM-style algorithm for clustering for mixture of Gaussians (in the next lecture) Hierarchical clustering 15

Partitional Clustering X = x (i) N i=1 C = {C 1, C 2,, C K } j, C j j=1 K C j = X Nonhierarchical, each instance is placed in exactly one of K non-overlapping clusters. i, j, C i C j = (disjoint partitioning for hard clustering) Hard clustering: Each data can belong to one cluster only Since the output is only one set of clusters the user has to specify the desired number of clusters K. 16

Partitioning Algorithms: Basic Concept Construct a partition of a set of N objects into a set of K clusters The number of clusters K is given in advance Each object belongs to exactly one cluster in hard clustering methods K-means is the most popular partitioning algorithm 17

Objective Based Clustering Input:A set of N points, also a distance/dissimilarity measure Output: a partition of the data. k-median: find center pts c 1, c 2,, c K to minimize N i=1 min d(x i, c j ) j 1,,K k-means: find center pts c 1, c 2,, c K to minimize N i=1 min j 1,,K d2 (x i, c j ) k-center: find partition to minimize the maxim radius 18

Distance Measure Let O 1 and O 2 be two objects from the universe of possible objects. The distance (dissimilarity) between O 1 and O 2 is a real number denoted by d(o 1, O 2 ) Specifying the distance d(x, x ) between pairs (x, x ). E.g., # keywords in common, edit distance Example: Euclidean distance in the space of features 19

K-means Clustering Input: a set x 1,, x N of data points (in a d-dim feature space) and an integer K Output: a set of K representatives c 1, c 2,, c K R d as the cluster representatives data points are assigned to the clusters according to their distances to c 1, c 2,, c K Each data is assigned to the cluster whose representative is nearest to it Objective: choose c 1, c 2,, c K to minimize: N i=1 min j 1,,K d2 (x i, c j ) 20

Euclidean k-means Clustering Input: a set x 1,, x N of data points (in a d-dim feature space) and an integer K Output: a set of K representatives c 1, c 2,, c K R d as the cluster representatives data points are assigned to the clusters according to their distances to c 1, c 2,, c K Each data is assigned to the cluster whose representative is nearest to it Objective: choose c 1, c 2,, c K to minimize: 21 N min x i 2 c j j 1,,K i=1 each point assigned to its closest cluster representative

Euclidean k-means Clustering: Computational Complexity To find the optimal partition, we need to exhaustively enumerate all partitions In how many ways can we assign k labels to N observations? NP hard: even for k = 2 or d = 2 For k=1: min c N c = μ = 1 N i=1 N i=1 x i x i c 2 For d = 1, dynamic programming in time O(N 2 K). 22

Common Heuristic in Practice: The Lloyd s method Input:A set X of N datapoints x 1,, x N in R d Initialize centers c 1, c 2,, c K R d in any way. Repeat until there is no further change in the cost. For each j: C j {x X where c j is the closest center to x} For each j: c 1 mean of members of C j Holding centers c 1, c 2,, c K fixed Find optimal assignments C 1,, C K of data points to clusters 23 Holding cluster assignments C 1,, C K fixed Find optimal centers c 1, c 2,, c K

K-means Algorithm (The Lloyd s method) Select k random points c 1, c 2, c k as clusters initial centroids. Repeat until converges (or other stopping criterion): for i=1 to N do: Assign x (i) to the closet cluster and thus C j contains all data that are closer to c j than to anyother cluster for j=1 to k do c j = 1 C j x (i) C j x (i) Assign data based on current centers Re-estimate centers based on current assignment 24

Assigning data to clusters Updating means [Bishop] 25

Intra-cluster similarity k-means optimizes intra-cluster similarity: J(C) = K j=1 x (i) C j x i c j 2 c j = 1 C j x (i) C j x (i) x (i) C j x i c j 2 = 1 2 C j x (i) C j x (i ) C j x i x i 2 the average distance to members of the same cluster 26

Sec. 16.4 K-means: Convergence It always converges. Why should the K-means algorithm ever reach a state in which clustering doesn t change. Reassignment stage monotonically decreases J since each vector is assigned to the closest centroid. Centroid update stage also for each cluster minimizes the sum of squared distances of the assigned points to the cluster from its center. After E-step After M-step [Bishop] 27

Local optimum It always converges but it may converge at a local optimum that is different from the global optimum may be arbitrarily worse in terms of the objective score. 28

Local optimum It always converges but it may converge at a local optimum that is different from the global optimum may be arbitrarily worse in terms of the objective score. 29

Local optimum It always converges but it may converge at a local optimum that is different from the global optimum may be arbitrarily worse in terms of the objective score. Local optimum: every point is assigned to its nearest center and every center is the mean value of its points. 30

K-means: Local Minimum Problem Original Data Optimal Clustering The obtained Clustering 31

The Lloyd s method: Initialization Initialization is crucial (how fast it converges, quality of clustering) Random centers from the data points Multiple runs and select the best ones Initialize with the results of another method Select good initial centers using a heuristic Furthest traversal K-means ++ (works well and has provable guarantees) 32

Another Initialization Idea: Furthest Point Heuristic Choose c 1 arbitrarily (or at random). For j = 2,, K Select c j among datapoints x (1),, x (N) that is farthest from previously chosen c 1,, c j 1 33

Another Initialization Idea: Furthest Point Heuristic It is sensitive to outliers 34

K-means++ Initialization: D2 sampling [AV07] Combine random initialization and furthest point initialization ideas Let the probability of selection of the point be proportional to the distance between this point and its nearest center. probability of selecting of x is proportional to D 2 x = min x c 2 k. k<j Choose c 1 arbitrarily (or at random). For j = 2,, K Select c j among data points x (1),, x (N) according to the distribution: Pr(c j = x (i) ) min x i 2 c k k<j Theorem: K-means++ always attains an O(log k) approximation to optimal k-means solution in expectation. 35

How Many Clusters? Number of clusters k is given in advance in the k-means algorithm However, finding the right number of clusters is a part of the problem Tradeoff between having better focus within each cluster and having too many clusters Hold-out validation/cross-validation on auxiliary task (e.g., supervised learning task). Optimization problem: penalize having lots of clusters some criteria can be used to automatically estimate k Penalize the number of bits you need to describe the extra parameter J (C) = J(C) + C log N 36

How Many Clusters? After E-step After M-step Heuristic: Find large gap between k 1-means cost and kmeans cost. knee finding or elbow finding. 37

K-means: Advantages and disadvantages Strength It is a simple method Relatively efficient: O(tKNd), where t iterations. Usually t n. K-means typically converges quickly Weakness is the number of Need to specify K, the number of clusters, in advance Often terminates at a local optimum. Not suitable to discover clusters with arbitrary shapes Works for numerical data.what about categorical data? Noise and outliers can be considerable trouble to K-means 38

k-means Algorithm: Limitation In general, k-means is unable to find clusters of arbitrary shapes, sizes, and densities Except to very distant clusters 39

K-means: Vector Quantization Data Compression Vector quantization: construct a codebook using k-means cluster means as prototypes representing examples assigned to clusters. k = 3 k = 5 k = 15 40

K-means K-means was proposed near 60 years ago thousands of clustering algorithms have been published since then However, K-means is still widely used. This speaks to the difficulty in designing a general purpose clustering algorithm and the ill-posed problem of clustering. 41 A.K. Jian, Data Clustering: 50 years beyond k-means,2010.

Hierarchical Clustering Notion of a cluster can be ambiguous? How many clusters? Hierarchical Clustering: Clusters contain sub-clusters and subclusters themselves can have sub-sub-clusters, and so on Several levels of details in clustering A hierarchy might be more natural. Different levels of granularity 42

Categorization of Clustering Algorithms Hierarchical Partitional Divisive Agglomerative 43

Hierarchical Clustering Agglomerative (bottom up): Starts with each data in a separate cluster Repeatedly joins the closest pair of clusters, until there is only one cluster (or other stopping criteria). Divisive (top down): Starts with the whole data as a cluster Repeatedly divide data in one of the clusters until there is only one data in each cluster (or other stopping criteria). 44

Example Hierarchical Agglomerative Clustering (HAC) Height represents the distance at which the merge occurs 4 6 3 5 2 1 7 1 4 5 6 7 2 3 45

Distances between Cluster Pairs Many variants to defining distances between pair of clusters Single-link Minimum distance between different pairs of data Complete-link Maximum distance between different pairs of data Centroid Distance between centroids (centers of gravity) Average-link Average distance between pairs of elements 46

Distances between Cluster Pairs Single-link Complete-link Ward s Average-link 47

Single Linkage The minimum of all pairwise distances between points in the two clusters: dist SL C i, C j = min x C i, x C j dist(x, x ) straggly (long and thin) clusters due to chaining effect. 48

Single-Link 2 3 4 6 5 1 7 1 4 5 6 7 2 3 keep max bridge length as small as possible. 49

Complete Linkage The maximum of all pairwise distances between points in the two clusters: dist CL C i, C j = max x C i, x C j dist(x, x ) Makes tighter, spherical clusters typically preferable. 50

Complete Link 2 3 4 6 5 1 7 1 4 5 6 7 2 3 keep max diameter as small as possible. 51

Ward s method The distances between centers of the two clusters (weighted to consider sizes of clusters too): dist Ward C i, C j = C i C j C i + C j dist(c i, c j ) Merge the two clusters such that the increase in k-means cost is as small as possible. Works well in practice. 52

Computational Complexity In the first iteration, all HAC methods compute similarity of all pairs of N individual instances which is O(N 2 ) similarity computation. In each N 2 merging iterations, compute the distance between the most recently created cluster and all other existing clusters. if done naively O N 3 but if done more cleverly O N 2 log N 53

Dendrogram: Hierarchical Clustering Clustering obtained by cutting the dendrogram at a desired level Cut at a pre-specified level of similarity where the gap between two successive combination similarities is largest select the cutting point that produces K clusters Where to cut the dendrogram is user-determined. 54 1 4 2 3 5 6 7

K-means vs. Hierarchical Time cost: K-means is usually fast while hierarchical methods do not scale well Human intuition Hierarchical structure maps nicely onto human intuition for some domains and provides more natural output Local minimum problem It is very common for k-means However, hierarchical methods like any heuristic search algorithms also suffer from local optima problem. Since they can never undo what was done previously Choosing of the number of clusters There is no need to specify the number of clusters in advance for hierarchical methods 55