Clustering. Unsupervised Learning

Similar documents
Clustering. Unsupervised Learning

Clustering. Unsupervised Learning

Machine Learning Department School of Computer Science Carnegie Mellon University. K- Means + GMMs

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering COMS 4771

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Clustering in R d. Clustering. Widely-used clustering methods. The k-means optimization problem CSE 250B

Clustering. So far in the course. Clustering. Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. dist(x, y) = x y 2 2

Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. 2 April April 2015

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Clustering: Overview and K-means algorithm

Introduction to Data Mining

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

4. Ad-hoc I: Hierarchical clustering

Hierarchical Clustering 4/5/17

Clustering. K-means clustering

What to come. There will be a few more topics we will cover on supervised learning

Lecture 4 Hierarchical clustering

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

A Computational Theory of Clustering

Clustering. Clustering in R d

Unsupervised Learning

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

CHAPTER 4: CLUSTER ANALYSIS

Clustering: Overview and K-means algorithm

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Machine Learning. Unsupervised Learning. Manfred Huber

CS 580: Algorithm Design and Analysis. Jeremiah Blocki Purdue University Spring 2018

Hierarchical Clustering

Clustering Color/Intensity. Group together pixels of similar color/intensity.

CSE 5243 INTRO. TO DATA MINING

Approximation Algorithms

Clustering: K-means and Kernel K-means

Lecture 5 Finding meaningful clusters in data. 5.1 Kleinberg s axiomatic framework for clustering

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

L9: Hierarchical Clustering

Clustering Algorithms. Margareta Ackerman

Unsupervised Learning and Clustering

k-means, k-means++ Barna Saha March 8, 2016

Clustering Lecture 8. David Sontag New York University. Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Carlos Guestrin, Andrew Moore, Dan Klein

K Nearest Neighbor Wrap Up K- Means Clustering. Slides adapted from Prof. Carpuat

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Clustering algorithms

Hierarchical clustering

Unsupervised Learning. Clustering and the EM Algorithm. Unsupervised Learning is Model Learning

Clust Clus e t ring 2 Nov

Unsupervised Learning and Clustering

CSE 5243 INTRO. TO DATA MINING

High Dimensional Indexing by Clustering

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 16

Machine Learning using MapReduce

1 Case study of SVM (Rob)

Lesson 3. Prof. Enza Messina

Unsupervised Learning: Clustering

Clustering CS 550: Machine Learning

Lecture 2 The k-means clustering problem

Clustering. (Part 2)

Learning with minimal supervision

Geometric data structures:

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

COMS 4771 Clustering. Nakul Verma

Machine Learning for OR & FE

Machine learning - HT Clustering

Introduction to Machine Learning CMU-10701

CLUSTERING IN BIOINFORMATICS

Clustering. Chapter 10 in Introduction to statistical learning

CSE 40171: Artificial Intelligence. Learning from Data: Unsupervised Learning

Kapitel 4: Clustering

Cluster Analysis. Jia Li Department of Statistics Penn State University. Summer School in Statistics for Astronomers IV June 9-14, 2008

Data Warehousing and Machine Learning

Clustering in Data Mining

Locality- Sensitive Hashing Random Projections for NN Search

Chapter VIII.3: Hierarchical Clustering

Lecture 10: Semantic Segmentation and Clustering

K-Means. Algorithmic Methods of Data Mining M. Sc. Data Science Sapienza University of Rome. Carlos Castillo

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

k-means demo Administrative Machine learning: Unsupervised learning" Assignment 5 out

Clustering algorithms and introduction to persistent homology

Clustering Results. Result List Example. Clustering Results. Information Retrieval

Sequence clustering. Introduction. Clustering basics. Hierarchical clustering

Based on Raymond J. Mooney s slides

Clustering: Centroid-Based Partitioning

Introduction to Mobile Robotics

Administrative. Machine learning code. Machine learning: Unsupervised learning

Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

SOCIAL MEDIA MINING. Data Mining Essentials

Machine Learning (CSE 446): Unsupervised Learning

Machine Learning (BSMC-GA 4439) Wenke Liu

INF 4300 Classification III Anne Solberg The agenda today:

Unsupervised Learning and Data Mining

Transcription:

Clustering. Unsupervised Learning Maria-Florina Balcan 11/05/2018

Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints. Question: When and why would we want to do this? Useful for: Automatically organizing data. Understanding hidden structure in data. Preprocessing for further analysis. Representing high-dimensional data in a low-dimensional space (e.g., for visualization purposes).

Applications (Clustering comes up everywhere ) Cluster news articles or web pages or search results by topic. Cluster protein sequences by function or genes according to expression profile. Cluster users of social networks by interest (community detection). Facebook network Twitter Network

Applications (Clustering comes up everywhere ) Cluster customers according to purchase history. Cluster galaxies or nearby stars (e.g. Sloan Digital Sky Survey) And many many more applications.

Clustering Today: Objective based clustering K-means clustering Hierarchical clustering

Objective Based Clustering Input: A set S of n points, also a distance/dissimilarity measure specifying the distance d(x,y) between pairs (x,y). E.g., # keywords in common, edit distance, wavelets coef., etc. Goal: output a partition of the data. k-means: find center pts c 1, c 2,, c k to n minimize i=1 k-median: find center pts c 1, c 2,, c k to n minimize i=1 min j 1,,k d 2 (x i, c j ) min j 1,,k d(x i, c j ) s c3 c y 1 x z c 2 K-center: find partition to minimize the maxim radius

Euclidean k-means Clustering Input: A set of n datapoints x 1, x 2,, x n in R d target #clusters k Output: k representatives c 1, c 2,, c k R d Objective: choose c 1, c 2,, c k R d to minimize n i=1 min j 1,,k x i c j 2

Euclidean k-means Clustering Input: A set of n datapoints x 1, x 2,, x n in R d target #clusters k Output: k representatives c 1, c 2,, c k R d Objective: choose c 1, c 2,, c k R d to minimize n i=1 min j 1,,k x i c j 2 Natural assignment: each point assigned to its closest center, leads to a Voronoi partition.

Euclidean k-means Clustering Input: A set of n datapoints x 1, x 2,, x n in R d target #clusters k Output: k representatives c 1, c 2,, c k R d Objective: choose c 1, c 2,, c k R d to minimize n i=1 min j 1,,k x i c j 2 Computational complexity: NP hard: even for k = 2 [Dagupta 08] or d = 2 [Mahajan-Nimbhorkar-Varadarajan09] There are a couple of easy cases

An Easy Case for k-means: k=1 Input: A set of n datapoints x 1, x 2,, x n in R d Output: c R d n to minimize i=1 x i c 2 Solution: The optimal choice is μ = 1 n n i=1 x i Idea: bias/variance like decomposition 1 n n i=1 x i c 2 = μ c 2 + 1 n n i=1 Avg k-means cost wrt c x i μ 2 Avg k-means cost wrt μ So, the optimal choice for c is μ.

Another Easy Case for k-means: d=1 Input: A set of n datapoints x 1, x 2,, x n in R d Output: c R d n to minimize i=1 Extra-credit homework question x i c 2 Hint: dynamic programming in time O(n 2 k).

Common Heuristic in Practice: The Lloyd s method [Least squares quantization in PCM, Lloyd, IEEE Transactions on Information Theory, 1982] Input: A set of n datapoints x 1, x 2,, x n in R d Initialize centers c 1, c 2,, c k R d and clusters C 1, C 2,, C k in any way. Repeat until there is no further change in the cost. For each j: C j {x S whose closest center is c j } For each j: c j mean of C j

Common Heuristic in Practice: The Lloyd s method [Least squares quantization in PCM, Lloyd, IEEE Transactions on Information Theory, 1982] Input: A set of n datapoints x 1, x 2,, x n in R d Initialize centers c 1, c 2,, c k R d and clusters C 1, C 2,, C k in any way. Repeat until there is no further change in the cost. For each j: C j {x S whose closest center is c j } For each j: c j mean of C j Holding c 1, c 2,, c k fixed, pick optimal C 1, C 2,, C k Holding C 1, C 2,, C k fixed, pick optimal c 1, c 2,, c k

Common Heuristic: The Lloyd s method Input: A set of n datapoints x 1, x 2,, x n in R d Initialize centers c 1, c 2,, c k R d and clusters C 1, C 2,, C k in any way. Repeat until there is no further change in the cost. For each j: C j {x S whose closest center is c j } For each j: c j mean of C j Note: it always converges. the cost always drops and there is only a finite #s of Voronoi partitions (so a finite # of values the cost could take)

Initialization for the Lloyd s method Input: A set of n datapoints x 1, x 2,, x n in R d Initialize centers c 1, c 2,, c k R d and clusters C 1, C 2,, C k in any way. Repeat until there is no further change in the cost. For each j: C j {x S whose closest center is c j } For each j: c j mean of C j Initialization is crucial (how fast it converges, quality of solution output) Discuss techniques commonly used in practice Random centers from the datapoints (repeat a few times) Furthest traversal K-means ++ (works well and has provable guarantees)

Lloyd s method: Random Initialization

Lloyd s method: Random Initialization Example: Given a set of datapoints

Lloyd s method: Random Initialization Select initial centers at random

Lloyd s method: Random Initialization Assign each point to its nearest center

Lloyd s method: Random Initialization Recompute optimal centers given a fixed clustering

Lloyd s method: Random Initialization Assign each point to its nearest center

Lloyd s method: Random Initialization Recompute optimal centers given a fixed clustering

Lloyd s method: Random Initialization Assign each point to its nearest center

Lloyd s method: Random Initialization Recompute optimal centers given a fixed clustering Get a good quality solution in this example.

Lloyd s method: Performance It always converges, but it may converge at a local optimum that is different from the global optimum, and in fact could be arbitrarily worse in terms of its score.

Lloyd s method: Performance Local optimum: every point is assigned to its nearest center and every center is the mean value of its points.

Lloyd s method: Performance.It is arbitrarily worse than optimum solution.

Lloyd s method: Performance This bad performance, can happen even with well separated Gaussian clusters.

Lloyd s method: Performance This bad performance, can happen even with well separated Gaussian clusters. Some Gaussian are combined..

Lloyd s method: Performance If we do random initialization, as k increases, it becomes more likely we won t have perfectly picked one center per Gaussian in our initialization (so Lloyd s method will output a bad solution). For k equal-sized Gaussians, Pr[each initial center is in a different Gaussian] k! k k 1 e k Becomes unlikely as k gets large.

Another Initialization Idea: Furthest Point Heuristic Choose c 1 arbitrarily (or at random). For j = 2,, k Pick c j among datapoints x 1, x 2,, x d that is farthest from previously chosen c 1, c 2,, c j 1 Fixes the Gaussian problem. But it can be thrown off by outliers.

Furthest point heuristic does well on previous example

Furthest point initialization heuristic sensitive to outliers Assume k=3 (0,1) (-2,0) (3,0) (0,-1)

Furthest point initialization heuristic sensitive to outliers Assume k=3 (0,1) (-2,0) (3,0) (0,-1)

K-means++ Initialization: D 2 sampling [AV07] Interpolate between random and furthest point initialization Let D(x) be the distance between a point x and its nearest center. Chose the next center proportional to D 2 (x). Choose c 1 at random. For j = 2,, k Pick c j among x 1, x 2,, x d according to the distribution Pr(c j = x i ) min j <j x i c j 2 D 2 (x i ) Theorem: K-means++ always attains an O(log k) approximation to optimal k-means solution in expectation. Running Lloyd s can only further improve the cost.

K-means++ Idea: D 2 sampling Interpolate between random and furthest point initialization Let D(x) be the distance between a point x and its nearest center. Chose the next center proportional to D α (x). α = 0, random sampling α =, furthest point (Side note: it actually works well for k-center) α = 2, k-means++ Side note: α = 1, works well for k-median

K-means ++ Fix (0,1) (-2,0) (3,0) (0,-1)

K-means++/ Lloyd s Running Time K-means ++ initialization: O(nd) and one pass over data to select next center. So O(nkd) time in total. Lloyd s method Repeat until there is no change in the cost. For each j: C j {x S whose closest center is c j } Each round takes time O(nkd). For each j: c j mean of C j Exponential # of rounds in the worst case [AV07]. Expected polynomial time in the smoothed analysis (non worst-case) model!

K-means++/ Lloyd s Summary K-means++ always attains an O(log k) approximation to optimal k-means solution in expectation. Running Lloyd s can only further improve the cost. Exponential # of rounds in the worst case [AV07]. Expected polynomial time in the smoothed analysis model! Does well in practice.

What value of k??? Heuristic: Find large gap between k -1-means cost and k-means cost. Hold-out validation/cross-validation on auxiliary task (e.g., supervised learning task). Try hierarchical clustering.

Hierarchical Clustering All topics sports fashion soccer tennis Gucci Lacoste A hierarchy might be more natural. Different users might care about different levels of granularity or even prunings.

Top-down (divisive) Hierarchical Clustering Partition data into 2-groups (e.g., 2-means) Recursively cluster each group. Bottom-Up (agglomerative) Start with every point in its own cluster. All topics Repeatedly merge the closest two clusters. Different defs of closest give different algorithms. sports fashion soccer tennis Gucci Lacoste

Bottom-Up (agglomerative) Have a distance measure on pairs of objects. All topics d(x,y) distance between x and y sports fashion E.g., # keywords in common, edit distance, etc soccer tennis Gucci Lacoste Single linkage: dist A, B = min x A,x B dist(x, x ) Complete linkage: dist A, B = max dist(x, x A,x B x ) Average linkage: dist A, B = avg dist(x, x ) x A,x B

Bottom-up (agglomerative) Single Linkage Start with every point in its own cluster. Repeatedly merge the closest two clusters. Single linkage: dist A, B = min dist(x, x A,x B x ) Dendogram 4 A B C D E 5 A B C D E F 3 A B C 1 A B 2 D E -3-2 0 2.1 3.2 6 A B C D E F

Bottom-up (agglomerative) Single Linkage Start with every point in its own cluster. Repeatedly merge the closest two clusters. Single linkage: dist A, B = min dist(x, x A,x B x ) One way to think of it: at any moment, we see connected components of the graph where connect any two pts of distance < r. Watch as r grows (only n-1 relevant values because we only we merge at value of r corresponding to values of r in different clusters). 4 5 1 3 2-3 -2 0 2.1 3.2 6 A B C D E F

Bottom-up (agglomerative) Complete Linkage Start with every point in its own cluster. Repeatedly merge the closest two clusters. Complete linkage: dist A, B = max dist(x, x A,x B x ) One way to think of it: keep max diameter as small as possible at any level. 5 A B C D E F 3 A B C 4 DEF 1 A B 2 D E -3-2 0 2.1 3.2 6 A B C D E F

Bottom-up (agglomerative) Complete Linkage Start with every point in its own cluster. Repeatedly merge the closest two clusters. Complete linkage: dist A, B = max dist(x, x A,x B x ) One way to think of it: keep max diameter as small as possible. 1 5 3 2 4-3 -2 0 2.1 3.2 6 A B C D E F

Running time Each algorithm starts with N clusters, and performs N-1 merges. For each algorithm, computing dist(c, C ) can be done in time O( C C ). (e.g., examining dist(x, x ) for all x C, x C ) Time to compute all pairwise distances and take smallest is O(N 2 ). Overall time is O(N 3 ). In fact, can run all these algorithms in time O(N 2 log N). If curious, see: Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008. http://www-nlp.stanford.edu/irbook/

Data Driven Clustering Goal: given family of algos F, sample of typical instances from domain (unknown distr. D), find algo that performs well on new instances from D. [Balcan-Nagarajan-Vitercik-White, COLT 17] [Balcan-Dick-White, NIPS 18] Large family F of algorithms Sample of typical inputs MST Greedy + Dynamic Programming + Farthest Location Clustering: Input 1: Input 2: Input m: Input 1: Input 2: Input m:

Data Driven Clustering Goal: given family of algos F, sample of typical instances from domain (unknown distr. D), find algo that performs well on new instances from D. Approach: find A near optimal algorithm over the set of samples. Key Question: Will A do well on future instances? Seen: New: Sample Complexity: How large should our sample of typical instances be in order to guarantee good performance on new instances?

Data Driven Clustering Goal: given family of algos F, sample of typical instances from domain (unknown distr. D), find algo that performs well on new instances from D. Approach: find A near optimal algorithm over the set of samples. Key Question: Will A do well on future instances? Key tool: uniform convergence, for any algo in F, average performance over samples close to its expected performance. Imply that A has high expected performance. Learning theory: N = O dim F /ϵ 2 instances suffice for ε-close.

Data Driven Clustering Clustering: Linkage + Dynamic Programming [Balcan-Nagarajan-Vitercik-White, COLT 17] DATA Single linkage Complete linkage α Weighted comb Ward s alg DP for k-means DP for k-median DP for k-center CLUSTERING Clustering: Greedy Seeding + Local Search DATA [Balcan-Dick-White, NIPS 18] Random seeding Farthest first traversal kmeans + + D α sampling Parametrized Lloyds methods L 2-Local search β-local search CLUSTERING

What You Should Know Partitional Clustering. k-means and k-means ++ Lloyd s method Initialization techniques (random, furthest traversal, k-means++) Hierarchical Clustering. Single linkage, Complete linkage