Clustering: K-means and Kernel K-means

Similar documents
Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

CSE 5243 INTRO. TO DATA MINING

Clustering CS 550: Machine Learning

CSE 5243 INTRO. TO DATA MINING

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Clustering. Chapter 10 in Introduction to statistical learning

Clustering COMS 4771

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Unsupervised Learning: K-means Clustering

Unsupervised Learning

Unsupervised Learning : Clustering

Unsupervised Learning and Clustering

Clustering. So far in the course. Clustering. Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. dist(x, y) = x y 2 2

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Unsupervised Learning and Clustering

Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. 2 April April 2015

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Machine Learning. Unsupervised Learning. Manfred Huber

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Clustering Part 3. Hierarchical Clustering

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Clustering Algorithms for general similarity measures

Machine Learning (BSMC-GA 4439) Wenke Liu

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Hierarchical clustering

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Cluster Analysis: Agglomerate Hierarchical Clustering

K-Means Clustering 3/3/17

Introduction to Artificial Intelligence

K-means and Hierarchical Clustering

Based on Raymond J. Mooney s slides

Pattern Recognition Lecture Sequential Clustering

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Information Retrieval and Web Search Engines

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

1 Case study of SVM (Rob)

Statistics 202: Data Mining. c Jonathan Taylor. Clustering Based in part on slides from textbook, slides of Susan Holmes.

Cluster Analysis. Ying Shen, SSE, Tongji University

Unsupervised Learning: Clustering

Administrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES

Cluster analysis formalism, algorithms. Department of Cybernetics, Czech Technical University in Prague.

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University


Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

COMS 4771 Clustering. Nakul Verma

Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Clustering Lecture 3: Hierarchical Methods

Machine Learning and Data Mining. Clustering. (adapted from) Prof. Alexander Ihler

Machine Learning for OR & FE

Kernels and Clustering

Clustering algorithms

Data Informatics. Seon Ho Kim, Ph.D.

Data Clustering. Danushka Bollegala

Clustering. Unsupervised Learning

Clustering. Supervised vs. Unsupervised Learning

K-Means Clustering. Sargur Srihari

Understanding Clustering Supervising the unsupervised

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

Clustering. Unsupervised Learning

Unsupervised Learning. Clustering and the EM Algorithm. Unsupervised Learning is Model Learning

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

University of Florida CISE department Gator Engineering. Clustering Part 2

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

Supervised vs. Unsupervised Learning

Expectation Maximization!

k-means demo Administrative Machine learning: Unsupervised learning" Assignment 5 out

CS 343: Artificial Intelligence

Clustering in R d. Clustering. Widely-used clustering methods. The k-means optimization problem CSE 250B

Clustering Algorithms. Margareta Ackerman

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Types of general clustering methods. Clustering Algorithms for general similarity measures. Similarity between clusters

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Information Retrieval and Web Search Engines

Text Documents clustering using K Means Algorithm

Machine Learning (BSMC-GA 4439) Wenke Liu

Big Data Analytics! Special Topics for Computer Science CSE CSE Feb 9

What to come. There will be a few more topics we will cover on supervised learning

10701 Machine Learning. Clustering

Introduction to Mobile Robotics

Lecture 3 January 22

Clustering and Visualisation of Data

K-Means++: The Advantages of Careful Seeding

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Hierarchical Clustering

Unsupervised Learning

Gene Clustering & Classification

Clustering web search results

Clustering and Dissimilarity Measures. Clustering. Dissimilarity Measures. Cluster Analysis. Perceptually-Inspired Measures

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Network Traffic Measurements and Analysis

Applications. Foreground / background segmentation Finding skin-colored regions. Finding the moving objects. Intelligent scissors

INF 4300 Classification III Anne Solberg The agenda today:

Finding Clusters 1 / 60

Clustering. K-means clustering

Unsupervised Learning

Transcription:

Clustering: K-means and Kernel K-means Piyush Rai Machine Learning (CS771A) Aug 31, 2016 Machine Learning (CS771A) Clustering: K-means and Kernel K-means 1

Clustering Usually an unsupervised learning problem Given: N unlabeled examples {x 1,..., x N }; no. of desired partitions K Goal: Group the examples into K homogeneous partitions Machine Learning (CS771A) Clustering: K-means and Kernel K-means 2

Clustering Usually an unsupervised learning problem Given: N unlabeled examples {x 1,..., x N }; no. of desired partitions K Goal: Group the examples into K homogeneous partitions Picture courtesy: Data Clustering: 50 Years Beyond K-Means, A.K. Jain (2008) Machine Learning (CS771A) Clustering: K-means and Kernel K-means 2

Clustering Usually an unsupervised learning problem Given: N unlabeled examples {x 1,..., x N }; no. of desired partitions K Goal: Group the examples into K homogeneous partitions Picture courtesy: Data Clustering: 50 Years Beyond K-Means, A.K. Jain (2008) Loosely speaking, it is classification without ground truth labels Machine Learning (CS771A) Clustering: K-means and Kernel K-means 2

Clustering Usually an unsupervised learning problem Given: N unlabeled examples {x 1,..., x N }; no. of desired partitions K Goal: Group the examples into K homogeneous partitions Picture courtesy: Data Clustering: 50 Years Beyond K-Means, A.K. Jain (2008) Loosely speaking, it is classification without ground truth labels A good clustering is one that achieves: High within-cluster similarity Low inter-cluster similarity Machine Learning (CS771A) Clustering: K-means and Kernel K-means 2

Similarity can be Subjective Clustering only looks at similarities, no labels are given Without labels, similarity can be hard to define Machine Learning (CS771A) Clustering: K-means and Kernel K-means 3

Similarity can be Subjective Clustering only looks at similarities, no labels are given Without labels, similarity can be hard to define Thus using the right distance/similarity is very important in clustering Machine Learning (CS771A) Clustering: K-means and Kernel K-means 3

Similarity can be Subjective Clustering only looks at similarities, no labels are given Without labels, similarity can be hard to define Thus using the right distance/similarity is very important in clustering Also important to define/ask: Clustering based on what? Machine Learning (CS771A) Clustering: K-means and Kernel K-means 3

Similarity can be Subjective Clustering only looks at similarities, no labels are given Without labels, similarity can be hard to define Thus using the right distance/similarity is very important in clustering Also important to define/ask: Clustering based on what? Picture courtesy: http://www.guy-sports.com/humor/videos/powerpoint presentation dogs.htm Machine Learning (CS771A) Clustering: K-means and Kernel K-means 3

Clustering: Some Examples Document/Image/Webpage Clustering Image Segmentation (clustering pixels) Clustering web-search results Clustering (people) nodes in (social) networks/graphs.. and many more.. Picture courtesy: http://people.cs.uchicago.edu/ pff/segment/ Machine Learning (CS771A) Clustering: K-means and Kernel K-means 4

Types of Clustering 1 Flat or Partitional clustering Partitions are independent of each other Machine Learning (CS771A) Clustering: K-means and Kernel K-means 5

Types of Clustering 1 Flat or Partitional clustering Partitions are independent of each other 2 Hierarchical clustering Partitions can be visualized using a tree structure (a dendrogram) Machine Learning (CS771A) Clustering: K-means and Kernel K-means 5

Types of Clustering 1 Flat or Partitional clustering Partitions are independent of each other 2 Hierarchical clustering Partitions can be visualized using a tree structure (a dendrogram) Possible to view partitions at different levels of granularities (i.e., can refine/coarsen clusters) using different K Machine Learning (CS771A) Clustering: K-means and Kernel K-means 5

Flat Clustering: K-means algorithm (Lloyd, 1957) Input: N examples {x 1,..., x N }; x n R D ; the number of partitions K Initialize: K cluster means µ 1,..., µ K, each µ k R D Machine Learning (CS771A) Clustering: K-means and Kernel K-means 6

Flat Clustering: K-means algorithm (Lloyd, 1957) Input: N examples {x 1,..., x N }; x n R D ; the number of partitions K Initialize: K cluster means µ 1,..., µ K, each µ k R D Usually initialized randomly, but good initialization is crucial; many smarter initialization heuristics exist (e.g., K-means++, Arthur & Vassilvitskii, 2007) Machine Learning (CS771A) Clustering: K-means and Kernel K-means 6

Flat Clustering: K-means algorithm (Lloyd, 1957) Input: N examples {x 1,..., x N }; x n R D ; the number of partitions K Initialize: K cluster means µ 1,..., µ K, each µ k R D Usually initialized randomly, but good initialization is crucial; many smarter initialization heuristics exist (e.g., K-means++, Arthur & Vassilvitskii, 2007) Iterate: Machine Learning (CS771A) Clustering: K-means and Kernel K-means 6

Flat Clustering: K-means algorithm (Lloyd, 1957) Input: N examples {x 1,..., x N }; x n R D ; the number of partitions K Initialize: K cluster means µ 1,..., µ K, each µ k R D Usually initialized randomly, but good initialization is crucial; many smarter initialization heuristics exist (e.g., K-means++, Arthur & Vassilvitskii, 2007) Iterate: (Re)-Assign each example x n to its closest cluster center (based on the smallest Euclidean distance) C k = {n : k = arg min k x n µ k 2 } (C k is the set of examples assigned to cluster k with center µ k ) Machine Learning (CS771A) Clustering: K-means and Kernel K-means 6

Flat Clustering: K-means algorithm (Lloyd, 1957) Input: N examples {x 1,..., x N }; x n R D ; the number of partitions K Initialize: K cluster means µ 1,..., µ K, each µ k R D Usually initialized randomly, but good initialization is crucial; many smarter initialization heuristics exist (e.g., K-means++, Arthur & Vassilvitskii, 2007) Iterate: (Re)-Assign each example x n to its closest cluster center (based on the smallest Euclidean distance) C k = {n : k = arg min k x n µ k 2 } (C k is the set of examples assigned to cluster k with center µ k ) Update the cluster means µ k = mean(c k ) = 1 x n C k n C k Machine Learning (CS771A) Clustering: K-means and Kernel K-means 6

Flat Clustering: K-means algorithm (Lloyd, 1957) Input: N examples {x 1,..., x N }; x n R D ; the number of partitions K Initialize: K cluster means µ 1,..., µ K, each µ k R D Usually initialized randomly, but good initialization is crucial; many smarter initialization heuristics exist (e.g., K-means++, Arthur & Vassilvitskii, 2007) Iterate: (Re)-Assign each example x n to its closest cluster center (based on the smallest Euclidean distance) C k = {n : k = arg min k x n µ k 2 } (C k is the set of examples assigned to cluster k with center µ k ) Update the cluster means Repeat while not converged µ k = mean(c k ) = 1 x n C k n C k Machine Learning (CS771A) Clustering: K-means and Kernel K-means 6

Flat Clustering: K-means algorithm (Lloyd, 1957) Input: N examples {x 1,..., x N }; x n R D ; the number of partitions K Initialize: K cluster means µ 1,..., µ K, each µ k R D Usually initialized randomly, but good initialization is crucial; many smarter initialization heuristics exist (e.g., K-means++, Arthur & Vassilvitskii, 2007) Iterate: (Re)-Assign each example x n to its closest cluster center (based on the smallest Euclidean distance) C k = {n : k = arg min k x n µ k 2 } (C k is the set of examples assigned to cluster k with center µ k ) Update the cluster means Repeat while not converged µ k = mean(c k ) = 1 x n C k n C k Stop when cluster means or the loss does not change by much Machine Learning (CS771A) Clustering: K-means and Kernel K-means 6

K-means: Initialization (assume K = 2) Machine Learning (CS771A) Clustering: K-means and Kernel K-means 7

K-means iteration 1: Assigning points Machine Learning (CS771A) Clustering: K-means and Kernel K-means 8

K-means iteration 1: Recomputing the centers Machine Learning (CS771A) Clustering: K-means and Kernel K-means 9

K-means iteration 2: Assigning points Machine Learning (CS771A) Clustering: K-means and Kernel K-means 10

K-means iteration 2: Recomputing the centers Machine Learning (CS771A) Clustering: K-means and Kernel K-means 11

K-means iteration 3: Assigning points Machine Learning (CS771A) Clustering: K-means and Kernel K-means 12

K-means iteration 3: Recomputing the centers Machine Learning (CS771A) Clustering: K-means and Kernel K-means 13

K-means iteration 4: Assigning points Machine Learning (CS771A) Clustering: K-means and Kernel K-means 14

K-means iteration 4: Recomputing the centers Machine Learning (CS771A) Clustering: K-means and Kernel K-means 15

What Loss Function is K-means Optimizing? Machine Learning (CS771A) Clustering: K-means and Kernel K-means 16

What Loss Function is K-means Optimizing? Let µ 1,..., µ K be the K cluster centroids (means) Let z nk {0, 1} be s.t. z nk = 1 if x n belongs to cluster k, and 0 otherwise Machine Learning (CS771A) Clustering: K-means and Kernel K-means 16

What Loss Function is K-means Optimizing? Let µ 1,..., µ K be the K cluster centroids (means) Let z nk {0, 1} be s.t. z nk = 1 if x n belongs to cluster k, and 0 otherwise Note: z n = [z n1 z n2... z nk ] represents a length K one-hot encoding of x n Machine Learning (CS771A) Clustering: K-means and Kernel K-means 16

What Loss Function is K-means Optimizing? Let µ 1,..., µ K be the K cluster centroids (means) Let z nk {0, 1} be s.t. z nk = 1 if x n belongs to cluster k, and 0 otherwise Note: z n = [z n1 z n2... z nk ] represents a length K one-hot encoding of x n Define the distortion or loss for the cluster assignment of x n K l(µ, x n, z n ) = z nk x n µ k 2 k=1 Machine Learning (CS771A) Clustering: K-means and Kernel K-means 16

What Loss Function is K-means Optimizing? Let µ 1,..., µ K be the K cluster centroids (means) Let z nk {0, 1} be s.t. z nk = 1 if x n belongs to cluster k, and 0 otherwise Note: z n = [z n1 z n2... z nk ] represents a length K one-hot encoding of x n Define the distortion or loss for the cluster assignment of x n K l(µ, x n, z n ) = z nk x n µ k 2 Total distortion over all points defines the K-means loss function N K L(µ, X, Z) = z nk x n µ k 2 n=1 k=1 k=1 Machine Learning (CS771A) Clustering: K-means and Kernel K-means 16

What Loss Function is K-means Optimizing? Let µ 1,..., µ K be the K cluster centroids (means) Let z nk {0, 1} be s.t. z nk = 1 if x n belongs to cluster k, and 0 otherwise Note: z n = [z n1 z n2... z nk ] represents a length K one-hot encoding of x n Define the distortion or loss for the cluster assignment of x n K l(µ, x n, z n ) = z nk x n µ k 2 Total distortion over all points defines the K-means loss function N K L(µ, X, Z) = z nk x n µ k 2 = X Zµ 2 n=1 k=1 k=1 where Z is N K (row n is z n ) and µ is K D (row k is µ k ) Machine Learning (CS771A) Clustering: K-means and Kernel K-means 16

What Loss Function is K-means Optimizing? Let µ 1,..., µ K be the K cluster centroids (means) Let z nk {0, 1} be s.t. z nk = 1 if x n belongs to cluster k, and 0 otherwise Note: z n = [z n1 z n2... z nk ] represents a length K one-hot encoding of x n Define the distortion or loss for the cluster assignment of x n K l(µ, x n, z n ) = z nk x n µ k 2 Total distortion over all points defines the K-means loss function N K L(µ, X, Z) = z nk x n µ k 2 = X Zµ 2 n=1 k=1 where Z is N K (row n is z n ) and µ is K D (row k is µ k ) The K-means problem is to minimize this objective w.r.t. µ and Z k=1 Machine Learning (CS771A) Clustering: K-means and Kernel K-means 16

What Loss Function is K-means Optimizing? Let µ 1,..., µ K be the K cluster centroids (means) Let z nk {0, 1} be s.t. z nk = 1 if x n belongs to cluster k, and 0 otherwise Note: z n = [z n1 z n2... z nk ] represents a length K one-hot encoding of x n Define the distortion or loss for the cluster assignment of x n K l(µ, x n, z n ) = z nk x n µ k 2 Total distortion over all points defines the K-means loss function N K L(µ, X, Z) = z nk x n µ k 2 = X Zµ 2 n=1 k=1 where Z is N K (row n is z n ) and µ is K D (row k is µ k ) The K-means problem is to minimize this objective w.r.t. µ and Z Note that the objective only minimizes within-cluster distortions k=1 Machine Learning (CS771A) Clustering: K-means and Kernel K-means 16

K-means Objective Consider the K-means objective function N K L(µ, X, Z) = z nk x n µ k 2 n=1 k=1 Machine Learning (CS771A) Clustering: K-means and Kernel K-means 17

K-means Objective Consider the K-means objective function N K L(µ, X, Z) = z nk x n µ k 2 n=1 k=1 It is a non-convex objective function Many local minima possible Also NP-hard to minimize in general (note that Z is discrete) Machine Learning (CS771A) Clustering: K-means and Kernel K-means 17

K-means Objective Consider the K-means objective function N K L(µ, X, Z) = z nk x n µ k 2 n=1 k=1 It is a non-convex objective function Many local minima possible Also NP-hard to minimize in general (note that Z is discrete) The K-means algorithm we saw is a heuristic to optimize this function Machine Learning (CS771A) Clustering: K-means and Kernel K-means 17

K-means Objective Consider the K-means objective function N K L(µ, X, Z) = z nk x n µ k 2 n=1 k=1 It is a non-convex objective function Many local minima possible Also NP-hard to minimize in general (note that Z is discrete) The K-means algorithm we saw is a heuristic to optimize this function K-means algorithm alternated between the following two steps Machine Learning (CS771A) Clustering: K-means and Kernel K-means 17

K-means Objective Consider the K-means objective function N K L(µ, X, Z) = z nk x n µ k 2 n=1 k=1 It is a non-convex objective function Many local minima possible Also NP-hard to minimize in general (note that Z is discrete) The K-means algorithm we saw is a heuristic to optimize this function K-means algorithm alternated between the following two steps Fix µ, minimize w.r.t. Z (assign points to closest centers) Machine Learning (CS771A) Clustering: K-means and Kernel K-means 17

K-means Objective Consider the K-means objective function N K L(µ, X, Z) = z nk x n µ k 2 n=1 k=1 It is a non-convex objective function Many local minima possible Also NP-hard to minimize in general (note that Z is discrete) The K-means algorithm we saw is a heuristic to optimize this function K-means algorithm alternated between the following two steps Fix µ, minimize w.r.t. Z (assign points to closest centers) Fix Z, minimize w.r.t. µ (recompute the center means) Machine Learning (CS771A) Clustering: K-means and Kernel K-means 17

K-means Objective Consider the K-means objective function N K L(µ, X, Z) = z nk x n µ k 2 n=1 k=1 It is a non-convex objective function Many local minima possible Also NP-hard to minimize in general (note that Z is discrete) The K-means algorithm we saw is a heuristic to optimize this function K-means algorithm alternated between the following two steps Fix µ, minimize w.r.t. Z (assign points to closest centers) Fix Z, minimize w.r.t. µ (recompute the center means) Note: The algorithm usually converges to a local minima (though may not always, and it may just convergence somewhere ). Multiple runs with different initializations can be tried to find a good solution. Machine Learning (CS771A) Clustering: K-means and Kernel K-means 17

Convergence of K-means Algorithm Each step (updating Z or µ) can never increase the objective Machine Learning (CS771A) Clustering: K-means and Kernel K-means 18

Convergence of K-means Algorithm Each step (updating Z or µ) can never increase the objective When we update Z from Z (t 1) to Z (t) L(µ (t 1), X, Z (t) ) L(µ (t 1), X, Z (t 1) ) Machine Learning (CS771A) Clustering: K-means and Kernel K-means 18

Convergence of K-means Algorithm Each step (updating Z or µ) can never increase the objective When we update Z from Z (t 1) to Z (t) because the new Z (t) = arg min Z L(µ (t 1), X, Z) L(µ (t 1), X, Z (t) ) L(µ (t 1), X, Z (t 1) ) Machine Learning (CS771A) Clustering: K-means and Kernel K-means 18

Convergence of K-means Algorithm Each step (updating Z or µ) can never increase the objective When we update Z from Z (t 1) to Z (t) because the new Z (t) = arg min Z L(µ (t 1), X, Z) When we update µ from µ (t 1) to µ (t) L(µ (t 1), X, Z (t) ) L(µ (t 1), X, Z (t 1) ) L(µ (t), X, Z (t) ) L(µ (t 1), X, Z (t) ) Machine Learning (CS771A) Clustering: K-means and Kernel K-means 18

Convergence of K-means Algorithm Each step (updating Z or µ) can never increase the objective When we update Z from Z (t 1) to Z (t) because the new Z (t) = arg min Z L(µ (t 1), X, Z) When we update µ from µ (t 1) to µ (t) because the new µ (t) = arg min µ L(µ, X, Z (t) ) L(µ (t 1), X, Z (t) ) L(µ (t 1), X, Z (t 1) ) L(µ (t), X, Z (t) ) L(µ (t 1), X, Z (t) ) Machine Learning (CS771A) Clustering: K-means and Kernel K-means 18

Convergence of K-means Algorithm Each step (updating Z or µ) can never increase the objective When we update Z from Z (t 1) to Z (t) because the new Z (t) = arg min Z L(µ (t 1), X, Z) When we update µ from µ (t 1) to µ (t) L(µ (t 1), X, Z (t) ) L(µ (t 1), X, Z (t 1) ) L(µ (t), X, Z (t) ) L(µ (t 1), X, Z (t) ) because the new µ (t) = arg min µ L(µ, X, Z (t) ) Thus the K-means algorithm monotonically decreases the objective Machine Learning (CS771A) Clustering: K-means and Kernel K-means 18

K-means: Choosing K One way to select K for the K-means algorithm is to try different values of K, plot the K-means objective versus K, and look at the elbow-point Machine Learning (CS771A) Clustering: K-means and Kernel K-means 19

K-means: Choosing K One way to select K for the K-means algorithm is to try different values of K, plot the K-means objective versus K, and look at the elbow-point For the above plot, K = 6 is the elbow point Machine Learning (CS771A) Clustering: K-means and Kernel K-means 19

K-means: Choosing K One way to select K for the K-means algorithm is to try different values of K, plot the K-means objective versus K, and look at the elbow-point For the above plot, K = 6 is the elbow point Can also information criterion such as AIC (Akaike Information Criterion) AIC = 2L(ˆµ, X, Ẑ) + K log D.. and choose the K that has the smallest AIC (discourages large K) Machine Learning (CS771A) Clustering: K-means and Kernel K-means 19

K-means: Some Limitations Makes hard assignments of points to clusters Machine Learning (CS771A) Clustering: K-means and Kernel K-means 20

K-means: Some Limitations Makes hard assignments of points to clusters A point either completely belongs to a cluster or doesn t belong at all Machine Learning (CS771A) Clustering: K-means and Kernel K-means 20

K-means: Some Limitations Makes hard assignments of points to clusters A point either completely belongs to a cluster or doesn t belong at all No notion of a soft assignment (i.e., probability of being assigned to each cluster: say K = 3 and for some point x n, p 1 = 0.7, p 2 = 0.2, p 3 = 0.1) Machine Learning (CS771A) Clustering: K-means and Kernel K-means 20

K-means: Some Limitations Makes hard assignments of points to clusters A point either completely belongs to a cluster or doesn t belong at all No notion of a soft assignment (i.e., probability of being assigned to each cluster: say K = 3 and for some point x n, p 1 = 0.7, p 2 = 0.2, p 3 = 0.1) Works well only is the clusters are roughtly of equal sizes Machine Learning (CS771A) Clustering: K-means and Kernel K-means 20

K-means: Some Limitations Makes hard assignments of points to clusters A point either completely belongs to a cluster or doesn t belong at all No notion of a soft assignment (i.e., probability of being assigned to each cluster: say K = 3 and for some point x n, p 1 = 0.7, p 2 = 0.2, p 3 = 0.1) Works well only is the clusters are roughtly of equal sizes Probabilistic clustering methods such as Gaussian mixture models can handle both these issues (model each cluster using a Gaussian distribution) Machine Learning (CS771A) Clustering: K-means and Kernel K-means 20

K-means: Some Limitations Makes hard assignments of points to clusters A point either completely belongs to a cluster or doesn t belong at all No notion of a soft assignment (i.e., probability of being assigned to each cluster: say K = 3 and for some point x n, p 1 = 0.7, p 2 = 0.2, p 3 = 0.1) Works well only is the clusters are roughtly of equal sizes Probabilistic clustering methods such as Gaussian mixture models can handle both these issues (model each cluster using a Gaussian distribution) K-means also works well only when the clusters are round-shaped and does badly if the clusters have non-convex shapes Machine Learning (CS771A) Clustering: K-means and Kernel K-means 20

K-means: Some Limitations Makes hard assignments of points to clusters A point either completely belongs to a cluster or doesn t belong at all No notion of a soft assignment (i.e., probability of being assigned to each cluster: say K = 3 and for some point x n, p 1 = 0.7, p 2 = 0.2, p 3 = 0.1) Works well only is the clusters are roughtly of equal sizes Probabilistic clustering methods such as Gaussian mixture models can handle both these issues (model each cluster using a Gaussian distribution) K-means also works well only when the clusters are round-shaped and does badly if the clusters have non-convex shapes Kernel K-means or Spectral clustering can handle non-convex Machine Learning (CS771A) Clustering: K-means and Kernel K-means 20

Kernel K-means Basic idea: Replace the Euclidean distance/similarity computations in K-means by the kernelized versions. E.g., d(x n, µ k ) = φ(x n ) φ(µ k ) by Machine Learning (CS771A) Clustering: K-means and Kernel K-means 21

Kernel K-means Basic idea: Replace the Euclidean distance/similarity computations in K-means by the kernelized versions. E.g., d(x n, µ k ) = φ(x n ) φ(µ k ) by φ(x n ) φ(µ k ) 2 Machine Learning (CS771A) Clustering: K-means and Kernel K-means 21

Kernel K-means Basic idea: Replace the Euclidean distance/similarity computations in K-means by the kernelized versions. E.g., d(x n, µ k ) = φ(x n ) φ(µ k ) by φ(x n ) φ(µ k ) 2 = φ(x n ) 2 + φ(µ k ) 2 2φ(x n ) φ(µ k ) Machine Learning (CS771A) Clustering: K-means and Kernel K-means 21

Kernel K-means Basic idea: Replace the Euclidean distance/similarity computations in K-means by the kernelized versions. E.g., d(x n, µ k ) = φ(x n ) φ(µ k ) by φ(x n ) φ(µ k ) 2 = φ(x n ) 2 + φ(µ k ) 2 2φ(x n ) φ(µ k ) = k(x n, x n ) + k(µ k, µ k ) 2k(x n, µ k ) Here k(.,.) denotes the kernel function and φ is its (implicit) feature map Machine Learning (CS771A) Clustering: K-means and Kernel K-means 21

Kernel K-means Basic idea: Replace the Euclidean distance/similarity computations in K-means by the kernelized versions. E.g., d(x n, µ k ) = φ(x n ) φ(µ k ) by φ(x n ) φ(µ k ) 2 = φ(x n ) 2 + φ(µ k ) 2 2φ(x n ) φ(µ k ) = k(x n, x n ) + k(µ k, µ k ) 2k(x n, µ k ) Here k(.,.) denotes the kernel function and φ is its (implicit) feature map Note: φ doesn t have to be computed/stored for data {x n } N n=1 or the cluster means {µ k} K k=1 because computations only depend on kernel evaluations Machine Learning (CS771A) Clustering: K-means and Kernel K-means 21

Kernel K-means Basic idea: Replace the Euclidean distance/similarity computations in K-means by the kernelized versions. E.g., d(x n, µ k ) = φ(x n ) φ(µ k ) by φ(x n ) φ(µ k ) 2 = φ(x n ) 2 + φ(µ k ) 2 2φ(x n ) φ(µ k ) = k(x n, x n ) + k(µ k, µ k ) 2k(x n, µ k ) Here k(.,.) denotes the kernel function and φ is its (implicit) feature map Note: φ doesn t have to be computed/stored for data {x n } N n=1 or the cluster means {µ k} K k=1 because computations only depend on kernel evaluations Machine Learning (CS771A) Clustering: K-means and Kernel K-means 21

Kernel K-means Basic idea: Replace the Euclidean distance/similarity computations in K-means by the kernelized versions. E.g., d(x n, µ k ) = φ(x n ) φ(µ k ) by φ(x n ) φ(µ k ) 2 = φ(x n ) 2 + φ(µ k ) 2 2φ(x n ) φ(µ k ) = k(x n, x n ) + k(µ k, µ k ) 2k(x n, µ k ) Here k(.,.) denotes the kernel function and φ is its (implicit) feature map Note: φ doesn t have to be computed/stored for data {x n } N n=1 or the cluster means {µ k} K k=1 because computations only depend on kernel evaluations A small technical note: When computing k(µ k, µ k ) and k(x n, µ k ), remember that φ(µ k ) is the average of φ s the data points assigned to cluster k Machine Learning (CS771A) Clustering: K-means and Kernel K-means 21

Extra Slides: Hierarchical Clustering Machine Learning (CS771A) Clustering: K-means and Kernel K-means 22

Hierarchical Clustering Agglomerative (bottom-up) Clustering 1 Start with each example in its own singleton cluster 2 At each time-step, greedily merge 2 most similar clusters 3 Stop when there is a single cluster of all examples, else go to 2 Divisive (top-down) Clustering 1 Start with all examples in the same cluster 2 At each time-step, remove the outsiders from the least cohesive cluster 3 Stop when each example is in its own singleton cluster, else go to 2 Agglomerative is more popular and simpler than divisive (but less accurarate) Machine Learning (CS771A) Clustering: K-means and Kernel K-means 23

(Dis)similarity between clusters We know how to compute the dissimilarity d(x i, x j ) between two examples How to compute the dissimilarity between two clusters R and S? Min-link or single-link: results in chaining (clusters can get very large) d(r, S) = min d(x R, x S ) x R R,x S S Max-link or complete-link: results in small, round shaped clusters d(r, S) = max d(x R, x S ) x R R,x S S Average-link: compromise between single and complexte linkage d(r, S) = 1 R S x R R,x S S d(x R, x S ) Machine Learning (CS771A) Clustering: K-means and Kernel K-means 24

Flat vs Hierarchical Clustering Flat clustering produces a single partitioning Hierarchical Clustering can give different partitionings depending on the level-of-resolution we are looking at Flat clustering is usually more efficient run-time wise Hierarchical clustering can be slow (has to make several merge/split decisions) Machine Learning (CS771A) Clustering: K-means and Kernel K-means 25