k-means Clustering Todd W. Neller Gettysburg College Laura E. Brown Michigan Technological University

Similar documents
k-means Clustering Todd W. Neller Gettysburg College

Supervised vs. Unsupervised Learning

Unsupervised Learning I: K-Means Clustering

Clustering. Supervised vs. Unsupervised Learning

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Unsupervised Learning : Clustering

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

High throughput Data Analysis 2. Cluster Analysis

Unsupervised Learning

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Machine Learning (BSMC-GA 4439) Wenke Liu

Network Traffic Measurements and Analysis

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Introduction to Computer Science

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Supervised vs unsupervised clustering

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

University of Florida CISE department Gator Engineering. Clustering Part 2

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

Kapitel 4: Clustering

Unsupervised Learning Partitioning Methods

Unsupervised Learning and Clustering

Clustering and The Expectation-Maximization Algorithm

K-means clustering Based in part on slides from textbook, slides of Susan Holmes. December 2, Statistics 202: Data Mining.

DBSCAN. Presented by: Garrett Poppe

Machine Learning (BSMC-GA 4439) Wenke Liu

Cluster Analysis. CSE634 Data Mining

Introduction to Artificial Intelligence

Clustering CS 550: Machine Learning

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

Unsupervised Learning

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

1 Case study of SVM (Rob)

Cluster Analysis for Microarray Data

CSC 411: Lecture 12: Clustering

Machine Learning using MapReduce

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Clustering k-mean clustering

REPRESENTATION OF BIG DATA BY DIMENSION REDUCTION

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Clustering & Dimensionality Reduction. 273A Intro Machine Learning

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Data Clustering. Danushka Bollegala

Clustering. Chapter 10 in Introduction to statistical learning

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

What to come. There will be a few more topics we will cover on supervised learning

Clustering and Visualisation of Data

STATS306B STATS306B. Clustering. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

CHAPTER 4: CLUSTER ANALYSIS

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Exploratory Data Analysis using Self-Organizing Maps. Madhumanti Ray

Clustering & Classification (chapter 15)

Machine Learning for OR & FE

Based on Raymond J. Mooney s slides

Machine Learning for Signal Processing Clustering. Bhiksha Raj Class Oct 2016

Unsupervised Learning and Clustering

Multivariate Analysis (slides 9)

Particle Swarm Optimization applied to Pattern Recognition

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

A Dendrogram. Bioinformatics (Lec 17)

Expectation Maximization!

MIS2502: Data Analytics Clustering and Segmentation. Jing Gong

Understanding Clustering Supervising the unsupervised

9.1. K-means Clustering

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Data clustering & the k-means algorithm

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

Chapter 6 Continued: Partitioning Methods

K-Means Clustering. Sargur Srihari

Machine Learning. Unsupervised Learning. Manfred Huber

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

Clustering Lecture 5: Mixture Model

k-center Problems Joey Durham Graphs, Combinatorics and Convex Optimization Reading Group Summer 2008

Statistics 202: Statistical Aspects of Data Mining

Exploratory data analysis for microarrays

Gene Clustering & Classification

Semi-supervised Clustering

k-means Clustering David S. Rosenberg April 24, 2018 New York University

Clustering in Data Mining

What is machine learning?

Objective of clustering


Cluster Analysis: Agglomerate Hierarchical Clustering

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

An Efficient Clustering for Crime Analysis

What is Cluster Analysis? COMP 465: Data Mining Clustering Basics. Applications of Cluster Analysis. Clustering: Application Examples 3/17/2015

Note Set 4: Finite Mixture Models and the EM Algorithm

Clustering COMS 4771

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

Clustering part II 1

Cluster Analysis. Ying Shen, SSE, Tongji University

Machine learning Pattern recognition. Classification/Clustering GW Chapter 12 (some concepts) Textures

Data Analytics for. Transmission Expansion Planning. Andrés Ramos. January Estadística II. Transmission Expansion Planning GITI/GITT

CSE 5243 INTRO. TO DATA MINING

Transcription:

k-means Clustering Todd W. Neller Gettysburg College Laura E. Brown Michigan Technological University

Outline Unsupervised versus Supervised Learning Clustering Problem k-means Clustering Algorithm Visual Example Worked Example Initialization Methods Choosing the Number of Clusters k Variations for non-euclidean distance metrics Summary

Supervised Learning Supervised Learning Given training input and output (x, y) pairs, learn a function approximation mapping x s to y s. Regression example: Given (sepal_width, sepal_length) pairs {(3.5, 5.1), (3, 4.9), (3.1, 4.6), (3.2, 4.7), }, learn a function f(sepal_width) that predicts sepal_length well for all sepal_width values. Classification example: Given (balance, will_default) pairs {(808, false), (1813, true), (944, true), (1072, false), }, learn a function f(balance) that predicts will_default for all balance values.

Unsupervised Learning Unsupervised Learning Given input data only (no training labels/outputs) learn characteristics of the data s structure. Clustering example: Given a set of (neck_size, sleeve_length) pairs representative of a target market, determine a set of clusters that will serve as the basis for shirt size design. Supervised vs. Unsupervised Learning Supervised learning: Given input and output, learn approximate mapping from input to output. (The output is the supervision.) Unsupervised learning: Given input only, output structure of input data.

Clustering Problem Clustering is grouping a set of objects such that objects in the same group (i.e. cluster) are more similar to each other in some sense than to objects of different groups. Our specific clustering problem: Given: a set of n observations x 1, x 2,, x n, where each observation is a d-dimensional real vector Given: a number of clusters k Compute: a cluster assignment mapping C x i 1,, k that minimizes the within cluster sum of squares (WCSS): n i=1 x i μ C(xi ) 2 where centroid μ C(xi ) is the mean of the points in cluster C(x i )

k-means Clustering Algorithm General algorithm: Randomly choose k cluster centroids μ 1, μ 2, μ k and arbitrarily initialize cluster assignment mapping C. While remapping C from each x i to its closest centroid μ j causes a change in C: Recompute μ 1, μ 2, μ k according to the new C In order to minimize the WCSS, we alternately: Recompute C to minimize the WCSS holding μ j fixed. Recompute μ j to minimize the WCSS holding C fixed. In minimizing the WCSS, we seek a clustering that minimizes Euclidean distance variance within clusters.

Visual Example Circle data points are randomly assigned to clusters (color = cluster). Diamond cluster centroids initially assigned to the means of cluster data points. Screenshots from http://www.onmyphd.com/?p=kmeans.clustering. Try it!

Visual Example Circle data points are reassigned to their closest centroid. Screenshots from http://www.onmyphd.com/?p=kmeans.clustering. Try it!

Visual Example Diamond cluster centroids are reassigned to the means of cluster data points. Note that one cluster centroid no longer has assigned points (red). Screenshots from http://www.onmyphd.com/?p=kmeans.clustering. Try it!

Visual Example After this, there is no circle data point cluster reassignment. WCSS has been minimized and we terminate. However, this is a local minimum, not a global minimum (one centroid per cluster). Screenshots from http://www.onmyphd.com/?p=kmeans.clustering. Try it!

Given: Worked Example n=6 data points with dimensions d=2 k=3 clusters Forgy initialization of centroids x1 x2 For each data point, compute the new cluster / centroid index to be that of the closest centroid point index 0 Data -2 1-2 3 3 2 5 2 1-2 1-4 1 2 x1 x2 Cluster / Centroid Index 1-2 -2 1 1-4 c Centroids 1 1 0 0 0 2

Worked Example For each centroid, compute the new centroid to be the mean of the data points assigned to that cluster / centroid index Data x1 x2-2 1-2 3 3 2 5 2 1-2 1-4 Cluster / Centroid Index c Centroids 1 1 0 0 0 2 index 0 1 2 x1 x2 3.7-2 2 1-4

Worked Example For each data point, compute the new cluster / centroid index to be that of the closest centroid point Data x1 x2-2 1-2 3 3 2 5 2 1-2 1-4 Cluster / Centroid Index c Centroids 1 1 0 0 2 2 index 0 1 2 x1 x2 3.7-2 2 1-4

Worked Example For each centroid, compute the new centroid to be the mean of the data points assigned to that cluster / centroid index Data x1 x2-2 1-2 3 3 2 5 2 1-2 1-4 Cluster / Centroid Index c Centroids 1 1 0 0 2 2 index 0 1 2 x1 x2 4 2-2 2 1-3

Worked Example Data Cluster / Centroid Index For each data point, compute the new cluster / centroid index to be that of the closest centroid point. With no change to the cluster / centroid indices, the algorithm terminates at a local (and in this example global) minimum WCSS = 1 2 + 1 2 + 1 2 + 1 2 + 1 2 + 1 2 =6. x1 x2 c -2 1 1-2 3 1 3 2 0 5 2 0 1-2 2 1-4 2 Centroids index x1 x2 0 4 2 1-2 2 2 1-3

k-means Clustering Assumptions k-means Clustering assumes real-valued data distributed in clusters that are: Separate Roughly hyperspherical (circular in 2D, spherical in 3D) or easily clustered via a Voronoi partition. Similar size Similar density Even with these assumptions being met, k-means Clustering is not guaranteed to find the global minimum.

k-means Limitations Separate Clusters Original Data Results of k-means Clustering

k-means Limitations Hyperspherical Clusters Original Data Results of k-means Clustering Data available at: http://cs.joensuu.fi/sipu/datasets/ Original data source: Jain, A. and M. Law, Data clustering: A user's dilemma. LNCS, 2005.

k-means Limitations Hyperspherical Clusters Original Data Results of k-means Clustering Data available at: http://cs.joensuu.fi/sipu/datasets/ Original data source: Chang, H. and D.Y. Yeung. Pattern Recognition, 2008. 41(1): p. 191-203.

k-means Limitations Hyperspherical Clusters Original Data Results of k-means Clustering Data available at: http://cs.joensuu.fi/sipu/datasets/ Original data source: Fu, L. and E. Medico. BMC bioinformatics, 2007. 8(1): p. 3.

k-means Limitations Similar Size Clusters Original Data Results of k-means Clustering Image source: Tan, Steinbach, and Kumar Introduction to Data Mining http://www-users.cs.umn.edu/~kumar/dmbook/index.php

k-means Limitations Similar Density Clusters Original Data Results of k-means Clustering Image source: Tan, Steinbach, and Kumar Introduction to Data Mining http://www-users.cs.umn.edu/~kumar/dmbook/index.php

k-means Clustering Improvements As with many local optimization techniques applied to global optimization problems, it often helps to: apply the approach through multiple separate iterations, and retain the clusters from the iteration with the minimum WCSS. Initialization: Random initial cluster assignments create initial centroids clustered about the global mean. Forgy initialization: Choose unique random input data points as the initial centroids. Local (not global) minimum results are still possible. (Try it out.) Distant samples: Choose unique input data points that approximately minimize the sum of inverse square distances between points (e.g. through stochastic local optimization).

Where does the given k come from? Sometime the number of clusters k is determined by the application. Examples: Cluster a given set of (neck_size, sleeve_length) pairs into k=5 clusters to be labeled S/M/L/XL/XXL. Perform color quantization of a 16.7M RGB color space down to a palette of k=256 colors. Sometimes we need to determine an appropriate value of k. How?

Determining the Number of Clusters k When k isn t determined by your application: The Elbow Method: Graph k versus the WCSS of iterated k-means clustering The WCSS will generally decrease as k increases. However, at the most natural k one can sometimes see a sharp bend or elbow in the graph where there is significant decrease up to that k but not much thereafter. Choose that k. The Gap Statistic Other methods

The Gap Statistic Motivation: We d like to choose k so that clustering achieves the greatest WCSS reduction relative to uniform random data. For each candidate k: Compute the log of the best (least) WCSS we can find (log(w k )). Estimate the expected value E * n{log(w k )} on uniform random data. One method: Generate 100 uniformly distributed data sets of the same size over the same ranges. Perform k-means clustering on each, and compute the log of the WCSS. Average the these log WCSS values. The gap statistic for this k would then be E * n{log(w k )} - log(w k ). Select the k that maximizes the gap statistic. R. Tibshirani, G. Walther, and T. Hastie. Estimating the number of clusters in a data set via the gap statistic

Variation: k-medoids Sometimes the Euclidean distance measure is not appropriate (e.g. qualitative data). k-medoids is a k-means variation that allows a general distance measure D x i, x j : Randomly choose k cluster medoids m 1, m 2, m k from the data set. While remapping C from each x i to its closest medoid m j causes a change in C: Recompute each m j to be the x i in cluster j that minimizes the total of within-cluster medoid distances D(x i, x i ) i,c x i =j PAM (Partitioning Around Medoids) - as above except when recomputing each m j, replace with any non-medoid data set point x i that minimizes the overall sum of within-cluster medoid distances.

Summary Supervised learning is given input-output pairs for the task of function approximation. Unsupervised learning is given input only for the task of finding structure in the data. k-means Clustering is a simple algorithm for clustering data with separate, hyperspherical clusters of similar size and density. Iterated k-means helps to find the best global clustering. Local cost minima are possible.

Summary (cont.) k-means can be initialized with random cluster assignments, a random sample of data points (Forgy), or a distant sample of data points. The number of clusters k is sometimes determined by the application and sometimes via the Elbow Method, Gap Statistic, etc. k-medoids is a variation that allows one to handle data with any suitable distance measure (not just Euclidean).