k-means Clustering Todd W. Neller Gettysburg College

Similar documents
k-means Clustering Todd W. Neller Gettysburg College Laura E. Brown Michigan Technological University

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Introduction to Artificial Intelligence

Multivariate Analysis (slides 9)

Unsupervised Learning

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Supervised vs unsupervised clustering

Introduction to Computer Science

Unsupervised Learning : Clustering

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

Network Traffic Measurements and Analysis

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

University of Florida CISE department Gator Engineering. Clustering Part 2

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Exploratory Data Analysis using Self-Organizing Maps. Madhumanti Ray

Clustering and Visualisation of Data

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Supervised vs. Unsupervised Learning

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Clustering. Danushka Bollegala

Clustering k-mean clustering

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

May 1, CODY, Error Backpropagation, Bischop 5.3, and Support Vector Machines (SVM) Bishop Ch 7. May 3, Class HW SVM, PCA, and K-means, Bishop Ch

Clustering & Dimensionality Reduction. 273A Intro Machine Learning

Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017

Unsupervised Learning I: K-Means Clustering

CSC 411: Lecture 12: Clustering

Unsupervised Learning

k-means Clustering David S. Rosenberg April 24, 2018 New York University

What to come. There will be a few more topics we will cover on supervised learning

CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16)

A Dendrogram. Bioinformatics (Lec 17)

Introduction to Information Retrieval

Gene Clustering & Classification

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Data clustering & the k-means algorithm

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

Clustering & Classification (chapter 15)

Cluster Analysis. Ying Shen, SSE, Tongji University

Machine Learning using MapReduce

Clustering. Supervised vs. Unsupervised Learning

Cluster Analysis: Agglomerate Hierarchical Clustering

COMS 4771 Clustering. Nakul Verma

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Based on Raymond J. Mooney s slides

Orange3 Educational Add-on Documentation

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

Lecture on Modeling Tools for Clustering & Regression

Clustering: K-means and Kernel K-means

Data Analytics for. Transmission Expansion Planning. Andrés Ramos. January Estadística II. Transmission Expansion Planning GITI/GITT

Unsupervised Learning: Clustering

Clustering: Overview and K-means algorithm

Unsupervised Learning and Clustering

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

Lecture 2 The k-means clustering problem

What is Clustering? Clustering. Characterizing Cluster Methods. Clusters. Cluster Validity. Basic Clustering Methodology

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing


1) Give decision trees to represent the following Boolean functions:

Clustering CS 550: Machine Learning

Clustering Lecture 5: Mixture Model

Introduction to Information Retrieval

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Note Set 4: Finite Mixture Models and the EM Algorithm

K-Means Clustering. Sargur Srihari

ECE 5424: Introduction to Machine Learning

Clustering and The Expectation-Maximization Algorithm

Linear Regression and K-Nearest Neighbors 3/28/18

Mixture Models and the EM Algorithm

Radial Basis Function Networks: Algorithms

CSE 5526: Introduction to Neural Networks Radial Basis Function (RBF) Networks

Machine Learning for Signal Processing Clustering. Bhiksha Raj Class Oct 2016

University of Ghana Department of Computer Engineering School of Engineering Sciences College of Basic and Applied Sciences

Clustering COMS 4771

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Region-based Segmentation

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Clustering: Overview and K-means algorithm

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Machine Learning - Clustering. CS102 Fall 2017

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Lecture 10: Semantic Segmentation and Clustering

Unsupervised Learning and Clustering

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Application of K-Means Clustering Methodology to Cost Estimation

Information Retrieval and Organisation

Bag of Words Models. CS4670 / 5670: Computer Vision Noah Snavely. Bag-of-words models 11/26/2013

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

Machine Learning Part 1

INF 4300 Classification III Anne Solberg The agenda today:

PV211: Introduction to Information Retrieval

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

Clustering. Stat 430 Fall 2011

Introduction to Data Science Lecture 8 Unsupervised Learning. CS 194 Fall 2015 John Canny

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

CHAPTER 4: CLUSTER ANALYSIS

Transcription:

k-means Clustering Todd W. Neller Gettysburg College

Outline Unsupervised versus Supervised Learning Clustering Problem k-means Clustering Algorithm Visual Example Worked Example Initialization Methods Choosing the Number of Clusters k Summary

Supervised Learning Supervised Learning Given training input and output (x, y) pairs, learn a function approximation mapping x s to y s. (Note: x can be a vector, i.e. multivalued.) Regression: approximate function to a continuous y Regression example 1: Given (sepal_width, sepal_length) pairs {(3.5, 5.1), (3, 4.9), (3.1, 4.6), (3.2, 4.7), }, learn a function f(sepal_width) that predicts sepal_length well for all sepal_width values. Regression example 2: Given Pig play data of hold actions, predict the most likely hold value for a given pair of scores.

Supervised Learning Supervised Learning Given training input and output (x, y) pairs, learn a function approximation mapping x s to y s. Classification: approximate function to a discrete y, i.e. a set of classifications Classification example 1: Given (balance, will_default) pairs {(88, false), (1813, true), (944, true), (172, false), }, learn a function f(balance) that predicts will_default for all balance values. Classification example 2: Player modeling - Given a single player s play history, create a function from all game scores to the most likely action (roll or hold).

Unsupervised Learning Unsupervised Learning Given input data only (no training labels/outputs) learn characteristics of the data s structure. Clustering: Compute a set of clusters on the data where members of each cluster have strongest similarity according to some measure for some number of clusters. Clustering example 1: Given a set of (neck_size, sleeve_length) pairs representative of a target market, determine a set of clusters that will serve as the basis for shirt size design. Clustering example 2: Given a set of player suboptimal plays, find clusters of suboptimal plays. (This can be used to characterize areas for training.)

Supervised vs. Unsupervised Learning Supervised learning: Given input and output, learn approximate mapping from input to output. (The output is the supervision.) Unsupervised learning: Given input only, output structure of input data.

Clustering Problem Clustering is grouping a set of objects such that objects in the same group (i.e. cluster) are more similar to each other in some sense than to objects of different groups. Our specific clustering problem: Given: a set of n observations x 1, x 2,, x n, where each observation is a d-dimensional real vector Given: a number of clusters k Compute: a cluster assignment mapping C x i 1,, k that minimizes the within cluster sum of squares (WCSS): n i=1 x i μ C(xi ) 2 where centroid μ C(xi ) is the mean of the points in cluster C(x i )

k-means Clustering Algorithm Hands-on examples: http://www.onmyphd.com/?p=kmeans.clustering General algorithm: Randomly choose k cluster centroids μ 1, μ 2, μ k and arbitrarily initialize cluster assignment mapping C. While remapping C from each x i to its closest centroid μ j causes a change in C: Recompute μ 1, μ 2, μ k according to the new C In order to minimize the WCSS, we alternately: Recompute C to minimize the WCSS holding μ j fixed. Recompute μ j to minimize the WCSS holding C fixed. In minimizing the WCSS, we seek a clustering that minimizes Euclidean distance variance within clusters.

Visual Example Circle data points are randomly assigned to clusters (color = cluster). Diamond cluster centroids initially assigned to the means of cluster data points. Screenshots from http://www.onmyphd.com/?p=kmeans.clustering. Try it!

Visual Example Circle data points are reassigned to their closest centroid. Screenshots from http://www.onmyphd.com/?p=kmeans.clustering. Try it!

Visual Example Diamond cluster centroids are reassigned to the means of cluster data points. Note that one cluster centroid no longer has assigned points (red). Screenshots from http://www.onmyphd.com/?p=kmeans.clustering. Try it!

Visual Example After this, there is no circle data point cluster reassignment. WCSS has been minimized and we terminate. However, this is a local minimum, not a global minimum (one centroid per cluster). Screenshots from http://www.onmyphd.com/?p=kmeans.clustering. Try it!

Given: Worked Example n=6 data points with dimensions d=2 k=3 clusters Forgy initialization of centroids x1 x2 For each data point, compute the new cluster / centroid index to be that of the closest centroid point index Data -2 1-2 3 3 2 5 2 1-2 1-4 1 2 x1 x2 Cluster / Centroid Index 1-2 -2 1 1-4 c Centroids 1 1 2

Worked Example For each centroid, compute the new centroid to be the mean of the data points assigned to that cluster / centroid index Data x1 x2-2 1-2 3 3 2 5 2 1-2 1-4 Cluster / Centroid Index c Centroids 1 1 2 index 1 2 x1 x2 3.7-2 2 1-4

Worked Example For each data point, compute the new cluster / centroid index to be that of the closest centroid point Data x1 x2-2 1-2 3 3 2 5 2 1-2 1-4 Cluster / Centroid Index c Centroids 1 1 2 2 index 1 2 x1 x2 3.7-2 2 1-4

Worked Example For each centroid, compute the new centroid to be the mean of the data points assigned to that cluster / centroid index Data x1 x2-2 1-2 3 3 2 5 2 1-2 1-4 Cluster / Centroid Index c Centroids 1 1 2 2 index 1 2 x1 x2 4 2-2 2 1-3

Worked Example Data Cluster / Centroid Index For each data point, compute the new cluster / centroid index to be that of the closest centroid point. With no change to the cluster / centroid indices, the algorithm terminates at a local (and in this example global) minimum WCSS = 1 2 + 1 2 + 1 2 + 1 2 + 1 2 + 1 2 =6. x1 x2 c -2 1 1-2 3 1 3 2 5 2 1-2 2 1-4 2 Centroids index x1 x2 4 2 1-2 2 2 1-3

k-means Clustering Assumptions k-means Clustering assumes real-valued data distributed in clusters that are: Separate Roughly hyperspherical (circular in 2D, spherical in 3D) or easily clustered via a Voronoi partition. Similar size Similar density Even with these assumptions being met, k-means Clustering is not guaranteed to find the global minimum.

k-means Clustering Improvements As with many local optimization techniques applied to global optimization problems, it often helps to: apply the approach through multiple separate iterations, and retain the clusters from the iteration with the minimum WCSS. Initialization: Random initial cluster assignments create initial centroids clustered about the global mean. Forgy initialization: Choose unique random input data points as the initial centroids. Local (not global) minimum results are still possible. (Try it out.) Distant samples: Choose unique input data points that approximately minimize the sum of inverse square distances between points (e.g. through stochastic local optimization).

Where does the given k come from? Sometime the number of clusters k is determined by the application. Examples: Cluster a given set of (neck_size, sleeve_length) pairs into k=5 clusters to be labeled S/M/L/XL/XXL. Perform color quantization of a 16.7M RGB color space down to a palette of k=256 colors. Sometimes we need to determine an appropriate value of k. How?

Determining the Number of Clusters k When k isn t determined by your application: The Elbow Method: Graph k versus the WCSS of iterated k-means clustering The WCSS will generally decrease as k increases. However, at the most natural k one can sometimes see a sharp bend or elbow in the graph where there is significant decrease up to that k but not much thereafter. Choose that k. Other methods

Summary Supervised learning is given input-output pairs for the task of function approximation. Continuous output? regression Discrete output? classification Unsupervised learning is given input only for the task of finding structure in the data. k-means Clustering is a simple algorithm for clustering data with separate, hyperspherical clusters of similar size and density.

Summary (cont.) Iterated k-means helps to find the best global clustering. Local cost minima are possible. k-means can be initialized with random cluster assignments, a random sample of data points (Forgy), or a distant sample of data points. The number of clusters k is sometimes determined by the application and sometimes via the Elbow Method, etc.