Data clustering & the k-means algorithm

Similar documents
Dynamic Clustering in WSN

Unsupervised Learning

Unsupervised Learning: Clustering

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

CHAPTER 4 K-MEANS AND UCAM CLUSTERING ALGORITHM

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Clustering: K-means and Kernel K-means

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Note Set 4: Finite Mixture Models and the EM Algorithm

Unsupervised Learning : Clustering

What to come. There will be a few more topics we will cover on supervised learning

Clustering & Bootstrapping

Unsupervised learning

Introduction to Machine Learning CMU-10701

Mixture Models and the EM Algorithm

Data Mining: Models and Methods

k-means Clustering David S. Rosenberg April 24, 2018 New York University

Colorado School of Mines. Computer Vision. Professor William Hoff Dept of Electrical Engineering &Computer Science.

Kernels and Clustering

Clustering Color/Intensity. Group together pixels of similar color/intensity.

Clustering. Supervised vs. Unsupervised Learning

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Unsupervised Learning and Clustering

k-means Clustering Todd W. Neller Gettysburg College

Clustering: Classic Methods and Modern Views

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Unsupervised Learning. Clustering and the EM Algorithm. Unsupervised Learning is Model Learning

CS 343: Artificial Intelligence

Introduction to Artificial Intelligence

Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017

K-Means Clustering 3/3/17

Clust Clus e t ring 2 Nov

COSC 6339 Big Data Analytics. Fuzzy Clustering. Some slides based on a lecture by Prof. Shishir Shah. Edgar Gabriel Spring 2017.

Machine Learning - Clustering. CS102 Fall 2017

Unsupervised Learning I: K-Means Clustering

COSC 6397 Big Data Analytics. Fuzzy Clustering. Some slides based on a lecture by Prof. Shishir Shah. Edgar Gabriel Spring 2015.

Introduction to Computer Science

Clustering COMS 4771

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Unsupervised Learning and Clustering

Text Documents clustering using K Means Algorithm

k-means Clustering Todd W. Neller Gettysburg College Laura E. Brown Michigan Technological University

Unsupervised Learning

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

Introduction to Machine Learning. Xiaojin Zhu

Supervised vs. Unsupervised Learning

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

CSE 5243 INTRO. TO DATA MINING

CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16)

Clustering: Overview and K-means algorithm

ALTERNATIVE METHODS FOR CLUSTERING

Multivariate analyses in ecology. Cluster (part 2) Ordination (part 1 & 2)

Introduction to Information Retrieval

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Comparative Study Of Different Data Mining Techniques : A Review

Unsupervised Learning

PV211: Introduction to Information Retrieval

Computer Vision. Exercise Session 10 Image Categorization

Hierarchical Clustering

Clustering: Overview and K-means algorithm

Clustering Lecture 14

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Introduction to Information Retrieval

Midterm Examination CS 540-2: Introduction to Artificial Intelligence

Network Traffic Measurements and Analysis

Gene expression & Clustering (Chapter 10)

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Informed Search Methods

Lecture 2 The k-means clustering problem

Fuzzy Segmentation. Chapter Introduction. 4.2 Unsupervised Clustering.

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

Data Clustering. Danushka Bollegala

One-mode Additive Clustering of Multiway Data

Clustering algorithms and autoencoders for anomaly detection

K-Means. Oct Youn-Hee Han

Clustering and Visualisation of Data

Incorporating Known Pathways into Gene Clustering Algorithms for Genetic Expression Data

Mixture Models and EM

Computer Vision. Colorado School of Mines. Professor William Hoff Dept of Electrical Engineering &Computer Science.

CLUSTERING. JELENA JOVANOVIĆ Web:

Text-Independent Speaker Identification

Intelligent Image and Graphics Processing

Performance Measure of Hard c-means,fuzzy c-means and Alternative c-means Algorithms

K-means clustering Based in part on slides from textbook, slides of Susan Holmes. December 2, Statistics 202: Data Mining.

Document Clustering using Feature Selection Based on Multiviewpoint and Link Similarity Measure

Lecture 8 May 7, Prabhakar Raghavan

Multivariate Analysis (slides 9)

Clustering. Huanle Xu. Clustering 1

Overlapping Clustering: A Review

An Improvement of Centroid-Based Classification Algorithm for Text Classification

Document Clustering: Comparison of Similarity Measures

Clustering Algorithms. Margareta Ackerman

Transcription:

April 27, 2016

Why clustering? Unsupervised Learning Underlying structure gain insight into data generate hypotheses detect anomalies identify features Natural classification e.g. biological organisms (phylogenetic relationships) Data compression

k-means popular for more than 50 years simple efficient empirically successful

k-means Input: n points in R d, X = {x 1,..., x n } k, the number of clusters we want to partition X into

k-means Input: Output: n points in R d, X = {x 1,..., x n } k, the number of clusters we want to partition X into a partition the minimizes the squared error between the mean of each cluster and the points in that cluster.

k-means More precisely, let X = {x 1,..., x n } where x j R d for all 1 j n. Let C = {c 1,..., c k } be k clusters, each containing some of the x j s. Let µ i be the mean of cluster c i. µ i = 1 #c i x j c i x j.

k-means We define the squared error between µ i and the points in c i by E(c i ) = x j c i x j µ i 2.

k-means We define the squared error between µ i and the points in c i by E(c i ) = x j c i x j µ i 2. The sum of the squared error over all k clusters is E(C) = k i=1 x j c i x j µ i 2.

k-means We define the squared error between µ i and the points in c i by E(c i ) = x j c i x j µ i 2. The sum of the squared error over all k clusters is E(C) = k i=1 x j c i x j µ i 2. This is the objective function that the algorithm is designed to minimize.

k-means in action Example of the algorithm for points in R 2, with k = 3.

k-means in action (images taken from [Jain, 2010])

k-means in action (images taken from [Jain, 2010])

k-means in action (images taken from [Jain, 2010])

k-means in action (images taken from [Jain, 2010])

k-means in action (images taken from [Jain, 2010])

basic procedure 1 Select an initial set of k means (for example, choose k points from the dataset). 2 Assign each point to its closest mean to generate a new partition of the data. 3 Calculate the new set of k means with respect to this partition. 4 Repeat Steps 2 and 3 until cluster membership stabilizes.

k-means Will this procedure terminate?

k-means Will this procedure terminate? Monotonely decreasing sequence of sum of squared errors Finite number of clusterings for finite point set X

k-means Will this procedure terminate? Monotonely decreasing sequence of sum of squared errors Finite number of clusterings for finite point set X Number of steps bounded by O(n O(dk) ) Inaba et al. 1994

k-means Will this procedure terminate? Monotonely decreasing sequence of sum of squared errors Finite number of clusterings for finite point set X Number of steps bounded by O(n O(dk) ) Inaba et al. 1994 k-means is a greedy algorithm. It may terminate in a local minumum.

back to example What s wrong with this example?

determining k Sometimes we don t know what k should be a priori.

determining k Sometimes we don t know what k should be a priori. Increasing k will always decrease squared error! In fact for k = n (the number of data points) the sum of squared errors is 0. So squared error does not tell us which k to use.

determining k Sometimes we don t know what k should be a priori. Increasing k will always decrease squared error! In fact for k = n (the number of data points) the sum of squared errors is 0. So squared error does not tell us which k to use. Idea: determine k based on the information we want from the clustering or run k-means for various k, then use some heuristic to compare the results.

determining k One heuristic: the elbow method. [Image from http://commons.wikimedia.org/wiki/file:dataclustering ElbowCriterion.JPG]

in Matlab [idx,c] = kmeans(x,k); X is an array whose rows are the points in your dataset. x 1 x 2 X =. x n

in Matlab k is the number of clusters. [idx,c] = kmeans(x,k);

in Matlab [idx,c] = kmeans(x,k); idx is a vector of length n identifying which cluster each point belongs to. cluster that x 1 is in cluster that x 2 is in idx =. cluster that x n is in

in Matlab [idx,c] = kmeans(x,k); C is an array containing the k centroids (means). centroid of cluster 1 centroid of cluster 2 C =. centroid of cluster k

References Jain, Anil K. Data Clustering: 50 Years Beyond K-means, in Pattern Recogn. Lett. 31:8, June 2010, pp. 651 666. Elsevier Science Inc, New York, NY, USA.