k-means, k-means++ Barna Saha March 8, 2016

Similar documents
Scalable K-Means++ Bahman Bahmani Stanford University

Clustering. Barna Saha

Clustering. (Part 2)

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Clustering. K-means clustering

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Lecture 2 The k-means clustering problem

Clustering. Unsupervised Learning

Clustering Algorithms. Margareta Ackerman

Clustering. Unsupervised Learning

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

K-Means. Oct Youn-Hee Han

University of Florida CISE department Gator Engineering. Clustering Part 2

Unsupervised Learning : Clustering

Machine Learning Department School of Computer Science Carnegie Mellon University. K- Means + GMMs

ADVANCED MACHINE LEARNING MACHINE LEARNING. Kernel for Clustering kernel K-Means

Clustering. Unsupervised Learning

K-Means. Algorithmic Methods of Data Mining M. Sc. Data Science Sapienza University of Rome. Carlos Castillo

K-Means Clustering. Sargur Srihari

1 Case study of SVM (Rob)

ECLT 5810 Clustering

Clustering CS 550: Machine Learning

Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. 2 April April 2015

Introduction to Mobile Robotics

Operators-Based on Second Derivative double derivative Laplacian operator Laplacian Operator Laplacian Of Gaussian (LOG) Operator LOG

The k-server problem June 27, 2005

Clustering: Overview and K-means algorithm

ECLT 5810 Clustering

Lecture 3 January 22

Clustering: Centroid-Based Partitioning

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

Clustering. So far in the course. Clustering. Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. dist(x, y) = x y 2 2

Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017

Introduction to Computer Science

CHAPTER 4 K-MEANS AND UCAM CLUSTERING ALGORITHM

Clustering and Visualisation of Data

Data Mining and Data Warehousing Classification-Lazy Learners

Linear Regression and K-Nearest Neighbors 3/28/18

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE

Clustering: Overview and K-means algorithm

9/17/2009. Wenyan Li (Emily Li) Sep. 15, Introduction to Clustering Analysis

k-center Problems Joey Durham Graphs, Combinatorics and Convex Optimization Reading Group Summer 2008

Unsupervised Learning

Clustering: K-means and Kernel K-means

Gene expression & Clustering (Chapter 10)

An Introduction to PDF Estimation and Clustering

Introduction to Machine Learning CMU-10701

Introduction to Mobile Robotics Iterative Closest Point Algorithm. Wolfram Burgard, Cyrill Stachniss, Maren Bennewitz, Kai Arras

MapReduce ML & Clustering Algorithms

Introduction to Data Mining

General Instructions. Questions

The EMCLUS Procedure. The EMCLUS Procedure

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Model Based Impact Location Estimation Using Machine Learning Techniques

Lecture: k-means & mean-shift clustering

High Dimensional Indexing by Clustering

The K-modes and Laplacian K-modes algorithms for clustering

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Overview. 1 Preliminaries. 2 k-center Clustering. 3 The Greedy Clustering Algorithm. 4 The Greedy permutation. 5 k-median clustering

Faster Algorithms for the Constrained k-means Problem

Big Data Analytics CSCI 4030

CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods

9.1. K-means Clustering

Hierarchical Clustering 4/5/17

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

An Efficient Clustering for Crime Analysis

CSE 5243 INTRO. TO DATA MINING

CHAPTER 4: CLUSTER ANALYSIS

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

CSSE463: Image Recognition Day 21

Text Documents clustering using K Means Algorithm

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

3D Surface Recovery via Deterministic Annealing based Piecewise Linear Surface Fitting Algorithm

Unsupervised Learning and Clustering

Clustering Part 2. A Partitional Clustering

[7.3, EA], [9.1, CMB]

Nearest Neighbor Classification. Machine Learning Fall 2017

A Course in Machine Learning

CSC 411: Lecture 12: Clustering

Scan Matching. Pieter Abbeel UC Berkeley EECS. Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

CSE 547: Machine Learning for Big Data Spring Problem Set 2. Please read the homework submission policies.

Clustering Results. Result List Example. Clustering Results. Information Retrieval

Algorithms for Euclidean TSP

Cluster Analysis. Ying Shen, SSE, Tongji University

Understanding Clustering Supervising the unsupervised

Lecture 9 & 10: Local Search Algorithm for k-median Problem (contd.), k-means Problem. 5.1 Local Search Algorithm for k-median Problem

Three-Dimensional Sensors Lecture 6: Point-Cloud Registration

04 - Normal Estimation, Curves

Data Mining and Analysis: Fundamental Concepts and Algorithms

Applications. Foreground / background segmentation Finding skin-colored regions. Finding the moving objects. Intelligent scissors

COSC 6397 Big Data Analytics. Fuzzy Clustering. Some slides based on a lecture by Prof. Shishir Shah. Edgar Gabriel Spring 2015.

Lecture 13: k-means and mean-shift clustering

Lecture 10: Semantic Segmentation and Clustering

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Introduction to Information Retrieval

Diffusion and Clustering on Large Graphs

Discrete geometry. Lecture 2. Alexander & Michael Bronstein tosca.cs.technion.ac.il/book

Introduction to object recognition. Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and others

Kapitel 4: Clustering

Intelligent Image and Graphics Processing

Transcription:

k-means, k-means++ Barna Saha March 8, 2016

K-Means: The Most Popular Clustering Algorithm k-means clustering problem is one of the oldest and most important problem.

K-Means: The Most Popular Clustering Algorithm k-means clustering problem is one of the oldest and most important problem. Given an integer k, and a set of data points in R d, the goal is to choose k centers so as to minimize φ, the sum of squared distance between each point and its closest center.

K-Means: The Most Popular Clustering Algorithm k-means clustering problem is one of the oldest and most important problem. Given an integer k, and a set of data points in R d, the goal is to choose k centers so as to minimize φ, the sum of squared distance between each point and its closest center. Solving this problem exactly is NP-Hard.

K-Means: The Most Popular Clustering Algorithm k-means clustering problem is one of the oldest and most important problem. Given an integer k, and a set of data points in R d, the goal is to choose k centers so as to minimize φ, the sum of squared distance between each point and its closest center. Solving this problem exactly is NP-Hard. 25 years ago Llyod proposed a simple local search algorithm that is still very widely used today.

K-Means: The Most Popular Clustering Algorithm k-means clustering problem is one of the oldest and most important problem. Given an integer k, and a set of data points in R d, the goal is to choose k centers so as to minimize φ, the sum of squared distance between each point and its closest center. Solving this problem exactly is NP-Hard. 25 years ago Llyod proposed a simple local search algorithm that is still very widely used today. A 2002 survey of data mining techniques states that it is by far the most popular clustering algorithm used in scientific and industrial applications.

Llyod s Local Search Algorithm 1. Begin with k arbitrary centers, typically chosen uniformly at random from the data points. φ is monotonically decreasing, which ensures no configuration is repeated. Convergence could be slow: k n possible clusterings.

Llyod s Local Search Algorithm 1. Begin with k arbitrary centers, typically chosen uniformly at random from the data points. 2. Assign each point to its nearest cluster center φ is monotonically decreasing, which ensures no configuration is repeated. Convergence could be slow: k n possible clusterings.

Llyod s Local Search Algorithm 1. Begin with k arbitrary centers, typically chosen uniformly at random from the data points. 2. Assign each point to its nearest cluster center 3. Recompute the new cluster centers as the center of mass of the points assigned to the clusters φ is monotonically decreasing, which ensures no configuration is repeated. Convergence could be slow: k n possible clusterings.

Llyod s Local Search Algorithm 1. Begin with k arbitrary centers, typically chosen uniformly at random from the data points. 2. Assign each point to its nearest cluster center 3. Recompute the new cluster centers as the center of mass of the points assigned to the clusters 4. Repeat Steps 1-3 until the process converges. φ is monotonically decreasing, which ensures no configuration is repeated. Convergence could be slow: k n possible clusterings.

Llyod s Local Search Algorithm 1. Begin with k arbitrary centers, typically chosen uniformly at random from the data points. 2. Assign each point to its nearest cluster center 3. Recompute the new cluster centers as the center of mass of the points assigned to the clusters 4. Repeat Steps 1-3 until the process converges. φ is monotonically decreasing, which ensures no configuration is repeated. Convergence could be slow: k n possible clusterings.

Llyod s Local Search Algorithm Convergence could be slow: k n possible clusterings. Lloyd s k-means algorithm has polynomial smoothed running time.

Illustration of k-means clustering Figure: Convergence to Local Optimum By Agor153 - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=19403547

Drawbacks of k-means algorithm Euclidean distance is used as a metric and variance is used as a measure of cluster scatter. The number of clusters k is an input parameter: an inappropriate choice of k may yield poor results. That is why, when performing k-means, it is important to run diagnostic checks for determining the number of clusters in the data set. Convergence to a local minimum may produce results far away from optimal clustering

k-means++

k-means++

k-means++ The first fix: Select centers based on distance.

k-means++ The first fix: Select centers based on distance.

k-means++ The first fix: Select centers based on distance.

k-means++ The first fix: Select centers based on distance.

k-means++ The first fix: Select centers based on distance.

k-means++ Sensitive to outliers.

k-means++ Sensitive to outliers.

k-means++ Sensitive to outliers.

k-means++ Interpolate between the two methods: Let D(x) be the distance between x and the nearest cluster center. Sample x as a cluster center proportionately to (D(x)) 2. k-means++ returns clustering C which is log k-competitive. In the class we will show that if the cluster centers are chosen from each optimal cluster then k-means++ is 8-competitive.