K-Means++: The Advantages of Careful Seeding

Similar documents
Clustering: K-means and Kernel K-means

K-Means. Algorithmic Methods of Data Mining M. Sc. Data Science Sapienza University of Rome. Carlos Castillo

Clustering: Centroid-Based Partitioning

CSE 5243 INTRO. TO DATA MINING

Clustering Algorithms. Margareta Ackerman

CS229 Final Project: k-means Algorithm

COMP 110 Programming Exercise: Simulation of the Game of Craps

Scalable K-Means++ Bahman Bahmani Stanford University

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 24: Online Algorithms

CSE 5243 INTRO. TO DATA MINING

1 Unweighted Set Cover

Clustering. Unsupervised Learning

VISUAL GUIDE to. RX Scripting. for Roulette Xtreme - System Designer 2.0. L J Howell UX Software Ver. 1.0

PROCESS SCHEDULING II. CS124 Operating Systems Fall , Lecture 13

Clust Clus e t ring 2 Nov

Clustering. Chapter 10 in Introduction to statistical learning

CPSC 536N: Randomized Algorithms Term 2. Lecture 5

Tillämpad Artificiell Intelligens Applied Artificial Intelligence Tentamen , , MA:9

Random Number Generator Andy Chen

Administrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES

9.2 Types of Errors in Hypothesis testing

Two-dimensional Totalistic Code 52

21 The Singular Value Decomposition; Clustering

Lecture 10: Semantic Segmentation and Clustering

Computational Complexity and Implications for Security DRAFT Notes on Infeasible Computation for MA/CS 109 Leo Reyzin with the help of Nick Benes

Parallel Algorithms K means Clustering

Copyright 2000, Kevin Wayne 1

Gambler s Ruin Lesson Plan

Clustering. Unsupervised Learning

Polynomial-Time Approximation Algorithms

ROULETTE RULES OF THE GAME

Treaps. 1 Binary Search Trees (BSTs) CSE341T/CSE549T 11/05/2014. Lecture 19

Hierarchical Clustering

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Unsupervised Learning

Faster Algorithms for the Constrained k-means Problem

4 Dynamic Programming

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Approximation Algorithms for Clustering Uncertain Data

OBJECTS AND CLASSES. Examples. You and I are instances of the class of objects known as people.

Discrete Mathematics and Probability Theory Fall 2013 Midterm #2

What to come. There will be a few more topics we will cover on supervised learning

k-means, k-means++ Barna Saha March 8, 2016

A Mixed Hierarchical Algorithm for Nearest Neighbor Search

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/18/14

Programming Assignment #4 Arrays and Pointers

Clustering. RNA-seq: What is it good for? Finding Similarly Expressed Genes. Data... And Lots of It!

L9: Hierarchical Clustering

9 Bounds for the Knapsack Problem (March 6)

Big-data Clustering: K-means vs K-indicators

Lecture 25 November 26, 2013

Based on Raymond J. Mooney s slides

Introduction to Computer Science

/ Live Arena Bringing Live Auto Roulette to Life

University of Florida CISE department Gator Engineering. Clustering Part 2

Information Retrieval and Web Search Engines

Crash Course in Modernization. A whitepaper from mrc

If you are searched for a book Power Professional Roulette Method. Let's Make Some MONEY!: How I Can Deposit Up to U$ / Month With This Simple

Outline More Security Protocols CS 239 Computer Security February 4, 2004

Clustering: Overview and K-means algorithm

Clustering. Unsupervised Learning

Lab # 2 - ACS I Part I - DATA COMPRESSION in IMAGE PROCESSING using SVD

Unsupervised Learning and Data Mining

Machine Learning for Signal Processing Clustering. Bhiksha Raj Class Oct 2016

tree follows. Game Trees

Announcements. Image Segmentation. From images to objects. Extracting objects. Status reports next Thursday ~5min presentations in class

Clustering Part 3. Hierarchical Clustering

Clustering. So far in the course. Clustering. Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. dist(x, y) = x y 2 2

Graph Connectivity in MapReduce...How Hard Could it Be?

k-means demo Administrative Machine learning: Unsupervised learning" Assignment 5 out

On Initial Effects of the k-means Clustering

(Refer Slide Time: 00:18)

Lecture and notes by: Nate Chenette, Brent Myers, Hari Prasad November 8, Property Testing

Search Trees: BSTs and B-Trees

Information Retrieval and Web Search Engines

Evolutionary Algorithms. CS Evolutionary Algorithms 1

def F a c t o r i a l ( n ) : i f n == 1 : return 1 else : return n F a c t o r i a l ( n 1) def main ( ) : print ( F a c t o r i a l ( 4 ) )

Error-Correcting Codes

Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. 2 April April 2015

Special Topic: Missing Values. Missing Can Mean Many Things. Missing Values Common in Real Data

NCAA Instructions Nerdy for Sports

Scribe: Virginia Williams, Sam Kim (2016), Mary Wootters (2017) Date: May 22, 2017

Design and Analysis of Algorithms Prof. Madhavan Mukund Chennai Mathematical Institute

Fast Lookup: Hash tables

14 Data Compression by Huffman Encoding

Accelerated Machine Learning Algorithms in Python

Introduction to Machine Learning CMU-10701

Parallel Algorithms for Geometric Graph Problems Grigory Yaroslavtsev

Using TLC to Check Inductive Invariance

Exploiting Parallelism to Support Scalable Hierarchical Clustering

Scheduling II. Today. Next Time. ! Proportional-share scheduling! Multilevel-feedback queue! Multiprocessor scheduling. !

Point Cloud Collision Detection

CSC263 Week 12. Larry Zhang

Define the red- black tree properties Describe and implement rotations Implement red- black tree insertion

Computer Basics. Need more help? What s in this guide? Types of computers and basic parts. Why learn to use a computer?

Chapter 13. Michelle Bodnar, Andrew Lohr. April 12, 2016

CSE200: Computability and complexity Interactive proofs

Mining Data that Changes. 17 July 2015

k-center Problems Joey Durham Graphs, Combinatorics and Convex Optimization Reading Group Summer 2008

Algorithms Exam TIN093/DIT600

Transcription:

K-Means++: The Advantages of Careful Seeding PAPER BY DAVID ARTHUR & SERGEI VASSILVITSKII PRESENTATION BY STEVEN ROHR

What is it and why do we need it? By Augmenting k-means with a very simple, randomized seeding technique, we obtain an algorithm that is Θ(log k)-competitive with the optimal clustering. Arthur & Vassilvitskii claim (often dramatic) improved accuracy and speed. Also focuses on improving cases where Lloyd s method typically does poorly.

How are we defining better? As this is a derivative of Lloyd s method, we are still using k-means cost as a measure of good. k-means cost = (Dist point, clustercenter 2 ) Finding the optimal clustering with respect to k-means cost is NP-hard, so we settle for good. Typically several dozen (100 is common) runs of Lloyd s method are done and the result with the lowest k-means cost is presented as output. K-means++ promises equal or lower k-means cost in often far less time. O(log k)-competitive* (proven, but we won t get into proofs here). *Just for selecting the points(?)

How does it work? Possible initial centers are chosen with probability according to the contribution to the overall potential, designated φ, and each following point c i D x is chosen with probability 2 x X D x 2 until we have the number of desired centers (this is called D 2 seeding). D(x) is the shortest distance from this point to the closest existing center. Lloyd s method then continues as normal.

What? Basically, each new center is chosen probabilistically based off how close it is to existing centers. Points further from existing centers have a higher chance of being chosen (which point chosen is still up to RNG, we re just weighting the odds in our favor). Think of it as a compromise between truly random centers and Furthest Centroids (where only the first point is random at all).

Experimental Results K-Means Cost Time

Pictures? Say we have a set of points which we want to cluster:

Pictures?: Lloyd s method Points are randomly selected as centers with equal probability This is like a fair game of roulette. 1 PointsNotChosen. Imagine there are numbers on the wheel corresponding to the data points. No betting on only Black or Red.

Pictures?: Lloyd s method Points are randomly selected as centers with equal probability This is like a fair game of roulette. 1 PointsNotChosen. Sadly the colors don t always work out in this analogy.

Pictures?: Lloyd s method Points are randomly selected as centers with equal probability This is like a fair game of roulette. 1 PointsNotChosen.

Pictures?: Lloyd s method Unfortunately Fair doesn t mean Good outcome, as any gambler (or D&D player) will tell you.

It explains the Phantom Menace.

Pictures?: K-Means++ Points are still selected at random, but are now weighted according to proximity to other centroids.

Pictures?: K-Means++ Points are still selected at random, but are now weighted according to proximity to other centroids. The result is not like a fair game of roulette. A not-so-fair roulette wheel.

Pictures?: K-Means++ Points are still selected at random, but are now weighted according to proximity to other centroids. The result is not like a fair game of roulette. A not-so-fair roulette wheel. Now with proper color alteration.

Pictures?: K-Means++ This gives us a reasonably high chance of not choosing horrible initial centers, though the chance still exists (and thus multiple runs are still required).

K-Means++ vs. Other algorithms Being a means algorithm, it has near linear runtime (much better than linkage algorithms which are near O(n 3 )). Vs. Furthest Centroids Furthest centroids always chooses the next centroid to be furthest from all current centroids. K-Means++ probabilistically chooses centers based off distance. End result: K-Means++ is faster, however both algorithms generally have very similar results.

Conclusions Random center Lloyd s methods are fast, but the traditional approach uses a brute-force approach to choosing points, and effectively leaves everything up to random chance. By adding a concept of weight to the random selection we can dramatically improve performance by being smarter about how we choose initial centers.