Clustering: Centroid-Based Partitioning

Similar documents
High Dimensional Indexing by Clustering

Clustering. (Part 2)

Lecture 2 The k-means clustering problem

Outline. CS38 Introduction to Algorithms. Approximation Algorithms. Optimization Problems. Set Cover. Set cover 5/29/2014. coping with intractibility

11.1 Facility Location

11. APPROXIMATION ALGORITHMS

Lecture 2. 1 Introduction. 2 The Set Cover Problem. COMPSCI 632: Approximation Algorithms August 30, 2017

Binary Heaps in Dynamic Arrays

Theorem 2.9: nearest addition algorithm

11. APPROXIMATION ALGORITHMS

CS 580: Algorithm Design and Analysis. Jeremiah Blocki Purdue University Spring 2018

Algorithms for Euclidean TSP

Notes for Lecture 24

ELEMENTARY NUMBER THEORY AND METHODS OF PROOF

K-Means. Algorithmic Methods of Data Mining M. Sc. Data Science Sapienza University of Rome. Carlos Castillo

15-854: Approximations Algorithms Lecturer: Anupam Gupta Topic: Direct Rounding of LP Relaxations Date: 10/31/2005 Scribe: Varun Gupta

15-451/651: Design & Analysis of Algorithms November 4, 2015 Lecture #18 last changed: November 22, 2015

Approximation Algorithms

k-selection Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Binary Search and Worst-Case Analysis

CS264: Homework #4. Due by midnight on Wednesday, October 22, 2014

11. APPROXIMATION ALGORITHMS

1 Minimum Cut Problem

COMP Analysis of Algorithms & Data Structures

Lecture 15: The subspace topology, Closed sets

Faster Algorithms for the Constrained k-means Problem

1 Better Approximation of the Traveling Salesman

Lecture 7. s.t. e = (u,v) E x u + x v 1 (2) v V x v 0 (3)

Topic: Local Search: Max-Cut, Facility Location Date: 2/13/2007

Recursion: The Beginning

Approximate Nearest Line Search in High Dimensions

Clustering. K-means clustering

Overview. 1 Preliminaries. 2 k-center Clustering. 3 The Greedy Clustering Algorithm. 4 The Greedy permutation. 5 k-median clustering

CHAPTER 8. Copyright Cengage Learning. All rights reserved.

Approximation Algorithms

2 The Fractional Chromatic Gap

CS270 Combinatorial Algorithms & Data Structures Spring Lecture 19:

Review of Calculus, cont d

introduction to Programming in C Department of Computer Science and Engineering Lecture No. #40 Recursion Linear Recursion

The k-center problem Approximation Algorithms 2009 Petros Potikas

Treewidth and graph minors

On the Max Coloring Problem

COT 6936: Topics in Algorithms! Giri Narasimhan. ECS 254A / EC 2443; Phone: x3748

The Capacity of Wireless Networks

Lecture and notes by: Sarah Fletcher and Michael Xu November 3rd, Multicommodity Flow

Advanced Operations Research Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras

Discrete geometry. Lecture 2. Alexander & Michael Bronstein tosca.cs.technion.ac.il/book

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/27/18

Lecture 7: Asymmetric K-Center

Linear Programming in Small Dimensions

Lecture Notes: Euclidean Traveling Salesman Problem

1. Lecture notes on bipartite matching

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/18/14

Clustering Algorithms. Margareta Ackerman

Math 734 Aug 22, Differential Geometry Fall 2002, USC

Gene expression & Clustering (Chapter 10)

8 Matroid Intersection

CS 372: Computational Geometry Lecture 10 Linear Programming in Fixed Dimension

Lecture 9 & 10: Local Search Algorithm for k-median Problem (contd.), k-means Problem. 5.1 Local Search Algorithm for k-median Problem

Table 1 below illustrates the construction for the case of 11 integers selected from 20.

Maximal Independent Set

9.1 Cook-Levin Theorem

AXIOMS FOR THE INTEGERS

Approximation Algorithms: The Primal-Dual Method. My T. Thai

ADVANCED MACHINE LEARNING MACHINE LEARNING. Kernel for Clustering kernel K-Means

COMP 355 Advanced Algorithms Approximation Algorithms: VC and TSP Chapter 11 (KT) Section (CLRS)

Polynomial-Time Approximation Algorithms

On the Complexity of the Policy Improvement Algorithm. for Markov Decision Processes

Decomposing Coverings and the Planar Sensor Cover Problem

Algorithm Design and Analysis

Approximation Algorithms

Traveling Salesman Problem (TSP) Input: undirected graph G=(V,E), c: E R + Goal: find a tour (Hamiltonian cycle) of minimum cost

Association Rule Mining: FP-Growth

1 Algorithm and Proof for Minimum Vertex Coloring

Maximal Independent Set

1 Overview. 2 Applications of submodular maximization. AM 221: Advanced Optimization Spring 2016

ELEMENTARY NUMBER THEORY AND METHODS OF PROOF

Approximation Algorithms

Stanford University CS261: Optimization Handout 1 Luca Trevisan January 4, 2011

9.1. K-means Clustering

Approximation Algorithms

Math 414 Lecture 2 Everyone have a laptop?

6 Distributed data management I Hashing

Slides on Approximation algorithms, part 2: Basic approximation algorithms

princeton univ. F 15 cos 521: Advanced Algorithm Design Lecture 2: Karger s Min Cut Algorithm

Clustering. Unsupervised Learning

Tutorial for Algorithm s Theory Problem Set 5. January 17, 2013

Scheduling, Map Coloring, and Graph Coloring

6 Randomized rounding of semidefinite programs

Recursion: The Beginning

15-451/651: Design & Analysis of Algorithms October 11, 2018 Lecture #13: Linear Programming I last changed: October 9, 2018

Lecture : Topological Space

MA 323 Geometric Modelling Course Notes: Day 36 Subdivision Surfaces

Approximation Algorithms

Lecture 11: Maximum flow and minimum cut

ICS 161 Algorithms Winter 1998 Final Exam. 1: out of 15. 2: out of 15. 3: out of 20. 4: out of 15. 5: out of 20. 6: out of 15.

CS369G: Algorithmic Techniques for Big Data Spring

Using Arithmetic of Real Numbers to Explore Limits and Continuity

Online Algorithms. Lecture 11

Λέων-Χαράλαμπος Σταματάρης

Transcription:

Clustering: Centroid-Based Partitioning Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong 1 / 29 Y Tao Clustering: Centroid-Based Partitioning

In this lecture, we will discuss another fundamental topic in data mining: clustering. At a high level, the objective of clustering can be stated as follows. Let P be a set of objects. We want to divide P into several groups each of which is called a cluster satisfying the following conditions: (Homogeneity) Objects in the same cluster should be similar to each other. (Heterogeneity) Objects in different clusters should be dissimilar. 2 / 29 Y Tao Clustering: Centroid-Based Partitioning

Typically, the similarity between two objects o 1, o 2 is measured by a distance function dist(o 1, o 2 ): the larger dist(o 1, o 2 ), the less similar they are. We will consider only distance functions satisfying the triangle inequality, namely, for any objects o 1, o 2, o 3, it holds that: dist(o 1, o 2 ) + dist(o 2, o 3 ) dist(o 1, o 3 ) 3 / 29 Y Tao Clustering: Centroid-Based Partitioning

Today we will focus on centroid-based partitioning, which works as follows. Let k be the number of clusters desired. It first identifies k objects c 1,..., c k (which are not necessarily in P) called centriods. Then, it forms clusters P 1, P 2,..., P k where P i includes all the objects in P that have c i as their nearest centroid. Formally: P i = {o P dist(o, c i ) dist(o, c j ) j [1, k]} If an object o happens to be equi-distance from two centroids c i, c j, it can be assigned to either P i or P j arbitrarily. 4 / 29 Y Tao Clustering: Centroid-Based Partitioning

We will discuss two classic algorithms of centroid-based partitioning: 1 k-center 2 k-means 5 / 29 Y Tao Clustering: Centroid-Based Partitioning

k-center 6 / 29 Y Tao Clustering: Centroid-Based Partitioning

Problem Let P be a set of n objects in R d, and k be an integer at most n. Let C be a set of objects in R d ; we refer to C as a centroid set. Define for each object p P, its centroid distance as The radius of C is defined to be d C (p) = min dist(p, c). c C r(c) = max o P d C (o). The goal of the k-center problem is to find a centroid set C of size k with the minimum radius. 7 / 29 Y Tao Clustering: Centroid-Based Partitioning

This problem is NP-hard, namely, no algorithm can solve the problem in time polynomial to both n and k (unless P = NP). Hence, we will aim to find approximate answers with precision guarantees. Let C be an optimal centroid set for the k-center problem. A set C of k objects is ρ-approximate if r(c) ρ r(c ). We will give an algorithm that guarantees to return a 2-approximate solution. 8 / 29 Y Tao Clustering: Centroid-Based Partitioning

A 2-Approximate Algorithm algorithm k-center (P) /* this algorithm returns a 2-approximate subset C */ 1. C 2. add to C an arbitrary object in P 2. for i = 2 to k 3. o an object in P with the maximum d C (o) 4. add o to C 5. return C The algorithm can be easily implemented in O(nk) time. 9 / 29 Y Tao Clustering: Centroid-Based Partitioning

Example Example: k = 3 c 1 Initially, C = {c 1 } 10 / 29 Y Tao Clustering: Centroid-Based Partitioning

Example Example: k = 3 c 1 c 2 After a round, C = {c 1, c 2 } 11 / 29 Y Tao Clustering: Centroid-Based Partitioning

Example Example: k = 3 c 1 c 2 c 3 After another round, C = {c 1, c 2, c 3 } 12 / 29 Y Tao Clustering: Centroid-Based Partitioning

Example Example: k = 3 c 1 c 2 c 3 r(c) is the radius of the largest circle. 13 / 29 Y Tao Clustering: Centroid-Based Partitioning

Theorem The k-center algorithm is 2-approximate. Proof Let C = {c1, c 2,..., c k } be an optimal centroid set, i.e., it has the smallest radius r(c ). Let P1, P 2,..., P k be the optimal clusters, namely, Pi (1 i k) contains all the objects that find ci as the closest centroid among all the centroids in C. Let C = {c 1, c 2,..., c k } be the output of our algorithm. We want to prove r(c) 2r(C ). 14 / 29 Y Tao Clustering: Centroid-Based Partitioning

Proof (cont.). Case 1: C has an object in each of P 1, P 2,..., P k. Take any object o P. We will prove that d C (o) 2r(C ), which implies that r(c) 2r(C ). Suppose that o P i (for some i [1, k]), and c is an object in C P i. It holds that: d C (o) dist(c, o) dist(c, c ) + dist(c, o) 2r(C ). 15 / 29 Y Tao Clustering: Centroid-Based Partitioning

Proof (cont.). Case 2: At least one of P1,..., P k covers no object in C. By the pigeon hole principle, one of P1,..., P k must cover at least two objects c 1, c 2 C. It thus follows that dist(c 1, c 2 ) 2r(C ). Next we will prove r(c) dist(c 1, c 2 ) which will complete the whole proof. Without loss of generality, assume that c 2 was picked after c 1 by our algorithm. Hence, c 2 has the largest centroid distance at this moment (by how our algorithm runs). Therefore, any object o P has a centroid distance at most dist(c 1, c 2 ) at this moment. Its centroid distance can only decrease in the rest of the algorithm. It thus follows that r(c) dist(c 1, c 2 ). 16 / 29 Y Tao Clustering: Centroid-Based Partitioning

k-means 17 / 29 Y Tao Clustering: Centroid-Based Partitioning

The k-means problem is only defined on point objects. Problem Let P be a set of n points, and k be an integer at most n. Let C be a set of points in R d ; we refer to C as a centroid set. Define for each point p P its centroid distance as d C (p) = min dist(p, c) c C where dist(p, c) is the straight line distance between p and c. The cost of C is defined to be φ(c) = o P d 2 C (o). The goal of the k-means problem is to find a centroid set C of size k with the minimum cost. The problem is once again NP-hard. 18 / 29 Y Tao Clustering: Centroid-Based Partitioning

algorithm k-means (P) 1. C an arbitrary subset of P with size k 2. repeat 3. C old C /* assume C old = {c 1,..., c k } */ 4. partition P into P 1,..., P k such that P i (1 i k) is the set of objects that find c i as the nearest centroid (among the centroids in C old ). if an object o is equi-distant from two centroids c i and c j, it is assigned to P i or P j arbitrarily 5. for i = 1 to k 6. c i the geometric center of P i 7. C = {c 1,..., c k } 8. until C old = C 9. return C Remark: The geometric center of a point set P is the point whose i-th coordinate is the average of all the i-th coordinates of the points in P. 19 / 29 Y Tao Clustering: Centroid-Based Partitioning

Example Suppose k = 2. Points c 1 and c 2 are the initial two centroids which are chosen arbitrarily. Round 1. c c 1 1 c 2 c 2 P 1 includes all the black points (they are closer to c 1 than c 2 ), and P 2 the red points. c 1 and c 2 are the new centroids. 20 / 29 Y Tao Clustering: Centroid-Based Partitioning

Example Points c 1 and c 2 are the two centroids from the last round. Round 2. c 1 c 1 c 2 c 2 P 1 includes all the black points, and P 2 the red points. c 1 and c 2 are the new centroids. 21 / 29 Y Tao Clustering: Centroid-Based Partitioning

Example Points c 1 and c 2 are the two centroids from the last round. Round 3. c 1 (c 1) c 2 (c 2) P 1 includes all the black points, and P 2 the red points. The new centroids c 1 and c 2 are idential to c 1 and c 2, respectively. The algorithm therefore terminates. 22 / 29 Y Tao Clustering: Centroid-Based Partitioning

An important question to answer is whether the k-means algorithm can run forever. Next we will prove that it will not, namely, it will always terminate. We will need the lemma below: Lemma Let P be a set of points in R d. Given a point q, define the squared summed distance SSD q (P) to be SSD q (P) = p P (dist(q, p)) 2 where dist(q, p) is the straight line distance between q and p. Let c be the geometric center of P. For any point q R d, it holds that SSD c (P) < SSD q (P). This lemma can be easily proved by taking the derivative of SSD q (P) with respect to each coordinate of q. 23 / 29 Y Tao Clustering: Centroid-Based Partitioning

Theorem The k-means algorithm always terminates. Proof First observe that there can be only a finite number of centroid sets that can possbily be produced at the end of each round (think: why?). We will show that after each round, the cost of the centroid set is strictly lower than that of the old centroid set, unless the two centroid sets are identical. This implies that the algorithm must terminate eventually. 24 / 29 Y Tao Clustering: Centroid-Based Partitioning

Proof (Continued.) Let C old = {c 1,..., c k } be the old centroid set at the beginning of a round. By definition, its cost equals φ(c old ) = o P (d C old (o)) 2. Let P 1,..., P k be the partitions obtained at Line 4 of the algorithm in Slide 19. We can thus rewrite φ(c old ) as: φ(c old ) = k i=1 o P i (dist(o, c i )) 2 Let C = {c 1,..., c k } be the new centroid set obtained at Line 7. By the lemma of the previous slide, we know (dist(o, c i )) 2 (dist(o, c i )) 2 o P i o P i where the equality holds only if c i = c i. In other words, if C old C, then φ(c old ) > k i=1 o P i (dist(o, c i )) 2. 25 / 29 Y Tao Clustering: Centroid-Based Partitioning

Proof (Continued.) By definition, d C (o) dist(o, c i ) where o is an object in P i. Hence, k i=1 o P i (dist(o, c i )) 2 k (d C (o)) 2 i=1 o P i = φ(c) We thus have shown φ(c old ) > φ(c), which completes the whole proof. 26 / 29 Y Tao Clustering: Centroid-Based Partitioning

Having proved that the algorithm always terminates, next let us worry about its accuracy guarantee. Let C be an optimal centroid set for the k-means problem. A centroid set C is said to be ρ-approximate if φ(c) ρ φ(c ). The k-means algorithm on Slide 28 does not have a bounded approximation ratio. In other words, the centroid set C it returns can have a cost that is greater than φ(c ) by an arbitrarily large ratio (i.e., ρ = ). It turns out that the issue is due to the fact that the initial centroid set is picked too arbitrarily. By doing so more carefully, as shown in the next slide, it is possible to significantly improve the approximation ratio. 27 / 29 Y Tao Clustering: Centroid-Based Partitioning

In the algorithm of Slide 19, replace the centroid set C at Line 1 with the centroid set returned by the following algorithm. algorithm k-seeding (P) 1. c a random point chosen uniformly from P 2. C = {c} 3. for i = 2 to k 4. c a point from P chosen as follows: each p P is chosen as c with probability 5. if c / C then 6. add c to C 7. else go to Line 4 8. return C (d C (p)) 2 p P (d C (p )) 2 28 / 29 Y Tao Clustering: Centroid-Based Partitioning

It is known that the k-means algorithm equipped with an initial centroid set chosen in the way shown in the previous slide returns a solution whose approximation ratio is O(log k) in expectation. The proof is rather involved, and is not required in this course. Interested students can refer to: David Arthur, Sergei Vassilvitskii: k-means++: the advantages of careful seeding. SODA 2007: 1027-1035. 29 / 29 Y Tao Clustering: Centroid-Based Partitioning