Approximation Algorithms for NP-Hard Clustering Problems

Size: px

Start display at page:

Download "Approximation Algorithms for NP-Hard Clustering Problems"

Garry Neal Eaton
6 years ago
Views:

1 Approximation Algorithms for NP-Hard Clustering Problems Ramgopal R. Mettu Dissertation Advisor: Greg Plaxton Department of Computer Science University of Texas at Austin

2 What Is Clustering? The goal of clustering is to partition n weighted points into a small number of coherent groups. Clustering algorithms can be used to: organize (e.g., document collections) analyze (e.g., data mining) manage (e.g., networks) 2

3 Measuring Cluster Quality The cost of a set of cluster centers is the sum, over all points, of the weighted distance from each point to the closest center or median. 3

4 The Problems We Study The facility location problem asks us to identify a set of cluster centers that minimize associated penalties as well as cost. The k-median problem asks us to identify k cluster centers that minimize cost. The online median problem asks us to identify one cluster center at a time, while ensuring at every step that we have a low cost set of cluster centers. 4

5 Talk Outline Introduction Summary of Results k-median Successive Sampling Algorithm Online Median Hierarchically Greedy Strategy Experimental Work Conclusion 5

6 Assumptions We assume the input points are drawn from a metric space. A metric distance function is symmetric, nonnegative, and satisfies the triangle inequality. Two well-known metrics: Euclidean distance, shortest-paths distance. 6

7 Hardness Results For arbitrary distances, it is NP-hard to approximate to a factor of o(log n). For arbitrary metric spaces, it is NP-hard to compute solutions with cost less than a certain constant factor times optimal. For example, it is NP-hard to obtain a solution for k- median with cost less than times optimal [JMS 02]. 7

8 Standard Approaches The k-means heuristic is widely used due to its simplicity and speed. Given an initial solution, the k-means heuristic utilizes an O(nk)-time iterative improvement step. There are no useful guarantees on solution quality. 8

9 Summary Of Results 1. A randomized constant-factor approximation algorithm for the k-median problem that runs in Θ(nk) time for log n <= k <= n/log 2 n. 2. A Θ(n 2 )-time constant-factor approximation algorithm for the online median problem. 3. A greedy Θ(n 2 )-time constant-factor approximation algorithm for the facility location problem. 4. Analysis of approximate metrics that extends our results to more general objective functions. 9

10 Talk Outline Introduction Summary of Results k-median Successive Sampling Algorithm Online Median Hierarchically Greedy Strategy Experimental Work Conclusion 10

11 Previous Work The k-median problem has been studied widely in Operations Research [FM 90]. The first constant-factor approximation algorithm for the k-median problem is due to Charikar et al. [CGTS 99], based on LP-rounding. The fastest deterministic algorithm is O(n 2 ) [MP 00], the best constant is due to [AGKMMP 01]. 11

12 Previous Work The first randomized algorithm was due to [Indyk 99]. His algorithm runs in O(nk polylog(n)) time but produces O(k) medians. [Thorup 01] gives an algorithm for the graph version of k-median. 12

13 Uniform Weights k-median Algorithm Our algorithm works in two phases: 1. Use successive sampling to rapidly identify O(k log(n/k)) points with cost within a constant factor of optimal. 2. Construct a small problem instance from the sampled points and use an existing k-median algorithm. 13

14 Successive Sampling Let near(x, Y) be the nearest half of the points in Y from X. near(x, Y) 14

15 Successive Sampling U 0 := U, i := 0 While U i > 0 do: S i := 3k/2 random samples from U i U i+1 := U i - near(s i, U i ) i := i+1 return S = union(s i ) near(s 2, U 2 ) Let k=2: near(s 1, U 1 ) near(s 0, U 0 ) 15

16 Successive Sampling Bounds Theorem: With high probability, cost(s) is within a constant factor of the optimal k-median solution cost. Running Time: For the case of uniform weights, our successive sampling algorithm runs in O(n(k+log n)) time. 16

17 Second Phase Collapse the points and apply a k-median algorithm to the resulting weighted problem instance The output of the second phase is within a constant factor of optimal [GMMO 00], and can be computed quickly. 17

18 Upper Bounds Theorem: With high probability, our k-median algorithm produces a solution with cost within a constant factor of optimal. Running Time: O(n(k+log n) if we use our online median algorithm for the second phase. 18

19 Our Arbitrary Weights k-median Algorithm The uniform weights algorithm can be used as a subroutine: Divide the points into power-of-2 weight classes Run the uniform weights algorithm on each weight class Apply the approach of the second phase to obtain k points 19

20 Upper Bounds Theorem: With high probability, our k-median algorithm produces a solution with cost within a constant factor of optimal. Running Time: O(n(k+log n)+k 2 log 2 n) 20

21 k-median Lower Bound For log n k n/log 2 n, our upper bound of O(n(k+log n)+k 2 log 2 n)) is tight: Theorem: Any o(nk)-time randomized constant-factor approximation algorithm for the k-median problem has a negligible success probability. [GMMO 00] gives a deterministic lower bound. 21

22 Talk Outline Introduction Summary of Results k-median Successive Sampling Algorithm Online Median Hierarchically Greedy Strategy Experimental Work Conclusion 22

23 The Online Median Problem What if we wish to compute a set of cluster centers, but we don t know k? 23

24 The Online Median Problem The goal of the online median problem is to identify an ordering of the points such that, over all i, the i-median cost of the prefix of length i is minimized. Is there always an ordering of the points such that, for all i, the cost of the prefix of length i is within a constant factor of optimal? 24

25 A Natural Greedy Approach Idea: Find successive points in the ordering greedily optimal solution for k=3 1 2 For k=3, the optimal solution has cost 0. But greedy has cost 1, so the approximation ratio is 1/0! 25 1

26 A Hierarchically Greedy Approach Balance global and local decisions by considering the metric space at varying levels of granularity. Instead of making a single greedy choice, make a sequence of greedy choices that are increasingly local. 26

27 Definitions Let ball(x, r) denote {y in d(x, y) r}. b a x r 2r r/3 y C=ball(y, r/3) c B=ball(x, r)={x,a,b,c} C is a child of B if radius(b)=3 radius(c) and d(x, y) 2r. 27

28 Definitions Let value(b)= y in B (r-d(y, center(b))) w(y). 2 b a x 2 r s z s/15 isolated(z, {x,a,b,c})=ball(z, s/15) c 1 value(b)=r+5 Let isolated(x, Y)=ball(x, d(x, Y)/15). 28

29 Hierarchically Greedy Algorithm Let Z denote the points in the ordering so far. B := maximum value isolated(x, Z) over all x While B has > 1 child do: B := maximum value child of B return center(b) as the next point in the ordering 29

30 Approximation Ratio and Running Time Theorem: The hierarchically greedy strategy produces an ordering such that every prefix has cost within a constant factor of optimal. Running Time: Our online median algorithm can be implemented in O(n 2 ) time. This is optimal by the k-median lower bound. 30

31 Talk Outline Introduction Summary of Results k-median Successive Sampling Algorithm Online Median Hierarchically Greedy Strategy Experimental Work Conclusion 31

32 Algorithm Implementations We implemented our uniform-weights k-median and online median algorithms in Java (version 1.3.1). We also implemented the k-means heuristic with a centroid-based initialization procedure. Common data structures took 542 loc; k-median took 726 loc, online median took 800 loc, and k-means took 218 loc. 32

33 Goals of Experiments Our algorithms have provably good solutions, are simple, and are asymptotically fast. How do they compare in practice to heuristics in terms of speed and solution quality? 33

34 Experiments with Gaussians We generated synthetic inputs consisting of k d-dimensional Gaussians. We tested: k-means with centroid-based initialization k-means with k-median initialization We varied n to test scalability, and varied d to test the effect of dimensionality. 34

35 Scalability Results Solution Costs Solution Cost 60,000 50,000 40,000 30,000 20,000 10,000 0 k-means k-median+k-means n 35

36 Scalability Results Running Times Running Time (seconds) k-means k-median+k-means n 36

37 A Real-World Application We applied our online median implementation to particle selection in electron microscopy images. Electron microscopy images typically have a low signal-to-noise ratio and are thus hard to interpret by inspection. We compare our results with those of [YB 02]. 37

38 Input to Online Median A weighted 2D point set is obtained by thresholding the electron microscopy image [YB 02]. source image 376 weighted points 38

39 Methodology Our online median algorithm proceeds by choosing one cluster center at a time. We chose the appropriate number of clusters for a given data set interactively. Note that heuristics exist for choosing the number of clusters in a data set. 39

40 Comparison of Results [YB 02] Online Median We obtained comparable results on the other inputs. 40

41 Talk Outline Introduction Summary of Results k-median Successive Sampling Algorithm Online Median Hierarchically Greedy Strategy Experimental Work Conclusion 41

42 Directions for Future Work Can the approximation constants be improved? Can the hierarchically greedy strategy be applied to other location problems (e.g., cooperative caching in a metric space)? Is our successive sampling algorithm useful for other problems? 42

43 Extra Slides

44 With high probability Let A be an algorithm that runs in time T(n). A succeeds with high probability if, given c>0, A can be made to succeed with probability 1-n -c while maintaining a running time of O(T(n)). 44

45 K-Clustering Lower Bound For the same objective function, is the problem of just partitioning the points into k sets considerably simpler? Theorem: Any randomized constant-factor approximation algorithm for the k-clustering problem, with even a negligible success probability, requires Ω(nk) time. 45

46 K-Median Lower Bound Think of k equidistant groups, each containing n/k unit-weight points. We show that no algorithm (even randomized) can distinguish between two groups without looking at at least nk distances. Any algorithm that cannot make this distinction cannot be constant-factor approximate. 46

47 The Approximation Constants For our online median algorithm, we can show that the approximation constant is around 27. For our k-median algorithm, the approximation factor depends on a number of complicated statistical arguments. 47

Optimal Time Bounds for Approximate Clustering

Optimal Time Bounds for Approximate Clustering Ramgopal R. Mettu C. Greg Plaxton Department of Computer Science University of Texas at Austin Austin, TX 78712, U.S.A. ramgopal, plaxton@cs.utexas.edu Abstract