Outline. Main Questions. What is the k-means method? What can we do when worst case analysis is too pessimistic?

Size: px

Start display at page:

Download "Outline. Main Questions. What is the k-means method? What can we do when worst case analysis is too pessimistic?"

Patrick Anthony
5 years ago
Views:

1 Outline Main Questions 1 Data Clustering What is the k-means method? 2 Smoothed Analysis What can we do when worst case analysis is too pessimistic? 3 Smoothed Analysis of k-means Method What is the smoothed complexity of the k-means method? 4 Extensions and Conclusions

2 Data Clustering Input: point set X R d, X = n number of clusters k

3 Data Clustering Input: Output: point set X R d, X = n number of clusters k partition C 1,...,C k centers c 1,...,c k R d

4 Data Clustering Input: Output: Goal: point set X R d, X = n number of clusters k partition C 1,...,C k centers c 1,...,c k R d min k i=1 x C i x c i 2

5 Data Clustering Input: Output: Goal: point set X R d, X = n number of clusters k partition C 1,...,C k centers c 1,...,c k R d min k i=1 x C i x c i 2 Theory: Practice: The problem is NP-hard, but a PTAS exists. (running time is exponential in k) k-means Method.

6 Local Search k-means Method Local search based on two observations: 1. clusters C i fixed centers c i = 1 C i x C i x

7 Local Search k-means Method Local search based on two observations: 1. clusters C i fixed centers c i = 1 C i x C i x

8 Local Search k-means Method Local search based on two observations: 1. clusters C i fixed centers c i = 1 C i x C i x 2. centers c i fixed clusters C i fixed

9 Local Search k-means Method Local search based on two observations: 1. clusters C i fixed centers c i = 1 C i x C i x 2. centers c i fixed clusters C i fixed

10 k-means Method k-means Method Input: X R d, k 1 Achoose c 1,...,c k R d 2 ARepeat 3 Apartition X into C 1,...,C k 4 Ac i 1 C i x C i x 5 AUntil stable

11 k-means Method k-means Method Input: X R d, k 1 Achoose c 1,...,c k R d 2 ARepeat 3 Apartition X into C 1,...,C k 4 Ac i 1 C i x C i x 5 AUntil stable

12 k-means Method k-means Method Input: X R d, k 1 Achoose c 1,...,c k R d 2 ARepeat 3 Apartition X into C 1,...,C k 4 Ac i 1 C i x C i x 5 AUntil stable

13 k-means Method k-means Method Input: X R d, k 1 Achoose c 1,...,c k R d 2 ARepeat 3 Apartition X into C 1,...,C k 4 Ac i 1 C i x C i x 5 AUntil stable

14 k-means Method k-means Method Input: X R d, k 1 Achoose c 1,...,c k R d 2 ARepeat 3 Apartition X into C 1,...,C k 4 Ac i 1 C i x C i x 5 AUntil stable

15 k-means Method k-means Method Input: X R d, k 1 Achoose c 1,...,c k R d 2 ARepeat 3 Apartition X into C 1,...,C k 4 Ac i 1 C i x C i x 5 AUntil stable

16 k-means Method k-means Method Input: X R d, k 1 Achoose c 1,...,c k R d 2 ARepeat 3 Apartition X into C 1,...,C k 4 Ac i 1 C i x C i x 5 AUntil stable

17 k-means Method k-means Method Input: X R d, k 1 Achoose c 1,...,c k R d 2 ARepeat 3 Apartition X into C 1,...,C k 4 Ac i 1 C i x C i x 5 AUntil stable

18 k-means Method k-means Method Input: X R d, k 1 Achoose c 1,...,c k R d 2 ARepeat 3 Apartition X into C 1,...,C k 4 Ac i 1 C i x C i x 5 AUntil stable

19 k-means Method k-means Method Input: X R d, k 1 Achoose c 1,...,c k R d 2 ARepeat 3 Apartition X into C 1,...,C k 4 Ac i 1 C i x C i x 5 AUntil stable

20 k-means Method k-means Method Input: X R d, k 1 Achoose c 1,...,c k R d 2 ARepeat 3 Apartition X into C 1,...,C k 4 Ac i 1 C i x C i x 5 AUntil stable

21 k-means Method k-means Method Input: X R d, k 1 Achoose c 1,...,c k R d 2 ARepeat 3 Apartition X into C 1,...,C k 4 Ac i 1 C i x C i x 5 AUntil stable

22 k-means Method k-means Method Input: X R d, k 1 Achoose c 1,...,c k R d 2 ARepeat 3 Apartition X into C 1,...,C k 4 Ac i 1 C i x C i x 5 AUntil stable by far the most popular clustering algorithm used in scientific and industrial applications (Berkhin 2002) in practice the number of iterations is generally much less than the number of points (Duda et al. 2001)

23 k-means Method Theoretical Results Running Time Upper Bound: At most (k 2 n) kd iterations. No clustering can occur twice. Lower Bound: At least 2 Ω(k) iterations for d 2. [Andrea Vattani (SoCG 09)]

24 k-means Method Theoretical Results Running Time Upper Bound: At most (k 2 n) kd iterations. No clustering can occur twice. Lower Bound: At least 2 Ω(k) iterations for d 2. [Andrea Vattani (SoCG 09)] Quality Local optima can be arbitrarily bad.

25 k-means Method Theoretical Results Running Time Upper Bound: At most (k 2 n) kd iterations. No clustering can occur twice. Lower Bound: At least 2 Ω(k) iterations for d 2. [Andrea Vattani (SoCG 09)] Quality Local optima can be arbitrarily bad. Huge discrepancy between theory and practice. (Focus of this talk: running time.)

26 Outline Main Questions 1 Data Clustering What is the k-means method? 2 Smoothed Analysis What can we do when worst case analysis is too pessimistic? 3 Smoothed Analysis of k-means Method What is the smoothed complexity of the k-means method? 4 Extensions and Conclusions

27 Smoothed Analysis Worst-case analysis is too pessimistic. Typical instances are not adversarial.

28 Smoothed Analysis Worst-case analysis is too pessimistic. Typical instances are not adversarial. Average-case analysis unrealistic. What is the right probability distribution?

29 Smoothed Analysis Worst-case analysis is too pessimistic. Typical instances are not adversarial. Average-case analysis unrealistic. What is the right probability distribution?

30 Smoothed Analysis Worst-case analysis is too pessimistic. Typical instances are not adversarial. Average-case analysis unrealistic. What is the right probability distribution? Smoothed Analysis Less powerful adversary: 1 Adversary chooses instance I.

31 Smoothed Analysis Worst-case analysis is too pessimistic. Typical instances are not adversarial. Average-case analysis unrealistic. What is the right probability distribution? Smoothed Analysis Less powerful adversary: 1 Adversary chooses instance I. 2 Small amount of random noise is added to I. σ = amount of noise

32 Smoothed Analysis Worst-case analysis is too pessimistic. Typical instances are not adversarial. Average-case analysis unrealistic. What is the right probability distribution? Smoothed Analysis Less powerful adversary: 1 Adversary chooses instance I. 2 Small amount of random noise is added to I. σ = amount of noise T(n,σ) = max E(running time(per σ (I))) I, I =n

Smoothed Analysis Worst-case analysis is too pessimistic. Typical instances are not adversarial. Average-case analysis unrealistic. What is the right probability distribution?

33 Smoothed Analysis Worst-case analysis is too pessimistic. Typical instances are not adversarial. Average-case analysis unrealistic. What is the right probability distribution? Smoothed Analysis Less powerful adversary: 1 Adversary chooses instance I. 2 Small amount of random noise is added to I. σ = amount of noise T(n,σ) = max E(running time(per σ (I))) I, I =n models, e.g., measurement errors or numerical imprecision Smoothed compl. low bad performance unlikely in practice

34 Smoothed Analysis of Simplex Method Smoothed Analysis introduced for the Simplex Method: Linear Programming: max c x such that Ax b

35 Smoothed Analysis of Simplex Method Smoothed Analysis introduced for the Simplex Method: Linear Programming: max c x such that Ax b Simplex Method: Follow edges of the polytope. In the worst case slow.

36 Smoothed Analysis of Simplex Method Smoothed Analysis introduced for the Simplex Method: Linear Programming: max c x such that Ax b Simplex Method: Follow edges of the polytope. In the worst case slow.

37 Smoothed Analysis of Simplex Method Smoothed Analysis introduced for the Simplex Method: Linear Programming: max c x such that Ax b Simplex Method: Follow edges of the polytope. In the worst case slow.

38 Smoothed Analysis of Simplex Method Smoothed Analysis introduced for the Simplex Method: Linear Programming: max c x such that Ax b Simplex Method: Follow edges of the polytope. In the worst case slow.

39 Smoothed Analysis of Simplex Method Smoothed Analysis introduced for the Simplex Method: Linear Programming: max c x such that Ax b Simplex Method: Follow edges of the polytope. In the worst case slow.

40 Smoothed Analysis of Simplex Method Smoothed Analysis introduced for the Simplex Method: Linear Programming: max c x such that Ax b Simplex Method: Follow edges of the polytope. In the worst case slow. Perturbation: Add independent Gaussian with standard deviation σ to every entry in matrix A and vector b.

41 Smoothed Analysis of Simplex Method Smoothed Analysis introduced for the Simplex Method: Linear Programming: max c x such that Ax b Simplex Method: Follow edges of the polytope. In the worst case slow. Perturbation: Add independent Gaussian with standard deviation σ to every entry in matrix A and vector b. Theorem [Spielman, Teng 2001] The smoothed running time of the simplex method with shadow vertex pivot rule is polynomial in n, m, and 1/σ.

42 Smoothed Analysis of Simplex Method Smoothed Analysis introduced for the Simplex Method: Linear Programming: max c x such that Ax b Simplex Method: Follow edges of the polytope. In the worst case slow. Perturbation: Add independent Gaussian with standard deviation σ to every entry in matrix A and vector b. Theorem [Spielman, Teng 2001] The smoothed running time of the simplex method with shadow vertex pivot rule is polynomial in n, m, and 1/σ. In 2006, Roman Vershynin improved bound from O ( m 86 n 55 σ 30) to O ( max{n 5 log 2 m, n 9 log 4 n, n 3 σ 4 } ). Further results on condition number of matrices, interior-point method, Gaussian elimination, etc.

43 Outline Main Questions 1 Data Clustering What is the k-means method? 2 Smoothed Analysis What can we do when worst case analysis is too pessimistic? 3 What is the smoothed complexity of the k-means method? 4 Extensions and Conclusions

44 Smoothed Analysis of k-means Model: Every point is perturbed by independent d-dimensional Gaussian with standard deviation σ. T(n,σ) = max E(#Iterations(per σ (X))) X, X =n

45 Smoothed Analysis of k-means Model: Every point is perturbed by independent d-dimensional Gaussian with standard deviation σ. T(n,σ) = max E(#Iterations(per σ (X))) X, X =n Arthur, Vassilvitskii (FOCS 2006) Smoothed number of iterations is at most poly(n k, 1/σ). Manthey, Röglin (SODA 2009) Smoothed number of iterations is at most poly(n k, 1/σ) k kd poly(n, 1/σ)

46 Smoothed Analysis of k-means Model: Every point is perturbed by independent d-dimensional Gaussian with standard deviation σ. T(n,σ) = max E(#Iterations(per σ (X))) X, X =n Arthur, Vassilvitskii (FOCS 2006) Smoothed number of iterations is at most poly(n k, 1/σ). Manthey, Röglin (SODA 2009) Smoothed number of iterations is at most poly(n k, 1/σ) k kd poly(n, 1/σ) Our Result (FOCS 2009) Smoothed number of iterations is poly(n, 1/σ).

47 General Approach 1 Initial Potential: After first iteration whp Φ = k i=1 x C i x c i 2 = O(poly(n))

48 General Approach 1 Initial Potential: After first iteration whp Φ = k i=1 x C i x c i 2 = O(poly(n)) 2 Define smallest possible improvement: = at most O(poly(n)/ ) steps. min (Φ(C) Φ(succ(C))). iteration: C succ(c)

49 General Approach 1 Initial Potential: After first iteration whp Φ = k i=1 x C i x c i 2 = O(poly(n)) 2 Define smallest possible improvement: = at most O(poly(n)/ ) steps. In the worst case: arbitrarily small. min (Φ(C) Φ(succ(C))). iteration: C succ(c)

50 General Approach 1 Initial Potential: After first iteration whp Φ = k i=1 x C i x c i 2 = O(poly(n)) 2 Define smallest possible improvement: = at most O(poly(n)/ ) steps. In the worst case: arbitrarily small. Lemma min (Φ(C) Φ(succ(C))). iteration: C succ(c) For d 2, for every X [0, 1] d, in the model of smoothed analysis: [ ] 1 E = poly(n, 1/σ). max E(#Iterations(per σ (X))) = poly(n, 1/σ). X, X =n

51 When does the potential drop? 1) center moves by ε When does the potential drop? ε improvement by ε 2

52 When does the potential drop? When does the potential drop? 1) center moves by ε 2) point with distance ε to bisector changes assignment ε δ ε improvement by ε 2 improvement by 2εδ

53 When does the potential drop? When does the potential drop? 1) center moves by ε 2) point with distance ε to bisector changes assignment ε δ ε improvement by ε 2 improvement by 2εδ Goal: Show that in every iteration 1 either a center moves significantly 2 or a reassigned point is significantly far from bisector.

54 How large is? Configuration C is ε-bad if Φ(C) Φ(succ(C)) ε. Naive approach: Union Bound over all configurations. Pr[ Configuration C: C is ε-bad] Pr[C is ε-bad] Configuration C

55 How large is? Configuration C is ε-bad if Φ(C) Φ(succ(C)) ε. Naive approach: Union Bound over all configurations. Pr[ Configuration C: C is ε-bad] Pr[C is ε-bad] Problem: Too many configurations: k n. Configuration C

56 How large is? Configuration C is ε-bad if Φ(C) Φ(succ(C)) ε. Naive approach: Union Bound over all configurations. Pr[ Configuration C: C is ε-bad] Pr[C is ε-bad] Problem: Too many configurations: k n. Configuration C space of all configurations

57 How large is? Configuration C is ε-bad if Φ(C) Φ(succ(C)) ε. Naive approach: Union Bound over all configurations. Pr[ Configuration C: C is ε-bad] Pr[C is ε-bad] Problem: Too many configurations: k n. Configuration C space of all configurations F 1 F 2 F 3 F 4 F 5 Pr[ Configuration C: C is ε-bad] 5 Pr[F i ] i=1

58 Transition Graph Transition Graph G = (V, E): V: clusters E: labeled directed edge for each reassigned point C succ(c) G = (V, E)

59 Transition Graph Transition Graph G = (V, E): V: clusters E: labeled directed edge for each reassigned point C succ(c) G = (V, E) Approach: Union bound over different transition blueprints.

60 Transition Graph Transition Graph G = (V, E): V: clusters E: labeled directed edge for each reassigned point C succ(c) G = (V, E) Approach: Union bound over different transition blueprints. First glance: Natural idea. Second glance: Not enough information. E.g.: No information about positions of centers and bisectors.

61 Transition Graph Transition Graph G = (V, E): V: clusters E: labeled directed edge for each reassigned point C succ(c) G = (V, E) Approach: Union bound over different transition blueprints. First glance: Natural idea. Second glance: Not enough information. E.g.: No information about positions of centers and bisectors. Third glance: Enough information!

62 Approximate Centers Goal: Show that in every iteration 1 either a center moves significantly 2 or a reassigned point is significantly far from bisector. B A C (C A)\B approx(a,b) { cm((c A)\B) cm(c) 1 ( }} ){ B cm(b) A cm(a) cm(c) n 2 B A

63 Approximate Centers Goal: Show that in every iteration 1 either a center moves significantly 2 or a reassigned point is significantly far from bisector. B A C (C A)\B approx(a,b) { cm((c A)\B) cm(c) 1 ( }} ){ B cm(b) A cm(a) cm(c) n 2 B A small potential drop cm(c) must be close to approx(a, B)

64 Proof of Main Theorem Theorem Smoothed number of iterations is poly(n, 1/σ).

65 Proof of Main Theorem Theorem Smoothed number of iterations is poly(n, 1/σ). number of blueprints with m edges: (k 2 n) m.

66 Proof of Main Theorem Theorem Smoothed number of iterations is poly(n, 1/σ). number of blueprints with m edges: (k 2 n) m. probability that a fixed data point has distance at mostεfrom its approximate bisector: ε/σ.

67 Proof of Main Theorem Theorem Smoothed number of iterations is poly(n, 1/σ). number of blueprints with m edges: (k 2 n) m. probability that a fixed data point has distance at mostεfrom its approximate bisector: ε/σ. probability that all m data points are ε-close to their approximate bisectors: (ε/σ) m

68 Proof of Main Theorem Theorem Smoothed number of iterations is poly(n, 1/σ). number of blueprints with m edges: (k 2 n) m. probability that a fixed data point has distance at mostεfrom its approximate bisector: ε/σ. probability that all m data points are ε-close to their approximate bisectors: (ε/σ) m Union Bound: Pr[ ε-bad blueprint] (k 2 n) m (ε/σ) m Very unlikely for ε = 1/poly(n, 1/σ).

69 Proof of Main Theorem Theorem Smoothed number of iterations is poly(n, 1/σ). number of blueprints with m edges: (k 2 n) m. probability that a fixed data point has distance at mostεfrom its approximate bisector: ε/σ. probability that all m data points are ε-close to their approximate bisectors: (ε/σ) m Union Bound: Pr[ ε-bad blueprint] (k 2 n) m (ε/σ) m Very unlikely for ε = 1/poly(n, 1/σ). Technical Difficulties: Data points are not independent from approx. bisectors, approximate centers not defined for balanced clusters, blueprints must have enough edges,...

70 Outline Main Questions 1 Data Clustering What is the k-means method? 2 Smoothed Analysis What can we do when worst case analysis is too pessimistic? 3 Smoothed Analysis of k-means Method What is the smoothed complexity of the k-means method? 4 Extensions and Conclusions

71 Kullback-Leibler Divergence Text Classification and Bag-of-Words Model: S set of all words count words and normalize: prob. distribution p: S [0, 1]

72 Kullback-Leibler Divergence Text Classification and Bag-of-Words Model: S set of all words count words and normalize: prob. distribution p: S [0, 1] Kullback-Leibler divergence (relative entropy): KLD(p, q) = d p i log i=1 ( pi = number of bits to encode p with Huffman code for q number of bits to encode p with Huffman code for p q i )

73 Bregman Divergences Bregman divergences are distance measures that generalize squared Euclidean distances and the Kullback-Leibler divergence. Ackermann, Blömer, Sohler (SODA 2008) Approximation Scheme for special cases of Bregman divergences (e.g., Kullback-Leibler divergence).

74 Bregman Divergences Bregman divergences are distance measures that generalize squared Euclidean distances and the Kullback-Leibler divergence. Ackermann, Blömer, Sohler (SODA 2008) Approximation Scheme for special cases of Bregman divergences (e.g., Kullback-Leibler divergence). Manthey, Röglin (ISAAC 2009) For any well-behaved Bregman divergence: Smoothed number of iterations is at most poly(n k, 1/σ) k kd poly(n, 1/σ) poly(n, 1/σ) if d, k = O( log n/ log log n) (or d = 1)

75 Bregman Divergences Bregman divergences are distance measures that generalize squared Euclidean distances and the Kullback-Leibler divergence. Ackermann, Blömer, Sohler (SODA 2008) Approximation Scheme for special cases of Bregman divergences (e.g., Kullback-Leibler divergence). Manthey, Röglin (ISAAC 2009) For any well-behaved Bregman divergence: Smoothed number of iterations is at most poly(n k, 1/σ) k kd poly(n, 1/σ) poly(n, 1/σ) if d, k = O( log n/ log log n) (or d = 1) Polynomial bound does not extend as it uses special properties of Gaussian perturbations.

76 Further Results and Open Questions Summary: Worst-case instances are often fragile. Smoothed Analysis often leads to better understanding of observed practical behavior.

77 Further Results and Open Questions Summary: Worst-case instances are often fragile. Smoothed Analysis often leads to better understanding of observed practical behavior. Future Research: improve exponents for k-means (currently n 30 ) better understanding of dynamics seems necessary for this explanation for good approximation ratio better analysis of Bregman divergences more systematic theory of smoothed local search Are all local search problems in PLS easy in smoothed analysis?

Pearls of Algorithms

Pearls of Algorithms Part 3: Randomized Algorithms and Probabilistic Analysis Prof. Dr. Institut für Informatik Winter 2013/14 Efficient Algorithms When is an algorithm considered efficient? Efficient Algorithms When is an