Pasi Fränti K-means properties on six clustering benchmark datasets Pasi Fränti and Sami Sieranoja Algorithms, 2017.

Size: px

Start display at page:

Download "Pasi Fränti K-means properties on six clustering benchmark datasets Pasi Fränti and Sami Sieranoja Algorithms, 2017."

Alfred Robertson
5 years ago
Views:

1 K-means properties Pasi Fränti K-means properties on six clustering benchmark datasets Pasi Fränti and Sami Sieranoja Algorithms, 2017.

2 Goal of k-means Input N points: X={x 1, x 2,, x N } Output partition and k centroids: P={p 1, p 2,, p k } C={c 1, c 2,, c k } Objective function: SSE N i1 x i c j 2 SSE = sum-of-squared errors

3 Goal of k-means Input N points: X={x 1, x 2,, x N } Output partition and k centroids: P={p 1, p 2,, p k } C={c 1, c 2,, c k } Objective function: SSE N i1 x i c j 2 Assumptions: SSE is suitable k is known

4 K-means algorithm X = Data set C = Cluster centroids P = Partition K-Means(X, C) (C, P) REPEAT C prev C; FOR i=1 TO N DO p i FindNearest(x i, C); FOR j=1 TO k DO c j Average of x i p i = j; Assignment step Centroid step UNTIL C = C prev

5 N i c x P j i k j i, 1 arg min 2 1 k j x c, 1 1 K-means optimization steps Assignment step: Centroid step: k j x c j P j P i j i i 1 1,

6 Problems of k-means Distance of clusters Cannot move centroids between clusters far away

7 Problems of k-means Dependency of initial solution Initial solution: After k-means:

8 K-means performance How affected by? 1. Overlap 3. Dimensionality 2. Number of clusters 4. Unbalance of cluster sizes

9 Basic Benchmark

10 Data sets statistics Dataset Varying Size Clusters Per cluster A Number of clusters S Overlap Dim Dimensions G2 Dimensions + overlap Birch Structure 100, Unbalance Balance

11 A sets K=20 K=35 K=50 A 1 A 2 A 3 Spherical clusters Number of clusters changing from k=20 to 50 Subsets of each other: A1 A2 A3. Other parameters fixed: - Cluster size = Deviation = Overlap = Dimensionality = 2

12 S sets K=15 Gaussian clusters (few truncated) S 1 S 2 S 3 S 4 overlap increases Least overlap Strong overlap but the clusters still recognizable

13 Unbalance K=8 Dense clusters st.dev=2043 Sparse clusters st.dev= Areas well-separated *Correct clustering can be obtained by minimizing SSE

14 DIM sets K=16 DIM 32 Well-separated clusters in high-dimensional spaces Dimensions vary: 32, 64, 128, 256, 512, 1024

15 G2 Datasets K=2 G G G , ,500 Dataset name: G2-dim-sd Centroid 1: [500,500,...] Centroid 2: [600,600,...] Dimensions: 1,2,4,8,16, St.dev. 10,20,30,

16 Birch Birch 1 Birch 2 Regular 10x10 grid Constant variance offset = amplitude = phaseshift = frequency = y(x) = amplitude * sin(2**frequency*x + phaseshift) + offset

17 Birch 2 subsets B2-random N= N= N= N= B2-sub k=100 k=99 k=98 k= Random subsampling. k=3 k=96 k=95 Cutting off last cluster N=2 000 N=1 000 k=2 k=1

18 Properties

19 Measured properties Overlap Contrast Intrinsic dimensionality H-index Distance profiles

20 Overlap by misclassification probability Points from blue cluster that are closer to red centroid. Points from red cluster that are closer to red centroid. Points = 2048 Incorrect = 20 Overlap = 20 / %

21 overlap 1 N Overlap by separation d d 1 2 d 1 = distance to nearest centroid d 2 = distance to 2 nd nearest 2 d 1 = 0.25 d 2 = 0.75 d 1 = 0.50 d 2 = d 1 = 0.25 d 2 = 0.75

22 Contrast contrast median d max d d min min 1000 DIM 1000 G2 Contrast Contrast Dimensions 1 sd=30 sd=70 sd= Dimensions

23 ID 2 dˆ 2 Intrinsic dimensionality 2 dˆ Average of distances 2 Variance of distances Unbalance DIM 0.4 Birch 1 Birch S sets 8.3 A 1 A 2 A

24 H-index 2 Hubness values: Rank: Hub:

25 Distance profiles Data that contains clusters tends to have two peaks: Local distances: distances inside the clusters Global distances: distances across different clusters A 1 A 2 A 3 S 1 S 2 S 3 S 4 Birch 1 Birch 2 DIM 32 Unbalance

26 Distance profiles G2 datasets G2: dimension increases D=2 D=4 D=8 D=16 D=32 D=1024 G2: overlap increases sd=10 sd=20 sd=30 sd=40 sd=50 sd=60 D=2 sd=10 sd=20 sd=30 sd=40 sd=50 sd=60 D=128

27 Summary of the properties Dataset Overlap Contrast Intrinsic dim. H-index A A A S S S S Dim G * Birch Birch Unbalance

28 G2 overlap Overlap decreases \dim % 0% 0% 0% 0% 0% 0% 0% 0% 0% 20 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 30 1% 0% 0% 0% 0% 0% 0% 0% 0% 0% 40 4% 1% 0% 0% 0% 0% 0% 0% 0% 0% 50 8% 2% 0% 0% 0% 0% 0% 0% 0% 0% 60 12% 4% 1% 0% 0% 0% 0% 0% 0% 0% 70 15% 8% 2% 0% 0% 0% 0% 0% 0% 0% 80 19% 9% 4% 1% 0% 0% 0% 0% 0% 0% 90 22% 12% 6% 2% 0% 0% 0% 0% 0% 0% % 15% 7% 2% 0.1% 0% 0% 0% 0% 0% Overlap incre eases

29 G2 contrast Contrast decreases \dim Contrast decre eases

30 G2 Intrinsic dimensionality ID increases (if overlap) \dim ID increas ses Most significant

31 G2 H-index H-index increases \dim No chang ge Most significant

32 Evaluation

33 Internal measures Sum of squared distances (SSE) SSE N i1 x i c j 2 Normalized mean square error (nmse) nmse SSE N D Approximation ratio () SSE SSE SSE opt opt

34 P. Fränti, M. Rezaei, Q. Zhao Centroid index: cluster level similarity measure Pattern Recognition 2014 External measures Centroid index CI=4 Too many centoids Missing centroids

35 External measures Success rate 17% CI=1 CI=2 CI=1 CI=0 CI=2 CI=2

36 Results

37 Summary of results Dataset Success Clustering quality Objective function CI rel-ci ARI SSE nmse A1 1% % A2 0% % A3 0% % S1 3% % S2 11% 1.4 9% S3 12% 1.3 9% S4 26% 0.9 6% Unbalance 0% % Birch1 0% 6.6 7% Birch2 0% % Dim32 0% % Average: 5% %

38 Dependency on overlap S datasets Success rates and CI-values: overlap increases S 1 S 2 S 3 S 4 3% 11% 12% 26%

39 Dependency on overlap G2 datasets Succ cess 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 2% 0.0 % 0.1 % 1.0 % 10.0 % % Overlap

40 Main observation 1. Overlap Overlap is good!

41 Dependency on clusters (k) A datasets Clusters increases K=20 K=35 K=50 A 1 A 2 A 3 Success: CI: 1% 0% 0% Relative CI: 13% 13% 13%

42 Dependency on clusters (k) Birch2 subsets CI-va alue K-means Repeated k-means Number of clusters (k)

43 Dependency on data size (N) CI-val lue K-means Repeated k-means Birch2 subsets Size of data (thousands)

44 Main observation 2. Number of clusters Linear increase with k!

45 Dependency on dimensions DIM datasets Dimensions increases CI: Success rate: 0%

46 Dependency on dimensions G2 datasets Success degrades \dim % 0% 0% 0% 0% 0% 0% 0% 0% 0% 20 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 30 35% 1% 0% 0% 0% 0% 0% 0% 0% 0% 40 99% 7% 0% 0% 0% 0% 0% 0% 0% 0% 50 99% 80% 1% 0% 0% 0% 0% 0% 0% 0% 60 79% 84% 16% 1% 0% 0% 0% 0% 0% 0% 70 95% 86% 56% 1% 0% 0% 0% 0% 0% 0% 80 63% 93% 94% 5% 0% 0% 0% 0% 0% 0% % 89% 69% 16% 0% 0% 0% 0% 0% 0% % 91% 71% 12% 1% 0% 0% 0% 0% 0% Success impr roves

47 Lack of overlap is the cause! \dim % 40 4% 1% 50 8% 2% Overlap 60 12% 4% 1% 70 15% 8% 2% 80 19% 9% 4% 1% 90 22% 12% 6% 2% % 15% 7% 2% 0.1% Correlation: 0.88 \dim % 1% 40 99% 7% 50 99% 80% 1% 60 79% 84% 16% 1% Success 70 95% 86% 56% 1% 80 63% 93% 94% 5% % 89% 69% 16% % 91% 71% 12% 1%

48 Main observation 3. Dimensionality No direct effect!

49 Effect of unbalance DIM datasets K-means tend to put too many clusters here and too few here Success: Average CI: 0% 3.9 Problem originates from the random initialization.

50 Main observation 4. Unbalance of cluster sizes Unbalance is bad!

51 Improving k-means

52 Better initialization technique Simple initializations: Random centroids (Random) [Forgy][MacQueen] Further point heuristic (max) [Gonzalez] More complex: K-means++ [Vasilievski] Luxburg [Luxburg]

53 Initialization techniques Varying N CI-v value Random K-means++ Maxmin Luxburg Data size (thousands)

54 Initialization techniques Varying k Relative CI-value 20% 15% 10% 5% 0% Random Luxburg K-means++ Maxmin Clusters (k)

55 Repeated k-means (RKM) K-means Repeat 100 times Can increase changes to success significantly In principle, running forever would solve Limitations if k is large CI-value Birch2 subsets K-means Repeated k-means Number of clusters (k) CI-value K-means Repeated k-means Size of data (thousands) Birch2 subsets

56 Random Swap (RS) Random Swap(X) C, P A better algorithm C Select random representatives(x); P Optimal partition(x, C); REPEAT T times (C new,j) Random swap(x, C); P new Local repartition(x, C new, P, j); C new, P new Kmeans(X, C new, P new ); IF f(c new, P new ) < f(c, P) THEN (C, P) C new, P new ; RETURN (C, P); Genetic Algorithm (GA) GeneticAlgorithm(X) (C, P) FOR i1 TO Z DO C i RandomCodebook(X); P i OptimalPartition(X, C i ); SortSolutions(C,P); REPEAT {C,P} CreateNewSolutions( {C,P} ); SortSolutions(C,P); UNTIL no improvement; CreateNewSolutions({C, P}) {C new, P new } C new-1, P new-1 C 1, P 1 ; FOR i2 TO Z DO (a,b) SelectNextPair; C new-i, P new-i Cross(C a, P a, C b, P b ); IterateK-Means(C new-i, P new-i ); CombineCentroids(C 1, C 2 ) C new C new C 1 C 2 CombinePartitions(C new, P 1, P 2 ) P new FOR i1 TO N DO 2 2 IF x c 1 x c 2 THEN pi ELSE pi END-FOR i p i i pi new new p 1 i p 2 UpdateCentroids(C 1, C 2 ) C new FOR j1 TO C new DO c j new CalculateCentroid(P new, j ); i Cross(C 1, P 1, C 2, P 2 ) (C new, P new ) C new CombineCentroids(C 1, C 2 ); P new CombinePartitions(P 1, P 2 ); C new UpdateCentroids(C new, P new ); RemoveEmptyClusters(C new, P new ); IS(C new, P new ); cs.uef.fi/pages/franti/research/ga.txt

57 Overall comparison CI-values Dataset K-means RKM RS GA Ste Ran Max K++ Lux A A A S S S S Unbalance Birch Birch Dim Average:

58 Conclusions

59 Conclusions How did K-means perform? 1. Overlap 3. Dimensionality Good! No change 2. Number of clusters 4. Unbalance of cluster sizes Bad! Bad!

60 References J. MacQueen, Some methods for classification and analysis of multivariate observations, Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, pp , University of California Press, Berkeley, Calif., S.P. Lloyd, Least squares quantization in PCM, IEEE Trans. on Information Theory, 28 (2), , Forgy, E. (1965). Cluster analysis of multivariate data: Efficiency vs. interpretability of classification. Biometrics, 21, 768. M. Steinbach, L. Ertöz, V. Kumar, The challenges of clustering high dimensional data, New Vistas in Statistical Physics -- Applications in Econophysics, Bioinformatics, and Pattern Recognition, Springer-Verlag, U. Luxburg, R.C. Williamson, I. Guyon, "Clustering: Science or Art?", J. Machine Learning Research, 27: 65 79, P. Fränti, "Genetic algorithm with deterministic crossover for vector quantization", Pattern Recognition Letters, 21 (1), 61-68, 2000 P. Fränti and J. Kivijärvi, "Randomized local search algorithm for the clustering problem", Pattern Analysis and Applications, 3 (4), , P. Fränti, O. Virmajoki and V. Hautamäki, Fast agglomerative clustering using a k-nearest neighbor graph, IEEE Trans. on Pattern Analysis and Machine Intelligence, 28 (11), , November Zhang R. Ramakrishnan and M. Livny, BIRCH: A new data clustering algorithm and its applications, Data Mining and Knowledge Discovery, 1 (2), , I. Kärkkäinen and P. Fränti, Dynamic local search algorithm for the clustering problem, Research Report A P. Fränti and O. Virmajoki, Iterative shrinking method for clustering problems, Pattern Recognition, 39 (5), , May P. Fränti R. Mariescu-Istodor and C. Zhong, XNN graph IAPR Joint Int. Workshop on Structural, Syntactic, and Statistical Pattern Recognition Merida, Mexico, LNCS 10029, , November M. Rezaei and P. Fränti, "Set-matching methods for external cluster validity", IEEE Trans. on Knowledge and Data Engineering, 28 (8), , August E. Chavez and G. Navarro, A probabilistic spell for the curse of dimensionality. Workshop on Algorithm Engineering and Experimentation, LNCS 2153, , N. Tomasev, M. Radovanovi, D. Mladeni and M. Ivanovi, The role of hubness in clustering high-dimensional data, IEEE Trans. on Knowledge and Data Engineering, 26 (3), , March D. Steinley, Local optima in k-means clustering: what you don t know may hurt you, Psychological Methods, 8, , P. Fränti, M. Rezaei and Q. Zhao, "Centroid index: Cluster level similarity measure", Pattern Recognition, 47 (9), , T. Gonzalez, Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38 (2 3), , 1985.

61 Back-up material

63 K-means neighbors Estimation

64 Shrinking of the space Clusters divided to two halves Centroids clearly separated No separation anymore. Can even cause 3:1 division. 2-d 611, d 598, d 603, , , , , , , , , ,500 Centroids closing to the group center

Backup material 8-d 194,203 2-d 198,176 128-d 68,91 170,175 83,63 186,190 47016,48568 47538,47516

65 Backup material 8-d 194,203 2-d 198, d 68,91 170,175 83,63 186, , , , ,56 77, , d 8-d 32-d 4-d

66 How K-means process goes

67 Time complexity

68 Efficiency of k-means Total time to find correct clustering: Time per iteration Number of iterations Time complexity of single iteration: Swap: O(1) Remove cluster: 2kN/k = O(N) Add cluster: 2N = O(N) Centroids: 2N/k + 2N/k + 2 = O(N/k) K-means: IkN = O(IkN) Bottleneck!

69 Efficiency of fast k-means Total time to find correct clustering: Time per iteration Number of iterations Time complexity of single iteration: Swap: O(1) Remove cluster: 2kN/k = O(N) Add cluster: 2N = O(N) Centroids: 2N/k + 2N/k + 2 = O(N/k) (Fast) K-means: 4N = O(N) 2 iterations only! T. Kaukoranta, P. Fränti and O. Nevalainen "A fast exact GLA based on code vector activity detection" IEEE Trans. on Image Processing, 9 (8), , August 2000.

71 Centroid index CI = Centroid index: P. Fränti, M. Rezaei and Q. Zhao "Centroid index: cluster level similarity measure Pattern Recognition, 47 (9), , September 2014, 2014.

72 Data sets visualized Artificial S 1 S 2 S 3 S 4 Unbalance DIM 32

Random Swap algorithm

Random Swap algorithm Pasi Fränti 24.4.2018 Definitions and data Set of N data points: X={x 1, x 2,, x N } Partition of the data: P={p 1, p 2,,p k }, Set of k cluster prototypes (centroids): C={c 1, c