Pasi Fränti K-means properties on six clustering benchmark datasets Pasi Fränti and Sami Sieranoja Algorithms, 2017.
|
|
- Alfred Robertson
- 5 years ago
- Views:
Transcription
1 K-means properties Pasi Fränti K-means properties on six clustering benchmark datasets Pasi Fränti and Sami Sieranoja Algorithms, 2017.
2 Goal of k-means Input N points: X={x 1, x 2,, x N } Output partition and k centroids: P={p 1, p 2,, p k } C={c 1, c 2,, c k } Objective function: SSE N i1 x i c j 2 SSE = sum-of-squared errors
3 Goal of k-means Input N points: X={x 1, x 2,, x N } Output partition and k centroids: P={p 1, p 2,, p k } C={c 1, c 2,, c k } Objective function: SSE N i1 x i c j 2 Assumptions: SSE is suitable k is known
4 K-means algorithm X = Data set C = Cluster centroids P = Partition K-Means(X, C) (C, P) REPEAT C prev C; FOR i=1 TO N DO p i FindNearest(x i, C); FOR j=1 TO k DO c j Average of x i p i = j; Assignment step Centroid step UNTIL C = C prev
5 N i c x P j i k j i, 1 arg min 2 1 k j x c, 1 1 K-means optimization steps Assignment step: Centroid step: k j x c j P j P i j i i 1 1,
6 Problems of k-means Distance of clusters Cannot move centroids between clusters far away
7 Problems of k-means Dependency of initial solution Initial solution: After k-means:
8 K-means performance How affected by? 1. Overlap 3. Dimensionality 2. Number of clusters 4. Unbalance of cluster sizes
9 Basic Benchmark
10 Data sets statistics Dataset Varying Size Clusters Per cluster A Number of clusters S Overlap Dim Dimensions G2 Dimensions + overlap Birch Structure 100, Unbalance Balance
11 A sets K=20 K=35 K=50 A 1 A 2 A 3 Spherical clusters Number of clusters changing from k=20 to 50 Subsets of each other: A1 A2 A3. Other parameters fixed: - Cluster size = Deviation = Overlap = Dimensionality = 2
12 S sets K=15 Gaussian clusters (few truncated) S 1 S 2 S 3 S 4 overlap increases Least overlap Strong overlap but the clusters still recognizable
13 Unbalance K=8 Dense clusters st.dev=2043 Sparse clusters st.dev= Areas well-separated *Correct clustering can be obtained by minimizing SSE
14 DIM sets K=16 DIM 32 Well-separated clusters in high-dimensional spaces Dimensions vary: 32, 64, 128, 256, 512, 1024
15 G2 Datasets K=2 G G G , ,500 Dataset name: G2-dim-sd Centroid 1: [500,500,...] Centroid 2: [600,600,...] Dimensions: 1,2,4,8,16, St.dev. 10,20,30,
16 Birch Birch 1 Birch 2 Regular 10x10 grid Constant variance offset = amplitude = phaseshift = frequency = y(x) = amplitude * sin(2**frequency*x + phaseshift) + offset
17 Birch 2 subsets B2-random N= N= N= N= B2-sub k=100 k=99 k=98 k= Random subsampling. k=3 k=96 k=95 Cutting off last cluster N=2 000 N=1 000 k=2 k=1
18 Properties
19 Measured properties Overlap Contrast Intrinsic dimensionality H-index Distance profiles
20 Overlap by misclassification probability Points from blue cluster that are closer to red centroid. Points from red cluster that are closer to red centroid. Points = 2048 Incorrect = 20 Overlap = 20 / %
21 overlap 1 N Overlap by separation d d 1 2 d 1 = distance to nearest centroid d 2 = distance to 2 nd nearest 2 d 1 = 0.25 d 2 = 0.75 d 1 = 0.50 d 2 = d 1 = 0.25 d 2 = 0.75
22 Contrast contrast median d max d d min min 1000 DIM 1000 G2 Contrast Contrast Dimensions 1 sd=30 sd=70 sd= Dimensions
23 ID 2 dˆ 2 Intrinsic dimensionality 2 dˆ Average of distances 2 Variance of distances Unbalance DIM 0.4 Birch 1 Birch S sets 8.3 A 1 A 2 A
24 H-index 2 Hubness values: Rank: Hub:
25 Distance profiles Data that contains clusters tends to have two peaks: Local distances: distances inside the clusters Global distances: distances across different clusters A 1 A 2 A 3 S 1 S 2 S 3 S 4 Birch 1 Birch 2 DIM 32 Unbalance
26 Distance profiles G2 datasets G2: dimension increases D=2 D=4 D=8 D=16 D=32 D=1024 G2: overlap increases sd=10 sd=20 sd=30 sd=40 sd=50 sd=60 D=2 sd=10 sd=20 sd=30 sd=40 sd=50 sd=60 D=128
27 Summary of the properties Dataset Overlap Contrast Intrinsic dim. H-index A A A S S S S Dim G * Birch Birch Unbalance
28 G2 overlap Overlap decreases \dim % 0% 0% 0% 0% 0% 0% 0% 0% 0% 20 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 30 1% 0% 0% 0% 0% 0% 0% 0% 0% 0% 40 4% 1% 0% 0% 0% 0% 0% 0% 0% 0% 50 8% 2% 0% 0% 0% 0% 0% 0% 0% 0% 60 12% 4% 1% 0% 0% 0% 0% 0% 0% 0% 70 15% 8% 2% 0% 0% 0% 0% 0% 0% 0% 80 19% 9% 4% 1% 0% 0% 0% 0% 0% 0% 90 22% 12% 6% 2% 0% 0% 0% 0% 0% 0% % 15% 7% 2% 0.1% 0% 0% 0% 0% 0% Overlap incre eases
29 G2 contrast Contrast decreases \dim Contrast decre eases
30 G2 Intrinsic dimensionality ID increases (if overlap) \dim ID increas ses Most significant
31 G2 H-index H-index increases \dim No chang ge Most significant
32 Evaluation
33 Internal measures Sum of squared distances (SSE) SSE N i1 x i c j 2 Normalized mean square error (nmse) nmse SSE N D Approximation ratio () SSE SSE SSE opt opt
34 P. Fränti, M. Rezaei, Q. Zhao Centroid index: cluster level similarity measure Pattern Recognition 2014 External measures Centroid index CI=4 Too many centoids Missing centroids
35 External measures Success rate 17% CI=1 CI=2 CI=1 CI=0 CI=2 CI=2
36 Results
37 Summary of results Dataset Success Clustering quality Objective function CI rel-ci ARI SSE nmse A1 1% % A2 0% % A3 0% % S1 3% % S2 11% 1.4 9% S3 12% 1.3 9% S4 26% 0.9 6% Unbalance 0% % Birch1 0% 6.6 7% Birch2 0% % Dim32 0% % Average: 5% %
38 Dependency on overlap S datasets Success rates and CI-values: overlap increases S 1 S 2 S 3 S 4 3% 11% 12% 26%
39 Dependency on overlap G2 datasets Succ cess 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 2% 0.0 % 0.1 % 1.0 % 10.0 % % Overlap
40 Main observation 1. Overlap Overlap is good!
41 Dependency on clusters (k) A datasets Clusters increases K=20 K=35 K=50 A 1 A 2 A 3 Success: CI: 1% 0% 0% Relative CI: 13% 13% 13%
42 Dependency on clusters (k) Birch2 subsets CI-va alue K-means Repeated k-means Number of clusters (k)
43 Dependency on data size (N) CI-val lue K-means Repeated k-means Birch2 subsets Size of data (thousands)
44 Main observation 2. Number of clusters Linear increase with k!
45 Dependency on dimensions DIM datasets Dimensions increases CI: Success rate: 0%
46 Dependency on dimensions G2 datasets Success degrades \dim % 0% 0% 0% 0% 0% 0% 0% 0% 0% 20 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 30 35% 1% 0% 0% 0% 0% 0% 0% 0% 0% 40 99% 7% 0% 0% 0% 0% 0% 0% 0% 0% 50 99% 80% 1% 0% 0% 0% 0% 0% 0% 0% 60 79% 84% 16% 1% 0% 0% 0% 0% 0% 0% 70 95% 86% 56% 1% 0% 0% 0% 0% 0% 0% 80 63% 93% 94% 5% 0% 0% 0% 0% 0% 0% % 89% 69% 16% 0% 0% 0% 0% 0% 0% % 91% 71% 12% 1% 0% 0% 0% 0% 0% Success impr roves
47 Lack of overlap is the cause! \dim % 40 4% 1% 50 8% 2% Overlap 60 12% 4% 1% 70 15% 8% 2% 80 19% 9% 4% 1% 90 22% 12% 6% 2% % 15% 7% 2% 0.1% Correlation: 0.88 \dim % 1% 40 99% 7% 50 99% 80% 1% 60 79% 84% 16% 1% Success 70 95% 86% 56% 1% 80 63% 93% 94% 5% % 89% 69% 16% % 91% 71% 12% 1%
48 Main observation 3. Dimensionality No direct effect!
49 Effect of unbalance DIM datasets K-means tend to put too many clusters here and too few here Success: Average CI: 0% 3.9 Problem originates from the random initialization.
50 Main observation 4. Unbalance of cluster sizes Unbalance is bad!
51 Improving k-means
52 Better initialization technique Simple initializations: Random centroids (Random) [Forgy][MacQueen] Further point heuristic (max) [Gonzalez] More complex: K-means++ [Vasilievski] Luxburg [Luxburg]
53 Initialization techniques Varying N CI-v value Random K-means++ Maxmin Luxburg Data size (thousands)
54 Initialization techniques Varying k Relative CI-value 20% 15% 10% 5% 0% Random Luxburg K-means++ Maxmin Clusters (k)
55 Repeated k-means (RKM) K-means Repeat 100 times Can increase changes to success significantly In principle, running forever would solve Limitations if k is large CI-value Birch2 subsets K-means Repeated k-means Number of clusters (k) CI-value K-means Repeated k-means Size of data (thousands) Birch2 subsets
56 Random Swap (RS) Random Swap(X) C, P A better algorithm C Select random representatives(x); P Optimal partition(x, C); REPEAT T times (C new,j) Random swap(x, C); P new Local repartition(x, C new, P, j); C new, P new Kmeans(X, C new, P new ); IF f(c new, P new ) < f(c, P) THEN (C, P) C new, P new ; RETURN (C, P); Genetic Algorithm (GA) GeneticAlgorithm(X) (C, P) FOR i1 TO Z DO C i RandomCodebook(X); P i OptimalPartition(X, C i ); SortSolutions(C,P); REPEAT {C,P} CreateNewSolutions( {C,P} ); SortSolutions(C,P); UNTIL no improvement; CreateNewSolutions({C, P}) {C new, P new } C new-1, P new-1 C 1, P 1 ; FOR i2 TO Z DO (a,b) SelectNextPair; C new-i, P new-i Cross(C a, P a, C b, P b ); IterateK-Means(C new-i, P new-i ); CombineCentroids(C 1, C 2 ) C new C new C 1 C 2 CombinePartitions(C new, P 1, P 2 ) P new FOR i1 TO N DO 2 2 IF x c 1 x c 2 THEN pi ELSE pi END-FOR i p i i pi new new p 1 i p 2 UpdateCentroids(C 1, C 2 ) C new FOR j1 TO C new DO c j new CalculateCentroid(P new, j ); i Cross(C 1, P 1, C 2, P 2 ) (C new, P new ) C new CombineCentroids(C 1, C 2 ); P new CombinePartitions(P 1, P 2 ); C new UpdateCentroids(C new, P new ); RemoveEmptyClusters(C new, P new ); IS(C new, P new ); cs.uef.fi/pages/franti/research/ga.txt
57 Overall comparison CI-values Dataset K-means RKM RS GA Ste Ran Max K++ Lux A A A S S S S Unbalance Birch Birch Dim Average:
58 Conclusions
59 Conclusions How did K-means perform? 1. Overlap 3. Dimensionality Good! No change 2. Number of clusters 4. Unbalance of cluster sizes Bad! Bad!
60 References J. MacQueen, Some methods for classification and analysis of multivariate observations, Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, pp , University of California Press, Berkeley, Calif., S.P. Lloyd, Least squares quantization in PCM, IEEE Trans. on Information Theory, 28 (2), , Forgy, E. (1965). Cluster analysis of multivariate data: Efficiency vs. interpretability of classification. Biometrics, 21, 768. M. Steinbach, L. Ertöz, V. Kumar, The challenges of clustering high dimensional data, New Vistas in Statistical Physics -- Applications in Econophysics, Bioinformatics, and Pattern Recognition, Springer-Verlag, U. Luxburg, R.C. Williamson, I. Guyon, "Clustering: Science or Art?", J. Machine Learning Research, 27: 65 79, P. Fränti, "Genetic algorithm with deterministic crossover for vector quantization", Pattern Recognition Letters, 21 (1), 61-68, 2000 P. Fränti and J. Kivijärvi, "Randomized local search algorithm for the clustering problem", Pattern Analysis and Applications, 3 (4), , P. Fränti, O. Virmajoki and V. Hautamäki, Fast agglomerative clustering using a k-nearest neighbor graph, IEEE Trans. on Pattern Analysis and Machine Intelligence, 28 (11), , November Zhang R. Ramakrishnan and M. Livny, BIRCH: A new data clustering algorithm and its applications, Data Mining and Knowledge Discovery, 1 (2), , I. Kärkkäinen and P. Fränti, Dynamic local search algorithm for the clustering problem, Research Report A P. Fränti and O. Virmajoki, Iterative shrinking method for clustering problems, Pattern Recognition, 39 (5), , May P. Fränti R. Mariescu-Istodor and C. Zhong, XNN graph IAPR Joint Int. Workshop on Structural, Syntactic, and Statistical Pattern Recognition Merida, Mexico, LNCS 10029, , November M. Rezaei and P. Fränti, "Set-matching methods for external cluster validity", IEEE Trans. on Knowledge and Data Engineering, 28 (8), , August E. Chavez and G. Navarro, A probabilistic spell for the curse of dimensionality. Workshop on Algorithm Engineering and Experimentation, LNCS 2153, , N. Tomasev, M. Radovanovi, D. Mladeni and M. Ivanovi, The role of hubness in clustering high-dimensional data, IEEE Trans. on Knowledge and Data Engineering, 26 (3), , March D. Steinley, Local optima in k-means clustering: what you don t know may hurt you, Psychological Methods, 8, , P. Fränti, M. Rezaei and Q. Zhao, "Centroid index: Cluster level similarity measure", Pattern Recognition, 47 (9), , T. Gonzalez, Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38 (2 3), , 1985.
61 Back-up material
62
63 K-means neighbors Estimation
64 Shrinking of the space Clusters divided to two halves Centroids clearly separated No separation anymore. Can even cause 3:1 division. 2-d 611, d 598, d 603, , , , , , , , , ,500 Centroids closing to the group center
65 Backup material 8-d 194,203 2-d 198, d 68,91 170,175 83,63 186, , , , ,56 77, , d 8-d 32-d 4-d
66 How K-means process goes
67 Time complexity
68 Efficiency of k-means Total time to find correct clustering: Time per iteration Number of iterations Time complexity of single iteration: Swap: O(1) Remove cluster: 2kN/k = O(N) Add cluster: 2N = O(N) Centroids: 2N/k + 2N/k + 2 = O(N/k) K-means: IkN = O(IkN) Bottleneck!
69 Efficiency of fast k-means Total time to find correct clustering: Time per iteration Number of iterations Time complexity of single iteration: Swap: O(1) Remove cluster: 2kN/k = O(N) Add cluster: 2N = O(N) Centroids: 2N/k + 2N/k + 2 = O(N/k) (Fast) K-means: 4N = O(N) 2 iterations only! T. Kaukoranta, P. Fränti and O. Nevalainen "A fast exact GLA based on code vector activity detection" IEEE Trans. on Image Processing, 9 (8), , August 2000.
70
71 Centroid index CI = Centroid index: P. Fränti, M. Rezaei and Q. Zhao "Centroid index: cluster level similarity measure Pattern Recognition, 47 (9), , September 2014, 2014.
72 Data sets visualized Artificial S 1 S 2 S 3 S 4 Unbalance DIM 32
73
74
Random Swap algorithm
Random Swap algorithm Pasi Fränti 24.4.2018 Definitions and data Set of N data points: X={x 1, x 2,, x N } Partition of the data: P={p 1, p 2,,p k }, Set of k cluster prototypes (centroids): C={c 1, c
More informationRandom Projection for k-means Clustering
Random Projection for k-means Clustering Sami Sieranoja and Pasi Fränti ( ) School of Computing, University of Eastern Finland, Joensuu, Finland {sami.sieranoja,pasi.franti}@uef.fi Abstract. We study how
More informationEfficiency of random swap clustering
https://doi.org/10.1186/s40537-018-0122-y RESEARCH Open Access Efficiency of random swap clustering Pasi Fränti * *Correspondence: franti@cs.uef.fi Machine Learning Group, School of Computing, University
More informationComparative Study on VQ with Simple GA and Ordain GA
Proceedings of the 9th WSEAS International Conference on Automatic Control, Modeling & Simulation, Istanbul, Turkey, May 27-29, 2007 204 Comparative Study on VQ with Simple GA and Ordain GA SADAF SAJJAD
More informationDynamic Clustering of Data with Modified K-Means Algorithm
2012 International Conference on Information and Computer Networks (ICICN 2012) IPCSIT vol. 27 (2012) (2012) IACSIT Press, Singapore Dynamic Clustering of Data with Modified K-Means Algorithm Ahamed Shafeeq
More informationImproving K-Means by Outlier Removal
Improving K-Means by Outlier Removal Ville Hautamäki, Svetlana Cherednichenko, Ismo Kärkkäinen, Tomi Kinnunen, and Pasi Fränti Speech and Image Processing Unit, Department of Computer Science, University
More informationGeneralizing centroid index to different clustering models
Generalizing centroid index to different clustering models Pasi Fränti and Mohammad Rezaei University of Eastern Finland Abstract. Centroid index is the only measure that evaluates cluster level differrences
More informationDimension reduction : PCA and Clustering
Dimension reduction : PCA and Clustering By Hanne Jarmer Slides by Christopher Workman Center for Biological Sequence Analysis DTU The DNA Array Analysis Pipeline Array design Probe design Question Experimental
More informationFeature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani
Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Dimensionality reduction Feature selection vs. feature extraction Filter univariate
More informationCodebook generation for Image Compression with Simple and Ordain GA
Codebook generation for Image Compression with Simple and Ordain GA SAJJAD MOHSIN, SADAF SAJJAD COMSATS Institute of Information Technology Department of Computer Science Tobe Camp, Abbotabad PAKISTAN
More informationDynamic local search algorithm for the clustering problem
UNIVERSITY OF JOENSUU DEPARTMENT OF COMPUTER SCIENCE Report Series A Dynamic local search algorithm for the clustering problem Ismo Kärkkäinen and Pasi Fränti Report A-2002-6 ACM I.4.2, I.5.3 UDK 519.712
More informationFast Efficient Clustering Algorithm for Balanced Data
Vol. 5, No. 6, 214 Fast Efficient Clustering Algorithm for Balanced Data Adel A. Sewisy Faculty of Computer and Information, Assiut University M. H. Marghny Faculty of Computer and Information, Assiut
More informationUNSUPERVISED STATIC DISCRETIZATION METHODS IN DATA MINING. Daniela Joiţa Titu Maiorescu University, Bucharest, Romania
UNSUPERVISED STATIC DISCRETIZATION METHODS IN DATA MINING Daniela Joiţa Titu Maiorescu University, Bucharest, Romania danielajoita@utmro Abstract Discretization of real-valued data is often used as a pre-processing
More informationClustering methods: Part 7 Outlier removal Pasi Fränti
Clustering methods: Part 7 Outlier removal Pasi Fränti 6.5.207 Machine Learning University of Eastern Finland Outlier detection methods Distance-based methods Knorr & Ng Density-based methods KDIST: K
More informationKapitel 4: Clustering
Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases WiSe 2017/18 Kapitel 4: Clustering Vorlesung: Prof. Dr.
More information1090 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 5, MAY Centroid Ratio for a Pairwise Random Swap Clustering Algorithm
1090 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 5, MAY 2014 Centroid Ratio for a Pairwise Random Swap Clustering Algorithm Qinpei Zhao and Pasi Fränti, Senior Member, IEEE Abstract
More informationClustering in Big Data Using K-Means Algorithm
Clustering in Big Data Using K-Means Algorithm Ajitesh Janaswamy, B.E Dept of CSE, BITS Pilani, Dubai Campus. ABSTRACT: K-means is the most widely used clustering algorithm due to its fairly straightforward
More informationAn Adaptive and Deterministic Method for Initializing the Lloyd-Max Algorithm
An Adaptive and Deterministic Method for Initializing the Lloyd-Max Algorithm Jared Vicory and M. Emre Celebi Department of Computer Science Louisiana State University, Shreveport, LA, USA ABSTRACT Gray-level
More informationDISCRETIZATION BASED ON CLUSTERING METHODS. Daniela Joiţa Titu Maiorescu University, Bucharest, Romania
DISCRETIZATION BASED ON CLUSTERING METHODS Daniela Joiţa Titu Maiorescu University, Bucharest, Romania daniela.oita@utm.ro Abstract. Many data mining algorithms require as a pre-processing step the discretization
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2013 http://ce.sharif.edu/courses/91-92/2/ce725-1/ Agenda Features and Patterns The Curse of Size and
More informationClustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York
Clustering Robert M. Haralick Computer Science, Graduate Center City University of New York Outline K-means 1 K-means 2 3 4 5 Clustering K-means The purpose of clustering is to determine the similarity
More informationBBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler
BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.
More informationTowards the world s fastest k-means algorithm
Greg Hamerly Associate Professor Computer Science Department Baylor University Joint work with Jonathan Drake May 15, 2014 Objective function and optimization Lloyd s algorithm 1 The k-means clustering
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2012 http://ce.sharif.edu/courses/90-91/2/ce725-1/ Agenda Features and Patterns The Curse of Size and
More informationUnsupervised Learning and Clustering
Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)
More informationCS570: Introduction to Data Mining
CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading: Chapter 10.3 Han, Chapter 9.5 Tan Cengiz Gunay, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber & Pei.
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Features and Patterns The Curse of Size and
More informationAnalyzing Outlier Detection Techniques with Hybrid Method
Analyzing Outlier Detection Techniques with Hybrid Method Shruti Aggarwal Assistant Professor Department of Computer Science and Engineering Sri Guru Granth Sahib World University. (SGGSWU) Fatehgarh Sahib,
More informationMean-shift outlier detection
Mean-shift outlier detection Jiawei YANG a, Susanto RAHARDJA b a,1 and Pasi FRÄNTI a School of Computing, University of Eastern Finland b Northwestern Polytechnical University, Xi an, China Abstract. We
More informationStatistics 202: Statistical Aspects of Data Mining
Statistics 202: Statistical Aspects of Data Mining Professor Rajan Patel Lecture 11 = Chapter 8 Agenda: 1)Reminder about final exam 2)Finish Chapter 5 3)Chapter 8 1 Class Project The class project is due
More informationClustering and Visualisation of Data
Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some
More informationBalanced COD-CLARANS: A Constrained Clustering Algorithm to Optimize Logistics Distribution Network
Advances in Intelligent Systems Research, volume 133 2nd International Conference on Artificial Intelligence and Industrial Engineering (AIIE2016) Balanced COD-CLARANS: A Constrained Clustering Algorithm
More informationUnsupervised Learning and Clustering
Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)
More informationGenetic algorithm with deterministic crossover for vector quantization
Pattern Recognition Letters 21 (2000) 61±68 www.elsevier.nl/locate/patrec Genetic algorithm with deterministic crossover for vector quantization Pasi Franti * Department of Computer Science, University
More informationOlmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.
Center of Atmospheric Sciences, UNAM November 16, 2016 Cluster Analisis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster)
More informationWhat is Cluster Analysis? COMP 465: Data Mining Clustering Basics. Applications of Cluster Analysis. Clustering: Application Examples 3/17/2015
// What is Cluster Analysis? COMP : Data Mining Clustering Basics Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, rd ed. Cluster: A collection of data
More informationCase-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.
CS 188: Artificial Intelligence Fall 2008 Lecture 25: Kernels and Clustering 12/2/2008 Dan Klein UC Berkeley Case-Based Reasoning Similarity for classification Case-based reasoning Predict an instance
More informationCS 188: Artificial Intelligence Fall 2008
CS 188: Artificial Intelligence Fall 2008 Lecture 25: Kernels and Clustering 12/2/2008 Dan Klein UC Berkeley 1 1 Case-Based Reasoning Similarity for classification Case-based reasoning Predict an instance
More informationK-Means Clustering With Initial Centroids Based On Difference Operator
K-Means Clustering With Initial Centroids Based On Difference Operator Satish Chaurasiya 1, Dr.Ratish Agrawal 2 M.Tech Student, School of Information and Technology, R.G.P.V, Bhopal, India Assistant Professor,
More informationOn the splitting method for vector quantization codebook generation
On the splitting method for vector quantization codebook generation Pasi Fränti University of Joensuu Department of Computer Science P.O. Box 111 FIN-80101 Joensuu, Finland E-mail: franti@cs.joensuu.fi
More informationAn Efficient Approach towards K-Means Clustering Algorithm
An Efficient Approach towards K-Means Clustering Algorithm Pallavi Purohit Department of Information Technology, Medi-caps Institute of Technology, Indore purohit.pallavi@gmail.co m Ritesh Joshi Department
More informationPattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition
Pattern Recognition Kjell Elenius Speech, Music and Hearing KTH March 29, 2007 Speech recognition 2007 1 Ch 4. Pattern Recognition 1(3) Bayes Decision Theory Minimum-Error-Rate Decision Rules Discriminant
More informationClustering part II 1
Clustering part II 1 Clustering What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods 2 Partitioning Algorithms:
More informationDetecting Clusters and Outliers for Multidimensional
Kennesaw State University DigitalCommons@Kennesaw State University Faculty Publications 2008 Detecting Clusters and Outliers for Multidimensional Data Yong Shi Kennesaw State University, yshi5@kennesaw.edu
More informationClustering. Chapter 10 in Introduction to statistical learning
Clustering Chapter 10 in Introduction to statistical learning 16 14 12 10 8 6 4 2 0 2 4 6 8 10 12 14 1 Clustering ² Clustering is the art of finding groups in data (Kaufman and Rousseeuw, 1990). ² What
More informationClustering Part 4 DBSCAN
Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of
More informationLesson 3. Prof. Enza Messina
Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical
More informationA Review on Cluster Based Approach in Data Mining
A Review on Cluster Based Approach in Data Mining M. Vijaya Maheswari PhD Research Scholar, Department of Computer Science Karpagam University Coimbatore, Tamilnadu,India Dr T. Christopher Assistant professor,
More informationINF 4300 Classification III Anne Solberg The agenda today:
INF 4300 Classification III Anne Solberg 28.10.15 The agenda today: More on estimating classifier accuracy Curse of dimensionality and simple feature selection knn-classification K-means clustering 28.10.15
More informationSocial and health care services as an optimization problem
Christmas seminar 2018 Social and health care services as an optimization problem Pasi Fränti Univ. of Eastern Finland Joensuu, FINLAND Data mining Information retrieval Location-aware applications Clustering
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster
More informationCSE 40171: Artificial Intelligence. Learning from Data: Unsupervised Learning
CSE 40171: Artificial Intelligence Learning from Data: Unsupervised Learning 32 Homework #6 has been released. It is due at 11:59PM on 11/7. 33 CSE Seminar: 11/1 Amy Reibman Purdue University 3:30pm DBART
More informationGene Clustering & Classification
BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering
More informationUnsupervised Learning
Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised
More informationVisual Representations for Machine Learning
Visual Representations for Machine Learning Spectral Clustering and Channel Representations Lecture 1 Spectral Clustering: introduction and confusion Michael Felsberg Klas Nordberg The Spectral Clustering
More informationCluster Analysis. Summer School on Geocomputation. 27 June July 2011 Vysoké Pole
Cluster Analysis Summer School on Geocomputation 27 June 2011 2 July 2011 Vysoké Pole Lecture delivered by: doc. Mgr. Radoslav Harman, PhD. Faculty of Mathematics, Physics and Informatics Comenius University,
More informationIntroduction to Computer Science
DM534 Introduction to Computer Science Clustering and Feature Spaces Richard Roettger: About Me Computer Science (Technical University of Munich and thesis at the ICSI at the University of California at
More informationMSA220 - Statistical Learning for Big Data
MSA220 - Statistical Learning for Big Data Lecture 13 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Clustering Explorative analysis - finding groups
More information10701 Machine Learning. Clustering
171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among
More informationData Mining Algorithms In R/Clustering/K-Means
1 / 7 Data Mining Algorithms In R/Clustering/K-Means Contents 1 Introduction 2 Technique to be discussed 2.1 Algorithm 2.2 Implementation 2.3 View 2.4 Case Study 2.4.1 Scenario 2.4.2 Input data 2.4.3 Execution
More informationA Parallel Architecture for the Generalized Traveling Salesman Problem
A Parallel Architecture for the Generalized Traveling Salesman Problem Max Scharrenbroich AMSC 663 Project Proposal Advisor: Dr. Bruce L. Golden R. H. Smith School of Business 1 Background and Introduction
More informationFast Approximate Minimum Spanning Tree Algorithm Based on K-Means
Fast Approximate Minimum Spanning Tree Algorithm Based on K-Means Caiming Zhong 1,2,3, Mikko Malinen 2, Duoqian Miao 1,andPasiFränti 2 1 Department of Computer Science and Technology, Tongji University,
More informationMotivation. Technical Background
Handling Outliers through Agglomerative Clustering with Full Model Maximum Likelihood Estimation, with Application to Flow Cytometry Mark Gordon, Justin Li, Kevin Matzen, Bryce Wiedenbeck Motivation Clustering
More informationUniversity of Florida CISE department Gator Engineering. Clustering Part 4
Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of
More informationCollaborative Rough Clustering
Collaborative Rough Clustering Sushmita Mitra, Haider Banka, and Witold Pedrycz Machine Intelligence Unit, Indian Statistical Institute, Kolkata, India {sushmita, hbanka r}@isical.ac.in Dept. of Electrical
More informationk-means Clustering Todd W. Neller Gettysburg College Laura E. Brown Michigan Technological University
k-means Clustering Todd W. Neller Gettysburg College Laura E. Brown Michigan Technological University Outline Unsupervised versus Supervised Learning Clustering Problem k-means Clustering Algorithm Visual
More informationReview of the Robust K-means Algorithm and Comparison with Other Clustering Methods
Review of the Robust K-means Algorithm and Comparison with Other Clustering Methods Ben Karsin University of Hawaii at Manoa Information and Computer Science ICS 63 Machine Learning Fall 8 Introduction
More informationCluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010
Cluster Analysis Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX 7575 April 008 April 010 Cluster Analysis, sometimes called data segmentation or customer segmentation,
More informationAN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION
AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION WILLIAM ROBSON SCHWARTZ University of Maryland, Department of Computer Science College Park, MD, USA, 20742-327, schwartz@cs.umd.edu RICARDO
More informationAN IMPROVED HYBRIDIZED K- MEANS CLUSTERING ALGORITHM (IHKMCA) FOR HIGHDIMENSIONAL DATASET & IT S PERFORMANCE ANALYSIS
AN IMPROVED HYBRIDIZED K- MEANS CLUSTERING ALGORITHM (IHKMCA) FOR HIGHDIMENSIONAL DATASET & IT S PERFORMANCE ANALYSIS H.S Behera Department of Computer Science and Engineering, Veer Surendra Sai University
More informationUninformed Search Methods. Informed Search Methods. Midterm Exam 3/13/18. Thursday, March 15, 7:30 9:30 p.m. room 125 Ag Hall
Midterm Exam Thursday, March 15, 7:30 9:30 p.m. room 125 Ag Hall Covers topics through Decision Trees and Random Forests (does not include constraint satisfaction) Closed book 8.5 x 11 sheet with notes
More informationClustering Results. Result List Example. Clustering Results. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Presenting Results Clustering Clustering Results! Result lists often contain documents related to different aspects of the query topic! Clustering is used to
More informationI. INTRODUCTION II. RELATED WORK.
ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: A New Hybridized K-Means Clustering Based Outlier Detection Technique
More informationFeature-Guided K-Means Algorithm for Optimal Image Vector Quantizer Design
Journal of Information Hiding and Multimedia Signal Processing c 2017 ISSN 2073-4212 Ubiquitous International Volume 8, Number 6, November 2017 Feature-Guided K-Means Algorithm for Optimal Image Vector
More informationUnsupervised Learning : Clustering
Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex
More informationEECS730: Introduction to Bioinformatics
EECS730: Introduction to Bioinformatics Lecture 15: Microarray clustering http://compbio.pbworks.com/f/wood2.gif Some slides were adapted from Dr. Shaojie Zhang (University of Central Florida) Microarray
More informationLecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 7 Introduction to Data Mining, nd Edition by Tan, Steinbach, Karpatne, Kumar What is Cluster Analysis? Finding groups
More informationNORMALIZATION INDEXING BASED ENHANCED GROUPING K-MEAN ALGORITHM
NORMALIZATION INDEXING BASED ENHANCED GROUPING K-MEAN ALGORITHM Saroj 1, Ms. Kavita2 1 Student of Masters of Technology, 2 Assistant Professor Department of Computer Science and Engineering JCDM college
More informationThis article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and
This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution
More informationMachine Learning (BSMC-GA 4439) Wenke Liu
Machine Learning (BSMC-GA 4439) Wenke Liu 01-25-2018 Outline Background Defining proximity Clustering methods Determining number of clusters Other approaches Cluster analysis as unsupervised Learning Unsupervised
More informationHierarchical Document Clustering
Hierarchical Document Clustering Benjamin C. M. Fung, Ke Wang, and Martin Ester, Simon Fraser University, Canada INTRODUCTION Document clustering is an automatic grouping of text documents into clusters
More informationMultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A
MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A. 205-206 Pietro Guccione, PhD DEI - DIPARTIMENTO DI INGEGNERIA ELETTRICA E DELL INFORMAZIONE POLITECNICO DI BARI
More informationDATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm
DATA MINING LECTURE 7 Hierarchical Clustering, DBSCAN The EM Algorithm CLUSTERING What is a Clustering? In general a grouping of objects such that the objects in a group (cluster) are similar (or related)
More informationCluster Analysis (b) Lijun Zhang
Cluster Analysis (b) Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Grid-Based and Density-Based Algorithms Graph-Based Algorithms Non-negative Matrix Factorization Cluster Validation Summary
More informationK-Means Clustering. Sargur Srihari
K-Means Clustering Sargur srihari@cedar.buffalo.edu 1 Topics in Mixture Models and EM Mixture models K-means Clustering Mixtures of Gaussians Maximum Likelihood EM for Gaussian mistures EM Algorithm Gaussian
More informationStatistics 202: Data Mining. c Jonathan Taylor. Clustering Based in part on slides from textbook, slides of Susan Holmes.
Clustering Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Clustering Clustering Goal: Finding groups of objects such that the objects in a group will be similar (or
More informationClustering in Data Mining
Clustering in Data Mining Classification Vs Clustering When the distribution is based on a single parameter and that parameter is known for each object, it is called classification. E.g. Children, young,
More informationCS Introduction to Data Mining Instructor: Abdullah Mueen
CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen LECTURE 8: ADVANCED CLUSTERING (FUZZY AND CO -CLUSTERING) Review: Basic Cluster Analysis Methods (Chap. 10) Cluster Analysis: Basic Concepts
More informationA Naïve Soft Computing based Approach for Gene Expression Data Analysis
Available online at www.sciencedirect.com Procedia Engineering 38 (2012 ) 2124 2128 International Conference on Modeling Optimization and Computing (ICMOC-2012) A Naïve Soft Computing based Approach for
More informationClustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani
Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation
More informationCISC 4631 Data Mining
CISC 4631 Data Mining Lecture 03: Nearest Neighbor Learning Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Prof. R. Mooney (UT Austin) Prof E. Keogh (UCR), Prof. F.
More informationFeature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.
CS 188: Artificial Intelligence Fall 2007 Lecture 26: Kernels 11/29/2007 Dan Klein UC Berkeley Feature Extractors A feature extractor maps inputs to feature vectors Dear Sir. First, I must solicit your
More informationData Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analysis: Basic Concepts and Algorithms Slides From Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining
More informationIndexing in Search Engines based on Pipelining Architecture using Single Link HAC
Indexing in Search Engines based on Pipelining Architecture using Single Link HAC Anuradha Tyagi S. V. Subharti University Haridwar Bypass Road NH-58, Meerut, India ABSTRACT Search on the web is a daily
More informationExploratory Data Analysis using Self-Organizing Maps. Madhumanti Ray
Exploratory Data Analysis using Self-Organizing Maps Madhumanti Ray Content Introduction Data Analysis methods Self-Organizing Maps Conclusion Visualization of high-dimensional data items Exploratory data
More informationDOCUMENT CLUSTERING USING HIERARCHICAL METHODS. 1. Dr.R.V.Krishnaiah 2. Katta Sharath Kumar. 3. P.Praveen Kumar. achieved.
DOCUMENT CLUSTERING USING HIERARCHICAL METHODS 1. Dr.R.V.Krishnaiah 2. Katta Sharath Kumar 3. P.Praveen Kumar ABSTRACT: Cluster is a term used regularly in our life is nothing but a group. In the view
More informationUnderstanding Clustering Supervising the unsupervised
Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data
More informationIntroduction to Mobile Robotics
Introduction to Mobile Robotics Clustering Wolfram Burgard Cyrill Stachniss Giorgio Grisetti Maren Bennewitz Christian Plagemann Clustering (1) Common technique for statistical data analysis (machine learning,
More informationEnhancing K-means Clustering Algorithm with Improved Initial Center
Enhancing K-means Clustering Algorithm with Improved Initial Center Madhu Yedla #1, Srinivasa Rao Pathakota #2, T M Srinivasa #3 # Department of Computer Science and Engineering, National Institute of
More information