Pasi Fränti K-means properties on six clustering benchmark datasets Pasi Fränti and Sami Sieranoja Algorithms, 2017.

Size: px
Start display at page:

Download "Pasi Fränti K-means properties on six clustering benchmark datasets Pasi Fränti and Sami Sieranoja Algorithms, 2017."

Transcription

1 K-means properties Pasi Fränti K-means properties on six clustering benchmark datasets Pasi Fränti and Sami Sieranoja Algorithms, 2017.

2 Goal of k-means Input N points: X={x 1, x 2,, x N } Output partition and k centroids: P={p 1, p 2,, p k } C={c 1, c 2,, c k } Objective function: SSE N i1 x i c j 2 SSE = sum-of-squared errors

3 Goal of k-means Input N points: X={x 1, x 2,, x N } Output partition and k centroids: P={p 1, p 2,, p k } C={c 1, c 2,, c k } Objective function: SSE N i1 x i c j 2 Assumptions: SSE is suitable k is known

4 K-means algorithm X = Data set C = Cluster centroids P = Partition K-Means(X, C) (C, P) REPEAT C prev C; FOR i=1 TO N DO p i FindNearest(x i, C); FOR j=1 TO k DO c j Average of x i p i = j; Assignment step Centroid step UNTIL C = C prev

5 N i c x P j i k j i, 1 arg min 2 1 k j x c, 1 1 K-means optimization steps Assignment step: Centroid step: k j x c j P j P i j i i 1 1,

6 Problems of k-means Distance of clusters Cannot move centroids between clusters far away

7 Problems of k-means Dependency of initial solution Initial solution: After k-means:

8 K-means performance How affected by? 1. Overlap 3. Dimensionality 2. Number of clusters 4. Unbalance of cluster sizes

9 Basic Benchmark

10 Data sets statistics Dataset Varying Size Clusters Per cluster A Number of clusters S Overlap Dim Dimensions G2 Dimensions + overlap Birch Structure 100, Unbalance Balance

11 A sets K=20 K=35 K=50 A 1 A 2 A 3 Spherical clusters Number of clusters changing from k=20 to 50 Subsets of each other: A1 A2 A3. Other parameters fixed: - Cluster size = Deviation = Overlap = Dimensionality = 2

12 S sets K=15 Gaussian clusters (few truncated) S 1 S 2 S 3 S 4 overlap increases Least overlap Strong overlap but the clusters still recognizable

13 Unbalance K=8 Dense clusters st.dev=2043 Sparse clusters st.dev= Areas well-separated *Correct clustering can be obtained by minimizing SSE

14 DIM sets K=16 DIM 32 Well-separated clusters in high-dimensional spaces Dimensions vary: 32, 64, 128, 256, 512, 1024

15 G2 Datasets K=2 G G G , ,500 Dataset name: G2-dim-sd Centroid 1: [500,500,...] Centroid 2: [600,600,...] Dimensions: 1,2,4,8,16, St.dev. 10,20,30,

16 Birch Birch 1 Birch 2 Regular 10x10 grid Constant variance offset = amplitude = phaseshift = frequency = y(x) = amplitude * sin(2**frequency*x + phaseshift) + offset

17 Birch 2 subsets B2-random N= N= N= N= B2-sub k=100 k=99 k=98 k= Random subsampling. k=3 k=96 k=95 Cutting off last cluster N=2 000 N=1 000 k=2 k=1

18 Properties

19 Measured properties Overlap Contrast Intrinsic dimensionality H-index Distance profiles

20 Overlap by misclassification probability Points from blue cluster that are closer to red centroid. Points from red cluster that are closer to red centroid. Points = 2048 Incorrect = 20 Overlap = 20 / %

21 overlap 1 N Overlap by separation d d 1 2 d 1 = distance to nearest centroid d 2 = distance to 2 nd nearest 2 d 1 = 0.25 d 2 = 0.75 d 1 = 0.50 d 2 = d 1 = 0.25 d 2 = 0.75

22 Contrast contrast median d max d d min min 1000 DIM 1000 G2 Contrast Contrast Dimensions 1 sd=30 sd=70 sd= Dimensions

23 ID 2 dˆ 2 Intrinsic dimensionality 2 dˆ Average of distances 2 Variance of distances Unbalance DIM 0.4 Birch 1 Birch S sets 8.3 A 1 A 2 A

24 H-index 2 Hubness values: Rank: Hub:

25 Distance profiles Data that contains clusters tends to have two peaks: Local distances: distances inside the clusters Global distances: distances across different clusters A 1 A 2 A 3 S 1 S 2 S 3 S 4 Birch 1 Birch 2 DIM 32 Unbalance

26 Distance profiles G2 datasets G2: dimension increases D=2 D=4 D=8 D=16 D=32 D=1024 G2: overlap increases sd=10 sd=20 sd=30 sd=40 sd=50 sd=60 D=2 sd=10 sd=20 sd=30 sd=40 sd=50 sd=60 D=128

27 Summary of the properties Dataset Overlap Contrast Intrinsic dim. H-index A A A S S S S Dim G * Birch Birch Unbalance

28 G2 overlap Overlap decreases \dim % 0% 0% 0% 0% 0% 0% 0% 0% 0% 20 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 30 1% 0% 0% 0% 0% 0% 0% 0% 0% 0% 40 4% 1% 0% 0% 0% 0% 0% 0% 0% 0% 50 8% 2% 0% 0% 0% 0% 0% 0% 0% 0% 60 12% 4% 1% 0% 0% 0% 0% 0% 0% 0% 70 15% 8% 2% 0% 0% 0% 0% 0% 0% 0% 80 19% 9% 4% 1% 0% 0% 0% 0% 0% 0% 90 22% 12% 6% 2% 0% 0% 0% 0% 0% 0% % 15% 7% 2% 0.1% 0% 0% 0% 0% 0% Overlap incre eases

29 G2 contrast Contrast decreases \dim Contrast decre eases

30 G2 Intrinsic dimensionality ID increases (if overlap) \dim ID increas ses Most significant

31 G2 H-index H-index increases \dim No chang ge Most significant

32 Evaluation

33 Internal measures Sum of squared distances (SSE) SSE N i1 x i c j 2 Normalized mean square error (nmse) nmse SSE N D Approximation ratio () SSE SSE SSE opt opt

34 P. Fränti, M. Rezaei, Q. Zhao Centroid index: cluster level similarity measure Pattern Recognition 2014 External measures Centroid index CI=4 Too many centoids Missing centroids

35 External measures Success rate 17% CI=1 CI=2 CI=1 CI=0 CI=2 CI=2

36 Results

37 Summary of results Dataset Success Clustering quality Objective function CI rel-ci ARI SSE nmse A1 1% % A2 0% % A3 0% % S1 3% % S2 11% 1.4 9% S3 12% 1.3 9% S4 26% 0.9 6% Unbalance 0% % Birch1 0% 6.6 7% Birch2 0% % Dim32 0% % Average: 5% %

38 Dependency on overlap S datasets Success rates and CI-values: overlap increases S 1 S 2 S 3 S 4 3% 11% 12% 26%

39 Dependency on overlap G2 datasets Succ cess 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 2% 0.0 % 0.1 % 1.0 % 10.0 % % Overlap

40 Main observation 1. Overlap Overlap is good!

41 Dependency on clusters (k) A datasets Clusters increases K=20 K=35 K=50 A 1 A 2 A 3 Success: CI: 1% 0% 0% Relative CI: 13% 13% 13%

42 Dependency on clusters (k) Birch2 subsets CI-va alue K-means Repeated k-means Number of clusters (k)

43 Dependency on data size (N) CI-val lue K-means Repeated k-means Birch2 subsets Size of data (thousands)

44 Main observation 2. Number of clusters Linear increase with k!

45 Dependency on dimensions DIM datasets Dimensions increases CI: Success rate: 0%

46 Dependency on dimensions G2 datasets Success degrades \dim % 0% 0% 0% 0% 0% 0% 0% 0% 0% 20 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 30 35% 1% 0% 0% 0% 0% 0% 0% 0% 0% 40 99% 7% 0% 0% 0% 0% 0% 0% 0% 0% 50 99% 80% 1% 0% 0% 0% 0% 0% 0% 0% 60 79% 84% 16% 1% 0% 0% 0% 0% 0% 0% 70 95% 86% 56% 1% 0% 0% 0% 0% 0% 0% 80 63% 93% 94% 5% 0% 0% 0% 0% 0% 0% % 89% 69% 16% 0% 0% 0% 0% 0% 0% % 91% 71% 12% 1% 0% 0% 0% 0% 0% Success impr roves

47 Lack of overlap is the cause! \dim % 40 4% 1% 50 8% 2% Overlap 60 12% 4% 1% 70 15% 8% 2% 80 19% 9% 4% 1% 90 22% 12% 6% 2% % 15% 7% 2% 0.1% Correlation: 0.88 \dim % 1% 40 99% 7% 50 99% 80% 1% 60 79% 84% 16% 1% Success 70 95% 86% 56% 1% 80 63% 93% 94% 5% % 89% 69% 16% % 91% 71% 12% 1%

48 Main observation 3. Dimensionality No direct effect!

49 Effect of unbalance DIM datasets K-means tend to put too many clusters here and too few here Success: Average CI: 0% 3.9 Problem originates from the random initialization.

50 Main observation 4. Unbalance of cluster sizes Unbalance is bad!

51 Improving k-means

52 Better initialization technique Simple initializations: Random centroids (Random) [Forgy][MacQueen] Further point heuristic (max) [Gonzalez] More complex: K-means++ [Vasilievski] Luxburg [Luxburg]

53 Initialization techniques Varying N CI-v value Random K-means++ Maxmin Luxburg Data size (thousands)

54 Initialization techniques Varying k Relative CI-value 20% 15% 10% 5% 0% Random Luxburg K-means++ Maxmin Clusters (k)

55 Repeated k-means (RKM) K-means Repeat 100 times Can increase changes to success significantly In principle, running forever would solve Limitations if k is large CI-value Birch2 subsets K-means Repeated k-means Number of clusters (k) CI-value K-means Repeated k-means Size of data (thousands) Birch2 subsets

56 Random Swap (RS) Random Swap(X) C, P A better algorithm C Select random representatives(x); P Optimal partition(x, C); REPEAT T times (C new,j) Random swap(x, C); P new Local repartition(x, C new, P, j); C new, P new Kmeans(X, C new, P new ); IF f(c new, P new ) < f(c, P) THEN (C, P) C new, P new ; RETURN (C, P); Genetic Algorithm (GA) GeneticAlgorithm(X) (C, P) FOR i1 TO Z DO C i RandomCodebook(X); P i OptimalPartition(X, C i ); SortSolutions(C,P); REPEAT {C,P} CreateNewSolutions( {C,P} ); SortSolutions(C,P); UNTIL no improvement; CreateNewSolutions({C, P}) {C new, P new } C new-1, P new-1 C 1, P 1 ; FOR i2 TO Z DO (a,b) SelectNextPair; C new-i, P new-i Cross(C a, P a, C b, P b ); IterateK-Means(C new-i, P new-i ); CombineCentroids(C 1, C 2 ) C new C new C 1 C 2 CombinePartitions(C new, P 1, P 2 ) P new FOR i1 TO N DO 2 2 IF x c 1 x c 2 THEN pi ELSE pi END-FOR i p i i pi new new p 1 i p 2 UpdateCentroids(C 1, C 2 ) C new FOR j1 TO C new DO c j new CalculateCentroid(P new, j ); i Cross(C 1, P 1, C 2, P 2 ) (C new, P new ) C new CombineCentroids(C 1, C 2 ); P new CombinePartitions(P 1, P 2 ); C new UpdateCentroids(C new, P new ); RemoveEmptyClusters(C new, P new ); IS(C new, P new ); cs.uef.fi/pages/franti/research/ga.txt

57 Overall comparison CI-values Dataset K-means RKM RS GA Ste Ran Max K++ Lux A A A S S S S Unbalance Birch Birch Dim Average:

58 Conclusions

59 Conclusions How did K-means perform? 1. Overlap 3. Dimensionality Good! No change 2. Number of clusters 4. Unbalance of cluster sizes Bad! Bad!

60 References J. MacQueen, Some methods for classification and analysis of multivariate observations, Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, pp , University of California Press, Berkeley, Calif., S.P. Lloyd, Least squares quantization in PCM, IEEE Trans. on Information Theory, 28 (2), , Forgy, E. (1965). Cluster analysis of multivariate data: Efficiency vs. interpretability of classification. Biometrics, 21, 768. M. Steinbach, L. Ertöz, V. Kumar, The challenges of clustering high dimensional data, New Vistas in Statistical Physics -- Applications in Econophysics, Bioinformatics, and Pattern Recognition, Springer-Verlag, U. Luxburg, R.C. Williamson, I. Guyon, "Clustering: Science or Art?", J. Machine Learning Research, 27: 65 79, P. Fränti, "Genetic algorithm with deterministic crossover for vector quantization", Pattern Recognition Letters, 21 (1), 61-68, 2000 P. Fränti and J. Kivijärvi, "Randomized local search algorithm for the clustering problem", Pattern Analysis and Applications, 3 (4), , P. Fränti, O. Virmajoki and V. Hautamäki, Fast agglomerative clustering using a k-nearest neighbor graph, IEEE Trans. on Pattern Analysis and Machine Intelligence, 28 (11), , November Zhang R. Ramakrishnan and M. Livny, BIRCH: A new data clustering algorithm and its applications, Data Mining and Knowledge Discovery, 1 (2), , I. Kärkkäinen and P. Fränti, Dynamic local search algorithm for the clustering problem, Research Report A P. Fränti and O. Virmajoki, Iterative shrinking method for clustering problems, Pattern Recognition, 39 (5), , May P. Fränti R. Mariescu-Istodor and C. Zhong, XNN graph IAPR Joint Int. Workshop on Structural, Syntactic, and Statistical Pattern Recognition Merida, Mexico, LNCS 10029, , November M. Rezaei and P. Fränti, "Set-matching methods for external cluster validity", IEEE Trans. on Knowledge and Data Engineering, 28 (8), , August E. Chavez and G. Navarro, A probabilistic spell for the curse of dimensionality. Workshop on Algorithm Engineering and Experimentation, LNCS 2153, , N. Tomasev, M. Radovanovi, D. Mladeni and M. Ivanovi, The role of hubness in clustering high-dimensional data, IEEE Trans. on Knowledge and Data Engineering, 26 (3), , March D. Steinley, Local optima in k-means clustering: what you don t know may hurt you, Psychological Methods, 8, , P. Fränti, M. Rezaei and Q. Zhao, "Centroid index: Cluster level similarity measure", Pattern Recognition, 47 (9), , T. Gonzalez, Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38 (2 3), , 1985.

61 Back-up material

62

63 K-means neighbors Estimation

64 Shrinking of the space Clusters divided to two halves Centroids clearly separated No separation anymore. Can even cause 3:1 division. 2-d 611, d 598, d 603, , , , , , , , , ,500 Centroids closing to the group center

65 Backup material 8-d 194,203 2-d 198, d 68,91 170,175 83,63 186, , , , ,56 77, , d 8-d 32-d 4-d

66 How K-means process goes

67 Time complexity

68 Efficiency of k-means Total time to find correct clustering: Time per iteration Number of iterations Time complexity of single iteration: Swap: O(1) Remove cluster: 2kN/k = O(N) Add cluster: 2N = O(N) Centroids: 2N/k + 2N/k + 2 = O(N/k) K-means: IkN = O(IkN) Bottleneck!

69 Efficiency of fast k-means Total time to find correct clustering: Time per iteration Number of iterations Time complexity of single iteration: Swap: O(1) Remove cluster: 2kN/k = O(N) Add cluster: 2N = O(N) Centroids: 2N/k + 2N/k + 2 = O(N/k) (Fast) K-means: 4N = O(N) 2 iterations only! T. Kaukoranta, P. Fränti and O. Nevalainen "A fast exact GLA based on code vector activity detection" IEEE Trans. on Image Processing, 9 (8), , August 2000.

70

71 Centroid index CI = Centroid index: P. Fränti, M. Rezaei and Q. Zhao "Centroid index: cluster level similarity measure Pattern Recognition, 47 (9), , September 2014, 2014.

72 Data sets visualized Artificial S 1 S 2 S 3 S 4 Unbalance DIM 32

73

74

Random Swap algorithm

Random Swap algorithm Random Swap algorithm Pasi Fränti 24.4.2018 Definitions and data Set of N data points: X={x 1, x 2,, x N } Partition of the data: P={p 1, p 2,,p k }, Set of k cluster prototypes (centroids): C={c 1, c

More information

Random Projection for k-means Clustering

Random Projection for k-means Clustering Random Projection for k-means Clustering Sami Sieranoja and Pasi Fränti ( ) School of Computing, University of Eastern Finland, Joensuu, Finland {sami.sieranoja,pasi.franti}@uef.fi Abstract. We study how

More information

Efficiency of random swap clustering

Efficiency of random swap clustering https://doi.org/10.1186/s40537-018-0122-y RESEARCH Open Access Efficiency of random swap clustering Pasi Fränti * *Correspondence: franti@cs.uef.fi Machine Learning Group, School of Computing, University

More information

Comparative Study on VQ with Simple GA and Ordain GA

Comparative Study on VQ with Simple GA and Ordain GA Proceedings of the 9th WSEAS International Conference on Automatic Control, Modeling & Simulation, Istanbul, Turkey, May 27-29, 2007 204 Comparative Study on VQ with Simple GA and Ordain GA SADAF SAJJAD

More information

Dynamic Clustering of Data with Modified K-Means Algorithm

Dynamic Clustering of Data with Modified K-Means Algorithm 2012 International Conference on Information and Computer Networks (ICICN 2012) IPCSIT vol. 27 (2012) (2012) IACSIT Press, Singapore Dynamic Clustering of Data with Modified K-Means Algorithm Ahamed Shafeeq

More information

Improving K-Means by Outlier Removal

Improving K-Means by Outlier Removal Improving K-Means by Outlier Removal Ville Hautamäki, Svetlana Cherednichenko, Ismo Kärkkäinen, Tomi Kinnunen, and Pasi Fränti Speech and Image Processing Unit, Department of Computer Science, University

More information

Generalizing centroid index to different clustering models

Generalizing centroid index to different clustering models Generalizing centroid index to different clustering models Pasi Fränti and Mohammad Rezaei University of Eastern Finland Abstract. Centroid index is the only measure that evaluates cluster level differrences

More information

Dimension reduction : PCA and Clustering

Dimension reduction : PCA and Clustering Dimension reduction : PCA and Clustering By Hanne Jarmer Slides by Christopher Workman Center for Biological Sequence Analysis DTU The DNA Array Analysis Pipeline Array design Probe design Question Experimental

More information

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Dimensionality reduction Feature selection vs. feature extraction Filter univariate

More information

Codebook generation for Image Compression with Simple and Ordain GA

Codebook generation for Image Compression with Simple and Ordain GA Codebook generation for Image Compression with Simple and Ordain GA SAJJAD MOHSIN, SADAF SAJJAD COMSATS Institute of Information Technology Department of Computer Science Tobe Camp, Abbotabad PAKISTAN

More information

Dynamic local search algorithm for the clustering problem

Dynamic local search algorithm for the clustering problem UNIVERSITY OF JOENSUU DEPARTMENT OF COMPUTER SCIENCE Report Series A Dynamic local search algorithm for the clustering problem Ismo Kärkkäinen and Pasi Fränti Report A-2002-6 ACM I.4.2, I.5.3 UDK 519.712

More information

Fast Efficient Clustering Algorithm for Balanced Data

Fast Efficient Clustering Algorithm for Balanced Data Vol. 5, No. 6, 214 Fast Efficient Clustering Algorithm for Balanced Data Adel A. Sewisy Faculty of Computer and Information, Assiut University M. H. Marghny Faculty of Computer and Information, Assiut

More information

UNSUPERVISED STATIC DISCRETIZATION METHODS IN DATA MINING. Daniela Joiţa Titu Maiorescu University, Bucharest, Romania

UNSUPERVISED STATIC DISCRETIZATION METHODS IN DATA MINING. Daniela Joiţa Titu Maiorescu University, Bucharest, Romania UNSUPERVISED STATIC DISCRETIZATION METHODS IN DATA MINING Daniela Joiţa Titu Maiorescu University, Bucharest, Romania danielajoita@utmro Abstract Discretization of real-valued data is often used as a pre-processing

More information

Clustering methods: Part 7 Outlier removal Pasi Fränti

Clustering methods: Part 7 Outlier removal Pasi Fränti Clustering methods: Part 7 Outlier removal Pasi Fränti 6.5.207 Machine Learning University of Eastern Finland Outlier detection methods Distance-based methods Knorr & Ng Density-based methods KDIST: K

More information

Kapitel 4: Clustering

Kapitel 4: Clustering Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases WiSe 2017/18 Kapitel 4: Clustering Vorlesung: Prof. Dr.

More information

1090 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 5, MAY Centroid Ratio for a Pairwise Random Swap Clustering Algorithm

1090 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 5, MAY Centroid Ratio for a Pairwise Random Swap Clustering Algorithm 1090 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 5, MAY 2014 Centroid Ratio for a Pairwise Random Swap Clustering Algorithm Qinpei Zhao and Pasi Fränti, Senior Member, IEEE Abstract

More information

Clustering in Big Data Using K-Means Algorithm

Clustering in Big Data Using K-Means Algorithm Clustering in Big Data Using K-Means Algorithm Ajitesh Janaswamy, B.E Dept of CSE, BITS Pilani, Dubai Campus. ABSTRACT: K-means is the most widely used clustering algorithm due to its fairly straightforward

More information

An Adaptive and Deterministic Method for Initializing the Lloyd-Max Algorithm

An Adaptive and Deterministic Method for Initializing the Lloyd-Max Algorithm An Adaptive and Deterministic Method for Initializing the Lloyd-Max Algorithm Jared Vicory and M. Emre Celebi Department of Computer Science Louisiana State University, Shreveport, LA, USA ABSTRACT Gray-level

More information

DISCRETIZATION BASED ON CLUSTERING METHODS. Daniela Joiţa Titu Maiorescu University, Bucharest, Romania

DISCRETIZATION BASED ON CLUSTERING METHODS. Daniela Joiţa Titu Maiorescu University, Bucharest, Romania DISCRETIZATION BASED ON CLUSTERING METHODS Daniela Joiţa Titu Maiorescu University, Bucharest, Romania daniela.oita@utm.ro Abstract. Many data mining algorithms require as a pre-processing step the discretization

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2013 http://ce.sharif.edu/courses/91-92/2/ce725-1/ Agenda Features and Patterns The Curse of Size and

More information

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York Clustering Robert M. Haralick Computer Science, Graduate Center City University of New York Outline K-means 1 K-means 2 3 4 5 Clustering K-means The purpose of clustering is to determine the similarity

More information

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

Towards the world s fastest k-means algorithm

Towards the world s fastest k-means algorithm Greg Hamerly Associate Professor Computer Science Department Baylor University Joint work with Jonathan Drake May 15, 2014 Objective function and optimization Lloyd s algorithm 1 The k-means clustering

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2012 http://ce.sharif.edu/courses/90-91/2/ce725-1/ Agenda Features and Patterns The Curse of Size and

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

CS570: Introduction to Data Mining

CS570: Introduction to Data Mining CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading: Chapter 10.3 Han, Chapter 9.5 Tan Cengiz Gunay, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber & Pei.

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Features and Patterns The Curse of Size and

More information

Analyzing Outlier Detection Techniques with Hybrid Method

Analyzing Outlier Detection Techniques with Hybrid Method Analyzing Outlier Detection Techniques with Hybrid Method Shruti Aggarwal Assistant Professor Department of Computer Science and Engineering Sri Guru Granth Sahib World University. (SGGSWU) Fatehgarh Sahib,

More information

Mean-shift outlier detection

Mean-shift outlier detection Mean-shift outlier detection Jiawei YANG a, Susanto RAHARDJA b a,1 and Pasi FRÄNTI a School of Computing, University of Eastern Finland b Northwestern Polytechnical University, Xi an, China Abstract. We

More information

Statistics 202: Statistical Aspects of Data Mining

Statistics 202: Statistical Aspects of Data Mining Statistics 202: Statistical Aspects of Data Mining Professor Rajan Patel Lecture 11 = Chapter 8 Agenda: 1)Reminder about final exam 2)Finish Chapter 5 3)Chapter 8 1 Class Project The class project is due

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

Balanced COD-CLARANS: A Constrained Clustering Algorithm to Optimize Logistics Distribution Network

Balanced COD-CLARANS: A Constrained Clustering Algorithm to Optimize Logistics Distribution Network Advances in Intelligent Systems Research, volume 133 2nd International Conference on Artificial Intelligence and Industrial Engineering (AIIE2016) Balanced COD-CLARANS: A Constrained Clustering Algorithm

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

Genetic algorithm with deterministic crossover for vector quantization

Genetic algorithm with deterministic crossover for vector quantization Pattern Recognition Letters 21 (2000) 61±68 www.elsevier.nl/locate/patrec Genetic algorithm with deterministic crossover for vector quantization Pasi Franti * Department of Computer Science, University

More information

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM. Center of Atmospheric Sciences, UNAM November 16, 2016 Cluster Analisis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster)

More information

What is Cluster Analysis? COMP 465: Data Mining Clustering Basics. Applications of Cluster Analysis. Clustering: Application Examples 3/17/2015

What is Cluster Analysis? COMP 465: Data Mining Clustering Basics. Applications of Cluster Analysis. Clustering: Application Examples 3/17/2015 // What is Cluster Analysis? COMP : Data Mining Clustering Basics Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, rd ed. Cluster: A collection of data

More information

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric. CS 188: Artificial Intelligence Fall 2008 Lecture 25: Kernels and Clustering 12/2/2008 Dan Klein UC Berkeley Case-Based Reasoning Similarity for classification Case-based reasoning Predict an instance

More information

CS 188: Artificial Intelligence Fall 2008

CS 188: Artificial Intelligence Fall 2008 CS 188: Artificial Intelligence Fall 2008 Lecture 25: Kernels and Clustering 12/2/2008 Dan Klein UC Berkeley 1 1 Case-Based Reasoning Similarity for classification Case-based reasoning Predict an instance

More information

K-Means Clustering With Initial Centroids Based On Difference Operator

K-Means Clustering With Initial Centroids Based On Difference Operator K-Means Clustering With Initial Centroids Based On Difference Operator Satish Chaurasiya 1, Dr.Ratish Agrawal 2 M.Tech Student, School of Information and Technology, R.G.P.V, Bhopal, India Assistant Professor,

More information

On the splitting method for vector quantization codebook generation

On the splitting method for vector quantization codebook generation On the splitting method for vector quantization codebook generation Pasi Fränti University of Joensuu Department of Computer Science P.O. Box 111 FIN-80101 Joensuu, Finland E-mail: franti@cs.joensuu.fi

More information

An Efficient Approach towards K-Means Clustering Algorithm

An Efficient Approach towards K-Means Clustering Algorithm An Efficient Approach towards K-Means Clustering Algorithm Pallavi Purohit Department of Information Technology, Medi-caps Institute of Technology, Indore purohit.pallavi@gmail.co m Ritesh Joshi Department

More information

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition Pattern Recognition Kjell Elenius Speech, Music and Hearing KTH March 29, 2007 Speech recognition 2007 1 Ch 4. Pattern Recognition 1(3) Bayes Decision Theory Minimum-Error-Rate Decision Rules Discriminant

More information

Clustering part II 1

Clustering part II 1 Clustering part II 1 Clustering What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods 2 Partitioning Algorithms:

More information

Detecting Clusters and Outliers for Multidimensional

Detecting Clusters and Outliers for Multidimensional Kennesaw State University DigitalCommons@Kennesaw State University Faculty Publications 2008 Detecting Clusters and Outliers for Multidimensional Data Yong Shi Kennesaw State University, yshi5@kennesaw.edu

More information

Clustering. Chapter 10 in Introduction to statistical learning

Clustering. Chapter 10 in Introduction to statistical learning Clustering Chapter 10 in Introduction to statistical learning 16 14 12 10 8 6 4 2 0 2 4 6 8 10 12 14 1 Clustering ² Clustering is the art of finding groups in data (Kaufman and Rousseeuw, 1990). ² What

More information

Clustering Part 4 DBSCAN

Clustering Part 4 DBSCAN Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

Lesson 3. Prof. Enza Messina

Lesson 3. Prof. Enza Messina Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical

More information

A Review on Cluster Based Approach in Data Mining

A Review on Cluster Based Approach in Data Mining A Review on Cluster Based Approach in Data Mining M. Vijaya Maheswari PhD Research Scholar, Department of Computer Science Karpagam University Coimbatore, Tamilnadu,India Dr T. Christopher Assistant professor,

More information

INF 4300 Classification III Anne Solberg The agenda today:

INF 4300 Classification III Anne Solberg The agenda today: INF 4300 Classification III Anne Solberg 28.10.15 The agenda today: More on estimating classifier accuracy Curse of dimensionality and simple feature selection knn-classification K-means clustering 28.10.15

More information

Social and health care services as an optimization problem

Social and health care services as an optimization problem Christmas seminar 2018 Social and health care services as an optimization problem Pasi Fränti Univ. of Eastern Finland Joensuu, FINLAND Data mining Information retrieval Location-aware applications Clustering

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster

More information

CSE 40171: Artificial Intelligence. Learning from Data: Unsupervised Learning

CSE 40171: Artificial Intelligence. Learning from Data: Unsupervised Learning CSE 40171: Artificial Intelligence Learning from Data: Unsupervised Learning 32 Homework #6 has been released. It is due at 11:59PM on 11/7. 33 CSE Seminar: 11/1 Amy Reibman Purdue University 3:30pm DBART

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

Unsupervised Learning

Unsupervised Learning Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised

More information

Visual Representations for Machine Learning

Visual Representations for Machine Learning Visual Representations for Machine Learning Spectral Clustering and Channel Representations Lecture 1 Spectral Clustering: introduction and confusion Michael Felsberg Klas Nordberg The Spectral Clustering

More information

Cluster Analysis. Summer School on Geocomputation. 27 June July 2011 Vysoké Pole

Cluster Analysis. Summer School on Geocomputation. 27 June July 2011 Vysoké Pole Cluster Analysis Summer School on Geocomputation 27 June 2011 2 July 2011 Vysoké Pole Lecture delivered by: doc. Mgr. Radoslav Harman, PhD. Faculty of Mathematics, Physics and Informatics Comenius University,

More information

Introduction to Computer Science

Introduction to Computer Science DM534 Introduction to Computer Science Clustering and Feature Spaces Richard Roettger: About Me Computer Science (Technical University of Munich and thesis at the ICSI at the University of California at

More information

MSA220 - Statistical Learning for Big Data

MSA220 - Statistical Learning for Big Data MSA220 - Statistical Learning for Big Data Lecture 13 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Clustering Explorative analysis - finding groups

More information

10701 Machine Learning. Clustering

10701 Machine Learning. Clustering 171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among

More information

Data Mining Algorithms In R/Clustering/K-Means

Data Mining Algorithms In R/Clustering/K-Means 1 / 7 Data Mining Algorithms In R/Clustering/K-Means Contents 1 Introduction 2 Technique to be discussed 2.1 Algorithm 2.2 Implementation 2.3 View 2.4 Case Study 2.4.1 Scenario 2.4.2 Input data 2.4.3 Execution

More information

A Parallel Architecture for the Generalized Traveling Salesman Problem

A Parallel Architecture for the Generalized Traveling Salesman Problem A Parallel Architecture for the Generalized Traveling Salesman Problem Max Scharrenbroich AMSC 663 Project Proposal Advisor: Dr. Bruce L. Golden R. H. Smith School of Business 1 Background and Introduction

More information

Fast Approximate Minimum Spanning Tree Algorithm Based on K-Means

Fast Approximate Minimum Spanning Tree Algorithm Based on K-Means Fast Approximate Minimum Spanning Tree Algorithm Based on K-Means Caiming Zhong 1,2,3, Mikko Malinen 2, Duoqian Miao 1,andPasiFränti 2 1 Department of Computer Science and Technology, Tongji University,

More information

Motivation. Technical Background

Motivation. Technical Background Handling Outliers through Agglomerative Clustering with Full Model Maximum Likelihood Estimation, with Application to Flow Cytometry Mark Gordon, Justin Li, Kevin Matzen, Bryce Wiedenbeck Motivation Clustering

More information

University of Florida CISE department Gator Engineering. Clustering Part 4

University of Florida CISE department Gator Engineering. Clustering Part 4 Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

Collaborative Rough Clustering

Collaborative Rough Clustering Collaborative Rough Clustering Sushmita Mitra, Haider Banka, and Witold Pedrycz Machine Intelligence Unit, Indian Statistical Institute, Kolkata, India {sushmita, hbanka r}@isical.ac.in Dept. of Electrical

More information

k-means Clustering Todd W. Neller Gettysburg College Laura E. Brown Michigan Technological University

k-means Clustering Todd W. Neller Gettysburg College Laura E. Brown Michigan Technological University k-means Clustering Todd W. Neller Gettysburg College Laura E. Brown Michigan Technological University Outline Unsupervised versus Supervised Learning Clustering Problem k-means Clustering Algorithm Visual

More information

Review of the Robust K-means Algorithm and Comparison with Other Clustering Methods

Review of the Robust K-means Algorithm and Comparison with Other Clustering Methods Review of the Robust K-means Algorithm and Comparison with Other Clustering Methods Ben Karsin University of Hawaii at Manoa Information and Computer Science ICS 63 Machine Learning Fall 8 Introduction

More information

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010 Cluster Analysis Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX 7575 April 008 April 010 Cluster Analysis, sometimes called data segmentation or customer segmentation,

More information

AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION

AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION WILLIAM ROBSON SCHWARTZ University of Maryland, Department of Computer Science College Park, MD, USA, 20742-327, schwartz@cs.umd.edu RICARDO

More information

AN IMPROVED HYBRIDIZED K- MEANS CLUSTERING ALGORITHM (IHKMCA) FOR HIGHDIMENSIONAL DATASET & IT S PERFORMANCE ANALYSIS

AN IMPROVED HYBRIDIZED K- MEANS CLUSTERING ALGORITHM (IHKMCA) FOR HIGHDIMENSIONAL DATASET & IT S PERFORMANCE ANALYSIS AN IMPROVED HYBRIDIZED K- MEANS CLUSTERING ALGORITHM (IHKMCA) FOR HIGHDIMENSIONAL DATASET & IT S PERFORMANCE ANALYSIS H.S Behera Department of Computer Science and Engineering, Veer Surendra Sai University

More information

Uninformed Search Methods. Informed Search Methods. Midterm Exam 3/13/18. Thursday, March 15, 7:30 9:30 p.m. room 125 Ag Hall

Uninformed Search Methods. Informed Search Methods. Midterm Exam 3/13/18. Thursday, March 15, 7:30 9:30 p.m. room 125 Ag Hall Midterm Exam Thursday, March 15, 7:30 9:30 p.m. room 125 Ag Hall Covers topics through Decision Trees and Random Forests (does not include constraint satisfaction) Closed book 8.5 x 11 sheet with notes

More information

Clustering Results. Result List Example. Clustering Results. Information Retrieval

Clustering Results. Result List Example. Clustering Results. Information Retrieval Information Retrieval INFO 4300 / CS 4300! Presenting Results Clustering Clustering Results! Result lists often contain documents related to different aspects of the query topic! Clustering is used to

More information

I. INTRODUCTION II. RELATED WORK.

I. INTRODUCTION II. RELATED WORK. ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: A New Hybridized K-Means Clustering Based Outlier Detection Technique

More information

Feature-Guided K-Means Algorithm for Optimal Image Vector Quantizer Design

Feature-Guided K-Means Algorithm for Optimal Image Vector Quantizer Design Journal of Information Hiding and Multimedia Signal Processing c 2017 ISSN 2073-4212 Ubiquitous International Volume 8, Number 6, November 2017 Feature-Guided K-Means Algorithm for Optimal Image Vector

More information

Unsupervised Learning : Clustering

Unsupervised Learning : Clustering Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 15: Microarray clustering http://compbio.pbworks.com/f/wood2.gif Some slides were adapted from Dr. Shaojie Zhang (University of Central Florida) Microarray

More information

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 7 Introduction to Data Mining, nd Edition by Tan, Steinbach, Karpatne, Kumar What is Cluster Analysis? Finding groups

More information

NORMALIZATION INDEXING BASED ENHANCED GROUPING K-MEAN ALGORITHM

NORMALIZATION INDEXING BASED ENHANCED GROUPING K-MEAN ALGORITHM NORMALIZATION INDEXING BASED ENHANCED GROUPING K-MEAN ALGORITHM Saroj 1, Ms. Kavita2 1 Student of Masters of Technology, 2 Assistant Professor Department of Computer Science and Engineering JCDM college

More information

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 01-25-2018 Outline Background Defining proximity Clustering methods Determining number of clusters Other approaches Cluster analysis as unsupervised Learning Unsupervised

More information

Hierarchical Document Clustering

Hierarchical Document Clustering Hierarchical Document Clustering Benjamin C. M. Fung, Ke Wang, and Martin Ester, Simon Fraser University, Canada INTRODUCTION Document clustering is an automatic grouping of text documents into clusters

More information

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A. 205-206 Pietro Guccione, PhD DEI - DIPARTIMENTO DI INGEGNERIA ELETTRICA E DELL INFORMAZIONE POLITECNICO DI BARI

More information

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm DATA MINING LECTURE 7 Hierarchical Clustering, DBSCAN The EM Algorithm CLUSTERING What is a Clustering? In general a grouping of objects such that the objects in a group (cluster) are similar (or related)

More information

Cluster Analysis (b) Lijun Zhang

Cluster Analysis (b) Lijun Zhang Cluster Analysis (b) Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Grid-Based and Density-Based Algorithms Graph-Based Algorithms Non-negative Matrix Factorization Cluster Validation Summary

More information

K-Means Clustering. Sargur Srihari

K-Means Clustering. Sargur Srihari K-Means Clustering Sargur srihari@cedar.buffalo.edu 1 Topics in Mixture Models and EM Mixture models K-means Clustering Mixtures of Gaussians Maximum Likelihood EM for Gaussian mistures EM Algorithm Gaussian

More information

Statistics 202: Data Mining. c Jonathan Taylor. Clustering Based in part on slides from textbook, slides of Susan Holmes.

Statistics 202: Data Mining. c Jonathan Taylor. Clustering Based in part on slides from textbook, slides of Susan Holmes. Clustering Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Clustering Clustering Goal: Finding groups of objects such that the objects in a group will be similar (or

More information

Clustering in Data Mining

Clustering in Data Mining Clustering in Data Mining Classification Vs Clustering When the distribution is based on a single parameter and that parameter is known for each object, it is called classification. E.g. Children, young,

More information

CS Introduction to Data Mining Instructor: Abdullah Mueen

CS Introduction to Data Mining Instructor: Abdullah Mueen CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen LECTURE 8: ADVANCED CLUSTERING (FUZZY AND CO -CLUSTERING) Review: Basic Cluster Analysis Methods (Chap. 10) Cluster Analysis: Basic Concepts

More information

A Naïve Soft Computing based Approach for Gene Expression Data Analysis

A Naïve Soft Computing based Approach for Gene Expression Data Analysis Available online at www.sciencedirect.com Procedia Engineering 38 (2012 ) 2124 2128 International Conference on Modeling Optimization and Computing (ICMOC-2012) A Naïve Soft Computing based Approach for

More information

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation

More information

CISC 4631 Data Mining

CISC 4631 Data Mining CISC 4631 Data Mining Lecture 03: Nearest Neighbor Learning Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Prof. R. Mooney (UT Austin) Prof E. Keogh (UCR), Prof. F.

More information

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule. CS 188: Artificial Intelligence Fall 2007 Lecture 26: Kernels 11/29/2007 Dan Klein UC Berkeley Feature Extractors A feature extractor maps inputs to feature vectors Dear Sir. First, I must solicit your

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Slides From Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining

More information

Indexing in Search Engines based on Pipelining Architecture using Single Link HAC

Indexing in Search Engines based on Pipelining Architecture using Single Link HAC Indexing in Search Engines based on Pipelining Architecture using Single Link HAC Anuradha Tyagi S. V. Subharti University Haridwar Bypass Road NH-58, Meerut, India ABSTRACT Search on the web is a daily

More information

Exploratory Data Analysis using Self-Organizing Maps. Madhumanti Ray

Exploratory Data Analysis using Self-Organizing Maps. Madhumanti Ray Exploratory Data Analysis using Self-Organizing Maps Madhumanti Ray Content Introduction Data Analysis methods Self-Organizing Maps Conclusion Visualization of high-dimensional data items Exploratory data

More information

DOCUMENT CLUSTERING USING HIERARCHICAL METHODS. 1. Dr.R.V.Krishnaiah 2. Katta Sharath Kumar. 3. P.Praveen Kumar. achieved.

DOCUMENT CLUSTERING USING HIERARCHICAL METHODS. 1. Dr.R.V.Krishnaiah 2. Katta Sharath Kumar. 3. P.Praveen Kumar. achieved. DOCUMENT CLUSTERING USING HIERARCHICAL METHODS 1. Dr.R.V.Krishnaiah 2. Katta Sharath Kumar 3. P.Praveen Kumar ABSTRACT: Cluster is a term used regularly in our life is nothing but a group. In the view

More information

Understanding Clustering Supervising the unsupervised

Understanding Clustering Supervising the unsupervised Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data

More information

Introduction to Mobile Robotics

Introduction to Mobile Robotics Introduction to Mobile Robotics Clustering Wolfram Burgard Cyrill Stachniss Giorgio Grisetti Maren Bennewitz Christian Plagemann Clustering (1) Common technique for statistical data analysis (machine learning,

More information

Enhancing K-means Clustering Algorithm with Improved Initial Center

Enhancing K-means Clustering Algorithm with Improved Initial Center Enhancing K-means Clustering Algorithm with Improved Initial Center Madhu Yedla #1, Srinivasa Rao Pathakota #2, T M Srinivasa #3 # Department of Computer Science and Engineering, National Institute of

More information