Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Similar documents
Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Clustering CS 550: Machine Learning

Cluster Analysis. Ying Shen, SSE, Tongji University

CSE 347/447: DATA MINING

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

CSE 5243 INTRO. TO DATA MINING

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

CSE 5243 INTRO. TO DATA MINING

Cluster Analysis: Agglomerate Hierarchical Clustering

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Network Traffic Measurements and Analysis

Unsupervised Learning

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

CSE 5243 INTRO. TO DATA MINING

Unsupervised Learning and Clustering

Clustering Part 4 DBSCAN

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Clustering Part 3. Hierarchical Clustering

University of Florida CISE department Gator Engineering. Clustering Part 4

Unsupervised Learning and Clustering

University of Florida CISE department Gator Engineering. Clustering Part 2

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

CHAPTER 4: CLUSTER ANALYSIS

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)

Machine Learning (BSMC-GA 4439) Wenke Liu

Chapter VIII.3: Hierarchical Clustering

Unsupervised Learning : Clustering

Clustering: K-means and Kernel K-means

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

DATA MINING - 1DL105, 1Dl111. An introductory class in data mining

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Hierarchical Clustering

Data Informatics. Seon Ho Kim, Ph.D.

数据挖掘 Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Clustering part II 1

Clustering in Data Mining

Cluster Analysis: Basic Concepts and Algorithms

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Data Mining Algorithms

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

Clustering Algorithm (DBSCAN) VISHAL BHARTI Computer Science Dept. GC, CUNY

Clustering Lecture 4: Density-based Methods

CS7267 MACHINE LEARNING

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

Hierarchical Clustering

Lesson 3. Prof. Enza Messina

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

Unsupervised Learning. Pantelis P. Analytis. Introduction. Finding structure in graphs. Clustering analysis. Dimensionality reduction.

Understanding Clustering Supervising the unsupervised

Gene Clustering & Classification

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Clustering. Chapter 10 in Introduction to statistical learning

Hierarchical clustering

Cluster Analysis: Basic Concepts and Algorithms

University of Florida CISE department Gator Engineering. Clustering Part 5

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Clustering in Ratemaking: Applications in Territories Clustering

Knowledge Discovery in Databases

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

Chapter 6 Continued: Partitioning Methods

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining, 2 nd Edition

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Clustering Lecture 3: Hierarchical Methods

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Multivariate Analysis

Cluster analysis. Agnieszka Nowak - Brzezinska

Cluster Analysis for Microarray Data

What is Cluster Analysis?

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

Lecture Notes for Chapter 8. Introduction to Data Mining

Hierarchical clustering

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

Unsupervised learning on Color Images

DBSCAN. Presented by: Garrett Poppe

HW4 VINH NGUYEN. Q1 (6 points). Chapter 8 Exercise 20

Introduction to Data Mining

Clustering Tips and Tricks in 45 minutes (maybe more :)

CHAPTER 4 AN IMPROVED INITIALIZATION METHOD FOR FUZZY C-MEANS CLUSTERING USING DENSITY BASED APPROACH

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

DS504/CS586: Big Data Analytics Big Data Clustering II

Machine Learning (BSMC-GA 4439) Wenke Liu

Unsupervised Learning. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis

Data Mining 4. Cluster Analysis

Clustering and Dimensionality Reduction. Stony Brook University CSE545, Fall 2017

Data Mining Concepts & Techniques

Cluster Analysis (b) Lijun Zhang

Hierarchical Clustering 4/5/17

Chapter 4: Text Clustering

STATS306B STATS306B. Clustering. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

Unsupervised Learning Hierarchical Methods

Based on Raymond J. Mooney s slides

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Cluster Analysis. Angela Montanari and Laura Anderlucci

Transcription:

Working with Unlabeled Data Clustering Analysis Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan chanhl@mail.cgu.edu.tw

Unsupervised learning Finding centers of similarity using the popular k-means algorithm Taking a bottom-up approach to building hierarchical clustering trees Identifying arbitrary shapes of objects using a densitybased clustering approach 2

K-means clustering 3

K-means clustering 1. Randomly pick k centroids from the sample points as initial cluster centers. 2. Assign each sample to the nearest centroid μ (j), j=1 k. 3. Move the centroids to the center of the samples that were assigned to it. 4. Repeat steps 2 and 3 until the cluster assignments do not change or a user-defined tolerance or maximum number of iterations is reached. 4

Similarity measures between objects Squared Euclidean distance between two points x and y in m-dimensional space 5

Optimization by minimizing cluster inertia Iterative approach for minimizing the within-cluster Sum of Squared Errors (SSE) μ (j) is the representative point (centroid) for cluster j w (i, j) = 1 if the sample x (i) is in cluster j w (i, j) = 0 otherwise. 6

A smarter way of placing the initial cluster centroids using k-means++ 1. Initialize an empty set M to store the k centroids being selected. 2. Randomly choose the first centroid μ (j) from the input samples and assign it to M. 3. For each sample x (i) that is not in M, find the minimum squared distance d(x (i),m) to any of the centroids in M. 4. To randomly select the next centroid μ (p), use a weighted probability distribution equal to 5. Repeat steps 2 and 3 until k centroids are chosen. 6. Proceed with the classic k-means algorithm. 7

Fuzzy C-means (FCM) Cluster membership vector of a sample x in k-means Cluster membership vector in FCM 8

FCM algorithm 1. Specify the number of k centroids and randomly assign the cluster memberships for each point. 2. Compute the cluster centroids μ (j), j=1 k. 3. Update the cluster memberships for each point. 4. Repeat steps 2 and 3 until the membership coefficients do not change, or a user-defined tolerance or maximum number of iterations is reached. 9

Cluster membership probability The membership of the x (i) sample belonging to the u (j) cluster Typically m = 2, so-called fuzziness coefficient 10

Cluster centroid The center u (j) of cluster j is calculated as the mean of all n samples weighted by the degrees to which each sample belongs to that cluster w m(i,j) 11

The objective function of FCM Similar to the within cluster sum-squared-error for minimization in k-means 12

The elbow method to find the optimal number of clusters Within-cluster SSE (distortion) can be used to quantify the quality of clustering 13

Quantifying the quality of clustering via silhouette plots Plot a measure of how tightly grouped the samples in the clusters are 1. Calculate the cluster cohesion a (i) as the average distance between a sample x (i) and all other points in the same cluster. 2. Calculate the cluster separation b (i) from the next closest cluster as the average distance between the sample x (i) and all samples in the nearest cluster 3. Calculate the silhouette s (i) 14

Silhouette plots 15

Hierarchical clustering Agglomerative hierarchical clustering Start with one cluster that encompasses all our samples Iteratively split the cluster into smaller clusters until each cluster only contains one sample. Divisive hierarchical clustering Start with each sample as an individual cluster Merge the closest pairs of clusters until only one cluster remains. 16

Agglomerative clustering using complete linkage approach Compute the distance matrix of all samples. Represent each data point as a singleton cluster. Merge the two closest clusters based on the distance between the most dissimilar (distant) members. Update the similarity matrix. Repeat steps 2-4 until one single cluster remains 17

Cluster mergence based on complete linkage vs. single linkage 18

Distance matrix Pairwise distance Dendrogram 19

Density-based spatial clustering of applications with noise (DBSCAN) Density-based clustering assigns cluster labels based on dense regions of points 20

Point labeling in DBSCAN algorithm A point is considered a core point if at least a specified number (MinPts) of neighboring points fall within the specified radius A border point is a point that has fewer neighbors than MinPts within but lies within the radius of a core point All other points that are neither core nor border points are considered noise points 21

DBSCAN algorithm 1. Form a separate cluster for each core point or connected group of core points Core points are connected if they are no farther away than 2. Assign each border point to the cluster of its corresponding core point. 22

Advantages of DBSCAN It does not assume that the clusters have a spherical shape as in k-means. It doesn't necessarily assign each point to a cluster but is capable of removing noise points. 23

Reference Sebastian Raschka, Vahid Mirjalili. Python Machine Learning: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow. Second Edition. Packt Publishing, 2017. 24