Clustering fundamentals

Similar documents
Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

Machine Learning 15/04/2015. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis

Cluster Analysis: Basic Concepts and Algorithms

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining. Cluster Analysis: Basic Concepts and Algorithms

DATA MINING - 1DL105, 1Dl111. An introductory class in data mining

and Algorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 8/30/ Introduction to Data Mining 08/06/2006 1

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition

What is Cluster Analysis?

Clustering Basic Concepts and Algorithms 1

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Unsupervised Learning. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

CSE 347/447: DATA MINING

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Clustering Part 2. A Partitional Clustering

Clustering CS 550: Machine Learning

Cluster analysis. Agnieszka Nowak - Brzezinska

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

Hierarchical Clustering

9/17/2009. Wenyan Li (Emily Li) Sep. 15, Introduction to Clustering Analysis

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

Data Mining: Clustering

Data Mining Concepts & Techniques

CSE 5243 INTRO. TO DATA MINING

CS7267 MACHINE LEARNING

Clustering Part 3. Hierarchical Clustering

Hierarchical clustering

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

Data Mining: Introduction. Lecture Notes for Chapter 1. Introduction to Data Mining

Cluster Analysis. Ying Shen, SSE, Tongji University

CSE 5243 INTRO. TO DATA MINING

Hierarchical Clustering

University of Florida CISE department Gator Engineering. Clustering Part 2

CSE 5243 INTRO. TO DATA MINING

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)

Clustering Lecture 3: Hierarchical Methods

Introduction to Data Mining. Komate AMPHAWAN

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Knowledge Discovery in Databases

Unsupervised Learning : Clustering

University of Florida CISE department Gator Engineering. Clustering Part 5

Statistics 202: Data Mining. c Jonathan Taylor. Clustering Based in part on slides from textbook, slides of Susan Holmes.

Clustering Part 4 DBSCAN

University of Florida CISE department Gator Engineering. Clustering Part 4

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

DATA MINING LECTURE 1. Introduction

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)

MIS2502: Data Analytics Clustering and Segmentation. Jing Gong

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering. Content. Typical Applications. Clustering: Unsupervised data mining technique

Gene Clustering & Classification

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

! Introduction. ! Partitioning methods. ! Hierarchical methods. ! Model-based methods. ! Density-based methods. ! Scalability

Online Social Networks and Media. Community detection

DS504/CS586: Big Data Analytics Big Data Clustering II

Clustering in Data Mining

数据挖掘 Introduction to Data Mining

Clustering Tips and Tricks in 45 minutes (maybe more :)

DS504/CS586: Big Data Analytics Big Data Clustering II

DBSCAN. Presented by: Garrett Poppe

Clustering Lecture 4: Density-based Methods

Unsupervised Learning

Cluster Analysis: Basic Concepts and Algorithms

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Algorithms

Unsupervised Learning and Clustering

Chapters 11 and 13, Graph Data Mining

Clustering part II 1

Chapter 4: Text Clustering

Unsupervised Learning and Clustering

Hierarchical and Ensemble Clustering

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining, 2 nd Edition

Data Mining 4. Cluster Analysis

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

ECLT 5810 Clustering

ECLT 5810 Clustering

Clustering Algorithm (DBSCAN) VISHAL BHARTI Computer Science Dept. GC, CUNY

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

Information Retrieval and Web Search Engines

Road map. Basic concepts

Introduction to Clustering

4. Ad-hoc I: Hierarchical clustering

CS570: Introduction to Data Mining

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

A Review on Cluster Based Approach in Data Mining

Clustering: Overview and K-means algorithm

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Transcription:

Elena Baralis, Tania Cerquitelli Politecnico di Torino What is Cluster Analsis? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Intra-cluster distances are minimized Inter-cluster distances are maimized DataBase and Data Mining Group

Applications of Cluster Analsis Understanding Group related documents for browsing, group genes and proteins that have similar functionalit, or group stocks with similar price fluctuations Summarization Reduce the size of large data sets 4 Discovered Clusters Applied-Matl-DOWN,Ba-Network-Down,-COM-DOWN, Cabletron-Ss-DOWN,CISCO-DOWN,HP-DOWN, DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN, Micron-Tech-DOWN,Teas-Inst-Down,Tellabs-Inc-Down, Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN, Sun-DOWN Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN, ADV-Micro-Device-DOWN,Andrew-Corp-DOWN, Computer-Assoc-DOWN,Circuit-Cit-DOWN, Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN, Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN Fannie-Mae-DOWN,Fed-Home-Loan-DOWN, MBNA-Corp-DOWN,Morgan-Stanle-DOWN Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP, Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP, Schlumberger-UP Industr Group Technolog-DOWN Technolog-DOWN Financial-DOWN Oil-UP Clustering precipitation in Australia Notion of a Cluster can be Ambiguous How man clusters? Si Clusters Two Clusters Four Clusters 4 DataBase and Data Mining Group

Tpes of Clusterings A clustering is a set of clusters Important distinction between hierarchical and partitional sets of clusters Partitional Clustering A division data objects into non-overlapping subsets (clusters) such that each data object is in eactl one subset Hierarchical clustering A set of nested clusters organized as a hierarchical tree 5 Partitional Clustering A Partitional Clustering 6 DataBase and Data Mining Group

Hierarchical Clustering p p p p4 p p p p4 Traditional Hierarchical Clustering Traditional Dendrogram p p p p4 p p p p4 Non-traditional Hierarchical Clustering Non-traditional Dendrogram 7 Clustering Algorithms K-means and its variants Hierarchical clustering Densit-based clustering 8 DataBase and Data Mining Group 4

K-means Clustering Partitional clustering approach Each cluster is associated with a centroid (center point) Each point is assigned to the cluster with the closest centroid Number of clusters, K, must be specified The basic algorithm is ver simple 9 Two different K-means Clusterings.5.5.5 - -.5 - -.5.5.5.5.5.5.5.5.5 - -.5 - -.5.5.5 - -.5 - -.5.5.5 Optimal Clustering Sub-optimal Clustering DataBase and Data Mining Group 5

Importance of Choosing Initial Centroids Iteration 4 56.5.5.5 - -.5 - -.5.5.5 Importance of Choosing Initial Centroids Iteration Iteration Iteration.5.5.5.5.5.5.5.5.5 - -.5 - -.5.5.5 - -.5 - -.5.5.5 - -.5 - -.5.5.5 Iteration 4 Iteration 5 Iteration 6.5.5.5.5.5.5.5.5.5 - -.5 - -.5.5.5 - -.5 - -.5.5.5 - -.5 - -.5.5.5 DataBase and Data Mining Group 6

Importance of Choosing Initial Centroids Iteration 4 5.5.5.5 - -.5 - -.5.5.5 Importance of Choosing Initial Centroids Iteration Iteration.5.5.5.5.5.5 - -.5 - -.5.5.5 - -.5 - -.5.5.5 Iteration Iteration 4 Iteration 5.5.5.5.5.5.5.5.5.5 - -.5 - -.5.5.5 - -.5 - -.5.5.5 - -.5 - -.5.5.5 4 DataBase and Data Mining Group 7

Evaluating K-means Clusters Most common measure is Sum of Squared Error (SSE) For each point, the error is the distance to the nearest cluster To get SSE, we square these errors and sum them. SSE K i C dist ( m, ) i is a data point in cluster C i and m i is the representative point for cluster C i can show that m i corresponds to the center (mean) of the cluster Given two clusters, we can choose the one with the smallest error One eas wa to reduce SSE is to increase K, the number of clusters A good clustering with smaller K can have a lower SSE than a poor clustering with higher K i 5 Solutions to Initial Centroids Problem Multiple runs Helps, but probabilit is not on our side Sample and use hierarchical clustering to determine initial centroids Select more than k initial centroids and then select among these initial centroids Select most widel separated Postprocessing Bisecting K-means Not as susceptible to initialization issues 6 DataBase and Data Mining Group 8

Pre-processing and Post-processing Pre-processing Normalize the data Eliminate outliers Post-processing Eliminate small clusters that ma represent outliers Split loose clusters, i.e., clusters with relativel high SSE Merge clusters that are close and that have relativel low SSE 7 Can use From: these Tan,Steinbach, steps Kumar, Introduction during to Data the Mining, McGraw clustering Hill 6 process Limitations of K-means K-means has problems when clusters are of differing Sizes Densities Non-globular shapes K-means has problems when the data contains outliers. 8 DataBase and Data Mining Group 9

Limitations of K-means: Differing Sizes K-means ( Clusters) 9 Limitations of K-means: Differing Densit K-means ( Clusters) DataBase and Data Mining Group

Limitations of K-means: Non-globular Shapes K-means ( Clusters) Overcoming K-means Limitations K-means Clusters One solution is to use man clusters. Find parts of clusters, but need to put together. DataBase and Data Mining Group

Overcoming K-means Limitations K-means Clusters Overcoming K-means Limitations K-means Clusters 4 DataBase and Data Mining Group

Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree like diagram that records the sequences of merges or splits 6 5..5. 4 4 5.5 5 4 6 5 Strengths of Hierarchical Clustering Do not have to assume an particular number of clusters An desired number of clusters can be obtained b cutting the dendogram at the proper level The ma correspond to meaningful taonomies Eample in biological sciences (e.g., animal kingdom, phlogen reconstruction, ) 6 DataBase and Data Mining Group

Hierarchical Clustering Two main tpes of hierarchical clustering Agglomerative: Start with the points as individual clusters At each step, merge the closest pair of clusters until onl one cluster (or k clusters) left Divisive: Start with one, all-inclusive cluster At each step, split a cluster until each cluster contains a point (or there are k clusters) Traditional hierarchical algorithms use a similarit or distance matri Merge or split one cluster at a time 7 Agglomerative Clustering Algorithm More popular hierarchical clustering technique Basic algorithm is straightforward. Compute the proimit matri. Let each data point be a cluster. Repeat 4. Merge the two closest clusters 5. Update the proimit matri 6. Until onl a single cluster remains Ke operation is the computation of the proimit of two clusters Different approaches to defining the distance between clusters distinguish the different algorithms 8 DataBase and Data Mining Group 4

How to Define Inter-Cluster Similarit p p p p4 p5... Similarit? p p p p4 MIN MAX Group Average Distance Between Centroids Other methods driven b an objective function Ward s Method uses squared error p5... Proimit Matri 9 Hierarchical Clustering: Comparison 5 4 4 5 6 MIN MAX 5 4 5 6 4 4 5 4 5 6 5 Ward s Method 5 Group Average 4 4 6 DataBase and Data Mining Group 5

DBSCAN DBSCAN is a densit-based algorithm. Densit = number of points within a specified radius (Eps) A point is a core point if it has more than a specified number of points (MinPts) within Eps These are points that are at the interior of a cluster A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point A noise point is an point that is not a core point or a border point. DBSCAN: Core, Border, and Noise Points DataBase and Data Mining Group 6

DBSCAN: Core, Border, and Noise Points Point tpes: core, border and noise Eps =, MinPts = 4 When DBSCAN Works Well Clusters Resistant to Noise Can handle clusters of different shapes and sizes 4 DataBase and Data Mining Group 7

When DBSCAN Does NOT Work Well (MinPts=4, Eps=9.75). Varing densities High-dimensional data (MinPts=4, Eps=9.6) 5 Measures of Cluster Validit The validation of clustering structures is the most difficult task To evaluate the goodness of the resulting clusters, some numerical measures can be eploited Numerical measures are classified into two main classes Eternal Inde: Used to measure the etent to which cluster labels match eternall supplied class labels. e.g., entrop, purit Internal Inde: Used to measure the goodness of a clustering structure without respect to eternal information. e.g., Sum of Squared Error (SSE), cluster cohesion, cluster separation, Rand- Inde, adjusted rand-inde 6 DataBase and Data Mining Group 8

Eternal Measures of Cluster Validit: Entrop and Purit 7 Internal Measures: Cohesion and Separation A proimit graph based approach can also be used for cohesion and separation. Cluster cohesion is the sum of the weight of all links within a cluster. Cluster separation is the sum of the weights between nodes in the cluster and nodes outside the cluster. cohesion separation 8 DataBase and Data Mining Group 9

Final Comment on Cluster Validit The validation of clustering structures is the most difficult and frustrating part of cluster analsis. Without a strong effort in this direction, cluster analsis will remain a black art accessible onl to those true believers who have eperience and great courage. Algorithms for Clustering Data, Jain and Dubes 9 DataBase and Data Mining Group