Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

Similar documents
Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

Data Mining Concepts & Techniques

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Clustering Part 3. Hierarchical Clustering

Hierarchical Clustering

CSE 347/447: DATA MINING

Lecture Notes for Chapter 8. Introduction to Data Mining

Hierarchical clustering

CS7267 MACHINE LEARNING

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

CSE 5243 INTRO. TO DATA MINING

DATA MINING - 1DL105, 1Dl111. An introductory class in data mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

CSE 5243 INTRO. TO DATA MINING

Clustering Lecture 3: Hierarchical Methods

Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster analysis. Agnieszka Nowak - Brzezinska

CSE 5243 INTRO. TO DATA MINING

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)

Hierarchical Clustering

Clustering CS 550: Machine Learning

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Unsupervised Learning. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

Unsupervised Learning : Clustering

Statistics 202: Data Mining. c Jonathan Taylor. Clustering Based in part on slides from textbook, slides of Susan Holmes.

Introduction to Clustering

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Knowledge Discovery in Databases

University of Florida CISE department Gator Engineering. Clustering Part 2

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

Online Social Networks and Media. Community detection

Gene Clustering & Classification

数据挖掘 Introduction to Data Mining

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Clustering fundamentals

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Data Mining. Cluster Analysis: Basic Concepts and Algorithms

Unsupervised Learning

MIS2502: Data Analytics Clustering and Segmentation. Jing Gong

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)

University of Florida CISE department Gator Engineering. Clustering Part 5

Machine Learning 15/04/2015. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis

Clustering Part 4 DBSCAN

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Clustering Basic Concepts and Algorithms 1

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

4. Ad-hoc I: Hierarchical clustering

CS570: Introduction to Data Mining

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

University of Florida CISE department Gator Engineering. Clustering Part 4

Information Retrieval and Web Search Engines

What is Cluster Analysis?

Machine Learning (BSMC-GA 4439) Wenke Liu

Hierarchical Clustering Lecture 9

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition

Cluster Analysis: Basic Concepts and Algorithms

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Clustering Tips and Tricks in 45 minutes (maybe more :)

Clustering: Overview and K-means algorithm

ECLT 5810 Clustering

CS Introduction to Data Mining Instructor: Abdullah Mueen

Introduction to Data Mining

ECLT 5810 Clustering

Chuck Cartledge, PhD. 23 September 2017

Hierarchy. No or very little supervision Some heuristic quality guidances on the quality of the hierarchy. Jian Pei: CMPT 459/741 Clustering (2) 1

Cluster Analysis: Basic Concepts and Algorithms

Foundations of Machine Learning CentraleSupélec Fall Clustering Chloé-Agathe Azencot

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining, 2 nd Edition

Clustering Algorithms for general similarity measures

Clustering: K-means and Kernel K-means

Chapter VIII.3: Hierarchical Clustering

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Understanding Clustering Supervising the unsupervised

Distances, Clustering! Rafael Irizarry!

HW4 VINH NGUYEN. Q1 (6 points). Chapter 8 Exercise 20

and Algorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 8/30/ Introduction to Data Mining 08/06/2006 1

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Machine Learning (BSMC-GA 4439) Wenke Liu

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Applications. Foreground / background segmentation Finding skin-colored regions. Finding the moving objects. Intelligent scissors

Clustering part II 1

COMP20008 Elements of Data Processing. Outlier Detection and Clustering

Cluster Analysis: Agglomerate Hierarchical Clustering

Introduction to Mobile Robotics

Information Retrieval and Web Search Engines

Lecture 7: Segmentation. Thursday, Sept 20

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Types of general clustering methods. Clustering Algorithms for general similarity measures. Similarity between clusters

Unsupervised Learning and Clustering

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Transcription:

Data Mining Cluster Analysis: Basic Concepts and Algorithms Slides From Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Intra-cluster distances are minimized Inter-cluster distances are maximized Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2

Notion of a Cluster can be Ambiguous How many clusters? Six Clusters Two Clusters Four Clusters Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3

Partitional Clustering Original Points A Partitional Clustering Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4

Hierarchical Clustering p1 p3 p4 p2 p1 p2 p3 p4 Traditional Hierarchical Clustering Traditional Dendrogram p1 p3 p4 p2 p1 p2 p3 p4 Non-traditional Hierarchical Clustering Non-traditional Dendrogram Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5

Other Distinctions Between Sets of Clusters Exclusive versus non-exclusive In non-exclusive clusterings, points may belong to multiple clusters Can represent multiple classes or border points Fuzzy versus non-fuzzy In fuzzy clustering, a point belongs to every cluster with some weight between 0 and 1 Weights must sum to 1 Probabilistic clustering has similar characteristics Partial versus complete In some cases, we only want to cluster some of the data Heterogeneous versus homogeneous Cluster of widely different sizes, shapes, and densities Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6

Hierarchical Clustering Two main types of hierarchical clustering Agglomerative: Start with the points as individual clusters At each step, merge the closest pair of clusters until only one cluster (or k clusters) left Divisive: Start with one, all-inclusive cluster At each step, split a cluster until each cluster contains a point (or there are k clusters) Traditional hierarchical algorithms use a similarity or distance matrix Merge or split one cluster at a time Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7

Agglomerative Clustering Algorithm More popular hierarchical clustering technique Basic algorithm is straightforward 1 Compute the proximity matrix 2 Let each data point be a cluster 3 Repeat 4 Merge the two closest clusters 5 Update the proximity matrix 6 Until only a single cluster remains Key operation is the computation of the proximity of two clusters Different approaches to defining the distance between clusters distinguish the different algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8

Starting Situation Start with clusters of individual points and a proximity matrix p1 p2 p3 p4 p5 p1 p2 p3 p4 p5 Proximity Matrix Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 9

Intermediate Situation After some merging steps, we have some clusters C1 C2 C3 C4 C5 C1 C3 C4 C1 C2 C3 C4 C5 Proximity Matrix C2 C5 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10

Intermediate Situation We want to merge the two closest clusters (C2 and C5) and update the proximity matrix C1 C2 C1 C2 C3 C4 C5 C1 C3 C4 C3 C4 C5 Proximity Matrix C2 C5 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11

After Merging The question is How do we update the proximity matrix? C1 C2 U C5 C3 C4 C1? C3 C4 C2 U C5 C3 C4?????? C1 Proximity Matrix C2 U C5 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12

How to Define Inter-Cluster Similarity p1 p2 p3 p4 p5 Similarity? p1 p2 p3 p4 MIN MAX Group Average Distance Between Centroids Other methods driven by an objective function Ward s Method uses squared error p5 Proximity Matrix Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 13

How to Define Inter-Cluster Similarity Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14

How to Define Inter-Cluster Similarity p1 p2 p3 p4 p5 p1 p2 p3 p4 MIN MAX Group Average Distance Between Centroids Other methods driven by an objective function Ward s Method uses squared error p5 Proximity Matrix Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15

How to Define Inter-Cluster Similarity p1 p2 p3 p4 p5 p1 p2 p3 p4 MIN MAX Group Average Distance Between Centroids Other methods driven by an objective function Ward s Method uses squared error p5 Proximity Matrix Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16

How to Define Inter-Cluster Similarity p1 p2 p3 p4 p5 p1 p2 p3 p4 MIN MAX Group Average Distance Between Centroids Other methods driven by an objective function Ward s Method uses squared error p5 Proximity Matrix Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 17

How to Define Inter-Cluster Similarity p1 p1 p2 p3 p4 p5 p2 p3 p4 MIN MAX Group Average Distance Between Centroids Other methods driven by an objective function Ward s Method uses squared error p5 Proximity Matrix Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18

Cluster Similarity: MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters Determined by one pair of points, ie, by one link in the proximity graph I1 I2 I3 I4 I5 I1 100 090 010 065 020 I2 090 100 070 060 050 I3 010 070 100 040 030 I4 065 060 040 100 080 I5 020 050 030 080 100 1 2 3 4 5 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 19

Cluster Similarity: MAX or Complete Linkage Similarity of two clusters is based on the two least similar (most distant) points in the different clusters Determined by all pairs of points in the two clusters I1 I2 I3 I4 I5 I1 100 090 010 065 020 I2 090 100 070 060 050 I3 010 070 100 040 030 I4 065 060 040 100 080 I5 020 050 030 080 100 1 2 3 4 5 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 20

Cluster Similarity: Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters pi Cluster i p Cluster proximity(p j j proximity(clusteri,clusterj) Cluster Cluster Need to use average connectivity for scalability since total proximity favors large clusters I1 I2 I3 I4 I5 I1 100 090 010 065 020 I2 090 100 070 060 050 I3 010 070 100 040 030 I4 065 060 040 100 080 I5 020 050 030 080 100 1 2 3 4 5 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 21 i i,p j j )

Hierarchical Clustering: Time and Space requirements O(N 2 ) space since it uses the proximity matrix N is the number of points O(N 3 ) time in many cases There are N steps and at each step the size, N 2, proximity matrix must be updated and searched Complexity can be reduced to O(N 2 log(n) ) time for some approaches Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 22

Hierarchical Clustering: Problems and Limitations Once a decision is made to combine two clusters, it cannot be undone No objective function is directly minimized Different schemes have problems with one or more of the following: Sensitivity to noise and outliers Difficulty handling different sized clusters and convex shapes Breaking large clusters Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 23

Internal Measures: Cohesion and Separation Cluster Cohesion: Measures how closely related are objects in a cluster Example: sum of square errors (SSE) Cluster Separation: Measure how distinct or wellseparated a cluster is from other clusters Example: Squared Error Cohesion is measured by the within cluster SSE 2 WSS ( x m i ) i x C i Separation is measured by the between cluster SSE BSS C ( m i i m i ) 2 Where C i is the size of cluster C i, m i is the medoid of C i, and m is the global medoid Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 24

Internal Measures: Cohesion and Separation Example: SSE BSS + WSS = constant m 1 m 1 2 3 4 m 2 5 K=1 cluster: WSS (1 3) 2 (2 3) 2 (4 3) 2 (5 3) 2 10 BSS 4 (3 3) 2 0 Total 10 0 10 K=2 clusters: WSS (1 15) 2 (2 15) 2 (4 45) 2 (5 45) 2 1 BSS 2 (3 15) 2 2 (45 3) 2 9 Total 1 9 10 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 25

Internal Measures: Cohesion and Separation A proximity graph based approach can also be used for cohesion and separation Cluster cohesion is the sum of the weight of all links within a cluster Cluster separation is the sum of the weights between nodes in the cluster and nodes outside the cluster cohesion separation Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 26