Hierarchy. No or very little supervision Some heuristic quality guidances on the quality of the hierarchy. Jian Pei: CMPT 459/741 Clustering (2) 1

Similar documents
CS570: Introduction to Data Mining

Unsupervised Learning Hierarchical Methods

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)

What is Cluster Analysis? COMP 465: Data Mining Clustering Basics. Applications of Cluster Analysis. Clustering: Application Examples 3/17/2015

Clustering in Data Mining

Cluster Analysis. Ying Shen, SSE, Tongji University

Clustering part II 1

University of Florida CISE department Gator Engineering. Clustering Part 4

Clustering Part 4 DBSCAN

Lesson 3. Prof. Enza Messina

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

Hierarchical Clustering

Clustering Lecture 3: Hierarchical Methods

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Clustering Part 3. Hierarchical Clustering

Hierarchical Clustering

CSE 5243 INTRO. TO DATA MINING

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

CS7267 MACHINE LEARNING

Clustering in Ratemaking: Applications in Territories Clustering

A Review on Cluster Based Approach in Data Mining

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

Lecture 7 Cluster Analysis: Part A

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Hierarchical clustering

CSE 5243 INTRO. TO DATA MINING

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering Techniques

Data Mining Concepts & Techniques

Unsupervised Learning and Clustering

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)

CSE 5243 INTRO. TO DATA MINING

Hierarchical and Ensemble Clustering

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Clustering CS 550: Machine Learning

Cluster analysis. Agnieszka Nowak - Brzezinska

Gene Clustering & Classification

Unsupervised Learning

Applications. Foreground / background segmentation Finding skin-colored regions. Finding the moving objects. Intelligent scissors

! Introduction. ! Partitioning methods. ! Hierarchical methods. ! Model-based methods. ! Density-based methods. ! Scalability

Knowledge Discovery in Databases

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

CHAPTER 4: CLUSTER ANALYSIS

Unsupervised Learning and Clustering

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

4. Ad-hoc I: Hierarchical clustering

Hierarchical Clustering

Parallelization of Hierarchical Density-Based Clustering using MapReduce. Talat Iqbal Syed

Hierarchical Clustering 4/5/17

Distance-based Methods: Drawbacks

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

CSE 347/447: DATA MINING

Tree-Structured Indexes

CS 2750: Machine Learning. Clustering. Prof. Adriana Kovashka University of Pittsburgh January 17, 2017

Cluster Analysis: Agglomerate Hierarchical Clustering

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Hierarchical Clustering Lecture 9

Based on Raymond J. Mooney s slides

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

A Comparative Study of Various Clustering Algorithms in Data Mining

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

Clustering COMS 4771

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

University of Florida CISE department Gator Engineering. Clustering Part 2

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

Network Traffic Measurements and Analysis

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining: Concepts and Techniques. Chapter 7 Jiawei Han. University of Illinois at Urbana-Champaign. Department of Computer Science

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

UNIT V CLUSTERING, APPLICATIONS AND TRENDS IN DATA MINING. Clustering is unsupervised classification: no predefined classes

Mining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Information Retrieval and Web Search Engines

Study and Implementation of CHAMELEON algorithm for Gene Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Lecture 10: Semantic Segmentation and Clustering

Unsupervised learning, Clustering CS434

CLUSTERING ALGORITHMS

Midterm Examination CS 540-2: Introduction to Artificial Intelligence

Unsupervised Learning. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Bioinformatics Programming. EE, NCKU Tien-Hao Chang (Darby Chang)

Machine Learning (BSMC-GA 4439) Wenke Liu

Types of general clustering methods. Clustering Algorithms for general similarity measures. Similarity between clusters

Chapter 6: Cluster Analysis

Information Retrieval and Web Search Engines

Machine Learning (BSMC-GA 4439) Wenke Liu

Clustering. Supervised vs. Unsupervised Learning

Data Warehousing and Machine Learning

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Transcription:

Hierarchy An arrangement or classification of things according to inclusiveness A natural way of abstraction, summarization, compression, and simplification for understanding Typical setting: organize a given set of objects to a hierarchy No or very little supervision Some heuristic quality guidances on the quality of the hierarchy Jian Pei: MPT 459/741 lustering (2) 1

Hierarchical lustering Group data objects into a tree of clusters Top-down versus bottom-up Step 0 Step 1 Step 2 Step 3 Step 4 a b c d e a b d e c d e a b c d e Step 4 Step 3 Step 2 Step 1 Step 0 agglomerative (AGNES) divisive (DIANA) Jian Pei: MPT 459/741 lustering (2) 2

AGNES (Agglomerative Nesting) Initially, each object is a cluster Step-by-step cluster merging, until all objects form a cluster Single-link approach Each cluster is represented by all of the objects in the cluster The similarity between two clusters is measured by the similarity of the closest pair of data points belonging to different clusters Jian Pei: MPT 459/741 lustering (2) 3

Dendrogram Show how to merge clusters hierarchically Decompose data objects into a multilevel nested partitioning (a tree of clusters) A clustering of the data objects: cutting the dendrogram at the desired level Each connected component forms a cluster Jian Pei: MPT 459/741 lustering (2) 4

DIANA (Divisive ANAlysis) Initially, all objects are in one cluster Step-by-step splitting clusters until each cluster contains only one object 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 Jian Pei: MPT 459/741 lustering (2) 5

Jian Pei: MPT 459/741 lustering (2) 6 Distance Measures Minimum distance Maximum distance Mean distance Average distance = = = = i j j i j i p q j i j i avg j i j i mean q p j i q p j i q p d n n d m m d d q p d d q p d d ), ( 1 ), ( ), ( ), ( ), ( max ), ( ), ( min ), (, max, min m: mean for a cluster : a cluster n: the number of objects in a cluster

hallenges Hard to choose merge/split points Never undo merging/splitting Merging/splitting decisions are critical High complexity O(n 2 ) Integrating hierarchical clustering with other techniques BIRH, URE, HAMELEON, ROK Jian Pei: MPT 459/741 lustering (2) 7

BIRH Balanced Iterative Reducing and lustering using Hierarchies F (lustering Feature) tree: a hierarchical data structure summarizing object info lustering objects à clustering leaf nodes of the F tree Jian Pei: MPT 459/741 lustering (2) 8

lustering Feature Vector lustering Feature: F = (N, LS, SS) N: Number of data points LS: N i=1 =o i SS: N i=1 =o i 2 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 F = (5, (16,30),(54,190)) (3,4) (2,6) (4,5) (4,7) (3,8) Jian Pei: MPT 459/741 lustering (2) 9

F-tree in BIRH lustering feature: Summarize the statistics for a cluster Many cluster quality measures (e.g., radium, distance) can be derived Additivity: F 1 +F 2 =(N 1 +N 2, L 1 +L 2, SS 1 +SS 2 ) A F tree: a height-balanced tree storing the clustering features for a hierarchical clustering A nonleaf node in a tree has descendants or children The nonleaf nodes store sums of the Fs of children Jian Pei: MPT 459/741 lustering (2) 10

F Tree B = 7 L = 6 F 1 F 2 child 1 child 2 child 3 child 6 F 3 F 6 Non-leaf node Root F 1 F 2 child 1 child 2 child 3 child 5 F 3 F 5 Leaf node Leaf node prev F 1 F 2 F 6 next prev F 1 F 2 F 4 next Jian Pei: MPT 459/741 lustering (2) 11

Parameters of a F-tree Branching factor: the maximum number of children Threshold: max diameter of sub-clusters stored at the leaf nodes Jian Pei: MPT 459/741 lustering (2) 12

BIRH lustering Phase 1: scan DB to build an initial inmemory F tree (a multi-level compression of the data that tries to preserve the inherent clustering structure of the data) Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the F-tree Jian Pei: MPT 459/741 lustering (2) 13

Pros & ons of BIRH Linear scalability Good clustering with a single scan Quality can be further improved by a few additional scans an handle only numeric data Sensitive to the order of the data records Jian Pei: MPT 459/741 lustering (2) 14

Drawbacks of Square Error Based Methods One representative per cluster Good only for convex shaped having similar size and density K: the parameter of number of clusters Good only if k can be reasonably estimated Jian Pei: MPT 459/741 lustering (2) 15

URE: the Ideas Each cluster has c representatives hoose c well scattered points in the cluster Shrink them towards the mean of the cluster by a fraction of α The representatives capture the physical shape and geometry of the cluster Merge the closest two clusters Distance of two clusters: the distance between the two closest representatives Jian Pei: MPT 459/741 lustering (2) 16

ure: The Algorithm Draw random sample S Partition sample to p partitions Partially cluster each partition Eliminate outliers Random sampling + remove clusters growing too slowly luster partial clusters until only k clusters left Shrink representatives of clusters towards the cluster center Jian Pei: MPT 459/741 lustering (2) 17

Data Partitioning and lustering y y y y y x x x x x Jian Pei: MPT 459/741 lustering (2) 18

Shrinking Representative Points Shrink the multiple representative points towards the gravity center by a fraction of α Representatives capture the shape y è y x x Jian Pei: MPT 459/741 lustering (2) 19

lustering ategorical Data: ROK Robust lustering using links # of common neighbors between two points Use links to measure similarity/proximity Not distance based 2 2 O( n + nm m + n log n) m a Basic ideas: Similarity function and neighbors: Let T1 = {1,2,3}, T2={3,4,5} Sim( T1, T2) Sim( T, T ) = = { 3} 1. { 1, 2, 3, 4, 5} = 5 = 0 2 1 2 T T T 1 2 T 1 2 Jian Pei: MPT 459/741 lustering (2) 20

Limitations Merging decision based on static modeling No special characteristics of clusters are considered 1 2 1 2 URE and BIRH merge 1 and 2 1 and 2 are more appropriate for merging Jian Pei: MPT 459/741 lustering (2) 21

hameleon Hierarchical clustering using dynamic modeling Measures the similarity based on a dynamic model Interconnectivity & closeness (proximity) between two clusters vs interconnectivity of the clusters and closeness of items within the clusters A two-phase algorithm Use a graph partitioning algorithm: cluster objects into a large number of relatively small sub-clusters Find the genuine clusters by repeatedly combining subclusters Jian Pei: MPT 459/741 lustering (2) 22

Overall Framework of HAMELEON onstruct Sparse Graph Partition the Graph Data Set Merge Partition Final lusters Jian Pei: MPT 459/741 lustering (2) 23

To-Do List Read hapter 10.3 (for thesis-based graduate students only) read the paper BIRH: an efficient data clustering method for very large databases Jian Pei: MPT 459/741 lustering (2) 24