CS570: Introduction to Data Mining

Similar documents
Unsupervised Learning Hierarchical Methods

Hierarchy. No or very little supervision Some heuristic quality guidances on the quality of the hierarchy. Jian Pei: CMPT 459/741 Clustering (2) 1

Lesson 3. Prof. Enza Messina

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)

Clustering Part 4 DBSCAN

University of Florida CISE department Gator Engineering. Clustering Part 4

Clustering part II 1

What is Cluster Analysis? COMP 465: Data Mining Clustering Basics. Applications of Cluster Analysis. Clustering: Application Examples 3/17/2015

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

Lecture 7 Cluster Analysis: Part A

Cluster Analysis. Ying Shen, SSE, Tongji University

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining, 2 nd Edition

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING

Clustering in Data Mining

Unsupervised Learning

Knowledge Discovery in Databases

CS570: Introduction to Data Mining

Data Mining Concepts & Techniques

Data Mining: Concepts and Techniques. Chapter 7 Jiawei Han. University of Illinois at Urbana-Champaign. Department of Computer Science

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

Study and Implementation of CHAMELEON algorithm for Gene Clustering

CSE 5243 INTRO. TO DATA MINING

Hierarchical Clustering

Research Article Term Frequency Based Cosine Similarity Measure for Clustering Categorical Data using Hierarchical Algorithm

UNIT V CLUSTERING, APPLICATIONS AND TRENDS IN DATA MINING. Clustering is unsupervised classification: no predefined classes

Hierarchical and Ensemble Clustering

Hierarchical Clustering

CSE 347/447: DATA MINING

Cluster analysis. Agnieszka Nowak - Brzezinska

Clustering Part 3. Hierarchical Clustering

Clustering Lecture 3: Hierarchical Methods

CS7267 MACHINE LEARNING

A Review on Cluster Based Approach in Data Mining

Clustering from Data Streams

CS6220: DATA MINING TECHNIQUES

Hierarchical clustering

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

CS412 Homework #3 Answer Set

CS145: INTRODUCTION TO DATA MINING

Gene Clustering & Classification

Parallelization of Hierarchical Density-Based Clustering using MapReduce. Talat Iqbal Syed

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering Techniques

Clustering Large Dynamic Datasets Using Exemplar Points

CS Data Mining Techniques Instructor: Abdullah Mueen

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Analysis and Extensions of Popular Clustering Algorithms

CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling

A Survey on Clustering Algorithms for Data in Spatial Database Management Systems

USING SOFT COMPUTING TECHNIQUES TO INTEGRATE MULTIPLE KINDS OF ATTRIBUTES IN DATA MINING

CS570: Introduction to Data Mining

DATA MINING II - 1DL460

Topic 1 Classification Alternatives

Hierarchical Clustering Lecture 9

Data Mining Algorithms

PATENT DATA CLUSTERING: A MEASURING UNIT FOR INNOVATORS

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Clustering CS 550: Machine Learning

University of Florida CISE department Gator Engineering. Clustering Part 2

Cluster Analysis. Outline. Motivation. Examples Applications. Han and Kamber, ch 8

Mining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams

Multi-Modal Data Fusion: A Description

International Journal of Advance Engineering and Research Development ANALYSIS OF HIERARCHICAL CLUSTERING ALGORITHM TO HANDLE LARGE DATASET

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 10

Distances, Clustering! Rafael Irizarry!

COMP 465: Data Mining Still More on Clustering

Road map. Basic concepts

Clustering in Ratemaking: Applications in Territories Clustering

Clustering Algorithms for general similarity measures

数据挖掘 Introduction to Data Mining

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1

Clustering Tips and Tricks in 45 minutes (maybe more :)

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li

HW4 VINH NGUYEN. Q1 (6 points). Chapter 8 Exercise 20

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

CS573 Data Privacy and Security. Li Xiong

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

CS570: Introduction to Data Mining

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Chuck Cartledge, PhD. 23 September 2017

DATA MINING - 1DL105, 1Dl111. An introductory class in data mining

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

A Comparison of Document Clustering Techniques

Document Clustering using Feature Selection Based on Multiviewpoint and Link Similarity Measure

Centroid Based Text Clustering

CS6220: DATA MINING TECHNIQUES

University of Alberta. Efficient Algorithms for Hierarchical Agglomerative Clustering. Ajay Anandan

COMP20008 Elements of Data Processing. Outlier Detection and Clustering

CHAPTER 4: CLUSTER ANALYSIS

A New Fast Clustering Algorithm Based on Reference and Density

Cluster Cores-based Clustering for High Dimensional Data

Clustering: An art of grouping related objects

Transcription:

CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading: Chapter 10.3 Han, Chapter 9.5 Tan Cengiz Gunay, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber & Pei. Data Mining. Morgan Kaufmann, and 2006 Tan, Steinbach & Kumar. Introd. Data Mining., Pearson. Addison Wesley. October 7, 2013 1

Previously: Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram, a tree like diagram Clustering obtained by cutting at desired level Do not have to assume any particular number of clusters May correspond to meaningful taxonomies 6 5 0.2 0.15 0.1 4 3 4 2 5 2 0.05 0 1 3 2 5 4 6 3 1 1 October 7, 2013 2

Previously: Major Weaknesses of Hierarchical Clustering Do not scale well (N: number of points) Space complexity: Time complexity: O(N 2 ) O(N 3 ) O(N 2 log(n)) for some cases/approaches Cannot undo what was done previously Quality varies in terms of distance measures MIN (single link): susceptible to noise/outliers MAX/GROUP AVERAGE: may not work well with nonglobular clusters How to improve?

Scalable Hierarchical Clustering Methods Combines hierarchical and partitioning approaches Recent methods: BIRCH (1996): uses CF-tree and incrementally adjusts the quality of sub-clusters CURE(1998): uses representative points for inter-cluster distance ROCK (1999): clustering categorical data by neighbor and link analysis CHAMELEON (1999): hierarchical clustering using dynamic modeling on graphs October 7, 2013 Data Mining: Concepts and Techniques 4

BIRCH A Tree-based Approach October 8, 2013 Data Mining: Scalable Clustering 5

BIRCH BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies (Zhang, Ramakrishnan & Livny, SIGMOD 96) SIGMOD 10-year test of time award Main ideas: Incremental (does not need the whole dataset) Summarizes using in-memory clustering feature Combines hierarchical clustering for microclustering and partitioning for macroclustering Features: Scales linearly: single scan and improves the quality with a few additional scans Weakness: handles only numeric data, and sensitive to the order of the data records. October 8, 2013 Data Mining: Concepts and Techniques 6

Cluster Statistics Given a cluster of instances Centroid: Radius: average distance from member points to centroid Diameter: average pair-wise distance within a cluster October 7, 2013 7

Given two clusters Intra-Cluster Distance Centroid Euclidean distance: Centroid Manhattan distance: Average distance: October 7, 2013 8

Clustering Feature (CF) in BIRCH CF = (5, (16,30),(54,190)) What is it good for? October 7, 2013 9 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 (3,4) (2,6) (4,5) (4,7) (3,8)

Properties of Clustering Feature CF entry is more compact Stores significantly less then all of the data points in the sub-cluster A CF entry has sufficient information to calculate statistics about the cluster and intra-cluster distances Additivity theorem allows us to merge subclusters incrementally & consistently October 7, 2013 10

Cluster Statistics from CF CF: n, LS, SS D Radius (average distance to centroid) R n i 1 2 2 2 x x / n nss 2LS nls n i 0 / Diameter (average pairwise distance) n n 2 2 x x / n n 1 2nSS 2LS / n n 1 i j i 1 j 1 October 7, 2013 Data Mining: Concepts and Techniques 11

Hierarchical CF-Tree A CF tree is a height-balanced tree that stores the clustering features for a hierarchical clustering A nonleaf node in a tree has descendants or children The nonleaf nodes store sums of the CFs of their children A CF tree has two parameters Branching factor: maximum number of children. Threshold: max diameter of sub-clusters stored at the leaf nodes Why a tree? October 7, 2013 Data Mining: Concepts and Techniques 12

CF 1 The CF Tree Structure CF 2 Root child 1 child 2 child 3 child 6 CF 3 CF 6 CF 1 CF 2 Non-leaf node child 1 child 2 child 3 child 5 CF 3 CF 5 Leaf node Leaf node prev CF 1 CF 2 CF 6 next prev CF 1 CF 2 CF 4 next October 7, 2013 Data Mining: Concepts and Techniques 13

CF-Tree Insertion Traverse down from root (top-down), find the appropriate leaf Follow the "closest"-cf path, w.r.t. intracluster distance measures Modify the leaf If the closest-cf leaf cannot absorb, make a new CF entry. If there is no room for new leaf, split the parent node Traverse back & up Updating CFs on the path or splitting nodes October 7, 2013 14

BIRCH Overview October 7, 2013 15

The Algorithm: BIRCH Phase 1: Scan database to build an initial inmemory CF-tree Subsequent phases become fast, accurate, less order sensitive Phase 2: Condense data (optional) Rebuild the CF-tree with a larger T (why?) Phase 3: Global clustering Use existing clustering algorithm on CF entries Helps fix problem where natural clusters span nodes Phase 4: Cluster refining (optional) Do additional passes over the dataset & reassign data points to the closest centroid from phase 3 October 7, 2013 16

Main ideas: BIRCH Summary Incremental (does not need the whole dataset) Use in-memory clustering feature to summarize Use hierarchical clustering for microclustering and other clustering methods (e.g. partitioning) for macroclustering Features: Scales linearly: single scan and improves the quality with a few additional scans Weakness: handles only numeric data sensitive to the order of the data records unnatural clusters because of leaf node limit spherical clusters because of diameter (how to solve?) October 7, 2013 Data Mining: Concepts and Techniques 17

CURE CURE: Clustering Using REpresentatives CURE: An Efficient Clustering Algorithm for Large Databases (1998) S Guha, R Rastogi, K Shim Addresses potential problems with min, max, centroid distance based hierarchical clustering Main ideas: Use representative points for inter-cluster distance Random sampling and partitioning Features: More robust to outliers Better for non-spherical shapes and non-uniform sizes See Tan Ch 9.5.3, pp. 635-639

CURE: Cluster Points A number of points (e.g., 10+) represents a cluster Representative points: Start with farthest from centroid Add farthest from all selected, and so on until k points Shrink them by distance to center of the cluster Why shrink? Parallels to other methods? Cluster similarity is the similarity of the closest pair of representative points from different clusters

Experimental Results: CURE Picture from CURE, Guha, Rastogi, Shim.

Experimental Results: CURE (centroid) (single link) Picture from CURE, Guha, Rastogi, Shim.

CURE Cannot Handle Differing Densities Original Points CURE So far only numerical data?

Clustering Categorical Data: The ROCK Algorithm ROCK: RObust Clustering using links S. Guha, R. Rastogi & K. Shim, Int. Conf. Data Eng. (ICDE) 99 Major ideas Use links to measure similarity/proximity Sampling-based clustering Features: More meaningful clusters Emphasizes interconnectivity but ignores proximity Not in textbook, see paper above. October 7, 2013 Data Mining: Concepts and Techniques 23

Similarity Measure in ROCK Market basket data clustering Jaccard coefficient-based similarity function: Sim( T, T ) 1 2 T T T 1 2 T 1 2 Example: Two groups (clusters) of transactions C 1. <a, b, c, d, e> {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {a, d, e}, {b, c, d}, {b, c, e}, {b, d, e}, {c, d, e} C 2. <a, b, f, g> {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g} Let T 1 = {a, b, c}, T 2 = {c, d, e}, T 3 = {a, b, f} Jaccard coefficient may lead to wrong clustering result: { c} 1 Sim( T 1, T 2) { a, b, c, d, e} 5 { c, f } 2 Sim( T, T3 ) 1 { a, b, c, f } 4 October 7, 2013 Data Mining: Scalable Clustering 24 0.5 0.2

Link Measure in ROCK Neighbor: Sim( i, j) Links: # of common neighbors Reminder: Let T 1 = {a, b, c}, T 2 = {c, d, e}, T 3 = {a, b, f} Let C 1 : <a, b, c, d, e>, C 2 : <a, b, f, g> Example: link(t 1, T 2 ) = 4, since they have 4 common neighbors {a, c, d}, {a, c, e}, {b, c, d}, {b, c, e} link(t 1, T 3 ) = 3, since they have 3 common neighbors {a, b, d}, {a, b, e}, {a, b, g} October 7, 2013 Data Mining: Concepts and Techniques 25

Rock Algorithm 1. Obtain a sample of points from the data set 2. Compute the link value for each set of points (computed by Jaccard coefficient) 3. Perform an agglomerative (bottom-up) hierarchical clustering on the data using the number of shared neighbors as similarity measure 4. Assign the remaining points to the clusters that have been found October 7, 2013 Data Mining: Concepts and Techniques 26

Cluster Merging: Limitations of Current Schemes Existing schemes are static in nature: MIN or CURE merge two clusters based on their closeness (or minimum distance) GROUP-AVERAGE or ROCK: merge two clusters based on their average connectivity

Limitations of Current Merging Schemes (a) (b) (c) (d) Closeness schemes will merge (a) and (b) Solution? Average connectivity schemes will merge (c) and (d)

Chameleon: Clustering Using Dynamic Modeling Adapt to the characteristics of the data set to find the natural clusters Use a dynamic model to measure the similarity between clusters Main property is the relative closeness and relative interconnectivity of the cluster Two clusters are combined if the resulting cluster shares certain properties with the constituent clusters The merging scheme preserves self-similarity

CHAMELEON: Hierarchical Clustering Using Dynamic Modeling (1999) CHAMELEON: by G. Karypis, E.H. Han, and V. Kumar 99 Basic ideas: A graph-based clustering approach A two-phase algorithm: Partitioning: cluster objects into a large number of small sub-clusters Agglomerative hierarchical clustering: repeatedly combine sub-clusters Measures the similarity based on a dynamic model Features: interconnectivity and closeness (proximity) Handles clusters of arbitrary shapes, sizes, and density Scales well See Sections Han 10.3.4 and Tan 9.4.4 October 7, 2013 Data Mining: Concepts and Techniques 30

Graph-Based Clustering Uses the proximity graph Start with the proximity matrix Consider each point as a node in a graph Each edge between two nodes has a weight which is the proximity between the two points Fully connected proximity graph MIN (single-link) and MAX (complete-link) Sparsification eliminates data for graph algos. knn Threshold based Clusters are connected components in the graph

Overall Framework of CHAMELEON Construct Sparse Graph Partition the Graph Data Set Merge Partition Final Clusters October 7, 2013 Data Mining: Concepts and Techniques 32

Chameleon: Steps Preprocessing Step: Represent the Data by a Graph Given a set of points, construct the k-nearest-neighbor (k-nn) graph to capture the relationship between a point and its k nearest neighbors Concept of neighborhood is captured dynamically (even if region is sparse) Phase 1: Use a multilevel graph partitioning algorithm on the graph to find a large number of clusters of wellconnected vertices Each cluster should contain mostly points from one true cluster, i.e., is a sub-cluster of a real cluster

Chameleon: Steps (cont) Phase 2: Use Hierarchical Agglomerative (bottom-up) Clustering to merge sub-clusters Two clusters are combined if the resulting cluster shares certain properties with the constituent clusters Two key properties used to model cluster similarity: Relative Interconnectivity: Absolute interconnectivity of two clusters normalized by the internal connectivity of the clusters Relative Closeness: Absolute closeness of two clusters normalized by the internal closeness of the clusters

CHAMELEON (Clustering Complex Objects) October 7, 2013 Data Mining: Concepts and Techniques 35

Chapter 7. Cluster Analysis Overview Partitioning methods Hierarchical methods Density-based methods Other methods Cluster evaluation Outlier analysis Summary October 7, 2013 Data Mining: Concepts and Techniques 36