Clustering Part 4 DBSCAN

Similar documents
University of Florida CISE department Gator Engineering. Clustering Part 4

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining, 2 nd Edition

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Clustering Part 3. Hierarchical Clustering

University of Florida CISE department Gator Engineering. Clustering Part 5

CS570: Introduction to Data Mining

Clustering in Data Mining

CSE 347/447: DATA MINING

University of Florida CISE department Gator Engineering. Clustering Part 2

A Review on Cluster Based Approach in Data Mining

CSE 5243 INTRO. TO DATA MINING

DS504/CS586: Big Data Analytics Big Data Clustering II

Clustering in Ratemaking: Applications in Territories Clustering

Data Mining 4. Cluster Analysis

Clustering CS 550: Machine Learning

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)

DS504/CS586: Big Data Analytics Big Data Clustering II

Clustering Lecture 4: Density-based Methods

Cluster Analysis. Ying Shen, SSE, Tongji University

DBSCAN. Presented by: Garrett Poppe

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

Clustering part II 1

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Clustering Algorithms for Spatial Databases: A Survey

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Distance-based Methods: Drawbacks

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling

COMP 465: Data Mining Still More on Clustering

Lesson 3. Prof. Enza Messina

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

数据挖掘 Introduction to Data Mining

Clustering Algorithms for Data Stream

A Comparative Study of Various Clustering Algorithms in Data Mining

Clustering Algorithm (DBSCAN) VISHAL BHARTI Computer Science Dept. GC, CUNY

Knowledge Discovery in Databases

Hierarchy. No or very little supervision Some heuristic quality guidances on the quality of the hierarchy. Jian Pei: CMPT 459/741 Clustering (2) 1

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

DATA MINING II - 1DL460

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Unsupervised Learning : Clustering

Cluster Analysis: Basic Concepts and Algorithms

What is Cluster Analysis? COMP 465: Data Mining Clustering Basics. Applications of Cluster Analysis. Clustering: Application Examples 3/17/2015

CHAPTER 4: CLUSTER ANALYSIS

Unsupervised Learning Hierarchical Methods

DATA MINING - 1DL105, 1Dl111. An introductory class in data mining

Network Traffic Measurements and Analysis

Clustering Lecture 3: Hierarchical Methods

Lecture 7 Cluster Analysis: Part A

CS Data Mining Techniques Instructor: Abdullah Mueen

Acknowledgements First of all, my thanks go to my supervisor Dr. Osmar R. Za ane for his guidance and funding. Thanks to Jörg Sander who reviewed this

Unsupervised Learning

Clustering Part 2. A Partitional Clustering

CS412 Homework #3 Answer Set

CS570: Introduction to Data Mining

Lecture Notes for Chapter 8. Introduction to Data Mining

d(2,1) d(3,1 ) d (3,2) 0 ( n, ) ( n ,2)......

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Clustering Techniques

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

CSE 5243 INTRO. TO DATA MINING

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Contents. Preface to the Second Edition

CSE 5243 INTRO. TO DATA MINING

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Chapter 4: Text Clustering

Clustering Tips and Tricks in 45 minutes (maybe more :)

Density-Based Clustering. Izabela Moise, Evangelos Pournaras

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Cluster Analysis (b) Lijun Zhang

Efficient Parallel DBSCAN algorithms for Bigdata using MapReduce

Data Mining Algorithms

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

Hierarchical Clustering

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

Machine Learning (BSMC-GA 4439) Wenke Liu

Analyzing Outlier Detection Techniques with Hybrid Method

HW4 VINH NGUYEN. Q1 (6 points). Chapter 8 Exercise 20

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

Unsupervised learning on Color Images

Cluster analysis. Agnieszka Nowak - Brzezinska

Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces

Study and Implementation of CHAMELEON algorithm for Gene Clustering

Scalable Varied Density Clustering Algorithm for Large Datasets

Segmentation (continued)

Gene Clustering & Classification

A Review: Techniques for Clustering of Web Usage Mining

CS7267 MACHINE LEARNING

CS Introduction to Data Mining Instructor: Abdullah Mueen

Analysis and Extensions of Popular Clustering Algorithms

Data Preprocessing. Data Preprocessing

What is Cluster Analysis?

D-GridMST: Clustering Large Distributed Spatial Databases

Clustering fundamentals

Parallelization of Hierarchical Density-Based Clustering using MapReduce. Talat Iqbal Syed

CHAPTER 7. PAPER 3: EFFICIENT HIERARCHICAL CLUSTERING OF LARGE DATA SETS USING P-TREES

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Hierarchical Clustering

Transcription:

Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of points within a specified radius (Eps) A point is a core point if it has more than specified number of points (MinPts) within Eps Core point is in the interior of a cluster A border point has fewer than MinPts within Eps but is in neighborhood of a core point A noise point is any point that is neither a core point nor a border point Data Mining Sanjay Ranka Fall 2003 2 1

DBSCAN: Core, Border and Noise points Data Mining Sanjay Ranka Fall 2003 3 When DBSCAN works well Original Dataset Clusters found by DBSCAN Data Mining Sanjay Ranka Fall 2003 4 2

DBSCAN: Core, Border and Noise points Original Points Eps = 10, Minpts = 4 Point types: Core Border Noise Data Mining Sanjay Ranka Fall 2003 5 DBSCAN: Determining Eps and MinPts Idea is that for points in a cluster, there k th nearest neighbors are at roughly the same distance Noise points have the k th nearest neighbor at at farther distance So, plot sorted distance of every point to its k th nearest neighbor. (k=4 used for 2D points) Data Mining Sanjay Ranka Fall 2003 6 3

Where DBSCAN doesn t work well Minpts = 4, Eps = 9.75 Original Points MinPts = 4, Eps = 9.92 Data Mining Sanjay Ranka Fall 2003 7 DENCLUE DENsity CLUstEring is a density clustering approach that models the overall density of a set of points as the sum of influence functions associated with each point DENCLUE is based on kernel density estimation. The goal of kernel density estimation is to describe the distribution of data by a function For kernel density estimation, the contribution of each point to the overall density function is expressed by an influence (kernel) function. The overall density is then merely the sum of the influence functions associated with each point Data Mining Sanjay Ranka Fall 2003 8 4

DENCLUE The resulting overall density functions will have local peaks, i.e. local density maxima, and these local peaks can be used to define clusters For each point, a hill climbing algorithm finds the nearest peak associated with that point, and set of all data points associated with a peak form a cluster However, if the density at a local peak is too low, then the points in the associated cluster are labeled as noise and discarded Similarly, if two peaks are connected by a path of data points, and the density at each point on the path is above a minimum density threshold, then the clusters associated with these two peaks are merged Data Mining Sanjay Ranka Fall 2003 9 DENCLUE: Kernel Function Typically the kernel function is symmetric and its value decreases as the distance from the point increases. The Gaussian function is often used as a kernel function. Data Mining Sanjay Ranka Fall 2003 10 5

Subspace clustering Instead of using all the attributes (features) of a dataset, if we consider only subset of the features (subspace of the data), then the clusters that we find can be quite different from one subspace to another The clusters we find depend on the subset of the attributes that we consider Data Mining Sanjay Ranka Fall 2003 11 Subspace clustering Data Mining Sanjay Ranka Fall 2003 12 6

Subspace clustering Data Mining Sanjay Ranka Fall 2003 13 Subspace Clustering Data Mining Sanjay Ranka Fall 2003 14 7

CLIQUE CLIQUE is a grid based clustering algorithm CLIQUE splits each dimension (attribute) in to a fixed number ( ) of equal length intervals. This partitions the data space in to rectangular units of equal volume We can measure the density of each unit by the fraction of points it contains A unit is considered dense if its density > user specified threshold A cluster is a group of contiguous (touching) dense units Data Mining Sanjay Ranka Fall 2003 15 CLIQUE: Example Data Mining Sanjay Ranka Fall 2003 16 8

CLIQUE CLIQUE starts by finding all the dense areas in the one dimensional spaces associated with each attribute Then it generates the set of two dimensional cells that might possibly be dense by looking at pairs of dense one dimensional cells In general, CLIQUE generates the possible set of k-dimensional cells that might possibly be dense by looking at dense (k-1)-dimensional cells. This is similar to APRIORI algorithm for finding frequent item sets It then finds clusters finds clusters by taking union of all adjacent high density cells Data Mining Sanjay Ranka Fall 2003 17 MAFIA Merging of Adaptive Finite Intervals (And more than a CLIQUE) MAFIA is a modification of CLIQUE that runs faster and finds better quality clusters. There is also pmafia which is a parallel version of MAFIA The main modification over CLIQUE is the use of an adaptive grid Data Mining Sanjay Ranka Fall 2003 18 9

MAFIA Initially each dimension is partitioned into a large number of intervals. A histogram is generated that shows the number of data points in each interval Groups of adjacent intervals are grouped in to windows, and the maximum number of points in the window s intervals becomes the value associated with the window Data Mining Sanjay Ranka Fall 2003 19 MAFIA Adjacent windows are grouped together if the values of the two windows are close As a special case, if all windows are combined into one window, the dimensions is partitioned in to a fixed number of cells and the threshold for being considered a dense unit is increased for that dimension Data Mining Sanjay Ranka Fall 2003 20 10

Limitations of CLIQUE and MAFIA Time complexity is exponential in the number of dimensions Will have difficulty if too many dense units are generated at lower stages May fail if clusters are of widely differing densities, since the threshold is fixed Determining the appropriate and for a variety of data sets can be challenging It is not typically possible to find all clusters using the same threshold Data Mining Sanjay Ranka Fall 2003 21 Clustering Scalability for Large Datasets One very common solution is sampling, but the sampling could miss small clusters. Data is sometimes not organized to make valid sampling easy or efficient. Another approach is to compress the data or portions of the data. Any such approach must ensure that not too much information is lost. (Scaling Clustering Algorithms to Large Databases, Bradley, Fayyad and Reina.) Data Mining Sanjay Ranka Fall 2003 22 11

Scalable Clustering: BIRCH BIRCH (Balanced and Iterative Reducing and Clustering using Hierarchies) BIRCH can efficiently cluster data with a single pass and can improve that clustering in additional passes. Can work with a number of different distance metrics. BIRCH can also deal effectively with outliers. Data Mining Sanjay Ranka Fall 2003 23 Scaleable Clustering: BIRCH BIRCH is based on the notion of a clustering feature (CF) and a CF tree. A cluster of data points (vectors) can be represented by a triplet of numbers (N, LS, SS) N is the number of points in the cluster LS is the linear sum of the points SS is the sum of squares of the points. Points are processed incrementally. Each point is placed in the leaf node corresponding to the closest cluster (CF). Clusters (CFs) are updated. Data Mining Sanjay Ranka Fall 2003 24 12

Scaleable Clustering: BIRCH Basic steps of BIRCH Load the data into memory by creating a CF tree that summarizes the data. Perform global clustering. Produces a better clustering than the initial step. An agglomerative, hierarchical technique was selected. Redistribute the data points using the centroids of clusters discovered in the global clustering phase, and thus, discover a new (and hopefully better) set of clusters. Data Mining Sanjay Ranka Fall 2003 25 Scaleable Clustering: CURE Clustering Using Representatives Uses a number of points to represent a cluster Representative points are found by selecting a constant number of points from a cluster and then shrinking them toward the center of the cluster Cluster similarity is the similarity of the closest pair of representative points from different clusters Shrinking representative points toward the center helps avoid problems with noise and outliers CURE is better able to handle clusters of arbitrary shapes and sizes Data Mining Sanjay Ranka Fall 2003 26 13

Experimental results: CURE Data Mining Sanjay Ranka Fall 2003 27 Experimental Results: CURE Data Mining Sanjay Ranka Fall 2003 28 14

CURE can not handle differing densities Original Points CURE Data Mining Sanjay Ranka Fall 2003 29 Graph Based Clustering Graph-Based clustering uses the proximity graph Start with the proximity matrix Consider each point as a node in a graph Each edge between two nodes has a weight which is the proximity between the two points Initially the proximity graph is fully connected MIN (single-link) and MAX (complete-link) can be viewed as starting with this graph In the most simple case, clusters are connected components in the graph Data Mining Sanjay Ranka Fall 2003 30 15

Graph Based Clustering: Sparsification The amount of data that needs to be processed is drastically reduced Sparsification can eliminate more than 99% of the entries in a similarity matrix The amount of time required to cluster the data is drastically reduced The size of the problems that can be handled is increased Data Mining Sanjay Ranka Fall 2003 31 Sparsification Clustering may work better Sparsification techniques keep the connections to the most similar (nearest) neighbors of a point while breaking the connections to less similar points. The nearest neighbors of a point tend to belong to the same class as the point itself. This reduces the impact of noise and outliers and sharpens the distinction between clusters. Sparsification facilitates the use of graph partitioning algorithms (or algorithms based on graph partitioning algorithms. Chameleon and Hypergraph-based Clustering Data Mining Sanjay Ranka Fall 2003 32 16

Sparsification Data Mining Sanjay Ranka Fall 2003 33 Limitations of Current Merging Schemes Existing merging schemes are static in nature Data Mining Sanjay Ranka Fall 2003 34 17

Chameleon: Clustering Using Dynamic Modeling Adapt to the characteristics of the data set to find the natural clusters. Use a dynamic model to measure the similarity between clusters. Main property is the relative closeness and relative inter-connectivity of the cluster. Two clusters are combined if the resulting cluster shares certain properties with the constituent clusters. The merging scheme preserves self-similarity. One of the areas of application is spatial data. Data Mining Sanjay Ranka Fall 2003 35 Characteristics of Spatial Datasets Clusters are defined as densely populated regions of the space Clusters have arbitrary shapes, orientation, and non-uniform sizes Difference in densities across clusters and variation in density within clusters Existence of special artifacts (streaks) and noise The clustering algorithm must address the above characteristics and also require minimal supervision Data Mining Sanjay Ranka Fall 2003 36 18

Chameleon Preprocessing Step: Represent the Data by a Graph Given a set of points, we construct the k-nearestneighbor (k-nn) graph to capture the relationship between a point and its k nearest neighbors. Phase 1: Use a multilevel graph partitioning algorithm on the graph to find a large number of clusters of well-connected vertices. Each cluster should contain mostly points from one true cluster, i.e., is a sub-cluster of a real cluster. Graph algorithms take into account global structure. Data Mining Sanjay Ranka Fall 2003 37 Chameleon Phase 2: Use Hierarchical Agglomerative Clustering to merge sub-clusters. Two clusters are combined if the resulting cluster shares certain properties with the constituent clusters. Two key properties are used to model cluster similarity: Relative Interconnectivity: Absolute interconnectivity of two clusters normalized by the internal connectivity of the clusters. Relative Closeness: Absolute closeness of two clusters normalized by the internal closeness of the clusters. Data Mining Sanjay Ranka Fall 2003 38 19

Experimental Results: Chameleon Data Mining Sanjay Ranka Fall 2003 39 Experimental Results: Chameleon Data Mining Sanjay Ranka Fall 2003 40 20

Experimental Results: CURE (10 clusters) Data Mining Sanjay Ranka Fall 2003 41 Experimental Results: CURE (15 clusters) Data Mining Sanjay Ranka Fall 2003 42 21

Experimental Results: Chameleon Data Mining Sanjay Ranka Fall 2003 43 Experimental Results: CURE (9 clusters) Data Mining Sanjay Ranka Fall 2003 44 22

Experimental Results: CURE (15 clusters) Data Mining Sanjay Ranka Fall 2003 45 23