DS504/CS586: Big Data Analytics Big Data Clustering II

Similar documents
DS504/CS586: Big Data Analytics Big Data Clustering II

Clustering Lecture 4: Density-based Methods

Data Mining 4. Cluster Analysis

DBSCAN. Presented by: Garrett Poppe

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)

CSE 5243 INTRO. TO DATA MINING

Distance-based Methods: Drawbacks

Density-Based Clustering. Izabela Moise, Evangelos Pournaras

Clustering Part 4 DBSCAN

Data Mining Algorithms

University of Florida CISE department Gator Engineering. Clustering Part 4

Knowledge Discovery in Databases

Clustering CS 550: Machine Learning

COMP 465: Data Mining Still More on Clustering

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)

CSE 347/447: DATA MINING

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

CS570: Introduction to Data Mining

Clustering Algorithms for Data Stream

Cluster Analysis: Basic Concepts and Algorithms

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Heterogeneous Density Based Spatial Clustering of Application with Noise

Cluster Analysis (b) Lijun Zhang

Clustering part II 1

Faster Clustering with DBSCAN

Clustering in Data Mining

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Lecture 7 Cluster Analysis: Part A

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

COMPARISON OF DENSITY-BASED CLUSTERING ALGORITHMS

Unsupervised Learning : Clustering

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li

Data Stream Clustering Using Micro Clusters

Density-based clustering algorithms DBSCAN and SNN

Cluster Analysis. Ying Shen, SSE, Tongji University

CSE 5243 INTRO. TO DATA MINING

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining, 2 nd Edition

K-DBSCAN: Identifying Spatial Clusters With Differing Density Levels

CSE 5243 INTRO. TO DATA MINING

A Comparative Study of Various Clustering Algorithms in Data Mining

CHAPTER 4: CLUSTER ANALYSIS

CS249: ADVANCED DATA MINING

Machine Learning (BSMC-GA 4439) Wenke Liu

Analysis and Extensions of Popular Clustering Algorithms

CHAPTER 4 AN IMPROVED INITIALIZATION METHOD FOR FUZZY C-MEANS CLUSTERING USING DENSITY BASED APPROACH

Course Content. What is Classification? Chapter 6 Objectives

CS Data Mining Techniques Instructor: Abdullah Mueen

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

DATA MINING - 1DL105, 1Dl111. An introductory class in data mining

Course Content. Classification = Learning a Model. What is Classification?

数据挖掘 Introduction to Data Mining

University of Florida CISE department Gator Engineering. Clustering Part 5

Lecture Notes for Chapter 8. Introduction to Data Mining

CHAPTER 7. PAPER 3: EFFICIENT HIERARCHICAL CLUSTERING OF LARGE DATA SETS USING P-TREES

Data Mining: Concepts and Techniques. Chapter 7 Jiawei Han. University of Illinois at Urbana-Champaign. Department of Computer Science

A Parallel Community Detection Algorithm for Big Social Networks

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Chapter VIII.3: Hierarchical Clustering

CS145: INTRODUCTION TO DATA MINING

A New Approach to Determine Eps Parameter of DBSCAN Algorithm

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

CS6220: DATA MINING TECHNIQUES

OPTICS-OF: Identifying Local Outliers

Scalable Varied Density Clustering Algorithm for Large Datasets

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

An Enhanced Density Clustering Algorithm for Datasets with Complex Structures

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

AutoEpsDBSCAN : DBSCAN with Eps Automatic for Large Dataset

What is Cluster Analysis?

Clustering Lecture 3: Hierarchical Methods

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team

University of Florida CISE department Gator Engineering. Clustering Part 2

CS6220: DATA MINING TECHNIQUES

Unsupervised learning on Color Images

A Review on Cluster Based Approach in Data Mining

Clustering Algorithm (DBSCAN) VISHAL BHARTI Computer Science Dept. GC, CUNY

Network Traffic Measurements and Analysis

Mining Social Network Graphs

Chapter ML:XI (continued)

Towards New Heterogeneous Data Stream Clustering based on Density

Efficient Parallel DBSCAN algorithms for Bigdata using MapReduce

d(2,1) d(3,1 ) d (3,2) 0 ( n, ) ( n ,2)......

On Clustering Validation Techniques

Clustering Techniques

ISSN: (Online) Volume 2, Issue 2, February 2014 International Journal of Advance Research in Computer Science and Management Studies

Contents. 2.5 DBSCAN Pair Auto-Correlation Saving your results References... 18

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Clustering in Ratemaking: Applications in Territories Clustering

Hierarchical Density-Based Clustering for Multi-Represented Objects

CS6220: DATA MINING TECHNIQUES

Clustering Algorithms for Spatial Databases: A Survey

Hierarchical clustering

Review of Spatial Clustering Methods

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Chapter 8: GPS Clustering and Analytics

UNIVERSITY OF CALGARY. Distributed Gaussian Mixture Model Summarization Using the MapReduce Framework. Arina Esmaeilpour A THESIS

Applications. Foreground / background segmentation Finding skin-colored regions. Finding the moving objects. Intelligent scissors

Data Bubbles: Quality Preserving Performance Boosting for Hierarchical Clustering

Transcription:

Welcome to DS504/CS586: Big Data Analytics Big Data Clustering II Prof. Yanhua Li Time: 6pm 8:50pm Thu Location: AK 232 Fall 2016

More Discussions, Limitations v Center based clustering K-means BFR algorithm v Hierarchical clustering Slides on DBSCAN and DENCLUE are in part based on lecture slides from CSE 601 at University of Buffalo

Example: Picking k=3 Just right; distances rather short. x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x J. Leskovec, A. Rajaraman, J. Ullman: 3 Mining of Massive Datasets, http:// www.mmds.org

Limitations of K-means v K-means has problems when clusters are of different Sizes Densities Non-globular shapes v K-means has problems when the data contains outliers.

Limitations of K-means: Differing Sizes Original Points K-means (3 Clusters)

Limitations of K-means: Differing Density Original Points K-means (3 Clusters)

Limitations of K-means: Non-globular Shapes Original Points K-means (2 Clusters)

Overcoming K-means Limitations Original Points K-means Clusters One solution is to use many clusters. Find parts of clusters, but need to put together.

Overcoming K-means Limitations Original Points K-means Clusters

Overcoming K-means Limitations Original Points K-means Clusters

Hierarchical Clustering: Group Average 5 4 1 5 2 2 4 3 3 1 6 0.25 0.2 0.15 0.1 0.05 0 3 6 4 1 2 5 Nested Clusters Dendrogram

Hierarchical Clustering: Time and Space requirements v O(N 2 ) space since it uses the proximity matrix. N is the number of points. v O(N 3 ) time in many cases There are N steps and at each step the size, N 2, proximity matrix must be updated and searched Complexity can be reduced to O(N 2 log(n) ) time for some approaches

Hierarchical Clustering: Problems and Limitations v Once a decision is made to combine two clusters, it cannot be undone v No objective function is directly minimized v Sensitivity to noise and outliers

Density-based Approaches v Why Density-Based Clustering methods? (Non-globular issue) Discover clusters of arbitrary shape. (Non-uniform size issue) Clusters Dense regions of objects separated by regions of low density DBSCAN the first density based clustering DENCLUE a general density-based description of cluster and clustering

DBSCAN: Density Based Spatial Clustering of Applications with Noise v Proposed by Ester, Kriegel, Sander, and Xu (KDD96) v Relies on a density-based notion of cluster: A cluster is defined as a maximal set of denselyconnected points. Discovers clusters of arbitrary shape in spatial databases with noise

Density-Based Clustering Basic Idea: Clusters are dense regions in the data space, separated by regions of lower object density v Why Density-Based Clustering? Results of a k-medoid algorithm for k=4 Different density-based approaches exist (see Textbook & Papers) Here we discuss the ideas underlying the DBSCAN algorithm

Density Based Clustering: Basic Concept v Intuition for the formalization of the basic idea In a cluster, the local point density around that point has to exceed some threshold The set of points from one cluster is spatially connected v Local point density at a point p defined by two parameters ε radius for the neighborhood of point p: ε neighborhood: N ε (p) := {q in data set D dist(p, q) ε} MinPts minimum number of points in the given neighbourhood N(p)

ε-neighborhood v ε-neighborhood Objects within a radius of ε from an object. N ε ( p) :{ q d( p, q) ε} v High density - ε-neighborhood of an object contains at least MinPts of objects. ε q p ε ε-neighborhood of p ε-neighborhood of q Density of p is high (MinPts = 4) Density of q is low (MinPts = 4)

Core, Border & Outlier Border Outlier Given ε and MinPts, categorize the objects into three exclusive groups. Core ε = 1unit, MinPts = 5 A point is a core point if it has more than a specified number of points (MinPts) within Eps These are points that are at the interior of a cluster. A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point. A noise point is any point that is not a core point nor a border point.

Example v M, P, O, and R are core objects since each is in an Eps neighborhood containing at least 3 points Minpts = 3 Eps=radius of the circles

Density-Reachability Directly density-reachable An object q is directly density-reachable from object p if p is a core object and q is in p s ε- neighborhood. ε q p ε q is directly density-reachable from p p is not directly density- reachable from q? Density-reachability is asymmetric. MinPts = 4

Density-reachability v Density-Reachable (directly and indirectly): A point p is directly density-reachable from p2; p2 is directly density-reachable from p1; p1 is directly density-reachable from q; pßp2ßp1ßq form a chain. q p 1 p 2 p p is (indirectly) density-reachable from q q is not density- reachable from p? MinPts = 7

Density-Connectivity Density-reachability is not symmetric not good enough to describe clusters Density-Connectedness p A pair of points p and q are density-connected if they are commonly density-reachable from a point o. Density-connectivity is symmetric q o

Formal Description of Cluster v Given a data set D, parameter ε and threshold MinPts. v A cluster C is a subset of objects satisfying two criteria: Connected: For any p, q in C: p and q are densityconnected. Maximal: For any p,q: if p in C and q is densityreachable from p, then q in C. (avoid redundancy)

DBSCAN: The Algorithm Input: Eps and MinPts Arbitrary select a point p Retrieve all points density-reachable from p wrt Eps and MinPts. If p is a core point, a cluster is formed. If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database. Continue the process until all of the points have been processed.

DBSCAN Algorithm: Example v Parameter ε = 2 cm MinPts = 3 for each o D do if o is not yet classified then if o is a core-object then collect all objects density-reachable from o and assign them to a new cluster. else assign o to NOISE

DBSCAN Algorithm: Example v Parameter ε = 2 cm MinPts = 3 for each o D do if o is not yet classified then if o is a core-object then collect all objects density-reachable from o and assign them to a new cluster. else assign o to NOISE

DBSCAN Algorithm: Example v Parameter ε = 2 cm MinPts = 3 for each o D do if o is not yet classified then if o is a core-object then collect all objects density-reachable from o and assign them to a new cluster. else assign o to NOISE

MinPts = 5 ε C 1 ε P C 1 ε P C 1 P 1 1. Check the ε-neighborhood of p; 2. If p has less than MinPts neighbors then mark p as outlier and continue with the next object 3. Otherwise mark p as processed and put all the neighbors in cluster C 1. Check the unprocessed objects in C 2. If no core object, return C 3. Otherwise, randomly pick up one core object p 1, mark p 1 as processed, and put all unprocessed neighbors of p 1 in cluster C

MinPts = 5 ε ε C 1 C 1 ε ε ε C 1 C 1 C 1

DBSCAN Algorithm Input: The data set D Parameter: ε, MinPts For each object p in D if p is a core object and not processed then C = retrieve all objects density-reachable from p mark all objects in C as processed report C as a cluster else mark p as outlier end if End For DBScan Algorithm Q: Each run reaches the same clustering result? Unique?

Example Original Points Point types: core, border and outliers ε = 10, MinPts = 4

When DBSCAN Works Well Original Points Clusters Resistant to Noise Can handle clusters of different shapes and sizes

Density Based Clustering: Discussion v Advantages Clusters can have arbitrary shape and size Number of clusters is determined automatically Can separate clusters from surrounding noise Can be supported by spatial index structures v Disadvantages Input parameters may be difficult to determine In some situations very sensitive to input parameter setting Hard to handle cases with different densities

When DBSCAN Does NOT Work Well Original Points (MinPts=4, Eps=9.92). Cannot handle Varying densities sensitive to parameters (MinPts=4, Eps=9.75) Explanations?

DBSCAN: Sensitive to Parameters

DENCLUE: using density functions v DENsity-based CLUstEring by Hinneburg & Keim (KDD 98) v Major features Pros: Solid mathematical foundation Good for datasets with large amounts of noise Significantly faster than existing algorithm (faster than DBSCAN by a factor of up to 45) Cons: But needs a large number of parameters

Denclue: Technical Essence v Influence Model: Model density by the notion of influence Each data object has influence on its neighborhood. The influence decreases with distance v Example: Consider each object is a radio, the closer you are to the object, the louder the noise v Key: Influence is represented by mathematical function

Denclue: Technical Essence v Influence functions: (influence of y on x, σ is a usergiven constant) Square : f y square (x) = 0, if dist(x,y) > σ, 1, otherwise Guassian: f d ( x, y) y ( x) = e 2 2σ Gaussian 2

Density Function v Density Definition is defined as the sum of the influence functions of all data points. f D Gaussian d ( x, x = N ( x) e 2σ i = 1 i 2 ) 2

Gradient: The steepness of a slope v Example f ( x, y) e f Gaussian d ( x, y) = 2 2σ d ( x, xi ) D = N ( x) e 2 2σ Gaussian i = 1 d ( x, x = N f D ( x, x ) ( x x) e 2σ Gaussian i i = 1 i 2 2 i 2 ) 2

Denclue: Technical Essence v Clusters can be determined mathematically by identifying density attractors. v Density attractors are local maximum of the overall density function.

Density Attractor

Cluster Definition v Center-defined cluster A subset of objects attracted by an attractor x density(x) ξ v Arbitrary-shape cluster A group of center-defined clusters which are connected by a path P For each object x on P, density(x) ξ.

Center-Defined and Arbitrary

DENCLUE: How to find the clusters v Divide the space into grids, with size 2σ v Consider only grids that are highly populated v For each object, calculate its density attractor Density attractors form basis of all clusters

Features of DENCLUE v Major features Solid mathematical foundation Compact definition for density and cluster Flexible for both center-defined clusters and arbitraryshape clusters But needs parameters, which is in general hard to set σ: parameter to calculate density Largest interval with constant number of clusters ξ: density threshold Greater than noise level Smaller than smallest relevant maxima

Comparison with DBSCAN Corresponding setup v Square wave influence function radius σ models neighborhood ε in DBSCAN Square : f y square (x) = 0, if dist(x,y) > σ, 1, otherwise v Definition of core objects in DBSCAN involves MinPts = ξ v Density reachable in DBSCAN becomes density attracted in DENCLUE

Project 2: v Progress Presentation: v Class schedule v One more group presentation 49

50 BACKUP SLIDES

Determining the Parameters ε and MinPts v Cluster: Point density higher than specified by ε and MinPts v Idea: use the point density of the least dense cluster in the data set as parameters but how to determine this? v Heuristic: look at the distances to the k-nearest neighbors p q 3-distance(p) : 3-distance(q) : v Function k-distance(p): distance from p to the its k-nearest neighbor v k-distance plot: k-distances of all objects, sorted in decreasing order

Determining the Parameters ε and MinPts v Example k-distance plot 3-distance first valley Objects v Heuristic method: border object Fix a value for MinPts (default: 2 d 1) User selects border object o from the MinPts-distance plot; ε is set to MinPts-distance(o)