Density-Based Clustering Izabela Moise, Evangelos Pournaras Izabela Moise, Evangelos Pournaras 1
Reminder Unsupervised data mining Clustering k-means Izabela Moise, Evangelos Pournaras 2
Main Clustering Approaches Partitioning method constructs partitions of data points evaluates the partitions by some criterion k-means, k-medoids Density-based method: based on connectivity and density functions DBSCAN, DJCluster Izabela Moise, Evangelos Pournaras 3
Density-Based Clustering Izabela Moise, Evangelos Pournaras 4
Density-Based Clustering Density-Based Clustering locates regions (neighborhoods) of high density that are separated from one another by regions of low density. Izabela Moise, Evangelos Pournaras 4
Main principles Two parameters: 1. maximum radius of the neighbourhood Eps 2. minimum number of points in an Eps neighbourhood of a point MinPts N Eps (p) : {q D s.t. dist(p, q) Eps} Key idea: the density of the neighbourhood has to exceed some threshold. The shape of a neighbourhood depends on the dist function Izabela Moise, Evangelos Pournaras 5
Main principles Two parameters: 1. maximum radius of the neighbourhood Eps 2. minimum number of points in an Eps neighbourhood of a point MinPts N Eps (p) : {q D s.t. dist(p, q) Eps} Key idea: the density of the neighbourhood has to exceed some threshold. The shape of a neighbourhood depends on the dist function Izabela Moise, Evangelos Pournaras 5
Main principles Two parameters: 1. maximum radius of the neighbourhood Eps 2. minimum number of points in an Eps neighbourhood of a point MinPts N Eps (p) : {q D s.t. dist(p, q) Eps} Key idea: the density of the neighbourhood has to exceed some threshold. The shape of a neighbourhood depends on the dist function Izabela Moise, Evangelos Pournaras 5
Core, Border and Noise/Outlier 1 1 Jing Gao, SUNY Buffalo Izabela Moise, Evangelos Pournaras 6
Directly Density-Reachable Directly density-reachable: A point p is directly density-reachable from a point q wrt. Eps, MinPts if: 1. p N Eps (q) and 2. N Eps (q) MinPts Izabela Moise, Evangelos Pournaras 7
Directly Density-Reachable Directly density-reachable: A point p is directly density-reachable from a point q wrt. Eps, MinPts if: 1. p N Eps (q) and 2. N Eps (q) MinPts Izabela Moise, Evangelos Pournaras 7
Density-Reachable Density-reachable: A point p is density-reachable from a point q wrt. Eps, MinPts if there is a chain of points p 1,..., p n, with p 1 = q, p n = p, s.t.p i+1 is directly density reachable from p i transitive but not symmetric Izabela Moise, Evangelos Pournaras 8
Density-Connected Density-connected: A point p is density-connected from a point q wrt. Eps, MinPts if there is a point o s.t. p and q are density-reachable from o wrt. Eps and MinPts Izabela Moise, Evangelos Pournaras 9
Density-Connected Density-connected: A point p is density-connected from a point q wrt. Eps, MinPts if there is a point o s.t. p and q are density-reachable from o wrt. Eps and MinPts not symmetric Izabela Moise, Evangelos Pournaras 9
Density-Connected Density-connected: A point p is density-connected from a point q wrt. Eps, MinPts if there is a point o s.t. p and q are density-reachable from o wrt. Eps and MinPts not symmetric Izabela Moise, Evangelos Pournaras 9
DBSCAN - Density-Based Spatial Clustering of Applications with Noise Izabela Moise, Evangelos Pournaras 10
Main Principles Main principle: One of the most cited clustering algorithms a cluster is defined as a maximal set of density-connected points. Discovers clusters of arbitrary shapes (spherical, elongated, linear), and noise Works with spatial datasets: geomarketing, tomography, satellite images Requires only two parameters (no prior knowledge of the number of clusters) Izabela Moise, Evangelos Pournaras 11
Definition: Cluster 2 2 Erik Kropat, University of the Bundeswehr Munich Izabela Moise, Evangelos Pournaras 12
Definition: Noise 3 3 Erik Kropat, University of the Bundeswehr Munich Izabela Moise, Evangelos Pournaras 13
The Algorithm 1. Randomly select a point p 2. Retrieve all points density-reachable from p wrt. Eps and MinPts 3. If p is a core point, a cluster is formed 4. If p is a border point, then no points are density-reachable from p visit the next data point 5. Continue the process until all points have been processed Izabela Moise, Evangelos Pournaras 14
Selecting Eps and MinPts The two parameters can be determined by a heuristic Observation: For points in a cluster their k-th nearest neighbours are at roughly the same distance. Noise points have the k-th nearest neighbour at farther distance. Izabela Moise, Evangelos Pournaras 15
4 4 Erik Kropat, University of the Bundeswehr Munich Izabela Moise, Evangelos Pournaras 16
5 5 Erik Kropat, University of the Bundeswehr Munich Izabela Moise, Evangelos Pournaras 17
6 6 Erik Kropat, University of the Bundeswehr Munich Izabela Moise, Evangelos Pournaras 18
Pros and Cons Pros: discovers clusters of arbitrary shapes handles noise needs density parameters as termination condition Izabela Moise, Evangelos Pournaras 19
Pros and Cons Cons: X cannot handle varying densities X sensitive to parameters hard to determine the correct set of parameters X sampling affects density measures Izabela Moise, Evangelos Pournaras 20