COMP 465: Data Mining Still More on Clustering

Similar documents
GRID BASED CLUSTERING

CS Data Mining Techniques Instructor: Abdullah Mueen

CS570: Introduction to Data Mining

CS6220: DATA MINING TECHNIQUES

Clustering Techniques

CS 412 Intro. to Data Mining Chapter 10. Cluster Analysis: Basic Concepts and Methods

CS6220: DATA MINING TECHNIQUES

CS249: ADVANCED DATA MINING

COMP5331: Knowledge Discovery and Data Mining

d(2,1) d(3,1 ) d (3,2) 0 ( n, ) ( n ,2)......

Clustering in Data Mining

CS6220: DATA MINING TECHNIQUES

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)

Analysis and Extensions of Popular Clustering Algorithms

DS504/CS586: Big Data Analytics Big Data Clustering II

CSE 5243 INTRO. TO DATA MINING

CS145: INTRODUCTION TO DATA MINING

CS570: Introduction to Data Mining

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)

A Review on Cluster Based Approach in Data Mining

DS504/CS586: Big Data Analytics Big Data Clustering II

Knowledge Discovery in Databases

Data Mining 4. Cluster Analysis

DBSCAN. Presented by: Garrett Poppe

Lezione 21 CLUSTER ANALYSIS

Lecture 3 Clustering. January 21, 2003 Data Mining: Concepts and Techniques 1

Clustering Part 4 DBSCAN

University of Florida CISE department Gator Engineering. Clustering Part 4

K-DBSCAN: Identifying Spatial Clusters With Differing Density Levels

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 10

CS490D: Introduction to Data Mining Prof. Chris Clifton. Cluster Analysis

Clustering part II 1

Cluster Analysis. Outline. Motivation. Examples Applications. Han and Kamber, ch 8

Distance-based Methods: Drawbacks

CS Introduction to Data Mining Instructor: Abdullah Mueen

UNIT V CLUSTERING, APPLICATIONS AND TRENDS IN DATA MINING. Clustering is unsupervised classification: no predefined classes

Data Mining Algorithms

COMPARISON OF DENSITY-BASED CLUSTERING ALGORITHMS

Clustering Lecture 9: Other Topics. Jing Gao SUNY Buffalo

Lecture 7 Cluster Analysis: Part A

A Comparative Study of Various Clustering Algorithms in Data Mining

Density-Based Clustering of Polygons

Clustering CS 550: Machine Learning

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

Data Mining: Concepts and Techniques. Chapter 7 Jiawei Han. University of Illinois at Urbana-Champaign. Department of Computer Science

A New Approach to Determine Eps Parameter of DBSCAN Algorithm

Big Data SONY Håkan Jonsson Vedran Sekara

University of Florida CISE department Gator Engineering. Clustering Part 5

What is Cluster Analysis? COMP 465: Data Mining Clustering Basics. Applications of Cluster Analysis. Clustering: Application Examples 3/17/2015

Cluster Analysis (b) Lijun Zhang

Heterogeneous Density Based Spatial Clustering of Application with Noise

Clustering Algorithms for Spatial Databases: A Survey

CS145: INTRODUCTION TO DATA MINING

Clustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme

On Clustering Validation Techniques

Comparative Study of Subspace Clustering Algorithms

Clustering in Ratemaking: Applications in Territories Clustering

Clustering Algorithm (DBSCAN) VISHAL BHARTI Computer Science Dept. GC, CUNY

Faster Clustering with DBSCAN

A Review: Techniques for Clustering of Web Usage Mining

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

Understanding Clustering Supervising the unsupervised

Chapter 4: Text Clustering

Review of Spatial Clustering Methods

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team

Machine Learning (BSMC-GA 4439) Wenke Liu

Metodologie per Sistemi Intelligenti. Clustering. Prof. Pier Luca Lanzi Laurea in Ingegneria Informatica Politecnico di Milano Polo di Milano Leonardo

Contents. Preface to the Second Edition

Clustering Algorithms for Data Stream

DENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

数据挖掘 Introduction to Data Mining

Network Traffic Measurements and Analysis

Clustering Algorithms In Data Mining

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

CS570: Introduction to Data Mining

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Unsupervised Learning and Clustering

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Clustering Algorithms for High Dimensional Data Literature Review

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Kapitel 4: Clustering

Clustering from Data Streams

Machine Learning. Unsupervised Learning. Manfred Huber

CSE 158. Web Mining and Recommender Systems. Midterm recap

FuzzyShrinking: Improving Shrinking-Based Data Mining Algorithms Using Fuzzy Concept for Multi-Dimensional data

D-GridMST: Clustering Large Distributed Spatial Databases

COMP 465: Data Mining Classification Basics

Course Content. Classification = Learning a Model. What is Classification?

Course Content. What is Classification? Chapter 6 Objectives

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

University of Florida CISE department Gator Engineering. Clustering Part 2

ADCN: An Anisotropic Density-Based Clustering Algorithm for Discovering Spatial Point Patterns with Noise

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining, 2 nd Edition

Efficient and Effective Clustering Methods for Spatial Data Mining. Raymond T. Ng, Jiawei Han

Unsupervised Learning

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Impulsion of Mining Paradigm with Density Based Clustering of Multi Dimensional Spatial Data

Efficient Parallel DBSCAN algorithms for Bigdata using MapReduce

Mining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams

Transcription:

3/4/015 Exercise COMP 465: Data Mining Still More on Clustering Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Describe each of the following clustering algorithms in terms of the following criteria: (1) shapes of clusters that can be determined; () input parameters that must be specified; (3) limitations. (a) k-means (b) k-medoids (c) CLARA (d) BIRCH (e) CHAMELEON (f) DBSCAN OPTICS: A Cluster-Ordering Method (1999) OPTICS: Ordering Points To Identify the Clustering Structure Ankerst, Breunig, Kriegel, and Sander (SIGMOD 99) Produces a special order of the database wrt its densitybased clustering structure This cluster-ordering contains info equiv to the densitybased clusterings corresponding to a broad range of parameter settings Good for both automatic and interactive cluster analysis, including finding intrinsic clustering structure Can be represented graphically or using visualization techniques OPTICS: Some Extension from DBSCAN Index-based: k = # of dimensions, N: # of points Complexity: O(N*logN) Core Distance of an object p: the smallest value ε such that the ε- neighborhood of p has at least MinPts objects Let N ε (p): ε-neighborhood of p, ε is a distance value Core-distance ε, MinPts (p) = Undefined if card(n ε (p)) < MinPts MinPts-distance(p), otherwise Reachability Distance of object p from core object q is the min radius value that makes p density-reachable from q Reachability-distance ε, MinPts (p, q) = Undefined if q is not a core object max(core-distance(q), distance (q, p)), otherwise 3 4 1

3/4/015 Core Distance & Reachability Distance Reachabilitydistance undefined Cluster-order of the objects 5 6 Density-Based Clustering: OPTICS & Applications demo: http://www.dbs.informatik.uni-muenchen.de/forschung/kdd/clustering/optics/demo DENCLUE: Using Statistical Density Functions DENsity-based CLUstEring by Hinneburg & Keim (KDD 98) Using statistical density functions: total influence on x 7 f ( x, y) e Gaussian Major features d ( x, x i ) d ( x, y) N D fgaussian ( x) i e 1 influence of y on x Solid mathematical foundation f Good for data sets with large amounts of noise Allows a compact mathematical description of arbitrarily shaped clusters in high-dimensional data sets Significant faster than existing algorithm (e.g., DBSCAN) But needs a large number of parameters d ( x, xi ) N D Gaussian( x, xi ) ( x i i x) e 1 gradient of x in the direction of x i 8

3/4/015 Grid-Based Clustering Method Using multi-resolution grid data structure Several interesting methods STING (a STatistical INformation Grid approach) by Wang, Yang and Muntz (1997) CLIQUE: Agrawal, et al. (SIGMOD 98) Both grid-based and subspace clustering WaveCluster by Sheikholeslami, Chatterjee, and Zhang (VLDB 98) A multi-resolution clustering approach using wavelet method STING: A Statistical Information Grid Approach Wang, Yang and Muntz (VLDB 97) The spatial area is divided into rectangular cells There are several levels of cells corresponding to different levels of resolution 1 13 The STING Clustering Method Each cell at a high level is partitioned into a number of smaller cells in the next lower level Statistical info of each cell is calculated and stored beforehand and is used to answer queries Parameters of higher level cells can be easily calculated from parameters of lower level cell count, mean, s, min, max type of distribution normal, uniform, etc. Use a top-down approach to answer spatial data queries Start from a pre-selected layer typically with a small number of cells For each cell in the current level compute the confidence interval STING Algorithm and Its Analysis Remove the irrelevant cells from further consideration When finish examining the current layer, proceed to the next lower level Repeat this process until the bottom layer is reached Advantages: Query-independent, easy to parallelize, incremental update O(K), where K is the number of grid cells at the lowest level Disadvantages: All the cluster boundaries are either horizontal or vertical, and no diagonal boundary is detected 14 15 3

Salary (10,000) 1 3 4 5 6 7 0 Vacation Vacation (week) 5 4 3 0 3/4/015 CLIQUE (Clustering In QUEst) CLIQUE: The Major Steps Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD 98) Automatically identifying subspaces of a high dimensional data space that allow better clustering than original space CLIQUE can be considered as both density-based and grid-based It partitions each dimension into the same number of equal length interval It partitions an m-dimensional data space into non-overlapping rectangular units A unit is dense if the fraction of total data points contained in the unit exceeds the input model parameter A cluster is a maximal set of connected dense units within a subspace Partition the data space and find the number of points that lie inside each cell of the partition. Identify the subspaces that contain clusters using the Apriori principle Identify clusters Determine dense units in all subspaces of interests Determine connected dense units in all subspaces of interests. Generate minimal description for the clusters Determine maximal regions that cover a cluster of connected dense units for each cluster Determination of minimal cover for each cluster 16 17 0 30 40 50 60 age 0 30 40 50 60 age = 3 30 50 age 1 6 7 Strength and Weakness of CLIQUE Strength automatically finds subspaces of the highest dimensionality such that high density clusters exist in those subspaces insensitive to the order of records in input and does not presume some canonical data distribution scales linearly with the size of input and has good scalability as the number of dimensions in the data increases Weakness The accuracy of the clustering result may be degraded at the expense of simplicity of the method 18 19 4

3/4/015 Determine the Number of Clusters Measuring Clustering Quality Empirical method # of clusters: k n/ for a dataset of n points, e.g., n = 00, k = 10 Elbow method Use the turning point in the curve of sum of within cluster variance w.r.t the # of clusters Cross validation method Divide a given data set into m parts Use m 1 parts to obtain a clustering model Use the remaining part to test the quality of the clustering E.g., For each point in the test set, find the closest centroid, and use the sum of squared distance between all points in the test set and the closest centroids to measure how well the model fits the test set For any k > 0, repeat it m times, compare the overall quality measure w.r.t. different k s, and find # of clusters that fits the data the best 3 kinds of measures: External, internal and relative External: supervised, employ criteria not inherent to the dataset Compare a clustering against prior or expert-specified knowledge (i.e., the ground truth) using certain clustering quality measure Internal: unsupervised, criteria derived from data itself Evaluate the goodness of a clustering by considering how well the clusters are separated, and how compact the clusters are, e.g., Silhouette coefficient Relative: directly compare different clusterings, usually those obtained via different parameter settings for the same algorithm 0 1 Measuring Clustering Quality: External Methods Clustering quality measure: Q(C, T), for a clustering C given the ground truth T Q is good if it satisfies the following 4 essential criteria Cluster homogeneity: the purer, the better Cluster completeness: should assign objects belong to the same category in the ground truth to the same cluster Rag bag: putting a heterogeneous object into a pure cluster should be penalized more than putting it into a rag bag (i.e., miscellaneous or other category) Small cluster preservation: splitting a small category into pieces is more harmful than splitting a large category into pieces Some Commonly Used External Measures Matching-based measures Purity, maximum matching, F-measure Entropy-Based Measures Conditional entropy, normalized mutual information (NMI), variation of information Pair-wise measures Four possibilities: True positive (TP), FN, FP, TN Jaccard coefficient, Rand statistic, Fowlkes-Mallow measure Correlation measures Discretized Huber static, normalized discretized Huber static Ground truth partitioning T 1 T Cluster C 1 Cluster C 3 5

3/4/015 Entropy-Based Measure (I): Conditional Entropy Entropy-Based Measure (II): Normalized mutual information (NMI) Entropy of clustering C: Entropy of partitioning T: Entropy of T w.r.t. cluster C i : Mutual information: quantify the amount of shared info between the clustering C and partitioning T: Conditional entropy of T w.r.t. clustering C: The more a cluster s members are split into different partitions, the higher the conditional entropy For a perfect clustering, the conditional entropy value is 0, where the worst possible conditional entropy value is log k It measures the dependency between the observed joint probability p ij of C and T, and the expected joint probability p Ci * p Tj under the independence assumption When C and T are independent, p ij = p Ci * p Tj, I(C, T) = 0. However, there is no upper bound on the mutual information Normalized mutual information (NMI) Value range of NMI: [0,1]. Value close to 1 indicates a good clustering 4 5 Summary Cluster analysis groups objects based on their similarity and has wide applications Measure of similarity can be computed for various types of data Clustering algorithms can be categorized into partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods K-means and K-medoids algorithms are popular partitioning-based clustering algorithms Birch and Chameleon are interesting hierarchical clustering algorithms, and there are also probabilistic hierarchical clustering algorithms DBSCAN, OPTICS, and DENCLU are interesting density-based algorithms STING and CLIQUE are grid-based methods, where CLIQUE is also a subspace clustering algorithm Quality of clustering results can be evaluated in various ways 6 6