Distance-based Methods: Drawbacks

Similar documents
DS504/CS586: Big Data Analytics Big Data Clustering II

Data Mining 4. Cluster Analysis

DS504/CS586: Big Data Analytics Big Data Clustering II

Clustering Part 4 DBSCAN

CS Introduction to Data Mining Instructor: Abdullah Mueen

University of Florida CISE department Gator Engineering. Clustering Part 4

DBSCAN. Presented by: Garrett Poppe

Clustering Lecture 4: Density-based Methods

COMP 465: Data Mining Still More on Clustering

Clustering in Data Mining

CHAPTER 4 AN IMPROVED INITIALIZATION METHOD FOR FUZZY C-MEANS CLUSTERING USING DENSITY BASED APPROACH

CS570: Introduction to Data Mining

Clustering part II 1

Density-Based Clustering. Izabela Moise, Evangelos Pournaras

Clustering Algorithm (DBSCAN) VISHAL BHARTI Computer Science Dept. GC, CUNY

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Data Mining Algorithms

Clustering CS 550: Machine Learning

Knowledge Discovery in Databases

Lecture 7 Cluster Analysis: Part A

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)

Clustering Algorithms for Data Stream

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining, 2 nd Edition

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)

A Review on Cluster Based Approach in Data Mining

CSE 5243 INTRO. TO DATA MINING

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

! Introduction. ! Partitioning methods. ! Hierarchical methods. ! Model-based methods. ! Density-based methods. ! Scalability

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

Analysis and Extensions of Popular Clustering Algorithms

A Comparative Study of Various Clustering Algorithms in Data Mining

Lesson 3. Prof. Enza Messina

Hierarchy. No or very little supervision Some heuristic quality guidances on the quality of the hierarchy. Jian Pei: CMPT 459/741 Clustering (2) 1

Cluster Analysis. Ying Shen, SSE, Tongji University

Faster Clustering with DBSCAN

数据挖掘 Introduction to Data Mining

Machine Learning (BSMC-GA 4439) Wenke Liu

CHAPTER 4: CLUSTER ANALYSIS

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Data Mining: Concepts and Techniques. Chapter 7 Jiawei Han. University of Illinois at Urbana-Champaign. Department of Computer Science

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

Unsupervised Learning : Clustering

University of Florida CISE department Gator Engineering. Clustering Part 5

Clustering in Ratemaking: Applications in Territories Clustering

CS Data Mining Techniques Instructor: Abdullah Mueen

Cluster Analysis (b) Lijun Zhang

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1

COMPARISON OF DENSITY-BASED CLUSTERING ALGORITHMS

A Parallel Community Detection Algorithm for Big Social Networks

Mobility Data Management & Exploration

Cluster Analysis: Basic Concepts and Algorithms

Contents. Preface to the Second Edition

Chapter VIII.3: Hierarchical Clustering

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

2. Background. 2.1 Clustering

Foundations of Machine Learning CentraleSupélec Fall Clustering Chloé-Agathe Azencot

CS412 Homework #3 Answer Set

Understanding Clustering Supervising the unsupervised

Clustering Techniques

Community Detection. Jian Pei: CMPT 741/459 Clustering (1) 2

Triclustering in Gene Expression Data Analysis: A Selected Survey

DENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

CS570: Introduction to Data Mining

Kapitel 4: Clustering

Unsupervised Learning and Clustering

ECLT 5810 Clustering

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

3. Data Preprocessing. 3.1 Introduction

Biclustering for Microarray Data: A Short and Comprehensive Tutorial

2. Data Preprocessing

CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

Heterogeneous Density Based Spatial Clustering of Application with Noise

Table Of Contents: xix Foreword to Second Edition

CS145: INTRODUCTION TO DATA MINING

Biclustering with δ-pcluster John Tantalo. 1. Introduction

Clustering Algorithms for Spatial Databases: A Survey

DBRS: A Density-Based Spatial Clustering Method with Random Sampling. Xin Wang and Howard J. Hamilton Technical Report CS

CSE 347/447: DATA MINING

Exploratory data analysis for microarrays

Efficient Parallel DBSCAN algorithms for Bigdata using MapReduce

Sponsored by AIAT.or.th and KINDML, SIIT

DATA MINING - 1DL105, 1Dl111. An introductory class in data mining

Mathematical Morphology and Distance Transforms. Robin Strand

The Parameter-less Randomized Gravitational Clustering algorithm with online clusters structure characterization

A Survey on DBSCAN Algorithm To Detect Cluster With Varied Density.

ECLT 5810 Clustering

DATA MINING AND WAREHOUSING

Contents. Foreword to Second Edition. Acknowledgments About the Authors

Community Detection. Community

Web Structure Mining Community Detection and Evaluation

Mining Quantitative Maximal Hyperclique Patterns: A Summary of Results

Acknowledgements First of all, my thanks go to my supervisor Dr. Osmar R. Za ane for his guidance and funding. Thanks to Jörg Sander who reviewed this

Unsupervised Learning. Unsupervised Learning. What is Clustering? Unsupervised Learning I Clustering 9/7/2017. Clustering

Chapter 5: Outlier Detection

Gene Clustering & Classification

Chapter 4: Text Clustering

Transcription:

Distance-based Methods: Drawbacks Hard to find clusters with irregular shapes Hard to specify the number of clusters Heuristic: a cluster must be dense Jian Pei: CMPT 459/741 Clustering (3) 1

How to Find Irregular Clusters? Divide the whole space into many small areas The density of an area can be estimated Areas may or may not be exclusive A dense area is likely in a cluster Start from a dense area, traverse connected dense areas and discover clusters in irregular shape Jian Pei: CMPT 459/741 Clustering (3) 2

Directly Density Reachable Parameters Eps: Maximum radius of the neighborhood MinPts: Minimum number of points in an Epsneighborhood of that point NEps(p): {q dist(p,q) Eps} Core object p: NEps(p) MinPts A core object is in a dense area MinPts = 3 Eps = 1 cm Point q directly density-reachable from p iff q NEps(p) and p is a core object q p Jian Pei: CMPT 459/741 Clustering (3) 3

Density-Based Clustering Density-reachable Directly density reachable p 1 àp 2, p 2 àp 3,, p n-1 à p n p n density-reachable from p 1 Density-connected If points p, q are density-reachable from o then p and q are density-connected p p q q p 1 o Jian Pei: CMPT 459/741 Clustering (3) 4

DBSCAN A cluster: a maximal set of densityconnected points Discover clusters of arbitrary shape in spatial databases with noise Outlier Border Core Eps = 1cm MinPts = 5 Jian Pei: CMPT 459/741 Clustering (3) 5

DBSCAN: the Algorithm Arbitrary select a point p Retrieve all points density-reachable from p wrt Eps and MinPts If p is a core point, a cluster is formed If p is a border point, no points are densityreachable from p and DBSCAN visits the next point of the database Continue the process until all of the points have been processed Jian Pei: CMPT 459/741 Clustering (3) 6

Challenges for DBSCAN Different clusters may have very different densities Clusters may be in hierarchies Jian Pei: CMPT 459/741 Clustering (3) 7

OPTICS: A Cluster-ordering Method Idea: ordering points to identify the clustering structure Group points by density connectivity Hierarchies of clusters Visualize clusters and the hierarchy Jian Pei: CMPT 459/741 Clustering (3) 8

Ordering Points Points strongly density-connected should be close to one another Clusters density-connected should be close to one another and form a cluster of clusters Jian Pei: CMPT 459/741 Clustering (3) 9

OPTICS: An Example Reachability-distance undefined ε ε ε Cluster-order of the objects Jian Pei: CMPT 459/741 Clustering (3) 10

DENCLUE: Using Density Functions DENsity-based CLUstEring Major features Solid mathematical foundation Good for data sets with large amounts of noise Allow a compact mathematical description of arbitrarily shaped clusters in high-dimensional data sets Significantly faster than existing algorithms (faster than DBSCAN by a factor of up to 45) But need a large number of parameters Jian Pei: CMPT 459/741 Clustering (3) 11

DENCLUE: Techniques Use grid cells Only keep grid cells actually containing data points Manage cells in a tree-based access structure Influence function: describe the impact of a data point on its neighborhood Overall density of the data space is the sum of the influence function of all data points Clustering by identifying density attractors Density attractor: local maximal of the overall density function Jian Pei: CMPT 459/741 Clustering (3) 12

Density Attractor Jian Pei: CMPT 459/741 Clustering (3) 13

Center-defined and Arbitrary Clusters Jian Pei: CMPT 459/741 Clustering (3) 14

A Shrinking-based Approach Difficulties of Multi-dimensional Clustering Noise (outliers) Clusters of various densities Not well-defined shapes A novel preprocessing concept Shrinking A shrinking-based clustering approach Jian Pei: CMPT 459/741 Clustering (3) 15

Intuition & Purpose For data points in a data set, what if we could make them move towards the centroid of the natural subgroup they belong to? Natural sparse subgroups become denser, thus easier to be detected Noises are further isolated Jian Pei: CMPT 459/741 Clustering (3) 16

Inspiration Newton s Universal Law of Gravitation Any two objects exert a gravitational force of attraction on each other The direction of the force is along the line joining the objects The magnitude of the force is directly proportional to the product of the gravitational masses of the objects, and inversely proportional to the square of the distance between them G: universal gravitational constant G = 6.67 x 10-11 N m 2 /kg 2 Fg = 2 G m m 1 r 2 Jian Pei: CMPT 459/741 Clustering (3) 17

The Concept of Shrinking A data preprocessing technique Aim to optimize the inner structure of real data sets Each data point is attracted by other data points and moves to the direction in which way the attraction is the strongest Can be applied in different fields Jian Pei: CMPT 459/741 Clustering (3) 18

Apply shrinking into clustering field Shrink the natural sparse clusters to make them much denser to facilitate further cluster-detecting process. Multiattribute hyperspac e Jian Pei: CMPT 459/741 Clustering (3) 19

Data Shrinking Each data point moves along the direction of the density gradient and the data set shrinks towards the inside of the clusters Points are attracted by their neighbors and move to create denser clusters It proceeds iteratively; repeated until the data are stabilized or the number of iterations exceeds a threshold Jian Pei: CMPT 459/741 Clustering (3) 20

Approximation & Simplification Problem: Computing mutual attraction of each data points pair is too time consuming O(n 2 ) Solution: No Newton's constant G, m 1 and m 2 are set to unit Only aggregate the gravitation surrounding each data point Use grids to simplify the computation Jian Pei: CMPT 459/741 Clustering (3) 21

Termination condition Average movement of all points in the current iteration is less than a threshold The number of iterations exceeds a threshold Jian Pei: CMPT 459/741 Clustering (3) 22

Optics on Pendigits Data Before data shrinking After data shrinking Jian Pei: CMPT 459/741 Clustering (3) 23

Biclustering Clustering both objects and attributes simultaneously Four requirements Only a small set of objects in a cluster (bicluster) A bicluster only involves a small number of attributes An object may participate in multiple biclusters or no biclusters An attribute may be involved in multiple biclusters, or no biclusters Jian Pei: Big Data Analytics -- Clustering 24

Application Examples Recommender systems Objects: users Attributes: items Values: user ratings Microarray data Objects: genes Attributes: samples Values: expression levels gene sample/condition w 11 w 21 w 31 w n1 w 12 w 22 w 32 w n2 w 1m w 2m w 3m w nm Jian Pei: Big Data Analytics -- Clustering 25

Biclusters with Constant Values b 6 b 12 b 36 b 99 a 1 60 60 60 60 a 33 60 60 60 60 a 86 60 60 60 60 10 10 10 10 10 20 20 20 20 20 50 50 50 50 50 0 0 0 0 0 On rows Jian Pei: Big Data Analytics -- Clustering 26

Biclusters with Coherent Values Also known as pattern-based clusters Jian Pei: Big Data Analytics -- Clustering 27

Biclusters with Coherent Evolutions Only up- or down-regulated changes over rows or columns 10 50 30 70 20 20 100 50 1000 30 50 100 90 120 80 0 80 20 100 10 Coherent evolutions on rows Jian Pei: Big Data Analytics -- Clustering 28

Differences from Subspace Clustering Subspace clustering uses global distance/ similarity measure Pattern-based clustering looks at patterns A subspace cluster according to a globally defined similarity measure may not follow the same pattern Jian Pei: Big Data Analytics -- Clustering 29

Objects Follow the Same Pattern? pscore Object blue Obejct green D 1 D 2 The less the pscore, the more consistent the objects Jian Pei: Big Data Analytics -- Clustering 30

Jian Pei: Big Data Analytics -- Clustering 31 Pattern-based Clusters pscore: the similarity between two objects r x, r y on two attributes a u, a v δ-pcluster (R, D): for any objects r x, r y R and any attributes a u, a v D, ).. ( ).. (.... v y v x u y u x v y u y v x u x a r a r a r a r a r a r a r a r pscore = 0) (.... δ δ v y u y v x u x a r a r a r a r pscore

Maximal pcluster If (R, D) is a δ-pcluster, then every subcluster (R, D ) is a δ-pcluster, where R R and D D An anti-monotonic property A large pcluster is accompanied with many small pclusters! Inefficacious Idea: mining only the maximal pclusters! A δ-pcluster is maximal if there exists no proper super cluster as a δ-pcluster Jian Pei: Big Data Analytics -- Clustering 32

Mining Maximal pclusters Given A cluster threshold δ An attribute threshold min a An object threshold min o Task: mine the complete set of significant maximal δ-pclusters A significant δ-pcluster has at least min o objects on at least min a attributes Jian Pei: Big Data Analytics -- Clustering 33

pcluters and Frequent Itemsets A transaction database can be modeled as a binary matrix Frequent itemset: a sub-matrix of all 1 s 0-pCluster on binary data Min o : support threshold Min a : no less than mina attributes Maximal pclusters closed itemsets Frequent itemset mining algorithms cannot be extended straightforwardly for mining pclusters on numeric data Jian Pei: Big Data Analytics -- Clustering 34

Where Should We Start from? How about the pclusters having only 2 objects or 2 attributes? MDS (maximal dimension set) A pcluster must have at least 2 objects and 2 attributes Objects Finding MDSs Attribute a b c d e f g h x 13 11 9 7 9 13 2 15 y 7 4 10 1 12 3 4 7 x - y 6 7-1 6-3 10-2 8 Jian Pei: Big Data Analytics -- Clustering 35

How to Assemble Larger pclusters? Systematically enumerate every combination of attributes D For each attribute subset, find the maximal subsets of objects R s.t. (R, D) is a pcluster Check whether (R, D) is maximal Prune search branches as early as possible Why attribute-first-objectlater? # of objects >> # attributes Algorithm MaPle (Pei et al, 2003) Jian Pei: Big Data Analytics -- Clustering 36

More Pruning Techniques Only possible attributes should be considered to get larger pclusters Pruning local maximal pclusters having insufficient possible attributes Extracting common attributes from possible attribute set directly Prune non-maximal pclusters Jian Pei: Big Data Analytics -- Clustering 37

Gene-Sample-Time Series Data Sample-Time Matrix Sample time2 time1 sample1 sample2 Time gene1 gene2 Gene-Sample Matrix Gene-Time Matrix Gene expression level of gene i on sample j at time k Jian Pei: Big Data Analytics -- Clustering 38

Mining GST Microarray Data Reduce the gene-sample-time series data to gene-sample data Use the Pearson's correlation coeffcient as the coherence measure Jian Pei: Big Data Analytics -- Clustering 39

Basic Approaches Sample-gene search Enumerate the subsets of samples systematically For each subset of samples, find the genes that are coherent on the samples Gene-sample search Enumerate the subsets of genes systematically For each subset of genes, find the samples on which the genes are coherent Jian Pei: Big Data Analytics -- Clustering 40

Basic Tools Set enumeration tree Sample-gene search and gene-sample search are not symmetric! Many genes, but a few samples No requirement on samples coherent on genes Jian Pei: Big Data Analytics -- Clustering 41

Phenotypes and Informative Genes samples 1 2 3 4 5 6 7 Informative Genes gene 1 gene 2 gene 3 gene 4 Noninformative Genes gene 5 gene 6 gene 7 Jian Pei: Big Data Analytics -- Clustering 42

The Phenotype Mining Problem Input: a microarray matrix and k Output: phenotypes and informative genes Partitioning the samples into k exclusive subsets phenotypes Informative genes discriminating the phenotypes Machine learning methods Heuristic search Mutual reinforcing adjustment Jian Pei: Big Data Analytics -- Clustering 43

Requirements The expression levels of each informative gene should be similar over the samples within each phenotype The expression levels of each informative gene should display a clear dissimilarity between each pair of phenotypes Jian Pei: Big Data Analytics -- Clustering 44

To-Do List Read Chapters 10.4 and 11.2 Assignment 3 Jian Pei: CMPT 459/741 Clustering (3) 45