Distance-based Methods: Drawbacks

Distance-based Methods: Drawbacks Hard to find clusters with irregular shapes Hard to specify the number of clusters Heuristic: a cluster must be dense Jian Pei: CMPT 459/741 Clustering (3) 1

How to Find Irregular Clusters? Divide the whole space into many small areas The density of an area can be estimated Areas may or may not be exclusive A dense area is likely in a cluster Start from a dense area, traverse connected dense areas and discover clusters in irregular shape Jian Pei: CMPT 459/741 Clustering (3) 2

Directly Density Reachable Parameters Eps: Maximum radius of the neighborhood MinPts: Minimum number of points in an Epsneighborhood of that point NEps(p): {q dist(p,q) Eps} Core object p: NEps(p) MinPts A core object is in a dense area MinPts = 3 Eps = 1 cm Point q directly density-reachable from p iff q NEps(p) and p is a core object q p Jian Pei: CMPT 459/741 Clustering (3) 3

Density-Based Clustering Density-reachable Directly density reachable p 1 àp 2, p 2 àp 3,, p n-1 à p n p n density-reachable from p 1 Density-connected If points p, q are density-reachable from o then p and q are density-connected p p q q p 1 o Jian Pei: CMPT 459/741 Clustering (3) 4

DBSCAN A cluster: a maximal set of densityconnected points Discover clusters of arbitrary shape in spatial databases with noise Outlier Border Core Eps = 1cm MinPts = 5 Jian Pei: CMPT 459/741 Clustering (3) 5

DBSCAN: the Algorithm Arbitrary select a point p Retrieve all points density-reachable from p wrt Eps and MinPts If p is a core point, a cluster is formed If p is a border point, no points are densityreachable from p and DBSCAN visits the next point of the database Continue the process until all of the points have been processed Jian Pei: CMPT 459/741 Clustering (3) 6

Challenges for DBSCAN Different clusters may have very different densities Clusters may be in hierarchies Jian Pei: CMPT 459/741 Clustering (3) 7

OPTICS: A Cluster-ordering Method Idea: ordering points to identify the clustering structure Group points by density connectivity Hierarchies of clusters Visualize clusters and the hierarchy Jian Pei: CMPT 459/741 Clustering (3) 8

Ordering Points Points strongly density-connected should be close to one another Clusters density-connected should be close to one another and form a cluster of clusters Jian Pei: CMPT 459/741 Clustering (3) 9

OPTICS: An Example Reachability-distance undefined ε ε ε Cluster-order of the objects Jian Pei: CMPT 459/741 Clustering (3) 10

DENCLUE: Using Density Functions DENsity-based CLUstEring Major features Solid mathematical foundation Good for data sets with large amounts of noise Allow a compact mathematical description of arbitrarily shaped clusters in high-dimensional data sets Significantly faster than existing algorithms (faster than DBSCAN by a factor of up to 45) But need a large number of parameters Jian Pei: CMPT 459/741 Clustering (3) 11

DENCLUE: Techniques Use grid cells Only keep grid cells actually containing data points Manage cells in a tree-based access structure Influence function: describe the impact of a data point on its neighborhood Overall density of the data space is the sum of the influence function of all data points Clustering by identifying density attractors Density attractor: local maximal of the overall density function Jian Pei: CMPT 459/741 Clustering (3) 12

Density Attractor Jian Pei: CMPT 459/741 Clustering (3) 13

Center-defined and Arbitrary Clusters Jian Pei: CMPT 459/741 Clustering (3) 14

A Shrinking-based Approach Difficulties of Multi-dimensional Clustering Noise (outliers) Clusters of various densities Not well-defined shapes A novel preprocessing concept Shrinking A shrinking-based clustering approach Jian Pei: CMPT 459/741 Clustering (3) 15

Intuition & Purpose For data points in a data set, what if we could make them move towards the centroid of the natural subgroup they belong to? Natural sparse subgroups become denser, thus easier to be detected Noises are further isolated Jian Pei: CMPT 459/741 Clustering (3) 16

Inspiration Newton s Universal Law of Gravitation Any two objects exert a gravitational force of attraction on each other The direction of the force is along the line joining the objects The magnitude of the force is directly proportional to the product of the gravitational masses of the objects, and inversely proportional to the square of the distance between them G: universal gravitational constant G = 6.67 x 10-11 N m 2 /kg 2 Fg = 2 G m m 1 r 2 Jian Pei: CMPT 459/741 Clustering (3) 17

The Concept of Shrinking A data preprocessing technique Aim to optimize the inner structure of real data sets Each data point is attracted by other data points and moves to the direction in which way the attraction is the strongest Can be applied in different fields Jian Pei: CMPT 459/741 Clustering (3) 18

Apply shrinking into clustering field Shrink the natural sparse clusters to make them much denser to facilitate further cluster-detecting process. Multiattribute hyperspac e Jian Pei: CMPT 459/741 Clustering (3) 19

Data Shrinking Each data point moves along the direction of the density gradient and the data set shrinks towards the inside of the clusters Points are attracted by their neighbors and move to create denser clusters It proceeds iteratively; repeated until the data are stabilized or the number of iterations exceeds a threshold Jian Pei: CMPT 459/741 Clustering (3) 20

Approximation & Simplification Problem: Computing mutual attraction of each data points pair is too time consuming O(n 2 ) Solution: No Newton's constant G, m 1 and m 2 are set to unit Only aggregate the gravitation surrounding each data point Use grids to simplify the computation Jian Pei: CMPT 459/741 Clustering (3) 21

Termination condition Average movement of all points in the current iteration is less than a threshold The number of iterations exceeds a threshold Jian Pei: CMPT 459/741 Clustering (3) 22

Optics on Pendigits Data Before data shrinking After data shrinking Jian Pei: CMPT 459/741 Clustering (3) 23

Biclustering Clustering both objects and attributes simultaneously Four requirements Only a small set of objects in a cluster (bicluster) A bicluster only involves a small number of attributes An object may participate in multiple biclusters or no biclusters An attribute may be involved in multiple biclusters, or no biclusters Jian Pei: Big Data Analytics -- Clustering 24

Application Examples Recommender systems Objects: users Attributes: items Values: user ratings Microarray data Objects: genes Attributes: samples Values: expression levels gene sample/condition w 11 w 21 w 31 w n1 w 12 w 22 w 32 w n2 w 1m w 2m w 3m w nm Jian Pei: Big Data Analytics -- Clustering 25

Biclusters with Constant Values b 6 b 12 b 36 b 99 a 1 60 60 60 60 a 33 60 60 60 60 a 86 60 60 60 60 10 10 10 10 10 20 20 20 20 20 50 50 50 50 50 0 0 0 0 0 On rows Jian Pei: Big Data Analytics -- Clustering 26

Biclusters with Coherent Values Also known as pattern-based clusters Jian Pei: Big Data Analytics -- Clustering 27

Biclusters with Coherent Evolutions Only up- or down-regulated changes over rows or columns 10 50 30 70 20 20 100 50 1000 30 50 100 90 120 80 0 80 20 100 10 Coherent evolutions on rows Jian Pei: Big Data Analytics -- Clustering 28

Differences from Subspace Clustering Subspace clustering uses global distance/ similarity measure Pattern-based clustering looks at patterns A subspace cluster according to a globally defined similarity measure may not follow the same pattern Jian Pei: Big Data Analytics -- Clustering 29

Objects Follow the Same Pattern? pscore Object blue Obejct green D 1 D 2 The less the pscore, the more consistent the objects Jian Pei: Big Data Analytics -- Clustering 30

Jian Pei: Big Data Analytics -- Clustering 31 Pattern-based Clusters pscore: the similarity between two objects r x, r y on two attributes a u, a v δ-pcluster (R, D): for any objects r x, r y R and any attributes a u, a v D, ).. ( ).. (.... v y v x u y u x v y u y v x u x a r a r a r a r a r a r a r a r pscore = 0) (.... δ δ v y u y v x u x a r a r a r a r pscore

Maximal pcluster If (R, D) is a δ-pcluster, then every subcluster (R, D ) is a δ-pcluster, where R R and D D An anti-monotonic property A large pcluster is accompanied with many small pclusters! Inefficacious Idea: mining only the maximal pclusters! A δ-pcluster is maximal if there exists no proper super cluster as a δ-pcluster Jian Pei: Big Data Analytics -- Clustering 32

Mining Maximal pclusters Given A cluster threshold δ An attribute threshold min a An object threshold min o Task: mine the complete set of significant maximal δ-pclusters A significant δ-pcluster has at least min o objects on at least min a attributes Jian Pei: Big Data Analytics -- Clustering 33

pcluters and Frequent Itemsets A transaction database can be modeled as a binary matrix Frequent itemset: a sub-matrix of all 1 s 0-pCluster on binary data Min o : support threshold Min a : no less than mina attributes Maximal pclusters closed itemsets Frequent itemset mining algorithms cannot be extended straightforwardly for mining pclusters on numeric data Jian Pei: Big Data Analytics -- Clustering 34

Where Should We Start from? How about the pclusters having only 2 objects or 2 attributes? MDS (maximal dimension set) A pcluster must have at least 2 objects and 2 attributes Objects Finding MDSs Attribute a b c d e f g h x 13 11 9 7 9 13 2 15 y 7 4 10 1 12 3 4 7 x - y 6 7-1 6-3 10-2 8 Jian Pei: Big Data Analytics -- Clustering 35

How to Assemble Larger pclusters? Systematically enumerate every combination of attributes D For each attribute subset, find the maximal subsets of objects R s.t. (R, D) is a pcluster Check whether (R, D) is maximal Prune search branches as early as possible Why attribute-first-objectlater? # of objects >> # attributes Algorithm MaPle (Pei et al, 2003) Jian Pei: Big Data Analytics -- Clustering 36

More Pruning Techniques Only possible attributes should be considered to get larger pclusters Pruning local maximal pclusters having insufficient possible attributes Extracting common attributes from possible attribute set directly Prune non-maximal pclusters Jian Pei: Big Data Analytics -- Clustering 37

Gene-Sample-Time Series Data Sample-Time Matrix Sample time2 time1 sample1 sample2 Time gene1 gene2 Gene-Sample Matrix Gene-Time Matrix Gene expression level of gene i on sample j at time k Jian Pei: Big Data Analytics -- Clustering 38

Mining GST Microarray Data Reduce the gene-sample-time series data to gene-sample data Use the Pearson's correlation coeffcient as the coherence measure Jian Pei: Big Data Analytics -- Clustering 39

Basic Approaches Sample-gene search Enumerate the subsets of samples systematically For each subset of samples, find the genes that are coherent on the samples Gene-sample search Enumerate the subsets of genes systematically For each subset of genes, find the samples on which the genes are coherent Jian Pei: Big Data Analytics -- Clustering 40

Basic Tools Set enumeration tree Sample-gene search and gene-sample search are not symmetric! Many genes, but a few samples No requirement on samples coherent on genes Jian Pei: Big Data Analytics -- Clustering 41

Phenotypes and Informative Genes samples 1 2 3 4 5 6 7 Informative Genes gene 1 gene 2 gene 3 gene 4 Noninformative Genes gene 5 gene 6 gene 7 Jian Pei: Big Data Analytics -- Clustering 42

The Phenotype Mining Problem Input: a microarray matrix and k Output: phenotypes and informative genes Partitioning the samples into k exclusive subsets phenotypes Informative genes discriminating the phenotypes Machine learning methods Heuristic search Mutual reinforcing adjustment Jian Pei: Big Data Analytics -- Clustering 43

Requirements The expression levels of each informative gene should be similar over the samples within each phenotype The expression levels of each informative gene should display a clear dissimilarity between each pair of phenotypes Jian Pei: Big Data Analytics -- Clustering 44

To-Do List Read Chapters 10.4 and 11.2 Assignment 3 Jian Pei: CMPT 459/741 Clustering (3) 45