Dimension Induced Clustering

Size: px

Start display at page:

Download "Dimension Induced Clustering"

Diana Norman
5 years ago
Views:

1 Dimension Induced Clustering Aris Gionis Alexander Hinneburg Spiros Papadimitriou Panayiotis Tsaparas HIIT, University of Helsinki Martin Luther University, Halle Carnegie Melon University HIIT, University of Helsinki SIGKDD 2005 Page 1 Data Objects Introduction Vector Space Motivation Mapping into high-dimensional Feature Space High-dimensional Feature Space Observation: data points cannot fill the space => Data lie on one or more low-dimensional manifolds Real data exhibit patterns and regularities SIGKDD 2005 Page 2

2 The red points form a 1d manifold in the 2d space. Example A low dimensional manifold must contain sufficient number of points that are densely packed density-based methods? SIGKDD 2005 Page 3 Dimension Induced Clustering How to separate river and lake River and lake have same density Both are spatially connected But they differ in dimensionality Density is still necessary for separating lake from surroundings SIGKDD 2005 Page 4

3 Dimension Induced Clustering Problem - Given a set of data objects with a distance function - Find dense subsets of objects with similar dimensionality SIGKDD 2005 Page 5 Indexing Other Applications efficient approximation of nearest neighbor for metric data, assumes bounded intrinsic dimensionality [Krauthgammer & Lee, ICALP 2004] Mixture Models of PCA needs average dimensionality as parameter [Aggarwal & Yu, SIGMOD 2002], [P. Agarwal et al, PODS 2004] SIGKDD 2005 Page 6

4 What is dimension? Approach using representation dimension is the number of coordinates decompose data space into set of linear sub-spaces densely filled with points Drawbacks assumes vector spaces only linear manifolds SIGKDD 2005 Page 7 What is dimension? Approach using relative distances use distances between objects only extend notion of fractal dimension Advantages complex curved manifolds applicable to metric spaces SIGKDD 2005 Page 8

5 Fractal Dimension Correlation Dimension Set of objects X, X =n Distance function Points in the ball of radius r around x Correlation Integral Correlation Dimension SIGKDD 2005 Page 9 Fractal Dimension Intuition behind definition d corr = 1 d corr = 2 SIGKDD 2005 Page 10

6 Fractal Dimension In real life, datasets are finite C () r = 1 n x X ( x, r) Calculation of correlation dimension: fit a line on the log-log plot of C(r) B n SIGKDD 2005 Page 11 Fractal Dimension 1D Data d= 2D Data SIGKDD 2005 Page 12

7 Fractal Dimension What if the data is non-homogeneous? SIGKDD 2005 Page 13 Local Fractal Dimension However, looking at A and B individually SIGKDD 2005 Page 14

8 Local Fractal Dimension Local Growth Curve Local Correlation Dimension For finite data G x () r = B ( x, r ) n d x is estimated by fitting a line 1 SIGKDD 2005 Page 15 Local Fractal Dimension Linear Growth Model of an object x i SIGKDD 2005 Page 16

9 Linear Growth Model d i : rate of growth of log G x (r) dimensionality b i : value of L x (log r) at radius 1-- density L x (log r*) : density at radius r* The model can be summarized with two values: d i and c i = L x (log r*) how do we select r*? SIGKDD 2005 Page 17 Selecting r* Idea: choose r* such that c i s and d i s are un-correlated SIGKDD 2005 Page 18

10 Local Representation Local Representation of point x i Captures the view of the world for each point SIGKDD 2005 Page 19 The fitting interval The linear growth model is defined over a subset of the neighbors of x Clipping from above Clipping from below SIGKDD 2005 Page 20

11 Algorithm SIGKDD 2005 Page 21 Experiments Detection of m-flats in high-dimensional space (a) 2d flat in 3d space Classification error = 8.1% (a) 40d flat in 50d space Classification error = 1.2% SIGKDD 2005 Page 22

Comparison to Optics Optics: density based

Rank Sub-Matrix Combinatorial low-rank sub-matrix

and set of rows Final Clusters are the Cartesian

12 Comparison to Optics Optics: density based hierarchical clustering SIGKDD 2005 Page 23 Low Rank Sub-Matrix Combinatorial low-rank sub-matrix in a random Matrix Apply DIC to set of columns and set of rows Final Clusters are the Cartesian product of row and column clusters Rank 2 SIGKDD 2005 Page 24

13 Experiments Gene Expression Data (gene clustering) Yeast data from George Church, Harvard SIGKDD 2005 Page 25 Experiments Gene Expression Data (gene clustering) Neither density nor dimensionality alone can detect the structure SIGKDD 2005 Page 26

14 Conclusion Find subsets with low fractal dimensionality Local Representation local fractal dimensionality local density Visualization of the cluster structure SIGKDD 2005 Page 27

Dimension Induced Clustering

Dimension Induced Clustering Aristides Gionis Basic Research Unit, HIIT University of Helsinki, Finland Spiros Papadimitriou Department of Computer Science Carnegie Mellon University, USA Alexander Hinneburg