CSE 6242 A / CX 4242 DVA. March 6, Dimension Reduction. Guest Lecturer: Jaegul Choo

Size: px

Start display at page:

Download "CSE 6242 A / CX 4242 DVA. March 6, Dimension Reduction. Guest Lecturer: Jaegul Choo"

Annabel Amice Poole
5 years ago
Views:

1 CSE 6242 A / CX 4242 DVA March 6, 2014 Dimension Reduction Guest Lecturer: Jaegul Choo

2 Data is Too Big To Analyze! Limited memory size! Data may not be fitted to the memory of your machine! Slow computation! dim vs. 10-dim vectors for Euclidean distance computation 2

3 Two Axes of Data Set! No. of data items! How many data items?! No. of dimensions! How many dimensions representing each item? Data item index Dimension index Columns as data items vs. Rows as data items We will use this during lecture 3

4 Dimension Reduction Let s Reduce Data (along Dimension Axis) High-dim data No. of dimensions Dimension Reduction low-dim data Additional info about data Other parameters Dim-reducing Transformer for new data 4 : user-specified

5 What You Get from DR Obviously,! Less storage! Faster computation More importantly,! Noise removal (improving quality of data)! Leads better performance for tasks! 2D/3D representation! Enables visual data exploration 5

6 Applications Traditionally,! Microarray data analysis! Information retrieval! Face recognition! Protein disorder prediction! Network intrusion detection! Document categorization! Speech recognition More interestingly,! Interactive visualization of high-dimensional data 6

7 Visualizing Map of Science Visualization 7

8 Two Main Techniques 1. Feature selection! Selects a subset of the original variables as reduced dimensions! For example, the number of genes responsible for a particular disease may be small 2. Feature extraction! Each reduced dimension involves multiple original dimensions! Active research area 8 Note that Feature = Variable = Dimension

9 Feature Selection What are the optimal subset of m features to maximize a given criterion?! Widely-used criteria! Information gain, correlation,! Typically combinatorial optimization problems! Therefore, greedy methods are popular! Forward selection: Empty set add one variable at a time! Backward elimination: Entire set remove one variable at a time 9

10 From now on, we will only discuss about feature extraction

11 Aspects of DR! Linear vs. Nonlinear! Unsupervised vs. Supervised! Global vs. Local! Feature vectors vs. Similarity (as an input) 11

12 Aspects of DR Linear vs. Nonlinear Linear! Represents each reduced dimension as a linear combination of original dimensions! e.g., Y1 = 3*X1 4*X *X3 1.5*X4, Y2 = 2*X *X2 X3 + 2*X4! Naturally capable of mapping new data to the same space D1 D2 X1 1 1 X2 1 0 X3 0 2 Dimension Reduction D1 D2 Y Y X4 1 1

13 Aspects of DR Linear vs. Nonlinear Linear! Represents each reduced dimension as a linear combination of original dimensions! e.g., Y1 = 3*X1 4*X *X3 1.5*X4, Y2 = 2*X *X2 X3 + 2*X4! Naturally capable of mapping new data to the same space Nonlinear! More complicated, but generally more powerful! Recently popular topics 13

14 Aspects of DR Unsupervised vs. Supervised Unsupervised! Uses only the input data High-dim data No. of dimensions Dimension Reduction low-dim data 14 Additional info about data Other parameters Dim-reducing Transformer for a new data

15 Aspects of DR Unsupervised vs. Supervised Supervised! Uses the input data + additional info High-dim data No. of dimensions Dimension Reduction low-dim data 15 Additional info about data Other parameters Dim-reducing Transformer for a new data

16 Aspects of DR Unsupervised vs. Supervised Supervised! Uses the input data + additional info! e.g., grouping label High-dim data No. of dimensions Dimension Reduction low-dim data 16 Additional info about data Other parameters Dim-reducing Transformer for a new data

17 Aspects of DR Global vs. Local Dimension reduction typically tries to preserve all the relationships/distances in data! Information loss is unavoidable! Then, what would you care about? Global! Treats all pairwise distances equally important! Tends to care larger distances more Local! Focuses on small distances, neighborhood relationships! Active research area a.k.a. manifold learning 17

18 Aspects of DR Feature vectors vs. Similarity (as an input)! Typical setup (feature vectors as an input) High-dim data No. of dimensions Dimension Reduction low-dim data 18 Additional info about data Other parameters Dim-reducing Transformer for a new data

19 Aspects of DR Feature vectors vs. Similarity (as an input)! Typical setup (feature vectors as an input)! Some methods take similarity matrix instead! (i,j)-th component indicates similarity between i-th and j-th data Similarity matrix No. of dimensions Dimension Reduction low-dim data 19 Additional info about data Other parameters Dim-reducing Transformer for a new data

20 Aspects of DR Feature vectors vs. Similarity (as an input)! Typical setup (feature vectors as an input)! Some methods take similarity matrix instead! Some methods internally convert feature vectors to similarity matrix before performing No. of dimension reduction Similarity dimensions matrix Dimension Reduction High-dim data Dimension Reduction low-dim data low-dim data 20 Additional info about data Other parameters a.k.a. Graph Embedding Dim-reducing Transformer for a new data

21 Aspects of DR Feature vectors vs. Similarity (as an input) Why called graph embedding?! Similarity matrix can be viewed as a graph where similarity represents edge weight High-dim data Dimension Reduction Similarity matrix low-dim data 21 a.k.a. Graph Embedding

22 Methods! Traditional! Principal component analysis (PCA)! Multidimensional scaling (MDS)! Linear discriminant analysis (LDA)! Advanced (nonlinear, kernel, manifold learning)! Isometric feature mapping (Isomap)! t-distributed stochastic neighborhood embedding (t-sne) 22 * Matlab codes are available at

23 Principal Component Analysis! Finds the axis showing the greatest variation, and project all points into this axis! Reduced dimensions are orthogonal! Algorithm: Eigen-decomposition! Pros: Fast! Cons: Limited performances Linear Unsupervised Global Feature vectors PC2 PC1 23

24 24 Principal Component Analysis Document Visualization

25 Multidimensional Scaling (MDS) Intuition! Tries to preserve given ideal pairwise distances in lowdimensional space! Metric MDS actual distance! Preserves given ideal distance values! Nonmetric MDS ideal distance! When you only know/care about ordering of distances! Preserves only the orderings of distance values Nonlinear Unsupervised Global Similarity input 25! Algorithm: gradient-decent type c.f. classical MDS is the same as PCA

26 Multidimensional Scaling Sammon s mapping Sammon s mapping! Local version of MDS! Down-weights errors in large distances! Algorithm: gradient-decent type Nonlinear Unsupervised Local Similarity input 26

27 Multidimensional Scaling Force-directed graph layout Force-directed graph layout! Rooted from graph visualization, but essentially variant of metric MDS! Spring-like attractive + repulsive forces between nodes! Algorithm: gradient-decent type Nonlinear Unsupervised Global! Widely-used in visualization Similarity input! Aesthetically pleasing results! Simple and intuitive! Interactivity 27

28 Demos! Prefuse Multidimensional Scaling Force-directed graph layout! D3:

29 Multidimensional Scaling In all variants,! Pros: widely-used (works well in general)! Cons: slow! Nonmetric MDS is even much slower than metric MDS 29

30 Linear Discriminant Analysis What if clustering information is available? LDA tries to separate clusters by! Putting different cluster as far as possible! Putting each cluster as compact as possible 30 (a) (b)

31 Aspects of DR Unsupervised vs. Supervised Supervised! Uses the input data + additional info! e.g., grouping label High-dim data No. of dimensions Dimension Reduction low-dim data 31 Additional info about data Other parameters Dim-reducing Transformer for a new data

32 Linear Discriminant Analysis vs. Principal Component Analysis 2D visualization of 7 Gaussian mixture of 1000 dimensions Linear discriminant analysis (Supervised) Principal component analysis (Unsupervised)

33 Linear Discriminant Analysis Maximally separates clusters by! Putting different cluster as far as possible! Putting each cluster as compact as possible! Algorithm: generalized eigendecomposition! Pros: better show cluster structure! Cons: may distort original relationship of data Linear Supervised Global Feature vectors 33

34 Methods! Traditional! Principal component analysis (PCA)! Multidimensional scaling (MDS)! Linear discriminant analysis (LDA)! Advanced (nonlinear, kernel, manifold learning)! Isometric feature mapping (Isomap)! t-distributed stochastic neighborhood embedding (t-sne) 34 * Matlab codes are available at

35 Manifold Learning Swiss Roll Data Swiss roll data! Originally in 3D! What is the intrinsic dimensionality? (allowing flattening) intrinsic semantic 35

36 Manifold Learning Swiss Roll Data Swiss roll data! Originally in 3D! What is the intrinsic dimensionality? (allowing flattening) 2D intrinsic semantic 36 What if your data has low intrinsic dimensionality but resides in highdimensional space?

37 Manifold Learning Goal and Approach 37 Manifold! Curvi-linear low-dimensional structure of your data based on intrinsic dimensionality Manifold learning! Match intrinsic dimensions to axes of dimension-reduced output space How?! Each piece of manifold is appox. linear! Utilize local neighborhood information! e.g. for a particular point,! Who are my neighbors?! How closely am I related to neighbors? Demo available at

38 Isomap (Isometric Feature Mapping) Let s preserve pairwise geodesic distance (along manifold)! Compute geodesic distance as the shortest path length from k-nearest neighbor (k-nn) graph! *Eigen-decomposition on pairwise geodesic distance matrix to obtain embedding that best preserves given distances 38 * Recall eigen-decomposition is the main algorithm of PCA

39 Isomap (Isometric Feature Mapping) 39! Algorithm: all-pair shortest path computation + eigendecomposition! Pros: performs well in general! Cons: slow (shortest path), sensitive to parameters Nonlinear Unsupervised Global: all pairwise distances are considered Feature vectors

40 Isomap Facial Data Example Angle Person (k is the value in k-nn graph) k=8 Cluster structure 40 k=22 k=49 Which one do you think is the best?

41 t-sne (t-distributed Stochastic Neighborhood Embedding) Made specifically for visualization! (in very low dimension)! Can reveal clusters without any supervision! e.g., spoken letter data PCA t-sne 41 Official website:

42 t-sne (t-distributed Stochastic Neighborhood Embedding) How it works! Converts distance into probability! Farther distance gets lower probability! Then, minimize differences in probability distribution between high- and low-dimensional spaces! KL divergence naturally focuses on neighborhood relationships! Difference from SNE! t-sne uses heavy-tailed t-distribution instead of Gaussian.! Suitable for dimension reduction to a very low dimension 42

43 t-sne (t-distributed Stochastic Neighborhood Embedding)! Algorithm: gradient-decent type! Pros: works surprisingly well in 2D/3D visualization! Cons: very slow Nonlinear Unsupervised Local Similarity input 43

44 DR in Interactive Visualization What can you do from visualization via dimension reduction?! e.g., Multidimensional scaling applied to document data 44

45 DR in Interactive Visualization As many data items involve, it s harder to analyze! For n data items, users are given O(n 2 ) relations spatially encoded in visualization! Too many to understand in general 45

46 DR in Interactive Visualization What to first look at? Thus, people tend to look for a small number of objects that perceptually/visually stand out, e.g.,! Outliers (if any) 46

47 DR in Interactive Visualization What to first look at? Thus, people tend to look for a small number of objects that perceptually/visually stand out, e.g.,! Outliers (if any) More commonly,! Subgroups/clusters! However, it is hard to expect for DR to always reveal clusters 47

48 DR in Interactive Visualization What to first look at? What if DR cannot reveal subgroups/clusters clearly? Or even worse, what if our data do not originally have any?! Often, pre-defined grouping information is injected and color-coded.! Such grouping information is usually obtained as! Pre-given labels along with data! Computed labels by clustering 48

49 Dimension Reduction in Action Handwritten Digit Data Visualization Now we can obtain Cluster/data relationship Subcluster/outlier Visualization of handwritten digit data Subcluster #1 in 5 Subcluster #2 in 5 49 Major data in 7 Minor group #1 in 7 Minor group #2 in 7! Treating two subclusters in digit 5 as separate clusters! Classification accuracy improved from 89% to 93% (LDA+k-NN)

50 Practitioner s Guide Caveats Can you trust dimension reduction results?! Expect significant distortion/information loss in 2D/3D! What algorithm think is the best may not be what we think is the best, e.g., PCA visualization of facial image data (1, 2)-dimension (3, 4)-dimension 50

51 Practitioner s Guide Caveats How would you determine the best method and its parameters for your needs?! Unlike typical data mining problems where only one shot is allowed, you can freely try out different methods with different parameters! Basic understanding of methods will greatly help applying them properly! What is a particular method trying to achieve? And how suitable is it to your needs?! What are the effects of increasing/decreasing parameters? 51

52 Practitioner s Guide General Recommendation Want something simple and fast to visualize data?! PCA, force-directed layout Want to first try some manifold learning methods?! Isomap! if it doesn t show any good, probably neither will anything else. Have cluster label to use? (pre-given or computed)! LDA (supervised)! Supervised approach is sometimes the only viable option when your data do not have clearly separable clusters No labels, but still want some clusters to be revealed? Or simply, want some state-of-the-art method for visualization?! t-sne (but, may be slow) 52

53 Practitioner s Guide Results Still Not Good? Pre-process data properly as needed! Data centering! Subtract the global mean from each vector! Normalization! Make each vector have unit Euclidean norm! Otherwise, a few outlier can affect dimension reduction significantly! Application-specific pre-processing! Document: TF-IDF weighting, remove too rare and/or short terms! Image: histogram normalization 53

54 Practitioner s Guide Too Slow?! Apply PCA to reduce to an intermediate dimensions before the main dimension reduction step! t-sne does it by default! The results may even be improved due to noise removed by PCA! See if there is any approximated but faster version! Landmarked versions (only using a subset of data items)! e.g., landmarked Isomap! Linearized versions (the same criterion, but only allow linear mapping)! e.g., Laplacian Eigenmaps Locality preserving projection 54

Play with its algorithm, convergence criteria, etc.

55 Practitioner s Guide Still need more? Be creative! And feel free to tweak dimension reduction! Play with its algorithm, convergence criteria, etc.! See if you can impose label information 55 Original t-sne t-sne with simple modification

56 Practitioner s Guide Still need more? Be creative! And feel free to tweak dimension reduction! Play with its algorithm, convergence criteria, etc.! See if you can impose label information! Restrict the number of iterations to save computational time. 56 The raison d etre of DR is to serve us in exploring data and solving complicated real-world problems

57 Useful Resource Nice review article by L.J.P. van der Maaten et al.! dimensionality_reduction_a_comparative_review.pdf Matlab toolbox for dimension reduction! Matlab_Toolbox_for_Dimensionality_Reduction.html Matlab manifold learning demo! 57

Useful Resource FODAVA Testbed Software 58 Available at http://fodava.gatech.

58 Useful Resource FODAVA Testbed Software 58 Available at For a recent version, contact me at jaegul.choo@cc.gatech.edu

59 Thank you for your attention! Jaegul Choo

CSE 6242 / CX October 9, Dimension Reduction. Guest Lecturer: Jaegul Choo

CSE 6242 / CX October 9, Dimension Reduction. Guest Lecturer: Jaegul Choo CSE 6242 / CX 4242 October 9, 2014 Dimension Reduction Guest Lecturer: Jaegul Choo Volume Variety Big Data Era 2 Velocity Veracity 3 Big Data are High-Dimensional Examples of High-Dimensional Data Image