CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo

Similar documents
CSE 6242 A / CX 4242 DVA. March 6, Dimension Reduction. Guest Lecturer: Jaegul Choo

CSE 6242 / CX October 9, Dimension Reduction. Guest Lecturer: Jaegul Choo

Dimension Reduction CS534

InterAxis: Steering Scatterplot Axes via Observation-Level Interaction

Non-linear dimension reduction

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

CPSC 340: Machine Learning and Data Mining. Multi-Dimensional Scaling Fall 2017

Modelling and Visualization of High Dimensional Data. Sample Examination Paper

Dimension reduction : PCA and Clustering

CIE L*a*b* color model

Data Preprocessing. Javier Béjar. URL - Spring 2018 CS - MAI 1/78 BY: $\

An Interactive Visual Testbed System for Dimension Reduction and Clustering of Large-scale High-dimensional Data

ADVANCED MACHINE LEARNING. Mini-Project Overview

Clustering CS 550: Machine Learning

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2018

Locality Preserving Projections (LPP) Abstract

Data Preprocessing. Javier Béjar AMLT /2017 CS - MAI. (CS - MAI) Data Preprocessing AMLT / / 71 BY: $\

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

MSA220 - Statistical Learning for Big Data

Multiresponse Sparse Regression with Application to Multidimensional Scaling

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

CSE 158. Web Mining and Recommender Systems. Midterm recap

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Facial Expression Recognition Using Non-negative Matrix Factorization

Visual Representations for Machine Learning

Selecting Models from Videos for Appearance-Based Face Recognition

Large-Scale Face Manifold Learning

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Dimension reduction for hyperspectral imaging using laplacian eigenmaps and randomized principal component analysis

Lecture Topic Projects

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Nonlinear projections. Motivation. High-dimensional. data are. Perceptron) ) or RBFN. Multi-Layer. Example: : MLP (Multi(

Applying Supervised Learning

Sensitivity to parameter and data variations in dimensionality reduction techniques

Locality Preserving Projections (LPP) Abstract

Neural Networks for Machine Learning. Lecture 15a From Principal Components Analysis to Autoencoders

Recognizing Handwritten Digits Using the LLE Algorithm with Back Propagation

Machine Learning : Clustering, Self-Organizing Maps

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders

CSE 547: Machine Learning for Big Data Spring Problem Set 2. Please read the homework submission policies.

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders

Chapter DM:II. II. Cluster Analysis

Unsupervised Learning

Manifold Learning for Video-to-Video Face Recognition

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

Understanding Clustering Supervising the unsupervised

MTTTS17 Dimensionality Reduction and Visualization. Spring 2018 Jaakko Peltonen. Lecture 11: Neighbor Embedding Methods continued

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Machine Learning: Think Big and Parallel

Dimension reduction for hyperspectral imaging using laplacian eigenmaps and randomized principal component analysis:midyear Report

Technical Report. Title: Manifold learning and Random Projections for multi-view object recognition

Dimension Reduction of Image Manifolds

Robust Pose Estimation using the SwissRanger SR-3000 Camera

Behavioral Data Mining. Lecture 18 Clustering

Deep Learning for Computer Vision

Learning a Manifold as an Atlas Supplementary Material

Network Traffic Measurements and Analysis

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Today. Gradient descent for minimization of functions of real variables. Multi-dimensional scaling. Self-organizing maps

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

How do microarrays work

AN IMPROVED HYBRIDIZED K- MEANS CLUSTERING ALGORITHM (IHKMCA) FOR HIGHDIMENSIONAL DATASET & IT S PERFORMANCE ANALYSIS

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Using Machine Learning to Optimize Storage Systems

Dimensionality Reduction and Visualization

Clustering and Visualisation of Data

Supervised vs. Unsupervised Learning

A Survey on Pre-processing and Post-processing Techniques in Data Mining

Feature selection. Term 2011/2012 LSI - FIB. Javier Béjar cbea (LSI - FIB) Feature selection Term 2011/ / 22

Clustering: Classic Methods and Modern Views

Feature selection. LING 572 Fei Xia

Gene Clustering & Classification

CS Introduction to Data Mining Instructor: Abdullah Mueen

Artificial Neural Networks (Feedforward Nets)

Unsupervised Learning

ECG782: Multidimensional Digital Signal Processing

Clustering and The Expectation-Maximization Algorithm

Neural Networks for unsupervised learning From Principal Components Analysis to Autoencoders to semantic hashing

Locally Linear Landmarks for large-scale manifold learning

Face Recognition Using Wavelet Based Kernel Locally Discriminating Projection

A Taxonomy of Semi-Supervised Learning Algorithms

Support Vector Machines for visualization and dimensionality reduction

Artificial Neural Networks Unsupervised learning: SOM

Clustering and Dimensionality Reduction. Stony Brook University CSE545, Fall 2017

Unsupervised Learning

Clustering. Bruno Martins. 1 st Semester 2012/2013

Modern Multidimensional Scaling

Advanced Data Visualization

Image Processing. Image Features

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

Courtesy of Prof. Shixia University

Machine Learning for Signal Processing Clustering. Bhiksha Raj Class Oct 2016

The K-modes and Laplacian K-modes algorithms for clustering

Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga.

On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Transcription:

CSE 6242 A / CS 4803 DVA Feb 12, 2013 Dimension Reduction Guest Lecturer: Jaegul Choo

CSE 6242 A / CS 4803 DVA Feb 12, 2013 Dimension Reduction Guest Lecturer: Jaegul Choo

Data is Too Big To Do Something.. Limited memory size Data may not be fitted to the memory of your machine Slow computation 10 6 -dim vs. 10-dim vectors for Euclidean distance computation

Two Axes of Data Set No. of data items How many data items? No. of dimensions How many dimensions representing each item? Data item index Dimension index

Two Axes of Data Set No. of data items How many data items? No. of dimensions How many dimensions representing each item? Data item index Dimension index

Two Axes of Data Set No. of data items How many data items? No. of dimensions How many dimensions representing each item? Data item index Dimension index Columns as data items

Two Axes of Data Set No. of data items How many data items? No. of dimensions How many dimensions representing each item? Data item index Dimension index Columns as data items vs. Rows as data items

Two Axes of Data Set No. of data items How many data items? No. of dimensions How many dimensions representing each item? Data item index Dimension index Columns as data items vs. Rows as data items We will use this during lecture

Dimension Reduction Let s Reduce Data (along Dimension Axis)

Dimension Reduction Let s Reduce Data (along Dimension Axis) High-dim data Dimension Reduction low-dim data

Dimension Reduction Let s Reduce Data (along Dimension Axis) High-dim data No. of Dimension Reduction low-dim data : user-specified

Dimension Reduction Let s Reduce Data (along Dimension Axis) High-dim data No. of Dimension Reduction low-dim data Other : user-specified

Dimension Reduction Let s Reduce Data (along Dimension Axis) High-dim data No. of Dimension Reduction low-dim data Additional info about data Other : user-specified

Dimension Reduction Let s Reduce Data (along Dimension Axis) High-dim data No. of Dimension Reduction low-dim data Additional info about data Other Dim-reducing Transformer for a new data : user-specified

What You Get from DR Obviously, Less storage Faster computation

What You Get from DR Obviously, Less storage Faster computation More importantly,

What You Get from DR Obviously, Less storage Faster computation More importantly, Noise removal (improving quality of data) Leads better performance for tasks

What You Get from DR Obviously, Less storage Faster computation More importantly, Noise removal (improving quality of data) Leads better performance for tasks 2D/3D representation Enables visual data exploration

Applications Traditionally, Microarray data analysis Information retrieval Face recognition Protein disorder prediction Network intrusion detection Document categorization Speech recognition More interestingly, Interactive visualization of high-dimensional data

Face Recognition Vector Representation of Images Images serialized/rasterized pixel values 3 3 80 58 24 Dimensions can be huge. 640x480 size: 307,200 dimensions 63 45 58 80 24 63 45

Face Recognition

Color = Person Column = Image Face Recognition Vector Representation of Images

Color = Person Column = Image Face Recognition Vector Representation of Images

Color = Person Column = Image Face Recognition Vector Representation of Images

Color = Person Column = Image Face Recognition Vector Representation of Images Dimension Reduction PCA, LDA, etc.

Face Recognition Vector Representation of Images Dimension Reduction Classification on dimreduced data Better accuracy than on original-dim data PCA, LDA, etc. Color = Person Column = Image

Document Retrieval D1 D2 I 1 1 like 1 0 hate 0 2 data 1 1

Document Retrieval Latent semantic indexing D1 D2 I 1 1 like 1 0 hate 0 2 data 1 1

Document Retrieval Latent semantic indexing Term-document matrix via bag-of-words model D1 = I like data D2 = I hate hate data D1 D2 I 1 1 like 1 0 hate 0 2 data 1 1

Document Retrieval Latent semantic indexing Term-document matrix via bag-of-words model D1 = I like data D2 = I hate hate data Dimensions can be hundreds of thousands i.e., #distinct words D1 D2 I 1 1 like 1 0 hate 0 2 data 1 1

Document Retrieval Latent semantic indexing Term-document matrix via bag-of-words model D1 = I like data D2 = I hate hate data Dimensions can be hundreds of thousands i.e., #distinct words D1 D2 I 1 1 like 1 0 hate 0 2 data 1 1 Dimension Reduction D1 D2 Dim1 1.75-0.27 Dim2-0.21 0.58 Dim3 1.32 0.25

Document Retrieval Latent semantic indexing Term-document matrix via bag-of-words model D1 = I like data D2 = I hate hate data Dimensions can be hundreds of thousands i.e., #distinct words D1 D2 I 1 1 like 1 0 hate 0 2 data 1 1 Dimension Reduction D1 D2 Dim1 1.75-0.27 Dim2-0.21 0.58 Dim3 1.32 0.25 Search-Retrieval on dim-reduced data leads to better semantics

Visualizing Map of Science Visualization http://www.mapofscience.com

Two Main Techniques 1. Feature selection Selects a subset of the original variables as reduced dimensions For example, the number of genes responsible for a particular disease may be small 2. Feature extraction Each reduced dimension involves multiple original dimensions Active area of research recently Note that Feature = Variable = Dimension

Feature Selection What are the optimal subset of m features to maximize a given criterion? Widely-used criteria Information gain, correlation, Typically combinatorial optimization problems Therefore, greedy methods are popular Forward selection: Empty set add one variable at a time Backward elimination: Entire set remove one variable at a time

From now on, we will only discuss about feature extraction

From now on, we will only discuss about feature extraction

Aspects of DR Linear vs. Nonlinear Unsupervised vs. Supervised Global vs. Local Feature vectors vs. Similarity (as an input)

Aspects of DR Linear vs. Nonlinear Linear Represents each reduced dimension as a linear combination of original dimensions e.g., Y1 = 3*X1 4*X2 + 0.3*X3 1.5*X4, Y2 = 2*X1 + 3.2*X2 X3 + 2*X4 Naturally capable of mapping new data to the same space D1 D2 X1 1 1 X2 1 0 X3 0 2 X4 1 1 Dimension Reduction D1 D2 Y1 1.75-0.27 Y2-0.21 0.58

Aspects of DR Linear vs. Nonlinear Linear Represents each reduced dimension as a linear combination of original dimensions e.g., Y1 = 3*X1 4*X2 + 0.3*X3 1.5*X4, Y2 = 2*X1 + 3.2*X2 X3 + 2*X4 Naturally capable of mapping new data to the same space Nonlinear More complicated, but generally more powerful Recently popular topics

Aspects of DR Unsupervised vs. Supervised Unsupervised Uses only the input data High-dim data No. of Dimension Reduction low-dim data Additional info about data Other Dim-reducing Transformer for a new data

Aspects of DR Unsupervised vs. Supervised Supervised Uses the input data + additional info High-dim data No. of Dimension Reduction low-dim data Additional info about data Other Dim-reducing Transformer for a new data

Aspects of DR Unsupervised vs. Supervised Supervised Uses the input data + additional info e.g., grouping label High-dim data No. of Dimension Reduction low-dim data Additional info about data Other Dim-reducing Transformer for a new data

Aspects of DR Global vs. Local Dimension reduction typically tries to preserve all the relationships/distances in data Information loss is unavoidable! e.g., PCA

Aspects of DR Global vs. Local Dimension reduction typically tries to preserve all the relationships/distances in data Information loss is unavoidable! e.g., PCA

Aspects of DR Global vs. Local Dimension reduction typically tries to preserve all the relationships/distances in data Information loss is unavoidable! Then, what would you care about? Global

Aspects of DR Global vs. Local Dimension reduction typically tries to preserve all the relationships/distances in data Information loss is unavoidable! Then, what would you care about? Global Treats all pairwise distances equally important Tends to care larger distances more

Aspects of DR Global vs. Local Dimension reduction typically tries to preserve all the relationships/distances in data Information loss is unavoidable! Then, what would you care about? Global Treats all pairwise distances equally important Tends to care larger distances more Local

Aspects of DR Global vs. Local Dimension reduction typically tries to preserve all the relationships/distances in data Information loss is unavoidable! Then, what would you care about? Global Treats all pairwise distances equally important Tends to care larger distances more Local Focuses on small distances, neighborhood relationships

Aspects of DR Global vs. Local Dimension reduction typically tries to preserve all the relationships/distances in data Information loss is unavoidable! Then, what would you care about? Global Treats all pairwise distances equally important Tends to care larger distances more Local Focuses on small distances, neighborhood relationships Active research area a.k.a. manifold learning

Aspects of DR Feature vectors vs. Similarity (as an input) Typical setup (feature vectors as an input) High-dim data No. of Dimension Reduction low-dim data Additional info about data Other Dim-reducing Transformer for a new data

Aspects of DR Feature vectors vs. Similarity (as an input) Typical setup (feature vectors as an input) Some methods take similarity matrix instead (i,j)-th component indicates how similar i-th and j-th data items are Similarity matrix No. of Dimension Reduction low-dim data Additional info about data Other Dim-reducing Transformer for a new data

Aspects of DR Feature vectors vs. Similarity (as an input) Similarity matrix No. of Dimension Reduction low-dim data Additional info about data Other Dim-reducing Transformer for a new data

Aspects of DR Feature vectors vs. Similarity (as an input) Typical setup (feature vectors as an input) Some methods take similarity matrix instead Some methods internally converts feature vectors to similarity matrix before performing dimension reduction High-dim data Similarity matrix Dimension Reduction low-dim data

Aspects of DR Feature vectors vs. Similarity (as an input) Typical setup (feature vectors as an input) Some methods take similarity matrix instead Some methods internally converts feature vectors to similarity matrix before performing dimension reduction High-dim data Similarity matrix Dimension Reduction low-dim data a.k.a. Graph Embedding

Aspects of DR Feature vectors vs. Similarity (as an input) Why called graph embedding? Similarity matrix can be viewed as a graph where similarity represents edge weight High-dim data Similarity matrix Dimension Reduction low-dim data a.k.a. Graph Embedding

Methods Traditional Principal component analysis (PCA) Multidimensional scaling (MDS) Linear discriminant analysis (LDA) Nonnegative matrix factorization (NMF) Advanced (nonlinear, kernel, manifold learning) Isometric feature mapping (Isomap) Locally linear embedding (LLE) Laplacian Eigenmaps (LE) Kernel PCA t-distributed stochastic neighborhood embedding (t-sne) * Matlab codes are available at http://homepage.tudelft.nl/19j49/matlab_toolbox_for_dimensionality_reduction.html

Principal Component Analysis Finds the axis showing the greatest variation, and project all points into this axis Reduced dimensions are orthogonal Algorithm: eigen-decomposition Pros: Fast Cons: basic limited performances PC2 PC1 http://en.wikipedia.org/wiki/principal_component_analysis

Principal Component Analysis Finds the axis showing the greatest variation, and project all points into this axis Reduced dimensions are orthogonal Algorithm: eigen-decomposition Pros: Fast Cons: basic limited performances Linear PC2 PC1 http://en.wikipedia.org/wiki/principal_component_analysis

Principal Component Analysis Finds the axis showing the greatest variation, and project all points into this axis Reduced dimensions are orthogonal Algorithm: eigen-decomposition Pros: Fast Cons: basic limited performances Linear Unsupervised PC2 PC1 http://en.wikipedia.org/wiki/principal_component_analysis

Principal Component Analysis Finds the axis showing the greatest variation, and project all points into this axis Reduced dimensions are orthogonal Algorithm: eigen-decomposition Pros: Fast Cons: basic limited performances Linear Unsupervised Global PC2 PC1 http://en.wikipedia.org/wiki/principal_component_analysis

Principal Component Analysis Finds the axis showing the greatest variation, and project all points into this axis Reduced dimensions are orthogonal Algorithm: eigen-decomposition Pros: Fast Cons: basic limited performances Linear Unsupervised Global Feature vectors PC2 PC1 http://en.wikipedia.org/wiki/principal_component_analysis

Principal Component Analysis Document Visualization

Principal Component Analysis Testbed Demo Text Data

Multidimensional Scaling (MDS) Intuition Tries to preserve given ideal pairwise distances in lowdimensional space Metric MDS Preserves given ideal distance values Nonmetric MDS When you only know/care about ordering of distances Preserves only the orderings of distance values Algorithm: gradient-decent type c.f. classical MDS is the same as PCA

Multidimensional Scaling (MDS) Intuition Tries to preserve given ideal pairwise distances in lowdimensional space actual distance ideal distance Metric MDS Preserves given ideal distance values Nonmetric MDS When you only know/care about ordering of distances Preserves only the orderings of distance values Algorithm: gradient-decent type c.f. classical MDS is the same as PCA

Multidimensional Scaling (MDS) Intuition Tries to preserve given ideal pairwise distances in lowdimensional space actual distance Metric MDS Preserves given ideal distance values Nonmetric MDS When you only know/care about ordering of distances Preserves only the orderings of distance values Algorithm: gradient-decent type c.f. classical MDS is the same as PCA ideal distance Nonlinear

Multidimensional Scaling (MDS) Intuition Tries to preserve given ideal pairwise distances in lowdimensional space actual distance Nonlinear Metric MDS Unsupervised Preserves given ideal distance values Nonmetric MDS When you only know/care about ordering of distances Preserves only the orderings of distance values Algorithm: gradient-decent type c.f. classical MDS is the same as PCA ideal distance

Multidimensional Scaling (MDS) Intuition Tries to preserve given ideal pairwise distances in lowdimensional space actual distance ideal distance Nonlinear Metric MDS Unsupervised Preserves given ideal distance values Global Nonmetric MDS When you only know/care about ordering of distances Preserves only the orderings of distance values Algorithm: gradient-decent type c.f. classical MDS is the same as PCA

Multidimensional Scaling (MDS) Intuition Tries to preserve given ideal pairwise distances in lowdimensional space actual distance ideal distance Nonlinear Metric MDS Unsupervised Preserves given ideal distance values Global Similarity input Nonmetric MDS When you only know/care about ordering of distances Preserves only the orderings of distance values Algorithm: gradient-decent type c.f. classical MDS is the same as PCA

Multidimensional Scaling Sammon s mapping Sammon s mapping Local version of MDS Down-weights errors in large distances Algorithm: gradient-decent type Nonlinear Unsupervised Local Similarity input

Multidimensional Scaling Force-directed graph layout Force-directed graph layout Rooted from graph visualization, but essentially variant of metric MDS Spring-like attractive + repulsive forces between nodes Algorithm: gradient-decent type Widely-used in visualization Aesthetically pleasing results Simple and intuitive Interactivity Nonlinear Unsupervised Global Similarity input

Multidimensional Scaling Force-directed graph layout Demos Prefuse http://prefuse.org/gallery/graphview/ D3: http://d3js.org/ http://bl.ocks.org/4062045

Multidimensional Scaling In all variants, Pros: widely-used (works well in general) Cons: slow Nonmetric MDS is even much slower than metric MDS Nonlinear Unsupervised Global Similarity input

Linear Discriminant Analysis Maximally separates clusters by Putting different cluster as far as possible Putting each cluster as compact as possible (a) (b)

Linear Discriminant Analysis vs. Principal Component Analysis 2D visualization of 7 Gaussian mixture of 1000 dimensions Linear discriminant analysis (Supervised) 36 Principal component analysis (Unsupervised)

Linear Discriminant Analysis Maximally separates clusters by Putting different cluster as far as possible Putting each cluster as compact as possible Algorithm: generalized eigendecomposition Pros: better show cluster structure Cons: may distort original relationship of data

Linear Discriminant Analysis Maximally separates clusters by Putting different cluster as far as possible Putting each cluster as compact as possible Algorithm: generalized eigendecomposition Pros: better show cluster structure Cons: may distort original relationship of data Linear

Linear Discriminant Analysis Maximally separates clusters by Putting different cluster as far as possible Putting each cluster as compact as possible Algorithm: generalized eigendecomposition Pros: better show cluster structure Cons: may distort original relationship of data Linear Supervised

Linear Discriminant Analysis Maximally separates clusters by Putting different cluster as far as possible Putting each cluster as compact as possible Algorithm: generalized eigendecomposition Pros: better show cluster structure Cons: may distort original relationship of data Linear Supervised Global

Linear Discriminant Analysis Maximally separates clusters by Putting different cluster as far as possible Putting each cluster as compact as possible Algorithm: generalized eigendecomposition Pros: better show cluster structure Cons: may distort original relationship of data Linear Supervised Global Feature vectors

Linear Discriminant Analysis Testbed Demo Text Data

Nonnegative Matrix Factorization A Dimension reduction via matrix factorization ~ = è min A WH F W H W>=0, H>=0 Why nonnegativity constraints? Better approximation vs. better interpretation Often physically/semantically meaningful Algorithm: alternating nonnegativity-constrained least squares

A Nonnegative Matrix Factorization as clustering Dimension reduction via matrix factorization ~ = è min A WH F W H W>=0, H>=0 Often NMF performs better and faster than k-means W: centroids, H: soft-clustering membership

In the next lecture.. More interesting topics coming up including Advanced methods Isomap, LLE, kernel PCA, t-sne, Real-world applications in interactive visualization Practitioners guide What to try first in which situations?