CSC 411: Lecture 14: Principal Components Analysis & Autoencoders

Similar documents
CSC 411: Lecture 14: Principal Components Analysis & Autoencoders

Dimension Reduction CS534

CSC 411: Lecture 12: Clustering

CSC321: Neural Networks. Lecture 13: Learning without a teacher: Autoencoders and Principal Components Analysis. Geoffrey Hinton

( ) =cov X Y = W PRINCIPAL COMPONENT ANALYSIS. Eigenvectors of the covariance matrix are the principal components

Unsupervised Learning

CSC 411 Lecture 18: Matrix Factorizations

CSC 411: Lecture 02: Linear Regression

Grundlagen der Künstlichen Intelligenz

Modelling and Visualization of High Dimensional Data. Sample Examination Paper

Programming Exercise 7: K-means Clustering and Principal Component Analysis

Unsupervised learning in Vision

Data Mining - Data. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - Data 1 / 47

Discriminate Analysis

CSC 411: Lecture 05: Nearest Neighbors

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Last week. Multi-Frame Structure from Motion: Multi-View Stereo. Unknown camera viewpoints

Network Traffic Measurements and Analysis

Machine Learning for Physicists Lecture 6. Summer 2017 University of Erlangen-Nuremberg Florian Marquardt

Lecture 07 Dimensionality Reduction with PCA

CIE L*a*b* color model

Data preprocessing Functional Programming and Intelligent Algorithms

Face detection and recognition. Many slides adapted from K. Grauman and D. Lowe

Recognition: Face Recognition. Linda Shapiro EE/CSE 576

Neural Networks for unsupervised learning From Principal Components Analysis to Autoencoders to semantic hashing

HW Assignment 3 (Due by 9:00am on Mar 6)

Neural Networks for Machine Learning. Lecture 15a From Principal Components Analysis to Autoencoders

Feature selection. Term 2011/2012 LSI - FIB. Javier Béjar cbea (LSI - FIB) Feature selection Term 2011/ / 22

Work 2. Case-based reasoning exercise

Face detection and recognition. Detection Recognition Sally

Understanding Faces. Detection, Recognition, and. Transformation of Faces 12/5/17

Unsupervised Learning

CSE 481C Imitation Learning in Humanoid Robots Motion capture, inverse kinematics, and dimensionality reduction

Applied Neuroscience. Columbia Science Honors Program Fall Machine Learning and Neural Networks

Lecture Topic Projects

CS 195-5: Machine Learning Problem Set 5

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo

Linear Discriminant Analysis in Ottoman Alphabet Character Recognition

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Applications Video Surveillance (On-line or off-line)

22 October, 2012 MVA ENS Cachan. Lecture 5: Introduction to generative models Iasonas Kokkinos

CSE 547: Machine Learning for Big Data Spring Problem Set 2. Please read the homework submission policies.

International Journal of Digital Application & Contemporary research Website: (Volume 1, Issue 8, March 2013)

CS325 Artificial Intelligence Ch. 20 Unsupervised Machine Learning

CSE 258 Lecture 5. Web Mining and Recommender Systems. Dimensionality Reduction

CHAPTER 3 PRINCIPAL COMPONENT ANALYSIS AND FISHER LINEAR DISCRIMINANT ANALYSIS

Clustering and Dimensionality Reduction

Unsupervised Learning

Deep Generative Models Variational Autoencoders

Locality Preserving Projections (LPP) Abstract

COMBINED METHOD TO VISUALISE AND REDUCE DIMENSIONALITY OF THE FINANCIAL DATA SETS

CSE 255 Lecture 5. Data Mining and Predictive Analytics. Dimensionality Reduction

A Course in Machine Learning

Application of Principal Components Analysis and Gaussian Mixture Models to Printer Identification

Dimensionality reduction as a defense against evasion attacks on machine learning classifiers

CS 229 Midterm Review

Feature Selection Using Principal Feature Analysis

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 1st, 2018

Locality Preserving Projections (LPP) Abstract

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Recognition, SVD, and PCA

Lecture 4 Face Detection and Classification. Lin ZHANG, PhD School of Software Engineering Tongji University Spring 2018

Intro to Artificial Intelligence

Image Processing. Image Features

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Data Compression. The Encoder and PCA

Robust Pose Estimation using the SwissRanger SR-3000 Camera

ECE 5424: Introduction to Machine Learning

Facial Expression Detection Using Implemented (PCA) Algorithm

Announcements. Recognition I. Gradient Space (p,q) What is the reflectance map?

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

CSE 40171: Artificial Intelligence. Learning from Data: Unsupervised Learning

Large-Scale Face Manifold Learning

ECE 285 Class Project Report

Data Preprocessing. Javier Béjar. URL - Spring 2018 CS - MAI 1/78 BY: $\

Alternative Statistical Methods for Bone Atlas Modelling

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

FMRI data: Independent Component Analysis (GIFT) & Connectivity Analysis (FNC)

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP( 1

Clustering Lecture 5: Mixture Model

Human Action Recognition Using Independent Component Analysis

Face Recognition Using Eigen-Face Implemented On DSP Processor

arxiv: v2 [cs.lg] 6 Jun 2015

Computer vision: models, learning and inference. Chapter 13 Image preprocessing and feature extraction

Multi-layer Perceptron Forward Pass Backpropagation. Lecture 11: Aykut Erdem November 2016 Hacettepe University

Dimension Reduction of Image Manifolds

Outline. Advanced Digital Image Processing and Others. Importance of Segmentation (Cont.) Importance of Segmentation

Robust Face Recognition via Sparse Representation Authors: John Wright, Allen Y. Yang, Arvind Ganesh, S. Shankar Sastry, and Yi Ma

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

Slides adapted from Marshall Tappen and Bryan Russell. Algorithms in Nature. Non-negative matrix factorization

Deep Learning for Computer Vision

Colorado School of Mines. Computer Vision. Professor William Hoff Dept of Electrical Engineering &Computer Science.

Recognizing Handwritten Digits Using the LLE Algorithm with Back Propagation

Using Spin Images for Efficient Object Recognition in Cluttered 3D Scenes

Principal Component Analysis (PCA) is a most practicable. statistical technique. Its application plays a major role in many

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Motion Interpretation and Synthesis by ICA

Mobile Face Recognization

Biometrics Technology: Image Processing & Pattern Recognition (by Dr. Dickson Tong)

Transcription:

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders Richard Zemel, Raquel Urtasun and Sanja Fidler University of Toronto Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 1 / 18

Today Dimensionality Reduction PCA Autoencoders Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 2 / 18

Mixture models and Distributed Representations One problem with mixture models: each observation assumed to come from one of K prototypes Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 3 / 18

Mixture models and Distributed Representations One problem with mixture models: each observation assumed to come from one of K prototypes Constraint that only one active (responsibilities sum to one) limits the representational power Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 3 / 18

Mixture models and Distributed Representations One problem with mixture models: each observation assumed to come from one of K prototypes Constraint that only one active (responsibilities sum to one) limits the representational power Alternative: Distributed representation, with several latent variables relevant to each observation Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 3 / 18

Mixture models and Distributed Representations One problem with mixture models: each observation assumed to come from one of K prototypes Constraint that only one active (responsibilities sum to one) limits the representational power Alternative: Distributed representation, with several latent variables relevant to each observation Can be several binary/discrete variables, or continuous Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 3 / 18

Example: Continuous Underlying Variables What are the intrinsic latent dimensions in these two datasets? Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 4 / 18

Example: Continuous Underlying Variables What are the intrinsic latent dimensions in these two datasets? Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 4 / 18

Example: Continuous Underlying Variables What are the intrinsic latent dimensions in these two datasets? How can we find these dimensions from the data? Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 4 / 18

Principal Components Analysis PCA: most popular instance of second main class of unsupervised learning methods, projection methods, aka dimensionality-reduction methods Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 5 / 18

Principal Components Analysis PCA: most popular instance of second main class of unsupervised learning methods, projection methods, aka dimensionality-reduction methods Aim: find a small number of directions in input space that explain variation in input data; re-represent data by projecting along those directions Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 5 / 18

Principal Components Analysis PCA: most popular instance of second main class of unsupervised learning methods, projection methods, aka dimensionality-reduction methods Aim: find a small number of directions in input space that explain variation in input data; re-represent data by projecting along those directions Important assumption: variation contains information Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 5 / 18

Principal Components Analysis PCA: most popular instance of second main class of unsupervised learning methods, projection methods, aka dimensionality-reduction methods Aim: find a small number of directions in input space that explain variation in input data; re-represent data by projecting along those directions Important assumption: variation contains information Data is assumed to be continuous: linear relationship between data and the learned representation Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 5 / 18

PCA: Common Tool Handles high-dimensional data Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 6 / 18

PCA: Common Tool Handles high-dimensional data If data has thousands of dimensions, can be difficult for a classifier to deal with Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 6 / 18

PCA: Common Tool Handles high-dimensional data If data has thousands of dimensions, can be difficult for a classifier to deal with Often can be described by much lower dimensional representation Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 6 / 18

PCA: Common Tool Handles high-dimensional data If data has thousands of dimensions, can be difficult for a classifier to deal with Often can be described by much lower dimensional representation Useful for: Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 6 / 18

PCA: Common Tool Handles high-dimensional data If data has thousands of dimensions, can be difficult for a classifier to deal with Often can be described by much lower dimensional representation Useful for: Visualization Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 6 / 18

PCA: Common Tool Handles high-dimensional data If data has thousands of dimensions, can be difficult for a classifier to deal with Often can be described by much lower dimensional representation Useful for: Visualization Preprocessing Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 6 / 18

PCA: Common Tool Handles high-dimensional data If data has thousands of dimensions, can be difficult for a classifier to deal with Often can be described by much lower dimensional representation Useful for: Visualization Preprocessing Modeling prior for new data Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 6 / 18

PCA: Common Tool Handles high-dimensional data If data has thousands of dimensions, can be difficult for a classifier to deal with Often can be described by much lower dimensional representation Useful for: Visualization Preprocessing Modeling prior for new data Compression Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 6 / 18

PCA: Intuition As in the previous lecture, training data has N vectors, {x n } N n=1, of dimensionality D, so x i R D Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 7 / 18

PCA: Intuition As in the previous lecture, training data has N vectors, {x n } N n=1, of dimensionality D, so x i R D Aim to reduce dimensionality: linearly project to a much lower dimensional space, M << D: x Uz + a where U is a D M matrix and z a M-dimensional vector Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 7 / 18

PCA: Intuition As in the previous lecture, training data has N vectors, {x n } N n=1, of dimensionality D, so x i R D Aim to reduce dimensionality: linearly project to a much lower dimensional space, M << D: x Uz + a where U is a D M matrix and z a M-dimensional vector Search for orthogonal directions in space with the highest variance project data onto this subspace Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 7 / 18

PCA: Intuition As in the previous lecture, training data has N vectors, {x n } N n=1, of dimensionality D, so x i R D Aim to reduce dimensionality: linearly project to a much lower dimensional space, M << D: x Uz + a where U is a D M matrix and z a M-dimensional vector Search for orthogonal directions in space with the highest variance project data onto this subspace Structure of data vectors is encoded in sample covariance Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 7 / 18

Finding Principal Components To find the principal component directions, we center the data (subtract the sample mean from each variable) Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 8 / 18

Finding Principal Components To find the principal component directions, we center the data (subtract the sample mean from each variable) Calculate the empirical covariance matrix: C = 1 N N (x (n) x)(x (n) x) T n=1 with x the mean Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 8 / 18

Finding Principal Components To find the principal component directions, we center the data (subtract the sample mean from each variable) Calculate the empirical covariance matrix: C = 1 N N (x (n) x)(x (n) x) T n=1 with x the mean What s the dimensionality of C? Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 8 / 18

Finding Principal Components To find the principal component directions, we center the data (subtract the sample mean from each variable) Calculate the empirical covariance matrix: C = 1 N N (x (n) x)(x (n) x) T n=1 with x the mean What s the dimensionality of C? Find the M eigenvectors with largest eigenvalues of C: these are the principal components Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 8 / 18

Finding Principal Components To find the principal component directions, we center the data (subtract the sample mean from each variable) Calculate the empirical covariance matrix: C = 1 N N (x (n) x)(x (n) x) T n=1 with x the mean What s the dimensionality of C? Find the M eigenvectors with largest eigenvalues of C: these are the principal components Assemble these eigenvectors into a D M matrix U Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 8 / 18

Finding Principal Components To find the principal component directions, we center the data (subtract the sample mean from each variable) Calculate the empirical covariance matrix: C = 1 N N (x (n) x)(x (n) x) T n=1 with x the mean What s the dimensionality of C? Find the M eigenvectors with largest eigenvalues of C: these are the principal components Assemble these eigenvectors into a D M matrix U We can now express D-dimensional vectors x by projecting them to M-dimensional z z = U T x Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 8 / 18

Standard PCA Algorithm: to find M components underlying D-dimensional data Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 9 / 18

Standard PCA Algorithm: to find M components underlying D-dimensional data 1. Select the top M eigenvectors of C (data covariance matrix): N C = 1 N n=1 (x (n) x)(x (n) x) T = UΣU T U 1:M Σ 1:M U T 1:M where U is orthogonal, columns are unit-length eigenvectors U T U = UU T = 1 and Σ is a matrix with eigenvalues on the diagonal, representing the variance in the direction of each eigenvector Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 9 / 18

Standard PCA Algorithm: to find M components underlying D-dimensional data 1. Select the top M eigenvectors of C (data covariance matrix): N C = 1 N n=1 (x (n) x)(x (n) x) T = UΣU T U 1:M Σ 1:M U T 1:M where U is orthogonal, columns are unit-length eigenvectors U T U = UU T = 1 and Σ is a matrix with eigenvalues on the diagonal, representing the variance in the direction of each eigenvector 2. Project each input vector x into this subspace, e.g., z j = u T j x; z = U T 1:Mx Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 9 / 18

Two Derivations of PCA Two views/derivations: Maximize variance (scatter of green points) Minimize error (red-green distance per datapoint) Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 10 / 18

PCA: Minimizing Reconstruction Error We can think of PCA as projecting the data onto a lower-dimensional subspace Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 11 / 18

PCA: Minimizing Reconstruction Error We can think of PCA as projecting the data onto a lower-dimensional subspace One derivation is that we want to find the projection such that the best linear reconstruction of the data is as close as possible to the original data J(u, z, b) = n x (n) x (n) 2 where x (n) = M j=1 z (n) j u j + D j=m+1 b j u j Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 11 / 18

PCA: Minimizing Reconstruction Error We can think of PCA as projecting the data onto a lower-dimensional subspace One derivation is that we want to find the projection such that the best linear reconstruction of the data is as close as possible to the original data J(u, z, b) = n x (n) x (n) 2 where x (n) = M j=1 z (n) j u j + D j=m+1 b j u j Objective minimized when first M components are the eigenvectors with the maximal eigenvalues z (n) j = u T j x (n) ; b j = x T u j Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 11 / 18

Applying PCA to faces Run PCA on 2429 19x19 grayscale images (CBCL data) Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 12 / 18

Applying PCA to faces Run PCA on 2429 19x19 grayscale images (CBCL data) Compresses the data: can get good reconstructions with only 3 components Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 12 / 18

Applying PCA to faces Run PCA on 2429 19x19 grayscale images (CBCL data) Compresses the data: can get good reconstructions with only 3 components PCA for pre-processing: can apply classifier to latent representation PCA with 3 components obtains 79% accuracy on face/non-face discrimination on test data vs. 76.8% for GMM with 84 states Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 12 / 18

Applying PCA to faces Run PCA on 2429 19x19 grayscale images (CBCL data) Compresses the data: can get good reconstructions with only 3 components PCA for pre-processing: can apply classifier to latent representation PCA with 3 components obtains 79% accuracy on face/non-face discrimination on test data vs. 76.8% for GMM with 84 states Can also be good for visualization Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 12 / 18

Applying PCA to faces: Learned basis Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 13 / 18

Applying PCA to digits Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 14 / 18

Relation to Neural Networks PCA is closely related to a particular form of neural network Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 15 / 18

Relation to Neural Networks PCA is closely related to a particular form of neural network An autoencoder is a neural network whose outputs are its own inputs Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 15 / 18

Relation to Neural Networks PCA is closely related to a particular form of neural network An autoencoder is a neural network whose outputs are its own inputs The goal is to minimize reconstruction error Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 15 / 18

Autoencoders Define z = f (W x); ˆx = g(v z) Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 16 / 18

Autoencoders Define z = f (W x); ˆx = g(v z) Goal: min W,V 1 2N N x (n) ˆx (n) 2 n=1 Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 16 / 18

Autoencoders Define z = f (W x); ˆx = g(v z) Goal: min W,V 1 2N N x (n) ˆx (n) 2 n=1 If g and f are linear min W,V 1 2N N x (n) VW x (n) 2 n=1 Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 16 / 18

Autoencoders Define z = f (W x); ˆx = g(v z) Goal: min W,V 1 2N N x (n) ˆx (n) 2 n=1 If g and f are linear min W,V 1 2N N x (n) VW x (n) 2 n=1 In other words, the optimal solution is PCA. Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 16 / 18

Autoencoders: Nonlinear PCA What if g() is not linear? Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 17 / 18

Autoencoders: Nonlinear PCA What if g() is not linear? Then we are basically doing nonlinear PCA Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 17 / 18

Autoencoders: Nonlinear PCA What if g() is not linear? Then we are basically doing nonlinear PCA Some subtleties but in general this is an accurate description Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 17 / 18

Comparing Reconstructions Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 18 / 18