CSC321: Neural Networks. Lecture 13: Learning without a teacher: Autoencoders and Principal Components Analysis. Geoffrey Hinton

Similar documents
Neural Networks for Machine Learning. Lecture 15a From Principal Components Analysis to Autoencoders

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders

Day 3 Lecture 1. Unsupervised Learning

Dimension Reduction CS534

Neural Networks for unsupervised learning From Principal Components Analysis to Autoencoders to semantic hashing

CSC 411: Lecture 12: Clustering

Unsupervised learning in Vision

Grundlagen der Künstlichen Intelligenz

Unsupervised Learning

CS325 Artificial Intelligence Ch. 20 Unsupervised Machine Learning

Computer vision: models, learning and inference. Chapter 13 Image preprocessing and feature extraction

Data Compression. The Encoder and PCA

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Deep Learning. Volker Tresp Summer 2014

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

ECG782: Multidimensional Digital Signal Processing

Unsupervised Learning: Clustering

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Network Traffic Measurements and Analysis

CSE 573: Artificial Intelligence Autumn 2010

Outlier detection using autoencoders

Why Normalizing v? Why did we normalize v on the right side? Because we want the length of the left side to be the eigenvalue

Dimension Reduction of Image Manifolds

What is machine learning?

Announcements. Recognition I. Gradient Space (p,q) What is the reflectance map?

Unsupervised Learning

Recognizing Handwritten Digits Using the LLE Algorithm with Back Propagation

SPE MS. Abstract. Introduction. Autoencoders

A Course in Machine Learning

10-701/15-781, Fall 2006, Final

Richard S. Zemel 1 Georey E. Hinton North Torrey Pines Rd. Toronto, ONT M5S 1A4. Abstract

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

Review: Final Exam CPSC Artificial Intelligence Michael M. Richter

Slides adapted from Marshall Tappen and Bryan Russell. Algorithms in Nature. Non-negative matrix factorization

Supervised vs.unsupervised Learning

Machine Learning / Jan 27, 2010

Classification: Feature Vectors

Lab 8 CSC 5930/ Computer Vision

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Does the Brain do Inverse Graphics?

Energy Based Models, Restricted Boltzmann Machines and Deep Networks. Jesse Eickholt

CPSC340. State-of-the-art Neural Networks. Nando de Freitas November, 2012 University of British Columbia

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2018

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

The EM Algorithm Lecture What's the Point? Maximum likelihood parameter estimates: One denition of the \best" knob settings. Often impossible to nd di

COMS 4771 Clustering. Nakul Verma

Data Warehousing and Machine Learning

MTTTS17 Dimensionality Reduction and Visualization. Spring 2018 Jaakko Peltonen. Lecture 11: Neighbor Embedding Methods continued

Does the Brain do Inverse Graphics?

Why equivariance is better than premature invariance

Support Vector Machines

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Artificial Neural Networks (Feedforward Nets)

PATTERN CLASSIFICATION AND SCENE ANALYSIS

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Deep Learning for Computer Vision

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

Alternatives to Direct Supervision

Introduction to Machine Learning CMU-10701

Support Vector Machines

CS 8520: Artificial Intelligence

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Machine Learning for Physicists Lecture 6. Summer 2017 University of Erlangen-Nuremberg Florian Marquardt

1 Case study of SVM (Rob)

Kernels and Clustering

Lecture 25: Review I

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

Learning. Learning agents Inductive learning. Neural Networks. Different Learning Scenarios Evaluation

Experimenting with Neural Nets Numbers marked with indicate material you need to write up.

Unsupervised Learning

Deep Generative Models Variational Autoencoders

A Taxonomy of Semi-Supervised Learning Algorithms

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Grounded Compositional Semantics for Finding and Describing Images with Sentences

22 October, 2012 MVA ENS Cachan. Lecture 5: Introduction to generative models Iasonas Kokkinos

Preprocessing DWML, /33

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Su et al. Shape Descriptors - III

Supervised vs unsupervised clustering

Clustering & Dimensionality Reduction. 273A Intro Machine Learning

Classifier C-Net. 2D Projected Images of 3D Objects. 2D Projected Images of 3D Objects. Model I. Model II

1 Training/Validation/Testing

Introduction to machine learning, pattern recognition and statistical data modelling Coryn Bailer-Jones

Introduction to Pattern Recognition and Machine Learning. Alexandros Iosifidis Academy of Finland Postdoctoral Research Fellow (term )

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo

CS 343: Artificial Intelligence

Deep Learning Srihari. Autoencoders. Sargur Srihari

Object Segmentation and Tracking in 3D Video With Sparse Depth Information Using a Fully Connected CRF Model

Machine Learning (CSE 446): Unsupervised Learning

Clustering and Visualisation of Data

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Learning to Recognize Faces in Realistic Conditions

CPSC 340: Machine Learning and Data Mining. Multi-Dimensional Scaling Fall 2017

Homework #4 Programming Assignment Due: 11:59 pm, November 4, 2018

Transcription:

CSC321: Neural Networks Lecture 13: Learning without a teacher: Autoencoders and Principal Components Analysis Geoffrey Hinton

Three problems with backpropagation Where does the supervision come from? Most data is unlabelled The vestibular-ocular reflex is an exception. How well does the learning time scale? Its is impossible to learn features for different parts of an image independently if they all use the same error signal. y w1 w2 Can neurons implement backpropagation? Not in the obvious way. but getting derivatives from later layers is so important that evolution may have found a way.

Three kinds of learning Supervised Learning: this models p(y x) Learn to predict a real valued output or a class label from an input. Reinforcement learning: this just tries to have a good time Choose actions that maximize payoff Unsupervised Learning: this models p(x) Build a causal generative model that explains why some data vectors occur and not others or Learn an energy function that gives low energy to data and high energy to non-data or Discover interesting features; separate sources that have been mixed together; find temporal invariants etc. etc.

The Goals of Unsupervised Learning The general goal of unsupervised learning is to discover useful structure in large data sets without requiring a target desired output or reinforcement signal. It is not obvious how to turn this general goal into a specific objective function that can be used to drive the learning. A more specific goal: Create representations that are better for subsequent supervised or reinforcement learning.

Why unsupervised pre-training makes sense stuff high bandwidth stuff low bandwidth image label image label If image-label pairs were generated this way, it would make sense to try to go straight from images to labels. For example, do the pixels have even parity? If image-label pairs are generated this way, it makes sense to first learn to recover the stuff that caused the image by inverting the high bandwidth pathway.

Another Goal of Unsupervised Learning Improve learning speed for high-dimensional inputs Allow features within a layer to learn independently Allow multiple layers to be learned greedily. Improve the speed of supervised learning by removing directions with high curvature of the error surface. This can be done by using Principal Components Analysis to pre-process the input vectors.

Another Goal of Unsupervised Learning Build a density model of the data vectors. This assigns a score or probability to each possible datavector. There are many ways to use a good density model. Classify by seeing which model likes the test case data most Monitor a complex system by noticing improbable states. Extract interpretable factors (causes or constraints).

Using backprop for unsupervised learning Try to make the output be the same as the input in a network with a central bottleneck. The activities of the hidden units in the bottleneck form an efficient code. The bottleneck does not have room for redundant features. Good for extracting independent features (as in the family trees) output vector code input vector

Self-supervised backprop in a linear network If the hidden and output layers are linear, it will learn hidden units that are a linear function of the data and minimize the squared reconstruction error. This is exactly what Principal Components Analysis does. The M hidden units will span the same space as the first M principal components found by PCA Their weight vectors may not be orthogonal They will tend to have equal variances

Principal Components Analysis This takes N-dimensional data and finds the M orthogonal directions in which the data have the most variance These M principal directions form a subspace. We can represent an N-dimensional datapoint by its projections onto the M principal directions This loses all information about where the datapoint is located in the remaining orthogonal directions. We reconstruct by using the mean value (over all the data) on the N-M directions that are not represented. The reconstruction error is the sum over all these unrepresented directions of the squared differences from the mean.

A picture of PCA with N=2 and M=1 The red point is represented by the green point. Our reconstruction of the red point has an error equal to the squared distance between red and green points. First principal component: Direction of greatest variance

Self-supervised backprop and clustering If we force the hidden unit whose weight vector is closest to the input vector to have an activity of 1 and the rest to have activities of 0, we get clustering. The weight vector of each hidden unit represents the center of a cluster. Input vectors are reconstructed as the nearest cluster center. reconstruction data=(x,y)

Clustering and backpropagation We need to tie the input->hidden weights to be the same as the hidden->output weights. Usually, we cannot backpropagate through binary hidden units, but in this case the derivatives for the input- >hidden weights all become zero! If the winner doesn t change no derivative The winner changes when two hidden units give exactly the same error no derivative So the only error-derivative is for the output weights. This derivative pulls the weight vector of the winning cluster towards the data point. When the weight vector is at the center of gravity of a cluster, the derivatives all balance out because the c. of g. minimizes squared error.

A spectrum of representations PCA is powerful because it uses distributed representations but limited because its representations are linearly related to the data Local Distributed Autoencoders with more hidden layers are not limited this way. Linear PCA Clustering is powerful because it uses very non-linear representations but limited because its representations are local (not componential). nonlinear clustering What we need We need representations that are both distributed and non-linear Unfortunately, these are typically very hard to learn.