Autoencoders. Stephen Scott. Introduction. Basic Idea. Stacked AE. Denoising AE. Sparse AE. Contractive AE. Variational AE GAN.

Similar documents
Alternatives to Direct Supervision

Unsupervised Learning

Variational Autoencoders. Sargur N. Srihari

GENERATIVE ADVERSARIAL NETWORKS (GAN) Presented by Omer Stein and Moran Rubin

Tutorial Deep Learning : Unsupervised Feature Learning

Generative Adversarial Networks (GANs) Ian Goodfellow, Research Scientist MLSLP Keynote, San Francisco

Neural Networks for Machine Learning. Lecture 15a From Principal Components Analysis to Autoencoders

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

GAN Frontiers/Related Methods

Day 3 Lecture 1. Unsupervised Learning

Deep Generative Models Variational Autoencoders

Introduction to Generative Adversarial Networks

Generative Adversarial Networks (GANs)

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

Deep Learning Srihari. Autoencoders. Sargur Srihari

Denoising Adversarial Autoencoders

Model Generalization and the Bias-Variance Trade-Off

Autoencoder. Representation learning (related to dictionary learning) Both the input and the output are x

Energy Based Models, Restricted Boltzmann Machines and Deep Networks. Jesse Eickholt

One Network to Solve Them All Solving Linear Inverse Problems using Deep Projection Models

Deep Generative Models and a Probabilistic Programming Library

Dropout. Sargur N. Srihari This is part of lecture slides on Deep Learning:

Lecture 19: Generative Adversarial Networks

Towards Principled Methods for Training Generative Adversarial Networks. Martin Arjovsky & Léon Bottou

Novel Lossy Compression Algorithms with Stacked Autoencoders

Extracting and Composing Robust Features with Denoising Autoencoders

Vulnerability of machine learning models to adversarial examples

Adversarially Learned Inference

(University Improving of Montreal) Generative Adversarial Networks with Denoising Feature Matching / 17

Introduction to GAN. Generative Adversarial Networks. Junheng(Jeff) Hao

Generative Modeling with Convolutional Neural Networks. Denis Dus Data Scientist at InData Labs

19: Inference and learning in Deep Learning

Neural Network Neurons

Learning to generate with adversarial networks

Machine Learning for Physicists Lecture 6. Summer 2017 University of Erlangen-Nuremberg Florian Marquardt

arxiv: v1 [cs.lg] 24 Jan 2019

Image Restoration with Deep Generative Models

MoonRiver: Deep Neural Network in C++

Deep Learning Cook Book

Facial Expression Classification with Random Filters Feature Extraction

Multi-Modal Generative Adversarial Networks

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

Autoencoders, denoising autoencoders, and learning deep networks

Auto-Encoding Variational Bayes

Bidirectional GAN. Adversarially Learned Inference (ICLR 2017) Adversarial Feature Learning (ICLR 2017)

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016

Deep Learning. Volker Tresp Summer 2014

Introduction to GAN. Generative Adversarial Networks. Junheng(Jeff) Hao

arxiv: v1 [cs.cv] 17 Nov 2016

Stacked Denoising Autoencoders for Face Pose Normalization

Generative Models in Deep Learning. Sargur N. Srihari

An Empirical Study of Generative Adversarial Networks for Computer Vision Tasks

Tutorial on Machine Learning Tools

Generative Adversarial Network

arxiv: v1 [cs.cv] 7 Mar 2018

Adversarial Examples and Adversarial Training. Ian Goodfellow, Staff Research Scientist, Google Brain CS 231n, Stanford University,

Generative Adversarial Networks (GANs) Based on slides from Ian Goodfellow s NIPS 2016 tutorial

Deep Learning for Computer Vision II

Auto-encoder with Adversarially Regularized Latent Variables

arxiv: v1 [cs.ne] 11 Jun 2018

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2018

Clustering Lecture 5: Mixture Model

Lab meeting (Paper review session) Stacked Generative Adversarial Networks

CS489/698: Intro to ML

Introduction to Generative Adversarial Networks

Deep Learning. Deep Learning. Practical Application Automatically Adding Sounds To Silent Movies

A Fast Learning Algorithm for Deep Belief Nets

When Variational Auto-encoders meet Generative Adversarial Networks

Autoencoding Beyond Pixels Using a Learned Similarity Metric

arxiv: v1 [cs.cv] 1 Aug 2017

Machine Learning Lecture 3

Stochastic Simulation with Generative Adversarial Networks

The exam is closed book, closed notes except your one-page cheat sheet.

Data Set Extension with Generative Adversarial Nets

Classification: Linear Discriminant Functions

Deep Learning. Deep Learning provided breakthrough results in speech recognition and image classification. Why?

Classification of 1D-Signal Types Using Semi-Supervised Deep Learning

Generative Adversarial Text to Image Synthesis

Variational Autoencoders

Image Processing. Filtering. Slide 1

Keras: Handwritten Digit Recognition using MNIST Dataset

Perceptron: This is convolution!

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Tutorial on Keras CAP ADVANCED COMPUTER VISION SPRING 2018 KISHAN S ATHREY

Lecture 21 : A Hybrid: Deep Learning and Graphical Models

CPSC 340: Machine Learning and Data Mining. Multi-Dimensional Scaling Fall 2017

Virtual Adversarial Ladder Networks for Semi-Supervised Learning

Learning Discrete Representations via Information Maximizing Self-Augmented Training

Using Machine Learning to Optimize Storage Systems

Advanced Video Analysis & Imaging

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

Early Stopping. Sargur N. Srihari

Deep Learning for Visual Manipulation and Synthesis

arxiv: v1 [stat.ml] 11 Feb 2018

arxiv: v2 [cs.lg] 17 Dec 2018

Capsule Networks. Eric Mintun

Grundlagen der Künstlichen Intelligenz

Lip Movement Synthesis from Text

K-Means Clustering 3/3/17

Implicit generative models: dual vs. primal approaches

Transcription:

Stacked Denoising Sparse Variational (Adapted from Paul Quint and Ian Goodfellow) Stacked Denoising Sparse Variational Autoencoding is training a network to replicate its input to its output Applications: Unlabeled pre-training for semi-supervised learning Learning embeddings to support information retrieval Generation of new instances similar to those in the training set Data compression sscott@cse.unl.edu 1 / 34 2 / 34 Outline Stacked Denoising Sparse Variational Basic idea Stacking Types of autoencoders Denoising Sparse Variational Generative adversarial networks Stacked Denoising Sparse Variational Sigmoid activation functions, 5000 training epochs, square loss, no regularization What s special about the hidden layer outputs? 3 / 34 4 / 34 Stacked Denoising Sparse Variational An autoencoder is a network trained to learn the identity function: output = input Subnetwork called encoder f ( ) maps input to an embedded representation Subnetwork called decoder g( ) maps back to input space Can be thought of as lossy compression of input Need to identify the important attributes of inputs to reproduce faithfully Stacked Denoising Sparse Variational General types of autoencoders based on size of hidden layer Undercomplete autoencoders have hidden layer size smaller than input layer size ) Dimension of embedded space lower than that of input space ) Cannot simply memorize training instances Overcomplete autoencoders have much larger hidden layer sizes ) Regularize to avoid overfitting, e.g., enforce a sparsity constraint 5 / 34 6 / 34

Example: Principal Component Analysis Stacked Stacked Stacked A stacked autoencoder has multiple hidden layers Denoising Denoising Sparse Variational A 3-2-3 autoencoder with linear units and square loss performs principal component analysis: Find linear transformation of data to maximize variance Sparse Variational Can share parameters to reduce their number by exploiting symmetry: W 4 = W > 1 and W 3 = W > 2 weights1 = tf.variable(weights1_init, dtype=tf.float32, name="weights1") weights2 = tf.variable(weights2_init, dtype=tf.float32, name="weights2") weights3 = tf.transpose(weights2, name="weights3") # shared weights weights4 = tf.transpose(weights1, name="weights4") # shared weights 7 / 34 8 / 34 Stacked Incremental Training Stacked Incremental Training (Single TF Graph) Stacked Stacked Denoising Denoising Sparse Sparse Variational 9 / 34 Can simplify training by starting with single hidden layer H 1 Then, train a second to mimic the output of H 1 Insert this into first network Can build by using H 1 s output as training set for Phase 2 Variational 10 / 34 Previous approach requires multiple TensorFlow graphs Can instead train both phases in a single graph: First left side, then right Stacked Visualization Stacked Semi-Supervised Learning Input MNIST Digit Network Output Stacked Denoising Sparse Variational Weights (features selected) for five nodes from H 1 : Stacked Denoising Sparse Variational 11 / 34 12 / 34 Can pre-train network with unlabeled data ) learn useful features and then train logic of dense layer with labeled data

Transfer Learning from Trained Classifier Denoising Vincent et al. (2010) Stacked Denoising Sparse Variational Can also transfer from a classifier trained on different task, e.g., transfer a GoogleNet architecture to ultrasound classification Stacked Denoising Sparse Variational Can train an autoencoder to learn to denoise input by giving input corrupted instance x and targeting uncorrupted instance x Example noise models: Gaussian noise: x = x + z, where z N (0, 2 I) Masking noise: zero out some fraction of components of x Salt-and-pepper noise: choose some fraction of components of x and set each to its min or max value (equally likely) Often choose existing one from a model zoo 13 / 34 14 / 34 Denoising Denoising Example Stacked Stacked Denoising Denoising Sparse Sparse Variational Variational 15 / 34 16 / 34 Denoising Sparse Stacked Denoising Sparse Variational How does it work? Even though, e.g., MNIST data are in a 784-dimensional space, they lie on a low-dimensional manifold that captures their most important features Corruption process moves instance x off of manifold Encoder f and decoder g 0 are trained to project x back onto manifold Stacked Denoising Sparse Variational An overcomplete architecture Regularize outputs of hidden layer to enforce sparsity: J (x) =J (x, g(f (x))) + (h), where J is loss function, f is encoder, g is decoder, h = f (x), and penalizes non-sparsity of h E.g., can use (h) = P i h i and ReLU activation to force many zero outputs in hidden layer Can also measure average activation of h i across mini-batch and compare it to user-specified target sparsity value p (e.g., 0.1) via square error or Kullback-Leibler divergence: p log p q +(1 p) log 1 p 1 q, 17 / 34 18 / 34 where q is average activation of h i over mini-batch

Variational Stacked Denoising Sparse Variational Similar to sparse autoencoder, but use (h) = mx j=1 i=1 nx @hi @x j I.e., penalize large partial derivatives of encoder outputs wrt input values This contracts the output space by mapping input points in a neighborhood near x to a smaller output neighborhood near f (x) ) Resists perturbations of input x If h has sigmoid activation, encoding near binary and a CE pushes embeddings to corners of a hypercube 2 Stacked Denoising Sparse Variational V is an autoencoder that is also generative model ) Can generate new instances according to a probability distribution E.g., hidden Markov models, Bayesian networks Contrast with discriminative models, which predict classifications Encoder f outputs [µ, ] > Pair (µ i, i) parameterizes Gaussian distribution for dimension i = 1,...,n Draw z i N (µ i, i) Decode this latent variable z to get g(z) 19 / 34 20 / 34 Variational Latent Variables Variational Architecture Stacked Denoising Sparse Independence of z dimensions makes it easy to generate instances wrt complex distributions via decoder g Latent variables can be thought of as values of attributes describing inputs E.g., for MNIST, latent variables might represent thickness, slant, loop closure Stacked Denoising Sparse Variational Variational 21 / 34 22 / 34 Variational Optimization Variational Reparameterization Trick Stacked Denoising Sparse Variational Maximum likelihood (ML) approach for training generative models: find a model ( ) with maximum probability of generating the training set X Achieve this by minimizing the sum of: End-to-end loss (e.g., square, cross-entropy) Regularizer measuring distance (K-L divergence) from latent distribution q(z x) and N (0, I) (= standard multivariate Gaussian) N (0, I) also considered the prior distribution over z (= distribution when no x is known) Stacked Denoising Sparse Variational Cannot backprop error signal through random samples Reparameterization trick emulates z N (µ, ) with N (0, 1), z = + µ eps = 1e-10 latent_loss = 0.5 * tf.reduce_sum( tf.square(hidden3_sigma) + tf.square(hidden3_mean) - 1 - tf.log(eps + tf.square(hidden3_sigma))) 23 / 34 24 / 34

Variational Variational Example Generated Images: Random Example Generated Images: Manifold Draw z N (0, I) and display g(z) Stacked Stacked Denoising Denoising Sparse Sparse Variational Variational 25 / 34 26 / 34 Variational 2D Cluster Analysis Cluster analysis by digit Stacked Stacked Denoising Denoising Sparse Sparse Variational Variational 27 / 34 Denoising Sparse Variational 29 / 34 Discriminator trains as a binary classifier, generator trains to fool the discriminator Training Stacked Generator creates samples intended to come from training distribution Discriminator attempts to discern the real (original training) samples from the fake (generated) ones How the Game Works s are also generative models, like Vs Models a game between two players 28 / 34 Uniformly sample points in z space and decode Let D(x) be discriminator parameterized by (D) Goal: Find (D) minimizing J (D) (D), (G) Let G(z) be generator parameterized by (G) Goal: Find (G) minimizing J (G) (D), (G) A Nash equilibrium of this game is such that each (i), i 2 {D, G} yields a local minimum of its corresponding J (D), (G) Stacked Denoising Sparse Variational 30 / 34 Each training step: Draw a minibatch of x values from dataset Draw a minibatch of z values from prior (e.g., N (0, I)) Simultaneously update (G) to reduce J (G) and (D) to reduce J (D), via, e.g., Adam For J (D), common to use cross-entropy where label is 1 for real and 0 for fake Since generator wants to trick discriminator, can use J (G) = J (D) Others exist that are generally better in practice, e.g., based on ML

DC: Radford et al. (2015) Stacked Deep, convolution Generator uses transposed convolutions (e.g., tf.layers.conv2d_transpose) without pooling to upsample images for input to discriminator DC Generated Images: Bedrooms Stacked Denoising Denoising Sparse Sparse Variational Variational 31 / 34 32 / 34 DC Generated Images: Adele Facial Expressions Trained from LSUN dataset, sampled z space Trained from frame grabs of interview, sampled z space DC Generated Images: Latent Space Arithmetic Stacked Stacked Denoising Denoising Sparse Sparse Variational Variational Performed semantic arithmetic in z space! (Non-center images have noise added in z space; center is noise-free) 33 / 34 34 / 34