Generative Models in Deep Learning. Sargur N. Srihari

Similar documents
Deep Boltzmann Machines

Variational Autoencoders. Sargur N. Srihari

Deep Generative Models Variational Autoencoders

Alternatives to Direct Supervision

Autoencoder. Representation learning (related to dictionary learning) Both the input and the output are x

Unsupervised Learning

GAN Frontiers/Related Methods

Introduction to GAN. Generative Adversarial Networks. Junheng(Jeff) Hao

RNNs as Directed Graphical Models

Autoencoders. Stephen Scott. Introduction. Basic Idea. Stacked AE. Denoising AE. Sparse AE. Contractive AE. Variational AE GAN.

Introduction to GAN. Generative Adversarial Networks. Junheng(Jeff) Hao

Auto-Encoding Variational Bayes

Implicit generative models: dual vs. primal approaches

Lecture 19: Generative Adversarial Networks

Energy Based Models, Restricted Boltzmann Machines and Deep Networks. Jesse Eickholt

A Fast Learning Algorithm for Deep Belief Nets

Generative Adversarial Networks (GANs) Ian Goodfellow, Research Scientist MLSLP Keynote, San Francisco

Conditional Random Fields - A probabilistic graphical model. Yen-Chin Lee 指導老師 : 鮑興國

19: Inference and learning in Deep Learning

Deep Generative Models and a Probabilistic Programming Library

How Learning Differs from Optimization. Sargur N. Srihari

Lecture 21 : A Hybrid: Deep Learning and Graphical Models

Introduction to Generative Models (and GANs)

DEEP LEARNING PART THREE - DEEP GENERATIVE MODELS CS/CNS/EE MACHINE LEARNING & DATA MINING - LECTURE 17

Introduction to Generative Adversarial Networks

Generative Adversarial Networks (GANs)

Generative Adversarial Networks (GANs) Based on slides from Ian Goodfellow s NIPS 2016 tutorial

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

(University Improving of Montreal) Generative Adversarial Networks with Denoising Feature Matching / 17

GENERATIVE ADVERSARIAL NETWORKS (GAN) Presented by Omer Stein and Moran Rubin

Network Traffic Measurements and Analysis

Structured Learning. Jun Zhu

Semantic Segmentation. Zhongang Qi

27: Hybrid Graphical Models and Neural Networks

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Learning Undirected Models with Missing Data

Efficient Feature Learning Using Perturb-and-MAP

An Empirical Study of Generative Adversarial Networks for Computer Vision Tasks

Adversarially Learned Inference

Bayes Net Learning. EECS 474 Fall 2016

Markov Networks in Computer Vision

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

Computer vision: models, learning and inference. Chapter 10 Graphical Models

Generative Adversarial Network

3 : Representation of Undirected GMs

Semi-Amortized Variational Autoencoders

Generative Modeling with Convolutional Neural Networks. Denis Dus Data Scientist at InData Labs

Markov Networks in Computer Vision. Sargur Srihari

Supervised Learning for Image Segmentation

D-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C.

10703 Deep Reinforcement Learning and Control

Machine Learning

Model Generalization and the Bias-Variance Trade-Off

Deep Learning. Volker Tresp Summer 2014

Introduction to Generative Adversarial Networks

Autoencoding Beyond Pixels Using a Learned Similarity Metric

Day 3 Lecture 1. Unsupervised Learning

Machine Learning Classifiers and Boosting

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

Vulnerability of machine learning models to adversarial examples

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

Motivation: Shortcomings of Hidden Markov Model. Ko, Youngjoong. Solution: Maximum Entropy Markov Model (MEMM)

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Generative and discriminative classification techniques

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

arxiv: v2 [cs.lg] 17 Dec 2018

Deep generative models of natural images

Modeling Sequences Conditioned on Context with RNNs

Latent Regression Bayesian Network for Data Representation

Implicit Mixtures of Restricted Boltzmann Machines

Object Recognition Using Pictorial Structures. Daniel Huttenlocher Computer Science Department. In This Talk. Object recognition in computer vision

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Conditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C,

Learning from Data: Adaptive Basis Functions

Part II. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 12: Deep Reinforcement Learning

Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient. Ali Mirzapour Paper Presentation - Deep Learning March 7 th

Simple Model Selection Cross Validation Regularization Neural Networks

10-701/15-781, Fall 2006, Final

Machine Learning

CSEP 573: Artificial Intelligence

Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient

An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation

Hidden Markov Models. Gabriela Tavares and Juri Minxha Mentor: Taehwan Kim CS159 04/25/2017

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015

Gradient of the lower bound

Variational Autoencoders

Generative Adversarial Text to Image Synthesis

Workshop report 1. Daniels report is on website 2. Don t expect to write it based on listening to one project (we had 6 only 2 was sufficient

Machine Learning. Chao Lan

Computer Vision Group Prof. Daniel Cremers. 4a. Inference in Graphical Models

Mini-project 2 CMPSCI 689 Spring 2015 Due: Tuesday, April 07, in class

Support Vector Machines.

Probabilistic Graphical Models

Neural Network Neurons

Machine Learning / Jan 27, 2010

Clustering web search results

Statistical Methods for NLP

Classification of 1D-Signal Types Using Semi-Supervised Deep Learning

Collective classification in network data

Transcription:

Generative Models in Deep Learning Sargur N. Srihari srihari@cedar.buffalo.edu 1

Topics 1. Need for Probabilities in Machine Learning 2. Representations 1. Generative and Discriminative Models 2. Directed/Undirected Representations Complexity of Learning Sampling and Variational Inference 3. Deep Generative Models: RBMs, DBNs,DBMs 4. Directed Generative Nets: VAEs, GANs 5. Remarks on Deep Learning as a science 2

Neural Network for Autonomous Vehicle x x 1 x 2 x 3 x 1 x 2 x 3 w 3 σ θ i=1 i=1 (k) x i y 1 y 2 y 3 y 3

Why are probabilities used in ML? 1. Conditional probability distributions model the output, e.g., p(y x;w) 2. Likelihood over p(y x;w) is used to define a loss function E(w) (instead of accuracy) which we optimize to determine weights 3. Probability distributions model the joint distribution p(v,h) from which we can infer the output p(h v) and p(v h) 4

Perceptron did not use probabilities Output is g(x) = f (w T x) Where f is a threshold function Seek w: if x n C 1, w T x n 0, if x n C 2, w T x n < 0 With targets t n {+1, 1}, we need w T x n t n >0 Minimize no of misclassified samples M E P (w) = Σ n M w T x n t n Implicit 0-1 loss function 1 0.5 0!0.5 Update rule: w τ+1 = w τ η E P (w τ )=w τ +ηx n t n 1 0.5 0!0.5 1 0.5 0!0.5 Cycle through x n in turn If incorrectly classified add to weight vector!1!1!0.5 0 0.5 1!1!1!0.5 0 0.5 1!1!1!0.5 0 0.5 1

Disadvantage of Accuracy as Loss Counting no. of misclassified samples is: E P (w) = Σ n M w T x n t n Which we wish to minimize Disadvantages Piecewise constant function of w with discontinuities No closed form solution Intractable even for a linear classifier Gradient descent requires gradient direction Solution: define the output to be probabilistic Output is a distribution p(y x;w)

Surrogate Loss Function When output can take one of two values Assuming Bernoulli distribution, t ={t n } n=1,..n, y n = σ (w T x n ) likelihood of observed data using model is p(t w) = Negative log-likelihood is cross-entropy, which we wish to minimize N E(w) = ln p(t w) = { t n lny n + (1 t n )ln(1 y n )} Cross entropy is concave, can use gradient descent E(w) = N n=1 N n=1 y n t n ( y n t n ) x n { 1 y n } 1 t n The model is called logistic regression Using learnt w we can predict p(y x;w) which we wish to maximize n=1

Surrogate loss for continuous case Linear regression Assume output is corrupted by Gaussian noise p(t x) ~ N(t;y(x,w), σ 2 ) y(x,w) = w 0 + M 1 j =1 w j φ j (x) Negative log-likelihood: E(w) = N n=1 log p(y n x n ;w) E(w) = N logσ N N 2 log 2π y n t n 2 n=1 Minimize Squared error 2σ 2 E(w) = N n=1 ( y n t n ) x n Using learnt w we can predict p(y x;w) Can pick class that yields least classification error in expectation

Surrogate may learn more Using log-likelihood surrogate, Test set 0-1 loss continues to decrease for a long time after the training set 0-1 loss has reached zero when training Because one can improve classifier robustness by further pushing the classes apart Results in a more confident and robust classifier Thus extracting more information from the training data than with minimizing 0-1 loss This sets stage for why we need to determine the conditional distribution p(y x) 9

Joint and Conditionals are very different Input x, output y Joint distribution p(x,y) Given 4 data points: (x,y): (1,0), (1,0), (2,0), (2, 1) Then p(x,y) is y=0 y=1 x=1 ½ 0 x=2 ¼ ¼ Conditional distribution p(y x) Then p(y x) is y=0 y=1 x=1 1 0 x=2 ½ ½ But we can use joint to generate conditional using Bayes rule as seen next 10

Classification àdetermine conditional Conditional p(y x) is a natural for classification 1. Discriminative algorithms model p(y x) directly 2. Generative algorithms model p(x,y) which can be transformed into p(y x) by applying Bayes rule and then used for classification p(y x) = p(x,y) p(x), where p(x)= p(x,y) Where p(x,y)=p(x y)p(y), i.e., class-conditional x prior However, p(x,y) is useful for other purposes. E.g., you could use p(x,y) to generate likely (x,y) pairs There are examples of both types of classifiers y 11

Single decision vs Sequential decisions GENERATIVE Naïve Bayes Classifier y SEQUENCE Y Hidden Markov Model y 1 y N x p(y,x) X x 1 x M x 1 x N p(y,x) CONDITION CONDITION p(y x) p(y X) SEQUENCE DISCRIMINATIVE Logistic Regression Conditional Random Field 12

Generative or Discriminative? Generative models, which model p(x,y), appear to be more general and therefore better But Discriminative models, which model p(y x), outperform generative models in classification 1. Fewer adaptive parameters With M features logistic regression needs M parameters Bayes classifier with p(x y) modeled as Gaussian: 2M for means+m(m+1)/2 for Σ+Class priorsà M(M+5)/2+ 1 Grows quadratically with M 2. Performance depends on p(x y) approximations 13

Deep Generative Models: Overview Generative models are built/trained using: PGMs, Monte Carlo methods, Partition Functions, Approximate Inference Models represent probability distributions Some allow explicit probability evaluation Others don t but allow sampling Some described by graphs and factors, others not 14

Directed graphical models CPDs: Representation (Bayesian Network) p(x i pa(x i )) Joint Distribution over N variables Deep Learning Example Deep Belief Network N p(x) = p(x i pa(x i )) i=1 x = {D,I,G,S,L} Intractability of Inference N p(e = e) = p(x i pa(x i )) E =e X \E i=1 p(x) = p(d)p(i )p(g D,I )p(s I )p(l G) Summing out variables is #P complete

Undirected Graphical models Representation (Markov Network) Factors: ϕ(c) Joint Distribution Z =!p(x)dx p(x) = 1 Z ˆp(x)!p(x) = φ(c) C G Deep Learning Example Restricted Boltzmann machine p(x) = 1 Z φ a,b (a,b)φ b,c (b,c)φ a,d (a,d)φ b,e (b,e)φ e,f (e, f ) Energy Model E(C)= -ln ϕ(c), ϕ(c)=exp(-e(c)) E(x)= E a,b (a,b)+e b,c (b,c)+e a,d (a,d)+e b,e (b,e)+e e,f (e,f) p(x) = 1 Z exp{ E(x)} x={a,b,c,d,e,f } C {(a,b),(a,d),(b,c),(b,e),(e, f )} E(v,h)= -b T v c T h v T Wh p(h v)=π i p(h i v) and p(v h)=π i p(v i h) Intractability of Inference: Partition Function: Z = x E(x) 16

Generative Models more complex Modeling is more complex than classification In classification Both input and output are given Optimization only needs to learn the mapping In generative modeling Data doesn t specify both input z and output x How to arrange z space usefully and how to map z to x Requires optimizing intractable criteria 17

RBM training: Contrastive Divergence Maximize log-likelihood L({x (1),..x (M ) };θ) = log!p(x (m) ;θ) logz(θ) m m For stochastic gradient ascent, take derivatives: g m = θ log p(x (m) ;θ) = θ log!p(x (m) ;θ) θ logz(θ) Gradient Update Positive phase: Sample a minibatch of examples From the training set Negative Phase: Gibbs samples from the model initialized with training data are used to

Approximate Inference Methods 1. Particle-based (sampling) methods E[ f ] = f (z) p(z) dz To evaluate use L samples z (l) fˆ 1 = L L i= 1 f ( z ( l) ) Directed models: Ancestral (or Forward) sampling Undirected models: Gibbs sampling 2. Variational inference methods ELBO L(v, θ,q) is a lower bound on log p(h, v; θ) Inference is viewed as maximizing L wrt h Mean field approach: q(h v) = q(h i v) i 19

Deep Generative Models use both directed/undirected graphs RBM is an undirected graphical model based on a bipartite graph Efficient evaluation and differentiation of P(v) Efficient sampling Deep Belief Network Hybrid graphical model with multiple hidden layers Sampling: Gibbs for top layers, Ancestral for lower Trained using contrastive divergence Deep Boltzmann Machine is an undirected graphical model with several layers of latent Variables Can rewrite as a bipartite Structure and use same Equations as for RBM

Differentiable Generator Nets Many generative models are based on the idea of using a differentiable generator network Model transforms samples of latent variables z to samples x or to distributions over samples x using a differentiable function g(z;θ (g) ) represented by a neural network Model class includes: Variational autoencoders (VAE) Which pair the generator net with an inference net Generative adversarial networks (GAN) Which pair the generator net with a discriminator net 21

Autoencoder vs VAE Autoencoder Variational Autoencoder Encoder f(.) (CNN) Encoder f(.) µ σ Decoder g(.) q(z x)~n(µ,diag(σ)) Sampled z=µ+σε Loss: Reconstruction loss Decoded Image (DCNN) Reconstruction loss+ Latent loss: KL(N(µ,Σ),N(0,I)) We have z ~ Enc(x)=q(z x) and y~dec(z)=p(x z)

Adversarial Example: Driving (a) Self driving car correctly decides to turn left for first image But turns right and crashes into guardrail for a slightly darkened version of (a) DeepXplore, DeepTest: Automated Testing of Deep neural networks (b) 23

Adversarial Approach Carefully craft synthetic images By adding minimal perturbations to an existing image So that hey get classified differently than the original picture but still look the same to the human eye. 24

Adversarial Example Generation Add to x an imperceptibly small vector Its elements are equal to the sign of the elements of the gradient of the cost function wrt the input. It changes Googlenet s classification of the image y = panda with 58% confidence y = nermatode With 8.2% confidence y = gibbon With 99% confidence From Goodfellow, et.al., Deep Learning MIT Press 25

Generative Adversarial Network A generative model (G-network) is trained to fool a feedforward classifier A feedforward network that generates images from noise A discriminative model (D-network) A feedforward classifier that attempts to recognize samples from G as fake and samples from training set as real

Training a GAN Training procedure for G is to maximize probability that D makes a mistake Framework corresponds to a minimax two-player game For arbitrary G and D, a unique solution exists If G and D are MLPs can be trained using backprop 27

Loss function for GAN Both D and G are playing a minimax game in which we should optimize the following loss function 28

Problems in GANs Hard to achieve Nash equilibrium Two models trained simultaneously may not converge Vanishing gradient If discriminator behaves badly, generator does not have feedback If discriminator does a great job, graident drops to zero and learning is superslow Mode Collapse During training generator may collapse to a setting where it always produces the same outputs 29

Wasserstein GAN (WGAN) Minimizes a reasonable and efficient approximation of the Earth Mover s distance instead of KL divergence GANs output a probability, WGAN outputs a value Advantages Improve the stability of learning, Get rid of problems like mode collap Provide meaningful learning curves useful for debugging and hyperparameter searches. 30

Importance of GANs GANs have been at the forefront of research on generative models in the last couple of years. GANs have been used for: Image generation image processing image synthesis from captions data generation for visual recognition many other applications, 31

Data Sets for GANs LSUN (Large Scale Scene Understanding) Category Training Validation Bedroom 3,033,042 images (43 GB) 300 images Bridge 818,687 images (16 GB) 300 images Church Outdoor 126,227 images (2.3 GB) 300 images Classroom 168,103 images (3.1 GB) 300 images Conference Room 229,069 images (3.8 GB) 300 images Dining Room 657,571 images (11 GB) 300 images Kitchen 2,212,277 images (34 GB) 300 images Living Room 1,315,802 images (22 GB) 300 images Restaurant 626,331 images (13 GB) 300 images Tower 708,264 images (12 GB) 300 images Testing Set 10,000 images 32

Visual Quality of Generated Images WGAN 2017 128x128 ICLR 2018 256 x 256 Architecture/hyperparameters are carefully selected Latent variables capture important FoVs 33

Images generated from CelebA 34

Shortcoming of Adversarial Approach While it exposes some erroneous behaviors of a deep learning model, it must limit its perturbations to tiny invisible changes or require manual checks. Adversarial images only cover a small part (52.3%) of deep learning system logic 35

Final Remarks on Deep Learning Artificial Intelligence is poised to disrupt the world NITI Aayog, National Strategy for Artificial Intelligence, #Aifor all Philosophy of Science Karl Popper Thomas Kuhn 36