Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient. Ali Mirzapour Paper Presentation - Deep Learning March 7 th

Similar documents
Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient

A Fast Learning Algorithm for Deep Belief Nets

Training Restricted Boltzmann Machines with Multi-Tempering: Harnessing Parallelization

Deep Boltzmann Machines

Training Restricted Boltzmann Machines with Overlapping Partitions

Energy Based Models, Restricted Boltzmann Machines and Deep Networks. Jesse Eickholt

Parallel Implementation of Deep Learning Using MPI

Introduction to Deep Learning

3D Object Recognition with Deep Belief Nets

Implicit Mixtures of Restricted Boltzmann Machines

CSE 5526: Introduction to Neural Networks Radial Basis Function (RBF) Networks

To be Bernoulli or to be Gaussian, for a Restricted Boltzmann Machine

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES

An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation

CS6716 Pattern Recognition

Efficient Feature Learning Using Perturb-and-MAP

Recognizing Hand-written Digits Using Hierarchical Products of Experts

Deep neural networks II

Learning Class-relevant Features and Class-irrelevant Features via a Hybrid third-order RBM

Large Scale Data Analysis Using Deep Learning

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

Deep Learning. Volker Tresp Summer 2014

SPE MS. Abstract. Introduction. Autoencoders

Multi-view object segmentation in space and time. Abdelaziz Djelouah, Jean Sebastien Franco, Edmond Boyer

Handwritten Hindi Numerals Recognition System

Emotion Detection using Deep Belief Networks

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Clustering algorithms and autoencoders for anomaly detection

Computer Vision Group Prof. Daniel Cremers. 4a. Inference in Graphical Models

3 : Representation of Undirected GMs

Restricted Boltzmann Machines. Shallow vs. deep networks. Stacked RBMs. Boltzmann Machine learning: Unsupervised version

Optimizing Neural Networks that Generate Images. Tijmen Tieleman

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Modeling pigeon behaviour using a Conditional Restricted Boltzmann Machine

Early Stopping. Sargur N. Srihari

Advanced Introduction to Machine Learning, CMU-10715

Deep Belief Network for Clustering and Classification of a Continuous Data

Collective classification in network data

Autoencoders, denoising autoencoders, and learning deep networks

A Comparison of Sequence-Trained Deep Neural Networks and Recurrent Neural Networks Optical Modeling For Handwriting Recognition

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

DEEP LEARNING TO DIVERSIFY BELIEF NETWORKS FOR REMOTE SENSING IMAGE CLASSIFICATION

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

Ensemble methods in machine learning. Example. Neural networks. Neural networks

The Mathematics Behind Neural Networks

Dynamic Routing Between Capsules

1.2 Round-off Errors and Computer Arithmetic

Neural Networks and Deep Learning

Hidden Markov Models. Gabriela Tavares and Juri Minxha Mentor: Taehwan Kim CS159 04/25/2017

Shortening Time Required for Adaptive Structural Learning Method of Deep Belief Network with Multi-Modal Data Arrangement

Asynchronous Multi-Task Learning

Deep Learning and Its Applications

Convolutional Deep Belief Networks on CIFAR-10

BRAND STANDARD GUIDELINES 2014

CSC 578 Neural Networks and Deep Learning

Latent Regression Bayesian Network for Data Representation

CS 179 Lecture 16. Logistic Regression & Parallel SGD

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

Lecture 13. Deep Belief Networks. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

D-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C.

Unsupervised Deep Learning for Scene Recognition

An Evolutionary Approximation to Contrastive Divergence in Convolutional Restricted Boltzmann Machines

Learning robust features from underwater ship-radiated noise with mutual information group sparse DBN

Report: Privacy-Preserving Classification on Deep Neural Network

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Image Restoration using Markov Random Fields

COMPUTATIONAL INTELLIGENCE

Machine Learning Classifiers and Boosting

Clustering Relational Data using the Infinite Relational Model

CS 1674: Intro to Computer Vision. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh November 16, 2016

Final Report of Term Project ANN for Handwritten Digits Recognition

Static Gesture Recognition with Restricted Boltzmann Machines

Mixed handwritten and printed digit recognition in Sudoku with Convolutional Deep Belief Network

Recurrent Neural Network (RNN) Industrial AI Lab.

Lecture 20: Neural Networks for NLP. Zubin Pahuja

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

Structured Perceptron. Ye Qiu, Xinghui Lu, Yue Lu, Ruofei Shen

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

Large-scale Deep Unsupervised Learning using Graphics Processors

More on Neural Networks. Read Chapter 5 in the text by Bishop, except omit Sections 5.3.3, 5.3.4, 5.4, 5.5.4, 5.5.5, 5.5.6, 5.5.7, and 5.

An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation

Unsupervised Learning

Neural Networks: promises of current research

Supplementary material for: BO-HB: Robust and Efficient Hyperparameter Optimization at Scale

Feedback Alignment Algorithms. Lisa Zhang, Tingwu Wang, Mengye Ren

Parallel Stochastic Gradient Descent

Extracting and Composing Robust Features with Denoising Autoencoders

Neural Network Neurons

Gradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent

Application of Support Vector Machines, Convolutional Neural Networks and Deep Belief Networks to Recognition of Partially Occluded Objects

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

Unsupervised Feature Learning for Optical Character Recognition

Applied Bayesian Nonparametrics 5. Spatial Models via Gaussian Processes, not MRFs Tutorial at CVPR 2012 Erik Sudderth Brown University

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016

Incremental extreme learning machine based on deep feature embedded

Convolution Neural Networks for Chinese Handwriting Recognition

Chapter 6: Cluster Analysis

Transcription:

Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient Ali Mirzapour Paper Presentation - Deep Learning March 7 th 1

Outline of the Presentation Restricted Boltzmann Machine (RBM) Contrastive Divergence (CD) Gradient Approximation The Persistent CD Algorithm Experimental Results Discussion Conclusion and Future work 2

Restricted Boltzmann Machine (RBM) Neural network models for both unsupervised and supervised learning Consists of two layer of binary units 3

Restricted Boltzmann Machine (RBM) (Cont.) Is an Energy Based model EE xx, h = jj kk WW jjjj h jj xx kk cc kk xx kk jj bb jj h jj kk Probability of data point (x in the visible layer) exp( EE xx, h) pp xx = pp xx, h = ZZ Z is a partition function. h ZZ = exp( EE xx, h) xx,yy 4

Restricted Boltzmann Machine (RBM) (Cont.) Success of RBM Try to minimize the average negative log-likelihood (NLL) 1 log pp(xx tt ) TT tt Stochasticgradient descent is proceeded = log PP vv = + = EE h EE(xx tt,h) xx (tt) EE xx,h (xx,h) 5

Contrastive Divergence (CD) Gradient Approximation To estimate the direction of gradient accurately Replace the expectation by a point estimate at xx Obtain the point xx by Gibbs sampling Start sampling chain at xx (tt) 6

The Persistent CD Algorithm Using the CD gradient approximation is too time-consuming Instead of initializing the chain by xx (tt), initialize the Markov Chain by the negative sample of the last iteration 7

Experimental results The considered data sets MNIST data set of handwritten digits images 28 by 28 pixels 50000 training cases 10000 for validation set 10000 for test cases Values are binarized by sampling from the given Bernoulli distribution 8

Experimental results (Cont.) A data set consisting of descriptions of emails is considered 5000 emails Emails are labeled by spam and not-spam An artificial data set Created by combining the outlines of rectangles and triangles Infinite amount of data is generated Data set of image segmentations Picture of horses Have a binary data set (part of horse, part of background) 9

Experimental results (Cont.) The Implemented Models RBM for unsupervised learning Time complexity of the computation is exponential of the smallest layer (visible or hidden) RBM for supervised learning The label of data points are added to the chain Fully connected Markov Random field (MRF) Compared with the Pseudo-Likelihood algorithm 10

Experimental results (Cont.) Best Implementation of PCD Algorithm No Markov chains get reset One full Gibbs sampling update is done on each of the Markov Chain for each gradient estimation The # of Markov Chains = # of training data points in a mini-batch 11

Experimental results (Cont.) PCD for fully connected MRF Advantage The positive phase in MRF is constant The training set can be discarded after the positive phase computation Disadvantage Markov chain defined by Gibbs sampling has slowing mixing All visible units can not be updated at the same time 12

Discussion Modeling MNIST data with 25 hidden units 13

Discussion (cont.) Modeling MNIST data with 500 hidden units 14

Discussion (cont.) Classification of MNIST data 15

Discussion PCD outperforms the other algorithms CD-10 takes about four times as long as PCD CD-1, and MF CD CD-10 performs better than CD-1 when there is a little time Performance of RBM which is trained by CD-1 and PCD 16

Discussion (cont.) Modeling Artificial Data CD-10 being preferable when little time is available and PCD being better if more time is available 17

Discussion (cont.) Modeling Artificial Data The Data set is artificially generated An infinite amount of data is available Weight Decay Regularization Determines how dominant this regularization term will be in the gradient computation CD algorithms are quite dependent on the mixing rate of the Markov Chain defined by the Gibbs sampler 18 Higher Regularization Term Parameters of Model Smaller

Discussion (cont.) Classifying E-mail Data There is a small data set (5000 data) Error bars on the performance are large PCD is a reasonable choice 19

Discussion (cont.) Modeling Horse Contours PCD is not a best choice The number of data is much bigger (1024 visible units, 500 hidden units) PCD performs better by increasing the number of training time 20

Discussion (cont.) PCD on MRFs vs. Pseudo-Likelihood (PL) PCD on MRFs It moves in the direction of the data likelihood function It profits from having more time to run Pseudo- Likelihood (PL) It Does not produce the best probability models It needs early stopping to prevent diverging 21

Discussion (cont.) PCD on MRFs vs. Pseudo-Likelihood (PL) 22

Conclusion and Future work Conclusion Proposed a Persistent CD (PCD) Quantify the performance of their proposed model with the other algorithms PCD is fast and simple PCD outperforms the other algorithms Future work To investigate the use of weight decay regularization To compare algorithms in more amount of the training time 23

24 Thank you