Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient Ali Mirzapour Paper Presentation - Deep Learning March 7 th 1
Outline of the Presentation Restricted Boltzmann Machine (RBM) Contrastive Divergence (CD) Gradient Approximation The Persistent CD Algorithm Experimental Results Discussion Conclusion and Future work 2
Restricted Boltzmann Machine (RBM) Neural network models for both unsupervised and supervised learning Consists of two layer of binary units 3
Restricted Boltzmann Machine (RBM) (Cont.) Is an Energy Based model EE xx, h = jj kk WW jjjj h jj xx kk cc kk xx kk jj bb jj h jj kk Probability of data point (x in the visible layer) exp( EE xx, h) pp xx = pp xx, h = ZZ Z is a partition function. h ZZ = exp( EE xx, h) xx,yy 4
Restricted Boltzmann Machine (RBM) (Cont.) Success of RBM Try to minimize the average negative log-likelihood (NLL) 1 log pp(xx tt ) TT tt Stochasticgradient descent is proceeded = log PP vv = + = EE h EE(xx tt,h) xx (tt) EE xx,h (xx,h) 5
Contrastive Divergence (CD) Gradient Approximation To estimate the direction of gradient accurately Replace the expectation by a point estimate at xx Obtain the point xx by Gibbs sampling Start sampling chain at xx (tt) 6
The Persistent CD Algorithm Using the CD gradient approximation is too time-consuming Instead of initializing the chain by xx (tt), initialize the Markov Chain by the negative sample of the last iteration 7
Experimental results The considered data sets MNIST data set of handwritten digits images 28 by 28 pixels 50000 training cases 10000 for validation set 10000 for test cases Values are binarized by sampling from the given Bernoulli distribution 8
Experimental results (Cont.) A data set consisting of descriptions of emails is considered 5000 emails Emails are labeled by spam and not-spam An artificial data set Created by combining the outlines of rectangles and triangles Infinite amount of data is generated Data set of image segmentations Picture of horses Have a binary data set (part of horse, part of background) 9
Experimental results (Cont.) The Implemented Models RBM for unsupervised learning Time complexity of the computation is exponential of the smallest layer (visible or hidden) RBM for supervised learning The label of data points are added to the chain Fully connected Markov Random field (MRF) Compared with the Pseudo-Likelihood algorithm 10
Experimental results (Cont.) Best Implementation of PCD Algorithm No Markov chains get reset One full Gibbs sampling update is done on each of the Markov Chain for each gradient estimation The # of Markov Chains = # of training data points in a mini-batch 11
Experimental results (Cont.) PCD for fully connected MRF Advantage The positive phase in MRF is constant The training set can be discarded after the positive phase computation Disadvantage Markov chain defined by Gibbs sampling has slowing mixing All visible units can not be updated at the same time 12
Discussion Modeling MNIST data with 25 hidden units 13
Discussion (cont.) Modeling MNIST data with 500 hidden units 14
Discussion (cont.) Classification of MNIST data 15
Discussion PCD outperforms the other algorithms CD-10 takes about four times as long as PCD CD-1, and MF CD CD-10 performs better than CD-1 when there is a little time Performance of RBM which is trained by CD-1 and PCD 16
Discussion (cont.) Modeling Artificial Data CD-10 being preferable when little time is available and PCD being better if more time is available 17
Discussion (cont.) Modeling Artificial Data The Data set is artificially generated An infinite amount of data is available Weight Decay Regularization Determines how dominant this regularization term will be in the gradient computation CD algorithms are quite dependent on the mixing rate of the Markov Chain defined by the Gibbs sampler 18 Higher Regularization Term Parameters of Model Smaller
Discussion (cont.) Classifying E-mail Data There is a small data set (5000 data) Error bars on the performance are large PCD is a reasonable choice 19
Discussion (cont.) Modeling Horse Contours PCD is not a best choice The number of data is much bigger (1024 visible units, 500 hidden units) PCD performs better by increasing the number of training time 20
Discussion (cont.) PCD on MRFs vs. Pseudo-Likelihood (PL) PCD on MRFs It moves in the direction of the data likelihood function It profits from having more time to run Pseudo- Likelihood (PL) It Does not produce the best probability models It needs early stopping to prevent diverging 21
Discussion (cont.) PCD on MRFs vs. Pseudo-Likelihood (PL) 22
Conclusion and Future work Conclusion Proposed a Persistent CD (PCD) Quantify the performance of their proposed model with the other algorithms PCD is fast and simple PCD outperforms the other algorithms Future work To investigate the use of weight decay regularization To compare algorithms in more amount of the training time 23
24 Thank you