Introduction to Deep Learning

Similar documents
Advanced Introduction to Machine Learning, CMU-10715

Autoencoders, denoising autoencoders, and learning deep networks

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images

Neural Networks: promises of current research

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Extracting and Composing Robust Features with Denoising Autoencoders

Depth Image Dimension Reduction Using Deep Belief Networks

Stacked Denoising Autoencoders for Face Pose Normalization

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

Convolutional Neural Networks

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

Deep Learning Basic Lecture - Complex Systems & Artificial Intelligence 2017/18 (VO) Asan Agibetov, PhD.

Neural Network Neurons

An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation

Deep Learning for Computer Vision II

Deep Learning. Deep Learning. Practical Application Automatically Adding Sounds To Silent Movies

Neural Networks and Deep Learning

Neural Networks for unsupervised learning From Principal Components Analysis to Autoencoders to semantic hashing

ImageNet Classification with Deep Convolutional Neural Networks

Computer Vision Lecture 16

Neural Networks for Machine Learning. Lecture 15a From Principal Components Analysis to Autoencoders

Deep Learning With Noise

Deep Learning for Generic Object Recognition

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Global Optimality in Neural Network Training

Machine Learning. MGS Lecture 3: Deep Learning

3D Object Recognition with Deep Belief Nets

Energy Based Models, Restricted Boltzmann Machines and Deep Networks. Jesse Eickholt

Machine Learning. The Breadth of ML Neural Networks & Deep Learning. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

Data Mining. Neural Networks

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

Deep Neural Networks:

Large Scale Data Analysis Using Deep Learning

C. Poultney S. Cho pra (NYU Courant Institute) Y. LeCun

Transfer Learning Using Rotated Image Data to Improve Deep Neural Network Performance

Emotion Detection using Deep Belief Networks

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

DEEP learning algorithms have been the object of much

CSC 578 Neural Networks and Deep Learning

Computer Vision Lecture 16

Backpropagation + Deep Learning

CSE 5526: Introduction to Neural Networks Radial Basis Function (RBF) Networks

Deep Learning. Volker Tresp Summer 2014

Machine Learning 13. week

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

Seminars in Artifiial Intelligenie and Robotiis

Day 3 Lecture 1. Unsupervised Learning

Knowledge Discovery and Data Mining. Neural Nets. A simple NN as a Mathematical Formula. Notes. Lecture 13 - Neural Nets. Tom Kelsey.

IE598 Big Data Optimization Summary Nonconvex Optimization

Neural Network and Deep Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Computer Vision Lecture 16

Knowledge Discovery and Data Mining

DEEP LEARNING FOR BRAIN DECODING

To be Bernoulli or to be Gaussian, for a Restricted Boltzmann Machine

Deep Learning with Tensorflow AlexNet

Tutorial Deep Learning : Unsupervised Feature Learning

Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

Feature Extraction and Learning for RSSI based Indoor Device Localization

Lecture 17: Neural Networks and Deep Learning. Instructor: Saravanan Thirumuruganathan

Learning Two-Layer Contractive Encodings

THE MNIST DATABASE of handwritten digits Yann LeCun, Courant Institute, NYU Corinna Cortes, Google Labs, New York

Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient. Ali Mirzapour Paper Presentation - Deep Learning March 7 th

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

Character Recognition Using Convolutional Neural Networks

11. Neural Network Regularization

Deep Learning. Architecture Design for. Sargur N. Srihari

Object Detection Lecture Introduction to deep learning (CNN) Idar Dyrdal

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

Efficient Algorithms may not be those we think

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

Lecture 13. Deep Belief Networks. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen

CS 6501: Deep Learning for Computer Graphics. Training Neural Networks II. Connelly Barnes

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Machine Learning for Physicists Lecture 6. Summer 2017 University of Erlangen-Nuremberg Florian Marquardt

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Grundlagen der Künstlichen Intelligenz

A Fast Learning Algorithm for Deep Belief Nets

Unsupervised Learning

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

INTRODUCTION TO DEEP LEARNING

Model Generalization and the Bias-Variance Trade-Off

Deep Learning. Volker Tresp Summer 2015

Deep Learning. Deep Learning provided breakthrough results in speech recognition and image classification. Why?

DEEP LEARNING TO DIVERSIFY BELIEF NETWORKS FOR REMOTE SENSING IMAGE CLASSIFICATION

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

ECE 5470 Classification, Machine Learning, and Neural Network Review

Using Machine Learning to Optimize Storage Systems

Alternatives to Direct Supervision

Tiled convolutional neural networks

CS 4510/9010 Applied Machine Learning. Neural Nets. Paula Matuszek Fall copyright Paula Matuszek 2016

CS 1674: Intro to Computer Vision. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh November 16, 2016

Know your data - many types of networks

CS231A Course Project Final Report Sign Language Recognition with Unsupervised Feature Learning

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

Deep Similarity Learning for Multimodal Medical Images

Transcription:

ENEE698A : Machine Learning Seminar Introduction to Deep Learning Raviteja Vemulapalli Image credit: [LeCun 1998]

Resources Unsupervised feature learning and deep learning (UFLDL) tutorial (http://ufldl.stanford.edu/wiki/index.php/ufldl_tutorial) Y. Bengio, Learning Deep Architectures for AI, Foundations and Trends in Machine Learning, Vol. 2, No. 1, pp. 1 127, 2009, Now publishers.

Overview Linear, Logistic and Softmax regression Feed forward neural networks Why deep models? Why is training deep networks (except CNNs) difficult? What makes convolutional neural networks (CNNs) relatively easier to train? How are deep networks trained today? Autoencoders and Denoising Autoencoders

Linear Regression Input xx RR dd, Output yy RR Parameters: θθ = {ww, bb} Model: yy = wwww + bb Squared error cost function : l θ = NN jj=1 yy jj (wwxx jj + bb) 2 2 Cost function is convex and differentiable.

Logistic Regression Used for binary classification. Input xx RR dd, Output yy {0,1} Parameters: θθ = {ww, bb} Model: PP yy = 1 xx, θθ = 1 1+ee (wwxx+bb) NLL cost function: l θ = log NN jj=1 PP yy = yy jj xx jj, θθ = NN jj=1 log PP yy = yy jj xx jj, θθ Cost function is convex and differentiable.

Softmax Regression Used for multi-class classification. Input xx RR dd, Output yy {1, 2,, CC} CC Parameters: θθ = {(ww ii, bb ii )} ii=1 Model: PP yy = ii xx, θθ) = ee (ww ii xx+ bb ii ) CC kk=1 ee (ww kk xx+ bb kk ) NLL cost function: l θ = log NN jj=1 PP yy = yy jj xx jj, θθ = NN jj=1 log PP yy = yy jj xx jj, θθ Cost function is convex and differentiable.

Regularization & Optimization l 2 -Regularization (Weight decay) Regularization helps to avoid overfitting. min ww,bb [l ww, bb + λλ ww 2 2 ] l 2 -Regularization is equivalent to MAP estimation with zero-mean Gaussian prior on ww. Optimization [l ww, bb + λλ ww 2 2 ] is convex and differentiable in the case of linear, logistic and softmax regression. Gradient based optimization techniques can be used to find the global optimum.

Feed Forward Neural Networks

Feed Forward Neural Networks Artificial neuron h ww,bb (xx) = gg(wwxx + bb) gg is a non-linear function referred to as activation function. Logistic and hyperbolic tangent functions are two commonly used activation functions. Logistic function Hyperbolic tangent function Image credit: UFLDL Tutorial

Feed Forward Neural Networks Neural network with 2 hidden layers Output layer Application Regression Model Linear regression Input layer (3 inputs) Hidden layer 1 (3 units) Output layer (2 units) Hidden layer 2 (2 units) Binary classification Multi-class classification Logistic regression Softmax regression Non-linear (logistic or tanh) Note that if the activation function is linear, then the entire neural network is effectively equivalent to the output layer. Image credit: UFLDL Tutorial

Feed Forward Neural Networks - Prediction Forward propagation L Number of layers (including input and output layers) nn ii - Number of units in ii ttt layer aa (ii) RR nn ii - Output of ii ttt layer WW (ii) RR nn ii+1 nn ii, bb (ii) RR nn ii+1 - Parameters of the mapping function from layer ii to ii + 1. SS - Output layer model (linear, logistic or softmax) aa (1) aa (2) aa (3) aa (4) WW (1), bb (1) WW (2), bb (2) WW (3), bb (3) aa (1) = xx aa (2) = gg(ww 1 aa 1 + bb 1 ) aa (LL 1) = gg(ww LL 2 aa LL 2 + bb LL 2 )... h WW,bb xx = aa (LL) = SS(WW LL 1 aa LL 1 + bb LL 1 ) Image credit: UFLDL Tutorial

Feed Forward Neural Networks - Learning NN Training samples: { xx ii, yy ii } ii=1 Parameters: WW- Weights, bb - bias Cost function Regression Linear output layer: JJ WW, bb, xx ii, yy ii = h WW,bb xx ii yy ii 2 2 PP yy ii = 1 xx ii, WW, bb) Binary classification Logistic output layer JJ WW, bb, xx ii, yy ii = [yy ii log(h WW,bb xx ii ) + (1 yy ii ) log(1 h WW,bb xx ii )] Multi-class classification Softmax output layer JJ WW, bb, xx ii, yy ii CC = kk=1 kk II(yy ii = kk) log( h WW,bb xx ii ) PP yy ii = kk xx ii, WW, bb)

Feed Forward Neural Networks - Learning NN min WW,bb ii=1 JJ WW, bb, xx ii, yy ii + λλ WW 2 2 Regularization J is non-convex but differentiable. Gradient of the cost function with respect to WW and bb can be computed using backpropagation algorithm. Stochastic gradient descent is commonly used to find a local optimum. The weights are randomly initialized to small values around zero.

How powerful are neural networks?

Expressive Power of Neural Networks Professor Kurt Hornik

Expressive Power of Neural Networks Universal Approximation Theorem [Hornik (1991)] Let gg be any continuous, bounded, non-constant function. Let XX RR mm be a closed and bounded set. Let CC(XX) denote the space of continuous functions on XX. Then, given any function ff CC XX and εε > 0, there exists an integer NN, and real constants αα ii, bb ii RR, ww ii RR mm for ii = 1,, NN such that satisfies NN FF xx = αα ii gg(ww ii xx + bb ii ) ii=1 FF xx ff xx < εε, xx XX. Hidden layer outputs Kurt Hornik, Approximation capabilities of multilayer feed forward networks, Neural Networks, 4:251-257, 1991.

Deep Neural Network A deep neural network is a neural network with many hidden layers.

Why Deep Networks? Universal approximation theorem says that one hidden layer is enough to represent any continuous function. Then why deep networks??

Why Deep Networks? Computational efficiency Functions that can be compactly (number of computational units polynomial in the number of inputs) represented by a depth k architecture might require an exponentially large number of computational units in order to be represented by a shallower (depth < k) architecture. In other words, for the same number of computational units, deep architectures have more expressive power than shallow architectures.

Why Deep Networks? Biological motivation Visual cortex is organized in a deep architecture (with 5-10 levels) with a given input percept represented at multiple levels of abstraction, each level corresponding to a different area of cortex. [Serre 2007] T. Serre, G. Kreiman, M. Kouh, C. Cadieu, U. Knoblich, and T. Poggio, A quantitative theory of immediate visual recognition, Progress in Brain Research, Computational Neuroscience: Theoretical Insights into Brain Function, vol. 165, pp. 33 56, 2007.

Training Deep Neural Networks Prior to 2006, the standard approach to train a neural network was to use gradient based techniques starting from a random initialization. Researches have reported positive experimental results with typically one or two hidden layers, but supervised training of deeper networks using random initialization consistently yielded poor results (except for convolutional neural networks). Why is training deep neural networks difficult?

Training Deep Neural Networks Lack of sufficient labeled data Deep neural networks represent very complex data transformations. Supervised training of such complex models requires a large number of labeled training samples. Local optima Supervised training of deep networks involves optimizing a cost function that is highly non-convex in its parameters. Gradient based approaches with random initialization usually converge to a very poor local minima.

Training Deep Neural Networks Diffusion of gradients The gradients that are propagated backwards rapidly diminish in magnitude as the depth of the network increases. The weights of the earlier layers change very slowly and the earlier layers fail to learn well. When earlier layers do not learn well, a deep network is equivalent to a shallow network (consisting of top layers) on corrupted input (output of the bad earlier layers), and hence may perform badly compared to a shallow network on original input.

Training Deep Neural Networks What to do??

Training Deep Neural Networks Lots of free unlabeled data on internet these days!! Can we use unlabeled data to train deep networks?

Unsupervised Layer-wise Pre-training In 2006, Hinton et. al. introduced an unsupervised training approach known as pre-training. The main idea of pre-training is to train the layers of the network one at a time, in an unsupervised manner. Various unsupervised pre-training approaches have been proposed in the recent past based on restricted Boltzmann machines [Hinton 2006], autoencoders [Bengio 2006], sparse autoencoders [Ranzato 2006], denoising autoencoders [Vincent 2008], etc. After unsupervised training of each layer, the learned weights are used for initialization and the entire deep network is fine-tuned in a supervised manner using the labeled training data. G. E. Hinton, S. Osindero, and Y. Teh, A fast learning algorithm for deep belief nets, Neural Computation, vol. 18, pp. 1527 1554, 2006. Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, Greedy layer-wise training of deep networks, NIPS, 2006. M. Ranzato, C. Poultney, S. Chopra, and Y. LeCun, Efficient learning of sparse representations with an energy-based model, NIPS, 2006. P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, Extracting and composing robust features with denoising autoencoders, ICML, 2008.

Unsupervised Layer-wise Pre-training Unlabeled data Unsupervised layer-wise pre-training Learned weights Supervised finetuning using backpropagation Trained network Labeled data The weights learned using the layer-wise unsupervised training approach provide a very good initialization compared to random weights. Supervised fine-tuning with this initialization usually produces a good local optimum for the entire deep network, because the unlabeled data has already provided a significant amount of prior information.

Convolutional Neural Networks (CNNs) Fully connected Local connections with different weights Local connections sharing the weights Equivalent to convolution by a filter Convolutional layer Multiple convolutions using different filters Image credit: CVPR12 Tutorial on deep learning

Convolutional Neural Networks Subsampling or Pooling Reduces computational complexity for upper layers. Pooling layers make the output of convolution networks more robust to local translations. Convolution Pooling Max or Average of the inputs Image credit: CVPR12 Tutorial on deep learning

Convolutional Neural Networks CNNs were inspired by the visual system s architecture. Modern understanding of the functioning of the visual system is consistent with the processing style found in CNNs [Serre 2007]. LeNet architecture used for character/digit recognition [LeCun 1998]. T. Serre, G. Kreiman, M. Kouh, C. Cadieu, U. Knoblich, and T. Poggio, A quantitative theory of immediate visual recognition, Progress in Brain Research, Computational Neuroscience: Theoretical Insights into Brain Function, vol. 165, pp. 33 56, 2007.

Training Deep CNNs Supervised training of deep neural networks always yielded poor results except for CNNs. Long before 2006, researchers were successful in supervised training of deep CNNs (5-6 hidden layers) without any pre-training. What makes CNNs easier to train? The small fan-in of the neurons in a CNN could be helping the gradients to propagate through many layers without diffusing so much as to become useless. The local connectivity structure is a very strong prior that is particularly appropriate for vision tasks, and sets the parameters of the whole network in a favorable region (with all non-connections corresponding to zero weight) from which gradient based optimization works well.

Unsupervised Pre-training

Pre-training Unlabeled data Unsupervised layer-wise pre-training Learned weights Supervised finetuning using backpropagation Trained network Unsupervised approaches Restricted Boltzmann machines (RBM) Autoencoders (AE) Sparse autoencoders (SAE) Denoising autoencoders (DAE) Labeled data In this talk we will focus on autoencoders and denoising autoencoders.

Autoencoders An autoencoder neural network is an unsupervised learning algorithm that applies backpropagation, setting the target values to be equal to the inputs. Encoder (WW ee, bb ee ) Input xx [0,1] dd If WW dd = WW ee TT, then the autoencoder is said to have tied weights. Decoder (WW dd, bb dd ) Image credit: UFLDL Tutorial

Autoencoders Encoder ff Decoder h xx xx = h ff xx = gg(ww dd ff(xx) + bb dd ) ff xx = gg WW ee xx + bb ee Autoencoders are trained using backpropagation. Two common cost functions In the case of real-valued inputs: NN ii=1 xx ii xx ii 2 2 In the case of binary inputs: NN dd xx ii jj log xx ii jj 1 xx ii jj log 1 xx ii jj ii=1 jj=1

Autoencoders An autoencoder encodes the input xx into some representation ff(xx) so that the input can be reconstructed from that representation. If we use neurons with linear activation function and train the network using mean squared error criterion, then autoencoder is similar to PCA. When the number of hidden units is smaller than the number of inputs, an autoencoder learns a compressed representation of the data by exploiting the correlations among the input variables. An autoencoder can be interpreted as a non-linear dimensionality reduction algorithm. An autoencoder can be used for denoising by training it with noisy input.

Unsupervised Pre-training with AEs Neural network with 2 hidden layers Image credit: UFLDL Tutorial

Step 1 Unsupervised Pre-training with AEs (1) (1) (1) (1) Train an autoencoder on the raw input to learn (WW ee, bbee ) and (WWdd, bbdd ). Compute the primary features h 1 (xx) for each input xx using the learned parameters (1) (1) (WW ee, bbee ). Image credit: UFLDL Tutorial

Step 2 Unsupervised Pre-training with AEs Use the primary features h 1 (xx) as input and train another autoencoder to learn (2) (2) (2) (2) (WW ee, bbee ) and (WWdd, bbdd ). Compute the secondary features h 2 (xx) for each input h 1 xx using the learned (2) (2) parameters (WW ee, bbee ). Image credit: UFLDL Tutorial

Supervised Fine-tuning Step 3 (1) (1) Use (WW ee, bbee ) to initialize the parameters between input and first hidden layer. (2) (2) Use (WW ee, bbee ) to initialize the parameters between first and second hidden layers. Randomly initialize the parameters of final classification layer. Train the entire network in a supervised manner using labeled training data. Image credit: UFLDL Tutorial

Denoising Autoencoders A denoising autoencoder is nothing but an autoencoder trained to reconstruct a clean input from a corrupted one. Motivation Humans can recognize patterns from partially occluded or corrupted data. A good representation should be robust to partial destruction of the data. Corrupt the input xx to get a partially destroyed version xx by means of some stochastic mapping xx ~ qq( xx xx) and train an autoencoder to recover xx from xx. Pre-training deep networks using denoising autoencoders was shown to work better than pre-training using ordinary autoencoders on MNIST digit dataset when images were corrupted by randomly choosing some of the pixels and setting their values to zero.

Thank You Yaniv Taigman, Ming Yang, Marc Aurelio Ranzato and Lior Wolf, DeepFace: Closing the Gap to Human-Level performance in Face Verification, CVPR 2014.