Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Similar documents
Deep Learning for Computer Vision II

Know your data - many types of networks

Computer Vision Lecture 16

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

Machine Learning 13. week

Machine Learning. MGS Lecture 3: Deep Learning

Intro to Deep Learning. Slides Credit: Andrej Karapathy, Derek Hoiem, Marc Aurelio, Yann LeCunn

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Convolu'onal Neural Networks

Deep Learning with Tensorflow AlexNet

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

Fuzzy Set Theory in Computer Vision: Example 3, Part II

INTRODUCTION TO DEEP LEARNING

ImageNet Classification with Deep Convolutional Neural Networks

Deep Learning. Volker Tresp Summer 2014

Introduction to Neural Networks

Deep Convolutional Neural Networks. Nov. 20th, 2015 Bruce Draper

Convolutional Neural Networks

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

Deconvolutions in Convolutional Neural Networks

Structured Prediction using Convolutional Neural Networks

Advanced Introduction to Machine Learning, CMU-10715

Deep Learning. Visualizing and Understanding Convolutional Networks. Christopher Funk. Pennsylvania State University.

Convolutional Neural Networks + Neural Style Transfer. Justin Johnson 2/1/2017

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

CSE 559A: Computer Vision

Deep Neural Networks:

CENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 11. Sinan Kalkan

Computer Vision Lecture 16

Object Detection Lecture Introduction to deep learning (CNN) Idar Dyrdal

Lecture 37: ConvNets (Cont d) and Training

CNN Basics. Chongruo Wu

Deep neural networks II

Computer Vision Lecture 16

Deep Learning for Computer Vision with MATLAB By Jon Cherrie

Convolutional Neural Networks: Applications and a short timeline. 7th Deep Learning Meetup Kornel Kis Vienna,

CS 1674: Intro to Computer Vision. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh November 16, 2016

Restricted Boltzmann Machines. Shallow vs. deep networks. Stacked RBMs. Boltzmann Machine learning: Unsupervised version

Deep Learning. Deep Learning provided breakthrough results in speech recognition and image classification. Why?

Deep Learning & Neural Networks

Convolutional Neural Networks

Introduction to Deep Learning

Fei-Fei Li & Justin Johnson & Serena Yeung

Introduction to Neural Networks

ECE 6504: Deep Learning for Perception

All You Want To Know About CNNs. Yukun Zhu

CS 6501: Deep Learning for Computer Graphics. Training Neural Networks II. Connelly Barnes

Inception and Residual Networks. Hantao Zhang. Deep Learning with Python.

Deep Learning. Deep Learning. Practical Application Automatically Adding Sounds To Silent Movies

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Lecture 17: Neural Networks and Deep Learning. Instructor: Saravanan Thirumuruganathan

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

Tutorial on Keras CAP ADVANCED COMPUTER VISION SPRING 2018 KISHAN S ATHREY

Deep Learning Applications

Dynamic Routing Between Capsules

6. Convolutional Neural Networks

CSC 578 Neural Networks and Deep Learning

Study of Residual Networks for Image Recognition

Deep Learning. a.k.o. Neural Network

Deep Learning Cook Book

Artificial Neural Networks. Introduction to Computational Neuroscience Ardi Tampuu

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Lecture 20: Neural Networks for NLP. Zubin Pahuja

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images

Using Machine Learning for Classification of Cancer Cells

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University

Advanced Video Analysis & Imaging

Using Capsule Networks. for Image and Speech Recognition Problems. Yan Xiong

Machine Learning. The Breadth of ML Neural Networks & Deep Learning. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart

Neural Networks for unsupervised learning From Principal Components Analysis to Autoencoders to semantic hashing

arxiv: v1 [cs.lg] 16 Jan 2013

Multi-layer Perceptron Forward Pass Backpropagation. Lecture 11: Aykut Erdem November 2016 Hacettepe University

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center

Neural Network Neurons

Como funciona o Deep Learning

Lecture: Deep Convolutional Neural Networks

TRAFFIC SIGN CLASSIFICATION USING CONVOLUTIONAL NEURAL NETWORK

Accelerating Convolutional Neural Nets. Yunming Zhang

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning,

Deep Learning Basic Lecture - Complex Systems & Artificial Intelligence 2017/18 (VO) Asan Agibetov, PhD.

Recurrent Convolutional Neural Networks for Scene Labeling

Caffe tutorial. Seong Joon Oh

IMPLEMENTING DEEP LEARNING USING CUDNN 이예하 VUNO INC.

Convolutional Neural Networks and Supervised Learning

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images

CNNS FROM THE BASICS TO RECENT ADVANCES. Dmytro Mishkin Center for Machine Perception Czech Technical University in Prague

Backpropagation + Deep Learning

arxiv: v1 [cs.cv] 4 Dec 2014

Neural Networks for Machine Learning. Lecture 15a From Principal Components Analysis to Autoencoders

ROB 537: Learning-Based Control

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

Deep Learning. Volker Tresp Summer 2015

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

Transcription:

Machine Learning 10-701, Fall 2015 Deep Learning Eric Xing (and Pengtao Xie) Lecture 8, October 6, 2015 Eric Xing @ CMU, 2015 1

A perennial challenge in computer vision: feature engineering SIFT Spin image HoG Textons RIFT Drawbacks of feature engineering 1. Needs expert knowledge 2. Time consuming hand-tuning Eric Xing @ CMU, 2015 GLOH Courtesy: Lee and Ng 2

Automatic feature learning? Successful learning of intermediate representations [Lee et al ICML 2009, Lee et al NIPS 2009] Eric Xing @ CMU, 2015 3

Deep models Neural Networks: Feed-forward* You have seen it Autoencoders (multilayer neural net with target output = input) Non-probabilistic -- Directed: PCA, Sparse Coding Probabilistic -- Undirected: MRFs and RBMs* Convolutional Neural Nets Recursive Neural Networks* Eric Xing @ CMU, 2015 4

Neural Network Output y1 y2 y3 Loss Function Hidden Input z z 1 2 z3 z4 x1 x2 x3 x4 x5 Activation Function Weights Eric Xing @ CMU, 2015 5

Local Computation At Each Unit Linear Combination + Nonlinear Activation Output Hidden Input z z 1 2 z3 z4 w w11 12 y1 y2 y3 u u11 12 x x 1 2 x3 x4 x5 Activation Function z 5 2 w2ixi i1 units in the lower layer Weights connecting to units in the lower layer Eric Xing @ CMU, 2015 6

Deep Neural Network Output y1 y2 y3 More Hidden Layers u 1 u2 u3 u4 z z 1 2 z3 z4 Input x1 x2 x3 x4 x5 Eric Xing @ CMU, 2015 7

Activation Functions Applied on the hidden units Achieve nonlinearity Popular activation functions Sigmoid Tanh Rectified Linear Eric Xing @ CMU, 2015 8

Loss Functions Squared loss for regression Prediction True value Cross entropy loss for classification Class label Prediction Eric Xing @ CMU, 2015 9

Neural Network Prediction Compute unit values layer by layer in a forward manner Prediction function Activation Function Input Output Weights Eric Xing @ CMU, 2015 10

Neural Network Prediction Output Hidden Input z z 1 2 z3 z4 w w11 12 y1 y2 y3 u u11 12 x x 1 2 x3 x4 x5 z 5 1 w1 ixi i1 ( x) 1 1 e x Eric Xing @ CMU, 2015 11

Neural Network Prediction Output Hidden Input z z 1 2 z3 z4 w w11 12 y1 y2 y3 u u11 12 x x 1 2 x3 x4 x5 z 5 2 w2ixi i1 ( x) 1 1 e x Eric Xing @ CMU, 2015 12

Neural Network Prediction Output Hidden Input z z 1 2 z3 z4 w w11 12 y1 y2 y3 u u11 12 x x 1 2 x3 x4 x5 z 5 4 w4ixi i1 ( x) 1 1 e x Eric Xing @ CMU, 2015 13

Neural Network Prediction Output Hidden Input z z 1 2 z3 z4 w w11 12 y1 y2 y3 u u11 12 x x 1 2 x3 x4 x5 y 4 1 u1izi i1 ( x) 1 1 e x Eric Xing @ CMU, 2015 14

Neural Network Prediction Output Hidden Input z z 1 2 z3 z4 w w11 12 y1 y2 y3 u u11 12 x x 1 2 x3 x4 x5 y 4 2 u2izi i1 ( x) 1 1 e x Eric Xing @ CMU, 2015 15

Neural Network Prediction Output Hidden Input z z 1 2 z3 z4 w w11 12 y1 y2 y3 u u11 12 x x 1 2 x3 x4 x5 y 4 3 u3izi i1 ( x) 1 1 e x Eric Xing @ CMU, 2015 16

Neural Network Training Gradient descent Back-Propagation (BP) A routine to compute gradient Use chain rule of derivative Eric Xing @ CMU, 2015 17

Neural Network Training Goal: compute gradient Training loss Weight between unit and Apply chain rule Linear combination value z k Called error, computed recursively in a backward manner z j w ji z i Eric Xing @ CMU, 2015 18

Neural Network Training Apply chain rule (cont d) gradient= =backward error x forward activation Pseudo code of BP While not converge 1. compute forward activations 2. compute backward errors 3. compute gradients of weights 4. perform gradient descent Derivative of activation function z k z j w ji z i Eric Xing @ CMU, 2015 19

Pretraining A better initialization strategy of weight parameters Based on Restricted Boltzmann Machine An auto-encoder model Unsupervised Layer-wise, greedy Useful when training data is limited Not necessary when training data is rich Eric Xing @ CMU, 2015 20

Restricted Boltzmann Machine Eric Xing @ CMU, 2015 21

Layer-wise Unsupervised Pretraining Features... Input... Eric Xing @ CMU, 2015 22

Layer-wise Unsupervised Pretraining Auto-encoder: Reconstruction of input?... =... Input Features... Input... Eric Xing @ CMU, 2015 23

Layer-wise Unsupervised Pretraining Features... Input... Eric Xing @ CMU, 2015 24

Layer-wise Unsupervised Pretraining More abstract features... Features... Input... Eric Xing @ CMU, 2015 25

Layer-wise Unsupervised Pretraining Auto-encoder: Reconstruction of features More abstract features...?... =... Features... Input... Eric Xing @ CMU, 2015 26

Layer-wise Unsupervised Pretraining More abstract features... Features... Input... Eric Xing @ CMU, 2015 27

Layer-wise Unsupervised Pretraining Even more abstract features More abstract features...... Features... Input... Eric Xing @ CMU, 2015 28

Supervised Fine-Tuning Use the weights learned in unsupervised pretraining to initialize the network Then run BP in supervised setting Eric Xing @ CMU, 2015 29

Convolutional Neural Network Some contents are borrowed from Rob Fergus, Yan Lecun and Stanford s course Eric Xing @ CMU, 2015 30

Ordinary Neural Network Now Figure courtesy, Fei-Fei, Andrej Karpathy Eric Xing @ CMU, 2015 31

All Neural Net activations arranged in 3 dimensions For example, a CIFAR-10 image is a 32*32*3 volume: 32 width, 32 height, 3 depth (RGB) Figure courtesy, Fei-Fei, Andrej Karpathy Eric Xing @ CMU, 2015 32

Local connectivity image: 32 * 32 * 3 volume 32 before: full connectivity: 32 * 32 * 3 weights for each neuron 3 32 now: one unit will connect to, e.g. 5*5*3 chunk and only have 5*5*3 weights Note the connectivity is: - local in space - full in depth Eric Xing @ CMU, 2015 33

Convolution One local region only gives one output Convolution: Replicate the column of hidden units across space, with some stride 7 * 7 Input Assume 3*3 connectivity, stride = 1 Eric Xing @ CMU, 2015 Produce a map What s the size of the map? 5 * 5 34

Convolution One local region only gives one output Convolution: Replicate the column of hidden units across space, with some stride 7 * 7 Input Assume 3*3 connectivity, stride = 1 What if stride = 2? Eric Xing @ CMU, 2015 35

Convolution One local region only gives one output Convolution: Replicate the column of hidden units across space, with some stride 7 * 7 Input Assume 3*3 connectivity, stride = 1 Eric Xing @ CMU, 2015 What if stride = 3? 36

Convolution: In Practice Zero Padding Input size: 7 * 7 Filter Size: 3*3, stride 1 Pad with 1 pixel border Output size? 7 * 7 => preserved size! Eric Xing @ CMU, 2015 Slide courtesy, Fei-Fei, Andrej Karpathy 37

Convolution: Summary Zero Padding Input volume of size [W1 * H1 * D1] Using K units with receptive fields F x F and applying them at strides of S gives Output volume: [W2, H2, D2] W2 = (W1 F)/S + 1 H2 = (H1 - F) / S + 1 D2 =k Eric Xing @ CMU, 2015 Slide courtesy, Fei-Fei, Andrej Karpathy 38

Convolution: Problem Assume input [32 * 32 * 3] 30 units with receptive field 5 * 5, applied at stride 1/pad 1 => Output volume: [30 * 30 * 30] At each position of the output volume, we need 5 * 5 * 3 weights => Number of weights in such layer: 27000 * 75 = 2 million Idea: Weight sharing! Learn one unit, let the unit convolve across all local receptive fields! Eric Xing @ CMU, 2015 39

Convolution: Problem Assume input [32 * 32 * 3] 30 units with receptive field 5 * 5, applied at stride 1/pad 1 => Output volume: [30 * 30 * 30] = 27000 units Weight sharing => Before: Number of weights in such layer: 27000 * 75 = 2 million => After: weight sharing: 30 * 75 = 2250 But also note that sometimes it s not a good idea to do weight sharing! When? Eric Xing @ CMU, 2015 40

Convolutional Layers Connect units only to local receptive fields Use the same unit weight parameters for units in each depth slice (i.e. across spatial positions) Can call the units filters We call the layer convolutional because it is related to convolution of two signals Sometimes we also add a bias term b, y = Wx + b, like what we have done for ordinary NN Short question: Will convolution layers introduce nonlinearity? Eric Xing @ CMU, 2015 41

Stacking Convolutional Layers Eric Xing @ CMU, 2015 42

Pooling Layers In ConvNet architectures, Conv layers are often followed by Pool layers makes the representations smaller and more manageable without losing too much information. Computes MAX operation (most common) Eric Xing @ CMU, 2015 Slide courtesy, Fei-Fei, Andrej Karpathy 43

Pooling Layers In ConvNet architectures, Conv layers are often followed by Pool layers makes the representations smaller and more manageable without losing too much information. Computes MAX operation (most common) Input volume of size [W1 x H1 x D1] Pooling unit receptive fields F x F and applying them at strides of S gives Output volume: [W2, H2, D1]: depth unchanged! W2 = (W1-F)/S+1, H2 = (H1-F)/S+1 Short question: Will pooling layer introduce nonlinearity? Eric Xing @ CMU, 2015 44

Nonlinerity Similar to NN, we need to introduce nonlinearity in CNN Sigmoid Tanh RELU: Rectified Linear Units -> preferred Simplifies backpropagation Makes learning faster Avoids saturation issues Eric Xing @ CMU, 2015 Slide courtesy, Yan Lecun 45

Convolutional Networks: 1989 LeNet: a layered model composed of convolution and subsampling operations followed by a holistic representation and ultimately a classifier for handwritten digits. [ LeNet ] Slide courtesy, Yangqing Jia Eric Xing @ CMU, 2015 46

Convolutional Nets: 2012 AlexNet: a layered model composed of convolution, subsampling, and further operations followed by a holistic representation and all-in-all a landmark classifier on ILSVRC12. [ AlexNet ] + data + gpu + non-saturating nonlinearity + regularization Slide courtesy, Yangqing Jia Eric Xing @ CMU, 2015 47

Convolutional Nets: 2014 ILSVRC14 Winners: ~6.6% Top-5 error - GoogLeNet: composition of multiscale dimension-reduced modules (pictured) - VGG: 16 layers of 3x3 convolution interleaved with max pooling + 3 fully-connected layers + depth + data + dimensionality reduction Slide courtesy, Yangqing Jia Eric Xing @ CMU, 2015 48

Training CNN: Use GPU Convolutional layers Reduce parameters BUT Increase computations FC layers each neuron has more weights but less computations Conv layers each neuron has less weights but more computations. Why? because of weight GPU is sharing! good it will convolve at every position! at convolution! Eric Xing @ CMU, 2015 49

Training CNN: depth cares! 21 Layers! Gradient vanishes when the network is too deep: Lazy to learn! Add intermediate loss layers to produce error signals! Do contrast normalization after each conv layer! Use ReLU to avoid saturation! Eric Xing @ CMU, 2015 50

Training CNN: Huge model needs more data! Only 7 layers, 60M parameters! Need more labeled data to train! Data augmentation: crop, translate, rotate, add noise! Eric Xing @ CMU, 2015 51

Training CNN: highly nonconvex objective Demand more advanced optimization techniques Add momentum as we have done for NN Learning rate policy decrease learning rate regularly! different layers use different learning rate! observe the trend of objective curve more often! Initialization really cares! Supervised pretraining Unsupervised pretraining Eric Xing @ CMU, 2015 52

Training CNN: avoid overfitting More data are always the best way to avoid overfitting data augmentation Add regualizations: recall what we have done for linear regression Dropout Eric Xing @ CMU, 2015 53

Visualize and Understand CNN A CNN transforms the image to 4096 numbers that are then linearly classified. Eric Xing @ CMU, 2015 54

Visualize and Understand CNN Find images that maximize some class score: Eric Xing @ CMU, 2015 Yes, Google Inceptionism! 55

Visualize and Understand CNN More visualizations https://www.youtube.com/watch?v=agkfiq4i GaM&feature=youtu.be Eric Xing @ CMU, 2015 56

Limitations Supervised Training Need huge amount of labeled data, but label is scarce! Slow Training Train an AlexNet on a single machine need one week! Optimization Highly nonconvex objective Parameter tuning is hard The parameter space is so large Eric Xing @ CMU, 2015 57

Summary Neural network with specialized connectivity structure Stack multiple stages of feature extractors Higher stages compute more global, more invariant features Classification layer at the end Eric Xing @ CMU, 2015 Slide courtesy, Rob Fergus 58

Summary Feed-forward Convolve input Non-linearity (rectified linear) Pooling (local max, mean) Supervised Train convolutional filter s by back-propagation classification error at the end Feature maps Normalization Spatial pooling Non-linearity Convolution (Learned) Input Image Eric Xing @ CMU, 2015 59

Further reading Andrej Karpathy: The Unreasonable Effectiveness of Recurrent Neural Networks (http://karpathy.github.io/2015/05/21/rnn-effectiveness) Recurrent Neural Networks Tutorial (http://www.wildml.com/2015/09/recurrent-neural-networkstutorial-part-1-introduction-to-rnns) Eric Xing @ CMU, 2015 60