Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Similar documents
Perceptron: This is convolution!

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Deep Learning for Computer Vision II

Machine Learning 13. week

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Machine Learning. MGS Lecture 3: Deep Learning

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

ImageNet Classification with Deep Convolutional Neural Networks

All You Want To Know About CNNs. Yukun Zhu

Structured Prediction using Convolutional Neural Networks

INTRODUCTION TO DEEP LEARNING

Deep Learning with Tensorflow AlexNet

Deep Learning Cook Book

CS 6501: Deep Learning for Computer Graphics. Training Neural Networks II. Connelly Barnes

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Neural Network Neurons

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016

Deep Learning. Deep Learning. Practical Application Automatically Adding Sounds To Silent Movies

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES

Deconvolutions in Convolutional Neural Networks

CENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 11. Sinan Kalkan

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

Research on Pruning Convolutional Neural Network, Autoencoder and Capsule Network

INF 5860 Machine learning for image classification. Lecture 11: Visualization Anne Solberg April 4, 2018

Deep Learning. Visualizing and Understanding Convolutional Networks. Christopher Funk. Pennsylvania State University.

Face Recognition A Deep Learning Approach

Recurrent Convolutional Neural Networks for Scene Labeling

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

Deep Learning. Volker Tresp Summer 2014

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

Deep Learning. Deep Learning provided breakthrough results in speech recognition and image classification. Why?

Dynamic Routing Between Capsules

Global Optimality in Neural Network Training

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

Deep neural networks II

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

Fuzzy Set Theory in Computer Vision: Example 3, Part II

Restricted Boltzmann Machines. Shallow vs. deep networks. Stacked RBMs. Boltzmann Machine learning: Unsupervised version

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images

Deep Neural Networks:

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images

Lecture : Neural net: initialization, activations, normalizations and other practical details Anne Solberg March 10, 2017

A Deep Learning Approach to Vehicle Speed Estimation

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Deep Learning and Its Applications

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

Know your data - many types of networks

Study of Residual Networks for Image Recognition

Keras: Handwritten Digit Recognition using MNIST Dataset

Neural Networks for Machine Learning. Lecture 15a From Principal Components Analysis to Autoencoders

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Natural Language Processing with Deep Learning CS224N/Ling284. Christopher Manning Lecture 4: Backpropagation and computation graphs

Seminars in Artifiial Intelligenie and Robotiis

Practical Methodology. Lecture slides for Chapter 11 of Deep Learning Ian Goodfellow

CS 1674: Intro to Computer Vision. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh November 16, 2016

Introduction to Neural Networks

COMP9444 Neural Networks and Deep Learning 5. Geometry of Hidden Units

Convolutional Neural Networks

Neural Networks for unsupervised learning From Principal Components Analysis to Autoencoders to semantic hashing

Unsupervised Learning

Safety verification for deep neural networks

Advanced Introduction to Machine Learning, CMU-10715

Deep Convolutional Neural Networks. Nov. 20th, 2015 Bruce Draper

Lecture 37: ConvNets (Cont d) and Training

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 15

Machine Learning : Clustering, Self-Organizing Maps

SEMANTIC COMPUTING. Lecture 9: Deep Learning: Recurrent Neural Networks (RNNs) TU Dresden, 21 December 2018

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

Como funciona o Deep Learning

A Quick Guide on Training a neural network using Keras.

Visual object classification by sparse convolutional neural networks

Deep Learning Applications

Report: Privacy-Preserving Classification on Deep Neural Network

Survey of Convolutional Neural Network

Convolutional Neural Networks

CS 523: Multimedia Systems

Code Mania Artificial Intelligence: a. Module - 1: Introduction to Artificial intelligence and Python:

Computer Vision Lecture 16

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

Image Transformation via Neural Network Inversion

Deep (1) Matthieu Cord LIP6 / UPMC Paris 6

ConvolutionalNN's... ConvNet's... deep learnig

Machine Learning. The Breadth of ML Neural Networks & Deep Learning. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart

Introduction to Deep Learning

3D model classification using convolutional neural network

Multi-Task Learning of Facial Landmarks and Expression

Applying Deep Learning for Classifying Images of Hand Postures in the Rock-Paper-Scissors Game

One Network to Solve Them All Solving Linear Inverse Problems using Deep Projection Models

Training Convolutional Neural Networks for Translational Invariance on SAR ATR

6. Convolutional Neural Networks

Parallel Deep Network Training

Neural Network and Deep Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

CSE 559A: Computer Vision

Unsupervised Learning of Spatiotemporally Coherent Metrics

Transcription:

Deep Learning Vladimir Golkov Technical University of Munich Computer Vision Group

1D Input, 1D Output target input 2

2D Input, 1D Output: Data Distribution Complexity Imagine many dimensions (data occupies sparse entangled regions) Deep network: sequence of (simple) nonlinear disentangling transformations (Transformation parameters are optimization variables) 3

Nonlinear Coordinate Transformation http://cs.stanford.edu/people/karpathy/convnetjs/ Dimensionality may change! 4

Sequence of Simple Nonlinear Coordinate Transformations Rectified linear units do exactly such kinks Linear separation of purple and white sheet 5

Sequence of Simple Nonlinear Coordinate Transformations Linear separation of dough and butter 6

Sequence of Simple Nonlinear Coordinate Transformations Data is sparse (almost lower-dimensional) 7

The Increasing Complexity of Features Input features: RGB Layer 1 Feature space coordinates: 45 edge yes/no Green patch yes/no Layer 2 [Zeiler & Fergus, ECCV 2014] Layer 3 Feature space coordinates: Person yes/no Car wheel yes/no Increasing dimensionality, receptive field, invariance, complexity by design by convergence 8

Network Architecture: Your Decision! Informed approach: If I were to choose layer-wise features by hand in a smart, optimal way, which features, how many features, with which receptive fields would I choose? Network learns You define architecture Bottom-up approach: make dataset simple (e.g. simple samples and/or few samples), get a simple network (few layers, few neurons/filters) to work at least on the training set, then retrain increasingly complex networks on increasingly complex data Top-down approach: use a network architecture that is known to work well on similar data, get it to work on your data, then tune the architecture if necessary 9

FORWARD PASS (DATA TRANSFORMATION)

Fully-Connected Layer x 0 x L is input feature vector for neural network (one sample). is output vector of neural network with L layers. Layer number l has: Inputs (usually x l 1, i.e. outputs of layer number l 1) Weight matrix W l, bias vector b l - both trained (e.g. with stochastic gradient descent) such that network output x L for the training samples minimizes some objective (loss) Nonlinearity s l (fixed in advance, for example ReLU z max{0, z}) Quiz: why is nonlinearity useful? Output x (l) of layer l Transformation from x l 1 to x l performed by layer l: x (l) = s l W l x l 1 + b l 11

Example W l = 0 0.1 1 0.2 0 1 x l 1 = 1 2 3 b l = 0 1.2 1 2 b 1 l b 2 l -2.8 = 0 = 1.2 W l x l 1 + b l = 0 1 + 0.1 2 1 3 + 0 = 0.2 1 + 0 2 + 1 3 + 1.2 = 2.8 4 3 4 12

Structured Data Zero-dimensional data: multilayer perceptron outputs inputs Structured data: translation-covariant operations Neighborhood structure: convolutional networks (2D/3D images, 1D bio. sequences, ) outputs inputs Sequential structure (memory): recurrent networks (1D text, 1D audio, ) outputs inputs 13

2D Convolutional Layer Appropriate for 2D structured data (e.g. images) where we want: Locality of feature extraction (far-away pixels do not influence local output) Translation-equivariance (shifting input in space (i, j dimensions) yields same output shifted in the same way) (l) x i,j,k = s l b k l + l l 1 w i i,j j, x k,k i,j, k i, j, k the size of W along the i, j dimensions is called filter size the size of W along the k dimension is the number of input channels (e.g. three (red, green, blue) in first layer) the size of W along the k dimension is the number of filters (number of output channels) Equivalent for 1D, 3D,... http://cs231n.github.io/assets/conv-demo/ 14

Convolutional Network Interactive: http://scs.ryerson.ca/~aharley/vis/conv/flat.html 15

Nonlinearity 16

Loss Functions N-class classification: N outputs nonlinearity in last layer: softmax loss: categorical cross-entropy between outputs x L and targets t (sum over all training samples) 2-class classification: 1 output nonlinearity in last layer: sigmoid loss: binary cross-entropy between outputs x L and targets t (sum over all training samples) 2-class classification (alternative formulation) 2 outputs nonlinearity in last layer: softmax loss: categorical cross-entropy between outputs x L and targets t (sum over all training samples) Many regression tasks: linear output in last layer loss: mean squared error between outputs x L and targets t (sum over all training samples) 17

Neural Network Training Procedure Fix number L of layers Fix sizes of weight arrays and bias vectors For a fully-connected layer, this corresponds to the number of neurons Fix nonlinearities Initialize weights and biases with random numbers Repeat: Select mini-batch (i.e. small subset) of training samples Compute the gradient of the loss with respect to all trainable parameters (all weights and biases) Use chain rule ( error backpropagation ) to compute gradient for hidden layers Perform a gradient-descent step (or similar) towards minimizing the error (Called stochastic gradient descent because every mini-batch is a random subset of the entire training set) Important hyperparameter: learning rate (i.e. step length factor) Early stopping : Stop when loss on validation set starts increasing (to avoid overfitting) 18

Data Representation: Your Decision! The data representation should be natural (do not outsource known data transformations to the learning of the mapping) Make it easy for the network For angles, we use sine and cosine to avoid the jump from 360 to 0 Redundancy is okay! Fair scale of features (and initial weights and learning rate) to facilitate optimization Data augmentation using natural assumptions Features from different distributions or missing: use several disentangled inputs to tell the network! Trade-off: The more a network should be able to do, the much more data and/or better techniques are required https://en.wikipedia.org/wiki/statistical_data_type Categorical variable: one-hot encoding Ordinal variable: cumulative sum of one-hot encoding [cf. Jianlin Cheng, arxiv:0704.1028] etc 19

Regularization to Avoid Overfitting Impose meaningful invariances/assumptions about mapping in a hard or soft manner: Limited complexity of model: few layers, few neurons, weight sharing Locality and shift-equivariance of feature extraction: ConvNets Exact spatial locations of features don t matter: pooling; strided convolutions http://cs231n.github.io/assets/conv-demo/ Deep features shouldn t strongly rely on each other: dropout (randomly setting some deep features to zero during training) Data augmentation: Known meaningful transformations of training samples Random noise (e.g. dropout) in first or hidden layers Optimization algorithm tricks, e.g. early stopping 20

WHAT DID THE NETWORK LEARN?

Convolutional Network Interactive: http://scs.ryerson.ca/~aharley/vis/conv/flat.html 23

Visualization: creative, no canonical way Look at standard networks to gain intuition 24

Image Reconstruction from Deep-Layer Features Inverse problem: Loss in feature space [Mahendran & Vedaldi, CVPR 2015] Another network: Loss in image space [Dosovitskiy & Brox, CVPR 2016] not entire information content can be retrieved, e.g. many solutions may be known, but only their least-squares compromise can be shown Another network: Loss in feature space + adversarial loss + loss in image space [Dosovitskiy & Brox, NIPS 2016] generative model improvises realistic details 25

Reasons for Activations of Each Neuron [Selvaraju et al., arxiv 2016] 26

Network Bottleneck 27

Network Bottleneck 28

An Unpopular and Counterintuitive Fact Any amount of information can pass through a layer with just one neuron! ( smuggled as fractional digits) 2 3 5 7 11 Quiz time: So why don t we have one-neuron layers? Learning to entangle (before) and disentangle (after) is difficult for the usual reasons: Solution is not global optimum and not perfect Optimum on training set optimum on test set 0.2030507011 Difficulty to entangle/disentangle used to our advantage: hard bottleneck (few neurons), soft bottleneck (additional loss terms): dimensionality reduction regularizing effect understanding of intrinsic factors of variation of data 2 3 5 7 11 29

Cost Functions: Your Decision! Appropriate measure of quality Desired properties of output (e.g. image smoothness, anatomic plausibility,...) Desired properties of mapping (e.g. f(x 1 ) f(x 2 ) for x 1 x 2 ) Weird data distributions: Class imbalance, domain adaptation,... If derivative is zero on more than a null set, better use a smoothed version Apart from that, all you need is subdifferentiability almost everywhere (not even everywhere) i.e. kinks and jumps are OK e.g. ReLU 30

THANK YOU!