Perceptron: This is convolution!

Similar documents
Deep Learning with Tensorflow AlexNet

Deep Learning for Computer Vision II

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Deep Neural Networks Optimization

Lecture 37: ConvNets (Cont d) and Training

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

Deep Learning Cook Book

DECISION TREES & RANDOM FORESTS X CONVOLUTIONAL NEURAL NETWORKS

Deep Learning and Its Applications

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

Deep Learning. Visualizing and Understanding Convolutional Networks. Christopher Funk. Pennsylvania State University.

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

CS489/698: Intro to ML

Machine Learning. MGS Lecture 3: Deep Learning

All You Want To Know About CNNs. Yukun Zhu

CSE 559A: Computer Vision

Deep neural networks II

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Lecture 20: Neural Networks for NLP. Zubin Pahuja

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

ImageNet Classification with Deep Convolutional Neural Networks

Deep Learning for Computer Vision

A Brief Look at Optimization

Global Optimality in Neural Network Training

Intro to Deep Learning. Slides Credit: Andrej Karapathy, Derek Hoiem, Marc Aurelio, Yann LeCunn

Research on Pruning Convolutional Neural Network, Autoencoder and Capsule Network

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

Knowledge Discovery and Data Mining. Neural Nets. A simple NN as a Mathematical Formula. Notes. Lecture 13 - Neural Nets. Tom Kelsey.

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016

Knowledge Discovery and Data Mining

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Fuzzy Set Theory in Computer Vision: Example 3, Part II

Dynamic Routing Between Capsules

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

Face Recognition A Deep Learning Approach

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES

SEMANTIC COMPUTING. Lecture 9: Deep Learning: Recurrent Neural Networks (RNNs) TU Dresden, 21 December 2018

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

Machine Learning 13. week

Como funciona o Deep Learning

Deep Learning for Embedded Security Evaluation

Report: Privacy-Preserving Classification on Deep Neural Network

Logistic Regression and Gradient Ascent

How Learning Differs from Optimization. Sargur N. Srihari

Ensemble methods in machine learning. Example. Neural networks. Neural networks

CS 179 Lecture 16. Logistic Regression & Parallel SGD

The exam is closed book, closed notes except your one-page cheat sheet.

CS 6501: Deep Learning for Computer Graphics. Training Neural Networks II. Connelly Barnes

Logistic Regression. Abstract

Notes on Multilayer, Feedforward Neural Networks

Hyperparameter optimization. CS6787 Lecture 6 Fall 2017

A Deep Learning Approach to Vehicle Speed Estimation

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

Neural Network Learning. Today s Lecture. Continuation of Neural Networks. Artificial Neural Networks. Lecture 24: Learning 3. Victor R.

Network Traffic Measurements and Analysis

Keras: Handwritten Digit Recognition using MNIST Dataset

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

CS 1674: Intro to Computer Vision. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh November 16, 2016

Mini-project 2 CMPSCI 689 Spring 2015 Due: Tuesday, April 07, in class

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018

6. Linear Discriminant Functions

Alternatives to Direct Supervision

Lecture #11: The Perceptron

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Unsupervised Learning

Practical Methodology. Lecture slides for Chapter 11 of Deep Learning Ian Goodfellow

Efficient Deep Learning Optimization Methods

Neural Network Neurons

Deep Learning Workshop. Nov. 20, 2015 Andrew Fishberg, Rowan Zellers

Study of Residual Networks for Image Recognition

Optimization. Industrial AI Lab.

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

Neural Networks. Theory And Practice. Marco Del Vecchio 19/07/2017. Warwick Manufacturing Group University of Warwick

INTRODUCTION TO DEEP LEARNING

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

CIS581: Computer Vision and Computational Photography Project 4, Part B: Convolutional Neural Networks (CNNs) Due: Dec.11, 2017 at 11:59 pm

Lecture : Neural net: initialization, activations, normalizations and other practical details Anne Solberg March 10, 2017

3D model classification using convolutional neural network

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

Logical Rhythm - Class 3. August 27, 2018

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions

11. Neural Network Regularization

291 Programming Assignment #3

COMP9444 Neural Networks and Deep Learning 5. Geometry of Hidden Units

Supervised Learning in Neural Networks (Part 2)

Cost Functions in Machine Learning

Neural Networks. Robot Image Credit: Viktoriya Sukhanova 123RF.com

The Mathematics Behind Neural Networks

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

FastText. Jon Koss, Abhishek Jindal

Neural Nets & Deep Learning

Machine Learning Classifiers and Boosting

Classification: Linear Discriminant Functions

Transcription:

Perceptron: This is convolution!

v v v Shared weights v

Filter = local perceptron. Also called kernel.

By pooling responses at different locations, we gain robustness to the exact spatial location of image features.

AlexNet diagram (simplified) [Krizhevsky et al. 2012] Input size 227 x 227 x 3 227 227 3x3 Stride 2 3x3 Stride 2 Conv 1 11 x 11 x 3 Stride 4 96 filters Conv 2 5 x 5 x 96 Stride 1 256 filters Conv 3 3 x 3 x 256 Stride 1 384 filters Conv 4 3 x 3 x 192 Stride 1 384 filters Conv 4 3 x 3 x 192 Stride 1 256 filters

Training Neural Networks Learning the weight matrices W

Gradient descent f(x) x

General approach Pick random starting point. f(x) a 1 x

General approach Compute gradient at point (analytically or by finite differences) f(x) f(a 1 ) a 1 x

General approach Move along parameter space in direction of negative gradient f(x) a 2 = a 1 γ f a 1 γ = amount to move = learning rate a 1 a 2 x

General approach Move along parameter space in direction of negative gradient. f(x) a 3 = a 2 γ f a 2 γ = amount to move = learning rate a 1 a 2 a 3 x

General approach Stop when we don t move any more. f(x) a stop : a n 1 γ f a n 1 = 0 a 1 a 2 a 3 a stop x

Gradient descent Optimizer for functions. Guaranteed to find optimum for convex functions. Non-convex = find local optimum. Most vision problems aren t convex. f(x) Works for multi-variate functions. Need to compute matrix of partial derivatives ( Jacobian ) x

Train NN with Gradient Descent x i, y i = n training examples f x = feed forward neural network L(x, y; θ) = some loss function Loss function measures how good our network is at classifying the training examples wrt. the parameters of the model (the perceptron weights).

Train NN with Gradient Descent Loss function (Evaluate NN on training data) a 1 a 2 a 3 a stop Model parameters (perceptron weights)

utput

What is an appropriate loss? Define some output threshold on detection Classification: compare training class to output class Zero-one loss L (per class) y = true label y = predicted label Is it good? Nope it s a step function. I need to compute the gradient of the loss. This loss is not differentiable, and flips easily.

Classification has binary outputs Special function on last layer - Softmax : "squashes" a C-dimensional vector O of arbitrary real values to a C-dimensional vector σ(o) of real values in the range (0, 1) that add up to 1. Turns the output into a probability distribution on classes.

Softmax utput Softmax

Cross-entropy loss function Negative log-likelihood L x, y; θ = j y j log p(c j x) Is it a good loss? Differentiable Cost decreases as probability increases L p(c j x)

utput Softmax

But the ReLU is not differentiable at 0! Right. Fudge! - 0 is the best place for this to occur, because we don t care about the result (it is no activation). - Dead perceptrons - ReLU has unbounded positive response: - Potential faster convergence / overstep

Optimization demo http://www.emergentmind.com/neural-network Thank you Matt Mazur

Stochastic Gradient Descent Dataset can be too large to strictly apply gradient descent. Instead, randomly sample a data point, perform gradient descent per point, and iterate. True gradient is approximated only Picking a subset of points: mini-batch Pick starting W and learning rate γ While not at minimum: Shuffle training set For each data point i=1 n (maybe as mini-batch) Gradient descent Epoch

Stochastic Gradient Descent Loss will not always decrease (locally) as training data point is random. Still converges over time. Wikipedia

Gradient descent oscillations Wikipedia

Gradient descent oscillations Slow to converge to the (local) optimum Wikipedia

Momentum Adjust the gradient by a weighted sum of the previous amount plus the current amount. Without momentum: θ t+1 = θ t γ L θ With momentum (new α parameter): θ t+1 = θ t γ α L θ t 1 + L θ t

But James I thought we were going to treat machine learning like a black box? I like black boxes. Deep learning is: - a black box Training data Classifier

But James I thought we were going to treat machine learning like a black box? I like black boxes. Deep learning is: - a black box - also a black art. http://www.isrtv.com/

But James I thought we were going to treat machine learning like a black box? I like black boxes. Many approaches and hyperparameters: Activation functions, learning rate, mini-batch size, momentum Often these need tweaking, and you need to know what they do to change them intelligently.

Nailing hyperparameters + trade-offs

Lowering the learning rate = smaller steps in SGD -Less ping pong -Takes longer to get to the optimum Wikipedia

Flat regions in energy landscape

Problem of fitting Too many parameters = overfitting Not enough parameters = underfitting More data = less chance to overfit How do we know what is required?

Regularization Attempt to guide solution to not overfit But still give freedom with many parameters

Data fitting problem [Nielson]

Which is better? Which is better a priori? 9 th order polynomial 1 st order polynomial [Nielson]

Regularization Attempt to guide solution to not overfit But still give freedom with many parameters Idea: Penalize the use of parameters to prefer small weights.

Regularization: Idea: add a cost to having high weights λ = regularization parameter λ [Nielson]

Both can describe the data but one is simpler. Occam s razor: Among competing hypotheses, the one with the fewest assumptions should be selected For us: Large weights cause large changes in behaviour in response to small changes in the input. Simpler models (or smaller changes) are more robust to noise.

Regularization Idea: add a cost to having high weights λ = regularization parameter λ λ Normal cross-entropy loss (binary classes) Regularization term [Nielson]

Regularization: Dropout Our networks typically start with random weights. Every time we train = slightly different outcome. Why random weights? If weights are all equal, response across filters will be equivalent. Network doesn t train. [Nielson]

Regularization Our networks typically start with random weights. Every time we train = slightly different outcome. Why not train 5 different networks with random starts and vote on their outcome? Works fine! Helps generalization because error is averaged.

Regularization: Dropout [Nielson]

Regularization: Dropout At each mini-batch: - Randomly select a subset of neurons. - Ignore them. On test: half weights outgoing to compensate for training on half neurons. Effect: - Neurons become less dependent on output of connected neurons. - Forces network to learn more robust features that are useful to more subsets of neurons. - Like averaging over many different trained networks with different random initializations. - Except cheaper to train. [Nielson]

Many forms of regularization Adding more data is a kind of regularization Pooling is a kind of regularization Data augmentation is a kind of regularization