Multi-layer Perceptron Forward Pass Backpropagation. Lecture 11: Aykut Erdem November 2016 Hacettepe University

Similar documents
Lecture 20: Neural Networks for NLP. Zubin Pahuja

Deep neural networks II

CS 1674: Intro to Computer Vision. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh November 16, 2016

Backpropagation and Neural Networks. Lecture 4-1

Deep Neural Networks Optimization

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Deep Learning with Tensorflow AlexNet

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES

Administrative. Assignment 1 due Wednesday April 18, 11:59pm

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

ImageNet Classification with Deep Convolutional Neural Networks

Deep Learning for Computer Vision II

CS 6501: Deep Learning for Computer Graphics. Training Neural Networks II. Connelly Barnes

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

Machine Learning 13. week

INTRODUCTION TO DEEP LEARNING

LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

Introduction to Neural Networks

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty)

Introduction to Neural Networks

ECE 6504: Deep Learning for Perception

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Deep Learning for Computer Vision

Data Mining. Neural Networks

Deep Learning and Its Applications

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center

Learning via Optimization

Deep Learning Workshop. Nov. 20, 2015 Andrew Fishberg, Rowan Zellers

For Monday. Read chapter 18, sections Homework:

Lecture : Training a neural net part I Initialization, activations, normalizations and other practical details Anne Solberg February 28, 2018

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016

Week 3: Perceptron and Multi-layer Perceptron

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Object Detection Lecture Introduction to deep learning (CNN) Idar Dyrdal

CNN Basics. Chongruo Wu

Neural Networks. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Neural Networks: What can a network represent. Deep Learning, Fall 2018

Neural Networks: What can a network represent. Deep Learning, Spring 2018

Classification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Neural Network Neurons

Neural Networks and Deep Learning

CS6220: DATA MINING TECHNIQUES

Technical University of Munich. Exercise 7: Neural Network Basics

SEMANTIC COMPUTING. Lecture 9: Deep Learning: Recurrent Neural Networks (RNNs) TU Dresden, 21 December 2018

Lecture : Neural net: initialization, activations, normalizations and other practical details Anne Solberg March 10, 2017

ECE 5424: Introduction to Machine Learning

Natural Language Processing with Deep Learning CS224N/Ling284. Christopher Manning Lecture 4: Backpropagation and computation graphs

11/14/2010 Intelligent Systems and Soft Computing 1

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon

Assignment # 5. Farrukh Jabeen Due Date: November 2, Neural Networks: Backpropation

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

Convolutional Neural Networks

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 15

Perceptron: This is convolution!

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2018

Supervised Learning in Neural Networks (Part 2)

Object Recognition II

2. Neural network basics

All lecture slides will be available at CSC2515_Winter15.html

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders

Lecture 37: ConvNets (Cont d) and Training

A Quick Guide on Training a neural network using Keras.

6.034 Quiz 2, Spring 2005

Introduction to object recognition. Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and others

Homework 5. Due: April 20, 2018 at 7:00PM

CS230: Deep Learning Winter Quarter 2018 Stanford University

Perceptron as a graph

Neural Networks (Overview) Prof. Richard Zanibbi

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Using neural nets to recognize hand-written digits. Srikumar Ramalingam School of Computing University of Utah

Generative and discriminative classification techniques

Deep Convolutional Neural Networks. Nov. 20th, 2015 Bruce Draper

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

Lecture 17: Neural Networks and Deep Learning. Instructor: Saravanan Thirumuruganathan

Please write your initials at the top right of each page (e.g., write JS if you are Jonathan Shewchuk). Finish this by the end of your 3 hours.

CS5670: Computer Vision

Artificial Neural Networks Lecture Notes Part 5. Stephen Lucci, PhD. Part 5

Deep Neural Networks with Flexible Activation Function

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

Unsupervised Learning

Lecture 13. Deep Belief Networks. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen

CS229 Final Project: Predicting Expected Response Times

Logical Rhythm - Class 3. August 27, 2018

Image Classification pipeline. Lecture 2-1

Keras: Handwritten Digit Recognition using MNIST Dataset

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

Alternatives to Direct Supervision

The exam is closed book, closed notes except your one-page cheat sheet.

Ensemble methods in machine learning. Example. Neural networks. Neural networks

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Keras: Handwritten Digit Recognition using MNIST Dataset

Deep Learning Basic Lecture - Complex Systems & Artificial Intelligence 2017/18 (VO) Asan Agibetov, PhD.

Neural Networks. Theory And Practice. Marco Del Vecchio 19/07/2017. Warwick Manufacturing Group University of Warwick

Transcription:

Multi-layer Perceptron Forward Pass Backpropagation Lecture 11: Aykut Erdem November 2016 Hacettepe University

Administrative Assignment 2 due Nov. 10, 2016! Midterm exam on Monday, Nov. 14, 2016 You are responsible from the beginning till the end of this week You can prepare and bring a full-page copy sheet (A4-paper, both sides) to the exam. Assignment 3 will be out soon! It is due December 1, 2016 You will implement a 2-layer Neural Network 2

Last time Linear classification slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 3

Last time Linear classification slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 4

Last time Linear classification slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 5

Interactive web demo time. slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson http://vision.stanford.edu/teaching/cs231n/linear-classify-demo/ 6

Last time Perceptron x 1 x 2 x 3... x n w 1 w n synaptic weights output slide by Alex Smola f(x) = X i w i x i = hw, xi 7

This Week Multi-layer perceptron Forward Pass Backward Pass 8

Introduction 9

A brief history of computers 1970s 1980s 1990s 2000s 2010s Data 10 2 10 3 10 5 10 8 10 11 RAM? 1MB 100MB 10GB 1TB CPU? 10MF 1GF 100GF 1PF GPU deep kernel deep Data grows nets methods nets at higher exponent Moore s law (silicon) vs. Kryder s law (disks) slide by Alex Smola Early algorithms data bound, now CPU/RAM bound 10

Not linearly separable data Some datasets are not linearly separable! - e.g. XOR problem slide by Alex Smola Nonlinear separation is trivial 11

Addressing non-linearly separable data Two options: - Option 1: Non-linear functions - Option 2: Non-linear classifiers slide by Dhruv Batra 12

Option 1 Non-linear features Choose non-linear features, e.g., - Typical linear features: w0 + Σi wi xi - Example of non-linear features: Degree 2 polynomials, w0 + Σi wi xi + Σij wij xi xj Classifier hw(x) still linear in parameters w - As easy to learn - Data is linearly separable in higher dimensional spaces - Express via kernels slide by Dhruv Batra 13

Option 2 Non-linear classifiers Choose a classifier hw(x) that is non-linear in parameters w, e.g., - Decision trees, neural networks, More general than linear classifiers But, can often be harder to learn (non-convex optimization required) Often very useful (outperforms linear classifiers) In a way, both ideas are related slide by Dhruv Batra 14

Biological Neurons Soma (CPU) Cell body - combines signals Dendrite (input bus) Combines the inputs from several other nerve cells Synapse (interface) Interface and parameter store between neurons slide by Alex Smola Axon (cable) May be up to 1m long and will transport the activation signal to neurons at different locations 15

Recall: The Neuron Metaphor Neurons - accept information from multiple inputs, - transmit information to other neurons. Multiply inputs by weights along edges Apply some function to the set of inputs at each node slide by Dhruv Batra 16

Types of Neuron 1 2 D 1 0 f(~x, ) y = 0 + X i x i i 1 z = 0 + X i 1 0 y = 1 x i i 1+e z Linear Neuron 2 f(~x, ) slide by Dhruv Batra 1 2 D Perceptron 1 0 z = 0 + X i y = f(~x, ) x i i 1 if z 0 0 otherwise D Logistic Neuron Potentially more. Requires a convex loss function for gradient descent training. 17

Limitation A single neuron is still a linear decision boundary What to do? Idea: Stack a bunch of them together! slide by Dhruv Batra 18

Nonlinearities via Layers Cascade neurons together The output from one layer is the input to the next Each layer has its own sets of weights y 1i = k(x i,x) y 1i (x) = (hw 1i,xi) Kernels slide by Alex Smola y 2 (x) = (hw 2,y 1 i) Deep Nets optimize all weights 19

Nonlinearities via Layers slide by Alex Smola y 1i (x) = (hw 1i,xi) y 2i (x) = (hw 2i,y 1 i) y 3 (x) = (hw 3,y 2 i) 20

Representational Power Neural network with at least one hidden layer is a universal approximator (can represent any function). Proof in: Approximation by Superpositions of Sigmoidal Function, Cybenko, paper slide by Raquel Urtasun, Richard Zemel, Sanja Fidler The capacity of the network increases with more hidden units and more hidden layers 21

A simple example Consider a neural network with two layers of neurons. 0 1 2 3 4 5 6 7 8 9 - neurons in the top layer represent known shapes. - neurons in the bottom layer represent pixel intensities. A pixel gets to vote if it has ink on it. - Each inked pixel can vote for several different shapes. x slide by Geoffrey Hinton The shape that gets the most votes wins. x x x f( w x ) 22

How to display the weights 1 2 3 4 5 6 7 8 9 0 The input image slide by Geoffrey Hinton Give each output unit its own map of the input image and display the weight coming from each pixel in the location of that pixel in the map. Use a black or white blob with the area representing the magnitude of the weight and the color representing the sign. 23

How to learn the weights 1 2 3 4 5 6 7 8 9 0 The image slide by Geoffrey Hinton Show the network an image and increment the weights from active pixels to the correct class. Then decrement the weights from active pixels to whatever class the network guesses. 24

1 2 3 4 5 6 7 8 9 0 The image slide by Geoffrey Hinton 25

1 2 3 4 5 6 7 8 9 0 The image slide by Geoffrey Hinton 26

1 2 3 4 5 6 7 8 9 0 The image slide by Geoffrey Hinton 27

1 2 3 4 5 6 7 8 9 0 The image slide by Geoffrey Hinton 28

1 2 3 4 5 6 7 8 9 0 The image slide by Geoffrey Hinton 29

The learned weights 1 2 3 4 5 6 7 8 9 0 The details of the learning algorithm will be explained later. The image slide by Geoffrey Hinton 30

Why insufficient A two layer network with a single winner in the top layer is equivalent to having a rigid template for each shape. - The winner is the template that has the biggest overlap with the ink. slide by Geoffrey Hinton The ways in which hand-written digits vary are much too complicated to be captured by simple template matches of whole shapes. - To capture all the allowable variations of a digit we need to learn the features that it is composed of. 31

Multilayer Perceptron slide by Alex Smola Layer Representation y i = W i x i x i+1 = (y i ) (typically) iterate between linear mapping Wx and nonlinear function Loss function l(y, y i ) to measure quality of estimate so far y W 4 x4 W 3 x3 W 2 x2 W 1 x1 32

Forward Pass 33

Forward Pass: What does the Network Compute? slide by Raquel Urtasun, Richard Zemel, Sanja Fidler Output of the network can be written as: X XDX h j (x) = f (v j0 + x i v ji ) o k (x) = g(w k0 + (j indexing hidden units, k indexing the output units, D number of inputs) Activation functions f, g : sigmoid/logistic, tanh, or rectified linear (ReLU) (z) = X JX j=1 h j (x)w kj ) 1 exp( z), tanh(z) =exp(z), ReLU(z) =max(0, z) 1 + exp( z) exp(z)+exp( z) i=1 34

Forward Pass in Python Example code for a forward pass for a 3-layer network in Python: slide by Raquel Urtasun, Richard Zemel, Sanja Fidler Can be implemented efficiently using matrix operations Example above: W 1 is matrix of size 4 3, W 2 is 4 4. What about biases and W 3? [http://cs231n.github.io/neural-networks-1/] 35

Special Case What is a single layer (no hiddens) network with a sigmoid act. function? slide by Raquel Urtasun, Richard Zemel, Sanja Fidler Network: Logistic regression! o k (x) = 1 1 + exp( z k ) z k = JX w k0 + x j w kj j=1 36

Example Example&applica3on& Consider!trying!to!classify!image!of!handwritten!digit:!32x32! Example Application pixels! Single!output!units!!it!is!a!4!(one!vs.!all)?! Classify image of handwritten digit (32x32 pixels): 4 vs non-4 Use!the!sigmoid!output!function:! Classify image of handwritten digit (32x32 pixels): 4 vs non-4 ok = 1 1+ exp( zk ) J zk = (wk 0 + h j (x)vkj ) j=1 Can!train!the!network,!that!is,!adjust!all!the!parameters!w,!to! optimize!the!training!objective,!but!this!is!a!complicated! function!of!the!parameters!! How would you build your network? How would you build your network? For example, use one hidden layer and the sigmoid activation function: For example, use one hidden layer and the sigmoid activation function: slide by Raquel Urtasun, Richard Zemel, Sanja Fidler ok (x) zk = = 1 1 + exp( zk ) wk0 + J X hj (x)wkj j=1 How wetrain train network, that all is,the adjust all the Howcan can we the the network, that is, adjust parameters w? parameters w? Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 19 / 62 37

Training Neural Networks slide by Raquel Urtasun, Richard Zemel, Sanja Fidler Find weights: NX X w = argmin X loss(o (n), t (n) ) w n=1 where o = f(x;w) is the output of a neural network Define a loss function, e.g.: - Squared loss: - Cross-entropy loss: Pk 1 2 (o(n) k t (n) k ) 2 2 P k k (n) P Pk t(n) k log o (n) k Gradient descent: w t+1 = w t @E @w t where η is the learning rate (and E is error/loss) 38

Useful derivatives slide by Raquel Urtasun, Richard Zemel, Sanja Fidler name function derivative Sigmoid (z) = 1 1+exp( z) (z) (1 (z)) exp(z) exp( z) Tanh tanh(z) = exp(z)+exp( z) 1/ cosh 2 (z) ( 1, if z > 0 ReLU ReLU(z) =max(0, z) 0, if z apple 0 39

Backpropagation and Neural Networks 40

Recap: Loss function/optimization TODO: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson -3.45-8.87 0.09 2.9 4.48 8.02 3.78 1.06-0.36-0.72-0.51 6.04 5.31-4.22-4.19 3.58 4.49-4.37-2.09-2.93 We defined a (linear) score function: 3.42 4.64 2.65 5.1 2.64 5.55-4.34-1.5-4.79 6.14 1. Define a loss function that quantifies our unhappiness with the scores across the training data. 2. Come up with a way of efficiently finding the parameters that minimize the loss function. (optimization) 41

Softmax Classifier (Multinomial Logistic Regression) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 42

Softmax Classifier (Multinomial Logistic Regression) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 43

Softmax Classifier (Multinomial Logistic Regression) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 44

Softmax Classifier (Multinomial Logistic Regression) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 45

Softmax Classifier (Multinomial Logistic Regression) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 46

Softmax Classifier (Multinomial Logistic Regression) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 47

Softmax Classifier (Multinomial Logistic Regression) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 48

Softmax Classifier (Multinomial Logistic Regression) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 49

Softmax Classifier (Multinomial Logistic Regression) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 50

Softmax Classifier (Multinomial Logistic Regression) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 51

Optimization 52

Gradient Descent slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 53

Mini-batch Gradient Descent only use a small portion of the training set to compute the gradient slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 54

Mini-batch Gradient Descent only use a small portion of the training set to compute the gradient slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson there are also more fancy update formulas (momentum, Adagrad, RMSProp, Adam, ) 55

The effects of different update form formulas slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson (image credits to Alec Radford) 56

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 57

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 58