Natural Language Processing with Deep Learning CS224N/Ling284. Christopher Manning Lecture 4: Backpropagation and computation graphs

Similar documents
Natural Language Processing with Deep Learning. CS224N/Ling284. Lecture 5: Backpropagation Kevin Clark

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES

Administrative. Assignment 1 due Wednesday April 18, 11:59pm

EE 511 Neural Networks

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Backpropagation and Neural Networks. Lecture 4-1

CS 6501: Deep Learning for Computer Graphics. Training Neural Networks II. Connelly Barnes

Lecture : Neural net: initialization, activations, normalizations and other practical details Anne Solberg March 10, 2017

Natural Language Processing with Deep Learning CS224N/Ling284

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Lecture : Training a neural net part I Initialization, activations, normalizations and other practical details Anne Solberg February 28, 2018

Deep Neural Networks Optimization

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

A Quick Guide on Training a neural network using Keras.

Deep Learning for Computer Vision II

Back Propagation and Other Differentiation Algorithms. Sargur N. Srihari

CS489/698: Intro to ML

Backpropagation + Deep Learning

Sequence Modeling: Recurrent and Recursive Nets. By Pyry Takala 14 Oct 2015

Deep neural networks II

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

Efficient Deep Learning Optimization Methods

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Training Deep Neural Networks (in parallel)

Gradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent

Multinomial Regression and the Softmax Activation Function. Gary Cottrell!

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center

Machine learning for vision. It s the features, stupid! cathedral. high-rise. Winter Roland Memisevic. Lecture 2, January 26, 2016

CS281 Section 3: Practical Optimization

COMP9444 Neural Networks and Deep Learning 5. Geometry of Hidden Units

Machine Learning. MGS Lecture 3: Deep Learning

Deep Learning for Computer Vision

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

Neural Networks for Machine Learning. Lecture 15a From Principal Components Analysis to Autoencoders

LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS

ECE 6504: Deep Learning for Perception

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius

CSC321 Lecture 10: Automatic Differentiation

Parallel Deep Network Training

Practical Tips for using Backpropagation

INTRODUCTION TO DEEP LEARNING

COMP6237 Data Mining Data Mining & Machine Learning with Big Data. Jonathon Hare

Homework 5. Due: April 20, 2018 at 7:00PM

SEMANTIC COMPUTING. Lecture 9: Deep Learning: Recurrent Neural Networks (RNNs) TU Dresden, 21 December 2018

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

Neural Network Neurons

Deep Learning with Tensorflow AlexNet

Parallel Deep Network Training

Introduction to SNNS

Machine Learning 13. week

Lecture 5: Training Neural Networks, Part I

Knowledge Discovery and Data Mining. Neural Nets. A simple NN as a Mathematical Formula. Notes. Lecture 13 - Neural Nets. Tom Kelsey.

Knowledge Discovery and Data Mining

Deep Learning Applications

DEEP LEARNING IN PYTHON. The need for optimization

Multi-layer Perceptron Forward Pass Backpropagation. Lecture 11: Aykut Erdem November 2016 Hacettepe University

A Brief Look at Optimization

All You Want To Know About CNNs. Yukun Zhu

EECS 496 Statistical Language Models. Winter 2018

Deep Learning. Volker Tresp Summer 2014

Residual Networks And Attention Models. cs273b Recitation 11/11/2016. Anna Shcherbina

CS6220: DATA MINING TECHNIQUES

ImageNet Classification with Deep Convolutional Neural Networks

INF 5860 Machine learning for image classification. Lecture 11: Visualization Anne Solberg April 4, 2018

Deep Learning. Architecture Design for. Sargur N. Srihari

11. Neural Network Regularization

Automatic Differentiation for R

Artificial Neural Networks Lecture Notes Part 5. Stephen Lucci, PhD. Part 5

CNN Basics. Chongruo Wu

Neural Networks. Robot Image Credit: Viktoriya Sukhanova 123RF.com

An Introduction to NNs using Keras

Usable while performant: the challenges building. Soumith Chintala

CS 4510/9010 Applied Machine Learning. Neural Nets. Paula Matuszek Fall copyright Paula Matuszek 2016

Week 3: Perceptron and Multi-layer Perceptron

CS 1674: Intro to Computer Vision. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh November 16, 2016

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

Perceptron: This is convolution!

Encoding RNNs, 48 End of sentence (EOS) token, 207 Exploding gradient, 131 Exponential function, 42 Exponential Linear Unit (ELU), 44

Batch Normalization Accelerating Deep Network Training by Reducing Internal Covariate Shift. Amit Patel Sambit Pradhan

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

Logistic Regression and Gradient Ascent

Simple Model Selection Cross Validation Regularization Neural Networks

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

Deep Learning Cook Book

Deep Convolutional Neural Networks. Nov. 20th, 2015 Bruce Draper

Advanced Video Analysis & Imaging

Adversarial Examples and Adversarial Training. Ian Goodfellow, Staff Research Scientist, Google Brain CS 231n, Stanford University,

DEEP LEARNING IN PYTHON. Introduction to deep learning

Capsule Networks. Eric Mintun

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 15

Mini-project 2 CMPSCI 689 Spring 2015 Due: Tuesday, April 07, in class

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

Linear Regression & Gradient Descent

PLT: Inception (cuz there are so many layers)

Transcription:

Natural Language Processing with Deep Learning CS4N/Ling84 Christopher Manning Lecture 4: Backpropagation and computation graphs

Lecture Plan Lecture 4: Backpropagation and computation graphs 1. Matrix gradients for our simple neural net and some tips [15 mins]. Computation graphs and backpropagation [40 mins] 3. Stuff you should know [15 mins] a. Regularization to prevent overfitting b. Vectorization c. Nonlinearities d. Initialization e. Optimizers f. Learning rates

1. Derivative wrt a weight matrix Let s look carefully at computing Using the chain rule again: ) = * + " " = $ % % = &! + ( 3! = [ x museums x in x Paris x are x amazing ]

Deriving gradients for backprop For this function (following on from last time):!"!# = %!&!# = %!!# #' + ) Let s consider the derivative of a single weight W ij W ij only contributes to z i For example: W 3 is only used to compute z not z 1 4!* + =! #!, +-!, +. ' + / + +- = 0 8 01 567, +5 9 5 = 9-3 s u f(z 1 )= h 1 h =f(z ) W 3b x 1 x x 3 +1

Deriving gradients for backprop So for derivative of single W ij :!"!# $% = ' $ ( % Error signal from above Local gradient signal We want gradient for full W but each case is the same Overall answer: Outer product: 5

Deriving gradients: Tips Tip 1: Carefully define your variables and keep track of their dimensionality! Tip : Chain rule! If y = f(u) and u = g(x), i.e., y = f(g(x)), then:!"!# =!"!%!%!# Keep straight what variables feed into what computations Tip 3: For the top softmax part of a model: First consider the derivative wrt f c when c = y (the correct class), then consider derivative wrt f c when c ¹ y (all the incorrect classes) Tip 4: Work out element-wise partial derivatives if you re getting confused by matrix calculus! Tip 5: Use Shape Convention. Note: The error message & that arrives at a hidden layer has the same dimensionality as that hidden layer 6

Deriving gradients wrt words for window model The gradient that arrives at and updates the word vectors can simply be split up for each word vector: Let With x window = [ x museums x in x Paris x are x amazing ] We have 7

Updating word gradients in window model This will push word vectors around so that they will (in principle) be more helpful in determining named entities. For example, the model can learn that seeing x in as the word just before the center word is indicative for the center word to be a location 8

A pitfall when retraining word vectors Setting: We are training a logistic regression classification model for movie review sentiment using single words. In the training data we have TV and telly In the testing data we have television The pre-trained word vectors have all three similar: TV telly television Question: What happens when we update the word vectors? 9

A pitfall when retraining word vectors Question: What happens when we update the word vectors? Answer: Those words that are in the training data move around TV and telly Words not in the training data stay where they were television telly TV This can be bad! 10 television

So what should I do? Question: Should I use available pre-trained word vectors Answer: Almost always, yes! They are trained on a huge amount of data, and so they will know about words not in your training data and will know more about words that are in your training data Have 100s of millions of words of data? Okay to start random Question: Should I update ( fine tune ) my own word vectors? Answer: If you only have a small training data set, don t train the word vectors If you have have a large dataset, it probably will work better to train = update = fine-tune word vectors to the task 11

Backpropagation We ve almost shown you backpropagation It s taking derivatives and using the (generalized) chain rule Other trick: we re-use derivatives computed for higher layers in computing derivatives for lower layers so as to minimize computation 1

. Computation Graphs and Backpropagation We represent our neural net equations as a graph Source nodes: inputs Interior nodes: operations + 13

Computation Graphs and Backpropagation We represent our neural net equations as a graph Source nodes: inputs Interior nodes: operations Edges pass along result of the operation + 14

Computation Graphs and Backpropagation Representing our neural net equations as a graph Source nodes: inputs Forward Propagation Interior nodes: operations Edges pass along result of the operation + 15

Backpropagation Go backwards along edges Pass along gradients + 16

Backpropagation: Single Node Node receives an upstream gradient Goal is to pass on the correct downstream gradient 17 Downstream gradient Upstream gradient

Backpropagation: Single Node Each node has a local gradient The gradient of it s output with respect to it s input 18 Downstream gradient Local gradient Upstream gradient

Backpropagation: Single Node Each node has a local gradient The gradient of it s output with respect to it s input Chain rule! 19 Downstream gradient Local gradient Upstream gradient

Backpropagation: Single Node Each node has a local gradient The gradient of it s output with respect to it s input [downstream gradient] = [upstream gradient] x [local gradient] 0 Downstream gradient Local gradient Upstream gradient

Backpropagation: Single Node What about nodes with multiple inputs? * 1

Backpropagation: Single Node Multiple inputs multiple local gradients * Downstream gradients Local gradients Upstream gradient

An Example 3

An Example Forward prop steps + max * 4

An Example Forward prop steps 1 0 + max 3 * 6 5

An Example Forward prop steps Local gradients 1 0 + max 3 * 6 6

An Example Forward prop steps Local gradients 1 0 + max 3 * 6 7

An Example Forward prop steps Local gradients 1 0 + max 3 * 6 8

An Example Forward prop steps Local gradients 1 0 + max 3 * 6 9

An Example Forward prop steps Local gradients 30 1 0 + max 3 1* = 1*3 = 3 * 6 1 upstream * local = downstream

An Example Forward prop steps Local gradients 31 1 3*1 = 3 0 3*0 = 0 + max 3 3 * 6 1 upstream * local = downstream

An Example Forward prop steps Local gradients 3 1 *1 = *1 = 3 0 0 + max 3 3 * 6 1 upstream * local = downstream

An Example Forward prop steps Local gradients 33 1 3 0 0 + max 3 3 * 6 1

Gradients sum at outward branches + 34

Gradients sum at outward branches + 35

Node Intuitions + distributes the upstream gradient 1 0 + max 3 * 6 1 36

Node Intuitions + distributes the upstream gradient to each summand max routes the upstream gradient 37 1 3 0 0 + max 3 3 * 6 1

Node Intuitions + distributes the upstream gradient max routes the upstream gradient * switches the upstream gradient 1 0 + max 3 3 * 6 1 38

Efficiency: compute all gradients at once Incorrect way of doing backprop: First compute * + 39

Efficiency: compute all gradients at once Incorrect way of doing backprop: First compute Then independently compute Duplicated computation! * + 40

Efficiency: compute all gradients at once Correct way: Compute all the gradients at once Analogous to using! when we computed gradients by hand * + 41

Back-Prop in General Computation Graph Single scalar output 1. Fprop: visit nodes in topological sort order - Compute value of node given predecessors. Bprop: - initialize output gradient = 1 - visit nodes in reverse order: Compute gradient wrt each node using gradient wrt successors = successors of 4 Done correctly, big O() complexity of fprop and bprop is the same In general our nets have regular layer-structure and so we can use matrices and Jacobians

Automatic Differentiation The gradient computation can be automatically inferred from the symbolic expression of the fprop Each node type needs to know how to compute its output and how to compute the gradient wrt its inputs given the gradient wrt its output Modern DL frameworks (Tensorflow, PyTorch, etc.) do backpropagation for you but mainly leave layer/node writer to hand-calculate the local derivative 43

Backprop Implementations 44

Implementation: forward/backward API 45

Implementation: forward/backward API 46

Gradient checking: Numeric Gradient For small h ( 1e-4), Easy to implement correctly But approximate and very slow: Have to recompute f for every parameter of our model Useful for checking your implementation In the old days when we hand-wrote everything, it was key to do this everywhere. 47 Now much less needed, when throwing together layers

Summary We ve mastered the core technology of neural nets!!! Backpropagation: recursively apply the chain rule along computation graph [downstream gradient] = [upstream gradient] x [local gradient] Forward pass: compute results of operations and save intermediate values Backward pass: apply chain rule to compute gradients 48

Why learn all these details about gradients? Modern deep learning frameworks compute gradients for you But why take a class on compilers or systems when they are implemented for you? Understanding what is going on under the hood is useful! Backpropagation doesn t always work perfectly. 49 Understanding why is crucial for debugging and improving models See Karpathy article (in syllabus): https://medium.com/@karpathy/yes-you-should-understandbackprop-ef06eab496b Example in future lecture: exploding and vanishing gradients

3. We have models with many params! Regularization! Really a full loss function in practice includes regularization over all parameters!, e.g., L regularization: Regularization (largely) prevents overfitting when we have a lot of features (or later a very powerful/deep model, ++) Test error overfitting Training error 50 model power

Vectorization E.g., looping over word vectors versus concatenating them all into one large matrix and then multiplying the softmax weights with that matrix 1000 loops, best of 3: 639 µs per loop 10000 loops, best of 3: 53.8 µs per loop 51

Vectorization The (10x) faster method is using a C x N matrix Always try to use vectors and matrices rather than for loops! You should speed-test your code a lot too!! tl;dr: Matrices are awesome!!! 5

Non-linearities: The starting points logistic ( sigmoid ) tanh hard tanh 1 1 0 1 tanh is just a rescaled and shifted sigmoid ( as steep, [ 1,1]): tanh(z) = logistic(z) 1 Both logistic and tanh are still used in particular uses, but are no longer the defaults for making deep networks

Non-linearities: The new world order ReLU (rectified Leaky ReLU Parametric ReLU linear unit) hard tanh rect(z) = max(z, 0) For building a feed-forward deep network, the first thing you should try is ReLU it trains quickly and performs well due to good gradient backflow

Parameter Initialization You normally must initialize weights to small random values To avoid symmetries that prevent learning/specialization Initialize hidden layer biases to 0 and output (or reconstruction) biases to optimal value if weights were 0 (e.g., mean target or inverse sigmoid of mean target) Initialize all other weights ~ Uniform( r, r), with r chosen so numbers get neither too big or too small Xavier initialization has variance inversely proportional to fan-in n in (previous layer size) and fan-out n out (next layer size):

Optimizers Usually, plain SGD will work just fine However, getting good results will often require hand-tuning the learning rate (next slide) For more complex nets and situations, or just to avoid worry, you often do better with one of a family of more sophisticated adaptive optimizers that scale the parameter adjustment by an accumulated gradient. These models give per-parameter learning rates Adagrad RMSprop Adam ß A fairly good, safe place to begin in many cases SparseAdam

Learning Rates You can just use a constant learning rate. Start around lr = 0.001? It must be order of magnitude right try powers of 10 Too big: model may diverge or not converge Too small: your model may not have trained by the deadline Better results can generally be obtained by allowing learning rates to decrease as you train By hand: halve the learning rate every k epochs An epoch = a pass through the data (shuffled or sampled) By a formula:!" =!" $ % &'(, for epoch t There are fancier methods like cyclic learning rates (q.v.) Fancier optimizers still use a learning rate but it may be an initial rate that the optimizer shrinks so may be able to start high