Backpropagation + Deep Learning

Size: px

Start display at page:

Download "Backpropagation + Deep Learning"

Amos Poole
5 years ago
Views:

1 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Backpropagation + Deep Learning Matt Gormley Lecture 13 Mar 1,

2 Reminders Homework 5: Neural Networks Out: Tue, Feb 28 Due: Fri, Mar 9 at 11:59pm 2

3 Q&A 3

4 BACKPROPAGATION 4

5 Background A Recipe for Machine Learning 1. Given training data: 3. Define goal: 2. Choose each of these: Decision function Loss function 4. Train with SGD: (take small steps opposite the gradient) 5

6 Approaches to Differentiation Question 1: When can we compute the gradients of the parameters of an arbitrary neural network? Question 2: When can we make the gradient computation efficient? 6

Symbolic Differentiation Note: The method you learned in high-school Note: Used by Mathematica / Wolfram Alpha / Maple Pro: Yields easily interpretable derivatives Con: Leads to exponential

7 Approaches to Differentiation 1. Finite Difference Method Pro: Great for testing implementations of backpropagation Con: Slow for high dimensional inputs / outputs Required: Ability to call the function f(x) on any input x 2. Symbolic Differentiation Note: The method you learned in high-school Note: Used by Mathematica / Wolfram Alpha / Maple Pro: Yields easily interpretable derivatives Con: Leads to exponential computation time if not carefully implemented Required: Mathematical expression that defines f(x) 3. Automatic Differentiation - Reverse Mode Note: Called Backpropagation when applied to Neural Nets Pro: Computes partial derivatives of one output f(x) i with respect to all inputs x j in time proportional to computation of f(x) Con: Slow for high dimensional outputs (e.g. vector-valued functions) Required: Algorithm for computing f(x) 4. Automatic Differentiation - Forward Mode Note: Easy to implement. Uses dual numbers. Pro: Computes partial derivatives of all outputs f(x) i with respect to one input x j in time proportional to computation of f(x) Con: Slow for high dimensional inputs (e.g. vector-valued x) Required: Algorithm for computing f(x) 7

8 Finite Difference Method Notes: Suffers from issues of floating point precision, in practice Typically only appropriate to use on small examples with an appropriately chosen epsilon 8

9 Symbolic Differentiation Differentiation Quiz #1: Suppose x = 2 and z = 3, what are dy/dx and dy/dz for the function below? 9

10 Symbolic Differentiation Differentiation Quiz #2: 11

11 Chain Rule Whiteboard Chain Rule of Calculus 12

12 { That is, the computation Given: : y = g(u) and u = h(x). Chain es. Rule: dy i dx k = JX j=1 dy i du j du j dx k, Chain Rule 8i, k 13

13 { That is, the computation Given: : y = g(u) and u = h(x). Chain es. Rule: dy i dx k = JX j=1 dy i du j du j dx k, Chain Rule 8i, k Backpropagation is just repeated application of the chain rule from Calculus

14 Backpropagation Whiteboard Example: Backpropagation for Chain Rule #1 Differentiation Quiz #1: Suppose x = 2 and z = 3, what are dy/dx and dy/dz for the function below? 15

15 Backpropagation Automatic Differentiation Reverse Mode (aka. Backpropagation) Forward Computation 1. Write an algorithm for evaluating the function y = f(x). The algorithm defines a directed acyclic graph, where each variable is a node (i.e. the computation graph ) 2. Visit each node in topological order. For variable u i with inputs v 1,, v N a. Compute u i = g i (v 1,, v N ) b. Store the result at the node Backward Computation 1. Initialize all partial derivatives dy/du j to 0 and dy/dy = Visit each node in reverse topological order. For variable u i = g i (v 1,, v N ) a. We already know dy/du i b. Increment dy/dv j by (dy/du i )(du i /dv j ) (Choice of algorithm ensures computing (du i /dv j ) is easy) Return partial derivatives dy/du i for all variables 16

16 Backpropagation Simple Example: The goal is to compute J = ( (x 2 )+3x 2 ) on the forward pass and the derivative dj dx on the backward pass. Forward J = cos(u) u = u 1 + u 2 u 1 = sin(t) u 2 =3t t = x 2 17

17 Backpropagation Simple Example: The goal is to compute J = ( (x 2 )+3x 2 ) on the forward pass and the derivative dj dx on the backward pass. Forward J = cos(u) u = u 1 + u 2 u 1 = sin(t) u 2 =3t t = x 2 Backward dj du = sin(u) dj = dj du, du 1 du du 1 dj dj du 1 = dt du 1 dt, du 1 dt dj dj du 2 = dt du 2 dt, du 2 dt dj dj dt = dx dt dx, du du 1 =1 = (t) =3 dt dx =2x dj = dj du 2 du du du 2, du du 2 =1 18

18 Backpropagation Case 1: Logistic Regression Output Input θ 1 θ 2 θ 3 θ M Forward J = y y +(1 y ) (1 y) y = a = 1 1+ ( a) D j=0 jx j Backward dj dy = y y + (1 y ) y 1 dj da = dj dy dy da, dy da = ( a) ( ( a) + 1) 2 dj = dj d j da dj = dj dx j da da d j, da dx j, da d j = x j da dx j = j 19

19 Backpropagation Output (E) Output (sigmoid) y = 1 1+ ( b) Hidden Layer (D) Output (linear) b = D j=0 jz j Input (C) Hidden (sigmoid) z j = 1 1+ ( a j ), j (B) Hidden (linear) a j = M i=0 jix i, j (A) Input Given x i, i 20

20 Backpropagation (F) Loss J = 1 2 (y y )2 Output (E) Output (sigmoid) y = 1 1+ ( b) Hidden Layer (D) Output (linear) b = D j=0 jz j Input (C) Hidden (sigmoid) z j = 1 1+ ( a j ), j (B) Hidden (linear) a j = M i=0 jix i, j (A) Input Given x i, i 21

21 Backpropagation Case 2: Neural Network Forward J = y y +(1 y ) (1 y) y = b = z j = a j = 1 1+ ( b) D j=0 jz j 1 1+ ( a j ) M i=0 jix i Backward dj dy = y y + (1 y ) y 1 dj db = dj dy dy db, dy db = ( b) ( ( b) + 1) 2 dj = dj d j db dj = dj dz j db db d j, db dz j, db d j = z j db dz j = dj = dj dz j, dz j = da j dz j da j da j dj = dj d ji da j da j d ji, dj = dj da j, da j = dx i da j dx i dx i j ( a j ) ( ( a j ) + 1) 2 da j d ji = x i D j=0 ji 22

22 Backpropagation Case 2: Neural Loss Network Sigmoid Linear Sigmoid Linear Forward J = y y +(1 y ) (1 y) y = b = z j = a j = 1 1+ ( b) D j=0 jz j 1 1+ ( a j ) M i=0 jix i Backward dj dy = y y + (1 y ) y 1 dj db = dj dy dy db, dy db = ( b) ( ( b) + 1) 2 dj = dj d j db dj = dj dz j db db d j, db dz j, db d j = z j db dz j = dj = dj dz j, dz j = da j dz j da j da j dj = dj d ji da j da j d ji, dj = dj da j, da j = dx i da j dx i dx i j ( a j ) ( ( a j ) + 1) 2 da j d ji = x i D j=0 ji 23

23 Backpropagation Whiteboard SGD for Neural Network Example: Backpropagation for Neural Network 24

24 Backpropagation Backpropagation (Auto.Diff. - Reverse Mode) Forward Computation 1. Write an algorithm for evaluating the function y = f(x). The algorithm defines a directed acyclic graph, where each variable is a node (i.e. the computation graph ) 2. Visit each node in topological order. a. Compute the corresponding variable s value b. Store the result at the node Backward Computation 1. Initialize all partial derivatives dy/du j to 0 and dy/dy = Visit each node in reverse topological order. For variable u i = g i (v 1,, v N ) a. We already know dy/du i b. Increment dy/dv j by (dy/du i )(du i /dv j ) (Choice of algorithm ensures computing (du i /dv j ) is easy) Return partial derivatives dy/du i for all variables 25

Background A Recipe for Gradients Machine Learning 1.

25 Background A Recipe for Gradients Machine Learning 1. Given training data: 3. Define goal: And it s a special case of a more general algorithm called reversemode automatic differentiation that Decision functioncan compute 4. Train the with gradient SGD: of any differentiable function efficiently! 2. Choose each of these: Loss function Backpropagation can compute this gradient! (take small steps opposite the gradient) 26

26 Summary 1. Neural Networks provide a way of learning features are highly nonlinear prediction functions (can be) a highly parallel network of logistic regression classifiers discover useful hidden representations of the input 2. Backpropagation provides an efficient way to compute gradients is a special case of reverse-mode automatic differentiation 27

27 DEEP NETS 28

A Recipe for Goals for Today s Machine Lecture Learning Background 1.

decision Define functions goal: (Deep Neural Networks) 2.

28 A Recipe for Goals for Today s Machine Lecture Learning Background Given Explore training a new data: class of 3. decision Define functions goal: (Deep Neural Networks) 2. Consider variants of this recipe for training 2. Choose each of these: Decision function Loss function 4. Train with SGD: (take small steps opposite the gradient) 29

29 Idea #1: No pre-training Idea #1: (Just like a shallow network) Compute the supervised gradient by backpropagation. Take small steps in the direction of the gradient (SGD) 30

30 Comparison on MNIST 2.5 Results from Bengio et al. (2006) on MNIST digit classification task Percent error (lower is better) % Error Shallow Net Idea #1 (Deep Net, nopretraining) Idea #2 (Deep Net, supervised pretraining) Idea #3 (Deep Net, unsupervised pretraining) 31

31 Comparison on MNIST 2.5 Results from Bengio et al. (2006) on MNIST digit classification task Percent error (lower is better) % Error Shallow Net Idea #1 (Deep Net, nopretraining) Idea #2 (Deep Net, supervised pretraining) Idea #3 (Deep Net, unsupervised pretraining) 32

32 Idea #1: No pre-training Idea #1: (Just like a shallow network) Compute the supervised gradient by backpropagation. Take small steps in the direction of the gradient (SGD) What goes wrong? A. Gets stuck in local optima Nonconvex objective Usually start at a random (bad) point in parameter space B. Gradient is progressively getting more dilute Vanishing gradients 33

33 Problem A: Nonconvexity Where does the nonconvexity come from? Even a simple quadratic z = xy objective is nonconvex: z x y 34

34 Problem A: Nonconvexity Where does the nonconvexity come from? Even a simple quadratic z = xy objective is nonconvex: z x y

35 Problem A: Nonconvexity Stochastic Gradient Descent climbs to the top of the nearest hill 36

36 Problem A: Nonconvexity Stochastic Gradient Descent climbs to the top of the nearest hill 37

37 Problem A: Nonconvexity Stochastic Gradient Descent climbs to the top of the nearest hill 38

38 Problem A: Nonconvexity Stochastic Gradient Descent climbs to the top of the nearest hill 39

39 Problem A: Nonconvexity Stochastic Gradient Descent climbs to the top of the nearest hill which might not lead to the top of the mountain 40

40 Problem B: Vanishing Gradients The gradient for an edge at the base of the network depends on the gradients of many edges above it Output Hidden Layer Hidden Layer The chain rule multiplies many of these partial derivatives together Hidden Layer Input 41

41 Problem B: Vanishing Gradients The gradient for an edge at the base of the network depends on the gradients of many edges above it Output Hidden Layer Hidden Layer The chain rule multiplies many of these partial derivatives together Hidden Layer Input 42

42 Problem B: Vanishing Gradients The gradient for an edge at the base of the network depends on the gradients of many edges above it Output Hidden Layer Hidden Layer The chain rule multiplies many of these partial derivatives together Hidden Layer Input

43 Idea #1: No pre-training Idea #1: (Just like a shallow network) Compute the supervised gradient by backpropagation. Take small steps in the direction of the gradient (SGD) What goes wrong? A. Gets stuck in local optima Nonconvex objective Usually start at a random (bad) point in parameter space B. Gradient is progressively getting more dilute Vanishing gradients 44

44 Idea #2: Supervised Pre-training Idea #2: (Two Steps) Train each level of the model in a greedy way Then use our original idea 1. Supervised Pre-training Use labeled data Work bottom-up Train hidden layer 1. Then fix its parameters. Train hidden layer 2. Then fix its parameters. Train hidden layer n. Then fix its parameters. 2. Supervised Fine-tuning Use labeled data to train following Idea #1 Refine the features by backpropagation so that they become tuned to the end-task 45

45 Idea #2: Supervised Pre-training Idea #2: (Two Steps) Train each level of the model in a greedy way Then use our original idea Output Hidden Layer 1 Input 46

46 Idea #2: Supervised Pre-training Idea #2: (Two Steps) Train each level of the model in a greedy way Then use our original idea Output Hidden Layer 2 Hidden Layer 1 Input 47

47 Idea #2: Supervised Pre-training Idea #2: (Two Steps) Output Train each level of the model in a greedy way Then use our original idea Hidden Layer 3 Hidden Layer 2 Hidden Layer 1 Input 48

48 Idea #2: Supervised Pre-training Idea #2: (Two Steps) Output Train each level of the model in a greedy way Then use our original idea Hidden Layer 3 Hidden Layer 2 Hidden Layer 1 Input 49

49 Comparison on MNIST 2.5 Results from Bengio et al. (2006) on MNIST digit classification task Percent error (lower is better) % Error Shallow Net Idea #1 (Deep Net, nopretraining) Idea #2 (Deep Net, supervised pretraining) Idea #3 (Deep Net, unsupervised pretraining) 50

50 Comparison on MNIST 2.5 Results from Bengio et al. (2006) on MNIST digit classification task Percent error (lower is better) % Error Shallow Net Idea #1 (Deep Net, nopretraining) Idea #2 (Deep Net, supervised pretraining) Idea #3 (Deep Net, unsupervised pretraining) 51

51 Idea #3: Unsupervised Pre-training Idea #3: (Two Steps) Use our original idea, but pick a better starting point Train each level of the model in a greedy way 1. Unsupervised Pre-training Use unlabeled data Work bottom-up Train hidden layer 1. Then fix its parameters. Train hidden layer 2. Then fix its parameters. Train hidden layer n. Then fix its parameters. 2. Supervised Fine-tuning Use labeled data to train following Idea #1 Refine the features by backpropagation so that they become tuned to the end-task 52

52 Unsupervised pretraining of the first layer: What should it predict? What else do we observe? The input! The solution: Output Unsupervised pretraining This topology defines an Auto-encoder. Hidden Layer Input 53

53 Unsupervised pretraining of the first layer: What should it predict? What else do we observe? The input! The solution: Unsupervised pretraining Input This topology defines an Auto-encoder. Hidden Layer Input 54

54 Auto-Encoders Key idea: Encourage z to give small reconstruction error: x is the reconstruction of x Loss = x DECODER(ENCODER(x)) 2 Train with the same backpropagation algorithm for 2-layer Neural Networks with x m as both input and output. DECODER: x = h(w z) Input Hidden Layer ENCODER: z = h(wx) Input Slide adapted from Raman Arora 55

55 Unsupervised pretraining Work bottom-up Train hidden layer 1. Then fix its parameters. Train hidden layer 2. Then fix its parameters. Train hidden layer n. Then fix its parameters. The solution: Unsupervised pretraining Input Hidden Layer Input 56

56 Unsupervised pretraining Work bottom-up Train hidden layer 1. Then fix its parameters. Train hidden layer 2. Then fix its parameters. Train hidden layer n. Then fix its parameters. The solution: Hidden Layer Hidden Layer Input Unsupervised pretraining 57

57 Unsupervised pretraining Work bottom-up Train hidden layer 1. Then fix its parameters. Train hidden layer 2. Then fix its parameters. Train hidden layer n. Then fix its parameters. The solution: Input Unsupervised pretraining Hidden Layer Hidden Layer Hidden Layer 58

The solution: Unsupervised pretraining Work bottom-up Train hidden layer 1. Then fix its parameters. Train hidden layer 2. Then fix its parameters. Train hidden layer n.

58 The solution: Unsupervised pretraining Work bottom-up Train hidden layer 1. Then fix its parameters. Train hidden layer 2. Then fix its parameters. Train hidden layer n. Then fix its parameters. Supervised fine-tuning Backprop and update all Output Hidden Layer Hidden Layer Hidden Layer Input Unsupervised pretraining parameters 59

59 Deep Network Training Idea #1: 1. Supervised fine-tuning only Idea #2: 1. Supervised layer-wise pre-training 2. Supervised fine-tuning Idea #3: 1. Unsupervised layer-wise pre-training 2. Supervised fine-tuning 60

60 Comparison on MNIST 2.5 Results from Bengio et al. (2006) on MNIST digit classification task Percent error (lower is better) % Error Shallow Net Idea #1 (Deep Net, nopretraining) Idea #2 (Deep Net, supervised pretraining) Idea #3 (Deep Net, unsupervised pretraining) 61

61 Comparison on MNIST 2.5 Results from Bengio et al. (2006) on MNIST digit classification task Percent error (lower is better) % Error Shallow Net Idea #1 (Deep Net, nopretraining) Idea #2 (Deep Net, supervised pretraining) Idea #3 (Deep Net, unsupervised pretraining) 62

62 Is layer-wise pre-training always necessary? In 2010, a record on a hand-writing recognition task was set by standard supervised backpropagation (our Idea #1). How? A very fast implementation on GPUs. See Ciresen et al. (2010) 63

63 Deep Learning Goal: learn features at different levels of abstraction Training can be tricky due to Nonconvexity Vanishing gradients Unsupervised layer-wise pre-training can help with both! 64

27: Hybrid Graphical Models and Neural Networks

10-708: Probabilistic Graphical Models 10-708 Spring 2016 27: Hybrid Graphical Models and Neural Networks Lecturer: Matt Gormley Scribes: Jakob Bauer Otilia Stretcu Rohan Varma 1 Motivation We first look