EE 511 Neural Networks

Size: px

Start display at page:

Download "EE 511 Neural Networks"

Christopher Pitts
5 years ago
Views:

1 Slides adapted from Ali Farhadi, Mari Ostendorf, Pedro Domingos, Carlos Guestrin, and Luke Zettelmoyer, Andrei Karpathy EE 511 Neural Networks Instructor: Hanna Hajishirzi

2 Computational Graphs scores function SVM loss data loss + regularization want x * s (scores) hinge loss + L W R

3 Backpropagation Backpropagation: a simple example e.g. x = -2, y = 5, z = -4 Want:

4 Backpropagation: a simple example e.g. x = -2, y = 5, z = -4 1 Want:

5 Backpropagation: a simple example e.g. x = -2, y = 5, z = -4 3 Want:

6 Backpropagation: a simple example e.g. x = -2, y = 5, z = -4-4 Want:

7 Backpropagation: a simple example e.g. x = -2, y = 5, z = -4-4 Chain rule: Want:

8 Backpropagation: a simple example -4 e.g. x = -2, y = 5, z = -4 Want:

9 f

10 local gradient f gradients

11 local gradient f gradients

12 local gradient f gradients

13 Another example:

14 Another example:

15 Another example:

16 Another example:

17 Another example:

18 Another example:

19 Another example: (-1) * (-0.20) = 0.20

20 Another example:

21 Another example: [local gradient] x [upstream gradient] [1] x [0.2] = 0.2 [1] x [0.2] = 0.2 (both inputs!)

22 Another example:

23 Another example: [local gradient] x [upstream gradient] x0: [2] x [0.2] = 0.4 w0: [-1] x [0.2] = -0.2

24 sigmoid gate sigmoid function

25 sigmoid function sigmoid gate (0.73) * (1-0.73) = 0.2

26 add gate: gradient distributor Q: What is a max gate? max gate: gradient router Q: What is a mul gate? mul gate: gradient switcher

27 Gradients add at branches add gate: gradient distributor Q: What is a max gate? max gate: gradient router Q: What is a mul gate? + mul gate: gradient switcher

28 Gradients for vectorized code local gradient (x,y,z are now vectors) This is now the Jacobian matrix (derivative of each element of z w.r.t. each element of x) f gradients

29 Vectorized operations Jacobian matrix 4096-d input vector Q: what is the size of the Jacobian matrix? f(x) = max(0,x) (elementwise) 4096-d output vector in practice we process an entire minibatch (e.g. 100) of examples at one time: [4096 x 4096!] i.e. Jacobian would technical

30 A vectorized example:

31 A vectorized example:

32 A vectorized example: Always check: The gradient with respect to a variable should have the same shape as the variable

33 A vectorized example:

34 A vectorized example:

35 Modularized implementation: forward / backward API Graph (or Net) object (rough psuedo code)

36 Modularized implementation: forward / backward API x * z y (x,y,z are scalars)

37 Modularized implementation: forward / backward API x * z y (x,y,z are scalars)

38 Summary neural nets will be very large: impractical to write down gradient formula by hand for all parameters backpropagation = recursive application of the chain rule along a computational graph to compute the gradients of all inputs/parameters/intermediates implementations maintain a graph structure, where the nodes implement the forward() / backward() API forward: compute result of an operation and save any intermediates needed for gradient computation in memory backward: apply the chain rule to compute the gradient of the loss function with respect to the inputs

39 Neural networks: Architectures 2-layer Neural Net, or 1-hidden-layer Neural Net Fully-connected layers 3-layer Neural Net, or 2-hidden-layer Neural Net

40 Example feed-forward computation of a neural network We can efficiently evaluate an entire layer of neurons.

41 Example feed-forward computation of a neural network

42 Loop: 1. Sample a batch of data 2. Forward prop it through the graph (network), get loss 3. Backprop to calculate the gradients 4. Update the parameters using the gradient

43 Activation functions Sigmoid Leaky ReLU tanh Maxout ReLU ELU

44 Activation Functions - Squashes numbers to range [0,1] - Historically popular since they have nice interpretation as a saturating firing rate of a neuron Sigmoid 1. Saturated neurons kill the gradients 2. Sigmoid outputs are not zero-centered 3. exp() is a bit compute expensive

45 x sigmoid gate What happens when x = -10? What happens when x = 0? What happens when x = 10?

46 Activation Functions - Computes f(x) = max(0,x) - Does not saturate (in +region) - Very computationally efficient - Converges much faster than sigmoid/tanh in practice (e.g. 6x) - Actually more biologically plausible than sigmoid ReLU (Rectified Linear Unit)

47 Training Neural Networks Overview 1. One time setup activation functions, preprocessing, weight initialization, regularization, gradient checking 2. Training dynamics babysitting the learning process, parameter updates, hyperparameter optimization 3. Evaluation model ensembles

48 Step 1: Preprocess the data

49 Consider what happens when the input to a neuron is always positive... allowed gradient update directions What can we say about the gradients on w? Always all positive or all negative :( (this is also why you want zero-mean data!) allowed gradient update directions zig zag path hypothetical optimal w vector

50 Step 2: Choose the architecture: say we start with one hidden layer of 50 neurons: 50 hidden neurons CIFAR-10 images, 3072 numbers input layer hidden layer output layer 10 output neurons, one per class

51 Hyperparameters Cross-validation strategy coarse -> fine cross-validation in stages First stage: only a few epochs to get rough idea of what params work Second stage: longer running time, finer search (repeat as necessary) Tip for detecting explosions in the solver: If the cost is ever > 3 * original cost, break out early

52 Monitor and visualize the accuracy: big gap = overfitting => increase regularization strength? no gap => increase model capacity?

53 Overfitting How to improve single-model performance? Regularization

54 Regularization: Add term to loss In common use: L2 regularization L1 regularization Elastic net (L1 + L2) (Weight decay)

55 Regularization: Dropout In each forward pass, randomly set some neurons to zero Probability of dropping is a hyperparameter; 0.5 is common How can this possibly be a good idea?

56 Forces the network to have a redundant representation; Prevents co-adaptation of features has an ear has a tail is furry has claws mischievous look X X X cat score Dropout is training a large ensemble of models (that share parameters). Each binary mask is one model

57 Dropout: Test time Dropout makes our output random! Output (label) Input (image) Random mask Want to average out the randomness at test-time But this integral seems hard

58 Dropout: Test time Want to approximate the integral a Consider a single neuron. At test time we have: w 1 w 2 During training we have: x y At test time, multiply by dropout probability

59 Regularization: A common pattern Training: Add some kind of randomness Testing: Average out randomness (sometimes approximate)

60 Regularization: Data Augmentation Load image and label cat CNN Compute loss This image by Nikita is licensed under CC-BY 2.0

61 Regularization: Data Augmentation Load image and label cat CNN Compute loss Transform image

62 Data Augmentation Random crops and scales Training: sample random crops / scales ResNet: 1. Pick random L in range [256, 480] 2. Resize training image, short side = L 3. Sample random 224 x 224 patch Testing: average a fixed set of crops ResNet: 1. Resize image at 5 scales: {224, 256, 384, 480, 640} 2. For each size, use x 224 crops: 4 corners + center, + flips

63 Summary Backpropagation Training neural nets strategies Regularizations

Administrative. Assignment 1 due Wednesday April 18, 11:59pm

Administrative. Assignment 1 due Wednesday April 18, 11:59pm Lecture 4-1 Administrative Assignment 1 due Wednesday April 18, 11:59pm Lecture 4-2 Administrative All office hours this week will use queuestatus Lecture 4-3 Where we are... scores function SVM loss data