Neural Networks. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Size: px

Start display at page:

Download "Neural Networks. Robot Image Credit: Viktoriya Sukhanova 123RF.com"

Cody Poole
6 years ago
Views:

Neural Networks These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made their course materials freely available online.

1 Neural Networks These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made their course materials freely available online. Feel free to reuse or adapt these slides for your own academic purposes, provided that you include proper attribution. Please send comments and corrections to Eric. Robot Image Credit: Viktoriya Sukhanova 3RF.com

2 Neural Function Brain function (thought) occurs as the result of the firing of neurons Neurons connect to each other through synapses, which propagate action potential (electrical impulses) by releasing neurotransmitters Synapses can be excitatory (potential-increasing) or inhibitory (potential-decreasing), and have varying activation thresholds Learning occurs as a result of the synapses plasticicity: They exhibit long-term changes in connection strength There are about neurons and about 4 synapses in the human brain! Based on slide by T. Finin, M. desjardins, L Getoor, R. Par

3 Biology of a Neuron 3

4 Brain Structure Different areas of the brain have different functions Some areas seem to have the same function in all humans (e.g., Broca s region for motor speech); the overall layout is generally consistent Some areas are more plastic, and vary in their function; also, the lower-level structure and function vary greatly We don t know how different functions are assigned or acquired Partly the result of the physical layout / connection to inputs (sensors) and outputs (effectors) Partly the result of experience (learning) We really don t understand how this neural structure leads to what we perceive as consciousness or thought Based on slide by T. Finin, M. desjardins, L Getoor, R. Par 4

5 The One Learning Algorithm Hypothesis Auditory Cortex Somatosensor y Cortex Auditory cortex learns to see [Roe et al., 99] Somatosensory cortex learns to see [Metin & Frost, 989] Based on slide by Andrew Ng 5

6 Sensor Representations in the Brain Seeing with your tongue Human echolocation (sonar) Haptic belt: Direction sense Implanting a 3 rd eye [BrainPort; Welsh & Blasch, 997; Nagel et al., 5; Constantine-Paton & Law, 9] Slide by Andrew Ng 6

7 Comparison of computing power INFORMATION CIRCA Computer Human Brain Computation Units -core Xeon: 9 Gates Neurons Storage Units 9 bits RAM, bits disk neurons, 4 synapses Cycle time -9 sec -3 sec Bandwidth 9 bits/sec 4 bits/sec Computers are way faster than neurons But there are a lot more neurons than we can reasonably model in modern digital computers, and they all fire in parallel Neural networks are designed to be massively parallel The brain is effectively a billion times faster 7

8 Neural Networks Origins: Algorithms that try to mimic the brain. Very widely used in 8s and early 9s; popularity diminished in late 9s. Recent resurgence: State-of-the-art technique for many applications Artificial neural networks are not nearly as complex or intricate as the actual brain structure Based on slide by Andrew Ng 8

9 Neural networks Output units Hidden units Layered feed-forward network Input units Neural networks are made up of nodes or units, connected by links Each link has an associated weight and activation level Each node has an input function (typically summing over weighted inputs), an activation function, and an output Based on slide by T. Finin, M. desjardins, L Getoor, R. Par 9

10 Neuron Model: Logistic Unit bias unit x x = x = 6 4 x x x x = X h (x) =g ( x) = +e T x Sigmoid (logistic) activation function: g(z) = +e z Based on slide by Andrew Ng

11 Neural Network bias units x a () h (x) Layer (Input Layer) Layer (Hidden Layer) Layer 3 (Output Layer) Slide by Andrew Ng

12 Feed-Forward Process Input layer units are set by some exterior function (think of these as sensors), which causes their output links to be activated at the specified level Working forward through the network, the input function of each unit is applied to compute the input value Usually this is just the weighted sum of the activation on the links feeding into this node The activation function transforms this input function into a final value Typically this is a nonlinear function, often a sigmoid function corresponding to the threshold of that node Based on slide by T. Finin, M. desjardins, L Getoor, R. Par 3

13 Neural Network () () h (x) a i (j) = activation of unit i in layer j Θ (j) = weight matrix controlling function mapping from layer j to layer j + If network has s j units in layer j and s j+ units in layer j+, then Θ (j) has dimension s j+ (s j +). Slide by Andrew Ng () R 3 4 () R 4 4

14 Vectorization a () = g z () = g () x + () x + () x + () 3 x 3 a () = g z () = g () x + () x + () x + () 3 x 3 a () = g z () 3 = g () 3 x + () 3 x + () 3 x + () 33 x 3 3 h (x) =g () a() + () a() + () a() + () 3 a() 3 = g z (3) Based on slide by Andrew Ng () () h (x) Feed-Forward Steps: z () = () x a () = g(z () ) Add a () = z (3) = () a () h (x) =a (3) = g(z (3) ) 5

15 Other Network Architectures s = [3, 3,, ] h (x) Layer Layer Layer 3 Layer 4 L denotes the number of layers s N +L contains the numbers of nodes at each layer Not counting bias units Typically, s = d (# input features) and s L- =K (# classes) 6

16 Multiple Output Units: One-vs-Rest 7 Pedestrian Car Motorcycle Truck h (x) R K when pedestrian when car when motorcycle when truck h (x) h (x) h (x) h (x) We want: Slide by Andrew Ng

17 Multiple Output Units: One-vs-Rest Given {(x,y ), (x,y ),..., (x n,y n )} Must convert labels to -of-k representation e.g., when motorcycle, when car, etc. 8 h (x) R K when pedestrian when car when motorcycle when truck h (x) h (x) h (x) h (x) We want: y i = y i = Based on slide by Andrew Ng

18 Neural Network Classification Given: {(x,y ), (x,y ),..., (x n,y n )} s N +L contains # nodes at each layer s = d (# features) Binary classification y = or output unit (s L- = ) Multi-class classification (K classes) y R K e.g.,,, pedestrian car motorcycle truck K output units (s L- = K) Slide by Andrew Ng 9

19 Understanding Representations

20 Representing Boolean Functions Simple example: AND Logistic / Sigmoid Function g(z) h (x) h Θ (x) = g(-3 + x + x ) x x h Θ (x) g(-3) g(-) g(-) g() Based on slide and example by Andrew Ng

21 Representing Boolean Functions -3 AND - OR + + h (x) + + h (x) NOT (NOT x ) AND (NOT x ) + - h (x) h (x)

22 Combining Representations to Create Non-Linear Functions (NOT x ) AND (NOT x ) -3 AND + - OR + + h (x) - - h (x) + + h (x) not(xor) II III I IV in I + in III I or III h (x) - Based on example by Andrew Ng 3

23 Layering Representations x... x x... x 4 x 4... x 6... pixel images d = 4 classes x x 4 Each image is unrolled into a vector x of pixel intensities 4

24 Layering Representations x x x 3 x 4 x 5 x d Hidden Layer Output Layer 9 Input Layer Visualization of Hidden Layer 5

25 LeNet 5 Demonstration: 6

26 Neural Network Learning 7

27 Perceptron Learning Rule + (y h(x))x Equivalent to the intuitive rules: If output is correct, don t change the weights If output is low (h(x) =, y = ), increment weights for all the inputs which are If output is high (h(x) =, y = ), decrement weights for all inputs which are Perceptron Convergence Theorem: If there is a set of weights that is consistent with the training data (i.e., the data is linearly separable), the perceptron learning algorithm will converge [Minicksy & Papert, 969] 9

28 Batch Perceptron Given training data Let [,,...,] (x (i),y (i) ) n i= Repeat: Let [,,...,] for i =...n,do if y (i) x (i) apple + y (i) x (i) + Until k k < // prediction for i th instance is incorrect /n // compute average update Simplest case: α = and don t normalize, yields the fixed increment perceptron Each increment of outer loop is called an epoch Based on slide by Alan Fern 3

29 Learning in NN: Backpropagation Similar to the perceptron learning algorithm, we cycle through our examples If the output of the network is correct, no changes are made If there is an error, weights are adjusted to reduce the error The trick is to assess the blame for the error and divide it among the contributing weights Based on slide by T. Finin, M. desjardins, L Getoor, R. Par 3

log (h (x i )) k + ( y ik ) log (h (x i )) k n i= L s X X l Xs l + n l= k= i= j= (l) ji

30 Cost Function Logistic Regression: J( ) = n nx dx [y i log h (x i )+( y i ) log ( h (x i ))] + n i= j= j Neural Network: h R K (h (x)) i = i th output " J( ) = n KX X X # y ik log (h (x i )) k + ( y ik ) log (h (x i )) k n i= L s X X l Xs l + n l= k= i= j= (l) ji k th class: true, predicted not k th class: true, predicted Based on slide by Andrew Ng 3

31 Optimizing the Neural Network J( ) = n " n X i= LX + n l= KX y ik log(h (x i )) k +( k= s X l i= Xs l j= (l) ji # y ik ) log (h (x i )) k Solve via: J(Θ) is not convex, so GD on a neural net yields a local optimum But, tends to work well in practice Need code to compute: Based on slide by Andrew Ng 33

32 Forward Propagation Given one labeled training instance (x, y): Forward Propagation a () = x z () = Θ () a () a () = g(z () ) [add a () ] a () a () a (3) a (4) z (3) = Θ () a () a (3) = g(z (3) ) [add a (3) ] z (4) = Θ (3) a (3) a (4) = h Θ (x) = g(z (4) ) Based on slide by Andrew Ng 34

33 Backpropagation Intuition Each hidden node j is responsible for some fraction of the error δ j (l) in each of the output nodes to which it connects δ j (l) is divided according to the strength of the connection between hidden node and the output node Then, the blame is propagated back to provide the error values for the hidden layer Based on slide by T. Finin, M. desjardins, L Getoor, R. Par 35

34 Backpropagation Intuition () (3) (4) () (3) δ j (l) = error of node j in layer l Formally, (l) (l) j cost(x i ) where cost(x i )=y i log h (x i )+( y i ) log( h (x i )) Based on slide by Andrew Ng 36

35 Backpropagation Intuition () () (3) (3) (4) δ (4) = a (4) y δ j (l) = error of node j in layer l Formally, Based on slide by Andrew Ng (l) (l) j cost(x i ) where cost(x i )=y i log h (x i )+( y i ) log( h (x i )) 37

36 Backpropagation Intuition () (3) (3) (4) () (3) δ (3) = Θ (3) δ (4) δ j (l) = error of node j in layer l Formally, Based on slide by Andrew Ng (l) (l) j cost(x i ) where cost(x i )=y i log h (x i )+( y i ) log( h (x i )) 38

Backpropagation Intuition () (3) (4) δ j (l) = error of node j in layer l Formally, Based on slide by Andrew Ng () (l) j =

37 Backpropagation Intuition () (3) (4) δ j (l) = error of node j in layer l Formally, Based on slide by Andrew Ng () (l) (l) j cost(x i ) (3) δ (3) = Θ (3) δ (4) δ (3) = Θ (3) δ (4) where cost(x i )=y i log h (x i )+( y i ) log( h (x i )) 39

38 Backpropagation Intuition () () (3) (4) () () (3) δ () = Θ () δ (3) + Θ () δ (3) δ j (l) = error of node j in layer l Formally, Based on slide by Andrew Ng (l) (l) j cost(x i ) where cost(x i )=y i log h (x i )+( y i ) log( h (x i )) 4

product.* δ () δ (3) g (z (3) ) = a (3).

39 Backpropagation: Gradient Computation Let δ j (l) = error of node j in layer l (#layers L = 4) Backpropagation δ (4) = a (4) y δ (3) = (Θ (3) ) T δ (4).* g (z (3) ) δ () = (Θ () ) T δ (3).* g (z () ) (No δ () ) Element-wise product.* δ () δ (3) g (z (3) ) = a (3).* ( a (3) ) g (z () ) = a ().* ( a () ) (l) ij Based on slide by Andrew Ng J( ) = a (l) j (l+) i (ignoring λ; if λ = ) 4

40 6 Backpropagation (l) ij Set = 8l, i, j For each training instance (x i,y i ): Set a () = x i Compute {a (),...,a (L) } via forward propagation Compute (L) = a (L) y i Compute errors { (L ),..., () } (l) Compute gradients ij = (l) ij + a(l) (l) (l+) j ( Compute avg regularized gradient D (l) ij D (l) is the matrix of partial derivatives of J(Θ) (l) Note: Can vectorize + a(l) (l+) as (l) = (l) + (l+) a (l) ij = (l) ij j i (Used to accumulate gradient) i ( (l) = n ij + (l) (l) n ij (l) (l) (l) ij if j 6= otherwise Based on slide by Andrew Ng 4

41 Training a Neural Network via Gradient Descent with Backprop Given: training set {(x,y ),...,(x n,y n )} Initialize all (l) randomly (NOT to!) Loop // each iteration is called an epoch (l) ij Set = 8l, i, j For each training instance (x i,y i ): Set a () = x i Compute {a (),...,a (L) } via forward propagation Compute (L) = a (L) y i Compute errors { (L ),..., () } Compute gradients (l) ij = (l) ij + a(l) j Compute avg regularized gradient D (l) ij (l+) i = ( n Update weights via gradient step (l) ij = (l) ij Until weights converge or max #epochs is reached (Used to accumulate gradient) n (l) ij + (l) (l) ij D (l) ij ij if j 6= otherwise Backpropagation Based on slide by Andrew Ng 43

42 Backprop Issues Backprop is the cockroach of machine learning. It s ugly, and annoying, but you just can t get rid of it. -Geoff Hinton Problems: black box local minima 44

43 Implementation Details 45

44 Random Initialization Important to randomize initial weight matrices Can t have uniform initial weights, as in logistic regression Otherwise, all updates will be identical & the net won t learn () (3) (4) () (3) 46

45 Implementation Details For convenience, compress all parameters into θ unroll Θ (), Θ (),..., Θ (L-) into one long vector θ E.g., if Θ () is x, then the first entries of θ contain the value in Θ () Use the reshape command to recover the original matrices E.g., if Θ () is x, then theta = reshape(theta[:], (, )) Each step, check to make sure that J(θ) decreases Implement a gradient-checking procedure to ensure that the gradient is correct... 47

J( i c ) @ i c θ i+c = [θ, θ,..., θ i, θ i +c, θ i+,.

46 Gradient Checking Idea: estimate gradient numerically to verify implementation, then turn off gradient checking J( ) J( i+c ) J( i c ) i c J( ) J( i+c) J( i c i c θ i+c = [θ, θ,..., θ i, θ i +c, θ i+,...] Based on slide by Andrew Ng c E-4 Change ONLY the i th entry in θ, increasing (or decreasing) it by c 49

Gradient Checking R m is an unrolled version of (), (),... =[,, 3,..., m ] @ J( ) J([ + c,, 3,..., m ]) J([ c,, 3,..., m ]) @ c @ J( ) J([, + c, 3,..., m ]) J([, c, 3,..., m ]) @ c. Put in vector called gradapprox @ J( ) J([,, 3,.

47 Gradient Checking R m is an unrolled version of (), (),... =[,, 3,..., m J( ) J([ + c,, 3,..., m ]) J([ c,, 3,..., m J( ) J([, + c, 3,..., m ]) J([, c, 3,..., m c. Put in vector called J( ) J([,, 3,..., m + c]) J([,, 3,..., m m c Check that the approximate numerical gradient matches the entries in the D matrices Based on slide by Andrew Ng 5

48 Implementation Steps Implement backprop to compute DVec DVec is the unrolled {D (), D (),... } matrices Implement numerical gradient checking to compute gradapprox Make sure DVec has similar values to gradapprox Turn off gradient checking. Using backprop code for learning. Important: Be sure to disable your gradient checking code before training your classifier. If you run the numerical gradient computation on every iteration of gradient descent, your code will be very slow Based on slide by Andrew Ng 5

49 Putting It All Together 5

50 Training a Neural Network Pick a network architecture (connectivity pattern between nodes) # input units = # of features in dataset # output units = # classes Reasonable default: hidden layer or if > hidden layer, have same # hidden units in every layer (usually the more the better) Based on slide by Andrew Ng 53

51 Training a Neural Network. Randomly initialize weights. Implement forward propagation to get h Θ (x i ) for any instance x i 3. Implement code to compute cost function J(Θ) 4. Implement backprop to compute partial derivatives 5. Use gradient checking to compare computed using backpropagation vs. the numerical gradient estimate. Then, disable gradient checking code 6. Use gradient descent with backprop to fit the network Based on slide by Andrew Ng 54

Linear Regression & Gradient Descent

Linear Regression & Gradient Descent These slides were assembled by Byron Boots, with grateful acknowledgement to Eric Eaton and the many others who made their course materials freely available online.