Dept. of Computing Science & Math

Size: px

Start display at page:

Download "Dept. of Computing Science & Math"

Sara Theodora Benson
5 years ago
Views:

1 Lecture 4: Multi-Laer Perceptrons 1

2 Revie of Gradient Descent Learning 1. The purpose of neural netor training is to minimize the output errors on a particular set of training data b adusting the netor eights 2. We define a Cost Function E( that measures ho far the current netor s output is from the desired one 3. Partial derivatives of the cost function E(/ tell us hich direction e need to move in eight space to reduce the error 4. The learning rate η specifies the step sizes e tae in eight space for each iteration of the eight update equation 5. We eep stepping through eight space until the errors are small enough. 2

3 Graphical Representation ti of GDR Total Error Local minimum Global minimum Ideal eight Weight, i 3

4 Revie of Perceptron Training 1. Generate a training pair or pattern x that ou ish our netor to learn 2. Setup our netor ith N input units full connected to M output units. 3. Initialize eights,, at random 4. Select an appropriate error function E( and learning rate η 5. Appl the eight change Δ = - η E(/ to each eight for each training pattern p. One set of updates for all the eights for all the training i patterns is called one epoch of ftraining. i 6. Repeat step 5 until the netor error function is small enough 4

5 Revie of XOR and Linear Separabilit Recall that it is not possible to find eights that enable Single Laer Perceptrons to deal ith non-linearl separable problems lie XOR XOR in 1 in 2 out I I 2 The proposed solution as to use a more complex netor that is able to generate more complex decision boundaries. That netor is the Multi-Laer Perceptron. 5

6 Multi-Laer Perceptrons (MLPs Y f ( O Y 1 Y 2 Y Output laer, O O 1 f ( i X i i O Hidden laer, X 1 X 2 X 3 X i Input laer, i 6

7 Can We Use a Generalized Form of the PLR/Delta Rule to Train the MLP? Recall the PLR/Delta rule: Adust neuron eights to reduce error at neuron s output: old x here desired Main problem: Ho to adust the eights in the hidden laer, so the reduce the error in the output laer, hen there is no specified target response in the hidden laer? Solution: Alter the non-linear Perceptron (discrete threshold activation function to mae it differentiable and hence, help derive Generalized DR for MLP training Threshold function Sigmoid function 7

8 Sigmoid (S-shaped Function Properties Approximates the threshold function Smoothl differentiable (everhere, and hence DR applicable Positive slope 1 Popular choice is 1 f ( a 1 e Derivative of sigmoidal function is: 0 f ' ( a f ( a (1 a f ( a

9 Weight Update Rule Weight Update Rule Generall eight change from an unit to unit b gradient descent (i e Generall, eight change from an unit to unit b gradient descent (i.e. eight change b small increment in negative direction to the gradient is no called Generalized Delta Rule (GDR or Bacpropagation: x E old No the delta is more complicated because of the sigmoid function: 2 ( 1 ( E o a a f 1 ( a E E ( 2 ( tar E o a a e a f 1 ( tar o o a a E E (1 ( 9

10 Weight Update Rule (2 For the output units, delta is the output error multiplied b a gradient term: ( tar (1 o For hidden units e also need a value of error. A suitable quantit to use is the eighted sum of the output deltas from a hidden unit: o And again the eight change is: (1 o i x i 10

11 Training i of a 2-Laer Feed Forard Netor 1. Tae the set of training patterns ou ish the netor to learn 2. Set up the netor ith N input units full connected to M non-linear hidden units via connections ith eights i, hich in turn are full connected to P output units via connections ith eights 3. Generate random initial eights, e.g. from range [-t, +t] 4. Select appropriate error function E( and learning rate η 5. Appl the eight update equation Δ =-η E( / to each eight for each training pattern p. 6. Do the same to all hidden laers. 7. Repeat step 5-6 until the netor error function is small enough 11

12 Practical Considerations for Learning Rules There are a number of important issues about training single laer neural netors that need further resolving: 1. Do e need to pre-process the training data? If so, ho? 2. Ho do e choose the initial eights from hich e start the training? 3. Ho do e choose an appropriate learning rate η? 4. Should e change the eights after each training pattern, or after the hole set? 5. Are some activation/transfer functions better than others? 6. Ho do e avoid local minima in the error function? 7. Ho do e no hen e should stop the training? 8. Ho man hidden units do e need? 9. Should e have different learning rates for the different laers? We shall no consider each of these issues one b one. 12

13 Pre-processing of the Training Data In principle, e can ust use an ra input-output data to train our netors. Hoever, in practice, it often helps the netor to learn appropriatel if e carr out some pre-processing of the training data before feeding it to the netor. We should mae sure that the training data is representative it should not contain too man examples of one tpe at the expense of another. On the other hand, if one class of pattern is eas to learn, having large numbers of patterns from that class in the training set ill onl slo don the over-all learning process. 13

14 Choosing the Initial Weight Values The gradient descent learning algorithm treats all the eights in the same a, so if e start them all off ith the same values, all the hidden units ill end up doing the same thing and the netor ill never learn properl. For that reason, e generall start off all the eights ith small random values. Usuall e tae them from a flat distribution around zero [ t, +t], or from a Gaussian distribution around zero ith standard deviation t. Choosing a good value of t can be difficult. Generall, it is a good idea to mae it as large as ou can ithout saturating an of the sigmoids. We usuall hope that the final netor performance ill be independent of the choice of initial eights, but e need to chec this b training the netor from a number of different random initial eight sets. 14

15 Choosing the Learning Rate Choosing a good value for the learning rate η is constrained b to opposing facts: 1. If η is too small, it ill tae too long to get anhere near the minimum of the error function. 2. If η is too large, the eight updates ill over-shoot the error minimum and the eights ill oscillate, or even diverge. Unfortunatel, the optimal value is ver problem and netor dependent, so one cannot formulate reliable general prescriptions. Generall, one should tr a range of different values (e.g. η = 0.1, 0.01, 1.0, and use the results as a guide. 15

16 Batch Training vs. On-line Training Batch Training: update the eights after all training patterns have been presented. On-line Training (or Sequential Training: A natural alternative is to update all the eights immediatel after processing each training pattern. On-line learning does not perform true gradient descent, and the individual eight changes can be rather erratic. Normall a much loer learning rate η ill be necessar than for batch learning. Hoever, because each eight no has N updates (here N is the number of patterns per epoch, rather than ust one, overall the learning is often much quicer. This is particularl true if there is a lot of redundanc in the training data, i.e. man training patterns containing similar information. 16

17 Choosing the Transfer Function We have alread seen that t having a differentiable transfer/activation ti ti function is important for the gradient descent algorithm to or. We have also seen that, in terms of computational efficienc, the standard sigmoid (i.e. logistic function is a particularl convenient replacement for the step function of the Simple Perceptron. The logistic function ranges from 0 to 1. There is some evidence that an anti-smmetric transfer function (eg tanh, i.e. one that satisfies f( x = f(x, enables the gradient descent algorithm to learn faster. When the outputs are required to be non-binar, i.e. continuous real values, having sigmoidal transfer functions no longer maes sense. In these cases, a simple linear transfer function f(x = x is appropriate. 17

18 Local Minima Cost functions can quite easil have more than one minimum: If e start off in the vicinit of the local minimum, e ma end up at the local minimum rather than the global minimum. Starting ith a range of different initial eight sets increases our chances of finding the global minimum. An variation from true gradient descent ill also increase our chances of stepping into the deeper valle. 18

19 When to Stop Training The Sigmoid(x function onl taes on its extreme values of 0 and 1 at x = ±. In effect, this means that t the netor can onl achieve its binar targets t hen at least some of its eights reach ±. So, given finite gradient descent step sizes, our netors ill never reach their binar targets. Even if e offset the targets (to 0.1 and 0.9 sa e ill generall require an infinite number of increasingl small gradient descent steps to achieve those targets. Clearl, if the training algorithm can never actuall reach the minimum, e have to stop the training process hen it is near enough. What constitutes near enough depends on the problem. If e have binar targets, it might be enough that all outputs are ithin 0.1 (sa of their targets. Or, it might be easier to stop the training hen the sum squared error function becomes less than a particular small value (0.2 sa. 19

20 Ho Man Hidden Units? The best number of hidden units depends in a complex a on man factors, including: 1. The number of training patterns 2. The numbers of input and output units 3. The amount of noise in the training data 4. The complexit of the function or classification to be learned 5. The tpe of hidden unit activation function 6. The training algorithm Too fe hidden units ill generall leave high training and generalisation errors due to under-fitting. Too man hidden units ill result in lo training errors, but ill mae the training unnecessaril slo, and ill result in poor generalisation unless some other technique (such as regularisation is used to prevent over-fitting. Virtuall all rules of thumb ou hear about are actuall nonsense. A sensible strateg is to tr a range of numbers of hidden units and see hich ors best. 20

21 Different Learning Rates for Different Laers? A netor as a hole ill usuall learn most efficientl if all its neurons are learning at roughl the same speed. So mabe different parts of the netor should have different learning rates η. There are a number of factors that ma affect the choices: 1. The later netor laers (nearer the outputs ill tend to have larger local gradients (deltas than the earlier laers (nearer the inputs. 2. The activations of units ith man connections feeding into or out of them tend to change faster than units ith feer connections. 3. Activations required for linear units ill be different for Sigmoidal units. 4. There is empirical evidence that it helps to have different learning rates η for the thresholds/biases compared ith the real connection eights. In practice, it is often quicer to ust use the same rates η for all the eights and thresholds, rather than spending time tring to or out appropriate differences. A ver poerful approach is to use evolutionar strategies to determine good learning rates. 21

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Neural Networks CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Biological and artificial neural networks Feed-forward neural networks Single layer