Exercise: Training Simple MLP by Backpropagation. Using Netlab.

Similar documents
Pattern Recognition Winter Semester 2009/2010. Exemplary Solution

CS6220: DATA MINING TECHNIQUES

Week 3: Perceptron and Multi-layer Perceptron

Assignment # 5. Farrukh Jabeen Due Date: November 2, Neural Networks: Backpropation

>> x2=-5.01:0.25:5.01; >> [x, y]=meshgrid(x1,x2); >> f=mexhat(x,y); >> mesh(x1,x2,f); The training data and test data sets with 100 instances each are

The Mathematics Behind Neural Networks

CHAPTER VI BACK PROPAGATION ALGORITHM

LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS

Supervised Learning in Neural Networks (Part 2)

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

For Monday. Read chapter 18, sections Homework:

CMPT 882 Week 3 Summary

Neural Network Learning. Today s Lecture. Continuation of Neural Networks. Artificial Neural Networks. Lecture 24: Learning 3. Victor R.

Model Answers to The Next Pixel Prediction Task

Neural Networks (pp )

Neural Networks and Deep Learning

Data Mining. Neural Networks

In this assignment, we investigated the use of neural networks for supervised classification

Technical University of Munich. Exercise 7: Neural Network Basics

Classification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska

Neural Networks (Overview) Prof. Richard Zanibbi

CSC 578 Neural Networks and Deep Learning

Simple Model Selection Cross Validation Regularization Neural Networks

A neural network that classifies glass either as window or non-window depending on the glass chemistry.

Notes on Multilayer, Feedforward Neural Networks

More on Neural Networks. Read Chapter 5 in the text by Bishop, except omit Sections 5.3.3, 5.3.4, 5.4, 5.5.4, 5.5.5, 5.5.6, 5.5.7, and 5.

An Algorithm For Training Multilayer Perceptron (MLP) For Image Reconstruction Using Neural Network Without Overfitting.

The exam is closed book, closed notes except your one-page cheat sheet.

A Systematic Overview of Data Mining Algorithms

Neural Networks. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Ensemble methods in machine learning. Example. Neural networks. Neural networks

Learning. Learning agents Inductive learning. Neural Networks. Different Learning Scenarios Evaluation

5 Learning hypothesis classes (16 points)

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty)

Visual object classification by sparse convolutional neural networks

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

6. Linear Discriminant Functions

Lecture 13. Deep Belief Networks. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen

Machine Learning 13. week

EECS 496 Statistical Language Models. Winter 2018

Applying Neural Network Architecture for Inverse Kinematics Problem in Robotics

CS 1674: Intro to Computer Vision. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh November 16, 2016

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York

.. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar..

CS 4510/9010 Applied Machine Learning. Neural Nets. Paula Matuszek Fall copyright Paula Matuszek 2016

CS 6501: Deep Learning for Computer Graphics. Training Neural Networks II. Connelly Barnes

Lecture 9. Support Vector Machines

CS 4510/9010 Applied Machine Learning

Robustness of Selective Desensitization Perceptron Against Irrelevant and Partially Relevant Features in Pattern Classification

Learning and Generalization in Single Layer Perceptrons

Neural Networks. Theory And Practice. Marco Del Vecchio 19/07/2017. Warwick Manufacturing Group University of Warwick

MATLAB representation of neural network Outline Neural network with single-layer of neurons. Neural network with multiple-layer of neurons.

Neural Nets. General Model Building

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES

Programming Exercise 4: Neural Networks Learning

Neural Networks Laboratory EE 329 A

COMPUTATIONAL INTELLIGENCE

WHAT TYPE OF NEURAL NETWORK IS IDEAL FOR PREDICTIONS OF SOLAR FLARES?

Recitation Supplement: Creating a Neural Network for Classification SAS EM December 2, 2002

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

EE 589 INTRODUCTION TO ARTIFICIAL NETWORK REPORT OF THE TERM PROJECT REAL TIME ODOR RECOGNATION SYSTEM FATMA ÖZYURT SANCAR

Artificial neural networks are the paradigm of connectionist systems (connectionism vs. symbolism)

Learning from Data: Adaptive Basis Functions

Learning internal representations

Multi-layer Perceptron Forward Pass Backpropagation. Lecture 11: Aykut Erdem November 2016 Hacettepe University

A Neural Network Model Of Insurance Customer Ratings

CS273 Midterm Exam Introduction to Machine Learning: Winter 2015 Tuesday February 10th, 2014

FAST NEURAL NETWORK ALGORITHM FOR SOLVING CLASSIFICATION TASKS

A USER-FRIENDLY AUTOMATIC TOOL FOR IMAGE CLASSIFICATION BASED ON NEURAL NETWORKS

Neuro-Fuzzy Computing

Logistic Regression

Deep Learning With Noise

A Compensatory Wavelet Neuron Model

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Optimum size of feed forward neural network for Iris data set.

Keras: Handwritten Digit Recognition using MNIST Dataset

Optimization Methods for Machine Learning (OMML)

Traffic Signs Recognition using HP and HOG Descriptors Combined to MLP and SVM Classifiers

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

Neural Nets. CSCI 5582, Fall 2007

Multi-Layered Perceptrons (MLPs)

Supervised Learning with Neural Networks. We now look at how an agent might learn to solve a general problem by seeing examples.

Deep Learning for Visual Computing Prof. Debdoot Sheet Department of Electrical Engineering Indian Institute of Technology, Kharagpur

Neural Networks. Lab 3: Multi layer perceptrons. Nonlinear regression and prediction.

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

DD143X Degree Project in Computer Science, First Level. Classification of Electroencephalographic Signals for Brain-Computer Interface

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

Classification: Linear Discriminant Functions

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

Artificial Neural Networks Lecture Notes Part 5. Stephen Lucci, PhD. Part 5

Image Compression: An Artificial Neural Network Approach

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Neuron Selectivity as a Biologically Plausible Alternative to Backpropagation

Multi Layer Perceptron with Back Propagation. User Manual

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Pattern Classification Algorithms for Face Recognition

6. Backpropagation training 6.1 Background

Backpropagation in Neural Nets, and an Introduction to Vision. CSCI 5582, Fall 2007

Transcription:

Exercise: Training Simple MLP by Backpropagation. Using Netlab. Petr Pošík December, 27 File list This document is an explanation text to the following script: demomlpklin.m script implementing the beckpropagation algortihm derived in Sec. 2. Uses data set tren.dat and m-file mlperc.m which implements the neural network. demonetlabklin.m constructs the same neural network as the first script (see Sec. 3.), this time using Netlab (which must be downloaded separately). Uses data set tren.dat and helper function plotdata.m. demonetlabxor.m uses Netlab to construct the neural network for the XOR data set, see Sec. 3.2 (trenxorrect.dat contains rectangular XOR pattern, trenxor.dat contains rotated version). Uses helper function plotdata.m. 2 Backpropagation for simple MLP Suppose we have a training set that exhibits the pattern shown in Fig. 2: The task is to create a simple neural network classifier that would be able to discriminate between red crosses and blue circles. 2. Preliminary questions 2.. Is one simple perceptron sufficient to classify the training points precisely? No. One perceptron creates only one linear decision boundary. 2..2 How many linear boundaries do we need here? At least two.

.9.8.7.6.5.4.3.2...2.3.4.5.6.7.8.9 Figure : The first training set. 2..3 What architecture of the network will be suitable? Well, we need two linear decision boundaries, i.e. two neurons variables x and x 2 will be the inputs to both of them. OK, but this would give us a network that has two outputs. We need additional neuron to combine the outputs of the first two neurons into one output. 2..4 We need to tune the NN to the training set. What is actually tuned? The weights w. 2.2 Questions for our network 2.2. What does our network look like? Our network is depicted in Fig. 2. The output nonlinear function g is the same for all the three neurons: g(a) = To reiterate, its derivative in point a can be computed as + e λa () g (a) = λg(a)( g(a)) (2) 2

3 x w a z w 2 w 3 w 3 a 3 z 3 o x 2 3 w 2 w 22 w 23 a 2 z 2 w 32 w 33 Figure 2: Our neural network. 2.2.2 How many weights have to be tuned in our network? It seems that 6. For two inputs to neuron, two inputs to neuron 2, and 2 inputs for neuron 3. Really? Well, all the neurons should also have additional input that is connected to value (do you remember the homogeneous coordinates?). That s why we have 3 inputs for each of the 3 neurons which gives 9 weights. 2.2.3 Given an input vector x, how do we compute the output of our NN? We use forward propagation. The weights related to neuron form a vector w = (w,w 2,w 3 ). Simlarly we have vectors of weights for neurons 2 and 3: w 2 = (w 2,w 22,w 23 ) and w 3 = (w 3,w 32,w 33 ). For a given vector x = (x,x 2 ), we create a vector x in homogeneous coordinates: x = (x,x 2,). (3) Then we can compute the weighted sums a and a 2 for neurons and 2: a = w T x, (4) a 2 = w2 T x, (5) and we can also compute the outputs z and z 2 of neurons and 2: z = g(a ), (6) z 2 = g(a 2 ). (7) Then, we can construct the input vector for the third neuron as and pass it through the third neuron x H = (z,z 2,) (8) a 3 = w T 3 x H, and (9) z 3 = o = g(a 3 ). () This way we have gained the values of all the z i in the whole network. 3

2.2.4 How do we compute the weights of the network? We define the loss function (optimization criterion) E as E = 2 N n= c= C (y nc o nc ) 2. () i.e. as the sum of squares of errors of all outputs (c =,...,C) for all training examples (n =,...,N). However we can consider only the error for one training example: E = 2 C (y c o c ) 2. (2) And, since our network has only one output, the error is defined as c= E = 2 (y o)2. (3) Since the output o is actually a function of weights o(w), we can optimize the error function E by modifying the weights. We shall use the gradient descent method. The update rule of the gradient descent method is 2.2.5 How do we compute the gradient? w w η grad(e(w)) (4) The gradient in point w can be computed by using the backprop. algorithm. In our case, the gradient can be written as a matrix (similarly as w): grad(e(w)) = w w 2 w 3 w 2 w 22 w 32 w 3 w 23 w 33 (5) From the derivation of the backpropagation algrotihm we know that the individual derivatives can be computed as w ji = δ j z i, (6) where δ j is the error of the neuron on the output of the edge i j, and z i is the signal on the input of the edge i j. Values of z i s are known from the forward propagation phase, we need to compute the δ j s. Again, from the derivation of the backpropagation algorithm we know, that for the output layer we can write δ 3 = g (a 3 ) }{{} o = λz 3( z 3 ) ( )(y o) (7) λz 3( z 3) 4

For the δ j s in the hidden layer, we can write δ j = g (a j ) k Dest(j) w kj δ k (8) Consider neuron. The set of neurons which are fed with signal from neuron is Dest() = 3, so that the sum will have only one element: δ = g (a ) w k δ k = (9) Similarly, for the neuron 2 we can write k Dest() = λz ( z )w 3 δ 3 (2) δ 2 = λz 2 ( z 2 )w 32 δ 3 (2) That completes the backpropagation algorithm for our particular network. We know the z i s from the forward propagation, we have computed the δ j s using the backpropagation algorithm, we can thus compute the gradient of the error function grad(e(w)) and make the gradient descent step w w η grad(e(w)). 2.2.6 How can that be written in MATLAB? The following code is the MATLAB implementation of our derived procedure. The only difference is that we have derived the on-line version, the code is for batch algorithm. function err = train() % W - [3 x 3] matrix of NN weights % x - [nin x 2] matrix of coordinates x and x2 % y - [nin x ] matrix of target values ( or ) % Propagate training points through network nin = size(x,); data = [x ones(nin,)]; z = perc(w(,:), data); z2 = perc(w(2,:), data); data2 = [z z2 ones(nin,)]; z3 = perc(w(3,:), data2); % Compute average error for point err =.5 * sum((y-z3).^ 2) / nin; % Delta in output layer, neuron 3 delta3 = lambda * z3.* (-z3).* -(y-z3); % Propagate errors back to hidden layer, neurons and 2 5

delta = lambda * z.* (-z).* delta3 * W(3,); delta2 = lambda * z2.* (-z2).* delta3 * W(3,2); % Compute the derivatives dedw = delta * data; dedw2 = delta2 * data; dedw3 = delta3 * data2; % Compile the gradient of the weight matrix nablaew = [dedw; dedw2; dedw3]; % Adapt the weights using the gradient descent rule W = W - eta * nablaew; % Apply decay factor eta = eta*quo; end 2.2.7 How can we interpret the weights after learning? After repeatedly running the training algorithm for e.g. iterations, we hopefully get near-optimal weights. The meaning of individual parts of our trained NN can be seen in Fig. 3 3 Using Netlab Netlab is a set of MATLAB functions that allows us to create simple neural networks (among other things). It was created by Ian Nabney and Christopher Bishop who is the author of the very popular book Neural Networks for Pattern Recognition. 3. Netlab and our first data set 3.. Crucial functions of Netlab First, we apply the Netlab to solve the same problem as in previous section. The important part of the script can be seen here: %% Suppose: % x - [ntr x 2] input part of training vectors % y - [ntr x ] output part of training vectors % xt - [ntst x 2] input part of testing vectors % yt - [ntst x ] output part of testing vectors Download at http://www.ncrg.aston.ac.uk/netlab/. 6

.9.8.7.6.5.4.3.2...2.3.4.5.6.7.8.9 Figure 3: The meaning of the elements in the net. Line is a set of points for which a ( x) =, line is a set of points for which a ( x) =, line --- is a set of points for which a 2 ( x) =, line --- is a set of points for which a 2 ( x) =, and line is the decision boundary of the whole network, i.e. set of points for which z 3 ( x) =.5. %% Create network structure % 2 inputs, 2 neurons in hidden layer, output. % The nonlinear function for the output neuron is logistic. % The nonlinear funtions for the neurons in hidden layer is % hardwired and is hyperbolic tangent. net = mlp(2,2,, logistic ); %% Set the options for optimization options = zeros(,8); options() = ; % This provides display of error values. %% Optimize the network weights % Train the network in structure net to correctly classify % input vectors x to outputs y using scg (scaled conjugate % gradients). [net, options] = netopt(net, options, x, y, scg ); %% Compute the predictions of the trained network % for the testing data ypred = mlpfwd(net, xt); 7

% Compute errors on testing data errors = yt - ypred; As can be seen, for our purposes (multilayer perceptron) we need only 3 functions of Netlab: net = mlp(nin, nhid, nout, func) which creates a data structure holding the NN with nin inputs, nhid neurons in hidden layer, and nout outputs. The nonliner function of output neurons is set to func. Further info: help mlp. net = netopt(net, options, x, y, scg ) which optimizes the NN weights (hidden in structure net that was created by function mlp) so that the NN will have the least possible error when classifing the training inputs x into training outputs y. It can use several optimization algorithms, one of them is scaled conjugate gradients, i.e. scg. Further info: help netopt. ypred = mlpfwd(net, xt) which provides the predictions that the net gives us when new data points xt are presented to our NN. Further info: help mlpfwd. 3..2 Netlab results on our first data set If we train the same network on the same data as in previous section, we shall expect very similar results. The difference is that () the nonlinear functions in hidden layer are not logistic, but hyperbolic tangent, and (2) we do not use gradient descent to optimize the weights, but rather the method of scaled conjugated gradients. The results of training can be seen in Fig. 4..9.8.7.6.5.5.9.8.7.6.4.5.3.4.2.3.2...2.3.4.5.6.7.8.9...2.3.4.5.6.7.8.9 Figure 4: Results for the network from Fig. 2. Left: the decision boundary, right: the discrimination function. 3.2 Netlab and the XOR problem Consider the pattern of training data points depicted in Fig. 5. 8

.9.8.7.6.5.4.3.2...2.3.4.5.6.7.8.9 Figure 5: Training data forming the XOR pattern. Similarly to our first data set, the two classes are not linearly separable. It seems reasonable to use our 2-2- network for this data set as well. 3.2. Locally optimal solution In Fig. 6, we can see one possible solution where the NN tries to separate one corner of the training data from the rest of the patterns. It can be seen that such a network makes pretty large error (see the image title). However, there exists another solution of this problem which is very hard to obtain if the NN already encodes the decision boundary from Fig. 6, i.e. when the NN is in local optimum. 3.2.2 Globally optimal solution A better solution is depicted in Fig. 7. The title of the left picture clearly shows that the solution is better (makes lower error with respect to the given data set). Looking at Fig. 7, one can ask where did the second decision boundary come from. The same picture in somewhat larger scale reveals that there is no second boundary, they are just 2 parts of boundary, see Fig. 8. 3.2.3 More complex network The first and the second datasets were similar in the respect that they are not linearly separable. However, they are different in the aspect that the first data 9

Approx. training error: 65.64.9.8.7.6.5.5.9.8.7.6.4.5.3.4.2.3.2...2.3.4.5.6.7.8.9...2.3.4.5.6.7.8.9 Figure 6: Locally optimal solution of the network from Fig. 2 applied to the XOR pattern. Left: the decision boundary, right: the discrimination function. Approx. training error: 96.8.9.8.7.6.5.5.9.8.7.6.4.5.3.4.2.3.2...2.3.4.5.6.7.8.9...2.3.4.5.6.7.8.9 Figure 7: Globally optimal solution of the network from Fig. 2 applied to the XOR pattern. Left: the decision boundary, right: the discrimination function. set is separable by decision boundary, the XOR pattern is not. The next step is to add neuron to the hidden layer, which gives a 2-3- network. In that case the solution can look like that depicted in Fig. 9. The result seems to be almost perfect solution! All training points are classified correctly, the decision boundaries look like those that would be drawn by a human. However, if you look at the solution in larger scale (see Fig. ), you can see that the boundaries are more wild than expected. Moreover, although the lower right corner of the training data points is full of red crosses, the neural network produces there another decision boundary and creates there an area that is classified as blue circles! Without any support in the training data! This phenomenon can be observed in all types of machine learning models. A particular model has its own particular structural constraints that it imposes on the final form of the classifier. Only at the area that is covered by the training

6.5 4 8 2 6 4 2 2 2 4 4 6 6 4 2 2 4 6 6 8 8 6 4 2 2 4 6 8 Figure 8: Globally optimal solution of the network in larger scale. Left: the decision boundary, right: the discrimination function..9.8.7.6.5.5.9.8.7.6.4.5.3.4.2.3.2...2.3.4.5.6.7.8.9...2.3.4.5.6.7.8.9 Figure 9: Solution of the 2-3- network for the XOR problem. Left: the decision boundary, right: the discrimination function. data set, the predictions of the model can be influenced in a systematic way; in areas beyond the training set, each model extrapolates the values in its own way similarly as would be done e.g. by 2nd and 3rd order polynomials when modeling one-dimensional function.

2.5.5 2.5.5.5.5.5.5.5 2.5.5.5.5 2 Figure : Solution of the 2-3- network for the XOR problem in larger scale. Left: the decision boundary, right: the discrimination function. 2