Pattern Recognition Winter Semester 2009/2010. Exemplary Solution

Size: px

Start display at page:

Download "Pattern Recognition Winter Semester 2009/2010. Exemplary Solution"

MargaretMargaret Walton
5 years ago
Views:

1 Pattern Recognition Winter Semester 2009/2010 Exemplary Solution 4 Exercise The focus of this exercise will be neural networks. You will train and apply a multi-layer perceptron (MLP) as well as a radial basis function network (RBF) on the Imaging Mass Spectrometry Dataset (again we will only use the first two mass channels to simplify visualization). We will use the netlab toolbox ( this way you do not have to implement the neural networks yourself, but can rely on existing matlab code and focus on the interpretation. 4.1 Netlab Toolbox Example To familiarize yourself with the netlab toolbox, follow through the example below: % Generate the matrix of inputs x and targets t. x = [0:1/19:1]'; ndata = size(x,1); t = sin(2*pi*x) + 0.2*randn(ndata, 1); %Set up network parameters. nin = 1; % Number of inputs. nhidden = 3; % Number of hidden units. nout = 1; % Number of outputs. outfunction= linear ; % outfunction (linear, logistic ) %alpha = 0.01; % Coefficient of weight-decay prior. net = mlp(nin, nhidden, nout, outfunction); options(1) = 1; % This provides display of error values. options(9) = 1; % Check the gradient calculations. options(14) = 1000; % Number of training cycles. % Train using scaled conjugate gradients. [net, options] = netopt(net, options, x, t, 'scg'); % Plot the trained network predictions. plotvals = [0:0.01:1]'; y = mlpfwd(net, plotvals); plot(plotvals, y, 'ob') Interpret the example (1 point) Print the resulting plot and comment on what the program does and also what the quality of the result looks like.

2 The example code produces datapoints from a sine curve with additional noise and the mlp with 3 hidden units is trained on 1000 iterations on the data. The plot shows the successful recovery of the sine structure from only 20 data points Influence of noise on MLP (2 points) The example in 4.1 uses a fixed noise factor of 0.2 investigate the performance of the MLP when you change the noise level in the interval [0,2] create plots of the results and comment. The code below was used to repeat the example program with various noise levels: for noise=0:0.5:2 % Generate the matrix of inputs x and targets t. x = [0:1/19:1]'; ndata = size(x,1); t = sin(2*pi*x) + noise*randn(ndata, 1); %Set up network parameters. nin = 1; % Number of inputs. nhidden = 3; % Number of hidden units. nout = 1; % Number of outputs. outfunction='linear'; % outfunction (linear, logistic ) %alpha = 0.01; % Coefficient of weight-decay prior. net = mlp(nin, nhidden, nout, outfunction); %options(1) = 1; % This provides display of error values. %options(9) = 1; % Check the gradient calculations. options(14) = 1000; % Number of training cycles. % Train using scaled conjugate gradients. [net, options] = netopt(net, options, x, t, 'scg'); % Plot the trained network predictions. plotvals = [0:0.01:1]'; y = mlpfwd(net, plotvals);

3 figure; plot(plotvals, y, 'ob'); hold; plot(x,t,'or'); title(noise); xlabel('x'); ylabel('f(x)');

4 It is evident that with the increasing noise levels, the quality of the prediction decreases rapidly. While at a noise level of 0.5, the function shows at least the same general overall tr, that is no longer true for higher values. Still, it should be noted that for high noise levels, this task would also be unfeasible for a human. 4.2 Multi Layer Perceptron Application to breast cancer data set (5 points) Set up a Multi-Layer Perceptron on the training data (again reduced to the first 2 mass channels) of the imaging mass spectrometry data set with the same parameters as in the example above and compute the correct classification rate on the training as well as the test data. Repeat this step several times and evaluate the results (round the prediction results to 0/1!) - do you think the settings are useful? Comment on the choice of each parameter and make a plot of the classifications and approximate the decision boundary. training2c=trainingx(1:2,:); test2c=testx(1:2,:); net = mlp(2, 3, 1, 'linear'); options(1) = 1; % This provides display of error values. options(9) = 1; % Check the gradient calculations. options(14) = 100; % Number of training cycles.

5 % Train using scaled conjugate gradients. [net, options] = netopt(net, options, training2c', trainingy', 'scg'); predicty = mlpfwd(net, training2c'); sum(abs(round(predicty)'-trainingy)) predicty = mlpfwd(net, test2c'); sum(abs(round(predicty)'-testy)) Repeated 10 times, we see strong variation in the number of incorrect predictions on the test data set (33,34,33,34,105,28,33,104,32,33). This indicates that the number of iterations are not sufficient to reliably train the model. Also, the linear outfunction is not a wise choice since we are dealing with 1/0 labeled data, thus a logistic outfunction would be a more natural fit. The low number of misclassification (at least with regard to the median of the various runs) on the test data suggests that 3 hidden units are already sufficient to separate the data. This is also supported by the fact that we see almost 100% correct classifications on the training data (again as a median, even their we see strongly divergent results). This is also connected to the good separation achieved by the two mass channels into the two label classes. For more complex data, more hidden units might be needed. % and now plot training data figure; plot(training2c(1,round(predicty)==1),training2c(2,round(predicty )==1),'oc') title('mlp results on training data'); xlabel('mass channel 1'); ylabel('mass channel 2'); hold plot(training2c(1,round(predicty)==0),training2c(2,round(predicty )==0),'og') plot(training2c(1,trainingy==1),training2c(2,trainingy==1),'xr') plot(training2c(1,trainingy==0),training2c(2,trainingy==0),'xy') % and test data

6 figure; plot(test2c(1,round(predicty)==1),test2c(2,round(predicty)==1),'o c') title('mlp results on test data'); xlabel('mass channel 1'); ylabel('mass channel 2'); hold plot(test2c(1,round(predicty)==0),test2c(2,round(predicty)==0),'o g') plot(test2c(1,testy==1),test2c(2,testy==1),'xr') plot(test2c(1,testy==0),test2c(2,testy==0),'xy') %plot decision boundary %produce grid of input values: decisiongrid=zeros(100,100); for i=1:100; for j =1:100; %perform qda decisiongrid(i,j)= round(mlpfwd(net, [i, j])); %make a plot imagesc(decisiongrid') title('decision boundary for MLP'); xlabel('mass channel 1'); ylabel('mass channel 2');

7 4.2.2 Influence of Parameters on MLP performance (3 points) Compute a MLP with 50 hidden units and a logistic outfunction and use 5000 iterations to fit it. What is the error on the training data? What is the error on the test data? What do you think is the reason for a difference? Comment. Using these settings, we obtain far more stable results over various runs, 5000 iterations and 50 hidden units appear (more than) suitable for fitting a stable network. But we see reasons to expect overfitting since at least in some of the iterations, the training error went down to zero and the decision boundary closely follows the training data. We do not see a decrease in the number of misclassifications on the test data, they are constant around 34, but we do not have any large outliers any longer. net = mlp(2, 50, 1, 'logistic'); options(14) = 5000; % Number of training cycles. % Train using scaled conjugate gradients. [net, options] = netopt(net, options, training2c', trainingy', 'scg'); predicty = mlpfwd(net, training2c'); sum(abs(round(predicty)'-trainingy)) predicty = mlpfwd(net, test2c'); sum(abs(round(predicty)'-testy)) Parameter tuning (3 points) What do you think is a sensible number of hidden units and iterations? What is an appropriate outfunction? Comment and then compute the results using the setting you proposed. If the results are not significantly better than the results in and what do you think is the reason?

8 It becomes evident from and that using sufficient iterations and hidden units is necessary to have stable results, but too many might lead to overfitting. A logistic outfunction is strongly suggested by the data. Therefore, we try the following model: net = mlp(2, 15, 1, 'logistic'); options(14) = 1500; % Number of training cycles. % Train using scaled conjugate gradients. [net, options] = netopt(net, options, training2c', trainingy', 'scg'); predicty = mlpfwd(net, training2c'); sum(abs(round(predicty)'-trainingy)) predicty = mlpfwd(net, test2c'); sum(abs(round(predicty)'-testy)) Overall, we only see very moderate changes between the parametrizations. With the above setting, we obtain approximately 30 misclassifications on the test data set. We do not observe large outliers and have lower training costs and fewer signs of overfitting than with the more complex model of Radial Basis Function Network (6 points) Now use a RBF network (read demrbf1.m to familiarize yourself with the most important commands) and train a network with 2 hidden units and a Gaussian radial basis functions. What does the correct classification rate on the training and test data look like? Explain what the model you applied corresponds to and how you could interpret it. % Create and initialize network weight and parameter vectors. net = rbf(2, 2, 1, 'gaussian'); options = foptions; options(1) = 1; % Display EM training options(14) = 20; % number of iterations of EM net = rbftrain(net, options, training2c', trainingy'); predicty = rbffwd(net, training2c'); sum(abs(round(predicty)'-trainingy)) predicty = rbffwd(net, test2c'); sum(abs(round(predicty)'-testy)) The RBF network with two Gaussian basis functions is very closely related to QDA. The

9 difference is that while in QDA the covariance matrix was directly estimated from the data using a maximum likelihood approach, it is learned iteratively in the RBF network. We see more misclassifications (7) on the training data set than with the MLP, but we see fewer misclassification (24) on the test data set. This indicates that the RBF network with so little complexity (only 2 Gaussians) generalizes very well and captures the essence of the data without the danger of overfitting. 4.4 Bonus: PCA computation using Neural Networks (6 points) Now, use Oja s Rule to implement a neural network which computes the first principal component. Write a function w=function(trainingx, rate, delta, maxiter) which takes the p x n dimensional training data with n observations as an input as well as the learning rate rate, the convergence criterion delta, which the norm of the residuals is compared to in each step, and the maximum number of iterations maxiter as parameters and outputs the first principal component w. Oja s Rule states that for each observation i; the current result is computed by y = w'*x(:,i); and w is updated by w = w + rate*(y*trainingx(:,i) - y^2*w); Normalizing w simplifies the convergence. Comment on the performance of the approach. function w = oja(x, rate,delta,maxiteration) [m n] = size(x); % X is the set of input vectors. Each column of X is one sample. % w is the first principal component % % random initial weights w=rand(m,1); % normalize weights w=w/norm(w); % run through all input samples for iter = 1:maxIteration wold = w; %keep old results for residuals computation for ii = 1:n y = w'*x(:,ii); w = w + rate*(y*x(:,ii) - y^2*w); %normalize again w=w/norm(w); if (norm(w - wold) < delta) %compare residuals to delta break; The algorithm is strongly depent on the input values and might not converge too easily in many settings thus many iterations are required.

Exercise: Training Simple MLP by Backpropagation. Using Netlab.

Exercise: Training Simple MLP by Backpropagation. Using Netlab. Petr Pošík December, 27 File list This document is an explanation text to the following script: demomlpklin.m script implementing the beckpropagation