Week 3: Perceptron and Multi-layer Perceptron

Similar documents
Assignment # 5. Farrukh Jabeen Due Date: November 2, Neural Networks: Backpropation

Supervised Learning in Neural Networks (Part 2)

CS6220: DATA MINING TECHNIQUES

Machine Learning Classifiers and Boosting

Neural Networks (pp )

Linear Classification and Perceptron

Lecture 20: Neural Networks for NLP. Zubin Pahuja

LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Supervised Learning with Neural Networks. We now look at how an agent might learn to solve a general problem by seeing examples.

CMPT 882 Week 3 Summary

Neural Network Neurons

Notes on Multilayer, Feedforward Neural Networks

Lecture 3: Linear Classification

An Algorithm For Training Multilayer Perceptron (MLP) For Image Reconstruction Using Neural Network Without Overfitting.

Multilayer Feed-forward networks

Programming Exercise 4: Neural Networks Learning

For Monday. Read chapter 18, sections Homework:

Perceptrons and Backpropagation. Fabio Zachert Cognitive Modelling WiSe 2014/15

Dr. Qadri Hamarsheh Supervised Learning in Neural Networks (Part 1) learning algorithm Δwkj wkj Theoretically practically

Technical University of Munich. Exercise 7: Neural Network Basics

Classification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska

Learning and Generalization in Single Layer Perceptrons

Artificial Neural Networks Lecture Notes Part 5. Stephen Lucci, PhD. Part 5

Lecture #11: The Perceptron

Multi Layer Perceptron with Back Propagation. User Manual

AM 221: Advanced Optimization Spring 2016

Ensemble methods in machine learning. Example. Neural networks. Neural networks

.. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar..

Linear Separability. Linear Separability. Capabilities of Threshold Neurons. Capabilities of Threshold Neurons. Capabilities of Threshold Neurons

Surfaces and Partial Derivatives

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

Data Mining. Neural Networks

arxiv: v1 [cs.lg] 25 Jan 2018

A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems

The Mathematics Behind Neural Networks

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty)

Exercise: Training Simple MLP by Backpropagation. Using Netlab.

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Logistic Regression and Gradient Ascent

Reification of Boolean Logic

Learning. Learning agents Inductive learning. Neural Networks. Different Learning Scenarios Evaluation

Neural Networks (Overview) Prof. Richard Zanibbi

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Neural Network Learning. Today s Lecture. Continuation of Neural Networks. Artificial Neural Networks. Lecture 24: Learning 3. Victor R.

Applying Neural Network Architecture for Inverse Kinematics Problem in Robotics

Neural Networks: What can a network represent. Deep Learning, Fall 2018

Multi-Class Logistic Regression and Perceptron

Neural Networks: What can a network represent. Deep Learning, Spring 2018

In this assignment, we investigated the use of neural networks for supervised classification

COMPUTATIONAL INTELLIGENCE

Programming Exercise 3: Multi-class Classification and Neural Networks

CP365 Artificial Intelligence

CS446: Machine Learning Fall Problem Set 4. Handed Out: October 17, 2013 Due: October 31 th, w T x i w

6. Linear Discriminant Functions

2. Neural network basics

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

More on Neural Networks. Read Chapter 5 in the text by Bishop, except omit Sections 5.3.3, 5.3.4, 5.4, 5.5.4, 5.5.5, 5.5.6, 5.5.7, and 5.

1-Nearest Neighbor Boundary

Instantaneously trained neural networks with complex inputs

Surfaces and Partial Derivatives

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

Linear Regression Implementation

CSE 546 Machine Learning, Autumn 2013 Homework 2

Linear Models. Lecture Outline: Numeric Prediction: Linear Regression. Linear Classification. The Perceptron. Support Vector Machines

Traffic Signs Recognition using HP and HOG Descriptors Combined to MLP and SVM Classifiers

The Perceptron. Simon Šuster, University of Groningen. Course Learning from data November 18, 2013

A Systematic Overview of Data Mining Algorithms

Learning via Optimization

Neural Networks. By Laurence Squires

Data mining with Support Vector Machine

AKA: Logistic Regression Implementation

Fall 09, Homework 5

Back propagation Algorithm:

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES

6.034 Quiz 2, Spring 2005

Neural Nets. General Model Building

Simulation of Back Propagation Neural Network for Iris Flower Classification

Neural Networks Laboratory EE 329 A

Mini-project 2 CMPSCI 689 Spring 2015 Due: Tuesday, April 07, in class

11/14/2010 Intelligent Systems and Soft Computing 1

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Machine Learning 13. week

Using neural nets to recognize hand-written digits. Srikumar Ramalingam School of Computing University of Utah

CHAPTER VI BACK PROPAGATION ALGORITHM

Deep Learning. Architecture Design for. Sargur N. Srihari

Character Recognition Using Convolutional Neural Networks

Instructor: Jessica Wu Harvey Mudd College

Multi-Layered Perceptrons (MLPs)

A faster model selection criterion for OP-ELM and OP-KNN: Hannan-Quinn criterion

Neural Networks. Robot Image Credit: Viktoriya Sukhanova 123RF.com

INTRODUCTION TO DEEP LEARNING

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

MULTILAYER PERCEPTRON WITH ADAPTIVE ACTIVATION FUNCTIONS CHINMAY RANE. Presented to the Faculty of Graduate School of

CS273 Midterm Exam Introduction to Machine Learning: Winter 2015 Tuesday February 10th, 2014

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018

Multi-layer Perceptron Forward Pass Backpropagation. Lecture 11: Aykut Erdem November 2016 Hacettepe University

Unit V. Neural Fuzzy System

Simple Model Selection Cross Validation Regularization Neural Networks

Transcription:

Week 3: Perceptron and Multi-layer Perceptron Phong Le, Willem Zuidema November 12, 2013 Last week we studied two famous biological neuron models, Fitzhugh-Nagumo model and Izhikevich model. This week, we will firstly explore another one, which is, though less biological, very computationally practical and widely used in artificial intelligence, namely perceptron. Then, we will see how to combine many neurons to build complex neural networks called multi-layer perceptrons. Required Library We ll use the library RSNNS 1, which is installed by typing install.packages("rsnns"). Required R Code At http://www.illc.uva.nl/laco/clas/fncm13/assignments/computerlab-week3/ you can find the R-files you need for this exercise. 1 Classification and Regression Many tasks in artificial intelligence can be casted down to supervised learning in which, given a training dataset {(x 1, y 1 ), (x 2, y 2 ),..., (x n, y n )}, we need to find a function y = f(x) such that the function is applicable to unknown patterns x. Based on the nature of y, we can classify those tasks into two classes: Classification is to assign a predefined label to an unknown pattern. For instance, given a picture of an animal, we need to identify if that picture is of a cat, or a dog, or a mouse, etc. If there are two categories (or two classes), the problem is called binary classification; otherwise, it is multi-class classification. For the binary classification problem, there is a special case, where patterns of the two classes are perfectly separated by a hyper-plane (see Figure 1). We call the phenomenon linear separability, and the hyper-plane decision boundary. Figure 1: Example of linear separability. 1 http://cran.r-project.org/web/packages/rsnns/index.html 1

Regression differs from classification in that what we need to assign to an unknown pattern is a real number, not a label. For instance, given the height and age of a person, can we infer his/her weight? 2 Perceptron A perceptron is a simplified neuron receiving inputs as a vector of real numbers, and outputing a real number (see Figure 2). Mathematically, a perceptron is represented by the following equation y = f(w 1 x 1 + w 2 x 2 +... + w n x n + b) = f(w T x + b) where w 1,..., w n are weights, b is a bias, x 1,..., x n are inputs, y is an output, and f is an activation function. In this section, we will use the threshold binary function (see Figure 3) { 1 if z > 0 f(z) = 0 otherwise Figure 2: Perceptron (from http://en.wikipedia.org/wiki/file:perceptron.svg). Figure 3: Threshold binary activation function. 2.1 Prediction A new pattern x will be assigned the label ŷ = f(w T x + b). The mean square error (MSE) and classification accuracy on a sample {(x 1, y 1 ),..., (x m, y m )} are respectively defined as error = 1 m m (y i ŷ i ) 2 accuracy = 1 m i=1 m I yi (ŷ i ) where I u (v) is an identity function, which returns 1 if u = v and 0 otherwise. i=1 2

2.2 Training Traditionally, a perceptron is trained in an online-learning manner with the delta rule: we randomly pick an example (x, y) and update the weights and bias as follows w new w old + η(y ŷ old )x (1) b new b old + η(y ŷ old ) (2) where η is a learning rate (0 < η < 1), ŷ old is the prediction based on the old weights and bias. Intuitively, we only update the weights and bias if our prediction ŷ old is different from the true label y, and the amount of update is (negatively, in the case y = 0) proportional to x. To see why it could work, let apply the weights and bias after being updated to that example w T newx + b new = (w old + η(y ŷ old )x) T x + b old + η(y ŷ old ) = w T oldx + b old + η(y ŷ old )( x + 1) Here, let s assume that ŷ old = 0, then w T old x + b old < 0. If y = 0, nothing happens. Otherwise, η(y ŷ old )( x + 1) > 0, and therefore w T newx + b new > w T old x + b old. In other words, the update is to pull the decision boundary to a direction that making prediction on the example is less incorrect (see Figure 4). Novikoff (1962) proves that if the training dataset is linearly separable, it is guaranteed to find a hyperplane correctly separates the training dataset (i.e., error = 0) in a finite number of update steps. Otherwise, the training won t converge. Figure 4: Example of weight update. The chosen example is enclosed in the green box. The update pulls the plane to a direction that making prediction on the example is less incorrect. Exercise 2.1: We have set up an experiment for the logic operation OR in the file perceptron logic opt.r. You should see x <- t(matrix(c( 0,0, 0,1, 1,0, 1,1), 2,4)) # OR # y <- matrix(c( 0, 3

4, 1) 1, 1, 1), which are mathematically written x 1 = (0, 0) y 1 = 0 x 2 = (0, 1) y 2 = 1 x 3 = (1, 0) y 3 = 1 x 4 = (1, 1) y 4 = 1 and parameters maxit = 25 # number of iterations learn.rate = 0.1 # learning rate stepbystep = TRUE # the program will output intermediate results on R console animation = TRUE Now, typing source("perceptron logic opt.r") to execute the file. Firstly, you will see the below in your R console ------ init ----- weights 0 0 bias 0 learning rate: 0.1 pick x[ 1 ] = 0 0 then compute yourself true target y:... prediction ŷ:... weights after update:... bias after update:... Press Enter to see if your computation is correct or not. At the same time, a plot will appear to inform you which example (black circle) is being taken, and how the current decision boundary looks like. Repeat that until the program finishes. Exercise 2.2: Repeat the exercise 2.1 for the XOR operation. Before that, you need to open the file perceptron logic opt.r to change y such that the dataset expresses the XOR operation. What could you conclude about the convergence of the training? Explain why. Exercise 2.3: In this exercise, you will build a perceptron for a breast cancer classification task. A dataset is provided in the file breast-cancer.dat.shuf which contains 683 rows (i.e., cases) with the following format 1 l a b e l f e a t u r e 1 f e a t u r e 2... f e a t u r e 10 You are required to play with the file perceptron breast cancer.r. 4

Open the file and set values for maxit (says 100) and learn.rate (says 0.1). Execute the file (source("perceptron breast cancer.r")). Report what you get. Train the perceptron and report what you get again. Do you get the same results? Why or why not? Look at the graph showing the error at each iteration, explain why there are many peaks. 3 Multi-layer Perceptron The perceptron model presented above is very limited: it is theoretically applicable only to linearly separable data. In order to handle non-linearly separable data, perceptron is extended to a more complex structure, namely multi-layer perceptron (MLP). A MLP is a neural network in which neuron layers are stacked such that the output of a neuron in a layer is only allowed to be an input to neurons in the upper layer (see Figure 5). Figure 5: A 3-layer perceptron with 2 hidden layers (from http://www.statistics4u.com/fundstat_eng/ cc_ann_bp_function.html.) It turns out that, if the activation functions of those neurons are non-linear, such as the sigmoid function (or logistic function) (see Figure 6) 1 sigm(z) = 1 + e z MLP is capable to capture high non-linearity of data: Hornik et al. (1989) proves that we can approximate any continuous function at an arbitrary small error by using complex-enough MLPs. 5

Figure 6: Sigmoid activation function. Let s denote the weight of the link from the i-th neuron in the l-th layer to the j-th neuron in the (l+1)-th layer w li,(l+1)j, then we can formalize the i-th neuron in the l-th layer by the equation y li = f li (z li ) ; z li = n l 1 j=1 w (l 1)j,li y (l 1)j + b li (3) where y li, f li, b li are respectively its output, activation function, and bias; n l is the number of neurons in the l-th layer. (Note that for the convenience, we denote y 0i x i.) Informally speaking, a neuron is activated by the sum of weighted outputs of the neurons in the lower layer. 3.1 Prediction Prediction is very fast thanks to the feedforward algorithm (Figure 7-left). The algorithm says that, given x, we firstly compute the outputs of the neurons in the first layer; then we compute the outputs of the neurons in the second layer; and so on until we reach the top layer. Figure 7: Feedforward (left) and backpropagation (right). 3.2 Training Training a MLP is to minimize an objective function w.r.t. its parameters (i.e., weights and biases) which is related to the task that the MLP is used for. For instance, for the binary classification task, the following objective function is widely used E(θ) = 1 ( y ŷ) 2 n (x,y) D where D is a training dataset, ŷ is the output of the MLP given an input x, and θ is its set of weights and biases. In order to minimize the objective function E(θ), we will use the gradient descent method (see 6

Figure 8) which says that an amount of update for a parameter is negatively proportional to the gradient at its current value. Figure 8: Illustration for the gradient descent method. The blue line is the tangent at the current value of the parameter w. If we update w by subtracting an amount proportional to the gradient at that point, the value of E will be pushed along the arrow and hence decrease. However, this method only guarantees to converge to a local minimum. The center point of the gradient descent method is to compute the gradient w easily done with the chain rule: z li = j for all w θ which is z Li = y Li y Li z Li (4) z li w li,(l+1)j = = j That s the main idea of the back-propagation algorithm (see Figure 7-right). w li,(l+1)j y li z li (5) = y li (6) w li,(l+1)j Exercise 3.1(!): You can realize that none of the formula above mentions about how to compute / b li for all l, i. That s your task. Exercise 3.2(!): Compute y li by using the feedforward algorithm, and w li,(l+1)j, b li by using the backpropagation algorithm for all l, i, j, for the 2-layer MLP shown in Figure 9 which all biases are 0.5 and all the neurons have the sigmoid activation function, with x = ( 0.5, 0.2), y = 1. The RSNNS library provides us with a function named mlp for training a MLP. Carefully read the description for that function in the library document (http://cran.r-project.org/web/packages/rsnns/ RSNNS.pdf) because you will use it in the next exercise. Note that, its arguments x,y are similar to the ones of the perceptron function above. If learnfunc="std Backpropagation" (which is default), then learnfuncparams plays the role as learn.rate. The default values for other parameters are almost perfect for us, you might not change them. 7

Figure 9: MLP for Exercise 3.2. All the biases are 0.5. All the neurons have the sigmoid activation function. Exercise 3.3: Use the file mlp logic opt.r for Exercise 2.1 and 2 (OR and XOR) (you don t need to compute by yourself anything). Open the file and set values for x and y (training dataset). Set values for maxit (says 100) and learn.rate (says 0.1), and size (says c(5), i.e. one hidden layer with 5 neurons). Execute the file (source("mlp logic opt.r")). Find a value for maxit such that the accuracy is 1. What is weighted SSE? Exercise 3.4: Use the file mlp breast cancer.r for Exercise 2.3 Open the file and set values for maxit (says 100) and learn.rate (says 0.1), and size (says c(5), i.e. one hidden layer with 5 neurons). Execute the file (source("mlp breast cancer.r")). 4 Submission You have to submit a file named your name.pdf for those exercises requiring explanations and math solutions. Note that exercises marked with (!) do not need to be submitted. The deadline is 15:00 Monday 18 Nov. If you have any question, contact Phong Le (p.le@uva.nl). References Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359 366. Novikoff, A. (1962). On convergence proofs on perceptrons. 12:615 622. 8