Supervised Learning with Neural Networks. We now look at how an agent might learn to solve a general problem by seeing examples.

Similar documents
Learning. Learning agents Inductive learning. Neural Networks. Different Learning Scenarios Evaluation

Lecture 3: Linear Classification

Week 3: Perceptron and Multi-layer Perceptron

Linear Models. Lecture Outline: Numeric Prediction: Linear Regression. Linear Classification. The Perceptron. Support Vector Machines

Data Mining. Neural Networks

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty)

Classification: Linear Discriminant Functions

AKA: Logistic Regression Implementation

Assignment # 5. Farrukh Jabeen Due Date: November 2, Neural Networks: Backpropation

CS 8520: Artificial Intelligence

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Classification: Feature Vectors

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Lecture #11: The Perceptron

5 Learning hypothesis classes (16 points)

Neural Network Learning. Today s Lecture. Continuation of Neural Networks. Artificial Neural Networks. Lecture 24: Learning 3. Victor R.

Artificial neural networks are the paradigm of connectionist systems (connectionism vs. symbolism)

CS 8520: Artificial Intelligence. Machine Learning 2. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek

CS 343: Artificial Intelligence

CSE 573: Artificial Intelligence Autumn 2010

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Kernels and Clustering

Classification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska

DM6 Support Vector Machines

Machine Learning for NLP

REGRESSION ANALYSIS : LINEAR BY MAUAJAMA FIRDAUS & TULIKA SAHA

More Learning. Ensembles Bayes Rule Neural Nets K-means Clustering EM Clustering WEKA

Artificial Neural Networks (Feedforward Nets)

Combine the PA Algorithm with a Proximal Classifier

For Monday. Read chapter 18, sections Homework:

1 Case study of SVM (Rob)

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York

CSE 417T: Introduction to Machine Learning. Lecture 22: The Kernel Trick. Henry Chai 11/15/18

Data mining with Support Vector Machine

Introduction to Support Vector Machines

6. Linear Discriminant Functions

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

Machine Learning Classifiers and Boosting

Review on Methods of Selecting Number of Hidden Nodes in Artificial Neural Network

CMPT 882 Week 3 Summary

Neural Network Neurons

Notes on Multilayer, Feedforward Neural Networks

Rapid growth of massive datasets

1 Machine Learning System Design

Practice EXAM: SPRING 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Introduction to Machine Learning

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

4. Feedforward neural networks. 4.1 Feedforward neural network structure

Simple Model Selection Cross Validation Regularization Neural Networks

LECTURE 5: DUAL PROBLEMS AND KERNELS. * Most of the slides in this lecture are from

The exam is closed book, closed notes except your one-page cheat sheet.

The Perceptron. Simon Šuster, University of Groningen. Course Learning from data November 18, 2013

Support vector machines

Neural Networks. By Laurence Squires

A Systematic Overview of Data Mining Algorithms

Linear Classification and Perceptron

Optimization Methods for Machine Learning (OMML)

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

How Learning Differs from Optimization. Sargur N. Srihari

Learning and Generalization in Single Layer Perceptrons

Applying Supervised Learning

Instructor: Jessica Wu Harvey Mudd College

Support Vector Machines.

Network Traffic Measurements and Analysis

Search Engines. Information Retrieval in Practice

CSEP 573: Artificial Intelligence

Topics in Machine Learning

Machine Learning in Biology

Machine Learning 13. week

Neural Networks and Deep Learning

CPSC 340: Machine Learning and Data Mining. Logistic Regression Fall 2016

In this assignment, we investigated the use of neural networks for supervised classification

SUPPORT VECTOR MACHINE ACTIVE LEARNING

Logistic Regression and Gradient Ascent

Introduction to Machine Learning

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

All lecture slides will be available at CSC2515_Winter15.html

Data Mining: Models and Methods

Machine Learning / Jan 27, 2010

06: Logistic Regression

SUPPORT VECTOR MACHINES

1-Nearest Neighbor Boundary

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

Neural Networks. Robot Image Credit: Viktoriya Sukhanova 123RF.com

.. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar..

Dr. Qadri Hamarsheh Supervised Learning in Neural Networks (Part 1) learning algorithm Δwkj wkj Theoretically practically

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

COMS 4771 Support Vector Machines. Nakul Verma

Programming Exercise 3: Multi-class Classification and Neural Networks

Combined Weak Classifiers

Classification of Subject Motion for Improved Reconstruction of Dynamic Magnetic Resonance Imaging

Perceptron as a graph

Transcription:

Supervised Learning with Neural Networks We now look at how an agent might learn to solve a general problem by seeing examples. Aims: to present an outline of supervised learning as part of AI; to introduce much of the notation and terminology used; to introduce the classical perceptron, and to show how it can be applied more generally using kernels; to introduce multilayer perceptrons and the backpropagation algorithm for training them. Reading: Russell and Norvig, chapter 18 and 19. Copyright csean Holden 2002-2005.

An example A common source of problems in AI is medical diagnosis. Imagine that we want to automate the diagnosis of an embarrassing disease (call it ) by constructing a machine: Machine Measurements taken from the patient, e.g. heart rate, blood pressure, presence of green spots etc. if the patient suffers from otherwise Could we do this by explicitly writing a program that examines the measurements and outputs a diagnosis? Experience suggests that this is unlikely.

An example, continued... Let s look at an alternative approach. Each collection of measurements can be written as a vector, where, heart rate blood pressure if the patient has green spots otherwise. and so on

An example, continued... A vector of this kind contains all the measurements for a single patient and is generally called a feature vector or instance. The measurements are usually referred to as attributes or features. (Technically, there is a difference between an attribute and a feature - but we won t have time to explore it.) Attributes or features generally appear as one of three basic types: continuous: binary: discrete:. or where ; ; can take one of a finite number of values, say

An example, continued... Now imagine that we have a large collection of patient histories ( in total) and for each of these we know whether or not the patient suffered from. The th patient history gives us an instance. This can be paired with a single bit or denoting whether or the not th patient suffers from. The resulting pair is called an example or a labelled example. Collecting all the examples together we obtain a training sequence,

An example, continued... In the form of machine learning to be emphasised here, we aim to design a learning algorithm which takes s and produces a hypothesis. Learning Algorithm Intuitively, a hypothesis is something that lets us diagnose new patients.

But what s a hypothesis? What s a hypothesis? Denote by the set of all possible instances. Then a hypothesis is a possible instance could simply be a function from to. In other words, can take any instance and produce a or a, depending on whether (according to ) the patient the measurements were taken from is suffering from.

But what s a hypothesis? There are some important issues here, which are effectively right at the heart of machine learning. Most importantly: the hypothesis can assign a or a to any. that did not appear in the training se- This includes instances quence. The overall process therefore involves rather more than memorising the training sequence. Ideally, the aim is that should be able to generalize. That is, it should be possible to use it to diagnose new patients.

But what s a hypothesis? In fact we need a slightly more flexible definition of a hypothesis. A hypothesis is a function from to some suitable set because: there may be more than two classes: No disease Disease Disease or, we might want has disease Disease to indicated how likely it is that the patient where denotes definitely does have the disease and denotes definitely does not have it. that the patient is reasonably certain to have the disease. might for example denote

But what s a hypothesis? One way of thinking about the previous case is in terms of probabilities: is in class We may have. For example if contains several recent measurements of a currency exchange rate and we want to be a prediction of what the rate will be in minutes time. Such problems are generally known as regression problems.

Types of learning The form of machine learning described is called supervised learning. This introduction will concentrate on this kind of learning. In particular, the literature also discusses: 1. Unsupervised learning. 2. Learning using membership queries and equivalence queries. 3. Reinforcement learning.

Speech recognition. Some further examples Deciding whether or not to give credit. Detecting credit card fraud. Deciding whether to buy or sell a stock option. Deciding whether a tumour is benign. Data mining - that is, extracting interesting but hidden knowledge from existing, large databases. For example, databases containing financial transactions or loan applications. Deciding whether driving conditions are dangerous. Automatic driving. (See Pomerleau, 1989, in which a car is driven for 90 miles at 70 miles per hour, on a public road with other cars present, but with no assistance from humans!) Playing games. (For example, see Tesauro, 1992 and 1995, where a world class backgammon player is described.)

What have we got so far? Extracting what we have so far, we get the following central ideas: A collection of possible instances A collection of classes to which any instance can belong. In some scenarios we might have. A training sequence containing. labelled examples, with and for A learning algorithm. We can write, which takes. and produces a hypothesis

What have we got so far? But we need some more ideas as well: We often need to state what kinds of hypotheses are available to. The collection of available hypotheses is called the hypothesis space and denoted is available to. Some learning algorithms do not always return the same each run on a given sequence for. In this case probability distribution on. is a

What s the right answer? We may sometimes assume that there is a correct function that governs the relationship between the s and the labels. This is called the target concept and is denoted. It can be reas the perfect hypothesis, and sowe have garded. This is not always a sufficient way of thinking about the problem...

Generalization The learning algorithm never gets to know exactly what the correct relationship between instances and classes is - it only ever gets to see a finite number of examples. So: generalization corresponds to the ability of to pick a hypothesis which is close in some sense to the best possible ; however we have to be careful about what close means here. For example, what if some instances are much more likely than others?

Generalization performance How can generalization performance be assessed? Model the generation of training example using a probability distribution on. All examples are assumed to be independent and identically distributed (i.i.d.) according to. Given a hypothesis measure example. and any example we can introduce a of the error that makes in classifying that For example the following definitions for might be appropriate: for a classification problem when

Generalization performance A reasonable definition of generalization performance is then er In the case of the definition of the previous slide this gives er for classification problems given in In the case of the definition for given for regression problems, er is the expected square of the difference between true label and predicted label.

Problems encountered in practice In practice there are various problems that can arise: Measurements may be missing from the There may be noise present. in Classifications may be incorrect. vectors. The practical techniques to be presented have their own approaches to dealing with such problems. Similarly, problems arising in practice are addressed by the theory.

Examples Target Concept NOT KNOWN Some are available Some measurements may be missing. Noise Noisy Examples. Learning Algorithm Hypothesis Prior Knowledge

The perceptron: a review The initial development of linear discriminants was carried out by Fisher in 1936, and since then they have been central to supervised learning. Their influence continues to be felt in the recent and ongoing development of support vector machines, of which more later...

The perceptron: a review We have a two-class classification problem in. We output class if, or class if.

The perceptron: a review So the hypothesis is where if otherwise

The primal perceptron algorithm do ) for (each example in ) if ( while (mistakes are made in the for loop) return.

The perceptron algorithm does not converge if Novikoff s theorem is not linearly separable. However Novikoff proved the following: 1 If Theorem is non-trivial and linearly separable, where there exists a hyperplane with optimum optimum optimum and for optimum optimum, then the perceptron algorithm makes at most mistakes.

Dual form of the perceptron algorithm then the primal perceptron algorithm operates by adding and subtracting misclassified points to an initial at each step. If we set As a result, when it stops we can represent the final as Note: the values are positive and proportional to the number of times is misclassified; if is fixed then the vector is an alternative representation of.

Dual form of the perceptron algorithm Using these facts, the hypothesis can be re-written

Dual form of the perceptron algorithm do ) for (each example in ) if ( while (mistakes are made in the for loop) return.

Mapping to a bigger space There are many problems a perceptron can t solve. The classic example is the parity problem.

Mapping to a bigger space But what happens if we add another element to For example we could use.? 1.2 1 0.8 1 x2 0.6 0.4 x1 * x2 0.5 0.2 0 0 0.2 0.5 0 0.5 1 1.5 x1 1 x2 0.5 0 0 0.5 x1 1 In a perceptron can easily solve this problem.

Mapping to a bigger space where

Mapping to a bigger space This is an old trick, and the functions can be anything we like. Example: In a multilayer perceptron where is the vector of weights and node hidden. the bias associated with Note however that for the time being the functions are fixed, whereas in a multilayer perceptron they are allowed to vary as a result of varying the and.

Mapping to a bigger space..

Mapping to a bigger space We can now use a perceptron in the high-dimensional space, obtaining a hypothesis of the form where the node. are the weights from the hidden nodes to the output So: start with ; use the to map it to a bigger space; use a (linear) perceptron in the bigger space.

where The cost associated with this is that we may have to calculate Mapping to a bigger space What happens if we use the dual form of the perceptron algorithm in this process? We end up with a hypothesis of the form is the vector Notice that this introduces the possibility of a tradeoff of and. The sum has (potentially) become smaller. several times.

This suggests that it might be useful to be able to calculate Kernels easily. In fact such an observation has far-reaching consequences. Definition 2 A kernel is a function such that for all vectors and Note that a given kernel naturally corresponds to an underlying collection of functions. Ideally, we want to make sure that the value of does not have a great effect on the calculation of is the case then is easy to evaluate.. If this

Kernels: an example We can use or to obtain polynomial kernels. In the latter case we have features that are monomials up to degree, so the decision boundary obtained using as described above will be a polynomial curve of degree.

Kernels: an example 2 100 examples, p=10, c=1 Weighted sum value x2 1.5 1 0.5 0 0.5 1 1 0 1 2 x1 Value (truncated at 1000) 1000 500 0 500 1000 1 0 x1 1 2 1 0 x2 1 2

Gradient descent An alternative method for training a basic perceptron works as follows. We define a measure of error for a given collection of weights. For example where the that has not been used. Modifying our notation slightly so gives

Gradient descent is parabolic and has a unique global minimum and no local minima. We therefore start with a random and update it as follows: where and is some small positive number. The vector tells us the direction of the steepest decrease in.

Simple feedforward neural networks We continue using the same notation as previously. Usually, we think in terms of a training algorithm based on a training sequence pothesis finding a hy- and then classifying new instances by evaluating. Usually with neural networks the training algorithm provides a vector of weights. We therefore have a hypothesis that depends on the weight vector. This is often made explicit by writing and representing the hypothesis as a mapping depending on both and the new instance, so classification of

Backpropagation: the general case First, let s look at the general case. We have a completely unrestricted feedforward structure:. For the time being, there may be several outputs, and no specific layering is assumed.

Backpropagation: the general case For each node:. node connects to node. is the weighted sum or activation for node is the activation function...

Backpropagation: the general case In addition, there is often a bias input for each node, which is always set to. This is not always included explicitly; sometimes the bias is included by writing the weighted summation as where is thebiasfor node.

measure of the error of the network on Backpropagation: the general case As usual we have: instances a training sequence ;. We also define a measure of training error where is the vector of all the weights in the network. Our aim is to find a set of weights that minimises.

Backpropagation: the general case How can we find a set of weights that minimises? The approach used by the backpropagation algorithm is very simple: 1. begin at step with a randomly chosen collection of weights at the 2. th step, calculate the gradient at the point ; 3. update the weight vector by taking a small step in the direction of the gradient of 4. repeat this process until is sufficiently small. ;

ple in Backpropagation: the general case In order to do this we have to calculate Often is the sum of separate components, one for each exam- in which case We can therefore consider examples individually.

Backpropagation: the general case Place example at the inputs and calculate the values all the nodes. This is called forward propagation. We have and for where we ve defined and used the fact that So we now need to calculate the values for...

Backpropagation: the general case When is an output unit this is easy as and the first term is in general easy to calculate for a given.

Backpropagation: the general case When is not an output unit we have where tion: sends a connec- nodes to which node are the..

Backpropagation: the general case Then by definition, and So

Backpropagation: the general case Summary: to calculate for one pattern: 1. Forward propagation: apply the nodes in the network. 2. Backpropagation 1: for outputs and calculate outputs etc for all 3. Backpropagation 2: For other nodes where the were calculated at an earlier step.

Backpropagation: a specific example.. output

Backpropagation: a specific example. For the output: For the other nodes: so Also,

Backpropagation: a specific example output output output output For the output: We have output and so output and

Backpropagation: a specific example For the hidden nodes: We have but there is only one output so output output and we have a value for output so output output

Putting it all together We can then use the derivatives in one of two basic ways: Batch: (as described previously) Sequential: using just one pattern at once selecting patterns in sequence or at random.

Example: the classical parity problem As an example we show the result of training a network with: two inputs; one output; one hidden layer containing ; all other details as above. units; noisy ex- The problem is the classical parity problem. There are amples. The sequential approach is used, with entire training sequence. repetitions through the

Example: the classical parity problem 2 Before training 2 After training 1.5 1.5 1 1 x2 0.5 x2 0.5 0 0 0.5 0.5 1 1 0 1 2 x1 1 1 0 1 2 x1

Example: the classical parity problem Before training After training 1 Network output 0.8 0.6 0.4 0.2 0 2 1 x2 0 1 1 0 x1 1 2 Network output 1 0.5 0 1 0 x2 1 2 2 1 0 x1 1

Example: the classical parity problem 10 Error during training 9 8 7 6 61 5 4 3 2 1 0 0 100 200 300 400 500 600 700 800 900 1000