Lecture #11: The Perceptron

Similar documents
Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Lecture 8: Grid Search and Model Validation Continued

CPSC 340: Machine Learning and Data Mining. Logistic Regression Fall 2016

Lecture 1 Notes. Outline. Machine Learning. What is it? Instructors: Parth Shah, Riju Pahwa

Machine Learning (CSE 446): Unsupervised Learning

Neural Networks (pp )

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018

Lecture 9. Support Vector Machines

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty)

Yuki Osada Andrew Cannon

CS 179 Lecture 16. Logistic Regression & Parallel SGD

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Applying Supervised Learning

Classification: Linear Discriminant Functions

Naïve Bayes for text classification

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

k-nearest Neighbors + Model Selection

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions

For Monday. Read chapter 18, sections Homework:

CSEP 573: Artificial Intelligence

5 Learning hypothesis classes (16 points)

Linear Models. Lecture Outline: Numeric Prediction: Linear Regression. Linear Classification. The Perceptron. Support Vector Machines

Machine Learning in Biology

Data Mining. Neural Networks

CSE 573: Artificial Intelligence Autumn 2010

CS 229 Midterm Review

Network Traffic Measurements and Analysis

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Large Margin Classification Using the Perceptron Algorithm

Instance-based Learning

PV211: Introduction to Information Retrieval

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

Lecture #17: Autoencoders and Random Forests with R. Mat Kallada Introduction to Data Mining with R

Week 3: Perceptron and Multi-layer Perceptron

Machine Learning Classifiers and Boosting

Combine the PA Algorithm with a Proximal Classifier

Classification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska

Character Recognition Using Convolutional Neural Networks

Perceptron: This is convolution!

Practice Exam Sample Solutions

Instance-based Learning

Multi Layer Perceptron with Back Propagation. User Manual

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics

Instructor: Jessica Wu Harvey Mudd College

Supervised Learning with Neural Networks. We now look at how an agent might learn to solve a general problem by seeing examples.

More on Learning. Neural Nets Support Vectors Machines Unsupervised Learning (Clustering) K-Means Expectation-Maximization

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

CMPT 882 Week 3 Summary

A Dendrogram. Bioinformatics (Lec 17)

3 Perceptron Learning; Maximum Margin Classifiers

Machine Learning for NLP

CS6220: DATA MINING TECHNIQUES

CS5670: Computer Vision

Linear Classification and Perceptron

Dr. Qadri Hamarsheh Supervised Learning in Neural Networks (Part 1) learning algorithm Δwkj wkj Theoretically practically

ECG782: Multidimensional Digital Signal Processing

CS 8520: Artificial Intelligence

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Neural Nets & Deep Learning

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

Machine Learning 13. week

CS273 Midterm Exam Introduction to Machine Learning: Winter 2015 Tuesday February 10th, 2014

Perceptron as a graph

CS229 Final Project: Predicting Expected Response Times

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Fall 09, Homework 5

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

.. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar..

CSE 417T: Introduction to Machine Learning. Lecture 22: The Kernel Trick. Henry Chai 11/15/18

CS 8520: Artificial Intelligence. Machine Learning 2. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Support Vector Machines

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods

In this assignment, we investigated the use of neural networks for supervised classification

Motivation. Problem: With our linear methods, we can train the weights but not the basis functions: Activator Trainable weight. Fixed basis function

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Notes on Multilayer, Feedforward Neural Networks

6.034 Quiz 2, Spring 2005

Support vector machines

Object Detection Lecture Introduction to deep learning (CNN) Idar Dyrdal

Classification: Feature Vectors

Data mining with Support Vector Machine

COMPUTATIONAL INTELLIGENCE (CS) (INTRODUCTION TO MACHINE LEARNING) SS16. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

Back propagation Algorithm:

SUPPORT VECTOR MACHINES

Artificial Neural Networks (Feedforward Nets)

Model Selection Introduction to Machine Learning. Matt Gormley Lecture 4 January 29, 2018

Ensemble methods in machine learning. Example. Neural networks. Neural networks

CS 4510/9010 Applied Machine Learning. Neural Nets. Paula Matuszek Fall copyright Paula Matuszek 2016

Introduction to object recognition. Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and others

Support Vector Machines

5 Machine Learning Abstractions and Numerical Optimization

Artificial Neural Networks. Introduction to Computational Neuroscience Ardi Tampuu

Transcription:

Lecture #11: The Perceptron Mat Kallada STAT2450 - Introduction to Data Mining

Outline for Today Welcome back! Assignment 3 The Perceptron Learning Method Perceptron Learning Rule

Assignment 3 Will be due on March 1st Please let me know if you haev any questions.

Supervised Learning Methods Methods or Algorithms to construct a predictive models to solve either classification or regression: K-Nearest Neighbours Decision Trees Support Vector Machines Why learn various approaches to solve the same task?

Why learn various approaches to solve the same task? There are trade-offs! One algorithm may be better at a task. Interpretability (What is the process it takes to build predictions?) Predictive Power (How well does it work in the real-world) Training Speed (How long does it take to train?) Testing Speed (How long does it take to make a prediction?)

Review of Support Vector Machines It s been awhile. How do they work?

Support Vector Machines: How do they draw lines? Step 1: Identify these outside points. They are called the support vectors in the SVM model.

Support Vector Machines: How do they draw lines? Step 2: Draw a line equidistant between support vectors Draw the dividing line which is perpendicular to the margin with the furthest distance between the boundaries of the support vectors

How do they solve Multi-class problems?

The One-vs-One trick for Multi-Class SVMs Predictive Model What Species is this? <4.2, 5.4> Cat Mouse Mouse Mouse

How do they solve non-linear problems?

The underlying data cannot be linearly separated Support Vector Machines The raw SVM does a terrible job here. SVM Height Height Width Width

Non-Linear Support Vector Machines Our original data not linearly separable. Using a plain SVM here would give us a predictor with terrible performance.

Non-Linear Support Vector Machines Transform the given data using a kernel function. We hope that after applying a non-linear kernel to the data, we can apply a regular SVM and get good accuracy

Non-Linear Support Vector Machines The support vector classifier is applied in transformed feature space The line is drawn separating the two classes apart Step 2 and Step 3 Apply the regular SVM in this transformed space. Find the middle ground line

Non-Linear Support Vector Machines Step 4: Projecting the decision surface back onto our original feature space Project Back We get a non-linear decision boundary

Kernel Transformations Transform the original feature space based on some property. Most common is the radial basis function Radial Basis Functions: Fits a Gaussian between each each point to create the additional dimension.

What about Noisy observations? How does the SVM handle those?

Support Vector Machines: The Cost Parameter Cost (C): When picking the support vectors, the cost of incorrectly classifying a data point. Higher C values means that there is a higher cost to incorrectly classifying a training point. Too high means we ll underfit. Lower C values means that there is a lower cost to incorrectly classifying a training point. Too low means we ll overfit.

The cost is high to make mistakes on the training data. Since the cost is high, we can t make mistakes Let s draw a line here Support Vector Machines: The Cost Parameter C = 1,000

Support Vector Machines: The Cost Parameter C = 0.001 The cost is low to make mistakes on the training data. Since the cost is low, we can make some mistakes Let s draw a line here

Any questions so far?

Introduction to Artificial Neural Networks Predictive models inspired by the communication pattern of biological neurons. First investigated in 1950s by Frank Rosenblatt to model the behaviour of actual neurons Eventually they started to shift away from the actual biological counterpart

Artificial neural networks Large family of data mining algorithms inspired by the communication pattern of biological neurons Inside our brains are millions of interconnected neurons communicating with other neurons to solve certain tasks

Actual Biological Neuron

An Artificial Neuron: Looks something like this! This is a very abstract view of a biological neuron.

Mimick a biological neural network by linking several artificial neurons together Forming an artificial neural network can help us to solve given data mining problems where the neurons collectively work together at the task.

Artificial Neural Networks: Few starting points Artificial Neurons are far-off from mimicking the entire functionality of a biological neuron. Although, they have many inspired components of the brain. They were named after Frank Rosenblatt who was trying to model the human brain in 1962.

There are many different types of artificial neural networks, one type which we will look at in this section is the feed-forward neural network (shown below) which can be used to solve the supervised learning problems (i.e regression and classification).

An Introduction to the Perceptron The most simplest type of neural network is the perceptron This is a basic feed-forward neural network that can be used solve linearly separable problems We will extend this idea to multi-layer perceptrons algorithms for solving complex non-linearly separable problems.

An Introduction to the Perceptron The simplest kind of feed-forward neural network is the perceptron, which can be applied for classification problems which are linearly separable. Consider the scenario on the left-hand panel below where we are given some training examples and our goal is to find the dividing line for the given binary classification problem.

Introduction to the Perceptron Algorithm This was a model was invented by Frank Rosenblatt of the Cornell Aeronautical Laboratory in 1957. While investigating how biological neurons communicate

Consider the scenario below where we are given some training examples and our goal is to find the dividing line for the given binary classification problem.

To create this decision boundary, we need only two pieces of information

The linear solution shown can be represented as: 0 = height * 0.5 + length * 0.4 Or equivalently: length = (0.5/0.4) * height

Linearly Separable Lines In general, for any linearly separable binary classification problem, we want to find the weights for the decision boundary equation below. All we need to do is find these weights (or slopes).

Linearly Separable Lines The only issue at hand: we don't know the weights. With the appropriate weights in place, we should be able find a solution that separates the two classes apart. We don't know what the appropriate weights for this equation. How should we find them?

Finding the Weights through Iterative Optimization These weights can be learned through an optimization method Goal is to minimize the error with respect to the chosen weights What are some optimization techniques we could use?

What are some optimization techniques we could use? We could try all the possible weights. That would take too long. Instead, we try an incremental optimization approach.

Visualizing the Weight Space of Two Features Training Score

Incremental Optimization to Pick the Weights Consider the equation below which indicates how each weight is iteratively updated. Improvement is calculated based on error.

Finding the Weights through Iterative Optimization By continually improving upon the previous weights, the appropriate weights can be found. Let s note that can train neural networks incrementally (one observation at a time) to update the weights.

Finding the Weights through Iterative Optimization The improvement value is calculated by measuring how far the predicted label value (found by the current model) differs from the actual label value. By continually updating the weights by computing the error over each training observation, we should eventually converge to a set of weights which give us an appropriate decision boundary. Consider the sequence below to gain a further appreciation of this incremental learning idea.

Steps for creating Predictive Models with Perceptron 1. Randomly pick the weights 2. If Weights have not converged yet Update the weights based on Improvement * Improvement is based on change in error

Step 1: Randomly Initialize the Weights Training Score Equation: 0 = 0.9*Length + 0.1*Height Training Score: 13%

Step 2: Update the Weights based on Error Compute the previous error, and update the current weights to compensate for this error. Training Score Equation: 0 = 0.7*Length + 0.2*Height Training Score: 24%

Step 2 (again): Update the Weights based on Error Compute the previous error, and update the current weights to compensate for this error. Training Score Equation: 0 = 0.8*Length + 0.4*Height Training Score: 75%

Step 2 (yet again): Update the Weights based on Error After doing this a number of times, we have converged to a solution with good weights since the training error is stable. Training Score Equation: 0 = 1.2*Length + 0.3*Height Training Score: 100%

We are trying to find the minimum of the training error function with respect to finding the best weights. The weight improvement value discussed earlier is dependent on the rate of this function; as the algorithm senses that it is approaching the optimal solution (e.g. small change in error), the improvement value becomes smaller. This type of incremental optimization is known as gradient descent

Batch vs Online Data Mining For perceptron learners themselves, this entire training procedure for finding the final solution is known as the perceptron learning rule. By nature, this perceptron learning algorithm is a type of online learning.

Online Data Mining Online data mining methods are trained incrementally - one observation at a time, and can be asked for a prediction at any instance; on the other hand, batch data mining methods must be first trained on the entire dataset.

Batch vs Online Data Mining All the algorithms which we have seen earlier in this section were all batch algorithms They needed the entire dataset before making a prediction. Online learning is particularly useful in real-time applications with streaming flows of data, or in the situation we are dealing with huge datasets (so huge that it is unfeasible to process all in one-go).

Network Diagram for Perceptron Typically, artificial neural networks (such as the perceptron) are presented as a network diagram; the network diagram for a perceptron is shown below.

Network Diagram for Perceptron Try to imagine a new unlabeled observation going through this pipeline; a new observation is fed into the input neurons, a weighted sum is calculated, and finallythe output neuron thresholds our weighted sum to make a prediction.

What about the intercept? There is only one issue that I forgot to mention. With our current setup, our model will always pass through the origin.

What is missing here? Our perceptron is trying to learn the boundary, but failing!

What about the intercept? That is, there is no way for our model to have an y-intercept that isn't zero. This issue is illustrated in the scenario on the left-panel below.

What is missing here? Adding a constant bias neuron allows us to learn the appropriate y-intercept. The bias neuron always has a value of 1, we just need to find the appropriate scaling factor through the perceptron learning rule.

Bias Neurons How should we add an intercept? This is the use of 'bias neurons' in feed-forward neural networks. We will simply add a constant 'bias' term to our previous equation. With this setup, we can learn a weight which creates the appropriate intercept using the learning rule.

One problem with incremental optimization approaches? Training Score Weight 1 Weight 2

One problem with incremental optimization approaches? Training Score Weight 1 Weight 2

That s all for Today Assignment 3 is released.