Instructor: Jessica Wu Harvey Mudd College

Similar documents
Ensemble Methods: Bagging

Multiclass Classification

Neural Networks. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Linear Regression & Gradient Descent

CMPT 882 Week 3 Summary

Data Mining. Neural Networks

For Monday. Read chapter 18, sections Homework:

Lecture #11: The Perceptron

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Machine Learning Classifiers and Boosting

Assignment # 5. Farrukh Jabeen Due Date: November 2, Neural Networks: Backpropation

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty)

Learning. Learning agents Inductive learning. Neural Networks. Different Learning Scenarios Evaluation

Lecture 3: Linear Classification

Neural Networks (pp )

Support Vector Machines

Linear Classification and Perceptron

A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems

Classification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska

The Perceptron. Simon Šuster, University of Groningen. Course Learning from data November 18, 2013

6.034 Quiz 2, Spring 2005

Ensemble methods in machine learning. Example. Neural networks. Neural networks

LARGE MARGIN CLASSIFIERS

Support Vector Machines

Machine Learning (CSE 446): Unsupervised Learning

1-Nearest Neighbor Boundary

3 Perceptron Learning; Maximum Margin Classifiers

Support Vector Machines. James McInerney Adapted from slides by Nakul Verma

Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning

Supervised Learning with Neural Networks. We now look at how an agent might learn to solve a general problem by seeing examples.

1 Case study of SVM (Rob)

Week 3: Perceptron and Multi-layer Perceptron

Learning and Generalization in Single Layer Perceptrons

Artificial neural networks are the paradigm of connectionist systems (connectionism vs. symbolism)

Multi-Class Logistic Regression and Perceptron

CS 8520: Artificial Intelligence

Artificial Neural Networks Lecture Notes Part 5. Stephen Lucci, PhD. Part 5

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

6. Linear Discriminant Functions

Support vector machines

DM6 Support Vector Machines

Optimization Methods for Machine Learning (OMML)

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

More on Learning. Neural Nets Support Vectors Machines Unsupervised Learning (Clustering) K-Means Expectation-Maximization

CS547, Neural Networks: Homework 3

5 Learning hypothesis classes (16 points)

CS570: Introduction to Data Mining

Neural Network Learning. Today s Lecture. Continuation of Neural Networks. Artificial Neural Networks. Lecture 24: Learning 3. Victor R.

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

LOGISTIC REGRESSION FOR MULTIPLE CLASSES

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

CS6220: DATA MINING TECHNIQUES

Machine Learning in Biology

Linear models. Subhransu Maji. CMPSCI 689: Machine Learning. 24 February February 2015

Instance-based Learning

Combine the PA Algorithm with a Proximal Classifier

.. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar..

CS 8520: Artificial Intelligence. Machine Learning 2. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek

Data Mining. Covering algorithms. Covering approach At each stage you identify a rule that covers some of instances. Fig. 4.

Lecture 3. Oct

ADVANCED CLASSIFICATION TECHNIQUES

COMS 4771 Support Vector Machines. Nakul Verma

CPSC 340: Machine Learning and Data Mining. Logistic Regression Fall 2016

Logical Rhythm - Class 3. August 27, 2018

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Perceptron as a graph

Dr. Qadri Hamarsheh Supervised Learning in Neural Networks (Part 1) learning algorithm Δwkj wkj Theoretically practically

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Classification: Linear Discriminant Functions

Linear Models. Lecture Outline: Numeric Prediction: Linear Regression. Linear Classification. The Perceptron. Support Vector Machines

Linear Discriminant Functions: Gradient Descent and Perceptron Convergence

Machine Learning. Chao Lan

COMPUTATIONAL INTELLIGENCE

Instance-based Learning

Technical University of Munich. Exercise 7: Neural Network Basics

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

Lecture 9. Support Vector Machines

Yuki Osada Andrew Cannon

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

NEURAL NETWORKS. Typeset by FoilTEX 1

Practice Questions for Midterm

Neural Networks: What can a network represent. Deep Learning, Fall 2018

Neural Networks: What can a network represent. Deep Learning, Spring 2018

Naïve Bayes for text classification

Fall 09, Homework 5

11/14/2010 Intelligent Systems and Soft Computing 1

CSE 417T: Introduction to Machine Learning. Lecture 22: The Kernel Trick. Henry Chai 11/15/18

Opening the Black Box Data Driven Visualizaion of Neural N

LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Applying Supervised Learning

Artificial Neural Networks. Introduction to Computational Neuroscience Ardi Tampuu

CS 4510/9010 Applied Machine Learning. Neural Nets. Paula Matuszek Fall copyright Paula Matuszek 2016

Data Preprocessing. Supervised Learning

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018

LECTURE 5: DUAL PROBLEMS AND KERNELS. * Most of the slides in this lecture are from

6. NEURAL NETWORK BASED PATH PLANNING ALGORITHM 6.1 INTRODUCTION

Transcription:

The Perceptron Instructor: Jessica Wu Harvey Mudd College The instructor gratefully acknowledges Andrew Ng (Stanford), Eric Eaton (UPenn), David Kauchak (Pomona), and the many others who made their course materials freely available online. Robot Image Credit: Viktoriya Sukhanova 23RF.com Perceptron Basics Learning Goals Describe the perceptron model Describe the perceptron algorithm Describe why the perceptron update works Describe the perceptron cost function Describe how a bias term affects the perceptron

The Perceptron Assume there is a linear classifier Start with simple learner and analyze with it does Let y {, +} where y = + y = So if T x 0, then y = + if T x < 0, then y = T x = 0 For now, assume hyperplane through origin (no bias term). Perceptron Algorithm start with guess for (typically = 0) repeat until convergence for i = to n if y (i) T x (i) 0 + y (i) x (i) possible criteria: all training examples correctly classified if average update ( new old 2 ) < fixed # of iterations if mistake is made single pass over data (why 0 rather than < 0?) update step Notes online learning algorithm guaranteed to find separating hyperplane if data is linearly separable (theorem later this lecture)

Perceptron Example 4 x 2 (,) (,) 0 = [, 0] T Repeat until convergence Process points in order,2,3,4 Keep track of as it changes Redraw the hyperplane after each step x 3 (, ) 2 (0.5, ) Based on slide by David Kauchak [originally by Piyush Rai] (This slide intentionally left blank.)

Why the Perceptron Update Works Geometric Interpretation old + misclassified Based on slide by Eric Eaton [originally by Piyush Rai] Why the Perceptron Update Works Mathematic Proof Consider the misclassified example y = + Perceptron wrongly thinks oldt x < 0 Based on slide by Eric Eaton [originally by Piyush Rai]

Why the Perceptron Update Works similar arguments for misclassified negative example old misclassified Consider the misclassified example y = Perceptron wrongly things oldt x > 0 Why the Perceptron Update Works Consider the misclassified example y = Perceptron wrongly things oldt x > 0

Cost Function What if we tried to minimize 0/ loss? remember, prediction is correct if y (i) T x (i) > 0 indicator function 0 if prediction is correct otherwise Idea: Use surrogate loss function Challenges NP hard! small changes in can induce large changes in J( ), and change is not continuous lots of local minima not useful gradient: at any point, no information to direct us towards any minima If prediction is correct max(0, y (i) T x (i) ) = 0. Otherwise it is confidence is mis prediction walk in direction of negative gradient (GD!) (for single example, J( ) = y (i) x (i) ) Based on slide by Eric Eaton [originally by Alan Fern] Bias Term Bias shifts hyperplane b units in direction of. + b = 0 b > 0 b < 0 + + Positive bias means more examples classified as positive. Shift away from means more space for positive classification.

Perceptron Convergence Learning Goals Does the perceptron converge? If so, how long does it take? (How many mistakes / updates?) State the perceptron convergence theorem State the implications of the theorem Margins For training set and parameter vector : functional margin of i th example this is positive if classifies x (i) correctly absolute value = confidence in predicted label (or mis confidence ) geometric margin of i th example signed distance of example to hyperplane (positive if example classified correctly) margin of training set = minimum geometric margin

Perceptron Convergence Theorem Theorem (Block & Novikoff) [b = 0] Assume * s.t. * = and i, (i) * > 0 data is linearly separable with margin * by unit norm hyperplane * i, x (i) R examples are not too big Then perceptron converges after at most updates. Convergence Theorem Guarantees Convergence rate depends on margin * and size of data R but not on number of training examples n or data dimensionality d If perceptron is given data that is linearly separable with margin *, it will converge to a solution that separates data and converge quickly if * is large Proof We assumed that * (a hyperplane consistent with data) exists the classification problem is easy if * is large optimal case (maximum * ): * is the maximum margin separator But we are not guaranteed to find * using perceptron algorithm we will show how to find maximum margin separator * using SVMs

Order of Examples Does the order in which we traverse examples matter? The Name Perceptron Learning Goals Why is it called the perceptron learning algorithm if it learns a line? Why not line learning algorithm?

The Name Perceptron Comes from our nervous system neuron Based on slide by David Kauchak Our Nervous System: the human brain is a large collection of interconnected neurons a neuron is a brain cell collects, processes, and disseminates electrical signals connected via synapses fire depending on conditions of neighboring neurons Based on slide by David Kauchak

A Neuron Node A weight Node B (neuron) (neuron) is the strength of signal sent from A to B if A fires and is positive, then A stimulates B if A fires and is negative, then A inhibits B if B is stimulated enough, then it also fires amount of stimulation required is determined by its threshold Based on slide by David Kauchak A Single Neuron / Perceptron input x weight input x 2 input x 3 weight 2 weight 3 g(in) output y input x 4 weight 4 in = j j x j linear combination threshold function possible thresholds hard sigmoid Based on slide by David Kauchak

Neuron Example 0 0.5 What are the weights, and what is b? What is the output? threshold of? 0 0? Based on slide by David Kauchak 0.5 threshold of Neural Networks node (neuron) edge (synapses) hidden layer of neurons Based on slide by David Kauchak

(extra slides) Proof of Perceptron Convergence Theorem Learning Goals Prove the Perceptron Convergence Theorem =D Proof Overview * s.t. data is linearly separable with margin * (we do not know * but we know that it exists) perceptron algorithm tries to find that points roughly in same direction as * for large *, roughly is very rough for small *, roughly is very precise every update, angle between and * changes recall, we will prove that u T v increases a lot u and v do not increase very much so angle decreases each update

Proof by Induction (This slide intentionally left blank.)