Linear Classification and Perceptron

Similar documents
Mathematics of Data. INFO-4604, Applied Machine Learning University of Colorado Boulder. September 5, 2017 Prof. Michael Paul

Week 3: Perceptron and Multi-layer Perceptron

Learning and Generalization in Single Layer Perceptrons

Lecture 3: Linear Classification

The Perceptron. Simon Šuster, University of Groningen. Course Learning from data November 18, 2013

CPSC 340: Machine Learning and Data Mining. Logistic Regression Fall 2016

Logistic Regression and Gradient Ascent

Lecture #11: The Perceptron

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

UNIVERSITY OF CALIFORNIA, SANTA CRUZ BOARD OF STUDIES IN COMPUTER ENGINEERING

Instructor: Jessica Wu Harvey Mudd College

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Multi-Class Logistic Regression and Perceptron

Week - 01 Lecture - 03 Euclid's Algorithm for gcd. Let us continue with our running example of gcd to explore more issues involved with program.

Lecture 9. Support Vector Machines

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018

CSE 158. Web Mining and Recommender Systems. Midterm recap

Sec 4.1 Coordinates and Scatter Plots. Coordinate Plane: Formed by two real number lines that intersect at a right angle.

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

Meeting 1 Introduction to Functions. Part 1 Graphing Points on a Plane (REVIEW) Part 2 What is a function?

AM 221: Advanced Optimization Spring 2016

Supervised Learning with Neural Networks. We now look at how an agent might learn to solve a general problem by seeing examples.

Data Preprocessing. Supervised Learning

Deep Learning for Visual Computing Prof. Debdoot Sheet Department of Electrical Engineering Indian Institute of Technology, Kharagpur

Week - 04 Lecture - 01 Merge Sort. (Refer Slide Time: 00:02)

Introduction to Machine Learning

Combine the PA Algorithm with a Proximal Classifier

Vocabulary Unit 2-3: Linear Functions & Healthy Lifestyles. Scale model a three dimensional model that is similar to a three dimensional object.

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

Machine Learning Classifiers and Boosting

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Unit 1 Algebraic Functions and Graphs

Week - 03 Lecture - 18 Recursion. For the last lecture of this week, we will look at recursive functions. (Refer Slide Time: 00:05)

Introduction to Programming in C Department of Computer Science and Engineering. Lecture No. #19. Loops: Continue Statement Example

CPSC 340: Machine Learning and Data Mining. Non-Linear Regression Fall 2016

Lecture 7: Primitive Recursion is Turing Computable. Michael Beeson

Signed umbers. Sign/Magnitude otation

Algebra I Notes Graphs of Functions OBJECTIVES: F.IF.A.1 Understand the concept of a function and use function notation. F.IF.A.2.

LARGE MARGIN CLASSIFIERS

Linear Models. Lecture Outline: Numeric Prediction: Linear Regression. Linear Classification. The Perceptron. Support Vector Machines

2.1 Transforming Linear Functions

6. Linear Discriminant Functions

CSEP 573: Artificial Intelligence

3. parallel: (b) and (c); perpendicular (a) and (b), (a) and (c)

Support Vector Machines

CS 229 Midterm Review

Machine Learning (CSE 446): Unsupervised Learning

6.867 Machine learning, lecture 1 (Jaakkola) 1

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty)

MA 1128: Lecture 02 1/22/2018

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

.. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar..

Advanced Operations Research Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras

Design and Analysis of Algorithms Prof. Madhavan Mukund Chennai Mathematical Institute. Week 02 Module 06 Lecture - 14 Merge Sort: Analysis

Machine Learning (CSE 446): Perceptron

CLASSROOM INVESTIGATION:

12 Classification using Support Vector Machines

Basic Definition INTEGER DATA. Unsigned Binary and Binary-Coded Decimal. BCD: Binary-Coded Decimal

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Logical Rhythm - Class 3. August 27, 2018

Hexadecimal Numbers. Journal: If you were to extend our numbering system to more digits, what digits would you use? Why those?

Nearest neighbors classifiers

(Refer Slide Time: 00:01:30)

1.1 Defining Functions

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Priority Queues / Heaps Date: 9/27/17

Econ 172A - Slides from Lecture 2

Chapter 10 Motion Planning

Four Types of Slope Positive Slope Negative Slope Zero Slope Undefined Slope Slope Dude will help us understand the 4 types of slope

Machine Learning: Algorithms and Applications Mockup Examination

CS451 - Assignment 3 Perceptron Learning Algorithm

ASMT 31: due Weds

Section 2.0: Getting Started

9.1 Linear Inequalities in Two Variables Date: 2. Decide whether to use a solid line or dotted line:

Functional Programming in Haskell Prof. Madhavan Mukund and S. P. Suresh Chennai Mathematical Institute

Topics in Machine Learning

Best Practices for. Membership Renewals

Introduction to Scientific Computing Lecture 1

PV211: Introduction to Information Retrieval

Recursively Enumerable Languages, Turing Machines, and Decidability

k-nearest Neighbors + Model Selection

GAMs semi-parametric GLMs. Simon Wood Mathematical Sciences, University of Bath, U.K.

CS103 Spring 2018 Mathematical Vocabulary

Learning internal representations

In math, the rate of change is called the slope and is often described by the ratio rise

CS273 Midterm Exam Introduction to Machine Learning: Winter 2015 Tuesday February 10th, 2014

Lecture 10: SVM Lecture Overview Support Vector Machines The binary classification problem

ADVANCED CLASSIFICATION TECHNIQUES

Appendix B: Using Graphical Analysis

CS 115 Data Types and Arithmetic; Testing. Taken from notes by Dr. Neil Moore

Review Systems of Equations Standard: A.CED.3; A.REI.6

Support Vector Machines

Lecture 1: Overview

CS 343: Artificial Intelligence

SVM in Analysis of Cross-Sectional Epidemiological Data Dmitriy Fradkin. April 4, 2005 Dmitriy Fradkin, Rutgers University Page 1

Notes on Multilayer, Feedforward Neural Networks

Perceptron: This is convolution!

Support Vector Machines.

Topic 7 Machine learning

ASTQB Advance Test Analyst Sample Exam Answer Key and Rationale

CSE 410 Computer Systems. Hal Perkins Spring 2010 Lecture 12 More About Caches

Transcription:

Linear Classification and Perceptron INFO-4604, Applied Machine Learning University of Colorado Boulder September 7, 2017 Prof. Michael Paul

Prediction Functions Remember: a prediction function is the function that predicts what the output should be, given the input Last time we looked at linear functions, which are commonly used as prediction functions.

Linear Functions General form with k variables (arguments): f(x 1,,x k ) = m i x i + b k i=1 or equivalently: f(x) = m T x + b

Linear Predictions Regression:

Linear Predictions Classification: Learn a linear function that separates instances of different classes

Linear Classification A linear function divides the coordinate space into two parts. Every point is either on one side of the line (or plane or hyperplane) or the other. Unless it is exactly on the line we ll come back to this case in a minute. This means it can only separate two classes. Classification with two classes is called binary classification. Conventionally, one class is called the positive class and the other is the negative class. We ll discuss classification with >2 classes later on.

Perceptron Perceptron is an algorithm for binary classification that uses a linear prediction function: f(x) = 1, w T x + b 0-1, w T x + b < 0 This is called a step function, which reads: the output is 1 if w T x + b 0 is true, and the output is -1 if instead w T x + b < 0 is true

Perceptron Perceptron is an algorithm for binary classification that uses a linear prediction function: f(x) = 1, w T x + b 0-1, w T x + b < 0 By convention, the two classes are +1 or -1.

Perceptron Perceptron is an algorithm for binary classification that uses a linear prediction function: f(x) = 1, w T x + b 0-1, w T x + b < 0 By convention, the slope parameters are denoted w (instead of m as we used last time). Often these parameters are called weights.

Perceptron Perceptron is an algorithm for binary classification that uses a linear prediction function: f(x) = 1, w T x + b 0-1, w T x + b < 0 By convention, ties are broken in favor of the positive class. If w T x + b is exactly 0, output +1 instead of -1.

Perceptron The w parameters are unknown. This is what we have to learn. f(x) = 1, w T x + b 0-1, w T x + b < 0 In the same way that linear regression learns the slope parameters to best fit the data points, perceptron learns the parameters to best separate the instances.

Example Suppose we want to predict whether a web user will click on an ad for a refrigerator Four features: Recently searched refrigerator repair Recently searched refrigerator reviews Recently bought a refrigerator Has clicked on any ad in the recent past These are all binary features (values can be either 0 or 1)

Example Suppose these are the weights: Searched repair 2.0 Searched reviews 8.0 Recent purchase -15.0 Clicked ads before 5.0 b (intercept) -9.0 Prediction function: f(x) = 1, w T x + b 0-1, w T x + b < 0

Example Suppose these are the weights: Searched repair 2.0 Searched reviews 8.0 Recent purchase -15.0 Clicked ads before 5.0 b (intercept) -9.0 w T x + b = 2*0 + 8*1 + -15*0 + 5*0 + -9 = Prediction: No 8 9 = -1

Example Suppose these are the weights: Searched repair 2.0 Searched reviews 8.0 Recent purchase -15.0 Clicked ads before 5.0 b (intercept) -9.0 w T x + b = 2 + 8 9 = 1 Prediction: Yes

Example Suppose these are the weights: Searched repair 2.0 Searched reviews 8.0 Recent purchase -15.0 Clicked ads before 5.0 b (intercept) -9.0 w T x + b = 8 + 5 9 = 4 Prediction: Yes

Example Suppose these are the weights: Searched repair 2.0 Searched reviews 8.0 Recent purchase -15.0 Clicked ads before 5.0 b (intercept) -9.0 w T x + b = 8 15 + 5 9 = -11 Prediction: No If someone bought a refrigerator recently, they probably aren t interested in shopping for another one anytime soon

Example Suppose these are the weights: Searched repair 2.0 Searched reviews 8.0 Recent purchase -15.0 Clicked ads before 5.0 b (intercept) -9.0 w T x + b = -9 Prediction: No Since most people don t click ads, the default prediction is that they will not click (the intercept pushes it negative)

Learning the Weights The perceptron algorithm learns the weights by: 1. Initialize all weights w to 0 2. Iterate through the training data. For each training instance, classify the instance. a) If the prediction (the output of the classifier) was correct, don t do anything. (It means the classifier is working, so leave it alone!) b) If the prediction was wrong, modify the weights by using the update rule. 3. Repeat step 2 some number of times (more on this later).

Learning the Weights What does an update rule do? If the classifier predicted an instance was negative but it should have been positive Currently: w T x i + b < 0 Want: w T x i + b 0 Adjust the weights w so that this function value moves toward positive If the classifier predicted positive but it should have been negative, shift the weights so that the value moves toward negative.

Learning the Weights The perceptron update rule: w j += (y i f(x i )) x ij w j y i x i f(x i ) x ij The weight of feature j The true label of instance i The feature vector of instance i The class prediction for instance i The value of feature j in instance i

Learning the Weights The perceptron update rule: w j += (y i f(x i )) x ij w j y i x i f(x i ) x ij The weight of feature j The true label of instance i The feature vector of instance i The class prediction for instance i The value of feature j in instance i Let s assume x ij is 1 in this example for now.

Learning the Weights The perceptron update rule: w j += (y i f(x i )) x ij w j y i x i f(x i ) x ij The weight of feature j The true label of instance i The feature vector of instance i The class prediction for instance i The value of feature j in instance i This term is 0 if the prediction was correct (y i = f(x i )). Then the entire update rule is 0, so no change is made.

Learning the Weights The perceptron update rule: w j += (y i f(x i )) x ij w j y i x i f(x i ) x ij The weight of feature j The true label of instance i The feature vector of instance i The class prediction for instance i The value of feature j in instance i If the prediction is wrong: This term is +2 if y i = +1 and f(x i ) = -1. This term is -2 if y i = -1 and f(x i ) = +1. The sign of this term indicates the direction of the mistake.

Learning the Weights The perceptron update rule: w j += (y i f(x i )) x ij w j y i x i f(x i ) x ij The weight of feature j The true label of instance i The feature vector of instance i The class prediction for instance i The value of feature j in instance i If the prediction is wrong: The (y i f(x i )) term is +2 if y i = +1 and f(x i ) = -1. This will increase w j (still assuming x ij is 1) which will increase w T x i + b which will make it more likely w T x i + b 0 next time (which is what we need for the classifier to be correct).

Learning the Weights The perceptron update rule: w j += (y i f(x i )) x ij w j y i x i f(x i ) x ij The weight of feature j The true label of instance i The feature vector of instance i The class prediction for instance i The value of feature j in instance i If the prediction is wrong: The (y i f(x i )) term is -2 if y i = -1 and f(x i ) = +1. This will decrease w j (still assuming x ij is 1) which will decrease w T x i + b which will make it more likely w T x i + b < 0 next time (which is what we need for the classifier to be correct).

Learning the Weights The perceptron update rule: w j += (y i f(x i )) x ij w j y i x i f(x i ) x ij The weight of feature j The true label of instance i The feature vector of instance i The class prediction for instance i The value of feature j in instance i If x ij is 0, there will be no update. The feature does not affect the prediction for this instance, so it won t affect the weight updates. If x ij is negative, the sign of the update flips.

Learning the Weights What about b? This is the intercept of the linear function, also called the bias. Common implementation: Realize that: w T x + b = w T x + b*1. If we add an extra feature to every instance whose value is always 1, then we can simply write this as w T x, where the final feature weight is the value of the bias. Then we can update this parameter the same way as all the other weights.

Learning the Weights The vector of w values is called the weight vector. Is the bias b counted when we use this phrase? Usually especially if you include it by using the trick of adding an extra feature with value 1 rather than treating it separately. Just be clear with your notation.

Linear Separability The training instances are linearly separable if there exists a hyperplane that will separate the two classes.

Linear Separability If the training instances are linearly separable, eventually the perceptron algorithm will find weights w such that the classifier gets everything correct.

Linear Separability If the training instances are not linearly separable, the classifier will always get some predictions wrong. You need to implement some type of stopping criteria for when the algorithm will stop making updates, or it will run forever. Usually this is specified by running the algorithm for a maximum number of iterations or epochs.

Learning Rate Let s make a modification to the update rule: w j += η (y i f(x i )) x ij where η is called the learning rate or step size. When you update w j to be more positive or negative, this controls the size of the change you make (or, how large a step you take). If η=1 (a common value), then this is the same update rule from the earlier slide.

Learning Rate How to choose the step size? If η is too small, the algorithm will be slow because the updates won t make much progress. If η is too large, the algorithm will be slow because the updates will overshoot and may cause previously correct classifications to become incorrect. We ll learn about step sizes a little more next week.

Summary

Perceptron: Prediction Prediction function: f(x) = 1, w T x + b 0-1, w T x + b < 0

Perceptron: Learning 1. Initialize all weights w to 0. 2. Iterate through the training data. For each training instance, classify the instance. a) If the prediction (the output of the classifier) was correct, don t do anything. b) If the prediction was wrong, modify the weights by using the update rule: w j += η (y i f(x i )) x ij 3. Repeat step 2 until the perceptron correctly classifiers every instance or the maximum number of iterations has been reached.