CSEP 573: Artificial Intelligence

Similar documents
Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

CSE 573: Artificial Intelligence Autumn 2010

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

Classification: Feature Vectors

The Perceptron. Simon Šuster, University of Groningen. Course Learning from data November 18, 2013

Exam Topics. Search in Discrete State Spaces. What is intelligence? Adversarial Search. Which Algorithm? 6/1/2012

CS 343H: Honors AI. Lecture 23: Kernels and clustering 4/15/2014. Kristen Grauman UT Austin

Final Exam. Introduction to Artificial Intelligence. CS 188 Spring 2010 INSTRUCTIONS. You have 3 hours.

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CPSC 340: Machine Learning and Data Mining. Logistic Regression Fall 2016

Multi-Class Logistic Regression and Perceptron

Lecture 9: Support Vector Machines

Lecture #11: The Perceptron

CS 188: Artificial Intelligence Fall Machine Learning

Machine Learning Classifiers and Boosting

Support Vector Machines

Slides for Data Mining by I. H. Witten and E. Frank

CIS 192: Artificial Intelligence. Search and Constraint Satisfaction Alex Frias Nov. 30 th

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

To earn the extra credit, one of the following has to hold true. Please circle and sign.

Structured Learning. Jun Zhu

CS 229 Midterm Review

CS 188: Artificial Intelligence Spring Announcements

Support Vector Machines

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #19: Machine Learning 1

Machine Learning Techniques for Data Mining

CSE 473: Ar+ficial Intelligence Machine Learning: Naïve Bayes and Perceptron

Markov Decision Processes and Reinforcement Learning

CS 188: Artificial Intelligence Fall Announcements

Announcements. CS 188: Artificial Intelligence Fall Machine Learning. Classification. Classification. Bayes Nets for Classification

Classification: Linear Discriminant Functions

Artificial Intelligence Naïve Bayes

Using Reinforcement Learning to Optimize Storage Decisions Ravi Khadiwala Cleversafe

Naïve Bayes for text classification

Evaluating Classifiers

Natural Language Processing

Mini-project 2 CMPSCI 689 Spring 2015 Due: Tuesday, April 07, in class

CS 188: Artificial Intelligence Spring Recap Search I

Machine Learning. Chao Lan

Recap Search I. CS 188: Artificial Intelligence Spring Recap Search II. Example State Space Graph. Search Problems.

Foundations of Artificial Intelligence

Classification. Statistical NLP Spring Example: Text Classification. Some Definitions. Block Feature Vectors.

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018

Evaluating Classifiers

Midterm Examination CS540-2: Introduction to Artificial Intelligence

CS 188: Artificial Intelligence Fall 2008

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty)

CS 188: Artificial Intelligence Fall 2011

Predictive Analysis: Evaluation and Experimentation. Heejun Kim

Introduction to AI Spring 2006 Dan Klein Midterm Solutions

PV211: Introduction to Information Retrieval

CS 343: Artificial Intelligence

Multi-label classification using rule-based classifier systems

Problems 1 and 5 were graded by Amin Sorkhei, Problems 2 and 3 by Johannes Verwijnen and Problem 4 by Jyrki Kivinen. Entropy(D) = Gini(D) = 1

STA 4273H: Statistical Machine Learning

CS 188: Artificial Intelligence. Recap Search I

Midterm Examination CS 540-2: Introduction to Artificial Intelligence

Bayes Net Learning. EECS 474 Fall 2016

Markov Decision Processes. (Slides from Mausam)

Supervised Learning with Neural Networks. We now look at how an agent might learn to solve a general problem by seeing examples.

Unsupervised Learning: Clustering

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods

Applying Supervised Learning

Supervised vs unsupervised clustering

Natural Language Processing

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

Uninformed Search Methods. Informed Search Methods. Midterm Exam 3/13/18. Thursday, March 15, 7:30 9:30 p.m. room 125 Ag Hall

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

CS570: Introduction to Data Mining

Search Engines. Information Retrieval in Practice

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank

Combine the PA Algorithm with a Proximal Classifier

More Learning. Ensembles Bayes Rule Neural Nets K-means Clustering EM Clustering WEKA

Semi-supervised Learning

Contents. Preface to the Second Edition

ADVANCED CLASSIFICATION TECHNIQUES

Modeling and Reasoning with Bayesian Networks. Adnan Darwiche University of California Los Angeles, CA

Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers

Introduction to Fall 2008 Artificial Intelligence Midterm Exam

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

CMPT 882 Week 3 Summary

Structured Perceptron. Ye Qiu, Xinghui Lu, Yue Lu, Ruofei Shen

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Linear Classification and Perceptron

Announcements. Homework 4. Project 3. Due tonight at 11:59pm. Due 3/8 at 4:00pm

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

CPSC 340: Machine Learning and Data Mining. Multi-Class Classification Fall 2017

Problem 1: Complexity of Update Rules for Logistic Regression

CSCI-630 Foundations of Intelligent Systems Fall 2015, Prof. Zanibbi

Network Traffic Measurements and Analysis

Climbing the Kaggle Leaderboard by Exploiting the Log-Loss Oracle

Classification Algorithms in Data Mining

CISC 4631 Data Mining

Application of Support Vector Machine Algorithm in Spam Filtering

Bias-Variance Analysis of Ensemble Learning

Machine Learning in Biology

Transcription:

CSEP 573: Artificial Intelligence Machine Learning: Perceptron Ali Farhadi Many slides over the course adapted from Luke Zettlemoyer and Dan Klein. 1

Generative vs. Discriminative Generative classifiers: E.g. naïve Bayes A joint probability model with evidence variables Query model for causes given evidence Discriminative classifiers: No generative model, no Bayes rule, often no probabilities at all! Try to predict the label Y directly from X Robust, accurate with varied features Loosely: mistake driven rather than model driven

Discriminative vs. generative Generative model (The artist) 0.1 0.05 0 0 10 20 30 40 50 60 70 x = data Discriminative model (The lousy painter) 1 0.5 0 0 10 20 30 40 50 60 70 x = data Classification function 1-1 0 10 20 30 40 50 60 70 80 x = data

Some (Simplified) Biology Very loose inspiration: human neurons

Linear Classifiers Inputs are feature values Each feature has a weight Sum is the activation If the activation is: Positive, output +1 Negative, output -1 f 1 f 2 f 3 w 1 w 2 w 3 >0? Σ

Example: Spam Imagine 4 features (spam is positive class): free (number of occurrences of free ) money (occurrences of money ) BIAS (intercept, always has value 1) free money BIAS : 1 free : 1 money : 1... BIAS : -3 free : 4 money : 2...

Binary Decision Rule In the space of feature vectors Examples are points Any weight vector is a hyperplane One side corresponds to Y=+1 Other corresponds to Y=-1 money 2 +1 = SPAM BIAS : -3 free : 4 money : 2... -1 = HAM 1 0 0 1 free

Binary Perceptron Algorithm Start with zero weights For each training instance: Classify with current weights If correct (i.e., y=y*), no change! If wrong: adjust the weight vector by adding or subtracting the feature vector. Subtract if y* is -1.

Examples: Perceptron Separable Case

Examples: Perceptron Separable Case http://isl.ira.uka.de/neuralnetcourse/2004/vl_11_5/perceptron.html

Examples: Perceptron Inseparable Case http://isl.ira.uka.de/neuralnetcourse/2004/vl_11_5/perceptron.html

Multiclass Decision Rule If we have more than two classes: Have a weight vector for each class: Calculate an activation for each class Highest activation wins

Example win the vote win the election win the game BIAS : win : game : vote : the :... BIAS : win : game : vote : the :... BIAS : win : game : vote : the :...

Example win the vote BIAS : 1 win : 1 game : 0 vote : 1 the : 1... BIAS : -2 win : 4 game : 4 vote : 0 the : 0... BIAS : 1 win : 2 game : 0 vote : 4 the : 0... BIAS : 2 win : 0 game : 2 vote : 0 the : 0...

The Multi-class Perceptron Alg. Start with zero weights Iterate training examples Classify with current weights If correct, no change! If wrong: lower score of wrong answer, raise score of right answer

Examples: Perceptron Separable Case

Mistake-Driven Classification For Naïve Bayes: Parameters from data statistics Parameters: probabilistic interpretation Training: one pass through the data Training Data For the perceptron: Parameters from reactions to mistakes Parameters: discriminative interpretation Training: go through the data until held-out accuracy maxes out Held-Out Data Test Data

Properties of Perceptrons Separability: some parameters get the training set perfectly correct Separable Convergence: if the training is separable, perceptron will eventually converge (binary case) Mistake Bound: the maximum number of mistakes (binary case) related to the margin or degree of separability Non-Separable

Problems with the Perceptron Noise: if the data isn t separable, weights might thrash Averaging weight vectors over time can help (averaged perceptron) Mediocre generalization: finds a barely separating solution Overtraining: test / held-out accuracy usually rises, then falls Overtraining is a kind of overfitting

Fixing the Perceptron Idea: adjust the weight update to mitigate these effects MIRA*: choose an update size that fixes the current mistake but, minimizes the change to w The +1 helps to generalize * Margin Infused Relaxed Algorithm

Minimum Correcting Update min not τ=0, or would not have made an error, so min will be where equality holds

Maximum Step Size In practice, it s also bad to make updates that are too large Example may be labeled incorrectly You may not have enough features Solution: cap the maximum possible value of τ with some constant C Corresponds to an optimization that assumes non-separable data Usually converges faster than perceptron Usually better, especially on noisy data

Linear Separators Which of these linear separators is optimal?

Support Vector Machines Maximizing the margin: good according to intuition, theory, practice Only support vectors matter; other training examples are ignorable Support vector machines (SVMs) find the separator with max margin Basically, SVMs are MIRA where you MIRA optimize over all examples at once SVM

Classification: Comparison Naïve Bayes Builds a model training data Gives prediction probabilities Strong assumptions about feature independence One pass through data (counting) Perceptrons / MIRA: Makes less assumptions about data Mistake-driven learning Multiple passes through data (prediction) Often more accurate

Extension: Web Search Information retrieval: Given information needs, produce information Includes, e.g. web search, question answering, and classic IR x = Apple Computers Web search: not exactly classification, but rather ranking

Feature-Based Ranking x = Apple Computers x, x,

Perceptron for Ranking Inputs Candidates Many feature vectors: One weight vector: Prediction: Update (if wrong):

Pacman Apprenticeship! Examples are states s Candidates are pairs (s,a) Correct actions: those taken by expert Features defined over (s,a) pairs: f(s,a) Score of a q-state (s,a) given by: correct action a* How is this VERY different from reinforcement learning?

Exam Topics Search BFS, DFS, UCS, A* (tree and graph) Completeness and Optimality Heuristics: admissibility and consistency Games Minimax, Alpha-beta pruning, Expectimax, Evaluation Functions MDPs Bellman equations Value and policy iteration Reinforcement Learning Exploration vs Exploitation Model-based vs. model-free TD learning and Q-learning Linear value function approx. Hidden Markov Models Markov chains Forward algorithm Particle Filter Bayesian Networks Basic definition, independence Variable elimination Sampling (prior, rejection, importance) Machine Learning: Naïve Bayes Perceptron