CSEP 573: Artificial Intelligence Machine Learning: Perceptron Ali Farhadi Many slides over the course adapted from Luke Zettlemoyer and Dan Klein. 1
Generative vs. Discriminative Generative classifiers: E.g. naïve Bayes A joint probability model with evidence variables Query model for causes given evidence Discriminative classifiers: No generative model, no Bayes rule, often no probabilities at all! Try to predict the label Y directly from X Robust, accurate with varied features Loosely: mistake driven rather than model driven
Discriminative vs. generative Generative model (The artist) 0.1 0.05 0 0 10 20 30 40 50 60 70 x = data Discriminative model (The lousy painter) 1 0.5 0 0 10 20 30 40 50 60 70 x = data Classification function 1-1 0 10 20 30 40 50 60 70 80 x = data
Some (Simplified) Biology Very loose inspiration: human neurons
Linear Classifiers Inputs are feature values Each feature has a weight Sum is the activation If the activation is: Positive, output +1 Negative, output -1 f 1 f 2 f 3 w 1 w 2 w 3 >0? Σ
Example: Spam Imagine 4 features (spam is positive class): free (number of occurrences of free ) money (occurrences of money ) BIAS (intercept, always has value 1) free money BIAS : 1 free : 1 money : 1... BIAS : -3 free : 4 money : 2...
Binary Decision Rule In the space of feature vectors Examples are points Any weight vector is a hyperplane One side corresponds to Y=+1 Other corresponds to Y=-1 money 2 +1 = SPAM BIAS : -3 free : 4 money : 2... -1 = HAM 1 0 0 1 free
Binary Perceptron Algorithm Start with zero weights For each training instance: Classify with current weights If correct (i.e., y=y*), no change! If wrong: adjust the weight vector by adding or subtracting the feature vector. Subtract if y* is -1.
Examples: Perceptron Separable Case
Examples: Perceptron Separable Case http://isl.ira.uka.de/neuralnetcourse/2004/vl_11_5/perceptron.html
Examples: Perceptron Inseparable Case http://isl.ira.uka.de/neuralnetcourse/2004/vl_11_5/perceptron.html
Multiclass Decision Rule If we have more than two classes: Have a weight vector for each class: Calculate an activation for each class Highest activation wins
Example win the vote win the election win the game BIAS : win : game : vote : the :... BIAS : win : game : vote : the :... BIAS : win : game : vote : the :...
Example win the vote BIAS : 1 win : 1 game : 0 vote : 1 the : 1... BIAS : -2 win : 4 game : 4 vote : 0 the : 0... BIAS : 1 win : 2 game : 0 vote : 4 the : 0... BIAS : 2 win : 0 game : 2 vote : 0 the : 0...
The Multi-class Perceptron Alg. Start with zero weights Iterate training examples Classify with current weights If correct, no change! If wrong: lower score of wrong answer, raise score of right answer
Examples: Perceptron Separable Case
Mistake-Driven Classification For Naïve Bayes: Parameters from data statistics Parameters: probabilistic interpretation Training: one pass through the data Training Data For the perceptron: Parameters from reactions to mistakes Parameters: discriminative interpretation Training: go through the data until held-out accuracy maxes out Held-Out Data Test Data
Properties of Perceptrons Separability: some parameters get the training set perfectly correct Separable Convergence: if the training is separable, perceptron will eventually converge (binary case) Mistake Bound: the maximum number of mistakes (binary case) related to the margin or degree of separability Non-Separable
Problems with the Perceptron Noise: if the data isn t separable, weights might thrash Averaging weight vectors over time can help (averaged perceptron) Mediocre generalization: finds a barely separating solution Overtraining: test / held-out accuracy usually rises, then falls Overtraining is a kind of overfitting
Fixing the Perceptron Idea: adjust the weight update to mitigate these effects MIRA*: choose an update size that fixes the current mistake but, minimizes the change to w The +1 helps to generalize * Margin Infused Relaxed Algorithm
Minimum Correcting Update min not τ=0, or would not have made an error, so min will be where equality holds
Maximum Step Size In practice, it s also bad to make updates that are too large Example may be labeled incorrectly You may not have enough features Solution: cap the maximum possible value of τ with some constant C Corresponds to an optimization that assumes non-separable data Usually converges faster than perceptron Usually better, especially on noisy data
Linear Separators Which of these linear separators is optimal?
Support Vector Machines Maximizing the margin: good according to intuition, theory, practice Only support vectors matter; other training examples are ignorable Support vector machines (SVMs) find the separator with max margin Basically, SVMs are MIRA where you MIRA optimize over all examples at once SVM
Classification: Comparison Naïve Bayes Builds a model training data Gives prediction probabilities Strong assumptions about feature independence One pass through data (counting) Perceptrons / MIRA: Makes less assumptions about data Mistake-driven learning Multiple passes through data (prediction) Often more accurate
Extension: Web Search Information retrieval: Given information needs, produce information Includes, e.g. web search, question answering, and classic IR x = Apple Computers Web search: not exactly classification, but rather ranking
Feature-Based Ranking x = Apple Computers x, x,
Perceptron for Ranking Inputs Candidates Many feature vectors: One weight vector: Prediction: Update (if wrong):
Pacman Apprenticeship! Examples are states s Candidates are pairs (s,a) Correct actions: those taken by expert Features defined over (s,a) pairs: f(s,a) Score of a q-state (s,a) given by: correct action a* How is this VERY different from reinforcement learning?
Exam Topics Search BFS, DFS, UCS, A* (tree and graph) Completeness and Optimality Heuristics: admissibility and consistency Games Minimax, Alpha-beta pruning, Expectimax, Evaluation Functions MDPs Bellman equations Value and policy iteration Reinforcement Learning Exploration vs Exploitation Model-based vs. model-free TD learning and Q-learning Linear value function approx. Hidden Markov Models Markov chains Forward algorithm Particle Filter Bayesian Networks Basic definition, independence Variable elimination Sampling (prior, rejection, importance) Machine Learning: Naïve Bayes Perceptron