Natural Language Processing

Similar documents
Natural Language Processing. Classification. Duals and Kernels. Linear Models: Perceptron. Non Parametric Classification. Classification III

Natural Language Processing

Classification. Statistical NLP Spring Example: Text Classification. Some Definitions. Block Feature Vectors.

CS 343: Artificial Intelligence

Kernels and Clustering

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Classification: Feature Vectors

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

CSE 573: Artificial Intelligence Autumn 2010

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

COMS 4771 Support Vector Machines. Nakul Verma

Support Vector Machines. James McInerney Adapted from slides by Nakul Verma

5 Machine Learning Abstractions and Numerical Optimization

Constrained optimization

Linear methods for supervised learning

All lecture slides will be available at CSC2515_Winter15.html

CS 343H: Honors AI. Lecture 23: Kernels and clustering 4/15/2014. Kristen Grauman UT Austin

DM6 Support Vector Machines

Support Vector Machines.

Support Vector Machines

Structured Learning. Jun Zhu

Support Vector Machines

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

LECTURE 13: SOLUTION METHODS FOR CONSTRAINED OPTIMIZATION. 1. Primal approach 2. Penalty and barrier methods 3. Dual approach 4. Primal-dual approach

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

A Short SVM (Support Vector Machine) Tutorial

Support vector machines

Neural Networks and Deep Learning

Introduction to Machine Learning

Kernel Methods & Support Vector Machines

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Introduction to Mathematical Programming IE496. Final Review. Dr. Ted Ralphs

Lecture 7: Support Vector Machine

CSE 417 Network Flows (pt 4) Min Cost Flows

The Perceptron. Simon Šuster, University of Groningen. Course Learning from data November 18, 2013

Support Vector Machine Learning for Interdependent and Structured Output Spaces

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

Generative and discriminative classification techniques

Image Registration Lecture 4: First Examples

CS 229 Midterm Review

CSEP 573: Artificial Intelligence

Development in Object Detection. Junyuan Lin May 4th

Algorithms for Integer Programming

A Brief Look at Optimization

Structured Perceptron. Ye Qiu, Xinghui Lu, Yue Lu, Ruofei Shen

Structure and Support Vector Machines. SPFLODD October 31, 2013

LOGISTIC REGRESSION FOR MULTIPLE CLASSES

Programming, numerics and optimization

Using Analytic QP and Sparseness to Speed Training of Support Vector Machines

SUPPORT VECTOR MACHINE ACTIVE LEARNING

Convex Optimization and Machine Learning

Introduction to Optimization

Content-based image and video analysis. Machine learning

Lecture 9: Support Vector Machines

Structured Prediction Basics

Introduction to Optimization

Sparse Optimization Lecture: Proximal Operator/Algorithm and Lagrange Dual

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018

Computationally Efficient M-Estimation of Log-Linear Structure Models

Lecture Linear Support Vector Machines

Statistical parsing. Fei Xia Feb 27, 2009 CSE 590A

Perceptron: This is convolution!

SVMs for Structured Output. Andrea Vedaldi

Contents. I Basics 1. Copyright by SIAM. Unauthorized reproduction of this article is prohibited.

Lecture 9. Support Vector Machines

6.034 Quiz 2, Spring 2005

Support Vector Machines

Computational Methods. Constrained Optimization

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017

Behavioral Data Mining. Lecture 10 Kernel methods and SVMs

Natural Language Processing

Support Vector Machines

k-nearest Neighbors + Model Selection

More on Learning. Neural Nets Support Vectors Machines Unsupervised Learning (Clustering) K-Means Expectation-Maximization

Large Margin Classification Using the Perceptron Algorithm

CSE 417T: Introduction to Machine Learning. Lecture 22: The Kernel Trick. Henry Chai 11/15/18

SVM: Multiclass and Structured Prediction. Bin Zhao

3 Perceptron Learning; Maximum Margin Classifiers

LARGE MARGIN CLASSIFIERS

CLASSIFICATION OF CUSTOMER PURCHASE BEHAVIOR IN THE AIRLINE INDUSTRY USING SUPPORT VECTOR MACHINES

Visual Recognition: Examples of Graphical Models

Support Vector Machines.

Today. Golden section, discussion of error Newton s method. Newton s method, steepest descent, conjugate gradient

Chapter II. Linear Programming

CS 188: Artificial Intelligence Spring Today

Convex Programs. COMPSCI 371D Machine Learning. COMPSCI 371D Machine Learning Convex Programs 1 / 21

Kernel Methods. Chapter 9 of A Course in Machine Learning by Hal Daumé III. Conversion to beamer by Fabrizio Riguzzi

Linear and Integer Programming :Algorithms in the Real World. Related Optimization Problems. How important is optimization?

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation

Machine Learning for NLP

Introduction to Support Vector Machines

Advanced Operations Research Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras

Efficient Algorithms may not be those we think

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

Nonlinear Programming

Lecture 10: SVM Lecture Overview Support Vector Machines The binary classification problem

Today. CS 188: Artificial Intelligence Fall Example: Boolean Satisfiability. Reminder: CSPs. Example: 3-SAT. CSPs: Queries.

Transcription:

Natural Language Processing Classification III Dan Klein UC Berkeley 1

Classification 2

Linear Models: Perceptron The perceptron algorithm Iteratively processes the training set, reacting to training errors Can be thought of as trying to drive down training error The (online) perceptron algorithm: Start with zero weights w Visit training instances one by one Try to classify If correct, no change! If wrong: adjust weights 3

Duals and Kernels 4

Nearest Neighbor Classification Nearest neighbor, e.g. for digits: Take new example Compare to all training examples Assign based on closest example Encoding: image is vector of intensities: Similarity function: E.g. dot product of two images vectors 5

Non Parametric Classification Non parametric: more examples means (potentially) more complex classifiers How about K Nearest Neighbor? We can be a little more sophisticated, averaging several neighbors But, it s still not really error driven learning The magic is in the distance function Overall: we can exploit rich similarity functions, but not objective driven learning 6

A Tale of Two Approaches Nearest neighbor like approaches Work with data through similarity functions No explicit learning Linear approaches Explicit training to reduce empirical error Represent data through features Kernelized linear models Explicit training, but driven by similarity! Flexible, powerful, very very slow 7

The Perceptron, Again Start with zero weights Visit training instances one by one Try to classify If correct, no change! If wrong: adjust weights mistake vectors 8

Perceptron Weights What is the final value of w? Can it be an arbitrary real vector? No! It s built by adding up feature vectors (mistake vectors). mistake counts Can reconstruct weight vectors (the primal representation) from update counts (the dual representation) for each i 9

Dual Perceptron Track mistake counts rather than weights Start with zero counts ( ) For each instance x Try to classify If correct, no change! If wrong: raise the mistake count for this example and prediction 10

Dual / Kernelized Perceptron How to classify an example x? If someone tells us the value of K for each pair of candidates, never need to build the weight vectors 11

Issues with Dual Perceptron Problem: to score each candidate, we may have to compare to all training candidates Very, very slow compared to primal dot product! One bright spot: for perceptron, only need to consider candidates we made mistakes on during training Slightly better for SVMs where the alphas are (in theory) sparse This problem is serious: fully dual methods (including kernel methods) tend to be extraordinarily slow Of course, we can (so far) also accumulate our weights as we go... 12

Kernels: Who Cares? So far: a very strange way of doing a very simple calculation Kernel trick : we can substitute any* similarity function in place of the dot product Lets us learn new kinds of hypotheses * Fine print: if your kernel doesn t satisfy certain technical requirements, lots of proofs break. E.g. convergence, mistake bounds. In practice, illegal kernels sometimes work (but not always). 13

Some Kernels Kernels implicitly map original vectors to higher dimensional spaces, take the dot product there, and hand the result back Linear kernel: Quadratic kernel: RBF: infinite dimensional representation Discrete kernels: e.g. string kernels, tree kernels 14

Tree Kernels [Collins and Duffy 01] Want to compute number of common subtrees between T, T Add up counts of all pairs of nodes n, n Base: if n, n have different root productions, or are depth 0: Base: if n, n are share the same root production: 15

Dual Formulation for SVMs We want to optimize: (separable case for now) This is hard because of the constraints Solution: method of Lagrange multipliers The Lagrangian representation of this problem is: All we ve done is express the constraints as an adversary which leaves our objective alone if we obey the constraints but ruins our objective if we violate any of them 16

Lagrange Duality We start out with a constrained optimization problem: We form the Lagrangian: This is useful because the constrained solution is a saddle point of (this is a general property): Primal problem in w Dual problem in 17

Dual Formulation II Duality tells us that has the same value as This is useful because if we think of the s as constants, we have an unconstrained min in w that we can solve analytically. Then we end up with an optimization over instead of w (easier). 18

Dual Formulation III Minimize the Lagrangian for fixed s: So we have the Lagrangian as a function of only s: 19

Back to Learning SVMs We want to find which minimize This is a quadratic program: Can be solved with general QP or convex optimizers But they don t scale well to large problems Cf. maxent models work fine with general optimizers (e.g. CG, L BFGS) How would a special purpose optimizer work? 20

Coordinate Descent I Despite all the mess, Z is just a quadratic in each i (y) Coordinate descent: optimize one variable at a time 0 0 If the unconstrained argmin on a coordinate is negative, just clip to zero 21

Coordinate Descent II Ordinarily, treating coordinates independently is a bad idea, but here the update is very fast and simple So we visit each axis many times, but each visit is quick This approach works fine for the separable case For the non separable case, we just gain a simplex constraint and so we need slightly more complex methods (SMO, exponentiated gradient) 22

What are the Alphas? Each candidate corresponds to a primal constraint In the solution, an i (y) will be: Zero if that constraint is inactive Positive if that constrain is active i.e. positive on the support vectors Support vectors Support vectors contribute to weights: 23

Structure 24

Handwriting recognition x y brace Sequential structure [Slides: Taskar and Klein 05] 25

CFG Parsing x y The screen was a sea of red Recursive structure 26

Bilingual Word Alignment x What is the anticipated cost of collecting fees under the new proposal? En vertu de nouvelle propositions, quel est le côut prévu de perception de les droits? What is the anticipated cost of collecting fees under the new proposal? Combinatorial structure y En vertu de les nouvelle propositions, quel est le côut prévu de perception de le droits? 27

Structured Models space of feasible outputs Assumption: Score is a sum of local part scores Parts = nodes, edges, productions 28

CFG Parsing #(NP DT NN) #(PP IN NP) #(NN sea ) 29

Bilingual word alignment What is the anticipated cost of collecting fees under the new proposal? j k En vertu de les nouvelle propositions, quel est le côut prévu de perception de le droits? association position orthography 30

Input Option 0: Reranking N-Best List (e.g. n=100) [e.g. Charniak and Johnson 05] Output x = The screen was a sea of red. Baseline Parser Non-Structured Classification 31

Reranking Advantages: Directly reduce to non structured case No locality restriction on features Disadvantages: Stuck with errors of baseline parser Baseline system must produce n best lists But, feedback is possible [McCloskey, Charniak, Johnson 2006] 32

Efficient Primal Decoding Common case: you have a black box which computes at least approximately, and you want to learn w Many learning methods require more (expectations, dual representations, k best lists), but the most commonly used options do not Easiest option is the structured perceptron [Collins 01] Structure enters here in that the search for the best y is typically a combinatorial algorithm (dynamic programming, matchings, ILPs, A* ) Prediction is structured, learning update is not 33

Structured Margin Remember the margin objective: This is still defined, but lots of constraints 34

Full Margin: OCR We want: brace Equivalently: brace aaaaa brace aaaab a lot! brace zzzzz 35

Parsing example We want: It was red S A B C D Equivalently: It was red S A B C D It was red S A B D F It was red S A B C D It was red S A B C D a lot! It was red S A B C D It was red S E F G H 36

Alignment example We want: What is the Quel est le 1 2 3 1 2 3 Equivalently: What is the Quel est le 1 2 3 1 2 3 What is the Quel est le 1 2 3 1 2 3 What is the Quel est le 1 2 3 1 2 3 What is the Quel est le 1 2 3 1 2 3 a lot! What is the Quel est le 1 2 3 1 2 3 What is the Quel est le 1 2 3 1 2 3 37

Cutting Plane A constraint induction method [Joachims et al 09] Exploits that the number of constraints you actually need per instance is typically very small Requires (loss augmented) primal decode only Repeat: Find the most violated constraint for an instance: Add this constraint and resolve the (non structured) QP (e.g. with SMO or other QP solver) 38

Some issues: Cutting Plane Can easily spend too much time solving QPs Doesn t exploit shared constraint structure In practice, works pretty well; fast like perceptron/mira, more stable, no averaging 39

M3Ns Another option: express all constraints in a packed form Maximum margin Markov networks [Taskar et al 03] Integrates solution structure deeply into the problem structure Steps Express inference over constraints as an LP Use duality to transform minimax formulation into min min Constraints factor in the dual along the same structure as the primal; alphas essentially act as a dual distribution Various optimization possibilities in the dual 40

Likelihood, Structured Structure needed to compute: Log normalizer Expected feature counts E.g. if a feature is an indicator of DT NN then we need to compute posterior marginals P(DT NN sentence) for each position and sum Also works with latent variables (more later) 41