Natural Language Processing. Classification. Duals and Kernels. Linear Models: Perceptron. Non Parametric Classification. Classification III

Size: px

Start display at page:

Download "Natural Language Processing. Classification. Duals and Kernels. Linear Models: Perceptron. Non Parametric Classification. Classification III"

Lisa Marsh
5 years ago
Views:

Natural Language Processing Classification Classification III Dan Kin UC Berkey

reacting to training errors Can be thought of as trying to drive down training error

one by one Try to classify Duals and Kernels If correct, no change!

for digits: Take new examp Compare to all training examps Assign based on closest

1 Natural Language Processing Classification Classification III Dan Kin UC Berkey Linear Mols: Perceptron The perceptron algorithm Iteratively processes training set, reacting to training errors Can be thought of as trying to drive down training error The (online) perceptron algorithm: tart with zero weights w Visit training instances one by one Try to classify Duals and Kernels If correct, no change! If wrong: adjust weights Nearest Neighbor Classification Nearest neighbor, e.g. for digits: Take new examp Compare to all training examps Assign based on closest examp Encoding: image is vector of intensities: imilarity function: E.g. dot product of two images vectors Non Parametric Classification Non parametric: more examps means (potentially) more compx classifiers How about K Nearest Neighbor We can be a litt more sophisticated, averaging several neighbors But, it s still not really error driven arning The magic is in distance function Overall: we can exploit rich similarity functions, but not objective driven arning

2 A Ta of Two Approaches Nearest neighbor like approaches Work with data through similarity functions No explicit arning Linear approaches Explicit training to reduce empirical error Represent data through features The Perceptron, Again tart with zero weights Visit training instances one by one Try to classify If correct, no change! If wrong: adjust weights Kernelized linear mols Explicit training, but driven by similarity! Fxib, powerful, very very slow mistake vectors Perceptron Weights Dual Perceptron What is final value of w Can it be an arbitrary real vector No! It s built by adding up feature vectors (mistake vectors). Track mistake counts rar than weights tart with zero counts ( ) For each instance x Try to classify mistake counts Can reconstruct weight vectors ( primal representation) from update counts ( dual representation) for each i If correct, no change! If wrong: raise mistake count for this examp and prediction Dual / Kernelized Perceptron How to classify an examp x Issues with Dual Perceptron Probm: to score each candidate, we may have to compare to all training candidates Very, very slow compared to primal dot product! One bright spot: for perceptron, only need to consir candidates we ma mistakes on during training lightly better for VMs where alphas are (in ory) sparse If someone tells us value of K for each pair of candidates, never need to build weight vectors This probm is serious: fully dual methods (including kernel methods) tend to be extraordinarily slow Of course, we can (so far) also accumulate our weights as we go...

3 Kernels: Who Cares o far: a very strange way of doing a very simp calculation Kernel trick : we can substitute any* similarity function in place of dot product ome Kernels Kernels implicitly map original vectors to higher dimensional spaces, take dot product re, and hand result back Linear kernel: Quadratic kernel: Lets us arn new kinds of hyposes RBF: infinite dimensional representation * Fine print: if your kernel doesn t satisfy certain technical requirements, lots of proofs break. E.g. convergence, mistake bounds. In practice, ilgal kernels sometimes work (but not always). Discrete kernels: e.g. string kernels, tree kernels Tree Kernels [Collins and Duffy 0] Dual Formulation for VMs We want to optimize: (separab case for now) Want to compute number of common subtrees between T, T Add up counts of all pairs of nos n, n Base: if n, n have different root productions, or are pth 0: This is hard because of constraints olution: method of Lagrange multipliers The Lagrangian representation of this probm is: Base: if n, n are share same root production: All we ve done is express constraints as an adversary which aves our objective alone if we obey constraints but ruins our objective if we violate any of m Lagrange Duality We start out with a constrained optimization probm: Duality tells us that Dual Formulation II We form Lagrangian: has same value as This is useful because constrained solution is a sadd point of (this is a general property): Primal probm in w Dual probm in This is useful because if we think of s as constants, we have an unconstrained min in w that we can solve analytically. Then we end up with an optimization over instead of w (easier).

4 Dual Formulation III Back to Learning VMs Minimize Lagrangian for fixed s: We want to find which minimize o we have Lagrangian as a function of only s: This is a quadratic program: Can be solved with general QP or convex optimizers But y don t sca well to large probms Cf. maxent mols work fine with general optimizers (e.g. CG, L BFG) How would a special purpose optimizer work Coordinate Descent I Coordinate Descent II Ordinarily, treating coordinates inpenntly is a bad ia, but here update is very fast and simp Despite all mess, Z is just a quadratic in each i (y) Coordinate scent: optimize one variab at a time o we visit each axis many times, but each visit is quick 0 0 This approach works fine for separab case For non separab case, we just gain a simpx constraint and so we need slightly more compx methods (MO, exponentiated gradient) If unconstrained argmin on a coordinate is negative, just clip to zero What are Alphas Each candidate corresponds to a primal constraint tructure In solution, an i (y) will be: Zero if that constraint is inactive Positive if that constrain is active i.e. positive on support vectors upport vectors upport vectors contribute to weights: 4

Handwriting recognition CFG Parsing x y x y brace The screen was a sea of red equential structure [lis: Taskar and Kin 05] Recursive structure Bilingual Word Alignment tructured Mols x What is

structure y En vertu s nouvel propositions, quel est côut prévu perception droits space of feasib outputs Assumption: core is a sum of local part scores Parts = nos, edges, productions CFG Parsing

5 Handwriting recognition CFG Parsing x y x y brace The screen was a sea of red equential structure [lis: Taskar and Kin 05] Recursive structure Bilingual Word Alignment tructured Mols x What is anticipated cost of colcting fees unr new proposal En vertu nouvel propositions, quel est côut prévu perception s droits What is anticipated cost of colcting fees unr new proposal Combinatorial structure y En vertu s nouvel propositions, quel est côut prévu perception droits space of feasib outputs Assumption: core is a sum of local part scores Parts = nos, edges, productions CFG Parsing Bilingual word alignment #(NP DT NN) #(PP IN NP) #(NN sea ) What is anticipated cost of colcting fees unr new proposal j k En vertu s nouvel propositions, quel est côut prévu perception droits association position orthography 5

6 Option 0: Reranking [e.g. Charniak and Johnson 05] Reranking Input N-Best List (e.g. n=00) Output Advantages: Directly reduce to non structured case No locality restriction on features x = The screen was a sea of red. Baseline Parser Non-tructured Classification Disadvantages: tuck with errors of baseline parser Baseline system must produce n best lists But, feedback is possib [McCloskey, Charniak, Johnson 006] Efficient Primal Decoding Common case: you have a black box which computes tructured Margin Remember margin objective: at ast approximately, and you want to arn w Many arning methods require more (expectations, dual representations, k best lists), but most commonly used options do not This is still fined, but lots of constraints Easiest option is structured perceptron [Collins 0] tructure enters here in that search for best y is typically a combinatorial algorithm (dynamic programming, matchings, ILPs, A*) Prediction is structured, arning update is not Full Margin: OCR Parsing examp We want: We want: Equivantly: Equivantly: aaaaa D F aaaab zzzzz a lot! E F G H a lot! 6

7 Alignment examp Cutting Plane We want: Equivantly: A constraint induction method [Joachims et al 09] Exploits that number of constraints you actually need per instance is typically very small Requires (loss augmented) primal co only Repeat: Find most violated constraint for an instance: a lot! Add this constraint and resolve (non structured) QP (e.g. with MO or or QP solver) Cutting Plane ome issues: Can easily spend too much time solving QPs Doesn t exploit shared constraint structure In practice, works pretty well; fast like perceptron/mira, more stab, no averaging MNs Anor option: express all constraints in a packed form Maximum margin Markov networks [Taskar et al 0] Integrates solution structure eply into probm structure teps Express inference over constraints as an LP Use duality to transform minimax formulation into min min Constraints factor in dual along same structure as primal; alphas essentially act as a dual distribution Various optimization possibilities in dual Likelihood, tructured tructure need to compute: Log normalizer Expected feature counts E.g. if a feature is an indicator of DT NN n we need to compute posterior marginals P(DT NN sentence) for each position and sum Also works with latent variabs (more later) 7

Natural Language Processing

Natural Language Processing Classification III Dan Klein UC Berkeley 1 Classification 2 Linear Models: Perceptron The perceptron algorithm Iteratively processes the training set, reacting to training errors