Klein & Manning, NIPS 2002

Size: px

Start display at page:

Download "Klein & Manning, NIPS 2002"

Dinah Hawkins
5 years ago
Views:

1 Agenda for today Factoring complex model into product of simpler models Klein & Manning factored model: dependencies and constituents Dual decomposition for higher-order dependency parsing Refresh memory on arc-factored vs. second order models Intuitions about dual decomposition for this application Step through example inference on whiteboard How does it do? Dependencies and higher order parsing formalisms Tree Adjoining grammars, derived trees and derivation trees 1

2 Klein & Manning, NIPS 2002 Fast Exact Inference with a Factored Model for Natural Language Parsing S NP VP NN NNS VBD PP Factory payrolls fell IN NN in September fell-vbd payrolls-nns fell in-in Factory-NN Factory payrolls in September-NN September S, fell-vbd NP, payrolls-nns VP, fell-vbd Factory-NN payrolls-nns fell-vbd PP, in-in Factory payrolls fell in-in September-NN (a) PCFG Structure (b) Dependency Structure (c) Combined Structure Figure 1: Three kinds of parse structures. in September 2 A Factored Model Unlexicalized PCFG structure (linguistically motivated non-terms) Generative Lexical models dependency for parsing structures typically model one of the kinds of structures shown in figure 1. Figure 1a is a plain phrase-structure tree T, which primarily models syntactic units, figure Lexicalized 1b is a dependency PCFGs tree D, which primarily models word-to-word selectional affinities [5], and figure 1c is a lexicalized phrase-structure tree L, which carries both category and (part-of-speechtagged) head word information at each node. 2 A lexicalized tree can be viewed as the pair L = (T, D) of a phrase structure tree T and a dependency tree D. In this view, generative models over lexicalized trees, of the sort

3 Factored models Constituent and dependency parsers annotate highly correlated information Very competitive approach: percolate heads in constituent parse Splitting non-terminals with lexical heads helps constituent parsing Bi-lexical grammars come with high complexity inference Klein & Manning (2002) break probability model into two parts Let T be a constituent tree; D a dependency graph. Then P(T, D) P(T )P(D) (1) Probability mass allocated to mismatched T, D (deficient) Inside/outside performed separately for T, D using n 3 algorithms Joint inference for T, D an A search, heuristics from inside/outside Intuition: simpler models and inference by factoring 3

4 Inference decomposition Similar motivation for dual decomposition that we ll look at later Simplifying inference for complex models Some guarantees on exact inference (finding high score) Centrally relys on models finding similar solutions Current modeling approaches more principled Before moving on to present dual decomposition Refresh our memory on dependency parsing, MST approaches Look again at higher order models 4

5 Arc factored models 2 ROOT the 30 bit 30 the 40 dog postman

6 MST algorithm 2 40 ROOT 2 8 the 30 bit the 40 dog postman

7 Second-order models In constituent parsing, information in non-terminal labels can impact accuracy E.g., parent annotation, siblings on Markov grammar factorization, etc. Increases the parameter space in models for disambiguation Typically at a cost in terms of grammar size, parsing efficiency May want to increase features in edge-factored MST parser First-order features know only about nodes being linked by dependency Second-order features would know about adjacent dependencies Unfortunately, exact inference with second-order model NP Hard Proof by reduction to another NP Hard graph problem Common approach in NLP: approximate inference with rich model often preferable to exact inference with weaker model 7

8 Adjacent sibling dependencies Collins (1997) bi-lexical parsing model used head-dependent parameters This is a constituent model with rules from a head to its dependents Rules were factored (binarized) from the head-out, as discussed in lecture 3 With such factored categories, forget siblings other than previous k P(d l,k... d l,1 d r,1... d r,j h) P(d l,1 h) P(d l,i h, d l,i 1 ) + P(d r,1 h) i=2 j P(d r,i h, d r,i 1 ) This was a generative model, but same idea for feature configurations Factor your dependencies into adjacent siblings ; define features on those i=2 8

9 will see later. We write s(x i,,x j ) when x j is the first left or first right dependent of word x i. For example, s(2,, 4) is the score of creating a dependency from hit to ball, sinceball is the first child to the right of hit. Moreformally,iftheword x i0 has the children shown in this picture, Graph from McDonald and Pereira (2006) x i0 x i1... x ij x ij+1... x im the score factors as follows: j 1 k=1 s(i 0,i k+1,i k )+s(i 0,,i j ) + s(i 0,,i j+1 )+ m 1 k=j+1 s(i 0,i k,i k+1 ) 4. for j :1 5. for i : 6. y 7. if 8. δ = 9. if δ 10. m 11. end fo 12. end for 13. if m>0 14. y = y 15. else retu 16. end while Figure 4: A projective parsi of dependents h the collection o Partitioning dependencies This second-order to the leftfactorization and right of the subsumes head in the thestringasinglestage,w first-order factorization, since the score function of second-order Features include closest dependency on same side that is closer to the head could just ignore the middle argument to simulate time parsing. Note that first-order features scoring. alsothe usedscore in models of a tree (getfor themsecond- order parsing is now arbitrary m th -o for free) The Eisner a 9 s(x, y) = O(n m+1 ),form s(i, k, j) gorithm will wo

10 Approximate approach Use Eisner s algorithm to build dependencies from head out Algorithm builds left and right dependencies separately Has information for sibling features accessible Cubic complexity algorithm; further, only projective dependencies Exact inference for projective dependencies Greedy approximation: Take projective parse from Eisner as the starting point For every word, try changing its head to other words Choose new (valid) head that increases score the most If none increase the score; then done. Otherwise iterate Best projective parse probably not far from best overall parse 10

11 Improvement over first-order models? Use 2nd order word/word, word/pos pairs; and pos/pos/pos triples as features (as well as all of the first order features) Despite approximate inference, solid accuracy gains in three languages English projective dependencies: Czech non-projective dependencies: Danish non-projective dependencies: Also allow multiple parents in Danish: 85.6 Unsurprising: major slowdown versus edge-factored (first order) models (dominated by projective part) Might a rescoring/post-editing process on standard first-order MST work? 11

12 Dual decomposition (intuitions) General method for breaking complex problems down into smaller problems also called Lagrangian relaxation For example, sibling dependencies in dependency graphs Using MST is NP complete for the full-blown model But MST with edge-factored (first-order) model is quadratic Finding the best dependents for each head using a sibling model also quadratic (using dynamic programming); not guaranteed to be a tree If it were a tree, it would be a solution If MST and sibling dynamic programming found the same solution... Iterative strategy for solving two problems and comparing their solutions Each term has Lagrangian multipliers that are updated at each iteration Comes with formal guarantees about finding optimal 12

13 Dynamic programming: Best dependents for each head s(bit, dog, <s>) dog bit s(bit, -, the) bit the s(bit, the, </s>) s(bit, dog, the) s(bit, -, dog) s(bit, the, postman) the bit s(bit, -, the) bit s(bit, -, postman) bit post man s(bit, the, <s>) s(bit, postman, </s>) s(bit, -, <s>) <s> bit s(bit, -, </s>) bit </s> O(n) nodes in each head s graph; O(n) incoming arcs; overall O(n 3 ) Paper claims quadratic complexity, but for each head position Can collapse states into equivalence classes (head automata) Similar approach for grandparent models 13

14 Some relevant notes Each best head-dependent configuration calculated independently Hence words can be dependents zero or more times (Tree requires exactly once) The three word scores s(head,sib,dep) can include many features Arc-factored features included without increasing complexity Of course, POS-tags, direction and distance are important Even if best solution not a tree, hopefully close to a tree Dual decomposition works when two solutions are close Too many iterations required otherwise 14

15 Lagrangian relaxation (following Koo et al., 2010) Let Y be the set of possible well-formed dependency trees; and Z the set of possible head-dependent relations (not necessarily well formed tree) Let g(y) be the score according to edge-factored model for y Y Let f(z) be the score according to sibling model for z Z Let z(i, j) be one if i is head of j; zero otherwise (y(i, j) similarly defined) Let u(i, j) be the Lagrangian multiplier for head i and dependent j L = max L(u) = max z Z z Z,y Y,z=y f(z) i,j f(z) i,j u(i, j)z(i, j) + g(y) + i,j u(i, j)z(i, j) + max g(y) + y Y i,j u(i, j)y(i, j) u(i, j)y(i, j) L(u) is an upper bound of L ; so search (over all u) for minimum L(u) Known as the dual problem; objective convex but non-differentiable 15

16 Inference algorithm Iterative algorithm; parameterized maximum number of iterations Initialize Lagrangian multipliers (u scores for arcs) to 0 Perform both inference tasks independently If z = y, then done Reward u(i, j) for arcs (i, j) in z; Penalize u(i, j) for arcs (i, j) in y Iterate Some comments on its use Remember: this is an inference algorithm, not a training algorithm Expensive, due to iteration Works when independent z and y solutions are close Some potential speedups, e.g., caching previous solutions 16

17 Example: run an iteration, get solutions MST arc-factored solution: ROOT the aged bottle flies fast sibling model DP solution: ROOT the aged bottle flies fast Now what? 17

18 How well does it work? Koo et al., 2010 dependency parsing results, UAS Ma09 MST Sib G+S Best CertS CertG TimeS TimeG TrainS TrainG Dan Dut Por Slo Swe Tur Eng Eng Sm08 MST Sib G+S CertS CertG TimeS TimeG TrainS TrainG Dan Dut Mc06 MST Sib G+S CertS CertG TimeS TimeG TrainS TrainG PTB PDT Table 1: A comparison of non-projective automaton-based parsers with results from previous work. MST: Our firs order baseline. Sib/G+S: Non-projective head automata with sibling or grandparent/sibling interactions, decoded v dual decomposition. Ma09: The best UAS of the LP/ILP-based 18 parsers introduced in Martins et al. (2009). Sm0 The best UAS of any LBP-based parser in Smith and Eisner (2008). Mc06: The best UAS reported by McDona and Pereira (2006). Best: For the CoNLL-X languages only, the best UAS for any parser in the original shared ta

19 How many iterations does it take to converge? Rand Rand Koo et al., 2010 dependency parsing results sition with lin- (2009). LP(S): gle-commodity tion based on er Linear Pro- Percentage % validation UAS % certificates % match K= Maximum Number of Dual Decomposition Iterations Figure 4: The behavior of the dual-decomposition parser with sibling automata as the value of K is varied. Sib P-Sib G+S P-G+S 19 PTB PDT

20 Final notes on dual decomposition General method for inference, applicable to many problems Every year being applied to more and more problems Becoming part of the standard NLP toolbox Comes with great exact inference guarantees Important and useful, but not required for quality inference Accuracy plateaus with a max of around 50 iterations, approximately 75% with guaranteed optimal solution Often the case in NLP, approximate inference is faster and just as accurate Expect new variants, potentially with new approximate heuristics 20

21 Lexicalized grammars Incorporation of lexical items into grammars influenced by lexicalized grammar formalisms Those now known as mildly context-sensitive, e.g., TAGs Also Lexical-Functional Grammar (LFG), a unification grammar Linguistic insight that words impact syntax, e.g., subcategorization Common approaches to mildly context-sensitive grammars build this into the formalism: lexical anchors in TAG; word tags in CCG Leads to some important connections with dependencies Also leads to spurious derviation ambiguities Worthwhile thinking of derived versus derivation structure 21

22 Tree-Adjoining grammars A Tree-adjoining grammar (TAG) G = (V, T, S, I, A) a set of non-terminal variables V a set of terminals T a special start symbol S V a set of initial trees I Non-terminals on frontier marked for substitution a set of auxiliary trees A One non-terminal on frontier marked as foot node Otherwise like initial trees 22

23 Elementary trees Initial trees (I) and auxiliary trees (A) together make up the set of elementary trees in contrast to derived trees Elementary trees are of type X where X is the root category Foot node in auxiliary trees must be of same category as the root Lexicalized TAG (LTAG) requires at least one terminal item (the anchor) on every elementary tree Two operations defined on trees: substitution and adjunction 23

24 Initial tree Rooted at a single node (X) Yield of tree can consist of terminals and non-terminals Non-terminals are substitution nodes X terminal items and substitution nodes Trees with root category Y can substitute at substitution node with category Y 24

25 Auxiliary tree Rooted at a single node (X) Yield of tree can consist of terminals and non-terminals At least one non-terminal is adjunction node denoted with * X Foot node X* terminal items and substitution nodes Foot node category must match root category 25

26 Substitution and Adjunction Substitution Replace substitution node in yield of tree T 1 with tree T 2 rooted in the same category Adjunction Detach sub-tree rooted with category X from T 1 Attach subtree to foot-node of auxiliary tree T 2 Re-attach root of T 2 at site of original sub-tree 26

27 Substitution Y X X Y X 27

28 Adjunction Y X X a) X * Y X b) Y X c) X X 28

29 Notes on TAG Adjunction makes this context-sensitive Just substitution is context-free equivalent Hence n 3 parsing algorithm with no adjunction Parsing with standard TAG is n 6 Instead of many rules, just two rules of combination Increased role of the lexicon to dictate possible structures Information encoded in rules in CFG now encoded in the lexicon 29

30 Elementary trees (slide taken from Joshi & Schabes, 1997) 30

31 Why mention this formalism in this class? Putting trees together to build parse, not rules Trees have lexical anchors (heads) When a tree substitutes or adjoins, it attaches to the head The steps taken in putting them together is a derivation The derivation is a tree of moves, not a sequence of moves In other words, it is a dependency tree Hence get both a phrase structure tree and dependency structure out of TAG derivation 31

32 Derived tree (slide taken from Joshi & Schabes, 1997) 32

33 Derivation tree (slide taken from Joshi & Schabes, 1997) 33

34 Summary Covered dual decomposition for dependency parsing Very useful approach for many NLP tasks; sort of hot topic Noted links between dependency trees and TAGS Similar links with other approaches, e.g., CCG Next lecture going to wrap up with some misc. topics Approximate inference techniques for dependency parsing Using dependency parsing for machine translation Using dependency parsing for language modeling (All three related to some term projects in the works) 34

Parsing with Dynamic Programming

CS11-747 Neural Networks for NLP Parsing with Dynamic Programming Graham Neubig Site https://phontron.com/class/nn4nlp2017/ Two Types of Linguistic Structure Dependency: focus on relations between words