The CKY Parsing Algorithm and PCFGs. COMP-550 Oct 12, 2017

Similar documents
The CKY Parsing Algorithm and PCFGs. COMP-599 Oct 12, 2016

Topics in Parsing: Context and Markovization; Dependency Parsing. COMP-599 Oct 17, 2016

CKY algorithm / PCFGs

Ling 571: Deep Processing for Natural Language Processing

CS 224N Assignment 2 Writeup

Admin PARSING. Backoff models: absolute discounting. Backoff models: absolute discounting 3/4/11. What is (xy)?

The CKY algorithm part 2: Probabilistic parsing

The CKY algorithm part 1: Recognition

The CKY algorithm part 1: Recognition

Elements of Language Processing and Learning lecture 3

Homework & NLTK. CS 181: Natural Language Processing Lecture 9: Context Free Grammars. Motivation. Formal Def of CFG. Uses of CFG.

School of Computing and Information Systems The University of Melbourne COMP90042 WEB SEARCH AND TEXT ANALYSIS (Semester 1, 2017)

Assignment 4 CSE 517: Natural Language Processing

Parsing Chart Extensions 1

Advanced PCFG Parsing

Natural Language Processing

Principles of Programming Languages COMP251: Syntax and Grammars

Advanced PCFG algorithms

Probabilistic Parsing of Mathematics

Dependency grammar and dependency parsing

Ling/CSE 472: Introduction to Computational Linguistics. 5/4/17 Parsing

Hidden Markov Models. Natural Language Processing: Jordan Boyd-Graber. University of Colorado Boulder LECTURE 20. Adapted from material by Ray Mooney

Statistical parsing. Fei Xia Feb 27, 2009 CSE 590A

COMS W4705, Spring 2015: Problem Set 2 Total points: 140

The Expectation Maximization (EM) Algorithm

Dependency grammar and dependency parsing

Dependency grammar and dependency parsing

I Know Your Name: Named Entity Recognition and Structural Parsing

Decision Properties for Context-free Languages

Vorlesung 7: Ein effizienter CYK Parser

1. Generalized CKY Parsing Treebank empties and unaries. Seven Lectures on Statistical Parsing. Unary rules: Efficient CKY parsing

Advanced PCFG Parsing

LSA 354. Programming Assignment: A Treebank Parser

Parsing III. (Top-down parsing: recursive descent & LL(1) )

CSE450 Translation of Programming Languages. Lecture 4: Syntax Analysis

Natural Language Processing

Chapter 4. Lexical and Syntax Analysis. Topics. Compilation. Language Implementation. Issues in Lexical and Syntax Analysis.

The CYK Parsing Algorithm

Compiler Design Concepts. Syntax Analysis

Wednesday, August 31, Parsers

Parsing partially bracketed input

CSE 417 Dynamic Programming (pt 6) Parsing Algorithms

Syntax-Directed Translation. Lecture 14

Parsing with Dynamic Programming

Wednesday, September 9, 15. Parsers

Parsers. What is a parser. Languages. Agenda. Terminology. Languages. A parser has two jobs:

Syntax Analysis Part I

Lexical Semantics. Regina Barzilay MIT. October, 5766

CS 406/534 Compiler Construction Parsing Part I

Parsing. Roadmap. > Context-free grammars > Derivations and precedence > Top-down parsing > Left-recursion > Look-ahead > Table-driven parsing

CMPT 755 Compilers. Anoop Sarkar.

Lecture 8: Context Free Grammars

CSE302: Compiler Design

Homework & Announcements

LING/C SC 581: Advanced Computational Linguistics. Lecture Notes Jan 23 rd

Parsers. Xiaokang Qiu Purdue University. August 31, 2018 ECE 468

Bottom-Up Parsing. Lecture 11-12

Syntactic Analysis. Top-Down Parsing

LING/C SC/PSYC 438/538. Lecture 3 Sandiway Fong

Monday, September 13, Parsers

BMI/CS Lecture #22 - Stochastic Context Free Grammars for RNA Structure Modeling. Colin Dewey (adapted from slides by Mark Craven)

Iterative CKY parsing for Probabilistic Context-Free Grammars

Syntax and Grammars 1 / 21

CS153: Compilers Lecture 4: Recursive Parsing

UNIVERSITY OF EDINBURGH COLLEGE OF SCIENCE AND ENGINEERING SCHOOL OF INFORMATICS INFR08008 INFORMATICS 2A: PROCESSING FORMAL AND NATURAL LANGUAGES

Compilers and computer architecture From strings to ASTs (2): context free grammars

Solving systems of regular expression equations

Syntax Analysis. The Big Picture. The Big Picture. COMP 524: Programming Languages Srinivas Krishnan January 25, 2011

CSE 3302 Programming Languages Lecture 2: Syntax

LR Parsing LALR Parser Generators

CS 314 Principles of Programming Languages

Introduction to Parsing. Lecture 5

Computational Linguistics

October 19, 2004 Chapter Parsing

Phrase Structure Parsing. Statistical NLP Spring Conflicting Tests. Constituency Tests. Non-Local Phenomena. Regularity of Rules

Top-Down Parsing and Intro to Bottom-Up Parsing. Lecture 7

Syntax Analysis: Context-free Grammars, Pushdown Automata and Parsing Part - 4. Y.N. Srikant

CSCI312 Principles of Programming Languages

LANGUAGE PROCESSORS. Presented By: Prof. S.J. Soni, SPCE Visnagar.

Non-deterministic Finite Automata (NFA)

LL(1) predictive parsing

Parsing. Cocke Younger Kasami (CYK) Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 35

Learning Latent Linguistic Structure to Optimize End Tasks. David A. Smith with Jason Naradowsky and Xiaoye Tiger Wu

Natural Language Dependency Parsing. SPFLODD October 13, 2011

Introduction to Parsing. Lecture 5

AUTOMATIC LFG GENERATION

Architecture of Compilers, Interpreters. CMSC 330: Organization of Programming Languages. Front End Scanner and Parser. Implementing the Front End

Review of CFGs and Parsing II Bottom-up Parsers. Lecture 5. Review slides 1

Transition-based Parsing with Neural Nets

LL(1) predictive parsing

3. Parsing. Oscar Nierstrasz

Parsing II Top-down parsing. Comp 412

LR Parsing LALR Parser Generators

Section A. A grammar that produces more than one parse tree for some sentences is said to be ambiguous.

Derivations vs Parses. Example. Parse Tree. Ambiguity. Different Parse Trees. Context Free Grammars 9/18/2012

Comp 411 Principles of Programming Languages Lecture 3 Parsing. Corky Cartwright January 11, 2019

Academic Formalities. CS Modern Compilers: Theory and Practise. Images of the day. What, When and Why of Compilers

Lecture 14 Sections Mon, Mar 2, 2009

Top-Down Parsing and Intro to Bottom-Up Parsing. Lecture 7

Specifying Syntax. An English Grammar. Components of a Grammar. Language Specification. Types of Grammars. 1. Terminal symbols or terminals, Σ

Transcription:

The CKY Parsing Algorithm and PCFGs COMP-550 Oct 12, 2017

Announcements I m out of town next week: Tuesday lecture: Lexical semantics, by TA Jad Kabbara Thursday lecture: Guest lecture by Prof. Timothy O Donnell (Linguistics) Corollary: no office hours on Tuesday TA office hours about A2 will be announced 2

Outline CYK parsing PCFGs Probabilistic CYK parsing Markovization 3

Parsing Input sentence, grammar output parse tree Parsing into a CFG: constituent parsing Parsing into a dependency representation: dependency parsing Difficulty: need an efficient way to search through plausible parse trees for the input sentence 4

Parsing into a CFG Given: 1. CFG 2. A sentence made up of words that are in the terminal vocabulary of the CFG Task: Recover all possible parses of the sentence. Why all possible parses? 5

Syntactic Ambiguity I shot the elephant in my pyjamas. S S VP VP I VP PP I V V in shot PP shot the elephant my pyjamas the elephant in my pyjamas 6

Types of Parsing Algorithms Top-down Start at the top of the tree, and expand downwards by using rewrite rules of the CFG to match the tokens in the input string e.g., Earley parser Bottom-up Start from the input words, and build ever-bigger subtrees, until a tree that spans the whole sentence is found e.g., CYK algorithm, shift-reduce parser Key to efficiency is to have an efficient search strategy that avoids redundant computation 7

CYK Algorithm Cocke-Younger-Kasami algorithm Steps: A dynamic programming algorithm partial solutions are stored and efficiently reused to find all possible parses for the entire sentence. Also known as the CKY algorithm 1. Convert CFG to an appropriate form 2. Set up a table of possible constituents 3. Fill in table 4. Read table to recover all possible parses 8

CFGs and Constituent Trees Rules/productions: S VP this VP V V is rules jumps rocks Trees: S S VP VP Non-terminals this V this V rules rocks Terminals 9

CYK Algorithm Cocke-Younger-Kasami algorithm Steps: A dynamic programming algorithm partial solutions are stored and efficiently reused to find all possible parses for the entire sentence. Also known as the CKY algorithm 1. Convert CFG to an appropriate form 2. Set up a table of possible constituents 3. Fill in table 4. Read table to recover all possible parses 10

Chomsky Normal Form To make things easier later, need all productions to be in one of these forms: 1. A B C, where A, B, C are nonterminals 2. A s, where A is a non-terminal s is a terminal This is actually not a big problem. 11

Converting to CNF (1) Rule of type A B C D Rewrite into: A X1 D and X1 B C Rule of type A s B Rewrite into: A X2 B and X2 s Rule of type A B Everywhere in which we see B on the LHS, replace it with A 12

Examples of Conversion Let s convert the following grammar fragment into CNF: S VP N I elephant pyjamas VP V PP V shot VP V Det my the N Det N Det N PP PP in 13

Next: Set Up a Table This table will store all of the constituents that can be built from contiguous spans within the sentence. Let sentence have N words. w[0], w[1], w[n-1] Create table, such that a cell in row i column j corresponds to the span from w[i:j+1], zero-indexed. Since i < j, we really just need half the table. The entry at each cell is a list of non-terminals that can span those words according to the grammar. 14

Parse Table I 0 shot 1 the 2 elephant 3 in 4 my 5 pyjamas 6 [0:1] [0:2] [0:3] [0:4] [0:5] [0:6] [0:7] [1:2] [1:3] [1:4] [1:5] [1:6] [1:7] [2:3] [2:4] [2:5] [2:6] [2:7] S VP VP X1 PP X1 V VP V Det N X2 PP X2 Det N PP P P in I elephant pyjamas N I elephant pyjamas V shot Det my the [3:4] [3:5] [3:6] [3:7] [4:5] [4:6] [4:7] [5:6] [5:7] [6:7] 15

Filling in Table: Base Case One word (e.g., cell [0:1]) Easy add all the lexical rules that can generate that word 16

Base Case Examples (First 3 Words) I 0 shot 1 the 2 elephant 3 in 4 my 5 pyjamas 6 [0:1] N [0:2] [0:3] [0:4] [0:5] [0:6] [0:7] V [1:2] [1:3] [1:4] [1:5] [1:6] [1:7] [2:3] Det [2:4] [2:5] [2:6] [2:7] S VP VP X1 PP X1 V VP V Det N X2 PP X2 Det N PP P P in I elephant pyjamas N I elephant pyjamas V shot Det my the [3:4] [3:5] [3:6] [3:7] [4:5] [4:6] [4:7] [5:6] [5:7] [6:7] 17

Filling in Table: Recursive Step Cell corresponding to multiple words eg., cell for span [0:3] I shot the Key idea: all rules that produce phrases are of the form A B C So, check all the possible break points m in between the start i and the end j, and see if we can build a constituent with a rule in the form, A [i:j] B [i:m] C [m:j] 18

Recurrent Step Example 1 I 0 shot 1 the 2 elephant 3 in 4 my 5 pyjamas 6? [0:1] N [0:2] [0:3] [0:4] [0:5] [0:6] [0:7] V [1:2] [1:3] [1:4] [1:5] [1:6] [1:7] [2:3] [2:4] [2:5] [2:6] [2:7] S VP VP X1 PP X1 V VP V Det N X2 PP X2 Det N PP P P in I elephant pyjamas N I elephant pyjamas V shot Det my the [3:4] [3:5] [3:6] [3:7] [4:5] [4:6] [4:7] [5:6] [5:7] [6:7] 19

Recurrent Step Example 2 I 0 shot 1 the 2 elephant 3 in 4 my 5 pyjamas 6 [0:1] N [0:2] [0:3] [0:4] [0:5] [0:6] [0:7] V [1:2] [1:3] [1:4] [1:5] [1:6] [1:7] [2:3] Det [2:4]? [2:5] [2:6] [2:7] S VP VP X1 PP X1 V VP V Det N X2 PP X2 Det N PP P P in I elephant pyjamas N I elephant pyjamas V shot Det my the [3:4] N [3:5] [3:6] [3:7] [4:5] [4:6] [4:7] [5:6] [5:7] [6:7] 20

Backpointers I 0 shot 1 the 2 elephant 3 in 4 my 5 pyjamas 6 [0:1] N [0:2] [0:3] [0:4] [0:5] [0:6] [0:7] V [1:2] [1:3] [1:4] [1:5] [1:6] [1:7] [2:3] Det [2:4] [2:5] [2:6] [2:7] [3:4] N [3:5] [3:6] [3:7] S VP VP X1 PP X1 V VP V Det N X2 PP X2 Det N PP P P in I elephant pyjamas N I elephant pyjamas V shot Det my the [4:5] [4:6] [4:7] [5:6] [5:7] [6:7] Store where you came from! 21

Putting It Together I 0 shot 1 the 2 elephant 3 in 4 my 5 pyjamas 6 [0:1] N [0:2] [0:3] [0:4] [0:5] [0:6] [0:7] V [1:2] [1:3] [1:4] [1:5] [1:6] [1:7] S VP VP X1 PP X1 V VP V Det N X2 PP X2 Det N PP P P in I elephant pyjamas N I elephant pyjamas V shot Det my the (Det 2:3,N 3:4) Det [2:3] [2:4] [2:5] [2:6] [2:7] [3:4] N [3:5] [3:6] [3:7] [4:5] [4:6] [4:7] [5:6] [5:7] [6:7] Fill the table in the correct order! 22

Finish the Example Let s finish the example together for practice How do we reconstruct the parse trees from the table? 23

Dealing with Syntactic Ambiguity In practice, one of these is more likely than the other: S S VP VP I VP PP I V V in shot PP shot the elephant my pyjamas the elephant in my pyjamas How to distinguish them? 24

Probabilistic CFGs Associate each rule with a probability: e.g., PP 0.2 Det N 0.4 I 0.1 V shot 0.005 Probability of a parse tree for a sentence is the product of the probabilities of the rules in the tree. 25

Formally Speaking For each nonterminal A N, Pr α β = 1 α β R s.t.α=a i.e., rules for each LHS form a probability distribution If a tree t contains rules α 1 β 1, α 2 β 2,, Pr t = Pr(α i β i ) Tree probability is product of rule probabilities i 26

Probabilistic Parsing Goal: recover the best parse for a sentence, along with its probability For a sentence, sent, let τ(sent) be the set of possible parses for it, we want to find argmax Pr(t) t τ(sent) Idea: extend CYK to keep track of probabilities in table 27

Extending CYK to PCFGs Previously, cell entries are nonterminals (+ backpointer) e.g., table[2:4] = {{, Det[2:3] N[3:4] }} table[3:4] = {{, } {N, }} Now, cell entries include the (best) probability of generating the constituent with that non-terminal e.g., table[2:4] = {{, Det[2:3] N[3:4], 0.215}} table[3:4] = {{,, 0.022} {N,, 0.04}} Equivalently, write as 3-dimensional array table[2, 4, ] = 0.215 (Det[2:3], N[3:4]) table[3, 4, ] = 0.022 table[3, 4, N] = 0.04 28

New Recursive Step Filling in dynamic programming table proceeds almost as before. During recursive step, compute probability of new constituents to be constructed: Pr(A[i:j] B[i:m] C[m:j]) = Pr(A BC) table[i,m,b] table[m,j,c] From PCFG From previously filled cells There could be multiple rules that form constituent A for span [i:j]. Take max: table[i,j,a] = max Pr(A i: j B i: m C m: j ) A BC, break at m 29

Example I 0 shot 1 the 2 elephant 3 in 4 my 5 pyjamas 6, 0.25 [0:1] N, 0.625 [0:2] [0:3] [0:4] [0:5] [0:6] [0:7] V, 1.0 [1:2] [1:3] [1:4] [1:5] [1:6] [1:7] [2:3] Det, 0.6 [2:4],? [2:5] [2:6] [2:7], 0.1 [3:4] N, 0.25 [3:5] [3:6] [3:7] New value: 0.6 * 0.25 * Pr( Det N) [4:5] [4:6] [4:7] [5:6] [5:7] [6:7] 30

Bottom-Up vs. Top-Down CYK algorithm is bottom-up Starting from words, build little pieces, then big pieces Alternative: top-down parsing Starting from the start symbol, expand non-terminal symbols according to rules in the grammar. Doing this efficiently can also get us all the parses of a sentence (Earley algorithm) 31

How to Train a PCFG? Derive from a treebank, such as WSJ. Simplest version: each LHS corresponds to a categorical distribution outcomes of the distributions are the RHS MLE estimates: #(α β) Pr α β = #α Can smooth these estimates in various ways, some of which we ve discussed 32

Vanilla PCFGs Estimate of rule probabilities: MLE estimates: #(α β) Pr α β = #α e.g., Pr(S -> VP) = #(S -> VP) / #(S) Recall: these distributions are normalized by LHS symbol Even with smoothing, doesn t work very well: Not enough context Rules are too sparse 33

Subject vs Object s s in subject and object positions are not identically distributed: Obvious cases pronouns (I vs me) But both appear as -> PRP -> I/me Less obvious: certain classes of nouns are more likely to appear in subject than object position, and vice versa. For example, subjects tend to be animate (usually, humans, animals, other moving objects) Many other cases of obvious dependencies between distant parts of the syntactic tree. 34

Sparsity Consider subcategorization of verbs, with modifiers ate VP -> VBD ate quickly VP -> VBD AdvP ate with a fork VP -> VBD PP ate a sandwich VP -> VBD ate a sandwich quickly VP -> VBD AdvP ate a sandwich with a fork VP -> VBD PP quickly ate a sandwich with a fork VP -> AdvP VBD PP We should be able to factorize the probabilities: of having an adverbial modifier, of having a PP modifier, etc. 35

Wrong Independence Assumptions Vanilla PCFGs make independence assumptions that are too strong AND too weak. Too strong: vertically, up and down the syntax tree PRP I Too weak: horizontally, across the RHS of a production VP? AdvP VBD PP 36

Adding Context Add more context vertically to the PCFG Annotate with the parent category Before: -> PRP, -> Det NN, etc. Now: Subjects: Objects: ^S -> PRP, ^S -> Det NN, etc. ^VP -> PRP, ^VP -> Det NN, etc. Learn the probabilities of the rules separately (though they may influence each other through interpolation/smoothing) 37

Example Let s help Pierre Vinken find his ancestors. ( (S ( ( (N Pierre) (N Vinken) ) (,,) (ADJP ( (CD 61) (NNS years) ) (JJ old) ) (,,) ) (VP (MD will) (VP (VB join) ( (DT the) (NN board) ) (PP (IN as) ( (DT a) (JJ nonexecutive) (NN director) )) ( (N Nov.) (CD 29) ))) (..) )) Note that the tree here is given in bracket parse format, rather than drawn out as a graph. 38

Removing Context Conversely, we break down the RHS of the rule when estimating its probability. Before: Pr(VP -> START AdvP VBD PP END) as a unit Now: Pr(VP -> START AdvP) * Pr(VP -> AdvP VBD) * Pr(VP -> VBD ) * Pr(VP -> PP) * Pr(VP -> PP END) In other words, we re making the same N-gram assumption as in language modelling, only over nonterminal categories rather than words. Learn probability of factors separately 39

Example Let s help Pierre Vinken find his children. ( (S ( ( (N Pierre) (N Vinken) ) (,,) (ADJP ( (CD 61) (NNS years) ) (JJ old) ) (,,) ) (VP (MD will) (VP (VB join) ( (DT the) (NN board) ) (PP (IN as) ( (DT a) (JJ nonexecutive) (NN director) )) ( (N Nov.) (CD 29) ))) (..) )) 40

Markovization Vertical markovization: adding ancestors as context Zeroth order vanilla PCFGs First order the scheme we just described Can go further: e.g., Second order: ^VP^S -> Horizontal markovization: breaking RHS into parts Infinite order vanilla PCFGs First order the scheme we just described Can choose any other order, do interpolation, etc. 41

Effect of Category Splitting WSJ results by Klein and Manning (2003) With additional linguistic insights, they got up to 87.04 F1 Current best is around 94-95 F1 42