Parsing. Cocke Younger Kasami (CYK) Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 35

Similar documents
Parsing. Earley Parsing. Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 39

Parsing. Top-Down Parsing. Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 19

Lecture 8: Context Free Grammars

Decision Properties for Context-free Languages

Parsing Chart Extensions 1

3. Syntax Analysis. Andrea Polini. Formal Languages and Compilers Master in Computer Science University of Camerino

The CYK Parsing Algorithm

SLR parsers. LR(0) items

Natural Language Processing

CYK parsing. I love automata theory I love love automata automata theory I love automata love automata theory I love automata theory

Chapter 18: Decidability

Context-free grammars

Homework. Announcements. Before We Start. Languages. Plan for today. Chomsky Normal Form. Final Exam Dates have been announced

Parsing. Roadmap. > Context-free grammars > Derivations and precedence > Top-down parsing > Left-recursion > Look-ahead > Table-driven parsing

Normal Forms and Parsing. CS154 Chris Pollett Mar 14, 2007.

3. Parsing. Oscar Nierstrasz

The CKY Parsing Algorithm and PCFGs. COMP-599 Oct 12, 2016

Compiler Construction: Parsing

Top-Down Parsing and Intro to Bottom-Up Parsing. Lecture 7

A left-sentential form is a sentential form that occurs in the leftmost derivation of some sentence.

Formal Languages. Grammar. Ryan Stansifer. Department of Computer Sciences Florida Institute of Technology Melbourne, Florida USA 32901

Bottom-up parsing. Bottom-Up Parsing. Recall. Goal: For a grammar G, withstartsymbols, any string α such that S α is called a sentential form

UNIT-III BOTTOM-UP PARSING

Compilation 2012 Context-Free Languages Parsers and Scanners. Jan Midtgaard Michael I. Schwartzbach Aarhus University

Top-Down Parsing and Intro to Bottom-Up Parsing. Lecture 7

LR Parsing, Part 2. Constructing Parse Tables. An NFA Recognizing Viable Prefixes. Computing the Closure. GOTO Function and DFA States

Lecture Bottom-Up Parsing

Syntax Analysis Part I

Syntax Analysis: Context-free Grammars, Pushdown Automata and Parsing Part - 4. Y.N. Srikant

Where We Are. CMSC 330: Organization of Programming Languages. This Lecture. Programming Languages. Motivation for Grammars

CT32 COMPUTER NETWORKS DEC 2015

Formal Languages and Compilers Lecture VII Part 3: Syntactic A

LR Parsing. Leftmost and Rightmost Derivations. Compiler Design CSE 504. Derivations for id + id: T id = id+id. 1 Shift-Reduce Parsing.

Syntactic Analysis. Top-Down Parsing

Parsing III. (Top-down parsing: recursive descent & LL(1) )

Compiler Construction

Syntax Analysis. Amitabha Sanyal. ( as) Department of Computer Science and Engineering, Indian Institute of Technology, Bombay

Andreas Daul Natalie Clarius

Lexical and Syntax Analysis. Bottom-Up Parsing

JNTUWORLD. Code No: R

Solving systems of regular expression equations

Lecture 12: Cleaning up CFGs and Chomsky Normal

Compilers. Bottom-up Parsing. (original slides by Sam

PART 3 - SYNTAX ANALYSIS. F. Wotawa TU Graz) Compiler Construction Summer term / 309

Normal Forms for CFG s. Eliminating Useless Variables Removing Epsilon Removing Unit Productions Chomsky Normal Form

Syntax Analysis. Martin Sulzmann. Martin Sulzmann Syntax Analysis 1 / 38

LL(1) predictive parsing

컴파일러입문 제 6 장 구문분석

CA Compiler Construction

CMSC 330: Organization of Programming Languages. Architecture of Compilers, Interpreters

Bottom-Up Parsing. Lecture 11-12

CS143 Midterm Fall 2008

Bottom-Up Parsing. Lecture 11-12

CMSC 330: Organization of Programming Languages

Syntactic Analysis. Top-Down Parsing. Parsing Techniques. Top-Down Parsing. Remember the Expression Grammar? Example. Example

General Overview of Compiler

Parsing III. CS434 Lecture 8 Spring 2005 Department of Computer Science University of Alabama Joel Jones

MIT Parse Table Construction. Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology

Defining syntax using CFGs

Grammars and Parsing. Paul Klint. Grammars and Parsing

CMSC 330: Organization of Programming Languages. Context Free Grammars

In One Slide. Outline. LR Parsing. Table Construction

Languages and Compilers

MODULE 12 LEADING, TRAILING and Operator Precedence Table

Context-Free Languages & Grammars (CFLs & CFGs) Reading: Chapter 5

Lecture Notes on Top-Down Predictive LL Parsing

CMSC 330: Organization of Programming Languages

Computer Science 160 Translation of Programming Languages

CMSC 330: Organization of Programming Languages

Plan for Today. Regular Expressions: repetition and choice. Syntax and Semantics. Context Free Grammars

Parsing. For a given CFG G, parsing a string w is to check if w L(G) and, if it is, to find a sequence of production rules which derive w.

Ambiguous Grammars and Compactification

Concepts Introduced in Chapter 4

Parsing. source code. while (k<=n) {sum = sum+k; k=k+1;}

Lecture Notes on Shift-Reduce Parsing

CMSC 330: Organization of Programming Languages. Context-Free Grammars Ambiguity

CMSC 330: Organization of Programming Languages. Context Free Grammars

CS 321 Programming Languages and Compilers. VI. Parsing

Types of parsing. CMSC 430 Lecture 4, Page 1

Zhizheng Zhang. Southeast University

Chapter 4: LR Parsing


Homework. Context Free Languages. Before We Start. Announcements. Plan for today. Languages. Any questions? Recall. 1st half. 2nd half.

CS 406/534 Compiler Construction Parsing Part I

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

Models of Computation II: Grammars and Pushdown Automata

CMSC 330: Organization of Programming Languages. Context Free Grammars

Building Compilers with Phoenix

CS 314 Principles of Programming Languages. Lecture 3

LL(k) Parsing. Predictive Parsers. LL(k) Parser Structure. Sample Parse Table. LL(1) Parsing Algorithm. Push RHS in Reverse Order 10/17/2012

LL(1) predictive parsing

CS 314 Principles of Programming Languages

Multiple Choice Questions

Parsing. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

Chapter 2 :: Programming Language Syntax

Compilers. Yannis Smaragdakis, U. Athens (original slides by Sam

Context-Free Languages and Parse Trees

LR(0) Parsing Summary. LR(0) Parsing Table. LR(0) Limitations. A Non-LR(0) Grammar. LR(0) Parsing Table CS412/CS413

S Y N T A X A N A L Y S I S LR

Principles of Programming Languages

Transcription:

Parsing Cocke Younger Kasami (CYK) Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Winter 2017/18 1 / 35

Table of contents 1 Introduction 2 The recognizer 3 The CNF recognizer 4 CYK parsing 5 CYK with dotted productions 6 Bibliography 2 / 35

Introduction The CYK parser is a bottom-up parser: we start with the terminals in the input string and subsequently compute recognized parse trees by going from already recognized rhs of productions to the nonterminal on the lefthand side. a non-directional parser: the checking for recognized components of a rhs in order to complete a lhs is not ordered; in particular, we cannot complete only parts of a rhs, everything in the rhs must be recognized in order to complete the lefthand category. Independently proposed by Cocke, Kasami and Younger in the 60s. Cocke and Schwartz (1970); Grune and Jacobs (2008); Hopcroft and Ullman (1979, 1994); Kasami (1965); Younger (1967) 3 / 35

The general recognizer (1) We store the results in a (n + 1) (n + 1) chart (table) C such that A C i,l iff A w i... w i+l 1. In other words, i is the index of the first terminal in the relevant substring of w; i ranges from 1 to n + 1 (the latter for an empty word following w n ) l is the length of the substring; l ranges from 0 to n. All fields in the chart are initialized with. 4 / 35

The general recognizer (2) General CYK recognizer (for arbitrary CFGs): C i,1 := {w i } for all 1 i n scan C i,0 := {A A ɛ P} for all i [1..n + 1] ɛ-productions for all l [0..n]: for all i 1 [1..n + 1]: repeat until chart does not change any more: for every A A 1... A k : if there are l 1,..., l k [0..n] such that l 1 + + l k = l and A j C ij,l j with i j = i 1 + l 1 + l j 1, then C i1,l := C i1,l {A} complete 5 / 35

The general recognizer (3) Example S ABC, A aa ɛ, B bb ɛ, C c. w = aabbbc. l 6 S 5 S 4 S 3 B S 2 A B B S 1 a,a a,a b,b b,b b,b c,c, S 0 A,B A,B A,B A,B A,B A,B A,B 1 2 3 4 5 6 7 i 6 / 35

The general recognizer (4) Parsing schema for CYK: Items have three elements: X N T: the nonterminal/terminal that spans a substring w i,..., w j of w; the index i of the first terminal in the subsequence; the length l = j i of the subsequence. Item form: [X, i, l] with X N T, i [1..n + 1], l [0..n]. 7 / 35

The general recognizer (5) Goal item: [S, 1, n]. Deduction rules: Scan: [a, i, 1] w i = a ɛ-productions: [A, i, 0] A ɛ P, i [1..n + 1] Complete: [A 1, i 1, l 1 ],..., [A k, i k, l k ] [A, i 1, l] A A 1... A k P, l = l 1 + + l k, i j = i 1 + l 1 + l j 1 8 / 35

The general recognizer (6) Tabulation avoids problems with loops: nothing needs to be computed more than once. In each complete step, we have to check for all l 1,..., l k. This is costly. Note, however, that we create a new chart entry (new item) only for combinations of already recognized parse trees. (No blind prediction as in Unger s parser.) With unary rules and ɛ-productions, an entry in field C i,l can be reused to compute a new entry in C i,l. This is why the repeat until chart does not change any more loop is necessary. 9 / 35

The CNF recognizer (1) A CFG is in Chomsky Normal Form iff all productions are either of the form A a or A B C. If the grammar has this form, we need to check only l 1, l 2 in a complete step, and we can be sure that to compute an entry in field C i,l, we do not use another entry from field C i,l. Consequently, we do not need the repeat until chart does not change any more loop. 10 / 35

The CNF recognizer (2) The chart C is now an n n-chart. C i,1 := {A A w i P} scan for all l [1..n]: for all i [1..n]: for every A B C: if there is a l 1 [1..l 1] such that B C i,l1 and C C i+l1,l l1, then C i,l := C i,l {A} complete 11 / 35

The CNF recognizer (3) Parsing schema for CNF CYK: Goal item: [S, 1, n] Deduction rules: Scan: [A, i, 1] A w i P Complete: [B, i, l 1 ], [C, i + l 1, l 2 ] [A, i, l 1 + l 2 ] A B C P 12 / 35

The CNF recognizer (4) Example S C a C b C a S B, S B SC b, C a a, C b b. (From S asb ab with transformation into CNF.) w = aaabbb. l 6 S 5 S B 4 S 3 S B 2 S 1 C a C a C a C b C b C b 1 2 3 4 5 6 i a a a b b b 13 / 35

The CNF recognizer (5) Time complexity: How many different instances of scan and complete are possible? Scan: A w [A, i, 1] i P c 1 n [B, i, l Complete: 1 ], [C, i + l 1, l 2 ] A B C P c [A, i, l 1 + l 2 ] 2 n 3 Consequently, the time complexity of CYK for CNF is O(n 3 ). The space complexity of CYK is O(n 2 ). 14 / 35

The CNF recognizer (6) Control structures: there are two possible orders in which the chart can be filled: 1 off-line order: fill first row 1, then row 2 etc.: for all l [1..n]: (length) for all i [1..n l + 1]: (start position) compute C i,l 2 on-line order: fill one diagonal after the ather, starting with 1, 1 and proceeding from k, 1 to 1, k: for all k [1..n]: (end position) for all l [1..k]: (length) compute C k l+1,l 15 / 35

The CNF recognizer (6) 1 Soundness of the algorithm: If [A, i, l], then A w i... w i+l 1. Proof via induction over deduction rules. 2 Completeness of the algorithm: If A w i... w i+l 1, then [A, i, l]. Proof via induction over l. 16 / 35

CYK parsing (1) We know that for every CFG G with ɛ / L(G) we can eliminate ɛ-productions, eliminate unary productions, eliminate useless symbols, transform into CNF, and the resulting CFG G is such that L(G) = L(G ). Therefore, for every CFG, we can use the CNF recognizer after transformation. How can we obtain a parser? 17 / 35

CYK parsing (2) We need to do two things: turn the CNF recognizer into a parser, and if the original grammar was not in CNF, retrieve the original syntax from the CNF syntax. 18 / 35

CYK parsing (3) To turn the CNF recognizer into a parser, we record not only non-terminal categories but whole productions with the positions and lenghts of the rhs symbols in the chart (i.e., with backpointers): C i,1 := {A w i A w i P} for all l [1..n]: for all i [1..n]: for every A B C: if there is a l 1 [1..l 1] such that B C i,l1 and C C i+l1,l l 1, then C i,l := C i,l {A [B, i, l 1 ][C, i + l 1, l l 1 ]} We can then obtain a parse tree by traversing the productions from left to right, starting with every S-production in C 1,n. 19 / 35

CYK parsing (4) Example S C a C b C a S B, S B SC b, C a a, C b b, w = aaabbb. (We write A i,l for [A, i, l].) S C a1,1 S B2,5 S B S 2,4 C b6,1 S C a2,1 S B3,3 S B S 3,2 C b5,1 S C a3,1 C b4,1 C a a C a a C a a C b b C b b C b b 20 / 35

CYK parsing (5) From the CNF parse tree to the original parse tree: First, we undo the CNF transformation: replace every C a a in the chart with a and replace every occurrence of C a in a production with a. For all l, i [1..n]: If A αd id,l D C i,l such that D is a new symbol introduced in the CNF transformation and D β C id,l D, then replace A αd id,l D with A αβ in C i,l. Finally remove all D γ with D being a new symbol introduced in the CNF transformation from the chart. 21 / 35

CYK parsing (6) Example S C a C b C a S B, S B SC b, C a a, C b b, w = aaabbb. New symbols: C a, C b, S B. Elimination of C a, C b : 6 S as B2,5 5 S B S 2,4 b 4 S as B3,3 3 S B S 3,2 b 2 S ab 1 a a a b b b 1 2 3 4 5 6 22 / 35

CYK parsing (7) Example S C a C b C a S B, S B SC b, C a a, C b b, w = aaabbb. Replacing of S B in rhs: 6 S as 2,4 b 5 S B S 2,4 b 4 S as 3,2 b 3 S B S 3,2 b 2 S ab 1 a a a b b b 1 2 3 4 5 6 23 / 35

CYK parsing (8) Example S C a C b C a S B, S B SC b, C a a, C b b, w = aaabbb. Elimination of S B : 6 S as 2,4 b 5 4 S as 3,2 b 3 2 S ab 1 a a a b b b 1 2 3 4 5 6 24 / 35

CYK parsing (9) Undo the elimination of unary productions: For every A β in C i,l that has been added in removing of the unary productions to replace B β (β is β without chart indices): replace A with B in this entry in C i,l. For every unary production A B in the original grammar and for every B β C i,l : add A B i,l to C i,l. Repeat this until chart does not change any more. 25 / 35

CYK parsing (10) Undo the elimination of ɛ-productions: Add a row with l = 0 and a column with i = n + 1 to the chart. Fill row 0 as in the general case using the original CFG grammar (tabulating productions). For every A β in C i,l that has been added in removing the ɛ-productions: add the deleted nonterminals to β with the position of the preceding non-terminal as starting position (or i if it is the first in the rhs) and with length 0. 26 / 35

CYK parsing (11) Terminal Filter: Observation: Information on the obligatory presence of terminals might get lost in the CNF transformation: S asb (requires an a, an S and a b) S C a S B (requires an a and an S and a b) and S B SC b (requires an S and a b) Consider an input babb: In a CYK parser with the orignal grammar, we would derive [S, 2, 2] and [b, 4, 1] but we could not apply S asb. In the CNF grammar, we would have [S, 2, 2] and [C b, 4, 1] and then we could apply S B SC b and obtain [S B, 2, 3] even though the only way to continue with S B in a rhs is with S C a S B which is not possible since the first terminal is not an a. 27 / 35

CYK parsing (12) Solution: add an additional check: Every new non-terminal D introduced for binarization stands (deterministically) for some substring β of a rhs αβ. Ex: S B in our sample grammar stands for Sb. Every terminal in this rhs to the left of β, i.e., evey terminal in α must necessarily be present to the left for a deduction of a D that leads to a goal item. Ex: S B can only lead to a goal item if to its left we have an a. Terminal Filter: During CNF transformation, for every nonterminal D introduced for binarization, record the sets of terminals in the rhs to the left of the part covered by D. During parsing, check for the presence of these terminals to the left of the deduced D item. 28 / 35

CYK with dotted productions (1) CNF leads to a binarization: In each completion, only two items are combined. Such a binarization can be obtained by using dotted productions: We process the rhs of a production from left to right. In each complete step, a production A α Xβ is combined with an X whose span starts at the end of the α-span. The result is a production A αx β. We start with the completed terminals and their spans. Note that this version of CYK is directional. 29 / 35

CYK with dotted productions (2) Parsing schema for the general version (allowing also for ε-productions): Goal items: [S α, 0, n] for all S-productions S α. Deduction rules: Predict (axioms): Scan: Complete: [A α, i, i] [A α aβ, i, j] [A αa β, i, j + 1] A α P, i [0..n] w j+1 = a [A α Bβ, i, j][b γ, j, k] [A αb β, i, k] 30 / 35

CYK with dotted productions (3) Parsing schema including passive items (just a non-terminal or terminal, no dotted production) and assuming an ε-free CFG: Goal item: [S, 0, n] Deduction rules: Scan (axioms): Left-corner predict: Complete: Publish: [a, i, i + 1] w i+1 = a [X, i, j] [A X α, i, j] [A α Xβ, i, j][x, j, k] [A αx β, i, k] [A α, i, j] [A, i, j] A Xα P, X N T (This is actually a deduction-based version of left-corner parsing.) 31 / 35

CYK with dotted productions (4) Example Example (without ɛ-productions, left-corner parsing): S ab asb w = aabb 4 S asb S 3 S as b 2 S ab S 1 a a b b S a b S a b S a Sb S a Sb 1 2 3 4 32 / 35

CYK with dotted productions (5) What about time complexity? The most complex operation, complete, involves only three indices i, j, k ranging from 0 to n: Complete: [A α Xβ, i, j][x, j, k] [A αx β, i, k] Consequently, the time complexity is O(n 3 ), as in the CNF case. But: the data structure required for representing a parse item with a dotted production is slightly more complex that what is needed for simple passive items. 33 / 35

Conclusion CYK is a non-directional bottom-up parser. If used with CNF, it is very efficient. Time complexity is O(n 3 ). The transformation into CNF can be undone after parsing, i.e., we still have a parser for arbitrary CFGs (as long as ɛ is not in the language). Instead of explicitly binarizing, we can use dotted productions and move through the righthand sides of productions step by step from left to right, which also leads to O(n 3 ). 34 / 35

Bibliography Cocke, J. and Schwartz, J. T. (1970). Programming languages and their compilers: Preliminary notes. Technical report, CIMS, NYU. 2nd revised edition. Grune, D. and Jacobs, C. (2008). Parsing Techniques. A Practical Guide. Monographs in Computer Science. Springer. Second Edition. Hopcroft, J. E. and Ullman, J. D. (1979). Introduction to Automata Theory, Languages and Computation. Addison Wesley. Hopcroft, J. E. and Ullman, J. D. (1994). Einführung in die Automatentheorie, Formale Sprachen und Komplexitätstheorie. Addison Wesley, 3. edition. Kasami, T. (1965). An efficient recognition and syntax-analysis algorithm for context-free languages. Technical report, AFCRL. Younger, D. H. (1967). Recognition and parsing of context-free languages in time n 3. Inform. Control, 10(2):189 20. 35 / 35