Fall 2014-2015 Compiler Principles Lecture 3: Parsing part 2 Roman Manevich Ben-Gurion University
Tentative syllabus Front End Intermediate Representation Optimizations Code Generation Scanning Lowering Local Optimizations Register Allocation Top-down Parsing (LL) Dataflow Analysis Instruction Selection Bottom-up Parsing (LR) Loop Optimizations Attribute Grammars mid-term exam 2
Previously Role of syntax analysis Context-free grammars refresher Top-down (predictive) parsing Recursive descent 3
Functions for nonterminals E LIT (E OP E) not E LIT true false OP and or xor E() { } if (current {TRUE, FALSE}) else if (current == LPAREN) else if (current == NOT) else LIT(); match(lparent); E(); OP(); E(); match(rparen); match(not); E(); error; LIT() { } if (current == TRUE) else if (current == FALSE) else match(true); match(false); error; OP() { } if (current == AND) else if (current == OR) else if (current == XOR) else match(and); match(or); match(xor); error; 4
Technical challenges with recursive descent 5
Recursive descent: problem 1 term ID indexed_elem indexed_elem ID [ expr ] With lookahead 1, the function for indexed_elem will never be tried What happens for input of the form ID[expr] 6
Recursive descent: problem 2 S A a b A a int S() { return A() && match(token( a )) && match(token( b )); } int A() { return match(token( a )) 1; } What happens for input ab? What happens if you flip order of alternatives and try aab? 7
Recursive descent: problem 3 p. 127 E E - term term int E() { } return E() && match(token( - )) && term(); What happens when we execute this procedure? Recursive descent parsers cannot handle left-recursive grammars 8
Agenda Predicting productions via FIRST/FOLLOW/NULLABLE sets Handling conflicts LL(k) via pushdown automata 9
How do we predict? E LIT (E OP E) not E LIT true false OP and or xor How can we decide which production of E to take? 10
FIRST sets For a nonterminal A, FIRST(A) is the set of terminals that can start in a sentence derived from A Formally: FIRST(A) = {t A * t ω} For a sentential form α, FIRST(α) is the set of terminals that can start in a sentence derived from α Formally: FIRST(α) = {t α * t ω} 11
FIRST sets example E LIT (E OP E) not E LIT true false OP and or xor FIRST(E) =? FIRST(LIT) =? FIRST(OP) =? 12
FIRST sets example E LIT (E OP E) not E LIT true false OP and or xor FIRST(E) = FIRST(LIT) FIRST(( E OP E )) FIRST(not E) FIRST(LIT) = { true, false } FIRST(OP) = {and, or, xor} A set of recursive equations How do we solve them? 13
Computing FIRST sets Assume no null productions (A ) 1. Initially, for all nonterminals A, set FIRST(A) = { t A t ω for some ω } 2. Repeat the following until no changes occur: for each nonterminal A for each production A α 1 α k FIRST(A) = FIRST(α 1 ) FIRST(α k ) This is known as a fixed-point algorithm We will see such iterative methods later in the course and learn to reason about them 14
Exercise: compute FIRST STMT if EXPR then STMT while EXPR do STMT EXPR ; EXPR TERM -> id zero? TERM not EXPR ++ id -- id TERM id constant STMT EXPR TERM 15
1. Initialization STMT if EXPR then STMT while EXPR do STMT EXPR ; EXPR TERM -> id zero? TERM not EXPR ++ id -- id TERM id constant STMT if while EXPR zero? Not ++ -- TERM id constant 16
2. Iterate 1 STMT if EXPR then STMT while EXPR do STMT EXPR ; EXPR TERM -> id zero? TERM not EXPR ++ id -- id TERM id constant STMT if while zero? Not ++ -- EXPR zero? Not ++ -- TERM id constant 17
2. Iterate 2 STMT if EXPR then STMT while EXPR do STMT EXPR ; EXPR TERM -> id zero? TERM not EXPR ++ id -- id TERM id constant STMT if while zero? Not ++ -- EXPR zero? Not ++ -- id constant TERM id constant 18
2. Iterate 3 fixed-point STMT if EXPR then STMT while EXPR do STMT EXPR ; EXPR TERM -> id zero? TERM not EXPR ++ id -- id TERM id constant STMT if while zero? Not ++ -- EXPR zero? Not ++ -- id constant TERM id constant id constant 19
Reasoning about the algorithm Assume no null productions (A ) 1. Initially, for all nonterminals A, set FIRST(A) = { t A t ω for some ω } 2. Repeat the following until no changes occur: for each nonterminal A for each production A α 1 α k FIRST(A) = FIRST(α 1 ) FIRST(α k ) Is the algorithm correct? Does it terminate? (complexity) 20
Reasoning about the algorithm Termination: Correctness: 21
LL(1) Parsing of grammars without epsilon productions 22
Using FIRST sets Assume G has no epsilon productions and for every non-terminal X and every pair of productions X and X we have that FIRST( ) FIRST( ) = {} No intersection between FIRST sets => can always pick a single rule 23
Using FIRST sets In our Boolean expressions example FIRST( LIT ) = { true, false } FIRST( ( E OP E ) ) = { ( } FIRST( not E ) = { not } If the FIRST sets intersect, may need longer lookahead LL(k) = class of grammars in which production rule can be determined using a lookahead of k tokens LL(1) is an important and useful class What if there are epsilon productions? 24
Extending LL(1) Parsing for epsilon productions 25
FIRST, FOLLOW, NULLABLE sets For each non-terminal X FIRST(X) = set of terminals that can start in a sentence derived from X FIRST(X) = {t X * t ω} NULLABLE(X) if X * FOLLOW(X) = set of terminals that can follow X in some derivation FOLLOW(X) = {t S * X t } 26
Computing the NULLABLE set Lemma: NULLABLE( 1 k ) = NULLABLE( 1 ) NULLABLE( k ) 1. Initially NULLABLE(X) = false 2. For each non-terminal X if exists a production X then NULLABLE(X) = true 3. Repeat for each production Y 1 k if NULLABLE( 1 k ) then NULLABLE(Y) = true until NULLABLE stabilizes 27
Exercise: compute NULLABLE S A a b A a B A B C C b NULLABLE(S) = NULLABLE(A) NULLABLE(a) NULLABLE(b) NULLABLE(A) = NULLABLE(a) NULLABLE( ) NULLABLE(B) = NULLABLE(A) NULLABLE(B) NULLABLE(C) NULLABLE(C) = NULLABLE(b) NULLABLE( ) 28
FIRST with epsilon productions How do we compute FIRST( 1 k ) when epsilon productions are allowed? FIRST( 1 k ) =? 29
FIRST with epsilon productions How do we compute FIRST( 1 k ) when epsilon productions are allowed? FIRST( 1 k ) = if not NULLABLE( 1 ) then FIRST( 1 ) else FIRST( 1 ) FIRST ( 2 k ) 30
Exercise: compute FIRST S A c b A a NULLABLE(S) = NULLABLE(A) NULLABLE(c) NULLABLE(b) NULLABLE(A) = NULLABLE(a) NULLABLE( ) FIRST(S) = FIRST(A) FIRST(cb) FIRST(A) = FIRST(a) FIRST ( ) FIRST(S) = FIRST(A) {c} FIRST(A) = FIRST(a) 31
FOLLOW sets if X α Y then FOLLOW(Y)? if NULLABLE( ) or = then FOLLOW(Y)? p. 189 32
FOLLOW sets if X α Y then FOLLOW(Y) FIRST( ) if NULLABLE( ) or = then FOLLOW(Y)? p. 189 33
FOLLOW sets if X α Y then FOLLOW(Y) FIRST( ) if NULLABLE( ) or = then FOLLOW(Y) FOLLOW(X) p. 189 34
FOLLOW sets p. 189 if X α Y then FOLLOW(Y) FIRST( ) if NULLABLE( ) or = then FOLLOW(Y) FOLLOW(X) Allows predicting epsilon productions: X when the lookahead token is in FOLLOW(X) S A c b A a What should we predict for input cb? What should we predict for input acb? 35
LL(k) grammars 36
Conflicts FIRST-FIRST conflict X α and X and If FIRST(α) FIRST(β) {} FIRST-FOLLOW conflict NULLABLE(X) If FIRST(X) FOLLOW(X) {} 37
LL(1) grammars A grammar is in the class LL(1) when it can be derived via: Top-down derivation Scanning the input from left to right (L) Producing the leftmost derivation (L) With lookahead of one token For every two productions A α and A β we have FIRST(α) FIRST(β) = {} and if NULLABLE(A) then FIRST(A) FOLLOW(A) = {} A language is said to be LL(1) when it has an LL(1) grammar 38
LL(k) grammars Generalizes LL(1) for k lookahead tokens Need to generalize FIRST and FOLLOW for k lookahead tokens 39
Agenda Predicting productions via FIRST/FOLLOW/NULLABLE sets Handling conflicts LL(k) via pushdown automata 40
Handling conflicts 41
Back to problem 1 term ID indexed_elem indexed_elem ID [ expr ] FIRST(term) = { ID } FIRST(indexed_elem) = { ID } FIRST-FIRST conflict 42
Solution: left factoring Rewrite the grammar to be in LL(1) term ID indexed_elem indexed_elem ID [ expr ] New grammar is more complex has epsilon production term ID after_id After_ID [ expr ] Intuition: just like factoring in algebra: x*y + x*z into x*(y+z) 43
Exercise: apply left factoring S if E then S else S if E then S T 44
Exercise: apply left factoring S if E then S else S if E then S T S if E then S S T S else S 45
Back to problem 2 S A a b A a FIRST(S) = { a } FOLLOW(S) = { } FIRST(A) = { a } FOLLOW(A) = { a } FIRST-FOLLOW conflict 46
Solution: substitution S A a b A a Substitute A in S S a a b a b 47
Solution: substitution S A a b A a Substitute A in S S a a b a b Left factoring S a after_a after_a a b b 48
Back to problem 3 E E - term term Left recursion cannot be handled with a bounded lookahead What can we do? 49
Left recursion removal p. 130 N Nα β N βn N αn G 1 G 2 L(G 1 ) = β, βα, βαα, βααα, L(G 2 ) = same For our 3 rd example: Can be done algorithmically. Problem 1: grammar becomes mangled beyond recognition Problem 2: grammar may not be LL(1) E E - term term E term TE term TE - term TE 50
Recap Given a grammar Compute for each non-terminal NULLABLE FIRST using NULLABLE FOLLOW using FIRST and NULLABLE Compute FIRST for each sentential form appearing on right-hand side of a production Check for conflicts If exist: attempt to remove conflicts by rewriting grammar 51
Agenda Predicting productions via FIRST/FOLLOW/NULLABLE sets Handling conflicts LL(k) via pushdown automata 52
LL(1) parsing: the automata approach By MG (talk contribs) (Own work) [GFDL (http://www.gnu.org/copyleft/fdl.html) or CC-BY-SA-3.0 (http://creativecommons.org/licenses/by-sa/3.0/)], via Wikimedia Commons 53
Marking end-of-file Sometimes it will be useful to transform a grammar G with start non-terminal S into a grammar G with a new start non-terminal S and a new production rule S S $ where $ is not part of the set of tokens To parse an input α with G we change it into α $ Simplifies top-down parsing with null productions and LR parsing 54
Another convention We will assume that all productions have been consecutively numbered (1) S E $ (2) E T (3) E E + T (4) T id (5) T ( E ) 55
LL(1) Parsers Recursive Descent Manual construction (parsing combinators make this easier, but ) Uses recursion Wanted A parser that can be generated automatically Does not use recursion 56
LL(1) parsing via pushdown automata Pushdown automaton uses Input stream Prediction stack Parsing table Nonterminal token production rule Entry indexed by nonterminal N and token t contains the alternative of N that must be predicated when current input starts with t Essentially, classic conversion from CFG to PDA The only difference is that we replace nondeterministic choice with the parsing table 57
Model of non-recursive predictive parser a + b $ Stack X Y Predictive Parsing program Output Z $ Parsing Table 58
LL(1) parsing algorithm Set stack=s$ While true Prediction When top of stack is nonterminal N pop N, lookup table[n,t] If table[n,t] is not empty, push table[n,t] on prediction stack Otherwise: return syntax error Match When top of prediction stack is a terminal t, must be equal to next input token t. If (t = t ), pop t and consume t. If (t t ): return syntax error End When prediction stack is empty If input is empty at that point: return success Otherwise: return syntax error 59
Nonterminals Example transition table (1) E LIT (2) E ( E OP E ) (3) E not E (4) LIT true (5) LIT false (6) OP and (7) OP or (8) OP xor ( FIRST(E) Input tokens Which rule should be used ( ) not true false and or xor $ E 2 3 1 1 LIT 4 5 OP 6 7 8 60
Running parser example aacbb$ A aab c Input suffix Stack content Move aacbb$ A$ predict(a,a) = A aab aacbb$ aab$ match(a,a) acbb$ Ab$ predict(a,a) = A aab acbb$ aabb$ match(a,a) cbb$ Abb$ predict(a,c) = A c cbb$ cbb$ match(c,c) bb$ bb$ match(b,b) b$ b$ match(b,b) $ $ match($,$) success a b c A A aab A c 61
Illegal input example abcbb$ A aab c Input suffix Stack content Move abcbb$ A$ predict(a,a) = A aab abcbb$ aab$ match(a,a) bcbb$ Ab$ predict(a,b) = ERROR a b c A A aab A c 62
Creating the prediction table Let G be an LL(1) grammar Compute FIRST/NULLABLE/FOLLOW Check for conflicts For non-terminal N and token t predict: 63
Top-down parsing summary Recursive descent LL(k) grammars LL(k) parsing with pushdown automata Cannot deal with left recursion Left-recursion removal might result with complicated grammar 64
Next lecture: Bottom-up parsing