Syntactic Analysis Chapter 4 Compiler Construction Syntactic Analysis 1
Context-free Grammars The syntax of programming language constructs can be described by context-free grammars (CFGs) Relatively simple and widely used More powerful grammars exist Context-sensitive grammars (CSG) Type-0 grammars Both are too complex and inefficient for general use Backus-Naur Form (BNF) and extended BNF (EBNF) are a convenient way to represent CFGs Compiler Construction Syntactic Analysis 2
Advantages of CFGs Precise, easy-to-understand syntactic specification of a programming language Efficient parsers can be automatically generated for some classes of CFGs This automatic generation process can reveal ambiguities that might otherwise go undetected during the language design A well-designed grammar makes translation to object code easier Language evolution is expedited by an existing grammatical language description Compiler Construction Syntactic Analysis 3
Role of the Syntactic Analyzer Second phase of compilation Input to parser is the output of the lexer Output of parser is (usually) a parse tree source code lexer token get next token parser symbol table Compiler Construction Syntactic Analysis 4
Parsers Universal parsers Cocke-Younger-Kasami algorithm Earley s algorithm Both too inefficient for production compilers Normal parsers Work only on subclasses of CFGs Examples: LL, LR, LALR(1) Automated tools available for the popular subclasses Compiler Construction Syntactic Analysis 5
Context-free Grammar Context-free Grammar (CFG) is a 4-tuple V N,V T,s,P V N is a set of non-terminal symbols V T is a set of terminal symbols s is a distinguished element of V N called the start symbol P is a set of productions or rules that specify how legal strings are built P V N (V N V T ) Compiler Construction Syntactic Analysis 6
CFG Elements Terminals: basic symbols from which strings are formed (typically corresponds to tokens from lexer) Non-terminals: syntactic variables that denote sets of strings and, in particular, denoting language constructs Start symbol: a non-terminal; the set of strings denoted by the start symbol is the language defined by the grammar Productions: set of rules that define how terminals and non-terminals can be combined to form strings in the language A bxy z Compiler Construction Syntactic Analysis 7
Example Symbol table interpreter G = V N,V T,s,P V N = {S} V T = {new,id,num,insert,lookup,quit} s = S P : S new id num insert id id num lookup id id quit Compiler Construction Syntactic Analysis 8
Example An arithmetic expression language G = V N,V T,s,P V N = {E} V T = {id,+,,(,), } s = E P : E E + E E E (E) E id Compiler Construction Syntactic Analysis 9
Notational Conventions (1) Dragon book, pages 166, 167 Terminals Lower-case letters early in the alphabet (a, b, etc.) Operator symbols (+,, etc.) Punctuation symbols (parentheses, commas, etc.) Digits Boldface strings (id, if, etc.) Compiler Construction Syntactic Analysis 10
Notational Conventions (2) Non-terminals Upper-case letters early in the alphabet (A, B, etc.) The letter S, if used, is usually the start symbol Lower-case italics names (expr, stmt, etc.) Compiler Construction Syntactic Analysis 11
Notational Conventions (3) Grammar symbols (either terminals or non-terminals) Upper-case letters late in the alphabet (X, Y, etc.) Strings of terminals Lower-case letters late in the alphabet (u, v, etc.) Strings of grammar symbols Lower-case Greek letters (α, β, etc.) Useful for representing generic productions Compiler Construction Syntactic Analysis 12
Notational Conventions (4) Productions with the same left side can be merged into one production using the symbol A α 1, A α 2,..., A α k becomes A α 1 α 2... α k Unless otherwise indicated, the left side of the first listed production is the start symbol Compiler Construction Syntactic Analysis 13
Example A programming language construct stmt ; if ( expr ) stmt else stmt while ( expr ) stmt blk id = expr ; blk { stmt } Compiler Construction Syntactic Analysis 14
Derivations Rewrite rule approach A production is treated as a rewriting rule in which a non-terminal on the left side of the production is replaced by the grammar symbols on the right side of the production Begin with the start symbol and through a sequence of derivations produce any string in L(G) Compiler Construction Syntactic Analysis 15
Derivation Given the productions A αbβ B λ 1 λ 2...λ n we can derive A αbβ αλ 1 λ 2...λ n β Compiler Construction Syntactic Analysis 16
A Derivation Given the productions E E + E E E (E) E id we can derive (id + id): E E (E) (E +E) (id+e) (id+id) Compiler Construction Syntactic Analysis 17
Derivations Let α be a set of grammar symbols (terminals and non-terminals) α β means zero or more derivations 1. α α (Base case) 2. If α γ and γ β, then α β (Inductive case) Compiler Construction Syntactic Analysis 18
The Language of a Grammar Given a grammar G, the language of G is L(G) L(G) V T L(G) = {w V T S w} Compiler Construction Syntactic Analysis 19
Sentential Forms Leftmost derivation Leftmost non-terminal is replaced at each step Rightmost derivation replaces the rightmost non-terminal at each step Sentential form A set of grammar symbols that may obtained from a set of valid derivations Leftmost sentential form A set of grammar symbols that may obtained from a set of valid leftmost derivations Compiler Construction Syntactic Analysis 20
Regular Languages and CFLs All regular languages are context-free Consider the regular expression a b Let G = {A,B},{a,b},A,{A aa B,B bb ɛ} Compiler Construction Syntactic Analysis 21
Producing a Grammar from a Regular Language 1. Construct an NFA from the regular expression 2. Each state in the NFA corresponds to a non-terminal symbol 3. For a transition from state A to state B given input symbol x, add a production of the form A xb 4. If A is a final state, add the production A ɛ Compiler Construction Syntactic Analysis 22
Parse Trees A graphical representation of a sequence of derivations E E + E Each interior node is a non-terminal and its id E * E children are the right side of one of the id id non-terminal s productions Compiler Construction Syntactic Analysis 23
Parse Trees If you read the leaves of the tree from left to right they form a sentential form E E + E Also called the yield or frontier of the parse tree id E * E All the leaves need not be terminals; the parse tree id id may be incomplete Valid sentential forms can contain non-terminals Compiler Construction Syntactic Analysis 24
Ambiguity Given the productions E E + E E E (E) id Derive id + id id: E E + E id + E id + E E id + id E id + id id or E E E E + E E id + E E id + id E id + id id Compiler Construction Syntactic Analysis 25
Ambiguity and Parse Trees A grammar G is ambiguous if a string in L(G) can have more than one parse tree E E E + E E * E id E * E E + E id id id id id Compiler Construction Syntactic Analysis 26
Consequences of Ambiguity Ambiguity is generally bad Often means there is more than one way to interpret a string Add before multiply or multiply before add? An ambiguous grammar should be rewritten to remove the ambiguity Compiler Construction Syntactic Analysis 27
Removing the Ambiguity Consider the rewritten productions E T E + T T F T F F (E) id E T E + T T * F F id F id id Here only one parse tree is possible Compiler Construction Syntactic Analysis 28
Disambiguating Rules Can we provide rules for disambiguating id + (id id) from (id + id) id Compiler Construction Syntactic Analysis 29
Top-down Parsing Recursive descent is an example Grows the parse tree from the root down to the leaves Useful for recognizing flow-of-control constructs since they are always labeled with a keyword (e.g., if,while,do, for) Requires each production for the same non-terminal to begin with a unique token Compiler Construction Syntactic Analysis 30
Left factoring Can be used to factor out a common prefix in two of more productions For example, to parse if...then vs. if...then...else C if E then S else S if E then S Left factor the grammar (factor out common left expression): C if E then SX X else S ɛ Compiler Construction Syntactic Analysis 31
Top-down Parsing Two requirements Left-factor the grammar Produce grammar in which no productions for the same nonterminal have a common prefix No left recursion A + Aα Parser could get into an infinite loop Compiler Construction Syntactic Analysis 32
Top-down Parsing Top-down parsing produces a sequence of left-most derivations A Bx Cy B z C w Produces two strings: zx and wy Compiler Construction Syntactic Analysis 33
Top-down Parsers Two common approaches are used in top-down parsing Recursive descent parser Recursive The structure of the grammar is hard-coded into the parsing program Table-driven parser Non-recursive The structure of the language is encoded in a parse table Compiler Construction Syntactic Analysis 34
Recursive Descent Relatively easy to implement Reads the input stream (from the scanner) left to right and verifies its correctness Perl has a recursive descent parser (Parse::RecDescent) Recursive, since parsing is accomplished via recursive procedures Descent, since parsing is top-down (descends from the root down the branches to the leaves) Compiler Construction Syntactic Analysis 35
Recursive Descent Each non-terminal is a subroutine call A Bx Cy B z C w A B x 0 1 2 B z 6 7 C 3 4 y 5 C 8 w 9 Compiler Construction Syntactic Analysis 36
Recursive Descent A candidate grammar: Bad because of left recursion E T E + T T F T F F (E) d The grammar can be modified to support a recursive descent parser: E T E E +T E ɛ T FT T FT ɛ F (E) d Compiler Construction Syntactic Analysis 37
Generalized Parser public abstract class RecursiveDescent { private String input; protected int cursor = 0; public RecursiveDescent() { getinputstring(); if ( parse() && cursor == input.length() ) { System.out.println("Accept"); } else { error(); } } protected final boolean checknexttoken(char ch) { // Ignore whitespace } while ( cursor < input.length() && (input.charat(cursor) == input.charat(cursor) == \t ) ) { cursor++; } return (cursor < input.length())? input.charat(cursor++) == ch : false; } protected static void error() { System.out.println("Invalid string"); System.exit(1); } protected final void getinputstring() { input = Console.In.getString(); } public abstract boolean parse(); Compiler Construction Syntactic Analysis 38
Subclass for Given Grammar (1) public class Expression extends RecursiveDescent { /* * Original Grammar: * E -> T E + T * T -> F T * F * F -> ( E ) d * * Adapted Grammar: * E -> T E * E -> + T E e * T -> F T * T -> * F T e * F -> ( E ) d * * Note method names: E1() => E and T1() => T */ public boolean parse() { return E(); } public static void main(string[] args) { new Expression(); } // Continued... Compiler Construction Syntactic Analysis 39
Subclass for Given Grammar (2) private boolean E() { int pos = cursor; // E -> T E if ( T() && E1() ) { return true; } cursor = pos; // Backtrack return false; } E T E Compiler Construction Syntactic Analysis 40
Subclass for Given Grammar (3) private boolean E1() { int pos = cursor; // E -> + T E if ( checknexttoken( + ) && T() && E1() ) { return true; } cursor = pos; // Backtrack // E -> e return true; } E +T E ɛ Compiler Construction Syntactic Analysis 41
Subclass for Given Grammar (4) } private boolean T() { int pos = cursor; // T -> F T if ( F() && T1() ) { return true; } cursor = pos; // Backtrack return false; } T FT Compiler Construction Syntactic Analysis 42
Subclass for Given Grammar (5) private boolean T1() { int pos = cursor; // T -> * F T if ( checknexttoken( * ) && F() && T1() ) { return true; } cursor = pos; // Backtrack // T -> e return true; } T FT ɛ Compiler Construction Syntactic Analysis 43
Subclass for Given Grammar (6) } private boolean F() { int pos = cursor; // F -> ( E ) if ( checknexttoken( ( ) && E() && checknexttoken( ) ) ) { return true; } cursor = pos; // Backtrack // F -> d if ( checknexttoken( d ) ) { return true; } cursor = pos; // Backtrack return false; } F (E) d Compiler Construction Syntactic Analysis 44
Backtracking The example recursive descent parser used backtracking Recursive descent parsing is criticized as being inefficient due to backtracking Some grammars can be written so that no backtracking is required The right side of the production starts with a terminal, so you know immediately which production to apply A top-down parser that requires no backtracking is called a predictive parser Compiler Construction Syntactic Analysis 45
The Bad News Some grammars cannot be processed with a top-down parser We need to determine the characteristics required to make a top-down parser feasible Compiler Construction Syntactic Analysis 46
Preprocessing Needed FIRST(α) is the set of terminals that begin strings derived from α A Bx Cy B z C w FIRST(B) = {z} FIRST(C) = {w} FIRST(A) = {z, w} Compiler Construction Syntactic Analysis 47
One Criteria Given a production of the form A α β if FIRST(α) FIRST(β), then a top-down parser cannot be used Compiler Construction Syntactic Analysis 48
ɛ Productions ɛ productions complicate the situation FOLLOW(A) is the set of terminals that can appear immediately to the right of A in some sentential form A Bx Cy B z ɛ C w FIRST(B) = {z} FIRST(C) = {w} FIRST(A) = {z, w} FOLLOW(B) = {x} FOLLOW(C) = {y} FOLLOW(A) = {$}(end of input) Compiler Construction Syntactic Analysis 49
FOLLOW Without any ɛ productions, FIRST would be sufficient Formally: If X V N V T, then FIRST(X) = { {X}, if X V T {a a V T and X aβ}, otherwise If A V N, then FOLLOW(A) = {a a V T and A αaaβ} How do we compute FIRST and FOLLOW? Compiler Construction Syntactic Analysis 50
FIRST Computation SetOfTerminalSymbols FIRST(GrammarSymbol X) { if ( X is a terminal ) F {X}; FIRST(X) is just X } else { F ; if ( X ɛ is a production ) F F ɛ; Add ɛ to FIRST(X) if ( X y 1 y 2...y n is a production ) { if ( i such that ɛ FIRST(y 1 ), ɛ FIRST(y 2 ),..., ɛ FIRST(y i 1 ), and a FIRST(y i ) ) F F a; if ( ɛ FIRST(y 1 ), ɛ FIRST(y 2 ),..., ɛ FIRST(y n ) ) F F ɛ; Add ɛ to FIRST(X) } } return F; Compiler Construction Syntactic Analysis 51
FIRST In a nutshell: If A ɛ, then FIRST(A) = {a V T A aβ} Else, if A ɛ, then FIRST(A) = {a V T A aβ} {ɛ} (if A ɛ) Compiler Construction Syntactic Analysis 52
FOLLOW Computation SetOfTerminalSymbols FOLLOW(NonTerminalSymbol A) { F ; } if ( A is the start symbol ) F F $ ; if ( B αaβ is a production ) F F (FIRST(β) - ɛ); if ( C αa or (C αaγ and ɛ FIRST(γ)) ) F F FOLLOW(C); return F; α can be ɛ Compiler Construction Syntactic Analysis 53
FOLLOW In a nutshell: + If S αa, then FOLLOW(A) = {a V T S + αaaβ} Else, if S + αa, then FOLLOW(A) = {a V T S + αaaβ} {$} Compiler Construction Syntactic Analysis 54
FIRST and FOLLOW Example Compute the FIRST and FOLLOW sets for the grammar from our recursive descent parser was built: E T E E +T E ɛ T FT T FT ɛ F (E) d Compiler Construction Syntactic Analysis 55
FIRST and FOLLOW Example E T E E +T E ɛ T FT T FT ɛ F (E) d The solution: FIRST(+) = {+} FIRST( ) = { } FIRST(d) = {d} FIRST(() = {(} FIRST()) = {)} FIRST(E) = {(,d} FIRST(E ) = {ɛ,+} FIRST(T ) = {(,d} FIRST(T ) = {ɛ, } FIRST(F) = {(,d} FOLLOW(E) = {$,)} FOLLOW(E ) = {$,)} FOLLOW(T) = {+,), $} FOLLOW(T ) = {+,),$} FOLLOW(F) = {, +,), $} Compiler Construction Syntactic Analysis 56
LL(1) Grammar Scanning Left-to-right Leftmost derivation 1 symbol lookahead LL(2),..., LL(k) means 2,..., k lookahead symbols Most parsers have just one symbol of lookahead Compiler Construction Syntactic Analysis 57
LL(1) Grammar Formally, a grammar is LL(1) if and only if whenever A α β 1. FIRST(α) FIRST(β) = 2. At most one of α or β can derive ɛ 3. If β ɛ, then α does not derive any string that starts with a terminal in FOLLOW(A) All LL(1) grammars can be parsed by a recursive descent parser, and recursive descent parsers can parse only LL(1) grammars Compiler Construction Syntactic Analysis 58
Common Prefixes Recall the common prefix example: C if E then S else S if E then S FIRST(if E then S else S) = {if} FIRST(if E then S) = {if} Thus the grammar is not LL(1), but the factored grammar is LL(1) (but ambiguous): C if E then SX X else S ɛ Compiler Construction Syntactic Analysis 59
Left Recursion Consider the grammar: E E + d d FIRST(E + d) = {d} FIRST(d) = {d} Thus the grammar is not LL(1) A recursive descent parser would succumb to infinite recursion Compiler Construction Syntactic Analysis 60
Parse Table from FIRST, FOLLOW If more than one production matches, then the grammar is not LL(1) For any two productions P i, P j, FIRST(P i ) FIRST(P j ) = If A α and b FIRST(α), then parsetable[a][b] = A α If X α and ɛ FIRST(α), then for each b FOLLOW(X) parsetable[x][b] = X α Compiler Construction Syntactic Analysis 61
Parse Table for Example Grammar Build an LL(1) parse table for our sample grammar: E T E E +T E ɛ T FT T FT ɛ F (E) d FIRST and FOLLOW sets: FIRST(+) = {+} FIRST( ) = { } FIRST(d) = {d} FIRST(() = {(} FIRST()) = {)} FIRST(E) = {(,d} FIRST(E ) = {ɛ,+} FIRST(T ) = {(,d} FIRST(T ) = {ɛ, } FIRST(F) = {(,d} FOLLOW(E) = {$,)} FOLLOW(E ) = {$,)} FOLLOW(T) = {+,), $} FOLLOW(T ) = {+,),$} FOLLOW(F) = {, +,), $} Compiler Construction Syntactic Analysis 62
Parse Table for Example Grammar The solution: Top of Input Symbol Stack d + ( ) $ E E TE E T E E E +TE E ɛ E ɛ T T FT T FT T T ɛ T FT T ɛ T ɛ F F d F (E) Compiler Construction Syntactic Analysis 63
LL(1) Table-driven Parser Input a a 1 2 a 3 a n $ Stack LL Parser Output Parse Table Compiler Construction Syntactic Analysis 64
LL(1) Parsing Algorithm LL Parser() { stack.push(s); Push start symbol onto empty stack } } a scanner.getnexttoken(); while ( not stack.empty() ) { X stack.top(); if ( X is a non-terminal and parsetable[x][a] = X y 1...y k ) { Get next token Look at top of stack stack.pop(); Pop off top item stack.push(y k...y 1 ); } else if ( X = a ) { stack.pop(); a scanner.getnexttoken(); } else Error(); Push left side symbols on in reverse order Pop off top item Get next token Illegal string Compiler Construction Syntactic Analysis 65
Parsing Example Stack Input Rule $ E d + d * d $ E T E $ E T d + d * d $ T FT $ E T F d + d * d $ F d $ E T d d + d * d $ $ E T + d * d $ T ɛ $ E + d * d $ E +TE $ E T + + d * d $ $ E T d * d $ T FT $ E T F d * d $ F d $ E T d d * d $ $ E T * d $ T FT $ E T F* * d $ $ E T F d $ F d $ E T d d $ $ E T $ T ɛ $ E $ E ɛ $ $ Accept Compiler Construction Syntactic Analysis 66
Another Parsing Example Stack Input Rule $ E (d + d) * d$ E T E $ E T (d + d) * d$ T FT $ E T F (d + d) * d$ F (E) $ E T )E( (d + d) * d$ $ E T )E d + d) * d$ E T E $ E T )E T d + d) * d$ T FT $ E T )E T F d + d) * d$ F d $ E T )E T d d + d) * d$ $ E T )E T + d) * d$ T ɛ $ E T )E + d) * d$ E +T E $ E T )E T + + d) * d$ $ E T )E T d) * d$ T FT $ E T )E T F d) * d$ F d $ E T )E T d d) * d$ $ E T )E T ) * d$ T ɛ $ E T )E ) * d$ E ɛ $ E T ) ) * d$ $ E T * d$ T FT $ E T F * d$ $ E T F d$ F d $ E T d d$ $ E T $ T ɛ $ E $ E ɛ $ $ Accept Compiler Construction Syntactic Analysis 67
Try a Non-LL(1) Grammar E E + id id Observe FIRST(E + id) = FIRST(id) = {id} Recursive descent parser: infinite recursion Parse table: Top of Input Symbol Stack d $ E E id E E + id Compiler Construction Syntactic Analysis 68
Top-down Parsing Summary To produce a top-down parser: 1. Eliminate left recursion and common prefixs; this yields an LL(1) grammar 2. Find the FIRST and FOLLOW sets 3. Build either the recursive descent parser methods or the parsing table Compiler Construction Syntactic Analysis 69
Limitations of LL(1) Grammars In many cases a grammar G 1 can be easily devised to represent strings in a language L(G 1 ), but G 1 is not LL(1) Sometimes G 1 can be rewritten to form G 2, where L(G 1 ) = L(G 2 ) and G 2 is LL(1) Some context-free languages have no LL(1) grammars Compiler Construction Syntactic Analysis 70
Bottom-up Parsing Grows parse tree from the leaves up Only two choices when scanning input shift symbol onto stack reduce Parser reduces in the reverse order of a rightmost derivation Bottom-up parsers are more powerful than top-down parsers They can be used to parse a larger variety of grammars Compiler Construction Syntactic Analysis 71
Reduction E E + E E E (E) E id E E + E E + E E E + E id E + id id id + id id Parser gives a rightmost reverse derivation Compiler Construction Syntactic Analysis 72
Handles A handle of a string is a substring that matches the right side of a production whose reduction to the non-terminal on the left side represents one step along the reverse of a rightmost derivation For unambiguous grammars, every right-sentential form has a unique handle Compiler Construction Syntactic Analysis 73
Handle More Formally A handle of a right-sentential form γ is a production A β and a position in γ where β can be found If (A β,k) is a handle, then replacing β in γ at position k with A produces the previous right-sentential form in a rightmost derivation of γ The substring to the right of a handle contains only terminal symbols Compiler Construction Syntactic Analysis 74
Handle Pruning Begin with string to parse Find handle and replace with the left side of a production that produces that handle Repeat until only the start symbol remains Compiler Construction Syntactic Analysis 75
Handle Pruning Example E E + T T T T F F F d Sentential Form d + d d F + d d T + d d E + d d E + F d E + T d E + T F E + T E Handle (F d,1) (T F,1) (E T,1) (F d,3) (T F,3) (F d,5) (T T F,3) (E E + T,1) Observe that this a rightmost derivation in reverse Compiler Construction Syntactic Analysis 76
Shift-Reduce Parsing Two problems to solve Find substring to be reduced in a right-sentential form Determine what production to choose in case more than one production has that substring on its right side Compiler Construction Syntactic Analysis 77
Overview of Process Stack contains states and grammar symbols Stack Input a a a a 1 2 3 n $ Grammar symbols on stack represent a viable prefix LR Parser Action Goto Parse Table Compiler Construction Syntactic Analysis 78
Parse Table Action shift reduce Stack Input a a a a 1 2 3 n $ LR Parser Goto Action Goto Next state Parse Table Compiler Construction Syntactic Analysis 79
Parse Table Actions Shift Pushes input symbol and state Input on to the stack Stack a 1 a a a 2 3 n $ Reduce Replaces a LR Parser Action Goto string of symbols on the stack with a non-terminal Parse Table Symbols on the stack can be either terminals or non-terminals Compiler Construction Syntactic Analysis 80
Shift-Reduce Parsing Stack holds grammar symbols $ indicates bottom of stack Input buffer for string to be parsed $ indicates end of string Parser activity shifts zero or more input symbols onto the stack until a handle β is on the top of the stack β is then reduced to the left side of a production Compiler Construction Syntactic Analysis 81
Shift-Reduce Parsing Initial parser state Stack: $ Input: w$ (Stack grows to the right; string is consumed from left to right) Final parser state (if no errors) Stack: $S Input: $ Parser actions Shift next input symbol to top of stack Reduce handle on top of stack to non-terminal Accept when string consumed and S on stack Error when string cannot be parsed Compiler Construction Syntactic Analysis 82
Viable Prefix Prefix of a right sentential form that can appear on the stack of a shiftreduce parser Compiler Construction Syntactic Analysis 83
Types of Bottom-up Parsers SLR Simple LR LR(0), no lookahead LR LR(1), more powerful, but requires a lot of memory LALR Look ahead LR Yacc is LALR(1) Compiler Construction Syntactic Analysis 84
SLR We ll concentrate on SLR since it is the simplest form To construct an SLR parse table we need items An item consists of a production and a numeric position within that production An item encodes where you are in a production Compiler Construction Syntactic Analysis 85
Expression Grammar E E + E E E (E) id compare to E E + T T T T F F F (E) id Compiler Construction Syntactic Analysis 86
Canonical LR(0) States 1. Augment the grammar by adding a new production S S 2. closure operation sets up states 3. goto operation computes transitions between states Compiler Construction Syntactic Analysis 87
LR(0) Items An LR(0) item of a grammar G is a production of G with a dot ( ) at some position of the right side. Example: Four items can be derived from production A XYZ A XY Z A X YZ A XY Z A XY Z Compiler Construction Syntactic Analysis 88
Interpreting LR(0) Items An item indicates how much of a production we have seen at a given point in the parsing process The item [A X Y Z] means we have seen a string derivable from X and hope to see a string derivable from Y Z Compiler Construction Syntactic Analysis 89
Closure Algorithm ItemSet closure(itemset I) { J I; do { Jold J; for each item [A α Bβ] J and each production B γ G do { J J {B γ}; } } while ( J J old ); return J; } B is a non-terminal If one B-production is added to the closure with a dot on the left end, then all B-productions will be added to the closure Compiler Construction Syntactic Analysis 90
Closure closure([e E + T]) = E E + T T T F T F F (E) F id Compiler Construction Syntactic Analysis 91
goto Function goto(i, X) I is a set of items (really just a state) X is a grammar symbol goto(i,x) is defined as the closure of the set of all items [A αx β] such that [A α Xβ] is in I Intuitively, if I is the set of items valid for a viable prefix γ, then goto(i,x) is the set of items valid for the viable prefix γx Compiler Construction Syntactic Analysis 92
LR(0) Item Sets E E E T T F F E I0 E E + T T T * F F ( E ) d ( d T F ( E E E T T F I 1 E E + T I 2 T T I 3 F F * I 4 F ( E ) E E + T E T T T * F T F F ( E ) F d d T + * E F d ( ( I 8 E E + T T T * F T F F ( E ) F d I 9 T T * F F ( E ) F d F I 10 T T * F I 6 F ( E ) E E + T ) I 7 F ( E ) T d * E T I 11 E + T T * F + F I 5 d Compiler Construction Syntactic Analysis 93
Set-of-Items Construction SetOfItems items(grammar G ) { C { closure ([S S])}); do { Cold C; for each set of items I C and each grammar symbol X such that goto(i,x) is not empty do { C C { goto(i,x) }; } } while ( C C old ); return C; } G is the augmented grammar Compiler Construction Syntactic Analysis 94
SLR Parse Table Construction BuildSLRParser(Grammar G ) { Initialize all the entries in the goto and action tables to error ; C items(g ); C = {I 0,I 1,...,I n } for each item set I i C do { if [A α aβ] I i and goto(i i,a) = I j } action([i][a]) shift j ; if [A α ] I i and A S for all a FOLLOW(A) do action([i][a]) reduce A α ; if [S S ] I i action([i][$]) accept ; } for each non-terminal A G do if goto(i i,a) = I j goto[i][a] j; The initial state of the parser is i where [S S] I i ; a is a terminal G is the augmented grammar Compiler Construction Syntactic Analysis 95
SLR Parsing Example FOLLOW(E) = {$, +,)} FOLLOW(T ) = {$,+,,)} FOLLOW(F) = {$, +,,)} Compiler Construction Syntactic Analysis 96
SLR Parse Table Action Goto State d + ( ) $ E T F 0 shift 5 shift 4 1 2 3 1 shift 8 Accept 2 reduce shift 9 reduce reduce E T E T E T 3 reduce reduce reduce reduce T F T F T F T F 4 shift 5 shift 4 6 2 3 5 reduce reduce reduce reduce F d F d F d F d 6 shift 8 shift 7 7 reduce reduce reduce reduce F (E) F (E) F (E) F (E) 8 shift 5 shift 4 11 3 9 shift 5 shift 4 10 10 reduce reduce reduce reduce T T F T T F T T F T T F 11 reduce shift 9 reduce reduce E E + T E E + T E E + T Compiler Construction Syntactic Analysis 97
LR Parsing Algorithm LR Parser() { stack.push(s); done false; } a scanner.getnexttoken(); while ( not done ) { } s stack.top(); if ( action[s][a] = shift s ) { stack.push(a); stack.push(s ); a = scanner.getnexttoken(); } else if ( action[s][a] = reduce A B ) { stack.pop 2 B symbols; s stack.top(); stack.push(a); stack.push(goto[s ][A]); } else if ( action[s][a] = accept ) { done true; } else { } Error(); Push initial state onto empty stack Get next token Look at state on top of stack Pop off some symbols Illegal string Compiler Construction Syntactic Analysis 98
Parsing Example Stack Input Rule $ S0 (d + d) * d $ Shift 4 $ S0(4 d + d) * d $ Shift 5 $ S0(4d5 + d) * d $ Reduce F d $ S0(4F3 + d) * d $ Reduce T F $ S0(4T2 + d) * d $ Reduce E T $ S0(4E6 + d) * d $ Shift 8 $ S0(4E6+8 d) * d $ Shift 5 $ S0(4E6+8d5 ) * d $ Reduce F d $ S0(4E6+8F3 ) * d $ Reduce T F $ S0(4E6+8T 11 ) * d $ Reduce T E + T $ S0(4E6 ) * d $ Shift 7 $ S0(4E6)7 * d $ Reduce F (E) $ S0F3 * d $ Reduce T F $ S0T 2 * d $ Shift 9 $ S0T 2*9 d $ Shift 5 $ S0T2*9d5 $ Reduce F d $ S0T2*9F10 $ Reduce T T F $ S0T2 $ Reduce E T $ S0E1 $ Accept Compiler Construction Syntactic Analysis 99
Comparing Grammars LR(1) grammars describe languages that are a proper superset of languages represented by LL(1) grammars LR(1) is more powerful than LALR(1) LALR(1) is more efficient than LR(1) For a language like C: LR(1) parser has thousands of states LALR(1) parser has hundreds of states Compiler Construction Syntactic Analysis 100
Comparing Context-free Grammars LL(1) SLR(1) LALR(1) LR(1) LR( k ) CFGs Compiler Construction Syntactic Analysis 101
Chomsky s Grammar Hierarchy Consider productions of the form α β Type Name Criteria Recognizer Type 3 Regular A a ab Finite automaton Type 2 Context-free A α Push-down automaton Type 1 Context-sensitive α β Linear bounded automaton Type 0 Unrestricted α ɛ Turing machine Compiler Construction Syntactic Analysis 102
Grammar Hierarchy Unrestricted Context sensitive Context free Regular Type 3 Type 2 Type 1 Type 0 Compiler Construction Syntactic Analysis 103
Error Handling Compilers cannot only process syntactically correct programs Language specifications do not usually describe how the compiler should respond to syntactical errors Review of types of errors Lexical Syntactic Semantic Logical Compiler Construction Syntactic Analysis 104
Syntactic Errors What should be done when the stream of tokens coming from the lexer disobeys the grammatical rules of the language? Compiler Construction Syntactic Analysis 105
Goals Errors should be reported clearly and accurately Some error recovery should be performed so subsequent errors can be detected The error detection and reporting mechanism should not significantly slow down the processing of correct programs Compiler Construction Syntactic Analysis 106
Issues Sometimes an error exist many lines before it is detected Types of errors are dependent on the programming language used See Example 4.1 in the dragon book Compiler Construction Syntactic Analysis 107
Error Handling Report the location of the detected error at least line number possibly the position within that line report problem Recovery A poor job may produce many spurious errors One strategy: skip bad tokens and continue with a number of good tokens until any subsequent errors are reported Compiler Construction Syntactic Analysis 108
Error Recovery Strategies (1) Panic-mode Discard tokens until some synchronizing token is detected Advantage simple to implement won t enter an infinite loop Compiler Construction Syntactic Analysis 109
Error Recovery Strategies (2) Phrase-level Perform local correction on remaining input (e.g., replace comma by semicolon) to allow parser to continue Used first with top-down parsers Has difficulty coping with errors that occur before the point of detection Compiler Construction Syntactic Analysis 110
Error Recovery Strategies (3) Error productions Augment grammar with special error rules Very useful if certain erroneous constructs are anticipated Yacc supports error productions Compiler Construction Syntactic Analysis 111
Error Recovery Strategies (4) Global correction Finds the minimal number of corrections required to produce a good parse tree from a bad one Interesting from a theoretical point of view, but not too practical Corrected parse tree obviously may not be what the programmer intended! Compiler Construction Syntactic Analysis 112
Yacc/Bison Program Used to generate LALR(1) parsers Developed by S.C. Johnson YACC stands for Yet another compiler compiler As with Lex, originally for C under Unix, but other platforms are supported Yacc generated C code can be linked with Lex generated C code for a ready-made lexer/parser combination GNU Bison is the modern version that we will use We ll just call it Yacc, though Compiler Construction Syntactic Analysis 113
Yacc Specification %{ %} %% %% C/C++ Declarations Yacc Declarations Rules Programmer functions Compiler Construction Syntactic Analysis 114
Yacc Specification (2) %{ %} %% %% C/C++ Declarations Yacc Declarations Rules Programmer functions 1. C/C++ macros and declarations are placed in the C/C++ declarations section 2. Yacc token declarations and precedence assignments are placed in the Yacc declarations section 3. Code to execute when productions are matched is placed placed in the rules section 4. Arbitrary C/C++ code is placed in the programmer functions section; functions named yylex() and yyerror() (normally produced by Lex) must be available Compiler Construction Syntactic Analysis 115
Yacc Rules Consist of a grammar production and an associated action The Yacc syntax for the rule A Bx C is A : B x { $$ = new ANode($1, "x"); cout << "Matched A -> Bx" << endl; } C { $$ = new ANode($1); cout << "Matched A -> C" << endl; } ; Compiler Construction Syntactic Analysis 116
Yacc Rules A Bx C A : B x { $$ = new ANode($1, "x"); cout << "Matched A -> Bx" << endl; } C { $$ = new ANode($1); cout << "Matched A -> C" << endl; } ; The $$ metasymbol represents the value to be returned by the parser when the production is matched; it represents the left side non-terminal (A is this case) The $1, $2, etc. metasymbols represent the values of the grammar symbols matched on the right side of the production Since the parser works from the bottom up, the left side non-terminals will have already been matched and their values will be available Compiler Construction Syntactic Analysis 117
Example Yacc Specification %{ /* -------------------------- C/C++ declarations */ #include <ctype.h> int yylex(); void yyerror(char *); %} /* -------------------------- Yacc declarations */ %union { int value; int symbol; } %type <value> S E I %token <symbol> digit %left + %left * %% /* -------------------------- Rules */ S : E { printf("%d\n", $1); } /* epsilon */ {} ; E : E + E { $$ = $1 + $3; } E * E { $$ = $1 * $3; } ( E ) { $$ = $2; } I { $$ = $1; } ; I : I digit { $$ = 10 * $1 + ($2-0 ); } digit { $$ = $1-0 ; } ; %% /* -------------------------- C/C++ code */ int main() { while (!feof(stdin) ) { yyparse(); } return 0; } Compiler Construction Syntactic Analysis 118
Yacc Specification to Parser prog.y Declarations %% Production rules %% C procedures main() { yyparse(); } y.tab.c yyparse() DFA Parse Table Compiler Construction Syntactic Analysis 119
Build Process Declarations %% Production rules %% C procedures main() { yyparse(); } prog.y y.tab.c prog yacc gcc yacc prog.y gcc o prog y.tab.c Compiler Construction Syntactic Analysis 120