PART 3 - SYNTAX ANALYSIS F. Wotawa (IST @ TU Graz) Compiler Construction Summer term 2016 64 / 309
Goals Definition of the syntax of a programming language using context free grammars Methods for parsing of programs determine whether a program is syntactically correct Advantages (of grammars): Precise, easily comprehensible language definition Automatic construction of parsers Declaration of the structure of a programming language (important for translation and error detection) Easy language extensions and modifications F. Wotawa (IST @ TU Graz) Compiler Construction Summer term 2016 65 / 309
Tasks source program lexical analyser token get next token parser parse tree rest of the front end intermediate representation symbol table Parser types: Universal parsers (inefficient) Top-down-parser Bottom-up-parser Only subclasses of grammars (LL, LR) Collect token informations Type checking Immediate code generation F. Wotawa (IST @ TU Graz) Compiler Construction Summer term 2016 66 / 309
Syntax error handling Error types: Lexical errors (spelling of a keyword) Syntactic errors (closing bracket is missing) Semantic errors (operand is incompatible to operator) Logic Errors (infinite loop) Tasks: Exact error description Error recovery consecutive errors should be detectable Error correction should not slow down the processing of correct programs F. Wotawa (IST @ TU Graz) Compiler Construction Summer term 2016 67 / 309
Problems during error handling Spurious Errors: Consecutive errors created by error recovery Example: Compiler issues error-recovery resulting in the removal of the declaration of pi Error during semantic analysis: pi undefined Error is detected late in the process error message does not point to the correct position within the code Too many error messages are issued F. Wotawa (IST @ TU Graz) Compiler Construction Summer term 2016 68 / 309
Error-recovery Panic mode: Skip symbols until input can by synchronized to a token Phrase-level recovery: Local error corrections, e.g. replacement of, by a ; Error productions: Extension of grammar to handle common errors Global correction: Minimal correction of program in order to find a matching derivation (cost intensive) F. Wotawa (IST @ TU Graz) Compiler Construction Summer term 2016 69 / 309
Grammars Grammar A grammar is a 4-tupel G = (V N, V T, S, Φ) whereby: V N Set of nonterminal symbols V T Set of terminal symbols S V N Start symbol Φ : (V N V T ) V N (V N V T ) (V N V T ) Set of production rules (rewriting rules) (α, β) is represented as α β Example: ({S, A, Z}, {a, b, 1, 2}, S, {S AZ, A a, A b, Z ɛ, Z 1, Z 2, Z ZZ}) F. Wotawa (IST @ TU Graz) Compiler Construction Summer term 2016 70 / 309
Derivations in grammars Direct derivation σ, ψ (V T V N ). σ can be directly derived from ψ (in one step; ψ σ), if there are two strings φ 1, φ 2, so that σ = φ 1 βφ 2 and ψ = φ 1 αφ 2 and α β Φ. Example: ψ σ Rule used φ 1 φ 2 S A Z S AZ ɛ ɛ az a1 Z 1 a ɛ AZZ A2Z Z 2 A Z F. Wotawa (IST @ TU Graz) Compiler Construction Summer term 2016 71 / 309
Derivation Production: A string ψ produces σ (ψ + σ), if there are strings φ 0,..., φ n (n > 0), so that ψ = φ 0 φ 1, φ 1 φ 2,..., φ n 1 φ n = σ. Example: S AZ AZZ A2Z a2z a21 Reflexive, transitive closure: ψ σ ψ + σ or ψ = σ Accepted language: A grammar G accepts the following language L(G) = {σ S σ, σ (V T ) } F. Wotawa (IST @ TU Graz) Compiler Construction Summer term 2016 72 / 309
Parse trees Example: E E + E E E id 2 derivations (and parse trees) for id+id*id E E E + E E * E id E * E E + E id id id id id Grammar is ambiguous F. Wotawa (IST @ TU Graz) Compiler Construction Summer term 2016 73 / 309
Classification of grammars Chomsky (restriction of production rules α β) Unrestricted Grammar: no restrictions Context-Sensitive Grammar: α β Context-Free Grammar: α β and α V N Regular Grammar: α β, α V N and β is in the form of: ab or a whereby a V T and B V N F. Wotawa (IST @ TU Graz) Compiler Construction Summer term 2016 74 / 309
Grammar examples Regular grammar: (a b) abb A 0 aa 0 ba 0 aa 1 A 1 ba 2 A 2 ba 3 A 3 ɛ Context-sensitive grammars: L 1 = {wcw w (a b) } But L 1 = {wcw R w (a b) } is context-free L 2 = {a n b m c n d m n 1, m 1} L 3 = {a n b n c n n 1} F. Wotawa (IST @ TU Graz) Compiler Construction Summer term 2016 75 / 309
Conversions Remove ambiguities stmnt if expr then stmnt if expr then stmnt else stmnt other 2 parse trees for if E 1 then if E 2 then S 1 else S 2. smtn smtn if expr then smtn E1 if expr then smtn else smtn if expr then smtn else smtn E1 S2 if expr then smtn E2 S1 S2 E2 S1 Prefer left tree Associate each else with the closest preceding then F. Wotawa (IST @ TU Graz) Compiler Construction Summer term 2016 76 / 309
Removing left recursions A grammar is left-recursive if there is a nonterminal A and a production A + Aα Top-Down-Parsing can t handle left recursions Example: convert A Aα β to: A βa 1 A 1 αa 1 ɛ F. Wotawa (IST @ TU Graz) Compiler Construction Summer term 2016 77 / 309
Algorithm to eliminate left recursions Input: Grammar G without cycles and ɛ-productions Output: Grammar without left recursions Arrange the nonterminals in some order A 1, A 2,..., A n for i := 1 to n do for j := 1 to i 1 do Replace each production A i A j γ by the productions A i δ 1 γ... δ k γ, where A j δ 1... δ k are all the current A j -production end Eliminate the immediate left recursion among the A i -productions end F. Wotawa (IST @ TU Graz) Compiler Construction Summer term 2016 78 / 309
Left factoring Important for predictive parsing Elimination of alternative productions stmnt if expr then stmnt else stmnt Example: if expr then stmnt Solution: For each nonterminal A find the longest prefix α for two or more alternative productions If α ɛ then replace all A-productions A αβ 1 αβ 2... αβ n γ (γ does not start with α) with: A αa 1 γ A 1 β 1 β 2... β n Apply transformation until no prefixes α ɛ can be found F. Wotawa (IST @ TU Graz) Compiler Construction Summer term 2016 79 / 309
Top-down-parsing Idea: Construct parse tree for a given input, starting at root node Recursive-descent parsing (with backtracking) Example: S cad A ab a Matching of cad c S A (1) Predictive parsing (without backtracking, special case of recursive-descent parsing) Left-recursive grammars can lead to infinite loops! d c a S A (2) b d c S A a (3) d F. Wotawa (IST @ TU Graz) Compiler Construction Summer term 2016 80 / 309
Predictive parsers Recursive-descent parser without backtracking Possible if production which needs to be used is obvious for each input symbol Transition diagrams 1 Remove left recursions 2 Left factoring 3 For each nonterminal A: 1 Create a initial state and an end state 2 For each production A X 1X 2... X n create a path leading from the initial state to the end state while labeling the edges X 1,..., X n F. Wotawa (IST @ TU Graz) Compiler Construction Summer term 2016 81 / 309
Predictive parsers (II) Processing: 1 Start at the initial state of the current start symbol 2 Suppose we are currently in the state s which has an edge whose label contains a terminal a and leads to the state t. If the next input symbol is a then go to state t and read a new input symbol. 3 Suppose the edge (from s) is marked by a nonterminal A. In that case go to the initial state of A (without reading a new input symbol). If we reach the end state of A then go to state t which is succeeding s. 4 If the edge is marked by ɛ then go directly to t without reading the input. Easily implemented by recursive procedures F. Wotawa (IST @ TU Graz) Compiler Construction Summer term 2016 82 / 309
Example - Predictive parser E() E1() E ide 1 (E) E 1 ope ɛ if nexttoken=id then getnexttoken E1() if nexttoken=( then getnexttoken E() if nexttoken=) then akzept if nexttoken=op then getnexttoken E() else return E: E1: id 0 1 ( 2 op E 0 1 2 ε E1 E ) 3 4 F. Wotawa (IST @ TU Graz) Compiler Construction Summer term 2016 83 / 309
Non-recursive predictive parser INPUT a + b $ STACK X Y Z $ Predictive Parsing Program OUTPUT Parsing Table M Input buffer: String to be parsed (terminated by a $) Stack: Initialized with the start symbol and contains nonterminals wich are not derivated yet (terminated by a $) Parsing table M(A, a), A is a nonterminal, a a terminal or $ F. Wotawa (IST @ TU Graz) Compiler Construction Summer term 2016 84 / 309
Top-down parsing with stack Mode of operation: X is top element of stack, a the current input symbol 1 X is a terminal: If X = a = $, then the input was matched. If X = a $, pop X off the stack and read next input symbol. Otherwise an error occured. 2 X ist a nonterminal: Fetch entry of M(X, a). If this entry is an error skip to error recovery. Otherwise the entry is a production of the form X UV W. Replace X on the stack with W V U (afterward U is the top most element on the stack). F. Wotawa (IST @ TU Graz) Compiler Construction Summer term 2016 85 / 309
Example Grammar E id E 1 (E) E 1 op E ɛ Parsing table M(X, a) ONTERMINAL id op ( ) $ E E id E 1 E (E) E 1 E op E E 1 ɛ E 1 ɛ Derivation of id op id. F. Wotawa (IST @ TU Graz) Compiler Construction Summer term 2016 86 / 309
Example (II) STACK INPUT OUTPUT $ E id op id $ $ E 1 id id op id $ E id E 1 $ E 1 op id $ $ E op op id $ E 1 op E $ E id $ $ E 1 id id $ E id E 1 $ E 1 $ $ $ E 1 ɛ F. Wotawa (IST @ TU Graz) Compiler Construction Summer term 2016 87 / 309
FIRST and FOLLOW Used when calculating parse table F IRST (α) Set of terminals, which can be derived from α (α string of grammar symbols) F OLLOW (A) Set of terminals which occur directly on the right side next to the nonterminal A in a derivation If A is the right most element of a derivation, then $ is contained in F OLLOW (A) F. Wotawa (IST @ TU Graz) Compiler Construction Summer term 2016 88 / 309
Calculation of FIRST F IRST (X) for a grammar symbol X 1 X is a terminal: F IRST (X) = {X} 2 X ɛ is a production: Add ɛ to F IRST (X) 3 X is a nonterminal and X Y 1 Y 2... Y k is a production a is in F IRST (X) if: 1 An i exists; a is in F IRST (Y i) and ɛ is in every set F IRST (Y 1)... F IRST (Y i 1) 2 a = ɛ and ɛ is in every set F IRST (Y 1)... F IRST (Y k ) F IRST (X 1 X 2... X n ): Each non-ɛ symbol of F IRST (X 1 ) is in the result If ɛ F IRST (X 1 ), then each non-ɛ symbol of F IRST (X 2 ) is in the result and so on Is ɛ in every F IRST -set, then it it also is contained in the result F. Wotawa (IST @ TU Graz) Compiler Construction Summer term 2016 89 / 309
Calculation of FOLLOW In order to calculate F OLLOW (A) of a nonterminal A use following rules: 1 Add $ to F OLLOW (S), whereby S is the initial symbol 2 For each production A αbβ, add all elements of F IRST (β) except ɛ to F OLLOW (B) 3 For each production A αb and A αbβ with ɛ F IRST (β), add each element of F OLLOW (A) to F OLLOW (B) F. Wotawa (IST @ TU Graz) Compiler Construction Summer term 2016 90 / 309
Example Grammar: FIRST sets: FOLLOW sets: E id E 1 (E) E 1 op E ɛ F IRST (E) = {id, (} F IRST (E 1 ) = {op, ɛ} F OLLOW (E) = F OLLOW (E 1 ) = {$, )} F. Wotawa (IST @ TU Graz) Compiler Construction Summer term 2016 91 / 309
Construction of parsing tables Input: Grammar G Output: Parsing table M 1 For each production A α do Steps 2 and 3. 2 For each terminal a in F IRST (α), add A α to M(A, a). 3 If ɛ is in F IRST (α), add A α to M(A, b) for each terminal b in F OLLOW (A). If ɛ is in F IRST (α) and $ is in F OLLOW (A), add A α to M(A, $) 4 Make each undefined entry of M be error. Example: See table of last example grammar F. Wotawa (IST @ TU Graz) Compiler Construction Summer term 2016 92 / 309
LL(1) Grammars Parsing table construction can be used with arbitrary grammars Multiple elements per entry may occur LL(1) Grammar: Grammar whose parsing table contains no multiple entries L... Scanning the Input from LEFT to right L... Producing the LEFTMOST derivation 1... Using 1 input symbol lookahead F. Wotawa (IST @ TU Graz) Compiler Construction Summer term 2016 93 / 309
Properties of LL(1) No ambiguous or left-recursive grammar is LL(1) G ist LL(1) For each two different productions A α β it is neccessary that: 1 No strings may be derived from both α and β which start with the same terminal a 2 At most one of the productions α or β may be derivable to ɛ 3 If β ɛ, then α may not derive any string which starts with an element in F OLLOW (A) Multiple entries in the parsing table can occasionally be removed by hand (without changing the language recognized by the automaton) F. Wotawa (IST @ TU Graz) Compiler Construction Summer term 2016 94 / 309
Error-recovery in predictive parsing Heuristics in panic-mode error recovery: 1 Initially, all symbols in F OLLOW (A) can be used for synchronisation: Skip all tokens until an element in F OLLOW (A) is read and remove A from the stack. 2 If F OLLOW sets don t suffice: Use hierarchical structure of program constructs. E.g. use keywords occuring at the beginning of a statement as addition to the synchronisation set. 3 F IRST (A) can be used as well: If an element in F IRST (A) is read, continue parsing at A. 4 If a terminal which can t be matched is at the top of the stack, remove it. F. Wotawa (IST @ TU Graz) Compiler Construction Summer term 2016 95 / 309
Bottom-up parsing Shift-reduce parsing Reduction of an input towards the start symbol of the grammar Reduction step: Replace a substring, which matches the right side of a production with the left side of that same production Example: S aabe A Abc b B d abbcde aabcde aade aabe S F. Wotawa (IST @ TU Graz) Compiler Construction Summer term 2016 96 / 309
Handles Substring, which matches the right side of a production and leads to a valid derivation (rightmost derivation) Example (ambiguous grammar): E E + E E E E E (E) E id Rightmost derivation of id + id * id: Right-Sentential Form Handle Reducing Production id + id * id id E id id + id * E id E id id + E * E E E E E E id + E id E id E + E E + E E E + E E F. Wotawa (IST @ TU Graz) Compiler Construction Summer term 2016 97 / 309
Stack implementation Initially: Stack Input $ w$ Shift n 0 symbols from input onto stack until a handle can be found Reduce handle (replace handle with left side of production) F. Wotawa (IST @ TU Graz) Compiler Construction Summer term 2016 98 / 309
Example shift-reduce parsing Stack Input Action (1) $ id + id * id $ shift (2) $ id + id * id $ reduce by E id (3) $ E + id * id $ shift (4) $ E+ id * id $ shift (5) $ E+ id * id $ reduce by E id (6) $ E + E * id $ shift (7) $ E + E id $ shift (8) $ E + E id $ reduce by E id (9) $ E + E E $ reduce by E E E (10) $ E + E $ reduce by E E + E (11) $ E $ accept F. Wotawa (IST @ TU Graz) Compiler Construction Summer term 2016 99 / 309
Viable prefixes, conflicts Viable prefix: Right sentential forms which can occur within the stack of a shift-reduce parser Conflicts: (Ambiguous grammars) stmt if expr then stmt if expr then stmt else stmt other Configuration: Stack Input... if expr then stmt else... No unambiguous handle (shift-reduce conflict) F. Wotawa (IST @ TU Graz) Compiler Construction Summer term 2016 100 / 309
LR parser LR(k) parsing L... Left-to-right scanning R... Rightmost derivation in reverse Advantages: Can be used for (nearly) every programming language construct Most generic backtrack-free shift-reduce parsing method Class of LR-grammars is greater than those of LL-grammars LR-parsers identify errors as early as possible Disadvantage: LR-parser is hard to build manually F. Wotawa (IST @ TU Graz) Compiler Construction Summer term 2016 101 / 309
LR-parsing algorithm INPUT a... a... 1 i a n $ STACK s m Xm s m-1 Xm-1... LR Parsing Program OUTPUT s 0 action goto Stack stores s 0 X 1 s 1 X 2 s 2... X m s m (X i grammar, s i state) Parsing table = action- and goto-table s m current state, a i current input symbol action[s m, a i ] {shift, reduce, accept, error} goto[s m, a i ] transition function of a DFA F. Wotawa (IST @ TU Graz) Compiler Construction Summer term 2016 102 / 309
LR-parsing mode of operation Configuration (s 0 X 1 s 1... X m s m, a i a i+1... a n ) Next step (move) is determined by reading of a i Dependent on action[s m, a i ]: 1 action[s m, a i ] = shift s New configuration: (s 0 X 1 s 1... X m s m a i s, a i+1... a n ) 2 action[s m, a i ] = reduce A β New configuration: (s 0 X 1 s 1... X m r s m r As, a i a i+1... a n ) whereby s = goto[s m r, A], r length of β 3 action[s m, a i ] = accept 4 action[s m, a i ] = error F. Wotawa (IST @ TU Graz) Compiler Construction Summer term 2016 103 / 309
Example ) E E + T ) E T ) T T F ) T F ) F (E) ) F id State action goto id + * ( ) $ E T F 0 s5 s4 1 2 3 1 s6 acc 2 r2 s7 r2 r2 3 r4 r4 r4 r4 4 s5 s4 8 2 3 5 r6 r6 r6 r6 6 s5 s4 9 3 7 s5 s4 10 8 s6 s11 9 r1 s7 r1 r1 10 r3 r3 r3 r3 11 r5 r5 r5 r5 F. Wotawa (IST @ TU Graz) Compiler Construction Summer term 2016 104 / 309
Construction of SLR parsing tables LR(0)-items: Production with dot at one position of the right side Example: Production A XY Z has 4 items: A.XY Z, A X.Y Z, A XY.Z and A XY Z. Exception: Produktion A ɛ only has the item: A. Augmented grammar: Grammar with new start symbol S and production S S. Functions: closure and goto closure(i) (I... set of items) 1 All I are within closure 2 If A α.bβ is part of closure and B γ is a production, then add B.γ to closure F. Wotawa (IST @ TU Graz) Compiler Construction Summer term 2016 105 / 309
Construction, goto goto(i, X) with I as set of items and X a symbol of the grammar goto = closure of set of all items A αx.β for all A α.xβ in I Example: I = {E E., E E. + T } goto(i, +) = {E E +.T, T.T F, T.F, F.(E), F.id} Sets-of-items construction (Construction of all LR(0)-items) items(g ) I 0 = closure({s.s}) C = {I 0 } repeat for each set of items I C and each grammar symbol X such that goto(i, X) is not empty and not in C do Add goto(i, X) to C until no more sets of items can be added to C F. Wotawa (IST @ TU Graz) Compiler Construction Summer term 2016 106 / 309
SLR parsing table Input: Augmented grammar G Output: SLR parsing table 1 Calculate C = {I 0, I 1,..., I n }, the set of LR(0)-items of G 2 State i is created by I i as follows: 1 If A α.aβ is in I i and goto(i i, a) = I j, then action(i, a) = shift j (a is a terminal symbol) 2 If A α. is in I i, then action[i, a] = reduce A α for all a F OLLOW (A) A S 3 If S S. is in I i, then action[i, $] = accept 3 For all nonterminal symbols A: goto[i, A] = j if goto(i i, A) = I j 4 Every other table entry is set to error 5 Initial state is determined by the item set with S.S F. Wotawa (IST @ TU Graz) Compiler Construction Summer term 2016 107 / 309
SLR(1), conflicts, error handling If we recieve a table without multiple entries using the SLR-parsing-table-algorithm then the grammar is SLR(1) Otherwise the algorithm fails and an algorithm for extended languages (like LR) needs to be utilized generally results in increased processing requirements Shift/reduce-conflicts can be partially resolved The process usually involves the determination of operator binding strength and associativity Error handling can be directly incorporated into the parsing table F. Wotawa (IST @ TU Graz) Compiler Construction Summer term 2016 108 / 309