Topic 3: Syntax Analysis I Compiler Design Prof. Hanjun Kim CoreLab (Compiler Research Lab) POSTECH 1
Back-End Front-End The Front End Source Program Lexical Analysis Syntax Analysis Semantic Analysis IR Code Generation Intermediate Representation IR Optimization Target Code Generation Target Code Optimization Target Program Lexical Analysis Break into tokens Think words, punctuation Syntax Analysis Parse phase structure Think document, paragraphs, sentences Semantic Analysis Calculate meaning 2
Parser in the Front-End Source Stream of Abstract Lexer Tokens Parser Syntax Tree FE IR Parser Functions: Verify that token stream is valid If it is not valid, report syntax error and recover Build Abstract Syntax Tree (AST) 3
Analogy to English Parsing Understanding sentence structure Check grammar Ex: This line is a longer sentence article noun verb article adjective noun subject complement sentence
Syntax Analysis (Parsing) A process that verifies that token stream is valid Check grammar in program language Ex: if a < b then c = 1 else c = 2 ID LT ID ID ASSIGN NUM ID ASSIGN NUM IF expression THEN statement ELSE statement IF-THEN-ELSE statement
Syntax Analysis (Parsing) Syntax analysis (Parsing) Every programming language has a set of rules that describe syntax of well-formed programs A process that determine if source program satisfies these rules Why do we need a parser in addition to a lexer? Some program construct may have recursive structures digits = [0-9]+ expr = {digits} ( {expr} + {expr} ) 28, (28+301), ((28+301) + 9) Finite automata cannot recognize recursive constructs 6
Limitation of Finite Automata Cannot recognize recursive constructs A machine with N states cannot remember a parenthesis-nesting depth greater than N Can FA check correctness for (( ))? ( ( ) ) Then, the FA check correctness for ((( )))? Can FA remember its nested states? ( ) ) 7
We need a more powerful formalism: Context-Free Grammar 8
Context-Free Grammar Regular Expressions describe lexical structure of tokens Regular Expressions Lexer Generator Lexer Context-Free Grammars describe syntactic nature of programs Context-Free Grammar Parser Generator Parser 9
Analogy Lexical Analysis Syntax Analysis Output Set of tokens Set of source programs Output of Each Rule Token Source Program Input ASCII character Token 10
Context-Free Grammars Context-Free Grammars consist of a set of productions symbol -> symbol symbol symbol Symbol types: Terminal : token types Non-terminal : a symbol that appears on the left-side of some production Left-Hand Side (LHS) : non-terminal Right-Hand Side (RHS) : terminals or non-terminals Start Symbol : A special non-terminal; A whole accepted program by grammar Each production specifies how terminals and non-terminals may be combined to form a substring in language Easy to specify recursion: stmt -> IF exp THEN stmt ELSE stmt 11
End-of-File Marker Parse must also recognize the End-of-File (EOF) EOF marker in the grammar is $ Introduce new start symbol and the production S -> S$ 12
Derivation Derivation (Execution of Parsing) 1. Begin with start symbol 2. While non-terminal exist, replace any non-terminal with RHS of production Multiple derivations exist for given sentence Left-most derivation replace left-most non-terminal in each step Right-most derivation replace right-most non-terminal in each step 13
Example Terminals SEMI ; ID NUM ASSIGN := LPAREN ( RPAREN ) PLUS + PRINT print COMMA, Non Terminals stmt: statement expr: expression expr_list: expression list Rules stmt -> stmt ; stmt stmt -> ID := expr stmt -> PRINT (expr_list) expr -> ID expr -> NUM expr -> expr + expr expr -> (stmt, expr) expr_list -> expr expr_list -> expr_list, expr 14
Example Terminals SEMI ; ID NUM ASSIGN := LPAREN ( RPAREN ) PLUS + PRINT print COMMA, Non Terminals stmt: statement expr: expression expr_list: expression list Rules stmt -> stmt SEMI stmt stmt -> ID ASSIGN expr stmt -> PRINT LPAREN expr_list RPAREN expr -> ID expr -> NUM expr -> expr PLUS expr expr -> LPAREN stmt COMMA expr RPAREN expr_list -> expr expr_list -> expr_list COMMA expr 15
Example: Left-most Derivation Input: a := 12; print(23) Results from Lexical Analysis ID ASSIGN NUM SEMI PRINT LPAREN NUM RPAREN Left-most Derivation 1. stmt 2. stmt SEMI stmt 3. ID ASSIGN expr SEMI stmt 4. ID ASSIGN NUM SEMI stmt 5. ID ASSIGN NUM SEMI PRINT LPAREN expr_list RPAREN 6. ID ASSIGN NUM SEMI PRINT LPAREN expr RPAREN 7. ID ASSIGN NUM SEMI PRINT LPAREN NUM RPAREN 16
Example: Right-most Derivation Input: a := 12; print(23) Results from Lexical Analysis ID ASSIGN NUM SEMI PRINT LPAREN NUM RPAREN Right-most Derivation 1. stmt 2. stmt SEMI stmt 3. stmt SEMI PRINT LPAREN expr_list RPAREN 4. stmt SEMI PRINT LPAREN expr RPAREN 5. stmt SEMI PRINT LPAREN NUM RPAREN 6. ID ASSIGN expr SEMI LPAREN NUM RPAREN 7. ID ASSIGN NUM SEMI LPAREN NUM RPAREN 17
Parsing Tree Graphical representation of derivation Each internal node is labeled with a non-terminal Each leaf node is labeled with a terminal Parsing tree of the example: ID ASSIGN NUM SEMI PRINT LPAREN NUM RPAREN stmt stmt SEMI stmt ID ASSIGN expr PRINT LPAREN expr_list RPAREN NUM expr NUM 18
Inefficiency in Parsing Tree Concrete parse tree Each internal node labeled with non-terminal Children labeled with symbols in RHS of production Concrete parse trees are inconvenient to use!!! Punctuation needed to specify structure when writing code, but Tree already describes program structure Make trees simple! Remove tokens containing no additional information 19
Inefficiency in Parsing Tree P -> (S) E -> ID E -> E - E S -> S ; S E -> NUM E -> E * E S -> ID := E E -> E + E E -> E / E ( a := 4 ; b := 5 ) P ( S ) S ; S ID( a ) := E ID( b ) := E NUM(4) NUM(5) Do we need (, ) or ;? 20
Abstract Syntax Tree Solution: generate abstract parse tree (abstract syntax tree, AST) AST similar to concrete parse tree, except redundant tokens left out CompoundStm AssignStm AssignStm ID( a ) NUM(4) ID( b ) NUM(5) 21
Abstract Syntax Tree Example P -> (S) E -> ID E -> E - E S -> S ; S E -> NUM E -> E * E S -> ID := E E -> E + E E -> E / E How can you describe abstract syntax tree structure? type id = string datatype binop = PLUS MINUS TIMES DIV datatype stm = CompoundStm of stm * stm AssignStm of id * exp datatype exp = IDExp of id NUMExp of int OpExp of exp * binop * exp 22
Ambiguous Grammars A grammar is ambiguous if it can derive a string of tokens with two or more different parsing trees Example expr -> NUM expr -> expr + expr expr -> expr * expr Consider: 4 + 5 * 6; is this 34 or 54? expr expr expr * expr expr + expr expr + expr NUM(6) NUM(4) expr * expr NUM(4) NUM(5) NUM(5) NUM(6) 23
Ambiguous Grammars Problem: Compiler uses parse tree to interpret meaning of parsed expressions Different Parse trees may have different meanings, resulting in different interpreted results For example, does 4+5*6 equal 34 or 54? Solution: rewrite grammar to eliminate ambiguity Operators have a relative precedence * binds tighter than + Operators wit the same precedence must be resolved by associativity Some operators have left associativity; others have right associativity 24
Ambiguous Grammars Non-Terminals expr: Expression term: Term (add) fact: Factor (mult) expr 4 + 5 * 6 expr + term Rules expr -> expr + term expr -> term term -> term * fact term -> fact fact -> NUM term fact NUM(4) term fact NUM(5) * fact NUM(6) 25
How to analyze the syntax of a program? 26
Back to analogy How do you recognize an English sentence? Prediction-based approach If you see a subject, you will expect a verb to be followed. If you see a verb at the beginning of a sentence, you can know the sentence is a question. Predictive parsing (LL parsing) Bottom-up based approach Read a sentence, and then figure out its structure. Bottom-up parsing (LR parsing, shift-reduce parsing) 27
Recursive Descent Parsing 1. LL(k) Parsing 28
Recursive Descent Parsing One recursive function for each non-terminal Each production becomes clause in function A.K.A. predictive parsing, top-down parsing, LL(1) LL(1) Left-to-right parse, Leftmost-derivation, 1 symbol lookahead 29
Example Grammar: Non-terminals: S, E, L Terminals: IF(if), THEN(then), ELSE(else), BEGIN(begin),END(end), SEMI(;), NUM, EQ(=) S -> if E then S else S L -> end E -> num = num S -> begin S L L -> ; S L datatype token = EOF IF THEN ELSE BEGIN END SEMI NUM EQ val tok = ref (gettoken()) fun advance() = tok := gettoken() fun eat(t) = if (!tok = t) then advance() else error() fun S() = case!tok of IF BEGIN fun L() = case!tok of END SEMI fun E() = => (eat(if); E(); eat(then); S(); eat(else); S()) => (eat(begin); S(); L()) => (eat(end)) => (eat(semi); S(); L()) (eat(num); eat(eq); eat(num)) 30
Formal Techniques Before making a parser, we need to compute 3 values Nullable For each γ corresponding to RHS of production, γ is nullable if γ can be derived to empty string (ε) First(γ) For each γ corresponding to RHS of production, first(γ) is a set of all terminal symbols that can begin any string derived from γ Ex: S -> if E then S else S First(S): if Follow(X) For each non-terminal X in grammar, follow(x) is a set of all terminal symbols that can immediately follow X in a derivation Ex: S -> if E then S else S Follow(E): then 31
Computation of Nullable γ is nullable if every symbol S γ is nullable Check if S can derive ε Example Z XYZ Y c X a Z d Y ε X bye Initial Iteration 1 Iteration 2 X No No No Y No Yes Yes Z No No No 32
Computation of First If T is a terminal symbol, then First(T) = {T} If X is a non-terminal and X Y 1 Y 2 Y 3 Y n then, first Y 1 first Y 2 first Y 3 first Y n First X first X if Y 1 is nullable first X if Y 1, Y 2 are nullable first X if Y 1, Y 2,, Y n 1 are nullable 33
Computation of Follow Let X, Y be non-terminals; γ, γ 1, γ 2 be strings of terminals and non-terminals If grammar includes production: X γy follow X follow Y If grammar includes production: X γ 1 Yγ 2 first(γ 2 ) follow Y follow X follow Y, if γ 2 is nullable Perform iterative technique in order to compute nullable, first and follow set for each non-terminal in grammar 34
Example Z XYZ Y c X a Z d Y ε X bye X Y Z Initial nullable first follow No No No Iteration 1 nullable first follow X No a,b Y Yes c Z No d Iteration 2 nullable first follow X No a,b Y Yes c Z No d,a,b Iteration 2 nullable first follow X No a,b c,d,a,b Y Yes c e,d,a,b Z No d,a,b 35
Example Z XYZ Y c X a Z d Y ε X bye nullable first follow X No a,b c,d,a,b Y Yes c e,d,a,b Z No d,a,b Build predictive parsing table from nullable, first, and follow sets a b c d e X X a X bye Y Y ε Y ε Y c Y ε Y ε Z Z XYZ Z XYZ Z d Enter S γ in row S, column T: for each T first γ If γ is nullable, enter S γ in row S, column T: for each T follow(s) Entry in row S, column T tells parser which clause to execute if current function is S and next token is T Blank entries are syntax errors 36
Another Example S S$ S IF E THEN A ELSE A T NUM S E E E + T A ID = NUM S IF E THEN A E T S S E T A nullable first follow 37
Another Example S S$ S IF E THEN A ELSE A T NUM S E E E + T A ID = NUM S IF E THEN A E T nullable first follow S No IF, NUM S No IF, NUM $ E No NUM $,THEN,+ T No NUM $,THEN,+ A No ID $,ELSE 38
Another Example S S$ S IF E THEN A ELSE A T NUM S E E E + T A ID = NUM S IF E THEN A E T IF THEN ELSE + NUM ID = $ S S S$ S S$ S E T A S IF E THEN A S IF E THEN A ELSE A S E E E + T E T T NUM A ID = NUM 39
Left-Recursion Problem E E + T E T First(E+T) = First(T) When in Function E(), if next token is NUM, parser will get stuck Grammar is left-recursive that cannot be LL(1) Solution: rewrite grammar so that it is right-recursive E TE E ϵ E +TE Rule: X Xγ X α X αx X ε X γx 40
Left-Factoring S IF E THEN A S IF E THEN A ELSE A Two productions begin with the same symbol first(if E THEN A) = first(if E THEN A ELSE A) Solution: Left-Factoring S IF E THEN A V V ε V ELSE A 41
Modified Example S S$ V ELSE A T NUM S E E TE A ID = NUM S IF E THEN A V E ε V ε E +TE S S V E E T A nullable first follow 42
Modified Example S S$ V ELSE A T NUM S E E TE A ID = NUM S IF E THEN A V E ε V ε E +TE nullable first follow S No IF, NUM S No IF, NUM $ V Yes ELSE $ E No NUM $,THEN E Yes + $,THEN T No NUM $,THEN,+ A No ID $,ELSE 43
Modified Example S S$ V ELSE A T NUM S E E TE A ID = NUM S IF E THEN A V E ε V ε E +TE IF THEN ELSE + NUM ID = $ S S S$ S S$ S S IF E THEN A V S E V V ELSE A V ε E E TE E E ε E +TE E ε T T NUM A A ID = NUM 44