Topic 5: Syntax Analysis III Compiler Design Prof. Hanjun Kim CoreLab (Compiler Research Lab) POSTECH 1
Back-End Front-End The Front End Source Program Lexical Analysis Syntax Analysis Semantic Analysis IR Code Generation Intermediate Representation IR Optimization Target Code Generation Target Code Optimization Target Program Lexical Analysis Break into tokens Think words, punctuation Syntax Analysis Parse phase structure Think document, paragraphs, sentences Semantic Analysis Calculate meaning 2
Parser in the Front-End Source Stream of Abstract Lexer Tokens Parser Syntax Tree FE IR Parser Functions: Verify that token stream is valid If it is not valid, report syntax error and recover Build Abstract Syntax Tree (AST) 3
Parsing Power Unambiguous Grammar LL(k) LR(k) Ambiguous Grammar LL(1) LR(1) LALR(1) SLR LL(0) LR(0) 4
Real-world Parser Generators 5
Real-world Parser Generators Context-Free Grammar Parser Generator Stream of Tokens Parser Parser generators yacc, bison: LALR parser generators for C ml-yacc: a LALR parser generator for ML Parser Generator Specification Input: a set of context-free grammars specifying a parser Outputs A parser in target language A description of state machine Rules: consists of a pattern and an action: Pattern is context free grammar Action is a fragment of ordinary target code Examples: exp: exp PLUS exp (exp1 + exp2) Abstract Syntax Tree 6
Lexer Generator Example: Bison %{ #include <math.h> %} %token NUM %left '-' '+' %left '*' %left NEG /* negation--unary minus */ %% line: '\n' exp '\n' { printf ("\t%.10g\n", $1); } ; exp: NUM { $$ = $1; } exp '+' exp { $$ = $1 + $3; } exp '-' exp { $$ = $1 - $3; } exp '*' exp { $$ = $1 * $3; } '-' exp %prec NEG { $$ = -$2; } ; %% main () { yyparse (); } Declarations Rules User Codes 7
Lexer Generator Example: ML-YACC structure A = struct type id = S.symbol datatype binop = PLUS MINUS TIMES DIV datatype stm = CompoundStm of stm * stm AssignStm of id * exp datatype exp = IDExp of id NUMExp of int OpExp of exp * binop * exp end %% %term INT of int ID of string PLUS MINUS %nonterm exp of A.exp stm of A.stm prog of A.stm %% prog: LPAREN stm RPAREN (stm) User Declaration YACC Definition stm: stm SEMICOLON stm stm: ID ASSIGN exp exp: INT exp: ID exp: exp PLUS exp exp: exp MINUS exp (A.CompoundStm(stm1, stm2)) (A.AssignStm(S.symbol(ID), exp)) (A.IntExp(INT)) (A.IDExp(S.symbol(ID))) (A.OpExp(exp1, A.PLUS, exp2)) (A.OpExp(exp1, A.MINUS, exp2)) Rules 8
Lexer Generator Example: ML-YACC User Declaration Define various values that are available to rules section YACC Definition Declare terminal and non-terminal symbols, and their attribute %term IF THEN ELSE NUM of int %nonterm prog stmt exp Declare precedences for terminals that help resolve shift-reduce conflicts Specify the type of the current input file position (%pos int) Optionally specify end-of-parse symbol (%eop EOF) Optionally specify start symbol otherwise, LHS non-terminal of first rule is taken as start symbol %start prog Rules Specify productions of grammar and semantic actions associated with productions symbol 0 symbol 1 symbol 2 symbol n (semantic action) 9
Lexer Generator Example: ML-YACC Positions In order to report semantic error, need to annotate each AST node with source file position of characters X < n >: returns attribute of nth occurrence of X X < n > left: returns left-end position of token corresponding to X X < n > right: returns right-end position of token corresponding to X Example: stm: stm SEMICOLON stm (A.PosStm(stm1left, A.CompoundStm(stm1, stm2))) 10
AST Example structure A = struct type id = S.symbol datatype binop = PLUS MINUS TIMES DIV datatype stm = CompoundStm of stm * stm AssignStm of id * exp PosStm of int * stm datatype exp = IDExp of id NUMExp of int OpExp of exp * binop * exp PosExp of int * exp end %% %term INT of int ID of string PLUS MINUS %nonterm exp of A.exp stm of A.stm prog of A.stm %% prog: LPARAEN stm RPAREN (stm) stm: stm SEMICOLON stm stm: ID ASSIGN exp exp: INT exp: ID (A.PosStm(stm1left, A.CompoundStm(stm1, stm2))) (A.PosStm(IDleft, A.AssignStm(S.symbol(ID),exp))) (A.PosExp(INTleft, A.IntExp(INT))) (A.PosExp(IDleft, A.IDExp(S.symbol(ID)))) 11
AST Example Input Program: (a := 5 ; b := a + 1) Abstract syntax: PosStm[ int = 1, stm = CompoundStm[ stm = PosStm[ int = 2, stm = AssignStm[ ID = PosExp[int = 2, exp = IDExp(S.symbol( a ))], exp = PosExp[int = 7, exp = NUMExp(5)] ] ], stm = PosStm[ int = 11, stm = AssignStm[ ID = PosExp[int = 11, exp = IDExp(S.symbol( b ))], exp = PosExp[ int = 16, exp = OpExp[ exp = PosExp[int = 16, exp = IDExp(S.symbol( b ))], binop = PLUS, exp = PosExp[int = 20, exp = NUMExp(1)] ] ] ] ] ] ] 12
YACC & Ambiguous Grammars A grammar is ambiguous if it can derive a string of tokens with two or more different parse tree Consider 4+5*6 * + + NUM(6) NUM(4) * NUM(4) NUM(5) NUM(5) NUM(6) We prefer to bind * tighter than + 13
YACC & Ambiguous Grammars Similarly, consider: 4+5+6 We prefer to bind left + first + + + NUM(6) NUM(4) + NUM(4) NUM(5) NUM(5) NUM(6) 14
YACC & Ambiguous Grammars YACC will report shift-reduce conflicts 4+5*6 When + is on top of stack, parser gets * as the current token Parser can reduce by rule + or shift Prefer shift 4+5+6 When + is on top of stack, parser gets + as the current token Parser can reduce by rule + or shift Prefer reduce 15
Directives Three Solutions Let YACC complain, but check if the choice (shift) is correct Rewrite grammar to eliminate ambiguity Keep grammar, but add precedence directives which enable conflicts to be resolved Use %left, %right, %nonassoc For this grammar %left PLUS MINUS %left MULT DIV PLUS, MINUS are left associative, bind equally tightly MULT, DIV are left associative, bind equally tightly MULT, DIV bind tighter than PLUS, MINUS 16
Directives Given directives, YACC assigns precedence to each terminal and rule Precedence of terminal based on order in which associativity specified Precedence of rule is the precedence of right-most terminal Ex: precedence( + )=precedence(plus) Given shift-reduce conflict, YACC performs the following: Find precedence of rule to be reduced, terminal to be shifted prec(terminal) > prec(rule) : shift prec(rule) > prec(terminal) : reduce prec(terminal) = prec(rule) assoc(terminal) = left : reduce assoc(terminal) = right: shift assoc(terminal) = nonassoc: report error 17
Precedence Example Input : 4 + 5 * 6 Stack : 4 + 5 Action: prec(*) > prec(+) -> shift Input : 4 * 5 + 6 Stack : 4 * 5 Action: prec(*) > prec(+) -> reduce Input : 4 + 5 + 6 Stack : 4 + 5 Action: assoc(+) = left -> reduce 18
Default Behavior What if directives not specified? shift-reduce: report error, shift by default reduce-reduce: report error, reduce by rule that occurs first What to do: shift-reduce: acceptable in well defined cases (dangling else) reduce-reduce: unacceptable, Rewrite grammar 19
%prec directive Commonly used for the unary minus problem %left PLUS MINUS %left MULT DIV Consider -4*6 We prefer to bind left unary minus (-) tighter, but precedence of MINUS is lower than MULT -(4*6) not (-4)*6 Solution: %term NUM PLUS MINUS MULT DIV UMINUS %left PLUS MINUS %left MULT DIV %left UMINUS : MINUS %prec UMINUS () PLUS () 20
A parser can support semantic action. Why does a compiler separate semantic action from parsing? 21
Precedence Parsing with semantic action E -> E + E E -> E E E -> E * E E -> NUM E -> -NUM %% %term INT of int PLUS MINUS TIMES UMINUS EOF %nonterm exp of int %start exp %eop EOF Left Associativity %left PLUS MINUS %left TIMES %left UMINUS %% exp: INT exp: exp PLUS exp exp: exp MINUS exp exp: exp TIMES exp exp: MINUS exp %prec UMINUS (INT) (exp1 + exp2) (exp1 exp2) (exp1 * exp2) (~exp) 22
Parsing with semantic action E -> E + E E -> E E E -> E * E E -> NUM E -> -NUM Input Program: 1 + 2 * 3 Stack Input Action 1 + 2 * 3 $ shift NUM(1) + 2 * 3 $ reduce E(1) + 2 * 3 $ shift E(1) PLUS 2 * 3 $ shift E(1) PLUS NUM(2) * 3 $ reduce E(1) PLUS E(2) * 3 $ shift E(1) PLUS E(2) TIMES 3 $ shift E(1) PLUS E(2) TIMES NUM(3) $ reduce E(1) PLUS E(2) TIMES E(3) $ reduce E(1) PLUS E(6) $ reduce E(7) $ accept 23
Parsing with semantic action Parser with semantic action Disadvantages File becomes too large; difficult to manage Program must be processed in order in which it is parsed; Impossible to do global/inter-procedural optimization Alternative: Separate parsing from remaining compiler phases 24
Context-Free Grammars are more powerful than Regular Expressions 25
Context-Free Grammar & REs CFGs are More powerful than REs Any language that can be generated using regular essions can be generated by a context-free grammar There are languages that can be generated by a contextfree grammar that cannot be generated by any regular ession Example: Matching parentheses Nested comments 26
Proof Given a RE R, we can generate a CFG G such that L(R) == L(G) We can define a grammar G for which there is no FA F such that L(F) == L(G) 27
Proof 1 Base Cases: Symbol(a): RE a Epsilon(ε): RE ε Inductive Cases: Alternation (M N): RE M RE N Concatenation (M N): RE M N Repetition (M*): RE M RE RE ε 28
Proof 2 S S S ε FAs have a FINITE number of states, N FA must remember the number of (, to generate ) s At or before N+1 ( s, FA will revisit a state that represents two different counts of ) s Both count must now be accepted One count will be invalid Representations Regular, finite-state grammars: FAs Context-free grammars: Push-Down Automata 29
Application of a Lexer and Parser 30
Applications Compiler & Interpreter Pattern matching Searching an exact word (ex. compiler ) Find and replace with a rule (ex. [a-z][a-z0-9]*) Rendering Rendering a web page of HTML + Content Rendering an image Printing a document Natural language processing Translation Understanding Korean particles Data Analysis Analyze xml files Big data analysis 31