Building Compilers with Phoenix Syntax-Directed Translation
Structure of a Compiler Character Stream Intermediate Representation Lexical Analyzer Machine-Independent Optimizer token stream Intermediate Representation Syntax Analyzer Code Generator syntax tree target machine code Semantic Analyzer Machine-dependent Optimizer syntax tree target machine code Intermediate Code Generation 2
Syntax Definition (E)BNF: (Extended) Backus Naur Form context-free grammars terminal symbols: provided by scanner/lexical analysis tokens / lexems nonterminal symbols syntactic variables productions head / left side: nonterminal arrow body / right side: sequence of terminals and/or non-terminals, possibly ε BNF: notational convenience: (or) EBNF: additional operators: [optional], {zero-or-more}, (group) alternatively: * (Kleene star), +,? 3
BNF Example list list '+' digit list list '-' digit list digit digit '0' '1' '2' '3' '4' '5' '6' '7' '8' '9' 4
Derivations start symbol derivation step: replace nonterminal with right-hand side word (of a language): sequence of terminals derivable from start symbol language: set of all words 5
Parse Trees tree of terminals and non-terminals start symbol at the root terminals in the leaves children: right hand side of a production list list digit list digit digit 9-5 + 2 6
Ambiguity Multiple parse trees for a single word of the language string string '+' string string '-' string '0' '1' '2' '3' '4' '5' '6' '7' '8' '9' 7
Associativity favor one parse tree over another in case of ambiguity left-associative vs. right-associative in BNF: left-recursive vs. right-recursive productions init letter '=' init letter letter 'a' 'b' 'c'... 'z' 8
Operator Precedence Resolution of ambiguity for different (binary) operators different nesting of nonterminals factor digit '(' expr ')' term term '*' factor term '/' factor factor expr expr '+' term expr '-' term term 9
Syntax-directed Translation Syntax analysis: form parse tree in memory Semantic analysis: attach attributes to nodes in the tree attributed grammar Synthesis: output attribute values Alternatively: Don't represent syntax tree in memory, but represent tree hierarchy only in call stack / value stack 10
Example: Postfix Notation expr.t = 95-2+ expr.t = 95- term.t = 2 expr.t = 9 - term.t = 5 + term.t = 9 9 5 2 11
Tree Traversals depth-first vs. breadth-first recursive traversal: depth first top-down vs. bottom-up both useful in parsing preorder vs. post-order cases of depth-first potentially: consider node both before and after visiting children 12
Top-Down Parsing Top-down processing of imaginary parse tree in-memory parse tree might get created through post-order creation of tree nodes one function created per nonterminal lookahead token: not-yet-consumed input token e.g. global variable, member of parser object possibly multiple lookahead tokens select alternative of production according to lookahead for terminals, consume lookahead for non-terminals, descend into appropriate function (recursive-descend parsing) proceed reading additional tokens for right-hand side of selected production ambiguous grammars: may need to backtrack ambiguity vs. conflict 13
Predictive Parsing avoid backtracking, by always knowing what alternative to chose requires constraints on the grammar FIRST-set: set of all possible first tokens of an alternative if alternative starts with terminal t: FIRST(a) = { t } if alternative starts with non-terminal e: FIRST(a) = FIRST(e) ε-productions: if ε can be derived from e, then also include FIRST(e2) in FIRST(a), where e2 follows e FIRST sets of all alternatives need to be disjoint reformulate grammar if first sets overlap 14
Left Recursion Left-recursive production: recursive-descend parser will overflow stack Reformulate left recursion: A Aα Aβ γ A γr R αr βr ε 15
Left Factorization Alternatives overlap in first sets Extract common prefix into separate nonterminal A αx αy β H α A H (X Y) β H α A H T β T (X Y) 16
Abstract Syntax Trees unambiguous representations of program abstract-away unnecessary punctuation use nodes specialized for language - 9 5 + 2 17
Lexical Analysis Might apply BNF and syntax-directed processing to lexical analysis as well however: Lexis often simpler than syntax (using regular languages, not arbitrary context-free ones) analysis possible using finite automata Lexis often ambiguous eg. "staticpublicintfoo" ambiguities broken in a local fashion, e.g. prefer longest match Lexer needs to drop white space tokens (including comments) Lexer needs to group lexems into token classes (e.g. identifier), with original lexem as value Lexer needs to consider "reserved" words (keywords) 18
Recognizing Keywords Often keywords use identifier syntax solution: recognize identifiers, then check whether it is a keyword binary search, hashing (perfect hash functions) 19