Front nd Hwansoo Han
Traditional Two-pass Compiler Source code Front nd IR Back nd Machine code rrors High level functions Recognize legal program, generate correct code (OS & linker can accept) Manage the storage of all variables and code Two passes Use an intermediate representation (IR) Front end maps legal source code into IR - O(n) or O(n log n) Back end maps IR into target machine code - typically NP-complete Admits multiple front ends & multiple passes (better code) 2
Front nd Source code Scanner tokens Parser IR rrors Responsibilities Recognize legal (& illegal) programs Report errors in a useful way Produce IR & preliminary storage map Shape the code for the back end Much of front end construction can be automated Lex/yacc, flex/bison, lkhound 3
Front nd - Scanner Source code Scanner tokens Parser IR rrors Scanner Maps character stream into words (basic units of syntax) Produces tokens a word & its part of speech x = x + y ; becomes <id,x> = <id,x> + <id,y> ; word lexeme, part of speech token type Typical tokens include number, identifier, +,, new, while, if Scanner eliminates white space Produced by automatic scanner generator 4
Front nd - Parser Source code Scanner tokens Parser IR rrors Parser Recognizes context-free syntax Guides context-sensitive ( semantic ) analysis.g. type checking Builds IR for source program Produced by automatic parser generators 5
Back nd IR Instruction Selection IR Instruction Scheduling IR Register Allocation Machine code rrors Responsibilities Translate IR into target machine code Choose instructions to implement each IR operation Decide which values to keep in registers Find optimal order of instruction execution nsure conformance with system interfaces Automation has been less successful in the back end 6
Optimizing Compiler Source Code Front nd IR Middle nd IR Back nd Machine code rrors Code Optimizations Analyzes IR and rewrites (or transforms) IR Primary goal is to reduce xecution time, Space usage, Power consumption, Must preserve meaning of the code 7
Scanner in Front nd Source code Scanner tokens Parser IR rrors Scanner Maps character stream into words (basic units of syntax) Produces tokens a word & its part of speech 8
Scanner Generator We want to avoid writing scanners by hand The scanner is the first stage in the front end Specifications can be expressed using regular expressions Build tables and code from a DFA.g. lex, flex source code Scanner tokens specifications Scanner Generator tables or code Represent words as indices into a global table Specifications written as regular expressions 9
Regular xpressions Regular xpression (over alphabet ) is a R denoting the set {} If a is in, then a is a R denoting {a} If x and y are Rs denoting L(x) and L(y) then x y is an R denoting L(x) L(y) xy is an R denoting L(x)L(y) x * is an R denoting L(x)* 10
Token Recognizer Tokens are recognized by NFA 0 main recognizer 11 9 7 1 2 3 4 6 9 digit digit. digit letter = digit 9 8 7 1 0 keyword recognizer other 8 letter digit other other 0 1 1 9 9 digit digit num (integer) other id_or_keyword equal 1 2 8 9 5 num (real) p r m i re-read the string stored in the buffer n other 9 9 id r 7 1 5 program integer
Automating Scanner Construction R NFA (Thompson s construction) Build an NFA for each term Combine them with -moves NFA DFA (subset construction) Build the simulation DFA Minimal DFA Hopcroft s algorithm The Cycle of Constructions R NFA DFA minimal DFA DFA R (Not part of the scanner construction) All pairs, all paths problem Take the union of all paths from s 0 to an accepting state 12
Parser in Front nd Source code Scanner tokens Parser IR Parser rrors Checks the stream of words and their parts of speech for grammatical correctness Determines if the input is syntactically well formed Guides checking at deeper levels than syntax Builds an IR representation of the code 13
Specification of Grammar Syntax is specified with CFG = <S, T, N, P> 1 xpr xpr Op xpr 2 number 3 id 4 Op + 5 6 * 7 / Rule xpr Such a sequence of rewrites is called a derivation Sentential Form 1 xpr Op xpr 3 <id,x> Op xpr 5 <id,x> xpr 1 <id,x> xpr Op xpr 2 <id,x> <num,2> Op xpr 6 <id,x> <num,2> * xpr 3 <id,x> <num,2> * <id,y> Process of discovering a derivation is called parsing We denote this derivation: xpr * id num * id 14
Derivations Derivation consists of multiple steps of rewrites At each step, we choose a non-terminal to replace Different choices can lead to different derivations Two derivations are of interest Leftmost derivation replace leftmost NT at each step Rightmost derivation replace rightmost NT at each step These are the two systematic derivations (We don t care about randomly-ordered derivations!) The example on the preceding slide was a leftmost derivation Of course, there is also a rightmost derivation Interestingly, it turns out to be different 15
The Two Derivations for x 2 * y Rule Sentential Form xpr 1 xpr Op xpr 3 <id,x> Op xpr 5 <id,x> xpr 1 <id,x> xpr Op xpr 2 <id,x> <num,2> Op xpr 6 <id,x> <num,2> * xpr 3 <id,x> <num,2> * <id,y> Leftmost derivation Rule Sentential Form xpr 1 xpr Op xpr 3 xpr Op <id,y> 6 xpr * <id,y> 1 xpr Op xpr * <id,y> 2 xpr Op <num,2> * <id,y> 5 xpr <num,2> * <id,y> 3 <id,x> <num,2> * <id,y> Rightmost derivation In both cases, xpr * id num * id The two derivations produce different parse trees The parse trees imply different evaluation orders! 16
Leftmost vs. Rightmost Derivations Leftmost derivation Rightmost derivation G G Parse Tree Op Op x Op Op * y 2 * This evaluates as x ( 2 * y ) y x 2 This evaluates as ( x 2 ) * y 17
Parsing Techniques Top-down parsers LL = Left-to-right input scan, Leftmost derivation Start at the root of the parse tree and grow toward leaves Pick a production & try to match the input Bad pick may need to backtrack LL(1) grammars are backtrack-free (predictive parsing) Bottom-up parsers LR = Left-to-right input scan, Rightmost derivation Start at the leaves and grow toward root As input is consumed, encode possibilities in an internal state Start in a state valid for legal first tokens Bottom-up parsers handle a large class of grammars 18
Bottom-Up Parser xample The expression grammar 1 Goal xpr 2 xpr xpr + Term 3 xpr Term 4 Term 5 Term Term * Factor 6 Term / Factor 7 Factor 8 Factor number 9 id 10 ( xpr ) Handles for rightmost derivation of x 2 * y Prod n. Sentential Form Handle Goal 1 xpr 1,1 3 xpr Term 3,3 5 xpr Term * Factor 5,5 9 xpr Term * <id,y> 9,5 7 xpr Factor * <id,y> 7,3 8 xpr <num,2> * <id,y> 8,3 4 Term <num,2> * <id,y> 4,1 7 Factor <num,2> * <id,y> 7,1 9 <id,x> <num,2> * <id,y> 9,1 Reverse rightmost derivation (RRD) Handles are specified in blue 19
Semantic analysis Try to find semantic errors Gather various information (e.g., symbol type) for later phases Use attribute grammar to describe semantics Identify attributes of language elements for semantic analysis Augment grammar with semantic rules (R 1, R 2, R n ) ex. S T 1 {R 1 } T 2 {R 2 } T n {R n } valuate rules in the order given by hierarchical syntax structure 20
Type checking Type checking is a semantic analysis that detects type errors (a kind of semantic errors) Traverse parse tree and evaluate semantic rules Pascal (strongly-typed language) x : integer; y : char; y := a ; x := 5; x + 20 + y integer integer integer integer integer type error id int_num id x + 20 + y char char A simplified example of attribute grammar to check the type of expressions 1 + 2 {.type = ( 1.type == 2.type? 1.type : type error)} int_num {.type = integer} char {.type = char} id {.type = symtable(id).type} 21
Intermediate Representations Source Code Front nd IR Middle nd IR Back nd Machine code Front end - produces an intermediate representation (IR) Middle end - transforms the IR into an equivalent IR that runs more efficiently (Potentially multiple IR s can be used) Back end - transforms the IR into native code rrors IR encodes the compiler s knowledge of the program AST, DAG Stack-machine code (one address code) : Java bytecode Three address code : Low-level, RISC-like IR 22
Types of Intermediate Representations Structural IR Graphically oriented Heavily used in source-to-source translators Tend to be large xamples: Trees, DAGs Linear IR Pseudo-code for an abstract machine Level of abstraction varies Simple, compact data structures asier to rearrange Hybrid IR Combination of graphs and linear code xample: control-flow graph xamples: 3 address code Stack machine code xample: Control-flow graph 23
Abstract Syntax Tree An abstract syntax tree is the procedure s parse tree with the non-terminal nodes removed Op x - 2 * y x - * x Op 2 y Can use linearized form of the tree asier to manipulate than pointers to nodes x 2 y * - in postfix form - * 2 y x in prefix form 24 2 Parse Tree * y AST
Directed Acyclic Graph A directed acyclic graph (DAG) is an AST with a unique node for each value - z w / x * 2 y z x - 2 * y w x / 2 Makes sharing explicit ncodes redundancy Same expression twice means that the compiler might arrange to evaluate it just once! 25
Three Address Code Several different representations of three address code In general, three address code has statements of the form: x y op z With 1 operator (op ) and, at most, 3 names (x, y, z) z x - 2 * y becomes Characteristics Resembles many machines Introduces a new set of names Compact form t 2 * y z x - t x - * 2 y 26
Three Address Code: Quadruples Naïve representation of three address code Table of k * 4 small integers Simple record structure asy to reorder xplicit names load r1, y loadi r2, 2 mult r3, r2, r1 load r4, x sub r5, r4, r3 load r1 y loadi r2 2 mult load sub r3 r2 r1 r4 x r5 r4 r3 RISC assembly code Quadruples 27
Summary Front nd Language grammars are processed (CFG, semantic analysis) Automatic code generators (lex/yacc, flex/bison, lkhound) Intermediate representations Structural IR graphs (AST, DAG) Linear IR pseudo code for abstract machines (3-address code) What s left Middle end optimizations Back end target specific code generation 28