CSE 431S Final Review Washington University Spring 2013
What You Should Know The six stages of a compiler and what each stage does. The input to and output of each compilation stage (especially the back-end). Context-free languages. Definition of a context-free grammar (including the formal definition). Leftmost and rightmost derivations and parse trees. Ambiguity.
What You Should Know Bottom-up (shift-reduce) parsing. LR(0) parser construction. SLR conflict resolution. LR(1) parser construction. Abstract syntax trees. L-value vs R-value Static type checking. Symbol tables.
What You Should Know CUP actions. Jasmin basics. Code generation. Call stack (function activation). Stack-based vs heap-based memory allocation. Parameter passing mechanisms. Register allocation (graph coloring).
Context-Free Languages Recall right-linear grammars: X a Y b Restricted right-hand side Context-free grammars: Allow anything on the right-hand side. A ( A ) x
Context-Free Grammars A grammar is a 4-tuple: : set of terminals V: set of nonterminals S: start nonterminal P: set of productions (rewrite rules) For a grammar to be context-free, all productions must be of the form: A α, where α is any sequence of symbols (terminals and nonterminals)
Ambiguity What about: E E + E a Two syntax trees for the string a + a + a E E E + E E + E E + E a a E + E a a a a
Ambiguity If there are multiple parse trees--or, equivalently, multiple leftmost derivations--for some string then the grammar is ambiguous. Note that it is the grammar that is ambiguous, not the language. There may exist a non-ambiguous grammar for the same language.
Bottom-Up Parsing Instead of starting from a start nonterminal and producing the parse tree, start from the leaves and build tree bottom up Start nonterminal is now the goal nonterminal
Sample Grammar 1. S E $ 2. E E + T 3. T 4. T a 5. ( E )
LR(0) Item The dot represents the current parse state (e.g. what has been seen ) The initial set of rules are the called the kernel The non-kernel items are generated from the closure operation and represents any nonterminals after the dot
LR(0) Parse States I 0 = START S E $
LR(0) Parse States I 0 = START S E $ E E + T E T T a T ( E ) 1 1 9 5 6 The closure operation adds all of the rules for a nonterminal to the immediate right of the dot Close on the operation The number in the square indicates which state to go to on the symbol to the right of the dot Must go to a single state for each symbol (deterministic)
LR(0) Parse States I 0 = START S E $ E E + T E T T a T ( E ) I 1 = GOTO(I 0, E ) S E $ E E + T 1 1 9 5 6 2 3 There must be only one state with a given kernel i.e., no identical states
1. S A C $ 2. A a B C d 3. B Q 4. λ 5. B b B 6. d 7. C c 8. λ 9. Q q Example
LR(0) Parse States I 0 S A C $ 1 I 3 S A C $ I 7 A a B C d 8 A a B C d 5 A B Q 12 I 4 I 8 A C c A a B C d B b B 9 B d I 1 S A C $ C c C 11 2 4 I 5 I 6 A a B C d B b B B d A a B C d 6 9 12 7 I 9 B b B B b B B d I 10 B b B 10 9 11 I 2 S A C $ 3 C c C 4
I 13 A B Q I 14 Q q LR(0) Parse States I 11 B d I 12 A B Q Q q 13 14 Grammar is not LR(0) parsable: shift/reduce conflicts in states 0, 1, and 6
SLR(1) Create the LR(0) states. If there are no conflicts then we are done. For states with conflicts Try to use follow sets to resolve the conflicts. If all conflicts can be resolved using the follow sets then the grammar is SLR(1).
SLR(1) Shift/Reduce conflict Need to make sure that every terminal to the immediate right of a in not in the Follow set of the nonterminal of the reduction rule I 1 S A C $ C c C I 6 A a B C d C c C States 1 and 6: Follow(C) = { d, $ } so c is not an element of Follow(C)
SLR(1) I 0 S A C $ A a B C d A B Q A B b B B d All conflicts can be resolved using the Follow sets, so the grammar is SLR parsable State 0: Follow(A) = { c, $ } so a, b, and d are not elements of Follow(A)
SLR(1) State Table State a b c d q $ A B C Q S 0 S5 S9 R4 S12 R4 S1 S13 Done 1 S4 R8 R8 S2 2 S3 3 R1 4 R7 5 S9 S12 S6 6 S4 R8 R8 S7 7 S8 8 R2 9 S9 S11 S10 10 R5 11 R6 12 S14 S13 13 R3 14 R9
Sample Parse Stack Remaining Input - 0 S5 a b b d d c $ - 0 a 5 S9 b b d d c $ - 0 a 5 b 9 S9 b d d c $ - 0 a 5 b 9 b 9 S11 d d c $ - 0 a 5 b 9 b 9 d 11 R6 d c $ - 0 a 5 b 9 b 9 S10 B d c $ - 0 a 5 b 9 b 9 B 10 R5 d c $
Sample Parse Stack Remaining Input - 0 a 5 b 9 S10 B d c $ - 0 a 5 b 9 B 10 S9 d c $ - 0 a 5 R5 B d c $ - 0 a 5 B 6 S6 d c $ - 0 a 5 B 6 R8 C d c $ - 0 a 5 B 6 C 7 S7 d c $ - 0 a 5 B 6 C 7 d 8 S8 c $
Sample Parse Stack Remaining Input - 0 R2 A c $ - 0 A 1 S1 c $ - 0 A 1 c 4 S4 $ - 0 A 1 R7 C $ - 0 A 1 C 2 S2 $ - 0 A 1 C 2 $ 3 S3-0 R1 Done S
Syntax Trees Concrete Actual parse tree Abstract Eliminates unnecessary nodes Structures the tree appropriately for evaluation Serves as basis for code generation
Concrete vs. Abstract
Construction Java code added to productions Most common action is to build a new tree node and assign to RESULT, which attaches it to the left-hand nonterminal Values for the nonterminals on the right-hand side are usually child tree nodes Stmt ::= id:id assign E:e {: RESULT = new AssignmentNode(id, e); :} if lparen E:pr rparent Stmt:s fi {: RESULT = new IfNode(pr, s); :} if lparen E:pr rparent Stmt:s1 else Stmt:s2 fi {: RESULT = new IfNode(pr, s1, s2); :} ;
Construction Stmt ::= begin Stmts:block end {: RESULT = block; :} ; Stmts ::= Stmts:block semi Stmt:stmt {: block.add(stmt); RESULT = block; :} Stmt:s {: RESULT = new BlockNode(s); :} ;
Construction Alternate construction of BlockNode Stmt ::= begin Stmts:list end {: RESULT = new BlockNode(list); :} ; Stmts ::= Stmts:list semi Stmt:stmt {: list.add(stmt); RESULT = list; :} Stmt:s {: RESULT = new ArrayList(); RESULT.add(s); :} ;
Left and Right Values x = y x is the L-value Refers to the location of x, not its value y is the R-value Refers to the value of y, not its location
Example Note that there is an error in this figure. The deref in the tree for example b should not be there.
Type Checking When are types checked? Statically at compile time Compiler does type checking during compilation Ideally eliminate runtime checks Dynamically at runtime Compiler generates code to do type checking at runtime JavaScript vs. Java Java still does a large amount of runtime type checking We ll focus on static typing for basic types
Expression Types For every operator we need to know allowed types of operands resulting type implicit coercion changes the representation, not the data short to long implicit conversion may change the data int to float explicit cast may lose information float to int, int to short
What are the types? =? x int +? y int 3.14 float
Determining Types make sure type is allowed (int + float) assign the resultant type to the operator (float) generate any necessary coercion(s) or conversion(s) most hardware has (int + int) and (float + float) but not (int + float)
Adding Coercion =? x int + float int 2 float float 3.14 float y int
Explicit Casting = int x int float 2 int int + float int 2 float float 3.14 float y int
Symbol Table Proc Dcls Body Synthesize symbol info Proc Inherit symbol info Dcls int I; float j; Body i=3; j = i * 3.14;
Symbol Table Persists the synthesized information as a side effect of the translation Maps a name and environment to information Environment is the scope Scope is static Basic actions Establish a mapping Retrieve a mapping
public class Car { int id; int color; int GetType() { String id; } public class Wheel { Object id; int GetType() { float id; } } } Name Scope Info id Car int color Car int id Car:GetType String id Car:Wheel Object id Car:Wheel:GetType float
Scopes Scopes are static Scopes are nested LIFO (last in, first out) Car scope GetType scope Wheel scope GetType scope
Possible Implementations Option 1: Keep all information available at all times Option 2: Use LIFO and process a scope at a time Name Scope Info id Car int color Car int id Car:GetType String id Car:Wheel Object id Car:Wheel:GetType float
LIFO Scopes Symbol table will be a stack of maps of name to information One map per scope (environment) Four basic operations Enter Scope Leave Scope Add Symbol Lookup Symbol
Implementation Scopes are LIFO so using a stack makes sense For each scope, use a map since we lookup names to retrieve info about them Typically use a hash map
Hello World :: Source public class HelloWorld { public static void main(string[] args) { System.out.println("Hello World!"); } }
Hello World :: Jasmin.class public HelloWorld.super java/lang/object ; ; standard initializer (calls java.lang.object's initializer) ;.method public <init>()v aload_0 invokenonvirtual java/lang/object/<init>()v return.end method ; ; main() - prints out Hello World ;.method public static main([ljava/lang/string;)v.limit stack 2 ; up to two items can be pushed ; push System.out onto the stack getstatic java/lang/system/out Ljava/io/PrintStream ; ; push a string onto the stack ldc "Hello World! ; call the PrintStream.println() method. invokevirtual java/io/printstream/println(ljava/lang/string;)v ; done return.end method
Source to AST Source if (i > 431) { a = b + c; } AST IF_STATEMENT GREATER_THAN VAR_USE IDENTIFIER (i) (SymbolInfo: INT, lv = 0) INTEGER_LITERAL (431) BLOCK EXPRESSION_STATEMENT ASSIGN IDENTIFIER (a) (SymbolInfo: INT, lv = 1) ADDITION VAR_USE IDENTIFIER (b) (SymbolInfo: INT, lv = 2) VAR_USE IDENTIFIER (c) (SymbolInfo: INT, lv = 3)
AST to Code AST IF_STATEMENT GREATER_THAN VAR_USE IDENTIFIER (i) (SymbolInfo: INT, lv = 0) INTEGER_LITERAL (431) BLOCK EXPRESSION_STATEMENT ASSIGN IDENTIFIER (a) (SymbolInfo: INT, lv = 1) ADDITION VAR_USE IDENTIFIER (b) (SymbolInfo: INT, lv = 2) VAR_USE IDENTIFIER (c) (SymbolInfo: INT, lv = 3) Code iload 0 ldc 431 if_icmpgt label3 iconst_0 goto label4 label3: iconst_1 label4: ifeq label1 iload 2 iload 3 iadd istore 1 goto label2 label1: label2:
Break It Down IF_STATEMENT node Create two labels (will be needed later) Visit first child Code for boolean test expression should be generated Code for the boolean expression should leave 0 (for false) or 1 (for true) on top of stack Output code that compares top of stack to 0 and jump to label for else block (to be output later) if 0 Visit second child Code for then block should be generated Output code that jumps over else block and output label to start else block Visit third child (if it exists) Code for else block should be generated Output label at end of else block
IF_STATEMENT private void visitifstatementnode(astnode node) throws Exception { String elselabel = generatelabel(); String endlabel = generatelabel(); } node.getchild(0).accept(this); // visit first child stream.println(" ifeq " + elselabel); node.getchild(1).accept(this); // visit second child stream.println(" goto " + endlabel); stream.println(elselabel + ":"); ASTNode elseblock = node.getchild(2); if (elseblock!= null) { elseblock.accept(this); // visit third child } stream.println(endlabel + ":");
Run-time System The run-time system consists of everything needed at run-time to support the execution of a process. This includes memory management, call-stack management, system call API, etc.
Function Calls Invoke f during runtime What happens? 1. Parameters are transmitted 2. Local storage is allocated 3. Local storage is initialized 4. Body of f executes 5. Return values prepared 6. Free storage 7. Return context to call
Function Calls Each invocation of f is a new activation What is the lifetime of f?
Lifetime a b a b overlapping a b b disjoint a
Activation Use a stack to represent activations No activation specific info survives death No activation specific info required for birth Each activation pushes a new activation record onto the run-time stack What will we record in it?
Activation Record Return address Storage information Local storage Parameters Access to non-locals
Parameter Passing Call by value Argument is R-value Value of arguments are copied into the function swap(x, y) won t change the value of x or y Call by reference Argument is L-value Variable in function points to the same location as the argument swap(x, y) would change the value of x and y Most modern languages use call-by-value semantics
Parameter Passing Java uses call-by-value semantics It is sometimes said that Java uses call-by-value for primitives and call-by-reference for object types, but that is not quite true. Java is call-by-value for everything, except that it does not copy objects but rather copies references to the objects. That is, the caller and callee both have references to the same object.
Parameter Passing Does not work in Java Primitive parameters are copied void swap(int x, int y) { int t = x; x = y; y = t; }
Parameter Passing Still does not work in Java References to objects are copied void swap(integer x, Integer y) { Integer t = x; x = y; y = t; }
Parameter Passing Cannot swap the objects, but could change the internal state of the objects void swap(modinteger x, ModInteger y) { int t = x.getvalue(); x.setvalue(y.getvalue()); y.setvalue(t); }
Register Allocation Most architectures have only a handful of registers to use for calculations Values need to be copied from memory into registers when needed, and then copied back to memory when a register is needed for something else For performance, we want to minimize the number of copies to/from memory
Register Allocation Can build an interference graph to determine what variables are live at the same time First, determine the live ranges of variables based on their "use" and "def" A def is an assignment to a variable (L-value) A use is the use of the value of a variable (R-value)
Live Ranges x y z x = y = z = = x Variables with ranges that overlap are live at the same time and therefore must use different registers to avoid extra copying in and out of memory = z = y
Interference Graph Each variable is a vertex in the graph An edge in the graph indicates that those two variables are live at the same time So the edges indicate which variables cannot share a register x y z
Graph Coloring The problem of allocating registers now becomes one of coloring the interference graph We want to color the vertices of the graph so that no two adjacent vertices have the same color The maximum number of colors we can use is the equal to the number of available registers A coloring with a maximum number of colors k is called a k- coloring But k-coloring a graph is NP-complete and we need it to be fast Use a heuristic algorithm
Graph Coloring Find a vertex whose edge count is < k Push the vertex on a stack and remove from the graph Repeat until there are no vertices left in the graph or there are no vertices with an edge count < k in the graph If all vertices have been removed from the graph then the graph can be k-colored Pop a vertex from the stack and add back to the graph Color the vertex a different color from any of its neighbors currently in the graph How can we know that there is an available color? Repeat until stack is empty
Graph Coloring Try k = 3 A B C G D E F
Graph Coloring A B C G D E F
Graph Coloring Note that if we get to a point when removing vertices from the graph where all of the remaining vertices have an edge count >= k then it does not necessarily mean the graph cannot be k-colored It just means the heuristic algorithm failed Could try a different algorithm But it could be that the graph is not k-colorable Will need to spill the registers At some point, copy the registers out to memory so we can use them to hold other variables
Parsers LR(0) 0 symbols of look ahead when creating the parse table SLR Simple LR resolves conflicts using global grammar follow sets LALR Look Ahead LR combines some states based on follow set information LR(k) Most powerful of those where parse states are created ahead of time
1. P S $ 2. S A B A C 3. a a c 4. A a a 5. B b 6. λ 7. C c 8. λ Yet Another Example
Kernel Rules I 0 P S $, {} Grammar Parse States I 2 S A B A C, {$} S a a c, {$} A a a, {b,a} 12 4 1 1 S a a c, {$} A a a, {b, a} I 3 S a a c, {$} 3 I 1 S a a c, {$} A a a, {b, a} 2 2 I 4 S A B A C, {$} B b, {a} B, {a} I 5 B b, {a} 6 5 I 6 S A B A C, {$} A a a, {c,$} 7 9
Kernel Rules Grammar Parse States (cont.) I 7 A a a, {c,$} 8 I 12 P S $, {} 13 I 8 A a a, {c,$} I 13 P S $, {} I 9 S A B A C, {$} C c, {$} C, {$} I 10 C c, {$} 11 10 I 11 S A B A C, {$}