Introduction to Compilers Compilers are language translators input: program in one language output: equivalent program in another language Introduction to Compilers Two types Compilers offline Data Program C executable Interpreters Program Data online I Output Output 2 History of Compilers History of Compilers From Machine Code to High-Level Programming Languages Grace Hopper (1906 1992) coined the term compiler 192: wrote the first compiler A for the A-0 language B-0, ARITH-MATIC, MATH-MATHIC, FLOW-MATIC, COBOL to Assembly Code even for the earliest machines (UNIVAC, IBM 701/704), the cost of software exceeded the hardware cost John Backus (1924 2007) 19: Speedcoding interpreted: code ran 10-20x slower interpreter used up 10 memory words 19: FORTRAN I FORmula TRANslator translated high-level code to assembly [ Backus-Naur Form (BNF) ] 4 FORTRAN I 194-7: FORTRAN I project compiler released in 197 first successful high-level programming language 198 more than 0% of all software is written in FORTRAN huge impact on computer science led to an enormous body of theoretical work even modern compilers preserve the outlines of the FORTRAN I compiler The Structure of a Compiler 6 1
The Bigger Picture int printf(); int main() int A[1000]; a[i] = a[i] 1; printf( %d, a[i]); source code Preprocessor modified source code Compiler assembly code #include <stdio.h> #define N 1000 int main() int A[N]; a[i] = a[i] 1; printf( %d, a[i]); movl $a, %eax addl (%eax, %ebx, 4), %ecx call printf Typical Phases of a Modern Compiler character stream Lexical Analyzer token stream Syntax Analyzer syntax tree intermediate representation Machine-Independent Code Optimizer intermediate representation Code Generator Symbol Table Assembler Semantic Analyzer 6: e8 fc ff ff ff call 7<main0x14> 7: R_86_PC2 printf relocatable machine code syntax tree Machine-Dependent Code Optimizer Linker/Loader Intermediate Code Generator 8048ba: e8 09 00 00 00 call 8048c8 7 8 Basic Structure of a Compiler Basic Structure of a Compiler 1. 2.. 4.. Code Generation or Scanning recognize the words in the input This is an example sentence. This, is, an, example, sentence,. acknowledgements: contains adapted material from Alex Aiken 9 10 or Scanning recognize the words in the input Not as trivial as it looks. Consider: thisi sane xam plesent ence. input: character stream of source program process: split stream into lexemes according to the grammar of the language build attributed tokens output: token stream a token contains the token name (id) and the token attributes < token name, attribute values > 11 12 2
Example: or Parsing position = initial rate * 60 1. lexeme position token <id, 1> 2. lexeme = token. lexeme initial token <id, 2> 4. lexeme token <>. lexeme rate token <id, > 6. lexeme * token <*> 7. lexeme 60 token <number, 60> understand the structure of the input This example is a longer sentence article noun verb article adjective noun subject object <id,1> <id,2> <> <id,> <*> <number,60> sentence 1 14 input: token stream process: transform token stream into a syntax (parse) tree grammar of the language by starting at the start symbol and repeatedly applying productions output: syntax tree, symbol table Example position = initial rate * 60 <id,1> <id,2> <> <id,> <*> <number,60> expr expr term expr term term term term * term / ( expr ) expr expr term <id,1> <> <id,2> <*> <, >, < >, <, >, < * >, <, 7 > term term * <id,> <number, 60> 7 1 16 Another Example position = rate * 60 initial <id,1> <id,2> <*> <number,60> <> <id, > vs <id,1> <*> <id,1> <> understand the meaning of the input this is too hard for compilers semantic analysis is restricted to detecting inconsistencies Tom said Jerry left his phone at home. whose phone? <id,2> <> <number,60> <id,> which syntax tree is correct? <*> <id,> <id,2> <number,60> Tom said Tom left his phone at home. how many people are we talking about? 17 18
compilers catch inconsistencies input: syntax tree, symbol table process: semantic consistency checking, type checking output: (semantically sound) syntax tree, possibly added coercions Tom left her phone at home. type mismatch position = initial rate * 60 1 position float = 2 initial float rate float <id, 1> <id, 1> = <id, 2> * <id, > 60 <id, 2> * <id, > inttofloat 60 19 20 Examples type checking public static void fun(int a, java.lang.string[] s) int sum = a s; no strong counterpart in English; similar to editing. is a little bit like editing. variable bindings int i = 4; int i = 6; printf( %d\n, i); is akin to editing. here: reduce number of words. 21 22 Common (High-Level) s automatically modify programs so that they use less of some resource. run faster use less memory use less (disk) space consume less energy reduce number of network accesses common subexpression elimination dead code elimination code motion constant propagation partial-redundancy elimination loop optimizations 2 24 4
Code Generation Example Code Generation or CodeGen x = y * 0 can be optimized to x = 0 translation into another language, e.g., English into Korean unfortunately, this rule is not correct. s should to be conservative, i.e., be correct under all circumstances. I miss you 보고싶다 2 26 Code Generation Proportions of the Compiler Phases More formally input: intermediate code process: map to machine code register allocation memory locations FORTRAN modern compilers L P S O CG L P S O CG t1 = id * 60.0 id1 = id2 t1 ldf r2, id mulf r2, r2, #60.0 ldf r, id2 addf r, r, r2 stf id1, r 27 28 s in a Modern Compiler Summary Scalar replacement of array references Data-cache optimizations Procedure integration Tail-call optimization, including tail-recursion elimination Scalar replacement of aggregates Sparse conditional constant propagation Interprocedural constant propagation Procedure specialization and cloning Sparse conditional constant propagation Global value numbering Local and global copy propagation Constant Sparse conditional constant propagation folding Local and global CSE Partial redundancy elimination Loop-invariant code motion Algebraic simplifications, including reassociation In-line expansion Leaf-routine optimization Shrink wrapping Machine idioms Tail merging Branch optimizations and conditional moves Software pipelining with loop unrolling, variable expansion, register renaming, and hierarchical reduction Basic-block and branch scheduling Register allocation by graph coloring Basic-block and branch scheduling Intraprocedural I-cache optimization Instruction prefetching Data prefetching Branch prediction Compilers are fundamental to Computer Science Basic Structure of a Compiler 1. from a character stream to tokens 2. from tokens to an AST. type checking 4. varying degree of complexity. Code Generation from IR to machine code Code hoisting Induction-variable strength reduction Linear-function test replacement Induction-variable removal Unnecessary bounds-checking removal Control-flow optimizations Interprocedural register allocation Aggregation of global references Interprocedural I-cache optimization This class covers all phases except optimizations We will build a simple but fully functional compiler for a Pascal-based language source: Steven Muchnick, Advanced Compiler Design & Implementation 29 0