CSE401: Itroductio to Compiler Costructio Larry Ruzzo Sprig 2004 Today s objectives Admiistrative details Defie compilers ad why we study them Defie the high-level structure of compilers Associate specific tasks, theories, ad techologies with achievig the differet structural elemets of a compiler Ad build some iitial ituitio about why these are eeded Slides by Chambers, Eggers, Notki, Ruzzo, ad others W.L.Ruzzo & UW CSE 1994-2004 Admiistrative Details Course Web: http://www.cs.washigto.edu/401 Gradig Homeworks ~20% Project ~40% Midterm ~15% Fial ~25% Project: toy compiler a (almost) real oe. Staged. Optioal teams of 2-3 people. What is a compiler? Compiler Source Code Executable Code A software tool that traslates a program i source code form to a equivalet program i a executable (target) form Coverts from a form good for people to a form good for computers Examples Why study compilers? Source laguages Java C C++ LISP ML COBOL Target architectures MIPS x86 SPARC Alpha C 1
CSE401 s project-orieted approach More o the project Start with a compiler for PL/0, writte i C++ We defie additioal laguage features Such as commets, arrays, call-by-referece parameters, result-returig procedures, for loops, etc. You modify the compiler to traslate the exteded PL/0 laguage Project completed i well-defied stages Strogly recommeded that you work i twoperso teams for the quarter Gradig based o correctess clarity of desig ad implemetatio quality of testig Provides experiece with object-orieted desig ad with C++ Provides experiece with workig i a team What's hard about compilig Example I will preset a small program to you, character by character Idetify problems that you ca see that you will ecouter i compilig this program Here s a example problem Whe we see a character 1 followed by a character 7, we have to covert it to the iteger 17. 1. i 2. 3. t 4. _ 5. i 6. ; 7. _ 8. i 9. : 10. = 11. 1 12. 7 13. _ 14. ; 15. p 16. r 17. i 18. 19. t 20. ( 21. i 22. * 23. i 24. + 25. 2 26. ) 27. ; _ is the space character This is ot a PL/0 program! Structure of compilers A commo compiler structure has bee defied Years ad years of deep, difficult research itermixed with buildig of thousads of compilers Actual compilers ofte differ from this prototype Primary differeces are the orderig ad clarity with which the pieces are actually separated But the model is still extremely useful You will see the structure to a large degree i the PL/0 compiler Prototype compiler structure Source Stream of characters Sequece of tokes Abstract Sytax Tree (AST) AST+ ad Lexical aalysis Sytactic aalysis Sematic aalysis Storage layout Executable code Itermediate represetatio Code geeratio Optimizatio Target Itermediate represetatio Itermediate code geeratio AST++ ad 2
Source stream of characters sequece of tokes abstract sytax tree (AST) AST+ ad Lexical aalysis Sytactic aalysis Sematic aalysis Storage layout Executable code Itermediate represetatio Itermediate represetatio AST++ ad Target Code geeratio Optimizatio Itermediate code geeratio Frot- ad back-ed. These parts are ofte lumped ito two categories The frot-ed Focuses o (repeated) aalysis Determies what the program is The back-ed Focuses o sythesis Produces target program equivalet to source program A example compilatio module mai; var x:it, result: it; procedure square(:it); begi result := *; ed square; begi x := iput; while x <> 0 do square(x); output := result; x := iput; ed; ed mai. A real PL/0 program We ll step through Lexical aalysis Sytactic aalysis Sematic aalysis Storage layout Code geeratio Lexical aalysis (AKA scaig ad tokeizig) Sytactic aalysis (AKA parsig) Read i characters ad clump them ito tokes Idet Also strip out white space ad commets Specify tokes with regular Letter expressios Use fiite state machies to Digit sca Remember the coectio betwee regular expressios ad fiite state machies ::= Letter AlphaNum* Iteger ::= Digit+ AlphaNum ::= Letter Digit ::= a z A Z ::= 0 9 E.g.: While x <> 0 do keywd id op it keywd Tur toke stream ito tree based o the program s sytactic structure Defie sytax usig cotext free grammar (CFG) EBNF is a commo otatio for defiig cocrete sytax Cares about semi-colos, pares, ad such Parser usually costructs AST represetig abstract sytax Cares about statemet structures, precedece ad such Stmt ::= Astmt IfStmt Astmt ::= Lvalue := Expr ; Lvalue ::= Id IfStmt ::= if Test the Stmt [else Stmt] ; Test ::= Expr = Expr Expr < Expr Expr ::= Term + Term Term Term Term Term ::= Factor * Factor Factor Factor ::= - Factor Id It ( Expr ) Lvalue Sytactic aalysis example Stmt Astmt Fact Expr Term Fact Id := Id * Id ; result := * ; Stmt ::= Astmt IfStmt Astmt ::= Lvalue := Expr ; Lvalue ::= Id IfStmt ::= if Test the Stmt [else Stmt] ; Test ::= Expr = Expr Expr < Expr Expr ::= Term + Term Term Term Term Term ::= Factor * Factor Factor Factor ::= - Factor Id It ( Expr ) Sematic aalysis (Name resolutio ad type checkig) Give AST figure out what declaratio each ame refers to perform static cosistecy checks Key data structure: maps ames to iformatio about ame derived from declaratio Sematic aalysis steps Process each scope, top dow Process declaratios i each scope ito for scope Process body of each scope i cotext of 3
Sematic aalysis example Storage layout it x; it y(void); it mai(void) { double x,y; x = x + 5; pritf("x is %d",x); x = y(); retur 1/2 ; } Which var with which decl? What type? Operators legal o those types? Coercio? Fuctio arg & retur types too? Overloadig? Goto/case labels uique? Give, determie how ad where variables will be stored at rutime What represetatio is used for each kid of data? How much space does each variable require? I what kid of memory should it be placed? static, global memory stack memory heap memory Where i memory should it be placed? e.g., what stack offset? Storage layout example Code geeratio it x; it y(void); it mai(void) { double x,y; x = x + 5; pritf("x is %d",x); x = y(); retur 1/2 ; } Outer x: 4 bytes, static Ier x,y: 8 bytes each o stack What address? How does pritf fid its parameters? How does mai retur a value? Give aotated AST ad, produce target code Ofte doe as three steps Produce machie-idepedet low-level represetatio of the program (itermediate represetatio or IR) Perform machie-idepedet optimizatios (optioal) Traslate IR ito machie-specific target istructios Istructio selectio Register allocatio Codege example x = x + y; t42 fl x lw $2, 48($fp) t43 fl y lw $3, 52($fp) t44 fl t42 + t43 add $2, $2, $3 x fl t44 x = x * 2;t45 fl x lw $2, 48($fp) t46 fl 2 li $3, 2 t47 fl t45 * t46 mul $2, $2, $3 x fl t47 x += y; t48 fl x lw $2, 48($fp) t49 fl y lw $3, 52($fp) t50 fl t48 + t49 add $2, $2, $3 x fl t50 Optimizatio Ca you see simple chages that would streamlie the code above? How could you fid them automatically? 4
Does this structure work well? Compilers vs. iterpreters FORTRAN I Compiler (circa 1954-56) 18 perso years PL/0 Compiler By the ed of the quarter, you'll have a workig compiler that's way better tha FORTRAN I i most respects (key exceptio: optimizatio) Compilers implemet laguages by traslatio Iterpreters implemet laguages directly Note: the lie is ot always crystal-clear Compilers ad iterpreters have tradeoffs Executio speed of program Start-up overhead, tur-aroud time Ease of implemetatio mig eviromet facilities Coceptual clarity Compiler egieerig issues Portability Ideal is multiple frot-eds ad multiple back-eds with a shared itermediate laguage Sequecig phases of compilatio Stream-based vs. sytax-directed Multiple, separate passes vs. fewer, itegrated passes How to avoid compiler bugs? Objectives: ext lecture Defie overall theory ad practical structure of lexical aalysis Briefly recap regular expressios, fiite state machies, ad their relatioship Eve briefer recap of the laguage hierarchy Show how to defie tokes with regular expressios Show how to leverage this style of toke defiitio i implemetig a lexer 5