Course Overview The Phases of a Compiler PART I: overview material Introduction 2 Language processors (tombstone diagrams, bootstrappg) 3 Architecture of a compiler PART II: side a compiler 4 Sntax analsis 5 Contextual analsis 6 Runtime organization 7 Code generation PART III: conclusion 8 Interpretation 9 Review Source Program This chapter Sntax Analsis Abstract Sntax Tree Contextual Analsis Decorated Abstract Sntax Tree Code Generation Object Code 2 In Chapter 4 Sntax Analsis Sntax Analsis Scanng: recognize words or tokens the put Parsg: recognize structure of program Different parsg strategies How to construct a recursive descent parser AST Construction Use of theoretical Tools : Regular Expressions and Fite State Maches Grammars Extended BNF notation First sets and Follow sets The job of sntax analsis is to read the source program (text file) and determe its structure. Subphases Scanng Parsg Construct an ternal representation of the source text that shows the structure (usuall an AST) Note: A sgle-pass compiler usuall does not explicitl construct an AST. 3 4 Multi Pass Compiler Sntax Analsis A multi pass compiler makes several passes over the program. The output of a precedg phase is stored a data structure and used b subsequent phases. Dataflow chart Source Program (Stream of Characters) Dependenc diagram of a tpical Multi Pass Compiler: Compiler Driver This chapter Sntactic Analzer Contextual Analzer Code Generator put output put output put output Source Text AST Decorated AST Object Code Scanner Stream of Tokens Parser Abstract Sntax Tree 5 6
() Scan: Divide Input to Tokens (2) Parse: Determe structure of program An example Mi Triangle source program: : Integer!new ear := +... scanner becomes := Tokens are words the put, for example kewords, operators, identifiers, literals, etc. colon : op. + Integer tlit eot... Parser analzes the structure of the token stream with respect to the grammar of the language. sgle-declaration Ident Ident Ident Ident Op. Int.Lit Declaration Tpe Denoter V-Name col. : Program Int bec. := Expression primar-exp V-Name op + primar-exp tlit eot 7 8 (3) AST Construction Grammars Program LetCommand AssignCommand VarDecl BarExpr SimpleTpe SimpleVar VNameExp Int.Expr SimpleVar Ident Ident Ident Ident Op Int.Lit Integer + RECAP: The Sntax of a Language can be specified b means of a CFG (Context Free Grammar). CFG can be expressed BNF (Bachus-Naur Form) Mi Triangle grammar BNF Program ::= Command ::= Command ; beg Command end... 9 0 Grammars (contued) Regular Expressions For our convenience, we will use EBNF or Extended BNF rather than simple BNF. RE are a notation for expressg a set of strgs of termal smbols. EBNF = BNF + regular expressions Mi Triangle EBNF * means 0 or more occurrences of Program ::= Command ::= ( ;)* beg Command end... Different kds of RE: ε The empt strg t Generates onl the strg t X Y Generates an strg x such that x is generated b x and is generated b Y X Y Generates an strg which generated either b X or b Y X* The concatenation of zero or more strgs generated b X (X) Used for groupg 2 2
RE: Examples Regular Expressions What sets of strgs do each of the followg RE generate?.. ε 2. 2. M(r s). 3. 3. (foo bar)* 4. 4. (foo bar)(foo bar)* 5. 5. (0 2 3 4 5 6 7 8 9)* 6. 6. 0 (.. 9)(0.. 9)* The languages that can be defed b RE and CFG have been extensivel studied b theoretical computer scientists. These are some important conclusions / termolog RE is a weaker formalism than CFG: An language expressible b a RE can be expressed b CFG but not the other wa around! The languages expressible as RE are called regular languages Generall: a language that exhibits self embeddg cannot be expressed b RE. Programmg languages exhibit self embeddg. (Examples: an expression can conta another expression, and a command can conta another command). 3 4 Extended BNF Extended BNF: an Example Extended BNF combes BNF with RE A production EBNF looks like LHS ::= RHS where LHS is a non termal smbol and RHS is an extended regular expression An extended RE is just like a regular expression except it is composed of termals and non termals of the grammar. Simpl put, EBNF adds to BNF these notations (...) for the purpose of groupg and * for denotg 0 or more repetitions of a simple expression language Expression ::= PrimarExp (Operator PrimarExp)* PrimarExpression ::= Literal Identifier ( Expression ) Identifier ::= Letter (Letter Digit)* Literal ::= Digit Digit* Letter ::= a b c... z Digit ::= 0 2 3 4... 9 5 6 A little bit of useful theor Grammar Transformations We will now look at a few useful bits of theor. These will be necessar later when we implement parsers. Grammar transformations A grammar can be transformed a number of was without changg its meang (i.e. its language, or the set of strgs that it generates) The defition and computation of starter sets (first sets), follow sets, and nullable smbols Left factorization X Y X Z X ( Y Z ) X Y= ε Z if Expression then if Expression then else if Expression then ( ε else ) 7 8 3
Grammar Transformations (contued) Grammar Transformations (contued) Elimation of Left Recursion N ::= X N Y N ::= X Y* Substitution of non-termal smbols N ::= X M ::= α N β N ::= X M ::= α X β Identifier ::= Letter Identifier Letter Identifier Digit Identifier ::= Letter Identifier (Letter Digit) Identifier ::= Letter (Letter Digit)* ::= for controlvar := Expression direction Expression do direction ::= to downto ::= for controlvar := Expression (to downto) Expression do 9 20 Starter Sets (a.k.a. First Sets) Derivations Informal Defition: The starter set of a RE X is the set of termal smbols that can occur as the start of an strg generated b X Example : starters[ ( + - ε) (0 9) + ] = {+, -, 0,,, 9} Replacg a non-termal E ::= ::= T E + T T ::= ::= i i (( E )) Formal Defition: starters[ε] ={ } starters[t] ={t} (where t is an termal smbol) starters[x Y] = starters[x] (if X doesn t generate ε) starters[x Y] = starters[x] starters[y] (ifx generates ε) starters[x Y] = starters[x] starters[y] starters[x*] = starters[x] S => => E => => E + T => => T + T => => i i + T => => i i + ii This is is a left-most derivation (it replaces the left -most non-termal at each step. Can ou fd the correspondg right-most derivation? Can ou fd a derivation that is is neither left -most nor right-most? 2 22 Sentential forms Ambiguous grammars A sequence of grammar smbols that can be derived from the start smbol A grammar is ambiguous if some sentence has more than one distct parse tree. S => => E => => E + T => => T + T => => i i + T => => i i + ii Equivalentl, a grammar is ambiguous if some sentence has more than one left-most derivation, or more than one right-most derivation. A sentence is a sentential form that contas onl termal smbols, that is, a strg that can be generated usg the grammar. Does i i + i i + i i i demonstrate the the an an ambiguit? E => => E + E => => i i + E => => i i + ii E => => E + E => => i i + E => => i i + E + E => => i i + i i + E => => i i + i i + i i E => => E + E => => E + E + E => => i i + E + E => => i i + i i + E => => i i + i i + i i 23 24 4
Augmented grammars Nullable, First sets (starter sets), and Follow sets We augment grammars to ensure that we can recognize and handle the end of the put strg A non-termal is nullable if it derives the empt strg First(N) or starters(n) is the set of all termals that can beg a sentence derived from N S S ::= ::= S $ Follow(N) is the set of termals that can follow N some sentential form Here $ denotes the end-of-file token Next we will see algorithms to compute each of these. 25 26 5