Principles of Compiler Design Presented by, R.Venkadeshan,M.Tech-IT, Lecturer /CSE Dept, Chettinad College of Engineering &Technology

Size: px

Start display at page:

Download "Principles of Compiler Design Presented by, R.Venkadeshan,M.Tech-IT, Lecturer /CSE Dept, Chettinad College of Engineering &Technology"

Daniella O’Connor’
5 years ago
Views:

1 Principles of Compiler Design Presented by, R.Venkadeshan,M.Tech-IT, Lecturer /CSE Dept, Chettinad College of Engineering &Technology 6/30/2010 Principles of Compiler Design R.Venkadeshan 1

2 Preliminaries Required Basic knowledge of programming languages. Basic knowledge of FSA and CFG. Knowledge of a high programming language for the programming assignments. Textbook: Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman, Compilers: Principles, Techniques, and Tools Addison-Wesley, /30/2010 Principles of Compiler Design R.Venkadeshan 2

3 Subjects Lexical analysis (Scanning) Syntax Analysis (Parsing) Syntax Directed Translation Intermediate Code Generation Run-time environments Code Generation Machine Independent Optimization 6/30/2010 Principles of Compiler Design R.Venkadeshan 3

4 Introduction to Compiling Lexical Analysis Syntax Analysis Context Free Grammars Top-Down Parsing, LL Parsing Bottom-Up Parsing, LR Parsing Syntax-Directed Translation Attribute Definitions Evaluation of Attribute Definitions Semantic Analysis, Type Checking Run-Time Organization Intermediate Code Generation Code Optimization Code Generation. Course Outline 6/30/2010 Principles of Compiler Design R.Venkadeshan 4

5 COMPILERS A compiler is a program takes a program written in a source language and translates it into an equivalent program in a target language. source program COMPILER target program ( Normally a program written in a high-level programming language) error messages ( Normally the equivalent program in machine code relocatable object file) 6/30/2010 Principles of Compiler Design R.Venkadeshan 5

6 Other Applications In addition to the development of a compiler, the techniques used in compiler design can be applicable to many problems in computer science. Techniques used in a lexical analyzer can be used in text editors, information retrieval system, and pattern recognition programs. Techniques used in a parser can be used in a query processing system such as SQL. Many software having a complex front-end may need techniques used in compiler design. A symbolic equation solver which takes an equation as input. That program should parse the given input equation. Most of the techniques used in compiler design can be used in Natural Language Processing (NLP) systems. 6/30/2010 Principles of Compiler Design R.Venkadeshan 6

7 Major Parts of Compilers There are two major parts of a compiler: Analysis and Synthesis In analysis phase, an intermediate representation is created from the given source program. Lexical Analyzer, Syntax Analyzer and Semantic Analyzer are the parts of this phase. In synthesis phase, the equivalent target program is created from this intermediate representation. Intermediate Code Generator, Code Generator, and Code Optimizer are the parts of this phase. 6/30/2010 Principles of Compiler Design R.Venkadeshan 7

8 Phases of A Compiler Source Program Lexical Analyzer Syntax Analyzer Semantic Analyzer Intermediate Code Generator Code Optimizer Code Generator Target Program Each phase transforms the source program from one representation into another representation. They communicate with error handlers. They communicate with the symbol table. 6/30/2010 Principles of Compiler Design R.Venkadeshan 8

9 Lexical Analyzer Lexical Analyzer reads the source program character by character and returns the tokens of the source program. A token describes a pattern of characters having same meaning in the source program. (such as identifiers, operators, keywords, numbers, delimeters and so on) Ex: newval := oldval + 12 => tokens: newval identifier := assignment operator oldval identifier + add operator 12 a number Puts information about identifiers into the symbol table. Regular expressions are used to describe tokens (lexical constructs). A (Deterministic) Finite State Automaton can be used in the implementation of a lexical analyzer. 6/30/2010 Principles of Compiler Design R.Venkadeshan 9

10 Syntax Analyzer A Syntax Analyzer creates the syntactic structure (generally a parse tree) of the given program. A syntax analyzer is also called as a parser. A parse tree describes a syntactic structure. assgstmt identifier := expression newval expression + expression In a parse tree, all terminals are at leaves. All inner nodes are non-terminals in a context free grammar. identifier number oldval 12 6/30/2010 Principles of Compiler Design R.Venkadeshan 10

11 Syntax Analyzer (CFG) The syntax of a language is specified by a context free grammar (CFG). The rules in a CFG are mostly recursive. A syntax analyzer checks whether a given program satisfies the rules implied by a CFG or not. If it satisfies, the syntax analyzer creates a parse tree for the given program. Ex: We use BNF (Backus Naur Form) to specify a CFG assgstmt -> identifier := expression expression -> identifier expression -> number expression -> expression + expression 6/30/2010 Principles of Compiler Design R.Venkadeshan 11

12 Syntax Analyzer versus Lexical Analyzer Which constructs of a program should be recognized by the lexical analyzer, and which ones by the syntax analyzer? Both of them do similar things; But the lexical analyzer deals with simple non-recursive constructs of the language. The syntax analyzer deals with recursive constructs of the language. The lexical analyzer simplifies the job of the syntax analyzer. The lexical analyzer recognizes the smallest meaningful units (tokens) in a source program. The syntax analyzer works on the smallest meaningful units (tokens) in a source program to recognize meaningful structures in our programming language. 6/30/2010 Principles of Compiler Design R.Venkadeshan 12

13 Parsing Techniques Depending on how the parse tree is created, there are different parsing techniques. These parsing techniques are categorized into two groups: Top-Down Parsing, Bottom-Up Parsing Top-Down Parsing: Construction of the parse tree starts at the root, and proceeds towards the leaves. Efficient top-down parsers can be easily constructed by hand. Recursive Predictive Parsing, Non-Recursive Predictive Parsing (LL Parsing). Bottom-Up Parsing: Construction of the parse tree starts at the leaves, and proceeds towards the root. Normally efficient bottom-up parsers are created with the help of some software tools. Bottom-up parsing is also known as shift-reduce parsing. Operator-Precedence Parsing simple, restrictive, easy to implement LR Parsing much general form of shift-reduce parsing, LR, SLR, LALR 6/30/2010 Principles of Compiler Design R.Venkadeshan 13

14 Semantic Analyzer A semantic analyzer checks the source program for semantic errors and collects the type information for the code generation. Type-checking is an important part of semantic analyzer. Normally semantic information cannot be represented by a context-free language used in syntax analyzers. Context-free grammars used in the syntax analysis are integrated with attributes (semantic rules) Ex: the result is a syntax-directed translation, Attribute grammars newval := oldval + 12 The type of the identifier newval must match with type of the expression (oldval+12) 6/30/2010 Principles of Compiler Design R.Venkadeshan 14

15 Intermediate Code Generation A compiler may produce an explicit intermediate codes representing the source program. These intermediate codes are generally machine (architecture independent). But the level of intermediate codes is close to the level of machine codes. Ex: newval := oldval * fact + 1 id1 := id2 * id3 + 1 MULT id2,id3,temp1 Intermediates Codes (Quadraples) ADD temp1,#1,temp2 MOV temp2,,id1 6/30/2010 Principles of Compiler Design R.Venkadeshan 15

16 Code Optimizer (for Intermediate Code Generator) The code optimizer optimizes the code produced by the intermediate code generator in the terms of time and space. Ex: MULT ADD id2,id3,temp1 temp1,#1,id1 6/30/2010 Principles of Compiler Design R.Venkadeshan 16

17 Code Generator Produces the target language in a specific architecture. The target program is normally is a relocatable object file containing the machine codes. Ex: ( assume that we have an architecture with instructions whose at least one of its operands is a machine register) MOVE MULT ADD MOVE id2,r1 id3,r1 #1,R1 R1,id1 6/30/2010 Principles of Compiler Design R.Venkadeshan 17

18 Front-end, Back-end division Source code Front end IR Back end Machine code Front end maps legal code into IR Back end maps IR onto target machine Simplify retargeting Allows multiple front ends Multiple passes -> better code errors 6/30/2010 Principles of Compiler Design R.Venkadeshan 18

19 Front end Source code Scanner tokens Parser IR errors Recognize legal code Report errors Produce IR Preliminary storage maps 6/30/2010 Principles of Compiler Design R.Venkadeshan 19

20 Front end Source code Scanner tokens Parser IR errors Scanner: Maps characters into tokens the basic unit of syntax x = x + y becomes <id, x> = <id, x> + <id, y> Typical tokens: number, id, +, -, *, /, do, end Eliminate white space (tabs, blanks, comments) A key issue is speed so instead of using a tool like LEX it sometimes needed to write your own scanner 6/30/2010 Principles of Compiler Design R.Venkadeshan 20

21 Front end Source code Scanner tokens Parser IR errors Parser: Recognize context-free syntax Guide context-sensitive analysis Construct IR Produce meaningful error messages Attempt error correction There are parser generators like YACC which automates much of the work 6/30/2010 Principles of Compiler Design R.Venkadeshan 21

22 Front end Context free grammars are used to represent programming language syntaxes: <expr> ::= <expr> <op> <term> <term> <term> ::= <number> <id> <op> ::= + - 6/30/2010 Principles of Compiler Design R.Venkadeshan 22

23 Front end A parser tries to map a program to the syntactic elements defined in the grammar A parse can be represented by a tree called a parse or syntax tree 6/30/2010 Principles of Compiler Design R.Venkadeshan 23

24 Front end A parse tree can be represented more compactly referred to as Abstract Syntax Tree (AST) AST is often used as IR between front end and back end 6/30/2010 Principles of Compiler Design R.Venkadeshan 24

25 Back end IR Instruction selection Register Allocation Machine code errors Translate IR into target machine code Choose instructions for each IR operation Decide what to keep in registers at each point Ensure conformance with system interfaces 6/30/2010 Principles of Compiler Design R.Venkadeshan 25

26 Back end IR Instruction selection Register Allocation Machine code errors Produce compact fast code Use available addressing modes 6/30/2010 Principles of Compiler Design R.Venkadeshan 26

27 Back end IR Instruction selection Register Allocation Machine code errors Have a value in a register when used Limited resources Optimal allocation is difficult 6/30/2010 Principles of Compiler Design R.Venkadeshan 27

28 Traditional three pass compiler Source code Front end IR Middle end IR Back end Machine code errors Code improvement analyzes and change IR Goal is to reduce runtime 6/30/2010 Principles of Compiler Design R.Venkadeshan 28

29 Middle end (optimizer) Modern optimizers are usually built as a set of passes Typical passes Constant propagation Common sub-expression elimination Redundant store elimination Dead code elimination 6/30/2010 Principles of Compiler Design R.Venkadeshan 29

30 Lexical Analysis 6/30/2010 Principles of Compiler Design R.Venkadeshan 30

31 Outline Role of lexical analyzer Specification of tokens Recognition of tokens Lexical analyzer generator Finite automata Design of lexical analyzer generator 6/30/2010 Principles of Compiler Design R.Venkadeshan 31

32 Lexical Analyzer Lexical Analyzer reads the source program character by character to produce tokens. Normally a lexical analyzer doesn t return a list of tokens at one shot, it returns a token when the parser asks a token from it. source program Lexical Analyzer token get next token Parser 6/30/2010 Principles of Compiler Design R.Venkadeshan 32

33 The role of lexical analyzer Source program Lexical Analyzer token getnexttoken Parser To semantic analysis Symbol table 6/30/2010 Principles of Compiler Design R.Venkadeshan 33

34 Token Token represents a set of strings described by a pattern. Identifier represents a set of strings which start with a letter continues with letters and digits The actual string (newval) is called as lexeme. Tokens: identifier, number, addop, delimeter, Since a token can represent more than one lexeme, additional information should be held for that specific lexeme. This additional information is called as the attribute of the token. For simplicity, a token may have a single attribute which holds the required information for that token. For identifiers, this attribute a pointer to the symbol table, and the symbol table holds the actual attributes for that token. Some attributes: <id,attr> where attr is pointer to the symbol table <assgop,_> no attribute is needed (if there is only one assignment operator) <num,val> where val is the actual value of the number. Token type and its attribute uniquely identifies a lexeme. Regular expressions are widely used to specify patterns. 6/30/2010 Principles of Compiler Design R.Venkadeshan 34

35 Recognition of tokens (cont.) The next step is to formalize the patterns: digit -> [0-9] Digits -> digit+ number -> digit(.digits)? (E[+-]? Digit)? letter -> [A-Za-z_] id If Then Else -> letter (letter digit)* -> if -> then -> else Relop -> < > <= >= = <> We also need to handle whitespaces: ws -> (blank tab newline)+ 6/30/2010 Principles of Compiler Design R.Venkadeshan 35

36 Terminology of Languages Alphabet : a finite set of symbols (ASCII characters) String : Finite sequence of symbols on an alphabet Sentence and word are also used in terms of string is the empty string s is the length of string s. Language: sets of strings over some fixed alphabet the empty set is a language. { } the set containing empty string is a language The set of well-formed C programs is a language The set of all possible identifiers is a language. Operators on Strings: Concatenation: xy represents the concatenation of strings x and y. s = s s = s s n = s s s.. s ( n times) s0 = 6/30/2010 Principles of Compiler Design R.Venkadeshan 36

37 Concatenation: L 1 L 2 = { s 1 s 2 s 1 L 1 and s 2 L 2 } Union L 1 L 2 = { s s L 1 or s L 2 } Exponentiation: L 0 = { } L 1 = L L 2 = LL Kleene Closure L * = 0 i i L Positive Closure Operations on Languages L + = i 1 i L 6/30/2010 Principles of Compiler Design R.Venkadeshan 37

38 L 1 = {a,b,c,d} L 2 = {1,2} Example L 1 L 2 = {a1,a2,b1,b2,c1,c2,d1,d2} L 1 L 2 = {a,b,c,d,1,2} L 13 = all strings with length three (using a,b,c,d} L 1 * = all strings using letters a,b,c,d and empty string L 1+ = doesn t include the empty string 6/30/2010 Principles of Compiler Design R.Venkadeshan 38

39 Regular Expressions We use regular expressions to describe tokens of a programming language. A regular expression is built up of simpler regular expressions (using defining rules) Each regular expression denotes a language. A language denoted by a regular expression is called as a regular set. 6/30/2010 Principles of Compiler Design R.Venkadeshan 39

40 Regular Expressions (Rules) Regular expressions over alphabet Reg. Expr Language it denotes { } a {a} (r 1 ) (r 2 ) L(r 1 ) L(r 2 ) (r 1 ) (r 2 ) L(r 1 ) L(r 2 ) (r) * (L(r)) * (r) L(r) (r) + = (r)(r) * (r)? = (r) 6/30/2010 Principles of Compiler Design R.Venkadeshan 40

41 Regular Expressions (cont.) We may remove parentheses by using precedence rules. * highest concatenation next lowest ab * c means (a(b) * ) (c) Ex: = {0,1} 0 1 => {0,1} (0 1)(0 1) => {00,01,10,11} 0 * => {,0,00,000,0000,...} (0 1) * => all strings with 0 and 1, including the empty string 6/30/2010 Principles of Compiler Design R.Venkadeshan 41

42 Regular Definitions To write regular expression for some languages can be difficult, because their regular expressions can be quite complex. In those cases, we may use regular definitions. We can give names to regular expressions, and we can use these names as symbols to define other regular expressions. A regular definition is a sequence of the definitions of the form: d 1 r 1 where d i is a distinct name and d 2 r 2 r i is a regular expression over symbols in. {d 1,d 2,...,d i-1 } d n r n basic symbols previously defined names 6/30/2010 Principles of Compiler Design R.Venkadeshan 42

43 Regular Definitions (cont.) Ex: Identifiers in Pascal letter A B... Z a b... z digit id letter (letter digit ) * If we try to write the regular expression representing identifiers without using regular definitions, that regular expression will be complex. (A... Z a... z) ( (A... Z a... z) (0... 9) ) * Ex: Unsigned numbers in Pascal digit digits digit + opt-fraction (. digits )? opt-exponent ( E (+ -)? digits )? unsigned-num digits opt-fraction opt-exponent 6/30/2010 Principles of Compiler Design R.Venkadeshan 43

44 Lexical Analyzer Generator - Lex Lex Source program lex.l Lexical Compiler lex.yy.c lex.yy.c C compiler a.out Input stream a.out Sequence of tokens 6/30/2010 Principles of Compiler Design R.Venkadeshan 44

45 Structure of Lex programs declarations %% translation rules %% auxiliary functions Pattern {Action} 6/30/2010 Principles of Compiler Design R.Venkadeshan 45

46 %{ %} /* definitions of manifest constants LT, LE, EQ, NE, GT, GE, IF, THEN, ELSE, ID, NUMBER, RELOP */ /* regular definitions delim ws [ \t\n] {delim}+ letter [A-Za-z] digit [0-9] id number {letter}({letter} {digit})* {digit}+(\.{digit}+)?(e[+-]?{digit}+)? Example Int installid() {/* funtion to install the lexeme, whose first character is pointed to by yytext, and whose length is yyleng, into the symbol table and return a pointer thereto */ } Int installnum() { /* similar to installid, but puts numerical constants into a separate table */ } %% {ws} {/* no action and no return */} if {return(if);} then {return(then);} else {return(else);} {id} {yylval = (int) installid(); return(id); } {number} {yylval = (int) installnum(); return(number);} 6/30/2010 Principles of Compiler Design R.Venkadeshan 46

47 Finite Automata Regular expressions = specification Finite automata = implementation A finite automaton consists of An input alphabet A set of states S A start state n A set of accepting states F S A set of transitions state input state 6/30/2010 Principles of Compiler Design R.Venkadeshan 47 47

48 Finite Automata Transition Is read s 1 a s 2 If end of input In state s 1 on input a go to state s 2 If in accepting state => accept, othewise => reject If no transition possible => reject 6/30/2010 Principles of Compiler Design R.Venkadeshan 48 48

49 Finite Automata State Graphs A state The start state An accepting state A transition a 6/30/2010 Principles of Compiler Design R.Venkadeshan 49 49

50 A Simple Example A finite automaton that accepts only 1 1 A finite automaton accepts a string if we can follow transitions labeled with the characters in the string from the start to some accepting state 6/30/2010 Principles of Compiler Design R.Venkadeshan 50 50

51 Another Simple Example A finite automaton accepting any number of 1 s followed by a single 0 Alphabet: {0,1} 1 0 Check that 1110 is accepted but 110 is not 6/30/2010 Principles of Compiler Design R.Venkadeshan 51 51

52 Alphabet still { 0, 1 } And Another Example 1 The operation of the automaton is not completely defined by the input On input 11 the automaton could be in either state 1 6/30/2010 Principles of Compiler Design R.Venkadeshan 52 52

53 Epsilon Moves Another kind of transition: -moves A Machine can move from state A to state B without reading input B 6/30/2010 Principles of Compiler Design R.Venkadeshan 53 53

54 Transition diagram for relop Transition diagrams 6/30/2010 Principles of Compiler Design R.Venkadeshan 54

55 Transition diagrams (cont.) Transition diagram for reserved words and identifiers 6/30/2010 Principles of Compiler Design R.Venkadeshan 55

56 Finite Automata A recognizer for a language is a program that takes a string x, and answers yes if x is a sentence of that language, and no otherwise. We call the recognizer of the tokens as a finite automaton. A finite automaton can be: deterministic(dfa) or non-deterministic (NFA) This means that we may use a deterministic or non-deterministic automaton as a lexical analyzer. Both deterministic and non-deterministic finite automaton recognize regular sets. Which one? deterministic faster recognizer, but it may take more space non-deterministic slower, but it may take less space Deterministic automatons are widely used lexical analyzers. First, we define regular expressions for tokens; Then we convert them into a DFA to get a lexical analyzer for our tokens. Algorithm1: Regular Expression NFA DFA (two steps: first to NFA, then to DFA) Algorithm2: Regular Expression DFA (directly convert a regular expression into a DFA) 6/30/2010 Principles of Compiler Design R.Venkadeshan 56

57 Non-Deterministic Finite Automaton (NFA) A non-deterministic finite automaton (NFA) is a mathematical model that consists of: S - a set of states - a set of input symbols (alphabet) move a transition function move to map state-symbol pairs to sets of states. s 0 - a start (initial) state F a set of accepting states (final states) - transitions are allowed in NFAs. In other words, we can move from one state to another one without consuming any symbol. A NFA accepts a string x, if and only if there is a path from the starting state to one of accepting states such that edge labels along this path spell out x. 6/30/2010 Principles of Compiler Design R.Venkadeshan 57

58 NFA (Example) start a a b b Transition graph of the NFA The language recognized by this NFA is (a b) * a b 0 is the start state s 0 {2} is the set of final states F = {a,b} S = {0,1,2} Transition Function: a b 0 {0,1} {0} 1 _ {2} 2 6/30/2010 Principles of Compiler Design R.Venkadeshan 58

59 Deterministic Finite Automaton (DFA) A Deterministic Finite Automaton (DFA) is a special form of a NFA. no state has - transition for each symbol a and state s, there is at most one labeled edge a leaving s. i.e. transition function is from pair of state-symbol to state (not set of states) a b a a b The language recognized by this DFA is also (a b) * a b b 6/30/2010 Principles of Compiler Design R.Venkadeshan 59

60 Implementing a DFA Le us assume that the end of a string is marked with a special symbol (say eos). The algorithm for recognition will be as follows: (an efficient implementation) s s 0 { start from the initial state } c nextchar { get the next character from the input string } while (c!= eos) do { do until the en dof the string } begin end s move(s,c) { transition function } c nextchar if (s in F) then { if s is an accepting state } else return yes return no 6/30/2010 Principles of Compiler Design R.Venkadeshan 60

61 Implementing a NFA S -closure({s 0 }) { set all of states can be accessible from s 0 by -transitions } c nextchar while (c!= eos) { begin s -closure(move(s,c)) { set of all states can be accessible from a state in S c nextchar by a transition on c } end if (S F!= ) then { if S contains an accepting state } return yes else return no This algorithm is not efficient. 6/30/2010 Principles of Compiler Design R.Venkadeshan 61

62 Regular Expressions to Finite Automata High-level sketch NFA Regular expressions DFA Lexical Specification Table-driven Implementation of DFA 6/30/2010 Principles of Compiler Design R.Venkadeshan 62 62

63 Converting A Regular Expression into A NFA (Thomson s Construction) This is one way to convert a regular expression into a NFA. There can be other ways (much efficient) for the conversion. Thomson s Construction is simple and systematic method. It guarantees that the resulting NFA will have exactly one final state, and one start state. Construction starts from simplest parts (alphabet symbols). To create a NFA for a complex regular expression, NFAs of its sub-expressions are combined to create its NFA, 6/30/2010 Principles of Compiler Design R.Venkadeshan 63

64 Thomson s Construction (cont.) To recognize an empty string i f To recognize a symbol a in the alphabet i a f If N(r 1 ) and N(r 2 ) are NFAs for regular expressions r 1 and r 2 For regular expression r 1 r 2 N(r 1 ) i f NFA for r 1 r 2 N(r 2 ) 6/30/2010 Principles of Compiler Design R.Venkadeshan 64

65 Thomson s Construction (cont.) For regular expression r 1 r 2 i N(r 1 ) N(r 2 ) f Final state of N(r 2 ) become final state of N(r 1 r 2 ) NFA for r 1 r 2 For regular expression r * i N(r) f NFA for r * 6/30/2010 Principles of Compiler Design R.Venkadeshan 65

66 6/30/2010 Principles of Compiler Design R.Venkadeshan 66 Thomson s Construction (Example - (a b) * a ) a: a b b: (a b) a b b a (a b) * b a a (a b) * a

67 Converting a NFA into a DFA (subset construction) put -closure({s 0 }) as an unmarked state into the set of DFA (DS) while (there is one unmarked S 1 in DS) do begin mark S 1 for each input symbol a do begin S 2 -closure(move(s 1,a)) if (S 2 is not in DS) then add S 2 into DS as an unmarked state transfunc[s 1,a] S 2 end end a state S in DS is an accepting state of DFA if a state in S is an accepting state of NFA the start state of DFA is -closure({s 0 }) -closure({s 0 }) is the set of all states can be accessible from s 0 by -transition. set of states to which there is a transition on a from a state s in S 1 6/30/2010 Principles of Compiler Design R.Venkadeshan 67

68 Converting a NFA into a DFA (Example) a 3 4 b a 8 S 0 = -closure({0}) = {0,1,2,4,7} S 0 into DS as an unmarked state mark S 0 -closure(move(s 0,a)) = -closure({3,8}) = {1,2,3,4,6,7,8} = S 1 -closure(move(s 0,b)) = -closure({5}) = {1,2,4,5,6,7} = S 2 S 1 into DS S 2 into DS transfunc[s 0,a] S 1 transfunc[s 0,b] S 2 mark S 1 -closure(move(s 1,a)) = -closure({3,8}) = {1,2,3,4,6,7,8} = S 1 -closure(move(s 1,b)) = -closure({5}) = {1,2,4,5,6,7} = S 2 transfunc[s 1,a] S 1 transfunc[s 1,b] S 2 mark S 2 -closure(move(s 2,a)) = -closure({3,8}) = {1,2,3,4,6,7,8} = S 1 -closure(move(s 2,b)) = -closure({5}) = {1,2,4,5,6,7} = S 2 transfunc[s 2,a] S 1 transfunc[s 2,b] S 2 6/30/2010 Principles of Compiler Design R.Venkadeshan 68

69 Converting a NFA into a DFA (Example cont.) S 0 is the start state of DFA since 0 is a member of S 0 ={0,1,2,4,7} S 1 is an accepting state of DFA since 8 is a member of S 1 = {1,2,3,4,6,7,8} a S 1 S 0 a b b a S 2 b 6/30/2010 Principles of Compiler Design R.Venkadeshan 69

70 Table Implementation of a DFA S 0 1 T 0 1 U S T U T T U U T U 6/30/2010 Principles of Compiler Design R.Venkadeshan 70 70

71 Converting Regular Expressions Directly to DFAs We may convert a regular expression into a DFA (without creating a NFA first). First we augment the given regular expression by concatenating it with a special symbol #. r (r)# augmented regular expression Then, we create a syntax tree for this augmented regular expression. In this syntax tree, all alphabet symbols (plus # and the empty string) in the augmented regular expression will be on the leaves, and all inner nodes will be the operators in that augmented regular expression. Then each alphabet symbol (plus #) will be numbered (position numbers). 6/30/2010 Principles of Compiler Design R.Venkadeshan 71

72 Regular Expression DFA (cont.) (a b) * a (a b) * a # augmented regular expression * a 3 # 4 Syntax tree of (a b) * a # each symbol is numbered (positions) each symbol is at a leave a 1 b 2 inner nodes are operators 6/30/2010 Principles of Compiler Design R.Venkadeshan 72

73 followpos Then we define the function followpos for the positions (positions assigned to leaves). followpos(i) -- is the set of positions which can follow the position i in the strings generated by the augmented regular expression. For example, ( a b) * a # followpos(1) = {1,2,3} followpos(2) = {1,2,3} followpos(3) = {4} followpos(4) = {} followpos is just defined for leaves, it is not defined for inner nodes. 6/30/2010 Principles of Compiler Design R.Venkadeshan 73

74 firstpos, lastpos, nullable To evaluate followpos, we need three more functions to be defined for the nodes (not just for leaves) of the syntax tree. firstpos(n) -- the set of the positions of the first symbols of strings generated by the sub-expression rooted by n. lastpos(n) -- the set of the positions of the last symbols of strings generated by the sub-expression rooted by n. nullable(n) -- true if the empty string is a member of strings generated by the sub-expression rooted by n false otherwise 6/30/2010 Principles of Compiler Design R.Venkadeshan 74

75 How to evaluate firstpos, lastpos, nullable n nullable(n) firstpos(n) lastpos(n) leaf labeled true leaf labeled with position i false {i} {i} c 1 c 2 nullable(c 1 ) or nullable(c 2 ) firstpos(c 1 ) firstpos(c 2 ) lastpos(c 1 ) lastpos(c 2 ) c 1 c 2 nullable(c 1 ) and nullable(c 2 ) if (nullable(c 1 )) firstpos(c 1 ) firstpos(c 2 ) else firstpos(c 1 ) if (nullable(c 2 )) lastpos(c 1 ) lastpos(c 2 ) else lastpos(c 2 ) * true firstpos(c 1 ) lastpos(c 1 ) c 1 6/30/2010 Principles of Compiler Design R.Venkadeshan 75

76 How to evaluate followpos Two-rules define the function followpos: 1. If n is concatenation-node with left child c 1 and right child c 2, and i is a position in lastpos(c 1 ), then all positions in firstpos(c 2 ) are in followpos(i). 2. If n is a star-node, and i is a position in lastpos(n), then all positions in firstpos(n) are in followpos(i). If firstpos and lastpos have been computed for each node, followpos of each position can be computed by making one depth-first traversal of the syntax tree. 6/30/2010 Principles of Compiler Design R.Venkadeshan 76

77 Example -- ( a b) * a # {1,2,3} {4} {1,2,3} {3} {4}# {4} 4 {1,2}*{1,2} {3} a{3} 3 {1,2} {1,2} green firstpos blue lastpos Then we can calculate followpos {1} a 1 {1} {2} b {2} 2 followpos(1) = {1,2,3} followpos(2) = {1,2,3} followpos(3) = {4} followpos(4) = {} After we calculate follow positions, we are ready to create DFA for the regular expression. 6/30/2010 Principles of Compiler Design R.Venkadeshan 77

78 Algorithm (RE DFA) Create the syntax tree of (r) # Calculate the functions: followpos, firstpos, lastpos, nullable Put firstpos(root) into the states of DFA as an unmarked state. while (there is an unmarked state S in the states of DFA) do mark S for each input symbol a do let s 1,...,s n are positions in S and symbols in those positions are a S followpos(s 1 )... followpos(s n ) move(s,a) S if (S is not empty and not in the states of DFA) put S into the states of DFA as an unmarked state. the start state of DFA is firstpos(root) the accepting states of DFA are all states containing the position of # 6/30/2010 Principles of Compiler Design R.Venkadeshan 78

79 Example -- ( a b) * a # followpos(1)={1,2,3} followpos(2)={1,2,3} followpos(3)={4} followpos(4)={} S 1 =firstpos(root)={1,2,3} mark S 1 a: followpos(1) followpos(3)={1,2,3,4}=s 2 move(s 1,a)=S 2 b: followpos(2)={1,2,3}=s 1 move(s 1,b)=S 1 mark S 2 a: followpos(1) followpos(3)={1,2,3,4}=s 2 move(s 2,a)=S 2 b: followpos(2)={1,2,3}=s 1 move(s 2,b)=S 1 start state: S 1 accepting states: {S 2 } b a S 1 S 2 b a 6/30/2010 Principles of Compiler Design R.Venkadeshan 79

80 Example -- ( a ) b c * # followpos(1)={2} followpos(2)={3,4} followpos(3)={3,4} followpos(4)={} S 1 =firstpos(root)={1,2} mark S 1 a: followpos(1)={2}=s 2 move(s 1,a)=S 2 b: followpos(2)={3,4}=s 3 move(s 1,b)=S 3 mark S 2 b: followpos(2)={3,4}=s 3 move(s 2,b)=S 3 mark S 3 c: followpos(3)={3,4}=s 3 move(s 3,c)=S 3 start state: S 1 accepting states: {S 3 } S 1 a b S 2 b S 3 c 6/30/2010 Principles of Compiler Design R.Venkadeshan 80

81 Minimizing Number of States of a DFA partition the set of states into two groups: G 1 : set of accepting states G 2 : set of non-accepting states For each new group G partition G into subgroups such that states s 1 and s 2 are in the same group iff for all input symbols a, states s 1 and s 2 have transitions to states in the same group. Start state of the minimized DFA is the group containing the start state of the original DFA. Accepting states of the minimized DFA are the groups containing the accepting states of the original DFA. 6/30/2010 Principles of Compiler Design R.Venkadeshan 81

82 Minimizing DFA - Example 1 a b a 2 b 3 b a G 1 = {2} G 2 = {1,3} G 2 cannot be partitioned because move(1,a)=2 move(3,a)=2 move(1,b)=3 move(2,b)=3 So, the minimized DFA (with minimum states) b {1,3} a b {2} a 6/30/2010 Principles of Compiler Design R.Venkadeshan 82

83 1 a b Minimizing DFA Another Example 2 b 4 a 3 b a a b Groups: {1,2,3} {4} {1,2} {3} no more partitioning a b 1->2 1->3 2->2 2->3 3->4 3->3 So, the minimized DFA b a {1,2} b {3} a b a {4} 6/30/2010 Principles of Compiler Design R.Venkadeshan 83

84 Some Other Issues in Lexical Analyzer The lexical analyzer has to recognize the longest possible string. Ex: identifier newval -- n ne new newv newva newval What is the end of a token? Is there any character which marks the end of a token? It is normally not defined. If the number of characters in a token is fixed, in that case no problem: + - But < < or <> (in Pascal) The end of an identifier : the characters cannot be in an identifier can mark the end of token. We may need a lookhead In Prolog: p :- X is 1. p :- X is 1.5. The dot followed by a white space character can mark the end of a number. But if that is not the case, the dot must be treated as a part of the number. 6/30/2010 Principles of Compiler Design R.Venkadeshan 84

85 Some Other Issues in Lexical Analyzer (cont.) Skipping comments Normally we don t return a comment as a token. We skip a comment, and return the next token (which is not a comment) to the parser. So, the comments are only processed by the lexical analyzer, and the don t complicate the syntax of the language. Symbol table interface symbol table holds information about tokens (at least lexeme of identifiers) how to implement the symbol table, and what kind of operations. hash table open addressing, chaining putting into the hash table, finding the position of a token from its lexeme. Positions of the tokens in the file (for the error handling). 6/30/2010 Principles of Compiler Design R.Venkadeshan 85

86 Syntax Analysis 6/30/2010 Principles of Compiler Design R.Venkadeshan 86

87 Outline Role of parser Context free grammars Top down parsing Bottom up parsing Parser generators 6/30/2010 Principles of Compiler Design R.Venkadeshan 87

88 The role of parser Source program Lexical Analyzer token getnext Token Parser Parse tree Rest of Front End Intermediate representation Symbol table 6/30/2010 Principles of Compiler Design R.Venkadeshan 88

89 Syntax Analyzer Syntax Analyzer creates the syntactic structure of the given source program. This syntactic structure is mostly a parse tree. Syntax Analyzer is also known as parser. The syntax of a programming is described by a context-free grammar (CFG). We will use BNF (Backus-Naur Form) notation in the description of CFGs. The syntax analyzer (parser) checks whether a given source program satisfies the rules implied by a context-free grammar or not. If it satisfies, the parser creates the parse tree of that program. Otherwise the parser gives the error messages. A context-free grammar gives a precise syntactic specification of a programming language. the design of the grammar is an initial phase of the design of a compiler. a grammar can be directly converted into a parser by some tools. 6/30/2010 Principles of Compiler Design R.Venkadeshan 89

90 Parser Parser works on a stream of tokens. The smallest item is a token. source program Lexical Analyzer token get next token Parser parse tree 6/30/2010 Principles of Compiler Design R.Venkadeshan 90

91 Parsers (cont.) We categorize the parsers into two groups: 1. Top-Down Parser the parse tree is created top to bottom, starting from the root. 2. Bottom-Up Parser the parse is created bottom to top; starting from the leaves Both top-down and bottom-up parsers scan the input from left to right (one symbol at a time). Efficient top-down and bottom-up parsers can be implemented only for sub-classes of context-free grammars. LL for top-down parsing LR for bottom-up parsing 6/30/2010 Principles of Compiler Design R.Venkadeshan 91

92 Context-Free Grammars Inherently recursive structures of a programming language are defined by a context-free grammar. In a context-free grammar, we have: A finite set of terminals (in our case, this will be the set of tokens) A finite set of non-terminals (syntactic-variables) A finite set of productions rules in the following form A where A is a non-terminal and is a string of terminals and non-terminals (including the empty string) A start symbol (one of the non-terminal symbol) Example: E E + E E E E * E E / E - E E ( E ) E id 6/30/2010 Principles of Compiler Design R.Venkadeshan 92

93 Derivations E E+E E+E derives from E we can replace E by E+E to able to do this, we have to have a production rule E E+E in our grammar. E E+E id+e id+id A sequence of replacements of non-terminal symbols is called a derivation of id+id from E. In general a derivation step is A if there is a production rule A in our grammar where and are arbitrary strings of terminal and non-terminal symbols n ( n derives from 1 or 1 derives n ) * + : derives in one step : derives in zero or more steps : derives in one or more steps 6/30/2010 Principles of Compiler Design R.Venkadeshan 93

94 CFG - Terminology L(G) is the language of G (the language generated by G) which is a set of sentences. A sentence of L(G) is a string of terminal symbols of G. If S is the start symbol of G then + is a sentence of L(G) iff S where is a string of terminals of G. If G is a context-free grammar, L(G) is a context-free language. Two grammars are equivalent if they produce the same language. S * - If contains non-terminals, it is called as a sentential form of G. - If does not contain non-terminals, it is called as a sentence of G. 6/30/2010 Principles of Compiler Design R.Venkadeshan 94

95 Derivation Example E -E -(E) -(E+E) -(id+e) -(id+id) OR E -E -(E) -(E+E) -(E+id) -(id+id) At each derivation step, we can choose any of the non-terminal in the sentential form of G for the replacement. If we always choose the left-most non-terminal in each derivation step, this derivation is called as left-most derivation. If we always choose the right-most non-terminal in each derivation step, this derivation is called as right-most derivation. 6/30/2010 Principles of Compiler Design R.Venkadeshan 95

96 Left-Most and Right-Most Derivations Left-Most Derivation E -E -(E) -(E+E) -(id+e) -(id+id) lm Right-Most Derivation E -E -(E) -(E+E) -(E+id) -(id+id) rm lm rm lm rm lm rm We will see that the top-down parsers try to find the left-most derivation of the given source program. lm rm We will see that the bottom-up parsers try to find the right-most derivation of the given source program in the reverse order. 6/30/2010 Principles of Compiler Design R.Venkadeshan 96

97 Parse Tree Inner nodes of a parse tree are non-terminal symbols. The leaves of a parse tree are terminal symbols. A parse tree can be seen as a graphical representation of a derivation. E -E - E E -(E) - E E -(E+E) - E E ( E ) ( E ) -(id+e) - E ( E E ) -(id+id) - E E ( E ) E + E E + E E + E id id id 6/30/2010 Principles of Compiler Design R.Venkadeshan 97

98 Ambiguity A grammar produces more than one parse tree for a sentence is called as an ambiguous grammar. E E+E id+e id+e*e id+id*e id+id*id E E + id E E * E id id E E*E E+E*E id+e*e id+id*e id+id*id E E * E E + id E id id 6/30/2010 Principles of Compiler Design R.Venkadeshan 98

99 Ambiguity (cont.) For the most parsers, the grammar must be unambiguous. unambiguous grammar unique selection of the parse tree for a sentence We should eliminate the ambiguity in the grammar during the design phase of the compiler. An unambiguous grammar should be written to eliminate the ambiguity. We have to prefer one of the parse trees of a sentence (generated by an ambiguous grammar) to disambiguate that grammar to restrict to this choice. 6/30/2010 Principles of Compiler Design R.Venkadeshan 99

100 Ambiguity (cont.) stmt if expr then stmt if expr then stmt else stmt otherstmts if E 1 then if E 2 then S 1 else S 2 stmt if expr then stmt else stmt E 1 if expr then stmt S 2 stmt if expr then stmt E 1 if expr then stmt else stmt E 2 S 1 E 2 S 1 S /30/2010 Principles of Compiler Design R.Venkadeshan 100

101 Ambiguity (cont.) We prefer the second parse tree (else matches with closest if). So, we have to disambiguate our grammar to reflect this choice. The unambiguous grammar will be: stmt matchedstmt unmatchedstmt matchedstmt if expr then matchedstmt else matchedstmt otherstmts unmatchedstmt if expr then stmt if expr then matchedstmt else unmatchedstmt 6/30/2010 Principles of Compiler Design R.Venkadeshan 101

102 Ambiguity Operator Precedence Ambiguous grammars (because of ambiguous operators) can be disambiguated according to the precedence and associativity rules. E E+E E*E E^E id (E) disambiguate the grammar precedence: ^ (right to left) * (left to right) + (left to right) E E+T T T T*F F F G^F G G id (E) 6/30/2010 Principles of Compiler Design R.Venkadeshan 102

103 Left Recursion A grammar is left recursive if it has a non-terminal A such that there is a derivation. + A A for some string Top-down parsing techniques cannot handle left-recursive grammars. So, we have to convert our left-recursive grammar into an equivalent grammar which is not left-recursive. The left-recursion may appear in a single step of the derivation (immediate left-recursion), or may appear in more than one step of the derivation. 6/30/2010 Principles of Compiler Design R.Venkadeshan 103

104 Immediate Left-Recursion A A where does not start with A eliminate immediate left recursion A A A A an equivalent grammar In general, A A 1... A m 1... n where 1... n do not start with A eliminate immediate left recursion A 1 A... n A A 1 A... m A an equivalent grammar 6/30/2010 Principles of Compiler Design R.Venkadeshan 104

105 Immediate Left-Recursion -- Example E E+T T T T*F F F id (E) eliminate immediate left recursion E T E E +T E T F T T *F T F id (E) 6/30/2010 Principles of Compiler Design R.Venkadeshan 105

106 Left-Recursion -- Problem A grammar cannot be immediately left-recursive, but it still can be left-recursive. By just eliminating the immediate left-recursion, we may not get a grammar which is not left-recursive. S Aa b A Sc d This grammar is not immediately left-recursive, but it is still left-recursive. S Aa Sca A Sc Aac or causes to a left-recursion So, we have to eliminate all left-recursions from our grammar 6/30/2010 Principles of Compiler Design R.Venkadeshan 106

107 Eliminate Left-Recursion -- Algorithm - Arrange non-terminals in some order: A 1... A n - for i from 1 to n do { - for j from 1 to i-1 do { replace each production A i A j by A i 1... k } } where A j 1... k - eliminate immediate left-recursions among A i productions 6/30/2010 Principles of Compiler Design R.Venkadeshan 107

108 S Aa b A Ac Sd f Eliminate Left-Recursion -- Example - Order of non-terminals: S, A for S: - we do not enter the inner loop. - there is no immediate left recursion in S. for A: - Replace A Sd with A Aad bd So, we will have A Ac Aad bd f - Eliminate the immediate left-recursion in A A bda fa A ca ada So, the resulting equivalent grammar which is not left-recursive is: S Aa b A bda fa A ca ada 6/30/2010 Principles of Compiler Design R.Venkadeshan 108

109 Eliminate Left-Recursion Example2 S Aa b A Ac Sd f - Order of non-terminals: A, S for A: - we do not enter the inner loop. - Eliminate the immediate left-recursion in A A SdA fa A ca for S: - Replace S Aa with S SdA a fa a So, we will have S SdA a fa a b - Eliminate the immediate left-recursion in S S fa as bs S da as So, the resulting equivalent grammar which is not left-recursive is: S fa as bs S da as A SdA fa A ca 6/30/2010 Principles of Compiler Design R.Venkadeshan 109

110 Left-Factoring A predictive parser (a top-down parser without backtracking) insists that the grammar must be left-factored. grammar a new equivalent grammar suitable for predictive parsing stmt if expr then stmt else stmt if expr then stmt when we see if, we cannot now which production rule to choose to re-write stmt in the derivation. 6/30/2010 Principles of Compiler Design R.Venkadeshan 110

111 Left-Factoring (cont.) In general, A 1 2 where is non-empty and the first symbols of 1 and 2 (if they have one)are different. when processing we cannot know whether expand A to 1 or A to 2 But, if we re-write the grammar as follows A A A 1 2 so, we can immediately expand A to A 6/30/2010 Principles of Compiler Design R.Venkadeshan 111

112 Left-Factoring -- Algorithm For each non-terminal A with two or more alternatives (production rules) with a common non-empty prefix, let say A 1... n 1... m convert it into A A 1... m A 1... n 6/30/2010 Principles of Compiler Design R.Venkadeshan 112

113 Left-Factoring Example1 A abb ab cdg cdeb cdfb A aa cdg cdeb cdfb A bb B A aa cda A bb B A g eb fb 6/30/2010 Principles of Compiler Design R.Venkadeshan 113

114 Left-Factoring Example2 A ad a ab abc b A aa b A d b bc A aa b A d ba A c 6/30/2010 Principles of Compiler Design R.Venkadeshan 114

115 Non-Context Free Language Constructs There are some language constructions in the programming languages which are not context-free. This means that, we cannot write a contextfree grammar for these constructions. L1 = { c is in (a b)*} is not context-free declaring an identifier and checking whether it is declared or not later. We cannot do this with a context-free language. We need semantic analyzer (which is not context-free). L2 = {a n b m c n d m n 1 and m 1 } is not context-free declaring two functions (one with n parameters, the other one with m parameters), and then calling them with actual parameters. 6/30/2010 Principles of Compiler Design R.Venkadeshan 115

116 Top-Down Parsing The parse tree is created top to bottom. Top-down parser Recursive-Descent Parsing Backtracking is needed (If a choice of a production rule does not work, we backtrack to try other alternatives.) It is a general parsing technique, but not widely used. Not efficient Predictive Parsing no backtracking efficient needs a special form of grammars (LL(1) grammars). Recursive Predictive Parsing is a special form of Recursive Descent parsing without backtracking. Non-Recursive (Table Driven) Predictive Parser is also known as LL(1) parser. 6/30/2010 Principles of Compiler Design R.Venkadeshan 116

117 Introduction A Top-down parser tries to create a parse tree from the root towards the leafs scanning input from left to right It can be also viewed as finding a leftmost derivation for an input string Example: id+id*id E -> TE E -> +TE Ɛ T -> FT T -> *FT Ɛ F -> (E) id E lm E T E lm E T E lm E T E F T F T id lm E T E F T id Ɛ lm E T E F T + T E id Ɛ 6/30/2010 Principles of Compiler Design R.Venkadeshan 117

118 Recursive-Descent Parsing (uses Backtracking) Backtracking is needed. It tries to find the left-most derivation. S abc B bc b input: abc S a B c a B c S fails, backtrack b c b 6/30/2010 Principles of Compiler Design R.Venkadeshan 118

119 Recursive descent parsing (cont) General recursive descent may require backtracking The previous code needs to be modified to allow backtracking In general form it cant choose an A-production easily. So we need to try all alternatives If one failed the input pointer needs to be reset and another alternative should be tried Recursive descent parsers cant be used for left-recursive grammars 6/30/2010 Principles of Compiler Design R.Venkadeshan 119

120 Example S->cAd A->ab a Input: cad S c A d S c A d S c A d a b a 6/30/2010 Principles of Compiler Design R.Venkadeshan 120

121 Predictive Parser a grammar a grammar suitable for predictive eliminate left parsing (a LL(1) grammar) left recursion factor no %100 guarantee. When re-writing a non-terminal in a derivation step, a predictive parser can uniquely choose a production rule by just looking the current symbol in the input string. A 1... n input:... a... current token 6/30/2010 Principles of Compiler Design R.Venkadeshan 121

122 Predictive Parser (example) stmt if... while... begin... for... When we are trying to write the non-terminal stmt, if the current token is if we have to choose first production rule. When we are trying to write the non-terminal stmt, we can uniquely choose the production rule by just looking the current token. We eliminate the left recursion in the grammar, and left factor it. But it may not be suitable for predictive parsing (not LL(1) grammar). 6/30/2010 Principles of Compiler Design R.Venkadeshan 122

123 Recursive Predictive Parsing Each non-terminal corresponds to a procedure. Ex: A abb (This is only the production rule for A) proc A { - match the current token with a, and move to the next token; - call B ; - match the current token with b, and move to the next token; } 6/30/2010 Principles of Compiler Design R.Venkadeshan 123

124 A abb bab Recursive Predictive Parsing (cont.) proc A { case of the current token { a : - match the current token with a, and move to the next token; - call B ; - match the current token with b, and move to the next token; b : - match the current token with b, and move to the next token; - call A ; - call B ; } } 6/30/2010 Principles of Compiler Design R.Venkadeshan 124

125 Recursive Predictive Parsing (cont.) When to apply -productions. A aa bb If all other productions fail, we should apply an -production. For example, if the current token is not a or b, we may apply the -production. Most correct choice: We should apply an -production for a nonterminal A when the current token is in the follow set of A (which terminals can follow A in the sentential forms). 6/30/2010 Principles of Compiler Design R.Venkadeshan 125

126 Recursive Predictive Parsing (Example) A abe cbd C B bb C f proc C { match the current token with f, proc A { and move to the next token; } case of the current token { a: - match the current token with a, and move to the next token; proc B { - call B; case of the current token { - match the current token with e, b: - match the current token with b, and move to the next token; and move to the next token; c: - match the current token with c, - call B and move to the next token; e,d: do nothing - call B; } - match the current token with d, } and move to the next token; follow set of B f: - call C } } first set of C 6/30/2010 Principles of Compiler Design R.Venkadeshan 126

127 Non-Recursive Predictive Parsing -- LL(1) Parser Non-Recursive predictive parsing is a table-driven parser. It is a top-down parser. It is also known as LL(1) Parser. input buffer stack Non-recursive output Predictive Parser Parsing Table 6/30/2010 Principles of Compiler Design R.Venkadeshan 127

128 LL(1) Parser input buffer our string to be parsed. We will assume that its end is marked with a special symbol $. output a production rule representing a step of the derivation sequence (left-most derivation) of the string in the input buffer. stack contains the grammar symbols at the bottom of the stack, there is a special end marker symbol $. initially the stack contains only the symbol $ and the starting symbol S. $S initial stack when the stack is emptied (ie. only $ left in the stack), the parsing is completed. parsing table a two-dimensional array M[A,a] each row is a non-terminal symbol each column is a terminal symbol or the special symbol $ each entry holds a production rule. 6/30/2010 Principles of Compiler Design R.Venkadeshan 128

129 LL(1) Parser Parser Actions The symbol at the top of the stack (say X) and the current symbol in the input string (say a) determine the parser action. There are four possible parser actions. 1. If X and a are $ parser halts (successful completion) 2. If X and a are the same terminal symbol (different from $) parser pops X from the stack, and moves the next symbol in the input buffer. 3. If X is a non-terminal parser looks at the parsing table entry M[X,a]. If M[X,a] holds a production rule X Y 1 Y 2...Y k, it pops X from the stack and pushes Y k,y k-1,...,y 1 into the stack. The parser also outputs the production rule X Y 1 Y 2...Y k to represent a step of the derivation. 4. none of the above error all empty entries in the parsing table are errors. If X is a terminal symbol different from a, this is also an error case. 6/30/2010 Principles of Compiler Design R.Venkadeshan 129

130 LL(1) Parser Example1 S aba B bb S a S aba b $ LL(1) Parsing Table B B B bb stack input output $S abba$ S aba $aba abba$ $ab bba$ B bb $abb bba$ $ab ba$ B bb $abb ba$ $ab a$ B $a a$ $ $ accept, successful completion 6/30/2010 Principles of Compiler Design R.Venkadeshan 130

131 LL(1) Parser Example1 (cont.) Outputs: S aba B bb B bb B Derivation(left-most): S aba abba abbba abba S parse tree a B a b B b B 6/30/2010 Principles of Compiler Design R.Venkadeshan 131

132 LL(1) Parser Example2 E TE E +TE T FT T *FT F (E) id F id F id + E E TE E TE E E +TE T T FT T FT T T T *FT * ( F (E) ) E T $ E T 6/30/2010 Principles of Compiler Design R.Venkadeshan 132

133 LL(1) Parser Example2 stack input output $E id+id$ E TE $E T id+id$ T FT $E T F id+id$ F id $ E T id id+id$ $ E T +id$ T $ E +id$ E +TE $ E T+ +id$ $ E T id$ T FT $ E T F id$ F id $ E T id id$ $ E T $ T $ E $ E $ $ accept 6/30/2010 Principles of Compiler Design R.Venkadeshan 133

134 Constructing LL(1) Parsing Tables Two functions are used in the construction of LL(1) parsing tables: FIRST FOLLOW FIRST( ) is a set of the terminal symbols which occur as first symbols in strings derived from where is any string of grammar symbols. if derives to, then is also in FIRST( ). FOLLOW(A) is the set of the terminals which occur immediately after (follow) the non-terminal A in the strings derived from the starting symbol. a terminal a is in FOLLOW(A) if S * Aa $ is in FOLLOW(A) if S * A 6/30/2010 Principles of Compiler Design R.Venkadeshan 134

135 Compute FIRST for Any String X If X is a terminal symbol FIRST(X)={X} If X is a non-terminal symbol and X is a production rule is in FIRST(X). If X is a non-terminal symbol and X Y 1 Y 2..Y n is a production rule if a terminal a in FIRST(Y i ) and is in all FIRST(Y j ) for j=1,...,i-1 then a is in FIRST(X). if is in all FIRST(Y j ) for j=1,...,n then is in FIRST(X). If X is FIRST(X)={ } If X is Y 1 Y 2..Y n if a terminal a in FIRST(Y i ) and is in all FIRST(Y j ) for j=1,...,i-1 then a is in FIRST(X). if is in all FIRST(Y j ) for j=1,...,n then is in FIRST(X). 6/30/2010 Principles of Compiler Design R.Venkadeshan 135

136 FIRST Example E TE E +TE T FT T *FT F (E) id FIRST(F) = {(,id} FIRST(TE ) = {(,id} FIRST(T ) = {*, } FIRST(+TE ) = {+} FIRST(T) = {(,id} FIRST( ) = { } FIRST(E ) = {+, } FIRST(FT ) = {(,id} FIRST(E) = {(,id} FIRST(*FT ) = {*} FIRST( ) = { } FIRST((E)) = {(} FIRST(id) = {id} 6/30/2010 Principles of Compiler Design R.Venkadeshan 136

137 Compute FOLLOW (for non-terminals) If S is the start symbol $ is in FOLLOW(S) if A B is a production rule everything in FIRST( ) is FOLLOW(B) except If ( A B is a production rule ) or ( A B is a production rule and is in FIRST( ) ) everything in FOLLOW(A) is in FOLLOW(B). We apply these rules until nothing more can be added to any follow set. 6/30/2010 Principles of Compiler Design R.Venkadeshan 137

138 FOLLOW Example E TE E +TE T FT T *FT F (E) id FOLLOW(E) = { $, ) } FOLLOW(E ) = { $, ) } FOLLOW(T) = { +, ), $ } FOLLOW(T ) = { +, ), $ } FOLLOW(F) = {+, *, ), $ } 6/30/2010 Principles of Compiler Design R.Venkadeshan 138

139 Constructing LL(1) Parsing Table -- Algorithm for each production rule A of a grammar G for each terminal a in FIRST( ) add A to M[A,a] If in FIRST( ) for each terminal a in FOLLOW(A) add A to M[A,a] If in FIRST( ) and $ in FOLLOW(A) add A to M[A,$] All other undefined entries of the parsing table are error entries. 6/30/2010 Principles of Compiler Design R.Venkadeshan 139

140 Constructing LL(1) Parsing Table -- Example E TE FIRST(TE )={(,id} E TE into M[E,(] and M[E,id] E +TE FIRST(+TE )={+} E +TE into M[E,+] E FIRST( )={ } none but since in FIRST( ) and FOLLOW(E )={$,)} E into M[E,$] and M[E,)] T FT FIRST(FT )={(,id} T FT into M[T,(] and M[T,id] T *FT FIRST(*FT )={*} T *FT into M[T,*] T FIRST( )={ } none but since in FIRST( ) and FOLLOW(T )={$,),+} T into M[T,$], M[T,)] and M[T,+] F (E) FIRST((E) )={(} F (E) into M[F,(] F id FIRST(id)={id} F id into M[F,id] 6/30/2010 Principles of Compiler Design R.Venkadeshan 140

141 LL(1) Grammars A grammar whose parsing table has no multiply-defined entries is said to be LL(1) grammar. LL(1) one input symbol used as a look-head symbol do determine parser action left most derivation input scanned from left to right The parsing table of a grammar may contain more than one production rule. In this case, we say that it is not a LL(1) grammar. 6/30/2010 Principles of Compiler Design R.Venkadeshan 141

142 A Grammar which is not LL(1) S i C t S E a FOLLOW(S) = { $,e } E e S FOLLOW(E) = { $,e } C b FOLLOW(C) = { t } FIRST(iCtSE) = {i} FIRST(a) = {a} FIRST(eS) = {e} FIRST( ) = { } FIRST(b) = {b} S E C a S a b C b e E e S E i S ictse t $ E two production rules for M[E,e] Problem ambiguity 6/30/2010 Principles of Compiler Design R.Venkadeshan 142

143 A Grammar which is not LL(1) (cont.) What do we have to do it if the resulting parsing table contains multiply defined entries? If we didn t eliminate left recursion, eliminate the left recursion in the grammar. If the grammar is not left factored, we have to left factor the grammar. If its (new grammar s) parsing table still contains multiply defined entries, that grammar is ambiguous or it is inherently not a LL(1) grammar. A left recursive grammar cannot be a LL(1) grammar. A A any terminal that appears in FIRST( ) also appears FIRST(A ) because A. If is, any terminal that appears in FIRST( ) also appears in FIRST(A ) and FOLLOW(A). A grammar is not left factored, it cannot be a LL(1) grammar A 1 2 any terminal that appears in FIRST( 1 ) also appears in FIRST( 2 ). An ambiguous grammar cannot be a LL(1) grammar. 6/30/2010 Principles of Compiler Design R.Venkadeshan 143

144 Properties of LL(1) Grammars A grammar G is LL(1) if and only if the following conditions hold for two distinctive production rules A and A 1. Both and cannot derive strings starting with same terminals. 2. At most one of and can derive to. 3. If can derive to, then cannot derive to any string starting with a terminal in FOLLOW(A). 6/30/2010 Principles of Compiler Design R.Venkadeshan 144

145 Error Recovery in Predictive Parsing An error may occur in the predictive parsing (LL(1) parsing) if the terminal symbol on the top of stack does not match with the current input symbol. if the top of stack is a non-terminal A, the current input symbol is a, and the parsing table entry M[A,a] is empty. What should the parser do in an error case? The parser should be able to give an error message (as much as possible meaningful error message). It should be recover from that error case, and it should be able to continue the parsing with the rest of the input. 6/30/2010 Principles of Compiler Design R.Venkadeshan 145

146 Error Recovery Techniques Panic-Mode Error Recovery Skipping the input symbols until a synchronizing token is found. Phrase-Level Error Recovery Each empty entry in the parsing table is filled with a pointer to a specific error routine to take care that error case. Error-Productions If we have a good idea of the common errors that might be encountered, we can augment the grammar with productions that generate erroneous constructs. When an error production is used by the parser, we can generate appropriate error diagnostics. Since it is almost impossible to know all the errors that can be made by the programmers, this method is not practical. Global-Correction Ideally, we we would like a compiler to make as few change as possible in processing incorrect inputs. We have to globally analyze the input to find the error. This is an expensive method, and it is not in practice. 6/30/2010 Principles of Compiler Design R.Venkadeshan 146

147 Panic-Mode Error Recovery in LL(1) Parsing In panic-mode error recovery, we skip all the input symbols until a synchronizing token is found. What is the synchronizing token? All the terminal-symbols in the follow set of a non-terminal can be used as a synchronizing token set for that non-terminal. So, a simple panic-mode error recovery for the LL(1) parsing: All the empty entries are marked as synch to indicate that the parser will skip all the input symbols until a symbol in the follow set of the non-terminal A which on the top of the stack. Then the parser will pop that non-terminal A from the stack. The parsing continues from that state. To handle unmatched terminal symbols, the parser pops that unmatched terminal symbol from the stack and it issues an error message saying that that unmatched terminal is inserted. 6/30/2010 Principles of Compiler Design R.Venkadeshan 147

148 Panic-Mode Error Recovery - Example S AbS e A a cad FOLLOW(S)={$} FOLLOW(A)={b,d} S A a S AbS A a b sync sync c S AbS A cad d sync sync e S e sync $ S sync stack input output stack input output $S aab$ S AbS $S ceadb$ S AbS $SbA aab$ A a $SbA ceadb$ A cad $Sba aab$ $SbdAc ceadb$ $Sb ab$ Error: missing b, inserted $SbdA eadb$ Error:unexpected e (illegal A) $S ab$ S AbS (Remove all input tokens until first b or d, pop A) $SbA ab$ A a $Sbd db$ $Sba ab$ $Sb b$ $Sb b$ $S $ S $S $ S $ $ accept $ $ accept 6/30/2010 Principles of Compiler Design R.Venkadeshan 148

149 Phrase-Level Error Recovery Each empty entry in the parsing table is filled with a pointer to a special error routine which will take care that error case. These error routines may: change, insert, or delete input symbols. issue appropriate error messages pop items from the stack. We should be careful when we design these error routines, because we may put the parser into an infinite loop. 6/30/2010 Principles of Compiler Design R.Venkadeshan 149

150 Bottom-Up Parsing A bottom-up parser creates the parse tree of the given input starting from leaves towards the root. A bottom-up parser tries to find the right-most derivation of the given input in the reverse order. S... (the right-most derivation of ) (the bottom-up parser finds the right-most derivation in the reverse order) Bottom-up parsing is also known as shift-reduce parsing because its two main actions are shift and reduce. At each shift action, the current symbol in the input string is pushed to a stack. At each reduction step, the symbols at the top of the stack (this symbol sequence is the right side of a production) will replaced by the non-terminal at the left side of that production. There are also two more actions: accept and error. 6/30/2010 Principles of Compiler Design R.Venkadeshan 150

151 Shift-Reduce Parsing A shift-reduce parser tries to reduce the given input string into the starting symbol. a string the starting symbol reduced to At each reduction step, a substring of the input matching to the right side of a production rule is replaced by the non-terminal at the left side of that production rule. If the substring is chosen correctly, the right most derivation of that string is created in the reverse order. Rightmost Derivation: Shift-Reduce Parser finds: * rm S... S rm rm 6/30/2010 Principles of Compiler Design R.Venkadeshan 151

152 Shift-Reduce Parsing -- Example S aabb input string: aaabb A aa a aaabb B bb b aabb reduction aabb S S aabb aabb aaabb aaabb rm rm rm rm Right Sentential Forms How do we know which substring to be replaced at each reduction step? 6/30/2010 Principles of Compiler Design R.Venkadeshan 152

153 Handle Informally, a handle of a string is a substring that matches the right side of a production rule. But not every substring matches the right side of a production rule is handle A handle of a right sentential form ( ) is a production rule A and a position of where the string may be found and replaced by A to produce the previous right-sentential form in a rightmost derivation of. S * A rm rm If the grammar is unambiguous, then every right-sentential form of the grammar has exactly one handle. We will see that is a string of terminals. 6/30/2010 Principles of Compiler Design R.Venkadeshan 153

154 Handle Pruning A right-most derivation in reverse can be obtained by handle-pruning. S= n-1 n = rm rm rm rm rm input string Start from n, find a handle A n n in n, and replace n in by A n to get n-1. Then find a handle A n-1 n-1 in n-1, and replace n-1 in by A n-1 to get n-2. Repeat this, until we reach S. 6/30/2010 Principles of Compiler Design R.Venkadeshan 154

155 A Shift-Reduce Parser E E+T T T T*F F F (E) id Right-Most Derivation of id+id*id E E+T E+T*F E+T*id E+F*id E+id*id T+id*id F+id*id id+id*id Right-Most Sentential Form Reducing Production id+id*id F id F+id*id T F T+id*id E T E+id*id F id E+F*id T F E+T*id F id E+T*F T T*F E+T E E+T E Handles are red and underlined in the right-sentential forms. 6/30/2010 Principles of Compiler Design R.Venkadeshan 155

156 A Stack Implementation of A Shift-Reduce Parser There are four possible actions of a shift-parser action: 1. Shift : The next input symbol is shifted onto the top of the stack. 2. Reduce: Replace the handle on the top of the stack by the nonterminal. 3. Accept: Successful completion of parsing. 4. Error: Parser discovers a syntax error, and calls an error recovery routine. Initial stack just contains only the end-marker $. The end of the input string is marked by the end-marker $. 6/30/2010 Principles of Compiler Design R.Venkadeshan 156

157 A Stack Implementation of A Shift-Reduce Parser Stack Input Action $ id+id*id$ shift $id +id*id$ reduce by F id Parse Tree $F +id*id$ reduce by T F $T +id*id$ reduce by E T E 8 $E +id*id$ shift $E+ id*id$ shift E 3 + T 7 $E+id *id$ reduce by F id $E+F *id$ reduce by T F T 2 T 5 * F 6 $E+T *id$ shift $E+T* id$ shift F 1 F 4 id $E+T*id $ reduce by F id $E+T*F $ reduce by T T*F id id $E+T $ reduce by E E+T $E $ accept 6/30/2010 Principles of Compiler Design R.Venkadeshan 157

158 Conflicts During Shift-Reduce Parsing There are context-free grammars for which shift-reduce parsers cannot be used. Stack contents and the next input symbol may not decide action: shift/reduce conflict: Whether make a shift operation or a reduction. reduce/reduce conflict: The parser cannot decide which of several reductions to make. If a shift-reduce parser cannot be used for a grammar, that grammar is called as non-lr(k) grammar. left to right right-most k lookhead scanning derivation An ambiguous grammar can never be a LR grammar. 6/30/2010 Principles of Compiler Design R.Venkadeshan 158

159 Shift-Reduce Parsers There are two main categories of shift-reduce parsers 1. Operator-Precedence Parser simple, but only a small class of grammars. 2. LR-Parsers CFG LR LALR SLR covers wide range of grammars. SLR simple LR parser LR most general LR parser LALR intermediate LR parser (lookhead LR parser) SLR, LR and LALR work same, only their parsing tables are different. 6/30/2010 Principles of Compiler Design R.Venkadeshan 159

160 LR Parsers The most powerful shift-reduce parsing (yet efficient) is: LR(k) parsing. left to right right-most k lookhead scanning derivation (k is omitted it is 1) LR parsing is attractive because: LR parsing is most general non-backtracking shift-reduce parsing, yet it is still efficient. The class of grammars that can be parsed using LR methods is a proper superset of the class of grammars that can be parsed with predictive parsers. LL(1)-Grammars LR(1)-Grammars An LR-parser can detect a syntactic error as soon as it is possible to do so a left-to-right scan of the input. 6/30/2010 Principles of Compiler Design R.Venkadeshan 160

161 LR Parsers LR-Parsers covers wide range of grammars. SLR simple LR parser LR most general LR parser LALR intermediate LR parser (look-head LR parser) SLR, LR and LALR work same (they used the same algorithm), only their parsing tables are different. 6/30/2010 Principles of Compiler Design R.Venkadeshan 161

162 LR Parsing Algorithm stack input a 1... a i... a n $ S m X m S m-1 LR Parsing Algorithm output X m-1. S 1. X 1 S 0 s t a t e s Action Table terminals and $ four different actions s t a t e s Goto Table non-terminal each item is a state number 6/30/2010 Principles of Compiler Design R.Venkadeshan 162

163 A Configuration of LR Parsing Algorithm A configuration of a LR parsing is: ( S o X 1 S 1... X m S m, a i a i+1... a n $ ) Stack Rest of Input S m and a i decides the parser action by consulting the parsing action table. (Initial Stack contains just S o ) A configuration of a LR parsing represents the right sentential form: X 1... X m a i a i+1... a n $ 6/30/2010 Principles of Compiler Design R.Venkadeshan 163

164 Actions of A LR-Parser 1. shift s -- shifts the next input symbol and the state s onto the stack ( S o X 1 S 1... X m S m, a i a i+1... a n $ ) ( S o X 1 S 1... X m S m a i s, a i+1... a n $ ) 2. reduce A (or rn where n is a production number) pop 2 (=r) items from the stack; then push A and s where s=goto[s m-r,a] ( S o X 1 S 1... X m S m, a i a i+1... a n $ ) ( S o X 1 S 1... X m-r S m-r A s, a i... a n $ ) Output is the reducing production reduce A 3. Accept Parsing successfully completed 4. Error -- Parser detected an error (an empty entry in the action table) 6/30/2010 Principles of Compiler Design R.Venkadeshan 164

165 Reduce Action pop 2 (=r) items from the stack; let us assume that = Y 1 Y 2...Y r then push A and s where s=goto[s m-r,a] ( S o X 1 S 1... X m-r S m-r Y 1 S m-r...y r S m, a i a i+1... a n $ ) ( S o X 1 S 1... X m-r S m-r A s, a i... a n $ ) In fact, Y 1 Y 2...Y r is a handle. X 1... X m-r A a i... a n $ X 1... X m Y 1...Y r a i a i+1... a n $ 6/30/2010 Principles of Compiler Design R.Venkadeshan 165

166 (SLR) Parsing Tables for Expression Grammar Action Table Goto Table 1) E E+T state id + * ( ) $ E T F 2) E T 3) T T*F 4) T F s5 s6 r2 r4 s7 r4 s4 r2 r4 acc r2 r ) F (E) 4 s5 s ) F id 5 r6 r6 r6 r6 6 s5 s s5 s s6 s11 9 r1 s7 r1 r1 10 r3 r3 r3 r3 11 r5 r5 r5 r5 6/30/2010 Principles of Compiler Design R.Venkadeshan 166

167 Actions of A (S)LR-Parser -- Example stack input action output 0 id*id+id$ shift 5 0id5 *id+id$ reduce by F id F id 0F3 *id+id$ reduce by T F T F 0T2 *id+id$ shift 7 0T2*7 id+id$ shift 5 0T2*7id5 +id$ reduce by F id F id 0T2*7F10 +id$ reduce by T T*F T T*F 0T2 +id$ reduce by E T E T 0E1 +id$ shift 6 0E1+6 id$ shift 5 0E1+6id5 $ reduce by F id F id 0E1+6F3 $ reduce by T F T F 0E1+6T9 $ reduce by E E+T E E+T 0E1 $ accept 6/30/2010 Principles of Compiler Design R.Venkadeshan 167

168 Constructing SLR Parsing Tables LR(0) Item An LR(0) item of a grammar G is a production of G a dot at the some position of the right side. Ex: A abb Possible LR(0) Items: A.aBb (four different possibility) A a.bb A ab.b A abb. Sets of LR(0) items will be the states of action and goto table of the SLR parser. A collection of sets of LR(0) items (the canonical LR(0) collection) is the basis for constructing SLR parsers. Augmented Grammar: G is G with a new production rule S S where S is the new starting symbol. 6/30/2010 Principles of Compiler Design R.Venkadeshan 168

169 The Closure Operation If I is a set of LR(0) items for a grammar G, then closure(i) is the set of LR(0) items constructed from I by the two rules: 1. Initially, every LR(0) item in I is added to closure(i). 2. If A.B is in closure(i) and B is a production rule of G; then B. will be in the closure(i). We will apply this rule until no more new LR(0) items can be added to closure(i). 6/30/2010 Principles of Compiler Design R.Venkadeshan 169

170 The Closure Operation -- Example E E closure({e.e}) = E E+T { E.E kernel items E T E.E+T T T*F E.T T F T.T*F F (E) T.F F id F.(E) F.id } 6/30/2010 Principles of Compiler Design R.Venkadeshan 170

171 Goto Operation If I is a set of LR(0) items and X is a grammar symbol (terminal or nonterminal), then goto(i,x) is defined as follows: If A.X in I then every item in closure({a X. }) will be in goto(i,x). Example: I ={ E.E, E.E+T, E.T, T.T*F, T.F, F.(E), F.id } goto(i,e) = { E E., E E.+T } goto(i,t) = { E F. T., T T.*F } goto(i,f) = {T } goto(i,() = { F (.E), E.E+T, E.T, T.T*F, T.F, F.(E), F.id } goto(i,id) = { F id. } 6/30/2010 Principles of Compiler Design R.Venkadeshan 171

172 Construction of The Canonical LR(0) Collection To create the SLR parsing tables for a grammar G, we will create the canonical LR(0) collection of the grammar G. Algorithm: C is { closure({s.s}) } repeat the followings until no more set of LR(0) items can be added to C. for each I in C and each grammar symbol X if goto(i,x) is not empty and not in C add goto(i,x) to C goto function is a DFA on the sets in C. 6/30/2010 Principles of Compiler Design R.Venkadeshan 172

173 The Canonical LR(0) Collection -- Example I 0 : E.E I 1 : E E. I 6 : E E+.T I 9 : E E+T. E.E+T E E.+T T.T*F T T.*F E.T T.F T.T*F I 2 : E T. F.(E) I 10 : T T*F. T.F T T.*F F.id F.(E) F.id I 3 : T F. I 7 : T T*.F I 11 : F (E). F.(E) I 4 : F (.E) F.id E.E+T E.T I 8 : F (E.) T.T*F E E.+T T.F F.(E) F.id I 5 : F id. 6/30/2010 Principles of Compiler Design R.Venkadeshan 173

174 Transition Diagram (DFA) of Goto Function E I 0 I 1 T F I 2 I 3 ( I 4 id id I 5 + * E T F ( I 6 I 7 I 8 to I 2 to I 3 to I 4 T ) + F ( F id ( id I 9 to I 3 to I 4 to I 5 I 10 to I 4 to I 5 I 11 to I 6 * to I 7 6/30/2010 Principles of Compiler Design R.Venkadeshan 174

175 Constructing SLR Parsing Table (of an augumented grammar G ) 1. Construct the canonical collection of sets of LR(0) items for G. C {I 0,...,I n } 2. Create the parsing action table as follows If a is a terminal, A.a in I i and goto(i i,a)=i j then action[i,a] is shift j. If A. is in I i, then action[i,a] is reduce A for all a in FOLLOW(A) where A S. If S S. is in I i, then action[i,$] is accept. If any conflicting actions generated by these rules, the grammar is not SLR(1). 3. Create the parsing goto table for all non-terminals A, if goto(i i,a)=i j then goto[i,a]=j 4. All entries not defined by (2) and (3) are errors. 5. Initial state of the parser contains S.S 6/30/2010 Principles of Compiler Design R.Venkadeshan 175

176 Parsing Tables of Expression Grammar Action Table Goto Table state id + * ( ) $ E T F 0 s5 s s6 acc 2 r2 s7 r2 r2 3 r4 r4 r4 r4 4 s5 s r6 r6 r6 r6 6 s5 s s5 s s6 s11 9 r1 s7 r1 r1 10 r3 r3 r3 r3 11 r5 r5 r5 r5 6/30/2010 Principles of Compiler Design R.Venkadeshan 176

177 SLR(1) Grammar An LR parser using SLR(1) parsing tables for a grammar G is called as the SLR(1) parser for G. If a grammar G has an SLR(1) parsing table, it is called SLR(1) grammar (or SLR grammar in short). Every SLR grammar is unambiguous, but every unambiguous grammar is not a SLR grammar. 6/30/2010 Principles of Compiler Design R.Venkadeshan 177

178 shift/reduce and reduce/reduce conflicts If a state does not know whether it will make a shift operation or reduction for a terminal, we say that there is a shift/reduce conflict. If a state does not know whether it will make a reduction operation using the production rule i or j for a terminal, we say that there is a reduce/reduce conflict. If the SLR parsing table of a grammar G has a conflict, we say that that grammar is not SLR grammar. 6/30/2010 Principles of Compiler Design R.Venkadeshan 178

179 Conflict Example S L=R I 0 : S.S I 1 :S S. I 6 :S L=.R I 9 : S L=R. S R S.L=R R.L L *R S.R I 2 :S L.=R L.*R L id L.*R R L. L.id R L L.id R.L I 3 :S R. I 4 :L *.R I 7 :L *R. Problem R.L FOLLOW(R)={=,$} L.*R I 8 :R L. = shift 6 L.id reduce by R L shift/reduce conflict I 5 :L id. 6/30/2010 Principles of Compiler Design R.Venkadeshan 179

180 Conflict Example2 S AaAb I 0 : S.S S BbBa S.AaAb A S.BbBa B A. B. Problem FOLLOW(A)={a,b} FOLLOW(B)={a,b} a reduce by A b reduce by A reduce by B reduce by B reduce/reduce conflict reduce/reduce conflict 6/30/2010 Principles of Compiler Design R.Venkadeshan 180

181 Constructing Canonical LR(1) Parsing Tables In SLR method, the state i makes a reduction by A when the current token is a: if the A. in the I i and a is FOLLOW(A) In some situations, A cannot be followed by the terminal a in a right-sentential form when and the state i are on the top stack. This means that making reduction in this case is not correct. S AaAb S AaAb Aab ab S BbBa Bba ba S BbBa A Aab ab Bba ba B AaAb Aa b BbBa Bb a 6/30/2010 Principles of Compiler Design R.Venkadeshan 181

182 LR(1) Item To avoid some of invalid reductions, the states need to carry more information. Extra information is put into a state by including a terminal symbol as a second component in an item. A LR(1) item is: A.,a where a is the look-head of the LR(1) item (a is a terminal or end-marker.) 6/30/2010 Principles of Compiler Design R.Venkadeshan 182

183 LR(1) Item (cont.) When ( in the LR(1) item A.,a ) is not empty, the look-head does not have any affect. When is empty (A.,a ), we do the reduction by A only if the next input symbol is a (not for any terminal in FOLLOW(A)). A state will contain A.,a 1 where {a 1,...,a n } FOLLOW(A)... A.,a n 6/30/2010 Principles of Compiler Design R.Venkadeshan 183

184 Canonical Collection of Sets of LR(1) Items The construction of the canonical collection of the sets of LR(1) items are similar to the construction of the canonical collection of the sets of LR(0) items, except that closure and goto operations work a little bit different. closure(i) is: ( where I is a set of LR(1) items) every LR(1) item in I is in closure(i) if A.B,a in closure(i) and B is a production rule of G; then B.,b will be in the closure(i) for each terminal b in FIRST( a). 6/30/2010 Principles of Compiler Design R.Venkadeshan 184

185 goto operation If I is a set of LR(1) items and X is a grammar symbol (terminal or non-terminal), then goto(i,x) is defined as follows: If A.X,a in I then every item in closure({a X.,a}) will be in goto(i,x). 6/30/2010 Principles of Compiler Design R.Venkadeshan 185

186 Construction of The Canonical LR(1) Collection Algorithm: C is { closure({s.s,$}) } repeat the followings until no more set of LR(1) items can be added to C. for each I in C and each grammar symbol X if goto(i,x) is not empty and not in C add goto(i,x) to C goto function is a DFA on the sets in C. 6/30/2010 Principles of Compiler Design R.Venkadeshan 186

187 A Short Notation for The Sets of LR(1) Items A set of LR(1) items containing the following items A.,a 1... A.,a n can be written as A.,a 1 /a 2 /.../a n 6/30/2010 Principles of Compiler Design R.Venkadeshan 187

188 Canonical LR(1) Collection -- Example S AaAb I 0 : S.S,$ S I 1 : S S.,$ S BbBa S.AaAb,$ A A S.BbBa,$ B A.,a B I 2 : S A.aAb,$ B.,b I 3 : S B.bBa,$ I 4 : S Aa.Ab,$ A I 6 : S AaA.b,$ a I 8 : S AaAb.,$ A.,b I 5 : S Bb.Ba,$ B I 7 : S BbB.a,$ b I 9 : S BbBa.,$ B.,a a b to I 4 to I 5 6/30/2010 Principles of Compiler Design R.Venkadeshan 188

189 Canonical LR(1) Collection Example2 S S 1) S L=R 2) S R 3) L *R 4) L id 5) R L I 0 :S.S,$ S.L=R,$ S.R,$ L.*R,$/= L.id,$/= R.L,$ S I 1 :S S.,$ * LI 2 :S L.=R,$ R L.,$ R I 3 :S R.,$ I 4 :L *.R,$/= R.L,$/= to I 6 L.*R,$/= L.id,$/= id I 5 :L id.,$/= R L * id to I 7 to I 8 to I 4 to I 5 I 6 :S L=.R,$ R.L,$ L.*R,$ L.id,$ I 7 :L *R.,$/= I 8 : R L.,$/= R L * id to I 9 to I 10 to I 11 to I 12 I 9 :S L=R.,$ I 10 :R L.,$ I 11 :L *.R,$ R.L,$ L.*R,$ L.id,$ I 12 :L id.,$ R L * id to I 13 to I 10 to I 11 to I 12 I 13 :L *R.,$ I 4 and I 11 I 5 and I 12 I 7 and I 13 I 8 and I 10 6/30/2010 Principles of Compiler Design R.Venkadeshan 189

190 Construction of LR(1) Parsing Tables 1. Construct the canonical collection of sets of LR(1) items for G. C {I 0,...,I n } 2. Create the parsing action table as follows If a is a terminal, A.a,b in I i and goto(i i,a)=i j then action[i,a] is shift j. If A.,a is in I i, then action[i,a] is reduce A where A S. If S S.,$ is in I i, then action[i,$] is accept. If any conflicting actions generated by these rules, the grammar is not LR(1). 3. Create the parsing goto table for all non-terminals A, if goto(i i,a)=i j then goto[i,a]=j 4. All entries not defined by (2) and (3) are errors. 5. Initial state of the parser contains S.S,$ 6/30/2010 Principles of Compiler Design R.Venkadeshan 190

191 LR(1) Parsing Tables (for Example2) id * = $ S L R 0 s5 s acc 2 s6 r5 3 r s5 s12 s4 s11 r4 r no shift/reduce or no reduce/reduce conflict r3 r5 r3 r5 r1 so, it is a LR(1) grammar 10 r5 11 s12 s r4 13 r3 6/30/2010 Principles of Compiler Design R.Venkadeshan 191

192 LALR Parsing Tables LALR stands for LookAhead LR. LALR parsers are often used in practice because LALR parsing tables are smaller than LR(1) parsing tables. The number of states in SLR and LALR parsing tables for a grammar G are equal. But LALR parsers recognize more grammars than SLR parsers. yacc creates a LALR parser for the given grammar. A state of LALR parser will be again a set of LR(1) items. 6/30/2010 Principles of Compiler Design R.Venkadeshan 192

193 Creating LALR Parsing Tables Canonical LR(1) Parser LALR Parser shrink # of states This shrink process may introduce a reduce/reduce conflict in the resulting LALR parser (so the grammar is NOT LALR) But, this shrink process does not produce a shift/reduce conflict. 6/30/2010 Principles of Compiler Design R.Venkadeshan 193

194 The Core of A Set of LR(1) Items The core of a set of LR(1) items is the set of its first component. Ex: S L.=R,$ S L.=R L. Core R L.,$ R We will find the states (sets of LR(1) items) in a canonical LR(1) parser with same cores. Then we will merge them as a single state. I 1 :L id.,= A new state: I 12 : L id.,= I 2 :L id.,$ have same core, merge them L id.,$ We will do this for all states of a canonical LR(1) parser to get the states of the LALR parser. In fact, the number of the states of the LALR parser for a grammar will be equal to the number of states of the SLR parser for that grammar. 6/30/2010 Principles of Compiler Design R.Venkadeshan 194

195 Creation of LALR Parsing Tables Create the canonical LR(1) collection of the sets of LR(1) items for the given grammar. Find each core; find all sets having that same core; replace those sets having same cores with a single set which is their union. C={I 0,...,I n } C ={J 1,...,J m } where m n Create the parsing tables (action and goto tables) same as the construction of the parsing tables of LR(1) parser. Note that: If J=I 1... I k since I 1,...,I k have same cores cores of goto(i 1,X),...,goto(I 2,X) must be same. So, goto(j,x)=k where K is the union of all sets of items having same cores as goto(i 1,X). If no conflict is introduced, the grammar is LALR(1) grammar. (We may only introduce reduce/reduce conflicts; we cannot introduce a shift/reduce conflict) 6/30/2010 Principles of Compiler Design R.Venkadeshan 195

196 Shift/Reduce Conflict We say that we cannot introduce a shift/reduce conflict during the shrink process for the creation of the states of a LALR parser. Assume that we can introduce a shift/reduce conflict. In this case, a state of LALR parser must have: A.,a and B.a,b This means that a state of the canonical LR(1) parser must have: A.,a and B.a,c But, this state has also a shift/reduce conflict. i.e. The original canonical LR(1) parser has a conflict. (Reason for this, the shift operation does not depend on lookaheads) 6/30/2010 Principles of Compiler Design R.Venkadeshan 196

197 Reduce/Reduce Conflict But, we may introduce a reduce/reduce conflict during the shrink process for the creation of the states of a LALR parser. I 1 : A.,a B.,b I 2 : A.,b B.,c I 12 : A.,a/b B.,b/c reduce/reduce conflict 6/30/2010 Principles of Compiler Design R.Venkadeshan 197

198 Canonical LALR(1) Collection Example2 S S 1) S L=R 2) S R 3) L *R 4) L id 5) R L I 0 :S.S,$ S.L=R,$ S.R,$ L.*R,$/= L.id,$/= R.L,$ I 1 :S S.,$ S * LI 2 :S L.=R,$ R L.,$ R I 3 :S R.,$ I 411 :L *.R,$/= R.L,$/= to I 6 L.*R,$/= L.id,$/= id I 512 :L id.,$/= R L * id to I 713 to I 810 to I 411 to I 512 I 6 :S L=.R,$ R.L,$ L.*R,$ L.id,$ R L * id to I 9 to I 810 to I 411 to I 512 I 9 :S L=R.,$ Same Cores I 4 and I 11 I 5 and I 12 I 713 :L *R.,$/= I 7 and I 13 I 810 : R L.,$/= I 8 and I 10 6/30/2010 Principles of Compiler Design R.Venkadeshan 198

199 LALR(1) Parsing Tables (for Example2) id * = $ S L R 0 s5 s acc 2 s6 r5 3 r s5 s12 s4 s11 r4 r no shift/reduce or no reduce/reduce conflict r3 r5 r3 r5 r1 so, it is a LALR(1) grammar 6/30/2010 Principles of Compiler Design R.Venkadeshan 199

200 Using Ambiguous Grammars All grammars used in the construction of LR-parsing tables must be un-ambiguous. Can we create LR-parsing tables for ambiguous grammars? Yes, but they will have conflicts. We can resolve these conflicts in favor of one of them to disambiguate the grammar. At the end, we will have again an unambiguous grammar. Why we want to use an ambiguous grammar? Some of the ambiguous grammars are much natural, and a corresponding unambiguous grammar can be very complex. Usage of an ambiguous grammar may eliminate unnecessary reductions. Ex. E E+T T E E+E E*E (E) id T T*F F F (E) id 6/30/2010 Principles of Compiler Design R.Venkadeshan 200

201 Sets of LR(0) Items for Ambiguous Grammar I 0 : E.E E.E+E E.E*E E.(E) E.id E id ( I 1 : E E. E E.+E E E.*E I 2 : E (.E) E.E+E E.E*E E.(E) E.id id I 3 : E id. ( + * E I 4 : E E +.E E.E+E E.E*E E.(E) E.id I 5 : E E *.E E.E+E E.E*E E.(E) E.id I 6 : E (E.) E E.+E E E.*E id id ( + * E I 3 ( I 3 E I 2 I 2 ) I 4 I 5 I 7 : E E+E. E E.+E E E.*E I 8 : E E*E. E E.+E E E.*E I 9 : E (E). + * + * I 5 I 5 I 4 I 4 6/30/2010 Principles of Compiler Design R.Venkadeshan 201

202 SLR-Parsing Tables for Ambiguous Grammar FOLLOW(E) = { $,+,*,) } State I 7 has shift/reduce conflicts for symbols + and *. E + I 0 I 1 I 4 I 7 E when current token is + shift + is right-associative reduce + is left-associative when current token is * shift * has higher precedence than + reduce + has higher precedence than * 6/30/2010 Principles of Compiler Design R.Venkadeshan 202

203 SLR-Parsing Tables for Ambiguous Grammar FOLLOW(E) = { $,+,*,) } State I 8 has shift/reduce conflicts for symbols + and *. E * I 0 I 1 I 5 I 7 E when current token is * shift * is right-associative reduce * is left-associative when current token is + shift + has higher precedence than * reduce * has higher precedence than + 6/30/2010 Principles of Compiler Design R.Venkadeshan 203

204 SLR-Parsing Tables for Ambiguous Grammar Action Goto id + * ( ) $ E 0 s3 s2 1 1 s4 s5 acc 2 s3 s2 6 3 r4 r4 r4 r4 4 s3 s2 7 5 s3 s2 8 6 s4 s5 s9 7 r1 s5 r1 r1 8 r2 r2 r2 r2 9 r3 r3 r3 r3 6/30/2010 Principles of Compiler Design R.Venkadeshan 204

205 Error Recovery in LR Parsing An LR parser will detect an error when it consults the parsing action table and finds an error entry. All empty entries in the action table are error entries. Errors are never detected by consulting the goto table. An LR parser will announce error as soon as there is no valid continuation for the scanned portion of the input. A canonical LR parser (LR(1) parser) will never make even a single reduction before announcing an error. The SLR and LALR parsers may make several reductions before announcing an error. But, all LR parsers (LR(1), LALR and SLR parsers) will never shift an erroneous input symbol onto the stack. 6/30/2010 Principles of Compiler Design R.Venkadeshan 205

206 Panic Mode Error Recovery in LR Parsing Scan down the stack until a state s with a goto on a particular nonterminal A is found. (Get rid of everything from the stack before this state s). Discard zero or more input symbols until a symbol a is found that can legitimately follow A. The symbol a is simply in FOLLOW(A), but this may not work for all situations. The parser stacks the nonterminal A and the state goto[s,a], and it resumes the normal parsing. This nonterminal A is normally is a basic programming block (there can be more than one choice for A). stmt, expr, block,... 6/30/2010 Principles of Compiler Design R.Venkadeshan 206

207 Phrase-Level Error Recovery in LR Parsing Each empty entry in the action table is marked with a specific error routine. An error routine reflects the error that the user most likely will make in that case. An error routine inserts the symbols into the stack or the input (or it deletes the symbols from the stack and the input, or it can do both insertion and deletion). missing operand unbalanced right parenthesis 6/30/2010 Principles of Compiler Design R.Venkadeshan 207

208 Syntax-Directed Translation Grammar symbols are associated with attributes to associate information with the programming language constructs that they represent. Values of these attributes are evaluated by the semantic rules associated with the production rules. Evaluation of these semantic rules: may generate intermediate codes may put information into the symbol table may perform type checking may issue error messages may perform some other activities in fact, they may perform almost any activities. An attribute may hold almost any thing. a string, a number, a memory location, a complex record. 6/30/2010 Principles of Compiler Design R.Venkadeshan 208

209 Syntax-Directed Definitions and Translation Schemes When we associate semantic rules with productions, we use two notations: Syntax-Directed Definitions Translation Schemes Syntax-Directed Definitions: give high-level specifications for translations hide many implementation details such as order of evaluation of semantic actions. We associate a production rule with a set of semantic actions, and we do not say when they will be evaluated. Translation Schemes: indicate the order of evaluation of semantic actions associated with a production rule. In other words, translation schemes give a little bit information about implementation details. 6/30/2010 Principles of Compiler Design R.Venkadeshan 209

210 Syntax-Directed Definitions A syntax-directed definition is a generalization of a context-free grammar in which: Each grammar symbol is associated with a set of attributes. This set of attributes for a grammar symbol is partitioned into two subsets called synthesized and inherited attributes of that grammar symbol. Each production rule is associated with a set of semantic rules. Semantic rules set up dependencies between attributes which can be represented by a dependency graph. This dependency graph determines the evaluation order of these semantic rules. Evaluation of a semantic rule defines the value of an attribute. But a semantic rule may also have some side effects such as printing a value. 6/30/2010 Principles of Compiler Design R.Venkadeshan 210

211 Annotated Parse Tree A parse tree showing the values of attributes at each node is called an annotated parse tree. The process of computing the attributes values at the nodes is called annotating (or decorating) of the parse tree. Of course, the order of these computations depends on the dependency graph induced by the semantic rules. 6/30/2010 Principles of Compiler Design R.Venkadeshan 211

212 Syntax-Directed Definition In a syntax-directed definition, each production A α is associated with a set of semantic rules of the form: b=f(c 1,c 2,,c n ) and b can be one of the followings: OR where f is a function, b is a synthesized attribute of A and c 1,c 2,,c n are attributes of the grammar symbols in the production ( A α ). b is an inherited attribute one of the grammar symbols in α (on the right side of the production), and c 1,c 2,,c n are attributes of the grammar symbols in the production ( A α ). 6/30/2010 Principles of Compiler Design R.Venkadeshan 212

213 Attribute Grammar So, a semantic rule b=f(c 1,c 2,,c n ) indicates that the attribute b depends on attributes c 1,c 2,,c n. In a syntax-directed definition, a semantic rule may just evaluate a value of an attribute or it may have some side effects such as printing values. An attribute grammar is a syntax-directed definition in which the functions in the semantic rules cannot have side effects (they can only evaluate values of attributes). 6/30/2010 Principles of Compiler Design R.Venkadeshan 213

214 Syntax-Directed Definition -- Example Production L E return E E 1 + T E T T T 1 * F T F F ( E ) F digit Semantic Rules print(e.val) E.val = E 1.val + T.val E.val = T.val T.val = T 1.val * F.val T.val = F.val F.val = E.val F.val = digit.lexval Symbols E, T, and F are associated with a synthesized attribute val. The token digit has a synthesized attribute lexval (it is assumed that it is evaluated by the lexical analyzer). 6/30/2010 Principles of Compiler Design R.Venkadeshan 214

215 Annotated Parse Tree -- Example Input: 5+3*4 L E.val=17 return E.val=5 + T.val=12 T.val=5 T.val=3 * F.val=4 F.val=5 F.val=3 digit.lexval=4 digit.lexval=5 digit.lexval=3 6/30/2010 Principles of Compiler Design R.Venkadeshan 215

216 Dependency Graph Input: 5+3*4 L E.val=17 E.val=5 T.val=12 T.val=5 T.val=3 F.val=4 F.val=5 F.val=3 digit.lexval=4 digit.lexval=5 digit.lexval=3 6/30/2010 Principles of Compiler Design R.Venkadeshan 216

217 Syntax-Directed Definition Example2 Production E E 1 + T E T T T 1 * F T F F ( E ) F id Semantic Rules E.loc=newtemp(), E.code = E 1.code T.code add E 1.loc,T.loc,E.loc E.loc = T.loc, E.code=T.code T.loc=newtemp(), T.code = T 1.code F.code mult T 1.loc,F.loc,T.loc T.loc = F.loc, T.code=F.code F.loc = E.loc, F.code=E.code F.loc = id.name, F.code= Symbols E, T, and F are associated with synthesized attributes loc and code. The token id has a synthesized attribute name (it is assumed that it is evaluated by the lexical analyzer). It is assumed that is the string concatenation operator. 6/30/2010 Principles of Compiler Design R.Venkadeshan 217

218 Syntax-Directed Definition Inherited Attributes Production D T L T int T real L L 1 id L id Semantic Rules L.in = T.type T.type = integer T.type = real L 1.in = L.in, addtype(id.entry,l.in) addtype(id.entry,l.in) Symbol T is associated with a synthesized attribute type. Symbol L is associated with an inherited attribute in. 6/30/2010 Principles of Compiler Design R.Venkadeshan 218

219 A Dependency Graph Inherited Attributes Input: real p q D L.in=real T L T.type=real L 1.in=real addtype(q,real) real L id addtype(p,real) id.entry=q id id.entry=p parse tree dependency graph 6/30/2010 Principles of Compiler Design R.Venkadeshan 219

220 S-Attributed Definitions Syntax-directed definitions are used to specify syntax-directed translations. To create a translator for an arbitrary syntax-directed definition can be difficult. We would like to evaluate the semantic rules during parsing (i.e. in a single pass, we will parse and we will also evaluate semantic rules during the parsing). We will look at two sub-classes of the syntax-directed definitions: S-Attributed Definitions: only synthesized attributes used in the syntax-directed definitions. L-Attributed Definitions: in addition to synthesized attributes, we may also use inherited attributes in a restricted fashion. To implement S-Attributed Definitions and L-Attributed Definitions are easy (we can evaluate semantic rules in a single pass during the parsing). Implementations of S-attributed Definitions are a little bit easier than implementations of L-Attributed Definitions 6/30/2010 Principles of Compiler Design R.Venkadeshan 220

221 Bottom-Up Evaluation of S-Attributed Definitions We put the values of the synthesized attributes of the grammar symbols into a parallel stack. When an entry of the parser stack holds a grammar symbol X (terminal or non-terminal), the corresponding entry in the parallel stack will hold the synthesized attribute(s) of the symbol X. We evaluate the values of the attributes during reductions. A XYZ A.a=f(X.x,Y.y,Z.z) where all attributes are synthesized. stack parallel-stack top Z Z.z Y X Y.y X.x top A A.a.... 6/30/2010 Principles of Compiler Design R.Venkadeshan 221

222 Bottom-Up Eval. of S-Attributed Definitions (cont.) Production L E return E E 1 + T E T T T 1 * F T F F ( E ) F digit Semantic Rules print(val[top-1]) val[ntop] = val[top-2] + val[top] val[ntop] = val[top-2] * val[top] val[ntop] = val[top-1] At each shift of digit, we also push digit.lexval into val-stack. At all other shifts, we do not put anything into val-stack because other terminals do not have attributes (but we increment the stack pointer for val-stack). 6/30/2010 Principles of Compiler Design R.Venkadeshan 222

223 Canonical LR(0) Collection for The Grammar L I 0 : L.L I 1 : L.Er E E.E+T I 2 : E.T T.T*F T T.F I 3 : F.(E) F.d F I 4 : d ( I 5 : I 6 : L L. L E.r E E.+T E T. T T.*F T F. F (.E) E.E+T E.T T.T*F T.F F.(E) F.d F d. T F ( d r E + * I 7 : I 8 : I 9 : L Er. E E+.T T.T*F T.F F.(E) F.d T T*.F F.(E) F.d I 10 : F (E.) E E.+T T F ( d F ( d ) I 11 : I 12 : I 13 : E E+T. T T.*F T T*F. F (E). * 9 6/30/2010 Principles of Compiler Design R.Venkadeshan 223

224 Bottom-Up Evaluation -- Example At each shift of digit, we also push digit.lexval into val-stack. stack val-stack input action semantic rule 0 5+3*4r s6 d.lexval(5) into val-stack 0d6 5 +3*4r F d F.val=d.lexval do nothing 0F4 5 +3*4r T F T.val=F.val do nothing 0T3 5 +3*4r E T E.val=T.val do nothing 0E2 5 +3*4r s8 push empty slot into val-stack 0E *4r s6 d.lexval(3) into val-stack 0E2+8d6 5-3 *4r F d F.val=d.lexval do nothing 0E2+8F4 5-3 *4r T F T.val=F.val do nothing 0E2+8T *4r s9 push empty slot into val-stack 0E2+8T11* r s6 d.lexval(4) into val-stack 0E2+8T11*9d r F d F.val=d.lexval do nothing 0E2+8T11*9F r T T*F T.val=T 1.val*F.val 0E2+8T r E E+T E.val=E 1.val*T.val 0E2 17 r s7 push empty slot into val-stack 0E2r7 17- $ L Er print(17), pop empty slot from val-stack 0L1 17 $ acc 6/30/2010 Principles of Compiler Design R.Venkadeshan 224

225 Top-Down Evaluation (of S-Attributed Definitions) Productions Semantic Rules A B print(b.n0), print(b.n1) B 0 B 1 B.n0=B 1.n0+1, B.n1=B 1.n1 B 1 B 1 B.n0=B 1.n0, B.n1=B 1.n1+1 B B.n0=0, B.n1=0 where B has two synthesized attributes (n0 and n1). 6/30/2010 Principles of Compiler Design R.Venkadeshan 225

226 Top-Down Evaluation (of S-Attributed Definitions) Remember that: In a recursive predicate parser, each non-terminal corresponds to a procedure. procedure A() { call B(); A B } procedure B() { if (currtoken=0) { consume 0; call B(); } B 0 B else if (currtoken=1) { consume 1; call B(); } B 1 B else if (currtoken=$) {} // $ is end-marker B else error( unexpected token ); } 6/30/2010 Principles of Compiler Design R.Venkadeshan 226

227 Top-Down Evaluation (of S-Attributed Definitions) procedure A() { int n0,n1; Synthesized attributes of non-terminal B call B(&n0,&n1); are the output parameters of procedure B. print(n0); print(n1); } All the semantic rules can be evaluated procedure B(int *n0, int *n1) { at the end of parsing of production rules if (currtoken=0) {int a,b; consume 0; call B(&a,&b); *n0=a+1; *n1=b;} else if (currtoken=1) { int a,b; consume 1; call B(&a,&b); *n0=a; *n1=b+1; } else if (currtoken=$) {*n0=0; *n1=0; } // $ is end-marker else error( unexpected token ); } 6/30/2010 Principles of Compiler Design R.Venkadeshan 227

228 L-Attributed Definitions S-Attributed Definitions can be efficiently implemented. We are looking for a larger (larger than S-Attributed Definitions) subset of syntax-directed definitions which can be efficiently evaluated. L-Attributed Definitions L-Attributed Definitions can always be evaluated by the depth first visit of the parse tree. This means that they can also be evaluated during the parsing. 6/30/2010 Principles of Compiler Design R.Venkadeshan 228

229 L-Attributed Definitions A syntax-directed definition is L-attributed if each inherited attribute of X j, where 1 j n, on the right side of A X 1 X 2...X n depends only on: 1. The attributes of the symbols X 1,...,X j-1 to the left of X j in the production and 2. the inherited attribute of A Every S-attributed definition is L-attributed, the restrictions only apply to the inherited attributes (not to synthesized attributes). 6/30/2010 Principles of Compiler Design R.Venkadeshan 229

230 A Definition which is NOT L-Attributed Productions A L M A Q R Semantic Rules L.in=l(A.i), M.in=m(L.s), A.s=f(M.s) R.in=r(A.in), Q.in=q(R.s), A.s=f(Q.s) This syntax-directed definition is not L-attributed because the semantic rule Q.in=q(R.s) violates the restrictions of L-attributed definitions. When Q.in must be evaluated before we enter to Q because it is an inherited attribute. But the value of Q.in depends on R.s which will be available after we return from R. So, we are not be able to evaluate the value of Q.in before we enter to Q. 6/30/2010 Principles of Compiler Design R.Venkadeshan 230

231 Translation Schemes In a syntax-directed definition, we do not say anything about the evaluation times of the semantic rules (when the semantic rules associated with a production should be evaluated?). A translation scheme is a context-free grammar in which: attributes are associated with the grammar symbols and semantic actions enclosed between braces {} are inserted within the right sides of productions. Ex: A {... } X {... } Y {... } Semantic Actions 6/30/2010 Principles of Compiler Design R.Venkadeshan 231

232 Translation Schemes When designing a translation scheme, some restrictions should be observed to ensure that an attribute value is available when a semantic action refers to that attribute. These restrictions (motivated by L-attributed definitions) ensure that a semantic action does not refer to an attribute that has not yet computed. In translation schemes, we use semantic action terminology instead of semantic rule terminology used in syntax-directed definitions. The position of the semantic action on the right side indicates when that semantic action will be evaluated. 6/30/2010 Principles of Compiler Design R.Venkadeshan 232

233 Translation Schemes for S-attributed Definitions If our syntax-directed definition is S-attributed, the construction of the corresponding translation scheme will be simple. Each associated semantic rule in a S-attributed syntax-directed definition will be inserted as a semantic action into the end of the right side of the associated production. Production Semantic Rule E E 1 + T E.val = E 1.val + T.val a production of a syntax directed definition E E 1 + T { E.val = E 1.val + T.val } the production of the corresponding translation scheme 6/30/2010 Principles of Compiler Design R.Venkadeshan 233

234 A Translation Scheme Example A simple translation scheme that converts infix expressions to the corresponding postfix expressions. E T R R + T { print( + ) } R 1 R T id { print(id.name) } a+b+c ab+c+ infix expression postfix expression 6/30/2010 Principles of Compiler Design R.Venkadeshan 234

235 A Translation Scheme Example (cont.) E T R id {print( a )} + T {print( + )} R id {print( b )} + T {print( + )} R id {print( c )} The depth first traversal of the parse tree (executing the semantic actions in that order) will produce the postfix representation of the infix expression. 6/30/2010 Principles of Compiler Design R.Venkadeshan 235

236 Inherited Attributes in Translation Schemes If a translation scheme has to contain both synthesized and inherited attributes, we have to observe the following rules: 1. An inherited attribute of a symbol on the right side of a production must be computed in a semantic action before that symbol. 2. A semantic action must not refer to a synthesized attribute of a symbol to the right of that semantic action. 3. A synthesized attribute for the non-terminal on the left can only be computed after all attributes it references have been computed (we normally put this semantic action at the end of the right side of the production). With a L-attributed syntax-directed definition, it is always possible to construct a corresponding translation scheme which satisfies these three conditions (This may not be possible for a general syntax-directed translation). 6/30/2010 Principles of Compiler Design R.Venkadeshan 236

237 Top-Down Translation We will look at the implementation of L-attributed definitions during predictive parsing. Instead of the syntax-directed translations, we will work with translation schemes. We will see how to evaluate inherited attributes (in L-attributed definitions) during recursive predictive parsing. We will also look at what happens to attributes during the left-recursion elimination in the left-recursive grammars. 6/30/2010 Principles of Compiler Design R.Venkadeshan 237

238 A Translation Scheme with Inherited Attributes D T id { addtype(id.entry,t.type), L.in = T.type } L T int { T.type = integer } T real { T.type = real } L id { addtype(id.entry,l.in), L 1.in = L.in } L 1 L This is a translation scheme for an L-attributed definitions. 6/30/2010 Principles of Compiler Design R.Venkadeshan 238

239 Predictive Parsing (of Inherited Attributes) procedure D() { int Ttype,Lin,identry; call T(&Ttype); consume(id,&identry); addtype(identry,ttype); Lin=Ttype; call L(Lin); } procedure T(int *Ttype) { if (currtoken is int) { consume(int); *Ttype=TYPEINT; } a synthesized attribute (an output parameter) else if (currtoken is real) { consume(real); *Ttype=TYPEREAL; } else { error( unexpected type ); } an inherited attribute (an input parameter) } procedure L(int Lin) { if (currtoken is id) { int L1in,identry; consume(id,&identry); addtype(identry,lin); L1in=Lin; call L(L1in); } else if (currtoken is endmarker) { } else { error( unexpected token ); } } 6/30/2010 Principles of Compiler Design R.Venkadeshan 239

240 Eliminating Left Recursion from Translation Scheme A translation scheme with a left recursive grammar. E E 1 + T { E.val = E 1.val + T.val } E E 1 - T { E.val = E 1.val - T.val } E T { E.val = T.val } T T 1 * F { T.val = T 1.val * F.val } T F { T.val = F.val } F ( E ) { F.val = E.val } F digit { F.val = digit.lexval } When we eliminate the left recursion from the grammar (to get a suitable grammar for the top-down parsing) we also have to change semantic actions 6/30/2010 Principles of Compiler Design R.Venkadeshan 240

241 Eliminating Left Recursion (cont.) inherited attribute synthesized attribute E T { A.in=T.val } A { E.val=A.syn } A + T { A 1.in=A.in+T.val } A 1 { A.syn = A 1.syn} A - T { A 1.in=A.in-T.val } A 1 { A.syn = A 1.syn} A { A.syn = A.in } T F { B.in=F.val } B { T.val=B.syn } B * F { B 1.in=B.in*F.val } B 1 { B.syn = B 1.syn} B { B.syn = B.in } F ( E ) { F.val = E.val } F digit { F.val = digit.lexval } 6/30/2010 Principles of Compiler Design R.Venkadeshan 241

242 Eliminating Left Recursion (in general) A A 1 Y { A.a = g(a 1.a,Y.y) } A X { A.a=f(X.x) } a left recursive grammar with synthesized attributes (a,y,x). eliminate left recursion inherited attribute of the new non-terminal synthesized attribute of the new non-terminal A X { R.in=f(X.x) } R { A.a=R.syn } R Y { R 1.in=g(R.in,Y.y) } R 1 { R.syn = R 1.syn} R { R.syn = R.in } 6/30/2010 Principles of Compiler Design R.Venkadeshan 242

243 Evaluating attributes A parse tree of left recursive grammar A Y A.a=g(f(X.x),Y.y) X X.x=f(X.x) parse tree of non-left-recursive grammar A X R.in=f(X.x) R A.a=g(f(X.x,Y.y) Y R 1.in=g(f(X.x),Y.y) R 1 R.syn=g(f(X.x),Y.y) R 1.syn=g(f(X.x),Y.y) 6/30/2010 Principles of Compiler Design R.Venkadeshan 243

244 Translation Scheme - Intermediate Code Generation E T { A.in=T.loc } A { E.loc=A.loc } A + T { A 1.in=newtemp(); emit(add,a.in,t.loc,a 1.in) } A 1 { A.loc = A 1.loc} A { A.loc = A.in } T F { B.in=F.loc } B { T.loc=B.loc } B * F { B 1.in=newtemp(); emit(mult,b.in,f.loc,b 1.in) } B 1 { B.loc = B 1.loc} B { B.loc = B.in } F ( E ) { F.loc = E.loc } F id { F.loc = id.name } 6/30/2010 Principles of Compiler Design R.Venkadeshan 244

245 Predictive Parsing Intermediate Code Generation procedure E(char **Eloc) { char *Ain, *Tloc, *Aloc; call T(&Tloc); Ain=Tloc; call A(Ain,&Aloc); *Eloc=Aloc; } procedure A(char *Ain, char **Aloc) { if (currtok is +) { char *A1in, *Tloc, *A1loc; consume(+); call T(&Tloc); A1in=newtemp(); emit( add,ain,tloc,a1in); call A(A1in,&A1loc); *Aloc=A1loc; } else { *Aloc = Ain } } 6/30/2010 Principles of Compiler Design R.Venkadeshan 245

246 Predictive Parsing (cont.) procedure T(char **Tloc) { char *Bin, *Floc, *Bloc; call F(&Floc); Bin=Floc; call B(Bin,&Bloc); *Tloc=Bloc; } procedure B(char *Bin, char **Bloc) { if (currtok is *) { char *B1in, *Floc, *B1loc; consume(+); call F(&Floc); B1in=newtemp(); emit( mult,bin,floc,b1in); call B(B1in,&B1loc); Bloc=B1loc; } else { *Bloc = Bin } } procedure F(char **Floc) { if (currtok is ( ) { char *Eloc; consume( ( ); call E(&Eloc); consume( ) ); *Floc=Eloc } else { char *idname; consume(id,&idname); *Floc=idname } } 6/30/2010 Principles of Compiler Design R.Venkadeshan 246

247 Bottom-Up Evaluation of Inherited Attributes Using a top-down translation scheme, we can implement any L-attributed definition based on a LL(1) grammar. Using a bottom-up translation scheme, we can also implement any L-attributed definition based on a LL(1) grammar (each LL(1) grammar is also an LR(1) grammar). In addition to the L-attributed definitions based on LL(1) grammars, we can implement some of L-attributed definitions based on LR(1) grammars (not all of them) using the bottom-up translation scheme. 6/30/2010 Principles of Compiler Design R.Venkadeshan 247

248 Removing Embedding Semantic Actions In bottom-up evaluation scheme, the semantic actions are evaluated during the reductions. During the bottom-up evaluation of S-attributed definitions, we have a parallel stack to hold synthesized attributes. Problem: where are we going to hold inherited attributes? A Solution: We will convert our grammar to an equivalent grammar to guarantee to the followings. All embedding semantic actions in our translation scheme will be moved into the end of the production rules. All inherited attributes will be copied into the synthesized attributes (most of the time synthesized attributes of new non-terminals). Thus we will be evaluate all semantic actions during reductions, and we find a place to store an inherited attribute. 6/30/2010 Principles of Compiler Design R.Venkadeshan 248

249 Removing Embedding Semantic Actions To transform our translation scheme into an equivalent translation scheme: 1. Remove an embedding semantic action S i, put new a non-terminal M i instead of that semantic action. 2. Put that semantic action S i into the end of a new production rule M i for that non-terminal M i. 3. That semantic action S i will be evaluated when this new production rule is reduced. 4. The evaluation order of the semantic rules are not changed by this transformation. 6/30/2010 Principles of Compiler Design R.Venkadeshan 249

250 Removing Embedding Semantic Actions A {S 1 } X 1 {S 2 } X 2... {S n } X n remove embedding semantic actions A M 1 X 1 M 2 X 2... M n X n M 1 {S 1 } M 2 {S 2 }.. M n {S n } 6/30/2010 Principles of Compiler Design R.Venkadeshan 250

251 Removing Embedding Semantic Actions E T R R + T { print( + ) } R 1 R T id { print(id.name) } remove embedding semantic actions E T R R + T M R 1 R T id { print(id.name) } M { print( + ) } 6/30/2010 Principles of Compiler Design R.Venkadeshan 251

252 Translation with Inherited Attributes Let us assume that every non-terminal A has an inherited attribute A.i, and every symbol X has a synthesized attribute X.s in our grammar. For every production rule A X 1 X 2... X n, introduce new marker non-terminals M 1,M 2,...,M n and replace this production rule with A M 1 X 1 M 2 X 2... M n X n the synthesized attribute of X i will be not changed. the inherited attribute of X i will be copied into the synthesized attribute of M i by the new semantic action added at the end of the new production rule M i. Now, the inherited attribute of X i can be found in the synthesized attribute of M i (which is immediately available in the stack. A {B.i=f 1 (...)} B {C.i=f 2 (...)} C {A.s= f 3 (...)} A {M 1.i=f 1 (...)} M 1 {B.i=M 1.s} B {M 2.i=f 2 (...)} M 2 {C.i=M 2.s} C {A.s= f 3 (...)} M 1 {M 1.s=M 1.i} M 2 {M 2.s=M 2.i} 6/30/2010 Principles of Compiler Design R.Venkadeshan 252

253 Translation with Inherited Attributes S {A.i=1} A {S.s=k(A.i,A.s)} A {B.i=f(A.i)} B {C.i=g(A.i,B.i,B.s)} C {A.s= h(a.i,b.i,b.s,c.i,c.s)} B b {B.s=m(B.i,b.s)} C c {C.s=n(C.i,c.s)} S {M 1.i=1} M 1 {A.i=M 1.s} A {S.s=k(M 1.s,A.s)} A {M 2.i=f(A.i)} M 2 {B.i=M 2.s} B {M 3.i=g(A.i,M 2.s,B.s)} M 3 {C.i=M 3.s} C {A.s= h(a.i, M 2.s,B.s, M 3.s,C.s)} B b {B.s=m(B.i,b.s)} C c {C.s=n(C.i,c.s)} M 1 {M 1.s=M 1.i} M 2 {M 2.s=M 2.i} M 3 {M 3.s=M 3.i} 6/30/2010 Principles of Compiler Design R.Venkadeshan 253

254 Actual Translation Scheme S {M 1.i=1} M 1 {A.i=M 1.s} A {S.s=k(M 1.s,A.s)} A {M 2.i=f(A.i)} M 2 {B.i=M 2.s} B {M 3.i=g(A.i,M 2.s,B.s)} M 3 {C.i=M 3.s} C {A.s= h(a.i, M 2.s,B.s, M 3.s,C.s)} B b {B.s=m(B.i,b.s)} C c {C.s=n(C.i,c.s)} M 1 {M 1.s= M 1.i} M 2 {M 2.s=M 2.i} M 3 {M 3.s=M 3.i} S M 1 A { s[ntop]=k(s[top-1],s[top]) } M 1 { s[ntop]=1 } A M 2 B M 3 C { s[ntop]=h(s[top-4],s[top-3],s[top-2],s[top-1],s[top]) } M 2 { s[ntop]=f(s[top]) } M 3 { s[ntop]=g(s[top-2],s[top-1],s[top])} B b { s[ntop]=m(s[top-1],s[top]) } C c { s[ntop]=n(s[top-1],s[top]) } 6/30/2010 Principles of Compiler Design R.Venkadeshan 254

255 Evaluation of Attributes A.i=1 S S.s=k(1,h(..)) A A.s=h(1,f(1),m(..),g(..),n(..)) B.i=f(1) B B.s=m(f(1),b.s) C C.i=g(1,f(1),m(..)) C.s=n(g(..),c.s) b c 6/30/2010 Principles of Compiler Design R.Venkadeshan 255

256 Evaluation of Attributes stack input s-attribute stack bc$ M 1 bc$ 1 M 1 M 2 bc$ 1 f(1) M 1 M 2 b c$ 1 f(1) b.s M 1 M 2 B c$ 1 f(1) m(f(1),b.s) M 1 M 2 B M 3 c$ 1 f(1) m(f(1),b.s) g(1,f(1),m(f(1),b.s)) M 1 M 2 B M 3 c $ 1 f(1) m(f(1),b.s) g(1,f(1),m(f(1),b.s)) c.s M 1 M 2 B M 3 C $ 1 f(1) m(f(1),b.s) g(1,f(1),m(f(1),b.s)) n(g(..),c.s) M 1 A $ 1 h(f(1),m(..),g(..),n(..)) S $ k(1,h(..)) 6/30/2010 Principles of Compiler Design R.Venkadeshan 256

257 Problems All L-attributed definitions based on LR grammars cannot be evaluated during bottom-up parsing. S { L.i=0 } L L { L 1.i=L.i+1 } L 1 1 L { print(l.i) } this translations scheme cannot be implemented during the bottom-up parsing S M 1 L L M 2 L 1 1 But since L will be reduced first by the bottom-up L { print(s[top]) } parser, the translator cannot know the number of 1s. M 1 { s[ntop]=0 } M 2 { s[ntop]=s[top]+1 } 6/30/2010 Principles of Compiler Design R.Venkadeshan 257

258 Problems The modified grammar cannot be LR grammar anymore. L L b L M L b L a L a NOT LR-grammar M S.L, $ L.M L b, $ L.a, $ M.,a shift/reduce conflict 6/30/2010 Principles of Compiler Design R.Venkadeshan 258

259 Type Checking A compiler has to do semantic checks in addition to syntactic checks. Semantic Checks Static done during compilation Dynamic done during run-time Type checking is one of these static checking operations. we may not do all type checking at compile-time. Some systems also use dynamic type checking too. A type system is a collection of rules for assigning type expressions to the parts of a program. A type checker implements a type system. A sound type system eliminates run-time type checking for type errors. A programming language is strongly-typed, if every program its compiler accepts will execute without type errors. In practice, some of type checking operations are done at run-time (so, most of the programming languages are not strongly-typed). Ex: int x[100]; x[i] most of the compilers cannot guarantee that i will be between 0 and 99 6/30/2010 Principles of Compiler Design R.Venkadeshan 259

260 Type Expression The type of a language construct is denoted by a type expression. A type expression can be: A basic type a primitive data type such as integer, real, char, boolean, type-error to signal a type error void : no type A type name a name can be used to denote a type expression. A type constructor applies to other type expressions. arrays: If T is a type expression, then array(i,t) is a type expression where I denotes index range. Ex: array(0..99,int) products: If T 1 and T 2 are type expressions, then their cartesian product T 1 x T 2 is a type expression. Ex: int x int pointers: If T is a type expression, then pointer(t) is a type expression. Ex: pointer(int) functions: We may treat functions in a programming language as mapping from a domain type D to a range type R. So, the type of a function can be denoted by the type expression D R where D are R type expressions. Ex: int int represents the type of a function which takes an int value as parameter, and its return type is also int. 6/30/2010 Principles of Compiler Design R.Venkadeshan 260

261 A Simple Type Checking System P D;E D D;D D id:t { addtype(id.entry,t.type) } T char { T.type=char } T int { T.type=int } T real { T.type=real } T T 1 { T.type=pointer(T 1.type) } T array[intnum] of T 1 { T.type=array(1..intnum.val,T 1.type) } 6/30/2010 Principles of Compiler Design R.Venkadeshan 261

262 Type Checking of Expressions E id { E.type=lookup(id.entry) } E charliteral { E.type=char } E intliteral { E.type=int } E realliteral { E.type=real } E E 1 + E 2 { if (E 1.type=int and E 2.type=int) then E.type=int else if (E 1.type=int and E 2.type=real) then E.type=real else if (E 1.type=real and E 2.type=int) then E.type=real else if (E 1.type=real and E 2.type=real) then E.type=real else E.type=type-error } E E 1 [E 2 ] { if (E 2.type=int and E 1.type=array(s,t)) then E.type=t else E.type=type-error } E E 1 { if (E 1.type=pointer(t)) then E.type=t else E.type=type-error } 6/30/2010 Principles of Compiler Design R.Venkadeshan 262

263 Type Checking of Statements S id = E { if (id.type=e.type then S.type=void else S.type=type-error } S if E then S 1 { if (E.type=boolean then S.type=S 1.type else S.type=type-error } S while E do S 1 { if (E.type=boolean then S.type=S 1.type else S.type=type-error } 6/30/2010 Principles of Compiler Design R.Venkadeshan 263

264 Type Checking of Functions E E 1 ( E 2 ) { if (E 2.type=s and E 1.type=s t) then E.type=t else E.type=type-error } Ex: int f(double x, char y) {... } f: double x char int argument types return type 6/30/2010 Principles of Compiler Design R.Venkadeshan 264

265 Structural Equivalence of Type Expressions How do we know that two type expressions are equal? As long as type expressions are built from basic types (no type names), we may use structural equivalence between two type expressions Structural Equivalence Algorithm (sequiv): if (s and t are same basic types) then return true else if (s=array(s 1,s 2 ) and t=array(t 1,t 2 )) then return (sequiv(s 1,t 1 ) and sequiv(s 2,t 2 )) else if (s = s 1 x s 2 and t = t 1 x t 2 ) then return (sequiv(s 1,t 1 ) and sequiv(s 2,t 2 )) else if (s=pointer(s 1 ) and t=pointer(t 1 )) then return (sequiv(s 1,t 1 )) else if (s = s 1 s 2 and t = t 1 t 2 ) then return (sequiv(s 1,t 1 ) and sequiv(s 2,t 2 )) else return false 6/30/2010 Principles of Compiler Design R.Venkadeshan 265

266 Names for Type Expressions In some programming languages, we give a name to a type expression, and we use that name as a type expression afterwards. type link = cell;? p,q,r,s have same types? var p,q : link; var r,s : cell How do we treat type names? Get equivalent type expression for a type name (then use structural equivalence), or Treat a type name as a basic type. 6/30/2010 Principles of Compiler Design R.Venkadeshan 266

267 Cycles in Type Expressions type link = cell; type cell = record end; x : int, next : link We cannot use structural equivalence if there are cycles in type expressions. We have to treat type names as basic types. but this means that the type expression link is different than the type expression cell. 6/30/2010 Principles of Compiler Design R.Venkadeshan 267

268 Type Conversions x + y? what is the type of this expression (int or double)? What kind of codes we have to produce, if the type of x is double and the type of y is int? inttoreal real+ y,,t1 t1,x,t2 6/30/2010 Principles of Compiler Design R.Venkadeshan 268

269 Intermediate Code Generation 6/30/2010 Principles of Compiler Design R.Venkadeshan 269

270 Outline Variants of Syntax Trees Three-address code Types and declarations Translation of expressions Type checking Control flow Backpatching 6/30/2010 Principles of Compiler Design R.Venkadeshan 270

271 Introduction Intermediate code is the interface between front end and back end in a compiler Ideally the details of source language are confined to the front end and the details of target machines to the back end (a m*n model) In this chapter we study intermediate representations, static type checking and intermediate code generation Parser Static Checker Intermediate Code Generator Code Generator Front end Back end 6/30/2010 Principles of Compiler Design R.Venkadeshan 271

272 Intermediate Code Generation Translating source program into an intermediate language. Simple CPU Independent, yet, close in spirit to machine language. (and this completes the Front-End of Compilation). Benefits 1. Retargeting is facilitated 2. Machine independent Code Optimization can be applied. 6/30/2010 Principles of Compiler Design R.Venkadeshan 272

273 Intermediate Code Generation Intermediate codes are machine independent codes, but they are close to machine instructions. The given program in a source language is converted to an equivalent program in an intermediate language by the intermediate code generator. Intermediate language can be many different languages, and the designer of the compiler decides this intermediate language. syntax trees can be used as an intermediate language. postfix notation can be used as an intermediate language. three-address code (Quadraples) can be used as an intermediate language 6/30/2010 Principles of Compiler Design R.Venkadeshan 273

274 Intermediate Languages Graphical Representations. Consider the assignment a:=b*-c+b*-c 6/30/2010 Principles of Compiler Design R.Venkadeshan 274

275 Intermediate Languages Graphical Representations. Consider the assignment a:=b*-c+b*-c: assign a + * * b uminus b uminus c c Syntax Tree 6/30/2010 Principles of Compiler Design R.Venkadeshan 275

276 Intermediate Languages Graphical Representations. Consider the assignment a:=b*-c+b*-c: assign assign a + a + * * * b uminus b uminus uminus c c b c Syntax Tree Dag 6/30/2010 Principles of Compiler Design R.Venkadeshan 276

277 Syntax Dir. Definition for Assignment Statements The functions are -> mkunode(op,child) -> mknode(op,left,right) -> mkleaf(id,id.place) 6/30/2010 Principles of Compiler Design R.Venkadeshan 277

278 Syntax Dir. Definition for Assignment Statements PRODUCTION Semantic Rule S id := E { S.nptr = mknode ( assign, mkleaf(id, id.place), E.nptr) } E E 1 + E 2 {E.nptr = mknode( +, E 1.nptr,E 2.nptr) } E E 1 * E 2 {E.nptr = mknode( *, E 1.nptr,E 2.nptr) } E - E 1 {E.nptr = mkunode( uminus,e 1.nptr) } E ( E 1 ) {E.nptr = E 1.nptr } E id {E.nptr = mkleaf(id, id.place) } 6/30/2010 Principles of Compiler Design R.Venkadeshan 278

279 Two Representations of the Syntax Tree 6/30/2010 Principles of Compiler Design R.Venkadeshan 279

280 Three Address Code Statements of general form x:=y op z No built-up arithmetic expressions are allowed. As a result, x:=y + z * w should be represented as t 1 :=z * w t 2 :=y + t 1 x:=t 2 6/30/2010 Principles of Compiler Design R.Venkadeshan 280

281 Example of 3-address code assign t 1 :=- c t 2 :=b * t 1 t 3 :=- c t 4 :=b * t 3 t 5 :=t 2 + t 4 a:=t 5 b a + * * uminus b uminus Syntax Tree c Dag c Three address code Syntax Tree 6/30/2010 Principles of Compiler Design R.Venkadeshan 281

282 Types of Intermediate Languages Graphical Representations. Consider the assignment a:=b*-c+b*-c: assign t 1 :=- c t 2 :=b * t 1 t 5 :=t 2 + t 2 a:=t 5 a * + Three address code uminus b c Dag 6/30/2010 Principles of Compiler Design R.Venkadeshan 282

283 Types of Three-Address Statements. Assignment Statement: x:=y op z Assignment Instructions: x:=op z Copy Statement: x:=z Unconditional Jump: goto L Conditional Jump: if x relop y goto L Procedure: param x 1 param x 2 param x n call p,n Index Assignments: x:=y[i] x[i]:=y Address and Pointer Assignments: x:=&y x:=*y *x:=y 6/30/2010 Principles of Compiler Design R.Venkadeshan 283

284 Syntax-Directed Translation into 3-address code. First deal with assignments. Use attributes E.place: the name that will hold the value of E Identifier will be assumed to already have the place attribute defined. E.code:hold the three address code statements that evaluate E (this is the `translation attribute). Use function newtemp that returns a new temporary variable that we can use. Use function gen to generate a single three address statement given the necessary information (variable names and operations). 6/30/2010 Principles of Compiler Design R.Venkadeshan 284

285 Syntax-Dir. Definition for 3-address code PRODUCTION Semantic Rule S id := E { S.code = E.code gen(id.place = E.place ; ) } E E 1 + E 2 {E.place= newtemp ; E.code = E 1.code E 2.code gen(e.place := E 1.place + E 2.place) } E E 1 * E 2 {E.place= newtemp ; E.code = E 1.code E 2.code E - E 1 {E.place= newtemp ; E.code = E 1.code E ( E 1 ) gen(e.place = E 1.place * E 2.place) } gen(e.place = uminus E 1.place) } {E.place= E 1.place ; E.code = E 1.code} E id {E.place = id.place ; E.code = } 6/30/2010 Principles of Compiler Design R.Venkadeshan 285

286 What about things that are not assignments? E.g. while statements of the form S-> while E do S1 (intepreted as while the value of E is not 0 do S) Extension to the previous syntax-dir. Def. PRODUCTION S while E do S 1 Semantic Rule S.begin = newlabel; S.after = newlabel ; S.code = gen(s.begin : ) E.code gen( if E.place = 0 goto S.after) S 1.code gen( goto S.begin) gen(s.after : ) 6/30/2010 Principles of Compiler Design R.Venkadeshan 286

287 Implementations of 3-address statements (1.Quadruples) t 1 :=- c t 2 :=b * t 1 t 3 :=- c t 4 :=b * t 3 t 5 :=t 2 + t 4 a:=t 5 op arg1 arg2 result (0) uminus c t 1 (1) * b t 1 t 2 (2) uminus c t 3 (3) * b t 3 t 4 (4) + t 2 t 4 t 5 (5) := t 5 a Temporary names must be entered into the symbol table as they are created. 6/30/2010 Principles of Compiler Design R.Venkadeshan 287

288 t 1 :=- c Implementations of 3-address statements (2. Triples) op arg1 arg2 t 2 :=b * t 1 t 3 :=- c t 4 :=b * t 3 t 5 :=t 2 + t 4 a:=t 5 (0) uminus c (1) * b (0) (2) uminus c (3) * b (2) (4) + (1) (3) (5) assign a (4) Temporary names are not entered into the symbol table. 6/30/2010 Principles of Compiler Design R.Venkadeshan 288

289 Implementations of 3-address statements (2. Triples) contd e.g. ternary operations like x[i]:=y x:=y[i] require two or more entries. e.g. op arg1 arg2 (0) [ ] = x i (1) assign (0) y op arg1 arg2 (0) [ ] = y i (1) assign x (0) 6/30/2010 Principles of Compiler Design R.Venkadeshan 289

290 Implementations of 3-address statements ( Indirect Triples) Statement (0) (14) (1) (15) (2) (16) (3) (17) (4) (18) (5) (19) op arg1 arg2 (14) uminus c (15) * b (14) (16) uminus c (17) * b (16) (18) + (15) (17) (19) assign a (18) 6/30/2010 Principles of Compiler Design R.Venkadeshan 290

291 Yacc Program 6/30/2010 Principles of Compiler Design R.Venkadeshan 291

292 Yacc Program Translation Rules 6/30/2010 Principles of Compiler Design R.Venkadeshan 292

293 Variants of syntax trees It is sometimes beneficial to crate a DAG instead of tree for Expressions. This way we can easily show the common sub-expressions and then use that knowledge during code generation Example: a+a*(b-c)+(b-c)*d + + * * d a - b c 6/30/2010 Principles of Compiler Design R.Venkadeshan 293

294 SDD for creating DAG s Production 1) E -> E1+T 2) E -> E1-T 3) E -> T 4) T -> (E) 5) T -> id 6) T -> num Example: 1)p1=Leaf(id, entry-a) 2)P2=Leaf(id, entry-a)=p1 3)p3=Leaf(id, entry-b) 4)p4=Leaf(id, entry-c) 5)p5=Node( -,p3,p4) 6)p6=Node( *,p1,p5) 7)p7=Node( +,p1,p6) Semantic Rules E.node= new Node( +, E1.node,T.node) E.node= new Node( -, E1.node,T.node) E.node = T.node T.node = E.node T.node = new Leaf(id, id.entry) T.node = new Leaf(num, num.val) 8) p8=leaf(id,entry-b)=p3 9) p9=leaf(id,entry-c)=p4 10) p10=node( -,p3,p4)=p5 11) p11=leaf(id,entry-d) 12) p12=node( *,p5,p11) 13) p13=node( +,p7,p12) 6/30/2010 Principles of Compiler Design R.Venkadeshan 294

295 Value-number method for constructing DAG s i = + 10 id num To entry for i Algorithm Search the array for a node M with label op, left child l and right child r If there is such a node, return the value number M If not create in the array a new node N with label op, left child l, and right child r and return its value We may use a hash table 6/30/2010 Principles of Compiler Design R.Venkadeshan 295

296 Three address code In a three address code there is at most one operator at the right side of an instruction Example: + + * * a - d t1 = b c t2 = a * t1 t3 = a + t2 t4 = t1 * d t5 = t3 + t4 b c 6/30/2010 Principles of Compiler Design R.Venkadeshan 296

297 Forms of three address instructions x = y op z x = op y x = y goto L if x goto L and iffalse x goto L if x relop y goto L Procedure calls using: param x call p,n y = call p,n x = y[i] and x[i] = y x = &y and x = *y and *x =y 6/30/2010 Principles of Compiler Design R.Venkadeshan 297

298 Example do i = i+1; while (a[i] < v); L: t1 = i + 1 i = t1 t2 = i * 8 t3 = a[t2] if t3 < v goto L Symbolic labels 100: t1 = i : i = t1 102: t2 = i * 8 103: t3 = a[t2] 104: if t3 < v goto 100 Position numbers 6/30/2010 Principles of Compiler Design R.Venkadeshan 298

299 Data structures for three address codes Quadruples Has four fields: op, arg1, arg2 and result Triples Temporaries are not used and instead references to instructions are made Indirect triples In addition to triples we use a list of pointers to triples 6/30/2010 Principles of Compiler Design R.Venkadeshan 299

300 b * minus c + b * minus c Example Three address code t1 = minus c t2 = b * t1 t3 = minus c t4 = b * t3 t5 = t2 + t4 a = t5 Quadruples op arg1 arg2 result minus c t1 * b t1 t2 minus c t3 * b t3 t4 + t2 t4 t5 = t5 a op minus c * minus c * + = Triples arg1 arg2 b (0) b (2) (1) (3) a (4) op (0) (1) (2) (3) (4) (5) Indirect Triples op minus * minus * + = arg1 arg2 c b (0) c b (2) (1) (3) a (4) 6/30/2010 Principles of Compiler Design R.Venkadeshan 300

301 Type Expressions Example: int[2][3] array(2,array(3,integer)) A basic type is a type expression A type name is a type expression A type expression can be formed by applying the array type constructor to a number and a type expression. A record is a data structure with named field A type expression can be formed by using the type constructor for function types If s and t are type expressions, then their Cartesian product s*t is a type expression Type expressions may contain variables whose values are type expressions 6/30/2010 Principles of Compiler Design R.Venkadeshan 301

302 Type Equivalence They are the same basic type. They are formed by applying the same constructor to structurally equivalent types. One is a type name that denotes the other. 6/30/2010 Principles of Compiler Design R.Venkadeshan 302

303 Declarations 6/30/2010 Principles of Compiler Design R.Venkadeshan 303

304 Computing types and their widths Storage Layout for Local Names 6/30/2010 Principles of Compiler Design R.Venkadeshan 304

305 Storage Layout for Local Names Syntax-directed translation of array types 6/30/2010 Principles of Compiler Design R.Venkadeshan 305

306 Sequences of Declarations Actions at the end: 6/30/2010 Principles of Compiler Design R.Venkadeshan 306

307 Fields in Records and Classes 6/30/2010 Principles of Compiler Design R.Venkadeshan 307

308 Translation of Expressions and Statements We discussed how to find the types and offset of variables We have therefore necessary preparations to discuss about translation to intermediate code We also discuss the type checking 6/30/2010 Principles of Compiler Design R.Venkadeshan 308

309 Three-address code for expressions 6/30/2010 Principles of Compiler Design R.Venkadeshan 309

310 Incremental Translation 6/30/2010 Principles of Compiler Design R.Venkadeshan 310

311 Addressing Array Elements Layouts for a two-dimensional array: 6/30/2010 Principles of Compiler Design R.Venkadeshan 311

312 Semantic actions for array reference 6/30/2010 Principles of Compiler Design R.Venkadeshan 312

313 Translation of Array References Nonterminal L has three synthesized attributes: L.addr L.array L.type 6/30/2010 Principles of Compiler Design R.Venkadeshan 313

314 Conversions between primitive types in Java 6/30/2010 Principles of Compiler Design R.Venkadeshan 314

315 Introducing type conversions into expression evaluation 6/30/2010 Principles of Compiler Design R.Venkadeshan 315

316 Abstract syntax tree for the function definition fun length(x) = if null(x) then 0 else length(tl(x)+1) This is a polymorphic function in ML language 6/30/2010 Principles of Compiler Design R.Venkadeshan 316

317 Inferring a type for the function length 6/30/2010 Principles of Compiler Design R.Venkadeshan 317

318 Algorithm for Unification 6/30/2010 Principles of Compiler Design R.Venkadeshan 318

319 Unification algorithm boolean unify (Node m, Node n) { s = find(m); t = find(n); if ( s = t ) return true; else if ( nodes s and t represent the same basic type ) return true; else if (s is an op-node with children s1 and s2 and t is an op-node with children t1 and t2) { union(s, t) ; return unify(s1, t1) and unify(s2, t2); } else if s or t represents a variable { union(s, t) ; return true; } else return false; } 6/30/2010 Principles of Compiler Design R.Venkadeshan 319

320 Control Flow boolean expressions are often used to: Alter the flow of control. Compute logical values. 6/30/2010 Principles of Compiler Design R.Venkadeshan 320

321 Short-Circuit Code 6/30/2010 Principles of Compiler Design R.Venkadeshan 321

322 Flow-of-Control Statements 6/30/2010 Principles of Compiler Design R.Venkadeshan 322

323 Syntax-directed definition 6/30/2010 Principles of Compiler Design R.Venkadeshan 323

324 Generating three-address code for booleans 6/30/2010 Principles of Compiler Design R.Venkadeshan 324

325 translation of a simple if-statement 6/30/2010 Principles of Compiler Design R.Venkadeshan 325

326 Backpatching Previous codes for Boolean expressions insert symbolic labels for jumps It therefore needs a separate pass to set them to appropriate addresses We can use a technique named backpatching to avoid this We assume we save instructions into an array and labels will be indices in the array For nonterminal B we use two attributes B.truelist and B.falselist together with following functions: makelist(i): create a new list containing only I, an index into the array of instructions Merge(p1,p2): concatenates the lists pointed by p1 and p2 and returns a pointer to the concatenated list Backpatch(p,i): inserts i as the target label for each of the instruction on the list pointed to by p 6/30/2010 Principles of Compiler Design R.Venkadeshan 326

327 Backpatching for Boolean Expressions 6/30/2010 Principles of Compiler Design R.Venkadeshan 327

328 Backpatching for Boolean Expressions Annotated parse tree for x < 100 x > 200 && x! = y 6/30/2010 Principles of Compiler Design R.Venkadeshan 328

329 Flow-of-Control Statements 6/30/2010 Principles of Compiler Design R.Venkadeshan 329

330 Translation of a switch-statement 6/30/2010 Principles of Compiler Design R.Venkadeshan 330

331 Intermediate Code Generation Intermediate codes are machine independent codes, but they are close to machine instructions. The given program in a source language is converted to an equivalent program in an intermediate language by the intermediate code generator. Intermediate language can be many different languages, and the designer of the compiler decides this intermediate language. syntax trees can be used as an intermediate language. postfix notation can be used as an intermediate language. three-address code (Quadraples) can be used as an intermediate language we will use quadraples to discuss intermediate code generation quadraples are close to machine instructions, but they are not actual machine instructions. some programming languages have well defined intermediate languages. java java virtual machine prolog warren abstract machine In fact, there are byte-code emulators to execute instructions in these intermediate languages. 6/30/2010 Principles of Compiler Design R.Venkadeshan 331

332 Three-Address Code (Quadraples) A quadraple is: x := y op z where x, y and z are names, constants or compiler-generated temporaries; op is any operator. But we may also the following notation for quadraples (much better notation because it looks like a machine code instruction) op y,z,x apply operator op to y and z, and store the result in x. We use the term three-address code because each statement usually contains three addresses (two for operands, one for the result). 6/30/2010 Principles of Compiler Design R.Venkadeshan 332

333 Three-Address Statements Binary Operator: op y,z,result or result := y op z where op is a binary arithmetic or logical operator. This binary operator is applied to y and z, and the result of the operation is stored in result. Ex: add a,b,c gt a,b,c addr a,b,c addi a,b,c Unary Operator: op y,,result or result := op y where op is a unary arithmetic or logical operator. This unary operator is applied to y, and the result of the operation is stored in result. Ex: uminus a,,c not a,,c inttoreal a,,c 6/30/2010 Principles of Compiler Design R.Venkadeshan 333

334 Three-Address Statements (cont.) Move Operator: mov y,,result or result := y where the content of y is copied into result. Ex: mov a,,c movi a,,c movr a,,c Unconditional Jumps: jmp,,l or goto L We will jump to the three-address code with the label L, and the execution continues from that statement. Ex: jmp,,l1 // jump to L1 jmp,,7 // jump to the statement 7 6/30/2010 Principles of Compiler Design R.Venkadeshan 334

335 Three-Address Statements (cont.) Conditional Jumps: jmprelop y,z,l or if y relop z goto L We will jump to the three-address code with the label L if the result of y relop z is true, and the execution continues from that statement. If the result is false, the execution continues from the statement following this conditional jump statement. Ex: jmpgt y,z,l1 // jump to L1 if y>z jmpgte y,z,l1 // jump to L1 if y>=z jmpe y,z,l1 // jump to L1 if y==z jmpne y,z,l1 // jump to L1 if y!=z Our relational operator can also be a unary operator. jmpnz y,,l1 // jump to L1 if y is not zero jmpz y,,l1 // jump to L1 if y is zero jmpt y,,l1 // jump to L1 if y is true jmpf y,,l1 // jump to L1 if y is false 6/30/2010 Principles of Compiler Design R.Venkadeshan 335

336 Three-Address Statements (cont.) Procedure Parameters: param x,, or param x Procedure Calls: call p,n, or call p,n where x is an actual parameter, we invoke the procedure p with n parameters. Ex: param x 1,, param x 2,, param x n,, call p,n, p(x 1,...,x n ) f(x+1,y) add x,1,t1 param t1,, param y,, call f,2, 6/30/2010 Principles of Compiler Design R.Venkadeshan 336

337 Three-Address Statements (cont.) Indexed Assignments: move y[i],,x or move x,,y[i] or x := y[i] y[i] := x Address and Pointer Assignments: moveaddr y,,x or x := &y movecont y,,x or x := *y 6/30/2010 Principles of Compiler Design R.Venkadeshan 337

338 Syntax-Directed Translation into Three-Address Code S id := E E E 1 + E 2 E E 1 * E 2 E - E 1 E ( E 1 ) E id S.code = E.code gen( mov E.place,, id.place) E.place = newtemp(); E.code = E 1.code E 2.code gen( add E 1.place, E 2.place, E.place) E.place = newtemp(); E.code = E 1.code E 2.code gen( mult E 1.place, E 2.place, E.place) E.place = newtemp(); E.code = E 1.code gen( uminus E 1.place,, E.place) E.place = E 1.place; E.code = E 1.code E.place = id.place; E.code = null 6/30/2010 Principles of Compiler Design R.Venkadeshan 338

339 Syntax-Directed Translation (cont.) S while E do S 1 S if E then S 1 else S 2 S.begin = newlabel(); S.after = newlabel(); S.code = gen(s.begin : ) E.code gen( jmpf E.place,, S.after) S 1.code gen( jmp,, S.begin) gen(s.after : ) S.else = newlabel(); S.after = newlabel(); S.code = E.code gen( jmpf E.place,, S.else) S 1.code gen( jmp,, S.after) gen(s.else : ) S 2.code gen(s.after : ) 6/30/2010 Principles of Compiler Design R.Venkadeshan 339

340 Translation Scheme to Produce Three-Address Code S id := E E E 1 + E 2 E E 1 * E 2 E - E 1 { p= lookup(id.name); if (p is not nil) then emit( mov E.place,, p) else error( undefined-variable ) } { E.place = newtemp(); emit( add E 1.place, E 2.place, E.place) } { E.place = newtemp(); emit( mult E 1.place, E 2.place, E.place) } { E.place = newtemp(); emit( uminus E 1.place,, E.place) } E ( E 1 ) { E.place = E 1.place; } E id { p= lookup(id.name); if (p is not nil) then E.place = id.place else error( undefined-variable ) } 6/30/2010 Principles of Compiler Design R.Venkadeshan 340

341 Translation Scheme with Locations S id := { E.inloc = S.inloc } E { p = lookup(id.name); if (p is not nil) then { emit(e.outloc mov E.place,, p); S.outloc=E.outloc+1 } else { error( undefined-variable ); S.outloc=E.outloc } } E { E 1.inloc = E.inloc } E 1 + { E 2.inloc = E 1.outloc } E 2 { E.place = newtemp(); emit(e 2.outloc add E 1.place, E 2.place, E.place); E.outloc=E 2.outloc+1 } E { E 1.inloc = E.inloc } E 1 + { E 2.inloc = E 1.outloc } E 2 { E.place = newtemp(); emit(e 2.outloc mult E 1.place, E 2.place, E.place); E.outloc=E 2.outloc+1 } E - { E 1.inloc = E.inloc } E 1 { E.place = newtemp(); emit(e 1.outloc uminus E 1.place,, E.place); E.outloc=E 1.outloc+1 } E ( E 1 ) { E.place = E 1.place; E.outloc=E 1.outloc+1 } E id { E.outloc = E.inloc; p= lookup(id.name); if (p is not nil) then E.place = id.place else error( undefined-variable ) } 6/30/2010 Principles of Compiler Design R.Venkadeshan 341

342 Boolean Expressions E { E 1.inloc = E.inloc } E 1 and { E 2.inloc = E 1.outloc } E 2 { E.place = newtemp(); emit(e 2.outloc and E 1.place, E 2.place, E.place); E.outloc=E 2.outloc+1 } E { E 1.inloc = E.inloc } E 1 or { E 2.inloc = E 1.outloc } E 2 { E.place = newtemp(); emit(e 2.outloc and E 1.place, E 2.place, E.place); E.outloc=E 2.outloc+1 } E not { E 1.inloc = E.inloc } E 1 { E.place = newtemp(); emit(e 1.outloc not E 1.place,, E.place); E.outloc=E 1.outloc+1 } E { E 1.inloc = E.inloc } E 1 relop { E 2.inloc = E 1.outloc } E 2 { E.place = newtemp(); emit(e 2.outloc relop.code E 1.place, E 2.place, E.place); E.outloc=E 2.outloc+1 } 6/30/2010 Principles of Compiler Design R.Venkadeshan 342

343 Translation Scheme(cont.) S while { E.inloc = S.inloc } E do { emit(e.outloc jmpf E.place,, NOTKNOWN ); S 1.inloc=E.outloc+1; } S 1 { emit(s 1.outloc jmp,, S.inloc); S.outloc=S 1.outloc+1; backpatch(e.outloc,s.outloc); } S if { E.inloc = S.inloc } E then { emit(e.outloc jmpf E.place,, NOTKNOWN ); S 1.inloc=E.outloc+1; } S 1 else { emit(s 1.outloc jmp,, NOTKNOWN ); S 2.inloc=S 1.outloc+1; backpatch(e.outloc,s 2.inloc); } S 2 { S.outloc=S 2.outloc; backpatch(s 1.outloc,S.outloc); } 6/30/2010 Principles of Compiler Design R.Venkadeshan 343

344 Three Address Codes - Example x:=1; 01: mov 1,,x y:=x+10; 02: add x,10,t1 while (x<y) { 03: mov t1,,y x:=x+1; 04: lt x,y,t2 if (x%2==1) then y:=y+1; 05: jmpf t2,,17 else y:=y-2; 06: add x,1,t3 } 07: mov t3,,x 08: mod x,2,t4 09: eq t4,1,t5 10: jmpf t5,,14 11: add y,1,t6 12: mov t6,,y 13: jmp,,16 14: sub y,2,t7 15: mov t7,,y 16: jmp,,4 17: 6/30/2010 Principles of Compiler Design R.Venkadeshan 344

345 Arrays Elements of arrays can be accessed quickly if the elements are stored in a block of consecutive locations. A one-dimensional array A: base A low i width base A is the address of the first location of the array A, width is the width of each array element. low is the index of the first array element location of A[i] base A +(i-low)*width 6/30/2010 Principles of Compiler Design R.Venkadeshan 345

346 Arrays (cont.) base A +(i-low)*width can be re-written as i*width + (base A -low*width) should be computed at run-time can be computed at compile-time So, the location of A[i] can be computed at the run-time by evaluating the formula i*width+c where c is (base A -low*width) which is evaluated at compile-time. Intermediate code generator should produce the code to evaluate this formula i*width+c (one multiplication and one addition operation). 6/30/2010 Principles of Compiler Design R.Venkadeshan 346

347 Two-Dimensional Arrays A two-dimensional array can be stored in either row-major (row-by-row) or column-major (column-by-column). Most of the programming languages use row-major method. Row-major representation of a two-dimensional array: row 1 row 2 row n 6/30/2010 Principles of Compiler Design R.Venkadeshan 347

348 Two-Dimensional Arrays (cont.) The location of A[i 1,i 2 ] is base A + ((i 1 -low 1 )*n 2 +i 2 -low 2 )*width base A is the location of the array A. low 1 is the index of the first row low 2 is the index of the first column n 2 is the number of elements in each row width is the width of each array element Again, this formula can be re-written as ((i 1 *n 2 )+i 2 )*width + (base A -((low 1 *n 1 )+low 2 )*width) should be computed at run-time can be computed at compile-time 6/30/2010 Principles of Compiler Design R.Venkadeshan 348

349 Multi-Dimensional Arrays In general, the location of A[i 1,i 2,...,i k ] is ((... ((i 1 *n 2 )+i 2 )...)*n k +i k )*width + (base A -((...((low 1 *n 1 )+low 2 )...)*n k +low k )*width) So, the intermediate code generator should produce the codes to evaluate the following formula (to find the location of A[i 1,i 2,...,i k ]) : ((... ((i 1 *n 2 )+i 2 )...)*n k +i k )*width + c To evaluate the ((... ((i 1 *n 2 )+i 2 )...)*n k +i k portion of this formula, we can use the recurrence equation: e 1 = i 1 e m = e m-1 * n m + i m 6/30/2010 Principles of Compiler Design R.Venkadeshan 349

350 Translation Scheme for Arrays If we use the following grammar to calculate addresses of array elements, we need inherited attributes. L id id [ Elist ] Elist Elist, E E Instead of this grammar, we will use the following grammar to calculate addresses of array elements so that we do not need inherited attributes (we will use only synthesized attributes). L id Elist ] Elist Elist, E id [ E 6/30/2010 Principles of Compiler Design R.Venkadeshan 350

351 Translation Scheme for Arrays (cont.) S L := E E E 1 + E 2 { if (L.offset is null) emit( mov E.place,, L.place) else emit( mov E.place,, L.place [ L.offset ] ) } { E.place = newtemp(); emit( add E 1.place, E 2.place, E.place) } E ( E 1 ) { E.place = E 1.place; } E L { if (L.offset is null) E.place = L.place) else { E.place = newtemp(); emit( mov L.place [ L.offset ],, E.place) } } 6/30/2010 Principles of Compiler Design R.Venkadeshan 351

352 Translation Scheme for Arrays (cont.) L id { L.place = id.place; L.offset = null; } L Elist ] { L.place = newtemp(); L.offset = newtemp(); emit( mov c(elist.array),, L.place); emit( mult Elist.place, width(elist.array), L.offset) } Elist Elist 1, E { Elist.array = Elist 1.array ; Elist.place = newtemp(); Elist.ndim = Elist 1.ndim + 1; emit( mult Elist 1.place, limit(elist.array,elist.ndim), Elist.place); emit( add Elist.place, E.place, Elist.place); } Elist id [ E {Elist.array = id.place ; Elist.place = E.place; Elist.ndim = 1; } 6/30/2010 Principles of Compiler Design R.Venkadeshan 352

353 Translation Scheme for Arrays Example1 A one-dimensional double array A : n 1 =95 width=8 (double) low 1 =5 Intermediate codes corresponding to x := A[y] mov c,,t1 // where c=base A -(5)*8 mult y,8,t2 mov mov t1[t2],,t3 t3,,x 6/30/2010 Principles of Compiler Design R.Venkadeshan 353

354 Translation Scheme for Arrays Example2 A two-dimensional int array A : 1..10x1..20 n 1 =10 n 2 =20 width=4 (integers) low 1 =1 low 2 =1 Intermediate codes corresponding to x := A[y,z] mult add y,20,t1 t1,z,t1 mov c,,t2 // where c=base A -(1*20+1)*4 mult t1,4,t3 mov mov t2[t3],,t4 t4,,x 6/30/2010 Principles of Compiler Design R.Venkadeshan 354

355 Translation Scheme for Arrays Example3 A three-dimensional int array A : 0..9x0..19x0..29 n 1 =10 n 2 =20 n 3 =30 width=4 (integers) low 1 =0 low 2 =0 low 3 =0 Intermediate codes corresponding to x := A[w,y,z] mult w,20,t1 add t1,y,t1 mult t1,30,t2 add t2,z,t2 mov c,,t3 // where c=base A -((0*20+0)*30+0)*4 mult t2,4,t4 mov t3[t4],,t5 mov t5,,x 6/30/2010 Principles of Compiler Design R.Venkadeshan 355

356 Declarations P M D M { offset=0 } D D ; D D id : T { enter(id.name,t.type,offset); offset=offset+t.width } T int { T.type=int; T.width=4 } T real { T.type=real; T.width=8 } T array[num] of T 1 { T.type=array(num.val,T 1.type); T.width=num.val*T 1.width } T T 1 { T.type=pointer(T 1.type); T.width=4 } where enter crates a symbol table entry with given values. 6/30/2010 Principles of Compiler Design R.Venkadeshan 356

357 Nested Procedure Declarations For each procedure we should create a symbol table. mktable(previous) create a new symbol table where previous is the parent symbol table of this new symbol table enter(symtable,name,type,offset) create a new entry for a variable in the given symbol table. enterproc(symtable,name,newsymbtable) create a new entry for the procedure in the symbol table of its parent. addwidth(symtable,width) puts the total width of all entries in the symbol table into the header of that table. We will have two stacks: tblptr to hold the pointers to the symbol tables offset to hold the current offsets in the symbol tables in tblptr stack. 6/30/2010 Principles of Compiler Design R.Venkadeshan 357

358 Nested Procedure Declarations P M D { addwidth(top(tblptr),top(offset)); pop(tblptr); pop(offset) } M { t=mktable(nil); push(t,tblptr); push(0,offset) } D D ; D D proc id N D ; S { t=top(tblptr); addwidth(t,top(offset)); pop(tblptr); pop(offset); enterproc(top(tblptr),id.name,t) } D id : T { enter(top(tblptr),id.name,t.type,top(offset)); top(offset)=top(offset)+t.width } N { t=mktable(top(tblptr)); push(t,tblptr); push(0,offset) } 6/30/2010 Principles of Compiler Design R.Venkadeshan 358

359 Run-Time Environments How do we allocate the space for the generated target code and the data object of our source programs? The places of the data objects that can be determined at compile time will be allocated statically. But the places for the some of data objects will be allocated at run-time. The allocation of de-allocation of the data objects is managed by the run-time support package. run-time support package is loaded together with the generate target code. the structure of the run-time support package depends on the semantics of the programming language (especially the semantics of procedures in that language). Each execution of a procedure is called as activation of that procedure. 6/30/2010 Principles of Compiler Design R.Venkadeshan 359

360 Procedure Activations An execution of a procedure starts at the beginning of the procedure body; When the procedure is completed, it returns the control to the point immediately after the place where that procedure is called. Each execution of a procedure is called as its activation. Lifetime of an activation of a procedure is the sequence of the steps between the first and the last steps in the execution of that procedure (including the other procedures called by that procedure). If a and b are procedure activations, then their lifetimes are either nonoverlapping or are nested. If a procedure is recursive, a new activation can begin before an earlier activation of the same procedure has ended. 6/30/2010 Principles of Compiler Design R.Venkadeshan 360

361 Activation Tree We can use a tree (called activation tree) to show the way control enters and leaves activations. In an activation tree: Each node represents an activation of a procedure. The root represents the activation of the main program. The node a is a parent of the node b iff the control flows from a to b. The node a is left to to the node b iff the lifetime of a occurs before the lifetime of b. 6/30/2010 Principles of Compiler Design R.Venkadeshan 361

362 Activation Tree (cont.) program main; procedure s; begin... end; procedure p; procedure q; begin... end; begin q; s; end; begin p; s; end; enter main enter p enter q exit q enter s exit s exit p enter s exit s exit main A Nested Structure 6/30/2010 Principles of Compiler Design R.Venkadeshan 362

363 Activation Tree (cont.) main p s q s 6/30/2010 Principles of Compiler Design R.Venkadeshan 363

364 Control Stack The flow of the control in a program corresponds to a depth-first traversal of the activation tree that: starts at the root, visits a node before its children, and recursively visits children at each node an a left-to-right order. A stack (called control stack) can be used to keep track of live procedure activations. An activation record is pushed onto the control stack as the activation starts. That activation record is popped when that activation ends. When node n is at the top of the control stack, the stack contains the nodes along the path from n to the root. 6/30/2010 Principles of Compiler Design R.Venkadeshan 364

365 Variable Scopes The same variable name can be used in the different parts of the program. The scope rules of the language determine which declaration of a name applies when the name appears in the program. An occurrence of a variable (a name) is: local: If that occurrence is in the same procedure in which that name is declared. non-local: Otherwise (ie. it is declared outside of that procedure) procedure p; var b:real; procedure p; var a: integer; begin a := 1; b := 2; end; begin... end; a is local b is non-local 6/30/2010 Principles of Compiler Design R.Venkadeshan 365

366 Run-Time Storage Organization Code Static Data Stack Memory locations for code are determined at compile time. Locations of static data can also be determined at compile time. Data objects allocated at run-time. (Activation Records) Heap Other dynamically allocated data objects at run-time. (For example, malloc area in C). 6/30/2010 Principles of Compiler Design R.Venkadeshan 366

367 Activation Records Information needed by a single execution of a procedure is managed using a contiguous block of storage called activation record. An activation record is allocated when a procedure is entered, and it is de-allocated when that procedure exited. Size of each field can be determined at compile time (Although actual location of the activation record is determined at run-time). Except that if the procedure has a local variable and its size depends on a parameter, its size is determined at the run time. 6/30/2010 Principles of Compiler Design R.Venkadeshan 367

368 Activation Records (cont.) return value actual parameters optional control link optional access link saved machine status local data temporaries The returned value of the called procedure is returned in this field to the calling procedure. In practice, we may use a machine register for the return value. The field for actual parameters is used by the calling procedure to supply parameters to the called procedure. The optional control link points to the activation record of the caller. The optional access link is used to refer to nonlocal data held in other activation records. The field for saved machine status holds information about the state of the machine before the procedure is called. The field of local data holds data that local to an execution of a procedure.. Temporay variables is stored in the field of temporaries. 6/30/2010 Principles of Compiler Design R.Venkadeshan 368

369 program main; procedure p; var a:real; procedure q; var b:integer; begin... end; begin q; end; procedure s; var c:integer; begin... end; begin p; s; end; Activation Records (Ex1) p main s a: b: main p q stack q 6/30/2010 Principles of Compiler Design R.Venkadeshan 369

370 Activation Records for Recursive Procedures program main; procedure p; function q(a:integer):integer; begin if (a=1) then q:=1; else q:=a+q(a-1); end; begin q(3); end; begin p; end; a: 3 a:2 a:1 main p q(3) q(2) q(1) 6/30/2010 Principles of Compiler Design R.Venkadeshan 370

371 Creation of An Activation Record Who allocates an activation record of a procedure? Some part of the activation record of a procedure is created by that procedure immediately after that procedure is entered. Some part is created by the caller of that procedure before that procedure is entered. Who deallocates? Callee de-allocates the part allocated by Callee. Caller de-allocates the part allocated by Caller. 6/30/2010 Principles of Compiler Design R.Venkadeshan 371

372 Creation of An Activation Record (cont.) return value actual parameters optional control link optional access link saved machine status local data temporaries return value actual parameters optional control link optional access link saved machine status local data temporaries Caller s Activation Record Caller s Responsibility Callee s Activation Record Callee s Responsibility 6/30/2010 Principles of Compiler Design R.Venkadeshan 372

373 Variable Length Data return value actual parameters optional control link optional access link saved machine status Variable length data is allocated after temporaries, and there is a link to from local data to that array. local data pointer a pointer b temporaries array a array b 6/30/2010 Principles of Compiler Design R.Venkadeshan 373

374 Dangling Reference Whenever a storage is de-allocated, the problem of dangling references may accur. main () { int *p; p = dangle(); } int *dangle*() { int i=2; return &i; } 6/30/2010 Principles of Compiler Design R.Venkadeshan 374

375 Access to Nonlocal Names Scope rules of a language determine the treatment of references to nonlocal names. Scope Rules: Lexical Scope (Static Scope) Determines the declaration that applies to a name by examining the program text alone at compile-time. Most-closely nested rule is used. Pascal, C,.. Dynamic Scope Determines the declaration that applies to a name at run-time. Lisp, APL,... 6/30/2010 Principles of Compiler Design R.Venkadeshan 375

376 Lexical Scope The scope of a declaration in a block-structured language is given by the mostly closed rule. Each procedure (block) will have its own activation record. procedure begin-end blocks (treated same as procedure without creating most part of its activation record) A procedure may access to a nonlocal name using: access links in activation records, or displays (an efficient way to access to nonlocal names) 6/30/2010 Principles of Compiler Design R.Venkadeshan 376

377 Access Links program main; var a:int; procedure p; var d:int; begin a:=1; end; procedure q(i:int); var b:int; procedure s; var c:int; begin p; end; begin if (i<>0) then q(i-1) else s; end; begin q(1); end; Access Links main access link a: q(1) access link i,b: q(0) access link i,b: s access link c: p access link d: 6/30/2010 Principles of Compiler Design R.Venkadeshan 377

378 Procedure Parameters program main; procedure p(procedure a); begin a; end; procedure q; procedure s; begin... end; begin p(s) end; begin q; end; main access link q access link p access link Access links must be passed with procedure parameters. s access link 6/30/2010 Principles of Compiler Design R.Venkadeshan 378

379 Displays An array of pointers to activation records can be used to access activation records. This array is called as displays. For each level, there will be an array entry. 1: Current activation record at level 1 2: 3: Current activation record at level 2 Current activation record at level 3 6/30/2010 Principles of Compiler Design R.Venkadeshan 379

380 Accessing Nonlocal Variables program main; var a:int; procedure p; var b:int; begin q; end; procedure q(); var c:int; begin c:=a+b; end; begin p; end; main access link a: p access link b: q access link c: addrc := offsetc(currar) t := *currar addrb := offsetb(t) t := *t addra := offseta(t) ADD addra,addrb,addrc 6/30/2010 Principles of Compiler Design R.Venkadeshan 380

381 Accessing Nonlocal Variables using Display program main; var a:int; procedure p; var b:int; begin q; end; procedure q(); var c:int; begin c:=a+b; end; begin p; end; main access link a: p access link b: q access link D[1] D[2] D[3] addrc := offsetc(d[3]) addrb := offsetb(d[2]) addra := offseta(d[1]) ADD addra,addrb,addrc c: 6/30/2010 Principles of Compiler Design R.Venkadeshan 381

382 Parameter Passing Methods Call-by-value Call-by-reference Call-by-name (used by Algol) 6/30/2010 Principles of Compiler Design R.Venkadeshan 382

383 SOME OTHER ISSUES Symbol Tables hold scope information. Dynamic Scope: uses control link (similar to access links) Dynamic Storage Allocation (Heap Management) different allocation techniques 6/30/2010 Principles of Compiler Design R.Venkadeshan 383

384 Code Optimization 6/30/2010 Principles of Compiler Design R.Venkadeshan 384

385 Introduction Criteria for Code-Improving Transformation: Meaning must be preserved (correctness) Speedup must occur on average. Work done must be worth the effort. Opportunities: Programmer (algorithm, directives) Intermediate code Target code 6/30/2010 Principles of Compiler Design R.Venkadeshan 385

386 Peephole Optimizations (9.9) 1. A Simple but effective technique for locally improving the target code is peephole optimization, 2. a method for trying to improve the performance of the target program 3. by examining a short sequence of target instructions and replacing these instructions by a shorter or faster sequence whenever possible. Characteristics of peephole optimization 1. Redundant instruction elimination 2. Flow of control information 3. Algebraic Simplification 4. Use of machine Idioms 6/30/2010 Principles of Compiler Design R.Venkadeshan 386

387 Peephole Optimizations Constant Folding x := 32 becomes x := 64 x := x + 32 Unreachable Code goto L2 x := x + 1 No need Flow of control optimizations goto L1 becomes goto L2 L1: goto L2 No needed if no other L1 branch 6/30/2010 Principles of Compiler Design R.Venkadeshan 387

388 Peephole Optimizations Algebraic Simplification x := x + 0 No needed Dead code x := 32 where x not used after statement y := x + y y := y + 32 Reduction in strength x := x * 2 x := x + x x := x << 2 6/30/2010 Principles of Compiler Design R.Venkadeshan 388

389 Basic Block Level 1. Common Sub expression elimination 2. Constant Propagation 3. Copy Propagation 4. Dead code elimination 5. 6/30/2010 Principles of Compiler Design R.Venkadeshan 389

390 Common expression can be eliminated Simple example: a[i+1] = b[i+1] t1 = i+1 t2 = b[t1] t3 = i + 1 a[t3] = t2 t1 = i + 1 t2 = b[t1] t3 = i + 1 no longer live a[t1] = t2 6/30/2010 Principles of Compiler Design R.Venkadeshan 390

391 Now, suppose i is a constant: i = 4 t1 = i+1 t2 = b[t1] a[t1] = t2 i = 4 t1 = 5 t2 = b[t1] a[t1] = t2 i = 4 t1 = 5 t2 = b[5] a[5] = t2 Final Code: i = 4 t2 = b[5] a[5] = t2 6/30/2010 Principles of Compiler Design R.Venkadeshan 391

392 Optimizations on CFG Must take control flow into account Common Sub-expression Elimination Constant Propagation Dead Code Elimination Partial redundancy Elimination Applying one optimization may raise opportunities for other optimizations. 6/30/2010 Principles of Compiler Design R.Venkadeshan 392

393 Code Motion Move invariants out of the loop. Example: while (i <= limit - 2) becomes t := limit - 2 while (i <= t) Simple Loop Optimizations 6/30/2010 Principles of Compiler Design R.Venkadeshan 393

394 Three Address Code of Quick Sort 1 i = m t 7 = 4 * I 2 j = n 17 t 8 = 4 * j 3 t 1 =4 * n 18 t 9 = a[t 8 ] 4 5 v = a[t 1 ] i = i a[t 7 ] = t 9 t 10 = 4 * j 6 t 2 = 4 * i 21 a[t 10 ] = x 7 t 3 = a[t 2 ] 22 goto (5) 8 if t 3 < v goto (5) 23 t 11 = 4 * I 9 j = j 1 24 x = a[t 11 ] t 4 = 4 * j t 5 = a[t 4 ] t 12 = 4 * i t 13 = 4 * n 12 if t 5 > v goto (9) 27 t 14 = a[t 13 ] if i >= j goto (23) t 6 = 4 * i x = a[t 6 ] a[t 12 ] = t 14 t 15 = 4 * n a[t 15 ] = x 6/30/2010 Principles of Compiler Design R.Venkadeshan 394

395 Find The Basic Block 1 i = m t 7 = 4 * I 2 j = n 17 t 8 = 4 * j 3 t 1 =4 * n 18 t 9 = a[t 8 ] 4 5 v = a[t 1 ] i = i a[t 7 ] = t 9 t 10 = 4 * j 6 t 2 = 4 * i 21 a[t 10 ] = x 7 t 3 = a[t 2 ] 22 goto (5) 8 if t 3 < v goto (5) 23 t 11 = 4 * i 9 j = j 1 24 x = a[t 11 ] t 4 = 4 * j t 5 = a[t 4 ] t 12 = 4 * i t 13 = 4 * n 12 if t 5 > v goto (9) 27 t 14 = a[t 13 ] if i >= j goto (23) t 6 = 4 * i x = a[t 6 ] a[t 12 ] = t 14 t 15 = 4 * n a[t 15 ] = x 6/30/2010 Principles of Compiler Design R.Venkadeshan 395

396 B 1 Flow Graph i = m - 1 j = n t 1 =4 * n v = a[t 1 ] B 5 B 6 B 2 t 6 = 4 * i t 11 = 4 * i i = i + 1 x = a[t 6 ] x = a[t 11 ] t 2 = 4 * i t 3 = a[t 2 ] t 7 = 4 * i t 8 = 4 * j t 12 = 4 * i t 13 = 4 * n if t 3 < v goto B 2 t 9 = a[t 8 ] t 14 = a[t 13 ] B 3 j = j 1 t 4 = 4 * j t 5 = a[t 4 ] a[t 7 ] = t 9 t 10 = 4 * j a[t 10 ] = x a[t 12 ] = t 14 t 15 = 4 * n a[t 15 ] = x if t 5 > v goto B 3 goto B 2 B 4 if i >= j goto B 6 6/30/2010 Principles of Compiler Design R.Venkadeshan 396

397 B 1 Common Subexpression Elimination i = m - 1 j = n t 1 =4 * n v = a[t 1 ] B 5 B 6 B 2 t 6 = 4 * i t 11 = 4 * i i = i + 1 x = a[t 6 ] x = a[t 11 ] t 2 = 4 * i t 3 = a[t 2 ] t 7 = 4 * i t 8 = 4 * j t 12 = 4 * i t 13 = 4 * n if t 3 < v goto B 2 t 9 = a[t 8 ] t 14 = a[t 13 ] B 3 j = j 1 t 4 = 4 * j t 5 = a[t 4 ] a[t 7 ] = t 9 t 10 = 4 * j a[t 10 ] = x a[t 12 ] = t 14 t 15 = 4 * n a[t 15 ] = x if t 5 > v goto B 3 goto B 2 B 4 if i >= j goto B 6 6/30/2010 Principles of Compiler Design R.Venkadeshan 397

398 B 1 Common Subexpression Elimination B 2 B 3 i = m - 1 j = n t 1 =4 * n v = a[t 1 ] i = i + 1 t 2 = 4 * i t 3 = a[t 2 ] if t 3 < v goto B 2 j = j 1 t 4 = 4 * j t 5 = a[t 4 ] if t 5 > v goto B 3 B 5 B 6 t 6 = 4 * i x = a[t 6 ] t 11 = 4 * i x = a[t 11 ] t 8 = 4 * j t 12 = 4 * i t 9 = a[t 8 ] t 13 = 4 * n a[t 6 ] = t 9 t 14 = a[t 13 ] t 10 = 4 * a[t 12 ] = t 14 j a[t 10 ] = x t 15 = 4 * n goto B 2 a[t 15 ] = x B 4 if i >= j goto B 6 6/30/2010 Principles of Compiler Design R.Venkadeshan 398

399 B 1 Common Subexpression Elimination i = m - 1 j = n t 1 =4 * n v = a[t 1 ] B 5 B 6 B 2 t 6 = 4 * i t 11 = 4 *i i = i + 1 x = a[t 6 ] x = a[t 11 ] t 2 = 4 * i t 3 = a[t 2 ] t 8 = 4 * j t 9 = a[t 8 ] t 12 = 4 * i t 13 = 4 * n if t 3 < v goto B 2 a[t 6 ] = t 9 t 14 = a[t 13 ] B 3 j = j 1 t 4 = 4 * j t 5 = a[t 4 ] a[t 8 ] = x goto B 2 a[t 12 ] = t 14 t 15 = 4 * n a[t 15 ] = x if t 5 > v goto B 3 B 4 if i >= j goto B 6 6/30/2010 Principles of Compiler Design R.Venkadeshan 399

400 B 1 Common Subexpression Elimination i = m - 1 j = n t 1 =4 * n v = a[t 1 ] B 5 B 6 B 2 t 6 = 4 * i t 11 = 4 * i i = i + 1 x = a[t 6 ] x = a[t 11 ] t 2 = 4 * i t 3 = a[t 2 ] t 8 = 4 * j t 9 = a[t 8 ] t 12 = 4 * i t 13 = 4 * n if t 3 < v goto B 2 a[t 6 ] = t 9 t 14 = a[t 13 ] B 3 j = j 1 t 4 = 4 * j t 5 = a[t 4 ] a[t 8 ] = x goto B 2 a[t 12 ] = t 14 t 15 = 4 * n a[t 15 ] = x if t 5 > v goto B 3 B 4 if i >= j goto B 6 6/30/2010 Principles of Compiler Design R.Venkadeshan 400

401 B 1 Common Subexpression Elimination i = m - 1 j = n t 1 =4 * n v = a[t 1 ] B 5 B 6 B 2 t 6 = 4 * i t 11 = 4 * i i = i + 1 x = a[t 6 ] x = a[t 11 ] t 2 = 4 * i t 3 = a[t 2 ] t 8 = 4 * j t 9 = a[t 8 ] t 13 = 4 * n t 14 = a[t 13 ] if t 3 < v goto B 2 a[t 6 ] = t 9 a[t 11 ] = t 14 B 3 j = j 1 t 4 = 4 * j a[t 8 ] = x goto B 2 t 15 = 4 * n a[t 15 ] = x t 5 = a[t 4 ] if t 5 > v goto B 3 B 4 if i >= j goto B 6 6/30/2010 Principles of Compiler Design R.Venkadeshan 401

402 B 1 Common Subexpression Elimination i = m - 1 j = n t 1 =4 * n v = a[t 1 ] B 5 B 6 B 2 i = i + 1 t 2 = 4 * i t 3 = a[t 2 ] if t 3 < v goto B 2 t 6 = 4 * i x = a[t 6 ] t 8 = 4 * j t 9 = a[t 8 ] a[t 6 ] = t 9 t 11 = 4 * i x = a[t 11 ] t 13 = 4 * n t 14 = a[t 13 ] B 3 j = j 1 t 4 = 4 * j a[t 8 ] = x goto B 2 a[t 11 ] = t 14 a[t 13 ] = x t 5 = a[t 4 ] if t 5 > v goto B 3 B 4 if i >= j goto B 6 6/30/2010 Principles of Compiler Design R.Venkadeshan 402

403 B 1 Common Subexpression Elimination i = m - 1 j = n t 1 =4 * n v = a[t 1 ] B 5 B 6 B 2 i = i + 1 t 2 = 4 * i t 3 = a[t 2 ] if t 3 < v goto B 2 t 6 = 4 * i x = a[t 6 ] t 8 = 4 * j t 9 = a[t 8 ] a[t 6 ] = t 9 t 11 = 4 * i x = a[t 11 ] t 13 = 4 * n t 14 = a[t 13 ] B 3 j = j 1 t 4 = 4 * j a[t 8 ] = x goto B 2 a[t 11 ] = t 14 a[t 13 ] = x t 5 = a[t 4 ] if t 5 > v goto B 3 B 4 if i >= j goto B 6 6/30/2010 Principles of Compiler Design R.Venkadeshan 403

404 B 1 Common Subexpression Elimination i = m - 1 j = n t 1 =4 * n v = a[t 1 ] B 5 B 6 B 2 i = i + 1 t 2 = 4 * i t 3 = a[t 2 ] if t 3 < v goto B 2 x = a[t 2 ] t 8 = 4 * j t 9 = a[t 8 ] a[t 2 ] = t 9 a[t 8 ] = x t 11 = 4 * i x = a[t 11 ] t 13 = 4 * n t 14 = a[t 13 ] a[t 11 ] = t 14 B 3 j = j 1 goto B 2 a[t 13 ] = x t 4 = 4 * j t 5 = a[t 4 ] if t 5 > v goto B 3 B 4 if i >= j goto B 6 6/30/2010 Principles of Compiler Design R.Venkadeshan 404

405 B 1 Common Subexpression Elimination i = m - 1 j = n t 1 =4 * n v = a[t 1 ] B 5 B 6 B 2 i = i + 1 t 2 = 4 * i t 3 = a[t 2 ] if t 3 < v goto B 2 x = t 3 t 8 = 4 * j t 9 = a[t 8 ] a[t 2 ] = t 9 a[t 8 ] = x t 11 = 4 * i x = a[t 11 ] t 13 = 4 * n t 14 = a[t 13 ] a[t 11 ] = t 14 B 3 j = j 1 goto B 2 a[t 13 ] = x t 4 = 4 * j t 5 = a[t 4 ] if t 5 > v goto B 3 B 4 if i >= j goto B 6 6/30/2010 Principles of Compiler Design R.Venkadeshan 405

406 B 1 Common Subexpression Elimination i = m - 1 j = n t 1 =4 * n v = a[t 1 ] B 5 B 6 B 2 i = i + 1 t 2 = 4 * i t 3 = a[t 2 ] if t 3 < v goto B 2 x = t 3 t 9 = a[t 4 ] a[t 2 ] = t 9 a[t 4 ] = x goto B 2 t 11 = 4 * i x = a[t 11 ] t 13 = 4 * n t 14 = a[t 13 ] a[t 11 ] = t 14 B 3 j = j 1 a[t 13 ] = x t 4 = 4 * j t 5 = a[t 4 ] if t 5 > v goto B 3 B 4 if i >= j goto B 6 6/30/2010 Principles of Compiler Design R.Venkadeshan 406

407 B 1 Common Subexpression Elimination B 2 B 3 i = m - 1 j = n t 1 =4 * n v = a[t 1 ] i = i + 1 t 2 = 4 * i t 3 = a[t 2 ] if t 3 < v goto B 2 j = j 1 t 4 = 4 * j t 5 = a[t 4 ] if t 5 > v goto B 3 B 5 B 6 x = t 3 a[t 2 ] = t 5 a[t 4 ] = x goto B 2 t 11 = 4 * i x = a[t 11 ] t 13 = 4 * n t 14 = a[t 13 ] a[t 11 ] = t 14 a[t 13 ] = x B 4 if i >= j goto B 6 6/30/2010 Principles of Compiler Design R.Venkadeshan 407

408 B 1 Common Subexpression Elimination B 2 i = m - 1 j = n t 1 =4 * n v = a[t 1 ] i = i + 1 t 2 = 4 * i t 3 = a[t 2 ] if t 3 < v goto B 2 B 5 B 6 x = t 3 a[t 2 ] = t 5 a[t 4 ] = x goto B 2 x = t 3 t 14 = a[t 1 ] a[t 2 ] = t 14 a[t 1 ] = x B 3 j = j 1 t 4 = 4 * j t 5 = a[t 4 ] Similarly for B 6 if t 5 > v goto B 3 B 4 if i >= j goto B 6 6/30/2010 Principles of Compiler Design R.Venkadeshan 408

409 B 1 Dead Code Elimination B 2 i = m - 1 j = n t 1 =4 * n v = a[t 1 ] i = i + 1 t 2 = 4 * i t 3 = a[t 2 ] if t 3 < v goto B 2 B 5 B 6 x = t 3 a[t 2 ] = t 5 a[t 4 ] = x goto B 2 x = t 3 t 14 = a[t 1 ] a[t 2 ] = t 14 a[t 1 ] = x B 3 j = j 1 t 4 = 4 * j t 5 = a[t 4 ] if t 5 > v goto B 3 B 4 if i >= j goto B 6 6/30/2010 Principles of Compiler Design R.Venkadeshan 409

410 B 1 Dead Code Elimination i = m - 1 j = n t 1 =4 * n v = a[t 1 ] B 5 B 6 B 2 a[t 2 ] = t 5 t 14 = a[t 1 ] i = i + 1 t 2 = 4 * i t 3 = a[t 2 ] if t 3 < v goto B 2 a[t 4 ] = t 3 goto B 2 a[t 2 ] = t 14 a[t 1 ] = t 3 B 3 j = j 1 t 4 = 4 * j t 5 = a[t 4 ] if t 5 > v goto B 3 B 4 if i >= j goto B 6 6/30/2010 Principles of Compiler Design R.Venkadeshan 410

411 B 1 Reduction in Strength i = m - 1 j = n t 1 =4 * n v = a[t 1 ] B 5 B 6 B 2 a[t 2 ] = t 5 t 14 = a[t 1 ] i = i + 1 t 2 = 4 * i t 3 = a[t 2 ] if t 3 < v goto B 2 a[t 4 ] = t 3 goto B 2 a[t 2 ] = t 14 a[t 1 ] = t 3 B 3 j = j 1 t 4 = 4 * j t 5 = a[t 4 ] if t 5 > v goto B 3 B 4 if i >= j goto B 6 6/30/2010 Principles of Compiler Design R.Venkadeshan 411

412 B 1 B 2 i = m - 1 j = n t 1 =4 * n v = a[t 1 ] t 2 = 4 * i t 4 = 4 * j Reduction in Strength B 5 B 6 a[t 2 ] = t 5 t 14 = a[t 1 ] a[t 4 ] = t 3 a[t 2 ] = t 14 t 2 = t t 3 = a[t 2 ] goto B 2 a[t 1 ] = t 3 B 3 if t 3 < v goto B 2 t 4 = t 4-4 t 5 = a[t 4 ] if t 5 > v goto B 3 B 4 if i >= j goto B 6 6/30/2010 Principles of Compiler Design R.Venkadeshan 412

413 The End 6/30/2010 Principles of Compiler Design R.Venkadeshan 413

414 Code Generation 6/30/2010 Principles of Compiler Design R.Venkadeshan 414

415 Introduction Source program Front end Intermediate code Code Optimizer Intermediate code Code generator target program Symbol table Position of code generator 6/30/2010 Principles of Compiler Design R.Venkadeshan 415

416 Issues in the Design of a Code Generator Details depend on Target language Operating System But following issues are inherent in all code generation problems Memory management Instruction Selection Register allocation and Evaluation order 6/30/2010 Principles of Compiler Design R.Venkadeshan 416

417 Input to the Code Generator We assume, front end has Scanned, parsed and translate the source program into a reasonably detailed intermediate representations Type checking, type conversion and obvious semantic errors have already been detected Symbol table is able to provide run-time address of the data objects Intermediate representations may be Postfix notations Three address representations Stack machine code Syntax tree DAG 6/30/2010 Principles of Compiler Design R.Venkadeshan 417

418 Target Programs The output of the code generator is the target program. Target program may be Absolute machine language It can be placed in a fixed location of memory and immediately executed Re-locatable machine language Subprograms to be compiled separately A set of re-locatable object modules can be linked together and loaded for execution by a linker Assembly language Easier 6/30/2010 Principles of Compiler Design R.Venkadeshan 418

419 Instruction Selection The nature of the instruction set of the target machine determines the difficulty of the instruction selection. Uniformity and completeness of the instruction set are important Instruction speeds is also important Say, x = y + z Mov y, R0 Add z, R0 Mov R0, x Statement by statement code generation often produces poor code 6/30/2010 Principles of Compiler Design R.Venkadeshan 419

420 Instruction Selection (2) a = b + c d = a + e MOV b, R0 ADD c, R0 MOV R0, a MOV a, R0 ADD e, R0 MOV R0, d If a is subsequently used 6/30/2010 Principles of Compiler Design R.Venkadeshan 420

421 Instruction Selection (3) The quality of the generated code is determined by its speed and size. Cost difference between the different implementation may be significant. Say a = a + 1 Mov a, R0 Add #1, R0 Mov R0, a If the target machine has increment instruction (INC), we can write inc a 6/30/2010 Principles of Compiler Design R.Venkadeshan 421

422 Register Allocation Instructions involving register operands are usually shorter and faster than those involving operands in memory. Efficient utilization of register is particularly important in code generation. The use of register is subdivided into two sub problems During register allocation, we select the set of variables that will reside in register at a point in the program. During a subsequent register allocation phase, we pick the specific register that a variable will reside in. 6/30/2010 Principles of Compiler Design R.Venkadeshan 422

423 Basic Blocks and Flow Graphs A graph representation of three address statements, called flow graph. Nodes in the flow graph represent computations Edges represent the flow of control Basic Block: A basic block is a sequence of consecutive statements in which flow of control enters at the beginning and leaves at the end without halt or possibly of the branching except at the end. 6/30/2010 Principles of Compiler Design R.Venkadeshan 423

424 Basic Blocks and Flow Graphs (2) This is a basic block t 1 = a*a t 2 = a*b t 3 = 2*t 2 t 4 = t 1 +t 3 t 5 = b*b t 6 = t 4 + t 5 Three address statement x = y + z is said to define x and to use y and z. A name in a basic block is said to be live at a given point if its value is used after that point in the program, perhaps in another basic block 6/30/2010 Principles of Compiler Design R.Venkadeshan 424

425 Basic Blocks and Flow Graphs (3) Partition into basic blocks Method We first determine the leader The first statement is a leader Any statement that is the target of a conditional or unconditional goto is a leader Any statement that immediately follows a goto or unconditional goto statement is a leader For each leader, its basic block consists of the leader and all the statements up to but not including the next leader or the end of the program. 6/30/2010 Principles of Compiler Design R.Venkadeshan 425

426 Basic Blocks and Flow Graphs (4) (1) prod = 0 (2) i = 1 (3) t 1 =4*I B 1 B 2 (11) I = t 7 (12) If I <= 20 goto (3) 6/30/2010 Principles of Compiler Design R.Venkadeshan 426

427 Transformation on Basic Block A basic block computes a set of expressions. Transformations are useful for improving the quality of code. Two important classes of local optimizations that can be applied to a basic blocks Structure Preserving Transformations Algebraic Transformations 6/30/2010 Principles of Compiler Design R.Venkadeshan 427

428 Structure Preserving Transformations Common sub-expression elimination same a = b + c b = a d c = b + c d = a - d a = b + c b = a d c = b + c d = b b is redefined 6/30/2010 Principles of Compiler Design R.Venkadeshan 428

429 Structure Preserving Transformations Dead Code Elemination Say, x is dead, that is never subsequently used, at the point where the statement x = y + z appears in a block. We can safely remove x Renaming Temporary Variables say, t = b+c where t is a temporary var. If we change u = b+c, then change all instances of t to u. Interchange of Statements t 1 = b + c t 2 = x + y We can interchange iff neither x nor y is t 1 and neither b nor c is t 2 6/30/2010 Principles of Compiler Design R.Venkadeshan 429

430 Algebraic Transformations Replace expensive expressions by cheaper one X = X + 0 eliminate X = X * 1 eliminate X = y**2 (why expensive? Answer: Normally implemented by function call) by X = y * y Flow graph: We can add flow of control information to the set of basic blocks making up a program by constructing directed graph called flow graph. There is a directed edge from block B 1 to block B 2 if There is conditional or unconditional jump from the last statement of B 1 to the first statement of B 2 or B 2 is immediately follows B 1 in the order of the program, and B 1 does not end in an unconditional jump. 6/30/2010 Principles of Compiler Design R.Venkadeshan 430

431 Loops A loop is a collection of nodes in a flow graph such that All nodes in the collection are strongly connected, that is from any node in the loop to any other, there is a path of length one or more, wholly within the loop, and The collection of nodes has a unique entry, that is, a node in the loop such that, the only way to reach a node from a node out side the loop is to first go through the entry. 6/30/2010 Principles of Compiler Design R.Venkadeshan 431

432 The DAG representation of Basic Block + prod * 1 A = 4*i 2 B = a[a] 3 C = 4*i 4 D = b[c] 5 E = B * D 6 F = prod + E 7 Prod = F 8 G = i i = G 10 if I <= 20 goto (1) [ ] [ ] <= * + 20 a b 4 i 0 1 6/30/2010 Principles of Compiler Design R.Venkadeshan 432

433 Outline Code Generation Issues Target language Issues Addresses in Target Code Basic Blocks and Flow Graphs Optimizations of Basic Blocks A Simple Code Generator Peephole optimization Register allocation and assignment Instruction selection by tree rewriting 6/30/2010 Principles of Compiler Design R.Venkadeshan 433

434 Introduction The final phase of a compiler is code generator It receives an intermediate representation (IR) with supplementary information in symbol table Produces a semantically equivalent target program Code generator main tasks: Instruction selection Register allocation and assignment Insrtuction ordering Front end Code optimizer Code Generator 6/30/2010 Principles of Compiler Design R.Venkadeshan 434

435 Issues in the Design of Code Generator The most important criterion is that it produces correct code Input to the code generator IR + Symbol table We assume front end produces low-level IR, i.e. values of names in it can be directly manipulated by the machine instructions. Syntactic and semantic errors have been already detected The target program Common target architectures are: RISC, CISC and Stack based machines In this chapter we use a very simple RISC-like computer with addition of some CISC-like addressing modes 6/30/2010 Principles of Compiler Design R.Venkadeshan 435

436 complexity of mapping the level of the IR the nature of the instruction-set architecture the desired quality of the generated code. x=y+z LD ADD ST R0, y R0, R0, z x, R0 LD ADD ST LD ADD ST a=b+c d=a+e R0, b R0, R0, c a, R0 R0, a R0, R0, e d, R0 6/30/2010 Principles of Compiler Design R.Venkadeshan 436

437 Register allocation Two subproblems Register allocation: selecting the set of variables that will reside in registers at each point in the program Resister assignment: selecting specific register that a variable reside in Complications imposed by the hardware architecture Example: register pairs for multiplication and division t=a+b t=t*c T=t/d t=a+b t=t+c T=t/d L A M D ST R1, a R1, b R0, c R0, d R1, t L R0, a A R0, b M R0, c SRDA R0, 32 D R0, d ST R1, t 6/30/2010 Principles of Compiler Design R.Venkadeshan 437

438 A simple target machine model Load operations: LD r,x and LD r1, r2 Store operations: ST x,r Computation operations: OP dst, src1, src2 Unconditional jumps: BR L Conditional jumps: Bcond r, L like BLTZ r, L 6/30/2010 Principles of Compiler Design R.Venkadeshan 438

439 Addressing Modes variable name: x indexed address: a(r) like LD R1, a(r2) means R1=contents(a+contents(R2)) integer indexed by a register : like LD R1, 100(R2) Indirect addressing mode: *r and *100(r) immediate constant addressing mode: like LD R1, #100 6/30/2010 Principles of Compiler Design R.Venkadeshan 439

440 b = a [i] LD R1, i //R1 = i MUL R1, R1, 8 //R1 = Rl * 8 LD R2, a(r1) ST b, R2 //R2=contents(a+contents(R1)) //b = R2 6/30/2010 Principles of Compiler Design R.Venkadeshan 440

441 a[j] = c LD R1, c LD R2, j //R1 = c // R2 = j MUL R2, R2, 8 //R2 = R2 * 8 ST a(r2), R1 //contents(a+contents(r2))=r1 6/30/2010 Principles of Compiler Design R.Venkadeshan 441

442 x=*p LD R1, p LD R2, 0(R1) ST x, R2 //R1 = p // R2 = contents(0+contents(r1)) // x=r2 6/30/2010 Principles of Compiler Design R.Venkadeshan 442

443 If x<y goto L LD R1, x LD R2, y SUB R1, R1, R2 BLTZ R1, M // i f R1 < 0 jump t o M conditional-jump three-address instruction // R1 = x // R2 = y // R1 = R1 - R2 6/30/2010 Principles of Compiler Design R.Venkadeshan 443

444 costs associated with the addressing modes LD R0, R1 cost = 1 LD R0, M cost = 2 LD R1, *100(R2) cost = 3 6/30/2010 Principles of Compiler Design R.Venkadeshan 444

445 Addresses in the Target Code A statically determined area Code A statically determined data area Static A dynamically managed area Heap A dynamically managed area Stack 6/30/2010 Principles of Compiler Design R.Venkadeshan 445

446 three-address statements for procedure calls and returns call callee Return Halt action 6/30/2010 Principles of Compiler Design R.Venkadeshan 446

447 Target program for a sample call and return 6/30/2010 Principles of Compiler Design R.Venkadeshan 447

448 Stack Allocation Return to caller in Callee: in caller: BR *0(SP) SUB SP, SP, #caller.recordsize Branch to called procedure 6/30/2010 Principles of Compiler Design R.Venkadeshan 448

449 Target code for stack allocation 6/30/2010 Principles of Compiler Design R.Venkadeshan 449

450 Basic blocks and flow graphs Partition the intermediate code into basic blocks The flow of control can only enter the basic block through the first instruction in the block. That is, there are no jumps into the middle of the block. Control will leave the block without halting or branching, except possibly at the last instruction in the block. The basic blocks become the nodes of a flow graph 6/30/2010 Principles of Compiler Design R.Venkadeshan 450

451 rules for finding leaders The first three-address instruction in the intermediate code is a leader. Any instruction that is the target of a conditional or unconditional jump is a leader. Any instruction that immediately follows a conditional or unconditional jump is a leader. 6/30/2010 Principles of Compiler Design R.Venkadeshan 451

452 Intermediate code to set a 10*10 matrix to an identity matrix 6/30/2010 Principles of Compiler Design R.Venkadeshan 452

453 Flow graph based on Basic Blocks 6/30/2010 Principles of Compiler Design R.Venkadeshan 453

454 liveness and next-use information We wish to determine for each three address statement x=y+z what the next uses of x, y and z are. Algorithm: Attach to statement i the information currently found in the symbol table regarding the next use and liveness of x, y, and z. In the symbol table, set x to "not live" and "no next use. In the symbol table, set y and z to "live" and the next uses of y and z to i. 6/30/2010 Principles of Compiler Design R.Venkadeshan 454

455 DAG representation of basic blocks There is a node in the DAG for each of the initial values of the variables appearing in the basic block. There is a node N associated with each statement s within the block. The children of N are those nodes corresponding to statements that are the last definitions, prior to s, of the operands used by s. Node N is labeled by the operator applied at s, and also attached to N is the list of variables for which it is the last definition within the block. Certain nodes are designated output nodes. These are the nodes whose variables are live on exit from the block. 6/30/2010 Principles of Compiler Design R.Venkadeshan 455

456 Code improving transformations We can eliminate local common subexpressions, that is, instructions that compute a value that has already been computed. We can eliminate dead code, that is, instructions that compute a value that is never used. We can reorder statements that do not depend on one another; such reordering may reduce the time a temporary value needs to be preserved in a register. We can apply algebraic laws to reorder operands of three-address instructions, and sometimes t hereby simplify t he computation. 6/30/2010 Principles of Compiler Design R.Venkadeshan 456

457 DAG for basic block 6/30/2010 Principles of Compiler Design R.Venkadeshan 457

458 DAG for basic block 6/30/2010 Principles of Compiler Design R.Venkadeshan 458

459 array accesses in a DAG An assignment from an array, like x = a [i], is represented by creating a node with operator =[] and two children representing the initial value of the array, a0 in this case, and the index i. Variable x becomes a label of this new node. An assignment to an array, like a [j] = y, is represented by a new node with operator []= and three children representing a0, j and y. There is no variable labeling this node. What is different is that the creation of this node kills all currently constructed nodes whose value depends on a0. A node that has been killed cannot receive any more labels; that is, it cannot become a common subexpression. 6/30/2010 Principles of Compiler Design R.Venkadeshan 459

460 DAG for a sequence of array assignments 6/30/2010 Principles of Compiler Design R.Venkadeshan 460

Compiler Design Overview. Compiler Design 1

Compiler Design Overview. Compiler Design 1 Compiler Design Overview Compiler Design 1 Preliminaries Required Basic knowledge of programming languages. Basic knowledge of FSA and CFG. Knowledge of a high programming language for the programming