PSD3A Principles of Compiler Design Unit : I-V 1
UNIT I - SYLLABUS Compiler Assembler Language Processing System Phases of Compiler Lexical Analyser Finite Automata NFA DFA Compiler Tools 2
Compiler - A compiler is computer software that transforms computer code written in one programming language (the source language) into another computer language (the target language). 3
Compiler Architecture 4
Assembler 5
Language-Processing System 6
Compiler Vs Interpreter 7
Phases of Compiler 8
Lexical Analyser www.csd.uwo.ca/~moreno/cs447/lectures/introduction.html/node10.html 9
Finite Automata - FA also called Finite State Machine (FSM) Abstract model of a computing entity. Decides whether to accept or reject a string. Every regular expression can be represented as a FA and vice versa - Two types of FAs: Non-deterministic (NFA): Has more than one alternative action for the same input symbol. Deterministic (DFA): Has at most one action for a given input symbol. 10
scanner generator Main components of scanner generation (e.g., Lex) Convert a regular expression to a non-deterministic finite automaton (NFA) Convert the NFA to a determinstic finite automaton (DFA) Improve the DFA to minimize the number of states Generate a program in C or some other language to simulate the DFA RE Thompson construction Subset construction Minimization DFA simulation NFA DFA Minimized DFA Program Scanner generator 11
Non-deterministic Finite Automata (NFA) NFA (Non-deterministic Finite Automaton) is a 5-tuple (S, Σ,, S0, F): S: a set of states; : the symbols of the input alphabet; : a set of transition functions; S0: s0 S, the start state; F: F S, a set of final or accepting states. Non-deterministic -- a state and symbol pair can be mapped to a set of states. Finite the number of states is finite. 12
Transition Diagram FA can be represented using transition diagram. Corresponding to FA definition, a transition diagram has: States represented by circles; An Alphabet (Σ) represented by labels on edges; Transitions represented by labeled directed edges between states. The label is the input symbol; One Start State shown as having an arrow head; One or more Final State(s) represented by double circles. Example transition diagram to recognize (a b)*abb 13
Compiler Tools 14
UNIT II - SYLLABUS Context-Free Grammar Parse Tree Leftmost & Rightmost Derivation Derivation Trees Ambiguous Grammar Parser Types of Parser Shift Reduce Parsing 15
Context-Free Grammars Languages that are generated by context-free grammars are context-free languages Context-free grammars are more expressive than finite automata: if a language L is accepted by a finite automata then L can be generated by a context-free grammar Definition. A context-free grammar is a 4-tuple (, NT, R, S), where: is an alphabet (each character in is called terminal) NT is a set (each element in NT is called nonterminal) R, the set of rules, is a subset of NT ( NT)* S, the start symbol, is one of the symbols in NT 16
Parse Tree A parse tree of a derivation is a tree in which: Each internal node is labeled with a nonterminal If a rule A A 1 A 2 A n occurs in the derivation then A is a parent node of nodes labeled A 1, A 2,, A n S a S a S S e b 17
Leftmost & Rightmost Derivations A left-most derivation of a sentential form is one in which rules transforming the left-most nonterminal are always applied. A right-most derivation of a sentential form is one in which rules transforming the right-most nonterminal are always applied EX: S A A B A e a A b A A B b b c B c b B -> S A B A A b B a a b 18
Derivation Trees S A A B A e a A b A A B b b c B c b B w = aabb S S S A B A B A A A b B a a b A a A A b a b A A A a e A A A b a b 19
Ambiguity & Disambiguation Given an ambiguous grammar, would like an equivalent unambiguous grammar. Allows you to know more about structure of a given derivation. Simplifies inductive proofs on derivations. Can lead to more efficient parsing algorithms. In programming languages, want to impose a canonical structure on derivations. E.g., for 1+2 3. 20
Role of Parser 21
Types of Parser https://www.tutorialspoint.com/compiler_design/compiler_design_ types_of_parsing.htm 22
Bottom-Up Parsing 23
Top-Down Parsing 24
Shift-reduce Parsing Shift-reduce parsing is a form of bottom-up parsing in which a stack holds grammar symbols and an input buffer holds the rest of the tokens to be parsed. Shift: shift the next input token onto the top of the stack. Reduce: the right end of the string to be reduced must be at the top of the stack. Locate the left end of the string within the stack and decide what non-terminal to replace that string. Accept: announce successful completion of parsing. Error: discover a syntax error and call an error recovery routine. 25
Shift-reduce Parsing 26
Predictive Parser TM Predictive Parser Predictive parser is a recursive descent parser It has the capability to predict which production is to be used to replace the input string. The predictive parser does not suffer from backtracking. To accomplish its tasks, the predictive parser uses a lookahead pointer, which points to the next input symbols. To make the parser back-tracking free, the predictive parser puts some constraints on the grammar and accepts only a class of grammar known as LL(k) grammar.. 27
UNIT III - SYLLABUS Variants of Syntax Trees Three-address code Types and declarations Translation of expressions Type checking Control flow Backpatching 28
Intermediate Code Intermediate code is the interface between front end and back end in a compiler Ideally the details of source language are confined to the front end and the details of target machines to the back end (a m*n model) In this chapter we study intermediate representations, static type checking and intermediate code generation Parser Static Checker Intermediate Code Generator Code Generator Front end Back end 29
Variants of syntax trees It is sometimes beneficial to crate a DAG instead of tree for Expressions. This way we can easily show the common sub-expressions and then use that knowledge during code generation Example: a+a*(b-c)+(b-c)*d + + * a * - d b c 30
Value-number method for constructing DAG s i = + 10 id num 10 + 1 2 3 1 3 To entry for i Algorithm Search the array for a node M with label op, left child l and right child r. If there is such a node, return the value number M If not create in the array a new node N with label op, left child l, and right child r and return its value We may use a hash table 31
Three address code In a three address code there is at most one operator at the right side of an instruction Example: + + * a * b - c d t1 = b c t2 = a * t1 t3 = a + t2 t4 = t1 * d t5 = t3 + t4 www.geeksforgeeks.org/intermediate-code-generation-in-compiler-design 32
Forms of three address instructions x = y op z x = op y x = y goto L if x goto L and iffalse x goto L if x relop y goto L Procedure calls using: param x call p,n y = call p,n x = y[i] and x[i] = y x = &y and x = *y and *x =y 33
Data structures for three address codes Quadruples Has four fields: op, arg1, arg2 and result Triples Temporaries are not used and instead references to instructions are made Indirect triples In addition to triples we use a list of pointers to triples 34
Type Expressions Example: int[2][3] array(2,array(3,integer)) A basic type is a type expression, A type name is a type expression A type expression can be formed by applying the array type constructor to a number and a type expression. A record is a data structure with named field A type expression can be formed by using the type constructor g for function types Type expressions may contain variables whose values are type expressions 35
Backpatching Previous codes for Boolean expressions insert symbolic labels for jumps It therefore needs a separate pass to set them to appropriate addresses We can use a technique named backpatching to avoid this We assume we save instructions into an array and labels will be indices in the array For nonterminal B we use two attributes B.truelist and B.falselist together with functions: makelist(i), Merge(p1,p2), Backpatch(p,i) 36
Type Equivalence They are the same basic type. They are formed by applying the same constructor to structurally equivalent types. One is a type name that denotes the other. 37
Three-address code for expressions 38
Addressing Array Elements Layouts for a two-dimensional array: 39
Control Flow Boolean expressions are often used to: Alter the flow of control. Compute logical values. Short-Circuit Code 40
UNIT IV - SYLLABUS Optimization Rules Basic Blocks Control Flow Graph (CFG) Loops Local Optimizations Peephole optimization 41
Levels of Optimizations Local inside a basic block Global (intraprocedural) Across basic blocks Whole procedure analysis Interprocedural Across procedures Whole program analysis 42
Basic Blocks A basic block is a maximal sequence of consecutive three-address instructions with the following properties: The flow of control can only enter the basic block thru the 1st instr. Control will leave the block without halting or branching, except possibly at the last instr. Basic blocks become the nodes of a flow graph, with edges indicating the order. https://www.youtube.com/watch?v=bc3yshc5rh0 43
Examples 1) i = 1 2) j = 1 3) t1 = 10 * i 4) t2 = t1 + j 5) t3 = 8 * t2 6) t4 = t3-88 7) a[t4] = 0.0 8) j = j + 1 9) if j <= 10 goto (3) 10) i = i + 1 11) if i <= 10 goto (2) 12) i = 1 13) t5 = i - 1 14) t6 = 88 * t5 15) a[t6] = 1.0 16) i = i + 1 17) if i <= 10 goto (13) for i from 1 to 10 do for j from 1 to 10 do a[i,j]=0.0 for i from 1 to 10 do a[i,i]=0.0 44
Identifying Basic Blocks Input: sequence of instructions instr(i) Output: A list of basic blocks Method: Identify leaders: the first instruction of a basic block Iterate: add subsequent instructions to basic block until we reach another leader 45
Identifying Leaders Rules for finding leaders in code First instr in the code is a leader Any instr that is the target of a (conditional or unconditional) jump is a leader Any instr that immediately follow a (conditional or unconditional) jump is a leader 46
Basic Block Partition Algorithm leaders = {1} // start of program for i = 1 to n // all instructions if instr(i) is a branch leaders = leaders U targets of instr(i) U instr(i+1) worklist = leaders While worklist not empty x = first instruction in worklist worklist = worklist {x} block(x) = {x} for i = x + 1; i <= n && i not in leaders; i++ block(x) = block(x) U {i} 47
Control-Flow Edges Basic blocks = nodes Edges: Add directed edge between B1 and B2 if: Branch from last statement of B1 to first statement of B2 (B2 is a leader), or B2 immediately follows B1 in program order and B1 does not end with unconditional branch (goto) Definition of predecessor and successor B1 is a predecessor of B2 B2 is a successor of B1 48
Control-Flow Edge Algorithm Input: block(i), sequence of basic blocks Output: CFG where nodes are basic blocks for i = 1 to the number of blocks x = last instruction of block(i) if instr(x) is a branch for each target y of instr(x), create edge (i -> y) if instr(x) is not unconditional branch, create edge (i -> i+1) 49
Loops Loops comes from while, do-while, for, goto Loop definition: A set of nodes L in a CFG is a loop if 1. There is a node called the loop entry: no other node in L has a predecessor outside L. 2. Every node in L has a nonempty path (within L) to the entry of L. Loop Examples {B3} {B6} {B2, B3, B4} 50
Peephole Optimization Simple compiler do not perform machine-independent code improvement They generates naive code It is possible to take the target hole and optimize it Sub-optimal sequences of instructions that match an optimization pattern are transformed into optimal sequences of instructions This technique is known as peephole optimization Peephole optimization usually works by sliding a window of several instructions (a peephole) 51
Peephole Optimization Goals: - improve performance - reduce memory footprint - reduce code size Method: 1. Exam short sequences of target instructions 2. Replacing the sequence by a more efficient one. redundant-instruction elimination algebraic simplifications flow-of-control optimizations use of machine idioms 52
Peephole Optimization Common Techniques 53
UNIT V - SYLLABUS Code Generation Code Generation Algorithm Function getreg DAG Types of Error Phrase Level Recovery 54
Code generation and Instruction Selection input Front end Intermediate Code generator Code generator output Symbol table output code must be correct output code must be of high quality code generator should run efficiently 55
Issues in the design of code generator Input: Intermediate representation with symbol table assume that input has been validated by the front end target programs : absolute machine language fast for small programs relocatable machine code requires linker and loader assembly code requires assembler, linker, and loader www.geeksforgeeks.org/intermediate-code-generation-in-compilerdesign 56
Instruction Selection Instruction selection Uniformity Completeness Instruction speed Register allocation Instructions with register operands are faster store long life time and counters in registers temporary locations Even odd register pairs Evaluation order 57
Target Machine Byte addressable with 4 bytes per word It has n registers R 0, R 1,..., R n-l Two address instructions of the form opcode source, destination Usual opcodes like move, add, sub etc. Addressing modes MODE FORM ADDRESS Absolute M M register R R index c(r) c+cont(r) indirect register *R cont(r) indirect index *c(r) cont(c+cont(r)) literal #c c PSD1B- Advance Java Programming 58
Code Generator consider each statement remember if operand is in a register Register descriptor Keep track of what is currently in each register. Initially all the registers are empty Address descriptor Keep track of location where current value of the name can be found at runtime The location might be a register, stack, memory address or a set of those 59
Code Generation Algorithm for each X = Y op Z do invoke a function getreg to determine location L where X must be stored. Usually L is a register. Consult address descriptor of Y to determine Y'. Prefer a register for Y'. If value of Y not already in L generate Mov Y', L Generate op Z', L Again prefer a register for Z. Update address descriptor of X to indicate X is in L. If L is a register update its descriptor to indicate that it contains X and remove X from all other register descriptors. If current value of Y and/or Z have no next use and are dead on exit from block and are in registers, change register descriptor to indicate that they no longer contain Y and/or Z. 60
Function getreg 1. If Y is in register (that holds no other values) and Y is not live and has no next use after X = Y op Z then return register of Y for L. 2. Failing (1) return an empty register 3. Failing (2) if X has a next use in the block or op requires register then get a register R, store its content into M (by Mov R, M) and use it. 4. else select memory location X as L 61
Example Stmt code reg desc addr desc t 1 =a-b mov a,r 0 R 0 contains t 1 t 1 in R 0 sub b,r 0 t 2 =a-c mov a,r 1 R 0 contains t 1 t 1 in R 0 sub c,r 1 R 1 contains t 2 t 2 in R 1 t 3 =t 1 +t 2 add R 1,R 0 R 0 contains t 3 t 3 in R 0 R 1 contains t 2 t 2 in R 1 d=t 3 +t 2 add R 1,R 0 R 0 contains d d in R 0 mov R 0,d d in R 0 and memory 62
DAG representation of basic blocks useful data structures for implementing transformations on basic blocks gives a picture of how value computed by a statement is used in subsequent statements good way of determining common sub-expressions A dag for a basic block has following labels on the nodes leaves are labeled by unique identifiers, either variable names or constants interior nodes are labeled by an operator symbol nodes are also optionally given a sequence of identifiers for labels 63
DAG representation: example 1. t 1 := 4 * i 2. t 2 := a[t 1 ] 3. t 3 := 4 * i 4. t 4 := b[t 3 ] 5. t 5 := t 2 * t 4 6. t 6 := prod + t 5 7. prod := t 6 8. t 7 := i + 1 9. i := t 7 10. if i <= 20 goto (1) + prod 0 * t 2 t 6 prod t 5 [ ] [ ] t 4 t 1 t 3 a b * + (1) <= t 7 i 20 4 i 0 1 64
Code Generation from DAG S 1 = 4 * i S 2 = addr(a)-4 S 3 = S 2 [S 1 ] S 4 = 4 * i S 5 = addr(b)-4 S 6 = S 5 [S 4 ] S 7 = S 3 * S 6 S 8 = prod+s 7 prod = S 8 S 9 = I+1 I = S 9 If I <= 20 goto (1) S 1 = 4 * i S 2 = addr(a)-4 S 3 = S 2 [S 1 ] S 5 = addr(b)-4 S 6 = S 5 [S 4 ] S 7 = S 3 * S 6 prod = prod + S 7 I = I + 1 If I <= 20 goto (1) 65
Types of Error There are mainly four types of error. They are as follows: Lexical Error : Such as misspelling an identifier, keyword or operator. Syntactic Error : Such as an arithmetic expression with unbalanced parentheses. Semantic Error : Such as operator applied to an incompatible operand. Logical Error : Such as infinitely recursive call. 66
Phrase Level Recovery On discovering an error, a parser may perform local correction on the remaining input; that is, it may replace a prefix of the remaining input by some string that allows the parser to continue. For example, in case of an error like the one above, it will report the error, generate the ; and continue. Global Correction We would like compiler to make as few changes as possible in processing an incorrect input string. There are algorithms for choosing a minimal amount of changes to obtain a globally least-cost correction. 67