Compiler Construction Class Notes

Size: px

Start display at page:

Download "Compiler Construction Class Notes"

Warren Davidson
5 years ago
Views:

1 Compiler Construction Class Notes Reg Dodds Department of Computer Science University of the Western Cape c 2006,2017 Reg Dodds March 22, 2017

2 Introduction What is a Compiler? What is an Interpreter? Why Compiler Construction? What languages? An example of a very simple compilation. Why write a compiler? Layout of a compiler. 1

3 What is interpretation? Let L L be a programming language, with L = {Fortran, Lisp, Algol, COBOL, PL/1, BA- SIC, APL, SNOBOL, Pascal, C, C++, Ada, SQL, Java, ML, Haskell, }. I L is an interpreter for a program p L L, and input A is data, where A is usually called a character set and A is its Kleene-closure from which I L computes output data output A. The execution of the interpreter may abort and lead to an error condition: I L : L A A {error} p L } interpret output {error}, which we input by I L may also write as: I L (p L,input) = output {error} A single process takes place: the source program is directly interpreted. 2

4 Making interpreters efficient In a production quality interpreter it is advantageous to produce some sort of compact interpretable code by a process that is similar to compilation, once, and then subsequently reinterpret this compact code repeatedly. This process is used by Java and many interpreters for BASIC such as GWBasic. Typically a commandline interface interprets the command directly. Anevenbetterideaistocompileblocksofcodeincrementally, directly to executable machine code. When a block is altered its corresponding code is replaced with new code. Interpreters often have direct access to the original source code this is very useful for finding errors in the source program. Stepping mechanisms that move line-by-line through the source are easily implemented with interpreters. 3

5 One view of a compiler When compiling is involved, two processes are applied to execute a source program. A compiler C L for a language L translates a syntactically correct source program p L L into equivalent machine code. Source program Compiler machine language Examples: A source program in C++ is translated into Mips machine code. Visual Basic source code is compiled into Intelx86 machine code. A Java source program is translated into JVM byte code. 4

6 Execution of machine language The machine code produced by the compiler is somehow executed by hardware. Hardware may be emulated by microcode, or it may be hardwired. Some instructions may be entirely executable by hardware. Certain instructions may be emulated by microcode. The user is usually not aware that some of the machine code instructions, or even all of them, are being emulated. On some machines the machine instruction set may change dynamically, depending on the application. It is likely that compiled machine code, on any particular machine, runs faster than code running on an interpreter on the same machine. 5

7 What is compilation? Thesourceprogramp L Lisfirsttranslatedbya compilerc L intoanequivalentmachineexecutable program p M. Next p M is interpreted, or executed, by a machine plus its input to create output and/or an error. To run a program: (1) it is compiled and (2) then it is executed. C L (p L ) = p M {error}, if there are no compilation errors then the second step may be invoked:i M (p M,input)=output {error}. NoticethattheinterpreterI L hasnowbecomei M which is perhaps hardware. Interpreters and computers are different realizations of computing machines. Sun s picojava chip or the Java Virtual Machine on your computer can be used interchangeably to run the same byte code program p M. 6

8 Java source program public class simple { public static void main (Strings argsv[ ]){ int a; a = 41; a = a + 19; } } 7

9 Java byte code Compiled from public class simple.java public class simple extends java.lang.object{ public static void main (java.lang.string[ ]); public simple(); } Method void main(java.lang.string[ ]); 0 bipush 41 2 istore 1 3 iload 1 4 bipush 19 6 iadd 7 istore 1 8 return Method simple() 0 aload 0 1 invokespecial #12 <Methodjava.lang.Object.()> 4 return Notethereisamainmethodandaconstructormethod. 8

10 Overview of course Programs related to compilers. The compilation process: phases, intermediate code, structures. Bootstrapping and transfer, T-diagrams, Louden s TINY and TM. SEPL: interpreter, emulator, compiler. 9

11 Programs related to compilers (Louden p 4-6) interpreters assemblers linkers loaders preprocessors editors debuggers profilers project managers SCCS and RCS 10

12 The compilation process (Louden p 7, 8-14) Phases Intermediate code source code scanner lexical analyser tokens syntax analyser abstract syntax tree semantic analyser annotated syntax tree intermediate code optimizer intermediate code code generator target code target code optimizer optimized code linker-loader executable code Structures literals symboltable error handler temporary files 11

13 Bootstrapping and transfer of programming languages (Louden 1.6, p 18-21) T-diagrams next slide. Pascal in 1970 on CDC P-code compiler for Pascal in P-code emulator written in Algol 60, and in Fortran lead to widespread usage of Pascal. (Why?) 12

14 T-diagrams A T-diagram represents Source language being run in Host code to produce Target language. Source Target Host Let two compilers run on the same host machine. One compiler translates from language Start into an intermediate language IL and the other compiler translates from IL into language Final. Start IL IL Final Start Final Host Host Host We have produced a system that can compile from Start into Final. 13

15 T-diagrams One compiler for Pascal creates P-code, but runs on machine M. Another processor running on M can generate code for machine N. Pascal P code Pascal P code M M N N M We have produced a system that can compile from Pascal into P-code on a new machine. 14

16 T-diagrams: compiler-old compiler-new Pascal P code Pascal P code M M N N M 15

17 T-diagrams define the SEPL language. write an interpreter for it. develop a machine emulator or use an available one. develop a compiler that compiles to our machine machine code. add an optimizing phase to the compiler. alter the compiler to produce code for another machine. 16

18 Students Educational Programming Language (SEPL) Various projects lie ahead. Define the SEPL language Louden calls his TINY Develop its syntax and informal semantics. Write an interpreter for it using flex/lex and bison/yacc. Decide on target machine. Develop a machine emulator for the target or use a real machine. Develop a compiler that produces executable code. Introduce optimization phase not really enough time. How much time required to produce the compiler? 17

19 Scanning Lexical analysis (Louden Chapter 2) tokens from lexemes is done quite well by flex. regular expressions (Louden p 38). extension of notation for regular expressions does not give the notation any more power, but simplifies its practical use. regular expressions are widely used: flex, vim, sed, emacs, python, bash, tcl/tk, grep, awk, perl, etc. regular expressions and FSAs (Louden p 47 ). DFSA-FSA relationship (Louden p46 72). minimization of number of states. Louden s TINY-scanner: Gives insight into direct connection between FSA and scanner. (Louden 2.5) application of f lex for scanning lexical analysis. 18

20 Context-free languages (CFLs) and syntax analysis (Louden Chapter 3) Syntax analysers are based on CFLs. list of tokens analyser abstract syntax tree syntaxtree = analyse(); 19

21 Parse trees have dynamic structure. recursive structure. tree keeps track of attributes such as: types, scope, liveliness, nesting and values. subscript expression integer id a integer[] e.g. a[i] = 6; id i integer assignment number 6 integer 20

22 Context-free grammars (CFGs) (Louden 3.2) Formally a CFG is a fourtuple G = (N,T,P,S) where N and T are alphabets, N is the set of non-terminals or variables and T is the set of terminals, P N (N T) is the set productionrules and S N is the startsymbol. Example: N = {exp,op}, T = {number,+,, }, P = {exp exp op exp (exp) number, op + } and S = exp Note that number is treated as a token. The source string (117 17) 5 is first tokenized to (number number) number before it is analysed. P 1 = {E E O E (E) n,o + } is a set of productions not different from P. 21

23 Derivations sententialform: anystring (N T) derived from S, the start symbol. direct derivations: if one production is applied to a part of a sentential form and transforms it by matching the right hand side of a production with this part and then replaces it with a non-terminal. Example: The production exp (exp). can be applied to bring about the direct derivation exp number (exp) number. derivation when a chain of direct derivations are applied one after the other to transform the sentential form s 0 to another sentential form s n. It is written as s 0 sn. language: all strings s T that can be derived from the start symbol S, symbolically: L(G) = {s T S s}. 22

24 Derivation: exp (number number) number [exp exp op exp], exp exp op exp, [exp number], exp op number, [op ], exp number, [exp (exp)], (exp) number, [exp exp op exp], (exp op exp) number, [exp number], (exp op number) number, [op ], (exp number) number, [exp number], (number number) number, 23

25 language, sentence, examples language: all strings s T that can be derived from the start symbol S, symbolically: L(G) = {s T S s} sentence: the elements of the language L(G), s L(G), are known as sentences. Example: G = ({E},{a,(,)},{E (E) a},e) E a, i.e. E a, i.e. E a,a L(G). similarly E (E) (a), i.e. E (a) and E (E) ((E)) ((a)), i.e. E ((a)). Theorem: E ( n a) n, n N 0 Proof: Using induction. P 0 : E ( 0 a) 0 = a, since E E a. P 1 : E (a), because E (E) (a). P k P : Assume that P k+1 k holds, i.e. E ( k a) k. Now E (E), in other words E (E) ( k+1 a) k+1, and E ( n a) n, n N, i.e. L(G) = {( n a) n n N 0 }. (( k a) k ) 24

26 Examples Problem with empty base If P = {E (E)} is L(G) = {} =. This is empty because it is impossible to form bases P 0, or P 1. Since the base does not exist an infinite regress ensues. However, we can prove that E ( n E) n, but this is of little value, since E can not be reduced to a terminal. CFL using regular expressions If P = {E E +a a}, is L(G) = a(+a), where a(+a) {a,a+a,a+a+a,...}. 25

27 An if-statement G = ({statement, if-statement, expression}, {0,1,if,else,other}, {statement if-statement other if-statement if (expression) statement if (expression) statement else statement expression 0 1}, statement) and L(G) = { other,if (0) other,if (1) other, if (0) other else other,if (1) other else other, if (0) if (0) other,if (1) if (0) other, if (0) if (1) other,if (1) if (1) other, if (0) if (0) other else other, if (1) if (0) other else other, if (0) if (1) other else other, if (1) if (1) other else other,...} 26

28 The use of ε Consider the grammar we only show the productionsp: {statement if-statement other, if-statement if (expression) statement if (expression) statement else statement, expression 0 1} It may be written if using an ε-grammar if follows: {statement if-statement other, if-statement if (expression) statement else-part, else-part else statement ε, expression 0 1} ε is also useful for lists: list statement; list statement statement s This generates the language L(G) = {s, s;s, s;s;s,...} s + It is rewritten using ε if follows: list non-ε-list ε non-ε-list statement; non-ε-list statement statement s 27

29 Left- and right recursion The regular grammar a + is represented as follows with left recursive productions: A Aa a. a L(G) since A a, thus A a, but A Aa, and A aa, and we again expect that it may be replaced in A Aa and it follows that A aaa. It is simple to prove with mathematical induction that L(G) = a +. Our notation is rather informal: the set represented by a +, was formerly represented more exactly by L(a + ), which represents the set {a,aa,aaa,...}. Similarly we can prove that a grammar using the right recursive productions A aa a generates the same language. How is a represented? A Aa ε or using A aa ε What is L(G) for the grammar with the productions A (A)A ε? 28

30 Parse trees and abstract syntax trees (ASTs) It is convenient to distinguish between a parse tree and an abstract syntax tree. An abstract syntax tree is often called a syntax tree. A parse tree contains all the information concerning the syntactical Consider the parse tree and its corresponding stripped down (abstract) syntax tree generated by the derivation on the next slide. Syntax trees usually show the actual values at the terminals and not merely the tokens. 29

31 Right derivation for exp (number number) number The derivation below is executed in a determinate order. The rightmost non-terminal is replaced in each step until no more non-terminals remain. (1) [exp exp op exp], exp exp op exp, (2) [exp number], exp op number, (3) [op ], exp number, (4) [exp (exp)], (exp) number, (5) [exp exp op exp], (exp op exp) number, (6) [exp number], (exp op number) number, (7) [op ], (exp number) number, (8) [exp number], (number number) number, 64

32 Parse tree and syntax tree for the derivation exp (29-11) * 47 Parse tree for (29-11) * 47 1 exp exp op exp ( 5 exp ) * number exp 7 op 6 exp 47 number number Syntax tree for (29-11) * 47 *

33 Right derivation for exp (number number) number The derivation below is executed in a determinate order. The rightmost non-terminal is replaced in each step until no more non-terminals remain. (1) [exp exp op exp], exp exp op exp, (2) [exp number], exp op number, (3) [op ], exp number, (4) [exp (exp)], (exp) number, (5) [exp exp op exp], (exp op exp) number, (6) [exp number], (exp op number) number, (7) [op ], (exp number) number, (8) [exp number], (number number) number, 64

34 Parse tree for right derivation of exp (number number) number 8 exp 1 4exp 3 op 2 ( 5 exp ) * exp 7 op 6 exp exp number number - number 65

35 Leftmost derivation for exp (number number) number The derivation below is executed in a determinate order. The leftmost non-terminal of the sentential form is replaced each time reduced until there are no more non-terminals. (1) [exp exp op exp], exp exp op exp, (2) [exp (exp)], (exp) op exp, (3) [exp exp op exp], (exp op exp) op exp, (4) [exp number], (number op exp) op exp, (5) [op ], (number exp) op exp, (6) [exp number], (number number) op exp, (7) [op ], (number number) exp, (8) [exp number], (number number) number, 66

36 A Parse tree for the derivation of exp (number number) number exp ( exp ) exp op exp op exp * exp number number - number 67

37 Right and left derivations for number + number A left derivation (1) exp exp op exp, number op exp, number + exp, number + number, exp exp op exp number + number // 68

38 Rightmost derivation A rightmost derivation for number + number 4 (1) exp exp op exp, exp op number, exp + number, number + number, exp 1 exp 3 op 2 exp number + number 69

39 Ambiguous grammars The grammar with P = { exp exp op exp (exp) number op + } is ambiguous because it has two different parse trees. It will also therefore have two different left and rightmost derivations, because each parse tree has a unique leftmost derivation. exp exp op exp exp op exp * number number and now the other tree. - number 70

40 Ambiguous grammars A different parse tree for number +number exp exp op exp number - exp op exp number * number Ambiguous: If two different parse trees can be derived from a given grammar then it is ambiguous. It is preferable to use an unambiguous grammar for defining a computing language. Ambiguity can be eliminated in two ways: the grammar can be altered so that it becomes unambiguous, or the way bison/yacc does it precedence rules or association rules can be applied when there ambiguities. 71

41 The dangling else problem (Louden p ) The string if (0) if (1) other else other has two parse trees. This is the dangling else problem. statement if statement if ( exp ) statement else statement 0 if statement other exp if ( ) statement 1 other statement if statement if ( exp ) statement 0 if statement if ( exp ) statement else statement 1 other other 72

42 The dangling else problem The C code if (x!= 0) if (y == 1/x) OK = TRUE; else z = 1/x; could have had two interpretations: if (x!= 0) { if (x!= 0) { if (y == 1/x) OK = TRUE; if (y == 1/x) OK = TRUE; } else z = 1/x; else z = 1/x; } C disambiguates if with the most closely nested rule which resolves the ambiguity right-hand side. The grammar rules may be adapted as follows: if-statement matched unmatched matched if (exp) matched else matched other unmatched if (expression) if-statement if (exp) matched else unmatched expression 0 1 The next slide shows the unambiguous parse tree. 73

43 An unambiguous grammar for C s if-statement if-statement matched unmatched matched if (exp) matched else matched other unmatched if (expression) if-statement if (exp) matched else unmatched expression 0 1 if statement unmatched if ( exp ) if statement 0 matched if ( exp ) matched else matched 1 other other 74

44 Representations of syntax: BNF BNF Bacchus-Naur form. The metasymbol ::= is used like in production rules, is separates alternatives. Angle brackets, < and > delimit non-terminals. Terminals are written in plain text, or in bold face. The code below defines a <program>: <program> ::= program <declaration-list> begin <statement-list> end. A program, starts with program, and is followed by a list of declarations, then a begin, and a list of statements terminated with end and a fullstop. EBNF Extended BNF. BNF was made more convenient to use by extending it slightly. 75

45 Representations of syntax: EBNF EBNF Extended BNF. Put optional items inside brackets [ and ], <if-statement> ::= if <boolean> then <statement-list> [else <statement-list>] end if ; Repetition is done using braces, { and }. <identifier> ::= <letter> { <letter> <digit> } An <identifier> is a word that starts with a letter and has any number of letters of digits. <statement-list> ::= <statement> { ; <statement-list> } An <statement-list> is a <statement> or a list of <statement>s separated by semicolons. 76

46 Representations of syntax: EBNF tramline diagrams used by Wirth for Pascal, and for ANS Fortran. two-level grammar Algol 68. etc. 77

47 Formal properties of CFLs (Louden p ) Vide Louden. 78

48 The Chomsky hierarchy (Louden p. 131) Chomsky-type: Description 3: Regular languages Let A N and α T then productions in the grammar have the form A α or A Aα or alternatively: the recursion may be right about. Onlyonekindofrecursionmaybepresent, i.e. left or right otherwise G is a CFL. 2: Let A N and γ (N T) and A γ. In a context-f ree language A can always be replaced in any context by γ. 1: IftheproductionA γ isinacontextsensitive language, then it may be applied only in a predetermined context, i.e., A may produce γ only if A is in a given context lê, e.g. αaβ αγβ, where α ε. Such a rule is context sensitive. An example of context sensitivity is the restriction that variables must be declared before they may be used. 0: Phrase structure grammars are the most powerful. 79

49 Top-down parsing (Louden Chapter 4) Recursive-descent LL(1) parsing first and follow sets Error recovery in top-down-parsers 80

50 Top-down parsing A top-down parser executes a lef tmost derivation. It starts from the start symbol and works itswaydowntotheterminalsintheformoftokens. P redictive parser: attempts to forecast the next construction by using lookahead tokens. Backtracking parser: attempts different possibilities for parsing the known input, and backs up when it hits dead ends. Slower than predictive parsers. Use exponential time. More powerful. Recursive-descent parsing is usually applied to hand-written compilers Wirth s compilers often use RD parsers. Your 1st-year compiler was RD. LL(1) parsing L on left input is followed from left to right. L on right derivation is leftmost. The1meansthatonlyonetokenisusedtopredict the progress of the parser. 81

51 LL(1) parsing LL(1) parsers work from left to right through the input and follow a leftmost derivation that uses one lookahead token. Viable-prefix property easy to see very quickly in such languages that there is an error when the lookahead token does not correspond with what we expect. The viable prefix corresponds to f irst. LL(k) parsers are also possible where k > 1. More difficult to see errors. first and follow sets derived from the grammar are used to construct the tables that will be used for LL(1) parsing. 82

52 first and follow sets The set first(x), where X is a terminal or ε, is simply {X}. SupposeX is a nonterminal then first(x) is the set of all xs such that {X xβ}, where β may be ε. The definition may be altered to accommodate LL(k) parsers by replacing x with strings of k terminals, or if β is ε x < k. Inotherwordsfirstisthesetofleadingterminals of the sentential forms derivable from X. The definition may be altered to accommodate LL(k) parsers by replacing x with strings of k terminals, or if β is ε x < k. (See also Louden p. 168) 83

53 first sets In the grammar for arithmetic expressions: exp exp addop term term addop + term term mulop factor factor mulop factor ( exp ) number first(addop) = { +, } first(mulop) = { } first(exp) = { (, number} first(term) = { (, number} first(factor) = { (, number} 84

54 first in the grammar for an if-statement G = ({statement, if-statement, expression}, {0,1,if,else,rest}, {statement if-statement rest if-statement if (expression) statement else-part else-part else statement ε expression 0 1}, statement}) f irst(statement) = {if, rest} f irst(expression) = {0, 1} f irst(if-statement) = {if} f irst(else-part) = {else, ε} 85

55 Basic LL(1) parsing (Louden p. 152) LL(1) parsers use a push-down-stack rather than backtracking from recursive procedure calls. Consider S ( S ) S ε Initialize stack to $S Parse action P arsing stack Input Action 1 $ S ()$ S (S)S 2 $ S)S( ()$ match 3 $ S)S )$ S ε 4 $ S) )$ match 5 $ S $ S ε 6 $ $ accept Two actions: 1. Replace A N at the top of the stack by α, where A α, where α (N T) and 2. Match the token on top of the stack with the next input token. 86

56 LL(1) parsing Parse action P arsing stack Input Action 1 $ S ()$ S (S)S 2 $ S)S( ()$ match 3 $ S)S )$ S ε 4 $ S) )$ match 5 $ S $ S ε 6 $ $ accept At step 1 the stack contains S and the input is ()$. Apply rule S (S)S. The RHS is place stacked item-by-item onto the stack so that it appears reversed. Remove the matched on top of the stack ( in step 2 because it matches the token at the start of the input. 87

57 LL(1)-recursion-free productions for arithmetic (Louden p.160) exp term exp exp addop term exp ε addop + term factor term term mulop factor term ε mulop factor ( exp ) number 88

58 89

59 Parse tree and syntax tree for (Louden p. 161) The parse tree for the expression does not represent the left associativity of subtraction. The parser should still construct the left associative syntax tree. 1. The value 3 must be passed up to the root exp 2. Theroot exphands3downto exp whichsubtracts 4 from it. 3. The resulting 1 is passed down to the next exp, 4. which subtracts 5 yielding 6, 5. which is passed to the next exp. 6. The rightmost exp has an ε child and finally passes the 6 back to the root exp. 87

60 Building the syntax tree with an LL(1)-grammar Implement exp term exp as follows exp(){ term; exp ; } To compute the expression it is rewritten as: int exp(){int temp; temp = term; return exp (temp); } 88

61 Code for arithmetic The code for exp addop term exp ε is exp () { switch(token) { + : match( + ); term; exp ; break; - : match( - ); term; exp ; break; } } To compute the expression it could be rewritten as: int exp (int val) { switch(token) { + : match( + ); val += term; return exp (val); - : match( - ); val -= term; return exp (val); default: return val; } Note that exp requires a parameter passed from exp. 89

62 Left factoring Lef t f actoring is needed when right-hand sides of productions share a common prefix, e.g. A αβ αγ Typical practical examples are: stmt-sequence stmt;stmt-sequence stmt stmt s and if-stmnt if ( exp ) statement if ( exp ) statement else statement An LL(1) parser cannot distinguish between such productions. The solution is to factor out the common prefix as follows: A αa, A β γ For factoring to work properly α should be the longest left prefix. Louden gives a left-factoring algorithm and many examples on pp

63 follow sets In this discussion we regard $ as a terminal. Recallthatfirst(A)isthesetofleadingterminals of the sentential forms derivable from A. Informally, f ollow(a) is the set of terminals that may be derived from nonterminals appearing after A on the right-hand side of productions, or it is the set of those terminals that follow A in such productions. Since $ is regarded as a terminal, if A is the start symbol then $ is in follow(a). Formally: f ollow(a) is the set of terminals such that if there is a production B αaγ, 1. then first(γ) \ {ε} is in follow(a), and 2. and if ε is in first(γ), then follow(a) contains follow(b). f ollow sets are only defined for nonterminals. 91

64 An algorithm for f ollow(a) Algol style for all nonterminals A do follow(a) = { }; f ollow(start-symbol) = {$}; while there are changes to any follow sets do for each production A X 1 X 2...X n do for each X i that is a nonterminal do add first(x i+1 X i+2...x n ) \ {ε}to follow(x i ) /* Note: if i = n then X i+1 X i+2...x n = ε*/ if ε first(x i+1 X i+2...x n ) then add follow(a) to follow(x i ) 92

65 An algorithm for f ollow(a) C-style for (all nonterminals A) follow(a) = { }; f ollow(start-symbol) = {$}; while (there are changes to any follow sets) for (each production A X 1 X 2...X n ) for (each X i that is a nonterminal){ add first(x i+1 X i+2...x n ) \ {ε}to follow(x i ) /* Note: if i = n then X i+1 X i+2...x n = ε*/ if ε first(x i+1 X i+2...x n ) then add follow(a) to follow(x i ) } 93

66 Construct follow from the first set In the grammar for arithmetic expressions: (1) exp exp addop term (2) exp term (3) addop + (4) addop - (5) term term mulop factor (6) term factor (7) mulop * (8) factor ( exp ) (9) factor number first(addop) = { +, - } first(mulop) = { * } first(factor) = { (, number } first(term) = { (, number } first(exp) = { (, number } 94

67 Constructing f ollow from f irst In the grammar for arithmetic expressions: (1) exp exp addop term (2) exp term (3) addop + (4) addop - (5) term term mulop factor (6) term factor (7) mulop * (8) factor ( exp ) (9) factor number Ignore (3), (4), (7) and (9) no RH nonterminals Set all follow(a) = { }; follow(exp) = {$} (1) affects follow of exp, addop and term first(addop) is added to follow(exp), so follow(exp) = { $,-,+} and f irst(term) is added to f ollow(addop), so follow(addop) = { (,number} and follow(exp) is added to follow(term), so follow(term) = { $,+,-} 95

68 Constructing f ollow from f irst In the grammar for arithmetic expressions: (1) exp exp addop term (2) exp term (3) addop + (4) addop - (5) term term mulop factor (6) term factor (7) mulop * (8) factor ( exp ) (9) factor number (2)causesfollow(exp)tobeaddedtofollow(term), which does not add anything new. (5) is similar to (1). first(mulop) is added to follow(term), so follow(term) = { $,+,-,*} and first(factor) is added to follow(mulop), so follow(mulop) = { (,number} and follow(term) is added to follow(factor), so follow(factor) = { $,+,-,*} 96

69 Constructing f ollow from f irst In the grammar for arithmetic expressions: (1) exp exp addop term (2) exp term (3) addop + (4) addop - (5) term term mulop factor (6) term factor (7) mulop * (8) factor ( exp ) (9) factor number (6) adds f ollow(term) to f ollow(f actor) no effect. (8) adds first()) to follow(exp), such that follow(exp) = { $,+,-,)} Duringthesecondpass(1)adds)tofollow(factor), so that follow(factor) = { $,+,-,*,)} 97

70 Constructing LL(1) parse tables The parse table M[A, a] contains productions added according to the rules 1. If A α is a production rule such that there is a derivation α aβ, where a is a token, then the rule A α is added to M[A,a]. 2. If A α ε is an ε-production and there is a derivation S$ αaaβ, where S is the start symbol and a is a token, or $, then the production A ε is added to M[A,a]. The token a in Rule 1 is in first(α) and the token in Rule 2 is in follow(a). This is repeatedly applied for each nonterminal A and each production A α. 1. For each token a in first(α), add A α to the entry M[A,a]. 2. If ε first(α), for each element a follow(a), add A α to M[A,a]. 98

71 Characterizing an LL(1) grammar A grammar in BNF is LL(1) if the following conditions are satisfied: 1. ForeveryproductionA α 1 α 2... α n,first(α i ) first(α j ) is empty for all i and j and i,j [1..n],i j. 2. For every nonterminal A such that first(a) ε, first(a) follow(a) is empty. 99

72 Examples See Louden s examples on p

73 Bottom-up parsing Overview. Finite automata of LR(0) items and LR(0) parsing. SLR(1) parsing. General LR(1) and LALR(1) parsing. bison an LALR(1) parser generator. Generation of a parser using bison. Error recovery in bottom-up parsers. 101

74 Bottom-up parsing an overview The most general bottom-up parser is the LR(1) parser the L indicates that the input is processed fromthelefttotheright,andtherindicatesthat a rightmost derivation is applied, and the one indicates that a single token is used for lookahead. LR(0) parsers are also possible where there is no lookahead, i.e. the lookahead token can be examined after it appears on the parse stack. SLR(1) parsers improve on LR(0) parsing. An even more powerful method, but still not as general as LR(1) parsers is the LALR(1) parser. Bottom-up parsers are generally more powerful than their top-down counterparts for example left recursion can be handled. Bottom-up parsers are unsuitable for hand coding, so parser generators like bison are used. 102

75 Bottom-up parsing overview Parse stack contains tokens and nonterminals P LU S state information. Parse stack starts empty and ends with start symbol alone on the stack and an empty input string. Actions: shif t, reduce and accept. A shift merely moves a token from the input to the top of the stack. A reduce replaces the string α on top of the stack with a nonterminal A, given A α. Top-down parsers are generate-match parsers and bottom-up parsers are shift-reduce parsers. If the grammar does not possess a unique start symbol that only appears once in the grammar, then bottom-up parsers are always augmented by such a start symbol. 103

76 Bottom-up parse of () ConsiderthegrammarwithP = {S (S )S ε}. Augment it by adding: S S. A bottom-up parse for the parenthesis grammar of () follows: P arsing stack Input Action 1 $ ()$ shift 2 $ ( )$ reduce S ε 3 $ ( S )$ shift 4 $ ( S ) $ reduce S ε 5 $ ( S ) S $ reduce S ( S ) S 6 $ S $ reduce S S 7 $ S $ accept The bottom-up parser looks deeper into its parse stack and thus requires arbitrary stack lookahead. The derivation is: S S (S)S (S) () Clearly the rightmost nonterminal is reduced at each derivation step. 104

77 A bottom-up parse of + grammar Consider the grammar with P ={E E+n n}. Augment it by adding: E E. A bottom-up parse for the + grammar of n+n: P arsing stack Input Action 1 $ n + n$ shift 2 $ n + n$ reduce E n 3 $ E + n$ shift 4 $ E + n$ shift 5 $ E + n $ reduce E E + n 6 $ E $ reduce E E 7 $ E $ accept The derivation is: E E E +n n+n We see that the rightmost nonterminal is reduced at each derivation step. 105

78 Bottom-up parse overview P arsing stack Input Action 1 $ n + n$ shift 2 $ n + n$ reduce E n 3 $ E + n$ shift 4 $ E + n$ shift 5 $ E + n $ reduce E E + n 6 $ E $ reduce E E 7 $ E $ accept Inderivation: E E E+n n+n, eachof the intermediate strings is called a right sentential form, and it is split between the parse stack and the input. E+n occurs in step 3 of the parse as E +n, and as E + n in step 4, and finally as E +n. The string of symbols on top of the stack is called a viable prefix of the right sentential form. E, E+ and E +n are all viable prefixes of E +n. The viable prefixes of n+n are ε and n, but n+ and n+n are not. 106

79 Bottom-up parse overview A shift-reduce parser will shift terminals to the stack until it can perform a reduction to obtain the next right sentential form. This occurs when the stack top matches the righthand side of a production. This string together with the position in the right sentential form where it occurs and the production used to reduce it, is known as the handle. Handles are unique in unambiguous grammars. The handle of n + n is thus E n and the handle of E + n, to which the previous form is reduced is E E +n. The main task of a shift-reduce parser is finding the next handle. 107

80 Bottom-up parse overview P arsing stack Input Action 1 $ ()$ shift 2 $ ( )$ reduce S ε 3 $ ( S )$ shift 4 $ ( S ) $ reduce S ε 5 $ ( S ) S $ reduce S ( S ) S 6 $ S $ reduce S S 7 $ S $ accept The main task of a shift-reduce parser is finding the next handle. Reductions only occur when the reduced string is a right sentential form. In step 3 above the reduction S ε cannot be performed because the resulting string after the shift of ) onto the stack would be (S S) which is not a right sentential form. Thus S ε is not a handle at this position of the sentential form (S. To reduce with S (S)S the parser knows that (S)S appears on the right of a production and that it is already on the stack by using a DFA of items. 108

81 LR(0) items ThegrammarwithP = {S S, S (S)S ε} has three productions and eight LR(0) items: S.S S S. S.(S)S S (.S)S S (S.)S S (S).S S (S)S. S. When P = {E E, E E +n n} there are three productions and eight LR(0) items: E.E E E. E.E +n E E.+n E E +.n E E +n. E.n E n. 109

82 LR(0) parsing LR(0) items An LR(0) item of a CFG is a production with a distinguished position in its right-hand side. The distinguished position is usually denoted with the meta symbol:. i.e. period. e.g. if A α and β and γ are any two strings of symbols including ε such that α = βγ then A.βγ, A β.γ and A βγ. are all LR(0) items. They are called LR(0) items because they contain no explicit reference to lookahead. The item records the recognition of the righthand side of a particular production. Specifically A β.γ constructed from A βγ denotes that the β part has already been seen and it may be possible to derive the next input tokens from γ. 110

83 LR(0) parsing LR(0) items The item A.α indicates that A could be reduced from α it is called an initial item. The item A α. indicates that α is on the top of the stack and may be the handle if A α is used to reduce α to A it is called a complete item. The LR(0) items are used as states of a finite automaton that maintains information about the parse stack and the progress of a shift-reduce parse. 111

84 LR(0) parsing finite automata of items LR(0) items denote the states of a FSA that maintains the progress of a shift-reduce parse. One approach is to first construct a nondeterministic FSA of LR(0) items and then derive a DFA from it. Another approach is to construct the DFA of sets of LR(0) items directly. What transitions are represented in the NFA of LR(0) items? Suppose that the symbol X (N T). Let A α.xη be an LR(0) item which represents a state reached where α has been recognized and where the focal point, is directly before X. If X is a token, then there is a transition on the token X to next LR(0) state: A αx.η. A α.xη X A αx.η 112

85 LR(0) parsing finite automata of items We are considering A α.xη where the focal point, is directly before X. Suppose that X is a nonterminal, then it cannot be directly matched with a token on the input stream. The transition: A α.xη X A αx.η corresponds to pushing X onto the stack as a result of a reduction of some β to X as a result of applying the rule X β Such a reduction must be preceded by the recognition of β. The state denoted by X.β represents the start of the process of recognizing β. So when X is a nonterminal ε-transitions must also be provided: leaving from A α.xη for every production X β with X on the left and going to the LR(0) state X.β. A α.xη ε X.β 113

86 LR(0) parsing finite automata of items The two transitions: A α.xη X A αx.η and A α.xη ε X.β are the only ones in the NFA of LR(0) items. The start state of the NFA must correspond to the initial conditions of the parser: the parse stack is empty and the S the start symbol is about to be parsed, i.e. any initial item S.α can be used. Since we want the start state to be unique, the simple device of augmenting the grammar with a new, unique start symbol S for which S S suffices. The start state then is S.S. 114

87 LR(0) parsing finite automata of items What are the accepting states of the NFA? The NFA does not need accepting states. The NFA is not being used to do the recognition of the language. The NFA is merely being applied to keep track of the state of the parse. The parser itself determines when it accepts an input stream by determining that the input stream is empty and the start symbol is on the top of the parse stack. 115

88 LR(0) parsing finite automata of items ThegrammarwithP = {S S, S (S)S ε} has three productions and eight LR(0) items: S.S S S. S.(S)S S (.S)S S (S.)S S (S).S S (S)S. S. The NFA of LR(0) items for the S grammar: S.S S S S. ε ε S.(S)S S. S (S)S. ε ( ε ε S S (.S)S S ) S (S.)S S (S).S ε The next step is to produce the DFA that corresponds to the NFA. 116

89 LR(0) parsing Converting the NFA into a DFA S.S S S S. ε ε S.(S)S S. S (S)S. ε ( ε ε S S (.S)S S ) S (S.)S S (S).S ε Form the ε-closure of each LR(0) item. The closure always contains the set itself Add each item for which there are ε-transitions from the the original set. Then recursively add all sets which are ε-reachable from the sets already aggregated. Do this for every LR(0) item in the NFA. Add the terminal transitions that leave each aggregate. 117

90 LR(0) parsing an NFA and its corresponding DFA The NFA for the S grammar: S.S S S S. ε ε S.(S)S S. S (S)S. ε ( ε ε S S (.S)S S ) S (S.)S S (S).S ε The DFA derived from the NFA: S.S S.(S)S S. 0. S S S. 1. ( S (.S)S S.(S)S S. ( 2. S ( S (S.)S ) S (S).S S.(S)S S S S (S)S

91 LR(0) parsing finite automata of items When P = {E E, E E +n n} there are three productions and eight LR(0) items: E.E E E. E.E +n E E.+n E E +.n E E +n. E.n E n. The NFA of LR(0) items for the E grammar: ε E.E E E E. ε ε ε n E.E +n E.n E n. E E E.+n + n E E +.n E E +n. The next step is to produce the DFA that corresponds to the NFA. 119

92 LR(0) parsing: NFA and equivalent DFA The NFA for the E grammar: ε E.E E E E. ε ε E.E +n ε E.n n E n. E E E.+n + E E +.n n E E +n. The DFA derived from the above NFA: E.E E E E. E.E +n E E.+n E.n 0. + n 1. E n. 2. E E +.n 3. n E E +n. 4. The items that are added by the ε-closure are known as closure items and those items that originate the state are called kernel items. 120

93 LR(0) parsing The LR(0) algorithm keeps track of the current state in the DFA of LR(0) items. The parse stack need hold only state numbers since they represent all the necessary information. For the sake of simplifying the description of the algorithm the symbol will also be pushed onto the parse stack before the state number. The parse starts with: P arsing stack Input 1 $ 0 input string$ Suppose the token n is shifted onto the stack and the next state is 2: P arsing stack Input 2 $ 0 n 2 rest of input string$ The LR(0) parsing algorithm chooses its next action depending on the state on the top of the stack and the current input token. 121

94 The LR(0) parsing algorithm Let s be the current state. 1. If state s contains the item A α.xβ where X is a terminal, then the action is a shift. If the token is X then the next state is A αx.β. If the token is not X then there is an error. 2. If s contains a complete action such as A γ. then the action is to reduce γ by the rule A γ. WhenthestartsymbolS isreducedbytherule S S and the input is empty, then accept; if it is not empty then announce an error. In every other case the next state is computed as follows: (a) pop γ off the stack. (b) Set s = top, which contains B αa.β. (c) push(a) and push(b αa.β). 122

95 LR(0) parsing shif t-reduce and reduce-reduce conflicts A grammar is said to be an LR(0) grammar if the parser rules are unambiguous. If a statecontainsthecompleteitem A α.then it can contain no other items. If such a state were also to contain the shift item A α.xβ, where X is a terminal, then an ambiguityarisesastowhetheraction(1)or(2)mustbe executed. This is called a shif t-reduce conf lict. If such a state were also to contain another complete item B β., then an ambiguity arises as to whichproductiontoapply A α.orb β. this is known as a reduce-reduce conflict. A grammar is therefore LR(0) if and only if each state is either a shift state or a reduce state containing a single complete item. 123

96 SLR(1) parsing The SLR(1) parsing algorithm. Disambiguating rules for parsing conflicts. Limits of SLR(1) parsing power. SLR(k) grammars. 124

97 The SLR(1) parsing algorithm Simple LR(1), i.e. SLR(1) parsing, uses a DFA of sets of LR(0) items. The power of LR(0) is significantly increased by using the next token in the input stream to direct its actions in two ways: 1. The input token is consulted before a shift is made, to ensure that an appropriate DFA transition exists, and 2. It uses the follow set of a nonterminal to decide if a reduction should be performed. This is powerful enough to parse almost all common language constructs. 125

98 The SLR(1) parsing algorithm Let s be the current state, i.e. the state on top of the stack. 1. If s contains any item of the form A α.xβ, where X is the next token in the input stream, then shift X onto the stack and push the state containing the item A αx.β. 2. If s contains the complete item A γ. and the next token in the input stream is in follow(a), then reduce by the rule A γ. more details follow on next slide. 3. If the next input token is not accommodated by the (1) or (2), then an error is declared. 126

99 The SLR(1) parsing algorithm If s contains the complete item A γ. and the next token in the input stream is in follow(a), then reduce by the rule A γ. The reduction by S S, where S is the start state, and the next token is $, implies acceptance, otherwise the new state is computed as follows: (a) Remove the string γ and all its corresponding states from the parse stack. (b) Back up the DFA to the state where the construction of γ started. (c) By construction, this state contains an item of the form B γ.aβ. Push A onto the stack and push the item containing B γa.β. 127

100 SLR(1) grammar AgrammarisanSLR(1)grammariftheapplicationof the SLR(1) parsing rules do not result in an ambiguity. A grammar is an SLR(1) grammar. 1. For any item A α.xβ, where X is a token there is no complete item B γ. in s with X follow(b). A violation of this condition is a shift-reduce conflict. 2. For any two complete items A α. s and A β. s, follow(a) follow(b) =. A violation of this condition is a reduce-reduce conflict. 128

101 Table-driven SLR(1) grammar The grammar with P = {E E,E E + n n} is not LR(0) but is SLR(1) and its DFA of sets of items is: E.E E E E. E.E +n E E.+n E.n 0. + n 1. E n. 2. E E +.n 3. n E E +n. 4. follow(e ) = {$}, and follow(e) = {$,+} q(1,$) = accept instead of r(e E) State Input Go to n + $ E 0 s2 1 1 s3 accept 2 r(e n) r(e n) 3 s4 4 r(e E +n) r(e E +n) 129

102 SLR(1) parse of n+n+n State Input Go to n + $ E 0 s2 1 1 s3 accept 2 r(e n) r(e n) 3 s4 4 r(e E +n) r(e E +n) P arsing stack Input Action 1 $ 0 n + n + n$ shift 2 2 $ 0 n 2 + n + n$ reduce E n 3 $ 0 E 1 + n + n$ shift 3 4 $ 0 E n + n$ shift 4 5 $ 0 E n 4 + n$ reduce E E + n 6 $ 0 E 1 + n$ shift 3 7 $ 0 E n$ shift 4 8 $ 0 E n 4 $ reduce E E + n 9 $ 0 E 1 $ accept 130

103 SLR(1) parse of ()() State Input Go to ( ) $ S 0 s2 r(s ε) r(s ε) 1 1 accept 2 s2 r(s ε) r(s ε) 3 3 s4 4 s2 r(s ε) r(s ε) 5 5 r(s (S)S ) r(s (S)S ) P arsing stack Input Action 1 $ 0 ()()$ shift 2 2 $ 0 ( 2 )()$ reduce S ε 3 $ 0 ( 2 S 3 ()$ shift 4 4 $ 0 ( 2 S 3 ) 4 ()$ shift 2 5 $ 0 ( 2 S 3 ) 4 ( 2 )$ reduce S ε 6 $ 0 ( 2 S 3 ) 4 ( 2 S 3 $ shift 4 7 $ 0 ( 2 S 3 ) 4 ( 2 S 3 ) 4 $ reduce S ε 8 $ 0 ( 2 S 3 ) 4 ( 2 S 3 ) 4 S 5 $ reduce S (S)S 9 $ 0 ( 2 S 3 ) 4 S 5 $ reduce S (S)S 10 $ 0 S 1 $ accept 131

104 Disambiguating rules for parsing conflicts shif t-reduce have a natural disambiguating rule: prefer the shift over the reduce. reduce-reduce conflicts are more complex to resolve they usually require the grammar to be altered. Preferring the shif t over the reduce in the danglingelse ambiguity, leads to incorporating the mostclosely-nested-if rule. The grammar with the following productions is ambiguous: statement if-statement other if-statement if(exp)statement if(exp)statement else statement exp 0 1 We will consider the even simpler grammar: S I other I ifs ifselses 132

105 Disambiguating a shif t-reduce conflict Consider the grammar: S I other I if S if S else S Since follow(i) = follow(s) = {$,else}, there is a parsing conflict in state 5 the complete item I if S. indicates a reduction on inputtingelse or $, but theitemi ifs.elses indicatesashiftwhenelse is read. S.S S S.I S S. S.other I.if S I I.if S else S I other S other other if 1. I if.s I if.s else S S.I S.other I.if S I.if S else S S I if S. I if.s else S S I if other 2. else I I ifs else.s S.I S.other I.if S I.if S else S S I ifs else S

106 SLR(1) table without conflicts The rules are numbered: The SLR(1) parse table: (1) S I (2) S other (3) I if S (4) I if S else S State Input Go to if else other $ S I 0 s4 s accept 2 r1 r1 3 r2 r2 4 s4 s s6 r3 6 s4 s r4 r4 134

107 Limits of SLR(1) parsing power Consider the grammar which describes parameterless procedures and assignment statements: stmt call-stmt assign-stmt call-stmt identifier assign-stmt var := exp var var [ exp] identifier exp var number Assignments and procedure calls both start with an identifier. The parser can only decide at the end of the statement or when the token := appears if a call or an assignment is being processed. 135

108 Limits of SLR(1) parsing power Consider the simplified grammar: S id V := E V id E V n The start state of the DFA of sets items contains: S.S S.id S.V := E V.id Thestatehasashifttransitiononidtothestate: S id. V id. follow(s) = {$} and follow(v) = {:=,$}. On getting the input token $ the SLR(1) parser will try to reduce by both the rules S id and V id this is a reduce-reduce conflict. This simple problem can be solved by using an SLR(k) grammar. 136

109 SLR(k) grammars The SLR(1) algorithm can be extended to SLR(k) parsing, with k 1 lookahead symbols. Use first k and follow k sets and the two rules: 1. If s A α.xβ where X is a token and Xw first k (Xβ) are the next k tokens in the input stream, then the action is to shift the current input token onto the stack, and to pushthestatecontainingtheitema αx.β 2. If s A α. and w follow k (A) are the next tokens in the input string, then the action is to reduce by the rule A α SLR(k) parsing is more powerful than SLR(1) parsingwhenk > 1,butitissubstantiallyslower,since the cost of parsing grows exponentially in k. Typical non-slr(1) constructs are handled using an LALR(1) parser, by using standard disambiguating rules, or by rewriting the grammar. 137

110 General LR(1) and LALR(1) parsing LR(1), also called canonical LR(1), parsing overcomes the problem with SLR(1) parsing but also at the cost of increased time complexity. Lookahead LR(1) or LALR(1) preserves the efficiency of SLR(1) parsing and retains the benefits of general LR(1) parsing. We will discuss: Finite automata of LR(1) items. The LR(1) parsing algorithm. LALR(1) parsing. 138

111 Finite automata of LR(1) items (Louden p ) SLR(1) applies lookahead af ter constructing the DFA of LR(0) items the construction ignores the advantages that may ensue from considering lookaheads. General LR(1) uses the new DFA that has lookaheads built in from the start. ThisDFAusesitemsthatareanextensionofLR(0) items. They are called LR(1) items because they include a single lookahead token in each item. LR(1) items are written: [A α.β,a] where A α.β is an LR(0) item, and a is the lookahead token. Next the transitions between LR(1) items will be defined. 139

112 Transitions between LR(1) items There are several similarities withdfas of LR(0) items. They include ε-transitions. The DFA states are also built from ε-closures. However, transitions between LR(1) items must keep track of the lookahead token. Normal, i.e. non-ε-transitions, are quite similar to those in DFAs of LR(0) items. The major difference lies in the definition of ε- transitions. 140

113 Definition of LR(1)-transitions Given an LR(1) item, [A α.xγ,a], where X N T, thereisatransitiononxtotheitem[a αx.γ,a]. Given an LR(1) item, [A α.bγ,a], where B N, there are ε-transitions to items [B.β,b] for every production B β and for every token b first(γa). Only ε-transitions create new lookaheads. 141

114 DFA of sets of LR(0) items for A (A) a (Louden p. 208) TheaugmentedgrammarwithP = {A A,A (A) a} has the DFA of sets of LR(0) items: A.A S.(A) A.a ( S (.A) S.(A) A.a ( A a a A A A. A a. S (A.) ) S (A). 5. The parsing actions for the input ((a)) follow: P arsing stack Input Action 1 $ 0 ((a))$ shift 2 $ 0 ( 3 (a))$ shift 3 $ 0 ( 3 ( 3 a))$ shift 4 $ 0 ( 3 ( 3 a 2 ))$ reducea a 5 $ 0 ( 3 ( 3 A 4 ))$ shift 6 $ 0 ( 3 ( 3 A 4 ) 5 )$ reducea (A) 7 $ 0 ( 3 A 4 )$ shift 8 $ 0 ( 3 A 4 ) 5 $ reducea (A) 9 $ 0 A 1 $ accept 142

115 DFA of sets of LR(1) items for A (A) a (Louden p. 218) Augment the grammar by adding A A. State 0: first put [A.A,$]into State 0. To complete the closure, add ε-transitions to items with an A on the left of productions with $ as the lookahead: [A.(A),$], and [A.a,$]. [A.A,$] [A.(A),$] [A.a,$] 0. State 1: There is a transition from State 0 on A totheclosureofthesetthatincludes[a A.,$]. The action for this state will be to accept. [A A.,$]

116 DFA of sets of LR(1) items for A (A) a [A.A,$] [A.(A),$] [A.a,$] 0. State2: Thereisatransitionon ( leavingstate0 to the closure of the LR(1) item [A (.A),$] which forms the basis of State 2. Since there are ε-transitions from this item to [A.(A),)] and to[a.a,)]becausethefollowoftheainparentheses is first()$) = {)}. Note that there is a new lookahead item. The complete State 2 is: [A (.A),$] [A.(A),)] [A.a,)]

Parsing. Roadmap. > Context-free grammars > Derivations and precedence > Top-down parsing > Left-recursion > Look-ahead > Table-driven parsing

Parsing. Roadmap. > Context-free grammars > Derivations and precedence > Top-down parsing > Left-recursion > Look-ahead > Table-driven parsing Roadmap > Context-free grammars > Derivations and precedence > Top-down parsing > Left-recursion > Look-ahead > Table-driven parsing The role of the parser > performs context-free syntax analysis > guides