5.1 Subprograms and Procedures. "INPUTSET.x" procedure z := "GCD.x" := 33, 111

Size: px

Start display at page:

Download "5.1 Subprograms and Procedures. "INPUTSET.x" procedure z := "GCD.x" := 33, 111"

Rosemary Lawson
5 years ago
Views:

1 Chapter 5 Advanced Topics If a little knowledge is dangerous, where is the man who has so much as to be out of danger? Thomas Huxley There are a number of interesting topics that cannot all be covered in a term For this book, so closely tied to a term project, it is also important to get to the operational material early so that the projects can get underway The writers were faced with a choice: present all of the material in its most natural order and depend on the instructor to pick and skip, or to present one complete track through the material and then organize the rest as additional topics, some of which can serve as lecture material toward the end of the term, when it is too late to change the course of a term project but not too late to think about the next project Thus Chapters 1 4 are designed to mesh with the stages of the project There are few sections in them that can be skipped There are a few topics that should not be skipped, but interact with the main track in such a way as to make it more difficult to present They include two final features of x: procedures and macros, and the bottom-up alternative for parsing Static Analysis, macros, linkers and loaders and some other sections are optional 51 Subprograms and Procedures subprogram "INPUTSETx" procedure z := "GCDx" := 33, 111 The inition of X, and therefore the implementation of X, was left incomplete in Chapters 1, 2, and 3 Two of the missing features are presented here While the details differ from similar constructs in traditional programming languages, the implementations are conventional The consequence is that, like 177

2 178 CHAPTER 5 ADVANCED TOPICS the earlier semantic material, this material too can be applied to a variety of languages Definition of Subprogram A complete X program can be inserted into a statement list in any other program without violating syntactic or scoping rules The meaning of insertion is as if the combined programs were analyzed after the insertion If the inserted program is a block, its private variables are, by inition, kept separate from the variables in the surrounding text The global variables of the inserted block are the only path of information into and out of the block A program that, when run by itself, caused input or output may no longer cause input or output after combination with another program Output is a response to a variable never used on the right The other program may contain a left use of variable so that in the combined program the variable is used on the left and right, and therefore no longer used for output Suppose a program has a some input variables If one inserts a list of assignments to those variables into that program, the assignments will take precedence over the implicit input caused by right-only use in the containing program This is how batch input is realized One can also analogously cause batch output by inserting a list of assignments to new, unique, names The form of insertion is syntactically a statement: "ProgramName" where the program name inside quotes is known to the system Typically it will be a file name 1 It has exactly the same kind of semantics as the typical include statement such as that of C although it is somewhat more interesting (surprising?) in X because of the inference mechanisms for variable type and use Definition of Procedure A subprogram with parameters is a procedure containing the program: Supposing GCDx is a file x, y := X, Y; it if x < y -> y := y - x :: x > y -> x := x - y :: x = y -> exit fi ti; Z := x 1 This design has good points and bad points The string quotes free the name from the syntactic constraints of x On the other hand one may find procedure names constrained by file naming conventions

3 51 SUBPROGRAMS AND PROCEDURES 179 If run as a program, it would request input for X and Y and report the value of Z One might invoke GCDx as a procedure by: gcd := "GCDx" := 17, 51 The effect of the invocation is as if the file GCDx were surrounded in its implicit be eb pair, all its outer variables made private, an assignment of the actual parameter values to the input variables placed immediately after the nomenclature, and an assignment of the output variables (in this case, Z) placed immediately before the closing eb The result, for GCDx, is as follows: be x y X Y Z X, Y := 17, 51; "GCDx"; gcd := Z eb The order of the actual parameters is determined by the order of appearance of the input and output variables in the text of the procedure inition This information is typically supplied by the procedure author in documentation rather than by the user examining the text of the inition Running a program standalone in Hyper exhibits the input and output variables, in order, to assist the documentor A procedure call is ined as if by substitution The invocation is replaced by the ining block described above The number of actual parameters must match the number of input and output variables A procedure may be ined in terms of itself For file GCDx we might have had instead: 2 x,y := X, Y; if x = y -> z := y :: x < y -> z := "GCDx" := (y-1)//x+1, x :: x > y -> z := "GCDx" := y, x fi; Z := z Because of the potential for name clashes, the concepts of preactive and postactive regions are essential to the understanding of recursive procedures Even when GCDx is nested within itself, the inputs and outputs are kept separate despite the fact they have the same names in each nesting One can see this by carrying out the nesting one more level in the GCDx example above It is also interesting that a recursive procedure has an infinite inition The inition needs to be examined in any one execution only as deeply as the recursion actually goes There is a well understood mechanism to get the effect of recursion without needing to copy the inition 2 Reminder: operator // means remainder

4 180 CHAPTER 5 ADVANCED TOPICS Run Stack Implementing Recursion 52 Static Analysis Control Paths The access patterns of a program in D are represented as a regular expression over a vocabulary consisting of sets of variable names {x, y } a set of names V p set of names in construct p path x (p) r/w sequence for x in p B set of block designators Definitions: B {0, 1, 2 } (51) V x {x} (52) V c {} (53) V p V p (54) V p q V p V q (55) V (p) V p (56) { r x S read x (S) (57) λ x S { w x S write x (S) (58) λ x S path x (x 1, x k := p 1, p k ) read x ( V pi )write x ( {x i }) i i (59) path x (abort) λ (510) path x (skip) λ (511) path x (exit) λ (512) path x (if path x (pq) path x (p)path x (q) (513) p i q i fi) read x ( V pi )( path i x (q i )) (514) i i path x (it p ti) (path x (p)) + (515) Table 51: Static Analysis 53 Macros A macro is a text transformation mechanism

5 54 LINKERS AND LOADERS Linkers and Loaders In modern languages the process of getting a program into execution involves more than translation to machine executable form Typically the unit of compilation, called a module, is less than a whole program The translation of a single module must leave behind a record that can eventually be combined with other compilation records so that when all of the modules have been translated, they can be combined into a single runnable program One way is just to compile them all together not a bad idea when the compiler is very fast and the program is not too large But, in fact, programs are too large One s own relatively modest program may have to be combined with a much larger program written by someone else It is often the case that the source files for the other program are not available 3 The compilation record is called an object module The program that combines object modules is called a linker The program that takes a linked program and places it in execution is called a loader The linker and loader are often combined in one Object Files seeobjc Self-suspension 55 Automatic Parsing I never have a computer do something I can do by hand a mathematician I never do anything by hand I can do with a computer a hacker That mathematician had better not construct parsers a compiler writer The recursive descent technique for writing parsers requires the programmer to write 10 or so lines of recursive code for each nonterminal in the grammar There is another method, usually called bottom-up, in contrast to the top-down recursive descent, for building parsers for which no parser code need be written and for which it is guaranteed that there are no parser errors Bottom-up parsers also have a better chance to recover from input errors and proceed with useable analysis beyond the point of the first error The disadvantages with the bottomup method are that a table building tool must be available or be constructed, and a grammar for the source language, obeying the restrictions of the table builder, must be prepared There is no significant practical difference in the resulting parsers 3 Source files can get lost, become out of date with respect to available compilers, be in the wrong language, or even contain trade secrets

6 182 CHAPTER 5 ADVANCED TOPICS There are many ways of building bottom-up parsers The dominant technology goes under the initials lr One develops the lr technology by considering the input text as the catenation of a parse stack ρ and an as yet unprocessed input δ Then one reduces the process of parsing to a sequence of just two kinds of actions: shift move one symbol from the head of δ to the top of ρ, and reduce apply a rule to the top of ρ, thereby rewriting its rightmost symbols Recalling Exercise 1d in Chapter 2, start the process with ρ = λ and δ = f t If some authorative sargeant bawled out the cadence shift, reduce by r8, reduce by r6, reduce by r4, reduce by r2, shift, shift, reduce by r7, reduce by r6, reduce by r4, reduce by r1, reduce by r0, parser halt!, the parser could respond with the steps ρ λ f Boolean Complement Conjunction Disjunction Disjunction Disjunction t Disjunction Boolean Disjunction Complement Disjunction Conjunction Disjunction Proposition δ f t t t t t t t λ λ λ λ λ λ What we need is the sargeant and the parser It is tempting to think that the sargeant need not name the rule, since the right side of the rule must match the tail of ρ, but in fact more than one rule might match For example, whenever rule r1 is applied, rule r2 also matches the tail of ρ How, then, can the sargeant decide which rule to apply? The answer is lookahead There is nothing fundamental in this choice, it reflects the idea that the input is coming off the input tape and therefore only the next few symbols are conveniently available for examination A language for which the reduce choices can be made by looking ahead k symbols is called lr(k) The combinatorics of looking at k symbols keeps k = 1 for all practical purposes Language designers have learned to live within this constraint The lookahead might accidentally look beyond the input (as in the application of rule r1 above) To keep things simple, and within the model of using only

7 55 AUTOMATIC PARSING 183 grammars for language description, all lr(k) grammars are given an end-of-file symbol ; exactly k are appended to the last applied rule of the grammar In the case of the lr(1) grammar in Table 21, the first rule would become Proposition: Disjunction The implementation of is achieved by having the scanner return when the end of file is detected The many versions of lr differ principally in how lookahead is computed and used, in the size of the tables needed, in the generality of the resulting parser, and in the speed with which it operates One particular version, called lalr(1), is presented in this section Finite state, lr(0), and slr(1) parsers are also presented because they are intermediate steps in the construction of lalr(1) parsers Finite Automata Finite automata (fa) are abstract algorithms for the recognition of strings They are closely related to regular expressions 4 Automata have many applications They form the basis of pattern matching programs such as Unix egrep and are sometimes used as the basis for scanners The reason for presenting them here is that they are the basis for lr parsers Automata are also called state-transition machines The central idea is that an automaton, at any one time, is in a unique state and can transition to some other state by reading input The transitions are ined by a relation from stateinput pairs to states Transitions on the null string, ie non-reading transitions, are allowed The sequence of input values read is the string that is processed Processing is started in a unique start state At each step the automaton examines the text, and based on its state-transition table, goes to a another state Each time an automaton transitions on an input symbol, that symbol is discarded so that the next symbol may be processed If input appears for which there is no ined transition, the automaton is said to reject the input Processing continues until the input is rejected or there is no more input Whenever the automaton is in a designated final state, the input read so far is said to have been accepted Starting and stopping the automaton is done outside of it; typically an automaton is stopped when an end-of-input is detected or a final state accepting an end-of-input symbol is reached Finite automata can be deterministic (dfa) or nondeterministic (nfa) If the state transition relation is a single-valued function, and there are no null transitions, the automaton is a dfa A dfa executes in time proportional to the length of the input and is therefore convenient and efficient on conventional computers; a nfa is harder to execute Unfortunately, nfa often arise naturally in applications Fortunately there is an algorithm to transform any nfa into an equivalent dfa 4 See Section 25and Section 56 r0

8 184 CHAPTER 5 ADVANCED TOPICS One can draw an intuitive diagram representing an automaton The diagram below recognizes properly rounded values approximating 2/3 The value to be recognized must start with "0", is followed by any number of 6 s and terminates with a 7 Each state is boxed; the initial state is labelled G; the final state is double boxed G 0 A 6 B 7 C Figure 51 A dfa for rounded values of 2/3 Exercises 1 [1,1] Draw a diagram for a dfa that recognizes positive integers 5 2 [1,1] Draw a diagram for a dfa that recognizes rounded values of 1/7 3 [1,1] Suppose that you have a dfa that recognizes truncated representations of fraction 1/n How can you transform it into a DFA that recognizes rounded representations of 1/n? (Hint: does your solution work for 1/101?) 4 [1,1] Draw a diagram for a dfa that recognizes any sequence of nickels and dimes (N and D) that adds up to a quarter (Hint: let state k represent an accumulation of 5k cents) 5 [1,1] Draw a diagram for an automaton that recognizes a sequence of zero or more a s, followed by a sequence of zero or more b s, followed by one c (Hint: there is a 3-state nfa solution) 6 [1,1] Use a regular expression to describe the strings accepted by each of the automata in this list of exercises Definition of Finite Automata A ab (516) A B (517) A λ (518) Table 52: Schema for Automata 5 The statement of a recognition condition for an automaton implies in addition and rejects anything else

9 55 AUTOMATIC PARSING 185 An automaton can be ined by a cfg If the rules in Π are restricted to one of the three forms shown in Table 52 then a recognizer similar to the diagram in Figure 51 can be built The nonterminals correspond to states, the rules correspond to transitions and the terminals are the input The goal symbol is the start state The first kind of fa rule (Equation 516) ines the transitions (often called shifts) The shifts are deterministic if A V N a V T size({a ab A ab Π}) 1 That is to say, for no nonterminal A is there more than one shift ined for any terminal a The second kind of rule (Equation 517) is called an empty transition 6 If the shifts are deterministic and there are no empty transitions, then the automaton is deterministic The third kind of rule (Equation 518) is a final transition If one uses a fa grammar to produce a string, one starts with the goal symbol, and at each stage discards one nonterminal in the text and replaces it with another, until a final transition is applied, leaving no further nonterminals The set of states with final transitions is the set of final states of the automaton: V F = {A A λ Π} (519) The requirement that there be final states in a fa is the same as the constraint on cfgs to avoid nonterminating rules 7 Exercises 7 [1,1] Write down the grammars ining the automata derived in the previous set of exercises 8 [1,1] Write a program to execute the dfa in Figure 51 9 [1,1] Write a program to execute any dfa (Hint: represent Π as a 2- dimensional matrix The matrix ines a function mapping each statesymbol pair into a state and V N as a vector recording which states are final and which are not) 10 [1,1] Write a program to execute any nfa (Hint: provide backtracking) 11 [1,1] Show how to derive a grammar A for the sequence of states passed to accept a string from the grammar A ining the fa (Hint V T = V N) 12 [1,1] Show how to derive a grammar A for the sequence of transitions applied to accept a string from the grammar A ining the fa (Hint: V T = Π) 6 An empty transition is often called an ɛ transition in literature using letter ɛ to denote the empty string 7 See Section 23

10 186 CHAPTER 5 ADVANCED TOPICS Construction of a DFA from a NFA Given a nfa A, there is a dfa A such that L(A ) = L(A) The construction for V T, V N, G and Π follows The central idea is that the set of states of the nfa which can be reached by some string α are all mapped into a single state of the dfa Once the construction is complete, the states of the dfa can be relabelled to simplify the description of the dfa The construction of Π and V N is mutually dependent, as represented by the mutually recursive formulas below The construction is based on a function S which collects the set of states S(A, a) that can be reached from the set of states A via a transition on terminal symbol a The value of S is often the empty set, corresponding to an error state (rejection state) in the resulting dfa C(A ) S(A, a) = {B A A B V N A B} (520) = {B A C(A ) A ab Π} (521) Using S and C we can compute A V T G V N V F Π A = V T = C({G}) = D(Π ) = {A A V N A V F {}} = {A ab A V N B = S(A, a)} {A λ A V F } = V T, V N, G, Π Table 53: nfa to dfa Transformation To start the mutual recursion off, the start symbol G may be put into V N The error state is a member of V N in any constructed dfa that does not accept all strings (ie, L(A ) VT ) This happens because S(A, a) is empty whenever there is no transition on a out of any set A A in the nfa The error state is the unique place to which all rejected strings take the dfa The error state has many entries and no exits it is a trap from which no string returns The existence of the error state insures that the state-transition function for the dfa is a single-valued function Exercises 13 [1,1] Show A C({A}) 14 [1,1] Show A C(A ) 15 [1,1] Show that the error state is not a final state

11 55 AUTOMATIC PARSING [1,1] Show {} V N iff L(A ) V T 17 [1,1] Show that for any constructed dfa A size(π ) = size(v N ) size(v T ) + size(v F ) 18 [1,1] Suppose the nfa dfa transformation were applied to a dfa Would the input and output of the transformation necessarily be the same? 19 [1,1] Suppose the nfa dfa transformation were applied twice, to a nfa and then to the resulting dfa Would the input and output of the second transformation necessarily be the same? Example The following nfa is an answer to Exercise 5, to build an automaton to recognize a b c It is a nfa V T = {a, b, c} V N = {A, B, C} G = A V F = {C} Π = {A aa, A B, B bb, B cc, C λ} A = V T, V N, G, Π The corresponding dfa for a b c follows V T = {a, b, c} V N = {{A, B}, {B}, {C}, {}} G = {A, B} V F = {{C}} Π = {A, B} a{a, B}, {B} a{}, {C} a{}, {} a{}, {A, B} b{b}, {B} b{b}, {C} b{}, {} b{}, {A, B} c{c}, {B} c{c}, {C} c{}, {} c{}, {C} λ A = V T, V N, G, Π Exercises Using the example above: 20 [1,1] Verify the nfa dfa transformation

12 188 CHAPTER 5 ADVANCED TOPICS 21 [1,1] Verify the formula in Exercise [1,1] Label the states of the dfa for a b c and draw the diagram for it 23 [1,1] Repeat the previous three exercises for a b c LR Parsers The central fact of lr parsing is that, for any cfg, the set of parse stacks ρ generated during canonical parses is a fa language The application of a cfg rule A α leads to a text transformation ρaδ ραδ where δ is the unprocessed input text The set of all values ρ, over all steps of all parses is the fa language Therefore, there is a dfa to recognize it It is upon this dfa that the lr parser is built The construction of an lalr(1) parser proceeds in six steps The cfg for the language is the starting point 1 The lr(0) nfa for the parse stack is written down; 2 The lr(0) dfa is constructed from the nfa; 3 The lalr(1) shift function is taken from the dfa; 4 The Lookahead Grammar is constructed from the dfa and cfg; 5 The slr(1) lookahead is constructed for the Lookahead Grammar; 6 The slr(1) lookahead is added to the dfa build a lalr(1) reduce function for the cfg The starting point is a context-free grammar G The first objective is the lr(0) parser in Figure 52 It may be interesting to do Exercise 99 and then return to this point and resume reading The LR(0) NFA Construction One characteristic of the material to follow is the construction of grammars from grammars where the symbols of the constructed grammar are parts of the original grammar It is helpful to enhance the notation by introducing meta-brackets to turn strings α into symbols [α] Suppose the language for which a parser is desired is described by cfg G = V T, V N, G, Π The lr(0) nfa is written down as follows:

13 55 AUTOMATIC PARSING 189 V T V N G Π A = V T V N = {[A α] A αβ Π} = [G λ] = {[A α] B[A αb] B V T V N A αbγ Π} {[A α] [B λ] B V N A αbγ Π} {[A α] λ A α Π} = V T, V N, G, Π Table 54: cfg to lr(0) nfa Construction Since the parse stack can have both the terminal and nonterminal symbols from the cfg, the nfa must have transitions ined for all of them the terminal vocabulary V T is the whole vocabulary of the cfg The nonterminal vocabulary is partially complete rules from the cfg Each of them represents the state of having recognized part of that rule The canonical parse in the cfg requires that the rightmost nonterminal is expanded Here is the source of the surprising fact that the parse stack is describable by a fa As the fa walks the stack, it is walking across partially complete right-hand sides of rules Whenever the next symbol in a grammar rule is a nonterminal in the cfg, there are two possibilities Either the nonterminal is already there in the stack part of the right-hand side of one of its rules is already in the stack The first two schema for filling Π in Table 54 reflect these two possibilities The first schema steps to a state including the symbol (terminal or nonterminal) on the stack The second schema starts expanding the nonterminal by an empty transition to a state where the new right-hand side is empty and therefore ready to start being built from the input The third schema terminates the fa because a rule has been fully built on the parse stack There are terminating schema for each rule in the grammar The application of the lr(0) machine will repeatedly arrive in these terminating states and then be restarted to do the next step in the canonical parse The start state of the automaton is a rule for the goal symbol of the cfg with none of the right-hand side built The lr(0) dfa is then computed from the lr(0) nfa Exercises 24 [1,1] Show that the set of parse stacks ρ is a fa language 25 [1,1] Show that for an lr(0) automaton, V F = Π

14 190 CHAPTER 5 ADVANCED TOPICS A Worked Example: LR(0) Construction Consider the following grammar G where where V T = {f, (, ),, } V N = {P, D, C} G = P Π = {r0, r1, r2, r3, r4} r0 = P D r1 = D D C r2 = D C r3 = C f r4 = C ( D ) The grammar consists of five rules, including one ending in, signifying end-of-input The language is a combination of boolean value f, logical or operations and parentheses It is a subset of the language for which a recursive descent parser was written in Chapter 2 This language is picked for the example because it does not require any lookahead This is fine for purposes of illustrating the construction and use of the lr(0) machine, but slightly misleading since every useful programming language does require lookahead An lr(0) parser for this language follows The first step is to write down the nfa for the lr(0) machine The LR(0) NFA (example) V T = {P, D, C,,, f, (, )} V N = {m0, m1, m2, m3, m4, m5, m6, m7, m8, m9, m10, m11, m12} G = m0 V F = {m2, m6, m7, m9, m12} Π = {t0, t1, t2, t3, t4, t5, t6, t7, t8, t9, t10, t11, t12,, t19} A = V T, V N, G, Π where m0 = [P λ] m1 = [P D] m2 = [P D ] m3 = [D λ] m4 = [D D] m5 = [D D ] m6 = [D D C] m7 = [D C] m8 = [C λ] m9 = [C f] m10 = [C (] m11 = [C (D] m12 = [C (D)]

15 55 AUTOMATIC PARSING 191 and t0 = m0 D m1 t1 = m1 m2 t2 = m3 D m4 t3 = m4 m5 t4 = m5 C m6 t5 = m3 C m7 t6 = m8 f m9 t7 = m8 ( m10 t8 = m10 D m11 t9 = m11 ) m12 t10 = m0 m3 t11 = m3 m3 t12 = m3 m8 t13 = m5 m8 t14 = m10 m3 t15 = m2 λ t16 = m6 λ t17 = m7 λ t18 = m9 λ t19 = m12 λ Exercise 26 [1,1] Verify the construction of the example nfa The LR(0) DFA (example) The next step is to compute the dfa A from the nfa A The majority of transitions involve the generated error state something that is relatively unsightly to write down and also not very informative We establish the convention of leaving all transitions to the error state, and the error state itself, out of the displayed computations and also the eventual dfa diagram In an implementation it is, on the other hand, of no cost to retain the error state and its transitions V T = {P, D, C,,, f, (, )} V N = {n0, n1, n2, n3, n4, n5, n6, n7, n8, n9, n10} G = n0 V F = {n2, n3, n5, n8, n9} Π = {v0, v1, v2, v3, v4, v5, v6, v7, v8, v9, v10, v11, v12,, v19} A = V T, V N, G, Π where n0 = {m0, m3, m8} n1 = {m1, m4} n2 = {m7} n3 = {m9} n4 = {m3, m8, m10} n5 = {m2} n6 = {m5, m8} n7 = {m4, m11} n8 = {m6} n9 = {m12} and v0 = n0 D n1 v1 = n0 C n2 v2 = n0 f n3 v3 = n0 ( n4 v4 = n1 n5 v5 = n1 n6 v6 = n4 D n7 v7 = n4 C n2 v8 = n4 f n3 v9 = n4 ( n4 v10 = n6 C n8 v11 = n6 f n3 v12 = n6 ( n4 v13 = n7 n6 v14 = n7 ) n9 v15 = n2 λ v16 = n3 λ v17 = n5 λ v18 = n8 λ v19 = n9 λ

16 192 CHAPTER 5 ADVANCED TOPICS Exercise 27 [1,1] Verify the construction of the example dfa Diagram of the LR(0) DFA (example) To display the LR(0) machine, it is convenient to use the state names from the construction for the non-final states, and the rule names from the original grammar for the final states This allows the reader to readily apply the diagram to arbitrary input texts The mapping from states to rules here is n2 = r2, n3 = r3, n5 = r0, n8 = r1, n9 = r4 n0 D n1 C r2 r0 C n6 r1 f r3 f r3 ( n4 D n7 ) r4 C r2 ( f r3 Figure 52 A lr(0) dfa Exercises Each of the following grammars poses a problem for lr parsers Construct each lr(0) nfa and dfa Your results will be used in the discussion on lookahead 28 [1,1] lalr(1) shift-reduce conflict for lr(0) G E E T T T x T x 29 [1,1] lalr(1) reduce-reduce conflict for lr(0)

17 55 AUTOMATIC PARSING 193 G E E S x E T z S a T a 30 [1,1] lalr(1) reduce-reduce conflict for slr(1) G E E a T a E b T b E a x b T x 31 [1,1] lalr(1) erasure in the lookahead G E E S x E T U y S a T a U λ 32 [1,1] lalr(1) eats simple lookahead analysis of lr(0) nfa a nqlr example G E E b A d E a A c E b g c E a g d A B B g 33 [1,1] not lalr(1) G E E S x y E T x z S a T a 34 [1,1] Simple ambiguous grammar Parse xxx two ways G E E E E E x 35 [1,1] Classical dangling else ambiguity Parse iixtx two ways

18 194 CHAPTER 5 ADVANCED TOPICS G E E i E E i E t E E x Applying the LR(0) Machine The canonical parse is a sequence of grammar rule applications The rules are applied to the catenation of the parse stack ρ and the remaining input δ Initially ρ is empty and all of δ is available At each rule application, the right side of a rule matches a substring of ρδ The matched substring is removed and is replaced by the left side of the rule The lr(0) machine is repetitively applied, yielding one parse step per application As one might expect, the dfa is started in its initial state When a transition occurs, the transition symbol is taken from δ pushed onto ρ When the lr(0) machine reaches a final state, the grammar rule to be applied is given by the state label, and the right side of that rule is on the top of the parse stack The matched string is popped off ρ and replaced by the left side of the rule The process is repeated, starting again in the initial state and at the left of ρ One of two things finally happens: the goal symbol G of the grammar appears or the lr(0) machine rejects the input In the former case the sequence of rule applications is the canonical parse In the latter case an error diagnostic can be reported parse unread commentary stack input ρ δ (f f) starting in state n0 ( f f) read (, goto n4 (f f) read f, goto r3 (C f) apply r3 (C f) starting in state n0 ( C f) read (, goto n4 (C f) read C, goto r2 (D f) apply r2 (D f) starting in state n0 ( D f) read (, goto n4 (D f) read D, goto n1 (D f) read, goto n6 (D f ) read f, goto r3 (D C ) apply r3 (D C) starting in state n0 ( D C) read (, goto n4 (D C) read D, goto n1 (D C) read, goto n6 (D C ) read C, goto r1

19 55 AUTOMATIC PARSING 195 (D ) apply r1 (D) starting in state n0 ( D) read (, goto n4 (D ) read D, goto n7 (D) read ), goto r4 C apply r4 C starting in state n0 C read C, goto r2 D apply r2 D starting in state n0 D read D, goto n1 D read, goto r0 P apply r0, quit The canonical parse is r3, r2, r3, r1, r4, r2, r0 If ρ is the parse stack and δ the unread input, the invariant G ρδ holds throughout the parse Note that the actions after restarting the dfa are repetitious This is a consequence of the parse stack not changing to the left of the substitution Exercises 36 [1,1] Verify the invariant G ρδ where G = P for the parse of (f f) shown in the previous example 37 [1,1] Apply the lr(0) machine to strings f, (f), f f f, (f f) What is the canonical parse in each case? 38 [1,1] Apply the lr(0) machine to string ff What kind of diagnostic can be generated in this case? In general? 39 [1,1] Invent a hack to avoid the repetitious transitions across the parse stack ρ after a substitution Hint: if p is the length of the canonical parse, and i is the length of the input excluding, the number dfa steps (shift or reduce) should be only 2p + i Using the LR(0) DFA more efficiently (example) The last exercise above hinted at an inefficiency in the applications of the lr(0) machine Suppose we start this time with the valid text f (f) The canonical parse will be a sequence of rule applications resulting in a sequence of forms eventually converging to the goal symbol P It is convenient to write the string and the states of the dfa together, with the states between and below the symbols of the string Whenever the dfa gets to a final state the rewriting is done, removing some symbols from the string and replacing them by the phrase name When symbols are removed, so are the interpolated states which become invalid after the substitution The effect

20 196 CHAPTER 5 ADVANCED TOPICS is that one does not need to start from the left end of the parse stack after each rewrite To reestablish the state, the new phrase name is tacked onto the front of the input and a non-terminal transition gets things going again interpolated δ the stack and input stack n0 f (f) start n0f r3 (f) shift over f n0 C (f) apply rule r3 n0c r2 (f) shift over C n0 D (f) apply rule r2 n0d n1 (f) shift over D n0d n1 n6 (f) shift over n0d n1 n6 ( n4 f) shift over ( n0d n1 n6 ( n4 f r3 ) shift over f n0d n1 n6 ( n4 C) apply rule r3 n0d n1 n6 ( n4 C r2 ) shift over C n0d n1 n6 ( n4 D) apply rule r2 n0d n1 n6 ( n4 D n7 ) shift over D n0d n1 n6 ( n4 D n7 ) r4 shift over ) n0d n1 n6 C apply rule r4 n0d n1 n6 C r1 shift over C n0 D apply rule r1 n0d n1 shift over D n0d n1 r0 λ shift over n0 P apply rule r0, quit The canonical parse is r3, r2, r3, r2, r4, r1, r0 Now one can observe that the symbols in the interpolated stack (as contrasted to the states) are never used That is, the reduce step must discard one state and symbol from the interpolated stack for every symbol on the right side of the applied rule but need examine none of them while doing so The newly exposed top-of-stack is the restarting state A new stack consisting of only the states, can be used in place of ρ This is the form of parse stack used in the rest of this section We will call it the parse state stack to distinguish from the parse stack and use symbol σ to represent it Exercises 40 [1,1] Once again apply the lr(0) machine to strings f, (f), f f f, (f f), but in this case use the parse state stack σ instead of the parse stack ρ 41 [1,1] Does using σ affect the quality of the diagnostics that can be generated? 42 [1,1] Given the lr(0) dfa and some parse state stack σ, show how to compute the corresponding parse stack ρ

21 55 AUTOMATIC PARSING 197 The Failures of LR(0) p0 E p1 p4 G E E T T p2 x p5 T Tx x p3 T x Figure 53 Inadequate LR(0) dfa Taking the grammar from Exercise 28, we get the dfa in Figure 53 The problem arises with state p2, which is both a reduce state for rule E T and also a shift state carrying on by x to state p5 Having arrived in state p2, the sargeant 8 won t know what command to give The answer can be found by examining the lr(0) dfa The only allowed shift in state p2 is on symbol x One could take the attitude shift when you can and that would work in this case One can also imagine doing a trial reduction and see where that would leave the lr(0) machine In state p2, E will be pushed onto the head of the input and the top of σ will be p0 Then, shifting the E goes to state p1 in which only is valid input Thus, if reduce by rule E T (from state p2) was the right answer, the next symbol will surely be This resolves the sargeant s dilema: when in state p2, an x gets shifted but is left alone and a reduce by rule E T is done instead The next task is to generalize and formalize this insight Exercises 43 [1,1] Show that merely letting shift take precedence over reduce is the correct solution for the lr(0) machine in Figure [1,1] Each of the grammars in the set starting with Exercise 27 fails to be lr(0) Identify the failure(s) See if the lr(0) machine contains the resolution to the problem(s) as in the example worked above Lookahead The lr(0) machine must be augmented with lookahead for practical languages In fact, the dfa has been used as a stepping procedure, taking the form (in language x): σ, δ := "step" := σ, δ 8 See page 182

22 198 CHAPTER 5 ADVANCED TOPICS where at each step either a shift takes a symbol from δ and places the new state on the top of σ, or a reduce pops some states off σ and puts a nonterminal on the front of δ The arguments of step range over infinite sets therefore they cannot be directly tabulated The top of the parse state stack s and the leading symbol on the input D are the keys One can implement step with finite tables recording all of the possible decisions There are three possibilities following the arrival at any particular state s: shift (if there is a transition from s ined on D) reduce (if D is in the lookahead) reject (a syntax error has been discovered) What is needed are two functions: s := shift(s, D); which looks up a new state s r := reduce(s, D), which looks up the rule r to be applied Information about rules, such as the length of the rule and which nonterminal it ines, must also be available to the algorithm implementing step The functions shift, and reduce can each be represented by a fixed size table For cfg G = V T, V N, G, Π and derived lr(0) dfa A = V T, V N, G, Π, each table is a matrix of with size(v N ) (size(v T V N ) elements That is, for every state of the dfa and every symbol of the cfg, there is one entry in each table The entries for shift can be picked directly off the lr(0) machine: A DC Π iff shift(a, D) = C (522) Suppose that A V F Then there is some rule r Π from the cfg such that [r] A There are several strategies for recording values for reduce If there is nothing else in A except [r] then the state is lr(0) Only a reduction is allowed In this case it will never cause the parser to fail (although it may delay the detection of a syntax error) to give the value r to reduce for all possible lookaheads A = {[r]} r Π D V reduce(a, D) = r (523) It is only when {[r]} A r Π that lookahead must be added There are two algorithms of interest: slr and lalr, the latter being the more powerful of the two As it happens, the slr algorithm can be used to compute the lalr tables, so it makes sense to present slr first SLR(1) Lookahead Whenever a reduction is applied, a phrase is reduced to a nonterminal Whatever comes next in the input must follow that nonterminal in the cfg In the

23 55 AUTOMATIC PARSING 199 little cfg below (from Figure 53), E occurs once on the right-hand side of a rule There is a terminal symbol to its immediate right Therefore, whenever a reduction to E is made, the only acceptable following symbol is p0 E p1 p4 G E E T T p2 x p5 T Tx x p3 T x There is one other nonterminal in the little cfg It occurs at the right end of a rule ining E, which says that whatever follows E may also follow T T is also followed by x in the cfg We deduce that whenever a T is made, the following symbol may be either x or The slr(1) lookahead for the application of any rule A β is the set of symbols that can follow A In the little cfg, the problem arises in state p2 where shift(p2, x) = p5 and reduce(p2, ) = E T Without the lookahead it would not be clear what to do in state p2 The relation FB, meaning followed-by, is the information needed to compute the slr(1) lookahead Both are ined below: A FB D = α, δ G αadδ (524) r = A β [r] A r Π A FB D reduce(a, D) = r (525) The computation of the FB relation is complicated by erasure 9 If some symbol in a rule might disappear, then things to the right of what was erased must also be recorded as following, and so on All followed-by symbols ultimately derive from symbols next to each other in some grammar rule Supposing that the cfg is not pathological, 10 then A FB D iff γ, X, µ, B, C, ν γ λ X µbγcν Π B ηa C Dζ Symbols B and C are next to each other Any nonterminal ending B must be created when any symbol that can be a head of C appears All of the possibilities 9 See page See page 999

24 200 CHAPTER 5 ADVANCED TOPICS can be collected into three situations, erasure occurs on the left of a rule, in the middle of a rule, or on the right of a rule Suppose that string γ λ (ie, γ can be erased) Then we have three relations that can be read directly out of the grammar: They may be described by the phrases C < D = C γdδ Π (526) B = C = X αbγcδ Π (527) A > B = B αaγ Π (528) C < D = D starts some rule ining C B = C = B precedes C in some rule A > B = A ends some rule ining B The pointy end of the relation is toward the visible symbol within which the other symbols in the relation hide The reflexive transitive closure of the relations (ie, < and > ) exposes the hidden components One can show FB = > = < (529) One can understand the compound relation above by inserting the symbols B and C; A > = < D means there are symbols B and C, where A is a tail of B and D is a head of C and B is next to C in some rule That is: B, C A > B = C < D The computation of the followed-by relation is thereby reduced to picking information out of the cfg and doing some standard computations on relations If reduce is not single-valued, the language is not slr(1) (a reduce-reduce conflict) If both shift and reduce are ined for some parameters s and D, the language is not slr(1) (a shift-reduce conflict) If neither shift nor reduce is ined for some parameters s and D, there is no ined action If such a situation arises during translation, a syntax error has been discovered and a diagnostic may be issued 11 Otherwise the function that is ined is obeyed Since in practice shift and reduce are never both ined at any one fixed location [s, D], an implementation may use a single matrix to store them both 11 It is my personal opinion that the translator should gracefully quit at this point and wait for the error to be fixed Conventional wisdom includes attempting to repair the damage well enough so that translation can continue The difference of opinion probably is related to the speed of the compiler for which the issue is raised Error recovery is an interesting topic in itself

25 55 AUTOMATIC PARSING 201 Exercises 45 [1,1] Compute the relations <, =, >, <, >, FB for each of the cfgs starting with Exercise [1,1] Compute the functions shift for each of the cfgs starting with Exercise [1,1] Compute the slr(1) functions reduce for each of the cfgs starting with Exercise [1,1] Considering the results of the previous two exercises, which of the cfgs are not slr(1) and why not? 49 [1,1] What difficulties might arise from the storage efficiency hack of combining the implementing arrays for shift and reduce? The Lookahead Grammar The difference between slr and lalr is that slr keys off the nonterminal of the reduction where lalr keys off the state in which the reduction is made The result is to give lalr more information, and therefore a finer separation of lookahead sets The difference is enough in practice to justify the additional effort of extending slr(1) to lalr(1) The cost is entirely in building the tables; it does not affect the size of the tables or the efficiency of the parser Consider the sequence of lr(0) transitions through which the lr(0) dfa goes during the parse of some text When a rule is applied, the rhs of the rule is popped off the parse state stack, the nonterminal lhs if the rule is prefixed to the unprocessed input and the next transition is over that nonterminal Eventually the goal symbol is all that is left on the stack The set of all transition sequences over all valid input texts is also a language (recall Exercise 12) Suppose G is the original cfg and A describes the lr(0) dfa Then there is a cfg G, called the Lookahead Grammar, for the transition language G will be derived below; it is useful because the slr(1) lookahead for G gives the lalr(1) lookahead for the original cfg G Select some particular transition in Π over nonterminal B A BC B V N By the construction of the lr(0) dfa, one can deduce the existence of a path in A on the basis of the rules in G B β Π A β For example, there are transitions from state p0 over E and T in Figure 53 There are four paths p0 β:

26 202 CHAPTER 5 ADVANCED TOPICS p0 E p1 p4 p0 T p2 p0 T p2 x p5 p0 x p3 The vocabulary of the Lookahead Grammar is the set of transitions from the the lr(0) dfa The brackets will be used, as before, to distinguish the symbol [A BC ] V from the transition A BC Π The notation will be extended so that the symbol [A β] will signify the (perhaps empty) sequence of transitions from state A over the sequence of symbols β The lr(0) terminating rules A λ are not included on the end of the sequences [A β] All of this is summarized by the initions in Table 55 V T V N G Π G = {[A bα] A bα Π {G } b V T } = {[A Bα] A Bα Π {G } B V N } = [G G] = {[A Bα] [A β] A Bα Π {G } B β Π} = V T, V N, G, Π Table 55: Construction of the Lookahead Grammar The rules in the Lookahead Grammar have the same form as those in the original grammar G The difference is that there may be more than one rule in Π corresponding to a single rule in Π The application of a rule in G replaces the transitions that crossed the right-hand of the rule with a transition that crosses the left-hand of the rule, which is precisely the action associated with the use of the lr(0) dfa Thus the Lookahead Grammar follows the entire parse, rather than just a single parse step A Failure of slr(1) The cfg in Figure 54 is neither lr(0) nor slr(1) The Lookahead Grammar will be needed to construct the lalr(1) lookahead To see the necessity, the slr(1) lookahead will first be constructed and shown to fail The relation FB for G must be constructed to get the lookahead and therefore the values of function reduce

27 55 AUTOMATIC PARSING 203 q0 E q1 q4 a q2 T q5 a q9 G E x q6 b q10 E a T a E a x b b q3 T q7 b q11 E b T b T x x q8 Figure 54 A lr(0) dfa Showing slr(1) Failure The relation FB for slr(1) lookahead is constructed directly from the cfg The components of FB are given below: > G < E G E < a a > E G < a b > E > G x > E = E < b T G > a = T G < b G T = a T < x G < E a > E E > E a = x G < E < a G b > E x > T T > T x = b a > E < E < b E a b = T b > T = T < T < x T b b a < a x > x b < b > x < x < Combining the three center columns we get the slr(1) lookahead for G T FB a a FB T E FB T FB b a FB x a FB x FB a b FB T b FB x FB b b FB x The corresponding values of shift and reduce, ined in Equations 522 and 525, are given below:

28 204 CHAPTER 5 ADVANCED TOPICS shift(q0, E) = q1 shift(q0, a) = q2 shift(q0, b) = q3 shift(q1, ) = q4 shift(q2, T) = q5 shift(q2, x) = q6 shift(q3, T) = q7 shift(q3, x) = q8 shift(q5, a) = q9 shift(q6, b) = q10 shift(q7, b) = q11 reduce(q6, a) = T x reduce(q6, b) = T x reduce(q8, a) = T x reduce(q8, b) = T x reduce(q9, ) = E ata reduce(q10, ) = E axb reduce(q11, ) = E btb The failure to be slr(1) shows up in initions for both reduce and shift for state q6 and input symbol b The slr(1) lookahead is inadequate The cfg is not slr(1), and it is now time to try constructing the Lookahead Grammar and the lalr(1) tables The Lookahead Grammar (example) There are 11 transitions in the lr(0) dfa A That is, size(π ) = 11 where V T = {E, T,, a, b, x} V N = {q0, q1, q2, q3, q4, q5, q6, q11} G = q0 Π = {t0, t1, t2, t3, t4, t5, t6, t10} A = V T, V N, G, Π t0 = q0 E q1 t3 = q1 q4 t1 = q0 a q2 t4 = q2 T q5 t8 = q5 a q9 t5 = q2 x q6 t9 = q6 b q10 t2 = q0 b q3 t6 = q3 T q7 t10 = q7 b q11 t7 = q3 x q8 The Lookahead Grammar G is ined in terms of the transitions of the lr(0) dfa The vocabulary V is taken from Π augmented by one extra symbol G V T = {t1, t2, t3, t5, t7, t8, t9, t10} V N = {G, t0, t4, t6} G = [q0 G]

29 55 AUTOMATIC PARSING 205 Π = G t0 t3 t0 t1 t4 t8 t0 t1 t5 t9 t0 t2 t6 t10 G = V T, V N, G, Π t4 t5 t6 t7 The Lookahead Grammar cfg rules are easier to read if the transitions tn are spelled out [q0 G] [q0 E q1][q1 q4] [q0 E q1] [q0 a q2][q2 T q5][q5 a q9] [q0 E q1] [q0 a q2][q2 x q6][q6 b q10] [q0 E q1] [q0 b q3][q3 T q7][q7 b q11] [q2 T q5] [q2 x q6] [q3 T q7] [q3 x q8] The effective difference between G and G is two rules for T x This causes different lookahead to be computed for states q6 and q8 in Figure 54 Since it was q6 that caused the slr(1) inadequacy, the lalr(1) construction looks promising The computation of relation FB is repeated for G G t8 t9 t10 t5 t7 > t0 > t0 > t0 > t0 > t4 > t6 t3 t8 t9 t10 t5 t7 t0 t1 t2 t3 t10 G > G > t0 > t0 > t0 > t4 t0 > t1 t6 > t4 G > t1 t0 t5 > t1 t2 > t2 t6 > t3 > t10 = t3 = t4 = t8 = t5 = t9 = t6 = t10 G G G t0 t0 t4 t6 G t0 t1 t2 t10 < t0 < t1 < t2 < t1 < t2 < t5 < t7 < G < t0 < t1 < t2 < t10 G t0 t0 t4 t6 < t0 < t1 < t2 < t5 < t7 Combining the three center columns we get the slr(1) lookahead for G t8 FB t3 t9 FB t3 t10 FB t3 t5 FB t8 t7 FB t10 t0 FB t3 t1 FB t4 t4 FB t8 t1 FB t5 t5 FB t9 t2 FB t6 t6 FB t10 t2 FB t7

30 206 CHAPTER 5 ADVANCED TOPICS LALR(1) Lookahead Now comes the application of the Lookahead Grammar slr(1) information to the construction of the lalr(1) lookahead for the original cfg The shift function values are unchanged There are fewer reduce values because of the finer separation of lookahead in lalr(1) For every rule [A Bα] [A β] Π there is a rule B β Π and a destinationt state qn in the final symbol in the sequence [A β] Suppose [A Bα] FB [C Dω] Then the lalr(1) lookahead for state qn in the lr(0) dfa for G is given by: reduce(qn, D) = B β (530) Applying this formula to the running example, only symbols tn V N slr(1) lookahead and can give rise to lalr(1) lookahead for G have t0 FB t3 gives reduce(q9, ) = E ata reduce(q10, ) = E axb reduce(q11, ) = E btb t4 FB t8 gives reduce(q6, a) = T x t6 FB t10 gives reduce(q8, b) = T x the resulting values of reduce no longer conflict with shift 12 Exercises 50 [1,1] Verify the computation of the lalr(1) lookahead for the example 51 [1,1] Compute the relations <, =, >, <, >, FB for the lookahead grammars associated with each of the cfgs starting with Exercise [1,1] Compute the lalr(1) functions reduce for each of the cfgs starting with Exercise [1,1] Considering the results of the previous two exercises, which of the cfgs are not lalr(1) and why not? When LALR(1) Fails It will often be the case that lalr(1) is not enough to build tables for your favorite grammar While it is possible to increase the lookahead at the expense of complicating the tables and the algorithms that have to deal with them, it is almost always better to change the cfg There is an art to writing grammars 12 See page 204

31 55 AUTOMATIC PARSING 207 that are both pleasing and also lalr(1) The lalr(1) tables for any practical cfg are far beyond reasonable hand computation, so the failures are reported through a table building program Therefore, in addition to knowing what to do to change the cfg, one also needs to know how to interpret diagnostic messages from a relatively opaque algorithm 13 Once the simple syntactic constraints of the input cfg are met, troubles are always deep they come from within relatively massive computations on intermediate grammars the user never wants to see The symptom, shift-reduce or reduce-reduce conflict often has no obvious cause What can be reported (directly from the symptom) is the symbol that shows up in the conflict It is the second argument to the shift and reduce functions The original cfg rule for which the reduce decision cannot be tabulated can also be reported This leads to diagnostics of the form: lalr(1) shift-reduce conflict for symbol ) on rule Term = Term * Factor or lalr(1) reduce-reduce conflict for symbol ) on rules Term = Term * Factor SimpleDecl = * SimpleDecl There may be hundreds of such messages and only one error in the grammar The challenge of good diagnostics is to suppress messages that do not contribute to locating the error, and to give traceability information For instance the set of symbols for which the conflict symbol is FB eases the problem of finding how the conflict arose The following symbol(s) are followed by the conflict symbol ) Term Factor Declaration SimpleDecl The problem of giving reasonable diagnostics is also interesting when a correct lalr(1) parser detects an input error There is a particular bit of compilerwriter arrogance to avoid a syntax error may reflect simple carelessness on the part of the programmer; it may also reflect a reasonable generalization on the language for which the language designer should be criticized In any case the diagnostic is issued by a mere computer any accusatory tone in the diagnostic is out of place A cloying preamble O master, I do not understand your divine intent is overshoot, but in the right direction Once the compiler-writer s heart is in the right place, the compiler can get down to issuing just the facts: where the syntax error was detected, what was found, and what would have been acceptable That is usually enough for a syntax error, which is all a parser can detect 13 The situation is even worse for recursive descent techniques there one may end up with a parser that apparently works but in fact contains bugs

Context-free Grammars

1 contents of Context-free Grammars Phrase Structure Everyday Grammars for Programming Language Formal Definition of Context-free Grammars Definition of Language Left-to-right Application cfg ects Transforming