CS143 Handout 14 Summer 2011 July 6 th, LALR Parsing

Similar documents
LALR Parsing. What Yacc and most compilers employ.

Conflicts in LR Parsing and More LR Parsing Types

Syntax Directed Translation

UNIT III & IV. Bottom up parsing

CS143 Handout 20 Summer 2011 July 15 th, 2011 CS143 Practice Midterm and Solution

Let us construct the LR(1) items for the grammar given below to construct the LALR parsing table.

Monday, September 13, Parsers

Formal Languages and Compilers Lecture VII Part 4: Syntactic A

Downloaded from Page 1. LR Parsing

LR Parsing Techniques

Principle of Compilers Lecture IV Part 4: Syntactic Analysis. Alessandro Artale

Parsing. Handle, viable prefix, items, closures, goto s LR(k): SLR(1), LR(1), LALR(1)

Parsing Wrapup. Roadmap (Where are we?) Last lecture Shift-reduce parser LR(1) parsing. This lecture LR(1) parsing

Lexical and Syntax Analysis. Bottom-Up Parsing

CSE P 501 Compilers. LR Parsing Hal Perkins Spring UW CSE P 501 Spring 2018 D-1

CSE 401 Compilers. LR Parsing Hal Perkins Autumn /10/ Hal Perkins & UW CSE D-1

Parsers. Xiaokang Qiu Purdue University. August 31, 2018 ECE 468

CS 2210 Sample Midterm. 1. Determine if each of the following claims is true (T) or false (F).

Compilers. Bottom-up Parsing. (original slides by Sam

Bottom-up parsing. Bottom-Up Parsing. Recall. Goal: For a grammar G, withstartsymbols, any string α such that S α is called a sentential form

Wednesday, August 31, Parsers

Lecture 7: Deterministic Bottom-Up Parsing

S Y N T A X A N A L Y S I S LR

Bottom-Up Parsing. Lecture 11-12

Talen en Compilers. Johan Jeuring , period 2. January 17, Department of Information and Computing Sciences Utrecht University

Lecture 8: Deterministic Bottom-Up Parsing

LR(0) Parsing Summary. LR(0) Parsing Table. LR(0) Limitations. A Non-LR(0) Grammar. LR(0) Parsing Table CS412/CS413

Wednesday, September 9, 15. Parsers

Context-free grammars

Parsers. What is a parser. Languages. Agenda. Terminology. Languages. A parser has two jobs:

LR Parsing of CFG's with Restrictions

LALR stands for look ahead left right. It is a technique for deciding when reductions have to be made in shift/reduce parsing. Often, it can make the

Compiler Construction: Parsing

LR Parsing LALR Parser Generators

CS606- compiler instruction Solved MCQS From Midterm Papers

CS143 Handout 08 Summer 2009 June 30, 2009 Top-Down Parsing

Chapter 4. Lexical and Syntax Analysis

GUJARAT TECHNOLOGICAL UNIVERSITY

Chapter 3: Lexing and Parsing

DEPARTMENT OF INFORMATION TECHNOLOGY / COMPUTER SCIENCE AND ENGINEERING UNIT -1-INTRODUCTION TO COMPILERS 2 MARK QUESTIONS

4. Lexical and Syntax Analysis

Intermediate Representation

ECE 468/573 Midterm 1 October 1, 2014

VALLIAMMAI ENGNIEERING COLLEGE SRM Nagar, Kattankulathur

SYED AMMAL ENGINEERING COLLEGE (An ISO 9001:2008 Certified Institution) Dr. E.M. Abdullah Campus, Ramanathapuram

4. Lexical and Syntax Analysis

CS453 : JavaCUP and error recovery. CS453 Shift-reduce Parsing 1

LR Parsing. Leftmost and Rightmost Derivations. Compiler Design CSE 504. Derivations for id + id: T id = id+id. 1 Shift-Reduce Parsing.

CS 164 Programming Languages and Compilers Handout 9. Midterm I Solution

Lecture Notes on Shift-Reduce Parsing

EDAN65: Compilers, Lecture 06 A LR parsing. Görel Hedin Revised:

LR Parsing LALR Parser Generators

3. Syntax Analysis. Andrea Polini. Formal Languages and Compilers Master in Computer Science University of Camerino

EXAM. CS331 Compiler Design Spring Please read all instructions, including these, carefully

Bottom-Up Parsing II. Lecture 8

Syntax Analysis, VII One more LR(1) example, plus some more stuff. Comp 412 COMP 412 FALL Chapter 3 in EaC2e. target code.

I 1 : {E E, E E +E, E E E}

The Parsing Problem (cont d) Recursive-Descent Parsing. Recursive-Descent Parsing (cont d) ICOM 4036 Programming Languages. The Complexity of Parsing

shift-reduce parsing

CS143 Handout 12 Summer 2011 July 1 st, 2011 Introduction to bison

COMPILER (CSE 4120) (Lecture 6: Parsing 4 Bottom-up Parsing )

General Overview of Compiler

A left-sentential form is a sentential form that occurs in the leftmost derivation of some sentence.

Part III : Parsing. From Regular to Context-Free Grammars. Deriving a Parser from a Context-Free Grammar. Scanners and Parsers.

In One Slide. Outline. LR Parsing. Table Construction

Recursively Enumerable Languages, Turing Machines, and Decidability

Lecture Bottom-Up Parsing

CS415 Compilers. LR Parsing & Error Recovery

Principles of Programming Languages

MIT Parse Table Construction. Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology

Bottom-Up Parsing. Lecture 11-12

Formal Languages and Compilers Lecture VII Part 3: Syntactic A

14.1 Encoding for different models of computation

Bottom up parsing. The sentential forms happen to be a right most derivation in the reverse order. S a A B e a A d e. a A d e a A B e S.

Chapter 3 Parsing. time, arrow, banana, flies, like, a, an, the, fruit

LR Parsing, Part 2. Constructing Parse Tables. An NFA Recognizing Viable Prefixes. Computing the Closure. GOTO Function and DFA States

Lecture 14: Parser Conflicts, Using Ambiguity, Error Recovery. Last modified: Mon Feb 23 10:05: CS164: Lecture #14 1

LL(1) predictive parsing

Bottom-Up Parsing. Parser Generation. LR Parsing. Constructing LR Parser

CSE 401 Midterm Exam Sample Solution 2/11/15

Compiler Design Overview. Compiler Design 1

Chapter 4. Lexical and Syntax Analysis. Topics. Compilation. Language Implementation. Issues in Lexical and Syntax Analysis.

CSE 130 Programming Language Principles & Paradigms Lecture # 5. Chapter 4 Lexical and Syntax Analysis

LR Parsing E T + E T 1 T

Slide Set 5. for ENEL 353 Fall Steve Norman, PhD, PEng. Electrical & Computer Engineering Schulich School of Engineering University of Calgary

LR Parsing Techniques

CS 4120 Introduction to Compilers

Bottom Up Parsing. Shift and Reduce. Sentential Form. Handle. Parse Tree. Bottom Up Parsing 9/26/2012. Also known as Shift-Reduce parsing

Lecture Notes on Bottom-Up LR Parsing


UNIT-III BOTTOM-UP PARSING

CSE 401 Midterm Exam Sample Solution 11/4/11

Section A. A grammar that produces more than one parse tree for some sentences is said to be ambiguous.

Optimizing Finite Automata

Using an LALR(1) Parser Generator

SLR parsers. LR(0) items

COMPILER CONSTRUCTION LAB 2 THE SYMBOL TABLE. Tutorial 2 LABS. PHASES OF A COMPILER Source Program. Lab 2 Symbol table

Bottom-Up Parsing II (Different types of Shift-Reduce Conflicts) Lecture 10. Prof. Aiken (Modified by Professor Vijay Ganesh.

In this simple example, it is quite clear that there are exactly two strings that match the above grammar, namely: abc and abcc

LR Parsing. Table Construction

Transcription:

CS143 Handout 14 Summer 2011 July 6 th, 2011 LALR Parsing Handout written by Maggie Johnson, revised by Julie Zelenski. Motivation Because a canonical LR(1) parser splits states based on differing lookahead sets, it can have many more states than the corresponding SLR(1) or LR(0) parser. Potentially it could require splitting a state with just one item into a different state for each subset of the possible lookaheads; in a pathological case, this means the entire power set of its follow set (which theoretically could contain all terminals yikes!). It never actually gets that bad in practice, but a canonical LR(1) parser for a programming language might have an order of magnitude more states than an SLR(1) parser. Is there something in between? With LALR (lookahead LR) parsing, we attempt to reduce the number of states in an LR(1) parser by merging similar states. This reduces the number of states to the same as SLR(1), but still retains some of the power of the LR(1) lookaheads. Let s examine the LR(1) configurating sets from an example given in the LR parsing handout. S' > S S > XX X > ax X > b I 0 : S' > S, $ S > XX, $ X > ax, a/b X > b, a/b I 1 : S' > S, $ I 2 : S > X X, $ X > ax, $ X > b, $ I 3 : X > a X, a/b X > ax, a/b X > b, a/b I 4 : X > b, a/b I 5 : S > XX, $ I 6 : X > a X, $ X > ax, $ X > b, $ I 7 : X > b, $ I 8 : X > ax, a/b I 9 : X > ax, $ Notice that some of the LR(1) states look suspiciously similar. Take I 3 and I 6 for example. These two states are virtually identical they have the same number of items, the core of each item is identical, and they differ only in their lookahead sets. This observation may make you wonder if it possible to merge them into one state. The

same is true of I 4 and I 7, and I 8 and I 9. If we did merge, we would end up replacing those six states with just these three: I 36 : I 47 : I 89 : X > a X, a/b/$ X > ax, a/b/$ X > b, a/b/$ X > b, a/b/$ X > ax, a/b/$ But isn t this just SLR(1) all over again? In the above example, yes, since after the merging we coincidentally end up with the complete follow sets as the lookahead. This is not always the case however. Consider this example: S' > S S > Bbb aab bba B > a I 0 : S' > S, $ S > Bbb, $ S > aab, $ S > bba, $ B > a, b I 2 : S > B bb, $ I 3 : S > a ab, $ B > a, b... I 1 : S' > S, $ In an SLR(1) parser there is a shift-reduce conflict in state 3 when the next input is anything in Follow(B)which includes a and b. In LALR(1), state 3 will shift on a and reduce on b. Intuitively, this is because the LALR(1) state "remembers" that we arrived at state 3 after seeing an a. Thus we are trying to parse either Bbb or aab. In order for that first a to be a valid reduction to B, the next input has to be exactly b since that is the only symbol that can follow B in this particular context. Although elsewhere an expansion of B can be followed by an a, we consider only the subset of the follow set that can appear here, and thus avoid the conflict an SLR(1) parser would have. LALR Merge Conflicts Can merging states in this way ever introduce new conflicts? A shift-reduce conflict cannot exist in a merged set unless the conflict was already present in one of the original LR(1) configurating sets. When merging, the two sets must have the same core items. If the merged set has a configuration that shifts on a and another that reduces on a, both configurations must have been present in the original sets, and at least one of those sets had a conflict already.

Reduce-reduce conflicts, however, are another story. Consider the following grammar: S' > S S > abc bcc acd bbd B > e C > e The LR(1) configurating sets are as follows: I 0 : S' > S, $ S > abc, $ S > bcc, $ S > acd, $ S > bbd, $ I 1 : S' > S, $ I 2 : S > a Bc, $ S > a Cd, $ B > e, c C > e, d I 3 : S > b Cc, $ I 3 : S > b Cc, $ S > b Bd, $ C > e, c B > e, d I 4 : S > ab c, $ I 5 : S > ac d, $ I 6 : B > e, c C > e, d I 7 : S > bc c, $ I 8 : S > bb d, $ I 9 : B > e, d C > e, c I 10 : S > abc, $ I 11 : S > acd, $ I 12 : S > bcc, $ I 13 : S > bbd, $ We try to merge I 6 and I 9 since they have the same core items and they only differ in lookahead: I 69 : C > e, c/d B > e, d/c However, this creates a problem. The merged configurating set allows a reduction to either B or C when next token is c or d. This is a reduce-reduce conflict and can be an unintended consequence of merging LR(1) states. When such a conflict arises in doing a merging, we say the grammar is not LALR(1).

LALR Table Construction A LALR(1) parsing table is built from the configurating sets in the same way as canonical LR(1); the lookaheads determine where to place reduce actions. In fact, if there are no mergable states in the configuring sets, the LALR(1) table will be identical to the corresponding LR(1) table and we gain nothing. In the common case, however, there will be states that can be merged and the LALR table will have fewer rows than LR. The LR table for a typical programming language may have several thousand rows, which can be merged into just a few hundred for LALR. Due to merging, the LALR(1) table seems more similar to the SLR(1) and LR(0) tables, all three have the same number of states (rows), but the LALR may have fewer reduce actions some reductions are not valid if we are more precise about the lookahead. Thus, some conflicts are avoided because an action cell with conflicting actions in SLR(1) or LR(0) table may have a unique entry in an LALR(1) once some erroneous reduce actions have been eliminated. Brute Force? There are two ways to construct LALR(1) parsing tables. The first (and certainly more obvious way) is to construct the LR(1) table and merge the sets manually. This is sometimes referred as the "brute force" way. If you don t mind first finding all the multitude of states required by the canonical parser, compressing the LR table into the LALR version is straightforward. 1. Construct all canonical LR(1) states. 2. Merge those states that are identical if the lookaheads are ignored, i.e., two states being merged must have the same number of items and the items have the same core (i.e., the same productions, differing only in lookahead). The lookahead on merged items is the union of the lookahead from the states being merged. 3. The successor function for the new LALR(1) state is the union of the successors of the merged states. If the two configurations have the same core, then the original successors must have the same core as well, and thus the new state has the same successors. 4. The action and goto entries are constructed from the LALR(1) states as for the canonical LR(1) parser. 4 Let s do an example, eh?

5 Consider the LR(1) table for the grammar given on page 1 of this handout. There are nine states. State on top of stack Action Goto a b $ S X 0 s3 s4 1 2 1 Acc 2 s6 s7 5 3 s3 s4 8 4 r3 r3 5 r1 6 s6 s7 9 7 r3 8 r2 r2 9 r2 Looking at the configurating sets, we saw that states 3 and 6 can be merged, so can 4 and 7, and 8 and 9. Now we build this LALR(1) table with the six remaining states: State on top of stack Action Goto a b $ S X 0 S36 s47 1 2 1 acc 2 S36 s47 5 36 S36 s47 89 47 r3 r3 r3 5 r1 89 r2 r2 r2 The More Clever Approach Having to compute the LR(1) configurating sets first means we won t save any time or effort in building an LALR parser. However, the work wasn t all for naught, because when the parser is executing, it can work with the compressed table, thereby saving memory. The difference can be an order of magnitude in the number of states. However there is a more efficient strategy for building the LALR(1) states called step-bystep merging. The idea is that you merge the configurating sets as you go, rather than waiting until the end to find the identical ones. Sets of states are constructed as in the LR(1) method, but at each point where a new set is spawned, you first check to see

whether it may be merged with an existing set. This means examining the other states to see if one with the same core already exists. If so, you merge the new set with the existing one, otherwise you add it normally. Here is an example of this method in action: S' > S S > V = E E > F E + F F > V int (E) V > id Start building the LR(1) collection of configurating sets as you would normally: 6 I 0 : S' > S, $ S > V = E, $ V > id, = I 1 : S' > S, $ I 2 : S' > V = E, $ I 3 : V > id, = I 4 : S > V = E, $ E > F, $/+ E > E + F, $/+ F > V, $/+ F > int, $/+ F > (E), $/+ V > id, $/+ I 5 : S > V = E, $ E > E + F, $/+ I 6 : E > F, $/+ I 7 : F > V, $/+ I 8 : F > int, $/+ I 9 : F > ( E), $/+ E > F, )/+ E > E + F, )/+ F > V, )/+ F > int, )/+ F > (E), )/+ V > id )/+ I 10 : F > (E ), $/+ E > E + F, )/+

When we construct state I 11, we get something we ve seen before: I 11 : E >F,)/+ It has the same core as I 6, so rather than add a new state, we go ahead and merge with that one to get: I 611 : E >F, $/+/) We have a similar situation on state I 12 which can be merged with state I 7. The algorithm continues like this, merging into existing states where possible and only adding new states when necessary. When we finish creating the sets, we construct the table just as in LR(1). LALR(1) Grammars A formal definition of what makes a grammar LALR(1) cannot be easily encapsulated in a set of rules, because it needs to look beyond the particulars of a production in isolation to consider the other situations where the production appears on the top of the stack and what happens when we merge those situations. Instead we state that what makes a grammar LALR(1) is the absence of conflicts in its parser. If you build the parser and it is conflict-free, it implies the grammar is LALR(1) and vice-versa. LALR(1) is a subset of LR(1) and a superset of SLR(1). A grammar that is not LR(1) is definitely not LALR(1), since whatever conflict occurred in the original LR(1) parser will still be present in the LALR(1). A grammar that is LR(1) may or may not be LALR(1) depending on whether merging introduces conflicts. A grammar that is SLR(1) is definitely LALR(1). A grammar that is not SLR(1) may or may not be LALR(1) depending on whether the more precise lookaheads resolve the SLR(1) conflicts. LALR(1) has proven to be the most used variant of the LR family. The weakness of the SLR(1) and LR(0) parsers mean they are only capable of handling a small set of grammars. The expansive memory needs of LR(1) caused it to languish for several years as a theoretically interesting but intractable approach. It was the advent of LALR(1) that offered a good balance between the power of the specific lookaheads and table size. The popular tools yacc and bison generate LALR(1) parsers and most programming language constructs can be described with an LALR(1) grammar (perhaps with a little grammar massaging or parser trickery to skirt some isolated issues).

Error Handling As in LL(1) parsing tables, we can implement error processing for any of the variations of LR parsing by placing appropriate actions in the parse table. Here is a parse table for a simple arithmetic expression grammar with error actions inserted into what would have been the blank entries in the table. 8 E > E + E E * E (E) id State on top of stack Action Goto id + * ( ) $ E 0 s3 e1 e1 s2 e2 e1 1 1 e3 s4 s5 e3 e2 acc 2 s3 e1 e1 s2 e2 e1 6 3 e3 r4 r4 e3 r4 r4 4 s3 e1 e1 s2 e2 e1 7 5 s3 e1 e1 s2 e2 e1 8 6 e3 s4 s5 e3 s9 e4 7 e3 r1 s5 e3 r1 r1 8 e3 r2 r2 e3 r2 r2 9 e3 r3 r3 e3 r3 r3 Error e1 is called from states 0, 2, 4, 5 when we encounter an operator. All of these states expect to see the beginning of an expression, i.e., an id or a left parenthesis. One way to fix is for the parser to act as though id was seen in the input and shift state 3 on the stack (the successor for id in these states), effectively faking that the necessary token was found. The error message printed might be something like "missing operand". Error e2 is called from states 0, 1, 2, 4, 5 on finding a right parenthesis where we were expecting either the beginning of a new expression (or potentially the end of input for state 1). A possible fix: remove right parenthesis from the input and discard it. The message printed could be "unbalanced right parenthesis." Error e3 is called from state 1, 3, 6, 7, 8, 9 on finding id or left parenthesis. What were these states expecting? What might be a good fix? How should you report the error to the user? Error e4 is called from state 6 on finding $. What is a reasonable fix? What do you tell the user? Bibliography A. Aho, R. Sethi, J. Ullman, Compilers: Principles, Techniques, and Tools. Reading, MA: Addison-Wesley, 1986. J.P. Bennett, Introduction to Compiling Techniques. Berkshire, England: McGraw-Hill, 1990. K. Loudon, Compiler Construction. Boston, MA: PWS, 1997 A. Pyster, Compiler Design and Construction. New York, NY: Van Nostrand Reinhold, 1988.