LL Parsing, LR Parsing, Complexity, and Automata

Similar documents
Chapter 4. Lexical and Syntax Analysis

Definition: A context-free grammar (CFG) is a 4- tuple. variables = nonterminals, terminals, rules = productions,,

Syntax Analysis. Amitabha Sanyal. ( as) Department of Computer Science and Engineering, Indian Institute of Technology, Bombay

LLparse and LRparse: Visual and Interactive Tools for Parsing

UNIT III & IV. Bottom up parsing

LL(1) predictive parsing

ONE-STACK AUTOMATA AS ACCEPTORS OF CONTEXT-FREE LANGUAGES *

Syntax Analysis, VII One more LR(1) example, plus some more stuff. Comp 412 COMP 412 FALL Chapter 3 in EaC2e. target code.

JavaCC Parser. The Compilation Task. Automated? JavaCC Parser

LR Parsing - The Items

CSE P 501 Compilers. LR Parsing Hal Perkins Spring UW CSE P 501 Spring 2018 D-1

shift-reduce parsing

Downloaded from Page 1. LR Parsing

8 Parsing. Parsing. Top Down Parsing Methods. Parsing complexity. Top down vs. bottom up parsing. Top down vs. bottom up parsing

CS143 Handout 20 Summer 2012 July 18 th, 2012 Practice CS143 Midterm Exam. (signed)

Programming Language Syntax and Analysis

Chapter 4. Lexical and Syntax Analysis. Topics. Compilation. Language Implementation. Issues in Lexical and Syntax Analysis.

CS 4120 Introduction to Compilers

LR Parsing E T + E T 1 T

Parsing Algorithms. Parsing: continued. Top Down Parsing. Predictive Parser. David Notkin Autumn 2008

3. Syntax Analysis. Andrea Polini. Formal Languages and Compilers Master in Computer Science University of Camerino

Lecture Notes on Bottom-Up LR Parsing

4. Lexical and Syntax Analysis

CSE 130 Programming Language Principles & Paradigms Lecture # 5. Chapter 4 Lexical and Syntax Analysis

Lexical and Syntax Analysis. Bottom-Up Parsing

Lecture Bottom-Up Parsing

Syntax Analysis, V Bottom-up Parsing & The Magic of Handles Comp 412

CSCI312 Principles of Programming Languages!

Chapter 3: Lexing and Parsing

4. Lexical and Syntax Analysis

Let us construct the LR(1) items for the grammar given below to construct the LALR parsing table.

Parsing III. CS434 Lecture 8 Spring 2005 Department of Computer Science University of Alabama Joel Jones

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

CT32 COMPUTER NETWORKS DEC 2015

Lexical and Syntax Analysis

Top down vs. bottom up parsing

Lecture Notes on Bottom-Up LR Parsing

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

Syntactic Analysis. Top-Down Parsing

COP 3402 Systems Software Syntax Analysis (Parser)

CS5371 Theory of Computation. Lecture 8: Automata Theory VI (PDA, PDA = CFG)

Introduction to Syntax Analysis Recursive-Descent Parsing

Propositional Logic. Part I

CS453 : JavaCUP and error recovery. CS453 Shift-reduce Parsing 1

EDAN65: Compilers, Lecture 06 A LR parsing. Görel Hedin Revised:

MIT Parse Table Construction. Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology

The Parsing Problem (cont d) Recursive-Descent Parsing. Recursive-Descent Parsing (cont d) ICOM 4036 Programming Languages. The Complexity of Parsing

CSCI312 Principles of Programming Languages

CSE 401 Compilers. LR Parsing Hal Perkins Autumn /10/ Hal Perkins & UW CSE D-1

Recursive Descent Parsers

15 212: Principles of Programming. Some Notes on Grammars and Parsing

A Characterization of the Chomsky Hierarchy by String Turing Machines

2.2 Syntax Definition

LR Parsing. Leftmost and Rightmost Derivations. Compiler Design CSE 504. Derivations for id + id: T id = id+id. 1 Shift-Reduce Parsing.

Chapter 14: Pushdown Automata

How do LL(1) Parsers Build Syntax Trees?

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

1. Lexical Analysis Phase

Section A. A grammar that produces more than one parse tree for some sentences is said to be ambiguous.

Wednesday, August 31, Parsers

Wednesday, September 9, 15. Parsers

Parsers. What is a parser. Languages. Agenda. Terminology. Languages. A parser has two jobs:

CS 2210 Sample Midterm. 1. Determine if each of the following claims is true (T) or false (F).

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design

Reflection in the Chomsky Hierarchy

Programming Languages Third Edition

LL(k) Parsing. Predictive Parsers. LL(k) Parser Structure. Sample Parse Table. LL(1) Parsing Algorithm. Push RHS in Reverse Order 10/17/2012

flex is not a bad tool to use for doing modest text transformations and for programs that collect statistics on input.

Introduction to Computers & Programming

Compiler Construction 2016/2017 Syntax Analysis

Chapter 3. Describing Syntax and Semantics ISBN

The following deflniüons h:i e been establish for the tokens: LITERAL any group olcharacters surrounded by matching quotes.

Chapter 3. Set Theory. 3.1 What is a Set?

LR Parsing. The first L means the input string is processed from left to right.

CJT^jL rafting Cm ompiler

Semantics via Syntax. f (4) = if define f (x) =2 x + 55.

Compiler Design 1. Bottom-UP Parsing. Goutam Biswas. Lect 6

Visual PCYACC. Developing and Debugging with Visual Pcyacc. by Y. Jenny Luo. For more information, contact

Bottom-Up Parsing II. Lecture 8

LR Parsing, Part 2. Constructing Parse Tables. An NFA Recognizing Viable Prefixes. Computing the Closure. GOTO Function and DFA States


Introduction. Introduction. Introduction. Lexical Analysis. Lexical Analysis 4/2/2019. Chapter 4. Lexical and Syntax Analysis.

Parsing. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

MIT Specifying Languages with Regular Expressions and Context-Free Grammars. Martin Rinard Massachusetts Institute of Technology

MIT Specifying Languages with Regular Expressions and Context-Free Grammars

Introduction to Automata Theory. BİL405 - Automata Theory and Formal Languages 1

Context-Free Grammars and Parsers. Peter S. Housel January 2001

Faculty of Electrical Engineering, Mathematics, and Computer Science Delft University of Technology

Pushdown Automata. A PDA is an FA together with a stack.

Context-Free Grammar. Concepts Introduced in Chapter 2. Parse Trees. Example Grammar and Derivation

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

Assignment 4 CSE 517: Natural Language Processing

CSC 4181 Compiler Construction. Parsing. Outline. Introduction

LR Parsing Techniques

Compiler Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore

Software II: Principles of Programming Languages

CS 164 Programming Languages and Compilers Handout 9. Midterm I Solution

Review main idea syntax-directed evaluation and translation. Recall syntax-directed interpretation in recursive descent parsers

UNIVERSITY OF EDINBURGH COLLEGE OF SCIENCE AND ENGINEERING SCHOOL OF INFORMATICS INFR08008 INFORMATICS 2A: PROCESSING FORMAL AND NATURAL LANGUAGES

Parsing II Top-down parsing. Comp 412

Transcription:

LL Parsing, LR Parsing, Complexity, and Automata R. Gregory Taylor Department of Mathematics and Computer Science Manhattan College Riverdale, New York 10471-4098 USA <gtaylor@manhattan.edu> Abstract It is well known that pushdown-stack automata find application within the syntactic analysis phase of compilation. Nonetheless, in most compiler design textbooks the relation between popular parsing algorithms and the theory of deterministic pushdown-stack automata remains implicit. We show that it is not difficult to implement these algorithms as deterministic automata. These implementations in turn yield instructive time/space analyses of the implemented algorithms. 1. LL(1) Parsing and Deterministic Pushdown- Stack Automata Table 1 1.1 LL(1) Expression Grammar Example 1 It will be helpful for our discussion to focus on a particular example, and so let us consider the expression grammar G whose productions are as follows. expr " term expr_aux expr_aux " add_op term expr_aux term " fac term_aux term_aux " mult_op fac term_aux fac " primary - primary primary " ( expr ) tok_id add_op " + mult_op " * / Note that G is a left-factored grammar. It is obvious that G has no direct left recursion. It is also quite easy to see that G has no indirect left recursion either. Table 1 presents part of an LL(1) parse table that is constructed, in accordance with a well-known technique, on the basis of grammar G (see [1], [2]). 1.2. Implementation of an LL(1) Parser as a Pushdown-Stack Automaton It can be shown that an LL(1) parser is, at bottom, a deterministic pushdown-stack automaton that accepts by empty stack. Of course, a stack component does play a role in the table-driven LL(1) parsers that one standardly considers. However, it is presumably not obvious how we may view a parser that uses a stack and a table as a pushdown automaton, which, of course, appears to involve no table. As the reader will have guessed, this table component is really just the tabular representation of the state diagram of a (deterministic) pushdown automaton. To see this, consider the automaton M of Figure 1 in conjunction with the following remarks. We shall use Table 1 as the basis upon which to construct M s state diagram as depicted (partially) in Figure 1. The result will be that M accepts (or recognizes) the class of expressions generated by G. We assume that we previously tokenized the input stream. The input alphabet of M then comprises the terminal symbols (token classes) of G together with end-of-input symbol #. The stack alphabet of M contains all terminals and Vol 34, No. 4, 2002 December 71 SIGCSE Bulletin

nonterminals of G plus stack-initialization symbol #. As usual, we assume the latter symbol to be the only symbol on M s stack at the inception of execution. An arc labeled a, b;c, d, e, say, is interpreted as follows: if symbol a is the current input symbol and symbol b is currently on top of the stack, then pop the stack and push symbols c, d, and e in that order. If symbol a is e, then the input stream is being ignored in effect. If, following the semicolon, there is an occurrence of e, this is to say that no symbol is pushed onto the stack. The states of M are two initialization states q 0 and q 1 and, in addition, eight states corresponding to each symbol within its input alphabet including #. It will be convenient to designate the latter states as q tok_id, q +, q -, q *, q /, q (, q ), and q #. (State designation q + abbreviates q tok_plus or the like.) The start state of M is q 0, and M has no accepting states. Each of the transitions from state q 1 corresponds to M s looking ahead one token in the input stream and then incorporating the encountered lookahead symbol into its state. For example, if the lookahead symbol at state q 1 is *, then M is seen to enter state q * and then to manipulate its stack without reading further input see the two selfloops at state q *. (For the sake of simplicity in presenting the diagram, we are using symbol a as a wildcard ranging over all members of M s stack alphabet.) Processing at state q * continues until symbol * itself appears atop the stack, at which point it is popped and M re-enters state q 1. For simplicity again, we show only two self-loops for each of the states q tok_id, q +, and so forth. We may easily infer the omitted self-loops from (the completion of) Table 1. It is rather easy to see that M is deterministic. This is primarily because the expression grammar on which it is based is left-factored. Tracing the computation of M for input string num1 * num2 + num3 as well as (num1 +, say, reveals that the former is accepted, whereas the latter is not. In fact, M s computation will be essentially that of an LL(1) parser as recorded in Tables 2 and 3 (both partial). The student should then have no trouble believing that the table-driven LL(1) parser is, at root, the implementation of a pushdown automaton that accepts by empty stack. The state diagram of Figure 1 may be used to clarify the role of the (single) lookahead in LL(1) parsing. Namely, at central state q 1, machine M must decide which peripheral state to enter and, by implication, how to expand the leftmost parse tree node labeled by a nonterminal. It renders this decision based solely upon the lookahead: if the lookahead is (, then M enters state q ( and so forth. Once at peripheral state q (, say, M proceeds to expand the tree downward until terminal ( appears atop the stack, at which point the stack is popped and M returns to state q 1. It is at this point, and not before, that input token ( is said to have been consumed. In other words, lookahead ( is used as the basis for tree expansion before it is consumed. Incidentally, the topology of the state diagram of Figure 1 a single central state sur- Figure 1. Arc label a serves as a wildcard representing an arbitrary stack alphabet symbol. SIGCSE Bulletin 72 Vol 34, No. 4, 2002 December

rounded by a ring of peripheral states with multiple selfloops is characteristic of parsing automata that use a single lookahead. (How would we configure parsers using two lookaheads? Three lookaheads?) Table 2 2. LR Parsing and Deterministic Pushdown-Stack Automata In fact, we shall look only at so-called SLR(1) parsing, a particularly simple form of LR(1) parsing. 2.1. Example 2 Our discussion of SLR parsing will focus on the expression grammar G appearing below. Table 3 (1.1) (1.2) expr " term expr add_op term (2.1) (2.2) term " fac term mult_op fac (3.1) (3.4) fac " ( expr ) tok_int_lit - ( expr ) - tok_int_lit (4.1) (4.2) add_op " + - (5.1) (5.2) mult_op " * / Note that G is left-recursive and not left-factored. 1.3 The Complexity of LL(1) Parsing Finally, reflection upon Figure 1 enables us to justify a certain claim regarding the efficiency of LL(1) parsing. Note that if token * is the current input symbol, then M reads that symbol, leaves state q 1, and enters state q *. Once in state q *, M will traverse e-self-loops until symbol * appears atop its stack. It is easy to see that this will require, worst case, two steps. One more step will then bring M back to state q 1. The case of input token / is perfectly analogous. So are all the other possibilities: each peripheral state q has but finitely many e-self-loops. Since we eliminated left recursion from the grammar based on which M was constructed, none of these e-self-loops will ever be traversed twice during any one stint at q. Moreover, the number of these self-loops obviously depends in no way upon n, where we take n to be the number of tokens in the tokenized input stream. (What it does depend on is the grammar G of Example 1.) Finally, since the self-loops at q are all e-moves, they never advance input. It should now be apparent that, for each of n input tokens, automaton M enters some peripheral state and then computes there for O(1) steps, worst case, assuming the obvious notion of computation step. Moreover, each step increases the height of M s stack by O(1). Apparently, we have proved the following proposition. Theorem 1 LL(1) parsing requires O(n) time and O(n) space, where n is the length of the token stream. 2.2. Implementation of an SLR Parser as a Pushdown- Stack Automaton Once again we endeavor to show that a certain type of parser this time an SLR parser for the expression grammar G of Example 2 is, in its essentials, the implementation of a deterministic pushdown automaton M that accepts by empty stack. A part of the transition diagram of M appears in Figure 2. Again, it is presumably not obvious that Tables 4 and 5 represent this machine. Consequently, some explanation will be required in order to make this plausible. This explanation will take the form of a step-by-step description of the construction of M based upon G or, rather, based upon the action and goto tables that are based, in turn, upon G. (1) M will have ten states total. There will be one for each of the seven terminal symbols of G and one more for end-of-input symbol #. In addition, there will be two initialization states q 0 and q 1. Again, it will be convenient to designate peripheral states as q tok_int_lit, q +, q -,q *, q /, q (, q ), and q #. M s start state will be q 0. (2) M s stack alphabet will include all terminals and nonterminals of G plus stack-initialization symbol #. In addition, the stack alphabet will contain numerals representing each of the 22 rows in the complete action table (Table 4). We shall think of the numeral 46, say, as a single stack alphabet symbol. (3) The stack of M will contain symbols of G terminals and nonterminals alternating with numerals designating states of a certain finite-state automaton (not shown). This is potentially confusing, we admit. Our talk of states here has nothing whatever to do with the states of the pushdown automaton M now under construction of which there are only ten. (4) A single arc from state q 0 to state q 1 pushes numeral 0 onto M s stack, which was assumed to already contain stack-initialization symbol #. (5) For each terminal a of G and every state numeral n, we Vol 34, No. 4, 2002 December 73 SIGCSE Bulletin

add an arc labeled a,n;n from start state q 0 to state q a. Similarly, for end-of-input symbol #, we add an arc labeled #,n ;n from start state q 0 to state q #. (See Figure 2, where we indicated these arcs schematically.) (6) Corresponding to each shift action within the action table, there will be an e-move leading from a peripheral state back to state q 1. For instance, corresponding to the shift S5 in the upper left-hand corner of Table 4 there will be an arc labeled e,0;0 tok_int_lit 5 from state q tok_int_lit back to state q 1. There are 31 shift actions in the action table, but we have included only a very few of the corresponding arcs in the diagram of Figure 2 just the ones that we shall need to cite later. (7) Corresponding to each reduce action in Table 4, there will be an e-self-loop on one of the peripheral states in Figure 2. For example, corresponding to the reduce action R3.2 in the sixth row and second column (not shown), there will be a self-loop labeled e,tok_int_lit 5;fac at state q +. (The reader will need to check production (3.2) of G in order to make sense of this.) Again, we have presented only a very few of these arcs in Figure 2. The entire goto table of Table 5 will be represented by e- self-loops on each of the peripheral states including state Table 4. Action Table (partial) Table 5. Goto Table (partial) q #. Thus, corresponding to the three entries in the first row of that table, there will be arcs labeled e,0 expr;0 expr 1 and e,0 term;0 term 2 and e,0 fac;0 fac 3 on state q + and on every other peripheral state. (8) The single Accept action of the action table will be reflected in a self-loop labeled e,# 0 expr 1;e at state q # (see Figure 2). Figure 2. SIGCSE Bulletin 74 Vol 34, No. 4, 2002 December

2.3. The Complexity of SLR Parsing As in the case of LL parsing, reflection upon deterministic pushdown automata such as machine M of Figure 2 will enable us to give worst-case time and space analyses for SLR parsing. (1) First, note that, after entering central state q 1, machine M makes a single-step transition from q 1 to some peripheral state q a for each token a within its input stream. That transition does not itself alter M s stack. However, M will eventually make a transition from q a back to q 1, simultaneously pushing a together with some state numeral onto its stack. (2) Since such periphery-to-center transitions are the only instructions that strictly increase the height of the stack, one can readily see that the height of the stack at any point during M s computation is O(n), where n is the length of the token stream. Thus M computes in O(n) space. (3) Further, while at peripheral state q a, machine M executes a number of e-moves. These e-moves occur in pairs: an e-move implementing a reduce action, followed by an e-move implementing a goto action. Again, the number of such e-moves executed, during any one stint at q a, is O(1) it depends upon G and not upon n. (4) Putting (1) through (3) together, we have established the following proposition. Theorem 2 SLR(1) parsing executes in time and space that are linear in the length of the token stream. The same is true of LR(1) parsing. Of course, Theorem 2 does not take into account the cost of computations required to design the SLR parser itself. That is as it should be, since the cost of parser construction is a one-time cost that we should not charge to the parsing process itself. As for the generalization to LR(1) parsing, we remind the reader that the driver routines of SLR and LR parsing do not differ. Rather, the difference between LR and SLR parsing is a matter of the size of the respective parse tables. In the present context, this means that the general structure of the transition diagrams of implementing pushdown automata will be the same. What will change is the number of e- moves at peripheral states as well as the number of center-toperiphery and periphery-to-center transitions. Consequently, the foregoing analyses for SLR parsing are applicable to LR parsing as well. 3. Summary We have that both LL and LR parsing can be implemented by deterministic pushdown stack automata. Moreover, both algorithms can be carried out in linear time. A more careful statement of the situation is the following. Given any LL(1) respectively LR(1) grammar G for context-free language L, a deterministic pushdown stack automaton MG can be constructed such that MG parses an arbitrary string of length n, over the set of terminal symbols of G, in O(n) steps. This having been said, there do exist context-free languages generated by no LR(1) grammar, and the class of LL(1) grammars is still more restrictive. Of course, these grammar classes have not been defined above (see [3]). Suffice it to say here that they are precisely the grammars for which our deterministic automaton construction techniques work. Acknowledgements The author wishes to thank Jane Stanton for editorial assistance. He also acknowledges a debt to the late Matthew Smosna, in whose superb course at New York University he acquired a first exposure to compiler design theory. References [1] Fischer, Charles N. and LeBlanc, Richard., Jr. Crafting a Compiler. Benjamin/Cummings, Menlo Park, California, 1988. [2] Parsons, Thomas W. Introduction to Compiler Construction, Computer Science Press, New York, 1992. [3] Sorenson, Paul G. and Tremblay, Jean Paul, The Theory and Practice of Compiler Writing, McGraw Hill, New York, 1985. FASE Forum for Advanced Software Engineering Education Online Newsletter for educating and training software engineers <http://www.cs.ttu.edu/fase/> Vol 34, No. 4, 2002 December 75 SIGCSE Bulletin