Parsing. For a given CFG G, parsing a string w is to check if w L(G) and, if it is, to find a sequence of production rules which derive w.

Similar documents
Key to Homework #8. (a) S aa A bs bbb (b) S AB aaab A aab aaaaaaab B bba bbbb

Introduction to Syntax Analysis. The Second Phase of Front-End

Decision Properties for Context-free Languages

JNTUWORLD. Code No: R

Introduction to Syntax Analysis

Parsing. Top-Down Parsing. Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 19

Top-Down Parsing and Intro to Bottom-Up Parsing. Lecture 7

Models of Computation II: Grammars and Pushdown Automata

Top-Down Parsing and Intro to Bottom-Up Parsing. Lecture 7

CMSC 330: Organization of Programming Languages. Context-Free Grammars Ambiguity

CMSC 330: Organization of Programming Languages. Architecture of Compilers, Interpreters

LL(1) predictive parsing

Where We Are. CMSC 330: Organization of Programming Languages. This Lecture. Programming Languages. Motivation for Grammars

Syntax Analysis Part I

Lecture 7: Deterministic Bottom-Up Parsing

CSE P 501 Compilers. LR Parsing Hal Perkins Spring UW CSE P 501 Spring 2018 D-1

Multiple Choice Questions

Lecture 8: Deterministic Bottom-Up Parsing

Parsing III. CS434 Lecture 8 Spring 2005 Department of Computer Science University of Alabama Joel Jones

CMSC 330: Organization of Programming Languages

Chapter 4: LR Parsing

Chapter 14: Pushdown Automata

Context-Free Languages and Parse Trees

CMSC 330: Organization of Programming Languages

UNIT III & IV. Bottom up parsing

Section A. A grammar that produces more than one parse tree for some sentences is said to be ambiguous.

CMSC 330: Organization of Programming Languages

CSE 130 Programming Language Principles & Paradigms Lecture # 5. Chapter 4 Lexical and Syntax Analysis

CMSC 330: Organization of Programming Languages. Context Free Grammars

LL(1) predictive parsing

Syntax Analysis. Amitabha Sanyal. ( as) Department of Computer Science and Engineering, Indian Institute of Technology, Bombay

Context-Free Languages & Grammars (CFLs & CFGs) Reading: Chapter 5

Architecture of Compilers, Interpreters. CMSC 330: Organization of Programming Languages. Front End Scanner and Parser. Implementing the Front End

CS 44 Exam #2 February 14, 2001

Introduction to Parsing. Lecture 8

Top down vs. bottom up parsing

CMSC 330: Organization of Programming Languages. Context Free Grammars

QUESTION BANK. Formal Languages and Automata Theory(10CS56)

Homework. Context Free Languages. Before We Start. Announcements. Plan for today. Languages. Any questions? Recall. 1st half. 2nd half.

Ambiguous Grammars and Compactification

CT32 COMPUTER NETWORKS DEC 2015

Chapter 4. Lexical and Syntax Analysis. Topics. Compilation. Language Implementation. Issues in Lexical and Syntax Analysis.

Syntax Analysis: Context-free Grammars, Pushdown Automata and Parsing Part - 4. Y.N. Srikant

CS 321 Programming Languages and Compilers. VI. Parsing

SLR parsers. LR(0) items

Chapter 4. Lexical and Syntax Analysis

4. Lexical and Syntax Analysis

CSE 401 Compilers. LR Parsing Hal Perkins Autumn /10/ Hal Perkins & UW CSE D-1

Dr. D.M. Akbar Hussain

Monday, September 13, Parsers


COMP 330 Autumn 2018 McGill University

4. Lexical and Syntax Analysis

Context-Free Grammars

Lexical and Syntax Analysis

Context Free Languages and Pushdown Automata

Lexical and Syntax Analysis. Top-Down Parsing

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

CS1622. Today. A Recursive Descent Parser. Preliminaries. Lecture 9 Parsing (4)

8 Parsing. Parsing. Top Down Parsing Methods. Parsing complexity. Top down vs. bottom up parsing. Top down vs. bottom up parsing

PDA s. and Formal Languages. Automata Theory CS 573. Outline of equivalence of PDA s and CFG s. (see Theorem 5.3)

Bottom-Up Parsing II. Lecture 8

CSE431 Translation of Computer Languages

Compiler Design 1. Bottom-UP Parsing. Goutam Biswas. Lect 6

CS 2210 Sample Midterm. 1. Determine if each of the following claims is true (T) or false (F).

Compiler Construction

LL(k) Parsing. Predictive Parsers. LL(k) Parser Structure. Sample Parse Table. LL(1) Parsing Algorithm. Push RHS in Reverse Order 10/17/2012

Outline. Limitations of regular languages. Introduction to Parsing. Parser overview. Context-free grammars (CFG s)

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

Note that for recursive descent to work, if A ::= B1 B2 is a grammar rule we need First k (B1) disjoint from First k (B2).

Parsing. Roadmap. > Context-free grammars > Derivations and precedence > Top-down parsing > Left-recursion > Look-ahead > Table-driven parsing

3. Parsing. Oscar Nierstrasz

CS5371 Theory of Computation. Lecture 8: Automata Theory VI (PDA, PDA = CFG)

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

Derivations of a CFG. MACM 300 Formal Languages and Automata. Context-free Grammars. Derivations and parse trees

Finite Automata Theory and Formal Languages TMV027/DIT321 LP4 2018

MIT Parse Table Construction. Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology

PART 3 - SYNTAX ANALYSIS. F. Wotawa TU Graz) Compiler Construction Summer term / 309

CS 314 Principles of Programming Languages

Types of parsing. CMSC 430 Lecture 4, Page 1

CMSC 330: Organization of Programming Languages

Bottom-up parsing. Bottom-Up Parsing. Recall. Goal: For a grammar G, withstartsymbols, any string α such that S α is called a sentential form

CS502: Compilers & Programming Systems

Compilation 2012 Context-Free Languages Parsers and Scanners. Jan Midtgaard Michael I. Schwartzbach Aarhus University

Plan for Today. Regular Expressions: repetition and choice. Syntax and Semantics. Context Free Grammars

Bottom-Up Parsing. Lecture 11-12

3. Syntax Analysis. Andrea Polini. Formal Languages and Compilers Master in Computer Science University of Camerino

Bottom-Up Parsing. Lecture 11-12

Talen en Compilers. Johan Jeuring , period 2. January 17, Department of Information and Computing Sciences Utrecht University

Optimizing Finite Automata

MA513: Formal Languages and Automata Theory Topic: Context-free Grammars (CFG) Lecture Number 18 Date: September 12, 2011

UNIT-III BOTTOM-UP PARSING

Formal Languages. Grammar. Ryan Stansifer. Department of Computer Sciences Florida Institute of Technology Melbourne, Florida USA 32901

Review of CFGs and Parsing II Bottom-up Parsers. Lecture 5. Review slides 1

ITEC2620 Introduction to Data Structures

A left-sentential form is a sentential form that occurs in the leftmost derivation of some sentence.

CSCI312 Principles of Programming Languages


Bottom-Up Parsing II (Different types of Shift-Reduce Conflicts) Lecture 10. Prof. Aiken (Modified by Professor Vijay Ganesh.

Wednesday, August 31, Parsers

Transcription:

Parsing For a given CFG G, parsing a string w is to check if w L(G) and, if it is, to find a sequence of production rules which derive w. Since, for a given language L, there are many grammars which generates the same language L, parsing must be done based on grammar G, not on its language L(G). Consider the following two context-free grammars G 1 and G 2 which generate the same language {a i b i i > 0 }. G 1 : S asb ab G 2 : S aa A Sb b Clearly, the following PDA recognizes this language. However, this PDA does not provide any information for identifying the grammar or a way for generating a given string with one of the grammars. ( a, a/aa ) (b, a/ε ) (a, Z 0 /az 0 ) (b, a/ ε ) (ε, Z 0 /Z 0 ) start 142

Two Derivation Rules Recall that if a context-free grammar G is unambiguous, for each string x in L(G), there is unique parse tree that yields x. So parse tree could be a good output form for parsing. However, it is not practical to output parse trees in two dimensional form. How about representing them in one-dimensional form, i.e., a sequence of productions rules applied? There is a problem with this approach; in general there can be more than one sequence of productions rules that generate the same string x. This is true even when the grammar is unambiguous. Recall that a string x is in L(G), if x can be derived by applying a sequence of production rules (one rule at a time) in G. Suppose that ababd is derived in the middle of such sequence. The final result is irrelevant to which nonterminal symbol is chosen in ababd to derive next sentential form (i.e., string of terminals and nonterminals). We should choose one from such multiple sequences of production rules. The sequence should be uniquely identifiable and effective to work with. There are two ways for the derivation defined as follows that can be uniquely identifiable. Leftmost (rightmost) derivation: A string is derived by iteratively applying a production rule with the leftmost (rightmost) nonterminal symbol of the current sentential form. 143

For the following grammar G, the leftmost and rightmost derivations are as shown below. G: S ABC A aa B a C cc c Leftmost derivation: S ABC aabc aaac aaacc aaacc Rightmost derivation: S ABC ABcC ABcc Aacc aaacc S A B C aa a c C c Notice that the sequence of productions applied with the leftmost derivation rule corresponds to the top-down left-to-right traversal of the parse tree, and the reverse sequence applied with the rightmost derivation rule corresponds to bottom-up left-to-right traversal of the parse tree. 144

The Basic Strategy for LL(k) Parsing Now we investigate how we can use a DPDA for parsing. Consider the following CFG G which generates language {a 10 x x = b i or x = c i, i > 0 }. (For convenience, when we refer a rule of G, we shall use the rule number shown above each of the rules.) (1) (2) (3) (4) (5) (6) (7) G: S AB AC A aaaaaaaaaa B bb b C cc c We want to design a DPDA which, given a string x {a, b} * on the input tape, outputs a sequence of production rules that generates x, if x L(G). We assume that the machine has an output port as shown in the figure below, and the grammar is stored in the memory as a lookup table. Let s first try a simple greedy strategy of generating a string in the stack that matches string x appearing in the input tape. Since any string in L(G) should be generated with the start symbol S, the machine initially pushes S in the stack entering in a working state q 1, and examine the input to choose a proper production rule for S. Recall that the conventional PDA sees the stack top, which is S, and decides whether it will read the next input symbol or not. a a a a a a a a a a b b b output port q 1 G SZ 0? 145

Without reading the input, the machine has no information available for choosing rule (1) or (2) for S. So we let the machine read the input. Suppose that the symbol read is a as shown in the figure below. This information does not help, because both rules (1) and (2) generates the same leading a s (actually 10 a s). The b s located after a s in the input string indicate that the first production rule to apply to generate the input string is rule (1), which is S aaaaaaaaaab. Using the conventional DPDA, it is impossible to correctly choose this production rule. a a a a a a a a a a b b b q 1 G SZ 0? To overcome this problem we equip the DPDA with the capability of looking ahead the input string by some constant k cells. For the current grammar the lookahead length k should be at least 11, because the first symbol b appears 11 cells away from the current input position. (Notice that the count includes current cell under the read head.) This symbol b is the nearest information in the input string that helps for choosing the correct production rule for S, rule (1) for the example. 146

Now, by looking 11 symbols ahead the machine knows that the input string should be derived by applying production rule (1) first, if it is a string generated by grammar G. So the machine replaces S in the stack top with the right side string of rule (1) and output rule number (1) as shown in the figure below. (Notice that looking ahead does not involve any move of the read head.) Whenever a terminal symbol appears at the stack top, the machine reads the input symbol, compares with the stack top and pops it if they match. Otherwise, the input is rejected. a a a a a a a a a a b b b α (1) q 1 G A B Z 0 q β G Machine Configuration: (q, α, β) (a) (b) For convenience, let (q, α, β) denote a configuration of the machine with current state (including G) q, the input portion α to be read, and current stack content β. From now on we shall use this triple for the machine configuration instead of a diagram. 147

The initial configuration (q 0, aaaaaaaaaabbb, Z 0 ) is routinely changed to ready configuration, aaaaaaaaaabbb, SZ 0 ). Based on the information looked ahead 11 positions, this configuration has been changed by applying rule (1) as shown below. Then seeing A at the stack top, the machine replaces A with the right side of rule (3). For this operation the machine does not need to look ahead because there is only one production rule for A. Now, the first 10 a s of the input can be successfully matched one by one with the 10 a s appearing at the stack top as follows. (The number above the arrows refer the production rule applied.) (1) (3) (q 0, aaaaaaaaaabbb, Z 0 ), aaaaaaaaaabbb, SZ 0 ), aaaaaaaaaabbb, ABZ 0 ), aaaaaaaaaabbb, aaaaaaaaaabz 0 )...., abbb, abz 0 ), bbb, BZ 0 )? Now symbol B appears at the stack top. Which of the production rules B bb b should have been applied to generate next input symbol b? Since there are more than one b, the next input b must be generated by rule B bb. To see if there are more than one b, the machine needs to look ahead 2 cells. Thus, the machine applies rule B bb whenever it sees two b s ahead, and applies rule B b when it sees one b. This way the machine successfully parse the the remaining input as the following slide shows. The last configuration (q 0, ε, Z 0 ), with empty stack and null input to parse, implies that the parsing has successfully completed. Its output is the sequence of production rules applied when a nonterminal symbol appears at the stack top. 148

(4) (4), bbb, BZ 0, bbb, bbz 0, bb, BZ 0, bb, bbz 0 ) (5), b, BZ 0, b, bz 0, ε, Z 0 ) The sequence of productions applied by this machine is shown below, which follows exactly the order of leftmost derivation. (1) (3) (4) (4) (5) S AB aaaaaaaaaab aaaaaaaaaabb aaaaaaaaaabbb aaaaaaaaaabbb We can easily see that the machine, given a string x in the input tape, can correctly generate the sequence of production rules in the order applied for the leftmost derivation for x if and only if x is in L(G). This machine parses the input string reading left-to-right looking ahead at most 11 cells and generates the sequence of productions rules applied according to leftmost derivation. We call this machine LL(11) parser. Conventionally LL(k) parser is represented with a table that shows, depending on the nonterminal symbol appearing at the stack top and look-ahead contents, which production rule should be applied. Reading the input symbols to match stack top terminal symbols and popping operations are usually omitted for convenience. 149

The parse table for the above example is shown below, where blank entries are for the cases not defined (i.e., the input is rejected), and x in the look-ahead contents is a don t care (wild card) symbol. Stack top S A B C 11 look-ahead aaaaaaaaaab aaaaaaaaaac bbxxxxxxxxx b ccxxxxxxxxx c ε (no look-ahead) AB AC aaaaaaaaaa bb b cc c 150

Example 1. Construct an LL(k) parser for the following CFG with minimum k. (1) (2) S asb aabbb This grammar generates the language {a i aabbbb i i 0 }. Consider string aaaabbbbb and its left most derivation; (1) (1) (1) (2) S asb aasbb aaasbbb aaaaabbbbbb Notice that aabbb at the center of this string is generated by rule S aabbb. If we let our parser look ahead 3 cells, it can select correct production rule that generates the next input symbol as follows. If it sees aaa, then the first a in this look-ahead contents must have been generated by rule S asb. If it is aab, then this string aab, together with the succeeding two b s, if any, must have been generated by production rule S aabbb. Based on this observation our LL(3) parser parses string aaaaabbbbbb as follows. First the parser gets ready by pushing S into the stack. (q 0, aaaaabbbbbb, Z 0, aaaaabbbbbb, SZ 0 )? 151

Our parser, looking aaa ahead, applies rule (1) S asb, and seeding a at the stack top, pop it reading a from the input tape. Thus, the configuration changes as follows. (1), aaaaabbbbbb, SZ 0, aaaaabbbbbb, asbz 0, aaaabbbbbbb, SbZ 0 )? Again, looking aaa ahead, the parser applies rule (1) S asb two more times as follows. (1), aaaabbbbbbb, SbZ 0, aaaabbbbbb, asbbz 0 ) (1), aaabbbbbbb, SbZ 0, aaabbbbbb, asbbz 0 ), aabbbbbb, SbbZ 0 )? Now, our parser looks aab ahead, applies rule (2) S aabbb and then matches remaining input symbols with the ones appearing on the stack top as follows., aabbbbbb, SbbZ 0, aabbbbbb, aabbbbbz 0 ), ε, Z 0 ) 152

The sequence of productions applied (1) (1) (1) (2) is exactly the one applied for the leftmost derivation deriving aaaaabbbbbb. Actually the parser derived the string in the stack according to the leftmost derivation rule. Clearly, this parser operates according to the following parsing table. Stack top S 3 look-ahead aaa aab asb aabbb 153

Example 2. Construct an LL(K) parser with minimum k for the following grammar. (1) (2) (3) (4) S aba ε A Saa b We will build an LL(2) parser by examining how it can parse string ababaaaa by deriving it in the stack according to the following leftmost derivation. (1) (3) (1) (3) (2) S aba absaa ababaaa ababsaaaa ababaaaa Following the routine initialization operation we have S at the stack top as follows. (q 0, ababaaaa, Z 0, ababaaaa, SZ 0 )? By looking ahead two cells on the tape, the parser sees ab. Definitely the next rule applied in the leftmost derivation is rule (1), which is the only rule producing ab to the left. So the parser applies rule (1) and the configuration changes as follows; 154

(1), ababaaaa, SZ 0, ababaaaa, abaz 0 ).., abaaaa, AZ 0 )? For convenience the grammar is repeated here. (1) (2) (3) (4) S aba ε A Saa b If rule (4) were applied for A, the next terminal symbol appearing in the next cell would be b, not a. The next input symbol must be generated by S. So, rule (3) must have been applied next to generate the input string. Thus the configuration is changed as follows. (3), abaaaa, AZ 0, abaaaa, SaaZ 0 )? Now, the parser looks ahead two cells and sees ab. Rule (1) must have been applied to derive the input string. If rule (2) were applied, the two look ahead contents would be either ε (for the case of null input string) or aa (generated by rule (3)). Thus the parser applies rule (1) for S on the stack top, and changes its configuration as follows. 155

(1), abaaaa, SaaZ 0, abaaaa, abaaaz 0 ).., aaaa, AaaZ 0 )? (1) (2) (3) (4) S aba ε A Saa b Again, since the next input symbol is a, next rule applied cannot be rule (4). It must be rule (3). Thus the configuration changes as follows. (3), aaaa, AaaZ 0, aaaa, SaaaaZ 0 )? Now, the parser looks ahead aa which cannot be generated by either rule (1) or (2). It must be generated by rule (3) previously. Thus the parser applies rule (2) as follows and then matches remaining input with the string in the stack. (2), aaaa, SaaaaZ 0, aaaa, aaaaz 0 )...., ε, Z 0 ) The sequence of rules applied by the parser on the stack top is exactly same as the sequence applied for the leftmost derivation deriving ababaaaa. 156

Clearly, this parser can parse the language with the following parsing table. Stack top S aba ε A 2 look-ahead ab aa bx BB Saa Saa b B: blank X: don t care ε 157

Example 3. The grammar below is not an LL(k) grammar for any fixed integer k. S A B A aa 0 B ab 1 Notice that the language of this grammar is { a n x n 0, x {0, 1} }. The strings in this language can have arbitrary large number of a s followed by either 0 or 1, depending on whether it is generated by rule S A or S B, respectively. With finite look ahead range k it is impossible to look ahead the crucial indicator (0 or 1) that is needed to decide which production rule is applied to generated the input string. For the given grammar, there is not LL(k) parser for any finite k. a a a a.............. a a a a a a a a 0 q 1 G SZ 0? It is easy to see that for the following grammar, which generates the same language, we can construct an LL(1) parser. S as D D 0 1 158

Formal Definition of LL(k) Grammars Notation: Let (k) ω denote the prefix of length k of string ω. If ω < k, (k) ω = ω. For example, (2) ababaa = ab and (3) ab = ab. Definition (LL(k) grammar). Let G = (V T, V N, P, S) be a CFG. Grammar G is an LL(k) grammar for some fixed integer k, if it has the following property: For any two leftmost derivations S * ωaα ωβα * ωy and S * ωaα ωγα * ωx, where α, β,γ (V T V N )* and ω, x, y V T *, if (k) x = (k) y, then it satisfies β = γ. If a CFG G has this property, then, for every x L(G), we can decide the sequence of leftmost derivations which generates x by scanning x left to right, looking ahead at most k symbols. (If you are interested in the proof of this claim, see The Theory of Computation by D. Wood, or a book for compiler construction.) 159