Language Processing note 12 CS

Similar documents
Automatic generation of LL(1) parsers

CS502: Compilers & Programming Systems

Note that for recursive descent to work, if A ::= B1 B2 is a grammar rule we need First k (B1) disjoint from First k (B2).

Top-Down Parsing and Intro to Bottom-Up Parsing. Lecture 7

CS 321 Programming Languages and Compilers. VI. Parsing

Ambiguity, Precedence, Associativity & Top-Down Parsing. Lecture 9-10

Top-Down Parsing and Intro to Bottom-Up Parsing. Lecture 7

CS1622. Today. A Recursive Descent Parser. Preliminaries. Lecture 9 Parsing (4)

LL(1) predictive parsing

3. Parsing. Oscar Nierstrasz

Table-Driven Parsing

Abstract Syntax Trees & Top-Down Parsing

Abstract Syntax Trees & Top-Down Parsing

Syntax Analysis. Martin Sulzmann. Martin Sulzmann Syntax Analysis 1 / 38

Ambiguity. Grammar E E + E E * E ( E ) int. The string int * int + int has two parse trees. * int

Abstract Syntax Trees & Top-Down Parsing

LL(1) predictive parsing

CA Compiler Construction

Parsing. Roadmap. > Context-free grammars > Derivations and precedence > Top-down parsing > Left-recursion > Look-ahead > Table-driven parsing


Top down vs. bottom up parsing

Types of parsing. CMSC 430 Lecture 4, Page 1

8 Parsing. Parsing. Top Down Parsing Methods. Parsing complexity. Top down vs. bottom up parsing. Top down vs. bottom up parsing

Compilers. Predictive Parsing. Alex Aiken

Parser. Larissa von Witte. 11. Januar Institut für Softwaretechnik und Programmiersprachen. L. v. Witte 11. Januar /23

Introduction to Parsing. Lecture 8

LL(k) Parsing. Predictive Parsers. LL(k) Parser Structure. Sample Parse Table. LL(1) Parsing Algorithm. Push RHS in Reverse Order 10/17/2012

Syntax Analysis. Chapter 4

Outline. Limitations of regular languages. Introduction to Parsing. Parser overview. Context-free grammars (CFG s)

Lexical and Syntax Analysis. Top-Down Parsing

Extra Credit Question

Parsing. Handle, viable prefix, items, closures, goto s LR(k): SLR(1), LR(1), LALR(1)

Building a Parser III. CS164 3:30-5:00 TT 10 Evans. Prof. Bodik CS 164 Lecture 6 1

Concepts of Programming Languages Recitation 1: Predictive Parsing

Context-Free Grammars

UNIT III & IV. Bottom up parsing

CMSC 330: Organization of Programming Languages. Context Free Grammars

CS 132 Compiler Construction, Fall 2012 Instructor: Jens Palsberg Multiple Choice Exam, Nov 6, 2012

Introduction to Parsing. Lecture 5

Grammars and Parsing for SSC1

Introduction to Lexing and Parsing

Structure of a compiler. More detailed overview of compiler front end. Today we ll take a quick look at typical parts of a compiler.

Lecture Notes on Bottom-Up LR Parsing

Table-driven using an explicit stack (no recursion!). Stack can be viewed as containing both terminals and non-terminals.

Lexical and Syntax Analysis

Syntactic Analysis. Top-Down Parsing

Monday, September 13, Parsers

1. Consider the following program in a PCAT-like language.

Compilers. Bottom-up Parsing. (original slides by Sam

Section A. A grammar that produces more than one parse tree for some sentences is said to be ambiguous.

COP 3402 Systems Software Syntax Analysis (Parser)

Introduction to Parsing. Lecture 5

Types and Static Type Checking (Introducing Micro-Haskell)

Compilers: CS31003 Computer Sc & Engg: IIT Kharagpur 1. Top-Down Parsing. Lect 5. Goutam Biswas

CS 4120 Introduction to Compilers

Fall Compiler Principles Lecture 2: LL parsing. Roman Manevich Ben-Gurion University of the Negev

CS606- compiler instruction Solved MCQS From Midterm Papers

Error Recovery during Top-Down Parsing: Acceptable-sets derived from continuation

Compilers and Interpreters

More Bottom-Up Parsing

Syntax Analysis, III Comp 412

Final Term Papers 2013

Lecture 8: Deterministic Bottom-Up Parsing

November Grammars and Parsing for SSC1. Hayo Thielecke. Introduction. Grammars. From grammars to code. Parser generators.

VIVA QUESTIONS WITH ANSWERS

Outline. Top Down Parsing. SLL(1) Parsing. Where We Are 1/24/2013

Fall Compiler Principles Lecture 2: LL parsing. Roman Manevich Ben-Gurion University of the Negev

Lecture 7: Deterministic Bottom-Up Parsing

CS164: Midterm I. Fall 2003

Syntax Analysis, III Comp 412

Compilerconstructie. najaar Rudy van Vliet kamer 140 Snellius, tel rvvliet(at)liacs(dot)nl. college 3, vrijdag 22 september 2017

Syntactic Analysis. The Big Picture Again. Grammar. ICS312 Machine-Level and Systems Programming

Plan for Today. Regular Expressions: repetition and choice. Syntax and Semantics. Context Free Grammars

Syntax. 2.1 Terminology

CSC 467 Lecture 3: Regular Expressions

Building Compilers with Phoenix

Lecture Notes on Bottom-Up LR Parsing

Bottom-Up Parsing. Lecture 11-12

LL parsing Nullable, FIRST, and FOLLOW

Introduction to Compiler Design

Introduction to Bottom-Up Parsing

CSCI312 Principles of Programming Languages

The analysis part breaks up the source program into constituent pieces and creates an intermediate representation of the source program.

Compiler Design 1. Top-Down Parsing. Goutam Biswas. Lect 5

1 Introduction. 2 Recursive descent parsing. Predicative parsing. Computer Language Implementation Lecture Note 3 February 4, 2004

UNIVERSITY OF EDINBURGH COLLEGE OF SCIENCE AND ENGINEERING SCHOOL OF INFORMATICS INFR08008 INFORMATICS 2A: PROCESSING FORMAL AND NATURAL LANGUAGES

Parsers. Xiaokang Qiu Purdue University. August 31, 2018 ECE 468

CS143 Handout 20 Summer 2011 July 15 th, 2011 CS143 Practice Midterm and Solution

UNIT -2 LEXICAL ANALYSIS

Wednesday, August 31, Parsers

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Earlier edition Dragon book has been revised. Course Outline Contact Room 124, tel , rvvliet(at)liacs(dot)nl

Wednesday, September 9, 15. Parsers

Parsers. What is a parser. Languages. Agenda. Terminology. Languages. A parser has two jobs:

Introduction to Syntax Analysis. Compiler Design Syntax Analysis s.l. dr. ing. Ciprian-Bogdan Chirila

COMP 181 Compilers. Administrative. Last time. Prelude. Compilation strategy. Translation strategy. Lecture 2 Overview

Chapter 4. Lexical and Syntax Analysis

Component Compilers. Abstract

4 (c) parsing. Parsing. Top down vs. bo5om up parsing

Defining Program Syntax. Chapter Two Modern Programming Languages, 2nd ed. 1

Transcription:

CS2 Language Processing note 12 Automatic generation of parsers In this note we describe how one may automatically generate a parse table from a given grammar (assuming the grammar is LL(1)). This and similar methods can be used in practice to build parser generators programs which take a grammar supplied by the user and automatically write the code for a parser for the corresponding languages. We end the note with a brief discussion of lexer and parser generators. As we have seen, for small examples of grammars it is often possible to work out a parse table by hand simply by inspection of the grammar. For a large grammar, however, this is no longer feasible, and we require a more systematic method for computing the parse table. The method we shall describe here also has the virtue that it can be carried out completely automatically, and hence one can write a program that computes parse tables for us. First and Follow sets To generate a parse table from a given grammar, we need some preliminary data about which tokens a parser might meet when considering any particular nonterminal. This takes the form of two sets for each nonterminal A, known as First(A) and Follow(A). One can think of these as follows: Suppose at some point during parsing, we are expecting a phrase corresponding to some nonterminal A, and we see a terminal a as the next symbol in the input. Assuming the sentence we are parsing is a valid one, there are two possible kinds of reasons for the appearance of this a: either it is the first symbol of the A-phrase we are expecting, or this A-phrase is in fact the empty string and the a we see is the start of whatever comes after it. We are therefore interested in both what an A can possibly start with and what an A can possibly be followed by in a complete sentence. More precisely: First: For any nonterminal A, the set First(A) comprises all terminals that can appear at the start of some sentential form derived from A. In addition, if A derives the empty string, then First(A) includes the symbol ε. Follow: For any nonterminal A, the set Follow(A) is made up of all the terminals that can appear after A in any sentential form derived from some nonterminal. In addition, Follow(A) contains the token $ if A can appear at the end of some sentential form derived from the start symbol for the grammar. 1

For those who like a more mathematical notation: First(A) = { a u. A au } { ε A ε } Follow(A) = { a B,u,v. B uaav } { $ u. S ua } Computing First sets We now give a systematic method for computing First and Follow sets. We will illustrate our method by applying it to the following grammar G. This describes a language of commands such as one might type to a shell, where a command name like cp or ls is followed by some options and some file names. shell command args args opts files opts option opts ε files file files ε As a first stage, we construct the set E of potentially empty nonterminals, i.e. those nonterminals A such that A ε. The general method is as follows: 1. Start by setting E to be the set of nonterminals A such that the grammar contains a production A ε. 2. For every production A t in the grammar, if t consists entirely of nonterminals in E, add A itself to E. 3. Repeat step 2 until E stabilizes. Applying this method to the grammar G, we see at step 1 that opts, files E. At step 2 we then note that since we have a production args opts files, we must also have args E. Since further applications of step 2 do not yield any new elements of E, we conclude that E = {opts, files, args}. We now give the algorithm for computing First sets. In fact, it will be convenient to compute a set First(x) for every symbol x, terminal or nonterminal. 1. Start by setting First(a) = {a} for each terminal a; First(A) = {ε} for each nonterminal A E; and First(A) = for each nonterminal A E. 2. For each production A x 1 x 2 x n (where the x i s may be terminals or nonterminals), and for each i (1 i n) such that x j E whenever 1 j i 1, add every terminal a First(x i ) to First(A). 3. Repeat step 2 until all the sets First(A) stabilize. Let us apply this algorithm to our grammar G. We start by setting First(a) = {a} for each of the terminals a = command, option, file; First(A) = {ε} for A = 2

args, opts, files; and First(shell) =. For step 2, we need to consider in turn each production and each possible value of i for that production. For the production shell command args, taking i = 1 gives us that command First(shell), and we need not consider i = 2 since command E. The production args opts files yields nothing at the moment, since First(opts) and First(files) do not at present contain any terminals. The remaining productions, however, yield that option First(opts) and file First(files). This completes the first round of step 2. On the second round, however, we discover something new when we consider the rule args opts files: by taking i = 1, 2 respectively, we find that option, file First(args). We now observe that the First sets are now complete, since nothing new is added at the third round. Note that we can also sensibly define the First set not just of a single symbol but of an arbitrary sentential form t = x 1 x 2 x n (n 0): just replace A by t in the definition on page 2. Once we have all the sets First(A), for any t we may easily compute First(t) as follows: 1. Whenever we have a First(x i ) where 1 i n, x 1,..., x i 1 E, and a ε, put a First(t). 2. If x 1,..., x n E, put ε First(t). We will need certain such sets First(t) in order to construct the parse table. Computing Follow sets Follow sets are also calculated progressively; the algorithm makes use of First sets. 1. Begin with Follow(S) = {$} for the start symbol S, and Follow(A) = for all other nonterminals. 2. For each rule that can be presented as A tbx 1 x n for n 1, and each i such that x 1,..., x i 1 E, add all of First(x i ), except ε, to Follow(B). 3. For each rule A tb, or A tbx 1 x n with x 1,..., x n E, add all of Follow(A) to Follow(B). 4. Repeat step 3 until all Follow sets stabilize. This is generally rather trickier to carry out than the calculation of First sets. Let us see how it works for our example. In step 1 we set Follow(shell) = {$} and Follow(A) = for A = args, opts, files. Step 2 only yields anything in the case of the production args opts files: here we must add all of First(files) except ε to Follow(opts), thus file Follow(opts). Moving on to step 3, we discover on the first round that $ Follow(args), and on the second round that $ Follow(opts) and $ Follow(files). To summarize, the contents of the sets First(A) and Follow(A) for the grammar G are as given by the following table: 3

G First Follow shell command $ args option file ε $ opts option ε $ file files file ε $ Building parse tables Given the First and Follow sets for a grammar, constructing a parse table is reasonably straightforward. Initially, every entry is empty; then for every production A t in the grammar: for each terminal a in First(t), put A t in the table at row A, column a; if ε First(t), then for each b Follow(A) enter A t in row A, column b. Applying this procedure to our example, we obtain the following parse table: G 3 command option file $ shell command args args opts files opts files opts files opts option opts ε ε files file files ε For a given grammar, the fact that this method succeeds in constructing a parse table with at most one production in each cell provides the final confirmation that our grammar is indeed LL(1) and hence suitable for predictive parsing. If the grammar suffers from ambiguities, common prefixes or left recursion, this will typically show up in the fact that some cell contains two or more productions, so during parsing we will not know which of the competing productions to apply. (There are even some grammars not suffering from any of these three defects for which such clashes arise, though such grammars are unlikely to arise in practice.) Of course, if clashes do arise in the parse table, then where they arise can give us a useful hint on how we might debug the grammar. Automatic lexer and parser generators When any programming activity becomes repetitive, programmers naturally try to get the machine to do it for them. So it is with writing lexers and parsers: the formal theories of state machines and grammars give us sufficient insight to 4

write programs that will write our programs for us. The first lexer- and parserwriting programs were written around 1970; the best known being lex and yacc ( Yet Another Compiler Compiler ). Both of these generate C code and are still in use today. They have several descendants and variants, so for example the Linux systems here provide flex and bison. Other programming languages have equivalent tools: for instance, for Java we have the tools JFlex and Java CUP. More detailed information of these tools may be found at (respectively) www.jflex.de www.cs.princeton.edu/~appel/modern/java/cup With tools like these, building a language processing system generally comprises three steps. Prepare a description of the tokens of the language. Give this to lex, or similar, which will write some code for a lexer. Write a grammar for the language. Feed this to yacc, or similar, which will write some code for a parser. Write some code to operate on a document once it has been read in. These three pieces of code are then tied together and compiled to give the final program. Usually the first two contain some large tables, with small general engines to drive the lexing and parsing. They are often not very human-readable, but that is not their aim. Exercises Revisit some of the examples of grammars from earlier lecture notes, and compute the First and Follow sets, and hence the parse table, using the above method. Check these against the parse table given in the notes. It is also quite instructive to try this out on some of the examples of non- LL(1) grammars (e.g. ambiguous grammars), so that you can see where the clashes arise when we try to build the parse table. Books The following books may provide a useful supplement to the lecture notes for the second half of this thread (context-free grammars and parsing). They may be readily identified by the animals depicted on the cover. Dragon book Aho, Sethi and Ullman, Compilers: Principles, Techniques and Tools. Tiger book Appel, Modern Compiler Implementation in { C, Java, ML }. Turtle book Aho and Ullman, Foundations of Computer Science. John Longley 2002, Ian Stark 2001 5