Introduction; Parsing LL Grammars

Similar documents
Table-Driven Parsing

Structure of Programming Languages Lecture 3

CS143 Handout 20 Summer 2011 July 15 th, 2011 CS143 Practice Midterm and Solution

Non-deterministic Finite Automata (NFA)

CMSC 330: Organization of Programming Languages

CSE 401/M501 18au Midterm Exam 11/2/18. Name ID #

Where We Are. CMSC 330: Organization of Programming Languages. This Lecture. Programming Languages. Motivation for Grammars

Review. Pat Morin COMP 3002

Example CFG. Lectures 16 & 17 Bottom-Up Parsing. LL(1) Predictor Table Review. Stacks in LR Parsing 1. Sʹ " S. 2. S " AyB. 3. A " ab. 4.

Lexical and Syntax Analysis. Top-Down Parsing

CSE 401 Midterm Exam 11/5/10

CMSC 330: Organization of Programming Languages. Architecture of Compilers, Interpreters

CS 4120 Introduction to Compilers

Programming Lecture 3

Week 2: Syntax Specification, Grammars

Lexical Analysis. COMP 524, Spring 2014 Bryan Ward

Syntactic Analysis. Top-Down Parsing

Compiler Design 1. Top-Down Parsing. Goutam Biswas. Lect 5

Chapter 4. Lexical and Syntax Analysis. Topics. Compilation. Language Implementation. Issues in Lexical and Syntax Analysis.

R10 SET a) Construct a DFA that accepts an identifier of a C programming language. b) Differentiate between NFA and DFA?

Compilers: CS31003 Computer Sc & Engg: IIT Kharagpur 1. Top-Down Parsing. Lect 5. Goutam Biswas

1 Parsing (25 pts, 5 each)

Syntax Analysis. Martin Sulzmann. Martin Sulzmann Syntax Analysis 1 / 38

CS 4240: Compilers and Interpreters Project Phase 1: Scanner and Parser Due Date: October 4 th 2015 (11:59 pm) (via T-square)

CS 164 Handout 11. Midterm Examination. There are seven questions on the exam, each worth between 10 and 20 points.

In this simple example, it is quite clear that there are exactly two strings that match the above grammar, namely: abc and abcc

Regular Expressions Explained

Lec-5-HW-1, TM basics

CMSC 330: Organization of Programming Languages

Page No 1 (Please look at the next page )

Top-Down Parsing and Intro to Bottom-Up Parsing. Lecture 7

CMSC 330: Organization of Programming Languages

CS103 Handout 35 Spring 2017 May 19, 2017 Problem Set 7

CSCI312 Principles of Programming Languages

Types, Expressions, and States

The procedure attempts to "match" the right hand side of some production for a nonterminal.

COMP 330 Autumn 2018 McGill University

Regexs with DFA and Parse Trees. CS230 Tutorial 11

Context-Free Languages & Grammars (CFLs & CFGs) Reading: Chapter 5

Context-Free Grammars

Programming Languages (CS 550) Lecture 4 Summary Scanner and Parser Generators. Jeremy R. Johnson

CSE 130 Programming Language Principles & Paradigms Lecture # 5. Chapter 4 Lexical and Syntax Analysis

CS143 Handout 20 Summer 2012 July 18 th, 2012 Practice CS143 Midterm Exam. (signed)

Architecture of Compilers, Interpreters. CMSC 330: Organization of Programming Languages. Front End Scanner and Parser. Implementing the Front End

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 3

CSE 413 Final Exam. June 7, 2011

CMSC 201 Fall 2016 Lab 09 Advanced Debugging

Briefly describe the purpose of the lexical and syntax analysis phases in a compiler.

CA Compiler Construction

Lexical Analyzer Scanner

LR Parsing. The first L means the input string is processed from left to right.

Lexical and Syntax Analysis

Defining Program Syntax. Chapter Two Modern Programming Languages, 2nd ed. 1

CSE 401 Midterm Exam Sample Solution 2/11/15

MITOCW watch?v=w_-sx4vr53m

Languages and Compilers

CS164: Programming Assignment 2 Dlex Lexer Generator and Decaf Lexer

Chapter Seven: Regular Expressions

CMSC 330: Organization of Programming Languages. Context Free Grammars

Introduction to Lexing and Parsing

Principles of Compiler Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore

Ambiguous Grammars and Compactification

CMSC 330: Organization of Programming Languages. Context Free Grammars

CSE302: Compiler Design

CS308 Compiler Principles Lexical Analyzer Li Jiang

Syntax Analysis. COMP 524: Programming Language Concepts Björn B. Brandenburg. The University of North Carolina at Chapel Hill

Top-Down Parsing and Intro to Bottom-Up Parsing. Lecture 7

Syntax Analysis. The Big Picture. The Big Picture. COMP 524: Programming Languages Srinivas Krishnan January 25, 2011

A Simple Syntax-Directed Translator

Lexical Analyzer Scanner

2.2 Syntax Definition

DaMPL. Language Reference Manual. Henrique Grando

More Examples. Lex/Flex/JLex

LR Parsing Techniques

Program Syntax; Operational Semantics

Introduction to Bottom-Up Parsing

Part 5 Program Analysis Principles and Techniques

Shared Variables and Interference

Abstract Syntax Trees & Top-Down Parsing

Abstract Syntax Trees & Top-Down Parsing

Abstract Syntax Trees & Top-Down Parsing

printf( Please enter another number: ); scanf( %d, &num2);

Haskell: Lists. CS F331 Programming Languages CSCE A331 Programming Language Concepts Lecture Slides Friday, February 24, Glenn G.

Last lecture CMSC330. This lecture. Finite Automata: States. Finite Automata. Implementing Regular Expressions. Languages. Regular expressions

CS 536 Midterm Exam Spring 2013

Slide 1 CS 170 Java Programming 1 The Switch Duration: 00:00:46 Advance mode: Auto

Parsing. Handle, viable prefix, items, closures, goto s LR(k): SLR(1), LR(1), LALR(1)

Lecture Bottom-Up Parsing

Monday, August 26, 13. Scanners

Chapter 3: Lexing and Parsing

Wednesday, September 3, 14. Scanners

SFU CMPT 379 Compilers Spring 2018 Milestone 1. Milestone due Friday, January 26, by 11:59 pm.

CS52 - Assignment 10

Stating the obvious, people and computers do not speak the same language.

Maciej Sobieraj. Lecture 1

Midterm I (Solutions) CS164, Spring 2002

Regular Expressions. Regular Expression Syntax in Python. Achtung!

Language Processing note 12 CS

Compilers. Yannis Smaragdakis, U. Athens (original slides by Sam

CS502: Compilers & Programming Systems

Transcription:

Introduction; Parsing LL Grammars CS 440: Programming Languages and Translators Due Fri Feb 2, 11:59 pm 1/29 pp.1, 2; 2/7 all updates incorporated, solved Instructions You can work together in groups of 4. Submit your work on Blackboard. * Submit one copy. Include the names and A-IDs of everyone in the group on that copy (in the pdf, for example). Submit under the name of one person in the group (doesn't matter who). Questions [100 points total] 1. [10 = 5+5 points] For each question below, a paragraph should be enough. a. Exercise 1.3 (p.38) b. Exercise 1.9 (p.39) For Questions 2 4, your regular expressions can use some basic egrep notations. (Try man re_format on unix for help.) Some simple example of what you can use: [a-z_] ("a through z or underscore") [0-9ab] ("Any digit or the letters a or b") [^xyz] ("Any character except for x, y, or z") x? ("x or nothing") x+ ("one or more x's"). (a period or dot means "any one character") \. (backslash dot means literally a dot, as in the float 12\.34") Don't use back references, (such as "\3"); bounds (such as "{7}"); character classes (such as"[:cntrl:]" or "[[:<:]]"); or assertions (such as "\D"). (You won't need literals like \n (except for \.), and if you try things like \x{89abcdef}, we'll hunt you down :-) 2. [15 = 3*5 points] Translate each regular expression below into English. Don't just translate individual subexpressions; try to get at the essence of the expression. (E.g., "[1-9][0-9]" could be "a two-digit number without a leading zero".) [Hint: You can try an expression using egrep -e "expression" text_file, where each line of text_file has a candidate string to try to match. You may want to add "^" and "$" to the expression, in that case; again, see the man page.] a. [0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9] b. (19 20)[0-9][0-9]-(0[1-9] 1[012])-(0[1-9] [12][0-9] 3[01]) c. (0x)[1-9a-f][0-9a-f]* * Using group submission is an experiment; let me know how it works. CS 440: Programming Languages and Translators 1 James Sasaki, 2018

3. [15 = 3*5 points] Give regular expressions that match each of the following kinds of (possibly empty) strings. There may be more than one answer; we just want one. a. Strings that alternate between vowels (a, e, i, o, u) and consonants (not vowels) and can start with either a vowel or a consonant. b. Strings of a's and b's where the number of b's is divisible by 2 or 3 c. Strings of lowercase letters that don't include abc. (Don't forget to include strings like aaa or xab.) 4. [12 points] Give a regular expression for numbers in the following made-up format: Integers are sequences of digits; leading zeros are allowed. Floats include a dot with digits before and/or after the dot. In addition, you may include a base as a leading b#, o#, d#, or x# (binary, octal, decimal, or hex). You may also have a leading + or - before the base (or the integer, if there's no base). In addition, you may include an exponent after the number, of the form e integer where integer is as described above. If specified, the base for the exponent doesn't have to match the base of the number. A single space can be included between each group of one or more digits, or after the base #, or before the e exponent, but no space is allowed between a leading sign and base or between a base and #. The letters (b, o, etc) can be in upper case. If you like, you can define parts of the expressions as grammar rules (like number integer float etc.) Some random examples of numbers (with spaces as underscores to make them more visible): -b#_1._e-b#10 equals binary -1.0 / 2² = binary -0.01 1.0eb#10 equals binary 1.0 2 10 = 2 10 cast as a float +3.e+1 equals 30.0 3e1 equals 30 o#072_031 equals 72031₈ But not b#_3 (because of the 3) or 12 34 (two spaces between 12 and 34) or -_56 (space after -) 5. [18 points] Here's a state transition table for an NFA that accepts the 3-character string abc. To (I hope) make things clearer, I've mostly given states names that are regular expressions describing the input that takes us to that state. The cells that are empty actually contain err. (I omitted them to make the non-err parts more visible). State ε a b c Start ε (Seen) ε a (Seen) a ab (Seen) ab abc (Seen) abc accept Accept err err err err err err err err CS 440: Programming Languages and Translators 2 James Sasaki, 2018

Accept is underlined to indicate that it's (the only) accepting state. Note that once you get to the error state err, you stay there forever. Now imagine gluing together four NFAs for abc, acc, bbc, and bca, merging their Start, Accept, and err states respectively, and ending up with an NFA with 3 + 4*4 = 19 states. For this problem, convert this NFA to a DFA; the most straightforward way to do this is to use the algorithm in the text. You'll need to use some different terminology to name the states. (Number them? More complicated regular expressions?). You can, but don't have to, give a DFA with the minimum number of states (I believe it's 6 states). Present the DFA using a transition table. 6. [20 = 4*5 points] (Modified Exercise 2.14, p.108) Consider the language consisting of all strings of properly-balanced parentheses and brackets. (I.e., "(", ")", "[", and "]".) a. Give an LL(1) grammar for this language. Surround each terminal parenthesis or bracket by double quotes to emphasize that they are terminal symbols. b. Give the corresponding LL(1) parsing table. c. Show the parse tree for ([]([]))[]. If you like, you may present the tree using an outline form: List the nodes in preorder with the children for each node indented one more level than their parent. E.g., a tree with root X, children Y and Z, with Y having children A and B, and Z having children C and D would be presented as X. Y.. A.. B. Z.. C.. D d. Give a trace of the parser action as it constructs the parse tree. 7. [10 = 5+5 points] (Modified Problem 2.26) Consider the grammar below. The start symbol is S, the other nonterminals are E, T, TL, F, and FL, and the terminal symbols are v and anything double-quoted. S E "$$" E v ":=" E E T TL TL "+" T TL ε T F FL FL "*" F FL ε F "(" E ")" v a. For each rule A α above, give the FIRST(α), FOLLOW(A), EPS(α), and PREDICT(A α) sets. Omit duplicates (there's no reason to show EPS(ε) more than once, for example). b. What tells us that this grammar is not LL(1)? CS 440: Programming Languages and Translators 3 James Sasaki, 2018

Solution to Homework 1 1. (Compilation; Correctness) a. Exercise 1.3, p.38 (Compilation vs interpretation) Some possible answers: Compilation can catch errors earlier; compiled code usually executes faster. An interpreter may take less time to rerun a program that's had a small change made to it (a compiler has to recompile and relink the whole program); am interpreter may produce better error messages; for language development, writing an interpreter can be faster than writing a compiler. b. Exercise 1.9, p.39 (Program correctness) There are two parts to correctness: the specification and meeting the specification. Specifications can be vague, wrong, or not cover all possible inputs. For correctness, testing only reveals lack of bugs under the tested inputs; untested inputs may still encounter bugs, plus, determining what inputs to test on is hard. For complex software, it's hard to figure out what environments a program might run in (plus test in all of them). Blind spots can include things you know you don't know (like exact user behavior) and things you don't know you don't know (like unexpected user behavior). 2. (Translate reg expressions to English). There can be alternative answers. 2a. [0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9] Three (natural) numbers separated by dashes; the first number has 4 digits and the other two have 2 digits. 2b. (19 20)[0-9][0-9]-(0[1-9] 1[012])-(0[1-9] [12][0-9] 3[01]) Dates of the form yearmonth-day where the years are 1900 2099, days are 01 12, and days are 01 31. 2c. (0x)[1-9a-f][0-9a-f]* Hex natural numbers: the tag 0x followed by one or more lower-case hex digits, with no leading zero. 3. (Regular expressions) 3a. Alternating vowels and consonants: [aeiou]?([^aeiou][aeiou])*[^aeiou]? Note this allows the empty string. 3b. a's and b's with number of b's is a multiple of 2 or 3: a* (b a* b a*)* a* a* (b a* b a* b a*)* a* (You should include the empty string.) 3c. Strings without substring abc: ([^a] a[^b] ab[^c])*(a ab)? 4. (Integer and Float Numbers in bases 2, 8, 10, 16). First, let's break down the problem. The basic idea: We need sign? base? number exponent? Including spaces gives us sign? base?\_? number_with_spaces exponent_with_spaces? "\_" means backslash space, which means an actual space. To avoid having a number lead with spaces, I'm putting them between the base and number, hence \_?. This is the first time I've offered this problem, so if my solution has bugs, let me know. CS 440: Programming Languages and Translators 4 James Sasaki, 2018

Sign and base are straightforward: sign? is [+ -]?, and base? could be ([bbooddxx]#\_)?, except that the legal digits in the following number depend on the base, so we'll need to break the bases up into cases: base? can be ([bb]#\_?)?, ([oo]#\_?)? and so on. The exponent part is also straightforward: \_[ee]integer, with integer as below. The number part is the hard one, of course. It's either an integer or a float. If the number is an integer (natural number, technically), then we can use an expression like digit + (\_digit + )*, which is a sequence that alternates between runs of digits and one space, beginning and ending with digit(s). (Remember the superscript Kleene + is one or more of.) It's \_, not \_ + because we're only allowed one space between runs of digits. There are alternatives like digit ((digit \_)*digit)? that are perfectly fine too. For a float, the thing to avoid is something like digit*\.digit*, which makes digits optional before or after the dot (which is good) but doesn't insist on having at least one digit somewhere (which is bad). We can follow a (non-empty) integer with a dot and (optionally) more digits and spaces ending with a digit (we don't want trailing spaces). digit + (\_digit + )*(\.((digit + \_)*digit + )?)? // integer (dot integer?)? Or, we can begin with a dot and follow with digits and spaces and end with a digit(s) \. (digit + \_)*digit + (Note we don't allow a space before and/or after the dot; maybe that's a bug in the specification.) Below, I'm using name expression to give names to expressions to make things more readable (I hope). It's fine if you used symbols like or ::=. I took of the _with_spaces and went with just number and exponent. The full expansion is pretty horrendous, so I'm skipping it. (Hope you did too.) value sign? base_and_nbr exponent?!! sign [+ -]! base_and_nbr (base2? nbr2 base8? nbr8 base10? nbr10 base16? nbr16) base2 [bb]#\_? base8 [oo]#\_? base10 [dd]#\_? base16 [xx]#\_? nbr2 [01] + (\_ [01] + )*(\.(([01] + \_)*[01] + )?)? \. ([01] + \_)*[01] + nbr8 [0-7] + (\_ [0-7] + )*(\.(([0-7] + \_)*[0-7] + )?)? \. ([0-7] + \_)*[0-7] + nbr10 [0-9] + (\_ [0-9] + )*(\.(([0-9] + \_)*[0-9] + )?)? \. ([0-9] + \_)*[0-9] + nbr16 [0-9a-fA-F] + (\_ [0-9a-fA-F] + )*(\.(([0-9a-fA-F] + \_)*[0-9a-fA-F] + )?)?! \. ([0-9a-fA-F] + \_)* [0-9a-fA-F] + exponent \_?[ee] sign? base_nbr CS 440: Programming Languages and Translators 5 James Sasaki, 2018

5. [18 points] (DFA that accepts abc, acc, bbc, and bca) Except for Start, Accept, and err, I named the states after the path you take to get there. State a b c Start a b err a err ab ac bb ab ac bb ab ac bb err err Acc b err ab ac bb bc bc Acc err err Accept err err err err err err err [Not asked for: The DFA above is minimal. Rows with different (error not error) patterns can't be joined, and Accept and err aren't both accepting or non-accepting states, so they can't be joined either. If you have separate rows for ab, ac, and bb, you'll see they behave identically (accept on c, err otherwise). That's why they can be joined. So the minimal automaton has seven states (when I said six I forgot about the error state).] 6. [20 = 4*5 points] (Modified Exercise 2.14, p.108: Balanced parentheses and brackets) 6a. The grammar has four rules, given below. The rule Start S $$ lets the parser check for end-of-input. Rule # Rule 1 Start S $$ 2 S ( S ) S 3 S [ S ] S 4 S ε 6b. The parse table pairs the nonterminal at the top of the stack with the current input token and tells you which rule to apply to the nonterminal. err indicates a syntax error. Stack Top Input Token ( ) [ ] $$ Start 1 err 1 err 1 S 2 4 3 4 4 CS 440: Programming Languages and Translators 6 James Sasaki, 2018

6c. Parse tree for ([]([]))[]. The outline-format tree is to the left; the terminal string on the right shows where each terminal symbol appears in the input (as the head of the string) Start. S.. ( ([]([]))[].. S... [ []([]))[]... S.... ε... ] ]([]))[]... S.... ( ([]))[].... S..... [ []))[]..... S..... ] ]))[]..... S.... ) ))[].... S.. ) )[].. S... [ []... S.... ε... ] ]... S.... ε. $$ 6d. Trace of parser actions: Parser Stack Input Stream Action Start ( [ ] ( [ ] ) ) [ ] $$ (Initialize parser) S $$ ( [ ] ( [ ] ) ) [ ] $$ (Predict) Rule 1: Start S $$ ( S ) S $$ ( [ ] ( [ ] ) ) [ ] $$ Rule 2: S ( S ) S S ) S $$ [ ] ( [ ] ) ) [ ] $$ Match ( [ S ] S ) S $$ [ ] ( [ ] ) ) [ ] $$ Rule 3: S [ S ] S S ] S ) S $$ ] ( [ ] ) ) [ ] $$ Match [ ] S ) S $$ ] ( [ ] ) ) [ ] $$ Rule 4: S ε S ) S $$ ( [ ] ) ) [ ] $$ Match [ ( S ) S ) S $$ ( [ ] ) ) [ ] $$ Rule 2: S ( S ) S S ) S ) S $$ [ ] ) ) [ ] $$ Match ( [ S ] S ) S ) S $$ [ ] ) ) [ ] $$ Rule 3: S [ S ] S CS 440: Programming Languages and Translators 7 James Sasaki, 2018

S ] S ) S ) S $$ ] ) ) [ ] $$ Match [ ] S ) S ) S $$ ] ) ) [ ] $$ Rule 4: S ε S ) S ) S $$ ) ) [ ] $$ Match [ ) S ) S $$ ) ) [ ] $$ Rule 4: S ε S ) S $$ ) [ ] $$ Match ) ) S $$ ) [ ] $$ Rule 4: S ε S $$ [ ] $$ Match ) [ S ] S $$ [ ] $$ Rule 3: S [ S ] S S ] S $$ ] $$ Match [ ] S $$ ] $$ Rule 4: S ε S $$ $$ Match ] $$ $$ Rule 4: S ε empty ε Match $$ Parse successful! 7. [10 = 5+5 points] (Modified Problem 2.26: First, Follow, etc.) The rules are S E $$ E v ":=" E E T TL TL + T TL ε T F FL FL * F FL ε F ( E ) v 7a. Here is a table that lists the inferences about FIRST, FOLLOW, and EPS that follow from each rule. Rule A α FIRST(α) includes Other Inferences from Rule Start E $$ FIRST(E) FIRST(E) FIRST(Start), $$ FOLLOW(E) E v ":=" E v v FIRST(E) E T TL FIRST(T) FIRST(E) FIRST(T), FIRST(TL) FOLLOW(T) FOLLOW(E) FOLLOW(TL) If EPS(TL) then FOLLOW(E) FOLLOW(T) TL + T TL + + FIRST(TL), FIRST(TL) FOLLOW(T) If EPS(TL) then FOLLOW(TL) FOLLOW(T) TL ε EPS(TL) = Y T F FL FIRST(F) FIRST(F) FIRST(T), FIRST(FL) FOLLOW(F), FOLLOW(T) FOLLOW(FL) If EPS(FL) then FOLLOW(T) FOLLOW(F) CS 440: Programming Languages and Translators 8 James Sasaki, 2018

FL * F FL * * FIRST(FL), FIRST(FL) FOLLOW(F) If EPS(FL) then FOLLOW(FL) FOLLOW(F) FL ε EPS(FL) = Y F ( E ) v (, v (, v FIRST(F), ) FOLLOW(E) Using the inferences, we can calculate the FIRST, FOLLOW, and EPS sets for each nonterminal: A FIRST(A) FOLLOW(A) EPS(A) Start (, v N E (, v ), $$ N TL + ), $$ Y T (, v +, ), $$ N FL * + Y F (, v *, +, ), $$ N From the FIRST, FOLLOW, and EPS sets, we can calculate the PREDICT sets for the rules: Rule A α PREDICT(A α) Rule A α PREDICT(A α) Start E $$ (, v T F FL (, v E v ":=" E v FL * F FL * E T TL (, v FL ε + TL + T TL + F ( E ) ( TL ε ), $$ F v v 7b. The grammar is not LL(1) because v is in the PREDICT of two rules for the same nonterminal, E. CS 440: Programming Languages and Translators 9 James Sasaki, 2018