Introduction; Parsing LL Grammars

Introduction; Parsing LL Grammars CS 440: Programming Languages and Translators Due Fri Feb 2, 11:59 pm 1/29 pp.1, 2; 2/7 all updates incorporated, solved Instructions You can work together in groups of 4. Submit your work on Blackboard. * Submit one copy. Include the names and A-IDs of everyone in the group on that copy (in the pdf, for example). Submit under the name of one person in the group (doesn't matter who). Questions [100 points total] 1. [10 = 5+5 points] For each question below, a paragraph should be enough. a. Exercise 1.3 (p.38) b. Exercise 1.9 (p.39) For Questions 2 4, your regular expressions can use some basic egrep notations. (Try man re_format on unix for help.) Some simple example of what you can use: [a-z_] ("a through z or underscore") [0-9ab] ("Any digit or the letters a or b") [^xyz] ("Any character except for x, y, or z") x? ("x or nothing") x+ ("one or more x's"). (a period or dot means "any one character") \. (backslash dot means literally a dot, as in the float 12\.34") Don't use back references, (such as "\3"); bounds (such as "{7}"); character classes (such as"[:cntrl:]" or "[[:<:]]"); or assertions (such as "\D"). (You won't need literals like \n (except for \.), and if you try things like \x{89abcdef}, we'll hunt you down :-) 2. [15 = 3*5 points] Translate each regular expression below into English. Don't just translate individual subexpressions; try to get at the essence of the expression. (E.g., "[1-9][0-9]" could be "a two-digit number without a leading zero".) [Hint: You can try an expression using egrep -e "expression" text_file, where each line of text_file has a candidate string to try to match. You may want to add "^" and "$" to the expression, in that case; again, see the man page.] a. [0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9] b. (19 20)[0-9][0-9]-(0[1-9] 1[012])-(0[1-9] [12][0-9] 3[01]) c. (0x)[1-9a-f][0-9a-f]* * Using group submission is an experiment; let me know how it works. CS 440: Programming Languages and Translators 1 James Sasaki, 2018

3. [15 = 3*5 points] Give regular expressions that match each of the following kinds of (possibly empty) strings. There may be more than one answer; we just want one. a. Strings that alternate between vowels (a, e, i, o, u) and consonants (not vowels) and can start with either a vowel or a consonant. b. Strings of a's and b's where the number of b's is divisible by 2 or 3 c. Strings of lowercase letters that don't include abc. (Don't forget to include strings like aaa or xab.) 4. [12 points] Give a regular expression for numbers in the following made-up format: Integers are sequences of digits; leading zeros are allowed. Floats include a dot with digits before and/or after the dot. In addition, you may include a base as a leading b#, o#, d#, or x# (binary, octal, decimal, or hex). You may also have a leading + or - before the base (or the integer, if there's no base). In addition, you may include an exponent after the number, of the form e integer where integer is as described above. If specified, the base for the exponent doesn't have to match the base of the number. A single space can be included between each group of one or more digits, or after the base #, or before the e exponent, but no space is allowed between a leading sign and base or between a base and #. The letters (b, o, etc) can be in upper case. If you like, you can define parts of the expressions as grammar rules (like number integer float etc.) Some random examples of numbers (with spaces as underscores to make them more visible): -b#_1._e-b#10 equals binary -1.0 / 2² = binary -0.01 1.0eb#10 equals binary 1.0 2 10 = 2 10 cast as a float +3.e+1 equals 30.0 3e1 equals 30 o#072_031 equals 72031₈ But not b#_3 (because of the 3) or 12 34 (two spaces between 12 and 34) or -_56 (space after -) 5. [18 points] Here's a state transition table for an NFA that accepts the 3-character string abc. To (I hope) make things clearer, I've mostly given states names that are regular expressions describing the input that takes us to that state. The cells that are empty actually contain err. (I omitted them to make the non-err parts more visible). State ε a b c Start ε (Seen) ε a (Seen) a ab (Seen) ab abc (Seen) abc accept Accept err err err err err err err err CS 440: Programming Languages and Translators 2 James Sasaki, 2018

Accept is underlined to indicate that it's (the only) accepting state. Note that once you get to the error state err, you stay there forever. Now imagine gluing together four NFAs for abc, acc, bbc, and bca, merging their Start, Accept, and err states respectively, and ending up with an NFA with 3 + 4*4 = 19 states. For this problem, convert this NFA to a DFA; the most straightforward way to do this is to use the algorithm in the text. You'll need to use some different terminology to name the states. (Number them? More complicated regular expressions?). You can, but don't have to, give a DFA with the minimum number of states (I believe it's 6 states). Present the DFA using a transition table. 6. [20 = 4*5 points] (Modified Exercise 2.14, p.108) Consider the language consisting of all strings of properly-balanced parentheses and brackets. (I.e., "(", ")", "[", and "]".) a. Give an LL(1) grammar for this language. Surround each terminal parenthesis or bracket by double quotes to emphasize that they are terminal symbols. b. Give the corresponding LL(1) parsing table. c. Show the parse tree for ([]([]))[]. If you like, you may present the tree using an outline form: List the nodes in preorder with the children for each node indented one more level than their parent. E.g., a tree with root X, children Y and Z, with Y having children A and B, and Z having children C and D would be presented as X. Y.. A.. B. Z.. C.. D d. Give a trace of the parser action as it constructs the parse tree. 7. [10 = 5+5 points] (Modified Problem 2.26) Consider the grammar below. The start symbol is S, the other nonterminals are E, T, TL, F, and FL, and the terminal symbols are v and anything double-quoted. S E "$$" E v ":=" E E T TL TL "+" T TL ε T F FL FL "*" F FL ε F "(" E ")" v a. For each rule A α above, give the FIRST(α), FOLLOW(A), EPS(α), and PREDICT(A α) sets. Omit duplicates (there's no reason to show EPS(ε) more than once, for example). b. What tells us that this grammar is not LL(1)? CS 440: Programming Languages and Translators 3 James Sasaki, 2018

Solution to Homework 1 1. (Compilation; Correctness) a. Exercise 1.3, p.38 (Compilation vs interpretation) Some possible answers: Compilation can catch errors earlier; compiled code usually executes faster. An interpreter may take less time to rerun a program that's had a small change made to it (a compiler has to recompile and relink the whole program); am interpreter may produce better error messages; for language development, writing an interpreter can be faster than writing a compiler. b. Exercise 1.9, p.39 (Program correctness) There are two parts to correctness: the specification and meeting the specification. Specifications can be vague, wrong, or not cover all possible inputs. For correctness, testing only reveals lack of bugs under the tested inputs; untested inputs may still encounter bugs, plus, determining what inputs to test on is hard. For complex software, it's hard to figure out what environments a program might run in (plus test in all of them). Blind spots can include things you know you don't know (like exact user behavior) and things you don't know you don't know (like unexpected user behavior). 2. (Translate reg expressions to English). There can be alternative answers. 2a. [0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9] Three (natural) numbers separated by dashes; the first number has 4 digits and the other two have 2 digits. 2b. (19 20)[0-9][0-9]-(0[1-9] 1[012])-(0[1-9] [12][0-9] 3[01]) Dates of the form yearmonth-day where the years are 1900 2099, days are 01 12, and days are 01 31. 2c. (0x)[1-9a-f][0-9a-f]* Hex natural numbers: the tag 0x followed by one or more lower-case hex digits, with no leading zero. 3. (Regular expressions) 3a. Alternating vowels and consonants: [aeiou]?([^aeiou][aeiou])*[^aeiou]? Note this allows the empty string. 3b. a's and b's with number of b's is a multiple of 2 or 3: a* (b a* b a*)* a* a* (b a* b a* b a*)* a* (You should include the empty string.) 3c. Strings without substring abc: ([^a] a[^b] ab[^c])*(a ab)? 4. (Integer and Float Numbers in bases 2, 8, 10, 16). First, let's break down the problem. The basic idea: We need sign? base? number exponent? Including spaces gives us sign? base?\_? number_with_spaces exponent_with_spaces? "\_" means backslash space, which means an actual space. To avoid having a number lead with spaces, I'm putting them between the base and number, hence \_?. This is the first time I've offered this problem, so if my solution has bugs, let me know. CS 440: Programming Languages and Translators 4 James Sasaki, 2018

Sign and base are straightforward: sign? is [+ -]?, and base? could be ([bbooddxx]#\_)?, except that the legal digits in the following number depend on the base, so we'll need to break the bases up into cases: base? can be ([bb]#\_?)?, ([oo]#\_?)? and so on. The exponent part is also straightforward: \_[ee]integer, with integer as below. The number part is the hard one, of course. It's either an integer or a float. If the number is an integer (natural number, technically), then we can use an expression like digit + (\_digit + )*, which is a sequence that alternates between runs of digits and one space, beginning and ending with digit(s). (Remember the superscript Kleene + is one or more of.) It's \_, not \_ + because we're only allowed one space between runs of digits. There are alternatives like digit ((digit \_)*digit)? that are perfectly fine too. For a float, the thing to avoid is something like digit*\.digit*, which makes digits optional before or after the dot (which is good) but doesn't insist on having at least one digit somewhere (which is bad). We can follow a (non-empty) integer with a dot and (optionally) more digits and spaces ending with a digit (we don't want trailing spaces). digit + (\_digit + )*(\.((digit + \_)*digit + )?)? // integer (dot integer?)? Or, we can begin with a dot and follow with digits and spaces and end with a digit(s) \. (digit + \_)*digit + (Note we don't allow a space before and/or after the dot; maybe that's a bug in the specification.) Below, I'm using name expression to give names to expressions to make things more readable (I hope). It's fine if you used symbols like or ::=. I took of the _with_spaces and went with just number and exponent. The full expansion is pretty horrendous, so I'm skipping it. (Hope you did too.) value sign? base_and_nbr exponent?!! sign [+ -]! base_and_nbr (base2? nbr2 base8? nbr8 base10? nbr10 base16? nbr16) base2 [bb]#\_? base8 [oo]#\_? base10 [dd]#\_? base16 [xx]#\_? nbr2 [01] + (\_ [01] + )*(\.(([01] + \_)*[01] + )?)? \. ([01] + \_)*[01] + nbr8 [0-7] + (\_ [0-7] + )*(\.(([0-7] + \_)*[0-7] + )?)? \. ([0-7] + \_)*[0-7] + nbr10 [0-9] + (\_ [0-9] + )*(\.(([0-9] + \_)*[0-9] + )?)? \. ([0-9] + \_)*[0-9] + nbr16 [0-9a-fA-F] + (\_ [0-9a-fA-F] + )*(\.(([0-9a-fA-F] + \_)*[0-9a-fA-F] + )?)?! \. ([0-9a-fA-F] + \_)* [0-9a-fA-F] + exponent \_?[ee] sign? base_nbr CS 440: Programming Languages and Translators 5 James Sasaki, 2018

5. [18 points] (DFA that accepts abc, acc, bbc, and bca) Except for Start, Accept, and err, I named the states after the path you take to get there. State a b c Start a b err a err ab ac bb ab ac bb ab ac bb err err Acc b err ab ac bb bc bc Acc err err Accept err err err err err err err [Not asked for: The DFA above is minimal. Rows with different (error not error) patterns can't be joined, and Accept and err aren't both accepting or non-accepting states, so they can't be joined either. If you have separate rows for ab, ac, and bb, you'll see they behave identically (accept on c, err otherwise). That's why they can be joined. So the minimal automaton has seven states (when I said six I forgot about the error state).] 6. [20 = 4*5 points] (Modified Exercise 2.14, p.108: Balanced parentheses and brackets) 6a. The grammar has four rules, given below. The rule Start S $$ lets the parser check for end-of-input. Rule # Rule 1 Start S $$ 2 S ( S ) S 3 S [ S ] S 4 S ε 6b. The parse table pairs the nonterminal at the top of the stack with the current input token and tells you which rule to apply to the nonterminal. err indicates a syntax error. Stack Top Input Token ( ) [ ] $$ Start 1 err 1 err 1 S 2 4 3 4 4 CS 440: Programming Languages and Translators 6 James Sasaki, 2018

6c. Parse tree for ([]([]))[]. The outline-format tree is to the left; the terminal string on the right shows where each terminal symbol appears in the input (as the head of the string) Start. S.. ( ([]([]))[].. S... [ []([]))[]... S.... ε... ] ]([]))[]... S.... ( ([]))[].... S..... [ []))[]..... S..... ] ]))[]..... S.... ) ))[].... S.. ) )[].. S... [ []... S.... ε... ] ]... S.... ε. $$ 6d. Trace of parser actions: Parser Stack Input Stream Action Start ( [ ] ( [ ] ) ) [ ] $$ (Initialize parser) S $$ ( [ ] ( [ ] ) ) [ ] $$ (Predict) Rule 1: Start S $$ ( S ) S $$ ( [ ] ( [ ] ) ) [ ] $$ Rule 2: S ( S ) S S ) S $$ [ ] ( [ ] ) ) [ ] $$ Match ( [ S ] S ) S $$ [ ] ( [ ] ) ) [ ] $$ Rule 3: S [ S ] S S ] S ) S $$ ] ( [ ] ) ) [ ] $$ Match [ ] S ) S $$ ] ( [ ] ) ) [ ] $$ Rule 4: S ε S ) S $$ ( [ ] ) ) [ ] $$ Match [ ( S ) S ) S $$ ( [ ] ) ) [ ] $$ Rule 2: S ( S ) S S ) S ) S $$ [ ] ) ) [ ] $$ Match ( [ S ] S ) S ) S $$ [ ] ) ) [ ] $$ Rule 3: S [ S ] S CS 440: Programming Languages and Translators 7 James Sasaki, 2018

S ] S ) S ) S $$ ] ) ) [ ] $$ Match [ ] S ) S ) S $$ ] ) ) [ ] $$ Rule 4: S ε S ) S ) S $$ ) ) [ ] $$ Match [ ) S ) S $$ ) ) [ ] $$ Rule 4: S ε S ) S $$ ) [ ] $$ Match ) ) S $$ ) [ ] $$ Rule 4: S ε S $$ [ ] $$ Match ) [ S ] S $$ [ ] $$ Rule 3: S [ S ] S S ] S $$ ] $$ Match [ ] S $$ ] $$ Rule 4: S ε S $$ $$ Match ] $$ $$ Rule 4: S ε empty ε Match $$ Parse successful! 7. [10 = 5+5 points] (Modified Problem 2.26: First, Follow, etc.) The rules are S E $$ E v ":=" E E T TL TL + T TL ε T F FL FL * F FL ε F ( E ) v 7a. Here is a table that lists the inferences about FIRST, FOLLOW, and EPS that follow from each rule. Rule A α FIRST(α) includes Other Inferences from Rule Start E $$ FIRST(E) FIRST(E) FIRST(Start), $$ FOLLOW(E) E v ":=" E v v FIRST(E) E T TL FIRST(T) FIRST(E) FIRST(T), FIRST(TL) FOLLOW(T) FOLLOW(E) FOLLOW(TL) If EPS(TL) then FOLLOW(E) FOLLOW(T) TL + T TL + + FIRST(TL), FIRST(TL) FOLLOW(T) If EPS(TL) then FOLLOW(TL) FOLLOW(T) TL ε EPS(TL) = Y T F FL FIRST(F) FIRST(F) FIRST(T), FIRST(FL) FOLLOW(F), FOLLOW(T) FOLLOW(FL) If EPS(FL) then FOLLOW(T) FOLLOW(F) CS 440: Programming Languages and Translators 8 James Sasaki, 2018

FL * F FL * * FIRST(FL), FIRST(FL) FOLLOW(F) If EPS(FL) then FOLLOW(FL) FOLLOW(F) FL ε EPS(FL) = Y F ( E ) v (, v (, v FIRST(F), ) FOLLOW(E) Using the inferences, we can calculate the FIRST, FOLLOW, and EPS sets for each nonterminal: A FIRST(A) FOLLOW(A) EPS(A) Start (, v N E (, v ), $$ N TL + ), $$ Y T (, v +, ), $$ N FL * + Y F (, v *, +, ), $$ N From the FIRST, FOLLOW, and EPS sets, we can calculate the PREDICT sets for the rules: Rule A α PREDICT(A α) Rule A α PREDICT(A α) Start E $$ (, v T F FL (, v E v ":=" E v FL * F FL * E T TL (, v FL ε + TL + T TL + F ( E ) ( TL ε ), $$ F v v 7b. The grammar is not LL(1) because v is in the PREDICT of two rules for the same nonterminal, E. CS 440: Programming Languages and Translators 9 James Sasaki, 2018