Programming Languages (CS 550) Lecture 4 Summary Scanner and Parser Generators. Jeremy R. Johnson

Programming Languages (CS 550) Lecture 4 Summary Scanner and Parser Generators Jeremy R. Johnson 1

Theme We have now seen how to describe syntax using regular expressions and grammars and how to create scanners and parsers, by hand and using automated tools. In this lecture we provide more details on parsing and scanning and indicate how these tools work. Reading: chapter 2 of the text by Scott. 2

Parser and Scanner Generators Tools exist (e.g. yacc/bison 1 for C/C++, PLY for python, CUP for Java) to automatically construct a parser from a restricted set of context free grammars (LALR(1) grammars for yacc/bison and the derivatives CUP and PLY) These tools use table driven bottom up parsing techniques (commonly shift/reduce parsing) Similar tools (e.g. lex/flex for C/C++, Jflex for Java) exist, based on the theory of finite automata, to automatically construct scanners from regular expressions 1 bison in the GNU version of yacc 3

Outline Scanners and DFA Regular Expressions and NDFA Equivalence of DFA and NDFA Regular Languages and the limitations of regular expressions Recursive Descent Parsing LL(1) Grammars and Tob-down (Predictive) Parsing LR(1) Grammars and Bottom-up Parsing 4

Regular Expressions Alphabet = A language over is subset of strings in Regular expressions describe certain types of languages is a regular expression ε = {ε} is a regular expression For each a in, a denoting {a} is a regular expression If r and s are regular expressions denoting languages R and S respectively then (r s), (rs), and (r*) are regular expressions E.G. 00, (0 1)*, (0 1)*00(0 1)*, 00*11*22*, (1 10)* 5

List Tokens LPAREN = ( RPAREN = ) COMMA =, NUMBER = DIGIT DIGIT* DIGIT = 0 1 2 3 4 5 6 7 8 9 Unix shorthand: [0-9], DIGIT+ Whitespace: ( \n \t )* 6

List Scanner TOKEN GetToken() { int val = 0; if (c = getchar() == eof) then return None end if; while c {, \n, \t } then c = getchar() end do; if c { (,,, ) } then return c end if; if c { 0,, 9 } then while c { 0,, 9 } do val = val*10 + (c 0 ); c = getchar(); end do; putchar(c); return (NUMBER,val); else return None; end if; } 7

Flex List Tokens %{ #include "list.tab.h" extern int yylval; %} %% [ \t\n] ; "(" return yytext[0]; ")" return yytext[0]; "," return yytext[0]; [0-9]+ { yylval = atoi(yytext); return NUMBER; } %% 8

Deterministic Finite Automata Input comes from alphabet A Finite set of states, S, start state, s 0, Accepting States, F Transition T from state to state depending on next input M = (A,S,s 0,F,T) The language accepted by a finite automata is the set of input strings that end up in accepting states

Example 1 Create a finite state automata that accepts strings of a s and b s with an even number of a s. b a > S1 S0 a abbbabaabbb 011110010000 b

DFA Implementation Program to implement DFA a > b S1 S0 b a bool EA() { S0: x = getchar(); if (x == b ) goto S0; if (x == a ) goto S1; if (x == ENDM) return true; S1: x = getchar(); if (x == b ) goto S1; if (x == a ) goto S0; if (x == ENDM) return false; }

List DFA d,\t, \n S1 S4 d > S0 (, ( S2 S3 12

Calculator Tokens ASSIGN = := PLUS = +, MINUS = -, TIMES = *, DIV = / LPAREN = (, RPAREN = ) NUMBER = DIGIT DIGIT* DIGIT* (. DIGIT DIGIT.) DIGIT* ID = LETTER (LETTER DIGIT)* DIGIT = 0 1 9, LETTER = a z A Z COMMENT = /* (non-* * non-/)* */ // (non-newline)* newline WHITESPACE = ( \n \t )* 13

Table Driven Scanner State ' ',\t \n / * ( ) + - : =. digit letter other token 1 17 17 2 10 6 7 8 9 11 13 14 16 2 3 4 div 3 3 18 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 5 4 4 4 4 4 4 4 4 4 4 5 4 4 18 5 4 4 4 4 4 4 4 4 4 4 6 lparen 7 rparen 8 plus 9 minus 10 times 11 12 12 assign 13 15 14 15 14 number 15 15 number 16 16 16 identifier 17 17 17 - - - - - - - - - - - - white-space 18 - - - - - - - - - - - - - - comment 15

Non-Deterministic Finite Automata Same as DFA M = (A,S,s 0,F,T) except Can have multiple transitions from same state with same input Can have epsilon transitions Except input string if there is a path to an accepting state The languages accepted by NDFA are the same as DFA

Example 2a DFA accepting (a b)*abb b a > S0 a b S1 S2 a b S0

Example 2b NDFA accepting (a b)*abb a,b a b > S0 S1 S2 b S3 a,b ε a > S0 S1 S2 b S3 b S4

Simulating an NDFA Compute S = set of states NDFA could be in after reading each symbol in the input. Si = set of possible states after reading i input symbols 1. Initialize S 0 = EpsilonClosure{0} 2. for i = 1,,len(str) 1. T i = _{s S i-1 } T[s,str[i]] 2. S i = EpsilonClosure(T i ) b a b b {0,1} {0,1}{0,1,2}{0,1,3}{0,1,4}

NDFA from Regular Expressions Base case c c Union R S ε R ε Concatenation RS ε S ε R S Closure R* ε ε R ε ε

Example 3 Construct a NDFA that accepts the language generated by the regular expression (a bc) ε S1 a S2 ε > S0 S6 ε b c S3 S4 S5 ε

Regular Expression Compiler %{ #include "machine.h" char input[100]; %} %union{ MachinePtr ndfa; char symbol; } %token <symbol> LETTER %type <ndfa> regexp %type <ndfa> cat %type <ndfa> kleene %% statement: regexp { do { printf("enter string\n"); if (scanf("%s",input)!= EOF) Simulate($1,input); else exit(1); } while (1); } regexp: regexp ' ' cat { $$ = MachineOr($1,$3); } cat { $$ = $1; } ; cat: cat kleene { $$ = MachineConcat($1,$2); } kleene { $$ = $1; } ; kleene: '(' regexp ')' { $$ = $2; } kleene '*' { $$ = MachineStar($1); } LETTER { $$ = BaseMachine($1); }; %%

NDFA DFA Find an equivalent DFA for Example 3 States in DFA are sets of states from NDFA [keep track of all possible transitions] > 013 a b 26 4 c 56

Exercise 1 1. Construct NDFA that recognizes (see pages 55-57 of text) 1. d*(.d d.)d* 2. Convert NDFA from (1) to DFA (see pages 56-58 of text)

Solution 1.1 ε d 5 6 7 1 2 3 4 11 12 13 14 ε ε ε ε ε 8. 9 d d. 10 ε ε ε d ε ε ε 25

Solution 1.2 A[1,2,4,5,8] d B[2,3,4,5,8,9] d.. C[6] d D[6,10,11,12,14] d E[7,11,12,14] F[7,11,12,13,14] d G[11,12,13,14] d d 26

Minimizing States in DFA The exists a unique minimal state DFA for any language described by a regular expression Combine equivalent states p q if for each input string x T(p,x) is an accepting state iff T(q,x) is an accepting state Initialize two sets of states: accepting and nonaccepting Partition state sets which transition into multiple sets of states

Exercise 2 1. Find an equivalent DFA to the one in Exercise 1 that minimizes the number of states (see page 59 of text) d,. ABC d,. DEFG d Ambiguity: T(A,d) = T(B,d) T(C,d) split ABC AB,C

Solution 2 A d B d.. C d DEFG d

Equivalence of Regular Expressions and DFA The languages accepted by finite automata are equivalent to those generated by regular expressions Given any regular expression R, there exists a finite state automata M such that L(M) = L(R) Proof is given by previous construction Given any finite state automata M, there exists a regular expression R such that L(R) = L(M) The basic idea is to combine the transitions in each node along all paths that lead to an accepting state. The combination of the characters along the paths are described using regular expressions.

Example 4 Create a regular expression for the language that consists of strings of a s and b s with an even number of a s. b a S1 a b* (b*ab*a)* > S0 b

Grammars and Regular Expressions Given a regular expression R, there exists a grammar with syntactic category <S> such that L(R) = L(<S>). There are grammars such that there does NOT exist a regular expression R with L(<S>) = L(R) <S> a<s>b ε L(<S>) = {a n b n, n=0,1,2, }

Example 5 Create a grammar that generates the language that consists of strings of a s and b s with an even number of a s. b a > b S1 S0 a <S0> b<s0> <S0> a<s1> <S0> ε <S1> b<s1> <S1> a<s0>

Proof that a n b n is not Recognized by a Finite State Automata To show that there is no finite state automata that recognizes the language L = {a n b n, n = 0,1,2, }, we assume that there is a finite state automata M that recognizes L and show that this leads to a contradiction. Since M is a finite state automata it has a finite number of states. Let the number of states = m. Since M recognizes the language L all strings of the form a k b k must end up in accepting states. Choose such a string with k = n which is greater than m.

Proof that a n b n is not Recognized by a Finite State Automata Since n > m there must be a state s that is visited twice while the string a n is read [we can only visit m distinct states and since n > m after reading (m+1) a s, we must go to a state that was already visited]. Suppose that state s is reached after reading the strings a j and a k (j k). Since the same state is reached for both strings, the finite state machine can not distinguish strings that begin with a j from strings that begin with a k. Therefore, the finite state automata must either accept or reject both of the strings a j b j and a k b j. However, a j b j should be accepted, while a k b j should not be accepted.

List Grammar < list > ( < sequence > ) ( ) < sequence > < listelement >, < sequence > < listelement > < listelement > < list > NUMBER 36

Recursive Descent Parser list() { match( ( ); if token ) then seq(); endif; match( ) ); } 37

Recursive Descent Parser seq() { elt(); if token =, then match(, ); seq(); endif; } 38

Recursive Descent Parser elt() { if token = ( then list(); else match(number); endif; } 39

Yacc (bison) Example %token NUMBER /* needed to communicate with scanner */ %% list: '(' sequence ')' { printf("l -> ( seq )\n"); } '(' ')' { printf("l -> () \n "); } sequence: listelement ',' sequence { printf("seq -> LE,seq\n"); } listelement { printf("seq -> LE\n"); } ; listelement: NUMBER { printf("le -> %d\n",$1); } list { printf("le -> L\n"); } ; %% /* since no code here, default main constructed that simply calls parser. */ 40

LR Calculator Grammar Figure 2.24 Program stmt list $$ stmt_list stmt_list stmt stmt stmt id := expr read id write expr expr term expr add op term term factor term mult_op factor factor ( expr ) id number add op + - mult op * / 43

LL Calculator Grammar Here is an LL(1) grammar (Fig 2.15): 1. program stmt_list $$ 2. stmt_list stmt stmt_list 3. ε 4. stmt id := expr 5. read id 6. write expr 7. expr term term_tail 8. term_tail add op term term_tail 9. ε 44

LL Calculator Grammar LL(1) grammar (continued) 10. term factor fact_tailt 11. fact_tail mult_op fact fact_tail 12. ε 13. factor ( expr ) 14. id 15. number 16. add_op + 17. - 18. mult_op * 19. / 45

Predictive Parser Predict which rules to match A α when next token can start α α * ε and the next token can follow A PREDICT(A α) = FIRST(α) FOLLOW(A) if EPS(α) PREDICT(program stmt_list $$) FIRST(stmt_list) = {id, read, write} FOLLOW(stmt_list) = {$$} Intersection of PREDICT sets for same lhs must be empty 46

Recursive Descent Parser procedure program case input_token of id, read, write, $$: stmt_list; match($$) otherwise error procedure stmt_list case input_token of id, read, write: stmt; stmt_list $$: skip otherwise error procedure stmt case input_token of id: match(id); match(:=); expr read: match(read); match(id) write: match(write); expr otherwise error procedure expr case input_token of id, number, (: term; termtail otherwise error procedure term_tail case input_token of +, -: add_op; term; term_tail ), id, read, write, $$: skip otherwise error procedure term case input_token of id, number, (: factor; factor_tail otherwise error 47

Exercise 3 Trace through the recursive descent parser and build parse tree for the following program 1. read A 2. read B 3. sum := A + B 4. write sum 5. write sum / 2

Table-Driven LL Parser Parse Stack Input Stream Comment program stmt_list $$ stmt stmt_list $$ read id stmt_list $$ id stmt_list $$ stmt_list $$ stmt stmt_list $$ read A read B read A read B read A read B read A read B A read B read B read B Initial stack contents program stmt_list $$ stmt_list stmt stmt_list stmt read id match(read) match(id) stmt_list stmt stmt_list stmt_list $$ $$ term_tail ε stmt_list ε 51

Computing First, Follow, Predict Algorithm First/Follow/Predict: FIRST(α) ={c : α * c β} FOLLOW(A) = {c : S + α A c β} EPS(α) = if α * ε then true else false Predict (A α) = FIRST(α) (if EPS(α) then FOLLOW(A) else ) 52

LR Parsing Bottom up (rightmost derivation) Maintain forrest of partially completed subtrees of the parse tree Join trees together when recognizing symbols in rhs of production Keep roots of partially completed trees on stack Shift when new token Reduce when top symbols match rhs Table driven 54

LR Parsing Example Stack ε id(a) id(a), id(a), id(b) id(a), id(b), id(a), id(b), id(c) id(a), id(b), id(c); id(a), id(b), id(c) id_list_tail id(a), id(b) id_list_tail id(a) id_list_tail id_list Remaining Input A, B, C;, B, C; B, C;, C; C; ; 56

LR Calculator Grammar (Figure 2.24, Page 73): 1. program stmt list $$ 2. stmt_list stmt_list stmt 3. stmt 4. stmt id := expr 5. read id 6. write expr 7. expr term 8. expr add op term 57

LR Calculator Grammar LR grammar (continued): 9. term factor 10. term mult_op factor 11. factor ( expr ) 12. id 13. number 14. add op + 15. - 16. mult op * 17. / 58

LR Parser State Keep track of set of productions we might be in along with where in those productions we might be Initial state for calculator grammar program. stmt_list $$ stmt_list. stmt_list stmt stmt_list. stmt stmt. id := expr stmt. read id stmt. write expr // basis // yield 59