Programming Languages (CS 550) Lecture 4 Summary Scanner and Parser Generators. Jeremy R. Johnson

Similar documents
COP4020 Programming Languages. Syntax Prof. Robert van Engelen

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

Syntax. Syntax. We will study three levels of syntax Lexical Defines the rules for tokens: literals, identifiers, etc.

Chapter 2 - Programming Language Syntax. September 20, 2017

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

Lecturer: William W.Y. Hsu. Programming Languages

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

Lexical Analysis. Introduction

CS 403: Scanning and Parsing

Chapter 3 Lexical Analysis

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

THE COMPILATION PROCESS EXAMPLE OF TOKENS AND ATTRIBUTES

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

Compiler course. Chapter 3 Lexical Analysis

EDAN65: Compilers, Lecture 06 A LR parsing. Görel Hedin Revised:

Languages and Compilers

Wednesday, August 31, Parsers

10/5/17. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntax Analysis

Lexical and Syntax Analysis

Languages, Automata, Regular Expressions & Scanners. Winter /8/ Hal Perkins & UW CSE B-1

Implementation of Lexical Analysis

10/4/18. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntactic Analysis

Monday, September 13, Parsers

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

Wednesday, September 9, 15. Parsers

Parsers. What is a parser. Languages. Agenda. Terminology. Languages. A parser has two jobs:

Chapter 4. Lexical and Syntax Analysis

CS 315 Programming Languages Syntax. Parser. (Alternatively hand-built) (Alternatively hand-built)

Implementation of Lexical Analysis

Context-free grammars

Formal Languages and Grammars. Chapter 2: Sections 2.1 and 2.2

Parsers. Xiaokang Qiu Purdue University. August 31, 2018 ECE 468

Compiler Construction: Parsing

Table-driven using an explicit stack (no recursion!). Stack can be viewed as containing both terminals and non-terminals.

Parsing. Lecture 11: Parsing. Recursive Descent Parser. Arithmetic grammar. - drops irrelevant details from parse tree

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Lexical Analysis. Chapter 2

Configuration Sets for CSX- Lite. Parser Action Table

Parsing III. CS434 Lecture 8 Spring 2005 Department of Computer Science University of Alabama Joel Jones

Syntax Analysis. COMP 524: Programming Language Concepts Björn B. Brandenburg. The University of North Carolina at Chapel Hill

Bottom-Up Parsing. Lecture 11-12

Lexical Analysis. Lecture 2-4

CSE 130 Programming Language Principles & Paradigms Lecture # 5. Chapter 4 Lexical and Syntax Analysis

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

CS 314 Principles of Programming Languages

CS 4120 Introduction to Compilers

Lexical Analysis. Lecture 3-4

Ambiguous Grammars and Compactification

Compiler phases. Non-tokens

Simple LR (SLR) LR(0) Drawbacks LR(1) SLR Parse. LR(1) Start State and Reduce. LR(1) Items 10/3/2012

CSE 401 Midterm Exam Sample Solution 2/11/15

LECTURE 7. Lex and Intro to Parsing

Syntax and Parsing COMS W4115. Prof. Stephen A. Edwards Fall 2003 Columbia University Department of Computer Science

Introduction to Lexical Analysis

COP 3402 Systems Software Syntax Analysis (Parser)

Compilers. Bottom-up Parsing. (original slides by Sam

Zhizheng Zhang. Southeast University

Concepts Introduced in Chapter 4

Question Bank. 10CS63:Compiler Design

Revisit the example. Transformed DFA 10/1/16 A B C D E. Start

Parser. Larissa von Witte. 11. Januar Institut für Softwaretechnik und Programmiersprachen. L. v. Witte 11. Januar /23

Syntax and Parsing COMS W4115. Prof. Stephen A. Edwards Fall 2004 Columbia University Department of Computer Science

Compilation 2012 Context-Free Languages Parsers and Scanners. Jan Midtgaard Michael I. Schwartzbach Aarhus University

Syntax Analysis. The Big Picture. The Big Picture. COMP 524: Programming Languages Srinivas Krishnan January 25, 2011

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

Bottom-Up Parsing. Lecture 11-12

Figure 2.1: Role of Lexical Analyzer

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

Optimizing Finite Automata

CMSC 330: Organization of Programming Languages

Syntax Analysis. Martin Sulzmann. Martin Sulzmann Syntax Analysis 1 / 38

Chapter 4. Lexical and Syntax Analysis. Topics. Compilation. Language Implementation. Issues in Lexical and Syntax Analysis.

CSE 401/M501 Compilers

Compiler Lab. Introduction to tools Lex and Yacc

4. Lexical and Syntax Analysis

Introduction to Parsing. Comp 412

Implementation of Lexical Analysis

Lexical and Syntax Analysis. Bottom-Up Parsing

CS1622. Today. A Recursive Descent Parser. Preliminaries. Lecture 9 Parsing (4)

CS 2210 Sample Midterm. 1. Determine if each of the following claims is true (T) or false (F).

Part 5 Program Analysis Principles and Techniques

Homework. Lecture 7: Parsers & Lambda Calculus. Rewrite Grammar. Problems

Action Table for CSX-Lite. LALR Parser Driver. Example of LALR(1) Parsing. GoTo Table for CSX-Lite

CSE 401 Midterm Exam 11/5/10

UNIT III & IV. Bottom up parsing

Topic 3: Syntax Analysis I

CMSC 330: Organization of Programming Languages. Architecture of Compilers, Interpreters

Parsing. Roadmap. > Context-free grammars > Derivations and precedence > Top-down parsing > Left-recursion > Look-ahead > Table-driven parsing

Compiler Construction

CSC 4181 Compiler Construction. Parsing. Outline. Introduction

Chapter 3. Parsing #1

How do LL(1) Parsers Build Syntax Trees?

Lexical and Syntax Analysis (2)

Syntax Analysis: Context-free Grammars, Pushdown Automata and Parsing Part - 4. Y.N. Srikant

Programming Language Syntax and Analysis

Compiler Construction 2016/2017 Syntax Analysis

MIT Parse Table Construction. Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology

Outline. 1 Scanning Tokens. 2 Regular Expresssions. 3 Finite State Automata

Transcription:

Programming Languages (CS 550) Lecture 4 Summary Scanner and Parser Generators Jeremy R. Johnson 1

Theme We have now seen how to describe syntax using regular expressions and grammars and how to create scanners and parsers, by hand and using automated tools. In this lecture we provide more details on parsing and scanning and indicate how these tools work. Reading: chapter 2 of the text by Scott. 2

Parser and Scanner Generators Tools exist (e.g. yacc/bison 1 for C/C++, PLY for python, CUP for Java) to automatically construct a parser from a restricted set of context free grammars (LALR(1) grammars for yacc/bison and the derivatives CUP and PLY) These tools use table driven bottom up parsing techniques (commonly shift/reduce parsing) Similar tools (e.g. lex/flex for C/C++, Jflex for Java) exist, based on the theory of finite automata, to automatically construct scanners from regular expressions 1 bison in the GNU version of yacc 3

Outline Scanners and DFA Regular Expressions and NDFA Equivalence of DFA and NDFA Regular Languages and the limitations of regular expressions Recursive Descent Parsing LL(1) Grammars and Tob-down (Predictive) Parsing LR(1) Grammars and Bottom-up Parsing 4

Regular Expressions Alphabet = A language over is subset of strings in Regular expressions describe certain types of languages is a regular expression ε = {ε} is a regular expression For each a in, a denoting {a} is a regular expression If r and s are regular expressions denoting languages R and S respectively then (r s), (rs), and (r*) are regular expressions E.G. 00, (0 1)*, (0 1)*00(0 1)*, 00*11*22*, (1 10)* 5

List Tokens LPAREN = ( RPAREN = ) COMMA =, NUMBER = DIGIT DIGIT* DIGIT = 0 1 2 3 4 5 6 7 8 9 Unix shorthand: [0-9], DIGIT+ Whitespace: ( \n \t )* 6

List Scanner TOKEN GetToken() { int val = 0; if (c = getchar() == eof) then return None end if; while c {, \n, \t } then c = getchar() end do; if c { (,,, ) } then return c end if; if c { 0,, 9 } then while c { 0,, 9 } do val = val*10 + (c 0 ); c = getchar(); end do; putchar(c); return (NUMBER,val); else return None; end if; } 7

Flex List Tokens %{ #include "list.tab.h" extern int yylval; %} %% [ \t\n] ; "(" return yytext[0]; ")" return yytext[0]; "," return yytext[0]; [0-9]+ { yylval = atoi(yytext); return NUMBER; } %% 8

Deterministic Finite Automata Input comes from alphabet A Finite set of states, S, start state, s 0, Accepting States, F Transition T from state to state depending on next input M = (A,S,s 0,F,T) The language accepted by a finite automata is the set of input strings that end up in accepting states

Example 1 Create a finite state automata that accepts strings of a s and b s with an even number of a s. b a > S1 S0 a abbbabaabbb 011110010000 b

DFA Implementation Program to implement DFA a > b S1 S0 b a bool EA() { S0: x = getchar(); if (x == b ) goto S0; if (x == a ) goto S1; if (x == ENDM) return true; S1: x = getchar(); if (x == b ) goto S1; if (x == a ) goto S0; if (x == ENDM) return false; }

List DFA d,\t, \n S1 S4 d > S0 (, ( S2 S3 12

Calculator Tokens ASSIGN = := PLUS = +, MINUS = -, TIMES = *, DIV = / LPAREN = (, RPAREN = ) NUMBER = DIGIT DIGIT* DIGIT* (. DIGIT DIGIT.) DIGIT* ID = LETTER (LETTER DIGIT)* DIGIT = 0 1 9, LETTER = a z A Z COMMENT = /* (non-* * non-/)* */ // (non-newline)* newline WHITESPACE = ( \n \t )* 13

Calculator DFA 14 Copyright 2009 Elsevier

Table Driven Scanner State ' ',\t \n / * ( ) + - : =. digit letter other token 1 17 17 2 10 6 7 8 9 11 13 14 16 2 3 4 div 3 3 18 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 5 4 4 4 4 4 4 4 4 4 4 5 4 4 18 5 4 4 4 4 4 4 4 4 4 4 6 lparen 7 rparen 8 plus 9 minus 10 times 11 12 12 assign 13 15 14 15 14 number 15 15 number 16 16 16 identifier 17 17 17 - - - - - - - - - - - - white-space 18 - - - - - - - - - - - - - - comment 15

Non-Deterministic Finite Automata Same as DFA M = (A,S,s 0,F,T) except Can have multiple transitions from same state with same input Can have epsilon transitions Except input string if there is a path to an accepting state The languages accepted by NDFA are the same as DFA

Example 2a DFA accepting (a b)*abb b a > S0 a b S1 S2 a b S0

Example 2b NDFA accepting (a b)*abb a,b a b > S0 S1 S2 b S3 a,b ε a > S0 S1 S2 b S3 b S4

Simulating an NDFA Compute S = set of states NDFA could be in after reading each symbol in the input. Si = set of possible states after reading i input symbols 1. Initialize S 0 = EpsilonClosure{0} 2. for i = 1,,len(str) 1. T i = _{s S i-1 } T[s,str[i]] 2. S i = EpsilonClosure(T i ) b a b b {0,1} {0,1}{0,1,2}{0,1,3}{0,1,4}

NDFA from Regular Expressions Base case c c Union R S ε R ε Concatenation RS ε S ε R S Closure R* ε ε R ε ε

Example 3 Construct a NDFA that accepts the language generated by the regular expression (a bc) ε S1 a S2 ε > S0 S6 ε b c S3 S4 S5 ε

Regular Expression Compiler %{ #include "machine.h" char input[100]; %} %union{ MachinePtr ndfa; char symbol; } %token <symbol> LETTER %type <ndfa> regexp %type <ndfa> cat %type <ndfa> kleene %% statement: regexp { do { printf("enter string\n"); if (scanf("%s",input)!= EOF) Simulate($1,input); else exit(1); } while (1); } regexp: regexp ' ' cat { $$ = MachineOr($1,$3); } cat { $$ = $1; } ; cat: cat kleene { $$ = MachineConcat($1,$2); } kleene { $$ = $1; } ; kleene: '(' regexp ')' { $$ = $2; } kleene '*' { $$ = MachineStar($1); } LETTER { $$ = BaseMachine($1); }; %%

NDFA DFA Find an equivalent DFA for Example 3 States in DFA are sets of states from NDFA [keep track of all possible transitions] > 013 a b 26 4 c 56

Exercise 1 1. Construct NDFA that recognizes (see pages 55-57 of text) 1. d*(.d d.)d* 2. Convert NDFA from (1) to DFA (see pages 56-58 of text)

Solution 1.1 ε d 5 6 7 1 2 3 4 11 12 13 14 ε ε ε ε ε 8. 9 d d. 10 ε ε ε d ε ε ε 25

Solution 1.2 A[1,2,4,5,8] d B[2,3,4,5,8,9] d.. C[6] d D[6,10,11,12,14] d E[7,11,12,14] F[7,11,12,13,14] d G[11,12,13,14] d d 26

Minimizing States in DFA The exists a unique minimal state DFA for any language described by a regular expression Combine equivalent states p q if for each input string x T(p,x) is an accepting state iff T(q,x) is an accepting state Initialize two sets of states: accepting and nonaccepting Partition state sets which transition into multiple sets of states

Exercise 2 1. Find an equivalent DFA to the one in Exercise 1 that minimizes the number of states (see page 59 of text) d,. ABC d,. DEFG d Ambiguity: T(A,d) = T(B,d) T(C,d) split ABC AB,C

Solution 2 A d B d.. C d DEFG d

Equivalence of Regular Expressions and DFA The languages accepted by finite automata are equivalent to those generated by regular expressions Given any regular expression R, there exists a finite state automata M such that L(M) = L(R) Proof is given by previous construction Given any finite state automata M, there exists a regular expression R such that L(R) = L(M) The basic idea is to combine the transitions in each node along all paths that lead to an accepting state. The combination of the characters along the paths are described using regular expressions.

Example 4 Create a regular expression for the language that consists of strings of a s and b s with an even number of a s. b a S1 a b* (b*ab*a)* > S0 b

Grammars and Regular Expressions Given a regular expression R, there exists a grammar with syntactic category <S> such that L(R) = L(<S>). There are grammars such that there does NOT exist a regular expression R with L(<S>) = L(R) <S> a<s>b ε L(<S>) = {a n b n, n=0,1,2, }

Example 5 Create a grammar that generates the language that consists of strings of a s and b s with an even number of a s. b a > b S1 S0 a <S0> b<s0> <S0> a<s1> <S0> ε <S1> b<s1> <S1> a<s0>

Proof that a n b n is not Recognized by a Finite State Automata To show that there is no finite state automata that recognizes the language L = {a n b n, n = 0,1,2, }, we assume that there is a finite state automata M that recognizes L and show that this leads to a contradiction. Since M is a finite state automata it has a finite number of states. Let the number of states = m. Since M recognizes the language L all strings of the form a k b k must end up in accepting states. Choose such a string with k = n which is greater than m.

Proof that a n b n is not Recognized by a Finite State Automata Since n > m there must be a state s that is visited twice while the string a n is read [we can only visit m distinct states and since n > m after reading (m+1) a s, we must go to a state that was already visited]. Suppose that state s is reached after reading the strings a j and a k (j k). Since the same state is reached for both strings, the finite state machine can not distinguish strings that begin with a j from strings that begin with a k. Therefore, the finite state automata must either accept or reject both of the strings a j b j and a k b j. However, a j b j should be accepted, while a k b j should not be accepted.

List Grammar < list > ( < sequence > ) ( ) < sequence > < listelement >, < sequence > < listelement > < listelement > < list > NUMBER 36

Recursive Descent Parser list() { match( ( ); if token ) then seq(); endif; match( ) ); } 37

Recursive Descent Parser seq() { elt(); if token =, then match(, ); seq(); endif; } 38

Recursive Descent Parser elt() { if token = ( then list(); else match(number); endif; } 39

Yacc (bison) Example %token NUMBER /* needed to communicate with scanner */ %% list: '(' sequence ')' { printf("l -> ( seq )\n"); } '(' ')' { printf("l -> () \n "); } sequence: listelement ',' sequence { printf("seq -> LE,seq\n"); } listelement { printf("seq -> LE\n"); } ; listelement: NUMBER { printf("le -> %d\n",$1); } list { printf("le -> L\n"); } ; %% /* since no code here, default main constructed that simply calls parser. */ 40

Top-down vs. Bottom-up Parsing Copyright 2009 Elsevier 41

Copyright 2009 Elsevier Bottom-up Parsing LR Grammar 42

LR Calculator Grammar Figure 2.24 Program stmt list $$ stmt_list stmt_list stmt stmt stmt id := expr read id write expr expr term expr add op term term factor term mult_op factor factor ( expr ) id number add op + - mult op * / 43

LL Calculator Grammar Here is an LL(1) grammar (Fig 2.15): 1. program stmt_list $$ 2. stmt_list stmt stmt_list 3. ε 4. stmt id := expr 5. read id 6. write expr 7. expr term term_tail 8. term_tail add op term term_tail 9. ε 44

LL Calculator Grammar LL(1) grammar (continued) 10. term factor fact_tailt 11. fact_tail mult_op fact fact_tail 12. ε 13. factor ( expr ) 14. id 15. number 16. add_op + 17. - 18. mult_op * 19. / 45

Predictive Parser Predict which rules to match A α when next token can start α α * ε and the next token can follow A PREDICT(A α) = FIRST(α) FOLLOW(A) if EPS(α) PREDICT(program stmt_list $$) FIRST(stmt_list) = {id, read, write} FOLLOW(stmt_list) = {$$} Intersection of PREDICT sets for same lhs must be empty 46

Recursive Descent Parser procedure program case input_token of id, read, write, $$: stmt_list; match($$) otherwise error procedure stmt_list case input_token of id, read, write: stmt; stmt_list $$: skip otherwise error procedure stmt case input_token of id: match(id); match(:=); expr read: match(read); match(id) write: match(write); expr otherwise error procedure expr case input_token of id, number, (: term; termtail otherwise error procedure term_tail case input_token of +, -: add_op; term; term_tail ), id, read, write, $$: skip otherwise error procedure term case input_token of id, number, (: factor; factor_tail otherwise error 47

Exercise 3 Trace through the recursive descent parser and build parse tree for the following program 1. read A 2. read B 3. sum := A + B 4. write sum 5. write sum / 2

Solution 3 Copyright 2009 Elsevier 49

Table-Driven LL Parser Copyright 2009 Elsevier 50

Table-Driven LL Parser Parse Stack Input Stream Comment program stmt_list $$ stmt stmt_list $$ read id stmt_list $$ id stmt_list $$ stmt_list $$ stmt stmt_list $$ read A read B read A read B read A read B read A read B A read B read B read B Initial stack contents program stmt_list $$ stmt_list stmt stmt_list stmt read id match(read) match(id) stmt_list stmt stmt_list stmt_list $$ $$ term_tail ε stmt_list ε 51

Computing First, Follow, Predict Algorithm First/Follow/Predict: FIRST(α) ={c : α * c β} FOLLOW(A) = {c : S + α A c β} EPS(α) = if α * ε then true else false Predict (A α) = FIRST(α) (if EPS(α) then FOLLOW(A) else ) 52

Predict Set for LL Parser Copyright 2009 Elsevier 53

LR Parsing Bottom up (rightmost derivation) Maintain forrest of partially completed subtrees of the parse tree Join trees together when recognizing symbols in rhs of production Keep roots of partially completed trees on stack Shift when new token Reduce when top symbols match rhs Table driven 54

Top-down vs. Bottom-up Parsing Copyright 2009 Elsevier 55

LR Parsing Example Stack ε id(a) id(a), id(a), id(b) id(a), id(b), id(a), id(b), id(c) id(a), id(b), id(c); id(a), id(b), id(c) id_list_tail id(a), id(b) id_list_tail id(a) id_list_tail id_list Remaining Input A, B, C;, B, C; B, C;, C; C; ; 56

LR Calculator Grammar (Figure 2.24, Page 73): 1. program stmt list $$ 2. stmt_list stmt_list stmt 3. stmt 4. stmt id := expr 5. read id 6. write expr 7. expr term 8. expr add op term 57

LR Calculator Grammar LR grammar (continued): 9. term factor 10. term mult_op factor 11. factor ( expr ) 12. id 13. number 14. add op + 15. - 16. mult op * 17. / 58

LR Parser State Keep track of set of productions we might be in along with where in those productions we might be Initial state for calculator grammar program. stmt_list $$ stmt_list. stmt_list stmt stmt_list. stmt stmt. id := expr stmt. read id stmt. write expr // basis // yield 59

LR Parser States Copyright 2009 Elsevier 60

Characteristic Finite State Machine Copyright 2009 Elsevier 61

LR Parser Table Copyright 2009 Elsevier 62

LR Parsing Example Copyright 2009 Elsevier 63