Finite State Automata are Limited. Let us use (context-free) grammars!

Similar documents
Chapter 3: CONTEXT-FREE GRAMMARS AND PARSING Part 1

CMSC 330: Organization of Programming Languages

Context-Free Grammars

Context-Free Languages & Grammars (CFLs & CFGs) Reading: Chapter 5

COP 3402 Systems Software Syntax Analysis (Parser)

CMSC 330: Organization of Programming Languages. Architecture of Compilers, Interpreters

Ambiguous Grammars and Compactification

Where We Are. CMSC 330: Organization of Programming Languages. This Lecture. Programming Languages. Motivation for Grammars

Architecture of Compilers, Interpreters. CMSC 330: Organization of Programming Languages. Front End Scanner and Parser. Implementing the Front End

Lecture 8: Context Free Grammars

Compiler Construction 2010 (6 credits)

CMSC 330: Organization of Programming Languages

CMSC 330: Organization of Programming Languages

CMSC 330: Organization of Programming Languages. Context Free Grammars

Context-Free Languages and Parse Trees

Introduction to Automata Theory. BİL405 - Automata Theory and Formal Languages 1

Finite Automata Theory and Formal Languages TMV027/DIT321 LP4 2018

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

CMSC 330: Organization of Programming Languages. Context Free Grammars

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

Context-Free Grammars

Today. Assignments. Lecture Notes CPSC 326 (Spring 2019) Quiz 2. Lexer design. Syntax Analysis: Context-Free Grammars. HW2 (out, due Tues)

Building Compilers with Phoenix

Compiler Construction

Lexical Analysis. COMP 524, Spring 2014 Bryan Ward

Syntax Analysis Check syntax and construct abstract syntax tree

CSE302: Compiler Design

Regular Languages and Regular Expressions

Compiler Construction

DVA337 HT17 - LECTURE 4. Languages and regular expressions

Syntax Analysis. Prof. James L. Frankel Harvard University. Version of 6:43 PM 6-Feb-2018 Copyright 2018, 2015 James L. Frankel. All rights reserved.

Plan for Today. Regular Expressions: repetition and choice. Syntax and Semantics. Context Free Grammars

Introduction to Parsing. Lecture 8

Languages and Compilers

source code Compiler parser

Part I: Multiple Choice Questions (40 points)

R10 SET a) Construct a DFA that accepts an identifier of a C programming language. b) Differentiate between NFA and DFA?

CMSC 330: Organization of Programming Languages. Context Free Grammars

Outline. Limitations of regular languages. Introduction to Parsing. Parser overview. Context-free grammars (CFG s)

Principles of Programming Languages COMP251: Syntax and Grammars

Automating Construction of Lexers

MIT Specifying Languages with Regular Expressions and Context-Free Grammars. Martin Rinard Massachusetts Institute of Technology

THE COMPILATION PROCESS EXAMPLE OF TOKENS AND ATTRIBUTES

MIT Specifying Languages with Regular Expressions and Context-Free Grammars

Introduction to Lexing and Parsing

EDAN65: Compilers, Lecture 04 Grammar transformations: Eliminating ambiguities, adapting to LL parsing. Görel Hedin Revised:

CS 403: Scanning and Parsing

2.2 Syntax Definition

CS 315 Programming Languages Syntax. Parser. (Alternatively hand-built) (Alternatively hand-built)

Parsing. source code. while (k<=n) {sum = sum+k; k=k+1;}

COL728 Minor1 Exam Compiler Design Sem II, Answer all 5 questions Max. Marks: 20

Syntax Analysis. Chapter 4

Syntax. In Text: Chapter 3

Derivations of a CFG. MACM 300 Formal Languages and Automata. Context-free Grammars. Derivations and parse trees

Context-Free Grammars

Week 2: Syntax Specification, Grammars

Formal Languages and Automata

Compiler Construction

HKN CS 374 Midterm 1 Review. Tim Klem Noah Mathes Mahir Morshed

ECS 120 Lesson 7 Regular Expressions, Pt. 1

Formal Languages and Compilers Lecture IV: Regular Languages and Finite. Finite Automata

Formal Languages and Compilers Lecture VI: Lexical Analysis

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

CMPT 755 Compilers. Anoop Sarkar.

Dr. D.M. Akbar Hussain

CSE 130 Programming Language Principles & Paradigms Lecture # 5. Chapter 4 Lexical and Syntax Analysis

Syntax-Directed Translation. Lecture 14

Consider a description of arithmetic. It includes two equations that define the structural types of digit and operator:

Parsers. Xiaokang Qiu Purdue University. August 31, 2018 ECE 468

Recursively Defined Functions

Lexing, Parsing. Laure Gonnord sept Master 1, ENS de Lyon

Dr. D.M. Akbar Hussain

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

Languages and Finite Automata

Section A. A grammar that produces more than one parse tree for some sentences is said to be ambiguous.

Syntax Analysis Part I

QUESTION BANK. Formal Languages and Automata Theory(10CS56)

SYED AMMAL ENGINEERING COLLEGE (An ISO 9001:2008 Certified Institution) Dr. E.M. Abdullah Campus, Ramanathapuram

Syntax Analysis. Martin Sulzmann. Martin Sulzmann Syntax Analysis 1 / 38

Chapter 3. Syntax - the form or structure of the expressions, statements, and program units

Chapter 3. Describing Syntax and Semantics

Outline. 1 Scanning Tokens. 2 Regular Expresssions. 3 Finite State Automata

This book is licensed under a Creative Commons Attribution 3.0 License

CS308 Compiler Principles Lexical Analyzer Li Jiang

Regular Expressions. Regular Expressions. Regular Languages. Specifying Languages. Regular Expressions. Kleene Star Operation

CS 220: Discrete Structures and their Applications. Recursive objects and structural induction in zybooks

programming languages need to be precise a regular expression is one of the following: tokens are the building blocks of programs

COMP 330 Autumn 2018 McGill University

Grammars & Parsing. Lecture 12 CS 2112 Fall 2018

Formal Languages. Formal Languages

Compilers Course Lecture 4: Context Free Grammars

Compiler Construction

Introduction to Lexical Analysis

Parsing Expression Grammars and Packrat Parsing. Aaron Moss

Lexical Analysis 1 / 52

COP 3402 Systems Software Top Down Parsing (Recursive Descent)

Introduction to Lexical Analysis

Compiler Construction 2011, Lecture 2

CSE 3302 Programming Languages Lecture 2: Syntax

Programming Languages (CS 550) Lecture 4 Summary Scanner and Parser Generators. Jeremy R. Johnson

Transcription:

Finite State Automata are Limited Let us use (context-free) grammars!

Context Free Grammar for a n b n S ::= - a grammar rule S ::= a S b - another grammar rule Example of a derivation S => asb => a asb b => aa asb bb => aaabbb Parse tree: leaves give us the result

G = (A, N, S, R) Context-Free Grammars A - terminals (alphabet for generated words w A*) N - non-terminals symbols with (recursive) definitions Grammar rules in R are pairs (n,v), written n ::= v where n N is a non-terminal v (A U N)* - sequence of terminals and non-terminals A derivation in G starts from the starting symbol S Each step replaces a non-terminal with one of its right hand sides Example from before: G = ({a,b}, {S}, S, {(S, ), (S,aSb)})

Parse Tree Given a grammar G = (A, N, S, R), t is a parse tree of G iff t is a node-labelled tree with ordered children that satisfies: root is labeled by S leaves are labelled by elements of A each non-leaf node is labelled by an element of N for each non-leaf node labelled by n whose children left to right are labelled by p 1 p n, we have a rule (n::= p 1 p n ) R Yield of a parse tree t is the unique word in A* obtained by reading the leaves of t from left to right Language of a grammar G = words of all yields of parse trees of G L(G) = {yield(t) isparsetree(g,t)} isparsetree - easy to check condition Harder: know if a word has a parse tree

Grammar Derivation A derivation for G is any sequence of words p i (A U N)*,whose: first word is S each subsequent word is obtained from the previous one by replacing one of its letters by right-hand side of a rule in R : p i = unv, (n::=q) R, p i+1 = uqv Last word has only letters from A Each parse tree of a grammar has one or more derivations, which result in expanding tree gradually from S Different orders of expanding non-terminals may generate the same tree

Remark We abbreviate S ::= p S ::= q as S ::= p q

Example: Parse Tree vs Derivation Consider this grammar G = ({a,b}, {S,P,Q}, S, R) where R is: S ::= PQ P ::= a ap Q ::= aqb Show a derivation tree for aaaabb Show at least two derivations that correspond to that tree.

Balanced Parentheses Grammar Consider the language L consisting of precisely those words consisting of parentheses ( and ) that are balanced (each parenthesis has the matching one) Example sequence of parentheses ( ( () ) ()) - balanced, belongs to the language ( ) ) ( ( ) - not balanced, does not belong Exercise: give the grammar and example derivation for the first string.

Balanced Parentheses Grammar

Proving that a Grammar Defines a Language Grammar G: defines language L(G) S ::= (S)S Theorem: L(G) = L b where L b = { w for every pair u,v of words such that uv=w, the number of ( symbols in u is greater or equal than the number of ) symbols in u. These numbers are equal in w }

L(G) L b : If w L(G), then it has a parse tree. We show w L b by induction on size of the parse tree deriving w using G. If tree has one node, it is, and L b, so we are done. Suppose property holds for trees up size n. Consider tree of size n. The root of the tree is given by rule (S)S. The derivation of sub-trees for the first and second S belong to L b by induction hypothesis. The derived word w is of the form (p)q where p,q L b. Let us check if (p)q L b. Let (p)q = uv and count the number of ( and ) in u. If u= then it satisfies the property. If it is shorter than p +1 then it has at least one more ( than ). Otherwise it is of the form (p)q 1 where q 1 is a prefix of q. Because the parentheses balance out in p and thus in (p), the difference in the number of ( and ) is equal to the one in q 1 which is a prefix of q so it satisfies the property. Thus u satisfies the property as well.

L b L(G): If w L b, we need to show that it has a parse tree. We do so by induction on w. If w= then it has a tree of size one (only root). Otherwise, suppose all words of length <n have parse tree using G. Let w L b and w =n>0. (Please refer to the figure counting the difference between the number of ( and ). We split w in the following way: let p 1 be the shortest non-empty prefix of w such that the number of ( equals to the number of ). Such prefix always exists and is non-empty, but could be equal to w itself. Note that it must be that p 1 = (p) for some p because p 1 is a prefix of a word in L b, so the first symbol must be ( and, because the final counts are equal, the last symbol must be ). Therefore, w = (p)q for some shorter words p,q. Because we chose p to be the shortest, prefixes of (p always have at least one more (. Therefore, prefixes of p always have at greater or equal number of (, so p is in L b. Next, for prefixes of the form (p)v the difference between ( and ) equals this difference in v itself, since (p) is balanced. Thus, v has at least as many ( as ). We have thus shown that w is of the form (p)q where p,q are in L b. By IH p,q have parse trees, so there is parse tree for w.

Exercise: Grammar Equivalence Show that each string that can be derived by grammar G 1 B ::= ( B ) B B can also be derived by grammar G 2 B ::= ( B ) B and vice versa. In other words, L(G 1 ) = L(G 2 ) Remark: there is no algorithm to check for equivalence of arbitrary grammars. We must be clever.

G 1 : B ::= ( B ) B B G 2 : B ::= ( B ) B Grammar Equivalence (Easy) Lemma: Each word in alphabet A={(,)} that can be derived by G 2 can also be derived by G 1. Proof. Consider a derivation of a word w from G 2. We construct a derivation of w in G 1 by showing one or more steps that produce the same effect. We have several cases depending on steps in G 2 derivation: ubv => uv replace by (same) ubv => uv ubv => u(b)bv replace by ubv => ubbv => u(b)bv This constructs a valid derivation in G 1. Corollary: L(G 2 ) L(G 1 ) Lemma: L(G 1 ) L b (words derived by G 1 are balanced parentheses). Proof: very similar to proof of L(G 2 ) L b from before. Lemma: L b L(G 2 ) this was one direction of proof that L b = L(G 2 ) before. Corollary: L(G 2 ) = L(G 1 ) = L(L b )

Regular Languages and Grammars Exercise: give grammar describing the same language as this regular expression: (a b) (ab)*b*

Translating Regular Expression into a Grammar Suppose we first allow regular expression operators * and within grammars Then R becomes simply S ::= R Then give rules to remove *, by introducing new non-terminal symbols

Eliminating Additional Notation Alternatives s ::= P Q becomes s ::= P s ::= Q Parenthesis notation introduce fresh non-terminal expr (&& < == + - * / % ) expr Kleene star { statmt* } Option use an alternative with epsilon if ( expr ) statmt (else statmt)?

Grammars for Natural Language Statement = Sentence "." Sentence ::= Simple Belief Simple ::= Person liking Person liking ::= "likes" "does" "not" "like" Person ::= "Barack" "Helga" "John" "Snoopy" Belief ::= Person believing "that" Sentence but believing ::= "believes" "does" "not" "believe" but ::= "" "," "but" Sentence Exercise: draw the derivation tree for: John does not believe that Barack believes that Helga likes Snoopy, but Snoopy believes that Helga likes Barack. can also be used to automatically generate essays

While Language Syntax This syntax is given by a context-free grammar: program ::= statmt* statmt ::= println( stringconst, ident ) ident = expr if ( expr ) statmt (else statmt)? while ( expr ) statmt { statmt* } expr ::= intliteral ident expr (&& < == + - * / % ) expr! expr - expr

Id3 = 0 while (id3 < 10) { println(,id3); id3 = id3 + 1 } source code Compiler i d 3 = 0 LF w lexer id3 = 0 while ( id3 < 10 ) Compiler (scalac, gcc) parser assign i 0 while assign + a[i] * 7 i < i 10 3 characters words (tokens) trees

Recursive Descent Parsing - Manually - weak, but useful parsing technique - to make it work, we might need to transform the grammar

Recursive Descent is Decent descent = a movement downward decent = adequate, good enough Recursive descent is a decent parsing technique can be easily implemented manually based on the grammar (which may require transformation) efficient (linear) in the size of the token sequence Correspondence between grammar and code concatenation ; alternative ( ) if repetition (*) while nonterminal recursive procedure

A Rule of While Language Syntax // Where things work very nicely for recursive descent! statmt ::= println ( stringconst, ident ) ident = expr if ( expr ) statmt (else statmt)? while ( expr ) statmt { statmt* }

Parser for the statmt (rule -> code) def skip(t : Token) = if (lexer.token == t) lexer.next else error( Expected + t) // statmt ::= def statmt = { // println ( stringconst, ident ) if (lexer.token == Println) { lexer.next; skip(openparen); skip(stringconst); skip(comma); skip(identifier); skip(closedparen) // ident = expr } else if (lexer.token == Ident) { lexer.next; skip(equality); expr // if ( expr ) statmt (else statmt)? } else if (lexer.token == ifkeyword) { lexer.next; skip(openparen); expr; skip(closedparen); statmt; if (lexer.token == elsekeyword) { lexer.next; statmt } // while ( expr ) statmt

Continuing Parser for the Rule // while ( expr ) statmt } else if (lexer.token == whilekeyword) { lexer.next; skip(openparen); expr; skip(closedparen); statmt // { statmt* } } else if (lexer.token == openbrace) { lexer.next; while (isfirstofstatmt) { statmt } skip(closedbrace) } else { error( Unknown statement, found token + lexer.token) }

How to construct if conditions? statmt ::= println ( stringconst, ident ) ident = expr if ( expr ) statmt (else statmt)? while ( expr ) statmt { statmt* } Look what each alternative starts with to decide what to parse Here: we have terminals at the beginning of each alternative More generally, we have first computation, as for regular expresssions Consider a grammar G and non-terminal N L G (N) = { set of strings that N can derive } e.g. L(statmt) all statements of while language first(n) = { a aw in L G (N), a terminal, w string of terminals} first(statmt) = { println, ident, if, while, { } first(while ( expr ) statmt) = { while }