Compilers Course Lecture 4: Context Free Grammars

Similar documents
Syntax Analysis Check syntax and construct abstract syntax tree

Introduction to Parsing. Lecture 5

CS 314 Principles of Programming Languages

Compiler Design Concepts. Syntax Analysis

Introduction to Parsing. Lecture 5

Syntax Analysis. Prof. James L. Frankel Harvard University. Version of 6:43 PM 6-Feb-2018 Copyright 2018, 2015 James L. Frankel. All rights reserved.

CS415 Compilers. Syntax Analysis. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

Formal Languages and Compilers Lecture V: Parse Trees and Ambiguous Gr

Syntax Analysis Part I

Introduction to Parsing Ambiguity and Syntax Errors

A Simple Syntax-Directed Translator

Introduction to Parsing Ambiguity and Syntax Errors

Context-Free Languages & Grammars (CFLs & CFGs) Reading: Chapter 5

Where We Are. CMSC 330: Organization of Programming Languages. This Lecture. Programming Languages. Motivation for Grammars

Outline. Regular languages revisited. Introduction to Parsing. Parser overview. Context-free grammars (CFG s) Lecture 5. Derivations.

Introduction to Parsing. Lecture 5. Professor Alex Aiken Lecture #5 (Modified by Professor Vijay Ganesh)

Parsing: Derivations, Ambiguity, Precedence, Associativity. Lecture 8. Professor Alex Aiken Lecture #5 (Modified by Professor Vijay Ganesh)

Sometimes an ambiguous grammar can be rewritten to eliminate the ambiguity.

([1-9] 1[0-2]):[0-5][0-9](AM PM)? What does the above match? Matches clock time, may or may not be told if it is AM or PM.

Introduction to Parsing

Defining syntax using CFGs

Administrativia. PA2 assigned today. WA1 assigned today. Building a Parser II. CS164 3:30-5:00 TT 10 Evans. First midterm. Grammars.

( ) i 0. Outline. Regular languages revisited. Introduction to Parsing. Parser overview. Context-free grammars (CFG s) Lecture 5.

Parsing. source code. while (k<=n) {sum = sum+k; k=k+1;}

Parsing Part II. (Ambiguity, Top-down parsing, Left-recursion Removal)

Grammars and ambiguity. CS164 3:30-5:00 TT 10 Evans. Prof. Bodik CS 164 Lecture 8 1

CS 406/534 Compiler Construction Parsing Part I

Ambiguity, Precedence, Associativity & Top-Down Parsing. Lecture 9-10

EECS 6083 Intro to Parsing Context Free Grammars

Architecture of Compilers, Interpreters. CMSC 330: Organization of Programming Languages. Front End Scanner and Parser. Implementing the Front End

Introduction to Syntax Analysis. The Second Phase of Front-End

Introduction to Syntax Analysis

CMSC 330: Organization of Programming Languages. Architecture of Compilers, Interpreters

CMSC 330: Organization of Programming Languages

Building a Parser II. CS164 3:30-5:00 TT 10 Evans. Prof. Bodik CS 164 Lecture 6 1

Parsing. Roadmap. > Context-free grammars > Derivations and precedence > Top-down parsing > Left-recursion > Look-ahead > Table-driven parsing

Parsing. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

Context-Free Languages and Parse Trees

CMSC 330: Organization of Programming Languages. Context-Free Grammars Ambiguity

Lecture 8: Context Free Grammars

CMSC 330: Organization of Programming Languages

CMSC 330: Organization of Programming Languages

Ambiguity. Lecture 8. CS 536 Spring

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

CMSC 330: Organization of Programming Languages. Context Free Grammars

E E+E E E (E) id. id + id E E+E. id E + E id id + E id id + id. Overview. derivations and parse trees. Grammars and ambiguity. ambiguous grammars

3. Context-free grammars & parsing

Parsing II Top-down parsing. Comp 412

3. Parsing. Oscar Nierstrasz

Principles of Programming Languages COMP251: Syntax and Grammars

Ambiguity. Grammar E E + E E * E ( E ) int. The string int * int + int has two parse trees. * int

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

Part 5 Program Analysis Principles and Techniques

Context-Free Grammars

Exercise 1: Balanced Parentheses


CS 321 Programming Languages and Compilers. VI. Parsing

MA513: Formal Languages and Automata Theory Topic: Context-free Grammars (CFG) Lecture Number 18 Date: September 12, 2011

CSCI312 Principles of Programming Languages

Finite Automata Theory and Formal Languages TMV027/DIT321 LP4 2018

Chapter 3: CONTEXT-FREE GRAMMARS AND PARSING Part2 3.3 Parse Trees and Abstract Syntax Trees

Parsing III. CS434 Lecture 8 Spring 2005 Department of Computer Science University of Alabama Joel Jones

EDAN65: Compilers, Lecture 04 Grammar transformations: Eliminating ambiguities, adapting to LL parsing. Görel Hedin Revised:

CSE P 501 Compilers. Parsing & Context-Free Grammars Hal Perkins Winter /15/ Hal Perkins & UW CSE C-1

CSE443 Compilers. Dr. Carl Alphonce 343 Davis Hall

Optimizing Finite Automata

VIVA QUESTIONS WITH ANSWERS

Introduction to Parsing. Lecture 8

Dr. D.M. Akbar Hussain

Introduction to Parsing. Comp 412

Parsing Expression Grammars and Packrat Parsing. Aaron Moss

Types of parsing. CMSC 430 Lecture 4, Page 1

Outline. Parser overview Context-free grammars (CFG s) Derivations Syntax-Directed Translation

CMSC 330: Organization of Programming Languages. Context Free Grammars

The analysis part breaks up the source program into constituent pieces and creates an intermediate representation of the source program.

Context-Free Grammars

Lexical and Syntax Analysis. Top-Down Parsing

Outline. Limitations of regular languages. Introduction to Parsing. Parser overview. Context-free grammars (CFG s)

Part 3. Syntax analysis. Syntax analysis 96

Chapter 3: Describing Syntax and Semantics. Introduction Formal methods of describing syntax (BNF)

Abstract Syntax Trees & Top-Down Parsing

Abstract Syntax Trees & Top-Down Parsing

Parsers. Xiaokang Qiu Purdue University. August 31, 2018 ECE 468

Introduction to Lexing and Parsing

Intro To Parsing. Step By Step

Compilerconstructie. najaar Rudy van Vliet kamer 140 Snellius, tel rvvliet(at)liacs(dot)nl. college 3, vrijdag 22 september 2017

Lexical and Syntax Analysis

Syntax Analysis: Context-free Grammars, Pushdown Automata and Parsing Part - 4. Y.N. Srikant

COP 3402 Systems Software Syntax Analysis (Parser)

CS5363 Final Review. cs5363 1

CS502: Compilers & Programming Systems

Abstract Syntax Trees & Top-Down Parsing

CSE302: Compiler Design

Wednesday, September 9, 15. Parsers

Parsers. What is a parser. Languages. Agenda. Terminology. Languages. A parser has two jobs:

Syntactic Analysis. Top-Down Parsing

ΕΠΛ323 - Θεωρία και Πρακτική Μεταγλωττιστών

Compilers. Yannis Smaragdakis, U. Athens (original slides by Sam

Building Compilers with Phoenix

Compiler Design 1. Bottom-UP Parsing. Goutam Biswas. Lect 6

Transcription:

Compilers Course Lecture 4: Context Free Grammars Example: attempt to define simple arithmetic expressions using named regular expressions: num = [0-9]+ sum = expr "+" expr expr = "(" sum ")" num Appears to generate "2", "(3+4)", "(3+(4+5))", etc. However, expr is not a regular language because: If we have scanned N "(":s then we must also scan N ")":s, and finite automata cannot do that for arbitrarily large N:s. expr is a self-recursive definition, which implies that it is not a regular expression. We need something stronger: Context-Free Grammars. Context-Free Grammars (CFGs) : A CFG G is a 4-tuple (V,T,P,S), where: V is a finite set of variable symbols very often the variables are instead called the nonterminals, and the set V is then called N. T is a finite set of terminal symbols V and T are disjoint. P is a finite set of productions A α where A V and α (V T)*, i.e., α is a string of terminals and variables. S is the start symbol, S V. Example: V = { S } T = { "(", ")" } P = { S (S), S ε} This is typically written as S (S) S ε or as S (S) ε or as S ::= (S) ε with the sets V and T being implicit.

Derivations A grammar G defines a derivation relation,, as follows: α A γ α β γ if A β P where A is a variable, and α, β, and γ are strings over V T. The relation is then extended to *: α * α α * γ * if α * β and β γ That is, we can apply zero or more times. Finally, the language of a grammar G, L(G), is defined as: L(G) = { w S * w w T* } That is, the set of strings of terminals w that can be generated by derivations starting from the start symbol S. These grammars are called "context-free" because the context surrounding a variable, α and γ above, does not affect the applicability of a production for that variable. If S * w and w T*, then w is called a sentence. If S * w and w (V T)*, then w is called a sentential form. Example: G: S (S) ε S (S) ((S)) (((S)))... ( i S ) i ( i ) i L(G) = { ( i ) i i >= 0 } Example: G is (P0) stmt id := expr (P1) if expr then stmt (P2) stmt ; stmt (P3) expr id (P4) num Apparently V = { stmt, expr } and T = { id, num, :=, ;, if, then }. We assume stmt is the start symbol since it is defined first. Now: stmt (P1) if expr then stmt (P3)

if id then stmt (P0) if id then id := expr (P4) if id then id := num This sequence is called a derivation. We have just proved that "if id then id := num" L(G). For clarity we underline the variable that is being rewritten, and indicate the production used after the sign. Left-most and Right-most derivations This was a left-most derivation, where in each step the left-most variable is rewritten: lm There are also right-most derivations, where in each step the right-most variable is rewritten: rm This doesn't change the generated language, but becomes relevant later when specific parsing algorithms are introduced. Parsing and Parsers To "parse" a string of terminals is to construct a derivation from the start symbol to that string. A "parser" is a device that parses strings, or rejects them if they cannot be parsed. Parse Trees A yes/no answer is insufficient. A compiler needs to see the structure of the token stream. Parse trees are those structures:

Normally these trees are written with the root at the top and subtrees and leaves further down. In compilers, all further interpretation of the input program is based on its parse tree. Ambiguity Consider the grammar E num E + E E * E, and the input "1+2*3". Derivation 1: E E*E (E+E)*E (1+E)*E (1+2)*E (1+2)*3 (Parentheses shown for clarity, they are not actual terminals.) Derivation 2: E E+E 1+E 1+(E*E) 1+(2*E) 1+(2*3) But this results in two different parse trees, which when interpreted would evaluate to different results (9 vs 7). Definition: A grammar is ambiguous if its language includes strings for which there exist several different parse trees (derivations). Example: Expressions Consider the naive expression grammar E E + E E * E (E) num id This grammar fails to express precedence (priority) and associativity. Precedence Conventionally "*" binds tighter than "+", so E1+E2*E3 should be parsed as E1+(E2*E3), not (E1+E2)*E3. The way to ensure this is to rewrite the grammar so that E1+E2*E3 cannot be parsed as (E1+E2)*E3, which forces the other (correct) derivation. To prevent the incorrect derivation we introduce new non-terminals, separate "+" expressions from "*" expressions, and allow "+" expressions to have "*" expressions as subexpressions but not vice-versa: E E+E T T T*T F F id num (E)

The left-hand side of "*" must be a T, but T cannot derive E+E, so E1+E2*E3 can only parse as E1+ (E2*E3). Splitting the productions into several precedence levels, with careful references between the levels, is a common "design pattern" for expressing precedence in expression grammars. Associativity The productions "E E+E T" do not express the associativity of the "+" operator, so E1+E2+E3 could parse as (E1+E2)+E3 or as E1+(E2+E3). Replace "+" by "-" or "/" and it is obvious that this ambiguity is a problem. "E E+E T" generates sentential forms "T1+T2+...+Tn". We do not want to change that: what we do want is to control whether they parse as (((T1+T2)+T3)+...)+Tn or as T1+(T2+(T3+(...+Tn))). If we want left associativity, then we must prevent derivations where T1+T2+T3 parse as T1+(T2+T3). That is, the right-hand side of a "+" expression must not itself be a "+" expression. Solution: E E+T T Now E E+T (E+T)+T (T+T)+T, so "+" is left associative. If we want right associativity, just change the order. For example, to make "E E=E T" right associative, write: E T=E T Then E T=E T=(T=E) T=(T=T), which is appropriate for e.g. C's assignment operator. Standard Expression Grammar Rewrites to express both precedence and associativity are usually required, leading to the following standard expression grammar: E E+T T T T*F F F id num (E)

More Examples 1. Optional constructs S if E then S else S if E then S At a higher level, we might want to say: S if E then S (else S) optional This can be encoded using an auxiliary non-terminal S': S if E then S S' S' else S ε 2. Sequences Consider function calls: f(), f(x), f(x,y,z) How do we express the shape of the argument list? E id "(" Args ")" Args ε ArgSeq ArgSeq E ArgSeq "," E The "Dangling Else" Ambiguity Consider a simple statement grammar: S if E then S else S if E then S other Now consider "if E1 then if E2 then S1 else S2". This can be parsed as: S if E1 then S if E1 then (if E2 then S1 else S2) or as: S if E1 then S else S2 if E1 then (if E2 then S1) else S2 In the first case S2 is executed if E1 is true and E2 is false. In the second case S2 is executed if E1 is false. Clearly this ambiguity is unacceptable.

Traditionally programming languages prefer the first interpretation, which is that an "else" part is associated with the nearest previous incomplete "if" statement. To achieve this we can rewrite the grammar and introduce non-terminals for "matched if:s" and "unmatched if:s", as follows: S MIF UIF MIF if E then MIF else MIF other UIF if E then S if E then MIF else UIF That is, the statement between "then" and "else" must always be a fully matched statement where each "if" has "else". Now consider the second possible parse of the IF example above: if E1 then (if E2 then S1) else S2 The inner (if E2 then S1) is clearly a UIF, so at some point in the derivation there would have to be a sentential form: if E1 then UIF else S2 However, in the rewritten grammar only MIF can occur between "then" and "else", so this is impossible. Hence we have prevented this undesirable parse of nested IFs. In practice the "dangling else" ambiguity is often tolerated because most parsing algorithms will produce the correct derivation.