Principles of Programming Languages COMP251: Syntax and Grammars

Similar documents
Principles of Programming Languages COMP251: Syntax and Grammars

A Simple Syntax-Directed Translator

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

This book is licensed under a Creative Commons Attribution 3.0 License

Chapter 3: Describing Syntax and Semantics. Introduction Formal methods of describing syntax (BNF)

CMPT 755 Compilers. Anoop Sarkar.

Syntax Analysis Check syntax and construct abstract syntax tree

A programming language requires two major definitions A simple one pass compiler

CMSC 330: Organization of Programming Languages. Architecture of Compilers, Interpreters

Architecture of Compilers, Interpreters. CMSC 330: Organization of Programming Languages. Front End Scanner and Parser. Implementing the Front End

EECS 6083 Intro to Parsing Context Free Grammars

More Assigned Reading and Exercises on Syntax (for Exam 2)

2.2 Syntax Definition

A simple syntax-directed

COP 3402 Systems Software Syntax Analysis (Parser)

CPS 506 Comparative Programming Languages. Syntax Specification

Dr. D.M. Akbar Hussain

CMSC 330: Organization of Programming Languages

Where We Are. CMSC 330: Organization of Programming Languages. This Lecture. Programming Languages. Motivation for Grammars

Habanero Extreme Scale Software Research Project

Part 5 Program Analysis Principles and Techniques

CMSC 330: Organization of Programming Languages

EDAN65: Compilers, Lecture 04 Grammar transformations: Eliminating ambiguities, adapting to LL parsing. Görel Hedin Revised:

CMSC 330: Organization of Programming Languages

Chapter 3: CONTEXT-FREE GRAMMARS AND PARSING Part 1

CMSC 330: Organization of Programming Languages. Context Free Grammars

CSCE 314 Programming Languages

Context-Free Grammar. Concepts Introduced in Chapter 2. Parse Trees. Example Grammar and Derivation

Formal Languages and Grammars. Chapter 2: Sections 2.1 and 2.2

Syntax. A. Bellaachia Page: 1

CS 315 Programming Languages Syntax. Parser. (Alternatively hand-built) (Alternatively hand-built)

Derivations of a CFG. MACM 300 Formal Languages and Automata. Context-free Grammars. Derivations and parse trees

CMPS Programming Languages. Dr. Chengwei Lei CEECS California State University, Bakersfield

CS415 Compilers. Syntax Analysis. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

Context-Free Languages & Grammars (CFLs & CFGs) Reading: Chapter 5

1 Lexical Considerations

Syntax. In Text: Chapter 3

CS 314 Principles of Programming Languages

Examples of attributes: values of evaluated subtrees, type information, source file coordinates,

CSE 3302 Programming Languages Lecture 2: Syntax

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

Specifying Syntax. An English Grammar. Components of a Grammar. Language Specification. Types of Grammars. 1. Terminal symbols or terminals, Σ

announcements CSE 311: Foundations of Computing review: regular expressions review: languages---sets of strings

Defining syntax using CFGs

CMSC 330: Organization of Programming Languages. Context Free Grammars

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

CSE 311 Lecture 21: Context-Free Grammars. Emina Torlak and Kevin Zatloukal

CSCI312 Principles of Programming Languages!

Parsing II Top-down parsing. Comp 412

Chapter 3. Describing Syntax and Semantics ISBN

Question Bank. 10CS63:Compiler Design

Programming Lecture 3

Section A. A grammar that produces more than one parse tree for some sentences is said to be ambiguous.

22c:111 Programming Language Concepts. Fall Syntax III

A language is a subset of the set of all strings over some alphabet. string: a sequence of symbols alphabet: a set of symbols

Context-Free Grammars

Parsing. Roadmap. > Context-free grammars > Derivations and precedence > Top-down parsing > Left-recursion > Look-ahead > Table-driven parsing

Chapter 3. Describing Syntax and Semantics

Part 3. Syntax analysis. Syntax analysis 96

Parsing. source code. while (k<=n) {sum = sum+k; k=k+1;}

([1-9] 1[0-2]):[0-5][0-9](AM PM)? What does the above match? Matches clock time, may or may not be told if it is AM or PM.

Describing Syntax and Semantics

Building Compilers with Phoenix

Syntax Intro and Overview. Syntax

Syntax. Syntax. We will study three levels of syntax Lexical Defines the rules for tokens: literals, identifiers, etc.

Lexical Considerations

programming languages need to be precise a regular expression is one of the following: tokens are the building blocks of programs

Related Course Objec6ves

Chapter 4. Syntax - the form or structure of the expressions, statements, and program units

Lexical and Syntax Analysis. Top-Down Parsing

3. Context-free grammars & parsing

Formal Languages and Compilers Lecture V: Parse Trees and Ambiguous Gr

Decaf Language Reference Manual

Introduction to Syntax Analysis

Chapter 3: CONTEXT-FREE GRAMMARS AND PARSING Part2 3.3 Parse Trees and Abstract Syntax Trees

Introduction to Syntax Analysis. The Second Phase of Front-End

Introduction to Parsing

Lexical Considerations

More on Syntax. Agenda for the Day. Administrative Stuff. More on Syntax In-Class Exercise Using parse trees

UNIT I Programming Language Syntax and semantics. Kainjan Sanghavi

Context-Free Grammars and Languages (2015/11)

Syntax. 2.1 Terminology

Lexical and Syntax Analysis

Syntax Analysis. Prof. James L. Frankel Harvard University. Version of 6:43 PM 6-Feb-2018 Copyright 2018, 2015 James L. Frankel. All rights reserved.

3. Parsing. Oscar Nierstrasz

Compiler Design Concepts. Syntax Analysis

Week 2: Syntax Specification, Grammars

ADTS, GRAMMARS, PARSING, TREE TRAVERSALS

COP 3402 Systems Software Top Down Parsing (Recursive Descent)

VIVA QUESTIONS WITH ANSWERS

Lexical Analysis (ASU Ch 3, Fig 3.1)

Consider a description of arithmetic. It includes two equations that define the structural types of digit and operator:

ΕΠΛ323 - Θεωρία και Πρακτική Μεταγλωττιστών

CSE 401 Midterm Exam 11/5/10

CSE450 Translation of Programming Languages. Lecture 4: Syntax Analysis

Earlier edition Dragon book has been revised. Course Outline Contact Room 124, tel , rvvliet(at)liacs(dot)nl

Languages and Compilers

Stating the obvious, people and computers do not speak the same language.

Theoretical Part. Chapter one:- - What are the Phases of compiler? Answer:

Programming Language Specification and Translation. ICOM 4036 Fall Lecture 3

Transcription:

Principles of Programming Languages COMP251: Syntax and Grammars Prof. Dekai Wu Department of Computer Science and Engineering The Hong Kong University of Science and Technology Hong Kong, China Fall 2007

Part I Language Description

Language Description: Motivation Able was I ere I saw Elba. about Napoléon How do you know that this is English, and not French or Chinese?

Language Description A language has 2 parts: 1 Syntax lexical syntax describes how a sequence of symbols makes up tokens (lexicon) of the language checked by a lexical analyzer grammar describes how a sequence of tokens makes up a valid program. checked by a parser 2 Semantics specifies the meaning of a program

Compilation source program lexical analyzer lexical units syntax analyzer (parser) symbol table intermediate code generator (and semantic analyzer) parse tree optimization code generator intermediate code executable program

Example 1: English Language A word = some combination of the 26 letters, a,b,c,...,z. One form of a sentence = Subject + Verb + Object. e.g. The student wrote a great program.

Example 2: Date Format A date like 06/04/2010 may be written in the general format: D D / D D / D D D D where D = 0,1,2,3,4,5,6,7,8,9 But, does 03/09/1998 mean Sept 3rd, or March 9th?

Example 3: Real Numbers (Simplified) Examples of reals: 0.45 12.3.98 Examples of non-reals: 2+4i 1a2b 8 < Informal rules: In general, a real number has three parts: an integer part (I ) a dot. symbol (.) a fraction part (F ) valid forms: I.F,.F I and F are strings of digits I may be empty but F cannot a digit is one of { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 }

Expression: Examples a + b 3 a + b/c b+ b 2 4 a c 2 a a (1 R n ) 1 R if (x > 10) then x /= 10 else x *= 2 c.f. While I was coming to school, I saw a car accident. The sentence is in the form of: While E 1, E 2.

Expression Notation: Example 4 Goal: Add a to b. Infix : Prefix : Postfix : a + b +ab ab+ Abstract Syntax Tree + / \ a b Abstract syntax tree is independent of notation.

Expression A constant or variable is an expression. In general, an expression has the form of a function: E = Op (E 1, E 2,..., E k ) where Op is the operator, and E 1, E 2,..., E k are the operands. An operator with k operands is said to have an arity of k; and Op is an k-ary operator. unary operator : binary operator : ternary operator : x x + y (x > y)? x : y

Infix, Prefix, Postfix, Mixfix Infix : E 1 Op E 2 (must be binary operator!) a + b, a b, a b, a/b, a == b, a < b. Prefix : Op E 1 E 2... E k +ab, ab, ab, /ab, == ab, < ab. Postfix : E 1 E 2... E k Op ab+, ab, ab, ab/, ab ==, ab <. Mixfix : e.g. if E 1 then E 2 else E 3

Abstract Syntax Tree O p E1 E 2 E k

Expression Notation: Example 5 abstract syntax tree infix : prefix : postfix : 3 a + b/c + 3a/bc 3a bc/+ + / \ / \ * / / \ / \ 3 a b c Note: Prefix and postfix notation does not require parentheses.

Expression Notation: Example 6 divide + * _ sqrt() 2 a b _ * * infix : ( b + b 2 4 a c)/(2 a) b b * c prefix : / + b bb 4ac 2a 4 a postfix : b bb 4a c + 2a /

Postfix Evaluation: By a Stack infix expression: 3 a + b/c. postfix expression: 3a bc/+. 3 a 3 * a 3 (3a) b (3a) c b (3a) / c b (3a) (b/c) (3a) + (b/c) (3a) (3a+b/c)

Precedence and Associativity in C++ Operator Description Associativity [ ] array element LEFT structure member pointer - minus RIGHT ++ increment - - decrement indirection multiply LEFT / divide % mod + - add subtract LEFT == logical equal LEFT = assignment RIGHT

Precedence Example: 1/2 + 3 4 = (1/2) + (3 4) because, / has a higher precedence over +,. Precedence rules decide which operators run first. In general, x P y Q z = x P ( y Q z ) if operator Q is at a higher precedence level than operator P.

Associativity: Binary Operators Example: 1 2 + 3 4 = ((1 2) + 3) 4 because +, are left associative. Associativity decides the grouping of operands with operators of the same level of precedence. In general, if binary operator P, Q are of the same precedence level: x P y Q z = x P ( y Q z ) if operator P, Q are both right associative; x P y Q z = ( x P y ) Q z if operator P, Q are both left associative. Question : What if + is left while is right associative?

Associativity: Unary Operators Example in C++: a + + = (a + +) because all unary operators in C++ are right-associative. In Pascal, all operators including unary operators are left-associative. In general, unary operators in many languages may be considered as non-associative as it is not important to assign an associativity for them, and their usage and semantics will decide their order of computation. Question : Which of infix/prefix/postfix notation needs precedence or associative rules?

Summary on Syntax Will describe a language by a formal syntax and an informal semantics Syntax = lexical syntax + grammar Expression notation: infix, prefix, postfix, mixfix Abstract syntax tree: independent of notation Precedence and associativity of operators decide the order of applying the operators

Part II Grammar

Grammar: Motivation What do the following sentences really mean? I saw a small kid on the beach with a binocular. What is the final value of x? x = 15 if (x > 20) then if (x > 30) then x = 8 else x = 9 Ambiguity in semantics is often caused by ambiguous grammar of the language.

A Formal Description: Example 7 1. < real-number > ::= < integer-part >. < fraction > 2. < integer-part > ::= <empty> < digit-sequence > 3. < fraction > ::= < digit-sequence > 4. < digit-sequence > ::= < digit > < digit >< digit-sequence > 5. < digit > ::= 0 1 2 3 4 5 6 7 8 9 This is the context-free grammar of real numbers written in the Backus-Naur Form.

Context Free Grammar (CFG) A context-free grammar has 4 components: 1 A set of tokens or terminals: atomic symbols of the language. English : a, b, c,..., z Reals : 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,. 2 A set of nonterminals: variables denoting language constructs. English : < Noun >, < Verb >, < Adjective >,... Reals : < real-number >, < integer-part >, < fraction >, < digit-sequence >, < digit >

Context Free Grammar.. 3 A set of rules called productions: for generating expressions of the language. nonterminal ::= a string of terminals and nonterminals English : < Sentence > ::= < Noun > < Verb > < Noun > Reals : < integer-part > ::= <empty> < digit-sequence > Notice that CFGs allow only a single non-terminal on the left-hand side of any production rules. 4 A nonterminal chosen as the start symbol: represents the main construct of the language. English : < Sentence > Reals : < real-number > The set of strings that can be generated by a CFG makes up a context-free language.

Backus-Naur Form (BNF) One way to write context-free grammar. Terminals appear as they are. Nonterminals are enclosed by < and >. e.g.: < real-number >, < digit >. The special empty string is written as <empty>. Productions with a common nonterminal may be abbreviated using the special or symbol. e.g. X ::= W1, X ::= W2,..., X ::= Wn may be abbreviated as X ::= W1 W2 Wn

Top-Down Parsing: Example 8 A parser checks to see if a given expression or program can be derived from a given grammar. Check if.5 is a valid real number by finding from the CFG of Example 6 a leftmost derivation of.5 : < real-number > => < integer-part >. < fraction > [Production 1] => <empty>. < fraction > [Production 2] =>.< fraction > [By definition] =>.< digit-sequence > [Production 3] =>.< digit > [Production 4] =>.5 [Production 5]

Bottom-Up Parsing: Example 9 Check if.5 is a valid real number by finding from the CFG of Example 6 a rightmost derivation of.5 in reverse:.5 = <empty>.5 [By definition] => < integer-part >.5 [Production 2] => < integer-part >. < digit > [Production 5] => < integer-part >. < digit-sequence > [Production 4] => < integer-part >. < fraction > [Production 3] => < real-number > [Production 1]

Parse Tree: Example 10 [Real Numbers] A parse tree of.5 generated by the CFG of Example 6. <real-number> / \ <integer-part>. <fraction> <empty> <digit-sequence> <digit> 5

Parse Tree A parse tree shows how a string is generated by a CFG the concrete syntax in a tree representation. Root = start symbol. Leaf nodes = terminals or <empty>. Non-leaf nodes = nonterminals For any subtree, the root is the left-side nonterminal of some production, while its children, if read from left to right, make up the right side of the production. The leaf nodes, read from left to right, make up a string of the language defined by the CFG.

Example 11: CFG/BNF [Expression] < Expr > ::= < Expr >< Op >< Expr > < Expr > ::= (< Expr >) < Expr > ::= < Id > < Op > ::= + - * / = < Id > ::= a b c 1. Terminals: a, b, c, +, -, *, /, =, (, ) 2. Nonterminals: Expr, Op, Id 3. Start symbol: Expr

Parse Tree : Example 12 [Expression] A parse tree of a + b c generated by the CFG of Example 10: <Expr> / \ <Expr> <Op> <Expr> / \ <Expr> <Op> <Expr> - <Id> <Id> + <Id> c a b Question: What is the difference between a parse tree and an abstract syntax tree?

Ambiguous Grammar: Example 13 A grammar is (syntactically) ambiguous if some string in its language is generated by more than one parse tree. <Expr> / \ <Expr> <Op> <Expr> / \ <Id> + <Expr> <Op> <Expr> a <Id> - <Id> b c Solution: Rewrite the grammar to make it unambiguous.

Handle Left Associativity: Example 14 CFG of Example 10 cannot handle a + b c correctly. Add a left recursive production. < Expr > ::= < Expr >< Op >< Term > < Expr > ::= < Term > < Term > ::= (< Expr >) < Id > < Op > ::= + - * / = < Id > ::= a b c

Handle Left Associativity.. Now there is only one parse tree for a + b c : <Expr> / \ / \ <Expr> <Op> <Term> / \ <Expr> <Op> <Term> - <Id> <Term> + <Id> c <Id> b a

Handling Right Associativity: Example 15 CFG of Example 10 cannot handle a = b = c correctly. Add a right recursive production. < Assign > ::= < Expr > = < Assign > < Assign > ::= < Expr > < Expr > ::= < Expr >< Op >< Term > < Term > < Term > ::= (< Expr >) < Id > < Op > ::= + - * / < Id > ::= a b c Question: this grammar will accept strings like a + b = c - d. Try to correct it.

Handling Right Associativity.. Now there is only one parse tree for a = b = c : <Assign> / \ / \ <Expr> = <Assign> / \ <Term> <Expr> = <Assign> <Id> <Term> <Expr> a <Id> <Term> b <Id> c

Handling Precedence: Example 16 CFG of Example 10 cannot handle a + b c correctly. Add one nonterminal (plus appropriate productions) for each precedence level. < Assign > ::= < Expr > = < Assign > < Expr > < Expr > ::= < Expr > + < Term > < Expr > ::= < Expr > < Term > < Term > < Term > ::= < Term > < Factor > < Term > ::= < Term > / < Factor > < Factor > < Factor > ::= (< Expr >) < Id > < Id > ::= a b c

Handling Precedence.. Now there is only one parse tree for a + b c : <Assign> <Expr> / \ / \ <Expr> + <Term> / \ <Term> <Term> * <Factor> <Factor> <Factor> <Id> <Id> <Id> c a b

Tips on Handling Precedence/Associativity left associativity left-recursive production right associativity right-recursive production n levels of precedence divide the operators into n groups write productions for each group of operators start with operators with the lowest precedence In all cases, introduce new non-terminals whenever necessary. In general, one needs a new non-terminal for each new group of operators of different associativity and different precedence.

Dangling-Else: Example 17 Consider the following grammar: < S > ::= if < E > then < S > < S > ::= if < E > then < S > else < S > How many parse trees can you find for the statement: if E 1 then if E 2 then S 1 else S 2

Dangling-Else.. S E 1 if then S E 2 S 1 if then else S 2 S E 1 if then S else S 2 if E 2 then S 1

Dangling-Else... Ambiguity is often a property of a grammar, not of a language. Solution: matching an else with the nearest unmatched if. i.e. the first case.

More CFG Examples 1 < S > ::= < A >< B >< C > < A > ::= a< A > a < B > ::= b< B > b < C > ::= c< C > c 2 < S > ::= < A > a < B > b < A > ::= < A > b b < B > ::= a< B > a 3 <stmts> ::= <empty> <stmt> ; <stmts> <stmt> ::= <id> := <expr> if <expr> then <stmt> if <expr> then <stmt> else <stmt> while <expr> do <stmt> begin <stmts> end

Non-Context Free Grammars: Examples 1 < S > ::= < B >< A >< C > < C >< A >< B > b< A > ::= c< A >< B > < B > c< A > ::= b< A >< C > < C > < B > ::= b < C > ::= c L = { (cb) n, b(cb) n, (bc) n, c(bc) n }. 2 L = { wcw w is a string of a s or b s }. This language abstracts the problem of checking that an identifier is declared before its use in a program. The first w = declaration of the identifier, and the second w = its use in the program.

Summary on Grammar Context-free grammar (CFG) is commonly used to specify most of the syntax of a programming language. However, most programming languages are not CFL! CFG is commonly written in Backus-Naur Form (BNF). CFG = (Terminals, Nonterminals, Productions, Start Symbol) A program is valid if we may construct a parse tree, or a derivation from the grammar. Associativity and precedence of operations are part of the design of a CFG. Avoid ambiguous grammars by rewriting them or imposing parsing rules.

Part III Regular Grammar, Regular Expression

Regular Grammars Regular Grammars are a subset of CFGs in which all productions are in one of the following forms: 1 Right-Regular Grammar <A> ::= x <A> ::= x<b> 2 Left-Regular Grammar <A> ::= x <A> ::= <B>x where A and B are non-terminals and x is a string of terminals.

RE Example 1: Right-Regular Grammar <S> ::= a<a> <S> ::= b<b> <S> ::= <empty> <A> ::= a<s> <B> ::= bb<s> What is the regular language this RG generates?

Regular Expressions Regular expressions (RE) are succinct representations of RGs using the following notations. Sub-Expression Meaning x the single char x. any single char except the newline [abc] char class consisting of a, b, or c [ abc] any char except a, b, c r* repeat r zero or more times r+ repeat r 1 or more times r? zero or 1 occurrence of r rs concatenation of RE r and RE s (r)s r is evaluated and concatenated with s r s RE r or RE s \x escape sequences for white-spaces and special symbols: \b \n \r \t

Precedence of Regular Expression Operators The following table gives the order of RE operator precedence from the highest precedence to the lowest precedence. Function Operator parenthesis ( ) counters * +? { } concatenation disjunction

RE Example 2: Regular Expression Notations RE Meaning abc the string abc a+b+ {a m b n : m, n 1} a*b*c {a m b n c : m, n 0} a*b*c? {a m b n c or a m b n : m, n 0} xy(abc)+ {xy(abc) n : n 1} xy[abc] {xya, xyb, xyc} xy(a b) {xya, xyb} Questions: What are the following REs? foo bar* foo (bar)* (foo bar)*

RE Example 3: Regular Expressions REs are commonly used for pattern matching in editors, word processors, commandline interpreters, etc. The REs used for searching texts in Unix (vi, emacs, perl, grep), Microsoft Word v.6+, and Word Perfect are almost identical. Examples: identifiers in C++: real numbers: email addresses: white spaces: all C++ source or include files:

Summary on Regular Grammars There are algorithms to prove if a language is regular. There are algorithms to prove if a language is context-free too. English is not RL, nor CFL. REs are commonly used for text search. Different applications may extend the standard RE notations.