CS 315 Programming Languages Syntax. Parser. (Alternatively hand-built) (Alternatively hand-built)

Similar documents
COP 3402 Systems Software Syntax Analysis (Parser)

CPS 506 Comparative Programming Languages. Syntax Specification

Lecture 4: Syntax Specification

Dr. D.M. Akbar Hussain

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

Chapter 3: CONTEXT-FREE GRAMMARS AND PARSING Part2 3.3 Parse Trees and Abstract Syntax Trees

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

Theoretical Part. Chapter one:- - What are the Phases of compiler? Answer:

Languages and Compilers

Part 5 Program Analysis Principles and Techniques

Chapter 3. Describing Syntax and Semantics ISBN

Architecture of Compilers, Interpreters. CMSC 330: Organization of Programming Languages. Front End Scanner and Parser. Implementing the Front End

MIT Specifying Languages with Regular Expressions and Context-Free Grammars

CSE 3302 Programming Languages Lecture 2: Syntax

MIT Specifying Languages with Regular Expressions and Context-Free Grammars. Martin Rinard Massachusetts Institute of Technology

Where We Are. CMSC 330: Organization of Programming Languages. This Lecture. Programming Languages. Motivation for Grammars

CMSC 330: Organization of Programming Languages

Habanero Extreme Scale Software Research Project

3. Context-free grammars & parsing

CMSC 330: Organization of Programming Languages. Architecture of Compilers, Interpreters

Syntax. Syntax. We will study three levels of syntax Lexical Defines the rules for tokens: literals, identifiers, etc.

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

Chapter 3: Describing Syntax and Semantics. Introduction Formal methods of describing syntax (BNF)

Chapter 2 - Programming Language Syntax. September 20, 2017

Syntax Analysis. COMP 524: Programming Language Concepts Björn B. Brandenburg. The University of North Carolina at Chapel Hill

CMSC 330: Organization of Programming Languages

Formal Languages. Formal Languages

CMSC 330: Organization of Programming Languages

CMSC 330: Organization of Programming Languages. Context Free Grammars

Formal Languages and Grammars. Chapter 2: Sections 2.1 and 2.2

Principles of Programming Languages COMP251: Syntax and Grammars

EECS 6083 Intro to Parsing Context Free Grammars

Syntax Analysis. The Big Picture. The Big Picture. COMP 524: Programming Languages Srinivas Krishnan January 25, 2011

A Simple Syntax-Directed Translator

Lexical Analysis. Introduction

Principles of Programming Languages COMP251: Syntax and Grammars

Syntax. In Text: Chapter 3

Chapter 3: CONTEXT-FREE GRAMMARS AND PARSING Part 1

Syntax Analysis Check syntax and construct abstract syntax tree

CMPS Programming Languages. Dr. Chengwei Lei CEECS California State University, Bakersfield

programming languages need to be precise a regular expression is one of the following: tokens are the building blocks of programs

CMSC 330: Organization of Programming Languages. Context Free Grammars

EDAN65: Compilers, Lecture 06 A LR parsing. Görel Hedin Revised:

CSE 401 Midterm Exam 11/5/10

Parsing. source code. while (k<=n) {sum = sum+k; k=k+1;}

CS 314 Principles of Programming Languages

Context-free grammars (CFG s)

Chapter 3. Describing Syntax and Semantics

Compiler Design Concepts. Syntax Analysis

Programming Language Syntax and Analysis

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

Describing Syntax and Semantics

CSCI312 Principles of Programming Languages!

Optimizing Finite Automata

Syntax Analysis/Parsing. Context-free grammars (CFG s) Context-free grammars vs. Regular Expressions. BNF description of PL/0 syntax

SYED AMMAL ENGINEERING COLLEGE (An ISO 9001:2008 Certified Institution) Dr. E.M. Abdullah Campus, Ramanathapuram

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

CMPT 755 Compilers. Anoop Sarkar.

EDAN65: Compilers, Lecture 04 Grammar transformations: Eliminating ambiguities, adapting to LL parsing. Görel Hedin Revised:

Derivations of a CFG. MACM 300 Formal Languages and Automata. Context-free Grammars. Derivations and parse trees

CS 230 Programming Languages

Programming Language Definition. Regular Expressions

Chapter 4. Lexical and Syntax Analysis

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Introduction to Lexing and Parsing

A simple syntax-directed

Syntax Analysis. Prof. James L. Frankel Harvard University. Version of 6:43 PM 6-Feb-2018 Copyright 2018, 2015 James L. Frankel. All rights reserved.

Defining syntax using CFGs

CIT Lecture 5 Context-Free Grammars and Parsing 4/2/2003 1

Chapter 4. Syntax - the form or structure of the expressions, statements, and program units

Introduction to Parsing Ambiguity and Syntax Errors

QUESTIONS RELATED TO UNIT I, II And III

Chapter 3. Describing Syntax and Semantics ISBN

Lexical and Syntax Analysis (2)

Context-Free Grammars

ECE251 Midterm practice questions, Fall 2010

Introduction to Parsing Ambiguity and Syntax Errors

CS415 Compilers. Syntax Analysis. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

Chapter 3. Describing Syntax and Semantics

CMSC 330: Organization of Programming Languages. Context Free Grammars

Specifying Syntax. An English Grammar. Components of a Grammar. Language Specification. Types of Grammars. 1. Terminal symbols or terminals, Σ

CS415 Compilers. Lexical Analysis

Programming Languages (CS 550) Lecture 4 Summary Scanner and Parser Generators. Jeremy R. Johnson

Defining Languages GMU

Wednesday, September 9, 15. Parsers

Parsers. What is a parser. Languages. Agenda. Terminology. Languages. A parser has two jobs:

THE COMPILATION PROCESS EXAMPLE OF TOKENS AND ATTRIBUTES

Question Bank. 10CS63:Compiler Design

CS 403: Scanning and Parsing

CMSC 330: Organization of Programming Languages. Context-Free Grammars Ambiguity

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

Syntax. A. Bellaachia Page: 1

Chapter 2 :: Programming Language Syntax

CS 314 Principles of Programming Languages. Lecture 3

R10 SET a) Construct a DFA that accepts an identifier of a C programming language. b) Differentiate between NFA and DFA?

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

Introduction to Parsing

Non-deterministic Finite Automata (NFA)

Week 2: Syntax Specification, Grammars

Transcription:

Programming languages must be precise Remember instructions This is unlike natural languages CS 315 Programming Languages Syntax Precision is required for syntax think of this as the format of the language semantics think of this as the meaning of the language Recall the first part of the compilation sequence: input program Scanner (lexical anlyzer) tokens Parser Parse tree Specification Tool Result Regular expressions Context-free grammar Scanner generator (lex) (Alternatively hand-built) Parser generator (yacc) (Alternatively hand-built) Scanner Implements a DFA (deterministic finite automata) Parser Implements a PDA (push-down automata) E.g. regular expressions: DIGIT: [0-9] NUMBER: DIGIT+ CHARACTER: [a-za-z] IDENTIFIER: CHARACTER (CHARACTER DIGIT) * OP: ( '+' '*' '-' '/' ) Red characters are meta-characters. + means 1 or more, * (Kleene star) means 0 or more, is used for alteration, etc. We will cover these in more detail later. E.g. context-free grammar: expression expression OP expression '(' expression ')' IDENTIFIER NUMBER ( abc + 5 ) / 71 abc 5 / + 71 See code as trees, not as lines.

Specifying Syntax Tokens (produced as a result of lexing / scanning, based on regular expressions) Concatenation ab Alternation a b Kleene closure (repetition) a * These rules are called regular expressions. Set of strings that can be defined in terms of regular expressions are called regular languages. Syntax tree (produced as a result of parsing, based on context-free grammars) Add recursion in addition to the above These rules are called context-free grammars. Set of strings that can be defined in terms of context-free grammars are called context-free languages. Examples: Ex1) The set of identifiers can be represented via regular expressions: IDENTIFIER: [a-za-z][a-za-z0-9]* Ex2) The set of palindromes cannot be represented via regular expressions. It requires recursion, thus context-free grammars are needed. palindrome '0' palindrome '0' '1' palindrome '1' '1' '0' ε

Tokens & Regular Expressions Token: basic building block e.g. keywords, identifiers, symbols, constants (while, if) (variable x) (+, -, *, /) ( abc ) Regular expressions are used to specify tokens: a character the empty string, denoted by ε two regular expressions next to each other (concatenation) two regular expressions separated by (alternation) a regular expression followed by a Kleene star (repetition) parentheses used to avoid ambiguity E.g.: DIGIT: 0 1 2 9 [0-9] INTEGER: DIGIT DIGIT* DIGIT+ DECIMAL: DIGIT* ( '.' DIGIT DIGIT '.') DIGIT* EXPONENT: ( e E ) ('+' '-' ε) INTEGER (e E) ('+' '-')? INTEGER REAL: INTEGER EXPONENT DECIMAL (EXPONENT ε) DECIMAL EXPONENT? NUMBER: INTEGER REAL Precedence of regular expression operators: Highest () *, +,? concatenation Lowest Think of these as trees as well. Highest precedence operator is at the bottom, lowest is at the top. Exercises: (AB C)*D concat D(A B*)C concat AB* C concat C concat C * D D A * concat * C A B A B B AB C*D AB?+ concat A(C B?)+ concat concat concat A + A + A B * D?? C C B B

(AB C)*D ABD D CABCD ABABD Invalid: ACD D(A B*)C DAC DC DBC DBBC Invalid: DABC AB?+ A AB ABB AB C*D AB D CD CCD Invalid: ABD AB* C C A AB ABB Invalid: ABC A(C B?)+ AC ACC A ABC ABBCBCC

Context-free Grammars Regular expressions cannot be used to describe PL structures such expressions. Instead, we use context-free grammars, that is CFGs. They are described using the BNF (Backus Normal Form) notation. non-terminal also start symbol expr ID NUMBER '-' expr '(' expr ')' expr op expr token also terminal terminal non-terminal op '+' '-' '*' '/' only non-terminals can appear as LHS (left-hand side) LHS only has a single non-terminal (otherwise it becomes context sensitive) non-terminals and terminals can appear on right-hand side (RHS) concatenation is allowed on RHS parenthesis and Kleene closure operators are not allowed on RHS in BNF There is also the extended BNF notation, called the EBNF. EBNF removes the last restriction: parenthesis and Kleene closure operators are allowed in EBNF. The grammar id_list ID (',' ID)* is in EBNF and is equivalent to the following BNF grammar: id_list ID id_list ',' ID Also, the following forms are all the same: op '+' '-' '*' '/' op '+' op '-' op '*' op '/' op '+' '-' '*' '/'

Derivations and parse-trees Context-free grammars describe how to generate synthetically valid strings. E.g.: Consider the following input string: slope * x + intercept And the following grammar: 1. expr ID 2. NUMBER 3. '-' expr 4. '(' expr ')' 5. expr op expr 6. op '+' 7. '-' 8. '*' 9. '/' Here is a rightmost derivation of the input string using the above grammar: expr apply rule 5 => expr op expr apply rule 1 => expr op ID apply rule 6 => expr '+' ID apply rule 5 => expr op expr '+' ID apply rule 1 => expr op ID '+' ID apply rule 8 => expr '*' ID '+' ID apply rule 1 => ID (slope) '*' ID (x) '+' ID (intercept) expr 7 op 8 expr 9 ID 12 (slope) * 11 ID 10 expr 1 expr 2 op 3 expr 4 (x) + 6 ID 5 (intercept) In a rightmost derivation, we always expand the rightmost non-terminal. Here is a leftmost derivation of the input string using the above grammar: expr apply rule 5 => expr op expr apply rule 5 => expr op expr op expr apply rule 1 => ID op expr op expr apply rule 8 => ID '*' expr op expr apply rule 1 => ID '*' ID op expr apply rule 6 => ID '*' ID '+' expr apply rule 1 => ID (slope) '*' ID (x) '+' ID (intercept) expr 5 op 6 expr 7 ID 8 (slope) * 9 ID 10 expr 1 expr 2 op 3 expr 4 (x) + 11 ID 12 (intercept) In a leftmost derivation, we always expand the leftmost non-terminal. Note that these two trees are the same, structurally, but created in different orders. However, for this particular grammar, we can come up with a derivation that results in a different tree.

Here is another rightmost derivation that produces a different tree: expr apply rule 5 => expr op expr apply rule 5 => expr op expr op expr apply rule 1 => expr op expr op ID apply rule 6 => expr op expr '+' ID apply rule 1 => expr op ID '+' ID apply rule 8 => expr '*' ID '+' ID apply rule 1 => ID (slope) '*' ID (x) '+' ID (intercept) expr 1 expr 2 op 3 expr 4 ID 12 * 11 expr 5 op 6 expr 7 (slope) ID 10 + 9 ID 8 (x) (intercept) If a grammar results in derivations with different parse trees, then the grammar is ambiguous. Ambiguous grammars create problems in parsing, since different parsing trees result in different semantics. For instance: precedence: 3+4*5 means 3+(4*5), since * has higher precedence than + Thus, the multiplication should be deeper in the parse tree associativity: 10-4-3 means (10-4)-3, since is left-associative Thus, the first subtraction should be deeper in the parse tree + 3 * - - 3 4 5 10 4 An unambiguous grammar for expressions Note: You should know this by heart. 1. expr expr add_op term 2. term 3. term term mult_op factor 4. factor 5. factor '-' factor 6. '(' expr ')' 7. ID 8. NUM 9. add_op '+' 10. '-' 11. mul_op '*' 12. '/'

Let's do a RMD: 3 + 4 * 5 expr apply rule 1 => expr add_op term apply rule 3 => expr add_op term mult_op factor apply rule 8 => expr add_op term mult_op NUM apply rule 11 => expr add_op term '*' NUM apply rule 4 => expr add_op factor '*' NUM apply rule 8 => expr add_op NUM '*' NUM apply rule 9 => expr '+' NUM '*' NUM apply rule 2 => term '+' NUM '*' NUM apply rule 4 => factor '+' NUM '*' NUM apply rule 8 => NUM (3) '+' NUM (4) '*' NUM (5) expr 1 expr 2 add_op 3 term 4 term 13 + 12 term 5 mult_op 6 factor 7 factor 14 NUM 15 (3) factor 10 NUM 11 (4) * 9 NUM 8 (5) Let's do a RMD: 10 4 3 expr apply rule 1 => expr add_op term apply rule 4 => expr add_op factor apply rule 8 => expr add_op NUM apply rule 10 => expr '-' NUM apply rule 1 => expr add_op term '-' NUM apply rule 4 => expr add_op factor '-' NUM apply rule 8 => expr add_op NUM '-' NUM apply rule 10 => expr '-' NUM '-' NUM apply rule 2 => term '-' NUM '-' NUM apply rule 4 => factor '-' NUM '-' NUM apply rule 8 => NUM (10) '-' NUM (4) '-' NUM (3) expr 1 expr 2 add_op 3 term 4 expr 8 add_op 9 term 10-7 factor 5 term 14 factor 15 NUM 16 (10) - 13 factor 11 NUM 6 (3) NUM 12 (4)

An unambiguous grammar always results in the same tree, irrespective of the derivation performed. Different derivations may result in different construction orders of the trees, but the structure of the parse tree will always be the same. There are few details to note about the grammar we used for arirhmetic expressions: Left recursion results in left associativity (if we were to use right recursion, we would have had right associativity) Order of the rules correspond to the precedence of the operators. In particular, lower precedence operators appear earlier in the rule list. Example: Let's say we want to create an expression language that has the following operators: \ precedence: low medium high associativity: left left right Grammar: expr expr ' ' term term term term ' ' factor factor factor expo '\' factor expo expo '(' expr ')' ID NUM Example input: (y z) w \ u \ x expr => term => term ' ' factor => term ' ' expo '\' factor => term ' ' expo '\' expo '\' factor => term ' ' expo '\' expo '\' expo => term ' ' expo '\' expo '\' ID => term ' ' expo '\' ID '\' ID => term ' ' ID '\' ID '\' ID => factor ' ' ID '\' ID '\' ID => expo ' ' ID '\' ID '\' ID => '(' expr ')' ' ' ID '\' ID '\' ID => '(' expr ' ' term ')' ' ' ID '\' ID '\' ID => '(' expr ' ' factor ')' ' ' ID '\' ID '\' ID => '(' expr ' ' expo ')' ' ' ID '\' ID '\' ID => '(' expr ' ' ID ')' ' ' ID '\' ID '\' ID => '(' term ' ' ID ')' ' ' ID '\' ID '\' ID => '(' factor ' ' ID ')' ' ' ID '\' ID '\' ID => '(' expo ' ' ID ')' ' ' ID '\' ID '\' ID => '(' ID (y) ' ' ID (z) ')' ' ' ID (w) '\' ID (u) '\' ID (x) term \ y z w \ expr term u factor factor expo factor \ expo ( ) expr expr term factor expo term factor expo ID x ID expo \ factor ID expo ID ID

Consider the following grammar for if/else statements: stmt if_stmt nonif_stmt if_stmt IF '(' logical_expr ')' THEN stmt IF '(' logical_expr ')' THEN stmt ELSE stmt nonif_stmt Is this grammar unambiguous? We can prove otherwise via an example: Consider the following input: if (x) then if (y) then a = b; else a = c; We can create two different parse trees: Alternative 1: if_stmt IF ( logical_expr ) THEN stmt x if_stmt IF ( logical_expr ) THEN stmt ELSE stmt Alternative 2: y a = b; a = c; if_stmt IF ( logical_expr ) THEN stmt x if_stmt ELSE stmt a = c; IF ( logical_expr ) THEN stmt y a = b; The else belongs to the closest dangling if, so the first one is the one we should ideally have.

The following is an unambiguous grammar for the same problem: stmt if_stmt nonif_stmt if_stmt matched_if unmatched_if matched_if 'if' logical_expr 'then' matched_stmt 'else' matched_stmt matched_stmt mathced_if nonif_stmt unmatched_if 'if' logical_expr 'then' stmt 'if' logical_expr 'then' matched_stmt 'else' unmatched_if logical_expr id '==' lit nonif_stmt assgn_stmt assgn_stmt id '=' expr expr expr '+' term term term '(' expr ')' id id 'A' 'B' 'C' lit '0' '1' '2' Consider the following input: if A == 0 then if B == 1 then C = A + B else B = C Let us do a leftmost derivation for the input: stmt => if_stmt => unmatched_if => 'if' logical_expr 'then' stmt => 'if' id '==' lit 'then' stmt => 'if' 'A' '==' lit 'then' stmt => 'if' 'A' '==' '0' 'then' stmt => 'if' 'A' '==' '0' 'then' if_stmt => 'if' 'A' '==' '0' 'then' matched_if => 'if' 'A' '==' '0' 'then' 'if' logical_expr 'then' matched_stmt 'else' matched_stmt => 'if' 'A' '==' '0' 'then' 'if' id '==' lit 'then' matched_stmt 'else' matched_stmt => 'if' 'A' '==' '0' 'then' 'if' 'B' '==' lit 'then' matched_stmt 'else' matched_stmt => 'if' 'A' '==' '0' 'then' 'if' 'B' '==' '1' 'then' matched_stmt 'else' matched_stmt => 'if' 'A' '==' '0' 'then' 'if' 'B' '==' '1' 'then' nonif_stmt 'else' matched_stmt => 'if' 'A' '==' '0' 'then' 'if' 'B' '==' '1' 'then' assgn_stmt 'else' matched_stmt => 'if' 'A' '==' '0' 'then' 'if' 'B' '==' '1' 'then' id '=' expr 'else' matched_stmt => 'if' 'A' '==' '0' 'then' 'if' 'B' '==' '1' 'then' 'C' '=' expr '+' term 'else' matched_stmt => 'if' 'A' '==' '0' 'then' 'if' 'B' '==' '1' 'then' 'C' '=' term '+' term 'else' matched_stmt => 'if' 'A' '==' '0' 'then' 'if' 'B' '==' '1' 'then' 'C' '=' id '+' term 'else' matched_stmt => 'if' 'A' '==' '0' 'then' 'if' 'B' '==' '1' 'then' 'C' '=' 'A' + term 'else' matched_stmt => 'if' 'A' '==' '0' 'then' 'if' 'B' '==' '1' 'then' 'C' '=' 'A' + 'B' 'else' matched_stmt => 'if' 'A' '==' '0' 'then' 'if' 'B' '==' '1' 'then' 'C' '=' 'A' + 'B' 'else' nonif_stmt => 'if' 'A' '==' '0' 'then' 'if' 'B' '==' '1' 'then' 'C' '=' 'A' + 'B' 'else' assgn_stmt => 'if' 'A' '==' '0' 'then' 'if' 'B' '==' '1' 'then' 'C' '=' 'A' + 'B' 'else' id '=' expr => 'if' 'A' '==' '0' 'then' 'if' 'B' '==' '1' 'then' 'C' '=' 'A' + 'B' 'else' 'B' '=' expr => 'if' 'A' '==' '0' 'then' 'if' 'B' '==' '1' 'then' 'C' '=' 'A' + 'B' 'else' 'B' '=' term => 'if' 'A' '==' '0' 'then' 'if' 'B' '==' '1' 'then' 'C' '=' 'A' + 'B' 'else' 'B' '=' id => 'if' 'A' '==' '0' 'then' 'if' 'B' '==' '1' 'then' 'C' '=' 'A' + 'B' 'else' 'B' '=' 'C'