Part 5 Program Analysis Principles and Techniques

Similar documents
Parsing. Roadmap. > Context-free grammars > Derivations and precedence > Top-down parsing > Left-recursion > Look-ahead > Table-driven parsing

Introduction to Parsing

CS415 Compilers. Syntax Analysis. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

3. Parsing. Oscar Nierstrasz

EECS 6083 Intro to Parsing Context Free Grammars

CPSC 434 Lecture 3, Page 1

CSE 3302 Programming Languages Lecture 2: Syntax

Lexical Analysis. Introduction

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

Formal Languages and Grammars. Chapter 2: Sections 2.1 and 2.2

Where We Are. CMSC 330: Organization of Programming Languages. This Lecture. Programming Languages. Motivation for Grammars

CMSC 330: Organization of Programming Languages. Architecture of Compilers, Interpreters

CPS 506 Comparative Programming Languages. Syntax Specification

The Front End. The purpose of the front end is to deal with the input language. Perform a membership test: code source language?

Parsing II Top-down parsing. Comp 412

Syntax Analysis Check syntax and construct abstract syntax tree

Architecture of Compilers, Interpreters. CMSC 330: Organization of Programming Languages. Front End Scanner and Parser. Implementing the Front End

A Simple Syntax-Directed Translator

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

COP 3402 Systems Software Syntax Analysis (Parser)

CMSC 330: Organization of Programming Languages

CMSC 330: Organization of Programming Languages

programming languages need to be precise a regular expression is one of the following: tokens are the building blocks of programs

Optimizing Finite Automata

CS 406/534 Compiler Construction Parsing Part I

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

Principles of Programming Languages COMP251: Syntax and Grammars

CS 314 Principles of Programming Languages. Lecture 3

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

Chapter 3: CONTEXT-FREE GRAMMARS AND PARSING Part 1

CMSC 330: Organization of Programming Languages

2. Lexical Analysis! Prof. O. Nierstrasz!

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

CS 315 Programming Languages Syntax. Parser. (Alternatively hand-built) (Alternatively hand-built)

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

Lexical Analysis (ASU Ch 3, Fig 3.1)

Languages and Compilers

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

Compiler Design Concepts. Syntax Analysis

MIT Specifying Languages with Regular Expressions and Context-Free Grammars. Martin Rinard Massachusetts Institute of Technology

MIT Specifying Languages with Regular Expressions and Context-Free Grammars

Dr. D.M. Akbar Hussain

Parsing. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

CMSC 330: Organization of Programming Languages. Context Free Grammars

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

Formal Languages and Compilers Lecture V: Parse Trees and Ambiguous Gr

Syntax. In Text: Chapter 3

Front End. Hwansoo Han

A simple syntax-directed

Academic Formalities. CS Language Translators. Compilers A Sangam. What, When and Why of Compilers

Syntactic Analysis. CS345H: Programming Languages. Lecture 3: Lexical Analysis. Outline. Lexical Analysis. What is a Token? Tokens

CMSC 330: Organization of Programming Languages. Context Free Grammars

CS 314 Principles of Programming Languages

Compilers. Yannis Smaragdakis, U. Athens (original slides by Sam

CMPS Programming Languages. Dr. Chengwei Lei CEECS California State University, Bakersfield

Introduction to Compiler

Formal Languages. Formal Languages

Parsing Part II. (Ambiguity, Top-down parsing, Left-recursion Removal)

CS 314 Principles of Programming Languages

CMSC 330: Organization of Programming Languages. Context Free Grammars

Theoretical Part. Chapter one:- - What are the Phases of compiler? Answer:

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

Goals for course What is a compiler and briefly how does it work? Review everything you should know from 330 and before

Principles of Programming Languages COMP251: Syntax and Grammars

EDAN65: Compilers, Lecture 04 Grammar transformations: Eliminating ambiguities, adapting to LL parsing. Görel Hedin Revised:

Lexical Analysis - An Introduction. Lecture 4 Spring 2005 Department of Computer Science University of Alabama Joel Jones

Introduction to Lexical Analysis

Lecture 4: Syntax Specification

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Chapter 2 :: Programming Language Syntax

Formal Languages and Compilers Lecture VI: Lexical Analysis

CS321 Languages and Compiler Design I. Winter 2012 Lecture 4

CSE302: Compiler Design

Chapter 3: Describing Syntax and Semantics. Introduction Formal methods of describing syntax (BNF)

UNIT -2 LEXICAL ANALYSIS

Chapter 2 - Programming Language Syntax. September 20, 2017

Chapter 3. Parsing #1

CSCE 314 Programming Languages

Habanero Extreme Scale Software Research Project

2.2 Syntax Definition

Syntax. Syntax. We will study three levels of syntax Lexical Defines the rules for tokens: literals, identifiers, etc.

Properties of Regular Expressions and Finite Automata

Introduction to Lexical Analysis

Parsing. source code. while (k<=n) {sum = sum+k; k=k+1;}

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

CS415 Compilers. Lexical Analysis

Chapter 4. Syntax - the form or structure of the expressions, statements, and program units

1 Lexical Considerations

Chapter 3. Describing Syntax and Semantics ISBN

Syntax. A. Bellaachia Page: 1

Chapter 3. Describing Syntax and Semantics

Languages, Automata, Regular Expressions & Scanners. Winter /8/ Hal Perkins & UW CSE B-1

Defining syntax using CFGs

Academic Formalities. CS Modern Compilers: Theory and Practise. Images of the day. What, When and Why of Compilers

CSE 401 Midterm Exam 11/5/10

This book is licensed under a Creative Commons Attribution 3.0 License

CSE P 501 Compilers. Parsing & Context-Free Grammars Hal Perkins Winter /15/ Hal Perkins & UW CSE C-1

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

CSE 311 Lecture 21: Context-Free Grammars. Emina Torlak and Kevin Zatloukal

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Transcription:

1 Part 5 Program Analysis Principles and Techniques Front end 2 source code scanner tokens parser il errors Responsibilities: Recognize legal programs Report errors Produce il Preliminary storage map Shape the code for the back end Much of front end construction can be automated

3 Scanner source code scanner tokens parser il Scanner errors Maps characters into tokens the basic unit of syntax x = x + y; Becomes <id,x> = <id,x> + <id,y> ; character string for token is a lexeme Typical tokens: number, id, +, -, *, /, do, end Eliminates white space ( tables, blanks, comments) A key issue is speed use specialized recognizer (lex) 4 Specifying patterns A scanner must recognize various parts of the language s syntax. Some parts are easy: White space some combination of <b> and tab Keywords and operators specified as literal patterns do, end Comments opening and closing delimiters - /* */

Specifying patterns 5 Other parts are much harder: Identifiers alphabetic followed by k alphanumerics Special symbols (-, $, &, ) Numbers interger 0 or digit from 1-9 followed by digitals from 0-9 decimals integer. digits from 0-9 reals (integer or decimal) E (+ or -) digits from 0-9 complex ( real, real ) We need a powerful notation to specify these patterns. Definitions 6 Operation Definition Union of L and M written L M Concatenation of L and M written LM Kleene closure of L written L* Positive closure of L written L+ L M = { s s L or s M } LM = {st s L and t M} L* = i=0 L i L+ = i=1 L i

Regular expressions 7 Patterns are often specified as regular languages. Notations used to describe a regular language (or a regular set) include both regular expressions and regular grammars. Regular expressions (over an alphabet Σ): 1. is a RE denoting the set { } 2. If a Є Σ, then a is RE denoting {a} Regular expressions 8 Regular expressions (over an alphabet Σ): 3. If r and s are REs, denoting L(r) and L(s), then: (r) is a RE denoting L(r) (r) (s) is a RE denoting L(r) L(s) (r) (s) is a RE denoting L(r) L(s) (r)* is a RE denoting L(r)* (r)+ is a RE denoting L(r)+ If we adopt a precedence for operators, the extra parentheses can go away. We assume closure, then concatenation, the alternation as the order of precedence.

9 RE examples identifier letter (a b c z A B C Z) digit (0 1 2 3 4 5 6 7 8 9) id letter (letter digit)* numbers integer ( + - ) (0 1 2 3 9) (digit)* decimal integer. (digit)* real (integer decimal) E (+ -) (digit)* Complex ( real. real ) 10 RE examples Most programming language tokens can be described with regular expressions. We can use regular expressions to automatically build scanners.

21 Regular expressions Regular expressions represent languages Languages are sets of strings. Operations include Kleene closure, concatenation, and union. Regular expression (a) (a) (b) (a)(b) (a)* (a)+ Language { a } { a, b } { ab } {, a, aa, } { a, aa, } 22 Regular expressions We assume that closure, concatenation, union as the order of precedence. ab cd* = (ab) (c(d*)) = { ab, c, cd, cdd, } a (bc)* = { a, abc, abcbc, }

Recognizers 23 From a regular expression, we can construct a deterministic finite automan (dfa). Recognizer for the identifier: letter digit S0 letter S1 other S2 digit other accept S3 error Recognizers 24 Identifier: letter (a b c z A B C Z) digit (0 1 2 3 4 5 6 7 8 9) id letter (letter digit)*

So what is hard? 25 Language features that can cause problems: Reserved words PL/I had no reserved words if then then then = else; else else = then; Significant blanks FORTRAN and Algol68 ignore blanks do 10 I = 1,25 do 10 I = 1.25 So what is hard? 26 String constants special charaters in strings newline, tab, quote, comment, Finite closures some languages limit identifier lengths adds states to count length FORTRAN 66 6 characters These problems can be swept under the rug (avoided) by intelligent language design.

27 Methods and parameters What is a lexical error? 1234G6 Illegal character What should the scanner do? Report the error Try to connect it? Error connection techniques Minimum distance corrections Hard token recovery Skip until match 28 Scanners source code scanner tokens parser IL errors A scanner separates input into tokens based on lexical analysis. Legal tokens are usually specified by regular expressions (REs). Regular expressions specify regular languages.

Limits of regular languages 31 Not all languages are regular. You cannot construct dfa s to recognize these languages: L = { p k q k } L = { wcw τ w Σ* } Note: neither of these is a regular expression! (dfa s cannot count!) Limits of regular languages 32 But, this is a little subtle. You can construct dfa s for: Alternating 0 s and 1 s ( 1) (01)* ( 0) Sets of pairs of 0 s and 1 s (01 10)+

More regular languages 33 Let s look at another regular language the set of strings containing an even number of zeros and an even number of ones. The regular expression is: Start S2 1 1 S1 0 0 0 0 (00 11)* ((01 10)(00 11)*(01 10)(00 11)*)* 1 1 S3 Summary 34 Scanners Break up input into tokens Catch lexical errors Difficulty affected by language design Scanner generators Tokens specified by regular expressions Construct dfa to recognize language Highly efficient in practise

The role of the parser 35 source code scanner tokens parser il errors Parser: Perform context-free syntax anaylsis Guide the context-sensitive anaylsis Construct an intermediate representation Produce meaningful error messages Attempt error correction Syntax analysis Context-free syntax is specified with a grammar. 36 Formally, a context-free grammar G is a four-tuple (T, NT, S, P) T is the set of terminal symbols in the grammar. For our purposes, the set of terminals is equivalent to the set of tokens returned by the lexical analyzer.

37 Syntax analysis NT is a set of syntactic variables that denote sets of (sub)strings occurring in the language. These are used to impose a structure on the grammar. S is a distinguished nonterminal (S NT) that denotes the entire set of strings in L(G). This is sometimes called a goal symbol. S cannot appear on the right hand side of some production. P is a set of productions that specify the way that terminals and non-terminals can be combined to from strings in the language. Each production must have a single non-terminal on its left hand side. 38 Syntax analysis Grammars are often written in BNF, or Backus-Naur Form 1. <goal> ::= <expr> 2. <expr> ::= <expr> <op> <expr> 3. number 4. id 5. <op> ::= + 6. - 7. * 8. /

Syntax analysis 39 This grammar gives simple expressions over numbers and identifiers. In a BNF for a grammar, we represent Non-terminal with brackets or capital letters, Terminals with typewriter font or underline, Productions as in example Why use context-free grammars? 40 Many advantages: Precise syntactic specification of a programming language Easy to understand, avoids ad hoc definition Easier to maintain, add new language features Can automatically construct efficient parser Parser construction reveals ambiguity, other difficulties Imparts structure to language Supports syntax-directed translation

Grammars for regular languages 41 Can we place a restriction on the form of grammar to ensure that it describes a regular language? Provable fact: For any RE r, there is a grammar g such, that L(r) = L(g) The grammar that generate regular sets are called regular grammar Grammars for regular languages 42 Defintion: In a regular grammar, all productions have one of two forms: 1. A aa 2. A a Where A is a non-terminal and a is a terminal symbol. These are also called type 3 grammars (Chlomsky).

Scanning vs. parsing 43 Where do we draw the line? term [a-za-z] ( [ a-za-z ] [0-9 ] )* 0 [1-9 ] [0-9 ] * op + - * / expr (term op)* term Regular expressions are use to classify. identifiers, numbers, keywords Scanning vs. parsing 44 Context-free grammars are used to count. Brackets ( ), begin end, if then else, imposing structure - expressions Grammar for cc has 188 productions.

47 Derivations We can view the productions of a cfg as rewriting rules. Using a example: <goal> <expr> <expr> <op> <expr> <expr> <op> <expr> <op> <expr> <id,x> <op> <expr> <op> <expr> <id,x> + <expr> <op> <expr> <id,x> + <num,2> <op> <expr> <id,x> + <num,2> * <expr> <id,x> + <num,2> * <id,y> 48 Derivations We have derived the sentence x + 2 * y We denote this <goal> id + num * id. Such a sequence of rewrites is a derivation or a parse. The process of discovering a derivation is called parsing.

49 Derivations At each step, we chose a non-terminal to replace. This choice can lead to different derivations. Two are of particular interest Leftmost derivation the leftmost non-terminal is replaced at each step Rightmost derivation the rightmost non-terminal is replaced at each step. The example was a leftmost derivation. 50 For the string x + 2 + y: Rightmost Derivation <goal> <expr> <expr> <op> <expr> <expr> <op> <id,y> <expr> * <id,y> <expr> <op> <expr> * <id,y> <expr> <op> <num,2> * <id,y> <expr> + <num,2> * <id,y> <term> + <num,2> * <id,y> <id,x> + <num,2> * <id,y> Again, <goal>... id + num * id.

Parse tree 51 Let s look at the parse tree goal Treewalk evaluation would give the wrong answer. (x + 2) * y instead of x + (2 * y) expr <id,x> expr op term + <num,2> expr op * expr <id,y> Precedence 52 These two derivations point out a problem with the grammar. The grammar has no notion of precedence, or implied order of evaluation.

53 Precedence To add precedence takes additional machinery 1. <goal> ::= <expr> 2. <expr> ::= <expr> + <term> 3. <expr> - <term> 4. <term> 5. <term> ::= <term> * <factor> 6. <term> / < factor> 7. <factor> 8. <factor> ::= number 9. id 54 Precedence This grammar enforces a precedence on the derivation termsmust be derived from expressions forces the "correct" tree

Precedence 55 Now, for the string x + 2 * y: <goal> <expr> <expr> + <term> <expr> + <term> * <factor> <expr> + <term> * <id,y> <expr> + <factor> * <id,y> <expr> + <num,2> * <id,y> <term> + <num,2> * <id,y> <factor> + <num,2> * <id,y> <id,x> + <num,2> * <id,y> Again, <goal>... id + num * id, but this time, we build the desired tree. Precedence 56 This time, we get the desired parse tree. Treewalk evaluation computes x + (2 * y). <id,x> <num,2>

57 Ambiguity If a grammar has multiple leftmost derivations for a single (terminal) word of the language, the grammar is ambiguous. Similarly, a grammar with multiple rightmost derivations for a single (terminal) word of the language is ambiguous. Example <stmt> ::= if <expr> then <stmt> if <expr> then <stmt> else <stmt> other stmts 58 Ambiguity Consider deriving the sentential form: ife 1 then if E 2 then S 1 else S 2 It has two derivations. This ambiguity is purely grammatical. It is a context-free ambiguity.

Ambiguity 59 We may be able to eliminate ambiguities by rearranging the grammar. <stmt> ::= <ms> <us> <ms> ::= if <expr> then <ms> else <ms> other stmts <us> ::= if <expr> then <stmt> if <expr> then <ms> else <us> Ambiguity 60 This grammar generates the same language as the ambiguous grammar, but applies the common sense rule match each else with the closest unmatched then This is pretty clearly the language designer's intent.

Ambiguity 61 Ambiguity generally refers to a confusion in the context-free specification. Context-sensitive confusions can arise from overloading. a = f(17) In many Algol-like languages, f could be either a function or a subscripted variable. Ambiguity 62 Disambiguating this Statement requires context. need values of declarations not context free really an issue of type Rather than complicate parsing, we will handle this separately.

66 Summary: Parser Parser performs context-free syntax analysis Recognizes structure over set of tokens Deals with ambiguity and precedences