Compilation 2012 Context-Free Languages Parsers and Scanners. Jan Midtgaard Michael I. Schwartzbach Aarhus University

Similar documents
3. Syntax Analysis. Andrea Polini. Formal Languages and Compilers Master in Computer Science University of Camerino

CSE P 501 Compilers. LR Parsing Hal Perkins Spring UW CSE P 501 Spring 2018 D-1

Lecture 7: Deterministic Bottom-Up Parsing

Lecture 8: Deterministic Bottom-Up Parsing

Bottom-Up Parsing. Lecture 11-12

COP 3402 Systems Software Syntax Analysis (Parser)

Bottom-Up Parsing. Lecture 11-12

EDAN65: Compilers, Lecture 06 A LR parsing. Görel Hedin Revised:

LR Parsing Techniques

S Y N T A X A N A L Y S I S LR

Section A. A grammar that produces more than one parse tree for some sentences is said to be ambiguous.

CMSC 330: Organization of Programming Languages. Context Free Grammars

LR Parsing LALR Parser Generators

Downloaded from Page 1. LR Parsing

Optimizing Finite Automata

MODULE 14 SLR PARSER LR(0) ITEMS

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

MIT Parse Table Construction. Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology

LR Parsing LALR Parser Generators

Derivations of a CFG. MACM 300 Formal Languages and Automata. Context-free Grammars. Derivations and parse trees

UNIT III & IV. Bottom up parsing

Configuration Sets for CSX- Lite. Parser Action Table

Action Table for CSX-Lite. LALR Parser Driver. Example of LALR(1) Parsing. GoTo Table for CSX-Lite

shift-reduce parsing

Compiler Design 1. Bottom-UP Parsing. Goutam Biswas. Lect 6

Parsers. Xiaokang Qiu Purdue University. August 31, 2018 ECE 468

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

Part 5 Program Analysis Principles and Techniques

In One Slide. Outline. LR Parsing. Table Construction

PART 3 - SYNTAX ANALYSIS. F. Wotawa TU Graz) Compiler Construction Summer term / 309

Wednesday, September 9, 15. Parsers

Parsers. What is a parser. Languages. Agenda. Terminology. Languages. A parser has two jobs:

Parsing. Roadmap. > Context-free grammars > Derivations and precedence > Top-down parsing > Left-recursion > Look-ahead > Table-driven parsing

CSE P 501 Compilers. Parsing & Context-Free Grammars Hal Perkins Winter /15/ Hal Perkins & UW CSE C-1

Context-Free Languages & Grammars (CFLs & CFGs) Reading: Chapter 5

Compilers. Bottom-up Parsing. (original slides by Sam

Conflicts in LR Parsing and More LR Parsing Types

Wednesday, August 31, Parsers

programming languages need to be precise a regular expression is one of the following: tokens are the building blocks of programs

Review: Shift-Reduce Parsing. Bottom-up parsing uses two actions: Bottom-Up Parsing II. Shift ABC xyz ABCx yz. Lecture 8. Reduce Cbxy ijk CbA ijk

Properties of Regular Expressions and Finite Automata

Introduction to Syntax Analysis

CS 314 Principles of Programming Languages

Lexical and Syntax Analysis. Bottom-Up Parsing

Bottom-up parsing. Bottom-Up Parsing. Recall. Goal: For a grammar G, withstartsymbols, any string α such that S α is called a sentential form

UNIT-III BOTTOM-UP PARSING

Context-free grammars

Formal Languages and Compilers Lecture VII Part 3: Syntactic A

3. Parsing. Oscar Nierstrasz

CS 314 Principles of Programming Languages. Lecture 3

CSCI312 Principles of Programming Languages!

Syntax Analysis: Context-free Grammars, Pushdown Automata and Parsing Part - 4. Y.N. Srikant

Syntax Analysis Part I

Languages and Compilers

Compiler Construction: Parsing

Parsing Wrapup. Roadmap (Where are we?) Last lecture Shift-reduce parser LR(1) parsing. This lecture LR(1) parsing

Finite Automata Theory and Formal Languages TMV027/DIT321 LP4 2018

Monday, September 13, Parsers

Parsing. Handle, viable prefix, items, closures, goto s LR(k): SLR(1), LR(1), LALR(1)

Introduction to Syntax Analysis. The Second Phase of Front-End

Bottom-Up Parsing II. Lecture 8

Non-deterministic Finite Automata (NFA)

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

How do LL(1) Parsers Build Syntax Trees?

Introduction to Lexing and Parsing

Compiler Construction Using

CS 4120 Introduction to Compilers

Parsing. source code. while (k<=n) {sum = sum+k; k=k+1;}

Bottom-Up Parsing II (Different types of Shift-Reduce Conflicts) Lecture 10. Prof. Aiken (Modified by Professor Vijay Ganesh.

Chapter 2 :: Programming Language Syntax

LALR Parsing. What Yacc and most compilers employ.

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

Principle of Compilers Lecture IV Part 4: Syntactic Analysis. Alessandro Artale

CS143 Handout 20 Summer 2011 July 15 th, 2011 CS143 Practice Midterm and Solution

Principles of Programming Languages

CT32 COMPUTER NETWORKS DEC 2015

CSc 453 Lexical Analysis (Scanning)

CS606- compiler instruction Solved MCQS From Midterm Papers

CSE 401 Compilers. LR Parsing Hal Perkins Autumn /10/ Hal Perkins & UW CSE D-1

CSE 3302 Programming Languages Lecture 2: Syntax

Where We Are. CMSC 330: Organization of Programming Languages. This Lecture. Programming Languages. Motivation for Grammars

Question Bank. 10CS63:Compiler Design

Formal Languages and Grammars. Chapter 2: Sections 2.1 and 2.2

A left-sentential form is a sentential form that occurs in the leftmost derivation of some sentence.

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

CMSC 330: Organization of Programming Languages. Architecture of Compilers, Interpreters

Let us construct the LR(1) items for the grammar given below to construct the LALR parsing table.

SYED AMMAL ENGINEERING COLLEGE (An ISO 9001:2008 Certified Institution) Dr. E.M. Abdullah Campus, Ramanathapuram

CMSC 330: Organization of Programming Languages

CSCE 314 Programming Languages

Programming Languages (CS 550) Lecture 4 Summary Scanner and Parser Generators. Jeremy R. Johnson

Syntax. Syntax. We will study three levels of syntax Lexical Defines the rules for tokens: literals, identifiers, etc.

Compiler phases. Non-tokens

LR Parsing, Part 2. Constructing Parse Tables. An NFA Recognizing Viable Prefixes. Computing the Closure. GOTO Function and DFA States

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

CMSC 330: Organization of Programming Languages. Context Free Grammars

Syntax and Parsing COMS W4115. Prof. Stephen A. Edwards Fall 2004 Columbia University Department of Computer Science

Lecture 4: Syntax Specification

Bottom Up Parsing. Shift and Reduce. Sentential Form. Handle. Parse Tree. Bottom Up Parsing 9/26/2012. Also known as Shift-Reduce parsing

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

Transcription:

Compilation 2012 Parsers and Scanners Jan Midtgaard Michael I. Schwartzbach Aarhus University

Context-Free Grammars Example: sentence subject verb object subject person person John Joe Zacharias verb asked kicked object thing person thing the football the computer Nonterminal symbols: sentence, subject, person, verb, object, thing Terminal symbols: John, Joe, Zacharias, asked, kicked, the football, the computer Start symbol: sentence Example of derivation: sentence subject verb object... John asked the computer 2

Formal Definition of CFGs A context-free grammar (CFG) is a 4-tuple G = (V,, S, P) V is a finite set of nonterminal symbols is an alphabet of terminal symbols and V = Ø S V is a start symbol P is a finite set of productions of the form A where A V and (V )* 3

Derivations denotes a single derivation step, where a nonterminal is rewritten according to a production Thus is a relation on the set (V )* If, (V )* then when = 1 A 2 where 1, 2 (V )* and A V the grammar contains the production A = 1 2 4

The Language of a CFG Define the relation * as the reflexive transitive closure of, that is: * iff...... 0 or more steps The language of G is defined as L(G) = { x * S * x } A language L * is context-free iff there is a CFG G where L(G)=L 5

Example 1 The language L = { a n b n n 0 } is described by a CFG G=(V,,S,P) where V = {S} = {a,b} P = {S asb, S } That is, L(G) = L alternative notation: S asb 6

Example 2 The language pal = { x {0,1}* x=reverse(x) } is described by a CFG G=(V,,S,P) where V = {S} = {0,1} P = {S, S 0, S 1, S 0S0, S 1S1} alternative notation: S 0 1 0S0 1S1 7

Why Context-Free? 1 A 2 1 2 if the grammar contains the production A Thus, may substitute for A independently of the context ( 1 and 2 ) 8

Algebraic Expressions A CFG G=(V,,S,P) for algebraic expressions: V = { S } = { +, -,, /, (, ), x, y, z } productions in P: S S + S S - S S S S / S ( S ) x y z example: x*y+z-(z/y+x) 9

Derivations Three derivations of the string x y + z: S S + S S S + S x S + S x y + S x y + z S S + S S + z S S + z S y + z x y + z S S S S S + S x S + S x y + S x y + z 10

Derivation Trees A derivation tree shows the structure of a derivation, but not the detailed order: and S S S + S S S S S z x S + S x y y z A parser finds a derivation tree for a given string. 11

Ambiguous CFGs Definition: A CFG G is ambiguous if there exists a string x L(G) with more than one derivation tree Thus, the CFG for algebraic expressions is ambiguous 12

A Simpler Syntax for Grammars S S + S S - S S S S / S ( S ) x y z V is inferred from the left-hand sides contains the remaining symbols S is the first nonterminal symbol used P is written explicitly 13

Rewriting Grammars The ambiguous grammar: S S + S S - S S S S / S ( S ) x y z may be rewritten to become unambiguous: S S + T S - T T T T F T / F F F x y z ( S ) This imposes an operator precedence 14

Unambiguous Parsing The string x y + z now only admits a single parse tree: S S T + T F T F x * F y z 15

Chomsky Normal Form Definition: A CFG G=(V,,S,P) is in Chomsky Normal Form (CNF) if every production in P is of the form A BC or A a for A,B,C V and a Any CFG G can be rewritten to a CFG G where G is in Chomsky Normal Form and L(G ) = L(G) - { } For every terminal a that appears in a body of length 2 or more create now production A -> a and replace a by A in the body Break productions with more than two variables into group of productions with two variables CNF transformation preserves unambiguity 16

Parsing Any CNF Grammar Given a grammar G in CNF and a string x of length n Define an V n n table P of booleans where P(A,i,j) iff A * x[i..j] Using dynamic programming, fill in this table bottom-up in time O(n 3 ) (CYK-Algorithm) Now, x L(G) iff P(S,0,n-1) is true This algorithm can easily be extended to also construct a parse tree 17

An Impractical Approach Cubic time is much too slow in practice: the source code for Windows is 60 million lines An industrial parser must be close to linear time 18

Shift-Reduce Parsing Shift-reduce parsers work bottom-up using a stack to track derivations Extend the grammar with an EOF symbol $ Choose between the following actions: shift: move first input token to the top of a stack reduce: replace on top of the stack by A for some production A accept: when S$ is reduced 19

Shift-Reduce in Action x*y+z$ shift x *y+z$ reduce F->x F *y+z$ reduce T->F T *y+z$ shift T* y+z$ shift T*y +z$ reduce F->y T*F +z$ reduce T->T*F T +z$ reduce S->T S +z$ shift S+ z$ shift S+z $ reduce F->z S+F $ reduce T->F S+T $ reduce S->S+T S $ shift S$ accept 20

Shift-Reduce Always Works A shift-reduce trace is the same as a backward rightmost derivation sequence: x*y+z$ shift x *y+z$ reduce F->x x*y+z F *y+z$ reduce T->F F*y+z T *y+z$ shift T* y+z$ shift T*y +z$ reduce F->y T*y+z T*F +z$ reduce T->T*F T*F+z T +z$ reduce S->T T+z S +z$ shift S+ z$ shift S+z $ reduce F->z S+z S+F $ reduce T->F S+F S+T $ reduce S->S+T S+T S $ shift S$ accept S 21

Deterministic Parsing A string is parsed by a grammar iff it is accepted by some run of the shift-reduce parser We must know when to shift and when to reduce A deterministic parser uses a table to determine which action to take Some grammars can be parsed like this 22

The General LR(1) Algorithm Enumerate the productions of the grammar: 1: S S + T 2: S S - T 3: S T 4: T T F 5: T T / F 6: T F 7: F x 8: F y 9: F z 10: F ( S ) 23

LR(1) Setup A finite set of states (numbered 0, 1, 2,...) An input string A stack of states, initialized to the state 0 A magical table A LALR(1) parser is a particular case of a LR(1) parser in which some states have been merged. This merging generally leads to smaller tables. 24

LALR(1) Table x y z + - * / ( ) $ S T F 0 s1 s2 s3 s4 a g5 g6 g7 1 r7 r7 r7 r7 r7 r7 r7 r7 r7 r7 2 r8 r8 r8 r8 r8 r8 r8 r8 r8 r8 3 r9 r9 r9 r9 r9 r9 r9 r9 r9 r9 4 s1 s2 s3 s4 g8 g6 g7 5 s10 s11 s9 6 r3 r3 r3 r3 r3 s12 s13 r3 r3 r3 7 r6 r6 r6 r6 r6 r6 r6 r6 r6 r6 8 s10 s11 s14 9 a a a a a a a a a a 10 s1 s2 s3 s4 g15 g7 11 s1 s2 s3 s4 g16 g7 12 s1 s2 s3 s4 g17 13 s1 s2 s3 s4 g18 14 r10 r10 r10 r10 r10 r10 r10 r10 r10 r10 15 r1 r1 r1 r1 r1 s12 s13 r1 r1 r1 16 r2 r2 r2 r2 r2 s12 s13 r2 r2 r2 17 r4 r4 r4 r4 r4 r4 r4 r4 r4 r4 18 r5 r5 r5 r5 r5 r5 r5 r5 r5 r5 25

LR(1) Actions sk: shift and push state k gk: push state k ( goto action ) ri: is a combination of two steps: First pop α states, where A α is the i'th production Then lookup a goto action gk at entry (j,a), where j is the new stack top, and push state k to the stack (Note: the action at (j,a) is always a goto action) a: accept an empty table entry indicates a parse error 26

LR(1) Parsing Keep executing the action at entry (j,a), where j is the stack top and a is the next input symbol Stop at either accept or error This is completely deterministic The time complexity is linear in the input string 27

LALR(1) Example 0 x*y+z$ s1 0,1 *y+z$ r7 + g7 0,7 *y+z$ r6 + g6 0,6 *y+z$ s12 0,6,12 y+z$ s2 0,6,12,2 +z$ r8 + g17 0,6,12,17 +z$ r4 + g6 0,6 +z$ r3 + g5 0,5 +z$ s10 0,5,10 z$ s3 0,5,10,3 $ r9 + g7 0,5,10,7 $ r6 + g15 0,5,10,15 $ r1 + g5 0,5 $ s9 0,5,9 a 28

LR(1) and LALR(1) Conflicts The LR(1) algorithm tries to construct a table For some grammars, the table becomes perfect For other grammars, it may contain conflicts: shift/reduce: an entry contains both a shift and a reduce action reduce/reduce: an entry contains two different reduce actions Because of the state merging in a LALR(1) table, it will potentially contain more conflicts than the corresponding LR(1) table. 29

Grammar Containments Context-Free Unambiguous LR(1) LALR(1) 30

LALR(1) Conflicts in Action The ambiguous grammar S S + S S - S S S S / S ( S ) x y z generates 16 LALR(1) shift/reduce conflicts But the unambiguous version is LALR(1)... 31

Tokens For a Java grammar, = Unicode This is not a practical approach Instead, grammars use an alphabet of tokens: keywords identifiers numerals strings constants comments symbols (==, <=, ++,...) whitespace... 32

Tokens Are Regular Expressions Tokens are defined through regular expressions: keyword: class identifier:[a-z][a-z0-9]* numeral:[+]?[0-9]+ symbol: ++ whitespace: [ ]* 33

Scanning A scanner translates a string of characters into a string of tokens It is defined by an ordered sequence of regular expressions for the tokens: r 1, r 2,..., r k Let t i be the longest prefix of the input string that is recognized by r i Let k = max{ t i } Let j = min{ i t i = k} The next token is then t j 34

Scanning with a DFA Scanning (for each token definition) can be efficiently performed with a minimal DFA Run the input string through the DFA At each accept state, record the current prefix When the DFA crashes, the last prefix is the candidate token 35

Scanning in Action (1/3) The previous collection of tokens: class [a-z][a-z0-9]* [+]?[0-9]+ ++ [ ]* 36

The automata: Scanning in Action (2/3) c l a s s a-z a-z0-9 0-9 0-9 + 0-9 + + \u0020 37

Scanning in Action (3/3) The input string: class foo +17 c++ generates the tokens: keyword: class whitespace identifier: foo whitespace numeral: +17 whitespace identifier: c symbol: ++ 38