Intro To Parsing. Step By Step

Similar documents
( ) i 0. Outline. Regular languages revisited. Introduction to Parsing. Parser overview. Context-free grammars (CFG s) Lecture 5.

Introduction to Parsing. Lecture 5. Professor Alex Aiken Lecture #5 (Modified by Professor Vijay Ganesh)

Outline. Regular languages revisited. Introduction to Parsing. Parser overview. Context-free grammars (CFG s) Lecture 5. Derivations.

Introduction to Parsing. Lecture 5

Introduction to Parsing Ambiguity and Syntax Errors

Introduction to Parsing. Lecture 5

Introduction to Parsing Ambiguity and Syntax Errors

Grammars and ambiguity. CS164 3:30-5:00 TT 10 Evans. Prof. Bodik CS 164 Lecture 8 1

Introduction to Parsing. Lecture 8

Outline. Limitations of regular languages. Introduction to Parsing. Parser overview. Context-free grammars (CFG s)

E E+E E E (E) id. id + id E E+E. id E + E id id + E id id + id. Overview. derivations and parse trees. Grammars and ambiguity. ambiguous grammars

Parsing: Derivations, Ambiguity, Precedence, Associativity. Lecture 8. Professor Alex Aiken Lecture #5 (Modified by Professor Vijay Ganesh)

Context-Free Grammars

Outline. Parser overview Context-free grammars (CFG s) Derivations Syntax-Directed Translation

Outline. Limitations of regular languages Parser overview Context-free grammars (CFG s) Derivations Syntax-Directed Translation

Ambiguity. Lecture 8. CS 536 Spring

Programming Languages & Translators PARSING. Baishakhi Ray. Fall These slides are motivated from Prof. Alex Aiken: Compilers (Stanford)

Ambiguity, Precedence, Associativity & Top-Down Parsing. Lecture 9-10

Formal Languages and Compilers Lecture V: Parse Trees and Ambiguous Gr

Context-Free Grammars

Syntax Analysis Check syntax and construct abstract syntax tree

Ambiguity. Grammar E E + E E * E ( E ) int. The string int * int + int has two parse trees. * int

Compilers and computer architecture From strings to ASTs (2): context free grammars

CSE P 501 Compilers. Parsing & Context-Free Grammars Hal Perkins Winter /15/ Hal Perkins & UW CSE C-1

Ambiguity and Errors Syntax-Directed Translation

Theoretical Part. Chapter one:- - What are the Phases of compiler? Answer:

Optimizing Finite Automata

programming languages need to be precise a regular expression is one of the following: tokens are the building blocks of programs

([1-9] 1[0-2]):[0-5][0-9](AM PM)? What does the above match? Matches clock time, may or may not be told if it is AM or PM.

COP 3402 Systems Software Syntax Analysis (Parser)

Chapter 3: CONTEXT-FREE GRAMMARS AND PARSING Part2 3.3 Parse Trees and Abstract Syntax Trees

CSE P 501 Compilers. Parsing & Context-Free Grammars Hal Perkins Spring UW CSE P 501 Spring 2018 C-1

Syntax Analysis. Prof. James L. Frankel Harvard University. Version of 6:43 PM 6-Feb-2018 Copyright 2018, 2015 James L. Frankel. All rights reserved.

Architecture of Compilers, Interpreters. CMSC 330: Organization of Programming Languages. Front End Scanner and Parser. Implementing the Front End

Lexical Analysis. Finite Automata. (Part 2 of 2)

CMSC 330: Organization of Programming Languages

CMSC 330: Organization of Programming Languages

Parsing Part II. (Ambiguity, Top-down parsing, Left-recursion Removal)

CMSC 330: Organization of Programming Languages. Architecture of Compilers, Interpreters

Bottom-Up Parsing. Lecture 11-12

Bottom-Up Parsing. Lecture 11-12

CMSC 330: Organization of Programming Languages

Compilers Course Lecture 4: Context Free Grammars

Properties of Regular Expressions and Finite Automata

Lecture 4: Syntax Specification

Introduction to Parsing

CMSC 330: Organization of Programming Languages. Context Free Grammars

Formal Languages and Compilers Lecture VI: Lexical Analysis

CS 314 Principles of Programming Languages

Compiler Design Concepts. Syntax Analysis

EECS 6083 Intro to Parsing Context Free Grammars

CMSC 330: Organization of Programming Languages. Context Free Grammars

Chapter 3: CONTEXT-FREE GRAMMARS AND PARSING Part 1

Fall Compiler Principles Context-free Grammars Refresher. Roman Manevich Ben-Gurion University of the Negev

MIT Specifying Languages with Regular Expressions and Context-Free Grammars. Martin Rinard Massachusetts Institute of Technology

MIT Specifying Languages with Regular Expressions and Context-Free Grammars

Parsing II Top-down parsing. Comp 412

Chapter 3: Lexing and Parsing

Where We Are. CMSC 330: Organization of Programming Languages. This Lecture. Programming Languages. Motivation for Grammars

CS415 Compilers. Syntax Analysis. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

UNIT -2 LEXICAL ANALYSIS

CMSC 350: COMPILER DESIGN

CSE 401 Midterm Exam Sample Solution 2/11/15

CMSC 330: Organization of Programming Languages. Context Free Grammars

Parsing. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

CSCI312 Principles of Programming Languages!

Part 5 Program Analysis Principles and Techniques

Syntax Analysis Part I

Formal Languages and Grammars. Chapter 2: Sections 2.1 and 2.2

Lecture 8: Deterministic Bottom-Up Parsing

A Simple Syntax-Directed Translator

THE COMPILATION PROCESS EXAMPLE OF TOKENS AND ATTRIBUTES

CS 315 Programming Languages Syntax. Parser. (Alternatively hand-built) (Alternatively hand-built)

CS 403: Scanning and Parsing

Parsing. source code. while (k<=n) {sum = sum+k; k=k+1;}

MA513: Formal Languages and Automata Theory Topic: Context-free Grammars (CFG) Lecture Number 18 Date: September 12, 2011

Lecture 7: Deterministic Bottom-Up Parsing

Structure of a compiler. More detailed overview of compiler front end. Today we ll take a quick look at typical parts of a compiler.

Chapter 2 :: Programming Language Syntax

Context Free Languages

Context-Free Languages & Grammars (CFLs & CFGs) Reading: Chapter 5

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

CSCE 314 Programming Languages

Principles of Programming Languages COMP251: Syntax and Grammars

MIDTERM EXAMINATION - CS130 - Spring 2003

Lexical Analysis. COMP 524, Spring 2014 Bryan Ward

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

CSE 3302 Programming Languages Lecture 2: Syntax

A simple syntax-directed

Introduction to Lexical Analysis

CSE450 Translation of Programming Languages. Lecture 4: Syntax Analysis

Lexical and Syntax Analysis. Top-Down Parsing

LR Parsing LALR Parser Generators

Semantic Analysis. Lecture 9. February 7, 2018

Introduction to Lexing and Parsing

Syntax. Syntax. We will study three levels of syntax Lexical Defines the rules for tokens: literals, identifiers, etc.

Part 3. Syntax analysis. Syntax analysis 96

afewadminnotes CSC324 Formal Language Theory Dealing with Ambiguity: Precedence Example Office Hours: (in BA 4237) Monday 3 4pm Wednesdays 1 2pm

syntax tree - * * * - * * * * * 2 1 * * 2 * (2 * 1) - (1 + 0)

Transcription:

#1 Intro To Parsing Step By Step

#2 Self-Test from Last Time Are practical parsers and scanners based on deterministic or non-deterministic automata? How can regular expressions be used to specify nested constructs? How is a two-dimensional transition table used in tabledriven scanning?

#3 Self-Test Answers Are practical parsers and scanners based on deterministic or non-deterministic automata? Deterministic Half credit: they are equivalent w/o convert to DFA How can regular expressions be used to specify nested constructs? They cannot. (Scott Book 2.1.2, second sentence) Nested parentheses ( n ) n are the canonical example. How is a two-dimensional transition table used in tabledriven scanning? new_state = table[old_state][new_character]

#4 Cunning Plan Formal Languages Regular Languages Revisited Parser Overview Context-Free Grammars (CFGs) Derivations Ambiguity

#5 One-Slide Summary A parser takes a sequence of tokens as input. If the input is valid, it produces a parse tree (or derivation). Context-free grammars are a notation for specifying formal languages. They contain terminals, non-terminals and productions (aka rewrite rules).

#6 Languages and Automata Formal languages are very important in CS specially in programming languages Regular languages The weakest formal languages widely used Many applications We will also study context-free languages A stronger type of formal language

Limitations of Regular Languages Intuition: A finite automaton that runs long enough must repeat states Pigeonhole Principle: imagine 20 states and 300 input characters A finite automaton can t remember how often it has visited a particular state Only enough to store in which state it is in Cannot count, except up to a finite limit Language of balanced parentheses is not regular: { ( i ) i i 0} http://en.wikipedia.org/wiki/pumping_lemma_for_regular_languages #7

#8 The Functionality of the Parser Input: sequence of tokens from lexer e.g., the.cl-lex files you make in PA2 Output: parse tree of the program Also called an abstract syntax tree Output: error if the input is not valid e.g., parse error on line 3

#9 xample Cool program text if x = y then 1 else 2 fi Parser input (tokens) IF ID = ID THN INT LS INT FI Parser output (tree) IF-THN-LS ID = ID INT INT

The Role of the Parser Not all sequences of tokens are programs then x * / + 3 while x ; y z then The parser must distinguish between valid and invalid sequences of tokens We need A language or formalism to describe valid sequences of tokens A method (an algorithm) for distinguishing valid from invalid sequences of tokens #11

#12 Programming Language Structure Programming languages have recursive structure Consider the language of arithmetic expressions with integers, +, *, and ( ) An expression is either: an integer an expression followed by + followed by expression an expression followed by * followed by expression a ( followed by an expression followed by ) int, int + int, ( int + int) * int are expressions

Notation for Programming Languages An alternative notation:! int! +! *! ( ) We can view these rules as rewrite rules We start with and replace occurrences of with some right-hand side! *! ( ) *! ( + ) *!...! (int + int) * int #13

#14 Observation All arithmetic expressions can be obtained by a sequence of replacements Any sequence of replacements forms a valid arithmetic expression This means that we cannot obtain ( int ) ) by any sequence of replacements. Why? This notation is a context-free grammar

#15 Context Free Grammars A context-free grammar consists of A set of non-terminals N Written in uppercase in these notes A set of terminals T Lowercase or punctuation in these notes A start symbol S (a non-terminal) A set of productions (rewrite rules) Assuming 2 N!! Y 1 Y 2... Y n, or where Y i µ N [ T

#16 xamples of CFGs Simple arithmetic expressions:! int! +! *! ( ) One non-terminal: Several terminals: int + * ( ) Called terminals because they are never replaced By convention the non-terminal for the first production is the start symbol

#17 The Language of a CFG Read productions as replacement rules: X! Y 1... Y n Means X can be replaced by Y 1... Y n X! Means X can be erased or eaten (replaced with empty string)

#18 Key Idea To construct a valid sequence of terminals: Begin with a string consisting of the start symbol S Replace any non-terminal X in the string by a right-hand side of some production X! Y 1 Y n Continue replacing until there are only terminals in the string

#19 The Language of a CFG:! More formally, write X 1 X i-1 X i X i+1 X n! X 1 X i-1 Y 1 Y m X i+1 X n if there is a production X i! Y 1 Y m

#20 The Language of a CFG:! * Write X 1 X n! * Y 1 Y m if X 1 X n!!! Y 1 Y m in 0 or more steps.

#21 The Language of a CFG Let G be a context-free grammar with start symbol S. Then the language of G is: L(G) = { a 1 a n S! * a 1 a n and every a i is a terminal } L(G) is a set of strings over the alphabet of terminals.

#22 CFG Language xamples S! 0 also written as S! 0 1 S! 1 Generates the language { 0, 1 } What about S! 1 A A! 0 1 What about S! 1 A A! 0 1 A What about S! ( S ) Pencil and paper corner!

#23 Arithmetic xample Simple arithmetic expressions: + () id Some elements of the language: id id + id (id) id id (id) id id (id)

#24 Cool xample A fragment of COOL: XPR if XPR then XPR else XPR fi while XPR loop XPR pool id

#25 Cool xample (Cont.) Some elements of the language id if id then id else id fi while id loop id pool if while id loop id pool then id else id if if id then id else id fi then id else id fi

#26 Notes The idea of a CFG is a big step. But: Membership in a language is yes or no we also need parse tree of the input We must handle errors gracefully Need an implementation of CFGs bison, yacc, ocamlyacc, ply, etc.

#27 More Notes Form of the grammar is important Many grammars generate the same language Give me an example. Automatic tools are sensitive to the grammar Note: Tools for regular languages (e.g., flex) are also sensitive to the form of the regular expression, but this is rarely a problem in practice

#28 Derivations and Parse Trees A derivation is a sequence of productions S!! A derivation can be drawn as a tree Start symbol is the tree s root For a production X! Y 1 Y n add children Y 1,, Y n to node X

#29 Derivation xample Grammar + () id String id id + id We're going to build a derivation!

#30 Derivation xample (Cont.) + + id + id id + id id + id Thus!* id * id + id

#31 Derivation in Detail (1)

#32 Derivation in Detail (2) + +

#33 Derivation in Detail (3) + + + *

#34 Derivation in Detail (4) + + + * id + id

#35 Derivation in Detail (5) + + + id + * id id + id id

#36 Derivation in Detail (6) + + + id + * id id id + id id + id id id

#37 Notes on Derivations A parse tree has Terminals at the leaves Non-terminals at the interior nodes A left-right traversal of the leaves is the original input The parse tree shows the association of operations, the input string does not!

Some CS Research Highlights 80% of CS1 and CS2 courses allow students to collaborate. Is there any evidence that pair programming is helpful? See L. Williams (e.g., Lessons Learned from Seven Years of Pair Programming at North Carolina State University ) About one-third of edits to software are systematic: similar changes to many locations. What can we do to help with such copy-and-paste edits? See M. Kim (e.g., LAS: An xample-based Program Transformation Tool for Locating and Applying Systematic dits ) Searches and clicks are increasingly critical to on-line advertising. How can we make use of such information while preserving privacy? See N. Mishra (e.g., Releasing Search Queries and Clicks Privately ) Crowdsourcing services like Mechanical Turk are increasingly popular and serve as the backbone of many businesses. How can we make the most of such data? See J. Widom (e.g., Query Optimization over Crowdsourced Data ) very day more defects are reported than developers can address. How can we automatically fix bugs? See C. Le Goues (e.g., A systematic study of automated program repair: Fixing 55 out of 105 bugs for $8 each ) #38

Q: Games (548 / 842) In the card game Hearts, how many points would you accumulate if you ended up with all of the Queens and all of the Jacks? (Standard scoring.)

Q: Music (158 / 842) Name the "shocking" 1976 line dance created by Ric Silver and sung and choreographed by Marcia Griffiths.

Q: Games (582 / 842) Minty, Snuzzle, Butterscotch, Bluebelle, Cotton Candy, and Blossom were the first six characters in the 1982 edition of this series of Hasbro-produced brushable hoofed dolls with designs on their hips.

#42 Left-most and Right-most Derivations The previous step-by-step example was a left-most derivation At each step, replace the leftmost non-terminal + +id Right-Most Derivation There is an equivalent notion of a right-most derivation Final result shown here: Step-by-step shown next + id id + id id id + id

#43 Right-most Derivation in Detail (1)

#44 Right-most Derivation in Detail (2) + +

#45 Right-most Derivation in Detail (3) + + + id id

#46 Right-most Derivation in Detail (4) + + +id * id + id

#47 Right-most Derivation in Detail (5) + + +id + id * id id + id id

#48 Right-most Derivation in Detail (6) + + +id + id * id id + id id id + id id id

#49 Derivations and Parse Trees Note that for each parse tree there is a left-most and a right-most derivation The difference is the order in which branches are added We will start with a parsing technique that yields left-most derivations Later we'll move on to right-most derivations

#50 Summary of Derivations We are not just interested in whether s 2 L(G) We also need a parse tree for s A derivation defines a parse tree But one parse tree may have many derivations Left-most and right-most derivations are important in parser implementation

#51 Review A parser consumes a sequence of tokens s and produces a parse tree Issues: How do we recognize that s 2 L(G)? A parse tree of s describes how s L(G) Ambiguity: more than one parse tree (interpretation) for some string s rror: no parse tree for some string s How do we construct the parse tree?

#52 Ambiguity (Ambiguous) Grammar G! + * ( ) int Some strings in L(G) int + int + int int * int + int

#53 Ambiguity. xample The string int + int + int has two parse trees + + int + int in t in t int + int + is left-associative

#54 Ambiguity. xample The string int * int + int has two parse trees + * int * int in t in t int + int * has higher precedence than +

Ambiguity (Cont.) A grammar is ambiguous if it has more than one parse tree for some string quivalently, there is more than one rightmost or left-most derivation for some string Ambiguity is bad Leaves meaning of some programs ill-defined Ambiguity is common in programming languages Arithmetic expressions IF-THN-LS #55

#56 Dealing with Ambiguity There are several ways to handle ambiguity Most direct method is to rewrite the grammar unambiguously! + T T T! T * int int ( ) nforces precedence of * over + nforces left-associativity of + and *

#57 Ambiguity. xample The int * int + int has ony one parse tree now + T * T int int + T * int int int int

#58 Ambiguity: The Dangling lse Consider this new grammar if then if then else OTHR This grammar is also ambiguous

#59 The Dangling lse: xample The string if 1 then if 2 then 3 else 4 has two parse trees if if 1 if 4 1 if 2 3 2 3 4 Typically we want the second form

#60 The Dangling lse: A Fix else matches the closest unmatched then We can describe this in the grammar (distinguish between matched and unmatched then ) MIF /* all then are matched */ UIF /* some then are unmatched */ MIF if then MIF else MIF OTHR UIF if then if then MIF else UIF Describes the same set of strings

#61 The Dangling lse: xample Revisited The exp if 1 then if 2 then 3 else 4 if if 1 if 1 if 4 2 3 4 2 3 A valid parse tree (for a UIF) Not valid because the then expression is not a MIF

Ambiguity No general techniques for handling ambiguity Impossible to convert automatically every ambiguous grammar to an unambiguous one Used with care, ambiguity can simplify the grammar Sometimes allows more natural definitions So we need disambiguation mechanisms As shown next... #62

#63 Precedence and Associativity Declarations Instead of rewriting the grammar Use the more natural (ambiguous) grammar Along with disambiguating declarations Most tools allow precedence and associativity declarations to disambiguate grammars xamples

#64 Associativity Declarations Consider the grammar + int And the string: int + int + int + + + int int + int int int int Left-associativity declaration: %left + %left *

#65 Review We can specify language syntax using CFG A parser will answer whether s L(G) and will build a parse tree and pass on to the rest of the compiler Next: How do we answer s L(G) and build a parse tree?

#66 Homework RS1 Recommended for Today PA2 (Lexer) Due You may work in pairs. CA1 (Dead Code) Due You may work in pairs.