MIT Specifying Languages with Regular Expressions and Context-Free Grammars. Martin Rinard Massachusetts Institute of Technology

Similar documents
MIT Specifying Languages with Regular Expressions and Context-Free Grammars

Languages and Compilers

R10 SET a) Construct a DFA that accepts an identifier of a C programming language. b) Differentiate between NFA and DFA?

COP 3402 Systems Software Syntax Analysis (Parser)

Part 5 Program Analysis Principles and Techniques

CS 315 Programming Languages Syntax. Parser. (Alternatively hand-built) (Alternatively hand-built)

CS 314 Principles of Programming Languages

Chapter 3: CONTEXT-FREE GRAMMARS AND PARSING Part2 3.3 Parse Trees and Abstract Syntax Trees

Where We Are. CMSC 330: Organization of Programming Languages. This Lecture. Programming Languages. Motivation for Grammars

CMSC 330: Organization of Programming Languages

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

CMSC 330: Organization of Programming Languages. Architecture of Compilers, Interpreters

CT32 COMPUTER NETWORKS DEC 2015

MIT Top-Down Parsing. Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology

Derivations of a CFG. MACM 300 Formal Languages and Automata. Context-free Grammars. Derivations and parse trees

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

Formal Languages and Compilers Lecture IV: Regular Languages and Finite. Finite Automata

Architecture of Compilers, Interpreters. CMSC 330: Organization of Programming Languages. Front End Scanner and Parser. Implementing the Front End

CSCE 314 Programming Languages

Syntax. In Text: Chapter 3

CS 403: Scanning and Parsing

Syntax Analysis Check syntax and construct abstract syntax tree

THE COMPILATION PROCESS EXAMPLE OF TOKENS AND ATTRIBUTES

Finite Automata Theory and Formal Languages TMV027/DIT321 LP4 2016

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

Syntax. Syntax. We will study three levels of syntax Lexical Defines the rules for tokens: literals, identifiers, etc.

Implementation of Lexical Analysis

Parsing. source code. while (k<=n) {sum = sum+k; k=k+1;}

Parsing II Top-down parsing. Comp 412

Implementation of Lexical Analysis

Chapter 3. Describing Syntax and Semantics ISBN

CMPS Programming Languages. Dr. Chengwei Lei CEECS California State University, Bakersfield

CPS 506 Comparative Programming Languages. Syntax Specification

CMSC 330: Organization of Programming Languages

Habanero Extreme Scale Software Research Project

programming languages need to be precise a regular expression is one of the following: tokens are the building blocks of programs

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

CMSC 330: Organization of Programming Languages. Context Free Grammars

Formal Languages and Grammars. Chapter 2: Sections 2.1 and 2.2

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

ECS 120 Lesson 7 Regular Expressions, Pt. 1

CMSC 330: Organization of Programming Languages. Context Free Grammars

Lexical Analysis. COMP 524, Spring 2014 Bryan Ward

Lexical Analysis. Chapter 2

Introduction to Lexical Analysis

CMSC 330: Organization of Programming Languages

Compiler Construction

Introduction to Syntax Analysis. The Second Phase of Front-End

Introduction to Parsing

Compilers and computer architecture From strings to ASTs (2): context free grammars

Context-Free Grammars

Optimizing Finite Automata

Introduction to Parsing. Lecture 8

Parsing. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

Syntax Analysis. Martin Sulzmann. Martin Sulzmann Syntax Analysis 1 / 38

CMSC 330: Organization of Programming Languages. Context Free Grammars

Compiler Construction

Ambiguous Grammars and Compactification

Implementation of Lexical Analysis

Defining Languages GMU

Regular Languages. MACM 300 Formal Languages and Automata. Formal Languages: Recap. Regular Languages

Introduction to Lexing and Parsing

PSD3A Principles of Compiler Design Unit : I-V. PSD3A- Principles of Compiler Design

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Introduction to Syntax Analysis

Bottom-Up Parsing. Lecture 11-12

Final Course Review. Reading: Chapters 1-9

Compilers. Yannis Smaragdakis, U. Athens (original slides by Sam

CSCI312 Principles of Programming Languages!

CIT Lecture 5 Context-Free Grammars and Parsing 4/2/2003 1

Section A. A grammar that produces more than one parse tree for some sentences is said to be ambiguous.

CS143 Midterm Fall 2008

Chapter Seven: Regular Expressions

Lexical Analysis. Lecture 2-4

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

CS415 Compilers. Syntax Analysis. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

Specifying Syntax. An English Grammar. Components of a Grammar. Language Specification. Types of Grammars. 1. Terminal symbols or terminals, Σ

Compiler Design Concepts. Syntax Analysis

3. Context-free grammars & parsing

Compiler Construction

Part 3. Syntax analysis. Syntax analysis 96

Chapter 4. Lexical and Syntax Analysis

2.2 Syntax Definition

Lexical Analysis. Introduction

Week 2: Syntax Specification, Grammars

Models of Computation II: Grammars and Pushdown Automata

A Simple Syntax-Directed Translator

CSE450. Translation of Programming Languages. Lecture 20: Automata and Regular Expressions

MA513: Formal Languages and Automata Theory Topic: Context-free Grammars (CFG) Lecture Number 18 Date: September 12, 2011

Introduction to Parsing. Lecture 5

Lecture 4: Syntax Specification

Formal Languages and Compilers Lecture VI: Lexical Analysis

Syntax Analysis. The Big Picture. The Big Picture. COMP 524: Programming Languages Srinivas Krishnan January 25, 2011

Outline. Limitations of regular languages. Introduction to Parsing. Parser overview. Context-free grammars (CFG s)

Chapter 4. Syntax - the form or structure of the expressions, statements, and program units

Lexical Analysis. Implementation: Finite Automata

Question Bank. 10CS63:Compiler Design

More on Syntax. Agenda for the Day. Administrative Stuff. More on Syntax In-Class Exercise Using parse trees

Formal Languages. Formal Languages

Transcription:

MIT 6.035 Specifying Languages with Regular essions and Context-Free Grammars Martin Rinard Massachusetts Institute of Technology

Language Definition Problem How to precisely define language Layered structure of language definition Start with a set of letters in language Lexical structure - identifies words in language (each word is a sequence of letters) Syntactic structure - identifies sentences in language (each sentence is a sequence of words) Semantics - meaning of program (specifies what result should be for each input) Today s topic: lexical and syntactic structures

Specifying Formal Languages Huge Triumph of Computer Science Beautiful Theoretical Results Practical Techniques and Applications Two Dual Notions Generative approach (grammar or regular expression) Recognition approach (automaton) Lots of theorems about converting one approach automatically to another

Specifying Lexical Structure Using Regular essions Have some alphabet = set of letters Regular expressions are built from: - empty string Any letter from alphabet r 1 r 2 regular expression r 1 followed by r 2 (sequence) r 1 r 2 either regular expression r 1 or r 2 (choice) r* - iterated sequence and choice r rr Parentheses to indicate grouping/precedence

Concept of Regular ession Generating a String Rewrite regular expression until have only a sequence of letters (string) left General Rules 1) r 1 r 2 r 1 2) r 1 r 2 r 2 3) r* rr* 4) r* Example (0 1)*.(0 1)* (0 1)(0 1)*.(0 1)* 1(0 1)*.(0 1)* 1.(0 1)* 1.(0 1)(0 1)* 1.(0 1) 1.0

Nondeterminism in Generation Rewriting is similar to equational reasoning But different rule applications may yield different final results Example 1 (0 1)*.(0 1)* (0 1)(0 1)*.(0 1)* 1(0 1)*.(0 1)* 1.(0 1)* 1.(0 1)(0 1)* 1.(0 1) 1.0 Example 2 (0 1)*.(0 1)* (0 1)(0 1)*.(0 1)* 0(0 1)*.(0 1)* 0.(0 1)* 0.(0 1)(0 1)* 0.(0 1) 0.1

Concept of Language Generated by Regular essions Set of all strings generated by a regular expression is language of regular expression In general, language may be (countably) infinite String in language is often called a token

Examples of Languages and Regular = { 0, 1,. } essions (0 1)*.(0 1)* - Binary floating point numbers (00)* - even-length all-zero strings 1*(01*01*)* - strings with even number of zeros = { a,b,c, 0, 1, 2 } (a b c)(a b c 0 1 2)* - alphanumeric identifiers (0 1 2)* - trinary numbers

Alternate Abstraction Finite-State Automata Alphabet Set of states with initial and accept states Transitions between states, labeled with letters (0 1)*.(0 1)*. 1 1 Start state 0 0 Accept state

Automaton Accepting String Conceptually, run string through automaton Have current state and current letter in string Start with start state and first letter in string At each step, match current letter against a transition whose label is same as letter Continue until reach end of string or match fails If end in accept state, automaton accepts string Language of automaton is set of strings it accepts

Example Current state 1. 1 Start state 0 0 Accept state 11.0 Current letter

Example Current state 1. 1 Start state 0 0 Accept state 11.0 Current letter

Example Current state 1. 1 Start state 0 0 Accept state 11.0 Current letter

Example Current state 1. 1 Start state 0 0 Accept state 11.0 Current letter

Example Current state 1. 1 Start state 0 0 Accept state 11.0 Current letter

Example Current state 1. 1 Start state 0 0 Accept state 11.0 String is accepted! Current letter

DFA vs. NFA a b a b a c ab or ac c a b a c

DFA vs. NFA DFA only one possible transition at each state NFA may have multiple possible transitions 2 or more transitions with same label Transitions labeled with empty string Rule string accepted if any execution accepts Angelic vs. Demonic nondeterminism Angelic all decisions made to accept Demonic all decisions made to not accept NFA uses Angelic nondeterminism

Generative Versus Recognition Regular expressions give you a way to generate all strings in language Automata give you a way to recognize if a specific string is in language Philosophically very different Theoretically equivalent (for regular expressions and automata) Standard approach Use regular expressions when define language Translated automatically into automata for implementation

From Regular essions to Automata Construction by structural induction Given an arbitrary regular expression r Assume we can convert r to an automaton with One start state One accept state Show how to convert all constructors to deliver an automaton with One start state One accept state

a Σ Basic Constructs a Start state Accept state

Sequence Start state Accept state r 1 r 2 r 1 r 2

Sequence Old start state Old accept state Start state Accept state r 1 r 2 r 1 r 2

Sequence Old start state Old accept state Start state Accept state r 1 r 2 r 1 r 2

Sequence Old start state Old accept state Start state Accept state r 1 r 2 r 1 r 2

Sequence Old start state Old accept state Start state Accept state r 1 r 2 r 1 r 2

Choice Start state Accept state r 1 r 2 r 1 r 2

Choice Old start state Old accept state Start state Accept state r 1 r 2 r 1 r 2

Choice Old start state Old accept state Start state Accept state r 1 r 1 r 2 r 2

Choice Old start state Old accept state Start state Accept state r 1 r 2 r 1 r 2

Kleene Star Old start state Old accept state Start state Accept state r* r

Kleene Star Old start state Old accept state Start state Accept state r* r

Kleene Star Old start state Old accept state Start state Accept state r* r

Kleene Star Old start state Old accept state Start state Accept state r* r

Kleene Star Old start state Old accept state Start state Accept state r* r

DFA No transitions NFA vs. DFA At most one transition from each state for each letter OK b NFA neither restriction a NOT OK a a

Conversions Our regular expression to automata conversion produces an NFA Would like to have a DFA to make recognition algorithm simpler Can convert from NFA to DFA (but DFA may be exponentially larger than NFA)

NFA to DFA Construction DFA has a state for each subset of states in NFA DFA start state corresponds to set of states reachable by following transitions from NFA start state DFA state is an accept state if an NFA accept state is in its set of NFA states To compute the transition for a given DFA state D and letter a Set S to empty set Find the set N of D s NFA states For all NFA states n in N Compute set of states N that the NFA may be in after matching a Set S to S union N If S is nonempty, there is a transition for a from D to the DFA state that has the set S of NFA states Otherwise, there is no transition for a from D

NFA to DFA Example for (a b)*.(a b)* 1 2 3 a 5 7 8. 9 10 11 a 13 15 16 4 b 6 12 b 14. a a a 1,2,3,4,8 b 5,7,2,3,4,8 a b 6,7,2,3,4,8. 9,10,11,12,16. a b 13,15,10,11,12,16 a b 14,15,10,11,12,16 b b

Lexical Structure in Languages Each language typically has several categories of words. In a typical programming language: Keywords (if, while) Arithmetic Operations (+, -, *, /) Integer numbers (1, 2, 45, 67) Floating point numbers (1.0,.2, 3.337) Identifiers (abc, i, j, ab345) Typically have a lexical category for each keyword and/or each category Each lexical category defined by regexp

Lexical Categories Example IfKeyword = if WhileKeyword = while Operator = + - * / Integer = [0-9] [0-9]* Float = [0-9]*. [0-9]* Identifier = [a-z]([a-z] [0-9])* Note that [0-9] = (0 1 2 3 4 5 6 7 8 9) [a-z] = (a b c y z) Will use lexical categories in next level

Programming Language Syntax Regular languages suboptimal for specifying programming language syntax Why? Constructs with nested syntax (a+(b-c))*(d-(x-(y-z))) if (x < y) if (y < z) a = 5 else a = 6 else a = 7 Regular languages lack state required to model nesting Canonical example: nested expressions No regular expression for language of parenthesized expressions

Solution Context-Free Grammar Set of terminals { Op, Int, Open, Close } Each terminal defined by regular expression Set of nonterminals { Start, } Set of productions Single nonterminal on LHS Sequence of terminals and nonterminals on RHS Op = + - * / Int = [0-9] [0-9]* Open = < Close = > Start Op Int Open Close

Production Game have a current string start with Start nonterminal loop until no more nonterminals choose a nonterminal in current string choose a production with nonterminal in LHS replace nonterminal with RHS of production substitute regular expressions with corresponding strings generated string is in language Note: different choices produce different strings

Sample Derivation Op = + - * / Int = [0-9] [0-9]* Open = < Close = > 1) Start 2) Op 3) Int 4) Open Close Start Op Open Close Op Open Op Close Op Open Int Op Close Op Open Int Op Close Op Int Open Int Op Int Close Op Int < 2-1 > + 1

Parse Tree Internal Nodes: Nonterminals Leaves: Terminals Edges: From Nonterminal of LHS of production To Nodes from RHS of production Captures derivation of string

Parse Tree for <2-1>+1 Start Open < Int 2 Op - Close > Int 1 Op + Int 1

Ambiguity in Grammar Grammar is ambiguous if there are multiple derivations (therefore multiple parse trees) for a single string Derivation and parse tree usually reflect semantics of the program Ambiguity in grammar often reflects ambiguity in semantics of language (which is considered undesirable)

Start Ambiguity Example Two parse trees for 2-1+1 Tree corresponding Tree corresponding to <2-1>+1 to 2-<1+1> Start Int 2 Op - Op + Int 1 Int 1 Int 2 Op - Int 1 Op + Int 1

Eliminating Ambiguity Solution: hack the grammar Original Grammar Start Op Int Open Close Hacked Grammar Start Op Int Int Open Close Conceptually, makes all operators associate to left

Parse Trees for Hacked Grammar Only one parse tree for 2-1+1! Valid parse tree Start No longer valid parse tree Start Op + Int 1 Op - Op - Int 2 Int 1 Int 2 Int 1 Op + Int 1

Precedence Violations All operators associate to left Violates precedence of * over + 2-3*4 associates like <2-3>*4 Parse tree for 2-3*4 Start Op - Int 2 Op * Int 3 Int 4

Hacking Around Precedence Original Grammar Op = + - * / Int = [0-9] [0-9]* Open = < Close = > Start Op Int Int Open Close Hacked Grammar AddOp = + - MulOp = * / Int = [0-9] [0-9]* Open = < Close = > Start AddOp Term Term Term Term MulOp Num Term Num Num Int Num Open Close

Parse Tree Changes Op - Int 2 Old parse tree for 2-3*4 Start Op * Int 3 Int 4 Term Num Int 2 New parse tree for 2-3*4 Start AddOp - Term Num Int 3 Term MulOp * Num Int 4

General Idea Group Operators into Precedence Levels * and / are at top level, bind strongest + and - are at next level, bind next strongest Nonterminal for each Precedence Level Term is nonterminal for * and / is nonterminal for + and - Can make operators left or right associative within each level Generalizes for arbitrary levels of precedence

Parser Converts program into a parse tree Can be written by hand Or produced automatically by parser generator Accepts a grammar as input Produces a parser as output Practical problem Parse tree for hacked grammar is complicated Would like to start with more intuitive parse tree

Solution Abstract versus Concrete Syntax Abstract syntax corresponds to intuitive way of thinking of structure of program Omits details like superfluous keywords that are there to make the language unambiguous Abstract syntax may be ambiguous Concrete Syntax corresponds to full grammar used to parse the language Parsers are often written to produce abstract syntax trees.

Abstract Syntax Trees Start with intuitive but ambiguous grammar Hack grammar to make it unambiguous Concrete parse trees Less intuitive Convert concrete parse trees to abstract syntax trees Correspond to intuitive grammar for language Simpler for program to manipulate

Hacked Unambiguous Grammar AddOp = + - MulOp = * / Int = [0-9] [0-9]* Open = < Close = > Start AddOp Term Term Term Term MulOp Num Term Num Num Int Num Open Close Example Intuitive but Ambiguous Grammar Op = * / + - Int = [0-9] [0-9]* Start Op Int

Concrete parse tree for <2-3>*4 Abstract syntax tree for <2-3>*4 Start Start Op * Term Num Int 2 AddOp - Term Num Int 3 Term MulOp * Num Int 4 Op - Int 2 Int 3 Uses intuitive grammar Int 4 Eliminates superfluous terminals Open Close

Abstract parse tree for <2-3>*4 Start Further simplified abstract syntax tree for <2-3>*4 Start Op - Int 2 Op * Int 3 Int 4 Int 2 Op - Int 3 Op * Int 4

Summary Lexical and Syntactic Levels of Structure Lexical regular expressions and automata Syntactic grammars Grammar ambiguities Hacked grammars Abstract syntax trees Generation versus Recognition Approaches Generation more convenient for specification Recognition required in implementation

Handling If Then Else Start Stat Stat if then Stat else Stat Stat if then Stat Stat...

Parse Trees Consider Statement if e 1 then if e 2 then s 1 else s 2

Stat Two Parse Trees if Stat e1 if then Stat else Stat Stat e2 s1 s2 if then Stat else e1 Stat s2 Which is correct? if then s1 e2

Alternative Readings Parse Tree Number 1 if e 1 if e 2 s 1 else s 2 Parse Tree Number 2 Grammar is ambiguous if e 1 if e 2 s 1 else s 2

Goal Stat Stat WithElse Hacked Grammar Stat LastElse WithElse if then WithElse else WithElse WithElse <statements without if then or if then else> LastElse if then Stat LastElse if then WithElse else LastElse

Hacked Grammar Basic Idea: control carefully where an if without an else can occur Either at top level of statement Or as very last in a sequence of if then else if then... statements

Grammar Vocabulary Leftmost derivation Always expands leftmost remaining nonterminal Similarly for rightmost derivation Sentential form Partially or fully derived string from a step in valid derivation 0 + Op 0 + - 2

Grammar Defining a Language Generative approach All strings that grammar generates (How many are there for grammar in previous example?) Automaton Recognition approach All strings that automaton accepts Different flavors of grammars and automata In general, grammars and automata correspond

Regular Languages Automaton Characterization (S,A,F,s 0,s F ) Finite set of states S Finite Alphabet A Transition function F : S A S Start state s 0 Final states s F Lanuage is set of strings accepted by Automaton

Regular Languages Regular Grammar Characterization (T,NT,S,P) Finite set of Terminals T Finite set of Nonterminals NT Start Nonterminal S (goal symbol, start symbol) Finite set of Productions P: NT T U NT U T NT Language is set of strings generated by grammar

Grammar and Automata Correspondence Grammar Regular Grammar Context-Free Grammar Context-Sensitive Grammar Automaton Finite-State Automaton Push-Down Automaton Turing Machine

Context-Free Grammars Grammar Characterization (T,NT,S,P) Finite set of Terminals T Finite set of Nonterminals NT Start Nonterminal S (goal symbol, start symbol) Finite set of Productions P: NT (T NT)* RHS of production can have any sequence of terminals or nonterminals

DFA Plus a Stack Push-Down Automata (S,A,V, F,s 0,s F ) Finite set of states S Finite Input Alphabet A, Stack Alphabet V Transition relation F : S (A U{}) V S V* Start state s 0 Final states s F Each configuration consists of a state, a stack, and remaining input string

CFG Versus PDA CFGs and PDAs are of equivalent power Grammar Implementation Mechanism: Translate CFG to PDA, then use PDA to parse input string Foundation for bottom-up parser generators

Context-Sensitive Grammars and Turing Machines Context-Sensitive Grammars Allow Productions to Use Context P: (T.NT)+ (T.NT)* Turing Machines Have Finite State Control Two-Way Tape Instead of A Stack