Optimizing Finite Automata

Similar documents
Properties of Regular Expressions and Finite Automata

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 5

Syntax Analysis. Prof. James L. Frankel Harvard University. Version of 6:43 PM 6-Feb-2018 Copyright 2018, 2015 James L. Frankel. All rights reserved.

Part 5 Program Analysis Principles and Techniques

Parsing. source code. while (k<=n) {sum = sum+k; k=k+1;}

Syntax Analysis Check syntax and construct abstract syntax tree

Where We Are. CMSC 330: Organization of Programming Languages. This Lecture. Programming Languages. Motivation for Grammars

CMSC 330: Organization of Programming Languages. Architecture of Compilers, Interpreters

Context-Free Grammars

CMSC 330: Organization of Programming Languages

EECS 6083 Intro to Parsing Context Free Grammars

Architecture of Compilers, Interpreters. CMSC 330: Organization of Programming Languages. Front End Scanner and Parser. Implementing the Front End

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

Context-Free Languages & Grammars (CFLs & CFGs) Reading: Chapter 5

2.2 Syntax Definition

CMSC 330: Organization of Programming Languages. Context Free Grammars

CMSC 330: Organization of Programming Languages

Introduction to Parsing. Lecture 5

Introduction to Parsing. Lecture 5

Lexical Error Recovery

CMSC 330: Organization of Programming Languages

CS 314 Principles of Programming Languages

Compiler Design Concepts. Syntax Analysis

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

CMSC 330: Organization of Programming Languages. Context Free Grammars

Compilers and computer architecture From strings to ASTs (2): context free grammars

CMSC 330: Organization of Programming Languages. Context Free Grammars

Dr. D.M. Akbar Hussain

EDAN65: Compilers, Lecture 06 A LR parsing. Görel Hedin Revised:

Context-Free Languages and Parse Trees

Syntax Analysis Part I

MA513: Formal Languages and Automata Theory Topic: Context-free Grammars (CFG) Lecture Number 18 Date: September 12, 2011

COP 3402 Systems Software Syntax Analysis (Parser)

Defining syntax using CFGs

Action Table for CSX-Lite. LALR Parser Driver. Example of LALR(1) Parsing. GoTo Table for CSX-Lite

Compilation 2012 Context-Free Languages Parsers and Scanners. Jan Midtgaard Michael I. Schwartzbach Aarhus University

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

Ambiguous Grammars and Compactification

Context-Free Grammars

Lecture 4: Syntax Specification

CS415 Compilers. Syntax Analysis. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

Syntax. In Text: Chapter 3

A Simple Syntax-Directed Translator

Parsing. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

CMPS Programming Languages. Dr. Chengwei Lei CEECS California State University, Bakersfield

Parsing. Roadmap. > Context-free grammars > Derivations and precedence > Top-down parsing > Left-recursion > Look-ahead > Table-driven parsing

CS 315 Programming Languages Syntax. Parser. (Alternatively hand-built) (Alternatively hand-built)

Introduction to Parsing

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

Lexical Analysis. COMP 524, Spring 2014 Bryan Ward

Wednesday, September 9, 15. Parsers

Chapter 3. Describing Syntax and Semantics ISBN

Parsers. What is a parser. Languages. Agenda. Terminology. Languages. A parser has two jobs:

Bottom-Up Parsing. Lecture 11-12

MIT Specifying Languages with Regular Expressions and Context-Free Grammars. Martin Rinard Massachusetts Institute of Technology

MIT Specifying Languages with Regular Expressions and Context-Free Grammars

Lexical Scanning COMP360

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

programming languages need to be precise a regular expression is one of the following: tokens are the building blocks of programs

Error Detection in LALR Parsers. LALR is More Powerful. { b + c = a; } Eof. Expr Expr + id Expr id we can first match an id:

THE COMPILATION PROCESS EXAMPLE OF TOKENS AND ATTRIBUTES

Introduction to Parsing. Lecture 8

Parsers. Xiaokang Qiu Purdue University. August 31, 2018 ECE 468

CSCE 314 Programming Languages

CS2210: Compiler Construction Syntax Analysis Syntax Analysis

Syntax Analysis. Martin Sulzmann. Martin Sulzmann Syntax Analysis 1 / 38

Formal Languages and Compilers Lecture V: Parse Trees and Ambiguous Gr

Parsing II Top-down parsing. Comp 412

Finite Automata Theory and Formal Languages TMV027/DIT321 LP4 2018

CS 403: Scanning and Parsing

Wednesday, August 31, Parsers

Configuration Sets for CSX- Lite. Parser Action Table

Describing Syntax and Semantics

Introduction to Syntax Analysis

CSE 401 Midterm Exam Sample Solution 2/11/15

Week 2: Syntax Specification, Grammars

Outline. Limitations of regular languages. Introduction to Parsing. Parser overview. Context-free grammars (CFG s)

Intro To Parsing. Step By Step

CS 314 Principles of Programming Languages

Monday, September 13, Parsers

Lecture 8: Context Free Grammars

CSE P 501 Compilers. Parsing & Context-Free Grammars Hal Perkins Winter /15/ Hal Perkins & UW CSE C-1

Chapter 3: CONTEXT-FREE GRAMMARS AND PARSING Part2 3.3 Parse Trees and Abstract Syntax Trees

Introduction to Syntax Analysis. The Second Phase of Front-End

Downloaded from Page 1. LR Parsing

R10 SET a) Construct a DFA that accepts an identifier of a C programming language. b) Differentiate between NFA and DFA?

Front End. Hwansoo Han

Last lecture CMSC330. This lecture. Finite Automata: States. Finite Automata. Implementing Regular Expressions. Languages. Regular expressions

EDA180: Compiler Construc6on Context- free grammars. Görel Hedin Revised:

Syntax/semantics. Program <> program execution Compiler/interpreter Syntax Grammars Syntax diagrams Automata/State Machines Scanning/Parsing

Principles of Programming Languages COMP251: Syntax and Grammars

Introduction to Parsing Ambiguity and Syntax Errors

Part III : Parsing. From Regular to Context-Free Grammars. Deriving a Parser from a Context-Free Grammar. Scanners and Parsers.

Introduction to Lexing and Parsing

LL(1) predictive parsing

How do LL(1) Parsers Build Syntax Trees?

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

Introduction to Parsing Ambiguity and Syntax Errors

Let s look at the CUP specification for CSX-lite. Recall its CFG is

CSE 401 Midterm Exam 11/5/10

Transcription:

Optimizing Finite Automata We can improve the DFA created by MakeDeterministic. Sometimes a DFA will have more states than necessary. For every DFA there is a unique smallest equivalent DFA (fewest states possible). Some DFA s contain unreachable states that cannot be reached from the start state. Other DFA s may contain dead states that cannot reach any accepting state. It is clear that neither unreachable states nor dead states can participate in scanning any valid token. We therefore eliminate all such states as part of our optimization process. We optimize a DFA by merging together states we know to be equivalent. For example, two accepting states that have no transitions at all out of them are equivalent. Why? Because they behave exactly the same way they accept the string read so far, but will accept no additional characters. If two states, s 1 and s 2, are equivalent, then all transitions to s 2 can be replaced with transitions to s 1. In effect, the two states are merged together into one common state. How do we decide what states to merge together? 157 158 We take a greedy approach and try the most optimistic merger of states. By definition, accepting and non- accepting states are distinct, so we initially try to create only two states: one representing the merger of all accepting states and the other representing the merger of all non- accepting states. This merger into only two states is almost certainly too optimistic. In particular, all the constituents of a merged state must agree on the same transition for each possible character. That is, for character c all the merged states must have no successor under c or they must all go to a single (possibly merged) state. If all constituents of a merged state do not agree on the transition to follow for some character, the merged state is split into two or more smaller states that do agree. As an example, assume we start with the following automaton: a b c 1 2 3 4 d b c 5 6 7 Initially we have a merged nonaccepting state {1,2,3,5,6} and a merged accepting state {4,7}. A merger is legal if and only if all constituent states agree on the same successor state for all characters. For example, states 3 and 6 would go to an accepting state given character c; states 1, 2, 5 would not, so a split must occur. 159 160

We will add an error state s E to the original DFA that is the successor state under any illegal character. (Thus reaching s E becomes equivalent to detecting an illegal token.) s E is not a real state; rather it allows us to assume every state has a successor under every character. s E is never merged with any real state. Algorithm Split, shown below, splits merged states whose constituents do not agree on a common successor state for all characters. When Split terminates, we know that the states that remain merged are equivalent in that they always agree on common successors. Split(FASet StateSet) { repeat for(each merged state S in StateSet) { Let S correspond to {s 1,...,s n } for(each char c in Alphabet){ Let t 1,...,t n be the successor states to s 1,...,s n under c if(t 1,...,t n do not all belong to the same merged state){ Split S into two or more new states such that s i and s j remain in the same merged state if and only if t i and t j are in the same merged state} } until no more splits are possible } 161 162 Returning to our example, we initially have states {1,2,3,5,6} and {4,7}. Invoking Split, we first observe that states 3 and 6 have a common successor under c, and states 1, 2, and 5 have no successor under c (equivalently, have the error state s E as a successor). This forces a split, yielding {1,2,5}, {3,6} and {4,7}. Now, for character b, states 2 and 5 would go to the merged state {3,6}, but state 1 would not, so another split occurs. We now have: {1}, {2,5}, {3,6} and {4,7}. At this point we are done, as all constituents of merged states agree on the same successor for each input symbol. Once Split is executed, we are essentially done. Transitions between merged states are the same as the transitions between states in the original DFA. Thus, if there was a transition between state s i and s j under character c, there is now a transition under c from the merged state containing s i to the merged state containing s j. The start state is that merged state containing the original start state. Accepting states are those merged states containing accepting states (recall that accepting and non- accepting states are never merged). 163 164

Returning to our example, the minimum state automaton we obtain is Properties of Regular Expressions and Finite Automata a d b c 1 2,5 3,6 4,7 Some token patterns can t be defined as regular expressions or finite automata. Consider the set of balanced brackets of the form [ [ [ ] ] ]. This set is defined formally as { [ m ] m m 1 }. This set is not regular. No finite automaton that recognizes exactly this set can exist. Why? Consider the inputs [, [[, [[[,... For two different counts (call them i and j) [ i and [ j must reach the same state of a given FA! (Why?) Once that happens, we know that if [ i ] i is accepted (as it should be), the [ j ] i will also be accepted (and that should not happen). 165 166 R = V* - R is regular if R is. Why? Build a finite automaton for R. Be careful to include transitions to an error state s E for illegal characters. Now invert final and non- final states. What was previously accepted is now rejected, and what was rejected is now accepted. That is, R is accepted by the modified automaton. Not all subsets of a regular set are themselves regular. The regular expression [ + ] + has a subset that isn t regular. (What is that subset?) Let R be a set of strings. Define R rev as all strings in R, in reversed (backward) character order. Thus if R = {abc, def} then R rev = {cba, fed}. If R is regular, then R rev is too. Why? Build a finite automaton for R. Make sure the automaton has only one final state. Now reverse the direction of all transitions, and interchange the start and final states. What does the modified automation accept? 167 168

If R 1 and R 2 are both regular, then R 1 R 2 is also regular. We can show this two different ways: 1. Build two finite automata, one for R1 and one for R2. Pair together states of the two automata to match R1 and R2 simultaneously. The pairedstate automaton accepts only if both R1 and R2 would, so R1 R2 is matched. 2. We can use the fact that R1 R2 Reading Assignment Read Chapter 4 of Crafting a Compiler is = R 1 R 2 We already know union and complementation are regular. 169 170 Context Free Grammars Example A context- free grammar (CFG) is defined as: A finite terminal set V t ; these are the tokens produced by the scanner. A set of intermediate symbols, called non- terminals, V n. Prog { Stmts } Stmts Stmts ; Stmt Stmts Stmt Stmt id = Expr Expr id Expr Expr + id A start symbol, a designated nonterminal, that starts all derivations. A set of productions (sometimes called rewriting rules) of the form A X 1... X m X 1 to X m may be any combination of terminals and non- terminals. If m = 0 we have A λ which is a valid production. 171 172

Often more than one production shares the same left- hand side. Rather than repeat the left hand side, an or notation is used: Prog { Stmts } Stmts Stmts ; Stmt Stmt Stmt id = Expr Expr id Expr + id Derivations Starting with the start symbol, non- terminals are rewritten using productions until only terminals remain. Any terminal sequence that can be generated in this manner is syntactically valid. If a terminal sequence can t be generated using the productions of the grammar it is invalid (has syntax errors). The set of strings derivable from the start symbol is the language of the grammar (sometimes denoted L(G)). 173 174 For example, starting at Prog we generate a terminal sequence, by repeatedly applying productions: Prog { Stmts } { Stmts ; Stmt } { Stmt ; Stmt } { id = Expr ; Stmt } { id = id ; Stmt } { id = id ; id = Expr } { id = id ; id = Expr + id} Parse Trees To illustrate a derivation, we can draw a derivation tree (also called a parse tree): Prog { Stmts } Stmts ; Stmt Stmt id = Expr id = Expr Expr + id id id 175 176

An abstract syntax tree (AST) shows essential structure but eliminates unnecessary delimiters and intermediate symbols: Prog Stmts Stmts = id = id id + id id If A γ is a production then αaβ αγβ where denotes a one step derivation (using production A γ). We extend to + (derives in one or more steps), and * (derives in zero or more steps). We can show our earlier derivation as Prog { Stmts } { Stmts ; Stmt } { Stmt ; Stmt } { id = Expr ; Stmt } { id = id ; Stmt } { id = id ; id = Expr } { id = id ; id = Expr + id} Prog + 177 178 When deriving a token sequence, if more than one non- terminal is present, we have a choice of which to expand next. We must specify, at each step, which non- terminal is expanded, and what production is applied. For simplicity we adopt a convention on what non- terminal is expanded at each step. We can choose the leftmost possible non- terminal at each step. A derivation that follows this rule is a leftmost derivation. If we know a derivation is leftmost, we need only specify what productions are used; the choice of non- terminal is always fixed. To denote derivations that are leftmost, we use L, + L, and * L The production sequence discovered by a large class of parsers (the top- down parsers) is a leftmost derivation, hence these parsers produce a leftmost parse. Prog L { Stmts } L { Stmts ; Stmt } L { Stmt ; Stmt } L { id = Expr ; Stmt } L { id = id ; Stmt } L { id = id ; id = Expr } L { id = id ; id = Expr + id} L Prog L + 179 180

Rightmost Derivations A rightmost derivation is an alternative to a leftmost derivation. Now the rightmost non- terminal is always expanded. This derivation sequence may seem less intuitive given our normal left- to- right bias, but it corresponds to an important class of parsers (the bottom- up parsers, including CUP). As a bottom- up parser discovers the productions used to derive a token sequence, it discovers a rightmost derivation, but in reverse order. The last production applied in a rightmost derivation is the first that is discovered. The first production used, involving the start symbol, is discovered last. The sequence of productions recognized by a bottom- up parser is a rightmost parse. It is the exact reverse of the production sequence that represents a rightmost derivation. For rightmost derivations, we use the notation R, + R, and * R Prog R { Stmts } R { Stmts ; Stmt } R { Stmts ; id = Expr } R { Stmts ; id = Expr + id } R { Stmts ; id = id + id } R { Stmt ; id = id + id } R { id = Expr ; id = id + id } R Prog + 181 182 You can derive the same set of tokens using leftmost and rightmost derivations; the only difference is the order in which productions are used. Ambiguous Grammars Some grammars allow more than one parse tree for the same token sequence. Such grammars are ambiguous. Because compilers use syntactic structure to drive translation, ambiguity is undesirable it may lead to an unexpected translation. Consider E E - E id When parsing the input a- b- c (where a, b and c are scanned as identifiers) we can build the following two parse trees: 183 184

Now a- b- c can only be parsed as: E E - E E - E id id id The effect is to parse a- b- c as either (a- b)- c or a- (b- c). These two groupings are certainly not equivalent. Ambiguous grammars are usually voided in building compilers; the tools we use, like Yacc and CUP, strongly prefer unambiguous grammars. To correct this ambiguity, we use E E - id id E E - E E - E id id id E E - E - id id id 185 186 Operator Precedence Most programming languages have operator precedence rules that state the order in which operators are applied (in the absence of explicit parentheses). Thus in C and Java and CSX, a+b*c means compute b*c, then add in a. These operators precedence rules can be incorporated directly into a CFG. Consider E E + T T T T * P P P id ( E ) Does a+b*c mean (a+b)*c or a+(b*c)? The grammar tells us! Look at the derivation tree: E E + T T T * P P P id id id The other grouping can t be obtained unless explicit parentheses are used. (Why?) 187 188