CS308 Compiler Principles Lexical Analyzer Li Jiang

Similar documents
Module 6 Lexical Phase - RE to DFA

[Lexical Analysis] Bikash Balami

Zhizheng Zhang. Southeast University

Principles of Compiler Design Presented by, R.Venkadeshan,M.Tech-IT, Lecturer /CSE Dept, Chettinad College of Engineering &Technology

Dixita Kagathara Page 1

UNIT II LEXICAL ANALYSIS

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

Lexical Analysis/Scanning

Finite automata. We have looked at using Lex to build a scanner on the basis of regular expressions.

Lexical Analysis. Implementation: Finite Automata

PRINCIPLES OF COMPILER DESIGN UNIT II LEXICAL ANALYSIS 2.1 Lexical Analysis - The Role of the Lexical Analyzer

Formal Languages and Compilers Lecture VI: Lexical Analysis

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

Lexical Analysis. Prof. James L. Frankel Harvard University

Front End: Lexical Analysis. The Structure of a Compiler

Lexical Analysis. Chapter 2

Implementation of Lexical Analysis

Lexical Analysis. Lecture 2-4

Implementation of Lexical Analysis

CSc 453 Lexical Analysis (Scanning)

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

Implementation of Lexical Analysis. Lecture 4

Implementation of Lexical Analysis

Lexical Analysis. Sukree Sinthupinyo July Chulalongkorn University

CS6660 COMPILER DESIGN L T P C

Lexical Analyzer Scanner

Formal Languages and Compilers Lecture IV: Regular Languages and Finite. Finite Automata

CS415 Compilers. Lexical Analysis

Introduction to Lexical Analysis

Figure 2.1: Role of Lexical Analyzer

Lexical Analysis. Lecture 3-4

2010: Compilers REVIEW: REGULAR EXPRESSIONS HOW TO USE REGULAR EXPRESSIONS

Lexical Analyzer Scanner

CSE302: Compiler Design

ECS 120 Lesson 7 Regular Expressions, Pt. 1

Compiler course. Chapter 3 Lexical Analysis

Chapter 3 Lexical Analysis

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

COMPILER DESIGN UNIT I LEXICAL ANALYSIS. Translator: It is a program that translates one language to another Language.

UNIT -2 LEXICAL ANALYSIS

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

David Griol Barres Computer Science Department Carlos III University of Madrid Leganés (Spain)

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Chapter Seven: Regular Expressions

Last lecture CMSC330. This lecture. Finite Automata: States. Finite Automata. Implementing Regular Expressions. Languages. Regular expressions

Computer Science Department Carlos III University of Madrid Leganés (Spain) David Griol Barres

Lexical Analysis. Introduction

Lexical Analysis - 1. A. Overview A.a) Role of Lexical Analyzer

Formal Languages and Grammars. Chapter 2: Sections 2.1 and 2.2

Administrivia. Lexical Analysis. Lecture 2-4. Outline. The Structure of a Compiler. Informal sketch of lexical analysis. Issues in lexical analysis

Implementation of Lexical Analysis

CS 403 Compiler Construction Lecture 3 Lexical Analysis [Based on Chapter 1, 2, 3 of Aho2]

CS 314 Principles of Programming Languages

Compiler Construction

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 2: Lexical Analysis 23 Jan 08

CMSC 330: Organization of Programming Languages

Week 2: Syntax Specification, Grammars

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

CS 314 Principles of Programming Languages. Lecture 3

Lexical Analysis - 2

Dr. D.M. Akbar Hussain

Outline. 1 Scanning Tokens. 2 Regular Expresssions. 3 Finite State Automata

Structure of Programming Languages Lecture 3

Lexical Analysis. Lecture 3. January 10, 2018

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

Lexical Analysis 1 / 52

Implementation of Lexical Analysis

CS143 Handout 20 Summer 2011 July 15 th, 2011 CS143 Practice Midterm and Solution

1. INTRODUCTION TO LANGUAGE PROCESSING The Language Processing System can be represented as shown figure below.

Announcements! P1 part 1 due next Tuesday P1 part 2 due next Friday

THE COMPILATION PROCESS EXAMPLE OF TOKENS AND ATTRIBUTES

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

CS 403: Scanning and Parsing

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

CS321 Languages and Compiler Design I. Winter 2012 Lecture 4

Implementation of Lexical Analysis

CSE 105 THEORY OF COMPUTATION

Regular Languages and Regular Expressions

Architecture of Compilers, Interpreters. CMSC 330: Organization of Programming Languages. Front End Scanner and Parser. Implementing the Front End

Where We Are. CMSC 330: Organization of Programming Languages. This Lecture. Programming Languages. Motivation for Grammars

G Compiler Construction Lecture 4: Lexical Analysis. Mohamed Zahran (aka Z)

CMSC 330: Organization of Programming Languages. Architecture of Compilers, Interpreters

Lexical Analysis (ASU Ch 3, Fig 3.1)

Regular Languages. MACM 300 Formal Languages and Automata. Formal Languages: Recap. Regular Languages

A simple syntax-directed

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

2. Lexical Analysis! Prof. O. Nierstrasz!

Compiler Construction

1. Lexical Analysis Phase

Lexical Analysis. Finite Automata

Languages, Automata, Regular Expressions & Scanners. Winter /8/ Hal Perkins & UW CSE B-1

Roll No. :... Invigilator's Signature :. CS/B.Tech(CSE)/SEM-7/CS-701/ LANGUAGE PROCESSOR. Time Allotted : 3 Hours Full Marks : 70

Lexical Analysis. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

Question Bank. 10CS63:Compiler Design

CS 432 Fall Mike Lam, Professor. Finite Automata Conversions and Lexing

CSCI-GA Compiler Construction Lecture 4: Lexical Analysis I. Hubertus Franke

Transcription:

CS308 Lexical Analyzer Li Jiang Department of Computer Science and Engineering Shanghai Jiao Tong University

Content: Outline Basic concepts: pattern, lexeme, and token. Operations on languages, and regular expression Recognition of tokens Finite automata, including NFA and DFA Conversion from regular expression to NFA and DFA Optimization of lexical analyzer 2

Lexical Analyzer Lexical Analyzer reads the source program character by character to produce tokens. strips out comments and whitespaces returns a token when the parser asks for correlates error messages with the source program 3

Token A token is a pair of a token name and an optional attribute value. Token name specifies the pattern of the token Attribute stores the lexeme of the token Tokens Keyword: begin, if, else, Identifier: string of letters or digits, starting with a letter Integer: a non-empty string of digits Punctuation symbol:,, ;, (, ), Regular expressions are widely used to specify patterns of the tokens. 4

Attributes of Token Information for subsequent compiler phases about the particular lexeme Token name influences parsing decision attribute value influences translation of tokens after the parse Attributes of identifier Lexeme, type, location Stored in symbol table Tricky problem DO 5 I = 1.25 VS. DO 5 I = 1,25 5

Token Example 6

Content: Outline Basic concepts: pattern, lexeme, and token. Operations on languages, and regular expression Recognition of tokens Finite automata, including NFA and DFA Conversion from regular expression to NFA and DFA Optimization of lexical analyzer 7

Input Buffering Why a compiler needs buffers? Buffer Pairs: alternately reload Two pointers lexemebegin forward Sentinels: a mark for buffer end If length of lexeme + look ahead distance > buffer size 8

Lookahead with Sentinels 9

Terminology of Languages Alphabet: a finite set of symbols ASCII Unicode String: a finite sequence of symbols on an alphabet is the empty string s is the length of string s Concatenation: xy represents x followed by y Exponentiation: s n = s s s.. s ( n times) s 0 = Language: a set of strings over some fixed alphabet the empty set is a language The set of well-formed C programs is a language 10

Operations on Languages Union: L 1 L 2 = { s s L 1 or s L 2 } Concatenation: L 1 L 2 = { s 1 s 2 s 1 L 1 L 2 } and s 2 (Kleene) Closure: Positive Closure: L * i0 1 L i i L i L 11

Example L 1 = {a,b,c,d} L 2 = {1,2} L 1 L 2 = {a,b,c,d,1,2} L 1 L 2 = {a1,a2,b1,b2,c1,c2,d1,d2} L 1 * = L 1+ = all strings using letters a,b,c,d including the empty string all strings using letters a,b,c,d without the empty string 12

Regular Expressions Regular expression is a representation of a language that can be built from the operators applied to the symbols of some alphabet. A regular expression is built up of smaller regular expressions (using defining rules). Each regular expression r denotes a language L(r). A language denoted by a regular expression is called as a regular set. 13

Regular Expressions (Rules) Regular expressions over alphabet Reg. Expr a (r 1 ) (r 2 ) L(r 1 ) L(r 2 ) (r 1 ) (r 2 ) L(r 1 ) L(r 2 ) (r) * (L(r)) * (r) L(r) Language it denotes L() = {} L(a) = {a} Extension (r) + = (r)(r) * (L(r)) + Positive closure (r)? = (r) L(r) {} zero or one instance [a 1 -a n ] L(a 1 a 2 a n ) character class 14

Regular Expressions (cont.) We may remove parentheses by using precedence rules: * highest concatenation second highest lowest ab* c (a(b) * ) (c) Example: = {0,1} 0 1 => {0,1} (0 1)(0 1) => {00,01,10,11} 0 * => {,0,00,000,0000,...} (0 1) * => all strings with 0 and 1, including the empty string 15

Lex regular expression 16

Regular Definitions We can give names to regular expressions, and use these names as symbols to define other regular expressions. 17 A regular definition is a sequence of the definitions of the form: d 1 r 1 where d i is a innovative symbol and d 2 r 2 r i is a regular expression over symbols in {d 1,d 2,...,d i-1 } d n r n alphabet previously defined symbols

Regular Definitions Example Example: Identifiers in Pascal letter A B... Z a b... z digit 0 1... 9 id letter (letter digit ) * If we try to write the regular expression representing identifiers without using regular definitions, that regular expression will be complex. (A... Z a... z) ( (A... Z a... z) (0... 9) ) * Q: unsigned numbers (integer or floating point) 18

Quiz * 1. All strings of lowercase letters that contain the five vowels in order. 2. All strings of lowercase letters in which the letters are in ascending lexicographic order. 3. Comments, consisting of a string surrounded by /* and */, without an intervening */, unless it is inside doublequotes ( ). [HOMEWORK] 19

Content: Outline Basic concepts: pattern, lexeme, and token. Operations on languages, and regular expression Recognition of tokens Finite automata, including NFA and DFA Conversion from regular expression to NFA and DFA Optimization of lexical analyzer 21

Recognition of token Express the pattern Grammar 22 Find a prefix that is a lexeme matching the pattern Regular Definitions

Transition Diagram * State: represents a condition that could occur during scanning start/initial state: accepting/final state: lexeme found intermediate state: Edge: directs from one state to another, labeled with one or a set of symbols 23

Transition Diagram for relop Among the lexemes that match the pattern for relop, what can we only be looking at? Transition Diagram for ``relop < > < = >= = <> 24

Transition-Diagram-Based Lexical Analyzer Switch statement or multi way branch Holds the number of the current state Determines the next state by reading and examining the next input character Find the edge Take action 25 Implementation of relop transition diagram

Transition Diagram for Others * What about the Transition Diagram of letter/digit? A transition diagram for id's A transition diagram for unsigned numbers 26

Content: Outline Basic concepts: pattern, lexeme, and token. Operations on languages, and regular expression Recognition of tokens Finite automata, including NFA and DFA Conversion from regular expression to NFA and DFA Optimization of lexical analyzer 29

Finite Automata A finite automaton is a recognizer that takes a string, and answers yes if the string matches a pattern of a specified language, and no otherwise. * Two kinds: Nondeterministic finite automaton (NFA) no restriction on the labels of their edges Deterministic finite automaton (DFA) exactly one edge with a distinguished symbol goes out of each state Both NFA and DFA have the same capability We may use NFA or DFA as lexical analyzer 30

Nondeterministic Finite Automaton (NFA) A NFA consists of: S: a set of states Σ: a set of input symbols (alphabet) A transition function: maps state-symbol pairs to sets of states s 0 : a start (initial) state F: a set of accepting states (final states) NFA can be represented by a transition graph Accepts a string x, if and only if there is a path from the starting state to one of accepting states such that edge labels along this path spell out x. Remarks The same symbol can label edges from one state to several different states An edge may be labeled by ε, the empty string 31

NFA Example (1) The language recognized by this NFA is (a b) * a b 32

NFA Example (2) NFA accepting aa* bb* 33

Implementing an NFA S -closure({s 0 }) c nextchar() while (c!= eof) { begin S -closure(move(s,c)) { set all of states can be accessible from s 0 by -transitions } { set of all states can be accessible from a state in S by a transition on c} c nextchar end if (SF!= ) then { if S contains an accepting state } return yes else return no backtrack may be needed to identify the longest match. Subset Construction 34

Excise 3 For NFA in the following figure, indicate all the paths labeled aabb. Does the NFA accept aabb? Give the transition table. - (0) -a-> (1) -a-> (2) -b-> (2) -b-> ((3)) (0) -a-> (1) -a-> (2) -b-> (2) -b-> (2) - (0) -a-> (0) -a-> (0) -b-> (0) -b-> (0) (0) -a-> (0) -a-> (1) -b-> (1) -b-> (1) - (0) -a-> (1) -a-> (1) -b-> (1) -b-> (1) (0) -a-> (1) -a-> (2) -b-> (2) -ε-> (0) -b-> (0) - (0) -a-> (1) -a-> (2) -ε-> (0) -b-> (0) -b-> (0) 35

Deterministic Finite Automaton (DFA) A Deterministic Finite Automaton (DFA) is a special form of a NFA. No state has ε- transition For each symbol a and state s, there is at most one a labeled edge leaving s. start The language recognized by this DFA is?(a b) * a b 36

Practice * Draw the transition diagram for recognizing the following regular expression a(a b)*a a b b a a a 1 2 3 Nondeterministic a a 1 2 3 b Deterministic 37

Implementing a DFA s s 0 { start from the initial state } c nextchar { get the next character from the input string } while (c!= eof) do { do until the end of the string } begin s move(s,c) { transition function } c nextchar end if (s in F) then { if s is an accepting state } return yes else return no 38

NFA vs. DFA Compactibility Readability Speed NFA Good Good Slow DFA Bad Bad Fast DFAs are widely used to build lexical analyzers. Maintaining a set of state is more complex than keeping track a single state. 39 NFA DFA The language recognized (a b) * a b

Pop Quiz 1) What are the languages presented by the two FAs? (a) 0 1 1 0 1 2 3 4 5 1 0 0 1 0 6 0 0 7 8 9 1 1 1 Fixed pattern Solution: 01 strings with length 4, except 0110 Closure a a a a (b) 1 2 3 4 5 40 Solution: a(aaaaa)* a 40

Content: Outline Basic concepts: pattern, lexeme, and token. Operations on languages, and regular expression Recognition of tokens Finite automata, including NFA and DFA Conversion from regular expression to NFA and DFA Optimization of lexical analyzer 42

Regular Expression NFA McNaughton-Yamada-Thompson (MYT) construction Simple and systematic (recursive up the parse tree for the regular expression) Construction starts from the simplest parts (alphabet symbols). For a complex regular expression, subexpressions are combined to create its NFA. Guarantees the resulting NFA will have exactly one final state, and one start state. 43

MYT Construction Basic rules: for subexpressions with no operators For expression start i f For a symbol a in the alphabet start i a f 44

MYT Construction Cont d Inductive rules: for constructing larger NFAs from the NFAs of subexpressions (Let N(r 1 ) and N(r 2 ) denote NFAs for regular expressions r 1 and r 2, respectively) For regular expression r 1 r 2 start i N(r 1 ) f N(r 2 ) 45

MYT Construction Cont d For regular expression r 1 r 2 start i N(r 1 ) N(r 2 ) f For regular expression r * start i N(r) f 46

Example: (a b) * a a: b: a b (a b): a b (a b) * : a b (a b) * a: a b a 47 47

Properties of the Constructed NFA 1. N(r) has at most twice as many states as there are operators and operands in r. This bound follows from the fact that each step of the algorithm creates at most two new states. 2. N(r) has one start state and one accepting state. The accepting state has no outgoing transitions, and the start state has no incoming transitions. 3. Each state of N(r) other than the accepting state has either one outgoing transition on a symbol in {} or two outgoing transitions, both on. 48

Conversion of an NFA to a DFA Approach: Subset Construction each state of the constructed DFA corresponds to a set / combination of NFA states Details 1 Create transition table Dtran for the DFA 2 Insert -closure(s 0 ) to Dstates as initial state 3 Pick a not visited state T in Dstates 4 For each symbol a, Create state -closure(move(t, a)), and add it to Dstates and Dtran 5 Repeat step (3) and (4) until all states in Dstates are visited 49

The Subset Construction Simulate in parallel all possible moves NFA can make on the input a 50

NFA to DFA Example NFA for (a b) * abb A = -closure({0}) = {0,1,2,4,7} A into DS as an unmarked state mark A -closure(move(a,a)) = -closure({3,8}) = {1,2,3,4,6,7,8} = B B into DS -closure(move(a,b)) = -closure({5}) = {1,2,4,5,6,7} = C C into DS transfunc[a,a] B transfunc[a,b] C mark B -closure(move(b,a)) = -closure({3,8}) = {1,2,3,4,6,7,8} = B -closure(move(b,b)) = -closure({5,9}) = {1,2,4,5,6,7,9} = D transfunc[b,a] B transfunc[b,b] D mark C -closure(move(c,a)) = -closure({3,8}) = {1,2,3,4,6,7,8} = B -closure(move(c,b)) = -closure({5}) = {1,2,4,5,6,7} = C transfunc[c,a] B transfunc[c,b] C 51

NFA to DFA Example NFA for (a b) * abb Transition table for DFA Equivalent DFA 4 52

Quiz 1 Suppose we have two tokens: (1) the keyword if, and (2) identifiers, which are strings of letters other than if. Show: 1. The NFA for these tokens, and 2. The DFA for these tokens NFA DFA 55

Regular Expression DFA First, augment the given regular expression by concatenating a special symbol # r r# augmented regular expression Second, create a syntax tree for the augmented regular expression. All leaves are alphabet symbols (plus # and the empty string) All inner nodes are operators Third, number each alphabet symbol (plus #) (position numbers) 56

Regular Expression DFA Cont d (a b) * a (a b) * a# augmented regular expression a 1 * b 2 a 3 # 4 1 2 a b a 3 4 # F Syntax tree of (a b) * a# each symbol is at a leaf each symbol is numbered (positions) inner nodes are operators 57

followpos Then we define the function followpos for the positions (positions assigned to leaves). followpos(i) -- the set of positions which can follow the position i in the strings generated by the augmented regular expression. Example: ( a b) * a # 1 2 3 4 followpos(1) = {1,2,3} followpos(2) = {1,2,3} followpos(3) = {4} followpos(4) = {} followpos() is just defined for leaves, not defined for inner nodes. 58

firstpos, lastpos, nullable To compute followpos, we need three more functions defined for the nodes (not just for leaves) of the syntax tree. firstpos(n) -- the set of the positions of the first symbols of strings generated by the subexpression rooted by n. lastpos(n) -- the set of the positions of the last symbols of strings generated by the subexpression rooted by n. nullable(n) -- true if the empty string is a member of strings generated by the subexpression rooted by n; false otherwise 59

Usage of the Functions (a b) * a (a b) * a# augmented regular expression m * n a 3 # 4 nullable(n) = false nullable(m) = true firstpos(n) = {1, 2, 3} a 1 b 2 lastpos(n) = {3} Syntax tree of (a b) * a# 60

Computing nullable, firstpos, lastpos n nullable(n) firstpos(n) lastpos(n) leaf labeled true leaf labeled with position i false {i} {i} nullable(c 1 ) or c 1 c 2 nullable(c 2 ) nullable(c 1 ) c 1 c 2 and nullable(c 2 ) firstpos(c 1 ) firstpos(c 2 ) if (nullable(c 1 )) firstpos(c 1 )firstpos(c 2 ) else firstpos(c 1 ) lastpos(c 1 ) lastpos(c 2 ) if (nullable(c 2 )) lastpos(c 1 )lastpos(c 2 ) else lastpos(c 2 ) * true firstpos(c 1 ) lastpos(c 1 ) c 1 Straightforward recursion on the height of the tree 61

Thinking Extend the above table to include two more operations (a)? (b) + n nullable(n) firstpos(n) lastpos(n)? c 1 + TRUE firstpos(c 1 ) lastpos(c 1 ) c 1 Nullable(c 1 ) firstpos(c 1 ) lastpos(c 1 ) 62

How to evaluate followpos Two-rules define the function followpos: 1. If n is concatenation-node with left child c 1 and right child c 2, and i is a position in lastpos(c 1 ), then all positions in firstpos(c 2 ) are in followpos(i). 2. If n is a star-node, and i is a position in lastpos(n), then all positions in firstpos(n) are in followpos(i). If firstpos and lastpos have been computed for each node, followpos of each position can be computed by making one depth-first traversal of the syntax tree. 63

Example -- ( a b) * a # {1,2,3} {4} {1,2,3} {3} {4}# 4 {1,2}* {1,2}{3} a{3} 3 {1,2} {1,2} {1} a {1} {2} b {2} 1 2 {4} red firstpos blue lastpos Then we can calculate followpos followpos(1) = {1,2,3} followpos(2) = {1,2,3} followpos(3) = {4} followpos(4) = {} After we calculate follow positions, we are ready to create DFA for the regular expression. 64

Algorithm (RE DFA) 1. Create the syntax tree of (r) # 2. Calculate nullable, firstpos, lastpos, followpos 3. Put firstpos(root) into the states of DFA as an unmarked state. 4. while (there is an unmarked state S in the states of DFA) do mark S for each input symbol a do let s 1,...,s n are positions in S and symbols in those positions are a S followpos(s 1 )... followpos(s n ) Dtran[S,a] S if (S is not in the states of DFA) put S into the states of DFA as an unmarked state. the start state of DFA is firstpos(root) the accepting states of DFA are all states containing the position of # 65

Example -- ( a b) * a # followpos(1)={1,2,3} followpos(3)={4} followpos(2)={1,2,3} followpos(4)={} 1 2 3 4 S 1 =firstpos(root)={1,2,3} mark S 1 a: followpos(1) followpos(3)={1,2,3,4}=s 2 Dtran[S 1,a]=S 2 b: followpos(2)={1,2,3}=s 1 Dtran[S 1,b]=S 1 mark S 2 a: followpos(1) followpos(3)={1,2,3,4}=s 2 Dtran[S 2,a]=S 2 b: followpos(2)={1,2,3}=s 1 Dtran[S 2,b]=S 1 start state: S 1 accepting states: {S 2 } b S 1 a S 2 a 66 b

Example -- ( a ) b c * # followpos(1)={2} Let s continue 1 2 3 4 followpos(2)={3,4} followpos(3)={3,4} followpos(4)={} S 1 =firstpos(root)={1,2} mark S 1 a: followpos(1)={2}=s 2 Dtran[S 1,a]=S 2 b: followpos(2)={3,4}=s 3 Dtran[S 1,b]=S 3 mark S 2 b: followpos(2)={3,4}=s 3 Dtran[S 2,b]=S 3 mark S 3 c: followpos(3)={3,4}=s 3 Dtran[S 3,c]=S 3 S 1 a b S 2 b S 3 c start state: S 1 accepting states: {S 3 } 67

Minimizing Number of DFA States For any regular language, there is always a unique minimum state DFA, which can be constructed from any DFA of the language. Algorithm: Partition the set of states into two groups: G 1 : set of accepting states G 2 : set of non-accepting states For each new group G partition G into subgroups such that states s 1 and s 2 are in the same group iff for all input symbols a, states s 1 and s 2 have transitions to states in the same group. Start state of the minimized DFA is the group containing the start state of the original DFA. Accepting states of the minimized DFA are the groups containing the accepting states of the original DFA. 68

Minimizing DFA Example (1) 1 a b a 2 b 3 a G 1 = {2} G 2 = {1,3} G 2 cannot be partitioned because Dtran[1,a]=2 Dtran[3,a]=2 Dtran[1,b]=3 Dtran[3,b]=3 b So, the minimized DFA (with minimum states) is b a 1 a b 2 69

Minimizing DFA Example (2) a a 2 a 1 b 4 b a 3 b b a Minimized DFA 1 b Groups: {1,2,3} {4} {1,2} {3} no more partitioning b 2 a b a b 1->2 1->3 2->2 2->3 3->4 3->3 70 a 3 70

Architecture of A Lexical Analyzer 71 71

An NFA for Lex program Create an NFA for each regular expression Combine all the NFAs into one Introduce a new start state Connect it with ε- transitions to the start states of the NFAs 72

Pattern Matching with NFA 1 The lexical analyzer reads in input and calculates the set of states it is in at each symbol. 2 Eventually, it reach a point with no next state. 3 It looks backwards in the sequence of sets of states, until it finds a set including one or more accepting states. 4 It picks the one associated with the earliest pattern in the list from the Lex program. 5 It performs the associated action of the pattern. 73

Pattern Matching with NFA -- Example Input: aaba 74 Report pattern: a*b +

Pattern Matching with DFA 1 Convert the NFA for all the patterns into an equivalent DFA. For each DFA state with more than one accepting NFA states, choose the pattern, who is defined earliest, the output of the DFA state. 2 Simulate the DFA until there is no next state. 3 Trace back to the nearest accepting DFA state, and perform the associated action. Input: abba 0137 247 58 68 Report pattern abb 75

Summary How lexical analyzers work Convert REs to NFA Convert NFA to DFA Minimize DFA Use the minimized DFA to recognize tokens in the input Use priorities, longest matching rule 76

Homework Check the web page!!! 77