Written ssignments W assigned today Implementation of Lexical nalysis Lecture 4 Due in one week y 5pm Turn in In class In box outside 4 Gates Electronically Prof. iken CS 43 Lecture 4 Prof. iken CS 43 Lecture 4 2 Tips on uilding Large Systems KISS (Keep It Simple, Stupid!) Don t optimize prematurely Design systems that can be tested It is easier to modify a working system than to get a system working Outline Specifying lexical structure using regular expressions Finite automata Deterministic Finite utomata (DFs) Non-deterministic Finite utomata (NFs) Implementation of regular expressions RegExp => NF => DF => Tables Prof. iken CS 43 Lecture 4 3 Prof. iken CS 43 Lecture 4 4 Notation There is variation in regular expression notation Union: + Option: +? Range: a + b + + z [a-z] Excluded range: complement of [a-z] [^a-z] Regular Expressions in Lexical Specification Last lecture: a specification for the predicate s L(R) ut a yes/no answer is not enough! Instead: partition the input into tokens We adapt regular expressions to this goal Prof. iken CS 43 Lecture 4 5 Prof. iken CS 43 Lecture 4 6
Regular Expressions => Lexical Spec. (). Write a rexp for the lexemes of each token Number = digit + Keyword = if + else + Identifier = letter (letter + digit)* OpenPar = ( Regular Expressions => Lexical Spec. (2) 2. Construct R, matching all lexemes for all tokens R = Keyword + Identifier + Number + = R + R 2 + Prof. iken CS 43 Lecture 4 7 Prof. iken CS 43 Lecture 4 8 Regular Expressions => Lexical Spec. (3) 3. Let input be x x n For i n check x x i L(R) 4. If success, then we know that x x i L(R j ) for some j 5. Remove x x i from input and go to (3) mbiguities () There are ambiguities in the algorithm How much input is used? What if x x i L(R) and also x x K L(R) Rule: Pick longest possible string in L(R) The maximal munch Prof. iken CS 43 Lecture 4 9 Prof. iken CS 43 Lecture 4 mbiguities (2) Which token is used? What if x x i L(R j ) and also x x i L(R k ) Rule: use rule listed first (j if j < k) Treats if as a keyword, not an identifier Error Handling What if No rule matches a prefix of input? Problem: Can t just get stuck Solution: Write a rule matching all bad strings Put it last (lowest priority) Prof. iken CS 43 Lecture 4 Prof. iken CS 43 Lecture 4 2 2
Summary Regular expressions provide a concise notation for string patterns Use in lexical analysis requires small extensions To resolve ambiguities To handle errors Good algorithms known Require only single pass over the input Few operations per character (table lookup) Finite utomata Regular expressions = specification Finite automata = implementation finite automaton consists of n input alphabet set of states S start state n set of accepting states F S set of transitions state input state Prof. iken CS 43 Lecture 4 3 Prof. iken CS 43 Lecture 4 4 Finite utomata Transition Finite utomata State Graphs state s a s 2 Is read In state s on input a go to state s 2 If end of input and in accepting state => accept Otherwise => reject The start state n accepting state transition a Prof. iken CS 43 Lecture 4 5 Prof. iken CS 43 Lecture 4 6 Simple Example finite automaton that accepts only nother Simple Example finite automaton accepting any number of s followed by a single lphabet: {,} Prof. iken CS 43 Lecture 4 7 Prof. iken CS 43 Lecture 4 8 3
nd nother Example lphabet {,} What language does this recognize? Epsilon Moves nother kind of transition: -moves Machine can move from state to state without reading input Prof. iken CS 43 Lecture 4 9 Prof. iken CS 43 Lecture 4 2 Deterministic and Nondeterministic utomata Deterministic Finite utomata (DF) One transition per input per state No -moves Nondeterministic Finite utomata (NF) Can have multiple transitions for one input in a given state Can have -moves Execution of Finite utomata DF can take only one path through the state graph Completely determined by input NFs can choose Whether to make -moves Which of multiple transitions for a single input to take Prof. iken CS 43 Lecture 4 2 Prof. iken CS 43 Lecture 4 22 cceptance of NFs n NF can get into multiple states NF vs. DF () NFs and DFs recognize the same set of languages (regular languages) Input: Rule: NF accepts if it can get to a final state DFs are faster to execute There are no choices to consider Prof. iken CS 43 Lecture 4 23 Prof. iken CS 43 Lecture 4 24 4
NF vs. DF (2) Regular Expressions to Finite utomata For a given language NF can be simpler than DF NF DF DF can be exponentially larger than NF High-level sketch Regular expressions Lexical Specification NF DF Table-driven Implementation of DF Prof. iken CS 43 Lecture 4 25 Prof. iken CS 43 Lecture 4 26 Regular Expressions to NF () Regular Expressions to NF (2) For each kind of rexp, define an NF Notation: NF for rexp M For For For input a M a For + Prof. iken CS 43 Lecture 4 27 Prof. iken CS 43 Lecture 4 28 Regular Expressions to NF (3) Example of RegExp -> NF conversion For * Consider the regular expression (+)* The NF is C E D F G H I J Prof. iken CS 43 Lecture 4 29 Prof. iken CS 43 Lecture 4 3 5
NF to DF: The Trick Simulate the NF Each state of DF = a non-empty subset of states of the NF Start state = the set of NF states reachable through -moves from NF start state dd a transition S a S to DF iff S is the set of NF states reachable from any state in S after seeing the input a, considering - moves as well Prof. iken CS 43 Lecture 4 3 NF to DF. Remark n NF may be in many states at any time How many different states? If there are N states, the NF must be in some subset of those N states How many subsets are there? 2 N - = finitely many Prof. iken CS 43 Lecture 4 32 NF -> DF Example Implementation CDHI C E G D H F I J FGHICD EJGHICD DF can be implemented by a 2D table T One dimension is states Other dimension is input symbol For every transition S i a S k define T[i,a] = k DF execution If in state S i and input a, read T[i,a] = k and skip to state S k Very efficient Prof. iken CS 43 Lecture 4 33 Prof. iken CS 43 Lecture 4 34 Table Implementation of a DF Implementation (Cont.) S T U NF -> DF conversion is at the heart of tools such as flex ut, DFs can be huge S T U T T U U T U In practice, flex-like tools trade off speed for space in the choice of NF and DF representations Prof. iken CS 43 Lecture 4 35 Prof. iken CS 43 Lecture 4 36 6