1 2010: Compilers Lexical Analysis: Finite State Automata Dr. Licia Capra UCL/CS REVIEW: REGULAR EXPRESSIONS a Character in A Empty string R S Alternation (either R or S) RS Concatenation (R followed by S) R* Repetition (zero or more R) R+ RR* (one or more R) R? (R ) (zero or one R) [abcd] [az] [^ab] [^ax] a b c d (any of the listed) a b c.. y z (character range) c d y z (anything but the listed) y z (anything but the range) HOW TO USE REGULAR EXPRESSIONS Mechanism to determine whether input string s belongs to the language L denoted by Regular Expression R Input string s Language L Acceptor? Yes, if s in L No, if s not in L 1
2 OUTLINE Lexical Analyser (source program) (list of tokens) i f ( b = = 0 ) a = b ; Lexical Analyser i f ( b = = 0 ) a = b ; Regular Expressions Finite State Automata Lexer Generator FINITE STATE AUTOMATA Finite state automata consist of: A finite set of states Edges (transitions) between states each labelled with a symbol A start state A set of final states (accepting states) FINITE STATE AUTOMATA Two kinds of finite automata: Deterministic finite automata (DFA)=the transition from the current state is uniquely determined by the current input character Nondeterministic finite automata (NFA)=there may be multiple possible choices or some transitions do not depend on the input character 2
3 DFA EXAMPLE DFA that accepts the strings in the language denoted by regular expression ab*a a Graph b Examples: abba aaa ab a Transition table a b 0 1 error error error int state = 0; char c = nextchar(); while (c!= EOF) { switch (state) { case 0: if (c==`a ) state=1; else error; break; case 1: if (c==`b ) state=1; else if (c==`a ) state=2; else error; break; case 2: error; break; } c = readchar(); } if ((state==2) && (!error)) return <token>; MORE DFA EXAMPLES i f IF token NUM token _ az AZ 1 2 _ az AZ ID token 09 NFA DEFINITION A nondeterministic finite state automaton (NFA) is an automaton where the state transitions are such that There may be transitions (i.e., transitions that do not consume input characters) There may be multiple transitions from the same state on the same input character 3
4 NFA EXAMPLE NFA that accepts the strings in the language denoted by regular expression ab*a a Graph a b Example: abba a DFA NFA DFA Action fully determined on each input symbol String accepted if I can go from initial to final state while reading string NFA There may be choice on each step (which path should I take?) String accepted if there is any path that leads to acceptance (automaton must guess correctly) Obvious tabledriven implementation Difficult to implement BUILDING A LEXER IN STEPS Programming Language L Regular Expression describing all valid tokens (while if for else int char ([azaz_][azaz09_]*) (?[09]+) )* DFA recognising tokens in L??? No obvious implementation of DFA accepting programs in L 3step solution to automatically build the DFA Step 1: from RE to NFA Step 2: from NFA to DFA Step 3: from DFA to minimised DFA 4
5 BUILDING A LEXER IN STEPS STEP 1: FROM RE TO NFA STEP 2: FROM NFA TO DFA STEP 3: FROM DFA TO min DFA STEP 1: FROM RE TO NFA Strategy: build the finite automaton inductively, based on the definition of RE (empty string) a (character) a STEP 1: FROM RE TO NFA R automaton S automaton Alternation R S R S Concatenation RS R S 5
6 STEP 1: FROM RE TO NFA R automaton Kleene star R* R Note: NFA only need one final state (WHY?) STEP 1: FROM RE TO NFA  EXAMPLE A={a,b} R=(ab ba)* EXERCISE Write the NFA that recognises the strings described by the following RE: (a* b*)* Simulate its execution on input ababbab 6
7 BUILDING A LEXER IN STEPS STEP 1: FROM RE TO NFA STEP 2: FROM NFA TO DFA STEP 3: FROM DFA TO min DFA STEP 2: FROM NFA TO DFA Problem: how to execute NFA? String accepted if there is any path that leads to acceptance. How to guess correctly? Solution: search all paths consistent with the string If there is any path that accepts the string, we will find it Idea: search paths in parallel Keep track of set of NFA states we could be in after seeing some string prefix Search set of possible states I could move to when reading next input character STEP 2: FROM NFA TO DFA closure(s)=set of states reachable from state s with transition closure(t)= U closure(s) sint edge(t,a)=set of states reachable with transition a from any state in T DFAedge(T,a)=closure(edge(T,a)) 7
8 STEP 2: FROM NFA TO DFA DFA initial state = closure{nfa initial state} For each DFA state S For each character x in A S = DFAedge(S,x) add an edge (S,S ) labelled with character x in DFA For each DFA state S If S contains an NFA final state Mark S as DFA final state STEP 2: FROM NFA TO DFA  EXAMPLE A={a,b} R=(ab ba)* EXERCISES Given the following regular expressions R, build the NFA that recognise L(R) and then convert them to DFA R=(a b)* R=b*(ab ba)b* 8
9 BUILDING A LEXER IN STEPS STEP 1: FROM RE TO NFA STEP 2: FROM NFA TO DFA STEP 3: FROM DFA TO min DFA STEP 3: MINIMISATION ALGORITHM The DFA automatically built from NFA is not minimal, i.e. it contains more states than necessary Minimisation algorithm: converts a DFA to another DFA which recognizes the same language and has a minimum number of states STEP 3: STATE MINIMISATION Idea: find groups of equivalent states All transitions from states in one group G1 go to states in the same group G2 Construct the minimised DFA such that there is one state for each group of states from the initial DFA 9
10 STEP 3: DFA MINIMISATION ALGORITHM STEP 1: Construct a partition P of the set S of states in the original DFA having 2 groups: F = set of final states SF = set of nonfinal states STEP 2: Repeat Let P= G1 U G2 U Gn be the current partition Partition each group Gi into subgroups such that: s and t are in the same subgroup if, for each symbol a in A there are transitions s s, t t and s,t are in the same subgroup Gj Combine the computed subgroups into a new partition P Until P == P STEP 3: Construct a DFA with one state for each group of states in the final partition STEP 3: FROM DFA TO MIN DFA  EXAMPLE R=(ab ba)* EXERCISE Given the regular expression R=b*(ab ba)b*, build a minimal DFA that recognises L(R) Step 1: from R to NFA Step 2: from NFA to DFA Step 3: from DFA to minimal DFA 10
11 PUTTING THE PIECES TOGETHER Regular Expression R RE NFA conversion Input String s NFA DFA conversion DFA DFA optimisation DFA simulation Yes, if s in L(R) No, if s not in L(R) LEXICAL ANALYSERS ACCEPTORS Lexical analysers use the same mechanism but they: Have multiple RE describing multiple tokens LEXICAL ANALYSERS Handling multiple Res: NFAs of all regular expressions R1,,Rn must be combined into a single finite automata Keywords Minimised DFA Numbers Identifiers Whitespaces 11
12 LEXICAL ANALYSERS ACCEPTORS Lexical analysers use the same mechanism but they: Have multiple RE describing multiple tokens Have a character stream in input Return a sequence of matching tokens or an error LEXICAL ANALYSERS Input/output stream Associate tokens with final states Output the corresponding token when reaching a final state Keywords Numbers Minimised DFA Identifiers Whitespaces LEXICAL ANALYSERS ACCEPTORS Lexical analysers use the same mechanism but they: Have multiple RE describing multiple tokens Have a character stream in input Have a character stream in input Return a sequence of matching tokens or an error Always return the longest matching token For multiple longest matching tokens, they use rule priorities 12
13 LEXICAL ANALYSERS Longest match When in a final state, look if there are further transactions; if not, return the token for the current final state Rule priority Same length matching token for final states corresponding to multiple tokens Associate the final state to the token with the highest priority AUTOMATING LEXICAL ANALYSIS All of the lexical analysis process can be automated! We only need to specify: Regular expressions for tokens Rule priorities for multiple longest match cases JLex/JFlex = Lexical Analyser Generators 13
More information