Lexical Analysis. Role, Specification & Recognition Tool: LEX Construction: - RE to NFA to DFA to min-state DFA - RE to DFA

Lexicl Anlysis Role, Specifiction & Recognition Tool: LEX Construction: - RE to NFA to DFA to min-stte DFA - RE to DFA

Conducting Lexicl Anlysis Techniques for specifying nd implementing lexicl nlyzers Hnd-written stte trnsition digrm tht revels the structure of the tokens hnd-trnslted driver progrm Tools: Pttern Triggered ctions pttern-ction lnguge: LEX Other pplictions: query lnguge, informtion retrievl AWK shell commnds PCB inspection

Lexicl Anlyzer nd Prser source progrm lexicl nlyzer token &ttributes get next token Prser symbol tble token: smllest logiclly cohesive sequence of chrcters of interest in source progrm (Aho,Sethi,Ullmn, pp.84)

Lexicl Anlysis Convert input lexemes to strem of tokens Typicl Functions: Lexeme: sequence of chrcters tht comprises single token Removl of white spce nd comments insted of writing productions tht include spces nd comments keeping line count for ssociting error messge with line number Digits into Token ID + vlue/ttributes insted of writing productions for integer constnts 31+28+59 <num, 31> <+, > <num, 28><+, > <num, 59> Recognizing Identifiers nd Keywords Identifiers: count = count + increment id = id + id Keywords: begin, end, if, else begin, end, if, else Opertors/punctutions: >, <=, <>

Why Seprte Lexicl Anlysis from Prsing Simpler Design no production rules nd trnsltions for white spces nd comments Improved Efficiency lexicl nlyzer cn be optimized seprtely (e.g., using specilized buffering techniques) Enhnced Compiler Portbility Input lphbet peculirities nd device-specific nomlies cn be restricted to the lexicl nlyzer

Tokens, Ptterns, Lexemes Token: terminl symbol or lexicl unit of prser, representing set of strings of prticulr type e.g., pi, count, => id e.g., 3.1416, 6.02e23, => num Typicl: keywords, opertors, identifiers, constnts, literl strings, punctution symbols Representtion: n integer (e.g., #define ID 258) with ssocited ttributes Pttern: specifiction of the set of strings rule describing the set of lexemes tht cn represent prticulr token e.g., id => letter followed by letters nd digits Lexeme: sequence of chrcters in the source progrm tht is mtched by the pttern for token Exmples: Fig. 3.2

Attributes for Tokens Attributes: dditionl informtion for prticulr lexeme when mtching multiple ptterns prsing decision, trnsltion Implementtion: pointer to symbol tble entry in which the token informtion is kept Exmple: E = M * C ** 2 E (or M, C): <id, ptr to symbol-tble entry for E (M, C)> =: <ssign_op, > *: <multi_op, > **: <exp_op, > 2: <num, integer vlue 2>

Lexicl Errors Mtched but mbiguous: left to the other phses (e.g., prser) e.g., fi ( == f(x) ) : fi => identifier?? misspelling of if Unmtched: Pnic mode recovery: delete successive chrcters from the remining input until well-formed token is found Repir input (single error): deleting n extrneous chrcter inserting missing chrcter replcing with correct chrcter trnsposing two djcent chrcters Minimum-distnce error correction (multiple errors)

Specifiction of Tokens A Forml Specifiction for Tokens or Ptterns - Strings nd Lnguges - Regulr Expressions & Definitions - Recognition of Tokens

Strings nd Lnguge lphbet (or chrcter clss) ( 字符集 ): ny finite set of symbols string over some lphbet ( 字串 ): finite sequence of symbols drwn from tht lphbet length of string s, s : number of symbols in s empty string: specil string of length zero (proper) prefix: bcdef (proper) suffix: bcdef (proper) substring: bcdef subsequence: bcdef

Strings nd Lnguge lnguge: ny set of strings over some lphbet empty set: the set contining only empty string, i.e., Φ={}

Opertions on Strings Conctention: xy s = s = s x= Dog y= House => xy = DogHouse Exponentitions: s i =s i-1 s (s 0 =)

Opertions on Lnguges Union {s s is in L or s is in M} Conctention {st s is in L nd t is in M} Kleene closure: zero or more conctention L * : union of L i (i = 0 infinity) L 0 = {}, L i = L i-1 L Positive closure: one or more conctention L + : union of L i (i = 1 infinity)

Exmples L={A, B,, Z,, b,, z}, D={0, 1,, 9} Union L U D = {letters nd digits of length 1} Conctention LD={ letter followed by digit} (={A0, A1, B0, }) L 4 = {4-letter strings}(={aaaa, AABC, BBBB, }) Kleene closure: zero or more conctention L * : {ll strings of letters of length zero (i.e., ) or more} L(L U D) * = {ll strings of letters-nd-digits, strting with letter} Positive closure: one or more conctention D + : {strings of one or more digits}

Regulr Expression (R.E.) A Forml Specifiction for Tokens

Regulr Expression: Syntx for Specifying String Ptterns Regulr expression r over lphbet Defines the lnguge L(r) corresponding to r Regulr Set: A lnguge denoted by regulr expression Bsic Symbols empty-string: ny symbol in input symbol set Bsic Opertors disjunction (OR, union): r s conctention (AND): r s (or simply rs) closure (repetition): r* identity (prenthesized): (r)

Regulr Expression: Syntx for Specifying String Ptterns Extended opertors:? : optionl opertor + : positive closure opertor. : ny chrcter but newline [-z]: chrcter clss [^-z]: complement (ny chrcters NOT in [-z]) {m,n}: number of occurrence ^: strt of line $: end of line registers: the n-th prt of mtch: \1, \2 sed s/.*<img src=$[^ >]*$.*/\1/g escpe, met-symbols: \c (chrcter c literlly) [\-z]:, - or z (NOT:, b,, z ) r/s: r which is followed by s ( / : lookhed opertor)

Nottionl Shorthnds One or more instnces (r) + denoting (L(r)) + r * = r + r + = r r * Zero or one instnce r? = r Chrcter clsses [bc] = b c [-z] = b... z

Regulr Expression Exmples: = {, b} r = b {, b} r = ( b)( b) {, b, b, bb} = b b bb (nother equivlent regulr expression) r = * {,,,,, } r = ( b)* {ll strings of s nd b s} = (*b*)*

Equivlence A lnguge my be represented by two or more equivlent regulr expressions. Equivlence: L(r) = L(s) r = s Algebric properties of Regulr Expression Commuttive: r s = s r Associtive: r (s t) = (r s) t (rs)t = r(st) Distribution: r(s t) = rs rt & (s t)r = sr tr Identity element ( ): r = r nd r =r Appliction of properties: Proof of Equivlence r* = (r )* r** = r*

Regulr Definition: A CFG-like Nottion of Regulr Expression Regulr Definition Similr to CFG Define regulr expressions in terms of nmed regulr expressions d 1 r 1 d 2 r 2 d n r n

Regulr Definition Exmple of Regulr Definition: letter A B C z digit 0 1 9 id letter (letter digit ) * Another Exmple: Unsigned numbers (ex. 3.5) // Unsigned numbers (512, 3.14, 6.33 E 4, 1.89 E -5) digit 0 1 9 digits digit digit* optionl_frction.digits optionl_exponent (E(+ - ) digits ) num digits optionl_frction optionl_exponent

Nonregulr Set Some lnguges cnnot be described by ny regulr expression Exmples: Blnced nd nested constructs BUT, Cn be specified by CFG Repeting strings {wcw w is string of s nd b s} ={c, bcb, bcb, } Cnnot be expressed in CFG either Context dependent strings nh12 n

Regulr Expression: Syntx for Specifying String Ptterns Chomsky Hierrchy: regulr set (R.E.) context-free context-sensitive recursively enumerble (Tuning Mchine)

Regulr Expression: Syntx for Specifying String Ptterns Applictions: Mtching wildcrd chrcters (shell commnds, filenme expnsion) string pttern mtching (grep, wk) serch engine (keyword mtching, fuzzy mtch) string pttern editing/processing (sed, vi, tr)

Recognition of Tokens

Exmple Tsk Grmmr: stmt if expr then stmt if expr then stmt else stmt expr term relop term term term id num

Exmple Tsk Terminl Symbols: if if then then else else relop < <= = <> > >= id letter (letter digit)* num digit+ (. digit+ )? ( E(+ -)? digit+)? White Spce Delimited: delim blnk tb newline ws delim+

Exmple Tsk Gol: construct lexicl nlyzer tht isoltes lexeme for the next token Produce token nd ssocited ttribute-vlues Methods: FA / FSA: Finite (Stte) Automt By hnds: constructing FAs & simultor for the FAs Simultor (scnner) depends on FAs By tools: writing regulr definition for scnner genertors to build FAs for scnner Scnner: driver progrm tht is independent of the forms of the FAs

FA nd Trnsition Digrms b c r = (bc)+ stte trnsition the strt stte finl stte

FA/FSA nd Trnsition Tbles sttes inputs b c q0 q1 q1 q2 q2 q3 q3 q1 NextStte = Move( CurrentStte, Input )

Recognition stte = 0; while ( (c = next_chr() )!= EOF ) { switch (stte) { cse 0: if ( c == ) stte = 1; brek; cse 1: if ( c == b ) stte = 2; brek; cse 2: if ( c == c ) stte = 3; brek; cse 3: if ( c == ) stte = 1; else { ungetchr(); return (TRUE); } brek; } } defult: error(); if ( stte == 3 ) return (TRUE) else return (FALSE);

Finite Automt for the Lexicl Tokens i 1 2 f 3 - z 1 2 - z 0-9 0-9 0-9 1 2 IF ID NUM 0-9 0-9 0-9 1 2 3. 4 5 0-9. 0-9 1-2 - 3 4 - z \n blnk, etc. 5 blnk, etc. 1 ny but \n 2 REAL White spce error (nd comment strting with - - ) (Appel, pp. 21)

Regulr expressions for tokens if {return IF;} [ - z] [ - z0-9 ] * {return ID;} [0-9] + {return NUM;} ([0-9] +. [0-9] *) (. [0-9] +) {return REAL;} ( -- [ - z]* \n ) ( \n \t ) + {/* do nothing*/}. {error ();} (Appel, pp. 20)

Recognition of the Lexicl Tokens Given the FA s (Nïve Pttern Mtching) Trversl of the trnsition digrms in sequence to mtch ny of the bove stte trnsition digrms until mtch Give different unique stte numbers to different initil sttes (nd other sttes) in individul digrm before writing progrm to simulte the trversl process Mtch the longest expression first if two stte trnsition digrms hve super-/sub-string reltionship E.g., mtch REAL before INTEGER On filure, next_stte = init_stte of next FA Exmple progrm: [Aho 86]

Finite Stte Automt

How to Construct FA Systemticlly? You cn construct single complicted stte trnsition digrm directly to recognize ll token types if you re smrt enough, or E.g., (next pge) You cn do it systemticlly by constructing simpler trnsition digrms nd composing them into lrger networks Preferred for utomtic construction Esy to verify its correctness

1,4,9,14 i 0-9 A DFA for Recognizing Common Token Types -h j-z ID 2,5,6,8,15 ID 5,6,7,8,15 NUM 10,11,12,13,15 f 0-9 -z,0-9 -e, g-z, 0-9 IF(or ID) 3,6,7,8 NUM 11,12,13 -z,0-9 ID 6,7,8 1 st pttern or reserved word in LEX spec. -z,0-9 other error 15 0-9 Longest mtch (Appel, pp. 29)

Finite (Stte) Automt A set of sttes: S A set of input symbols: (the input symbol lphbet) A trnsition (move) function: (s,) = s Initil (strt) stte: s0 A set of finl (ccepting) sttes: F

Finite (Stte) Automt Grphicl Representtion: Stte trnsition digrm Implementtion: Stte trnsition tble Deterministic (DFA) Single trnsition for ll sttes on ll input symbols Non-deterministic (NFA) More thn one trnsitions for t lest one stte with some input symbol

NFA: Nondeterministic Finite Automt An NFA consists of S: A finite set of sttes : A finite set of input symbols : A trnsition function tht mps (stte, symbol) pirs to sets of sttes s 0 : A stte distinguished s strt stte F: A set of sttes distinguished s finl sttes

NFA: An Exmple RE: ( b) * bb Sttes: {0, 1, 2, 3} Input symbols: {, b} Trnsition function: (0,) = {0,1}, (0,b) = {0} (1,b) = {2}, (2,b) = {3} Strt stte: 0 Finl sttes: {3}

Trnsition Digrm (NFA) ( b) * bb strt b b 0 1 2 3 b Sttes: {0/Strt/init., 1, 2, 3/Finl} Input symbols: {, b} NFA Trnsition function: (0,) = {0,1}, (0,b) = {0} (1,b) = {2}, (2,b) = {3}

Acceptnce of NFA An NFA ccepts n input string s iff there is some pth in the trnsition digrm from the strt stte to some finl stte such tht the edge lbels long this pth spell out s Exmple: bbbbb is ccepted by ( b)*bb bbbb is NOT

NFA: Exmple with trnsition RE: * bb * Sttes: {0, 1, 2, 3, 4} Input symbols: {, b} Trnsition function: (0, ) = {1, 3}, (1, ) = {2}, (2, ) = {2} (3, b) = {4}, (4, b) = {4} Strt stte: 0 Finl sttes: {2, 4}

Trnsition Digrm (NFA) strt * bb * NFA Trnsition function: (0, ) = {1, 3}, (1, ) = {2}, (2, ) = {2} (3, b) = {4}, (4, b) = {4} 0 1 2 3 b 4 b

Deterministic Finite Automt A DFA is specil cse of n NFA in which no stte hs n -trnsition for ech stte s nd input symbol, there is t most one edge lbeled leving s

DFA: An Exmple RE: ( b) * bb Sttes: {0, 1, 2, 3} Input symbols: {, b} Trnsition function: (0,) = {1}, (1,) = {1}, (2,) = {1}, (3,) = {1} (0,b) = {0}, (1,b) = {2}, (2,b) = {3}, (3,b) = {0} Strt stte: 0 Finl sttes: {3}

Trnsition Digrm A DFA for ( b) * bb strt b b 0 1 2 b 3 b

Trnsition Digrm strt 0 1 b 2 b 3 b DFA for ( b) * bb {0,2} strt {0} b b b 0 1 2 3 b {0,1} {0,3}

Recognition of Regulr Expression Using DFA Simulting Deterministic Finite Automt (DFA) initiliztion: current_stte = s0; input_symbol = 1st symbol while (current_stte is not fil_stte && input_symbol!= EOF) next_stte = (current_stte, input_symbol), & Current_stte = next_stte input_symbol = next_input_symbol If (current_stte in finl sttes) ccept() else fil()

Simulting DFA Input. An input string ended with eof nd DFA with strt stte s 0 nd finl sttes F. Output. The nswer yes if ccepts, no otherwise. begin s := s 0 ; c := nextchr; while c <> eof do begin s := move(s, c); // trnsition function c := nextchr end; if s is in F then return yes else return no end.

DFA: An Exmple ( b) * bb strt b b b 0 1 2 3 b

An Exmple bbbbb bbbb s = 0 s = 0 s = move(0, b) = 0 s = move(0, b) = 0 s = move(0, b) = 0 s = move(0, b) = 0 s = move(0, ) = 1 s = move(0, ) = 1 s = move(1, b) = 2 s = move(1, b) = 2 s = move(2, ) = 1 s = move(2, ) = 1 s = move(1, b) = 2 s = move(1, b) = 2 s = move(2, b) = 3 s is not in {3} s is in {3}

Recognition of Regulr Expression Using NFA Simulting Non-Deterministic Finite Automt (NFA) Bcktrck/Bckup: (Sequentil Trversl) remember next lterntive configurtion (current input & next lterntive stte) when lterntive choices re possible Prllelism: (Prllel Trversl) trce every possible lterntives in prllel Look-hed: look t more input symbols to mke it deterministic

Simulting n NFA Input. An input string ended with eof nd n NFA with strt stte s 0 nd finl sttes F. Output. The nswer yes if ccepts, no otherwise. begin S := -closure({s 0 }); // s 0 = => S c := nextchr; while c <> eof do begin S := -closure(move(s, c)); // S =c=> M = => S c := nextchr end; if S F <> then return yes else return no end.

Opertions on NFA sttes -closure: set of sttes rechble without consuming ny input symbol -closure(s): set of NFA sttes rechble from NFA stte s on -trnsitions lone -closure(s): set of NFA sttes rechble from some NFA stte s in S on -trnsitions lone move(s, c): set of NFA sttes to which there is trnsition on input symbol c from some NFA stte s in S

Computtion of -closure Input. An NFA nd set of NFA sttes S. Output. T = -closure(s). begin push ll sttes in S onto stck; & initilize T := S; while stck is not empty do begin pop t, the top element, off of stck; for ech stte u with n edge from t to u lbeled do if u is not in T [i.e., current -closure(s)] do begin end end; return T end. dd u to T; push u onto stck

( b) * bb An Exmple 2 3 strt 0 1 6 4 b 5 T= -closure(0): 01: S={0}, T={0} 02: S={}; t=0; T={0} 03: S={1,7}; T={0,1,7} 04: S={1}; t=7; T={0,1,7} 05: S={1}; T={0,1,7} 06: S={}; t=1; T={0,1,7} 07: S={2,4}; T={0,1,2,4,7} b b 7 8 9 10 08: S={2}; t=4; T={0,1,2,4,7} 09: S={2}; T={0,1,2,4,7} 10: S={}; t=2; T={0,1,2,4,7} **: S={}; T={0,1,2,4,7}

An Exmple ( b) * bb strt 0 2 1 3 6 A = -closure ({0}) = {0,1,2,4,7} b 7 8 9 b 10 4 b 5

An Exmple ( b) * bb strt 0 2 1 4 b 3 6 5 move(a,)= {3,8} b b 7 8 9 10 move(a,b)= {5}

An Exmple ( b) * bb strt 0 2 1 4 3 6 b 5 C = -closure (move(a,b)) = {1,2,4,5,6,7} b 7 8 9 move(a,b)= {5} b 10

An Exmple ( b) * bb 2 3 strt 0 1 4 b 6 5 b 7 8 9 move(c,b)= {5} b 10

An Exmple ( b) * bb strt 0 2 1 4 3 6 b 5 C = -closure (move(c,b)) = {1,2,4,5,6,7} b 7 8 9 move(c,b)= {5} b 10

An Exmple bbbb S = -closure({0}) = {0,1,2,4,7} = A S = -closure(move({0,1,2,4,7}, b)) = -closure({5}) = {1,2,4,5,6,7} = C S = -closure(move({1,2,4,5,6,7}, b)) = -closure({5}) = {1,2,4,5,6,7} = C S = -closure(move({1,2,4,5,6,7}, )) = -closure({3,8}) = {1,2,3,4,6,7,8} S = -closure(move({1,2,3,4,6,7,8}, b)) = -closure({5,9}) = {1,2,4,5,6,7,9} S = -closure(move({1,2,4,5,6,7,9}, b)) = -closure({5,10}) = {1,2,4,5,6,7,10} S {10} <>

Recognition of Regulr Expression Simulting NFA is hrder thn simulting DFA Constructing NFA is esier thn constructing DFA Construct NFA => Construct Equivlent DFA By pre-defining sttes in NFA tht cn be reched in prllel s stte for the DFA & pre-computing ll possible trnsitions Insted of simulting the prllel trnsitions in run-time => (optionl) Stte Minimiztion => Simulte DFA

Constructing Automt from R.E. (1) R.E. NFA (Thompson s construction) DFA (Subset Construction) Stte Minimiztion R.E. decomposition into bsic lphbets & opertors construct FA for bsic lphbets merging FA s by opertor

Constructing Automt from R.E. (2) R.E. DFA: stte_trnsition position_trnsition in pttern Stte Minimiztion nnotte RE symbols with position lbels get syntx tree of the nnotted pttern compute {nullble, fistpos, lstpos} of subexpressions compute follow(i) s0 = firstpos(root) construct trnsition function ccording to follow(i)

Regulr Expression to NFA R.E. NFA (Thompson s construction)

Constructing NFA How to define n NFA tht ccepts regulr expression? It is very simple. Remember tht regulr expression is formed by the use of lterntion, conctention nd repetition. Thus ll we need to do is to know how to build the NFA for single symbol, nd how to compose NFAs.

Composing NFAs with Alterntion The NFA for symbol (or ) is: strt i f Given two NFA N(s) nd N(t), the NFA N(s t) is: strt i N(s) f N(t) (Aho,Sethi,Ullmn, pp. 122)

Composing NFAs with Conctention Given two NFA N(s) nd N(t), the NFA N(st) is: strt i N(s) N(t) f (Aho,Sethi,Ullmn, pp. 123)

Composing NFAs with Repetition The NFA for N(s*) is i N(s) f (Aho,Sethi,Ullmn, pp. 123)

Properties of the NFA vi. Thompson s Construction Following the construction rules, we obtin n NFA N(r) tht: hs t most twice s mny sttes s the number of symbols nd opertors in r hs exctly one strting nd one ccepting stte ech stte hs t most one outgoing trnsition on symbol of the lphbet or t most two outgoing -trnsitions All nondeterministic trnsitions re introduced by trnsitions tht connect to/from new/old init./finl sttes.

An Exmple ( b) * bb 2 3 strt 0 1 6 b b 7 8 9 10 4 b 5

Comprison: NFA (by Heuristics) ( b) * bb strt b b 0 1 2 3 b NOT constructed using Thompson s Construction Sttes: {0/Strt/init., 1, 2, 3/Finl} Input symbols: {, b} NFA Trnsition function: (0,) = {0,1}, (0,b) = {0} (1,b) = {2}, (2,b) = {3}

NFA to DFA NFA DFA (Subset Construction)

Trnslting NFA into DFA Ech stte of DFA (D) corresponds to set of sttes of NFA (N) trnsforming N to D is done by subset construction D will be in stte {x,y,z} fter reding given input string if nd only ifncould be in ny of the sttesx,y, orz, depending on the trnsitions it chooses. D keeps trck of ll the possible routesnmight tke nd runs them in prllel.

Simulting n NFA (recll tht ) Input. An input string ended with eof nd n NFA with strt stte s 0 nd finl sttes F. Output. The nswer yes if ccepts, no otherwise. begin S := -closure({s 0 }); // s 0 = => S c := nextchr; while c <> eof do begin S := -closure(move(s, c)); // S =c=> M = => S c := nextchr end; if S F <> then return yes else return no end.

c: extends to ll symbols in lphbet (not input Symbols in some files) Simulting n NFA (recll tht ) Input. An input string ended with eof nd n NFA with strt stte s 0 nd finl sttes F. Output. The nswer yes if ccepts, no otherwise. begin S := -closure({s 0 }); // s 0 = => S c := nextchr; c Next stte: U while c <> eof do begin S := -closure(move(s, c)); // S =c=> M = => S c := nextchr end; if S F <> then return yes else return no end. Initil stte Previous stte: T NFA to DFA S: ll sttes generted during NFA prllel trversl over ll possible input prefixes (NOT prticulr input) : ll trnsitions during trversl

From n NFA to DFA Subset construction Algorithm. Input. An NFA N. Output. A DFA D with sttes Dsttes nd trnsition tble Dtrn. begin dd -closure(s 0 ) s n unmrked stte to Dsttes; while there is n unmrked stte T in Dsttes do begin mrk T; for ech input symbol do begin U := -closure(move(t, )); if U is not in Dsttes then dd U s n unmrked stte to Dsttes; mrk s finl if U contins the originl finl stte; Dtrn[T, ] := U end end.

An Exmple ( b) * bb 2 3 strt 0 1 6 b b 7 8 9 10 4 b 5

An Exmple: -closure(s) & move(s,x) s -closure(s) move(s,) move(s,b) importnt stte? 0 {0,1,2,4,7} 1 {1,2,4} 2 2 3 Yes 3 {1,2,3,4,6,7} 4 4 5 Yes 5 {1,2,4,5,6,7} 6 {1,2,4,6,7} 7 7 8 Yes 8 8 9 Yes 9 9 10 Yes 10 10 ((Fin)) ((Fin)) ((?))

An Exmple -closure({0}) = {0,1,2,4,7} = A A: -closure(move({0,1,2,4,7}, )) Ignore -trnsitions (0, 1, ) -trnsitions: (2,) 3, (7,) 8 b-trnsitions: (4,b) 5, 8 9, 9 10 Good to lbel sttes sequentilly: such tht (s,x) s+1 = -closure({3,8}) = {1,2,3,4,6,7,8} = B A: -closure(move({0,1,2,4,7}, b)) = -closure({5}) = {1,2,4,5,6,7} = C B: -closure(move({1,2,3,4,6,7,8}, )) = -closure({3,8}) = B B: -closure(move({1,2,3,4,6,7,8}, b)) = -closure({5,9}) = {1,2,4,5,6,7,9} = D C: -closure(move({1,2,4,5,6,7}, )) = -closure({3,8}) = B C: -closure(move({1,2,4,5,6,7}, b)) = -closure({5}) = C D: -closure(move({1,2,4,5,6,7,9}, )) = -closure({3,8}) = B D: -closure(move({1,2,4,5,6,7,9}, b)) = -closure({5,10}) = {1,2,4,5,6,7,10} = E E: -closure(move({1,2,4,5,6,7,10}, )) = -closure({3,8}) = B E: -closure(move({1,2,4,5,6,7,10}, b)) = -closure({5}) = C

An Exmple Ignore -trnsitions (0, 1, ) -trnsitions: (2,) 3, (7,) 8 b-trnsitions: (4,b) 5,8 9,9 10 Good to lbel sttes sequentilly: such tht (s,x) s+1 Stte A = {0}* ={0,1,2,4,7} B = {3,8}* ={1,2,3,4,6,7,8} C = {5}* ={1,2,4,5,6,7} D = {5,9}* ={1,2,4,5,6,7,9} E = {5,10}* ={1,2,4,5,6,7,10} Input Symbol b B C B D B C B E B C

An Exmple Stte A = {0,1,2,4,7} B = {1,2,3,4,6,7,8} C = {1,2,4,5,6,7} D = {1,2,4,5,6,7,9} E = {1,2,4,5,6,7,10} Input Symbol b B C B D B C B E B C

An Exmple: Result of Subset Construction b C A b {1,2,4, 5,6,7} b D E strt {0,1,2,4,7} {1,2,3,4, 6,7,8} b {1,2,4,5, 6,7,9} b {1,2,4,5, 6,7,10} B

Minimizing Number of Sttes Every DFA hs unique smllest equivlent DFA. Given DFA M, we use splitting to construct the equivlent miniml DFA. Normlly, we ctully merge individul sttes to lrger set of sttes, insted of splitting wildly

DFA to Minimum Stte DFA Input. A DFA M=(S,s 0,F). Output. An equivlent DFA M =(S,,s 0,F ) with fewer sttes. begin initilize prtition of two groups of sttes: s q q q {F(finl sttes), S-F(non-finl sttes)} t q q q for ech group G of do begin /* until new unchnged */ prtition G into subgroups such tht ny two sttes s nd t of G re in the sme subgroup iff for ll input symbol, sttes s nd t hve trnsitions on to sttes in the sme group of ; /* t worst, stte will be in subgroup by itself */ updte new by replcing G by the set of ll subgroups formed end s 0 = r(s 0 ), representtive of s 0 ; S = {representtives of subgroups}; F = {representtives of sttes in F}; (s,)=t => (r(s),) = r(t) end.

Splitting into Equivlent Sttes Algorithm: Initilly, there re two sets, one consisting of ll ccepting sttes of M, the other contining the remining sttes. repet { Choose set A = { s 1, s 2,, s n } Split A into A 1, A 2,, A m so tht for ll A i & ll symbols if s j, s k A i nd, on input s j t j nd s k t k // source trget then t j nd t k re in the sme set. } until no more chnge.

An Exmple Stte A = {0,1,2,4,7} B = {1,2,3,4,6,7,8} C = {1,2,4,5,6,7} D = {1,2,4,5,6,7,9} E = {1,2,4,5,6,7,10} Input Symbol b B C B D B C B E B C

An Exmple -Fin +Fin Stte A = {0,1,2,4,7} B = {1,2,3,4,6,7,8} C = {1,2,4,5,6,7} D = {1,2,4,5,6,7,9} E = {1,2,4,5,6,7,10} Input Symbol b B C B D B C B E B C

An Exmple Stte A = {0,1,2,4,7} B = {1,2,3,4,6,7,8} C = {1,2,4,5,6,7} D = {1,2,4,5,6,7,9} E = {1,2,4,5,6,7,10} Input Symbol b B C B D B C B E B C

An Exmple Stte A = {0,1,2,4,7} B = {1,2,3,4,6,7,8} A = {1,2,4,5,6,7} D = {1,2,4,5,6,7,9} E = {1,2,4,5,6,7,10} Input Symbol b B A B D B A B E B A

Trnsition Digrm (fter Stte Reduction) We sid DFA for ( b) * bb {0,2} strt {0} b b b 0 1 2 3 b {0,1} {0,3}

Trnsition Digrm (fter Stte Reduction) It relly is DFA for ( b) * bb D strt A b b b 0 1 2 3 b B E

RE to DFA Construct DFA from RE directly without intermedite NFA

( b) * bb Review of Thompson s Trnsition Digrm: An Exmple 2 3 A = -closure ({0}) = {0,1,2,4,7} strt 0 1 6 b 7 8 9 b 10 4 b 5

( b) * bb Review of Thompson s Trnsition Digrm: An Exmple 2 3 strt 0 1 6 b 7 8 9 b 10 4 b 5 move(a,b)= {5}

( b) * bb strt 0 Review of Thompson s Trnsition Digrm: An Exmple 2 1 4 b 3 6 5 C = -closure (move(a,b)) = {1,2,4,5,6,7} b 7 8 9 0 1 2 4 7 b 2 1 2 4 5 6 7 b 10

Constructing DFA from R.E. Importnt sttes : -trnsitions hve no effect on determining next stte since they will not relly mke trnsition on visible input symbol -trnsitions determine equivlent sttes in loose sense Importnt sttes re relted to non-null symbol t prticulr position in RE e.g., b t position 2 of ( b)bb# Re-definition of Sttes : Thompson s Trnsition digrm: nodes s sttes (the sttus before & fter mtching symbol) Alterntive method: rcs s sttes (the position (in RE) of mtch) #: simulte the lst node for checking finl stte Only sttes tht consumes symbols mtter

DFA directly from R.E.: underlying NFA strt ( 1 b 2 )* 3 b 4 b 5 # 6 A C B 1 b 2 3 D E b 4 b 5 F # 6 Importnt sttes ( {1 6}): with non-null trnsitions

DFA directly from R.E.: underlying NFA 1 C ( 1 b 2 )* 3 b 4 b 5 # 6 strt A B E 3 4 b 5 b 6 2 b D Followpos(1) ={1,2,3} F #

Constructing Automt from R.E. Exmple: RE = ( b)*bb# ( 1 b 2 )* 3 b 4 b 5 # 6 Syntx tree for RE: (Fig. 3.41) Directed grph for followpos(): Node Followpos 1 on {1,2,3} 2 on b {1,2,3} 3 on {4} 4 on b {5} b 1 Redy to mtch t 3 5 on b {6} 6 - b 3 4 5 b #6 b 2 b Redy to mtch b t 2 followpos(1): ( 1 b 2 )* 3 b 4 b 5 # 6 ~ (( 1 b 2 ) ( 1 b 2 )) 3 b 4 b 5 # 6

DFA directly from R.E. Possible mtching positions DFA for ( b) * bb ( 1 b 2 )* 3 b 4 b 5 # 6 {1,2,3,5} {1,2,3} strt b b b 0 1 2 3 b {1,2,3,4} Next Possible mtching positions {1,2,3,6}

Constructing DFA from RE: FirstPos, LstPos, Nullble Mtching RE s 3 possible cses x(c1 c2)y x(c1.c2)y x(c*)y Followpos: Which position(s)/symbol(s) to mtch fter mtching lstpos of x? Requires firstpos of c, c1, c2, y Need to know whether c1, c2 cn be pss-through (nullble) (c* is lwys nullble)

Constructing DFA from R.E. R.E. DFA: Stte (set of) position(s) ( respective symbols) in RE (where n input chrcter is being mtched) Stte_trnsition llowed position trnsition for RE Set of Positions Set of Importnt Sttes of NFA (tht consumes input symbols) DFA Construction: Augment RE: (r)# [#: end-of-pttern mrk] Annotte RE symbols (excluding ) with position lbels Get syntx tree T of the nnotted pttern Compute {nullble, firstpos, lstpos} of nodes [sub-re s] Compute follow(i) [by mking DFT over the tree T] Initil stte: s 0 = firstpos(root) [ complete RE] Construct trnsition function ccording to follow(i) (i,)=i ]

Constructing DFA from R.E. DFA Construction: Initil stte: s 0 = firstpos(root) & S = {s 0 } While there is n unmrked stte Q in S do begin For ech input symbol do begin For ech position p in Q s.t. symbol(p)=, Compre : NFA DFA Let U = followpos(p) // tke Union if more thn one such p If U is not (empty), nd U S, then S += U // new stte (Q,)=U // new trnsition End /* * / End /* while */ Q:{p p } U ={followpos(p)}

Lexicl Anlyzer Genertor RE Thompson s construction NFA Subset construction DFA

Time-Spce Trdeoffs RE (r) to NFA, simulte NFA on input x time: O( r * x ), spce: O( r ) [mx. 2 r sttes] RE to NFA, NFA to DFA, simulte DFA time: O( x ), spce: O(2 r ) Lzy trnsition evlution trnsitions re computed s needed t run time; computed trnsitions re stored in cche for lter use

LEX A Lnguge for Specifying Lexicl Anlyzers

Lex A lnguge for specifying lexicl nlyzers (for ny lnguge, sy, X) (Lex. Anlyzer Spec.) lex.l lex.yy.c lex compiler C compiler (Lex. Anlyzer in C) lex.yy.c (Lex. Anlyzer Exe.).out source code in X.out tokens (for prser) next_token = yylex();

Using Scnner Genertor: Lex Lex is lexicl nlyzer genertor developed by Lesk nd Schmidt of AT&T Bell Lb, written in C, running under UNIX. Lex produces n entire scnner module tht cn be compiled nd linked with other compiler modules. Lex ssocites regulr expressions with rbitrry code frgments. When n expression is mtched, the code segment is executed. A typicl lex progrm contins three sections seprted by%% delimiters.

Lex Progrms %{ uxiliry declrtions %} regulr definitions %% trnsltion rules %% uxiliry procedures

First Section of Lex The first section define chrcter clsses nd uxiliry regulr expression. (Fig. 3.5 on p. 67) [] delimits chrcter clsses - denotes rnges:[xyz] = =[x-z] \ denotes the escpe chrcter: s in C. ^ complements chrcter clss, (Not): [^xy] denotes ll chrcters exceptxndy.,*, nd+(lterntion, Kleene closure, nd positive closure) re provided. () cn be used to control grouping of subexpressions. (expr)? = =(expr), i.e. mtchesexpr zero times or once. {} signls the mcroexpnsion of symbol defined in the first section.

First Section of Lex, cont. Ctention is specified by the juxtposition of two expressions; no explicit opertor is used. [b][cd] will mtch ny of d, c, bc, nd bd. begin = = begin = =[b][e][g][i][n]

Second Section of Lex The second section of lex defines tble of regulr expressions nd corresponding commnds. When n expression is mtched, its ssocited commnd is executed. Auxiliry functions my be defined in the third section. Input tht is mtched is stored in the string vribleyytext whose length isyyleng. Lex cretes n integer functionyylex() tht my be clled from the prser. The vlue returned is usully the token code of the token scnned by Lex. Whenyylex() encounters end of file, it clls user-supplied integer function nmedyywrp() to wrp up input processing.

Trnsltion Rules P 1 {ction 1 } P 2 {ction 2 }... P n {ction n } where P i re regulr expressions nd ction i re progrm segments to be executed on mtching P i

Deling with Multiple Input Files yylex() uses three user-defined functions to hndle chrcter I/O: input(): retrieve single chrcter, 0 on EOF output(c): write single chrcter to the output unput(c): put single chrcter bck on the input to be re-red

An Exmple %{ #define LT 24 // uxiliry declrtions (in C) #define LE 25 #define EQ 26... %} // regulr definitions delim [ \t\n] ws {delim}+ letter [A-Z-z] digit [0-9] id {letter}({letter} {digit})* number {digit}+(\.{digit}+)?(e[+\-]?{digit}+)? %%

An Exmple // trnsltion rules (ctions re in C) {ws} { /* no ction nd no return */ } if {return (IF);} then {return (THEN);} else {return (ELSE);} {id} {yylvl=instll_id(); return (ID);} {number} {yylvl=instll_num(); return (NUMBER);} < {yylvl=lt; return (RELOP);} <= {yylvl=le; return (RELOP);}... %% // uxiliry procedures (in C) instll_id() { /* yytext to symbol tble */ } instll_num() {... /* yytext to symbol tble */ }

Functions nd Vribles yylex() function implementing the lexicl nlyzer nd returning the token mtched yytext globl pointer vrible pointing to the lexeme mtched yyleng globl vrible giving the length of the lexeme mtched yylvl n externl globl vrible storing the ttribute of the token

NFA from Lex Progrms P 1 P 2... P n s 0 N(P 1 ) N(P 2 )... N(P n )

Rules Look for the longest lexeme e.g., Number Mtch until no trnsition & retrct to longest mtch Look for the first-listed pttern tht mtches the longest lexeme keywords nd identifiers List frequently occurring ptterns first white spce

Rules View keywords s exceptions to the rule of identifiers construct keyword tble to distinguish them from id s Lookhed opertor: r 1 /r 2 - mtch string in r 1 only if followed by string in r 2 DO 5 I = 1. 25 DO 5 I = 1, 25 DO/({letter} {digit})* = ({letter} {digit})*,

Lexicl Error Recovery Error: none of the ptterns mtches prefix of the remining input Pnic mode error recovery delete successive chrcters from the remining input until the pttern-mtching cn continue Error repir: delete n extrneous chrcter insert missing chrcter replce n incorrect chrcter trnspose two djcent chrcters

Appendix: Regulr Expression nd Pttern Mtching - KMP lgorithm - AC lgorithm

R.E. nd Pttern Mtching Nïve Pttern Mtching: Specify the pttern with regulr expression R.E. for ech keyword Construct FA for ech such R.E., nd conduct left-to-right mtching: DFA := Stte_Trnsition_Tble := Construct_DFA(R.E.) while (input_pointer!= EOF) stop_stte = recognize(input_pointer, DFA) if fil (stop_stte not in finl_sttes) : move input pointer by one chrcter if not mtch if success (stop_stte in finl_sttes) : output mtching sttus & skip over mtched pttern upon successful mtch

R.E. nd Pttern Mtching Why Is It Slow? mtch multiple keywords multiple times for ech keyword, move input pointer bckwrd to the chrcter next to the lst begin of mtching & reset to initil stte on filure, even though some repeted pttern might pper in recently mtched prtil string probbility of filure is significntly lrger thn probbility of success mtch in most pplictions (success or mtch only few times) will therefore strt the next mtching session by setting the input pointer one chrcter behind the strting position of the previous mtch most of the time

R.E. nd Pttern Mtching RE vs. Pttern Mtching R.E. <=> FA for recognizing one of set of keywords/ptterns in input string sy yes if input string is in Lng(R.E.) (the regulr lnguge for the expression) Pttern Mtching (PM): recognizing ll the occurrences of ny keyword/pttern, specified in regulr expression, within text document specify ech pttern/keyword with RE output ll occurrences, in ddition to sying yes/no

R.E. nd Pttern Mtching Forml Method for Pttern Mtching (PM) Constructing FA for (single/multi-keyword) PM is equivlent to constructing FA tht recognizes the regulr expression: PM = (.* RE)*, nd outputting keyword upon visiting finl stte of the originl FA for recognizing RE RE = K1 K2 K3 Kn (the regulr expression for ll specified keywords). : ny chrcter not strting in the first chrcters of K1 ~ Kn.* : unspecified ptterns (or unknown keywords)

R.E. nd Pttern Mtching Constructing FA1 for recognizing RE = K1 K2 Kn equivlent to merging prefixes of the keywords to void redundnt forwrd mtching => TRIE lexicon tree = DFA for RE Constructing FA2 for recognizing PM = (.* RE)* extending FA1 by () including unknown keywords nd (2) introducing epsilon-moves from the originl finl sttes to originl initil sttes on mtching filure, redundnt bckwrd mtching cn be voided if substring preceding current input pointer is the prefix of nother keyword filure function: the stte (in TRIE) to bckoff on filure (!= init. stte if the bove mentioned sub-string exists nd is non-null) epsilon-moves & filure function mke FA2 NFA, whose DFA counterprt cn be simulted by bcktrcking

R.E. nd Fst Methods for Pttern Mtching Fst Single Keyword Mtching [KMP - Knuth, Morris & Prtt 1977] Reference: [Aho et. l 1986, Ex. 3.26-3.27] keyword => stte_trnsition_tble reduce repeted mtching suggested by keyword pttern filure function: where to bckoff on filure

R.E. nd Fst Methods for Pttern Mtching Fst Multiple Keyword Mtching [AC, Cherry 1982] Reference: [Aho, Ex. 3.31-32] keywords => TRIE (stte_trnsition_tble) reduce repeted mtching suggested by TRIE of the keywords TRIE filure function

R.E. nd Fst Methods for Pttern Mtching Boyer & Moore [1977] Hrrison [1971]: Hshing Method

KMP: Filure Function strt 0 1 b 2 3 b 4 5 6 If filed t stte 5 on x => Input = bbx (input pointer => x) Need to re-try bbx, bx, bx, x from stte 0 bbx : fil gin; (do not strt with prefix b ) bx : success until stte 3, pointing t x Look bck from s5 & see longest mtch (s3) to prefix Choose the longest one so we cn re-try the lest Do you need to go bck nd try ll these? No. Simply set s :=3 nd keep the input pointer to x Stte 3 is the filure stte of stte 5

KMP: Filure Function strt 0 1 b 2 3 b 4 5 6 s f(s) 0 0 1 0 2 0 3 1 4 2 5 3 6 1 If filed t stte 5 on x => Input = bbx (input pointer => x) Need to re-try bbx, bx, bx, x from stte 0 bbx : fil gin; (do not strt with prefix b ) bx : success until stte 3, pointing t x Look bck from s5 & see longest mtch (s3) to prefix Choose the longest one so we cn re-try the lest Do you need to go bck nd try ll these? No. Simply set s :=3 nd keep the input pointer to x Stte 3 is the filure stte of stte 5

KMP: Re-Mtching on Filure strt 0 1 b 2 3 b 4 5 6 s f(s) 0 0 1 0 2 0 3 1 4 2 5 3 6 1 If filed t stte 5 on x => (5,x) = fil ( bbx does not mtch prefix) f(5)=3 => (3,x)=??, if fil ( bx unmtch) f(3)=1 => (1,x)=??, if fil ( x unmtch) f(1)=0 => (0,x)=??, try x from initil stte (since no prtil mtch in filed prefixes is observed) If (.,x) is legl trnsition, just go hed to (.,x)

KMP strt 0 1 b 2 3 b 4 5 6 Recursively compute f(s) bsed on f(.) of previous sttes