CS412/413 Introduction to Compilers Tim Teitelum Lecture 4: Lexicl Anlyzers 28 Jn 08
Outline DFA stte minimiztion Lexicl nlyzers Automting lexicl nlysis Jlex lexicl nlyzer genertor CS 412/413 Spring 2008 Introduction to Compilers 2
Finite Automt Finite utomt: Sttes, trnsitions etween sttes Initil stte, set of finl sttes DFA: Deterministic Finite Automton Ech trnsition consumes n input chrcter Ech trnsition is uniquely determined y the input chrcter NFA: Non-deterministic Finite Automton ε-trnsitions, which do not consume input chrcters Multiple trnsitions from the sme stte on the sme input chrcter CS 412/413 Spring 2008 Introduction to Compilers 3
From RE to DFA Two steps: Convert the regulr expression to n NFA Convert the resulting NFA to DFA The generted DFAs my hve lrge numer of sttes Stte Minimiztion is n optimiztion tht converts DFA to nother DFA tht recognizes the sme lnguge nd hs minimum numer of sttes CS 412/413 Spring 2008 Introduction to Compilers 4
Stte Minimiztion Exmple: DFA1: 0 1 2 3 4 DFA2: 0 1 2 Both DFAs ccept: ** CS 412/413 Spring 2008 Introduction to Compilers 5
Stte Minimiztion Step1. Prtition sttes of originl DFA into mximlsized groups of equivlent sttes S = {G 1,,G n } 1 0 2 3 4 Step 2. Construct the minimized DFA such tht there is stte for ech group G i CS 412/413 Spring 2008 Introduction to Compilers 6
DFA Minimiztion (Equivlence) All sttes in group G i re equivlent iff for ny two sttes p nd q in G i, nd for every symol σ, trnsition(p,σ) nd trnsition(q,σ) re either oth Error, or re sttes in the sme group G j (possily G i itself). c G i G j p G (or Error) k q c r c CS 412/413 Spring 2008 Introduction to Compilers 7
DFA Minimiztion (Equivlence) All sttes in group G i re equivlent iff for ny two sttes p nd q in G i, nd for every symol σ, trnsition(p,σ) nd trnsition(q,σ) re either oth Error, or re sttes in the sme group G j (possily G i itself). c G i G j G (or Error) k CS 412/413 Spring 2008 Introduction to Compilers 8
DFA Minimiztion Step1. Prtition sttes of originl DFA into mximlsized groups of equivlent sttes Step 1. Discrd sttes not rechle from strt stte Step 1. Initil prtition is S = {Finl, Non-finl} Step 1c. Repetedly refine the prtition {G 1,,G n } while some group G i contins sttes p nd q such tht for some symol σ, trnsitions from p nd q on σ re to different groups G j G i G (or Error) k p j k q CS 412/413 Spring 2008 Introduction to Compilers 9
DFA Minimiztion Step1. Prtition sttes of originl DFA into mximlsized groups of equivlent sttes Step 1. Discrd sttes not rechle from strt stte Step 1. Initil prtition is S = {Finl, Non-finl} Step 1c. Repetedly refine the prtition {G 1,,G n } while some group G i contins sttes p nd q such tht for some symol σ, trnsitions from p nd q on σ re to different groups G i G j p G (or Error) k j k q G i CS 412/413 Spring 2008 Introduction to Compilers 10
Optimized Acceptor Regulr Expression R RE NFA NFA DFA Minimize DFA Input String w DFA Simultion Yes, if w L(R) No, if w L(R) CS 412/413 Spring 2008 Introduction to Compilers 11
Lexicl Anlyzers vs Acceptors Lexicl nlyzers use the sme mechnism, ut they: Hve multiple RE descriptions for multiple tokens Output sequence of mtching tokens (or n error) Alwys return the longest mtching token For multiple longest mtching tokens, use rule priorities CS 412/413 Spring 2008 Introduction to Compilers 12
Lexicl Anlyzers REs for Tokens R 1 R n RE NFA NFA DFA Minimize DFA Chrcter Strem progrm DFA Simultion Token strem (nd errors) CS 412/413 Spring 2008 Introduction to Compilers 13
Hndling Multiple REs Construct one NFA for ech RE Associte the finl stte of ech NFA with the given RE Comine NFAs for ll REs into one NFA Convert NFA to minimized DFA, ssociting ech finl DFA stte with the highest priority RE of the corresponding NFA sttes NFAs keywords Minimized DFA ε ε ε ε whitespce identifier numer CS 412/413 Spring 2008 Introduction to Compilers 14
Scnning Algorithm Scn input nd simulte DFA until no further trnsition is possile keeping trck of most recently visited finl stte F Roll input ck to position t the time F ws entered Emit token ssocited with F For ech successive token, scn remining input nd simulte DFA from the strt stte, i.e., scnner is stteless (NB. this is to e chnged elow.) CS 412/413 Spring 2008 Introduction to Compilers 15
Exmple of Roll Bck Consider three REs: { ] nd input: 0 1 2 3 4 5 6 Rech stte 3 with no trnsition on next chrcter Roll input ck to position on entering stte 2 (i.e., hving red ) Emit token for On next cll to scnner, strt in stte 0 gin with input CS 412/413 Spring 2008 Introduction to Compilers 16
Automting Lexicl Anlysis All of the lexicl nlysis process cn e utomted RE NFA DFA Minimized DFA Minimized DFA Lexicl Anlyzer (DFA Simultion Progrm) We only need to specify: Regulr expressions for the tokens Rule priorities for multiple longest mtch cses CS 412/413 Spring 2008 Introduction to Compilers 17
Lexicl Anlyzer Genertors REs for Tokens lex.l Jlex Compiler lex.jv jvc Compiler Chrcter Strem progrm lex.clss Token strem (nd errors) CS 412/413 Spring 2008 Introduction to Compilers 18
Jlex Specifiction File Jlex = Lexicl nlyzer genertor written in Jv genertes Jv lexicl nlyzer Hs three prts: Premle, which contins pckge/import declrtions Definitions, which contins regulr expression revitions Regulr expressions nd ctions, which contins: the list of regulr expressions for ll the tokens Corresponding ctions for ech token (Jv code to e executed when the token is recognized) CS 412/413 Spring 2008 Introduction to Compilers 19
Exmple Specifiction File Pckge Prse; Import Error.LexiclError; %% digits = 0 [1-9][0-9]* letter = [A-Z-z] identifier = {letter}({letter} [0-9_])* whitespce = [\ \t\n\r]+ %% {whitespce} {/* discrd */} {digits} { return new Token(INT, Integer.vlueOf(yytext()); } if { return new Token(IF, null); } while { return new Token(WHILE, null); } {identifier} { return new Token(ID, yytext()); }. { ErrorMsg.error( illegl chrcter ); } CS 412/413 Spring 2008 Introduction to Compilers 20
Strt Sttes Mechnism tht specifies stte in which to strt the execution of the DFA Declre sttes in the second section %stte STATE Use sttes s prefixes of regulr expressions in the third section: <STATE> regex {ction} Set current stte in the ctions yyegin(state) There is pre-defined initil stte: YYINITIAL CS 412/413 Spring 2008 Introduction to Compilers 21
Exmple INITIAL STRING. if %% %stte STRING %% <YYINITIAL> if { return new Token(IF, null); } <YYINITIAL> \ { yyegin(string); } <STRING> \ { yyegin(yyinitial); } <STRING>. { } CS 412/413 Spring 2008 Introduction to Compilers 22
Strt Sttes nd REs The use of strt sttes llows the lexer to recognize more thn regulr expressions (or DFAs) Reson: the lexer cn jump cross different sttes in the semntic ctions using yyegin(state) Exmple: nested comments Increment glol vrile on open prentheses nd decrement it on close prentheses When the vrile gets to zero, jump to YYINITIAL The glol vrile essentilly models n infinite numer of sttes! CS 412/413 Spring 2008 Introduction to Compilers 23
Conclusion Regulr expressions: concise wy of specifying tokens Cn convert RE to NFA, then to DFA, then to minimized DFA Use the minimized DFA to recognize tokens in the input strem Automte the process using lexicl nlyzer genertors Write regulr expression descriptions of tokens Automticlly get lexicl nlyzer progrm which identifies tokens from n input strem of chrcters CS 412/413 Spring 2008 Introduction to Compilers 24