Implementing utomt Sc 5 ompilers nd Systems Softwre : Lexicl nlysis II Deprtment of omputer Science University of rizon collerg@gmil.com opyright c 009 hristin ollerg NFs nd DFs cn e hrd-coded using this pttern: stte := strt stte c := first chr while (true) { cse stte of { : cse c of { chr : { c := nexthr(); stte := new stte; : cse c of { chr : { c := nexthr(); stte := new stte; chr : { return; /* ccept */ Implementing utomt... Implementing utomt... We cn lso encode the trnsitions directly into trnsition tle: next stte stte chr chr other ccepting [] Sttes in rckets don t consume their inputs. ccepting sttes re indicted y. Empty entries represent error sttes. Given the tle, we cn write n interpreter to perform lexicl nlysis of ny DF: stte := c := first chr while not EPT[stte] do { newstte := NEXTSTTE[stte,c] if DVNE[stte,c] then c := nexthr() stte := newstte if EPT[stte] then ccept;
Tle-driven omments Tle-driven omments... 0 ll chrs except * / * / * ll chrs except *,/ * stte / * other ccepting 0 clss omments { pulic sttic finl int SLSH = 0; pulic sttic finl int STR = ; pulic sttic finl int OTHER = ; pulic sttic finl int END = ; sttic int[][] NEXTSTTE = { // "/" "*" other {, -, -, {-,, -, {,,, {,,, {-, -, - ; Tle-driven omments... Tle-driven omments... sttic oolen[] EPT = {flse,flse,flse,flse,true; sttic oolen[][] DVNE = { // "/" "*" other {true, true, true, {true, true, true, {true, true, true, {true, true, true, {true, true, true ; sttic String input; sttic int current = -; sttic int nexthr() { int ch; current++; if (current >= input.length()) return END; switch (input.chrt(current)) { cse / : { ch = SLSH; rek; cse * : { ch = STR; rek; defult : { ch = OTHER; rek; return ch;
Tle-driven omments... Hrd-coded omments pulic sttic oolen interpret () { int stte = 0; int c = nexthr(); while ((c!= END) && (stte>=0) &&!EPT[stte]) int newstte = NEXTSTTE[stte][c]; if (DVNE[stte][c]) c = nexthr(); stte = newstte; return (stte>=0) && EPT[stte]; pulic sttic void min (String[] rgs) { input = rgs[0]; oolen result = interpret(); Hrd-coded omments... clss omments { // Declrtions of SLSH,STR,OTHER,END, nd nexthr(). pulic sttic oolen interpret() { int stte = 0; int ch = nexthr(); while(true) { switch (stte) { cse - : return flse; cse 0 : switch (ch) { cse SLSH:ch=nexthr();stte=;rek; defult :return flse; rek; 0 ll chrs except * / * / * ll chrs except *,/ * Let s do the sme thing gin, ut this time we will hrd-code the interpreter using switch-sttements. nexthr nd the constnt declrtions re the sme s for the previous progrm. cse : switch (ch) { cse STR: ch=nexthr(); stte=; rek; defult : return flse; rek; cse : switch (ch) { cse SLSH: ch=nexthr(); stte=; rek; cse STR : ch=nexthr(); stte=; rek; cse OTHER: ch=nexthr(); stte=; rek; defult : return flse; rek;
Hrd-coded omments... From REs to NFs cse : switch (ch) { cse SLSH: ch=nexthr(); stte=; rek; cse STR : ch=nexthr(); stte=; rek; cse OTHER: ch=nexthr(); stte=; rek; defult : return flse; rek; cse : return (ch == END); Thompson s onstruction From REs to NFs We will descrie our tokens using REs, convert these to n NF, convert this to DF, nd finlly code this into progrm or tle to e interpreted: RE NF DF progrm tle Ech piece of regulr expression is turned into prt of n NF. Ech prt is glued together (using -trnsitions) into complete utomton. n RE mtching the chrcter trnsltes into interpreter We will next show how to construct n NF from regulr expression. This lgorithm is clled Thompson s onstruction (fter Ken Thompson of ell Ls). n RE mtching trnsltes into
Thompson s onstruction onctention Thompson s onstruction lterntion We represent n RE component r y the figure: Strt stte ccepting stte for r for r r The regulr expression r s trnsltes into r n RE mtching the regulr expression r followed y the regulr expression s (rs) trnsltes into r s s Thompson s onstruction Repetition Thompson s onstruction Exmple I The regulr expression r* trnsltes into r The regulr expression trnsltes into
Thompson s onstruction Exmple II The regulr expression letter(letter digit)* trnsltes into From NF to DF letter letter digit From NF to DF From NF to DF... We now know how to trnslte regulr expression into n NF, nd how to trnslte DF into code. The missing piece is how to trnslte n NF into DF. Ech stte in the DF corresponds to set of sttes in the NF. The DF will e in stte,, RE NF DF progrm tle interpreter if the NF could hve een in ny of the sttes,,. fter reding n the DF is in stte tht represents the sttes the NF could e in fter seeing the input n.
From NF to DF... From NF to DF... in the DF represents the set of sttes {,, in the NF. These re the sttes the Fs could e in efore ny input is consumed (the strt sttes). in the DF represents the set of sttes {,, in the NF. These re the sttes we cn get to on the symol from. We need three functions: -closure(t) is the set of NF sttes rechle from some NF stte s in T on -trnsitions lone. This is essentilly grph explortion lgorithm tht finds the nodes in grph rechle from griven node. move(t,) is the set of NF sttes to which there is trnsition on input symol from some NF stte s T. Susetonstruction(N) returns DF D=(Dsttes,Dtrns) corresponding to NF N. -closure(t) -closure(t) Exmple procedure -closure(t) push ll sttes in T onto stck := T while stck is not empty do t := pop(stck) for ech edge t u do if u is not in then := u push(stck, u) return -closure( ) = {,, -closure( ) = { -closure( ) = {, -closure({, ) = {,,
move(t,) Exmple Susetonstruction(N) move({, ) = {, move({,, ) = { procedure Susetonstruction(NF N) Dsttes := {-closure(s0) Dtrns := { repet T := n unexplored stte in Dsttes for ech input symol do U := -closure(move(t,)) if U is not in Dsttes then Dsttes := Dsttes U Dtrns := Dtrns (T U) until ll sttes hve een explored return (Dsttes,Dtrns) NF DF Susetonstruction(N) Exmple strt stte NF NF c 5 6 strt stte DF DF N -closure( ) = {,, = will e the DF s strt stte. 9 unexplored stte new DF stte
Exmple... Exmple... -closure(move(, )) = -closure(move({,,, )) = -closure({, ) = {,, = We dd the trnsition -closure(move(, )) = -closure(move({,,, )) = -closure({ ) = {, = We dd the trnsition Exmple... Exmple... -closure(move(, )) = -closure(move({,,, )) = -closure({ ) = {, = We dd the trnsition 5 -closure(move(, )) = -closure(move({,, )) = -closure({, ) = {, = We dd the trnsition
Exmple, Tke Exmple, Tke... slightly different pproch is to generte the power-set of the set of NF sttes, nd then dd ll the edges we get from -closure().,,,,,,,,,,,,,,,,, On we cn go to sttes,, which ecomes our strt stte,.,,,,,,,,,,,,,,,,, Exmple, Tke... Exmple, Tke... From sttes,, we cn go to sttes,, on n.,,,,,,,,,,,,,,,,, From sttes,, we cn go to sttes, on.,,,,,,,,,,,,,,,,,
Exmple, Tke... Exmple, Tke... From sttes,, we cn go to sttes, on.,,,,,,,,,,,,,,,,, From sttes, we cn go to sttes, on.,,,,,,,,,,,,,,,,, Exmple, Tke... Keywords Finlly, removing unrechle sttes gives us our DF.,,,,,,,,,,,,,,,,,
Keywords revisited Keywords revisited... For lnguge with mny keywords (d-95 hs 98, OOL hs hundreds), the trnsition tle cn e lrge. We cn remove ll keywords from the trnsition tle nd insted nlyze them s IDENTs. When n IDENT is found we look it up in specil tle to see if it is, in fct, reserved word. We cn use regulr hsh-tle, of course, ut if we re concerned out speed we cn use miniml perfect hsh-tle. This is sttic tle nd relted lookup routines tht hve een optimized for prticulr sttic set of words. For exmple, we could uild this perfect hsh-tle for the words LU, MODUL-, OERON: 0 LU MODUL- OERON int hsh(string s) {return s[0]- L ; oolen memer(string s) {return tle[hsh(s)] = s; In this cse we use the first chrcter of the string s the hsh-vlue. This is not miniml tle, there s one wsted entry. Using Unix gperf Using Unix gperf... gperf (http://www.gnu.org/mnul/gperf-.7) is Unix progrm tht tkes list of keywords s input nd returns perfect hsh-tle (nd relted serch routines) s output. From the gperf mnul: The perfect hsh function genertor gperf reds set of "keywords" from keyfile. It ttempts to derive perfect hshing function tht recognizes memer of the sttic keyword set with t most single proe into the lookup tle. If gperf succeeds in generting such function it produces pir of source code routines tht perform hshing nd tle lookup recognition. The following commnd > echo "EGIN\nEND" gperf -L NSI- genertes the progrm elow. /* NSI- code produced y gperf version.7 */ #define TOTL_KEYWORDS #define MIN_WORD_LENGTH #define MX_WORD_LENGTH 5 #define MIN_HSH_VLUE #define MX_HSH_VLUE 5
Using Unix gperf... sttic unsigned int hsh ( register const chr *str, register unsigned int len) { sttic unsigned chr sso_vlues[] = { 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 0, 6, 0, 0, <--- Lots more stuff like this ---> ; return len + sso_vlues[(unsigned chr)str[len - ]] + sso_vlues[(unsigned chr)str[0]]; const chr * in_word_set ( register const chr *str, register unsigned int len) { sttic const chr * wordlist[] = { "", "", "", "END", "", "EGIN"; if (len<=mx_word_length && len>=min_word_length) { register int key = hsh (str, len); if (key <= MX_HSH_VLUE && key >= 0) { register const chr *s = wordlist[key]; if (*str == *s &&!strcmp (str +, s + )) retur return 0; In this prticulr cse, the hsh function only looks t the first nd lst chrcters of the string, s well s the string length. Summry Summry The prolem with tle-driven methods is tht the tles cn esily get huge. Much work hs gone into constructing tle-compression lgorithms, nd dt structures for sprse tles. See the Drgon ook for detils. There re lso mny lgorithms for minimizing the numer of sttes in DF. See Louden, pp. 7 7.
Redings nd References Reflections on Trusting Trust Red Louden, pp. 80. Or, red the Drgon ook, pp. 8 0. n interview with Ken Thompson: http://www.computer.org/computer/thompson.htm. His Turing wrd lecture (Reflections on Trusting Trust): http://www.cm.org/clssics/sep95/. The next slide shows how you insert Trojn Horse in the compiler. compile (String S) if (we re compiling "login.c") GENERTE_ODE( if (user=="collerg" && psswd="d. Troi") login_ok = true ) if (we re compiling "gcc.c") GENERTE_ODE( if (we re compiling "login.c") GENERTE_ODE( if (user=="collerg" && psswd="d. Troi") login_ok = true ) )