Implementing Automata. CSc 453. Compilers and Systems Software. 4 : Lexical Analysis II. Department of Computer Science University of Arizona

Similar documents
CSc 453. Compilers and Systems Software. 4 : Lexical Analysis II. Department of Computer Science University of Arizona

ΕΠΛ323 - Θεωρία και Πρακτική Μεταγλωττιστών

Fig.25: the Role of LEX

Lexical Analysis: Constructing a Scanner from Regular Expressions

ΕΠΛ323 - Θεωρία και Πρακτική Μεταγλωττιστών. Lecture 3b Lexical Analysis Elias Athanasopoulos

Lexical analysis, scanners. Construction of a scanner

Definition of Regular Expression

In the last lecture, we discussed how valid tokens may be specified by regular expressions.

CS321 Languages and Compiler Design I. Winter 2012 Lecture 5

Dr. D.M. Akbar Hussain

Languages. L((a (b)(c))*) = { ε,a,bc,aa,abc,bca,... } εw = wε = w. εabba = abbaε = abba. (a (b)(c)) *

Reducing a DFA to a Minimal DFA

CS 430 Spring Mike Lam, Professor. Parsing

Principles of Programming Languages

Should be done. Do Soon. Structure of a Typical Compiler. Plan for Today. Lab hours and Office hours. Quiz 1 is due tonight, was posted Tuesday night

Topic 2: Lexing and Flexing

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 4: Lexical Analyzers 28 Jan 08

CS 432 Fall Mike Lam, Professor a (bc)* Regular Expressions and Finite Automata

COMP 423 lecture 11 Jan. 28, 2008

CS143 Handout 07 Summer 2011 June 24 th, 2011 Written Set 1: Lexical Analysis

CMPSC 470: Compiler Construction

Lexical Analysis. Amitabha Sanyal. ( as) Department of Computer Science and Engineering, Indian Institute of Technology, Bombay

Lexical Analysis and Lexical Analyzer Generators

CSCE 531, Spring 2017, Midterm Exam Answer Key

Deterministic. Finite Automata. And Regular Languages. Fall 2018 Costas Busch - RPI 1

CSCI 3130: Formal Languages and Automata Theory Lecture 12 The Chinese University of Hong Kong, Fall 2011

this grammar generates the following language: Because this symbol will also be used in a later step, it receives the

CS 340, Fall 2014 Dec 11 th /13 th Final Exam Note: in all questions, the special symbol ɛ (epsilon) is used to indicate the empty string.

Finite Automata. Lecture 4 Sections Robb T. Koether. Hampden-Sydney College. Wed, Jan 21, 2015

Example: Source Code. Lexical Analysis. The Lexical Structure. Tokens. What do we really care here? A Sample Toy Program:

Assignment 4. Due 09/18/17

Compiler Construction D7011E

LR Parsing, Part 2. Constructing Parse Tables. Need to Automatically Construct LR Parse Tables: Action and GOTO Table

Applied Databases. Sebastian Maneth. Lecture 13 Online Pattern Matching on Strings. University of Edinburgh - February 29th, 2016

CMPT 379 Compilers. Lexical Analysis

CS481: Bioinformatics Algorithms

TO REGULAR EXPRESSIONS

Virtual Machine (Part I)

COMBINATORIAL PATTERN MATCHING

Slides for Data Mining by I. H. Witten and E. Frank

Tries. Yufei Tao KAIST. April 9, Y. Tao, April 9, 2013 Tries

From Dependencies to Evaluation Strategies

If you are at the university, either physically or via the VPN, you can download the chapters of this book as PDFs.

Compilers Spring 2013 PRACTICE Midterm Exam

Algorithm Design (5) Text Search

Quiz2 45mins. Personal Number: Problem 1. (20pts) Here is an Table of Perl Regular Ex

LEX5: Regexps to NFA. Lexical Analysis. CMPT 379: Compilers Instructor: Anoop Sarkar. anoopsarkar.github.io/compilers-class

Systems I. Logic Design I. Topics Digital logic Logic gates Simple combinational logic circuits

CS 241. Fall 2017 Midterm Review Solutions. October 24, Bits and Bytes 1. 3 MIPS Assembler 6. 4 Regular Languages 7.

2014 Haskell January Test Regular Expressions and Finite Automata

CS 241 Week 4 Tutorial Solutions

12 <= rm <digit> 2 <= rm <no> 2 <= rm <no> <digit> <= rm <no> <= rm <number>

Lecture T4: Pattern Matching

Solving Problems by Searching. CS 486/686: Introduction to Artificial Intelligence

Regular Expression Matching with Multi-Strings and Intervals. Philip Bille Mikkel Thorup

CIS 1068 Program Design and Abstraction Spring2015 Midterm Exam 1. Name SOLUTION

box Boxes and Arrows 3 true 7.59 'X' An object is drawn as a box that contains its data members, for example:

Compression Outline :Algorithms in the Real World. Lempel-Ziv Algorithms. LZ77: Sliding Window Lempel-Ziv

What are suffix trees?

Solving Problems by Searching. CS 486/686: Introduction to Artificial Intelligence Winter 2016

Recognition of Tokens

ECE 468/573 Midterm 1 September 28, 2012

CSE 401 Midterm Exam 11/5/10 Sample Solution

Some Thoughts on Grad School. Undergraduate Compilers Review and Intro to MJC. Structure of a Typical Compiler. Lexing and Parsing

Agenda & Reading. Class Exercise. COMPSCI 105 SS 2012 Principles of Computer Science. Arrays

Regular Expressions and Automata using Miranda

2 Computing all Intersections of a Set of Segments Line Segment Intersection

Scanner Termination. Multi Character Lookahead. to its physical end. Most parsers require an end of file token. Lex and Jlex automatically create an

Theory of Computation CSE 105

Compilation

CSEP 573 Artificial Intelligence Winter 2016

Lecture T1: Pattern Matching

Section 3.1: Sequences and Series

cisc1110 fall 2010 lecture VI.2 call by value function parameters another call by value example:

CS 340, Fall 2016 Sep 29th Exam 1 Note: in all questions, the special symbol ɛ (epsilon) is used to indicate the empty string.

COMPUTER SCIENCE 123. Foundations of Computer Science. 6. Tuples

MTH 146 Conics Supplement

Scanner Termination. Multi Character Lookahead

Presentation Martin Randers

Fall Compiler Principles Lecture 1: Lexical Analysis. Roman Manevich Ben-Gurion University of the Negev

Suffix trees, suffix arrays, BWT

Mid-term exam. Scores. Fall term 2012 KAIST EE209 Programming Structures for EE. Thursday Oct 25, Student's name: Student ID:

AI Adjacent Fields. This slide deck courtesy of Dan Klein at UC Berkeley

ASTs, Regex, Parsing, and Pretty Printing

Today. CS 188: Artificial Intelligence Fall Recap: Search. Example: Pancake Problem. Example: Pancake Problem. General Tree Search.

Announcements. CS 188: Artificial Intelligence Fall Recap: Search. Today. Example: Pancake Problem. Example: Pancake Problem

Operator Precedence. Java CUP. E E + T T T * P P P id id id. Does a+b*c mean (a+b)*c or

CS201 Discussion 10 DRAWTREE + TRIES

P(r)dr = probability of generating a random number in the interval dr near r. For this probability idea to make sense we must have

Example: 2:1 Multiplexer

Pattern Matching. Pattern Matching. Pattern Matching. Review of Regular Expressions

CS 321 Programming Languages and Compilers. Bottom Up Parsing

stack of states and grammar symbols Stack-Bottom marker C. Kessler, IDA, Linköpings universitet. 1. <list> -> <list>, <element> 2.

Sample Midterm Solutions COMS W4115 Programming Languages and Translators Monday, October 12, 2009

LING/C SC/PSYC 438/538. Lecture 21 Sandiway Fong

CPSC 213. Polymorphism. Introduction to Computer Systems. Readings for Next Two Lectures. Back to Procedure Calls

Fall Compiler Principles Lecture 1: Lexical Analysis. Roman Manevich Ben-Gurion University

Position Heaps: A Simple and Dynamic Text Indexing Data Structure

Suffix Tries. Slides adapted from the course by Ben Langmead

Information Retrieval and Organisation

Transcription:

Implementing utomt Sc 5 ompilers nd Systems Softwre : Lexicl nlysis II Deprtment of omputer Science University of rizon collerg@gmil.com opyright c 009 hristin ollerg NFs nd DFs cn e hrd-coded using this pttern: stte := strt stte c := first chr while (true) { cse stte of { : cse c of { chr : { c := nexthr(); stte := new stte; : cse c of { chr : { c := nexthr(); stte := new stte; chr : { return; /* ccept */ Implementing utomt... Implementing utomt... We cn lso encode the trnsitions directly into trnsition tle: next stte stte chr chr other ccepting [] Sttes in rckets don t consume their inputs. ccepting sttes re indicted y. Empty entries represent error sttes. Given the tle, we cn write n interpreter to perform lexicl nlysis of ny DF: stte := c := first chr while not EPT[stte] do { newstte := NEXTSTTE[stte,c] if DVNE[stte,c] then c := nexthr() stte := newstte if EPT[stte] then ccept;

Tle-driven omments Tle-driven omments... 0 ll chrs except * / * / * ll chrs except *,/ * stte / * other ccepting 0 clss omments { pulic sttic finl int SLSH = 0; pulic sttic finl int STR = ; pulic sttic finl int OTHER = ; pulic sttic finl int END = ; sttic int[][] NEXTSTTE = { // "/" "*" other {, -, -, {-,, -, {,,, {,,, {-, -, - ; Tle-driven omments... Tle-driven omments... sttic oolen[] EPT = {flse,flse,flse,flse,true; sttic oolen[][] DVNE = { // "/" "*" other {true, true, true, {true, true, true, {true, true, true, {true, true, true, {true, true, true ; sttic String input; sttic int current = -; sttic int nexthr() { int ch; current++; if (current >= input.length()) return END; switch (input.chrt(current)) { cse / : { ch = SLSH; rek; cse * : { ch = STR; rek; defult : { ch = OTHER; rek; return ch;

Tle-driven omments... Hrd-coded omments pulic sttic oolen interpret () { int stte = 0; int c = nexthr(); while ((c!= END) && (stte>=0) &&!EPT[stte]) int newstte = NEXTSTTE[stte][c]; if (DVNE[stte][c]) c = nexthr(); stte = newstte; return (stte>=0) && EPT[stte]; pulic sttic void min (String[] rgs) { input = rgs[0]; oolen result = interpret(); Hrd-coded omments... clss omments { // Declrtions of SLSH,STR,OTHER,END, nd nexthr(). pulic sttic oolen interpret() { int stte = 0; int ch = nexthr(); while(true) { switch (stte) { cse - : return flse; cse 0 : switch (ch) { cse SLSH:ch=nexthr();stte=;rek; defult :return flse; rek; 0 ll chrs except * / * / * ll chrs except *,/ * Let s do the sme thing gin, ut this time we will hrd-code the interpreter using switch-sttements. nexthr nd the constnt declrtions re the sme s for the previous progrm. cse : switch (ch) { cse STR: ch=nexthr(); stte=; rek; defult : return flse; rek; cse : switch (ch) { cse SLSH: ch=nexthr(); stte=; rek; cse STR : ch=nexthr(); stte=; rek; cse OTHER: ch=nexthr(); stte=; rek; defult : return flse; rek;

Hrd-coded omments... From REs to NFs cse : switch (ch) { cse SLSH: ch=nexthr(); stte=; rek; cse STR : ch=nexthr(); stte=; rek; cse OTHER: ch=nexthr(); stte=; rek; defult : return flse; rek; cse : return (ch == END); Thompson s onstruction From REs to NFs We will descrie our tokens using REs, convert these to n NF, convert this to DF, nd finlly code this into progrm or tle to e interpreted: RE NF DF progrm tle Ech piece of regulr expression is turned into prt of n NF. Ech prt is glued together (using -trnsitions) into complete utomton. n RE mtching the chrcter trnsltes into interpreter We will next show how to construct n NF from regulr expression. This lgorithm is clled Thompson s onstruction (fter Ken Thompson of ell Ls). n RE mtching trnsltes into

Thompson s onstruction onctention Thompson s onstruction lterntion We represent n RE component r y the figure: Strt stte ccepting stte for r for r r The regulr expression r s trnsltes into r n RE mtching the regulr expression r followed y the regulr expression s (rs) trnsltes into r s s Thompson s onstruction Repetition Thompson s onstruction Exmple I The regulr expression r* trnsltes into r The regulr expression trnsltes into

Thompson s onstruction Exmple II The regulr expression letter(letter digit)* trnsltes into From NF to DF letter letter digit From NF to DF From NF to DF... We now know how to trnslte regulr expression into n NF, nd how to trnslte DF into code. The missing piece is how to trnslte n NF into DF. Ech stte in the DF corresponds to set of sttes in the NF. The DF will e in stte,, RE NF DF progrm tle interpreter if the NF could hve een in ny of the sttes,,. fter reding n the DF is in stte tht represents the sttes the NF could e in fter seeing the input n.

From NF to DF... From NF to DF... in the DF represents the set of sttes {,, in the NF. These re the sttes the Fs could e in efore ny input is consumed (the strt sttes). in the DF represents the set of sttes {,, in the NF. These re the sttes we cn get to on the symol from. We need three functions: -closure(t) is the set of NF sttes rechle from some NF stte s in T on -trnsitions lone. This is essentilly grph explortion lgorithm tht finds the nodes in grph rechle from griven node. move(t,) is the set of NF sttes to which there is trnsition on input symol from some NF stte s T. Susetonstruction(N) returns DF D=(Dsttes,Dtrns) corresponding to NF N. -closure(t) -closure(t) Exmple procedure -closure(t) push ll sttes in T onto stck := T while stck is not empty do t := pop(stck) for ech edge t u do if u is not in then := u push(stck, u) return -closure( ) = {,, -closure( ) = { -closure( ) = {, -closure({, ) = {,,

move(t,) Exmple Susetonstruction(N) move({, ) = {, move({,, ) = { procedure Susetonstruction(NF N) Dsttes := {-closure(s0) Dtrns := { repet T := n unexplored stte in Dsttes for ech input symol do U := -closure(move(t,)) if U is not in Dsttes then Dsttes := Dsttes U Dtrns := Dtrns (T U) until ll sttes hve een explored return (Dsttes,Dtrns) NF DF Susetonstruction(N) Exmple strt stte NF NF c 5 6 strt stte DF DF N -closure( ) = {,, = will e the DF s strt stte. 9 unexplored stte new DF stte

Exmple... Exmple... -closure(move(, )) = -closure(move({,,, )) = -closure({, ) = {,, = We dd the trnsition -closure(move(, )) = -closure(move({,,, )) = -closure({ ) = {, = We dd the trnsition Exmple... Exmple... -closure(move(, )) = -closure(move({,,, )) = -closure({ ) = {, = We dd the trnsition 5 -closure(move(, )) = -closure(move({,, )) = -closure({, ) = {, = We dd the trnsition

Exmple, Tke Exmple, Tke... slightly different pproch is to generte the power-set of the set of NF sttes, nd then dd ll the edges we get from -closure().,,,,,,,,,,,,,,,,, On we cn go to sttes,, which ecomes our strt stte,.,,,,,,,,,,,,,,,,, Exmple, Tke... Exmple, Tke... From sttes,, we cn go to sttes,, on n.,,,,,,,,,,,,,,,,, From sttes,, we cn go to sttes, on.,,,,,,,,,,,,,,,,,

Exmple, Tke... Exmple, Tke... From sttes,, we cn go to sttes, on.,,,,,,,,,,,,,,,,, From sttes, we cn go to sttes, on.,,,,,,,,,,,,,,,,, Exmple, Tke... Keywords Finlly, removing unrechle sttes gives us our DF.,,,,,,,,,,,,,,,,,

Keywords revisited Keywords revisited... For lnguge with mny keywords (d-95 hs 98, OOL hs hundreds), the trnsition tle cn e lrge. We cn remove ll keywords from the trnsition tle nd insted nlyze them s IDENTs. When n IDENT is found we look it up in specil tle to see if it is, in fct, reserved word. We cn use regulr hsh-tle, of course, ut if we re concerned out speed we cn use miniml perfect hsh-tle. This is sttic tle nd relted lookup routines tht hve een optimized for prticulr sttic set of words. For exmple, we could uild this perfect hsh-tle for the words LU, MODUL-, OERON: 0 LU MODUL- OERON int hsh(string s) {return s[0]- L ; oolen memer(string s) {return tle[hsh(s)] = s; In this cse we use the first chrcter of the string s the hsh-vlue. This is not miniml tle, there s one wsted entry. Using Unix gperf Using Unix gperf... gperf (http://www.gnu.org/mnul/gperf-.7) is Unix progrm tht tkes list of keywords s input nd returns perfect hsh-tle (nd relted serch routines) s output. From the gperf mnul: The perfect hsh function genertor gperf reds set of "keywords" from keyfile. It ttempts to derive perfect hshing function tht recognizes memer of the sttic keyword set with t most single proe into the lookup tle. If gperf succeeds in generting such function it produces pir of source code routines tht perform hshing nd tle lookup recognition. The following commnd > echo "EGIN\nEND" gperf -L NSI- genertes the progrm elow. /* NSI- code produced y gperf version.7 */ #define TOTL_KEYWORDS #define MIN_WORD_LENGTH #define MX_WORD_LENGTH 5 #define MIN_HSH_VLUE #define MX_HSH_VLUE 5

Using Unix gperf... sttic unsigned int hsh ( register const chr *str, register unsigned int len) { sttic unsigned chr sso_vlues[] = { 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 0, 6, 0, 0, <--- Lots more stuff like this ---> ; return len + sso_vlues[(unsigned chr)str[len - ]] + sso_vlues[(unsigned chr)str[0]]; const chr * in_word_set ( register const chr *str, register unsigned int len) { sttic const chr * wordlist[] = { "", "", "", "END", "", "EGIN"; if (len<=mx_word_length && len>=min_word_length) { register int key = hsh (str, len); if (key <= MX_HSH_VLUE && key >= 0) { register const chr *s = wordlist[key]; if (*str == *s &&!strcmp (str +, s + )) retur return 0; In this prticulr cse, the hsh function only looks t the first nd lst chrcters of the string, s well s the string length. Summry Summry The prolem with tle-driven methods is tht the tles cn esily get huge. Much work hs gone into constructing tle-compression lgorithms, nd dt structures for sprse tles. See the Drgon ook for detils. There re lso mny lgorithms for minimizing the numer of sttes in DF. See Louden, pp. 7 7.

Redings nd References Reflections on Trusting Trust Red Louden, pp. 80. Or, red the Drgon ook, pp. 8 0. n interview with Ken Thompson: http://www.computer.org/computer/thompson.htm. His Turing wrd lecture (Reflections on Trusting Trust): http://www.cm.org/clssics/sep95/. The next slide shows how you insert Trojn Horse in the compiler. compile (String S) if (we re compiling "login.c") GENERTE_ODE( if (user=="collerg" && psswd="d. Troi") login_ok = true ) if (we re compiling "gcc.c") GENERTE_ODE( if (we re compiling "login.c") GENERTE_ODE( if (user=="collerg" && psswd="d. Troi") login_ok = true ) )