Writing a Lexical Analyzer in Haskell (part II)

Similar documents
CSc 453 Lexical Analysis (Scanning)

Announcements! P1 part 1 due next Tuesday P1 part 2 due next Friday

Implementation of Lexical Analysis

Implementation of Lexical Analysis

Monday, August 26, 13. Scanners

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

Wednesday, September 3, 14. Scanners

2010: Compilers REVIEW: REGULAR EXPRESSIONS HOW TO USE REGULAR EXPRESSIONS

Lexical Analysis. Implementation: Finite Automata

CSE450. Translation of Programming Languages. Lecture 20: Automata and Regular Expressions

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

Lexical Analysis. Lecture 3-4

Lexical Analysis. Lecture 2-4

Compiler phases. Non-tokens

Introduction to Lexical Analysis

Lexical Analysis. Chapter 2

Lexical Analysis. Introduction

Implementation of Lexical Analysis

Implementation of Lexical Analysis

CS415 Compilers. Lexical Analysis

A Scanner should create a token stream from the source code. Here are 4 ways to do this:

Lexical Analysis. Finite Automata. (Part 2 of 2)

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

CMSC 350: COMPILER DESIGN

Formal Languages and Compilers Lecture VI: Lexical Analysis

CS 432 Fall Mike Lam, Professor. Finite Automata Conversions and Lexing

EDAN65: Compilers, Lecture 02 Regular expressions and scanning. Görel Hedin Revised:

CS 314 Principles of Programming Languages

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

CS 314 Principles of Programming Languages. Lecture 3

Lexical Analysis - 2

Implementation of Lexical Analysis. Lecture 4

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

Kinder, Gentler Nation

Finite Automata and Scanners

Finite automata. We have looked at using Lex to build a scanner on the basis of regular expressions.

Outline. 1 Scanning Tokens. 2 Regular Expresssions. 3 Finite State Automata

CPSC 434 Lecture 3, Page 1

David Griol Barres Computer Science Department Carlos III University of Madrid Leganés (Spain)

Lexical Analysis 1 / 52

Chapter 3 Lexical Analysis

CSCI312 Principles of Programming Languages!

Alternation. Kleene Closure. Definition of Regular Expressions

Compiler course. Chapter 3 Lexical Analysis

Dr. D.M. Akbar Hussain

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

Lexical Analysis. Lecture 3. January 10, 2018

Computer Science Department Carlos III University of Madrid Leganés (Spain) David Griol Barres

Administrivia. Lexical Analysis. Lecture 2-4. Outline. The Structure of a Compiler. Informal sketch of lexical analysis. Issues in lexical analysis

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Compiler Construction

CSE450. Translation of Programming Languages. Automata, Simple Language Design Principles

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

UNIT -2 LEXICAL ANALYSIS

CSE302: Compiler Design

CS 403 Compiler Construction Lecture 3 Lexical Analysis [Based on Chapter 1, 2, 3 of Aho2]

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

Zhizheng Zhang. Southeast University

Implementation of Lexical Analysis

Implementation of Lexical Analysis

CSC 467 Lecture 3: Regular Expressions

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 2: Lexical Analysis 23 Jan 08

Lexical Analyzer Scanner

CSE Lecture 4: Scanning and parsing 28 Jan Nate Nystrom University of Texas at Arlington

CSE 401/M501 Compilers

Lexical Analyzer Scanner

CS308 Compiler Principles Lexical Analyzer Li Jiang

Tasks of the Tokenizer

The Front End. The purpose of the front end is to deal with the input language. Perform a membership test: code source language?

Compiler Construction

2. Syntax and Type Analysis

CS143 Handout 20 Summer 2011 July 15 th, 2011 CS143 Practice Midterm and Solution

Part II : Lexical Analysis

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 3

Lecture 9 CIS 341: COMPILERS

Languages, Automata, Regular Expressions & Scanners. Winter /8/ Hal Perkins & UW CSE B-1

CS164: Programming Assignment 2 Dlex Lexer Generator and Decaf Lexer

Programming Languages (CS 550) Lecture 4 Summary Scanner and Parser Generators. Jeremy R. Johnson

Syntax and Type Analysis

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

Scanning. COMP 520: Compiler Design (4 credits) Professor Laurie Hendren.

Lecture 3: Lexical Analysis

Compiler construction 2002 week 2

Lexical Error Recovery

Chapter 3: Lexing and Parsing

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

Programming Languages 2nd edition Tucker and Noonan"

CSE 105 THEORY OF COMPUTATION

[Lexical Analysis] Bikash Balami

LECTURE 6 Scanning Part 2

Building lexical and syntactic analyzers. Chapter 3. Syntactic sugar causes cancer of the semicolon. A. Perlis. Chomsky Hierarchy

CSc 453 Compilers and Systems Software

Languages and Compilers

Lexical Analysis. Finite Automata

Formal Languages and Compilers Lecture IV: Regular Languages and Finite. Finite Automata

Compiler Construction LECTURE # 3

Parsing and Pattern Recognition

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

Transcription:

Writing a Lexical Analyzer in Haskell (part II) Today Regular languages and lexicographical analysis part II Some of the slides today are from Dr. Saumya Debray and Dr. Christian Colberg This week PA1: It is due in 4 days! PA2 has been posted. We are starting to cover concepts needed for PA2. Recitation tomorrow will be on implementing lexers in Haskell using a table-driven approach. CS453 Lecture Regular Languages and Lexical Analysis 1

General Approach for Lexical Analysis Regular Languages Finite State Machines DFAs: Deterministic Finite Automata Complications when doing lexical analysis NFAs: Non Deterministic Finite State Automata From Regular Expressions to NFAs From NFAs to DFAs CS453 Lecture Regular Languages and Lexical Analysis 2

Complications 1. "1234" is an NUMBER but what about the 123 in 1234 or the 23, etc. Also, the scanner must recognize many tokens, not one, only stopping at end of file. 2. "if" is a keyword or reserved word IF, but "if" is also defined by the reg. exp. for identifier ID. We want to recognize IF. 3. We want to discard white space and comments. 4. "123" is a NUMBER but so is "235" and so is "0", just as "a" is an ID and so is "bcd, we want to recognize a token, but add attributes to it. CS453 Lecture Regular Languages and Lexical Analysis 3

Complications 1 (longest match) 1. "1234" is an NUMBER but what about the 123 in 1234 or the 23, etc. Also, the scanner must recognize many tokens, not one, only stopping at end of file. So: recognize the largest string defined by some regular expression, only stop getting more input if there is no more match. This introduces the need to reconsider a character, as it is the first of the next token e.g. fname(a,bcd ); would be scanned as (TokenID fname ) OPEN (TokenID a ) COMMA... SEMI EOF scanning fname would peek at (, which would be put back and then recognized as OPEN CS453 Lecture Regular Languages and Lexical Analysis 4

Structure of a Scanner Automaton CSc 453: Lexical Analysis 5

Implementing finite state machines Table-driven FSMs (e.g., lex, flex): Use a table to encode transitions: next_state = T(curr_state, next_char); Use one bit in state no. to indicate whether it s a final (or error) state. If so, consult a separate table for what action to take. T next input character Current state CSc 453: Lexical Analysis 6

Table-driven FSMs: Example int acceptstring() { char ch; int currstate = 1; ch = nextchar(); while (ch!=eof) { currstate= T [currstate, ch]; } /* while */ if (IsFinal(currState)) { return 1; /* success */ T a input b } } state 1 2 3 2 2 3 3 (final) 2 CSc 453: Lexical Analysis 7

Table-driven FSMs: Determines if full string is in language Token scanner() { char ch; } int currstate = 1; ch = nextchar(); while (not IsFinal(currState)) { nextstate = T [currstate, ch]; if (consume(currstate,ch)) { ch = NextChar( ); } if (ch == EOF) { return 0; } /* fail */ currstate = nextstate; } /* while */ if (IsFinal(currState)) { } return finaltoken(currstate); /* success */ state T 1 input CSc 453: Lexical Analysis 8 b a b 1 2 3 2 2 3 3 2 [other] 4 (final) a a 3 2 b TokenAB 4 [other]

Go see http://www.cs.arizona.edu/classes/cs453/fall16/recit/lexerstart-take2.hs Table-Driven FSM for Numbers -- Produce tokens until the input string -- has been completely consumed. lexer :: String -> [Token] lexer [] = [] lexer input = let (tok,remaining) = drivetable 0 input in if tok==whitespace then lexer remaining else tok : lexer remaining -- From given state consume characters -- from the string until token is found. drivetable :: Int->String->String->(Token,String) drivetable curr [] = (UnexpectedEOF, "") drivetable curr (c:rest) = let (next,consume) = nextstate curr c (nexttokstr,remaining)= nextstrings... (done,tok) = final next nexttokstr in if done then (tok,remaining) else drivetable next nexttokstrnremaining Draw FSM on board State 0 Digit goto state 1 State 1 Digit goto state 1 Other goto state 2 State 2 is a final state for TokenNUM How should we define nextstate and final functions? CSc 453: Lexical Analysis 9

Complication 2 (priority combined with longest match) 2. "if" is a keyword or reserved word IF, but "if" is also defined by the reg. exp. for identifier ID, we want to recognize IF, so Have some way of determining which token ( IF or ID ) is recognized. This can be done using priority, e.g. in scanner generators an earlier definition has a higher priority than a later one. By putting the definition for IF before the definition for ID in the input for the scanner generator, we get the desired result. What about the string ifyouleavemenow? CS453 Lecture Regular Languages and Lexical Analysis 10

Complication 3 3. we want to discard white space and comments and not bother the parser with these. So: In scanner generators, we can specify, using a regular expression, white space e.g. [\t\n ] and return no token, i.e. move to the next specify comments using a (NASTY) regular expression and again return no token, move to the next When writing scanner by hand lexer (c:rest) = if isspace c then lexer rest else TokenUnknownChar c : lexer rest CS453 Lecture Regular Languages and Lexical Analysis 11

Complication 4 (Information Associated with Token) 4. "123" is a NUMBER but so is "235" and so is "0", just as "a" is an ID and so is "bcd, we want to recognize a token, but add attributes to it. So, Want to associate some data with some Token types: data Token = TokenIfKW TokenID String --... deriving (Show,Eq) Often more information is added to a Token, e.g. line number and position CS453 Lecture Regular Languages and Lexical Analysis 12

(Non) Deterministic Finite State Automata A Deterministic Finite State Automaton (DFA) has disjoint character sets on its edges, i.e. the choice which state is next is deterministic. A Non-deterministic Finite State Automaton (NFA) does NOT, i.e. it can have character sets on its edges that overlap (non empty intersection), and empty sets on the some edges (labeled ε ). NFAs are used in the translation from regular expressions to FSAs. E.g. when we combine the reg. exp for IF with the reg.exp for ID by just merging the two Transition graphs, we would get an NFA. NFAs are a first step in creating a DFA for a scanner. The NFA is then transformed into a DFA. CS453 Lecture Regular Languages and Lexical Analysis 13

From regular expressions to NFAs regexp a ε simple letter a empty string AB concat the NFAs A ε accept state of the NFA for A B A B split merge them ε A ε ε B ε A* build a loop A ε ε CS453 Lecture Regular Languages and Lexical Analysis 14

The Problem DFAs are easy to execute (table driven interpretation) NFAs are easy to build from reg. exps, but hard to execute we would need some form of guessing, implemented by back tracking To build a DFA from an NFA we avoid the back track by taking all choices in the NFA at once, a move with a character or ε gets us to a set of states in the NFA, which will become one state in the DFA. We keep doing this until we have exhausted all possibilities. This mechanism is called transitive closure (This ends because there is only a finite set of subsets of NFA states. How many are there? ) CS453 Lecture Regular Languages and Lexical Analysis 15

Example IF and ID let : [a-z] dig : [0-9] tok : if id if : i f id : let (let dig)* CS453 Lecture Regular Languages and Lexical Analysis 16

Notes to read through later, Definitions: edge(s,c) and closure edge(s,c): the set of all NFA states reachable from state s following an edge with character c closure(s): the set of all states reachable from S with no chars or ε T=S repeat T =T; forall s in T { T =T; } until T ==T closure(s) = T = S ( T = T ' ( s T ' s T edge(s,ε)) edge(s,ε)) This transitive closure algorithm terminates because there is a finite number of states in the NFA CS453 Lecture Regular Languages and Lexical Analysis 17

DFAedge and NFA Simulation Suppose we are in state DFA d = {s i, s k,s l } By moving with character c from d we reach a set of new NFA states, call these DFAedge(d,c), a new or already existing DFA state NFA simulation: let the input string be c 1 c k d=closure({s1}) // s 1 the start state of the NFA for i from 1 to k DFAedge(d, c) = closure( d = DFAedge(d,c i ) s d edge(s, c)) CS453 Lecture Regular Languages and Lexical Analysis 18

Constructing a DFA with closure and DFAEdge state d 1 = closure(s 1 ) the closure of the start state of the NFA make new states by moving from existing states with a character c, using DFAEdge(d,c); record these in the transition table make accepts in the transition table, if there is an accepting state in d, decide priority if more than one accept state. Instead of characters we use non-overlapping (DFA) character classes to keep the table manageable. CS453 Lecture Regular Languages and Lexical Analysis 19

Suggested Exercise Build an NFA and a DFA for integer and float literals dot:. dig: [0-9] int-lit: dig + float-lit: dig* dot dig+ CS453 Lecture Regular Languages and Lexical Analysis 20