Implementation of Lexical Analysis. Lecture 4

Similar documents
Implementation of Lexical Analysis

Lexical Analysis. Implementation: Finite Automata

Implementation of Lexical Analysis

Implementation of Lexical Analysis

Implementation of Lexical Analysis

Implementation of Lexical Analysis

Lexical Analysis. Chapter 2

Lexical Analysis. Lecture 2-4

Introduction to Lexical Analysis

Implementation of Lexical Analysis

Lexical Analysis. Lecture 3-4

Administrivia. Lexical Analysis. Lecture 2-4. Outline. The Structure of a Compiler. Informal sketch of lexical analysis. Issues in lexical analysis

CSE450. Translation of Programming Languages. Lecture 20: Automata and Regular Expressions

Kinder, Gentler Nation

Lexical Analysis. Finite Automata. (Part 2 of 2)

CSE450. Translation of Programming Languages. Automata, Simple Language Design Principles

Lexical Analysis. Lecture 3. January 10, 2018

Compiler course. Chapter 3 Lexical Analysis

Lexical Analysis. Finite Automata

Chapter 3 Lexical Analysis

Lexical Analysis. Finite Automata

Cunning Plan. Informal Sketch of Lexical Analysis. Issues in Lexical Analysis. Specifying Lexers

Administrativia. Extra credit for bugs in project assignments. Building a Scanner. CS164, Fall Recall: The Structure of a Compiler

Formal Languages and Compilers Lecture IV: Regular Languages and Finite. Finite Automata

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

CS308 Compiler Principles Lexical Analyzer Li Jiang

Formal Languages and Compilers Lecture VI: Lexical Analysis

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

Introduction to Lexical Analysis

Last lecture CMSC330. This lecture. Finite Automata: States. Finite Automata. Implementing Regular Expressions. Languages. Regular expressions

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

Syntactic Analysis. CS345H: Programming Languages. Lecture 3: Lexical Analysis. Outline. Lexical Analysis. What is a Token? Tokens

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

Front End: Lexical Analysis. The Structure of a Compiler

Lexical Error Recovery

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

(Refer Slide Time: 0:19)

Compiler Construction

CS 432 Fall Mike Lam, Professor. Finite Automata Conversions and Lexing

Compiler phases. Non-tokens

ECS 120 Lesson 7 Regular Expressions, Pt. 1

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 5

Lexical Analyzer Scanner

UNIT -2 LEXICAL ANALYSIS

Figure 2.1: Role of Lexical Analyzer

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

Lexical Analyzer Scanner

David Griol Barres Computer Science Department Carlos III University of Madrid Leganés (Spain)

Computer Science Department Carlos III University of Madrid Leganés (Spain) David Griol Barres

CS415 Compilers. Lexical Analysis

2010: Compilers REVIEW: REGULAR EXPRESSIONS HOW TO USE REGULAR EXPRESSIONS

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

MIT Specifying Languages with Regular Expressions and Context-Free Grammars. Martin Rinard Massachusetts Institute of Technology

MIT Specifying Languages with Regular Expressions and Context-Free Grammars

Regular Languages. MACM 300 Formal Languages and Automata. Formal Languages: Recap. Regular Languages

Compiler Construction

Theory of Computations Spring 2016 Practice Final Exam Solutions

Finite Automata Part Three

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

Automating Construction of Lexers

Tasks of the Tokenizer

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

CSc 453 Lexical Analysis (Scanning)

Lecture 9 CIS 341: COMPILERS

Monday, August 26, 13. Scanners

Finite automata. We have looked at using Lex to build a scanner on the basis of regular expressions.

Writing a Lexical Analyzer in Haskell (part II)

Wednesday, September 3, 14. Scanners

Compiler Construction D7011E

1. (10 points) Draw the state diagram of the DFA that recognizes the language over Σ = {0, 1}

Zhizheng Zhang. Southeast University

Dr. D.M. Akbar Hussain

Chapter Seven: Regular Expressions

Lex Spec Example. Int installid() {/* code to put id lexeme into string table*/}

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

Week 2: Syntax Specification, Grammars

Chapter 3: Lexing and Parsing

CS164: Midterm I. Fall 2003

Solutions to Homework 10

Lexical Analysis. COMP 524, Spring 2014 Bryan Ward

CS164: Programming Assignment 2 Dlex Lexer Generator and Decaf Lexer

1. Lexical Analysis Phase

Lexical Analysis - 2

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

Compiler Design. 2. Regular Expressions & Finite State Automata (FSA) Kanat Bolazar January 21, 2010

Formal Definition of Computation. Formal Definition of Computation p.1/28

6 NFA and Regular Expressions

Limitations of Algorithmic Solvability In this Chapter we investigate the power of algorithms to solve problems Some can be solved algorithmically and

Announcements! P1 part 1 due next Tuesday P1 part 2 due next Friday

Lexical Error Recovery

The Language for Specifying Lexical Analyzer

Formal Languages and Grammars. Chapter 2: Sections 2.1 and 2.2

CS2 Language Processing note 3

14.1 Encoding for different models of computation


Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

CMSC 350: COMPILER DESIGN

Structure of Programming Languages Lecture 3

Lexical Analysis. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

Transcription:

Implementation of Lexical Analysis Lecture 4 1

Tips on Building Large Systems KISS (Keep It Simple, Stupid!) Don t optimize prematurely Design systems that can be tested It is easier to modify a working system than to get a system working 2

Outline Specifying lexical structure using regular expressions Finite automata Deterministic Finite Automata (DFAs) Non-deterministic Finite Automata (NFAs) Implementation of regular expressions RegExp => NFA => DFA => Tables 3

Notation There is variation in regular expression notation Union: A B A + B Option: A + A? Range: a + b + + z [a-z] At least one: A + (A-Abdullah) AA* Excluded range: complement of [a-z] [^a-z] 4

Regular Expressions in Lexical Specification Last lecture: a specification for the predicate s L(R) Note: we can do this just by looking at reg. exp. R But a yes/no answer is not enough! Because we need to know not just whether the string is a valid program, but also How to partition the input into tokens We adapt regular expressions to this goal I.e., there are some small required extensions 5

Regular Expressions => Lexical Spec. (1) 1. Write a regexp for the lexemes of each token class Number = digit + Keyword = if + else + Identifier = letter (letter + digit)* OpenPar = ( So, we write down a whole list of regular expressions, one for each syntactic category in language 6

Regular Expressions => Lexical Spec. (2) 2. Construct R, matching all lexemes for all token classes R = Keyword + Identifier + Number + = R 1 + R 2 + 7

What follows is the key to how we use the regular expression specification to perform lexical analysis 8

Regular Expressions => Lexical Spec. (3) 3. Let input be x 1 x n For 1 i n check whether the prefix x 1 x i L(R) 4. If success, then we know that x 1 x i L(R j ) for some j 5. Remove x 1 x i from input and go to (3) Continue removing pieces until we have tokenized the entire string 9

Ambiguities (1) There are ambiguities in the algorithm Some things are under specified (and these turn out to be interesting) How much input is used? What if x 1 x i L(R) and also x 1 x K L(R) (of course i k) Ex. We ve got == Rule: Pick longest possible string in L(R) The maximal munch Yes, this is really what this rule is called 10

Ambiguities (1) Do we see How or H-o-w? There are ambiguities in the algorithm Some things are under specified (and these turn out to be interesting) How much input is used? What if x 1 x i L(R) and also x 1 x K L(R) (of course i k) Ex. We ve got == Rule: Pick longest possible string in L(R) The maximal munch Reason is that this is the way humans read things So tools work this way as well (which usually does right 11thing)

Ambiguities (2) Which token is used? What if x 1 x i L(R j ) and also x 1 x i L(R k ) Ex. Recall our specifications for keywords and identifiers Keyword = if + else + Identifier = letter (letter + digit)* if satisfies both 12

Ambiguities (2) Which token is used? What if x 1 x i L(R j ) and also x 1 x i L(R k ) Note: in most languages, if it s a keyword, it s not an identifier But: changing RE for Identifier to explicitly exclude keywords is a real pain (current rule much more natural) 13

Ambiguities (2) Which token is used? What if x 1 x i L(R j ) and also x 1 x i L(R k ) Rule: use rule listed first (j if j < k) Treats if as a keyword, not an identifier Bottom line: in file defining our lexical specification, put Keywords before the Identifiers 14

Error Handling What if No rule matches a prefix of input? Note: This comes up quite a bit Problem: Can t just get stuck I.e., Important for compiler to do good error handling (can t simply crash) Need to provide feedback as to where the error is and what kind of error it is 15

Error Handling What if No rule matches a prefix of input? Note: This comes up quite a bit Solution: Don t let it ever happen that a string isn t in L(R)???!!! 16

Error Handling What if No rule matches a prefix of input? Note: This comes up quite a bit Solution: Don t let it ever happen that a string isn t in L(R) Write a rule matching all bad strings Create an Error token class Put it last (lowest priority) Putting it last also allows us to be a little bit sloppy can include strings in the RE that ARE valid Earlier rules will have caught these Action for this rule is to print an error string 17

Summary Regular expressions provide a concise notation for string patterns Use in lexical analysis requires small extensions To resolve ambiguities To handle errors Warning: When you actually go to write the specification for a lexor, the two rules for resolving ambiguity can lead to tricky situations you must think carefully about the ordering of the rules! You may not always be getting what you think you are! 18

Summary Regular expressions provide a concise notation for string patterns Use in lexical analysis requires small extensions To resolve ambiguities To handle errors Good algorithms known Require only single pass over the input Few operations per character (table lookup) These algorithms are the subject of the following slides 19

Finite Automata Regular expressions = specification Finite automata = implementation Closely related: they can specify exactly the same languages -- the regular languages 20

Finite Automata Regular expressions = specification Finite automata = implementation Closely related: they can specify exactly the same languages -- the regular languages We won t prove this, but we will use it 21

Finite Automata Regular expressions = specification Finite automata = implementation A finite automaton consists of An input alphabet Σ (characters the FA can read) A finite set of states S (thus finite automata) A start state n A set of accepting states F S A set of transitions state input state I.e., if it s in a given state, it can read some input and move to another specified state 22

Finite Automata Transition Is read s 1 a s 2 In state s 1 on input a go to state s 2 If end of input and in accepting state => accept That is, yes, this string was in the language of this machine Otherwise => reject 23

So Start in start state Repeat (until end of input string): Read one character of input string Move to appropriate state If after last character read you end up in accepting state, string is in language of the FA Else, string not in language of FA E.g., Terminates in state not in F or Machine gets stuck: finds itself in a state and there is no transition of that state on the input (note that it does not read out of ) 24

Alternative Notation: Finite Automata State Graphs A state The start state An accepting state A transition a 25

A Simple Example A finite automaton that accepts only 1 1 26

How The Machine Executes A finite automaton that accepts only 1 A 1 B StateState Input A 1 B 1 Input pointer always advances one spot. Never moves backwards 27

How The Machine Executes A finite automaton that accepts only 1 A 1 B StateState Input A 0 0 28

How The Machine Executes A finite automaton that accepts only 1 A 1 B StateState Input A 1 0 B 1 0 1 0 29

The Language of a Finite Automata Is the set consisting of accepted strings 30

Another Simple Example A finite automaton accepting any number of 1 s followed by a single 0 Alphabet: {0,1} 1 0 31

And Another Example Alphabet {0,1} What language does this recognize? 1 0 0 0 1 1 32

Epsilon Moves Another kind of transition: -moves A Machine can move from state A to state B without reading input B 33

Deterministic and Nondeterministic Automata Deterministic Finite Automata (DFA) One transition per input per state No -moves Nondeterministic Finite Automata (NFA) Can have multiple transitions for one input in a given state Can have -moves 34

Execution of Finite Automata A DFA can take only one path through the state graph Completely determined by input NFAs can choose Whether to make -moves Which of multiple transitions for a single input to take 35

Acceptance of NFAs An NFA can get into multiple states 1 0 0 Input: 0 1 0 0 Rule: NFA accepts if it can get to a final state 36

NFA vs. DFA (1) NFAs and DFAs recognize the same set of languages (regular languages) Think of NFAs as parallel processors DFAs are faster to execute There are no choices to consider 37

NFA vs. DFA (2) For a given language NFA can be simpler than DFA NFA 1 0 0 0 DFA 1 0 1 DFA can be exponentially larger than NFA (why would you expect that to be the case?) 1 0 0

Regular Expressions to Finite Automata High-level sketch NFA Regular expressions DFA Lexical Specification Table-driven Implementation of DFA (Lookup tables + code for traversing those tables) 39

Regular Expressions to NFA (1) For each kind of rexp, define an equivalent NFA Notation: NFA for rexp M M For For input a a 40

Regular Expressions to NFA: Atomic REs For each kind of rexp, define an equivalent NFA Notation: NFA for rexp M M For For input a a Notion here is that we will be building our overall machine up from smaller machines 41

Regular Expressions to NFA: Compound REs For AB A B For A + B B A 42

Regular Expressions to NFA: Compound REs For AB For A + B Note modification of final state A B And addition of transition B A Note addition of new start and final states 43

Regular Expressions to NFA: Compound REs For AB A B For A + B B A Remember: With NFA if ANY choice works, string is accepted 44

Regular Expressions to NFA (3) For A* A 45

Regular Expressions to NFA (3) For A* Iteration of A A The Abdullah edge 46

Let s remember one very important thing in all of this: it is done this way so that the process can be automated! You might see easier DFAs or NFAs, but a computer needs to have an algorithm to create these, which is why we go through this process. 47

Example of RegExp -> NFA conversion Consider the regular expression (1+0)*1 The NFA is A B C 1 E 0 D F G H I 1 J 48

-closure An idea that helps with transition from NFA to DFA -closure of state B is all states that can be reached from B via epsilon moves -closure(b) = {B,C,D}; e-closure(g) = {A,B,C,D,G,H,I} A B C 1 E 0 D F G H I 1 J 49

NFA to DFA. Remark An NFA may be in many states at any time How many different states? If there are N states, the NFA must be in some subset of those N states How many subsets are there? 2 N - 1 = finitely many 50

Some helpful notation For a character a in the input language Σ, and X a set of states in the NFA, define a(x) to be a subset of the set of all states in the NFA defined as follows: a(x) = {Y for some X in X, there is a transition from X to Y on input a} I.e., If we are in some state in set X, and a is the next input, then a(x) is the set of states to which we could transition. 51

NFA to DFA: The Trick Simulate the NFA Each state of DFA is a non-empty subset of states of the NFA Though not all subsets of states of the NFA will have links to/from them So a large number of possible states (but finite!) 52

NFA to DFA: The Trick Start state the set of NFA states reachable through -moves from NFA start state -clos(start state) Think about why this makes sense: which sets of states might the NFA be in at the beginning of execution? 53

NFA to DFA: The Trick Add a transition S a S to DFA iff S is the set of NFA states reachable from any state in S after seeing the input a, considering - moves as well -clos( a(s) ) 54

NFA to DFA: The Trick Final states the set of DFA states X such that X contains a state in F (the set of final states of the NFA) Think about why this makes sense: if a subset of NFA states contains a state that was in F (for the NFA), and execution of the DFA ends up in that state, then there is a path through the NFA which ends in a final state. 55

Is What We Get a DFA? A finite set of states A start state A set of final states A transition function with only one move per input No -moves 56

Let s do this! We won t follow exact steps: too many possible states to list them all (and most not involved in final DFA) Start with start state, then work out which additional states required A B C D 1 0 E F 1 G H I J 57

NFA -> DFA Example A B ABCDHI 0 1 C D 1 0 E F FGHIABCD 0 1 EJGHIABCD 1 G H I J 0 1 58

The Final Step: Implementation of DFA First, note that our original diagram was somewhat misleading. Some systems go straight from NFA to implementation More on this in a bit 59

Implementation A DFA can be implemented by a 2D table T One dimension is states Other dimension is input symbol For every transition S i a S k define T[i,a] = k DFA execution If in state S i and input a, read T[i,a] = k and skip to state S k Very efficient 60

Table Implementation of a DFA S 0 1 T 0 1 U 0 1 input symbol state 0 1 S T U T T U U T U 61

The code 62

The code Note how compact and efficient this is 63

Table Implementation of a DFA S 0 1 T 0 1 U 0 1 S T U 0 1 T U Much more compact Duplicate rows common 64 in DFAs for LA

Compaction of Table Turns out that this technique can make tables much more compact Moreover, duplicate rows show up quite a bit in the FAs that arise in lexical analysis Considering large number of potential states, this can lead to considerable compaction of the table While blowup in states from NFA to DFA is often not the worst case, it can be substantial 2D table can be quite large Disadvantage: pointer dereferences 65

What if we don t want a DFA? Why? Table that results might be huge, so may want to use NFA directly 66

Directly from NFA to Table A B C D 1 0 E F 1 G H I J A B C D 0 1 {B,H} {C,D} E F 67

Good and Bad of Going Directly from NFA to Table Good: Table guaranteed to be relatively small limited by size of NFA and size of input alphabet Bad: Even with compression tricks, inner loop runs much more slowly because dealing with sets of states, rather than states themselves So on each move, we need to keep track of all possible states to which we could go (and associated -moves) Bottom line: Can save a lot of space, but cost a lot in time 68

Implementation (Cont.) NFA -> DFA conversion is at the heart of tools such as flex But, DFAs can be huge In practice, flex-like tools trade off speed for space in the choice of NFA and DFA representations User chooses via configuration, whether they want to be closer to full DFA or full NFA 69