David Griol Barres Computer Science Department Carlos III University of Madrid Leganés (Spain)

Similar documents
Computer Science Department Carlos III University of Madrid Leganés (Spain) David Griol Barres

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

CSc 453 Lexical Analysis (Scanning)

Lexical Analysis. Chapter 2

Lexical Analysis. Implementation: Finite Automata

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

2010: Compilers REVIEW: REGULAR EXPRESSIONS HOW TO USE REGULAR EXPRESSIONS

Introduction to Lexical Analysis

Lexical Analysis. Lecture 2-4

Figure 2.1: Role of Lexical Analyzer

CSE302: Compiler Design

Implementation of Lexical Analysis

UNIT -2 LEXICAL ANALYSIS

Outline. 1 Scanning Tokens. 2 Regular Expresssions. 3 Finite State Automata

Zhizheng Zhang. Southeast University

Lexical Analysis. Lecture 3-4

Lexical Analysis 1 / 52

Implementation of Lexical Analysis

UNIT II LEXICAL ANALYSIS

Lexical Analysis. Introduction

Formal Languages and Compilers Lecture VI: Lexical Analysis

Front End: Lexical Analysis. The Structure of a Compiler

Implementation of Lexical Analysis

Lexical Analysis. Prof. James L. Frankel Harvard University

Lexical Analysis. Lecture 3. January 10, 2018

Finite automata. We have looked at using Lex to build a scanner on the basis of regular expressions.

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

Compiler Construction

Compiler course. Chapter 3 Lexical Analysis

COMPILER DESIGN UNIT I LEXICAL ANALYSIS. Translator: It is a program that translates one language to another Language.

CSE302: Compiler Design

The Front End. The purpose of the front end is to deal with the input language. Perform a membership test: code source language?

CS308 Compiler Principles Lexical Analyzer Li Jiang

Introduction to Lexical Analysis

Lexical Analysis. Finite Automata

Compiler Construction LECTURE # 3

CS415 Compilers. Lexical Analysis

Chapter 3 Lexical Analysis

Announcements! P1 part 1 due next Tuesday P1 part 2 due next Friday

Dixita Kagathara Page 1

Prof. Mohamed Hamada Software Engineering Lab. The University of Aizu Japan

Lexical Analysis (ASU Ch 3, Fig 3.1)

2. Lexical Analysis! Prof. O. Nierstrasz!

[Lexical Analysis] Bikash Balami

Lexical Analysis. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

Compiler Construction

PRINCIPLES OF COMPILER DESIGN UNIT II LEXICAL ANALYSIS 2.1 Lexical Analysis - The Role of the Lexical Analyzer

Administrivia. Lexical Analysis. Lecture 2-4. Outline. The Structure of a Compiler. Informal sketch of lexical analysis. Issues in lexical analysis

Formal Languages and Compilers Lecture IV: Regular Languages and Finite. Finite Automata

Syntactic Analysis. CS345H: Programming Languages. Lecture 3: Lexical Analysis. Outline. Lexical Analysis. What is a Token? Tokens

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

Lexical Analysis. COMP 524, Spring 2014 Bryan Ward

Lexical Analysis. Finite Automata

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

CS321 Languages and Compiler Design I. Winter 2012 Lecture 4

Implementation of Lexical Analysis. Lecture 4

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 2: Lexical Analysis 23 Jan 08

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

CSE 105 THEORY OF COMPUTATION

CS 314 Principles of Programming Languages

for (i=1; i<=100000; i++) { x = sqrt (y); // square root function cout << x+i << endl; }

Cunning Plan. Informal Sketch of Lexical Analysis. Issues in Lexical Analysis. Specifying Lexers

CS 432 Fall Mike Lam, Professor. Finite Automata Conversions and Lexing

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

Lexical Error Recovery

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console


Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Regular Languages. MACM 300 Formal Languages and Automata. Formal Languages: Recap. Regular Languages

Structure of Programming Languages Lecture 3

1. INTRODUCTION TO LANGUAGE PROCESSING The Language Processing System can be represented as shown figure below.

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

CSE450. Translation of Programming Languages. Lecture 20: Automata and Regular Expressions

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

Last lecture CMSC330. This lecture. Finite Automata: States. Finite Automata. Implementing Regular Expressions. Languages. Regular expressions

Implementation of Lexical Analysis

CS 403 Compiler Construction Lecture 3 Lexical Analysis [Based on Chapter 1, 2, 3 of Aho2]

Formal Languages and Grammars. Chapter 2: Sections 2.1 and 2.2

Chapter 3: Lexical Analysis

Lexical Analysis - 2

Writing a Lexical Analyzer in Haskell (part II)

for (i=1; i<=100000; i++) { x = sqrt (y); // square root function cout << x+i << endl; }

Group A Assignment 3(2)

Part 5 Program Analysis Principles and Techniques

Dr. D.M. Akbar Hussain

G52LAC Languages and Computation Lecture 6

Lexical Analyzer Scanner

The Language for Specifying Lexical Analyzer

Compiler phases. Non-tokens

DFA: Automata where the next state is uniquely given by the current state and the current input character.

Lexical Analyzer Scanner

2. λ is a regular expression and denotes the set {λ} 4. If r and s are regular expressions denoting the languages R and S, respectively

Week 2: Syntax Specification, Grammars

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design

LANGUAGE TRANSLATORS

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

Outline CS4120/4121. Compilation in a Nutshell 1. Administration. Introduction to Compilers Andrew Myers. HW1 out later today due next Monday.

Lecture 3: Lexical Analysis

Transcription:

David Griol Barres dgriol@inf.uc3m.es Computer Science Department Carlos III University of Madrid Leganés (Spain)

OUTLINE Introduction: Definitions The role of the Lexical Analyzer Scanner Implementation Regular Expressions review Regular Expressions for tokens Finite Automata review Implementing the scanner From regular expressions to NFA From NFA to DFA From DFA to program 2

Introduction: Definitions Lexical analysis or scanning: To read from left-to-right a source program and divide it into a set of tokens (first phase of a compiler). TOKEN: Sequence of characters with a collective syntactic meaning Objectives: To simplify the syntax analyzer. To facilitate the portability of the compiler. 3

Introduction: Definitions Objectives: It may identify errors in the source program. It may strip out from the source program comments and white space characters (tab, newline, space). It may also associate a line number from the source program with a given error message. 4

The role of the Lexical Analyzer Error messages Source Program Lexical Analyzer Token stream Syntax Analyzer Control (Next) Symbol Table 5

Introduction: Definitions Tokens: reserved words (if, while), identifiers (a23, var53d), special symbols (+, *, >=) Lexemes: Particular instances of tokens. Patterns: Rules that describe the lexemes of a token. Tokens: subject, verb, predicate Lexemes: verb (go, be, belong, arrive ) Pattern: go be belong arrive 6

The role of the Lexical Analyzer Errors than can be detected: The scanner has no information about context It can detect: illegal characters, unterminated comments Can eliminate comments, white spaces, etc. Correlates error messages from the compiler with the source program. 7

The role of the Lexical Analyzer It does not look: garbled sequences, tokens out of place, undeclared identifiers, misspelled keywords, mismatched types. 8

Scanner Implementation There are basically two methods for implementing a scanner: 1. A program that is hard-coded to perform the scanner analysis (Loop and Switch). 2. Using methods to define and recognize patterns in sequences of characters: regular expressions. finite automata theory. 9

Scanner Implementation There are basically two methods for implementing a scanner: 1. A program that is hard-coded to perform the scanner analysis (Loop and Switch): Write the lexical analyzer in a conventional programming/scripting language, using the I/O facilities of that language to read the input. A good candidate is PERL with the rich pattern matching capabilities it offers. Write the lexical analyzer in assembly language and explicitly manage the reading of input. 10

Scanner Implementation Loop and Switch Main Loop: Reads characters one by one from the input file. Uses a switch statement to process the character(s) just read. Output: A list of tokens and lexemes from the source program. Ad hoc scanners (specific problems). Gcc: C lexer is over 2,500 lines of code; 11

Scanner Implementation There are basically two methods for implementing a scanner: 1. Using methods to define and recognize patterns in sequences of characters: regular expressions. finite automata theory. 12

Regular Expressions review Given an alphabet Σ, the rules that define regular expressions of Σ are: a Σ is a regular expression. ε is a regular expression. If r and s are regular expressions, then (r) rs r s r* are regular expressions. Nothing else is a regular expression. 13

Regular Expressions review Axioms: r s = s r r (s t) = (r s) t (rs)t = r(st) r(s t)=rs rt λr = r rλ = r r * = (r λ) * r ** = r * 14

Regular Expressions review Notation: One or more: + R* = r * λ Cero or one:? Cero or more: * Any character:. Any other character: ~ Classes: a b c z = [a-z] 15

Regular Expressions for tokens Numbers: nat = [0 1 2 3 4 5 6 7 8 9]+ natwithsign = (+ -)? nat number = natwithsign (. nat)? (E natwithsign)? 16

Regular Expressions for tokens Identifiers and reserved words: reserved = if while do letter = [a-za-z] digit= [0-9] identifier = letter(letter digit)* 17

Regular Expressions for tokens Comments: {this is a comment in Pascal} comment= {(~})*} 18

Finite Automata review Once all the tokens are defined using regular expressions, a finite automaton can be created for recognizing them. A finite automata consists of: A finite set of states, including a start state and some final states. An alphabet Σ of possible input symbols. A finite set of transitions. 19

Finite Automata review (II) 20

Finite Automata review (IV) digit (0 1 2 3 4 5 6 7 8 9)+ digit digit f not digit not digit 21

Finite Automata review (VI) Deterministic finite automata (DFA): AFD=(Σ, Q, f, q0, F) Σ is the alphabet of possible input symbols. Q is the set of states q0 Q is the start state F Q is the set of final states f is the transition fuction f : Q Σ Q 22

Finite Automata review (VII) Nondeterministic finite automata: NFA=(Σ, Q, f, q0, F) Σ is the alphabet of possible input symbols. Q is the set of states q0 Q is the start state F Q is the set of final states f is the transition fuction f : Q (Σ {λ}) P(Q) 23

Finite Automata review (VIII) Deterministic finite automata (DFA): 1. There are not moves on input ε. 2. For each state s and input symbol a, there is exactly one edge out of s labeled as a. Nondeterministic finite automata (NFA): 1. More than one edge with the same label from any state is allowed. 2. Some states for which certain input symbols have no edge are allowed. 3. ε-nfa: ε transitions allowed. 24

Implementing the scanner Regular expression NFA DFA Program From regular expressions to NFA: Thompson s construction From NFA to DFA: Subsets construction From DFA to program: Specific purpose programs Transition tables 25

Regular expression NFA DFA Program Thompson s construction Input. A regular expression r over an alphabet T. Output. An NDFA N accepting the language L(r ). Method. First we parse r and fragment it into sub-expressions. Then we create NDFAs for the basic symbols appearing in the regular expression. Finally, we integrate the basic fragments into an NDFA that represents the entire expression. 26

Regular expression NFA DFA Program Thompson s construction Basic Regular expressions (ε, a): i ε f i a f 27

Regular expression NFA DFA Program Thompson s construction Concatenation rs: r ε s 28

Regular expression NFA DFA Program Thompson s construction Selection r s: ε r ε ε s ε 29

Regular expression NFA DFA Program Thompson s construction Repetition r * : ε ε r ε ε 30

Regular expression NFA DFA Program Example 1: ab a a b ab a ε b ab a ε a ε b ε ε a ε 31

Regular expression NFA DFA Program Conversion of and ε-nfa into a DFA Subset construction 32

Regular expression NFA DFA Program Conversion of and ε-nfa into a DFA For s N, closure(s) ={t N, there are a ε- transitions from s to t} For T in N, closure(t)=u si T closure(s i ) For T in N, move(t,a)=u si T {states in N to which there is an a-transition from s i in T} Algorithm: construction of states D E and the D T table 1. Initially, D E contains the closure(s 0 ) 2. while there is an unmarked state T in D E 1. Mark T 2. for each input symbol a Σ : 1. U=closure(move(T,a)) 2. if U is not in D E then 1. add U to D E 2. D T (T,a)=U 3. End 3. End Subset construction 33

Regular expression NFA DFA Program Minimizing the number of states of a DFA Construction of a DFA M accepting the same language as M and having as few states as possible 1. Construct an initial partition Π with two groups : F (acp), S (no) 2. Construct Π n : 1. For each group G of Π, partition G into subgroups for until any pair of states s and t in the same subgroup there is a transtition on an input a to states in the same group Π. 3. If Π n =Π, go to the next step. Otherwise repeat previous step with Π Π n 4. The groups in Π are the states of M 1. Construct transition table 2. Eliminate unreachable states 34

Regular expression NFA DFA Program Specific purpose programs (I) {start: state 1} if nextchar is a letter then read newchar; {now in state 2} while nextchar is a letter or a digit do read newchar; {stay at state2} end while; {goto to state 3 without reading newchar} accept; else {error or other cases} end if; digit, letter 1 letter [other] 2 3 Only for a small number of states. Each DFA has its specific implementation. 35

Regular expression NFA DFA Program Specific purpose programs (II) state:=1 {initial state} while state = 1 or 2 do case state of: 1: case inputchar of letter: read newchar; state:=2; else state:= {error or another}; end case; 2: case inputchar of letter, digit: read newchar; state:=2; else state:=3; end case: end case; end while; if state := 3 then accept else error; digit, letter 1 letter [other] 2 3 Introduces a variable that denotes the state. Case selections to represent the transitions. 36

Regular expression NFA DFA Program Transition tables digit, letter 1 letter [other] 2 3 Input character state Letter digit another Accept? 1 2 no 2 2 2 [3] no 3 yes 37

Regular expression NFA DFA Program Transition tables state := 1 ch := next input character; while not Accept[state] and not error(state) do newstate := T[state,ch]; if Advance[state,ch] then ch := next input char; state := newstate; end while; if Accept[state] then accept; digit, letter 1 letter [other] 2 3 The code is reduced. It can be used for many different problems. It is easy to modify. 38