Formal Languages and Compilers Lecture VI: Lexical Analysis

Similar documents
Formal Languages and Compilers Lecture IV: Regular Languages and Finite. Finite Automata

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

Lexical Analysis. Chapter 2

Zhizheng Zhang. Southeast University

Figure 2.1: Role of Lexical Analyzer

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

Implementation of Lexical Analysis

Chapter 3 Lexical Analysis

UNIT -2 LEXICAL ANALYSIS

Lexical Analysis. Lecture 2-4

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

Compiler course. Chapter 3 Lexical Analysis

Introduction to Lexical Analysis

Implementation of Lexical Analysis

Formal Languages and Compilers Lecture VII Part 3: Syntactic A

Lexical Analysis. Lecture 3-4

Lexical Analyzer Scanner

CS 314 Principles of Programming Languages

Lexical Analyzer Scanner

Front End: Lexical Analysis. The Structure of a Compiler

The Language for Specifying Lexical Analyzer

2. λ is a regular expression and denotes the set {λ} 4. If r and s are regular expressions denoting the languages R and S, respectively

CSc 453 Lexical Analysis (Scanning)

Lexical Analysis. Implementation: Finite Automata

Administrivia. Lexical Analysis. Lecture 2-4. Outline. The Structure of a Compiler. Informal sketch of lexical analysis. Issues in lexical analysis

Lexical Analysis. Lecture 3. January 10, 2018

CS 314 Principles of Programming Languages. Lecture 3

Chapter 3: Lexical Analysis

Syntactic Analysis. CS345H: Programming Languages. Lecture 3: Lexical Analysis. Outline. Lexical Analysis. What is a Token? Tokens

COMPILER DESIGN UNIT I LEXICAL ANALYSIS. Translator: It is a program that translates one language to another Language.

Lexical Analysis (ASU Ch 3, Fig 3.1)

Implementation of Lexical Analysis

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

Lexical Analysis. Introduction

Compiler Construction LECTURE # 3

CS 403 Compiler Construction Lecture 3 Lexical Analysis [Based on Chapter 1, 2, 3 of Aho2]

Compiler Construction

UNIT II LEXICAL ANALYSIS

Introduction to Lexical Analysis

Implementation of Lexical Analysis

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

A simple syntax-directed

Monday, August 26, 13. Scanners

Wednesday, September 3, 14. Scanners

2010: Compilers REVIEW: REGULAR EXPRESSIONS HOW TO USE REGULAR EXPRESSIONS

G Compiler Construction Lecture 4: Lexical Analysis. Mohamed Zahran (aka Z)

CSCI-GA Compiler Construction Lecture 4: Lexical Analysis I. Hubertus Franke

Compiler Construction

CS308 Compiler Principles Lexical Analyzer Li Jiang

Outline. 1 Scanning Tokens. 2 Regular Expresssions. 3 Finite State Automata

Lexical Analysis. Sukree Sinthupinyo July Chulalongkorn University

Lexical Analysis 1 / 52

Lexical Analysis. Finite Automata

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Week 2: Syntax Specification, Grammars

Last lecture CMSC330. This lecture. Finite Automata: States. Finite Automata. Implementing Regular Expressions. Languages. Regular expressions

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

Part 5 Program Analysis Principles and Techniques

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 2: Lexical Analysis 23 Jan 08

Lexical Analysis. Finite Automata

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

Dr. D.M. Akbar Hussain

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

Computer Science Department Carlos III University of Madrid Leganés (Spain) David Griol Barres

Cunning Plan. Informal Sketch of Lexical Analysis. Issues in Lexical Analysis. Specifying Lexers

1. INTRODUCTION TO LANGUAGE PROCESSING The Language Processing System can be represented as shown figure below.

CSE450. Translation of Programming Languages. Lecture 20: Automata and Regular Expressions

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

Implementation of Lexical Analysis. Lecture 4

Structure of Programming Languages Lecture 3

PRINCIPLES OF COMPILER DESIGN UNIT II LEXICAL ANALYSIS 2.1 Lexical Analysis - The Role of the Lexical Analyzer

CS415 Compilers. Lexical Analysis

David Griol Barres Computer Science Department Carlos III University of Madrid Leganés (Spain)

[Lexical Analysis] Bikash Balami

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

CS 403: Scanning and Parsing

Lexical Analysis - 1. A. Overview A.a) Role of Lexical Analyzer

UNIT I- LEXICAL ANALYSIS. 1.Interpreter: It is one of the translators that translate high level language to low level language.

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

ECS 120 Lesson 7 Regular Expressions, Pt. 1

THE COMPILATION PROCESS EXAMPLE OF TOKENS AND ATTRIBUTES

R10 SET a) Construct a DFA that accepts an identifier of a C programming language. b) Differentiate between NFA and DFA?

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

Automating Construction of Lexers

2. Lexical Analysis! Prof. O. Nierstrasz!

COMPILER DESIGN LECTURE NOTES

CSE302: Compiler Design

Outline CS4120/4121. Compilation in a Nutshell 1. Administration. Introduction to Compilers Andrew Myers. HW1 out later today due next Monday.

CD Assignment I. 1. Explain the various phases of the compiler with a simple example.

Regular Languages. MACM 300 Formal Languages and Automata. Formal Languages: Recap. Regular Languages

CSC 467 Lecture 3: Regular Expressions

DVA337 HT17 - LECTURE 4. Languages and regular expressions

1. Lexical Analysis Phase

CT32 COMPUTER NETWORKS DEC 2015

Examples of attributes: values of evaluated subtrees, type information, source file coordinates,

2068 (I) Attempt all questions.

Introduction to Parsing. Lecture 8

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design

Prof. Mohamed Hamada Software Engineering Lab. The University of Aizu Japan

Lexical Analysis. COMP 524, Spring 2014 Bryan Ward

Transcription:

Formal Languages and Compilers Lecture VI: Lexical Analysis Free University of Bozen-Bolzano Faculty of Computer Science POS Building, Room: 2.03 artale@inf.unibz.it http://www.inf.unibz.it/ artale/ Formal Languages and Compilers BSc course 2017/18 First Semester

Summary of Lecture VI Lexemes and Tokens. Lexer Representation: Regular Expressions (RE). Implementing a Recognizer of RE s: Automata. Lexical Analyzer Implementation. Breaking the Input in Substrings. Dealing with conflicts. Lexical Analyzer Architecture.

Lexical Analyzer Lexical Analysis is the first phase of a compiler. Main Task: Read the input characters and produce a sequence of Tokens that will be processed by the Parser. Upon receiving a get-next-token command from the Parser, the input is read to identify the next token. While looking for next token it eliminates comments and white-spaces. Tokens are treated as Terminal Symbols in the Grammar for the source program. token Source Program Lexical Analyzer Parser get-next-token Symbol Table

Token Vs. Lexeme In general, a set of input strings (Lexemes) give rise to the same Token. For example: Token Lexeme Description var var language keyword if if language keyword relation <, <=, =, <>, >, >= comparison operators id position, A1, x sequence of letters and digits num 3.14, 14, 6.02E23 numeric constant

Attributes for Tokens When a Token can be generated by different Lexemes the Lexical Analyzer must transmit the Lexeme to the subsequent phases of the compiler. Such information is specified as an Attribute associated to the Token. Usually, the attribute of a Token is a pointer to the symbol table entry that keeps information about the Token. Important! The Token influences parsing decisions: Parser relies on the token distinctions, e.g., identifiers are treated differently than keywords; The Attribute influences the semantic and code generation phase.

Attributes for Tokens: An Example Example. Let us consider the following assignment statement: E := M C 2 then the following pairs token,attribute are passed to the Parser: id, pointer to symbol-table entry for E assign-op, id, pointer to symbol-table entry for M mult-op, id, pointer to symbol-table entry for C exp-op, num, integer value 2. Some Tokens have a null attribute: the Token is sufficient to identify the Lexeme. From an implementation point of view, each token is encoded as an integer number.

Identifiers Vs. Keywords Programming languages use fixed strings to identify particular Keywords e.g., if, then, else, etc. Since keywords are just identifiers the Lexical Analyzer must distinguish between these two possibilities. If keywords are reserved not used as identifiers we can initialize the symbol-table with all the keywords and mark them as such. Then, a string is recognized as an identifier only if it is not already in the symbol-table as a keyword.

Summary Lexemes and Tokens. Lexer Representation: Regular Expressions (RE). Implementing a Recognizer of RE s: Automata. Lexical Analyzer Implementation. Breaking the Input in Substrings. Dealing with conflicts. Lexical Analyzer Architecture.

Main Objective What we want to accomplish: Given a way to describe Lexemes of the input language, automatically generate the Lexical Analyzer. This objective can be split in the following sub-problems: 1 Lexer specification language: How to represent Lexemes of the input language. 2 The lexical analyzer mechanism: How to generate Tokens starting from Lexeme representations. 3 Lexical analyzer implementation: Coding (1) + (2) + breaking inputs into substrings corresponding to the pair Lexeme/Token.

Problem 1. Lexer Specification Language: Regular Expressions Regular Expressions are the most popular specification formalisms to describe Lexemes and map them to Tokens. Example. An identifier is made by a letter followed by zero or more letters or digits: letter (letter digit) The vertical bar means or ; Parentheses are used to group sub-expressions; The means zero or more occurrences.

Regular Expressions Each Regular Expression, say R, denotes a Language, L(R). The following are the rules to build them over an alphabet V: 1 If a V {ϵ} then a is a Regular Expression denoting the language {a}; 2 If R, S are Regular Expressions denoting the Languages L(R) and L(S) then: 1 R S is a Regular Expression denoting L(R) L(S); 2 RS is a Regular Expression denoting the concatenation L(R)L(S), i.e. L(R)L(S) = {rs r L(R) and s L(S)}; 3 R is a Regular Expression denoting L(R), zero or more concatenations of L(R), i.e. L(R) = i=0 L(R)i ; 4 (R) is a Regular Expression denoting L(R).

Regular Expressions: Examples Example. Let V = {a, b}. 1 The Regular Expression a b denotes the Language {a, b}. 2 The Regular Expression (a b)(a b) denotes the Language {aa, ab, ba, bb}. 3 The Regular Expression a denotes the Language of all strings of zero or more a s, {ϵ, a, aa, aaa,...}. 4 The Regular Expression (a b) denotes the Language of all strings of a s and b s.

Regular Expressions: Shorthands Notational shorthands are introduced for frequently used constructors. 1 +, One or more instances. If R is a Regular Expression then R + RR. 2?, Zero or one instance. If R is a Regular Expression then R? ϵ R. 3 Character Classes. If a, b,..., z V then [a, b, c] a b c, and [a z] a b... z.

Regular Definitions Regular Definitions are used to give names to regular Expressions and then to re-use these names to build new Regular Expressions. A Regular Definition is a sequence of definitions of the form: D 1 R 1 D 2 R 2... D n R n Where each D i is a distinct name and each R i is a Regular Expression over the extended alphabet V {D 1, D 2,..., D i 1 }. Note: Such names for Regular Expression will be often the Tokens returned by the Lexical Analyzer. As a convention, names are printed in boldface.

Regular Definitions: Examples Example 1. Identifiers are usually strings of letters and digits beginning with a letter: letter A B... Z a b... z digit 0 1 9 id letter(letter digit) Using Character Classes we can define identifiers as: id [A Za z][a Za z0 9]

Regular Definitions: Examples (Cont.) Example 2. Numbers are usually strings such as 5230, 3.14, 6.45E4, 1.84E-4. digit 0 1 9 digits digit + optional-fraction (.digits)? optional-exponent (E(+ )?digits)? num digits optional-fraction optional-exponent

Summary Lexemes and Tokens. Lexer Representation: Regular Expressions (RE). Implementing a Recognizer of RE s: Automata. Lexical Analyzer Implementation. Breaking the Input in Substrings. Dealing with conflicts. Lexical Analyzer Architecture.

Problem 2. The Lexical Analyzer Mechanism. Finite Automata We need a mechanism to recognize Regular Expressions and so associating Tokens to Lexemes. While Regular Expressions are a specification language, Finite Automata are their implementation. Given an input string, w, and a Regular Language, L, they answer yes if w L and no otherwise.

Deterministic Finite Automata A Deterministic Finite Automata, DFA for short, is a tuple: A = (S, V, δ, s 0, F ): S is a finite non empty set of states; V is the input symbol alphabet; δ : S V S is a total function called the Transition Function; s 0 S is the initial state; F S is the set of final states.

Deterministic Finite Automata (Cont.) To define when an Automaton accepts a string we extend the transition function, δ, to a multiple transition function ˆδ : S V S: ˆδ(s, ϵ) = s ˆδ(s, xa) = δ(ˆδ(s, x), a); x V, a V A DFA accepts an input string, w, if starting from the initial state with w as input the Automaton stops in a final state: ˆδ(s 0, w) = f, and f F. Language accepted by a DFA, A = (S, V, δ, s 0, F ): L(A) = {w V ˆδ(s 0, w) F }

Transition Graphs A DFA can be represented by Transition Graphs where the nodes are the states and each labeled edge represents the transition function. The initial state has an input arch marked start. Final states are indicated by double circles. Example. DFA that accepts strings in the Language L((a b) abb). b b start 0 a 1 b 2 b 3 a a a

Transition Table Transition Tables implement transition graphs, and thus Automata. A Transition Table has a row for each state and a column for each input symbol. The value of the cell (s i, a j ) is the state that can be reached from state s i with input a j. Example. The table implementing the previous transition graph will have 4 rows and 2 columns, let us call the table δ, then: δ(0, a) = 1 δ(0, b) = 0 δ(1, a) = 1 δ(1, b) = 2 δ(2, a) = 1 δ(2, b) = 3 δ(3, a) = 1 δ(3, b) = 0

Summary Lexemes and Tokens. Lexer Representation: Regular Expressions. Implementing a Recognizer of RE s: Automata. Lexical Analyzer Implementation. Breaking the Input in Substrings. Dealing with conflicts. Lexical Analyzer Architecture.

Main Objective What we want to accomplish: Given a way to describe Lexemes of the input language, automatically generate the Lexical Analyzer. This objective can be split in the following sub-problems: 1 Lexer specification language: How to represent Lexemes of the input language (Regular Expressions). 2 The lexical analyzer mechanism: How to generate Tokens starting from Lexeme representations (Side-Effect of Automata recognition process). 3 Lexical analyzer implementation: Coding (1) + (2) + breaking inputs into substrings corresponding to the pair Lexeme/Token.

Lexical Analyzer Implementation Given that the source program is just a single string, the Lexical Analyzer must do two things: 1 Breaking the input string by recognizing substrings corresponding to Lexemes; 2 Return the pair Token, Attribute for each Lexeme. Compare this procedure to Automata: 1 Automata accept or reject a string, they do not partition it. 2 We need to do more than just implement an Automaton.

Lexical Analyzer Implementation (Cont.) Two important issues: 1 The goal is to partition the source program into Lexemes: This is implemented by reading left-to-right, recognizing one lexeme at a time. 2 Dealing with conflicts. E.g., pos Vs. position as identifiers.

Summary Lexemes and Tokens. Lexer Representation: Regular Expressions. Implementing a Recognizer of RE s: Automata. Lexical Analyzer Implementation. Breaking the Input in Substrings. Dealing with conflicts. Lexical Analyzer Architecture.

Dealing with Conflicts There are two possible conflicts: Case 1. The same Lexeme is recognized by two different RE s. Case 2. The same RE can recognize portion of a Lexeme.

Dealing with Conflicts (Case 1) Let s assume we have the following RE s: R 1 abb id letter(letter digit) The Lexeme abb matches both R 1 and id. Solution: Ordering between RE s. If a 1... a n L(R j ) and a 1... a n L(R k ) we use the RE listed first (j, if j < k). Remark. To distinguish between keywords and identifiers insert the regular expression for keywords before the one for identifiers.

Dealing with Conflicts (Case 2) Let s consider the RE for Identifiers and the string position : id letter(letter digit) p matches id; po matches id;...; position matches id. Solution: Maximal Lexemes. The lexeme for a given token must be maximal.

Maximal Lexemes Strategies (1) Solution 1. Use Automata with Lookahead. Example: Automaton for id with lookahead. letter start letter other 0 1 2 digit The label other refers to any character not indicated by any other edges living the node; The indicates that we read an extra character and we must retract the forward input pointer by one character.

Maximal Lexemes Strategies (2) Solution 2. Change the response to non-acceptance. Don t stop when reaching an accepting state; When failure occurs revert back to the last accepting state. This technique is the preferred one used also by Lex. Example: Automaton for numbers: digit digit digit start digit. digit e + digit 0 1 2 3 4 5 6 e digit Try with the following input: 61.23Express. Exercise. Modify the above Automaton to comply with the Lookahead technique. -

Conflict Resolution in Lex When several possible lexemes of the input match one or more RE s: 1 Always prefer a longer lexeme. 2 If the longest lexeme is matched by two or more RE s prefer the RE listed first.

Summary Lexemes and Tokens. Lexer Representation: Regular Expressions. Implementing a Recognizer of RE s: Automata. Lexical Analyzer Implementation. Breaking the Input in Substrings. Dealing with conflicts. Lexical Analyzer Architecture.

From Regular Expressions to Lexical Analyzers 1 For each Token write the associated Regular Expression; 2 Build the Automaton for each Regular Expression; 3 Combine all the Automata (A i ) into a single big Automaton adding a new start state with ϵ-transitions (ϵ-nfa) to each of the start states of the A i Automata. 4 Read the input to map Lexemes into Tokens till we successfully read the whole input or the Automaton fails without recognizing the input left (case of Lexical error).

Lexical Analyzer Example Consider the following Regular Expressions: R 1 a R 2 abb R 3 a b + Construct the Automaton and consider as input abb.

Lexical Analyzer Example (Cont.) Single NFA Automaton. a 5 6 ϵ start ϵ a b b 0 1 2 3 4 ϵ b 7 8 a b

Reading the Input The lexical analysis is the only phase that reads the input. Such reading can be time consuming: It is necessary to adopt efficient buffering techniques (see the book, Chapter 3.2, for more details). Two pointers, begin and forward, to the input buffer are maintained. The begin pointers always points at the beginning of the lexeme to be recognized.

Lexical Analyzer Architecture

Breaking the Input into Lexemes Two pointers, begin and forward, to the input buffer are maintained: 1 Initially, both pointers point to the first character of the Lexeme to be found; 2 The forward pointer scans ahead the input until there are no more next states in the Automaton we are sure that the longest lexeme has been found. 3 We go back in the sequence of set of states (assuming the Automaton is NFA) till we find a set that contains one or more accepting states, otherwise fail. 4 If there are more accepting states prefer the state associated with the RE listed first. 5 When the Lexeme is successfully processed transmit the token and its attribute to the parser, and set both pointers to the next character immediately past the Lexeme. 6 If there are no more input characters then succeed else go to point 1.

Summary of Lecture VI Lexemes and Tokens. Lexer Representation: Regular Expressions. Implementing a Recognizer of RE s: Automata. Lexical Analyzer Implementation. Breaking the Input in Substrings. Dealing with conflicts. Lexical Analyzer Architecture.