Compiler Design. Lexical Analysis

Similar documents
CSc 453 Lexical Analysis (Scanning)

UNIT II LEXICAL ANALYSIS

Figure 2.1: Role of Lexical Analyzer

Chapter 3: Lexical Analysis

Formal Languages and Compilers Lecture VI: Lexical Analysis

G Compiler Construction Lecture 4: Lexical Analysis. Mohamed Zahran (aka Z)

Lexical Analyzer Scanner

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

CD Assignment I. 1. Explain the various phases of the compiler with a simple example.

Lexical Analyzer Scanner

Lexical Analysis - 1. A. Overview A.a) Role of Lexical Analyzer

Zhizheng Zhang. Southeast University

CSCI-GA Compiler Construction Lecture 4: Lexical Analysis I. Hubertus Franke

1. Lexical Analysis Phase

1. INTRODUCTION TO LANGUAGE PROCESSING The Language Processing System can be represented as shown figure below.

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

UNIT -2 LEXICAL ANALYSIS

ECS 120 Lesson 7 Regular Expressions, Pt. 1

Chapter 3 Lexical Analysis

Lexical Analysis (ASU Ch 3, Fig 3.1)

COMPILER DESIGN UNIT I LEXICAL ANALYSIS. Translator: It is a program that translates one language to another Language.

DVA337 HT17 - LECTURE 4. Languages and regular expressions

Introduction to Lexical Analysis

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

Principles of Compiler Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore

Monday, August 26, 13. Scanners

Wednesday, September 3, 14. Scanners

Compiler course. Chapter 3 Lexical Analysis

COS 140: Foundations of Computer Science

Buffering Techniques: Buffer Pairs and Sentinels

CSC 467 Lecture 3: Regular Expressions

Lexical Analysis. Introduction

A Pascal program. Input from the file is read to a buffer program buffer. program xyz(input, output) --- begin A := B + C * 2 end.

C Language, Token, Keywords, Constant, variable

We use L i to stand for LL L (i times). It is logical to define L 0 to be { }. The union of languages L and M is given by

UNIT I- LEXICAL ANALYSIS. 1.Interpreter: It is one of the translators that translate high level language to low level language.

Part 5 Program Analysis Principles and Techniques

2. λ is a regular expression and denotes the set {λ} 4. If r and s are regular expressions denoting the languages R and S, respectively

Lexical analysis. Syntactical analysis. Semantical analysis. Intermediate code generation. Optimization. Code generation. Target specific optimization

PESIT Bangalore South Campus Hosur road, 1km before Electronic City, Bengaluru -100 Department of Computer Science and Engineering

Programming in C++ 4. The lexical basis of C++

COS 140: Foundations of Computer Science

Dixita Kagathara Page 1

The Front End. The purpose of the front end is to deal with the input language. Perform a membership test: code source language?

CS308 Compiler Principles Lexical Analyzer Li Jiang

CSc 453 Compilers and Systems Software

PRINCIPLES OF COMPILER DESIGN UNIT II LEXICAL ANALYSIS 2.1 Lexical Analysis - The Role of the Lexical Analyzer

Compiler Design Concepts. Syntax Analysis

[Lexical Analysis] Bikash Balami

CS 403: Scanning and Parsing

Compiler Construction LECTURE # 3

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

UNIT III & IV. Bottom up parsing

Pioneering Compiler Design

THE COMPILATION PROCESS EXAMPLE OF TOKENS AND ATTRIBUTES

SEM / YEAR : VI / III CS2352 PRINCIPLES OF COMPLIERS DESIGN UNIT I - LEXICAL ANALYSIS PART - A

Lecture 05 I/O statements Printf, Scanf Simple statements, Compound statements

CS 314 Principles of Programming Languages. Lecture 3

Crafting a Compiler with C (V) Scanner generator

Lexical Analysis. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

A simple syntax-directed

Lexical Analysis/Scanning

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

VIVA QUESTIONS WITH ANSWERS

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

The Language for Specifying Lexical Analyzer

Assignment 1 (Lexical Analyzer)

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

Chapter 3: CONTEXT-FREE GRAMMARS AND PARSING Part 1

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

CSE 3302 Programming Languages Lecture 2: Syntax

Lexical Analysis. Sukree Sinthupinyo July Chulalongkorn University

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

CS122 Lecture 15 Winter Term,

Outline. 1 Scanning Tokens. 2 Regular Expresssions. 3 Finite State Automata

Lecture 4: Syntax Specification

CPS 506 Comparative Programming Languages. Syntax Specification

CS 315 Programming Languages Syntax. Parser. (Alternatively hand-built) (Alternatively hand-built)

Standard 11. Lesson 9. Introduction to C++( Up to Operators) 2. List any two benefits of learning C++?(Any two points)

Lexical Analysis 1 / 52

CS 403 Compiler Construction Lecture 3 Lexical Analysis [Based on Chapter 1, 2, 3 of Aho2]

Datatypes, Variables, and Operations

Introduction to Lexical Analysis

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 2: Lexical Analysis 23 Jan 08

VHDL Lexical Elements

Procedures, Parameters, Values and Variables. Steven R. Bagley

Lexical Analysis - An Introduction. Lecture 4 Spring 2005 Department of Computer Science University of Alabama Joel Jones

Compiler Construction

(Refer Slide Time: 00:23)

CS321 Languages and Compiler Design I. Winter 2012 Lecture 4

The analysis part breaks up the source program into constituent pieces and creates an intermediate representation of the source program.

COMPILER CONSTRUCTION LAB 2 THE SYMBOL TABLE. Tutorial 2 LABS. PHASES OF A COMPILER Source Program. Lab 2 Symbol table

Formal Languages and Compilers Lecture IV: Regular Languages and Finite. Finite Automata

CSE Lecture 4: Scanning and parsing 28 Jan Nate Nystrom University of Texas at Arlington

Grammars and Parsing, second week

Languages and Compilers

Fundamentals of C Programming

Lecture 3: Lexical Analysis

Transcription:

Compiler Design Lexical Analysis

What it Lexical Analysis It is the phase where the compiler reads the text from device source program lexical analyzer token get next token parser symbol table Reading has to be character by character, but buffering is helps

What it Lexical Analysis It detects valid tokens, which are equivalent to words in a text For example, from the following code segment, If xval <= yval then result := yval The lexical analyzer will detect the following unbreakable meaning ful components if, xval, <=, yval, then, result, :=, yval

What it Lexical Analysis It will check if these meaningful components, which are like words, are valid from the point of view of the given language. If a components is valid according to the language, then Lexical Analyser will determine its type. Or Token. In the example, tokens are identified as follows: if == keyword <= == logical operator xval == identifier := == assignment operator yval == identifier result == identifier

What it Lexical Analysis Question: What is a valid token? Answer: There will be set rules to define valid tokens. For example, rule an identifier is usually like starts with alphabet and followed by a combination of alphabet and digit or nothing So, xval is a valid identifier. But, if we write 9xval, it won t be a valid identifier. In fact, generally, it will be valid nothing. No keyword or operator or numeric value.

Valid Tokens Valid operators : fixed strings =, <, <=, >, >= Valid numeric: there is a pattern Starts with digit Followed by digit There may be a decimal Digits after decimal 12.34 12.34E10 12.34E-5 Then there may be an E (exponent) followed by signed or unsigned integer

File Reading is overhead Most expensive phase of Compiler because it reads the text from device extensive Input operation Though it processes the input character by character for matching set patterns Better to read a block (say 1024 bytes) at a time and place it in a buffer. Afterwards process from buffer. Why? Every read involves a system call, therefore context switch. It is more time saving to have one system call per 1024 characters than one per character.

File Reading is overhead So, the following happen in a loop for each character. 1. Reads a character from the disk (System Call) 2. Compares with a pattern (User mode) 3. changes state (User Mode) 4. Go to 1 So, we see that for every input character read there is a system call. It means context changes from user to system. After the read, the context is changed from system to user again.

What it Lexical Analysis Context Change being an overhead, for every input character we are incurring 2 such overheads. If we could read, say 1024 characters at once, through a single system call, it will save context switch time by 1024 times.

Lexical Analyzer tasks: Loop 1 to 2 until end of file 1. Read a character from disk THIS IS A SYSTEM CALL CPU changes context to OS, thus saving PCB of user process 2. Match pattern Here the CPU goes back to the user mode, changing context, saving PCB of OS and retrieving PCB of the user process So, if there is a file of 10,000 characters, context change will take place 20,000 times.

User Process OS Pattern match Pattern match Pattern match Context switch Context switch Context switch Context switch Context switch Context switch Read a character Read a character Read a character Instead, if we could 2048 context switches for 1024 characters!!!!

User Process OS Context switch Read 1024 Read 1024 characters characters Only 2 context switches instead of 2048 Pattern match Context switch

How it happens using a buffer Base n newval5=oldval*12 Forward

How it happens using a buffer e Base n newval=oldval*12 Forward

How it happens using a buffer ew Base n newval=oldval*12 Forward

How it happens using a buffer ewv Base n newval=oldval*12 Forward

How it happens using a buffer ewva Base n newval=oldval*12 Forward

How it happens using a buffer ewval Base n newval5=oldval*12 Forward

How it happens using a buffer ewval5 Base n newval5=oldval*12 Forward

How it happens using a buffer ewval5 Base n newval=oldval*12 = Forward

How it happens using a buffer ewval5 Base n newval=oldval*12 Forward = Retract; Return (Gettoken ) The string between BP and FP is the next token (if it has adhered to any rule)

How it happens using a buffer Base newval=oldval*12 Forward = Retract; Return (Gettoken ) BP is sprung forward to FP; Now the next token will be found.

Okay, input characters are read in blocks and put in a buffer / array in main memory. The characters are scanned one by one. What follows, is that when the last character in the buffer has been read and processed, the buffer needs to be reloaded Consider a scenario where the buffer ends before the variable/ identifier oldval is complete. Base newval = old Forward

Obviously, next block has to be read into the buffer As a result, the current buffer is overwritten. Base val * 12. Forward Therefore, previous content of the buffer is lost. BP doesn t point to the earlier content.

Way out : Two buffers, or split buffers to be reloaded alternatively Initially, both buffers are empty. Base Forward

Way out : Two buffers, or split buffers to be reloaded alternatively Read the first one. Base newval = old Forward

Way out : Two buffers, or split buffers to be reloaded alternatively Read the first one. Base newval = old Forward

Way out : Two buffers, or split buffers to be reloaded alternatively Read the first one. Keep on scanning and processing till the FP reaches the last character. That means, the lexical analyzer is inside the last potential token. Base newval = old Forward

Way out : Two buffers, or split buffers to be reloaded alternatively Read the next block and reload the second buffer. Base newval = old val * 12. Forward

Way out : Two buffers, or split buffers to be reloaded alternatively Read the next block and reload the second buffer. Move the forward pointer by one character. It therefore enters the second buffer. Continue processing as if it is a single buffer. Base newval = old val * 12. Forward

Way out : Two buffers, or split buffers to be reloaded alternatively Read the next block and reload the second buffer. Move the forward pointer by one character. It therefore enters the second buffer. Continue processing as if it is a single buffer. Base newval = old val * 12. Forward

Necessity of buffering (A) To read a block of characters at a time, thus reducing context switching time and I/O time as well. Otherwise, the process would be : Loop: I/O request to read next character /* System Call */ execute processing logic /* User mode */ go back to Loop. There would be 1024 system calls for reading 1024 characters. Also, separately requesting the disk controller to read each character. Using block read and buffering, the one system call and I/O request is issued per block of characters, for example, 1024 characters.

Necessity of buffering (B) Some times, the lexical analyzer might have to look ahead in order to identify a token. Example. There are two similar looking FORTRAN statements, along with their meaning. (i) DO 5 I = 1.25 /* DO5I is a variable. Set its value to 1.25 */ /* FORTRAN allows spaces in variable name */ (ii) DO 5 I = 1,25 /* Execute line#5 for I = 1 to 25 */ BP FP Look ahead The difference of the two is in the. and, between 1 and 25. Lexical analyzer will have to read forward (look ahead), halting the FP to detect the presence of a,. After that, it goes back to FP.

Limitation of Buffer Pair (1) If look ahead character is beyond the end of buffer Such as in PL/1 DECLARE (ARG1,ARG2,.ARGn) To determine if DECLARE is an array or key word, it has look ahead till the closing ) (2) With every character scan, lexical analyzer has to check if it is end of block. Since there are two buffers, it has to check two times, if it is end of buffer 1 or buffer 2. Algorithm and its alternative are furnished in the following two slides.

Algorithm With every character read, the program has to check end of buffer. FWD = End of 1 st Half? NO YES Reload 2 nd Half Reload 1st Half YES FWD = End of 2 nd Half? Move FWD to the beginning of the 1 st Half NO FWD := FWD + 1 Back to loop Check 1st buffer. If not end, then check 2nd buffer

Alternative (more efficient) FWD = $? NO Back to loop Use sentinel value for end of buffer. Put a $ at the end of each buffer half. Reload 1st Half Move FWD to the beginning of the 1 st Half $ $ YES It must be the end of 1 st half YES FWD = End of 2 nd Half? NO Reload 2nd Half Back to loop Happens once out of 1024 Though it looks like two checks, but the second check comes only when FWD = $ Which happens once for each buffer. If each buffer is of 1024 characters, then for 1023 times the second check does not happen

Specification of Tokens Regular expressions are used for recognizing patterns Let s understand how Regular Pattern is represented. First of all, a single character is Regular Language in itself. (1) Union Operation on Regular Languages Alphabet A is itself a Regular Language. Likewise, B and also C and so on. Therefore, {A,B,C,.Z} will be regular since it is the union of regular languages.

Specification of Tokens Say, L = {A, B, C,..Z,a,b,c,.z} /* all the letters */ and D = {0,1,2,3,4,5,6,7,8,9} /* all the digits*/ Then UNION: L U D = {A, B, C,..Z,a,b,c,.z, 0,1,2,3,4,5,6,7,8,9} is also a regular language, representing the set of all letters and digits. (2) Concatenation L. D consists of each element of L concatenated with each element of D, thus resulting in the set {A0, A1, A2.A9,B0,B1,B2 B9..z9} Which is also regular

Specification of Tokens (2) Exponent / Kleene Closure L is regular. So, the concatenation, L L, which can be written as (L ) 2 is also regular Similarly, (L ) 3, (L ) 4, (L ) 5, etc. are all regular. And then, ɛ is regular and (L ) 0 = ɛ Therefore, (L ) 0 + (L ) 1 + (L ) 2 + (L ) 3 + (L ) 4 +. Any number of elements Is regular. This is called Kleene Closure (L )*

Specification of Tokens (2) Transpose / Reverse Reverse of regular string is also regular. Set of reverses of all strings in a regular language is also regular So, {A0,A1,A2} being regular, {0A,1A,2A} is also regular.

Specification of Tokens Regular Expressions A regular language is represented by a regular expression. As mentioned before, a single character is a regular expression. Like regular languages, union, concatenation, Kleene closure and reversal of a regular expression result in regular expressions. So, if R1 and R2 are two regular expressions, then following are also: R1+R2 R1.R2 R1* alternatively, R1 R2 And any combination of the above three

Specification of Tokens Example: Identifiers in Pascal Regular Expressions Starts with an alphabet followed by any number of alphabets and digits in any order. Letter A B C. Z a b c z Digit 0 1 2 3 4 5 6 7 8 9 Id Letter. (Letter Digit ) * Note: (1) Letter and Digit have to be defined before Id. (2) Character class notation can also be used to declare Letter and Digit. Letter [A-Za-z] Digit [0-9]

Specification of Tokens Regular Expressions Example: Floating point numbers Rule: Digits, optionally separated by a., optionally followed by E and signed/unsigned digit Digit [0-9] Digits Digit.(Digit )* Optional_Fraction. Digits ɛ Optional_Exponent (E (+ - ɛ) digits) ɛ Floating _point_number Digits. Optional_Fraction. Optional_Exponent