Figure 2.1: Role of Lexical Analyzer

Similar documents
CSc 453 Lexical Analysis (Scanning)

Formal Languages and Compilers Lecture VI: Lexical Analysis

COMPILER DESIGN UNIT I LEXICAL ANALYSIS. Translator: It is a program that translates one language to another Language.

UNIT II LEXICAL ANALYSIS

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

Chapter 3 Lexical Analysis

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

Compiler course. Chapter 3 Lexical Analysis

1. INTRODUCTION TO LANGUAGE PROCESSING The Language Processing System can be represented as shown figure below.

PRINCIPLES OF COMPILER DESIGN UNIT II LEXICAL ANALYSIS 2.1 Lexical Analysis - The Role of the Lexical Analyzer

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

Lexical Analyzer Scanner

Zhizheng Zhang. Southeast University

Lexical Analyzer Scanner

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

UNIT -2 LEXICAL ANALYSIS

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

Alternation. Kleene Closure. Definition of Regular Expressions

Lexical Analysis. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

Lexical Analysis. Introduction

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

Parsing and Pattern Recognition

Computer Science Department Carlos III University of Madrid Leganés (Spain) David Griol Barres

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

The Language for Specifying Lexical Analyzer

Prof. Mohamed Hamada Software Engineering Lab. The University of Aizu Japan

Front End: Lexical Analysis. The Structure of a Compiler

David Griol Barres Computer Science Department Carlos III University of Madrid Leganés (Spain)

UNIT I- LEXICAL ANALYSIS. 1.Interpreter: It is one of the translators that translate high level language to low level language.

Lex Spec Example. Int installid() {/* code to put id lexeme into string table*/}

Edited by Himanshu Mittal. Lexical Analysis Phase

Automatic Scanning and Parsing using LEX and YACC

CSE302: Compiler Design

Dixita Kagathara Page 1

Implementation of Lexical Analysis

Monday, August 26, 13. Scanners

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

Wednesday, September 3, 14. Scanners

THE COMPILATION PROCESS EXAMPLE OF TOKENS AND ATTRIBUTES

[Lexical Analysis] Bikash Balami

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

CSE302: Compiler Design

CS 403: Scanning and Parsing

CS308 Compiler Principles Lexical Analyzer Li Jiang

Finite Automata and Scanners

LANGUAGE TRANSLATORS

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design

Languages and Compilers

Formal Languages and Compilers Lecture IV: Regular Languages and Finite. Finite Automata

CS415 Compilers. Lexical Analysis

Outline. 1 Scanning Tokens. 2 Regular Expresssions. 3 Finite State Automata

Implementation of Lexical Analysis

The Front End. The purpose of the front end is to deal with the input language. Perform a membership test: code source language?

ECS 120 Lesson 7 Regular Expressions, Pt. 1

Lexical Analysis. Sukree Sinthupinyo July Chulalongkorn University

Lexical Analysis 1 / 52

CSC 467 Lecture 3: Regular Expressions

G Compiler Construction Lecture 4: Lexical Analysis. Mohamed Zahran (aka Z)

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

Languages, Automata, Regular Expressions & Scanners. Winter /8/ Hal Perkins & UW CSE B-1

Compiler Construction

SEM / YEAR : VI / III CS2352 PRINCIPLES OF COMPLIERS DESIGN UNIT I - LEXICAL ANALYSIS PART - A

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

CD Assignment I. 1. Explain the various phases of the compiler with a simple example.

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Compiler phases. Non-tokens

Introduction to Lexical Analysis

COMPILER DESIGN LECTURE NOTES

Lexical Analysis. Chapter 2

Announcements! P1 part 1 due next Tuesday P1 part 2 due next Friday

CS143 Handout 04 Summer 2011 June 22, 2011 flex In A Nutshell

CMSC 350: COMPILER DESIGN

Part 5 Program Analysis Principles and Techniques

CS 432 Fall Mike Lam, Professor. Finite Automata Conversions and Lexing

Module 8 - Lexical Analyzer Generator. 8.1 Need for a Tool. 8.2 Lexical Analyzer Generator Tool

EXPERIMENT NO : M/C Lenovo Think center M700 Ci3,6100,6th Gen. H81, 4GB RAM,500GB HDD

Structure of Programming Languages Lecture 3

An introduction to Flex

10/5/17. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntax Analysis

COLLEGE OF ENGINEERING, NASHIK. LANGUAGE TRANSLATOR

Finite automata. We have looked at using Lex to build a scanner on the basis of regular expressions.

10/4/18. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntactic Analysis

Lexical and Syntax Analysis

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 3

Lexical Analysis. Implementing Scanners & LEX: A Lexical Analyzer Tool

Lexical Analysis. Implementation: Finite Automata

CSE 401/M501 Compilers

CSCI-GA Compiler Construction Lecture 4: Lexical Analysis I. Hubertus Franke

Lexical Analysis/Scanning

Lexical Analysis (ASU Ch 3, Fig 3.1)

Lexical Analysis. Lecture 3-4

Flex and lexical analysis

TABLE OF CONTENTS S.No DATE TOPIC PAGE No UNIT I LEXICAL ANALYSIS 1 Introduction to Compiling-Compilers 6 2 Analysis of the source program 7 3 The

Chapter 2 :: Programming Language Syntax

Lexical Analysis. Lecture 2-4

Lexical Analysis - Flex

The structure of a compiler

Dr. D.M. Akbar Hussain

Transcription:

Chapter 2 Lexical Analysis Lexical analysis or scanning is the process which reads the stream of characters making up the source program from left-to-right and groups them into tokens. The lexical analyzer takes a source program as input and produces a stream of tokens as output. The lexical analyzer might recognize particular instances of tokens called lexemes. A token can then be passed to next phase of compiler i.e. syntax analysis. It is general for a lexical analyzer to interact with the symbol table as well. When the lexical analyzer discovers a lexeme constituting an identifier, it needs to enter that lexeme into the symbol table. In some cases, the information concerning the kind of identifier may be read from symbol table by the lexical analyzer to assist it in determining the suitable token it must pass to the parser. Figure 2.1 Shows the role of a lexical analyzer. Figure 2.1: Role of Lexical Analyzer 2.1 Constituents of Lexical Analysis Token: A token is a pair consisting of a token name and an optional attribute value. The token name is an abstract symbol representing a kind of lexical unit, e.g., a particular keyword, or 8

a sequence of input characters denoting an identifier. The token names are the input symbols that the parser processes. Some examples of tokens in C are: Keywords (e.g.int, while), Identifiers (e.g. rate, total), Constants (e.g. 10, 2.5), Strings (e.g. total, hello ), Special symbols (e.g. ( ), { }), Operators (e.g. +, /, -, *). Pattern: A pattern is a description of the form that the lexemes of a token may take. In case of a keyword as a token, the pattern is just the sequence of characters that forms the keyword. For identifiers and some other tokens, the pattern is a more complex structure that is matched by many strings. Lexemes: A lexeme is a sequence of characters in the source program that matches the pattern for a token and is identified by the lexical analyzer as an instance of that token. Table 2.1 shows the examples of tokens, patterns and lexemes used in C language. Table 2.1: Example of Token, Pattern, Lexemes Token Lexeme Pattern ID x y n0 letter followed by letters and digits NUM -123 1.456e-5 any numeric constant IF if if LPAREN ( ( RPAREN ) ) LITERAL Hello any string of characters For example if we consider a C statement printf( Final = %d, Number); both printf and Number are lexemes matching the pattern for token ID, and Final = %d is a lexeme matching LITERAL. ( and ) match with token LPAREN and RPAREN respectively. The lexical analyzer must provide the additional information about the particular lexeme, when more than one lexeme matches a pattern. The lexical analyzer returns not only a token name, but also an attribute value that describes the lexeme represented by the token to the subsequent compiler phases. The token name influences parsing decisions, while the attribute value influences translation of tokens after the parse. For C statement printf( Final = %d, Number); the tokens returned would be: <ID,1><LPAREN><LITERAL><,><ID,2><RPAREN><;> Here, more than one identifier are discovered so to differentiate, a numeric value is assigned to tokens. 2.2 Input Buffering There are three general approaches to the implementation of a lexical analyzer: 9

1. Use a lexical-analyzer generator, such as Lex compiler to produce the lexical analyzer from a regular expression based specification. In this, the generator provides routines for reading and buffering the input. 2. Write the lexical analyzer in a conventional systems-programming language, using I/O facilities of that language to read the input. 3. Write the lexical analyzer in assembly language and explicitly manage the reading of input. Because of the amount of time taken to the large number of characters that must be processed during the compilation of a large source program, specialized buffering techniques have been developed to reduce the amount of overhead required to process a single input character. Two important techniques of buffering are described below: 2.2.1 Buffer Pairs In this technique two pointers to the input are maintained. First Pointer Lexeme Begin marks the beginning of the current lexeme, whose extent we are attempting to determine. while second pointer Forward scans ahead until a pattern match is found. Once the next lexeme is determined, forward is set to character at its right end. Then, after the lexeme is recorded as an attribute value of a token returned to the parser, Lexeme Begin is set to the character immediately after the lexeme just found. 2.2.2 Sentinels If we use the idea of Buffer pairs we must make sure each time we advance forward, that we have not moved off one of the buffers; if we do, then we must also reload the other buffer. Thus, for each character read, we make two tests: one for the end of the buffer, and one to determine what character is read. We can combine the buffer-end test with the test for the current character if we extend each buffer to hold a sentinel character at the end. The sentinel is a special character that cannot be part of the source program, and a natural choice is the character EOF. Note that EOF retains its use as a marker for the end of the entire input. Any EOF that appears other than at the end of a buffer means that the input is at an end. 2.3 Token Specification The Patterns corresponding to a token are generally specified using a compact notation known as regular expression. Regular expressions of a language are created by combining members of its alphabet. A regular expression r corresponds to a set of strings L(r) where L(r) is called 10

a regular set or a regular language and may be infinite. A regular expression is defined as follows: A basic regular expression a denotes the set {a} where a Σ; L(a) = {a} The regular expression ɛ denotes the set {ɛ} Technically, regular expression ɛ is different from string ɛ. Here ɛ represents null. If r and s are two regular expressions denoting the sets L(r) and L(s) then; following are some rules for regular expressions R1. r s is a regular expression denoting the union set: L(r) L(s) R2. rs is a regular expression denoting the concatenation set: L(r)L(s) R3. r is a regular expression denoting the Kleene closure set: L(r) R4. (r) is a regular expression denoting the set L(r) Following are some examples of regular expressions: 0 1 denotes the set {0, 1} as per rule 1. 0 denotes the set {ɛ, 0, 00, 000, 0000,... } as per rule 3. (0 1)(0 1) denotes the set {00, 01, 10, 11} as per rule 1. (0 1) denotes the set {ɛ, 0, 1, 00, 01, 10, 11, 000, 001,... } as per rule 1,3. 0 0 1 denotes the set {0, 1, 01, 001, 0001,... } as per rule 1,3. 2.3.1 Regular Definition We may assign a name to a regular expression to use and reuse the name in other (more complex) regular expressions and to enhance the readability of longer regular expressions. Suppose, following regular definition definitions are given: digit = [0 9], This will represent the number in the range from 0 through 9. letter = [A Za z], Shows any letter between capital A through Z and small a through z. eol = [\n] neol = [ˆ\n] We can use these regular definitions to write complex regular expressions, for example, 11

Integer_Literal = digit+ Fixed_Point_Literal = digit+. digit+ Floating_Point_Literal = digit+. digit+(e E)(+ -)?digit+ Identifier = letter(letter digit)* 2.4 Token Recognition The previous section described about tokens specification of a language using compact nation called regular expression. This section will elaborate how to construct recognizers that can identify the tokens occurring in input stream. These recognizers are known as Finite Automata. A Finite Automaton (FA) consists of: A finite set of states A set of transitions (or moves) between states: The transitions are labeled by characters form the alphabet A special start state A set of final or accepting states A finite automaton to represent is shown below in Figure 2.2. Identifier = letter(letter digit) Figure 2.2: A finite automata for Identifier 2.4.1 Deterministic Finite Automata(DFA) A Deterministic Finite Automaton (DFA) is a 5-tuple M = (Q, Σ,δ, S, F) consisting of: 1. A finite set of states Q 12

2. Finite set of input symbols Σ 3. A transition function δ : Q Σ Q 4. A start state S Q 5. A set of accepting states F Q A DFA takes an input string w over the alphabet Σ, and either accepts or rejects the string. Identifying acceptance with the value 1 and rejection with 0, one can think of a DFA as a machine that takes a string w as input, and outputs a single bit b {0, 1}. DFA be represented by a transition table T which is indexed by state S and input character c. T [s][c] is the next state to visit from state S if the input character is c. T can also be described as a transition function T : S Σ S maps the pair (S, c) to next_s. DFA and transition table for a C comment are show in Figure 2.3 and Table 2.2. Blank entries in the table represent an error state. A full transition table will contain one column for each character (may waste space). The characters are combined into character classes when treated identically in a DFA. Figure 2.3: DFA for C Comments Table 2.2: Transition Table for C Comments State / * other 1 2 2 3 3 3 4 4 4 5 4 3 5 13

2.4.2 Non-Deterministic Finite Automata (NFA) An NFA is a 5-tuple M = (Q, Σ, δ, S, F ) consisting of: 1. A finite set of states Q 2. Finite set of input symbols Σ 3. A transition function δ : Q (Σ {}) P (Q) 4. A start state S Q 5. A set of accepting states F Q The only difference between a DFA and an NFA is in the transition function δ. This is exactly the same as the definition of NFA. We proceed to define its computations using the same style as for DFAs. An NFA is similar to a DFA except that multiple transitions labeled by same character from same state are allowed, ɛ -transitions are allowed and ɛ -transitions are spontaneous. They occur without consuming any character. Figure 2.4 and Figure 2.5 show DFA and NFA for operators. Figure 2.4: DFA of Relational Operators Figure 2.5: NFA for Relational Operators 14

2.5 Lexical Analyzer Generator Lexical Analyzer Generator or Scanner Generator generates lexical analyzers which can be used to perform scanning of a file. Lex and Flex are two most popular scanner generators available in UNIX and Linux platforms. They take as input specification of requirements in the form of regular expressions and generate C code to do the lexical analysis of the file supplied as input i.e. it generates a lexical analyzer. Figure 2.6 shows the working of lex/flex and Figure 2.7 gives a general template to write lex/flex specifications. Figure 2.6: Working of Lex/Flex Figure 2.7: Lex/Flex Specification Template 2.5.1 Definition Section This section defines header files to import in code, macros basic declaration of variables, functions, keywords, special patterns etc. This will be copied to generated C file. We include following code in our definition section: #include<stdio.h> int vowels=0; int cons=0; 15

2.5.2 Rule Section This section deals with regular expression patterns with language statements. When the scanner matches a pattern in the input file with the declared pattern, it will execute the code associated with the pattern. Based on pattern declared in definition section we have defined the following rules for patterns: [aeiouaeiou] {vowels++;} The above rule means that whenever any vowel comes increment vowel count. [a-za-z] {cons++;} The above rule means means that whenever any consonant comes increment consonant count. 2.5.3 User Subroutines This section contains main function, definition of function declared in definition section and other relevant C code. These statements are directly copied to the generated source file. The execution of statements and calling of function is done by rules written in rule section. main() { printf( Enter the string.. at end press ˆd ); yylex(); printf( No of vowels=%d No of consonants=%d, vowels, cons); } When lex compiles the input specifications, it generates the C file lex.yy.c that contains the routine yylex(). This routine reads the input and tries to match it with any of the token patterns specified in the rules section. On a match, the associated action is executed. If there is more than one match, the action associated with the pattern that matches more text (included context) is executed. If still there are two or more patterns that match the same amount of text, the action associated with the pattern listed first in the specification file is executed. If no match is found, the default action is executed. The input text (lexeme) associated with the recognized token is placed in the global variable yytext. The detailed description of using lex/flex compiler is given in Appendix A. 16

Example: To count the number of vowels and consonants in a given string. %{ #include<stdio.h> int vowels=0; int cons=0; %} %% [aeiouaeiou] vowels++; [a-za-z] cons++; %% int yywrap() { return 1; } main() { printf( Enter the string.. at end press ˆd ); yylex(); printf( No of vowels=%dno of consonants=%d,vowels,cons); } By using the approach described in this Chapter lexical analyzer can be designed to perform specific task of lexical analysis. 17