We use L i to stand for LL L (i times). It is logical to define L 0 to be { }. The union of languages L and M is given by

Similar documents
1. Lexical Analysis Phase

1.0 Languages, Expressions, Automata

Languages and Finite Automata

8 ε. Figure 1: An NFA-ǫ

DFA: Automata where the next state is uniquely given by the current state and the current input character.

Definition of Regular Expression

UNIT -2 LEXICAL ANALYSIS

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

Zhizheng Zhang. Southeast University

Lexical Analysis. Prof. James L. Frankel Harvard University

UNIT II LEXICAL ANALYSIS

Finite automata. III. Finite automata: language recognizers. Nondeterministic Finite Automata. Nondeterministic Finite Automata with λ-moves

CS 403 Compiler Construction Lecture 3 Lexical Analysis [Based on Chapter 1, 2, 3 of Aho2]

Lexical Analysis. Sukree Sinthupinyo July Chulalongkorn University

ECS 120 Lesson 7 Regular Expressions, Pt. 1

Regular Expressions. Regular Expressions. Regular Languages. Specifying Languages. Regular Expressions. Kleene Star Operation

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design

UNIT III. The following section deals with the compilation procedure of any program.

Lexical Analysis (ASU Ch 3, Fig 3.1)

The Front End. The purpose of the front end is to deal with the input language. Perform a membership test: code source language?

A Simple Syntax-Directed Translator

Alternation. Kleene Closure. Definition of Regular Expressions

Formal Languages and Compilers Lecture VI: Lexical Analysis

2.2 Syntax Definition

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

2. λ is a regular expression and denotes the set {λ} 4. If r and s are regular expressions denoting the languages R and S, respectively

PRINCIPLES OF COMPILER DESIGN UNIT II LEXICAL ANALYSIS 2.1 Lexical Analysis - The Role of the Lexical Analyzer

Finite automata. We have looked at using Lex to build a scanner on the basis of regular expressions.

Lexical Analysis. Lecture 3. January 10, 2018

Lexical Analyzer Scanner

Languages and Compilers

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

Chapter 3: CONTEXT-FREE GRAMMARS AND PARSING Part 1

B The SLLGEN Parsing System

Part 5 Program Analysis Principles and Techniques

COMPILER DESIGN UNIT I LEXICAL ANALYSIS. Translator: It is a program that translates one language to another Language.

SEM / YEAR : VI / III CS2352 PRINCIPLES OF COMPLIERS DESIGN UNIT I - LEXICAL ANALYSIS PART - A

Lexical Analyzer Scanner

CS308 Compiler Principles Lexical Analyzer Li Jiang

Lexical Analysis. Introduction

[Lexical Analysis] Bikash Balami

The Language for Specifying Lexical Analyzer

2. Lexical Analysis! Prof. O. Nierstrasz!

for (i=1; i<=100000; i++) { x = sqrt (y); // square root function cout << x+i << endl; }

Computer Science Department Carlos III University of Madrid Leganés (Spain) David Griol Barres

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

A simple syntax-directed

Principles of Compiler Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore

Chapter 3: Lexical Analysis

Non-deterministic Finite Automata (NFA)

Introduction to Lexical Analysis

Figure 2.1: Role of Lexical Analyzer

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

for (i=1; i<=100000; i++) { x = sqrt (y); // square root function cout << x+i << endl; }

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

David Griol Barres Computer Science Department Carlos III University of Madrid Leganés (Spain)

UNIT I- LEXICAL ANALYSIS. 1.Interpreter: It is one of the translators that translate high level language to low level language.

Introduction to Lexical Analysis

Dr. D.M. Akbar Hussain

LANGUAGE TRANSLATORS

Regular Languages and Regular Expressions

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

Lexical Analysis. Chapter 2

COMPILER DESIGN LECTURE NOTES

CSE Discrete Structures

Complexity Theory. Compiled By : Hari Prasad Pokhrel Page 1 of 20. ioenotes.edu.np

Buffering Techniques: Buffer Pairs and Sentinels

CSc 453 Lexical Analysis (Scanning)

Lexical Analysis. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

TENTAMEN / EXAM. General instructions

Formal Languages and Compilers Lecture IV: Regular Languages and Finite. Finite Automata

UNIT III & IV. Bottom up parsing

CMPSCI 250: Introduction to Computation. Lecture #7: Quantifiers and Languages 6 February 2012

1. INTRODUCTION TO LANGUAGE PROCESSING The Language Processing System can be represented as shown figure below.

Last lecture CMSC330. This lecture. Finite Automata: States. Finite Automata. Implementing Regular Expressions. Languages. Regular expressions

CS321 Languages and Compiler Design I. Winter 2012 Lecture 4

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

THE COMPILATION PROCESS EXAMPLE OF TOKENS AND ATTRIBUTES

Lecture 3: Lexical Analysis

CS415 Compilers. Lexical Analysis

Lecture 9 CIS 341: COMPILERS

T Parallel and Distributed Systems (4 ECTS)

NFAs and Myhill-Nerode. CS154 Chris Pollett Feb. 22, 2006.

Finite Automata. Dr. Nadeem Akhtar. Assistant Professor Department of Computer Science & IT The Islamia University of Bahawalpur

T.E. (Computer Engineering) (Semester I) Examination, 2013 THEORY OF COMPUTATION (2008 Course)

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

Lexical Analysis. Lecture 2-4

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

Structure of Programming Languages Lecture 3

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 3

CS 403: Scanning and Parsing

Syntactic Analysis. CS345H: Programming Languages. Lecture 3: Lexical Analysis. Outline. Lexical Analysis. What is a Token? Tokens

Lexical Analysis. Lecture 3-4

CS 314 Principles of Programming Languages. Lecture 3

Lexical Analysis - 2

Outline. 1 Scanning Tokens. 2 Regular Expresssions. 3 Finite State Automata

Bottom Up Parsing. Shift and Reduce. Sentential Form. Handle. Parse Tree. Bottom Up Parsing 9/26/2012. Also known as Shift-Reduce parsing

Lecture 4: Syntax Specification

Section A. A grammar that produces more than one parse tree for some sentences is said to be ambiguous.

Transcription:

The term languages to mean any set of string formed from some specific alphaet. The notation of concatenation can also e applied to languages. If L and M are languages, then L.M is the language consisting of all string xy, which can e found y selecting a string x from L, and a string y from M, and concatenating them in that order. That is, LM= {xy x is in L and y in M} we call LM the concatenation of L and M. Example: Let L e {0, 01,110}, and let M e {10,110}. Then LM= {010, 0110, 01110, 11010, 110110}. Is the concatenation operator w1 =fire, w2 =truck w1 w2 =firetruck w2 w1 =truckfire w2 w2 =trucktruck Often drop the : w1w2 =firetruck For any string w, wɛ = w We can concatenate languages as well as strings L1L2 = {wv : w L1 and v L2} {a,a}{,}={a,a,a} {a,a}{a,a}={aa,aa,aa,aa} {a,aa}{a,aa}={aa,aaa,aaaa} We use L i to stand for LL L (i times). It is logical to define L 0 to e { }. The union of languages L and M is given y L M = {x x is in L or x is in M}. The empty set,, is the identity under union, since And L=L =L L=L = 15

There is another operation on languages which plays an important role in specifying tokens. This is the kleen closure operator. We use L * to denote the concatenation of language L with itself any numer of times. L * = L i i=0 Example Let D e the language consisting of the string 0, 1 9, that is, each string is a single decimal digit. Then D * is all strings of digits, including the empty string. For example, if L= {aa}, then L * is all string of an even numer of a's, since L 0 = { }, L 1 = {aa}, L 2 = {aaaa},.... If we wished to exclude, we could write L.(L * ), to denote that language. That is:- L.(L * ) =L. L i = L i+1 = L i i=0 i=0 i=1 We shall often use the L * for L.(L * ). The unary postfix operator + is called positive closure, and denotes "one or more instances of". A simple Approach to the Design of Lexical Analyzers There are two primary methods for implementing a scanner. The first is a program that is hard-coded to perform the scanning tasks. The second uses regular expression and finite automata theory to model the scanning process. One way to egin the design of any program is to descrie the ehavior of the program y a flowchart. This approach is particularly useful when the program is a lexical analyzer, ecause the action taken is highly dependent on what characters have een seen recently. Rememering previous characters y the position in a flowchart is a valuale tool, so much so that a specialized kind of flowchart for lexical analyzer, called a transition diagram, has evolved. In a transition diagram, the oxes of the flowchart are drawn as circles and called states. The states are connected y arrow, called edges. The laels on the various edges leaving a state indicate the input characters that can appear after that state. Identifier letter {letter digit} * digit [0-9] letter [A-Z a-z] 16

Start Letter 0 1 2 Fig. 6: Transition diagram for identifier Fig. 6 shows a transition diagram for an identifier, defined to e a letter followed y any numer of letters or digits. The starting state of the transition diagram is state 0, the edge from which indicates that the first input character must e a letter. If this is the case, we enter state 1 and look at the next input character if this is a letter or the digit, we continue this way, reading letters and digits, and making transition from state 1 to itself, until the next input characters is a delimiter for an identifier, which we have assume is any character that is not a letter or a digit. On reading the delimiter, we enter state 2. To turn a collection of transition diagram into a program, we construct a segment of code for each state. The first step to e done in the code for any state is to otain the next character from the input uffer. For this purpose we use a function GETCHAR, which returns the next character, advancing the lookahead pointer at each call. The next step is to determine which edge, if any, out of the state is laeled y a character or class of characters that includes the character just read. If such an edge is found, control is transferred to the state pointed to y that edge. If no such edge is found, and the state is not one which indicated that a token has een found (indicated y a doule circle), we have fail to find this token. The lookahead pointer must e retracted to where the eginning pointer is, and another token must e searched for, using another transition diagram. If all transition diagrams have een tried without success, a lexical error has een detected, and an error correction routine must e called. Consider the transition diagram in Fig. 6, the code for state 0 might e:- State 0: C: = GETCHAR (); If LETTER(C) then goto state 1 else FAIL () Here, LETTER is a procedure which returns true if and only if C is a letter. Fail() is a routine which retracts the lookahead pointer and starts up the next transition diagram, if there is one, or calls the error routine. The code for state 1 is: State 1 C:=GETCHAR (); if LETTER (C) or DIGIT (C) then goto state 1 else if DELIMITER(C) then goto state 2 else FAIL () 17 Letter or digit Delimiter *

DIGIT is a procedure which returns true if and only if C is one of the digits 0, 1 9. DELIMITER is a procedure which returns true whenever C is a character that could follow an identifier. If we define a delimiter to e any character that is not letter or digit, then the clause "if DELIMITER (C) then", need not e presented in state 1. To detect errors more effectively we might define a delimiter precisely (e.g., lank, arithmetic or logical operator, left or right parenthesis, equal sign, colon, semicolon, or comma), depending on the language eing compiled. State 2 indicates that an identifier has een found. Since the delimiter is not part of the identifier, we must retract the lookahead pointer one character, for which we use a procedure RETRACT. We use '*' to indicate states on which input retraction must take place. We must also install the newly-found identifier in the symol tale if it is not already there, using the procedure INSTALL *. In state 2 we return a pair consisting of the integer code for an identifier, which we denote y id, and a value that is a pointer to the symol tale returned y INSTALL. The code for state 2 is: State 2: RETRACT ( ) return (id, INSTALL ( )) If lank must e skipped in the language at hand, we should include in the code for state 2 a step that moved the eginning pointer to the next non-lank. Fig. 7 shows a list of tokens that we want to recognize using token recognizer that use transition diagram explained in Fig. 8. Token Code Value egin 1 ------- end 2 ------- if 3 ------- then 4 ------- else 5 ------- identifier 6 Pointer to Symol Tale constant 7 Pointer to Symol Tale < 8 1 <= 8 2 = 8 3 <> 8 4 > 8 5 >= 8 6 Fig. 7: Token Recognizer 18

Keywords: Blank or Start B E G I N newline 0 1 2 3 4 5 6 * Blank or newline 7 8 9 10 E N D * return (2,) return (1,) Blank or 11 12 13 newline 14 L S E * return (5,) I Blank or F newline 15 16 17 * return (3,) Blank or newline 18 19 20 21 22 T H E N * return (4,) Identifier: Start Not Letter Letter or digit 23 24 25 * return (6,INSTALL ()) Constant: Letter or digit Start Digit Not Digit * 26 27 28 return (7,INSTALL ()) Digit 19

Re lops: not Start < = or > * 29 30 31 return (8,1) = 32 return (8,2) > 33 return (8,4) = 34 return (8,3) > 35 not = * 36 return (8,5) = 37 return (8,6) Fig. 8: transition Diagram A more efficient program can e constructed from a single transition diagram than from a collection of diagrams, since there is no need to acktrack and rescan using a second transition diagram. In Fig. 8, we have comined all keywords into one transition diagram. However, if we attempt to comine the diagram for identifiers with that for keywords, difficulties arise. For example, one seeing the three letters BEG, we could not tell whether to e in state 3 or state 24. In Fig. 8, each keyword is treated as a separate token, whereas all relops are comine into one token class, with the associated token value distinguishing one relops from another. Let us now consider an example if the action of the lexical analyzer constructed from the transition diagram of Fig.8. On seeing IFA followed y a lank, the 20

lexical analyzer would traverse state 0, 15, and 16, then fail and retract the input to I. It would then startup the second transition diagram at state 23, traverse state 24 three times, go to state 25 on the lank, retract the input one position, install IFA in the symol tale. Definition of Regular Expression After the definition of the string and languages, we are ready to descrie regular expressions, the notation we shall use to define the class of languages known as regular sets. Recall that a token is either a single string (such as a punctuation symol) or one of a collection of string of a certain type (such as an identifier). If we view the set of strings in each token class as a language, we can use the regularexpression notation to descrie tokens. In regular expression notation we could write the definition for identifier as:- Identifier= letter (letter digit) * The vertical ar means "or" that is union, the parentheses are used to group su expressions, and the star is the closure operator meaning "zero or more instances". What we call the regular expression over alphaet are exactly those expressions that can e constructed from the following rules. Each regular expression denotes a language and we gives the rules for construction of the denoted languages along with the regular-expression construction rules. 1- Is a regular expression denoting { }, that is, the language consisting only the empty string. 2- For each a in, a is a regular expression denoting {a}, the language with only one string, that string consisting of the single symol a. 3- If R and S are regular expression denoting language L R and L S, respectively, then:- i) (R) (S) is a regular expression denoting L R U L S ii) (R). (S) is a regular expression denoting L R. L S iii) (R) * is a regular expression denoting L * R We have shown regular expression formed with parentheses whenever possile. In fact, we eliminate them when we can, using the precedence rules that * has highest precedence, then comes., and has lowest precedence. 21

Let us assume that our alphaet is {a, }. The regular expression a denotes {a}, which is different from just the string a. 1- The regular expression a * denotes the closure of the language {a}, that is a * =U{a i } The set of all strings of zero or more a's. The regular expression aa*, which y our precedence rules is parsed a(a)*, denote the strings of one or more a's. We may use a + for aa* 2- What does the regular expression (a )* denote? We see that a denotes {a, }, the language with two string a and. Thus (a )* denote U{a, } i Which is just the set of all string of a's and 's including the empty string. The regular expression (a**)* denote the same set. 3- The expression a a* is grouped a ( (a)*), and denotes the set of strings consisting of either a single "a" or "" followed y zero or more a's. 4- The expression aa a a denotes all strings of length two, so (aa a a )* denotes all strings of even length. Note that is a string of length zero. 5- a denotes strings of length zero or one. Example: The token discussed in Fig. 7, can e descried y regular expression as follows: Keyword=BEGIN END IF THEN ELSE Identifier=letter (letter digit)* Constant=digit* Relops= < <= = < > > >= Where letter stands for A B Z, and digit stands for 0 1 9. If two regular expression R and S denote the same language, we write R=S, and say that R and S are equivalent. For example, we previously oserved that (a )*= (a**)*. For any regular expression R, S and T, the following axioms hold:- i=0 i=0 22

1- R S= S R ( is commutative) 2- R (S T)=(R S) T ( is associative) 3- R (ST) = (RS) T (. is associative) 4- R(S T) = RS RT and (S T) R= SR TR (. distriutes over 1) 5- R=R =R ( is the identity for concatenation) Finite Automata A recognizer for a language L is a program takes as input a string x and answer "yes" if x is a sentence of L on "no" otherwise. Clearly, the part of a lexical analyzer that identifies the presence of a token on the input is a recognized for the language defining that token. Suppose we have specific a language y a regular expression R, and we are given some string x. We want to know whether x is in the language L denoted y R. One way to attempt this test is to check that x can e decomposed into a sequence of sustrings denoted y the primitive su expressions in R. Suppose R is (a )*a, the set of all strings ending in a and x is the string aa. We see that R=R 1 R 2, where R 1 = (a )* and R 2 = a. We can verify that a is an element of the language denoted y R 1 and that a similarly match R 2. In this way, we show that a is in the language denoted y R. Nondeterministic Finite Automata (NFA) A etter way to convert a regular expression to a recognizer is to construct a generalized transition diagram from the expression. This diagram is called nondeterministic finite automata. A nondeterministic finite automata recognizing the language (a )*a is shown in Fig.9. a Start a 0 1 2 3 Fig. 9: Nondeterministic Finite Automata The NFA is a laeled directed graph. The nodes are called states, and the laeled edges are called transitions. The NFA looks almost like a transition diagram, ut edges can e laeled y as well as characters, and the some character called lael 23

two or more transitions out of one state. One state (0 in Fig. 9) is distinguished as the start state, and one or more states may e distinguished as accepting states (or final states). State 3 in Fig. 9 is the final state. The transitions of an NFA can e conveniently represented in taular form y means of a transition tale. The transition tale for the NFA of Fig. 9 is shown in Fig. 10. In the transition tale there is a row for each state and a column for each input symol. The entry for row 1 and symol a is the set of possile next state for state 1 on input a State Input symol Fig.10: Transition Tale a The NFA accepts an input string x if and only if there is a path from the start state to some accepting state, such that laels along that path spell out x. If the input string is aa, then we can show this sequence of moves:- State Remaining input 0 aa 0 a 1 2 3 In Fig.11 elow we can see an NFA to recognize aa* *. String aaa is accepted y going through states 0, 1, 2, 2, and 2. The laels of these edges are, a, a and a, whose concatenation is aaa. 0 {0,1} {0} 1 ---- {2} 2 ---- {3} a 1 a 2 Start 0 3 4 Fig.11: NFA accepting aa* *. 24