Module 6 Lexical Phase - RE to DFA

Similar documents
CS308 Compiler Principles Lexical Analyzer Li Jiang

[Lexical Analysis] Bikash Balami

Lexical Analysis/Scanning

Dixita Kagathara Page 1

CS6660 COMPILER DESIGN L T P C

UNIT II LEXICAL ANALYSIS

Principles of Compiler Design Presented by, R.Venkadeshan,M.Tech-IT, Lecturer /CSE Dept, Chettinad College of Engineering &Technology

Lexical Analysis. Prof. James L. Frankel Harvard University

CS 432 Fall Mike Lam, Professor. Finite Automata Conversions and Lexing

Regular Languages. MACM 300 Formal Languages and Automata. Formal Languages: Recap. Regular Languages

Finite automata. We have looked at using Lex to build a scanner on the basis of regular expressions.

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

Lecture Compiler Construction

Chapter 20: Binary Trees

Converting a DFA to a Regular Expression JP

Decision, Computation and Language

Ambiguous Grammars and Compactification

COMPILER DESIGN UNIT I LEXICAL ANALYSIS. Translator: It is a program that translates one language to another Language.

Chapter Seven: Regular Expressions

CSE302: Compiler Design

MIT Specifying Languages with Regular Expressions and Context-Free Grammars

DVA337 HT17 - LECTURE 4. Languages and regular expressions

Lexical Analysis. Implementation: Finite Automata

CSE450. Translation of Programming Languages. Lecture 20: Automata and Regular Expressions

Multi-way Search Trees. (Multi-way Search Trees) Data Structures and Programming Spring / 25

March 20/2003 Jayakanth Srinivasan,

1. INTRODUCTION TO LANGUAGE PROCESSING The Language Processing System can be represented as shown figure below.

Slides for Faculty Oxford University Press All rights reserved.

MIT Specifying Languages with Regular Expressions and Context-Free Grammars. Martin Rinard Massachusetts Institute of Technology

CSE 105 THEORY OF COMPUTATION

Roll No. :... Invigilator's Signature :. CS/B.Tech(CSE)/SEM-7/CS-701/ LANGUAGE PROCESSOR. Time Allotted : 3 Hours Full Marks : 70

Front End: Lexical Analysis. The Structure of a Compiler

Last lecture CMSC330. This lecture. Finite Automata: States. Finite Automata. Implementing Regular Expressions. Languages. Regular expressions


XML databases. Jan Chomicki. University at Buffalo. Jan Chomicki (University at Buffalo) XML databases 1 / 9

Zhizheng Zhang. Southeast University

Section 5.5. Left subtree The left subtree of a vertex V on a binary tree is the graph formed by the left child L of V, the descendents

Binary Search Tree (2A) Young Won Lim 5/17/18

2010: Compilers REVIEW: REGULAR EXPRESSIONS HOW TO USE REGULAR EXPRESSIONS

HKN CS 374 Midterm 1 Review. Tim Klem Noah Mathes Mahir Morshed

Lexical Analysis. Introduction

A Simple Syntax-Directed Translator

Formal Languages and Grammars. Chapter 2: Sections 2.1 and 2.2

Theoretical Part. Chapter one:- - What are the Phases of compiler? Answer:

1 Finite Representations of Languages

Chapter Seven: Regular Expressions. Formal Language, chapter 7, slide 1

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Lexical Analysis. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

David Griol Barres Computer Science Department Carlos III University of Madrid Leganés (Spain)

Abstract Syntax Trees L3 24

2. Lexical Analysis! Prof. O. Nierstrasz!

An introduction to suffix trees and indexing

Regular Expressions. Regular Expressions. Regular Languages. Specifying Languages. Regular Expressions. Kleene Star Operation

The Front End. The purpose of the front end is to deal with the input language. Perform a membership test: code source language?

Implementation of Lexical Analysis

1. Lexical Analysis Phase

CS 171: Introduction to Computer Science II. Binary Search Trees

Regular Languages and Regular Expressions

Complexity Theory. Compiled By : Hari Prasad Pokhrel Page 1 of 20. ioenotes.edu.np

Lexical Error Recovery

Outline. 1 Scanning Tokens. 2 Regular Expresssions. 3 Finite State Automata

Data Structures and Algorithms

CSE 431S Scanning. Washington University Spring 2013

TREES. Trees - Introduction

Implementation of Lexical Analysis

Conversion of Fuzzy Regular Expressions to Fuzzy Automata using the Follow Automata

Implementation of Lexical Analysis

Lexical Analysis 1 / 52

Lexical Analysis. Chapter 2

CSE450. Translation of Programming Languages. Automata, Simple Language Design Principles

Finite Automata. Dr. Nadeem Akhtar. Assistant Professor Department of Computer Science & IT The Islamia University of Bahawalpur

Binary Search Tree (3A) Young Won Lim 6/2/18

Midterm I review. Reading: Chapters 1-4

Trees. Q: Why study trees? A: Many advance ADTs are implemented using tree-based data structures.

ECS 120 Lesson 7 Regular Expressions, Pt. 1

18.3 Deleting a key from a B-tree

More Assigned Reading and Exercises on Syntax (for Exam 2)

Non-deterministic Finite Automata (NFA)

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design

(Refer Slide Time: 0:19)

CSE 548: Analysis of Algorithms. Lectures 14 & 15 ( Dijkstra s SSSP & Fibonacci Heaps )

1. (10 points) Draw the state diagram of the DFA that recognizes the language over Σ = {0, 1}

Programming Languages and Techniques (CIS120)

CS6160 Theory of Computation Problem Set 2 Department of Computer Science, University of Virginia

Computer Science Department Carlos III University of Madrid Leganés (Spain) David Griol Barres

FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCES. stadfa: AN EFFICIENT SUBEXPRESSION MATCHING METHOD MOHAMMAD IMRAN CHOWDHURY

CPSC 434 Lecture 3, Page 1

Dr. D.M. Akbar Hussain

Lexical Analysis. COMP 524, Spring 2014 Bryan Ward

Lecture 5: Suffix Trees

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

Introduction to Lexical Analysis

PRINCIPLES OF COMPILER DESIGN UNIT II LEXICAL ANALYSIS 2.1 Lexical Analysis - The Role of the Lexical Analyzer

CS402 Theory of Automata Solved Subjective From Midterm Papers. MIDTERM SPRING 2012 CS402 Theory of Automata

CSE 5095 Topics in Big Data Analytics Spring 2014; Homework 1 Solutions

Finite Automata Theory and Formal Languages TMV027/DIT321 LP4 2018

Lexical Analysis - 2

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

2.2 Syntax Definition

Syntax Analysis. COMP 524: Programming Language Concepts Björn B. Brandenburg. The University of North Carolina at Chapel Hill

Transcription:

Module 6 Lexical Phase - RE to DFA The objective of this module is to construct a minimized DFA from a regular expression. A NFA is typically easier to construct but string matching with a NFA is slower. Hence, we go in for a DFA representation. However, if the regular expression is converted to a DFA using the Thompson s subset construction algorithm, which was discussed in modules 4 and 5, the resultant DFA will have more states than actually necessary. Therefore for string matching, this DFA would take more than the optimized DFA that is directly constructed for the language. So, in this module, we will discuss the construction of a minimized DFA from a regular expression. 6.1 Minimized DFA Construction A DFA can be constructed for a language using a direct representation as a directed graph, or as a procedure that converts a regular expression to a DFA. The conversion from RE to DFA can be done using the following procedure Thompson s subset construction algorithm results in a non-minimized DFA and hence could be minimized using Table filling algorithm Syntax tree procedure Results in a minimized DFA In this module, the syntax tree procedure that converts a regular expression to a minimized DFA is discussed. 6.2 Syntax Tree Procedure The following algorithm, gives the steps to be followed in converting the regular expression (RE) to a DFA. SyntaxTreeAlgorithm Input: Regular Expression Output: a Minimized DFA 1. Augment the regular expression r with a special symbol # which is used as an end marker and any transition over # in a DFA will be an accepting. Hence, the new expression is r #. 2. Construct a syntax tree for r# 3. Traverse the tree to construct functions nullable( ), firstpos( ), lastpos ( ), and followpos( ) 4. Based on the functions firstpos( ), lastpos ( ), and followpos( ) and using the tree, the minimized DFA is constructed Let us discuss each of these functions in detail before going ahead with the implementation of the algorithm:

6.2.1 Syntax Tree construction Based on the precedence of the regular expression operators *, + and. a syntax tree is constructed. The kleene closure operator * has the highest precedence and will have just one child. The next precedence is assigned to the concatenation operator. and this will have two children. The least precedence is given to the union operator + and this will also have two children. The input symbols and the special symbol # will be the leaf nodes of this expression. If a and b are input symbols and they are connected using the + operator then the construction of the syntax tree is given in figure 6.1 (a). The syntax tree will be a typical binary tree. +. a b a b Figure 6.1(a) Syntax tree for a+b Figure 6.1 (b) Syntax tree for a. b If the input symbols are connected by the concatenation operator then the syntax tree would be as given in figure 6.1 (b). If the input symbol has a kleene closure operator, then the syntax tree would be as given in figure 6.1 (c) * Figure 6.1(c) Syntax tree for a* Using this method and the precedence of operators, a syntax tree for the regular expression r# is constructed. All the leaf nodes are labeled with integers from 1 to n, which would be used as the information to construct the DFA later. 6.2.2. Nullable( ) a After constructing the syntax tree, for every symbol in the syntax tree, the function nullable() is defined. The leaf nodes are first assigned nullable based on whether the sub-tree under them can generate an empty string ε. Hence, if a leaf node is labeled with ε then the nullable information

of that node is set to True and for all other input alphabet a such that a is not an operator, the nullable information of that node is set to False. For the regular expression operators, nullable information is assigned. They form the interior node part of the syntax tree. The concatenation operator and the union operator cannot generate empty string and hence their nullable information depends on the nullable status of its children. The concatenation operator is considered as and operation while the union operator is considered as or operation for determining the nullable information of the interior node. If c1 and c2 are the children of these interior nodes, then the following relationship is used to calculate the nullable information of the interior nodes + and. nullable (+) = nullable(c1) or nullable(c2) (6.1) nullable (.) = nullable (c1) and nullable(c2) (6.2) Here the operators or and and indicates the logical operators. Hence, nullable(+) will be set to True if any of its children is nullable and nullable(.) will be set to True only if both of its children are nullable. Since, the kleene closure operator can generate ε as a string, the nullable of the * node is set to True nullabe(*) = True (6.3) 6.2.3 firstpos() The function firstpos(n) is defined as the set of positions that can match the first symbol of a string generated by the sub-tree at node n. It is defined for all the nodes and depends on the firstpos() values of the leaf nodes. To start with, for any input symbol a, the firstpos(a) is the integer assigned to it. For the node ε, firstpos( ) is defined as empty. After assigning the firstpos( ) for all the leaf nodes, firstpos() of all interior nodes are computed. To compute this, nullable information of all interior and leaf nodes are necessary. For a node labeled with the Kleene closure operator, *, and child c1 the firstpos() is defined as firstpos(*) = firstpos(c1) (6.4) For interior nodes labeled with + and. with children c1 and c2, firstpos() is defined as follows: firstpos(+) = firstpos(c1) U firstpos(c2) (6.5) firstpos(.) = if (nullable(c1) == true) then return (firstpos(c1) firstpos(c2))

else return firstpos(c1) (6.6) This function is computed for all interior and leaf nodes and the result is stored in a list. 6.2.4 lastpos() lastpos() is also computed for all the nodes of the syntax tree. lastpos(n) is defined as the set of positions that can match the last symbol of a string generated be the subtree at node n. To start with, for any input symbol a, the lastpos(a) is also the integer assigned to it similar to the firstpos(). For the node ε, lastpos( ) is defined as empty, since ε which is defined as the suffix or prefix of any string is not shown with the string during representation. After assigning the lastpos( ) for all the leaf nodes, lastpos() of all interior nodes are computed. To compute this, nullable information of all interior and leaf nodes is necessary. For a node labeled with the Kleene closure operator, *, and child c1 the lastpos() is defined as lastpos(*) = lastpos(c1) (6.7) For interior nodes labeled + and. and with children c1 and c2, lastpos() is defined as follows: lastpos(+) = lastpos(c1) U lastpos(c2) (6.8) lastpos(.) = if (nullable(c2) == true) then return (lastpos(c1) lastpos(c2)) else return lastpos(c2) (6.9) The difference between the firstpos() and the lastpos() in only for the sub-tree that involves the concatenation operator and for all other nodes, firspos() and lastpos() are same. This function is computed for all interior and leaf nodes and the result is stored in a list. The summary of all the functions are listed in Table 6.1

Table 6.1 Summary of the functions nullable(), firstpos() and lastpos() Node n nullable(n) firstpos(n) lastpos(n) Leaf ε true Leaf i false {i} {i} + or ( ) / \ c 1 c 2 / \ c1 c2 * c1 nullable(c 1 ) or nullable(c 2 ) nullable(c1) and nullable(c2) firstpos(c 1 ) firstpos(c 2 ) if nullable(c1) then firstpos(c1) firstpos(c2) else firstpos(c1) lastpos(c 1 ) lastpos(c 2 ) if nullable(c2) then lastpos(c1) lastpos(c2) else lastpos(c2) true firstpos(c1) lastpos(c1) 6.2.5 followpos(n) followpos() is computed only for the leaf nodes that are labeled with the input symbols that constitute the language of the regular expression. followpos(i) is defined as the set of positions that can follow position i" in the syntax tree. Hence, this is an important function to construct the DFA. To compute followpos(i), the firstpost() and the laspos() of all the nodes are necessary. The algorithm to compute followpos() is given in algorithm 6.1 1. for each node n in the tree do 2. if n is a concatenation node with left child c 1 and right child c 2 then 3. for each i in lastpos(c 1 ) do 4. followpos(i) := followpos(i) firstpos(c 2 ) 5. else if n is a *-node 6. for each i in lastpos(n) do 7. followpos(i) := followpos(i) firstpos(n) end if Algorithm 6.1: Followpos() construction From Algorithm 6.1 it is visible that the followpos() is determined for all the leaf nodes based on their integers assigned to them. In computation of followpos() the Kleene closure operator node and the concatenation operator nodes are only considered. The union operator is not considered since the union operator can choose the appropriate path to validate the string. Line 1 of the algorithm, considers all the nodes of the syntax tree. In line 2, the node can be analysed based on the concatenation operator and line 5 analyses and constructs followpos()

based on the Kleene closure operator. In line 3, every element present in lastpos() of the left child of the concatenation node is considered and the followpos() value is updated according to the relationship given in line 4 of the algorithm. Similarly, line 6 of the algorithm, considers all the elements present in the lastpos of the * node and determines their followpos() based on the relation given in line 7 of the algorithm. As line 7 involves followpos() of * node, the followpos() of the concatenation operator node is first determined which is then used to identify the followpos() of the Kleene closure operator node. 6.2.6 DFA Construction The construction algorithm is given in Algorithm 6.2. Initially, the firstpos (root) is considered. The integers representing in this firstpos( ) is considered as one state of the DFA. This is the starting state of the DFA. Line 1 and 2 of the algorithm, indicates, this procedure, where the variable Dstates is used to remember the states of the DFA. Line 3 initializes a while loop which generates new states based on the transition from the initial state. The start state is marked in line 4 to indicate that the state has been considered. Line 5 initiates for loop to define transitions from the start state on all input symbols. The new state will have state numbers which is determined as a union of the state numbers of the followpos() of the state numbers in the initiating state. The procedure is repeated till there are no more new states can be generated. 1. s 0 := firstpos(root) where root is the root of the syntax tree 2. Dstates := {s 0 } and is unmarked 3. while there is an unmarked state T in Dstates do 4. mark T 5. for each input symbol a Σ do 6. let U be the set of positions that are in followpos(p) 7. for some position p in T, a. such that the symbol at position p is a b. if U is not empty and not in Dstates then add U as an unmarked state to Dstates end if Dtran[T,a] := U Algorithm 6.2 DFA Construction Example 6.1 : Construct a minimized DFA for the regular expression (a b)*abb

In example 6.1, the regular expression is suffixed with # to form (a b)*abb#. According to the construction algorithm, given in figures 6.1 (a c), the syntax tree is constructed and is shown in figure 6.2 Figure 6.2 Syntax tree construction for (a b)*abb# For the syntax tree, the only nullable() node happens to be the * node. Equations 6.1 to 6.3 are used to compute the nullable of all the nodes. All the leaf nodes are not nullable. For the other interior nodes, the interior concatenation node of children * node and node 3, their nullable information is computed as the. of its children. Hence, the concatenation node will have nullable information set to False. Similarly, other interior concatenation nodes will also have nullable information set to False. Figure 6.3 Syntax tree with firstpos() and lastpos() marked. After computing the nullable information, equations 6.4 to 6.9 are used to compute the firstpos() and lastpos() of all the nodes. For the leaf nodes, the numbers indicated will be their firstpos()

and lastpos() and is indicated in figure 6.3. Consider, the union operator node of the children 1 and 2. The firstpos( ) and the laspos( ) is given by the union of the firstpos and lastpos of its children. Hence, it is given by {1,2}. The interior node,., which is a parent of * and (3), the firstpos(. ) is computed as union of the firstpos() of its children since c1 being the * node is nullable. The computation of firstpos() of all the other interior nodes is computed as firstpos() of its left child c1 since, c1 is not nullable for all other interior nodes. On the other hand, the lastpos(.) will be computed based on c2. As the right child of all interior node is not nullable, the lastpos, is computed as just the lastpos (c2) only. Then using this information, followpos() is computed. At the star node, the followpos(1) and followpos(2) will be the set {1,2} using line 7 of algorithm 6.1. Consider the parent of the * node and we add {3} to the followpos(1) and followpos(2) using line 4 of algorithm 6.1. The followpos() of all the nodes is tabulated in Table 6.2 Table 6.2 Followpos() of figure 6.3 Node Followpos(n) 1 {1, 2, 3} 2 {1,2,3} 3 {4} 4 {5} 5 {6} 6 Φ After computing the followpos(), the DFA is constructed by using root s firstpos() as the start state. The root s firstpos() is {1,2,3} and hence, this set is the start state. In this set, 1 and 3 corresponds to the input symbol a. Hence, to define the edge from this state on a, we need to consider the followpos(1) and followpos(3) which is {1,2,3} and {4} respectively. So, we define an edge from {1,2,3} to {1,2,3,4} on the input a. Similarly, the state 2 corresponds to input b. Hence, followpos(2) is to be considered for the edge on b from {1,2,3} which is the same state itself. Now, we have a new state {1,2,3,4}. From this state, 1 and 3 corresponds to a and hence, there is a self loop on this state on the input symbol a. For b, nodes 2, 4 corresponds and hence the union of their followpos() is {1,2,3,5} which is again a new state. The process is repeated and the resultant DFA is shown in figure 6.4. The final state corresponds to the states in which the number corresponding to # is available. In this example, the state number is represented by 6. In the states of the DFA, 6 is contained in the set representing {1,2,3,6} and thus it corresponds to the final state of the DFA. Thus we end up with 4 states in the DFA as against 5 states which is the resultant of Thompson s construction algorithm.

Summary Figure 6.4 Minimized DFA for the expression (a b)*abb In this module, we discussed the construction of minimized DFA which is performed from the regular expression. This algorithm will be proportional asymptotically to the number of operators in the input expression and the number of input symbols present in the regular expression. The next module will discuss the minimized DFA construction from the input DFA.