ECS 120 Lesson 7 Regular Expressions, Pt. 1

Similar documents
CSE 105 THEORY OF COMPUTATION

1 Finite Representations of Languages

Chapter Seven: Regular Expressions

Regular Languages and Regular Expressions

Regular Expressions. Regular Expressions. Regular Languages. Specifying Languages. Regular Expressions. Kleene Star Operation

ECS 120 Lesson 16 Turing Machines, Pt. 2

Formal Languages and Compilers Lecture IV: Regular Languages and Finite. Finite Automata

Lexical Analysis. Chapter 2

Dr. D.M. Akbar Hussain

Implementation of Lexical Analysis

Lexical Analysis. Lecture 2-4

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

Languages and Compilers

CS321 Languages and Compiler Design I. Winter 2012 Lecture 4

Implementation of Lexical Analysis

8 ε. Figure 1: An NFA-ǫ

Lexical Analysis. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

Chapter Seven: Regular Expressions. Formal Language, chapter 7, slide 1

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

CT32 COMPUTER NETWORKS DEC 2015

Lexical Analysis. Lecture 3-4

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Lexical Analysis. Sukree Sinthupinyo July Chulalongkorn University

Compiler phases. Non-tokens

Last lecture CMSC330. This lecture. Finite Automata: States. Finite Automata. Implementing Regular Expressions. Languages. Regular expressions

Introduction to Lexical Analysis

CS308 Compiler Principles Lexical Analyzer Li Jiang

CSE 105 THEORY OF COMPUTATION

Figure 2.1: Role of Lexical Analyzer

Formal Languages and Compilers Lecture VI: Lexical Analysis

Finite Automata. Dr. Nadeem Akhtar. Assistant Professor Department of Computer Science & IT The Islamia University of Bahawalpur

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

Structure of Programming Languages Lecture 3

Introduction to Automata Theory. BİL405 - Automata Theory and Formal Languages 1

MIT Specifying Languages with Regular Expressions and Context-Free Grammars. Martin Rinard Massachusetts Institute of Technology

MIT Specifying Languages with Regular Expressions and Context-Free Grammars

1. (10 points) Draw the state diagram of the DFA that recognizes the language over Σ = {0, 1}

Implementation of Lexical Analysis

Formal Languages and Grammars. Chapter 2: Sections 2.1 and 2.2

1.0 Languages, Expressions, Automata

Lecture 4: Syntax Specification

Automating Construction of Lexers

Chapter 3: Lexing and Parsing

Administrivia. Lexical Analysis. Lecture 2-4. Outline. The Structure of a Compiler. Informal sketch of lexical analysis. Issues in lexical analysis

Implementation of Lexical Analysis

Formal Languages and Automata

Compiler Construction

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 2: Lexical Analysis 23 Jan 08

1. Lexical Analysis Phase


2. Lexical Analysis! Prof. O. Nierstrasz!

DVA337 HT17 - LECTURE 4. Languages and regular expressions

R10 SET a) Construct a DFA that accepts an identifier of a C programming language. b) Differentiate between NFA and DFA?

Parsing. source code. while (k<=n) {sum = sum+k; k=k+1;}

Lexical Analysis (ASU Ch 3, Fig 3.1)

The Front End. The purpose of the front end is to deal with the input language. Perform a membership test: code source language?

Regular Languages. Regular Language. Regular Expression. Finite State Machine. Accepts

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

Implementation of Lexical Analysis. Lecture 4

The Language for Specifying Lexical Analyzer

Compilers CS S-01 Compiler Basics & Lexical Analysis

CS 301. Lecture 05 Applications of Regular Languages. Stephen Checkoway. January 31, 2018

Zhizheng Zhang. Southeast University

COP 3402 Systems Software Syntax Analysis (Parser)

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

CMPSCI 250: Introduction to Computation. Lecture #28: Regular Expressions and Languages David Mix Barrington 2 April 2014

Syntactic Analysis. CS345H: Programming Languages. Lecture 3: Lexical Analysis. Outline. Lexical Analysis. What is a Token? Tokens

CSE450. Translation of Programming Languages. Lecture 20: Automata and Regular Expressions

Decision, Computation and Language

Chapter 3 Lexical Analysis

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design

Lexical Analysis. Introduction

Outline. 1 Scanning Tokens. 2 Regular Expresssions. 3 Finite State Automata

Lexical Analysis. Prof. James L. Frankel Harvard University

programming languages need to be precise a regular expression is one of the following: tokens are the building blocks of programs

We use L i to stand for LL L (i times). It is logical to define L 0 to be { }. The union of languages L and M is given by

Part 5 Program Analysis Principles and Techniques

Lexical Analysis 1 / 52

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

Compiler Construction

Regular Expressions. Steve Renals (based on original notes by Ewan Klein) ICL 12 October Outline Overview of REs REs in Python

(Refer Slide Time: 0:19)

Lexical Analysis. COMP 524, Spring 2014 Bryan Ward

COMP Logic for Computer Scientists. Lecture 25

Compilers CS S-01 Compiler Basics & Lexical Analysis

Lexical Analyzer Scanner

Alternation. Kleene Closure. Definition of Regular Expressions

Multiple Choice Questions

Chapter 2 :: Programming Language Syntax

Lexical Analyzer Scanner

Course Project 2 Regular Expressions

Lexical Analysis. Lecture 3. January 10, 2018

Optimizing Finite Automata

CSE450. Translation of Programming Languages. Automata, Simple Language Design Principles

Introduction to Parsing. Lecture 5

Finite automata. We have looked at using Lex to build a scanner on the basis of regular expressions.

Transcription:

ECS 120 Lesson 7 Regular Expressions, Pt. 1 Oliver Kreylos Friday, April 13th, 2001 1 Outline Thus far, we have been discussing one way to specify a (regular) language: Giving a machine that reads a word and tells whether it is in the language or not. Though this is a valid and unambiguous specification, it is sometimes not a very helpful one. Specifying languages by automata has two major shortcomings: First, when given a language, it is often difficult to construct an automaton that accepts it; second, when given an automaton, it is often difficult to understand which language it accepts. Regular expressions are an alternative specification method for regular languages: They are easier to construct, and it is easier to see which language they describe by just looking at the expression. Both benefits stem from the fact that regular expressions describe the structure of words contained in a language, rather than giving a machine that must be run in order to decide a word. Regular expressions are very common in computer applications because they are a powerful way to describe patterns in texts. Text editors (in their search and replace functions), programming languages such as PERL, and UNIX utilities such as grep, awk and lex all use regular expressions to describe patterns. Programming language compilers typically use regular expressions to define the lowest-level constructs of program source code ( tokens ), and the stage of the compiler responsible for recognizing tokens ( parser ) is automatically constructed from those regular expressions using the lex utility. 1

2 Regular Expressions Regular expressions over an alphabet Σ define languages over Σ by describing the structure of words in a language. They are based on the three regular operations: Union, concatenation and Kleene Star. They are very similar to arithmetic expressions like 3 + (4 5): They consist of constants and operators, and they construct complex expressions from simpler building blocks. As opposed to arithmetic expressions, their values are not numbers, but languages. Examples: hello specifies the language consisting of the single word hello. hello world specifies the language consisting of the two words hello and world. (aa) specifies the language of all words consisting of an even number of as. a bb, often written just as a bb, specifies the set of all words consisting of any number of as followed by two bs. Since every regular expression defines one language, we will write L(R) to denote the language defined by regular expression R. 3 Formal Definition of Regular Expressions Regular expressions over an alphabet Σ are defined in a recursive fashion, very similarly to arithmetic expressions. We start by defining the simplest regular expressions, and then define operations to create more complex ones from simpler building blocks: 1. is a regular expression defining the empty language, L( ) = Σ. 2. ɛ is a regular expression defining the language consisting only of the empty word, L(ɛ) = {ɛ} Σ. 3. If a Σ is a character, then a is a regular expression over Σ defining the language consisting of the single one-character word a, L(a) = {a} Σ. 2

4. If R 1 and R 2 are two regular expressions over the same alphabet Σ, then (R 1 R 2 ) is a regular expression over Σ defining the union of the languages of R 1 and R 2, L ( (R 1 R 2 ) ) = L(R 1 ) L(R 2 ) Σ. 5. If R 1 and R 2 are two regular expressions over the same alphabet Σ, then (R 1 R 2 ), often abbreviated as (R 1 R 2 ), is a regular expression over Σ defining the concatenation of the languages of R 1 and R 2, L ( (R 1 R 2 ) ) = L ( (R 1 R 2 ) ) = L(R 1 )L(R 2 ) Σ. 6. If R is a regular expression over Σ, then (R ) is a regular expression over Σ defining the Kleene Star of the language of R, L ( (R ) ) = L(R) Σ. The regular expressions generated by these definitions are fully parenthesized to avoid ambiguities. Here is how the earlier example expressions look like following the recursive definition: ((((h e) l) l) o) = ((((he)l)l)o) ( ((((h e) l) l) o) ((((w o) r) l) d) ) = ( ((((he)l)l)o) ((((wo)r)l)d) ) ( (a a) ) = ( (aa) ) ( (a ) (b b) ) = ( (a )(bb) ) Even when dropping the explicit concatenation operator, the fully parenthesized expressions are almost impossible to read. Therefore, we introduce rules for implicit parenthezation, similar to those used in arithmetics: R 1 R 2 R 3 := ( (R 1 R 2 )R 3 ) = ( R1 (R 2 R 3 ) ). Since the concatenation operator is associative, we can drop parentheses in sequences of concatenation operations entirely. R 1 R 2 R 3 := ( (R 1 R 2 ) R 3 ) = ( R1 (R 2 R 3 ) ). Since the union operator is associative as well, we can drop parentheses in sequences of union operations entirely. R 1 R 2 R 3 := ( R 1 (R 2 R 3 ) ). Concatenation has precedence over union. R 1 R 2 := ( R 1 (R 2 ) ). Kleene Star has precedence over concatenation. 3

R 1 R 2 := ( R 1 (R 2 ) ). Kleene Star has precedence over union. We also define the following shorthand notation: If A = {a 1, a 2,..., a n } Σ {ɛ} is a set of characters from Σ or the symbol ɛ, then A is a shorthand for a 1 a 2 a n, the regular expression denoting the language L(A) = {a 1, a 2,..., a n } Σ. Here are some more relevant examples for regular expressions over the ASCII alphabet. In the following, let L := {A,..., Z, a,..., z} be the set of letters, and D := {0,..., 9} the set of decimal digits. DD describes all words starting with a digit, followed by any number of digits. This is the set of all positive integers in decimal notation. {+, -, ɛ}dd describes the language of all integer constants with an optional sign. {+, -, ɛ}(dd DD.D D.DD ) ( ({E, e}{+, -, ɛ}dd ) ɛ ) describes the language of all floating-point constants with an optional sign and exponential part, as recognized by the C programming language. (L )(L D ) describes the language of all valid identifiers in the C programming language (not taking reserved words into account). 4 Equivalence of Regular Expressions and Finite State Machines Earlier we have claimed that the class of languages that can be described by regular expressions is exactly the class of regular languages. We are now going to prove this statement. First we will show that the language L(R) generated by any regular expression R is accepted by some NFA M. Second, we show that the language L(M) accepted by any automaton M is generated by some regular expression R. 5 Construction of Automata from Regular Expressions We will prove the existence of an automata that accepts the language generated by a regular expression by structural induction. This means, we will 4

q 0 Figure 1: An NFA M accepting the empty language, L(M) =. q 0 Figure 2: An NFA M accepting the language consisting only of the empty word, L(M) = {ɛ}. follow the recursive definition of regular expressions and construct automata accepting the languages generated by simple regular expressions first, and will then show how to combine those automata to accept the languages generated by more complex ones. For all the following constructions, we will assume that all regular expressions are over some alphabet Σ. 5.1 Case 1: R = If R =, then L(R) =. The empty language is accepted by the NFA M := ( {q 0 }, Σ, δ, q 0, ) where δ(q 0, a) = for all a Σ ɛ. A transition diagram for this automaton is shown in Figure 1. 5.2 Case 2: R = ɛ If R = ɛ, then L(R) = {ɛ}. The language consisting only of the empty word is accepted by the NFA M ɛ := ( {q 0 }, Σ, δ, q 0, {q 0 } ) where δ(q 0, a) = for all a Σ ɛ. A transition diagram for this automaton is shown in Figure 2. 5.3 Case 3: R = a If R = a for some character a Σ, then L(R) = {a}. The language consisting only of the word a is accepted by the NFA M a := ( {q 0, q 1 }, Σ, δ, q 0, {q 1 } ) where { {q1 }, if q = q q {q 0, q 1 }, x Σ ɛ : δ(q, x) = 0 and x = a otherwise 5

a q 0 q 1 Figure 3: An NFA M accepting the language consisting only of the word a, L(M) = {a}. A transition diagram for this automaton is shown in Figure 3. 5.4 Case 4: R = R 1 R 2 If R = R 1 R 2 for two regular expressions R 1 and R 2, then L(R) = L(R 1 ) L(R 2 ). To construct an automaton that accepts L(R), we first construct the two automata M 1 and M 2 that accept L(R 1 ) and L(R 2 ), respectively. Then we use the union construction shown earlier to build an automaton M that accepts L(R 1 ) L(R 2 ). 5.5 Case 5: R = R 1 R 2 If R = R 1 R 2 for two regular expressions R 1 and R 2, then L(R) = L(R 1 )L(R 2 ). To construct an automaton that accepts L(R), we first construct the two automata M 1 and M 2 that accept L(R 1 ) and L(R 2 ), respectively. Then we use the concatenation construction shown earlier to build an automaton M that accepts L(R 1 )L(R 2 ). 5.6 Case 6: R = R 1 If R = R 1 for a regular expression R 1, then L(R) = L(R 1 ). To construct an automaton that accepts L(R), we first construct an automaton M 1 that accepts L(R 1 ). Then we use the Kleene Star construction shown earlier to build an automaton M that accepts L(R 1 ). 6