CMPSCI 250: Introduction to Computation. Lecture #28: Regular Expressions and Languages David Mix Barrington 2 April 2014

Similar documents
CHAPTER TWO LANGUAGES. Dr Zalmiyah Zakaria

Chapter 4: Regular Expressions

Notes for Comp 454 Week 2

CMPSCI 250: Introduction to Computation. Lecture #36: State Elimination David Mix Barrington 23 April 2014

Glynda, the good witch of the North

Chapter Seven: Regular Expressions

DVA337 HT17 - LECTURE 4. Languages and regular expressions

CMPSCI 250: Introduction to Computation. Lecture #30: Properties of the Regular Languages David Mix Barrington 7 April 2014

Automata Theory TEST 1 Answers Max points: 156 Grade basis: 150 Median grade: 81%

CMPSCI 250: Introduction to Computation. Lecture 20: Deterministic and Nondeterministic Finite Automata David Mix Barrington 16 April 2013

Regular Expressions. Lecture 10 Sections Robb T. Koether. Hampden-Sydney College. Wed, Sep 14, 2016

CMPSCI 250: Introduction to Computation. Lecture #1: Things, Sets and Strings David Mix Barrington 22 January 2014

Compiler Construction

Strings, Languages, and Regular Expressions

1.3 Functions and Equivalence Relations 1.4 Languages

CMPSCI 250: Introduction to Computation. Lecture #7: Quantifiers and Languages 6 February 2012

CMPSCI 250: Introduction to Computation. Lecture #30: Proving Properties of the Regular Languages David Mix Barrington 9 April 2012

CSE 105 THEORY OF COMPUTATION

ECS 120 Lesson 7 Regular Expressions, Pt. 1

Proof Techniques Alphabets, Strings, and Languages. Foundations of Computer Science Theory

Welcome! CSC445 Models Of Computation Dr. Lutz Hamel Tyler 251

Finite Automata Part Three

ITEC2620 Introduction to Data Structures

Lexical Analysis. Sukree Sinthupinyo July Chulalongkorn University

Regular Languages. Regular Language. Regular Expression. Finite State Machine. Accepts

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 2: Lexical Analysis 23 Jan 08

2. λ is a regular expression and denotes the set {λ} 4. If r and s are regular expressions denoting the languages R and S, respectively

Regular Expressions & Automata

Assignment No.4 solution. Pumping Lemma Version I and II. Where m = n! (n-factorial) and n = 1, 2, 3

Chapter Seven: Regular Expressions. Formal Language, chapter 7, slide 1

CS402 Theory of Automata Solved Subjective From Midterm Papers. MIDTERM SPRING 2012 CS402 Theory of Automata

Regular Languages and Regular Expressions

Lec-5-HW-1, TM basics

Formal Languages and Automata

CS402 - Theory of Automata FAQs By

CMSC 132: Object-Oriented Programming II

Complexity Theory. Compiled By : Hari Prasad Pokhrel Page 1 of 20. ioenotes.edu.np

Languages and Strings. Chapter 2

Finite Automata Part Three

Recursion defining an object (or function, algorithm, etc.) in terms of itself. Recursion can be used to define sequences

1 Finite Representations of Languages

CSE 105 THEORY OF COMPUTATION

CHAPTER 8. Copyright Cengage Learning. All rights reserved.

Discrete Mathematics Lecture 4. Harper Langston New York University

Tilings of the Sphere by Edge Congruent Pentagons

Recursively Defined Functions

Chapter 18: Decidability

CSE 431S Scanning. Washington University Spring 2013

SWEN 224 Formal Foundations of Programming

Lexical Analyzer Scanner

JNTUWORLD. Code No: R

Recursively Enumerable Languages, Turing Machines, and Decidability

Definition: A context-free grammar (CFG) is a 4- tuple. variables = nonterminals, terminals, rules = productions,,

Solutions to Homework 10

Lexical Analyzer Scanner

Formal Languages and Compilers Lecture VI: Lexical Analysis

Regular Expressions. Steve Renals (based on original notes by Ewan Klein) ICL 12 October Outline Overview of REs REs in Python

Ambiguous Grammars and Compactification

Recursion defining an object (or function, algorithm, etc.) in terms of itself. Recursion can be used to define sequences

8 ε. Figure 1: An NFA-ǫ

Turing Machine Languages

Recursive Definitions Structural Induction Recursive Algorithms

2. Lexical Analysis! Prof. O. Nierstrasz!

Section 2.4 Sequences and Summations

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

Finite Automata Part Three

Regular Expressions. Regular Expressions. Regular Languages. Specifying Languages. Regular Expressions. Kleene Star Operation

KHALID PERVEZ (MBA+MCS) CHICHAWATNI

This is already grossly inconvenient in present formalisms. Why do we want to make this convenient? GENERAL GOALS

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

CMPSCI 250: Introduction to Computation. Lecture #22: Graphs, Paths, and Trees David Mix Barrington 12 March 2014

AUBER (Models of Computation, Languages and Automata) EXERCISES

Dr. D.M. Akbar Hussain

Formal Languages and Compilers Lecture IV: Regular Languages and Finite. Finite Automata

CS321 Languages and Compiler Design I. Winter 2012 Lecture 4

CS 314 Principles of Programming Languages

Slides for Faculty Oxford University Press All rights reserved.

Phil 320 Chapter 1: Sets, Functions and Enumerability I. Sets Informally: a set is a collection of objects. The objects are called members or

2010: Compilers REVIEW: REGULAR EXPRESSIONS HOW TO USE REGULAR EXPRESSIONS

AXIOMS FOR THE INTEGERS

Regular Expressions. Chapter 6

QUESTION BANK. Formal Languages and Automata Theory(10CS56)

Alphabets, strings and formal. An introduction to information representation

Languages and Finite Automata

Section 1.7 Sequences, Summations Cardinality of Infinite Sets

Learn Smart and Grow with world

UNION-FREE DECOMPOSITION OF REGULAR LANGUAGES

CMPSCI 250: Introduction to Computation. Lecture #14: Induction and Recursion (Still More Induction) David Mix Barrington 14 March 2013

Lexical Analysis (ASU Ch 3, Fig 3.1)

GSAC TALK: THE WORD PROBLEM. 1. The 8-8 Tessellation. Consider the following infinite tessellation of the open unit disc. Actually, we want to think

Lexical Analysis - 1. A. Overview A.a) Role of Lexical Analyzer

CmpSci 187: Programming with Data Structures Spring 2015

Multiple Choice Questions

The Kuratowski Closure-Complement Theorem

UNIT I PART A PART B

Finite automata. We have looked at using Lex to build a scanner on the basis of regular expressions.

MA/CSSE 474. Today's Agenda

Where We Are. CMSC 330: Organization of Programming Languages. This Lecture. Programming Languages. Motivation for Grammars

CS 3100 Models of Computation Fall 2011 This assignment is worth 8% of the total points for assignments 100 points total.

CS 403 Compiler Construction Lecture 3 Lexical Analysis [Based on Chapter 1, 2, 3 of Aho2]

Transcription:

CMPSCI 250: Introduction to Computation Lecture #28: Regular Expressions and Languages David Mix Barrington 2 April 2014

Regular Expressions and Languages Regular Expressions The Formal Inductive Definition The Kleene Star Operation Finite Languages The Language (a + ab) * Logically Describable Languages Languages From Number Theory

Regular Expressions We re now entering the final segment of the course, dealing with regular expressions and finite-state machines. A regular expression is a way to denote a language (a subset of Σ * for some finite alphabet Σ). A finite-state machine is a particular kind of computer that reads a string in Σ * and gives a boolean answer. Thus the machine decides some language.

Regular Expressions Our major result will be Kleene s Theorem, which says that a language is denoted by a regular expression (is a regular language) if and only if it is decided by a finitestate machine. We ll learn algorithms to go from an expression to an equivalent machine, and vice versa.

Regular Expressions Regular expressions, with slightly different notation, occur in programming languages and operating systems such as Unix. It s common for a language (or a property of strings) to be defined by a regular expression. The system then has to build a finite-state machine to decide that language, and it uses the very algorithm we will present here.

The Formal Inductive Definition Fix an alphabet Σ. A regular expression over Σ is a string over the alphabet Σ {, +,, *, (, )} that can be built by the rules below. Each regular expression R denotes a language L(R), also determined by the rules below. is a regular expression and denotes the empty set. If a is any letter in Σ, then a is a regular expression and denotes the language {a}.

The Formal Inductive Definition If R and S are two regular expressions denoting languages L(R) and L(S), then R S (often written RS ) is a regular expression denoting the concatenation L(R)L(S), and R+S is a regular expression denoting the union L(R) L(S). If R is a regular expression denoting the language L(R), then R * is a regular expression denoting the Kleene star of L(R), which is written L(R) *. Nothing else is a regular expression.

The Kleene Star Operation If A is any language, the Kleene star of A, written A *, is the set of all strings that can be written as the concatenation of zero or more strings from A. If A =, A * = {λ} because we can only have a concatenation of zero strings from A. If A = {a}, then A * = {λ, a, aa, aaa, aaaa,...}, the set of all strings of a s.

The Kleene Star Operation If A = Σ, then A * is just Σ *, so the star notation we have been using for Σ * is just this same Kleene star operation. A string over Σ is just the concatenation of zero or more letters from Σ. In general, A * is the union of the languages A 0, A 1, A 2, A 3,... where A 0 = {λ}, A 1 = A, A 2 = AA, A 3 = AAA, and so on. (Note that some of the laws of exponents still work, like A i A j = A i+j and (A i ) j = A ij.)

Finite Languages The regular expression aba denotes the concatenation {a}{b}{a} = {aba}, by the definition of concatenation of languages. Thus any language consisting of a single nonempty string has a regular expression, which is (up to a type cast) itself. The language {λ}, as we just saw, can be written *.

Finite Languages If I have any finite language, I can denote it by a regular expression, as the union of the one-string languages for each of the strings in it. For example, the finite language {λ, aba, abbb, b} is denoted by the regular expression * + aba + abbb + b. (Note that this + is not the Java concatenation operator!)

Finite Languages A regular expression that never uses the star operator must denote a finite set of nonempty strings. (We can prove this fact using induction!) If we use the star operator on any language that contains a non-empty string, the result is an infinite language, such as (aa) * = {λ, aa, aaaa, aaaaaa,...}.

Clicker Question #1 Which of these sets of strings is denoted by the regular expression (ab + ba)(aa + bb)ba? (a) {ab, ba, aa, bb, ba} (b) {aabbba, baaaba, babbba, abaaba} (c) {abbbba, baaaba, abaaba, babbba} (d) {abbaaabbba}

Answer #1 Which of these sets of strings is denoted by the regular expression (ab + ba)(aa + bb)ba? (a) {ab, ba, aa, bb, ba} (b) {aabbba, baaaba, babbba, abaaba} (c) {abbbba, baaaba, abaaba, babbba} (d) {abbaaabbba} (in lecture none of the four were correct)

The Language (a + ab) * Here is a more interesting regular language, denoted by the regular expression (a + ab) *. (Note that the parentheses are important -- a + ab * and a + (ab) * denote quite different languages.) The strings in (a + ab) * are exactly those strings that can be made by concatenating zero or more strings, each of which is equal to either a or ab.

The Language (a + ab) * We can systematically list (a + ab) * by listing (a + ab) i in turn for each natural i. We get (a + ab) 0 = {λ}, (a + ab) 1 = {a, ab}, (a + ab) 2 = {aa, aab, aba, abab}, (a + ab) 3 = (aaa, aaab, aaba, aabab, abaa, abaab, ababa, ababab}, and so forth.

The Language (a + ab) * How can we tell whether a given string of a s and b s is in (a + ab) *? If it ends in a, we know that the last string used in the concatenation was a, and if it ends in b, the last string used was ab. So we can delete a s and ab s from the right as long as we can, and if we produce λ then the string was in the language. It turns out that (a + ab) * is the set of strings that don t begin with b and never have two b s in a row. (How would you prove this assertion?)

Clicker Question #2 Let X {a, b}* be the language denoted by the regular expression (ab + bbb) *. Which of these is not true of every string in X? (a) The string must end with b. (b) There are at least as many b s as a s. (c) There are never two a s in a row. (d) There cannot be two a s separated by exactly five letters that are all b s.

Answer #2 Let X {a, b}* be the language denoted by the regular expression (ab + bbb) *. Which of these is not true of every string in X? (a) The string must end with b. (λ doesn t.) (b) There are at least as many b s as a s. (c) There are never two a s in a row. (d) There cannot be two a s separated by exactly five letters that are all b s.

Logically Describable Languages We can say the first letter is not b and there are never two b s in a row in the predicate calculus. One way to do it is to have variables that range over positions in the string. Our atomic predicates are Ca(x) ( position x contains an a, Cb(x) ( position x contains a b ), x = y ( x and y are the same position ), and x < y ( x is to the left of y ).

Logically Describable Languages So we can say that the first letter is not b, x: Cb(x) y: x y, and that there are never two b s in a row, x: y: Cb(x) Cb(y) (x < y) z: (z x) (z y). One way to say both things at once is x: C b(x) y: Pred(y, x) Ca(y), where Pred(y, x) abbreviates (y < x) z: (z y) (z x). It turns out that languages are logically describable if and only if they have a regular expression of a certain kind, and that (aa) * is not describable.

Clicker Question #3 Let L be the language described by the logical formula x: y: (x y) Cb(y). Consider the three strings λ, baa, and abba. How many of them are in the language L? (a) all three (b) two (c) one (d) none

Answer #3 Let L be the language described by the logical formula x: y: (x y) Cb(y). Consider the three strings λ, baa, and abba. How many of them are in the language L? (a) all three (b) two (c) one (statement says no last letter of a ) (d) none

Languages From Number Theory We can easily make a regular expression for the set of even-length strings of a s, (aa) *, or the oddlength strings of a s, (aa) * a, or the set of strings of a s whose length is congruent to 3 modulo 7, a 3 (a 7 ) *, or the set of strings whose length is congruent to 1, 2, or 5 modulo 6, (a + a 2 + a 5 )(a 6 ) *. What about the set of strings over {a,b} that have an even number of a s? A good first guess is that such a string is a concatenation of zero or more strings, each of which has exactly two a s. This would be the language (b * ab * ab * ) *.

Languages From Number Theory But this isn t exactly right, because bb, for example, has 0 a s and 0 is even. A correct expression for this language is (b + ab * a) * -- we can divide any such string into pieces which either have exactly two a s (with some number of b s between) or are just b s themselves. It s harder to get the strings with a number of a s congruent to 3 mod 7, or the strings with an even number of a s and an even number of b s, but both are possible.