Regular Expressions. Chapter 6

Similar documents
Regular Expression Module-2

Regular Languages. Regular Language. Regular Expression. Finite State Machine. Accepts

Languages and Strings. Chapter 2

Chapter Seven: Regular Expressions

Chapter Seven: Regular Expressions. Formal Language, chapter 7, slide 1

Regular Expressions. A Language for Specifying Languages. Thursday, September 20, 2007 Reading: Stoughton 3.1

Regular Expressions & Automata

Welcome! CSC445 Models Of Computation Dr. Lutz Hamel Tyler 251

ECS 120 Lesson 7 Regular Expressions, Pt. 1

A Formal Study of Practical Regular Expressions

CMSC 132: Object-Oriented Programming II

Finite Automata. Dr. Nadeem Akhtar. Assistant Professor Department of Computer Science & IT The Islamia University of Bahawalpur

Complexity Theory. Compiled By : Hari Prasad Pokhrel Page 1 of 20. ioenotes.edu.np

Formal Languages and Automata

Multiple Choice Questions

CHAPTER TWO LANGUAGES. Dr Zalmiyah Zakaria

1 Finite Representations of Languages

Proof Techniques Alphabets, Strings, and Languages. Foundations of Computer Science Theory

Structure of Programming Languages Lecture 3

Regular Languages and Regular Expressions

CMSC 330: Organization of Programming Languages. Architecture of Compilers, Interpreters

Lec-5-HW-1, TM basics

Languages and Compilers

Ambiguous Grammars and Compactification

CMSC 330: Organization of Programming Languages. Context-Free Grammars Ambiguity

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

2. Lexical Analysis! Prof. O. Nierstrasz!

Finite Automata Part Three

Glynda, the good witch of the North

Alphabets, strings and formal. An introduction to information representation

Regular Expressions. Regular Expressions. Regular Languages. Specifying Languages. Regular Expressions. Kleene Star Operation

Regular Expressions. Lecture 10 Sections Robb T. Koether. Hampden-Sydney College. Wed, Sep 14, 2016

CMPSCI 250: Introduction to Computation. Lecture #36: State Elimination David Mix Barrington 23 April 2014

MIT Specifying Languages with Regular Expressions and Context-Free Grammars

CSE 431S Scanning. Washington University Spring 2013

Turing Machine Languages

Lexical Analysis (ASU Ch 3, Fig 3.1)

CMSC 330: Organization of Programming Languages

Architecture of Compilers, Interpreters. CMSC 330: Organization of Programming Languages. Front End Scanner and Parser. Implementing the Front End

MIT Specifying Languages with Regular Expressions and Context-Free Grammars. Martin Rinard Massachusetts Institute of Technology

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

System & Network Engineering. Regular Expressions ESA 2008/2009. Mark v/d Zwaag, Eelco Schatborn 22 september 2008

CMSC 330: Organization of Programming Languages. Context Free Grammars

CMSC 330: Organization of Programming Languages

Context-Free Languages & Grammars (CFLs & CFGs) Reading: Chapter 5

Lecture 5: The Halting Problem. Michael Beeson

Compiler Design. 2. Regular Expressions & Finite State Automata (FSA) Kanat Bolazar January 21, 2010

CMPSCI 250: Introduction to Computation. Lecture #28: Regular Expressions and Languages David Mix Barrington 2 April 2014

JNTUWORLD. Code No: R

Where We Are. CMSC 330: Organization of Programming Languages. This Lecture. Programming Languages. Motivation for Grammars

Automata Theory TEST 1 Answers Max points: 156 Grade basis: 150 Median grade: 81%

CMSC 330: Organization of Programming Languages

Compiler Construction

Lexical Analysis. Prof. James L. Frankel Harvard University

CS402 Theory of Automata Solved Subjective From Midterm Papers. MIDTERM SPRING 2012 CS402 Theory of Automata

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 2: Lexical Analysis 23 Jan 08

CS321 Languages and Compiler Design I. Winter 2012 Lecture 4

DVA337 HT17 - LECTURE 4. Languages and regular expressions

MATH Iris Loeb.

Perl Regular Expressions. Perl Patterns. Character Class Shortcuts. Examples of Perl Patterns

QUESTION BANK. Formal Languages and Automata Theory(10CS56)

Notes for Comp 454 Week 2

14.1 Encoding for different models of computation

Compilers CS S-01 Compiler Basics & Lexical Analysis

Regular Expressions. Steve Renals (based on original notes by Ewan Klein) ICL 12 October Outline Overview of REs REs in Python

Recursively Enumerable Languages, Turing Machines, and Decidability

CMSC 330: Organization of Programming Languages. Context Free Grammars

CSE 311 Lecture 21: Context-Free Grammars. Emina Torlak and Kevin Zatloukal

UNIT I PART A PART B

Compilers CS S-01 Compiler Basics & Lexical Analysis

CSE 105 THEORY OF COMPUTATION

Part 5 Program Analysis Principles and Techniques

Reflection in the Chomsky Hierarchy

CS402 - Theory of Automata FAQs By

Automata & languages. A primer on the Theory of Computation. The imitation game (2014) Benedict Cumberbatch Alan Turing ( ) Laurent Vanbever

chapter 14 Monoids and Automata 14.1 Monoids GOALS Example

TOPIC PAGE NO. UNIT-I FINITE AUTOMATA

n λxy.x n y, [inc] [add] [mul] [exp] λn.λxy.x(nxy) λmn.m[inc]0 λmn.m([add]n)0 λmn.n([mul]m)1

Learn Smart and Grow with world

Week 2: Syntax Specification, Grammars

8 ε. Figure 1: An NFA-ǫ

CS S-01 Compiler Basics & Lexical Analysis 1

Finite Automata Part Three

WFST: Weighted Finite State Transducer. September 12, 2017 Prof. Marie Meteer

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

QUESTION BANK. Unit 1. Introduction to Finite Automata

Syntax and Parsing COMS W4115. Prof. Stephen A. Edwards Fall 2004 Columbia University Department of Computer Science

Context-Free Languages and Parse Trees

Zhizheng Zhang. Southeast University

Lecture 12: Cleaning up CFGs and Chomsky Normal

A.1 Numbers, Sets and Arithmetic

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design

Name: Finite Automata

LECTURE NOTES THEORY OF COMPUTATION

Group A Assignment 3(2)

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

CS402 - Theory of Automata Glossary By

Normal Forms for CFG s. Eliminating Useless Variables Removing Epsilon Removing Unit Productions Chomsky Normal Form

Compiler Construction

Transcription:

Regular Expressions Chapter 6

Regular Languages Generates Regular Language Regular Expression Recognizes or Accepts Finite State Machine

Stephen Cole Kleene 1909 1994, mathematical logician One of many distinguished students (e.g., Alan Turing) of Alonzo Church (lambda calculus) at Princeton. Best known as a founder of the branch of mathematical logic known as recursion theory. Also invented regular expressions. Kleene pronounced his last name KLAY-nee. `kli:ni and `kli:n are common mispronunciations. His son, Ken Kleene, wrote: "As far as I am aware this pronunciation is incorrect in all known languages. I believe that this novel pronunciation was invented by my father. " Kleeneness is next to Godelness Cleanliness is next to Godliness A pic of pic by me

Regular Expressions Regular expression Σ contains two kinds of symbols: special symbols,, ε, *, +,, (, ) symbols that regular expressions will match against The regular expressions over an alphabet Σ are all and only the strings that can be obtained as follows: 1. is a regular expression. 2. ε is a regular expression. 3. Every element of Σ is a regular expression. 4. If α, β are regular expressions, then so is αβ. 5. If α, β are regular expressions, then so is α β. 6. If α is a regular expression, then so is α*. 7. α is a regular expression, then so is α +. 8. If α is a regular expression, then so is (α).

Regular Expression Examples If Σ = {a, b}, the following are regular expressions: ε a (a b)* abba ε

Regular Expressions Define Languages Regular expressions are useful because each RE has a meaning If the meaning of an RE α is the language A, then we say that α defines or describes A. Define L, a semantic interpretation function for regular expressions: 1. L( ) =. //the language that contains no strings 2. L(ε) = {ε}. //the language that contains just the empty string 3. L(c) = {c}, where c Σ. 4. L(αβ) = L(α) L(β). 5. L(α β) = L(α) L(β). 6. L(α*) = (L(α))*. 7. L(α + ) = L(αα*) = L(α) (L(α))*. If L(α) is equal to, then L(α + ) is also equal to. Otherwise L(α + ) is the language that is formed by concatenating together one or more strings drawn from L(α). 8. L((α)) = L(α).

The Role of the Rules Rules 1, 3, 4, 5, and 6 give the language its power to define sets. Rule 8 has as its only role grouping other operators. Rules 2 and 7 appear to add functionality to the regular expression language, but they don t. 2. ε is a regular expression. 7. α is a regular expression, then so is α +.

Analyzing a Regular Expression The compositional semantic interpretation function lets us map between regular expressions and the languages they define. L((a b)*b) = L((a b)*) L(b) = (L((a b)))* L(b) = (L(a) L(b))* L(b) = ({a} {b})* {b} = {a, b}* {b}.

Examples L( a*b* ) = L( (a b)* ) = L( (a b)*a*b* ) = L( (a b)*abba(a b)* ) = L( (a b) (a b)a(a b)* ) =

Going the Other Way Given a language, find a regular expression L = {w {a, b}*: w is even} ((a b) (a b))* (aa ab ba bb)* L = {w {a, b}*: w contains an odd number of a s} b* (ab*ab*)* a b* b* a b* (ab*ab*)*

Common Idioms (α ε) Optional α, matching α or the empty string (a b)* Set of all strings composed of the characters a and b The regular expression a* is simply a string. It is different from the language L(a*) ={w: w is composed of zero or more a s}. However, when no confusion, we do not write the semantic interpretation function explicitly. We will say things like, The language a* is infinite

Operator Precedence in Regular Expressions Regular Expressions Arithmetic Expressions Highest Kleene star exponentiation concatenation multiplication Lowest union addition a b* c d* x y 2 + i j 2

a* b* (a b)* The Details Matter (ab)* a*b*

Kleene s Theorem Finite state machines and regular expressions define the same class of languages. To prove this, we must show: Theorem: Any language that can be defined with a regular expression can be accepted by some FSM and so is regular. Theorem: Every regular language (i.e., every language that can be accepted by some DFSM) can be defined with a regular expression. Sometimes FSM is easy, sometimes RE is easy.

For Every Regular Expression α, There is a Corresponding FSM M s.t. L(α) = L(M) We ll show this by construction. First, primitive regular expressions, then regular expressions that exploit the operations of union, concatenation, and Kleene star. : A single element of Σ: ε ( *):

Union If α is the regular expression β γ and if both L(β) and L(γ) are regular:

Concatenation If α is the regular expression βγ and if both L(β) and L(γ) are regular:

Kleene Star If α is the regular expression β* and if L(β) is regular:

From RE to FSM: An Example (b ab)* An FSM for b An FSM for a An FSM for b An FSM for ab:

An Example (b ab)* An FSM for (b ab):

An Example (b ab)* An FSM for (b ab)*: Error

The Algorithm regextofsm regextofsm(α: regular expression) = Beginning with the primitive subexpressions of α and working outwards until an FSM for all of α has been built do: Construct an FSM as described above.

For Every FSM There is a Corresponding Regular Expression We ll show this by construction. The key idea is that we ll allow arbitrary regular expressions to label the transitions of an FSM. Read if interested

A Simple Example Let M be: Suppose we rip out state 2:

The Algorithm fsmtoregexheuristic fsmtoregexheuristic(m: FSM) = 1. Remove unreachable states from M. 2. If M has no accepting states then return. 3. If the start state of M is part of a loop, create a new start state s and connect s to M s start state via an ε-transition. 4. If there is more than one accepting state of M or there are any transitions out of any of them, create a new accepting state and connect each of M s accepting states to it via an ε-transition. The old accepting states no longer accept. 5. If M has only one state then return ε. 6. Until only the start state and the accepting state remain do: 6.1 Select rip (not s or an accepting state). 6.2 Remove rip from M. 6.3 *Modify the transitions among the remaining states so M accepts the same strings. 7. Return the regular expression that labels the one remaining transition from the start state to the accepting state.

Regular Expressions in Perl Syntax Name Description abc Concatenation Matches a, then b, then c, where a, b, and c are any regexs a b c Union (Or) Matches a or b or c, where a, b, and c are any regexs a* Kleene star Matches 0 or more a s, where a is any regex a+ Atleast one Matches 1 or more a s, where a is any regex a? Matches 0 or 1 a s, where a is any regex a{n, m} Replication Matches at least n but no more than m a s,where a is any regex a*? Parsimonious Turns off greedy matching so the shortest match is selected a+? ʺ ʺ. Wild card Matches any character except newline ^ Left anchor Anchors the match to the beginning of a line or string $ Right anchor Anchors the match to the end of a line or string [a-z] Assuming a collating sequence, matches any single character in range [^a-z] Assuming a collating sequence, matches any single character not in range \d Digit Matches any single digit, i.e., string in [0-9] \D Nondigit Matches any single nondigit character, i.e., [^0-9] \w Alphanumeric Matches any single word character, i.e., [a-za-z0-9] \W Nonalphanumeric Matches any character in [^a-za-z0-9] \s White space Matches any character in [space, tab, newline, etc.]

Regular Expressions in Perl Syntax Name Description \S Nonwhite space Matches any character not matched by \s \n Newline Matches newline \r Return Matches return \t Tab Matches tab \f Formfeed Matches formfeed \b Backspace Matches backspace inside [] \b Word boundary Matches a word boundary outside [] \B Nonword boundary Matches a non-word boundary \0 Null Matches a null character \nnn Octal Matches an ASCII character with octal value nnn \xnn Hexadecimal Matches an ASCII character with hexadecimal value nn \cx Control Matches an ASCII control character \char Quote Matches char; used to quote symbols such as. and \ (a) Store Matches a, where a is any regex, and stores the matched string in the next variable \1 Variable Matches whatever the first parenthesized expression matched \2 Matches whatever the second parenthesized expression matched For all remaining variables Testing. many other online tools

Using Regular Expressions Matching numbers: in the Real World -? ([0-9]+(\.[0-9]*)? \.[0-9]+) Matching ip addresses: [0-9]{1,3} (\. [0-9] {1,3}){3} Trawl for email addresses: \b[a-za-z0-9_%-]+@[a-za-z0-9_%-]+ (\.[A-Zaz]+){1,4}\b From Friedl, J., Mastering Regular Expressions, O Reilly,1997. IE: information extraction, unstructured data management

A Biology Example BLAST Given a protein or DNA sequence, find others that are likely to be evolutionarily close to it. ESGHDTTTYYNKNRYPAGWNNHHDQMFFWV Build a DFSM that can examine thousands of other sequences and find those that match any of the selected patterns.

Simplifying Regular Expressions Regex s describe sets: Union is commutative: α β = β α. Union is associative: (α β) γ = α (β γ). is the identity for union: α = α = α. Union is idempotent: α α = α. Concatenation: Concatenation is associative: (αβ)γ = α(βγ). ε is the identity for concatenation: α ε = ε α = α. is a zero for concatenation: α = α =. Concatenation distributes over union: (α β) γ = (α γ) (β γ). γ (α β) = (γ α) (γ β). Kleene star: * = ε. ε* = ε. (α*)* = α*. α*α* = α*. α β)* = (α*β*)*.

Regular Grammars Chapter 7

Regular Languages Generates Regular Language Regular Grammar Recognizes or Accepts Finite State Machine

Regular Grammars A regular grammar G is a quadruple (V, Σ, R, S), where: V (rule alphabet) contains nonterminals and terminals terminals: symbols that can appear in strings generated by G nonterminals: symbols that are used in the grammar but do not appear in strings of the language Σ (the set of terminals) is a subset of V, R (the set of rules) is a finite set of rules of the form: X Y S (the start symbol) is a nonterminal

Regular Grammars In a regular grammar, all rules in R must: have a left hand side that is a single nonterminal have a right hand side that is: ε, or a single terminal, or a single terminal followed by a single nonterminal. Legal: S a, S ε, and T as Not legal: S asa and asa T Regular grammars must always produce strings one character at a time, moving left to right.

Regular Grammars The one we study is actually right regular grammar. Also called right linear grammar Generates regular languages, recognized by FSM Note FSM reads the input string w left to right Left regular grammar (left linear grammar) S a, S ε, and T Sa Does it generate regular languages?

Regular Grammar Example L = {w {a, b}* : w is even} ((aa) (ab) (ba) (bb))* M: G: S ε S at S bt T as T bs By convention, the start symbol of any grammar G will be the symbol on the left-hand side of the first rule Notice the clear correspondence between M and G Given one, easy to derive the other

Regular Languages and Regular Grammars Theorem: The class of languages that can be defined with regular grammars is exactly the regular languages. Proof: By two constructions.

Regular Languages and Regular Grammars Regular grammar FSM: grammartofsm(g = (V, Σ, R, S)) = 1. Create in M a separate state for each nonterminal in V. 2. Start state is the state corresponding to S. 3. If there are any rules in R of the form X w, for some w Σ, create a new state labeled #. 4. For each rule of the form X w Y, add a transition from X to Y labeled w. 5. For each rule of the form X w, add a transition from X to # labeled w. 6. For each rule of the form X ε, mark state X as accepting. 7. Mark state # as accepting. FSM Regular grammar: Similarly.

Strings That End with aaaa L = {w {a, b}* : w ends with the pattern aaaa}. S as S bs S ab B ac C ad D a

One Character Missing L = {w {a, b, c}*: there is a symbol in the alphabet not appearing in w}. S ε A ba C ac S ab A ca C bc S ac A ε C ε S ba B ab S bc B cb S ca B ε S cb How about ε transitions?