System & Network Engineering. Regular Expressions ESA 2008/2009. Mark v/d Zwaag, Eelco Schatborn 22 september 2008

Similar documents
We have the following problem

Lecture 18 Regular Expressions

CS 301. Lecture 05 Applications of Regular Languages. Stephen Checkoway. January 31, 2018

Non-deterministic Finite Automata (NFA)

Principles of Programming Languages COMP251: Syntax and Grammars

Languages and Compilers

CS 314 Principles of Programming Languages. Lecture 3

Lexical Analysis (ASU Ch 3, Fig 3.1)

CPS 506 Comparative Programming Languages. Syntax Specification

ECS 120 Lesson 7 Regular Expressions, Pt. 1

JNTUWORLD. Code No: R

CSE 311 Lecture 21: Context-Free Grammars. Emina Torlak and Kevin Zatloukal

Perl Regular Expressions. Perl Patterns. Character Class Shortcuts. Examples of Perl Patterns

announcements CSE 311: Foundations of Computing review: regular expressions review: languages---sets of strings

CMSC 330: Organization of Programming Languages. Architecture of Compilers, Interpreters

Regular Expressions. Chapter 6

DVA337 HT17 - LECTURE 4. Languages and regular expressions

Lecture 4: Syntax Specification

Regular Expressions 1

Regular Expressions. Todd Kelley CST8207 Todd Kelley 1

Part 5 Program Analysis Principles and Techniques

Lecture 8: Context Free Grammars

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

Augmented BNF for Syntax Specifications: ABNF

CMSC 330: Organization of Programming Languages

MIT Specifying Languages with Regular Expressions and Context-Free Grammars. Martin Rinard Massachusetts Institute of Technology

Regular Expressions. Steve Renals (based on original notes by Ewan Klein) ICL 12 October Outline Overview of REs REs in Python

Pattern Matching. An Introduction to File Globs and Regular Expressions

Pattern Matching. An Introduction to File Globs and Regular Expressions. Adapted from Practical Unix and Programming Hunter College

Category: Standards Track January Augmented BNF for Syntax Specifications: ABNF

CMSC 330: Organization of Programming Languages

CS321 Languages and Compiler Design I. Winter 2012 Lecture 4

Architecture of Compilers, Interpreters. CMSC 330: Organization of Programming Languages. Front End Scanner and Parser. Implementing the Front End

Chapter Seven: Regular Expressions

CS 314 Principles of Programming Languages

MIT Specifying Languages with Regular Expressions and Context-Free Grammars

CMSC 330: Organization of Programming Languages. Context Free Grammars

CSCE 314 Programming Languages

BNF, EBNF Regular Expressions. Programming Languages,

CMSC 330: Organization of Programming Languages. Context Free Grammars

Where We Are. CMSC 330: Organization of Programming Languages. This Lecture. Programming Languages. Motivation for Grammars

Introduction to Regular Expressions

CMSC 330: Organization of Programming Languages

CHAPTER TWO LANGUAGES. Dr Zalmiyah Zakaria

Introduction to Syntax Analysis. The Second Phase of Front-End

Regex, Sed, Awk. Arindam Fadikar. December 12, 2017

CMPSCI 250: Introduction to Computation. Lecture #28: Regular Expressions and Languages David Mix Barrington 2 April 2014

Today s Lecture. The Unix Shell. Unix Architecture (simplified) Lecture 3: Unix Shell, Pattern Matching, Regular Expressions

Regular Languages. MACM 300 Formal Languages and Automata. Formal Languages: Recap. Regular Languages

Context-Free Languages and Parse Trees

COP 3402 Systems Software Syntax Analysis (Parser)

Notes for Comp 454 Week 2

CSCI 2132 Software Development. Lecture 7: Wildcards and Regular Expressions

Formal Languages and Grammars. Chapter 2: Sections 2.1 and 2.2

QUESTION BANK. Formal Languages and Automata Theory(10CS56)

Compiler Design. 2. Regular Expressions & Finite State Automata (FSA) Kanat Bolazar January 21, 2010

CMSC 330: Organization of Programming Languages. Context Free Grammars

Lexical Analysis. Sukree Sinthupinyo July Chulalongkorn University

Homework. Context Free Languages. Before We Start. Announcements. Plan for today. Languages. Any questions? Recall. 1st half. 2nd half.

Decision Properties for Context-free Languages

CSE 3302 Programming Languages Lecture 2: Syntax

Configuring the RADIUS Listener LEG

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 2: Lexical Analysis 23 Jan 08

Chapter 2 :: Programming Language Syntax

CS 315 Programming Languages Syntax. Parser. (Alternatively hand-built) (Alternatively hand-built)

Regular Expressions. with a brief intro to FSM Systems Skills in C and Unix

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

Chapter 3: CONTEXT-FREE GRAMMARS AND PARSING Part 1

Theory and Compiling COMP360

programming languages need to be precise a regular expression is one of the following: tokens are the building blocks of programs

Introduction to Syntax Analysis

Part III. Shell Config. Tobias Neckel: Scripting with Bash and Python Compact Max-Planck, February 16-26,

Lexical and Syntax Analysis

Week 2: Syntax Specification, Grammars

Regular Expressions. Regular Expression Syntax in Python. Achtung!

2. λ is a regular expression and denotes the set {λ} 4. If r and s are regular expressions denoting the languages R and S, respectively

Chapter 3: Describing Syntax and Semantics. Introduction Formal methods of describing syntax (BNF)

While Statement Examples. While Statement (35.15) Until Statement (35.15) Until Statement Example

Compiler Construction

Theoretical Part. Chapter one:- - What are the Phases of compiler? Answer:

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

Lec-5-HW-1, TM basics

2. Lexical Analysis! Prof. O. Nierstrasz!

Compiler phases. Non-tokens

More Scripting and Regular Expressions. Todd Kelley CST8207 Todd Kelley 1

Decision, Computation and Language

Context-Free Grammars

Bioinformatics Programming. EE, NCKU Tien-Hao Chang (Darby Chang)

CSE 401 Midterm Exam 11/5/10

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

Parsing Combinators: Introduction & Tutorial

Outline. 1 Scanning Tokens. 2 Regular Expresssions. 3 Finite State Automata

CSE 401/M501 Compilers

Programming Lecture 3

2010: Compilers REVIEW: REGULAR EXPRESSIONS HOW TO USE REGULAR EXPRESSIONS

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Unleashing the Shell Hands-On UNIX System Administration DeCal Week 6 28 February 2011

Chapter Seven: Regular Expressions. Formal Language, chapter 7, slide 1

Transcription:

1 Regular Expressions ESA 2008/2009 Mark v/d Zwaag, Eelco Schatborn eelco@os3.nl 22 september 2008

Today: Regular1 Expressions and Grammars Formal Languages Context-free grammars; BNF, ABNF Unix Regular Expressions

Regular 1 Expressions A regular expression (regexp, regex, regxp) is a string (a word), that, according to certain syntax rules, describes a set of strings (a language). Regular Expressions are used in many text (unix) editors, tools and programming languages to search for patterns in a text, and for substitution of strings

1 History Theory of formal languages Kleenes algebra of regular sets Ken Thompson introduced the RE notation to ed Regex is used in grep, awk, emacs, vi, perl.

Regular Expressions 1 in formal languages theory a regular expression represents a set of strings/words (a language). Regular Expressions are made up of constants and operators Given a finite alphabet Σ, the following constants are defined: empty set represents the empty set empty string (length 0) ɛ represents the set {ɛ} literals a character A in Σ represents the set {A}

Regular Expressions 1 in formal languages theory The operators Assume Regular Expressions R and S concatenation (RS) represents the set {αβ α R, β S} choice (R S) represents a choice between R and S iteration (R) represents the closure of R under concatenation; R = {ɛ} R RR RRR Binding strength: Kleene star > concatenation > choice ((ab)c) can be written as abc (a (b(c) )) as a bc.

Regular Expressions 1 in formal languages theory Examples Let Σ = {0, 1}, then 00 represents {00} 0100100 represents {0100100}

Regular Expressions 1 in formal languages theory Examples Let Σ = {0, 1}, then 00 represents {00} 0100100 represents {0100100} 0 1 represents {0, 1} 10 01 represents {10, 01}

Regular Expressions 1 in formal languages theory Examples Let Σ = {0, 1}, then 00 represents {00} 0100100 represents {0100100} 0 1 represents {0, 1} 10 01 represents {10, 01} 0 represents {ɛ, 0, 00, 000,...} (01) represents {ɛ, 01, 0101, 010101,...}

Regular Expressions 1 in formal languages theory Examples Let Σ = {0, 1}, then 00 represents {00} 0100100 represents {0100100} 0 1 represents {0, 1} 10 01 represents {10, 01} 0 represents {ɛ, 0, 00, 000,...} (01) represents {ɛ, 01, 0101, 010101,...} (0 00)1 represents {0, 00, 01, 001, 011, 0011,...} (0 00)1 = 01 001

Regular Expressions 1 in formal languages theory More Examples a b represents {a, ɛ, b, bb,...} (a b) = b (ab ) ab (c ɛ) represents the set of strings that begin with a single a, followed by zero or more b, and ending in an optional c.

Regular 1 Languages languages represented by Regular Expressions are called regular languages they correspond with so called type 3 grammars in the Chomsky hierarchy Example of a Regular grammar S as S ba A ɛ A ca with startsymbol S corresponds with regular expression

Regular 1 Languages languages represented by Regular Expressions are called regular languages they correspond with so called type 3 grammars in the Chomsky hierarchy Example of a Regular grammar S as S ba A ɛ A ca with startsymbol S corresponds with regular expression a bc

Context-Free 1 languages Regular Expressions/Languages/Grammars are less expressive than context-free grammars (CFGs) Example Context-Free Language a k b k S asb S ɛ (CFG) with start symbol S. This language cannot be expressed by a regular expression.

Context-Free 1 languages Regular Expressions/Languages/Grammars are less expressive than context-free grammars (CFGs) Example Context-Free Language a k b k S asb S ɛ (CFG) with start symbol S. This language cannot be expressed by a regular expression. Note For instance regular expressions in Perl are not strictly regular. Extra expressiveness can be detrimental to effectiveness Worst-case complexity or matching a string agains a Perl RE is time exponential to the length of the input

Backus-Naur Form System & Network Engineering Example 1 BNF Grammars a format for defining CFG s <bit> ::= 0 1 <expr> ::= <bit> (<expr> + <expr>) (<expr> * <expr>) This BNF grammar generates the strings 0, 1, (0 + 1), (1 * (1 + 1)),... Non-terminals <bit>, <expr> Terminals 0, 1, (, ), *, +, (spatie)

Backus-Naur Form System & Network Engineering Example 1 BNF Grammars a format for defining CFG s <bit> ::= 0 1 <expr> ::= <bit> (<expr> + <expr>) (<expr> * <expr>) This BNF grammar generates the strings 0, 1, (0 + 1), (1 * (1 + 1)),... Non-terminals <bit>, <expr> Terminals 0, 1, (, ), *, +, (spatie) Note This language is context-free, but not regular; it cannot be described using a regular expression

ABNF: 1 Augmented BNF RFC 2234, Augmented BNF for Syntax Specifications: ABNF. Obsolete. RFC 4234, Augmented BNF for Syntax Specifications: ABNF. Specificatie of ABNF in ABNF

1ABNF: Rules Rules have names like elements, rule0 and char-a-z Rulenames may be put inside brackets <elements> Case-insensitive: <rulename>, <rulename>, and <RULENAME> refer to the same rule

1ABNF: Rules Rules have names like elements, rule0 and char-a-z Rulenames may be put inside brackets <elements> Case-insensitive: <rulename>, <rulename>, and <RULENAME> refer to the same rule A rule is defined by a sequence name = elements ; comment crlf

ABNF: 1 Terminal Values Terminal values: characters A character encoding like US-ASCII may be used %b1000001 (binary 65, US-ASCII A ) %x42 (hexadecimal 66, US-ASCII B ) %d67 (decimal 67, US-ASCII C ) %d13.10 (the sequence CR,LF) Literal text: abc (the sequence a,b,c) NB: case-insensitive US-ASCII

1 Example Example 1 rulename = "abc" and rulename = "ABc" both generate the set { abc, Abc, abc, abc, ABc, abc, AbC, ABC}

1 Example Example 1 rulename = "abc" and rulename = "ABc" both generate the set { abc, Abc, abc, abc, ABc, abc, AbC, ABC} Example 2 rulename = %d97 %d98 %d99 and rulename = %d97.98.99 both generate the set { abc }

ABNF: 1 Basic operators Concatenation Alternatives Repetition (variable) Grouping Comments

ABNF: 1 Concatenation Rule = Rule1 Rule2 Example magic = xyzzy foo bar xyzzy = "xyzzy" foo = "foo" bar = "bar" magic "xyzzyfoobar"

ABNF: 1 Alternatives Rule = Rule1 / Rule2 Sometimes uses pipe ( ) instead of / Example magic = xyzzy / foo / bar magic "xyzzy", but also magic "foo", and also magic "bar"

ABNF: 1Variable Repetition Rule = <n> * <m> Rule1 n and m are optional decimal values Default for n is 0 and for m is Example magic = <2> * <3> xyzzy magic "xyzzyxyzzy" magic "xyzzyxyzzyxyzzy"

Rule = ( Rule1 ) ABNF: 1 Grouping Only used for parsing (syntax) Has no semantic counterpart Example magictoo = ( magic ) magictoo has the same productions as magic

Rule = ( Rule1 ) ABNF: 1 Grouping Only used for parsing (syntax) Has no semantic counterpart Example magictoo = ( magic ) magictoo has the same productions as magic Example elem (foo / bar) blat versus elem foo / bar blat Use grouping to avoid misunderstanding (elem foo) / (bar blat)

ABNF: 1 Comment Rule =... ; Followed by an explanation Example magic = xyzzy "," foo "," bar ; comma separated magic magic "xyzzy,foo,bar"

ABNF: 1 More operators Incremental alternative Value ranges Optional presence Specific repetition

ABNF: Incremental 1 alternative Alternatives may be added later in extra rules Rule =/ Rule1 Example magic = "xyzzy" magic =/ "foo" magic =/ "bar" Equivalently: magic = "xyzzy" / "foo" / "bar"

ABNF: 1 Value ranges Uses - as range indicator in terminal specifications Example DIGIT = %x30-39 ; "0" / "1" /... / "9" UPPER = %x41-5a ; "A" / "B" /... / "Z"

ABNF: 1 Value ranges Uses - as range indicator in terminal specifications Example DIGIT = %x30-39 ; "0" / "1" /... / "9" UPPER = %x41-5a ; "A" / "B" /... / "Z" DIGIT is a Core Rule. More core rules: ALPHA = %x41-5a / %x61-7a ; A-Z / a-z BIT = "0" / "1"

ABNF: 1 Optional presence Rule = [ Rule1 ] Equivalently: Rule = *<1> Rule1 Example magic = [ "xyzzy" ] magic "xyzzy", but also magic ""

ABNF: 1 Specific repetition Rule = <n> Rule1 Equivalently: Rule = <n> * <n> Rule1 Example magic = <3> "xyzzy" magic "xyzzyxyzzyxyzzy"

1Unix regexps The following syntax is more or less standard in many unix tools and programming languages Basic rules 1. Every printable character that is not a meta character is a regular expression that represents itself, for example a for the letter a 2.. represents any single character except newline

1Unix regexps The following syntax is more or less standard in many unix tools and programming languages Basic rules 1. Every printable character that is not a meta character is a regular expression that represents itself, for example a for the letter a 2.. represents any single character except newline 3. ^ represents the beginning of a line 4. $ represents the ending of a line

1Unix regexps The following syntax is more or less standard in many unix tools and programming languages Basic rules 1. Every printable character that is not a meta character is a regular expression that represents itself, for example a for the letter a 2.. represents any single character except newline 3. ^ represents the beginning of a line 4. $ represents the ending of a line 5. \ followed by a metacharacter represents the character itself Meaning \. represents a dot. 6. [E] represents a single character. The characterization of E is put in brackets.

1 Examples [a] represents a [abc] represents any of a, b, and c [a-z] represents a character in the range a z (ordered according to the ASCII notation)

1 Examples [a] represents a [abc] represents any of a, b, and c [a-z] represents a character in the range a z (ordered according to the ASCII notation) [A-Za-z0-9] represents a digit or character

1 Examples [a] represents a [abc] represents any of a, b, and c [a-z] represents a character in the range a z (ordered according to the ASCII notation) [A-Za-z0-9] represents a digit or character [^E] represents every character that is not represented by [E] [acq-z] represents a, c, or a character in q z.

1Unix regexps Inductive rules 1. if A and B are Regular Expressions then AB is a regular expression (concatenation)

1Unix regexps Inductive rules 1. if A and B are Regular Expressions then AB is a regular expression (concatenation) 2. A B is a regular expression (choice/unification)

1Unix regexps Inductive rules 1. if A and B are Regular Expressions then AB is a regular expression (concatenation) 2. A B is a regular expression (choice/unification) 3. A* is a regular expression (Kleene star, zero or more) 4. A+ is a regular expression (one or more) 5. A? is a regular expression (zero or one)

1Unix regexps Inductive rules 1. if A and B are Regular Expressions then AB is a regular expression (concatenation) 2. A B is a regular expression (choice/unification) 3. A* is a regular expression (Kleene star, zero or more) 4. A+ is a regular expression (one or more) 5. A? is a regular expression (zero or one) 6. (A) is a regular expression (awk, egrep, perl) 7. \(A\) is a regular expression (vi, sed, grep)

1Unix regexps Inductive rules 1. A{m, n} for integers m and n represents m to n concatenated instances of A 2. A{m} represents m concatenated instances of A Iteration binds stronger than concatenation which binds stronger than choice, so A BC* = A (B(C)*)

$ cat aap (aap) aap $ grep (aap) aap (aap) $ grep -E (aap) aap (aap) aap $ cat noot not noot nooot $ grep o\{2\} noot noot nooot $ grep -E o{3} noot nooot System & Network Engineering 1 Examples

More 1 Examples regex does doesn t A. A9, Aa, AA aa, AAA a.c abc, aac, a4c, a+c ABC, abcd, abbc a\.c a.c abc

More 1 Examples regex does doesn t A. A9, Aa, AA aa, AAA a.c abc, aac, a4c, a+c ABC, abcd, abbc a\.c a.c abc.ap aap, lap, hap [al]ap aap, lap [\^al]ap hap, kap aap, lap [al]+ap aap, lap, aaap, alap, laap, llap

More 1 Examples regex does doesn t A. A9, Aa, AA aa, AAA a.c abc, aac, a4c, a+c ABC, abcd, abbc a\.c a.c abc.ap aap, lap, hap [al]ap aap, lap [\^al]ap hap, kap aap, lap [al]+ap aap, lap, aaap, alap, laap, llap [^A-Z] 5, b A, Q, W [abc]* aaab, cba Iets.* Iets, Iets is beter dan niets Iets.+ Iets is beter dan niets Iets

More 1 Examples regex does doesn t A. A9, Aa, AA aa, AAA a.c abc, aac, a4c, a+c ABC, abcd, abbc a\.c a.c abc.ap aap, lap, hap [al]ap aap, lap [\^al]ap hap, kap aap, lap [al]+ap aap, lap, aaap, alap, laap, llap [^A-Z] 5, b A, Q, W [abc]* aaab, cba Iets.* Iets, Iets is beter dan niets Iets.+ Iets is beter dan niets Iets [ab]{4} abba, baba aba a(bc)*d ad, abc, abcbcd abcxd, bcbcd

Dutch 1 Examples Telephone number (land line) ^[-0-9+() ]*$ ^[0-9]{3,4}-[0-9]{6,7}$ 020-5253705 Postal code ^[1-9]{1}[0-9]{3}[:space:]?[A-Z]{2}

More 1 Examples Email [A-Za-z0-9_-]+([.]{1}[A-Za-z0-9_-]+)*@[A-Za-z0-9-]+ ([.]{1}[A-Za-z0-9-]+)+ E.P.Schatborn@uva.nl Mobile ^06[-]?[0-9]{8}$ 06-20369972

One1 More Example Email $ cat sedscr s/\([a-z][a-z]*\){\([^}]*\)}/\1<\2>/

One1 More Example Email $ cat sedscr s/\([a-z][a-z]*\){\([^}]*\)}/\1<\2>/ $ cat b a{ab} A{d} abc{aba} ABC(a) ABC{a}ab

One1 More Example Email $ cat sedscr s/\([a-z][a-z]*\){\([^}]*\)}/\1<\2>/ $ cat b a{ab} A{d} abc{aba} ABC(a) ABC{a}ab $ sed -f sedscr b a{ab} A<d> abc{aba} ABC(a) ABC<a>ab

1 Assignment 1 create a unix regular expression to show all lines that contain URI s in a html file. (hrefs... ) use grep -E use wget to retrieve a page look at pages with many URI s, like www.csszengarden.com 2 create a regexp to read all telephone numbers correctly. (correct length etc... ) 3 create a regular expression using grep that will remove all lines with only comments from any bash script file