System & Network Engineering. Regular Expressions ESA 2008/2009. Mark v/d Zwaag, Eelco Schatborn 22 september PDF Free Download

1 Regular Expressions ESA 2008/2009 Mark v/d Zwaag, Eelco Schatborn eelco@os3.nl 22 september 2008

Today: Regular1 Expressions and Grammars Formal Languages Context-free grammars; BNF, ABNF Unix Regular Expressions

Regular 1 Expressions A regular expression (regexp, regex, regxp) is a string (a word), that, according to certain syntax rules, describes a set of strings (a language). Regular Expressions are used in many text (unix) editors, tools and programming languages to search for patterns in a text, and for substitution of strings

1 History Theory of formal languages Kleenes algebra of regular sets Ken Thompson introduced the RE notation to ed Regex is used in grep, awk, emacs, vi, perl.

Regular Expressions 1 in formal languages theory a regular expression represents a set of strings/words (a language). Regular Expressions are made up of constants and operators Given a finite alphabet Σ, the following constants are defined: empty set represents the empty set empty string (length 0) ɛ represents the set {ɛ} literals a character A in Σ represents the set {A}

Regular Expressions 1 in formal languages theory The operators Assume Regular Expressions R and S concatenation (RS) represents the set {αβ α R, β S} choice (R S) represents a choice between R and S iteration (R) represents the closure of R under concatenation; R = {ɛ} R RR RRR Binding strength: Kleene star > concatenation > choice ((ab)c) can be written as abc (a (b(c) )) as a bc.

Regular Expressions 1 in formal languages theory Examples Let Σ = {0, 1}, then 00 represents {00} 0100100 represents {0100100}

Regular Expressions 1 in formal languages theory Examples Let Σ = {0, 1}, then 00 represents {00} 0100100 represents {0100100} 0 1 represents {0, 1} 10 01 represents {10, 01}

Regular Expressions 1 in formal languages theory Examples Let Σ = {0, 1}, then 00 represents {00} 0100100 represents {0100100} 0 1 represents {0, 1} 10 01 represents {10, 01} 0 represents {ɛ, 0, 00, 000,...} (01) represents {ɛ, 01, 0101, 010101,...}

Regular Expressions 1 in formal languages theory More Examples a b represents {a, ɛ, b, bb,...} (a b) = b (ab ) ab (c ɛ) represents the set of strings that begin with a single a, followed by zero or more b, and ending in an optional c.

Regular 1 Languages languages represented by Regular Expressions are called regular languages they correspond with so called type 3 grammars in the Chomsky hierarchy Example of a Regular grammar S as S ba A ɛ A ca with startsymbol S corresponds with regular expression

Context-Free 1 languages Regular Expressions/Languages/Grammars are less expressive than context-free grammars (CFGs) Example Context-Free Language a k b k S asb S ɛ (CFG) with start symbol S. This language cannot be expressed by a regular expression. Note For instance regular expressions in Perl are not strictly regular. Extra expressiveness can be detrimental to effectiveness Worst-case complexity or matching a string agains a Perl RE is time exponential to the length of the input

Backus-Naur Form System & Network Engineering Example 1 BNF Grammars a format for defining CFG s <bit> ::= 0 1 <expr> ::= <bit> (<expr> + <expr>) (<expr> * <expr>) This BNF grammar generates the strings 0, 1, (0 + 1), (1 * (1 + 1)),... Non-terminals <bit>, <expr> Terminals 0, 1, (, ), *, +, (spatie)

ABNF: 1 Augmented BNF RFC 2234, Augmented BNF for Syntax Specifications: ABNF. Obsolete. RFC 4234, Augmented BNF for Syntax Specifications: ABNF. Specificatie of ABNF in ABNF

1ABNF: Rules Rules have names like elements, rule0 and char-a-z Rulenames may be put inside brackets <elements> Case-insensitive: <rulename>, <rulename>, and <RULENAME> refer to the same rule

1ABNF: Rules Rules have names like elements, rule0 and char-a-z Rulenames may be put inside brackets <elements> Case-insensitive: <rulename>, <rulename>, and <RULENAME> refer to the same rule A rule is defined by a sequence name = elements ; comment crlf

ABNF: 1 Terminal Values Terminal values: characters A character encoding like US-ASCII may be used %b1000001 (binary 65, US-ASCII A ) %x42 (hexadecimal 66, US-ASCII B ) %d67 (decimal 67, US-ASCII C ) %d13.10 (the sequence CR,LF) Literal text: abc (the sequence a,b,c) NB: case-insensitive US-ASCII

1 Example Example 1 rulename = "abc" and rulename = "ABc" both generate the set { abc, Abc, abc, abc, ABc, abc, AbC, ABC}

1 Example Example 1 rulename = "abc" and rulename = "ABc" both generate the set { abc, Abc, abc, abc, ABc, abc, AbC, ABC} Example 2 rulename = %d97 %d98 %d99 and rulename = %d97.98.99 both generate the set { abc }

ABNF: 1 Basic operators Concatenation Alternatives Repetition (variable) Grouping Comments

ABNF: 1 Concatenation Rule = Rule1 Rule2 Example magic = xyzzy foo bar xyzzy = "xyzzy" foo = "foo" bar = "bar" magic "xyzzyfoobar"

ABNF: 1 Alternatives Rule = Rule1 / Rule2 Sometimes uses pipe ( ) instead of / Example magic = xyzzy / foo / bar magic "xyzzy", but also magic "foo", and also magic "bar"

ABNF: 1Variable Repetition Rule = <n> * <m> Rule1 n and m are optional decimal values Default for n is 0 and for m is Example magic = <2> * <3> xyzzy magic "xyzzyxyzzy" magic "xyzzyxyzzyxyzzy"

Rule = ( Rule1 ) ABNF: 1 Grouping Only used for parsing (syntax) Has no semantic counterpart Example magictoo = ( magic ) magictoo has the same productions as magic

Rule = ( Rule1 ) ABNF: 1 Grouping Only used for parsing (syntax) Has no semantic counterpart Example magictoo = ( magic ) magictoo has the same productions as magic Example elem (foo / bar) blat versus elem foo / bar blat Use grouping to avoid misunderstanding (elem foo) / (bar blat)

ABNF: 1 Comment Rule =... ; Followed by an explanation Example magic = xyzzy "," foo "," bar ; comma separated magic magic "xyzzy,foo,bar"

ABNF: 1 More operators Incremental alternative Value ranges Optional presence Specific repetition

ABNF: Incremental 1 alternative Alternatives may be added later in extra rules Rule =/ Rule1 Example magic = "xyzzy" magic =/ "foo" magic =/ "bar" Equivalently: magic = "xyzzy" / "foo" / "bar"

ABNF: 1 Value ranges Uses - as range indicator in terminal specifications Example DIGIT = %x30-39 ; "0" / "1" /... / "9" UPPER = %x41-5a ; "A" / "B" /... / "Z"

ABNF: 1 Value ranges Uses - as range indicator in terminal specifications Example DIGIT = %x30-39 ; "0" / "1" /... / "9" UPPER = %x41-5a ; "A" / "B" /... / "Z" DIGIT is a Core Rule. More core rules: ALPHA = %x41-5a / %x61-7a ; A-Z / a-z BIT = "0" / "1"

ABNF: 1 Optional presence Rule = [ Rule1 ] Equivalently: Rule = *<1> Rule1 Example magic = [ "xyzzy" ] magic "xyzzy", but also magic ""

ABNF: 1 Specific repetition Rule = <n> Rule1 Equivalently: Rule = <n> * <n> Rule1 Example magic = <3> "xyzzy" magic "xyzzyxyzzyxyzzy"

1Unix regexps The following syntax is more or less standard in many unix tools and programming languages Basic rules 1. Every printable character that is not a meta character is a regular expression that represents itself, for example a for the letter a 2.. represents any single character except newline 3. ^ represents the beginning of a line 4. $ represents the ending of a line 5. \ followed by a metacharacter represents the character itself Meaning \. represents a dot. 6. [E] represents a single character. The characterization of E is put in brackets.

1 Examples [a] represents a [abc] represents any of a, b, and c [a-z] represents a character in the range a z (ordered according to the ASCII notation)

1 Examples [a] represents a [abc] represents any of a, b, and c [a-z] represents a character in the range a z (ordered according to the ASCII notation) [A-Za-z0-9] represents a digit or character

1 Examples [a] represents a [abc] represents any of a, b, and c [a-z] represents a character in the range a z (ordered according to the ASCII notation) [A-Za-z0-9] represents a digit or character [^E] represents every character that is not represented by [E] [acq-z] represents a, c, or a character in q z.

1Unix regexps Inductive rules 1. if A and B are Regular Expressions then AB is a regular expression (concatenation)

1Unix regexps Inductive rules 1. if A and B are Regular Expressions then AB is a regular expression (concatenation) 2. A B is a regular expression (choice/unification)

1Unix regexps Inductive rules 1. if A and B are Regular Expressions then AB is a regular expression (concatenation) 2. A B is a regular expression (choice/unification) 3. A* is a regular expression (Kleene star, zero or more) 4. A+ is a regular expression (one or more) 5. A? is a regular expression (zero or one)

1Unix regexps Inductive rules 1. A{m, n} for integers m and n represents m to n concatenated instances of A 2. A{m} represents m concatenated instances of A Iteration binds stronger than concatenation which binds stronger than choice, so A BC* = A (B(C)*)

$ cat aap (aap) aap $ grep (aap) aap (aap) $ grep -E (aap) aap (aap) aap $ cat noot not noot nooot $ grep o\{2\} noot noot nooot $ grep -E o{3} noot nooot System & Network Engineering 1 Examples

More 1 Examples regex does doesn t A. A9, Aa, AA aa, AAA a.c abc, aac, a4c, a+c ABC, abcd, abbc a\.c a.c abc

More 1 Examples regex does doesn t A. A9, Aa, AA aa, AAA a.c abc, aac, a4c, a+c ABC, abcd, abbc a\.c a.c abc.ap aap, lap, hap [al]ap aap, lap [\^al]ap hap, kap aap, lap [al]+ap aap, lap, aaap, alap, laap, llap [^A-Z] 5, b A, Q, W [abc]* aaab, cba Iets.* Iets, Iets is beter dan niets Iets.+ Iets is beter dan niets Iets

Dutch 1 Examples Telephone number (land line) ^[-0-9+() ]*$ ^[0-9]{3,4}-[0-9]{6,7}$ 020-5253705 Postal code ^[1-9]{1}[0-9]{3}[:space:]?[A-Z]{2}

More 1 Examples Email [A-Za-z0-9_-]+([.]{1}[A-Za-z0-9_-]+)*@[A-Za-z0-9-]+ ([.]{1}[A-Za-z0-9-]+)+ E.P.Schatborn@uva.nl Mobile ^06[-]?[0-9]{8}$ 06-20369972

One1 More Example Email $ cat sedscr s/$[a-z][a-z]*${$[^}]*$}/\1<\2>/

One1 More Example Email $ cat sedscr s/$[a-z][a-z]*${$[^}]*$}/\1<\2>/ $ cat b a{ab} A{d} abc{aba} ABC(a) ABC{a}ab

One1 More Example Email $ cat sedscr s/$[a-z][a-z]*${$[^}]*$}/\1<\2>/ $ cat b a{ab} A{d} abc{aba} ABC(a) ABC{a}ab $ sed -f sedscr b a{ab} A<d> abc{aba} ABC(a) ABC<a>ab

1 Assignment 1 create a unix regular expression to show all lines that contain URI s in a html file. (hrefs... ) use grep -E use wget to retrieve a page look at pages with many URI s, like www.csszengarden.com 2 create a regexp to read all telephone numbers correctly. (correct length etc... ) 3 create a regular expression using grep that will remove all lines with only comments from any bash script file

System & Network Engineering. Regular Expressions ESA 2008/2009. Mark v/d Zwaag, Eelco Schatborn 22 september 2008