1 Regular Expressions ESA 2008/2009 Mark v/d Zwaag, Eelco Schatborn eelco@os3.nl 22 september 2008
Today: Regular1 Expressions and Grammars Formal Languages Context-free grammars; BNF, ABNF Unix Regular Expressions
Regular 1 Expressions A regular expression (regexp, regex, regxp) is a string (a word), that, according to certain syntax rules, describes a set of strings (a language). Regular Expressions are used in many text (unix) editors, tools and programming languages to search for patterns in a text, and for substitution of strings
1 History Theory of formal languages Kleenes algebra of regular sets Ken Thompson introduced the RE notation to ed Regex is used in grep, awk, emacs, vi, perl.
Regular Expressions 1 in formal languages theory a regular expression represents a set of strings/words (a language). Regular Expressions are made up of constants and operators Given a finite alphabet Σ, the following constants are defined: empty set represents the empty set empty string (length 0) ɛ represents the set {ɛ} literals a character A in Σ represents the set {A}
Regular Expressions 1 in formal languages theory The operators Assume Regular Expressions R and S concatenation (RS) represents the set {αβ α R, β S} choice (R S) represents a choice between R and S iteration (R) represents the closure of R under concatenation; R = {ɛ} R RR RRR Binding strength: Kleene star > concatenation > choice ((ab)c) can be written as abc (a (b(c) )) as a bc.
Regular Expressions 1 in formal languages theory Examples Let Σ = {0, 1}, then 00 represents {00} 0100100 represents {0100100}
Regular Expressions 1 in formal languages theory Examples Let Σ = {0, 1}, then 00 represents {00} 0100100 represents {0100100} 0 1 represents {0, 1} 10 01 represents {10, 01}
Regular Expressions 1 in formal languages theory Examples Let Σ = {0, 1}, then 00 represents {00} 0100100 represents {0100100} 0 1 represents {0, 1} 10 01 represents {10, 01} 0 represents {ɛ, 0, 00, 000,...} (01) represents {ɛ, 01, 0101, 010101,...}
Regular Expressions 1 in formal languages theory Examples Let Σ = {0, 1}, then 00 represents {00} 0100100 represents {0100100} 0 1 represents {0, 1} 10 01 represents {10, 01} 0 represents {ɛ, 0, 00, 000,...} (01) represents {ɛ, 01, 0101, 010101,...} (0 00)1 represents {0, 00, 01, 001, 011, 0011,...} (0 00)1 = 01 001
Regular Expressions 1 in formal languages theory More Examples a b represents {a, ɛ, b, bb,...} (a b) = b (ab ) ab (c ɛ) represents the set of strings that begin with a single a, followed by zero or more b, and ending in an optional c.
Regular 1 Languages languages represented by Regular Expressions are called regular languages they correspond with so called type 3 grammars in the Chomsky hierarchy Example of a Regular grammar S as S ba A ɛ A ca with startsymbol S corresponds with regular expression
Regular 1 Languages languages represented by Regular Expressions are called regular languages they correspond with so called type 3 grammars in the Chomsky hierarchy Example of a Regular grammar S as S ba A ɛ A ca with startsymbol S corresponds with regular expression a bc
Context-Free 1 languages Regular Expressions/Languages/Grammars are less expressive than context-free grammars (CFGs) Example Context-Free Language a k b k S asb S ɛ (CFG) with start symbol S. This language cannot be expressed by a regular expression.
Context-Free 1 languages Regular Expressions/Languages/Grammars are less expressive than context-free grammars (CFGs) Example Context-Free Language a k b k S asb S ɛ (CFG) with start symbol S. This language cannot be expressed by a regular expression. Note For instance regular expressions in Perl are not strictly regular. Extra expressiveness can be detrimental to effectiveness Worst-case complexity or matching a string agains a Perl RE is time exponential to the length of the input
Backus-Naur Form System & Network Engineering Example 1 BNF Grammars a format for defining CFG s <bit> ::= 0 1 <expr> ::= <bit> (<expr> + <expr>) (<expr> * <expr>) This BNF grammar generates the strings 0, 1, (0 + 1), (1 * (1 + 1)),... Non-terminals <bit>, <expr> Terminals 0, 1, (, ), *, +, (spatie)
Backus-Naur Form System & Network Engineering Example 1 BNF Grammars a format for defining CFG s <bit> ::= 0 1 <expr> ::= <bit> (<expr> + <expr>) (<expr> * <expr>) This BNF grammar generates the strings 0, 1, (0 + 1), (1 * (1 + 1)),... Non-terminals <bit>, <expr> Terminals 0, 1, (, ), *, +, (spatie) Note This language is context-free, but not regular; it cannot be described using a regular expression
ABNF: 1 Augmented BNF RFC 2234, Augmented BNF for Syntax Specifications: ABNF. Obsolete. RFC 4234, Augmented BNF for Syntax Specifications: ABNF. Specificatie of ABNF in ABNF
1ABNF: Rules Rules have names like elements, rule0 and char-a-z Rulenames may be put inside brackets <elements> Case-insensitive: <rulename>, <rulename>, and <RULENAME> refer to the same rule
1ABNF: Rules Rules have names like elements, rule0 and char-a-z Rulenames may be put inside brackets <elements> Case-insensitive: <rulename>, <rulename>, and <RULENAME> refer to the same rule A rule is defined by a sequence name = elements ; comment crlf
ABNF: 1 Terminal Values Terminal values: characters A character encoding like US-ASCII may be used %b1000001 (binary 65, US-ASCII A ) %x42 (hexadecimal 66, US-ASCII B ) %d67 (decimal 67, US-ASCII C ) %d13.10 (the sequence CR,LF) Literal text: abc (the sequence a,b,c) NB: case-insensitive US-ASCII
1 Example Example 1 rulename = "abc" and rulename = "ABc" both generate the set { abc, Abc, abc, abc, ABc, abc, AbC, ABC}
1 Example Example 1 rulename = "abc" and rulename = "ABc" both generate the set { abc, Abc, abc, abc, ABc, abc, AbC, ABC} Example 2 rulename = %d97 %d98 %d99 and rulename = %d97.98.99 both generate the set { abc }
ABNF: 1 Basic operators Concatenation Alternatives Repetition (variable) Grouping Comments
ABNF: 1 Concatenation Rule = Rule1 Rule2 Example magic = xyzzy foo bar xyzzy = "xyzzy" foo = "foo" bar = "bar" magic "xyzzyfoobar"
ABNF: 1 Alternatives Rule = Rule1 / Rule2 Sometimes uses pipe ( ) instead of / Example magic = xyzzy / foo / bar magic "xyzzy", but also magic "foo", and also magic "bar"
ABNF: 1Variable Repetition Rule = <n> * <m> Rule1 n and m are optional decimal values Default for n is 0 and for m is Example magic = <2> * <3> xyzzy magic "xyzzyxyzzy" magic "xyzzyxyzzyxyzzy"
Rule = ( Rule1 ) ABNF: 1 Grouping Only used for parsing (syntax) Has no semantic counterpart Example magictoo = ( magic ) magictoo has the same productions as magic
Rule = ( Rule1 ) ABNF: 1 Grouping Only used for parsing (syntax) Has no semantic counterpart Example magictoo = ( magic ) magictoo has the same productions as magic Example elem (foo / bar) blat versus elem foo / bar blat Use grouping to avoid misunderstanding (elem foo) / (bar blat)
ABNF: 1 Comment Rule =... ; Followed by an explanation Example magic = xyzzy "," foo "," bar ; comma separated magic magic "xyzzy,foo,bar"
ABNF: 1 More operators Incremental alternative Value ranges Optional presence Specific repetition
ABNF: Incremental 1 alternative Alternatives may be added later in extra rules Rule =/ Rule1 Example magic = "xyzzy" magic =/ "foo" magic =/ "bar" Equivalently: magic = "xyzzy" / "foo" / "bar"
ABNF: 1 Value ranges Uses - as range indicator in terminal specifications Example DIGIT = %x30-39 ; "0" / "1" /... / "9" UPPER = %x41-5a ; "A" / "B" /... / "Z"
ABNF: 1 Value ranges Uses - as range indicator in terminal specifications Example DIGIT = %x30-39 ; "0" / "1" /... / "9" UPPER = %x41-5a ; "A" / "B" /... / "Z" DIGIT is a Core Rule. More core rules: ALPHA = %x41-5a / %x61-7a ; A-Z / a-z BIT = "0" / "1"
ABNF: 1 Optional presence Rule = [ Rule1 ] Equivalently: Rule = *<1> Rule1 Example magic = [ "xyzzy" ] magic "xyzzy", but also magic ""
ABNF: 1 Specific repetition Rule = <n> Rule1 Equivalently: Rule = <n> * <n> Rule1 Example magic = <3> "xyzzy" magic "xyzzyxyzzyxyzzy"
1Unix regexps The following syntax is more or less standard in many unix tools and programming languages Basic rules 1. Every printable character that is not a meta character is a regular expression that represents itself, for example a for the letter a 2.. represents any single character except newline
1Unix regexps The following syntax is more or less standard in many unix tools and programming languages Basic rules 1. Every printable character that is not a meta character is a regular expression that represents itself, for example a for the letter a 2.. represents any single character except newline 3. ^ represents the beginning of a line 4. $ represents the ending of a line
1Unix regexps The following syntax is more or less standard in many unix tools and programming languages Basic rules 1. Every printable character that is not a meta character is a regular expression that represents itself, for example a for the letter a 2.. represents any single character except newline 3. ^ represents the beginning of a line 4. $ represents the ending of a line 5. \ followed by a metacharacter represents the character itself Meaning \. represents a dot. 6. [E] represents a single character. The characterization of E is put in brackets.
1 Examples [a] represents a [abc] represents any of a, b, and c [a-z] represents a character in the range a z (ordered according to the ASCII notation)
1 Examples [a] represents a [abc] represents any of a, b, and c [a-z] represents a character in the range a z (ordered according to the ASCII notation) [A-Za-z0-9] represents a digit or character
1 Examples [a] represents a [abc] represents any of a, b, and c [a-z] represents a character in the range a z (ordered according to the ASCII notation) [A-Za-z0-9] represents a digit or character [^E] represents every character that is not represented by [E] [acq-z] represents a, c, or a character in q z.
1Unix regexps Inductive rules 1. if A and B are Regular Expressions then AB is a regular expression (concatenation)
1Unix regexps Inductive rules 1. if A and B are Regular Expressions then AB is a regular expression (concatenation) 2. A B is a regular expression (choice/unification)
1Unix regexps Inductive rules 1. if A and B are Regular Expressions then AB is a regular expression (concatenation) 2. A B is a regular expression (choice/unification) 3. A* is a regular expression (Kleene star, zero or more) 4. A+ is a regular expression (one or more) 5. A? is a regular expression (zero or one)
1Unix regexps Inductive rules 1. if A and B are Regular Expressions then AB is a regular expression (concatenation) 2. A B is a regular expression (choice/unification) 3. A* is a regular expression (Kleene star, zero or more) 4. A+ is a regular expression (one or more) 5. A? is a regular expression (zero or one) 6. (A) is a regular expression (awk, egrep, perl) 7. \(A\) is a regular expression (vi, sed, grep)
1Unix regexps Inductive rules 1. A{m, n} for integers m and n represents m to n concatenated instances of A 2. A{m} represents m concatenated instances of A Iteration binds stronger than concatenation which binds stronger than choice, so A BC* = A (B(C)*)
$ cat aap (aap) aap $ grep (aap) aap (aap) $ grep -E (aap) aap (aap) aap $ cat noot not noot nooot $ grep o\{2\} noot noot nooot $ grep -E o{3} noot nooot System & Network Engineering 1 Examples
More 1 Examples regex does doesn t A. A9, Aa, AA aa, AAA a.c abc, aac, a4c, a+c ABC, abcd, abbc a\.c a.c abc
More 1 Examples regex does doesn t A. A9, Aa, AA aa, AAA a.c abc, aac, a4c, a+c ABC, abcd, abbc a\.c a.c abc.ap aap, lap, hap [al]ap aap, lap [\^al]ap hap, kap aap, lap [al]+ap aap, lap, aaap, alap, laap, llap
More 1 Examples regex does doesn t A. A9, Aa, AA aa, AAA a.c abc, aac, a4c, a+c ABC, abcd, abbc a\.c a.c abc.ap aap, lap, hap [al]ap aap, lap [\^al]ap hap, kap aap, lap [al]+ap aap, lap, aaap, alap, laap, llap [^A-Z] 5, b A, Q, W [abc]* aaab, cba Iets.* Iets, Iets is beter dan niets Iets.+ Iets is beter dan niets Iets
More 1 Examples regex does doesn t A. A9, Aa, AA aa, AAA a.c abc, aac, a4c, a+c ABC, abcd, abbc a\.c a.c abc.ap aap, lap, hap [al]ap aap, lap [\^al]ap hap, kap aap, lap [al]+ap aap, lap, aaap, alap, laap, llap [^A-Z] 5, b A, Q, W [abc]* aaab, cba Iets.* Iets, Iets is beter dan niets Iets.+ Iets is beter dan niets Iets [ab]{4} abba, baba aba a(bc)*d ad, abc, abcbcd abcxd, bcbcd
Dutch 1 Examples Telephone number (land line) ^[-0-9+() ]*$ ^[0-9]{3,4}-[0-9]{6,7}$ 020-5253705 Postal code ^[1-9]{1}[0-9]{3}[:space:]?[A-Z]{2}
More 1 Examples Email [A-Za-z0-9_-]+([.]{1}[A-Za-z0-9_-]+)*@[A-Za-z0-9-]+ ([.]{1}[A-Za-z0-9-]+)+ E.P.Schatborn@uva.nl Mobile ^06[-]?[0-9]{8}$ 06-20369972
One1 More Example Email $ cat sedscr s/\([a-z][a-z]*\){\([^}]*\)}/\1<\2>/
One1 More Example Email $ cat sedscr s/\([a-z][a-z]*\){\([^}]*\)}/\1<\2>/ $ cat b a{ab} A{d} abc{aba} ABC(a) ABC{a}ab
One1 More Example Email $ cat sedscr s/\([a-z][a-z]*\){\([^}]*\)}/\1<\2>/ $ cat b a{ab} A{d} abc{aba} ABC(a) ABC{a}ab $ sed -f sedscr b a{ab} A<d> abc{aba} ABC(a) ABC<a>ab
1 Assignment 1 create a unix regular expression to show all lines that contain URI s in a html file. (hrefs... ) use grep -E use wget to retrieve a page look at pages with many URI s, like www.csszengarden.com 2 create a regexp to read all telephone numbers correctly. (correct length etc... ) 3 create a regular expression using grep that will remove all lines with only comments from any bash script file