DVA337 HT17 - LECTURE 4 Languages and regular expressions 1
SO FAR 2
TODAY Formal definition of languages in terms of strings Operations on strings and languages Definition of regular expressions Meaning of regular expressions in terms of languages Outlook: practical use of regular expressions 3
LANGUAGES alphabets, strings, and languages 4
LANGUAGE How can we define what a (formal) language is? 5
LANGUAGE We define a language to be a set of strings over an alphabet Σ An alphabet is a set of symbols, e.g., { a, b, c,..., z } A string over an alphabet Σ is a sequence of symbols from the alphabet What is the alphabet for the language L = { apple, pear, 1911 } L = { x : x is a binary string } 6
EXERCISE, LANGUAGE How can we define the alphabet and language for 1) the programming language C? 2) written English? 7
STRINGS 8
STRINGS For a 1, a 2,..., a n Σ the sequence a 1 a 2... a n is a string over Σ The empty string is written λ What are the strings over Σ = { a, b }? Let u, v, w denote strings 9
CONCATENATION For u = a 1 a 2... a n and v = b 1 b 2... b n what is the concatenation of u and v, written uv? 10
PREFIX AND SUFFIX For a string w = u v u is a prefix v is a suffix All prefixes and suffixes for abbab? 11
SUBSTRING For a string w = u 1 v u 2 v is a substring prefix and suffix special cases of substring Substrings of abbab? 12
LENGTH For string w = a 1 a 2... a n, the length is n, w = n λ = 0 abbab = 5 How can we define length recursively? 13
PROOF BY INDUCTION Induction over natural numbers To show that a property P holds for all natural numbers, n N. P(n), show A base case, e.g., P(0) An inductive step, n N. P(n) P(n+1) Why can we conclude n N. P(n) from this? 14
EXERCISE, LENGTH OF CONCATENATION What is u v? Can we prove it? 15
LENGTH OF CONCATENATION Theorem: u v = u + v Proof: By induction on the length of v. 16
REVERSE For a string w = a 1 a 2... a n what is the reverse w R of w? What is a palindrome? 17
REPETITION Let w n be w repeated n times, w w... w Can you write a recursive definition of w n? 18
Σ N, STRINGS OF LENGTH N Let Σ n be the set of strings of length n over Σ For Σ = {a, b} Σ 0 = { λ } Σ 1 = { a, b } Σ 2 = { aa, ab, ba, bb } How can we define Σ n? 19
Σ*, KLEENE CLOSURE Σ* is the set of all strings over Σ {a,b}* = { λ, a, b, aa, bb, ab, ba, aaa, bbb,... } How can we define Σ*? 20
Σ*, KLEENE CLOSURE We have that Σ* = Σ 0 Σ 1..., where Σ 0 = { λ } Σ n+1 = { x y : x Σ, y Σ n } Can we use this to define Σ*? as a fixpoint to F(S) S for some F? 21
Σ +, POSITIVE CLOSURE Let Σ + = Σ 1 Σ 2... How can we define the positive closure? 22
EXERCISE For Σ = {a, b} what is the cardinality of Σ 3? In general, what is the cardinality of Σ n? For Σ as below, give Σ* and Σ + Σ = { 0, 1 } Σ = { a } Σ = { } 23
EXERCISE Prove that Σ n = Σ n 24
LANGUAGES 25
LANGUAGE A language L is a set of strings over an alphabet Σ A language L is a subset of Σ* For Σ = { a, b } Σ* = { λ, a, b, aa, ab, ba, bb, aaa, aab,... } Examples of languages over Σ? 26
EXERCISE What is P(Σ*)? 27
SET OPERATIONS ON LANGUAGES Since language are sets, the standard set operations apply. For L 1 = {a, b, aaa} and L 2 = {bb, ab}, what is L 1 L 2 L 1 L 2 L 1 L 2 What is the complement of a language, L 28
REVERSAL AND CONCATENATION Reversal and concatenation carry over from strings in the natural way Reversal, L R = { w R : w L } { ab, aab, baba } R {a n b n : n 0 } Concatenation, L 1 L 2 = { u v : u L 1, v L 2 } { ab, aab, baba }{b,aa} 29
REPETITION With concatenation of languages defined, we can define repetition L 0 = { λ } L n+1 = { u v : u L, v L n } For L = { a n b n : n 0} what is L 2? what is L 0? 30
CLOSURES With repetition we can define Kleene closure and positive closure for languages L* = L 0 L 1... L + = L 1 L 2... What is L* in words? If L* = L we say that L is Kleene closed Is C Kleene closed? 31
SUMMARY An alphabet, Σ, is a set of symbols A string is a sequence of symbols concatenation, reverse, length, substring, prefix, suffix, repetition Kleene closure Σ*, and positive closure Σ + A language over Σ is a set of strings; a subset of Σ * union, intersection, difference, complement reverse, concatenation, repetition Kleene closure L *, and positive closure L + (c.f., Σ * and Σ + ) 32
WHY IS THIS USEFUL? Broad definition: any set of strings on an alphabet is a language Methods of defining language grammars Methods of deciding membership in languages How to answer the questions if a given string is in a given language Can membership always be decided? 33
REGULAR EXPRESSIONS 34
REGULAR EXPRESSIONS, λ, and any α Σ are primitive regular expressions If r 1 and r 2 are regular expressions, then so are r 1 + r 2, r 1 r 2, r*, and (r) 35
EXERCISE Is (a + bc)*(c+λ) a regular expression? Is (a + b +) a regular expression? 36
INTUITIVE MEANING Each regular expression over Σ defines a language over Σ think in terms of matching, λ, and any α Σ are primitive regular expressions If r 1 and r 2 are regular expressions, then so are r 1 + r 2, r 1 r 2, r*, and (r) 37
EXAMPLE What is the language defined by a + b? What is the language defined by (ab)*? Exercise, what is the language defined by (a + bc)*(c+λ)? 38
LANGUAGE DEFINED BY REGULAR EXPRESSIONS How can we define the language of a regular expression more formally? Can we build a recursive function, L(r) that defines the language of a regular expression r? Remember a language is a set of strings we have defined operations on languages: union, concatenation, Kleene star 39
EXAMPLE What is L((a + b)a*)? 40
EXERCISE What is the language defined by (a+b)*(a+bb) 41
ON PRECEDENCE What is the language defined by (a + b)a What is the language defined by a + (ba) Which one is a + ba? 42
EXERCISE What is the language defined by (aa)*(bb)*b? 43
EXAMPLE Create a regular expression over Σ = { 0, 1 } that defines the language where all strings have at least two consecutive 0s 001 L 010 L 44
EXERCISE Construct the regular expression over { 0, 1 } where no string has two consecutive 0s. 010 L 001 L 45
EQUIVALENCE OF REGULAR EXPRESSIONS Two regular expressions are equivalent if they define the same language L = { all strings over {0, 1} without consecutive 0 } r 1 = (1+01)*(0+λ) r 2 = (1+011*)*(0+λ)+1*(0+λ) Since L = L(r 1 ) = L(r 2 ) we have that r 1 and r 2 are equivalent. Can we prove that L(r 1 ) = L(r 2 ) in some way? 46
REGULAR EXPRESSIONS IN REALITY Slightly richer alphabet and language than what we saw here, e.g., quantifiers: *, +,?, {m}, {m,}, {m,n}, atoms: char, [chars],., ^, $, \char Example uses Lexical analysis - tokenization preceding parsing Text search grep/egrep (unix) Search for gr(a e)y ^[-+]?[0-9]*\.?[0-9]+$ 47
REGULAR EXPRESSIONS IN COMPILERS The programmer creates a program The lexer splits the program text into a stream of tokens and removes white space Literals: 1, 1.32, Hello World! Keywords: if, while, Variables: c, y, counter, The token stream is passed to the parser that creates a parse tree, which is used by the next step of the compiler this simplifies the parse as it can work on tokens rather than on characters. Text Tokens Binary Lexer Parser 48
PARTS OF EXAMPLE PASCAL LEXER white_space [ \t]* digit [0-9] alpha alpha_num hex_digit identifier unsigned_integer hex_integer exponent i real string [A-Za-z_] ({alpha} {digit}) [0-9A-F] {alpha}{alpha_num}* {digit}+ ${hex_digit}{hex_digit}* e[+-]?{digit}+ {unsigned_integer} ({i}\.{i}? {i}?\.{i}){exponent}? \'([^'\n] \'\')+\ and array begin return(and); return(array); return(_begin); 49
EXAMPLE TOKENIZATION Consider the following PASCAL program Program Lesson1_Program1; Begin Write('Hello World'); Readln; End. Which would produce the following token stream PROGRAM IDENTIFIER BEGIN IDENTIFIER ( STRING ) ; IDENTIFIER ; END. Note that the tokens are represented by integers and tokens like IDENTIFIER and STRING carry the actual string representing the token. 50
REGULAR LANGUAGES Topic for the next few lectures Ways of defining regular language Regular Expressions (RE) Regular grammars Ways of deciding membership in regular languages DFA and NFA Equivalence of the approaches DFA NFA RE 51
REGULAR LANGUAGES Regular Expression DFA Regular Language NFA Regular Grammar 52
DO THE EXERCISES! Exercise material on the homepage exercises similar to what will be on exam If you get stuck ask a friend ask me If several of you have issues with one we ll add it to a lecture. 53