Strings, Languages, and Regular Expressions Strings An alphabet sets is simply a nonempty set of things. We will call these things symbols. A finite string of lengthnover an alphabets is a total function from{0,..n} tos. Examples. SupposeS={ a, b, c }. Then the following function is a finite string of length3overs. ({0,1,2},S,{0 a,1 b,2 b,}) We ll also write this string as[ a, b, b ]. SupposeS={ a, b, c }. Then the following function is a finite string of length0overs. (,S, ) We will also write this string (and any other string of length0) as[]or asɛ. WhenS is a set of characters, I ll use double quotes to indicate strings Example: So[ a, b, b ]= abb. March 1, 2017 1
For eachn N the set( of strings of lengthnovers ) is S n = {0,..n} S tot Ifs S n, we say that its length isnand write s =n. Examples SupposeS={0,1} thens 2 is {[0,0],[0,1],[1,0],[1,1]} aabb =4 The set of finite strings overs is S = n N S n Example SupposeS={ a, b } thens is {ɛ, a, b, aa, ab, ba, bb, aaa, } Note that while the size ofs is infinite (even for finites), the length of each element ofs is finite. We can concatenate two stringssandtto get a stringsˆt. Such that sˆt S s + t and thus sˆt = s + t. Furthermore (sˆt)(i)=s(i), if0 i< s, and (sˆt)( s +i)=t(i), if0 i< t Note theɛis the identity element of catenationsˆɛ=s= ɛˆs, for alls. March 1, 2017 2
Letter conventions for this section of the course I ll use variables as follows S an alphabet a,b,c S s,t,w S M,N S x,y regular expressions overs Q a set of states p,q,r Q R,F Q T a set of transitions March 1, 2017 3
Languages A language overs is any subset ofs. We can extend the operation of concatenation to languages. IfM andn are languages overs. MˆN ={s M,t N sˆt} For example if M={ a, aa, ab } andn={ɛ, b } then MˆN={ a, aa, ab, aab, abb } For eachi N we definem i to bemm M where there areims. For example if M ={ a, aa, ab } then M 1 ={ a, aa, ab } M 2 ={ aa, aaa, aab, aaaa, aaab, aba, abaa, abab } M 3 = { aaa, aaaa, aaab, aaaaa, aaaab, aaba, aabaa, aabab, aaaaaa, aaaaab, aaaba, aaabaa, aaabab, abaa, abaaa, abaab, abaaaa, abaaab, ababa, ababaa, ababab } By convention,m 0 is{ɛ}. So we can definem i by M 0 = {ɛ} M i+1 = MˆM i, for alli N We can define the Kleene closurem of a set of strings as the set of all finite strings that can be generated from its March 1, 2017 4
members by catenation. So ifm ={ a, bb } then M = {ɛ, a, bb, aa, abb, bba, bbbb, aaa, aabb, abba, abbbb, bbaa, bbabb, bbbba, bbbbbb, } Formally we can define M = i N M i To put it differently, a finite stringsis a member ofm iff there is a sequence of 0 or more strings s 0,s 1,,s n M such that s=s 0ˆs 1ˆ ˆs n March 1, 2017 5
Regular languages Given an alphabets, we can make the following simple regular languages: the empty language; contains no strings {ɛ} the language that contains only the empty string For eacha S,{[a]} the languages of single, length 1 strings. Example: IfS = 0,1} we have the following4simple regular languages,{ɛ},{[0]},{[1]} Given thatm andn are languages overs, consider three ways to make languages Union:M N the language that contains all strings either inm or inn Concatenation:MˆN the language of stringssthat can be split into two partss=tˆw, wheret Mand w N. Kleene closure:m the language of stringssthat can be split into some number (including0) of parts, each of which is inm. March 1, 2017 6
Examples: IfS is{0,1} we can make the following languages out of the simple regular languages: {[0]} {[1]} {ɛ}={ɛ,[0],[1]} {[0]}ˆ{[1]}={[0,1]} {[0]} ={ɛ,[0],[0,0],[0,0,0],...} {[0]} {[1]} ={ɛ,[0],[1],[0,0],[1,1],...} {[0]} ˆ{[1]} ={ɛ [0],[1], [0,0],[0,1],[1,1], [0,0,0],[0,0,1],[0,1,1],[1,1,1],...} A regular language overs is any language that can be formed from the simple regular languages over S using the three operations of union, concatenation, and Kleene closure. Next we look at a nice syntax for describing regular languages. March 1, 2017 7
Regular expressions A regular expression overs defines a language overs. The set of regular expressions overs is itself a language overs {ɛ,,,;,,(,)} Syntax: For eacha S,ais a regular expression ɛ is a regular expression. is a regular expression. Ifxandy are regular expressions then (x y) is a regular expression (alternation) (x;y) is a regular expression (concatenation) (x ) is a regular expression (repetition) Underlining: I ve underlined regular expressions to make it clear that that is what they are. For exampleɛis a regular expression and is a string of length 1 in the language of regular expressions, whereasɛis a string of length 0. March 1, 2017 8
Parentheses: Parentheses can be omitted with the understanding that has higher precedence than;and that;has higher precedence than. Thus. (a;(b )) can be abbreviated bya;b whereas (a;b) needs its parentheses. And ((a;b) (c;d)) can be abbreviated bya;b c;d whereas a;(b c);d needs its parentheses. We can writex;y;z to mean(x;y);z Similarly, we can writex y zto mean(x y) z. Redundant parentheses can be added. E.g.((( b ) (( a ) ))). (The operators, ; and are analogous to+,, and exponentiation both in precedence and in some algebraic properties.) March 1, 2017 9
Semantics: We can define the meaning of regular expressions over an alphabet S by defining a function L from regular expressions to languages overs. L(a)={[a]}, for eacha S. L(ɛ)={ɛ} L( )= Ifxandy are regular expressions then L((x;y))=L(x)ˆL(y) L((x y))=l(x) L(y) L((x ))=(L(x)) Ifs L(x) we say thatxdescribessor thatxrecognizess. Two regular expression are equivalent if they describe the same language. March 1, 2017 10
Examples Money Suppose that S = { $, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,,,., - } We want to describe strings representing amounts of money such as $0.05 $-12,345.67 We can describe a set of strings representing amounts of money as follows. Letxrepresent the regular expression ( 0 1 2 3 4 5 6 7 8 9 ) The regular expression y= $ ;( - ɛ);(x;x;x x;x x);(, ;x;x;x) ;. ;x;x is such that $0.05 L(y) and $-12,345.67 L(y), whereas $12345.67 / L(y) and $12.6 / L(y), etc. Conventions and abbreviations We leave off the underline when it is clear we are talking about a regular expression March 1, 2017 11
R.E. Abbreviates Describes... x y x; y catenation s s(0)s(1) s( s 1) {s} x + xx any catenation of 1 or more strings, each described byx. x n xx...x }{{} any catenation of n, each n times described byx. x n m x m x m+1 x n any catenation of frommton strings, each described byx. x? ɛ x a string described byxorɛ.. a b any string of length one. [a b] any string of length one with a character alphabetically between a and b inclusive. With these abbreviations $ ;( - ɛ);(x;x;x x;x x);(, ;x;x;x) ;. ;x;x wherexis( 0 1 2 3 4 5 6 7 8 9 ) can be written as $ -? [ 0 9 ] 3 1 (, [ 0 9 ]3 ). [ 0 9 ] 2 The abbreviations are useful in application of regular expressions. However, any regular expression can be expanded to one that does not use the abbreviations. When dealing with the theory of regular expressions, we can safely ignore the abbreviations. March 1, 2017 12
Identifiers Suppose that S = { _, 0, 1,..., 9, a, b,..., z, A, B,..., Z } then letabe the regular expression ([ a z ] [ A Z ] _ )([ a z ] [ A Z ] [ 0 9 ] _ ) describes C++ identifiers. Example Parity LetS ={0,1}. Define a regular expression that matches only strings with an even number of 1s. ( 0 1 0 1 ) 0 String search LetS = { _, 0, 1,..., 9, a, b,..., z, A, B,..., Z }. Define an r.e. that matches all strings that contain the substring engi. Now the regular expression is. engi. More string searches Does a document contain any of the strings woman, women, man, men?. ( woman women man men ). or, equivalently,. wo? m ( a e ) n. March 1, 2017 13
Does a document contain John followed by Smith?. ( John ). ( Smith ). Challenges A C comment is a string that follows the following rules. C comments have at least 4 characters. The first two are /* and the last two are */. The sequence */ does not occur in between the first two and last two characters. Examples include /**/, /*/*/, and /*o*o/*/. Counter examples include o/**/, /**/o, and /**/*/. Supposexis a regular expression whose language is the set of all strings of length 1 other than / and *. Devise a regular expression whose language is the set of all C comments. Show that if a languagem is represented by a regular expression x, then there is a regular expression that represents the reversal ofm, i.e., the languagen that consists of all the strings inm written backwards. Represent an addition such as2+3=5 by 235 and 12 + 35 = 47 by 134257. Pad the operands with 0s so that both operands and the result are the same number of digits; e.g., represent23+777=800 by 078270370. Thus strings in the language are always of lengths that are multiples of3.create a regular expression that represents correct additions in binary. Hint: it might be easier to create a regular expression for the reversed language. Examples March 1, 2017 14
11+101 = 1000 is represented by a string 001010100110 which is in the language. 11+11=110 is represented by a string001111110 which is in the language 1+1 1 and so111 is not in the language. 00000 is not in the language, as its length is not a multiple of 3. Regular languages (again) A languagem is a regular language exactly if there exists a regular expressionxsuch thatm =L(x). So far we have seen that some languages are regular. Perhaps not surprisingly, there are also languages that are not regular. In the next section of the course we will see that a language is regular exactly if it can be recognized by a program that has a fixed amount of memory. Consider the following language M = i N { a } iˆ{ b } i = {ɛ, ab, aabb, aaabbb, } A program to recognized this language would somehow need to count the number of a s and b s. This would take an amount of memory that depends on the size of the input; i.e. it is not fixed. So this language is not regular. The preceding argument is not a formal proof; after learning a bit more, we will be in a position to be able to formally prove that some languages are not regular. March 1, 2017 15