Notes for Comp 454 Week 2 This week we look at the material in chapters 3 and 4. Homework on Chapters 2, 3 and 4 is assigned (see end of notes). Answers to the homework problems are due by September 10th. Errata in Chapter 3: None known. Chapter 3 You are probably familiar with the concept of recursion from programming and/or math classes. In Chapter 3, Cohen shows how recursion is a powerful tool in defining languages. Recall from last week that a language is a set of strings. A typical recursive definition of a set will have three parts: 1. A base set of objects, 2. Rules for specifying how additional objects are defined in terms of existing ones, 3. A rule that states that the only objects in the set are those required by rules 1 and 2. Example 1. EVEN: the set of (positive) even numbers. 1. The numbers 2 and 4 are in EVEN, 2. If x is in EVEN then so is x+4, 3. Only those numbers required to be in EVEN by rules 1 and 2 are in EVEN. Example 2. TRIPLEX: the set of strings of Xs that have length which is a multiple of 3. 1. Λ, the empty string, is in TRIPLEX, 2. If the string w is in TRIPLEX then so is the string wxxx, 3. Only strings required to be in TRIPLEX by 1 and 2 are in TRIPLEX. Usually, the third rule is assumed so we do not bother including it in our definition. There are often many ways to define a particular set. Here is another definition of EVEN from Cohen page 22. Example 3. EVEN: the set of positive even numbers (a different definition than Example 1). 1. The number 2 is in EVEN, 2. If both x and y are in EVEN then so is x + y. x can be the same as y. Verify for yourself that this includes every positive even number. Comp 454 Notes Page 1 of 8 September 3, 2013
Example 4. INTEGERS: the set of integers. 1. The number 1 is in INTEGERS, 2. If x and y are in INTEGERS then so are x+y and x-y. x can be the same as y. Verify for yourself that this defines a set that includes 0, +3 and -4 Note the difficulty of using this recursive approach to define the set of real numbers. There is no smallest real number on which to base a definition. AE: The Language of Arithmetic Expressions. It is interesting to define the language of arithmetic expressions as they appear in most programming languages. Limiting ourselves to constants for the moment, here are three examples of arithmetic expressions: (3 + 4) / ( 12 * ( 8 3 )) -7 * (( 4 + 21) / ( -8 + 23 )) 23 / 0 Note that we are only concerned with expressions that are syntactically correct so we do not care that the value of the last example is mathematically undefined. Here is Cohen s definition of AE. 1. Any number is in AE, 2. If x is in AE then so are (x) and -x unless x starts with a minus sign, 3. If x and y are in AE then so are x + y (y must not begin with a sign character), x - y (y must not begin with a sign character), x * y, x / y, x^y, x**y (whatever operators your language supports). The language defined in this way permits an expression like 34/24/8 but is not concerned with whether it means (34/24)/8 or 34/(24/8). Question: could you extend the definition of AE to include (a) identifiers and (b) function calls? THEOREM 2 No string in AE can contain $. $ is not part of any number. None of our recursive rules contains $. Therefore there is no way that $ can appear in a string in AE. Comp 454 Notes Page 2 of 8 September 3, 2013
THEOREM 3 No string in AE can begin or end with /. / is not part of any number. Any string formed by a recursive rule must start with a parenthesis or a number or -. Any string formed by a recursive rule must end with a parenthesis or a number. Therefore there is no way that a member of AE begins or ends with /. THEOREM 4 No string in AE can contain //. Read Cohen s proof of this on page 27. Rather than simply use a variation on the proof of Theorem 3, he uses a proof by contradiction. He assumes that there is a string in AE containing // then shows that this must contradict Theorem 3. Proof by contradiction is another tool that Cohen uses in later chapters. The language of well-formed formulae (WFF) See page 28. The structure of this definition is similar to the one for AE, just different operators. Chapter 4 Errata in Chapter 4: Page 41 13 lines from the end, delete either whether or if Page 41 penultimate line, delete. Regular expressions. Regular expressions might already be familiar to those of you who have used Unix shell commands. The name of the Unix pattern-matching utility grep is an acronym: global regular expression and print. The regular expression notation used by Cohen is a little different from Unix and is also somewhat different than used in most other computer theory textbooks. In regular expressions we have the notions of repetition, sequence and iteration. Repetition. Using the idea of the Kleene star, X* represents a sequence of zero or more X s. Thus, XSTRING = language(x*) = { Λ X XX XXX XXXX XXXXX } Comp 454 Notes Page 3 of 8 September 3, 2013
Sequence We use concatenation. For example, wx represents w followed by x. Note that pq* represents all strings where p is followed by any number of q s pq* = { p pq pqq pqqq pqqqq } and is different from (pq)* which represents strings that are zero or more repetitions of pq (pq)* = { Λ pq pqpq pqpqpq pqpqpqpq } If each string in our language is a sequence of one or more X s rather than zero or more X s then we can represent it XX* or X*X. To simplify matters we could define the + operator to represent one or more repetitions, thus X + is the same language as XX*. There are likely to be many ways to represent an infinite language. Each of the following regular expression represents the language of strings consisting of one or more x s: xx* x + xx*x* x*xx* x + x* x*x + x*x*x*xx* Convince yourself that each expression does define the language of one or more x s. What if the language were strings consisting of two or more x s? Are there correspondingly many ways to represent a finite language? Choice On page 34 Cohen introduces his or operator. Some books and Web sites use the Unix pipe operator for this. Thus, to represent either x or y Cohen: Others: x + y x y Now we have choice, sequence and iteration giving us a complete set of operators for regular expressions. Here are some examples of expressions and the set of strings that each represents. (x + y)z* = { x y xz yz xzz yzz }. (a+b)* = { Λ a b aa ab ba bb aaa }. a(a+b)*b = { ab aab abb aaab aabb abab abbb aaaab }. Comp 454 Notes Page 4 of 8 September 3, 2013
Formal Definition of Regular Expression (R.E.) We can use our ideas from recursive definitions to define a valid regular expression as follows. 1. Λ is a regular expression and every character in Σ is a regular expression. 2. If w is a regular expression then so are (w) and w*. 3. If w and v are regular expressions then so are wv and w+v. Example regular expressions. Assume that our alphabet is {a b}. There are often many different possible regular expressions for a particular language. Language of all words containing at least one a can be represented by (a+b)*a(a+b)*. Language of all words containing at least two a s can be represented by (a+b)*ab*ab*. Language of all words containing exactly two a s can be represented by b*ab*ab*. See the example on page 39 of how relatively tricky it turns out to be to devise an R.E. representing the set of strings that have at least one a and at least one b. The a might appear before the b or it might appear after it. Product set If S and T are both sets of strings then we define the product set of S and T, denoted ST, as the set of strings in which each member is a member of S concatenated with a member of T. Example 1 S = { cat dog } T = {fish house } ST = { catfish cathouse dogfish doghouse }. Example 2 S = { Λ a aa } T = { Λ b bbb bbbbb } ST = { Λ a b aa ab aab bbb abbb }. Languages associated with regular expressions. Suppose we have regular expression R1 defining a language and regular expression R2 defining a language, what is the language defined by R1+R2 and by R1R2? Again, Cohen takes a recursive approach. (1) If an R.E. is a single letter, the corresponding language is that one-letter word. (2) If R.E. R 1 defines language L 1 and R.E. R 2 defines language L 2 then (R 1 )(R 2 ) defines the product of L 1 and L 2, that is, a language in which each string is a string from L 1 followed by a string from L 2. Comp 454 Notes Page 5 of 8 September 3, 2013
(R 1 ) + (R 2 ) defines the union of L 1 and L 2, that is, a language in which each string is either in L 1 or in L 2. The language associated with (R 1 )* is the Kleene closure of L 1, that is a language in which each string is a sequence of zero or more strings drawn from L 1. This means that every regular expression defines a language; that is, it represents a set of strings. It is often quite tricky to determine the characteristics of the language given the regular expression. Cohen s example on page 39, slightly modified, is ( (a+b)*a(a+b)*b(a+b)* ) + bb*aa*. This represents the set of strings that have both a and b in them. The expression to the left of the top-level + represents the strings where the a precedes the b and the expression on the right of the + represents the strings omitted by the first term. THEOREM 5 All finite languages are regular. This is clearly true because we can list the strings in the set and devise the appropriate R.E. So if our language is { cat dog frog mouse } the R.E. is R = cat + dog + frog + mouse The understandability of regular expressions. The ease with which we can determine the language represented by an R.E. is highly-variable. Given that (a+b)* represents any string of a s and b s then clearly (a+b)*(aa+bb)(a+b)* represents strings that are guaranteed to have at least one instance of either aa or bb. But trying to define an expression for the inverse, that is strings that do not contain either aa or bb is tricky. See page 45. Try the extended example on page 46. See if you can determine what language is represented by the expression at the top of the page before reading Cohen s analysis. Note the observation on Page 47. It is unknown whether an algorithm exists that can transform an arbitrary regular expression to another equivalent one. Comp 454 Notes Page 6 of 8 September 3, 2013
The language EVEN-EVEN. (see page 48). The language EVEN-EVEN will appear at several points in the book. It is simply the set of strings where each string has an even number of a s and an even number of b s. Strings in EVEN-EVEN include Λ abba bbbbbb abbaababaa. Strings not in EVEN-EVEN include baa bababa bab aaaaa aaabbbb. The language is easily specified but what about devising a regular expression? Let us work backwards to the expression that Cohen gives. Every string that is in EVEN-EVEN can be split up into syllables where each syllable is of one of three types. 1: aa. 2: bb. 3: mismatch followed by any number of aa/bb strings followed by mismatch. Where mismatch means ab or ba. Note that in each syllable there is an even number of a s and an even number of b s (Zero is even). Examples of strings in EVEN-EVEN broken into syllables as defined above aaabaabbbabbaaaa : a a a b a a b b b a b b a a a a abbaabbbabaa : a b b a a b b b a b a a This leads us to the following R.E. for EVEN-EVEN EVEN-EVEN = [ aa + bb + (ab+ba)(aa+bb)*(ab+ba) ]*. Homework 1 There are 5 homework problems drawn from Chapters 2, 3 and 4. Each answer is worth a maximum of 20 points. Answers due 9/10/13. 1. Consider the language PALINDROME over the alphabet { a b } (a) Prove that if x is in PALIDROME, then so is x n for any n > 0 (b) Prove that if y 3 is in PALINDROME then so is y (c) Prove that PALINDROME has as many words of length 4 as it does of length 3 (d) Prove that PALINDROME has as many words of length 2n as it does of length 2n-1. Comp 454 Notes Page 7 of 8 September 3, 2013
2. Show that the following is another recursive definition of the set EVEN: Rule 1: 2 and 4 are in EVEN Rule 2: If x is in EVEN, then so is x+4. 3. Using the second recursive definition of EVEN (page 22), what is the smallest number of steps required to prove that 100 is EVEN? Describe a good method for showing that 2*n (for positive integer n) is in EVEN. 4. Construct a regular expression defining each of the following languages over the alphabet {a b}. (i) (ii) all words that do not have the substring ab all words that do not have both the substrings bba and abb. 5. Show that the following pairs of regular expressions define the same language over the alphabet Σ = {a b} (a) ((a + bb)*aa)* and Λ + (a + bb)*aa (b) a(aa)*(λ + a)b + b and a*b Reading Assignment Read Chapters 3 and 4. Next week s class notes will cover Chapters 5 and 6. Comp 454 Notes Page 8 of 8 September 3, 2013