Decision Properties for Context-free Languages

Previously: Decision Properties for Context-free Languages CMPU 240 Language Theory and Computation Fall 2018 Context-free languages Pumping Lemma for CFLs Closure properties for CFLs Today: Assignment 5 due Decision properties for CFLs, e.g., is a string in the language? Election special! Later I ll post practice problems for Exam 2 tonight Exam 2 review on Thursday Decision properties for context-free languages (CFLs) Start with a representation of a CFL, i.e., a context-free grammar (CFG) or a pushdown automaton (PDA). Since we can convert between CFGs and PDAs, we can use whatever is more convenient. Spoiler: Very little is decidable about CFLs! Testing emptiness of a CFL Given a representation of some context-free language, ask whether it represents We ve already seen how to do this when we were converting to CNF. Check if the start symbol is useless, i.e., it doesn t derive at least one string. We can decide if a language is empty. We can decide if a string is in a language.

Testing finiteness of a CFL Let L be a CFL. Then there is some Pumping Lemma constant n for L. Test all strings of length between n and 2n 1 for membership. If there is any such string, it can be pumped, and the language is infinite. If there is no such string, then n 1 is an upper limit on the length of strings, so the language is finite. Trick: If there were a string s = uvxyz of length 2n or longer, you can find a shorter string uxz in L, but it s at most n shorter. (Why?) Thus, if there are any strings of length 2n or more, you can repeatedly cut out vy to get, eventually, a string whose length is in the range n to 2n 1. Testing membership of a string in a CFL Important result: Given a context-free grammar G and a word w, we can tell if G generates w! This can be done in finite time, algorithmically. Testing membership of a string in a CFL What if we only considered PDAs? It s not obvious that this could be done in finite time. Why can t we just simulate a PDA on w and, whenever it stops, we d have our answer? Simulating a PDA for L on string w doesn t quite work, because the PDA can grow its stack indefinitely on ε input, and we never finish, even if the PDA is deterministic Testing membership of a string in a CFL The approach to recognizing if a grammar G generates a string w has two steps: 1 Convert G to Chomsky normal form (CNF) 2 Use the CYK algorithm. The Cocke Younger Kasami (CYK) algorithm is an O(n 3 ) algorithm (n = length of w) that uses a dynamic programming technique.

Aside: CNF Recall that in Chomsky normal form, every rule in the grammar is of the form or A BC A a where a is a terminal, A is any variable, and B and C are variables other than the start variable. (Exception: allow S ε) Aside: Big O notation We said the algorithm is O(n 3 ). If you haven t seen this notation before, it means that it takes at most n 3 steps of computation (loosely defined) to process an input of length n. Big O notation is used in complexity analysis, which we may spend some time on at the end of the course, and is used extensively in CMPU 241, Algorithms. Aside: Dynamic programming Dynamic programming is a class of methods that avoid duplicate computation at the expense of memory. Values that may be used in future computations are stored in a table. Think of computing the Fibonacci sequence: Each value depends on the two previous ones, so we save them after computation. In dynamic programming, you may have many previous calculations that you want to re-use. CYK algorithm Start with a CNF grammar for L Build a two-dimensional table: Row = length of a substring of w Column = beginning position of the substring Entry in row i and column j = set of variables that generate the substring of w beginning at position j and extending for i positions These entries are denoted X j,i+j 1, i.e., the subscripts are the first and last positions of the string represented, so the first row is X, X,, X n,n ; the second row is X, X 2,3,, X n 1,n, and so on

Table The horizontal axis corresponds to the positions of the string w = a 1 a 2 a n. Table entry X i,j is the set of non-terminals A such that A a i a i+1 a j. We are particularly interested in whether S is in X 1,n because that is the same as saying S w (that is, w is in L) Basis: (row 1) X i,i = the set of variables A such that A a is a production, and a is the symbol at position i of w. The grammar is in CNF, therefore the only way to derive a terminal is with a production of the form A a, so X i,i is the set of non-terminals such that A a i is a production of G Induction: Suppose we want to compute X i,j, which is in row j i +1 and we have computed all the Xs in the rows for shorter strings. We can derive a i a i+1 a j from A if there is a production A BC, B derives any proper prefix of a i a i +1 a j, and C derives the rest. Thus, we must ask if there is any value of k such that i k < j B is in X i,k C is in X k+1,j Example We ll use the algorithm to determine if the string w = aabbb is in the language generated by the grammar S AB S AB Note that w = a, so X is the set of all variables that immediately derive a. that is X =. Since w = a, we also have X =, and so on to get X =, X =, X =, X =, X = 2,3 3,4 4,5

S AB Compute X : since X = and X =, X consists of all variables on the left side of a production whose right side is AA. None, so X is empty. S AB Next X 2,3 = {A A BB, B X, B X } so the required right side is AB, thus X 2,3 = 2,3 3,4 4,5 2,3 3,4 4,5 S AB The rest is easy. S AB 1,4 2,3 3,4 4,5 2,4 2,5 3,5 1,4 2,3 2,4 2,5 3,4 3,5 4,5 Since S is in X, w L(G) A A B B B A B B S

Which variables have a production body b? a? 2,3 3,4 4,5 2,3 3,4 4,5 Which variables have a production body b? a? 2,3 3,4 4,5 Break ba into two nonempty substrings, b and a. Rule must have body αβ where α X and β X, i.e., BA or BC 2,3 3,4 4,5

... {S,A} {A,C} 2,3 2,4 3,5 {A,C} 3,4 {S,C} 4,5 {S,A} {A,C} We can break the string aab (position 2 to 4) after position 2 or after position 3: k =2 or k =3. Need to consider bodies in X X 3,4 X 2,3X = {A,C}{S,C} = {AS, AC, CS, CC, BB} 2,3 2,4 3,4 {S, C} 3,5 4,5 Only CC shows up as a body baaba L(G) 1,4 {S, A, C} 2,3 2,4 2,5 {S, A, C} 3,4 {S, C} 3,5 4,5 CYK as a parsing algorithm Applicability of the CYK algorithm as a parser is limited by the computational requirements needed to find a derivation For an input string of length n, (n 2 +n)/2 sets need to be constructed to complete the dynamic programming table Each of these sets may require the consideration of several decompositions of the associated substring

Preview of undecidable CFL problems The Chomsky hierarchy Is a given CFG G ambiguous? Is a given CFL inherently ambiguous? Recursively Enumerable Languages Context-sensitive Languages Turing Machine Linear-Bounded Automata Is the intersection of two CFLs empty? Are two CFLs the same? r Context-free Languages Regular Languages Pushdown Automata Finite Automata Is a given CFL equal to Σ*, where Σ is the alphabet of the language? Context-sensitive grammars The next grammar type, more powerful than CFGs, is a somewhat restricted grammar A grammar is context-sensitive if all productions are of the form x y, where x, y are in (V T)+ and x y Fundamental property: grammar is non-contracting i.e., the length of successive sentential forms can never decrease Why context-sensitive? All productions can be rewritten in a normal form xay xvy Effectively, A can be replaced by v only in the context of a preceding x and a following y Example CSG for {a n b n c n n 1} S abc aabc Ab ba Ac Bbcc bb Bb ab aa aaa Try to derive a 3 b 3 c 3 S aabc abac abbbcc abbbcc aaabbcc aababcc aabbacc aabbbbccc aabbbbccc aabbbbccc aaabbbccc A and B are messengers an A is created on the left, travels to the right to the first c, creates another b and c. Then sends B back to create the corresponding a. Similar to the way one would program a TM to accept the language.

Linear-bounded automata A limited Turing machine in which tape use is restricted Use only part of the tape occupied by the input I.e., has an unbounded tape, but the amount that can be used is a function of the input Restrict usable part of tape to exactly the cells taken by the input LBA is assumed to be nondeterministic Relation between CSLs and LBAs If a language L is accepted by some linear bounded automaton, then there is a context-sensitive grammar that generates L Every step in a derivation from a CSG is a bounded function of w because any CSG G is non-contracting That is all.