Context-Free Grammars An alphabet is a set of symbols finite or infinite. Given an alphabetathe set of all finite sequences with items inais writtena. A language (overa) is a subset finite or infinite of A. Some languages: {[0,0],[0,1],[1,0],[1,1]} a finite language over{0,1} {0,1} ={[],[0],[1],[0,0],[0,1],[1,0],[1,1],[0,0,0], }, an infinite language over{0,1} {[],[0],[1],[0,0],[1,1],[0,0,0],[0,1,0],[1,0,1],[1,1,1], } the infinite set of palindromes over{0,1} {[1],[2],[1, +,1],[1, +,2],[2, +,2],[1, +,1, +,1], } arithmetic expressions using 1, 2, and + As languages are often infinite sets, we need finite descriptions of infinite languages. A context-free language (CFL) is a language defined by a context-free grammar (below). Context free languages have numerous applications in data formats, programming languages, communication protocols, etc. Typeset October 21, 2014 1
Languages that are too complex to be context free are nevertheless often described by first describing a context free language and then imposing additional restrictions E.g. the set of all syntactically correct Java classes is context free class C { int i = 13.5;} The set of all type correct Java classes is not context free. A context free grammar (CFG)(A,N,P,n start ) consists of An alphabeta(i.e. a set of symbols) A finite set of nonterminal symbolsn disjoint from A. A finite set of production rules of the formn α, wheren N andα (N A) A starting nonterminaln start N. Example G 0 = ({0,1,2,3,4,5,6,7,8,9,(,),, },{pn,d},p 0,pn) wherep 0 is 1 {pn (ddd) ddd dddd, d 0,d 1,d 2,d 3,d 4, d 5,d 6,d 7,d 8,d 9} 1 In formal language theory, it is is usual to write sequences just by listing the items. E.g. dḋd instead of[d,d,d]. This creates a usually harmless ambiguity between symbols and sequences of length 1. Catenation is written asst rather thansˆt. The empty sequence is written as eitherɛor as nothing at all rather than[]. Typeset October 21, 2014 2
For a given grammarg = (A,N,P,n start ), define a relation= on sequences as follows 2 α= β if and only if α,β (N A) and there are sequencesγ,δ,θ (N A) and an n N such that α=γnθ and β=γδθ and (n δ) P Ifα= β, we say that we can deriveβ fromαin one step. Each derivation step γnθ = γδθ replaces one occurrence of some nonterminal symbol n with δ where (n δ) P. For example the following are derivation steps forg 0 ddd = 8dd ddd = d0d pn = (ddd) ddd dddd Thus each grammar defines a directed graph in which the nodes are elements of(a N) and, for eachαand β, there is an edge fromαtoβ iffα= β. 2 Using the notation of the rest of the course, we would write the last line of this definition as such thatα=γˆ[n]ˆθ andβ=γˆδˆθ and(n δ) P. Typeset October 21, 2014 3
A derivation is a finite path in this graph. We write α= β to mean there is a derivation that starts atsand ends att. I.e. α= βmeans that we can transformα intoβ via 0 or more derivation steps. For example pn = (dd9) ddd dd0d The language defined by a CFG(A,N,P,n start ) is the set of sequences inα A such thatn start = α. For example, the language defined byg 0 includes the sequence (709) 867 5309 To prove this, all we need to do is show 1 derivation (of the many) frompn. pn = (ddd) ddd dddd= (dd9) ddd dddd = (dd9) ddd dd0d= (d09) ddd dd0d = (709) ddd dd0d= (709) ddd dd0d = (709) ddd dd09= (709) ddd d309 = (709) 8dd d309= (709) 86d d309 = (709) 867 d309= (709) 867 5309 Typeset October 21, 2014 4
Another example. LetG 1 =(A 1,N 1,P 1,block) A 1 ={+,,/,,(,),<,:=,if,then,while,do,else,end} I N, wherei is a finite set of identifiers disjoint from {if,then,while,do,else,end} andn is some finite subset ofn. N 1 ={block,command,exp,comparand,term,factor} andp 1 contains all of the following production rules 3 block ɛ block command block command i:=exp for alli I command ifexpthenblockelseblockendif command whileexpdoblockendwhile exp comparand exp comparand < comparand comparand term comparand term + comparand comparand term comparand term factor term factor term term factor/term factor n for alln N factor i for alli I factor (exp) An example of a string in this language is whilei<ndoj:=j+ii:=i+1endwhile 3 Recall thatɛmeans an empty sequence. Typeset October 21, 2014 5
We can show that whilei<ndoj:=j+ii:=i+1endwhile is in the language by showing a derivation. block = command block = whileexpdoblockendwhileblock. = whilei<ndoj:=j+icommandblockendwhilebloc = whilei<ndoj:=j+ii:=expblockendwhileblock = whilei<ndoj:=j+ii:=comparandblock = whilei<ndoj:=j+ii:=term+comparandblock = whilei<ndoj:=j+ii:=factor+comparandblock = whilei<ndoj:=j+ii:=i+comparandblock = whilei<ndoj:=j+ii:=i+termblock = whilei<ndoj:=j+ii:=i+factorblock = whilei<ndoj:=j+ii:=i+1blockendwhilebloc = whilei<ndoj:=j+ii:=i+1endwhileblock = whilei<ndoj:=j+ii:=i+1endwhile Typeset October 21, 2014 6
Another way to show that a sequence is in a context free language is with a parse tree. Typeset October 21, 2014 7
Let s define parse trees. An ordered tree is a directed tree whose nodes are either leaves (with no children) or branches whose (0 or more) children are arranged in a sequencechildren(t) A parse tree for a CFG(A,N,P,n start ) is a finite ordered tree whose nodes are labelled with symbols froma N, such that the root is labelled withn start and each branch node is labelled with a nonterminal symboln N and has a sequence of children whose labels form a finite sequenceαsuch that(n α) P. Given a parse treetthe sequence of leaves is given by proc leaves(t) iftis a leaf then return[label(t)] else //tis a branch varα:=[] for u children(t) do α:= αˆleaves(u) end for returnα end if end leaves Exercise: Show that, for any parse treet, ifleaves(t)=α, thenn start = α. Exercise: Show that ifn start = α, there is a parse treet such thatα=leaves(t). Typeset October 21, 2014 8
Exercise: Design a CFG for the language of palindromes in{0,1}. Exercise: Design a CFG for the language of valid boolean expressions in{p,q,r,,,,,(,)} Exercise: Design a CFG for the language of valid C++ variable declarations in {int,,[,],(,),;,a,b,c,0,1} wherea,b, andcare identifiers. E.g. int a(int, int ); declaresato be a function, while int( b[10])(); declaresbto be an array of 10 pointers to functions returning int results. Exercise: Look up the definition of regular language. Show that every regular language is a context-free language. Show that some context free languages are not regular languages. Typeset October 21, 2014 9