MA53: Formal Languages and Automata Theory Topic: Context-free Grammars (CFG) Lecture Number 8 Date: September 2, 20 xercise: Define a context-free grammar that represents (a simplification of) expressions in typical programming language such that the expression may contains + (addition), (multiplication) as operators and identifiers. The identifiers can be formed from the letters a and b and the digits 0 and only. very identifier must begin with a or b, which may be followed by any string in {a, b, 0, }. Solution: We need two variables in this grammar. To represent expression, we use variable. t is the start symbol and represents the language of expressions we are defining. The other variable is, represents identifiers. ts language is actually regular, it is the language of the regular expression (a + b)(a + b + 0 + ) The rules of the grammar are as follows: Table : Rules of the context-free grammar. 2. + 3. 4. () 5. a 6. b 7. a 8. b 9. 0 0. The grammar for expressions is stated formally as G = ({, }, T, R, ), where T is the set of symbols {+,, (, ), a, b, 0, }, R is the set of rules shown in above table and is the start symbol. We interpret the rules as follows: Rule () is the basis for expressions. t says that an expression can be a single identifier. Rules (2) - (4) describe the inductive case for expressions. Rule (2) (resp. Rule (3) )says that an expression can be two expressions connected by a plus (resp. multiplication) sign. Rule (4) says that if we take any expression and put matching parentheses around it, the result is also an expression. Rules (5) - (0) describe identifiers. Rules (5) and (6) say that a and b are identifiers. The remaining four rules are the inductive case. They say that if we have any identifier, we can follow it by a, b, 0, or, and the result will be another identifier.
Derivations Using a Grammar (What is the language L(G) defined by a CFG G?) The process of deriving strings of language L(G) from a CFG G by applying rules requires the definition of new relation symbol. Suppose G = (V, T, R, S) is a CFG. Let αaβ be a string of variables and terminals, with A a variable. That is α and β are strings in (V T), and A is in V. Let A γ a rule of G. Then we say αaβ G αγβ. f G is understood, we just say αaβ αγβ. Note that one derivation step replaces any variable anywhere in the string by the body of one of its rules. We may extend the relationship to present zero, one, or many derivation steps, much as the transition function δ of a finite automaton was extended to ˆδ. For derivations, we use a to denote zero or more steps, as follows: Basis: For any string α of terminals or variables, we say α α. That is, any string derives itself. nduction: f α β and β γ, then α γ. xample: We can infer that for the rules in Table, (a0+b) (a+b) is in the language of variable by showing a derivation starting with as given bellow: ( + ) ( + ) ( + ) (0 + ) (0 + ) (a0 + ) (a0 + ) (a0 + ) (a0+b) (a0+b) () (a0+b) ( +) (a0+b) ( +) (a0 + b) ( + ) (a0 + b) (a + ) (a0 + b) (a + ) (a0 + b) (a + b). The Language of a Grammar f G = (V, T, R, S) is a CFG, the language of G, denoted by L(G), is the set of all terminal strings that have derivations from the start symbol S. That is, L(G) = {w T : S w} f a language L is the language of some context-free grammar, then L is said to be a context-free language (CFL). Two grammars G and G 2 are said to be equivalent if and only if L(G ) = L(G 2 ). Leftmost and Rightmost Derivations Leftmost derivation: At each step we replace the leftmost variable by one of its production/rule bodies. For leftmost derivation we use lm and lm for one and many steps respectively. (The derivation of the above example was actually a leftmost derivation.) Rightmost derivation: At each step we replace the rightmost variable by one of its production/rule bodies. For rightmost derivation we use rm and rm for one and many steps respectively. 2
Sentential Forms Derivation from the start symbol produce strings that have a special role. We call these sentential forms. That is, if G = (V, T, R, S) is a CFG, then any string α in (V T) such that S α is a sentential form. f S lm α, then α is a left sentential form, and if S rm α, then α is a right sentential form. Note that the language L(G) is those sentential forms that are in T (i.e., they consist solely of terminals). xample: Consider the grammar for expressions from Table. For example, ( + ) is a sentential form, since there is derivation () ( + ) ( + ) However this derivation is neither leftmost nor rightmost, since at the last step, the middle is replaced. Parse Tree Let G = (V, T, R, S) be a context-free grammar. The parse trees for G are trees with the following conditions:. ach internal node is labeled by a variable in V. 2. ach leaf is labeled by either (i) a variable, (ii) a terminal, or (iii) ɛ. However, if the leaf is labeled by ɛ, then it must be the only child of its parent. 3. f an interior node is labeled by A, and its children are labeled X, X 2,...,X k respectively, from the left, then A X X 2...X k is a rule in R. xample: * ( ) ( + ) * ( ) + (i) a 0 b a (ii) b Figure : (i) A parse tree showing the derivation of () from, and (ii) parse tree showing (a0 + b) (a + b) is in the language of CFG in table Definition: The yield of a parse tree is the concatenation of leaves of any parse tree from left to right. 3
An yield is always a string (derived from the root variable). f the root is start symbol S of CFG, then yields are strings in the language. Ambiguity in Grammars and Languages Consider the sentential form + for the grammar defined in Table. t has two derivations from :. + + 2. + Notice that in derivation (), the second is replaced by, while in derivation (2), the first is replaced by +. Figure 2 shows the two parse trees, which are distinct trees. + * * + (a) (b) Figure 2: Two parse trees with the same yield The difference between these two derivations is significant. As far as the structure of the expressions is concerned, derivation () says that the second and third expressions are multiplied, and the result is added to the first expression, while derivation (2) adds the first two expressions and multiplies the result by third. n more concrete terms, the first derivation suggests that +2*3 should be grouped +(2*3) = 7, while the second derivation suggests the same expression should be grouped (+2) * 3 = 9. Obviously, the first of these (not the second), matches our notion of correct grouping of arithmetic expressions. Since the grammar of Table gives two different structures to any string of terminals that is derived by replacing the three expressions in + by identifiers, we see that this grammar is not a good one for providing unique structure. n particular, while it can give strings the correct grouping as arithmetic expressions, it also gives them incorrect groupings. On the other hand, the mere existence of different derivations for a string (as opposed to different parse trees) does not imply a defect in the grammar. For example, the string a b has many derivations. Two examples are. a a a b 2. b b a b 4
Definition: A CFG G = (V, T, R, S) is ambiguous if there is at least one string w T for which two different parse tree exist., each with root labeled S and yield w. f each string has at most one parse tree in the grammar, then the grammar is said to be unambiguous. 5