Habanero Extreme Scale Software Research Project
Comp215: Grammars Zoran Budimlić (Rice University)
Grammar, which knows how to control even kings - Moliere
So you know everything about regular expressions Are regular expressions sufficient for describing syntax? Example: ((a+b) * c) Wouldn t it be great if we had a name for this? Can you write a regular expression to recognize something like this? Why not? How about this: a n b n, n = 0, 1, 2,
Grammars to the rescue! We use Context-Free Grammars (CFGs) to specify context-free syntax. A CFG describes how a sentence of a language may be generated. Example: <EvilLaugh> ::= mwa <EvilCackle> <EvilCackle> ::= ha <EvilCackle> <EvilCackle> ::= ha! Use this grammar to generate the sentence mwa ha ha ha!
Some terminology Syntax Colorless green ideas sleep furiously. (Noam Chomsky) the form or structure of the expressions, statements, and program units Semantics the meaning of the expressions, statements, and program units Sentence a string of characters over some alphabet Language a set of sentences Lexeme the lowest level syntactic unit of a language :=, {, while Token a category of lexemes (e.g., identifier)
More terminology terminal symbols atomic components of statements in the language appear in source programs identifiers, operators, punctuation, keywords nonterminal symbols intermediate elements in producing terminal symbols never appear in source program start (or goal) symbol a special nonterminal which is the starting symbol for producing statements
Productions rules for transforming nonterminal symbols into terminals or other nonterminals nonterminal ::= terminals and/or nonterminals each has lefthand side (LHS) and righthand side (RHS) every nonterminal must appear on LHS of at least one production Example: <EvilLaugh> ::= mwa <EvilCackle> <EvilCackle> ::= ha <EvilCackle> <EvilCackle> ::= ha! production
Productions rules for transforming nonterminal symbols into terminals or other nonterminals nonterminal ::= terminals and/or nonterminals each has lefthand side (LHS) and righthand side (RHS) every nonterminal must appear on LHS of at least one production Example: <EvilLaugh> ::= mwa <EvilCackle> <EvilCackle> ::= ha <EvilCackle> <EvilCackle> ::= ha! nonterminal
Productions rules for transforming nonterminal symbols into terminals or other nonterminals nonterminal ::= terminals and/or nonterminals each has lefthand side (LHS) and righthand side (RHS) every nonterminal must appear on LHS of at least one production Example: <EvilLaugh> ::= mwa <EvilCackle> <EvilCackle> ::= ha <EvilCackle> <EvilCackle> ::= ha! terminal
Productions rules for transforming nonterminal symbols into terminals or other nonterminals nonterminal ::= terminals and/or nonterminals each has lefthand side (LHS) and righthand side (RHS) every nonterminal must appear on LHS of at least one production Example: start symbol <EvilLaugh> ::= mwa <EvilCackle> <EvilCackle> ::= ha <EvilCackle> <EvilCackle> ::= ha!
Categories of grammars regular good for identifiers, parameter lists, subscripts an expanded form of regular expressions left-regular or right-regular context free LHS of production is single non-terminal context sensitive LHS has a non-terminal that can be surrounded by terminals and/or non-terminals recursively enumerable COMP 215 LHS can be any non-empty sequence of terminals and/or non-terminals
Backus-Naur Form (BNF) Used to describe syntax of PL; first used for Algol-60 Nonterminals are enclosed in <...> <expression>, <identifier> Alternatives indicated by <digit> ::= 0 1 2 3 4 5 6 7 8 9 Options (0 or 1 occurrences) indicated by [...] <stmt> ::= if <cond> then <stmt> [ else <stmt>] Repetition (0 or more occurrences) indicated by {...} <unsigned> ::= <digit> {<digit>} Derivation How would you do this using regular espressions? apply the rules, starting with start symbol and ending with a sentence
BNF defined using BNF <syntax> ::= <rule> <rule> <syntax> <rule> ::= <opt-whitespace> "<" <rule-name> ">" <opt-whitespace> "::=" <opt-whitespace> <expression> <line-end> <opt-whitespace> ::= " " <opt-whitespace> "" <expression> ::= <list> <list> " " <expression> <line-end> ::= <opt-whitespace> <EOL> <line-end> <line-end> <list> ::= <term> <term> <opt-whitespace> <list> <term> ::= <literal> "<" <rule-name> ">" <literal> ::= '"' <text> '"' "'" <text> "'"
Our earlier example a n b n, n = 0, 1, 2, <expr> ::= [a <expr> b]
Example Grammar and Derivation <program> ::= <stmts> <stmts> ::= <stmt> <stmt> ; <stmts> <stmt> ::= <var> = <expr> <var> ::= a b c d <expr> ::= <term> + <term> <term> - <term> <term> ::= <var> const <program> => <stmts> => <stmt> => <var> = <expr> => a = <expr> => a = <term> + <term> => a = <var> + <term> => a = b + <term> => a = b + const
Derivation Terminology Every string of symbols in the derivation is a sentential form A sentence is a sentential form that has only terminal symbols A leftmost derivation is one in which the leftmost nonterminal in each sentential form is the one that is expanded similarly for rightmost derivation A derivation may be neither leftmost nor rightmost! A language defined by the grammar is a set of all possible sentences that can be derived from the grammar
Derivation Tree A derivation tree is the tree resulting from applying productions to rewrite start symbol a parse tree is the same tree starting with terminals and building back to the start symbol (goal symbol) A parse tree is a useful data structure to reason about the meaning of the program/data <program> <stmts> <stmt> <var> = <expr> a <term> + <term> <var> const b
Ending a sentence with a preposition is something up with which I will not put. - Winston Churchill
Ambiguity A grammar is ambiguous iff it generates a sentential form that has two or more distinct parse trees An ambiguous expression grammar: <expr> ::= <expr> <op> <expr> const <op> ::= / - <expr> <expr> <expr> <op> <expr> <expr> <op> <expr> <expr> <op> <expr> <expr> <op> <expr> const - const / const const - const / const
Dangling Else Ambiguity One famous ambiguity is dangling else <stmt> ::= if <cond> then <stmt> [else <stmt>] This can derive if X > 9 then if B = 4 then X := 5 else X := 0 which if does this else belong to?
Dangling Else Ambiguity Can solve syntactically by adding nonterminals & productions <stmt> ::= <matched> <unmatched> <matched> ::= if <cond> then <matched> else <matched> <unmatched> ::= if <cond> then <stmt> if <cond> then <matched> else <unmatched> Can also solve semantically elses are associated with immediately preceding unmatched then
Resolving Ambiguity An ambiguous expression grammar: <expr> ::= <expr> <op> <expr> const <op> ::= / - <expr> Unambiguous expression grammar: <expr> ::= <expr> - <term> <term> <term> ::= <term> / const const Some languages are inherently ambiguous! <expr> - <term> <expr> / const const <term> const
Recursion Left recursive grammars <expr> ::= <expr> + <term> <term> <term> ::= <term> * const const Right recursive grammars <expr> ::= <term> + <remain> <term> <term> ::= const * <term> const <remain> ::= <term> <term> + <expr>
Associativity Left associativity: a - b + c = (a - b) + c Right associativity: a ** b ** c = a ** (b ** c)
Example: Expressions Consider following unambiguous grammar for expressions: <expr> ::= [<expr> <addop>] <term> <term> ::= [<term> <mulop>] <factor> <factor> ::= (<expr>) <digit> <addop> ::= + - <mulop> ::= * / <digit> ::= 0... 9 This grammar is left recursive and generates expressions that are left associative Changing <factor> production produces right associative exponentiation <factor> ::= <expon> [ ** <factor> ]
Syntax Graphs equivalent to CFGs put the terminals in circles or ellipses and put the nonterminals in rectangles; Lines and arrows indicate how constructs are built <expr> ::= <term> [ <addop> < expr>] <term> ::= <factor> [ <mulop> <term>] expr: term term: factor addop mulop
JSON Grammar json.org :