ITEC2620 Introduction to Data Structures Lecture 9b Grammars I Overview How can a computer do Natural Language Processing? Grammar checking? Artificial Intelligence Represent knowledge so that brute force computation can achieve a result Result appears like intelligence Dictionary Definitions I Language The special terms used by/for a specific application Grammar Rules about using words in a language Words Any set of symbols/characters which represent a unit of meaning Dictionary Definitions II Grammars use words Grammars are the rules which build units of meaning into larger ideas Grammars define languages Rules of a language 1
Example I Mathematical expressions Words numbers and operators Grammar rules for a valid equation Internal operator Post-operator (reverse Polish notation) Language physics is spoken in the language of math Example II How does JAVA know if an equation is valid? Need a systematic method to parse (analyze grammatically) an equation and determine if it is valid i.e. follows the rules of the grammar Definitions Tokens or Terminals T = set of symbols that can appear in the input Sentence A possible sequence (up to infinite length) of the terminals Details T i The set of all sentences with i tokens T 0 = { Φ } The null sentence T j = T x T j-1 Set product T * = j=0u T j All possible sentences 2
Simple Example I T = { a, b } Imagine a keyboard with two characters a and b All sentences will be a sequence of a s & b s Imagine a monkey using this keyboard What are all the sentences it could type? Simple Example II T 0 = { Φ } T 1 = { a, b } T 2 = { aa, ab, ba, bb } T 3 = { aaa, aab, aba, abb, baa, bab, bba, bbb } T 4 = { } Simple Example III T * = { Φ, a, b, aa, ab, ba, bb, aaa, aab, aba, abb, baa, bab, bba, bbb, } The possible sentences for T = { a, b }: The (infinite) set of strings that consist of a s & b s (or the empty string/null sentence) A Better Example T = { (, ), +, -, *, /, a...z, 0...9 } T * represents all possible sequences Things that a grade 2 student could come up with Not all will be valid arithmetic expressions The grammar of arithmetic expressions defines which sequences are valid sentences in the language of math 3
Formal Definitions I A Grammar G consists of Terminals Non-terminals Start symbol Productions G = ( T, N, S, P ) Formal Definitions II Terminals (T) The (only) symbols that can appear in the sentence T = { a, b } T = { (, ), +, -, *, /, a...z, 0...9 } Formal Definitions III Non-terminals (N) Variables that denote a class of terminals or a set of terminals Not in the sentence internal variables Class operators, digits Set phrases Meaningful sub-groups of terminals Formal Definitions IV Start Symbol (S) Used to define the initial starting point in the grammar Could be S (a non-terminal used only as the start symbol), or another other nonterminal 4
Formal Definitions V Productions (P) The re-writing rules Defines how non-terminals can become terminals and/or other non-terminals Only the sentences which can be produced through these rules are valid Full Example I A grammar that specifies valid arithmetic expressions with variables a-z and single digits 0-9 Java could use such a grammar to check that entered code is syntactically correct Full Example II G = ( T, N, S, P ) T = { (, ), +, -, *, /, a...z, 0...9 } N = { expr, op, id } expression, operator, identifier S = expr P the following (six) productions: Full Example III a) expr expr op expr A valid expression can consist of a valid expression, followed by an operator, followed by a valid expression b) expr ( expr ) c) expr - expr d) expr id Sets of variables sub-phrases 5
Full Example IV e) op + - * / f) id a b c... z 0 1 2... 9 Classes of variables Interchangeable tokens You can switch any + with a - and still have a valid sentence Full Example V Note: the production rules are recursive The same non-terminal can appear on both sides of a production rule Reduces the quantity of rules required expr -id A redundant production Full Example VI Sentence Generation Use production rules to generate an expression means can derive or can re-write Full Example VII expr expr op expr // rule a expr + expr // rule e - expr + expr // rule c - (expr) + expr // rule b - ( expr op expr ) + expr // rule a - ( expr * expr ) + expr // rule e - ( id * id ) + id // rule d - ( a * b ) + c // rule f 6
Full Example VIII We have derived the sentence - ( a * b ) + c from expr derivation is the sequence of steps No longer just a monkey at a keyboard Note: derivation is usually less interesting than parsing (the reverse process) Another Example I G = ( T, N, S, P ) T = { a, b } N = { S, A, B } (S = S) // redundant from above Another Example II P: S AB A aa B bb is the null symbol It allows a non-terminal to become nothing Another Example III Can we derive aab? S AB aab aaab aa B aab aabb aab aab 7
Another Example IV Can we derive b? S AB B B bb b b Another Example V Can we derive bba? S AB B B bb bbb bb bb??? Note: A can become, cannot become A Another Example VI G = ( T, N, S, P ) T = { a, b } N = { S, A, B } P: S AB A aa B bb Valid sentences in defined language: 0 or more a s followed by 0 or more b s Formal Definitions A string s of terminals is a valid sentence in the language defined by a grammar G = ( T, N, S, P ) if and only if S s Subsequently, the language defined by G is the set of all sentences s such that s is a sentence in T * and S derives s. L(G) = { s s T* and S s } 8
Overview of Grammars Three main problems Derive a sentence from a grammar Develop a grammar for a language Given a grammar, determine if an input is valid We have done derivation The next two parts are more interesting Readings and Assignments Suggested Readings from Shaffer (third edition) 9