1 Outline Introduction to Parsing Lecture 8 Adapted from slides by G. Necula and R. Bodik Limitations of regular languages Parser overview Contextfree grammars (CG s) Derivations SyntaxDirected ranslation 2/8/2008 Prof. Hilfinger CS164 Lecture 8 1 2/8/2008 Prof. Hilfinger CS164 Lecture 8 2 Languages and Automata ormal languages are very important in CS specially in programming languages Regular languages he weakest formal languages widely used Many applications We will also study contextfree languages Limitations of Regular Languages Intuition: A finite automaton that runs long enough must repeat states inite automaton can t remember # of times it has visited a particular state inite automaton has finite memory Only enough to store in which state it is Cannot count, except up to a finite limit.g., language of balanced parentheses is not regular: { ( i ) i i 0} 2/8/2008 Prof. Hilfinger CS164 Lecture 8 3 2/8/2008 Prof. Hilfinger CS164 Lecture 8 4 he Structure of a Compiler he unctionality of the Parser Source Lexical analysis okens Input: sequence of tokens from lexer Output: abstract syntax tree of the program oday we start Parsing Optimization Interm. Language Code Gen. Machine Code 2/8/2008 Prof. Hilfinger CS164 Lecture 8 5 2/8/2008 Prof. Hilfinger CS164 Lecture 8 6 1
2 xample Pyth: if x == y: z =1 else: z = 2 Parser input: I ID == ID : ID = IN LS : ID = IN Parser output (abstract syntax tree): IHNLS == = = ID ID ID IN ID IN 2/8/2008 Prof. Hilfinger CS164 Lecture 8 7 Why A ree? ach stage of the compiler has two purposes: Detect and filter out some class of errors Compute some new information or translate the representation of the program to make things easier for later stages Recursive structure of tree suits recursive structure of language definition With tree, later stages can easily find the else clause, e.g., rather than having to scan through tokens to find it. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 8 Comparison with Lexical Analysis he Role of the Parser Phase Lexer Parser Input Sequence of characters Sequence of tokens Output Sequence of tokens Syntax tree Not all sequences of tokens are programs Parser must distinguish between valid and invalid sequences of tokens We need A language for describing valid sequences of tokens A method for distinguishing valid from invalid sequences of tokens 2/8/2008 Prof. Hilfinger CS164 Lecture 8 9 2/8/2008 Prof. Hilfinger CS164 Lecture 8 10 Programming Language Structure Programming languages have recursive structure Consider the language of arithmetic expressions with egers,,, and ( ) An expression is either: an eger an expression followed by followed by expression an expression followed by followed by expression a ( followed by an expression followed by ),, ( ) are expressions Notation for Programming Languages An alternative notation: ( ) We can view these rules as rewrite rules We start with and replace occurrences of with some righthand side ( ) ( ) ( ) 2/8/2008 Prof. Hilfinger CS164 Lecture /8/2008 Prof. Hilfinger CS164 Lecture
3 Observation All arithmetic expressions can be obtained by a sequence of replacements Any sequence of replacements forms a valid arithmetic expression his means that we cannot obtain ( ) ) by any sequence of replacements. Why? his set of rules is a contextfree grammar 2/8/2008 Prof. Hilfinger CS164 Lecture 8 13 Contextree Grammars A CG consists of A set of nonterminals N By convention, written with capital letter in these notes A set of terminals By convention, either lower case names or punctuation A start symbol S (a nonterminal) A set of productions Assuming N ε Y 1 Y 2... Y n, or where Y i N 2/8/2008 Prof. Hilfinger CS164 Lecture 8 14 xamples of CGs Simple arithmetic expressions: ( ) One nonterminal: Several terminals:,,, (, ) Called terminals because they are never replaced By convention the nonterminal for the first production is the start one he Language of a CG Read productions as replacement rules: X Y 1... Y n Means X can be replaced by Y 1... Y n X ε Means X can be erased (replaced with empty string) 2/8/2008 Prof. Hilfinger CS164 Lecture /8/2008 Prof. Hilfinger CS164 Lecture 8 16 Key Idea 1. Begin with a string consisting of the start symbol S 2. Replace any nonterminal X in the string by a righthand side of some production X Y 1 Y n 3. Repeat (2) until there are only terminals in the string 4. he successive strings created in this way are called sentential forms. he Language of a CG (Cont.) More formally, may write X 1 X i1 X i X i1 X n X 1 X i1 Y 1 Y m X i1 X n if there is a production X i Y 1 Y m 2/8/2008 Prof. Hilfinger CS164 Lecture /8/2008 Prof. Hilfinger CS164 Lecture
4 he Language of a CG (Cont.) Write X 1 X n Y 1 Y m if X 1 X n Y 1 Y m in 0 or more steps he Language of a CG Let G be a contextfree grammar with start symbol S. hen the language of G is: L(G) = { a 1 a n S a 1 a n and every a i is a terminal } 2/8/2008 Prof. Hilfinger CS164 Lecture /8/2008 Prof. Hilfinger CS164 Lecture 8 20 xamples: S 0 also written as S 0 1 S 1 Generates the language { 0, 1 } What about S 1 A A 0 1 What about S 1 A A 0 1 A What about S ε ( S ) Pyth xample A fragment of Pyth: Compound while xpr: Block if xpr: Block lses lses ε else: Block elif xpr: Block lses Block Stmt_List Suite (ormal language papers use onecharacter nonterminals, but we don t have to!) 2/8/2008 Prof. Hilfinger CS164 Lecture /8/2008 Prof. Hilfinger CS164 Lecture 8 22 Derivations and Parse rees Derivation xample A derivation is a sequence of sentential forms resulting from the application of a sequence of productions S A derivation can be represented as a parse tree Start symbol is the tree s root or a production X Y 1 Y n add children Y 1,, Y n to node X Grammar () String 2/8/2008 Prof. Hilfinger CS164 Lecture /8/2008 Prof. Hilfinger CS164 Lecture
5 Derivation xample (Cont.) Derivation in Detail (1) 2/8/2008 Prof. Hilfinger CS164 Lecture /8/2008 Prof. Hilfinger CS164 Lecture 8 26 Derivation in Detail (2) Derivation in Detail (3) 2/8/2008 Prof. Hilfinger CS164 Lecture /8/2008 Prof. Hilfinger CS164 Lecture 8 28 Derivation in Detail (4) Derivation in Detail (5) 2/8/2008 Prof. Hilfinger CS164 Lecture /8/2008 Prof. Hilfinger CS164 Lecture
6 Derivation in Detail (6) Notes on Derivations A parse tree has erminals at the leaves Nonterminals at the erior nodes A leftright traversal of the leaves is the original input he parse tree shows the association of operations, the input string does not! here may be multiple ways to match the input Derivations (and parse trees) choose one 2/8/2008 Prof. Hilfinger CS164 Lecture /8/2008 Prof. Hilfinger CS164 Lecture 8 32 he Payoff: parser as a translator syntaxdirected translation Mechanism of syntaxdirected translation syntaxdirected translation is done by extending the CG a translation rule is defined for each production stream of tokens parser syntax translation rules (typically hardcoded in the parser) ASs, or assembly code given X d A B c the translation of X is defined recursively using translation of nonterminals A, B values of attributes of terminals d, c constants 2/8/2008 Prof. Hilfinger CS164 Lecture /8/2008 Prof. Hilfinger CS164 Lecture 8 34 o translate an input string: xample 1: Arithmetic expression to value 1. Build the parse tree. 2. Working bottomup Use the translation rules to compute the translation of each nonterminal in the tree Result: the translation of the string is the translation of the parse tree's root nonterminal. Why bottom up? a nonterminal's value may depend on the value of the symbols on the righthand side, so translate a nonterminal node only after children translations are available. Syntaxdirected translation rules: 1.trans = 2.trans.trans.trans =.trans 1.trans = 2.trans.trans.trans =.trans.trans =.value ( ).trans =.trans 2/8/2008 Prof. Hilfinger CS164 Lecture /8/2008 Prof. Hilfinger CS164 Lecture
7 xample 1: Bison/Yacc Notation xample 1 (cont): Annotated Parse ree : { $$ = $1 $3; } : { $$ = $1 $3; } : { $$ = $1; } : ( ) { $$ = $2; } KY: $$ : Semantic value of lefthand side $n : Semantic value of n th symbol on righthand side 2/8/2008 Prof. Hilfinger CS164 Lecture 8 37 Input: 2 (4 5) (18) (18) (2) (9) ( (9) (2) (2) (4) (5) (4) (5) (4) (5) 2/8/2008 Prof. Hilfinger CS164 Lecture 8 (4) 38 ) xample 2: Compute the type of an expression xample 2 (cont) > if $1 == IN and $3 == IN: $$ = IN else: $$ = RROR > and if $1 == BOOL and $3 == BOOL: $$ = BOOL else: $$ = RROR > == if $1 == $3 and $2!= RROR: $$ = BOOL else: $$ = RROR > true $$ = BOOL > false $$ = BOOL > $$ = IN > ( ) $$ = $2 Input: (2 2) == 4 (IN) (IN) == ( (IN) ) (IN) (BOOL) (IN) (IN) (IN) (IN) 2/8/2008 Prof. Hilfinger CS164 Lecture /8/2008 Prof. Hilfinger CS164 Lecture 8 40 Building Abstract Syntax rees xamples so far, streams of tokens translated o eger values, or types ranslating o ASs is not very different 2/8/2008 Prof. Hilfinger CS164 Lecture 8 41 AS vs. Parse ree AS is condensed form of a parse tree operators appear at ernal nodes, not at leaves. "Chains" of single productions are collapsed. Lists are "flattened". Syntactic details are omitted e.g., parentheses, commas, semicolons AS is a better structure for later compiler stages omits details having to do with the source language, only contains information about the essential structure of the program. 2/8/2008 Prof. Hilfinger CS164 Lecture
8 xample: 2 (4 5) Parse tree vs. AS ASbuilding translation rules (2) ( (4) ) (5) 2/8/2008 Prof. Hilfinger CS164 Lecture $$ = new PlusNode($1, $3) $$ = $1 $$ = new imesnode($1, $3) $$ = $1 $$ = new IntLitNode($1) ( ) $$ = $2 2/8/2008 Prof. Hilfinger CS164 Lecture 8 44 xample: 2 (4 5): Steps in Creating AS (2) 2 (Only some of the 4 5 semantic values are shown) ( ) (4) (5) 2/8/2008 Prof. Hilfinger CS164 Lecture 8 45 Leftmost and Rightmost Derivations Leftmost derivation: always act on leftmost nonterminal Rightmost derivation: always act on rightmost nonterminal 2/8/2008 Prof. Hilfinger CS164 Lecture 8 46 rightmost Derivation in Detail (1) rightmost Derivation in Detail (2) 2/8/2008 Prof. Hilfinger CS164 Lecture /8/2008 Prof. Hilfinger CS164 Lecture
9 rightmost Derivation in Detail (3) rightmost Derivation in Detail (4) 2/8/2008 Prof. Hilfinger CS164 Lecture /8/2008 Prof. Hilfinger CS164 Lecture 8 50 rightmost Derivation in Detail (5) rightmost Derivation in Detail (6) 2/8/2008 Prof. Hilfinger CS164 Lecture /8/2008 Prof. Hilfinger CS164 Lecture 8 52 Aside: Canonical Derivations ake a look at that last derivation in reverse. he active part (red) tends to move left to right. We call this a reverse rightmost or canonical derivation. Comes up in bottomup parsing. We ll return to it in a couple of lectures. Derivations and Parse rees or each parse tree there is exactly one leftmost and one rightmost derivation he difference is the order in which branches are added, not the structure of the tree. 2/8/2008 Prof. Hilfinger CS164 Lecture /8/2008 Prof. Hilfinger CS164 Lecture
10 Summary of Derivations We are not just erested in whether s L(G) Also need derivation (or parse tree) and AS. Parse trees slavishly reflect the grammar. Abstract syntax trees abstract from the grammar, cutting out detail that erferes with later stages. A derivation defines a parse tree But one parse tree may have many derivations Derivations drive translation (to ASs, etc.) Leftmost and rightmost derivations most important in parser implementation 2/8/2008 Prof. Hilfinger CS164 Lecture
More information