CSE450 Translation of Programming Languages Lecture 4: Syntax Analysis
http://xkcd.com/859
Structure of a Today! Compiler Source Language Lexical Analyzer Syntax Analyzer Semantic Analyzer Int. Code Generator Front End Intermediate Code Code Optimizer Target Code Generator Back End Target Language
Project 2: Syntax Analysis Your project group will be assigned before Thursday s class. You will be extending your lexical analysis program from Project 1. Choose one group member s lexer as a starting point (or write a new one!) You may (and in some cases should!) change the tokens you use. The output from Project 2 is independent of Project 1. Do not worry about checking for errors beyond those listed in the project. Correctness is still jobs #1, 2, and 3.
Where is Syntax Analysis? if (idx == 0) idx = 750; Lexical Analysis or Scanner if ( idx == 0 ) idx = 750 ; Syntax Analysis or Parsing Abstract Syntax Tree or Parse Tree IsEq if Assign idx 0 idx 750
Parsing Analogy Syntax analysis for natural languages - Identify the function of each word - Recognize if a sentence is grammatically correct Example: I gave Jim the card.
Parsing Analogy Syntax analysis for natural languages - Identify the function of each word - Recognize if a sentence is grammatically correct subject sentence action verb phrase indirect object object noun phrase pronoun verb proper noun article noun I gave Jim the card.
Syntax Analysis Overview Goal: Does the input token stream satisfy the syntax of the program? What do we need to do this? An expressive way to describe the syntax A mechanism to determine if a token stream satisfies the syntax A structured output to be used by later components of the compiler. For lexical analysis Regular expressions describe patterns for tokens Finite automata (generated by Flex) convert the input character stream to tokens A token stream is made availble to later components (specifically, the parser)
Just Use Regular essions? Regular expressions are easy to implement and can expressively describe tokens. Should we also use them to describe the syntax of a programming language? NO! - They do not have the power to express any non-trivial syntax Example - Nested constructs (blocks, expressions, statements) - Detect balanced braces: {{} {} {{} { }}} - We need unbounded counting! - FSAs cannot count except in a strictly modulo fashion { { { { { } } } } }...
Context Free Grammars Consist of 4 components (Backus-Naur Form or BNF): Terminal Symbols = token or ε Non-terminal Symbols = syntactic variables Symbol S = special non-terminal Production Rules of the form LHS RHS LHS = A single non-terminal RHS = A string of terminals and non-terminals Specify how non-terminals may be expanded S a S a S T T b S b T ε The language generated by a grammar is the set of strings of terminals derived from the start symbol by repeatedly apply the productions. L(G) = language generated by grammar 'G'
Context Free Grammar Example Grammar for a balanced-parentheses language: S ( S ) S S ε 1 non-terminal: S 2 terminals: "(", ")" symbol: S 2 production rules If the grammar accepts a string, there is a derivation of that string using the production rules How do we produce the string "(())" S = ( S ) ε = ( ( S ) S ) ε = ( ( ε ) ε ) ε = ( ( ) )
More on Context Free Grammars Shorthand - vertical bar ' ' to combine multiple productions S a S a T T b T b ε CFGs are powerful enough to express the syntax of most programming languages Derivation = successive application of productions starting from S Acceptance? = Determine if there exists a derivation for an input token stream
A Parser Context Free Grammar, G Token Stream, s (from lexer) Parser Error Messages Yes, if s in L(G) No, Otherwise If yes, also output abstract syntax tree Syntax analyzers (parsers) = Context free grammar acceptors that also output the corresponding derivation when the token stream is accepted.
Reverse Derivation Op = '+' '-' '*' '/' Int = [0-9]+ Open = '(' Close = ')' 1) 2) Op 3) Int 4) Open Close (2-1) + 1 (2-1) + 1 Question: Could we have produced this string from the above grammar?
Reverse Derivation Op = '+' '-' '*' '/' Int = [0-9]+ Open = '(' Close = ')' 1) 2) Op 3) Int 4) Open Close (2-1) + 1 Next Step:??
Reverse Derivation Op = '+' '-' '*' '/' Int = [0-9]+ Open = '(' Close = ')' 1) 2) Op 3) Int 4) Open Close (2-1) + 1 Next Step: Tokenize
Reverse Derivation Op = '+' '-' '*' '/' Int = [0-9]+ Open = '(' Close = ')' 1) 2) Op 3) Int 4) Open Close (2-1) + 1 Open Int Op Int Close Op Int Next Step:??
Reverse Derivation Op = '+' '-' '*' '/' Int = [0-9]+ Open = '(' Close = ')' 1) 2) Op 3) Int 4) Open Close (2-1) + 1 Open Int Op Int Close Op Int Next Step: Production 3
Reverse Derivation Op = '+' '-' '*' '/' Int = [0-9]+ Open = '(' Close = ')' (2-1) + 1 Open Int Op Int Close Op Int Open Op Close Op 1) 2) Op 3) Int 4) Open Close Next Step:??
Reverse Derivation Op = '+' '-' '*' '/' Int = [0-9]+ Open = '(' Close = ')' (2-1) + 1 Open Int Op Int Close Op Int Open Op Close Op 1) 2) Op 3) Int 4) Open Close Next Step: Production 2
Reverse Derivation Op = '+' '-' '*' '/' Int = [0-9]+ Open = '(' Close = ')' (2-1) + 1 Open Int Op Int Close Op Int Open Op Close Op Open Close Op 1) 2) Op 3) Int 4) Open Close Next Step:??
Reverse Derivation Op = '+' '-' '*' '/' Int = [0-9]+ Open = '(' Close = ')' (2-1) + 1 Open Int Op Int Close Op Int Open Op Close Op Open Close Op 1) 2) Op 3) Int 4) Open Close Next Step: Production 4
Reverse Derivation Op = '+' '-' '*' '/' Int = [0-9]+ Open = '(' Close = ')' (2-1) + 1 Open Int Op Int Close Op Int Open Op Close Op Open Close Op Op 1) 2) Op 3) Int 4) Open Close Next Step:??
Reverse Derivation Op = '+' '-' '*' '/' Int = [0-9]+ Open = '(' Close = ')' (2-1) + 1 Open Int Op Int Close Op Int Open Op Close Op Open Close Op Op 1) 2) Op 3) Int 4) Open Close Next Step: Production 2
Reverse Derivation Op = '+' '-' '*' '/' Int = [0-9]+ Open = '(' Close = ')' (2-1) + 1 Open Int Op Int Close Op Int Open Op Close Op Open Close Op Op 1) 2) Op 3) Int 4) Open Close Next Step:??
Reverse Derivation Op = '+' '-' '*' '/' Int = [0-9]+ Open = '(' Close = ')' (2-1) + 1 Open Int Op Int Close Op Int Open Op Close Op Open Close Op Op 1) 2) Op 3) Int 4) Open Close Next Step: Production 1
Reverse Derivation Op = '+' '-' '*' '/' Int = [0-9]+ Open = '(' Close = ')' (2-1) + 1 Open Int Op Int Close Op Int Open Op Close Op Open Close Op Op 1) 2) Op 3) Int 4) Open Close Next Step:??
Reverse Derivation Op = '+' '-' '*' '/' Int = [0-9]+ Open = '(' Close = ')' (2-1) + 1 Open Int Op Int Close Op Int Open Op Close Op Open Close Op Op 1) 2) Op 3) Int 4) Open Close Next Step: Done
True Derivation Op = '+' '-' '*' '/' Int = [0-9]+ Open = '(' Close = ')' Op Open Close Op Open Op Close Op Open Int Op Int Close Op Int (2-1) + 1 1) 2) Op 3) Int 4) Open Close Can we convert this to a parse tree?
Parse Tree Internal Nodes: Non-terminals Leaves: Terminals Edges: From: non-terminal of LHS of production To: nodes from RHS of production Captures derivation of the string
Parse Tree Construction Op = '+' '-' '*' '/' Int = [0-9]+ Open = '(' Close = ')' 1) 2) Op 3) Int 4) Open Close Op Open Close Op Open Op Close Op Open Int Op Int Close Op Int (2-1) + 1
Parse Tree Construction 1) 2) Op 3) Int 4) Open Close Op Open Close Op Open Op Close Op Open Int Op Int Close Op Int (2-1) + 1
Parse Tree Construction 1) 2) Op 3) Int 4) Open Close Op Open Close Op Open Op Close Op Open Int Op Int Close Op Int (2-1) + 1
Parse Tree Construction 1) 2) Op 3) Int 4) Open Close Op Open Close Op Open Op Close Op Open Int Op Int Close Op Int (2-1) + 1
Parse Tree Construction 1) 2) Op 3) Int 4) Open Close Op Open Close Op Open Op Close Op Open Int Op Int Close Op Int (2-1) + 1
Parse Tree Construction 1) 2) Op 3) Int 4) Open Close Op Open Close Op Open Op Close Op Open Int Op Int Close Op Int (2-1) + 1
Parse Tree Construction 1) 2) Op 3) Int 4) Open Close Op Open Close Op Open Op Close Op Open Int Op Int Close Op Int (2-1) + 1 Op
Parse Tree Construction 1) 2) Op 3) Int 4) Open Close Op Open Close Op Open Op Close Op Open Int Op Int Close Op Int (2-1) + 1 Op
Parse Tree Construction 1) 2) Op 3) Int 4) Open Close Op Open Close Op Open Op Close Op Open Int Op Int Close Op Int (2-1) + 1 Op Open Close
Parse Tree Construction 1) 2) Op 3) Int 4) Open Close Op Open Close Op Open Op Close Op Open Int Op Int Close Op Int (2-1) + 1 Op Open Close
Parse Tree Construction 1) 2) Op 3) Int 4) Open Close Op Open Close Op Open Op Close Op Open Int Op Int Close Op Int (2-1) + 1 Op Open Close Op
Parse Tree Construction 1) 2) Op 3) Int 4) Open Close Op Open Close Op Open Op Close Op Open Int Op Int Close Op Int (2-1) + 1 Op Open Close Op
Parse Tree Construction 1) 2) Op 3) Int 4) Open Close Op Open Close Op Open Op Close Op Open Int Op Int Close Op Int (2-1) + 1 Op Open Close Int Op Int Int
Parse Tree Construction 1) 2) Op 3) Int 4) Open Close Op Open Close Op Open Op Close Op Open Int Op Int Close Op Int (2-1) + 1 Done! Op Open Close Int Op Int Int
Simplifying the Tree Op Open Close Int Op Int Int
Simplifying the Tree Op Open Close Int Op Int Int
Simplifying the Tree Op Open Close Int Op Int Int ( 2-1 ) + 1
Simplifying the Tree Op Open Close Int Op Int Int ( 2-1 ) + 1
Simplifying the Tree Op Int Op Int Int 2-1 + 1
Simplifying the Tree Op Int Op Int Int 2-1 + 1
Simplifying the Tree Op Int Op Int 2-1 + 1
Simplifying the Tree Op Int 2 Op Int - 1 + 1
Simplifying the Tree 2-1 + 1
Simplifying the Tree + 1 2-1
Simplifying the Tree + 1 2-1
Simplifying the Tree + 1-2 1
Simplifying the Tree + 1-2 1
Simplifying the Tree + 1-2 1
Simplifying the Tree + 1-2 1
Simplifying the Tree + 1-2 1
Simplifying the Tree ROOT + 1-2 1
Original Input: ( 2-1) + 1 ROOT + Op - 1 Open Close Int 2 1 Op Int Int Parse Tree Abstract Syntax Tree