Abstract Syntax Trees

Similar documents
Implementing Actions

ASTs, Objective CAML, and Ocamlyacc

An ANTLR Grammar for Esterel

Syntax and Parsing COMS W4115. Prof. Stephen A. Edwards Fall 2003 Columbia University Department of Computer Science

Comp 411 Principles of Programming Languages Lecture 3 Parsing. Corky Cartwright January 11, 2019

2.2 Syntax Definition

CMPT 379 Compilers. Parse trees

Decaf Language Reference

Principles of Programming Languages COMP251: Syntax and Grammars

CSE 3302 Programming Languages Lecture 2: Syntax

Syntax and Parsing COMS W4115. Prof. Stephen A. Edwards Fall 2004 Columbia University Department of Computer Science

Outline. Parser overview Context-free grammars (CFG s) Derivations Syntax-Directed Translation

Table-Driven Top-Down Parsers

Chapter 4. Abstract Syntax

cmps104a 2002q4 Assignment 3 LALR(1) Parser page 1

Ordering Within Expressions. Control Flow. Side-effects. Side-effects. Order of Evaluation. Misbehaving Floating-Point Numbers.

Control Flow COMS W4115. Prof. Stephen A. Edwards Fall 2006 Columbia University Department of Computer Science

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

TML Language Reference Manual

A Simple Syntax-Directed Translator

Earlier edition Dragon book has been revised. Course Outline Contact Room 124, tel , rvvliet(at)liacs(dot)nl

Sometimes an ambiguous grammar can be rewritten to eliminate the ambiguity.

Context-Free Grammar. Concepts Introduced in Chapter 2. Parse Trees. Example Grammar and Derivation

CS 314 Principles of Programming Languages

Syntax Analysis Check syntax and construct abstract syntax tree

Syntax-Directed Translation. Lecture 14

EDAN65: Compilers, Lecture 04 Grammar transformations: Eliminating ambiguities, adapting to LL parsing. Görel Hedin Revised:

THE COMPILATION PROCESS EXAMPLE OF TOKENS AND ATTRIBUTES

Compilers and computer architecture From strings to ASTs (2): context free grammars

Chapter 3: CONTEXT-FREE GRAMMARS AND PARSING Part2 3.3 Parse Trees and Abstract Syntax Trees

CS 403: Scanning and Parsing

The SPL Programming Language Reference Manual

Building Compilers with Phoenix

Formal Languages and Compilers Lecture V: Parse Trees and Ambiguous Gr

Grammars and Parsing, second week

EDAN65: Compilers, Lecture 06 A LR parsing. Görel Hedin Revised:

Principles of Programming Languages COMP251: Syntax and Grammars

Compilers Project 3: Semantic Analyzer

A programming language requires two major definitions A simple one pass compiler

Defining Program Syntax. Chapter Two Modern Programming Languages, 2nd ed. 1

Program Representations

UNIVERSITY OF CALIFORNIA Department of Electrical Engineering and Computer Sciences Computer Science Division

CS 4240: Compilers and Interpreters Project Phase 1: Scanner and Parser Due Date: October 4 th 2015 (11:59 pm) (via T-square)

Project 1: Scheme Pretty-Printer

Parsing. CS321 Programming Languages Ozyegin University

Parsing II Top-down parsing. Comp 412

7. Introduction to Denotational Semantics. Oscar Nierstrasz

Building Compilers with Phoenix

Writing Evaluators MIF08. Laure Gonnord

Parsing. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

6.001 Notes: Section 15.1

5 The Control Structure Diagram (CSD)

UNIVERSITY OF COLUMBIA ADL++ Architecture Description Language. Alan Khara 5/14/2014

Examination in Compilers, EDAN65

Compilers. Yannis Smaragdakis, U. Athens (original slides by Sam

UNIVERSITY OF CALIFORNIA Department of Electrical Engineering and Computer Sciences Computer Science Division. P. N. Hilfinger

First Midterm Exam CS164, Fall 2007 Oct 2, 2007

Names, Scope, and Bindings

Sprite an animation manipulation language Language Reference Manual

CSE 413 Final Exam. June 7, 2011

LL(k) Compiler Construction. Choice points in EBNF grammar. Left recursive grammar

Defining syntax using CFGs

Names, Scope, and Bindings

LL(k) Compiler Construction. Top-down Parsing. LL(1) parsing engine. LL engine ID, $ S 0 E 1 T 2 3

COMP520 - GoLite Type Checking Specification

Programming Languages & Translators PARSING. Baishakhi Ray. Fall These slides are motivated from Prof. Alex Aiken: Compilers (Stanford)

Faculty of Electrical Engineering, Mathematics, and Computer Science Delft University of Technology

CS 406/534 Compiler Construction Parsing Part I

CSE P 501 Compilers. Implementing ASTs (in Java) Hal Perkins Autumn /20/ Hal Perkins & UW CSE H-1

Ambiguity and Errors Syntax-Directed Translation

Compilers - Chapter 2: An introduction to syntax analysis (and a complete toy compiler)

B The SLLGEN Parsing System

int foo() { x += 5; return x; } int a = foo() + x + foo(); Side-effects GCC sets a=25. int x = 0; int foo() { x += 5; return x; }

Parsing III. (Top-down parsing: recursive descent & LL(1) )

Topic 3: Syntax Analysis I

Introduction to Parsing Ambiguity and Syntax Errors

Syntax and Grammars 1 / 21

CSE 401 Midterm Exam Sample Solution 2/11/15

Lexical Considerations

Chapter 3: Describing Syntax and Semantics. Introduction Formal methods of describing syntax (BNF)

CS Introduction to Data Structures How to Parse Arithmetic Expressions

Introduction to Parsing Ambiguity and Syntax Errors

Compiler Optimisation

A simple syntax-directed

Language Design COMS W4115. Prof. Stephen A. Edwards Spring 2003 Columbia University Department of Computer Science

CSE P 501 Compilers. Implementing ASTs (in Java) Hal Perkins Winter /22/ Hal Perkins & UW CSE H-1

CD Assignment I. 1. Explain the various phases of the compiler with a simple example.

Error Detection in LALR Parsers. LALR is More Powerful. { b + c = a; } Eof. Expr Expr + id Expr id we can first match an id:

The PCAT Programming Language Reference Manual

CSE 401/M501 Compilers

CSE302: Compiler Design

Control Flow. Stephen A. Edwards. Fall Columbia University

Postfix Notation is a notation in which the operator follows its operands in the expression (e.g ).

CS2112 Fall Assignment 4 Parsing and Fault Injection. Due: March 18, 2014 Overview draft due: March 14, 2014

Syntax Intro and Overview. Syntax

SEMANTIC ANALYSIS TYPES AND DECLARATIONS

Time : 1 Hour Max Marks : 30

LECTURE 7. Lex and Intro to Parsing

Compiler Design Concepts. Syntax Analysis

Modelica Change Proposal MCP-0019 Flattening (In Development) Proposed Changes to the Modelica Language Specification Version 3.

Transcription:

Abstract Syntax Trees COMS W4115 Prof. Stephen A. Edwards Fall 2007 Columbia University Department of Computer Science

Parsing and Syntax Trees Parsing decides if the program is part of the language. Not that useful: we want more than a yes/no answer. Like most, ANTLR parsers can include actions: pieces of code that run when a rule is matched. Top-down parsers: actions executed during parsing rules. Bottom-up parsers: actions executed when rule is reduced.

Actions Simple languages can be interpreted with parser actions. class CalcParser extends Parser; expr returns [int r] { int a; r=0; } : r=mexpr ("+" a=mexpr { r += a; } )* EOF ; mexpr returns [int r] { int a; r=0; } : r=atom ("*" a=atom { r *= a; } )* ; atom returns [int r] { r=0; } : i:int { r = Integer.parseInt(i.getText()); };

Actions In a top-down parser, actions are executed during the matching routines. Actions can appear anywhere within a rule: before, during, or after a match. rule { /* before */ } : A { /* during */ } B C D { /* after */ } ; Bottom-up parsers restricted to running actions only after a rule has matched.

Implementing Actions Nice thing about top-down parsing: grammar is essentially imperative. Action code simply interleaved with rule-matching. Easy to understand what happens when.

Implementing Actions expr returns [int r] { int a; r=0; } : r=mexpr ("+" a=mexpr { r += a; } )* EOF ; public final int expr() { // What ANTLR builds int r; int a; r=0; r=mexpr(); while ((LA(1)==PLUS)) { // ( )* match(plus); // "+" a=mexpr(); // a=mexpr r += a; // { r += a; } } match(token.eof_type); return r; }

Actions Usually, actions build a data structure that represents the program. Separates parsing from translation. Makes modification easier by minimizing interactions. Allows parts of the program to be analyzed in different orders.

Actions Bottom-up parsers can only build bottom-up data structures. Children known first, parents later. Constructor for any object can require knowledge of children, but not of parent. Context of an object only established later. Top-down parsers can build both kinds of data structures.

What To Build? Typically, an Abstract Syntax Tree that represents the program. Represents the syntax of the program almost exactly, but easier for later passes to deal with. Punctuation, whitespace, other irrelevant details omitted.

Abstract vs. Concrete Trees Like scanning and parsing, objective is to discard irrelevant details. E.g., comma-separated lists are nice syntactically, but later stages probably just want lists. AST structure almost a direct translation of the grammar.

Abstract vs. Concrete Trees expr : mexpr ("+" mexpr )* ; mexpr : atom ("*" atom )* ; atom : INT ; 3 + 5 * 4 mexpr expr + mexpr + INT:3 * atom atom + atom INT:5 INT:4 INT:3 INT:5 INT:4 Concrete Parse Tree Abstract Syntax Tree

Implementing ASTs Most general implementation: ASTs are n-ary trees. Each node holds a token and pointers to its first child and next sibling: Parent Last Sibling Node Next Sibling First Child

Example of AST structure if > a b -= a b -= b a if > -= -= a b a b b a

Typical AST Operations Create a new node; Append a subtree as a child. > -= + -= = a b a b b a > -= -= a b a b b a

Comment on Generic ASTs Is this general-purpose structure too general? Not very object-oriented: whole program represented with one type. Alternative: Heterogeneous ASTs: one class per object. class BinOp { int operator; Expr left, right; }; class IfThen { Expr predicate; Stmt thenpart, elsepart; };

Heterogeneous ASTs Advantage: avoid switch statements when walking tree. Disadvantage: each analysis requires another method. class BinOp { int operator; Expr left, right; void typecheck() {... }; void constantprop() {... }; void buildthreeaddr() {... }; }; Analyses spread out across class files. Classes become littered with analysis code, additional annotations.

Comment on Generic ASTs ANTLR offers a compromise: It can automatically generate tree-walking code. It generates the big switch statement. Each analysis can have its own file. Still have to modify each analysis if the AST changes. Choose the AST structure carefully.

Building ASTs

The Obvious Way to Build ASTs class ASTNode { ASTNode( Token t ) {... } void appendchild( ASTNode c ) {... } void appendsibling( ASTNode C) {... } } stmt returns [ASTNode n] : if p=expr then t=stmt else e=stmt { n = new ASTNode(new Token("IF")); n.appendchild(p); n.appendchild(t); n.appendchild(e); } ;

The Obvious Way Putting code in actions that builds ASTs is traditional and works just fine. But it s tedious. Fortunately, ANTLR can automate this process.

Building an AST Automatically with ANTLR class TigerParser extends Parser; options { buildast=true; } By default, each matched token becomes an AST node. Each matched token or rule is made a sibling of the AST for the rule. After a token, ˆ makes the node a root of a subtree. After a token,! prevents an AST node from being built.

Automatic AST Construction Running class CalcParser extends Parser; options { buildast=true; } expr : mexpr ( + mexpr)* EOF ; mexpr : atom ( * atom)* ; atom : INT ; on 2*3+4*5+6 gives 2 * 3 + 4 * 5 + 6 EOF

AST Construction with Annotations Running class CalcParser extends Parser; options { buildast=true; } expr : mexpr ( + ˆ mexpr)* EOF! ; mexpr : atom ( * ˆ atom)* ; atom : INT ; + on 2*3+4*5+6 + 6 gives * * 2 3 4 5

Choosing AST Structure

Designing an AST Structure Sequences of things Removing unnecessary punctuation Additional grouping How many token types?

Sequences of Things Comma-separated lists are common int gcd(int a, int b, int c) args : "(" ( arg ("," arg)* )? ")" ; A concrete parse tree: args Drawbacks: (, ) Many unnecessary nodes, arg Branching suggests recursion int arg a arg int b int c Harder for later routines to get the data they want

Sequences of Things Better to choose a simpler structure for the tree. Punctuation irrelevant; build a simple list. int gcd(int a, int b, int c) args : "("! ( arg (","! arg)* )? ")"! { #args = #([ARGS], args); } ; ARGS arg arg arg int a int b int c

What s going on here? args : "("! ( arg (","! arg)* )? ")"! { #args = #([ARGS], args); } ; Rule generates a sequence of arg nodes. Node generation supressed for punctuation (parens, commas). Action uses ANTLR s terse syntax for building trees. { #args = #( [ARGS], args ) ; } ; set the args tree to a new tree whose root is a node of type ARGS and whose child is the old args tree

What s going on here? (int a, int b, int c) args : "("! ( arg (","! arg)* )? ")"! { #args = #([ARGS], args); } ; #args arg arg arg int a int b int c #args ARGS arg arg arg int a int b int c

Removing Unnecessary Punctuation Punctuation makes the syntax readable, unambiguous. Information represented by structure of the AST Things typically omitted from an AST Parentheses Grouping and precedence/associativity overrides Separators (commas, semicolons) Mark divisions between phrases Extra keywords while-do, if-then-else (one is enough)

Additional Grouping The Tiger language from Appel s book allows mutually recursive definitions only in uninterrupted sequences: let function f1() = ( f2() ) /* OK */ function f2() = (... ) in... end let function f1() = ( f2() ) /* Error */ var foo := 42 /* splits group */ function f2() = (... ) in... end

Grouping Convenient to group sequences of definitions in the AST. Simplifies later static semantic checks. let function f1() = (... ) function f2() = (... ) var foo := 42 in... end defs defs funcs vars func func var func func var f1... f2... foo... f1... f2... foo...

Grouping Identifying and building sequences of definitions a little tricky in ANTLR. Obvious rules defs : ( funcs vars types )* ; funcs : ( func )+ ; vars : ( var )+ ; types : ( type )+ ; are ambiguous: Maximum-length sequences or minimum-length sequences?

Grouping Hint: Use ANTLR s greedy option to disambiguate this. The greedy flag decides whether repeating a rule takes precedence when an outer rule could also work. string : (dots)* ; dots : (".")+ ; When faced with a period, the second rule can repeat itself or exit.

The Greedy Option Setting greedy true makes dots as long as possible string : (dots)* ; dots : ( options greedy=true; : ".")+ ; Setting greedy false makes each dots a single period string : (dots)* ; dots : ( options greedy=false; : ".")+ ;

How Many Types of Tokens? Since each token is a type plus some text, there is some choice. Generally, want each different construct to have a different token type. Different types make sense when each needs different analysis. Arithmetic operators usually not that different. For the assignment, you need to build a node of type BINOP for every binary operator. The text indicates the actual operator.

Walking ASTs M. C. Escher, Ascending and Descending, 1960

Walking ASTs with ANTLR ANTLR can build tree parsers as easily as token parsers. Much simpler: tree structure is already resolved. Simple recursive recursive walk on the tree. Matches are sufficient, not exact. (Cheaper to implement.) #( A B ) also matches the larger tree #( A #(B C) D ) These examples are for ANTLR 2.0

Walking ASTs with ANTLR class CalcParser extends Parser expr : mexpr ("+"ˆ mexpr)* ; mexpr : atom ("*"ˆ atom)* ; atom : INT "("! expr ")"! ; class CalcWalker extends TreeParser expr returns [int r] { int a,b; r=0; } : #("+" a=expr b=expr) { r = a + b; } #("*" a=expr b=expr) { r = a * b; } i:int { r = parseint(i.gettext()); } ;

Walking ASTs with ANTLR class CalcWalker extends TreeParser expr returns [int r] { int a,b; r=0; } : #("+" a=expr b=expr) { r = a + b; } #("*" a=expr b=expr) { r = a * b; } i:int { r = parseint(i.gettext()); } ; This walker only has one rule: grammar had three. Fine: only structure of tree matters.

Walking ASTs with ANTLR : #("+" a=expr b=expr) { r = a + b; } #("*" a=expr b=expr) { r = a * b; } i:int { r = parseint(i.gettext()); } ; The highlighted line says Match a tree #(... ) With the token "+" at the root With two children matched by expr (Store their results in a and b) When this is matched, assign a + b to the result r.

Comments on walking ASTs Tree grammars may seem to be ambiguous. Does not matter: tree structure already known Unlike proper parsers, tree parsers have only one token of lookahead. Must be possible to make a decision locally. Has impact on choice of AST structure.

Comments on walking ASTs Optional clauses can cause trouble. Place them at the end. stmt : #("if" expr stmt (stmt)?) // OK #("do" (stmt)? expr) // Bad ; First rule works: can easily decide if there is another child. Second rule does not: not enough lookahead.

Comments on walking ASTs Lists of undefined length can also cause trouble funcdef : #("func" ID (arg)* stmt) ; Does not work because the tree walker does not look ahead. Solution: use a subtree funcdef : #("func" #("args" (arg)*) stmt) ; The placeholder resolves the problem.