Exercise ANTLRv4. Patryk Kiepas. March 25, 2017

Similar documents
CS664 Compiler Theory and Design LIU 1 of 16 ANTLR. Christopher League* 17 February Figure 1: ANTLR plugin installer

Install and Configure ANTLR 4 on Eclipse and Ubuntu

Lab 2. Lexing and Parsing with ANTLR4

MATVEC: MATRIX-VECTOR COMPUTATION LANGUAGE REFERENCE MANUAL. John C. Murphy jcm2105 Programming Languages and Translators Professor Stephen Edwards

Project 2 Interpreter for Snail. 2 The Snail Programming Language

Typescript on LLVM Language Reference Manual

An ANTLR Grammar for Esterel

KU Compilerbau - Programming Assignment

12/22/11. Java How to Program, 9/e. Help you get started with Eclipse and NetBeans integrated development environments.

Syntax Analysis MIF08. Laure Gonnord

The SPL Programming Language Reference Manual

CPS 506 Comparative Programming Languages. Syntax Specification

Sprite an animation manipulation language Language Reference Manual

KU Compilerbau - Programming Assignment

Lexing, Parsing. Laure Gonnord sept Master 1, ENS de Lyon

A Simple Syntax-Directed Translator

IT 374 C# and Applications/ IT695 C# Data Structures

4. Inputting data or messages to a function is called passing data to the function.

2.2 Syntax Definition

Basics of Java Programming

ASML Language Reference Manual

SFU CMPT 379 Compilers Spring 2018 Milestone 1. Milestone due Friday, January 26, by 11:59 pm.

IPCoreL. Phillip Duane Douglas, Jr. 11/3/2010

Chapter 3: Describing Syntax and Semantics. Introduction Formal methods of describing syntax (BNF)

JME Language Reference Manual

Examination in Compilers, EDAN65

CS 6353 Compiler Construction Project Assignments

Assoc. Prof. Dr. Marenglen Biba. (C) 2010 Pearson Education, Inc. All rights reserved.

Introduction to C# Applications

JavaCC: SimpleExamples

Expressions and Data Types CSC 121 Fall 2015 Howard Rosenthal

Compilers. Compiler Construction Tutorial The Front-end

Decaf Language Reference

Program Fundamentals

Building Compilers with Phoenix

B The SLLGEN Parsing System

Full file at

Expressions and Data Types CSC 121 Spring 2015 Howard Rosenthal

Introduction to Programming

1. In C++, reserved words are the same as predefined identifiers. a. True

Syntax and Grammars 1 / 21

Syntax and Parsing COMS W4115. Prof. Stephen A. Edwards Fall 2003 Columbia University Department of Computer Science

Introduction to Lexical Analysis

Compiler phases. Non-tokens

flex is not a bad tool to use for doing modest text transformations and for programs that collect statistics on input.

M/s. Managing distributed workloads. Language Reference Manual. Miranda Li (mjl2206) Benjamin Hanser (bwh2124) Mengdi Lin (ml3567)

Chapter 2 Basic Elements of C++

CSE au Final Exam Sample Solution

A simple syntax-directed

Projects for Compilers

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

Syntax Analysis Part VIII

ASTs, Objective CAML, and Ocamlyacc

The PCAT Programming Language Reference Manual

CSE 3302 Programming Languages Lecture 2: Syntax

CS 6353 Compiler Construction Project Assignments

CS /534 Compiler Construction University of Massachusetts Lowell. NOTHING: A Language for Practice Implementation

Objectives. Chapter 2: Basic Elements of C++ Introduction. Objectives (cont d.) A C++ Program (cont d.) A C++ Program

Chapter 2: Basic Elements of C++

Chapter 2: Basic Elements of C++ Objectives. Objectives (cont d.) A C++ Program. Introduction

VENTURE. Section 1. Lexical Elements. 1.1 Identifiers. 1.2 Keywords. 1.3 Literals

Lexical and Syntax Analysis

Università degli Studi di Bologna Facoltà di Ingegneria. Principles, Models, and Applications for Distributed Systems M

Related Course Objec6ves

LECTURE 3. Compiler Phases

Syntactic Analysis. The Big Picture Again. Grammar. ICS312 Machine-Level and Systems Programming

Automated Tools. The Compilation Task. Automated? Automated? Easier ways to create parsers. The final stages of compilation are language dependant

Overview. - General Data Types - Categories of Words. - Define Before Use. - The Three S s. - End of Statement - My First Program

VARIABLES & ASSIGNMENTS

CS5363 Final Review. cs5363 1

Course Outline. Introduction to java

Decaf Language Reference Manual

IC Language Specification

Parsing and AST construction with PCCTS

The Compiler So Far. CSC 4181 Compiler Construction. Semantic Analysis. Beyond Syntax. Goals of a Semantic Analyzer.

Writing a Lexer. CS F331 Programming Languages CSCE A331 Programming Language Concepts Lecture Slides Monday, February 6, Glenn G.

Compilers CS S-01 Compiler Basics & Lexical Analysis

Semantic actions for declarations and expressions

CS164: Programming Assignment 2 Dlex Lexer Generator and Decaf Lexer

Contents. Jairo Pava COMS W4115 June 28, 2013 LEARN: Language Reference Manual

Tree visitors. Context classes. CS664 Compiler Theory and Design LIU 1 of 11. Christopher League* 24 February 2016

Java How to Program, 10/e. Copyright by Pearson Education, Inc. All Rights Reserved.

Compiler Techniques MN1 The nano-c Language

CS 330 Homework Comma-Separated Expression

1 Lexical Considerations

Jim Lambers ENERGY 211 / CME 211 Autumn Quarter Programming Project 2

Grammars & Parsing. Lecture 12 CS 2112 Fall 2018

Java Bytecode (binary file)

UNIVERSITY OF COLUMBIA ADL++ Architecture Description Language. Alan Khara 5/14/2014

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

Structure of a compiler. More detailed overview of compiler front end. Today we ll take a quick look at typical parts of a compiler.

Introduction to Java Applications; Input/Output and Operators

Introduction to Java Applications

Building lexical and syntactic analyzers. Chapter 3. Syntactic sugar causes cancer of the semicolon. A. Perlis. Chomsky Hierarchy

Algorithms and Programming I. Lecture#12 Spring 2015

([1-9] 1[0-2]):[0-5][0-9](AM PM)? What does the above match? Matches clock time, may or may not be told if it is AM or PM.

Comp 411 Principles of Programming Languages Lecture 3 Parsing. Corky Cartwright January 11, 2019

d-file Language Reference Manual

Programming for Engineers Introduction to C

Character Set. The character set of C represents alphabet, digit or any symbol used to represent information. Digits 0, 1, 2, 3, 9

Transcription:

Exercise ANTLRv4 Patryk Kiepas March 25, 2017 Our task is to learn ANTLR a parser generator. This tool generates parser and lexer for any language described using a context-free grammar. With this parser we can perform further analysis of the texts written in the language. Help: https://github.com/antlr/antlr4/blob/master/doc/index.md 1 Installation With IntelliJ IDEA 1. Install IntelliJ IDEA Community Edition https://www.jetbrains.com/idea/ 2. Open IntelliJ IDEA Community Edition 3. Click File Settings... (Ctrl + Alt + S) 4. In the Settings window select Plugins and click Install JetBrains plugin... 5. Find and install plugin ANTLR v4 grammar plugin Without IntelliJ IDEA 1. Download ANTLR v4.6 java binary http://www.antlr.org/download/antlr-4.6- complete.jar 2. Put the JAR in any directory 1

2 First project We will start with grammar Hello. Create Hello.g4 file and copy the text below to the file: // Define a grammar called Hello grammar Hello r : hello ID // match keyword hello followed by an identifier ID : [a-z]+ // match lower-case identifiers WS : [ \t\r\n]+ -> skip // skip spaces, tabs, newlines This grammar allows to parse any string consists of word hello followed by a space, and any valid identificator. This identificator is defined as token ID with regular expression [a-z]+. This means: match any characters from a to z one or more times. Note The grammar s name and the name of g4 file must be the same. The grammar must contain at least one parser rule. With IntelliJ IDEA 1. Click File New Project... 2. Then create an Empty Project 3. Include Hello.g4 grammar in the project 4. Open tab with the plugin ANTLR v4 grammar plugin 5. In the tab write down the test code (e.g. hello aaaaaaaa) 6. Select initial rule r from Structure sidebar (on the left) 7. Test the grammar (look over AST) Without IntelliJ IDEA 1. Open terminal and go to the directory with the JAR 2. Put Hello.g4 in that directory 3. Type java -cp "antlr-4.6-complete.jar." org.antlr.v4.tool Hello.g4 (this generate parser/lexer) 4. Compile parser/lexer files with javac Hello*.java 5. Prepare file with code to test TEST_FILE.txt (e.g. hello aaaaaaaa) 6. Test the grammar java -cp "antlr-4.6-complete.jar." org.antlr.v4.gui.testrig Hello r -gui TEST_FILE.txt where Hello is the name of the grammar, and r is the rule to test (look over AST) 2

3 Arithmetic Let s say our language is an arithmetic expression. This language consists of numbers mixed with arithmetic operators (e.g. +, -, /, *) and a pair of paranthesis (()). Example expressions from this language are: 3*2+(103-3), 3*3*3*3 or (((3-3))). We start by fiding all the tokens a basic blocks of our language. Then we prepare parser rules that describe how these tokens are connected. Tokens Token name starts with capital letter. By convention all token name is written in CAPS LOCK. Token is defined using regular grammar (well known regular expression). In our language there are numbers, let s say integers: NUMBER : [0-9]+. This means: any sequence of one or more digits. Apart from named tokens we can have in-lined tokens that are used directly in the parser rules (as we see later). So instead of defining tokens for each operator (e.g. MUL : * ) we will just use *. Apostrophes means that whatever is in between of them, is a part of the described language. Parser rules Having initial token we can move to parser rules. We have one rule which is expression expr. Expression can be a number, or an addition of numbers, multiplicaiton of numbers,... and so on. Also an expression can be nested in parentheses. All of this can be written down using one parser rule with many alternatives (subrules): expr : ( expr ) expr ( * / ) expr expr ( + - ) expr NUMBER When parsing an expression, the rule expr is match from top to the bottom. So the order in which rule s alternatives are written matters. If a subrule A is before subrule B then A will be match if it is correct, even though subrule B might be correct. This behaviour is also correct for tokens. Note Usually we define subrules/tokens starting from the more specific one to the least specific. Using this subrule hierarchy we can mimic operators precedence. For our rule expr multiplication and division have the same priority (( * / )) but also a higher priority than addition and substraction (( + - )) 3

Full example Now we can test our grammar. We added an additional token WS (whitespaces) which matches all white characters (newline, tabulator, space). This token has an annotaion -> skip which means to skip all of these whitespaces and don t use them when matching parser rules. This makes our rules easy, because otherwise it would be necessary to put the tokens for each of these whitespaces whenever they could appear. grammar ArithmeticExpression NUMBER : [0-9]+ WS : [ \r\n\t] -> skip expr : ( expr ) expr ( * / ) expr expr ( + - ) expr NUMBER We can test this grammar with an example given before 3*2+(103-3). The output AST (abstract-syntax tree) for this expression is as follow: Figure 1: An AST for 3 2 + (103 3) 4

4 C++ function Now we try to parse a simple C++ function. Let s look at our language example: // function example #include <iostream> using namespace std int addition (double a, float b) { int r r=a+b return r } Our example consists of a few language constructs that are good candidates for parser rules: Include directive #include <iostream> Namespace directive using namespace std Function definition int addition(double... Argument definition double a Variable definition int r Assignment statement r = a+b Expression a+b Return statement return r Here are a few proposal for our tokens Identificator iostream, std, addition, a, r etc. Type float, int, double Keywords using, namespace, return Comment // function example Operators + Special characters <, >, 5

Tokens We start with tokens. We define TYPE with three subrules for each data type name. ID is a classic sequence of lower and upper latin characters. COMMENT token matches comments and skip them. We know that comments start with // characters. After this characters everything belong to the comment. Term.*? means match every character but do it in a non-greedy matter (we don t want to match the rest of the program with this token). Comment ends with a new line. TYPE : int double float ID : [a-za-z]+ COMMENT : //.*? \n -> skip WS : [ \r\n\t] -> skip Parser rules We define program rule as our main rule. Each program consists of an include directive, a namespace directive and a function definitions. The include directive is a #include token with name of the library... and so on. program : include* namespace* function_def+ include : #include < ID > namespace : using namespace ID arg : TYPE ID function_def : TYPE ID ( arg (, arg)* ) { statements } statements : (stmt )+ stmt : TYPE ID ID = expr return ID expr : expr + expr ID 6

Figure 2: An AST for our C++ program 5 Project Go to course webpage and open PDF with a project description. 7