Building lexical and syntactic analyzers. Chapter 3. Syntactic sugar causes cancer of the semicolon. A. Perlis. Chomsky Hierarchy

Similar documents
Programming Languages 2nd edition Tucker and Noonan"

CSCI312 Principles of Programming Languages!

Syntax Intro and Overview. Syntax

Lexical and Syntax Analysis

High Level Languages. Java (Object Oriented) This Course. Jython in Java. Relation. ASP RDF (Horn Clause Deduction, Semantic Web) Dr.

Lexical and Syntax Analysis

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

Syntax. Syntax. We will study three levels of syntax Lexical Defines the rules for tokens: literals, identifiers, etc.

Languages, Automata, Regular Expressions & Scanners. Winter /8/ Hal Perkins & UW CSE B-1

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

Lexical and Syntactic Analysis

CPS 506 Comparative Programming Languages. Syntax Specification

CS164: Programming Assignment 2 Dlex Lexer Generator and Decaf Lexer

Assignment 1 (Lexical Analyzer)

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

Formal Languages and Grammars. Chapter 2: Sections 2.1 and 2.2

Chapter 3. Describing Syntax and Semantics ISBN

Assignment 1 (Lexical Analyzer)

Automated Tools. The Compilation Task. Automated? Automated? Easier ways to create parsers. The final stages of compilation are language dependant

Compilers. Compiler Construction Tutorial The Front-end

10/4/18. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntactic Analysis

WARNING for Autumn 2004:

CSE 401/M501 Compilers

Figure 2.1: Role of Lexical Analyzer

Lexical Analysis. Introduction

CSc 453 Compilers and Systems Software

programming languages need to be precise a regular expression is one of the following: tokens are the building blocks of programs

CSE302: Compiler Design

10/5/17. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntax Analysis

22c:111 Programming Language Concepts. Fall Syntax III

Chapter 3 Lexical Analysis

CS 314 Principles of Programming Languages

CSC 467 Lecture 3: Regular Expressions

CSE 401 Midterm Exam Sample Solution 2/11/15

Compiler course. Chapter 3 Lexical Analysis

Chapter 4. Lexical and Syntax Analysis

Compiler Construction D7011E

Structure of Programming Languages Lecture 3

Outline. 1 Scanning Tokens. 2 Regular Expresssions. 3 Finite State Automata

1 Lexical Considerations

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

Parsing and Pattern Recognition

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

Lexical Analysis 1 / 52

4. LEXICAL AND SYNTAX ANALYSIS

CSCE 314 Programming Languages

CSE 130 Programming Language Principles & Paradigms Lecture # 5. Chapter 4 Lexical and Syntax Analysis

Lexical Analysis. COMP 524, Spring 2014 Bryan Ward

LECTURE 7. Lex and Intro to Parsing

CS 230 Programming Languages

CSc 453 Lexical Analysis (Scanning)

CSE302: Compiler Design

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

THE COMPILATION PROCESS EXAMPLE OF TOKENS AND ATTRIBUTES

Specifying Syntax. An English Grammar. Components of a Grammar. Language Specification. Types of Grammars. 1. Terminal symbols or terminals, Σ

Monday, August 26, 13. Scanners

Programming Assignment II

Formal Languages and Compilers Lecture VI: Lexical Analysis

Part II : Lexical Analysis

CS 403: Scanning and Parsing

Building Compilers with Phoenix

Structure of a compiler. More detailed overview of compiler front end. Today we ll take a quick look at typical parts of a compiler.

Wednesday, September 3, 14. Scanners

Lexical Analysis. Chapter 2

The SPL Programming Language Reference Manual

3. Context-free grammars & parsing

Dr. D.M. Akbar Hussain

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Writing a Lexical Analyzer in Haskell (part II)

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

Compilers CS S-01 Compiler Basics & Lexical Analysis

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

Languages and Compilers

Week 2: Syntax Specification, Grammars

Introduction to Lexical Analysis

Compilers CS S-01 Compiler Basics & Lexical Analysis

Introduction to Parsing. Lecture 8

Lexical Analysis. Lecture 2-4

Lexical Analysis. Lecture 3-4

Announcements! P1 part 1 due next Tuesday P1 part 2 due next Friday

Compiler phases. Non-tokens

CS S-01 Compiler Basics & Lexical Analysis 1

Project 1: Scheme Pretty-Printer

Left to right design 1

Programming Lecture 3

CS 541 Spring Programming Assignment 2 CSX Scanner

Decaf Language Reference Manual

A Pascal program. Input from the file is read to a buffer program buffer. program xyz(input, output) --- begin A := B + C * 2 end.

ML 4 A Lexer for OCaml s Type System

Lexical Considerations

MIT Specifying Languages with Regular Expressions and Context-Free Grammars

COP 3402 Systems Software Top Down Parsing (Recursive Descent)

Problem: Read in characters and group them into tokens (words). Produce a program listing. Do it efficiently.

COLLEGE OF ENGINEERING, NASHIK. LANGUAGE TRANSLATOR

Introduction. Introduction. Introduction. Lexical Analysis. Lexical Analysis 4/2/2019. Chapter 4. Lexical and Syntax Analysis.

Last time. What are compilers? Phases of a compiler. Scanner. Parser. Semantic Routines. Optimizer. Code Generation. Sunday, August 29, 2010

Transcription:

Building lexical and syntactic analyzers Chapter 3 Syntactic sugar causes cancer of the semicolon. A. Perlis Chomsky Hierarchy Four classes of grammars, from simplest to most complex: Regular grammar What we can express with a regular expression Context-free grammar Equivalent to our grammar rules in BNF Context-sensitive grammar Unrestricted grammar Only the first two are used in programming languages 1

Lexical Analysis Purpose: transform program representation Input: printable ASCII (or Unicode) characters Output: tokens (type, value) Discard: whitespace, comments Definition: A token is a logically cohesive sequence of characters representing a single symbol. Sample Tokens Identifiers Literals: 123, 5.67, 'x', true Keywords: bool char... Operators: + - * /... Punctuation: ;, ( ) { Whitespace: space tab Comments // any-char* end-of-line End-of-line End-of-file 2

Lexical Phase Why a separate phase for lexical analysis? Why not make it part of the concrete syntax? Simpler, faster machine model than parser 75% of time spent in lexer for non-optimizing compiler Differences in character sets End of line convention differs Macs: cr (ASCII 13) Windows: cr/lf (ASCII 13/10) Unix: nl (ASCII 10) Categories of Lexical Tokens Identifiers Literals Includes Integers, true, false, floats, chars Keywords bool char else false float if int main true while Operators = && ==!= < <= > >= + - * / %! [ ] Punctuation ;. { ( ) 3

Regular Expression Review RegExpr Meaning x a character x \x an escaped character, e.g., \n { name a reference to a name M N M or N M N M followed by N M* zero or more occurrences of M M+ One or more occurrences of M M? Zero or one occurrence of M [aeiou] the set of vowels [0-9] the set of digits. Any single character Clite Lexical Syntax Category Definition anychar [ -~] Letter [a-za-z] Digit [0-9] Whitespace [ \t] Eol \n Eof \004 4

Category Definition Keyword bool char else false float if int main true while Identifier {Letter({Letter {Digit)* integerlit {Digit+ floatlit {Digit+\.{Digit+ charlit {anychar Category Definition Operator= && ==!= < <= > >= + - * /! [ ] Separator :. { ( ) Comment // ({anychar {Whitespace)*{eol 5

Finite State Automaton Given the regular expression definition of lexical tokens, how do we design a program to recognize these sequences? One way: build a deterministic finite automaton Set of states: representation graph nodes Input alphabet + unique end symbol State transition function Labelled (using alphabet) arcs in graph Unique start state One or more final states Example : DFA for Identifiers An input is accepted if, starting with the start state, the automaton consumes all the input and halts in a final state. An input is accepted if, starting with the start state, the automaton consumes all the input and halts in a final state. 6

Overview of DFA s for Clite 7

Lexer Code Parser calls lexer whenever it needs a new token. Lexer must remember where it left off. Class variable for the current char (ch) Greedy consumption goes 1 character too far Consider: (foo<bar) with no whitespace after the foo. If we consume the < at the end of identifying foo, we lose the first char of the next token peek function pushback function no symbol consumed by start state From Design to Code private char ch = ; public Token next ( ) { do { switch (ch) {... while (true); Loop only exited when a token is found Loop exited via a return statement. Variable ch must be global. Initialized to a space character. 8

Translation Rules We need to translate our DFA into code Relatively straightforward process Traversing an arc from A to B: If labeled with x: test ch == x If unlabeled: else/default part of if/switch. If only arc, no test need be performed. Get next character if A is not start state Translation Rules A node with an arc to itself is a do-while. Otherwise the move is translated to a if/switch: Each arc is a separate case. Unlabeled arc is default case. A sequence of transitions becomes a sequence of translated statements. 9

A complex diagram is translated by boxing its components so that each box is one node. Translate each box using an outside-in strategy. Some Code Helper Functions private boolean isletter(char c) { return ch >= a && ch <= z ch >= A && ch <= Z ; private String concat(string set) { StringBuffer r = new StringBuffer( ); do { r.append(ch); ch = nextchar( ); while (set.indexof(ch) >= 0); return r.tostring( ); 10

Code See next() method in the Lexer.java source code Code is in the zip file for homework #1 Lexical Analysis of Clite in Java public class TokenTester { public static void main (String[] args) { Lexer lex = new Lexer (args[0]); Token t; int i = 1; do { t = lex.next(); System.out.println(i+" Type: "+t.type() +"\tvalue: "+t.value()); i++; while (t!= Token.eofTok); 11

Result of Analysis (seen before) Result of Lexical Analysis: 1 Type: Int Value: int 2 Type: Main Value: main 3 Type: LeftParen Value: ( 4 Type: RightParen Value: ) 5 Type: LeftBrace Value: { 6 Type: Int Value: int 7 Type: Identifier Value: x 8 Type: Semicolon Value: ; 9 Type: Identifier Value: x 10 Type: Assign Value: = 11 Type: IntLiteral Value: 3 12 Type: Semicolon Value: ; 13 Type: RightBrace Value: 14 Type: Eof Value: <<EOF>> // Simple Program int main() { int x; x = 3; Syntactic Analysis After the lexical tokens have been generated the next phase is syntactic analysis, i.e. parsing Purpose is to recognize source structure Input: tokens Output: parse tree or abstract syntax tree A recursive descent parser is one in which each nonterminal in the grammar is converted to a function which recognizes input derivable from the nonterminal. 12

Parsing Preliminaries Skipping, some more detail in the book To prep the grammar for easier parsing it is converted into a left dependency grammar: Discover all terminals recursively Turn regular expressions into BNF style grammar For example: A x { y z becomes A x A z A ε ya Program Structure Consists Of: Expressions: x + 2 * y Assignment Statement: z = x + 2 * y Loop Statements: while (i < n) a[i++] = 0; Function definitions Declarations: int i; Assignment Identifier = Expression Expression Term { AddOp Term AddOp + - Term Factor { MulOp Factor MulOp * / Factor [ UnaryOp ] Primary UnaryOp -! Primary Identifier Literal ( Expression ) Partial here; skipping &&,, etc. 13

Recursive Descent Parser One algorithm for generating an abstract syntax tree Input: lexical, concrete, outputs abstract representation Lexical data a stream of tokens, comes from the Lexer we saw earlier This algorithm is top down Based on an EBNF concrete syntax Overview of Recursive Descent Process for Assignment 14

Algorithm for Writing a Recursive Descent Parser from EBNF Implementing Recursive Descent Say we want to write Java code to parse Assignment (EBNF, Concrete Syntax): Assignment Identifier = Expression; From steps 1-2, we add a method for an Assignment object: private Assignment assignment () { // will fill in code here momentarily to parse assignment return new Assignment(target, source); This is a method named assignment in the Parser.java file separate from the Assignment class defined in AbstractSyntax.java 15

Implement Assignment According to the syntax, assignment should find an identifier, an operator (=), an expression, and a separator (;) So these are coded up into the method! private Assignment assignment () { // Assignment --> Identifier = Expression ; Variable target = new Variable (match(token.identifier)); match(token.assign); Expression source = expression(); match(token.semicolon); return new Assignment(target, source); Helper Methods Match: retrieves next token or displays a syntax error. Syntax Error: Displays error and terminates private void match (TokenType t) { String value = token.value(); if (token.type().equals(t)) token = lexer.next(); else error(t); return value; private void error(tokentype tok) { System.err.println("Syntax error: expecting: " + tok + "; saw: " + token); System.exit(1); 16

Expression Method Assignment method relies on Expression method Expression Conjunction { Conjunction * private Expression expression () { // Conjunction --> Equality { && Equality Expression e = equality(); while (token.type().equals(tokentype.and)) { Operator op = new Operator(token.value()); token = lexer.next(); Expression term2 = equality(); e = new Binary(op, e, term2); return e; Need loop for possible multiple && s. Conjunction method must return expr if there are no && s More Expression Methods private Expression factor() { // Factor --> [ UnaryOp ] Primary if (isunaryop()) { Operator op = new Operator(match(token.type())); Expression term = primary(); return new Unary(op, term); else return primary(); 17

More Expression Methods private Expression primary () { // Primary --> Identifier Literal ( Expression ) // Type ( Expression ) Expression e = null; if (token.type().equals(tokentype.identifier)) { Variable v = new Variable(match(TokenType.Identifier)); e = v; else if (isliteral()) { e = literal(); else if (token.type().equals(tokentype.leftparen)) { token = lexer.next(); e = expression(); match(tokentype.rightparen); else if (istype( )) { Operator op = new Operator(match(token.type())); match(tokentype.leftparen); Expression term = expression(); match(tokentype.rightparen); e = new Unary(op, term); else error("identifier Literal ( Type"); return e; Finished Program Finishing recursive descent parser will be available as Parser.java Extending it in some way will be left as an exercise What we ve done in the resulting program incorporates both the concrete and abstract syntax Concrete syntax used to define the methods, classes, sequence of tokens Abstract syntax is created by setting the class member variables to the appropriate data values as the program is parsed 18