Compilers CS S-01 Compiler Basics & Lexical Analysis

Similar documents
CS S-01 Compiler Basics & Lexical Analysis 1

Compilers CS S-01 Compiler Basics & Lexical Analysis

JavaCC. JavaCC File Structure. JavaCC is a Lexical Analyser Generator and a Parser Generator. The structure of a JavaCC file is as follows:

CS321 Languages and Compiler Design I. Winter 2012 Lecture 4

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 2: Lexical Analysis 23 Jan 08

Compiler phases. Non-tokens

ECS 120 Lesson 7 Regular Expressions, Pt. 1

Chapter Seven: Regular Expressions

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

Programming Languages and Compilers (CS 421)

Programming Languages & Compilers. Programming Languages and Compilers (CS 421) I. Major Phases of a Compiler. Programming Languages & Compilers

Outline. 1 Scanning Tokens. 2 Regular Expresssions. 3 Finite State Automata

Announcements! P1 part 1 due next Tuesday P1 part 2 due next Friday

2010: Compilers REVIEW: REGULAR EXPRESSIONS HOW TO USE REGULAR EXPRESSIONS

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

Lexical Analysis 1 / 52

Regular Expressions. Regular Expressions. Regular Languages. Specifying Languages. Regular Expressions. Kleene Star Operation

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

CSc 453 Lexical Analysis (Scanning)

Lexical Analysis. Introduction

Compiler course. Chapter 3 Lexical Analysis

Introduction to Lexical Analysis

Programming Languages & Compilers. Programming Languages and Compilers (CS 421) Programming Languages & Compilers. Major Phases of a Compiler

DVA337 HT17 - LECTURE 4. Languages and regular expressions

Lexical Analysis - 1. A. Overview A.a) Role of Lexical Analyzer

Lexical Analysis. Lecture 2-4

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Chapter 3 Lexical Analysis

Compiler Construction D7011E

Part 5 Program Analysis Principles and Techniques

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design

CMSC 132: Object-Oriented Programming II

Week 2: Syntax Specification, Grammars

Lexical Analyzer Scanner

The Front End. The purpose of the front end is to deal with the input language. Perform a membership test: code source language?

JavaCC: SimpleExamples

CS308 Compiler Principles Lexical Analyzer Li Jiang

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

Syntactic Analysis. CS345H: Programming Languages. Lecture 3: Lexical Analysis. Outline. Lexical Analysis. What is a Token? Tokens

Lexical Analysis. Lecture 3-4

Formal Languages and Compilers Lecture VI: Lexical Analysis

Lexical Analyzer Scanner

Lexical Analysis. Chapter 2

Automated Tools. The Compilation Task. Automated? Automated? Easier ways to create parsers. The final stages of compilation are language dependant

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

CS 314 Principles of Programming Languages

Lexical Analysis (ASU Ch 3, Fig 3.1)

Regular Expressions & Automata

CPS 506 Comparative Programming Languages. Syntax Specification

CSc 453 Compilers and Systems Software

Introduction to Lexical Analysis

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

CS Parsing 1

CS 314 Principles of Programming Languages. Lecture 3

Defining Program Syntax. Chapter Two Modern Programming Languages, 2nd ed. 1

Figure 2.1: Role of Lexical Analyzer

Zhizheng Zhang. Southeast University

CS164: Programming Assignment 2 Dlex Lexer Generator and Decaf Lexer

Outline CS4120/4121. Compilation in a Nutshell 1. Administration. Introduction to Compilers Andrew Myers. HW1 out later today due next Monday.

Implementation of Lexical Analysis

Dr. D.M. Akbar Hussain

Chapter Seven: Regular Expressions. Formal Language, chapter 7, slide 1

CPSC 434 Lecture 3, Page 1

Programming Languages and Compilers (CS 421)

Formal Languages and Grammars. Chapter 2: Sections 2.1 and 2.2

n (0 1)*1 n a*b(a*) n ((01) (10))* n You tell me n Regular expressions (equivalently, regular 10/20/ /20/16 4

CSC 467 Lecture 3: Regular Expressions

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

for (i=1; i<=100000; i++) { x = sqrt (y); // square root function cout << x+i << endl; }

Introduction to Lexing and Parsing

1. Lexical Analysis Phase

Full file at

Building lexical and syntactic analyzers. Chapter 3. Syntactic sugar causes cancer of the semicolon. A. Perlis. Chomsky Hierarchy

CS1622. Semantic Analysis. The Compiler So Far. Lecture 15 Semantic Analysis. How to build symbol tables How to use them to find

Lexical Analysis. Sukree Sinthupinyo July Chulalongkorn University

8 ε. Figure 1: An NFA-ǫ

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

Implementation of Lexical Analysis

CSE 105 THEORY OF COMPUTATION

Grammars and Parsing. Paul Klint. Grammars and Parsing

Languages and Compilers

CS 301. Lecture 05 Applications of Regular Languages. Stephen Checkoway. January 31, 2018

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

Administrivia. Lexical Analysis. Lecture 2-4. Outline. The Structure of a Compiler. Informal sketch of lexical analysis. Issues in lexical analysis

POLITECNICO DI TORINO. Formal Languages and Compilers. Laboratory N 1. Laboratory N 1. Languages?

More Things We Can Do With It! Overview. Circle Calculations. πr 2. π = More operators and expression types More statements

Chapter 3: Lexing and Parsing

Formal Languages and Compilers

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 3

Structure of Programming Languages Lecture 3

Monday, August 26, 13. Scanners

CS202 Compiler Construction. Christian Skalka. Course prerequisites. Solid programming skills a must.

Lexical Analysis. Lecture 3. January 10, 2018

Wednesday, September 3, 14. Scanners

Finite automata. We have looked at using Lex to build a scanner on the basis of regular expressions.

CSE 401 Midterm Exam 11/5/10

CSCE 314 Programming Languages

Transcription:

Compilers CS414-2017S-01 Compiler Basics & Lexical Analysis David Galles Department of Computer Science University of San Francisco

01-0: Syllabus Office Hours Course Text Prerequisites Test Dates & Testing Policies Projects Teams of up to 2 Grading Policies Questions?

01-1: Notes on the Class Don t be afraid to ask me to slow down! We will cover some pretty complex stuff here, which can be difficult to get the first (or even the second) time. ASK QUESTIONS While specific questions are always preferred, I don t get it is always an acceptable question. I am always happy to stop, re-explain a topic in a different way. If you are confused, I can guarantee that at least one other person in the class would benefit from more explanation

01-2: Notes on the Class Projects are non-trivial Using new tools (JavaCC) Managing a large scale project Lots of complex classes & advanced programming techniques.

01-3: Notes on the Class Projects are non-trivial Using new tools (JavaCC) Managing a large scale project Lots of complex classes & advanced programming techniques. START EARLY! Projects will take longer than you think (especially starting with the semantic analyzer project) ASK QUESTIONS!

01-4: What is a compiler? Source Program Compiler Machine code Simplified View

01-5: What is a compiler? Source File Lexical Analyzer Token Stream Parser Abstract Syntax Tree Assembly Code Generator Abstract Assembly Tree Semantic Analyzer Assembly Tree Generator Assembler Relocatable Object Code Linker Machine code More Accurate View Libraries

01-6: What is a compiler? Source File Lexical Analyzer Token Stream Parser Abstract Syntax Tree Front end Assembly Code Generator Abstract Assembly Tree Semantic Analyzer Assembly Tree Generator Back End Assembler Relocatable Object Code Linker Machine code Libraries

01-7: What is a compiler? Source File Lexical Analyzer Token Stream Parser Abstract Syntax Tree Assembly Code Generator Abstract Assembly Tree Semantic Analyzer Assembly Tree Generator Covered in this course Assembler Relocatable Object Code Linker Machine code Libraries

01-8: Why Use Decomposition?

01-9: Why Use Decomposition? Software Engineering! Smaller units are easier to write, test and debug Code Reuse Writing a suite of compilers (C, Fortran, C++, etc) for a new architecture Create a new language want compilers available for several platforms

01-10: Lexical Analysis Converting input file to stream of tokens void main() { print(4);

01-11: Lexical Analysis Converting input file to stream of tokens void main() { IDENTIFIER(void) print(4); IDENTIFIER(main) LEFT-PARENTHESIS RIGHT-PARENTHESIS LEFT-BRACE IDENTIFIER(print) LEFT-PARENTHESIS INTEGER-LITERAL(4) RIGHT-PARENTHESIS SEMICOLON RIGHT-BRACE

01-12: Lexical Analysis Brute-Force Approach Lots of nested if statements if (c = nextchar() == P ) { if (c = nextchar() == R ) { if (c = nextchar() == 0 ) { if (c = nextchar() == G ) { /* Code to handle the rest of either PROGRAM or any identifier that starts with PROG */ else if (c == C ) { /* Code to handle the rest of either PROCEDURE or any identifier that starts with PROC */...

01-13: Lexical Analysis Brute-Force Approach Break the input file into words, separated by spaces or tabs This can be tricky not all tokens are separated by whitespace Use string comparison to determine tokens

01-14: Deterministic Finite Automata Set of states Initial State Final State(s) Transitions DFA for else, end, identifiers Combine DFA

01-15: DFAs and Lexical Analyzers Given a DFA, it is easy to create C code to implement it DFAs are easier to understand than C code Visual almost like structure charts... However, creating a DFA for a complete lexical analyzer is still complex

01-16: Automatic Creation of DFAs We d like a tool: Describe the tokens in the language Automatically create DFA for tokens Then, automatically create C code that implements the DFA We need a method for describing tokens

01-17: Formal Languages Alphabet Σ: Set of all possible symbols (characters) in the input file Think of Σ as the set of symbols on the keyboard String w: Sequence of symbols from an alphabet String length w Number of characters in a string: car = 3, abba = 4 Empty String ǫ: String of length 0: ǫ = 0 Formal Language: Set of strings over an alphabet Formal Language Programming language Formal Language is only a set of strings.

01-18: Formal Languages Example formal languages: Integers {0,23,44,... Floating Point Numbers {3.4, 5.97,... Identifiers {foo, bar,...

01-19: Language Concatenation Language Concatenation Given two formal languages L 1 and L 2, the concatenation of L 1 and L 2, L 1 L 2 = {xy x L 1,y L 2 For example: {fire, truck, car {car, dog = {firecar, firedog, truckcar, truckdog, carcar, cardog

01-20: Kleene Closure Given a formal language L: L 0 = {ǫ L 1 = L L 2 = LL L 3 = LLL L 4 = LLLL L = L 0 L 1 L 2... L n...

01-21: Regular Expressions Regular expressions are use to describe formal languages over an alphabet Σ: Regular Expression Language ǫ L[ǫ] = {ǫ a Σ L[a] = {a (MR) L[MR] = L[M]L[R] (M R) L[(M R)] = L[M] L[R] (M ) L[(M )] = L[M]

01-22: r.e. Precedence From highest to Lowest: Kleene Closure * Concatenation Alternation ab*c e = (a(b*)c) e

01-23: Regular Expression Examples all strings over {a,b binary integers (with leading zeroes) all strings over {a,b that begin and end with a all strings over {a,b that contain aa all strings over {a,b that do not contain aa

01-24: Regular Expression Examples all strings over {a,b (a b)* binary integers (with leading zeroes) (0 1)(0 1)* all strings over {a,b that a(a b)*a begin and end with a all strings over {a,b that (a b)*aa(a b)* contain aa all strings over {a,b that b*(abb*)*(a ǫ) do not contain aa

01-25: Reg. Exp. Shorthand [a,b,c,d] = (a b c d) [d-g] = [d,e,f,g] = (b e f g) [d-f,m-o] = [d,e,f,m,n,o] = (d e f M N O) (α)? = Optionally α (i.e., (α ǫ)) (α)+ = α(α)*

01-26: Regular Expressions & Unix Many unix tools use regular expressions Example: grep <reg exp> filename Prints all lines that contain a match to the regular expression Special characters: ^ beginning of line $ end of line (grep examples on other screen)

01-27: JavaCC Regular Expressions All characters & strings must be in quotation marks "else" "+" ("a" "b") All regular expressions involving * must be parenthesized ("a")*, not "a"*

01-28: JavaCC Shorthand ["a","b","c","d"] = ("a" "b" "c" "d") ["d"-"g"] = ["d","e","f","g"] = ("b" "e" "f" "g") ["d"-"f","m"-"o"] = ["d","e","f","m","n","o"] = ("d" "e" "f" "M" "N" "O") (α)? = Optionally α (i.e., (α ǫ)) (α)+ = α(α)* (~["a","b"]) = Any character except a or b. Can only be used with [] notation ~(a(a b)*b) is not legal

01-29: r.e. Shorthand Examples Regular Expression Langauge {if Set of legal identifiers Set of integer literals (leading zeroes allowed) Set of real literals

01-30: r.e. Shorthand Examples Regular Expression Langauge "if" {if ["a"-"z"](["0"-"9","a"-"z"])* Set of legal identifiers ["0"-"9"] Set of integer literals (leading zeroes allowed) (["0"-"9"]+"."(["0"-"9"]*)) Set of real literals ((["0"-"9"])*"."["0"-"9"]+)

01-31: Lexical Analyzer Generator JavaCC is a Lexical Analyzer Generator and a Parser Generator Input: Set of regular expressions (each of which describes a type of token in the language) Output: A lexical analyzer, which reads an input file and separates it into tokens

01-32: Structure of a JavaCC file options{ /* Code to set various options flags */ PARSER_BEGIN(foo) public class foo { /* This segment is often empty */ PARSER_END(foo) TOKEN_MGR_DECLS : { /* Declarations used by lexical analyzer */ /* Token Rules & Actions */

01-33: Token Rules in JavaCC Tokens are described by rules with the following syntax: TOKEN : { <TOKEN_NAME: RegularExpression> TOKEN_NAME is the name of the token being described RegularExpression is a regular expression that describes the token

01-34: Token Rules in JavaCC Token rule examples: TOKEN : { <ELSE: "else"> TOKEN : { <INTEGER_LITERAL: (["0"-"9"])+>

01-35: Token Rules in JavaCC Several different tokens can be described in the same TOKEN block, with token descriptions separated by. TOKEN : { <ELSE: "else"> <INTEGER_LITERAL: (["0"-"9"])+> <SEMICOLON: ";">

01-36: getnexttoken When we run javacc on the input file foo.jj, it creates the class footokenmanager The class footokenmanager contains the static method getnexttoken() Every call to getnexttoken() returns the next token in the input stream.

01-37: getnexttoken When getnexttoken is called, a regular expression is found that matches the next characters in the input stream. What if more than one regular expression matches? TOKEN : { <ELSE: "else"> <IDENTIFIER: (["a"-"z"])+>

01-38: getnexttoken When more than one regular expression matches the input stream: Use the longest match elsed should match to IDENTIFIER, not to ELSE followed by the identifier d If two matches have the same length, use the rule that appears first in the.jj file else should match to ELSE, not IDENTIFIER

01-39: JavaCC Example PARSER_BEGIN(simple) public class simple { PARSER_END(simple) TOKEN : { <ELSE: "else"> <SEMICOLON: ";"> <FOR: "for"> <INTEGER_LITERAL: (["0"-"9"])+> <IDENTIFIER: ["a"-"z"](["a"-"z","0"-"9"])*> else;ford for

01-40: SKIP Rules Tell JavaCC what to ignore (typically whitespace) using SKIP rules SKIP rule is just like a TOKEN rule, except that no TOKEN is returned. SKIP: { < regularexpression1 > < regularexpression2 >... < regularexpressionn >

01-41: Example SKIP Rules PARSER_BEGIN(simple2) public class simple2 { PARSER_END(simple2) SKIP : { < " " > < "\n" > < "\t" > TOKEN : { <ELSE: "else"> <SEMICOLON: ";"> <FOR: "for"> <INTEGER_LITERAL: (["0"-"9"])+> <IDENTIFIER: ["A"-"Z"](["A"-"Z","0"-"9"])*>

01-42: JavaCC States Comments can be dealt with using SKIP rules How could we skip over 1-line C++ Style comments? // This is a comment

01-43: JavaCC States Comments can be dealt with using SKIP rules How we could skip over 1-line C++ Style comments: // This is a comment Using a SKIP rule SKIP : { < "//" (~["\n"])* "\n" >

01-44: JavaCC States Writing a regular expression to match multi-line comments (using /* and */) is much more difficult Writing a regular expression to match nested comments is impossible (take Automata Theory for a proof :) ) What can we do? Use JavaCC States

01-45: JavaCC States We can label each TOKEN and SKIP rule with a state Unlabeled TOKEN and SKIP rules are assumed to be in the default state (named DEFAULT, unsurprisingly enough) Can switch to a new state after matching a TOKEN or SKIP rule using the : NEWSTATE notation

01-46: JavaCC States SKIP : { < " " > < "\n" > < "\t" > SKIP : { < "/*" > : IN_COMMENT <IN_COMMENT> SKIP : { < "*/" > : DEFAULT < ~[] > TOKEN : { <ELSE: "else">... (etc)

01-47: Actions in TOKEN & SKIP We can add Java code to any SKIP or TOKEN rule That code will be executed when the SKIP or TOKEN rule is matched. Any methods / variables defined in the TOKEN_MGR_DECLS section can be used by these actions

01-48: Actions in TOKEN & SKIP PARSER_BEGIN(remComments) public class remcomments { PARSER_END(remComments) TOKEN_MGR_DECLS : { public static int numcomments = 0; SKIP : { < "/*" > : IN_COMMENT SKIP : { < "//" (~["\n"])* "\n" > { numcomments++;

01-49: Actions in TOKEN & SKIP <IN_COMMENT> SKIP : { < "*/" > { numcomments++; SwitchTo(DEFAULT); <IN_COMMENT> SKIP : { < ~[] > TOKEN : { <ANY: ~[]>

01-50: Tokens Each call to getnexttoken returns a Token object Token class is automatically created by javacc. Variables of type Token contain the following public variables: public int kind; The type of token. When javacc is run on the file foo.jj, a file fooconstants.java is created, which contains the symbolic names for each constant public interface simplejavaconstants { int EOF = 0; int CLASSS = 8; int DO = 9; int ELSE = 10;...

01-51: Tokens Each call to getnexttoken returns a Token object Token class is automatically created by javacc. Variables of type Token contain the following public variables: public int beginline, begincolumn, endline, endcolumn; The location of the token in the input file

01-52: Tokens Each call to getnexttoken returns a Token object Token class is automatically created by javacc. Variables of type Token contain the following public variables: public String image; The text that was matched to create the token.

01-53: Generated TokenManager class TokenTest { public static void main(string args[]) { Token t; Java.io.InputStream infile; pascaltokenmanager tm; boolean loop = true; if (args.length < 1) { System.out.print("Enter filename as command line argument"); return; try { infile = new Java.io.FileInputStream(args[0]); catch (Java.io.FileNotFoundException e) { System.out.println("File " + args[0] + " not found."); return; tm = new sjavatokenmanager(new SimpleCharStream(infile));

01-54: Generated TokenManager t = tm.getnexttoken(); while(t.kind!= sjavaconstants.eof) { System.out.println("Token : "+ t + " : "); System.out.println(pascalConstants.tokenImage[t.kind]);

01-55: Lexer Project Write a.jj file for simplejava tokens Need to handle all whitespace (tabs, spaces, end-of-line) Need to handle nested comments (to an arbitrary nesting level)

01-56: Project Details JavaCC is available at https://javacc.dev.java.net/ To compile your project % javacc simplejava.jj % javac *.java To test your project % java TokenTest <test filename> To submit your program: Create a branch: https://www.cs.usfca.edu/svn/<username>/cs414/lexer/