CMSC 350: COMPILER DESIGN

Similar documents
Lecture 9 CIS 341: COMPILERS

Outline CS4120/4121. Compilation in a Nutshell 1. Administration. Introduction to Compilers Andrew Myers. HW1 out later today due next Monday.

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Lexical Analysis. COMP 524, Spring 2014 Bryan Ward

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

Lexical Analysis. Lecture 2-4

Lexical Analysis. Chapter 2

Introduction to Lexical Analysis

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

Implementation of Lexical Analysis

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 2: Lexical Analysis 23 Jan 08

Figure 2.1: Role of Lexical Analyzer

Lexical Analysis. Lecture 3-4

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design

Implementation of Lexical Analysis

Lexical Analysis. Lecture 3. January 10, 2018

2010: Compilers REVIEW: REGULAR EXPRESSIONS HOW TO USE REGULAR EXPRESSIONS

Chapter 3: Lexing and Parsing

CSE 582 Autumn 2002 Exam Sample Solution

programming languages need to be precise a regular expression is one of the following: tokens are the building blocks of programs

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

COLLEGE OF ENGINEERING, NASHIK. LANGUAGE TRANSLATOR

Formal Languages and Compilers Lecture VI: Lexical Analysis

Compiler Construction D7011E

Administrivia. Lexical Analysis. Lecture 2-4. Outline. The Structure of a Compiler. Informal sketch of lexical analysis. Issues in lexical analysis

MP 3 A Lexer for MiniJava

CS 4240: Compilers and Interpreters Project Phase 1: Scanner and Parser Due Date: October 4 th 2015 (11:59 pm) (via T-square)

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

CSE 105 THEORY OF COMPUTATION

CSc 453 Compilers and Systems Software

Monday, August 26, 13. Scanners

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

UNIT -2 LEXICAL ANALYSIS

Lecture 13 CIS 341: COMPILERS

Programming Languages and Compilers (CS 421)

Wednesday, September 3, 14. Scanners

Programming Languages & Compilers. Programming Languages and Compilers (CS 421) I. Major Phases of a Compiler. Programming Languages & Compilers

Principles of Compiler Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore

Writing a Lexical Analyzer in Haskell (part II)

Programming Languages & Compilers. Programming Languages and Compilers (CS 421) Programming Languages & Compilers. Major Phases of a Compiler

MP 3 A Lexer for MiniJava

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

Lexical Analysis. Textbook:Modern Compiler Design Chapter 2.1.

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

CS164: Programming Assignment 2 Dlex Lexer Generator and Decaf Lexer

Chapter 2 :: Programming Language Syntax

Theoretical Part. Chapter one:- - What are the Phases of compiler? Answer:

Announcements! P1 part 1 due next Tuesday P1 part 2 due next Friday

Lecture 3: Lexical Analysis

Lexical Analysis. Introduction

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

CSCI312 Principles of Programming Languages!

ML 4 A Lexer for OCaml s Type System

Non-deterministic Finite Automata (NFA)

Lexical Analysis. Finite Automata. (Part 2 of 2)

CS 415 Midterm Exam Spring SOLUTION

Introduction to Compiler Design

Compiler phases. Non-tokens

Syntax. Syntax. We will study three levels of syntax Lexical Defines the rules for tokens: literals, identifiers, etc.

EDAN65: Compilers, Lecture 02 Regular expressions and scanning. Görel Hedin Revised:

Languages and Compilers

Syntactic Analysis. CS345H: Programming Languages. Lecture 3: Lexical Analysis. Outline. Lexical Analysis. What is a Token? Tokens

Structure of Programming Languages Lecture 3

Implementation of Lexical Analysis

CSC 467 Lecture 3: Regular Expressions

CSc 453 Lexical Analysis (Scanning)

Lexical Analysis. Implementation: Finite Automata

CSE 3302 Programming Languages Lecture 2: Syntax

Programming Languages and Compilers (CS 421)

Last lecture CMSC330. This lecture. Finite Automata: States. Finite Automata. Implementing Regular Expressions. Languages. Regular expressions

n (0 1)*1 n a*b(a*) n ((01) (10))* n You tell me n Regular expressions (equivalently, regular 10/20/ /20/16 4

Outline. 1 Scanning Tokens. 2 Regular Expresssions. 3 Finite State Automata

Lexical Analysis. Finite Automata

Languages, Automata, Regular Expressions & Scanners. Winter /8/ Hal Perkins & UW CSE B-1

Introduction to Lexing and Parsing

Compiler Construction

CS415 Compilers. Lexical Analysis

Semantic Analysis. Lecture 9. February 7, 2018

CS606- compiler instruction Solved MCQS From Midterm Papers

Appendix Set Notation and Concepts

MIT Specifying Languages with Regular Expressions and Context-Free Grammars

Lexical Analysis 1 / 52

Parsing and Pattern Recognition

CSE450. Translation of Programming Languages. Lecture 20: Automata and Regular Expressions

Intro To Parsing. Step By Step

Lexical Analysis - 2

for (i=1; i<=100000; i++) { x = sqrt (y); // square root function cout << x+i << endl; }

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 2

Introduction to Parsing. Lecture 5

Implementation of Lexical Analysis

Introduction to Parsing. Lecture 8

G Compiler Construction Lecture 4: Lexical Analysis. Mohamed Zahran (aka Z)

CD Assignment I. 1. Explain the various phases of the compiler with a simple example.

Week 2: Syntax Specification, Grammars

CS 415 Midterm Exam Spring 2002

CS308 Compiler Principles Lexical Analyzer Li Jiang

Introduction to Parsing. Lecture 5

Transcription:

Lecture 11 CMSC 350: COMPILER DESIGN

see HW3 LLVMLITE SPECIFICATION Eisenberg CMSC 350: Compilers 2

Discussion: Defining a Language Premise: programming languages are purely formal objects We (as language designers) get to determine the meaning of the language constructs Question: How do we specify that meaning? Question: What are the properties of a good specification? Examples? Eisenberg CMSC 350: Compilers 3

Approaches to Language Specification Implementation It does what it does! Social Authority figure says: it means X English prose Technological Multiple implementations Reference interpreter Test cases / Examples Translation Semantics given in terms of (hopefully better specified) target Mathematical Informal specifications Formal specifications Less formal : Techniques may miss problems in programs This isn t a tradeoff all of these methods should be used. Even the most formal can still have holes: Did you prove the right thing? Do your assumptions match reality? Knuth. Beware of bugs in the above code; I have only proved it correct, not tried it. More formal : eliminate with certainty as many problems as possible. Eisenberg CMSC 350: Compilers 4

LLVMlite notes Real LLVM requires that constants appearing in getelementptr be declared with type i32: %struct = type { i64, [5 x i64], i64} @gbl = global %struct {i64 1, [5 x i64] [i64 2, i64 3, i64 4, i64 5, i64 6], i64 7} define void @foo() { %1 = getelementptr %struct* @gbl, i32 0, i32 0 } LLVMlite ignores the i32 annotation and treats these as i16 values we keep the i32 annotation in the syntax to retain compatibility with LLVM syntax Eisenberg CMSC 350: Compilers 5

COMPILING LLVMLITE TO HERA Eisenberg CMSC 350: Compilers 6

Compiling LLVMlite Types to HERA i1, i16, t* = word (16 bits) array and struct types are laid out sequentially in memory getelementptr computations must be relative to the LLVMlite size definitions i.e. i1 = word Eisenberg CMSC 350: Compilers 7

Compiling LLVM locals How do we manage storage for each %uid defined by an LLVM instruction? Option 1: Map each %uid to a HERA register Efficient! Difficult to do effectively: many %uid values, only 16 registers Option 2: Map each %uid to a stack-allocated space Less efficient! Simple to implement For HW3 we will follow Option 2 Eisenberg CMSC 350: Compilers 8

Globals must use data labels Other LLVMlite Features Calls Follow HERA calling convention from section 7.5 of the specification getelementptr trickiest part Eisenberg CMSC 350: Compilers 9

TOUR OF ASSIGNMENT 3 Eisenberg CMSC 350: Compilers 10

Lexical analysis, tokens, regular expressions, automata LEXING

Compilation in a Nutshell Source Code (Character stream) if (b == 0) { a = 1; } Token stream: if ( b == 0 ) { a = 0 ; } Abstract Syntax Tree: Eq If Assembly Code l1: cmpq %eax, $0 jeq l2 jmp l3 Assn b 0 a 1 l2: Eisenberg CMSC 350: Compiler Design None Intermediate code: l1: %cnd = icmp eq i64 %b, 0 br i1 %cnd, label %l2, label %l3 l2: store i64* %a, 1 br label %l3 l3: Lexical Analysis Parsing Analysis & Transformation Backend

Today: Lexing Source Code (Character stream) if (b == 0) { a = 1; } Token stream: if ( b == 0 ) { a = 0 ; } Abstract Syntax Tree: Eq If Assembly Code l1: cmpq %eax, $0 jeq l2 jmp l3 Assn b 0 a 1 l2: Eisenberg CMSC 350: Compiler Design None Intermediate code: l1: %cnd = icmp eq i64 %b, 0 br i1 %cnd, label %l2, label %l3 l2: store i64* %a, 1 br label %l3 l3: Lexical Analysis Parsing Analysis & Transformation Backend

First Step: Lexical Analysis Change the character stream if (b == 0) a = 0; into tokens: if ( b == 0 ) { a = 0 ; } IF; LPAREN; Ident( b ); EQEQ; Int(0); RPAREN; LBRACE; Ident( a ); EQ; Int(0); SEMI; RBRACE Token: data type that represents indivisible chunks of text: Identifiers: a y11 elsex _100 Keywords: if else while Integers: 2 200-500 5L Floating point: 2.0.02 1e5 Symbols: + * ` { } ( ) ++ << >> >>> Strings: x He said, \ Are you?\ Comments: (* CMSC350: HW01 *) /* foo */ Often delimited by whitespace (, \t, etc.) In some languages (e.g. Python or Haskell) whitespace is significant

How hard can it be? Writing Java lexers. EXAMPLE: CS245, HW03

How hard can it be? Tedious and painful! Lexing By Hand Problems: Precisely define tokens Matching tokens simultaneously Reading too much input (need look ahead) Error handling Hard to compose/interleave tokenizer code Hard to maintain

PRINCIPLED SOLUTION TO LEXING

Regular Expressions Regular expressions precisely describe sets of strings. A regular expression R has one of the following forms: ε Epsilon stands for the empty string a An ordinary character stands for itself R 1 R 2 Alternatives, stands for choice of R 1 or R 2 R 1 R 2 Concatenation, stands for R 1 followed by R 2 R* Kleene star, stands for zero or more repetitions of R Useful extensions: foo Strings, equivalent to 'f''o''o' R+ One or more repetitions of R, equivalent to RR* R? Zero or one occurrences of R, equivalent to (ε R) ['a'-'z'] One of a or b or c or z, equivalent to (a b z) [^'0'-'9'] Any character except 0 through 9 R as x Name the string matched by R as x

Example Regular Expressions Recognize the keyword if : if Recognize a digit: ['0'-'9'] Recognize an integer literal: '-'?['0'-'9']+ Recognize an identifier: (['a'-'z'] ['A'-'Z'])(['0'-'9'] '_' ['a'-'z'] ['A'-'Z'])* In practice, it s useful to be able to name regular expressions: @lowercase = [a-z] @uppercase = [A-Z] @character = @uppercase @lowercase

How to Match? Consider the input string: ifx = 0 Could lex as: if x = 0 or as: ifx = 0 Regular expressions alone are ambiguous, need a rule for choosing between the options above Most languages choose longest match So the 2 nd option above will be picked Note that only the first option is correct for parsing purposes Conflicts: arise due to two regular expressions with non-empty intersection Ties broken by giving some matches higher priority Example: keywords have priority over identifiers Usually specified by order the rules appear in the lex input file

Lexer Generators Reads a list of regular expressions: R 1,,R n, one per token. Each token has an attached action A i (just a piece of code to run when the regular expression is matched): "-"? $digit+ { lexint } "+" { token PLUS } if { token IF } character (digit character '_')* { identifier } whitespace+ ; Generates scanning code that: 1. Decides whether the input is of the form (R 1 R n )* 2. Whenever the scanner matches a (longest) token, it runs the associated action

Java.x DEMO: ALEX

Implementation Strategies Most Tools: lex, alex, ocamllex, flex, etc.: Table-based Deterministic Finite Automata (DFA) Goal: Efficient, compact representation, high performance Other approaches: Brzozowski derivatives Idea: directly manipulate the (abstract syntax of) the regular expression Compute partial derivatives Regular expression that is left-over after seeing the next character Elegant, purely functional, implementation (very cool!)

Finite Automata Consider the regular expression: [^ ]* An automaton (DFA) can be represented as: A transition table: " Non-" A graph: Non-" " " 0 1 2 0 1 ERROR 1 2 1 2 ERROR ERROR

RE to Finite Automaton? Can we build a finite automaton for every regular expression? Yes! CMSC 345 for the complete theory Strategy: consider every possible regular expression (by induction on the structure of the regular expressions): 'a' ε a What about? R 1 R 2 R 1 R 2 R 1 R 2??

Nondeterministic Finite Automata A finite set of states, a start state, and accepting state(s) Transition arrows connecting states Labeled by input symbols Or ε (which does not consume input) Nondeterministic: two arrows leaving the same state may have the same label b ε ε a b a a

RE to NFA? Converting regular expressions to NFAs is easy. Assume each NFA has one start state, unique accept state a a ε R 1 R 2 ε R 1 R 2

RE to NFA (cont d) Sums and Kleene star are easy with NFAs R 1 ε ε R 1 R 2 ε R 2 ε ε R* ε R ε ε

DFA: DFA versus NFA Action of the automaton for each input is fully determined Automaton accepts if the input is consumed upon reaching an accepting state Obvious table-based implementation NFA: Automaton potentially has a choice at every step Automaton accepts an input string if there exists a way to reach an accepting state Less obvious how to implement efficiently

NFA to DFA conversion (Intuition) Idea: Run all possible executions of the NFA in parallel Keep track of a set of possible states: finite fingers Consider: -?[0-9]+ NFA representation: - [0-9] 0 [0-9] ε 1 2 3 ε DFA representation: - {1} [0-9] {0,1} [0-9] {2,3} [0-9]

Summary of Lexer Generator Behavior Take each regular expression R i and its action A i Compute the NFA formed by (R 1 R 2 R n ) Remember the actions associated with the accepting states of the R i Compute the DFA for this big NFA There may be multiple accept states (why?) A single accept state may correspond to one or more actions (why?) Compute the minimal equivalent DFA There is a standard algorithm due to Myhill & Nerode Produce the transition table Implement longest match: Start from initial state Follow transitions, remember last accept state entered (if any) Accept input until no transition is possible (i.e. next state is ERROR ) Perform the highest-priority action associated with the last accept state; if no accept state there is a lexing error

Lexer Generators in Practice Many existing implementations: lex, Flex, Jlex, ocamllex, alex, For example alex program see Java.x, HERA/Lexer.x (HW02), LLVM/Lexer.x (HW03) on course website Error reporting: Associate line number/character position with tokens Use a rule to recognize \n and increment the line number The lexer generator itself usually provides character position info. Sometimes useful to treat comments specially Nested comments: keep track of nesting depth Lexer generators are usually designed to work closely with parser generators