CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

Similar documents
High-level View of a Compiler

Compilers and Interpreters

Lexical Analysis. Introduction

Introduction to Compiler

Compiling Techniques

CS415 Compilers. Lexical Analysis

Front End. Hwansoo Han

Administrativia. Extra credit for bugs in project assignments. Building a Scanner. CS164, Fall Recall: The Structure of a Compiler

Implementation of Lexical Analysis

The Front End. The purpose of the front end is to deal with the input language. Perform a membership test: code source language?

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

Implementation of Lexical Analysis

Announcements! P1 part 1 due next Tuesday P1 part 2 due next Friday

Introduction to Lexical Analysis

Lexical Analysis - An Introduction. Lecture 4 Spring 2005 Department of Computer Science University of Alabama Joel Jones

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

CS 314 Principles of Programming Languages

Formal Languages and Compilers Lecture VI: Lexical Analysis

Compiler course. Chapter 3 Lexical Analysis

Lexical Analysis. Chapter 2

The View from 35,000 Feet

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

Part 5 Program Analysis Principles and Techniques

CS415 Compilers Overview of the Course. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

Lexical Analysis. Lecture 2-4

Lexical Analysis. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

Chapter 3 Lexical Analysis

CS 314 Principles of Programming Languages. Lecture 3

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

Lexical Analysis. Lecture 3-4

Languages and Compilers

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

CSE450 Translation of Programming Languages. Lecture 4: Syntax Analysis

CSCI312 Principles of Programming Languages!

Optimizing Finite Automata

Lecture 3: CUDA Programming & Compiler Front End

Goals for course What is a compiler and briefly how does it work? Review everything you should know from 330 and before

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

MidTerm Papers Solved MCQS with Reference (1 to 22 lectures)

Overview of a Compiler

Lexical Analysis. Implementation: Finite Automata

CS606- compiler instruction Solved MCQS From Midterm Papers

Implementation of Lexical Analysis

Implementation of Lexical Analysis. Lecture 4

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

Compiler phases. Non-tokens

CSE450. Translation of Programming Languages. Lecture 20: Automata and Regular Expressions

MIT Specifying Languages with Regular Expressions and Context-Free Grammars

Lexical Analysis. COMP 524, Spring 2014 Bryan Ward

Week 2: Syntax Specification, Grammars

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design

COLLEGE OF ENGINEERING, NASHIK. LANGUAGE TRANSLATOR

Zhizheng Zhang. Southeast University

Structure of Programming Languages Lecture 3

CPS 506 Comparative Programming Languages. Syntax Specification

COMP 181 Compilers. Administrative. Last time. Prelude. Compilation strategy. Translation strategy. Lecture 2 Overview

Introduction to Lexing and Parsing

Implementation of Lexical Analysis

Compilers. Lecture 2 Overview. (original slides by Sam

2. Lexical Analysis! Prof. O. Nierstrasz!

Last lecture CMSC330. This lecture. Finite Automata: States. Finite Automata. Implementing Regular Expressions. Languages. Regular expressions

MIT Specifying Languages with Regular Expressions and Context-Free Grammars. Martin Rinard Massachusetts Institute of Technology

CS 132 Compiler Construction

Administrivia. Lexical Analysis. Lecture 2-4. Outline. The Structure of a Compiler. Informal sketch of lexical analysis. Issues in lexical analysis

UNIT -2 LEXICAL ANALYSIS

CS321 Languages and Compiler Design I. Winter 2012 Lecture 4

CSE 3302 Programming Languages Lecture 2: Syntax

COMPILER DESIGN UNIT I LEXICAL ANALYSIS. Translator: It is a program that translates one language to another Language.

Lexical and Syntax Analysis

CS 403 Compiler Construction Lecture 3 Lexical Analysis [Based on Chapter 1, 2, 3 of Aho2]

Syntactic Analysis. The Big Picture Again. Grammar. ICS312 Machine-Level and Systems Programming


G52LAC Languages and Computation Lecture 6

CSc 453 Lexical Analysis (Scanning)

for (i=1; i<=100000; i++) { x = sqrt (y); // square root function cout << x+i << endl; }

1. INTRODUCTION TO LANGUAGE PROCESSING The Language Processing System can be represented as shown figure below.

Where We Are. CMSC 330: Organization of Programming Languages. This Lecture. Programming Languages. Motivation for Grammars

CMSC 330: Organization of Programming Languages

10/4/18. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntactic Analysis

Figure 2.1: Role of Lexical Analyzer

Lexical Analysis. Lecture 3. January 10, 2018

CSE302: Compiler Design

Languages, Automata, Regular Expressions & Scanners. Winter /8/ Hal Perkins & UW CSE B-1

Formal Languages and Compilers Lecture IV: Regular Languages and Finite. Finite Automata

Question Bank. 10CS63:Compiler Design

Dr. D.M. Akbar Hussain

CSC 467 Lecture 3: Regular Expressions

CMSC 330: Organization of Programming Languages. Architecture of Compilers, Interpreters

Compiler Construction

CSCE 314 Programming Languages

programming languages need to be precise a regular expression is one of the following: tokens are the building blocks of programs

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 2

CMSC 350: COMPILER DESIGN

CMSC 330: Organization of Programming Languages. Context Free Grammars

CS5371 Theory of Computation. Lecture 8: Automata Theory VI (PDA, PDA = CFG)

10/5/17. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntax Analysis

CS415 Compilers. Syntax Analysis. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

Transcription:

CS 1622 Lecture 2 Lexical Analysis CS 1622 Lecture 2 1 Lecture 2 Review of last lecture and finish up overview The first compiler phase: lexical analysis Reading: Chapter 2 in text (by 1/18) CS 1622 Lecture 2 2 The Front End Source code Scanner tokens Parser Responsibilites: Recognize legal and illegal programs Report errors meaningfully Produce and initial storage map Shape the code for the backend Typically automatically constructed From a lexical specification Based on finite automata (meet theory) Very well understood CS 1622 Lecture 2 3 1

Source code Scanner tokens Parser Maps characters into tokens - basic lexical units x = y + z becomes <id> <assign> <id> <binop> <id> Lexeme = string that matches the token x, y, and z are lexemes that match <id> Some tokens have attributes <id, x> or <binop, plus> Eliminates whitespace In some languages performs preprocessing (in C done by the preprocessor) CS 1622 Lecture 2 4 Source code Scanner tokens Parser Recognizes syntactic structure & errors Directs semantic analysis (type checking) Builds for source program For some languages (more precisely: grammars) can be easily built by hand More flexible: use parser generators Can change language more easily Typically very fast Well undestood theory ( Push-down automata CS 1622 Lecture 2 5 Grammars A concise and precise way to specify languages For context-free grammars can build efficient parsers Can typically write a CFG for a programming language Tool of choice for specifying syntactic structure CS 1622 Lecture 2 6 2

Grammars Formally, a grammar G = (S,N,T,P) S is the start symbol N is a set of non-terminal symbols T is a set of terminal symbols or words P is a set of productions or rewrite rules (P : N N T ) CS 1622 Lecture 2 7 CFG Example 1. goal expr 2. expr expr op term 3. term 4. term number 5. id 6. op + 7. - S = goal T = { number, id, +, - } N = { goal, expr, term, op } P = { 1, 2, 3, 4, 5, 6, 7} CS 1622 Lecture 2 8 Deriving Sentences Production Result goal 1 expr 2 expr op term 5 expr op y 7 expr - y 2 expr op term - y 4 expr op 2 - y 6 expr + 2 - y 3 term + 2 - y 5 x + 2 - y To recognize a valid sentence for some CFG, we reverse this process and build up a parse CS 1622 Lecture 2 9 3

Parse Tree x + 2 - y goal expr expr op term expr op term - <id,y> term <id,x> + <number,2> Lots of superfluous detail. 1. goal expr 2. expr expr op term 3. term 4. term number 5. id 6. op + 7. - CS 1622 Lecture 2 10 Abstract Syntax Tree (AST) - <id,x> + <number,2> <id,y> The AST summarizes grammatical structure, without including detail about the derivation This is much more concise ASTs are one form of intermediate representation () CS 1622 Lecture 2 11 The Back End - instruction selection Instruction Selection Instruction Scheduling Register Allocation Machine code Responsibilities: Translates to target code Selects target instructions for (trivial for RISC) Allocates machine resources (registers, memory) Typically implemented manually For CISC some automated pattern matching approaches Lots of hand-crafting done for good backends -- must know target architecture well! CS 1622 Lecture 2 12 4

Back end - instruction scheduling Instruction Selection Instruction Scheduling Register Allocation Machine code Avoid hardware stalls and interlocks Use all functional units productively Can increase lifetime of variables Optimal scheduling is NP-Complete in nearly all cases but good heuristic techniques are well understood CS 1622 Lecture 2 13 Back end - register allocation Instruction Selection Instruction Scheduling Register Allocation Machine code Have each value in a register when it is used Manage a limited set of resources Can change instruction choices & insert LOADs & STOREs Optimal allocation is NP-Complete approximate CS 1622 Lecture 2 14 Traditional Three-pass Compiler Source Code Front End Middle End Back End Machine code Analyzes and rewrites (or transforms) Primary goal is to reduce running time of the compiled code May also improve space, power consumption, Must preserve meaning of the code CS 1622 Lecture 2 15 5

The Optimizer Opt Opt Opt... Opt 1 2 3 n Discover & propagate some constant value Move a computation to a less frequently executed place Specialize some computation based on context Discover a redundant computation & remove it Remove useless or unreachable code Encode an idiom in some particularly efficient form CS 1622 Lecture 2 16 The Scanner: Overview Task: translate the sequence of characters to a corresponding sequence of tokens - essentially grouping characters into words -removing irrevelant characters - e.g., white space Each time the scanner is called, it should find the longest sequence of characters in the input starting with the current character that corresponds to a token, and return that token. CS 1622 Lecture 2 17 How to write a scanner? write it from scratch, or automatically generate it with a scanner generator lex or flex (produce C code), or jlex (produces Java code). input to a scanner generator: one regular expression for each token output of a scanner generator: a finite state machine so, you need to understand: regular expressions finite automata. CS 1622 Lecture 2 18 6

Lexical analyzers Goals: To simplify specification & implementation of scanners To understand the underlying techniques and technologies source code specifications Scanner Scanner Generator parts of speech tables or code CS 1622 Lecture 2 19 Regular Expressions to Finite Automata Generating a scanner NFA Regular expressions DFA Lexical Specification Table-driven Implementation of DFA CS 1622 Lecture 2 20 Recognizing words Example - begin b e g i n s s 0 1 s 2 s 3 s 4 s 5 c= next char; if c!= b then error c = next char; if c!= e the error; c = next char; if c!= g then error;. Transition diagrams - serve as abstractions for code that would be written - finite automata CS 1622 Lecture 2 21 7

Finite Automata A compiler recognizes legal programs in some (source) language. A finite-state machine recognizes legal strings in some language. Example: Identifiers sequences of one or more letters or digits, starting with a letter: letter digit S letter A CS 1622 Lecture 2 22 Finite-Automata State Graphs A state The start state An accepting/final state A transition a CS 1622 Lecture 2 23 Finite Automata Transition s 1 a s 2 Is read In state s 1 on input a go to state s 2 If end of input or no transition possible If in accepting state => accept Otherwise => reject CS 1622 Lecture 2 24 8

Language defined by FSM The language defined by a FSM is the set of strings accepted by the FSM. in the language of the FSM on previous slide: x, tmp2, XyZzy, position27. not in the language of the FSM on previous slide: 123, a?, 13apples. CS 1622 Lecture 2 25 Example: Integer Literals FA that accepts integer literals with an optional + or - sign: digit digit S + - B A digit CS 1622 Lecture 2 26 Formal FSA Definition A finite automaton is a 5-tuple (Σ, S, δ, s 0, S F ) where: An input alphabet Σ ν A set of states S ν A start state s 0 ν A set of accepting states S F S ν δ is the state transition function: S x Σ S (i.e., encodes transitions state input state) CS 1622 Lecture 2 27 9

FA for the integer-literal example Σ = {digit, +, - ) A set of states S = {S, A and B} A start state S 0 = S A set of accepting states S F S = {B} δ is the state transition function = (S,digit) -> B (S, + ) -> A (S, - ) -> A (B, digit) -> B (A, digit) -> B CS 1622 Lecture 2 28 Two kinds of Automata Deterministic (DFA): No state has more than one outgoing edge with the same label. Non-Deterministic (NFA): States may have more than one outgoing edge with same label. Edges may be labeled with ε (epsilon), the empty string. The automaton can take an ε epsilon transition without looking at the current input character. CS 1622 Lecture 2 29 Example of NFA integer-literal example: digit S ε + - B A digit CS 1622 Lecture 2 30 10

Non-deterministic automata (NFA) often simpler (e.g. smaller) than DFA can be in multiple states at the same time NFA accepts a string is if there exists a sequence of moves starting in the start state, ending in a final state, that consumes the entire string. Think about it as pursuing all choices in parallel or having an oracle that says what to do. Example: the integer-literal NFA on input "+75": CS 1622 Lecture 2 31 Equivalence of DFA and NFA Theorem: For every non-deterministic finite-state machine M, there exists a deterministic machine M' such that M and M' accept the same language. Why is the theorem important for scanner generation? Theorem is not enough: what do we need for automatic scanner generation? CS 1622 Lecture 2 32 How to Implement a FSM A table-driven approach: table: one row for each state in the machine, and one column for each possible character. Table[j][k] which state to go to from state j on character k, an empty entry corresponds to the machine getting stuck. CS 1622 Lecture 2 33 11

The table-driven program for a DFA state = S // S is the start state repeat { } k = next character from the input if k == EOF the // end of input if state is a final state then accept else reject state = T[state,k] if state = empty then reject // got stuck CS 1622 Lecture 2 34 Generating a scanner NFA Regular expressions DFA Lexical Specification Table-driven Implementation of DFA CS 1622 Lecture 2 35 Regular Expressions FA s not good way to specify tokens - diagrams hard to write down regular expressions are another specification technique a compact way to define a language that can be accepted by an automaton. used as the input to a scanner generator define each token, and define white-space, comments, etc these do not correspond to tokens, but must be recognized and ignored. CS 1622 Lecture 2 36 12

Example: Simple identifier English: A letter, followed by zero or more letters or digits. RE: letter. (letter digit)* Operators: means "or". means "followed by (usually just use position) * means zero or more instances () are used for grouping CS 1622 Lecture 2 37 Operands of a regular expression Operands are same as labels on the edges of an FSM single characters, or the special character ε (the empty string) "letter" is a shorthand for a b c... z A... Z "digit is a shorthand for 0 1 9 sometimes we put the characters in quotes necessary when denoting characters:. * CS 1622 Lecture 2 38 Precedence of. * operators. Regular Expression Operator Analogous Arithmetic Operator Precedence plus lowest. times middle * exponentiation highest Consider regular expressions: letter.letter digit* letter.(letter digit)* CS 1622 Lecture 2 39 13

Examples Describe (in English) the language defined by each of the following regular expressions: letter (letter digit*) digit digit* "." digit digit* CS 1622 Lecture 2 40 Example: Integer Literals An integer literal with an optional sign can be defined in English as: (nothing or + or -) followed by one or more digits The corresponding regular expression is: (+ - epsilon).(digit.digit*) A new convenient operator + digit.digit* is the same as digit+ which means "one or more digits CS 1622 Lecture 2 41 Language Defined by a Regular Expression Recall: language = set of strings Language defined by an automaton / RE Regular Exp. the set of strings accepted by the automaton the set of strings that match the expression. epsilon {""} a a.b.c a b c Corresponding Set of Strings {"a"} {"abc"} {"a", "b", "c"} (a b c)* {"", "a", "b", "c", "aa", "ab",..., "bccabb"...} CS 1622 Lecture 2 42 14

REs describe regular languages Patterns form a regular language *** any finite language is regular *** Regular Expression (RE) (over alphabet Σ) ε is a RE denoting the set {ε} If a is in Σ, then a is a RE denoting {a} If x and y are REs denoting L(x) and L(y) then x is an RE denoting L(x); y is a RE denoting L(y); x y is an RE denoting L(x) L(y) xy is an RE denoting L(x)L(y) x * is an RE denoting L(x)* Can combine RE to form other REs CS 1622 Lecture 2 43 15