Compiler Construction

Similar documents
Compiler Construction

Formal Languages and Compilers Lecture VI: Lexical Analysis

UNIT -2 LEXICAL ANALYSIS

The Front End. The purpose of the front end is to deal with the input language. Perform a membership test: code source language?

Lexical Analysis. Chapter 2

Introduction to Lexical Analysis

Compiler Construction I

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

Compiler Construction I

Implementation of Lexical Analysis

Lexical Analysis. Implementation: Finite Automata

Lexical Analysis. Lecture 2-4

Lexical Analysis. Introduction

Lexical Analysis. Lecture 3-4

Compiler Construction

Implementation of Lexical Analysis

Formal Languages and Compilers Lecture IV: Regular Languages and Finite. Finite Automata

Chapter 3 Lexical Analysis

Compiler course. Chapter 3 Lexical Analysis

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

MIT Specifying Languages with Regular Expressions and Context-Free Grammars

2. Lexical Analysis! Prof. O. Nierstrasz!

Implementation of Lexical Analysis

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

Compiler Construction

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

Compiler phases. Non-tokens

David Griol Barres Computer Science Department Carlos III University of Madrid Leganés (Spain)

MIT Specifying Languages with Regular Expressions and Context-Free Grammars. Martin Rinard Massachusetts Institute of Technology

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

CSc 453 Lexical Analysis (Scanning)

Figure 2.1: Role of Lexical Analyzer

Lexical Analysis/Scanning

CS415 Compilers. Lexical Analysis

Zhizheng Zhang. Southeast University

1. Lexical Analysis Phase

Introduction to Lexical Analysis

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

Lexical Analysis - 2

Front End: Lexical Analysis. The Structure of a Compiler

Lexical Analyzer Scanner

CSE450. Translation of Programming Languages. Lecture 20: Automata and Regular Expressions

Lexical Analysis. Lecture 3. January 10, 2018

Lexical Analyzer Scanner

Part 5 Program Analysis Principles and Techniques

Outline. 1 Scanning Tokens. 2 Regular Expresssions. 3 Finite State Automata

Computer Science Department Carlos III University of Madrid Leganés (Spain) David Griol Barres

Lexical Analysis. COMP 524, Spring 2014 Bryan Ward

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

CSCI312 Principles of Programming Languages!

Implementation of Lexical Analysis. Lecture 4

ECS 120 Lesson 7 Regular Expressions, Pt. 1

Lexical Analysis 1 / 52

CSE302: Compiler Design

Structure of Programming Languages Lecture 3


Static Program Analysis

Languages, Automata, Regular Expressions & Scanners. Winter /8/ Hal Perkins & UW CSE B-1

Languages and Compilers

Formal Languages and Grammars. Chapter 2: Sections 2.1 and 2.2

Week 2: Syntax Specification, Grammars

Building Compilers with Phoenix

Finite Automata. Dr. Nadeem Akhtar. Assistant Professor Department of Computer Science & IT The Islamia University of Bahawalpur

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

CS 403: Scanning and Parsing

G Compiler Construction Lecture 4: Lexical Analysis. Mohamed Zahran (aka Z)

Lexical Analysis (ASU Ch 3, Fig 3.1)

MidTerm Papers Solved MCQS with Reference (1 to 22 lectures)

CS308 Compiler Principles Lexical Analyzer Li Jiang

Dr. D.M. Akbar Hussain

Implementation of Lexical Analysis

THE COMPILATION PROCESS EXAMPLE OF TOKENS AND ATTRIBUTES

CS 403 Compiler Construction Lecture 3 Lexical Analysis [Based on Chapter 1, 2, 3 of Aho2]

COMPILER DESIGN UNIT I LEXICAL ANALYSIS. Translator: It is a program that translates one language to another Language.

Compiler Construction LECTURE # 3

Announcements! P1 part 1 due next Tuesday P1 part 2 due next Friday

[Lexical Analysis] Bikash Balami

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 5

Finite automata. We have looked at using Lex to build a scanner on the basis of regular expressions.

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 2: Lexical Analysis 23 Jan 08

Languages and Compilers

Lexical Analysis - An Introduction. Lecture 4 Spring 2005 Department of Computer Science University of Alabama Joel Jones

Lexical Analysis. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

CS 314 Principles of Programming Languages. Lecture 3

COMPILER DESIGN LECTURE NOTES

2. Syntax and Type Analysis

Compiler Construction

COLLEGE OF ENGINEERING, NASHIK. LANGUAGE TRANSLATOR

CSE 401/M501 Compilers

Lecture 4: Syntax Specification

Syntactic Analysis. CS345H: Programming Languages. Lecture 3: Lexical Analysis. Outline. Lexical Analysis. What is a Token? Tokens

CS321 Languages and Compiler Design I. Winter 2012 Lecture 4

Non-deterministic Finite Automata (NFA)

A simple syntax-directed

Syntax and Type Analysis

PRINCIPLES OF COMPILER DESIGN UNIT II LEXICAL ANALYSIS 2.1 Lexical Analysis - The Role of the Lexical Analyzer

1. INTRODUCTION TO LANGUAGE PROCESSING The Language Processing System can be represented as shown figure below.

Transcription:

Compiler Construction Lecture 2: Lexical Analysis I (Introduction) Thomas Noll Lehrstuhl für Informatik 2 (Software Modeling and Verification) noll@cs.rwth-aachen.de http://moves.rwth-aachen.de/teaching/ss-14/cc14/ Summer Semester 2014

Exercise Class Shift Fri 08:15 09:45 (AH 2) Fri 10:00 11:30 (AH 5)? Compiler Construction Summer Semester 2014 2.2

Conceptual Structure of a Compiler Source code x1 := y2 + 1; Lexical analysis (Scanner) regular expressions/finite automata (id,x1)(gets,)(id,y2)(plus,)(int,1) Syntax analysis (Parser) Semantic analysis Generation of intermediate code Code optimization Generation of machine code Target code Compiler Construction Summer Semester 2014 2.3

Outline 1 Problem Statement 2 Specification of Symbol Classes 3 The Simple Matching Problem 4 Complexity Analysis of Simple Matching Compiler Construction Summer Semester 2014 2.4

Lexical Structures From Merriam-Webster s Online Dictionary Lexical: of or relating to words or the vocabulary of a language as distinguished from its grammar and construction Compiler Construction Summer Semester 2014 2.5

Lexical Structures From Merriam-Webster s Online Dictionary Lexical: of or relating to words or the vocabulary of a language as distinguished from its grammar and construction Starting point: source program P as a character sequence Ω (finite) character set (e.g., ASCII, ISO Latin-1, Unicode,...) a,b,c,... Ω characters (= lexical atoms) P Ω source program (of course, not every w Ω is a valid program) Compiler Construction Summer Semester 2014 2.5

Lexical Structures From Merriam-Webster s Online Dictionary Lexical: of or relating to words or the vocabulary of a language as distinguished from its grammar and construction Starting point: source program P as a character sequence Ω (finite) character set (e.g., ASCII, ISO Latin-1, Unicode,...) a,b,c,... Ω characters (= lexical atoms) P Ω source program (of course, not every w Ω is a valid program) P exhibits lexical structures: natural language for keywords, identifiers,... mathematical notation for numbers, formulae,... (e.g., x 2 x**2) spaces, linebreaks, indentation comments and compiler directives (pragmas) Translation of P follows its hierarchical structure (later) Compiler Construction Summer Semester 2014 2.5

Observations 1 Syntactic atoms (called symbols) are represented as sequences of input characters, called lexemes First goal of lexical analysis Decomposition of program text into a sequence of lexemes Compiler Construction Summer Semester 2014 2.6

Observations 1 Syntactic atoms (called symbols) are represented as sequences of input characters, called lexemes First goal of lexical analysis Decomposition of program text into a sequence of lexemes 2 Differences between similar lexemes are (mostly) irrelevant (e.g., identifiers do not need to be distinguished) lexemes grouped into symbol classes (e.g., identifiers, numbers,...) symbol classes abstractly represented by tokens symbols identified by additional attributes (e.g., identifier names, numerical values,...; required for semantic analysis and code generation) = symbol = (token, attribute) Second goal of lexical analysis Transformation of a sequence of lexemes into a sequence of symbols Compiler Construction Summer Semester 2014 2.6

Lexical Analysis Definition 2.1 The goal of lexical analysis is to decompose a source program into a sequence of lexemes and their transformation into a sequence of symbols. Compiler Construction Summer Semester 2014 2.7

Lexical Analysis Definition 2.1 The goal of lexical analysis is to decompose a source program into a sequence of lexemes and their transformation into a sequence of symbols. The corresponding program is called a scanner (or lexer): (token,[attribute]) Source program Scanner Parser get next token Symbol table Compiler Construction Summer Semester 2014 2.7

Lexical Analysis Definition 2.1 The goal of lexical analysis is to decompose a source program into a sequence of lexemes and their transformation into a sequence of symbols. The corresponding program is called a scanner (or lexer): (token,[attribute]) Source program Scanner Parser get next token Symbol table Example:... x1 :=y2+ 1 ;......(id,p 1 )(gets,)(id,p 2 )(plus,)(int,1)(sem,)... Compiler Construction Summer Semester 2014 2.7

Important Symbol Classes Identifiers: for naming variables, constants, types, procedures, classes,... usually a sequence of letters and digits (and possibly special symbols), starting with a letter keywords usually forbidden; length possibly restricted Compiler Construction Summer Semester 2014 2.8

Important Symbol Classes Identifiers: for naming variables, constants, types, procedures, classes,... usually a sequence of letters and digits (and possibly special symbols), starting with a letter keywords usually forbidden; length possibly restricted Keywords: identifiers with a predefined meaning for representing control structures (while), operators (and),... Compiler Construction Summer Semester 2014 2.8

Important Symbol Classes Identifiers: for naming variables, constants, types, procedures, classes,... usually a sequence of letters and digits (and possibly special symbols), starting with a letter keywords usually forbidden; length possibly restricted Keywords: identifiers with a predefined meaning for representing control structures (while), operators (and),... Numerals: certain sequences of digits, +, -,., letters (for exponent and hexadecimal representation) Compiler Construction Summer Semester 2014 2.8

Important Symbol Classes Identifiers: for naming variables, constants, types, procedures, classes,... usually a sequence of letters and digits (and possibly special symbols), starting with a letter keywords usually forbidden; length possibly restricted Keywords: identifiers with a predefined meaning for representing control structures (while), operators (and),... Numerals: certain sequences of digits, +, -,., letters (for exponent and hexadecimal representation) Special symbols: one special character, e.g., +, *, <, (, ;,...... or two or more special characters, e.g., :=, **, <=,... each makes up a symbol class (plus, gets,...)... or several combined into one class (arithop) Compiler Construction Summer Semester 2014 2.8

Important Symbol Classes Identifiers: for naming variables, constants, types, procedures, classes,... usually a sequence of letters and digits (and possibly special symbols), starting with a letter keywords usually forbidden; length possibly restricted Keywords: identifiers with a predefined meaning for representing control structures (while), operators (and),... Numerals: certain sequences of digits, +, -,., letters (for exponent and hexadecimal representation) Special symbols: one special character, e.g., +, *, <, (, ;,...... or two or more special characters, e.g., :=, **, <=,... each makes up a symbol class (plus, gets,...)... or several combined into one class (arithop) White spaces: blanks, tabs, linebreaks,... generally for separating symbols (exception: FORTRAN) usually not represented by token (but just removed) Compiler Construction Summer Semester 2014 2.8

Specification and Implementation of Scanners Representation of symbols: symbol = (token, attribute) Token: (binary) denotation of symbol class (id, gets, plus,...) Attribute: additional information required in later compilation phases reference to symbol table, value of numeral, concrete arithmetic/relational/boolean operator,... usually unused for singleton symbol classes Compiler Construction Summer Semester 2014 2.9

Specification and Implementation of Scanners Representation of symbols: symbol = (token, attribute) Token: (binary) denotation of symbol class (id, gets, plus,...) Attribute: additional information required in later compilation phases reference to symbol table, value of numeral, concrete arithmetic/relational/boolean operator,... usually unused for singleton symbol classes Observation: symbol classes are regular sets = specification by regular expressions recognition by finite automata enables automatic generation of scanners ([f]lex) Compiler Construction Summer Semester 2014 2.9

Outline 1 Problem Statement 2 Specification of Symbol Classes 3 The Simple Matching Problem 4 Complexity Analysis of Simple Matching Compiler Construction Summer Semester 2014 2.10

Regular Expressions I Definition 2.2 (Syntax of regular expressions) Given some alphabet Ω, the set of regular expressions over Ω, RE Ω, is the least set with RE Ω, Ω RE Ω, and whenever α,β RE Ω, also α β,α β,α RE Ω. Compiler Construction Summer Semester 2014 2.11

Regular Expressions I Definition 2.2 (Syntax of regular expressions) Given some alphabet Ω, the set of regular expressions over Ω, RE Ω, is the least set with RE Ω, Ω RE Ω, and whenever α,β RE Ω, also α β,α β,α RE Ω. Remarks: abbreviations: α + := α α, ε := α β often written as αβ Binding priority: > > (i.e., a b c := a (b (c ))) Compiler Construction Summer Semester 2014 2.11

Regular Expressions II Regular expressions specify regular languages: Definition 2.3 (Semantics of regular expressions) The semantics of a regular expression is defined by the mapping. : RE Ω 2 Ω where := a := {a} α β := α β α β := α β α := α Compiler Construction Summer Semester 2014 2.12

Regular Expressions II Regular expressions specify regular languages: Definition 2.3 (Semantics of regular expressions) The semantics of a regular expression is defined by the mapping. : RE Ω 2 Ω where := a := {a} α β := α β α β := α β α := α Remarks: for formal languages L,M Ω, we have L M := {vw v L,w M} L := n=0 Ln where L 0 := {ε} and L n+1 := L L n (thus L = {w 1 w 2...w n n N, 1 i n : w i L} and ε L ) = = = {ε} Compiler Construction Summer Semester 2014 2.12

Regular Expressions III Example 2.4 1 A keyword: begin Compiler Construction Summer Semester 2014 2.13

Regular Expressions III Example 2.4 1 A keyword: begin 2 Identifiers: (a... z A... Z)(a... z A... Z 0... 9 $...) Compiler Construction Summer Semester 2014 2.13

Regular Expressions III Example 2.4 1 A keyword: begin 2 Identifiers: (a... z A... Z)(a... z A... Z 0... 9 $...) 3 (Unsigned) Integer numbers: (0... 9) + Compiler Construction Summer Semester 2014 2.13

Regular Expressions III Example 2.4 1 A keyword: begin 2 Identifiers: (a... z A... Z)(a... z A... Z 0... 9 $...) 3 (Unsigned) Integer numbers: (0... 9) + 4 (Unsigned) Fixed-point numbers: ( (0... 9) +.(0... 9) ) ( (0... 9).(0... 9) +) Compiler Construction Summer Semester 2014 2.13

Outline 1 Problem Statement 2 Specification of Symbol Classes 3 The Simple Matching Problem 4 Complexity Analysis of Simple Matching Compiler Construction Summer Semester 2014 2.14

The Simple Matching Problem I Problem 2.5 (Simple matching problem) Given α RE Ω and w Ω, decide whether w α or not. Compiler Construction Summer Semester 2014 2.15

The Simple Matching Problem I Problem 2.5 (Simple matching problem) Given α RE Ω and w Ω, decide whether w α or not. This problem can be solved using the following concept: Definition 2.6 (Finite automaton) A nondeterministic finite automaton (NFA) is of the form A = Q,Ω,δ,q 0,F where Q is a finite set of states Ω denotes the input alphabet δ : Q Ω ε 2 Q is the transition function where Ω ε := Ω {ε} (notation: q x q for q δ(q,x)) q 0 Q is the initial state F Q is the set of final states The set of all NFA over Ω is denoted by NFA Ω. If δ(q,ε) = and δ(q,a) = 1 for every q Q and a Ω (i.e., δ : Q Ω Q), then A is called deterministic (DFA). Notation: DFA Ω Compiler Construction Summer Semester 2014 2.15

The Simple Matching Problem II Definition 2.7 (Acceptance condition) Let A = Q,Ω,δ,q 0,F NFA Ω and w = a 1...a n Ω. A w-labeled A-run from q 1 to q 2 is a sequence of transitions ε q 1 a 1 ε a 2 ε ε... a n ε q 2 A accepts w if there is a w-labeled A-run from q 0 to some q F The language recognized by A is L(A) := {w Ω A accepts w} A language L Ω is called NFA-recognizable if there exists a NFA A such that L(A) = L Compiler Construction Summer Semester 2014 2.16

The Simple Matching Problem II Definition 2.7 (Acceptance condition) Let A = Q,Ω,δ,q 0,F NFA Ω and w = a 1...a n Ω. A w-labeled A-run from q 1 to q 2 is a sequence of transitions ε q 1 a 1 ε a 2 ε ε... a n ε q 2 A accepts w if there is a w-labeled A-run from q 0 to some q F The language recognized by A is L(A) := {w Ω A accepts w} A language L Ω is called NFA-recognizable if there exists a NFA A such that L(A) = L Example 2.8 NFA for a b a (on the board) Compiler Construction Summer Semester 2014 2.16

The Simple Matching Problem III Remarks: NFA as specified in Definition 2.6 are sometimes called NFA with ε-transitions (ε-nfa). Compiler Construction Summer Semester 2014 2.17

The Simple Matching Problem III Remarks: NFA as specified in Definition 2.6 are sometimes called NFA with ε-transitions (ε-nfa). For A DFA Ω, the acceptance condition yields δ : Q Ω Q with δ (q,ε) = q and δ (q,aw) = δ (δ(q,a),w), and L(A) = {w Ω δ (q 0,w) F}. Compiler Construction Summer Semester 2014 2.17

The DFA Method I Known from Formal Systems, Automata and Processes: Algorithm 2.9 (DFA method) Input: regular expression α RE Ω, input string w Ω Compiler Construction Summer Semester 2014 2.18

The DFA Method I Known from Formal Systems, Automata and Processes: Algorithm 2.9 (DFA method) Input: regular expression α RE Ω, input string w Ω Procedure: 1 using Kleene s Theorem, construct A α NFA Ω such that L(A α ) = α 2 apply powerset construction to obtain A α = Q,Ω,δ,q 0,F DFA Ω with L(A α ) = L(A α) = α 3 solve the matching problem by deciding whether δ (q 0,w) F Compiler Construction Summer Semester 2014 2.18

The DFA Method I Known from Formal Systems, Automata and Processes: Algorithm 2.9 (DFA method) Input: regular expression α RE Ω, input string w Ω Procedure: 1 using Kleene s Theorem, construct A α NFA Ω such that L(A α ) = α 2 apply powerset construction to obtain A α = Q,Ω,δ,q 0,F DFA Ω with L(A α ) = L(A α) = α 3 solve the matching problem by deciding whether δ (q 0,w) F Output: yes or no Compiler Construction Summer Semester 2014 2.18

The DFA Method II The powerset construction involves the following concept: Definition 2.10 (ε-closure) Let A = Q,Ω,δ,q 0,F NFA Ω. The ε-closure ε(t) Q of a subset T Q is defined by T ε(t) and if q ε(t), then δ(q,ε) ε(t) Compiler Construction Summer Semester 2014 2.19

The DFA Method II The powerset construction involves the following concept: Definition 2.10 (ε-closure) Let A = Q,Ω,δ,q 0,F NFA Ω. The ε-closure ε(t) Q of a subset T Q is defined by T ε(t) and if q ε(t), then δ(q,ε) ε(t) Example 2.11 1 Kleene s Theorem (on the board) 2 Powerset construction (on the board) Compiler Construction Summer Semester 2014 2.19

Outline 1 Problem Statement 2 Specification of Symbol Classes 3 The Simple Matching Problem 4 Complexity Analysis of Simple Matching Compiler Construction Summer Semester 2014 2.20

Complexity of DFA Method 1 in construction phase: Kleene method: time and space O( α ) (where α := length of α) Powerset construction: time and space O(2 Aα ) = O(2 α ) (where A α := # of states of A α ) Compiler Construction Summer Semester 2014 2.21

Complexity of DFA Method 1 in construction phase: Kleene method: time and space O( α ) (where α := length of α) Powerset construction: time and space O(2 Aα ) = O(2 α ) (where A α := # of states of A α ) 2 at runtime: Word problem: time O( w ) (where w := length of w), space O(1) (but O(2 α ) for storing DFA) Compiler Construction Summer Semester 2014 2.21

Complexity of DFA Method 1 in construction phase: Kleene method: time and space O( α ) (where α := length of α) Powerset construction: time and space O(2 Aα ) = O(2 α ) (where A α := # of states of A α ) 2 at runtime: Word problem: time O( w ) (where w := length of w), space O(1) (but O(2 α ) for storing DFA) = nice runtime behavior but memory requirements very high (and exponential time in construction phase) Compiler Construction Summer Semester 2014 2.21

The NFA Method Idea: reduce memory requirements by applying powerset construction at runtime, i.e., only to the run of w through A α Compiler Construction Summer Semester 2014 2.22

The NFA Method Idea: reduce memory requirements by applying powerset construction at runtime, i.e., only to the run of w through A α Algorithm 2.12 (NFA method) Input: automaton A α = Q,Ω,δ,q 0,F NFA Ω, input string w Ω Compiler Construction Summer Semester 2014 2.22

The NFA Method Idea: reduce memory requirements by applying powerset construction at runtime, i.e., only to the run of w through A α Algorithm 2.12 (NFA method) Input: automaton A α = Q,Ω,δ,q 0,F NFA Ω, input string w Ω Variables: T Q, a Ω Procedure: T := ε({q 0 }); while w ε do a := head(w); ( ) T := ε q T δ(q,a) ; w := tail(w) od Compiler Construction Summer Semester 2014 2.22

The NFA Method Idea: reduce memory requirements by applying powerset construction at runtime, i.e., only to the run of w through A α Algorithm 2.12 (NFA method) Input: automaton A α = Q,Ω,δ,q 0,F NFA Ω, input string w Ω Variables: T Q, a Ω Procedure: T := ε({q 0 }); while w ε do a := head(w); ( ) T := ε q T δ(q,a) ; w := tail(w) od Output: if T F then yes else no Compiler Construction Summer Semester 2014 2.22

Complexity Analysis For NFA method at runtime: Space: O( α ) (for storing NFA and T) Time: O( α w ) (in the loop s body, T states need to be considered) = trades exponential space for increase in time Compiler Construction Summer Semester 2014 2.23

Complexity Analysis For NFA method at runtime: Space: O( α ) (for storing NFA and T) Time: O( α w ) (in the loop s body, T states need to be considered) = trades exponential space for increase in time Comparison: Method Space Time (for w α? ) DFA O(2 α ) O( w ) NFA O( α ) O( α w ) Compiler Construction Summer Semester 2014 2.23

Complexity Analysis For NFA method at runtime: Space: O( α ) (for storing NFA and T) Time: O( α w ) (in the loop s body, T states need to be considered) = trades exponential space for increase in time Comparison: Method Space Time (for w α? ) DFA O(2 α ) O( w ) NFA O( α ) O( α w ) In practice: Exponential blowup of DFA method usually does not occur in real applications ( = used in [f]lex) Improvement of NFA method: caching of transitions δ (T,a) = combination of both methods Compiler Construction Summer Semester 2014 2.23