Lexical Analysis (ASU Ch 3, Fig 3.1)

Similar documents
2. λ is a regular expression and denotes the set {λ} 4. If r and s are regular expressions denoting the languages R and S, respectively

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

Chapter 3: Lexical Analysis

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

Chapter 3 Lexical Analysis

Compiler course. Chapter 3 Lexical Analysis

UNIT -2 LEXICAL ANALYSIS

COMPILER DESIGN UNIT I LEXICAL ANALYSIS. Translator: It is a program that translates one language to another Language.

UNIT II LEXICAL ANALYSIS

Lexical Analysis. Sukree Sinthupinyo July Chulalongkorn University

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

CS321 Languages and Compiler Design I. Winter 2012 Lecture 4

1. INTRODUCTION TO LANGUAGE PROCESSING The Language Processing System can be represented as shown figure below.

Formal Languages and Compilers Lecture VI: Lexical Analysis

Buffering Techniques: Buffer Pairs and Sentinels

Part 5 Program Analysis Principles and Techniques

Lexical Analysis - 1. A. Overview A.a) Role of Lexical Analyzer

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Prof. Mohamed Hamada Software Engineering Lab. The University of Aizu Japan

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

PRINCIPLES OF COMPILER DESIGN UNIT II LEXICAL ANALYSIS 2.1 Lexical Analysis - The Role of the Lexical Analyzer

UNIT I- LEXICAL ANALYSIS. 1.Interpreter: It is one of the translators that translate high level language to low level language.

Projects for Compilers

Lexical Analyzer Scanner

CS 403: Scanning and Parsing

CSE302: Compiler Design

Lexical Analyzer Scanner

1. Lexical Analysis Phase

THE COMPILATION PROCESS EXAMPLE OF TOKENS AND ATTRIBUTES

SEM / YEAR : VI / III CS2352 PRINCIPLES OF COMPLIERS DESIGN UNIT I - LEXICAL ANALYSIS PART - A

UNIT III. The following section deals with the compilation procedure of any program.

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

Front End: Lexical Analysis. The Structure of a Compiler

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

2. Lexical Analysis! Prof. O. Nierstrasz!

Program Analysis ( 软件源代码分析技术 ) ZHENG LI ( 李征 )

Lexical Analysis. Introduction

CSE302: Compiler Design

Zhizheng Zhang. Southeast University

Recognition of Tokens

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

CS 403 Compiler Construction Lecture 3 Lexical Analysis [Based on Chapter 1, 2, 3 of Aho2]

CSc 453 Lexical Analysis (Scanning)

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 2: Lexical Analysis 23 Jan 08

[Lexical Analysis] Bikash Balami

Computer Science Department Carlos III University of Madrid Leganés (Spain) David Griol Barres

Lexical Analysis. Chapter 2

Lexical Analysis. Lecture 3. January 10, 2018

A Simple Syntax-Directed Translator

Figure 2.1: Role of Lexical Analyzer

CD Assignment I. 1. Explain the various phases of the compiler with a simple example.

David Griol Barres Computer Science Department Carlos III University of Madrid Leganés (Spain)

Chapter 3: CONTEXT-FREE GRAMMARS AND PARSING Part 1

ECS 120 Lesson 7 Regular Expressions, Pt. 1

Introduction to Lexical Analysis

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

Group A Assignment 3(2)

Lexical Analysis. Prof. James L. Frankel Harvard University

Formal Languages and Compilers Lecture IV: Regular Languages and Finite. Finite Automata

Compiler Construction

COP 3402 Systems Software Syntax Analysis (Parser)

DVA337 HT17 - LECTURE 4. Languages and regular expressions

Lexical Analysis. Lecture 3-4

Introduction to Lexical Analysis

Dr. D.M. Akbar Hussain

Compiler Construction LECTURE # 3

Principles of Programming Languages COMP251: Syntax and Grammars

Lexical Analysis. Lecture 2-4

Syntactic Analysis. CS345H: Programming Languages. Lecture 3: Lexical Analysis. Outline. Lexical Analysis. What is a Token? Tokens

The Front End. The purpose of the front end is to deal with the input language. Perform a membership test: code source language?

CSE302: Compiler Design

Outline. 1 Scanning Tokens. 2 Regular Expresssions. 3 Finite State Automata

VIVA QUESTIONS WITH ANSWERS

LANGUAGE TRANSLATORS

DEMO A Language for Practice Implementation Comp 506, Spring 2018

Lexical Analysis 1 / 52

PRACTICAL CLASS: Flex & Bison

CS 314 Principles of Programming Languages. Lecture 3

2010: Compilers REVIEW: REGULAR EXPRESSIONS HOW TO USE REGULAR EXPRESSIONS

Compiler phases. Non-tokens

for (i=1; i<=100000; i++) { x = sqrt (y); // square root function cout << x+i << endl; }

CS 315 Programming Languages Syntax. Parser. (Alternatively hand-built) (Alternatively hand-built)

Languages and Compilers

Structure of Programming Languages Lecture 3

We use L i to stand for LL L (i times). It is logical to define L 0 to be { }. The union of languages L and M is given by

COMPILER DESIGN LECTURE NOTES

The Language for Specifying Lexical Analyzer

Compiler Construction

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

The analysis part breaks up the source program into constituent pieces and creates an intermediate representation of the source program.

CS6660 COMPILER DESIGN L T P C

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

Lexical analysis. Syntactical analysis. Semantical analysis. Intermediate code generation. Optimization. Code generation. Target specific optimization

Formal Languages and Grammars. Chapter 2: Sections 2.1 and 2.2

CS 314 Principles of Programming Languages

CMSC 330: Organization of Programming Languages. Architecture of Compilers, Interpreters

CS308 Compiler Principles Lexical Analyzer Li Jiang

Lexical analysis. Concepts. Lexical analysis in perspec>ve

Transcription:

Lexical Analysis (ASU Ch 3, Fig 3.1) Implementation by hand automatically ((F)Lex) Lex generates a finite automaton recogniser uses regular expressions Tasks remove white space (ws) display source program line numbers (error display) Lexical analyser Issues Parser Symbol Table simple: LA/SA split more efficient flexible / portable 1

Terminology (ASU Ch 3.1) pattern set of input strings associated with token lexeme sequence of chars in i/p matched by a pattern token syntactic object generated by pattern match from i/p terminal symbols in G token (lexemes) const ( const ) if ( if ) relation ( > >= = <> < <= ) id ( pi count D2 ) num ( 3.14 22 0 ) literal ( this is a string ) lexemes correspond to attributes for tokens 2

Attributes, errors & recovery (ASU Ex. 3.1) E.g. E = m*c**2 <id, E > <assign_op, > <id, m > <mul_op, > <id, c > <exp_op, > <num, 2 > Errors & recovery delete extraneous characters insert missing characters replace incorrect character transpose characters (fi > if) resynchronise with i/p stream and find token error distance # steps to transform erroneous program 3

Strings & languages (ASU Ch 3.3) alphabet (char set / class) - any finite set of symbols e.g. Binary { 1, 0 }, ASCII, English, Swedish a string over an alphabet - a finite sequence of symbols drawn from that alphabet string length s e.g. foobar = 6 = 0 language - any set of strings over some fixed alphabet - includes { } - the empty set concatenation - xy where x and y are strings product notation - s 0 =, s 1 = s, s 2 = ss, s n = sss sss (n times) 4

Strings & languages (ASU Ch 3.3) prefix of s - string obtained by removing 0 or more trailing symbols of a string - e.g. ban from banana suffix of s - string obtained by removing 0 or more leading symbols of a string - e.g. nana from banana sub-string of s - string obtained by deleting a prefix & suffix - e.g. nan from banana every prefix / suffix is a sub-string of s proper prefix / suffix / sub-string - non-empty string x s.t. x!= s sub-sequence of s - string obtained by deleting 0 or more not necessarily contiguous symbols from s - e.g. baaa from banana 5

Operations on Languages (ASU Fig 3.8) L union M: L u M = { s s in L or s in M } concatenation LM: LM = { st s in L and t in M } Kleene closure: L * = i=0 U inf L i zero or more concatenations of L positive closure: L + = i=1 U inf L i one or more concatenations of L 6

Regular Expressions (RE) (ASU Ch 3.3) E.g. Pascal <id> ::= letter ( letter digit )* each RE r denotes a language L(r) built from simpler RE s using a set of defining rules is an RE that denotes { } for a in alphabet A, a is an RE that denotes {a}, the set containing the string a (a can mean symbol a / string a / RE {a}) let r, s be RE s denoting L(r), L(s) respectively then (r) (s) is an RE denoting L(r) u L(s) (r) (s) is an RE denoting L (r) L(s) (r) * is an RE denoting ( L(r) )* (r) is an RE denoting L(r) (extra (, ) ) 7

Regular Expressions - examples (ASU Ch 3.3 Ex. 3.3) (a) ((b)*(c)) == a b*c a, or 0 or more b s, c for an alphabet A = {a, b} a b denotes the set {a, b} (a b)(a b) denotes the set {aa, ab, ba, bb} a* denotes the set {, a, aa, aaa, } (a b)* denotes the set of all strings of a s and b s what does a a*b denote? 8

Regular Expressions - Equivalence & Axioms (ASU Fig 3.9) Equivalence - if RE s r & s denote the same L, r = s axioms (algebraic laws) r s = s r is commutative r (s t) = (r s) t is associative (rs)t = r(st) concat is associative r(s t) = rs rt concat distributes over r = r - identity element r* = (r )* relationship * <=> r* * = r* * is idempotent 9

Regular Definitions (RD) (ASU Ch 3.3) A regular definition is a sequence of definitions d 1 => r 1, d 2 => r 2, d n => r n each d i is a distinct name each r i is an RE over the symbols in A u {d 1, d 2,, d i-1 } NB i-1 examples Pascal identifiers letter => A B Z a b z digit => 0 1 9 id => letter (letter digit)* definition names: letter, digit, id 10

Regular Expressions - Notational Shorthand (ASU Ch 3.3) + denotes 1 or more instances : if (r) denotes L (r), (r)+ denotes L (r)+ a+ denotes the set of all strings of one or more a s * denotes 0 (zero) or more instances: r* = r+ r+ = r r*? denotes 0 (zero) or one instances: r? = r (r)? Denotes L(r) u [abc] denotes a b c where a, b, c in A (alphabet) [a-z] denotes a b z id => [A-Za-z][A-Za-z0-9]* 11

Regular Expressions - Limitations (ASU Ch 3.3) Non-regular sets RE s cannot describe balanced / nested constructs e.g. All strings of balanced parentheses this requires a Context Free Grammar (CFG) RE s cannot describe repeating strings e.g. { wcw w is a string of a s and b s} this leads to the next stage in compiling - syntax analyser based on CFG s based on token recognition 12

Example Grammar + Regular Expressions (ASU Ex 3.6) stmt => if expr then stmt if expr then stmt else stmt expr => term relop term term term => id num ==================================== if => if / then => then / else => else relop => < <= = <> > >= id => letter ( letter digit )* num => digit+ (.digit+)? ( E ( + - )? digit+ )? delim => blank tab newline ws => delim+ 13

Token Recognition (ASU Ch 3.4) RE token attribute ws -- -- if if -- then then -- else else -- id id reference to symbol table entry [0..9] num reference to table entry < relop LT <= relop LE etc. (=, <>, >, >= become EQ, NE, GT, GE respectively) 14

Transition Diagrams (ASU Fig 3.12) Lexical Analysers can be represented by transition diagrams: circles are states, edges are labelled by a char start = 2 return(relop, LE) 0 return(relop, EQ) * = retract forward pointer < = > 1 5 6 > other = other 3 4 7 8 * * return(relop, NE) return(relop, LT) return(relop, GE) return(relop, GT) 15

Summary Lexical Analysis automatic ((F)Lex) uses regular expressions and regular definitions pattern / lexeme / token Strings and Languages alphabet / string / language operations: union / concatenation / closure L u M, LM, L*, L+ Regular Expressions equivalence & axioms regular definitions shorthand: r*, r+, r?, [a-z] Tokens produced by LA used by SA (syntax analysis) have attributes (e.g. relop >= ) Transition Diagrams representation of an LA 16