Lexical Analysis. Introduction

Similar documents
Lexical Analysis - An Introduction. Lecture 4 Spring 2005 Department of Computer Science University of Alabama Joel Jones

The Front End. The purpose of the front end is to deal with the input language. Perform a membership test: code source language?

CS415 Compilers. Lexical Analysis

Lexical Analysis. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

Part 5 Program Analysis Principles and Techniques

Front End. Hwansoo Han

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

CSc 453 Lexical Analysis (Scanning)

Formal Languages and Compilers Lecture VI: Lexical Analysis

Lexical Analysis. Implementing Scanners & LEX: A Lexical Analyzer Tool

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

Introduction to Lexical Analysis

CSE302: Compiler Design

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

Theoretical Part. Chapter one:- - What are the Phases of compiler? Answer:

Lecture 3: CUDA Programming & Compiler Front End

THE COMPILATION PROCESS EXAMPLE OF TOKENS AND ATTRIBUTES

Lecture 4: Syntax Specification

CS 403: Scanning and Parsing

Introduction to Lexical Analysis

CS321 Languages and Compiler Design I. Winter 2012 Lecture 4

Figure 2.1: Role of Lexical Analyzer

CS415 Compilers. Syntax Analysis. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

CS 314 Principles of Programming Languages. Lecture 3

Introduction to Compiler

2. Lexical Analysis! Prof. O. Nierstrasz!

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

Syntactic Analysis. CS345H: Programming Languages. Lecture 3: Lexical Analysis. Outline. Lexical Analysis. What is a Token? Tokens

CS 314 Principles of Programming Languages

Implementation of Lexical Analysis

Compiler Construction

Lexical Analysis. Chapter 2

Announcements! P1 part 1 due next Tuesday P1 part 2 due next Friday

Parsing. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

Lexical Analysis. Lecture 3. January 10, 2018

Parsing II Top-down parsing. Comp 412

Implementation of Lexical Analysis

David Griol Barres Computer Science Department Carlos III University of Madrid Leganés (Spain)

CS 315 Programming Languages Syntax. Parser. (Alternatively hand-built) (Alternatively hand-built)

Week 2: Syntax Specification, Grammars

Specifying Syntax. An English Grammar. Components of a Grammar. Language Specification. Types of Grammars. 1. Terminal symbols or terminals, Σ

Lexical Analysis. Lecture 2-4

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

Non-deterministic Finite Automata (NFA)

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Introduction to Parsing

COMPILER DESIGN UNIT I LEXICAL ANALYSIS. Translator: It is a program that translates one language to another Language.

CPS 506 Comparative Programming Languages. Syntax Specification

Computer Science Department Carlos III University of Madrid Leganés (Spain) David Griol Barres

Lexical Analysis. Lecture 3-4

Compiler Construction

Compiler phases. Non-tokens

Structure of a compiler. More detailed overview of compiler front end. Today we ll take a quick look at typical parts of a compiler.

CSE450 Translation of Programming Languages. Lecture 4: Syntax Analysis

1. INTRODUCTION TO LANGUAGE PROCESSING The Language Processing System can be represented as shown figure below.

Lexical Analysis (ASU Ch 3, Fig 3.1)

Formal Languages and Grammars. Chapter 2: Sections 2.1 and 2.2

Compiler Construction LECTURE # 3

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

Compiler course. Chapter 3 Lexical Analysis

1. Lexical Analysis Phase

Outline. 1 Scanning Tokens. 2 Regular Expresssions. 3 Finite State Automata

Programming Languages (CS 550) Lecture 4 Summary Scanner and Parser Generators. Jeremy R. Johnson

Lexical Analysis 1 / 52

Compiling Regular Expressions COMP360

Lexical Analysis. COMP 524, Spring 2014 Bryan Ward

Compilers. Yannis Smaragdakis, U. Athens (original slides by Sam

UNIT II LEXICAL ANALYSIS

Chapter 3 Lexical Analysis

Prof. Mohamed Hamada Software Engineering Lab. The University of Aizu Japan

CS 403 Compiler Construction Lecture 3 Lexical Analysis [Based on Chapter 1, 2, 3 of Aho2]

Introduction to Parsing. Comp 412

A simple syntax-directed

CSE 401 Midterm Exam 11/5/10

Syntax Analysis, III Comp 412

COLLEGE OF ENGINEERING, NASHIK. LANGUAGE TRANSLATOR

Parsing. Roadmap. > Context-free grammars > Derivations and precedence > Top-down parsing > Left-recursion > Look-ahead > Table-driven parsing

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

A Simple Syntax-Directed Translator

CSCI312 Principles of Programming Languages!

Lexical Analysis. Finite Automata

COP 3402 Systems Software Syntax Analysis (Parser)

MIT Specifying Languages with Regular Expressions and Context-Free Grammars

Dr. D.M. Akbar Hussain

LECTURE 7. Lex and Intro to Parsing

Parsing III. CS434 Lecture 8 Spring 2005 Department of Computer Science University of Alabama Joel Jones

CSE 3302 Programming Languages Lecture 2: Syntax

Question Bank. 10CS63:Compiler Design

Chapter 3. Parsing #1

Structure of Programming Languages Lecture 3

Lexical Analysis. Finite Automata

CPSC 434 Lecture 3, Page 1

Zhizheng Zhang. Southeast University

Transcription:

Lexical Analysis Introduction Copyright 2015, Pedro C. Diniz, all rights reserved. Students enrolled in the Compilers class at the University of Southern California have explicit permission to make copies of these materials for their personal use.

The Front End Source code Front End IR Back End Machine code Errors The purpose of the front end is to deal with the input language Perform a membership test: code source language? Is the program well-formed (semantically)? Build an IR version of the code for the rest of the compiler The front end is not monolithic 2

The Front End Source code Scanner tokens Parser IR Errors Scanner Maps stream of characters into words Basic unit of syntax x = x + y ; becomes <id,x> <eq,=> <id,x> <pl,+> <id,y> <sc,; > Characters that form a word are its lexeme Its part of speech (or syntactic category) is called its token type Scanner discards white space & (often) comments Speed is an issue in scanning use a specialized recognizer 3

The Front End Source code Scanner tokens Parser IR Parser Checks stream of classified words (parts of speech) for grammatical correctness Determines if code is syntactically well-formed Guides checking at deeper levels than syntax Builds an IR representation of the code We ll come back to parsing in a couple of lectures Errors 4

The Big Picture Language syntax is specified with parts of speech, not words Syntax checking matches parts of speech against a grammar 1. goal expr 2. expr expr op term 3. term 4. term number 5. id 6. op + 7. S = goal T = { number, id, +, - } N = { goal, expr, term, op } P = { 1, 2, 3, 4, 5, 6, 7} 5

The Big Picture Language syntax is specified with parts of speech, not words Syntax checking matches parts of speech against a grammar 1. goal expr 2. expr expr op term 3. term 4. term number 5. id 6. op + 7. S = goal T = { number, id, +, - } N = { goal, expr, term, op } P = { 1, 2, 3, 4, 5, 6, 7} No words here! Parts of speech, not words! 6

The Big Picture Why study Lexical Analysis? We want to avoid writing Scanners by hand We want to harness the theory classes source code Scanner parts of speech & words Goals: specifications Scanner Generator Specifications written as regular expressions tables or code To simplify specification & implementation of scanners To understand the underlying techniques and technologies Represent words as indices into a global table 7

What is a Lexical Analyzer? Source program text Tokens Example of Tokens Operators = + - > ( { := == <> Keywords if while for int double Numeric literals 43 4.565-3.6e10 0x13F3A Character literals a ~ \ String literals 4.565 Fall 10 \ \ = empty Example of non-tokens White space Comments space( ) tab( \t ) end-of-line( \n ) /*this is not a token*/ 8

9

10

11

for_key 12

for_key 13

for_key 14

for_key 15

for_key 16

for_key 17

for_key ID( var1 ) 18

for_key ID( var1 ) 19

for_key ID( var1 ) 20

for_key ID( var1 ) eq_op 21

for_key ID( var1 ) eq_op 22

for_key ID( var1 ) eq_op 23

for_key ID( var1 ) eq_op 24

for_key ID( var1 ) eq_op Num(10) 25

for_key ID( var1 ) eq_op Num(10) 26

for_key ID( var1 ) eq_op Num(10) 27

for_key ID( var1 ) eq_op Num(10) 28

for_key ID( var1 ) eq_op Num(10) 29

for_key ID( var1 ) eq_op Num(10) 30

for_key ID( var1 ) eq_op Num(10) ID( var1 ) 31

for_key ID( var1 ) eq_op Num(10) ID( var1 ) 32

for_key ID( var1 ) eq_op Num(10) ID( var1 ) 33

for_key ID( var1 ) eq_op Num(10) ID( var1 ) le_op 34

for_key ID( var1 ) eq_op Num(10) ID( var1 ) le_op 35

Lexical Analyzer needs to Partition Input Program Text into Subsequence of Characters Corresponding to Tokens Attach the Corresponding Attributes to the Tokens Eliminate White Space and Comments 36

Lexical Analysis: Basic Issues How to Precisely Match Strings to Tokens How to Implement a Lexical Analyzer 37

Regular Expressions Lexical patterns form a regular language *** any finite language is regular *** Regular expressions (REs) describe regular languages Regular Expression (over alphabet Σ) ε is a RE denoting the set {ε} If a is in Σ, then a is a RE denoting {a} If x and y are REs denoting L(x) and L(y) then x y is an RE denoting L(x) L(y) xy is an RE denoting L(x)L(y) x * is an RE denoting L(x)* Precedence : closure, then concatenation, then alternation 38

Set Operations (review) Operation Union of L and M Written L M Concatenation of L and M Written LM Kleene closure of L Written L * Positive Closure of L Written L + Definition L M = {s s L or s M } LM = {st s L and t M } L * = 0 i L i L + = 1 i L i These definitions should be well known 39

Examples of Regular Expressions Identifiers: Letter (a b c z A B C Z) Digit (0 1 2 9) Identifier Letter ( Letter Digit ) * Numbers: Integer (+ - ε) (0 (1 2 3 9)(Digit * ) ) Decimal Integer. Digit * Real ( Integer Decimal ) E (+ - ε) Digit * Complex ( Real, Real ) Numbers can get much more complicated! 40

Regular Expressions (the point) Regular expressions can be used to specify the words to be translated to parts of speech by a lexical analyzer Using results from automata theory and theory of algorithms, we can automatically build recognizers from regular expressions Some of you may have seen this construction for string pattern matching We study REs and associated theory to automate scanner construction! 41

Example Consider the problem of recognizing Register names Register r (0 1 2 9) (0 1 2 9) * Allows registers of arbitrary number Requires at least one digit RE corresponds to a recognizer (or DFA) r (0 1 2 9) (0 1 2 9) S 0 S 1 S 2 accepting state Recognizer for Register Transitions on other inputs go to an error state, s e 42

Example (continued) DFA operation Start in state S 0 & take transitions on each input character DFA accepts a word x iff x leaves it in a final state (S 2 ) r (0 1 2 9) (0 1 2 9) S 0 S 1 S 2 Recognizer for Register So, r17 takes it through s 0, s 1, s 2 and accepts r takes it through s 0, s 1 and fails a takes it straight to s e accepting state 43

Example (continued) To be useful, recognizer must turn into code Char next character State s 0 δ r 0,1,2,3,4,5,6,7,8,9 All others while (Char EOF) State δ(state,char) Char next character s 0 s 1 s 1 s e s e s 2 s e s e if (State is a final state ) then report success else report failure s 2 s e s e s e s 2 s e s e s e Skeleton recognizer Table encoding RE 44

Example (continued) To be useful, recognizer must turn into code Char next character State s 0 δ r 0,1,2,3,4,5, 6,7,8,9 All others while (Char EOF) State δ(state,char) perform specified action Char next character if (State is a final state ) then report success else report failure s 0 s 1 s 2 s 1 start s e error s e error s e error s 2 add s 2 add s e error s e error s e error Skeleton recognizer s e s e error s e error s e error Table encoding RE 45

What about a Tighter Specification? r Digit Digit * allows arbitrary numbers Accepts r00000 Accepts r99999 What if we want to limit it to r0 through r31? Write a tighter regular expression Register r ( (0 1 2) (Digit ε) (4 5 6 7 8 9) (3 30 31) ) Register r0 r1 r2 r31 r00 r01 r02 r09 Produces a more complex DFA Has more states Same cost per transition Same basic implementation 46

Tighter Register Specification (cont d) The DFA for Register r ( (0 1 2) (Digit ε) (4 5 6 7 8 9) (3 30 31) ) (0 1 2 9) 0,1,2 S 2 S 3 r 3 0,1 S 0 S 1 S 5 S 6 4,5,6,7,8,9 S 4 Accepts a more constrained set of registers Same set of actions, more states 47

Tighter Register Specification (cont d) δ r 0,1 2 3 4-9 All others s 0 s 1 s e s e s e s e s e s 1 s e s 2 s 2 s 5 s 4 s e s 2 s e s 3 s 3 s 3 s 3 s e s 3 s e s e s e s e s e s e s 4 s e s e s e s e s e s e s 5 s e s 6 s e s e s e s e Runs in the same skeleton recognizer s 6 s e s e s e s e s e s e s e s e s e s e s e s e s e Table encoding RE for the tighter register specification 48

Summary The Role of the Lexical Analyzer Partition input stream into tokens Regular Expressions Used to describe the structure of tokens DFA: Deterministic Finite Automata Machinery to recognize Regular Languages 49