Prof. Mohamed Hamada Software Engineering Lab. The University of Aizu Japan

Similar documents
Chapter 3 Lexical Analysis

Compiler course. Chapter 3 Lexical Analysis

CSE302: Compiler Design

PRACTICAL CLASS: Flex & Bison

Lexical Analysis. Implementing Scanners & LEX: A Lexical Analyzer Tool

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

2. λ is a regular expression and denotes the set {λ} 4. If r and s are regular expressions denoting the languages R and S, respectively

Module 8 - Lexical Analyzer Generator. 8.1 Need for a Tool. 8.2 Lexical Analyzer Generator Tool

Compiler Construction

UNIT -2 LEXICAL ANALYSIS

Compiler Construction

Lex Spec Example. Int installid() {/* code to put id lexeme into string table*/}

THE COMPILATION PROCESS EXAMPLE OF TOKENS AND ATTRIBUTES

Typical tradeoffs in compiler design are: speed of compilation size of the generated code speed of the generated code Speed of Execution Foundations

CS 403: Scanning and Parsing

Lexical Analysis (ASU Ch 3, Fig 3.1)

Figure 2.1: Role of Lexical Analyzer

Parsing and Pattern Recognition

PRINCIPLES OF COMPILER DESIGN UNIT II LEXICAL ANALYSIS 2.1 Lexical Analysis - The Role of the Lexical Analyzer

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

Principles of Compiler Design Presented by, R.Venkadeshan,M.Tech-IT, Lecturer /CSE Dept, Chettinad College of Engineering &Technology

1. INTRODUCTION TO LANGUAGE PROCESSING The Language Processing System can be represented as shown figure below.

Projects for Compilers

Edited by Himanshu Mittal. Lexical Analysis Phase

Preparing for the ACW Languages & Compilers

Chapter 3: Lexical Analysis

COMPILER DESIGN UNIT I LEXICAL ANALYSIS. Translator: It is a program that translates one language to another Language.

Lexical Analysis. Introduction

Lexical Analyzer Scanner

Lexical Analyzer Scanner

Computer Science Department Carlos III University of Madrid Leganés (Spain) David Griol Barres

CD Assignment I. 1. Explain the various phases of the compiler with a simple example.

Formal Languages and Compilers Lecture VI: Lexical Analysis

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

Implementation of Lexical Analysis

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

David Griol Barres Computer Science Department Carlos III University of Madrid Leganés (Spain)

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Announcements! P1 part 1 due next Tuesday P1 part 2 due next Friday

Introduction to Lexical Analysis

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Prof. Mohamed Hamada Software Engineering Lab. The University of Aizu Japan

Lexical Analysis. Lecture 3. January 10, 2018

Assignment 1 (Lexical Analyzer)

Compiler phases. Non-tokens

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

Assignment 1 (Lexical Analyzer)

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

Implementation of Lexical Analysis

Flex and lexical analysis

Scanning. COMP 520: Compiler Design (4 credits) Professor Laurie Hendren.

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

Type 3 languages. Regular grammars Finite automata. Regular expressions. Deterministic Nondeterministic. a, a, ε, E 1.E 2, E 1 E 2, E 1*, (E 1 )

LECTURE 6 Scanning Part 2

Lexical Analysis. Lecture 2-4

Week 2: Syntax Specification, Grammars

10/4/18. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntactic Analysis

Lexical Analysis. Chapter 2

Parsing How parser works?

Lexical and Syntax Analysis

CS 403 Compiler Construction Lecture 3 Lexical Analysis [Based on Chapter 1, 2, 3 of Aho2]

COLLEGE OF ENGINEERING, NASHIK. LANGUAGE TRANSLATOR

Syntax Analysis Part IV

Ray Pereda Unicon Technical Report UTR-02. February 25, Abstract

Modern Compiler Design: An approach to make Compiler Design a Significant Study for Students

TDDD55- Compilers and Interpreters Lesson 2

Chapter 3 -- Scanner (Lexical Analyzer)

Monday, August 26, 13. Scanners

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Wednesday, September 3, 14. Scanners

UNIT II LEXICAL ANALYSIS

Program Analysis ( 软件源代码分析技术 ) ZHENG LI ( 李征 )

Automatic Scanning and Parsing using LEX and YACC

Recognition of Tokens

Ulex: A Lexical Analyzer Generator for Unicon

Principles of Compiler Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

Scanning. COMP 520: Compiler Design (4 credits) Alexander Krolik MWF 13:30-14:30, MD 279

Lexical and Parser Tools

EXPERIMENT NO : M/C Lenovo Think center M700 Ci3,6100,6th Gen. H81, 4GB RAM,500GB HDD

CS4850 SummerII Lex Primer. Usage Paradigm of Lex. Lex is a tool for creating lexical analyzers. Lexical analyzers tokenize input streams.

The Language for Specifying Lexical Analyzer

Lexical Analysis. Finite Automata

Group A Assignment 3(2)

Compiler Construction

CS143 Handout 04 Summer 2011 June 22, 2011 flex In A Nutshell

Compiler Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore

CSCI312 Principles of Programming Languages!

10/5/17. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntax Analysis

COMPILER CONSTRUCTION LAB 2 THE SYMBOL TABLE. Tutorial 2 LABS. PHASES OF A COMPILER Source Program. Lab 2 Symbol table

Front End: Lexical Analysis. The Structure of a Compiler

CS415 Compilers. Lexical Analysis

Lexical Analysis. COMP 524, Spring 2014 Bryan Ward

Lex & Yacc. By H. Altay Güvenir. A compiler or an interpreter performs its task in 3 stages:

An Introduction to LEX and YACC. SYSC Programming Languages

Introduction to Lex & Yacc. (flex & bison)

The Front End. The purpose of the front end is to deal with the input language. Perform a membership test: code source language?

An introduction to Flex

WARNING for Autumn 2004:

Transcription:

Compilers Prof. Mohamed Hamada Software Engineering Lab. The University of Aizu Japan

Lexical Analyzer (Scanner)

1. Uses Regular Expressions to define tokens 2. Uses Finite Automata to recognize tokens next char get next char Source Program lexical analyzer next token get next token symbol table (Contains a record for each identifier) Syntax analyzer token: smallest meaningful sequence of characters of interest in source program

What is tokens? Source program text Tokens Examples Tokens Operators = + - > ( { := == <> Keywords if while for int double Identifiers such as pi in program fragment const pi=3.14; Numeric literals 43 6.035-3.6e10 0x13F3A Character literals a ~ \ String literals 6.891 Fall 98 \ \ = empty Punctuation symbols such as comma and semicolon etc. Example of non-tokens White space space( ) tab( \t ) end-of-line( \n ) Comments /*this is not a token*/

How the scanner recognizes a token? General approach: 1. Build a deterministic finite automaton (DFA) from regular expression E 2. Execute the DFA to determine whether an input string belongs to L(E) Note: The DFA construction is done automatically by a tool such as lex.

Regular Definitions Regular definitions are regular expressions associated with suitable names. For Example the set of identifiers in Java can be expressed by the Following regular definition: letter à A B Z a b z digit à 0 1 2 9 id à letter (letter digit)*

Regular Definitions Notations 1. The `+` symbol denotes one or more instance 2. The `?` symbol denotes zero or one instance 3. The `[ ]` symbol denotes character classes Example: the following regular definitions represents unsigned numbers in C digit à [0 9] digits à digit + fraction à (.digits)? exponent à (E(+ -)? digits)? number à digits fraction exponent

How to Parse a Regular Expression? Given a DFA, we can generate an automaton that recognizes the longest substring of an input that is a valid token. Using the three simple rules presented, it is easy to generate an NFA to recognize a regular expression. Given a regular expression, how do we generate an automaton to recognize tokens? Create an NFA and convert it to a DFA.

Regular expressions for some tokens if {return IF;} [a - z] [a - z0-9 ] * {return ID;} [0-9] + {return NUM;} ([0-9] +. [0-9] *) ([0-9] *. [0-9] +) {return REAL;} ( -- [a - z]* n ) ( n t ) + {/* do nothing*/}. {error ();}

Building Finite Automata for Lexical Tokens if {return IF;} The NFA for a symbol i is: start i 1 2 The NFA for a symbol f is: start 1 f 2 The NFA for the regular expression if is: start 1 i 2 f 3 IF

Building Finite Automata for Lexical Tokens [] [ ] * {return ID;} start 1 ID 2

Building Finite Automata for Lexical Tokens [0-9] + {return NUM;} start 1 NUM 2

Building Finite Automata for Lexical Tokens ([0-9] +. [0-9] *) ([0-9] *. [0-9] +) {return REAL;} start 1.. 2 3 4 5 REAL

Building Finite Automata for Lexical Tokens ( -- [a - z]* n ) ( n t ) + {/* do nothing*/} start 1-2 - n 3 4 blank t 5 n n /* do nothing */ blank t

Building Finite Automata for Lexical Tokens i 1 2 f 3 1 2 0-9 0-9 1 2 IF ID NUM 0-9 0-9 0-9 1 2 3.. 4 0-9 5 0-9 1-2 - 3 4 n blank, etc. 5 blank, etc. 2 1 any but n REAL White space error

Conversion of NFA into DFA 1 i f 2 3 any character IF 14 15 error 4 5 6 7 9 10 11 12 8 NUM 13 ID

Conversion of NFA into DFA 1 i f 2 3 any character IF 14 15 error 4 5 6 7 9 10 11 12 8 NUM 13 ID 1-4-9-14

Conversion of NFA into DFA 1 i f 2 3 any character IF 14 15 error 4 5 6 7 9 10 11 12 8 13 ID NUM 1-4-9-14 a-h 5-6-8-15

Conversion of NFA into DFA 1 i f 2 3 any character IF 14 15 error 4 5 6 7 9 10 11 12 8 NUM 13 ID a-h 5-6-8-15 1-4-9-14 i 2-5-6-8-15

Conversion of NFA into DFA 1 i f 2 3 any character IF 14 15 error 4 5 6 7 9 10 11 12 8 NUM 13 ID a-h j-z 5-6-8-15 1-4-9-14 i 2-5-6-8-15

Conversion of NFA into DFA 1 i f 2 3 any character IF 14 15 error 4 5 6 7 9 10 11 12 8 NUM 13 ID 10-11-13-15 a-h j-z 5-6-8-15 1-4-9-14 i 2-5-6-8-15

Conversion of NFA into DFA 1 i f 2 3 any character IF 14 15 error 4 5 6 7 9 10 11 12 8 NUM 13 ID 10-11-13-15 a-h j-z 5-6-8-15 1-4-9-14 i 2-5-6-8-15 other 15

Conversion of NFA into DFA 1 i f 2 3 any character IF 14 15 error 10-11-13-15 a-h j-z 5-6-8-15 1-4-9-14 i 2-5-6-8-15 other 15 4 5 6 7 9 10 11 12 The analysis for 1-4-9-14 is complete. We mark it and pick another state in the DFA to analyze. 8 NUM 13 ID

The corresponding DFA 1-4-9-14 other i a-h j-z ID 2-5-6-8-15 ID 5-6-8-15 NUM 10-11-13-15 error 15 f a-e, g-z, IF 3-6-7-8,, NUM 11-12-13 ID 6-7-8,

1. Uses Regular Expressions to define tokens 2. Uses Finite Automata to recognize tokens next char get next char Source Program lexical analyzer next token get next token symbol table (Contains a record for each identifier) Syntax analyzer token: smallest meaningful sequence of characters of interest in source program

How to write a scanner? General approach: The construction is done automatically by a tool such as the Unix program lex. Using the source program language grammar to write a simple lex program and save it in a file named lex.l Using the unix program lex to compile to compile lex.l resulting in a C (scanner) program named lex.yy.c Compiling and linking the C program lex.yy.c in a normal way resulting the required scanner.

Using Lex Lex source program mylex.l Lex compiler lex.yy.c lex.yy.c Source program C compiler Scanner (a.out) a.out sequence of tokens

Lex In addition to compilers and interpreters, Lex can be used in many other software applications: 1. The desktop calculator bc 2. The tools eqn and pic (used for mathematical equations and complex pictures) 3. PCC (Portable C Compiler) used ith many UNIX systems 4. GCC (GNU C Compiler) used ith many UNIX systems 5. A menu compiler 6. A SQL data base language syntax checker 7. The Lex program itself And many more

Lex program specification A Lex program consists of the following three parts: declarations %% translation rules %% user subroutines (auxiliary procedures) The first %% is required to mark the beginning of the translation rules and the second %% is required only if user subroutines follow.

Lex program specification Declarations: include variables, constants and statements. Translation rules: are statements of the form: p 1 {action 1 } p 2 {action 2 } p n {action n } where each p i is a regular expression and each action i is a program fragment describing what action the lexical analyzer should take when pattern p i matches a lexeme. For example p i may be an if statement and the corresponding action i is {return(if)}.

Lex program specification How to compile and run the Lex program specification First use a word processor (for example mule) and create your Lex specification program and then save it under any name but it must have the extension.l (for example mylexprogram.l) Next compile the program using the UNIX Lex command which will automatically generate the Lex C program under the name lex.yy.cc Finally use the UNIX C compiler cc to compile the C program lex.yy.cc % lex mylexprogram.l % cc lex.yy.c o first -ll

Lex program specification Example 1 Simple verb recognizer verb à is am was do does has have The following is a lex program for the tokens of the grammar %{ /* This is a very simple Lex program for a few verbs recognition */ %} %% [ \t]+ /* ignore whites pace */ is am was do does has have {printf( %s: is a verb\n, yytext); } [A-Z]+ {printf( %s: is other\n, yytext); } %% main() { yylex() ; }

Lex program specification Example 2 consider the following grammar statement à if expression then statement if expression then statement else statement expression à term relop term term term à id number With the following regular definitions letter à [A-Z] digit à [] if à if then à then else à else relop à < <= = <> > >= id à letter (letter digit)* number à digit + (. digit + )? (E(+ -)? digit+)?

Example 2 Lex program specification The following is a lex program for the tokens of the grammar %{ /* Here are the definitions of the constants LT, LE, EQ, NE, GT, IF, THEN, ELSE, ID, NUMBER, RELOP */ %} delim [ \t\n] ws {delim}+ letter [A-Z] digit [] id {letter}({letter} {digit})* number {digit}+(\.{digit}+)?(e[+\-]?{digit}+)? %% {whitespace} if {return(if);} then {return(then);} else {return(else);} {id} {yylval = install_id(); return(id);} {number} {yylval = install_num(); return(number);} "<" {yylval = LT; return(relop);} "<=" {yylval = LE; return(relop);} "=" {yylval = EQ; return(relop);} "<>" {yylval = NQ; return(relop);} ">" {yylval = GT; return(relop);} ">=" {yylval = GE; return(relop);} %% install_id() {} install_num() {}