Lexical Analysis. Textbook:Modern Compiler Design Chapter 2.1.

Similar documents
Lexical Analysis. Textbook:Modern Compiler Design Chapter 2.1

Compila(on Lecture 2: Lexical Analysis Syntax Analysis (1): CFLs, CFGs, PDAs. Noam Rinetzky

Compilation

Module 8 - Lexical Analyzer Generator. 8.1 Need for a Tool. 8.2 Lexical Analyzer Generator Tool

Figure 2.1: Role of Lexical Analyzer

Simple Lexical Analyzer

CMSC 350: COMPILER DESIGN

Parsing and Pattern Recognition

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Lexical and Syntax Analysis

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

10/4/18. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntactic Analysis

Alternation. Kleene Closure. Definition of Regular Expressions

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Finite Automata and Scanners

CSC 467 Lecture 3: Regular Expressions

10/5/17. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntax Analysis

Chapter 3 -- Scanner (Lexical Analyzer)

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

EDAN65: Compilers, Lecture 02 Regular expressions and scanning. Görel Hedin Revised:

Compilation 2014 Warm-up project

Fall Compiler Principles Lecture 4: Parsing part 3. Roman Manevich Ben-Gurion University of the Negev

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

Compiler Construction

Introduction to Lexical Analysis

Prof. Mohamed Hamada Software Engineering Lab. The University of Aizu Japan

Monday, August 26, 13. Scanners

Lexical Analysis. Implementing Scanners & LEX: A Lexical Analyzer Tool

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

Wednesday, September 3, 14. Scanners

Compiler Construction

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

Compiler phases. Non-tokens

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

Lex Spec Example. Int installid() {/* code to put id lexeme into string table*/}

Part II : Lexical Analysis

POLITECNICO DI TORINO. Formal Languages and Compilers. Laboratory N 1. Laboratory N 1. Languages?

Chapter 3 Lexical Analysis

Fall Compiler Principles Lecture 5: Parsing part 4. Roman Manevich Ben-Gurion University

Formal Languages and Compilers

Syntax and Parsing COMS W4115. Prof. Stephen A. Edwards Fall 2003 Columbia University Department of Computer Science

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 3

Compiler course. Chapter 3 Lexical Analysis

Flex and lexical analysis

Introduction to Compiler Design

Outline CS4120/4121. Compilation in a Nutshell 1. Administration. Introduction to Compilers Andrew Myers. HW1 out later today due next Monday.

Programming Languages (CS 550) Lecture 4 Summary Scanner and Parser Generators. Jeremy R. Johnson

Compiler Construction D7011E

Languages, Automata, Regular Expressions & Scanners. Winter /8/ Hal Perkins & UW CSE B-1

Lexical and Syntax Analysis

Announcements! P1 part 1 due next Tuesday P1 part 2 due next Friday

MP 3 A Lexer for MiniJava

LECTURE 6 Scanning Part 2

CSCI312 Principles of Programming Languages!

Overlapping Definitions

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 2: Lexical Analysis 23 Jan 08

Using Lex or Flex. Prof. James L. Frankel Harvard University

CSc 453 Lexical Analysis (Scanning)

Lexical analysis. Syntactical analysis. Semantical analysis. Intermediate code generation. Optimization. Code generation. Target specific optimization

Implementation of Lexical Analysis

Part III : Parsing. From Regular to Context-Free Grammars. Deriving a Parser from a Context-Free Grammar. Scanners and Parsers.

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

CS 314 Principles of Programming Languages

Lecture 9 CIS 341: COMPILERS

CS 415 Midterm Exam Spring SOLUTION

JFlex. Lecture 16 Section 3.5, JFlex Manual. Robb T. Koether. Hampden-Sydney College. Mon, Feb 23, 2015

CS164: Programming Assignment 2 Dlex Lexer Generator and Decaf Lexer

David Griol Barres Computer Science Department Carlos III University of Madrid Leganés (Spain)

Flex and lexical analysis. October 25, 2016

CSE 3302 Programming Languages Lecture 2: Syntax

An introduction to Flex

Lecture Outline. COMP-421 Compiler Design. What is Lex? Lex Specification. ! Lexical Analyzer Lex. ! Lex Examples. Presented by Dr Ioanna Dionysiou

Computer Science Department Carlos III University of Madrid Leganés (Spain) David Griol Barres

Implementation of Lexical Analysis

2010: Compilers REVIEW: REGULAR EXPRESSIONS HOW TO USE REGULAR EXPRESSIONS

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 2

9/5/17. The Design and Implementation of Programming Languages. Compilation. Interpretation. Compilation vs. Interpretation. Hybrid Implementation

CD Assignment I. 1. Explain the various phases of the compiler with a simple example.

An example of CSX token specifications. This example is in ~cs536-1/public/proj2/startup

Lexical Analysis. Chapter 2

Introduction to Lexical Analysis

Implementation of Lexical Analysis

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

Scanning. COMP 520: Compiler Design (4 credits) Professor Laurie Hendren.

PRINCIPLES OF COMPILER DESIGN UNIT II LEXICAL ANALYSIS 2.1 Lexical Analysis - The Role of the Lexical Analyzer

LEX/Flex Scanner Generator

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design

CS 4240: Compilers and Interpreters Project Phase 1: Scanner and Parser Due Date: October 4 th 2015 (11:59 pm) (via T-square)

LECTURE 11. Semantic Analysis and Yacc

EECS483 D1: Project 1 Overview

Lexical Analysis. Implementation: Finite Automata

More Examples. Lex/Flex/JLex

EXPERIMENT NO : M/C Lenovo Think center M700 Ci3,6100,6th Gen. H81, 4GB RAM,500GB HDD

Lexical Analyzer Scanner

Assignment 1 (Lexical Analyzer)

The Java program that uses the scanner is: import java.io.*;

Lexical and Syntax Analysis

The JLex specification file is:

Transcription:

Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1 http://www.cs.tau.ac.il/~msagiv/courses/wcc11-12.html 1

A motivating example Create a program that counts the number of lines in a given input text file 2

Solution (Flex) int num_lines = 0; %% \n ++num_lines;. ; %% main() { yylex(); printf( "# of lines = %d\n", num_lines); } 3

Solution(Flex) int num_lines = 0; initial %% \n ++num_lines;. ; %% main() { yylex(); printf( "# of lines = %d\n", num_lines); } newline ; 4

JLex Spec File User code %% Copied directly to Java file JLex directives %% Define macros, state names Lexical analysis rules Optional state, regular expression, action How to break input to tokens Action when token matched Possible source of javac errors down the road DIGIT= [0-9] LETTER= [a-za-z] YYINITIAL {LETTER} ({LETTER} {DIGIT})* 5

File: linecount Jlex linecount import java_cup.runtime.*; %% %cup %{ private int linecounter = 0; %} %eofval{ System.out.println("line number=" + linecounter); return new Symbol(sym.EOF); %eofval} NEWLINE=\n %% {NEWLINE} { linecounter++; } [^{NEWLINE}] { } 6

Outline Roles of lexical analysis What is a token Regular expressions Lexical analysis Automatic Creation of Lexical Analysis Error Handling 7

Basic Compiler Phases Source program (string) Front-End lexical analysis Tokens syntax analysis Abstract syntax tree semantic analysis Annotated Abstract syntax tree Back-End Fin. Assembly 8

Example Tokens Type Examples ID foo n_14 last NUM 73 00 517 082 REAL 66.1.5 10. 1e67 5.5e-10 IF if COMMA, NOTEQ!= LPAREN ( RPAREN ) 9

Example Non Tokens Type Examples comment /* ignored */ preprocessor directive #include <foo.h> #define NUMS 5, 6 macro NUMS whitespace \t \n \b 10

Example void match0(char *s) /* find a zero */ { } if (!strncmp(s, 0.0, 3)) return 0. ; VOID ID(match0) LPAREN CHAR DEREF ID(s) RPAREN LBRACE IF LPAREN NOT ID(strncmp) LPAREN ID(s) COMMA STRING(0.0) COMMA NUM(3) RPAREN RPAREN RETURN REAL(0.0) SEMI RBRACE EOF 11

Lexical Analysis (Scanning) input program text (file) output sequence of tokens Read input file Identify language keywords and standard identifiers Handle include files and macros Count line numbers Remove whitespaces Report illegal symbols [Produce symbol table] 12

Why Lexical Analysis Simplifies the syntax analysis And language definition Modularity Reusability Efficiency 13

What is a token? Defined by the programming language Can be separated by spaces Smallest units Defined by regular expressions 14

A simplified scanner for C Token nexttoken() { char c ; loop: c = getchar(); switch (c){ case ` `:goto loop ; case `;`: return SemiColumn; case `+`: c = getchar() ; switch (c) { case `+': return PlusPlus ; case '= return PlusEqual; default: ungetc(c); return Plus; } case `<`: case `w`: } 15

Regular Expressions Basic patterns Matching x The character x. Any character expect newline [xyz] Any of the characters x, y, z R? An optional R R* Zero or more occurrences of R R+ One or more occurrences of R R 1 R 2 R 1 followed by R 2 R 1 R 2 Either R 1 or R 2 (R) R itself 16

Escape characters in regular expressions \ converts a single operator into text a\+ (a\+\*)+ Double quotes surround text a+* + Esthetically ugly But standard 17

Ambiguity Resolving Find the longest matching token Between two tokens with the same length use the one declared first 18

The Lexical Analysis Problem Given A set of token descriptions Token name Regular expression An input string Partition the strings into tokens (class, value) Ambiguity resolution The longest matching token Between two equal length tokens select the first 19

A Jlex specification of C Scanner import java_cup.runtime.*; %% %cup %{ private int linecounter = 0; %} Letter= [a-za-z_] Digit= [0-9] %% \t { } \n { linecounter++; } ; { return new Symbol(sym.SemiColumn);} ++ {return new Symbol(sym.PlusPlus); } += {return new Symbol(sym.PlusEq); } + {return new Symbol(sym.Plus); } while {return new Symbol(sym.While); } {Letter}({Letter} {Digit})* {return new Symbol(sym.Id, yytext() ); } <= {return new Symbol(sym.LessOrEqual); } < {return new Symbol(sym.LessThan); } 20

Input Jlex regular expressions and actions (Java code) Output A scanner program that reads the input and applies actions when input regular expression is matched regular expressions Jlex input program scanner tokens 21

How to Implement Ambiguity Resolving Between two tokens with the same length use the one declared first Find the longest matching token 22

Pathological Example if { return IF; } [a-z][a-z0-9]* { return ID; } [0-9]+ { return NUM; } [0-9]. [0-9]* [0-9]*. [0-9]+ { return REAL; } (\-\-[a-z]*\n) ( \n \t) { ; }. { error(); } 23

int edges[][256] ={ /*, 0, 1, 2, 3,..., -, e, f, g, h, i, j,... */ /* state 0 */ {0,..., 0, 0,, 0, 0, 0, 0, 0,..., 0, 0, 0, 0, 0, 0} /* state 1 */ {13,..., 7, 7, 7, 7,, 9, 4, 4, 4, 4, 2, 4,..., 13, 13} /* state 2 */ {0,, 4, 4, 4, 4,..., 0, 4, 3, 4, 4, 4, 4,..., 0, 0} /* state 3 */ {0,, 4, 4, 4, 4,, 0, 4, 4, 4, 4, 4, 4,, 0, 0} /* state 4 */ {0,, 4, 4, 4, 4,..., 0, 4, 4, 4, 4, 4, 4,..., 0, 0} /* state 5 */ {0,, 6, 6, 6, 6,, 0, 0, 0, 0, 0, 0, 0,, 0, 0} /* state 6 */ {0,, 6, 6, 6, 6,, 0, 0, 0, 0, 0, 0, 0,..., 0, 0} /* state 7 */... /* state 13 */ {0,, 0, 0, 0, 0,, 0, 0, 0, 0, 0, 0, 0,, 0, 0} 24

Pseudo Code for Scanner Token nexttoken() { lastfinal = 0; currentstate = 1 ; inputpositionatlastfinal = input; currentposition = input; while (not(isdead(currentstate))) { nextstate = edges[currentstate][*currentposition]; if (isfinal(nextstate)) { lastfinal = nextstate ; inputpositionatlastfinal = currentposition; } currentstate = nextstate; advance currentposition; } input = inputpositionatlastfinal ; return action[lastfinal]; } 25

Example Input: if --not-a-com 26

final state input 0 1 if --not-a-com 2 2 if --not-a-com 3 3 if --not-a-com return IF 3 0 if --not-a-com 27

final state input 0 1 --not-a-com 12 12 --not-a-com 12 0 --not-a-com found whitespace 28

final state input 0 1 --not-a-com 9 9 --not-a-com 9 10 --not-a-com error 9 10 --not-a-com 9 10 --not-a-com 9 0 --not-a-com 29

final state input 0 1 -not-a-com 9 9 -not-a-com 9 0 -not-a-com error 30

Efficient Scanners Efficient state representation Input buffering Using switch and gotos instead of tables 31

Constructing Automaton from Specification Create a non-deterministic automaton (NDFA) from every regular expression Merge all the automata using epsilon moves (like the construction) Construct a deterministic finite automaton (DFA) State priority Minimize the automaton starting with separate accepting states 32

NDFA Construction if { return IF; } [a-z][a-z0-9]* { return ID; } [0-9]+ { return NUM; } 33

DFA Construction 34

Minimization 35

Start States It may be hard to specify regular expressions for certain constructs Examples Strings Comments Writing automata may be easier Can combine both Specify partial automata with regular expressions on the edges No need to specify all states Different actions at different states 36

Missing Creating a lexical analysis by hand Table compression Symbol Tables Nested Comments Handling Macros 37

Summary For most programming languages lexical analyzers can be easily constructed automatically Exceptions: Fortran PL/1 Lex/Flex/Jlex are useful beyond compilers 38