Scanning. COMP 520: Compiler Design (4 credits) Professor Laurie Hendren.

Similar documents
Scanning. COMP 520: Compiler Design (4 credits) Alexander Krolik MWF 13:30-14:30, MD 279

COMP 520 Fall 2012 Scanners and Parsers (1) Scanners and parsers

Scanners and parsers. A scanner or lexer transforms a string of characters into a string of tokens: uses a combination of deterministic finite

Parsing. COMP 520: Compiler Design (4 credits) Professor Laurie Hendren.

Compiler Construction

Compiler Construction

CSE302: Compiler Design

Monday, August 26, 13. Scanners

Wednesday, September 3, 14. Scanners

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

Prof. Mohamed Hamada Software Engineering Lab. The University of Aizu Japan

An introduction to Flex

Using Lex or Flex. Prof. James L. Frankel Harvard University

Compiler course. Chapter 3 Lexical Analysis

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Chapter 3 Lexical Analysis

Implementation of Lexical Analysis

Lexical analysis. Syntactical analysis. Semantical analysis. Intermediate code generation. Optimization. Code generation. Target specific optimization

Lexical Analysis. Chapter 2

Lexical Analyzer Scanner

Lexical Analyzer Scanner

Implementation of Lexical Analysis

Lexical and Parser Tools

CS143 Handout 04 Summer 2011 June 22, 2011 flex In A Nutshell

Type 3 languages. Regular grammars Finite automata. Regular expressions. Deterministic Nondeterministic. a, a, ε, E 1.E 2, E 1 E 2, E 1*, (E 1 )

Introduction to Lexical Analysis

CSC 467 Lecture 3: Regular Expressions

Lexical Analysis. Implementing Scanners & LEX: A Lexical Analyzer Tool

Figure 2.1: Role of Lexical Analyzer

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Introduction to Lex & Yacc. (flex & bison)

Module 8 - Lexical Analyzer Generator. 8.1 Need for a Tool. 8.2 Lexical Analyzer Generator Tool

Lexical Analysis. Lecture 3. January 10, 2018

Edited by Himanshu Mittal. Lexical Analysis Phase

Lexical Analysis. Lecture 2-4

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

Lex Spec Example. Int installid() {/* code to put id lexeme into string table*/}

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

Formal Languages and Compilers Lecture VI: Lexical Analysis

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 2: Lexical Analysis 23 Jan 08

Lexical Analysis. Finite Automata

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

Alternation. Kleene Closure. Definition of Regular Expressions

Lexical Analysis. Lecture 3-4

David Griol Barres Computer Science Department Carlos III University of Madrid Leganés (Spain)

CSE450. Translation of Programming Languages. Lecture 20: Automata and Regular Expressions

TDDD55- Compilers and Interpreters Lesson 2

Lex & Yacc (GNU distribution - flex & bison) Jeonghwan Park

Writing a Lexical Analyzer in Haskell (part II)

Parsing and Pattern Recognition

PRINCIPLES OF COMPILER DESIGN UNIT II LEXICAL ANALYSIS 2.1 Lexical Analysis - The Role of the Lexical Analyzer

An Introduction to LEX and YACC. SYSC Programming Languages

Outline CS4120/4121. Compilation in a Nutshell 1. Administration. Introduction to Compilers Andrew Myers. HW1 out later today due next Monday.

Computer Science Department Carlos III University of Madrid Leganés (Spain) David Griol Barres

CS 432 Fall Mike Lam, Professor. Finite Automata Conversions and Lexing

Outline. 1 Scanning Tokens. 2 Regular Expresssions. 3 Finite State Automata

CMSC 350: COMPILER DESIGN

Lexical Analysis 1 / 52

Flex and lexical analysis. October 25, 2016

Gechstudentszone.wordpress.com

LECTURE 6 Scanning Part 2

Lex (Lesk & Schmidt[Lesk75]) was developed in the early to mid- 70 s.

Compiler phases. Non-tokens

Lexical Analysis. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

Chapter 3 -- Scanner (Lexical Analyzer)

Flex and lexical analysis

Big Picture: Compilation Process. CSCI: 4500/6500 Programming Languages. Big Picture: Compilation Process. Big Picture: Compilation Process.

A program that performs lexical analysis may be termed a lexer, tokenizer, or scanner, though scanner is also a term for the first stage of a lexer.

Cunning Plan. Informal Sketch of Lexical Analysis. Issues in Lexical Analysis. Specifying Lexers

Preparing for the ACW Languages & Compilers

Group A Assignment 3(2)

2010: Compilers REVIEW: REGULAR EXPRESSIONS HOW TO USE REGULAR EXPRESSIONS

CS4850 SummerII Lex Primer. Usage Paradigm of Lex. Lex is a tool for creating lexical analyzers. Lexical analyzers tokenize input streams.

flex is not a bad tool to use for doing modest text transformations and for programs that collect statistics on input.

Etienne Bernard eb/textes/minimanlexyacc-english.html

Lexical Analysis. Finite Automata

Ray Pereda Unicon Technical Report UTR-02. February 25, Abstract

Big Picture: Compilation Process. CSCI: 4500/6500 Programming Languages. Big Picture: Compilation Process. Big Picture: Compilation Process

Automatic Scanning and Parsing using LEX and YACC

CSc 453 Lexical Analysis (Scanning)

Compilation 2014 Warm-up project

Lexical Analysis. Introduction

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 3

Front End: Lexical Analysis. The Structure of a Compiler

Marcello Bersani Ed. 22, via Golgi 42, 3 piano 3769

Announcements! P1 part 1 due next Tuesday P1 part 2 due next Friday

COMPILER DESIGN UNIT I LEXICAL ANALYSIS. Translator: It is a program that translates one language to another Language.

UNIT - 7 LEX AND YACC - 1

UNIT -2 LEXICAL ANALYSIS

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

EXPERIMENT NO : M/C Lenovo Think center M700 Ci3,6100,6th Gen. H81, 4GB RAM,500GB HDD

Part 2 Lexical analysis. Lexical analysis 44

Languages, Automata, Regular Expressions & Scanners. Winter /8/ Hal Perkins & UW CSE B-1

Lexical Analysis. Implementation: Finite Automata

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

Transcription:

COMP 520 Winter 2016 Scanning COMP 520: Compiler Design (4 credits) Professor Laurie Hendren hendren@cs.mcgill.ca Scanning (1)

COMP 520 Winter 2016 Scanning (2) Readings Crafting a Compiler: Chapter 2, A simple compiler Chapter 3, Scanning - Theory and Practice Modern Compiler Implementation in Java: Chapter 1, Introduction Chapter 2, Lexical Analysis Flex tool: Manual - http://flex.sourceforge.net/manual/ Reference book, Flex & bison - http://mcgill.worldcat.org/title/flex-bison/oclc/457179470

COMP 520 Winter 2016 Scanning (3) Background (1), from Crafting a Compiler

COMP 520 Winter 2016 Scanning (4) Background (2), from Crafting a Compiler

COMP 520 Winter 2016 Scanning (5) Background (3), from Crafting a Compiler

COMP 520 Winter 2016 Scanning (6) Tokens are defined by regular expressions:, the empty set: a language with no strings ε, the empty string a, where a Σ and Σ is our alphabet M N, alternation: either M or N M N, concatenation: M followed by N M, zero or more occurences of M where M and N are both regular expressions. What are M? and M +?

COMP 520 Winter 2016 Scanning (7) We can write regular expressions for the tokens in our source language using standard POSIX notation: simple operators: "*", "/", "+", "-" parentheses: "(", ")" integer constants: 0 ([1-9][0-9]*) identifiers: [a-za-z_][a-za-z0-9_]* white space: [ \t\n]+

COMP 520 Winter 2016 Scanning (8) A scanner or lexer transforms a string of characters into a string of tokens: uses a combination of deterministic finite automata (DFA); plus some glue code to make it work; can be generated by tools like flex (or lex), JFlex,...

COMP 520 Winter 2016 Scanning (9) joos.l flex foo.joos lex.yy.c gcc scanner tokens

COMP 520 Winter 2016 Scanning (10) How to go from regular expressions to DFAs? flex accepts a list of regular expressions (regex); converts each regex internally to an NFA (Thompson construction); converts each NFA to a DFA (subset construction) may minimize DFA (see Crafting a Compiler, ch 3; or Modern Compiler Implementation in Java, Ch. 2)

COMP 520 Winter 2016 Scanning (11) Regular Expressions to NFA (1) from text, Crafting a Compiler

COMP 520 Winter 2016 Scanning (12) Regular Expressions to NFA (2)from text, Crafting a Compiler

COMP 520 Winter 2016 Scanning (13) Regular Expressions to NFA (3)from text, Crafting a Compiler

COMP 520 Winter 2016 Scanning (14) * / + - ( ) 0 1-9 0-9 a-za-z a-za-z0-9 Some DFAs \t\n \t\n Each DFA has an associated action.

COMP 520 Winter 2016 Scanning (15) Let s assume we have a collection of DFAs, one for each lex rule reg_expr1 -> DFA1 reg_expr2 -> DFA2... reg_rexpn -> DFAn How do we decide which regular expression should match the next characters to be scanned?

COMP 520 Winter 2016 Scanning (16) Given DFAs D 1,..., D n, ordered by the input rule order, the behaviour of a flex-generated scanner on an input string is: while input is not empty do s i := the longest prefix that D i accepts l := max{ s i } if l > 0 then j := min{i : s i = l} remove s j from input perform the j th action else (error case) move one character from input to output end end The longest initial substring match forms the next token, and it is subject to some action The first rule to match breaks any ties Non-matching characters are echoed back

COMP 520 Winter 2016 Scanning (17) Why the longest match principle? Example: keywords [ \t]+ /* ignore */;... import return timport;... [a-za-z_][a-za-z0-9_]* { yylval.stringconst = (char *)malloc(strlen(yytext)+1); printf(yylval.stringconst,"%s",yytext); return tidentifier; } Want to match importedfiles as tidentifier(importedfiles) and not as timport tidentifier(edfiles). Because we prefer longer matches, we get the right result.

COMP 520 Winter 2016 Scanning (18) Why the first match principle? Again Example: keywords [ \t]+ /* ignore */;... continue return tcontinue;... [a-za-z_][a-za-z0-9_]* { yylval.stringconst = (char *)malloc(strlen(yytext)+1); printf(yylval.stringconst,"%s",yytext); return tidentifier; } Want to match continue foo as tcontinue tidentifier(foo) and not as tidentifier(continue) tidentifier(foo). First match rule gives us the right answer: When both tcontinue and tidentifier match, prefer the first.

COMP 520 Winter 2016 Scanning (19) When first longest match (flm) is not enough, look-ahead may help. FORTRAN allows for the following tokens:.eq., 363, 363.,.363 flm analysis of 363.EQ.363 gives us: tfloat(363) E Q tfloat(0.363) What we actually want is: tinteger(363) teq tinteger(363) flex allows us to use look-ahead, using / : 363/.EQ. return tinteger;

COMP 520 Winter 2016 Scanning (20) Another example taken from FORTRAN, FORTRAN ignores whitespace 1. DO5I = 1.25 DO5I=1.25 in C: do5i = 1.25; 2. DO 5 I = 1,25 DO5I=1,25 in C: for(i=1;i<25;++i){...} (5 is interpreted as a line number here) Case 1: flm analysis correct: tid(do5i) teq treal(1.25) Case 2: want: tdo tint(5) tid(i) teq tint(1) tcomma tint(25) Cannot make decision on tdo until we see the comma, look-ahead comes to the rescue: DO/({letter} {digit})*=({letter} {digit})*, return tdo;

COMP 520 Winter 2016 Scanning (21) $ cat print_tokens.l # flex source code /* includes and other arbitrary C code */ %{ #include <stdio.h> /* for printf */ %} /* helper definitions */ DIGIT [0-9] /* regex + action rules come after the first %% */ %% [ \t\n]+ printf ("white space, length %i\n", yyleng); "*" printf ("times\n"); "/" printf ("div\n"); "+" printf ("plus\n"); "-" printf ("minus\n"); "(" printf ("left parenthesis\n"); ")" printf ("right parenthesis\n"); 0 ([1-9]{DIGIT}*) printf ("integer constant: %s\n", yytext); [a-za-z_][a-za-z0-9_]* printf ("identifier: %s\n", yytext); %% /* user code comes after the second %% */ main () { yylex (); }

COMP 520 Winter 2016 Scanning (22) Using flex to create a scanner is really simple: $ emacs print_tokens.l $ flex print_tokens.l $ gcc -o print_tokens lex.yy.c -lfl

COMP 520 Winter 2016 Scanning (23) When input a*(b-17) + 5/c: $ echo "a*(b-17) + 5/c"./print_tokens our print tokens scanner outputs: identifier: a times left parenthesis identifier: b minus integer constant: 17 right parenthesis white space, length 1 plus white space, length 1 integer constant: 5 div identifier: c white space, length 1

COMP 520 Winter 2016 Scanning (24) Count lines and characters: %{ int lines = 0, chars = 0; %} %% \n lines++; chars++;. chars++; %% main () { yylex (); printf ("#lines = %i, #chars = %i\n", lines, chars); }

COMP 520 Winter 2016 Scanning (25) Remove vowels and increment integers: %{ #include <stdlib.h> /* for atoi */ #include <stdio.h> /* for printf */ %} %% [aeiouy] /* ignore */ [0-9]+ printf ("%i", atoi (yytext) + 1); %% main () { yylex (); }