Scanning. COMP 520: Compiler Design (4 credits) Alexander Krolik MWF 13:30-14:30, MD 279

Similar documents
Scanning. COMP 520: Compiler Design (4 credits) Professor Laurie Hendren.

COMP 520 Fall 2012 Scanners and Parsers (1) Scanners and parsers

Scanners and parsers. A scanner or lexer transforms a string of characters into a string of tokens: uses a combination of deterministic finite

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

Monday, August 26, 13. Scanners

Wednesday, September 3, 14. Scanners

Lexical analysis. Syntactical analysis. Semantical analysis. Intermediate code generation. Optimization. Code generation. Target specific optimization

Introduction to Lexical Analysis

CS143 Handout 04 Summer 2011 June 22, 2011 flex In A Nutshell

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Chapter 3 Lexical Analysis

CSE302: Compiler Design

Compiler course. Chapter 3 Lexical Analysis

Compiler Construction

Lexical Analyzer Scanner

Compiler Construction

Lexical Analyzer Scanner

Lexical Analysis. Lecture 2-4

Prof. Mohamed Hamada Software Engineering Lab. The University of Aizu Japan

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 2: Lexical Analysis 23 Jan 08

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Lexical Analysis. Chapter 2

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

Formal Languages and Compilers Lecture VI: Lexical Analysis

Implementation of Lexical Analysis

Parsing and Pattern Recognition

Type 3 languages. Regular grammars Finite automata. Regular expressions. Deterministic Nondeterministic. a, a, ε, E 1.E 2, E 1 E 2, E 1*, (E 1 )

Lexical Analysis. Lecture 3. January 10, 2018

Implementation of Lexical Analysis

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

An Introduction to LEX and YACC. SYSC Programming Languages

Parsing. COMP 520: Compiler Design (4 credits) Professor Laurie Hendren.

Lexical Analysis. Lecture 3-4

CSC 467 Lecture 3: Regular Expressions

flex is not a bad tool to use for doing modest text transformations and for programs that collect statistics on input.

Lexical and Parser Tools

Lexical Analysis. Implementing Scanners & LEX: A Lexical Analyzer Tool

Figure 2.1: Role of Lexical Analyzer

Announcements! P1 part 1 due next Tuesday P1 part 2 due next Friday

CS 314 Principles of Programming Languages. Lecture 3

Introduction to Lexical Analysis

Administrivia. Lexical Analysis. Lecture 2-4. Outline. The Structure of a Compiler. Informal sketch of lexical analysis. Issues in lexical analysis

CMSC 350: COMPILER DESIGN

Languages and Compilers

EXPERIMENT NO : M/C Lenovo Think center M700 Ci3,6100,6th Gen. H81, 4GB RAM,500GB HDD

Compiler Design 1. Yacc/Bison. Goutam Biswas. Lect 8

An introduction to Flex

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

LECTURE 11. Semantic Analysis and Yacc

EXPERIMENT NO : M/C Lenovo Think center M700 Ci3,6100,6th Gen. H81, 4GB RAM,500GB HDD

Outline. 1 Scanning Tokens. 2 Regular Expresssions. 3 Finite State Automata

Marcello Bersani Ed. 22, via Golgi 42, 3 piano 3769

Using Lex or Flex. Prof. James L. Frankel Harvard University

Edited by Himanshu Mittal. Lexical Analysis Phase

Lexical Analysis 1 / 52

Lexical Analysis. Introduction

TDDD55- Compilers and Interpreters Lesson 2

The structure of a compiler

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

Part 2 Lexical analysis. Lexical analysis 44

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

Lexical and Syntax Analysis

Lex Spec Example. Int installid() {/* code to put id lexeme into string table*/}

Writing a Lexical Analyzer in Haskell (part II)

CS 403: Scanning and Parsing

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

Compiler Construction LECTURE # 3

Chapter 3 -- Scanner (Lexical Analyzer)

Compiler phases. Non-tokens

CSc 453 Compilers and Systems Software

CSE450. Translation of Programming Languages. Lecture 20: Automata and Regular Expressions

LECTURE 6 Scanning Part 2

CS 314 Principles of Programming Languages

Module 8 - Lexical Analyzer Generator. 8.1 Need for a Tool. 8.2 Lexical Analyzer Generator Tool

David Griol Barres Computer Science Department Carlos III University of Madrid Leganés (Spain)

CS4850 SummerII Lex Primer. Usage Paradigm of Lex. Lex is a tool for creating lexical analyzers. Lexical analyzers tokenize input streams.

Gechstudentszone.wordpress.com

Introduction to Lex & Yacc. (flex & bison)

Lexical Analysis. Finite Automata

Better error handling using Flex and Bison

PRINCIPLES OF COMPILER DESIGN UNIT II LEXICAL ANALYSIS 2.1 Lexical Analysis - The Role of the Lexical Analyzer

Computer Science Department Carlos III University of Madrid Leganés (Spain) David Griol Barres

THE COMPILATION PROCESS EXAMPLE OF TOKENS AND ATTRIBUTES

2010: Compilers REVIEW: REGULAR EXPRESSIONS HOW TO USE REGULAR EXPRESSIONS

Ray Pereda Unicon Technical Report UTR-02. February 25, Abstract

Compiling Regular Expressions COMP360

Big Picture: Compilation Process. CSCI: 4500/6500 Programming Languages. Big Picture: Compilation Process. Big Picture: Compilation Process.

Group A Assignment 3(2)

UNIT -2 LEXICAL ANALYSIS

Implementation of Lexical Analysis

Lecture Outline. COMP-421 Compiler Design. What is Lex? Lex Specification. ! Lexical Analyzer Lex. ! Lex Examples. Presented by Dr Ioanna Dionysiou

Lexical Analysis. COMP 524, Spring 2014 Bryan Ward

Lex (Lesk & Schmidt[Lesk75]) was developed in the early to mid- 70 s.

CS 403 Compiler Construction Lecture 3 Lexical Analysis [Based on Chapter 1, 2, 3 of Aho2]

Outline CS4120/4121. Compilation in a Nutshell 1. Administration. Introduction to Compilers Andrew Myers. HW1 out later today due next Monday.

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 3

Alternation. Kleene Closure. Definition of Regular Expressions

Transcription:

COMP 520 Winter 2017 Scanning COMP 520: Compiler Design (4 credits) Alexander Krolik alexander.krolik@mail.mcgill.ca MWF 13:30-14:30, MD 279 Scanning (1)

COMP 520 Winter 2017 Scanning (2) Announcements (Friday, January 6th) Facebook group: Useful for discussions/announcements Link on mycourses or in email Milestones: Continue picking your group (3 recommended) Create a GitHub account, learn git as needed Midterm: Either 1st or 2nd week after break on the Friday 1.5 hour in class midterm, so either 30 minutes before/after class. Thoughts? Tentative date: Friday, March 10th. Or the week after? Thoughts?

COMP 520 Winter 2017 Scanning (3) Readings Textbook, Crafting a Compiler: Chapter 2: A Simple Compiler Chapter 3: Scanning Theory and Practice Modern Compiler Implementation in Java: Chapter 1: Introduction Chapter 2: Lexical Analysis Flex tool: Manual - https://github.com/westes/flex Reference book, Flex & bison - http://mcgill.worldcat.org/title/flex-bison/oclc/457179470

COMP 520 Winter 2017 Scanning (4) Scanning: also called lexical analysis; is the first phase of a compiler; takes an arbitrary source file, and identifies meaningful character sequences. note: at this point we do not have any semantic or syntactic information Overall: a scanner transforms a string of characters into a string of tokens.

COMP 520 Winter 2017 Scanning (5) An example: var a = 5 if (a == 5) { print "success" } tvar tidentifier: a tassign tinteger: 5 tif tlparen tidentifier: a tequals tinteger: 5 trparen tlbrace tidentifier: print tstring: success trbrace

COMP 520 Winter 2017 Scanning (6) Review of COMP 330: Σ is an alphabet, a (usually finite) set of symbols; a word is a finite sequence of symbols from an alphabet; Σ is a set consisting of all possible words using symbols from Σ; a language is a subset of Σ. An example: alphabet: Σ={0,1} words: {ɛ, 0, 1, 00, 01, 10, 11,..., 0001, 1000,... } language: {1, 10, 100, 1000, 10000, 100000,... }: 1 followed by any number of zeros {0, 1, 1000, 0011, 11111100,... }:?!

COMP 520 Winter 2017 Scanning (7) A regular expression: is a string that defines a language (set of strings); in fact, a regular language. A regular language: is a language that can be accepted by a DFA; is a language for which a regular expression exists.

COMP 520 Winter 2017 Scanning (8) In a scanner, tokens are defined by regular expressions: is a regular expression [the empty set: a language with no strings] ε is a regular expression [the empty string] a, where a Σ is a regular expression [Σ is our alphabet] if M and N are regular expressions, then M N is a regular expression [alternation: either M or N ] if M and N are regular expressions, then M N is a regular expression [concatenation: M followed by N ] if M is a regular expression, then M is a regular expression [zero or more occurences of M ] What are M? and M +?

COMP 520 Winter 2017 Scanning (9) Examples of regular expressions: Alphabet Σ={a,b} a* = {ɛ, a, aa, aaa, aaaa,... } (ab)* = {ɛ, ab, abab, ababab,... } (a b)* = {ɛ, a, b, aa, bb, ab, ba,... } a*ba* = strings with exactly 1 b (a b)*b(a b)* = strings with at least 1 b

COMP 520 Winter 2017 Scanning (10) We can write regular expressions for the tokens in our source language using standard POSIX notation: simple operators: "*", "/", "+", "-" parentheses: "(", ")" integer constants: 0 ([1-9][0-9]*) identifiers: [a-za-z_][a-za-z0-9_]* white space: [ \t\n]+ [... ] define a character class: matches a single character from a set; allows ranges of characters to be alternated ; and can be negated using ^ (i.e. [^\n]). The wildcard character: is represented as. (dot); and matches all characters except newlines by default (in most implementations).

COMP 520 Winter 2017 Scanning (11) A scanner: can be generated using tools like flex (or lex), JFlex,... ; by defining regular expressions for each type of token. Internally, a scanner or lexer: uses a combination of deterministic finite automata (DFA); plus some glue code to make it work.

COMP 520 Winter 2017 Scanning (12) A finite state machine (FSM): represents a set of possible states for a system; uses transitions to link related states. A deterministic finite automaton (DFA): is a machine which recognizes regular languages; for an input sequence of symbols, the automaton either accepts or rejects the string; it works deterministically - that is given some input, there is only one sequence of steps.

COMP 520 Winter 2017 Scanning (13) Background (DFAs) from textbook, Crafting a Compiler

COMP 520 Winter 2017 Scanning (14) DFAs (for the previous example regexes): * / + - ( ) 0 1-9 0-9 a-za-z_ a-za-z0-9_ \t\n \t\n

COMP 520 Winter 2017 Scanning (15) Try it yourself: Design a DFA matching binary strings divisible by 3. Use only 3 states. Design a regular expression for floating point numbers of form: {1., 1.1,.1} (a digit on at least one side of the decimal) Design a DFA for the language above language.

COMP 520 Winter 2017 Scanning (16) Background (Scanner Table) from textbook, Crafting a Compiler

COMP 520 Winter 2017 Scanning (17) Background (Scanner Algorithm) from textbook, Crafting a Compiler

COMP 520 Winter 2017 Scanning (18) A non-deterministric finite automaton: is a machine which recognizes regular languages; for an input sequence of symbols, the automaton either accepts or rejects the string; it works non-deterministically - that is given some input, there is potentially more than one path; an NFA accepts a string if at least one path leads to an accept. Note: DFAs and NFAs are equally powerful.

COMP 520 Winter 2017 Scanning (19) Regular Expressions to NFA (1) from textbook, Crafting a Compiler

COMP 520 Winter 2017 Scanning (20) Regular Expressions to NFA (2) from textbook, Crafting a Compiler"

COMP 520 Winter 2017 Scanning (21) Regular Expressions to NFA (3) from textbook, Crafting a Compiler"

COMP 520 Winter 2017 Scanning (22) How to go from regular expressions to DFAs? 1. flex accepts a list of regular expressions (regex); 2. converts each regex internally to an NFA (Thompson construction); 3. converts each NFA to a DFA (subset construction) 4. may minimize DFA See Crafting a Compiler", Chapter 3; or Modern Compiler Implementation in Java", Chapter 2

COMP 520 Winter 2017 Scanning (23) What you should know: 1. Understand the definition of a regular language, whether that be: prose, regular expression, DFA, or NFA. 2. Given the definition of a regular language, construct either a regular expression or an automaton. What you do not need to know: 1. Specific algorithms for converting between regular language definitions. 2. DFA minimization

COMP 520 Winter 2017 Scanning (24) Let s assume we have a collection of DFAs, one for each lex rule reg_expr1 -> DFA1 reg_expr2 -> DFA2... reg_rexpn -> DFAn How do we decide which regular expression should match the next characters to be scanned?

COMP 520 Winter 2017 Scanning (25) Given DFAs D 1,..., D n, ordered by the input rule order, the behaviour of a flex-generated scanner on an input string is: while input is not empty do s i := the longest prefix that D i accepts l := max{ s i } if l > 0 then j := min{i : s i = l} remove s j from input perform the j th action else (error case) move one character from input to output end end The longest initial substring match forms the next token, and it is subject to some action The first rule to match breaks any ties Non-matching characters are echoed back

COMP 520 Winter 2017 Scanning (26) Why the longest match principle? Example: keywords... import [a-za-z_][a-za-z0-9_]*... return timport; return tidentifier; Given a string importedfiles, we want the token output of the scanner to be and not tidentifier(importedfiles) timport tidentifier(edfiles) Because we prefer longer matches, we get the right result.

COMP 520 Winter 2017 Scanning (27) Why the first match principle? Example: keywords... continue [a-za-z_][a-za-z0-9_]*... return tcontinue; return tidentifier; Given a string continue foo, we want the token output of the scanner to be tcontinue tidentifier(foo) and not tidentifier(continue) tidentifier(foo) First match rule gives us the right answer: When both tcontinue and tidentifier match, prefer the first.

COMP 520 Winter 2017 Scanning (28) When first longest match (flm) is not enough, look-ahead may help. FORTRAN allows for the following tokens:.eq., 363, 363.,.363 flm analysis of 363.EQ.363 gives us: tfloat(363) E Q tfloat(0.363) What we actually want is: tinteger(363) teq tinteger(363) To distinguish between a tfloat and a tinteger followed by a., flex allows us to use look-ahead, using / : 363/.EQ. return tinteger; A look-ahead matches on the full pattern, but only processes the characters before the /. All subsequent characters are returned to the input stream for further matches.

COMP 520 Winter 2017 Scanning (29) Another example taken from FORTRAN, FORTRAN ignores whitespace 1. DO5I = 1.25 DO5I=1.25 in C, these are equivalent to an assignment: do5i = 1.25; 2. DO 5 I = 1,25 DO5I=1,25 in C, these are equivalent to looping: for(i=1;i<25;++i){...} (5 is interpreted as a line number here) To get the correct token output: 1. flm analysis correct: tid(do5i) teq treal(1.25) 2. flm analysis gives the incorrect result. What we want is: tdo tint(5) tid(i) teq tint(1) tcomma tint(25) But we cannot make decision on tdo until we see the comma, look-ahead comes to the rescue: DO/({letter} {digit})*=({letter} {digit})*, return tdo;

COMP 520 Winter 2017 Scanning (30) Announcements (Monday, January 9th) Facebook group: Useful for discussions/announcements Link on mycourses or in email Milestones: Learn flex, bison, SableCC Assignment 1 out Wednesday Continue forming your groups Midterm: Friday, March 17th 1.5 hour in class midterm. You have the option of either 13:00-14:30 or 13:30-15:00.

COMP 520 Winter 2017 Scanning (31) Introduce yourselves! (no, not joking) Name Major/year If grad student, research area Any other fun facts we should know...

COMP 520 Winter 2017 Scanning (32) In practice, we use tools to generate scanners. Using flex: joos.l flex foo.joos lex.yy.c gcc scanner tokens

COMP 520 Winter 2017 Scanning (33) A flex file: is used to define a scanner implementation; has 3 main sections divided by %%: 1. Declarations, helper code 2. Regular expression rules and associated actions 3. User code and saves much effort in compiler design. /* includes and other arbitrary C code. copied to the scanner verbatim */ %{ %} /* helper definitions */ DIGIT [0-9] %% /* regex + action rules come after the first %% */ RULE ACTION %% /* user code comes after the second %% */ main () {}

COMP 520 Winter 2017 Scanning (34) $ cat print_tokens.l # flex source code /* includes and other arbitrary C code */ %{ #include <stdio.h> /* for printf */ %} /* helper definitions */ DIGIT [0-9] /* regex + action rules come after the first %% */ %% [ \t\n]+ printf ("white space, length %i\n", yyleng); "*" printf ("times\n"); "/" printf ("div\n"); "+" printf ("plus\n"); "-" printf ("minus\n"); "(" printf ("left parenthesis\n"); ")" printf ("right parenthesis\n"); 0 ([1-9]{DIGIT}*) printf ("integer constant: %s\n", yytext); [a-za-z_][a-za-z0-9_]* printf ("identifier: %s\n", yytext); %% /* user code comes after the second %% */ main () { yylex (); }

COMP 520 Winter 2017 Scanning (35) Sometimes a token is not enough, we need the value as well: want to capture the value of an identifier; or need the value of a string, int, or float literal. In these cases, flex provides: yytext: the scanned sequence of characters; yylval: a user-defined variable from the parser (bison) to be returned with the token; and yyleng: the length of the scanned sequence. [a-za-z_][a-za-z0-9_]* { yylval.stringconst = (char *)malloc(strlen(yytext)+1); printf(yylval.stringconst,"%s",yytext); return tidentifier; }

COMP 520 Winter 2017 Scanning (36) Using flex to create a scanner is really simple: $ vim print_tokens.l $ flex print_tokens.l $ gcc -o print_tokens lex.yy.c -lfl

COMP 520 Winter 2017 Scanning (37) Running this scanner with input: a*(b-17) + 5/c $ echo "a*(b-17) + 5/c"./print_tokens our print_tokens scanner outputs: identifier: a times left parenthesis identifier: b minus integer constant: 17 right parenthesis white space, length 1 plus white space, length 1 integer constant: 5 div identifier: c white space, length 1

COMP 520 Winter 2017 Scanning (38) Count lines and characters: %{ %} int lines = 0, chars = 0; %% \n lines++; chars++;. chars++; %% main () { yylex (); printf ("#lines = %i, #chars = %i\n", lines, chars); }

COMP 520 Winter 2017 Scanning (39) Getting (better) position information in flex: is easy for line numbers: option and variable yylineno; but is more involved for character positions. If position information is useful for further compilation phases: it can be stored in a structure yylloc provided by the parser (bison); but must be updated by a user action. typedef struct yyltype { int first_line, first_column, last_line, last_column; } yyltype; %{ #define YY_USER_ACTION yylloc.first_line = yylloc.last_line = yylineno; %} %option yylineno %%. { printf("error: (line %d) unexpected char %s \n", yylineno, yytext); exit(1); }

COMP 520 Winter 2017 Scanning (40) Actions in a flex file can either: do nothing ignore the characters; perform some computation, call a function, etc.; and/or return a token (token definitions provided by the parser). %{ %} #include <stdlib.h> /* for atoi */ #include <stdio.h> /* for printf */ #include "lang.tab.h" /* for tokens */ %% [aeiouy] /* ignore */ [0-9]+ printf ("%i", atoi (yytext) + 1); \\n { yylval.rune_const = \n ; return truneconst; } %% main () { yylex (); }

COMP 520 Winter 2017 Scanning (41) Summary a scanner transforms a string of characters into a string of tokens; scanner generating tools like flex allow you to define a regular expression for each type of token; internally, the regular expressions are transformed to a deterministic finite automata for matching; to break ties, matching uses 2 principles: longest match and first match.