Lexical Analysis. Lecture 3. January 10, 2018

Similar documents
Lexical Analysis. Finite Automata

Lexical Analysis. Chapter 2

Lexical Analysis. Finite Automata

Lexical Analysis. Lecture 2-4

Lexical Analysis. Lecture 3-4

Cunning Plan. Informal Sketch of Lexical Analysis. Issues in Lexical Analysis. Specifying Lexers

Introduction to Lexical Analysis

Syntactic Analysis. CS345H: Programming Languages. Lecture 3: Lexical Analysis. Outline. Lexical Analysis. What is a Token? Tokens

Introduction to Lexical Analysis

Administrivia. Lexical Analysis. Lecture 2-4. Outline. The Structure of a Compiler. Informal sketch of lexical analysis. Issues in lexical analysis

Implementation of Lexical Analysis

Implementation of Lexical Analysis

Implementation of Lexical Analysis

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

Lexical Analysis. Implementation: Finite Automata

Implementation of Lexical Analysis

Implementation of Lexical Analysis. Lecture 4

Formal Languages and Compilers Lecture VI: Lexical Analysis

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Compiler Construction LECTURE # 3

Implementation of Lexical Analysis

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

The Front End. The purpose of the front end is to deal with the input language. Perform a membership test: code source language?

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

Implementation of Lexical Analysis

UNIT -2 LEXICAL ANALYSIS

Lexical Analysis. Introduction

Compiler Construction D7011E

CSE 3302 Programming Languages Lecture 2: Syntax

Front End: Lexical Analysis. The Structure of a Compiler

Compiler phases. Non-tokens

Part 5 Program Analysis Principles and Techniques

1. Lexical Analysis Phase

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

David Griol Barres Computer Science Department Carlos III University of Madrid Leganés (Spain)

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 2: Lexical Analysis 23 Jan 08

PRINCIPLES OF COMPILER DESIGN UNIT II LEXICAL ANALYSIS 2.1 Lexical Analysis - The Role of the Lexical Analyzer

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

Computer Science Department Carlos III University of Madrid Leganés (Spain) David Griol Barres

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Outline CS4120/4121. Compilation in a Nutshell 1. Administration. Introduction to Compilers Andrew Myers. HW1 out later today due next Monday.

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

Week 2: Syntax Specification, Grammars

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

Lecture 9 CIS 341: COMPILERS

Compiler course. Chapter 3 Lexical Analysis

Zhizheng Zhang. Southeast University

COMPILER DESIGN UNIT I LEXICAL ANALYSIS. Translator: It is a program that translates one language to another Language.

CS164: Programming Assignment 2 Dlex Lexer Generator and Decaf Lexer

Chapter 3 Lexical Analysis

Lexical Analysis (ASU Ch 3, Fig 3.1)

CMSC 350: COMPILER DESIGN

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Administrativia. Extra credit for bugs in project assignments. Building a Scanner. CS164, Fall Recall: The Structure of a Compiler

CSE302: Compiler Design

Lexical Analysis. Finite Automata. (Part 2 of 2)

A Pascal program. Input from the file is read to a buffer program buffer. program xyz(input, output) --- begin A := B + C * 2 end.

CS164: Midterm I. Fall 2003

CS321 Languages and Compiler Design I. Winter 2012 Lecture 4

CS 403 Compiler Construction Lecture 3 Lexical Analysis [Based on Chapter 1, 2, 3 of Aho2]

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

CSc 453 Lexical Analysis (Scanning)

Prof. Mohamed Hamada Software Engineering Lab. The University of Aizu Japan

Topic 3: Syntax Analysis I

Compiler Construction

Figure 2.1: Role of Lexical Analyzer

A simple syntax-directed

UNIT II LEXICAL ANALYSIS

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

CSC 467 Lecture 3: Regular Expressions

CS415 Compilers. Lexical Analysis

Chapter 3: Lexical Analysis

Last lecture CMSC330. This lecture. Finite Automata: States. Finite Automata. Implementing Regular Expressions. Languages. Regular expressions

Outline. 1 Scanning Tokens. 2 Regular Expresssions. 3 Finite State Automata

Lexical Analysis 1 / 52

Announcements! P1 part 1 due next Tuesday P1 part 2 due next Friday

DVA337 HT17 - LECTURE 4. Languages and regular expressions

Compiler Construction

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

Compiler Construction

Non-deterministic Finite Automata (NFA)

Lexical Analysis. Prof. James L. Frankel Harvard University

CS 314 Principles of Programming Languages. Lecture 3

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

CS308 Compiler Principles Lexical Analyzer Li Jiang

Dr. D.M. Akbar Hussain

Structure of Programming Languages Lecture 3

Programming Languages and Compilers (CS 421)

Programming Languages & Compilers. Programming Languages and Compilers (CS 421) I. Major Phases of a Compiler. Programming Languages & Compilers

CSCI312 Principles of Programming Languages!

[Lexical Analysis] Bikash Balami

1. INTRODUCTION TO LANGUAGE PROCESSING The Language Processing System can be represented as shown figure below.

2. Lexical Analysis! Prof. O. Nierstrasz!

Compiler Construction

Kinder, Gentler Nation

ECS 120 Lesson 7 Regular Expressions, Pt. 1

CS 314 Principles of Programming Languages

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

CMSC 330: Organization of Programming Languages. Context Free Grammars

Transcription:

Lexical Analysis Lecture 3 January 10, 2018

Announcements PA1c due tonight at 11:50pm! Don t forget about PA1, the Cool implementation! Use Monday s lecture, the video guides and Cool examples if you re stuck with Cool! Compiler Construction 2/39

Programming Assignments Going Forw C was allowed for PA1, but not for PA2 through PA6 How comfortable are we with other languages? Python, Haskell, Ruby, OCaml, and JavaScript Compiler Construction 3/39

Lexical Analysis Summary Lexical analysis turns a stream of characters into a stream of tokens Regular expressions are a way to specify sets of strings, which we use to describe tokens class Main {... Lexical Analyzer CLASS, IDENT, LBRACE,... Compiler Construction 4/39

Lexical Analysis Summary Lexical analysis turns a stream of characters into a stream of tokens Regular expressions are a way to specify sets of strings, which we use to describe tokens class Main {... Lexical Analyzer CLASS, IDENT, LBRACE,... Compiler Construction 4/39

Lexical Analysis Summary Lexical analysis turns a stream of characters into a stream of tokens Regular expressions are a way to specify sets of strings, which we use to describe tokens class Main {... Lexical Analyzer CLASS, IDENT, LBRACE,... Compiler Construction 4/39

Lexical Analysis Summary Lexical analysis turns a stream of characters into a stream of tokens Regular expressions are a way to specify sets of strings, which we use to describe tokens class Main {... Lexical Analyzer CLASS, IDENT, LBRACE,... Compiler Construction 4/39

Lexical Analysis Summary Lexical analysis turns a stream of characters into a stream of tokens Regular expressions are a way to specify sets of strings, which we use to describe tokens class Main {... Lexical Analyzer CLASS, IDENT, LBRACE,... Compiler Construction 4/39

Lexical Analysis Summary Lexical analysis turns a stream of characters into a stream of tokens Regular expressions are a way to specify sets of strings, which we use to describe tokens class Main {... Lexical Analyzer CLASS, IDENT, LBRACE,... Compiler Construction 4/39

Cunning Plan Informal Sketch of Lexical Analysis LA identifies tokens from input string List<Token> lexer ( char[] ) Issues in Lexical Analysis Lookahead Ambiguity Specifying Lexers Regular Expressions Examples Compiler Construction 5/39

Definitions Token set of strings defining an atomic element with a distinct meaning a syntactic category Lexeme a sequence of characters than can be categorized as a Token Compiler Construction 6/39

Definitions Token set of strings defining an atomic element with a distinct meaning a syntactic category In English: noun, verb, adjective In Programming: identifier, integer, keyword, whitespace,... Lexeme a sequence of characters than can be categorized as a Token Compiler Construction 6/39

Tokens and Lexemes Token Lexeme CLASS class LT < FALSE false IDENT variable_name Compiler Construction 7/39

Tokens and Lexemes Token Lexeme CLASS class LT < FALSE false IDENT variable_name Compiler Construction 7/39

Tokens and Lexemes Token Lexeme CLASS class LT < FALSE false IDENT variable_name By the way, what do you think of Cool s fi, pool, esac...? Compiler Construction 7/39

Context for Lexers Lexing and Parsing go hand-in-hand Parser uses distinctions between tokens e.g., a keyword is treated differently than an identifier get_char() get_token() input Lexer Parser character get_token() Compiler Construction 8/39

Lexical Analysis Consider this example: if(i=j) then z<-0 else z<-1 The input is simply a sequence of characters: if(i=j) then\n\tz<-0\nelse... Goal partition input strings into substrings Then, classify them according to their role (tokenize!) Compiler Construction 9/39

Tokens Tokens correspond to sets of strings Identifier strings of letters or digits, starting with a letter Integer a non-empty string of digits Keyword else or class or let... Whitespace Non-empty sequence of blanks, newlines, and/or tabs OpenParen a left parenthesis ( CloseParen a right parenthesis ) Compiler Construction 10/39

Building a Lexical Analyzer Lexer implementation must do three things 1. Recognize substrings corresponding to tokens 2. Return the value of lexeme of the token 3. Report errors intelligently (line numbers for Cool) Compiler Construction 11/39

Lexical Analyzer Implementation Lexer usually discards uninteresting tokens that don t contribute to parsing Examples: Whitespace, comments Exceptions: Which languages care about whitespace? Review: What would happen if we removed all whitespace and comments before lexing? Compiler Construction 12/39

Example Recall: if (i = j) then z<-0 else z<-1 Our Cool Lexer would return token-lexeme-linenumber tuples <IF, if, 1> <WHITESPACE,, 1> <OPENPAREN, (, 1> <IDENTIFIER, i, 1> <WHITESPACE,, 1> <EQUALS, =, 1> Compiler Construction 13/39

Lexing Considerations The goal is to partition the input string into meaningful tokens. Scan left to right (i.e., in order) Recognize tokens Compiler Construction 14/39

Lexing Considerations The goal is to partition the input string into meaningful tokens. Scan left to right (i.e., in order) Recognize tokens We really need a way to describe the lexemes associated with each token Compiler Construction 14/39

Lexing Considerations The goal is to partition the input string into meaningful tokens. Scan left to right (i.e., in order) Recognize tokens We really need a way to describe the lexemes associated with each token And also a way to handle ambiguities is if two variables i, f Compiler Construction 14/39

Regular Languages Sounds like we can use DFAs to recognize lexemes With accepting states corresponding to tokens! Example: Capture the word class c l a s s WS Example: Capture some variable name A AN WS A = letter AN = alphanumeric W = whitespace Compiler Construction 15/39

Capturing Multiple Tokens What about both class and variable names? c l a s s WS 1 A-c WS AN WS 2 Compiler Construction 16/39

Lexical Analyzer Generators We like regular languages as a means to categorize lexemes into tokens We don t like the complexity of implementing a DFA manually We use Regular Expressions to describe regular languages And our tokens are recognizable as regular languages! Regular Expressions can be automatically turned into a DFA for rapid lexing! Compiler Construction 17/39

Languages Review Definition Let Σ be a set of characters. A language over Σ is a set of strings of characters drawn from Σ. Σ is called the alphabet Compiler Construction 18/39

Examples of Languages Alphabet = English Characters Language = English Sentences Note: Not every string on English characters is an English sentence Example: adsfasdklg gdsajkl Alphabet = ASCII characters Language = C Programs Note: ASCII character set is different from English character set Compiler Construction 19/39

Notation Languages are sets of strings We need some notation for specifying which sets we want i.e., which strings are in a set? For lexical analysis, we care about regular languages, which can we described using regular expressions Compiler Construction 20/39

Regular Expressions Each regular expressions is a notation for a regular language (a set of words ) Notation forthcoming! If A is a regular expression, we write L(A) to refer to the language denoted by A Compiler Construction 21/39

Base Regular Expressions Single character: c L( c ) = { c } Concatenation: AB A and B are both Regular expressions L(AB) = { ab a L(A) and b L(B)} Example: L( i f ) = { if } Compiler Construction 22/39

Compound Regular Expressions Union L(A B) = { s s L(A) or s L(B)} Examples L( if then else ) = { if, then, else } L( 0 1 2 3 4 5 6 7 8 9 ) = what? L ( ( 0 1 ) ( 0 1 ) ) = what? Compiler Construction 23/39

Starz! So far, base and compound regular expressions only describe finite languages Iteration: A Examples L(A ) = { } {L(A)} {L(AA)} {L(AAA)}... L( 0 ) = {, 0, 00, 000,...} L( 1 0 ) = { 1, 10, 100, 1000,...} Empty: ε L(ε) = { } Compiler Construction 24/39

Example: Keyword Keywords: else or if or fi else if fi (Recall that else abbreviates concatenation of e l s e ) Compiler Construction 25/39

Example: Integer Integer: a non-empty string of digits digit = 0 1 2 3 4 5 6 7 8 9 number = digit digit* Abbreviation: A+ = AA* Compiler Construction 26/39

Example: Identifiers Identifier: string of letters or digits, start with a letter letter = A... Z a... Z ident = letter (letter digit ) * Is (letter* digit*) the same? Compiler Construction 27/39

Example: Whitespace Whitespace: a non-empty sequence of blanks, newlines, and tabs ( \t \n \r ) + Compiler Construction 28/39

Example: Phone Numbers Regular expressions are everywhere! Consider: (123)-234-4567 Σ { 0, 1, 2,... 9, (, ), -} area digit digit digit exch digit digit digit phone digit digit digit digit number ( area ) - exch - phone Compiler Construction 29/39

Example: Email addresses Consider kjleach@eecs.umich.edu Σ {a, b, c,..., z,., @ } name letter+ address name @ name (. name)* Compiler Construction 30/39

Regular Expression Summary Regular expressions describe many useful languages Given a string s and a regexp R, we can find if s L(R) Is this enough? Compiler Construction 31/39

Regular Expression Summary Regular expressions describe many useful languages Given a string s and a regexp R, we can find if s L(R) Is this enough? NO Recall we need the original lexeme! Compiler Construction 31/39

Regular Expression Summary Regular expressions describe many useful languages Given a string s and a regexp R, we can find if s L(R) Is this enough? NO Recall we need the original lexeme! We must adapt regular expressions to this goal Compiler Construction 31/39

Next time Specifying lexical structure using regular expressions Finite automata Deterministic Finite Automata Nondeterministic Finite Automata Implementation of Regular Expressions Regexp NFA DFA lookup table Lexical Analyzer Generation (i.e., doing this all automatically) Compiler Construction 32/39

Lexical Specification (1) Start with a set of tokens (protip, PA2 lists them for Cool) Write a regular expressions for the lexemes representing each token Number = digit+ IF = if ELSE = else IDENT = letter ( letter digit ) *... Compiler Construction 33/39

Lexical Specification (2) Construct R, matching all lexemes for all tokens R = Number IF ELSE IDENT... R = R1 R2 R3... If s L(R), then s is a lexeme Also s L(R j ) for some j The particular j corresponds to the type of token reported by lexer Compiler Construction 34/39

Lexical Specification (3) For an input x 1,...,x n Each x i Σ For each 1 i n, check is x 1...x i L(R)? If so, it must be that x 1...x i L(R j ) for some j Remove x 1...x i from input and restart Compiler Construction 35/39

Example Lexing R = Whitespace Integer Identifier Plus Lex f + 3 + g f matches R (more specifically, Identifier) matches R (more specifically, Whitespace)... What does the lexer output look like for this example? Compiler Construction 36/39

Ambiguities Ambiguities arise in this algorithm R = Whitespace Integer Identifier Plus Lex foo+3 f, fo, and foo all match R, but not foo+ How much input do we consume? Maximal munch rule: pick the longest possible substring that matches R Compiler Construction 37/39

Ambiguities (2) R = Whitespace new Integer Identifier Plus Lex new foo new matches both the new rule and the Identifier rule which one do you pick? Generally, pick the rule listed first Arbitrary, but typical Important for PA2! new was listed before Identifier, so the token given is new Compiler Construction 38/39

Summary Regular expressions provide a concise notation for string patterns We need to adapt them for lexical analysis to Resolve ambiguities Handle errors (report line numbers) Next time, Lexical Analysis Generators Compiler Construction 39/39

c l a s s WS A AN WS c l a s s WS 1 A = letter AN = alphanumeric W = whitespace A-c WS AN WS 2 Compiler Construction 39/39