Lexical Analysis. Finite Automata

Similar documents
Lexical Analysis. Finite Automata

Cunning Plan. Informal Sketch of Lexical Analysis. Issues in Lexical Analysis. Specifying Lexers

Lexical Analysis. Lecture 3. January 10, 2018

Lexical Analysis. Chapter 2

Lexical Analysis. Lecture 2-4

Lexical Analysis. Lecture 3-4

Introduction to Lexical Analysis

Administrivia. Lexical Analysis. Lecture 2-4. Outline. The Structure of a Compiler. Informal sketch of lexical analysis. Issues in lexical analysis

Introduction to Lexical Analysis

Syntactic Analysis. CS345H: Programming Languages. Lecture 3: Lexical Analysis. Outline. Lexical Analysis. What is a Token? Tokens

Implementation of Lexical Analysis

Implementation of Lexical Analysis

Lexical Analysis. Implementation: Finite Automata

Implementation of Lexical Analysis

Implementation of Lexical Analysis

Implementation of Lexical Analysis. Lecture 4

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

Formal Languages and Compilers Lecture VI: Lexical Analysis

Implementation of Lexical Analysis

Implementation of Lexical Analysis

UNIT -2 LEXICAL ANALYSIS

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

CSE 3302 Programming Languages Lecture 2: Syntax

The Front End. The purpose of the front end is to deal with the input language. Perform a membership test: code source language?

Compiler Construction LECTURE # 3

Lexical Analysis. Introduction

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

A simple syntax-directed

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

Lexical Analysis. COMP 524, Spring 2014 Bryan Ward

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 2: Lexical Analysis 23 Jan 08

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

David Griol Barres Computer Science Department Carlos III University of Madrid Leganés (Spain)

Front End: Lexical Analysis. The Structure of a Compiler

Compiler Construction D7011E

Outline. 1 Scanning Tokens. 2 Regular Expresssions. 3 Finite State Automata

Lexical Analysis 1 / 52

CS164: Midterm I. Fall 2003

1. Lexical Analysis Phase

Week 2: Syntax Specification, Grammars

Computer Science Department Carlos III University of Madrid Leganés (Spain) David Griol Barres

Compiler phases. Non-tokens

Part 5 Program Analysis Principles and Techniques

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

UNIT II LEXICAL ANALYSIS

Compiler course. Chapter 3 Lexical Analysis

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

CSc 453 Lexical Analysis (Scanning)

CS164: Programming Assignment 2 Dlex Lexer Generator and Decaf Lexer

Formal Languages and Grammars. Chapter 2: Sections 2.1 and 2.2

Outline CS4120/4121. Compilation in a Nutshell 1. Administration. Introduction to Compilers Andrew Myers. HW1 out later today due next Monday.

CSC 467 Lecture 3: Regular Expressions

Compiler Construction

Administrativia. Extra credit for bugs in project assignments. Building a Scanner. CS164, Fall Recall: The Structure of a Compiler

Lexical Analysis. Finite Automata. (Part 2 of 2)

Chapter 3 Lexical Analysis

Topic 3: Syntax Analysis I

Lexical Analysis. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

CMSC 350: COMPILER DESIGN

COMPILER DESIGN UNIT I LEXICAL ANALYSIS. Translator: It is a program that translates one language to another Language.

Zhizheng Zhang. Southeast University

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design

Programming Languages and Compilers (CS 421)

Programming Languages & Compilers. Programming Languages and Compilers (CS 421) I. Major Phases of a Compiler. Programming Languages & Compilers

PRINCIPLES OF COMPILER DESIGN UNIT II LEXICAL ANALYSIS 2.1 Lexical Analysis - The Role of the Lexical Analyzer

Midterm I - Solution CS164, Spring 2014

CS308 Compiler Principles Lexical Analyzer Li Jiang

Building Compilers with Phoenix

UVa ID: NAME (print): CS 4501 LDI Midterm 1

Lexical Analysis (ASU Ch 3, Fig 3.1)

Compiler Construction

CSCI312 Principles of Programming Languages!

Announcements! P1 part 1 due next Tuesday P1 part 2 due next Friday

Non-deterministic Finite Automata (NFA)

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

CS415 Compilers. Lexical Analysis

DVA337 HT17 - LECTURE 4. Languages and regular expressions

CSCE 314 Programming Languages

Prof. Mohamed Hamada Software Engineering Lab. The University of Aizu Japan

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

1. INTRODUCTION TO LANGUAGE PROCESSING The Language Processing System can be represented as shown figure below.

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

COMPILER DESIGN LECTURE NOTES

CSE302: Compiler Design

Chapter 4. Lexical and Syntax Analysis. Topics. Compilation. Language Implementation. Issues in Lexical and Syntax Analysis.

Figure 2.1: Role of Lexical Analyzer

Question Bank. 10CS63:Compiler Design

Theoretical Part. Chapter one:- - What are the Phases of compiler? Answer:

Programming Languages & Compilers. Programming Languages and Compilers (CS 421) Programming Languages & Compilers. Major Phases of a Compiler

UNIT I- LEXICAL ANALYSIS. 1.Interpreter: It is one of the translators that translate high level language to low level language.

CS321 Languages and Compiler Design I. Winter 2012 Lecture 4

COMP 181 Compilers. Administrative. Last time. Prelude. Compilation strategy. Translation strategy. Lecture 2 Overview

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Kinder, Gentler Nation

ECS 120 Lesson 7 Regular Expressions, Pt. 1

Principles of Programming Languages COMP251: Syntax and Grammars

Group A Assignment 3(2)

Transcription:

#1 Lexical Analysis Finite Automata Cool Demo? (Part 1 of 2)

#2 Cunning Plan Informal Sketch of Lexical Analysis LA identifies tokens from input string lexer : (char list) (token list) Issues in Lexical Analysis Lookahead Ambiguity Specifying Lexers Regular Expressions Examples

#3 One-Slide Summary Lexical analysis turns a stream of characters into a stream of tokens. Regular expressions are a way to specify sets of strings. We use them to describe tokens.

#4 Fold Batter Lightly... fold_left f a [1;...;n] == f (... (f (f a 1) 2)) n fold_left (fun a e -> e :: a) [] [1;2;3] = [3;2;1] fold_left (fun a e -> a @ [e]) [] [1;2;3] = [1;2;3] fold_right f [1;...;n] b == f 1 (f 2 (... (f n b))) fold_right (fun a e -> e :: a) [1;2;3] [] = [1;2;3] fold_right (fun e a -> a @ [e]) [1;2;3] [] = [3;2;1]

#5 Compiler Structure Regular Expressions Formal Grammar Lexer Generator Parser Generator Typing Rules Machine Description Lexical Analyzer Parser Semantic Analysis Code Generator Cool Source Code List of Tokens Abstract Syntax Tree Annotated Abstract Syntax Tree Assembly Code

#6 Lexical Analysis What do we want to do? Example: if (i == j) z = 0; else z = 1; The input is just a sequence of characters: if (i == j)\n\tz = 0;\nelse\n\tz = 1; Goal: partition input strings into substrings And classify them according to their role

#7 What's a Token? Output of lexical analysis is a list of tokens A token is a syntactic category In English: noun, verb, adjective,... In a programming language: Identifier, Integer, Keyword, Whitespace,... Parser relies on token distinctions: e.g., identifiers are treated differently than keywords

#8 Tokens Tokens correspond to sets of strings. Identifier: strings of letters or digits, starting with a letter Integer: a non-empty string of digits Keyword: else or if or begin or... Whitespace: a non-empty sequence of blanks, newlines, and/or tabs OpenPar: a left-parenthesis

#9 Lexical Analyzer: Build It! An implementation must do three things: Recognize substrings corresponding to tokens Return the value or lexeme of the token The lexeme is the substring Report errors intelligently!

Example Recall: if (i == j)\n\tz = 0;\nelse\n\tz = 1; Token-lexeme pairs returned by the lexer: <Keyword, if > <Whitespace, > <OpenPar, ( > <Identifier, i > <Whitespace, > <Relation, == > <Whitespace, >... #10

#11 Lexical Analyzer: Implementation The lexer usually discards uninteresting tokens that don't contribute to parsing. Examples: Whitespace, Comments Exception: which languages care about whitespace? Question: What would happen if we removed all whitespace and comments prior to lexing?

#12 Lookahead The goal is to partition the input string (source code). This is implemented by reading left-toright, recognizing one token at a time. Lookahead may be required to decide where one token ends and the next token begins Even our simple example has lookahead issues i vs. if = vs. ==

#13 Still Needed A way to describe the lexemes of each token Recall: lexeme = the substring corresponding to the token A way to resolve ambiguities Is if two variables i and f? Is == two equal signs = =?

#14 Languages Definition. Let Σ be a set of characters. A language over Σ is a set of strings of characters drawn from Σ. Σ is called the alphabet.

#15 Examples of Languages Alphabet = English Characters Language = English Sentences Note: Not every string on English characters is an English sentence. Example: xayenb sbe' Alphabet = ASCII characters Language = C Programs Note: ASCII character set is different from English character set.

#16 Notation Languages are sets of strings We need some notation for specifying which sets we want that is, which strings are in the set For lexical analysis we care about regular languages, which can be described using regular expressions.

#17 Regular Expressions Each regular expression is a notation for a regular language (a set of words) You'll see the exact notation in minute! If A is a regular expression then we write L(A) to refer to the language denoted by A

#18 Base Regular Expression Single character: 'c' L('c') = { c } (for any c Σ ) Concatenation: AB A and B are other regular expressions L(AB) = { ab a L(A) and b L(B) } Example: L('i' 'f') = { if } We abbreviate 'i' 'f' as 'if'

#19 Compound Regular Expressions Union L(A B) = { s s L(A) or s L(B) } Examples: L('if' 'then' 'else') = { if, then, else } L('0' '1' '2' '3' '4' '5' '6' '7' '8' '9') = what? Fun Example: L( ('0' '1') ('0' '1') ) = { 00, 01, 10, 11 }

#20 Starz! So far we have only finite languages Iteration: A* L(A*) = { } L(A) L(AA) L(AAA)... Examples: L('0'*) = {, 0, 00, 000, 0000,... } L('1' '0' *) = { 1, 10, 100, 1000,...} Empty: ε L(ε) = { }

Q: Events (603 / 842) This product was introduced on May 8, 1985, in one of the greatest consumer flops ever. It was effectively shelved on July 10 of the same year when its "red, white and you" predecessor was re-introduced and set as the default.

Q: Music (150 / 842) In this 1958 Sheb Wooley song the pigeon-toed title character wears short shorts and wants to get a job in a rock'n'roll band playing the horn, but is perhaps best known for his skin tone and non-standard diet.

Tortured Prose (with lexer errors) 6. "'Composition for College Writer's', seems like fun!" 19. The brunette thrusts out her carefully manicured hand. "Hi, my name is Kelly with one 'l'. 206. It was Chris, I was getting a real knack for hearing...okay, ease-dropping. Like its not a common practice! 336. "Yeh, you were quite good, the way you jumped those, those... thingy's," he said not able to think of the correct English word for the hurtles.

Q: Advertising (810 / 842) The United States Forest Service's ursine mascot first appeared in 1944. Give his catchphrase safety message.

#25 Example: Keyword Keyword: else or if or begin or... 'else' 'if' 'begin'... (Recall: 'else' abbreviates 'e' 'l' 's' 'e')

#26 Example: Integers Integer: a non-empty string of digits digit = '0' '1' '2' '3' '4' '5' '6' '7' '8' '9' number = digit digit* Abbreviation: A+ = A A*

#27 Example: Identifier Identifier: string of letters or digits, starting with a letter letter = 'A'... 'Z' 'a'... 'z' ident = letter ( letter digit )* Is (letter* digit*) the same?

#28 Example: Whitespace Whitespace: a non-empty sequence of blanks, newlines, and tabs (' ' '\t' '\n') + or (' ' '\t' '\n' '\r') +

#29 Example: Phone Numbers Regular expressions are everywhere! Consider: (434) 924-1021 Σ = {0, 1, 2, 3,..., 9, (, ), -} area = digit digit digit exch = digit digit digit phone = digit digit digit digit number = '(' area ')' exch '-' phone

#30 Example: Email Addresses Consider weimer@cs.virginia.edu Σ name address = {a, b,..., z,., @} = letter+ = name '@' name ('.' name)*

#31 Regexp Summary Regular expressions describe many useful languages Next: Given a string s and a regexp R, is s L(R) But a yes/no answer is not enough! Recall: we also need the substring (lexeme) Instead: partition the input into lexemes We will adapt regular expression to this goal

#32 Subsequent Outline Specifying lexical structure using regexps Finite Automata Deterministic Finite Automata (DFAs) Non-deterministic Finite Automata (NFAs) Implementation of Regular Expressions Regexp -> NFA -> DFA -> tables

#33 Lexical Specification (1) Select a set of tokens Number, Keyword, Identifier,... Write a regexp for the lexemes of each token Number = digit+ Keyword = 'if' 'else'... Identifier = letter ( letter digit ) * OpenPar = '('...

#34 Lexical Specification (2) Construct R, matching all lexemes for all tokens: R = Keyword Identifier Number... R = R1 R2 R3... Fact: if s L(R) then s is a lexeme Furthermore, s L(Rj) for some j This j determines the token that is reported

#35 Lexical Specification (3) Let the input be x 1... x n Each x i is in the alphabet Σ For 1 i n, check x 1... x i L(R) If so, it must be that x 1... x i L(Rj) for some j Remove x 1... x i from the input and restart

Lexing Example R = Whitespace Integer Identifer Plus Parse f +3 +g f matches R, more precisely Identifier matches R, more precisely Whitespace + matches R, more precisely Plus... The token-lexeme pairs are <Identifier, f > <Whitespace, > <Plus, + >... In the future, we'll just drop whitespace. #36

#37 Ambiguities (1) There are ambiguities in the algorithm Example: R = Whitespace Integer Identifier Plus Parse foo+3 f matches R, more precisely Identifier But also fo matches R, and foo, but not foo+ How much input is used? Maximal Munch rule: Pick the longest possible substring that matches R

#38 Ambiguities (2) R = Whitespace 'new' Integer Identifier Parse new foo new matches R, more precisely 'new' but also Identifier which one do we pick? In general, use the rule listed first No, really. So we must list 'new' (and other keywords) before Identifier. That is, you must list it first when writing PA2.

#39 Error Handling R = Whitespace Integer Identifier '+' Parse =56 No prefix matches R: not =, nor =5, nor =56 Problem: we can't just get stuck and die Solution: Add a rule matching all bad strings Put it last Lexer tools allow the writing of: R = R1 R2... Rn Error

#40 Summary Regular expressions provide a concise notation for string patterns Their use in lexical analysis requires small extensions To resolve ambiguities To handle errors Good algorithms known (next) Requiring only a single pass over the input And few operations per character (table lookup)