Lexical Analysis. COMP 524, Spring 2014 Bryan Ward

Similar documents
Lexical Analysis. Chapter 2

Lexical Analysis. Lecture 2-4

Lecture 3: Lexical Analysis

Lecture 4: Syntax Specification

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Lexical Analysis. Lecture 3-4

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

Syntax Analysis. COMP 524: Programming Language Concepts Björn B. Brandenburg. The University of North Carolina at Chapel Hill

Introduction to Lexical Analysis

Theoretical Part. Chapter one:- - What are the Phases of compiler? Answer:

Syntax Analysis. The Big Picture. The Big Picture. COMP 524: Programming Languages Srinivas Krishnan January 25, 2011

CSCI312 Principles of Programming Languages!

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

CMSC 350: COMPILER DESIGN

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

MIT Specifying Languages with Regular Expressions and Context-Free Grammars

Lexical Analysis. Introduction

Non-deterministic Finite Automata (NFA)

MIT Specifying Languages with Regular Expressions and Context-Free Grammars. Martin Rinard Massachusetts Institute of Technology

Implementation of Lexical Analysis

Administrivia. Lexical Analysis. Lecture 2-4. Outline. The Structure of a Compiler. Informal sketch of lexical analysis. Issues in lexical analysis

COLLEGE OF ENGINEERING, NASHIK. LANGUAGE TRANSLATOR

Introduction to Parsing. Lecture 8

UNIT -2 LEXICAL ANALYSIS

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design

Lexical Scanning COMP360

Theory and Compiling COMP360

Optimizing Finite Automata

A simple syntax-directed

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

CSCE 314 Programming Languages

Implementation of Lexical Analysis

Formal Languages and Compilers Lecture VI: Lexical Analysis

COP 3402 Systems Software Syntax Analysis (Parser)

Parsing. source code. while (k<=n) {sum = sum+k; k=k+1;}

CSE450. Translation of Programming Languages. Lecture 20: Automata and Regular Expressions

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

Outline. 1 Scanning Tokens. 2 Regular Expresssions. 3 Finite State Automata

Compiler Construction

Context-Free Grammars

Lexical Analysis. Finite Automata

Compiling Regular Expressions COMP360

THE COMPILATION PROCESS EXAMPLE OF TOKENS AND ATTRIBUTES

Outline. Limitations of regular languages. Introduction to Parsing. Parser overview. Context-free grammars (CFG s)

Lexical Analysis 1 / 52

CS 403: Scanning and Parsing

Outline. Limitations of regular languages Parser overview Context-free grammars (CFG s) Derivations Syntax-Directed Translation

Introduction to Lexical Analysis

programming languages need to be precise a regular expression is one of the following: tokens are the building blocks of programs

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

Introduction to Automata Theory. BİL405 - Automata Theory and Formal Languages 1

CSE 3302 Programming Languages Lecture 2: Syntax

Parsing II Top-down parsing. Comp 412

Formal Languages and Grammars. Chapter 2: Sections 2.1 and 2.2

Introduction to Lexing and Parsing

CS164: Midterm I. Fall 2003

CSE450 Translation of Programming Languages. Lecture 4: Syntax Analysis

Syntactic Analysis. CS345H: Programming Languages. Lecture 3: Lexical Analysis. Outline. Lexical Analysis. What is a Token? Tokens

Lexical Analysis. Finite Automata

COMPILER DESIGN UNIT I LEXICAL ANALYSIS. Translator: It is a program that translates one language to another Language.

CSE302: Compiler Design

David Griol Barres Computer Science Department Carlos III University of Madrid Leganés (Spain)

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

CS415 Compilers. Lexical Analysis

A Simple Syntax-Directed Translator

Chapter 3: CONTEXT-FREE GRAMMARS AND PARSING Part 1

Announcements! P1 part 1 due next Tuesday P1 part 2 due next Friday

1. INTRODUCTION TO LANGUAGE PROCESSING The Language Processing System can be represented as shown figure below.

Finite Automata Theory and Formal Languages TMV027/DIT321 LP4 2016

Compiler Design. 2. Regular Expressions & Finite State Automata (FSA) Kanat Bolazar January 21, 2010

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

Building Compilers with Phoenix

Cunning Plan. Informal Sketch of Lexical Analysis. Issues in Lexical Analysis. Specifying Lexers

CSE302: Compiler Design

Compiler Construction

Midterm Exam. CSCI 3136: Principles of Programming Languages. February 20, Group 2

CT32 COMPUTER NETWORKS DEC 2015

CS 314 Principles of Programming Languages. Lecture 3

Computer Science Department Carlos III University of Madrid Leganés (Spain) David Griol Barres

UVa ID: NAME (print): CS 4501 LDI Midterm 1

Compiler Construction

CS 314 Principles of Programming Languages

CS 4240: Compilers and Interpreters Project Phase 1: Scanner and Parser Due Date: October 4 th 2015 (11:59 pm) (via T-square)

2010: Compilers REVIEW: REGULAR EXPRESSIONS HOW TO USE REGULAR EXPRESSIONS

This book is licensed under a Creative Commons Attribution 3.0 License

Principles of Programming Languages COMP251: Syntax and Grammars

Briefly describe the purpose of the lexical and syntax analysis phases in a compiler.

CS 403 Compiler Construction Lecture 3 Lexical Analysis [Based on Chapter 1, 2, 3 of Aho2]

Introduction to Parsing. Lecture 5

ECS 120 Lesson 7 Regular Expressions, Pt. 1

Figure 2.1: Role of Lexical Analyzer

for (i=1; i<=100000; i++) { x = sqrt (y); // square root function cout << x+i << endl; }

Languages and Compilers

COL728 Minor1 Exam Compiler Design Sem II, Answer all 5 questions Max. Marks: 20

Syntax. Syntax. We will study three levels of syntax Lexical Defines the rules for tokens: literals, identifiers, etc.

Front End: Lexical Analysis. The Structure of a Compiler

Week 2: Syntax Specification, Grammars

Chapter 3. Describing Syntax and Semantics

Transcription:

Lexical Analysis COMP 524, Spring 2014 Bryan Ward Based in part on slides and notes by J. Erickson, S. Krishnan, B. Brandenburg, S. Olivier, A. Block and others

The Big Picture Character Stream Scanner (lexical analysis) Token Stream Parser (syntax analysis) Parse Tree Semantic analysis & intermediate code gen. Abstract syntax tree Machine-independent optimization (optional) Modified intermediate form Target code generation. Machine language Machine-specific optimization (optional) Modified target language!2

The Big Picture Character Stream Scanner (lexical analysis) Token Stream Parser (syntax analysis) Parse Tree Semantic analysis & intermediate code gen. Abstract syntax tree Modified intermediate form Lexical analysis: Machine-independent optimization (optional) grouping consecutive characters that belong together.! Target code generation. Machine language Turn the stream of individual characters into a Modified target language Machine-specific optimization stream of tokens that have individual meaning. (optional)!3

Source Program The compiler reads the program from a file.! Input as a character stream.!4

Source Program The compiler reads the program from a file.! Input as a character stream. Source File = 1 2 3-7 5 * f o o ;!4

Source Program The compiler reads the program from a file.! Input as a character stream. Source File = 1 2 3-7 5 * f o o ; Compilation requires analysis of program structure.! Identify subroutines, classes, methods, etc. Thus, first step is to find units of meaning.!4

Tokens Source File = 1 2 3-7 5 * f o o ;!5

Tokens Source File = 1 2 3-7 5 * f o o ; Not every character has an individual meaning.! In Java, a + can have two interpretations: A single + means addition. A + + sequence means increment. A sequence of characters that has an atomic meaning is called a token. Compiler must identify all input tokens.!5

Tokens Source File = 1 2 3-7 5 * f o o ; Not every character has an individual meaning.! In Java, a + can have two interpretations: A single + means addition. A + + sequence means increment. A sequence of characters that has an atomic meaning is called a token. Human Analogy: To understand the meaning of an English sentence, we do not look at individual characters. Rather, we look at individual words. Human word = Program token Compiler must identify all input tokens.!!6

Tokens Source File Operator: Assignment = 1 2 3-7 5 * f o o ; Not every character has an individual meaning.! In Java, a + can have two interpretations: A single + means addition. A + + sequence means increment. A sequence of characters that has an atomic meaning is called a token. Compiler must identify all input tokens.!7

Tokens Source File Integer Literal = 1 2 3-7 5 * f o o ; Not every character has an individual meaning.! In Java, a + can have two interpretations: A single + means addition. A + + sequence means increment. A sequence of characters that has an atomic meaning is called a token. Compiler must identify all input tokens.!8

Tokens Source File Operator: Minus = 1 2 3-7 5 * f o o ; Not every character has an individual meaning.! In Java, a + can have two interpretations: A single + means addition. A + + sequence means increment. A sequence of characters that has an atomic meaning is called a token. Compiler must identify all input tokens.!9

Tokens Source File Integer Literal = 1 2 3-7 5 * f o o ; Not every character has an individual meaning.! In Java, a + can have two interpretations: A single + means addition. A + + sequence means increment. A sequence of characters that has an atomic meaning is called a token. Compiler must identify all input tokens.!10

Tokens Source File Operator: Multiplication = 1 2 3-7 5 * f o o ; Not every character has an individual meaning.! In Java, a + can have two interpretations: A single + means addition. A + + sequence means increment. A sequence of characters that has an atomic meaning is called a token. Compiler must identify all input tokens.!11

Tokens Source File Identifier: foo = 1 2 3-7 5 * f o o ; Not every character has an individual meaning.! In Java, a + can have two interpretations: A single + means addition. A + + sequence means increment. A sequence of characters that has an atomic meaning is called a token. Compiler must identify all input tokens.!12

Tokens Source File Statement separator/terminator = 1 2 3-7 5 * f o o ; Not every character has an individual meaning.! In Java, a + can have two interpretations: A single + means addition. A + + sequence means increment. A sequence of characters that has an atomic meaning is called a token. Compiler must identify all input tokens.!13

Lexical vs. Syntactical Analysis Why have a separate lexical analysis phase?!14

Lexical vs. Syntactical Analysis Why have a separate lexical analysis phase? In theory, token discovery (lexical analysis) could be done as part of the structure discovery (syntactical analysis, parsing).!14

Lexical vs. Syntactical Analysis Why have a separate lexical analysis phase? In theory, token discovery (lexical analysis) could be done as part of the structure discovery (syntactical analysis, parsing). However, this is impractical.!14

Lexical vs. Syntactical Analysis Why have a separate lexical analysis phase? In theory, token discovery (lexical analysis) could be done as part of the structure discovery (syntactical analysis, parsing). However, this is impractical. It is much easier (and much more efficient) to express the syntax rules in terms of tokens.!14

Lexical vs. Syntactical Analysis Why have a separate lexical analysis phase? In theory, token discovery (lexical analysis) could be done as part of the structure discovery (syntactical analysis, parsing). However, this is impractical. It is much easier (and much more efficient) to express the syntax rules in terms of tokens. Thus, lexical analysis is made a separate step because it greatly simplifies the subsequently performed syntactical analysis.!14

Example: Java Language Specification Lexical Structure The following 37 tokens are the operators, formed from ASCII characters: Operator: one of = > <! ~? : == <= >=!= && ++ -- + - * / & ^ % << >> >>> += -= *= /= &= = ^= %= <<= >>= >>>= Syntactical Structure UnaryExpression: PreIncrementExpression PreDecrementExpression + UnaryExpression - UnaryExpression UnaryExpressionNotPlusMinus PreIncrementExpression: ++ UnaryExpression UnaryExpressionNotPlusMinus: PostfixExpression ~ UnaryExpression! UnaryExpression CastExpression

Example: Java Language Specification Token Specification: Lexical Structure These strings mean something, but knowledge of the exact meaning is not required to identify them. The following 37 tokens are the operators, formed from ASCII characters: Operator: one of = > <! ~? : == <= >=!= && ++ -- + - * / & ^ % << >> >>> += -= *= /= &= = ^= %= <<= >>= >>>= Syntactical Structure UnaryExpression: PreIncrementExpression PreDecrementExpression + UnaryExpression - UnaryExpression UnaryExpressionNotPlusMinus PreIncrementExpression: ++ UnaryExpression UnaryExpressionNotPlusMinus: PostfixExpression ~ UnaryExpression! UnaryExpression CastExpression

Example: Java Language Specification Token Specification: Lexical Structure These strings mean something, but knowledge of the exact meaning is not required to identify them. The following 37 tokens are the operators, formed from ASCII characters: Operator: one of = > <! ~? : == <= >=!= && ++ -- + - * / & ^ % << >> >>> += -= *= /= &= = ^= %= <<= >>= >>>= Syntactical Structure Meaning is given by where they can UnaryExpression: PreIncrementExpression PreDecrementExpression + UnaryExpression - UnaryExpression UnaryExpressionNotPlusMinus PreIncrementExpression: ++ UnaryExpression UnaryExpressionNotPlusMinus: PostfixExpression ~ UnaryExpression! UnaryExpression CastExpression occur in the program (grammar) and and language semantics.

Lexical Analysis The need to identify tokens raises two questions.! How can we specify the tokens of a language? How can we recognize tokens in a character stream?!18

Lexical Analysis The need to identify tokens raises two questions.! How can we specify the tokens of a language? How can we recognize tokens in a character stream? Token Specification Regular Expressions Language Design and Specification!18

Lexical Analysis The need to identify tokens raises two questions.! How can we specify the tokens of a language? How can we recognize tokens in a character stream? Token Specification Token Recognition Regular Expressions Deterministic Finite Automata (DFA) Language Design and Specification Language Implementation!18

Lexical Analysis The need to identify tokens raises two questions.! How can we specify the tokens of a language? How can we recognize tokens in a character stream? Token Specification Token Recognition Regular Expressions DFA Construction Deterministic Finite Automata (DFA) Language Design and Specification (several steps) Language Implementation!18

Regular Expression Rules!19

Regular Expression Rules Base case: a regular expression (RE) is either!19

Regular Expression Rules Base case: a regular expression (RE) is either a character (e.g., 0, 1,...), or!19

Regular Expression Rules Base case: a regular expression (RE) is either a character (e.g., 0, 1,...), or the empty string (i.e., ε ).!19

Regular Expression Rules Base case: a regular expression (RE) is either a character (e.g., 0, 1,...), or the empty string (i.e., ε ).!19

Regular Expression Rules Base case: a regular expression (RE) is either a character (e.g., 0, 1,...), or the empty string (i.e., ε ). A compound RE is constructed by!19

Regular Expression Rules Base case: a regular expression (RE) is either a character (e.g., 0, 1,...), or the empty string (i.e., ε ). A compound RE is constructed by alternation: two REs separated by next to each other (e.g. 1 0 ),!19

Regular Expression Rules Base case: a regular expression (RE) is either a character (e.g., 0, 1,...), or the empty string (i.e., ε ). A compound RE is constructed by alternation: two REs separated by next to each other (e.g. 1 0 ), parentheses (in order to avoid ambiguity).!19

Regular Expression Rules Base case: a regular expression (RE) is either a character (e.g., 0, 1,...), or the empty string (i.e., ε ). A compound RE is constructed by alternation: two REs separated by next to each other (e.g. 1 0 ), parentheses (in order to avoid ambiguity). concatenation: two REs next to each other (e.g., (1)(0 1) ),!19

What Does This Regular Expression Match? 0x(1 2 3 4 5 6 7 8 9 a b c d e f)(0 1 2 3 4 5 6 7 8 9 a b c d e f)*!20

What Does This Regular Expression Match? 0x(1 2 3 4 5 6 7 8 9 a b c d e f)(0 1 2 3 4 5 6 7 8 9 a b c d e f)* 0x1?!20

What Does This Regular Expression Match? 0x(1 2 3 4 5 6 7 8 9 a b c d e f)(0 1 2 3 4 5 6 7 8 9 a b c d e f)* 0x1? Yes!20

What Does This Regular Expression Match? 0x(1 2 3 4 5 6 7 8 9 a b c d e f)(0 1 2 3 4 5 6 7 8 9 a b c d e f)* 0x1? Yes 0x0?!20

What Does This Regular Expression Match? 0x(1 2 3 4 5 6 7 8 9 a b c d e f)(0 1 2 3 4 5 6 7 8 9 a b c d e f)* 0x1? 0x0? Yes No!20

What Does This Regular Expression Match? 0x(1 2 3 4 5 6 7 8 9 a b c d e f)(0 1 2 3 4 5 6 7 8 9 a b c d e f)* 0x1? 0x0? Yes No 0xdeadbeef?!20

What Does This Regular Expression Match? 0x(1 2 3 4 5 6 7 8 9 a b c d e f)(0 1 2 3 4 5 6 7 8 9 a b c d e f)* 0x1? 0x0? 0xdeadbeef? Yes No Yes!20

What Does This Regular Expression Match? 0x(1 2 3 4 5 6 7 8 9 a b c d e f)(0 1 2 3 4 5 6 7 8 9 a b c d e f)* 0x1? 0x0? 0xdeadbeef? Yes No Yes 0x?!20

What Does This Regular Expression Match? 0x(1 2 3 4 5 6 7 8 9 a b c d e f)(0 1 2 3 4 5 6 7 8 9 a b c d e f)* 0x1? 0x0? 0xdeadbeef? 0x? Yes No Yes No!20

What Does This Regular Expression Match? 0x(1 2 3 4 5 6 7 8 9 a b c d e f)(0 1 2 3 4 5 6 7 8 9 a b c d e f)* 0x1? 0x0? 0xdeadbeef? 0x? Yes No Yes No 0x01?!20

What Does This Regular Expression Match? 0x(1 2 3 4 5 6 7 8 9 a b c d e f)(0 1 2 3 4 5 6 7 8 9 a b c d e f)* 0x1? 0x0? 0xdeadbeef? 0x? 0x01? Yes No Yes No No!20

What Does This Regular Expression Match? 0x(1 2 3 4 5 6 7 8 9 a b c d e f)(0 1 2 3 4 5 6 7 8 9 a b c d e f)* 0x1? 0x0? 0xdeadbeef? 0x? 0x01? Yes No Yes No No Recognizes Positive hexadecimal constants! without leading zeros!20

Example!21

Example Can we create a regular expression corresponding to the City, State ZIP-code line in mailing addresses? E.g.: Chapel Hill, NC 27599-3175!21

Example Can we create a regular expression corresponding to the City, State ZIP-code line in mailing addresses? E.g.: Chapel Hill, NC 27599-3175!! Beverly Hills, CA 90210!21

Example Can we create a regular expression corresponding to the City, State ZIP-code line in mailing addresses? E.g.: Chapel Hill, NC 27599-3175!! Beverly Hills, CA 90210!21

Grammars and Languages A regular grammar is a kind of grammar.! A grammar describes the structure of strings. A string that matches a grammar G s structure is said to be in the language L(G) (which is a set).!22

Grammars and Languages A regular grammar is a kind of grammar.! A grammar describes the structure of strings. A string that matches a grammar G s structure is said to be in the language L(G) (which is a set). A grammar is a set of productions:! Rules to obtain (produce) a string that is in L(G) via repeated substitutions. There are many grammar classes (see COMP 455). Two are commonly used to describe programming languages: regular grammars for tokens and context-free grammars for syntax.!22

Grammar 101 digit 0 1 2 3 4 5 6 7 8 9 non_zero_digit 1 2 3 4 5 6 7 8 9 natural_number non_zero_digit digit* non_neg_number (0 natural_number) ( (. digit* non_zero_digit) ε )!23

Grammar 101: Productions digit 0 1 2 3 4 5 6 7 8 9 non_zero_digit 1 2 3 4 5 6 7 8 9 natural_number non_zero_digit digit* A B is called a production. non_neg_number (0 natural_number) ( (. digit* non_zero_digit) ε )!24

Grammar 101: Non-Terminals digit 0 1 2 3 4 5 6 7 8 9 non_zero_digit 1 2 3 4 5 6 7 8 9 natural_number non_zero_digit digit* The name on the left is called non_neg_number (0 natural_number) ( (. digit* non_zero_digit) ε ) a non-terminal symbol.!25

Grammar 101: Terminals digit 0 1 2 3 4 5 6 7 8 9 non_zero_digit 1 2 3 4 5 6 7 8 9 natural_number non_zero_digit digit* non_neg_number (0 natural_number) ( (. digit* non_zero_digit) ε ) The symbols on the right are either terminal or nonterminal symbols. A terminal symbol is just a character.!26

Grammar 101: Definition digit 0 1 2 3 4 5 6 7 8 9 non_zero_digit 1 2 3 4 5 6 7 8 9 natural_number non_zero_digit digit* means is a or replace with non_neg_number (0 natural_number) ( (. digit* non_zero_digit) ε )!27

Grammar 101: Choice digit 0 1 2 3 4 5 6 7 8 9 non_zero_digit 1 2 3 4 5 6 7 8 9 natural_number non_zero_digit digit* denotes or non_neg_number (0 natural_number) ( (. digit* non_zero_digit) ε )!28

Grammar 101: Example digit 0 1 2 3 4 5 6 7 8 9 non_zero_digit 1 2 3 4 5 6 7 8 9 natural_number non_zero_digit digit* Thus, the first production means: A digit is a 0 or 1 or 2 or or 9. non_neg_number (0 natural_number) ( (. digit* non_zero_digit) ε )!29

Grammar 101: Optional Repetition * denotes zero or more of a symbol. digit 0 1 2 3 4 5 6 7 8 9 non_zero_digit 1 2 3 4 5 6 7 8 9 natural_number non_zero_digit digit* non_neg_number (0 natural_number) ( (. digit* non_zero_digit) ε )!30

Grammar 101: Sequence Two symbols next to each other means followed by. digit 0 1 2 3 4 5 6 7 8 9 non_zero_digit 1 2 3 4 5 6 7 8 9 natural_number non_zero_digit digit* non_neg_number (0 natural_number) ( (. digit* non_zero_digit) ε )!31

Grammar 101: Example Thus, this means: A natural number is a non-zero digit digit 0 1 2 3 4 5 6 7 8 9 followed by zero or more digits. non_zero_digit 1 2 3 4 5 6 7 8 9 natural_number non_zero_digit digit* non_neg_number (0 natural_number) ( (. digit* non_zero_digit) ε )!32

Grammar 101: Epsilon digit 0 1 2 3 4 5 6 7 8 9 ε is special terminal that means empty. It corresponds to the empty string. non_zero_digit 1 2 3 4 5 6 7 8 9 natural_number non_zero_digit digit* non_neg_number (0 natural_number) ( (. digit* non_zero_digit) ε )!33

Grammar 101: Example So, what does this mean? digit 0 1 2 3 4 5 6 7 8 9 non_zero_digit 1 2 3 4 5 6 7 8 9 natural_number non_zero_digit digit* non_neg_number (0 natural_number) ( (. digit* non_zero_digit) ε )!34

Grammar 101: Example A non-negative digit 0 1 number 2 3 4 5 is 6 a 7 0 8 9 or a natural number, followed by either nothing or a., followed by zero or more non_zero_digit 1 2 3 4 5 6 7 8 9 digits, followed by (exactly one) digit. natural_number non_zero_digit digit* non_neg_number (0 natural_number) ( (. digit* non_zero_digit) ε )!35

Regular Grammar Rules Very similar to regular expression rules!!36

Regular Grammar Rules Very similar to regular expression rules! Terminals: a terminal is either!36

Regular Grammar Rules Very similar to regular expression rules! Terminals: a terminal is either a character (e.g., 0, 1,...), or!36

Regular Grammar Rules Very similar to regular expression rules! Terminals: a terminal is either a character (e.g., 0, 1,...), or the empty string (i.e., ε ).!36

Regular Grammar Rules Very similar to regular expression rules! Terminals: a terminal is either a character (e.g., 0, 1,...), or the empty string (i.e., ε ).!36

Regular Grammar Rules Very similar to regular expression rules! Terminals: a terminal is either a character (e.g., 0, 1,...), or the empty string (i.e., ε ). Non-Terminals: can be constructed using!36

Regular Grammar Rules Very similar to regular expression rules! Terminals: a terminal is either a character (e.g., 0, 1,...), or the empty string (i.e., ε ). Non-Terminals: can be constructed using alternation: two REs or nonterminals separated by next to each other (e.g. letter digit ),!36

Regular Grammar Rules Very similar to regular expression rules! Terminals: a terminal is either a character (e.g., 0, 1,...), or the empty string (i.e., ε ). Non-Terminals: can be constructed using alternation: two REs or nonterminals separated by next to each other (e.g. letter digit ), parentheses (in order to avoid ambiguity).!36

Regular Grammar Rules Very similar to regular expression rules! Terminals: a terminal is either a character (e.g., 0, 1,...), or the empty string (i.e., ε ). Non-Terminals: can be constructed using alternation: two REs or nonterminals separated by next to each other (e.g. letter digit ), parentheses (in order to avoid ambiguity). concatenation: two REs or nonterminals next to each other (e.g., letter letter ),!36

Regular Grammar Rules Very similar to regular expression rules! Terminals: a terminal is either! a character (e.g., 0, 1,...), or the empty string (i.e., ε ).! A non-terminal is NEVER defined in terms of itself, Non-Terminals: can be constructed using! not even indirectly! alternation: two REs or nonterminals separated by next to Thus, regular grammars cannot define recursive each other (e.g. letter digit ), statements. parentheses (in order to avoid ambiguity). concatenation: two REs or nonterminals next to each other (e.g., letter letter ), optional repetition: a RE or nonterminal followed by * (the Kleene star) to denote zero or more occurrences (e.g., digit* )!37

Example Let s create a regular grammar corresponding to the City, State ZIP-code line in mailing addresses.! E.g.: Chapel Hill, NC 27599-3175!!! Beverly Hills, CA 90210!38

Example Let s create a regular grammar corresponding to the City, State ZIP-code line in mailing addresses.! E.g.: Chapel Hill, NC 27599-3175!!! Beverly Hills, CA 90210 city_line city city, state_abbrev zip_code letter (letter letter)* state_abbrev AL AK AS AZ WY zip_code digit digit digit digit digit (extra ε ) extra - digit digit digit digit digit 0 1 2 3 4 5 6 7 8 9 letter A B C ö!39

Example Let s create a regular grammar corresponding to the City, State ZIP-code line in mailing addresses.! E.g.: Chapel Hill, NC 27599-3175! city_line city Creating a regular expression from a regular grammar is mechanical and easy.!! Beverly Hills, CA 90210 city, state_abbrev zip_code letter (letter letter)* state_abbrev AL AK AS AZ WY! Just take the most general non-terminal and keep substituting until you get down to terminals.! zip_code digit digit digit digit digit (extra ε ) The lack of recursion means that you won t extra get into an infinite loop. - digit digit digit digit digit 0 1 2 3 4 5 6 7 8 9 letter A B C ö!40

Regular Sets and Finite Automata If a grammar G is a regular grammar, then the language L(G) is called a regular set.!41

Regular Sets and Finite Automata If a grammar G is a regular grammar, then the language L(G) is called a regular set. Equivalently, the language accepted by a regular expression is a regular set.!41

Regular Sets and Finite Automata If a grammar G is a regular grammar, then the language L(G) is called a regular set. Equivalently, the language accepted by a regular expression is a regular set. Fundamental equivalence: For every regular set L(G), there exists a deterministic finite automaton (DFA) that accepts a string S if and only if S L(G). (See COMP 455 for proof.)!41

DFA 101 Deterministic finite automaton:! Has a finite number of states. Exactly one start state. One or more final states. Transitions: define how automaton switches between states (given an input symbol). 0 0 0, 1 A (Start) 1 B 1 C!42

DFA 101 Deterministic finite automaton:! Has a finite number of states. Exactly one start state. Start State One or more final states. Transitions: define how automaton switches between states (given an input symbol). 0 0 0, 1 A (Start) 1 B 1 C!43

DFA 101 Deterministic finite automaton:! Has a finite number of states. Exactly one start state. One or more final states. Intermediate State (neither start nor final) Transitions: define how automaton switches between states (given an input symbol). 0 0 0, 1 A (Start) 1 B 1 C!44

DFA 101 Deterministic finite automaton:! Has a finite number of states. Exactly one start state. One or more final states. Final State (indicated by double border) Transitions: define how automaton switches between states (given an input symbol). 0 0 0, 1 A (Start) 1 B 1 C!45

DFA 101 Deterministic finite automaton:! Has a finite number of states. Exactly one start state. Transition One or more final states. Transitions: define how automaton switches between states (given an input symbol). Given an input of 1, if DFA is in state A, then transition to state B (and consume the input). 0 0 0, 1 A (Start) 1 B 1 C!46

DFA 101 Deterministic finite automaton:! Has a finite number of states. Exactly one start state. Self Transition Given an input of 0, if DFA is in state A, then stay in state A (and consume the input). One or more final states. Transitions: define how automaton switches between states (given an input symbol). 0 0 0, 1 A (Start) 1 B 1 C!47

DFA 101 Transitions must be unambiguous: Deterministic finite automaton:! For each state and each input, there exist only one Has a finite number of states. transition. This is what makes the DFA deterministic. Exactly one start state. One or more final states. Not a legal DFA! Transitions: define how automaton switches between states (given an input symbol). X 1 Z 1 Y 0 0 0, 1 A (Start) 1 B 1 C!48

DFA 101 Deterministic finite automaton:! Has a finite number of states. Exactly one start state. One or more final states. Multiple Transitions Given an input of either 0 or 1, if DFA is in state C, then stay in state C (and consume the input). Transitions: define how automaton switches between states (given an input symbol). 0 0 0, 1 A (Start) 1 B 1 C!49

DFA String Processing 0 0 0, 1 A (Start) 1 B 1 C!50

DFA String Processing 0 0 0, 1 A (Start) 1 B 1 C String processing.! Initially in start state. Sequentially make transitions each character in input string.!50

DFA String Processing 0 0 0, 1 A (Start) 1 B 1 C String processing.! Initially in start state. Sequentially make transitions each character in input string. A DFA either accepts or rejects a string.! Reject if a character is encountered for which no transition is defined in the current state. Reject if end of input is reached and DFA is not in a final state. Accept if end of input is reached and DFA is in final state.!50

DFA Example 0 0 0, 1 A (Start) 1 B 1 C current state Input: 1 0 Initially, DFA is in the start State A.! current input character The first input character is 1.! This causes a transition to State B.!51

DFA Example 0 0 0, 1 A (Start) 1 B 1 C current state Input: 1 0 The next input character is 0.! current input character This causes a self transition in State B.!52

DFA Example 0 0 0, 1 A (Start) 1 B 1 C current state Input: 1 0 The end of the input is reached, current input character but the DFA is not in a final state:! the string 10 is rejected!!53

DFA-Equivalent Regular Expression 0 0 0, 1 A (Start) 1 B 1 C What s the RE such that the RE s language is exactly the set of strings that is accepted by this DFA?!54

DFA-Equivalent Regular Expression 0 0 0, 1 A (Start) 1 B 1 C What s the RE such that the RE s language is exactly the set of strings that is accepted by this DFA? 0*10*1(1 0)*!54

DFA-Equivalent Regular Expression 0 0 0, 1 A (Start) 1 B 1 C What s the RE such that the RE s language is exactly the set of strings that is accepted by this DFA? 0*10*1(1 0)*!55

Recognizing Tokens with a DFA 0 0 0, 1 A (Start) 1 B 1 C!56

Recognizing Tokens with a DFA 0 0 0, 1 A (Start) 1 B 1 C Table-driven implementation.! DFA s can be represented as a 2-dimensional table.!56

Recognizing Tokens with a DFA 0 0 0, 1 A (Start) 1 B 1 C Table-driven implementation.! DFA s can be represented as a 2-dimensional table. Current State On 0 On 1 Note A transition to A transition to B start B transition to B transition to C C transition to C transition to C final!56

Recognizing Tokens with a DFA currentstate = start state;! 0 while end of input not yet reached: {! c = get next input character;! if transitiontable[currentstate][c] null:! currentstate = transitiontable[currentstate][c]! else:! A (Start) reject input! }! if currentstate is final:! accept input! else:! Table-driven implementation.! reject input! 1 DFA s can be represented as a 2-dimensional 0 B 1 0, 1 C Current State On 0 On 1 Note A transition to A transition to B start B transition to B transition to C C transition to C transition to C final!57

Recognizing Tokens with a DFA currentstate = start state;! 0 while end of input not yet reached: {! c = get next input character;! if transitiontable[currentstate][c] null:! currentstate = transitiontable[currentstate][c]! else:! A (Start) reject input! }! if currentstate is final:! accept input! else:! Table-driven implementation.! reject input 1 DFA s can be represented as a 2-dimensional 0 B 1 0, 1 This accepts exactly one token in the input.! A real lexer must detect multiple successive tokens.! Current State On 0 On 1 Note A transition to A transition to B start This can be achieved by resetting to the start state.! But what happens if the suffix of one token is the prefix of another?! B transition to B transition to C (See Chapter 2 for a solution.) C transition to C transition to C final C!58

Lexical Analysis The need to identify tokens raises two questions.! How can we specify the tokens of a language? With regular expressions. How can we recognize tokens in a character stream? With DFAs. Token Specification Token Recognition Regular Expressions DFA Construction Deterministic Finite Automata (DFA)!59

Lexical Analysis The need to identify tokens raises two questions.! How can we specify the tokens of a language? With regular expressions. How can we recognize tokens in a character stream? With DFAs. Token Specification Token Recognition Regular Expressions DFA Construction Deterministic Finite Automata (DFA)!59

Lexical Analysis The need to identify tokens raises two questions.! How can we specify the tokens of a language? With regular expressions. How can we recognize tokens in a character stream? With DFAs. Token Specification Token Recognition Regular Expressions DFA Construction Deterministic Finite Automata (DFA)!59

Lexical Analysis The need to identify tokens raises two questions.! How can we specify the tokens of a language? With regular expressions. How can we recognize tokens in a character stream? With DFAs. Token Specification Token Recognition Regular Expressions DFA Construction Deterministic Finite Automata (DFA) No single-step algorithm: We first need to construct a Non-Deterministic Finite Automaton!59

Non-Deterministic Finite Automaton (NFA) X 1 1 ε Z Y V A legal NFA fragment.!60

Non-Deterministic Finite Automaton (NFA) Like a DFA, but less restrictive: X 1 1 ε Z Y V A legal NFA fragment.!60

Non-Deterministic Finite Automaton (NFA) Like a DFA, but less restrictive: Transitions do not have to be unique: each state may have multiple ambiguous transitions for the same input symbol. (Hence, it can be non-deterministic.) X 1 1 ε Z Y V A legal NFA fragment.!60

Non-Deterministic Finite Automaton (NFA) Like a DFA, but less restrictive: Transitions do not have to be unique: each state may have multiple ambiguous transitions for the same input symbol. (Hence, it can be non-deterministic.) Epsilon transitions do not consume any input. (They correspond to the empty string.) X 1 1 ε Z Y V A legal NFA fragment.!60

Non-Deterministic Finite Automaton (NFA) Like a DFA, but less restrictive: Transitions do not have to be unique: each state may have multiple ambiguous transitions for the same input symbol. (Hence, it can be non-deterministic.) Epsilon transitions do not consume any input. (They correspond to the empty string.) Note that every DFA is also a NFA. X 1 1 ε Z Y V A legal NFA fragment.!60

Non-Deterministic Finite Automaton (NFA) Like a DFA, but less restrictive: Transitions do not have to be unique: each state may have multiple ambiguous transitions for the same input symbol. (Hence, it can be non-deterministic.) Epsilon transitions do not consume any input. (They correspond to the empty string.) Note that every DFA is also a NFA. Acceptance rule: X 1 1 ε Z Y V A legal NFA fragment.!60

Non-Deterministic Finite Automaton (NFA) Like a DFA, but less restrictive: Transitions do not have to be unique: each state may have multiple ambiguous transitions for the same input symbol. (Hence, it can be non-deterministic.) Epsilon transitions do not consume any input. (They correspond to the empty string.) Note that every DFA is also a NFA. Acceptance rule: Accepts an input string if there exists X a series of transitions such that the NFA is in a final state when the end of 1 1 ε input is reached. Z Y V A legal NFA fragment.!60

Non-Deterministic Finite Automaton (NFA) Like a DFA, but less restrictive: Transitions do not have to be unique: each state may have multiple ambiguous transitions for the same input symbol. (Hence, it can be non-deterministic.) Epsilon transitions do not consume any input. (They correspond to the empty string.) Note that every DFA is also a NFA. Acceptance rule: Accepts an input string if there exists X a series of transitions such that the NFA is in a final state when the end of 1 1 ε input is reached. Z Y V Inherent parallelism: all possible paths are explored simultaneously. A legal NFA fragment.!60