Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

Similar documents
Syntactic Analysis. CS345H: Programming Languages. Lecture 3: Lexical Analysis. Outline. Lexical Analysis. What is a Token? Tokens

Introduction to Lexical Analysis

Formal Languages and Compilers Lecture VI: Lexical Analysis

Introduction to Lexical Analysis

Lexical Analysis. Chapter 2

Lexical Analysis. Lecture 2-4

Lexical Analysis (ASU Ch 3, Fig 3.1)

Lexical Analysis. Lecture 3-4

Zhizheng Zhang. Southeast University

Lexical Analysis. Lecture 3. January 10, 2018

Implementation of Lexical Analysis

Lexical Analyzer Scanner

Lexical Analyzer Scanner

Part 5 Program Analysis Principles and Techniques

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

Implementation of Lexical Analysis

Administrivia. Lexical Analysis. Lecture 2-4. Outline. The Structure of a Compiler. Informal sketch of lexical analysis. Issues in lexical analysis

UNIT -2 LEXICAL ANALYSIS

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 2: Lexical Analysis 23 Jan 08

Formal Languages and Compilers Lecture IV: Regular Languages and Finite. Finite Automata

Front End: Lexical Analysis. The Structure of a Compiler

Chapter 3 Lexical Analysis

Lexical Analysis. Finite Automata

Compiler course. Chapter 3 Lexical Analysis

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

COMPILER DESIGN UNIT I LEXICAL ANALYSIS. Translator: It is a program that translates one language to another Language.

Lexical Analysis. Introduction

Lexical Analysis. Finite Automata

1. INTRODUCTION TO LANGUAGE PROCESSING The Language Processing System can be represented as shown figure below.

Figure 2.1: Role of Lexical Analyzer

CS308 Compiler Principles Lexical Analyzer Li Jiang

Cunning Plan. Informal Sketch of Lexical Analysis. Issues in Lexical Analysis. Specifying Lexers

UNIT II LEXICAL ANALYSIS

CS 403 Compiler Construction Lecture 3 Lexical Analysis [Based on Chapter 1, 2, 3 of Aho2]

Compiler Construction

Question Bank. 10CS63:Compiler Design

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

Compiler Construction LECTURE # 3

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

ECS 120 Lesson 7 Regular Expressions, Pt. 1

Outline CS4120/4121. Compilation in a Nutshell 1. Administration. Introduction to Compilers Andrew Myers. HW1 out later today due next Monday.

Implementation of Lexical Analysis

2010: Compilers REVIEW: REGULAR EXPRESSIONS HOW TO USE REGULAR EXPRESSIONS

PRINCIPLES OF COMPILER DESIGN UNIT II LEXICAL ANALYSIS 2.1 Lexical Analysis - The Role of the Lexical Analyzer

Compiler Construction

Lexical Analysis. Implementation: Finite Automata

Compiler Construction

Non-deterministic Finite Automata (NFA)

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Computer Science Department Carlos III University of Madrid Leganés (Spain) David Griol Barres

Implementation of Lexical Analysis

CS321 Languages and Compiler Design I. Winter 2012 Lecture 4

CS415 Compilers. Lexical Analysis

2068 (I) Attempt all questions.

David Griol Barres Computer Science Department Carlos III University of Madrid Leganés (Spain)

Week 2: Syntax Specification, Grammars

CS 314 Principles of Programming Languages. Lecture 3

Syntax and Parsing COMS W4115. Prof. Stephen A. Edwards Fall 2004 Columbia University Department of Computer Science

UNIT I- LEXICAL ANALYSIS. 1.Interpreter: It is one of the translators that translate high level language to low level language.

UNIT III & IV. Bottom up parsing

Lexical Analysis. Sukree Sinthupinyo July Chulalongkorn University

DVA337 HT17 - LECTURE 4. Languages and regular expressions

MIT Specifying Languages with Regular Expressions and Context-Free Grammars. Martin Rinard Massachusetts Institute of Technology

Chapter Seven: Regular Expressions

1. Lexical Analysis Phase

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Implementation of Lexical Analysis. Lecture 4

MIT Specifying Languages with Regular Expressions and Context-Free Grammars

2. Lexical Analysis! Prof. O. Nierstrasz!

A simple syntax-directed

CSE 3302 Programming Languages Lecture 2: Syntax

Ambiguous Grammars and Compactification

CSc 453 Lexical Analysis (Scanning)

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

Last lecture CMSC330. This lecture. Finite Automata: States. Finite Automata. Implementing Regular Expressions. Languages. Regular expressions

Building Compilers with Phoenix

Lexical Analysis - 1. A. Overview A.a) Role of Lexical Analyzer

Outline. 1 Scanning Tokens. 2 Regular Expresssions. 3 Finite State Automata

Where We Are. CMSC 330: Organization of Programming Languages. This Lecture. Programming Languages. Motivation for Grammars

THE COMPILATION PROCESS EXAMPLE OF TOKENS AND ATTRIBUTES

The Language for Specifying Lexical Analyzer

CS 403: Scanning and Parsing

Monday, August 26, 13. Scanners

Lexical Analysis 1 / 52

Languages, Automata, Regular Expressions & Scanners. Winter /8/ Hal Perkins & UW CSE B-1

Regular Expressions. Regular Expressions. Regular Languages. Specifying Languages. Regular Expressions. Kleene Star Operation

CMSC 330: Organization of Programming Languages. Architecture of Compilers, Interpreters

CSCI-GA Compiler Construction Lecture 4: Lexical Analysis I. Hubertus Franke

Wednesday, September 3, 14. Scanners

Specifying Syntax. An English Grammar. Components of a Grammar. Language Specification. Types of Grammars. 1. Terminal symbols or terminals, Σ

Compiler phases. Non-tokens

CS 314 Principles of Programming Languages

The Front End. The purpose of the front end is to deal with the input language. Perform a membership test: code source language?

Dr. D.M. Akbar Hussain

Languages and Compilers

Transcription:

Lexical Analysis Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

Phase Ordering of Front-Ends Lexical analysis (lexer) Break input string into words called tokens Syntactic analysis (parser) Recover structure from the text and put it in a parse tree Semantic Analysis Discover meaning (e.g., type-checking) Prepare for code generation Works with a symbol table

Similarity to Natural Languages Tokens and a Parse Tree 문장 : non-terminals 주어목적어서술어 : non-terminals 대명사조사명사조사동사 : Tokens (also called terminals) 나는 학교를간다

What is a Token? A syntactic category In English: Noun, verb, adjective, In a programming language: Identifier, Integer, Keyword, White-space, A token corresponds to a set of strings

Terms Token Syntactic atoms that are terminal symbols in the grammar from the source language A data structure (or pointer to it) returned by lexer Patten A rule that defines strings corresponding to a token Lexeme A string in the source code that matches a pattern

An Example of these Terms int foo = 100; The lexeme matched by the pattern for the token represents a string of characters in the source program that can be treated as a lexical unit

What are Tokens For? Classify substrings of a given program according to its role Parser relies on token classification e.g., How to handle reserved keywords? As an identifier or a separate keyword for each? Output of the lexer is a stream of tokens which is input to the parser How parser and lexer co-work? Parser leads the work

Lexical Analysis Problem Partition input string of characters into disjoint substrings which are tokens Useful tokens here: identifier, keyword, relop, integer, white space, (, ), =, ;

Designing Lexical Analyzer First, define a set of tokens Tokens should describe all items of interest Choice of tokens depends on the language and the design of the parser Then, describe what strings belongs to each token by providing a pattern for it

Implementing Lexical Analyzer Implementation must do two thing: Recognize substrings corresponding to tokens Return the value or lexeme of the token: the substring matching the category Reading left-to-right, recognizing one token at a time The lexer usually discards uninteresting tokens that do not contribute to parsing Examples: white space, comments Is it as easy as it sounds? Not actually! Due to lookahead and ambiguity issues (Look at the history)

Lexical Analysis in Fortran Fortran rule: white space is insignificant Example: VAR1 is the same as VA R1 Left-to-right reading is not enough DO 5 I = 1,25 ==> DO 5 I = 1, 25 DO 5 I = 1.25 ==> DO5I = 1.25 Reading left-to-right cannot tell whether DO5I is a variable or a DO statement until. or, is reached Lookahead may be needed to decide where a token ends and the next token begins Even our simple example has lookahead issues e.g, = and ==

Lexical Analysis in PL/I PL/I keywords are not reserved IF THEN ELSE THEN = ELSE; ELSE ELSE = THEN PL/I Declarations DECLARE (ARG1,..,ARGN) Cannot tell whether DECLARE is a keyword or an array reference until we see the charater that follows ), requiring an arbitrarily long lookahead

Lexical Analysis in C++ C++ template syntax: Foo<Bar> C++ io stream syntax: Cin >> var; But there is a conflict with nested templates Foo<Bar<int>>

Review The goal of lexical analysis is to Partition the input string into lexemes Identify the token of each lexeme Left-to-right scan, sometimes requiring lookahead We still need A way to describe the lexemes of each token: pattern A way to resolve ambiguities Is == two equal signs = = or a single relational op?

Specifying Tokens: Regular Languages There are several formalisms for specifying tokens but the most popular one is regular languages Regular languages are not perfect but they have a concise (though sometimes not user-friendly) expression: regular expression a useful theory to evaluate them finite automata a well-understood, efficient implementation a tool to process regular expressions lex Lexical definitions (regular expressions) lex a table-driven lexer (C program)

Formal Language Theory Alphabet : a finite set of symbols (characters) Ex: {a,b}, an ASCII character set String: a finite sequence of symbols over Ex: abab, aabb, a over {a,b}; hello over ASCII Empty string є: zero-length string є Ø {є} Language: a set of strings over Ex: {a, b, abab} over {a,b} Ex: a set of all valid C programs over ASCII

Operations on Strings Concatenation ( ): a b = ab, hello there = hellothere Denoted by α β = αβ Exponentiation: hello 3 = hello hello hello = hellohellohello, hello 0 = є Terms for parts of a string s prefix of s : A string obtained by removing zero or more trailing symbols of string s: (Ex: ban is a prefix of banana) proper prefix of s: A non-empty prefix of s that is not s

Operations on Languages Lex X and Y be sets of strings Concatenation ( ): X Y = {x y x X, y Y} Ex: X = {Liz, Add} Y = {Eddie, Dick} X Y = {LizEddie, LizDick, AddEddie,AddDick} Exponentiation: X 2 = X X X 0 = є Union: X Y = {u u X or u Y} Kleene s Closure: X * = Ex:X = {a,b}, X * = {є, a, b, aa, ab, ba, bb, aaa,..} X i

Regular Languages over Definition of regular languages over Ø is regular {a} is regular {є} is regular R S is regular if R, S are regular R S is regular if R, S are regular Nothing else

Regular Expressions (RE) over In order to describe a regular language, we can use a regular expression (RE), which is strings over representing the regular language Ø is a regular expression є is a regular expression a is regular expression for a Let r, s be regular expressions. Then, (r) (s) is a regular expression (r) (s) is a regular expression (r) * is a regular expression Nothing else Ex: = {a, b}, ab ba * = (a)(b) ((b)((a) * ))

Regular Expressions & Languages Let s and r be REs L(Ø) = Ø, L(є) = {є}, L(a) = {a} L(s r) = L(s) L(r), L(s r) = L(s) L(r) L(r * )=(L(r)) * Anything that can be constructed by a finite number of applications of the rules in the previous page is a regular expression which equally describe a regular language Ex: ab * = {a, ab, abb, } Quiz: what is a RE describing at least one a and any number of b s (a b) * a(a b) * or (a * b * ) * a(a * b * ) *

Non-Regular Languages Not all languages are regular (i.e., cannot be described by any regular expressions) Ex: set of all strings of balanced parentheses {(), (()), ((())), (((()))), } What about ( * ) *? Nesting can be described by a context-free grammar Ex: Set of repeating strings {wcw w is a string of a s and b s} {aca, abcab, abacaba, } Cannot be described even by a context-free grammar Regular languages are not that powerful

RE Shorthands r? = r Є (zero or one instance of r) r + = r r * (positive closure) Charater class: [abc] = a b c, [a-z] = a b c z Ex: ([ab]c?) + = {a, b, aa, ab, ac, ba, bb, bc, }

Regular Definition For convenience, we give names to regular expressions and define other regular expressions using these names as if they are symbols Regular definition is a sequence of definitions of the following form, d 1 r 1 d 2 r 2 d n r n d i is a distinct name r i is a regular expression over the symbols in {d 1,d 2,,d i-1 } For lex we use regular definitions to specify tokens; for example, letter [A-Za-z] digit [0-9] id letter(letter digit) *

Examples of Regular Expressions Our tokens can be specified by the following for for id letter(letter digit)* relop < <= ==!= > >= num digit + (.digit + )?(E(+ -)?digit + )? Our lexer will strip out white spaces delim [ \t\n] ws delim +

More Regular Expression Examples Regular expressions are all around you! Phone numbers: (02)-880-1814 = digit {-,(,)} exchange digit 3 phone digit 4 area (digit 3 ) phone_number = area exchange - phone

Another Regular Expression Example E-mail addresses: smoon@altair.snu.ac.kr = letter {.,@} Name = letter + Address = name @ name. name. name. name Real e-mail address will be more elaborate but still regular Other examples: file path names, etc.

Review and the Next Issue Regular expressions are a language specification that describe many useful languages including set of tokens for programming language compilers We still need an implementation for them Our problem is Given a string s and a regular expression R, is s L(R)? Solution for this problem is the base of lexical analyzer A naïve solution: transition diagram and input buffering A more elaborate solution Using the theory and practice of deterministic finite automata (DFA)

Transition Diagram A flowchart corresponding to regular expression(s) to keep track of information as characters are scanned Composed of states and edges that show transition

Input Buffering Two pointers are maintained Initially both pointers point the first character of the next lexeme Forward pointer scans; if a lexeme is found, it is set to the last character of the lexeme found After processing the lexeme, both pointers are set to the character immediately the lexeme

Making Lexer using Transition Diagrams Build a list of transition diagrams for all regular expressions Start from the top transition diagram and if it fails, try the next diagram until found; fail() is used to move the forward pointer back to the lexeme_beginning If a lexeme is found but requires retract(n), move the forward pointer n charcters back Basically, these ideas are used when implementing deterministic finite automata (DFA) in lex

Deterministic Finite Automata (DFA) Language recognizers with finite memory contained in states A DFA accepts/rejects a given string if it is/is not a language of the DFA Regular languages can be recognized by DFAs

Formal Definition of a DFA A deterministic finite state automata M = (, Q, δ, q 0, F) : alphabet Q: set of states δ: Qⅹ Q, a transition function q 0 : the start state F: final states A run on an input x is a sequence of states by consuming x A string x is accepted by M if its run ends in a final state A language accepted by a DFA M, L(M) = {x M accepts x}

Graphic Representation of DFA

A DFA Example: A Number num digit + (.digit + )?(E(+ -)?digit + )?

From Regular Expression to DFA

From DFA to Regular Expression We can determine a RE directly from a DFA either by inspection or by removing states from the DFA

Nondeterministic Finite Automata (NFA) Conversion from RE to NFA is more straightforward є-transition Multiple transitions on a single input i.e., δ : Qⅹ 2 Q We will not cover much of NFA stuff in this lecture Conversion of NFA to DFA: subset construction Ch. 3.6 From RE to an NFA: Thomson s construction Ch. 3.7 Minimizing the number of states in DFA: Ch. 3.9 Equivalence of RE, NFA, and DFA: L(RE) = L(NFA) = L(DFA)

Subset Construction Basic Idea Each DFA state corresponds to a set of NFA states: keep track of all possible states the NFA can be in after reading each symbol The number of states in DFA is exponential in the number of states of NFA (maximum 2 n states)

Thomson s Construction From RE to NFA

Lexical Analysis using Automata Automata vs. Lexer Automata accepts/rejects strings Lexer recognizes tokens (prefixes) from a longer string Lookahead issues: number of characters that must be read beyond the end of a lexeme to recognize it Resolving ambiguities: Longest lexeme rule Precedence rule

Longest Lexeme Rule In case of multiple matches longer ones are matched Ex: floating-point numbers (digit +.digit * (E(+ -)?digit + )?) Can be implemented with our buffer scheme: when we are in accept state, mark the input position and the pattern; keep scanning until fail when we retract the forward pointer back to the last position recorded Precedence rule of lex Another rule of lex to resolve ambiguities: In case of ties lex matches the RE that is closer to the beginning of the lex input

Pitfall of Longest Lexeme Rule The longest lexeme rule does not always work Ex: L = {ab, aba, baa} and input string abab Infinite maximum lookahead is needed for ababaaba... THIS IS A WRONG set of lexemes Unfortunately this might be a real life situation Ex: Floating-point numbers as defined above and resolving.. (DOTDOT ); e.g., 1..2

Lookahead Operator of lex Lookahead Operator RE for lex input can include an operator / such as ab/cd, where ab is matched only if it is followed by cd If matched at d, the forward pointer goes back to b position before the lexeme ab is processed

Summary of Lexical Analysis Token, pattern, lexeme Regular languages Regular expressions (computational model, tools) Finite automata (DFA, NFA) Lexer using automata: longest lexeme rules Tool: lex Programming Assignment #1 Writing a lexical analyzer for a subset of C, subc, using lex (nested comments, lookaheads, hash tables)