Introduction to Lexical Analysis

Similar documents
Introduction to Lexical Analysis

Syntactic Analysis. CS345H: Programming Languages. Lecture 3: Lexical Analysis. Outline. Lexical Analysis. What is a Token? Tokens

Lexical Analysis. Chapter 2

Lexical Analysis. Lecture 2-4

Lexical Analysis. Lecture 3. January 10, 2018

Lexical Analysis. Lecture 3-4

Administrivia. Lexical Analysis. Lecture 2-4. Outline. The Structure of a Compiler. Informal sketch of lexical analysis. Issues in lexical analysis

Lexical Analysis. Finite Automata

Lexical Analysis. Finite Automata

Cunning Plan. Informal Sketch of Lexical Analysis. Issues in Lexical Analysis. Specifying Lexers

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

Implementation of Lexical Analysis

Implementation of Lexical Analysis

CSE 3302 Programming Languages Lecture 2: Syntax

Formal Languages and Compilers Lecture VI: Lexical Analysis

Defining Program Syntax. Chapter Two Modern Programming Languages, 2nd ed. 1

Implementation of Lexical Analysis

Lexical Analysis. Introduction

Part 5 Program Analysis Principles and Techniques

1. Lexical Analysis Phase

Compiler Construction LECTURE # 3

A simple syntax-directed

Week 2: Syntax Specification, Grammars

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 2: Lexical Analysis 23 Jan 08

Specifying Syntax. An English Grammar. Components of a Grammar. Language Specification. Types of Grammars. 1. Terminal symbols or terminals, Σ

The Front End. The purpose of the front end is to deal with the input language. Perform a membership test: code source language?

Compiler course. Chapter 3 Lexical Analysis

Administrativia. Extra credit for bugs in project assignments. Building a Scanner. CS164, Fall Recall: The Structure of a Compiler

CSC 467 Lecture 3: Regular Expressions

Implementation of Lexical Analysis. Lecture 4

Chapter 3 Lexical Analysis

Implementation of Lexical Analysis

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

Compiler Construction

UNIT -2 LEXICAL ANALYSIS

Lexical Analysis. Implementation: Finite Automata

Topic 3: Syntax Analysis I

Compiler Construction D7011E

CSCE 314 Programming Languages

A Pascal program. Input from the file is read to a buffer program buffer. program xyz(input, output) --- begin A := B + C * 2 end.

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

A Simple Syntax-Directed Translator

for (i=1; i<=100000; i++) { x = sqrt (y); // square root function cout << x+i << endl; }

David Griol Barres Computer Science Department Carlos III University of Madrid Leganés (Spain)

Lexical analysis. Syntactical analysis. Semantical analysis. Intermediate code generation. Optimization. Code generation. Target specific optimization

CPS 506 Comparative Programming Languages. Syntax Specification

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

for (i=1; i<=100000; i++) { x = sqrt (y); // square root function cout << x+i << endl; }

CSE302: Compiler Design

CSE302: Compiler Design

CS321 Languages and Compiler Design I. Winter 2012 Lecture 4

Formal Languages and Grammars. Chapter 2: Sections 2.1 and 2.2

Lexical Analysis. COMP 524, Spring 2014 Bryan Ward

COMPILER DESIGN LECTURE NOTES

Non-deterministic Finite Automata (NFA)

CMPT 755 Compilers. Anoop Sarkar.

Lexical Analysis (ASU Ch 3, Fig 3.1)

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

Syntax. In Text: Chapter 3

Compiler Construction

CSc 453 Compilers and Systems Software

Computer Science Department Carlos III University of Madrid Leganés (Spain) David Griol Barres

LANGUAGE PROCESSORS. Presented By: Prof. S.J. Soni, SPCE Visnagar.

Compiler Construction

Chapter 3: CONTEXT-FREE GRAMMARS AND PARSING Part 1

Today. Assignments. Lecture Notes CPSC 326 (Spring 2019) Quiz 2. Lexer design. Syntax Analysis: Context-Free Grammars. HW2 (out, due Tues)

Syntactic Analysis. The Big Picture Again. Grammar. ICS312 Machine-Level and Systems Programming

Principles of Programming Languages COMP251: Syntax and Grammars

Where We Are. CMSC 330: Organization of Programming Languages. This Lecture. Programming Languages. Motivation for Grammars

Programming Languages and Compilers (CS 421)

UNIT II LEXICAL ANALYSIS

Programming Languages & Compilers. Programming Languages and Compilers (CS 421) I. Major Phases of a Compiler. Programming Languages & Compilers

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

Lexical Analyzer Scanner

Principles of Programming Languages COMP251: Syntax and Grammars

CSCI312 Principles of Programming Languages!

Lexical Analyzer Scanner

Chapter 3: Lexical Analysis

CMSC 330: Organization of Programming Languages. Architecture of Compilers, Interpreters

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Languages and Compilers

Implementation of Lexical Analysis

CMSC 330: Organization of Programming Languages

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

Compiler Construction

Related Course Objec6ves

Course Overview. Introduction (Chapter 1) Compiler Frontend: Today. Compiler Backend:

CSE450 Translation of Programming Languages. Lecture 4: Syntax Analysis

Syntax. A. Bellaachia Page: 1

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

CS 314 Principles of Programming Languages. Lecture 3

The Structure of a Syntax-Directed Compiler

PRINCIPLES OF COMPILER DESIGN UNIT II LEXICAL ANALYSIS 2.1 Lexical Analysis - The Role of the Lexical Analyzer

Outline CS4120/4121. Compilation in a Nutshell 1. Administration. Introduction to Compilers Andrew Myers. HW1 out later today due next Monday.

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

CS164: Programming Assignment 2 Dlex Lexer Generator and Decaf Lexer

Chapter 4. Lexical and Syntax Analysis. Topics. Compilation. Language Implementation. Issues in Lexical and Syntax Analysis.

Transcription:

Introduction to Lexical Analysis

Outline Informal sketch of lexical analysis Identifies tokens in input string Issues in lexical analysis Lookahead Ambiguities Specifying lexers Regular expressions Examples of regular expressions Compiler Design 1 (2011) 2

Lexical Analysis What do we want to do? Example: if (i == j) then Z = 0; else Z = 1; The input is just a string of characters: \tif (i == j)\nthen\n\tz = 0;\n\telse\n\t\tz = 1; Goal: Partition input string into substrings Where the substrings are tokens Compiler Design 1 (2011) 3

What s a Token? A syntactic category In English: noun, verb, adjective, In a programming language: Identifier, Integer, Keyword, Whitespace, Compiler Design 1 (2011) 4

Tokens Tokens correspond to sets of strings. Identifier: strings of letters or digits, starting with a letter Integer: a non-empty string of digits Keyword: else or if or begin or Whitespace: a non-empty sequence of blanks, newlines, and tabs Compiler Design 1 (2011) 5

What are Tokens used for? Classify program substrings according to role Output of lexical analysis is a stream of tokens...... which is input to the parser Parser relies on token distinctions An identifier is treated differently than a keyword Compiler Design 1 (2011) 6

Designing a Lexical Analyzer: Step 1 Define a finite set of tokens Tokens describe all items of interest Choice of tokens depends on language, design of parser Recall \tif (i == j)\nthen\n\tz = 0;\n\telse\n\t\tz = 1; Useful tokens for this expression: Integer, Keyword, Relation, Identifier, Whitespace, (, ), =, ; Compiler Design 1 (2011) 7

Designing a Lexical Analyzer: Step 2 Describe which strings belong to each token Recall: Identifier: strings of letters or digits, starting with a letter Integer: a non-empty string of digits Keyword: else or if or begin or Whitespace: a non-empty sequence of blanks, newlines, and tabs Compiler Design 1 (2011) 8

Lexical Analyzer: Implementation An implementation must do two things: 1. Recognize substrings corresponding to tokens 2. Return the value or lexeme of the token The lexeme is the substring Compiler Design 1 (2011) 9

Example Recall: \tif (i == j)\nthen\n\tz = 0;\n\telse\n\t\tz = 1; Token-lexeme groupings: Identifier: i, j, z Keyword: if, then, else Relation: == Integer: 0, 1 (, ), =, ; single character of the same name Compiler Design 1 (2011) 10

Why do Lexical Analysis? Dramatically simplify parsing The lexer usually discards uninteresting tokens that don t contribute to parsing E.g. Whitespace, Comments Converts data early Separate out logic to read source files Potentially an issue on multiple platforms Can optimize reading code independently of parser Compiler Design 1 (2011) 11

True Crimes of Lexical Analysis Is it as easy as it sounds? Not quite! Look at some programming language history... Compiler Design 1 (2011) 12

Lexical Analysis in FORTRAN FORTRAN rule: Whitespace is insignificant E.g., VAR1 is the same as VA R1 Footnote: FORTRAN whitespace rule was motivated by inaccuracy of punch card operators Compiler Design 1 (2011) 13

A terrible design! Example Consider DO 5 I = 1,25 DO 5 I = 1.25 The first is DO 5 I = 1, 25 The second is DO5I = 1.25 Reading left-to-right, cannot tell if DO5I is a variable or DO stmt. until after, is reached Compiler Design 1 (2011) 14

Lexical Analysis in FORTRAN. Lookahead. Two important points: 1. The goal is to partition the string. This is implemented by reading left-to-write, recognizing one token at a time 2. Lookahead may be required to decide where one token ends and the next token begins Even our simple example has lookahead issues i vs. if = vs. == Compiler Design 1 (2011) 15

Another Great Moment in Scanning PL/1: Keywords can be used as identifiers: IF THEN THEN THEN = ELSE; ELSE ELSE = IF can be difficult to determine how to label lexemes Compiler Design 1 (2011) 16

More Modern True Crimes in Scanning Nested template declarations in C++ vector<vector<int>> myvector vector < vector < int >> myvector (vector < (vector < (int >> myvector))) Compiler Design 1 (2011) 17

Review The goal of lexical analysis is to Partition the input string into lexemes (the smallest program units that are individually meaningful) Identify the token of each lexeme Left-to-right scan lookahead sometimes required Compiler Design 1 (2011) 18

Next We still need A way to describe the lexemes of each token A way to resolve ambiguities Is if two variables i and f? Is == two equal signs = =? Compiler Design 1 (2011) 19

Regular Languages There are several formalisms for specifying tokens Regular languages are the most popular Simple and useful theory Easy to understand Efficient implementations Compiler Design 1 (2011) 20

Languages Def. Let Σ be a set of characters. A language Λ over Σ is a set of strings of characters drawn from Σ (Σ is called the alphabet of Λ) Compiler Design 1 (2011) 21

Examples of Languages Alphabet = English characters Language = English sentences Alphabet = ASCII Language = C programs Not every string on English characters is an English sentence Note: ASCII character set is different from English character set Compiler Design 1 (2011) 22

Notation Languages are sets of strings Need some notation for specifying which sets of strings we want our language to contain The standard notation for regular languages is regular expressions Compiler Design 1 (2011) 23

Atomic Regular Expressions Single character Epsilon ' c' = {" c" } ε = {""} Compiler Design 1 (2011) 24

Compound Regular Expressions Union Concatenation Iteration * { or } A+ B= s s A s B { and } AB= ab a A b B = U i i = i 0 A A where A A... i times... A Compiler Design 1 (2011) 25

Regular Expressions Def. The regular expressions over Σ are the smallest set of expressions including ε ' c' where c A+ B where A, B are rexp over AB " " " * A where A is a rexp over Compiler Design 1 (2011) 26

Syntax vs. Semantics To be careful, we should distinguish syntax and semantics (meaning) of regular expressions L( ε) = {""} L(' c') = {" c"} LA ( + B) = LA ( ) LB ( ) L( AB) = { ab a L( A) and b L( B)} LA = U LA * ( ) ( i ) i 0 Compiler Design 1 (2011) 27

Example: Keyword Keyword: else or if or begin or ' else' + 'if' + 'begin' + L Note: 'else' abbreviates 'e''l''s''e' Compiler Design 1 (2011) 28

Example: Integers Integer: a non-empty string of digits digit = '0' + '1' + '2' + '3' + '4' + '5' + '6' + '7' + '8' + '9' integer = digit digit * Abbreviation: A + = AA * Compiler Design 1 (2011) 29

Example: Identifier Identifier: strings of letters or digits, starting with a letter letter = 'A' + K+ 'Z' + 'a' + K+ 'z' identifier = letter (letter + digit) * * * Is (letter + digit ) the same? Compiler Design 1 (2011) 30

Example: Whitespace Whitespace: a non-empty sequence of blanks, newlines, and tabs (' ' + '\n' + '\t') + Compiler Design 1 (2011) 31

Example 1: Phone Numbers Regular expressions are all around you! Consider +46(0)18-471-1056 Σ = digits {+,,(,)} country = digit digit city = digit digit univ = digit digit digit extension = digit digit digit digit phone_num = + country ( 0 ) city univ extension Compiler Design 1 (2011) 32

Example 2: Email Addresses Consider kostis@it.uu.se = letters {.,@} name = + letter address = name '@' name '.' name '.' name Compiler Design 1 (2011) 33

Summary Regular expressions describe many useful languages Regular languages are a language specification We still need an implementation Next time: Given a string s and a regular expression R, is s L( R)? Compiler Design 1 (2011) 34