CS321 Languages and Compiler Design I. Winter 2012 Lecture 4

Similar documents
Lexical Analysis (ASU Ch 3, Fig 3.1)

Lexical Analysis. Introduction

2. λ is a regular expression and denotes the set {λ} 4. If r and s are regular expressions denoting the languages R and S, respectively

The Front End. The purpose of the front end is to deal with the input language. Perform a membership test: code source language?

Lexical Analysis. Sukree Sinthupinyo July Chulalongkorn University

CSC 467 Lecture 3: Regular Expressions

Part 5 Program Analysis Principles and Techniques

ECS 120 Lesson 7 Regular Expressions, Pt. 1

Lexical Analysis - 1. A. Overview A.a) Role of Lexical Analyzer

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 2: Lexical Analysis 23 Jan 08

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

CPS 506 Comparative Programming Languages. Syntax Specification

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

A Pascal program. Input from the file is read to a buffer program buffer. program xyz(input, output) --- begin A := B + C * 2 end.

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

2. Lexical Analysis! Prof. O. Nierstrasz!

Structure of Programming Languages Lecture 3

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

1. Lexical Analysis Phase

Compiler phases. Non-tokens

Lexical Analysis. Lecture 3-4

Chapter 3: Lexical Analysis

CS 403: Scanning and Parsing

CS 403 Compiler Construction Lecture 3 Lexical Analysis [Based on Chapter 1, 2, 3 of Aho2]

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

DVA337 HT17 - LECTURE 4. Languages and regular expressions

THE COMPILATION PROCESS EXAMPLE OF TOKENS AND ATTRIBUTES

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

Introduction to Lexical Analysis

Lexical Analysis. Lecture 3. January 10, 2018

Compiler Construction LECTURE # 3

Lexical Analysis - An Introduction. Lecture 4 Spring 2005 Department of Computer Science University of Alabama Joel Jones

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

CPSC 434 Lecture 3, Page 1

Lexical Analysis. Lecture 2-4

Compilers CS S-01 Compiler Basics & Lexical Analysis

Lexical Analysis. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

Computer Science Department Carlos III University of Madrid Leganés (Spain) David Griol Barres

Syntactic Analysis. CS345H: Programming Languages. Lecture 3: Lexical Analysis. Outline. Lexical Analysis. What is a Token? Tokens

Lexical Analysis. Chapter 2

Formal Languages and Compilers Lecture VI: Lexical Analysis

Lexical analysis. Syntactical analysis. Semantical analysis. Intermediate code generation. Optimization. Code generation. Target specific optimization

Outline. 1 Scanning Tokens. 2 Regular Expresssions. 3 Finite State Automata

CSE 3302 Programming Languages Lecture 2: Syntax

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

Languages, Automata, Regular Expressions & Scanners. Winter /8/ Hal Perkins & UW CSE B-1

David Griol Barres Computer Science Department Carlos III University of Madrid Leganés (Spain)

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

for (i=1; i<=100000; i++) { x = sqrt (y); // square root function cout << x+i << endl; }

Compiler Construction

Lexical Analysis 1 / 52

A Simple Syntax-Directed Translator

for (i=1; i<=100000; i++) { x = sqrt (y); // square root function cout << x+i << endl; }

CS S-01 Compiler Basics & Lexical Analysis 1

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

Introduction to Lexical Analysis

CS 301. Lecture 05 Applications of Regular Languages. Stephen Checkoway. January 31, 2018

Administrivia. Lexical Analysis. Lecture 2-4. Outline. The Structure of a Compiler. Informal sketch of lexical analysis. Issues in lexical analysis

A lexical analyzer generator for Standard ML. Version 1.6.0, October 1994

Compiler Construction

B The SLLGEN Parsing System

Compilers CS S-01 Compiler Basics & Lexical Analysis

Regular Expressions. Regular Expressions. Regular Languages. Specifying Languages. Regular Expressions. Kleene Star Operation

Compiler course. Chapter 3 Lexical Analysis

PRINCIPLES OF COMPILER DESIGN UNIT II LEXICAL ANALYSIS 2.1 Lexical Analysis - The Role of the Lexical Analyzer

UNIT -2 LEXICAL ANALYSIS

CS308 Compiler Principles Lexical Analyzer Li Jiang

A simple syntax-directed

CSE 401/M501 Compilers

announcements CSE 311: Foundations of Computing review: regular expressions review: languages---sets of strings

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

Outline CS4120/4121. Compilation in a Nutshell 1. Administration. Introduction to Compilers Andrew Myers. HW1 out later today due next Monday.

Chapter 3 Lexical Analysis

The Structure of a Syntax-Directed Compiler

8 ε. Figure 1: An NFA-ǫ

CS415 Compilers. Lexical Analysis

CSc 453 Lexical Analysis (Scanning)

Formal Languages and Compilers Lecture IV: Regular Languages and Finite. Finite Automata

The SPL Programming Language Reference Manual

Administrativia. Extra credit for bugs in project assignments. Building a Scanner. CS164, Fall Recall: The Structure of a Compiler

Compiler Construction

Zhizheng Zhang. Southeast University

CSE302: Compiler Design

Implementation of Lexical Analysis

Compiler Construction

Group A Assignment 3(2)

Lexical Analysis. Finite Automata

UNIT II LEXICAL ANALYSIS

Examples of attributes: values of evaluated subtrees, type information, source file coordinates,

2010: Compilers REVIEW: REGULAR EXPRESSIONS HOW TO USE REGULAR EXPRESSIONS

CS 314 Principles of Programming Languages

MIT Specifying Languages with Regular Expressions and Context-Free Grammars. Martin Rinard Massachusetts Institute of Technology

MIT Specifying Languages with Regular Expressions and Context-Free Grammars

Regular Expressions. Steve Renals (based on original notes by Ewan Klein) ICL 12 October Outline Overview of REs REs in Python

Dr. D.M. Akbar Hussain

CSE 105 THEORY OF COMPUTATION

CS164: Programming Assignment 2 Dlex Lexer Generator and Decaf Lexer

Transcription:

CS321 Languages and Compiler Design I Winter 2012 Lecture 4 1

LEXICAL ANALYSIS Convert source file characters into token stream. Remove content-free characters (comments, whitespace,...) Detect lexical errors (badly-formed literals, illegal characters,...) Output of lexical analysis is input to syntax analysis. Could just do lexical analysis as part of syntax analysis. But choose to handle separately for better modularity and portability, and to allow make syntax analysis easier. Idea: Look for patterns in input character sequence, convert to tokens with attributes, and pass them to parser in stream. PSU CS321 W 12 LECTURE 4 c 1992 2012 ANDREW TOLMACH 2

LEXICAL ANALYSIS EXAMPLE Pattern Token Attribute if IF else ELSE print PRINT then THEN := ASSIGN = or < or > RELOP enum letter followed by letters or digits ID symbol digits NUM int chars between double quotes STRING string PSU CS321 W 12 LECTURE 4 c 1992 2012 ANDREW TOLMACH 3

Source code: if x>17 then count:= 2 else (* oops!*) print "bad!" Lexeme Token Attribute if IF x ID "x" > RELOP GT 17 NUM 17 then THEN count ID "count" := ASSIGN 2 NUM 2 else print ELSE PRINT "bad!" STRING "bad!" PSU CS321 W 12 LECTURE 4 c 1992 2012 ANDREW TOLMACH 4

MORE DETAILS A token describes a class of character strings with some distinguished meaning in language. May describe unique string (e.g., IF, ASSIGN) or set of possible strings, in which case an attribute is needed to indicate which. (Tokens are typically represented as elements of an enumeration.) A lexeme is the string in the input that actually matched the pattern for some token. Attributes represent lexemes converted to a more useful form, e.g.,: strings symbols (like strings, but perhaps handled separately) numbers (integers, reals,...) enumerations Whitespace (spaces, tabs, new lines,...) and comments usually just disappear (unless they affect program meaning). PSU CS321 W 12 LECTURE 4 c 1992 2012 ANDREW TOLMACH 5

STREAM INTERFACE Could convert entire input file to list of tokens/attributes. But parser needs only one token at a time, so use stream instead: File Reader Give next char! Char Take it back! Lexical Analyzer Give next token! (Token, Attributes) Parser (or test driver) What about ERRORS? Another stream relationship with ability to send back a char. read(); unread(); Simple call interface class Token { TokenCode c; Attribute a; } Token gettoken(); Parser is CLIENT of lexical analyzer PSU CS321 W 12 LECTURE 4 c 1992 2012 ANDREW TOLMACH 6

HAND-CODED SCANNER (IN PSEUDO-JAVA) Token gettoken() { while (true) { char c = read(); if (c is whitespace) ignore it; else if (c is digit) { int n = 0; do {n = n * 10 + (c- 0 ); c = read(); } until (c not a digit); unread(c); return new Token(NUM,n); } else if (c is alpha) { String s = ""; do { s = s + c; c = read(); } until (c is not an alphanumeric); unread(c); return new Token(ID,S); } else... } } PSU CS321 W 12 LECTURE 4 c 1992 2012 ANDREW TOLMACH 7

PROS AND CONS OF HAND-CODED SCANNERS Efficient! But easy to get wrong! Note intermixed code for input, output, patterns, conversion. Hard to specify! (esp. patterns). PSU CS321 W 12 LECTURE 4 c 1992 2012 ANDREW TOLMACH 8

FORMALIZING PATTERN DESCRIPTIONS Ex.: An identifier is a letter followed by any number of letters or digits. Exactly what is a letter? LETTER a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z Exactly what is a digit? DIGIT 0 1 2 3 4 5 6 7 8 9 How can we express letters or digits? L_OR_D LETTER DIGIT How can we express any number of? L_OR_DS L_OR_D How can we express followed by? IDENTIFIER LETTER L_OR_DS PSU CS321 W 12 LECTURE 4 c 1992 2012 ANDREW TOLMACH 9

REGULAR EXPRESSIONS A regular expression (R.E.) is a concise formal characterization of a regular language (or regular set). Example: The regular language containing all identifiers is described by the regular expression LETTER (LETTER DIGIT) where means or and e means zero or more copies of e. Regular languages are one particular kind of formal languages. PSU CS321 W 12 LECTURE 4 c 1992 2012 ANDREW TOLMACH 10

LANGUAGES: SOME PRELIMINARY DEFINITIONS An alphabet is a set of symbols (e.g., the ASCII character set). A language over an alphabet is a set of strings of symbols from that alphabet. We write ǫ for the empty string (containing zero characters); some authors use λ instead. If x and y are strings, then the concatenation xy is the string consisting of the the characters of x followed by the characters of y. If L and M are languages, then their concatenation LM = {xy x L,y M}. The exponentiation of a language L is defined thus: L 0 = {ǫ }, the language containing just the empty string, and L i = L i 1 L for i > 0. PSU CS321 W 12 LECTURE 4 c 1992 2012 ANDREW TOLMACH 11

REGULAR EXPRESSIONS AND LANGUAGES Each R.E. over an alphabet Σ denotes a regular language over Σ, according to the following inductive definition: Base rules: The R.E. ǫ denotes {ǫ }. For each a Σ, the R.E. a denotes {a}, the language containing the single string containing just a. Inductive rules: If the R.E. R denotes L R and the R.E. S denotes L S, then R S denotes L R L S. R S (or just RS) denotes L R L S. R denotes L R = i=0 zero or more strings from L R ). Also: (R) denotes L R. L i, the Kleene closure (the concatenation of Precedence rules: () before before before. PSU CS321 W 12 LECTURE 4 c 1992 2012 ANDREW TOLMACH 12

REGULAR EXPRESSIONS Examples (over alphabet {a, b}) a zero or more a s (a b) all strings of a s and b s of length 0 (a b ) ditto (aa ab ba bb) all strings of a s and b s of even length Counterexamples (Not every language is regular!) {a n b n n 0} Set of strings over {(,)} such that parentheses are properly matched. Implication: regular languages can t be used to describe arithmetic expressions. R.E. s are everywhere in command-line programming tools grep, Perl, shell commands, etc. PSU CS321 W 12 LECTURE 4 c 1992 2012 ANDREW TOLMACH 13

REGULAR DEFINITIONS Give names to R.E. s and then use these as a shorthand. Must avoid recursive definitions! Example of syntactic sugar Examples: id letter (letter digit) num digit (digit) or this shorthand: digit + if if (not too useful!) then then relop < > <= >= = or: <(ǫ =) >(ǫ =) = assgn := string "(nonquote) " letter a b... z A... Z or this shorthand: [a-za-z] digit 0 1 2 3 4 5 6 7 8 9 or this shorthand: [0-9] nonquote letter digit! $ %... Note that id and keywords have overlapping patterns. PSU CS321 W 12 LECTURE 4 c 1992 2012 ANDREW TOLMACH 14

SPECIFYING LEXICAL ANALYZERS Can define lexical analyzer via list of pairs: (regular expression, action) where regular expression describes token pattern (maybe using auxiliary regular definitions), and action is a piece of code, parameterized by the matching lexeme, that returns a (token,attribute) pair. Example (digit +, {return new Token(NUM,parseInt(lexeme));}) (alpha(alpha digit), {return new Token(ID,lexeme);} ) (space tab newline, {} ) (...,... ) (...,... ) (...,... ) So R.E.s can help us specify scanners. But can they help us generate running code that does pattern matching? PSU CS321 W 12 LECTURE 4 c 1992 2012 ANDREW TOLMACH 15