Structure of Programming Languages Lecture 3

Similar documents
CS415 Compilers. Lexical Analysis

Implementation of Lexical Analysis

Last lecture CMSC330. This lecture. Finite Automata: States. Finite Automata. Implementing Regular Expressions. Languages. Regular expressions

Lexical Analysis. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

Implementation of Lexical Analysis

Lexical Analysis. Lecture 3-4

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 3

Formal Languages and Compilers Lecture VI: Lexical Analysis

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

Lexical Analysis. Lecture 2-4

CS 403 Compiler Construction Lecture 3 Lexical Analysis [Based on Chapter 1, 2, 3 of Aho2]

CSC 467 Lecture 3: Regular Expressions

Dr. D.M. Akbar Hussain

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Introduction to Lexing and Parsing

ECS 120 Lesson 7 Regular Expressions, Pt. 1

Administrivia. Lexical Analysis. Lecture 2-4. Outline. The Structure of a Compiler. Informal sketch of lexical analysis. Issues in lexical analysis

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Lexical Analysis. Chapter 2

CPSC 434 Lecture 3, Page 1

CS321 Languages and Compiler Design I. Winter 2012 Lecture 4

The Front End. The purpose of the front end is to deal with the input language. Perform a membership test: code source language?

Outline CS4120/4121. Compilation in a Nutshell 1. Administration. Introduction to Compilers Andrew Myers. HW1 out later today due next Monday.

Figure 2.1: Role of Lexical Analyzer

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 2: Lexical Analysis 23 Jan 08

CSCI312 Principles of Programming Languages!

Introduction to Lexical Analysis

Week 2: Syntax Specification, Grammars

Implementation of Lexical Analysis

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Implementation of Lexical Analysis

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

Lexical Analysis. Introduction

Assignment 1 (Lexical Analyzer)

CS164: Programming Assignment 2 Dlex Lexer Generator and Decaf Lexer

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

More Examples. Lex/Flex/JLex

UNIT -2 LEXICAL ANALYSIS

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

Syntactic Analysis. CS345H: Programming Languages. Lecture 3: Lexical Analysis. Outline. Lexical Analysis. What is a Token? Tokens

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

Announcements! P1 part 1 due next Tuesday P1 part 2 due next Friday

A lexical analyzer generator for Standard ML. Version 1.6.0, October 1994

CSE450. Translation of Programming Languages. Lecture 20: Automata and Regular Expressions

Assignment 1 (Lexical Analyzer)

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

Compiler course. Chapter 3 Lexical Analysis

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

CS308 Compiler Principles Lexical Analyzer Li Jiang

Formal Languages and Compilers Lecture IV: Regular Languages and Finite. Finite Automata

Lecture Outline. COMP-421 Compiler Design. What is Lex? Lex Specification. ! Lexical Analyzer Lex. ! Lex Examples. Presented by Dr Ioanna Dionysiou

Compiler Construction

Compiler Construction

ML 4 A Lexer for OCaml s Type System

Lexical Analysis. Lecture 3. January 10, 2018

Lexical Analysis (ASU Ch 3, Fig 3.1)

Administrativia. Extra credit for bugs in project assignments. Building a Scanner. CS164, Fall Recall: The Structure of a Compiler

Lexical Analysis. Sukree Sinthupinyo July Chulalongkorn University

2. Lexical Analysis! Prof. O. Nierstrasz!

David Griol Barres Computer Science Department Carlos III University of Madrid Leganés (Spain)

Chapter 3 Lexical Analysis

Alternation. Kleene Closure. Definition of Regular Expressions

Compiler Construction LECTURE # 3

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

Lexical and Syntax Analysis

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 2

Computer Science Department Carlos III University of Madrid Leganés (Spain) David Griol Barres

Lexical Analysis - 1. A. Overview A.a) Role of Lexical Analyzer

Regular Expressions & Automata

Outline. 1 Scanning Tokens. 2 Regular Expresssions. 3 Finite State Automata

Compiler Design. 2. Regular Expressions & Finite State Automata (FSA) Kanat Bolazar January 21, 2010

CS 314 Principles of Programming Languages

11. a b c d e. 12. a b c d e. 13. a b c d e. 14. a b c d e. 15. a b c d e

Chapter 3: Lexing and Parsing

Implementation of Lexical Analysis

Implementation of Lexical Analysis

CMSC 350: COMPILER DESIGN

Lexical Analysis 1 / 52

PRINCIPLES OF COMPILER DESIGN UNIT II LEXICAL ANALYSIS 2.1 Lexical Analysis - The Role of the Lexical Analyzer

MP 3 A Lexer for MiniJava

Lexical Analysis. COMP 524, Spring 2014 Bryan Ward

Compiler phases. Non-tokens

CS 314 Principles of Programming Languages. Lecture 3

for (i=1; i<=100000; i++) { x = sqrt (y); // square root function cout << x+i << endl; }

CSc 453 Lexical Analysis (Scanning)

MP 3 A Lexer for MiniJava

CS402 Theory of Automata Solved Subjective From Midterm Papers. MIDTERM SPRING 2012 CS402 Theory of Automata

FSASIM: A Simulator for Finite-State Automata

Part 5 Program Analysis Principles and Techniques

Homework 1 Due Tuesday, January 30, 2018 at 8pm

Lexical Analysis. Implementation: Finite Automata

Languages, Automata, Regular Expressions & Scanners. Winter /8/ Hal Perkins & UW CSE B-1

Non-deterministic Finite Automata (NFA)

EXPERIMENT NO : M/C Lenovo Think center M700 Ci3,6100,6th Gen. H81, 4GB RAM,500GB HDD

CMSC 132: Object-Oriented Programming II

Standard 11. Lesson 9. Introduction to C++( Up to Operators) 2. List any two benefits of learning C++?(Any two points)

Regular Expressions. Regular Expressions. Regular Languages. Specifying Languages. Regular Expressions. Kleene Star Operation

Lecture 4: Syntax Specification

The Structure of a Syntax-Directed Compiler

Transcription:

Structure of Programming Languages Lecture 3 CSCI 6636 4536 Spring 2017 CSCI 6636 4536 Lecture 3... 1/25 Spring 2017 1 / 25

Outline 1 Finite Languages Deterministic Finite State Machines Lexical Analysis Regular Expressions 2 Creating a Lexer Lexer Generators Flex 3 Homework CSCI 6636 4536 Lecture 3... 2/25 Spring 2017 2 / 25

Part 1 1. Finite Languages Deterministic Finite State Machines Lexical Analysis CSCI 6636 4536 Lecture 3... 3/25 Spring 2017 3 / 25

Deterministic Finite State Machines Deterministic Finite Automata (DFA) A deterministic finite automaton is an abstract machine for classifying strings. It reads an input string and either accepts or rejects it. It has: A finite set of states, including one called S0, the initial state. A subset of the states, A, called accepting states. An input alphabet. A state transition rule. The machine starts in state S 0. At each step, it reads an input character, c. Then it moves to another state, depending on c and the current state. After the last symbol has been read and processed, the string is accepted if the current state is in the set A. In some generalizations, each state transition has an associated action. CSCI 6636 4536 Lecture 3... 4/25 Spring 2017 4 / 25

Deterministic Finite State Machines Automaton 1 S0 a P D b a R b a a, b Alphabet: a, b States: S0, D, P, R Starting state: S0 Accepting states: {S0,R} b CSCI 6636 4536 Lecture 3... 5/25 Spring 2017 5 / 25

Deterministic Finite State Machines What Language is Defined by this DFA? A language is a set of strings. Automaton 1 accepts this language: The null string ab abab ababab It rejects: abbbbbbbbbb abbbbbbbbbbabbbbab Anything that begins with b. Anything that ends in a. Anything with two consecutive a s. Since the alphabet for this machine is a, b, we don t concern ourselves with any other letters or symbols. CSCI 6636 4536 Lecture 3... 6/25 Spring 2017 6 / 25

Deterministic Finite State Machines Automaton 2 S0 b, d P a a b,d D a, b, d a, b, d Alphabet: a, b, d States: S0, D, P, R Starting state: S0 Accepting states: {R} R CSCI 6636 4536 Lecture 3... 7/25 Spring 2017 7 / 25

Deterministic Finite State Machines What Language is Defined by this DFA? Automaton 2 accepts this language: bb dd bad bab dad daaaab baaaaaaaaaaaaaad It rejects: Anything that begins with a or ends with a. Anything that ends in a. Anything with more than two of b/d. CSCI 6636 4536 Lecture 3... 8/25 Spring 2017 8 / 25

Deterministic Finite State Machines A Shorthand Notation We use DFA is a mathematical abstraction with an alphabet, states, and a transition rule. A DFM is a generalization of the DFA with outputs. We use DFM s to do the lexical analysis for programming languages. We represent the DFM with a graphical language of circles and transition links. I have introduced graphs that are fully explicit: Every state is shown in the graph. Every state has a transition rule for every symbol in the alphabet. Some professionals use graphs with implicit links to a Dead state. Any missing transition goes, by default, to the Dead state, which may not be shown, but is listed among the states of the DFA. CSCI 6636 4536 Lecture 3... 9/25 Spring 2017 9 / 25

Deterministic Finite State Machines Short form of Automaton 2 S0 b, d P a b,d Alphabet: a, b, d States: S0, D, P, R Starting state: S0 Accepting states: {R} R CSCI 6636 4536 Lecture 3... 10/25 Spring 2017 10 / 25

Lexical Analysis Lexical Analysis Lexical analysis is the first stage of translation. The input is the source code of a program or a program unit. The Lexer reads the characters in the source code one at a time and identifies the primitive language units (words and punctuation) that make up the program. Illegal characters (if any) are identified. Comments are identified and discarded. The remaining units are called lexemes. A lexeme, together with its category-code is called a token. A lexer outputs a stream of tokens, ready for preprocessing or for parsing. CSCI 6636 4536 Lecture 3... 11/25 Spring 2017 11 / 25

Lexical Analysis Lexing a number in C A number must start with a digit, which may be followed by other digits. It may have a decimal point at the beginning or end of the digits, or in the middle. If a decimal point is present, an exponent may follow the last digit or the decimal point. An exponent consists of the letter E or e, followed by a + or sign, followed by one or more digits, and ending in a precision indicator (D, d, F, f, L, or l). See: LexC.pdf CSCI 6636 4536 Lecture 3... 12/25 Spring 2017 12 / 25

Lexical Analysis Lexing FORTH A rest-of line comment starts with a backslash in column and ends with the next newline. An inline comment starts with (< space > and ends with the first ). A string literal starts with. < space > and ends with the first. A number is a sequence of digits delimited by whitespace. Any other collection of visible characters form a word. See: LexFORTH.pdf CSCI 6636 4536 Lecture 3... 13/25 Spring 2017 13 / 25

Regular Expressions Regular Expressions A regular expression is a string that describes a whole set of strings in some finite language, according to certain rules. An American logician named Stephen Cole Kleene studied regular languages in the 1950 s and introduced one notation for writing regular expressions. We use regular expressions to specify the lexical rules for computer languages. Theorem (Kleene) A language is definable by a regular expression if and only if it is recognized by a finite-state automaton. CSCI 6636 4536 Lecture 3... 14/25 Spring 2017 14 / 25

Regular Expressions Kleene Regular Expressions To write a Kleene regular expression, you need: An alphabet. Each symbol in the alphabet is a regular expression. A set of operators, in order of precedence: + x means 0 or more copies of x. x y means x followed by y. x + y means the union of x and y: either x or y or both. Grouping symbols to override the precedence: ( ) Variations on this idea are widely used in CS applications (grep, flex, Perl,... ). CSCI 6636 4536 Lecture 3... 15/25 Spring 2017 15 / 25

Regular Expressions Examples: Finite Languages Revisited Expression for Automaton 1: (a b b ) Expression for Automaton 2: (b + d) a (b + d) We use regular expressions to define the lexical structure of languages. These are translated into efficient computer programs (lexers), based on finite state machines, that recognize the language forms. CSCI 6636 4536 Lecture 3... 16/25 Spring 2017 16 / 25

Creating a Lexer Part 2 2. Lexical Analysis Lexer Generators Flex: Fast Lexer Generator CSCI 6636 4536 Lecture 3... 17/25 Spring 2017 17 / 25

Creating a Lexer Lexer Generators Lexer Generators A person implementing a compiler must produce a lexer for the language. This can be done in two ways. (See: Wikipedia Lexical Analysis) Write a custom lexer. This can be more efficient and can be the only solution if the language is complex or its lexemes are recursively defined. Write regular expressions that define the lexical structure. Then use these definitions as input to a lexer-generator, such as lex or flex, which turns them into into a lexer for the language, implemented as a finite-state machine. CSCI 6636 4536 Lecture 3... 18/25 Spring 2017 18 / 25

Creating a Lexer Flex Flex: Fast Lexer Generator Flex is the GNU lexer generator. It reads user-specified input files for a lexer description in the form of pairs of regular expressions and corresponding action-rules coded in C. The result of running flex is a C program, which must be compiled to create a lexer. The resulting lexer analyzes its input (source code) for occurrences of text matching the original regular expressions. Whenever it finds a match, it executes the corresponding action-rule. CSCI 6636 4536 Lecture 3... 19/25 Spring 2017 19 / 25

Creating a Lexer Flex Flex Regular Expressions-1 This is the extended regular expression syntax recognized by flex: x match a particular character xy an x followed by a y (this is Kleene s operator). match any single character (not the same as Kleene s ) x y either an x or a y (this is Kleene s + operator). x* zero or more x s x+ one or more x s (not the same as Kleene s +). x? zero or one x (an optional x) d{2} exactly 2 d s d{2,} 2 or more d s d{2,5} from 2 to 5 d s [^A] match any character except A. (A negated character.) CSCI 6636 4536 Lecture 3... 20/25 Spring 2017 20 / 25

Creating a Lexer Flex Flex Regular Expressions-2 This is the extended regular expression syntax recognized by flex: [xyz] a character class: x or y or z [xyz]* zero or more chars from the set x, y, z, in any order or combination. [0-9] a character class with a range of numbers (describes a decimal literal) [a-za-z] a character class with two ranges [^A-Z\n] a negated set of characters that includes a range [a..z]{-}[lmno] the set a... z with lmno removed Plus several increasingly complex ways to name patterns and make patterns out of other patterns. CSCI 6636 4536 Lecture 3... 21/25 Spring 2017 21 / 25

Creating a Lexer Flex Examples: Flex Regular Expressions (abb*)* Automaton 1, Kleene syntax: (a b b ) (ab+)* Automaton 1, using the extended Flex syntax [bd]a*[bd] Automaton 2, Kleene syntax: (b +d) a (b +d) ab?c A string that starts with and a and ends with a c and has an optional b in the middle. 0[xX][a-fA-F0-9]* A hex literal, in C. [^"] Any character except a double quote. [a-z]{-}[eo] The set of characters a... z, excluding e and o. CSCI 6636 4536 Lecture 3... 22/25 Spring 2017 22 / 25

Creating a Lexer Flex Flex for FORTH To analyze a FORTH program, the following rules should be applied in order. If one fails, try the next. If the first four rules fail, the last one will always succeed. Let WS stand for a sequence of 1 or more whitespace characters. Input symbols are given in red ink, lexer symbols are black. Comment \ WS [ \n ]* \n Comment ( WS [ ) ]* ) WS StringLiteral. WS [ ]* WS Integer? [0 9 WS]* WS Word [ WS]* WS A recognizer for this language needs five accepting states. If you are in an accepting state when a newline or WS terminates a rule, the entire string that has been matched should be output as a token, with the category from the left column. CSCI 6636 4536 Lecture 3... 23/25 Spring 2017 23 / 25

Homework Hw 3: Finite-State Machines and Regular Expressions 1. Define a DFA that will process input strings of letters from a to e. Accept all strings that contain only vowels ( a and e ). Reject all other inputs. 2. Write a Flex regular expression that defines strings of letters that contain only the letters a and e, in any combination, any order, and any number. 3. Define a DFA that will process input strings of letters from a to e. Accept all strings of exactly 3-letters that start with a vowel. (Examples: add, ade, edc, eee) Reject all other inputs (Examples: dad, eat). 4. Write a Flex regular expression that defines strings of 3-letters that contain only the letters a through e and start with a vowel. CSCI 6636 4536 Lecture 3... 24/25 Spring 2017 24 / 25

Homework Hw 3: Finite-State Machines and Regular Expressions 5. Define a DFA that accepts legal C identifiers. The rules are: 1 An identifier can be composed of letters (both uppercase and lowercase letters), digits and underscore _ only. 2 The first letter of identifier cannot be a digit. 3 There is no limit on the length of an identifier. 6. Write FORTH function that displays your name. (Keep it simple.) Read chapter 4.1 through 4.3 of the text. CSCI 6636 4536 Lecture 3... 25/25 Spring 2017 25 / 25