Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

Similar documents
CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

Languages, Automata, Regular Expressions & Scanners. Winter /8/ Hal Perkins & UW CSE B-1

CSE 401/M501 Compilers

CSE 401 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter UW CSE 401 Winter 2015 B-1

Part 5 Program Analysis Principles and Techniques

CSc 453 Lexical Analysis (Scanning)

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

Formal Languages and Grammars. Chapter 2: Sections 2.1 and 2.2

Introduction to Lexical Analysis

Lexical Analysis. Chapter 2

Figure 2.1: Role of Lexical Analyzer

Lexical Analysis. Introduction

CSCI312 Principles of Programming Languages!

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Implementation of Lexical Analysis

Lecture 4: Syntax Specification

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Implementation of Lexical Analysis

The Front End. The purpose of the front end is to deal with the input language. Perform a membership test: code source language?

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

Lexical Analysis. Lecture 2-4

Formal Languages. Formal Languages

CSE P 501 Compilers. Parsing & Context-Free Grammars Hal Perkins Winter /15/ Hal Perkins & UW CSE C-1

CSE 3302 Programming Languages Lecture 2: Syntax

CS 314 Principles of Programming Languages. Lecture 3

Lexical Analysis. Lecture 3-4

Lexical Analysis. COMP 524, Spring 2014 Bryan Ward

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 2: Lexical Analysis 23 Jan 08

programming languages need to be precise a regular expression is one of the following: tokens are the building blocks of programs

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

Chapter 2 :: Programming Language Syntax

CS 315 Programming Languages Syntax. Parser. (Alternatively hand-built) (Alternatively hand-built)

Compiler phases. Non-tokens

CPS 506 Comparative Programming Languages. Syntax Specification

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

COP 3402 Systems Software Syntax Analysis (Parser)

Compiler course. Chapter 3 Lexical Analysis

CS 314 Principles of Programming Languages

CSE 401 Midterm Exam 11/5/10

Formal Languages and Compilers Lecture VI: Lexical Analysis

Chapter 3. Describing Syntax and Semantics ISBN

Syntax. Syntax. We will study three levels of syntax Lexical Defines the rules for tokens: literals, identifiers, etc.

Chapter 3 Lexical Analysis

Implementation of Lexical Analysis

Non-deterministic Finite Automata (NFA)

CSCE 314 Programming Languages

CSE302: Compiler Design

Lexical and Syntax Analysis

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

Theoretical Part. Chapter one:- - What are the Phases of compiler? Answer:

Optimizing Finite Automata

CSE P 501 Compilers. Parsing & Context-Free Grammars Hal Perkins Spring UW CSE P 501 Spring 2018 C-1

COMPILER DESIGN UNIT I LEXICAL ANALYSIS. Translator: It is a program that translates one language to another Language.

Building Compilers with Phoenix

CMSC 350: COMPILER DESIGN

Where We Are. CMSC 330: Organization of Programming Languages. This Lecture. Programming Languages. Motivation for Grammars

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design

CSE450 Translation of Programming Languages. Lecture 4: Syntax Analysis

Programming Languages (CS 550) Lecture 4 Summary Scanner and Parser Generators. Jeremy R. Johnson

MIT Specifying Languages with Regular Expressions and Context-Free Grammars

Defining syntax using CFGs

10/4/18. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntactic Analysis

CSE450. Translation of Programming Languages. Lecture 20: Automata and Regular Expressions

Programming Language Syntax and Analysis

Zhizheng Zhang. Southeast University

Compiler Construction

THE COMPILATION PROCESS EXAMPLE OF TOKENS AND ATTRIBUTES

CMSC 330: Organization of Programming Languages. Architecture of Compilers, Interpreters

1. INTRODUCTION TO LANGUAGE PROCESSING The Language Processing System can be represented as shown figure below.

Languages and Compilers

Administrivia. Lexical Analysis. Lecture 2-4. Outline. The Structure of a Compiler. Informal sketch of lexical analysis. Issues in lexical analysis

Lexical Scanning COMP360

10/5/17. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntax Analysis

Chapter 2 - Programming Language Syntax. September 20, 2017

CMSC 330: Organization of Programming Languages. Context Free Grammars

CSE 582 Autumn 2002 Exam Sample Solution

Lexical Analysis - An Introduction. Lecture 4 Spring 2005 Department of Computer Science University of Alabama Joel Jones

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

ECS 120 Lesson 7 Regular Expressions, Pt. 1

CS 403: Scanning and Parsing

MIT Specifying Languages with Regular Expressions and Context-Free Grammars. Martin Rinard Massachusetts Institute of Technology

Week 2: Syntax Specification, Grammars

Introduction to Lexical Analysis

Defining Program Syntax. Chapter Two Modern Programming Languages, 2nd ed. 1

Lexical Analysis (ASU Ch 3, Fig 3.1)

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

UNIT -2 LEXICAL ANALYSIS

CS308 Compiler Principles Lexical Analyzer Li Jiang

COLLEGE OF ENGINEERING, NASHIK. LANGUAGE TRANSLATOR

Lexical Analyzer Scanner

CSE Lecture 4: Scanning and parsing 28 Jan Nate Nystrom University of Texas at Arlington

CMSC 330: Organization of Programming Languages

Decision, Computation and Language

WARNING for Autumn 2004:

Lexical Analyzer Scanner

Lexical Analysis. Lecture 3. January 10, 2018

Transcription:

Agenda for Today Regular Expressions CSE 413, Autumn 2005 Programming Languages Basic concepts of formal grammars Regular expressions Lexical specification of programming languages Using finite automata to recognize regular expressions http://www.cs.washington.edu/education/courses/413/05au/ 14-Nov-2005 cse413-15-regex 2005 University of Washington 1 14-Nov-2005 cse413-15-regex 2005 University of Washington 2 Programming Language Specifications Since the 1960s, the syntax of every significant programming language has been specified by a formal grammar» First done in 1959 with BNF (Backus-Naur Form or Backus-Normal Form) used to specify the syntax of ALGOL 60» Borrowed from the linguistics community Grammar for a Tiny Language program ::= statement program statement statement ::= assignstmt ifstmt assignstmt ::= id = expr ; ifstmt ::= if ( expr ) stmt expr ::= id int expr + expr id ::= a b c i j k n x y z int ::= 0 1 2 3 4 5 6 7 8 9 14-Nov-2005 cse413-15-regex 2005 University of Washington 3 14-Nov-2005 cse413-15-regex 2005 University of Washington 4

Productions The rules of a grammar are called productions Rules contain» Nonterminal symbols: grammar variables (program, statement, id, etc.)» Terminal symbols: concrete syntax that appears in programs (a, b, c, 0, 1, if, (, ) Meaning of nonterminal ::= <sequence of terminals and nonterminals> In a derivation, an instance of nonterminal can be replaced by the sequence of terminals and nonterminals on the right of the production Often, there are two or more productions for a single nonterminal can use either at different times Alternative Notations There are several syntax notations for productions in common use» all mean the same thing» the non-terminal on the left can be replaced by the expression on the right ifstmt ::= if ( expr ) stmt ifstmt if ( expr ) stmt <ifstmt> ::= if ( <expr> ) <stmt> 14-Nov-2005 cse413-15-regex 2005 University of Washington 5 14-Nov-2005 cse413-15-regex 2005 University of Washington 6 Example Derivation a = 1 ; if ( a + 1 ) b = 2 ; program ::= statement program statement statement ::= assignstmt ifstmt assignstmt ::= id = expr ; ifstmt ::= if ( expr ) stmt expr ::= id int expr + expr Id ::= a b c i j k n x y z int ::= 0 1 2 3 4 5 6 7 8 9 Parsing Parsing: reconstruct the derivation (syntactic structure) of a program In principle, a single recognizer could work directly from the concrete, character-bycharacter grammar» In practice this is never done because there are more useful ways to organize the task that simplify each part 14-Nov-2005 cse413-15-regex 2005 University of Washington 8

Parsing & Scanning In real compilers the recognizer is split into two phases» Scanner: translate input characters to tokens Also, report lexical errors like illegal characters and illegal symbols» Parser: read token stream and reconstruct the derivation Recall: Characters vs Tokens Input text // this line is a simple comment if (x >= y) y = 42; Token Stream IF LPAREN ID(x) OP_GEQ ID(y) RPAREN ID(y) OP_ASSIGN INT(42) SCOLON source Scanner tokens Parser derivation» Note: tokens are atomic items, not character strings objects of class Token 14-Nov-2005 cse413-15-regex 2005 University of Washington 9 14-Nov-2005 cse413-15-regex 2005 University of Washington 10 Why Separate the Scanner and Parser? Simplicity & Separation of Concerns» Scanner hides details from parser (comments, whitespace, input files, etc.)» Parser is easier to build; has simpler input stream Efficiency» Scanner can use simpler, faster design But still often consumes a surprising amount of the compiler s total execution time Tokens Idea: we want a distinct token type (lexical class) for each distinct terminal symbol in the programming language» Examine the grammar to find these Some tokens may have attributes»examples: integer literal token will have the actual integer value (17, 42,...) as an attribute identifiers will have a string with the actual id as an attribute and perhaps some type information 14-Nov-2005 cse413-15-regex 2005 University of Washington 11 14-Nov-2005 cse413-15-regex 2005 University of Washington 12

Typical Programming Language Tokens Operators & Punctuation» + - * / ( ) { } [ ] ; : < <= == =!=!» Each of these is a distinct lexical class Keywords (reserved)» if while for goto return switch void» Each of these is also a distinct lexical class (not a string) Identifiers» A single ID lexical class, but parameterized by actual id Integer literals» A single INT lexical class, but parameterized by int value Other constants, etc. Principle of Longest Match In most languages, the scanner should pick the longest possible string to make up the next token if there is a choice Example return forbar!= beginning; should be recognized as 5 tokens RETURN ID(forbar) NEQ ID(beginning) SCOLON not more (i.e., not parts of words or identifiers, or! and = as separate tokens) 14-Nov-2005 cse413-15-regex 2005 University of Washington 13 14-Nov-2005 cse413-15-regex 2005 University of Washington 14 Languages & Automata Theory Alphabet: a finite set of symbols String: a finite, possibly empty sequence of symbols from an alphabet Language: a set, often infinite, of strings Finite specifications of (possibly infinite) languages» Automaton a recognizer; a machine that accepts all strings in a language (and rejects all other strings)» Grammar a generator; a system for producing all strings in the language (and no other strings) A language may be specified by many different grammars and automata A grammar or automaton specifies only one language Regular Expressions and Finite Automata The lexical grammar (structure) of most programming languages can be specified with regular expressions» Sometimes a little ad-hoc cheating is useful Tokens can be recognized by a deterministic finite automaton» Can be either table-driven or built by hand based on lexical grammar 14-Nov-2005 cse413-15-regex 2005 University of Washington 15 14-Nov-2005 cse413-15-regex 2005 University of Washington 16

Regular Expressions Fundamental Regular Expressions Defined over some alphabet Σ» For programming languages, commonly ASCII or Unicode If re is a regular expression, L(re ) is the language (set of strings) generated by re Note that this is opposite of the way we often think about regular expressions re a ε L(re ) { a } { ε } { } Notes Singleton set, for each a in Σ Empty string Empty language» generating strings vs matching strings» either way, the relevant set of strings is L(re) 14-Nov-2005 cse413-15-regex 2005 University of Washington 17 14-Nov-2005 cse413-15-regex 2005 University of Washington 18 Operations on Regular Expressions Abbreviations re L(re ) Notes The basic operations generate all possible regular expressions, but there are common abbreviations used for convenience. Typical examples: rs L(r)L(s) Concatenation Abbr. Meaning Notes r s L(r) L(s) Combination (union) r+ (rr*) 1 or more occurrences r* L(r)* 0 or more occurrences (Kleene closure) r? [a-z] (r ε) (a b z) 0 or 1 occurrence 1 character in given range Precedence: * (highest), concatenation, (lowest) Parentheses can be used to group REs as needed [abxyz] (a b x y z) 1 of the given characters 14-Nov-2005 cse413-15-regex 2005 University of Washington 19 14-Nov-2005 cse413-15-regex 2005 University of Washington 20

Examples More Examples re L(re) re L(re) a single character a [abc]+!!= [!<>]= single character! specific 2-character sequence!= a 2-character sequence:!=, <=, or >= [abc]* [0-9]+ \[ hogwash single character [ 7 character sequence [1-9][0-9]* [a-za-z][a-za-z0-9_]* 14-Nov-2005 cse413-15-regex 2005 University of Washington 21 14-Nov-2005 cse413-15-regex 2005 University of Washington 22 Abbreviations Many systems allow naming the regular expressions to make writing and reading definitions easier name ::= re for example digit ::= [0-9]» Restriction: abbreviations may not be circular (recursive) either directly or indirectly Example Possible syntax for numeric constants number ::= digits (. digits )? ( [ee] (+ -)? digits )? digits ::= digit+ digit ::= [0-9] 14-Nov-2005 cse413-15-regex 2005 University of Washington 23 14-Nov-2005 cse413-15-regex 2005 University of Washington 24

Recognizing Regular Expressions Finite automata can be used to recognize strings generated by regular expressions Can build by hand or automatically» Not totally straightforward, but can be done systematically» Tools like Lex, Flex, and JLex do this automatically, given a set of REs Finite State Automaton A finite set of states» One marked as initial state» One or more marked as final states A set of transitions from state to state» Each labeled with symbol from Σ, or ε Operate by reading input symbols (usually characters)» Transition can be taken if labeled with current symbol» ε-transition can be taken at any time c a t initial state final state recognize cat 14-Nov-2005 cse413-15-regex 2005 University of Washington 25 14-Nov-2005 cse413-15-regex 2005 University of Washington 26 Example: FSA for cat Accept or Reject c a t final state recognize cat Accept» if final state reached and no more input» if in an accepting state when no valid transition for the next symbol or no more input Reject» if no more input and not in final state» if no transition possible and not in accepting state 14-Nov-2005 cse413-15-regex 2005 University of Washington 27 14-Nov-2005 cse413-15-regex 2005 University of Washington 28

DFA example Scanner DFA Example (1) Idea: show a hand-written DFA for some typical programming language constructs» Can use to construct hand-written scanner state 0 whitespace or comments Setting: Scanner is called whenever the parser needs a new token» Scanner stores current position in input» Starting there, use a DFA to recognize the longest possible input sequence that makes up a token and return that token end of input ( ) 1 2 3 Accept EOF Accept LPAREN Accept RPAREN ; 4 Accept SCOLON 14-Nov-2005 cse413-15-regex 2005 University of Washington 29 14-Nov-2005 cse413-15-regex 2005 University of Washington 30 Scanner DFA Example (2) Scanner DFA Example (3)! 5 = 6 Accept NEQ [0-9] 11 [0-9] other 7 Accept NOT other 12 Accept INT < 8 = 9 Accept LEQ other 10 Accept LESS 14-Nov-2005 cse413-15-regex 2005 University of Washington 31 14-Nov-2005 cse413-15-regex 2005 University of Washington 32

Scanner DFA Example (4) [a-za-z] 13 [a-za-z0-9_] other 14 Accept ID or keyword Strategies for handling identifiers vs keywords» Hand-written scanner: look up identifier-like things in table of keywords to classify (good application of perfect hashing)» Machine-generated scanner: generate DFA will appropriate transitions to recognize keywords Lots o states, but efficient (no extra lookup step) 14-Nov-2005 cse413-15-regex 2005 University of Washington 33