Lexical analysis. Concepts. Lexical analysis in perspec>ve

Similar documents
Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

Lexical Analysis (ASU Ch 3, Fig 3.1)

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

CSC 467 Lecture 3: Regular Expressions

Lexical Analyzer Scanner

Figure 2.1: Role of Lexical Analyzer

Lexical Analyzer Scanner

Implementation of Lexical Analysis

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

Implementation of Lexical Analysis

Lexical Analysis. Chapter 2

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

Introduction to Lexical Analysis

Programming Languages (CS 550) Lecture 4 Summary Scanner and Parser Generators. Jeremy R. Johnson

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Monday, August 26, 13. Scanners

Wednesday, September 3, 14. Scanners

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 2: Lexical Analysis 23 Jan 08

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

CSc 453 Lexical Analysis (Scanning)

Lexical Analysis. Lecture 3-4

Part 5 Program Analysis Principles and Techniques

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

Lexical Analysis. Lecture 2-4

ECS 120 Lesson 7 Regular Expressions, Pt. 1

Formal Languages and Compilers Lecture VI: Lexical Analysis

Languages, Automata, Regular Expressions & Scanners. Winter /8/ Hal Perkins & UW CSE B-1

Structure of Programming Languages Lecture 3

CMSC 330: Organization of Programming Languages. Architecture of Compilers, Interpreters

Lexical Analysis. Introduction

Compiler course. Chapter 3 Lexical Analysis

Parsing and Pattern Recognition

Week 2: Syntax Specification, Grammars

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

PRINCIPLES OF COMPILER DESIGN UNIT II LEXICAL ANALYSIS 2.1 Lexical Analysis - The Role of the Lexical Analyzer

CMSC 330: Organization of Programming Languages

Chapter 3 Lexical Analysis

COMPILER DESIGN UNIT I LEXICAL ANALYSIS. Translator: It is a program that translates one language to another Language.

Compiler phases. Non-tokens

CSCI312 Principles of Programming Languages!

Where We Are. CMSC 330: Organization of Programming Languages. This Lecture. Programming Languages. Motivation for Grammars

CMSC 330: Organization of Programming Languages

CS 314 Principles of Programming Languages. Lecture 3

Group A Assignment 3(2)

Dr. D.M. Akbar Hussain

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 2

CS321 Languages and Compiler Design I. Winter 2012 Lecture 4

LECTURE 7. Lex and Intro to Parsing

CSE 401/M501 Compilers

CS308 Compiler Principles Lexical Analyzer Li Jiang

CMSC 330: Organization of Programming Languages

Lecture 4: Syntax Specification

Announcements! P1 part 1 due next Tuesday P1 part 2 due next Friday

Last lecture CMSC330. This lecture. Finite Automata: States. Finite Automata. Implementing Regular Expressions. Languages. Regular expressions

CSE Lecture 4: Scanning and parsing 28 Jan Nate Nystrom University of Texas at Arlington

Outline. 1 Scanning Tokens. 2 Regular Expresssions. 3 Finite State Automata

DVA337 HT17 - LECTURE 4. Languages and regular expressions

Architecture of Compilers, Interpreters. CMSC 330: Organization of Programming Languages. Front End Scanner and Parser. Implementing the Front End

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design

Implementation of Lexical Analysis

Lexical Analysis 1 / 52

Lexical Analysis. Sukree Sinthupinyo July Chulalongkorn University

Compilers CS S-01 Compiler Basics & Lexical Analysis

Zhizheng Zhang. Southeast University

Formal Languages and Grammars. Chapter 2: Sections 2.1 and 2.2

Compilers CS S-01 Compiler Basics & Lexical Analysis

CS 315 Programming Languages Syntax. Parser. (Alternatively hand-built) (Alternatively hand-built)

UNIT II LEXICAL ANALYSIS

10/4/18. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntactic Analysis

2010: Compilers REVIEW: REGULAR EXPRESSIONS HOW TO USE REGULAR EXPRESSIONS

UNIT -2 LEXICAL ANALYSIS

Defining syntax using CFGs

CMSC 330: Organization of Programming Languages. Context Free Grammars

THE COMPILATION PROCESS EXAMPLE OF TOKENS AND ATTRIBUTES

MIT Specifying Languages with Regular Expressions and Context-Free Grammars

CS 403: Scanning and Parsing

CSE 3302 Programming Languages Lecture 2: Syntax

2. Lexical Analysis! Prof. O. Nierstrasz!

Lexical Analysis. Lecture 3. January 10, 2018

The Front End. The purpose of the front end is to deal with the input language. Perform a membership test: code source language?

The Structure of a Syntax-Directed Compiler

CPS 506 Comparative Programming Languages. Syntax Specification

Compiler Construction

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 3

Compiler Construction

CS 314 Principles of Programming Languages

CS S-01 Compiler Basics & Lexical Analysis 1

CS 403 Compiler Construction Lecture 3 Lexical Analysis [Based on Chapter 1, 2, 3 of Aho2]

Chapter 3 -- Scanner (Lexical Analyzer)

Lecture Outline. COMP-421 Compiler Design. What is Lex? Lex Specification. ! Lexical Analyzer Lex. ! Lex Examples. Presented by Dr Ioanna Dionysiou

CMSC 330: Organization of Programming Languages. Context Free Grammars

1. INTRODUCTION TO LANGUAGE PROCESSING The Language Processing System can be represented as shown figure below.

MIT Specifying Languages with Regular Expressions and Context-Free Grammars. Martin Rinard Massachusetts Institute of Technology

Languages and Compilers

Transcription:

4a Lexical analysis Concepts Overview of syntax and seman3cs Step one: lexical analysis Lexical scanning Regular expressions DFAs and FSAs Lex CMSC 331, Some material 1998 by Addison Wesley Longman, Inc. Lexical analysis in perspec>ve LEXICAL ANALYZER: Transforms character stream to token stream. Also called scanner, lexer, linear analysis source program lexical analyzer token get next token symbol table parser This is an overview of the standard process of turning a text file into an executable program. LEXICAL ANALYZER Scans Input Removes whitespace, newlines, Iden3fies Tokens Creates Symbol Table Inserts Tokens into symbol table Generates Errors Sends Tokens to Parser PARSER Performs Syntax Analysis Ac3ons Dictated by Token Order Updates Symbol Table Entries Creates Abstract Rep. of Source Generates Errors

Where we are Basic lexical analysis terms Total=price+tax; Total = price + tax ; id assignment = Expr id + id price tax Lexical analyzer Parser Token A classifica3on for a common set of strings Examples: <iden3fier>, <number>, <operator>, <open paren>, etc. PaUern The rules which characterize the set of strings for a token Typically defined via regular expressions Lexeme Character sequence that matches pauern a token Iden3fiers: x, count, name, foo32, etc Integers: - 12, 101, 0, Open paren: ) Examples: token, lexeme, panern if (price + gst rebate <= 10.00) gift := false Token lexeme Informal description of pattern if if if Lparen ( ( Identifier price String consists of letters and numbers and starts with a letter operator + + identifier gst String consists of letters and numbers and starts with a letter operator - - identifier rebate String consists of letters and numbers and starts with a letter Operator <= Less than or equal to constant 10.00 Any numeric constant rparen ) ) identifier gift String consists of letters and numbers and starts with a letter Operator := Assignment symbol identifier false String consists of letters and numbers and starts with a letter Regular expression (REs) Scanners are based on regular expressions that define simple pauerns Simpler and less expressive than BNF Examples of a regular expression lener: a b c... z A B C... Z digit: 0 1 2 3 4 5 6 7 8 9 iden>fier: leuer (leuer digit)* Basic opera3ons are (1) set union, (2) concatena3on and (3) Kleene closure Plus: parentheses, naming pauerns No recursion

Regular expression (REs) Example: lener: a b c... z A B C... Z digit: 0 1 2 3 4 5 6 7 8 9 iden>fier: leuer (leuer digit)* leuer ( leuer digit ) * leuer ( leuer digit ) * leuer ( leuer digit ) * concatena3on: one pauern followed by another set union: one pauern or another Kleene closure: zero or more repe3ons of a pauern Regular expressions are extremely useful in many applica3ons. Mastering them will serve you well. Another view Formal language opera>ons "Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems. - - Jamie Zawinski (1997) alt.religion.emacs hup://bit.ly/jwzregex Operation Notation Definition union of L and M concatenation of L and M Kleene closure of L positive closure L M LM L M = {s s is in L or s is in M} LM = {st s is in L and t is in M} L* L* denotes zero or more concatenations of L L+ L+ denotes one or more concatenations of L Example L={a, b} M={0,1} {a, b, 0, 1} {a0, a1, b0, b1} All the strings consists of a and b, plus the empty string. {ε, a, b, aa, bb, ab, ba, aaa, } All the strings consists of a and b. {a, b, aa, bb, ab, ba, aaa, }

Regular expression Let Σ be an alphabet, r a regular expression then L(r) is the language that is characterized by the rules of r Defini3on of regular expression ε is a regular expression that denotes the language {ε} If a is in Σ, a is a regular expression that denotes {a} Let r & s be regular expressions with languages L(r) & L(s)» (r) (s) is a regular expression à L(r) L(s)» (r)(s) is a regular expression à L(r) L(s)» (r)* is a regular expression à (L(r))* It is an induc3ve defini3on A regular language is a language that can be defined by a regular expression RE example revisited Examples of regular expression Letter: a b c... z A B C... Z Digit: 0 1 2 3 4 5 6 7 8 9 Identifier: letter (letter digit)* Q: why it is an regular expression? Because it only uses the opera3ons of union, concatena3on and Kleene closure Being able to name pauerns is just syntac3c sugar Using parentheses to group things is just syntac3c sugar provided we specify the precedence and associa3vely of the operators (i.e.,, * and concat ) +: Another common operator The + operator is commonly used to mean one or more repe33ons of a pauern For example, lener + means one or more leuers We can always do without this, e.g. leuer + is equivalent to leuer leuer * So the + operator is just syntac3c sugar Precedence of operators In interpre3ng a regular expression Parens scope sub- expressions * and + have the highest precedence Concatena3on comes next is lowest. All the operators are ley associa3ve Example (A) ((B)* (C)) is equivalent to A B * C What strings does this generate or match? Either an A or any number of Bs followed by a C

Epsilon: more syntac>c sugar Some3mes we d like a token that represents nothing This makes a regular expression matching more complex, but can be useful We use the lower case Greek leuer epsilon (ε) for this special token Example: digit: 0 1 2 3 4 5 6 7 8 9 0 sign: + - ε int: sign digit+ Proper>es of regular expressions We can easily determine some basic proper3es of the operators involved in building regular expressions r s = s r Property r (s t) = (r s) t (rs)t=r(st) r(s t)=rs rt (s t)r=sr tr...... is commutative is associative Description Concatenation is associative Concatenation distributes over RE: S>ll more syntac>c sugar Zero or one instance L? = L ε Examples» Op3onal_frac3onà.digits ε» op3onal_frac3onà (.digits)? Character classes [abc] = a b c [a- z] = a b c... z Systems having RE support (e.g., Java, Python, Lex, Emacs) vary in the features supported and oyen in the nota3on But tend to be very similar Regular grammars / expressions Regular grammars and regular expressions are equivalent Every regular expression can be expressed by regular grammar Every regular grammar can be expressed by regular expression Example: an iden3fier must begin with a leuer and can be followed by any number of leuers and digits Regular expression ID: LETTER (LETTER IT)* Regular grammar ID à LETTER ID_REST ID_REST à LETTER ID_REST IT ID_REST EMPTY

Formal defini>on of tokens A set of tokens is a set of strings over an alphabet {read, write, +, -, *, /, :=, 1, 2,, 10,, 3.45e- 3, } A set of tokens is a regular set that can be defined by using a regular expression For every regular set, there is a finite automaton (FA) that can recognize it Aka determinis3c Finite State Machine (FSM) i.e. determine whether a string belongs to the set or not Scanners extract tokens from source code in the same way DFAs determine membership FSM = FA Finite state machine and finite automaton are different names for the same concept The concept is important and useful in almost every aspect of computer science Provides abstract way to define a process that Has a finite set of states it can be in, with a special statr state and a set of accep3ng states Gets a sequence of inputs Each input causes process to go from its current state to a new state (which might be the same) If ayer the input ends, we are in one of a set of accep3ng state, the input is accepted by the FA Example An FA that determines whether a binary number has an odd or even number of 0's, where S1 is an accep3ng state. transition label is input that triggers it Determinis>c finite automaton (DFA) A DFA has only one choice for a given input in every state No states with two arcs matching same input Incoming arrow identifies start state State names (e.g., S 1, S 2 ) for convenience Double circle identifies accepting state(s) For this FA inputs are expected to be a 0 or 1 Is this a DFA?

Determinis>c finite automaton (DFA) If an input symbol matches no arc for current state, input is not accepted This FA accepts only binary numbers that are multiples of three REs can be represented as DFAs Regular expression for a simple iden3fier Letter: a b c... z A B C... Z Digit: 0 1 2 3 4 5 6 7 8 9 Identifier: letter (letter digit)* letter This DFA recognizes identifiers letter * Marking state with a * is another way to identify accepting state Is this a DFA? 0,1,2,3,4 9 RE < CFG Every language that can be described by a RE can be described by a CFG Some languages can be described by a CFG but not by a RE for example the set of palidromes made up of as and bs: S - > a S a b S b a aa b bb Token Defini>on Numeric literals in Pascal, e.g. 1, 123, 3.1415, 10e- 3, 3.14e4 Defini>on of token unsignednum 0 1 2 3 4 5 6 7 8 9 unsignedint * unsignednum unsignedint ((. unsignedint) ε) ((e ( + ε) unsignedint) ε) Note: Recursion restricted to leymost or rightmost posi3on on LHS Parentheses used to avoid ambiguity. * * e * + - FAs with epsilons are NFAs NFAs are harder to implement, use backtracking Every NFA can be rewriuen as a DFA (gets larger, though)

Simple Problem Read characters consis3ng of as and bs, one at a 3me. If it contains a double aa, print accepted else rejected. An abstract solu3on to this can be expressed as a DFA b a a 1 2 3* Start state b The DFA state transi3ons can be encoded as a table which specifies the new state for a given current state and input current state An accep>ng state 1 2 3 input a, b a b 2 1 3 1 3 3 State transi>on table State transition table, initial state and set of accepting states represent the DFA import sys state = 1 ok = [3] trans = {1:{'a':2,'b':1}, 2:{'a':3,'b':1}, 3:{'a':3,'b':3}} for char in sys.argv[1]: state = trans[state][char] print 'accepted' if state in ok else 'rejected b a 1 2 3* Start state b current state 1 2 3 An accep>ng state input a b 2 1 3 1 3 3 a, b Scanner Generators E.g. lex, flex Take a table as input, return scanner program that extracts tokens from character stream Useful programming u3lity, especially when coupled with a parser generator (e.g., yacc) Standard in Unix Lex Lexical analyzer generator It writes a lexical analyzer Assumes each token matches a regular expression Needs set of regular expressions for each expression an ac3on Produces a highly op3mized C program Automa3cally handles many tricky problems flex is the gnu version of the venerable unix tool lex

Lex example input Examples lex cc foolex foo.l foolex.c foolex > flex - ofoolex.c foo.l > cc - ofoolex foolex.c - lfl >more input begin if size>10 then size * - 3.1415 end > foolex < input Keyword: begin Keyword: if Iden3fier: size Operator: > Integer: 10 (10) Keyword: then Iden3fier: size Operator: * Operator: - Float: 3.1415 (3.1415) Keyword: end tokens The examples to follow can be access on gl See /afs/umbc.edu/users/f/i/finin/pub/lex % ls -l /afs/umbc.edu/users/f/i/finin/pub/lex total 8 drwxr-xr-x 2 finin faculty 2048 Sep 27 13:31 aa drwxr-xr-x 2 finin faculty 2048 Sep 27 13:32 defs drwxr-xr-x 2 finin faculty 2048 Sep 27 11:35 footranscanner drwxr-xr-x 2 finin faculty 2048 Sep 27 11:34 simplescanner A Lex Program Simplest Example defini3ons rules subrou3nes [0-9] ID [a- z][a- z0-9]* {}+ prinƒ("integer\n ); {}+"."{}* prinƒ("float\n ); {ID} prinƒ("iden3fier\n ); [ \t\n]+ /* skip whitespace */. prinƒ( Huh?\n"); main(){yylex();}. \n ECHO; main() { yylex(); } No defini3ons One rule Minimal wrapper Echoes input

(a b)*aa(a b)* [a b]+ Strings containing aa. \n ECHO; main() {yylex();} {printf("accept %s\n", yytext);} {printf("reject %s\n", yytext);} Rules Each has a rule has a paiern and an acjon PaUerns are regular expression Only one ac3on is performed Ac3on corresponding to the pauern matched is performed If several pauerns match, one correspond- ing to the longest sequence is chosen Among the rules whose pauerns match the same number of characters, the first rule is preferred Defini>ons Defini3ons block allows you to name a RE If name in curly braces in a rule, the RE will be subs3tuted [0-9] {}+ printf("int: %s\n", yytext); {}+"."{}* printf("float: %s\n", yytext);. /* skip anything else */ main(){yylex();} /* scanner for a toy Pascal- like language */ %{ #include <math.h> /* needed for call to atof() */ %} [0-9] ID [a- z][a- z0-9]* {}+ prinƒ("integer: %s (%d)\n", yytext, atoi(yytext)); {}+"."{}* prinƒ("float: %s (%g)\n", yytext, atof(yytext)); if then begin end prinƒ("keyword: %s\n",yytext); {ID} prinƒ("iden3fier: %s\n",yytext); "+" "- " "*" "/" prinƒ("operator: %s\n",yytext); "{"[^}\n]*"}" /* skip one- line comments */ [ \t\n]+ /* skip whitespace */. prinƒ("unrecognized: %s\n",yytext); main(){yylex();}

Flex RE syntax x character 'x'. any character except newline [xyz] character class, in this case, matches either an 'x', a 'y', or a 'z' [abj- oz] character class with a range in it; matches 'a', 'b', any leuer from 'j' through 'o', or 'Z' [^A- Z] negated character class, i.e., any character but those in the class, e.g. any character except an uppercase leuer. [^A- Z\n] any character EXCEPT an uppercase leuer or a newline r* zero or more r's, where r is any regular expression r+ one or more r's r? zero or one r's (i.e., an op3onal r) {name} expansion of the "name" defini3on "[xy]\"foo" the literal string: '[xy]"foo' (note escaped ") \x if x is an 'a', 'b', 'f', 'n', 'r', 't', or 'v', then the ANSI- C interpreta3on of \x. Otherwise, a literal 'x' (e.g., escape) rs RE r followed by RE s (e.g., concatena3on) r s either an r or an s <<EOF>> end- of- file