CSC 467 Lecture 3: Regular Expressions

Similar documents
Using Lex or Flex. Prof. James L. Frankel Harvard University

Lecture Outline. COMP-421 Compiler Design. What is Lex? Lex Specification. ! Lexical Analyzer Lex. ! Lex Examples. Presented by Dr Ioanna Dionysiou

EXPERIMENT NO : M/C Lenovo Think center M700 Ci3,6100,6th Gen. H81, 4GB RAM,500GB HDD

EXPERIMENT NO : M/C Lenovo Think center M700 Ci3,6100,6th Gen. H81, 4GB RAM,500GB HDD

Module 8 - Lexical Analyzer Generator. 8.1 Need for a Tool. 8.2 Lexical Analyzer Generator Tool

Ulex: A Lexical Analyzer Generator for Unicon

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 2: Lexical Analysis 23 Jan 08

CS321 Languages and Compiler Design I. Winter 2012 Lecture 4

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

The structure of a compiler

LECTURE 7. Lex and Intro to Parsing

A lexical analyzer generator for Standard ML. Version 1.6.0, October 1994

CSc 453 Lexical Analysis (Scanning)

Edited by Himanshu Mittal. Lexical Analysis Phase

Flex and lexical analysis. October 25, 2016

Languages and Compilers

Structure of Programming Languages Lecture 3

Figure 2.1: Role of Lexical Analyzer

Syntactic Analysis. CS345H: Programming Languages. Lecture 3: Lexical Analysis. Outline. Lexical Analysis. What is a Token? Tokens

Introduction to Lexical Analysis

Big Picture: Compilation Process. CSCI: 4500/6500 Programming Languages. Big Picture: Compilation Process. Big Picture: Compilation Process.

CS4850 SummerII Lex Primer. Usage Paradigm of Lex. Lex is a tool for creating lexical analyzers. Lexical analyzers tokenize input streams.

CS143 Handout 04 Summer 2011 June 22, 2011 flex In A Nutshell

An introduction to Flex

Introduction to Lexical Analysis

Lexical Analysis. Lecture 3-4

L L G E N. Generator of syntax analyzier (parser)

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

Formal Languages and Compilers Lecture VI: Lexical Analysis

Flex and lexical analysis

Lexical Analysis. Lecture 2-4

Lexical Analysis. Implementing Scanners & LEX: A Lexical Analyzer Tool

CSE302: Compiler Design

PRINCIPLES OF COMPILER DESIGN UNIT II LEXICAL ANALYSIS 2.1 Lexical Analysis - The Role of the Lexical Analyzer

Gechstudentszone.wordpress.com

Outline CS4120/4121. Compilation in a Nutshell 1. Administration. Introduction to Compilers Andrew Myers. HW1 out later today due next Monday.

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Using an LALR(1) Parser Generator

Big Picture: Compilation Process. CSCI: 4500/6500 Programming Languages. Big Picture: Compilation Process. Big Picture: Compilation Process

Parsing and Pattern Recognition

G Compiler Construction Lecture 4: Lexical Analysis. Mohamed Zahran (aka Z)

CS 403: Scanning and Parsing

THE COMPILATION PROCESS EXAMPLE OF TOKENS AND ATTRIBUTES

LECTURE 6 Scanning Part 2

Lexical Analysis. Chapter 2

Handout 7, Lex (5/30/2001)

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

flex is not a bad tool to use for doing modest text transformations and for programs that collect statistics on input.

LECTURE 11. Semantic Analysis and Yacc

1 Lexical Considerations

Lexical Analysis. Introduction

CS164: Programming Assignment 2 Dlex Lexer Generator and Decaf Lexer

Lesson 10. CDT301 Compiler Theory, Spring 2011 Teacher: Linus Källberg

PRACTICAL CLASS: Flex & Bison

JFlex Regular Expressions

Implementation of Lexical Analysis

Part 5 Program Analysis Principles and Techniques

UNIT -2 LEXICAL ANALYSIS

Ray Pereda Unicon Technical Report UTR-02. February 25, Abstract

Lexical Analysis - Flex

Lexical Analyzer Scanner

Introduction to Lex & Yacc. (flex & bison)

CMSC445 Compiler design Blaheta. Project 2: Lexer. Due: 15 February 2012

Lex & Yacc. By H. Altay Güvenir. A compiler or an interpreter performs its task in 3 stages:

Lexical Analysis. Lecture 3. January 10, 2018

Lexical Considerations

Lexical Analyzer Scanner

Administrivia. Lexical Analysis. Lecture 2-4. Outline. The Structure of a Compiler. Informal sketch of lexical analysis. Issues in lexical analysis

The Front End. The purpose of the front end is to deal with the input language. Perform a membership test: code source language?

Lex & Yacc. by H. Altay Güvenir. A compiler or an interpreter performs its task in 3 stages:

UNIT - 7 LEX AND YACC - 1

1. INTRODUCTION TO LANGUAGE PROCESSING The Language Processing System can be represented as shown figure below.

Implementation of Lexical Analysis

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

Compiler Construction D7011E

CPS 506 Comparative Programming Languages. Syntax Specification

CSE 3302 Programming Languages Lecture 2: Syntax

Preparing for the ACW Languages & Compilers

CSCI-GA Compiler Construction Lecture 4: Lexical Analysis I. Hubertus Franke

Lexical Considerations

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design

TDDD55- Compilers and Interpreters Lesson 2

Parser Tools: lex and yacc-style Parsing

COMPILER DESIGN UNIT I LEXICAL ANALYSIS. Translator: It is a program that translates one language to another Language.

Programming in C++ 4. The lexical basis of C++

Lexical Analysis and jflex

UNIT II LEXICAL ANALYSIS

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

COMPILER CONSTRUCTION LAB 2 THE SYMBOL TABLE. Tutorial 2 LABS. PHASES OF A COMPILER Source Program. Lab 2 Symbol table

Chapter 3 -- Scanner (Lexical Analyzer)

Full file at

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

Program Development Tools. Lexical Analyzers. Lexical Analysis Terms. Attributes for Tokens

Yacc: A Syntactic Analysers Generator

Syntax Analysis Part IV

A program that performs lexical analysis may be termed a lexer, tokenizer, or scanner, though scanner is also a term for the first stage of a lexer.

Marcello Bersani Ed. 22, via Golgi 42, 3 piano 3769

Transcription:

CSC 467 Lecture 3: Regular Expressions Recall How we build a lexer by hand o Use fgetc/mmap to read input o Use a big switch to match patterns Homework exercise static TokenKind identifier( TokenKind token ) { /* exercise for you did you do it?*/ int len; char* start = cursor; while( isletter(*cursor) isdigit(*cursor) ) cursor ++; token->kind = KIND_TOKEN_IDENTIFIER; len = cursor start; token->u.stringval = malloc( len +1 ); strncpy( token->u.stringval, start, len ); token->u.stringval[len] = \0 ; return token->kind; TokenKind gettoken( Token token ) { for( ; ; ) { c = *curser ++; switch( c ) { case a : case A : if( c[-1] == f && c[0] == o && c[1] == r && isblank(c[2]) ) { cursor += 2; return KIND_TOKEN_FOR; return identifier( token ); Today How can we build lexer systematically

Start by how to describe token patterns Regular Expressions The notation we use to precisely capture all the variations that a given category of token may take are called "regular expressions" (or, less formally, "patterns". The word "pattern" is really vague and there are lots of other notations for patterns besides regular expressions). Regular expressions are a shorthand notation for sets of strings. In order to even talk about "strings" you have to first define an alphabet, the set of characters which can appear. 1. Epsilon: 1. Notation: (ε) 2. Definition: { : is a regular expression denoting the set containing the empty string 2. Symbol: 1. a 2. { a : Any letter in the alphabet is also a regular expression denoting the set containing a one-letter string consisting of that letter. 3. Alteration: For regular expressions r and s, 1. r s 2. is a regular expression denoting the union of r and s 4. Concatenation: For regular expressions r and s, 1. r s 2. is a regular expression denoting the set of strings consisting of a member of r followed by a member of s 5. Repetition: For regular expression r, 1. r* 2. is a regular expression denoting the set of strings consisting of zero or more occurrences of r. Notation Sugar Although these operators are sufficient to describe all regular languages, in practice everybody uses extensions: You can parenthesize a regular expression to specify operator precedence (otherwise, alternation is like plus, concatenation is like times, and closure is like exponentiation) For regular expression r, r+ is a regular expression denoting the set of strings consisting of one or more occurrences of r. Equivalent to rr*

For regular expression r, r? is a regular expression denoting the set of strings consisting of zero or one occurrence of r. Equivalent to r ε The notation [abc] is short for a b c. [a-z] is short for a b... z. [^abc] is short for: any character other than a, b, or c. Example for (keyword) for letter [a-za-z] digit [0-9] identifier letter (letter digit)* sign + - ε integer sign (0 [1-9] digit*) decimal integer.digit* real (integer decimal) E sign digit* There is some ambiguity though: If the input includes the characters for8, then the first rule (for for-keyword) matches 3 characters (for), the fourth rule (for identifier) can match 1, 2, 3, or 4 characters, the longest being for8. To resolve this type of ambiguities, when there is a choice of rules, scanner generators choose the one that matches the maximum number of characters. In this case, the chosen rule is the one for identifier that matches 4 characters (for8). This disambiguation rule is called the longest match rule. If there are more than one rules that match the same maximum number of characters, the rule listed first is chosen. This is the rule priority disambiguation rule. For example, the lexical word for is For example, the lexical word for is taken as a for-keyword even though it uses the same number of characters as an identifier. lex(1) and flex(1) These programs generally take a lexical specification given in a.l file and create a corresponding C language lexical analyzer in a file named lex.yy.c. The lexical analyzer is then linked with the rest of your compiler. The C code generated by lex has the following public interface. Note the use of global variables instead of parameters, and the use of the prefix yy to distinguish scanner names from your program names. This prefix is also used in the YACC parser generator. FILE *yyin; /* set this variable prior to calling yylex() */ int yylex(); /* call this function once for each token */ char yytext[]; /* yylex() writes the token's lexeme to an array */ /* note: with flex, I believe extern declarations must read extern char *yytext; */ int yywrap(); /* called by lex when it hits end-of-file; see below */

The.l file format consists of a mixture of lex syntax and C code fragments. The percent sign (%) is used to signify lex elements. The whole file is divided into three sections separated by %%: %% %% header body helper functions The header consists of C code fragments enclosed in %{ and % as well as macro definitions consisting of a name and a regular expression denoted by that name. lex macros are invoked explicitly by enclosing the macro name in curly braces. Following are some example lex macros. letter [a-za-z] digit [0-9] ident {letter({letter {digit)* The body consists of of a sequence of regular expressions for different token categories and other lexical entities. Each regular expression can have a C code fragment enclosed in curly braces that executes when that regular expression is matched. For most of the regular expressions this code fragment (also called a semantic action consists of returning an integer that identifies the token category to the rest of the compiler, particularly for use by the parser to check syntax. Some typical regular expressions and semantic actions might include: " " { /* no-op, discard whitespace */ {ident { return IDENTIFIER; "*" { return ASTERISK; "." { return PERIOD; You also need regular expressions for lexical errors such as unterminated character constants, or illegal characters. The helper functions in a lex file typically compute lexical attributes, such as the actual integer or string values denoted by literals. One helper function you have to write is yywrap(), which is called when lex hits end of file. If you just want lex to quit, have yywrap() return 1. If your yywrap() switches yyin to a different file and you want lex to continue processing, have yywrap() return 0. The lex or flex library (-ll or -lfl) have default yywrap() function which return a 1, and flex has the directive %option noyywrap which allows you to skip writing this function. A Short Comment on Lexing C Reals C float and double constants have to have at least one digit, either before or after the required decimal. This is a pain: ([0-9]+.[0-9]* [0-9]*.[0-9]+)... You might almost be happier if you wrote ([0-9]*.[0-9]*) { return (strcmp(yytext,"."))? REAL : PERIOD;

You-all know C's ternary e1? e2 : e3 operator, don't ya? Its an if-then-else expression, very slick. Lex extended regular expressions Lex further extends the regular expressions with several helpful operators. Lex's regular expressions include: c normal characters mean themselves \c backslash escapes remove the meaning from most operator characters. Inside character sets and quotes, backslash performs C-style escapes. "s" Double quotes mean to match the C string given as itself. This is particularly useful for multi-byte operators and may be more readable than using backslash multiple times. [s] This character set operator matches any one character among those in s. [^s] A negated-set matches any one character not among those in s.. The dot operator matches any one character except newline: [^\n] r* match r 0 or more times. r+ match r 1 or more times. r? match r 0 or 1 time. r{m,n match r between m and n times. r 1 r 2 concatenation. match r 1 followed by r 2 r 1 r 2 alternation. match r 1 or r 2 (r) parentheses specify precedence but do not match anything r 1 /r 2 lookahead. match r 1 when r 2 follows, without consuming r 2 ^r match r only when it occurs at the beginning of a line r$ match r only when it occurs at the end of a line