Handout 7, Lex (5/30/2001)

Similar documents
CS143 Handout 04 Summer 2011 June 22, 2011 flex In A Nutshell

EXPERIMENT NO : M/C Lenovo Think center M700 Ci3,6100,6th Gen. H81, 4GB RAM,500GB HDD

EXPERIMENT NO : M/C Lenovo Think center M700 Ci3,6100,6th Gen. H81, 4GB RAM,500GB HDD

Lecture Outline. COMP-421 Compiler Design. What is Lex? Lex Specification. ! Lexical Analyzer Lex. ! Lex Examples. Presented by Dr Ioanna Dionysiou

CSC 467 Lecture 3: Regular Expressions

The structure of a compiler

Using Lex or Flex. Prof. James L. Frankel Harvard University

Marcello Bersani Ed. 22, via Golgi 42, 3 piano 3769

An Introduction to LEX and YACC. SYSC Programming Languages

Chapter 3 -- Scanner (Lexical Analyzer)

Ulex: A Lexical Analyzer Generator for Unicon

An introduction to Flex

Big Picture: Compilation Process. CSCI: 4500/6500 Programming Languages. Big Picture: Compilation Process. Big Picture: Compilation Process

Module 8 - Lexical Analyzer Generator. 8.1 Need for a Tool. 8.2 Lexical Analyzer Generator Tool

More Examples. Lex/Flex/JLex

C How to Program, 6/e by Pearson Education, Inc. All Rights Reserved.

Edited by Himanshu Mittal. Lexical Analysis Phase

CS143 Handout 05 Summer 2011 June 22, 2011 Programming Project 1: Lexical Analysis

CS4850 SummerII Lex Primer. Usage Paradigm of Lex. Lex is a tool for creating lexical analyzers. Lexical analyzers tokenize input streams.

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 3

Big Picture: Compilation Process. CSCI: 4500/6500 Programming Languages. Big Picture: Compilation Process. Big Picture: Compilation Process.

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

Parsing and Pattern Recognition

LECTURE 6 Scanning Part 2

Lexical and Syntax Analysis

Alternation. Kleene Closure. Definition of Regular Expressions

Lex & Yacc. By H. Altay Güvenir. A compiler or an interpreter performs its task in 3 stages:

Ray Pereda Unicon Technical Report UTR-02. February 25, Abstract

LECTURE 11. Semantic Analysis and Yacc

TDDD55- Compilers and Interpreters Lesson 2

flex is not a bad tool to use for doing modest text transformations and for programs that collect statistics on input.

ITC213: STRUCTURED PROGRAMMING. Bhaskar Shrestha National College of Computer Studies Tribhuvan University

Generating a Lexical Analyzer Program

LECTURE 7. Lex and Intro to Parsing

Lex & Yacc. by H. Altay Güvenir. A compiler or an interpreter performs its task in 3 stages:

Principles of Compiler Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore

LEX/Flex Scanner Generator

Compil M1 : Front-End

Figure 2.1: Role of Lexical Analyzer

Have examined process Creating program Have developed program Written in C Source code

Decaf Language Reference

COMPILER CONSTRUCTION LAB 2 THE SYMBOL TABLE. Tutorial 2 LABS. PHASES OF A COMPILER Source Program. Lab 2 Symbol table

POLITECNICO DI TORINO. Formal Languages and Compilers. Laboratory N 1. Laboratory N 1. Languages?

Introduction to Yacc. General Description Input file Output files Parsing conflicts Pseudovariables Examples. Principles of Compilers - 16/03/2006

Formal Languages and Compilers

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Monday, August 26, 13. Scanners

Lexical Analysis. Implementing Scanners & LEX: A Lexical Analyzer Tool

Compiler Lab. Introduction to tools Lex and Yacc

Wednesday, September 3, 14. Scanners

Preparing for the ACW Languages & Compilers

Full file at

IBM. UNIX System Services Programming Tools. z/os. Version 2 Release 3 SA

CS 541 Spring Programming Assignment 2 CSX Scanner

Flex and lexical analysis

CS164: Programming Assignment 2 Dlex Lexer Generator and Decaf Lexer

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

Gechstudentszone.wordpress.com

Yacc: A Syntactic Analysers Generator

Data Types and Variables in C language

Department : Computer Science & Engineering

A lexical analyzer generator for Standard ML. Version 1.6.0, October 1994

1. Lexical Analysis Phase

CS 6353 Compiler Construction Project Assignments

I. OVERVIEW 1 II. INTRODUCTION 3 III. OPERATING PROCEDURE 5 IV. PCLEX 10 V. PCYACC 21. Table of Contents

Introduction to Lex & Yacc. (flex & bison)

PRINCIPLES OF COMPILER DESIGN UNIT II LEXICAL ANALYSIS 2.1 Lexical Analysis - The Role of the Lexical Analyzer

Compiler Construction

Lexical Analyzer Scanner

Prof. Mohamed Hamada Software Engineering Lab. The University of Aizu Japan

PRACTICAL CLASS: Flex & Bison

1.1 Introduction to C Language. Department of CSE

LESSON 4. The DATA TYPE char

Program Development Tools. Lexical Analyzers. Lexical Analysis Terms. Attributes for Tokens

Syntax Analysis Part IV

Languages and Compilers

Lex (Lesk & Schmidt[Lesk75]) was developed in the early to mid- 70 s.

Flex and lexical analysis. October 25, 2016

CS 6353 Compiler Construction Project Assignments

Lexical Analyzer Scanner

Programming Assignment I Due Thursday, October 9, 2008 at 11:59pm

Left to right design 1

DECLARATIONS. Character Set, Keywords, Identifiers, Constants, Variables. Designed by Parul Khurana, LIECA.

2 Input and Output The input of your program is any file with text. The output of your program will be a description of the strings that the program r

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

10/4/18. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntactic Analysis

CPSC 434 Lecture 3, Page 1

10/5/17. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntax Analysis

Compiler Construction

Programming Languages & Compilers. Programming Languages and Compilers (CS 421) Programming Languages & Compilers. Major Phases of a Compiler

As we have seen, token attribute values are supplied via yylval, as in. More on Yacc s value stack

CS 426 Fall Machine Problem 1. Machine Problem 1. CS 426 Compiler Construction Fall Semester 2017

COMPILERS AND INTERPRETERS Lesson 4 TDDD16

Programming Languages and Compilers (CS 421)

Programming Languages & Compilers. Programming Languages and Compilers (CS 421) I. Major Phases of a Compiler. Programming Languages & Compilers

CMSC445 Compiler design Blaheta. Project 2: Lexer. Due: 15 February 2012

CS143 Handout 12 Summer 2011 July 1 st, 2011 Introduction to bison

Typescript on LLVM Language Reference Manual

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

Transcription:

Handout 7, Lex (5/30/2001) Lex is a venerable Unix tool that generates scanners. Input to lex is a text file that specifies the scanner; more precisely: specifying tokens, a yet to be made scanner must recognize, and what to do with each recognized pattern. The specification is partitioned into three sections: a definition section, rules section, and the function section. The command to run lex is: lex switches file_names. Switches are optional. When more than one file name is specified, they are all concatenated and used as one single input to lex. When no input file is specified, then input comes from stdin. The default action of the generated scanner is to read input from stdin, and produce output onto stdout. The shortest possible lex program will do just that. Table 0 defines lex switches. The lex commands is: lex [-cntv] [-e -w] [-V -Q [y n]] [ file...] Example: lex in.l and then: cc lex.yy.c -ll Table 0: Lex Switches Switch Meaning -c C-language action; is default -e Scanner can handle European language EUC characters, exclusive with -w -f No packing -n Suppress table size statistics after scanner generation -t Instead of generating C source program in file lex.yy.c, emit C source onto stdout. From there it may be redirected, or piped into more for viewing. -v Write summary of lex table sizes to stderr; -w Scanner cannot handle EUC characters; exclusive with -e -V Print version information onto stderr -Q[n y] Print version information onto lex.yy.c (or stdout); -Qn is the default The definition section holds C language definitions, for example external objects, or objects unique to functions that constitute the scanner. The rules section is composed of patterns and actions; we'll explain patterns below. When the pattern defined is matched with the input, the associated action is executed. The function section holds any C text you wish, including executable statements in C function bodies. Sections are separated from one another by the meta operators. When only a single is given, this means, the function section is empty. When no is given, the specification is erroneous. When 2 or more patterns start matching the same source input, the longest one is accepted. If two or more matching ones are of the same length, the pattern defined first is taken. Definition Section: Any line in the definition section (actually anywhere in a lex program) starting with a blank ' ' is considered a snippet of C to be copied into the external definition area of the resulting lex.yy.c source program. Also, text enclosed in the %{ and %} meta brackets is considered C source text and is copied into lex.yy.c. Any line in the definition section not starting with a blank (and not enclosed between the %{ and %} meta operators) is a definition. The format of definitions is: name substitute. Name must Spring 2001 1 HO 7

match the C lexical rules for identifier. Such a name may be referenced in the rules section, enclosed in { and } braces, for example: {name1}. In that case, the matching substitute replaces the complete string {name1} in a rule. The { and } braces are syntactic means to tell lex: Here is a name, defined in the definition section! States: Any line in the definition section starting with % and followed by an identifier starting with s or S defines a state (aka start condition). For example, %state1 in the definition section defines a lex state, named state1. States can be used in the rules of the rules section to make actions conditional. Any line in the definition section starting with % and followed by an identifier starting with x or X defines an exclusive state (aka exclusive start condition). For example, %x ex1 defines a lex exclusive state, named ex1. When the generated scanner is in a %s state, patterns with no state will also be considered. When the scanner is in an exclsuive %x state, then only rules associated with an exclusive state will be considered. Mostly you'll use the %s form. Pointer or Array: The predefined words %array and %pointer specify, how yytext is defined in C. The default, matching %array, is extern char yytext[];. This can be overridden with %pointer to mean: extern char * yytext;. Rules Section: The 0 or more rules in the rules section each consist of a pattern (an Extended Regular Expression, aka ERE), followed by an associated action, expressed as a C program snippet. The two are separated from one another by 1 or more blanks. For example, the rule below is meant to convert uppercase letters to lowercase: [A-Z] putchar( yytext[ 0 ] + 'a' - 'A' ); When an action is not specified, lex does not have to detect this and the result is undefined. Note that the ; in place of the C program snippet is a complete action: it specifies the empty statement, doing nothing. This will effectively skip the matched input at compilation time (when the scanner reads source). Also note that when an input character (or string) is not matched, it will be output by the sacnner. Another example: [abcx-z] { return SPECIAL_L; } When the input is any of the characters a, b, c, x, y, or z, then the value SPECIAL_L is returned by the scanner specified above. Funtion Section: This section includes 0 or more C function definitions with complete bodies, that can be used in the scanner. For example, these can be functions that are callable from the C snippets specified as actions when patterns are matched. Table 1 defines the lex meta operators used to build patterns of character strings: Table 1: Lex Meta Operators Operator Meaning Separate one section from another, at most 3 sections, at most two Spring 2001 2 HO 7

%{ Begin C source text in definition section %} End C source text in definition section %array extern char yytext[]; is the default %pointer extern char * yytext; * Repetition of 0 or more times of pattern to the left + Repetition of 1 or more times of pattern to the left? Repetition of 0 or once of the pattern to the left Select pattern to the left or the one to the right of ^ Match at start of source line in column 1 $ Match at end of source line. Match any character that is not the newline, not the \n ( ) Group pattern expressions in rules section [ ] Select exactly 1 charcater of the patterns inscribed ^ inside [ ] Negate the pattern following, but inside [ ] - inside [ ] Define a range of characters, from the one on the left of - to the one on the right of -, but inside the [ ] pair \ Escape: take the next character literally ".." Escape: take the character string literally { } Use a symbolic name in a pattern of the rules section, to be substituted. Was defined in definition section. E.g. {in_comment} < > Use < and > to enclose a state, defined in the definition section. Acts as an additional condition for a pattern to match, i.e. if state is on. / Specify trailing context for pattern to the left of /; see explanation below BEGIN state Turn the named state on; state must have been defined via %s or %x BEGIN INITIAL Turn all states off; used in rules section, part of actions Table 2 shows a few sample patterns and their meaning: Table 2: Lex Pattern Examples Pattern Matching strings [a-z] Al l lower case letters, exactly one of them [A-Z]+ One or more upper case letters [0-9]* Zero or more digits -?[0-9]+ Decimal number with 1 or more digits, optionally signed with - [a-za-z][a-za-z0-9]* A letter, followed by 0 or more letters or digits "Hello" The string "Hello" ^mom The string or substring "mom" at the beginning of a line -?(([0-9]+) [0-9]*\.[0-9]+) Optionally signed integer or float literal. Float need not have a leading digit, for example -.99. But -0.99 also o.k. Table 3 lists selected external lex C objects, used by programs linked with lex, e.g. yacc: lex C object yylex() yytext lex.yy.c ECHO y.tab.h yylval Table 3: External Lex Objects Meaning The scanner function, callable by yacc Char array holding scanned input, ended by null; length can be set C program that is the scanner, generated by lex Macro that copies source matched to standard out Include file used by lex, generated by yacc Integer object of yacc, frequently assigned by lex Spring 2001 3 HO 7

lex Sample 1: The lex program below scans integer literals, returns a NUMBER token for each, and filters out (skips) the rest of the source coming from stdin: Lex Sample 1 %{ /* * The definition section: * * PSU CS 302 Spring 2001 * Herb Mayer 4/26/2001 */ #include "y.tab.h" extern int yylval; %} [0-9]+ { yylval = atoi( yytext ); return NUMBER; }. \n { return yytext[ 0 ]; } /* * The Function Section: empty */ The simple lex source shows the 3 sections separated by from one another. The definition section includes a C style comment, #includes a file with #defines generated by yacc, and specifies a C integer object named yylval defined elsewhere; it is external to lex. It filters out decimal numbers, one digit or more constituting a token, and returns the converted integer value to its caller, presumably a parser generated by yacc. All other input, expressed by the. meta character, and the newline are silently swallowed (not copied to stdout), and returned to the caller of yylex(). Note that the meta character. does not include (not match on) the newline. lex Sample 2: The lex program below converts upper- to lowercase characters: Lex Sample 2 [A-Z] putchar( yytext[ 0 ] + 'a' - 'A' ); A scanner generated with Sample 2 reads input, copies all input to stdout, except that uppercase letters are all converted to lowercase letters. The rest of the output looks just like the input. Note that the second meta operator is superfluous. lex Sample 3: The lex program below converts to lowercase and reduces white space: Lex Sample 3 [A-Z] putchar( yytext[ 0 ] + 'a' - 'A' ); [ ]+$ ; [ ]+ putchar( ' ' ); Spring 2001 4 HO 7

A scanner generated with Sample 3 reads input, copies it to stdout, except that uppercase letters are all converted to lowercase. Also, blanks at the end of a text line are discarded, and multiple blanks inside other text are all replaced by a single blank each. Lex macros for I/O: There are a few macros defined in lex that can be used in the action portion of a rule, listed in Table 4: input() output( char ) unput( char ) ECHO Table 4 lists lex-specific I/O intrinsics Table 4: Lex I/O instrinsics C macro returning the next character from stdin, if any C macro, printing char to stdout C macro, pushing char back into stdin Copy current value in yytext[] to stdout We'll see an example of the use of input() and unput( char ) further on. lex Sample 4: The lex program below filters out C style /*.. */ comments. Lex Sample 4 "/*" { again: // loops if EOF before */ while( input()!= '*' ); // last input() was '*', what is next? '/' perhaps? switch( input() ) { case '/': break; // comment was skipped, /* */ case '*': unput( '*' ); // stuff back, could be /** then /**/ default: goto again; // yeah, we know, gotos are ugly } //end switch } // end single C action, matching pattern /* A scanner generated with Sample 4 reads any text input, copies it back onto stdout, except that all text between /* and */, including the /* and */ delimiters have magically disappeared. However, the program loops, if a comment is incomplete such that the end of the file is found after a /* comment started, but the ending */ is not provided in the source. Trailing context / lex meta operator: The / trailing context operator: An ERE of the form pattern1/pattern2 causes pattern1 in the rules section to match only, if it is followed by a source string that matches pattern2. Only the source strings matching pattern1 are actually consumed; pattern2 remains untouched. Thus, pattern2 or a portion thereof may be read in a subsequent lex activity. Confusion can arise, and lex is not defined in some of these cases, if the end of pattern1 matches the start of pattern2. For example: given an ERE pattern a*b/cc and input aaabcc, yytext will hold aaab. However, given the pattern x*/xy and input xxxy, the token xxx and not xx is returned Spring 2001 5 HO 7

by some lexes, since xxx matches the longest of x*. Similarly: for the pattern with trailing context ab*/bc and input abbbc the length of the recognized match is undefined. It is either abb or abbb. But for pattern ab/bc and input abbc the matched string is clearly defined: ab. Sample of Lex State: lex Sample 5: The same below shows a convenient use of states, arbitrarily using exclusive states. The %s state would have worked the same: Lex Sample 5 %x CMT "/*" { printf( "/* " ); BEGIN CMT; } <CMT>. putchar( yytext[ 0 ] ); <CMT>\n putchar( yytext[ 0 ] ); <CMT>"*/" { printf( "*/ " ); BEGIN INITIAL; }. ; \n ; When the scanner generated by Sample 5 reads C source input, it generates output solely consisting of the old C-style comments, plus a blank character after each complete /* and */ comment. Note the way the state CMT is initialized to on, and turned off. The former is accomplished by BEGIN CMT, the latter by BEGIN INITIAL. The INITIAL is a predefined command resetting all switches. lex Sample 6: The lex program below is intended to filter out C style /* and */ comments. But it will fail. We leave it as an exercise to the reader, why this lex pattern will not filter just all /* and */ comments as desired "/*"(. \n)*"*/" ; Lex Sample 6 Note that the pattern (. \n) will also match, at least some times, the '*' and '/' characters. Which match case is excluded? lex Sample 7: The next small lex program filters out all ANSI C-style comments, and copies all other characters of an ASCII text file onto standard output: "//".*$ ; Lex Sample 7 Spring 2001 6 HO 7

The lex source in Example 7 scans for strings that start with //, exactly the characters for the beginning of a new-style C comment. From the first pair of // on any text line until the end of that line, all text is skipped. All other characters of the input file are copied back to standard output. Caution: There exist numerous inconsistencies in the different implementations of lex on different systems, aside from character set differences. Mayor causes for differences include: Lex response on illegal input, e.g. for a missing action in a rule Trailing context match, if pattern1's end matches pattern2's start Type of yytext Lex command line switches Spring 2001 7 HO 7