CS143 Handout 04 Summer 2011 June 22, 2011 flex In A Nutshell

Similar documents
Lexical and Parser Tools

Using Lex or Flex. Prof. James L. Frankel Harvard University

Flex and lexical analysis. October 25, 2016

Module 8 - Lexical Analyzer Generator. 8.1 Need for a Tool. 8.2 Lexical Analyzer Generator Tool

An Introduction to LEX and YACC. SYSC Programming Languages

Handout 7, Lex (5/30/2001)

EXPERIMENT NO : M/C Lenovo Think center M700 Ci3,6100,6th Gen. H81, 4GB RAM,500GB HDD

Flex and lexical analysis

CS143 Handout 12 Summer 2011 July 1 st, 2011 Introduction to bison

Chapter 3 -- Scanner (Lexical Analyzer)

EXPERIMENT NO : M/C Lenovo Think center M700 Ci3,6100,6th Gen. H81, 4GB RAM,500GB HDD

An introduction to Flex

Edited by Himanshu Mittal. Lexical Analysis Phase

Big Picture: Compilation Process. CSCI: 4500/6500 Programming Languages. Big Picture: Compilation Process. Big Picture: Compilation Process

CS143 Handout 05 Summer 2011 June 22, 2011 Programming Project 1: Lexical Analysis

Ulex: A Lexical Analyzer Generator for Unicon

CS4850 SummerII Lex Primer. Usage Paradigm of Lex. Lex is a tool for creating lexical analyzers. Lexical analyzers tokenize input streams.

The structure of a compiler

CSC 467 Lecture 3: Regular Expressions

Ray Pereda Unicon Technical Report UTR-02. February 25, Abstract

Lecture Outline. COMP-421 Compiler Design. What is Lex? Lex Specification. ! Lexical Analyzer Lex. ! Lex Examples. Presented by Dr Ioanna Dionysiou

Figure 2.1: Role of Lexical Analyzer

Big Picture: Compilation Process. CSCI: 4500/6500 Programming Languages. Big Picture: Compilation Process. Big Picture: Compilation Process.

Lexical and Syntax Analysis

TDDD55- Compilers and Interpreters Lesson 2

LEX/Flex Scanner Generator

Marcello Bersani Ed. 22, via Golgi 42, 3 piano 3769

Scanning. COMP 520: Compiler Design (4 credits) Alexander Krolik MWF 13:30-14:30, MD 279

Lexical Analysis. Implementing Scanners & LEX: A Lexical Analyzer Tool

Programming Assignment II

Automatic Scanning and Parsing using LEX and YACC

COMPILER CONSTRUCTION LAB 2 THE SYMBOL TABLE. Tutorial 2 LABS. PHASES OF A COMPILER Source Program. Lab 2 Symbol table

Part 2 Lexical analysis. Lexical analysis 44

Introduction to Lex & Yacc. (flex & bison)

Yacc: A Syntactic Analysers Generator

PRACTICAL CLASS: Flex & Bison

Type 3 languages. Regular grammars Finite automata. Regular expressions. Deterministic Nondeterministic. a, a, ε, E 1.E 2, E 1 E 2, E 1*, (E 1 )

Languages and Compilers

Preparing for the ACW Languages & Compilers

Parsing and Pattern Recognition

Lex & Yacc. By H. Altay Güvenir. A compiler or an interpreter performs its task in 3 stages:

Lex & Yacc. by H. Altay Güvenir. A compiler or an interpreter performs its task in 3 stages:

flex is not a bad tool to use for doing modest text transformations and for programs that collect statistics on input.

Chapter 3 Lexical Analysis

Lex Spec Example. Int installid() {/* code to put id lexeme into string table*/}

Compiler course. Chapter 3 Lexical Analysis

Prof. Mohamed Hamada Software Engineering Lab. The University of Aizu Japan

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

Scanning. COMP 520: Compiler Design (4 credits) Professor Laurie Hendren.

Programming Assignment I Due Thursday, October 7, 2010 at 11:59pm

A program that performs lexical analysis may be termed a lexer, tokenizer, or scanner, though scanner is also a term for the first stage of a lexer.

Compil M1 : Front-End

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

CSE302: Compiler Design

Lex (Lesk & Schmidt[Lesk75]) was developed in the early to mid- 70 s.

Introduction to Yacc. General Description Input file Output files Parsing conflicts Pseudovariables Examples. Principles of Compilers - 16/03/2006

TDDD55 - Compilers and Interpreters Lesson 3

Monday, August 26, 13. Scanners

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

Wednesday, September 3, 14. Scanners

Lex & Yacc (GNU distribution - flex & bison) Jeonghwan Park

Etienne Bernard eb/textes/minimanlexyacc-english.html

Syntax Analysis Part IV

DOID: A Lexical Analyzer for Understanding Mid-Level Compilation Processes

PRINCIPLES OF COMPILER DESIGN UNIT II LEXICAL ANALYSIS 2.1 Lexical Analysis - The Role of the Lexical Analyzer

Gechstudentszone.wordpress.com

COMPILERS AND INTERPRETERS Lesson 4 TDDD16

UNIT -2 LEXICAL ANALYSIS

EECS483 D1: Project 1 Overview

Using an LALR(1) Parser Generator

Principles of Compiler Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 3

TDDD55- Compilers and Interpreters Lesson 3

LECTURE 6 Scanning Part 2

Compiler Lab. Introduction to tools Lex and Yacc

Compiler Construction

COMPILER CONSTRUCTION Seminar 02 TDDB44

Compiler Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

CS 426 Fall Machine Problem 1. Machine Problem 1. CS 426 Compiler Construction Fall Semester 2017

A lexical analyzer generator for Standard ML. Version 1.6.0, October 1994

Lexical Analyzer Scanner

Compiler Construction

Lexical and Syntax Analysis

Lexical Analyzer Scanner

THE COMPILATION PROCESS EXAMPLE OF TOKENS AND ATTRIBUTES

CS415 Compilers. Lexical Analysis

Programming Assignment I Due Thursday, October 9, 2008 at 11:59pm

As we have seen, token attribute values are supplied via yylval, as in. More on Yacc s value stack

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

More Examples. Lex/Flex/JLex

Alternation. Kleene Closure. Definition of Regular Expressions

CS164: Programming Assignment 2 Dlex Lexer Generator and Decaf Lexer

CSCI Compiler Design

I. OVERVIEW 1 II. INTRODUCTION 3 III. OPERATING PROCEDURE 5 IV. PCLEX 10 V. PCYACC 21. Table of Contents

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

LECTURE 11. Semantic Analysis and Yacc

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

CS 403: Scanning and Parsing

Transcription:

CS143 Handout 04 Summer 2011 June 22, 2011 flex In A Nutshell Handout written by Julie Zelenski with minor edits by Keith. flex is a fast lexical analyzer generator. You specify the scanner you want in the form of patterns to match and actions to apply for each token. flex takes your specification and generates a combined NFA to recognize all your patterns, converts it to an equivalent DFA, minimizes the automaton as much as possible, and generates C code that will implement it. flex is similar to another tool, lex, designed by Lesk and Schmidt, and bears many similarities to it. While we will use flex in the course, almost all features we use are also present in the original lex. This handout is designed to give you a quick introduction to the flex tool. This should serve as a useful reference for the first programming assignment. However, you should be aware that in order to complete the first assignment, you may need to use some more advanced features of flex that aren't covered here. To learn more about flex, run info flex, or read the documentation at http://flex.sourceforge.net/manual/. How It Works flex is designed for use with C code and generates a scanner written in C. The scanner is specified using regular expressions for patterns and C code for the actions. The specification files are traditionally identified by their.l extension. You invoke flex on a.l file and it creates lex.yy.c, a source file containing a wad of unrecognizable C code that implements a FA encoding all your rules and including the code for the actions you specified. The file provides an extern function yylex() that will scan one token. You compile that C file normally, link with the lex library, and you have built a scanner! The scanner reads from stdin and writes to stdout by default. % lex myfile.l creates lex.yy.c containing C code for scanner % gcc -o scan lex.yy.c -ll compiles scanner, links with lex lib %./scan executes scanner, will read from stdin Linking with the lex library provides a simple main that will repeatedly calls yylex until it reaches EOF. You can also compile and link the scanner into your project and use your own main to control when tokens are scanned. The Makefiles we provide for the projects will execute the compilation steps for you, but it is worthwhile to understand the steps required.

2 A flex Input File flex input files are structured as follows: %{ Declarations %} Definitions Rules User subroutines The optional Declarations and User subroutines sections are used for ordinary C code that you want copied verbatim to the generated C file. Declarations are copied to the top of the file, user subroutines to the bottom. The optional Definitions section is where you specify options for the scanner and can set up definitions to give names to regular expressions as a simple substitution mechanism that allows for more readable entries in the Rules section that follows. The required Rules section is where you specified the patterns that identify your tokens and the action to perform upon recognizing each token. flex Rules A rule has a regular expression (called the pattern) and an associated set of C statements (called the action). The idea is that whenever the scanner reads an input sequence that matches a pattern, it executes the action to process it. This is a substantial generalization of regular expressions that allows the tool to be used in many different contexts. In specifying patterns, flex supports a fairly rich set of conveniences (character classes, specific repetition, etc.) beyond our formal language definition of a regular expression. These features don't add expressive power, but simply allow you to construct complicated patterns more succinctly. The table below shows some operators to give you an idea of what is available. For more details, see the web or man pages. Character classes [0-9] This means alternation of the characters in the range listed (in this case: 0 1 2 3 4 5 6 7 8 9). More than one range may be specified, e.g. [0-9A-Za-z] as well as specifying individual characters, as with [aeiou0-9]. Character exclusion ^ The first character in a character class may be ^ to indicate the complement of the set of characters specified. For example, [^0-9] matches any non digit character.

Arbitrary character. The period matches any single character except newline. Single repetition x? 0 or 1 occurrence of x. Non zero repetition x+ x repeated one or more times; equivalent to xx*. Specified repetition x{n,m} x repeated between n and m times. Beginning of line ^x Match x at beginning of line only. End of line x$ Match x at end of line only. Context sensitivity ab/cd Match ab but only when followed by cd. The lookahead characters are left in the input stream to be read for the next token. Literal strings "x" This means x even if x would normally have special meaning. Thus, "x*" may be used to match x followed by an asterisk. You can turn off the special meaning of just one character by preceding it with a backslash,.e.g. \. matches exactly the period character and nothing more. Definitions {name} Replace with the earlier defined pattern called name. This kind of substitution allows you to reuse pattern pieces and define more readable patterns. 3 As the scanner reads characters from the file, it will gather them until it forms the longest possible match for any of the available patterns. If two or more patterns match an equally long sequence, the pattern listed first in the file is used. The code that you include in the actions depends on what processing you are trying to do with each token. Perhaps the only action necessary is to print the matching token, add it to a table, or perhaps ignore it in the case of white space or comments. For a scanner designed to be used by a compiler, the action will usually record the token attributes and return a code that identifies the token type. flex Global Variables The token grabbing function yylex takes no arguments and returns an integer. Often more information is needed about the token just read than that one integer code. The usual way information about the token is communicated back to the caller is by having the scanner set the contents of a global variable which can be read by the caller. After counseling you for years that globals are absolute evil, we reluctantly sanction their

limited use here, because our tools require we use them. Here are the specific global variables used: yytext is a null terminated string containing the text of the lexeme just recognized as a token. This global variable is declared and managed in the lex.yy.c file. Do not modify its contents. The buffer is overwritten with each subsequent token, so you must make your own copy of a lexeme you need to store more permanently. yyleng is an integer holding the length of the lexeme stored in yytext. This global variable is declared and managed in the lex.yy.c file. Do not modify its contents. yylval is the global variable used to store attributes about the token, e.g. for an integer lexeme it might store the value, for a string literal, the pointer to its characters and so on. This variable is declared to be of type YYSTYPE, and is usually a union of all the various fields needed for different token types. If you are using a parser generator (such as yacc or bison), it will define this type for you, otherwise, you must provide the definition yourself. Your scanner actions should appropriately set the contents of the variable for each token. yylloc is the global variable that is used to store the location (line and column) of the token. This variable is declared to be of type YYLTYPE. Again, the parser generator can provide this or it may be your responsibility. Your scanner actions should appropriately set the contents of the variable for each token. Example 1 Here is a simple and complete specification for a scanner that replaces all numbers in a stream of text with a question mark. It might be useful, for example, if you were a particularly unscrupulous accountant: [0-9]+ printf("?");. ECHO; The first marks the beginning of the rules section, the only section required in the input file. The pattern for the first rule matches any sequence of digits and the associated action prints a question mark instead of the number itself. The second rule matches any remaining character and uses the standard action ECHO (which just prints the character unchanged). To build and run this program, you could use the following % flex hide-digits.l % gcc -o hide-digits lex.yy.c ll %./hide-digits... at this point anything you type, the scanner echoes after...... replacing numbers with question-marks... 4

Example 2 The following flex input file has all three sections: a definitions section (when you can define substitutions, set up global variables, etc.), the rules section, and the user subroutines section (where you can define helper functions). This function includes its own main rather than using the one supplied by the flex library. What does this program do? 5 %{ %} int numchars = 0, numwords = 0, numlines = 0; \n {numlines++; numchars++;} [^ \t\n]+ {numwords++; numchars += yyleng;}. {numchars++;} int main() { yylex(); printf("%d\t%d\t%d\n", numchars, numwords, numlines); } You can build and execute this scanner from the command line as follows: % flex count.l % gcc -o count lex.yy.c ll %./count < count.l % 243 34 17 Example 3 The following shows an excerpt of a scanner configured for use in a compiler. When the scanner finds a token, it stores information about that token in the global variable yylval, then returns predefined a token code to inform the compiler of the token type just scanned. There's obviously a lot more that needs to be here, but that's what you get to do for your first programming project! [+>;] { return yytext[0]; /* use ASCII code for single-char token */} "for" { return T_For; } [0-9]+ { yylval.integerconstant = atoi(yytext); return T_IntConstant; } [a-z]+ { yylval.identifier = strdup(yytext); return T_Identifier; }

Bibliography The Flex Project. Lexical Analysis with Flex. Accessed Online, 19 Jun 2011. URL: http://flex.sourceforge.net/manual/ T. Mason, D. Brown, lex & yacc. Sebastopol, CA: O'Reilly & Associates, 1990. 6