Module 8 - Lexical Analyzer Generator. 8.1 Need for a Tool. 8.2 Lexical Analyzer Generator Tool

Similar documents
EXPERIMENT NO : M/C Lenovo Think center M700 Ci3,6100,6th Gen. H81, 4GB RAM,500GB HDD

EXPERIMENT NO : M/C Lenovo Think center M700 Ci3,6100,6th Gen. H81, 4GB RAM,500GB HDD

Edited by Himanshu Mittal. Lexical Analysis Phase

Lexical Analysis. Implementing Scanners & LEX: A Lexical Analyzer Tool

Type 3 languages. Regular grammars Finite automata. Regular expressions. Deterministic Nondeterministic. a, a, ε, E 1.E 2, E 1 E 2, E 1*, (E 1 )

Using Lex or Flex. Prof. James L. Frankel Harvard University

Lexical Analysis. Textbook:Modern Compiler Design Chapter 2.1.

An introduction to Flex

Prof. Mohamed Hamada Software Engineering Lab. The University of Aizu Japan

Lexical Analysis. Textbook:Modern Compiler Design Chapter 2.1

Chapter 3 Lexical Analysis

LEX/Flex Scanner Generator

Compiler course. Chapter 3 Lexical Analysis

Lexical and Syntax Analysis

Flex and lexical analysis. October 25, 2016

CSC 467 Lecture 3: Regular Expressions

Flex and lexical analysis

CS143 Handout 04 Summer 2011 June 22, 2011 flex In A Nutshell

PRACTICAL CLASS: Flex & Bison

CSE302: Compiler Design

Lex Spec Example. Int installid() {/* code to put id lexeme into string table*/}

Preparing for the ACW Languages & Compilers

The structure of a compiler

Lexical and Parser Tools

CS4850 SummerII Lex Primer. Usage Paradigm of Lex. Lex is a tool for creating lexical analyzers. Lexical analyzers tokenize input streams.

Introduction to Lex & Yacc. (flex & bison)

PRINCIPLES OF COMPILER DESIGN UNIT II LEXICAL ANALYSIS 2.1 Lexical Analysis - The Role of the Lexical Analyzer

Big Picture: Compilation Process. CSCI: 4500/6500 Programming Languages. Big Picture: Compilation Process. Big Picture: Compilation Process.

Lecture Outline. COMP-421 Compiler Design. What is Lex? Lex Specification. ! Lexical Analyzer Lex. ! Lex Examples. Presented by Dr Ioanna Dionysiou

Figure 2.1: Role of Lexical Analyzer

Compiler Construction

Compiler Construction

Gechstudentszone.wordpress.com

Lexical analysis. Syntactical analysis. Semantical analysis. Intermediate code generation. Optimization. Code generation. Target specific optimization

An Introduction to LEX and YACC. SYSC Programming Languages

UNIT - 7 LEX AND YACC - 1

Marcello Bersani Ed. 22, via Golgi 42, 3 piano 3769

Chapter 3 -- Scanner (Lexical Analyzer)

Big Picture: Compilation Process. CSCI: 4500/6500 Programming Languages. Big Picture: Compilation Process. Big Picture: Compilation Process

COMPILER CONSTRUCTION LAB 2 THE SYMBOL TABLE. Tutorial 2 LABS. PHASES OF A COMPILER Source Program. Lab 2 Symbol table

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

I. OVERVIEW 1 II. INTRODUCTION 3 III. OPERATING PROCEDURE 5 IV. PCLEX 10 V. PCYACC 21. Table of Contents

Lex & Yacc (GNU distribution - flex & bison) Jeonghwan Park

Ulex: A Lexical Analyzer Generator for Unicon

Ray Pereda Unicon Technical Report UTR-02. February 25, Abstract

Programming Assignment II

TDDD55- Compilers and Interpreters Lesson 2

Lexical Analysis - Flex

Lex (Lesk & Schmidt[Lesk75]) was developed in the early to mid- 70 s.

Lex & Yacc. By H. Altay Güvenir. A compiler or an interpreter performs its task in 3 stages:

Handout 7, Lex (5/30/2001)

Lex & Yacc. by H. Altay Güvenir. A compiler or an interpreter performs its task in 3 stages:

flex is not a bad tool to use for doing modest text transformations and for programs that collect statistics on input.

Programming Assignment I Due Thursday, October 9, 2008 at 11:59pm

Parsing and Pattern Recognition

IBM. UNIX System Services Programming Tools. z/os. Version 2 Release 3 SA

DOID: A Lexical Analyzer for Understanding Mid-Level Compilation Processes

A program that performs lexical analysis may be termed a lexer, tokenizer, or scanner, though scanner is also a term for the first stage of a lexer.

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Program Development Tools. Lexical Analyzers. Lexical Analysis Terms. Attributes for Tokens

Programming Project 1: Lexical Analyzer (Scanner)

Compil M1 : Front-End

Programming Assignment I Due Thursday, October 7, 2010 at 11:59pm

Lexical and Syntax Analysis

Compiler Lab. Introduction to tools Lex and Yacc

Scanning. COMP 520: Compiler Design (4 credits) Professor Laurie Hendren.

More Examples. Lex/Flex/JLex

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

Introduction to Yacc. General Description Input file Output files Parsing conflicts Pseudovariables Examples. Principles of Compilers - 16/03/2006

Compiler construction 2002 week 2

Lesson 10. CDT301 Compiler Theory, Spring 2011 Teacher: Linus Källberg

Structure of Programming Languages Lecture 3

Yacc: A Syntactic Analysers Generator

CS321 Languages and Compiler Design I. Winter 2012 Lecture 4

JFlex Regular Expressions

Monday, August 26, 13. Scanners

UNIT -2 LEXICAL ANALYSIS

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

JFlex. Lecture 16 Section 3.5, JFlex Manual. Robb T. Koether. Hampden-Sydney College. Mon, Feb 23, 2015

Wednesday, September 3, 14. Scanners

Applications of Context-Free Grammars (CFG)

Syntax Analysis Part IV

LECTURE 6 Scanning Part 2

CD Assignment I. 1. Explain the various phases of the compiler with a simple example.

Lexical Analyzer Scanner

Languages and Compilers

Introduction to Compiler Design

CS143 Handout 05 Summer 2011 June 22, 2011 Programming Project 1: Lexical Analysis

LECTURE 11. Semantic Analysis and Yacc

Group A Assignment 3(2)

EECS483 D1: Project 1 Overview

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 3

Lexical Analyzer Scanner

Typical tradeoffs in compiler design are: speed of compilation size of the generated code speed of the generated code Speed of Execution Foundations

2 Input and Output The input of your program is any file with text. The output of your program will be a description of the strings that the program r

UNIT II LEXICAL ANALYSIS

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

COMPILER CONSTRUCTION Seminar 02 TDDB44

CSCI312 Principles of Programming Languages!

Projects for Compilers

Transcription:

Module 8 - Lexical Analyzer Generator This module discusses the core issues in designing a lexical analyzer generator from basis or using a tool. The basics of LEX tool are also discussed. 8.1 Need for a Tool The lexical analysis phase of the compiler is machine independent. It comes under the analysis phase. The lexical analysis phase needs to tokenize the input string and hence it is source language dependent. The lexical analyzer needs to define patterns for all programming constructs of the input language. Hence, designing a lexical analyzer from the scratch is difficult. On the other hand, if there is a tool that can handle these variations in source language, then designing the lexical analysis phase would be easier. 8.2 Lexical Analyzer Generator Tool Lexical Analyzer Generator is typically implemented using a tool. There are some standard tools available in the UNIX environment. Some of the standard tools are LEX it helps in writing programs whose flow of control is regulated by the various definitions of regular expressions in the input stream. The wrapper programming language is C. FLEX It is a faster lexical analyzer tool. This is also a C language version of the LEX tool. JLEX This is a Java version of LEX tool. The control flow of a lex program is given in Figure 8.1. Figure 8.1 Control flow of a lex program

The input will be a source file with a.l extension. This will be compiled by a Lex compiler and the output of this compiler will be a source file in C named as lex.yy.c. This can be compiled using a C compiler to get the desired output. 8.3 Components of a LEX program A LEX program typically consists of three parts: An initial declaration section, a middle set of translation rules and the last section that consists of other auxiliary procedures. The acts as a delimiter which separates the declaration section from the translation rules section and the translation rules section from the auxiliary procedures section. A program may miss the declaration section but the delimiter is mandatory. declaration translation rules auxiliary procedures In the declaration section, declarations and initialization of variables take place. In the translation rules section, regular expressions to match tokens along with the necessary actions are defined. The auxiliary procedures section consists of a main function corresponding to a C program and any other functions that are required in the auxiliary procedures section. 8.3.1 Declaration In the declaration section a regular expression can be defined. Following is an example of declaration section. Each statement has two components: a name and a regular expression that is used to denote the name. 1. delim [\t\n] 2. ws {delim}+ 3. letter [A-Za-z] 4. digit [0-9] 5. id {letter}({letter} {digit})* Table 8.1 summarizes the operators and special characters used in the regular expressions which are part of the declaration and translation rules section.

Table 8.1 Meta Characters Meta Character Match. Any character except new line \n newline * zero or more copies of the preceding expression + one or more copies of the preceding expression? zero or one copy of the preceding expression ^ beginning of line $ end of line a b a or b (ab)+ one or more copies of ab (grouping) "a+b" literal "a+b" (C escapes still work) [ ] character class In addition, the declaration section may also contain some local variable declarations and definitions which can be modified in the subsequent sections. 8.3.2 Translation Rules This is the second section of the LEX program after the declarations. The declarations section is separated from the Translation Rules section by means of the delimiter. Here, each statement consists of two components: a pattern and an action. The pattern is matched with the input. If there is a match of pattern, the action listed against the pattern is carried out. Thus the LEX tool can be looked upon as a rule based programming language. The following is an example of patterns p 1, p 2 p n and their corresponding actions 1 to n. p 1 {action 1 } /*p pattern (Regular exp) */ p n {action n } For example, if the keyword IF is to be returned as a token for a match with the input string if then the translation rule is defined as {if} {return(if);} The ; at the end of the (IF) indicates end of the first statement of an action and the entire sequence of actions is available between a pair of parenthesis. If the action has been written in multiple lines then the continuation character needs to be used. Similarly the following is an example for an identifier id, where the usage of id is already stated in the first declaration section.

{id} {yylval=install_id();return(id);} In the above statement, when encountering an identifier, two actions need to be taken. The first one is call install_id() function and assign it to yylval and the second one is a return statement. 8.3.3 Auxiliary procedures This section is separated from the translation rules section using the delimiter. In this section, the C program s main function is declared and the other necessary functions also defined. In the example defined in translation rules section, the function install_id() is a procedure used to install the lexeme, whose first character is pointed by yytext and length is provided by yyleng which are inserted into the symbol table and return a pointer pointing to the beginning of the lexeme. install_id() { } The functionality of install_id can be written separately or combined with the main function. The functions yytext ( ) and yyleng( ) are lex commands to indicate the text of the input and the length of the string. 8.4 Example LEX program The following LEX program is used to count the number of lines in the input data stream. 1. int num_lines = 0; 2. 3. \n ++num_lines; 4.. ; 5. 6. main() 7. { yylex(); 8. printf( "# of lines = %d\n", num_lines); } Example 8.1 LEX program to count the number of lines Line 1 of this program belongs to the declaration section and declares an integer variable num_lines and initializes it to 0 and this concludes the first section. In the translation rules section the pattern in \n which is defined in line 3 needs to be matched with the action, incrementing the variable num_lines. This statement indicates whenever a new line is encountered which is defined by \n, the line number is incremented. The third section is from line numbers 6 to 8. yylex( ) points to the text defined in the input stream and as long as data exists in the input, the line numbers are counted and printed in Line 8.

Example 8.2 is an extension of example 8.1 where number of characters, words and lines are counted. 1. %{ int nchar, nword, nline; %} 2. 3. \n { nline++; nchar++; } 4. [^ \t\n]+ { nword++, nchar += yyleng; }. { nchar++; } 5. 6. int main(void) 7. { yylex(); 8. printf("%d\t%d\t%d\n", nchar, nword, nline); 9. return 0; } Example 8.2 LEX program to count number of lines, characters and words In this example \n is mapped to increment the number of characters and number of lines, while \t, a tab space is used to increment the number of words and number of characters which is defined in line numbers 3 and 4. Lines 6 to 9 represent the main function, which uses yylex to point to the text and processes it to count the number of lines, characters and words. Table 8.2 is a summary of some of the yy() commands that can be used in LEX program. Table 8.2 Name int yylex(void) char *yytext yyleng yylval int yywrap(void) FILE *yyout FILE *yyin INITIAL BEGIN condition ECHO Function call to invoke lexer, returns token pointer to matched string length of matched string value associated with token wrapup, return 1 if done, 0 if not done output file input file initial start condition switch start condition write matched string The syntax and compilation procedure of FLEX program is the same as that of the LEX program.

8.4 JLEX program A typical JLEX program also consists of three parts. In the case of JLEX program, the organization is slightly different. The first section represents the user code which is later copied directly as a Java file. The second section consists of JLEX directives, where macros and other state names are typically defined. Here again, is used as a delimiter to separate one section from the other. The third section consists of the translation rules that define regular expressions along with the necessary actions. The user code is copied to a java file and a JAVA compiler is used instead of a C compiler of the LEX program to compile and execute the JLEX program. The following is a simple layout of a JLEX file. User code JLex directives Lexical analysis rules Example 8.3 is a JLEX program to count the number of lines in an input document. 1. import java_cup.runtime.*; 2. 3. %cup 4. %{ 5. private int linecounter = 0; 6. %} 7. %eofval{ 8. System.out.println("line number=" + linecounter); 9. return new Symbol(sym.EOF); 10. %eofval} 11. NEWLINE=\n 12. 13. {NEWLINE} { 14. linecounter++; 15. } 16. [^{NEWLINE}] { } Example8.3 Sample JLEX program In example 8.3, line number 1 indicates the user code which gets copied to the java file. %cup is used to activate the Java CUP compatibility that helps conform to the java_cup.runtime and

initiates the scanner. The Java CUP compatibility is turned off by default and hence every JLEX program needs to get activate as the first step. %eofval is used to indicate end of value of the input. The variable line number is incremented for every NEWLINE encountered in line number 14. NEWLINE is a name given to the \n character which indicates the encounter of a new line. This is again a rule based programming language and hence does not obey the flow of a structured programming language. Summary This module discussed the use of tools LEX and JLEX for tokenizing the input. Designing and implementing the lexical phase of the compiler is difficult and hence tools can be used to do the job of tokenizing using a rule based programming language. The tools use regular expressions to define the pattern and a corresponding action. When the regular expression matches a pattern, the defined action is used to perform a task.