Lexical Analysis. Implementing Scanners & LEX: A Lexical Analyzer Tool

Similar documents
PRACTICAL CLASS: Flex & Bison

Prof. Mohamed Hamada Software Engineering Lab. The University of Aizu Japan

Chapter 3 Lexical Analysis

Module 8 - Lexical Analyzer Generator. 8.1 Need for a Tool. 8.2 Lexical Analyzer Generator Tool

Compiler course. Chapter 3 Lexical Analysis

CSE302: Compiler Design

EXPERIMENT NO : M/C Lenovo Think center M700 Ci3,6100,6th Gen. H81, 4GB RAM,500GB HDD

EXPERIMENT NO : M/C Lenovo Think center M700 Ci3,6100,6th Gen. H81, 4GB RAM,500GB HDD

Lexical Analysis. Introduction

Edited by Himanshu Mittal. Lexical Analysis Phase

LEX/Flex Scanner Generator

Lex Spec Example. Int installid() {/* code to put id lexeme into string table*/}

Using Lex or Flex. Prof. James L. Frankel Harvard University

Type 3 languages. Regular grammars Finite automata. Regular expressions. Deterministic Nondeterministic. a, a, ε, E 1.E 2, E 1 E 2, E 1*, (E 1 )

Lexical analysis. Syntactical analysis. Semantical analysis. Intermediate code generation. Optimization. Code generation. Target specific optimization

Marcello Bersani Ed. 22, via Golgi 42, 3 piano 3769

CSC 467 Lecture 3: Regular Expressions

PRINCIPLES OF COMPILER DESIGN UNIT II LEXICAL ANALYSIS 2.1 Lexical Analysis - The Role of the Lexical Analyzer

Figure 2.1: Role of Lexical Analyzer

The structure of a compiler

Lecture Outline. COMP-421 Compiler Design. What is Lex? Lex Specification. ! Lexical Analyzer Lex. ! Lex Examples. Presented by Dr Ioanna Dionysiou

Lexical and Parser Tools

Lexical Analysis - An Introduction. Lecture 4 Spring 2005 Department of Computer Science University of Alabama Joel Jones

An introduction to Flex

Big Picture: Compilation Process. CSCI: 4500/6500 Programming Languages. Big Picture: Compilation Process. Big Picture: Compilation Process.

Typical tradeoffs in compiler design are: speed of compilation size of the generated code speed of the generated code Speed of Execution Foundations

CS143 Handout 04 Summer 2011 June 22, 2011 flex In A Nutshell

CSCI Compiler Design

Compiler Construction

Compiler Construction

Big Picture: Compilation Process. CSCI: 4500/6500 Programming Languages. Big Picture: Compilation Process. Big Picture: Compilation Process

Flex and lexical analysis

CS4850 SummerII Lex Primer. Usage Paradigm of Lex. Lex is a tool for creating lexical analyzers. Lexical analyzers tokenize input streams.

Chapter 3 -- Scanner (Lexical Analyzer)

LECTURE 11. Semantic Analysis and Yacc

Parsing and Pattern Recognition

Syntax Analysis Part IV

Lexical Analyzer Scanner

The Front End. The purpose of the front end is to deal with the input language. Perform a membership test: code source language?

Lexical Analyzer Scanner

Lesson 10. CDT301 Compiler Theory, Spring 2011 Teacher: Linus Källberg

Lexical and Syntax Analysis

Introduction to Lex & Yacc. (flex & bison)

TDDD55- Compilers and Interpreters Lesson 2

1. INTRODUCTION TO LANGUAGE PROCESSING The Language Processing System can be represented as shown figure below.

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

Lex & Yacc (GNU distribution - flex & bison) Jeonghwan Park

CS415 Compilers. Lexical Analysis

I. OVERVIEW 1 II. INTRODUCTION 3 III. OPERATING PROCEDURE 5 IV. PCLEX 10 V. PCYACC 21. Table of Contents

Automatic Scanning and Parsing using LEX and YACC

Projects for Compilers

Scanning. COMP 520: Compiler Design (4 credits) Professor Laurie Hendren.

Flex and lexical analysis. October 25, 2016

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

Lex & Yacc. By H. Altay Güvenir. A compiler or an interpreter performs its task in 3 stages:

Principles of Compiler Design Presented by, R.Venkadeshan,M.Tech-IT, Lecturer /CSE Dept, Chettinad College of Engineering &Technology

COMPILER DESIGN UNIT I LEXICAL ANALYSIS. Translator: It is a program that translates one language to another Language.

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

10/4/18. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntactic Analysis

Languages and Compilers

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

Lex & Yacc. by H. Altay Güvenir. A compiler or an interpreter performs its task in 3 stages:

Introduction to Yacc. General Description Input file Output files Parsing conflicts Pseudovariables Examples. Principles of Compilers - 16/03/2006

CD Assignment I. 1. Explain the various phases of the compiler with a simple example.

Monday, August 26, 13. Scanners

10/5/17. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntax Analysis

CS 403: Scanning and Parsing

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

Wednesday, September 3, 14. Scanners

Scanning. COMP 520: Compiler Design (4 credits) Alexander Krolik MWF 13:30-14:30, MD 279

THE COMPILATION PROCESS EXAMPLE OF TOKENS AND ATTRIBUTES

Lexical Analysis (ASU Ch 3, Fig 3.1)

Lexical Analysis. Textbook:Modern Compiler Design Chapter 2.1.

CSc 453 Lexical Analysis (Scanning)

Lex (Lesk & Schmidt[Lesk75]) was developed in the early to mid- 70 s.

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

COLLEGE OF ENGINEERING, NASHIK. LANGUAGE TRANSLATOR

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Preparing for the ACW Languages & Compilers

Yacc: A Syntactic Analysers Generator

Gechstudentszone.wordpress.com

An Introduction to LEX and YACC. SYSC Programming Languages

Prof. Mohamed Hamada Software Engineering Lab. The University of Aizu Japan

COMPILER CONSTRUCTION LAB 2 THE SYMBOL TABLE. Tutorial 2 LABS. PHASES OF A COMPILER Source Program. Lab 2 Symbol table

Lexical Analysis. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

Announcements! P1 part 1 due next Tuesday P1 part 2 due next Friday

Handout 7, Lex (5/30/2001)

Recognition of Tokens

Compil M1 : Front-End

Lexical and Syntax Analysis

Ulex: A Lexical Analyzer Generator for Unicon

Modern Compiler Design: An approach to make Compiler Design a Significant Study for Students

As we have seen, token attribute values are supplied via yylval, as in. More on Yacc s value stack

A lexical analyzer generator for Standard ML. Version 1.6.0, October 1994

Generating a Lexical Analyzer Program

Writing a Lexical Analyzer in Haskell (part II)

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

Transcription:

Lexical Analysis Implementing Scanners & LEX: A Lexical Analyzer Tool Copyright 2016, Pedro C. Diniz, all rights reserved. Students enrolled in the Compilers class at the University of Southern California have explicit permission to make copies of these materials for their personal use. 1

Scanners and Parsers Source code Scanner tokens Parser IR Errors Scanner Maps stream of characters into words Basic unit of syntax x = x + y ; becomes <id,x> <eq,=> <id,x> <pl,+> <id,y> <sc,; > Characters that form a word are its lexeme Its part of speech (or syntactic category) is called its token type Scanner discards white space & (often) comments 2

Scanners Construction Straightforward Implementation Input: REs Construct an NFA for each RE Combine NFAs using ε-transitions (alternation in Thompson s construction) Create a DFA using the subset construction Minimize the resulting DFA Create an executable code for the DFA 3

Table-driven Scanners [0..9] r [0..9] S 0 S 1 S 2 char nextchar() state s 0 while (char eof) state δ(state,char) char nextchar() end while δ s 0 s 1 s 2 r 0..9 Other s 1 s e s e s e s 2 s e s e s 2 s e if state S F s e s e s e s e then report acceptance else report failure 4

Direct-Coded Scanners [0..9] r [0..9] S 0 S 1 S 2 goto s 0 s 0 : char nextchar() if(char = r ) then goto s 1 else goto s e s 1 : char nextchar() if( 0 char 9 ) s 2 : char nextchar() if( 0 char 9 ) then goto s 2 elseif (char = eof) then report acceptance else goto s e s e : report failure then goto s 2 else goto s e 5

Scanners in Practice Uses automated tools to construct a Lexical Analyzer Given a set of tokens defined using regular expressions Tools will generate a character stream tokenizer by constructing a DFA Common Scanner Generator Tools lex in C JLex in java 6

Input: LEX: A Lexical Analyzer Tool File with suffix.l Three (3) regions delimited by the marker %% corresponding to: Declarations: macros and language declaration between %{ e %} Rules: regular expression and corresponding action between brackets { } to be executed when the analyzer matches the regular expression. Code: support functions Output: Generates a C function named yylex in the file lex.yy.c, with the command lex file.l and compiled using -ll switch flex uses the library -lfl 7

LEX: A Lexical Analyzer Tool LEX Specification LEX Compiler Transition Table file.l lex file.l yy.lex.c input buffer matching patterns FA simulator semantic actions output file Transition Table yyin yylex yyout 8

LEX: Regular Expressions x the character x "x" the character x, even if it is a special character \x the character x, even if it is a special character x$ the character x at the end of a line ^x the character x at the beginning of a line x? Zero or one occurrence of x x+ One or more occurrences of x x* Zero or more occurrences of x xy the character x followed by the character y x y the character x or the character y [az] the character a or the character z [a-z] from character a to character z [^a-z] Any character except from a to z x{n} n occurrences of x x{m,n} between m and n occurrences of x x/y x if followed by y (only x is part of the pattern). Any character except \n (x) same as x, parentheses change operator priority <<EOF>> end of file 9

LEX: Regular Expression Short-hands Allow Regular Expressions Simplification In the declaration region; consists of an identifier followed by a regular expression Use between brackets [ ] in subsequent regular expressions DIGIT[0-9] INT {DIGIT}+ EXP [Ee][+-]?{INT} REAL {INT}"."{INT}({EXP})? 10

LEX: Example lex.l %{ /* definitions of constants */ #define LT 128 #define LE 129 #define EQ 130 #define NE 131 #define GT 132 #define GE 133 #define IF 134 #define THEN 135 #define ELSE 136 #define ID 137 #define NUM 138 #define RELOP 139 void install_id() { printf(" Installing an identifier %s\n", yytext); } void install_num() { printf(" Installing a number %s\n", yytext); } %} /* regular definitions */ delim [ \t\n] ws letter {delim}+ [A-Za-z] digit [0-9] id number {letter}({letter} {digit})* {digit}+(\.{digit}+)?(e[+-]{digit}+)? %% {ws} {/* no action and no return */} if { return(if); } then { return(then); } else { return(else); } {id} { install_id(); return(id); } {number} { install_num(); return(num); } < { return(lt); } <= { return(le); } = { return(eq); } <> { return(ne); } > { return(gt); } >= {return(ge); } %% int yywrap() { return 1; } int main(){ } int n, tok; n = 0; while(1) { tok = yylex(); if(tok == 0) break; printf(" token matched is %d\n",tok); n++; } printf(" number of tokens matched is %d\n",n); return 0; 11

LEX: Executing yylex %> more in.txt a = 4 b = 12.5 if a <> 0 then a = 1 %>./a.out < in.txt Installing an identifier a token matched is 137 token matched is 130 Installing a number 4 token matched is 138 Installing an identifier b token matched is 137 token matched is 130 Installing a number 12.5 token matched is 138 token matched is 134 Installing an identifier a token matched is 137 token matched is 131 Installing a number 0 token matched is 138 token matched is 135 Installing an identifier a token matched is 137 token matched is 130 Installing a number 1 token matched is 138 number of tokens matched is 14 12

LEX: Handling of Regular Expressions Disambiguating in case of multiple regular expression matching: Longest input sequence is selected If same length, first in specification is selected Note: Input sequence length not length of regular expression length: %% dependent [a-z]+ printf(found 'dependent'\n"); ECHO; 13

LEX: Context Sensitive Rules Set of regular expressions activated by a BEGIN command and identified by %s in the declaration region The regular expression in each set are preceded by the identifier between < and >. The INITIAL identifier indicates the global rules permanently active. At any given time only the global rules and at most one of the set of context sensitive rules can be active by the invocation of the BEGIN command. 14

LEX: Global Variables (C lang.) char yytext[], string containing matched input text int yyleng, length of string containing matched input text int yylineno, line number of input file where the last character of the macthed input text lies. With flex use the option -l or include %option yylineno or %option lex-compat in input file.l FILE *yyin, file pointer where from the input characters are read FILE *yyout, file pointer where to the output text is written using the ECHO macro or other programmer defined functions YYSTYPE yylval, internal variable used to carry over the values between the lexical analyzer and other tools, namely YACC 15

LEX: Auxiliary Functions (C lang.) int yylex(void), lex generated function that implements the lexical analysis. Returns a numeric value identifying the matched lexical element (i.e. as identified by y.tab.h) or 0 (zero) when EOF is reached int yywrap(void), programmer defined function invoked when the EOF of the current input file is reached. In the absence of more additional input files this function returns 1 (one). It returns 0 (zero) otherwise and the yyin variable should be pointing to the new input file void yymore(void), function invoked in a semantic action allowing the matched text to be saved and concatenated to the following matched text void yyless(int n), function invoked in a semantic action allowing the n characters of yytext to be reconsidered for processing. 16

LEX: Predefined Macros (C lang.) ECHO: outputs the matched text, that is yytext, for a given regular expression, or when aggregated using other rules by the invocation of the yymore() function Is defined as #define ECHO fwrite(yytext, yyleng, 1, yyout) REJECT: after processing of the semantic action that includes a REJECT invocation, the processing restarts at the beginning of the matched text but this time ignoring it. What is the point?! 17

Lex and Flex Lex compatible mode: use flex -l or include %option lexcompat in declaration region of the input specification Access to yylineno: use lex compatibility mode or include %option yylineno in the declaration region Context Sensitive Rules: only the currently active context sensitive rules are active in addition to the global rules using %x instead of %s Debug mode: use flex -d and set the yy_flex_debug to 1 18

Lex: Processing Efficiency Processing time of FA proportional to the size of the input file and not on the number of regular expressions used (although the number of expressions may impact number of internal states and therefore in space) Use as much as possible regular expressions and as little as possible action processing in C Use regular expressions more specific at the beginning of the LEX file specification (keyword for example) and more generic regular expressions at the end (identifiers, for example) 19

Lex: Example lex.l with Context Rules %{ #include <stdio.h> #include <string.h> int nest = 0; %} %s COMMENT ACTION HREF [Hh][Rr][Ee][Ff][ \t\n]*=[ \t\n]* %% "<!--" {nest++; BEGIN COMMENT; printf(" begin comment\n"); } <COMMENT>"<" ; <COMMENT>"-->" {if(--nest==0) BEGIN INITIAL; printf(" end comment\n"); } "<" { BEGIN ACTION; printf(" begin action\n"); } <ACTION>{HREF}\"(\"\" [^\"])*\" { printf("(%s)\n",index(yytext,'"')); } <ACTION>">" { BEGIN INITIAL; printf(" end action\n"); }. \n ; %% int yywrap(){ return 1; } more test1.txt <!-- sdsdswdsdsdsdssds --> <HREF="string4"> <!-- <HREF="string5"> sdsdswdsdsdsdssds --> < HREF="string3"> <HREF="string2"> <HREF = "string1">./a.out < test1.txt begin comment end comment begin action ("string4") end action int main(int argc, char **argv){ int tok; int n; n = 0; while(1) { tok = yylex(); if(tok == 0) break; printf(" token matched is %d\n",tok); n++; } printf(" number of tokens matched is %d\n",n); return 0; } begin comment end comment begin action ("string3") end action begin action ("string2") end action begin action ("string1") end action 20

Summary Scanner Construction Table-driven Direct-coded Lex: A Lexical Analyzer Tool 21