The structure of a compiler

Similar documents
Marcello Bersani Ed. 22, via Golgi 42, 3 piano 3769

Lexical and Syntax Analysis

Chapter 3 -- Scanner (Lexical Analyzer)

LEX/Flex Scanner Generator

An introduction to Flex

TDDD55- Compilers and Interpreters Lesson 2

Using Lex or Flex. Prof. James L. Frankel Harvard University

CS4850 SummerII Lex Primer. Usage Paradigm of Lex. Lex is a tool for creating lexical analyzers. Lexical analyzers tokenize input streams.

CSC 467 Lecture 3: Regular Expressions

EXPERIMENT NO : M/C Lenovo Think center M700 Ci3,6100,6th Gen. H81, 4GB RAM,500GB HDD

Flex and lexical analysis

EXPERIMENT NO : M/C Lenovo Think center M700 Ci3,6100,6th Gen. H81, 4GB RAM,500GB HDD

Edited by Himanshu Mittal. Lexical Analysis Phase

Module 8 - Lexical Analyzer Generator. 8.1 Need for a Tool. 8.2 Lexical Analyzer Generator Tool

Lecture Outline. COMP-421 Compiler Design. What is Lex? Lex Specification. ! Lexical Analyzer Lex. ! Lex Examples. Presented by Dr Ioanna Dionysiou

Big Picture: Compilation Process. CSCI: 4500/6500 Programming Languages. Big Picture: Compilation Process. Big Picture: Compilation Process.

Ray Pereda Unicon Technical Report UTR-02. February 25, Abstract

CS143 Handout 04 Summer 2011 June 22, 2011 flex In A Nutshell

Big Picture: Compilation Process. CSCI: 4500/6500 Programming Languages. Big Picture: Compilation Process. Big Picture: Compilation Process

Automatic Scanning and Parsing using LEX and YACC

Ulex: A Lexical Analyzer Generator for Unicon

FLEX(1) FLEX(1) FLEX(1) FLEX(1) directive. flex fast lexical analyzer generator

Handout 7, Lex (5/30/2001)

Lexical Analysis. Implementing Scanners & LEX: A Lexical Analyzer Tool

Figure 2.1: Role of Lexical Analyzer

Flex and lexical analysis. October 25, 2016

File I/O in Flex Scanners

Lex & Yacc (GNU distribution - flex & bison) Jeonghwan Park

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Type 3 languages. Regular grammars Finite automata. Regular expressions. Deterministic Nondeterministic. a, a, ε, E 1.E 2, E 1 E 2, E 1*, (E 1 )

Flex, version 2.5. A fast scanner generator Edition 2.5, March Vern Paxson

Lex (Lesk & Schmidt[Lesk75]) was developed in the early to mid- 70 s.

Introduction to Lex & Yacc. (flex & bison)

Gechstudentszone.wordpress.com

UNIT - 7 LEX AND YACC - 1

Version 2.4 November

Lexical and Parser Tools

Lexical Analyzer Scanner

Flex, version A fast scanner generator Edition , 27 March Vern Paxson W. L. Estes John Millaway

Lexical Analyzer Scanner

Lexical Analysis with Flex

A lexical analyzer generator for Standard ML. Version 1.6.0, October 1994

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Yacc: A Syntactic Analysers Generator

Flex, version A fast scanner generator Edition , 27 March Vern Paxson W. L. Estes John Millaway

Parsing and Pattern Recognition

An Introduction to LEX and YACC. SYSC Programming Languages

CSE302: Compiler Design

Lesson 10. CDT301 Compiler Theory, Spring 2011 Teacher: Linus Källberg

flex is not a bad tool to use for doing modest text transformations and for programs that collect statistics on input.

Structure of Programming Languages Lecture 3

PRACTICAL CLASS: Flex & Bison

PRINCIPLES OF COMPILER DESIGN UNIT II LEXICAL ANALYSIS 2.1 Lexical Analysis - The Role of the Lexical Analyzer

Lex & Yacc. By H. Altay Güvenir. A compiler or an interpreter performs its task in 3 stages:

Lex & Yacc. by H. Altay Güvenir. A compiler or an interpreter performs its task in 3 stages:

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

Chapter 3 Lexical Analysis

COMPILER CONSTRUCTION Seminar 01 TDDB

Compiler course. Chapter 3 Lexical Analysis

Flex, version 2.5. A fast scanner generator. Edition 2.5, March Vern Paxson

Full file at

Prof. Mohamed Hamada Software Engineering Lab. The University of Aizu Japan

Program Development Tools. Lexical Analyzers. Lexical Analysis Terms. Attributes for Tokens

CMSC445 Compiler design Blaheta. Project 2: Lexer. Due: 15 February 2012

JFlex Regular Expressions

Advances in Compilers

Compiler Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore

COMPILER CONSTRUCTION LAB 2 THE SYMBOL TABLE. Tutorial 2 LABS. PHASES OF A COMPILER Source Program. Lab 2 Symbol table

Compiler Construction

Compiler Construction

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

Flex, version 2.5. A fast scanner generator. Edition 2.5, March Vern Paxson

Using an LALR(1) Parser Generator

Monday, August 26, 13. Scanners

I. OVERVIEW 1 II. INTRODUCTION 3 III. OPERATING PROCEDURE 5 IV. PCLEX 10 V. PCYACC 21. Table of Contents

Applications of Context-Free Grammars (CFG)

Syntax Analysis Part IV

Wednesday, September 3, 14. Scanners

Introduction to Yacc. General Description Input file Output files Parsing conflicts Pseudovariables Examples. Principles of Compilers - 16/03/2006

A program that performs lexical analysis may be termed a lexer, tokenizer, or scanner, though scanner is also a term for the first stage of a lexer.

JFlex. Lecture 16 Section 3.5, JFlex Manual. Robb T. Koether. Hampden-Sydney College. Mon, Feb 23, 2015

The Structure of a Syntax-Directed Compiler

Languages and Compilers

POLITECNICO DI TORINO. Formal Languages and Compilers. Laboratory N 1. Laboratory N 1. Languages?

Formal Languages and Compilers

Lexical analysis. Syntactical analysis. Semantical analysis. Intermediate code generation. Optimization. Code generation. Target specific optimization

(F)lex & Bison/Yacc. Language Tools for C/C++ CS 550 Programming Languages. Alexander Gutierrez

Group A Assignment 3(2)

Lex Source The general format of Lex source is:

Data Types and Variables in C language

Compiler Construction Assignment 2 Fall 2009

LECTURE 11. Semantic Analysis and Yacc

Compil M1 : Front-End

L L G E N. Generator of syntax analyzier (parser)

Work relative to other classes

1. INTRODUCTION TO LANGUAGE PROCESSING The Language Processing System can be represented as shown figure below.

Syntax Analysis Part VIII

Programming for Engineers Introduction to C

cmps104a 2002q4 Assignment 2 Lexical Analyzer page 1

Transcription:

The structure of a compiler Source code front-end Intermediate front-end representation compiler back-end machine code

Front-end & Back-end C front-end Pascal front-end C front-end Intel x86 back-end Motorola 68000 back-end The front-end abstracts from the hardware. The back-end abstracts from the high level language.

Front-end Structure Source code lexical analyzer tokenized front-end code front-end syntax analyzer any desired output

Lexical Relating to words or the vocabulary of a language as distinguished from its grammar and construction. Webster s Dictionary

Lexical Analysis It recognizes patterns in a stream of characters. A pattern represents a category of lexical elements, named tokens. Each token can have one or more attributes describing, for example, its position in the original text.

Example WORD SPACE Attributes Input stream a word, made by one or more alphabetical characters (in upper or lower case); a sequence of one or more blank spaces; characters constituting the token; position and length of the token expressed in number of characters; 1 2 3 4 5 6 7 1 A s i m p l 2 e e x a m p 3 l e The result of the lexical analysis over the above text follows: WORD: A, (1,1), 1 SPACE:, (1,2), 1 WORD: simple, (1,3), 6 SPACE:, (2,2), 1 WORD: example, (2,3), 7

The scanner A program that performs lexical analysis. Construction by hand is a tedious work. There are programs that generate scanners automatically.

flex: a scanner generator flex is a scanner generator, a complete rewriting of the AT&T Unix tool lex. You can find flex at the GNU site at the following address: www.gnu.org/software/flex/flex.html flex is free, and distributed under the terms of GNU General Public License (GPL). A useful book to understand flex is: lex & yacc, 2nd Edition by John Levine, Tony Mason & Doug Brown O Reilly

How flex Works Scanner description (.lex file) flex Scanner C source code (file lex.yy.c) gcc Input stream Scanner executable Tokenized output

%option noyywrap UPPER [A-Z] LOWER [a-z] BLANK [ ] TAB [\t] NEWLINE [\n] %% {UPPER} { printf("%c",tolower(*yytext)); /* replace uppercase letters with lowercase ones */} {NEWLINE} { printf(".\n"); /* replace newlines with a dot*/} {BLANK}+ { printf(" "); /* replace any number of spaces with a single space */} %% int main() { return yylex();}

%{ // A more complicated example #define max_col 7 %} %option noyywrap LETTER [a-za-z] SPACE [ ] %% %{ int col=0; int row=1; %} {LETTER}+ {printf("word: '%s', (%d,%d),%d\n", yytext, row, col+1, yyleng); col=((col+yyleng)<=max_col)? col+yyleng : row++,(col+yyleng)%max_col; } {SPACE}+ {printf("space: '%s', (%d,%d), %d\n", yytext, row, col+1, yyleng); col=((col+yyleng)<=max_col)? col+yyleng : row++,(col+yyleng)%max_col; } %% int main() { return yylex();}

The format of a flex input file definitions %% rules %% user code

The format of a flex input file (2) DEFINITIONS Name definitions example: LETTER [a-za-z] The notation [ ] represents a class of character. In the rules section, each occurrence of {LETTER} is substituted by ([a-za- Z]).

The format of a flex input file (3) RULES Rule example: {LETTER}+ {a block of C code} Pattern condition: a regular expression; Action: a fragment of C code to be executed each time a token in the input stream matches the associated pattern.

The format of a flex input file (4) USER CODE User C code is copied to the generated scanner source as is. This section can contain routines called by actions, or code which calls the scanner.

The format of a flex input file (3) ADDITIONAL CODE %{ %} It can be put in the definitions and in the rules sections. This code is copied into the generated scanner source code as is.

Regular Expression Rules R the regular expression R RS the concatenation of R and S R S either R or S R* zero or more R s R+ one or more R s R? zero or one R s R{m,n} a number of R s ranging from m to n R{n,} n or more times R s R{n} exactly n R s (R) (parentheses to override precedence) R/S R, but only if followed by S ^R R, but only at beginning of a line R$ R, but only at end of a line <s1>r R, but only in start condition s1 <s1,,sn>r R, in any of start conditions s1,,sn <*>R R, in any start condition, even an exclusive one

Regular Expression Rules x matches the character x. matches any character except newline [xyz] any character in the given set (x y z) [a-z] any character in the given range (a b z) [^A-Z] any character but those in the range {x} expansion of x s definition a literal string \x ANSI-C interpretation of \x, if any, otherwise the literal x (to escape operators) \0 the NUL character <<EOF>> the end-of-file

How the generated scanner works It reads the input stream, looking for strings that match any of its patterns. Longest matching rule: if more than one matching string is found, the rule that generates the longest one is selected. First Rule: if more than one string with the same length are found, the rule listed first in the rules section is selected;

How the generated scanner works (2) If no rules were found, the scanner performs the standard action: the next character in input is considered matched and it is copied to the output stream; then the scanner goes on. Once the right match is determined, the corresponding text is made available thru the global char * variable yytext, and its length is available in yyleng.

Rule actions Each rule can have its own action, specified as a block of C code. The default action is to discard matched text. A symbol in place of a block of C code instructs flex to use the same rule as the following pattern.

Special directives in actions ECHO copies yytext to the output. BEGIN sx: places the scanner in the corresponding start condition; REJECT chooses the next best matching rule; yymore()the next matched text is appended to yytext. yyless(n) sends back to the input stream all but the first n characters of the matched string.

Special directives in actions (2) unput(c)sends character c back to the input stream; WARNING: calls to unput(c) trash the contents of yytext; therefore contents of yytext must be copied before calling unput(c), if required. input(): consumes the next character in input.

The yylex() scanner function Default signature: int yylex() You can modify it by, e.g., as follows: #define YY_DECL float lexscan(float a) The yylex input is the global input stream yyin, which by default is assigned to stdin. The yylex output is the global output stream, yyout which by default is assigned to stdout.

Multiple Scanners Sometimes it is useful to have more than one scanner together. A classic example: comments in a C source code. /* comment body INITIAL COMMENT C source code */ EOF ERROR

Start conditions In flex it is possible to model this behavior as follows: %x COMMENT %option noyywrap %% <INITIAL>[^/]* ECHO; <INITIAL>"/"+[^*/]* ECHO; <INITIAL>"/*" BEGIN(COMMENT); <COMMENT>[^*]* <COMMENT>"*"+[^*/]* <COMMENT>"*"+"/" BEGIN(INITIAL); %% int main(){ return yylex(); }

Start conditions (2) A pattern preceded by an <s> start condition is active only when the scanner is in such a state. A start condition can be declared with %x (exclusive mode) or %s (inclusive mode). The special start condition <*> matches every start condition. The initial start condition is INITIAL. Start conditions are stored as integer values. The current start condition is stored in YY_START variable.

Example: %x COMMENT %option noyywrap SLASH [/] STAR [*] %% %{ int nesting_level=0; int comment_caller[10]; %} <INITIAL>[^/]* ECHO; <INITIAL>{SLASH}+[^*/]* ECHO; <INITIAL>{SLASH}{STAR} { comment_caller[nesting_level++]=yy_start; BEGIN(COMMENT); } <COMMENT>[^/*]* <COMMENT>{SLASH}+[^*/]* <COMMENT>{SLASH}{STAR} { comment_caller[nesting_level++]=yy_start; BEGIN(COMMENT); } <COMMENT>{STAR}+[^*/]* <COMMENT>{STAR}+{SLASH} BEGIN(comment_caller[--nesting_level]); %% int main() { return yylex(); }

Good regular expressions CONCISENESS %x COMMENT %option noyywrap %% <INITIAL>([^/]*("/"+[^*/])*)* ECHO; <INITIAL>"/*" BEGIN(COMMENT); <COMMENT>([^*]*("*"+[^*/])*)* <COMMENT>"*"+"/" BEGIN(INITIAL); %% void main(){ return yylex(); }

Good regular expressions (2) READABILITY NOT_SLASH [^/] NOT_STAR [^*] NOT_SLASH_STAR [^*/] SLASH [/] STAR [*] %% <INITIAL>{ ({NOT_SLASH}*({SLASH}+{NOT_SLASH_STAR})*)* ECHO; {SLASH}{STAR} BEGIN(COMMENT); } <COMMENT>{ ({NOT_STAR}*({STAR}+{NOT_SLASH_STAR})*)* {STAR}+{SLASH} BEGIN(INITIAL); }

Common pitfalls <COMMENT>[^*]* (rule 1) <COMMENT>"*"+[^/]* (rule 2) /* * what you want */ rule 2 matches * what you want * rule 1 matches / results: the end of comment is lost. <COMMENT>([^*]*("*"+[^*/]*)*)* (rule 1) /* * what you want */ rule 1 matches * what you want * rule 1 matches / results: the end of comment is lost.

Common pitfalls (2) <INITIAL> / * + BEGIN(COMMENT); (rule 1) <COMMENT>[^*]* (rule 2) /***************/ rule 1 matches /*************** rule 2 matches / results: the end of comment is lost.

Stacked start conditions %x COMMENT %option noyywrap %option stack SLASH [/] STAR [*] %% <INITIAL>{ [^/]* ECHO; {SLASH}+[^*/]* ECHO; {SLASH}{STAR} { yy_push_state(yy_start); BEGIN(COMMENT);} } <COMMENT>{ [^*/]* ; {SLASH}+[^*/]* ; {SLASH}{STAR} yy_push_state(yy_start); {STAR}+[^*/]* ; {STAR}+{SLASH} yy_pop_state(); } %% int main(int argc, char* argv[]){ argv++; argc--; yyin=fopen(argv[0],"r"); yylex(); }

Multiple input buffers It is sometimes useful to switch among multiple input buffers. A classical example: header files included in a C source code. YY_BUFFER_STATE yy_create_buffer(file * file, int size) void yy_switch_to_buffer(yy_buffer_state buffer) void yy_delete_buffer(yy_buffer_state buffer) YY_CURRENT_BUFFER

Other useful options -d enables the debugging mode; -s suppresses the default rule and raises an error whenever text cannot be matched by any rule; -v increases the output verbosity; %option yylineno an integer variable named yylineno stores the number of line currently being read.