Chapter 3 -- Scanner (Lexical Analyzer)

Similar documents
An Introduction to LEX and YACC. SYSC Programming Languages

The structure of a compiler

Marcello Bersani Ed. 22, via Golgi 42, 3 piano 3769

CS143 Handout 04 Summer 2011 June 22, 2011 flex In A Nutshell

TDDD55- Compilers and Interpreters Lesson 2

Flex and lexical analysis. October 25, 2016

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Flex and lexical analysis

Lexical and Parser Tools

Parsing and Pattern Recognition

Lex & Yacc. By H. Altay Güvenir. A compiler or an interpreter performs its task in 3 stages:

Big Picture: Compilation Process. CSCI: 4500/6500 Programming Languages. Big Picture: Compilation Process. Big Picture: Compilation Process.

Big Picture: Compilation Process. CSCI: 4500/6500 Programming Languages. Big Picture: Compilation Process. Big Picture: Compilation Process

PRACTICAL CLASS: Flex & Bison

Lex & Yacc. by H. Altay Güvenir. A compiler or an interpreter performs its task in 3 stages:

Module 8 - Lexical Analyzer Generator. 8.1 Need for a Tool. 8.2 Lexical Analyzer Generator Tool

LECTURE 11. Semantic Analysis and Yacc

Lexical and Syntax Analysis

Compiler Construction

Chapter 3 Lexical Analysis

Compiler Construction

Introduction to Yacc. General Description Input file Output files Parsing conflicts Pseudovariables Examples. Principles of Compilers - 16/03/2006

Compiler course. Chapter 3 Lexical Analysis

Using an LALR(1) Parser Generator

Introduction to Lex & Yacc. (flex & bison)

Prof. Mohamed Hamada Software Engineering Lab. The University of Aizu Japan

Lecture Outline. COMP-421 Compiler Design. What is Lex? Lex Specification. ! Lexical Analyzer Lex. ! Lex Examples. Presented by Dr Ioanna Dionysiou

CSC 467 Lecture 3: Regular Expressions

Lexical Analysis. Implementing Scanners & LEX: A Lexical Analyzer Tool

Handout 7, Lex (5/30/2001)

Automatic Scanning and Parsing using LEX and YACC

Syntax Analysis Part IV

Programming in C++ 4. The lexical basis of C++

LEX/Flex Scanner Generator

Etienne Bernard eb/textes/minimanlexyacc-english.html

Using Lex or Flex. Prof. James L. Frankel Harvard University

Figure 2.1: Role of Lexical Analyzer

Ray Pereda Unicon Technical Report UTR-02. February 25, Abstract

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

Compiler Lab. Introduction to tools Lex and Yacc

CSE302: Compiler Design

Edited by Himanshu Mittal. Lexical Analysis Phase

Applications of Context-Free Grammars (CFG)

EXPERIMENT NO : M/C Lenovo Think center M700 Ci3,6100,6th Gen. H81, 4GB RAM,500GB HDD

Structure of Programming Languages Lecture 3

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Lexical Analysis. Textbook:Modern Compiler Design Chapter 2.1.

Lex & Yacc (GNU distribution - flex & bison) Jeonghwan Park

DECLARATIONS. Character Set, Keywords, Identifiers, Constants, Variables. Designed by Parul Khurana, LIECA.

Monday, August 26, 13. Scanners

Wednesday, September 3, 14. Scanners

EXPERIMENT NO : M/C Lenovo Think center M700 Ci3,6100,6th Gen. H81, 4GB RAM,500GB HDD

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

COMPILER CONSTRUCTION LAB 2 THE SYMBOL TABLE. Tutorial 2 LABS. PHASES OF A COMPILER Source Program. Lab 2 Symbol table

flex is not a bad tool to use for doing modest text transformations and for programs that collect statistics on input.

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

Yacc: A Syntactic Analysers Generator

Programming Assignment II

CS4850 SummerII Lex Primer. Usage Paradigm of Lex. Lex is a tool for creating lexical analyzers. Lexical analyzers tokenize input streams.

Lex Spec Example. Int installid() {/* code to put id lexeme into string table*/}

PRINCIPLES OF COMPILER DESIGN UNIT II LEXICAL ANALYSIS 2.1 Lexical Analysis - The Role of the Lexical Analyzer

Type 3 languages. Regular grammars Finite automata. Regular expressions. Deterministic Nondeterministic. a, a, ε, E 1.E 2, E 1 E 2, E 1*, (E 1 )

1. INTRODUCTION TO LANGUAGE PROCESSING The Language Processing System can be represented as shown figure below.

CS 403: Scanning and Parsing

A lexical analyzer generator for Standard ML. Version 1.6.0, October 1994

JFlex Regular Expressions

Gechstudentszone.wordpress.com

A program that performs lexical analysis may be termed a lexer, tokenizer, or scanner, though scanner is also a term for the first stage of a lexer.

CS143 Handout 12 Summer 2011 July 1 st, 2011 Introduction to bison

THE COMPILATION PROCESS EXAMPLE OF TOKENS AND ATTRIBUTES

cmps104a 2002q4 Assignment 2 Lexical Analyzer page 1

COMPILER CONSTRUCTION Seminar 01 TDDB

TDDD55 - Compilers and Interpreters Lesson 3

Principles of Compiler Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

COMPILERS AND INTERPRETERS Lesson 4 TDDD16

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 3

Compil M1 : Front-End

IBM. UNIX System Services Programming Tools. z/os. Version 2 Release 3 SA

L L G E N. Generator of syntax analyzier (parser)

Programming Assignment I Due Thursday, October 7, 2010 at 11:59pm

Lexical Analyzer Scanner

Flex, version 2.5. A fast scanner generator Edition 2.5, March Vern Paxson

CSCI Compiler Design

An introduction to Flex

2 Input and Output The input of your program is any file with text. The output of your program will be a description of the strings that the program r

FLEX(1) FLEX(1) FLEX(1) FLEX(1) directive. flex fast lexical analyzer generator

Ulex: A Lexical Analyzer Generator for Unicon

COMPILER DESIGN UNIT I LEXICAL ANALYSIS. Translator: It is a program that translates one language to another Language.

Lexical Analyzer Scanner

Scanning. COMP 520: Compiler Design (4 credits) Professor Laurie Hendren.

Department : Computer Science & Engineering

Lexical Considerations

Data Types and Variables in C language

LAB MANUAL OF COMPILER DESIGN. Department of Electronics & Computer Engg. Dronacharya College Of Engineering Khentawas, Gurgaon

Lexical Considerations

Version 2.4 November

Languages and Compilers

CMSC445 Compiler design Blaheta. Project 2: Lexer. Due: 15 February 2012

Transcription:

Chapter 3 -- Scanner (Lexical Analyzer) Job: Translate input character stream into a token stream (terminals) Most programs with structured input have to deal with this problem Need precise definition of tokens Strings: xx x, "yy\"y" Reals: 0.3 vs.3 Others: 1..10 Regular Expressions -- simple patterns (a b c)(a b c _)* (0-9)+ Deterministic Finite Automata -- Recognizers Scanner Generators Input: Regular Expressions Output: DFA in a program

LEX & FLEX (One of many) Input: description file Regular expressions associated "action" Lex: standard AT&T original program Flex: open source re-implementation with added features File format: << definitions and %{ initial code %} >> %% << rules and associated actions >> %% << extra code >>

Simple Flex Example %{ %} int num_lines = 0, num_chars = 0; %% \n ++num_lines; ++num_chars;. ++num_chars; %% int main(void) { yylex(); printf( "# of lines = %d, # of chars = %d\n", num_lines, num_chars ); }

ATL/0 scanner in lex (definitions) %{ /* scan.l: An ATL/0 scanner. 1/13/94 */ #include "defs.h" #include "global.h" #include "parse.h" #define YY_NO_UNPUT %}

parse.h (created by yacc) #define END 257 #define READ 258 #define BEGINSY 259 #define WRITE 260 #define INTEGER 261 #define PROGRAM 262 #define WRITELN 263 #define VARIABLE 264 #define ASSIGN 265 #define CONST 266 #define ID 267 typedef union { int i_value; char *s_value; syntax_node *node_ptr; } YYSTYPE; extern YYSTYPE yylval; BEGIN is a flex reserved name, BEGINSY is BEGIN in ATL/1 source

ATL/0 scanner in lex (rules) [ \t]+ { /* ignore spaces and tabs */ if (list_src) ECHO; } \n { if (list_src) ECHO; line_no++; dump_errors (); if (list_src) fprintf (yyout, "%5d: ", line_no); } "+" "-" ";" "(" ")" "," "." ":" { if (list_src) ECHO; return((int)yytext[0]); }

ATL/0 scanner in lex (rules - page 2) end { if (list_src) ECHO; return(end); } read { if (list_src) ECHO; return(read); } begin { if (list_src) ECHO; return(beginsy); } write { if (list_src) ECHO; return(write); } integer { if (list_src) ECHO; return(integer); } program { if (list_src) ECHO; return(program); } writeln { if (list_src) ECHO; return(writeln); } variable { if (list_src) ECHO; return(variable); }

ATL/0 scanner in lex (rules - page 3) \<-- { if (list_src) ECHO; return(assign); } [a-z][a-z0-9_]* { if (list_src) ECHO; yylval.s_value = strdup(yytext); return(id); } [0-9]+ { if (list_src) ECHO; yylval.s_value = strdup(yytext); return(const); }

ATL/0 scanner in lex (rules - page 4). { if (list_src) ECHO; if (yytext[0] < ) yyerror ("illegal character: ^%c",yytext[0] + @ ); else if (yytext[0] > ~ ) yyerror ("illegal character: \%3d", (int) yytext[0]); else yyerror ("illegal character: %s",yytext); }

ATL0 scanner in lex (subroutines) #ifdef TESTSCAN YYSTYPE yylval; int yyparse() { int val; line_no = 1; list_src = 0; while ( (val = yylex())!= 0 ) printf ("val = %d yytext = %s \n", val, yytext); } #endif ( use "make testscan" in atl1 directory )

More about FLEX patterns -- Flex matches the longest sequence of characters that it can x match the character x. any character (byte) except newline [xyz] a "character class"; in this case, the pattern matches either an x, a y, or a z [abj-oz] a "character class" with a range in it; matches an a, a b, any letter from j through o, or a Z [^A-Z] a "negated character class", i.e., any character but those in the class. In this case, any character EXCEPT an uppercase letter. [^A-Z\n] any character EXCEPT an uppercase letter or a newline

More about FLEX patterns (page 2) r* zero or more r s, where r is any regular expression r+ one or more r s r? zero or one r s (that is, "an optional r") r{2,5} anywhere from two to five r s r{2,} two or more r s r{4} exactly 4 r s {name} the expansion of the "name" definition (definitions explained in a couple of pages.) "[xyz]\"foo" the literal string: [xyz]"foo \X if X is an a, b, f, n, r, t, or v, then the ANSI-C interpretation of \x. Otherwise, a literal X (used to escape operators such as * )

More about FLEX patterns (page 3) \0 a NUL character (ASCII code 0) \123 the character with octal value 123 \x2a the character with hexadecimal value 2a (r) match an r; parentheses are used to override precedence rs the regular expression r followed by the regular expression s; called "concatenation" r s either an r or an s Precedence: (Highest to lowest) groups -- [xyz] *, +,?, r{..} -- r* concatenation -- rs union -- r s foo ba[rz]* => (foo) (ba(([rz])*))

More about FLEX patterns (page 4) r/s an r but only if it is followed by an s. ^r an r, but only at the beginning of a line r$ an r, but only at the end of a line (i.e., just before a newline). Equivalent to "r/\n". <<EOF>> matches the end of the file (Flex only.) Definitions: (in definition section) DIGIT [0-9] Use: (in regular expressions) {DIGIT}+("."{DIGIT}+)?

Start States Method to allow only a few rules to apply at a time %x xyz /* Exclusive start state, declaration part */ Use in rule part: xyz { BEGIN(xyz); } <xyz>r1 { action... } <xyz>r2 { action... } <xyz>r3 { BEGIN(INITIAL); } /* Revert to using initial start state */ INITIAL is value 0.

Start State Example -- C comments %x comment %% "/*" { BEGIN(comment); } <comment>[^*\n] /* eat it! */ <comment>["*"+[^*\n] /* eat it! */ <comment>[\n { line_no++; } <comment>"*"+"/" { BEGIN(INITIAL); )

Running lex / flex file extension usually.l "flex scan.l" => lex.yy.c "flex -oscan.c scan.l" => scan.c With yacc... "yacc -d parse.y" => y.tab.c, y.tab.h y.tab.h - token definitions for scanner. y.tab.c - C code for parser that calls scanner.

Other considerations... Reserved words Reserved vs. Restricted Part of IDs and then use table lookup? Compiler Control, e.g. pragmas Conditional Compilation -- C uses #ifdef Source Listings -- not as often now Symbol Table entry Some scanners enter names in a table String tables..

Other considerations... (page 2) Inclusion of other files? (#include "file") Multi-character lookahead DO 10 I = 1,100 (Fortran) DO 10 I = 1.100 arrayname length (Ada) a Non-regular structures ATL/1 nested comments use variables in scanner! flex -- use different start states

Lexical Errors Delete all characters -- start again? Delete first character -- start again? How about <- in ATL? (This is about matching... not errors!) How about beg#in? Flex:. { Generate an error... "eat 1 char"}

ATL/1 Scanner notes: ( ATL1.notes ) 1) Comments start with (* and end at the MATCHING *). 2) An IDENTIFIER is a string from [A-Za-z][A-Za-z0-9_]* Case is important. "Aname" is different from "AName". 3) Look like IDs. ALL capitalizations of a reserved word is the same reserved word. For example, BeGiN, begin, Begin and so forth are all the same reserved word, BEGIN. The reserved words are: DO IF IS OF OR AND END NOT ELSE THEN TYPE ARRAY BEGIN ELSIF UNTIL VALUE WHILE REPEAT RETURN RETURNS PROGRAM VARIABLE FUNCTION PROCEDURE (Note: The word BEGIN is reserved in flex. Therefore, yacc and lex use BEGINSY to refer to the BEGIN reserved word in ATL/1.)

ATL/1 Scanner notes (page 2): 4) A STRING starts with a double quote (") and ends with a double quote. Strings may not cross the line breaks. Strings may have "quoted" characters in the string. They are \b for backspace, \f for formfeed, \n for newline, \r for carriage return, \t for tab, \" is the double quote character and \\ for the backslash character. An arbitrary character can be specified by \nnn notation where nnn is a decimal value less than 256. (Your scanner does not have to translate strings, the strings can be copied directly to the assembler. The hcas assembler uses the exact same escape sequences. The primary recognition issue for your scanner is the \".)

ATL/1 Scanner notes (page 3): 5) An INT_CONST is a string of digits ([0-9]+). 6) A MUL_OP is one of: * / mod (mod is like a reserved word even though it is not returned as a ID token. Any capitalization of mod is still mod.) 7) A REL_OP is one of: =!= < <= > >= 8) An ASSIGN is: <--