Edited by Himanshu Mittal. Lexical Analysis Phase

Similar documents
Module 8 - Lexical Analyzer Generator. 8.1 Need for a Tool. 8.2 Lexical Analyzer Generator Tool

EXPERIMENT NO : M/C Lenovo Think center M700 Ci3,6100,6th Gen. H81, 4GB RAM,500GB HDD

Using Lex or Flex. Prof. James L. Frankel Harvard University

EXPERIMENT NO : M/C Lenovo Think center M700 Ci3,6100,6th Gen. H81, 4GB RAM,500GB HDD

Lexical Analysis. Implementing Scanners & LEX: A Lexical Analyzer Tool

CS4850 SummerII Lex Primer. Usage Paradigm of Lex. Lex is a tool for creating lexical analyzers. Lexical analyzers tokenize input streams.

Lecture Outline. COMP-421 Compiler Design. What is Lex? Lex Specification. ! Lexical Analyzer Lex. ! Lex Examples. Presented by Dr Ioanna Dionysiou

Big Picture: Compilation Process. CSCI: 4500/6500 Programming Languages. Big Picture: Compilation Process. Big Picture: Compilation Process.

Figure 2.1: Role of Lexical Analyzer

LEX/Flex Scanner Generator

An introduction to Flex

The structure of a compiler

CSC 467 Lecture 3: Regular Expressions

Big Picture: Compilation Process. CSCI: 4500/6500 Programming Languages. Big Picture: Compilation Process. Big Picture: Compilation Process

Prof. Mohamed Hamada Software Engineering Lab. The University of Aizu Japan

Lexical and Syntax Analysis

CS143 Handout 04 Summer 2011 June 22, 2011 flex In A Nutshell

PRINCIPLES OF COMPILER DESIGN UNIT II LEXICAL ANALYSIS 2.1 Lexical Analysis - The Role of the Lexical Analyzer

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

Flex and lexical analysis

Chapter 3 Lexical Analysis

Compiler course. Chapter 3 Lexical Analysis

An Introduction to LEX and YACC. SYSC Programming Languages

Flex and lexical analysis. October 25, 2016

Lexical analysis. Syntactical analysis. Semantical analysis. Intermediate code generation. Optimization. Code generation. Target specific optimization

TDDD55- Compilers and Interpreters Lesson 2

Marcello Bersani Ed. 22, via Golgi 42, 3 piano 3769

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

CSE302: Compiler Design

LECTURE 6 Scanning Part 2

Type 3 languages. Regular grammars Finite automata. Regular expressions. Deterministic Nondeterministic. a, a, ε, E 1.E 2, E 1 E 2, E 1*, (E 1 )

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Preparing for the ACW Languages & Compilers

Program Development Tools. Lexical Analyzers. Lexical Analysis Terms. Attributes for Tokens

Lexical Analyzer Scanner

Lexical Analyzer Scanner

Lexical and Parser Tools

Parsing and Pattern Recognition

Handout 7, Lex (5/30/2001)

Introduction to Lex & Yacc. (flex & bison)

Lex & Yacc (GNU distribution - flex & bison) Jeonghwan Park

Chapter 3 -- Scanner (Lexical Analyzer)

Formal Languages and Compilers Lecture VI: Lexical Analysis

PRACTICAL CLASS: Flex & Bison

Implementation of Lexical Analysis

Scanning. COMP 520: Compiler Design (4 credits) Professor Laurie Hendren.

Lex (Lesk & Schmidt[Lesk75]) was developed in the early to mid- 70 s.

Scanning. COMP 520: Compiler Design (4 credits) Alexander Krolik MWF 13:30-14:30, MD 279

A program that performs lexical analysis may be termed a lexer, tokenizer, or scanner, though scanner is also a term for the first stage of a lexer.

Lexical Analysis. Lecture 3. January 10, 2018

Automatic Scanning and Parsing using LEX and YACC

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

Implementation of Lexical Analysis

Lex & Yacc. By H. Altay Güvenir. A compiler or an interpreter performs its task in 3 stages:

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

2 Input and Output The input of your program is any file with text. The output of your program will be a description of the strings that the program r

Lexical Analysis. Sukree Sinthupinyo July Chulalongkorn University

Introduction to Lexical Analysis

Lex & Yacc. by H. Altay Güvenir. A compiler or an interpreter performs its task in 3 stages:

Introduction to Lexical Analysis

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

LECTURE 7. Lex and Intro to Parsing

Monday, August 26, 13. Scanners

Wednesday, September 3, 14. Scanners

COLLEGE OF ENGINEERING, NASHIK. LANGUAGE TRANSLATOR

Structure of Programming Languages Lecture 3

The Language for Specifying Lexical Analyzer

COMPILER DESIGN UNIT I LEXICAL ANALYSIS. Translator: It is a program that translates one language to another Language.

LECTURE 11. Semantic Analysis and Yacc

flex is not a bad tool to use for doing modest text transformations and for programs that collect statistics on input.

Syntactic Analysis. CS345H: Programming Languages. Lecture 3: Lexical Analysis. Outline. Lexical Analysis. What is a Token? Tokens

Generating a Lexical Analyzer Program

10/4/18. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntactic Analysis

I. OVERVIEW 1 II. INTRODUCTION 3 III. OPERATING PROCEDURE 5 IV. PCLEX 10 V. PCYACC 21. Table of Contents

DOID: A Lexical Analyzer for Understanding Mid-Level Compilation Processes

10/5/17. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntax Analysis

Lex Spec Example. Int installid() {/* code to put id lexeme into string table*/}

1. INTRODUCTION TO LANGUAGE PROCESSING The Language Processing System can be represented as shown figure below.

CS 403: Scanning and Parsing

Recognition of Tokens

Advances in Compilers

Computer Science Department Carlos III University of Madrid Leganés (Spain) David Griol Barres

Compil M1 : Front-End

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

THE COMPILATION PROCESS EXAMPLE OF TOKENS AND ATTRIBUTES

Lexical Analysis - Flex

FSASIM: A Simulator for Finite-State Automata

JFlex Regular Expressions

Ulex: A Lexical Analyzer Generator for Unicon

Gechstudentszone.wordpress.com

David Griol Barres Computer Science Department Carlos III University of Madrid Leganés (Spain)

Lexical Analysis. Finite Automata

Ray Pereda Unicon Technical Report UTR-02. February 25, Abstract

Languages and Compilers

Lesson 10. CDT301 Compiler Theory, Spring 2011 Teacher: Linus Källberg

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

Lexical Analysis. Chapter 2

A Pascal program. Input from the file is read to a buffer program buffer. program xyz(input, output) --- begin A := B + C * 2 end.

Compiler phases. Non-tokens

Lexical Analysis. Textbook:Modern Compiler Design Chapter 2.1

Transcription:

Edited by Himanshu Mittal Lexical Analysis Phase

Lexical Analyzer The main task of Lexical analysis is to read input characters of source program and group them into lexemes and produce as output a sequence of tokens for each lexeme in the source file. Other tasks: Stripping out comments and whitespaces Co-relating error messages.

Role of Lexical Analyzer Parser issues getnexttoken command that causes the lexical analyzer to read input characters until it can identify the next lexeme and produce it as next token which it returns to parser. The stream of tokens is sent to the parser for syntax analysis.

Patterns Pattern: The definition used for recognizing tokens. Regular expressions are used for defining/specifying patterns. NFA/DFA is used to implement the regular expressions.

Lexeme Lexeme: A sequence of characters (or substring) in the input, identified by a pattern(or regular expression) as token. Eg., in a C stmt, int a = 0 Lexemes are int, a, =, 0.

Token Class Token Class: A category to which a lexeme can belong. Some common token class are Keyword, Identifier, Digit, Operator, Literals. Eg., in C stmt, int a = 0: int belongs to keyword class, a belongs to identifier class, = belongs to operator class, 0 belongs to digit class.

Another Example In C stmt, char v= hello //v is a variable char belongs to keyword class, v belongs to identifier class, = belongs to operator class, hello belongs to literal class, //v is a variable belongs to comment class.

Token Token: Symbol used for representing lexeme. The representation form for tokens can vary. Generally, tokens are represented as pair of token class and lexeme. <token_class,lexeme> Eg., token of int is <keyword,int>. Note: Patterns for tokens are specified through regular expression. Recognition of token is done through finite automata(nfa/dfa).

Lex Tool Tool for constructing lexical analyzers from special purpose notations based on regular expressions. Tool widely used to specify lexical analyzers for a variety of languages. Free available with Unix Terminal

Lex Tool Lex program is a file with extension.l that contains regular expressions, together with the actions to be taken when each expression is matched. Lex compiler produces an output file, usually called lex.yy.c, that contains C code defining a procedure yylex() which is table driven implementation of a DFA corresponding to the regular expression in the lex file and that operates like a gettoken procedure. The lex.yy.c file is then compiled and linked to a main program to get a running program using C compiler.

Lex Process Create a file, named as filename.l, that contains specifications/regular expression. lex compiler processes filename.l and produces lex.yy.c file. The C compiler turns lex.yy.c file into a.out file. Steps to execute a lex file on unix terminal: lex filename.l gcc lex.yy.c ll (-ll means link with lex library, use lfl if using flex)./a.out Note: lex contains a function yylex( ) which does actual lexical analysis.

Creating a Lexical Analyzer with LEX Lex source program LEX Compiler Lex.yy.c Lex.yy.c C Compiler a.out Input Stream a.out Sequence of tokens

Lex File Format <Definitions>...... %% <Rules>...... %% <Supplementary code>...... #includes #defines RegExps Pattern/Action Pairs {pattern1} {action1} {pattern2} {action2} Additional code (Not always needed)

Eg: Simple Lex Program Small program Lex program that prints everything as output that is entered as input. File Name: scan.l %%. \n %% main() { } yylex(); ECHO; We get this by default in Lex! This form will read from stdin. To terminate type: ctrl/d Put this code in a file called: scan.l Run lex: lex scan.l Compile: gcc lex.yy.c -ll Run by typing:./a.out or a.out < somefile.txt

Another Example %{ #include<stdio.h> %} digit [0-9]+ letter [a-za-z]+ How to include c code id {letter}({letter} {digit})* %% {id} { printf( Found identifier %s,yytext); } {digit} {printf( Found Digit %s,yytext); } %% yytext is an Internal variable containing text of word matched

Eg: To count of variables in input string %{ #include<stdio.h> int count=0; %} digit [0-9]+ letter [a-za-z]+ id {letter}({letter} {digit})* %% {id} { count++ } %% int main() { yylex(); printf( The no. of variables in string: %d,count); return(0); }

Main Points Text that is not matched is echoed as read. Thus, there is an implied ECHO. If you don't specifiy a main you get one for free!!! Lex patterns only match a given input character or string once Lex executes the action for the longest possible match for the current input. If two patterns are of same length, then lex executes the action of pattern that has high priority(or first in pattern sequence).

Example: AAA { printf("<found 3 A's>"); } AA { printf("<found 2 A's>"); } Given Input: AAAAAAAA Will print: <Found 3 A's><Found 3 A's><Found 2 A's> The scanning continues unless a value is returned!

Lex Predefined Variables yytext --> a string containing the matched lexeme yyleng --> the length of the matched lexeme yyin --> the input stream pointer the default input of default main() is stdin yyout --> the output stream pointer the default output of default main() is stdout. E.g.,./a.out < inputfile > outfile E.g. [a-z]+ [a-z]+ [a-za-z]+ printf( %s, yytext); ECHO; {words++; chars += yyleng;} PLLab, NTHU,Cs2403 Programming Languages 19

Lex Library Routines yylex() The default main() contains a call of yylex() yymore() return the next token yyless(n) retain the first n characters in yytext yywarp() is called whenever Lex reaches an end-of-file The default yywarp() always returns 1 PLLab, NTHU,Cs2403 Programming Languages 20

Pattern Matching Primitives Metacharacter Matches. any character except newline \n newline * zero or more copies of the preceding expression + one or more copies of the preceding expression? zero or one copy of the preceding expression ^ beginning of line / complement $ end of line a b a or b (ab)+ one or more copies of ab (grouping) [ab] a or b a{3} 3 instances of a a+b literal a+b (C escapes still work)

Review of Lex Predefined Variables Name char *yytext int yyleng FILE *yyin FILE *yyout int yylex(void) char* yymore(void) int yyless(int n) int yywrap(void) ECHO REJECT INITAL BEGIN Function pointer to matched string length of matched string input stream pointer output stream pointer call to invoke lexer, returns token return the next token retain the first n characters in yytext wrapup, return 1 if done, 0 if not done write matched string go to the next alternative rule initial start condition condition switch start condition