A Pascal program. Input from the file is read to a buffer program buffer. program xyz(input, output) --- begin A := B + C * 2 end.

Similar documents
CS321 Languages and Compiler Design I. Winter 2012 Lecture 4

Introduction to Lexical Analysis

Lisp: Lab Information. Donald F. Ross

Syntactic Analysis. CS345H: Programming Languages. Lecture 3: Lexical Analysis. Outline. Lexical Analysis. What is a Token? Tokens

Lexical Analysis. Lecture 3. January 10, 2018

Compiler Construction D7011E

Part 5 Program Analysis Principles and Techniques

Parsing and Pattern Recognition

CPS 506 Comparative Programming Languages. Syntax Specification

Lexical Analysis. Introduction

Introduction to Lexical Analysis

The Front End. The purpose of the front end is to deal with the input language. Perform a membership test: code source language?

Syntactic Analysis. The Big Picture Again. Grammar. ICS312 Machine-Level and Systems Programming

A simple syntax-directed

COMPILER DESIGN LECTURE NOTES

G Compiler Construction Lecture 4: Lexical Analysis. Mohamed Zahran (aka Z)

UNIT -2 LEXICAL ANALYSIS

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

PL Revision overview

Languages and Compilers

Chapter 3: Lexical Analysis

Lexical Analysis. Finite Automata

Figure 2.1: Role of Lexical Analyzer

Typescript on LLVM Language Reference Manual

Features of C. Portable Procedural / Modular Structured Language Statically typed Middle level language

Lexical Analysis (ASU Ch 3, Fig 3.1)

A lexical analyzer generator for Standard ML. Version 1.6.0, October 1994

Contents. Jairo Pava COMS W4115 June 28, 2013 LEARN: Language Reference Manual

1 Lexical Considerations

CSCI-GA Compiler Construction Lecture 4: Lexical Analysis I. Hubertus Franke

10/4/18. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntactic Analysis

CS 403: Scanning and Parsing

Lexical Analysis. Chapter 2

Dixita Kagathara Page 1

for (i=1; i<=100000; i++) { x = sqrt (y); // square root function cout << x+i << endl; }

1. Lexical Analysis Phase

A Simple Syntax-Directed Translator

THE COMPILATION PROCESS EXAMPLE OF TOKENS AND ATTRIBUTES

10/5/17. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntax Analysis

Syntax Intro and Overview. Syntax

Lexical Analysis. Finite Automata

CS 426 Fall Machine Problem 1. Machine Problem 1. CS 426 Compiler Construction Fall Semester 2017

Language Reference Manual

Simple Lexical Analyzer

Lexical Considerations

Cunning Plan. Informal Sketch of Lexical Analysis. Issues in Lexical Analysis. Specifying Lexers

Formal Languages and Compilers Lecture VI: Lexical Analysis

MP 3 A Lexer for MiniJava

Programming in C++ 4. The lexical basis of C++

B The SLLGEN Parsing System

CS1622. Semantic Analysis. The Compiler So Far. Lecture 15 Semantic Analysis. How to build symbol tables How to use them to find

CSCI312 Principles of Programming Languages!

Compiler Construction

Module 8 - Lexical Analyzer Generator. 8.1 Need for a Tool. 8.2 Lexical Analyzer Generator Tool

Lexical Considerations

Computer Science Department Carlos III University of Madrid Leganés (Spain) David Griol Barres

COP 3402 Systems Software Top Down Parsing (Recursive Descent)

CSE 3302 Programming Languages Lecture 2: Syntax

LESSON 1. A C program is constructed as a sequence of characters. Among the characters that can be used in a program are:

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

CS 4240: Compilers and Interpreters Project Phase 1: Scanner and Parser Due Date: October 4 th 2015 (11:59 pm) (via T-square)

Lexical Analysis. Lecture 2-4

Compiler Construction LECTURE # 3

CSc 453 Compilers and Systems Software

CSE302: Compiler Design

POLITECNICO DI TORINO. Formal Languages and Compilers. Laboratory N 1. Laboratory N 1. Languages?

MP 3 A Lexer for MiniJava

CSC 467 Lecture 3: Regular Expressions

Full file at

Formal Languages and Compilers

1. INTRODUCTION TO LANGUAGE PROCESSING The Language Processing System can be represented as shown figure below.

Building lexical and syntactic analyzers. Chapter 3. Syntactic sugar causes cancer of the semicolon. A. Perlis. Chomsky Hierarchy

David Griol Barres Computer Science Department Carlos III University of Madrid Leganés (Spain)

CS Exam #1-100 points Spring 2011

Language Reference Manual

Lexical and Syntax Analysis

for (i=1; i<=100000; i++) { x = sqrt (y); // square root function cout << x+i << endl; }

Flex and lexical analysis. October 25, 2016

Standard 11. Lesson 9. Introduction to C++( Up to Operators) 2. List any two benefits of learning C++?(Any two points)

Outline. 1 Scanning Tokens. 2 Regular Expresssions. 3 Finite State Automata

Sprite an animation manipulation language Language Reference Manual

flex is not a bad tool to use for doing modest text transformations and for programs that collect statistics on input.

Lexical and Syntax Analysis

3. Except for strings, double quotes, identifiers, and keywords, C++ ignores all white space.

Programming Assignment II

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

Lexical Analysis 1 / 52

Project 1: Scheme Pretty-Printer

Appendix A: Syntax Diagrams

UNIT III. The following section deals with the compilation procedure of any program.

Programming Assignment I Due Thursday, October 9, 2008 at 11:59pm

Spoke. Language Reference Manual* CS4118 PROGRAMMING LANGUAGES AND TRANSLATORS. William Yang Wang, Chia-che Tsai, Zhou Yu, Xin Chen 2010/11/03

Programming Project 1: Lexical Analyzer (Scanner)

Parser Tools: lex and yacc-style Parsing

CS164: Programming Assignment 2 Dlex Lexer Generator and Decaf Lexer

Recognition of Tokens

Full file at C How to Program, 6/e Multiple Choice Test Bank

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

Edited by Himanshu Mittal. Lexical Analysis Phase

The Structure of a Syntax-Directed Compiler

Transcription:

A Pascal program program xyz(input, output); var A, B, C: integer; begin A := B + C * 2 end. Input from the file is read to a buffer program buffer program xyz(input, output) --- begin A := B + C * 2 end. \0 OR input read char by char to a lexeme buffer Program header Declaration part Statement part BUT the next char after the lexeme has been read this character must be saved (may be beginning of next lexeme) xyz\0 ( this technique is used for Prolog & Lisp 1DFR - PL - Program

Program: textual view Sequence of characters from an alphabet White Space: Alphanumeric strings Numeric strings Other strings blank, tab, newline (ignored) (begin with a letter) (keywords / user defined ids) (begin with a number) (begin with non letter/number) 2DFR - PL - Program

Patterns Literal strings program, input, output, var, integer, begin, end, (,,, ), ;, :, :=, +, *,. Regular expressions Alphanumeric [a z,a Z][a z,a Z,0 9]* Numeric [0 9][0 9]* Matching via algorithms OR table lookup 3DFR - PL - Program

Lexeme Each substring in the program text input string which is matched by a pattern is called a LEXEME E.g. (keyword, user defined id, symbol, number) program xyz ( input, output ) ; var A, B, C : integer ; begin A := B + C * 2 end. Copy the lexeme from the input buffer to a lexeme buffer OR read the file char by char into the lexeme buffer BUT the next char after the lexeme has been read this character must be saved (may be beginning of next lexeme) 4DFR - PL - Program

Tokens Each LEXEME may be represented by a TOKEN (often a (symbolic) integer value to save space) A TOKEN represents a Class of Lexemes (often just 1 member) ID and NUMBER have a potentially infinite number of members program xyz ( input, output ) ; var A, B, C : integer ; begin A := B + C * 2 end. (lexemes to tokens) 261 257 40 262 44 263 41 59 OR use symbolic names program ID lparen input comma output 5DFR - PL - Program

Token Values In a language such as C the values may be defined as ASCII values (single character tokens) (0 255) Values > 256 (ID, NUMBER, assign, keywords) typedef enum tvalues // tokens + keywords { tstart=257, id, number, assign, predef, tempty, undef, error, typ, tend, kstart, program, input, output, var, begin, end, boolean, integer, real, kend } toktyp; 6DFR - PL - Program

The Parsing Process The role of the Parser is to determine if the input program is syntactically correct or not The role of the Lexer is to identify lexemes and convert them to tokens (as well as to remove white space) Program text (input string) Lexical Analysis Pattern matching Token stream Parsing Syntax Checks T/F 7DFR - PL - Program

Lexemes Tokens (via a table) Keyword table Token table lexeme token lexeme token program program id id input input number number output output := assign var var, comma integer integer ; semicolon begin begin + plus etc. etc. etc. etc. NB: id and number are pseudo-lexemes 8DFR - PL - Program

Tokens Lexemes (via a table) Often for debugging, it is useful to convert the tokens back to lexemes use the tables. (ID id ; number number ) 1. All identifiers map to the pseudo lexeme ID 2. For ID we have a {token, lexeme} tuple {ID, xyz } 3. Similarly all numbers map to NUMBER 4. Again we have a {token, lexeme} tuple {NUMBER, 2 } 5. We will use this in the Prolog and Lisp parsers (as a list) 6. In the C parser we have get_token() and get_lexeme() 7. This means that the actual values (lexemes) of IDs and NUMBERs must be saved 8. IDs Symbol Table 9DFR - PL - Program

Symbol Table Name Rôle Type Size Address (or offset) _predef type _predef 0 0 _undef type _predef 0 0 _error type _predef 0 0 integer type _predef 4 0 boolean type _predef 4 0 xyz program id _predef 12 9999 A variable integer 4 0 B variable integer 4 4 C variable integer 4 8 DFR - PL - Program 10

What s the difference? This may lead to some confusion Program program xyz (text string) Pattern program (text string) OR Pattern [a z,a Z][a z,a Z,0 9]* (regular expression) Lexeme program / xyz (sub string of program) Token program / id (symbolic name) OR Token 257 / 258 (integer value) Alphanumeric keyword or ID DFR - PL - Program keyword ID 11

Summary Source program is a text i.e. a string Pattern string or regular expression Lexeme substring of the program text Token class of lexemes (often just 1 member) Token representation as integers or symbolic names Lexer string token stream Parser token stream Boolean (T/F) DFR - PL - Program 12