Languages and Compilers

Similar documents
Chapter 2 :: Programming Language Syntax

programming languages need to be precise a regular expression is one of the following: tokens are the building blocks of programs

LECTURE 6 Scanning Part 2

Parsing and Pattern Recognition

Figure 2.1: Role of Lexical Analyzer

Lecture Outline. COMP-421 Compiler Design. What is Lex? Lex Specification. ! Lexical Analyzer Lex. ! Lex Examples. Presented by Dr Ioanna Dionysiou

CSC 467 Lecture 3: Regular Expressions

G Compiler Construction Lecture 4: Lexical Analysis. Mohamed Zahran (aka Z)

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

The Structure of a Syntax-Directed Compiler

Syntax. Syntax. We will study three levels of syntax Lexical Defines the rules for tokens: literals, identifiers, etc.

Introduction to Lexical Analysis

Structure of a compiler. More detailed overview of compiler front end. Today we ll take a quick look at typical parts of a compiler.

UNIT -2 LEXICAL ANALYSIS

CSCI-GA Compiler Construction Lecture 4: Lexical Analysis I. Hubertus Franke

COMPILER CONSTRUCTION LAB 2 THE SYMBOL TABLE. Tutorial 2 LABS. PHASES OF A COMPILER Source Program. Lab 2 Symbol table

Programming Language Syntax

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

CS 4240: Compilers and Interpreters Project Phase 1: Scanner and Parser Due Date: October 4 th 2015 (11:59 pm) (via T-square)

Chapter 3 Lexical Analysis

Compiler course. Chapter 3 Lexical Analysis

COMPILER DESIGN LECTURE NOTES

Part 5 Program Analysis Principles and Techniques

UNIT II LEXICAL ANALYSIS

Compiler Construction

Yacc: A Syntactic Analysers Generator

Monday, August 26, 13. Scanners

Wednesday, September 3, 14. Scanners

flex is not a bad tool to use for doing modest text transformations and for programs that collect statistics on input.

Introduction to Lexical Analysis

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

Lexical Analysis. Chapter 2

Principles of Compiler Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

Lexical Analyzer Scanner

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

Lexical Analyzer Scanner

CS143 Handout 04 Summer 2011 June 22, 2011 flex In A Nutshell

1. INTRODUCTION TO LANGUAGE PROCESSING The Language Processing System can be represented as shown figure below.

CMSC 350: COMPILER DESIGN

Languages and Compilers

COMPILER DESIGN UNIT I LEXICAL ANALYSIS. Translator: It is a program that translates one language to another Language.

Lexical Analysis. Lecture 2-4

Compiler Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore

Why are there so many programming languages? Why do we have programming languages? What is a language for? What makes a language successful?

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 5

When do We Run a Compiler?

Implementation of Lexical Analysis

Compiler Construction

Lex Spec Example. Int installid() {/* code to put id lexeme into string table*/}

CS 321 IV. Overview of Compilation

Syntactic Analysis. CS345H: Programming Languages. Lecture 3: Lexical Analysis. Outline. Lexical Analysis. What is a Token? Tokens

A program that performs lexical analysis may be termed a lexer, tokenizer, or scanner, though scanner is also a term for the first stage of a lexer.

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

Lecture 3: Lexical Analysis

A Pascal program. Input from the file is read to a buffer program buffer. program xyz(input, output) --- begin A := B + C * 2 end.

CS 6353 Compiler Construction Project Assignments

Compiler Construction D7011E

Prof. Mohamed Hamada Software Engineering Lab. The University of Aizu Japan

Zhizheng Zhang. Southeast University

Programming Assignment II

Lecture 4: The Declarative Sequential Kernel Language. September 5th 2011

Implementation of Lexical Analysis

CS 6353 Compiler Construction Project Assignments

PRINCIPLES OF COMPILER DESIGN UNIT I INTRODUCTION TO COMPILERS

Formal Languages and Compilers Lecture VI: Lexical Analysis

Lexical Analysis. Lecture 3. January 10, 2018

The Structure of a Syntax-Directed Compiler

Lecture 9 CIS 341: COMPILERS

PRINCIPLES OF COMPILER DESIGN UNIT II LEXICAL ANALYSIS 2.1 Lexical Analysis - The Role of the Lexical Analyzer

10/4/18. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntactic Analysis

10/5/17. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntax Analysis

CST-402(T): Language Processors

CS606- compiler instruction Solved MCQS From Midterm Papers

Dixita Kagathara Page 1

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

COLLEGE OF ENGINEERING, NASHIK. LANGUAGE TRANSLATOR

Context-Free Grammar. Concepts Introduced in Chapter 2. Parse Trees. Example Grammar and Derivation

Lexical Analysis. Lecture 3-4

Lexical analysis. Syntactical analysis. Semantical analysis. Intermediate code generation. Optimization. Code generation. Target specific optimization

Lexical Analysis. Implementing Scanners & LEX: A Lexical Analyzer Tool

9/5/17. The Design and Implementation of Programming Languages. Compilation. Interpretation. Compilation vs. Interpretation. Hybrid Implementation

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

CSc 453 Lexical Analysis (Scanning)

1 Lexical Considerations

CD Assignment I. 1. Explain the various phases of the compiler with a simple example.

Reading Assignment. Scanner. Read Chapter 3 of Crafting a Compiler.

Compilers and Interpreters

Automated Tools. The Compilation Task. Automated? Automated? Easier ways to create parsers. The final stages of compilation are language dependant

Crafting a Compiler with C (V) Scanner generator

Introduction to Compiler Design

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 2

The Structure of a Syntax-Directed Compiler

CS143 Handout 05 Summer 2011 June 22, 2011 Programming Project 1: Lexical Analysis

TDDD55- Compilers and Interpreters Lesson 2

1. Lexical Analysis Phase

Programming Assignment I Due Thursday, October 7, 2010 at 11:59pm

Compiler Design. Subject Code: 6CS63/06IS662. Part A UNIT 1. Chapter Introduction. 1.1 Language Processors

Formats of Translated Programs

TDDD55 - Compilers and Interpreters Lesson 3

Transcription:

Principles of Software Engineering and Operational Systems Languages and Compilers SDAGE: Level I 2012-13 4. Lexical Analysis (Scanning) Dr Valery Adzhiev vadzhiev@bournemouth.ac.uk Office: TA-121 For some images: Copyright 2009 Elsevier, Inc. All rights reserved

Contents Lexical Analysis and Scanner Functionality Tokens and Their Specifics Scanner Implementation: Ad-Hoc Direct-coded Pure DFA Table-Driven DFA Lex: Scanner Generator Check Your Understanding 2

Lexical / Syntax Analysis Together, Scanner and Parser are responsible for discovering the syntactic structure of the program. Scanner s principal job: to reduce the quantity and complexity of information that must be processed by Parser. Parser is in control of recognising of syntactic structure: Scanner is called by Parser when it needs next token Separation Lexical and Syntax Analysis allows for: Better efficiency at both phases Portability: parts of lexical analyzer may not be portable, but parser usually is portable. During reading input files lexical analyzer includes buffering of input, which is platform-dependent. Syntax analyzer always is platform independent 3

Scanner Functionality Main function - tokenising : Aggregates characters into substrings to form words (lexeme) Applies set of rules describing lexical structure (microsyntax) to determine if each word (lexeme) is valid (i.e. matches the pattern) If it is valid, Scanner assigns it a syntactic category thus recognising a token - the smallest meaningful language entity If not, then Lexical error. Saves tokens with source locations (file, line, column) to make it easier to generate error messages in subsequent phases Saves text of interesting tokens (identifiers, strings, numerical literals, ) Removes comments Deals (often) with Pragmas (i.e. significant comments) 4

Dealing with Special Tokens Handling keywords (reserved words): Treat them as exceptions to rule for identifiers: before returning id, scanner looks it up in a special hash table to make sure it s not keyword Need to peek ahead further than for one character Near universal rule: always try to recognise longest possible token from input which means you return only when next character can t be used to continue current token foobar not f or foob; 3.14159 is a real const 3.14159 and never 3,., and 14159 White space (blank,s, tabs, newlines) is generally ignored, except to extent that it separate tokens (then, foo bar is different from foobar) In some cases one may need to peek at more than one character: In Pascal: when you have a 3 and next is a dot.: do you proceed in hopes of getting 3.14 or do you stop (in fear of getting 3..5 (.. can be token!) In Fortran even messier case (e.g., with need to unread buffered characters): DO 5 I = 1, 25 or DO 5 I = 1.25 (NASA s Mariner 1!) 5

Token Attributes Attribute of token: additional information on the specific lexeme For simplicity, a token may have a single attribute which holds the required information for that token. For identifiers, this attribute a pointer to the symbol table, and the symbol table holds the actual attributes for that token. Attributes for some tokens: <id,attr>: attr is pointer to the symbol table <assign-op,_>: no attribute is needed (if there is only one type of assignment operator) <num,val>: val is the actual value of the number. 6

Pragmas Pragmas: constructs that provide directives or hints to compiler Pragmas do not change program semantics only compilation process ( significant comments ) Turn various run-time checks on and off Turn certain code improvement on and off Enable or disable performance profiling (stats, etc.) Scanner usually deals with pragmas in languages where they can appear anywhere in the source. Examples of pragmas as hints for compiler: Variable x is very heavily used Keep it in register! Subroutine F is pure function Its only effect is returning value Subroutine S is not (indirectly) recursive Its storage can be statically allocated 32 bits of precision (instead of 64) suffice for floating-point variable x. Keep it in register! Compiler may ignore these: In the interest of simplicity or In face of contradictory information 7

Calculator Language: Tokens := is used for assignment Tokens read and write are listed as exceptions to the rule for id: actually, they are treated as keywords Two styles of comments (as in C) are allowed (no nested comments of same type, but different can appear inside each other to allow commenting out ) 8

Scanner Implementation Ad Hoc approach (Hand-Coded) Production compilers often use ad-hoc scanners as they generally yield fastest, most compact code by doing lots of specialpurpose things Semi-mechanical Pure DFA (Direct- Coded) Table-driven DFA DFA-based implementation is preferable during development as they allow to build scanner in more structured way 9

Ad Hoc Scanner Simpler and more common cases checked first Read characters one at a time with look-ahead ( peek ) when needed Embed loops for comments and for long tokens When invoked again, scanner repeats from beginning, using next available characters including those peeked but not consumed Lexical errors?! 10

DFA-based Scanner Implementation Write language lexical specification and convert it into REs Convert REs into nondeterministic FA (NFA) Translate NFA into equivalent DFA Optimise (minimise) the DFA Implement the DFA either through Direct- Coded approach or using Table-Driven Scanning Typical Scanner Generator 11

Recognising Multiple Kinds of Tokens Scanner differ from just a formal DFA in that it identifies tokens in addition to recognising them I.e., indicate which one. In practice, this means it must separate final states for every kind of token To build scanner for language with n different kind of tokens: Begin from NFA: {M i, i=1,n} Create a new start state with ε transitions In contrast to normal alternation construction, do not create single final state keep existing ones, labeled by token for which it is final Then apply NFA-to-DFA as before. In DFA minimisation phase, instead of starting with two equivalence classes (final and non-final states), begin with n+1, including separate class for final states for each kind of token. 12

Scanner for Calculator: DFA FA starts in distinguished initial state When reaches one of designated set of final states, it recognises the token associated with the state Comments, when recognised, send the scanner back to its start state The longest possible token rule means: scanner returns to parser only when next character cannot be used to continue current token. 13

Scanner Code: Pure DFA This direct-coded hand-written approach embeds automation in control flow of program using nested case (switch) statements Outer case statement covers states of FA. Inner cases cover transitions out of each state Most of inner clauses set a new state Some return from scanner with current token (if current character should not be part of that token, it is pushed back onto input stream) Easier to write and to debug than ad hoc approach, if not quite as efficient. 14

Scanner Tables For Calculator 15

Scanner Tables and Driver Scanner Tables generated for calculator language: States are numbered as in DFA Calculator Graph with addition of states 17 and 18 to recognise white space and comments Three main tables: scan_tab: each entry specifies action: to move to a new state (and if so, which) Return a token Announce an error token_tab: indicates for each state whether we might be at end of token (and if so, which one) Separating this table from the main one allows for noticing when we pass a state that might ve been end of token, so we can back up if we hit error state. keyword_tab: contains read and write. Driver for a table-driven scanner (declarations) Scanner must return: The kind of token found Its character-string image (spelling) needed for semantic analysis and error messages 16

Driver for Table-Driven Scanner Driver Program (generic Skeleton Scanner ): Uses current state and input character to index into scan_table. Before returning: looks tokens up in keyword_tab Outer loop serves to filter out comments and white space (spaces, tabs, newlines) Lexical Errors: Next character of input may be neither acceptable continuation of current token nor start of another token Scanner must print message and perform some sort of recovery: Throw away current invalid token Skip forward until next proper character found Restart scanning algorithm Count on error-recovery mechanism of parser 17

Lex: Scanner Generator Lex: Linux tool for automatically generating a scanner given lex specification (.l file) Lex source is actually a table of REs and corresponding program fragments implementing DFA Lex inputs Lex file and generates a C program with function yylex() to be called by parser (usually yacc) There are free open-source analogs of lex: notably flex (being used with Bison parser) A lot of tutorials available online 18

Lex Specification Lex File Structure: Definition section: defines macros and imports header files written in C. It is also possible to write any C code here, which will be copied verbatim into the generated source file. Rules section: associates RE patterns with C statements. When the lexer sees text in the input matching a given pattern, it will execute the associated C code. C code section: defines macros and imports header files written in C. It is also possible to write any C code here, which will be copied verbatim into the generated source file. REs in Lex: http://dinosaur.compilertools.net/lex/index.html 19

Lex: Example Example: Generate scanner that recognizes strings of numbers (integers) in the input, and simply prints them out. If this input is given to flex, it will be converted into lex.yy.c. This is compiled into an executable which matches and outputs strings of integers. Given input: the program will print: http://en.wikipedia.org/wiki/lex_(software) 20

Exercises Build ad-hoc scanner (e.g., in C) for the calculator language As output, have it print a list, in order, of the input tokens. For simplicity, feel free to simply halt in the event of lexical error. Try Lex or Flex tools for the calculator language. Compare your program in C and the generated scanner in C. 21

Check Your Understanding List the tasks performed by a typical scanner What are pragmas? Explain the reasons behind the longest possible token rule. Why must scanner save the text of tokens? Why must it sometimes peek at upcoming characters? Explain the main approaches to scanner implementation. What are the advantages of automatically generated scanner in comparison to a handwritten one? Why do many commercial compilers use a handwritten scanner anyway? Describe the process of building the scanner using Lex tool. 22