Parsing and Pattern Recognition

Similar documents
Figure 2.1: Role of Lexical Analyzer

Prof. Mohamed Hamada Software Engineering Lab. The University of Aizu Japan

Languages and Compilers

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

A simple syntax-directed

CS 403: Scanning and Parsing

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

10/4/18. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntactic Analysis

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

CSC 467 Lecture 3: Regular Expressions

CS 4240: Compilers and Interpreters Project Phase 1: Scanner and Parser Due Date: October 4 th 2015 (11:59 pm) (via T-square)

10/5/17. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntax Analysis

THE COMPILATION PROCESS EXAMPLE OF TOKENS AND ATTRIBUTES

Lexical Analysis. Introduction

Chapter 3 -- Scanner (Lexical Analyzer)

Chapter 3 Lexical Analysis

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Compiler course. Chapter 3 Lexical Analysis

1 Lexical Considerations

CS415 Compilers. Lexical Analysis

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

Part 5 Program Analysis Principles and Techniques

LECTURE 11. Semantic Analysis and Yacc

Lexical and Syntax Analysis

UNIT -2 LEXICAL ANALYSIS

Lexical and Syntax Analysis

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

Lexical Analyzer Scanner

Lexical Analyzer Scanner

COMPILER CONSTRUCTION LAB 2 THE SYMBOL TABLE. Tutorial 2 LABS. PHASES OF A COMPILER Source Program. Lab 2 Symbol table

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

CSE302: Compiler Design

CSCI312 Principles of Programming Languages!

PRINCIPLES OF COMPILER DESIGN UNIT II LEXICAL ANALYSIS 2.1 Lexical Analysis - The Role of the Lexical Analyzer

Ray Pereda Unicon Technical Report UTR-02. February 25, Abstract

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

COLLEGE OF ENGINEERING, NASHIK. LANGUAGE TRANSLATOR

Lexical and Syntax Analysis

Lexical Analysis. Textbook:Modern Compiler Design Chapter 2.1.

Lexical Analysis. Implementing Scanners & LEX: A Lexical Analyzer Tool

Preparing for the ACW Languages & Compilers

flex is not a bad tool to use for doing modest text transformations and for programs that collect statistics on input.

A Pascal program. Input from the file is read to a buffer program buffer. program xyz(input, output) --- begin A := B + C * 2 end.

Lexical Considerations

CS143 Handout 04 Summer 2011 June 22, 2011 flex In A Nutshell

Compiler Construction

Compiler Construction

Lex & Yacc. by H. Altay Güvenir. A compiler or an interpreter performs its task in 3 stages:

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

CPS 506 Comparative Programming Languages. Syntax Specification

Examples of attributes: values of evaluated subtrees, type information, source file coordinates,

The Front End. The purpose of the front end is to deal with the input language. Perform a membership test: code source language?

Left to right design 1

Ulex: A Lexical Analyzer Generator for Unicon

Lexical Analysis - An Introduction. Lecture 4 Spring 2005 Department of Computer Science University of Alabama Joel Jones

UNIT III. The following section deals with the compilation procedure of any program.

Languages, Automata, Regular Expressions & Scanners. Winter /8/ Hal Perkins & UW CSE B-1

Lexical Considerations

Module 8 - Lexical Analyzer Generator. 8.1 Need for a Tool. 8.2 Lexical Analyzer Generator Tool

Undergraduate Compilers in a Day

Edited by Himanshu Mittal. Lexical Analysis Phase

CSc 453 Compilers and Systems Software

Yacc: A Syntactic Analysers Generator

CMSC 350: COMPILER DESIGN

Compilation 2014 Warm-up project

Compiler Construction

Lexical Analysis. Textbook:Modern Compiler Design Chapter 2.1

Lex Spec Example. Int installid() {/* code to put id lexeme into string table*/}

B The SLLGEN Parsing System

Parsing and Pattern Recognition

1. INTRODUCTION TO LANGUAGE PROCESSING The Language Processing System can be represented as shown figure below.

CA Compiler Construction

The structure of a compiler

CSE 401/M501 Compilers

Lex & Yacc. By H. Altay Güvenir. A compiler or an interpreter performs its task in 3 stages:

Formal Languages and Compilers Lecture VI: Lexical Analysis

Compiler phases. Non-tokens

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design

Time : 1 Hour Max Marks : 30

LEX/Flex Scanner Generator

PRACTICAL CLASS: Flex & Bison

Monday, August 26, 13. Scanners

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

Handout 7, Lex (5/30/2001)

Wednesday, September 3, 14. Scanners

CS 314 Principles of Programming Languages. Lecture 3

Alternation. Kleene Closure. Definition of Regular Expressions

CD Assignment I. 1. Explain the various phases of the compiler with a simple example.

A lexical analyzer generator for Standard ML. Version 1.6.0, October 1994

Big Picture: Compilation Process. CSCI: 4500/6500 Programming Languages. Big Picture: Compilation Process. Big Picture: Compilation Process

Introduction to Lexical Analysis

Lexical analysis. Syntactical analysis. Semantical analysis. Intermediate code generation. Optimization. Code generation. Target specific optimization

Scanning. COMP 520: Compiler Design (4 credits) Alexander Krolik MWF 13:30-14:30, MD 279

2068 (I) Attempt all questions.

Introduction to Lex & Yacc. (flex & bison)

COMPILER DESIGN UNIT I LEXICAL ANALYSIS. Translator: It is a program that translates one language to another Language.

CS 314 Principles of Programming Languages

TDDD55- Compilers and Interpreters Lesson 2

Transcription:

Topics in IT 1 Parsing and Pattern Recognition Week 10 Lexical analysis College of Information Science and Engineering Ritsumeikan University 1

this week mid-term evaluation review lexical analysis its place in typical compiler architecture semantic type vs. semantic value tokenisation tools and examples 2

compiler architecture: lexical analysis source file for (i = 0; i < 10; ++i) printf("%i\n", i); text tokeniser (lexical analyser) FOR LPAREN ID<i> EQ NUM<0> SEMI... tokens parser (syntax analyser) optimiser = FOR i 0 tree code generator movl $0, 24(%esp) assembly language executable file binary 3

tokenisation also known as lexical analysis scanning file or terminal input text = regular expressions C code fragments buffer character sequence string matching rules rule actions allocate & initialise = tokens parser 4

lexical analysis a grammar defines the structure of sentences of a language categories (ID, NUM,...) represent roles, ignoring specific values e.g., foo, bar and baz are all ID, regardless of their name %token FOR LPAREN RPAREN SEMI EQ ID NUM statement = FOR LPAREN statement SEMI expression SEMI expression RPAREN statement... expression = ID EQ expression NUM... this is sufficient to recognise if a sentence is grammatical however, a parser does care about specific values! ID foo is not the same as ID bar,... 5

semantic types and values a token combines a category and a specific value (when appropriate) category ID NUM BINARYOP value char *name int value <=, +, =, etc... with values, we can analyse the semantics (meaning) of a program the two parts of a token are therefore called semantic type (identifier, number, binary operator, etc...) semantic value (foo, 123, ADD, etc...) 6

tokens semantic types can be represented by unique values (e.g., integers) enum { // semantic types ID, INT, FLOAT, // variables and literals UNYOP, BINOP, // unary and binary operators LPAREN, RPAREN, // punctuation... }; type of the semantic value is often dependent on the semantic value enum { ADD, SUB, MUL, DIV, MOD,... }; // operators struct token { int semantic_type; union { // semantic values char *id_name; // ID long integer_value; // INT double float_value; // FLOAT int operator; // BINOP, UNYOP, etc. } semantic_value; }; 7

problem: tokenisation identify lexemes in the source code of a program punctuation, keywords, identifiers, numbers, etc. solution: use regular expressions to describe what each looks like convert the regular expressions into a DFA accept whenever an entire lexeme has been read construct a token and return it to the parser extended regular expressions: [abcde] a b c d e character set [a-dp-s] [abcdpqrs] character range. any character wildcard 8

if else do while for break continue return tokenisation language keyword ; language punctuation ( ) [-+]?[0-9]+ signed decimal integer [a-za-z_][a-za-z_0-9]* identifier [ \t\n\r] blank ( white space ) 9

lex scanner generator lex scanner generator automates: buffering and sequencing of input text creating a FSA from regular expressions scanning the input characters using the FSA recognising semantic types and values executing user-supplied actions to create tokens supplying tokens one at a time to a client (e.g., a parser) definitions regular expressions actions scanner.l lex lex.yy.c lex.yy.c cc a.out text a.out tokens 10

three sections: lex scanner specification C declarations and named REs named REs can be referred to as {name} RE rules and associated actions actions can be enclosed in {... } braces supporting C functions can be called from within actions lex converts specification into C program lex.yy.c lex.yy.c compiled (with parser, etc.) to create compiler front-end default action of lex.yy.c is to echo characters as they are read lex can be used to make simple text filters, word counters, etc. 11

lex scanner specification %{ /* declarations */ enum { FOR ID INTEGER FLOAT EQ LPAREN RPAREN SEMI }; Symbol *intern(const char *string); %} spaces [ \t\n]+ letter [A-Za-z] digit [0-9] id {letter}({letter} {digit})* integer {digit}+(\.{digit}+)?(e[+-]?{digit}+)? float {digit}+\.{digit}+(e[+-]?{digit}+)? %% /* rules and actions */ {spaces} { /* ignored */ } for { return FOR; } {id} { yylval.symbol = intern(yytext); return ID; } {integer} { yylval.integer_val = atoi(yytext); return INTEGER; } {float} { yylval.float_val = atof(yytext); return FLOAT; } "=" { return EQ; } %% /* supporting functions */ Symbol *intern(const char *string) {... } 12

examples (available for download from the course website) echo.l unspace.l startstop.l wordnum.l wc.l config.l config2.l config.txt default is to echo characters matched characters are not echoed actions are attached to matching patterns actions are attached to matching patterns EOF can be matched too can easily scan configuration files, etc. yytext contains the matched text (example input for config and config2) to compile on Mac, Linux, or Cygwin (Windows): lex filename.l ; cc -o filename lex.yy.c 13

symbols and symbol tables identifiers are often treated specially the same names reappear very many times wasteful to allocate a new string for each inefficient to compare identifiers using string comparison type, defined value (of symbolic constants), etc. identifiers converted into symbols a symbol is a unique string (maybe with other information) stored in a symbol table (binary tree, hash table,...) identifier names lookuped up in the table during scanning if found, existing symbol reused otherwise new symbol created symbols compared by identity (not equality of contents) provides a place to store additional information about identifiers 14

examples tokenise.l tokenise2.l tokens made from type + value yylex() and yylval provide tokens ordered tree of symbols is created previously-created symbols are always reused symbols can be compared by identity 15

deterministic FSA (DFA) is used lex implementation very fast: table lookup used to perform transitions immediately current state next character next state NFA constructed from regular expression rules DFA constructed from NFA no need for separate finite-choice matching of keywords DFA is faster than a series of strcmp()s DFA tables rapidly grow quite large trivial languages have hundreds of states 128 (ASCII) or 256 (UTF-8) characters per state table compression algorithm can be used to minimise size 16

ambiguity between rules lex complications the longest matching rule is always preferred if two rules match the same input characters, the one occuring first in the specification is preferred need for trailing context sometimes reserved words must occur in groups if any word is missing from the group, the words are identifiers instead the right context operator / provides for this, e.g: IF/.*THEN { return IF; } (input after the / must be matched, but is not consumed) coupling between parser and lexical analyser is sometimes needed in C, typedefed names are reserved words (not identifiers)! the symbol table provides a place in which this communication can take place 17

lex complications modal treatment of characters, e.g., C strings C compilers warn of string constants that span lines the interpretation of \n changes within a string constant two ways to handle this; first: let the action consume input characters, storing them in a buffer explicitly check for un-escaped \n tedious and error-prone or, second: temporarily put lex into a mode where \n becomes illegal \" { BEGIN str; } <str>\n { error("end of line in string"); } <str>\" { BEGIN 0; return STRING; } 18

homework and next week homework: read slides learn vocabulary practice using lex download the examples from the course website compile and run them next week: we now have tokens, so... let s turn them into a parse tree recursive-descent parsing 19

glossary action user-supplied code executed when a sequence of characters has been recognised. In lexical analysis, actions typically contruct and return a token. In syntactic analysis, actions typically construct a parse tree node. identity a property of an item that allows it to be identified uniquely and compared for equality. The literal value of a scalar quantity, or the memory address of an aggregate structure, typically serve as their identity. Two such items can be compared in a single operation (without having to compare the contents of the aggregate structure, for example). lex a program that generates scanners from a high-level description based on regular expressions. mode in lex, a state in which a different set of rules and patters are temporarily in effect. Scanning a string, for example, might put the scanner into a mode where newline characters are not allowed. 20

reserved word a token that is reserved by the programming languages. For example, in C the tokens for, while and if obey the rules for identifiers but cannot be used as identifiers since they are reserved words that give structure to the program. (In C, identifiers that have been defined as type names with typedef are treated as reserved words.) scanner another name for a lexical analyser: a program that converts a sequence of symbols (typically text characters) into tokens that represent the semantic quantities (identifiers, numbers, punctuation symbols) of the language that is being parsed. scanner generator a program that generates a scanner from a high-level description, often written as a set of regular expressions that describe the tokens to be produced when the generated scanner is run. 21

scanning the process of converting a sequence of symbols (typically text characters) into tokens that represent the semantic quantities (identifiers, numbers, punctuation symbols) of the language that is being parsed. semantic type the category to which a token belongs, often associated with a single terminal symbol (parentheses, arithmetic operators, statement terminators, etc.) or a class of related terminal symbols that have identical semantic behaviour (identifiers, literals, etc.). semantic value the actual value of a token, implied by the text that matched the token during scanning. For example, a token whose semantic type is integer might have the semantic value 37, or a token of type identifier might have a semantic value of tempvar. 22

symbol an object representing a name (such as an identifier) whose identity is guaranteed to be unique for any given value. Symbols can be compared using equality (on their memory address, for example) instead of having to perform a more expensive compaison of the characters in the associated name. For example, every occurence of the identifier xyz in a program would typically be scanned as the same, unqique symbol object. table lookup finding a value by indexing a table. The lookup is performed in constant time: no search needs to be performed. token an object or value represting a single semantic item in a language. For example identifiers, integers and the various arithmetic operators symbols of a language are typically represented as single tokens (even though they are written using more than one character). A token is often made from two properties: the type of the token (indicating the role it plays in the language sich as integer, identifier, multiplication operator, etc.) and its value (if any, such as the numeric value of an integer or the symbol associated with an identifier). 23