Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

Similar documents
Wednesday, September 3, 14. Scanners

Monday, August 26, 13. Scanners

Formal Languages and Compilers Lecture VI: Lexical Analysis

Chapter 3 Lexical Analysis

Crafting a Compiler with C (V) Scanner generator

Compiler course. Chapter 3 Lexical Analysis

Writing a Lexical Analyzer in Haskell (part II)

Introduction to Lexical Analysis

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Figure 2.1: Role of Lexical Analyzer

Implementation of Lexical Analysis

Announcements! P1 part 1 due next Tuesday P1 part 2 due next Friday

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

Lexical Analyzer Scanner

Lexical Analyzer Scanner

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 3

Implementation of Lexical Analysis

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

Alternation. Kleene Closure. Definition of Regular Expressions

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

2010: Compilers REVIEW: REGULAR EXPRESSIONS HOW TO USE REGULAR EXPRESSIONS

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

CSc 453 Lexical Analysis (Scanning)

Compiler phases. Non-tokens

Compiler Construction D7011E

Lexical Analysis. Chapter 2

Finite Automata and Scanners

Scanning. COMP 520: Compiler Design (4 credits) Alexander Krolik MWF 13:30-14:30, MD 279

Lexical analysis. Syntactical analysis. Semantical analysis. Intermediate code generation. Optimization. Code generation. Target specific optimization

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

Week 2: Syntax Specification, Grammars

UNIT -2 LEXICAL ANALYSIS

Compiler Construction LECTURE # 3

Scanning. COMP 520: Compiler Design (4 credits) Professor Laurie Hendren.

Zhizheng Zhang. Southeast University

2068 (I) Attempt all questions.

CSE450. Translation of Programming Languages. Lecture 20: Automata and Regular Expressions

CSc 453 Compilers and Systems Software

CSCI312 Principles of Programming Languages!

CMSC 350: COMPILER DESIGN

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

Lexical Analysis. Lecture 2-4

CS415 Compilers. Lexical Analysis

Lexical Analysis. Lecture 3-4

Lexical Analysis. Lecture 3. January 10, 2018

CSC 467 Lecture 3: Regular Expressions

Lexical Error Recovery

CSE 401 Midterm Exam Sample Solution 2/11/15

Languages and Compilers

CS 314 Principles of Programming Languages. Lecture 3

Introduction to Lexical Analysis

THE COMPILATION PROCESS EXAMPLE OF TOKENS AND ATTRIBUTES

Lexical Analysis - 2

Lexical Analysis. Introduction

CS 403: Scanning and Parsing

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 2

Chapter 2 :: Programming Language Syntax

Prof. Mohamed Hamada Software Engineering Lab. The University of Aizu Japan

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

LECTURE 6 Scanning Part 2

CS164: Midterm I. Fall 2003

Lexical Error Recovery

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 5

Chapter 3: Lexing and Parsing

Syntactic Analysis. CS345H: Programming Languages. Lecture 3: Lexical Analysis. Outline. Lexical Analysis. What is a Token? Tokens

Lecture 9 CIS 341: COMPILERS

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

CD Assignment I. 1. Explain the various phases of the compiler with a simple example.

Implementation of Lexical Analysis. Lecture 4

programming languages need to be precise a regular expression is one of the following: tokens are the building blocks of programs

Tasks of the Tokenizer

Chapter 2 - Programming Language Syntax. September 20, 2017

Outline. 1 Scanning Tokens. 2 Regular Expresssions. 3 Finite State Automata

ECE251 Midterm practice questions, Fall 2010

COMPILER DESIGN UNIT I LEXICAL ANALYSIS. Translator: It is a program that translates one language to another Language.

CS 314 Principles of Programming Languages

Compiler Construction

Lexical Analysis 1 / 52

Computer Science Department Carlos III University of Madrid Leganés (Spain) David Griol Barres

Lexical Analysis. Finite Automata. (Part 2 of 2)

Compiler Construction

Finite automata. We have looked at using Lex to build a scanner on the basis of regular expressions.

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

Automatic Scanning and Parsing using LEX and YACC

1. INTRODUCTION TO LANGUAGE PROCESSING The Language Processing System can be represented as shown figure below.

Lexical Analysis. Finite Automata

CS143 Handout 20 Summer 2011 July 15 th, 2011 CS143 Practice Midterm and Solution

Lex Spec Example. Int installid() {/* code to put id lexeme into string table*/}

Lexical Analysis. COMP 524, Spring 2014 Bryan Ward

MP 3 A Lexer for MiniJava

Syntax. Syntax. We will study three levels of syntax Lexical Defines the rules for tokens: literals, identifiers, etc.

Cunning Plan. Informal Sketch of Lexical Analysis. Issues in Lexical Analysis. Specifying Lexers

David Griol Barres Computer Science Department Carlos III University of Madrid Leganés (Spain)

Part 5 Program Analysis Principles and Techniques

Compiler Design Aug 1996

CPSC 434 Lecture 3, Page 1

Lexical Analysis. Implementing Scanners & LEX: A Lexical Analyzer Tool

1. Lexical Analysis Phase

Transcription:

Scanners Xiaokang Qiu Purdue University ECE 468 Adapted from Kulkarni 2012 August 24, 2016

Scanners Sometimes called lexers Recall: scanners break input stream up into a set of tokens Identifiers, reserved words, literals, etc. What do we need to know? How do we define tokens? How can we recognize tokens? How do we write scanners?

Regular expressions Regular sets: set of strings defined by regular expressions Single-string sets are regular sets: xkqiu orangzeb So is the empty string: λ (sometimes use ɛ instead) A choice between two regular sets is regular, using : (xkqiu orangzeb) To avoid ambiguity, can use ( ) to group regexps together Concatenations of regular sets are regular: (xkqiu orangzeb)@purdue.edu Any string from first regexp concatenated with any string from second regexp 0 or more of a regular set is regular, using * : (purdue) * Some other notation used for convenience: Use Not to accept all strings except those in a regular set Use? to make a string optional: A? equivalent to (A λ) Use + to mean 1 or more strings from a set: A + equivalent to AA * Use [ ] to present a range of choices: [1-3] equivalent to (1 2 3)

Examples of regular expressions Digits: D = [0-9] Letters: L = [A-Za-z] Literals (integers or floats): -?D + (.D * )? Identifiers: (_ L)(_ L D) * Comments (as in Micro): -- Not(\n) * \n More complex comments (delimited by ##, can use # inside comment): ##((# λ)not(#))*##

Scanner generators Essentially, tools for converting regular expressions into scanners Two popular scanner generators Lex (Flex): generates C/C++ scanners ANTLR: generates Java scanners

Lex (Flex) Commonly used Unix scanner generator (superseded by Flex) Flex is a domain specific language for writing scanners Features: Character classes : define sets of characters (e.g., digits) Token definitions : regex {action to take}

DIGIT [0-9] ID [a-z][a-z0-9]* %% Lex (Flex) {DIGIT}+ { printf( "An integer: %s (%d)\n", yytext, atoi( yytext ) ); } {DIGIT}+"."{DIGIT}* { printf( "A float: %s (%g)\n", yytext, atof( yytext ) ); } if then begin end procedure function { printf( "A keyword: %s\n", yytext ); } {ID} printf( "An identifier: %s\n", yytext );

Lex (Flex) The order in which tokens are defined matters! Lex will match the longest possible token ifa becomes ID(ifa), not IF ID(a) If two regexes both match, Lex uses the one defined first if becomes IF, not ID(if) Use action blocks to process tokens as necessary Convert integer/float literals to numbers Remove quotes from string literals

Lex (Flex) Compile lex file to C code Example of compiling high-level language to another high-level language! Compile generated scanner to produce working scanner Combine with yacc/bison to produce parser

ANTLR More powerful tool than Lex (can generate parsers, too, not just scanners) Same basic principles but generates Java code Tokens: Token definition: tokenname : regex1 regex2... Character classes: Look similar to token definitions fragment characterclassname : regex1 regex2... Can use character classes when defining tokens

How do flex and ANTLR work? Use a systematic technique for converting regular expressions into code that recognizes when a string matches that regular expression Key to efficiency: recognize matches as characters are read Enabling concept: finite automata

Regular Expression Finite Automata Lexer

Finite automata Finite state machine which will only accept a string if it is in the set defined by the regular expression (a b c + ) +

λ transitions Transitions between states that aren t triggered by seeing another character Can optionally take the transition, but do not have to Can be used to link states together

Non-deterministic FA Note that if a finite automaton has a λ-transition in it, it may be non-deterministic (do we take the transition? or not?) More precisely, FA is non-deterministic if, from one state reading a single character could result in transition to multiple states a a How do we deal with non-deterministic finite automata (NFAs)?

Running an NFA Intuition: take every possible path through an NFA Think: parallel execution of NFA Maintain a pointer that tracks the current state Every time there is a choice, split the pointer, and have one pointer follow each choice Track each pointer simultaneously If a pointer gets stuck, stop tracking it If any pointer reaches an accept state at the end of input, accept

Example How does this NFA handle the string aba?

NFAs to DFAs Can convert NFAs to deterministic finite automata (DFAs) No choices never a need to split pointers Simulate NFA for all possible inputs, any time there is a new configuration of pointers, create a state to capture it Pointers at states 1, 3 and 4 new state {1, 3, 4} For any new state, explore all possible next states (that can be reached with a single character) Process ends when there are no new states found This can result in very large DFAs!

Example Convert the following into a DFA

DFA reduction DFAs built from NFAs are not necessarily optimal May contain many more states than is necessary (ab) + (ab)(ab) *

DFA reduction DFAs built from NFAs are not necessarily optimal May contain many more states than is necessary (ab) + (ab)(ab) *

DFA reduction Theoretical results: a unique, optimal DFA always exists Intuition: merge equivalent states Two states are equivalent if they have the same transitions to the same states Basic idea of optimization algorithm Start with two big nodes, one representing all the final states, the other representing all other states Successively split those nodes whose transitions lead to nodes in the original DFA that are in different nodes in the optimized DFA

Simplify the following Example

Building a FA from a regexp Expression FA a λ AB A B A* Mini-exercise: how do we build an FA that accepts Not(A)?

Regular Expression Finite Automata Lexer

Transition tables Table encoding states and transitions of FA 1 row per state, 1 column per possible character Each entry: if automaton in a particular state sees a character, what is the next state? State Character a b c 1 2 2 3 3 4 4 2 4

Finite automata program Using a transition table, it is straightforward to write a program to recognize strings in a regular language state = initial_state; //start state of FA while (true) { next_char = getc(); if (next_char == EOF) break; next_state = T[state][next_char]; if (next_state == ERROR) break; state = next_state; } if (is_final_state[state]) //recognized a valid string else handle_error(next_char);

Alternate implementation Here s how we would implement the same program conventionally next_char = getc(); while (next_char == a ) { next_char = getc(); if (next_char!= b ) handle_error(next_char); next_char = getc(); if (next_char!= c ) handle_error(next_char); while (next_char == c ) { next_char = getc(); if (next_char == EOF) return; //matched token if (next_char == a ) break; if (next_char!= c ) handle_error(next_char); } } handle_error(next_char);

Transducers Simple extension of a FA which also outputs the recognized string Recognized characters are output; everything else is discarded Annotate transitions: T(x): toss x x: save x Example: DFA to recognize comments and if token

Example: Transducer for strings Recognize quoted strings Can use double quotation marks ( ) within string to produce a quotation mark ( (Not( ) )* ) Examples: ECE 468 ECE 468 Scanning is fun Scanning is fun

Practical Considerations Or: what do I have to worry about if I m actually going to write a scanner?

Handling reserved words Keywords can be written as regular expressions. However, this leads to a big blowup in FA size Consider writing a regular expression that accepts identifiers which cannot be if, while, do, for, etc. Usually better to specify reserved words as exceptions Capture them using the identifier regex, and then decide if the token corresponds to a reserved word

Handling compiler directives #include <stdio.h> #pragma omp for for (int i = 0; i < N; i++) { A[i] = B[i] + C[i] } Simple directives (e.g., include files, conditional compilation) can be parsed by the scanner No need to invoke full parser

Generating symbol table entries In simple languages, the scanner can build the symbol table directly In more complex languages, with complicated scoping rules, this needs to be handled by the parser

Lookahead Up until now, we have only considered matching an entire string to see if it is in a regular language What if we want to match multiple tokens from a file? Distinguish between int a and inta We need to look ahead to see if the next character belongs to the current token If it does, we can continue If it doesn t, the next character becomes part of the next token

Multi-character lookahead Sometimes, a scanner will need to look ahead more than one character to distinguish tokens Examples Fortran: DO J = 1,100 (loop) vs. DO J = 1.100 (variable assignment) Pascal: 23.85 (literal) vs. 23..85 (range) 2 solutions: Backup or special action state

Multi-character lookahead Sometimes, a scanner will need to look ahead more than one character to distinguish tokens Examples Fortran: DO I = 1,100 (loop) vs. DO I = 1.100 (variable assignment) Pascal: 23.85 (literal) vs. 23..85 (range) 2 solutions: Backup or special action state

General approach Remember states (T) that can be final states Buffer the characters from then on If stuck in a non-final state, back up to T, restore buffered characters to stream Example: 12.3e+q input stream 1 2. 3 e + q FA processing T Error!

Why can t we do this? Just build an FA which recognizes the string D + ( λ.d + )(...)D + ( λ.d + ) and recognize the final state we are in to determine the token type? Note that this will recognize tokens of the form 12.3 and 12..3

Error Recovery What do we do if we encounter a lexical error (a character which causes us to take an undefined transition)? Two options Delete all currently read characters, start scanning from current location Delete first character read, start scanning from second character This presents problems with ill-formatted strings (why?) One solution: create a new regexp to accept runaway strings

Next Time We ve covered how to tokenize an input program But how do we decide what the tokens actually say? How do we recognize that IF ID(a) OP(<) ID(b) { ID(a) ASSIGN LIT(5) ; } is an if-statement? Next time: Parsers