Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Similar documents
Languages, Automata, Regular Expressions & Scanners. Winter /8/ Hal Perkins & UW CSE B-1

CSE 401/M501 Compilers

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

Lexical Analysis. Chapter 2

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

Compiler phases. Non-tokens

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

Lexical Analysis. Lecture 2-4

Lexical Analysis. Lecture 3-4

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

Introduction to Lexical Analysis

Lexical Analysis. COMP 524, Spring 2014 Bryan Ward

Lexical Analysis. Lecture 3. January 10, 2018

Outline. 1 Scanning Tokens. 2 Regular Expresssions. 3 Finite State Automata

Lexical Analysis 1 / 52

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

Lexical Analysis. Introduction

Lexical Analysis. Implementation: Finite Automata

Syntactic Analysis. Syntactic analysis, or parsing, is the second phase of compilation: The token file is converted to an abstract syntax tree.

Part 5 Program Analysis Principles and Techniques

CMSC 350: COMPILER DESIGN

CSE 3302 Programming Languages Lecture 2: Syntax

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

CSCI312 Principles of Programming Languages!

Compiler Construction

Compiler Passes. Syntactic Analysis. Context-free Grammars. Syntactic Analysis / Parsing. EBNF Syntax of initial MiniJava.

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

The Front End. The purpose of the front end is to deal with the input language. Perform a membership test: code source language?

Implementation of Lexical Analysis

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 2

THE COMPILATION PROCESS EXAMPLE OF TOKENS AND ATTRIBUTES

CS 403: Scanning and Parsing

CS 314 Principles of Programming Languages. Lecture 3

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Compiler course. Chapter 3 Lexical Analysis

A Simple Syntax-Directed Translator

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Implementation of Lexical Analysis

Chapter 3 Lexical Analysis

CS164: Programming Assignment 2 Dlex Lexer Generator and Decaf Lexer

CS321 Languages and Compiler Design I. Winter 2012 Lecture 4

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design

Formal Languages and Compilers Lecture VI: Lexical Analysis

Announcements! P1 part 1 due next Tuesday P1 part 2 due next Friday

CSE302: Compiler Design

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 3

for (i=1; i<=100000; i++) { x = sqrt (y); // square root function cout << x+i << endl; }

Administrivia. Lexical Analysis. Lecture 2-4. Outline. The Structure of a Compiler. Informal sketch of lexical analysis. Issues in lexical analysis

Alternation. Kleene Closure. Definition of Regular Expressions

1 Lexical Considerations

Defining Program Syntax. Chapter Two Modern Programming Languages, 2nd ed. 1

Lexical Analysis. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

Figure 2.1: Role of Lexical Analyzer

CS415 Compilers. Lexical Analysis

CSCE 314 Programming Languages

Outline CS4120/4121. Compilation in a Nutshell 1. Administration. Introduction to Compilers Andrew Myers. HW1 out later today due next Monday.

Theoretical Part. Chapter one:- - What are the Phases of compiler? Answer:

CPS 506 Comparative Programming Languages. Syntax Specification

Implementation of Lexical Analysis

MIT Specifying Languages with Regular Expressions and Context-Free Grammars

PRINCIPLES OF COMPILER DESIGN UNIT II LEXICAL ANALYSIS 2.1 Lexical Analysis - The Role of the Lexical Analyzer

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

Languages and Compilers

1. INTRODUCTION TO LANGUAGE PROCESSING The Language Processing System can be represented as shown figure below.

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

A simple syntax-directed

Compiler Construction

Week 2: Syntax Specification, Grammars

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 2: Lexical Analysis 23 Jan 08

Syntax. Syntax. We will study three levels of syntax Lexical Defines the rules for tokens: literals, identifiers, etc.

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

2. Lexical Analysis! Prof. O. Nierstrasz!

Lexical and Syntax Analysis

CSE 401 Midterm Exam Sample Solution 2/11/15

Lexical and Syntax Analysis

CS 314 Principles of Programming Languages

CSC 467 Lecture 3: Regular Expressions

Lexical Analysis. Finite Automata

for (i=1; i<=100000; i++) { x = sqrt (y); // square root function cout << x+i << endl; }

UNIT -2 LEXICAL ANALYSIS

CS308 Compiler Principles Lexical Analyzer Li Jiang

Structure of Programming Languages Lecture 3

CS 315 Programming Languages Syntax. Parser. (Alternatively hand-built) (Alternatively hand-built)

ML 4 A Lexer for OCaml s Type System

The Structure of a Syntax-Directed Compiler

CPSC 434 Lecture 3, Page 1

Formal Languages and Grammars. Chapter 2: Sections 2.1 and 2.2

Theory and Compiling COMP360

COLLEGE OF ENGINEERING, NASHIK. LANGUAGE TRANSLATOR

Building lexical and syntactic analyzers. Chapter 3. Syntactic sugar causes cancer of the semicolon. A. Perlis. Chomsky Hierarchy

MIT Specifying Languages with Regular Expressions and Context-Free Grammars. Martin Rinard Massachusetts Institute of Technology

B The SLLGEN Parsing System

A lexical analyzer generator for Standard ML. Version 1.6.0, October 1994

Lecture 4: Syntax Specification

Lexical Analysis. Finite Automata

Zhizheng Zhang. Southeast University

CSE 401 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter UW CSE 401 Winter 2015 B-1

Introduction to Lexical Analysis

Introduction to Lexing and Parsing

Transcription:

Lexical Analysis Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Compiler Passes Analysis of input program (front-end) character stream Lexical Analysis token stream Syntactic Analysis abstract syntax tree Semantic Analysis annotated AST Synthesis of output program (back-end) Intermediate Code Generation intermediate form Optimization intermediate form Code Generation target language

Lexical Pass/Scanning Purpose: Turn the character stream (program input) into a token stream Token: a group of characters forming a basic, atomic unit of syntax, such as a identifier, number, etc. White space: characters between tokens that is ignored

Why separate lexical / syntactic analysis Separation of concerns / good design scanner: handle grouping chars into tokens ignore white space handle I/O, machine dependencies parser: handle grouping tokens into syntax trees Restricted nature of scanning allows faster implementation scanning is time-consuming in many compilers

Complications to Scanning Most languages today are free form Layout doesn t matter White space separates tokens Alternatives Fortran -- line oriented do do 10 10 i = 1,100...loop code... 10 10 continue Haskell -- indentation and layout can imply grouping Separating scanning from parsing is standard Alternative: C/C++/Java: type vs idenifier Parser wants scanner to distinguish between names that are types and names that are variables Scanner doesn t know how things are declared done in semantic analysis, a\k\a type checking

Lexemes, tokens, patterns Lexeme: group of characters that forms a pattern Token: class of lexemes matching a pattern Token may have attributes if more than one lexeme is a token Pattern: typically defined using regular expressions REs are the simplest class that s powerful enough for this purpose

Languages and Language Specification Alphabet: finite set of characters and symbols String: a finite (possibly empty) sequence of characters from an alphabet Language: a (possibly empty or infinite) set of strings Grammar: a finite specification for a set of strings Language Automaton: an abstract machine accepting a set of strings and rejecting all others A language can be specified by many different grammars and automata A grammar or automaton specifies a single language

Classes of Languages Regular languages specified by regular expressions/grammars & finite automata (FSAs) Context-free languages specified by context-free grammars and pushdown automata (PDAs) Turing-computable languages are specified by general grammars and Turing machines all languages turing complete context -free regular languages

Syntax of Regular Expressions Defined inductively Base cases Empty string (ε, ) Symbol from the alphabet (e.g. x) Inductive cases Concatenation (sequence of two REs ) : E 1 E 2 Alternation (choice of two REs): E 1 E 2 Kleene closure (0 or more repetitions of RE): E* Notes Use parentheses for grouping Precedence: * is highest, then concatenate, is lowest White space not significant

Notational Conveniences E + means 1 or more occurrences of E E k means exactly k occurrences of E [E] means 0 or 1 occurrences of E {E} means E* not(x) means any character in alphabet by x not(e) means any strings from alphabet except those in E E 1 -E 2 means any string matching E 1 that s not in E 2 There is no additional expressive power here

Naming Regular Expressions Can assign names to regular expressions Can use the names in regular expressions Example: letter ::= a b... z digit ::= 0 1... 9 alphanum ::= letter num Grammar-like notation for regular expression is a regular grammar Can reduce named REs to plain REs by macro expansion No recursive definitions allowed as in normal context-free

Using REs to Specify Tokens Identifiers ident ::= letter ( digit letter)* Integer constants integer ::= digit + sign ::= + - signed_int ::= [sign] integer Real numbers real ::= signed_int [fraction] [exponent] fraction ::=. digit + exponent ::= (E e) signed_int

More Tokens String and character constants string ::= " char* " character ::= ' char ' char ::= not(" ' \) escape escape ::= \(" ' \ n r t v b a ) White space whitespace ::= <space> <tab> <newline> comment comment ::= /* not(*/) */

Meta-Rules Can define a rule that a legal program is a sequence of tokens and white space: program ::= (token whitespace)* token ::= ident integer real string... But this doesn t say how to uniquely breakup a program into its tokens -- it s highly ambiguous E.G. what tokens to make out of hi2bob One identifier, hi2bob? Three tokes hi 2 bob? Six tokens, each one character long? The grammar states that it s legal, but not how to decide Apply extra rules to say how to break up a string Longest sequence wins

RE Specification of initial MiniJava Lex Program ::= (Token Whitespace)* Token ::= ID Integer ReservedWord Operator Delimiter ID ::= Letter (Letter Digit)* Letter ::= a... z A... Z Digit ::= 0... 9 Integer ::= Digit + ReservedWord::= class public static extends void int boolean if else while return true false this new String main System.out.println Operator ::= + - * / < <= >= > ==!= &&! Delimiter ::= ;., = ( ) { } [ ]

Building Scanners with REs Convert RE specification into a finite state automaton (FSA) Convert FSA into a scanner implementation By hand into a collection of procedures Mechanically into a table-driven scanner

Finite State Automata A Finite State Automaton has A set of states One marked initial Some marked final A set of transitions from state to state Each labeled with an alphabet symbol or ε / * * / not(*) * not(*,/) Operate by beginning at the start state, reading symbols and making indicated transitions When input ends, state must be final or else reject

Determinism FSA can be deterministic or nondeterministic Deterministic: always know uniquely which edge to take At most 1 arc leaving a state with a given symbol No ε arcs Nondeterministic: may need to guess or explore multiple paths, choosing the right one later 0 1 1 1 0 0 0

NFAs vs DFAs A problem: REs (e.g. specifications map easily to NFAs) Can write code for DFAs easily How to bridge the gap? Can it be bridged?

A Solution Cool algorithm to translate any NFA to a DFA Proves that NFAs aren t any more expressive Plan: 1) Convert RE to NFA 2) Convert NFA to DFA 3) Convert DFA to code Can be done by hand or fully automatically

RE => NFA Construct Cases Inductively ε ε x x E1 E2 E 1 ε E 2 E1 E2 E* ε E ε ε ε ε E 1 E 2 ε ε ε

NFA => DFA Problem: NFA can choose among alternative paths, while DFA must pick only one path Solution: subset construction Each state in the DFA represents the set of states the NFA could possibly be in

Subset Construction Given NFA with states and transitions label all NFA states uniquely Create start state of DFA label it with the set of NFA states that can be reached by ε transitions, i.e. w/o consuming input Process the start state To process a DFA state S with label [S 1,,S n ] For each symbol x in the alphabet: Compute the set T of NFA states from S 1,,S n by an x transition followed by any number of ε transitions If T not empty If a DFA state has T as a label add an x transition from S to T Otherwise create a new DFA state T and add an x transition S to T A DFA state is final iff at least one of the NFA states is

Subset Construction a / b * Σ c ε d * e / f

To Tokens Every final symbol of a DFA emits a token Tokens are the internal compiler names for the lexemes == becomes equal ( becomes leftparen private becomes private You choose the names Also, there may be additional data \r\n might include line count

DFA => Code Option 1: Implement by hand using procedures one procedure for each token each procedure reads one character choices implemented using if and switch statements Pros straightforward to write fast Cons a fair amount of tedious work may have subtle differences from the language specification

DFA => code [continued] Option 2: use tool to generate table driven parser Rows: states of DFA Columns: input characters Entries: action Go to next state Accept token, go to start state Error Pros Convenient Exactly matches specification, if tool generated Cons Magic Table lookups may be slower than direct code, but switch implementation is a possible revision

Automatic Scanner Generation in MiniJava We use the jflex tool to automatically create a scanner from a specification file, Scanner/minijava.jflex (We use the CUP tool to automatically create a parser from a specification file, Parser/minijava.cup, which also generates all of the code for the token classes used in the scanner, via the Symbol class The MiniJava Makefile automatically rebuilds the scanner (or parser) whenever its specification file changes

Symbol Class Lexemes are represented as instances of class Symbol class Symbol { Int sym; // which token class? Object value; // any extra data for this lexeme... } A different integer constant is defined for each token class in the sym helper class class sym { static int CLASS = 1; static int IDENTIFIER = 2; static int COMMA = 3;... } Can use this in printing code for Symbols; see symboltostring in minijava.jflex

Token Declarations Declare new token classes in Parser/minijava.cup, using terminal declarations include Java type if Symbol stores extra data Examples /* reserved words: */ terminal CLASS, PUBLIC, STATIC, EXTENDS;... /* operators: */ terminal PLUS, MINUS, STAR, SLASH, EXCLAIM;... /* delimiters: */ terminal OPEN_PAREN, CLOSE_PAREN; terminal EQUALS, SEMICOLON, COMMA, PERIOD;... /* tokens with values: */ terminal String IDENTIFIER; terminal Integer INT_LITERAL;

jflex Token Specifications Helper definitions for character classes and regular expressions letter = [a-z A-Z] eol = [\r\n] Simple) token definitions are of the form: regexp { Java stmt } regexp can be (at least): a string literal in double-quotes, e.g. "class", "<=" a reference to a named helper, in braces, e.g. {letter} a character list or range,in square brackets,e.g. [a-z A-Z] a negated character list or range, e.g. [^\r\n]. (which matches any single character) regexp regexp,regexp regexp,regexp*,regexp+, regexp?, (regexp)

jflex Tokens [Continued] Java stmt (the accept action) is typically: return symbol(sym.class); for a simple token return symbol(sym.class,yytext()); for a token with extra data based on the lexeme stringyytext() empty for whitespace