Compiler Construction

Similar documents
Compiler Construction

Prof. Mohamed Hamada Software Engineering Lab. The University of Aizu Japan

Compiler Construction

Compiler course. Chapter 3 Lexical Analysis

Chapter 3 Lexical Analysis

Introduction to Lexical Analysis

Scanning. COMP 520: Compiler Design (4 credits) Professor Laurie Hendren.

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Module 8 - Lexical Analyzer Generator. 8.1 Need for a Tool. 8.2 Lexical Analyzer Generator Tool

Compiler Construction

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

Lexical Analyzer Scanner

Lexical Analysis. Chapter 2

Lexical Analyzer Scanner

CSE302: Compiler Design

Lexical analysis. Syntactical analysis. Semantical analysis. Intermediate code generation. Optimization. Code generation. Target specific optimization

Lexical Analysis. Implementing Scanners & LEX: A Lexical Analyzer Tool

Chapter 3 -- Scanner (Lexical Analyzer)

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

Lexical Analysis. Lecture 2-4

THE COMPILATION PROCESS EXAMPLE OF TOKENS AND ATTRIBUTES

CS 403: Scanning and Parsing

Figure 2.1: Role of Lexical Analyzer

Introduction to Lexical Analysis

CS321 Languages and Compiler Design I. Winter 2012 Lecture 4

Lexical Analysis. Lecture 3. January 10, 2018

Implementation of Lexical Analysis

An Introduction to LEX and YACC. SYSC Programming Languages

Lexical Analysis. Implementation: Finite Automata

Parsing and Pattern Recognition

Lexical Analysis. Lecture 3-4

Lexical Analysis. Textbook:Modern Compiler Design Chapter 2.1

Type 3 languages. Regular grammars Finite automata. Regular expressions. Deterministic Nondeterministic. a, a, ε, E 1.E 2, E 1 E 2, E 1*, (E 1 )

Implementation of Lexical Analysis

Scanning. COMP 520: Compiler Design (4 credits) Alexander Krolik MWF 13:30-14:30, MD 279

PRACTICAL CLASS: Flex & Bison

Compiler phases. Non-tokens

CSC 467 Lecture 3: Regular Expressions

Lexical and Syntax Analysis

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

PRINCIPLES OF COMPILER DESIGN UNIT II LEXICAL ANALYSIS 2.1 Lexical Analysis - The Role of the Lexical Analyzer

2. λ is a regular expression and denotes the set {λ} 4. If r and s are regular expressions denoting the languages R and S, respectively

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

Monday, August 26, 13. Scanners

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Wednesday, September 3, 14. Scanners

Lex Spec Example. Int installid() {/* code to put id lexeme into string table*/}

CSE302: Compiler Design

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

Lexical Analysis (ASU Ch 3, Fig 3.1)

Implementation of Lexical Analysis

Lexical Analysis. Introduction

Flex and lexical analysis

Lexical Analysis. Finite Automata

Zhizheng Zhang. Southeast University

COMPILER DESIGN UNIT I LEXICAL ANALYSIS. Translator: It is a program that translates one language to another Language.

CSE 105 THEORY OF COMPUTATION

CSc 453 Lexical Analysis (Scanning)

UNIT -2 LEXICAL ANALYSIS

Programming Languages (CS 550) Lecture 4 Summary Scanner and Parser Generators. Jeremy R. Johnson

Week 2: Syntax Specification, Grammars

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

LEX/Flex Scanner Generator

Introduction to Yacc. General Description Input file Output files Parsing conflicts Pseudovariables Examples. Principles of Compilers - 16/03/2006

1. INTRODUCTION TO LANGUAGE PROCESSING The Language Processing System can be represented as shown figure below.

UNIT II LEXICAL ANALYSIS

CSCI312 Principles of Programming Languages!

CS4850 SummerII Lex Primer. Usage Paradigm of Lex. Lex is a tool for creating lexical analyzers. Lexical analyzers tokenize input streams.

LECTURE 6 Scanning Part 2

Compiler Construction

Formal Languages and Compilers Lecture VI: Lexical Analysis

The Front End. The purpose of the front end is to deal with the input language. Perform a membership test: code source language?

Cunning Plan. Informal Sketch of Lexical Analysis. Issues in Lexical Analysis. Specifying Lexers

Lex & Yacc. By H. Altay Güvenir. A compiler or an interpreter performs its task in 3 stages:

Lex & Yacc. by H. Altay Güvenir. A compiler or an interpreter performs its task in 3 stages:

An introduction to Flex

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

Compiler Construction

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

CS143 Handout 04 Summer 2011 June 22, 2011 flex In A Nutshell

Lexical Analysis. Textbook:Modern Compiler Design Chapter 2.1.

Introduction to Lex & Yacc. (flex & bison)

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

Syntax and Type Analysis

Languages and Compilers

2. Lexical Analysis! Prof. O. Nierstrasz!

Administrivia. Lexical Analysis. Lecture 2-4. Outline. The Structure of a Compiler. Informal sketch of lexical analysis. Issues in lexical analysis

Syntactic Analysis. CS345H: Programming Languages. Lecture 3: Lexical Analysis. Outline. Lexical Analysis. What is a Token? Tokens

Preparing for the ACW Languages & Compilers

CS 403 Compiler Construction Lecture 3 Lexical Analysis [Based on Chapter 1, 2, 3 of Aho2]

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 2: Lexical Analysis 23 Jan 08

2. Syntax and Type Analysis

Flex and lexical analysis. October 25, 2016

CD Assignment I. 1. Explain the various phases of the compiler with a simple example.

for (i=1; i<=100000; i++) { x = sqrt (y); // square root function cout << x+i << endl; }

The structure of a compiler

CSE 3302 Programming Languages Lecture 2: Syntax

Compiler Construction

CS308 Compiler Principles Lexical Analyzer Li Jiang

Transcription:

Compiler Construction Thomas Noll Software Modeling and Verification Group RWTH Aachen University https://moves.rwth-aachen.de/teaching/ss-17/cc/

Recap: First-Longest-Match Analysis The Extended Matching Problem I Definition Let n 1 and α 1,..., α n RE Ω with ε / α i for every i [n] (where [n] := {1,..., n}). Let Σ := {T 1,..., T n } be an alphabet of corresponding tokens and w Ω +. If w 1,..., w k Ω + such that w = w 1... w k and for every j [k] there exists i j [n] such that w j α ij, then (w 1,..., w k ) is called a decomposition and (T i1,..., T ik ) is called an analysis of w w.r.t. α 1,..., α n. Problem (Extended matching problem) Given α 1,..., α n RE Ω and w Ω +, decide whether there exists a decomposition of w w.r.t. α 1,..., α n and determine a corresponding analysis. 3 of 26 Compiler Construction

Recap: First-Longest-Match Analysis Ensuring Uniqueness Two principles 1. Principle of the longest match ( maximal munch tokenisation ) for uniqueness of decomposition make lexemes as long as possible motivated by practical considerations: every proper prefix of an identifier is also an identifier real-valued constants contain integer constants as prefixes 2. Principle of the first match for uniqueness of analysis choose first matching regular expression (in the given order) therefore: arrange keywords before identifiers (if keywords protected) 4 of 26 Compiler Construction

Recap: First-Longest-Match Analysis Implementation of FLM Analysis Algorithm (FLM analysis overview) Input: expressions α 1,..., α n RE Ω, tokens {T 1,..., T n }, input word w Ω + Procedure: 1. for every i [n], construct A i DFA Ω such that L(A i ) = α i (see DFA method; Algorithm 2.9) 2. construct the product automaton A DFA Ω such that L(A) = n i=1 α i 3. partition the set of final states of A to follow the first-match principle 4. extend the resulting DFA to a backtracking DFA which implements the longest-match principle 5. let the backtracking DFA run on w Output: FLM analysis of w (if existing) 5 of 26 Compiler Construction

Recap: First-Longest-Match Analysis (4) The Backtracking DFA Definition (Backtracking DFA) The set of configurations of B is given by ({N} Σ) Ω Q Ω Σ {ε, lexerr} The initial configuration for an input word w Ω + is (N, q 0 w, ε). The transitions of B are defined as follows (where q := δ(q, a)): normal mode: look for initial match (T i, q w, W) if q F (i) (1) (N, qaw, W) (N, q w, W) if q P \ F (2) output: W lexerr if q / P (3) backtrack mode: look for longest match (T i, q w, W) if q F (i) (4) (T, vqaw, W) (T, vaq w, W) if q P \ F (5) (N, q 0 vaw, WT) if q / P (6) end of input: (T, q, W) output: WT if q F (7) (N, q, W) output: W lexerr if q P \ F (8) (T, vaq, W) (N, q 0 va, WT) if q P \ F (9) 6 of 26 Compiler Construction

Time Complexity of First-Longest-Match Analysis Time Complexity of FLM Analysis Lemma 4.1 The worst-case time complexity of FLM analysis using the backtracking DFA on input w Ω + is O( w 2 ). Proof. lower bound: α 1 = a, α 2 = a b, w = a m requires O(m 2 ) upper bound: each run from mode N to T Σ consumes at least one input symbol (and possibly inspects all input symbols), involving at most w i=1 i = n(n+1) transitions 2 if no Σ-mode is reached, lexerr is reported after w transitions Remark: possible improvement by tabular method (similar to Knuth-Morris-Pratt Algorithm for pattern matching in strings) Literature: Th. Reps: Maximal-Munch Tokenization in Linear Time, ACM TOPLAS 20(2), 1998, 259 273 8 of 26 Compiler Construction

First-Longest-Match Analysis with NFA A Backtracking NFA A similar construction is possible for the NFA method: 1. For every i [n]: A i = Q i, Ω, δ i, q (i) 0, F i NFA Ω with L(A i ) = α i by NFA method 2. Product automaton: Q := {q 0 } n i=1 Q i ε A 1 q 0 ε. 3. Partitioning of final states: M Q is called a T i -matching if A n M F i and for all j [i 1] : M F j = yields set of T i -matchings F (i) 2 Q M Q is called productive if there exists a productive q M yields productive state sets P 2 Q 4. Backtracking automaton: similar to DFA case 10 of 26 Compiler Construction

Longest Match in Practice Longest Match in Practice In general: lookahead of arbitrary length required that is, v unbounded in configurations (T, vqw, W) see Lemma 4.1: α 1 = a, α 2 = a b, w a + Modern programming languages (Pascal, Java,...): lookahead of one or two characters sufficient separation of keywords, identifiers, etc. by spaces Pascal: two-character lookahead required to distinguish 1.5 (real number) from 1..5 (integer range) However: principle of longest match not always applicable! 12 of 26 Compiler Construction

Longest Match in Practice Inadequacy of Longest Match I Example 4.2 (Longest match in FORTRAN) 1. Relational expressions valid lexemes:.eq. (relational operator), EQ (identifier), 12 (integer), 12.,.12 (reals) input string: 12.EQ. 12 12.EQ.12 (ignoring blanks!) intended analysis: (int, 12)(relop, eq)(int, 12) LM yields: (real, 12.0)(id, EQ)(real, 0.12) wrong interpretation 2. DO loops (correct) input string: DO 5 I = 1, 20 DO5I=1,20 intended analysis: (do, )(label, 5)(id, I)(gets, )(int, 1)(comma, )(int, 20) LM analysis (wrong): (id, DO5I)(gets, )(int, 1)(comma, )(int, 20) (erroneous) input string: DO 5 I = 1. 20 DO5I=1.20 LM analysis (correct): (id, DO5I)(gets, )(real, 1.2) 13 of 26 Compiler Construction

Longest Match in Practice Inadequacy of Longest Match II Example 4.3 (Longest match in C) valid lexemes: x (identifier) = (assignment) =- (subtractive assignment; K&R/ANSI-C: -=) 1, -1 (integers) input string: x=-1 intended analysis: (id, x)(gets, )(int, 1) LM yields: (id, x)(dec, )(int, 1) wrong interpretation Possible solutions: Hand-written (non-flm) scanners FLM with lookahead (later) 14 of 26 Compiler Construction

Regular Definitions Regular Definitions I Goal: modularising the representation of regular sets by introducing additional identifiers Definition 4.4 (Regular definition) Let {R 1,..., R n } be a set of symbols disjoint from Ω. A regular definition (over Ω) is a sequence of equations R 1 = α 1. R n = α n such that, for every i [n], α i RE Ω {R1,...,R i 1 }. Remark: since recursion is excluded, every R i can (iteratively) be substituted by a regular expression α RE Ω (otherwise = context-free languages) 16 of 26 Compiler Construction

Regular Definitions Regular Definitions II Example 4.5 (Symbol classes in Pascal) Identifiers: Numerals: Digits = Digit + (unsigned) Letter = A... Z a... z Digit = 0... 9 Id = Letter (Letter Digit) Empty = OptFrac =.Digits Empty OptExp = E (+ - Empty) Digits Empty Num = Digits OptFrac OptExp Rel. operators: RelOp = < <= = <> > >= Keywords: If = if Then = then Else = else 17 of 26 Compiler Construction

Generating Scanners Using [f]lex The [f]lex Tool Usage of [f]lex ( [fast] lexical analyzer generator ): spec.l [f]lex lex.yy.c cc a.out [f]lex specification Scanner (in C) Executable A [f]lex specification is of the form Definitions (optional) %% Rules %% Auxiliary procedures (optional) Program a.out Symbol sequence 19 of 26 Compiler Construction

Generating Scanners Using [f]lex [f]lex Specifications Definitions: C code for declarations etc.: %{ Code %} Regular definitions: Name RegExp... (non-recursive!) Rules: of the form Pattern { Action } Pattern: regular expression, possibly using Names Action: C code for computing symbol = (token, attribute) token: integer return value, 0 = EOF attribute: passed in global variable yylval lexeme: accessible by yytext matching rule found by FLM strategy lexical errors catched by. (any character) 20 of 26 Compiler Construction

Generating Scanners Using [f]lex Example [f]lex Specification %{#include <stdio.h> typedef enum {EOF, IF, ID, RELOP, LT,...} token_t; unsigned int yylval; /* attribute values */ %} LETTER [A-Za-z] DIGIT [0-9] ALPHANUM {LETTER} {DIGIT} SPACE [ \t\n] %% "if" { return IF; } "<" { yylval = LT; return RELOP; } {LETTER}{ALPHANUM}* { yylval = install_id(); return ID; } {SPACE}+ /* eat up whitespace */. { fprintf (stderr, "Invalid character %c \n", yytext[0]); } %% int main(void) { token_t token; while ((token = yylex())!= EOF) printf ("(Token %d, Attribute %d)\n", token, yylval); exit (0); } unsigned int install_id () {...} /* identifier name passed through yytext */ 21 of 26 Compiler Construction

Generating Scanners Using [f]lex Regular Expressions in [f]lex Syntax Meaning printable character this character \n, \t, \123, etc. newline, tab, octal representation, etc.. any character except \n [Chars] one of Chars; ranges possible ( 0-9 ) [^Chars] none of Chars \\, \., \[, etc. \,., [, etc. "Text" Text without interpretation of., [, \, etc. ^α α at beginning of line α$ α at end of line {Name} RegExp for Name α? zero or one α α* zero or more α α+ one or more α α{n, m} between n and m times α (, m optional) (α) α α 1 α 2 concatenation α 1 α 2 alternative α 1 /α 2 α 1 but only if followed by α 2 (lookahead) 22 of 26 Compiler Construction

Generating Scanners Using [f]lex Using the Lookahead Operator Example 4.6 (Lookahead in FORTRAN) 1. DO loops (cf. Example 4.2) input string: DO 5 I = 1, 20 LM yields: (id, )(gets, )(int, 1)(comma, )(int, 20) observation: decision for do token only possible after reading, specification of DO keyword in [f]lex, using lookahead: 2. IF statement DO / ({LETTER} {DIGIT})* = ({LETTER} {DIGIT})*, problem: FORTRAN keywords not reserved example: IF(I,J) = 3 (assignment to an element of matrix IF) conditional: IF (condition ) THEN... (with keyword IF) specification of IF keyword in [f]lex, using lookahead: IF / \(.* \) THEN 23 of 26 Compiler Construction

Generating Scanners Using [f]lex Longest Match and Lookahead in [f]lex %{#include <stdio.h> typedef enum {EOF, AB, A} token_t; %} %% ab { return AB; } a/bc { return A; }. { fprintf (stderr, "Invalid character %c \n", yytext[0]); } %% int main(void) { token_t token; while ((token = yylex())!= EOF) printf ("Token %d\n", token); exit (0); } returns on input a: Invalid character a ab: Token 1 abc: Token 2 Invalid character b Invalid character c lookahead counts for length of match 24 of 26 Compiler Construction

Preprocessing Preprocessing Preprocessing = preparation of source code before (lexical) analysis Preprocessing steps macro substitution file inclusion #define is capital(ch) ((ch) >= A && (ch) <= Z ) #include "header.h" conditional compilation #ifdef UNIX char* separator = / #endif #ifdef WINDOWS char* separator = \\ #endif elimination of comments (or by scanner) 26 of 26 Compiler Construction