Lexical and Syntax Analysis

Similar documents
10/4/18. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntactic Analysis

10/5/17. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntax Analysis

Chapter 4. Lexical and Syntax Analysis

CSE 130 Programming Language Principles & Paradigms Lecture # 5. Chapter 4 Lexical and Syntax Analysis

4. Lexical and Syntax Analysis

4. Lexical and Syntax Analysis

4. LEXICAL AND SYNTAX ANALYSIS

Chapter 3. Describing Syntax and Semantics ISBN

Introduction. Introduction. Introduction. Lexical Analysis. Lexical Analysis 4/2/2019. Chapter 4. Lexical and Syntax Analysis.

Lexical and Syntax Analysis

Programming Language Specification and Translation. ICOM 4036 Fall Lecture 3

ICOM 4036 Spring 2004

Programming Language Syntax and Analysis

Lexical and Syntax Analysis (2)

Revisit the example. Transformed DFA 10/1/16 A B C D E. Start

Chapter 4. Lexical and Syntax Analysis. Topics. Compilation. Language Implementation. Issues in Lexical and Syntax Analysis.

The Parsing Problem (cont d) Recursive-Descent Parsing. Recursive-Descent Parsing (cont d) ICOM 4036 Programming Languages. The Complexity of Parsing

CS 230 Programming Languages

Syntax. In Text: Chapter 3

Software II: Principles of Programming Languages

Top down vs. bottom up parsing

COMPILER CONSTRUCTION LAB 2 THE SYMBOL TABLE. Tutorial 2 LABS. PHASES OF A COMPILER Source Program. Lab 2 Symbol table

8 Parsing. Parsing. Top Down Parsing Methods. Parsing complexity. Top down vs. bottom up parsing. Top down vs. bottom up parsing

CSCI312 Principles of Programming Languages!

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

Parsing III. CS434 Lecture 8 Spring 2005 Department of Computer Science University of Alabama Joel Jones

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

LECTURE 7. Lex and Intro to Parsing

Chapter 3 Lexical Analysis

Compiler course. Chapter 3 Lexical Analysis

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

THE COMPILATION PROCESS EXAMPLE OF TOKENS AND ATTRIBUTES

CS 403: Scanning and Parsing

Introduction to Parsing. Lecture 8

CSE 401 Midterm Exam 11/5/10

Building lexical and syntactic analyzers. Chapter 3. Syntactic sugar causes cancer of the semicolon. A. Perlis. Chomsky Hierarchy

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design

A simple syntax-directed

COLLEGE OF ENGINEERING, NASHIK. LANGUAGE TRANSLATOR

Zhizheng Zhang. Southeast University

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Formal Languages and Compilers Lecture VII Part 3: Syntactic A

2068 (I) Attempt all questions.

MIDTERM EXAM (Solutions)

Part III : Parsing. From Regular to Context-Free Grammars. Deriving a Parser from a Context-Free Grammar. Scanners and Parsers.

Context-free grammars

Figure 2.1: Role of Lexical Analyzer

Introduction to Parsing. Comp 412

CSE 3302 Programming Languages Lecture 2: Syntax

Unit-1. Evaluation of programming languages:

Languages, Automata, Regular Expressions & Scanners. Winter /8/ Hal Perkins & UW CSE B-1

Compiler Construction: Parsing

Programming Languages (CS 550) Lecture 4 Summary Scanner and Parser Generators. Jeremy R. Johnson

Syntax. Syntax. We will study three levels of syntax Lexical Defines the rules for tokens: literals, identifiers, etc.

CD Assignment I. 1. Explain the various phases of the compiler with a simple example.

CPS 506 Comparative Programming Languages. Syntax Specification

CS143 Handout 20 Summer 2012 July 18 th, 2012 Practice CS143 Midterm Exam. (signed)

Outline. Limitations of regular languages. Introduction to Parsing. Parser overview. Context-free grammars (CFG s)

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

The Front End. The purpose of the front end is to deal with the input language. Perform a membership test: code source language?

Course Overview. Introduction (Chapter 1) Compiler Frontend: Today. Compiler Backend:

G Compiler Construction Lecture 4: Lexical Analysis. Mohamed Zahran (aka Z)

Recursive Descent Parsers

1. The output of lexical analyser is a) A set of RE b) Syntax Tree c) Set of Tokens d) String Character

PART 3 - SYNTAX ANALYSIS. F. Wotawa TU Graz) Compiler Construction Summer term / 309

Chapter 3. Syntax - the form or structure of the expressions, statements, and program units

Parsing. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

Syntax Analysis: Context-free Grammars, Pushdown Automata and Parsing Part - 4. Y.N. Srikant

CSc 453 Lexical Analysis (Scanning)

Syntax Analysis Part I

Syntax-Directed Translation. Lecture 14

CS 314 Principles of Programming Languages

A programming language requires two major definitions A simple one pass compiler

CS 536 Midterm Exam Spring 2013

Implementation of Lexical Analysis

Lecture 8: Deterministic Bottom-Up Parsing

Introduction to Lexical Analysis

Section A. A grammar that produces more than one parse tree for some sentences is said to be ambiguous.

Lexical Analysis. Sukree Sinthupinyo July Chulalongkorn University

Topic 3: Syntax Analysis I

Syntax Analysis. Amitabha Sanyal. ( as) Department of Computer Science and Engineering, Indian Institute of Technology, Bombay

Syntax Analysis Check syntax and construct abstract syntax tree

Parsing and Pattern Recognition

CSE P 501 Compilers. LR Parsing Hal Perkins Spring UW CSE P 501 Spring 2018 D-1

Lecture 7: Deterministic Bottom-Up Parsing

Prof. Mohamed Hamada Software Engineering Lab. The University of Aizu Japan

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

Bottom-up Parser. Jungsik Choi

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Lexical Analysis. Textbook:Modern Compiler Design Chapter 2.1

Homework & Announcements

CS 4240: Compilers and Interpreters Project Phase 1: Scanner and Parser Due Date: October 4 th 2015 (11:59 pm) (via T-square)

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

Implementation of Lexical Analysis


CS 314 Principles of Programming Languages

Left to right design 1

Projects for Compilers

Transcription:

Lexical and Syntax Analysis In Text: Chapter 4 N. Meng, F. Poursardar

Lexical and Syntactic Analysis Two steps to discover the syntactic structure of a program Lexical analysis (Scanner): to read the input characters and output a sequence of tokens Syntactic analysis (Parser): to read the tokens and output a parse tree and report syntax errors if any 2

Compilation Process 3

Interaction between lexical analysis and syntactic analysis 4

Reasons to Separate Lexical and Syntax Analysis Simplicity - less complex approaches can be used for lexical analysis; separating them simplifies the parser Efficiency - separation allows optimization of the lexical analyzer Portability - parts of the lexical analyzer may not be portable, but the parser is always portable 5

Scanner Pattern matcher for character strings If a character sequence matches a pattern, it is identified as a token Responsibilities Tokenize source, report lexical errors if any, remove comments and whitespace, save text of interesting tokens, save source locations, (optional) expand macros and implement preprocessor functions 6

Tokenizing Source Given a program, identify all lexemes and their categories (tokens) 7

Lexeme, Token, & Pattern Lexeme A sequence of characters in the source program with the lowest level of syntactic meanings Token E.g., sum, +, - A category of lexemes A lexeme is an instance of token The basic building blocks of programs 8

Token Examples Token Informal Description Sample Lexemes keyword compariso n id All keywords defined in the language if else <, >, <=, >=, ==,!= <=,!= Letter followed by letters and digits pi, score, D2 number Any numeric constant 3.14159, 0, 6 literal Anything surrounded by s, but exclude core dumped 9

Another Token Example Consider the following example of an assignment statement: result = oldsum value / 100; Following are the tokens and lexemes of this statement: 10

Pattern Lexeme, Token, & Pattern A description of the form that the lexemes of a token may take Specified with regular expressions 11

Motivating Example Token set: assign -> := plus -> + minus -> - times -> * div -> / lparen -> ( rparen -> ) id -> letter(letter digit)* number -> digit digit* digit*(.digit digit.)digit* 12

Motivating Example What are the lexemes in the string a_var:=b*3? What are the corresponding tokens? How do you identify the tokens? 13

Lexical Analysis Three approaches to build a lexical analyzer: Write a formal description of the tokens and use a software tool that constructs a table-driven lexical analyzer from such a description Design a state diagram that describes the tokens and write a program that implements the state diagram Design a state diagram that describes the tokens and hand-construct a table-driven implementation of the state diagram 14

State Diagram A state transition diagram, or just state diagram, is a directed graph. The nodes of a state diagram are labeled with state names. The arcs are labeled with the input characters that cause the transitions among the states. An arc may also include actions the lexical analyzer must perform when the transition is taken. 15

State Diagram State diagrams of the form used for lexical analyzers are representations of a class of mathematical machines called finite automata. Finite automata can be designed to recognize members of a class of languages called regular languages. Regular grammars are generative devices for regular languages. The tokens of a programming language are a regular language, and a lexical analyzer is a finite automaton. 16

State Diagram Design A naïve state diagram would have a transition from every state on every character in the source language - such a diagram would be very large! Reason? Because every node in the state diagram would need a transition for every character in the character set of the language being analyzed. Solution: Consider ways to simplify 17

State Diagram Design - Example Design a lexical analyzer that recognizes only arithmetic expressions, including variable names and integer literals as operands. Assume that the variable names consist of strings of uppercase letters, lowercase letters, and digits but must begin with a letter. Names have no length limitation. How many transitions for initial state? How can we simplify it? 18

Example (continued) There are 52 different characters (any uppercase or lowercase letter) that can begin a name, which would require 52 transitions from the transition diagram s initial state. However, a lexical analyzer is interested only in determining that it is a name and is not concerned with which specific name it happens to be. Therefore, we define a character class named LETTER for all 52 letters and use a single transition on the first letter of any name. 19

Example (continued) Another opportunity for simplifying the transition diagram is with the integer literal tokens. There are 10 different characters that could begin an integer literal lexeme. This would require 10 transitions from the start state of the state diagram. define a character class named DIGIT for digits and use a single transition on any character in this character class to a state that collects integer literals 20

Lexical Analysis (continued) In many cases, transitions can be combined to simplify the state diagram When recognizing an identifier, all uppercase and lowercase letters are equivalent Use a character class that includes all letters When recognizing an integer literal, all digits are equivalent - use a digit class 21

Lexical Analysis (continued) Reserved words and identifiers can be recognized together (rather than having a part of the diagram for each reserved word) Use a table lookup to determine whether a possible identifier is in fact a reserved word 22

State Diagram 23

Lexical Analysis (continued) Convenient utility subprograms: getchar - gets the next character of input, puts it in nextchar, determines its class and puts the class in charclass addchar - puts the character from nextchar into the place the lexeme is being accumulated lookup - determines whether the string in lexeme is a reserved word (returns a code) 24

/* Function declarations */ void addchar(); void getchar(); void getnonblank(); int lex(); /* Character classes */ #define LETTER 0 #define DIGIT 1 #define UNKNOWN 99 /* Token codes */ #define INT_LIT 10 #define IDENT 11 #define ASSIGN_OP 20 #define ADD_OP 21 #define SUB_OP 22 #define MULT_OP 23 #define DIV_OP 24 #define LEFT_PAREN 25 #define RIGHT_PAREN 26 25

Implementation Pseudo-code static TOKEN nexttoken; static CHAR_CLASS charclass; int lex() { switch (charclass) { case LETTER: // add nextchar to lexeme addchar(); // get the next character and determine its class getchar(); while (charclass == LETTER charclass == DIGIT) { addchar(); getchar(); } nexttoken = ID; break; 26

case DIGIT: addchar(); getchar(); while (charclass == DIGIT) { addchar(); getchar(); } nexttoken = INT_LIT; break; case EOF: nexttoken = EOF; lexeme[0] = E ; lexeme[1] = O ; lexeme[2] = F ; lexeme[3] = 0; } printf ( Next token is: %d, Next lexeme is %s\n, nexttoken, lexeme); return nexttoken; } /* End of function lex */ 27

Lexical Analyzer Implementation: front.c (pp. 166-170) - Following is the output of the lexical analyzer of front.c when used on (sum + 47) / total Next token is: 25 Next lexeme is ( Next token is: 11 Next lexeme is sum Next token is: 21 Next lexeme is + Next token is: 10 Next lexeme is 47 Next token is: 26 Next lexeme is ) Next token is: 24 Next lexeme is / Next token is: 11 Next lexeme is total Next token is: -1 Next lexeme is EOF 28

The Parsing Problem Given an input program, the goals of the parser: Find all syntax errors; for each, produce an appropriate diagnostic message and recover quickly Produce the parse tree, or at least a trace of the parse tree, for the program 29

The Parsing Problem (continued) The Complexity of Parsing Parsers that work for any unambiguous grammar are complex and inefficient ( O(n3), where n is the length of the input ) Compilers use parsers that only work for a subset of all unambiguous grammars, but do it in linear time ( O(n), where n is the length of the input ) 30

Two Classes of Grammars Left-to-right, Leftmost derivation (LL) Left-to-right, Rightmost derivation (LR) We can build parsers for these grammars that run in linear time 31

Grammar Comparison LL E -> T E E -> + T E ε T -> F T T -> * F T ε F -> id LR E -> E + T T T -> T * F F F -> id 32

Two Categories of Parsers LL(1) Parsers L: scanning the input from left to right L: producing a leftmost derivation 1: using one input symbol of lookahead at each step to make parsing action decisions LR(1) Parsers L: scanning the input from left to right R: producing a rightmost derivation in reverse 1: the same as above 33

Two Categories of Parsers LL(1) parsers (predicative parsers) Top down Build the parse tree from the root Find a left most derivation for an input string LR(1) parsers (shift-reduce parsers) Bottom up Build the parse tree from leaves Reducing a string to the start symbol of a grammar 34

Top-down Parsers Given a sentential form, xaα, the parser must choose the correct A-rule to get the next sentential form in the leftmost derivation, using only the first token produced by A The most common top-down parsing algorithms: Recursive descent - a coded implementation LL parsers - table driven implementation 35

Bottom-up parsers Given a right sentential form, α, determine what substring of α is the right-hand side of the rule in the grammar that must be reduced to produce the previous sentential form in the right derivation The most common bottom-up parsing algorithms are in the LR family 36

Recursive Descent Parsing Parsing is the process of tracing or constructing a parse tree for a given input string Parsers usually do not analyze lexemes; that is done by a lexical analyzer, which is called by the parser A recursive descent parser traces out a parse tree in topdown order; it is a top-down parser Each nonterminal has an associated subprogram; the subprogram parses all sentential forms that the nonterminal can generate The recursive descent parsing subprograms are built directly from the grammar rules Recursive descent parsers, like other top-down parsers, cannot be built from left-recursive grammars

Recursive Descent Example Example: For the grammar: <term> -> <factor> {(* /) <factor>} Simple recursive descent parsing subprogram: void term() { factor(); /* parse the first factor*/ while (next_token == ast_code next_token == slash_code) { lexical(); /* get next token */ factor(); /* parse the next factor */ } }

39