Lexical and Syntax Analysis

Similar documents
Lexical and Syntax Analysis

Introduction. Introduction. Introduction. Lexical Analysis. Lexical Analysis 4/2/2019. Chapter 4. Lexical and Syntax Analysis.

CSCI312 Principles of Programming Languages!

4. LEXICAL AND SYNTAX ANALYSIS

Building lexical and syntactic analyzers. Chapter 3. Syntactic sugar causes cancer of the semicolon. A. Perlis. Chomsky Hierarchy

10/4/18. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntactic Analysis

Chapter 3. Describing Syntax and Semantics ISBN

Programming Languages 2nd edition Tucker and Noonan"

10/5/17. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntax Analysis

Chapter 4. Lexical and Syntax Analysis

CSE 130 Programming Language Principles & Paradigms Lecture # 5. Chapter 4 Lexical and Syntax Analysis

4. Lexical and Syntax Analysis

4. Lexical and Syntax Analysis

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Syntax. Syntax. We will study three levels of syntax Lexical Defines the rules for tokens: literals, identifiers, etc.

Syntax Intro and Overview. Syntax

Unit-1. Evaluation of programming languages:

CPS 506 Comparative Programming Languages. Syntax Specification

Programming Language Syntax and Analysis

Formal Languages and Grammars. Chapter 2: Sections 2.1 and 2.2

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

Lexical and Syntax Analysis (2)

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

Part 5 Program Analysis Principles and Techniques

Structure of Programming Languages Lecture 3

COP 3402 Systems Software Syntax Analysis (Parser)

CS 314 Principles of Programming Languages

CSE 3302 Programming Languages Lecture 2: Syntax

Languages, Automata, Regular Expressions & Scanners. Winter /8/ Hal Perkins & UW CSE B-1

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

ICOM 4036 Spring 2004

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

Compiler course. Chapter 3 Lexical Analysis

Programming Language Specification and Translation. ICOM 4036 Fall Lecture 3

MIT Specifying Languages with Regular Expressions and Context-Free Grammars

Formal Languages. Formal Languages

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

CS 230 Programming Languages

Chapter 3 Lexical Analysis

MIT Specifying Languages with Regular Expressions and Context-Free Grammars. Martin Rinard Massachusetts Institute of Technology

Chapter 4. Lexical and Syntax Analysis. Topics. Compilation. Language Implementation. Issues in Lexical and Syntax Analysis.

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

Lexical Analysis. Chapter 2

Lexical Analysis (ASU Ch 3, Fig 3.1)

Week 2: Syntax Specification, Grammars

1 Lexical Considerations

CSE 401/M501 Compilers

UNIT -2 LEXICAL ANALYSIS

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Lexical Analysis. Lecture 3-4

The Parsing Problem (cont d) Recursive-Descent Parsing. Recursive-Descent Parsing (cont d) ICOM 4036 Programming Languages. The Complexity of Parsing

Lexical Analysis. Lecture 3. January 10, 2018

CT32 COMPUTER NETWORKS DEC 2015

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Lexical Analysis. Lecture 2-4

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

Languages and Compilers

CS 314 Principles of Programming Languages. Lecture 3

Lexical Considerations

Syntax. In Text: Chapter 3

KEY. A 1. The action of a grammar when a derivation can be found for a sentence. Y 2. program written in a High Level Language

CS 441G Fall 2018 Exam 1 Matching: LETTER

CS 301. Lecture 05 Applications of Regular Languages. Stephen Checkoway. January 31, 2018

Compiler Construction D7011E

CS 315 Programming Languages Syntax. Parser. (Alternatively hand-built) (Alternatively hand-built)

CS321 Languages and Compiler Design I. Winter 2012 Lecture 4

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

Formal Languages and Compilers Lecture VI: Lexical Analysis

Lexical Considerations

Lexical Analysis. Sukree Sinthupinyo July Chulalongkorn University

Grammars and Parsing. Paul Klint. Grammars and Parsing

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

Prof. Mohamed Hamada Software Engineering Lab. The University of Aizu Japan

JME Language Reference Manual

Programming Languages and Compilers (CS 421)

Revisit the example. Transformed DFA 10/1/16 A B C D E. Start

Programming Languages & Compilers. Programming Languages and Compilers (CS 421) I. Major Phases of a Compiler. Programming Languages & Compilers

Chapter 2 - Programming Language Syntax. September 20, 2017

Syntax. 2.1 Terminology

programming languages need to be precise a regular expression is one of the following: tokens are the building blocks of programs

Programming Lecture 3

Figure 2.1: Role of Lexical Analyzer

Compiler phases. Non-tokens

CSC 467 Lecture 3: Regular Expressions

Where We Are. CMSC 330: Organization of Programming Languages. This Lecture. Programming Languages. Motivation for Grammars

Features of C. Portable Procedural / Modular Structured Language Statically typed Middle level language

High Level Languages. Java (Object Oriented) This Course. Jython in Java. Relation. ASP RDF (Horn Clause Deduction, Semantic Web) Dr.

Zhizheng Zhang. Southeast University

Parsing and Pattern Recognition

for (i=1; i<=100000; i++) { x = sqrt (y); // square root function cout << x+i << endl; }

Lexical Analysis. Textbook:Modern Compiler Design Chapter 2.1

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 2: Lexical Analysis 23 Jan 08

Habanero Extreme Scale Software Research Project

R10 SET a) Construct a DFA that accepts an identifier of a C programming language. b) Differentiate between NFA and DFA?

Structure of a Compiler: Scanner reads a source, character by character, extracting lexemes that are then represented by tokens.

CS308 Compiler Principles Lexical Analyzer Li Jiang

Software II: Principles of Programming Languages

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

Transcription:

COS 301 Programming Languages Lexical and Syntax Analysis Sebesta, Ch. 4 Syntax analysis Programming languages compiled, interpreted, or hybrid All have to do syntax analysis For a compiled language parse trees Overall Syntax Analysis Program (string of characters) Lexical Analyzer Tokens & lexemes Syntax Analyzer Parse trees (decorated) Why separate phases? Different difficulties: Lexical analysis: Simple, so simple approach Optimize, since lot of time spent here Syntax analysis: more complex more complex approach Portability: Syntax analyzer: portable Lexical analyzer: maybe not But: May not really be totally separate phases

Lexical and syntax analysis Lexical analysis: Low-level analysis: looking for identifiers, constants Needs regular grammar Finite state machine (automaton) Syntax analysis: Needs context-free (or attribute) grammar Pushdown automaton (recursive transition network) Lexical Analysis Pattern matching Lexical analyzer (LA): pattern matcher Input: String of characters Look for patterns: lexemes (e.g., myarray) Also determine categories of lexemes: Categories = tokens (e.g., identifier) Often represented by numeric code Output: tokens + lexemes Strips out comments, whitespace

Tokens Identifiers Literals: Numbers: 2, 3, 5.7, 3E4 Characters: x Strings: foo Booleans: TRUE Keywords/reserved words: while, if,etc. Operators: +, -, *, /, **, ^, etc. Punctuation: ;, () {} [] Non-token strings Whitespace (space, tab ) Sometimes not just discarded (e.g., Python) Comments EOL Some operating systems: EOL+newline Sometimes whitespace (C, C++, Java, Lisp, ) Sometimes statement separators (FORTRAN, Basic) EOF Example output foo = foo * PI / 2; Token Lexeme IDENT foo ASSIGN_OP = IDENT foo MULT_OP * IDENT PI DIV_OP / INT_LIT 2 SEMICOLON ;

Building a lexical analyzer One way: Write regular grammar of tokens Give to lex, flex, flex++, etc. table-driven lexical analyzer Another way: Draw state transition diagram for tokens Write custom program to implement it Third way: Draw state transition diagram Construct table-driven implementation Review: Chomsky hierarchy Four levels of languages (grammars) Regular Context-free Context-sensitive Finite-state automaton Recursively-enumerable CFGs needed for syntax Pushdown automaton Linear-bounded automaton Turing machine Regular grammars sufficient for lexical analysis Each can be recognized/generated by automaton (formal machine) state diagram for LA should represent an FSA Regular grammars: Grammars LHS: single nonterminal RHS: at most 1 nonterminal, rightmost/leftmost Context-free grammars: only one nonterminal on LHS Context-sensitive grammars: LHS: any number of terminals, nonterminals Sentential form cannot shrink in derivation Recursively-enumerable (unrestricted) grammars

Regular grammars Tuple {P,T,N,S} P = productions T = terminals N = nonterminals S = start symbol(s) Must be right- or left-regular Right regular grammars RHS contains at most 1 nonterminal Nonterminal must be rightmost symbol Let ω T*, A,B N; productions: A ω B A ω E.g.: let a = an alphanumeric character, and n = numeral: S ar R ar R nr Left regular grammars Same except non-terminal on left A B ω A ω

Linear grammars Linear grammars: Both kinds of rules Not strictly a regular grammar: more powerful E.g.: balance (), {}, begin/end Regular grammar: no Linear grammar: yes E.g.: {a n b n n 1} S! aab or S! aa A! S ε A! Sb b Reg languages linear languages CF languages Example regular grammar: Integers Right-regular grammar for whole numbers: <num> 0 1 <num2> 2 <num2> 9 <num2> <num2> 0 <num2> 1 <num2> 2 <num2> 9 As EBNF: <num2> ε <num> (0 (1 9) {(0 1 9)}) Finite state automata (machine) Automaton = abstract machine Two types: nondeterministic FSA (NFSA) deterministic FSA (DFSA) Only DFSA useful for our purposes Equivalent in power: NFSA can be equivalent DFSA

DFSA DFSA: formal machine, finite # states Accepts input from a tape State + input symbol unique next state Start state, accepting (end) state(s) Transitions: consumes (reads) symbols Accepts string when: Reaches accepting state and no more input left Else: error Uses of FSAs Language recognition Describe other things Control things (i.e., represent simple programs) FSA as graph FSAs can be represented as directed graphs Nodes states Input alphabet + end-of-input symbol State transition function represented by directed edges in graph, labeled with symbols or set of symbols Unique start state One or more final (accepting) states

Example: Vending Machine Adapted from Wulf, Shaw, Hilfinger, Flon, Fundamental Structures of Computer Science, p.17. Example: Battery Charger From http://www.jcelectronica.com/articles/state_machines.htm Regular expressions Regular expressions: Alternative to regular grammars Specify language at the lexical level Also: in text-processing, web applications Built-in support in many languages: e.g., Perl, Ruby, Java, Javascript, Python,.NET languages

Regular expression conventions Regex Meaning x a character x (stands for itself) \x an escaped character, e.g., \n M N M or N M N M followed by N Note: \ varies with software, typical usage: certain non-printable characters (e.g., \n = newline and \t=tab) ASCII hex (\xff) or Unicode hex (\xffff) Shorthand character classes (\w = word, \s = whitespace \d=digit) Escaping a literal, e.g. \* or \. Meta-symbols Regex Meaning M+ One or more occurrences of M M? Zero or one occurrence of M M* Zero or more occurrences of M [] surrounding a range or set: one of these E.g., [aeiou] the set of vowels E.g., [0-9] the set of digits E.g., [A-Z,a-z,0-9] the set of alphanumeric chars. Any single character ( ) Grouping Regex example Let Σ = { a, b, c } r = (a b)*c This regex specifies repetition (0, 1, 2, etc. occurrences) of either a or b followed by c. Strings that match this regular expression include: c ac bc abc aabbaabbc

Let Σ = { a, b, c } Regex example r = (a c)*b(a c)* This regular expression specifies repetition of either a or c followed by b followed by repetition of either a or c. b ab bcccc abc aaccaab aacabccca Signed integers Leading +/- (optional) At least 1 digit in 0..9 Regex: (\+ \-)?[0-9]+ Regex example Matches include +1, 0, -0, 827356, -98686, Regex example Create regular expression to represent a signed floating point number. There is an optional leading sign ( + or - ) followed by 1 or more digits in the range 0.. 9 followed by an optional decimal point and then 1 or more digits in the range 0.. 9. The \. symbol indicates. is the literal period and not the. symbol for any character. 1. (\+ \-)?[0-9]+(\.[0-9]+)? 2. [-+]?([0-9]+\.[0-9]+ [0-9]+) 3. [-+]?[0-9]+\.?[0-9]* will allow 9. This illustrates how complex regexes can be!

DFSA for regular grammar E.g.: A DFSA that accepts binary strings with an even number of 1 bits Right regular grammar A 0A 1B ε B 0B 1A Regex 0*(10*1)*0* 0 0 A 1 B 1 Regex libraries Many available online See for example http://regexlib.com/default.aspx Lexical analysis state transition diagram For recognizing/generating regular languages A DFSA Nodes states Arcs transitions between states Labels: input characters Actions (optional) Labels can be classes of characters (e.g., 0 9, [A Z,a z], etc.)

A FSA for identifiers Letter, Digit Letter ε S 1 F Explicit accepting state A FSA for identifiers Letter, Digit Letter ε S 1 F Explicit accepting state L, D Could also draw as: L S 1 What language is this? What language is described by this diagram? a S m a m d a d a

Lexical syntax for a simple C-like language anychar [ -~] Note: space(0x20) to tilde (0x7f) Letter [a-za-z] Digit [0-9] Whitespace [ \t] Again note literal space(0x20) EOL \n EOF \004 Lexical syntax for a simple C-like language Keyword bool char else false float if int main true while Identifier {Letter}({Letter} {Digit})* integerlit {Digit}+ floatlit {Digit}+\.{Digit}+ charlit {anychar} Operator = && ==!= < <= > >= + - * /! [ ] Separator :. { } ( ) Comment // ({anychar} {Whitespace})* {eol Some common FSA conventions Unlabeled arc: any other valid input symbol. Recognition of a token ends in a final state. Recognition of a non-token (e.g., whitespace, comment) transitions back to start state. Recognition of end symbol (end of file) ends in a final state.

FSA Automaton must be deterministic. Drop keywords; handle separately with lookup table We must consider all sequences with a common prefix together e.g., Floats and ints Comments and division DFSA for a small C-like language ws = whitespace, l = letter, d = digit, eoln = \n, eof = end of input, All others are literal Whitespace // comments Division op Identifiers DFSAs for a small C-like language Ints and floats Single & double quotes Assignment & comparison Addition Logical and bitwise AND

Lexical Rules <id> ::= <letter> <letter> <id2> <id2> ::= <letter> <id2> <digit> <id2> <letter> <digit> <int> ::= <digit> <digit> <int> <other> ::= + - * / ( ) State Diagram Implementation: Lexical Analyzer from Text front.c (pp. 176-181) - Following is the output of the lexical analyzer of front.c when used on (sum + 47) / total Next token is: 25 Next lexeme is ( Next token is: 11 Next lexeme is sum Next token is: 21 Next lexeme is + Next token is: 10 Next lexeme is 47 Next token is: 26 Next lexeme is ) Next token is: 24 Next lexeme is / Next token is: 11 Next lexeme is total Next token is: -1 Next lexeme is EOF

Program Structure Program is a DFSA with global variables Utility routines: getchar - gets the next character of input, puts it in nextchar, determines its class and puts the class in charclass getnonblank advances over whitespace to the first char of a token addchar - puts the character from nextchar into the place the lexeme is being accumulated, lexeme lookup - determines whether the string in lexeme is a reserved word (returns a code) front.c 1 #include <stdio.h> #include <ctype.h> /* global declarations */ /* variables */ int charclass; char lexeme[100]; char nextchar; int lexlen; int nexttoken; FILE *in_fp, *fopen(); /* Function declarations */ void void getchar(); void getnonblank(); int lex(); /* Character classes */ #define LETTER 0 #define DIGIT 1 #define UNKNOWN 99 /* Token codes */ #define INT_LIT 10 #define IDENT 11 #define ASSIGN_OP 20 #define ADD_OP 21 #define SUB_OP 22 #define MULT_OP 23 #define DIV_OP 24 #define LEFT_PAREN 25 #define RIGHT_PAREN 26 front.c 2

/* main driver */ main() { } front.c 3 /* open the input data file and process contents */ if ((in_fp = fopen = fopen("front.in","r")) == NULL) printf("error - cannot open front in \n"); else { getchar(); do { } lex(); } while nexttoken!= EOF front.c 4 /* lookup - a function to lookup operators and parentheses and return the token */ int lookup(char ch){ switch(ch){ case '(': nexttoken = LEFT_PAREN; case ')': nexttoken = RIGHT_PAREN; case '+': nexttoken = ADD_OP; case '-': nexttoken = SUB_OP; case '*': nexttoken = MULT_OP; case '/': nexttoken = DIV_OP; default: nexttoken = EOF; } return nexttoken; } front.c 5 /* addchar - a function to add next char to lexeme */ void addchar(){ if (lexlen <= 98){ lexeme[lexlen++] = nextchar; lexeme[lexlen] = 0; } else { printf("error - lexeme too long \n"); } } /* getchar - a function to get the next char of input and determine its character class */ void getchar(){ if ((nextchar = getc(in_fp))!= EOF){ if (isalpha(nextchar)) charclass = LETTER; else if (isdigit(nextchar)) charclass = DIGIT; else charclass = UNKNOWN; } else charclass = EOF; }

front.c 6 /* getnonblank - a function to call getchar until it returns a non-whitespace character */ void getnonblank(){ while (isspace(nextchar)) getchar(); } /* lex - a simple lexical analyzer for arithmetic expressions */ int lex(){ lexlen = 0; getnonblank(); switch (charclass){ case LETTER: /* parse identifiers */ getchar(); while (charclass == LETTER charclass == DIGIT){ getchar(); } nexttoken = IDENT; front.c 7 case DIGIT: /* parse integer literals */ getchar(); while (charclass == DIGIT){ getchar(); } nexttoken = INT_LIT; case UNKNOWN: /* parentheses and operators */ lookup(nextchar); getchar(); case EOF: /* EOF */ nexttoken = EOF; lexeme[0] = 'E'; lexeme[1] = 'O'; lexeme[2] = 'F'; lexeme[3] = 0; } /* end of switch */ printf("next token is: %d, next lexeme is %s\n", nexttoken, lexeme); return nexttoken; Example output (sum + 47) / total Next token is: 25 lexeme is ( Next token is: 11 lexeme is sum Next token is: 21 lexeme is + Next token is: 10 lexeme is 47 Next token is: 26 lexeme is ) Next token is: 24 lexeme is / Next token is: 11 lexeme is total Next token is: -1 lexeme is EOF

Quiz 1. Draw a DFSA that recognizes binary strings that start with 1 and end with 0 2. Draw a DFSA that recognizes binary strings with at least three consecutive 1 s 3. Below is a BNF grammar for fractional numbers: S -> -FN FN FN -> DL DL.DL DL -> D D DL D -> 0 1 2 3 4 5 6 7 8 9 (a) Rewrite as EBNF (b) Now draw a corresponding DFSA Done? Quiz Answers 1. Draw a DFSA that recognizes binary strings that start with 1 and end with 0 S 1 1 0 1 0

DFSA for q2 2. Draw a DFSA that recognizes binary strings with at least three consecutive 1 s 0 0 1,0 S 1 1 1 0 3. Below is a BNF grammar for fractional numbers. Rewrite as EBNF: <s> -<fn> <fn> <fn> <dl> <dl>.<dl> <dl> <d> <d> <dl> <d> 0 1 2 3 4 5 6 7 8 9 <s> [-]<fn> <fn> <dl>[.<dl>] <dl> <d>{<d>} And as DFSA: Quiz Answers - 0,1,,9 0,1,,9 0,1,,9 S 0,1,,9. 0,1,,9 Could also have had another state to handle -