Simple Lexical Analyzer

Similar documents
Lexical Analysis and jflex

Lecture 12: Parser-Generating Tools

Lexical Analysis. Textbook:Modern Compiler Design Chapter 2.1.

CSC 467 Lecture 3: Regular Expressions

10/4/18. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntactic Analysis

MP 3 A Lexer for MiniJava

JFlex. Lecture 16 Section 3.5, JFlex Manual. Robb T. Koether. Hampden-Sydney College. Mon, Feb 23, 2015

JFlex Regular Expressions

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

Lexical Analysis. Textbook:Modern Compiler Design Chapter 2.1

Lecture Outline. COMP-421 Compiler Design. What is Lex? Lex Specification. ! Lexical Analyzer Lex. ! Lex Examples. Presented by Dr Ioanna Dionysiou

CS321 Languages and Compiler Design I. Winter 2012 Lecture 4

Lecture 05 I/O statements Printf, Scanf Simple statements, Compound statements

10/5/17. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntax Analysis

Lecture 11: while loops CS1068+ Introductory Programming in Python. for loop revisited. while loop. Summary. Dr Kieran T. Herley

CS 541 Spring Programming Assignment 2 CSX Scanner

Compiler Construction D7011E

Lexical and Syntax Analysis

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Parsing and Pattern Recognition

12/22/11. Java How to Program, 9/e. Help you get started with Eclipse and NetBeans integrated development environments.

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

Assoc. Prof. Dr. Marenglen Biba. (C) 2010 Pearson Education, Inc. All rights reserved.

A Pascal program. Input from the file is read to a buffer program buffer. program xyz(input, output) --- begin A := B + C * 2 end.

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

LECTURE 11. Semantic Analysis and Yacc

Compiler Construction

MP 3 A Lexer for MiniJava

Project 1: Scheme Pretty-Printer

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Flex and lexical analysis. October 25, 2016

The SPL Programming Language Reference Manual

Lecture 4: Stack Applications CS2504/CS4092 Algorithms and Linear Data Structures. Parentheses and Mathematical Expressions

PROGRAMMING FUNDAMENTALS

Introduction to Programming Using Java (98-388)

Compiler Construction

Lexical Analysis - Flex

CS164: Programming Assignment 2 Dlex Lexer Generator and Decaf Lexer

EECS483 D1: Project 1 Overview

Flex and lexical analysis

Lecture 8: Simple Calculator Application

COMP 202 Java in one week

Constants. Why Use Constants? main Method Arguments. CS256 Computer Science I Kevin Sahr, PhD. Lecture 25: Miscellaneous

CSCI 2010 Principles of Computer Science. Data and Expressions 08/09/2013 CSCI

PLT 4115 LRM: JaTesté

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

CS111: PROGRAMMING LANGUAGE II

Lecture 15-16: Intermediate Code-Generation

Decaf Language Reference

Section 2.2 Your First Program in Java: Printing a Line of Text

CSE302: Compiler Design

The PCAT Programming Language Reference Manual

Full file at

An Introduction to LEX and YACC. SYSC Programming Languages

I/O and Parsing Tutorial

JavaCUP. There are also many parser generators written in Java

Lexical and Syntax Analysis

Programming with Java

Language Reference Manual

Projects for Compilers

CSC Web Programming. Introduction to JavaScript

Lexical Analysis. Introduction

Compilation 2014 Warm-up project

Jim Lambers ENERGY 211 / CME 211 Autumn Quarter Programming Project 4

JavaCC: SimpleExamples

Java Bytecode (binary file)

A clarification on terminology: Recognizer: accepts or rejects strings in a language. Parser: recognizes and generates parse trees (imminent topic)

Part III : Parsing. From Regular to Context-Free Grammars. Deriving a Parser from a Context-Free Grammar. Scanners and Parsers.

Lexical Considerations

CSE 340 Fall 2014 Project 4

1 Lexical Considerations

Week 2: Syntax Specification, Grammars

YOLOP Language Reference Manual

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

The MaSH Programming Language At the Statements Level

BASIC ELEMENTS OF A COMPUTER PROGRAM

Chapter 3 Lexical Analysis

CPS 506 Comparative Programming Languages. Syntax Specification

Figure 2.1: Role of Lexical Analyzer

LECTURE 6 Scanning Part 2

More on control structures

Compiler course. Chapter 3 Lexical Analysis

Perdix Language Reference Manual

Lexical Considerations

Decaf Language Reference Manual

Administrativia. Extra credit for bugs in project assignments. Building a Scanner. CS164, Fall Recall: The Structure of a Compiler

TaML. Language Reference Manual. Adam Dossa (aid2112) Qiuzi Shangguan (qs2130) Maria Taku (mat2185) Le Chang (lc2879) Columbia University

COP4020 Programming Assignment 1 CALC Interpreter/Translator Due March 4, 2015

Prof. Mohamed Hamada Software Engineering Lab. The University of Aizu Japan

Lexical and Syntax Analysis

CS1622. Semantic Analysis. The Compiler So Far. Lecture 15 Semantic Analysis. How to build symbol tables How to use them to find

SSOL Language Reference Manual

Introduction to Lex & Yacc. (flex & bison)

Part II : Lexical Analysis

GAWK Language Reference Manual

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

Lecture 8: Context Free Grammars

COP4020 Programming Languages. Compilers and Interpreters Robert van Engelen & Chris Lacher

Introduction to Compiler Design

Lecture 4: Basic I/O

Transcription:

Lecture 7: Simple Lexical Analyzer Dr Kieran T. Herley Department of Computer Science University College Cork 2017-2018 KH (03/10/17) Lecture 7: Simple Lexical Analyzer 2017-2018 1 / 1

Summary Use of jflex to generate lexical analyzer for programming language. KH (03/10/17) Lecture 7: Simple Lexical Analyzer 2017-2018 2 / 1

TINY Programming Language { F a c t o r i a l program i n TINY} read x ; i f x > 0 then f a c t := 1 ; r e p e a t f a c t := f a c t x ; x := x 1 u n t i l x = 0 ; w r i t e f a c t end Simple toy language Running example for cs4150 Pascal-like syntax if-then-end, if-then-else-end, repeat-until, assignment, read and write KH (03/10/17) Lecture 7: Simple Lexical Analyzer 2017-2018 3 / 1

Tiny cont d { F a c t o r i a l program i n TINY} read x ; i f x > 0 then f a c t := 1 ; repeat f a c t := f a c t x ; x := x 1 u n t i l x = 0 ; write f a c t end KH (03/10/17) Lecture 7: Simple Lexical Analyzer 2017-2018 4 / 1

Language Features semicolons as separators not terminators Integer vars. only; no declarations arithmetic expressions: vars, constants, +,,, /, () Boolean expressions: arithmetic expressions, <, = read, write perform simple i/o comments enclosed in { } KH (03/10/17) Lecture 7: Simple Lexical Analyzer 2017-2018 5 / 1

TINY s Tokens Reserved Words if, then, else, end, repeat, until, read, write Special Symbols Numbers Identifiers One or more digits One or more letters + / = < ( ) ; := (Comments) Any sequence of symbols (other than }) encosed in {... } KH (03/10/17) Lecture 7: Simple Lexical Analyzer 2017-2018 6 / 1

Tiny Scanner Simplified Simplified version (TinyScanner1.flex) will merely categorize and list tokens One jflex rule per token type: patterns specify token structure actions are System.out.println() %% %c l a s s TinyScanner %s t a n d a l o n e... DEFINITIONS... %%... i f { System. out. p r i n t l n ( IF ) ; }... KH (03/10/17) Lecture 7: Simple Lexical Analyzer 2017-2018 7 / 1

Illustration { F a c t o r i a l... } read x ; i f x > 0 then f a c t := 1 ; r e p e a t f a c t := f a c t x ; x := x 1 u n t i l x = 0 ; w r i t e f a c t end >jflex TinyScanner1.flex >javac TinyScanner >java TinyScanner <sample.tny READ ID SEMI IF NUM LT ID THEN ID ASSIGN NUM SEMI &c &c KH (03/10/17) Lecture 7: Simple Lexical Analyzer 2017-2018 8 / 1

Some Useful Definitions d i g i t = [0 9] number = { d i g i t }+ l e t t e r = [ a za Z ] i d e n t i f i e r = { l e t t e r }+ n e w l i n e = \n w h i t e s p a c e = [ \ t ]+ KH (03/10/17) Lecture 7: Simple Lexical Analyzer 2017-2018 9 / 1

Rules for Reserved Words and Symbols i f { System. out. p r i n t l n ( IF ) ; } then { System. out. p r i n t l n ( THEN ) ; } e l s e { System. out. p r i n t l n ( ELSE ) ; } end { System. out. p r i n t l n ( END ) ; }... ETC... := { System. out. p r i n t l n ( ASSIGN ) ; } = { System. out. p r i n t l n ( EQ ) ; } < { System. out. p r i n t l n ( LT ) ; }... ETC... {number} { System. out. p r i n t f ( NUM (%d )\ n, I n t e g e r. p a r s e I n t ( y y t e x t ( ) ) ) ; } { i d e n t i f i e r } { System. out. p r i n t f ( ID (%s )\ n, y y t e x t ( ) ) ; } KH (03/10/17) Lecture 7: Simple Lexical Analyzer 2017-2018 10 / 1

Notes Could merge reserved word and identifier rules: single rule for words (captures reserved and identifiers) list/map -based lookup function to distinguish identifiers from reserved words more efficient than approach overleaf (simpler N/DFA) When more that one rule applies: jflex favours longer match (e.g. := rather than = ) maximum munch For matches of equal length, earlier rule is favoured (e.g. string write matches write rule and also {identifier} rule) but former favoured). KH (03/10/17) Lecture 7: Simple Lexical Analyzer 2017-2018 11 / 1

Rules for Whitespace and Comments { w h i t e s p a c e } { / s k i p w h i t e s p a c e /} \ { [ ˆ } ] \ } { / s k i p comments / } { n e w l i n e } { / s k i p new l i n e s /}.... { System. out. p r i n t f ( UKNOWN SYMBOL(%s )\ n, y y t e x t ( ) ) ; } Simply skip whitespace, newlines and comments Last rule matches anything not matched by any other rule, e.g. extranrous symbols like #. KH (03/10/17) Lecture 7: Simple Lexical Analyzer 2017-2018 12 / 1

More Sophisticated Version TinyScanner2 Facilitate integration with other compiler elements Skeleton %% %c l a s s TinyScanner2 %f u n c t i o n nexttoken %t y p e TinyToken... %%... i f { r e t u r n new TinyToken ( TinyToken. TokenKind. RW IF ) ; }... (Most) actions contain return jflex creates a read the next token method within generated code named nexttoken (default yylex) returns a TinyToken object (null at end of file) %function and %type options specify these names KH (03/10/17) Lecture 7: Simple Lexical Analyzer 2017-2018 13 / 1

Class TinyToken public class TinyToken { public TinyToken (TokenKind k) { kind = k;}... OTHER METHODS... public enum TokenKind { RW IF, RW THEN, RW ELSE, RW END, RW REPEAT, RW UNTIL, RW READ, RW WRITE, } SYM ASSIGN, SYM EQ, SYM LT, SYM PLUS, SYM MINUS, SYM TIMES, SYM OVER, SYM LPAREN, SYM RPAREN, SYM SEMI, NUMBER, ID, ILLEGAL } private TokenKind kind; private int value ; private String spelling ; Represent token data (kind etc.) TokenKind encodes token classification value: numerical value for NUMBERs spelling: e.g. ID KH (03/10/17) Lecture 7: Simple Lexical Analyzer 2017-2018 14 / 1

Using TinyScanner2 TinyToken current; TinyScanner2 scanner = null; scanner = new TinyScanner2(new FileReader( sample.tny )); current = scanner.nexttoken(); while ( current!= null) { System.out. printf ( Token [%s]\n, current. tostring ()); current = scanner.nexttoken(); } 1 1 Some exception-handling code omitted for clarity. KH (03/10/17) Lecture 7: Simple Lexical Analyzer 2017-2018 15 / 1

A Scanner for More Sophisticated Languages Same general approach works for many programming languages including C Handling C-style comments? For non-toy languages (e.g. Java) capturing some aspects of lexical structure may require care: String literals Numerical literals (many formats) KH (03/10/17) Lecture 7: Simple Lexical Analyzer 2017-2018 16 / 1

Our Next Assignment Should build scanner for C using jflex, but that s too easy KH (03/10/17) Lecture 7: Simple Lexical Analyzer 2017-2018 17 / 1

Our Next Assignment Should build scanner for C using jflex, but that s too easy Will instead use these ideas to build simple plagiarism detector for C programs Generate profile for programs based on feature counting Count the number of occurrences of certain selected features e.g. number of semicolons Programs with similar profiles are suspicious KH (03/10/17) Lecture 7: Simple Lexical Analyzer 2017-2018 17 / 1