Lecture 7: Simple Lexical Analyzer Dr Kieran T. Herley Department of Computer Science University College Cork 2017-2018 KH (03/10/17) Lecture 7: Simple Lexical Analyzer 2017-2018 1 / 1
Summary Use of jflex to generate lexical analyzer for programming language. KH (03/10/17) Lecture 7: Simple Lexical Analyzer 2017-2018 2 / 1
TINY Programming Language { F a c t o r i a l program i n TINY} read x ; i f x > 0 then f a c t := 1 ; r e p e a t f a c t := f a c t x ; x := x 1 u n t i l x = 0 ; w r i t e f a c t end Simple toy language Running example for cs4150 Pascal-like syntax if-then-end, if-then-else-end, repeat-until, assignment, read and write KH (03/10/17) Lecture 7: Simple Lexical Analyzer 2017-2018 3 / 1
Tiny cont d { F a c t o r i a l program i n TINY} read x ; i f x > 0 then f a c t := 1 ; repeat f a c t := f a c t x ; x := x 1 u n t i l x = 0 ; write f a c t end KH (03/10/17) Lecture 7: Simple Lexical Analyzer 2017-2018 4 / 1
Language Features semicolons as separators not terminators Integer vars. only; no declarations arithmetic expressions: vars, constants, +,,, /, () Boolean expressions: arithmetic expressions, <, = read, write perform simple i/o comments enclosed in { } KH (03/10/17) Lecture 7: Simple Lexical Analyzer 2017-2018 5 / 1
TINY s Tokens Reserved Words if, then, else, end, repeat, until, read, write Special Symbols Numbers Identifiers One or more digits One or more letters + / = < ( ) ; := (Comments) Any sequence of symbols (other than }) encosed in {... } KH (03/10/17) Lecture 7: Simple Lexical Analyzer 2017-2018 6 / 1
Tiny Scanner Simplified Simplified version (TinyScanner1.flex) will merely categorize and list tokens One jflex rule per token type: patterns specify token structure actions are System.out.println() %% %c l a s s TinyScanner %s t a n d a l o n e... DEFINITIONS... %%... i f { System. out. p r i n t l n ( IF ) ; }... KH (03/10/17) Lecture 7: Simple Lexical Analyzer 2017-2018 7 / 1
Illustration { F a c t o r i a l... } read x ; i f x > 0 then f a c t := 1 ; r e p e a t f a c t := f a c t x ; x := x 1 u n t i l x = 0 ; w r i t e f a c t end >jflex TinyScanner1.flex >javac TinyScanner >java TinyScanner <sample.tny READ ID SEMI IF NUM LT ID THEN ID ASSIGN NUM SEMI &c &c KH (03/10/17) Lecture 7: Simple Lexical Analyzer 2017-2018 8 / 1
Some Useful Definitions d i g i t = [0 9] number = { d i g i t }+ l e t t e r = [ a za Z ] i d e n t i f i e r = { l e t t e r }+ n e w l i n e = \n w h i t e s p a c e = [ \ t ]+ KH (03/10/17) Lecture 7: Simple Lexical Analyzer 2017-2018 9 / 1
Rules for Reserved Words and Symbols i f { System. out. p r i n t l n ( IF ) ; } then { System. out. p r i n t l n ( THEN ) ; } e l s e { System. out. p r i n t l n ( ELSE ) ; } end { System. out. p r i n t l n ( END ) ; }... ETC... := { System. out. p r i n t l n ( ASSIGN ) ; } = { System. out. p r i n t l n ( EQ ) ; } < { System. out. p r i n t l n ( LT ) ; }... ETC... {number} { System. out. p r i n t f ( NUM (%d )\ n, I n t e g e r. p a r s e I n t ( y y t e x t ( ) ) ) ; } { i d e n t i f i e r } { System. out. p r i n t f ( ID (%s )\ n, y y t e x t ( ) ) ; } KH (03/10/17) Lecture 7: Simple Lexical Analyzer 2017-2018 10 / 1
Notes Could merge reserved word and identifier rules: single rule for words (captures reserved and identifiers) list/map -based lookup function to distinguish identifiers from reserved words more efficient than approach overleaf (simpler N/DFA) When more that one rule applies: jflex favours longer match (e.g. := rather than = ) maximum munch For matches of equal length, earlier rule is favoured (e.g. string write matches write rule and also {identifier} rule) but former favoured). KH (03/10/17) Lecture 7: Simple Lexical Analyzer 2017-2018 11 / 1
Rules for Whitespace and Comments { w h i t e s p a c e } { / s k i p w h i t e s p a c e /} \ { [ ˆ } ] \ } { / s k i p comments / } { n e w l i n e } { / s k i p new l i n e s /}.... { System. out. p r i n t f ( UKNOWN SYMBOL(%s )\ n, y y t e x t ( ) ) ; } Simply skip whitespace, newlines and comments Last rule matches anything not matched by any other rule, e.g. extranrous symbols like #. KH (03/10/17) Lecture 7: Simple Lexical Analyzer 2017-2018 12 / 1
More Sophisticated Version TinyScanner2 Facilitate integration with other compiler elements Skeleton %% %c l a s s TinyScanner2 %f u n c t i o n nexttoken %t y p e TinyToken... %%... i f { r e t u r n new TinyToken ( TinyToken. TokenKind. RW IF ) ; }... (Most) actions contain return jflex creates a read the next token method within generated code named nexttoken (default yylex) returns a TinyToken object (null at end of file) %function and %type options specify these names KH (03/10/17) Lecture 7: Simple Lexical Analyzer 2017-2018 13 / 1
Class TinyToken public class TinyToken { public TinyToken (TokenKind k) { kind = k;}... OTHER METHODS... public enum TokenKind { RW IF, RW THEN, RW ELSE, RW END, RW REPEAT, RW UNTIL, RW READ, RW WRITE, } SYM ASSIGN, SYM EQ, SYM LT, SYM PLUS, SYM MINUS, SYM TIMES, SYM OVER, SYM LPAREN, SYM RPAREN, SYM SEMI, NUMBER, ID, ILLEGAL } private TokenKind kind; private int value ; private String spelling ; Represent token data (kind etc.) TokenKind encodes token classification value: numerical value for NUMBERs spelling: e.g. ID KH (03/10/17) Lecture 7: Simple Lexical Analyzer 2017-2018 14 / 1
Using TinyScanner2 TinyToken current; TinyScanner2 scanner = null; scanner = new TinyScanner2(new FileReader( sample.tny )); current = scanner.nexttoken(); while ( current!= null) { System.out. printf ( Token [%s]\n, current. tostring ()); current = scanner.nexttoken(); } 1 1 Some exception-handling code omitted for clarity. KH (03/10/17) Lecture 7: Simple Lexical Analyzer 2017-2018 15 / 1
A Scanner for More Sophisticated Languages Same general approach works for many programming languages including C Handling C-style comments? For non-toy languages (e.g. Java) capturing some aspects of lexical structure may require care: String literals Numerical literals (many formats) KH (03/10/17) Lecture 7: Simple Lexical Analyzer 2017-2018 16 / 1
Our Next Assignment Should build scanner for C using jflex, but that s too easy KH (03/10/17) Lecture 7: Simple Lexical Analyzer 2017-2018 17 / 1
Our Next Assignment Should build scanner for C using jflex, but that s too easy Will instead use these ideas to build simple plagiarism detector for C programs Generate profile for programs based on feature counting Count the number of occurrences of certain selected features e.g. number of semicolons Programs with similar profiles are suspicious KH (03/10/17) Lecture 7: Simple Lexical Analyzer 2017-2018 17 / 1